Repository: aws/aws-neuron-sdk
Branch: master
Commit: 371eabc8a739
Files: 1636
Total size: 10.3 MB

Directory structure:
gitextract_u554eb9v/

├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug-report.yml
│   │   ├── config.yml
│   │   ├── documentation.yml
│   │   └── feature-request.yml
│   ├── pull_request_template.md
│   ├── stale_issue_mark_close_workflow.yml
│   └── workflows/
│       ├── acknowledge-new-issue.yml
│       └── auto-label-issues.yml
├── .gitignore
├── .readthedocs.yml
├── CODEOWNERS
├── CONTRIBUTING.md
├── Dockerfile
├── LICENSE-DOCUMENTATION
├── LICENSE-SAMPLECODE
├── LICENSE-SUMMARY-DOCS-SAMPLES
├── Makefile
├── README.md
├── _backup-setup/
│   └── neuron-setup/
│       ├── multiframework/
│       │   ├── multi-framework-ubuntu22-neuron-dlami.rst
│       │   └── multi-framework-ubuntu24-neuron-dlami.rst
│       └── pytorch/
│           ├── neuron/
│           │   ├── amazon-linux/
│           │   │   ├── torch-neuron-al2-base-dlami.rst
│           │   │   ├── torch-neuron-al2-pytorch-dlami.rst
│           │   │   ├── torch-neuron-al2.rst
│           │   │   └── torch-neuron-al2023.rst
│           │   └── ubuntu/
│           │       ├── torch-neuron-ubuntu20-base-dlami.rst
│           │       ├── torch-neuron-ubuntu20-pytorch-dlami.rst
│           │       ├── torch-neuron-ubuntu20.rst
│           │       └── torch-neuron-ubuntu22.rst
│           └── neuronx/
│               ├── amazon-linux/
│               │   ├── torch-neuronx-al2-base-dlami.rst
│               │   ├── torch-neuronx-al2-pytorch-dlami.rst
│               │   ├── torch-neuronx-al2.rst
│               │   └── torch-neuronx-al2023.rst
│               └── ubuntu/
│                   ├── torch-neuronx-ubuntu20-base-dlami.rst
│                   ├── torch-neuronx-ubuntu20-pytorch-dlami.rst
│                   ├── torch-neuronx-ubuntu20.rst
│                   ├── torch-neuronx-ubuntu22.rst
│                   └── torch-neuronx-ubuntu24.rst
├── _content-types/
│   ├── conceptual-deep-dive.rst
│   ├── model-card.rst
│   ├── procedural-how-to.rst
│   ├── procedural-tutorial.ipynb
│   ├── reference-kernel-api.rst
│   └── release-notes-templates/
│       ├── compiler.rst
│       ├── containers.rst
│       ├── dlami.rst
│       ├── index.rst
│       ├── nki.rst
│       ├── nx-jax.rst
│       ├── nx-pytorch.rst
│       ├── nxd-core.rst
│       ├── nxd-inference.rst
│       ├── nxd-training.rst
│       ├── runtime.rst
│       └── tools.rst
├── _ext/
│   ├── archive.py
│   ├── df_tables.py
│   ├── local_documenter.py
│   ├── neuron_tag.py
│   ├── release-notes-automation-spec.md
│   ├── release-notes-context.md
│   ├── sphinx_plotly_directive.py
│   └── symlink.py
├── _static/
│   └── css/
│       ├── custom.css
│       └── custom.css.new
├── _templates/
│   ├── recentposts.html
│   ├── search-field.html
│   ├── search-google.html
│   └── search.html
├── _utilities/
│   ├── JIRA_SETUP_QUICKSTART.md
│   ├── add_meta.py
│   ├── audit_frameworks.py
│   ├── check_urls.sh
│   ├── create_sitemap.py
│   ├── format_build_logs.py
│   ├── inject_archive_meta.py
│   ├── metadata_schema.yaml
│   ├── migrate_setup_content.py
│   ├── old-nki-apis.txt
│   └── setup_jira_token.sh
├── about-neuron/
│   ├── amazonq-getstarted.rst
│   ├── announcements/
│   │   ├── index.rst
│   │   ├── neuron1.x/
│   │   │   ├── announce-eol-mx-before-1-5.rst
│   │   │   ├── announce-eol-pt-1-5.rst
│   │   │   ├── announce-eol-pt-before-1-8.rst
│   │   │   ├── announce-eol-tf-before-2-5.rst
│   │   │   ├── announce-eol-tf-before-2-7.rst
│   │   │   ├── announcements.rst
│   │   │   ├── eol-ncgs-env_2.rst
│   │   │   ├── eol-pt-15.rst
│   │   │   └── eol-tf-21-24.rst
│   │   └── neuron2.x/
│   │       ├── announce-component-change.rst
│   │       ├── announce-correction-neuron-driver-support-inf1.rst
│   │       ├── announce-deprecation-containers-rtd.rst
│   │       ├── announce-deprecation-nxd-path-trace-api.rst
│   │       ├── announce-deprecation-transformer-flag.rst
│   │       ├── announce-eol-megatron-lm.rst
│   │       ├── announce-eol-python-3-7.rst
│   │       ├── announce-eol-ubuntu-18.rst
│   │       ├── announce-eos-al2.rst
│   │       ├── announce-eos-beta-pytorch-neuroncore-placement-apis.rst
│   │       ├── announce-eos-bf16-vars.rst
│   │       ├── announce-eos-block-dimension-nki.rst
│   │       ├── announce-eos-dlami-ubuntu-22-04.rst
│   │       ├── announce-eos-dlami.rst
│   │       ├── announce-eos-inf1-virtual-environments.rst
│   │       ├── announce-eos-jax-neuronx-nki-call.rst
│   │       ├── announce-eos-megatronlm-2-13.rst
│   │       ├── announce-eos-mllama-checkpoint.rst
│   │       ├── announce-eos-multiframework-dlamis-inf1.rst
│   │       ├── announce-eos-nemo.rst
│   │       ├── announce-eos-neuron-det.rst
│   │       ├── announce-eos-neuron-driver-support-inf1.rst
│   │       ├── announce-eos-neuron-profiler-2.rst
│   │       ├── announce-eos-neuron-profiler-v230.rst
│   │       ├── announce-eos-neuron-profiler.rst
│   │       ├── announce-eos-neurondevice-version.rst
│   │       ├── announce-eos-neurondevice.rst
│   │       ├── announce-eos-nxd-examples.rst
│   │       ├── announce-eos-nxdt-nxd-core-training.rst
│   │       ├── announce-eos-probuf.rst
│   │       ├── announce-eos-pt-versions.rst
│   │       ├── announce-eos-pt2.rst
│   │       ├── announce-eos-python38.rst
│   │       ├── announce-eos-pytorch-1-1-3.rst
│   │       ├── announce-eos-pytorch-1-9.rst
│   │       ├── announce-eos-pytorch-2-1.rst
│   │       ├── announce-eos-pytorch-2-7-2-8-v229.rst
│   │       ├── announce-eos-pytorch-2-7-2-8.rst
│   │       ├── announce-eos-pytorch-profiling-api.rst
│   │       ├── announce-eos-tensorboard-tools.rst
│   │       ├── announce-eos-tensorflow-2-8-9.rst
│   │       ├── announce-eos-tensorflow-inf2.rst
│   │       ├── announce-eos-tensorflow1-x.rst
│   │       ├── announce-eos-torch-neuron.rst
│   │       ├── announce-eos-torch-neuronx-nki-jit.rst
│   │       ├── announce-eos-u20-dlamis.rst
│   │       ├── announce-eos-xla-bf16.rst
│   │       ├── announce-intent-eol-nemo-arg.rst
│   │       ├── announce-intent-eos-opt.rst
│   │       ├── announce-intent-eos-pt-version.rst
│   │       ├── announce-intent-eos-pt2-6.rst
│   │       ├── announce-intent-eos-tensorflow-tutorial-inf.rst
│   │       ├── announce-intent-eos-tnx.rst
│   │       ├── announce-intent-maintenance-tnx.rst
│   │       ├── announce-maintenance-mxnet.rst
│   │       ├── announce-maintenance-nxdi-nxd-core-inference.rst
│   │       ├── announce-maintenance-nxdt-nxd-core-training.rst
│   │       ├── announce-maintenance-tf.rst
│   │       ├── announce-moving-samples.rst
│   │       ├── announce-nki-library-namespace-changes-2-28.rst
│   │       ├── announce-nki-namespace-migration.rst
│   │       ├── announce-no-longer-support-neuron-det.rst
│   │       ├── announce-no-longer-support-nxd-examples.rst
│   │       ├── announce-no-longer-support-pytorch-113.rst
│   │       ├── announce-no-longer-support-pytorch-2-1.rst
│   │       ├── announce-no-longer-support-pytorch-2-7-2-8.rst
│   │       ├── announce-no-longer-support-tensorflow-inf2.rst
│   │       ├── announce-no-longer-support-u20-dlc-dlami.rst
│   │       ├── announce-no-support-al2.rst
│   │       ├── announce-no-support-device-version.rst
│   │       ├── announce-no-support-jax-neuronx-nki-call.rst
│   │       ├── announce-no-support-llama3-2-checkpoint.rst
│   │       ├── announce-no-support-nemo-megatron.rst
│   │       ├── announce-no-support-neurondevice.rst
│   │       ├── announce-no-support-nki-jit-torch.rst
│   │       ├── announce-no-support-tensorboard-plugin.rst
│   │       ├── announce-no-support-tensorflow1-x.rst
│   │       ├── announce-no-support-tensorflow2-10.rst
│   │       ├── announce-no-support-tf-versions.rst
│   │       ├── announce-no-support-torch-neuron-versions.rst
│   │       ├── announce-no-support-ubuntu-20-base.rst
│   │       ├── announce-no-support-vllm-v0.rst
│   │       ├── announce-nxdi-changes.rst
│   │       ├── announce-package-change.rst
│   │       ├── announce-python38-no-longer-support.rst
│   │       ├── announce-transition-pytorch-trainium.rst
│   │       ├── announcement-end-of-support-neuronxcc-nki.rst
│   │       ├── announcement-end-of-support-nxdt-nxd-core.rst
│   │       ├── announcement-end-of-support-parallel-model-trace.rst
│   │       ├── announcement-end-of-support-pytorch-2-6.rst
│   │       ├── announcement-end-of-support-vllm-v0.rst
│   │       ├── announcement-nki-library-kernel-migration.rst
│   │       ├── announcement-nki-library-namespace-changes.rst
│   │       ├── announcement-python-3-9-eol.rst
│   │       ├── dlami-neuron-2.10.rst
│   │       ├── dlami-neuron-2.12.rst
│   │       ├── dlami-pytorch-introduce.rst
│   │       ├── end-of-support-pt2.rst
│   │       ├── github-changes.rst
│   │       ├── gpg-expiration.rst
│   │       ├── neuron-rtd-eol.rst
│   │       ├── neuron2-intro.rst
│   │       ├── neuron230-packages-changes.rst
│   │       ├── neuron250-packages-changes.rst
│   │       ├── release-neuron2.4.rst
│   │       ├── sm-training-dlc-2.9.1.rst
│   │       └── sm-training-trn1-introduce.rst
│   ├── appnotes/
│   │   ├── index.rst
│   │   ├── mxnet-neuron/
│   │   │   └── flex-eg.rst
│   │   ├── neuron-cc/
│   │   │   └── mixed-precision.rst
│   │   ├── neuron1x/
│   │   │   ├── important-neuronx-dkms.txt
│   │   │   └── introducing-libnrt.rst
│   │   ├── neuronx-cc/
│   │   │   └── neuronx-cc-training-mixed-precision.rst
│   │   ├── neuronx-distributed/
│   │   │   ├── introducing-nxd-inference.rst
│   │   │   └── introducing-nxdt-training.rst
│   │   ├── perf/
│   │   │   └── neuron-cc/
│   │   │       ├── parallel-ncgs.rst
│   │   │       └── performance-tuning.rst
│   │   ├── torch-neuron/
│   │   │   ├── bucketing-app-note.rst
│   │   │   ├── index.rst
│   │   │   ├── rcnn-app-note.rst
│   │   │   └── torch-neuron-dataparallel-app-note.rst
│   │   ├── torch-neuronx/
│   │   │   ├── index.rst
│   │   │   ├── introducing-pytorch-2-6.rst
│   │   │   ├── introducing-pytorch-2-7.rst
│   │   │   ├── introducing-pytorch-2-8.rst
│   │   │   ├── introducing-pytorch-2-9.rst
│   │   │   ├── introducing-pytorch-2-x.rst
│   │   │   ├── migration-from-xla-downcast-bf16.rst
│   │   │   ├── torch-neuronx-dataparallel-app-note.rst
│   │   │   └── torch-neuronx-graph-partitioner-app-note.rst
│   │   └── transformers-neuronx/
│   │       └── generative-llm-inference-with-neuron.rst
│   ├── arch/
│   │   ├── glossary.rst
│   │   ├── index.rst
│   │   ├── neuron-features/
│   │   │   ├── custom-c++-operators.rst
│   │   │   ├── data-types.rst
│   │   │   ├── index.rst
│   │   │   ├── logical-neuroncore-config.rst
│   │   │   ├── neuron-caching.rst
│   │   │   ├── neuroncore-batching.rst
│   │   │   ├── neuroncore-pipeline.rst
│   │   │   └── rounding-modes.rst
│   │   └── neuron-hardware/
│   │       ├── inf1-arch.rst
│   │       ├── inf2-arch.rst
│   │       ├── inferentia.rst
│   │       ├── inferentia2.rst
│   │       ├── neuron-core-v1.rst
│   │       ├── neuron-core-v2.rst
│   │       ├── neuron-core-v3.rst
│   │       ├── neuron-core-v4.rst
│   │       ├── trainium.rst
│   │       ├── trainium2.rst
│   │       ├── trainium3.rst
│   │       ├── trn1-arch.rst
│   │       ├── trn2-arch.rst
│   │       └── trn3-arch.rst
│   ├── benchmarks/
│   │   ├── index.rst
│   │   ├── inf1/
│   │   │   ├── data.csv
│   │   │   ├── index.rst
│   │   │   ├── instance_prices.csv
│   │   │   ├── latency_data_encoder.csv
│   │   │   ├── throughput_data_cnn.csv
│   │   │   └── throughput_data_encoder.csv
│   │   ├── inf2/
│   │   │   ├── inf2-performance.rst
│   │   │   ├── inf2_instance_prices.csv
│   │   │   ├── latency_data_decoder.csv
│   │   │   ├── latency_data_encoder.csv
│   │   │   ├── latency_data_encoder_decoder.csv
│   │   │   ├── latency_data_vision.csv
│   │   │   ├── latency_data_vision_cnn.csv
│   │   │   ├── latency_data_vision_dit.csv
│   │   │   ├── latency_data_vision_sd.csv
│   │   │   ├── latency_data_vision_transformers.csv
│   │   │   ├── throughput_data_decoder.csv
│   │   │   ├── throughput_data_encoder.csv
│   │   │   ├── throughput_data_encoder_decoder.csv
│   │   │   ├── throughput_data_vision.csv
│   │   │   ├── throughput_data_vision_cnn.csv
│   │   │   ├── throughput_data_vision_dit.csv
│   │   │   ├── throughput_data_vision_sd.csv
│   │   │   └── throughput_data_vision_transformers.csv
│   │   └── trn1/
│   │       ├── latency_data_decoder.csv
│   │       ├── latency_data_encoder.csv
│   │       ├── latency_data_encoder_decoder.csv
│   │       ├── throughput_data_decoder.csv
│   │       ├── throughput_data_encoder.csv
│   │       ├── throughput_data_encoder_decoder.csv
│   │       ├── training_data_decoder.csv
│   │       ├── training_data_encoder.csv
│   │       ├── training_data_vision_transformers.csv
│   │       ├── trn1-inference-performance.rst
│   │       ├── trn1-training-performance.rst
│   │       ├── trn1_instance_prices.csv
│   │       └── trn1_trn1n_nlp_data.csv
│   ├── beta-participation.rst
│   ├── calculator/
│   │   └── neuron-calculator.rst
│   ├── faq/
│   │   ├── contributing-faq.rst
│   │   ├── index.rst
│   │   ├── inference/
│   │   │   ├── neuron-faq.rst
│   │   │   └── trouble-shooting-faq.rst
│   │   ├── neuron2-intro-faq.rst
│   │   ├── onnx-faq.rst
│   │   ├── roadmap-faq.rst
│   │   └── training/
│   │       └── neuron-training.rst
│   ├── faq.rst
│   ├── index.rst
│   ├── models/
│   │   ├── index.rst
│   │   ├── inference-inf1-samples.rst
│   │   ├── inference-inf2-trn1-samples.rst
│   │   └── training-trn1-samples.rst
│   ├── monitoring-tools.rst
│   ├── news-and-blogs/
│   │   ├── CONTRIBUTING.md
│   │   ├── JIRA-INTEGRATION-DESIGN.md
│   │   ├── README.md
│   │   ├── article-template.yaml
│   │   ├── index.rst
│   │   ├── news-and-blogs.yaml
│   │   └── validate_articles.py
│   ├── oss/
│   │   └── index.rst
│   ├── profiling-tools.rst
│   ├── quick-start/
│   │   ├── _specs/
│   │   │   └── REFACTORING_NOTES.md
│   │   ├── docs-quicklinks.rst
│   │   ├── github-samples.rst
│   │   ├── index.rst
│   │   ├── inference-quickstart.rst
│   │   ├── mxnet-neuron.rst
│   │   ├── tab-inference-tensorflow-neuron.rst
│   │   ├── tensorflow-neuron.rst
│   │   ├── torch-neuron-tab-training.rst
│   │   ├── torch-neuron.rst
│   │   ├── training-quickstart.rst
│   │   └── user-guide-quickstart.rst
│   ├── sdk-policy.rst
│   ├── security.rst
│   ├── troubleshooting.rst
│   ├── what-is-neuron.rst
│   └── whats-new.rst
├── archive/
│   ├── helper-tools/
│   │   ├── index.rst
│   │   ├── tutorial-neuron-check-model.rst
│   │   └── tutorial-neuron-gatherinfo.rst
│   ├── index.rst
│   ├── mxnet-neuron/
│   │   ├── api-compilation-python-api.rst
│   │   ├── api-reference-guide.rst
│   │   ├── api-reference-guide.txt
│   │   ├── developer-guide.rst
│   │   ├── developer-guide.txt
│   │   ├── ec2-then-ec2-devflow.rst
│   │   ├── index.rst
│   │   ├── inference-mxnet-neuron.rst
│   │   ├── inference-mxnet-neuron.txt
│   │   ├── misc-mxnet-neuron.rst
│   │   ├── misc-mxnet-neuron.txt
│   │   ├── mxnet-neuron-setup.rst
│   │   ├── mxnet-neuron-setup.txt
│   │   ├── neo-then-hosting-devflow.rst
│   │   ├── setup/
│   │   │   ├── mxnet-install-prev-al2.rst
│   │   │   ├── mxnet-install-prev-al2023.rst
│   │   │   ├── mxnet-install-prev-u20.rst
│   │   │   ├── mxnet-install-prev-u22.rst
│   │   │   ├── mxnet-install.rst
│   │   │   ├── mxnet-neuron-al2-base-dlami.rst
│   │   │   ├── mxnet-neuron-al2.rst
│   │   │   ├── mxnet-neuron-al2023.rst
│   │   │   ├── mxnet-neuron-ubuntu20-base-dlami.rst
│   │   │   ├── mxnet-neuron-ubuntu20.rst
│   │   │   ├── mxnet-neuron-ubuntu22.rst
│   │   │   ├── mxnet-update-u20.rst
│   │   │   ├── mxnet-update.rst
│   │   │   ├── prev-releases/
│   │   │   │   ├── neuron-1.14.2-mxnet-install.rst
│   │   │   │   ├── neuron-1.15.0-mxnet-install.rst
│   │   │   │   ├── neuron-1.15.1-mxnet-install.rst
│   │   │   │   ├── neuron-1.15.2-mxnet-install.rst
│   │   │   │   ├── neuron-1.16.3-mxnet-install.rst
│   │   │   │   ├── neuron-1.17.2-mxnet-install.rst
│   │   │   │   ├── neuron-1.18.0-mxnet-install.rst
│   │   │   │   └── neuron-1.19.0-mxnet-install.rst
│   │   │   └── setup-inference
│   │   ├── troubleshooting-guide.rst
│   │   └── tutorials/
│   │       ├── mxnet-tutorial-setup.rst
│   │       ├── tutorial-model-serving.rst
│   │       ├── tutorials-mxnet-computervision.rst
│   │       ├── tutorials-mxnet-neuron.rst
│   │       ├── tutorials-mxnet-neuron.txt
│   │       ├── tutorials-mxnet-nlp.rst
│   │       └── tutorials-mxnet-utilizing-neuron-capabilities.rst
│   ├── neuronperf/
│   │   ├── index.rst
│   │   ├── neuronperf_api.rst
│   │   ├── neuronperf_benchmark_guide.rst
│   │   ├── neuronperf_compile_guide.rst
│   │   ├── neuronperf_evaluate_guide.rst
│   │   ├── neuronperf_examples.rst
│   │   ├── neuronperf_faq.rst
│   │   ├── neuronperf_framework_notes.rst
│   │   ├── neuronperf_install.rst
│   │   ├── neuronperf_model_index_guide.rst
│   │   ├── neuronperf_overview.rst
│   │   ├── neuronperf_terminology.rst
│   │   ├── neuronperf_troubleshooting.rst
│   │   ├── rn.rst
│   │   ├── setup.cfg
│   │   ├── setup.py
│   │   ├── test_resnet50_pt.py
│   │   └── test_simple_pt.py
│   ├── src/
│   │   └── benchmark/
│   │       └── pytorch/
│   │           ├── bert-base-cased_benchmark.py
│   │           ├── bert-base-cased_compile.py
│   │           ├── bert-base-uncased_benchmark.py
│   │           ├── bert-base-uncased_compile.py
│   │           ├── distilbert-base-uncased-finetuned-sst-2-english_benchmark.py
│   │           ├── distilbert-base-uncased-finetuned-sst-2-english_compile.py
│   │           ├── distilbert-base-uncased_benchmark.py
│   │           ├── distilbert-base-uncased_compile.py
│   │           ├── distilroberta-base_benchmark.py
│   │           ├── distilroberta-base_compile.py
│   │           ├── hf-google-vit_benchmark.py
│   │           ├── hf-openai-clip_benchmark.py
│   │           ├── hf_pretrained_wav2vec2_conformer_relpos_benchmark.py
│   │           ├── hf_pretrained_wav2vec2_conformer_rope_benchmark.py
│   │           ├── inf2_benchmark.py
│   │           ├── opt_benchmark.py
│   │           ├── perceiver-multimodal_benchmark.py
│   │           ├── perceiver-multimodal_compile.py
│   │           ├── perceiver-vision_benchmark.py
│   │           ├── perceiver-vision_compile.py
│   │           ├── pixart_alpha_benchmark.py
│   │           ├── pixart_sigma_benchmark.py
│   │           ├── resnet50_benchmark.py
│   │           ├── resnet50_compile.py
│   │           ├── resnet_benchmark.py
│   │           ├── resnet_compile.py
│   │           ├── sd2_512_benchmark.py
│   │           ├── sd2_512_compile.py
│   │           ├── sd2_768_benchmark.py
│   │           ├── sd2_768_compile.py
│   │           ├── sd2_inpainting_benchmark.py
│   │           ├── sd2_inpainting_inference.py
│   │           ├── sd_15_512_benchmark.py
│   │           ├── sd_15_512_compile.py
│   │           ├── sd_4x_upscaler_benchmark.py
│   │           ├── sd_4x_upscaler_compile.py
│   │           ├── sdxl_base_1024_benchmark.py
│   │           ├── sdxl_base_1024_compile.py
│   │           ├── sdxl_base_and_refiner_1024_benchmark.py
│   │           ├── sdxl_base_and_refiner_1024_compile.py
│   │           ├── unet_benchmark.py
│   │           ├── unet_compile.py
│   │           ├── vgg_benchmark.py
│   │           └── vgg_compile.py
│   ├── tensorboard/
│   │   └── getting-started-tensorboard-neuron-plugin.rst
│   ├── tensorflow/
│   │   ├── index.rst
│   │   ├── setup-legacy-inf1-tensorflow.rst
│   │   ├── tensorflow-neuron/
│   │   │   ├── additional-examples.rst
│   │   │   ├── additional-examples.txt
│   │   │   ├── api-auto-replication-api.rst
│   │   │   ├── api-compilation-python-api.rst
│   │   │   ├── api-reference-guide.rst
│   │   │   ├── api-reference-guide.txt
│   │   │   ├── api-tfn-analyze-model-api.rst
│   │   │   ├── api-tracing-python-api.rst
│   │   │   ├── dlc-then-ec2-devflow.rst
│   │   │   ├── dlc-then-ecs-devflow.rst
│   │   │   ├── dlc-then-eks-devflow.rst
│   │   │   ├── ec2-then-ec2-devflow.rst
│   │   │   ├── misc-tensorflow-neuron.rst
│   │   │   ├── misc-tensorflow-neuron.txt
│   │   │   ├── neo-then-hosting-devflow.rst
│   │   │   ├── setup/
│   │   │   │   ├── prev-releases/
│   │   │   │   │   ├── neuron-1.14.2-tensorflow-install.rst
│   │   │   │   │   ├── neuron-1.15.0-tensorflow-install.rst
│   │   │   │   │   ├── neuron-1.15.1-tensorflow-install.rst
│   │   │   │   │   ├── neuron-1.15.2-tensorflow-install.rst
│   │   │   │   │   ├── neuron-1.16.3-tensorflow-install.rst
│   │   │   │   │   ├── neuron-1.17.0-tensorflow-install.rst
│   │   │   │   │   ├── neuron-1.17.1-tensorflow-install.rst
│   │   │   │   │   ├── neuron-1.17.2-tensorflow-install.rst
│   │   │   │   │   ├── neuron-1.18.0-tensorflow-install.rst
│   │   │   │   │   └── neuron-1.19.0-tensorflow-install.rst
│   │   │   │   ├── tensorflow-install-prev-al2023.rst
│   │   │   │   ├── tensorflow-install-prev-u20.rst
│   │   │   │   ├── tensorflow-install-prev-u22.rst
│   │   │   │   ├── tensorflow-install-prev.rst
│   │   │   │   ├── tensorflow-install.rst
│   │   │   │   ├── tensorflow-update-u20.rst
│   │   │   │   ├── tensorflow-update-u22.rst
│   │   │   │   └── tensorflow-update.rst
│   │   │   ├── tensorflow2-accelerated-ops.rst
│   │   │   ├── tf2_faq.rst
│   │   │   └── tutorials/
│   │   │       ├── bert_demo/
│   │   │       │   ├── bert_demo.rst
│   │   │       │   ├── glue_mrpc_dev.tsv
│   │   │       │   └── mrpc.proto
│   │   │       ├── index.rst
│   │   │       ├── k8s_bert_demo/
│   │   │       │   └── Dockerfile.tfserving_example
│   │   │       ├── tensorflow-tutorial-setup.rst
│   │   │       ├── tutorials-tensorflow-neuron.rst
│   │   │       ├── tutorials-tensorflow-neuron.txt
│   │   │       ├── tutorials-tensorflow-nlp.rst
│   │   │       └── tutorials-tensorflow-utilizing-neuron-capabilities.rst
│   │   ├── tensorflow-neuron-inference.rst
│   │   ├── tensorflow-neuron-inference.txt
│   │   ├── tensorflow-neuronx/
│   │   │   ├── api-reference-guide.rst
│   │   │   ├── api-reference-guide.txt
│   │   │   ├── misc-tensorflow-neuronx.rst
│   │   │   ├── misc-tensorflow-neuronx.txt
│   │   │   ├── setup/
│   │   │   │   ├── index.rst
│   │   │   │   ├── prev-releases/
│   │   │   │   │   ├── neuronx-2.8.0-tensorflow-install.rst
│   │   │   │   │   └── neuronx-2.9.0-tensorflow-install.rst
│   │   │   │   ├── tensorflow-install-prev-al2.rst
│   │   │   │   ├── tensorflow-install-prev-al2023.rst
│   │   │   │   ├── tensorflow-install-prev-u20.rst
│   │   │   │   ├── tensorflow-install-prev-u22.rst
│   │   │   │   ├── tensorflow-neuronx-install.rst
│   │   │   │   ├── tensorflow-update-al2-dlami.rst
│   │   │   │   ├── tensorflow-update-al2.rst
│   │   │   │   ├── tensorflow-update-u20-dlami.rst
│   │   │   │   ├── tensorflow-update-u20.rst
│   │   │   │   └── tensorflow-update-u22.rst
│   │   │   ├── tf-neuronx-auto-replication-api.rst
│   │   │   ├── tfneuronx-python-tracing-api.rst
│   │   │   ├── tfnx-analyze-model-api.rst
│   │   │   └── tutorials/
│   │   │       ├── tutorial-tensorflowx-serving-NeuronRT-Visible-Cores.rst
│   │   │       ├── tutorials-tensorflow-neuronx.rst
│   │   │       └── tutorials-tensorflow-neuronx.txt
│   │   ├── tensorflow-neuronx-inference.rst
│   │   ├── tensorflow-neuronx-inference.txt
│   │   ├── tensorflow-setup.rst
│   │   └── tensorflow-setup.txt
│   ├── torch-neuron/
│   │   ├── additional-examples-inference-torch-neuron.rst
│   │   ├── additional-examples-inference-torch-neuron.txt
│   │   ├── api-compilation-python-api.rst
│   │   ├── api-core-placement.rst
│   │   ├── api-reference-guide-torch-neuron.rst
│   │   ├── api-reference-guide-torch-neuron.txt
│   │   ├── api-torch-neuron-dataparallel-api.rst
│   │   ├── developer-guide-torch-neuron.rst
│   │   ├── developer-guide-torch-neuron.txt
│   │   ├── guides/
│   │   │   ├── core-placement/
│   │   │   │   └── torch-core-placement.rst
│   │   │   └── torch-lstm-support.rst
│   │   ├── index.rst
│   │   ├── inference-torch-neuron.rst
│   │   ├── misc-inference-torch-neuron.rst
│   │   ├── misc-inference-torch-neuron.txt
│   │   ├── placement.py
│   │   ├── setup/
│   │   │   ├── index.rst
│   │   │   ├── prev-releases/
│   │   │   │   ├── neuron-1.14.2-pytorch-install.rst
│   │   │   │   ├── neuron-1.15.0-pytorch-install.rst
│   │   │   │   ├── neuron-1.15.1-pytorch-install.rst
│   │   │   │   ├── neuron-1.15.2-pytorch-install.rst
│   │   │   │   ├── neuron-1.16.1-pytorch-install.rst
│   │   │   │   ├── neuron-1.16.2-pytorch-install.rst
│   │   │   │   ├── neuron-1.16.3-pytorch-install.rst
│   │   │   │   ├── neuron-1.17.2-pytorch-install.rst
│   │   │   │   ├── neuron-1.18.0-pytorch-install.rst
│   │   │   │   ├── neuron-1.19.0-pytorch-install.rst
│   │   │   │   ├── neuron-2.3.0-pytorch-install.rst
│   │   │   │   ├── neuron-2.4.0-pytorch-install.rst
│   │   │   │   └── neuron-2.5.0-pytorch-install.rst
│   │   │   ├── pytorch-install-cxx11.rst
│   │   │   ├── pytorch-install-prev-al2.rst
│   │   │   ├── pytorch-install-prev-al2023.rst
│   │   │   ├── pytorch-install-prev-u20.rst
│   │   │   ├── pytorch-install-prev-u22.rst
│   │   │   ├── pytorch-install-prev.rst
│   │   │   ├── pytorch-install.rst
│   │   │   ├── pytorch-update-al2-dlami.rst
│   │   │   ├── pytorch-update-al2023.rst
│   │   │   ├── pytorch-update-u20-dlami.rst
│   │   │   ├── pytorch-update-u20.rst
│   │   │   ├── pytorch-update-u22.rst
│   │   │   └── pytorch-update.rst
│   │   ├── torch-neuron-dataparallel-example-default.rst
│   │   ├── torch-neuron-dataparallel-example-dim-neq-zero.rst
│   │   ├── torch-neuron-dataparallel-example-disable-dynamic-batching.rst
│   │   ├── torch-neuron-dataparallel-example-dynamic-batching.rst
│   │   ├── torch-neuron-dataparallel-example-specify-ncs.rst
│   │   ├── troubleshooting-guide.rst
│   │   └── tutorials/
│   │       ├── neuroncore_pipeline_pytorch.rst
│   │       ├── pytorch-tutorial-setup.rst
│   │       ├── transformers-marianmt.rst
│   │       ├── tutorial-libtorch.rst
│   │       ├── tutorial-torchserve.rst
│   │       ├── tutorial_source_instructions/
│   │       │   ├── run_libtorch.sh
│   │       │   └── run_torchserve_u20.sh
│   │       ├── tutorials-inference-torch-neuron.rst
│   │       ├── tutorials-inference-torch-neuron.txt
│   │       ├── tutorials-torch-neuron-computervision.rst
│   │       ├── tutorials-torch-neuron-nlp.rst
│   │       └── tutorials-utilizing-neuron-capabilities.rst
│   ├── transformers-neuronx/
│   │   ├── api-reference-guide.rst
│   │   ├── api-reference-guide.txt
│   │   ├── developer-guide.rst
│   │   ├── developer-guide.txt
│   │   ├── index.rst
│   │   ├── setup/
│   │   │   └── index.rst
│   │   ├── transformers-neuronx-api-reference.rst
│   │   ├── transformers-neuronx-developer-guide-for-continuous-batching.rst
│   │   ├── transformers-neuronx-developer-guide.rst
│   │   ├── transformers-neuronx-misc.rst
│   │   ├── transformers-neuronx-misc.txt
│   │   ├── transformers-neuronx-tutorials.rst
│   │   ├── transformers-neuronx-tutorials.txt
│   │   └── transformers-neuronx.txt
│   └── tutorials/
│       ├── finetune_t5.rst
│       ├── finetuning_llama2_7b_ptl.rst
│       ├── gpt3_neuronx_nemo_megatron_pretraining.rst
│       ├── megatron_gpt_pretraining.rst
│       ├── multinode-training-model-profiling.rst
│       ├── nxd-source-code/
│       │   ├── gpt_neox_tp_zero1/
│       │   │   ├── gpt_neox_20b.sh
│       │   │   └── gpt_neox_6_9b.sh
│       │   └── llama_tp_pp_ptl/
│       │       ├── llama_2_13b.sh
│       │       ├── llama_2_70b.sh
│       │       ├── llama_2_7b.sh
│       │       └── llama_tp_pp_ptl_setup.sh
│       ├── ssd300_demo/
│       │   ├── requirements.txt
│       │   ├── ssd300_demo.rst
│       │   ├── ssd300_detection.py
│       │   ├── ssd300_evaluation.py
│       │   ├── ssd300_evaluation_client.py
│       │   └── ssd300_model.py
│       ├── training-gpt-neox-20b.rst
│       ├── training-gpt-neox.rst
│       ├── training_codegen25_7b.rst
│       ├── training_llama2_tp_pp_ptl.rst
│       └── tutorial_source_code/
│           └── t5_finetuning/
│               ├── t5_finetuning_32_worker_training_code.sh
│               ├── t5_finetuning_multi_worker_training_code.sh
│               ├── t5_finetuning_setup_code.sh
│               ├── t5_finetuning_single_worker_training_code.sh
│               └── t5_modify_run_summarization_code.sh
├── audit-report.md
├── build.sh
├── compiler/
│   ├── error-codes/
│   │   ├── EARG001.rst
│   │   ├── EBIR023.rst
│   │   ├── EBVF030.rst
│   │   ├── EHCA005.rst
│   │   ├── EOOM001.rst
│   │   ├── EOOM002.rst
│   │   ├── ESFH002.rst
│   │   ├── ESPP004.rst
│   │   ├── ESPP047.rst
│   │   ├── EUOC002.rst
│   │   ├── EVRF001.rst
│   │   ├── EVRF004.rst
│   │   ├── EVRF005.rst
│   │   ├── EVRF006.rst
│   │   ├── EVRF007.rst
│   │   ├── EVRF009.rst
│   │   ├── EVRF010.rst
│   │   ├── EVRF011.rst
│   │   ├── EVRF013.rst
│   │   ├── EVRF015.rst
│   │   ├── EVRF016.rst
│   │   ├── EVRF017.rst
│   │   ├── EVRF018.rst
│   │   ├── EVRF019.rst
│   │   ├── EVRF022.rst
│   │   ├── EVRF031.rst
│   │   ├── EXSP001.rst
│   │   ├── EXTP004.rst
│   │   └── index.rst
│   ├── index.rst
│   ├── neuron-cc/
│   │   ├── api-reference-guide.rst
│   │   ├── command-line-reference.rst
│   │   ├── developer-guide.rst
│   │   └── faq.rst
│   ├── neuron-cc.rst
│   ├── neuronx-cc/
│   │   ├── api-reference-guide/
│   │   │   └── index.rst
│   │   ├── developer-guide.rst
│   │   ├── faq.rst
│   │   └── how-to-convolution-in-unet.rst
│   └── neuronx-cc.rst
├── conf.py
├── containers/
│   ├── container-deployment-flows.rst
│   ├── container-sm-hosting-devflow.rst
│   ├── developerflows.rst
│   ├── developerflows.txt
│   ├── dlc-then-customize-devflow.rst
│   ├── dlc-then-ec2-devflow.rst
│   ├── dlc-then-ecs-devflow.rst
│   ├── dlc-then-eks-devflow.rst
│   ├── dlc-then-k8s-devflow.rst
│   ├── docker-example/
│   │   ├── Dockerfile.device-plugin
│   │   ├── index.rst
│   │   ├── inference/
│   │   │   ├── Dockerfile-inference
│   │   │   ├── Dockerfile-inference-dlc
│   │   │   ├── Dockerfile-inference-dlc.rst
│   │   │   ├── Dockerfile-libmode
│   │   │   ├── Dockerfile-libmode.rst
│   │   │   ├── Dockerfile-tf-serving.rst
│   │   │   ├── Dockerfile.mxnet-serving
│   │   │   ├── Dockerfile.tf-serving
│   │   │   ├── config-properties.rst
│   │   │   ├── config.properties
│   │   │   ├── dockerd-libmode-entrypoint.rst
│   │   │   ├── dockerd-libmode-entrypoint.sh
│   │   │   ├── torchserve-neuron.rst
│   │   │   └── torchserve-neuron.sh
│   │   ├── training/
│   │   │   ├── Dockerfile-training-dlc
│   │   │   ├── Dockerfile-trainium-dlc.rst
│   │   │   ├── mlp.rst
│   │   │   ├── mlp_train.py
│   │   │   └── model.py
│   │   └── v1/
│   │       └── inference/
│   │           ├── Dockerfile-app-rt-diff.rst
│   │           ├── Dockerfile-app-rt-same.rst
│   │           ├── Dockerfile-neuron-rtd.rst
│   │           ├── Dockerfile-torch-neuron.rst
│   │           ├── Dockerfile.app-rt-diff
│   │           ├── Dockerfile.neuron-rtd
│   │           ├── Dockerfile.torch-neuron
│   │           ├── dockerd-entrypoint-app-rt-same.rst
│   │           └── dockerd-entrypoint.sh
│   ├── ec2-then-ec2-devflow.rst
│   ├── ec2.rst
│   ├── faq-troubleshooting-releasenote.rst
│   ├── faq.rst
│   ├── files/
│   │   ├── index-dra.rst
│   │   ├── manifests/
│   │   │   ├── clusterrole.yaml
│   │   │   ├── clusterrolebinding.yaml
│   │   │   ├── daemonset.yaml
│   │   │   ├── deviceclass.yaml
│   │   │   ├── namespace.yaml
│   │   │   └── serviceaccount.yaml
│   │   ├── scripts/
│   │   │   └── install-dra-driver.sh
│   │   └── specs/
│   │       ├── 1x4-connected-devices.yaml
│   │       ├── 2-node-inference-us.yaml
│   │       ├── 4-node-inference-us.yaml
│   │       ├── all-devices.yaml
│   │       ├── lnc-setting-trn2.yaml
│   │       ├── specific-driver-version.yaml
│   │       └── us-and-lnc-config.yaml
│   ├── get-started/
│   │   ├── quickstart-configure-deploy-dlc.rst
│   │   └── quickstart-pytorch-inference-dlc.rst
│   ├── getting-started.rst
│   ├── how-to/
│   │   └── how-to-ultraserver.rst
│   ├── index.rst
│   ├── k8.rst
│   ├── kubernetes-getting-started.rst
│   ├── locate-neuron-dlc-image.rst
│   ├── neo-then-hosting-devflow.rst
│   ├── neuron-dra.rst
│   ├── neuron-plugins.rst
│   ├── neuron_dlc_images.csv
│   ├── troubleshooting.rst
│   ├── tutorial-docker-runtime1.0.rst
│   ├── tutorials/
│   │   ├── build-run-neuron-container.rst
│   │   ├── inference/
│   │   │   ├── index.rst
│   │   │   ├── index.txt
│   │   │   ├── k8s_rn50_demo.rst
│   │   │   └── tutorial-infer.rst
│   │   ├── k8s-default-scheduler.rst
│   │   ├── k8s-multiple-scheduler.rst
│   │   ├── k8s-neuron-device-plugin.rst
│   │   ├── k8s-neuron-helm-chart.rst
│   │   ├── k8s-neuron-monitor.rst
│   │   ├── k8s-neuron-problem-detector-and-recovery-irsa.rst
│   │   ├── k8s-neuron-problem-detector-and-recovery.rst
│   │   ├── k8s-neuron-scheduler-flow.rst
│   │   ├── k8s-neuron-scheduler.rst
│   │   ├── k8s-prerequisite.rst
│   │   ├── k8s-setup.rst
│   │   ├── training/
│   │   │   ├── index.rst
│   │   │   ├── index.txt
│   │   │   ├── k8s_mlp_train_demo.rst
│   │   │   └── tutorial-training.rst
│   │   ├── tutorial-docker-env-setup.rst
│   │   └── tutorial-oci-hook.rst
│   └── tutorials.rst
├── devflows/
│   ├── aws-batch-flows.rst
│   ├── aws-batch-flows.txt
│   ├── dlc-then-customize-devflow.rst
│   ├── ec2-flows.rst
│   ├── ec2-flows.txt
│   ├── ecs-flows.rst
│   ├── eks-flows.rst
│   ├── index.rst
│   ├── inference/
│   │   ├── aws-batch-flows.rst
│   │   ├── aws-batch-flows.txt
│   │   ├── byoc-hosting-devflow-inf2.rst
│   │   ├── byoc-hosting-devflow.rst
│   │   ├── container-sm-hosting-devflow.rst
│   │   ├── dev-flows.rst
│   │   ├── dlc-then-ec2-devflow.rst
│   │   ├── dlc-then-ecs-devflow.rst
│   │   ├── dlc-then-eks-devflow.rst
│   │   ├── dlc-then-k8s-devflow.rst
│   │   ├── ec2-flows.rst
│   │   ├── ec2-flows.txt
│   │   ├── ec2-then-ec2-devflow-inf2.rst
│   │   ├── ec2-then-ec2-devflow.rst
│   │   ├── env-setup-text.rst
│   │   ├── neo-then-hosting-devflow.rst
│   │   ├── parallelcluster-flows.rst
│   │   ├── parallelcluster-flows.txt
│   │   ├── sagemaker-flows.rst
│   │   └── sagemaker-flows.txt
│   ├── parallelcluster-flows.rst
│   ├── parallelcluster-flows.txt
│   ├── plugins/
│   │   ├── npd-ecs-flows.rst
│   │   └── npd-ecs-flows.txt
│   ├── sagemaker-flows.rst
│   ├── setup/
│   │   ├── ecs-flows.rst
│   │   ├── ecs-flows.txt
│   │   ├── eks-flows.rst
│   │   └── eks-flows.txt
│   ├── third-party-solutions.rst
│   └── training/
│       ├── aws-batch-flows.rst
│       ├── aws-batch-flows.txt
│       ├── batch/
│       │   └── batch-training.rst
│       ├── dlc-then-ecs-devflow.rst
│       ├── ec2/
│       │   └── ec2-training.rst
│       ├── ec2-flows.rst
│       ├── ec2-flows.txt
│       ├── parallelcluster/
│       │   └── parallelcluster-training.rst
│       ├── parallelcluster-flows.rst
│       ├── parallelcluster-flows.txt
│       ├── sagemaker-flows.rst
│       ├── sagemaker-flows.txt
│       └── sm-devflow/
│           └── sm-training-devflow.rst
├── dlami/
│   └── index.rst
├── frameworks/
│   ├── index.rst
│   ├── jax/
│   │   ├── api-reference-guide/
│   │   │   ├── index.rst
│   │   │   └── neuron-envvars.rst
│   │   ├── index.rst
│   │   └── setup/
│   │       ├── jax-neuronx-known-issues.rst
│   │       └── jax-setup.rst
│   └── torch/
│       ├── about/
│       │   └── index.rst
│       ├── guide-torch-neuron-vs-torch-neuronx-inference.rst
│       ├── index.rst
│       ├── inference-torch-neuronx.rst
│       ├── pytorch-native-overview.rst
│       ├── torch-neuronx/
│       │   ├── additional-examples-inference-torch-neuronx.rst
│       │   ├── additional-examples-training.rst
│       │   ├── api-reference-guide/
│       │   │   ├── inference/
│       │   │   │   ├── api-torch-neuronx-analyze.rst
│       │   │   │   ├── api-torch-neuronx-async-lazy-load.rst
│       │   │   │   ├── api-torch-neuronx-core-placement.rst
│       │   │   │   ├── api-torch-neuronx-data-parallel.rst
│       │   │   │   ├── api-torch-neuronx-replace-weights.rst
│       │   │   │   ├── api-torch-neuronx-trace.rst
│       │   │   │   └── inference-api-guide-torch-neuronx.rst
│       │   │   ├── torch-neuronx-profiling-api.rst
│       │   │   └── training/
│       │   │       ├── index.rst
│       │   │       ├── pytorch-neuron-parallel-compile.rst
│       │   │       └── torch-neuron-envvars.rst
│       │   ├── misc-inference-torch-neuronx.rst
│       │   ├── misc-training.rst
│       │   ├── programming-guide/
│       │   │   ├── inference/
│       │   │   │   ├── autobucketing-dev-guide.rst
│       │   │   │   ├── core-placement.rst
│       │   │   │   ├── index.rst
│       │   │   │   └── trace-vs-xla-lazytensor.rst
│       │   │   ├── torch-neuronx-profiling-dev-guide.rst
│       │   │   └── training/
│       │   │       ├── index.rst
│       │   │       ├── pytorch-neuron-debug.rst
│       │   │       └── pytorch-neuron-programming-guide.rst
│       │   ├── pytorch-neuron-supported-operators.rst
│       │   ├── setup/
│       │   │   ├── install-templates/
│       │   │   │   └── pytorch-dev-install.txt
│       │   │   ├── note-setup-general.rst
│       │   │   ├── prev-releases/
│       │   │   │   ├── neuronx-2.7.0-pytorch-install.rst
│       │   │   │   ├── neuronx-2.8.0-pytorch-install.rst
│       │   │   │   └── neuronx-2.9.0-pytorch-install.rst
│       │   │   ├── pytorch-install-prev-al2.rst
│       │   │   ├── pytorch-install-prev-al2023.rst
│       │   │   ├── pytorch-install-prev-u20.rst
│       │   │   ├── pytorch-install-prev-u22.rst
│       │   │   ├── pytorch-install-prev-u24.rst
│       │   │   ├── pytorch-install.rst
│       │   │   ├── pytorch-neuronx-install-cxx11.rst
│       │   │   ├── pytorch-update-al2-dlami.rst
│       │   │   ├── pytorch-update-al2.rst
│       │   │   ├── pytorch-update-al2023.rst
│       │   │   ├── pytorch-update-u20-dlami.rst
│       │   │   ├── pytorch-update-u20.rst
│       │   │   ├── pytorch-update-u22.rst
│       │   │   └── pytorch-update-u24.rst
│       │   ├── setup-trn1-multi-node-execution.rst
│       │   ├── torch-neuronx-dataparallel-example-default.rst
│       │   ├── torch-neuronx-dataparallel-example-dim-neq-zero.rst
│       │   ├── torch-neuronx-dataparallel-example-disable-dynamic-batching.rst
│       │   ├── torch-neuronx-dataparallel-example-dynamic-batching.rst
│       │   ├── torch-neuronx-dataparallel-example-specify-ncs.rst
│       │   ├── training-troubleshooting.rst
│       │   └── tutorials/
│       │       ├── inference/
│       │       │   ├── tutorial-torchserve-neuronx.rst
│       │       │   └── tutorials-torch-neuronx.rst
│       │       ├── note-performance.txt
│       │       └── training/
│       │           ├── analyze_for_training.rst
│       │           ├── bert.rst
│       │           ├── finetune_hftrainer.rst
│       │           ├── mlp.rst
│       │           ├── tutorial_source_code/
│       │           │   ├── analyze_training/
│       │           │   │   └── analyze_training_code.sh
│       │           │   ├── bert_mrpc_finetuning/
│       │           │   │   ├── bert_mrpc_finetuning_converted_checkpoint_training.sh
│       │           │   │   ├── bert_mrpc_finetuning_multi_worker_training_code.sh
│       │           │   │   ├── bert_mrpc_finetuning_setup_code.sh
│       │           │   │   └── bert_mrpc_finetuning_single_worker_training.sh
│       │           │   ├── bert_training/
│       │           │   │   ├── bert_amp_training_code.sh
│       │           │   │   ├── bert_lamb_bf16_training_code.sh
│       │           │   │   ├── bert_lamb_training_code.sh
│       │           │   │   ├── bert_phase2_training_code.sh
│       │           │   │   ├── bert_precompilation_code.sh
│       │           │   │   ├── bert_setup_code.sh
│       │           │   │   ├── bert_setup_code_ph2.sh
│       │           │   │   └── bert_training_code.sh
│       │           │   ├── multi_layer_perceptron_training/
│       │           │   │   └── multi_layer_perceptron_training_code.sh
│       │           │   └── zero1_training/
│       │           │       └── zero1_single_node_training_code.sh
│       │           ├── tutorials-training-torch-neuronx.rst
│       │           └── zero1_gpt2.rst
│       ├── torch-setup.rst
│       └── training-torch-neuronx.rst
├── general/
│   └── faq.rst
├── includes/
│   └── setup/
│       ├── select-framework-note.txt
│       ├── tab-inference-mxnet-neuron-al2.txt
│       ├── tab-inference-mxnet-neuron-al2023.txt
│       ├── tab-inference-mxnet-neuron-u20.txt
│       ├── tab-inference-mxnet-neuron-u22.txt
│       ├── tab-inference-mxnet-neuron.txt
│       ├── tab-inference-tensorflow-neuron-al2.txt
│       ├── tab-inference-tensorflow-neuron-al2023.txt
│       ├── tab-inference-tensorflow-neuron-u20.txt
│       ├── tab-inference-tensorflow-neuron-u22.txt
│       ├── tab-inference-tensorflow-neuronx-al2.txt
│       ├── tab-inference-tensorflow-neuronx-al2023.txt
│       ├── tab-inference-tensorflow-neuronx-u20.txt
│       ├── tab-inference-tensorflow-neuronx-u22.txt
│       ├── tab-inference-torch-neuron-al2.txt
│       ├── tab-inference-torch-neuron-al2023.txt
│       ├── tab-inference-torch-neuron-u20.txt
│       ├── tab-inference-torch-neuron-u22.txt
│       ├── tab-inference-torch-neuron.txt
│       ├── tab-inference-torch-neuronx-al2.txt
│       ├── tab-inference-torch-neuronx-al2023.txt
│       ├── tab-inference-torch-neuronx-u20.txt
│       ├── tab-inference-torch-neuronx-u22.txt
│       └── tab-inference-torch-neuronx-u24.txt
├── index.rst
├── info/
│   └── exclude
├── libraries/
│   ├── index.rst
│   ├── nemo-megatron/
│   │   └── index.rst
│   ├── neuronx-distributed/
│   │   ├── activation_memory_reduction.rst
│   │   ├── activation_memory_reduction_developer_guide.rst
│   │   ├── api-reference-guide-inference.rst
│   │   ├── api-reference-guide-training.rst
│   │   ├── api-reference-guide.rst
│   │   ├── api-reference-guide.txt
│   │   ├── api_guide.rst
│   │   ├── app_notes.rst
│   │   ├── app_notes.txt
│   │   ├── context_parallelism_overview.rst
│   │   ├── developer-guide-inference.rst
│   │   ├── developer-guide-inference.txt
│   │   ├── developer-guide-training.rst
│   │   ├── developer-guide-training.txt
│   │   ├── developer-guide.rst
│   │   ├── developer-guide.txt
│   │   ├── index-inference.rst
│   │   ├── index-training.rst
│   │   ├── lora_finetune_developer_guide.rst
│   │   ├── model_builder_v2_api_reference.rst
│   │   ├── model_optimizer_wrapper_developer_guide.rst
│   │   ├── neuronx-distributed-misc.rst
│   │   ├── neuronx-distributed-misc.txt
│   │   ├── neuronx_distributed_inference_developer_guide.rst
│   │   ├── pipeline_parallelism_overview.rst
│   │   ├── pp_developer_guide.rst
│   │   ├── ptl_developer_guide.rst
│   │   ├── save_load_developer_guide.rst
│   │   ├── setup/
│   │   │   ├── index.rst
│   │   │   └── index.txt
│   │   ├── standard_mixed_precision.rst
│   │   ├── tensor_parallelism_overview.rst
│   │   ├── tp_developer_guide.rst
│   │   └── tutorials/
│   │       ├── finetune_llama3_8b_ptl_lora.rst
│   │       ├── index.rst
│   │       ├── index.txt
│   │       ├── inference.rst
│   │       ├── inference_tutorials.rst
│   │       ├── neuronx_distributed_tutorials.txt
│   │       ├── nxd-source-code/
│   │       │   ├── llama_tp_pp/
│   │       │   │   ├── llama_2_13b.sh
│   │       │   │   ├── llama_2_70b.sh
│   │       │   │   ├── llama_31_70b.sh
│   │       │   │   ├── llama_3_70b.sh
│   │       │   │   └── llama_tp_pp_setup.sh
│   │       │   └── llama_tp_zero1/
│   │       │       ├── llama_2_7b.sh
│   │       │       ├── llama_31_8b.sh
│   │       │       ├── llama_3_8b.sh
│   │       │       └── llama_tp_zero1_setup.sh
│   │       ├── nxd_inference_tutorials.txt
│   │       ├── nxd_training_tutorials.txt
│   │       ├── training.rst
│   │       ├── training_llama_tp_pp.rst
│   │       ├── training_llama_tp_zero1.rst
│   │       └── training_tutorials.rst
│   ├── nxd-inference/
│   │   ├── _templates/
│   │   │   ├── model_card.jinja.rst
│   │   │   └── model_card_qwen3.jinja.rst
│   │   ├── api-guides/
│   │   │   ├── api-guide.rst
│   │   │   ├── api-guide.txt
│   │   │   └── index.rst
│   │   ├── app-notes/
│   │   │   ├── app_notes.txt
│   │   │   ├── index.rst
│   │   │   └── parallelism.rst
│   │   ├── developer_guides/
│   │   │   ├── accuracy-eval-with-datasets.rst
│   │   │   ├── custom-quantization.rst
│   │   │   ├── disaggregated-inference.rst
│   │   │   ├── feature-guide.rst
│   │   │   ├── how-to-use-fpem.rst
│   │   │   ├── index.rst
│   │   │   ├── llm-inference-benchmarking-guide.rst
│   │   │   ├── migrate-from-tnx-to-nxdi.rst
│   │   │   ├── model-reference.rst
│   │   │   ├── moe-arch-deep-dive.rst
│   │   │   ├── nxd-examples-migration-guide.rst
│   │   │   ├── onboarding-models.rst
│   │   │   ├── performance-cli-params.rst
│   │   │   ├── vllm-user-guide-v1.rst
│   │   │   ├── vllm-user-guide.rst
│   │   │   ├── weights-sharding-guide.rst
│   │   │   └── writing-tests.rst
│   │   ├── examples/
│   │   │   └── vllm_client.py
│   │   ├── index.rst
│   │   ├── misc/
│   │   │   ├── index.rst
│   │   │   ├── misc.txt
│   │   │   └── nxdi-troubleshooting.rst
│   │   ├── models/
│   │   │   ├── index.rst
│   │   │   ├── llama3/
│   │   │   │   ├── data/
│   │   │   │   │   └── card_llama33_70b.yml
│   │   │   │   └── llama_33_70b.rst
│   │   │   ├── models.txt
│   │   │   └── qwen3/
│   │   │       ├── data/
│   │   │       │   └── card_qwen3_moe_235b.yml
│   │   │       └── qwen3_moe_235b.rst
│   │   ├── neuron-inference-overview.rst
│   │   ├── nxdi-setup.rst
│   │   ├── overview-index.rst
│   │   ├── setup.txt
│   │   ├── tutorials/
│   │   │   ├── disaggregated-inference-tutorial-1p1d.rst
│   │   │   ├── disaggregated-inference-tutorial.rst
│   │   │   ├── flux-inference-tutorial.ipynb
│   │   │   ├── flux-inpainting-inference-tutorial.ipynb
│   │   │   ├── generating-results-with-performance-cli.ipynb
│   │   │   ├── index.rst
│   │   │   ├── llama4-tutorial-v0.ipynb
│   │   │   ├── llama4-tutorial.ipynb
│   │   │   ├── llama405b_perf_comparison.csv
│   │   │   ├── llama70b_apc_perf_comparison.csv
│   │   │   ├── llama70b_perf_comparison.csv
│   │   │   ├── modules_to_not_convert.json
│   │   │   ├── pixtral-tutorial.ipynb
│   │   │   ├── qwen2-vl-tutorial.ipynb
│   │   │   ├── qwen3-moe-tutorial.ipynb
│   │   │   ├── qwen3-vl-tutorial.ipynb
│   │   │   ├── sd-inference-tutorial.rst
│   │   │   ├── trn1-llama3.1-70b-instruct-accuracy-eval-tutorial.ipynb
│   │   │   ├── trn2-llama3.1-405b-speculative-tutorial.rst
│   │   │   ├── trn2-llama3.1-405b-tutorial.rst
│   │   │   ├── trn2-llama3.1-8b-multi-lora-tutorial.ipynb
│   │   │   ├── trn2-llama3.3-70b-apc-tutorial.ipynb
│   │   │   ├── trn2-llama3.3-70b-dp-tutorial.ipynb
│   │   │   ├── trn2-llama3.3-70b-fp8.rst
│   │   │   ├── trn2-llama3.3-70b-tutorial.rst
│   │   │   └── trn3-gpt-oss-120b-tutorial.rst
│   │   └── vllm/
│   │       ├── index.rst
│   │       ├── quickstart-vllm-offline-serving.rst
│   │       └── quickstart-vllm-online-serving.rst
│   ├── nxd-training/
│   │   ├── api-guide.txt
│   │   ├── api-reference-guide.rst
│   │   ├── app_notes/
│   │   │   ├── nxd-training-amr-appnote.rst
│   │   │   ├── nxd-training-cp-appnote.rst
│   │   │   ├── nxd-training-pp-appnote.rst
│   │   │   └── nxd-training-tp-appnote.rst
│   │   ├── app_notes.rst
│   │   ├── app_notes.txt
│   │   ├── developer-guide.rst
│   │   ├── developer_guides/
│   │   │   ├── cpu_mode_developer_guide.rst
│   │   │   ├── dev-guide.txt
│   │   │   ├── index.rst
│   │   │   ├── migration_nemo_nxdt.rst
│   │   │   ├── migration_nnm_nxdt.rst
│   │   │   ├── nemo_nxdt_mapping.csv
│   │   │   ├── new_dataloader_guide.rst
│   │   │   ├── new_model_guide.rst
│   │   │   ├── nnm_nxdt_mapping.csv
│   │   │   └── optimizer_lr_scheduler_flow.rst
│   │   ├── general/
│   │   │   ├── config_overview.rst
│   │   │   ├── features.rst
│   │   │   ├── installation_guide.rst
│   │   │   ├── known-issues.txt
│   │   │   └── known_issues.rst
│   │   ├── index.rst
│   │   ├── misc.rst
│   │   ├── misc.txt
│   │   ├── overview.rst
│   │   ├── overview.txt
│   │   ├── setup.txt
│   │   └── tutorials/
│   │       ├── checkpoint_conversion.rst
│   │       ├── hf_llama3_70B_pretraining.rst
│   │       ├── hf_llama3_8B_DPO_ORPO.rst
│   │       ├── hf_llama3_8B_SFT.rst
│   │       ├── hf_llama3_8B_SFT_LORA.rst
│   │       ├── hf_llama3_8B_pretraining.rst
│   │       ├── index.rst
│   │       ├── megatron_gpt_pretraining.rst
│   │       └── tutorials.txt
│   └── transformers-neuronx/
│       └── index.rst
├── llms.txt
├── neuron-customops/
│   ├── api-reference-guide/
│   │   ├── api-reference-guide.rst
│   │   └── custom-ops-ref-guide.rst
│   ├── customops-intro.txt
│   ├── index.rst
│   ├── misc-customops.rst
│   ├── programming-guide/
│   │   ├── custom-c++-operators-devguide.rst
│   │   └── programming-guide.rst
│   └── tutorials/
│       ├── customop-mlp-perf-opt.rst
│       ├── customop-mlp-training.rst
│       ├── tutorial_source_code/
│       │   ├── custom_c_mlp_training/
│       │   │   └── custom_c_mlp_training_code.sh
│       │   └── custom_c_perf_optimization/
│       │       └── custom_c_perf_optimization_code.sh
│       └── tutorials.rst
├── neuron-runtime/
│   ├── about/
│   │   ├── collectives.rst
│   │   ├── core-dump.rst
│   │   └── index.rst
│   ├── api/
│   │   ├── debug-stream-api.rst
│   │   ├── index.rst
│   │   ├── ndebug_stream.rst
│   │   ├── ndl.rst
│   │   ├── nec.rst
│   │   ├── neuron_driver_shared.rst
│   │   ├── neuron_driver_shared_tensor_batch_op.rst
│   │   ├── neuron_ds.rst
│   │   ├── nrt-async-api-best-practices.rst
│   │   ├── nrt-async-api-examples.rst
│   │   ├── nrt-async-api-overview.rst
│   │   ├── nrt.rst
│   │   ├── nrt_async.rst
│   │   ├── nrt_async_sendrecv.rst
│   │   ├── nrt_experimental.rst
│   │   ├── nrt_profile.rst
│   │   ├── nrt_status.rst
│   │   ├── nrt_sys_trace.rst
│   │   └── nrt_version.rst
│   ├── configuration-guide.rst
│   ├── explore/
│   │   ├── compute-comm-overlap.rst
│   │   ├── core-dump-deep-dive.rst
│   │   ├── device-memory.rst
│   │   ├── direct-hbm-tensor-alloc.rst
│   │   ├── index.rst
│   │   ├── internode-collective-comm.rst
│   │   ├── intranode-collective-comm.rst
│   │   ├── runtime-performance-tips.rst
│   │   └── work-with-neff-files.rst
│   ├── faq.rst
│   ├── index.rst
│   ├── nrt-configurable-parameters.rst
│   ├── nrt-developer-guide.rst
│   ├── nrt-troubleshoot.rst
│   └── rn.rst
├── nki/
│   ├── _ext/
│   │   └── nki_directives.py
│   ├── _templates/
│   │   ├── nki-custom-class-attr-only-template.rst
│   │   └── nki-custom-class-template.rst
│   ├── api/
│   │   ├── index.rst
│   │   ├── nki/
│   │   │   ├── __init__.py
│   │   │   ├── collectives/
│   │   │   │   └── __init__.py
│   │   │   ├── isa/
│   │   │   │   └── __init__.py
│   │   │   └── language/
│   │   │       └── __init__.py
│   │   ├── nki.api.shared.rst
│   │   ├── nki.collectives.rst
│   │   ├── nki.isa.rst
│   │   ├── nki.isa.rst.bak
│   │   ├── nki.language.rst
│   │   ├── nki.language.tile_size.rst
│   │   ├── nki.rst
│   │   └── nki.simulate.rst
│   ├── deep-dives/
│   │   ├── index.rst
│   │   ├── mxfp-matmul.rst
│   │   ├── nki-aps.rst
│   │   ├── nki-compiler.rst
│   │   ├── nki-dge.rst
│   │   ├── nki-dma-bandwidth-guide.rst
│   │   ├── nki-dynamic-loops.rst
│   │   ├── nki_perf_guide.rst
│   │   └── src/
│   │       └── mxfp-matmul/
│   │           ├── mx_cpu_utils.py
│   │           ├── mx_kernel_utils.py
│   │           ├── mx_kernels.py
│   │           └── mx_toplevel.py
│   ├── examples/
│   │   ├── average_pool2d/
│   │   │   ├── average_pool2d_jax.py
│   │   │   ├── average_pool2d_nki_kernels.py
│   │   │   └── average_pool2d_torch.py
│   │   ├── fused_mamba/
│   │   │   ├── mamba_nki_kernels.py
│   │   │   └── mamba_torch.py
│   │   ├── getting_started_baremetal.py
│   │   ├── getting_started_jax.py
│   │   ├── getting_started_torch.py
│   │   ├── index-case-1.py
│   │   ├── index-case-3.py
│   │   ├── layout-dynamic-loop.py
│   │   ├── layout-loop.py
│   │   ├── layout-pass.py
│   │   ├── layout-violation.py
│   │   ├── matrix_multiplication/
│   │   │   ├── matrix_multiplication_nki_kernels.py
│   │   │   └── matrix_multiplication_torch.py
│   │   ├── simulate/
│   │   │   └── nki_simulate_example.py
│   │   ├── tensor_addition/
│   │   │   └── tensor_addition_nki_kernels.py
│   │   └── transpose2d/
│   │       ├── transpose2d_jax.py
│   │       ├── transpose2d_nki_kernels.py
│   │       └── transpose2d_torch.py
│   ├── get-started/
│   │   ├── about/
│   │   │   ├── data-representation-overview.rst
│   │   │   ├── index.rst
│   │   │   ├── indexing-overview.rst
│   │   │   ├── lnc.rst
│   │   │   ├── memory-hierarchy-overview.rst
│   │   │   ├── nki-dma-overview.rst
│   │   │   └── tiling-overview.rst
│   │   ├── index.rst
│   │   ├── nki-language-guide.rst
│   │   ├── quickstart-implement-run-kernel.rst
│   │   └── setup-env.rst
│   ├── guides/
│   │   ├── architecture/
│   │   │   ├── index.rst
│   │   │   ├── trainium2_arch.rst
│   │   │   ├── trainium3_arch.rst
│   │   │   └── trainium_inferentia2_arch.rst
│   │   ├── framework_custom_op.rst
│   │   ├── how-to-scheduling-apis.rst
│   │   ├── index.rst
│   │   ├── nki_simulator.rst
│   │   ├── tutorials/
│   │   │   ├── average_pool2d.rst
│   │   │   ├── fused_mamba.rst
│   │   │   ├── index.rst
│   │   │   ├── kernel-optimization.rst
│   │   │   ├── matrix_multiplication.rst
│   │   │   └── transpose2d.rst
│   │   └── use-neuron-profile.rst
│   ├── index.rst
│   ├── library/
│   │   ├── about/
│   │   │   └── index.rst
│   │   ├── api/
│   │   │   ├── attention-block-tkg.rst
│   │   │   ├── attention-cte.rst
│   │   │   ├── attention-tkg.rst
│   │   │   ├── blockwise-mm-backward.rst
│   │   │   ├── conv1d.rst
│   │   │   ├── cross-entropy.rst
│   │   │   ├── cumsum.rst
│   │   │   ├── depthwise-conv1d.rst
│   │   │   ├── dynamic-elementwise-add.rst
│   │   │   ├── fg-allgather.rst
│   │   │   ├── fgcc.rst
│   │   │   ├── find-nonzero-indices.rst
│   │   │   ├── index.rst
│   │   │   ├── mlp.rst
│   │   │   ├── moe-cte.rst
│   │   │   ├── moe-tkg.rst
│   │   │   ├── output-projection-cte.rst
│   │   │   ├── output-projection-tkg.rst
│   │   │   ├── qkv.rst
│   │   │   ├── rmsnorm-quant.rst
│   │   │   ├── rope.rst
│   │   │   ├── router-topk.rst
│   │   │   ├── sb2sb-allgather.rst
│   │   │   ├── topk-reduce.rst
│   │   │   └── transformer-tkg.rst
│   │   ├── index.rst
│   │   ├── kernel-utils/
│   │   │   ├── allocator.rst
│   │   │   ├── index.rst
│   │   │   └── tensor-view.rst
│   │   └── specs/
│   │       ├── design-rmsnorm-quant.rst
│   │       └── index.rst
│   ├── migration/
│   │   ├── index.rst
│   │   ├── nki-0-3-0-update-guide.rst
│   │   ├── nki-beta2-migration-guide.rst
│   │   └── nki_block_dimension_migration_guide.rst
│   ├── nki_faq.rst
│   ├── scripts/
│   │   ├── markdown2rst.py
│   │   └── requirements.txt
│   └── test/
│       ├── test_nki_isa_activation.py
│       ├── test_nki_isa_affine_select.py
│       ├── test_nki_isa_bn_stats.py
│       ├── test_nki_isa_copypredicated.py
│       ├── test_nki_isa_dma_copy.py
│       ├── test_nki_isa_dma_transpose.py
│       ├── test_nki_isa_dropout.py
│       ├── test_nki_isa_iota.py
│       ├── test_nki_isa_local_gather.py
│       ├── test_nki_isa_max8.py
│       ├── test_nki_isa_memset.py
│       ├── test_nki_isa_nc_find_index8.py
│       ├── test_nki_isa_nc_match_replace8.py
│       ├── test_nki_isa_nc_matmul.py
│       ├── test_nki_isa_nc_stream_shuffle.py
│       ├── test_nki_isa_nc_transpose.py
│       ├── test_nki_isa_partition_reduce.py
│       ├── test_nki_isa_range_select.py
│       ├── test_nki_isa_reciprocal.py
│       ├── test_nki_isa_reduce.py
│       ├── test_nki_isa_select_reduce.py
│       ├── test_nki_isa_sequence_bounds.py
│       ├── test_nki_isa_tensor_copy.py
│       ├── test_nki_isa_tensor_scalar.py
│       ├── test_nki_isa_tensor_scalar_cumulative.py
│       ├── test_nki_isa_tensor_tensor.py
│       ├── test_nki_isa_tensor_tensor_scan.py
│       ├── test_nki_mask.py
│       ├── test_nki_memory_semantics.py
│       ├── test_nki_nl_add.py
│       ├── test_nki_nl_atomic_rmw.py
│       ├── test_nki_nl_broadcast.py
│       ├── test_nki_nl_dslice.py
│       ├── test_nki_nl_gather_flattened.py
│       ├── test_nki_nl_load_store.py
│       ├── test_nki_nl_load_store_indirect.py
│       ├── test_nki_nl_load_transpose2d.py
│       ├── test_nki_nl_mgrid.py
│       ├── test_nki_simulate_kernel.py
│       ├── test_nki_spmd_grid.py
│       ├── test_psum_modulo_alloc.py
│       └── test_sbuf_modulo_alloc.py
├── release-notes/
│   ├── 2.29.0.rst
│   ├── archive/
│   │   ├── customcxxps/
│   │   │   ├── gpsimd-customop-lib.rst
│   │   │   └── gpsimd-tools.rst
│   │   ├── index.rst
│   │   ├── libneuronxla.rst
│   │   ├── mxnet-neuron.rst
│   │   ├── nemo/
│   │   │   ├── index.rst
│   │   │   └── neuronx-nemo.rst
│   │   ├── neuron-cc/
│   │   │   ├── neuron-cc-ops/
│   │   │   │   ├── index.rst
│   │   │   │   ├── neuron-cc-ops-mxnet.rst
│   │   │   │   ├── neuron-cc-ops-pytorch.rst
│   │   │   │   ├── neuron-cc-ops-tensorflow.rst
│   │   │   │   └── neuron-cc-ops-xla.rst
│   │   │   └── neuron-cc.rst
│   │   ├── neuron1/
│   │   │   ├── _legacy-labels.rst
│   │   │   ├── neuronrelease/
│   │   │   │   └── previous-content.rst
│   │   │   └── prev/
│   │   │       ├── content.rst
│   │   │       └── rn.rst
│   │   ├── tensorboard-neuron.rst
│   │   ├── tensorflow/
│   │   │   ├── tensorflow-modelserver-neuron/
│   │   │   │   ├── tensorflow-modelserver-neuron-v2.rst
│   │   │   │   ├── tensorflow-modelserver-neuron.rst
│   │   │   │   └── tensorflow-modelserver-neuronx.rst
│   │   │   ├── tensorflow-neuron/
│   │   │   │   ├── tensorflow-neuron-v2.rst
│   │   │   │   └── tensorflow-neuron.rst
│   │   │   └── tensorflow-neuronx/
│   │   │       └── tensorflow-neuronx.rst
│   │   └── torch-neuron.rst
│   ├── components/
│   │   ├── compiler.rst
│   │   ├── containers.rst
│   │   ├── dev-tools.rst
│   │   ├── dlamis.rst
│   │   ├── index.rst
│   │   ├── jax.rst
│   │   ├── nki-lib.rst
│   │   ├── nki.rst
│   │   ├── nxd-core.rst
│   │   ├── nxd-inference.rst
│   │   ├── nxd-training.rst
│   │   ├── pytorch.rst
│   │   └── runtime.rst
│   ├── documentation/
│   │   └── neuron-documentation.rst
│   ├── index.rst
│   ├── prev/
│   │   ├── 2.25.0/
│   │   │   ├── compiler.rst
│   │   │   ├── containers.rst
│   │   │   ├── dlami.rst
│   │   │   ├── docs-and-samples.rst
│   │   │   ├── index.rst
│   │   │   ├── nx-jax.rst
│   │   │   ├── nx-pytorch.rst
│   │   │   ├── nxd-core.rst
│   │   │   ├── nxd-inference.rst
│   │   │   ├── nxd-training.rst
│   │   │   ├── runtime.rst
│   │   │   └── tools.rst
│   │   ├── 2.26.0/
│   │   │   ├── containers.rst
│   │   │   ├── dlami.rst
│   │   │   ├── index.rst
│   │   │   ├── nki.rst
│   │   │   ├── nx-jax.rst
│   │   │   ├── nx-pytorch.rst
│   │   │   ├── nxd-core.rst
│   │   │   ├── nxd-inference.rst
│   │   │   ├── runtime.rst
│   │   │   └── tools.rst
│   │   ├── 2.26.1.rst
│   │   ├── 2.27.0/
│   │   │   ├── compiler.rst
│   │   │   ├── containers.rst
│   │   │   ├── dlami.rst
│   │   │   ├── index.rst
│   │   │   ├── nki-lib.rst
│   │   │   ├── nki.rst
│   │   │   ├── nx-pytorch.rst
│   │   │   ├── nxd-inference.rst
│   │   │   ├── runtime.rst
│   │   │   └── tools.rst
│   │   ├── 2.27.1.rst
│   │   ├── 2.28.0.rst
│   │   ├── 2.28.1.rst
│   │   ├── content.rst
│   │   └── rn.rst
│   └── releasecontent.rst
├── requirements-python310.txt
├── requirements-python38.txt
├── requirements.txt
├── setup/
│   ├── index.rst
│   ├── index.txt-back
│   ├── install-templates/
│   │   ├── al2-python.rst
│   │   ├── inf1/
│   │   │   ├── compile_mode.rst
│   │   │   ├── deploy_mode.rst
│   │   │   ├── develop_mode.rst
│   │   │   ├── dlami-enable-neuron-mxnet.rst
│   │   │   ├── dlami-enable-neuron-pytorch.rst
│   │   │   ├── launch-inf1-ami.rst
│   │   │   ├── launch-inf1-dlami-aws-cli.rst
│   │   │   ├── launch-inf1-dlami.rst
│   │   │   ├── neuron-pip-install.rst
│   │   │   ├── neuron-pip-setup.rst
│   │   │   ├── note-setup-cntr.rst
│   │   │   ├── note-setup-general.rst
│   │   │   ├── note-setup-libnrt-warning.rst
│   │   │   └── tensorboard-plugin-neuron-pip-install.rst
│   │   ├── inf2/
│   │   │   ├── dlami-enable-neuron-pytorch.rst
│   │   │   ├── launch-inf2-dlami.rst
│   │   │   └── note-setup-libnrt-warning.rst
│   │   ├── launch-instance.txt
│   │   ├── launch-trn1-dlami.rst
│   │   ├── trn1/
│   │   │   └── dlami-notes.rst
│   │   └── trn1-ga-warning.txt
│   ├── jax/
│   │   ├── dlami.rst
│   │   ├── dlc.rst
│   │   ├── index.rst
│   │   └── manual.rst
│   ├── jax-neuronx.rst
│   ├── legacy-inf1/
│   │   ├── index.rst
│   │   └── pytorch.rst
│   ├── multiframework-dlami.rst
│   ├── mxnet-neuron.rst
│   ├── notebook/
│   │   ├── running-jupyter-notebook-as-script.rst
│   │   └── setup-jupyter-notebook-steps-troubleshooting.rst
│   ├── pytorch/
│   │   ├── dlami.rst
│   │   ├── dlc.rst
│   │   ├── index.rst
│   │   ├── manual.rst
│   │   ├── update-dlami.rst
│   │   ├── update-dlc.rst
│   │   └── update-manual.rst
│   ├── setup-rocky-linux-9.rst
│   ├── setup-troubleshooting.rst
│   ├── torch-neuron-ubuntu20.rst
│   ├── torch-neuron.rst
│   ├── torch-neuronx.rst
│   └── troubleshooting.rst
├── src/
│   ├── benchmark/
│   │   ├── helper_scripts/
│   │   │   ├── llmperf_dp.patch
│   │   │   ├── llmperf_reasoning.patch
│   │   │   └── neuron_perf.patch
│   │   └── tensorflow/
│   │       ├── distilbert-base-uncased-finetuned-sst-2-english_benchmark.py
│   │       └── distilbert-base-uncased-finetuned-sst-2-english_compile.py
│   ├── examples/
│   │   ├── mxnet/
│   │   │   ├── README.md
│   │   │   ├── data_parallel/
│   │   │   │   ├── benchmark_utils.py
│   │   │   │   ├── data_parallel_tutorial.ipynb
│   │   │   │   └── parallel.py
│   │   │   ├── mxnet-gluon-tutorial.ipynb
│   │   │   ├── resnet50/
│   │   │   │   └── resnet50.ipynb
│   │   │   └── resnet50_neuroncore_groups.ipynb
│   │   ├── neuron-monitor/
│   │   │   └── neuron-monitor-grafana.json
│   │   ├── pytorch/
│   │   │   ├── bert_tutorial/
│   │   │   │   ├── README.md
│   │   │   │   ├── THIRD
│   │   │   │   ├── THIRD PARTY LICENSE.txt
│   │   │   │   ├── bert_benchmark_utils.py
│   │   │   │   ├── glue_mrpc_dev.tsv
│   │   │   │   ├── parallel.py
│   │   │   │   ├── tutorial_pretrained_bert.ipynb
│   │   │   │   └── tutorial_pretrained_bert_shared_weights.ipynb
│   │   │   ├── byoc_sm_bert_tutorial/
│   │   │   │   ├── code/
│   │   │   │   │   └── inference.py
│   │   │   │   ├── container/
│   │   │   │   │   └── Dockerfile
│   │   │   │   └── sagemaker_container_neuron.ipynb
│   │   │   ├── libtorch_demo/
│   │   │   │   ├── bert_neuronx/
│   │   │   │   │   ├── compile.py
│   │   │   │   │   └── detect_instance.py
│   │   │   │   ├── clean.sh
│   │   │   │   ├── example_app/
│   │   │   │   │   ├── README.txt
│   │   │   │   │   ├── build.sh
│   │   │   │   │   ├── core_count.hpp
│   │   │   │   │   ├── example_app.cpp
│   │   │   │   │   ├── utils.cpp
│   │   │   │   │   └── utils.hpp
│   │   │   │   ├── neuron.patch
│   │   │   │   ├── run_tests.sh
│   │   │   │   ├── setup.sh
│   │   │   │   ├── tokenizers_binding/
│   │   │   │   │   ├── build.sh
│   │   │   │   │   ├── remote_rust_tokenizer.h
│   │   │   │   │   ├── run.sh
│   │   │   │   │   ├── run_python.sh
│   │   │   │   │   ├── tokenizer_test
│   │   │   │   │   ├── tokenizer_test.cpp
│   │   │   │   │   └── tokenizer_test.py
│   │   │   │   └── trace_bert_neuron.py
│   │   │   ├── mnist_mlp/
│   │   │   │   ├── train_monitor.py
│   │   │   │   └── train_tb.py
│   │   │   ├── neuronx_distributed/
│   │   │   │   └── t5-inference/
│   │   │   │       ├── t5-inference-tutorial.ipynb
│   │   │   │       ├── t5_model_layers.py
│   │   │   │       ├── t5_models.py
│   │   │   │       └── wrapper.py
│   │   │   ├── pipeline_tutorial/
│   │   │   │   └── neuroncore_pipeline_pytorch.ipynb
│   │   │   ├── resnet50.ipynb
│   │   │   ├── resnet50_partition.ipynb
│   │   │   ├── torch-neuronx/
│   │   │   │   ├── bert-base-cased-finetuned-mrpc-inference-on-trn1-tutorial.ipynb
│   │   │   │   ├── resnet50-inference-on-trn1-tutorial.ipynb
│   │   │   │   └── t5-inference-tutorial.ipynb
│   │   │   ├── torchserve/
│   │   │   │   ├── benchmark_bert.py
│   │   │   │   ├── config.json
│   │   │   │   ├── handler_bert.py
│   │   │   │   ├── handler_bert_neuronx.py
│   │   │   │   ├── infer_bert.py
│   │   │   │   ├── torchserve.config
│   │   │   │   ├── trace_bert_neuron.py
│   │   │   │   └── trace_bert_neuronx.py
│   │   │   ├── transformers-marianmt.ipynb
│   │   │   └── yolo_v4.ipynb
│   │   └── tensorflow/
│   │       ├── bert_demo/
│   │       │   ├── LICENSE
│   │       │   ├── README.md
│   │       │   ├── bert_client.py
│   │       │   ├── bert_model.py
│   │       │   ├── bert_model_server.py
│   │       │   ├── bert_no_model.py
│   │       │   ├── bert_server.py
│   │       │   ├── download_mrpc_data.py
│   │       │   ├── glue_mrpc_dev.tsv
│   │       │   ├── latency_printer.py
│   │       │   ├── mrpc.proto
│   │       │   ├── mrpc_feature.py
│   │       │   ├── mrpc_pb2.py
│   │       │   ├── mrpc_pb2_grpc.py
│   │       │   ├── protoc.sh
│   │       │   ├── setup.py
│   │       │   ├── tokenization.py
│   │       │   ├── tune_save.sh
│   │       │   └── uncased_L-24_H-1024_A-16.vocab.txt
│   │       ├── huggingface_bert/
│   │       │   └── huggingface_bert.ipynb
│   │       ├── k8s_bert_demo/
│   │       │   ├── Dockerfile.tfserving_example
│   │       │   ├── README.md
│   │       │   ├── bert_client.py
│   │       │   └── bert_service.yml
│   │       ├── keras_resnet50/
│   │       │   ├── LICENSE
│   │       │   ├── README.md
│   │       │   ├── fp32tofp16.py
│   │       │   ├── full_sweep
│   │       │   ├── gen_resnet50_keras.py
│   │       │   ├── infer_resnet50_keras.py
│   │       │   ├── infer_resnet50_keras_loadtest.py
│   │       │   ├── keras_resnet50.ipynb
│   │       │   ├── optimize_for_inference.py
│   │       │   ├── pb2sm_compile.py
│   │       │   └── run_all
│   │       ├── openpose_demo/
│   │       │   └── openpose.ipynb
│   │       ├── ssd300_demo/
│   │       │   ├── README.md
│   │       │   ├── ssd300_detection.py
│   │       │   ├── ssd300_evaluation.py
│   │       │   ├── ssd300_evaluation_client.py
│   │       │   └── ssd300_model.py
│   │       ├── tensorflow-neuronx/
│   │       │   └── tfneuronx-roberta-base-tutorial.ipynb
│   │       ├── tensorflow_resnet50/
│   │       │   └── resnet50.ipynb
│   │       ├── tensorflow_serving_tutorial.rst
│   │       ├── yolo_v3_demo/
│   │       │   ├── yolo_v3.ipynb
│   │       │   └── yolo_v3_coco_saved_model.py
│   │       └── yolo_v4_demo/
│   │           ├── README.md
│   │           ├── evaluate.ipynb
│   │           └── yolo_v4_coco_saved_model.py
│   ├── helperscripts/
│   │   ├── installationScripts/
│   │   │   └── python_instructions.txt
│   │   ├── n2-helper.py
│   │   ├── n2-manifest.json
│   │   ├── neuron-releases-manifest.json
│   │   ├── neuron-setup-example.py
│   │   ├── neuronsetuphelper.py
│   │   └── release-manifest-def.py
│   ├── k8/
│   │   ├── bert_service.yml
│   │   ├── k8s-neuron-device-plugin-rbac.yml
│   │   ├── k8s-neuron-device-plugin.yml
│   │   ├── k8s-neuron-monitor-daemonset.yml
│   │   ├── k8s-neuron-scheduler-configmap.yml
│   │   ├── k8s-neuron-scheduler-eks.yml
│   │   ├── k8s-neuron-scheduler.yml
│   │   ├── k8s-ultraserver-init-script.sh
│   │   ├── my-scheduler.yml
│   │   └── neuron-problem-detector/
│   │       ├── k8s-neuron-problem-detector-and-recovery-config.yml
│   │       ├── k8s-neuron-problem-detector-and-recovery-rbac.yml
│   │       └── k8s-neuron-problem-detector-and-recovery.yml
│   ├── libnrt/
│   │   ├── README.md
│   │   └── include/
│   │       ├── ndl/
│   │       │   ├── ndl.h
│   │       │   ├── neuron_driver_shared.h
│   │       │   └── neuron_driver_shared_tensor_batch_op.h
│   │       └── nrt/
│   │           ├── ndebug_stream.h
│   │           ├── nds/
│   │           │   └── neuron_ds.h
│   │           ├── nec.h
│   │           ├── nrt.h
│   │           ├── nrt_async.h
│   │           ├── nrt_async_sendrecv.h
│   │           ├── nrt_experimental.h
│   │           ├── nrt_profile.h
│   │           ├── nrt_status.h
│   │           ├── nrt_sys_trace.h
│   │           └── nrt_version.h
│   ├── neuron-gatherinfo/
│   │   ├── LICENSE
│   │   ├── clear_params_tfpb.py
│   │   ├── mx_neuron_check_model.py
│   │   ├── neuron-gatherinfo.py
│   │   └── tf_neuron_check_model.py
│   └── neuronperf/
│       ├── LICENSE
│       ├── README.md
│       ├── build.sh
│       ├── conf.py
│       ├── model_neuron_b1.csv
│       ├── pyproject.toml
│       ├── src/
│       │   └── neuronperf/
│       │       ├── __init__.py
│       │       ├── __version__.py
│       │       ├── benchmarking.py
│       │       ├── compile_constants.py
│       │       ├── cpu/
│       │       │   ├── __init__.py
│       │       │   └── cpu.py
│       │       ├── logging.py
│       │       ├── model_index.py
│       │       ├── mxnet/
│       │       │   ├── __init__.py
│       │       │   └── mxnet.py
│       │       ├── py.typed
│       │       ├── reporting.py
│       │       ├── scripts/
│       │       │   ├── __init__.py
│       │       │   └── run_benchmark_file.py
│       │       ├── tensorflow/
│       │       │   ├── __init__.py
│       │       │   └── tensorflow.py
│       │       ├── timing.py
│       │       └── torch/
│       │           ├── __init__.py
│       │           └── torch.py
│       └── test/
│           └── test_neuronperf.py
├── static/
│   ├── google673a8c4fbaa024d8.html
│   ├── robots.txt
│   └── sitemap1.xml
└── tools/
    ├── index.rst
    ├── neuron-explorer/
    │   ├── get-started.rst
    │   ├── how-to-link-view-source-code.rst
    │   ├── how-to-profile-workload.rst
    │   ├── index.rst
    │   ├── migration-faq.rst
    │   ├── overview-ai-recommendations.rst
    │   ├── overview-database-viewer.rst
    │   ├── overview-device-profiles.rst
    │   ├── overview-hierarchy-view.rst
    │   ├── overview-memory-viewer.rst
    │   ├── overview-summary-page.rst
    │   ├── overview-system-profiles.rst
    │   ├── overview-tensor-viewer.rst
    │   └── view-perfetto.rst
    ├── neuron-sys-tools/
    │   ├── index.rst
    │   ├── nccom-test.rst
    │   ├── neuron-ls.rst
    │   ├── neuron-monitor-user-guide.rst
    │   ├── neuron-sysfs-user-guide.rst
    │   └── neuron-top-user-guide.rst
    ├── profiler/
    │   ├── neuron-profile-user-guide.rst
    │   └── neuron-profiler-2-0-beta-user-guide.rst
    ├── tensorboard/
    │   ├── getting-started-tensorboard-neuronx-plugin.rst
    │   └── index.rst
    ├── third-party-solutions.rst
    └── tutorials/
        ├── index.rst
        ├── performance-profiling-vllm.rst
        ├── torch-neuronx-profiling-with-tb.rst
        ├── tutorial-neuron-monitor-mnist.rst
        └── tutorial-tensorboard-scalars-mnist.rst

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/ISSUE_TEMPLATE/bug-report.yml
================================================
---
name: "🐛 Bug Report"
description: Report a bug
title: "(short issue description)"
labels: [bug, needs-triage]
assignees: []
body:
  - type: textarea
    id: description
    attributes:
      label: Describe the bug
      description: What is the problem? Provide a clear description of your issue and the steps you took that produced it.
    validations:
      required: true
  - type: textarea
    id: modelname
    attributes:
      label: Model Name
      description: Provide Model Name
    validations:
      required: true
  - type: textarea
    id: workloadtype
    attributes:
      label: Describe the workload type 
      description: Note the type of workload (such as Inference or Training) and any specific details about the workload configuration.
    validations:
      required: true
  - type: textarea
    id: instancetype
    attributes:
      label: Instance Type
      description: |
       Provide the AWS EC2 instance type you used to run the workload (such as `inf2.xlarge`, `trn1.32xlarge`, `trn2.48xlarge` etc.)
    validations:
      required: true
  - type: textarea
    id: release
    attributes:
      label: Release version 
      description: |
        Provide the Neuron SDK release version (such as `2.25.0`) you are using, and all relevant Neuron component versions. 
        ```
        apt list --installed | grep -i -e neuron
        pip list | grep -i -e neuron -e torch -e transformers -e jax
        ```
  - type: textarea
    id: reproduction
    attributes:
      label: Reproduction Steps
      description: |
        Provide the type of the model and links to any tutorials you may have used, as additional context.
        Provide a self-contained, concise snippet of code that can be used to reproduce the issue.
        For more complex issues provide a repo with the smallest sample that reproduces the bug.
        
        Avoid including business logic or unrelated code as it makes diagnosis more difficult.
        The code sample should be an SSCCE. See http://sscce.org/ for details. In short, please provide a code sample that we can copy/paste, run and reproduce.
    validations:
      required: true
  - type: checkboxes
    id: regression
    attributes:
      label: Regression Issue
      description: Is this as regression (did it work in a previous version and not now)?
        If this is a regression, provide the Neuron SDK release version where this configuration worked for you.
      options:
        - label: Select this option if this issue appears to be a regression.
          required: false
  - type: textarea
    id: solution
    attributes:
      label: Possible Solution
      description: |
        Suggest a fix or reason for the bug, if you know:
    validations:
      required: false
  - type: textarea
    id: context
    attributes:
      label: Logs/Context/Additional Information
      description: |
        Anything else that might be relevant for troubleshooting this bug. Providing context helps us come up with a solution that is most useful in the real world. When applicable, please provide HLOs and compiler commands.
    validations:
      required: false


================================================
FILE: .github/ISSUE_TEMPLATE/config.yml
================================================
blank_issues_enabled: false

================================================
FILE: .github/ISSUE_TEMPLATE/documentation.yml
================================================
---
name: "📕 Documentation Issue"
description: Report an issue in the documentation and Developer Guide
title: "(short issue description)"
labels: [documentation, needs-triage]
assignees: []
body:
  - type: textarea
    id: description
    attributes:
      label: Describe the issue
      description: A clear and concise description of the issue.
    validations:
      required: true

  - type: textarea
    id: links
    attributes:
      label: Links
      description: |
        Include links to affected documentation page(s).
    validations:
      required: true


================================================
FILE: .github/ISSUE_TEMPLATE/feature-request.yml
================================================
---
name: 🚀 Feature Request
description: Suggest an idea for this project
title: "(short issue description)"
labels: [feature-request, needs-triage]
assignees: []
body:
  - type: textarea
    id: description
    attributes:
      label: Describe the feature
      description: A clear and concise description of the feature you are proposing.
    validations:
      required: true
  - type: textarea
    id: use-case
    attributes:
      label: Use Case
      description: |
        Why do you need this feature?
    validations:
        required: true
  - type: textarea
    id: solution
    attributes:
      label: Proposed Solution
      description: |
        Provide detailed suggestions or requirements for this proposed feature. If you have them, include any reference implementation details (or even links to prototypes).
    validations:
      required: false
  - type: textarea
    id: other
    attributes:
      label: Other Information
      description: |
        Any additional details or information you can provide, including links to related content or similar issues.
    validations:
      required: false
  - type: checkboxes
    id: ack
    attributes:
      label: Acknowledgements
      options:
        - label: I may be able to implement this feature request
          required: false


================================================
FILE: .github/pull_request_template.md
================================================
**IMPORTANT!** _If this is a documentation PR for a specific release, this PR must go the corresponding release branch_ (`release-X.XX.X`). _If it is an "out-of-band" doc update, the PR must go to the_ `master` _branch_.


## Required PR information

To expedite approvals and merges for releases, provide the following information (select the `...` button to the right at the top of your PR message to edit it):

> **AWS email alias**: {_your-name_}@amazon.com

>**Description**: {_What this documentation change is and why you made it. If you have a corresponding Jira ticket or content plan, link it here. The more details you provide around any decisions you made when preparing the docs, the less annoying comments you'll get preparing to release it._}

> **Date this must be published by**: {_If empty, we will assume the release date for the branch you're merging into._}

> **Link to ReadTheDocs staging for this branch's doc changes**: https://awsdocs-neuron-staging.readthedocs-hosted.com/en/{YOUR_BRANCH_NAME_HERE}/

> **Set the `docs-review-needed` label on the PR for tracking.**

## Before you request approvals

> Run a spelling and grammar check over your prose and make the changes it suggests. VSCode has a number of extensions (cSpell, LTeX) that you can use. You can also provide the rendered HTML for (or a cut-and-paste of) your pages to an AI and have it correct your spelling, grammar, and formatting issues. If you need an advanced prompt, contact @erickson-doug.

## Approvers

We require 3-4 approvers to merge for non-trivial content changes (where a "trivial" change is a typo/grammar fix or a minor update to the format syntax):

1. A senior+ engineer who will review your documentation for technical accuracy and clarity in communicating the technical concepts in your work
2. A product manager for your Neuron component area who will review it for customer relevance and product/component/feature messaging
3. The lead tech writer (@erickson-doug) who will review your work for overall doc design and quality, and perform the merge when all approvals are met
4. (For PRs with code/notebook submissions) A QA/test engineer who can run your code and confirm the results.

Make sure you get a commitment from these reviewers in advance! It's hard to get good quality doc reviews in order in the 11th hour of a release.

**Note**: For trivial changes, you only need @erickson-doug's approval. He will merge your content once he's confirmed the fixes on staging.

## Doc review checklist

### Engineering reviewer checklist

- [ ] I've confirmed that the contributions in this PR meet the current  [AWS Neuron writing guidelines](https://quip-amazon.com/m97CAO0kQFEU/Writing-for-AWS-Neuron).
- [ ] I've confirmed that the documentation submitted is technically correct to the best of my knowledge.
- [ ] I've confirmed that the documentation submitted has no spelling or grammar errors or use of internal jargon/terminology.
- [ ] I've verified the changes render correctly on RTD (link above).
- [ ] (If code is included) I've run tests to verify the contents of the change.

---

## For PRs that include code or notebook examples

**MANDATORY: PR must include test run output**

Provide this information for the QA reviewer in order to expedite their review.

**Test run output:**
Specify the release version, instance size and type, OS type and test output.

**For Training tutorials:**

{Convergence graph for training tutorials}

{Performance metrics `average_throughput`, `latency_p50`, `latency_p99` and MFU% if available}

Make sure this PR contains correct classification terms (Alpha, Beta, and Stable).

If possible, provide your results or a link to them for the reviewer to check your work.

## Code example/notebook content PR checklist

- [ ] (If applicable) I've automated a test to safeguard my changes from regression.
- [ ] (If applicable) I've posted test collateral to prove my change was effective and not harmful.
- [ ] (If applicable) I've added someone from QA to the list of reviewers.  Do this if you didn't make an automated test or feel it's appropriate for another reason.
- [ ] (If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the pre-approved Amazon license list.  See https://inside.amazon.com/en/services/legal/us/OpenSource/Pages/BlessedOpenSourceLicenses.aspx.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

================================================
FILE: .github/stale_issue_mark_close_workflow.yml
================================================
name: Close inactive issues
on:
  schedule:
    - cron: "30 1 * * *"

jobs:
  close-issues:
    runs-on: ubuntu-latest
    permissions:
      issues: write
      pull-requests: write
    steps:
      - uses: actions/stale@v5
        with:
          days-before-issue-stale: 30
          days-before-issue-close: 14
          stale-issue-label: "stale"
          stale-issue-message: "This issue is stale because it has been open for 30 days with no activity."
          close-issue-message: "This issue was closed because it has been inactive for 14 days since being marked as stale."
          days-before-pr-stale: -1
          days-before-pr-close: -1
          repo-token: ${{ secrets.GITHUB_TOKEN }}


================================================
FILE: .github/workflows/acknowledge-new-issue.yml
================================================
name: Acknowledge New Issue

on:
  issues:
    types: [opened]

permissions:
  issues: write

jobs:
  acknowledge:
    runs-on: ubuntu-latest
    steps:
      - name: Comment on issue
        uses: actions/github-script@v7
        with:
          script: |
            const creator = context.payload.issue.user.login;
            await github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.payload.issue.number,
              body: `Hi @${creator}, Thank you for filing the issue! We will take a look and get back to you.`
            });


================================================
FILE: .github/workflows/auto-label-issues.yml
================================================
# Auto-label issues based on content keywords
name: auto-label-issues

on:
  issues:
    types: [opened]

jobs:
  auto-label-issues:
    runs-on: ubuntu-latest
    permissions:
      issues: write
    steps:
      - name: Analyze issue content
        id: analyze_content
        uses: actions/github-script@v7
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          ISSUE_TITLE: ${{ github.event.issue.title }}
          ISSUE_BODY: ${{ github.event.issue.body }}
        with:
          script: |
            const title = process.env.ISSUE_TITLE || '';
            const body = process.env.ISSUE_BODY || '';
            const content = `${title} ${body}`;
            
            const labels = [];
            
            // =============================================================================
            // LABEL CONFIGURATION - Easy to update dictionary
            // Add keywords, typos, or synonyms to the arrays below
            // =============================================================================
            
            const labelConfig = {
              // ----- Issue Type Labels (mutually exclusive) -----
              bug: {
                keywords: [
                  // Standard terms
                  'bug', 'error', 'crash', 'fail', 'failed', 'failure', 'failing',
                  'broken', 'exception', 'traceback', 'segfault', 'segmentation fault',
                  // Synonyms
                  'issue', 'problem', 'defect', 'fault', 'glitch', 'malfunction',
                  'wrong', 'incorrect', 'unexpected',
                  'hang', 'hanging', 'hung', 'freeze', 'frozen',
                  'timeout', 'timed out',
                  'oom', 'out of memory', 'memory error',
                  'nan', 'diverge', 'diverged',
                  // Common typos
                  'bugg', 'bgu', 'eror', 'errror', 'crahs', 'fial', 'brokn', 'broke'
                ],
                patterns: [/not\s*work/i, /doesn'?t\s*work/i, /won'?t\s*work/i, /can'?t\s*work/i]
              },
              
              documentation: {
                keywords: [
                  // Standard terms
                  'doc', 'docs', 'documentation', 'readme',
                  'guide', 'tutorial', 'howto', 'how-to', 'how to',
                  'typo', 'typos', 'spelling', 'grammar',
                  'example', 'examples', 'sample', 'samples',
                  'instruction', 'instructions',
                  'clarify', 'clarification', 'unclear', 'confusing',
                  'outdated', 'out of date', 'stale',
                  'missing documentation', 'missing docs',
                  'broken link', 'dead link', '404',
                  // Common typos
                  'documention', 'documenation', 'documentaion', 'tutoral', 'toturial'
                ],
                patterns: [/issue\s*on\s*page/i, /page\s*.*\.html/i]
              },
              
              'feature-request': {
                keywords: [
                  // Standard terms
                  'feature', 'feature request', 'feature-request',
                  'enhancement', 'improvement',
                  'implement', 'implementation',
                  'new feature', 'add feature',
                  'support for', 'add support',
                  'would be nice', 'would be great', 'would be helpful',
                  'suggestion', 'suggest', 'proposal', 'propose',
                  'wishlist', 'wish list',
                  // Common typos
                  'feture', 'featrue', 'enchancement', 'improvment'
                ],
                patterns: [/add\s+support\s+for/i, /please\s+add/i, /would\s+be\s+(nice|great|helpful)/i]
              },
              
              // ----- Hardware Labels (independent - multiple can be applied) -----
              Trn1: {
                keywords: [
                  'trn1', 'trn-1', 'trn 1', 'trn1n',
                  'trn1.2xlarge', 'trn1.32xlarge', 'trn1n.32xlarge',
                  'trainium', 'trainium1', 'trainium 1', 'trainium-1',
                  // Common typos
                  'tranium', 'trainuim', 'trn-1n'
                ],
                patterns: [/trn1n?(?:\.[0-9]*xlarge)?/i, /trainium\s*1?(?!\s*2)/i]
              },
              
              Trn2: {
                keywords: [
                  'trn2', 'trn-2', 'trn 2',
                  'trn2.48xlarge',
                  'trainium2', 'trainium 2', 'trainium-2',
                  // Common typos
                  'tranium2', 'trainuim2'
                ],
                patterns: [/trn2(?:\.[0-9]*xlarge)?/i, /trainium\s*2/i]
              },
              
              Inf1: {
                keywords: [
                  'inf1', 'inf-1', 'inf 1',
                  'inf1.xlarge', 'inf1.2xlarge', 'inf1.6xlarge', 'inf1.24xlarge',
                  'inferentia', 'inferentia1', 'inferentia 1', 'inferentia-1',
                  // Common typos
                  'infertia', 'inferntia', 'infernita'
                ],
                patterns: [/inf1(?:\.[0-9]*xlarge)?/i, /inferentia\s*1?(?!\s*2)/i]
              },
              
              Inf2: {
                keywords: [
                  'inf2', 'inf-2', 'inf 2',
                  'inf2.xlarge', 'inf2.8xlarge', 'inf2.24xlarge', 'inf2.48xlarge',
                  'inferentia2', 'inferentia 2', 'inferentia-2',
                  // Common typos
                  'infertia2', 'inferntia2', 'infernita2'
                ],
                patterns: [/inf2(?:\.[0-9]*xlarge)?/i, /inferentia\s*2/i]
              },
              
              // ----- Use Case Labels (independent - both can be applied) -----
              Inference: {
                keywords: [
                  // Standard terms
                  'inference', 'inferencing',
                  'predict', 'prediction', 'predictions', 'predicting',
                  'serving', 'serve', 'server',
                  'batch inference', 'real-time', 'realtime',
                  'endpoint', 'endpoints',
                  // Common typos
                  'infernce', 'inferance', 'prediciton', 'deploymnet'
                ],
                patterns: [/infer(?:ence|ring)?/i, /predict(?:ion|ing)?/i, /deploy(?:ment|ing)?/i]
              },
              
              Training: {
                keywords: [
                  // Standard terms
                  'training', 'train', 'trained',
                  'fine-tune', 'finetune', 'fine tune', 'finetuning', 'fine-tuning',
                  'pretrain', 'pre-train', 'pretraining', 'pre-training',
                  'learning', 'learn',
                  'gradient', 'gradients',
                  'backward', 'backprop', 'backpropagation',
                  'loss', 'convergence', 'converge',
                  'epoch', 'epochs',
                  'checkpoint', 'checkpointing',
                  // Common typos
                  'trainig', 'traning', 'trainin', 'fintune', 'finetunning'
                ],
                patterns: [/train(?:ing|ed)?/i, /fine[\s-]?tun(?:e|ing)/i, /pre[\s-]?train(?:ing)?/i]
              }
            };
            
            // =============================================================================
            // MATCHING LOGIC
            // =============================================================================
            
            function matchesLabel(config) {
              const contentLower = content.toLowerCase();
              // Check keywords (case-insensitive substring match)
              for (const keyword of config.keywords) {
                if (contentLower.includes(keyword.toLowerCase())) {
                  return true;
                }
              }
              // Check regex patterns
              for (const pattern of config.patterns) {
                if (pattern.test(content)) {
                  return true;
                }
              }
              return false;
            }
            
            // Issue Type Labels - MUTUALLY EXCLUSIVE (priority: bug > documentation > feature-request)
            if (matchesLabel(labelConfig.bug)) {
              labels.push('bug');
            } else if (matchesLabel(labelConfig.documentation)) {
              labels.push('documentation');
            } else if (matchesLabel(labelConfig['feature-request'])) {
              labels.push('feature-request');
            }
            
            // Hardware/Instance Type Labels - INDEPENDENT (multiple can be applied)
            if (matchesLabel(labelConfig.Trn1)) {
              labels.push('Trn1');
            }
            if (matchesLabel(labelConfig.Trn2)) {
              labels.push('Trn2');
            }
            if (matchesLabel(labelConfig.Inf1)) {
              labels.push('Inf1');
            }
            if (matchesLabel(labelConfig.Inf2)) {
              labels.push('Inf2');
            }
            
            // Use Case Labels - INDEPENDENT (both can be applied)
            if (matchesLabel(labelConfig.Inference)) {
              labels.push('Inference');
            }
            if (matchesLabel(labelConfig.Training)) {
              labels.push('Training');
            }
            
            core.setOutput('labels', labels.join(','));
            core.setOutput('has_labels', labels.length > 0);

      - name: Apply labels to issue
        if: steps.analyze_content.outputs.has_labels == 'true'
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          IFS=',' read -ra LABELS <<< "${{ steps.analyze_content.outputs.labels }}"
          for label in "${LABELS[@]}"; do
            gh issue edit ${{ github.event.issue.number }} --add-label "$label" -R ${{ github.repository }}
          done


================================================
FILE: .gitignore
================================================
_build/
__pycache__/
.venv/
.DS_Store
src/examples/pytorch/libtorch_demo.tar.gz
src/neuronperf.tar.gz
*-checkpoint.ipynb
.idea/
.vscode/
nki/*/generated/
uncommitted/

================================================
FILE: .readthedocs.yml
================================================
# .readthedocs.yml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Set the version of Python and other tools you might need
build:
  os: "ubuntu-22.04"
  tools:
    python: "3.10"
#  jobs:
#    pre_build:
#      - python -m sphinx -b linkcheck . _build/linkcheck

# Build documentation in the docs/ directory with Sphinx
sphinx:
  configuration: conf.py

#conda
#conda:
#  file: readthedocs-environment.yml

# Build documentation with MkDocs
#mkdocs:
#  configuration: mkdocs.yml

# Optionally build your docs in additional formats such as PDF
#formats:
#  - pdf

# Optionally set the version of Python and requirements required to build your docs
python:
  install:
    - requirements: requirements.txt

================================================
FILE: CODEOWNERS
================================================
# This file creates codeowners for the documentation. It will allow setting code reviewers for all Pull requests to merge to the master branch 
# Each line is a file pattern followed by one or more owners.

# Reference guide - https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/creating-a-repository-on-github/about-code-owners#example-[…]ners-file
# Example - These owners will be the default owners for everything in
# the repo. Unless a later match takes precedence,
# @global-owner1 and @global-owner2 will be requested for
# review when someone opens a pull request.
# *       @global-owner1 @global-owner2

*       @aws-maens @micwade-aws @musunita @aws-sadaf @rgrandhiamzn @eshalakhotia @jluntamazon @jeffhataws @aws-rhsoln @hannanjgaws @PrashantSaraf @aws-donkrets @aws-singhada @gsnaws @awsjoshir @sidjoshiaws @pinak-p @vikas-paliwal-aws @aarondou @mrinalks @erickson-doug @lnixaws @micwade-aws 

src/examples/mxnet/ @aws-rhsoln  @aws-sadaf @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia
neuron-guide/neuron-frameworks/mxnet-neuron/  @aws-rhsoln @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia
neuron-guide/neuron-frameworks/mxnet-neuron/tutorials/ @musunita @aws-rhsoln @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia

src/examples/tensorflow/  @awshaichen  @aws-sadaf @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia
neuron-guide/neuron-frameworks/tensorflow-neuron/ @awshaichen @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia
neuron-guide/neuron-frameworks/tensorflow-neuron/tutorials/ @musunita @awshaichen @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia


src/examples/pytorch/ @jluntamazon @aws-sadaf @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia
neuron-guide/neuron-frameworks/pytorch-neuron/  @jluntamazon @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia
neuron-guide/neuron-frameworks/pytorch-neuron/tutorials/ @musunita @jluntamazon @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia

libraries/nxd-inference/ @huntingcarlisle @lccasagrande @lipovsek-aws @erickson-doug @eshalakhotia @pinak-p @hannanjgaws @akhil-aws @ahimsh-aws @rgrandhiamzn @yahavb @FThompsonAWS @gsnaws @sidjoshiaws @jluntamazon @musunita

================================================
FILE: CONTRIBUTING.md
================================================
# Contributing Guidelines

Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
documentation, we greatly value feedback and contributions from our community.

Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
information to effectively respond to your bug report or contribution.

## Reporting Bugs/Feature Requests

We welcome you to use the GitHub issue tracker to report bugs or suggest features.

When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:

* A reproducible test case or series of steps
* The version of our code being used
* Any modifications you've made relevant to the bug
* Anything unusual about your environment or deployment

## Contributing Workflow (via Pull Requests)

Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:

1. You are working against the latest source on the *master* branch.
2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
3. You open an issue to discuss any significant work - we would hate for your time to be wasted.

**Important**: Currently, local doc builds require a Python 3.9 environment. If you are on MacOS, you can install it from the terminal with `brew install python@3.9`. Add it to your working path with `brew link python@3.9` and confirm it works by running `python3.9 --version`.

### Docker Build

If you don't have Python 3.9/3.10 or a compatible gcc toolchain, use the Docker workflow:

```bash
./build.sh build   # Build Docker image (first time only)
./build.sh html    # Build HTML docs to _build/html/
./build.sh shell   # Interactive shell for debugging
./build.sh clean   # Remove _build/ directory
```

### Manual Build

To send us a pull request, please:

1. Clone the repository locally:

    ```bash
    git clone git@github.com:YOUR-USERNAME/private-aws-neuron-sdk-staging.git
    ```

2. Install the build dependencies. This requires a Python 3.9 installation and venv:

    ```bash
    cd .. # The root folder where you have your cloned Git repos; don't run this in the repo folder but one level up or you'll have venv files in your repo folder
    python3.9 -m venv venv && . venv/bin/activate
    pip install -U pip
    cd private-aws-neuron-sdk-staging
    pip install -r requirements.txt
    ```

3. Build the documentation into HTML. This command will allow you to view the
   rendered documentation by opening the generated `_build/html/index.html`. On first run, this will take about 15 mins. Subsequent html generations are incremental and will take less time.

   Run:

   ```bash
   sphinx-build -b html . _build/html
   ```

   Or leverage the make file and run:

   ```bash
   make html
   ```

   If this doesn't work, try this command:

   ```bash
   sphinx-build -C -b html . _build/html
   ```

   For speedier builds in multiprocessor environments, run:

     ```bash
   sphinx-build -b html . _build/html -j auto
   ```

   **NOTE**: If you get an error for the spelling extension, like `Extension error: Could not import extension sphinxcontrib.spelling (exception: The 'enchant' C library was not found and maybe needs to be installed. See  https://pyenchant.github.io/pyenchant/install.html`, run `brew install enchant`.

4. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
5. Rebuild the documentation with `sphinx-build -b html . _build/html`. Always ensure that the docs build without errors and that your changes look correct before pushing your changes to remote.
    * If you encounter errors that are unclear, run the build in verbose mode with `sphinx-build -vv -b html . _build/html`.
6. Commit your changes to your branch with a clear, scoped commit messages. Bad: "fixed stuff". Good: "Updated ref IDs in all containers topics".
7. Push your changes to remote (`git push origin`) and create a PR from your branch into `master` or the standing release branch (example: `release-2.27.0`). Answer any default questions in the pull request interface.
    * See: [pull request guide](https://help.github.com/articles/creating-a-pull-request/)).
8. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.

Updated process documentation can be found here: [Runbook: Authoring a topic for the Neuron documentation](https://quip-amazon.com/e9B9AM7Npb17/Runbook-Authoring-a-topic-for-the-Neuron-documentation).

## Updating the sitemap

If you add or remove a topic, you must recreate the sitemap. To do so:

1. From a shell, `cd` to the root of this repo (`private-aws-neuron-sdk-staging`) on your local machine.
2. Run the following command: `python3 ./_utilities/create_sitemap.py`. This will generate the sitemap as `sitemap.xml` in the root folder of the repo.
3. Rename the `sitemap.xml` file to `sitemap1.xml`.
4. Move the `sitemap1.xml` file to the `/static` folder, copying over the previous version.
5. Delete the generated `sitemap.xml` file from the root (**not** from `/static`) if you did a copy instead of a move.
6. Push a PR with the updated sitemap to remote and request DougEric review/approve it.

## Finding contributions to work on

Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
    * Or, if you're so inclined, get on DougEric's Christmas card list by fixing broken links, formatting errors, removing stale topics, and fixing spelling/grammar errors.

## Code of Conduct

This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
opensource-codeofconduct@amazon.com with any additional questions or comments.

## Security issue notifications

If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.

## Licensing

See the [LICENSE-DOCUMENTATION](./LICENSE-DOCUMENTATION), [LICENSE-SAMPLECODE](./LICENSE-SAMPLECODE) and [LICENSE-SUMMARY-DOCS-SAMPLES](./LICENSE-SUMMARY-DOCS-SAMPLES) files for our project's licensing. We will ask you to confirm the licensing of your contribution.

We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger chan


================================================
FILE: Dockerfile
================================================
FROM python:3.10-slim

RUN apt-get update && apt-get install -y --no-install-recommends \
    make enchant-2 git pandoc \
    && rm -rf /var/lib/apt/lists/* \
    && pandoc --version

COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv

WORKDIR /docs

COPY requirements.txt .
RUN uv pip install --system -r requirements.txt --extra-index-url=https://pypi.org/simple

ENTRYPOINT ["/bin/bash"]


================================================
FILE: LICENSE-DOCUMENTATION
================================================
*** Documentation:

Creative Commons Attribution-ShareAlike 4.0 International Public License

By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution-ShareAlike 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.

Section 1 – Definitions.
	
     a.	Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image.
	
     b.	Adapter's License means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License.
	
     c.	BY-SA Compatible License means a license listed at creativecommons.org/compatiblelicenses, approved by Creative Commons as essentially the equivalent of this Public License.
	
     d.	Copyright and Similar Rights means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights.
	
     e.	Effective Technological Measures means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements.
	
     f.	Exceptions and Limitations means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material.
	
     g.	License Elements means the license attributes listed in the name of a Creative Commons Public License. The License Elements of this Public License are Attribution and ShareAlike.
	
     h.	Licensed Material means the artistic or literary work, database, or other material to which the Licensor applied this Public License.
	
     i.	Licensed Rights means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license.
	
     j.	Licensor means the individual(s) or entity(ies) granting rights under this Public License.
	
     k.	Share means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them.
	
     l.	Sui Generis Database Rights means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world.
	
     m.	You means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning.

Section 2 – Scope.
	
     a.	License grant.
	
          1. Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to:

               A. reproduce and Share the Licensed Material, in whole or in part; and	

               B. produce, reproduce, and Share Adapted Material.
	
          2. Exceptions and Limitations. For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions.
	
          3. Term. The term of this Public License is specified in Section 6(a).
	
          4. Media and formats; technical modifications allowed. The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a)(4) never produces Adapted Material.
	
          5. Downstream recipients.

               A. Offer from the Licensor – Licensed Material. Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License.
	
               B. Additional offer from the Licensor – Adapted Material. Every recipient of Adapted Material from You automatically receives an offer from the Licensor to exercise the Licensed Rights in the Adapted Material under the conditions of the Adapter’s License You apply.
	
               C. No downstream restrictions. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material.
	
          6. No endorsement. Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i).
	
     b.	Other rights.
	
          1. Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise.
	
          2. Patent and trademark rights are not licensed under this Public License.
	
          3. To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties.

Section 3 – License Conditions.

Your exercise of the Licensed Rights is expressly made subject to the following conditions.
	
     a.	Attribution.
	
          1. If You Share the Licensed Material (including in modified form), You must:

               A. retain the following if it is supplied by the Licensor with the Licensed Material:

                    i.	identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated);

                    ii.	a copyright notice;

                    iii. a notice that refers to this Public License;

                    iv.	a notice that refers to the disclaimer of warranties;

                    v.	a URI or hyperlink to the Licensed Material to the extent reasonably practicable;

               B. indicate if You modified the Licensed Material and retain an indication of any previous modifications; and

               C. indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License.
	
          2. You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information.
	
          3. If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable.
	
     b.	ShareAlike.In addition to the conditions in Section 3(a), if You Share Adapted Material You produce, the following conditions also apply.
	
          1. The Adapter’s License You apply must be a Creative Commons license with the same License Elements, this version or later, or a BY-SA Compatible License.
	
          2. You must include the text of, or the URI or hyperlink to, the Adapter's License You apply. You may satisfy this condition in any reasonable manner based on the medium, means, and context in which You Share Adapted Material.
	
          3. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, Adapted Material that restrict exercise of the rights granted under the Adapter's License You apply.

Section 4 – Sui Generis Database Rights.

Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material:
	
     a.	for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database;
	
     b.	if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material, including for purposes of Section 3(b); and
	
     c.	You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database.
For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights.

Section 5 – Disclaimer of Warranties and Limitation of Liability.
	
     a.	Unless otherwise separately undertaken by the Licensor, to the extent possible, the Licensor offers the Licensed Material as-is and as-available, and makes no representations or warranties of any kind concerning the Licensed Material, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. Where disclaimers of warranties are not allowed in full or in part, this disclaimer may not apply to You.
	
     b.	To the extent possible, in no event will the Licensor be liable to You on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this Public License or use of the Licensed Material, even if the Licensor has been advised of the possibility of such losses, costs, expenses, or damages. Where a limitation of liability is not allowed in full or in part, this limitation may not apply to You.
	
     c.	The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.

Section 6 – Term and Termination.
	
     a.	This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically.
	
     b.	Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates:
	
          1. automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or
	
          2. upon express reinstatement by the Licensor.
	
     c.	For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License.
	
     d.	For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License.
	
     e.	Sections 1, 5, 6, 7, and 8 survive termination of this Public License.

Section 7 – Other Terms and Conditions.
	
     a.	The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed.
	
     b.	Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License.

Section 8 – Interpretation.
	
     a.	For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License.
	
     b.	To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions.
	
     c.	No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor.
	
     d.	Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority.


================================================
FILE: LICENSE-SAMPLECODE
================================================
Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of this
software and associated documentation files (the "Software"), to deal in the Software
without restriction, including without limitation the rights to use, copy, modify,
merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


================================================
FILE: LICENSE-SUMMARY-DOCS-SAMPLES
================================================
*** Documentation and Sample Code:

Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.

The documentation is made available under the Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file.

The sample code within this documentation is made available under the MIT-0 license. See the LICENSE-SAMPLECODE file.


================================================
FILE: Makefile
================================================
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS    ?=
SPHINXBUILD   ?= sphinx-build
SOURCEDIR     = $(CURDIR)
BUILDDIR      = _build

# Put it first so that "make" without argument is like "make help".
help:
	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile clean

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

clean:
	-rm -rf $(BUILDDIR)/*


================================================
FILE: README.md
================================================
![neuron](./images/Site-Merch_Neuron-ML-SDK_Editorial.png)

# AWS Neuron

## Neuron SDK Overview

AWS Neuron is a software development kit (SDK) enabling high-performance deep learning acceleration using AWS Inferentia and Trainium, AWS's custom designed machine learning accelerators. With Neuron, you can develop, profile, and deploy high-performance machine learning workloads on top of accelerated EC2 instances, e.g. Inf1 and Trn1.

Neuron includes a compiler, runtime driver, as well as debug and profiling utilities with a TensorBoard plugin for visualization, and is pre-integrated into popular machine learning frameworks like Pytorch, TensorFlow and MXNet, to provide a seamless machine learning acceleration workflow.

## Neuron SDK’s documentation

For full documentations including user guide, Howtos and Tutorials see [Neuron SDK’s documentation](https://awsdocs-neuron.readthedocs-hosted.com/)

## Support
If none of the github and online resources have an answer to your question, checkout the AWS Neuron [support forum](https://forums.aws.amazon.com/forum.jspa?forumID=355).


================================================
FILE: _backup-setup/neuron-setup/multiframework/multi-framework-ubuntu22-neuron-dlami.rst
================================================
.. _setup-ubuntu22-multi-framework-dlami:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


Get Started with Neuron on Ubuntu 22 with Neuron Multi-Framework DLAMI
======================================================================

You can quickly get started on Ubuntu 22 using the Neuron Deep Learning AMI (DLAMI). Then, start using one of the multiple frameworks or libraries that Neuron SDK supports by
activating the corresponding virtual environment. Each virtual environment comes pre-installed with Neuron libraries needed for you to get started. The Neuron DLAMI supports all Neuron instances (Inf1/Inf2/Trn1/Trn1n/Trn2/Trn3)
and is updated with each Neuron SDK release. To start using the latest version of the Neuron DLAMI, use the following steps:

Step 1:  Launch the instance using Neuron DLAMI
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Once you open the `EC2 Console <https://console.aws.amazon.com/ec2>`_, select your desired AWS region and choose "Launch Instance". Under AMI selection select the "Quick Start"
and "Ubuntu", choose the "Deep Learning AMI Neuron (Ubuntu 22.04)"(see screenshot below). Once you have selected the AMI, select the desired Neuron Instance(Inf1/Inf2/Trn1/Trn1n/Trn2/Trn3) , 
configure disk size and other criteria, launch the instance

.. image:: /images/neuron-multi-framework-dlami-quick-start.png
    :scale: 20%
    :align: center


.. note::
  If you are looking to use the Neuron DLAMI in your cloud automation flows , Neuron also supports :ref:`SSM parameters <ssm-parameter-neuron-dlami>` to easily retrieve the latest DLAMI id.


Step 2: Activate the desired virtual environment 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  

You can activate one of the virtual environments depending on the library or framework you are interested in:

1. Get the desired virtual environment name for the framework/library by referring to :ref:`the Neuron DLAMI overview <neuron-dlami-multifw-venvs>`.
2. Activate the virtual environment by using:

  ::

    source /opt/<name_of_virtual_environment>/bin/activate


After you have activated the desired virtual environment , you can try out one of the tutorials listed in the corresponding framework or library training and inference section.


================================================
FILE: _backup-setup/neuron-setup/multiframework/multi-framework-ubuntu24-neuron-dlami.rst
================================================
.. _setup-ubuntu24-multi-framework-dlami:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


Get Started with Neuron on Ubuntu 24 with Neuron Multi-Framework DLAMI
======================================================================

You can quickly get started on Ubuntu 24 using the Neuron Deep Learning AMI (DLAMI). Then, start using one of the multiple frameworks or libraries that Neuron SDK supports by
activating the corresponding virtual environment. Each virtual environment comes pre-installed with Neuron libraries needed for you to get started. The Neuron DLAMI supports all Neuron instances (Inf2/Trn1/Trn1n/Trn2/Trn3)
and is updated with each Neuron SDK release. To start using the latest version of the Neuron DLAMI, use the following steps:

Step 1:  Launch the instance using Neuron DLAMI
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Once you open the `EC2 Console <https://console.aws.amazon.com/ec2>`_, select your desired AWS region and choose "Launch Instance". Under AMI selection select the "Quick Start"
and "Ubuntu", choose the "Deep Learning AMI Neuron (Ubuntu 24.04)"(see screenshot below). Once you have selected the AMI, select the desired Neuron Instance(Inf2/Trn1/Trn1n/Trn2/Trn3), 
configure disk size and other criteria, launch the instance

.. image:: /images/neuron-multi-framework-dlami-U24-quick-start.png
    :scale: 20%
    :align: center


.. note::
  If you are looking to use the Neuron DLAMI in your cloud automation flows , Neuron also supports :ref:`SSM parameters <ssm-parameter-neuron-dlami>` to easily retrieve the latest DLAMI id.


Step 2: Activate the desired virtual environment 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  

You can activate one of the virtual environments depending on the library or framework you are interested in:

1. Get the desired virtual environment name for the framework/library by referring to :ref:`the Neuron DLAMI overview <neuron-dlami-multifw-venvs>`.
2. Activate the virtual environment by using:

  ::

    source /opt/<name_of_virtual_environment>/bin/activate


After you have activated the desired virtual environment , you can try out one of the tutorials listed in the corresponding framework or library training and inference section.


================================================
FILE: _backup-setup/neuron-setup/pytorch/neuron/amazon-linux/torch-neuron-al2-base-dlami.rst
================================================
.. _setup-torch-neuron-al2-base-dlami:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuron") Setup on Amazon Linux 2 with DLAMI Base
=======================================================================

.. note::
   As of 2.20.0, Neuron Runtime no longer supports AL2. Upgrade to AL2023 following the :ref:`AL2 Migration guide <eos-al2>`

.. contents:: Table of contents
	:local:
	:depth: 2

.. include:: /setup/install-templates/al2-python.rst

Get Started with Latest Release of PyTorch Neuron (``torch-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`setup-torch-neuron` for Inference.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console. please make sure to select the correct instance type.
    * To get more information about instance sizes and pricing see: `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_
    * Check for the latest version of the `DLAMI Base AMI <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-base-neuron-amazon-linux-2/>`_ and copy the AMI name that starts with "Deep Learning Base Neuron AMI (Amazon Linux 2) <latest_date>" from "AMI Name:" section
    * Search for the copied AMI name in the AMI Search , you should see a matching AMI with the AMI name in Community AMIs. Select the AMI and use it to launch the instance.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami --category=driver_runtime_tools


.. include:: /includes/setup/tab-inference-torch-neuron-al2.txt

.. include :: /archive/torch-neuron/setup/pytorch-update-al2.rst

.. include :: /archive/torch-neuron/setup/pytorch-install-prev-al2.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuron/amazon-linux/torch-neuron-al2-pytorch-dlami.rst
================================================
.. _setup-torch-neuron-al2-pytorch-dlami:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuron") Setup on Amazon Linux 2 with Pytorch DLAMI
=========================================================================

.. note::
   As of 2.20.0, Neuron Runtime no longer supports AL2. Upgrade to AL2023 following the :ref:`AL2 Migration guide <eos-al2>`

.. contents:: Table of contents
	:local:
	:depth: 2

.. include:: /setup/install-templates/al2-python.rst

Get Started with Latest Release of PyTorch Neuron (``torch-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`setup-torch-neuron`.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console. please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_
    * Check for the latest version of the `DLAMI Neuron Pytorch 1.13 AMI <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-neuron-pytorch-1-13-amazon-linux-2/>`_ and copy the AMI name that starts with "Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2) <latest_date>" from "AMI Name:" section
    * Search for the copied AMI name in the AMI Search , you should see an exact matching AMI with the AMI name in Community AMIs. Select the AMI and use it to launch the instance.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Update Neuron Drivers
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=driver_runtime_tools --framework=pytorch --framework-version=1.13.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1

.. dropdown::  Get Started With Pytorch DLAMI
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small

    .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 98
            :end-line: 99

.. card:: Visit PyTorch Neuron(``torch-neuron``) for Inference section
    :link: inference-torch-neuron
    :link-type: ref
    :class-body: sphinx-design-class-title-small

.. card:: Visit PyTorch Neuron section for more
    :class-body: sphinx-design-class-body-small
    :link: neuron-pytorch
    :link-type: ref

.. include:: /archive/torch-neuron/setup/pytorch-update-al2-dlami.rst

.. include:: /archive/torch-neuron/setup/pytorch-install-prev-al2.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuron/amazon-linux/torch-neuron-al2.rst
================================================
.. _setup-torch-neuron-al2:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuron") Setup on Amazon Linux 2
=========================================================

.. note::
   As of 2.20.0, Neuron Runtime no longer supports AL2. Upgrade to AL2023 following the :ref:`AL2 Migration guide <eos-al2>`

.. contents:: Table of contents
	:local:
	:depth: 2

.. include:: /setup/install-templates/al2-python.rst

Get Started with Latest Release of PyTorch Neuron (``torch-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`setup-torch-neuron` for Inference.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_
    * Select Amazon Linux 2 AMI(HVM) - Kernel 5.10
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance  

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami --category=driver_runtime_tools


.. include:: /includes/setup/tab-inference-torch-neuron-al2.txt

.. include :: /archive/torch-neuron/setup/pytorch-update-al2.rst

.. include :: /archive/torch-neuron/setup/pytorch-install-prev-al2.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuron/amazon-linux/torch-neuron-al2023.rst
================================================
.. _setup-torch-neuron-al2023:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuron") Setup on Amazon Linux 2023
===========================================================


.. contents:: Table of contents
	:local:
	:depth: 2

Get Started with Latest Release of PyTorch Neuron (``torch-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`setup-torch-neuron` for Inference.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_
    * Select Amazon Linux 2023 AMI
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance  

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami --category=driver_runtime_tools


.. include:: /includes/setup/tab-inference-torch-neuron-al2023.txt

.. include:: /archive/torch-neuron/setup/pytorch-update-al2023.rst

.. include:: /archive/torch-neuron/setup/pytorch-install-prev-al2023.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuron/ubuntu/torch-neuron-ubuntu20-base-dlami.rst
================================================
.. _setup-torch-neuron-u20-base-dlami:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuron") Setup on Ubuntu 20 with DLAMI Base
==================================================================


.. contents:: Table of contents
	:local:
	:depth: 2


Get Started with Latest Release of PyTorch Neuron (``torch-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`setup-torch-neuron` for Inference.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console. please make sure to select the correct instance type.
    * To get more information about instance sizes and pricing see: `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_
    * Check for the latest version of the `DLAMI Base AMI <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-base-neuron-ubuntu-20-04/>`_ and copy the AMI name that starts with "Deep Learning Base Neuron AMI (Ubuntu 20.04) <latest_date>" from "AMI Name:" section
    * Search for the copied AMI name in the AMI Search , you should see a matching AMI with the AMI name in Community AMIs. Select the AMI and use it to launch the instance.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami --category=driver_runtime_tools


.. include:: /includes/setup/tab-inference-torch-neuron-u20.txt

.. include:: /archive/torch-neuron/setup/pytorch-update-u20.rst

.. include:: /archive/torch-neuron/setup/pytorch-install-prev-u20.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuron/ubuntu/torch-neuron-ubuntu20-pytorch-dlami.rst
================================================
.. _setup-torch-neuron-u20-pytorch-dlami:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuron") Setup on Ubuntu 20 with Pytorch DLAMI
=====================================================================


.. contents:: Table of contents
	:local:
	:depth: 2


Get Started with Latest Release of PyTorch Neuron (``torch-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`setup-torch-neuron`.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console. please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_
    * Check for the latest version of the `DLAMI Neuron Pytorch 1.13 AMI <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-neuron-pytorch-1-13-ubuntu-20-04/>`_ and copy the AMI name that starts with "Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) <latest_date>" from "AMI Name:" section
    * Search for the copied AMI name in the AMI Search , you should see an exact matching AMI with the AMI name in Community AMIs. Select the AMI and use it to launch the instance.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 


.. dropdown::  Update Neuron Drivers
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=driver_runtime_tools --framework=pytorch --framework-version=1.13.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1


.. dropdown::  Get Started With Pytorch DLAMI
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small

    .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 101
            :end-line: 102

.. card:: PyTorch Neuron(``torch-neuron``) for Inference
    :link: inference-torch-neuron
    :link-type: ref
    :class-body: sphinx-design-class-title-small

.. card:: Visit PyTorch Neuron section for more
    :class-body: sphinx-design-class-body-small
    :link: neuron-pytorch
    :link-type: ref

.. include:: /archive/torch-neuron/setup/pytorch-update-u20-dlami.rst

.. include:: /archive/torch-neuron/setup/pytorch-install-prev-u20.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuron/ubuntu/torch-neuron-ubuntu20.rst
================================================
.. _setup-torch-neuron-u20:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuron") Setup on Ubuntu 20
====================================================


.. contents:: Table of contents
	:local:
	:depth: 2


Get Started with Latest Release of PyTorch Neuron (``torch-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`setup-torch-neuron` for Inference.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console. please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_
    * Select Ubuntu Server 20 AMI
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami --category=driver_runtime_tools


.. include:: /includes/setup/tab-inference-torch-neuron-u20.txt

.. include:: /archive/torch-neuron/setup/pytorch-update-u20.rst

.. include:: /archive/torch-neuron/setup/pytorch-install-prev-u20.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuron/ubuntu/torch-neuron-ubuntu22.rst
================================================
.. _setup-torch-neuron-u22:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuron") Setup on Ubuntu 22
=====================================================


.. contents:: Table of contents
	:local:
	:depth: 2


Get Started with Latest Release of PyTorch Neuron (``torch-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`setup-torch-neuron` for Inference.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console. please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_
    * Select Ubuntu Server 22 AMI
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami --category=driver_runtime_tools

.. include:: /includes/setup/tab-inference-torch-neuron-u22.txt

.. include:: /archive/torch-neuron/setup/pytorch-update-u22.rst

.. include:: /archive/torch-neuron/setup/pytorch-install-prev-u22.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuronx/amazon-linux/torch-neuronx-al2-base-dlami.rst
================================================
.. _setup-torch-neuronx-al2-base-dlami:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuronx") Setup on Amazon Linux 2 with DLAMI Base
=========================================================================

.. note::
    As of 2.20.0, Neuron Runtime no longer supports AL2. Upgrade to AL2023 following the :ref:`AL2 Migration guide <eos-al2>`

.. contents:: Table of contents
	:local:
	:depth: 2

.. include:: /setup/install-templates/al2-python.rst

Get Started with Latest Release of PyTorch Neuron (``torch-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`pytorch-neuronx-main` for both Inference and Training.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console. please make sure to select the correct instance type.
    * To get more information about instance sizes and pricing see: `Trn1 web page <https://aws.amazon.com/ec2/instance-types/trn1/>`_, `Inf2 web page <https://aws.amazon.com/ec2/instance-types/inf2/>`_
    * Check for the latest version of the `DLAMI Base AMI <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-base-neuron-amazon-linux-2/>`_ and copy the AMI name that starts with "Deep Learning Base Neuron AMI (Amazon Linux 2) <latest_date>" from "AMI Name:" section
    * Search for the copied AMI name in the AMI Search , you should see a matching AMI with the AMI name in Community AMIs. Select the AMI and use it to launch the instance.
    * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 2
        :end-line: 3


.. include:: /includes/setup/tab-inference-torch-neuronx-al2.txt

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-update-al2.rst

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-install-prev-al2.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuronx/amazon-linux/torch-neuronx-al2-pytorch-dlami.rst
================================================
.. _setup-torch-neuronx-al2-dlami-pytorch:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuronx") Setup on Amazon Linux 2 with DLAMI Pytorch
===========================================================================

.. note::
    As of 2.20.0, Neuron Runtime no longer supports AL2. Upgrade to AL2023 following the :ref:`AL2 Migration guide <eos-al2>`

.. contents:: Table of contents
	:local:
	:depth: 2

.. include:: /setup/install-templates/al2-python.rst

Get Started with Latest Release of PyTorch Neuron (``torch-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`pytorch-neuronx-main` for both Inference and Training.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console. please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Trn1 web page <https://aws.amazon.com/ec2/instance-types/trn1/>`_, `Inf2 web page <https://aws.amazon.com/ec2/instance-types/inf2/>`_
    * Check for the latest version of the `DLAMI Neuron Pytorch 1.13 AMI <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-neuron-pytorch-1-13-amazon-linux-2/>`_ and copy the AMI name that starts with "Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2) <latest_date>" from "AMI Name:" section
    * Search for the copied AMI name in the AMI Search , you should see an exact matching AMI with the AMI name in Community AMIs. Select the AMI and use it to launch the instance.
    * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Update Neuron Drivers
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=driver_runtime_tools --framework=pytorch --framework-version=2.9.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1

.. dropdown::  Get Started With Pytorch DLAMI
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small

    .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 50
            :end-line: 51


.. card:: Visit PyTorch Neuron(``torch-neuronx``) for Inference section
    :link: inference-torch-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small


.. card:: Visit PyTorch Neuron(``torch-neuronx``) for Training section
    :link: training-torch-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-update-al2-dlami.rst

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-install-prev-al2.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuronx/amazon-linux/torch-neuronx-al2.rst
================================================
.. _setup-torch-neuronx-al2:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuronx") Setup on Amazon Linux 2
=========================================================

.. note::
   As of 2.20.0, Neuron Runtime no longer supports AL2. Upgrade to AL2023 following the :ref:`AL2 Migration guide <eos-al2>`

.. contents:: Table of contents
	:local:
	:depth: 2

.. include:: /setup/install-templates/al2-python.rst

Get Started with Latest Release of PyTorch Neuron (``torch-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`pytorch-neuronx-main` for both Inference and Training.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Trn1 web page <https://aws.amazon.com/ec2/instance-types/trn1/>`_, `Inf2 web page <https://aws.amazon.com/ec2/instance-types/inf2/>`_
    * Select Amazon Linux 2 AMI(HVM) - Kernel 5.10
    * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 2
        :end-line: 3


.. include:: /includes/setup/tab-inference-torch-neuronx-al2.txt

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-update-al2.rst

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-install-prev-al2.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuronx/amazon-linux/torch-neuronx-al2023.rst
================================================
.. _setup-torch-neuronx-al2023:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuronx") Setup on Amazon Linux 2023
============================================================


.. contents:: Table of contents
	:local:
	:depth: 2

Get Started with Latest Release of PyTorch Neuron (``torch-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`pytorch-neuronx-main` for both Inference and Training.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Trn1 web page <https://aws.amazon.com/ec2/instance-types/trn1/>`_, `Inf2 web page <https://aws.amazon.com/ec2/instance-types/inf2/>`_
    * Select Amazon Linux 2023 AMI
    * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 239
        :end-line: 240


.. include:: /includes/setup/tab-inference-torch-neuronx-al2023.txt

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-update-al2023.rst

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-install-prev-al2023.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20-base-dlami.rst
================================================
.. _setup-torch-neuronx-ubuntu20-base-dlami:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuronx") Setup on Ubuntu 20 with DLAMI Base
====================================================================


.. contents:: Table of contents
	:local:
	:depth: 2


Get Started with Latest Release of PyTorch Neuron (``torch-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`pytorch-neuronx-main` for both Inference and Training.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console. please make sure to select the correct instance type.
    * To get more information about instance sizes and pricing see: `Trn1 web page <https://aws.amazon.com/ec2/instance-types/trn1/>`_, `Inf2 web page <https://aws.amazon.com/ec2/instance-types/inf2/>`_
    * Check for the latest version of the `DLAMI Base AMI <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-base-neuron-ubuntu-20-04/>`_ and copy the AMI name that starts with "Deep Learning Base Neuron AMI (Ubuntu 20.04) <latest_date>" from "AMI Name:" section
    * Search for the copied AMI name in the AMI Search , you should see a matching AMI with the AMI name in Community AMIs. Select the AMI and use it to launch the instance.
    * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 5
        :end-line: 6


.. include:: /includes/setup/tab-inference-torch-neuronx-u20.txt

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-update-u20.rst

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-install-prev-u20.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20-pytorch-dlami.rst
================================================
.. _setup-torch-neuronx-ubuntu20-dlami-pytorch:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuronx") Setup on Ubuntu 20 with DLAMI Pytorch
======================================================================


.. contents:: Table of contents
	:local:
	:depth: 2


Get Started with Latest Release of PyTorch Neuron (``torch-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`pytorch-neuronx-main` for both Inference and Training.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console. please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Trn1 web page <https://aws.amazon.com/ec2/instance-types/trn1/>`_, `Inf2 web page <https://aws.amazon.com/ec2/instance-types/inf2/>`_
    * Check for the latest version of the `DLAMI Neuron Pytorch 1.13 AMI <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-neuron-pytorch-1-13-ubuntu-20-04/>`_ and copy the AMI name that starts with "Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) <latest_date>" from "AMI Name:" section
    * Search for the copied AMI name in the AMI Search , you should see an exact matching AMI with the AMI name in Community AMIs. Select the AMI and use it to launch the instance.
    * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Update Neuron Drivers
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=driver_runtime_tools --framework=pytorch --framework-version=2.9.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1

.. dropdown::  Get Started With Pytorch DLAMI
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small

    .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 53
            :end-line: 54

.. card:: Visit PyTorch Neuron(``torch-neuronx``) for Inference section
    :link: inference-torch-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small


.. card:: Visit PyTorch Neuron(``torch-neuronx``) for Training section
    :link: training-torch-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small

 
.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-update-u20-dlami.rst

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-install-prev-u20.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20.rst
================================================
.. _setup-torch-neuronx-ubuntu20:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :width: 100%
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuronx") Setup on Ubuntu 20
===================================================


.. contents:: Table of contents
	:local:
	:depth: 2


Get Started with Latest Release of PyTorch Neuron (``torch-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`pytorch-neuronx-main` for both Inference and Training.

.. include:: /setup/install-templates/trn1-ga-warning.txt

.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console. please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Trn1 web page <https://aws.amazon.com/ec2/instance-types/trn1/>`_, `Inf2 web page <https://aws.amazon.com/ec2/instance-types/inf2/>`_
    * Select Ubuntu Server 20 AMI
    * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 5
        :end-line: 6


.. include:: /includes/setup/tab-inference-torch-neuronx-u20.txt

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-update-u20.rst

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-install-prev-u20.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.rst
================================================
.. _setup-torch-neuronx-ubuntu22:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuronx") Setup on Ubuntu 22
=====================================================


.. contents:: Table of contents
	:local:
	:depth: 2


Get Started with Latest Release of PyTorch Neuron (``torch-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`pytorch-neuronx-main` for both Inference and Training.

.. include:: /setup/install-templates/trn1-ga-warning.txt

.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Trn1 web page <https://aws.amazon.com/ec2/instance-types/trn1/>`_, `Inf2 web page <https://aws.amazon.com/ec2/instance-types/inf2/>`_
    * Select Ubuntu Server 22 AMI
    * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 242
        :end-line: 243

.. include:: /includes/setup/tab-inference-torch-neuronx-u22.txt

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-update-u22.rst

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-install-prev-u22.rst

================================================
FILE: _backup-setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu24.rst
================================================
.. _setup-torch-neuronx-ubuntu24:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


PyTorch Neuron ("torch-neuronx") Setup on Ubuntu 24
=====================================================


.. contents:: Table of contents
	:local:
	:depth: 2


Get Started with Latest Release of PyTorch Neuron (``torch-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`pytorch-neuronx-main` for both Inference and Training.

.. include:: /setup/install-templates/trn1-ga-warning.txt

.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Trn1 web page <https://aws.amazon.com/ec2/instance-types/trn1/>`_, `Inf2 web page <https://aws.amazon.com/ec2/instance-types/inf2/>`_
    * Select Ubuntu Server 24 AMI
    * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 299
        :end-line: 300

.. include:: /includes/setup/tab-inference-torch-neuronx-u24.txt

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-update-u24.rst


================================================
FILE: _content-types/conceptual-deep-dive.rst
================================================
.. meta::
   :description: {short description here}
   :date_updated: {planned date of publication here}

.. _{RST page ref string here}:

================================================================================
Deep dive: {concept/practice/technique name; use sentence-case, not title case!}
================================================================================

.. {SEO-friendly intro paragraph, no more than 3 sentences total.} 
This topic explores {subjects} in depth and discusses the technical details of it from the perspective of an AWS Neuron expert. Some experience in {related subjects here} is required to understand it in full.

What you should know before reading
-----------------------------------

.. {If there is anything the reader should know before diving into this material, note it here and provide any supporting links. This also helps LLMs training on this content have greater technical context for this subject.}

Before you start, you must be familiar with the following:

- **Concept 1:** {Brief description. Link to a related topic if necessary.}
- **Concept 2:** {Brief description. Link to a related topic if necessary.}

Overview
---------

.. {Your first section, which should cover the subject from the title at a high level. If appropriate, note when this concept is applicable in Neuron components and developer workflows. Starting off with a diagram can help illustrate the concept.}

PARAGRAPH 1

PARAGRAPH 2

.. image:: images/diagram-name.png
   :alt: {Alt text for diagram}
   :align: center

{Section 1 Title}
-----------------

.. {Each section should build on top of what was discussed in the previous sections. If a new concept is introduced that wasn't discussed previously, link to a topic that covers it. You can add subsections within this section if it helps to break it up more and clarify the content, but do not go more than 1-2 levels deep.}

PARAGRAPH 1

PARAGRAPH 2

.. code-block:: python

   # Code example if applicable
   def example_function():
       pass

{Section 2 Title}
-----------------

.. {Each section should build on top of what was discussed in the previous sections. If a new concept is introduced that wasn't discussed previously, link to a topic that covers it. You can add subsections within this section if it helps to break it up more and clarify the content, but do not go more than 1-2 levels deep.}

PARAGRAPH 1

PARAGRAPH 2

.. code-block:: python

   # Code example if applicable
   def example_function():
       pass

.. {Add more sections as appropriate to logically break up the content. Each section should be focused on a specific aspect of the concept.}

{optional}Related Concepts
----------------

* :ref:`link-reference-name` - {description}
* :ref:`link-reference-name` - {description}

{optional}Further Reading
---------------

.. toctree::
   :maxdepth: 1

   * `External Link <URL>`_ - {description}
   * :doc:`/path/to/internal/doc` - {description}


.. (Note to both the writer and any AI incorporating this template: The content below is provided as a resource and should not be included as-is in any final document created using this template as a basis.)

.. note::
   .. Additional implementation details or important considerations can be added as admonitions.

.. warning::
   .. Critical information or potential pitfalls can be highlighted using warning admonitions.


================================================
FILE: _content-types/model-card.rst
================================================
.. _unique-ref-id-here:

.. meta::
    :description: AWS Neuron SDK model card for {Model Name}, version {version}. Overview, intended use, training data, performance, limitations, ethical considerations, and citations.
    :date-modified: 2026-10-03

Model Card: {Model Name}
=======================

.. contents:: Table of Contents
   :depth: 1
   :local:

Model overview
--------------

:Model name: {name}
:Version: {version}
:Organization: {organization}
:License: {license}
:Last updated: {date}

.. warning::
   {Important warnings or critical limitations}

Quickstart
----------

.. code-block:: python

   # Example usage code
   from model import Model
   model = Model.from_pretrained("model_name")
   output = model.generate("Your input text")

Model details
-------------

Architecture
^^^^^^^^^^^^

- Base architecture: {architecture}
- Number of parameters: {parameter_count}
- Model dimensions: {model_dimensions}
- Training objective: {training_objective}

Hardware requirements
^^^^^^^^^^^^^^^^^^^^^

- Minimum RAM: {min_ram}
- Recommended GPU: {gpu_specs}
- Disk space: {disk_space}

Intended Use
-----------

Primary uses
^^^^^^^^^^^^
* {use_case_1}
* {use_case_2}
* {use_case_3}

Out-of-Scope uses
^^^^^^^^^^^^^^^^^
* {prohibited_use_1}
* {prohibited_use_2}

Training data
------------

Datasets
^^^^^^^^
.. list-table::
   :header-rows: 1

   * - Dataset Name
     - Size
     - Description
   * - {dataset1}
     - {size1}
     - {description1}
   * - {dataset2}
     - {size2}
     - {description2}

Training procedure
^^^^^^^^^^^^^^^^^^
* Training hardware: {hardware_details}
* Training time: {duration}
* Training cost: {cost_estimate}
* Carbon footprint: {carbon_impact}

Performance and limitations
---------------------------

Benchmarks
^^^^^^^^^
.. list-table::
   :header-rows: 1

   * - Benchmark
     - Score
     - Details
   * - {benchmark1}
     - {score1}
     - {details1}
   * - {benchmark2}
     - {score2}
     - {details2}

Known limitations
^^^^^^^^^^^^^^^^^
* {limitation_1}
* {limitation_2}

Bias and fairness
^^^^^^^^^^^^^^^^^
* {bias_consideration_1}
* {bias_consideration_2}

Ethical considerations
----------------------

Potential risks
^^^^^^^^^^^^^^^
* {risk_1}
* {risk_2}

Mitigation strategies
^^^^^^^^^^^^^^^^^^^^^
* {strategy_1}
* {strategy_2}

Model details and notes
----------------------

{Provide detailed information about the model, its training, evaluation, and any other relevant aspects. Create the sections as needed.}

{Section 1 title}
^^^^^^^^^^^^^^^^^
{Details for section 1.}

{Section 2 title}
^^^^^^^^^^^^^^^^^
{Details for section 2.}

{. . .}

Citations
---------

.. code-block:: bibtex

   @article{model_paper,
       title={},
       author={},
       journal={},
       year={}
   }

Version history
---------------

.. list-table::
   :header-rows: 1

   * - Version
     - Date
     - Changes
   * - {version1}
     - {date1}
     - {changes1}
   * - {version2}
     - {date2}
     - {changes2}

Contact
-------

:Documentation Issues: {link_to_issues}
:Support: {support_contact}
:Website: {website_url}


================================================
FILE: _content-types/procedural-how-to.rst
================================================
.. meta::
   :description: {short description here}
   :date_updated: {planned date of publication here}

.. _{RST page ref string here}:

========================================================================
How to {verb phrase with specific features or models that will be used}
========================================================================

Task overview
-------------

.. {SEO-friendly intro paragraph, no more than 3 sentences total.} 
This topic discusses how to {description of task or process here} using the AWS Neuron SDK. {Short description of what the task will accomplish.}

Prerequisites
-------------

- **Prerequisite 1:** Description. Link to a related topic if necessary.
- **Prerequisite 2:** Description. Link to a related topic if necessary.

Instructions
------------

**1:** {First step; start with verb/action}

.. {Describe what the user will do in this step, starting with a verb. If applicable, include any commands or code examples that illustrate the step.}

.. code-block:: bash

   # Command or code example
   command --flag value

.. {Additional detail if needed.}

.. note::
   .. {Optional; important information or caveats about this step}

**2:** {Second step; start with verb/action}

.. .. {Describe what the user will do in this step, starting with a verb. If applicable, include any commands or code examples that illustrate the step.}

.. code-block:: python

   # Code example if applicable
   def example():
       pass

.. {Additional detail if needed.}

.. note::
   .. {Optional; important information or caveats about this step}

.. **{More discrete steps as needed, following the same pattern as above.}**

**N:** {Last step; start with verb/action}

.. {Final step instructions}

Confirm your work
-----------------

To confirm you have successfully completed this task, {how to verify the task was done correctly}:

.. {Provide them with a way to know they’ve done everything correctly. This could be a screenshot, command-line output, a tool to launch, or specific settings to check.}

.. code-block:: bash

   # Verification command if applicable
   verify-command --check

Common issues
-------------

Uh oh! Did you encounter an error or other issue while working through this task? Here are some commonly encountered issues and how to address them.

.. rubric:: {Problem 1}

- **Possible solution**: {detailed solution}

.. rubric:: {Problem 2}

- **Possible solution**: {detailed solution}

Related information
-------------------

.. toctree::
   :maxdepth: 1

   * `External Link <URL>`_ - {description}
   * :doc:`/path/to/internal/doc` - {description}


================================================
FILE: _content-types/procedural-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    ".. meta::\n",
    "    :description: {SEO-friendly short description of the tutorial. Include 'Neuron' and any keywords such as the language mode and framework.}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial: {title starting with verb}\n",
    "\n",
    "This tutorial guides you through using the AWS Neuron SDK to {description of what the reader will accomplish in this tutorial, using a specific component or framework}.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "{Briefly summarize the purpose and outcome of this end-to-end tutorial}.\n",
    "{State what users will learn or achieve by completing the tutorial}."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    ".. contents:: Table of contents\n",
    "    :local:\n",
    "    :depth: 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Before you start\n",
    "\n",
    "To successfully complete this tutorial, you must have completed the following steps in advance:\n\n",
    "- Downloaded and installed the [AWS Neuron SDK](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/index.html) for {component}.\n",
    "- {prerequisite 2 description here. If the user must read a topic in advance or perform any complex preparations, provide a link to a topic or download}\n",
    "- {prerequisite 3 description here}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "{Describe any initial local setup required before starting the tutorial.}\n",
    "{Include any code-specific installation, configuration, or environment setup steps.}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example setup command (Remove these comments and add the CLI commands, env variable declarations, or other operations for the user to prepare their environment.)\n",
    "# pip install package_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tutorial steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: {Title, starting with an infinitive verb like 'Load...', 'Create...', etc.}\n",
    "\n",
    "{Describe the first main step. Provide code, commands, or configuration as needed.}\n",
    "\n",
    "{Optional} {Add any important notes, caveats, or warnings for this step.}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Code goes here!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: {Title, starting with an infinitive verb like 'Load...', 'Create...', etc.}\n",
    "\n",
    "{Describe the second main step.}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Code goes here!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: {Title, starting with an infinitive verb like 'Load...', 'Create...', etc.}\n",
    "\n",
    "Describe the third main step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Code goes here!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step N: {Title, starting with an infinitive verb like 'Load...', 'Create...', etc.}\n",
    "\n",
    "Describe the last main step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Code goes here!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\nCode completed. Now, let's run it..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run the code\n",
    "\n",
    "To run this code, {action to take to run the code}:\n",
    "Include commands, expected outputs, or checks to perform."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example verification command\n",
    "# python foo.py\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "If your code works, you will see output like this:\n\n",
    "```\n",
    "Loading glorp inhume logic...Done!\n",
    "Configuring extubation channel instances...Done!\n\n",
    "1111 | 2222 | 3333\n",
    "4444 | 5555 | 6666\n\n",
    "Average glorps inhumed and extubated: 420\n",
    "Time to max glorp: 8 seconds\n",
    "```\n\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\nCongratulations! You now know how to {goal of tutorial}. If your code did not run or did not produce similar results, see the [Common issues](#Common issues) section below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Common issues\n",
    "\n",
    "Here are some common errors and mistakes you can make when developing code using the approach in this tutorial, and how you may be able to address them:\n\n",
    "- {describe error, symptoms, and possible solution}\n",
    "- {describe error, symptoms, and possible solution}"
   ]
  },

  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## (Optional) Next steps\n",
    "\n",
    "{Suggest what users might want to do next after completing the tutorial.\n",
    "Link to related topics or advanced guides.}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Related topics\n",
    "\n",
    "- [Related topic 1](link_here)\n",
    "- [Related topic 2](link_here)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: _content-types/reference-kernel-api.rst
================================================
.. meta::
    :description: API reference for the {kernel-name} kernel included in the NKI Library .
    :date-modified: MM/DD/YYYY

.. currentmodule:: {kernel namespace}.{kernel module path}

RMSNorm-Quant Kernel API Reference
==================================

This topic provides the API reference for the ``{kernel name}`` kernel. The kernel performs optional RMS normalization followed by quantization to ``fp8``.

The kernel supports:

* {feature 1}
* {feature 2}
* {feature 3}
* ... {more features as needed}

Background
-----------

The ``{kernel}`` kernel ... {description of kernel functionality based in sources}

For detailed information about the mathematical operations and implementation details, refer to the :doc:`{kernel name} Kernel Design Specification </nki/library/specs/{kernel-spec-doc-file-link}>`.

API Reference
--------------

{kernel argument class name}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. py:class:: {kernel argument class name}

   {kernel name} Kernel arguments.

   .. py:attribute:: {attribute-1}
      :type: {attribute-1-type}

      {description from docstring}

   .. py:attribute:: {attribute-1}
      :type: {attribute-1-type}

      {description from docstring}

    {more attributes as needed}

   .. py:method:: {method syntax} -> {return type}

      {description from docstring}

   .. py:method:: {method syntax} -> {return type}

      {description from docstring}

   **Raises**:

   * **{exception-1}** – {when exception is raised}
   * **{exception-1}** – {when exception is raised}

{kernel API function name in code}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. py:function:: rmsnorm_quant_kernel(hidden: nt.tensor, ln_w: nt.tensor, kargs: RmsNormQuantKernelArgs)

   {definition of method used to instantiate or invoke kernel here, from source docstrings}
   
   {params and types with descriptions from source docstrings}

Implementation Details
-----------------------

The kernel implementation includes several key optimizations:

1. **{optimization-or-feature}**: {description}
2. **{optimization-or-feature}**: {description}
3. **{optimization-or-feature}**: {description}

Example
--------

The following is a simple example of how to use the ``{kernel}`` kernel:

.. code-block:: python

   # Code here -- need usage example in pedagogical style.

See Also
--------

* :doc:`{kernel} </nki/library/specs/{link-to-kernel-spec}>`


================================================
FILE: _content-types/release-notes-templates/compiler.rst
================================================
.. _neuron-2-XX-0-compiler:

.. meta::
   :description: The official release notes for the AWS Neuron SDK compiler component, version X.XX.0. Release date: XX/XX/2026.

AWS Neuron SDK 2.XX.X: Neuron Compiler release notes
====================================================

**Date of release**: Month Day, 2026

.. contents:: In this release
   :local:
   :depth: 1

* Go back to the :ref:`AWS Neuron 2.XX.0 release notes home <neuron-2-XX-0-whatsnew>`

Improvements
------------

*Improvements are significant new or improved features and solutions introduced this release of the AWS Neuron SDK. Read on to learn about them!*

Feature 1
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 2
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 3
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Behavioral changes
------------------

*Behavioral changes are small, user-facing changes that you may notice after upgrading to this version.*

* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* . . .

Breaking changes
----------------

*Sometimes we have to break something now to make the experience better in the longer term. Breaking changes are changes that may require you to update your own code, tools, and configurations.*

* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* . . .

Bug fixes
---------

 Here's what we fixed in 2.XX.X:

* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* . . .

Known issues
------------

*Something doesn't work. Check here to find out if we already knew about it. We hope to fix these soon!*

* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* . . .

================================================
FILE: _content-types/release-notes-templates/containers.rst
================================================
.. _neuron-2-XX-0-dlc:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Deep Learning Containers (DLC) component, version X.XX.0. Release date: XX/XX/2026.

AWS Neuron SDK 2.XX.0: Neuron Deep Learning Containers release notes
====================================================================

**Date of release**: Month Day, 2026

.. contents:: In this release
   :local:
   :depth: 1

* Go back to the :ref:`AWS Neuron 2.XX.0 release notes home <neuron-2-XX-0-whatsnew>`

Improvements
------------

*Improvements are significant new or improved features and solutions introduced this release of the AWS Neuron SDK. Read on to learn about them!*

Feature 1
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 2
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 3
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Behavioral changes
------------------

*Behavioral changes are small, user-facing changes that you may notice after upgrading to this version.*

* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* . . .

Breaking changes
----------------

*Sometimes we have to break something now to make the experience better in the longer term. Breaking changes are changes that may require you to update your own code, tools, and configurations.*

* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* . . .

Bug fixes
---------

 Here's what we fixed in 2.XX.X:

* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* . . .

Known issues
------------

*Something doesn't work. Check here to find out if we already knew about it. We hope to fix these soon!*

* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* . . .

================================================
FILE: _content-types/release-notes-templates/dlami.rst
================================================
.. _neuron-2-XX-0-dlami:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Deep Learning AWS Machine Images (DLAMIs) component, version X.XX.0. Release date: XX/XX/2026.

AWS Neuron SDK 2.XX.X: Neuron Deep Learning AWS Machine Images release notes
============================================================================

**Date of release**: Month Day, 2026

.. contents:: In this release
   :local:
   :depth: 1

* Go back to the :ref:`AWS Neuron 2.XX.X release notes home <neuron-2-XX-0-whatsnew>`

Improvements
------------

*Improvements are significant new or improved features and solutions introduced this release of the AWS Neuron SDK. Read on to learn about them!*

Feature 1
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 2
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 3
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Behavioral changes
------------------

*Behavioral changes are small, user-facing changes that you may notice after upgrading to this version.*

* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* . . .

Breaking changes
----------------

*Sometimes we have to break something now to make the experience better in the longer term. Breaking changes are changes that may require you to update your own code, tools, and configurations.*

* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* . . .

Bug fixes
---------

 Here's what we fixed in 2.XX.X:

* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* . . .

Known issues
------------

*Something doesn't work. Check here to find out if we already knew about it. We hope to fix these soon!*

* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* . . .

================================================
FILE: _content-types/release-notes-templates/index.rst
================================================
.. _neuron-2-XX-0-whatsnew:
.. _latest-neuron-release:

.. meta::
   :description: The official release notes for the AWS Neuron SDK, version X.XX.0. Release date: XX/XX/2026.

AWS Neuron SDK 2.XX.X release notes
===================================

**Date of release**: Month Day, 2026

.. toctree::
   :hidden:
   :maxdepth: 1

   PyTorch support <nx-pytorch>
   JAX support <nx-jax>
   NxD Inference <nxd-inference>
   NxD Training <nxd-training>
   NxD Core <nxd-core>
   Neuron Compiler <compiler>
   NKI <nki>
   Neuron Runtime <runtime>
   Developer tools <tools>
   Deep Learning AMIs <dlami>
   Deep Learning Containers <containers>
   Release artifacts <../releasecontent>

What's new?
-----------

AWS and Annapurna Labs are excited to bring you release version 2.XX.X of the Neuron SDK! In this release you'll find improvements to...

* . . .
* . . .
* . . .

.. contents:: In this release
   :local:
   :depth: 1

Release highlights
------------------

Version 2.XX.X brings some exciting new features! HYPE TEXT HERE

HIGHLIGHT 1
^^^^^^^^^^^
HYPE TEXT HERE 

* TALKING POINT 1
* TALKING POINT 2
* . . .

USE CASE DESCRIPTION HERE

For more details, see :doc:`DOC LINK </release-notes/path/to/page>`

HIGHLIGHT 2
^^^^^^^^^^^
HYPE TEXT HERE 

* TALKING POINT 1
* TALKING POINT 2
* . . .

USE CASE DESCRIPTION HERE

For more details, see :doc:`DOC LINK </release-notes/path/to/page>`

HIGHLIGHT 3
^^^^^^^^^^^
HYPE TEXT HERE 

* TALKING POINT 1
* TALKING POINT 2
* . . .

USE CASE DESCRIPTION HERE

For more details, see :doc:`DOC LINK </release-notes/path/to/page>`

Other important changes
^^^^^^^^^^^^^^^^^^^^^^^

This release also includes the following improvements

* . . . LINK TO COMPONENT RELEASE NOTE PAGE
* . . . LINK TO COMPONENT RELEASE NOTE PAGE
* . . . LINK TO COMPONENT RELEASE NOTE PAGE
* . . . LINK TO COMPONENT RELEASE NOTE PAGE

Component release notes
-----------------------

Select a card below to review detailed release notes for each component of the Neuron SDK version 2.XX.X. These component release notes contain details on specific new and improved features, as well as breaking changes, bug fixes, and known issues for that component area of the Neuron SDK.

.. grid:: 1 1 2 2
        :gutter: 2

        .. grid-item-card:: 
                :link: neuron-2-XX-0-pytorch
                :link-type: ref

                **PyTorch support** 2.XX.0 release notes
                ^^^
                Neuron features and solutions that support the PyTorch ML framework.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: neuron-2-XX-0-jax
                :link-type: ref

                **JAX support** 2.XX.0 release notes
                ^^^
                Neuron features and solutions that support the JAX ML framework.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: neuron-2-XX-0-nxd-training
                :link-type: ref

                **NxD Training** 2.XX.0 release notes
                ^^^
                Neuron features and tools for LLM and agent ML model training.
                +++
                Supports: ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: neuron-2-XX-0-nxd-inference
                :link-type: ref

                **NxD Inference** 2.XX.0 release notes
                ^^^
                Neuron features and tools for LLM and agent ML model inference.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``
        
        .. grid-item-card::
                :link: neuron-2-XX-0-nxd-core
                :link-type: ref

                **NxD Core** 2.XX.0 release notes
                ^^^
                Common features and tools for Neuron-based training and inference.
                +++
                Supports: ``Trn1`` / ``Trn1n``, ``Trn2``
         
        .. grid-item-card:: 
                :link: neuron-2-XX-0-compiler
                :link-type: ref

                **Neuron Compiler** 2.XX.0 release notes
                ^^^
                The Neuron compiler for AWS Trainium and Inferentia, and its libraries and tools.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: neuron-2-XX-0-nki
                :link-type: ref

                **Neuron Kernel Interface (NKI)** 2.XX.0 release notes
                ^^^
                Neuron's Python-based programming interface for developing and optimizing Neuron kernels.
                +++
                Supports: ``Inf2``, ``Trn1``, ``Trn1n``

        .. grid-item-card:: 
                :link: neuron-2-XX-0-runtime
                :link-type: ref

                **Neuron Runtime** 2.XX.0 release notes
                ^^^
                The Neuron kernel driver and C++ libraries for AWS Inferentia and Trainium instances.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``

        .. grid-item-card:: 
                :link: neuron-2-XX-0-tools
                :link-type: ref

                **Neuron Developer Tools** 2.XX.0 release notes
                ^^^
                Tools that support end-to-end development for AWS Neuron.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``


        .. grid-item-card:: 
                :link: neuron-2-XX-0-dlami
                :link-type: ref

                **Neuron Deep Learning AWS Machine Images (DLAMIs)** 2.XX.0 release notes
                ^^^
                AWS-specific machine images for building and deploying Neuron-based ML solutions.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``
 
        .. grid-item-card:: 
                :link: neuron-2-XX-0-dlc
                :link-type: ref

                **Neuron Deep Learning Containers (DLCs)** 2.XX.0 release notes
                ^^^
                AWS-specific container definitions for building and deploying Neuron-based ML solutions.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``

        .. grid-item-card::
                :link: latest-neuron-release-artifacts
                :link-type: ref
        
                **Neuron 2.XX.0 release artifacts**
                ^^^
                The libraries and packages updated in this release.

Support announcements
---------------------

This section signals the official end-of-support or end of support for specific features, tools, and APIs.

End-of-support announcements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

*An "end-of-support (EoS)" announcement is a notification that a feature, tool, or API will not be supported in the future. Plan accordingly!*

* END-OF-SUPPORT ANNOUNCEMENT 1 (link to announcement here)
* . . .

Ending support in 2.XX.X
^^^^^^^^^^^^^^^^^^^^^^^^

"End of support" means that AWS Neuron no longer supports the feature, tool, or API indicated in the note as of this release.

* ENDING SUPPORT ANNOUNCEMENT 1 (link to announcement here)
* . . .

Previous releases
-----------------

* :doc:`Neuron 2.27.0 </release-notes/prev/2.27.0/index>`
* :doc:`Neuron 2.26.0 </release-notes/prev/2.26.0/index>`
* :doc:`Neuron 2.25.0 </release-notes/prev/2.25.0/index>`
* :doc:`Earlier releases </release-notes/prev/rn>`

* :ref:`prev-rn`
* :ref:`pre-release-content`
* :ref:`prev-n1-rn`


================================================
FILE: _content-types/release-notes-templates/nki.rst
================================================
.. _neuron-2-XX-0-nki:

.. meta::
   :description: The official release notes for the AWS Neuron Kernel Interface (NKI) component, version X.XX.0. Release date: XX/XX/2026.

AWS Neuron SDK 2.25.0: Neuron Kernel Interace (NKI) release notes
=================================================================

**Date of release**: Month Day, 2026

.. contents:: In this release
   :local:
   :depth: 1

* Go back to the :ref:`AWS Neuron 2.25.0 release notes home <neuron-2-XX-0-whatsnew>`

Improvements
------------

*Improvements are significant new or improved features and solutions introduced this release of the AWS Neuron SDK. Read on to learn about them!*

Feature 1
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 2
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 3
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Behavioral changes
------------------

*Behavioral changes are small, user-facing changes that you may notice after upgrading to this version.*

* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* . . .

Breaking changes
----------------

*Sometimes we have to break something now to make the experience better in the longer term. Breaking changes are changes that may require you to update your own code, tools, and configurations.*

* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* . . .

Bug fixes
---------

 Here's what we fixed in 2.25.0:

* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* . . .

Known issues
------------

*Something doesn't work. Check here to find out if we already knew about it. We hope to fix these soon!*

* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* . . .

================================================
FILE: _content-types/release-notes-templates/nx-jax.rst
================================================
.. _neuron-2-XX-0-jax:

.. meta::
   :description: The official release notes for the AWS Neuron SDK JAX support component, version X.XX.0. Release date: XX/XX/2026.

AWS Neuron SDK 2.XX.X: JAX support release notes
================================================

**Date of release**: Month Day, 2026

.. contents:: In this release
   :local:
   :depth: 1

* Go back to the :ref:`AWS Neuron 2.25.0 release notes home <neuron-2-XX-0-whatsnew>`

Released versions
-----------------
* ``0.6.1.1.0.*``

Improvements
------------

*Improvements are significant new or improved features and solutions introduced this release of the AWS Neuron SDK. Read on to learn about them!*

Feature 1
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 2
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 3
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Behavioral changes
------------------

*Behavioral changes are small, user-facing changes that you may notice after upgrading to this version.*

* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* . . .

Breaking changes
----------------

*Sometimes we have to break something now to make the experience better in the longer term. Breaking changes are changes that may require you to update your own code, tools, and configurations.*

* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* . . .

Bug fixes
---------

 Here's what we fixed in 2.25.0:

* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* . . .

Known issues
------------

*Something doesn't work. Check here to find out if we already knew about it. We hope to fix these soon!*

* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* . . .


================================================
FILE: _content-types/release-notes-templates/nx-pytorch.rst
================================================
.. _neuron-2-XX-0-pytorch:

.. meta::
   :description: The official release notes for AWS Neuron SDK PyTorch support, version X.XX.0. Release date: XX/XX/XXXX.

AWS Neuron SDK X.XX.0: PyTorch support release notes
====================================================

**Date of release**: Month Day, 2026

.. contents:: In this release
   :local:
   :depth: 1


* Go back to the :ref:`AWS Neuron 2.XX.0 release notes home <neuron-2-XX-0-whatsnew>`

Released versions
-----------------

* ... 

Improvements
------------

*Improvements are significant new or improved features and solutions introduced this release of the AWS Neuron SDK. Read on to learn about them!*

Feature 1
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 2
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 3
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Behavioral changes
------------------

* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* . . .

Breaking changes
----------------

* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE WHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* . . .

Bug fixes
---------

Here's what we fixed in 2.XX.X:

* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* . . .

Known issues
------------

*Something doesn't work. Check here to find out if we already knew about it. We hope to fix these soon!*

* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* . . .


================================================
FILE: _content-types/release-notes-templates/nxd-core.rst
================================================
.. _neuron-2-XX-0-nxd-core:

.. meta::
   :description: The official release notes for the AWS Neuron SDK NxD Core component, version X.XX.0. Release date: XX/XX/2026.

AWS Neuron SDK 2.XX.X: NxD Core release notes
=============================================

**Date of release**: Month Day, 2026

.. contents:: In this release
   :local:
   :depth: 1

* Go back to the :ref:`AWS Neuron 2.XX.0 release notes home <neuron-2-XX-0-whatsnew>`

Improvements
------------

*Improvements are significant new or improved features and solutions introduced this release of the AWS Neuron SDK. Read on to learn about them!*

Feature 1
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 2
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 3
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Behavioral changes
------------------

*Behavioral changes are small, user-facing changes that you may notice after upgrading to this version.*

* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* . . .

Breaking changes
----------------

*Sometimes we have to break something now to make the experience better in the longer term. Breaking changes are changes that may require you to update your own code, tools, and configurations.*

* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* . . .

Bug fixes
---------

 Here's what we fixed in 2.XX.X:

* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* . . .

Known issues
------------

*Something doesn't work. Check here to find out if we already knew about it. We hope to fix these soon!*

* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* . . .

================================================
FILE: _content-types/release-notes-templates/nxd-inference.rst
================================================
.. _neuron-2-XX-0-nxd-inference:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Transformers for Inference component, version X.XX.0. Release date: XX/XX/2026.

AWS Neuron SDK 2.XX.X: NxD Inference release notes
==================================================

**Date of release**: Month Day, 2026

.. contents:: In this release
   :local:
   :depth: 1

* Go back to the :ref:`AWS Neuron 2.XX.0 release notes home <neuron-2-XX-0-whatsnew>`
* 
Improvements
------------

*Improvements are significant new or improved features and solutions introduced this release of the AWS Neuron SDK. Read on to learn about them!*

Feature 1
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 2
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 3
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Behavioral changes
------------------

*Behavioral changes are small, user-facing changes that you may notice after upgrading to this version.*

* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* . . .

Breaking changes
----------------

*Sometimes we have to break something now to make the experience better in the longer term. Breaking changes are changes that may require you to update your own code, tools, and configurations.*

* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* . . .

Bug fixes
---------

 Here's what we fixed in 2.XX.X:

* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* . . .

Known issues
------------

*Something doesn't work. Check here to find out if we already knew about it. We hope to fix these soon!*

* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* . . .

================================================
FILE: _content-types/release-notes-templates/nxd-training.rst
================================================
.. _neuron-2-XX-0-nxd-training:

.. meta::
   :description: The official release notes for the AWS Neuron SDK NxD Training component, version X.XX.0. Release date: XX/XX/2026.

AWS Neuron SDK 2.25.0: NxD Training release notes
=================================================

**Date of release**: Month Day, 2026

.. contents:: In this release
   :local:
   :depth: 1

* Go back to the :ref:`AWS Neuron 2.XX.0 release notes home <neuron-2-XX-0-whatsnew>`

Improvements
------------

*Improvements are significant new or improved features and solutions introduced this release of the AWS Neuron SDK. Read on to learn about them!*

Feature 1
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 2
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 3
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Behavioral changes
------------------

*Behavioral changes are small, user-facing changes that you may notice after upgrading to this version.*

* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* . . .

Breaking changes
----------------

*Sometimes we have to break something now to make the experience better in the longer term. Breaking changes are changes that may require you to update your own code, tools, and configurations.*

* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* . . .

Bug fixes
---------

 Here's what we fixed in 2.25.0:

* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* . . .

Known issues
------------

*Something doesn't work. Check here to find out if we already knew about it. We hope to fix these soon!*

* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* . . .

================================================
FILE: _content-types/release-notes-templates/runtime.rst
================================================
.. _neuron-2-XX-0-runtime:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Runtime component, version X.XX.0. Release date: XX/XX/2026.

AWS Neuron SDK 2.XX.X: Neuron Runtime release notes
===================================================

**Date of release**: Month Day, 2026

.. contents:: In this release
   :local:
   :depth: 1

* Go back to the :ref:`AWS Neuron 2.XX.0 release notes home <neuron-2-XX-0-whatsnew>`

Improvements
------------

*Improvements are significant new or improved features and solutions introduced this release of the AWS Neuron SDK. Read on to learn about them!*

Feature 1
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 2
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 3
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Behavioral changes
------------------

*Behavioral changes are small, user-facing changes that you may notice after upgrading to this version.*

* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* . . .

Breaking changes
----------------

*Sometimes we have to break something now to make the experience better in the longer term. Breaking changes are changes that may require you to update your own code, tools, and configurations.*

* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* . . .

Bug fixes
---------

 Here's what we fixed in 2.XX.X:

* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* . . .

Known issues
------------

*Something doesn't work. Check here to find out if we already knew about it. We hope to fix these soon!*

* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* . . .

================================================
FILE: _content-types/release-notes-templates/tools.rst
================================================
.. _neuron-2-XX-0-tools:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Developer Tools component, version X.XX.0. Release date: XX/XX/2026.

AWS Neuron SDK 2.XX.X: Developer Tools release notes
====================================================

**Date of release**: Month Day, 2026

.. contents:: In this release
   :local:
   :depth: 1

* Go back to the :ref:`AWS Neuron 2.XX.0 release notes home <neuron-2-XX-0-whatsnew>`

Improvements
------------

*Improvements are significant new or improved features and solutions introduced this release of the AWS Neuron SDK. Read on to learn about them!*

Feature 1
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 2
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Feature 3
^^^^^^^^^

USER-FACING DESCRIPTION OF IMPROVEMENT (WHAT WILL IT DO FOR DEV CUSTOMERS), WHY WE MADE THE IMPROVEMENT, LINK TO SUPPORTING DOC PAGE

Behavioral changes
------------------

*Behavioral changes are small, user-facing changes that you may notice after upgrading to this version.*

* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HOW THE USER MAY EXPERIENCE IT, IF APPLICABLE.
* . . .

Breaking changes
----------------

*Sometimes we have to break something now to make the experience better in the longer term. Breaking changes are changes that may require you to update your own code, tools, and configurations.*

* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* CHANGE DESCRIPTION SENTENCE. NOTE HWHEN THE USER MAY ENCOUNTER IT. PROVIDE A WORKAROUND, IF POSSIBLE.
* . . .

Bug fixes
---------

 Here's what we fixed in 2.XX.X:

* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* SHORT SENTENCE DESCRIBING BUG FIX.
* . . .

Known issues
------------

*Something doesn't work. Check here to find out if we already knew about it. We hope to fix these soon!*

* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* SENTENCE DESCRIBING ISSUE AND WHEN THE USER WILL ENCOUNTER IT.
* . . .

================================================
FILE: _ext/archive.py
================================================
# This file creates a downloadable archive from each directory listed in src_dirs.
# You can modify or add additional archive_handler functions here to create additional archives.

import os, tarfile

def archive_handler(app):
    old_cwd = os.getcwd()
    src_dirs = ['src/examples/pytorch', 'src']
    target_dirs = ['libtorch_demo', 'neuronperf']
    archive_names = [name + '.tar.gz' for name in target_dirs]

    for src_dir, target_dir, archive_name in zip(src_dirs, target_dirs, archive_names):
        os.chdir(src_dir)

        try:
            os.remove(archive_name)
        except OSError:
            pass

        with tarfile.open(archive_name, 'w:gz') as tar:
            tar.add(target_dir)

        os.chdir(old_cwd)

def setup(app):
    app.connect('builder-inited', archive_handler)

    return {
        'version': '1.0',
        'parallel_read_safe': True,
        'parallel_write_safe': True,
    }


================================================
FILE: _ext/df_tables.py
================================================
import os
from docutils.parsers.rst import Directive, directives
from docutils.parsers.rst.directives.tables import CSVTable

class DFTable(CSVTable):
    CSVTable.option_spec['df-arg'] = directives.unchanged
    df = None

    def __init__(self, name, arguments, options, content, lineno,
                 content_offset, block_text, state, state_machine):

        super().__init__(name, arguments, options, content, lineno,
                 content_offset, block_text, state, state_machine)

    def get_csv_data(self):
        return self.df.to_csv(index=False).splitlines(), None

    def run(self):
        source_file_name = self.state_machine.document.attributes["source"]
        dirname = os.path.abspath(os.path.dirname(source_file_name))
        os.chdir(dirname)

        code = "\n".join(map(str, self.content))
        ns = {}

        try:
            exec("\n".join( ["import numpy as np", "import pandas as pd", ]), ns)

            variable_name = "df"
            if self.options.get("df-var"):
                variable_name = self.options.get("df-var")

            exec(code, ns)
            self.df = ns[variable_name]
            
        except Exception as e:
            raise self.error(str(e))

        return super().run()

    
def setup(app):
    setup.app = app
    setup.config = app.config
    setup.confdir = app.confdir
    app.add_directive("df-table", DFTable)

    metadata = {
        "parallel_read_safe": True,
        "parallel_write_safe": True,
        "version": 0.1,
    }
    return metadata

================================================
FILE: _ext/local_documenter.py
================================================
import os
import sys

from sphinx.ext.autodoc import ModuleDocumenter, FunctionDocumenter


class LocalModuleDocumenter(ModuleDocumenter):
    """
    Provides identical functionality to "automodule", but allows the module
    function names to be overridden with the "module-name" option.

    This also allows local python files to be documented as if they were
    imported from an actual package by temporarily adding the directory of the
    RST file to the python path.
    """
    option_spec = dict(ModuleDocumenter.option_spec)
    option_spec['module-name'] = lambda x = None: x

    def import_object(self, *args):
        """Find modules local to the RST document directory"""
        local = os.path.join(self.env.app.srcdir, os.path.dirname(self.env.docname))
        sys.path.append(local)
        result = super().import_object(*args)
        sys.path.remove(local)
        return result

    def get_module_members(self):
        """Add module name override to local files"""
        members = super().get_module_members()
        name = self.options.module_name
        if name is not None:
            for member in members.values():
                if callable(member.object):
                    setattr(member.object, 'module_name_override', name)
        return members


class LocalFunctionDocumenter(FunctionDocumenter):
    def format_name(self) -> str:
        """Apply module name override to local functions"""
        # Use overridden module path if it is provided
        if hasattr(self.object, 'module_name_override'):
            self.objpath = self.object.module_name_override.split('.') + [self.objpath[-1]]
        return super().format_name()


def setup(app):
    app.add_autodocumenter(LocalFunctionDocumenter)
    app.add_autodocumenter(LocalModuleDocumenter)


================================================
FILE: _ext/neuron_tag.py
================================================
import os

from docutils import nodes
from docutils.statemachine import ViewList

from sphinx.util.docutils import SphinxDirective
from sphinx.util.nodes import nested_parse_with_titles


# =============================================================================
# Legacy add/clear lists (used only for files NOT handled by explicit overrides)
# =============================================================================

# These lists use substring matching via in_list(). They apply ONLY when no
# explicit_override was set. As more paths get explicit overrides, entries
# here become dead code. Kept for backward compatibility with paths not yet
# explicitly overridden.

add_inf1_tag = [
    'about-neuron/arch',
    'archive/mxnet-neuron',
    'about-neuron/announcements/index',
    'archive/tensorflow/tensorflow-neuron/',
]

add_trn1_tag = [
    'frameworks/neuron-customops/',
    'neuron-customops/',
    'frameworks/torch/inference-torch-neuronx',
    'libraries/nemo-megatron/',
    'libraries/nxd-training/',
]

add_trn2_tag = [
    'libraries/nxd-training/',
    'about-neuron/models/',
]

add_trn3_tag = [
    'about-neuron/arch/neuron-hardware/neuron-core-v4',
    'about-neuron/arch/neuron-hardware/trn3-arch',
]

add_neuronx_tag = [
    'frameworks/torch/torch-neuronx/',
    'archive/tensorflow/tensorflow-neuronx/',
    'frameworks/torch/inference-torch-neuronx/',
    'libraries/neuronx-distributed/',
    'libraries/nxd-training',
    'setup/tensorflow-neuronx',
]

clear_inf1_tag = [
    'about-neuron/arch/neuron-features/neuron-caching',
    'about-neuron/arch/neuron-features/eager-debug-mode',
    'about-neuron/arch/neuron-features/collective-communication-operations',
    'about-neuron/arch/neuron-features/dynamic-shapes',
    'about-neuron/arch/neuron-features/control-flow',
    'about-neuron/arch/neuron-features/custom-c++-operators',
    'about-neuron/arch/neuron-features/collective-communication',
    'about-neuron/arch/neuron-features/rounding-modes',
    'about-neuron/arch/neuron-hardware/trn1-arch',
    'about-neuron/arch/neuron-hardware/inf2-arch',
    'about-neuron/arch/neuron-hardware/inferentia2',
    'about-neuron/arch/neuron-hardware/trainium',
    'about-neuron/arch/neuron-hardware/neuron-core-v2',
    'about-neuron/arch/neuron-hardware/trn2-arch',
    'about-neuron/arch/neuron-hardware/trn3-arch',
    'about-neuron/arch/neuron-hardware/neuron-core-v3',
    'about-neuron/arch/neuron-hardware/neuron-core-v4',
    'about-neuron/benchmarks/trn1-performance',
    'about-neuron/benchmarks/trn1/',
    'about-neuron/benchmarks/inf2/inf2-performance',
    'about-neuron/faq/training/',
    'about-neuron/models/inference-inf2-trn1-samples',
    'about-neuron/models/training-trn1-samples',
    'about-neuron/models/training-inference-trn2-samples',
    'about-neuron/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision',
    'about-neuron/appnotes/transformers-neuronx/generative-llm-inference-with-neuron',
    'about-neuron/appnotes/torch-neuronx/torch-neuronx-dataparallel-app-note',
    'about-neuron/calculator/neuron-calculator',
    'about-neuron/announcements/neuron2.x/dlami-pytorch-introduce',
    'about-neuron/announcements/neuron2.x/sm-training-trn1-introduce',
    'about-neuron/announcements/neuron2.x/sm-training-dlc-2.9.1',
    'devflows/training',
    'devflows/inference/byoc-hosting-devflow-inf2',
    'compiler/neuronx-cc/',
    'about-neuron/appnotes/perf/neuronx-cc/',
    'frameworks/torch/torch-neuronx/',
    'frameworks/torch/training',
    'frameworks/torch/inference-torch-neuronx',
    'archive/tensorflow/tensorflow-neuronx/',
    'archive/tensorflow/tensorflow-neuronx-inference',
    'frameworks/torch/torch-neuronx/transformers-neuronx/readme',
    'release-notes/neuron-cc/index',
    'release-notes/runtime/aws-neuronx-collectives/',
    'release-notes/torch/torch-neuronx/',
    'release-notes/torch/transformers-neuronx/index',
    'release-notes/tensorflow/tensorflow-neuronx/',
    'release-notes/compiler/neuronx-cc/',
    'tools/tutorials/tutorial-tensorboard-scalars-mnist',
    'tools/tutorials/tutorial-neuron-monitor-mnist',
    'tools/tensorboard/getting-started-tensorboard-neuronx-plugin',
    'tools/neuron-sys-tools/nccom-test',
    'setup/torch-neuronx',
    'setup/tensorflow-neuronx',
    'setup/neuron-setup/tensorflow/neuronx/',
    'setup/neuron-setup/pytorch/neuronx/',
    'nki/',
    'frameworks/jax/',
    'libraries/nxd-training/',
    '/release-notes/components/nki',
    '/release-notes/components/nki-lib',
    '/release-notes/components/compiler'
]

clear_inf2_tag = [
    'frameworks/torch/torch-neuronx/training',
    'frameworks/torch/training',
    'archive/torch-neuron/inference-torch-neuron',
    'archive/tensorflow/tensorflow-neuron-inference',
    'frameworks/jax/',
    'about-neuron/arch/neuron-hardware/trn1-arch',
    'about-neuron/arch/neuron-hardware/trainium',
    'about-neuron/arch/neuron-hardware/trn2-arch',
    'about-neuron/arch/neuron-hardware/trn3-arch',
    'about-neuron/arch/neuron-hardware/neuron-core-v3',
    'about-neuron/arch/neuron-hardware/neuron-core-v4',
    'about-neuron/arch/neuron-features/logical-neuroncore-config',
    'about-neuron/benchmarks/trn1/trn1-inference-performance',
    'about-neuron/benchmarks/trn1/trn1-training-performance',
    'about-neuron/models/training-trn1-samples',
    'about-neuron/models/training-inference-trn2-samples',
    'about-neuron/announcements/neuron2.x/announce-neuron-trn2',
    'neuronx-distributed/nxd-training',
    'libraries/nxd-training/',
    'tools/neuron-sys-tools/nccom-test',
    'release-notes/runtime/aws-neuronx-collectives/',
]

clear_trn1_tag = [
    'about-neuron/arch/neuron-hardware/inf2-arch',
    'about-neuron/arch/neuron-hardware/inferentia2',
    'about-neuron/arch/neuron-hardware/trn2-arch',
    'about-neuron/arch/neuron-hardware/trn3-arch',
    'about-neuron/arch/neuron-hardware/trainium2',
    'about-neuron/arch/neuron-hardware/neuron-core-v3',
    'about-neuron/arch/neuron-hardware/neuron-core-v4',
    'about-neuron/benchmarks/inf2/inf2-performance',
    'about-neuron/models/training-inference-trn2-samples',
]

clear_trn2_tag = [
    'archive/tensorflow/',
    'libraries/transformers-neuronx/',
    'about-neuron/arch/neuron-hardware/trn1-arch',
    'about-neuron/arch/neuron-hardware/trainium',
    'about-neuron/arch/neuron-hardware/neuron-core-v2',
    'about-neuron/arch/neuron-hardware/neuron-core-v4',
    'about-neuron/arch/neuron-hardware/trn3-arch',
    'about-neuron/benchmarks/',
    'about-neuron/benchmarks/trn1/',
    'about-neuron/benchmarks/inf2/inf2-performance',
    'about-neuron/models/inference-inf2-trn1-samples',
    'about-neuron/models/training-trn1-samples',
    'neuron-customops/programming-guide/custom-c++-operators-devguide'
]

clear_trn3_tag = [
    'archive/tensorflow/',
    'libraries/transformers-neuronx/',
    'about-neuron/arch/neuron-hardware/trn1-arch',
    'about-neuron/arch/neuron-hardware/trainium',
    'about-neuron/arch/neuron-hardware/neuron-core-v2',
    'about-neuron/arch/neuron-hardware/neuron-core-v3',
    'about-neuron/benchmarks/',
    'about-neuron/benchmarks/trn1/',
    'about-neuron/benchmarks/inf2/inf2-performance',
    'about-neuron/models/inference-inf2-trn1-samples',
    'about-neuron/models/training-trn1-samples',
    'libraries/neuronx-distributed/context_parallelism_overview',
    'about-neuron/appnotes/',
    'neuron-customops/programming-guide/custom-c++-operators-devguide'
]

# Neuron 1.x / NeuronCore v1 era content — clear all non-Inf1 tags
clear_nc_v2_tag = [
    'tools/tutorials/tutorial-neuron-check-model',
    'tools/tutorials/tutorial-neuron-gatherinfo',
    'tools/tutorials/getting-started-tensorboard-neuron-plugin',
    'tools/tensorboard/getting-started-tensorboard-neuron-plugin',
    'tools/helper-tools/tutorial-neuron-check-model',
    'tools/helper-tools/tutorial-neuron-gatherinfo',
    'about-neuron/appnotes/neuron-cc/mixed-precision',
    'about-neuron/appnotes/perf/neuron-cc/',
    'about-neuron/appnotes/neuron1x/',
    'about-neuron/appnotes/torch-neuron/',
    'about-neuron/arch/neuron-hardware/inf1-arch',
    'about-neuron/arch/neuron-hardware/inferentia',
    'about-neuron/arch/neuron-hardware/neuron-core-v1',
    'about-neuron/arch/neuron-features/neuroncore-pipeline',
    'about-neuron/announcements/neuron1.x/',
    'about-neuron/quick-start/mxnet-neuron',
    'about-neuron/benchmarks/inf1/',
    'about-neuron/faq/inference/',
    'about-neuron/models/inference-inf1-samples',
    'containers/dlc-then-ec2-devflow',
    'containers/dlc-then-ecs-devflow',
    'containers/dlc-then-eks-devflow',
    'containers/container-sm-hosting-devflow',
    'containers/rn',
    'containers/tutorials/k8s-neuron-scheduler',
    'compiler/neuron-cc/',
    'release-notes/mxnet-neuron/',
    'release-notes/torch/torch-neuron/',
    'release-notes/tensorflow/tensorflow-neuron/',
    'release-notes/compiler/neuron-cc/',
    'release-notes/neuron1/',
    'archive/torch-neuron/',
    'archive/torch-neuron/inference-torch-neuron',
    'archive/tensorflow/tensorflow-neuron/',
    'archive/tensorflow/tensorflow-neuron-inference',
    'archive/mxnet-neuron/',
    'setup/tensorflow-neuron',
    'setup/torch-neuron',
    'setup/mxnet-neuron',
    'setup/neuron-setup/pytorch/neuron/',
    'setup/neuron-setup/mxnet/neuron/ubuntu/',
    'setup/neuron-setup/mxnet/neuron/amazon-linux/',
    'setup/neuron-setup/tensorflow/neuron/ubuntu/',
    'setup/neuron-setup/tensorflow/neuron/amazon-linux/',
]

# Top-level directories used for initial tag assignment
NEURON1_DIRS = ['n1']
COMMON_DIRS = [
    'tools', 'neuron-runtime', 'release-notes', 'containers', 'compiler',
    'frameworks', 'src', 'about-neuron', 'setup', 'devflows', 'dlami', 'libraries',
]

TEXT_TEMPLATE = '**This document is relevant for**: '


# =============================================================================
# Hardware architecture page map (exact docname → instance list)
# =============================================================================

HW_ARCH_MAP = {
    'about-neuron/arch/neuron-hardware/inf1-arch': ['Inf1'],
    'about-neuron/arch/neuron-hardware/inf2-arch': ['Inf2'],
    'about-neuron/arch/neuron-hardware/inferentia': ['Inf1'],
    'about-neuron/arch/neuron-hardware/inferentia2': ['Inf2'],
    'about-neuron/arch/neuron-hardware/neuron-core-v1': ['Inf1'],
    'about-neuron/arch/neuron-hardware/neuron-core-v2': ['Inf2', 'Trn1'],
    'about-neuron/arch/neuron-hardware/neuron-core-v3': ['Trn2'],
    'about-neuron/arch/neuron-hardware/neuron-core-v4': ['Trn3'],
    'about-neuron/arch/neuron-hardware/trainium': ['Trn1'],
    'about-neuron/arch/neuron-hardware/trainium2': ['Trn2'],
    'about-neuron/arch/neuron-hardware/trainium3': ['Trn3'],
    'about-neuron/arch/neuron-hardware/trn1-arch': ['Trn1'],
    'about-neuron/arch/neuron-hardware/trn2-arch': ['Trn2'],
    'about-neuron/arch/neuron-hardware/trn3-arch': ['Trn3'],
}

# NxD Core training-specific pages (no Inf2)
NXD_CORE_TRAINING_PAGES = [
    'libraries/neuronx-distributed/index-training',
    'libraries/neuronx-distributed/developer-guide-training',
    'libraries/neuronx-distributed/api-reference-guide-training',
    'libraries/neuronx-distributed/tp_developer_guide',
    'libraries/neuronx-distributed/pp_developer_guide',
    'libraries/neuronx-distributed/ptl_developer_guide',
    'libraries/neuronx-distributed/save_load_developer_guide',
    'libraries/neuronx-distributed/activation_memory_reduction',
    'libraries/neuronx-distributed/activation_memory_reduction_developer_guide',
    'libraries/neuronx-distributed/standard_mixed_precision',
    'libraries/neuronx-distributed/tensor_parallelism_overview',
    'libraries/neuronx-distributed/pipeline_parallelism_overview',
    'libraries/neuronx-distributed/lora_finetune_developer_guide',
    'libraries/neuronx-distributed/model_optimizer_wrapper_developer_guide',
    'libraries/neuronx-distributed/context_parallelism_overview',
]


def _in_list(cur_file, file_list):
    """Return True if any entry in file_list is a substring of cur_file."""
    return any(entry in cur_file for entry in file_list)


def _splitall(path):
    """Split a path into all its components."""
    parts = []
    while True:
        head, tail = os.path.split(path)
        if head == path:
            parts.insert(0, head)
            break
        elif tail == path:
            parts.insert(0, tail)
            break
        else:
            path = head
            parts.insert(0, tail)
    return parts, len(parts)


def _get_explicit_override(cur_file):
    """Return (instances, True) if cur_file has an explicit CSV-based override,
    or (None, False) otherwise.

    Rules are evaluated top-to-bottom. More specific paths must come AFTER
    broader paths so they can override them (last match wins).
    """

    # --- Libraries -----------------------------------------------------------

    # NxD Core = Inf2, Trn1, Trn2 (default for all neuronx-distributed pages)
    if cur_file.startswith('libraries/neuronx-distributed/'):
        result = ['Inf2', 'Trn1', 'Trn2']
        # Training-specific pages drop Inf2
        if cur_file in NXD_CORE_TRAINING_PAGES:
            result = ['Trn1', 'Trn2']
        if cur_file.startswith('libraries/neuronx-distributed/tutorials/training') or \
           cur_file.startswith('libraries/neuronx-distributed/tutorials/finetune'):
            result = ['Trn1', 'Trn2']
        return result, True

    if cur_file.startswith('libraries/transformers-neuronx/'):
        return ['Inf2', 'Trn1'], True

    if cur_file.startswith('libraries/nxd-training/'):
        return ['Trn1', 'Trn2'], True

    # vLLM must come before general nxd-inference
    if cur_file.startswith('libraries/nxd-inference/vllm/'):
        return ['Trn2', 'Trn3'], True

    if cur_file.startswith('libraries/nxd-inference/'):
        return ['Inf2', 'Trn1', 'Trn2'], True

    if cur_file.startswith('libraries/nemo-megatron/'):
        return ['Trn1', 'Trn2'], True

    # --- NKI -----------------------------------------------------------------

    if cur_file.startswith('nki/'):
        return ['Trn2', 'Trn3'], True

    # --- CustomOps -----------------------------------------------------------

    if cur_file.startswith('neuron-customops/'):
        return ['Inf2', 'Trn1'], True

    # --- Frameworks ----------------------------------------------------------

    if cur_file.startswith('frameworks/jax/'):
        return ['Trn2', 'Trn3'], True

    # TensorFlow NeuronX (must come before TensorFlow Neuron check)
    if 'tensorflow/tensorflow-neuronx' in cur_file:
        return ['Inf2', 'Trn1'], True

    # TensorFlow Neuron (Inf1)
    if 'tensorflow/tensorflow-neuron' in cur_file and 'neuronx' not in cur_file:
        return ['Inf1'], True

    # TorchNeuron native PyTorch (must come before torch-neuronx check)
    if 'torch/pytorch-native' in cur_file:
        return ['Trn2', 'Trn3'], True

    # PyTorch NeuronX (Torch/XLA)
    if 'torch/torch-neuronx' in cur_file:
        return ['Inf2', 'Trn1', 'Trn2'], True

    # PyTorch NeuronX top-level pages (not in torch-neuronx/ subdir)
    if cur_file in ['frameworks/torch/inference-torch-neuronx',
                     'frameworks/torch/training-torch-neuronx',
                     'frameworks/torch/training',
                     'frameworks/torch/inference']:
        return ['Inf2', 'Trn1', 'Trn2'], True

    # PyTorch Neuron (Inf1)
    if 'torch/torch-neuron' in cur_file and 'neuronx' not in cur_file:
        return ['Inf1'], True

    if cur_file == 'archive/torch-neuron/inference-torch-neuron':
        return ['Inf1'], True

    # MXNet
    if 'mxnet-neuron' in cur_file:
        return ['Inf1'], True

    # --- Neuron Runtime ------------------------------------------------------

    # Collectives (more specific, must come after general runtime)
    if cur_file.startswith('neuron-runtime/about/collectives') or \
       cur_file in ['neuron-runtime/explore/internode-collective-comm',
                     'neuron-runtime/explore/intranode-collective-comm',
                     'neuron-runtime/explore/compute-comm-overlap']:
        return ['Trn1', 'Trn2', 'Trn3'], True

    if cur_file.startswith('neuron-runtime/'):
        return ['Inf2', 'Trn1', 'Trn2', 'Trn3'], True

    # --- Compiler ------------------------------------------------------------

    if cur_file.startswith('compiler/error-codes/'):
        return ['Inf2', 'Trn1', 'Trn2', 'Trn3'], True

    if cur_file == 'compiler/neuron-cc' or cur_file.startswith('compiler/neuron-cc/'):
        return ['Inf1'], True

    if cur_file == 'compiler/neuronx-cc' or cur_file.startswith('compiler/neuronx-cc/'):
        return ['Inf2', 'Trn1', 'Trn2', 'Trn3'], True

    if cur_file == 'neuron-customops/programming-guide' or cur_file.startswith('neuron-customops/programming-guide'):
        return ['Inf2', 'Trn1'], True

    # --- Setup ---------------------------------------------------------------

    if cur_file.startswith('setup/install-templates/inf1/'):
        return ['Inf1'], True
    if cur_file.startswith('setup/install-templates/inf2/'):
        return ['Inf2'], True
    if cur_file.startswith('setup/install-templates/trn1/') or \
       cur_file == 'setup/install-templates/launch-trn1-dlami':
        return ['Trn1'], True

    if cur_file in ['setup/setup-neuron', 'setup/torch-neuron', 'setup/torch-neuron-ubuntu20']:
        return ['Inf1'], True

    if cur_file.startswith('setup/neuron-setup/pytorch/neuronx/'):
        return ['Inf2', 'Trn1', 'Trn2'], True
    if cur_file.startswith('setup/neuron-setup/tensorflow/neuronx/'):
        return ['Inf2', 'Trn1'], True
    if cur_file.startswith('setup/neuron-setup/pytorch/neuron/'):
        return ['Inf1'], True
    if cur_file.startswith('setup/neuron-setup/tensorflow/neuron/'):
        return ['Inf1'], True

    if cur_file == 'setup/jax-neuronx':
        return ['Trn2', 'Trn3'], True
    if cur_file == 'setup/torch-neuronx':
        return ['Inf2', 'Trn1', 'Trn2'], True
    if cur_file == 'setup/tensorflow-neuronx':
        return ['Inf2', 'Trn1'], True
    if cur_file == 'setup/tensorflow-neuron':
        return ['Inf1'], True

    return None, False


def _get_page_override(cur_file):
    """Return (instances, True) for page-specific overrides that don't fit
    neatly into _get_explicit_override (devflows, containers, tools, about-neuron, etc.).
    """

    # --- Devflows ------------------------------------------------------------

    if cur_file == 'devflows/inference/byoc-hosting-devflow-inf2':
        return ['Inf2'], True
    if cur_file == 'devflows/inference/ec2-then-ec2-devflow-inf2':
        return ['Inf2'], True
    if cur_file == 'devflows/parallelcluster-flows':
        return ['Trn1', 'Trn2'], True

    if cur_file.startswith('devflows/training/batch/') or \
       cur_file.startswith('devflows/training/ec2/') or \
       cur_file.startswith('devflows/training/parallelcluster/') or \
       cur_file.startswith('devflows/training/sm-devflow/'):
        return ['Trn1', 'Trn2', 'Trn3'], True

    if cur_file.startswith('devflows/plugins/npd'):
        return ['Inf2', 'Trn1', 'Trn2'], True

    # --- Containers ----------------------------------------------------------

    # OCI Hooks
    if 'tutorial-oci-hook' in cur_file:
        return ['Inf1', 'Inf2', 'Trn1', 'Trn2'], True

    # DRA
    if cur_file == 'containers/neuron-dra' or cur_file.startswith('containers/files/'):
        return ['Trn2', 'Trn3'], True

    if cur_file == 'containers/how-to/how-to-ultraserver':
        return ['Trn2', 'Trn3'], True

    # DLC quickstarts
    if cur_file == 'containers/get-started/quickstart-configure-deploy-dlc':
        return ['Trn2', 'Trn3'], True
    if cur_file == 'containers/get-started/quickstart-pytorch-inference-dlc':
        return ['Inf2', 'Trn1', 'Trn2', 'Trn3'], True

    # Inf1-era container content
    if cur_file == 'containers/tutorial-docker-runtime1.0':
        return ['Inf1'], True
    if cur_file == 'containers/container-deployment-flows' or \
       cur_file.startswith('containers/docker-example/inference/') or \
       cur_file.startswith('containers/docker-example/v1/') or \
       cur_file == 'containers/ec2-then-ec2-devflow' or \
       cur_file == 'containers/neo-then-hosting-devflow':
        return ['Inf1'], True

    # Container training/inference tutorials and docker examples
    if cur_file.startswith('containers/docker-example/training/'):
        return ['Trn1', 'Trn2', 'Trn3'], True
    if cur_file.startswith('containers/tutorials/inference/'):
        return ['Inf1'], True
    if cur_file.startswith('containers/tutorials/training/'):
        return ['Trn1', 'Trn2', 'Trn3'], True

    # Neuron Monitor Container
    if cur_file == 'containers/tutorials/k8s-neuron-monitor':
        return ['Inf2', 'Trn1', 'Trn2'], True

    # Node Problem Detector
    if cur_file.startswith('containers/tutorials/k8s-neuron-problem-detector'):
        return ['Inf2', 'Trn1', 'Trn2'], True

    # --- Tools ---------------------------------------------------------------

    # TensorBoard plugin (End Of Support)
    if cur_file.startswith('tools/tensorboard/getting-started-tensorboard-neuronx') or \
       cur_file == 'tools/tutorials/tutorial-tensorboard-scalars-mnist' or \
       cur_file == 'tools/tutorials/torch-neuronx-profiling-with-tb':
        return ['Inf2', 'Trn1'], True

    # --- Announcements -------------------------------------------------------

    if cur_file.startswith('about-neuron/announcements/'):
        return [], True

    # --- Hardware architecture -----------------------------------------------

    if cur_file in HW_ARCH_MAP:
        return HW_ARCH_MAP[cur_file], True

    # --- Arch features -------------------------------------------------------

    if cur_file == 'about-neuron/arch/neuron-features/custom-c++-operators':
        return ['Inf2', 'Trn1'], True
    if cur_file == 'about-neuron/arch/neuron-features/logical-neuroncore-config':
        return ['Trn2', 'Trn3'], True

    # --- Appnotes ------------------------------------------------------------

    if cur_file == 'about-neuron/appnotes/neuronx-distributed/introducing-nxd-inference':
        return ['Inf2', 'Trn1', 'Trn2'], True
    if cur_file == 'about-neuron/appnotes/neuronx-distributed/introducing-nxdt-training':
        return ['Trn1', 'Trn2'], True
    if cur_file.startswith('about-neuron/appnotes/torch-neuronx/'):
        return ['Inf2', 'Trn1', 'Trn2'], True
    if cur_file.startswith('about-neuron/appnotes/transformers-neuronx/'):
        return ['Inf2', 'Trn1'], True
    if cur_file == 'about-neuron/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision':
        return ['Trn1', 'Trn2', 'Trn3'], True
    if cur_file.startswith('about-neuron/appnotes/neuron1x/'):
        return ['Inf1'], True

    # --- Benchmarks ----------------------------------------------------------

    if cur_file == 'about-neuron/benchmarks/index':
        return ['Inf1', 'Inf2', 'Trn1', 'Trn2', 'Trn3'], True

    # --- Quick-start ---------------------------------------------------------

    if cur_file == 'about-neuron/quick-start/tensorflow-neuron':
        return ['Inf1'], True
    if cur_file in ['about-neuron/quick-start/torch-neuron',
                     'about-neuron/quick-start/torch-neuron-tab-training']:
        return ['Inf1'], True

    if cur_file.startswith('about-neuron/quick-start/tab-inference-torch-neuronx'):
        return ['Inf2', 'Trn1', 'Trn2'], True
    if cur_file.startswith('about-neuron/quick-start/tab-inference-torch-neuron') and 'neuronx' not in cur_file:
        return ['Inf1'], True
    if cur_file.startswith('about-neuron/quick-start/tab-inference-tensorflow-neuronx'):
        return ['Inf2', 'Trn1'], True
    if cur_file.startswith('about-neuron/quick-start/tab-inference-tensorflow-neuron') and 'neuronx' not in cur_file:
        return ['Inf1'], True

    return None, False


class NeuronTag(SphinxDirective):

    def run(self):
        cur_file = self.env.docname
        path_split, path_len = _splitall(cur_file)

        # Landing page gets no tag
        if path_split[0] == 'index':
            return self._render('')

        # Step 1: Assign default instances based on top-level directory
        return_instances = []
        if path_split[0] in NEURON1_DIRS:
            return_instances = ['Inf1']
        elif path_split[0] in COMMON_DIRS:
            return_instances = ['Inf1', 'Inf2', 'Trn1', 'Trn2', 'Trn3']

        # Step 2: Check explicit overrides (CSV-based, highest priority)
        explicit_override = False

        result, matched = _get_explicit_override(cur_file)
        if matched:
            return_instances = result
            explicit_override = True

        if not explicit_override:
            result, matched = _get_page_override(cur_file)
            if matched:
                return_instances = result
                explicit_override = True

        # Step 3: Directory-based inference/training heuristic
        if not explicit_override:
            if path_len >= 2:
                parent_dir = path_split[path_len - 2]
                if parent_dir == 'inference':
                    return_instances = ['Inf1']
                elif parent_dir == 'training':
                    return_instances = ['Trn1', 'Trn2', 'Trn3']

        # Step 4: Legacy add/clear tag lists (only for non-overridden files)
        if not explicit_override:
            if _in_list(cur_file, add_trn1_tag):
                if 'Trn1' not in return_instances:
                    return_instances.extend(['Trn1', 'Trn2', 'Trn3', 'Inf2'])

            if _in_list(cur_file, add_trn2_tag):
                if 'Trn2' not in return_instances:
                    return_instances.extend(['Trn2', 'Trn3'])

            if _in_list(cur_file, add_trn3_tag):
                if 'Trn3' not in return_instances:
                    return_instances.append('Trn3')

            if _in_list(cur_file, add_neuronx_tag):
                if 'Trn1' not in return_instances:
                    return_instances.extend(['Trn1', 'Trn2', 'Trn3', 'Inf2'])

            if _in_list(cur_file, add_inf1_tag):
                if 'Inf1' not in return_instances:
                    return_instances.append('Inf1')

            if _in_list(cur_file, clear_nc_v2_tag):
                for tag in ['Trn1', 'Trn2', 'Trn3', 'Inf2']:
                    if tag in return_instances:
                        return_instances.remove(tag)

            if _in_list(cur_file, clear_trn1_tag):
                if 'Trn1' in return_instances:
                    return_instances.remove('Trn1')

            if _in_list(cur_file, clear_trn2_tag):
                if 'Trn2' in return_instances:
                    return_instances.remove('Trn2')

            if _in_list(cur_file, clear_trn3_tag):
                if 'Trn3' in return_instances:
                    return_instances.remove('Trn3')

            if _in_list(cur_file, clear_inf1_tag):
                if 'Inf1' in return_instances:
                    return_instances.remove('Inf1')

            if _in_list(cur_file, clear_inf2_tag):
                if 'Inf2' in return_instances:
                    return_instances.remove('Inf2')

        # Step 5: Generate output
        return_instances = sorted(set(return_instances))
        if return_instances:
            text = TEXT_TEMPLATE + ', '.join('``' + i + '``' for i in return_instances)
        else:
            text = ''

        return self._render(text)

    def _render(self, text):
        """Parse RST text and return docutils nodes."""
        rst = ViewList()
        rst.append(text, "neuron-tag", 1)
        node = nodes.section()
        node.document = self.state.document
        nested_parse_with_titles(self.state, rst, node)
        return node.children


def setup(app):
    app.add_directive("neuron-tag", NeuronTag)
    return {
        'version': '0.2',
        'parallel_read_safe': True,
        'parallel_write_safe': True,
    }


================================================
FILE: _ext/release-notes-automation-spec.md
================================================
# Release Notes Review Automation Specification

## Overview

This specification defines a GitHub Action that automatically reviews release notes files in pull requests using Amazon Q CLI to ensure they meet quality standards defined in the release notes writing guidelines.

## Purpose

Automate the review of release notes changes to:
- Ensure consistency and quality across all release notes
- Catch common issues before human review
- Provide immediate feedback to PR authors
- Reduce manual review burden on documentation team

## Scope

### In Scope
- PRs labeled with "release-notes"
- RST files under `/release-notes/components/` directory
- Files that have been modified in the PR (not just added to context)
- Automated review using Q CLI with release notes guidelines
- Posting review feedback as PR comments

### Out of Scope
- Release notes files outside `/release-notes/components/`
- Non-RST files
- PRs without the "release-notes" label
- Manual approval/rejection of PRs (action only provides feedback)

## Requirements

### Functional Requirements

#### FR1: PR Detection and Filtering
- **FR1.1**: Action triggers on pull request events (opened, synchronize, labeled)
- **FR1.2**: Action only runs when PR has "release-notes" label
- **FR1.3**: Action identifies all changed RST files in `/release-notes/components/` directory

#### FR2: File Analysis
- **FR2.1**: Action reads content of each changed RST file
- **FR2.2**: Action loads release notes guidelines from `_ext/release-notes-context.md`
- **FR2.3**: Action processes files individually to provide file-specific feedback

#### FR3: Q CLI Integration
- **FR3.1**: Action invokes Amazon Q CLI with appropriate context
- **FR3.2**: Action provides Q CLI with:
  - Release notes guidelines from `_ext/release-notes-context.md`
  - Content of the changed RST file
  - Instruction to review against guidelines
- **FR3.3**: Action captures Q CLI output for each file

#### FR4: Review Feedback
- **FR4.1**: Action formats Q CLI feedback into readable PR comment
- **FR4.2**: Action posts comment to PR with review results
- **FR4.3**: Comment includes:
  - List of files reviewed
  - Issues found per file (using format from guidelines)
  - Suggested improvements
  - Link to full guidelines document
- **FR4.4**: If no issues found, action posts positive confirmation

#### FR5: Error Handling
- **FR5.1**: Action handles Q CLI failures gracefully
- **FR5.2**: Action reports when no RST files are found in scope
- **FR5.3**: Action logs errors for debugging without failing the PR

### Non-Functional Requirements

#### NFR1: Performance
- Action completes review within 5 minutes for typical PRs (1-5 files)
- Action processes files in parallel when possible

#### NFR2: Security
- Action uses GitHub secrets for Q CLI credentials
- Action has read-only access to repository
- Action has write access only to PR comments

#### NFR3: Maintainability
- Action configuration is version controlled in `.github/workflows/`
- Action uses official Q CLI container/action when available
- Action logic is simple and well-documented

## User Stories

### US1: Automatic Review Trigger
**As a** documentation contributor  
**I want** the review action to run automatically when I label my PR  
**So that** I get immediate feedback without manual intervention

**Acceptance Criteria:**
- Action triggers when "release-notes" label is added
- Action runs on subsequent commits to labeled PR
- Action does not run on PRs without the label

### US2: Targeted File Review
**As a** documentation contributor  
**I want** only my changed release notes files to be reviewed  
**So that** I get relevant feedback without noise from unchanged files

**Acceptance Criteria:**
- Only files in `/release-notes/components/*.rst` are reviewed
- Only files modified in the PR are analyzed
- Files in other directories are ignored

### US3: Clear Feedback
**As a** documentation contributor  
**I want** clear, actionable feedback on my release notes  
**So that** I know exactly what to improve

**Acceptance Criteria:**
- Feedback follows the format specified in guidelines
- Each issue includes: original text, problem, example rewrite, action items
- Feedback is posted as a PR comment
- Comment includes link to full guidelines

### US4: No False Failures
**As a** documentation contributor  
**I want** the action to provide feedback without blocking my PR  
**So that** I can address issues without being blocked by automation

**Acceptance Criteria:**
- Action never fails the PR check
- Action always succeeds even if issues are found
- Issues are reported as comments, not check failures

## Technical Design

### GitHub Action Workflow

**File Location:** `.github/workflows/release-notes-review.yml`

**Trigger Events:**
```yaml
on:
  pull_request:
    types: [opened, synchronize, labeled]
    paths:
      - 'release-notes/components/**/*.rst'
```

**Workflow Steps:**

1. **Check Label**
   - Verify PR has "release-notes" label
   - Exit gracefully if label not present

2. **Get Changed Files**
   - Use GitHub API to get list of changed files
   - Filter for `release-notes/components/**/*.rst`
   - Exit if no matching files found

3. **Setup Q CLI**
   - Install/configure Amazon Q CLI
   - Authenticate using GitHub secrets

4. **Load Guidelines**
   - Read `_ext/release-notes-context.md`
   - Prepare as context for Q CLI

5. **Review Each File**
   - For each changed RST file:
     - Read file content
     - Invoke Q CLI with prompt:
       ```
       Review the following release notes file against the guidelines provided.
       
       Guidelines: [content from release-notes-context.md]
       
       File: [filename]
       Content: [file content]
       
       Provide feedback using the review format specified in the guidelines.
       Focus on: customer visibility, documentation links, impact clarity, 
       specific conditions, and actionable information.
       ```
     - Capture Q CLI response

6. **Format Feedback**
   - Combine all file reviews into single comment
   - Format as markdown with sections per file
   - Include summary at top

7. **Post Comment**
   - Post formatted feedback as PR comment
   - Include link to guidelines
   - Tag PR author

### Q CLI Prompt Template

```markdown
You are reviewing release notes for the AWS Neuron SDK. Review the following 
file against the release notes writing guidelines.

GUIDELINES:
[Full content of _ext/release-notes-context.md]

FILE TO REVIEW: {filename}

CONTENT:
{file_content}

INSTRUCTIONS:
1. Review the content against all guidelines
2. Identify issues using the review format from the guidelines
3. For each issue, provide:
   - Issue number and title
   - Original text
   - Problem description
   - Phrasing problem (if applicable)
   - Example rewrite
   - Specific action items
4. If no issues found, state "No issues found - release notes meet guidelines"

Focus especially on:
- Customer-visible language (no internal code names)
- Documentation URLs for all new features
- Specific conditions (not vague language)
- Clear impact statements
- Proper categorization (breaking changes vs bug fixes)
- Migration guidance for breaking changes
```

### Comment Format Template

```markdown
## 🤖 Release Notes Review

This PR modifies {count} release notes file(s). Here's the automated review:

### Files Reviewed
- ✅ `release-notes/components/file1.rst` - {issue_count} issue(s)
- ✅ `release-notes/components/file2.rst` - No issues found

---

### 📝 Review Feedback

#### File: `release-notes/components/file1.rst`

[Q CLI feedback for file1]

---

#### File: `release-notes/components/file2.rst`

[Q CLI feedback for file2]

---

### 📚 Resources

- [Release Notes Writing Guidelines](_ext/release-notes-context.md)
- Need help? Tag @documentation-team

---

*This is an automated review. Please address the feedback and request human 
review when ready.*
```

## Implementation Notes

### GitHub Action Configuration

**Required Secrets:**
- `Q_CLI_TOKEN` or equivalent for Q CLI authentication

**Required Permissions:**
```yaml
permissions:
  contents: read
  pull-requests: write
```

**Environment:**
- Ubuntu latest runner
- Node.js 18+ (if using JavaScript action)
- Python 3.9+ (if using Python script)

### Q CLI Integration Options

**Option 1: Direct CLI Invocation**
```bash
q chat --prompt-file prompt.txt --context-file guidelines.md
```

**Option 2: Q CLI GitHub Action** (if available)
```yaml
- uses: aws/q-cli-action@v1
  with:
    prompt: ${{ steps.prepare.outputs.prompt }}
    context: ${{ steps.prepare.outputs.context }}
```

**Option 3: API Integration** (if Q provides API)
```python
import q_cli
response = q_cli.chat(prompt=prompt, context=guidelines)
```

## Testing Strategy

### Unit Tests
- Test file filtering logic
- Test prompt generation
- Test comment formatting

### Integration Tests
- Test with sample PR containing valid release notes
- Test with sample PR containing issues
- Test with PR without "release-notes" label
- Test with PR modifying non-component files

### Manual Testing
- Create test PR with intentional issues
- Verify action triggers correctly
- Verify feedback is accurate and helpful
- Verify comment formatting is readable

## Success Criteria

1. **Automation Works**: Action runs on 100% of labeled PRs
2. **Accurate Detection**: Action correctly identifies changed RST files
3. **Useful Feedback**: 80%+ of PR authors find feedback helpful
4. **No False Blocks**: Action never blocks valid PRs
5. **Performance**: Action completes within 5 minutes
6. **Reliability**: Action succeeds 95%+ of the time

## Future Enhancements

### Phase 2 (Optional)
- Support for reviewing other release notes files (not just components)
- Severity levels for issues (critical, warning, suggestion)
- Auto-fix suggestions as code suggestions
- Integration with PR review status
- Metrics dashboard for common issues

### Phase 3 (Optional)
- Pre-commit hook for local review
- VS Code extension for real-time feedback
- Training mode to help new contributors learn guidelines
- Historical analysis of release notes quality trends

## Dependencies

- GitHub Actions infrastructure
- Amazon Q CLI availability and access
- Repository write access for bot account
- `_ext/release-notes-context.md` guidelines file

## Risks and Mitigations

| Risk | Impact | Mitigation |
|------|--------|------------|
| Q CLI unavailable | High | Graceful failure with manual review fallback |
| Q CLI rate limits | Medium | Implement retry logic and rate limiting |
| False positives | Medium | Continuous refinement of guidelines and prompts |
| Action performance | Low | Parallel processing and caching |
| Cost of Q CLI usage | Low | Monitor usage and set budget alerts |

## Rollout Plan

1. **Phase 1**: Implement basic action with manual trigger
2. **Phase 2**: Enable automatic trigger on label
3. **Phase 3**: Gather feedback and refine prompts
4. **Phase 4**: Expand to other release notes files if successful

## Maintenance

- **Owner**: Documentation team
- **Review Frequency**: Quarterly
- **Update Triggers**: 
  - Changes to release notes guidelines
  - Q CLI updates
  - User feedback on accuracy
  - GitHub Actions platform changes


================================================
FILE: _ext/release-notes-context.md
================================================
# Release Notes Writing Guidelines

## Core Principles

### Answer Three Questions for Every Item

- **What?** — What feature/API is affected?
- **When?** — Under what conditions does this occur?
- **So what?** — What is the impact on the user?

### All Content Must Be:

- **Customer-visible** - Written from the customer's perspective about capabilities they can use
- **Documented** - If documentation doesn't exist, exclude the feature. All new features must include documentation URLs.
- **Actionable** - Include workarounds, timelines, or how to check if affected

## DO:

- **Write in customer-visible terms** - Describe what customers can now do, not how it was implemented
- **State the impact clearly** - Use concrete language about what happens to users
- **Be specific about conditions** - Replace vague phrases with precise conditions
- **Quantify performance improvements** - Provide specific before/after metrics (e.g., "improved from 2.164x to 3.654x speedup") and state the conditions that trigger these improvements (e.g., "for batch I/O operations with 1024 ops at 10KB")
- **Explain the impact of wrong defaults** - When fixing incorrect default values, state what the wrong default was and what impact it had on users
- **Specify what was missing** - When fixing "missing" items, list what was missing and confirm they are now documented
- **Describe previous behavior for bugs** - Always explain what the incorrect behavior was before the fix
- **Categorize breaking changes correctly** - If a bug fix changes API behavior (e.g., renaming a parameter), list it under Breaking Changes, not Bug Fixes
- **Provide actionable information** - Include workarounds if available, fix timelines if known, or how users can check if they're affected
- **Provide migration guidance for breaking changes** - Tell users what they should do when behavior changes, with before/after examples
- **Link to documentation** - Every feature must have corresponding documentation with URL
- **Include documentation URLs for all new features** - If no URL exists, either create documentation first or remove the feature from release notes
- **Use standard terminology** - Use terms your audience already knows
- **Use clear, descriptive sentences** - Transform technical phrases into customer-understandable language
- **Focus on customer-visible results** - Describe what customers will see, not internal mechanics
- **Drop unnecessary words** - Remove "when specified," "may," "is in progress" when they add no value
- **Remove empty sections** - Don't include placeholder text like "None in this release"
- **Verify accuracy** - Check version numbers, dates, and technical details
- **Run IP scanner** - Catch any internal code name leaks before publishing
- **Use active voice** - Write "The system ignores the parameter" instead of "The parameter is ignored"
- **Define abbreviations on first use** - Write "time to first token (TTFT)" before using "TTFT"
- **Remove temporal qualifiers** - Replace "for now" with specific timelines or remove entirely
- **Provide concrete examples** - Include calculation examples for complex parameters

## DO NOT:

- **Include internal code names** - Remove references like "TRN3PDS", "Mariana", "Penguin"
- **Document undocumented features** - If documentation doesn't exist, exclude the feature
- **Include features without documentation URLs** - Every new feature must have a documentation link
- **List unreleased features** - Only include features available to customers
- **Include internal-only metrics** - Remove metrics useful only internally
- **Document bugs never released** - Only include fixes for publicly released issues
- **Use internal API names** - Unless they're part of the public API
- **Include debug variables** - Remove environment variables meant only for internal use
- **Use vague language** - Avoid "in certain cases," "some patterns," "may sometimes"
- **Use ambiguous phrasing** - Avoid phrases like "Fixed dynamic for loop" that could mean multiple things
- **Leave impacts unexplained** - Don't just say "fixed wrong default" without explaining what the impact was
- **Mix breaking changes with bug fixes** - Parameter renames or behavior changes belong in Breaking Changes, not Bug Fixes
- **Create heavy noun chains** - Break up complex phrases (e.g., "dtype override was ignored during reshape" not "reshape dtype override not being applied")
- **Write without context** - Every change needs metrics, conditions, or migration guidance
- **Use hedging language** - Replace "may result in" with "results in" when deterministic
- **Focus on internal implementation** - Avoid phrases like "internally uses" or internal platform identifiers
- **Use passive voice without clear subject** - Avoid constructions where the actor is unclear
- **Reference undefined versions** - Don't use "V0" or "V1" without defining them

## Impact Statements

| Avoid | Prefer |
|-------|--------|
| "incorrectly interpret" | "produces incorrect results" |
| "not being applied" | "is ignored" |
| "failing check" | "crashes with validation error" |
| "may incorrectly interpret tensor shapes" | "can produce incorrect results when transposing tensors" |

## Conditions - Be Specific

| Avoid | Prefer |
|-------|--------|
| "in certain cases" | "when reduction axis is not the last dimension" |
| "some patterns" | "multi-dimensional transposes with more than 2 axes" |
| "may sometimes" | "consistently occurs when..." |
| "for now" | "Support is planned for version X.X.X" or remove entirely |
| "small inputs" | "inputs under 512 tokens" |
| "low batch sizes" | "batch sizes of 4 or less" |

## Phrasing Examples

### Bug Fixes:

| Avoid | Prefer |
|-------|--------|
| "Fixed bug in nrt_vnc_usage_find_internal" | "Improved error handling to return a clear error instead of asserting during nrt_init" |
| "Fixed dynamic for loop incorrectly incrementing the loop induction variable" | "Fixed: dynamic for loops now correctly increment the loop counter. Previously, the counter incremented incorrectly, causing [specific impact]" |
| "Fixed reshape dtype override not being applied when specified" | "Fixed a bug where specifying a data type override during a reshape operation was ignored" |
| "Fixed reshape of shared/private HBM tensors failing partition size check" | "Fixed a bug where reshaping tensors stored in shared or private HBM incorrectly failed the partition size check" |
| "Fixed incorrect default value for on_false_value" | "Fixed incorrect default value for on_false_value in nki.isa.range_select. Previously defaulted to [X], now correctly defaults to [Y], which [impact]" |

### Performance Improvements:

| Avoid | Prefer |
|-------|--------|
| "Optimized zero-copy operations by enabling descriptor merging" | "Enhanced zero-copy operation performance: Write performance improved from 2.164x to 3.654x speedup for batch I/O operations(1_Batch_1024_Ops_10_KBs)" |
| "Optimized mesh AllGather on TP8 configurations using destination routing" | "Optimized mesh AllGather: [X]% performance improvement on TP8 configurations when [specific conditions]" |

### New Features:

| Avoid | Prefer |
|-------|--------|
| "Added support for TRN3PDS platform" | "Added support for [public instance type name] with optimized topology configurations for distributed training. See [documentation URL]" |
| "Added IOCTL to lookup Neuron device/HBM for a given virtual address" | "Added capability to lookup Neuron device for a given virtual address, enabling frameworks to identify which device holds a tensor. See [documentation link] for API details" |

### Known Issues:

| Avoid | Prefer |
|-------|--------|
| "may incorrectly interpret tensor shapes in certain multi-dimensional transpose patterns" | "can produce incorrect results when transposing tensors with certain multi-dimensional shapes" |
| "Training, Inference, and Penguin kernels compilation and execution validation is in progress" | Remove entirely (internal project name and not customer-actionable) |
| "Chunked prefill is not supported on Neuron for now" | "Chunked prefill is not supported. If you attempt to enable it with DISABLE_NEURON_CUSTOM_SCHEDULER='1', the system will fail to start with an error. Use standard prefill mode instead." |

## Breaking Changes Checklist

When documenting breaking changes, always include:

1. **What changed** - The specific API, parameter, or behavior
2. **Why it's breaking** - What will stop working
3. **Migration path** - What users should do instead
4. **Example (if helpful)** - Show old vs. new usage

### Example:

**Breaking:** NumPy synonyms (e.g., `np.add` for `nl.add`) are no longer accepted in NKI API calls.

**Migration:** Replace all NumPy function calls with their NKI equivalents:
- Replace `np.add(x, y)` with `nl.add(x, y)`
- Replace `np.multiply(x, y)` with `nl.multiply(x, y)`

Always explain:
- Why is this breaking?
- What was the previous behavior?
- What is the workaround or migration effort?

## Quick Template

```
[Fixed/Known Issue]: [API/Feature] [impact] when [specific conditions]. [Optional: Workaround or timeline.]
```

### Example:

```
Fixed: nki.isa.dma_copy causes a runtime timeout when copying FP32 from SBUF to BF16 in HBM with indirect addressing. Workaround: cast to BF16 in SBUF before copying.
```

## Quality Checks Before Publishing

1. **No internal names** - Run IP scanner to catch code name leaks
2. **Customer value** - Each item explains why customers should care
3. **Documentation links** - New features link to relevant docs with URLs
4. **Documentation exists** - Verify all features are documented before including; if no documentation URL exists, remove the feature from release notes
5. **Accuracy** - Technical details are correct and verifiable
6. **Clarity** - Phrasing is clear and professional
7. **Completeness** - Previous behavior and migration paths explained
8. **Impact explained** - Bug fixes describe what was broken and what the impact was
9. **Active voice** - Sentences use active voice with clear subjects
10. **Abbreviations defined** - All abbreviations spelled out on first use
11. **No vague language** - All conditions and impacts are specific and quantified
12. **Examples provided** - Complex parameters include calculation examples

## Key Principles

### All content must be:

- **Customer-visible** (not internal implementation details)
- **Documented with URLs** (if docs don't exist, exclude it)
- **Impactful** (explain value, not just what changed)

### Every bug fix must answer:

- What was broken?
- What was the impact?
- What works now?

### Every new feature must include:

- Documentation URL
- Customer benefit
- Usage guidance or examples

## How to Review Release Notes

When reviewing release notes against these guidelines, provide feedback in the following format:

### Issue [Number]: [Brief Issue Title]

**Original Text:**
```
[Exact text from the release notes]
```

**Problem:**
[Description of the content/completeness issue]

**Phrasing Problem:**
[Description of the language/clarity issue, if applicable]

**Example Rewrite:**
```
[Suggested improved version showing correct phrasing and content]
```

**Action:**
- [Specific action item 1]
- [Specific action item 2]

## Review Process:

1. **Extract original text** - Include the exact text being reviewed
2. **Identify problems** - Separate content issues from phrasing issues
3. **Provide examples** - Show how to rewrite the text correctly
4. **List actions** - Give specific, actionable steps to fix each issue
5. **Check documentation** - Verify URLs exist for all new features; if not, recommend removal
6. **Verify completeness** - Ensure all three questions (What? When? So what?) are answered
7. **Check phrasing** - Identify vague language, passive voice, undefined terms, internal references
8. **Validate breaking changes** - Ensure migration guidance and before/after examples are included


================================================
FILE: _ext/sphinx_plotly_directive.py
================================================
"""
CODE FROM: https://github.com/harupy/sphinx-plotly-directive
LICENSE: MIT

Based on: https://matplotlib.org/3.1.3/devel/plot_directive.html

A directive for including a Plotly figure in a Sphinx document
================================================================

By default, in HTML output, `plot` will include a .png file with a link to a
high-res .png and .pdf.  In LaTeX output, it will include a .pdf.

The source code for the plot may be included in one of three ways:

1. **A path to a source file** as the argument to the directive::

     .. plot:: path/to/plot.py

   When a path to a source file is given, the content of the
   directive may optionally contain a caption for the plot::

     .. plot:: path/to/plot.py

        The plot's caption.

   Additionally, one may specify the name of a function to call (with
   no arguments) immediately after importing the module::

     .. plot:: path/to/plot.py plot_function1

2. Included as **inline content** to the directive::

     .. plotly::

        import plotly.express as px
        px.scatter(x=[0, 1, 2, 3, 4], y=[0, 1, 4, 9, 16])

3. Using **doctest** syntax::

     .. plotly::

        A plotting example:
        >>> import plotly.express as px
        >>> px.scatter(x=[0, 1, 2, 3, 4], y=[0, 1, 4, 9, 16])

4. Using the `fig-vars` option. In the example below, `fig1` and `fig2` will be
   rendered::

     .. plotly::
        :fig-vars: fig1, fig2

        import plotly.express as px
        fig1 = px.scatter(x=[0, 1, 2, 3, 4], y=[0, 1, 4, 9, 16])
        fig2 = px.scatter(x=[4, 3, 2, 1, 0], y=[0, 1, 4, 9, 16])

Options
-------

The ``plotly`` directive supports the following options:

    format : {'python', 'doctest'}
        The format of the input.

    include-source : bool
        Whether to display the source code. The default can be changed
        using the `plot_include_source` variable in :file:`conf.py`.

    encoding : str
        If this source file is in a non-UTF8 or non-ASCII encoding, the
        encoding must be specified using the ``:encoding:`` option.  The
        encoding will not be inferred using the ``-*- coding -*-`` metacomment.

    context : bool or str
        If provided, the code will be run in the context of all previous plot
        directives for which the ``:context:`` option was specified.  This only
        applies to inline code plot directives, not those run from files. If
        the ``:context: reset`` option is specified, the context is reset
        for this and future plots, and previous figures are closed prior to
        running the code. ``:context: close-figs`` keeps the context but closes
        previous figures before running the code.

    nofigs : bool
        If specified, the code block will be run, but no figures will be
        inserted.  This is usually useful with the ``:context:`` option.

    caption : str
        If specified, the option's argument will be used as a caption for the
        figure. This overwrites the caption given in the content, when the plot
        is generated from a file.

    iframe-width
        The width of the iframe in which a plotly figure is rendered. The default can be changed
        using the `plotly_iframe_width` variable in :file:`conf.py`.

    iframe-height
        The height of the iframe in which a plotly figure is rendered. The default can be changed
        using the `plotly_iframe_height` variable in :file:`conf.py`.

Additionally, this directive supports all of the options of the `image`
directive, except for *target* (since plot will add its own target).  These
include *alt*, *height*, *width*, *scale*, *align* and *class*.

Configuration options
---------------------

The plot directive has the following configuration options:

    plotly_include_source
        Default value for the include-source option

    plotly_html_show_source_link
        Whether to show a link to the source in HTML.

    plotly_pre_code
        Code that should be executed before each plot. If not specified or None
        it will default to a string containing::

            import numpy as np
            import plotly
            import plotly.graph_objects as go
            import plotly.express as px

    plotly_basedir
        Base directory, to which ``plot::`` file names are relative
        to.  (If None or empty, file names are relative to the
        directory where the file containing the directive is.)

    plotly_formats
        File formats to generate. List of tuples or strings::

            [(suffix, dpi), suffix, ...]

        that determine the file format and the DPI. For entries whose
        DPI was omitted, sensible defaults are chosen. When passing from
        the command line through sphinx_build the list should be passed as
        suffix:dpi,suffix:dpi, ...

    plotly_html_show_formats
        Whether to show links to the files in HTML.

    plotly_working_directory
        By default, the working directory will be changed to the directory of
        the example, so the code can get at its data files, if any.  Also its
        path will be added to `sys.path` so it can import any helper modules
        sitting beside it.  This configuration option can be used to specify
        a central directory (also added to `sys.path`) where data files and
        helper modules for all code are located.

    plotly_iframe_width
        The width of the iframe in which a plotly figure is rendered. The default is "100%".

    plotly_iframe_height
        The height of the iframe in which a plotly figure is rendered. The default is "500px".

    plotly_template
        Provide a customized template for preparing restructured text.
"""

import copy
import itertools
import os
import re
import shutil
import textwrap
import traceback
from os.path import relpath
from pathlib import Path

import jinja2  # Sphinx dependency.
from docutils.parsers.rst import Directive, directives
from docutils.parsers.rst.directives.images import Image

import re
import textwrap

import plotly


INDENT_SPACES = " " * 3


def save_plotly_figure(fig, path):
    r"""
    Save a Plotly figure.
    Parameters
    ----------
    fig : plotly figure
        A plotly figure to save.
    path : str
        A file path.
    Returns
    -------
    None
    Examples
    --------
    >>> import plotly.express as px
    >>> import tempfile
    >>> fig = px.scatter(x=[0, 1, 2, 3, 4], y=[0, 1, 4, 9, 16])
    >>> path = tempfile.NamedTemporaryFile(suffix=".html").name
    >>> save_plotly_figure(fig, path)
    """
    fig_html = plotly.offline.plot(fig, output_type="div", include_plotlyjs="cdn", auto_open=False)
    with open(path, "w") as f:
        f.write(fig_html)


def assign_last_line_into_variable(code, variable_name):
    r"""
    Save a Plotly figure.
    Parameters
    ----------
    code : str
        A string representing code.
    name : str
        A variable name.
    Returns
    -------
    str
        Mew code.
    Examples
    --------
    >>> code = "a = 1\nfunc(a)"
    >>> new_code = assign_last_line_into_variable(code, "b")
    >>> print(new_code)
    a = 1
    b = func(a)
    """
    lines = code.split("\n")
    for idx in range(len(lines) - 1, -1, -1):
        if lines[idx].strip() != "":
            lines[idx] = "{} = ".format(variable_name) + lines[idx]
            break
    return "\n".join(lines)


def create_directive_block(name, arguments, options, content):
    r"""
    Create a directive block.
    Parameters
    ----------
    name : str
        A directive name.
    arguments : list of str
        Arguments of the directive.
    option : dict
        Option of the directive.
    content : list of str
        Content of the directive.
    Returns
    -------
    str
        A directive block.
    Examples
    --------
    >>> block = create_directive_block(
    ...     "plotly",
    ...     ["f1", "f2"],
    ...     {"a": 0, "b": 1},
    ...     ["l1", "l2"],
    ... )
    >>> print(block)
    .. plotly:: f1 f2
       :a: 0
       :b: 1
    <BLANKLINE>
       l1
       l2
    """
    header = ".. {}:: ".format(name) + " ".join(arguments)
    code = "\n".join(map(str, content))

    lines = [header]

    if len(options.items()) > 0:

        def process_value(v):
            if isinstance(v, list):
                return ", ".join(v)
            return v

        options_block = "\n".join(":{}: {}".format(k, process_value(v)) for k, v in options.items())
        lines.append(textwrap.indent(options_block, INDENT_SPACES))

    lines.append("")
    lines.append(textwrap.indent(code, INDENT_SPACES))

    return "\n".join(lines)


def create_code_block(code, language=None):
    return "\n".join(
        [
            ".. code-block::{}".format(" " + language if language else ""),
            "",
            textwrap.indent(code.strip(), INDENT_SPACES),
            "",
        ]
    )


def strip_last_line(code):
    r"""
    Strips the last line of the give code block
    Parameters
    ----------
    code : str
        Code to strip
    Returns
    -------
    str:
        Stripped code
    Examples
    --------
    >>> strip_last_line("a")
    ''
    >>> strip_last_line("a\nb")
    'a'
    >>> strip_last_line("a\nb\nc")
    'a\nb'
    """
    return "\n".join(code.strip().split("\n")[:-1])


def ends_with_show(code):
    r"""
    Returns True if the last line of the given code block ends with `show()`
    Parameters
    ----------
    code : str
        Code that may contain a line that looks like `fig.show()`
    Returns
    -------
    str:
        Variable name of the object that calls `show()`
    Examples
    --------
    >>> ends_with_show("fig.show()")  # simple
    True
    >>> ends_with_show("fig.show(1, a=2)")  # show with arguments
    True
    >>> ends_with_show("fig = dummy\nfig.show()\n")  # multiline
    True
    >>> ends_with_show("foo")  # doesn't contains `show`
    False
    """
    # TODO: Use a more strict regular expression
    pattern = r"^(.+)\.show\(.*\)$"
    match = re.search(pattern, code.strip().split("\n")[-1], flags=re.DOTALL)
    return bool(match)


# -----------------------------------------------------------------------------
# Registration hook
# -----------------------------------------------------------------------------


def _option_boolean(arg):
    if not arg or not arg.strip():
        # no argument given, assume used as a flag
        return True
    elif arg.strip().lower() in ("no", "0", "false"):
        return False
    elif arg.strip().lower() in ("yes", "1", "true"):
        return True
    else:
        raise ValueError('"%s" unknown boolean' % arg)


def _option_context(arg):
    if arg in [None, "reset", "close-figs"]:
        return arg
    raise ValueError("Argument should be None or 'reset' or 'close-figs'")


def _option_format(arg):
    return directives.choice(arg, ("python", "doctest"))


def _option_fig_vars(arg):
    return [x.strip() for x in arg.split(",")]


def mark_plot_labels(app, document):
    """
    To make plots referenceable, we need to move the reference from the
    "htmlonly" (or "latexonly") node to the actual figure node itself.
    """
    for name, explicit in document.nametypes.items():
        if not explicit:
            continue
        labelid = document.nameids[name]
        if labelid is None:
            continue
        node = document.ids[labelid]
        if node.tagname in ("html_only", "latex_only"):
            for n in node:
                if n.tagname == "figure":
                    sectname = name
                    for c in n:
                        if c.tagname == "caption":
                            sectname = c.astext()
                            break

                    node["ids"].remove(labelid)
                    node["names"].remove(name)
                    n["ids"].append(labelid)
                    n["names"].append(name)
                    document.settings.env.labels[name] = (
                        document.settings.env.docname,
                        labelid,
                        sectname,
                    )
                    break


class PlotlyDirective(Directive):
    """The ``.. plotly::`` directive, as documented in the module's docstring."""

    has_content = True
    required_arguments = 0
    optional_arguments = 2
    final_argument_whitespace = False
    option_spec = {
        "alt": directives.unchanged,
        "height": directives.length_or_unitless,
        "width": directives.length_or_percentage_or_unitless,
        "scale": directives.nonnegative_int,
        "align": Image.align,
        "class": directives.class_option,
        "include-source": _option_boolean,
        "format": _option_format,
        "context": _option_context,
        "nofigs": directives.flag,
        "encoding": directives.encoding,
        "caption": directives.unchanged,
        "fig-vars": _option_fig_vars,
        "iframe-width": directives.unchanged,
        "iframe-height": directives.unchanged,
    }

    def run(self):
        """Run the plot directive."""
        try:
            return run(
                self.arguments,
                self.content,
                self.options,
                self.state_machine,
                self.state,
                self.lineno,
            )
        except Exception as e:
            raise self.error(str(e))


def setup(app):
    setup.app = app
    setup.config = app.config
    setup.confdir = app.confdir
    app.add_directive("plotly", PlotlyDirective)
    app.add_config_value("plotly_pre_code", None, True)
    app.add_config_value("plotly_include_source", False, True)
    app.add_config_value("plotly_html_show_source_link", True, True)
    app.add_config_value("plotly_formats", ["html"], True)
    app.add_config_value("plotly_basedir", None, True)
    app.add_config_value("plotly_html_show_formats", True, True)
    app.add_config_value("plotly_working_directory", None, True)
    app.add_config_value("plotly_iframe_width", "100%", True)
    app.add_config_value("plotly_iframe_height", "500px", True)
    app.add_config_value("plotly_template", None, True)

    app.add_config_value("plotly_include_directive_source", None, False)

    app.connect("doctree-read", mark_plot_labels)

    metadata = {
        "parallel_read_safe": True,
        "parallel_write_safe": True,
        "version": 0.1,
    }
    return metadata


# -----------------------------------------------------------------------------
# Doctest handling
# -----------------------------------------------------------------------------


def contains_doctest(text):
    try:
        # check if it's valid Python as-is
        compile(text, "<string>", "exec")
        return False
    except SyntaxError:
        pass
    r = re.compile(r"^\s*>>>", re.M)
    m = r.search(text)
    return bool(m)


def unescape_doctest(text):
    """
    Extract code from a piece of text, which contains either Python code
    or doctests.
    """
    if not contains_doctest(text):
        return text

    code = ""
    for line in text.split("\n"):
        m = re.match(r"^\s*(>>>|\.\.\.) (.*)$", line)
        if m:
            code += m.group(2) + "\n"
        elif line.strip():
            code += "# " + line.strip() + "\n"
        else:
            code += "\n"
    return code


def split_code_at_show(text):
    """Split code at plt.show()."""
    parts = []
    is_doctest = contains_doctest(text)

    part = []
    for line in text.split("\n"):
        if (not is_doctest and line.strip() == "plt.show()") or (
            is_doctest and line.strip() == ">>> plt.show()"
        ):
            part.append(line)
            parts.append("\n".join(part))
            part = []
        else:
            part.append(line)
    if "\n".join(part).strip():
        parts.append("\n".join(part))
    return parts


# -----------------------------------------------------------------------------
# Template
# -----------------------------------------------------------------------------


TEMPLATE = """
{% if directive_source %}
Source:

{{ directive_source }}

Output:
{% endif %}
{{ source_code }}

.. only:: html

   {% if source_link or (html_show_formats and not multi_image) %}
   (
   {%- if source_link -%}
   `Source code <{{ source_link }}>`__
   {%- endif -%}
   {%- if html_show_formats and not multi_image -%}
     {%- for fig in figures -%}
       {%- for fmt in fig.formats -%}
         {%- if source_link or not loop.first -%}, {% endif -%}
         `{{ fmt }} <{{ dest_dir }}/{{ fig.basename }}.{{ fmt }}>`__
       {%- endfor -%}
     {%- endfor -%}
   {%- endif -%}
   )
   {% endif %}

   {% for fig in figures %}
   .. raw:: html
      {% for option in options -%}
      {{ option }}
      {% endfor %}

       <iframe src="{{ fig.basename }}.{{ default_fmt }}" width="{{ iframe_width }}"
        height="{{ iframe_height }}" frameborder="0"></iframe>

   {% if html_show_formats and multi_figure -%}
     (
     {%- for fmt in fig.formats -%}
     {%- if not loop.first -%}, {% endif -%}
     `{{ fmt }} <{{ dest_dir }}/{{ fig.basename }}.{{ fmt }}>`__
     {%- endfor -%}
     )
   {%- endif -%}

      {{ caption }}
   {% endfor %}

.. only:: not html

   {% for fig in figures %}
   .. raw:: html
      {% for option in options -%}
      {{ option }}
      {% endfor %}

       <iframe src="{{ fig.basename }}.{{ default_fmt }}" width="{{ iframe_width }}"
        height="{{ iframe_height }}" frameborder="0"></iframe>

      {{ caption }}
   {% endfor %}

"""

exception_template = """
.. only:: html

   [`source code <%(linkdir)s/%(basename)s.py>`__]

Exception occurred rendering plot.

"""

# the context of the plot for all directives specified with the
# :context: option
plot_context = dict()


class FigureFile:
    def __init__(self, basename, dirname):
        self.basename = basename
        self.dirname = dirname
        self.formats = []

    def filename(self, format):
        return os.path.join(self.dirname, "%s.%s" % (self.basename, format))

    def filenames(self):
        return [self.filename(fmt) for fmt in self.formats]


def out_of_date(original, derived):
    """
    Return whether *derived* is out-of-date relative to *original*, both of
    which are full file paths.
    """
    return not os.path.exists(derived) or (
        os.path.exists(original) and os.stat(derived).st_mtime < os.stat(original).st_mtime
    )


class PlotError(RuntimeError):
    pass


def run_code(code, code_path, ns=None, function_name=None, fig_vars=None):
    """
    Import a Python module from a path, and run the function given by
    name, if function_name is not None.
    """

    # Change the working directory to the directory of the example, so
    # it can get at its data files, if any.  Add its path to sys.path
    # so it can import any helper modules sitting beside it.
    pwd = os.getcwd()
    if setup.config.plotly_working_directory is not None:
        try:
            os.chdir(setup.config.plotly_working_directory)
        except OSError as err:
            raise OSError(
                str(err) + "\n`plot_working_directory` option in"
                "Sphinx configuration file must be a valid "
                "directory path"
            ) from err
        except TypeError as err:
            raise TypeError(
                str(err) + "\n`plot_working_directory` option in "
                "Sphinx configuration file must be a string or "
                "None"
            ) from err
    elif code_path is not None:
        dirname = os.path.abspath(os.path.dirname(code_path))
        os.chdir(dirname)

    try:
        code = unescape_doctest(code)
        if ns is None:
            ns = {}
        if not ns:
            if setup.config.plotly_pre_code is None:
                exec(
                    "\n".join(
                        [
                            "import numpy as np",
                            "import plotly",
                            "import plotly.graph_objects as go",
                            "import plotly.express as px",
                        ]
                    ),
                    ns,
                )
            else:
                exec(str(setup.config.plotly_pre_code), ns)
        if "__main__" in code:
            ns["__name__"] = "__main__"

        variable_name = "fig"

        if ends_with_show(code):
            exec(strip_last_line(code), ns)
            figs = [ns[fig_var] for fig_var in fig_vars] if fig_vars else [ns[variable_name]]
        elif function_name is not None:
            exec(code, ns)
            exec(assign_last_line_into_variable(function_name + "()", variable_name), ns)
            figs = [ns[variable_name]]
        elif fig_vars:
            exec(code, ns)
            figs = [ns[fig_var] for fig_var in fig_vars]
        else:
            exec(assign_last_line_into_variable(code, variable_name), ns)
            figs = [ns[variable_name]]

    except (Exception, SystemExit) as err:
        raise PlotError(traceback.format_exc()) from err
    finally:
        os.chdir(pwd)

    return figs


def get_plot_formats(config):
    default_dpi = {"html": 0}
    formats = []
    plot_formats = config.plotly_formats
    for fmt in plot_formats:
        if isinstance(fmt, str):
            if ":" in fmt:
                suffix, dpi = fmt.split(":")
                formats.append((str(suffix), int(dpi)))
            else:
                formats.append((fmt, default_dpi.get(fmt, 80)))
        elif isinstance(fmt, (tuple, list)) and len(fmt) == 2:
            formats.append((str(fmt[0]), int(fmt[1])))
        else:
            raise PlotError('invalid image format "%r" in plot_formats' % fmt)
    return formats


def render_figures(
    code,
    code_path,
    output_dir,
    output_base,
    context,
    function_name,
    config,
    context_reset=False,
    close_figs=False,
    fig_vars=None,
):
    """
    Run a pyplot script and save the images in *output_dir*.

    Save the images under *output_dir* with file names derived from
    *output_base*
    """
    formats = get_plot_formats(config)

    # -- Try to determine if all images already exist

    code_pieces = split_code_at_show(code)

    # Look for single-figure output files first
    all_exists = True
    fig = FigureFile(output_base, output_dir)
    for format, dpi in formats:
        if out_of_date(code_path, fig.filename(format)):
            all_exists = False
            break
        fig.formats.append(format)

    if all_exists:
        return [(code, [fig])]

    # Then look for multi-figure output files
    results = []
    all_exists = True
    for i, code_piece in enumerate(code_pieces):
        figures = []
        for j in itertools.count():
            if len(code_pieces) > 1:
                fig = FigureFile("%s_%02d_%02d" % (output_base, i, j), output_dir)
            else:
                fig = FigureFile("%s_%02d" % (output_base, j), output_dir)
            for fmt, dpi in formats:
                if out_of_date(code_path, fig.filename(fmt)):
                    all_exists = False
                    break
                fig.formats.append(fmt)

            # assume that if we have one, we have them all
            if not all_exists:
                all_exists = j > 0
                break
            figures.append(fig)
        if not all_exists:
            break
        results.append((code_piece, figures))

    if all_exists:
        return results

    # We didn't find the files, so build them

    results = []
    if context:
        ns = plot_context
    else:
        ns = {}

    if context_reset:
        plot_context.clear()

    close_figs = not context or close_figs

    for i, code_piece in enumerate(code_pieces):

        if not context:
            pass
        elif close_figs:
            pass

        fig_objects = run_code(code_piece, code_path, ns, function_name, fig_vars)

        figures = []
        for j, fig_obj in enumerate(fig_objects):
            if len(fig_objects) == 1 and len(code_pieces) == 1:
                fig = FigureFile(output_base, output_dir)
            elif len(code_pieces) == 1:
                fig = FigureFile("%s_%02d" % (output_base, j), output_dir)
            else:
                fig = FigureFile("%s_%02d_%02d" % (output_base, i, j), output_dir)
            figures.append(fig)
            for fmt, dpi in formats:
                try:
                    save_plotly_figure(fig_obj, fig.filename(fmt))
                except Exception as err:
                    raise PlotError(traceback.format_exc()) from err
                fig.formats.append(fmt)

        results.append((code_piece, figures))

    if not context:
        pass

    return results


def run(arguments, content, options, state_machine, state, lineno):
    document = state_machine.document
    config = document.settings.env.config
    nofigs = "nofigs" in options

    formats = get_plot_formats(config)
    default_fmt = formats[0][0]

    options_copy = copy.deepcopy(options)

    options.setdefault("include-source", config.plotly_include_source)
    options.setdefault("iframe-width", config.plotly_iframe_width)
    options.setdefault("iframe-height", config.plotly_iframe_height)
    keep_context = "context" in options
    context_opt = None if not keep_context else options["context"]

    rst_file = document.attributes["source"]
    rst_dir = os.path.dirname(rst_file)

    if len(arguments):
        if not config.plotly_basedir:
            source_file_name = os.path.join(setup.app.builder.srcdir, directives.uri(arguments[0]))
        else:
            source_file_name = os.path.join(
                setup.confdir, config.plotly_basedir, directives.uri(arguments[0])
            )

        # If there is content, it will be passed as a caption.
        caption = "\n".join(content)

        # Enforce unambiguous use of captions.
        if "caption" in options:
            if caption:
                raise ValueError(
                    "Caption specified in both content and options." " Please remove ambiguity."
                )
            # Use caption option
            caption = options["caption"]

        # If the optional function name is provided, use it
        if len(arguments) == 2:
            function_name = arguments[1]
        else:
            function_name = None

        code = Path(source_file_name).read_text(encoding="utf-8")
        output_base = os.path.basename(source_file_name)
    else:
        source_file_name = rst_file
        code = textwrap.dedent("\n".join(map(str, content)))
        counter = document.attributes.get("_plot_counter", 0) + 1
        document.attributes["_plot_counter"] = counter
        base, ext = os.path.splitext(os.path.basename(source_file_name))
        output_base = "%s-%d.py" % (base, counter)
        function_name = None
        caption = options.get("caption", "")

    base, source_ext = os.path.splitext(output_base)
    if source_ext in (".py", ".rst", ".txt"):
        output_base = base
    else:
        source_ext = ""

    # ensure that LaTeX includegraphics doesn't choke in foo.bar.pdf filenames
    output_base = output_base.replace(".", "-")

    # is it in doctest format?
    is_doctest = contains_doctest(code)
    if "format" in options:
        if options["format"] == "python":
            is_doctest = False
        else:
            is_doctest = True

    # determine output directory name fragment
    source_rel_name = relpath(source_file_name, setup.confdir)
    source_rel_dir = os.path.dirname(source_rel_name)
    while source_rel_dir.startswith(os.path.sep):
        source_rel_dir = source_rel_dir[1:]

    # build_dir: where to place output files (temporarily)
    build_dir = os.path.join(
        os.path.dirname(setup.app.doctreedir), "plot_directive", source_rel_dir
    )
    # get rid of .. in paths, also changes pathsep
    # see note in Python docs for warning about symbolic links on Windows.
    # need to compare source and dest paths at end
    build_dir = os.path.normpath(build_dir)

    if not os.path.exists(build_dir):
        os.makedirs(build_dir)

    # output_dir: final location in the builder's directory
    dest_dir = os.path.abspath(os.path.join(setup.app.builder.outdir, source_rel_dir))
    if not os.path.exists(dest_dir):
        os.makedirs(dest_dir)  # no problem here for me, but just use built-ins

    # how to link to files from the RST file
    dest_dir_link = os.path.join(relpath(setup.confdir, rst_dir), source_rel_dir).replace(
        os.path.sep, "/"
    )
    try:
        build_dir_link = relpath(build_dir, rst_dir).replace(os.path.sep, "/")
    except ValueError:
        # on Windows, relpath raises ValueError when path and start are on
        # different mounts/drives
        build_dir_link = build_dir
    source_link = dest_dir_link + "/" + output_base + source_ext

    # make figures
    try:
        results = render_figures(
            code,
            source_file_name,
            build_dir,
            output_base,
            keep_context,
            function_name,
            config,
            context_reset=context_opt == "reset",
            close_figs=context_opt == "close-figs",
            fig_vars=options.get("fig-vars"),
        )
        errors = []
    except PlotError as err:
        reporter = state.memo.reporter
        sm = reporter.system_message(
            2,
            "Exception occurred in plotting {}\n from {}:\n{}".format(
                output_base, source_file_name, err
            ),
            line=lineno,
        )
        results = [(code, [])]
        errors = [sm]

    # Properly indent the caption
    caption = "\n".join("      " + line.strip() for line in caption.split("\n"))

    # generate output restructuredtext
    total_lines = []
    for j, (code_piece, figures) in enumerate(results):
        if options["include-source"]:
            if is_doctest:
                lines = ["", *code_piece.splitlines()]
            else:
                lines = [
                    ".. code-block:: python",
                    "",
                    *textwrap.indent(code_piece, "    ").splitlines(),
                ]
            source_code = "\n".join(lines)
        else:
            source_code = ""

        if nofigs:
            figures = []

        opts = [
            ":%s: %s" % (key, val)
            for key, val in options.items()
            if key in ("alt", "height", "width", "scale", "align", "class")
        ]

        # Not-None src_link signals the need for a source link in the generated
        # html
        if j == 0 and config.plotly_html_show_source_link:
            src_link = source_link
        else:
            src_link = None

        if config.plotly_include_directive_source:
            directive_source = create_directive_block("plotly", arguments, options_copy, content)
            directive_source = create_code_block(directive_source, "text")
        else:
            directive_source = ""

        result = jinja2.Template(config.plotly_template or TEMPLATE).render(
            directive_source=directive_source,
            default_fmt=default_fmt,
            dest_dir=dest_dir_link,
            build_dir=build_dir_link,
            source_link=src_link,
            multi_figure=len(figures) > 1,
            options=opts,
            figures=figures,
            iframe_width=options["iframe-width"],
            iframe_height=options["iframe-height"],
            source_code=source_code,
            html_show_formats=config.plotly_html_show_formats and len(figures),
            caption=caption,
        )

        total_lines.extend(result.split("\n"))
        total_lines.extend("\n")

    if total_lines:
        state_machine.insert_input(total_lines, source=source_file_name)

    # copy image files to builder's output directory, if necessary
    Path(dest_dir).mkdir(parents=True, exist_ok=True)

    for code_piece, figures in results:
        for fig in figures:
            for fn in fig.filenames():
                destfig = os.path.join(dest_dir, os.path.basename(fn))
                if fn != destfig:
                    shutil.copyfile(fn, destfig)

    # copy script (if necessary)
    Path(dest_dir, output_base + source_ext).write_text(
        unescape_doctest(code) if source_file_name == rst_file else code,
        encoding="utf-8",
    )

    return errors

================================================
FILE: _ext/symlink.py
================================================
from docutils import nodes
from docutils.parsers.rst import Directive, directives

import os, sys

def remove_symlink_handler(app, exception):
    dst = './src'
    
    if os.path.exists(dst):
        if os.path.isdir(dst):
            if os.path.islink(dst):
                 os.unlink(dst)
            else:
                shutil.rmtree(dst)
        else:
            if os.path.islink(dst):
                os.unlink(dst)
            else:
                os.remove(dst)

def setup(app):
    app.connect('build-finished', remove_symlink_handler)
    src = '../src'
    dst = './src'

    # This creates a symbolic link on python in tmp directory

    if os.path.exists(dst):
        if os.path.isdir(dst):
            if os.path.islink(dst):
                 os.unlink(dst)
            else:
                shutil.rmtree(dst)
        else:
            if os.path.islink(dst):
                os.unlink(dst)
            else:
                os.remove(dst)

    os.symlink(src, dst)

    return {
        'version': '1.0',
        'parallel_read_safe': True,
        'parallel_write_safe': True,
    }

================================================
FILE: _static/css/custom.css
================================================
.xxtable-smaller-font-size p, strong  {
    font-size:0.9em; 
}

.ablog-post-title p {
    font-size:0.9em;     
}

.ablog-post p {
    font-size:0.9em;     
}

.sphinx-design-class-title-small {
    font-size:0.9em;     
}

.sphinx-design-class-title-med {
    font-size:1em;     
}

.sphinx-design-class-body-small {
    font-size:0.9em;     
}


h1{font-size:2em;}
h2{font-size:1.5em;}
h3{font-size:1.3em;}
h4{font-size:1.2em;}
div.topic  {
    font-size:0.85em; 
}
li.toctree-l1 {
    font-size:0.95em; 
}
th , tr, td {
    white-space: normal !important;
}
th {
    font-size:0.90em; 
}

.ff th , tr, td{
    font-size:0.90em; 
    white-space: normal !important;
}

.ff div.section.p  {
    font-size:0.8em; 
}

hr {
    border-color: #0000DD;
    height: 2px;
}


================================================
FILE: _static/css/custom.css.new
================================================

.table-smaller-font-size p, strong  {
    font-size: 90%;  
}

  
td, th , tr {
    white-space: normal !important;
}

/* Fixes the size of the RTD flyout */
/* .rst-versions {
    width: 320px !important;
} */

/* Content area color */
.wy-nav-content {
    background: #ffffff;
}

/* Scroll Bar*/
.wy-side-scroll {
    width: auto;
    overflow-y: auto;
    margin-top: 0px;
}

/* width of the side panel */
.wy-nav-side {
    width: 320px;
}

/* content section full screen */
.wy-nav-content {
   max-width: none; 
}

/* set color of left side bar */
.wy-nav-side,.wy-side-nav-search,.wy-nav-top {
    /*background: #0079c1; /*005eb8 */
   background: #ffffff;
}

/* Change caption color to be more legible */
.wy-menu > .caption > span.caption-text {
   color: #000000;
   font-size: 20px;
}

/* Change the version color to match caption color */
.wy-side-nav-search>div.version {
   color: #000000;
}

/* Get rid of that ugly yellow highlight color and replace with something more appealing to the eye */
.highlight .hll {
   background-color: #ffffff;
}
/* 
@media screen and (max-width: 768px) {
    .wy-nav-content-wrap {
        margin-left: 0px;
    }
    .wy-nav-side {
        width: 500px;
    }
} */


================================================
FILE: _templates/recentposts.html
================================================
{% if ablog %}
<h3>
  <a href="{{ pathto(ablog.blog_path) }}">{{ gettext('Recent Posts') }}</a>
</h3>
<ul>
  {% set pcount = 1 %} {% for recent in ablog.recent(10, pagename) %}
  <li>
    <a href="{{ pathto(recent.docname) }}{{ anchor(recent) }}"
      >{{ recent.title }}</a
    >
  </li>
  {% endfor %}
</ul>
{% endif %}


================================================
FILE: _templates/search-field.html
================================================
<form id="search-form" class="bd-search d-flex align-items-center" action="{{ pathto('search') }}" method="get"> 
  <i class="fa-solid fa-magnifying-glass"></i>
  <input type="search"
         class="form-control"
         name="q"
         placeholder='Default Search'
         aria-label='Default Search'
         autocomplete="off"
         autocorrect="off"
         autocapitalize="off"
         spellcheck="false"/>
  <span class="search-button__kbd-shortcut"><kbd class="kbd-shortcut__modifier">Ctrl</kbd>+<kbd>K</kbd></span>
</form>

<div class="search-engine-toggle">
  <span class="toggle-label"><b>Search Engine:</b></span>
  <span class="toggle-option" id="default-option" style="font-weight: bold; color: #3cba54;">Default</span> 
  <label class="switch">
    <input type="checkbox" id="search-type">
    <span class="slider round"></span>
  </label>
  <span class="toggle-option" id="google-option">Google</span>
</div>


<script>
  const form = document.getElementById('search-form');
  const searchType = document.getElementById('search-type');
  const searchInput = form.querySelector('input[type="search"]');
  const defaultOption = document.getElementById('default-option');
  const googleOption = document.getElementById('google-option');

  // Load the saved state from localStorage
  const savedState = localStorage.getItem('searchType');
  if (savedState === 'google') {
    searchType.checked = true;
    form.action = "{{ pathto('search-results') }}";
    defaultOption.style.fontWeight = 'normal';
    defaultOption.style.color = 'initial';
    googleOption.style.fontWeight = 'bold';
    googleOption.style.color = '#2196F3';
    searchInput.placeholder = "Google Search";
    searchInput.setAttribute("aria-label", "Google Search");
  } else {
    // Default state (or if savedState is null)
    searchType.checked = false; 
    form.action = "{{ pathto('search') }}";
    defaultOption.style.fontWeight = 'bold';
    defaultOption.style.color = '#3cba54';
    googleOption.style.fontWeight = 'normal';
    googleOption.style.color = 'initial';
    searchInput.placeholder = "Default Search";
    searchInput.setAttribute("aria-label", "Default Search"); 
  }

  searchType.addEventListener('change', () => {
    if (searchType.checked) {
      const query = form.elements['q'].value; 
      form.action = "{{ pathto('search-results') }}";

      // Bold "Google" in blue, unbold "Default" in default color
      defaultOption.style.fontWeight = 'normal'; 
      defaultOption.style.color = 'initial'; 
      googleOption.style.fontWeight = 'bold'; 
      googleOption.style.color = '#2196F3'; 

      searchInput.placeholder = "Google Search";
      searchInput.setAttribute("aria-label", "Google Search");
      localStorage.setItem('searchType', 'google');
    } else {
      form.action = "{{ pathto('search') }}"; 

      // Bold "Default" in green, unbold "Google" in default color
      defaultOption.style.fontWeight = 'bold'; 
      defaultOption.style.color = '#3cba54'; 
      googleOption.style.fontWeight = 'normal';
      googleOption.style.color = 'initial';

      searchInput.placeholder = "Default Search";
      searchInput.setAttribute("aria-label", "Default Search"); 
      localStorage.setItem('searchType', 'default');
    }
  });
</script>

<style>
.switch {
  position: relative;
  display: inline-block;
  width: 35px; 
  height: 18px; 
}

.switch input { 
  opacity: 0;
  width: 0;
  height: 0;
}

.slider {
  position: absolute;
  cursor: pointer;
  top: 0;
  left: 0;
  right: 0;
  bottom: 0;
  background-color: #3cba54;
  -webkit-transition: .4s;
  transition: .4s;
}

.slider:before {
  position: absolute;
  content: "";
  height: 14px; 
  width: 14px; 
  left: 2px; 
  bottom: 2px;
  background-color: white;
  -webkit-transition: .4s;
  transition: .4s;
}

input:checked + .slider {
  background-color: #2196F3;
}

input:focus + .slider {
  box-shadow: 0 0 1px #2196F3;
}

input:checked + .slider:before {
  -webkit-transform: translateX(19px); 
  -ms-transform: translateX(19px);
  transform: translateX(19px);
}

.slider.round {
  border-radius: 34px;
}

.slider.round:before {
  border-radius: 50%;
}

.search-engine-toggle {
  display: flex;
  align-items: center; 
  font-size: 80%; 
}

.toggle-label {
  margin-right: 5px; 
}

.toggle-option {
  margin-right: 5px; 
}

#google-option {
  margin-left: 5px; 
}
</style>


================================================
FILE: _templates/search-google.html
================================================
{%- extends "page.html" %}
{# Over-ride the body to be custom search structure we want #}
{% block docs_body %}
  <div class="bd-search-container">
    <h1>{{ _("Search") }}</h1>
    <noscript>
      <div class="admonition error">
        <p class="admonition-title">{% trans %}Error{% endtrans %}</p>
        <p>{% trans %}Please activate JavaScript to enable the search
          functionality.{% endtrans %}</p>
      </div>
    </noscript>

    <h2>{{ _('Search Results') }}</h2> 

    <script async src="https://cse.google.com/cse.js?cx=657ffaecc36684ee1">
    </script>
    <div class="gcse-searchresults-only"></div> 
    <div id="search-results"></div>
  </div>
<script>

// Activate the search field on page load
let searchForm2 = document.getElementById('search-form');
let searchFormInput2 = searchForm2.querySelector('input[type="search"]');

if (searchFormInput2) {
  searchFormInput2.focus();
  searchFormInput2.select();
  console.log("[PST]: Set focus on search field.");
  searchFormInput2.value = localStorage.getItem('lastSearchQuery');
} 

</script>

{% endblock docs_body %}
{# Below sections just re-create the behavior of Sphinx default search #}
{# Page metadata #}
{%- block htmltitle -%}
  <title>{{ _("Search") }} - {{ title or docstitle }}</title>
{%- endblock htmltitle -%}
{# Manually include the search JS that Sphinx includes #}
{% block scripts -%}
  {{ super() }}
{%- endblock scripts %}

================================================
FILE: _templates/search.html
================================================
{%- extends "page.html" %}
{# Over-ride the body to be custom search structure we want #}
{% block docs_body %}
  <div class="bd-search-container">
    <h1>{{ _("Search") }}</h1>
    <noscript>
      <div class="admonition error">
        <p class="admonition-title">{% trans %}Error{% endtrans %}</p>
        <p>{% trans %}Please activate JavaScript to enable the search functionality.{% endtrans %}</p>
      </div>
    </noscript>
    <div id="search-results"></div>
  </div>
  <script>
// Activate the search field on page load
let searchForm3 = document.getElementById('search-form');
let searchFormInput3 = searchForm3.querySelector('input[type="search"]');


if (searchFormInput3) {
  searchFormInput3.focus();
  searchFormInput3.select();
  console.log("[PST]: Set focus on search field3.");
  searchFormInput3.value = localStorage.getItem('lastSearchQuery');
} 
  </script>
{% endblock docs_body %}
{# Below sections just re-create the behavior of Sphinx default search #}
{# Page metadata #}
{%- block htmltitle -%}
  <title>{{ _("Search") }} - {{ title or docstitle }}</title>
{%- endblock htmltitle -%}
{# Manually include the search JS that Sphinx includes #}
{% block scripts -%}
  {{ super() }}
  <script src="{{ pathto('_static/searchtools.js', 1) }}"></script>
  <script src="{{ pathto('_static/language_data.js', 1) }}"></script>
  <script src="{{ pathto('searchindex.js', 1) }}"></script>
{%- endblock scripts %}

================================================
FILE: _utilities/JIRA_SETUP_QUICKSTART.md
================================================
# Jira Integration Quick Start

## Prerequisites Check

Run these commands to verify you have everything installed:

```bash
# Check AWS CLI
aws --version

# Check ada credentials tool
ada --version

# Check Python 3
python3 --version

# Check if uvx is available (for MCP server)
uvx --version
```

If any are missing, install them:
```bash
# AWS CLI
brew install awscli

# ada credentials tool
toolbox install ada

# uv (includes uvx)
brew install uv
```

## One-Time Setup

### 1. Configure Ada Credentials

```bash
ada credentials setup
```

When prompted:
- **Account**: 621547421844
- **Role**: Admin
- **Profile name**: kaena

### 2. Add Kaena Profile to AWS Config

```bash
echo '[profile kaena]
credential_process='$HOME'/.toolbox/bin/ada credentials print --profile=kaena' >> ~/.aws/config
```

### 3. Run the Setup Script

```bash
cd /path/to/aws-neuron-sdk-staging
chmod +x _utilities/setup_jira_token.sh
./_utilities/setup_jira_token.sh
```

This script will:
- Fetch the Jira API token from AWS Secrets Manager
- Update your MCP configuration with the token
- Verify everything is set up correctly

### 4. Restart Kiro

After running the setup script, restart Kiro CLI to load the new MCP server.

## Using Jira in Kiro

Once set up, you can use Kiro Powers to interact with Jira:

```bash
# In Kiro CLI, check available powers
kiro powers list

# Look for Atlassian/Jira related tools
```

## Manual Verification

To manually verify the setup worked:

```bash
# Check MCP config has Jira server
cat ~/.kiro/settings/mcp.json | grep -A 10 atlassian-jira

# Test AWS Secrets Manager access
export AWS_PROFILE=kaena
aws secretsmanager get-secret-value \
    --secret-id NKI_JIRA_API_TOKEN \
    --region us-west-2 \
    --query SecretString \
    --output text
```

## Troubleshooting

### "Error: Failed to fetch Jira API token"

1. Verify ada credentials are set up:
   ```bash
   ada credentials list
   ```

2. Check AWS profile is configured:
   ```bash
   cat ~/.aws/config | grep -A 2 kaena
   ```

3. Test AWS access:
   ```bash
   export AWS_PROFILE=kaena
   aws sts get-caller-identity
   ```

### "MCP server not loading"

1. Check uvx is installed:
   ```bash
   uvx --version
   ```

2. Manually test the MCP server:
   ```bash
   uvx mcp-server-atlassian
   ```

3. Check Kiro MCP logs (location varies by installation)

## What's Next

After setup, you can:
- Query NKI Jira tickets
- Create new tickets
- Update ticket status
- Search and filter tickets
- Generate reports

See the full guide at `.kiro/steering/jira.md` for detailed usage examples.


================================================
FILE: _utilities/add_meta.py
================================================
#!/usr/bin/env python3
"""Add missing .. meta:: blocks with :description:, :keywords:, and :date-modified: to .rst files."""

import os
import re
import sys
from pathlib import Path

TODAY = "2026-03-13"

# Map file paths to sensible descriptions/keywords based on content
def infer_meta(filepath: str, content: str) -> dict:
    """Infer description and keywords from file path and content."""
    rel = filepath.replace("frameworks/", "")
    
    # Extract title from RST
    title = ""
    lines = content.split("\n")
    title_chars = set("=-~^\"'`#*+_.")
    for i, line in enumerate(lines):
        stripped = line.rstrip()
        if (len(stripped) >= 3 and len(set(stripped)) == 1 
            and stripped[0] in title_chars and i > 0):
            candidate = lines[i-1].strip()
            if candidate and not candidate.startswith(".."):
                title = candidate
                break
    
    # Build description from title or path
    if title:
        desc = f"{title} - AWS Neuron SDK documentation"
    else:
        desc = f"AWS Neuron SDK documentation for {os.path.basename(filepath).replace('.rst', '').replace('-', ' ')}"
    
    # Build keywords from path components
    kw_parts = set()
    if "torch" in rel:
        kw_parts.update(["PyTorch", "AWS Neuron"])
    if "neuronx" in rel:
        kw_parts.update(["torch-neuronx", "Trainium", "Inferentia"])
    if "jax" in rel:
        kw_parts.update(["JAX", "AWS Neuron", "JAX NeuronX"])
    if "training" in rel.lower():
        kw_parts.add("training")
    if "inference" in rel.lower():
        kw_parts.add("inference")
    if "setup" in rel.lower() or "install" in rel.lower() or "update" in rel.lower():
        kw_parts.add("setup")
    if "tutorial" in rel.lower():
        kw_parts.add("tutorials")
    if "api" in rel.lower():
        kw_parts.add("API reference")
    if "profil" in rel.lower():
        kw_parts.add("profiling")
    if "troubleshoot" in rel.lower():
        kw_parts.add("troubleshooting")
    if "debug" in rel.lower():
        kw_parts.add("debugging")
    if not kw_parts:
        kw_parts.update(["AWS Neuron", "machine learning"])
    
    keywords = ", ".join(sorted(kw_parts))
    
    return {"description": desc, "keywords": keywords}


def has_meta_field(content: str, field: str) -> bool:
    """Check if a .. meta:: block contains a specific field."""
    return bool(re.search(rf"^\s+:{field}:", content, re.MULTILINE))


def process_file(filepath: str, dry_run: bool = False):
    """Process a single .rst file to ensure it has complete meta block."""
    with open(filepath, "r", encoding="utf-8", errors="replace") as f:
        content = f.read()
    
    # Skip include-only fragments (no title, very short)
    if len(content.strip()) < 50:
        print(f"  SKIP (fragment): {filepath}")
        return False
    
    has_meta = ".. meta::" in content
    has_desc = has_meta_field(content, "description")
    has_kw = has_meta_field(content, "keywords")
    has_date = has_meta_field(content, "date-modified")
    
    if has_desc and has_kw and has_date:
        print(f"  OK (complete): {filepath}")
        return False
    
    meta = infer_meta(filepath, content)
    
    if has_meta:
        # Meta block exists but missing fields — add them
        missing = []
        if not has_desc:
            missing.append(f"   :description: {meta['description']}")
        if not has_kw:
            missing.append(f"   :keywords: {meta['keywords']}")
        if not has_date:
            missing.append(f"   :date-modified: {TODAY}")
        
        insert_text = "\n".join(missing)
        
        # Find the end of the existing meta block (last line starting with :field:)
        lines = content.split("\n")
        meta_start = -1
        meta_last_field = -1
        for i, line in enumerate(lines):
            if line.strip() == ".. meta::":
                meta_start = i
            elif meta_start >= 0 and re.match(r"\s+:\w", line):
                meta_last_field = i
            elif meta_start >= 0 and meta_last_field >= 0 and not line.strip().startswith(":") and not (line.strip() and not line[0].isspace()):
                break
        
        if meta_last_field >= 0:
            lines.insert(meta_last_field + 1, insert_text)
            new_content = "\n".join(lines)
        else:
            # Fallback: insert after .. meta:: line
            new_content = content.replace(".. meta::", f".. meta::\n{insert_text}", 1)
    else:
        # No meta block at all — add one at the top (after any labels)
        lines = content.split("\n")
        insert_idx = 0
        
        # Skip leading labels (.. _label:) and blank lines
        for i, line in enumerate(lines):
            stripped = line.strip()
            if stripped.startswith(".. _") and stripped.endswith(":"):
                insert_idx = i + 1
            elif stripped == "" and i <= insert_idx + 1:
                insert_idx = i + 1
            else:
                break
        
        meta_block = (
            f"\n.. meta::\n"
            f"   :description: {meta['description']}\n"
            f"   :keywords: {meta['keywords']}\n"
            f"   :date-modified: {TODAY}\n\n"
        )
        
        lines.insert(insert_idx, meta_block)
        new_content = "\n".join(lines)
    
    action = "UPDATE" if has_meta else "ADD"
    fields = []
    if not has_desc: fields.append("description")
    if not has_kw: fields.append("keywords")
    if not has_date: fields.append("date-modified")
    print(f"  {action} ({', '.join(fields)}): {filepath}")
    
    if not dry_run:
        with open(filepath, "w", encoding="utf-8") as f:
            f.write(new_content)
    
    return True


def main():
    import argparse
    parser = argparse.ArgumentParser(description="Add meta blocks to .rst files")
    parser.add_argument("directory", default="frameworks", nargs="?")
    parser.add_argument("--dry-run", action="store_true", help="Show what would change without writing")
    args = parser.parse_args()
    
    root = Path(args.directory)
    rst_files = sorted(root.rglob("*.rst"))
    
    print(f"Scanning {len(rst_files)} .rst files in {root}/:")
    changed = 0
    for f in rst_files:
        if process_file(str(f), dry_run=args.dry_run):
            changed += 1
    
    print(f"\n{'Would change' if args.dry_run else 'Changed'} {changed} file(s).")


if __name__ == "__main__":
    main()


================================================
FILE: _utilities/audit_frameworks.py
================================================
#!/usr/bin/env python3
"""
Audit script for the /frameworks directory of the AWS Neuron SDK documentation.

Detects orphaned pages (not referenced by any toctree, :doc:, :ref:, or
.. include:: directive) and stale pages (containing outdated references).

Usage:
    python3 _utilities/audit_frameworks.py --root . --output audit-report.md
"""

import argparse
import os
import re
from pathlib import Path


# ---------------------------------------------------------------------------
# Reference extraction helpers
# ---------------------------------------------------------------------------

# Regex patterns for RST directives and roles
TOCTREE_BLOCK_RE = re.compile(r"^\.\.\s+toctree::", re.MULTILINE)
DOC_ROLE_RE = re.compile(r":doc:`(?:[^<`]*<)?(/[^>`]+|[^>`/][^>`]*)`")
REF_ROLE_RE = re.compile(r":ref:`(?:[^<`]*<)?([^>`]+)`")
INCLUDE_RE = re.compile(r"^\.\.\s+include::\s+(.+)$", re.MULTILINE)
LABEL_RE = re.compile(r"^\.\.\s+_([a-zA-Z0-9_-]+)\s*:", re.MULTILINE)


def _resolve_path(ref: str, referencing_file: Path, root: Path) -> str | None:
    """Resolve a toctree/doc/include reference to a repo-relative path."""
    ref = ref.strip()
    if not ref:
        return None

    # Absolute path (starts with /)
    if ref.startswith("/"):
        resolved = ref.lstrip("/")
    else:
        # Relative to the directory of the referencing file
        ref_dir = referencing_file.parent.relative_to(root)
        resolved = str(ref_dir / ref)

    # Normalise (collapse ..)
    resolved = os.path.normpath(resolved)
    return resolved


def _resolve_to_files(base: str, root: Path) -> list[str]:
    """Given a resolved base path, return candidate file paths that exist."""
    candidates = []
    # Direct file match (already has extension)
    if (root / base).is_file():
        candidates.append(base)
        return candidates

    # Try common extensions
    for ext in (".rst", ".ipynb", ".txt"):
        p = base + ext
        if (root / p).is_file():
            candidates.append(p)

    # Could be a directory with index.rst
    idx = os.path.join(base, "index.rst")
    if (root / idx).is_file():
        candidates.append(idx)

    return candidates


def extract_toctree_entries(content: str, filepath: Path, root: Path) -> set[str]:
    """Extract all file paths referenced in toctree directives."""
    referenced: set[str] = set()
    lines = content.split("\n")
    i = 0
    while i < len(lines):
        if TOCTREE_BLOCK_RE.match(lines[i]):
            # Skip toctree options (lines starting with : or blank within indent)
            i += 1
            # Skip blank lines and option lines
            while i < len(lines):
                stripped = lines[i].strip()
                if stripped == "" or stripped.startswith(":"):
                    i += 1
                    continue
                break
            # Now read toctree entries (indented non-empty lines)
            while i < len(lines):
                line = lines[i]
                stripped = line.strip()
                if stripped == "":
                    i += 1
                    continue
                # Check if still indented (part of toctree body)
                if line[0] in (" ", "\t"):
                    # Entry may have a title: "Title <path>" or just "path"
                    entry = stripped
                    m = re.match(r".*<(.+)>", entry)
                    if m:
                        entry = m.group(1).strip()
                    # Resolve the path
                    resolved = _resolve_path(entry, filepath, root)
                    if resolved:
                        for f in _resolve_to_files(resolved, root):
                            referenced.add(f)
                    i += 1
                else:
                    break
        else:
            i += 1
    return referenced


def extract_doc_refs(content: str, filepath: Path, root: Path) -> set[str]:
    """Extract all file paths referenced via :doc: roles."""
    referenced: set[str] = set()
    for m in DOC_ROLE_RE.finditer(content):
        ref = m.group(1).strip()
        resolved = _resolve_path(ref, filepath, root)
        if resolved:
            for f in _resolve_to_files(resolved, root):
                referenced.add(f)
    return referenced


def extract_include_refs(content: str, filepath: Path, root: Path) -> set[str]:
    """Extract all file paths referenced via .. include:: directives."""
    referenced: set[str] = set()
    for m in INCLUDE_RE.finditer(content):
        ref = m.group(1).strip()
        resolved = _resolve_path(ref, filepath, root)
        if resolved:
            for f in _resolve_to_files(resolved, root):
                referenced.add(f)
    return referenced


def extract_ref_labels(content: str) -> set[str]:
    """Extract all :ref: label targets from content."""
    return set(m.group(1) for m in REF_ROLE_RE.finditer(content))


def extract_label_definitions(content: str) -> set[str]:
    """Extract all label definitions (.. _label:) from content."""
    return set(m.group(1) for m in LABEL_RE.finditer(content))


# ---------------------------------------------------------------------------
# Orphan detection
# ---------------------------------------------------------------------------

def find_all_framework_files(root: Path) -> tuple[set[str], set[str], set[str]]:
    """Find all .rst, .ipynb, and .txt files under frameworks/.

    Returns (rst_files, ipynb_files, txt_files) as repo-relative paths.
    """
    rst_files: set[str] = set()
    ipynb_files: set[str] = set()
    txt_files: set[str] = set()
    fw_dir = root / "frameworks"
    if not fw_dir.is_dir():
        return rst_files, ipynb_files, txt_files
    for p in fw_dir.rglob("*"):
        if not p.is_file():
            continue
        rel = str(p.relative_to(root))
        if "__pycache__" in rel:
            continue
        if p.suffix == ".rst":
            rst_files.add(rel)
        elif p.suffix == ".ipynb":
            ipynb_files.add(rel)
        elif p.suffix == ".txt":
            txt_files.add(rel)
    return rst_files, ipynb_files, txt_files


def collect_all_references(root: Path) -> tuple[set[str], set[str], set[str]]:
    """Scan ALL .rst and .txt files in the repo to collect references.

    Returns (toctree_and_doc_refs, include_refs, ref_labels_used).
    We scan the entire repo (not just /frameworks) so that references
    from root index.rst, setup/, about-neuron/, etc. are captured.
    """
    toctree_doc_refs: set[str] = set()
    include_refs: set[str] = set()
    ref_labels_used: set[str] = set()

    # Directories to skip entirely
    skip_dirs = {"_build", ".git", "venv", ".venv", "__pycache__", ".kiro",
                 ".vscode", ".github", "node_modules", "_backup-rn"}

    for ext in ("*.rst", "*.txt"):
        for p in root.rglob(ext):
            # Skip files in excluded directories
            rel = str(p.relative_to(root))
            parts = Path(rel).parts
            if any(part in skip_dirs for part in parts):
                continue
            try:
                content = p.read_text(encoding="utf-8", errors="replace")
            except Exception:
                continue

            toctree_doc_refs |= extract_toctree_entries(content, p, root)
            toctree_doc_refs |= extract_doc_refs(content, p, root)
            include_refs |= extract_include_refs(content, p, root)
            ref_labels_used |= extract_ref_labels(content)

    return toctree_doc_refs, include_refs, ref_labels_used


def build_label_to_file_map(root: Path) -> dict[str, str]:
    """Build a mapping from :ref: label -> repo-relative file path.

    Only scans files under frameworks/ since we only need to know
    which framework files are referenced via :ref:.
    """
    label_map: dict[str, str] = {}
    fw_dir = root / "frameworks"
    if not fw_dir.is_dir():
        return label_map
    for p in fw_dir.rglob("*.rst"):
        rel = str(p.relative_to(root))
        try:
            content = p.read_text(encoding="utf-8", errors="replace")
        except Exception:
            continue
        for label in extract_label_definitions(content):
            label_map[label] = rel
    return label_map


def detect_orphans(root: Path) -> list[dict]:
    """Detect orphaned pages under /frameworks.

    Returns a list of dicts with keys: path, type, reason, action.
    """
    rst_files, ipynb_files, txt_files = find_all_framework_files(root)
    toctree_doc_refs, include_refs, ref_labels_used = collect_all_references(root)
    label_map = build_label_to_file_map(root)

    # Files referenced via :ref: labels
    ref_referenced_files: set[str] = set()
    for label in ref_labels_used:
        if label in label_map:
            ref_referenced_files.add(label_map[label])

    # All referenced content files (rst + ipynb)
    all_content_refs = toctree_doc_refs | ref_referenced_files
    # All referenced include files (txt)
    all_include_refs = include_refs

    orphans: list[dict] = []

    # Check .rst and .ipynb files against toctree/doc/ref references
    for f in sorted(rst_files | ipynb_files):
        if f not in all_content_refs and f not in all_include_refs:
            ext = Path(f).suffix
            orphans.append({
                "path": f,
                "type": ext,
                "reason": "Not in any toctree or cross-reference",
                "action": "Delete",
            })

    # Check .txt files against include references only
    for f in sorted(txt_files):
        if f not in all_include_refs:
            orphans.append({
                "path": f,
                "type": ".txt (include fragment)",
                "reason": "Not referenced by any .. include:: directive",
                "action": "Delete",
            })

    return orphans


# ---------------------------------------------------------------------------
# Stale page detection
# ---------------------------------------------------------------------------

# Staleness indicator patterns
STALE_OS_RE = re.compile(
    r"Ubuntu\s+18\.04|Ubuntu\s+20\.04|Amazon\s+Linux\s+2(?!\s*023)(?!\s*\d{3})\b",
    re.IGNORECASE,
)
STALE_PYTHON_RE = re.compile(
    r"Python\s+3\.[0-9](?!\d)\b",  # matches Python 3.0 through 3.9
)
STALE_SDK_RE = re.compile(r"Neuron\s+SDK\s+2\.(\d+)")
TORCH_NEURON_SETUP_RE = re.compile(
    r"torch-neuron.*(?:setup|install|update)",
    re.IGNORECASE,
)
NEURON_CC_RE = re.compile(r"\bneuron-cc\b")


def _check_stale_python(content: str) -> list[str]:
    """Find references to Python versions below 3.10."""
    indicators = []
    for m in STALE_PYTHON_RE.finditer(content):
        ver_str = m.group(0)
        # Extract minor version
        minor = int(ver_str.split(".")[-1])
        if minor < 10:
            indicators.append(ver_str)
    return list(set(indicators))


def _check_stale_sdk(content: str) -> list[str]:
    """Find references to Neuron SDK versions older than 2.20."""
    indicators = []
    for m in STALE_SDK_RE.finditer(content):
        ver = int(m.group(1))
        if ver < 20:
            indicators.append(m.group(0))
    return list(set(indicators))


def _check_stale_os(content: str) -> list[str]:
    """Find references to unsupported OS versions."""
    return list(set(m.group(0) for m in STALE_OS_RE.finditer(content)))


def _check_torch_neuron_unsupported_os(content: str) -> list[str]:
    """Flag torch-neuron setup/update instructions for unsupported OS."""
    indicators = []
    if TORCH_NEURON_SETUP_RE.search(content):
        os_refs = _check_stale_os(content)
        if os_refs:
            indicators.append(
                f"torch-neuron setup/update with unsupported OS: {', '.join(os_refs)}"
            )
    return indicators


def _check_neuron_cc(content: str) -> list[str]:
    """Flag deprecated neuron-cc references."""
    if NEURON_CC_RE.search(content):
        return ["References deprecated neuron-cc compiler"]
    return []


def detect_stale_pages(root: Path) -> list[dict]:
    """Detect stale pages under /frameworks.

    Returns a list of dicts with keys: path, indicators, recommendation.
    """
    stale: list[dict] = []
    fw_dir = root / "frameworks"
    if not fw_dir.is_dir():
        return stale

    for p in fw_dir.rglob("*"):
        if not p.is_file():
            continue
        if p.suffix not in (".rst", ".txt"):
            continue
        rel = str(p.relative_to(root))
        try:
            content = p.read_text(encoding="utf-8", errors="replace")
        except Exception:
            continue

        indicators: list[str] = []
        indicators.extend(_check_stale_os(content))
        indicators.extend(_check_stale_python(content))
        indicators.extend(_check_stale_sdk(content))
        indicators.extend(_check_torch_neuron_unsupported_os(content))
        indicators.extend(_check_neuron_cc(content))

        if indicators:
            # Determine recommendation
            is_archival = (
                "mxnet-neuron/" in rel
                or "tensorflow/" in rel
                or ("torch-neuron/" in rel and "torch-neuronx/" not in rel)
            )
            if is_archival:
                rec = "Will be archived"
            else:
                rec = "Update or archive"
            stale.append({
                "path": rel,
                "indicators": "; ".join(sorted(set(indicators))),
                "recommendation": rec,
            })

    return sorted(stale, key=lambda x: x["path"])


# ---------------------------------------------------------------------------
# Report generation
# ---------------------------------------------------------------------------

def generate_report(orphans: list[dict], stale: list[dict]) -> str:
    """Generate the audit report as Markdown."""
    lines: list[str] = []
    lines.append("# Frameworks Audit Report\n")

    # Orphaned pages
    lines.append("## Orphaned Pages\n")
    if orphans:
        lines.append("| File Path | Type | Reason | Action |")
        lines.append("|---|---|---|---|")
        for o in orphans:
            lines.append(
                f"| {o['path']} | {o['type']} | {o['reason']} | {o['action']} |"
            )
    else:
        lines.append("No orphaned pages detected.\n")

    lines.append("")

    # Stale pages
    lines.append("## Stale Pages\n")
    if stale:
        lines.append("| File Path | Staleness Indicators | Recommendation |")
        lines.append("|---|---|---|")
        for s in stale:
            lines.append(
                f"| {s['path']} | {s['indicators']} | {s['recommendation']} |"
            )
    else:
        lines.append("No stale pages detected.\n")

    lines.append("")
    return "\n".join(lines)


# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------

def main():
    parser = argparse.ArgumentParser(
        description="Audit /frameworks for orphaned and stale pages."
    )
    parser.add_argument(
        "--root",
        default=".",
        help="Repository root directory (default: current directory)",
    )
    parser.add_argument(
        "--output",
        default="audit-report.md",
        help="Output file path for the audit report (default: audit-report.md)",
    )
    args = parser.parse_args()

    root = Path(args.root).resolve()
    print(f"Auditing frameworks under: {root}")

    orphans = detect_orphans(root)
    print(f"Found {len(orphans)} orphaned page(s).")

    stale = detect_stale_pages(root)
    print(f"Found {len(stale)} stale page(s).")

    report = generate_report(orphans, stale)
    output_path = Path(args.output)
    if not output_path.is_absolute():
        output_path = root / output_path
    output_path.write_text(report, encoding="utf-8")
    print(f"Audit report written to: {output_path}")


if __name__ == "__main__":
    main()


================================================
FILE: _utilities/check_urls.sh
================================================
#!/bin/bash

# Output file
output_file="url_check_results.txt"

# Initialize counters
total=0
working=0
not_found=0
other=0

# Create output file with header
echo "URL Status Check Results" > $output_file
echo "=========================" >> $output_file
echo "" >> $output_file

# Read each URL from the file
while read url; do
  # Skip empty lines
  if [ -z "$url" ]; then
    continue
  fi
  
  # Increment total counter
  ((total++))
  
  # Print progress
  echo "Checking $total: $url"
  
  # Use curl to check the URL status
  status_code=$(curl -s -o /dev/null -w "%{http_code}" "$url")
  
  # Check status code
  if [ "$status_code" -eq 200 ]; then
    echo "✓ WORKING: $url" >> $output_file
    ((working++))
  elif [ "$status_code" -eq 404 ]; then
    echo "✗ NOT FOUND (404): $url" >> $output_file
    ((not_found++))
  else
    echo "? OTHER STATUS ($status_code): $url" >> $output_file
    ((other++))
  fi
  
  # Small delay to avoid overwhelming the server
  sleep 0.1
  
done < old-nki-apis.txt

# Write summary
echo "" >> $output_file
echo "" >> $output_file
echo "Summary" >> $output_file
echo "=======" >> $output_file
echo "Total URLs checked: $total" >> $output_file
echo "Working URLs: $working" >> $output_file
echo "Not found (404) URLs: $not_found" >> $output_file
echo "Other status URLs: $other" >> $output_file

echo "URL check completed. Results saved to $output_file"


================================================
FILE: _utilities/create_sitemap.py
================================================
# v1.0 by dougeric 2025-09-30
# Script to create sitemap.xml for Sphinx-generated docs; must be run at the root of the docs repo with venv

import os
from pathlib import Path
from datetime import datetime

def create_sitemap(root_dir, base_url):
    """
    This function generates a sitemap.xml file for the given root directory and base URL.
    It recursively scans all .rst files in the root directory, excluding those in directories
    starting with "_". For each .rst file, it calculates the last modification time, converts
    the .rst path to the corresponding HTML path, and adds a <url> entry to the sitemap in the
    format required by Google Search Console.
    """
    sitemap = ['<?xml version="1.0" encoding="UTF-8"?>',
               '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">']
    
    for path in Path(root_dir).rglob('*.rst'):
        # Skip directories starting with "_"
        if any(part.startswith('_') for part in path.parts):
            continue
            
        # Convert .rst path to expected html path
        rel_path = path.relative_to(root_dir)
        html_path = str(rel_path).replace('.rst', '.html')
        
        # Get file modification time
        mod_time = datetime.fromtimestamp(os.path.getmtime(path))
        
        sitemap.append(f'  <url>')
        sitemap.append(f'    <loc>{base_url}/{html_path}</loc>')
        sitemap.append(f'    <lastmod>{mod_time.strftime("%Y-%m-%d")}</lastmod>')
        sitemap.append(f'  </url>')
    
    sitemap.append('</urlset>')
    return '\n'.join(sitemap)

# Call the function and write the result to sitemap.xml
sitemap_content = create_sitemap('./', 'https://awsdocs-neuron.readthedocs-hosted.com/en/latest')
with open('sitemap.xml', 'w') as f:
    f.write(sitemap_content)
print("\nsitemap.xml has been created.\n")

================================================
FILE: _utilities/format_build_logs.py
================================================
#!/usr/bin/env python3
"""
Format Sphinx Build Logs

This script checks for Python 3.9 and pip, creates a virtual environment,
runs sphinx-build, and formats the build log as Markdown with separate
sections for errors and warnings.
"""

import os
import sys
import subprocess
import re
import datetime
import platform
import shutil
from collections import Counter
from pathlib import Path

def check_python_version():
    """Check if Python 3.9 is installed."""
    python_version = sys.version_info
    
    if python_version.major != 3 or python_version.minor != 9:
        print("Error: Python 3.9 is required.")
        
        if platform.system() == "Darwin":  # macOS
            print("To install Python 3.9 on macOS, visit: https://www.python.org/downloads/release/python-3913/")
            print("Or use Homebrew: brew install python@3.9")
        elif platform.system() == "Windows":
            print("To install Python 3.9 on Windows, visit: https://www.python.org/downloads/release/python-3913/")
        else:
            print("Please install Python 3.9 from: https://www.python.org/downloads/release/python-3913/")
            
        sys.exit(1)
    
    return True

def check_pip_installed():
    """Check if pip is installed."""
    try:
        subprocess.run([sys.executable, "-m", "pip", "--version"], 
                      check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        return True
    except subprocess.CalledProcessError:
        print("Error: pip is not installed.")
        print("Please install pip: https://pip.pypa.io/en/stable/installation/")
        sys.exit(1)

def find_repo_root():
    """Find the root of the private-aws-neuron-sdk-staging repo."""
    # Start with the current directory
    current_dir = Path.cwd()
    
    # Check if we're already in the repo root
    if current_dir.name == "private-aws-neuron-sdk-staging":
        return current_dir
    
    # Check parent directory
    parent_dir = current_dir.parent
    if parent_dir.name == "private-aws-neuron-sdk-staging":
        return parent_dir
    
    # Look for the repo in the current directory
    for item in current_dir.iterdir():
        if item.is_dir() and item.name == "private-aws-neuron-sdk-staging":
            return item
    
    # Look for the repo in the parent directory
    for item in parent_dir.iterdir():
        if item.is_dir() and item.name == "private-aws-neuron-sdk-staging":
            return item
    
    print("Error: Repository 'private-aws-neuron-sdk-staging' not found on local machine.")
    sys.exit(1)

def setup_venv(repo_parent):
    """Create and activate a Python 3.9 virtual environment."""
    venv_path = repo_parent / "venv"
    
    # Create venv if it doesn't exist
    if not venv_path.exists():
        print(f"Creating virtual environment at {venv_path}...")
        try:
            subprocess.run([sys.executable, "-m", "venv", str(venv_path)], check=True)
        except subprocess.CalledProcessError as e:
            print(f"Error creating virtual environment: {e}")
            sys.exit(1)
    
    # Determine the path to the activate script
    if platform.system() == "Windows":
        activate_script = venv_path / "Scripts" / "activate.bat"
        activate_cmd = str(activate_script)
    else:
        activate_script = venv_path / "bin" / "activate"
        activate_cmd = f"source {activate_script}"
    
    print(f"Virtual environment created at {venv_path}")
    print(f"To activate manually, run: {activate_cmd}")
    
    return venv_path

def get_venv_python(venv_path):
    """Get the path to the Python executable in the virtual environment."""
    if platform.system() == "Windows":
        return venv_path / "Scripts" / "python.exe"
    else:
        return venv_path / "bin" / "python"

def get_venv_pip(venv_path):
    """Get the path to the pip executable in the virtual environment."""
    if platform.system() == "Windows":
        return venv_path / "Scripts" / "pip.exe"
    else:
        return venv_path / "bin" / "pip"

def install_requirements(repo_root, venv_pip):
    """Install requirements from requirements.txt."""
    requirements_file = repo_root / "requirements.txt"
    
    if not requirements_file.exists():
        print(f"Error: requirements.txt not found at {requirements_file}")
        sys.exit(1)
    
    print("Installing requirements...")
    try:
        subprocess.run([
            str(venv_pip), "install", "-r", str(requirements_file),
            "--extra-index-url=https://pypi.org/simple"
        ], check=True)
    except subprocess.CalledProcessError as e:
        print(f"Error installing requirements: {e}")
        sys.exit(1)
    
    print("Requirements installed successfully.")

def run_sphinx_build(repo_root, venv_path):
    """Run sphinx-build and capture the output."""
    sphinx_build_path = venv_path / "bin" / "sphinx-build"
    if platform.system() == "Windows":
        sphinx_build_path = venv_path / "Scripts" / "sphinx-build.exe"
    
    if not sphinx_build_path.exists():
        print(f"Error: sphinx-build not found at {sphinx_build_path}")
        sys.exit(1)
    
    print("Running sphinx-build...")
    
    # Create a log file to capture output
    log_file_path = repo_root / "sphinx_build_output.log"
    
    try:
        # Run sphinx-build with output redirected to both terminal and log file
        with open(log_file_path, 'w') as log_file:
            process = subprocess.Popen(
                [str(sphinx_build_path), "-b", "html", ".", "_build/html", "-w", "warnings.txt"],
                cwd=str(repo_root),
                stdout=subprocess.PIPE,
                stderr=subprocess.STDOUT,
                text=True,
                bufsize=1
            )
            
            # Capture output in real-time
            output = []
            for line in process.stdout:
                print(line, end='')  # Print to terminal
                log_file.write(line)  # Write to log file
                output.append(line)
                
            process.wait()
            
            if process.returncode != 0:
                print(f"sphinx-build exited with code {process.returncode}")
        
        # Also read the warnings.txt file if it exists
        warnings_file = repo_root / "warnings.txt"
        if warnings_file.exists():
            with open(warnings_file, 'r') as f:
                warnings_content = f.read()
                output.append("\n--- WARNINGS FILE CONTENT ---\n")
                output.append(warnings_content)
        
        return ''.join(output)
    except Exception as e:
        print(f"Error running sphinx-build: {e}")
        sys.exit(1)

def parse_build_log(log_text):
    """Parse the build log to extract errors and warnings."""
    # Save raw log for debugging
    with open("raw_build_log.txt", "w") as f:
        f.write(log_text)
    
    # Check if warnings.txt exists and use it directly
    warnings_file = Path("warnings.txt")
    if warnings_file.exists():
        print(f"Found warnings.txt file with direct warnings from Sphinx")
        with open(warnings_file, 'r') as f:
            warnings_content = f.read()
            
        # Parse warnings.txt which has format: path:line: WARNING: message
        warnings = []
        for line in warnings_content.split('\n'):
            if not line.strip():
                continue
                
            # Try to match the standard format first
            match = re.match(r'(.*?):(\d+): WARNING: (.*)', line)
            if match:
                file_path, line_num, message = match.groups()
                warnings.append({
                    'file': file_path,
                    'line': line_num,
                    'message': message.strip()
                })
                print(f"Standard format match: file={file_path}, line={line_num}, message={message[:50]}...")
            else:
                # Check for the "document isn't included in any toctree" pattern
                # Format: /path/to/file.rst: WARNING: document isn't included in any toctree
                toctree_match = re.match(r'(.*?): WARNING: (document isn\'t included in any toctree.*)', line)
                if toctree_match:
                    file_path, message = toctree_match.groups()
                    warnings.append({
                        'file': file_path,
                        'line': '0',  # No line number in this format
                        'message': message.strip()
                    })
                    print(f"Toctree match: file={file_path}, message={message[:50]}...")
                else:
                    # If no match, just add as unknown
                    warnings.append({
                        'file': 'unknown',
                        'line': '0',
                        'message': line.strip()
                    })
                    print(f"No match: message={line[:50]}...")
    else:
        print("No warnings.txt file found, parsing log output directly")
        warnings = []
        lines = log_text.split('\n')
        i = 0
        while i < len(lines):
            line = lines[i].strip()
            
            # Skip empty lines
            if not line:
                i += 1
                continue
                
            # Check for the "document isn't included in any toctree" pattern
            # Format: /path/to/file.rst: WARNING: document isn't included in any toctree
            toctree_match = re.match(r'(.*?): WARNING: (document isn\'t included in any toctree.*)', line)
            if toctree_match:
                file_path, message = toctree_match.groups()
                warnings.append({
                    'file': file_path,
                    'line': '0',  # No line number in this format
                    'message': message.strip()
                })
                i += 1
                continue
                
            # Check for warnings in the raw message
            # This is for warnings that are already in the log as complete messages
            raw_warning_match = re.match(r'(.*?): WARNING: (.*)', line)
            if raw_warning_match:
                file_path, message = raw_warning_match.groups()
                warnings.append({
                    'file': file_path,
                    'line': '0',  # No line number in this format
                    'message': message.strip()
                })
                i += 1
                continue
            
            # Check for standard format: path:line: WARNING: message
            std_match = re.match(r'(.*?):(\d+): WARNING: (.*)', line)
            if std_match:
                file_path, line_num, message = std_match.groups()
                warnings.append({
                    'file': file_path,
                    'line': line_num,
                    'message': message.strip()
                })
                i += 1
                continue
                
            # Check for alternative format: WARNING: message (path:line)
            alt_match = re.match(r'WARNING: (.*?) \((.*?):(\d+)\)', line)
            if alt_match:
                message, file_path, line_num = alt_match.groups()
                warnings.append({
                    'file': file_path,
                    'line': line_num,
                    'message': message.strip()
                })
                i += 1
                continue
                
            # Check for simple warnings that start with "WARNING:"
            if line.startswith("WARNING:"):
                message = line[8:].strip()  # Remove "WARNING: " prefix
                
                # Collect continuation lines
                i += 1
                while i < len(lines) and lines[i].strip() and not lines[i].strip().startswith(("WARNING:", "ERROR:")):
                    message += " " + lines[i].strip()
                    i += 1
                    
                warnings.append({
                    'file': 'unknown',
                    'line': '0',
                    'message': message
                })
                continue
                
            i += 1
    
    # Debug: Print the first few warnings to see what's being parsed
    print(f"Parsed {len(warnings)} warnings")
    for i, warning in enumerate(warnings[:5]):
        print(f"Warning {i+1}: file={warning['file']}, line={warning['line']}, message={warning['message'][:50]}...")
    
    # Debug: Print the warning categories
    categories = categorize_issues(warnings)
    print(f"Warning categories: {categories}")
    
    # Regular expressions for errors
    error_pattern = re.compile(r'(.*?):(\d+): (?:ERROR|SEVERE): (.*?)(?:\n|$)')
    
    errors = []
    lines = log_text.split('\n')
    for line in lines:
        error_match = error_pattern.search(line)
        if error_match:
            file_path, line_num, message = error_match.groups()
            errors.append({
                'file': file_path,
                'line': line_num,
                'message': message.strip()
            })
    
    return errors, warnings

def categorize_issues(issues):
    """Categorize issues by type."""
    categories = Counter()
    
    for issue in issues:
        # Extract the main category from the message
        message = issue['message'].lower()
        
        if "undefined label" in message:
            categories["Undefined Label"] += 1
        elif "unknown document" in message:
            categories["Unknown Document"] += 1
        elif "duplicate label" in message:
            categories["Duplicate Label"] += 1
        elif "image file not found" in message:
            categories["Missing Image"] += 1
        elif "toctree contains reference to nonexisting document" in message:
            categories["Missing Document"] += 1
        elif "document isn't included in any toctree" in message:
            categories["Document Not in TOC"] += 1
        else:
            categories["Other"] += 1
    
    return categories

def format_markdown(errors, warnings, build_time):
    """Format the build log as Markdown."""
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    
    error_categories = categorize_issues(errors)
    warning_categories = categorize_issues(warnings)
    
    markdown = f"# Sphinx Build Log - {timestamp}\n\n"
    
    # Build summary
    markdown += "## Build Summary\n\n"
    markdown += f"- **Build Time**: {build_time:.2f} seconds\n"
    markdown += f"- **Total Errors**: {len(errors)}\n"
    markdown += f"- **Total Warnings**: {len(warnings)}\n\n"
    
    # Error categories
    if error_categories:
        markdown += "### Error Categories\n\n"
        for category, count in error_categories.most_common():
            markdown += f"- **{category}**: {count}\n"
        markdown += "\n"
    
    # Warning categories
    if warning_categories:
        markdown += "### Warning Categories\n\n"
        for category, count in warning_categories.most_common():
            markdown += f"- **{category}**: {count}\n"
        markdown += "\n"
    
    # Errors section
    markdown += "## Errors\n\n"
    if errors:
        for i, error in enumerate(errors, 1):
            # Format the file path to be more readable
            file_path = error['file']
            if file_path.startswith('/Users/dougeric/git/private-aws-neuron-sdk-staging/'):
                file_path = file_path[len('/Users/dougeric/git/private-aws-neuron-sdk-staging/'):]
            
            # Create a more readable header with file and line info
            if error['file'] != 'unknown':
                markdown += f"### Error {i}: {file_path} (line {error['line']})\n\n"
            else:
                markdown += f"### Error {i}\n\n"
                
            markdown += f"```\n{error['message']}\n```\n\n"
    else:
        markdown += "No errors found.\n\n"
    
    # Warnings section
    markdown += "## Warnings\n\n"
    if warnings:
        for i, warning in enumerate(warnings, 1):
            # Format the file path to be more readable
            file_path = warning['file']
            if file_path.startswith('/Users/dougeric/git/private-aws-neuron-sdk-staging/'):
                file_path = file_path[len('/Users/dougeric/git/private-aws-neuron-sdk-staging/'):]
            
            # Create a more readable header with file and line info
            if warning['file'] != 'unknown':
                if warning['line'] != '0':
                    markdown += f"### Warning {i}: {file_path} (line {warning['line']})\n\n"
                else:
                    markdown += f"### Warning {i}: {file_path}\n\n"
            else:
                markdown += f"### Warning {i}\n\n"
                
            # Don't include the file path in the message if it's already in the header
            message = warning['message']
            if warning['file'] != 'unknown' and message.startswith(warning['file']):
                # Remove the file path from the message
                message = message[len(warning['file'])+2:] # +2 for ": "
            
            markdown += f"```\n{message}\n```\n\n"
    else:
        markdown += "No warnings found.\n\n"
    
    return markdown

def main():
    """Main function."""
    print("Checking Python version...")
    check_python_version()
    
    print("Checking pip installation...")
    check_pip_installed()
    
    print("Finding repository root...")
    repo_root = find_repo_root()
    repo_parent = repo_root.parent
    
    print(f"Repository found at: {repo_root}")
    
    print("Setting up virtual environment...")
    venv_path = setup_venv(repo_parent)
    venv_python = get_venv_python(venv_path)
    venv_pip = get_venv_pip(venv_path)
    
    print(f"Changing directory to {repo_root}...")
    os.chdir(str(repo_root))
    
    print("Installing requirements...")
    install_requirements(repo_root, venv_pip)
    
    print("Running sphinx-build...")
    start_time = datetime.datetime.now()
    build_log = run_sphinx_build(repo_root, venv_path)
    end_time = datetime.datetime.now()
    build_time = (end_time - start_time).total_seconds()
    
    print("Parsing build log...")
    errors, warnings = parse_build_log(build_log)
    
    print("Formatting build log as Markdown...")
    markdown = format_markdown(errors, warnings, build_time)
    
    # Write the formatted log to a file
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    output_file = repo_root / f"build-log-{timestamp}.md"
    
    with open(output_file, "w") as f:
        f.write(markdown)
    
    print(f"Build log written to {output_file}")
    print(f"Found {len(errors)} errors and {len(warnings)} warnings.")

if __name__ == "__main__":
    main()


================================================
FILE: _utilities/inject_archive_meta.py
================================================
#!/usr/bin/env python3
"""Inject noindex/nofollow meta directives and deprecation banners into archived .rst files."""

import os
import re
import sys

META_BLOCK = """.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

"""

WARNING_TEMPLATE = """
.. warning::

   This document is archived. {framework} is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.

"""

# Default for backward compatibility
WARNING_BLOCK = WARNING_TEMPLATE.format(framework="MXNet")


def find_title_end(lines):
    """Find the line index after the RST title underline.
    
    RST titles look like:
        Title Text
        ==========
    
    or with overline:
        ==========
        Title Text
        ==========
    
    Returns the index of the line AFTER the title underline, or -1 if not found.
    """
    title_chars = set('=-~^"\'`#*+_.')
    i = 0
    while i < len(lines):
        line = lines[i].rstrip()
        # Check if this line is an underline (all same char, at least 3 chars)
        if len(line) >= 3 and len(set(line)) == 1 and line[0] in title_chars:
            # Check if previous line is text (title) - this is an underline
            if i > 0 and lines[i-1].strip() and not (len(set(lines[i-1].rstrip())) == 1 and lines[i-1].rstrip()[0] in title_chars):
                return i + 1
            # Check if next line is text and line after that is underline (overline pattern)
            if i + 2 < len(lines) and lines[i+1].strip():
                next_next = lines[i+2].rstrip()
                if len(next_next) >= 3 and len(set(next_next)) == 1 and next_next[0] in title_chars:
                    return i + 3
        i += 1
    return -1


def inject_meta_and_warning(filepath, framework="MXNet"):
    """Inject meta block at top and warning after title in an RST file."""
    with open(filepath, 'r') as f:
        content = f.read()
    
    # Skip if already has noindex meta
    if ':noindex:' in content:
        print(f"  SKIP (already has meta): {filepath}")
        return
    
    warning_block = WARNING_TEMPLATE.format(framework=framework)
    
    lines = content.split('\n')
    
    # Separate any leading labels (.. _label:) and blank lines
    # These need to stay before the meta block
    label_lines = []
    content_start = 0
    for i, line in enumerate(lines):
        stripped = line.strip()
        if stripped.startswith('.. _') and stripped.endswith(':'):
            label_lines.append(line)
            content_start = i + 1
        elif stripped == '' and all(l.strip().startswith('.. _') for l in lines[:i] if l.strip()):
            label_lines.append(line)
            content_start = i + 1
        else:
            break
    
    # Build the content after labels
    remaining_lines = lines[content_start:]
    remaining_content = '\n'.join(remaining_lines)
    
    # Find title end in remaining content
    title_end = find_title_end(remaining_lines)
    
    if title_end >= 0:
        # Insert warning after title
        before_title = '\n'.join(remaining_lines[:title_end])
        after_title = '\n'.join(remaining_lines[title_end:])
        
        new_remaining = before_title + '\n' + warning_block + after_title
    else:
        # No title found, just add warning at the start of content
        print(f"  WARNING: No title found in {filepath}")
        new_remaining = warning_block + remaining_content
    
    # Reconstruct: labels + meta + content with warning
    label_section = '\n'.join(label_lines) + '\n' if label_lines else ''
    new_content = label_section + META_BLOCK + new_remaining
    
    # Ensure file ends with newline
    if not new_content.endswith('\n'):
        new_content += '\n'
    
    with open(filepath, 'w') as f:
        f.write(new_content)
    
    print(f"  OK: {filepath}")


def main():
    import argparse
    parser = argparse.ArgumentParser(description='Inject archive meta into .rst files')
    parser.add_argument('archive_dir', nargs='?', default='archive/mxnet-neuron',
                        help='Directory containing .rst files to process')
    parser.add_argument('--framework', default='MXNet',
                        help='Framework name for the deprecation warning (e.g., MXNet, TensorFlow)')
    args = parser.parse_args()

    archive_dir = args.archive_dir
    framework = args.framework
    
    rst_files = []
    for root, dirs, files in os.walk(archive_dir):
        for fname in files:
            if fname.endswith('.rst'):
                rst_files.append(os.path.join(root, fname))
    
    rst_files.sort()
    print(f"Processing {len(rst_files)} .rst files in {archive_dir}:")
    
    for filepath in rst_files:
        inject_meta_and_warning(filepath, framework=framework)
    
    print(f"\nDone. Processed {len(rst_files)} files.")


if __name__ == '__main__':
    main()


================================================
FILE: _utilities/metadata_schema.yaml
================================================
# Metadata Schema for AWS Neuron SDK Setup Documentation
# This schema defines the structured metadata fields used in setup documentation pages

metadata_fields:
  # Core identification fields
  description:
    type: string
    required: true
    description: "SEO and AI agent description of the page content"
    example: "Install PyTorch Neuron using AWS Deep Learning AMI on Inf2, Trn1, Trn2, Trn3"
  
  keywords:
    type: array[string]
    required: true
    description: "Comma-separated search terms for discoverability"
    example: "pytorch, neuron, dlami, installation, inf2, trn1, trn2, trn3"
  
  date-modified:
    type: date
    required: true
    format: "YYYY-MM-DD"
    description: "ISO 8601 date of last modification"
    example: "2026-03-02"
  
  content-type:
    type: enum
    required: true
    values:
      - navigation-hub
      - framework-setup-hub
      - installation-guide
      - troubleshooting
      - legacy-guide
    description: "Type of documentation page"
  
  # Setup-specific fields
  framework:
    type: enum
    required_for: [installation-guide, framework-setup-hub]
    values:
      - pytorch
      - jax
      - tensorflow
      - mxnet
    description: "ML framework being documented"
    validation: "Must match parent directory name"
  
  instance-types:
    type: array[enum]
    required_for: [installation-guide, framework-setup-hub, navigation-hub]
    values:
      - inf1
      - inf2
      - trn1
      - trn2
      - trn3
    description: "Supported AWS instance types"
    validation: "Cannot mix inf1 with inf2/trn1/trn2/trn3"
  
  installation-method:
    type: enum
    required_for: [installation-guide]
    values:
      - dlami
      - manual
      - container
    description: "Installation approach documented"
  
  os:
    type: array[enum]
    required_for: [installation-guide]
    values:
      - ubuntu-24.04
      - ubuntu-22.04
      - al2023
      - rocky-9
    description: "Supported operating systems"
  
  python-versions:
    type: array[string]
    required: false
    description: "Supported Python versions"
    example: "3.10, 3.11, 3.12"
  
  status:
    type: enum
    required: false
    values:
      - current
      - beta
      - legacy
      - deprecated
    description: "Status of the documented feature/hardware"
    validation: "Must be 'legacy' when instance-types contains only inf1"
  
  # AI agent hints
  task:
    type: string
    required: false
    description: "Task-based description for AI agents"
    example: "Install PyTorch on Trn1 using DLAMI"
  
  prerequisites:
    type: array[string]
    required: false
    description: "List of required knowledge/resources"
  
  estimated-time:
    type: string
    required: false
    description: "Estimated completion time"
    example: "5 minutes"

# Validation Rules
validation_rules:
  - rule: "inf1_separation"
    description: "inf1 cannot be mixed with inf2, trn1, trn2, or trn3"
    check: "If 'inf1' in instance-types, then len(instance-types) == 1"
    error_message: "Cannot mix inf1 with other instance types"
  
  - rule: "framework_directory_match"
    description: "framework metadata must match parent directory"
    check: "framework value must equal parent directory name"
    error_message: "Framework '{framework}' does not match directory '{directory}'"
  
  - rule: "legacy_status_for_inf1"
    description: "Pages with only inf1 must have legacy status"
    check: "If instance-types == ['inf1'], then status == 'legacy'"
    error_message: "Inf1-only pages must have status: legacy"
  
  - rule: "legacy_directory_location"
    description: "Legacy content must be in legacy-inf1 directory"
    check: "If status == 'legacy', then path contains '/legacy-inf1/'"
    warning_message: "Legacy content should be in /setup/legacy-inf1/ directory"
  
  - rule: "installation_guide_completeness"
    description: "Installation guides must have complete metadata"
    check: "If content-type == 'installation-guide', then framework, instance-types, installation-method, and os must be present"
    error_message: "Installation guide missing required metadata: {missing_fields}"
  
  - rule: "content_type_requirements"
    description: "Each content type has specific required fields"
    requirements:
      navigation-hub: [description, keywords, instance-types, content-type]
      framework-setup-hub: [description, keywords, framework, instance-types, content-type]
      installation-guide: [description, keywords, framework, instance-types, installation-method, os, content-type]
      troubleshooting: [description, keywords, content-type]
      legacy-guide: [description, keywords, instance-types, status, content-type]

# Usage Examples
examples:
  installation_guide:
    description: "Install PyTorch Neuron using AWS DLAMI on Inf2, Trn1, Trn2, Trn3"
    keywords: "pytorch, neuron, dlami, installation, inf2, trn1, trn2, trn3"
    framework: "pytorch"
    instance-types: "inf2, trn1, trn2, trn3"
    installation-method: "dlami"
    os: "ubuntu-24.04, ubuntu-22.04, al2023"
    content-type: "installation-guide"
    date-modified: "2026-03-02"
  
  framework_hub:
    description: "Install PyTorch for AWS Neuron on Inf2, Trn1, Trn2, Trn3 instances"
    keywords: "pytorch, neuron, installation, trn1, trn2, trn3, inf2"
    framework: "pytorch"
    instance-types: "inf2, trn1, trn2, trn3"
    content-type: "framework-setup-hub"
    date-modified: "2026-03-02"
  
  legacy_guide:
    description: "Legacy installation guide for AWS Inferentia 1 (Inf1) instances"
    keywords: "neuron, inf1, legacy, installation, inferentia"
    instance-types: "inf1"
    status: "legacy"
    content-type: "legacy-guide"
    date-modified: "2026-03-02"


================================================
FILE: _utilities/migrate_setup_content.py
================================================
#!/usr/bin/env python3
"""
Setup Content Migration Script

Maps old setup file paths to new framework-first paths and generates
a migration report. This script does NOT move files — it produces a
report of what references exist and where they should point.

Usage:
    python3 _utilities/migrate_setup_content.py [--dry-run] [--fix]

Options:
    --dry-run   Show what would be changed without modifying files (default)
    --fix       Apply changes to files
"""

import argparse
import os
import re
import sys
from pathlib import Path

# Old path → new path mapping
PATH_MAP = {
    "/setup/torch-neuronx": "/setup/pytorch/index",
    "/setup/jax-neuronx": "/setup/jax/index",
    "/setup/tensorflow-neuronx": "/frameworks/tensorflow/index",
    "/setup/setup-neuronx": "/setup/index",
    "/setup/setup-neuron": "/setup/index",
    "/setup/mxnet-neuron": "/archive/mxnet-neuron/index",
}

# External URL mapping (for hardcoded URLs in tutorials)
URL_MAP = {
    "setup/torch-neuronx.html": "setup/pytorch/index.html",
    "setup/jax-neuronx.html": "setup/jax/index.html",
}

# Directories to scan
SCAN_DIRS = [
    "about-neuron",
    "frameworks",
    "libraries",
    "tools",
    "compiler",
    "containers",
    "devflows",
    "release-notes",
    "setup",
    "nki",
    "dlami",
]

# Directories to skip
SKIP_DIRS = {"_build", ".git", "__pycache__", ".venv", "node_modules"}


def find_rst_files(base_dir: str) -> list[Path]:
    """Find all .rst files in scan directories."""
    files = []
    for scan_dir in SCAN_DIRS:
        dir_path = Path(base_dir) / scan_dir
        if dir_path.exists():
            for rst_file in dir_path.rglob("*.rst"):
                if not any(skip in rst_file.parts for skip in SKIP_DIRS):
                    files.append(rst_file)
    return sorted(files)


def find_references(content: str, file_path: Path) -> list[dict]:
    """Find old setup path references in file content."""
    refs = []

    # Match :doc: references
    for old_path, new_path in PATH_MAP.items():
        pattern = re.compile(
            rf":doc:`([^`]*<)?{re.escape(old_path)}(>)?`", re.IGNORECASE
        )
        for match in pattern.finditer(content):
            line_num = content[: match.start()].count("\n") + 1
            refs.append(
                {
                    "file": str(file_path),
                    "line": line_num,
                    "old": match.group(0),
                    "old_path": old_path,
                    "new_path": new_path,
                    "type": "doc_ref",
                }
            )

    # Match :ref: references to old labels
    old_labels = {
        "setup-torch-neuronx": "pytorch-setup",
        "setup-jax-neuronx": "jax-setup",
        "setup-tensorflow-neuronx": "tensorflow-setup",
    }
    for old_label, new_label in old_labels.items():
        pattern = re.compile(rf":ref:`([^`]*<)?{re.escape(old_label)}(>)?`")
        for match in pattern.finditer(content):
            line_num = content[: match.start()].count("\n") + 1
            refs.append(
                {
                    "file": str(file_path),
                    "line": line_num,
                    "old": match.group(0),
                    "old_label": old_label,
                    "new_label": new_label,
                    "type": "ref_label",
                }
            )

    # Match hardcoded URLs
    for old_url, new_url in URL_MAP.items():
        if old_url in content:
            line_num = content[: content.index(old_url)].count("\n") + 1
            refs.append(
                {
                    "file": str(file_path),
                    "line": line_num,
                    "old_url": old_url,
                    "new_url": new_url,
                    "type": "url",
                }
            )

    return refs


def apply_fix(file_path: Path, refs: list[dict]) -> bool:
    """Apply reference fixes to a file."""
    content = file_path.read_text()
    modified = False

    for ref in refs:
        if ref["type"] == "doc_ref":
            old = ref["old_path"]
            new = ref["new_path"]
            new_content = content.replace(old, new)
            if new_content != content:
                content = new_content
                modified = True
        elif ref["type"] == "url":
            old = ref["old_url"]
            new = ref["new_url"]
            new_content = content.replace(old, new)
            if new_content != content:
                content = new_content
                modified = True

    if modified:
        file_path.write_text(content)
    return modified


def main():
    parser = argparse.ArgumentParser(description="Setup content migration script")
    parser.add_argument(
        "--fix", action="store_true", help="Apply changes (default is dry-run)"
    )
    args = parser.parse_args()

    base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
    rst_files = find_rst_files(base_dir)

    print(f"Scanning {len(rst_files)} .rst files...")
    print()

    all_refs = []
    for rst_file in rst_files:
        content = rst_file.read_text()
        refs = find_references(content, rst_file)
        all_refs.extend(refs)

    if not all_refs:
        print("No old setup references found. Migration complete.")
        return

    # Group by file
    by_file = {}
    for ref in all_refs:
        by_file.setdefault(ref["file"], []).append(ref)

    print(f"Found {len(all_refs)} references in {len(by_file)} files:")
    print()

    for file_path, refs in sorted(by_file.items()):
        print(f"  {file_path}:")
        for ref in refs:
            if ref["type"] == "doc_ref":
                print(f"    L{ref['line']}: {ref['old_path']} → {ref['new_path']}")
            elif ref["type"] == "ref_label":
                print(f"    L{ref['line']}: {ref['old_label']} → {ref['new_label']}")
            elif ref["type"] == "url":
                print(f"    L{ref['line']}: {ref['old_url']} → {ref['new_url']}")
        print()

    if args.fix:
        fixed_count = 0
        for file_path, refs in by_file.items():
            if apply_fix(Path(file_path), refs):
                fixed_count += 1
                print(f"  ✓ Fixed: {file_path}")
        print(f"\nFixed {fixed_count} files.")
    else:
        print("Dry run — no files modified. Use --fix to apply changes.")


if __name__ == "__main__":
    main()


================================================
FILE: _utilities/old-nki-apis.txt
================================================
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.benchmark.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.profile.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.baremetal.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.simulate_kernel.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.compiler.sbuf.alloc.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.compiler.sbuf.mod_alloc.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.compiler.sbuf.auto_alloc.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.compiler.psum.alloc.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.compiler.psum.mod_alloc.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.compiler.psum.auto_alloc.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.compiler.skip_middle_end_transformations.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.compiler.enable_stack_allocator.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.compiler.force_auto_alloc.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.tensor.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.load.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.store.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.load_transpose2d.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.atomic_rmw.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.copy.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.broadcast_to.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.empty_like.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.zeros_like.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.ones.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.full.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.rand.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.random_seed.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.shared_constant.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.shared_identity_matrix.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.arange.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.mgrid.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.expand_dims.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.where.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.gather_flattened.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.all_reduce.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.par_dim.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.spmd_dim.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.nc.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.device_print.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.loop_reduce.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.fp32.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.add.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.subtract.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.multiply.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.divide.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.power.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.maximum.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.minimum.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.max.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.min.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.mean.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.var.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.sum.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.prod.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.all.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.abs.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.negative.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.sign.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.trunc.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.floor.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.ceil.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.mod.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.fmod.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.exp.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.log.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.cos.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.sin.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.tan.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.tanh.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.arctan.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.sqrt.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.rsqrt.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.sigmoid.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.relu.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.gelu.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.gelu_dx.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.gelu_apprx_tanh.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.gelu_apprx_sigmoid.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.silu.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.silu_dx.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.erf.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.erf_dx.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.softplus.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.mish.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.square.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.softmax.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.rms_norm.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.dropout.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.matmul.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.transpose.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.reciprocal.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.bitwise_and.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.bitwise_or.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.bitwise_xor.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.invert.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.left_shift.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.right_shift.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.equal.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.not_equal.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.greater.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.greater_equal.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.less.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.less_equal.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.logical_and.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.logical_or.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.logical_xor.html
https://awsdocs-neuron.readthedocs-hosted.com/en/v2.26.1/nki/api/generated/nki.language.logical_not.html


================================================
FILE: _utilities/setup_jira_token.sh
================================================
#!/bin/bash
# Setup script to fetch Jira API token from AWS Secrets Manager
# and configure it for the Atlassian MCP server

set -e

echo "Setting up Jira API token..."

# Check if AWS CLI is available
if ! command -v aws &> /dev/null; then
    echo "Error: AWS CLI is not installed"
    echo "Install with: brew install awscli"
    exit 1
fi

# Check if ada is available
if ! command -v ada &> /dev/null; then
    echo "Error: ada credentials tool is not installed"
    echo "Install with: toolbox install ada"
    exit 1
fi

# Set AWS profile to kaena
export AWS_PROFILE=kaena

echo "Fetching Jira API token from AWS Secrets Manager..."
JIRA_TOKEN=$(aws secretsmanager get-secret-value \
    --secret-id NKI_JIRA_API_TOKEN \
    --region us-west-2 \
    --query SecretString \
    --output text 2>&1)

if [ $? -ne 0 ]; then
    echo "Error: Failed to fetch Jira API token"
    echo "Make sure you have:"
    echo "  1. Run 'ada credentials setup' with account 621547421844, role Admin, profile kaena"
    echo "  2. Added kaena profile to ~/.aws/config with ada credential_process"
    echo "  3. Have IAM permissions to access the secret"
    echo ""
    echo "Error details:"
    echo "$JIRA_TOKEN"
    exit 1
fi

echo "✓ Successfully fetched Jira API token"

# Update the MCP config with the actual token
MCP_CONFIG="$HOME/.kiro/settings/mcp.json"

if [ ! -f "$MCP_CONFIG" ]; then
    echo "Error: MCP config not found at $MCP_CONFIG"
    exit 1
fi

# Create a temporary file with the token substituted
python3 << EOF
import json
import os

config_path = os.path.expanduser('$MCP_CONFIG')
with open(config_path, 'r') as f:
    config = json.load(f)

# Update the Jira API token
if 'atlassian-jira' in config['mcpServers']:
    config['mcpServers']['atlassian-jira']['env']['JIRA_API_TOKEN'] = '''$JIRA_TOKEN'''
    
    with open(config_path, 'w') as f:
        json.dump(config, f, indent=2)
    
    print("✓ Updated MCP configuration with Jira API token")
else:
    print("Error: atlassian-jira server not found in MCP config")
    exit(1)
EOF

echo ""
echo "Setup complete! You can now use Jira tools in Kiro."
echo ""
echo "To use Jira MCP tools:"
echo "  1. Restart Kiro CLI"
echo "  2. Use Jira tools through the MCP server"
echo ""
echo "Example queries:"
echo "  - Search for NKI tickets"
echo "  - Get ticket details"
echo "  - Create new tickets"


================================================
FILE: about-neuron/amazonq-getstarted.rst
================================================

.. image:: /images/q-logo.png
       :scale: 30%
       :alt: Amazon Q
       :align: left
       :target: https://aws.amazon.com/q/

.. _amazon-q-dev:

Ask Amazon AI helper tools
===========================

Use Kiro, Quick, and Amazon Q in the AWS console as your Neuron Experts for general Neuron technical guidance and to jumpstart your NKI kernel developement.


.. card:: Ask Q on AWS apps and websites
            :link: https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/q-on-aws.html

.. card:: Ask Kiro IDE
            :link: https://kiro.dev/

.. card:: Ask Kiro CLI
            :link: https://kiro.dev/cli

.. card:: Ask Quick
            :link: https://aws.amazon.com/quick/

.. card:: Guidelines for Quality Results
            :link: amazon-q-dev-guidelines
            :link-type: ref

.. _amazon-q-dev-guidelines:

Guidelines for Quality Results
------------------------------

1. Be Specific: Clearly state the task, desired output, and any
   constraints.
2. Provide Context: Mention specific versions, strategies, and any relevant performance requirements.
3. Request Complete Code: Ask for full implementations including
   imports, decorators, and main functions. Remember to always review and test the generated code before using it in
   production.
4. Ask for Explanations: Request comments or separate explanations for
   complex parts of the code.
5. Iterate: If the initial response isn’t satisfactory, refine your
   prompt based on the output. If you encounter issues or inaccuracies, consider rephrasing your
   prompt or breaking down complex tasks into smaller, more specific
   questions.
6. Fact check: Use Q as a starting point and supplement its output with official documentation, AWS NKI Samples repository, and your own expertise.

Example Prompts
~~~~~~~~~~~~~~~~~

.. note::
   Amazon AI helper tools may not be fully synched with the latest Neuron features. Therefore, they may not always produce optimal or fully accurate results.

1. “Explain the key features and benefits of AWS Neuron Kernel Interface (NKI).”
2. "How do different parallelism strategies (data, pipeline, tensor) affect training performance on Neuron?"
3. “What are the best practices for optimizing matrix multiplication operations using Neuron Kernel Interface (NKI)?”
4. “Provide complete Neuron Kernel Interface (NKI) code for a matrix multiplication kernel, including imports, decorators, and explanations of key optimizations. Focus on efficient tiling and data movement strategies.”


================================================
FILE: about-neuron/announcements/index.rst
================================================
.. _announcements-main:

Announcements
=============

This page will be replaced by ABlog. It's here to make sure it's in the TOC.


================================================
FILE: about-neuron/announcements/neuron1.x/announce-eol-mx-before-1-5.rst
================================================
.. post:: May 01, 2023 01:00
    :language: en
    :tags: announce-eol mxnet-neuron

.. _announce-eol-mxnet-before-1-5:

Announcing end of support for ``mxnet-neuron`` versions 1.5
-----------------------------------------------------------

:ref:`Neuron release 2.10 <neuron-2.10.0-whatsnew>` will be the last release that will include ``mxnet-neuron`` versions 1.5. Future Neuron releases will not include ``mxnet-neuron`` versions 1.5

Current users of those versions are advised to migrate to latest ``mxnet-neuron`` version.


================================================
FILE: about-neuron/announcements/neuron1.x/announce-eol-pt-1-5.rst
================================================
.. post:: Mar 25, 2022
    :language: en
    :tags: announce-eol torch-neuron

.. _announce-eol-pt-1-5:

Announcing end of support for torch-neuron version 1.5 starting with Neuron 1.19.0 release
------------------------------------------------------------------------------------------

Starting with *Neuron 1.19.0* release, *torch-neuron version 1.5* will no longer be supported. Last release of *torch-neuron version 1.5* will be issued
as part of *Neuron 1.18.0* release. Current users of those versions are advised to migrate to latest *torch-neuron* version.


================================================
FILE: about-neuron/announcements/neuron1.x/announce-eol-pt-before-1-8.rst
================================================
.. post:: Nov 22, 2022
    :language: en
    :tags: announce-eol torch-neuron

.. _announce-eol-pt-before-1-8:

Announcing end of support for ``torch-neuron`` versions 1.7 and 1.8
-------------------------------------------------------------------

:ref:`Neuron release 2.5 <neuron-2.5.0-whatsnew>` will be the last release that will include ``torch-neuron`` versions 1.7 and 1.8. Future Neuron releases will not include ``torch-neuron`` versions 1.7 and 1.8.

Current users of those versions are advised to migrate to latest ``torch-neuron`` version.


================================================
FILE: about-neuron/announcements/neuron1.x/announce-eol-tf-before-2-5.rst
================================================
.. post:: Nov 22, 2022 01:00
    :language: en
    :tags: announce-eol tensorflow-neuron

.. _announce-eol-tf-before-2-5:

Announcing end of support for ``tensorflow-neuron`` versions 2.5 and 2.6
------------------------------------------------------------------------

:ref:`Neuron release 2.5 <neuron-2.5.0-whatsnew>` will be the last release that will include ``tensorflow-neuron`` versions 2.5 and 2.6. Future Neuron releases will not include ``tensorflow-neuron`` versions 2.5 and 2.6.

Current users of those versions are advised to migrate to latest ``tensorflow-neuron`` version.


================================================
FILE: about-neuron/announcements/neuron1.x/announce-eol-tf-before-2-7.rst
================================================
.. post:: May 01, 2023 01:00
    :language: en
    :tags: announce-eol tensorflow-neuron

.. _announce-eol-tf-before-2-7:

Announcing end of support for ``tensorflow-neuron`` versions 2.7
----------------------------------------------------------------

:ref:`Neuron release 2.10 <neuron-2.10.0-whatsnew>` will be the last release that will include ``tensorflow-neuron`` versions 2.7. Future Neuron releases will not include ``tensorflow-neuron`` versions 2.7

Current users of those versions are advised to migrate to latest ``tensorflow-neuron`` version.


================================================
FILE: about-neuron/announcements/neuron1.x/announcements.rst
================================================
.. post:: Feb 17, 2022
    :language: en
    :tags: announcements

.. _prev-announcements:

Previous Announcements
======================

.. contents::  Table of contents
	:local:
	:depth: 1

.. _maintenance_tf21_tf24:

02/17/2022 - tensorflow-neuron versions 2.1, 2.2, 2.3 and 2.4 enter maintenance mode
------------------------------------------------------------------------------------

Starting with *Neuron 1.17.2* release, *tensorflow-neuron versions 2.1, 2.2, 2.3 and 2.4* are entering maintenance mode.  Future releases of 
*tensorflow-neuron versions 2.1, 2.2, 2.3 and 2.4* will address critical security issues only. Current users of those versions are advised to migrate to 
latest *tensorflow-neuron* version.

10/27/2021 - Introducing Neuron Runtime 2.x (libnrt.so)  
-------------------------------------------------------

Starting with *Neuron 1.16.0* release, *Neuron Runtime 1.x* (``neuron-rtd``) is entering maintenance mode and is replaced by *Neuron Runtime 2.x*, a shared library named (``libnrt.so``). For more information on Runtime 1.x see  :ref:`Neuron Runtime 1.x enters maintenance mode <maintenance_rtd>`.

For more information please see :ref:`introduce-libnrt`.

.. _maintenance_rtd:

10/27/2021 - Neuron Runtime 1.x (``neuron-rtd``) enters maintenance mode
------------------------------------------------------------------------

Starting with *Neuron 1.16.0* release, *Neuron Runtime 1.x* (``neuron-rtd``) is entering maintenance mode and replaced 
with *Neuron Runtime 2.x*, a shared library named ``libnrt.so``. 
Future releases of *Neuron Runtime 1.x* (``neuron-rtd``) will address critical bug fixes and security issues only. Previous releases of 
*Neuron Runtime 1.x* (``neuron-rtd``) will continue to be available via ``rpm`` and ``deb`` packages.

For more information please see:

	* :ref:`introduce-libnrt`
	* :ref:`install-guide-index`
	* :ref:`neuron-maintenance-policy`


.. _maintenance_mxnet_1_5:

10/27/2021 - Neuron support for *Apache MXNet 1.5* enters maintenance mode
--------------------------------------------------------------------------

Starting *Neuron release 1.16.0*,  Neuron support for *MXNet 1.5* is entering maintenance mode.
Future releases of Neuron supporting *MXNet 1.5*  will address critical bug fixes and security issues only.
Previous releases of *Apache MXNet 1.5* will continue to be available via ``pip`` packages.

Current users of *MXNet Neuron 1.5* can migrate their applications to *MXNet Neuron 1.8*, for more information 
about MXNet Neuron support and how to upgrade to latest *MXNet Neuron 1.8*, please see visit :ref:`neuron-mxnet`.


.. _maintenance_neuron-cli:

10/27/2021 - ``neuron-cli`` enters maintenance mode
---------------------------------------------------

Starting *Neuron release 1.16.0*, with the introduction of *Neuron Runtime 2.x*, ``neuron-cli`` is entering maintenance mode. ``neuron-cli`` 
functionality will be available only if *Neuron Runtime 1.x* (``neuron-rtd``) is being used by the application. If the application is using 
*Neuron Runtime 2.x* shared library(``libnrt.so``), ``neuron-cli`` functionality will not be available.


If you have used ``neuron-cli`` in previous releases, and you are migrating to
newer Neuron releases where applications require *Neuron Runtime 2.x* shared library, please see the below :ref:`neuron-cli-mntnce-faq`.
Future releases of ``neuron-cli`` will address 
critical bug fixes and security issues only. Previous releases of ``neuron-cli`` will continue to be available via ``rpm`` and ``deb`` packages.


.. _eol-ncg:

10/27/2021 - End of support for NeuronCore Groups (NCG)
-------------------------------------------------------

Before the introduction of *Neuron Runtime 2.x*, NeuronCore Group (NCG) has been used by Neuron Runtime 1.x 
to define an execution group of one or more NeuronCores where models can be loaded and executed. It also provided separation between processes.
   
With the introduction of *Neuron Runtime 2.x*, the strict separation of NeuronCores into groups is no longer needed and NeuronCore Groups (NCG) is 
deprecated.  *Neuron Runtime 2.x* enables each process to own a set of NeuronCores, and within each process, Neuron Runtime 2.x supports loading and 
executing multiple models on separate , different or overlapping sets of NeuronCores.

Please note that ``NEURONCORE_GROUP_SIZES`` environment variable is in the process of being :ref:`unsupported <eol-ncgs-env>`, and for a transition period 
``NEURONCORE_GROUP_SIZES`` can be used to preserve the old NeuronCore Group behavior. The frameworks internally would convert ``NEURONCORE_GROUP_SIZES`` to 
use runtime's new mode of mapping models to NeuronCores.

For more information see details about ``NEURON_RT_VISIBLE_CORES`` at :ref:`nrt-configuration` and  and :ref:`neuron-migrating-apps-neuron-to-libnrt`.


.. _eol-ncgs-env:

10/27/2021 - Announcing end of support for ``NEURONCORE_GROUP_SIZES``
---------------------------------------------------------------------

``NEURONCORE_GROUP_SIZES`` environment variable is in the process of being deprecated, future Neuron releases may no longer support
the ``NEURONCORE_GROUP_SIZES`` environment variable. Please start
using ``NEURON_RT_VISIBLE_CORES`` instead.

See :ref:`eol-ncg`, :ref:`nrt-configuration` and :ref:`neuron-migrating-apps-neuron-to-libnrt` for more information.


.. _neuron-cli-mntnce-faq:

Frequently Asked questions (FAQ)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Is there another tool that provide the same functionality as ``neuron-cli list-model``?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Yes, please see :ref:`neuron-ls-ug` or :ref:`neuron-monitor-ug`.

Is there another tool that provide the same functionality as ``neuron-cli create-ncg``, ``neuron-cli destroy-ncg``, and ``neuron-cli list-ncg``?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No, these functionalities are no longer needed with *Neuron Runtime 2.x*,NeuronCore Groups (NCG) :ref:`is deprecated <eol-ncg>` and ``NEURONCORE_GROUP_SIZES`` environment variable :ref:`is in the process of being deprecated <eol-ncgs-env>`, Please start using ``NEURON_RT_VISIBLE_CORES`` instead. See :ref:`nrt-configuration` and :ref:`neuron-migrating-apps-neuron-to-libnrt` 

for more information.

Is there another tool that provide the same functionality as ``neuron-cli reset``?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No, this functionality is no longer needed with *Neuron Runtime 2.x*. Before introducing ``libnrt.so``, in certain cases after an application 
crashed  models had to be unloaded manually by calling neuron-cli reset.

With ``libnrt.so``, applications runs in the context of the ``libnrt.so`` shared library and when an application exits the Neuron driver will free all resources associated with the application.


For more information please see:

	* :ref:`introduce-libnrt`
	* :ref:`neuron-tools`
	* :ref:`install-guide-index`
	* :ref:`neuron-maintenance-policy`


.. _eol-conda-packages:

05/28/2021 - End of support for Neuron Conda packages in Deep Learning AMI starting Neuron 1.14.0
-------------------------------------------------------------------------------------------------

05/28/2021 - Starting with Neuron SDK 1.14.0, we will no longer support conda packages to install Neuron SDK framework in DLAMI and we will no longer update conda packages used to install Neuron SDK framework (Neuron conda packages) with new versions.

Starting with Neuron SDK 1.14.0, pip packages (Neuron pip packages) will be used to install Neuron SDK framework in DLAMI conda environment. To upgrade Neuron SDK framework DLAMI users should use pip upgrade commands instead of conda update commands. Instructions are available in this blog and in Neuron SDK documentation (:ref:`setup-guide-index`).


Starting with Neuron SDK 1.14.0, run one of the following commands to upgrade to latest Neuron framework of your choice:

* To upgrade PyTorch Neuron:

.. code-block::

    source activate aws_neuron_pytorch_p36
    pip install --upgrade torch-neuron neuron-cc[tensorflow] torchvision --extra-index-url https://pip.repos.neuron.amazonaws.com

* To upgrade TensorFlow Neuron:

.. code-block::

   source activate aws_neuron_tensorflow_p36
   pip install --upgrade torch-neuron neuron-cc[tensorflow] torchvision --extra-index-url https://pip.repos.neuron.amazonaws.com

* To upgrade MXNet Neuron:

.. code-block::

   source activate aws_neuron_mxnet_p36
   pip install --upgrade torch-neuron neuron-cc[tensorflow] torchvision --extra-index-url https://pip.repos.neuron.amazonaws.com

For more information please check the `blog <https://aws.amazon.com/blogs/developer/neuron-conda-packages-eol/>`__.


.. _eol-ubuntu16:

05/01/2021 - End of support for Ubuntu 16 starting Neuron 1.14.0
----------------------------------------------------------------

Ubuntu 16.04 entered end of life phase officially in April 2021 (see https://ubuntu.com/about/release-cycle) and will not receive any public software or security updates. Starting with Neuron SDK 1.14.0, Ubuntu 16 is no longer supported for Neuron, users who are using Ubuntu 16 are requested to migrate to Ubuntu18 or Amazon Linux 2.

Customers who choose to upgrade libc on Ubuntu 16 to work with Neuron v1.13.0 (or higher versions) are highly discouraged from doing that since Ubuntu 16 will no longer receive public security updates.

.. _eol-classic-tensorboard:

05/01/2021 - End of support for classic TensorBoard-Neuron starting Neuron 1.13.0 and introducing Neuron Plugin for TensorBoard 
-------------------------------------------------------------------------------------------------------------------------------

Starting with Neuron SDK 1.13.0, we are introducing :ref:`Neuron Plugin for TensorBoard <neuron-plugin-tensorboard>` and we will no longer support classic TensorBoard-Neuron. Users are required to migrate to Neuron Plugin for TensorBoard.

Starting with Neuron SDK 1.13.0, if you are using TensorFlow-Neuron within DLAMI Conda environment, attempting to run ``tensorboard`` with the existing version of TensorBoard will fail.  Please update the TensorBoard version before installing the Neuron plugin by running ``pip install TensorBoard --force-reinstall``, for installation instructions see :ref:`neuron-plugin-tensorboard`.

Users who are using Neuron SDK releases before 1.13.0,  can find classic TensorBoard-Neuron documentation at `Neuron 1.12.2 documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/1.12.2/neuron-guide/neuron-tools/getting-started-tensorboard-neuron.html>`__.


For more information see see :ref:`neuron-tensorboard-rn` and :ref:`neuron-plugin-tensorboard`.

.. _eol_python_3_5:

02/24/2021 - End of support for Python 3.5 
-------------------------------------------

As Python 3.5 reached end-of-life in October 2020, and many packages including TorchVision and Transformers have
stopped support for Python 3.5, we will begin to stop supporting Python 3.5 for frameworks, starting with
PyTorch-Neuron version :ref:`neuron-torch-11170` in this release. You can continue to use older versions with Python 3.5.


11/17/2020 - End of support for ONNX 
------------------------------------

ONNX support is limited and from this version onwards we are not
planning to add any additional capabilities to ONNX. We recommend
running models in TensorFlow, PyTorch or MXNet for best performance and
support.


07/16/2020 - End of support for PyTorch 1.3 
--------------------------------------------

Starting this release we are ending the support of PyTorch 1.3 and migrating to PyTorch 1.5.1, customers are advised to migrate to PyTorch 1.5.1.


================================================
FILE: about-neuron/announcements/neuron1.x/eol-ncgs-env_2.rst
================================================
.. post:: Mar 25, 2022
    :language: en
    :tags: announce-eol


Announcing end of support for ``NEURONCORE_GROUP_SIZES`` starting with Neuron 1.20.0 release
--------------------------------------------------------------------------------------------

Starting with Neuron SDK 1.20.0, ``NEURONCORE_GROUP_SIZES`` environment variable will no longer be supported. Setting 
``NEURONCORE_GROUP_SIZES`` environment variable will no longer affect applications behavior.
Current customers using ``NEURONCORE_GROUP_SIZES`` environment variable are advised to use ``NEURON_RT_VISIBLE_CORES`` environment variable  or ``NEURON_RT_NUM_CORES`` environment variable instead.

See :ref:`eol-ncg`, :ref:`nrt-configuration` and :ref:`neuron-migrating-apps-neuron-to-libnrt` for more information.


================================================
FILE: about-neuron/announcements/neuron1.x/eol-pt-15.rst
================================================
.. post:: Apr 29, 2022
    :language: en
    :tags: eol

.. _eol-pt-15:


End of support for torch-neuron version 1.5
-------------------------------------------

Starting with *Neuron 1.19.0* release, *torch-neuron 1.5* will no longer be supported, and  
no further releases of *torch-neuron version 1.5* will be issued.  Current users of torch-neuron version 1.5 are advised to migrate to 
latest *torch-neuron* version.

================================================
FILE: about-neuron/announcements/neuron1.x/eol-tf-21-24.rst
================================================
.. post:: Mar 25, 2022
    :language: en
    :tags: eol

.. _eol-tf-21-24:

End of support for tensorflow-neuron versions 2.1, 2.2, 2.3 and 2.4
--------------------------------------------------------------------

Starting with *Neuron 1.18.0* release, *tensorflow-neuron versions 2.1, 2.2, 2.3 and 2.4* will no longer be supported, and  
no further releases of *tensorflow-neuron versions 2.1, 2.2, 2.3 and 2.4* will be issued.  Current users of those versions are advised to migrate to 
latest *tensorflow-neuron* version.

================================================
FILE: about-neuron/announcements/neuron2.x/announce-component-change.rst
================================================
.. post:: December 21, 2023
    :language: en
    :tags: announce-name-change, neuron-component

.. _announce-component-name-change:

Announcing Name Change for Neuron Components
---------------------------------------------

Starting with :ref:`Neuron release 2.16 <neuron-2.16.0-whatsnew>`, the name of the following Neuron components will change as follows:

======================= =================== ====================
Package name            Current Name        New Name
======================= =================== ====================
torch-neuronx           PyTorch Neuron      PyTorch NeuronX
tensorflow-neuronx      TensorFlow Neuron   TensorFlow NeuronX
neuronx-cc              Neuron Compiler     NeuronX Compiler
aws-neuronx-runtime-lib Neuron Runtime      NeuronX Runtime
tensorflow-neuronx      Transformers Neuron Transformers NeuronX
neuronx-distributed     Neuron Distributed  NeuronX Distributed
======================= =================== ====================


================================================
FILE: about-neuron/announcements/neuron2.x/announce-correction-neuron-driver-support-inf1.rst
================================================
.. post:: March 12, 2026
    :language: en
    :tags: announce-correction-neuron-driver-inf1, neuron-driver-version, inf1

.. _announce-correction-neuron-driver-inf1-support:


Correction: Neuron Driver support for Inf1 — version 2.24 (not 2.21)
---------------------------------------------------------------------

We are correcting a previous announcement regarding last Neuron Driver version to support Inf1. The last supported version is 2.24

Neuron driver versions above 2.24 only support non-Inf1 instances (such as ``Trn1``, ``Inf2``, or other instance types).
For ``Inf1`` instance users, only Neuron driver version 2.24 will remain supported with regular security patches.

As part of this correction, Neuron Driver version **2.24.13.0** has been released as a patch for ``Inf1`` users, adding compatibility with Linux kernel 6.18.

``Inf1`` instance users are advised to pin the Neuron driver version to ``2.24.*`` in their installation script:

For Ubuntu:

.. code-block:: bash

    sudo apt-get install aws-neuronx-dkms=2.24.* -y

For Amazon Linux 2 / Amazon Linux 2023:

.. code-block:: bash

    sudo yum install aws-neuronx-dkms-2.24.* -y

Refer to the :ref:`Neuron Driver release notes <runtime_rn>` for more details.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-deprecation-containers-rtd.rst
================================================
.. post:: December 20, 2023
    :language: en
    :tags: announce-deprecating-containers, runtime-rtd

.. _announce-update-containers:

Announcing end-of-support for Neuron Containers with Runtime 1.x
-----------------------------------------------------------------

:ref:`Neuron release 2.3 <announce-neuron-rtd-eol>` was the last release to support Neuron Runtime 1.x (neuron-rtd).
Current users of Neuron DLC/DLAMI with Neuron Runtime 1.x are required to :ref:`update their image <neuron_containers>` to support latest Neuron Runtime versions. For instructions, see the :ref:`Setup Guide <setup-guide-index>`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-deprecation-nxd-path-trace-api.rst
================================================
.. post:: September 18, 2025
    :language: en
    :tags: announce-deprecation-nxd-path-trace-api, al2

.. _announce-deprecation-nxd-path-trace-api:

Announcing the deprecation of the NeuronX Deep Learning Inference API path_trace function
-----------------------------------------------------------------------------------------

:ref:`Neuron release 2.26.0 <neuron-2-26-0-whatsnew>` is the last release supporting ``parallel_model_trace``. This NxD Inference function will be deprecated in the next version of the Neuron SDK in favor of the ``ModelBuilder.trace()`` method, which provides a more robust and flexible approach for tracing and compiling models for Neuron devices,  enabling more advanced features such as weight layout optimization support, as well as other quality-of-life and stability improvements for SPMD tracing.

For customers directly invoking ``parallel_model_trace``, they can now use ModelBuilderV2 APIs. For more details on these APIS, see :ref:`nxd-core-model-builder-v2`. For customers that are directly using models in NxDI, there is  no impact since NxDI models are already built on MBv1 which has no issues.

================================================
FILE: about-neuron/announcements/neuron2.x/announce-deprecation-transformer-flag.rst
================================================
.. post:: September 15, 2023
    :language: en
    :tags: announce-end-of-support, transformer-flag 

.. _announce-end-of-support-transformer-flag:

Announcing end-of-support for ``--model-type=transformer-inference`` compiler flag
-----------------------------------------------------------------------------------

Starting with :ref:`Neuron release 2.14 <neuron-2.14.0-whatsnew>`, the ``--model-type=transformer-inference`` compiler flag is deprecated.

Neuron SDK users using the ``--model-type=transformer-inference`` compiler flag are highly encouraged to migrate to the ``--model-type=transformer`` compiler flag.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eol-megatron-lm.rst
================================================
.. post:: Aug 8, 2023
    :language: en
    :tags: announce-eol, trn1, trn1n

.. _announce-eol-megatronlm:

Announcing end of support for AWS Neuron reference for Megatron-LM 
-------------------------------------------------------------------

:ref:`Neuron release 2.12 <neuron-2.12.0-whatsnew>` will be the last release that will include support for `AWS Neuron reference for Megatron-LM <https://github.com/aws-neuron/aws-neuron-reference-for-megatron-lm>`_. Future releases will not include Neuron support for Megatron-LM.

Current Neuron Megatron-LM users are advised to migrate to `AWS Neuron reference for NeMo Megatron <https://github.com/aws-neuron/neuronx-nemo-megatron>`_ or `Neuron Distributed <https://github.com/aws-neuron/neuronx-distributed>`_.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eol-python-3-7.rst
================================================
.. post:: Jul 26, 2023 10:00
    :language: en
    :tags: announce-eol, python37

.. _announce-eol-python37:

Announcing end of support for ``Python 3.7`` 
---------------------------------------------

:ref:`Neuron release 2.12 <neuron-2.12.0-whatsnew>` will be the last release that will include support for ``Python 3.7`` . Future Neuron releases will not include support for ``Python 3.7``

Current users using ``Python 3.7`` are advised to migrate to latest supported Python version. (``Python 3.10`` )

================================================
FILE: about-neuron/announcements/neuron2.x/announce-eol-ubuntu-18.rst
================================================
.. post:: Jul 13, 2023 11:00
    :language: en
    :tags: announce-eol, ubuntu18

.. _announce-eol-ubuntu18:

Announcing end of support for ``Ubuntu 18`` 
-------------------------------------------

:ref:`Neuron release 2.12 <neuron-2.12.0-whatsnew>` will be the last release that will include support for ``Ubuntu 18`` . Future Neuron releases will not include support for ``Ubuntu 18``

Current users using ``Ubuntu 18`` are advised to migrate to ``Ubuntu 20`` version.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-al2.rst
================================================
.. post:: June 28, 2024
    :language: en
    :tags: announce-eos-al2, al2

.. _announce-eos-al2:

Announcing end of support for Neuron Runtime support of Amazon Linux 2 (AL2)
------------------------------------------------------------------------------

:ref:`Neuron release 2.19 <neuron-2.19.0-whatsnew>` will be the last release that will include Neuron Runtime support for ``Amazon Linux 2`` . Future Neuron releases will not include Neuron Runtime support for ``Amazon Linux 2``.

Current users using ``Amazon Linux 2`` are advised to migrate to Amazon Linux 2023 (AL2023) or Ubuntu 20/22.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-beta-pytorch-neuroncore-placement-apis.rst
================================================
.. post:: June 24, 2025
    :language: en
    :tags: announce-no-longer-support-pytorch-neuroncore-placement

.. _announce-no-longer-support-beta-pytorch-neuroncore-placement-apis:

Announcing end of support for Beta PyTorch NeuronCore Placement APIs starting next release 
--------------------------------------------------------------------------------------------

:ref:`Neuron Release 2.24 <neuron-2-24-0-whatsnew>` is the last release to support the Beta PyTorch NeuronCore Placement APIs. 

Customers using Beta PyTorch NeuronCore Placement APIs are recommended to migrate to using generally available (GA) PyTorch Neuron Core Placement APIs. Please refer to the :ref:`PyTorch Neuron documentation <torch_neuronx_core_placement_api>` for guidance on using the supported functionality. Any models using the beta APIs will need to be updated to use the generally available APIs.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-bf16-vars.rst
================================================
.. post:: June 24, 2025
    :language: en
    :tags: announce-no-longer-support-xla-env-vars

.. _announce-eos-longer-support-xla-bf16-vars:

Announcing end of support XLA_USE_BF16 and XLA_DOWNCAST_BF16 environment variables starting next release
---------------------------------------------------------------------------------------------------------

:ref:`Neuron Release 2.24 <neuron-2-24-0-whatsnew>` will be the last release to support the following environment variables:

- XLA_USE_BF16
- XLA_DOWNCAST_BF16

**I currently utilize these environment variables in my model code. What do I do?**

Customers are recommended to migrate to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to convert their model to BF16 format. For detailed migration guidance, please refer to :ref:`migration_from_xla_downcast_bf16`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-block-dimension-nki.rst
================================================
.. post:: June 24, 2025
    :language: en
    :tags: announce-eos-block-dimension-nki

.. _announce-eos-block-dimension-nki:

Announcing end of support for NKI block dimension starting next release
--------------------------------------------------------------------------

:ref:`Neuron release 2.24 <neuron-2-24-0-whatsnew>` will be the last release to include support for the NKI block dimension in NKI tensor creation routines. Starting with this release, using the block dimension will generate EOS warnings. In the next release (Neuron Release 2.25), these warnings will be upgraded to errors.

Customers are recommended to refer to the :ref:`nki_block_dimension_migration_guide` for detailed instructions on updating their code.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-dlami-ubuntu-22-04.rst
================================================
.. post:: December 18, 2025
    :language: en
    :tags: announce-eos-dlami-ubuntu-22-04

.. _announce-eos-dlami-ubuntu-22-04:

Announcing End of Support for Ubuntu 22.04 single framework DLAMIs for PyTorch and JAX in future release
========================================================================================================

Ubuntu 22.04 single framework DLAMIs for PyTorch and JAX will reach end of support in a future release. Customers are advised to use multi-framework or previously released DLAMIs for Ubuntu 22.04.

================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-dlami.rst
================================================
.. post:: April 24, 2024
    :language: en
    :tags: announce-eos-dlami, neuron-dlami

.. _announce-eos-dlami:

Announcing end of support for Neuron Release 2.18.0 Deep Learning AMIs 
------------------------------------------------------------------------

We are announcing end of support for :ref:`Neuron release 2.18.0 <neuron-2.18.0-whatsnew>` Deep Learning AMIs. DLAMIs released between March 26,2024 (2024-03-26) and April 10, 2024 (2024-04-10) were shipped without the audit package. The following are the affected DLAMIs:

- Deep Learning AMI Neuron (Ubuntu 22.04) 20240401
- Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2) 20240328
- Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2) 20240402
- Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2) 20240409
- Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20240328
- Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20240402
- Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20240409
- Deep Learning AMI Neuron TensorFlow 2.10 (Amazon Linux 2) 20240328
- Deep Learning AMI Neuron TensorFlow 2.10 (Amazon Linux 2) 20240402
- Deep Learning AMI Neuron TensorFlow 2.10 (Amazon Linux 2) 20240409
- Deep Learning AMI Neuron TensorFlow 2.10 (Ubuntu 20.04) 20240328
- Deep Learning AMI Neuron TensorFlow 2.10 (Ubuntu 20.04) 20240402
- Deep Learning AMI Neuron TensorFlow 2.10 (Ubuntu 20.04) 20240409
- Deep Learning Base Neuron AMI (Amazon Linux 2) 20240401
- Deep Learning Base Neuron AMI (Amazon Linux 2) 20240408
- Deep Learning Base Neuron AMI (Ubuntu 20.04) 20240401
- Deep Learning Base Neuron AMI (Ubuntu 20.04) 20240408

Current users of the above :ref:`Neuron release 2.18 <neuron-2.18.0-whatsnew>` Deep Learning AMIs are required to upgrade to the latest DLAMIs in order to consume those with the audit package installed. For instructions to upgrade to the latest AMI, see the :ref:`DLAMI User Guide <neuron-dlami-overview>` or find the specific DLAMI image id for the latest Neuron release with :ref:`SSM parameters <ssm-parameter-neuron-dlami>`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-inf1-virtual-environments.rst
================================================
.. post:: December 18, 2025
    :language: en
    :tags: announce-eos-inf1-virtual-environments

.. _announce-eos-inf1-virtual-environments:

Neuron no longer supports Inf1 virtual environments and AMIs starting with Neuron 2.27
======================================================================================

Starting with Neuron release 2.27, Neuron no longer supports Inf1 virtual environments and AMIs. If you are a customer who is currently using Inf1 virtual environments or AMIs, use Neuron DLAMIs with Neuron version 2.26.1 or earlier.

================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-jax-neuronx-nki-call.rst
================================================
.. post:: April 3, 2025
    :language: en
    :tags: announce-eos-jax-neuronx-features

.. _announce-eos-jax-neuronx-features-2:

Announcing end of support for ``jax_neuronx.nki_call`` API in ``jax-neuronx`` from  starting next release
------------------------------------------------------------------------------------------------------------

Starting with :ref:`Neuron Release 2.23 <neuron-2.23.0-whatsnew>`, Neuron will end support for ``jax_neuronx.nki_call`` API in ``jax-neuronx`` package.

For a full list of features that require ``jax-neuronx``, please see :ref:`jax-neuron-known-issues`. 

Customers using ``jax_neuronx.nki_call`` API are recommended to switch invocations to directly call functions annotated with ``@nki.jit``.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-megatronlm-2-13.rst
================================================
.. post:: Aug 28, 2023
    :language: en
    :tags: announce-eos, trn1, trn1n

.. _announce-eos-megatronlm:

AWS Neuron reference for Megatron-LM no longer supported
----------------------------------------------------------

:ref:`Neuron release 2.13 <neuron-2.13.0-whatsnew>` no longer includes support for `AWS Neuron reference for Megatron-LM <https://github.com/aws-neuron/aws-neuron-reference-for-megatron-lm>`_.

Current Neuron Megatron-LM users are required to migrate to `AWS Neuron reference for NeMo Megatron <https://github.com/aws-neuron/neuronx-nemo-megatron>`_ or `Neuron Distributed <https://github.com/aws-neuron/neuronx-distributed>`_.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-mllama-checkpoint.rst
================================================
.. post:: May 15, 2025
    :language: en
    :tags: announce-eos-mllama-checkpoint

.. _announce-eos-mllama-checkpoint:

Announcing end of support for mllama 3.2 Meta Checkpoint API starting next release
--------------------------------------------------------------------------------------

:ref:`Neuron Release 2.23 <neuron-2.23.0-whatsnew>` will be the last release to include support for the mllama 3.2 Meta checkpoint API. In the next release (Neuron 2.24), Neuron will end support.

All previously converted checkpoints will continue to function without disruption. Customers' existing workflows and converted models remain fully operational. For new checkpoint conversions, the HuggingFace solution provides equivalent functionality. Customers are recommended to use HuggingFace's official conversion script, available here:
`Hugging Face Conversion Script <https://github.com/huggingface/transformers/blob/main/src/transformers/models/mllama/convert_mllama_weights_to_hf.py>`_


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-multiframework-dlamis-inf1.rst
================================================
.. post:: April 24, 2024
    :language: en
    :tags: announce-eos-dlamis-inf1, dlami-inf1

.. _announce-update-multiframework-dlami:

Announcing end of support for Neuron virtual environments in AWS Deep Learning AMI (Amazon Linux 2)
----------------------------------------------------------------------------------------------------

:ref:`Neuron release 2.18.2 <neuron-2.18.0-whatsnew>` will be the last release that will include support for the following virtual environments in AWS Deep Learning AMI (Amazon Linux 2):

- ``aws_neuron_pytorch_p38: PyTorch 1.13, Python 3.8``
- ``aws_neuron_tensorflow2_p38: TensorFlow 2.10, Python 3.8``

Future releases will not include Neuron support for these virtual environments.

Current users of Neuron virtual environments in `AWS Deep Learning AMI (Amazon Linux 2) <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-amazon-linux-2/>`_ are required to migrate to the `Neuron multi framework DLAMI <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-neuron-ubuntu-22-04/>`_.

To see a list of Neuron supported virtual environments, please refer to :ref:`Neuron Multi Framework DLAMI User Guide <neuron-dlami-overview>`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-nemo.rst
================================================
.. post:: April 3, 2025
    :language: en
    :tags: announce-eos-nemo-megatron

.. _announce-eos-nnm:

Announcing end of support for Neuron support for NeMo Megatron starting next release
-------------------------------------------------------------------------------------

Starting with  Neuron Release 2.23, Neuron will end support for :ref:`NeMo Megatron <nemo-megatron-index>`. 

We recommend all users of :ref:`NeMo Megatron <nemo-megatron-index>` to migrate their training workloads to :ref:`NxD Training <nxd-training-overview>`. Please refer to :ref:`Neuron NeMo Megatron to NeuronX Distributed Training Migration Guide <nxdt_developer_guide_migration_nnm_nxdt>` for guidance.

================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-neuron-det.rst
================================================
.. post:: December 20, 2024
    :language: en
    :tags: announce-eos-neuron-det

.. _announce-eos-neuron-det:

Announcing end of support for Neuron DET tool starting next release
-------------------------------------------------------------------

:ref:`Neuron Release 2.21 <neuron-2.21.0-whatsnew>` will be the last release to support the Neuron Distributed Event Tracing (NDET/neuron-det) tool.

We recommend all customers using the NDET tool for debugging runtime hangs/issues in large-scale settings transition to the Neuron Profiler 2.0. This tool offers the same runtime function level traces with improved ease of use and optimized performance. For more information on Neuron Profiler 2.0, please refer to the :ref:`neuron-profiler-2-0-guide`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-neuron-driver-support-inf1.rst
================================================
.. post:: June 24, 2025
    :language: en
    :tags: announce-eos-neuron-driver-2.21-version, neuron-driver-version, inf1

.. _announce-upcoming-neuron-driver-2.21-version support changes for inf1 instance:


Upcoming changes to Neuron driver 2.21 support for Inf1 starting Neuron 2.26 release
------------------------------------------------------------------------------------

.. note::

   This announcement has been superseded. The correct last supported Neuron driver version for ``Inf1`` is **2.24**, not 2.21. See :ref:`announce-correction-neuron-driver-inf1-support` for details.

Starting with Neuron Release 2.26, Neuron driver versions above 2.21 will only support non-Inf1 instances (such as ``Trn1``, ``Inf2``, or other instance types). 
For ``Inf1`` instance users, Neuron driver versions <  2.21 will remain supported with regular security patches. 

``Inf1`` instance users are advised to pin the Neuron driver version to ``2.21.*`` in their installation script. 
Refer to the :ref:`Neuron Driver release [2.22.2.0] <runtime_rn>` for detailed instructions on pinning the Neuron Driver.  


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-neuron-profiler-2.rst
================================================
.. post:: February 26, 2026
    :language: en
    :tags: announce-eos-neuron-profiler

.. _announce-eos-neuron-profiler-2:

Neuron Explorer Replaces Neuron Profiler, Starting with Neuron 2.29
-------------------------------------------------------------------

Starting with Neuron 2.29, **Neuron Profiler and Profiler 2.0 (UI and CLI) will reach end of support** and be replaced by Neuron Explorer. If you are currently using the Neuron Profiler, migrate to Neuron Explorer before the Neuron 2.29 release.

For migration guidance, see the :doc:`/tools/neuron-explorer/migration-faq`.

What is Neuron Explorer?
~~~~~~~~~~~~~~~~~~~~~~~~

Neuron Explorer is the next-generation suite of tools, guiding developers through their development journey on Trainium. It enables ML performance engineers to:

* **Trace execution end-to-end** — from source code down to hardware operations.
* **Analyze model behavior at every layer of the stack** — with detailed breakdowns per operation, per core, and per device.
* **Profile distributed workloads** — with native support for multi-node and multi-worker analysis at scale.

For more details, see :doc:`/tools/neuron-explorer/index`.

How does this impact current Neuron Profiler users?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. important::

    Neuron strongly recommends migrating to Neuron Explorer **before** the Neuron 2.29 release.

There are two things to be aware of when migrating:

* **Existing NTFF profile files are supported**, but must be reprocessed before they can be viewed in the Neuron Explorer UI.
* **New features require new profiles.** To access the full set of Neuron Explorer capabilities, you must recapture your profiles using the updated tooling.

For detailed migration steps, see the :doc:`/tools/neuron-explorer/migration-faq` and the :ref:`Neuron Explorer FAQ <neuron-explorer-faq>`.

What happens to Neuron Profiler after Neuron 2.29?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After Neuron 2.29, Neuron Profiler will:

* **No longer receive** bug fixes, feature updates, or technical support.
* **No longer be distributed** as part of the Neuron SDK.

If you need to continue using Neuron Profiler temporarily, you must pin your environment to Neuron 2.28 or earlier. This is **not recommended**, as you will not receive any SDK updates or security fixes.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-neuron-profiler-v230.rst
================================================
.. post:: March 31, 2026
    :language: en
    :tags: announce-eos-neuron-profiler

.. _announce-eos-neuron-profiler-v230:

Neuron Explorer Replaces Neuron Profiler, Starting with Neuron 2.30.0
----------------------------------------------------------------------

Starting with Neuron 2.30.0, Neuron Profiler and Profiler 2.0 (UI and CLI) will reach end of support and be replaced by Neuron Explorer. If you are currently using the Neuron Profiler, migrate to Neuron Explorer before the Neuron 2.30.0 release.

For migration guidance, see the :doc:`/tools/neuron-explorer/migration-faq`.

What is Neuron Explorer?
~~~~~~~~~~~~~~~~~~~~~~~~

Neuron Explorer is the next-generation suite of tools, guiding developers through their development journey on Trainium. It enables ML performance engineers to:

* **Trace execution end-to-end** — from source code down to hardware operations.
* **Analyze model behavior at every layer of the stack** — with detailed breakdowns per operation, per core, and per device.
* **Profile distributed workloads** — with native support for multi-node and multi-worker analysis at scale.

For more details, see :doc:`/tools/neuron-explorer/index`.

How does this impact current Neuron Profiler users?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. important::

    Neuron strongly recommends migrating to Neuron Explorer **before** the Neuron 2.30.0 release.

There are two things to be aware of when migrating:

* **Existing NTFF profile files are supported**, but must be reprocessed before they can be viewed in the Neuron Explorer UI.
* **New features require new profiles.** To access the full set of Neuron Explorer capabilities, you must recapture your profiles using the updated tooling.

For detailed migration steps, see the :doc:`/tools/neuron-explorer/migration-faq` and the :ref:`Neuron Explorer FAQ <neuron-explorer-faq>`.

What happens to Neuron Profiler after Neuron 2.30.0?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After Neuron 2.30.0, Neuron Profiler will:

* **No longer receive** bug fixes, feature updates, or technical support.
* **No longer be distributed** as part of the Neuron SDK.

If you need to continue using Neuron Profiler temporarily, you must pin your environment to Neuron 2.28 or earlier. This is **not recommended**, as you will not receive any SDK updates or security fixes.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-neuron-profiler.rst
================================================
.. post:: December 16, 2025
    :language: en
    :tags: announce-eos-neuron-profiler

.. _announce-eos-neuron-profiler:

End of Support for Neuron Profiler and Neuron Profiler 2.0 UI and CLI coming in a future Neuron release
--------------------------------------------------------------------------------------------------------

What's changing
^^^^^^^^^^^^^^^^
Neuron will end support for the legacy Neuron Profiler and Neuron Profiler 2.0 UI and CLI tools in a coming release (planned for v2.29.0). We launched Neuron Explorer in Neuron SDK 2.27, replacing these tools with a unified developer experience that will include device and system profiling in a single view, eager mode support, enhanced memory profiling, improved visualization capabilities, as well as support for the full developer lifecycle.

Why are we making this change
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Consolidating to Neuron Explorer allows us to focus development efforts on a single, modern profiling solution while providing you with enhanced features and a better user experience.

How does this impact you
^^^^^^^^^^^^^^^^^^^^^^^^^

If you are currently using the legacy Neuron Profiler UI or CLI, please do the following before Neuron 2.29:

* Begin using Neuron Explorer (available since Neuron 2.27). See https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/get-started.html#
* Reprocess your existing NTFF files for the new UI: see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/how-to-profile-workload.html

Note: Neuron Explorer is backwards compatible with existing Profiler NTFF files, but they must be reprocessed to view in the new UI. For new features (eager mode, memory viewer, certain NKI tools), you'll need to recapture profiles.

After Neuron 2.29.0 releases (planned):

* Legacy UI will no longer receive bug fixes, updates, or technical support
* To continue using legacy UI, you must pin to the last version that supports it (not recommended)


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-neurondevice-version.rst
================================================
.. post:: June 28, 2024
    :language: en
    :tags: announce-eos-neuron-device-version, neuron-device-version

.. _announce-eos-neuron-device-version:

Announcing end of support for 'neuron-device-version' field in neuron-monitor
-------------------------------------------------------------------------------

:ref:`Neuron release 2.19 <neuron-2.19.0-whatsnew>` will be the last release to include the field 'neuron-device-version' in neuron-monitor.

In future releases, customers who are using the field 'neuron-device-version' will instead need to use 'instance_type' field in the 'instance_info' section and the 'neuroncore_version' field to obtain neuron device information.

Please see :ref:`neuron-monitor-ug` for more details.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-neurondevice.rst
================================================
.. post:: June 28, 2024
    :language: en
    :tags: announce-eos-neuron-device, neuron-device

.. _announce-eos-neurondevice:

Announcing end of support for 'neurondevice' resource name in Neuron Device K8s plugin
----------------------------------------------------------------------------------------

:ref:`Neuron release 2.19 <neuron-2.19.0-whatsnew>` will be the last release to include resource name 'neurondevice'. 

Neuron device plugin is a Neuron Software component that gets installed in Kubernetes environment. The resource name 'neurondevice' enables customers to allocate devices to the Neuron K8s container.

In future releases, we will rename resource name 'neurondevice' to 'neuron' to maintain consistency. Customers who are using the resource name 'neurondevice' in their YAML file will need to update to use 'neuron'.

Please see :ref:`k8s-neuron-device-plugin` for more details.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-nxd-examples.rst
================================================
.. post:: December 20, 2024
    :language: en
    :tags: announce-eos-nxd-examples

.. _announce-eos-nxd-examples:

Announcing migration of NxD Core examples from NxD Core repository to NxD Inference repository in next release
--------------------------------------------------------------------------------------------------------------

:ref:`Neuron Release 2.21 <neuron-2.21.0-whatsnew>` will be the last release to include NxD Core repository inference examples under the NxD Core repository: https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference. Starting with :ref:`Neuron Release 2.21 <neuron-2.21.0-whatsnew>`, the models and modules in NxD Core inference examples are now available through NxD Inference package. We recommend customers to update their applications to use examples from the NxD Inference repository. See :ref:`nxdi-overview`

In Neuron Release 2.22, the NxD Core inference samples will only reside under the NxD Inference repository. Current users are advised to start using samples/tutorials under the NxD Inference repository: https://github.com/aws-neuron/neuronx-distributed-inference.

I currently utilize an inference sample from the NxD Core repository in my model code. What do I do?
======================================================================================================

If your applications depend on the inference examples from NxD Core, we recommend that you update your code to use the new NxD Inference package. With NxD Inference, you can import and use these models and modules in your applications. Any models compiled with inference code from the NxD Core repository will need to be re-compiled. Please refer to the :ref:`nxd-examples-migration-guide` for guidance.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-nxdt-nxd-core-training.rst
================================================
.. post:: February 26, 2026
    :language: en
    :tags: announce-eos-nxdt

.. _announce-eos-nxdt-nxd-core-training:

Announcing end of support for NxDT and NxD Core Training APIs starting with Neuron SDK release 2.29 (PyTorch 2.10)
-------------------------------------------------------------------------------------------------------------------

Neuron SDK release 2.28 (PyTorch 2.9) will be the last release to include the NeuronX Distributed Training (NxDT) library. Starting with Neuron SDK release 2.29 (PyTorch 2.10), the use of NxD Core training APIs and the PyTorch/XLA package for training will no longer be supported.

How does this impact you?
~~~~~~~~~~~~~~~~~~~~~~~~~~

Existing NxDT/NxD Core users should stay on Neuron SDK 2.28 (PyTorch 2.9) until ready to migrate to native PyTorch on Neuron. Native PyTorch on Neuron uses standard distributed primitives (DTensor, FSDP, DDP). A migration guide will be published in a coming release.

See :doc:`Native PyTorch on Neuron Overview </frameworks/torch/pytorch-native-overview>` for more information.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-probuf.rst
================================================
.. post:: June 28, 2024
    :language: en
    :tags: announce-eos-probuf, probuf

.. _announce-eos-probuf319:

Announcing end of support for Probuf versions <= 3.19 for PyTorch NeuronX, NeuronX Distributed, and Transformers NeuronX libraries 
------------------------------------------------------------------------------------------------------------------------------------

:ref:`Neuron release 2.19 <neuron-2.19.0-whatsnew>` will be the last release that will include Probuf <= 3.19 support for PyTorch NeuronX, NeuronX Distributed, and Transformers NeuronX libraries. Future Neuron releases will not include Probuf <= 3.19 support for PyTorch NeuronX.

Current PyTorch NeuronX, NeuronX Distributed, or Transformers NeuronX users using Probuf <= 3.19 are advised to migrate to latest supported Probuf version.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-pt-versions.rst
================================================
.. post:: December 20, 2023
    :language: en
    :tags: announce-eos-pt, pt-versions

.. _announce-eos_pytorch110:

Announcing End of Support for PyTorch Neuron version 1.10
-----------------------------------------------------------

:ref:`Neuron release 2.16 <neuron-2.16.0-whatsnew>` will be the last release that will include support for PyTorch Neuron version 1.10. Future Neuron releases will not include support for PyTorch Neuron version 1.10.

Current users of PyTorch Neuron version 1.10 are advised to migrate to latest supported PyTorch Neuron version.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-pt2.rst
================================================
.. post:: December 20, 2023
    :language: en
    :tags: announce-eos-pt-two, pt-versions-two

.. _announce-eos_pytorch2:

Announcing End of Support for PyTorch NeuronX version 2.0 (beta)
-----------------------------------------------------------------

:ref:`Neuron release 2.16 <neuron-2.16.0-whatsnew>` will be the last release that will include beta support for PyTorch NeuronX version 2.0 (beta). Future Neuron releases will not include support for PyTorch NeuronX version 2.0.

Current users of PyTorch NeuronX version 2.0 are advised to upgrade to PyTorch NeuronX 2.1 (beta).


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-python38.rst
================================================
.. post:: December 20, 2024
    :language: en
    :tags: announce-python-eos

.. _announce-python-eos:

Announcing end of support for Python 3.8 in future releases
-----------------------------------------------------------

Due to Python 3.8 reaching its end-of-life status, future Neuron releases will no longer include support for this version.

=========================
How does this impact me?
=========================

I currently use Python 3.8.
============================

To avoid security issues and bugs, current users of Python 3.8 are advised to migrate to a Neuron supported Python version (3.9, 3.10, or 3.11) as Neuron will no longer support Python 3.8. For a list of supported Python versions according to Neuron package, please see :ref:`latest-neuron-release-artifacts`.

I currently use Ubuntu 20, which has Python 3.8 as the default version. Am I affected?
=======================================================================================

Although Python 3.8 is the default version of Ubuntu 20.04, Neuron will continue to support Ubuntu 20.04 until April 2025, due to extended standard support of Python 3.8 in Ubuntu 20. Please see the :ref:`sdk-maintenance-policy` for more information.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-pytorch-1-1-3.rst
================================================
.. post:: December 20, 2024
    :language: en
    :tags: announce-eos-pytorch-version

.. _announce-eos-pytorch-eos-113:

Announcing end of support for PyTorch 1.13 starting next release
----------------------------------------------------------------

:ref:`Neuron Release 2.21 <neuron-2.21.0-whatsnew>` is the last release to support PyTorch 1.13, its associated Deep Learning Containers (DLCs), and Deep Learning AMIs (DLAMIS) for Trn1, Trn2, and Inf2 instances.

We recommend that all customers using torch-neuron 1.13, related DLCs, and DLAMIS on Trn2, Trn1, and Inf2 instances upgrade to the latest supported PyTorch version. For more information on supported versions, please refer to :ref:`latest-neuron-release-artifacts`.

Please note that PyTorch 1.13 will continue to be supported for Inf1 instances.

================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-pytorch-1-9.rst
================================================
.. post:: August 28, 2023
    :language: en
    :tags: announce-eol, torch-neuron 

.. _announce-eol-pytorch19:

Announcing end of support for ``torch-neuron`` version 1.9 
-----------------------------------------------------------

:ref:`Neuron release 2.13 <neuron-2.13.0-whatsnew>` will be the last release that will include support for ``torch-neuron`` version 1.9. Future Neuron releases will not include support for ``torch-neuron`` version 1.9.

Current users of ``torch-neuron`` version 1.9 are advised to migrate to latest supported ``torch-neuron`` version.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-pytorch-2-1.rst
================================================
.. post:: December 20, 2024
    :language: en
    :tags: announce-eos-pytorch-version

.. _announce-eos-pytorch-2-1:

Announcing end of support for PyTorch 2.1 starting next release
---------------------------------------------------------------

:ref:`Neuron Release 2.21 <neuron-2.21.0-whatsnew>` is the last release to support PyTorch 2.1, its associated Deep Learning Containers (DLCs), and Deep Learning AMIs (DLAMIS).

We recommend that all customers using PyTorch 2.1, related DLCs, and DLAMIS upgrade to the latest supported PyTorch version. For more information on supported versions, please refer to :ref:`latest-neuron-release-artifacts`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-pytorch-2-7-2-8-v229.rst
================================================
.. post:: March 31, 2026
    :language: en
    :tags: announce-eos-pytorch-version

.. _announce-eos-pytorch-2-7-2-8-v229:

Neuron no longer supports PyTorch versions 2.7 and 2.8 starting with Neuron 2.29
----------------------------------------------------------------------------------

Starting with Neuron 2.29, Neuron no longer supports PyTorch versions 2.7 and 2.8. We recommend that all customers upgrade to the latest supported PyTorch version.

Customers currently using PyTorch versions 2.7 and 2.8 must upgrade to a newer supported PyTorch version. For more information on supported versions, refer to :ref:`latest-neuron-release-artifacts`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-pytorch-2-7-2-8.rst
================================================
.. post:: February 26, 2026
    :language: en
    :tags: announce-eos-pytorch-version

.. _announce-eos-pytorch-2-7-2-8:

Announcing end of support for PyTorch versions 2.7 and 2.8 starting next release
---------------------------------------------------------------------------------

:ref:`Neuron Release 2.28 <whats-new-2026-02-26-v2_28>` is the last release to support PyTorch versions 2.7 and 2.8. Future Neuron releases will not include support for PyTorch versions 2.7 and 2.8.

Current users of PyTorch version 2.7 or 2.8 are advised to upgrade to PyTorch 2.9. For more information on supported versions, refer to :ref:`latest-neuron-release-artifacts`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-pytorch-profiling-api.rst
================================================
.. post:: December 16, 2025
    :language: en
    :tags: announce-eos-pytorch-profling-api

.. _announce-eos-pytorch-profling-api:

End of Support for PyTorch Experimental Profiling API starting in a future release
------------------------------------------------------------------------------------

What's changing
^^^^^^^^^^^^^^^^

Neuron will end support for the ``torch_neuronx.experimental.profiler.profile`` API in a future release of Neuron (planned for v2.29.0). This experimental API will be replaced by native PyTorch profiling support using the standard ``torch.profiler.profile()`` API.

How does this impact you
^^^^^^^^^^^^^^^^^^^^^^^^^

If you are using ``torch_neuronx.experimental.profiler.profile,`` before April/May 2026:

* Update your code to use native PyTorch profiling API:

.. code-block:: python

    # Before (Experimental API)
    from torch_neuronx.experimental import profiler
    with profiler.profile(output_path="/tmp/profile") as prof:
        output = model(input)

    # After (Native API)
    import torch.profiler
    with torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.NEURON],
        on_trace_ready=torch.profiler.tensorboard_trace_handler("/tmp/profile")
    ) as prof:
        output = model(input)

After Neuron 2.29.0 releases (planned):

* Experimental API will no longer be supported
* To continue using the experimental API, you must pin to Neuron SDK 2.28 or earlier (not recommended)


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-tensorboard-tools.rst
================================================
.. post:: December 16, 2025
    :language: en
    :tags: announce-eos-tensorboard-tools

.. _announce-eos-tensorboard-tools:

Announcing End of Support for TensorBoard Plugin for Neuron Profiler in Neuron 2.27
-----------------------------------------------------------------------------------

Neuron 2.27 will be the last release to support TensorBoard Plugin. Future Neuron releases will not include the support for TensorBoard plugin. All customers using the TensorBoard plugin to visualize and analyze model performance are recommended to migrate to Neuron Explorer. 

To begin using Neuron Explorer (available since Neuron 2.27) for profiling, see :doc:`the Neuron Explorer documentation </tools/neuron-explorer/index>`. Neuron Explorer was introduced with :doc:`the release of the AWS Neuron SDK version 2.27.0 </release-notes/prev/2.27.0/index>`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-tensorflow-2-8-9.rst
================================================
.. post:: April 3, 2025
    :language: en
    :tags: announce-tensorflow-versions-eos

.. _announce-tfx-2-8-9-eos:

Announcing end of support for Tensorflow 2.8 and 2.9 starting next release
----------------------------------------------------------------------------

Starting with Neuron Release 2.23, Neuron will end support for TensorFlow 2.8 and 2.9. Future Neuron releases will not include support for Tensorflow-Neuron 2.8 and 2.9 versions.

Current users of those versions are advised to migrate to latest TensorFlow version (2.10). For a list of supported versions, please see :ref:`latest-neuron-release-artifacts`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-tensorflow-inf2.rst
================================================
.. post:: February 26, 2026
    :language: en
    :tags: announce-eos-tensorflow

.. _announce-eos-tensorflow-inf2:

Announcing end of support for TensorFlow for Inferentia2 (Inf2) starting with Neuron 2.29
------------------------------------------------------------------------------------------

:ref:`Neuron Release 2.28 <whats-new-2026-02-26-v2_28>` is the last release to support TensorFlow for Inferentia2 (``Inf2``). Future Neuron releases will not include support for TensorFlow for ``Inf2`` instance users.

Current Inf2 instance users are advised to use the latest PyTorch version 2.9. For a list of supported PyTorch versions, see :ref:`latest-neuron-release-artifacts`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-tensorflow1-x.rst
================================================
.. post:: June 28, 2024
    :language: en
    :tags: announce-tensorflow-eos, tf-versions-1-x

.. _announce-tfx-eos:

Announcing end of support for Tensorflow-Neuron 1.x
-----------------------------------------------------

:ref:`Neuron release 2.19 <neuron-2.19.0-whatsnew>` will be the last release to support Tensorflow-Neuron 1.x. 
Future Neuron releases will not include support for Tensorflow-Neuron 1.x versions. Current users of those versions are advised to migrate to latest tensorflow-neuron version 2.10.1.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-torch-neuron.rst
================================================
.. post:: September 16, 2024
    :language: en
    :tags: announce-torch-neuron-eos, torch-neuron

.. _announce-torch-neuron-eos:

Announcing maintenance mode for torch-neuron 1.9 and 1.10 versions 
---------------------------------------------------------------------

Starting with :ref:`Neuron release 2.20 <neuron-2-20-2-whatsnew>`, torch-neuron 1.9 and 1.10 versions will enter maintenance mode.
Future Neuron releases will not include support for torch-neuron 1.9 and 1.10 versions. Current users of torch-neuron 1.9 and 1.10 versions are advised to migrate to torch-neuron 1.13.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-torch-neuronx-nki-jit.rst
================================================
.. post:: May 15, 2025
    :language: en
    :tags: announce-eos-torch-neuronx-nki-jit

.. _announce-eos-torch-neuronx-nki-jit:

Announcing end of support for ``torch_neuronx.nki_jit`` API in ``torch-neuronx`` starting next release
---------------------------------------------------------------------------------------------------------

:ref:`Neuron Release 2.23 <neuron-2.23.0-whatsnew>` will be the last release to include support for ``torch_neuronx.nki_jit`` API in ``torch-neuronx`` package.

Customers using ``torch_neuronx.nki_jit`` API are recommended to switch invocations to directly call functions annotated with ``@nki.jit``.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-u20-dlamis.rst
================================================
.. post:: December 20, 2024
    :language: en
    :tags: announce-u20-dlami-dlc-eos

.. _announce-u20-dlami-dlc-eos:

Announcing end of support for Ubuntu20 DLCs and DLAMIs
------------------------------------------------------

Starting with :ref:`Neuron Release 2.21 <neuron-2.21.0-whatsnew>`, AWS Neuron will begin phasing out support for Ubuntu20 Deep Learning Containers (DLCs) and Deep Learning AMIs (DLAMIs). Neuron 2.21 will be the last release to provide bug fixes, and by Neuron 2.22, these offerings will no longer be available.

We recommend that all customers using Ubuntu20 DLCs and DLAMIs migrate to newer versions based on Ubuntu22 or Amazon Linux 2023. For customers who need to continue using Ubuntu20, you can create custom AMIs based on the Ubuntu20 base image and install Neuron components manually. Please see :ref:`container-faq` and :ref:`neuron-dlami-overview`. 

Please note that this does not affect support for the base Ubuntu20 operating system, which will continue to receive updates as per our standard support policy. For more information, please see :ref:`sdk-maintenance-policy`


================================================
FILE: about-neuron/announcements/neuron2.x/announce-eos-xla-bf16.rst
================================================
.. post:: May 15, 2025
    :language: en
    :tags: announce-eos-xla-bf

.. _announce-eos-xla-bf:

Announcing end of support for XLA_USE_BF16 and XLA_DOWNCAST_BF16 starting next release
----------------------------------------------------------------------------------------

Starting with :ref:`Neuron Release 2.23 <neuron-2.23.0-whatsnew>`, Neuron will begin phasing out support for the ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` environment variables. In this release, usage of these variables will trigger warnings. Neuron will end support in a subsequent release, aligned with the torch-xla maintenance schedule.

Customers are recommended to migrate to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to convert their model to BF16 format. For detailed migration guidance, please refer to :ref:`migration_from_xla_downcast_bf16`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-intent-eol-nemo-arg.rst
================================================
.. post:: Oct 26, 2023
    :language: en
    :tags: announce-intent-end-of-support-nemo-arg, nemo-arg

.. _announce-intent-deprecate-nemo-arg:

Announcing End of Support for ``nemo`` option-argument
-------------------------------------------------------

:ref:`Neuron release 2.15 <neuron-2.15.0-whatsnew>` will be the last release that will include support for ``nemo`` option-argument in the existing `--distribution_strategy` :ref:`compiler option <neuron-compiler-cli-reference-guide>`. Future releases will not include Neuron support for ``nemo`` option-argument.
Users are advised to migrate to the new ``llm-training`` option-argument.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-intent-eos-opt.rst
================================================
.. post:: Oct 26, 2023
    :language: en
    :tags: announce-intent-eos-opt, opt

.. _announce-intent-eos-opt:

Announcing End Of Support for OPT example in Transformers NeuronX
------------------------------------------------------------------

:ref:`Neuron release 2.15 <neuron-2.15.0-whatsnew>` will be the last release that will include OPT example in Transformers NeuronX.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-intent-eos-pt-version.rst
================================================
.. post:: June 24, 2025
    :language: en
    :tags: announce-eos-pt-two-five

.. _announce-eos_pytorch25:

Announcing End of Support for PyTorch NeuronX version 2.5 starting next release
---------------------------------------------------------------------------------

:ref:`Neuron release 2.24 <neuron-2-24-0-whatsnew>` will be the last release that will include support for PyTorch NeuronX version 2.5. Future Neuron releases will not include support for PyTorch NeuronX version 2.5.

Current users of PyTorch NeuronX version 2.5 are advised to upgrade to PyTorch NeuronX 2.6 or 2.7. Please see release artifacts for more details on supported versions.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-intent-eos-pt2-6.rst
================================================
.. post:: September 18, 2025
    :language: en
    :tags: announce-eos-pt2-6

.. _announce-eos_pt2-6:

Announcing End of Support for PyTorch NeuronX version 2.6 starting next release
---------------------------------------------------------------------------------

:ref:`Neuron release 2.26 <neuron-2-26-0-whatsnew>` will be the last release that will include support for PyTorch NeuronX version 2.6. Future Neuron releases will not include support for PyTorch NeuronX version 2.6. Current users of PyTorch NeuronX version 2.6 are advised to upgrade to PyTorch NeuronX 2.7 or 2.8. See :ref:`Neuron release artifacts <latest-neuron-release-artifacts>` for more details on supported versions.

================================================
FILE: about-neuron/announcements/neuron2.x/announce-intent-eos-tensorflow-tutorial-inf.rst
================================================
.. post:: June 24, 2025
    :language: en
    :tags: announce-eos-tensorflow-tutorial

.. _announce-eos-tensorflow-tutorial:

Announcing End of Support for Tensorflow Neuron Inf1 SSD300 tutorial starting next release
--------------------------------------------------------------------------------------------

:ref:`Neuron release 2.24 <neuron-2-24-0-whatsnew>` will be the last release that will include support for :ref:`Tensorflow Neuron Inf1 SSD300 <tensorflow-ssd300>` tutorial. Future Neuron releases will not include support for :ref:`Tensorflow Neuron Inf1 SSD300 <tensorflow-ssd300>` tutorial due to security issues.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-intent-eos-tnx.rst
================================================
.. post:: June 24, 2025
    :language: en
    :tags: announce-eos-tnx

.. _announce-eos-tnx:

Announcing end of support for Transformers NeuronX library starting in Neuron 2.26 release
--------------------------------------------------------------------------------------------

Starting from :ref:`Neuron Release 2.24 <neuron-2-24-0-whatsnew>`, Transformers NeuronX library is in maintenance mode. ``transformers-neuronx`` releases will now only address critical security issues. In Neuron Release 2.26, Neuron will end support for ``transformers-neuronx``.

Current users of ``transformers-neuronx`` are advised to migrate to :ref:`NeuronX Distributed Inference <nxdi-overview>`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-intent-maintenance-tnx.rst
================================================
.. post:: May 15, 2025
    :language: en
    :tags: announce-transformers-neuronx-maintenance, tnx

.. _announce-tnx-maintenance:

Announcing maintenance mode for Transformers NeuronX library starting next release
------------------------------------------------------------------------------------

Starting from Neuron release 2.24, Transformers NeuronX library is entering maintenance mode. Future releases of ``transformers-neuronx`` will address critical security issues only and we will gradually end support. 

Current users of ``transformers-neuronx`` are advised to migrate to :ref:`NeuronX Distributed Inference <nxdi-overview>`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-maintenance-mxnet.rst
================================================
.. post:: June 28, 2024
    :language: en
    :tags: announce-mxnet-maintenance, mxnet

.. _announce-mxnet-maintenance:

Neuron support for MxNet enters maintenance mode
---------------------------------------------------

Starting with :ref:`Neuron release 2.19 <neuron-2.19.0-whatsnew>`, Neuron support for MxNet (``mxnet-neuron``) is entering maintenance mode.

Future releases of ``mxnet-neuron`` will address critical security issues only and we will gradually end support. Current users of ``mxnet-neuron`` are advised to migrate to PyTorch NeuronX or TensorFlow NeuronX.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-maintenance-nxdi-nxd-core-inference.rst
================================================
.. post:: March 31, 2026
    :language: en
    :tags: announce-maintenance-nxdi

.. _announce-maintenance-nxdi-nxd-core-inference:

Announcing maintenance mode for NxD Inference and NxD Core Inference APIs starting next release
-----------------------------------------------------------------------------------------------

Starting with Neuron 2.30.0, NxD Inference library and NxD Core Inference APIs are entering maintenance mode. Future releases will address critical security issues only and we will gradually end support.

We are actively investing in an enhanced vLLM Neuron plugin that will not require a separate NxD Inference library. More information about the vLLM Neuron plugin enhancements and migration guidance will be shared in the upcoming release.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-maintenance-nxdt-nxd-core-training.rst
================================================
.. post:: March 31, 2026
    :language: en
    :tags: announce-maintenance-nxdt

.. _announce-maintenance-nxdt-nxd-core-training:

Announcing maintenance mode for NxDT and NxD Core Training APIs starting next release
-------------------------------------------------------------------------------------

Starting with Neuron 2.30.0, NxDT and NxD Core Training APIs are entering maintenance mode. Future releases will address critical security issues only and we will gradually end support.

How does this impact you?
~~~~~~~~~~~~~~~~~~~~~~~~~

Existing NxDT/NxD Core users should stay on Neuron 2.28 and PyTorch 2.9 until ready to migrate to native PyTorch on Neuron (starting PyTorch 2.10). Customers are recommended to use native PyTorch with standard distributed primitives (DTensor, FSDP, DDP) and TorchTitan starting with Neuron 2.30.0 and PyTorch 2.10. A migration guide will be published in a coming release.

See :doc:`/frameworks/torch/pytorch-native-overview` for more information.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-maintenance-tf.rst
================================================
.. post:: April 1, 2024
    :language: en
    :tags: announce-tensorflow-maintenance, tf-versions

.. _announce-tfx-maintenance:

Tensorflow-Neuron 1.x enters maintenance mode
-----------------------------------------------

Starting with :ref:`Neuron release 2.18 <neuron-2.18.0-whatsnew>`, Tensorflow-Neuron 1.x is entering maintenance mode. Future releases of Tensorflow-Neuron 1.x will address critical security issues only and we will gradually end support. Current users of those versions are advised to migrate to latest tensorflow-neuron version 2.10.1.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-moving-samples.rst
================================================
.. post:: December 20, 2023
    :language: en
    :tags: announce-moving-nxd-samples, nxd-samples

.. _announce-moving-samples:

Announcing end-of-support for NeuronX Distributed Training Samples in Neuron Samples Repository 
------------------------------------------------------------------------------------------------

:ref:`Neuron release 2.16 <neuron-2.16.0-whatsnew>` will be the last release to include support for NeuronX Distributed Training Samples (Llama-2, GPT-NeoX 20B, and GPT-NeoX 6.9B) under the `AWS Neuron Samples Github repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training>`_.

In future releases, NeuronX Distributed samples will reside under the `NeuronX Distributed Github repository <https://github.com/aws-neuron/neuronx-distributed>`_. Current users are advised to start using samples under the NeuronX Distributed repository for all NeuronX Distributed tutorials.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-nki-library-namespace-changes-2-28.rst
================================================
.. post:: February 26, 2026
    :language: en
    :tags: announce-nki-library-changes

.. _announce-nki-library-namespace-changes-2-28:

NKI Library namespace changes starting with Neuron 2.28
--------------------------------------------------------

Starting with Neuron 2.28, the open source repository namespace has changed from ``nkilib_standalone.nkilib.*`` to ``nkilib.*``, providing a consistent namespace between the open source repository and the shipped version. If customers want to add or modify NKI Library kernels, they can build and install them to replace the default implementation without changing model imports.

See :ref:`NKI Library <nkl_home>` for more information.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-nki-namespace-migration.rst
================================================
.. post:: March 31, 2026
    :language: en
    :tags: announce-nki-namespace

.. _announce-nki-namespace-migration:

Announcing NKI Library Kernel Migration to New nki.* Namespace starting Neuron 2.29
------------------------------------------------------------------------------------

Starting with Neuron 2.29, all NKI Library kernels are migrated to the new ``nki.*`` namespace.

The new ``nki.*`` namespace introduces changes to NKI APIs and language constructs that improve usability and performance. This transition ensures consistency across all NKI kernels and allows us to focus development efforts on a single, modern namespace.

See the :doc:`/nki/deep-dives/nki-migration-guide` for more information.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-longer-support-neuron-det.rst
================================================
.. post:: April 3, 2025
    :language: en
    :tags: announce-no-longer-support-neuron-det

.. _announce-no-longer-support-neuron-det:

Neuron no longer includes support for Neuron DET tool starting with this release 
---------------------------------------------------------------------------------

Starting with :ref:`Neuron Release 2.22 <neuron-2.22.0-whatsnew>`, Neuron no longer supports the Neuron Distributed Event Tracing (NDET/neuron-det) tool.

We recommend customers transition to the Neuron Profiler 2.0 for debugging runtime hangs and issues in large-scale settings. This tool offers the same runtime function level traces with improved ease of use and optimized performance. For more information about the Neuron Profiler 2.0, see :ref:`neuron-profiler-2-0-guide`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-longer-support-nxd-examples.rst
================================================
.. post:: May 15, 2025
    :language: en
    :tags: announce-eol-nxd-examples

.. _announce-eol-nxd-examples:

Announcing migration of NxD Core inference examples from NxD Core repository to NxD Inference repository starting this release
==================================================================================================================================


Starting with :ref:`Neuron Release 2.23 <neuron-2.23.0-whatsnew>`, the following models and modules in NxD Core inference examples are now only available through NxD Inference package:

- Llama
- Mixtral
- DBRX


I currently utilize one of the mentioned inference samples from the NxD Core repository in my model code. What do I do?
------------------------------------------------------------------------------------------------------------------------

For customers who want to deploy models out of the box, please use the NxD Inference model hub, which is the recommended option. With NxD Inference, you can import and use these models and modules in your applications. 
Customers will need to update their applications to use examples under the NxD Inference repository: https://github.com/aws-neuron/neuronx-distributed-inference.
Any models compiled with inference code from the NxD Core repository will need to be re-compiled. Please refer to the :ref:`nxd-examples-migration-guide` for guidance and see :ref:`nxdi-overview` for more information.

I would like to continue using NxD Core. What do I do?
--------------------------------------------------------

For customers who want to continue using NxD Core without NxD Inference, please refer to the Llama3.2 1B sample as a reference implementation: https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference/llama


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-longer-support-pytorch-113.rst
================================================
.. post:: April 3, 2025
    :language: en
    :tags: announce-no-longer-support-pytorch-version

.. _announce-no-longer-support-pytorch-113:

Neuron no longer supports PyTorch 1.13 starting this release
-------------------------------------------------------------

Starting with :ref:`Neuron Release 2.22 <neuron-2.22.0-whatsnew>`, Neuron no longer supports PyTorch 1.13, its associated Deep Learning Containers (DLCs), and Deep Learning AMIs (DLAMIS) for Trn1, Trn2, and Inf2 instances.

We recommend that all customers using PyTorch 1.13, related DLCs, and DLAMIS on Trn2, Trn1, and Inf2 instances upgrade to the latest supported PyTorch version. For more information on supported versions, please refer to :ref:`latest-neuron-release-artifacts`.

Please note that PyTorch 1.13 will continue to be supported for Inf1 instances.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-longer-support-pytorch-2-1.rst
================================================
.. post:: April 3, 2025
    :language: en
    :tags: announce-no-longer-support-pytorch-version

.. _announce-no-longer-support-pytorch-2-1:

Neuron no longer supports PyTorch 2.1 starting this release
------------------------------------------------------------

Starting with :ref:`Neuron Release 2.22 <neuron-2.22.0-whatsnew>`, Neuron no longer includes support for PyTorch 2.1, its associated Deep Learning Containers (DLCs), and Deep Learning AMIs (DLAMIS).

We recommend that all customers using PyTorch 2.1, related DLCs, and DLAMIS upgrade to the latest supported PyTorch version. For more information on supported versions, please refer to :ref:`latest-neuron-release-artifacts`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-longer-support-pytorch-2-7-2-8.rst
================================================
.. post:: March 30, 2026
    :language: en
    :tags: announce-no-longer-support-pytorch-version

.. _announce-no-longer-support-pytorch-2-7-2-8:

Neuron no longer supports PyTorch versions 2.7 and 2.8 starting with Neuron 2.29
----------------------------------------------------------------------------------

Starting with Neuron 2.29, Neuron no longer supports PyTorch versions 2.7 and 2.8. We recommend that all customers upgrade to the latest supported PyTorch version.

Customers currently using PyTorch versions 2.7 and 2.8 must upgrade to a newer supported PyTorch version. For more information on supported versions, refer to :ref:`latest-neuron-release-artifacts`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-longer-support-tensorflow-inf2.rst
================================================
.. post:: March 30, 2026
    :language: en
    :tags: announce-no-longer-support-tensorflow

.. _announce-no-longer-support-tensorflow-inf2:

Neuron no longer supports TensorFlow for Inferentia2 (Inf2) starting with Neuron 2.29
---------------------------------------------------------------------------------------

Starting with Neuron 2.29, Neuron no longer supports TensorFlow for Inferentia2 (Inf2). Current Inf2 instance users are advised to use the latest PyTorch version 2.9. For a list of supported PyTorch versions, see :doc:`/release-notes/releasecontent`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-longer-support-u20-dlc-dlami.rst
================================================
.. post:: April 3, 2025
    :language: en
    :tags: announce-u20-dlami-dlc-no-longer-support

.. _announce-u20-dlami-dlc-eos:

Neuron no longer includes support for Ubuntu20 DLCs and DLAMIs starting this release
-------------------------------------------------------------------------------------

Starting with :ref:`Neuron Release 2.22 <neuron-2.22.0-whatsnew>`, Neuron no longer includes offerings for Ubuntu20 Deep Learning Containers (DLCs) and Deep Learning AMIs (DLAMIs). 

Customers using Ubuntu20 DLCs and DLAMIs should migrate to newer versions based on Ubuntu22 or Amazon Linux 2023. For customers who need to continue using Ubuntu20, you can create custom AMIs based on the Ubuntu20 base image and install Neuron components manually. Please see :ref:`container-faq` and :ref:`neuron-dlami-overview`. 

Please note that this does not affect support for the base Ubuntu20 operating system, which will continue to receive updates as per our standard support policy. For more information, please see :ref:`sdk-maintenance-policy`


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-support-al2.rst
================================================
.. post:: September 16, 2024
    :language: en
    :tags: end-support-al2

.. _eos-al2:

Neuron Runtime no longer supports Amazon Linux 2 (AL2)
========================================================

Starting from :ref:`Neuron release 2.20 <neuron-2.20-whatsnew>`, the Neuron Runtime (``aws-neuronx-runtime-lib``) will no longer support Amazon Linux 2 (AL2). 
The Neuron Driver (``aws-neuronx-dkms``) is now the only Neuron package that supports Amazon Linux 2. 
However, the Neuron Driver requires Linux kernel 5.10 or higher. Since default AL2 AMIs ship with kernel 4.14, you must upgrade your AL2 kernel to 5.10+ before installing driver versions 2.18 and later, or migrate to Amazon Linux 2023 or Ubuntu which include compatible kernels by default.

This change introduces the following constraint:

Customers cannot run their full Neuron-powered applications natively on an AL2-based Amazon Machine Image (AMI). To leverage Neuron functionality on an AL2 AMI, customers must containerize their applications using a Neuron supported container with non-AL2 Linux distribution (e.g., Ubuntu 22.04, Amazon Linux 2023, etc.) and then deploy those containers on an AL2-based AMI that has the Neuron Driver (``aws-neuronx-dkms``) installed.

How does this impact me?
------------------------

**I have an AL2 DLAMI**

If you are using one of the following Amazon
Linux 2 DLAMIs, please migrate to a supported DLAMI (e.g., Ubuntu 22.04, Amazon Linux 2023 (AL2023), etc.). Please see :ref:`neuron-dlami-overview` for
a list of all supported DLAMIs to migrate to.

+-----------------+------------------+-----------------------------------------------------------+
|    Framework    | Operating System |                        DLAMI Name                         |
+=================+==================+===========================================================+
|  PyTorch 1.13   |  Amazon Linux 2  |  Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2)   |
+-----------------+------------------+-----------------------------------------------------------+
| TensorFlow 2.10 |  Amazon Linux 2  | Deep Learning AMI Neuron TensorFlow 2.10 (Amazon Linux 2) |
+-----------------+------------------+-----------------------------------------------------------+

**I am using my own AL2 Container**

If you using your own AL2 Container, please migrate to a Neuron supported container with non-AL2 Linux distribution (e.g., Ubuntu 22.04, Amazon Linux 2023, etc.)

**I am using a base AL2 DLAMI**

If you are using a base Amazon Linux 2 DLAMI, please ensure the Neuron Driver (``aws-neuronx-dkms``) is the only Neuron package installed. Please use non AL2 (e.g., Ubuntu 22.04, Amazon Linux 2023, etc.) containers to run your Neuron applications.

.. note::
   Neuron does not supports Linux kernel versions < 5.10. Customers using
   Linux kernel versions < 5.10 must migrate to >= 5.10.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-support-device-version.rst
================================================
.. post:: September 16, 2024
    :language: en
    :tags: eos-neuron-device, neuron-device-

.. _eos-neurondevice:

'neurondevice' resource name in Neuron Device K8s plugin no longer supported
------------------------------------------------------------------------------

Starting with :ref:`Neuron release 2.20 <neuron-2-20-2-whatsnew>`, Neuron no longer supports resource name 'neurondevice'. 

Neuron device plugin is a Neuron Software component that gets installed in Kubernetes environment. The resource name 'neurondevice' enables customers to allocate devices to the Neuron K8s container.

In this release, we renamed resource name 'neurondevice' to 'neuron' to maintain consistency. Customers who are using the resource name 'neurondevice' in their YAML file need to update to use 'neuron'.

Please see :ref:`k8s-neuron-device-plugin` for more details.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-support-jax-neuronx-nki-call.rst
================================================
.. post:: May 15, 2025
    :language: en
    :tags: 

.. _announce-eos-jax-neuronx-features:

Neuron no longer supports ``jax_neuronx.nki_call`` API in ``jax-neuronx`` starting this release
-------------------------------------------------------------------------------------------------

:ref:`Neuron Release 2.23 <neuron-2.23.0-whatsnew>` no longer supports ``jax_neuronx.nki_call`` API in ``jax-neuronx`` package.

For a full list of features that require ``jax-neuronx``, please see :ref:`jax-neuron-known-issues`. 

Customers using ``jax_neuronx.nki_call`` API will need to switch invocations to directly call functions annotated with ``@nki.jit``.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-support-llama3-2-checkpoint.rst
================================================
.. post:: June 24, 2025
    :language: en
    :tags: announce-no-longer-support-llama-checkpoint

.. _announce-no-longer-support-llama-32-meta-checkpoint:

Announcing end of support for Llama 3.2 Meta checkpoint
---------------------------------------------------------

Starting with :ref:`Neuron Release 2.24 <neuron-2-24-0-whatsnew>`, the mllama 3.2 Meta checkpoint API is no longer be supported.

**I currently use the mllama 3.2 Meta checkpoint in my applications. What do I do?**

All previously converted checkpoints will continue to function without disruption. Customers' existing workflows and converted models remain fully operational. For new checkpoint conversions, customers are advised to use the Hugging Face solution which provides equivalent functionality. Hugging Face's official conversion script is available here:
`HuggingFace Conversion Script <https://github.com/huggingface/transformers/blob/main/src/transformers/models/mllama/convert_mllama_weights_to_hf.py>`_


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-support-nemo-megatron.rst
================================================
.. post:: May 15, 2025
    :language: en
    :tags: announce-no-support-nemo-megatron

.. _announce-no-support-nemo-megatron:

Neuron no longer supports NeMo Megatron starting this release
---------------------------------------------------------------

Starting with :ref:`Neuron release 2.23 <neuron-2.23.0-whatsnew>`, Neuron no longer supports :ref:`NeMo Megatron <nemo-megatron-index>`. 

All users of :ref:`nemo-megatron-index` are requested to migrate their training workloads to :ref:`NxD Training <nxd-training-overview>`. Please refer to :ref:`Neuron NeMo Megatron to NeuronX Distributed Training Migration Guide <nxdt_developer_guide_migration_nnm_nxdt>` for guidance.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-support-neurondevice.rst
================================================
.. post:: September 16, 2024
    :language: en
    :tags: eos-neuron-device-version, neuron-device-version

.. _eos-neuron-device-version:

'neuron-device-version' field in neuron-monitor no longer supported
--------------------------------------------------------------------

Starting with :ref:`Neuron release 2.20 <neuron-2-20-2-whatsnew>`, Neuron no longer supports the field 'neuron-device-version' in neuron-monitor.

Customers who are using the field 'neuron-device-version' will instead need to use 'instance_type' field in the 'instance_info' section and the 'neuroncore_version' field to obtain neuron device information.

Please see :ref:`neuron-monitor-ug` for more details.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-support-nki-jit-torch.rst
================================================
.. post:: June 24, 2025
    :language: en
    :tags: announce-no-longer-support-nki-jit

.. _announce-no-longer-support-nki-jit:

Neuron no longer supports nki_jit API in PyTorch Neuron starting this release
--------------------------------------------------------------------------------

Starting with :ref:`Neuron Release 2.24 <neuron-2-24-0-whatsnew>`, ``torch_neuronx.nki_jit`` API in ``torch-neuronx`` package is no longer supported.

**I currently use nki_jit in my PyTorch models. What do I do?**

Customers using ``torch_neuronx.nki_jit`` API are recommended to switch invocations to directly call functions annotated with ``@nki.jit``.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-support-tensorboard-plugin.rst
================================================
.. post:: February 26, 2026
    :language: en
    :tags: announce-no-support-tensorboard

.. _announce-no-support-tensorboard-plugin:

Neuron no longer supports TensorBoard Plugin for Neuron Profiler starting with Neuron 2.28
-------------------------------------------------------------------------------------------

Starting with Neuron 2.28, Neuron no longer supports TensorBoard Plugin for Neuron Profiler. All customers using TensorBoard Plugin to visualize and analyze model performance are recommended to migrate to Neuron Explorer.

To start using Neuron Explorer (available since Neuron 2.27) to profile your workloads, please see the :doc:`Neuron Explorer Getting Started guide </tools/neuron-explorer/get-started>`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-support-tensorflow1-x.rst
================================================
.. post:: September 16, 2024
    :language: en
    :tags: no-support-tensorflow-eos, tf-versions-1-x-no-support

.. _announce-tfx-no-support:

Tensorflow-Neuron 1.x no longer supported
------------------------------------------

Starting with :ref:`Neuron release 2.20 <neuron-2-20-2-whatsnew>`, Neuron no longer supports Tensorflow-Neuron 1.x. 
Current users of those versions are advised to migrate to latest tensorflow-neuron version 2.10.1. Please see :ref:`TensorFlow Neuron <tensorflow-neuron-main>` for more details.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-support-tensorflow2-10.rst
================================================
.. post:: December 16, 2025
    :language: en
    :tags: announce-no-support-tensorflow2-10

.. _announce-no-support-tensorflow2-10:

Neuron no longer supports tensorflow_2_10 single framework DLAMI and virtual environment in multi-framework DLAMIs starting with Neuron 2.27
----------------------------------------------------------------------------------------------------------------------------------------------

Starting with the release of Neuron 2.27.0, the ``tensorflow_2_10`` single framework Deep Learning AMI (DLAMI) and the TensorFlow 2.10 virtual environment in multi-framework DLAMIs are no longer supported.

Users are advised to use previously released DLAMIs for TensorFlow 2.10 support, or migrate to newer supported TensorFlow versions. For more information on supported versions, refer to :doc:`the list of current Neuron-supported package and library versions </release-notes/releasecontent>`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-support-tf-versions.rst
================================================
.. post:: May 15, 2025
    :language: en
    :tags: announce-no-support-tensorflow-eos

.. _announce-no-support-tensorflow-eos:

Neuron no longer supports Tensorflow 2.8 and 2.9 starting this release
-----------------------------------------------------------------------

Starting with :ref:`Neuron Release 2.23 <neuron-2.23.0-whatsnew>`, Neuron no longer supports for TensorFlow-Neuron 2.8 and 2.9 versions. 

Current users of those versions are advised to migrate to latest TensorFlow version (2.10). For a list of supported versions, please see :ref:`latest-neuron-release-artifacts`.

================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-support-torch-neuron-versions.rst
================================================
.. post:: December 20, 2024
    :language: en
    :tags: announce-no-support-torch-neuron

.. _announce-no-support-torch-neuron:

PyTorch Neuron versions 1.9 and 1.10 no longer supported
----------------------------------------------------------

Starting with :ref:`Neuron Release 2.21 <neuron-2.21.0-whatsnew>`, Neuron no longer supports torch-neuron 1.9 and 1.10 versions. Current users of torch-neuron 1.9 and 1.10 versions are advised to migrate to the latest torch-neuron supported version. Please see :ref:`latest-neuron-release-artifacts`.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-support-ubuntu-20-base.rst
================================================
.. post:: May 15, 2025
    :language: en
    :tags: announce-u20-base-no-support

.. _announce-u20-base-no-support:

Neuron no longer supports base Ubuntu 20 operating system starting this release
--------------------------------------------------------------------------------

:ref:`Neuron Release 2.23 <neuron-2.23.0-whatsnew>` no longer includes support for base Ubuntu 20.04 operating system. 

Customers using Ubuntu 20.04 are required to migrate their workloads to Ubuntu 22.04 or another supported operating system. Please refer to :ref:`neuron-dlami-overview` for guidance on Neuron supported operating systems. 

For more information on the Neuron operating system support policy, please see :ref:`sdk-maintenance-policy`.

================================================
FILE: about-neuron/announcements/neuron2.x/announce-no-support-vllm-v0.rst
================================================
.. post:: February 26, 2026
    :language: en
    :tags: announce-no-support-vllm

.. _announce-no-support-vllm-v0:

Neuron no longer supports vLLM V0 starting with Neuron 2.28
------------------------------------------------------------

Starting with Neuron 2.28 release, vLLM V0 will no longer be supported. This includes the vLLM V0 Neuron forks in the AWS Neuron `upstreaming-to-vllm GitHub repo <https://github.com/aws-neuron/upstreaming-to-vllm>`__ and vLLM V0-based Neuron Inference Deep Learning Containers.

Customers are recommended to use vLLM V1-based inference containers as documented in the :doc:`vLLM V1 user guide </libraries/nxd-inference/developer_guides/vllm-user-guide-v1>`. Additionally, Neuron will be updating existing vLLM-based tutorials to use vLLM V1 in the coming release.

See :ref:`vLLM on Neuron <nxdi-vllm-user-guide-v1>` for more information on vLLM V1 support.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-nxdi-changes.rst
================================================
.. post:: December 19, 2025
    :language: en
    :tags: announce-nxdi-changes

.. _announce-nxdi-changes:

Announcing changes to NxDI in the upcoming releases
====================================================

As part of our transition to native PyTorch support, we are simplifying NxDI to provide a more streamlined developer experience.

**What's changing:**

In the upcoming releases, we will introduce NxDI v2 that will not use NxDI ModelBuilder APIs. Instead, it will use ``torch.compile`` for model compilation. We will also simplify the NxDI APIs for modeling to align with native PyTorch primitives.

**Timeline and migration:**

While we introduce these changes, we will maintain both NxDI v1 and NxDI v2 simultaneously to ensure a smooth migration path for our customers. We will provide detailed migration guidance, timelines, and updated documentation as we approach the transition. More information about the migration path and specific release dates will be shared in the next release (Neuron 2.28).


================================================
FILE: about-neuron/announcements/neuron2.x/announce-package-change.rst
================================================
.. post:: September 16, 2024
    :language: en
    :tags: announce-nxdcore, neuron-component-nxdcore

.. _announce-component-name-change-nxdcore:

Announcing Name Change for Neuron Component 
---------------------------------------------

Starting with :ref:`Neuron release 2.20 <neuron-2-20-2-whatsnew>`, the name of the following Neuron component will change as follows:

======================= ======================= ============================ ==================
Package name            Current Name             New Name                     Abbreviation
======================= ======================= ============================ ==================
neuronx-distributed     NeuronX Distributed      NeuronX Distributed Core     NxD Core
======================= ======================= ============================ ==================


================================================
FILE: about-neuron/announcements/neuron2.x/announce-python38-no-longer-support.rst
================================================
.. post:: April 3, 2025
    :language: en
    :tags: announce-python-version-no-longer-support

.. _announce-python-no-longer-support:

Neuron no longer includes Python 3.8 support starting this release
-------------------------------------------------------------------

Starting with :ref:`Neuron Release 2.22 <neuron-2.22.0-whatsnew>`, Neuron no longer includes support for Python 3.8 as it has its reached end-of-life status.

=========================
How does this impact me?
=========================

I currently use Python 3.8.
============================

To avoid security issues and bugs, current users of Python 3.8 are advised to migrate to a Neuron supported Python version (3.9, 3.10, or 3.11) as Neuron no longer supports Python 3.8. For a list of supported Python versions according to Neuron package, please see :ref:`latest-neuron-release-artifacts`.

I currently use Ubuntu 20, which has Python 3.8 as the default version. Am I affected?
=======================================================================================

Although Python 3.8 is the default version of Ubuntu 20.04, Neuron will continue to support Ubuntu 20.04 until April 2025, due to extended standard support of Python 3.8 in Ubuntu 20. Please see the :ref:`sdk-maintenance-policy` for more information.


================================================
FILE: about-neuron/announcements/neuron2.x/announce-transition-pytorch-trainium.rst
================================================
.. post:: December 16, 2025
    :language: en
    :tags: announce-transition-pytorch-trainium

.. _announce-transition-pytorch-trainium:

Announcing Transition to PyTorch Native Support for AWS Trainium in the Next Neuron Release Supporting PyTorch 2.10
------------------------------------------------------------------------------------------------------------------------

Starting with the introduction of Neuron support for PyTorch 2.10, AWS Neuron will begin a transition from PyTorch/XLA to native PyTorch support via TorchNeuron. PyTorch 2.9 will be the last version based on PyTorch/XLA.

What's changing
^^^^^^^^^^^^^^^^

* If you are using PyTorch 2.9, it will be the last version of it that uses the PyTorch/XLA backend in Neuron.
* For PyTorch 2.10 and later users, Neuron will provide Native PyTorch support via TorchNeuron.

Customers using PyTorch/XLA-based training should migrate to native PyTorch with TorchNeuron, which provides:

* Native PyTorch eager execution mode
* Standard distributed primitives (DTensor, FSDP, DDP)
* ``torch.compile`` support
* Compatibility with frameworks like TorchTitan (PyTorch Training Library)

For more information about native PyTorch on Neuron and migration guidance, see :doc:`Native PyTorch for AWS Trainium </frameworks/torch/pytorch-native-overview>`.


================================================
FILE: about-neuron/announcements/neuron2.x/announcement-end-of-support-neuronxcc-nki.rst
================================================
.. post:: December 16, 2025
    :language: en
    :tags: announcement-end-of-support-neuronxcc-nki

.. _announcement-end-of-support-neuronxcc-nki:

Announcing End of Support for neuronxcc.nki Namespace Starting with Neuron 2.28
--------------------------------------------------------------------------------

Neuron 2.27 will be the last to include support for the neuronxcc.nki.* namespace. Starting with Neuron 2.28, this namespace will no longer be supported.

The new ``nki.*`` namespace introduces changes to NKI APIs and language constructs. 

Existing kernels using ``neuronxcc.nki.*`` must migrate to the new nki.* namespace. A kernel migration guide is available in the Neuron 2.27 documentation.

See :doc:`the NKI Kernel Migration Guide </nki/deep-dives/nki-migration-guide>` for more information.


================================================
FILE: about-neuron/announcements/neuron2.x/announcement-end-of-support-nxdt-nxd-core.rst
================================================
.. post:: December 16, 2025
    :language: en
    :tags: announcement-end-of-support-nxdt-nxd-core

.. _announcement-end-of-support-nxdt-nxd-core:

Announcing End of Support for NxDT and NxD Core Training APIs Starting with PyTorch 2.10
-----------------------------------------------------------------------------------------

Neuron support for PyTorch 2.9 will be the last to include the NeuronX Distributed Training (NxDT) libraries, NxD Core training APIs, and PyTorch/XLA for training. Starting with Neuron support for PyTorch 2.10, these components will no longer be supported.

How does this impact you
^^^^^^^^^^^^^^^^^^^^^^^^^

Existing NxDT/NxD Core users should stay on PyTorch 2.9 until ready to migrate to native PyTorch on Neuron (starting PyTorch 2.10). Customers are recommended to use native PyTorch with standard distributed primitives (DTensor, FSDP, DDP) and TorchTitan starting with Neuron 2.28 and PyTorch 2.10. A migration guide will be published in a coming release.

See :doc:`Native PyTorch on Neuron Overview </frameworks/torch/pytorch-native-overview>` for more information.


================================================
FILE: about-neuron/announcements/neuron2.x/announcement-end-of-support-parallel-model-trace.rst
================================================
.. post:: December 16, 2025
    :language: en
    :tags: announcement-end-of-support-parallel-model-trace

.. _announcement-end-of-support-parallel-model-trace:

Neuron no longer supports parallel_model_trace API starting with Neuron 2.27
-----------------------------------------------------------------------------

Starting with the Neuron 2.27 release, the :ref:`parallel_model_trace API <nxd_tracing>` is no longer supported for inference. We introduced the :doc:`Model Builder V2 API </libraries/neuronx-distributed/model_builder_v2_api_reference>` in Neuron 2.25 as an alternative to the tracing API, and it is now the default API in Neuron for model tracing.

Customers can migrate to the Model Builder V2 API by following the reference `Llama-3.2-1B inference sample <https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference/llama>`__.


================================================
FILE: about-neuron/announcements/neuron2.x/announcement-end-of-support-pytorch-2-6.rst
================================================
.. post:: December 16, 2025
    :language: en
    :tags: announcement-end-of-support-pytorch-2-6

.. _announcement-end-of-support-pytorch-2-6:

Neuron no longer supports PyTorch 2.6 starting with Neuron 2.27
---------------------------------------------------------------

Starting with Neuron 2.27, Neuron no longer supports PyTorch 2.6. We recommend that all customers using PyTorch 2.6 to upgrade to the latest supported PyTorch version.

Customers currently using PyTorch 2.6 must upgrade to a newer supported PyTorch version. For more information on supported versions, refer to :doc:`the list of current Neuron-supported package and library versions </release-notes/releasecontent>`.


================================================
FILE: about-neuron/announcements/neuron2.x/announcement-end-of-support-vllm-v0.rst
================================================
.. post:: December 16, 2025
    :language: en
    :tags: announcement-end-of-support-vllm-v0

.. _announcement-end-of-support-vllm-v0:

Announcing End of Support for vLLM V0 starting with Neuron 2.28
----------------------------------------------------------------

Neuron Release 2.27 will be the last release to support vLLM V0. In Neuron 2.27 release, vLLM V1 support is introduced for Neuron using the ``vllm-neuron`` plugin. Review the sources in the `Neuron vLLM GitHub Repository <https://github.com/vllm-project/vllm-neuron>`__.

Starting with the Neuron 2.28 release, vLLM V0 will not be supported. Support will be dropped for vLLM V0 Neuron forks of the `upstreaming-to-vllm <https://github.com/aws-neuron/upstreaming-to-vllm/>`__ Neuron GitHub repo, along with vLLM V0-based Neuron Inference Deep Learning Containers.

Customers should migrate to vLLM V1 using the :doc:`vLLM V1 user guide </libraries/nxd-inference/developer_guides/vllm-user-guide-v1>`. Customers are recommended to start using vLLM V1 based inference containers that are released with Neuron v2.27.0. We plan to update the existing vLLM-based tutorials to use vLLM V1 in the coming release.

See :doc:`vLLM on Neuron </libraries/nxd-inference/vllm/index>` for more information on vLLM V1.


================================================
FILE: about-neuron/announcements/neuron2.x/announcement-nki-library-kernel-migration.rst
================================================
.. post:: December 16, 2025
    :language: en
    :tags: announcement-nki-library-kernel-migration

.. _announcement-nki-library-kernel-migration:

Announcing NKI Library Kernel Migration to New nki.* Namespace in Neuron 2.28
------------------------------------------------------------------------------

Some NKI Library kernels currently use the legacy ``neuronxcc.nki.*`` namespace. Starting with Neuron 2.28, all NKI Library kernels will migrate to the new ``nki.*`` namespace.

The new ``nki.*`` namespace introduces changes to NKI APIs and language constructs that improve usability and performance. This transition ensures consistency across all NKI kernels and allows us to focus development efforts on a single, modern namespace.

See :doc:`the NKI Kernel Migration Guide </nki/deep-dives/nki-migration-guide>` for more information.


================================================
FILE: about-neuron/announcements/neuron2.x/announcement-nki-library-namespace-changes.rst
================================================
.. post:: December 16, 2025
    :language: en
    :tags: announcement-nki-library-namespace-changes

.. _announcement-nki-library-namespace-changes:

Announcing NKI Library Namespace Changes in Neuron 2.28
--------------------------------------------------------

NKI Library kernels are published in the `NKI Library GitHub repository <https://github.com/aws-neuron/nki-library>`__. In Neuron 2.27, these kernels are also shipped as part of neuronx-cc using the nkilib.* namespace. To avoid namespace conflicts when customers use kernels from the open source repository, the repository uses the ``nkilib_standalone.nkilib.*`` namespace.

Starting with Neuron 2.28 the open source repository namespace will change from ``nkilib_standalone.nkilib.*`` to ``nkilib.*``, providing a consistent namespace between the open source repository and the shipped version.

See :doc:`NKI Library </nki/library/index>` for more information.


================================================
FILE: about-neuron/announcements/neuron2.x/announcement-python-3-9-eol.rst
================================================
.. post:: December 16, 2025
    :language: en
    :tags: announcement-python-3-9-eol

.. _announcement-python-3-9-eol:

Neuron no longer supports Python 3.9 starting with Neuron version 2.27
-----------------------------------------------------------------------

Starting with Neuron Release 2.27, Neuron no longer includes support for Python 3.9 as it has reached its end-of-life status.

If you currently use Python 3.9, you are advised to migrate to a Neuron supported Python version (3.10, 3.11 or 3.12) to avoid security issues and bugs.

For a list of supported Python versions according to Neuron package, refer to :doc:`the list of current Neuron-supported package and library versions </release-notes/releasecontent>`.

================================================
FILE: about-neuron/announcements/neuron2.x/dlami-neuron-2.10.rst
================================================
.. post:: May 02, 2023 11:00
    :language: en
    :tags: dlami, pytorch, trn1, inf2, inf1

.. _announce-dlc-sm-neuron-2.9.1:

AWS Deep Learning AMIs now available with Neuron 2.10 version
-------------------------------------------------------------

We are happy to announce that the following Deep Learning AMIs are now available with latest Neuron Version 2.10. These DLAMIs now support
all the Neuron EC2 instances including Inf1, Inf2, Trn1/Trn1n.

You can access the AMIs at the following URLs

* `AWS Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-neuron-pytorch-1-13-ubuntu-20-04/>`__
* `AWS Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2) <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-neuron-pytorch-1-13-amazon-linux-2/>`__
* `AWS Deep Learning AMI Base Neuron (Ubuntu 20.04) <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-base-neuron-ubuntu-20-04/>`__
* `AWS Deep Learning AMI Base Neuron (Amazon Linux 2) <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-base-neuron-amazon-linux-2/>`__


================================================
FILE: about-neuron/announcements/neuron2.x/dlami-neuron-2.12.rst
================================================
.. post:: July 26, 2023 11:00
    :language: en
    :tags: dlami, pytorch, trn1, inf2, inf1

.. _announce-dlami-neuron-2.12:

AWS Deep Learning AMIs now available with Neuron 2.12 version
-------------------------------------------------------------

We are happy to announce that the following Deep Learning AMIs are now available with latest Neuron Version 2.12. 

You can see more about the AMIs at the following URLs

* `AWS Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-neuron-pytorch-1-13-ubuntu-20-04/>`__
* `AWS Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2) <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-neuron-pytorch-1-13-amazon-linux-2/>`__
* `AWS Deep Learning AMI Neuron TensorFlow 2.10 (Ubuntu 20.04) <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-neuron-tensorflow-2-10-ubuntu-20-04/>`__
* `AWS Deep Learning AMI Neuron TensorFlow 2.10 (Amazon Linux 2) <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-neuron-tensorflow-2-10-amazon-linux-2/>`__
* `AWS Deep Learning AMI Base Neuron (Ubuntu 20.04) <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-base-neuron-ubuntu-20-04/>`__
* `AWS Deep Learning AMI Base Neuron (Amazon Linux 2) <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-base-neuron-amazon-linux-2/>`__


================================================
FILE: about-neuron/announcements/neuron2.x/dlami-pytorch-introduce.rst
================================================
.. post:: Nov 02, 2022 00:01
    :language: en
    :tags: dlami, pytorch

.. _announce-dlami-neuron-pytorch:

Introducing AWS Deep Learning AMI Neuron PyTorch
------------------------------------------------

We are happy to announce that Deep Learning AMI (DLAMI) with pre-installed PyTorch Neuron (``torch-neuronx``) is now available, for more information see:

* `AWS Deep Learning AMI Neuron PyTorch 1.11 \(Amazon Linux 2\) <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-neuron-pytorch-1-11-amazon-linux-2/>`_

* `AWS Deep Learning AMI Neuron PyTorch 1.11 \(Ubuntu 20.04\) <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-neuron-pytorch-1-11-ubuntu-20-04/>`_

The Neuron Setup Guide will be updated soon to include the DLAMI PyTorch Neuron.

================================================
FILE: about-neuron/announcements/neuron2.x/end-of-support-pt2.rst
================================================
.. post:: February 2, 2024
    :language: en
    :tags: eos-pt-two, pt-two

.. _eos_pytorch2:

PyTorch NeuronX version 2.0 (Beta) no longer supported
-------------------------------------------------------

:ref:`Neuron release 2.17 <neuron-2.17.0-whatsnew>` no longer supports PyTorch NeuronX version 2.0 (Beta). 

Current users of PyTorch NeuronX version 2.0 are advised to migrate to PyTorch NeuronX 2.1 (Beta).


================================================
FILE: about-neuron/announcements/neuron2.x/github-changes.rst
================================================
.. post:: Oct 10, 2022 02:00
    :language: en
    :tags: github

.. _announce-aws-neuron-github-org:

Introducing New Neuron GitHub Repositories
------------------------------------------

Starting with Neuron release 2.3, Neuron Github repositories will be migrated
to the new `AWS Neuron GitHub Organization <https://github.com/aws-neuron>`_. 

The new AWS Neuron GitHub Organization will include the `Neuron SDK GitHub <https://github.com/aws-neuron/aws-neuron-sdk>`_ repository and will include the following additional new GitHub repositories:

.. list-table:: AWS Neuron GitHub Organization 
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * 	- New GitHub repository
     	- Description

   * 	- `AWS Neuron Samples <https://github.com/aws-neuron/aws-neuron-samples>`_
     	- Repository that hosts examples and scripts used in the Neuron documentation tutorials

   * 	- `AWS Neuron Reference for Megatron-LM <https://github.com/aws-neuron/aws-neuron-reference-for-megatron-lm>`_
     	- Repository that hosts Neuron support for Megatron-LM

   * 	- `AWS Neuron Samples for AWS ParallelCluster <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples>`_
     	- Repository that hosts Neuron support for AWS ParallelCluster


================================================
FILE: about-neuron/announcements/neuron2.x/gpg-expiration.rst
================================================
.. post:: Nov 10, 2022 00:01
    :language: en
    :tags: dlami, pytorch

.. _announce-dlami-neuron-pytorch:

Neuron GPG key for Ubuntu installation has expired
--------------------------------------------------

GPG, or GNU Privacy Guard, is a public key cryptography implementation. This allows for the secure transmission of information between parties and can be used to verify that the origin of a message is genuine.

The GPG key for the Neuron repository (https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB) is installed on the Ubuntu (Canonical) server, the key was uploaded originally with an expiry date of three (3) years, which has expired on 11/10/22.

Please see :ref:`gpg_key_update` for instructions how to update the Neuron repository GPG keys.

================================================
FILE: about-neuron/announcements/neuron2.x/neuron-rtd-eol.rst
================================================
.. post:: Oct 10, 2022 01:00
    :language: en
    :tags: eol, neuron2.x

.. _announce-neuron-rtd-eol:

Announcing Neuron Runtime 1.x (``neuron-rtd``) end-of-support
-------------------------------------------------------------

Starting with Neuron release 2.3, Neuron components like Neuron System Tools
and Neuron Driver will no longer support Neuron Runtime 1.x.

In addition, starting with Neuron release 2.3, the `AWS Neuron Runtime Proto GitHub <https://github.com/aws-neuron/aws-neuron-runtime-proto>`_  and `AWS Neuron Driver GitHub <https://github.com/aws-neuron/aws-neuron-driver>`_ repositories will no longer be supported.

Why are we removing support for Neuron Runtime 1.x?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Neuron Runtime 1.x (``neuron-rtd``) entered :ref:`maintenance mode <maintenance_rtd>` when Neuron 1.16.0 
was released. While Neuron components like Neuron Driver and Neuron System Tools continued to support 
Neuron Runtime 1.x in addition to supporting Neuron Runtime 2.x, Neuron supported frameworks (e.g. PyTorch Neuron,
TensorFlow Neuron, and MXNet Neuron) stopped supporting Neuron Runtime 1.x starting with Neuron 1.16.0. 
For detailed information see :ref:`introduce-libnrt`.


================================================
FILE: about-neuron/announcements/neuron2.x/neuron2-intro.rst
================================================
.. post:: Oct 10, 2022 04:00
    :language: en
    :tags: neuron2.x

.. _neuron2-intro:

Introducing the first release of Neuron 2.x enabling EC2 Trn1 General Availability (GA)
=======================================================================================

Neuron release 2.3 is the first release of Neuron 2.x that enables GA of the new EC2 Trn1 instances.
Neuron release 2.3 extends the latest release of Neuron 1.x (Neuron 1.19.2), adding support for Deep Learning training on the AWS Trainium chips.

Starting with Neuron release 2.3, developers can run Deep Learning training workloads on Trn1 instances, saving training costs by up to 
50% over equivalent GPU-based EC2 instances, while achieving the highest training performance in the AWS cloud for popular NLP models.  Neuron 2.x introduces new capabilities and major architectural updates to support training neural-networks with the Trn1 instances. 

In addition, starting with this release, Neuron introduces new packages, renames several packages, 
and updates Neuron installation and update instructions. This release also ends support for Neuron Runtime 1.x.

More about the release
----------------------

.. include:: /release-notes/templates/n2.x-trn1-ga-quick.txt


================================================
FILE: about-neuron/announcements/neuron2.x/neuron230-packages-changes.rst
================================================
.. post:: Oct 10, 2022 03:00
    :language: en
    :tags: neuron2.x

.. _neuron-packages-changes:

Introducing Packaging and installation changes
----------------------------------------------

Starting with Neuron release 2.3, Neuron introduces changes in Neuron packages and installation instructions.

.. contents::  Table of contents
   :local:
   :depth: 2

.. _neuron-new-packages:

New Neuron packages
^^^^^^^^^^^^^^^^^^^

Starting with Neuron release 2.3, Neuron introduces the following new packages:

.. list-table:: New Neuron packages
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - New Package
     - Package Type
     - Description
     - Supported Instances 
     
       (At the time of releasing Neuron release 2.3)

   * - ``torch-neuronx``
     - .whl (pip)
     - PyTorch Neuron package using `PyTorch XLA <https://pytorch.org/xla>`_ 
     - Trn1

   * - ``neuronx-cc``
     - .whl (pip)
     - Neuron Compiler with XLA front-end
     - Trn1

   * - ``aws-neuronx-runtime-lib``
     - .deb (apt), .rpm (yum)
     - Neuron Runtime library
     - Trn1

   * - ``aws-neuronx-collective``
     - .deb (apt), .rpm (yum)
     - Collective Communication library          
     - Trn1

   * - ``aws-neuronx-tools``
     - .deb (apt), .rpm (yum)
     - Neuron System Tools
     - Trn1

.. note::

   In next releases ``aws-neuronx-tools`` and ``aws-neuronx-runtime-lib`` will add support for Inf1.

Why are we introducing new Neuron packages?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To add Neuron support for training neural-networks, Neuron 2.x introduces new capabilities and major architectural updates. For example, Neuron adds support for Collective Communication Operations, in :ref:`new packages <neuron-new-packages>` such as ``aws-neuron-collective``. 

In addition, some of those updates and new capabilities are not backward compatible, for example the Pytorch Neuron package that adds support for training neural-networks uses `PyTorch XLA <https://pytorch.org/xla>`_ as a backend. To reduce the possibility of customers using features that are not backward compatible, the new capabilities are introduced in new Neuron packages. For example, PyTorch Neuron and Neuron Compiler will support  different packages for Inf1 and for Trn1: ``torch-neuron`` and ``neuron-cc`` will support Inf1 instances, and ``torch-neuronx`` and ``neuronx-cc`` will support Trn1 instances.

.. _neuron-packages-renaming:

Renamed Neuron Packages
^^^^^^^^^^^^^^^^^^^^^^^

Starting with Neuron release 2.3, the following  Neuron packages will change names: 


.. list-table:: Neuron package with changed names
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size   

   * - New name
     - Old name (deprecated package)
     - Package Type
     - Description
     - Supported Instances 

   * - ``aws-neuronx-oci-hooks``
     - ``aws-neuron-runtime-base``
     - .deb (apt), .rpm (yum)
     - OCI Hooks support
     - Trn1, Inf1

   * - ``aws-neuronx-dkms``
     - ``aws-neuron-dkms``
     - .deb (apt), .rpm (yum)
     - Neuron Driver
     - Trn1, Inf1     


   * - ``aws-neuronx-k8-plugin``
     - ``aws-neuron-k8-plugin``
     - .deb (apt), .rpm (yum)
     - Neuron Kubernetes plugin
     - Trn1, Inf1

   * - ``aws-neuronx-k8-scheduler``
     - ``aws-neuron-k8-scheduler``
     - .deb (apt), .rpm (yum)
     - Neuron Scheduler plugin
     - Trn1, Inf1

Why are we changing package names?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To avoid situations where customers may accidentally install Neuron packages with features that are not backward compatible, we have introduced additional packages with different names for the same Neuron component. 

.. _neuron-installation-instruction-change:

Updated installation and update instructions 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Starting with Neuron release 2.3, Neuron installation and update instructions will include pinning of the major version of the Neuron package. For example, to install latest Neuron tools package, call ``sudo apt-get install aws-neuronx-tools=2.*`` and to install latest PyTorch Neuron package for Trn1, call ``pip install torch-neuronx==1.11.0.1.*``. 


Why are we changing installation and update instructions?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Neuron installation and update instructions now guide customers to pin the major version of the different Neuron packages as mentioned in :ref:`neuron-installation-instruction-change`. This is done to future-proof instructions for new, backwards-incompatible major version releases.

.. note:: The change of the installation and update instructions will not include instruction to install or update ``torch-neuron`` and ``neuron-cc``.

What do I need to do?
~~~~~~~~~~~~~~~~~~~~~

Please follow the :ref:`Neuron setup guide <setup-guide-index>` to update to latest Neuron releases.


================================================
FILE: about-neuron/announcements/neuron2.x/neuron250-packages-changes.rst
================================================
.. post:: Nov 22, 2022 03:00
    :language: en
    :tags: neuron2.x

.. _neuron250-packages-changes:

Introducing Neuron packaging and installation changes for Inf1 customers
------------------------------------------------------------------------

Starting with :ref:`Neuron release 2.5 <neuron-2.5.0-whatsnew>`, Neuron introduces changes in Neuron packages and installation instructions for Inf1, the following  Neuron packages will change names: 


.. list-table:: Neuron package with changed names for Inf1
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size   

   * - New name
     - Old name (deprecated package)
     - Package Type
     - Description
     - Supported Instances 

   * - ``aws-neuronx-tools``
     - ``aws-neuron-tools``
     - .deb (apt), .rpm (yum)
     - System Tools
     - Trn1, Inf1

   * - ``aws-neuronx-dkms``
     - ``aws-neuron-dkms``
     - .deb (apt), .rpm (yum)
     - Neuron Driver
     - Trn1, Inf1     

   * - ``aws-neuronx-k8-plugin``
     - ``aws-neuron-k8-plugin``
     - .deb (apt), .rpm (yum)
     - Neuron Kubernetes plugin
     - Trn1, Inf1

   * - ``aws-neuronx-k8-scheduler``
     - ``aws-neuron-k8-scheduler``
     - .deb (apt), .rpm (yum)
     - Neuron Scheduler plugin
     - Trn1, Inf1

   * - ``tensorflow-model-server-neuronx``
     - ``tensorflow-model-server-neuron``
     - .deb (apt), .rpm (yum)
     - tensorflow-model-server
     - Trn1, Inf1


Please follow the :ref:`Neuron setup guide <setup-guide-index>` to update to latest Neuron releases.


================================================
FILE: about-neuron/announcements/neuron2.x/release-neuron2.4.rst
================================================


================================================
FILE: about-neuron/announcements/neuron2.x/sm-training-dlc-2.9.1.rst
================================================
.. post:: Apr 26, 2023 11:00
    :language: en
    :tags: sagemaker, pytorch, trn1, inf2

.. _announce-dlc-sm-neuron-2.9.1:

PyTorch 1.13 Deep Learning Container for Inf2 & Trn1/Trn1n now available for SageMaker 
--------------------------------------------------------------------------------------

We are happy to announce that an updated Deep Learning Container that supports PyTorch 1.13 and Neuron 2.9.1 versions is now available for Sagemaker Training.

For more information see `Neuron Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers>`_


================================================
FILE: about-neuron/announcements/neuron2.x/sm-training-trn1-introduce.rst
================================================
.. post:: Nov 03, 2022 00:01
    :language: en
    :tags: sagemaker, pytorch, trn1

.. _announce-dlami-neuron-pytorch:

Amazon SageMaker now supports Trn1 training jobs
------------------------------------------------

We are happy to announce that Amazon SageMaker now supports running training jobs on ml.trn1 instance types.

For more information see `Distributed Training with PyTorch Neuron on Trn1 instances <https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-training-with-pytorch-neuron-on-trn1-instances>`_

The Neuron Developer Flows section will be updated soon.


================================================
FILE: about-neuron/appnotes/index.rst
================================================
.. _neuron-appnotes-index:
.. _neuron-appnotes:

.. meta::
   :description: AWS Neuron SDK application notes for support announcements, performance optimization, migration guides, and framework-specific implementations.
   :date-modified: 2025-10-03

Neuron application notes
========================

.. toctree:: 
   :maxdepth: 2
   :hidden:

   Neuron Runtime Library <neuron1x/introducing-libnrt>
   Performance <perf/neuron-cc/performance-tuning>
   Parallel execution <perf/neuron-cc/parallel-ncgs>
   PyTorch for Neuron <torch-neuron/index>
   PyTorch for NeuronX <torch-neuronx/index>

Application notes provide specific documentation for support announcements, migration guides, performance optimization techniques, and framework-specific implementations for AWS Neuron SDK components.


Framework integration
---------------------

.. grid:: 1 1 2 2
   :gutter: 2

   .. grid-item-card::
      :link: torch-neuron-r-cnn-app-note
      :link-type: ref

      **PyTorch Neuron (Inf1)**
      ^^^
      R-CNN implementation and optimization techniques for PyTorch on ``Inf1``

   .. grid-item-card::
      :link: torch-neuronx-graph-partitioner-app-note
      :link-type: ref

      **PyTorch NeuronX Graph Partitioner**
      ^^^
      Advanced graph partitioning strategies for distributed training and inference

   .. grid-item-card::
      :link: torch-neuronx-dataparallel-app-note
      :link-type: ref

      **Data Parallel Inference on Torch NeuronX**
      ^^^
      Guide to using ``torch.neuronx.DataParallel`` for scalable inference on ``Inf1``

   .. grid-item-card::
      :link: torch-neuron-dataparallel-app-note
      :link-type: ref

      **Data Parallel Inference on Torch Neuron**
      ^^^
      Guide to using ``torch.neuron.DataParallel`` for scalable inference on ``Inf1``

   .. grid-item-card::
      :link: migration_from_xla_downcast_bf16
      :link-type: ref

      **Migrate from XLA_USE_BF16/XLA_DOWNCAST_BF16**
      ^^^
      Guide to migrating from deprecated XLA environment variables to recommended PyTorch mixed-precision options on NeuronX

   .. grid-item-card::
      :link: introduce-pytorch-2-9
      :link-type: ref

      **PyTorch 2.9 Support**
      ^^^
      New features and migration guide for PyTorch 2.9 on Neuron


================================================
FILE: about-neuron/appnotes/mxnet-neuron/flex-eg.rst
================================================
.. _flexeg:

Flexible Execution Group (FlexEG) in Neuron-MXNet
=================================================

Introduction
------------

Inf1 instances are available with a different number of Inferentia
chips, each Inferentia chip is combined of 4 NeuronCores and an Inf1
instance includes 4 to 64 NeuronCores depending on the instance size.
With Neuron Runtime 1.x (neuron-rtd server), NeuronCores could be
combined into NeuronCore Groups (NCG),
which were basic scheduling units of compiled neural network in Neuron.
Creation of desired sized NCGs was done at the start of the application
and could not be modified afterwards.

Starting with Neuron SDK 1.16.0, and with the introduction of Neuron
Runtime 2.x, MXNet Neuron 1.8 introduces Flexible Execution Groups
(FlexEG) feature. With FlexEG, you do not have to create NCGs at the
start of the process, instead you will set the index of the first
NeuronCore you want to load models onto, and FlexEG feature will enable
the flexibility of loading models onto any available NeuronCore on the
inf1 instance starting from the first NeuronCore you set. This guide
will show you how to efficiently utilize NeuronCores using FlexEG
feature in NeuronMXNet.

FlexEG
------

With the introduction of FlexEG, you don’t need to create NCGs and can
load models onto a group of consecutive NeuronCores by providing the
index of the first NeuronCore in the group. Neuron runtime takes care of
figuring out the number of NeuronCores required for the given compiled
model and loads the model using the required number of cores
(sequentially starting with the NeuronCore index provided by the user).

For example, assuming that you have an Inf1.6xl machine and there are 4
models A, B, C, D compiled to 2, 4, 3, and 4 NeuronCores respectively,
you can map any model to any core by context
``mx.neuron(neuron_core_index)`` where ``neuron_core_index`` is the
NeuronCore index (0,1,2,3,4 … ).

In the example below, you map model A to ``mx.neuron(0)`` context, model
B to ``mx.neuron(2)`` context, model C to ``mx.neuron(6)`` context and
model D to ``mx.neuron(9)`` context. 

.. figure:: /images/mx_FlexEG_arch_1.png
   :scale: 80 %

The above configuration is achieved by using application code similar to
below:

.. code :: python

   # Load models (MXNet)
   # loaded into the 2 cores starting with core 0
   sym, args, aux = mx.model.load_checkpoint(mx_model0_file, 0)
   model0 = sym.bind(ctx=mx.neuron(0), args=args, aux_states=aux, grad_req='null')
   # loaded into the 4 cores starting with core 2
   sym, args, aux = mx.model.load_checkpoint(mx_model1_file, 0)
   model1 = sym.bind(ctx=mx.neuron(2), args=args, aux_states=aux, grad_req='null')
   # loaded into the 3 cores starting with core 6
   sym, args, aux = mx.model.load_checkpoint(mx_model2_file, 0)
   model2 = sym.bind(ctx=mx.neuron(6), args=args, aux_states=aux, grad_req='null')
   # loaded into the 4 cores starting with core 9
   sym, args, aux = mx.model.load_checkpoint(mx_model3_file, 0)
   model3 = sym.bind(ctx=mx.neuron(9), args=args, aux_states=aux, grad_req='null')

   # run inference by simply calling the loaded model
   results0 = model0.forward(data=inputs0)
   results1 = model1.forward(data=inputs1)
   results2 = model2.forward(data=inputs2)
   results3 = model3.forward(data=inputs3)

Since there is no NCG creation at the start of the process, you can load
the same four models but in a different configuration by changing the
context being used for inference. For example, you could map model C to
``mx.neuron(0)`` context, model A to ``mx.neuron(3)`` context, model D
to ``mx.neuron(5)`` context and model B to ``mx.neuron(9)`` context.

.. figure:: /images/mx_FlexEG_arch_2.png
   :scale: 80 %

Migration from NeuronCore Groups to FlexEG
------------------------------------------

NeuronCore Groups are defined by setting the environment variable
``NEURONCORE_GROUP_SIZES`` with a comma separated list of number of
cores in each group. In this mode of operation, number of devices
(defined in ``NEURONCORE_GROUP_SIZES``) are grouped together to create a
single entity.

``NEURONCORE_GROUP_SIZES`` environment variable is set at runtime:

.. code :: python

   #!/bin/bash
   export NEURONCORE_GROUP_SIZES=2,4,3,4 
   python your_neuron_application.py

NeuronCore groups are created once at the start of the application and
cannot be modified / re-created till the application process runs. The
above flow creates 4 neuron devices with 2,4,3 and 4 devices each. In
order to get the same configuration as the example from before , you map
model A to ``mx.neuron(0)`` context, model B to ``mx.neuron(1)``
context, model C to ``mx.neuron(2)`` context and model D to
``mx.neuron(3)`` context.


.. figure:: /images/mx_FlexEG_arch_1.png
   :scale: 80 %


This can be achieved programmatically as shown below:

.. code :: python

   # Set Environment 
   os.environ['NEURONCORE_GROUP_SIZES']='2,4,3,4'

   # Load models (MXNet)
   # loaded into the first group of NC0-NC1
   sym, args, aux = mx.model.load_checkpoint(mx_model0_file, 0)
   model0 = sym.bind(ctx=mx.neuron(0), args=args, aux_states=aux, grad_req='null')
   # loaded into the second group of NC2-NC5
   sym, args, aux = mx.model.load_checkpoint(mx_model1_file, 0)
   model1 = sym.bind(ctx=mx.neuron(1), args=args, aux_states=aux, grad_req='null')
   # loaded into the third group of NC6-NC8
   sym, args, aux = mx.model.load_checkpoint(mx_model2_file, 0)
   model2 = sym.bind(ctx=mx.neuron(2), args=args, aux_states=aux, grad_req='null')
   # loaded into the fourth group of NC9-NC12
   sym, args, aux = mx.model.load_checkpoint(mx_model3_file, 0)
   model3 = sym.bind(ctx=mx.neuron(3), args=args, aux_states=aux, grad_req='null')

   # run inference by simply calling the loaded model
   results0 = model0.forward(data=inputs0)
   results1 = model1.forward(data=inputs1)
   results2 = model2.forward(data=inputs2)
   results3 = model3.forward(data=inputs3)

So comparing to FlexEG, we see that in case of NCGs neuron context
requires the index of the execution group, while in FlexEG
neuron context requires the NeuronCore index of the first NeuronCore on which the
model is supposed to be loaded and executed. For example, with
``NEURONCORE_GROUP_SIZES='2,4,3,4'``, ``ctx=mx.neuron(1)`` loads the
model on execution group 1 which effectively loads the model on the 2nd NCG group 
which has 4 NeuronCores.

Best practices when using FlexEG
--------------------------------

FlexEG gives the user most flexibility in terms of accessing cores and
loading models on specific cores. With this the users can effortlessly
load and execute new models on NeuronCores without closing the
application. Here we shall outline some of the best practices that
should be kept in mind while using FlexEG.

Choosing starting core
~~~~~~~~~~~~~~~~~~~~~~

FlexEG tries to use the required number of cores (based on the input
model) starting with the core index provided by the user. Incase the
system, doesnt have the required number of cores after the starting core
index, model load will fail. For example: We have a model X which needs
2 cores and an inf1.xl machine with 4 NeuronCores (NeuronCore indexes
are: 0, 1, 2 and 3). As the model needs at least 2 cores, valid start
indexes for this model are: 0, 1, 2. However if the user gives 3 as the
neuron context, then there are no 2 cores available starting from core
3. So it will fail.

Performance vs. Flexibility tradeoff
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

While using data parallel model of operation (were models are executed
in parallel), for optimal performance the user should make sure that the
models are not sharing any cores. That is because NeuronCores can
execute one model at a time, when two or more models are executed on the
same core (assuming that they are already loaded), it executes the first model, stops it, starts the second
model and then executes it. This is called model switiching and involves
additional overhead and prevents execution on model in parallel. For
example: assuming that you have an Inf1.6xl machine and there are 4
models A, B, C, D compiled to 2, 4, 3, and 4 NeuronCores respectively.
Loading model A to ``mx.neuron(0)`` context, model B to ``mx.neuron(2)``
context, model C to ``mx.neuron(6)`` context and model D to
``mx.neuron(9)`` context is a good configuration because no two models
are sharing NeuronCores and thus can be executed in parallel. However,
Loading model A to ``mx.neuron(0)`` context, model B to ``mx.neuron(2)``
context, model C to ``mx.neuron(5)`` context and model D to
``mx.neuron(9)`` context is a not a good configuration as models B and C
share NeuronCore 5 and thus cannot be executed in parallel.


.. figure:: /images/mx_FlexEG_arch_bad.png
   :scale: 80 %


================================================
FILE: about-neuron/appnotes/neuron-cc/mixed-precision.rst
================================================
.. _neuron-cc-training-mixed-precision:

Mixed precision and performance-accuracy tuning (``neuron-cc``)
===============================================================

.. contents:: Table of contents
   :local:
   :depth: 2

The Neuron Compiler supports machine learning models with FP32,
FP16 and BF16 (Bfloat16) tensors and operators. The Neuron hardware supports a
mix of 32 and 16 bit datatypes.
The available auto-cast methods and their performance / accuracy trade-offs
are explained in this document.

Neuron Hardware
-------------------

The Neuron hardware supports matrix multiplication using FP16 or BF16 on its Matmult Engine, and
accumulations using FP32.
Similarly, operators such as activations or vector operations
are supported using FP16, BF16 and FP32.
Neuron supports tensor transpose in two ways - by fast matrix
multiplication in FP16/BF16 or by slower byte-by-byte data movements.


Performance-accuracy tradeoffs for models trained in FP32
---------------------------------------------------------

Models that are trained using FP32 data types can be deployed on Neuron
through ahead of time compilation using the :ref:`Neuron Compiler <neuron_cli>`.


.. important::
    **By default**, the Neuron Compiler disables auto-casting and uses the data types defined within the model.
    This provides the best accuracy for FP32 trained models, but does not provide the best performance.

weights and operations to BF16**. Only partial sums are left in FP32. The default, casting will generate the highest
performance for a FP32 trained model.

Using the ``--fast-math`` CLI option, you can choose the right 
tradeoff between performance and accuracy. The tradeoff usually is between achieving high performance or optimal accuracy, and decision what settings to use will be application specific.

It is recommended that the you start with compiling the model to achieve the high performance (default), you can then 
test the accuracy of the application and, if needed, try the next higher precision casting option until the desired 
accuracy and performance are achieved. A typical flow can be:

1. You can compile without options (default) or with ``--fast-math all`` which will optimize for performance.

2. If accuracy is not sufficient you can try ``--fast-math fp32-cast-matmult``  

3. If accuracy is not sufficient you can try ``--fast-math fp32-cast-matmult no-fast-relayout``

4. If accuracy is not sufficient you can try ``--fast-math none`` which will optimize for accuracy .

 
Between step 2 and step 3, and between step 3 and step 4 you have additional options that can provide different level of accuracy and which are explained in the below section.

Note that compiler has to preserve the input/output (i/o) tensor types requested by Framework, therefore no casting is done on the i/o tensors. Additional speedup can be obtained by casting them in the Framework prior compilation.

To learn how to use compiler command line interface (CLI) options with your application's framework, please see :ref:`torch_neuron_trace_api`, :ref:`tensorflow-ref-neuron-compile-api` and :ref:`tensorflow-ref-neuron-tracing-api`.


Compiler casting options
------------------------

``--fast-math`` option
^^^^^^^^^^^^^^^^^^^^^^^^

The ``--fast-math`` option is intended to replace the ``--fp32-cast`` option. It is recommended to
to start using or migrating to ``--fast-math`` option. The ``--fast-math`` option provides the same level of functionality
as the ``--fp32-cast`` option in addition to the following:

* The ``--fast-math`` option introduces the ``no-fast-relayout`` option to enable lossless transpose operation. This was not possible with the ``--fp32-cast`` option.
* The ``--fast-math`` option provides finer control than the ``--fp32-cast`` option. The transpose operation and the cast operation are controlled independently:

    - ``no-fast-relayout`` and ``fast-relayout`` provide control for the transpose operation.
    - ``fp32-cast-*`` provide control for casting.

See the detailed list of the options in :ref:`/compiler/neuron-cc/command-line-reference.rst`.


================================================
FILE: about-neuron/appnotes/neuron1x/important-neuronx-dkms.txt
================================================
.. important ::

   Starting with Neuron version 2.3, the ``aws-neuron-dkms`` package name has been changed to ``aws-neuronx-dkms``. See :ref:`neuron2-intro`


================================================
FILE: about-neuron/appnotes/neuron1x/introducing-libnrt.rst
================================================
.. _introduce-libnrt:

Introducing Neuron Runtime 2.x (libnrt.so)  
==========================================

.. contents:: Table of contents
   :local:
   :depth: 2


What are we changing?
---------------------

Starting with the *Neuron 1.16.0* release, *Neuron Runtime 1.x* (``neuron-rtd``) is entering maintenance mode and is being replaced by *Neuron Runtime 2.x*, a shared library named (``libnrt.so``). For more information on Runtime 1.x see :ref:`maintenance_rtd`.

Upgrading to ``libnrt.so`` simplifies the Neuron installation and upgrade process, introduces new capabilities for allocating NeuronCores 
to applications, streamlines container creation, and deprecates tools that are no longer needed.

This document describes the capabilities of *Neuron Runtime 2.x* in detail, provides information needed for successful installation and upgrade, 
and provides information needed for successful upgrade of Neuron applications using *Neuron Runtime 1.x* (included in releases before *Neuron 1.16.0*)
to *Neuron Runtime 2.x* (included in releases *Neuron 1.16.0* or newer).

.. _introduce-libnrt-why:

Why are we making this change?
------------------------------

Before *Neuron 1.16.0*, Neuron Runtime was delivered as a daemon (``neuron-rtd``), and communicated with Neuron framework extensions through a ``gRPC`` interface. 
``neuron-rtd`` was packaged as an ``rpm`` or ``debian`` package (``aws-neuron-runtime``) and required a separate installation step.

Starting with *Neuron 1.16.0*, *Neuron Runtime 2.x* is delivered as a shared
library (``libnrt.so``) and is directly linked to Neuron framework extensions.
``libnrt.so`` is packaged and installed as part of the Neuron framework extensions
(e.g. TensorFlow Neuron, PyTorch Neuron or MXNet Neuron), and does not require a
separate installation step. Installing Neuron Runtime as part of the Neuron
framework extensions simplifies installation and improves the user experience.
In addition, since ``libnrt.so`` is directly linked to the Neuron framework
extensions, faster communication between the Neuron Runtime and
Neuron Frameworks is enabled by eliminating the ``gRPC`` interface overhead.

For more information see :ref:`introduce-libnrt-how-sdk` and :ref:`neuron-migrating-apps-neuron-to-libnrt`.


.. _libnrt-neuron-cmponents:

.. _introduce-libnrt-how-sdk:

How will this change affect the Neuron SDK?
-------------------------------------------

Neuron Driver
^^^^^^^^^^^^^

Use the latest Neuron Driver. For successful installation and upgrade to *Neuron 1.16.0* or newer, 
you must install or upgrade to Neuron Driver (``aws-neuron-dkms``) *version 2.1.5.0* or newer. Neuron applications using *Neuron 1.16.0* will fail if 
they do not detect *Neuron Driver version 2.1.5.0* or newer. For installation and upgrade instructions see :ref:`install-guide-index`.


.. include:: ./important-neuronx-dkms.txt

To see details of Neuron component versions please see :ref:`latest-neuron-release-artifacts`.

.. important ::

   For successful installation or update to Neuron 1.16.0 and newer from previous releases:
      * Stop Neuron Runtime 1.x daemon (``neuron-rtd``) by running: ``sudo systemctl stop neuron-rtd``
      * Uninstall ``neuron-rtd`` by running: ``sudo apt remove aws-neuron-runtime`` or ``sudo dnf remove aws-neuron-runtime``
      * Install or upgrade to the latest Neuron Driver (``aws-neuron-dkms``) by following the :ref:`install-guide-index` instructions.
      * Starting with Neuron version 2.3, ``aws-neuron-dkms`` the package name has been changed to ``aws-neuronx-dkms``, see :ref:`neuron2-intro`


Neuron Runtime
^^^^^^^^^^^^^^

* Installation
  Starting from *Neuron 1.16.0*, Neuron releases will no longer include the ``aws-neuron-runtime packages`` and Neuron Runtime will be part of the Neuron 
  framework extension of choice (TensorFlow Neuron, PyTorch Neuron or MXNet Neuron). Installing any Neuron framework package will install the Neuron Runtime library 
  (``libnrt.so``).

      * For installation and upgrade instructions see :ref:`install-guide-index`.

* Configuring *Neuron Runtime*
   Before *Neuron 1.16.0*, *Neuron Runtime 1.x* was configured in configuration files (e.g. /opt/aws/neuron/config/neuron-rtd.config).
   Starting from *Neuron 1.16.0*, *Neuron Runtime 2.x* can be configured through environment variables. See :ref:`nrt-configuration` for details. 

* Starting and Stopping *Neuron Runtime*
   Before introducing ``libnrt.so``, ``neuron-rtd`` ran as a daemon that communicated through a ``gRPC`` interface. Whenever ``neuron-rtd`` took ownership of a Neuron device, 
   it continued owning that device until it was stopped. This created the need to stop ``neuron-rtd`` in certain cases. With the introduction of ``libnrt.so``, *Neuron Runtime* as it runs inside the context of the application. With *Neuron Runtime 2.x*, the act of starting and stopping a Neuron application causes ``libnrt.so`` to automatically claim or release ownership of the required Neuron devices.
   

* NeuronCore Groups (NCG) end-of-support
   Before the introduction of *Neuron Runtime 2.x*, NeuronCore Group (NCG) was used to define an execution group of one or more NeuronCores 
   where models could be loaded and executed. It also provided separation between processes.
   
   With the introduction of *Neuron Runtime 2.x*, strict separation of NeuronCores into groups is no longer necessary and NeuronCore Groups (NCG) has been 
   deprecated. See :ref:`eol-ncg` for more information.

* Running multiple *Neuron Runtimes*
   Before the introduction of ``libnrt.so``, it was necessary to run multiple ``neuron-rtd`` daemons to allocate Neuron devices for each ``neuron-rtd``, 
   using configuration files.
   After the introduction of ``libnrt.so``, it will no longer necessary to run multiple ``neuron-rtd`` daemons to allocate Neuron devices to a specific Neuron application. 
   With ``libnrt.so`` NeuronCores (A Neuron device includes multiple NeuronCores) are allocated to a particular application by using ``NEURON_RT_VISIBLE_CORES`` or ``NEURON_RT_NUM_CORES``
   environment variables, for example:

   .. code ::

      NEURON_RT_VISIBLE_CORES=0-3 myapp1.py
      NEURON_RT_VISIBLE_CORES=4-11 myapp2.py

   Or

   .. code ::

      NEURON_RT_NUM_CORES=3 myapp1.py &
      NEURON_RT_NUM_CORES=4 myapp2.py &


   See :ref:`nrt-configuration` for details. 

* Logging
   Similar to *Neuron Runtime 1.x*, *Neuron Runtime 2.x* logs into syslog (verbose logging). To make debugging easier, *Neuron Runtime 2.x* also logs into the console (error-only logging). Refer to :ref:`nrt-configuration` to see how to increase or decrease logging verbosity.

* Multi-process access to NeuronCores
    With the introduction of ``libnrt.so``, it is no longer possible to load models from multiple processes on the same NeuronCore.  
    A NeuronCore can only be accessed from the same process. Instead you can load models on a specific NeuronCore, using multiple threads from the same process.

    .. note::

      For optimal performance of multi-model execution, each NeuronCore executes a single model.


* Neuron Runtime architecture
    *Neuron Runtime 2.x* is delivered as a shared library (``libnrt.so``) and is directly linked to Neuron framework extensions.
    ``libnrt.so`` is packaged and installed as part of Neuron framework extensions 
    (e.g. TensorFlow Neuron, PyTorch Neuron, or MXNet Neuron), and does not require a 
    separate installation step. Installing Neuron Runtime as part of the Neuron 
    framework extensions simplifies installation and improves the user experience. 
    In addition, since ``libnrt.so`` is directly linked to Neuron framework 
    extensions, it enables faster communication between Neuron Runtime and 
    Neuron Frameworks by eliminating ``gRPC`` interface overhead.


Neuron framework extensions
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Starting from *Neuron 1.16.0*, Neuron framework extensions (TensorFlow Neuron, PyTorch Neuron, or MXNet Neuron) are packaged together with 
``libnrt.so``. It is required to install the ``aws-neuron-dkms`` Driver version 2.1.5.0 or newer for proper operation. The ``neuron-rtd`` daemon 
that was installed in previous releases no longer works starting with Neuron 1.16.0.

To see details of Neuron component versions see :ref:`latest-neuron-release-artifacts`.

.. :important:

   Starting Neuron version 2.3, the ``aws-neuron-dkms`` package name is changed to ``aws-neuronx-dkms``, see :ref:`neuron2-intro`

TensorFlow model server
^^^^^^^^^^^^^^^^^^^^^^^

Starting from *Neuron 1.16.0*, the TensorFlow Neuron model server is packaged together with ``libnrt.so`` and expects ``aws-neuron-dkms`` 
*version 2.1.5.0* or newer for proper operation.


.. note::

   The TensorFlow Neuron model server included in *Neuron 1.16.0* runs from the directory in which it was installed and will not run properly if copied to a different location, due to its dependency on ``libnrt.so``.

.. include:: ./important-neuronx-dkms.txt


Neuron tools
^^^^^^^^^^^^

* ``neuron-cli`` - Starting from *Neuron 1.16.0*, ``neuron-cli``  enters maintenance mode. See :ref:`maintenance_neuron-cli` for more information.
* ``neuron-top`` - Starting from *Neuron 1.16.0*, ``neuron-top`` has a new user interface. See :ref:`neuron-top-ug` for more information.
* ``neuron-monitor`` - ``neuron-monitor`` was updated to support Neuron Runtime 2.x (``libnrt.so``)

  * See :ref:`neuron-monitor-ug` for an updated user guide of ``neuron-monitor``.
  * See neuron-monitor upgrade notes for a list of changes between *Neuron Monitor 2.x* and *Neuron Monitor 1.0*
  * See neuron-monitor backward compatibility notes for instructions for using *Neuron Monitor 2.x* with *Neuron Runtime 1.x* (``neuron-rtd``) .


.. _introduce-libnrt-how-user:

How will this change affect me?
-------------------------------

Neuron installation and upgrade
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


As explained in ":ref:`libnrt-neuron-cmponents`", starting from *Neuron 1.16.0*, ``libnrt.so`` requires the latest Neuron Driver (``aws-neuron-dkms``). 
In addition, it is no longer necessary to install ``aws-neuron-runtime``. To install Neuron or to upgrade to latest Neuron version, follow the 
installation and upgrade instructions below:

* PyTorch Neuron
   * :ref:`install-neuron-pytorch`.
   * :ref:`update-neuron-pytorch`.

* TensorFlow Neuron
   * :ref:`install-neuron-tensorflow`.
   * :ref:`update-neuron-tensorflow`.

* MXNet Neuron
   * :ref:`install-neuron-mxnet`.
   * :ref:`update-neuron-mxnet`.


.. include:: ./important-neuronx-dkms.txt


.. _neuron-migrating-apps-neuron-to-libnrt:

Migrate your application to Neuron Runtime 2.x (libnrt.so) 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For a successful migration from previous releases of your application to *Neuron 1.16.0* or newer, make sure you perform the following:

#. Prerequisite
    Read  ":ref:`libnrt-neuron-cmponents`".

#. Make sure you are not using *Neuron Runtime 1.x* (``aws-neuron-runtime``)   
    * Remove any code that installs ``aws-neuron-runtime`` from any CI/CD scripts.
    * Stop ``neuron-rtd`` by running ``sudo systemctl stop neuron-rtd``
    * Uninstall ``neuron-rtd`` by running ``sudo apt remove aws-neuron-runtime`` or ``sudo dnf remove aws-neuron-runtime``


#. Upgrade to your Neuron Framework of choice:
    * :ref:`update-neuron-pytorch`.
    * :ref:`update-neuron-tensorflow`.
    * :ref:`update-neuron-mxnet`.


#. If you have code that starts and/or stops ``neuron-rtd``
    Remove any code that starts or stops ``neuron-rtd`` from any CI/CD scripts.
       

#. Application running multiple ``neuron-rtd``
    If your application runs multiple processes and requires running multiple ``neuron-rtd`` daemons:

    * Remove the code that runs multiple ``neuron-rtd`` daemons.
    * Instead of allocating Neuron devices to ``neuron-rtd`` through configuration files, use ``NEURON_RT_VISIBLE_CORES`` or ``NEURON_RT_NUM_CORES`` environment variables to
      allocate NeuronCores. See :ref:`nrt-configuration` for details.

    If you application uses ``NEURONCORE_GROUP_SIZES``, see the next item.


    .. note::

      ``NEURON_RT_VISIBLE_CORES`` and ``NEURON_RT_NUM_CORES`` environment variables enable you to allocate NeuronCores to an application. Allocating NeuronCores improves application granularity, because Neuron devices include multiple NeuronCores.

#. Application running multiple processes using ``NEURONCORE_GROUP_SIZES``
    * Consider using ``NEURON_RT_VISIBLE_CORES`` or ``NEURON_RT_NUM_CORES`` environment variables instead of ``NEURONCORE_GROUP_SIZES``, which is being deprecated.  See :ref:`nrt-configuration` for details.

    * If you are using TensorFlow Neuron (``tensorflow-neuron (TF2.x)``) and you are replacing ``NEURONCORE_GROUP_SIZES=AxB`` which enables auto multicore replication, see the new API :ref:`tensorflow-ref-auto-replication-python-api` for usage and documentation.
   
    * The behavior of your application will remain the same as before if you do not set ``NEURON_RT_VISIBLE_CORES`` and do not set ``NEURON_RT_NUM_CORES``.

    * If you are considering migrating to ``NEURON_RT_VISIBLE_CORES`` or ``NEURON_RT_NUM_CORES``:

      * ``NEURON_RT_VISIBLE_CORES`` takes precedence over ``NEURON_RT_NUM_CORES``.

      * If you are migrating to ``NEURON_RT_VISIBLE_CORES``:

         * For TensorFlow applications or PyTorch applications make sure that ``NEURONCORE_GROUP_SIZES`` is unset, or that ``NEURONCORE_GROUP_SIZES`` allocates the same or smaller number of NeuronCores as allocated by ``NEURON_RT_VISIBLE_CORES``.
         * For MXNet applications, setting ``NEURONCORE_GROUP_SIZES`` and ``NEURON_RT_VISIBLE_CORES`` environment variables at the same time is not supported. Use ``NEURON_RT_VISIBLE_CORES`` only.
         * See :ref:`nrt-configuration` for more details on how to use ``NEURON_RT_VISIBLE_CORES``.


      * If you are migrating to ``NEURON_RT_NUM_CORES``:

         * Make sure that ``NEURONCORE_GROUP_SIZES`` is unset.
         * See :ref:`nrt-configuration` for more details on how to use ``NEURON_RT_NUM_CORES``.


#. Application running multiple processes accessing the same NeuronCore
    If  your application accesses the same NeuronCore from multiple processes, this is no longer possible with ``libnrt.so``.
    Instead, modify your application to access the same NeuronCore from multiple threads.

    .. note::

      Optimal performance of multi-model execution is achieved when each NeuronCore executes a single model.


#. Neuron Tools
    * If you are using Neuron Monitor, see the neuron-monitor upgrade notes for details.
    * If you are using ``neuron-cli`` remove any call to ``neuron-cli``. For more information, see :ref:`maintenance_neuron-cli`.


#. Containers
    If your application is running within a container, and it previously executed ``neuron-rtd`` within the container, you need
    to re-build your container, so it will not include or install ``aws-neuron-runtime``. See :ref:`neuron-containers` for details.


Troubleshooting
---------------

Application fails to start
^^^^^^^^^^^^^^^^^^^^^^^^^^

Description
~~~~~~~~~~~

Starting with the *Neuron 1.16.0* release, Neuron Runtime (``libnrt.so``) requires *Neuron Driver 2.0* or greater (``aws-neuron-dkms``). Neuron Runtime requires the Neuron Driver (``aws-neuron-dkms`` package) to access Neuron devices. 

If ``aws-neuron-dkms`` is not installed, the application will fail with an error message on the console and syslog similar to the following:

.. code::

   NRT:nrt_init      Unable to determine Neuron Driver version. Please check aws-neuron-dkms package is installed.

If an old ``aws-neuron-dkms`` is installed, the application will fail with an error message on the console and syslog similar to the following:

.. code::

   NRT:nrt_init      This runtime requires Neuron Driver version 2.0 or greater. Please upgrade aws-neuron-dkms package.


Solution
~~~~~~~~

Follow the installation steps in :ref:`install-guide-index` to install ``aws-neuron-dkms``.

.. include:: ./important-neuronx-dkms.txt


Application fails to start although I installed latest ``aws-neuron-dkms``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Description
~~~~~~~~~~~

Starting from the *Neuron 1.16.0* release, Neuron Runtime (``libnrt.so``) requires *Neuron Driver 2.0* or greater (``aws-neuron-dkms``). If an old ``aws-neuron-dkms`` is installed,  the application will fail. You may try to install ``aws-neuron-dkms`` and still face application failure, because the ``aws-neuron-dkms`` installation failed as a result of ``neuron-rtd`` daemon that was still running.


Solution
~~~~~~~~

* Stop ``neuron-rtd`` by running: ``sudo systemctl stop neuron-rtd``
* Uninstall ``neuron-rtd`` by running: ``sudo apt remove aws-neuron-runtime`` or sudo ``dnf remove aws-neuron-runtime``
* Install ``aws-neuron-dkms`` by following steps in :ref:`install-guide-index`

.. include:: ./important-neuronx-dkms.txt


Application unexpected behavior when upgrading to release *Neuron 1.16.0* or newer 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Description
~~~~~~~~~~~

When upgrading to release *Neuron 1.16.0* or newer from previous releases, the OS may include two different versions of 
*Neuron Runtime*: the ``libnrt.so`` shared library and ``neuron-rtd`` daemon. This can happen if the user did not stop ``neuron-rtd`` daemon
or did not make sure to uninstall the existing Neuron version before upgrade. 
In this case the user application may behave unexpectedly.

Solution
~~~~~~~~

If the OS includes two different versions of *Neuron Runtime*, ``libnrt.so`` shared library and ``neuron-rtd`` daemon:

   * Before running applications that use ``neuron-rtd``, restart ``neuron-rtd`` by calling ``sudo systemctl restart neuron-rtd``.
   * Before running applications linked with ``libnrt.so``, stop ``neuron-rtd`` by calling ``sudo systemctl stop neuron-rtd``.


Application unexpected behavior when downgrading to releases before *Neuron 1.6.0* (from *Neuron 1.16.0* or newer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Description
~~~~~~~~~~~

When upgrading to release *Neuron 1.16.0* or newer from previous releases, and then downgrading back to releases before *Neuron 1.6.0*, 
the OS may include two different versions of *Neuron Runtime*: the ``libnrt.so`` shared library and ``neuron-rtd`` daemon. This can happen 
if the user did not make sure to uninstall the existing Neuron version before the upgrade or downgrade.
In this case the user application may behave unexpectedly.

Solution
~~~~~~~~

If the OS include two different versions of *Neuron Runtime*, ``libnrt.so`` shared library and ``neuron-rtd`` daemon:

   * Before running applications that use ``neuron-rtd``, restart ``neuron-rtd`` by calling ``sudo systemctl restart neuron-rtd``.
   * Before running applications linked with ``libnrt.so``, stop ``neuron-rtd`` by calling ``sudo systemctl stop neuron-rtd``.


Neuron Core is in use
^^^^^^^^^^^^^^^^^^^^^

Description
~~~~~~~~~~~

A Neuron Core cannot be shared between two applications. If an application
started using a Neuron Core all other applications trying to use the
NeuronCore will fail during runtime initialization with the following
message in the console and in syslog:

.. code:: bash

   ERROR   NRT:nrt_allocate_neuron_cores               NeuronCore(s) not available - Requested:nc1-nc1 Available:0

Solution
~~~~~~~~

Terminate the the process using NeuronCore and then try launching the application.

Frequently Asked Questions (FAQ)
--------------------------------

Do I need to recompile my model to run it with Neuron Runtime 2.x (``libnrt.so``)?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

No. 

Do I need to change my application launch command?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

No.


Can ``libnrt.so`` and ``neuron-rtd`` co-exist in the same environment?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Although we recommend upgrading to the latest Neuron release, we understand that for a transition period you may continue using ``neuron-rtd`` for old releases. If you are using Neuron Framework (PyTorch,TensorFlow or MXNet) from releases before *Neuron 1.16.0*: 

* Install the latest Neuron Driver (``aws-neuron-dkms``) 

.. include:: ./important-neuronx-dkms.txt

* For development, we recommend using different environments for Neuron Framework (PyTorch,TensorFlow or MXNet) from releases before *Neuron 1.16.0* and for Neuron 
  Framework (PyTorch,TensorFlow or MXNet) from *Neuron 1.16.0* and newer. If that is not possible, make sure to stop ``neuron-rtd`` before executing models using
  Neuron Framework (PyTorch,TensorFlow or MXNet) from *Neuron 1.16.0* and newer.

* For deployment, when you are ready to upgrade, upgrade to Neuron Framework (PyTorch,TensorFlow or MXNet) from *Neuron 1.16.0* and newer. 
  See :ref:`neuron-migrating-apps-neuron-to-libnrt` for more information.


.. warning ::

   Executing models using Neuron Framework (PyTorch,TensorFlow or MXNet) from *Neuron 1.16.0* and newer in an environment where ``neuron-rtd`` is running may cause
   undefined behavior. Make sure to stop ``neuron-rtd`` before executing models using Neuron Framework (PyTorch,TensorFlow or MXNet) from *Neuron 1.16.0* and newer.

Are there Neuron framework versions that will not support Neuron Runtime 2.x (``libnrt.so``)?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

All supported PyTorch Neuron and TensorFlow framework extensions, in addition to Neuron MXnet 1.8.0 framework extensions support Neuron Runtime 2.x.

Neuron MxNet 1.5.1 does not support Neuron Runtime 2.x (``libnrt.so``) and has now entered maintenance mode. See :ref:`maintenance_mxnet_1_5` for details.


================================================
FILE: about-neuron/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision.rst
================================================
.. _neuronx-cc-training-mixed-precision:

Mixed Precision and Performance-accuracy Tuning (``neuronx-cc``)
================================================================

.. contents:: Table of contents
   :local:
   :depth: 2

Overview
--------

The Neuron Compiler supports machine learning models with FP32, TF32, FP16 and BF16 (Bfloat16) tensors and operators. The Neuron hardware supports a mix of 32, 16, and 8 bit datatypes. This guide explains how to apply the available auto-cast methods and their performance / accuracy trade-offs when compiling a model with Neuron.

.. note:: Neuron Compiler support for INT8 is planned for a future Neuron SDK release. See `Neuron Compiler: Enable Neuron INT8 support <https://github.com/aws/aws-neuron-sdk/issues/36>`_ for details.

Neuron Hardware
---------------

The Neuron v2 hardware supports matrix multiplication using FP16, BF16, TF32, and FP32 on its matrix multiply ("matmult") engine, and accumulations using FP32. Operators such as activations or vector operations are supported using FP32, TF32, FP16, and BF16. Supporting FP16 and BF16 allows Neuron to have significantly higher performance than executing everything as FP32.


Performance-accuracy tradeoffs
------------------------------

**By default**, the Neuron Compiler will **automatically cast FP32 matrix multiplication operations to BF16**. The remaining operations are performed in the data type specified by the model. The Neuron Compiler provides CLI options that direct the compiler to cast to other data types, thereby giving the ability to choose an accuracy-to-performance tradeoff in model execution. Deciding what CLI settings to use will be application specific and may require some experimentation. See :ref:`Neuron Compiler CLI Reference Guide<neuron-compiler-cli-reference-guide>` for details.


What is the difference between  Data Types?
-------------------------------------------

The NeuronCore v2 support multiple data types (see :ref:`NeuronCore v2 Data Types<neuron-data-types-v2>`). Each data type provides benefits and drawbacks due to its dynamic range and numeric precision.

+------+-----------+----------+--------------------------------------------------------+---------------------------------------------------+
| Type | Minimum   | Maximum  | Strength                                               | Weakness                                          |
+======+===========+==========+========================================================+===================================================+
| FP16 | -65504    | 65504    |	Numeric Precision, High granularity, Mid-range numbers | Low range, medium precision                       |
+------+-----------+----------+--------------------------------------------------------+---------------------------------------------------+
| BF16 | -3.40E+38 | 3.40E+38 |	Dynamic Range, Extremely small/large numbers           | Low precision                                     |
+------+-----------+----------+--------------------------------------------------------+---------------------------------------------------+
| TF32 | -3.40E+38 | 3.40E+38 |	Dynamic Range, Extremely small/large numbers           | Medium precision                                  |
+------+-----------+----------+--------------------------------------------------------+---------------------------------------------------+
| FP32 | -3.40E+38 | 3.40E+38 | N/A                                                    | Larger model size, potentially slower computation |
+------+-----------+----------+--------------------------------------------------------+---------------------------------------------------+

* FP16 provides a high density of representable values that are neither extremely small or extremely large. The density of representable values within the range is approximately an order of magnitude greater than BF16.

  * Conversion from FP32 to FP16 will perform well when values are relatively small but non-extreme (either very small or very large).
  * Conversion from FP32 to FP16 will perform badly if the original FP32 values are outside of the range of FP16. This will produce inf/-inf values and may result in NaN depending on the operation.

* BF16 provides a wider range of representable values which includes both very small and very large values. However, the overall density of representable values is usually lower than FP16 for more non-extreme values. The range is nearly identical to the range of FP32 but because the number of bits is halved, this means the individual values are sparse.

  * Conversion from FP32 to BF16 will perform well when the values are well-distributed throughout the range. Since BF16 covers the entire FP32 range, this means each original value can map to a relatively close downcast value.
  * Conversion from FP32 to BF16 will perform badly when fine granularity is needed. Since BF16 granularity is sacrificed for greater range it will almost always map worse to values that are within the FP16 range.

Should I downcast operations to smaller Data Types?
---------------------------------------------------

This choice here is driven entirely by accuracy vs performance tradeoff. Casting operations to smaller 16-bit data types will provide a significant performance benefit but may end up sacrificing accuracy.

The compiler uses BF16 casting **by default** for matrix multiplication operations. The speedup from casting operations gives a significant performance boost and the range of representable values in BF16 allows for more safety compared to FP16 when the possible numeric range of input values is unknown.

The Neuron Compiler's  ``--auto-cast`` and ``--auto-cast-type`` CLI options are used to direct the compiler to perform alternate casting operations. See the detailed list of the options in :ref:`Neuron v2 Compiler CLI Reference Guide<neuron-compiler-cli-reference-guide>`. The default setting is ``--auto-cast=none``, which is applied if the ``--auto-cast`` flag is not provided.


The option combinations to consider in a typical flow are:


+---------------------------------------------------------+--------------------------------------------------------------------------+-----------------------------------------------------+-------------------------------------------------+
| Compiler autocast                                       | Options    Effect                                                        | Performance                                         | Accuracy                                        |
+=========================================================+==========================================================================+=====================================================+=================================================+
| ``--auto-cast none`` (default)                          | Disables all auto-casting, using the data types defined within the model | Lowest performance                                  | Highest accuracy                                |
+---------------------------------------------------------+--------------------------------------------------------------------------+-----------------------------------------------------+-------------------------------------------------+
| ``--auto-cast matmult --auto-cast-type tf32``           |                                                                          | Performance *increases* as you move down the table  | Accuracy *decreases* as you move down the table |
+---------------------------------------------------------+--------------------------------------------------------------------------+                                                     |                                                 |
| ``--auto-cast all —-auto-cast-type tf32``               | Balance of performance, dynamic range, and precision                     |                                                     |                                                 |
+---------------------------------------------------------+--------------------------------------------------------------------------+                                                     |                                                 |
| ``--auto-cast matmult --auto-cast-type fp16``           |                                                                          |                                                     |                                                 |
+---------------------------------------------------------+--------------------------------------------------------------------------+                                                     |                                                 |
| ``--auto-cast all —-auto-cast-type fp16``               | Best performance at the expense of dynamic range                         |                                                     |                                                 |
+---------------------------------------------------------+--------------------------------------------------------------------------+                                                     |                                                 |
| ``--auto-cast matmult --auto-cast-type bf16``           | Best performance at the expense of precision                             |                                                     |                                                 |
+---------------------------------------------------------+                                                                          + ----------------------------------------------------+-------------------------------------------------+
| ``--auto-cast all --auto-cast-type bf16``               |                                                                          |  Highest performance                                | Lowest accuracy                                 |
+---------------------------------------------------------+--------------------------------------------------------------------------+-----------------------------------------------------+-------------------------------------------------+

Note that compiler has to preserve the input/output (i/o) tensor types requested by Framework, therefore no casting is done on the i/o tensors. Additional speedup can be obtained by casting them in the Framework prior to compilation.

To learn how to configure the compiler options from within your application’s framework, please see:

* :ref:`Developer Guide for Training with PyTorch Neuron <pytorch-neuronx-programming-guide>`


================================================
FILE: about-neuron/appnotes/neuronx-distributed/introducing-nxd-inference.rst
================================================
.. _introduce-nxd-inference:

Introducing NeuronX Distributed (NxD) Inference
=================================================

.. contents:: Table of contents
   :local:
   :depth: 2


What are we introducing?
------------------------


Starting with the Neuron SDK 2.21 release, we are introducing NxD Inference, an open-source PyTorch-based inference library that simplifies deep learning model deployment on AWS Inferentia and Trainium instances. NxD Inference is designed for optimized inference, enabling quick onboarding of PyTorch models with minimal changes. It features a modular architecture that facilitates easy integration of HuggingFace PyTorch models and is compatible with serving engines like vLLM.

Please see :ref:`nxdi-index` for NxD Inference overview and documentation.


How can I install NxD Inference library?
-----------------------------------------
Please refer to :ref:`nxdi-setup` for installation instructions.


I am currently using the Transformers NeuronX library for inference. How does the NxD Inference library affect me?
--------------------------------------------------------------------------------------------------------------------

If you are using Transformers NeuronX (TNx) in production, you can continue doing so. However, if you are planning to onboard new models to Neuron for inference, NxD Inference offers several advantages to consider.

NxD Inference is designed to enable easy on-boarding of PyTorch models and comes with new features and enhanced support:

* **Hardware Support**: While TNx is not supported on Trn2, NxD Inference supports all platforms (Trn1, Inf2, and Trn2)
* **Simplified interface**: To simplify model development with NxD Inference, you write modeling code using PyTorch with standard Python, rather than using PyHLO as in TNx.
* **Easy Migration**: NxD Inference was designed to provide seamless migration from TNx, especially if you are using it with vLLM. You can migrate your existing TNx inference scripts using the :ref:`migration guide <nxdi_migrate_from_tnx>`
* **Enhanced Capabilities**: NxD Inference offers more comprehensive support for MoE models and multimodal models (Llama 3.2) compared to TNx
* **Future Development**: New inference features and support for advanced model architectures (like multi-modality/video models) will be focused on NxD Inference


I am currently using vLLM with Transformers NeuronX library for inference. Does NxD Inference library support vLLM ?
---------------------------------------------------------------------------------------------------------------------

Yes, NxD Inference library supports vLLM inference engine.  Neuron vLLM integration in 2.21 release will start supporting both NxD Inference and Transformers NeuronX libraries.  To use vLLM with NxD Inference library, you can refer to the :ref:`nxdi-vllm-user-guide-v1`.


What features and models are available in Transformers NeuronX (TNx) but not yet in NeuronX Distributed Inference?
-------------------------------------------------------------------------------------------------------------------

While NxD Inference supports most features and models available in TNx, there are some differences in current support that users should be aware of.

**Features that are not yet supported in NxD Inference**: The following TNx features aren't supported yet in the NxD Inference library.

* Multi-Node Inference support


**Models not part of NxD Inference Model Hub**: The following models are included in Transformers NeuronX but not currently in NxD Inference library:

* Bloom
* GPT2
* GPT-J
* GPT-NEOX

If you need to use these models with NxD Inference, we encourage you to follow the :ref:`onboarding models developer guide <nxdi-onboarding-models>`. The onboarding process in NxD Inference is more straightforward compared to TNx due to its PyTorch-based architecture.


I currently use Hugging Face TGI serving engine for deploying and serving Large Language Models (LLMs) on Neuron. How does NxD Inference library affect me?
-----------------------------------------------------------------------------------------------------------------------------------------------------------

If you are currently using Hugging Face TGI serving engine to deploy models on Neuron, the introduction of NxD Inference library will not have any impact and you can continue to use your existing inference workloads. Hugging Face TGI integrates with Neuron SDK Inference libraries in a way that abstracts the underlying library for the users.


I am new to Neuron and have inference workloads, what library should I use?
----------------------------------------------------------------------------

We recommend you use NxD Inference for your model inference workloads. To learn how to get started using NxD Inference, see the :ref:`nxdi-index` documentation


Additional Resources
--------------------

* :ref:`nxdi-index`
* :ref:`nxdi-overview`
* :ref:`nxd-inference_rn`


================================================
FILE: about-neuron/appnotes/neuronx-distributed/introducing-nxdt-training.rst
================================================
.. _introduce-nxd-training:

Introducing NxD Training
===================================================

.. contents:: Table of contents
   :local:
   :depth: 2


What are we introducing?
------------------------

Starting with the Neuron 2.20 release, we are introducing NxD Training. 
In doing so, we are expanding NeuronX Distributed library (previously called NxD that will now be called NxD Core) to 
NxD Training with data science/engineering modules, and end to end examples. NxD Training is a PyTorch based 
distributed training library that enables customers to train large-scale models. Some key distributed strategies 
supported by NxD Training include 3D-parallelism (data parallelism, tensor parallelism and pipeline parallelism) and 
ZeRO-1 (where optimizer states are partitioned across workers). 
 
NxD Training supports model training workflows like pretraining, supervised finetuning (SFT) and parameter efficient 
finetuning (PEFT) using Low-Rank Adapter (LoRA) techniques [#f1]_. For developers, NxD Training offers both API level access 
through NxD Core and PyTorch Lightning and an intuitive interface via YAML based configuration files. NxD Training 
offers a flexible approach that enables customers to leverage only the functionalities that align with their unique 
workflows and seamlessly integrate their machine learning training software at the appropriate level within NxD Training, 
ensuring a user experience tailored to their specific requirements. This is a beta preview version of NxD Training  
and feedback from the developer community is strongly encouraged for upcoming releases.


.. _how-nxd-core-user-affected:

I currently use NeuronX Distributed (NxD Core). How does NxD Training release affect me?
---------------------------------------------------------------------------------------------------------------

Existing NxD Core customers can continue to use NxD Core APIs available under NxD Training. If workflows based on NxD Core 
meet your needs, you do not need to do anything different with NxD Training’s introduction. NxD Core APIs and 
functionalities for NxD Core continue to be available to you as before. You can choose to 
:ref:`install NxD Core only <neuronx_distributed_setup>` and skip all subsequent installation steps for 
NxD Training. However, NxD Training has additional support for YAML based configuration, a model hub and integration with 
PyTorch Lightning. If these capabilities are of interest to you, you may choose to evaluate and start using NxD Training. 

.. _should_nnm_usage_continue:

Should the current Neuron NeMo Megatron (NNM) users continue to use NNM?
------------------------------------------------------------------------------------------------

NxD Training offers same capabilities as Neuron NeMo Megatron (NNM). Additionally, NNM 
will go into maintenance mode in the next release. If you are currently using NNM, the introduction of NxD Training 
toolkit means that you should start evaluating NxD Training for your training needs. With its YAML interface, NxD 
Training is very close in terms of usability to NNM and NeMo. Migrating from NNM to NxD Training  
should involve a relatively minor effort and instructions for doing so are provided 
:ref:`here <nxdt_developer_guide_migration_nnm_nxdt>`.

.. _what_to_use_as_new_user:

I am new to Neuron and have training workloads, what toolkits or libraries should I use?
----------------------------------------------------------------------------------------

If you are starting with Neuron and looking for solutions to your model pretraining or finetuning needs, then NxD Training 
is the recommended toolkit for you. Please start from :ref:`NxD Training page <nxdt>` for overview, 
installation and usage instructions.


Additional Resources
------------------------

Multiple NxD Training resources on getting started, using it and getting required support are listed below. If you encounter issues 
or have product related questions, please refer to FAQs and troubleshooting guides. Additionally, please feel free to reach out to us 
using resources in Support section.

:ref:`How to get started <neuron-quickstart>`

:ref:`Release notes <neuron-2.19.0-whatsnew>`

:ref:`Main section <nxdt>`

:ref:`Troubleshooting <nxdt_known_issues>` 

:ref:`Support <neuron-quickstart>`

.. [#f1] Supported through NxD Core.

================================================
FILE: about-neuron/appnotes/perf/neuron-cc/parallel-ncgs.rst
================================================
.. _parallel-exec-ncgs:

Parallel Execution using NEURON_RT_NUM_CORES
===============================================

.. important ::
  ``NEURONCORE_GROUP_SIZES`` will no longer be supported starting with the Neuron 1.19.0 release. If your application uses ``NEURONCORE_GROUP_SIZES``
  see :ref:`neuron-migrating-apps-neuron-to-libnrt` and :ref:`eol-ncgs-env_2` for more details.


Introduction
------------

Inf1 instances are available with a different number of Inferentia
chips. Each Inferentia chip consists of 4 NeuronCores and an Inf1
instance includes 4 to 64 NeuronCores, depending on the size of the instance.
This guide shows you how to load one or more compiled models into
different consecutive groups of NeuronCores using your framework of choice.

Data Parallel Execution
-----------------------

In PyTorch and TensorFlow, the same compiled model can run in parallel on an Inf1 instance by loading it multiple times, up to the total number of NeuronCores specified in NEURON_RT_NUM_CORES or NEURON_RT_VISIBLE_CORES. For more information about NEURON_RT_NUM_CORES and NEURON_RT_VISIBLE_CORES, refer to :ref:`Neuron Runtime Configuration <nrt-configuration>`.


Running multiple models using single process
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To run multiple models using a single process, set the environment
variable ``NEURON_RT_NUM_CORES`` with a list of the
number of cores in each group, separated by commas.

You can set the ``NEURON_RT_NUM_CORES`` environment variable at runtime:

.. code :: bash

   #!/bin/bash
   NEURON_RT_NUM_CORES=13 python your_neuron_application.py

Or from within the Python process running your models (NOTE: You can
only set it once in the same process at the beginning of the script):

.. code :: bash

    #!/usr/bin/env python
    import os

    # Set Environment
    os.environ['NEURON_RT_NUM_CORES']='13'

    # Load models and run inferences ...

The following examples allow you to load 4 models into 4 groups of NeuronCores
within one process. For example, if there are 4 models A, B, C, D
compiled to 2, 4, 3, and 4 NeuronCores respectively, directly load
the models A, B, C, D in sequence within your TensorFlow or PyTorch
Neuron process. This example requires an inf1.6xlarge instance with 16
NeuronCores, as the total number of NeuronCores within the NeuronCore
Groups is 13.


In MXNet, mapping from models to NeuronCores is controlled by
context ``mx.neuron(neuron_core_index)`` where ``neuron_core_index`` is the NeuronCore
index at the start of the group. In the example above, map model A to ``mx.neuron(0)``
context, model B to ``mx.neuron(2)`` context, model C to
``mx.neuron(6)`` context and model D to ``mx.neuron(9)`` context. For
further details, refer to :ref:`Flexible Execution Group (FlexEG) in Neuron-MXNet<flexeg>`.

For PyTorch

See :ref:`Data Parallel Inference on Torch Neuron<torch-neuron-dataparallel-app-note>` for more details.

For Tensorflow

.. code :: python

    # Set Environment 
    os.environ['NEURON_RT_NUM_CORES']='13'

    # Load models (TF2)
    model0 = tf.keras.models.load_model(model0_file) # loaded into the first group of NC0-NC1
    model1 = tf.keras.models.load_model(model1_file) # loaded into the second group of NC2-NC5
    model2 = tf.keras.models.load_model(model1_file) # loaded into the third group of NC6-NC8
    model3 = tf.keras.models.load_model(model1_file) # loaded into the fourth group of NC9-NC12

    # run inference by simply calling the loaded model
    results0 = model0(inputs0)
    results1 = model1(inputs1)
    results2 = model2(inputs2)
    results3 = model3(inputs3)


For MXNet 2.x:

.. code :: python

    # Set Environment
    os.environ['NEURON_RT_NUM_CORES']='13'

    # Load models (MXNet)
    # loaded into the first group of NC0-NC1
    sym, args, aux = mx.model.load_checkpoint(mx_model0_file, 0)
    model0 = sym.bind(ctx=mx.neuron(0), args=args, aux_states=aux, grad_req='null')
    # loaded into the second group of NC2-NC5
    sym, args, aux = mx.model.load_checkpoint(mx_model1_file, 0)
    model1 = sym.bind(ctx=mx.neuron(2), args=args, aux_states=aux, grad_req='null')
    # loaded into the third group of NC6-NC8
    sym, args, aux = mx.model.load_checkpoint(mx_model2_file, 0)
    model2 = sym.bind(ctx=mx.neuron(6), args=args, aux_states=aux, grad_req='null')
    # loaded into the fourth group of NC9-NC12
    sym, args, aux = mx.model.load_checkpoint(mx_model3_file, 0)
    model3 = sym.bind(ctx=mx.neuron(9), args=args, aux_states=aux, grad_req='null')

    # run inference by simply calling the loaded model
    results0 = model0.forward(data=inputs0)
    results1 = model1.forward(data=inputs1)
    results2 = model2.forward(data=inputs2)
    results3 = model3.forward(data=inputs3)

You can identify the NeuronCores used by each application with the ``neuron-top`` command
line tool. For more information about the neuron-top user interface, see :ref:`Neuron Top User Guide <neuron-top-ug>`.

.. code :: bash

   $ neuron-top

.. figure:: /images/multi_1core_models_multi_processes.png
   :scale: 80 %

Running multiple models using multiple processes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can also run multiple models in parallel processes, when you set
``NEURON_RT_NUM_CORES`` per process:

.. code :: bash

   $ NEURON_RT_NUM_CORES=2 python your_1st_neuron_application.py
   $ NEURON_RT_NUM_CORES=2 python your_2nd_neuron_application.py

The first process automatically selects a first set of 2 unused
NeuronCores for its new group. The second process automatically selects
a new set of 2 unused NeuronCores for its new group.

.. figure:: /images/multi_2cores_models_multi_processes.png
   :scale: 80 %

Running multiple models on the same NeuronCore group
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can load more than one model in a NeuronCore group within one
process. Neuron runtime handles switching from one model to the
next model within the NeuronCore group, when the next model is run within
the application. In TensorFlow or PyTorch, simply load the additional
models after the initial number of models have been loaded, to fill the
NeuronCore groups associated with the process.

For PyTorch:

.. code :: python

    # Set Environment
    os.environ['NEURON_RT_NUM_CORES']='2'

    # Load models (PT)
    model0 = torch.jit.load(model0_file) # loaded into the first group of NC0-NC1
    model1 = torch.jit.load(model1_file) # loaded into the first group of NC0-NC1

    # run inference by simply calling the loaded model
    results0 = model0(inputs0)
    results1 = model1(inputs1)

For TensorFlow 2.x:

.. code :: python

    # Set Environment
    os.environ['NEURON_RT_NUM_CORES']='2'

    # Load models (TF2)
    model0 = tf.keras.models.load_model(model0_file) # loaded into the first group of NC0-NC1
    model1 = tf.keras.models.load_model(model1_file) # loaded into the first group of NC0-NC1

    # run inference by simply calling the loaded model
    results0 = model0(inputs0)
    results1 = model1(inputs1)

In MXNet, use context ``mx.neuron(neuron_core_index)`` and use the
same NeuronCore start index for the additional models.

.. code :: python

    # Set Environment
    os.environ['NEURON_RT_NUM_CORES']='2'

    # Load models (MXNet)
    # loaded into the first group of NC0-NC1
    sym, args, aux = mx.model.load_checkpoint(mx_model0_file, 0)
    model0 = sym.bind(ctx=mx.neuron(0), args=args, aux_states=aux, grad_req='null')
    # loaded into the first group of NC0-NC1
    sym, args, aux = mx.model.load_checkpoint(mx_model1_file, 0)
    model1 = sym.bind(ctx=mx.neuron(0), args=args, aux_states=aux, grad_req='null')

    # run inference by simply calling the loaded model
    results0 = model0.forward(data=inputs0)
    results1 = model1.forward(data=inputs1)

The total ``NEURON_RT_NUM_CORES`` across all processes cannot exceed
the number of NeuronCores available on the instance. For example,
on an inf1.xlarge with default configurations where the total number of
NeuronCores visible to TensorFlow-Neuron is 4, you can launch one
process with ``NEURON_RT_NUM_CORES=2`` (pipelined) and another
process with ``NEURON_RT_NUM_CORES=2`` (data-parallel).

Examples using ``NEURON_RT_NUM_CORES`` include:

* :ref:`PyTorch example </src/examples/pytorch/resnet50.ipynb>`
* :ref:`MXNet example </src/examples/mxnet/resnet50_neuroncore_groups.ipynb>`


Auto Model Replication in TensorFlow Neuron (``tensorflow-neuron``) (Beta)
----------------------------------------------------------------------------------

Refer to the following API documentation to see how to perform automatic replication on
multiple cores. Note auto-replication will only work on models compiled with pipeline size 1:
via ``--neuroncore-pipeline-cores=1``. If automatic replication is not enabled, the model will default to 
replicate on up to 4 cores.

Python API (TF 2.x only):

:ref:`tensorflow-ref-auto-replication-python-api`

CLI API (TF 1.x and TF 2.x):

:ref:`tensorflow-ref-auto-replication-cli-api`


Auto Model Replication (Being Deprecated)
-----------------------------------------

The Auto Model Replication feature in TensorFlow-Neuron enables you to
load the model once and the data parallel replication will occur
automatically. This reduces framework memory usage, as the same model is not loaded multiple times. This feature is beta and
available in TensorFlow-Neuron only.

To enable Auto Model Replication, set NEURONCORE_GROUP_SIZES to Nx1,
where N is the desired replication count (the number of NeuronCore
groups, each group has size 1). For example, NEURONCORE_GROUP_SIZES=8x1
would automatically replicate the single-NeuronCore model 8 times.

.. code :: python

       os.environ['NEURONCORE_GROUP_SIZES'] = '4x1'

or

.. code :: bash

   NEURONCORE_GROUP_SIZES=4x1 python3 application.py

When NEURONCORE_GROUP_SIZES is not set, the default is 4x1, where a
single-NeuronCore model is replicated 4 times on any size of inf1 machine.

This feature is only available for models compiled with
neuroncore-pipeline-cores set to 1 (default).

You will still need to use threads in the scaffolding code, to feed the
loaded replicated model instance, to achieve high throughput.

Example of auto model replication: :ref:`/src/examples/tensorflow/openpose_demo/openpose.ipynb`


FAQ
---

Can I mix data parallel and NeuronCore Pipelines?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Yes. You can compile the model using the neuroncore-pipeline-cores option.
This tells the compiler to set compilation to the specified number of
cores for :ref:`neuroncore-pipeline`.
The Neuron Compiler returns a NEFF that fits within this limit. See
the :ref:`neuron-compiler-cli-reference`
for instructions on how to use this option.

For example, on an inf1.2xlarge, you can load two model instances, each
compiled with neuroncore-pipeline-cores set to 2, so they can run
in parallel. The model instances can be loaded from different saved
models or from the same saved model.

Can I have a mix of multiple models in one Neuroncore group and single model in another one Neuroncore group?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Currently, you can do this in MXNet, by setting up two Neuroncore groups, then loading,
for example, multiple models in one NCG, using context mx.neuron(0), and
loading a single model in the second NCG, using context mx.neuron(2). You can
also load a single model in the first NCG and multiple models in the
second NCG. For example:

.. code :: python


    # Set Environment
    os.environ['NEURON_RT_NUM_CORES']='6'

    # Load models (MXNet)
    # loaded into the first group of NC0-NC1
    sym, args, aux = mx.model.load_checkpoint(mx_model0_file, 0)
    model0 = sym.bind(ctx=mx.neuron(0), args=args, aux_states=aux, grad_req='null')
    # loaded into the second group of NC2-NC5
    sym, args, aux = mx.model.load_checkpoint(mx_model1_file, 0)
    model1 = sym.bind(ctx=mx.neuron(2), args=args, aux_states=aux, grad_req='null')
    # loaded into the second group of NC2-NC5
    sym, args, aux = mx.model.load_checkpoint(mx_model2_file, 0)
    model2 = sym.bind(ctx=mx.neuron(2), args=args, aux_states=aux, grad_req='null')
    # loaded into the second group of NC2-NC5
    sym, args, aux = mx.model.load_checkpoint(mx_model3_file, 0)
    model3 = sym.bind(ctx=mx.neuron(2), args=args, aux_states=aux, grad_req='null')

    # run inference by simply calling the loaded model
    results0 = model0.forward(data=inputs0)
    results1 = model1.forward(data=inputs1)
    results2 = model2.forward(data=inputs2)
    results3 = model3.forward(data=inputs3)

Loading multiple models in one NCG and a single model in another NCG is
currently not supported in TensorFlow and PyTorch.


================================================
FILE: about-neuron/appnotes/perf/neuron-cc/performance-tuning.rst
================================================
.. _appnote-performance-tuning:

Performance Tuning
==================

.. important ::
  NeuronCore Groups (NCG) have been deprecated. See :ref:`eol-ncg` and :ref:`neuron-migrating-apps-neuron-to-libnrt` for more details.

This guide is intended to provide the reader with an in-depth
understanding of how to optimize neural network performance on
Inferentia for both throughput and latency. For simplicity, the guide
uses the TensorFlow and ResNet-50 models as teaching examples to show how
to choose between different compile-time optimizations (e.g., Batching and
NeuronCore Pipeline), as well as model-serving optimizations (e.g.,
multi-threading and dynamic-batching) to improve inference performance.

The following guides are considered to be prerequisites for this tutorial:

-  :ref:`/src/examples/tensorflow/tensorflow_resnet50/resnet50.ipynb`
-  TensorFlow Serving NeuronCore Group
-  :ref:`neuron-batching`
-  :ref:`neuroncore-pipeline`

Batching and pipelining (technical background)
----------------------------------------------

Neuron provides developers with various performance optimization features.

Two of the most widely used features are batching and pipelining. Both
techniques aim to keep the data close to the compute engines, but they achieve
this data locality in different ways. In batching it is achieved by loading
the data into an on-chip cache and reusing it multiple times for multiple
different model-inputs, while in pipelining it is achieved by caching all
model parameters into the on-chip cache across multiple NeuronCores and
streaming the calculation across them.

As a general rule of thumb, batching is preferred for applications that
aim to optimize throughput and cost at the expense of latency, while
pipelining is preferred for applications with a high-throughput
requirement under a strict latency budget.

Compiling for batching optimization
-----------------------------------

To enable batching optimization, the model must first be compiled
for a target batch-size. This is done by specifying the batch size in
the input tensor's batch dimension during compilation. Users are
encouraged to evaluate multiple batch size, in order to determine the
optimal latency/throughput deployment-point, which is application-dependent.

For example, the code snippet below enables batching on a ResNet50
model, with a batch-size of 5:

.. code:: python

   import numpy as np
   import tensorflow.neuron as tfn

   # To change the batch size, change the first dimension in example_input
   batch_size = 5
   example_input = np.zeros([batch_size,224,224,3], dtype='float16')

   tfn.saved_model.compile("rn50_fp16",
                           "rn50_fp16_compiled/1",
                           model_feed_dict={'input_1:0': example_input },
                           dynamic_batch_size=True)

.. note::

   Depending on the size of the neural network, Neuron has a maximum
   batch size that works optimally on Inferentia. If
   an unsupported batch size is used, an internal compiler error message
   will be displayed.
   A simple way to explore optimal batch size for your specific model is to
   increment the batch size from 1 upward, one at a time, and test
   application performance.

Compiling for pipeline optimization
-----------------------------------

In NeuronCore Pipeline mode, Neuron stores the model parameters in
Inferentias' local cache and streams inference requests across
the available NeuronCores, as specified by the
``--neuroncore-pipeline-cores`` compiler argument. For example, to
compile the model to fit a pipeline size of four Inferentia devices (16
NeuronCores) avaliable in the inf1.6xlarge instance size:

.. code:: python

   import numpy as np
   import tensorflow.neuron as tfn

   compiler_args = ['--neuroncore-pipeline-cores', '16']
   example_input = np.zeros([1,224,224,3], dtype='float16')
   tfn.saved_model.compile("rn50_fp16",
                           "rn50_fp16_compiled/1",
                           model_feed_dict={'input_1:0': example_input },
                           compiler_args=compiler_args)

The minimum number of NeuronCores needed to run a compiled model can be
found using the Neuron Check Model tool. See :ref:`neuron_check_model`.

Model-serving inference optimizations
-------------------------------------

To fully realize the maximum throughput of the compiled model
(for either batching and pipelining), users need to launch multiple host
CPU threads to feed inputs into the Neuron pipeline. The number of
threads needs to be larger than the specified maximum number of
NeuronCores.

Additionally, dynamic batching can be used to process a larger
client-side inference batch-size and the framework automatically breaks
up the user-batch into smaller batch sizes, to match the compiled
batch-size. This technique increases the achievable throughput by hiding
the framework-to-neuron overhead, and amortizing it over a larger batch
size. To use dynamic batching, set the argument
``--dynamic_batch_size=True`` during compilation and send a larger
inference batch size (user inference batch size) that is equal to a
multiple of the compiled batch size.

Both methods can be applied together if this improves
performance. However, multi-threading is always needed as a first step
to achieve high throughput. You need to experiment to find
optimal settings for your application.

By default the framework sets the number of outstanding inference
requests to the total number of NeuronCores plus three. This can be
changed by setting the NEURON_MAX_NUM_INFERS environment variable. For
example, if the compiled model includes CPU partitions (e.g., if the Neuron compiler decides that some operations are more efficient to execute on CPU), 
the number of threads needs to be increased to account for the
additional compute performed on the CPU. Note that the available
instance host memory size needs to be taken into consideration to prevent
out-of-memory errors. As above, you need to experiment in order to find
the optimal settings for your application.

.. note::

   By default the framework allocates a NeuronCore Group size to
   match the size of the compiled model. The size of the model is the
   number of NeuronCores limit passed to compiler during compilation
   (``--neuroncore-pipeline-cores`` option). For more information see the
   TensorFlow Serving NeuronCore Group documentation.

Other considerations
--------------------

Mixed Precision
~~~~~~~~~~~~~~~

You can find more information about performance and accuracy trade offs
in :ref:`neuron-cc-training-mixed-precision`.


Operator support
~~~~~~~~~~~~~~~~

The Neuron Compiler maintains an evolving list of supported operators
for each framework: :ref:`neuron-supported-operators`

AWS Neuron handles unsupported operators by partitioning the graph into
subgraphs and executing them on different targets (e.g., NeuronCore
partition, CPU partition). If the entire model can run on Inferentia
(i.e., all operators are supported), then it will be compiled into
a single subgraph, which will be executed by a NeuronCore Group.

Debug
~~~~~

You can examine the post-compiled model to view the compilation results
using the Neuron plugin for TensorBoard.
See :ref:`tensorboard-plugin-visualize-graph`.

ResNet-50 optimization example
------------------------------

For an example demonstrating the concepts described here, see
:ref:`/src/examples/tensorflow/keras_resnet50/keras_resnet50.ipynb`


================================================
FILE: about-neuron/appnotes/torch-neuron/bucketing-app-note.rst
================================================
.. _bucketing_app_note:

Running inference on variable input shapes with bucketing
=========================================================

.. contents:: Table of contents
   :local:
   :depth: 2

Introduction
------------

With Inferentia, the shape of every input must be fixed at compile time. For
applications that require multiple input sizes, we recommend using padding or
bucketing techniques. Padding requires you to compile your model with the
largest expected input size and pad every input to this maximum size. If the
performance of your model using padding is not within your targets, you can
consider implementing bucketing.

This guide introduces bucketing, a technique to run inference on inputs with
variable shapes on Inferentia. The following sections explain how bucketing can
improve the performance of inference workloads on Inferentia. It covers an
overview of how bucketing works and provides examples of using bucketing in
:ref:`computer vision <bucketing_example_cv>` and
:ref:`natural language processing<bucketing_example_nlp>` applications.

Applications that benefit from bucketing
----------------------------------------

Bucketing refers to compiling your model multiple times with different target
input shapes to create “bucketed models." :ref:`creating_buckets` provides an
overview on selecting the input shapes that you use to create bucketed models. At
inference time, each input is padded until its shape matches the next largest
bucket shape. The padded input is then passed into the corresponding bucketed model
for inference. By compiling the same model with multiple different input shapes,
the amount of input padding is reduced compared to padding every input to the
maximum size in your dataset. This technique minimizes the compute overhead
and improves inference performance compared to padding every image to the
maximum shape in your dataset.

Bucketing works best when multiple different bucketed models are created to efficiently
cover the full range of input shapes. You can fine-tune the model performance
by experimenting with different bucket sizes that correspond to the
distribution of input shapes in your dataset.

Bucketing can only be used if there is an upper bound on the shape of the
inputs. If necessary, an upper bound on the input shape can be enforced using
resizing and other forms of preprocessing.

.. _num_buckets:

The upper bound on the number of bucketed models that you use is dictated by the
total size of the compiled bucketed models. Each Inferentia chip has 8GB of
DRAM, or 2GB of DRAM per NeuronCore. An inf1.xlarge and inf1.2xlarge have
1 Inferentia chip, an inf1.6xlarge has 4 Inferentia chips, and an inf1.24xlarge
has 16 Inferentia chips. Thus, you should limit the total size of all bucketed
models to around 8GB per Inferentia chip or 2GB per NeuronCore.
The following formula provides an approximation for the number of
compiled bucketed models you can fit on each NeuronCore:

::

    number-of-buckets = round(10^9 / number-of-weights-in-model)

We recommend using :ref:`neuron-top <neuron-top-ug>` to monitor the
memory usage on your inf1 instance as you load multiple bucketed models.

Implementing bucketing
-----------------------

Implementing bucketing consists of two main parts: creating multiple bucketed
models at compile-time and running inference using the bucketed models on (padded)
inputs. The following sections describe how to implement bucketing to run
inference in applications that have variable input shapes.

.. _creating_buckets:

Creating bucketed models
^^^^^^^^^^^^^^^^^^^^^^^^^

Before running inference, models should be compiled for different input shapes
that are representative of the input dataset. The input shapes that are used
to compile the models determine the bucket shapes that are used during inference.
The bucket shapes should be chosen to minimize the amount of padding on each new input.
Additionally, there should always be a bucket that’s large enough to handle the
maximum input shape in the dataset. The limit on the number of compiled bucketed
models that can be used is described in this :ref:`section<num_buckets>`.


Running inference with bucketing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

At inference time, each input should be padded to match the size of the next
largest bucket, such that the height and width (or sequence length) of the
padded input equals the size of the bucket. Then, the padded input should
be passed into the corresponding bucket for inference. If necessary, it’s
important to remove and/or crop any aberrant predictions that occur in the
padded region. For example, in object detection applications, bounding box
predictions that occur in the padded regions should be removed to avoid
erroneous predictions. 

.. _bucketing_examples:

Examples
--------

The following sections provide examples of applying the bucketing technique
to run inference in applications that have variable input shapes.

.. _bucketing_example_cv:

Computer vision bucketing
^^^^^^^^^^^^^^^^^^^^^^^^^^

As an example of implementing bucketing for computer vision models, consider an
application where the height and width of images in dataset are uniformly
distributed between `[400, 400]` and `[800, 800]`. Given that every input
shape between `[400, 400]` and `[800, 800]` is equally likely, it could
make sense to create bucketed models that divide up the range of input shapes into
equally sized chunks. For example, we could create bucketed models for the input shapes
`[500, 500]`, `[600, 600]`, `[700, 700]`, and `[800, 800]`. 

As an example of running inference with bucketing, let’s assume that we created
bucketed models for the input shapes `[500, 500]`, `[600, 600]`, `[700, 700]`, and
`[800, 800]`. If we receive an input with shape `[640, 640]`, we would
pad the input to the next largest bucket, `[700, 700]`, and use this bucket
for inference. If we receive an input with shape `[440, 540]`, we would
need to pad the input to the bucket size, `[600, 600]`, and use this bucket
for inference.

As another example of creating bucketed models, consider a computer vision
application where the dataset is not uniformly distributed. As before, let’s
assume the input shapes range between `[400, 400]` to `[800, 800]`. Now, let’s
assume the data shape distribution is bimodal, such that `[540, 540]` and
`[720, 720]` are the two most common input shapes. In this example, it might
make sense to create bucketed models for input shapes `[540, 540]`, `[720, 720]`, and
`[800, 800]` to target the most common shapes while still including the
entire range of input shapes.


End-to-end computer vision bucketing example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In this example, we run inference in a computer vision application that has
variable shaped images that range in shape from `[400, 400]` to
`[800, 800]`. We create bucketed models for the input shapes `[500, 500]`,
`[600, 600]`, `[700, 700]`, and `[800, 800]` to handle the variable input
shapes.

.. code-block:: python

    import numpy as np
    import torch
    from torchvision import models
    import torch_neuron

    # Load the model and set it to evaluation mode
    model = models.resnet50(pretrained=True)
    model.eval()

    # Define the bucket sizes that will be used for compilation and inference
    bucket_sizes = [(500, 500), (600, 600), (700, 700), (800, 800)]

    # Create the bucketed models by compiling a model for each bucket size
    buckets = {}
    for bucket_size in bucket_sizes:
        # Create an example input that is the desired bucket size
        h, w = bucket_size
        image = torch.rand([1, 3, h, w])

        # Compile with the example input to create the bucketed model
        model_neuron = torch.neuron.trace(model, image)

        # Run a warm up inference to load the model into Inferentia memory
        model_neuron(image)

        # Add the bucketed model based on its bucket size
        buckets[bucket_size] = model_neuron


    def get_bucket_and_pad_image(image):
        # Determine which bucket size to use
        oh, ow = image.shape[-2:]
        target_bucket = None
        for bucket_size in bucket_sizes:
            # Choose a bucket that's larger in both the height and width dimensions
            if oh <= bucket_size[0] and ow <= bucket_size[1]:
                target_bucket = bucket_size
                break

        # Pad the image to match the size of the bucket
        h_delta = target_bucket[0] - oh
        w_delta = target_bucket[1] - ow

        b_pad = h_delta  # Bottom padding
        l_pad = 0  # Left padding
        t_pad = 0  # Top padding
        r_pad = w_delta  # Right padding

        # Pad the height and width of the image
        padding_amounts = (l_pad, r_pad, t_pad, b_pad)
        image_padded = torch.nn.functional.pad(image, padding_amounts, value=0)

        return image_padded, target_bucket


    # Run inference on inputs with different shapes
    for _ in range(10):
        # Create an image with a random height and width in range [400, 400] to [800, 800]
        h = int(np.random.uniform(low=400, high=800))
        w = int(np.random.uniform(low=400, high=800))
        image = torch.rand(1, 3, h, w)

        # Determine bucket and pad the image
        image_padded, target_bucket = get_bucket_and_pad_image(image)

        # Use the corresponding bucket to run inference
        output = buckets[target_bucket](image_padded)


.. _bucketing_example_nlp:

Natural language processing bucketing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As an example of implementing bucketing for natural language processing models,
consider an application where the lengths of tokenized sequences in a dataset are
uniformly distributed between 0 and 128 tokens. Given that every tokenized sequence
length between 0 and 128 is equally likely, it might make sense to create
bucketed models that divide up the range of tokenized sequence lengths into equally sized
chunks. For example, we could create bucketed models for tokenized sequence lengths 64
and 128.

As an example of running inference with bucketing, let's assume that we created
bucketed models for the input tokenized sequence lengths 64 and 128. If we receive a
tokenized sequence with length 55, we would need to pad it to the bucket size
64 and use this bucket for inference. If we receive a tokenized sequence with
length 112, we would need to pad it to the bucket size 128 and use this bucket
for inference.

End-to-end natural language processing bucketing example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In this example, we run inference in a natural language processing application
that has variable length tokenized sequences that range from 0 to 128. We
create bucketed models for lengths 64 and 128 to handle the variable input lengths.

.. code-block:: python

    import numpy as np
    import torch
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    import torch_neuron

    # Build tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc", return_dict=False)
    model.eval()

    # Define the bucket sizes that will be used for compilation and inference
    bucket_sizes = [64, 128]

    # Create the bucketed models by compiling a model for each bucket size
    buckets = {}
    for bucket_size in bucket_sizes:
        # Setup some example inputs
        sequence_0 = "The company HuggingFace is based in New York City"
        sequence_1 = "HuggingFace's headquarters are situated in Manhattan"

        # Create an example input that is the desired bucket size
        paraphrase = tokenizer.encode_plus(sequence_0,
                                        sequence_1,
                                        max_length=bucket_size,
                                        padding='max_length',
                                        truncation=True,
                                        return_tensors="pt")

        # Convert example inputs to a format that is compatible with TorchScript tracing
        example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']

        # Compile with the example input to create the bucketed model
        model_neuron = torch.neuron.trace(model, example_inputs_paraphrase)

        # Run a warm up inference to load the model into Inferentia memory
        model_neuron(*example_inputs_paraphrase)

        # Add the bucketed model based on its bucket size
        buckets[bucket_size] = model_neuron


    def get_bucket_and_pad_paraphrase(paraphrase):
        # Determine which bucket size to use
        inputs = paraphrase['input_ids']
        attention = paraphrase['attention_mask']
        token_type = paraphrase['token_type_ids']
        paraphrase_len = inputs.shape[1]
        target_bucket = None
        for bucket_size in bucket_sizes:
            if paraphrase_len <= bucket_size:
                target_bucket = bucket_size
                break

        # Pad the paraphrase to match the size of the bucket
        delta = target_bucket - paraphrase_len
        zeros = torch.zeros([1, delta], dtype=torch.long)
        inputs = torch.cat([inputs, zeros], dim=1)
        attention = torch.cat([attention, zeros], dim=1)
        token_type = torch.cat([token_type, zeros], dim=1)

        paraphrase_padded = inputs, attention, token_type
        return paraphrase_padded, target_bucket


    # Create two sample sequences
    sequence_0 = ("The only other bear similar in size to the polar bear is the "
                  "Kodiak bear, which is a subspecies of the brown bear. Adult male "
                  "polar bears weigh 350–700 kg and measure 2.4–3 meters in total "
                  "length. All bears are short-tailed, the polar bear's tail is "
                  "relatively the shortest amongst living bears.")
    sequence_1 = ("Around the Beaufort Sea, however, mature males reportedly "
                  "average 450 kg. Adult females are roughly half the size of males "
                  "and normally weigh 150–250 kg, measuring 1.8–2.4 meters in length. "
                  "The legs are stocky and the ears and tail are small.")

    # Run inference on inputs with different shapes
    # We create the variable shapes by randomly cropping the sequences
    for _ in range(10):
        # Get random sequence lengths between 0 and 128
        paraphrase_len = int(np.random.uniform(128))

        # Crop the paraphrase
        paraphrase_cropped = tokenizer.encode_plus(sequence_0,
                                        sequence_1,
                                        max_length=paraphrase_len,
                                        padding='max_length',
                                        truncation=True,
                                        return_tensors="pt")

        # Determine bucket and pad the paraphrase
        paraphrase_padded, target_bucket = get_bucket_and_pad_paraphrase(paraphrase_cropped)

        # Use the corresponding bucket to run inference
        output = buckets[target_bucket](*paraphrase_padded)


================================================
FILE: about-neuron/appnotes/torch-neuron/index.rst
================================================
.. _torch-neuron-appnotes:

PyTorch Neuron Application Notes
=================================

.. toctree::
   :maxdepth: 1
   :hidden:

   bucketing-app-note
   rcnn-app-note
   torch-neuron-dataparallel-app-note

This section contains application notes specific to PyTorch Neuron (``torch-neuron``) for ``Inf1`` instances. These guides cover advanced optimization techniques, implementation patterns, and best practices for deploying PyTorch models on AWS Inferentia.

Application Notes
-----------------

.. grid:: 1 1 2 2
   :gutter: 2

   .. grid-item-card::
      :link: bucketing-app-note
      :link-type: doc

      **Dynamic Batching with Bucketing**
      ^^^
      Optimize inference performance using dynamic batching and bucketing strategies

   .. grid-item-card::
      :link: rcnn-app-note
      :link-type: doc

      **R-CNN Implementation Guide**
      ^^^
      Comprehensive guide for implementing and optimizing R-CNN models on Inferentia

   .. grid-item-card::
      :link: torch-neuron-dataparallel-app-note
      :link-type: doc

      **Data Parallel Inference**
      ^^^
      Scale inference workloads using ``torch.neuron.DataParallel`` for multi-core execution

================================================
FILE: about-neuron/appnotes/torch-neuron/rcnn-app-note.rst
================================================
.. _torch-neuron-r-cnn-app-note:

Running R-CNNs on Inf1
======================

This application note demonstrates how to compile and run
`Detectron2 <https://github.com/facebookresearch/detectron2>`__-based
R-CNNs on Inf1. It also provides guidance on how to use profiling to
improve performance of R-CNN models on Inf1.

.. contents:: Table of contents
   :local:


R-CNN Model Overview
--------------------

Region-based CNN (R-CNN) models are commonly used for object detection
and image segmentation tasks. A typical R-CNN architecture consists
of the following components:

-  **Backbone:** The backbone extracts features from input images. In
   some models the backbone is a Feature Pyramid Network (FPN), which
   uses a top-down architecture with lateral connections to build an
   in-network feature pyramid from a single-scale input. The backbone is
   commonly a ResNet or Vision Transformer based network.
-  **Region Proposal Network (RPN):** The RPN predicts region proposals
   with a wide range of scales and aspect ratios. RPNs are constructed
   using convolutional layers and anchor boxes, which that serve as references
   for multiple scales and aspect ratios.
-  **Region of Interest (RoI):** The RoI component is used to resize the
   extracted features of varying size to the same size so that
   they can be consumed by a fully connected layer. RoI Align is
   typically used instead of RoI Pooling, because RoI Align provides
   better alignment.

The `Detectron2 <https://github.com/facebookresearch/detectron2>`__
library provides many popular PyTorch R-CNN implementations, including
R-CNN, Fast R-CNN, Faster R-CNN, and Mask R-CNN. This application note 
focuses on the Detectron2 R-CNN models.

R-CNN Limitations and Considerations on Inferentia (NeuronCore-v1)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

R-CNN models may have limitations and considerations on Inferentia
(NeuronCore-v1). See the Model Architecture Fit Guidelines
for more information. These limitations are not
applicable to NeuronCore-v2.

Requirements
------------

The process described in this application note is intended to be run on an ``inf1.2xlarge``. In practice,
R-CNN models can be run on any Inf1 instance size.

Verify that this Jupyter notebook is running the Python kernel
environment that was set up according to the `PyTorch Installation
Guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/pytorch-install.html>`__.
Select the kernel from the “Kernel -> Change Kernel” option at
the top of the Jupyter notebook page.

Installation
------------

This process requires the following pip packages:

- ``torch==1.11.0``
- ``torch-neuron``
- ``neuron-cc``
- ``opencv-python``
- ``pycocotools``
- ``torchvision==0.12.0``
- ``detectron2==0.6``

The following section explains how to build ``torchvision`` from source and install
the ``Detectron2`` package. It also reinstalls the Neuron packages, to ensure
version compatibility.

The ``torchvision`` ``roi_align_kernel.cpp`` kernel is modified to
use OMP threading for a multi-threaded inference on the CPU. This significantly
improves the performance of RoI Align kernels on Inf1: OMP threading
leads to a RoI Align latency reduction two to three times larger than the default
``roi_align_kernel.cpp`` kernel configuration.

.. code:: ipython3

    # Install python3.7-dev for pycocotools (a Detectron2 dependency)
    !sudo apt install python3.7-dev -y
    
    # Install Neuron packages
    !pip uninstall -y torchvision
    !pip install --force-reinstall "protobuf==3.20.1" ninja opencv-python
    !pip install --force-reinstall torch-neuron==1.11.0.* neuron-cc[tensorflow] --extra-index-url https://pip.repos.neuron.amazonaws.com

    # Change cuda to 10.2 for Detectron2
    !sudo rm /usr/local/cuda
    !sudo ln -s /usr/local/cuda-10.2 /usr/local/cuda
    
    # Install Torchvision 0.12.0 from source
    !git clone -b release/0.12 https://github.com/pytorch/vision.git
    
    # Update the RoI Align kernel to use OMP multithreading
    with open('vision/torchvision/csrc/ops/cpu/roi_align_kernel.cpp', 'r') as file:
        content = file.read()
    
    # Enable OMP Multithreading and set the number of threads to 4
    old = "// #pragma omp parallel for num_threads(32)"
    new = "#pragma omp parallel for num_threads(4)"
    content = content.replace(old, new)
    
    # Re-write the file
    with open('vision/torchvision/csrc/ops/cpu/roi_align_kernel.cpp', 'w') as file:
        file.write(content)
    
    # Build Torchvision with OMP threading
    !cd vision && CFLAGS="-fopenmp" python setup.py bdist_wheel
    %pip install vision/dist/*.whl
    
    # Install Detectron2 release v0.6
    !python -m pip install 'git+https://github.com/facebookresearch/detectron2.git@v0.6'

Compiling an R-CNN for Inf1
---------------------------

By default, R-CNN models are not compilable on Inf1, because they cannot
be traced with ``torch.jit.trace``, which is a requisite for inference
on Inf1. The following section demonstrates techniques for compiling a
Detectron2 R-CNN model for inference on Inf1.

Specifically, this section explains how to create a standard Detectron2 R-CNN model,
using a ResNet-101 backbone. It demonstrates how to use profiling to
identify the most compute-intensive parts of the R-CNN that need to be
compiled for accelerated inference on Inf1. It then explains how to
manually extract and compile the ResNet backbone (the dominant compute
component) and inject the compiled backbone back into the full model, for
improved performance.

Create a Detectron2 R-CNN Model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Create a Detectron2 R-CNN model using the
``COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml`` pretrained weights and
config file. Download a sample image from the COCO dataset and
run an example inference.

.. code:: ipython3

    from detectron2 import model_zoo
    from detectron2.engine import DefaultPredictor
    from detectron2.config import get_cfg
    
    def get_model():
    
        # Configure the R-CNN model
        CONFIG_FILE = "COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml"
        WEIGHTS_FILE = "COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml"
        cfg = get_cfg()
        cfg.merge_from_file(model_zoo.get_config_file(CONFIG_FILE))
        cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(WEIGHTS_FILE)
        cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5
        cfg.MODEL.DEVICE = 'cpu'  # Send to CPU for Neuron Tracing
    
        # Create the R-CNN predictor wrapper
        predictor = DefaultPredictor(cfg)
        return predictor

.. code:: ipython3

    import os
    import urllib.request
    
    # Define a function to get a sample image
    def get_image():
        filename = 'input.jpg'
        if not os.path.exists(filename):
            url = "http://images.cocodataset.org/val2017/000000439715.jpg"
            urllib.request.urlretrieve(url, filename)
        return filename

.. code:: ipython3

    import time
    import cv2
    
    # Create an R-CNN model
    predictor = get_model()
    
    # Get a sample image from the COCO dataset
    image_filename = get_image()
    image = cv2.imread(image_filename)
    
    # Run inference and print inference latency
    start = time.time()
    outputs = predictor(image)
    print(f'Inference time: {(time.time() - start):0.3f} s')

Profile the Model
~~~~~~~~~~~~~~~~~

Use the `PyTorch
Profiler <https://pytorch.org/docs/stable/profiler.html>`__ to identify
which operators contribute the most to the model’s runtime on CPU.
Ideally, you can compile these compute intensive operators onto Inf1 for
accelerated inference.

.. code:: ipython3

    import torch.autograd.profiler as profiler
    
    with profiler.profile(record_shapes=True) as prof:
        with profiler.record_function("model_inference"):
            predictor(image)
    print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=30))

We see that convolution operators (``aten::convolution``) contribute the
most to inference time. By compiling these convolution operators to
Inf1, you can improve performance of the R-CNN model. Print the
R-CNN model architecture to see which layers contain the
``aten::convolution`` operators:

.. code:: ipython3

    print(predictor.model)

Note that the ResNet FPN backbone
(`predictor.model.backbone <https://github.com/facebookresearch/detectron2/blob/v0.6/detectron2/modeling/backbone/fpn.py>`__ L17-L162)
contains the majority of convolution operators in the model. The RPN
(`predictor.model.proposal_generator <https://github.com/facebookresearch/detectron2/blob/v0.6/detectron2/modeling/proposal_generator/rpn.py>`__ L181-L533)
also contains several convolutions. Based on this,
compile the ResNet backbone and RPN onto Inf1 to maximize performance.

Compiling the ResNet backbone to Inf1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This section demonstrates how to compile the ResNet backbone to
Inf1 and use it for inference.

Eextract the backbone by accessing it with
``predictor.model.backbone``. Compile the backbone using
``strict=False``, because the backbone outputs a dictionary. Use a
fixed input shape (``800 x 800``) for compilation, as all inputs will be resized to this shape during inference. This
section also defines a basic preprocessing function (mostly derived from
the Detectron2 R-CNN
`DefaultPredictor <https://github.com/facebookresearch/detectron2/blob/45b3fcea6e76bf7a351e54e01c7d6e1a3a0100a5/detectron2/engine/defaults.py>`__
module L308-L318) that reshapes inputs to ``800 x 800``.

Create a ``NeuronRCNN`` wrapper to inject the
compiled backbone back into the model by dynamically replacing the
``predictor.model.backbone`` attribute with the compiled model.

.. code:: ipython3

    import torch
    import torch_neuron 
    
    example = torch.rand([1, 3, 800, 800])
    
    # Use `with torch.no_grad():` to avoid a jit tracing issue in the ResNet backbone
    with torch.no_grad():
        neuron_backbone = torch_neuron.trace(predictor.model.backbone, example, strict=False)
    
    backbone_filename = 'backbone.pt'
    torch.jit.save(neuron_backbone, backbone_filename)

.. code:: ipython3

    from detectron2.modeling.meta_arch.rcnn import GeneralizedRCNN
    from torch.jit import ScriptModule

    class NeuronRCNN(torch.nn.Module):
        """
        Creates a `NeuronRCNN` wrapper that injects the compiled backbone into
        the R-CNN model. It also stores the `size_divisibility` attribute from
        the original backbone.
        """
    
        def __init__(self, model: GeneralizedRCNN, neuron_backbone: ScriptModule) -> None:
            super().__init__()
    
            # Keep track of the backbone variables
            size_divisibility = model.backbone.size_divisibility
    
            # Load and inject the compiled backbone
            model.backbone = neuron_backbone
    
            # Set backbone variables
            setattr(model.backbone, 'size_divisibility', size_divisibility)
    
            self.model = model
    
        def forward(self, x):
            return self.model(x)

.. code:: ipython3

    # Create the R-CNN with the compiled backbone
    neuron_rcnn = NeuronRCNN(predictor.model, neuron_backbone)
    neuron_rcnn.eval()

    # Print the R-CNN architecture to verify the backbone is now the
    # `neuron_backbone` (shows up as `RecursiveScriptModule`)
    print(neuron_rcnn)

.. code:: ipython3

    def preprocess(original_image, predictor):
        """
        A basic preprocessing function that sets the input height=800 and 
        input width=800. The function is derived from the preprocessing
        steps in the Detectron2 `DefaultPredictor` module.
        """
    
        height, width = original_image.shape[:2]
        resize_func = predictor.aug.get_transform(original_image)
        resize_func.new_h = 800 # Override height
        resize_func.new_w = 800 # Override width
        image = resize_func.apply_image(original_image)
        image = torch.as_tensor(image.astype("float32").transpose(2, 0, 1))
        inputs = {"image": image, "height": height, "width": width}
        return inputs

.. code:: ipython3

    # Get a resized input using the sample image
    inputs = preprocess(image, get_model())
    
    # Run inference and print inference latency
    start = time.time()
    for _ in range(10):
        outputs = neuron_rcnn([inputs])[0]
    print(f'Inference time: {((time.time() - start)/10):0.3f} s')

.. code:: ipython3

    with profiler.profile(record_shapes=True) as prof:
        with profiler.record_function("model_inference"):
            neuron_rcnn([inputs])
    print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=30))

By running the backbone on Inf1, the overall runtime is already
significantly improved. The count and runtime of ``aten::convolution``
operators is also decreased. We now see a ``neuron::forward_v2``
operator that is the compiled backbone.

Optimize the R-CNN model
------------------------

Compiling the RPN
~~~~~~~~~~~~~~~~~

Examine the profiling and note that there are still several
``aten::convolution``, ``aten::linear``, and ``aten::addmm`` operators
that significantly contribute to the model’s overall latency. By
inspecting the model's architecture and code, we can determine that the
majority of these operators are contained in the RPN module
(`predictor.model.proposal_generator <https://github.com/facebookresearch/detectron2/blob/v0.6/detectron2/modeling/proposal_generator/rpn.py>`__ L181-L533).

To improve the model's performance, extract the RPN Head and
compile it on Inf1 to increase the number of operators running
on Inf1. You need to compile the RPN Head, because the RPN Anchor Generator
contains objects that are not traceable with ``torch.jit.trace``.

The RPN Head contains five layers that run inference on multiple resized
inputs. To compile the RPN Head, create a list of tensors
that contain the input (“``features``”) shapes used by RPN Head on
each layer. These tensor shapes can be determined by printing the input
shapes in the RPN Head ``forward`` function
(``predictor.model.proposal_generator.rpn_head.forward``).

Create a new ``NeuronRCNN`` wrapper that injects both the
compiled backbone and RPN Head into the R-CNN model.

.. code:: ipython3

    import math
    
    input_shape = [1, 3, 800, 800] # Overall input shape at inference time
    
    # Create the list example of RPN inputs using the resizing logic from the RPN Head
    features = list()
    for i in [0, 1, 2, 3, 4]:
        ratio = 1 / (4 * 2**i)
        x_i_h = math.ceil(input_shape[2] * ratio)
        x_i_w = math.ceil(input_shape[3] * ratio)
        feature = torch.zeros(1, 256, x_i_h, x_i_w)
        features.append(feature)

.. code:: ipython3

    # Extract and compile the RPN Head
    neuron_rpn_head = torch_neuron.trace(predictor.model.proposal_generator.rpn_head, [features])
    rpn_head_filename = 'rpn_head.pt'
    torch.jit.save(neuron_rpn_head, rpn_head_filename)

.. code:: ipython3

    class NeuronRCNN(torch.nn.Module):
        """
        Creates a wrapper that injects the compiled backbone and RPN Head
        into the R-CNN model.
        """
    
        def __init__(self, model: GeneralizedRCNN, neuron_backbone: ScriptModule, neuron_rpn_head: ScriptModule) -> None:
            super().__init__()
    
            # Keep track of the backbone variables
            size_divisibility = model.backbone.size_divisibility
    
            # Inject the compiled backbone
            model.backbone = neuron_backbone
    
            # Set backbone variables
            setattr(model.backbone, 'size_divisibility', size_divisibility)
    
            # Inject the compiled RPN Head
            model.proposal_generator.rpn_head = neuron_rpn_head
    
            self.model = model
    
        def forward(self, x):
            return self.model(x)

.. code:: ipython3

    # Create the R-CNN with the compiled backbone and RPN Head
    predictor = get_model()
    neuron_rcnn = NeuronRCNN(predictor.model, neuron_backbone, neuron_rpn_head)
    neuron_rcnn.eval()

    # Print the R-CNN architecture to verify the compiled modules show up
    print(neuron_rcnn)

.. code:: ipython3

    # Run inference and print inference latency
    start = time.time()
    for _ in range(10):
        outputs = neuron_rcnn([inputs])[0]
    print(f'Inference time: {((time.time() - start)/10):0.3f} s')

.. code:: ipython3

    with profiler.profile(record_shapes=True) as prof:
        with profiler.record_function("model_inference"):
            neuron_rcnn([inputs])
    print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=30))

By running the compiled backbone and RPN Head on Inf1, overall
runtime is improved. Once again, the number and runtime of
``aten::convolution`` operators is also decreased. There are now two
``neuron::forward_v2`` operators, which correspond to the compiled
backbone and RPN Head.

Fusing the Backbone and RPN Head
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It is usually preferable to compile fewer independent models
(“subgraphs”) on Inf1. Combining models and compiling them as a single
subgraph enables the Neuron compiler to perform additional optimizations
and reduces I/O data transfer between CPU and NeuronCores between
each subgraph.

In this section, the ResNet backbone and RPN Head are "fused" into a
single model to compile on Inf1. Create the
``NeuronFusedBackboneRPNHead`` wrapper as a compilable model that
contains both the ResNet backbone
(`predictor.model.backbone <https://github.com/facebookresearch/detectron2/blob/v0.6/detectron2/modeling/backbone/fpn.py>`__ L17-L162)
and RPN Head
(`predictor.model.proposal_generator <https://github.com/facebookresearch/detectron2/blob/v0.6/detectron2/modeling/proposal_generator/rpn.py>`__ L181-L533).
Output the ``features`` to be used downstream by the RoI
Heads. Compile this ``NeuronFusedBackboneRPNHead`` wrapper as
``neuron_backbone_rpn``, then create a separate ``BackboneRPN``
wrapper to inject the ``neuron_backbone_rpn`` in place of
the original backbone and RPN Head. Copy the remainder of the
RPN ``forward`` code
(`predictor.model.proposal_generator.forward <https://github.com/facebookresearch/detectron2/blob/v0.6/detectron2/modeling/proposal_generator/rpn.py>`__ L431-L480)
to create a “fused” backbone + RPN module. Lastly, re-write the
``NeuronRCNN`` wrapper to use the fused backbone + RPN module. The
``NeuronRCNN`` wrapper also uses the ``predictor.model`` ``forward``
code to re-write the rest of the R-CNN model forward function.

.. code:: ipython3

    class NeuronFusedBackboneRPNHead(torch.nn.Module):
        """
        Wrapper to compile the fused ResNet backbone and RPN Head.
        """
    
        def __init__(self, model: GeneralizedRCNN) -> None:
            super().__init__()
            self.backbone = model.backbone
            self.rpn_head = model.proposal_generator.rpn_head
            self.in_features = model.proposal_generator.in_features
    
        def forward(self, x):
            features = self.backbone(x)
            features_ = [features[f] for f in self.in_features]
            return self.rpn_head(features_), features

.. code:: ipython3

    # Create the wrapper with the combined backbone and RPN Head
    predictor = get_model()
    backbone_rpn_wrapper = NeuronFusedBackboneRPNHead(predictor.model)
    backbone_rpn_wrapper.eval()
    
    # Compile the wrapper
    example = torch.rand([1, 3, 800, 800])
    
    with torch.no_grad():
        neuron_backbone_rpn_head = torch_neuron.trace(
            backbone_rpn_wrapper, example, strict=False)
    
    backbone_rpn_filename = 'backbone_rpn.pt'
    torch.jit.save(neuron_backbone_rpn_head, backbone_rpn_filename)

.. code:: ipython3

    class BackboneRPN(torch.nn.Module):
        """
        Wrapper that uses the compiled `neuron_backbone_rpn` instead
        of the original backbone and RPN Head. We copy the remainder
        of the RPN `forward` code (`predictor.model.proposal_generator.forward`)
        to create a "fused" backbone + RPN module.
        """
    
        def __init__(self, model: GeneralizedRCNN) -> None:
            super().__init__()
            self.backbone_rpn_head = NeuronFusedBackboneRPNHead(model)
            self._rpn = model.proposal_generator
            self.in_features = model.proposal_generator.in_features
    
        def forward(self, images):
            preds, features = self.backbone_rpn_head(images.tensor)
            features_ = [features[f] for f in self.in_features]
            pred_objectness_logits, pred_anchor_deltas = preds
            anchors = self._rpn.anchor_generator(features_)
    
            # Transpose the Hi*Wi*A dimension to the middle:
            pred_objectness_logits = [
                # (N, A, Hi, Wi) -> (N, Hi, Wi, A) -> (N, Hi*Wi*A)
                score.permute(0, 2, 3, 1).flatten(1)
                for score in pred_objectness_logits
            ]
            pred_anchor_deltas = [
                # (N, A*B, Hi, Wi) -> (N, A, B, Hi, Wi) -> (N, Hi, Wi, A, B) -> (N, Hi*Wi*A, B)
                x.view(x.shape[0], -1, self._rpn.anchor_generator.box_dim,
                       x.shape[-2], x.shape[-1])
                .permute(0, 3, 4, 1, 2)
                .flatten(1, -2)
                for x in pred_anchor_deltas
            ]
    
            proposals = self._rpn.predict_proposals(
                anchors, pred_objectness_logits, pred_anchor_deltas, images.image_sizes
            )
            return proposals, features

.. code:: ipython3

    class NeuronRCNN(torch.nn.Module):
        """
        Wrapper that uses the fused backbone + RPN module and re-writes
        the rest of the R-CNN `model` `forward` function.
        """
    
        def __init__(self, model: GeneralizedRCNN) -> None:
            super().__init__()
    
            # Use the fused Backbone + RPN
            self.backbone_rpn = BackboneRPN(model)
    
            self.roi_heads = model.roi_heads
    
            self.preprocess_image = model.preprocess_image
            self._postprocess = model._postprocess
    
        def forward(self, batched_inputs):
            images = self.preprocess_image(batched_inputs)
            proposals, features = self.backbone_rpn(images)
            results, _ = self.roi_heads(images, features, proposals, None)
            return self._postprocess(results, batched_inputs, images.image_sizes)

.. code:: ipython3

    # Create the new NeuronRCNN wrapper with the combined backbone and RPN Head
    predictor = get_model()
    neuron_rcnn = NeuronRCNN(predictor.model)
    neuron_rcnn.eval()

    # Inject the Neuron compiled models
    neuron_rcnn.backbone_rpn.backbone_rpn_head = neuron_backbone_rpn_head

    # Print the R-CNN architecture to verify the compiled modules show up
    print(neuron_rcnn)

.. code:: ipython3

    # Run inference and print inference latency
    start = time.time()
    for _ in range(10):
        outputs = neuron_rcnn([inputs])[0]
    print(f'Inference time: {((time.time() - start)/10):0.3f} s')

.. code:: ipython3

    with profiler.profile(record_shapes=True) as prof:
        with profiler.record_function("model_inference"):
            neuron_rcnn([inputs])
    print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=30))

By running the fused backbone + RPN Head on Inf1, overall runtime is
improved even more. We now see a single ``neuron::forward_v2`` operator with
a lower runtime than the previous combined runtime of the two separate
``neuron::forward_v2`` operators.

Compiling the RoI Heads
~~~~~~~~~~~~~~~~~~~~~~~

This section describes how to extract and compile part of RoI Heads module
(`predictor.model.roi_heads <https://github.com/facebookresearch/detectron2/blob/v0.6/detectron2/modeling/roi_heads/roi_heads.py>`__ L530-L778) which runs most of the remaining ``aten::linear`` and ``aten::addmm``
operators on Inf1. The entire RoI Heads module cannot be extracted, because
it contains unsupported operators. So you need to create a
``NeuronBoxHeadBoxPredictor`` wrapper, extracts specific parts of
the ``roi_heads`` for compilation. The example input for compilation is
the shape of the input into the ``self.roi_heads.box_head.forward``
function. Write another wrapper, ``ROIHead`` that combines the
compiled ``roi_heads`` into the rest of the RoI module. The
``_forward_box`` and ``forward`` functions are from the
``predictor.model.roi_heads`` module. Lastly, re-write the ``NeuronRCNN``
wrapper to use the optimized RoI Heads wrapper as well as the fused
backbone + RPN module.

.. code:: ipython3

    class NeuronBoxHeadBoxPredictor(torch.nn.Module):
        """
        Wrapper that extracts the RoI Box Head and Box Predictor
        for compilation.
        """
    
        def __init__(self, model: GeneralizedRCNN) -> None:
            super().__init__()
            self.roi_heads = model.roi_heads
    
        def forward(self, box_features):
            box_features = self.roi_heads.box_head(box_features)
            predictions = self.roi_heads.box_predictor(box_features)
            return predictions

.. code:: ipython3

    # Create the NeuronBoxHeadBoxPredictor wrapper
    predictor = get_model()
    box_head_predictor = NeuronBoxHeadBoxPredictor(predictor.model)
    box_head_predictor.eval()

    # Compile the wrapper
    example = torch.rand([1000, 256, 7, 7])
    neuron_box_head_predictor = torch_neuron.trace(box_head_predictor, example)

    roi_head_filename = 'box_head_predictor.pt'
    torch.jit.save(neuron_box_head_predictor, roi_head_filename)

.. code:: ipython3

    class ROIHead(torch.nn.Module):
        """
        Wrapper that combines the compiled `roi_heads` into the
        rest of the RoI module. The `_forward_box` and `forward`
        functions are from the `predictor.model.roi_heads` module.
        """
    
        def __init__(self, model: GeneralizedRCNN) -> None:
            super().__init__()
            self.roi_heads = model.roi_heads
            self.neuron_box_head_predictor = NeuronBoxHeadBoxPredictor(model)
    
        def _forward_box(self, features, proposals):
            features = [features[f] for f in self.roi_heads.box_in_features]
            box_features = self.roi_heads.box_pooler(
                features, [x.proposal_boxes for x in proposals])
            predictions = self.neuron_box_head_predictor(box_features)
            pred_instances, _ = self.roi_heads.box_predictor.inference(
                predictions, proposals)
            return pred_instances
    
        def forward(self, images, features, proposals, targets=None):
            pred_instances = self._forward_box(features, proposals)
            pred_instances = self.roi_heads.forward_with_given_boxes(
                features, pred_instances)
            return pred_instances, {}

.. code:: ipython3

    class NeuronRCNN(torch.nn.Module):
        """
        Wrapper that uses the fused backbone + RPN module and the optimized RoI
        Heads wrapper
        """
    
        def __init__(self, model: GeneralizedRCNN) -> None:
            super().__init__()
    
            # Create fused Backbone + RPN
            self.backbone_rpn = BackboneRPN(model)
    
            # Create Neuron RoI Head
            self.roi_heads = ROIHead(model)
    
            # Define pre and post-processing functions
            self.preprocess_image = model.preprocess_image
            self._postprocess = model._postprocess
    
        def forward(self, batched_inputs):
            images = self.preprocess_image(batched_inputs)
            proposals, features = self.backbone_rpn(images)
            results, _ = self.roi_heads(images, features, proposals, None)
            return self._postprocess(results, batched_inputs, images.image_sizes)

.. code:: ipython3

    # Initialize an R-CNN on CPU
    predictor = get_model()

    # Create the Neuron R-CNN on CPU
    neuron_rcnn = NeuronRCNN(predictor.model)
    neuron_rcnn.eval()

    # Inject the Neuron compiled models
    neuron_rcnn.backbone_rpn.backbone_rpn_head = neuron_backbone_rpn_head
    neuron_rcnn.roi_heads.neuron_box_head_predictor = neuron_box_head_predictor

.. code:: ipython3

    # Run inference and print inference latency
    start = time.time()
    for _ in range(10):
        outputs = neuron_rcnn([inputs])[0]
    print(f'CPU Inference time: {((time.time() - start)/10):0.3f} s')

.. code:: ipython3

    with profiler.profile(record_shapes=True) as prof:
        with profiler.record_function("model_inference"):
            neuron_rcnn([inputs])
    print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=30))

Although the overall latency did not change significantly, running more
of the model on Inf1 instead of CPU frees up CPU resources when
multiple models are running in parallel.

End-to-end Compilation and Inference
------------------------------------

This section provides standalone code that compiles and runs an
optimized Detectron2 R-CNN on Inf1. Most of the code in this section is
from the previous sections in this application note and is
consolidated here for easy deployment. This section has the following
main components:

- Preprocessing and compilation functions
- Wrappers that extract the R-CNN ResNet backbone, RPN Head, and RoI
   Head for compilation on Inf1.
- A ``NeuronRCNN`` wrapper that creates an optimized end-to-end
   Detectron2 R-CNN model for inference on Inf1
- Benchmarking code that runs parallelized inference for optimized
   throughput on Inf1

Benchmarking
~~~~~~~~~~~~

The benchmarking section explains how to load multiple optimized RCNN models and
run them in parallel, to maximize throughput.

Use the beta NeuronCore placement API,
``torch_neuron.experimental.neuron_cores_context()``, to ensure all
compiled models in an optimized RCNN model are loaded onto the same
NeuronCore. Note that the functionality and API of
``torch_neuron.experimental.neuron_cores_context()`` might change in
future releases.

Define a simple benchmark function that loads four optimized RCNN
models onto four separate NeuronCores, runs multithreaded inference, and
calculates the corresponding latency and throughput. Benchmark
various numbers of loaded models, to show the impact of parallelism.

Note that throughput increases (at the cost of latency) when more
models are run in parallel on Inf1. Increasing the number of worker
threads also improves throughput.

Other improvements
~~~~~~~~~~~~~~~~~~

There are many additional optimizations that can be applied to RCNN
models on Inf1 depending on the application:

For latency sensitive applications:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-  Each of the five layers in the RPN head can be parallelized to
   decrease overall latency.
-  The number of OMP Threads can be increased in the ROI Align kernel.
   Both of these optimizations improve latency, at the cost of
   decreasing throughput.

For throughput sensitive applications:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-  The input batch size can be increased to improve NeuronCore
   utilization.

.. code:: ipython3

    import time
    import os
    import urllib.request
    from typing import Any, Union, Callable
    
    import cv2
    import numpy as np
    from concurrent.futures import ThreadPoolExecutor
    
    import torch
    import torch_neuron
    
    from detectron2 import model_zoo
    from detectron2.engine import DefaultPredictor
    from detectron2.config import get_cfg
    from detectron2.modeling.meta_arch.rcnn import GeneralizedRCNN
    
    
    # -----------------------------------------------------------------------------
    # Helper functions
    # -----------------------------------------------------------------------------
    
    def get_model():
    
        # Configure the R-CNN model
        CONFIG_FILE = "COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml"
        WEIGHTS_FILE = "COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml"
        cfg = get_cfg()
        cfg.merge_from_file(model_zoo.get_config_file(CONFIG_FILE))
        cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(WEIGHTS_FILE)
        cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5
        cfg.MODEL.DEVICE = 'cpu'  # Send to CPU for Neuron Tracing
    
        # Create the R-CNN predictor wrapper
        predictor = DefaultPredictor(cfg)
        return predictor
    
    
    def get_image():
    
        # Get a sample image
        filename = 'input.jpg'
        if not os.path.exists(filename):
            url = "http://images.cocodataset.org/val2017/000000439715.jpg"
            urllib.request.urlretrieve(url, filename)
        return filename
    
    
    def preprocess(original_image, predictor):
        """
        A basic preprocessing function that sets the input height=800 and 
        input width=800. The function is derived from the preprocessing
        steps in the Detectron2 `DefaultPredictor` module.
        """
    
        height, width = original_image.shape[:2]
        resize_func = predictor.aug.get_transform(original_image)
        resize_func.new_h = 800 # Override height
        resize_func.new_w = 800 # Override width
        image = resize_func.apply_image(original_image)
        image = torch.as_tensor(image.astype("float32").transpose(2, 0, 1))
        inputs = {"image": image, "height": height, "width": width}
        return inputs
    
    
    # -----------------------------------------------------------------------------
    # Neuron modules
    # -----------------------------------------------------------------------------
    
    class NeuronFusedBackboneRPNHead(torch.nn.Module):
        """
        Wrapper to compile the fused ResNet backbone and RPN Head.
        """
    
        def __init__(self, model: GeneralizedRCNN) -> None:
            super().__init__()
            self.backbone = model.backbone
            self.rpn_head = model.proposal_generator.rpn_head
            self.in_features = model.proposal_generator.in_features
    
        def forward(self, x):
            features = self.backbone(x)
            features_ = [features[f] for f in self.in_features]
            return self.rpn_head(features_), features
    
    
    class BackboneRPN(torch.nn.Module):
        """
        Wrapper that uses the compiled `neuron_backbone_rpn` instead
        of the original backbone and RPN Head. We copy the remainder
        of the RPN `forward` code (`predictor.model.proposal_generator.forward`)
        to create a "fused" backbone + RPN module.
        """
    
        def __init__(self, model: GeneralizedRCNN) -> None:
            super().__init__()
            self.backbone_rpn_head = NeuronFusedBackboneRPNHead(model)
            self._rpn = model.proposal_generator
            self.in_features = model.proposal_generator.in_features
    
        def forward(self, images):
            preds, features = self.backbone_rpn_head(images.tensor)
            features_ = [features[f] for f in self.in_features]
            pred_objectness_logits, pred_anchor_deltas = preds
            anchors = self._rpn.anchor_generator(features_)
    
            # Transpose the Hi*Wi*A dimension to the middle:
            pred_objectness_logits = [
                # (N, A, Hi, Wi) -> (N, Hi, Wi, A) -> (N, Hi*Wi*A)
                score.permute(0, 2, 3, 1).flatten(1)
                for score in pred_objectness_logits
            ]
            pred_anchor_deltas = [
                # (N, A*B, Hi, Wi) -> (N, A, B, Hi, Wi) -> (N, Hi, Wi, A, B) -> (N, Hi*Wi*A, B)
                x.view(x.shape[0], -1, self._rpn.anchor_generator.box_dim,
                       x.shape[-2], x.shape[-1])
                .permute(0, 3, 4, 1, 2)
                .flatten(1, -2)
                for x in pred_anchor_deltas
            ]
    
            proposals = self._rpn.predict_proposals(
                anchors, pred_objectness_logits, pred_anchor_deltas, images.image_sizes
            )
            return proposals, features
    
    
    class NeuronBoxHeadBoxPredictor(torch.nn.Module):
        """
        Wrapper that extracts the RoI Box Head and Box Predictor
        for compilation.
        """
    
        def __init__(self, model: GeneralizedRCNN) -> None:
            super().__init__()
            self.roi_heads = model.roi_heads
    
        def forward(self, box_features):
            box_features = self.roi_heads.box_head(box_features)
            predictions = self.roi_heads.box_predictor(box_features)
            return predictions
    
    
    class ROIHead(torch.nn.Module):
        """
        Wrapper that combines the compiled `roi_heads` into the
        rest of the RoI module. The `_forward_box` and `forward`
        functions are from the `predictor.model.roi_heads` module.
        """
    
        def __init__(self, model: GeneralizedRCNN) -> None:
            super().__init__()
            self.roi_heads = model.roi_heads
            self.neuron_box_head_predictor = NeuronBoxHeadBoxPredictor(model)
    
        def _forward_box(self, features, proposals):
            features = [features[f] for f in self.roi_heads.box_in_features]
            box_features = self.roi_heads.box_pooler(
                features, [x.proposal_boxes for x in proposals])
            predictions = self.neuron_box_head_predictor(box_features)
            pred_instances, _ = self.roi_heads.box_predictor.inference(
                predictions, proposals)
            return pred_instances
    
        def forward(self, images, features, proposals, targets=None):
            pred_instances = self._forward_box(features, proposals)
            pred_instances = self.roi_heads.forward_with_given_boxes(
                features, pred_instances)
            return pred_instances, {}
    
    
    class NeuronRCNN(torch.nn.Module):
        """
        Wrapper that uses the fused backbone + RPN module and the optimized RoI
        Heads wrapper
        """
    
        def __init__(self, model: GeneralizedRCNN) -> None:
            super().__init__()
    
            # Create fused Backbone + RPN
            self.backbone_rpn = BackboneRPN(model)
    
            # Create Neuron RoI Head
            self.roi_heads = ROIHead(model)
    
            # Define pre and post-processing functions
            self.preprocess_image = model.preprocess_image
            self._postprocess = model._postprocess
    
        def forward(self, batched_inputs):
            images = self.preprocess_image(batched_inputs)
            proposals, features = self.backbone_rpn(images)
            results, _ = self.roi_heads(images, features, proposals, None)
            return self._postprocess(results, batched_inputs, images.image_sizes)
    
    
    # -----------------------------------------------------------------------------
    # Compilation functions
    # -----------------------------------------------------------------------------
    
    def compile(
        model: Union[Callable, torch.nn.Module],
        example_inputs: Any,
        filename: str,
        **kwargs
    ) -> torch.nn.Module:
        """
        Compiles the model for Inf1 if it doesn't already exist and saves it as the provided filename. 
        
        model: A module or function which defines a torch model or computation.
        example_inputs: An example set of inputs which will be passed to the
            `model` during compilation.
        filename: Name of the compiled model
        kwargs: Extra `torch_neuron.trace` kwargs
        """
    
        if not os.path.exists(filename):
            with torch.no_grad():
                compiled_model = torch_neuron.trace(model, example_inputs, **kwargs)
            torch.jit.save(compiled_model, filename)
    
    
    # -----------------------------------------------------------------------------
    # Benchmarking function
    # -----------------------------------------------------------------------------
    
    def benchmark(backbone_rpn_filename, roi_head_filename, inputs, 
                  n_models=4, batch_size=1, n_threads=4, iterations=200):
        """
        A simple benchmarking function that loads `n_models` optimized
        models onto separate NeuronCores, runs multithreaded inference,
        and calculates the corresponding latency and throughput.
        """
    
        # Load models
        models = list()
        for i in range(n_models):
            with torch_neuron.experimental.neuron_cores_context(i):
                # Create the RCNN with the fused backbone + RPN Head and compiled RoI Heads
                # Initialize an R-CNN on CPU
                predictor = get_model()

                # Create the Neuron R-CNN on CPU
                neuron_rcnn = NeuronRCNN(predictor.model)
                neuron_rcnn.eval()

                # Inject the Neuron compiled models
                neuron_rcnn.backbone_rpn.backbone_rpn_head = torch.jit.load(backbone_rpn_filename)
                neuron_rcnn.roi_heads.neuron_box_head_predictor = torch.jit.load(roi_head_filename)

                models.append(neuron_rcnn)
    
        # Warmup
        for _ in range(8):
            for model in models:
                model([inputs])
    
        latencies = []
    
        # Thread task
        def task(i):
            start = time.time()
            models[i]([inputs])
            finish = time.time()
            latencies.append((finish - start) * 1000)
    
        begin = time.time()
        with ThreadPoolExecutor(max_workers=n_threads) as pool:
            for i in range(iterations):
                pool.submit(task, i % n_models)
        end = time.time()
    
        # Compute metrics
        boundaries = [50, 95, 99]
        names = [f'Latency P{i} (ms)' for i in boundaries]
        percentiles = np.percentile(latencies, boundaries)
        duration = end - begin
    
        # Display metrics
        results = {
            'Samples': iterations,
            'Batch Size': batch_size,
            'Models': n_models,
            'Threads': n_threads,
            'Duration (s)': end - begin,
            'Throughput (inf/s)': (batch_size * iterations) / duration,
            **dict(zip(names, percentiles)),
        }
    
        print('-' * 80)
        pad = max(map(len, results))
        for key, value in results.items():
            if isinstance(value, float):
                print(f'{key + ":" :<{pad + 1}} {value:0.3f}')
            else:
                print(f'{key + ":" :<{pad + 1}} {value}')
        print()
    
    
    if __name__ == "__main__":
    
        # Create and compile the combined backbone and RPN Head wrapper
        backbone_rpn_filename = 'backbone_rpn.pt'
        predictor = get_model()
        backbone_rpn_wrapper = NeuronFusedBackboneRPNHead(predictor.model)
        backbone_rpn_wrapper.eval()
        example = torch.rand([1, 3, 800, 800])
        compile(backbone_rpn_wrapper, example, backbone_rpn_filename, strict=False)

        # Create and compile the RoI Head wrapper
        roi_head_filename = 'box_head_predictor.pt'
        predictor = get_model()
        box_head_predictor = NeuronBoxHeadBoxPredictor(predictor.model)
        box_head_predictor.eval()
        example = torch.rand([1000, 256, 7, 7])
        compile(box_head_predictor, example, roi_head_filename)

        # Download a sample image from the COCO dataset and read it
        image_filename = get_image()
        image = cv2.imread(image_filename)
        inputs = preprocess(image, get_model())
    
        # Benchmark the Neuron R-CNN model for various numbers of loaded models
        benchmark(backbone_rpn_filename, roi_head_filename, inputs, n_models=1, n_threads=1)
        benchmark(backbone_rpn_filename, roi_head_filename, inputs, n_models=1, n_threads=2)
        benchmark(backbone_rpn_filename, roi_head_filename, inputs, n_models=2, n_threads=2)
        benchmark(backbone_rpn_filename, roi_head_filename, inputs, n_models=2, n_threads=4)
        benchmark(backbone_rpn_filename, roi_head_filename, inputs, n_models=4, n_threads=4)
        benchmark(backbone_rpn_filename, roi_head_filename, inputs, n_models=4, n_threads=8)


================================================
FILE: about-neuron/appnotes/torch-neuron/torch-neuron-dataparallel-app-note.rst
================================================
.. _torch-neuron-dataparallel-app-note:

Data Parallel Inference on Torch Neuron
=======================================

.. contents:: Table of Contents
   :local:
   :depth: 2

Introduction
------------

This guide introduces :func:`torch.neuron.DataParallel`, a Python API that
implements data parallelism on :class:`~torch.jit.ScriptModule` models created by the
:doc:`Trace API </archive/torch-neuron/api-compilation-python-api>`.
The following sections explain how data parallelism can improve the performance of
inference workloads on Inferentia, including how :func:`torch.neuron.DataParallel`
uses dynamic batching to run inference on variable input sizes. It covers an
overview of the :func:`torch.neuron.DataParallel` module and provides a few
:ref:`example data parallel applications <data_paraellel_examples>`.

Data parallel inference
-------------------------

Data Parallelism is a form of parallelization across multiple devices or cores,
referred to as nodes. Each node contains the same model and parameters, but
data is distributed across the different nodes. By distributing the
data across multiple nodes, data parallelism reduces the total
execution time of large batch size inputs compared to sequential execution.
Data parallelism works best for smaller models in latency sensitive
applications that have large batch size requirements.


torch.neuron.DataParallel
-------------------------

To fully leverage the Inferentia hardware, we want to use all available
NeuronCores. An inf1.xlarge and inf1.2xlarge have four NeuronCores, an
inf1.6xlarge has 16 NeuronCores, and an inf1.24xlarge has 64 NeuronCores.
For maximum performance on Inferentia hardware, we can use
:func:`torch.neuron.DataParallel` to utilize all available NeuronCores.

:func:`torch.neuron.DataParallel` implements data parallelism at the module
level by replicating the Neuron model on all available NeuronCores
and distributing data across the different cores for parallelized inference.
This function is analogous to :class:`~torch.nn.DataParallel` in PyTorch.
:func:`torch.neuron.DataParallel` requires PyTorch >= 1.8.

The following sections provide an overview of some of the features
of :func:`torch.neuron.DataParallel` that enable maximum performance on
Inferentia.

NeuronCore selection
^^^^^^^^^^^^^^^^^^^^

By default, DataParallel will try to use all NeuronCores allocated to the
current process to fully saturate the Inferentia hardware for maximum performance.
It is more efficient to make the batch dimension divisible by the number of
NeuronCores. This will ensure that NeuronCores are not left idle during
parallel inference and the Inferentia hardware is fully utilized.

In some applications, it is advantageous to use a subset of the
available NeuronCores for DataParallel inference. DataParallel has a
``device_ids`` argument that accepts a list of :obj:`int` or ``'nc:#'``
that specify the NeuronCores to use for parallelization. See
:ref:`Specifying NeuronCores <dataparallel_example_specify_ncs>`
for an example of how to use ``device_ids`` argument.

Batch dim
^^^^^^^^^

DataParallel accepts a ``dim`` argument that denotes the batch dimension used
to split the input data for distributed inference. By default,
DataParalell splits the inputs on ``dim = 0`` if the ``dim`` argument is not
specified. For applications with a non-zero batch dim, the ``dim`` argument
can be used to specify the inference-time input batch dimension.
:ref:`DataParallel with dim ! = 0 <data_paraellel_examples>` provides an
example of data parallel inference on inputs with batch dim = 2.

.. _dynamic_batching_description:

Dynamic batching
^^^^^^^^^^^^^^^^

Batch size has a direct impact on model performance. The Inferentia chip is optimized
to run with small batch sizes. This means that a Neuron compiled model can outperform
a GPU model, even if running single digit batch sizes.

As a general best practice, we recommend optimizing your model's throughput by
compiling the model with a small batch size and gradually increasing it to
find the peak throughput on Inferentia.

Dynamic batching is a feature that allows you to use tensor batch sizes that the
Neuron model was not originally compiled against. This is necessary because the
underlying Inferentia hardware will always execute inferences with the batch
size used during compilation. Fixed batch size execution allows tuning the
input batch size for optimal performance. For example, batch size 1 may be
best suited for an ultra-low latency on-demand inference application, while
batch size > 1 can be used to maximize throughput for offline inferencing.
Dynamic batching is implemented by slicing large input tensors into chunks
that match the batch size used during the :func:`torch_neuron.trace` compilation call.

The :func:`torch.neuron.DataParallel` class automatically enables dynamic batching on
eligible models. This allows us to run inference in applications that have
inputs with a variable batch size without needing to recompile the model. See
:ref:`Dynamic batching <dataparallel_example_dynamic_batching>` for an example
of how DataParallel can be used to run inference on inputs with a dynamic batch
size without needing to recompile the model.

Dynamic batching using small batch sizes can result in sub-optimal throughput
because it involves slicing tensors into chunks and iteratively sending data
to the hardware. Using a larger batch size at compilation time can use the
Inferentia hardware more efficiently in order to maximize throughput. You can
test the tradeoff between individual request latency and total throughput by
fine-tuning the input batch size.

Automatic batching in the DataParallel module can be disabled using the
``disable_dynamic_batching()`` function as follows:

.. code-block:: python

   >>> model_parallel = torch.neuron.DataParallel(model_neuron)
   >>> model_parallel.disable_dynamic_batching()

If dynamic batching is disabled, the compile-time batch size must be equal to
the inference-time batch size divided by the number of NeuronCores.
:ref:`DataParallel with dim != 0 <dataparallel_example_dim_neq_zero>` and
:ref:`Dynamic batching disabled <dataparallel_example_disable_dynamic_batching>`
provide examples of running DataParallel inference with dynamic batching
disabled.


Performance optimizations
^^^^^^^^^^^^^^^^^^^^^^^^^

The DataParallel module has a ``num_workers`` attribute that can be used to
specify the number of worker threads used for multithreaded inference. By
default, ``num_workers = 2 * number of NeuronCores``. This value can be
fine tuned to optimize DataParallel performance.

DataParallel has a ``split_size`` attribute that dictates the size of the input
chunks that are distributed to each NeuronCore. By default,
``split_size = max(1, input.shape[dim] // number of NeuronCores)``. This value
can be modified to optimally match the inference input chunk size with the
compile-time batch size.

.. _data_paraellel_examples:

Examples
--------

The following sections provide example usages of the
:func:`torch.neuron.DataParallel` module.


.. _dataparallel_example_default:

Default usage
^^^^^^^^^^^^^

.. include:: /archive/torch-neuron/torch-neuron-dataparallel-example-default.rst

.. _dataparallel_example_specify_ncs:

Specifying NeuronCores
^^^^^^^^^^^^^^^^^^^^^^

.. include:: /archive/torch-neuron/torch-neuron-dataparallel-example-specify-ncs.rst


.. _dataparallel_example_dim_neq_zero:

DataParallel with dim != 0
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /archive/torch-neuron/torch-neuron-dataparallel-example-dim-neq-zero.rst


.. _dataparallel_example_dynamic_batching:

Dynamic batching
^^^^^^^^^^^^^^^^

.. include:: /archive/torch-neuron/torch-neuron-dataparallel-example-dynamic-batching.rst


.. _dataparallel_example_disable_dynamic_batching:

Dynamic batching disabled
^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /archive/torch-neuron/torch-neuron-dataparallel-example-disable-dynamic-batching.rst


Full tutorial with torch.neuron.DataParallel
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For an end-to-end tutorial that uses DataParallel, see the
:ref:`PyTorch Resnet Tutorial </src/examples/pytorch/resnet50.ipynb>`.


================================================
FILE: about-neuron/appnotes/torch-neuronx/index.rst
================================================
.. _torch-neuronx-appnotes:

PyTorch NeuronX Application Notes
==================================

.. toctree::
   :maxdepth: 1
   :hidden:

   introducing-pytorch-2-6
   introducing-pytorch-2-7
   introducing-pytorch-2-8
   introducing-pytorch-2-9
   introducing-pytorch-2-x
   migration-from-xla-downcast-bf16
   torch-neuronx-dataparallel-app-note
   torch-neuronx-graph-partitioner-app-note

This section contains application notes specific to PyTorch NeuronX (``torch-neuronx``) for ``Trn1`` and ``Inf2`` instances. These guides cover PyTorch version migrations, advanced features, optimization techniques, and best practices for training and inference on AWS Trainium and Inferentia2.

PyTorch Version Support
-----------------------

.. grid:: 1 1 2 2
   :gutter: 2

   .. grid-item-card::
      :link: introducing-pytorch-2-9
      :link-type: doc

      **PyTorch 2.9 Support**
      ^^^
      New features and migration guide for PyTorch 2.9 on Neuron

   .. grid-item-card::
      :link: introducing-pytorch-2-8
      :link-type: doc

      **PyTorch 2.8 Support**
      ^^^
      New features and migration guide for PyTorch 2.8 on Neuron

   .. grid-item-card::
      :link: introducing-pytorch-2-7
      :link-type: doc

      **PyTorch 2.7 Support**
      ^^^
      Features and improvements introduced with PyTorch 2.7 support

   .. grid-item-card::
      :link: introducing-pytorch-2-x
      :link-type: doc

      **PyTorch 2.x Overview**
      ^^^
      General guide to PyTorch 2.x series support and features

Advanced Features
-----------------

.. grid:: 1 1 2 2
   :gutter: 2

   .. grid-item-card::
      :link: torch-neuronx-graph-partitioner-app-note
      :link-type: doc

      **Graph Partitioner**
      ^^^
      Advanced graph partitioning strategies for distributed training and inference

   .. grid-item-card::
      :link: torch-neuronx-dataparallel-app-note
      :link-type: doc

      **Data Parallel Inference**
      ^^^
      Scale inference workloads using ``torch.neuronx.DataParallel`` for multi-core execution

   .. grid-item-card::
      :link: migration-from-xla-downcast-bf16
      :link-type: doc

      **XLA Migration Guide**
      ^^^
      Migrate from deprecated XLA environment variables to PyTorch mixed-precision options


================================================
FILE: about-neuron/appnotes/torch-neuronx/introducing-pytorch-2-6.rst
================================================
.. _introduce-pytorch-2-6:

Introducing PyTorch 2.6 Support
===============================

.. contents:: Table of contents
   :local:
   :depth: 2


What are we introducing?
------------------------

Starting with the :ref:`Neuron 2.23 <neuron-2.23.0-whatsnew>` release, customers can now upgrade to PyTorch NeuronX (``torch-neuronx``) with specific support for PyTorch version 2.6.

:ref:`setup-torch-neuronx` is updated to include installation instructions for PyTorch NeuronX 2.6 for Amazon Linux 2023 and Ubuntu 22.04. Note that PyTorch NeuronX 2.6 is supported on Python 3.9, 3.10, and 3.11.

Review :ref:`migration guide <migrate_to_pytorch_2.6>` for possible changes to training scripts. No code changes are required for inference scripts.


.. _how-pytorch-2.6-different:

How is PyTorch NeuronX 2.6 different compared to PyTorch NeuronX 2.5?
---------------------------------------------------------------------

PyTorch NeuronX 2.6 uses Torch-XLA 2.6 which has improved support for Automatic Mixed Precision and buffer aliasing. Additionally:

* Reintroduced ``XLA_USE_32BIT_LONG`` to give customers the flexibility to use INT32 for their workloads. This flag was removed in v2.5.
* Added xm.xla_device_kind() to return the XLA device kind string ('NC_v2' for Trainium1, 'NC_v3' and 'NC_v3d' for Trainium2). See :ref:`logical-neuroncore-config` for more info.

See `Torch-XLA 2.6 release <https://github.com/pytorch/xla/releases/tag/v2.6.0>`__ for a full list.

See :ref:`migrate_to_pytorch_2.6` for changes needed to use PyTorch NeuronX 2.6.

.. note::

   GSPMD and Torch Dynamo (torch.compile) support in Neuron will be available in a future release.

.. _install_pytorch_neuron_2.6:

How can I install PyTorch NeuronX 2.6?
--------------------------------------------

To install PyTorch NeuronX 2.6, follow the :ref:`setup-torch-neuronx` guides for Amazon Linux 2023 and Ubuntu 22.04 AMI. Refer to the Neuron Multi-Framework DLAMI :ref:`setup guide <setup-ubuntu22-multi-framework-dlami>` for Ubuntu 22.04 with a pre-installed virtual environment for PyTorch NeuronX 2.6 that you can use to get started. PyTorch NeuronX 2.6 can be installed using the following:

.. code::

    python -m pip install --upgrade neuronx-cc==2.* torch-neuronx==2.6.* torchvision

.. note::

   PyTorch NeuronX 2.6 is currently available for Python 3.9, 3.10, 3.11.

.. _migrate_to_pytorch_2.6:

Migrate your application to PyTorch 2.6
---------------------------------------

First, install the PyTorch NeuronX 2.6 as described above in :ref:`installation guide <install_pytorch_neuron_2.6>`


Migrating training scripts
^^^^^^^^^^^^^^^^^^^^^^^^^^

To migrate the training scripts from PyTorch NeuronX 2.5 to PyTorch NeuronX 2.6, implement the following changes: 

.. note::

    ``xm`` below refers to ``torch_xla.core.xla_model``, ``xr`` refers to ``torch_xla.runtime``, and ``xmp`` refers to ``torch_xla.distributed.xla_multiprocessing``

* The environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (warning when used) and will be removed in an upcoming release. Switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to convert model to BF16 format. (see :ref:`migration_from_xla_downcast_bf16`)
* The functions ``xm.xrt_world_size()``, ``xm.get_ordinal()``, and ``xm.get_local_ordinal()`` are deprecated (warnings are shown when used). Switch to ``xr.world_size()``, ``xr.global_ordinal()``, and ``xr.local_ordinal()`` respectively as replacements.
* The default behavior of ``torch.load`` parameter ``weights_only`` is changed from ``False`` to ``True``. Setting ``weights_only`` to ``True`` may cause issues with pickling custom objects.
* If using ``xmp.spawn``, the ``nprocs`` argument is limited to 1 or None since v2.1. Previously, passing a value > 1 would result in a warning. In torch-xla 2.6, passing a value > 1 will result in an error with an actionable message to use ``NEURON_NUM_DEVICES`` to set the number of NeuronCores to use.

See :ref:`v2.5 migration guide <migrate_to_pytorch_2_5>` for additional changes needed if you are migrating from PyTorch NeuronX 2.1.

Migrating inference scripts
^^^^^^^^^^^^^^^^^^^^^^^^^^^
There are no code changes required in the inference scripts.


Troubleshooting and Known Issues
--------------------------------

Tensor split on second dimension of 2D array not working
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently, when using the tensor split operation on a 2D array in the second dimension, the resulting tensors do not contain the expected data (https://github.com/pytorch/xla/issues/8640). The workaround is to set ``XLA_DISABLE_FUNCTIONALIZATION=0``. Another workaround is to use ``torch.tensor_split``.


Lower BERT pretraining performance with torch-neuronx 2.6 compared to torch-neuronx 2.5
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently, BERT pretraining performance is ~10% lower with torch-neuronx 2.6 compared to torch-neuronx 2.5. This is due to a known regression in the torch-xla library https://github.com/pytorch/xla/issues/9037 and may affect other models with high graph tracing overhead. To work around this issue, build the ``r2.6_aws_neuron`` branch of torch-xla as follows (see :ref:`pytorch-neuronx-install-cxx11` for C++11 ABI version):

.. code:: bash

   # Setup build env (make sure you are in a python virtual env). Replace "apt" with "yum" on AL2023.
   sudo apt install cmake
   pip install yapf==0.30.0
   wget https://github.com/bazelbuild/bazelisk/releases/download/v1.20.0/bazelisk-linux-amd64
   sudo cp bazelisk-linux-amd64 /usr/local/bin/bazel
   # Clone repos
   git clone --recursive https://github.com/pytorch/pytorch --branch v2.6.0
   cd pytorch/
   git clone --recursive https://github.com/pytorch/xla.git --branch r2.6_aws_neuron
   _GLIBCXX_USE_CXX11_ABI=0 python setup.py bdist_wheel
   # The pip wheel will be present in ./dist
   cd xla/
   CXX_ABI=0 python setup.py bdist_wheel
   # The pip wheel will be present in ./dist and can be installed instead of the torch-xla released in pypi.org

Lower BERT pretraining performance when switch to using ``model.to(torch.bfloat16)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently, BERT pretraining performance is approximately 11% lower when switching to using ``model.to(torch.bfloat16)`` as part of migration away from the deprecated environment variable ``XLA_DOWNCAST_BF16`` due to https://github.com/pytorch/xla/issues/8545. As a workaround to recover the performance, you can set ``XLA_DOWNCAST_BF16=1``, which will still work in torch-neuronx 2.5 and 2.6 although there will be end-of-support warnings (as noted below).


Warning "XLA_DOWNCAST_BF16 will be deprecated after the 2.6 release, please downcast your model directly"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (warning when used). Switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to cast model to BF16. (see :ref:`migration_from_xla_downcast_bf16`)


WARNING:root:torch_xla.core.xla_model.xrt_world_size() will be removed in release 2.7. is deprecated. Use torch_xla.runtime.world_size instead.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is a warning that ``torch_xla.core.xla_model.xrt_world_size()`` will be removed in a future release. Switch to using ``torch_xla.runtime.world_size`` instead.


WARNING:torch_xla.core.xla_model.get_ordinal() will be removed in release 2.7. is deprecated. Use torch_xla.runtime.global_ordinal instead.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is a warning that ``torch_xla.core.xla_model.get_ordinal()`` will be removed in a future release. Switch to using ``torch_xla.runtime.global_ordinal`` instead.

WARNING:torch_xla.core.xla_model.get_local_ordinal() will be removed in release 2.7. is deprecated. Use torch_xla.runtime.local_ordinal instead.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


.. warning::
    ``torch_xla.core.xla_model.get_local_ordinal()`` will be removed in a future release. Use ``torch_xla.runtime.local_ordinal`` instead.
    

Socket Error: Socket failed to bind
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In PyTorch 2.6, there must be a socket available for both torchrun and the ``init_process_group`` to bind. By default, both 
will be set to use unused sockets. If you plan to use a ``MASTER_PORT`` environment variable then this error may occur if the port you set it to
is already in use.

.. code:: 

    [W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:2.600 (errno: 98 - Address already in use).
    [W socket.cpp:426] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
    [E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
    RuntimeError: The server socket has failed to listen on any local network address. 
    The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

To resolve the issue, ensure you are setting ``MASTER_PORT`` to a port value that is not used anywhere else in your scripts. Otherwise,
you can leave ``MASTER_PORT`` unset and torchrun will set the default port for you.


``AttributeError: module 'torch' has no attribute 'xla'`` Failure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In PyTorch 2.6, training scripts might fail during activation checkpointing with the error shown below.

.. code::

    AttributeError: module 'torch' has no attribute 'xla'


The solution is to use ``torch_xla.utils.checkpoint.checkpoint`` instead of ``torch.utils.checkpoint.checkpoint`` as the checkpoint function while wrapping pytorch modules for activation checkpointing.
Refer to the pytorch/xla discussion regarding this `issue <https://github.com/pytorch/xla/issues/5766>`_.
Also set ``use_reentrant=True`` while calling the torch_xla checkpoint function. Failure to do so will lead to ``XLA currently does not support use_reentrant==False`` error.
For more details on checkpointing, refer the `documentation <https://pytorch.org/docs/stable/checkpoint.html>`_.


Error ``Attempted to access the data pointer on an invalid python storage`` when using HF Trainer API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
While using HuggingFace Transformers Trainer API to train (i.e. :ref:`HuggingFace Trainer API fine-tuning tutorial<torch-hf-bert-finetune>`), you may see the error "Attempted to access the data pointer on an invalid python storage". This is a known `issue <https://github.com/huggingface/transformers/issues/2.678>`_ and has been fixed in the version ``4.37.3`` of HuggingFace Transformers.


``ImportError: libcrypt.so.1: cannot open shared object file: No such file or directory`` on Amazon Linux 2023
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

torch-xla version 2.6+ now requires ``libcrypt.so.1`` shared library. Currently, Amazon Linux 2023 includes ``libcrypt.so.2`` shared library by default so you may see ``ImportError: libcrypt.so.1: cannot open shared object file: No such file or directory`` when using torch-neuronx 2.1+ on Amazon Linux 2023. To install ``libcrypt.so.1`` on Amazon Linux 2023, run the following installation command (see also https://github.com/amazonlinux/amazon-linux-2023/issues/182 for more context):

.. code::

   sudo dnf install libxcrypt-compat


``FileNotFoundError: [Errno 2] No such file or directory: 'libneuronpjrt-path'`` Failure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In PyTorch 2.6, users might face the error shown below due to incompatible ``libneuronxla`` and ``torch-neuronx`` versions being installed.

.. code::

    FileNotFoundError: [Errno 2] No such file or directory: 'libneuronpjrt-path'

Check that the version of ``libneuronxla`` that support PyTorch NeuronX 2.6 is ``2.2.*``. If not, then uninstall ``libneuronxla`` using ``pip uninstall libneuronxla`` and then reinstall the packages following the installation guide :ref:`installation guide <install_pytorch_neuron_2.6>`


``Input dimension should be either 1 or equal to the output dimension it is broadcasting into`` or ``IndexError: index out of range`` error during Neuron Parallel Compile
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When running Neuron Parallel Compile with HF Trainer API, you may see the errors ``Status: INVALID_ARGUMENT: Input dimension should be either 1 or equal to the output dimension it is broadcasting into`` or ``IndexError: index out of range`` in Accelerator's ``pad_across_processes`` function. This is due to data-dependent operation in evaluation metrics computation. Data-dependent operations would result in undefined behavior with Neuron Parallel Compile trial execution (execute empty graphs with zero outputs). To work-around this error, disable compute_metrics when NEURON_EXTRACT_GRAPHS_ONLY is set to 1:

.. code:: python

   compute_metrics=None if os.environ.get("NEURON_EXTRACT_GRAPHS_ONLY") else compute_metrics

Compiler assertion error when running Stable Diffusion training
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

With PyTorch 2.6 (torch-neuronx), you may encounter the following compiler assertion error with Stable Diffusion training when gradient accumulation is enabled. This will be fixed in an upcoming release. For now, if you want to run Stable Diffusion training, disable gradient accumulation in torch-neuronx 2.6 by keeping the `default gradient accumulation steps of 1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/stable_diffusion/run.py#L20>`__.

.. code:: bash

    ERROR 222163 [NeuronAssert]: Assertion failure in usr/lib/python3.9/concurrent/futures/process.py at line 239 with exception:
    too many partition dims! {{0,+,960}[10],+,10560}[10]


Frequently Asked Questions (FAQ)
--------------------------------

Do I need to recompile my models with PyTorch 2.6?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes.

Do I need to update my scripts for PyTorch 2.6?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See the :ref:`migration guide <migrate_to_pytorch_2.6>`

What environment variables will be changed with PyTorch NeuronX 2.6 ?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (warning when used). Switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to cast model to BF16. (see :ref:`migration_from_xla_downcast_bf16`)

What features will be missing with PyTorch NeuronX 2.6?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch NeuronX 2.6 has all of the supported features in PyTorch NeuronX 2.5, with known issues listed above, and unsupported features as listed in :ref:`pytorch_rn`.

Can I use Neuron Distributed and Transformers Neuron libraries with PyTorch NeuronX 2.6?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes, NeuronX Distributed, and Transformers NeuronX, and AWS Neuron Reference for NeMo Megatron libraries will work with PyTorch NeuronX 2.6.

Can I still use PyTorch 2.5 version?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch 2.5 is supported for releases 2.21/2.22/2.23 and will reach end-of-life in a future release. Additionally, the CVE `CVE-2025-32434 <https://github.com/advisories/GHSA-53q9-r3pm-6pq6>`_ affects PyTorch version 2.5. We recommend upgrading to the new version of Torch-NeuronX by following :ref:`setup-torch-neuronx`.

Can I still use PyTorch 2.1 version?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch 2.1 is supported for release 2.21 and has reached end-of-life in release 2.22. Additionally, the CVEs `CVE-2024-31583 <https://github.com/advisories/GHSA-pg7h-5qx3-wjr3>`_ and `CVE-2024-31580 <https://github.com/advisories/GHSA-5pcm-hx3q-hm94>`_ affect PyTorch versions 2.1 and earlier.  We recommend upgrading to the new version of Torch-NeuronX by following :ref:`setup-torch-neuronx`.


================================================
FILE: about-neuron/appnotes/torch-neuronx/introducing-pytorch-2-7.rst
================================================
.. _introduce-pytorch-2-7:

Introducing PyTorch 2.7 Support
===============================

.. contents:: Table of contents
   :local:
   :depth: 2


What are we introducing?
------------------------

Starting with the :ref:`Neuron 2.24 <neuron-2-24-0-whatsnew>` release, customers can now upgrade to PyTorch NeuronX (``torch-neuronx``) with specific support for PyTorch version 2.7.

:ref:`setup-torch-neuronx` is updated to include installation instructions for PyTorch NeuronX 2.7 for Amazon Linux 2023 and Ubuntu 22.04. Note that PyTorch NeuronX 2.7 is supported on Python 3.9, 3.10, and 3.11.

Review :ref:`migration guide <migrate_to_pytorch_2.7>` for possible changes to training scripts. No code changes are required for inference scripts.


.. _how-pytorch-2.7-different:

How is PyTorch NeuronX 2.7 different compared to PyTorch NeuronX 2.5?
---------------------------------------------------------------------

PyTorch NeuronX 2.7 uses Torch-XLA v2.7 and PyTorch v2.7 which have C++11 ABI enabled by default. 

Additionally, Torch-XLA v2.7 includes a fix for the training performance issue https://github.com/pytorch/xla/issues/9037.

See `Torch-XLA 2.7 release <https://github.com/pytorch/xla/releases/tag/v2.7.0>`__ for a full list.

See :ref:`migrate_to_pytorch_2.7` for changes needed to use PyTorch NeuronX 2.7.

.. note::

   GSPMD and Torch Dynamo (torch.compile) support in Neuron will be available in a future release.

.. _install_pytorch_neuron_2.7:

How can I install PyTorch NeuronX 2.7?
--------------------------------------------

To install PyTorch NeuronX 2.7, follow the :ref:`setup-torch-neuronx` guides for Amazon Linux 2023 and Ubuntu 22.04 AMI. Refer to the Neuron Multi-Framework DLAMI :ref:`setup guide <setup-ubuntu22-multi-framework-dlami>` for Ubuntu 22.04 with a pre-installed virtual environment for PyTorch NeuronX 2.7 that you can use to get started. PyTorch NeuronX 2.7 can be installed using the following:

.. code::

    python -m pip install --upgrade neuronx-cc==2.* torch-neuronx==2.7.* torchvision

.. note::

   PyTorch NeuronX 2.7 is currently available for Python 3.9, 3.10, 3.11.

.. _migrate_to_pytorch_2.7:

Migrate your application to PyTorch 2.7
---------------------------------------

First, install the PyTorch NeuronX 2.7 as described above in :ref:`installation guide <install_pytorch_neuron_2.7>`


Migrating training scripts
^^^^^^^^^^^^^^^^^^^^^^^^^^

To migrate the training scripts from PyTorch NeuronX 2.5/2.6 to PyTorch NeuronX 2.7, implement the following changes: 

.. note::

    ``xm`` below refers to ``torch_xla.core.xla_model``, ``xr`` refers to ``torch_xla.runtime``, and ``xmp`` refers to ``torch_xla.distributed.xla_multiprocessing``

* The environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (warnings are shown when used) and will be removed in an upcoming release. Switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to convert model to BF16 format. (see :ref:`migration_from_xla_downcast_bf16`)
* The functions ``xm.xrt_world_size()``, ``xm.get_ordinal()``, and ``xm.get_local_ordinal()`` are deprecated and removed so there are errors when used. Switch to ``xr.world_size()``, ``xr.global_ordinal()``, and ``xr.local_ordinal()`` respectively as replacements.
* The default behavior of ``torch.load`` parameter ``weights_only`` is changed from ``False`` to ``True``. Setting ``weights_only`` to ``True`` may cause issues with pickling custom objects.
* If using ``xmp.spawn``, the ``nprocs`` argument is limited to 1 or None since v2.1. Previously, passing a value > 1 would result in a warning. In torch-xla 2.6+, passing a value > 1 will result in an error with an actionable message to use ``NEURON_NUM_DEVICES`` to set the number of NeuronCores to use.

See :ref:`v2.6 migration guide <migrate_to_pytorch_2.6>` for additional changes needed if you are migrating from PyTorch NeuronX 2.5.
See :ref:`v2.5 migration guide <migrate_to_pytorch_2_5>` for additional changes needed if you are migrating from PyTorch NeuronX 2.1.

Migrating inference scripts
^^^^^^^^^^^^^^^^^^^^^^^^^^^
There are no code changes required in the inference scripts.


Troubleshooting and Known Issues
--------------------------------

Using the latest torch-xla v2.7 may result in increase in host memory usage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Using the latest torch-xla v2.7 may result in an increase in host memory usage compared to torch-xla v2.6. In one example, LLama2 pretraining with ZeRO1 and sequence length 16k could see an increase of 1.6% in host memory usage.

TypeError: AdamW.__init__() got an unexpected keyword argument 'decoupled_weight_decay'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AdamW now has an additional argument “decoupled_weight_decay” which defaults to False. If you get “TypeError: AdamW.__init__() got an unexpected keyword argument 'decoupled_weight_decay'” with NeuronX Distributed, update to the latest version.


Tensor split on second dimension of 2D array not working
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently, when using the tensor split operation on a 2D array in the second dimension, the resulting tensors do not contain the expected data (https://github.com/pytorch/xla/issues/8640). The workaround is to set ``XLA_DISABLE_FUNCTIONALIZATION=0``. Another workaround is to use ``torch.tensor_split``.

Lower BERT pretraining performance when switch to using ``model.to(torch.bfloat16)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently, BERT pretraining performance is approximately 11% lower when switching to using ``model.to(torch.bfloat16)`` as part of migration away from the deprecated environment variable ``XLA_DOWNCAST_BF16`` due to https://github.com/pytorch/xla/issues/8545. As a workaround to recover the performance, you can set ``XLA_DOWNCAST_BF16=1``, which will still work in torch-neuronx 2.5 and 2.7 although there will be end-of-support warnings (as noted below).


Warning "XLA_DOWNCAST_BF16 will be deprecated after the 2.6 release, please downcast your model directly"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (warnings are shown when used). Switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to cast model to BF16. (see :ref:`migration_from_xla_downcast_bf16`)


AttributeError: <module 'torch_xla.core.xla_model' ... does not have the attribute 'xrt_world_size'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is an error that ``torch_xla.core.xla_model.xrt_world_size()`` is removed in torch-xla version 2.7. Switch to using ``torch_xla.runtime.world_size()`` instead. If using Hugging Face transformers/accelerate libraries, use transformers==4.53.* and accelerate==1.7.*.

AttributeError: <module 'torch_xla.core.xla_model' ... does not have the attribute 'get_ordinal'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is an error that ``torch_xla.core.xla_model.get_ordinal()`` is removed in torch-xla version 2.7. Switch to using ``torch_xla.runtime.global_ordinal()`` instead. If using Hugging Face transformers/accelerate libraries, use transformers==4.53.* and accelerate==1.7.*.

AttributeError: <module 'torch_xla.core.xla_model' ... does not have the attribute 'get_local_ordinal'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is an error that ``torch_xla.core.xla_model.get_local_ordinal()`` is removed in torch-xla version 2.7. Switch to using ``torch_xla.runtime.local_ordinal()`` instead. If using Hugging Face transformers/accelerate libraries, use transformers==4.53.* and accelerate==1.7.*.


Socket Error: Socket failed to bind
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In PyTorch 2.7, there must be a socket available for both torchrun and the ``init_process_group`` to bind. By default, both 
will be set to use unused sockets. If you plan to use a ``MASTER_PORT`` environment variable then this error may occur if the port you set it to
is already in use.

.. code:: 

    [W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:2.700 (errno: 98 - Address already in use).
    [W socket.cpp:426] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
    [E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
    RuntimeError: The server socket has failed to listen on any local network address. 
    The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

To resolve the issue, if you are setting ``MASTER_PORT``, ensure that the port you're setting it to is not used anywhere else in your scripts. Otherwise,
you can leave ``MASTER_PORT`` unset and torchrun will set the default port for you.


``AttributeError: module 'torch' has no attribute 'xla'`` Failure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In PyTorch 2.7, training scripts might fail during activation checkpointing with the error shown below.

.. code::

    AttributeError: module 'torch' has no attribute 'xla'


The solution is to use ``torch_xla.utils.checkpoint.checkpoint`` instead of ``torch.utils.checkpoint.checkpoint`` as the checkpoint function while wrapping pytorch modules for activation checkpointing.
Refer to the pytorch/xla discussion regarding this `issue <https://github.com/pytorch/xla/issues/5766>`_.
Also set ``use_reentrant=True`` while calling the torch_xla checkpoint function. Failure to do so will lead to ``XLA currently does not support use_reentrant==False`` error.
For more details on checkpointing, refer the `documentation <https://pytorch.org/docs/stable/checkpoint.html>`_.


Error ``Attempted to access the data pointer on an invalid python storage`` when using HF Trainer API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
While using HuggingFace Transformers Trainer API to train (i.e. :ref:`HuggingFace Trainer API fine-tuning tutorial<torch-hf-bert-finetune>`), you may see the error "Attempted to access the data pointer on an invalid python storage". This is a known `issue <https://github.com/huggingface/transformers/issues/27778>`_ and has been fixed in the version ``4.37.3`` of HuggingFace Transformers.


``ImportError: libcrypt.so.1: cannot open shared object file: No such file or directory`` on Amazon Linux 2023
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

torch-xla version 2.5+ now requires the ``libcrypt.so.1`` shared library. Currently, Amazon Linux 2023 includes ``libcrypt.so.2`` shared library by default so you may see `ImportError: libcrypt.so.1: cannot open shared object file: No such file or directory`` when using torch-neuronx 2.1+ on Amazon Linux 2023. To install ``libcrypt.so.1`` on Amazon Linux 2023, run the following installation command (see also https://github.com/amazonlinux/amazon-linux-2023/issues/182 for more context):

.. code::

   sudo dnf install libxcrypt-compat


``FileNotFoundError: [Errno 2] No such file or directory: 'libneuronpjrt-path'`` Failure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In PyTorch 2.7, users might face the error shown below due to incompatible ``libneuronxla`` and ``torch-neuronx`` versions being installed.

.. code::

    FileNotFoundError: [Errno 2] No such file or directory: 'libneuronpjrt-path'

Check that the version of ``libneuronxla`` that supports PyTorch NeuronX 2.7 is ``2.2.*``. If not, then uninstall ``libneuronxla`` using ``pip uninstall libneuronxla`` and then reinstall the packages following the installation guide :ref:`installation guide <install_pytorch_neuron_2.7>`


``Input dimension should be either 1 or equal to the output dimension it is broadcasting into`` or ``IndexError: index out of range`` error during Neuron Parallel Compile
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When running Neuron Parallel Compile with HF Trainer API, you may see the errors ``Status: INVALID_ARGUMENT: Input dimension should be either 1 or equal to the output dimension it is broadcasting into`` or ``IndexError: index out of range`` in Accelerator's ``pad_across_processes`` function. This is due to data-dependent operations in evaluation metrics computation. Data-dependent operations would result in undefined behavior with Neuron Parallel Compile trial execution (execute empty graphs with zero outputs). To work around this error, disable compute_metrics when NEURON_EXTRACT_GRAPHS_ONLY is set to 1:

.. code:: python

   compute_metrics=None if os.environ.get("NEURON_EXTRACT_GRAPHS_ONLY") else compute_metrics

Compiler assertion error when running Stable Diffusion training
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

With PyTorch 2.7 (torch-neuronx), you may encounter the following compiler assertion error with Stable Diffusion training when gradient accumulation is enabled. This will be fixed in an upcoming release. For now, if you want to run Stable Diffusion training, disable gradient accumulation in torch-neuronx 2.7 by keeping the `default gradient accumulation steps of 1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/stable_diffusion/run.py#L20>`__.

.. code:: bash

    ERROR 222163 [NeuronAssert]: Assertion failure in usr/lib/python3.9/concurrent/futures/process.py at line 239 with exception:
    too many partition dims! {{0,+,960}[10],+,10560}[10]


Frequently Asked Questions (FAQ)
--------------------------------

Do I need to recompile my models with PyTorch 2.7?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes.

Do I need to update my scripts for PyTorch 2.7?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See the :ref:`migration guide <migrate_to_pytorch_2.7>`

What environment variables will be changed with PyTorch NeuronX 2.7 ?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (warnings are shown when used). Switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to cast model to BF16. (see :ref:`migration_from_xla_downcast_bf16`)

What features will be missing with PyTorch NeuronX 2.7?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch NeuronX 2.7 has all of the supported features in PyTorch NeuronX 2.6, with known issues listed above, and unsupported features as listed in :ref:`pytorch_rn`.

Can I use Neuron Distributed and Transformers Neuron libraries with PyTorch NeuronX 2.7?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes, NeuronX Distributed and Transformers NeuronX are supported by PyTorch NeuronX 2.7.  AWS Neuron Reference for NeMo Megatron has reached end-of-support in release 2.23.

Can I still use PyTorch 2.6 version?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch 2.6 is supported since release 2.23.

Can I still use PyTorch 2.5 version?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch 2.5 is supported for releases 2.21 to 2.24 and will reach end-of-life in a future release. Additionally, the CVE `CVE-2025-32434 <https://github.com/advisories/GHSA-53q9-r3pm-6pq6>`_ affects PyTorch version 2.5. We recommend upgrading to the new version of Torch-NeuronX by following :ref:`setup-torch-neuronx`.

Can I still use PyTorch 2.1 version?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch 2.1 is supported for release 2.21 and has reached end-of-life in release 2.22. Additionally, the CVEs `CVE-2024-31583 <https://github.com/advisories/GHSA-pg7h-5qx3-wjr3>`_ and `CVE-2024-31580 <https://github.com/advisories/GHSA-5pcm-hx3q-hm94>`_ affect PyTorch versions 2.1 and earlier.  We recommend upgrading to the new version of Torch-NeuronX by following :ref:`setup-torch-neuronx`.


================================================
FILE: about-neuron/appnotes/torch-neuronx/introducing-pytorch-2-8.rst
================================================
.. _introduce-pytorch-2-8:

Introducing PyTorch 2.8 Support
===============================

.. contents:: Table of contents
   :local:
   :depth: 2


What are we introducing?
------------------------

Starting with the :ref:`Neuron 2.26 <neuron-2-26-0-whatsnew>` release, customers can now upgrade to PyTorch NeuronX (``torch-neuronx``) with specific support for PyTorch version 2.8.

:ref:`setup-torch-neuronx` is updated to include installation instructions for PyTorch NeuronX 2.8 for Ubuntu 22.04. Note that PyTorch NeuronX 2.8 is supported on Python 3.10 and 3.11, with 3.12+ support coming in a future release.

Review :ref:`migration guide <migrate_to_pytorch_2.8>` for possible changes to training scripts. No code changes are required for inference scripts.


.. _how-pytorch-2.8-different:

How is PyTorch NeuronX 2.8 different compared to PyTorch NeuronX 2.7?
---------------------------------------------------------------------

See `Torch-XLA 2.8 release <https://github.com/pytorch/xla/releases/tag/v2.8.0>`__ for a full list of changes.

See :ref:`migrate_to_pytorch_2.8` for changes needed to use PyTorch NeuronX 2.8.

.. note::

   GSPMD and Torch Dynamo (torch.compile) support in Neuron will be available in a future release.

.. _install_pytorch_neuron_2.8:

How can I install PyTorch NeuronX 2.8?
--------------------------------------------

To install PyTorch NeuronX 2.8, follow the :ref:`setup-torch-neuronx` guides for Ubuntu 22.04 AMI. Refer to the Neuron Multi-Framework DLAMI :ref:`setup guide <setup-ubuntu22-multi-framework-dlami>` for Ubuntu 22.04 with a pre-installed virtual environment for PyTorch NeuronX 2.8 that you can use to get started. PyTorch NeuronX 2.8 can be installed using the following:

.. code::

    python -m pip install --upgrade neuronx-cc==2.* torch-neuronx==2.8.* torchvision

.. note::

   PyTorch NeuronX 2.8 is currently available for Python 3.10 and 3.11, with 3.12+ support coming in a future release.

.. note::

   To use Amazon Linux 2023, you will need to install Python 3.10 or 3.11 to use PyTorch NeuronX 2.8.

.. _migrate_to_pytorch_2.8:

Migrate your application to PyTorch 2.8
---------------------------------------

First, install the PyTorch NeuronX 2.8 as described above in :ref:`installation guide <install_pytorch_neuron_2.8>`


Migrating training scripts
^^^^^^^^^^^^^^^^^^^^^^^^^^

There are no code changes required in the training scripts to move from PyTorch NeuronX 2.7 to PyTorch NeuronX 2.8.

See :ref:`v2.7 migration guide <migrate_to_pytorch_2.7>` for additional changes needed if you are migrating from PyTorch NeuronX 2.6.
See :ref:`v2.6 migration guide <migrate_to_pytorch_2.6>` for additional changes needed if you are migrating from PyTorch NeuronX 2.5.

Migrating inference scripts
^^^^^^^^^^^^^^^^^^^^^^^^^^^
There are no code changes required in the inference scripts.


Troubleshooting and Known Issues
--------------------------------

[v2.8] Lower BERT/LLaMA performance with torch-xla 2.8.0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Using the publicly released version of torch-xla 2.8.0 from public PyPI repositories would result in lower performance for models like BERT and LLaMA (https://github.com/pytorch/xla/issues/9605). To fix this, switch to using the updated torch-xla version 2.8.1 from public PyPI repositories.

Using the latest torch-xla 2.7/2.8 may result in increase in host memory usage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Using torch-xla 2.7/2.8 may result in an increase in host memory usage compared to torch-xla 2.6. In one example, LLama2 pretraining with ZeRO1 and sequence length 16k could see an increase of 1.6% in host memory usage.

TypeError: AdamW.__init__() got an unexpected keyword argument 'decoupled_weight_decay'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AdamW now has an additional argument ``decoupled_weight_decay`` which defaults to False. If you get ``TypeError: AdamW.__init__() got an unexpected keyword argument 'decoupled_weight_decay'`` with NeuronX Distributed, update to the latest version.


Tensor split on second dimension of 2D array not working
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently, when using the tensor split operation on a 2D array in the second dimension, the resulting tensors do not contain the expected data (https://github.com/pytorch/xla/issues/8640). The workaround is to set ``XLA_DISABLE_FUNCTIONALIZATION=0``. Another workaround is to use ``torch.tensor_split``.

Lower BERT pretraining performance when switch to using ``model.to(torch.bfloat16)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently, BERT pretraining performance is approximately 11% lower when switching to using ``model.to(torch.bfloat16)`` as part of migration away from the deprecated environment variable ``XLA_DOWNCAST_BF16`` due to https://github.com/pytorch/xla/issues/8545. As a workaround to recover the performance, you can set ``XLA_DOWNCAST_BF16=1``, which will still work in torch-neuronx 2.5 to 2.8 although there will be end-of-support warnings (as noted below).


DeprecationWarning: Use torch_xla.device instead
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is a end-of-support warning when using ``torch_xla.core.xla_model.xla_device()``. Switch to ``torch_xla.device()`` instead.

DeprecationWarning: Use torch_xla.sync instead
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is a end-of-support warning when using ``torch_xla.core.xla_model.mark_step()``. Switch to ``torch_xla.sync()`` instead.

Warning "XLA_DOWNCAST_BF16 will be deprecated after the 2.6 release, please downcast your model directly"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (warnings are shown when used). Switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to cast model to BF16. (see :ref:`migration_from_xla_downcast_bf16`)


AttributeError: <module 'torch_xla.core.xla_model' ... does not have the attribute 'xrt_world_size'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is an error that ``torch_xla.core.xla_model.xrt_world_size()`` was removed since torch-xla version 2.7+. Switch to using ``torch_xla.runtime.world_size()`` instead. If using Hugging Face transformers/accelerate libraries, use transformers==4.53.* and accelerate==1.7.* or newer.

AttributeError: <module 'torch_xla.core.xla_model' ... does not have the attribute 'get_ordinal'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is an error that ``torch_xla.core.xla_model.get_ordinal()`` was removed since torch-xla version 2.7+. Switch to using ``torch_xla.runtime.global_ordinal()`` instead. If using Hugging Face transformers/accelerate libraries, use transformers==4.53.* and accelerate==1.7.* or newer.

AttributeError: <module 'torch_xla.core.xla_model' ... does not have the attribute 'get_local_ordinal'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is an error that ``torch_xla.core.xla_model.get_local_ordinal()`` was removed since torch-xla version 2.7+. Switch to using ``torch_xla.runtime.local_ordinal()`` instead. If using Hugging Face transformers/accelerate libraries, use transformers==4.53.* and accelerate==1.7.* or newer.


Socket Error: Socket failed to bind
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In PyTorch 2.1+ including 2.8, there must be a socket available for both torchrun and the ``init_process_group`` to bind. By default, both 
will be set to use unused sockets. If you plan to use a ``MASTER_PORT`` environment variable then this error may occur if the port you set it to
is already in use.

.. code:: 

    [W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:2.700 (errno: 98 - Address already in use).
    [W socket.cpp:426] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
    [E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
    RuntimeError: The server socket has failed to listen on any local network address. 
    The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

To resolve the issue, if you are setting ``MASTER_PORT``, ensure that the port you're setting it to is not used anywhere else in your scripts. Otherwise,
you can leave ``MASTER_PORT`` unset and torchrun will set the default port for you.


``AttributeError: module 'torch' has no attribute 'xla'`` Failure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In PyTorch 2.8, training scripts might fail during activation checkpointing with the error shown below.

.. code::

    AttributeError: module 'torch' has no attribute 'xla'


The solution is to use ``torch_xla.utils.checkpoint.checkpoint`` instead of ``torch.utils.checkpoint.checkpoint`` as the checkpoint function while wrapping pytorch modules for activation checkpointing.
Refer to the pytorch/xla discussion regarding this `issue <https://github.com/pytorch/xla/issues/5766>`_.
Also set ``use_reentrant=True`` while calling the torch_xla checkpoint function. Failure to do so will lead to ``XLA currently does not support use_reentrant==False`` error.
For more details on checkpointing, refer the `documentation <https://pytorch.org/docs/stable/checkpoint.html>`_.


Error ``Attempted to access the data pointer on an invalid python storage`` when using HF Trainer API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
While using HuggingFace Transformers Trainer API to train (i.e. :ref:`HuggingFace Trainer API fine-tuning tutorial<torch-hf-bert-finetune>`), you may see the error "Attempted to access the data pointer on an invalid python storage". This is a known `issue <https://github.com/huggingface/transformers/issues/27778>`_ and has been fixed in the version ``4.37.3`` of HuggingFace Transformers.

``Input dimension should be either 1 or equal to the output dimension it is broadcasting into`` or ``IndexError: index out of range`` error during Neuron Parallel Compile
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When running Neuron Parallel Compile with HF Trainer API, you may see the errors ``Status: INVALID_ARGUMENT: Input dimension should be either 1 or equal to the output dimension it is broadcasting into`` or ``IndexError: index out of range`` in Accelerator's ``pad_across_processes`` function. This is due to data-dependent operations in evaluation metrics computation. Data-dependent operations would result in undefined behavior with Neuron Parallel Compile trial execution (execute empty graphs with zero outputs). To work around this error, disable compute_metrics when NEURON_EXTRACT_GRAPHS_ONLY is set to 1:

.. code:: python

   compute_metrics=None if os.environ.get("NEURON_EXTRACT_GRAPHS_ONLY") else compute_metrics

Compiler assertion error when running Stable Diffusion training
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

With PyTorch 2.8 (torch-neuronx), you may encounter the following compiler assertion error with Stable Diffusion training when gradient accumulation is enabled. This will be fixed in an upcoming release. For now, if you want to run Stable Diffusion training, disable gradient accumulation in torch-neuronx 2.8 by keeping the `default gradient accumulation steps of 1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/stable_diffusion/run.py#L20>`__.

.. code:: bash

    ERROR 222163 [NeuronAssert]: Assertion failure in usr/lib/python3.9/concurrent/futures/process.py at line 239 with exception:
    too many partition dims! {{0,+,960}[10],+,10560}[10]


Frequently Asked Questions (FAQ)
--------------------------------

Do I need to recompile my models with PyTorch 2.8?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes.

Do I need to update my scripts for PyTorch 2.8?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See the :ref:`migration guide <migrate_to_pytorch_2.8>`

What environment variables will be changed with PyTorch NeuronX 2.8 ?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (warnings are shown when used). Switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to cast model to BF16. (see :ref:`migration_from_xla_downcast_bf16`)

What features will be missing with PyTorch NeuronX 2.8?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch NeuronX 2.8 has all of the supported features in PyTorch NeuronX 2.7, with known issues listed above, and unsupported features as listed in :ref:`pytorch_rn`.

Can I use Neuron Distributed libraries with PyTorch NeuronX 2.8?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes, NeuronX Distributed libraries are supported by PyTorch NeuronX 2.8. Transformers NeuronX has reached end-of-support in release 2.26. AWS Neuron Reference for NeMo Megatron has reached end-of-support in release 2.23.

Can I still use PyTorch 2.7 version?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch 2.7 is supported since release 2.24.

Can I still use PyTorch 2.6 version?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch 2.6 is supported since release 2.23.

Can I still use PyTorch 2.5 version?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch 2.5 reached end-of-support in release 2.25.

Can I still use Amazon Linux 2023?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes. You will need to install Python 3.10 or 3.11 to use PyTorch NeuronX 2.8.


================================================
FILE: about-neuron/appnotes/torch-neuronx/introducing-pytorch-2-9.rst
================================================
.. _introduce-pytorch-2-9:

Introducing PyTorch 2.9 Support
===============================

.. contents:: Table of contents
   :local:
   :depth: 2


What are we introducing?
------------------------

Starting with the :ref:`Neuron 2.27 <neuron-2-27-0-whatsnew>` release, customers can now upgrade to PyTorch NeuronX (``torch-neuronx``) with specific support for PyTorch version 2.9.

PyTorch NeuronX 2.9 adds support for AWS Trainium 3 (Trn3) instances, in addition to existing support for Trainium (Trn2/Trn1/Trn1n) and Inferentia (Inf2) instances.

:ref:`setup-torch-neuronx` is updated to include installation instructions for PyTorch NeuronX 2.9 for Ubuntu 24.04. Note that PyTorch NeuronX 2.9 is supported on Python 3.10, 3.11 and 3.12.

Review :ref:`migration guide <migrate_to_pytorch_2.9>` for possible changes to training scripts. No code changes are required for inference scripts.


.. _how-pytorch-2.9-different:

How is PyTorch NeuronX 2.9 different compared to PyTorch NeuronX 2.8?
---------------------------------------------------------------------

See `Torch-XLA 2.9 release <https://github.com/pytorch/xla/releases/tag/v2.9.0>`__ for a full list of changes.

See :ref:`migrate_to_pytorch_2.9` for changes needed to use PyTorch NeuronX 2.9.

.. note::

   Torch Dynamo (torch.compile) support in Neuron will be available in a future release.

.. _install_pytorch_neuron_2.9:

How can I install PyTorch NeuronX 2.9?
--------------------------------------------

To install PyTorch NeuronX 2.9, follow the :ref:`setup-torch-neuronx` guides for Ubuntu 24.04 AMI. Refer to the Neuron Multi-Framework DLAMI :ref:`setup guide <setup-ubuntu22-multi-framework-dlami>` for Ubuntu 24.04 with a pre-installed virtual environment for PyTorch NeuronX 2.9 that you can use to get started. PyTorch NeuronX 2.9 can be installed using the following:

.. code::

    python -m pip install --upgrade neuronx-cc==2.* torch-neuronx==2.9.* torchvision

.. note::

   PyTorch NeuronX 2.9 is currently available for Python 3.10, 3.11 and 3.12.

.. note::

   To use Amazon Linux 2023, you will need to install Python 3.10, 3.11 or 3.12 to use PyTorch NeuronX 2.9. See `Amazon Linux 2023 Python documentation <https://docs.aws.amazon.com/linux/al2023/ug/python.html>`_ for installation instructions.

.. _migrate_to_pytorch_2.9:

Migrate your application to PyTorch 2.9
---------------------------------------

First, install the PyTorch NeuronX 2.9 as described above in :ref:`installation guide <install_pytorch_neuron_2.9>`


Migrating training scripts
^^^^^^^^^^^^^^^^^^^^^^^^^^

There are no code changes required in the training scripts to move from PyTorch NeuronX 2.8 to PyTorch NeuronX 2.9.

See :ref:`v2.8 migration guide <migrate_to_pytorch_2.8>` for additional changes needed if you are migrating from PyTorch NeuronX 2.7.
See :ref:`v2.7 migration guide <migrate_to_pytorch_2.7>` for additional changes needed if you are migrating from PyTorch NeuronX 2.6.

Migrating inference scripts
^^^^^^^^^^^^^^^^^^^^^^^^^^^
There are no code changes required in the inference scripts.


Troubleshooting and Known Issues
--------------------------------

GLIBC compatibility issue on Amazon Linux 2023
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When running PyTorch NeuronX 2.9 on Amazon Linux 2023, you may encounter the following error:

.. code::

    ImportError: /usr/lib64/libm.so.6: version `GLIBC_2.35' not found (required by /opt/conda/lib/python3.12/site-packages/_XLAC.cpython-312-x86_64-linux-gnu.so)

This occurs because the PyTorch NeuronX 2.9 binaries require GLIBC 2.35, but Amazon Linux 2023 ships with an older version of GLIBC. Use Ubuntu 24.04 AMI instead, which has the required GLIBC version. Follow the :ref:`setup-torch-neuronx` installation guide for Ubuntu 24.04.

Using the latest torch-xla 2.7/2.8/2.9 may result in increase in host memory usage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Using the latest torch-xla v2.7/2.8/2.9 may result in an increase in host memory usage compared to torch-xla v2.6. In one example, LLama2 pretraining with ZeRO1 and sequence length 16k could see an increase of 1.6% in host memory usage.

TypeError: AdamW.__init__() got an unexpected keyword argument 'decoupled_weight_decay'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AdamW now has an additional argument ``decoupled_weight_decay`` which defaults to False. If you get ``TypeError: AdamW.__init__() got an unexpected keyword argument 'decoupled_weight_decay'`` with NeuronX Distributed, update to the latest version.


Tensor split on second dimension of 2D array not working
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently, when using the tensor split operation on a 2D array in the second dimension, the resulting tensors do not contain the expected data (https://github.com/pytorch/xla/issues/8640). The workaround is to set ``XLA_DISABLE_FUNCTIONALIZATION=0``. Another workaround is to use ``torch.tensor_split``.

Lower BERT pretraining performance when switch to using ``model.to(torch.bfloat16)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently, BERT pretraining performance is approximately 11% lower when switching to using ``model.to(torch.bfloat16)`` as part of migration away from the deprecated environment variable ``XLA_DOWNCAST_BF16`` due to https://github.com/pytorch/xla/issues/8545. As a workaround to recover the performance, you can set ``XLA_DOWNCAST_BF16=1``, which will still work in torch-neuronx 2.5 through 2.9 although there will be end-of-support warnings (as noted below).


DeprecationWarning: Use torch_xla.device instead
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is a end-of-support warning when using ``torch_xla.core.xla_model.xla_device()``. Switch to ``torch_xla.device()`` instead.

DeprecationWarning: Use torch_xla.sync instead
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is a end-of-support warning when using ``torch_xla.core.xla_model.mark_step()``. Switch to ``torch_xla.sync()`` instead.

Warning "XLA_DOWNCAST_BF16 will be deprecated after the 2.6 release, please downcast your model directly"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (warnings are shown when used). Switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to cast model to BF16. (see :ref:`migration_from_xla_downcast_bf16`)


AttributeError: <module 'torch_xla.core.xla_model' ... does not have the attribute 'xrt_world_size'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is an error that ``torch_xla.core.xla_model.xrt_world_size()`` was removed since torch-xla version 2.7+. Switch to using ``torch_xla.runtime.world_size()`` instead. If using Hugging Face transformers/accelerate libraries, use transformers==4.53.* and accelerate==1.7.* or newer.

AttributeError: <module 'torch_xla.core.xla_model' ... does not have the attribute 'get_ordinal'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is an error that ``torch_xla.core.xla_model.get_ordinal()`` was removed since torch-xla version 2.7+. Switch to using ``torch_xla.runtime.global_ordinal()`` instead. If using Hugging Face transformers/accelerate libraries, use transformers==4.53.* and accelerate==1.7.* or newer.

AttributeError: <module 'torch_xla.core.xla_model' ... does not have the attribute 'get_local_ordinal'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is an error that ``torch_xla.core.xla_model.get_local_ordinal()`` was removed since torch-xla version 2.7+. Switch to using ``torch_xla.runtime.local_ordinal()`` instead. If using Hugging Face transformers/accelerate libraries, use transformers==4.53.* and accelerate==1.7.* or newer.


Socket Error: Socket failed to bind
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In PyTorch 2.1+ including 2.9, there must be a socket available for both torchrun and the ``init_process_group`` to bind. By default, both 
will be set to use unused sockets. If you plan to use a ``MASTER_PORT`` environment variable then this error may occur if the port you set it to
is already in use.

.. code:: 

    [W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:2.700 (errno: 98 - Address already in use).
    [W socket.cpp:426] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
    [E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
    RuntimeError: The server socket has failed to listen on any local network address. 
    The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

To resolve the issue, if you are setting ``MASTER_PORT``, ensure that the port you're setting it to is not used anywhere else in your scripts. Otherwise,
you can leave ``MASTER_PORT`` unset and torchrun will set the default port for you.


``AttributeError: module 'torch' has no attribute 'xla'`` Failure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In PyTorch 2.9, training scripts might fail during activation checkpointing with the error shown below.

.. code::

    AttributeError: module 'torch' has no attribute 'xla'


The solution is to use ``torch_xla.utils.checkpoint.checkpoint`` instead of ``torch.utils.checkpoint.checkpoint`` as the checkpoint function while wrapping pytorch modules for activation checkpointing.
Refer to the pytorch/xla discussion regarding this `issue <https://github.com/pytorch/xla/issues/5766>`_.
Also set ``use_reentrant=True`` while calling the torch_xla checkpoint function. Failure to do so will lead to ``XLA currently does not support use_reentrant==False`` error.
For more details on checkpointing, refer the `documentation <https://pytorch.org/docs/stable/checkpoint.html>`_.


Error ``Attempted to access the data pointer on an invalid python storage`` when using HF Trainer API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
While using HuggingFace Transformers Trainer API to train (i.e. :ref:`HuggingFace Trainer API fine-tuning tutorial<torch-hf-bert-finetune>`), you may see the error "Attempted to access the data pointer on an invalid python storage". This is a known `issue <https://github.com/huggingface/transformers/issues/27778>`_ and has been fixed in the version ``4.37.3`` of HuggingFace Transformers.

``Input dimension should be either 1 or equal to the output dimension it is broadcasting into`` or ``IndexError: index out of range`` error during Neuron Parallel Compile
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When running Neuron Parallel Compile with HF Trainer API, you may see the errors ``Status: INVALID_ARGUMENT: Input dimension should be either 1 or equal to the output dimension it is broadcasting into`` or ``IndexError: index out of range`` in Accelerator's ``pad_across_processes`` function. This is due to data-dependent operations in evaluation metrics computation. Data-dependent operations would result in undefined behavior with Neuron Parallel Compile trial execution (execute empty graphs with zero outputs). To work around this error, disable compute_metrics when NEURON_EXTRACT_GRAPHS_ONLY is set to 1:

.. code:: python

   compute_metrics=None if os.environ.get("NEURON_EXTRACT_GRAPHS_ONLY") else compute_metrics


Frequently Asked Questions (FAQ)
--------------------------------

Do I need to recompile my models with PyTorch 2.9?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes.

Do I need to update my scripts for PyTorch 2.9?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See the :ref:`migration guide <migrate_to_pytorch_2.9>`

What environment variables will be changed with PyTorch NeuronX 2.9 ?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (warnings are shown when used). Switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to cast model to BF16. (see :ref:`migration_from_xla_downcast_bf16`)

What features will be missing with PyTorch NeuronX 2.9?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch NeuronX 2.9 has all of the supported features in PyTorch NeuronX 2.8, with known issues listed above, and unsupported features as listed in :ref:`pytorch_rn`.

Can I use Neuron Distributed libraries with PyTorch NeuronX 2.9?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes, NeuronX Distributed libraries are supported by PyTorch NeuronX 2.9. Transformers NeuronX has reached end-of-support in release 2.26. AWS Neuron Reference for NeMo Megatron has reached end-of-support in release 2.23.

Can I still use PyTorch 2.8 version?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch 2.8 is supported since release 2.26.

Can I still use PyTorch 2.7 version?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch 2.7 is supported since release 2.24.

.. note::

   PyTorch NeuronX 2.7 supports Python 3.10, and 3.11. Python 3.12 is not supported for PyTorch 2.7 and earlier versions.

Can I still use PyTorch 2.6 version?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch 2.6 has reached end-of-support since release 2.27.

Can I still use Amazon Linux 2023?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes. You will need to install Python 3.10, 3.11 or 3.12 to use PyTorch NeuronX 2.9.


================================================
FILE: about-neuron/appnotes/torch-neuronx/introducing-pytorch-2-x.rst
================================================
.. _introduce-pytorch-2-5:

Introducing PyTorch 2.5 Support
===============================

.. contents:: Table of contents
   :local:
   :depth: 2


What are we introducing?
------------------------

Starting with the :ref:`Neuron 2.21 <neuron-2.21.0-whatsnew>` release, customers will be able to upgrade to ``PyTorch NeuronX(torch-neuronx)`` supporting ``PyTorch 2.5``.

:ref:`setup-torch-neuronx` is updated to include installation instructions for PyTorch NeuronX 2.5 for Amazon Linux 2023 and Ubuntu 22. Note that PyTorch NeuronX 2.5 does not support Python 3.8 which is default in Ubuntu 20. To use Ubuntu 20, customers will need to install Python 3.9+.

Please review :ref:`migration guide <migrate_to_pytorch_2_5>` for possible changes to training scripts. No code changes are required for inference scripts.


.. _how-pytorch-2-5-different:

How is PyTorch NeuronX 2.5 different compared to PyTorch NeuronX 2.1?
---------------------------------------------------------------------

PyTorch NeuronX 2.5 uses Torch-XLA 2.5 which has improved support for eager debug mode, Automatic Mixed Precission, PJRT device auto-detection, FP8, and others. See `Torch-XLA 2.5 release <https://github.com/pytorch/xla/releases/tag/v2.5.0>`__ for a full list.

See :ref:`migrate_to_pytorch_2_5` for changes needed to use PyTorch NeuronX 2.5.

.. note::

   GSPMD and Torch Dynamo (torch.compile) support in Neuron will be available in a future release.

.. _install_pytorch_neuron_2_5:

How can I install PyTorch NeuronX 2.5?
--------------------------------------------

To install PyTorch NeuronX 2.5 please follow the :ref:`setup-torch-neuronx` guides for Amazon Linux 2023 and Ubuntu 22 AMI. Please also refer to the Neuron multi-framework DLAMI :ref:`setup guide <setup-ubuntu22-multi-framework-dlami>` for Ubuntu 22 with a pre-installed virtual environment for PyTorch NeuronX 2.5 that you can use to get started. PyTorch NeuronX 2.5 can be installed using the following:

.. code::

    python -m pip install --upgrade neuronx-cc==2.* torch-neuronx==2.5.* torchvision

.. note::

   PyTorch NeuronX 2.5 is currently available for Python 3.9, 3.10, 3.11.

.. _migrate_to_pytorch_2_5:

Migrate your application to PyTorch 2.5
---------------------------------------

Please make sure you have first installed the PyTorch NeuronX 2.5 as described above in :ref:`installation guide <install_pytorch_neuron_2_5>`


Migrating training scripts
^^^^^^^^^^^^^^^^^^^^^^^^^^

To migrate the training scripts from PyTorch NeuronX 2.1 to PyTorch NeuronX 2.5, implement the following changes: 

.. note::

    ``xm`` below refers to ``torch_xla.core.xla_model`` and ``xr`` refers to ``torch_xla.runtime``

* The environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (warning when used). Please switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to convert model to BF16 format. (see :ref:`migration_from_xla_downcast_bf16`)
* The ``torch_xla.experimental.pjrt`` module which was replaced by ``torch_xla.runtime`` in Torch-XLA 2.1, has been removed in Torch-XLA 2.5. Users should now utilize the ``torch_xla.runtime`` module as a replacement.
* ``torch_xla.runtime.using_pjrt`` is removed because PJRT is the sole Torch-XLA runtime.
* ``xm.all_reduce`` no longer operates in-place for single tensors. To fix this, please convert the single tensor to an array (e.g.. ``[single_tensor]``) or assign the output of ``xm.all_reduce`` to a variable.
* The functions ``xm.xrt_world_size()``, ``xm.xla_model.get_ordinal()``, and ``xm.xla_model.get_local_ordinal()`` are deprecated (warning when used). Please switch to ``xr.world_size``, ``xr.global_ordinal``, and ``xr.local_ordinal`` respectively as replacements.
* ``torch_xla.experimental.xla_sharding`` is now replaced by ``torch_xla.distributed.spmd.xla_sharding``.
* Class ``ZeroRedundancyOptimizer`` now has two new arguments that replaces the optional boolean argument ``coalesce_cc``:
    * ``bucket_cap_mb_all_gather`` (int, Optional): Number of MegaBytes of the tensor bucket to fill before doing all-gather. Default: 0 (disable  all gather coalescing).
    * ``bucket_cap_mb_reduce_scatter`` (int, Optional): Number of MegaBytes of the tensor bucket to fill before doing reduce-scatter. Default: 0 (disable reduce scatter coalescing).

Migrating inference scripts
^^^^^^^^^^^^^^^^^^^^^^^^^^^
There are no code changes required in the inference scripts.


Troubleshooting and Known Issues
--------------------------------

Neuronx-Distributed Training Llama 3.1 70B 8-node tutorial failed with OSError when the Neuron Cache is placed on FSx mount
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Currently, the Neuronx-Distributed Training Llama 3.1 70B 8-node tutorial failed with OSError (Errno 61) when the Neuron Cache is placed on FSx mount:

.. code:: bash

    [rank197]: RuntimeError: Bad StatusOr access: INVALID_ARGUMENT: RunNeuronCCImpl: error condition !(error != 400): <class 'OSError'>: [Errno 61] No data available: '/fsxl/neuron_cache/neuronxcc-2.16.372.0+4a9b2326/MODULE_3540044791706521849+4eb52b03/model.neff' -> '/tmp/tmpx7bvfpmm/model.neff'

We found that the error is due to FSx failing during file copy when there are multiple readers (13 workers fail to copy out of 256). This issue doesn’t affect simpler models like BERT.

To work-around the issue, please use the shared NFS mount (/home directory on a Parallel Cluster) instead of FSx to store Neuron Cache. This will be fixed in an upcoming release.

Running in-place update operations (e.g. all_reduce) on 0-dimensional tensors result in buffer aliasing errors in torch 2.5 and earlier
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Torch's lazy tensor core has a feature where 0-dimensional tensors are stored in a device cache, so scalar constant values can be transferred once and then reused. The values in the device cache are supposed to be marked read-only and never participate in parameter aliasing. However, due to a bug in torch-xla 2.5 (`#8499 <https://github.com/pytorch/xla/issues/8499>`_), sometimes the read-only flag can be dropped, allowing these tensors to be donated, resulting in aliasing errors later when the cached value is used again.

A work-around is to avoid using 0-dimensional tensors by changing them to be 1d tensor of length 1 (`example <https://github.com/aws-neuron/neuronx-nemo-megatron/pull/36/commits/0b2354666508ac75cb6150083211fa6823864ebe>`_).
If modifying library code is not possible, disable XLA parameter aliasing by setting environment variable XLA_ENABLE_PARAM_ALIASING=0

Tensor split on second dimension of 2D array not working
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently, when using tensor split operation on a 2D array in the second dimension, the resulting tensors don't have the expected data (https://github.com/pytorch/xla/issues/8640). The work-around is to set ``XLA_DISABLE_FUNCTIONALIZATION=0``. Another work-around is to use ``torch.tensor_split``.

Import torch_xla crashed with ``TypeError: must be called with a dataclass type or instance`` with torch-xla 2.5 and torch 2.5.1+cpu (CPU flavor)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When using torch 2.5.1+cpu (CPU flavor) on python 3.10, importing torch_xla crashed with ``TypeError: must be called with a dataclass type or instance`` due to installed triton version 3.2.0 (https://github.com/pytorch/xla/issues/8560). To work-around, please remove the installed triton package or downgrade to triton==3.1.0 or use the regular torch 2.5.1 (GPU flavor).

Certain sequence of operations with ``xm.save()`` could corrupt tensors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When using the ``xm.save`` function to save tensors, please use ``xm.mark_step()`` before ``xm.save`` to avoid the error described in https://github.com/pytorch/xla/issues/8422 where parameter aliasing could corrupt other tensor values. This issue will be fixed in a future release.

(Here ``xm`` is ``torch_xla.core.xla_model`` following PyTorch/XLA convention)

Lower BERT pretraining performance when switch to using ``model.to(torch.bfloat16)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently, BERT pretraining performance is ~11% lower when switching to using ``model.to(torch.bfloat16)`` as part of migration away from the deprecated environment variable ``XLA_DOWNCAST_BF16`` due to https://github.com/pytorch/xla/issues/8545. As a work-around to recover the performance, you can set ``XLA_DOWNCAST_BF16=1`` which would still work in torch-neuronx 2.5 and 2.6 although there will be end-of-support warnings (as noted below).

Warning "XLA_DOWNCAST_BF16 will be deprecated after the 2.5 release, please downcast your model directly"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (warning when used). Please switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to cast model to BF16. (see :ref:`migration_from_xla_downcast_bf16`)


WARNING:root:torch_xla.core.xla_model.xrt_world_size() will be removed in release 2.7. is deprecated. Use torch_xla.runtime.world_size instead.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is a warning that ``torch_xla.core.xla_model.xrt_world_size()`` will be removed in a future release. Please switch to using ``torch_xla.runtime.world_size`` instead.


WARNING:torch_xla.core.xla_model.xla_model.get_ordinal() will be removed in release 2.7. is deprecated. Use torch_xla.runtime.global_ordinal instead.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is a warning that ``torch_xla.core.xla_model.xla_model.get_ordinal()`` will be removed in a future release. Please switch to using ``torch_xla.runtime.global_ordinal`` instead.


AttributeError: module 'torch_xla.runtime' has no attribute 'using_pjrt'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In Torch-XLA 2.5, ``torch_xla.runtime.using_pjrt`` is removed because PJRT is the sole Torch-XLA runtime.
See `commit PR <https://github.com/pytorch/xla/commit/d6fb5391d09578c8804b1331a5e7a4f72bf981db>`__.


Socket Error: Socket failed to bind
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In PyTorch 2.5, there needs to be a socket available for both torchrun and the ``init_process_group`` to bind. Both of these, by default,
will be set to unused sockets. If you plan to use a ``MASTER_PORT`` environment variable then this error may occur, if the port you set it to
is already in use.

.. code:: 

    [W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
    [W socket.cpp:426] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
    [E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
    RuntimeError: The server socket has failed to listen on any local network address. 
    The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

To resolve the issue, please ensure if you are setting ``MASTER_PORT`` that the port you're setting it to is not used anywhere else in your scripts. Otherwise,
you can leave ``MASTER_PORT`` unset, and torchrun will set the default port for you.


``AttributeError: module 'torch' has no attribute 'xla'`` Failure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In PyTorch 2.5, training scripts might fail during activation checkpointing with the error shown below.

.. code::

    AttributeError: module 'torch' has no attribute 'xla'


The solution is to use ``torch_xla.utils.checkpoint.checkpoint`` instead of ``torch.utils.checkpoint.checkpoint`` as the checkpoint function while wrapping pytorch modules for activation checkpointing.
Refer to the pytorch/xla discussion regarding this `issue <https://github.com/pytorch/xla/issues/5766>`_.
Also set ``use_reentrant=True`` while calling the torch_xla checkpoint function. Failure to do so will lead to ``XLA currently does not support use_reentrant==False`` error.
For more details on checkpointing, refer the `documentation <https://pytorch.org/docs/stable/checkpoint.html>`_.


Error ``Attempted to access the data pointer on an invalid python storage`` when using HF Trainer API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

While using HuggingFace Transformers Trainer API to train (i.e. :ref:`HuggingFace Trainer API fine-tuning tutorial<torch-hf-bert-finetune>`), you may see the error "Attempted to access the data pointer on an invalid python storage". This is a known `issue <https://github.com/huggingface/transformers/issues/27578>`_ and has been fixed in the version ``4.37.3`` of HuggingFace Transformers.

``ImportError: libcrypt.so.1: cannot open shared object file: No such file or directory`` on Amazon Linux 2023
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

torch-xla version 2.5+ now requires ``libcrypt.so.1`` shared library. Currently, Amazon Linux 2023 includes ``libcrypt.so.2`` shared library by default so you may see `ImportError: libcrypt.so.1: cannot open shared object file: No such file or directory`` when using torch-neuronx 2.1+ on Amazon Linux 2023. To install ``libcrypt.so.1`` on Amazon Linux 2023, please run the following installation command (see also https://github.com/amazonlinux/amazon-linux-2023/issues/182 for more context):

.. code::

   sudo dnf install libxcrypt-compat


``FileNotFoundError: [Errno 2] No such file or directory: 'libneuronpjrt-path'`` Failure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In PyTorch 2.5, users might face the error shown below due to incompatible ``libneuronxla`` and ``torch-neuronx`` versions being installed.

.. code::

    FileNotFoundError: [Errno 2] No such file or directory: 'libneuronpjrt-path'

Check that the version of ``libneuronxla`` that support PyTorch NeuronX 2.5 is ``2.1.*``. If not, then uninstall ``libneuronxla`` using ``pip uninstall libneuronxla`` and then reinstall the packages following the installation guide :ref:`installation guide <install_pytorch_neuron_2_5>`


GlibC error on Amazon Linux 2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If using Torch-NeuronX 2.5 on Amazon Linux 2, you will see a GlibC error below. Please switch to a newer supported OS such as Ubuntu 22 or Amazon Linux 2023.

.. code:: bash

   ImportError: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /tmp/debug/_XLAC.cpython-38-x86_64-linux-gnu.so)

``Input dimension should be either 1 or equal to the output dimension it is broadcasting into`` or ``IndexError: index out of range`` error during Neuron Parallel Compile
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When running Neuron Parallel Compile with HF Trainer API, you may see the error ``Status: INVALID_ARGUMENT: Input dimension should be either 1 or equal to the output dimension it is broadcasting into`` or ``IndexError: index out of range`` in Accelerator's ``pad_across_processes`` function. This is due to data-dependent operation in evaluation metrics computation. Data-dependent operations would result in undefined behavior with Neuron Parallel Compile trial execution (execute empty graphs with zero outputs). To work-around this error, please disable compute_metrics when NEURON_EXTRACT_GRAPHS_ONLY is set to 1:

.. code:: python

   compute_metrics=None if os.environ.get("NEURON_EXTRACT_GRAPHS_ONLY") else compute_metrics

Compiler assertion error when running Stable Diffusion training
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently, with PyTorch 2.5 (torch-neuronx), we are seeing the following compiler assertion error with Stable Diffusion training when gradient accumulation is enabled. This will be fixed in an upcoming release. For now, if you would like to run Stable Diffusion training with Neuron SDK release 2.21/2.22, please disable gradient accumulation in torch-neuronx 2.5.

.. code:: bash

    ERROR 222163 [NeuronAssert]: Assertion failure in usr/lib/python3.9/concurrent/futures/process.py at line 239 with exception:
    too many partition dims! {{0,+,960}[10],+,10560}[10]


Frequently Asked Questions (FAQ)
--------------------------------

Do I need to recompile my models with PyTorch 2.5?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes.

Do I need to update my scripts for PyTorch 2.5?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Please see the :ref:`migration guide <migrate_to_pytorch_2_5>`

What environment variables will be changed with PyTorch NeuronX 2.5 ?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (warning when used). Please switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to cast model to BF16. (see :ref:`migration_from_xla_downcast_bf16`)

What features will be missing with PyTorch NeuronX 2.5?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch NeuronX 2.5 now has most of the supported features in PyTorch NeuronX 2.1, with known issues listed above, and unsupported features as listed in :ref:`pytorch_rn`.

Can I use Neuron Distributed and Transformers Neuron libraries with PyTorch NeuronX 2.5?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes, NeuronX Distributed, and Transformers NeuronX, and AWS Neuron Reference for NeMo Megatron libraries will work with PyTorch NeuronX 2.5.

Can I still use PyTorch 2.1 version?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch 2.1 is supported for release 2.21 and will reach end-of-life in a future release. Additionally, the CVEs `CVE-2024-31583 <https://github.com/advisories/GHSA-pg7h-5qx3-wjr3>`_ and `CVE-2024-31580 <https://github.com/advisories/GHSA-5pcm-hx3q-hm94>`_ affect PyTorch versions 2.1 and earlier.  We recommend upgrading to the new version of Torch-NeuronX by following :ref:`setup-torch-neuronx`.


================================================
FILE: about-neuron/appnotes/torch-neuronx/migration-from-xla-downcast-bf16.rst
================================================
.. _migration_from_xla_downcast_bf16:

Migration From ``XLA_USE_BF16``/``XLA_DOWNCAST_BF16``
=====================================================

Introduction
------------

The environmental variables ``XLA_USE_BF16`` and ``XLA_DOWNCAST_BF16`` were created to provide an easy cast-to-bf16 option before automatic mixed-precision or ``model.to(torch.bfloat16)`` as available in Torch-XLA. Now that both automatic mixed precision and ``model.to(torch.bfloat16)`` are available in Torch-XLA,  ``XLA_USE_BF16`` and ``XLA_DOWNCAST_BF16`` are redundant and can be replaced with these options as a more familiar experience as on other platforms such as CPUs and GPUs. Using them in Torch-XLA 2.5+ would cause warnings to be displayed about their end-of-support. While they are still functional, their functionality will be removed in a future release (Torch-XLA 2.8) so the recommended changes below are available as replacement.

NeuronX Distributed Training has been updated to use some of the options below. Please see :ref:`standard_mixed_precision` for more information.

The changes recommended below can best be made to scripts running with Torch-XLA 2.5+. The same recommendations are also available in :ref:`pytorch-neuronx-programming-guide`.

.. note::

    This guide recommends the options below as replacement for ``XLA_USE_BF16`` and ``XLA_DOWNCAST_BF16``. Do not set ``XLA_USE_BF16=1`` or ``XLA_DOWNCAST_BF16=1`` when using the options below on Neuron devices. Using them will override the per-operator precision settings provided by the options and thus cause more operators to execute in bfloat16.

Full BF16 with stochastic rounding enabled
------------------------------------------

Previously, on torch-neuronx 2.1 and earlier, the environmental variables ``XLA_USE_BF16`` or ``XLA_DOWNCAST_BF16`` provided full casting to BF16 with stochastic rounding enabled by default. These environmental variables are deprecated in torch-neuronx 2.5, although still functional with warnings. To replace ``XLA_USE_BF16`` or ``XLA_DOWNCAST_BF16`` with stochastic rounding on Neuron, set ``NEURON_RT_STOCHASTIC_ROUNDING_EN=1`` and use the ``torch.nn.Module.to`` method to cast model floating-point parameters and buffers to data-type BF16 as follows:

.. code:: python

    os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "1"

    # model is created
    model.to(torch.bfloat16)

Stochastic rounding is needed to enable faster convergence for full BF16 model.

If the loss is to be kept in FP32, initialize it with ``dtype=torch.float`` as follows:

.. code:: python

    running_loss = torch.zeros(1, dtype=torch.float).to(device)

Similarly, if the optimizer states are to be kept in FP32, convert the gradients to FP32 before optimizer computations:

.. code:: python

    grad = p.grad.data.float()

For a full example, please see the :ref:`PyTorch Neuron BERT Pretraining Tutorial (Data-Parallel) <hf-bert-pretraining-tutorial>`, which has been updated to use ``torch.nn.Module.to`` instead of ``XLA_DOWNCAST_BF16``.

BF16 in GPU-compatible mode without stochastic rounding enabled
---------------------------------------------------------------

Full BF16 training in GPU-compatible mode would enable faster convergence without the need for stochastic rounding, but would require a FP32 copy of weights/parameters to be saved and used in the optimizer. To enable BF16 in GPU-compatible mode without stochastic rounding enabled, use the ``torch.nn.Module.to`` method to cast model floating-point parameters and buffers to data-type bfloat16 as follows without setting ``NEURON_RT_STOCHASTIC_ROUNDING_EN=1``:

.. code:: python

    # model is created
    model.to(torch.bfloat16)

In the initializer of the optimizer, for example AdamW, you can add code like the following code snippet to make a FP32 copy of weights:

.. code:: python

        # keep a copy of weights in highprec
        self.param_groups_highprec = []
        for group in self.param_groups:
            params = group['params']
            param_groups_highprec = [p.data.float() for p in params]
            self.param_groups_highprec.append({'params': param_groups_highprec})

From then, you can use the usual gradients but updating the FP32 copy of weights instead:

.. code:: python

        for group, group_highprec in zip(self.param_groups, self.param_groups_highprec):
            for p, p_highprec in zip(group['params'], group_highprec['params']):
                # convert gradients to FP32 before computing exponential average
                grad = p.grad.data.float()

                # compute the exponential average and denominator using grad
                ...

                # Update FP32 copy of weights
                p_highprec.data.addcdiv_(exponential_avg, denominator, value=-step_size)


In the :ref:`PyTorch Neuron BERT Pretraining Tutorial (Data-Parallel) <hf-bert-pretraining-tutorial>`, this mode can be enabled by pasing ``--optimizer=AdamW_FP32ParamsCopy`` option to ``dp_bert_large_hf_pretrain_hdf5.py`` and setting ``NEURON_RT_STOCHASTIC_ROUNDING_EN=0`` (or leave it unset).

BF16 automatic mixed precision using PyTorch Autocast
-----------------------------------------------------

By default, the compiler automatically casts internal FP32 operations to
BF16. You can disable this and allow PyTorch's BF16 automatic mixed precision function (``torch.autocast``) to
do the casting of certain operations to operate in BF16.

To enable PyTorch's BF16 mixed-precision, first turn off the Neuron
compiler auto-cast:

.. code:: python

   os.environ["NEURON_CC_FLAGS"] = "--auto-cast=none"

Next, per recommendation from official PyTorch `torch.autocast documentation <https://pytorch.org/docs/stable/amp.html#autocasting>`__, place only
the forward-pass of the training step in the ``torch.autocast`` scope with ``xla`` device type:

.. code:: python

   with torch.autocast(dtype=torch.bfloat16, device_type='xla'):
       # forward pass

The device type is XLA because we are using PyTorch-XLA's autocast backend. The PyTorch-XLA `autocast mode source code <https://github.com/pytorch/xla/blob/master/torch_xla/csrc/autocast_mode.cpp>`_ lists which operations are casted to lower precision BF16 ("lower precision fp cast policy" section), which are maintained in FP32 ("fp32 cast policy"), and which are promoted to the widest input types ("promote" section).

.. note::

   If an operation is not part of any policy in `autocast mode source code <https://github.com/pytorch/xla/blob/master/torch_xla/csrc/autocast_mode.cpp>`_, the data type of the inputs will be used for the computation of the operation.


Example showing the original training code snippet:

.. code:: python

   def train_loop_fn(train_loader):
       for i, data in enumerate(train_loader):
           inputs = data[0]
           labels = data[3]
           outputs = model(inputs, labels=labels)
           loss = outputs.loss/ flags.grad_acc_steps
           loss.backward()
           optimizer.step()
           xm.mark_step()

The following shows the training loop modified to use BF16 autocast:

.. code:: python

   os.environ["NEURON_CC_FLAGS"] = "--auto-cast=none"

   def train_loop_fn(train_loader):
       for i, data in enumerate(train_loader):
           torch.cuda.is_bf16_supported = lambda: True
           with torch.autocast(dtype=torch.bfloat16, device_type='xla'):
               inputs = data[0]
               labels = data[3]
               outputs = model(inputs, labels=labels)
           loss = outputs.loss/ flags.grad_acc_steps
           loss.backward()
           optimizer.step()
           xm.mark_step()

For a full example of BF16 mixed-precision, see :ref:`PyTorch Neuron BERT Pretraining Tutorial (Data-Parallel) <hf-bert-pretraining-tutorial>`.

See official PyTorch documentation for more details about
`torch.autocast <https://pytorch.org/docs/stable/amp.html#autocasting>`__
.


================================================
FILE: about-neuron/appnotes/torch-neuronx/torch-neuronx-dataparallel-app-note.rst
================================================
.. _torch-neuronx-dataparallel-app-note:

Data Parallel Inference on torch_neuronx
=======================================

.. contents:: Table of Contents
   :local:
   :depth: 2

Introduction
------------

This guide introduces :func:`torch_neuronx.DataParallel`, a Python API that
implements data parallelism on :class:`~torch.jit.ScriptModule` models created by the
:ref:`torch_neuronx_trace_api`.
The following sections explain how data parallelism can improve the performance of
inference workloads on Inferentia, including how :func:`torch_neuronx.DataParallel`
uses dynamic batching to run inference on variable input sizes. It covers an
overview of the :func:`torch_neuronx.DataParallel` module and provides a few
:ref:`example data parallel applications <data_parallel_examples_torch_neuronx>`.

Data parallel inference
-------------------------

Data Parallelism is a form of parallelization across multiple devices or cores,
referred to as nodes. Each node contains the same model and parameters, but
data is distributed across the different nodes. By distributing the
data across multiple nodes, data parallelism reduces the total
execution time of large batch size inputs compared to sequential execution.
Data parallelism works best for smaller models in latency sensitive
applications that have large batch size requirements.


torch_neuronx.DataParallel
-------------------------

To fully leverage the Inferentia hardware, we want to use all available
NeuronCores. An inf2.xlarge and inf2.8xlarge have two NeuronCores, an
inf2.24xlarge has 12 NeuronCores, and an inf2.48xlarge has 24 NeuronCores.
For maximum performance on Inferentia hardware, we can use
:func:`torch_neuronx.DataParallel` to utilize all available NeuronCores.

:func:`torch_neuronx.DataParallel` implements data parallelism at the module
level by replicating the Neuron model on all available NeuronCores
and distributing data across the different cores for parallelized inference.
This function is analogous to :class:`~torch.nn.DataParallel` in PyTorch.
:func:`torch_neuronx.DataParallel` requires PyTorch >= 1.8.

The following sections provide an overview of some of the features
of :func:`torch_neuronx.DataParallel` that enable maximum performance on
Inferentia.

NeuronCore selection
^^^^^^^^^^^^^^^^^^^^

By default, DataParallel will try to use all NeuronCores allocated to the
current process to fully saturate the Inferentia hardware for maximum performance.
It is more efficient to make the batch dimension divisible by the number of
NeuronCores. This will ensure that NeuronCores are not left idle during
parallel inference and the Inferentia hardware is fully utilized.

In some applications, it is advantageous to use a subset of the
available NeuronCores for DataParallel inference. DataParallel has a
``device_ids`` argument that accepts a list of :obj:`int` or ``'nc:#'``
that specify the NeuronCores to use for parallelization. See
:ref:`Specifying NeuronCores <dataparallel_example_specify_ncs_torch_neuronx>`
for an example of how to use ``device_ids`` argument.

Batch dim
^^^^^^^^^

DataParallel accepts a ``dim`` argument that denotes the batch dimension used
to split the input data for distributed inference. By default,
DataParalell splits the inputs on ``dim = 0`` if the ``dim`` argument is not
specified. For applications with a non-zero batch dim, the ``dim`` argument
can be used to specify the inference-time input batch dimension.
:ref:`DataParallel with dim ! = 0 <dataparallel_example_dim_neq_zero_torch_neuronx>` provides an
example of data parallel inference on inputs with batch dim = 2.

.. _dynamic_batching_description_torch_neuronx:

Dynamic batching
^^^^^^^^^^^^^^^^

Batch size has a direct impact on model performance. The Inferentia chip is optimized
to run with small batch sizes. This means that a Neuron compiled model can outperform
a GPU model, even if running single digit batch sizes.

As a general best practice, we recommend optimizing your model's throughput by
compiling the model with a small batch size and gradually increasing it to
find the peak throughput on Inferentia.

Dynamic batching is a feature that allows you to use tensor batch sizes that the
Neuron model was not originally compiled against. This is necessary because the
underlying Inferentia hardware will always execute inferences with the batch
size used during compilation. Fixed batch size execution allows tuning the
input batch size for optimal performance. For example, batch size 1 may be
best suited for an ultra-low latency on-demand inference application, while
batch size > 1 can be used to maximize throughput for offline inferencing.
Dynamic batching is implemented by slicing large input tensors into chunks
that match the batch size used during the :func:`torch_neuronx.trace` compilation call.

The :func:`torch_neuronx.DataParallel` class automatically enables dynamic batching on
eligible models. This allows us to run inference in applications that have
inputs with a variable batch size without needing to recompile the model. See
:ref:`Dynamic batching <dataparallel_example_dynamic_batching_torch_neuronx>` for an example
of how DataParallel can be used to run inference on inputs with a dynamic batch
size without needing to recompile the model.

Dynamic batching using small batch sizes can result in sub-optimal throughput
because it involves slicing tensors into chunks and iteratively sending data
to the hardware. Using a larger batch size at compilation time can use the
Inferentia hardware more efficiently in order to maximize throughput. You can
test the tradeoff between individual request latency and total throughput by
fine-tuning the input batch size.

Automatic batching in the DataParallel module can be disabled using the
``disable_dynamic_batching()`` function as follows:

.. code-block:: python

   >>> model_parallel = torch_neuronx.DataParallel(model_neuron)
   >>> model_parallel.disable_dynamic_batching()

If dynamic batching is disabled, the compile-time batch size must be equal to
the inference-time batch size divided by the number of NeuronCores.
:ref:`DataParallel with dim != 0 <dataparallel_example_dim_neq_zero_torch_neuronx>` and
:ref:`Dynamic batching disabled <dataparallel_example_disable_dynamic_batching_torch_neuronx>`
provide examples of running DataParallel inference with dynamic batching
disabled.


Performance optimizations
^^^^^^^^^^^^^^^^^^^^^^^^^

The DataParallel module has a ``num_workers`` attribute that can be used to
specify the number of worker threads used for multithreaded inference. By
default, ``num_workers = 2 * number of NeuronCores``. This value can be
fine tuned to optimize DataParallel performance.

DataParallel has a ``split_size`` attribute that dictates the size of the input
chunks that are distributed to each NeuronCore. By default,
``split_size = max(1, input.shape[dim] // number of NeuronCores)``. This value
can be modified to optimally match the inference input chunk size with the
compile-time batch size.

.. _data_parallel_examples_torch_neuronx:

Examples
--------

The following sections provide example usages of the
:func:`torch_neuronx.DataParallel` module.


.. _dataparallel_example_default_torch_neuronx:

Default usage
^^^^^^^^^^^^^

.. include:: /frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-default.rst

.. _dataparallel_example_specify_ncs_torch_neuronx:

Specifying NeuronCores
^^^^^^^^^^^^^^^^^^^^^^

.. include:: /frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-specify-ncs.rst


.. _dataparallel_example_dim_neq_zero_torch_neuronx:

DataParallel with dim != 0
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-dim-neq-zero.rst


.. _dataparallel_example_dynamic_batching_torch_neuronx:

Dynamic batching
^^^^^^^^^^^^^^^^

.. include:: /frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-dynamic-batching.rst


.. _dataparallel_example_disable_dynamic_batching_torch_neuronx:

Dynamic batching disabled
^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-disable-dynamic-batching.rst


================================================
FILE: about-neuron/appnotes/torch-neuronx/torch-neuronx-graph-partitioner-app-note.rst
================================================
.. _torch-neuronx-graph-partitioner-app-note:

Graph Partitioner on torch_neuronx
=======================================

.. contents:: Table of Contents
   :local:
   :depth: 2

Introduction
------------

This guide introduces the graph partitioner for torch-neuronx.
The following sections explain the purpose of the graph partitioner,
how it works, and go over a few examples.

The Purpose of the Graph Partitioner
------------------------------------

While ``neuronx-cc`` is very sophisticated and can compile most operators,
there are some operator configurations that are not supported by the compiler.
Usually in a model that contains unsupported operators, these are only a few
operators while the supported parts of the model can benefit from the acceleration
benefits that Neuron offers. With this in mind, we developed a graph partitioner
that will partition out unsupported operators to be executed on CPU, while 
compiling and executing the supported operators on Neuron.

How it Works
------------

Determining Unsupported Operators
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Operator support is determined by the ``neuronx-cc`` compiler frontend. This is done
because this gives us more flexibility than a static list. This is evident
in cases where a specific operator configuration is supported but another
configuration is not supported. For example, we support the square root operator,
but do not support it with a ``C64`` data type for example.

To check operator support, we use the :func:`torch_neuronx.analyze` API, which
queries the compiler for device placement: Neuron or CPU, which gives the graph
partitioner a base graph to start partitioning.

The below image shows the flow of the graph partitioner:

|torch-neuronx-graph-partitioner-flow-diagram|

.. |torch-neuronx-graph-partitioner-flow-diagram| image:: /images/torch-neuronx-graph-partitioner-flow-diagram.png

Customizability
^^^^^^^^^^^^^^^

The graph partitioner has a wide range of customizability
for a variety of situations. The customization options include:

1. **Minimum Operator Support:** Only partition the model if a minimum percentage of operators are supported.
2. **Minimum Subgraph Size:** The minimum number of operators in any given subgraph. This can be useful if having compute chokepoints with single operator subgraphs is not desired.
3. **Maximum Subgraph Count:** The maximum number of subgraphs. Too many subgraphs can fragment the computation graph causing performance degredation.
4. **Ops to Partition:** Additional operators to partition to CPU beyond the unsupported operators. This can be useful to suggest to the graph partitioner to partition to create a more balanced graph.

Furthermore, compiler flags/args can be passed into all Neuron subgraphs through the graph partitioner.

For the API Reference, visit :func:`torch_neuronx.trace` and :class:`torch_neuronx.PartitionerConfig`

.. note::
  Dynamic batching has a case-by-case support with partitioned
  models, because it is highly dependent on how the
  final partition scheme looks like.

Examples
--------

The following sections provide example usages of the graph partitioner.

Default Usage
^^^^^^^^^^^^^

The below model is a simple MLP model with sorted log softmax output.
The sort operator, ``torch.sort()`` or ``aten::sort``, is not supported
by ``neuronx-cc`` at this time, so the graph partitioner will partition
out the sort operator to CPU.

.. code-block:: python

  import torch
  import torch_neuronx
  import torch.nn as nn

  import logging
  
  # adjust logger level to see what the partitioner is doing
  logger = logging.getLogger("Neuron")

  class MLP(nn.Module):
      def __init__(
          self, input_size=28 * 28, output_size=10, layers=[4096, 2048]
      ):
          super(MLP, self).__init__()
          self.fc1 = nn.Linear(input_size, layers[0])
          self.fc2 = nn.Linear(layers[0], layers[1])
          self.fc3 = nn.Linear(layers[1], output_size)
          self.relu = nn.ReLU()

      def forward(self, x):
          f1 = self.fc1(x)
          r1 = self.relu(f1)
          f2 = self.fc2(r1)
          r2 = self.relu(f2)
          f3 = self.fc3(r2)
          out = torch.log_softmax(f3, dim=1)
          sort_out,_ = torch.sort(out)
          return sort_out

  n = MLP()
  n.eval()

  inputs = torch.rand(32,784)

  # Configure the graph partitioner with the default values
  partitioner_config = torch_neuronx.PartitionerConfig()

  # Trace a neural network with graph partitioner enabled
  neuron_net = torch_neuronx.trace(n, inputs, partitioner_config=partitioner_config)

  # Run inference on the partitioned model
  output = neuron_net(inputs)


Specifying requirements
^^^^^^^^^^^^^^^^^^^^^^^

This example is very similar to the previous example, but
has two differences. The unsupported sort operator is sandwiched
between the ReLU activation function after the first linear layer
and the second linear layer. The second difference is that we are
specifying a max subgraph count of 2.

.. code-block:: python

  import torch
  import torch_neuronx
  import torch.nn as nn

  import logging
  
  # adjust logger level to see what the partitioner is doing
  logger = logging.getLogger("Neuron")

  class MLP(nn.Module):
      def __init__(
          self, input_size=28 * 28, output_size=10, layers=[4096, 2048]
      ):
          super(MLP, self).__init__()
          self.fc1 = nn.Linear(input_size, layers[0])
          self.fc2 = nn.Linear(layers[0], layers[1])
          self.fc3 = nn.Linear(layers[1], output_size)
          self.relu = nn.ReLU()

      def forward(self, x):
          f1 = self.fc1(x)
          r1 = self.relu(f1)
          sort_r1,_ = torch.sort(r1)
          f2 = self.fc2(sort_r1)
          r2 = self.relu(f2)
          f3 = self.fc3(r2)
          out = torch.log_softmax(f3, dim=1)
          return out

  n = MLP()
  n.eval()

  inputs = torch.rand(32,784)

  # Configure the graph partitioner with the default values
  partitioner_config = torch_neuronx.PartitionerConfig(max_subgraph_count=2)

  # This trace will fail since the min_subgraph_size requirement can't be satisfied by the graph partitioner
  neuron_net = torch_neuronx.trace(n, inputs, partitioner_config=partitioner_config)

Output:

.. code-block::

    ValueError: The partitioner has found 3 subgraphs which exceeds the specified max subgraph count of 2.


This example fails because the sort operator placement generates 3 subgraphs, which is more than 2.

Specifying additional operators to partition
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This example shows a situation where we want to partition out
the log_softmax operator despite it being supported. We also specify
an 80% support percentage threshold.

.. code-block:: python

  import torch
  import torch_neuronx
  import torch.nn as nn

  import logging
  
  # adjust logger level to see what the partitioner is doing
  logger = logging.getLogger("Neuron")
  logger.setLevel(logging.INFO)

  class MLP(nn.Module):
      def __init__(
          self, input_size=28 * 28, output_size=10, layers=[4096, 2048]
      ):
          super(MLP, self).__init__()
          self.fc1 = nn.Linear(input_size, layers[0])
          self.fc2 = nn.Linear(layers[0], layers[1])
          self.fc3 = nn.Linear(layers[1], output_size)
          self.relu = nn.ReLU()

      def forward(self, x):
          f1 = self.fc1(x)
          r1 = self.relu(f1)
          f2 = self.fc2(r1)
          r2 = self.relu(f2)
          f3 = self.fc3(r2)
          out = torch.log_softmax(f3, dim=1)
          sort_out,_ = torch.sort(out)
          return sort_out

  n = MLP()
  n.eval()

  inputs = torch.rand(32,784)

  # Configure the graph partitioner with the default values
  partitioner_config = torch_neuronx.PartitionerConfig(min_operator_percentage_threshold=0.8,ops_to_partition=set(["aten::log_softmax"]))

  # This trace succeeds
  neuron_net = torch_neuronx.trace(n, inputs, partitioner_config=partitioner_config)

Key Output logs:

.. code-block::

    ...
    Neuron: The following operations are currently supported:
    Neuron: aten::linear
    Neuron: aten::relu
    Neuron: aten::log_softmax
    Neuron: The following operations are currently not supported:
    Neuron: aten::sort, unsup.py(28): <stack_trace>
    ...
    Neuron: 85.71% of arithmetic operations (6 of 7) are supported
    Neuron: Num Partitions: 2

    Neuron: Creating Partition #1 for device: Device.NEURON
    Neuron: The following operators will be included in this partition:
    Neuron: prim::GetAttr:9
    Neuron: aten::linear:3
    Neuron: aten::relu:2
    ...
    Neuron: Creating Partition #2 for device: Device.CPU
    Neuron: The following operators will be included in this partition:
    Neuron: prim::Constant:4
    Neuron: aten::sort:1
    Neuron: aten::log_softmax:1


Notice that we still report that ``aten::log_softmax`` is still supported, but also
report that ``aten::log_softmax`` is in Partition #2 which is for ``Device.CPU``.

================================================
FILE: about-neuron/appnotes/transformers-neuronx/generative-llm-inference-with-neuron.rst
================================================
.. _neuron_llm_inference:

Generative LLM inference with Neuron
====================================


.. contents:: Table of contents
   :local:
   :depth: 2

Background
----------

Large Language Models (LLMs) generate human-like text through a
process known as generative inference. Fundamentally, given an input prompt, generative LLM
inference generates text outputs, by
iteratively predicting the next token in a sequence.

These models typically take a sequence of integers as input, which
represent a sequence of tokens (words/subwords), and generate a
prediction for the next token to be emitted. Below is a simple example
that illustrates this in code:

.. code-block:: python

    # Vocabulary of tokens the model can parse. The position of each token in the 
    # vocabulary is used as the token_id (an integer representing that token)
    vocab = ["having", "I", "fun", "am", "learning", ".", "Neuron"]

    # input token_ids: list of integers that represent the input tokens in this
    # case: "I", "am", "having", "fun"
    input_token_ids = [1, 3, 0, 2] 
                                   

    # The LLM gets a vector of input token_ids, and generates a probability-distribution
    # for what the output token_id should be (with a probability score for each token_id
    # in the vocabulary)
    output = LLM(input_token_ids) 
                                  

    # by taking argmax on the output, we effectively perform a 'greedy sampling' process,
    # i.e. we choose the token_id with the highest probability. Other sampling techniques
    # also exist, e.g. Top-K. By choosing a probabilistic sampling method we enable the model
    # to generate different outputs when called multiple times with the same input.
    next_token_id = np.argmax(output) 


    # map the token_id back into an output token
    next_token = vocab[next_token_id] 


To generate entire sentences, the application iteratively invokes the
LLM to generate the next token's prediction, and at each iteration we
append the predicted token back into the input:


.. code-block:: python

   def generate(input_token_ids, n_tokens_to_generate):
      for _ in range(n_tokens_to_generate): # decode loop
          output = LLM(input_token_ids) # model forward pass
      
          next_token_id = np.argmax(output) # greedy sampling
      
          if (next_token_id == EOS_TOK_ID)
              break # break if generated End Of Sentence (EOS)
      
          # append the prediction to the input, and continue to the next out_token
          input_token_ids.append(int(next_token_id)) 

      return input_token_ids[-n_tokens_to_generate :] # only return generated token_ids

   input_token_ids = [1, 3] # "I" "am"
   output_token_ids = generate(input_tokens_ids, 4) # output_token_ids = [0, 2, 4, 6]
   output_tokens = [vocab[i] for i in output_token_ids] # "having" "fun" "learning" “Neuron”


This process, of predicting a future value (regression) and adding
it back into the input (auto), is sometimes referred to as
autoregression. For more details, Jay Mody’s \ `GPT in 60 Lines of
NumPy <https://jaykmody.com/blog/gpt-from-scratch/>`__\  is an
excellent writeup on GPTs (Generative Pre-trained Transformers).


Performance optimizations
-------------------------

The sheer size of state-of-the-art LLMs, as well as the sequential
nature of text generation, poses multiple challenges for efficient
generative LLM deployment.

First, the model is typically sharded across multiple devices, in order to fit the model
in device memory. This creates communication overhead and complexity among devices.
Secondly, certain deployments have strict application-level latency bounds, thus requiring
substantial latency optimizations. This is especially challenging, due to the sequential nature
of token-by-token generation. Finally, generating one token at a time often leads to poor 
device utilization, due to low arithmetic intensity, which can be improved via batching (see :ref:`what_batch_size_to_use`).

The Neuron SDK provides several built-in
optimizations, allowing you to extract optimal performance when
deploying LLM models, including:

KV-caching:
^^^^^^^^^^^

The `transformers-neuronx <https://github.com/aws-neuron/transformers-neuronx>`__
library implements KV-cache optimization, which saves compute
resources by reusing previously calculated SelfAttention key-value
pairs, instead of recalculating them for each generated token.

To illustrate this concept, see the
inner workings of the MaskedSelfAttention operator in the figure below.

At each token generation step, the Query vector of a single current token is multiplied by the Key vectors of all 
previous tokens in the sequence to create attention scores and these scores are further multiplied by the Value
vectors of all previous tokens.


.. image:: /images/masked-self-attention-operator.png


The core idea behind this optimization is that instead of re-computing the Key and Value vectors
for all previous tokens at each token generation step, Neuron can perform only incremental
computation for the current token and re-use previously computed Key/Value vectors from the KV-cache. 
The Key/Value vector of the current token is also appended to the KV-cache, for the next token generation step.


.. image:: /images/kv-cache-optimization.png


Note that the first token in the
output sequence is unique in two ways:

.. container::

   -  No KV-cache is available at this point.
   -  Neuron needs to compute the entire KV-cache for <input_len> tokens (the
      input prompt), rather than one incremental KV-cache entry.

This means that first-token latency is typically higher
than the following tokens.

Model sharding:
^^^^^^^^^^^^^^^

Neuron enables you to shard the model across devices via Tensor
Parallelism, Pipeline Parallelism (coming soon), or a combination of the two (coming soon).

Tensor Parallelism shards each layer across multiple devices,
enabling you to achieve the optimal latency.

Pipeline Parallelism places different layers on different devices and
creates a pipeline between them (as the name suggests) and is
useful mainly when optimizing throughput and/or cost-per-inference.

To find the optimal Tensor/Pipeline parallelism configuration for your
model, see the :ref:`model_partitioning` section.
 
Computation/communication overlap:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The Neuron compiler automatically fuses Collective Communication
primitives (e.g., AllReduce) with the following computation (e.g.,
GEMM) in the compute graph. This helps minimize any overhead caused by sharding the
model across devices.

Compact data-types:
^^^^^^^^^^^^^^^^^^^
Neuron supports INT8 and FP8 (coming soon), which can significantly reduce the model's memory bandwidth and capacity requirements. 
This is especially useful for Generative LLM inference, which is typically memory-bound. Therefore, using a compact data-type can improve the overall
LLM inference performance with lower latency and higher throughput.


Bucketing:
^^^^^^^^^^
The transformers-neuronx library automatically uses bucketing to process the input prompt and output tokens. Bucketing makes
it possible to handle variable sequence lengths, without requiring support for dynamic shapes. Using multiple progressively 
larger buckets helps minimize the portion of the KV-cache that needs to be read for each token.

.. _model_partitioning:

Model partitioning
------------------

How many NeuronCores do I need?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Transformer models are typically defined via a hyper-parameter configuration, such
as the following:

.. code-block:: python

   {
    "n_vocab": 50257, # number of tokens in our vocabulary
    "n_ctx": 2048, # maximum possible sequence length of the input
    "n_embd": 9216, # embedding dimension (determines the "width" of the network)
    "n_head": 72, # number of attention heads (n_embd must be divisible by n_head)
    "n_layer": 64 # number of layers (determines the "depth" of the network)
   }

To determine the number of NeuronCores needed to fit the model,
perform the following calculation:

.. code-block:: python

   weight_mem_footprint = 12 x <n_layer> x <n_embd>^2 x <dtype-size> 
   KV_cache_mem_footprint = <batch-size> x <n_layer> x <n_ctx> x <n_embd> x 2 x <dtype-size>
   # <dtype-size> is 2 for BF16/FP16, or 1 for FP8/INT8

   mem_footprint = weight_mem_footprint + KV_cache_mem_footprint


And from here, determining the number of NeuronCores is straightforward:


.. code-block:: python

   num_neuron_cores = ceil_to_closest_supported_size (mem_footprint / <NC-HBM-capacity>, <instance-type>) # 16GiB per Inferentia2/Trainium1 NeuronCore


For example, when running OPT-66B on Inf2, with a batch-size of 16, 
the number of required NeuronCores can be computed as follows.


.. code-block:: python

   # OPT-66B example (BF16, Inf2)
   # n_layer=64, n_ctx=2048, n_embd=9216, batch=16
   weight_mem_footprint = 12 x 64 x 9216^2 x 2 = 121.5 GiB
   KV_cache_mem_footprint = 16 x 64 x 2048 x 9216 x 2 x 2 = 72 GiB 

   mem_footprint = 121.5GiB + 72GiB = 193.5 GiB

   num_neuron_cores = ceil_to_closest_supported_size (193.5GiB / 16GiB, Inf2)
                    = ceil_to_closest_supported_size (12.1) = 24
                    ## Currently, the Neuron runtime supports tensor-parallelism degrees 2, 8, and 32 on Trn1
                    ## and supports tensor-parallelism degrees 2, 4, 8, 12 and 24 on Inf2.


Use the :ref:`neuron_calculator` to compute the number of cores needed for a custom hyper-parameter configuration.

Which parallelism technique should I use?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Tensor parallelism improves latency, at the expense of increased
intra-layer communication. Thus, as a general rule, it is recommended to use
the smallest tensor parallelism degree that meets your latency
requirement and then use pipeline/data parallelism from that point on.

If latency is not a major concern in your application (e.g., model evaluation)
and the primary goal is to maximize throughput (i.e., minimize total cost per token),
then it is most efficient to use pipeline parallelism and increase the batch-size
as much as possible.


.. _what_batch_size_to_use:

What batch-size should I use?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Due to the serial token generation nature of generative LLM inference,
this workload tends to be extremely memory bound. This means that
throughput (and thus cost per inference) improves significantly by
batching.

As a general rule, we recommend increasing the batch-size to the
maximum amount that fits within the latency budget (up to batch=256.
A larger batch-size typically does not help with performance.)

Note that the KV-cache grows linearly with the batch-size and can
grow until it runs out of memory (typically referred to as
OOM). If the latency budget allows, we recommend increasing the
batch-size to the maximum value that does not result in OOM.

Users may also consider pipelining the model beyond what is necessary
to fit model parameters / KV-cache on devices, in order to free up
device-memory space and thus allow the batch-size to increase
without causing OOM issues.


================================================
FILE: about-neuron/arch/glossary.rst
================================================
.. _neuron_hw_glossary:

Neuron Glossary
===============

.. contents:: Table of contents
   :local:
   :depth: 2


Terms
-----

Neuron Devices (Accelerated Machine Learning chips)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
      

   * - Term
     - Description

   * - .. glossary::
          Inferentia
     - AWS first generation accelerated machine learning chip supporting inference only

   * - .. glossary::
          Trainium/Inferentia2
     - AWS second generation accelerated machine learning chip supporting training and inference

   * - .. glossary::
          Trainium2
     - AWS second generation accelerated machine learning chip supporting training and inference

   * - .. glossary::
          Neuron Device
     - Accelerated machine learning chip (e.g. Inferentia or Trainium)

Neuron powered Instances
^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
      

   * - Term
     - Description


   * - .. glossary::
          Inf1
     - Inferentia powered accelerated compute EC2 instance

   * - .. glossary::
          Trn1
     - Trainium powered accelerated compute EC2 instance

   * - .. glossary::
          Inf2
     - Inferentia2 powered accelerated compute EC2 instance

   * - .. glossary::
          Trn2
     - Trainium2 powered accelerated compute EC2 instance


NeuronCore terms
^^^^^^^^^^^^^^^^


.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
      

   * - Term
     - Description


   * - .. glossary::
          NeuronCore
     - The machine learning compute cores within Inferentia/Trainium

   * - .. glossary::
          NeuronCore-v1
     - Neuron Core within Inferentia

   * - .. glossary::
          NeuronCore-v2
     - Neuron Core within Trainium1/Inferentia2

   * - .. glossary::
          NeuronCore-v3
     - Neuron Core within Trainium2

   * - .. glossary::
          Tensor Engine
     - 2D systolic array (within the NeuronCore), used for matrix computations

   * - .. glossary::
          Scalar Engine
     - A scalar-engine within each NeuronCore, which can accelerate element-wise operations (e.g. GELU, ReLU, reciprocal, etc)

   * - .. glossary::
          Vector Engine
     - A vector-engine with each NeuronCore, which can accelerate spatial operations (e.g. layerNorm, TopK, pooling, etc)

   * - .. glossary::
          GPSIMD Engine
     - Embedded General Purpose SIMD cores, within each NeuronCore, to accelerate custom-operators

   * - .. glossary::
          Sync Engine
     - The SP engine, which is integrated inside NeuronCore. Used for synchronization and DMA triggering.

   * - .. glossary::
          Collective Communication Engine
     - Dedicated engine for collective communication, allows for overlapping computation and communication

   * - .. glossary::
          High Bandwidth Memory
     - `High Bandwidth Memory <https://en.wikipedia.org/wiki/High_Bandwidth_Memory>`_, used as device memory for NeuronCore-v2 and beyond.
   
   * - .. glossary::
          State Buffer
     - The main software-managed on-chip memory in NeuronCore-v1 and beyond.

   * - .. glossary::
          Partial Sum Buffer
     - A second software-managed on-chip memory in NeuronCore-v1 and beyond, with near-memory accumulation support for TensorE output data.
    
   * - .. glossary::
          NeuronLink
     - Interconnect between NeuronCores

   * - .. glossary::
          NeuronLink-v1
     - Interconnect between NeuronCores in Inferentia device

   * - .. glossary::
          NeuronLink-v2
     - Interconnect between NeuronCores in Trainium1/Inferentia2 device

   * - .. glossary::
          NeuronLink-v3
     - Interconnect between NeuronCores in Trainium2 device

Neuron SDK terms
^^^^^^^^^^^^^^^^


.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
      

   * - Term
     - Description


   * - .. glossary::
          Neuron Kernel Interface
     - A bare-metal language and compiler for directly programming Neuron devices available on AWS Trainium/Inferentia2 and beyond devices.


Abbreviations
-------------

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
      

   * - Abbreviation
     - Description

   * - .. glossary::
          NxD Core
     - NeuronX Distributed Core Library

   * - .. glossary::
          NxD Training
     - NeuronX Distributed Training Library

   * - .. glossary::
          NxD Inference
     - NeuronX Distributed Inference Library

   * - .. glossary::
          NC
     - Neuron Core

   * - .. glossary::
          NeuronCore
     - Neuron Core
     
   * - .. glossary::
          ND
     - Neuron Device

   * - .. glossary::
          NeuronDevice
     - Neuron Device

   * - .. glossary::
          TensorE
     - Tensor Engine

   * - .. glossary::
          ScalarE
     - Scalar Engine

   * - .. glossary::
          VectorE
     - Vector Engine

   * - .. glossary::
          GpSimdE
     - GpSimd Engine

   * - .. glossary::
          CCE
     - Collective Communication Engine

   * - .. glossary::
          HBM
     - High Bandwidth Memory

   * - .. glossary::
          SBUF
     - State Buffer 

   * - .. glossary::
          PSUM
     - Partial Sum Buffer

   * - .. glossary::
          FP32
     - Float32

   * - .. glossary::
          TF32
     - TensorFloat32

   * - .. glossary::
          FP16
     - Float16

   * - .. glossary::
          BF16
     - Bfloat16

   * - .. glossary::
          cFP8
     - Configurable Float8

   * - .. glossary::
          RNE
     - Round Nearest Even

   * - .. glossary::
          SR
     - Stochastic Rounding

   * - .. glossary::
          NKI
     - Neuron Kernel Interface

   * - .. glossary::
          CustomOps
     - Custom Operators

   * - .. glossary::
          RT
     - Neuron Runtime

   * - .. glossary::
          DP
     - Data Parallel

   * - .. glossary::
          DPr
     - Data Parallel degree

   * - .. glossary::
          TP
     - Tensor Parallel

   * - .. glossary::
          TPr
     - Tensor Parallel degree

   * - .. glossary::
          PP
     - Pipeline Parallel

   * - .. glossary::
          PPr
     - Pipeline Parallel degree


================================================
FILE: about-neuron/arch/index.rst
================================================
.. _neuron-architecture-index:

.. meta::
   :description: Explore the hardware architecture of AWS Neuron instances, including EC2 Trn and Inf instance types, AWS Inferentia and Trainium chips, and NeuronCore processing units. Learn about system specifications, memory hierarchies, interconnect topologies, and architectural considerations for machine learning workloads.
   :date-modified: 2025-10-03

AWS Neuron architecture guides
==============================

Review and understand the hardware architecture of AWS Neuron instances, including AWS Elastic Compute Cloud (EC2) ``Trn`` and ``Inf`` instance types, AWS Inferentia and Trainium chips, and NeuronCore processing units. The documentation covers system specifications, memory hierarchies, interconnect topologies, and architectural considerations for machine learning workloads.

About Neuron Hardware
----------------------

AWS Neuron hardware consists of custom-designed machine learning accelerators optimized for deep learning workloads. This section covers the architecture and capabilities of AWS Inferentia and Trainium chips, their NeuronCore processing units, and the EC2 instances that host them.

Trainium Architecture
----------------------

.. grid:: 2
   :gutter: 2

   .. grid-item-card:: AWS Trainium3
      :link: neuron-hardware/trainium3
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Third-generation training accelerator chip

   .. grid-item-card:: AWS Trainium2
      :link: neuron-hardware/trainium2
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Second-generation training accelerator chip

   .. grid-item-card:: AWS Trainium
      :link: neuron-hardware/trainium
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      First-generation training accelerator chip

Inferentia Architecture
------------------------

.. grid:: 2
   :gutter: 2

   .. grid-item-card:: AWS Inferentia2
      :link: neuron-hardware/inferentia2
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Second-generation inference accelerator chip

   .. grid-item-card:: AWS Inferentia
      :link: neuron-hardware/inferentia
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      First-generation inference accelerator chip

NeuronCore Architecture
------------------------

NeuronCores are fully-independent heterogenous compute-units that power Tranium, Tranium2, Inferentia, and Inferentia2 chips.

.. grid:: 2
   :gutter: 2

   .. grid-item-card:: NeuronCore v4
      :link: neuron-hardware/neuron-core-v4
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Processing unit architecture for Trainium3

   .. grid-item-card:: NeuronCore v3
      :link: neuron-hardware/neuron-core-v3
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Processing unit architecture for Trainium2

   .. grid-item-card:: NeuronCore v2
      :link: neuron-hardware/neuron-core-v2
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Processing unit architecture for Inferentia2 and Trainium


   .. grid-item-card:: NeuronCore v1
      :link: neuron-hardware/neuron-core-v1
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Processing unit architecture for Inferentia

Neuron AWS EC2 Platform Architecture
-------------------------------------

Overviews of the AWS Inf and Trn instance and UltraServer architectures.

.. grid:: 2
   :gutter: 2

   .. grid-item-card:: Inf1 Architecture
      :link: neuron-hardware/inf1-arch
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Inf1 instance architecture and specifications

   .. grid-item-card:: Inf2 Architecture
      :link: neuron-hardware/inf2-arch
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Inf2 instance architecture and specifications

   .. grid-item-card:: Trn1 Architecture
      :link: neuron-hardware/trn1-arch
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Trn1 instance architecture and specifications

   .. grid-item-card:: Trn2 Architecture
      :link: neuron-hardware/trn2-arch
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Trn2 instance architecture and specifications

   .. grid-item-card:: Trn3 Architecture
      :link: neuron-hardware/trn3-arch
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Trn3 instance architecture and specifications


.. toctree::
   :maxdepth: 1
   :hidden:

   AWS Inferentia <neuron-hardware/inferentia>
   AWS Inferentia2 <neuron-hardware/inferentia2>
   AWS Trainium <neuron-hardware/trainium>
   AWS Trainium2 <neuron-hardware/trainium2>
   AWS Trainium3 <neuron-hardware/trainium3>
   NeuronCore v1 <neuron-hardware/neuron-core-v1>
   NeuronCore v2 <neuron-hardware/neuron-core-v2>
   NeuronCore v3 <neuron-hardware/neuron-core-v3>
   NeuronCore v4 <neuron-hardware/neuron-core-v4>
   Inf1 Architecture <neuron-hardware/inf1-arch>
   Inf2 Architecture <neuron-hardware/inf2-arch>
   Trn1 Architecture <neuron-hardware/trn1-arch>
   Trn2 Architecture <neuron-hardware/trn2-arch>
   Trn3 Architecture <neuron-hardware/trn3-arch>


================================================
FILE: about-neuron/arch/neuron-features/custom-c++-operators.rst
================================================
.. _feature-custom-c++-operators:

Neuron Custom C++ Operators
===========================

.. include:: /neuron-customops/customops-intro.txt


For more details see :ref:`neuron_c++customops`

================================================
FILE: about-neuron/arch/neuron-features/data-types.rst
================================================
.. _neuron-data-types:

Data Types
==========

.. contents:: Table of contents
   :local:
   :depth: 2

Introduction
------------

Inferentia and Trainium NeuronDevices include different NeuronCore versions, which support different data-types. This section describes what data-types are supported in each NeuronCore version.

NeuronCore v1 Data Types
------------------------

Neuron Data-Types
^^^^^^^^^^^^^^^^^

Neuron enables developers to choose from multiple data-types. The
supported data-types are FP32, FP16, and BF16. Developers can
train their models on their platform of choice (e.g. EC2 P3 instances),
and then easily move their trained models to EC2 Inf1 for execution.

.. raw:: html

  <style type="text/css">table, td, th { border: 1px solid black; padding: 5px; }
  </style>
  <table style="table-layout: fixed; width: 50%; border-spacing:0px;">
  	<tbody>
  		<tr>
  			<th width="20%">Data Type</th>
  			<th width="10%">S</th>
         <th colspan="8">Range</th>
  			<th colspan="23">Precision</th>
  		</tr>
  		<tr>
  			<td>FP32</td>
  			<td bgcolor="#ad3bff">1</td>
         <td bgcolor="#AFEFA9" colspan="8">8 bits</td>
  			<td bgcolor="#FAC49E" colspan="23">23 bits</td>
  		</tr>
  		<tr>
  			<td>BF16</td>
  			<td bgcolor="#ad3bff">1</td>
         <td bgcolor="#AFEFA9" colspan="8">8 bits</td>
  			<td style="border-right: 0px" colspan="13" />
  			<td colspan="3" />
  			<td bgcolor="#FAC49E" colspan="7">7 bits</td>
  		</tr>
  		<tr>
  			<td>FP16</td>
  			<td bgcolor="#ad3bff">1</td>
         <td colspan="3" />
         <td bgcolor="#AFEFA9" colspan="5">5 bits</td>
         <td colspan="13" />
  			<td bgcolor="#FAC49E" colspan="10">10 bits</td>
  		</tr>
  	</tbody>
  </table>
  <p/>

FP16/BF16 models
~~~~~~~~~~~~~~~~

Models natively trained in FP16/BF16 will be executed in their trained
data-types. This is a straightforward migration from the training
platform to Inf1.

FP32 models
~~~~~~~~~~~

Neuron SDK supports **automatic model conversion** from FP32 to BF16 by
default. This capability allows developers to train their models using
FP32 format for the highest accuracy, and achieve performance benefits
without having to worry about low-precision training (e.g. no need for
loss-scaling during training). ML models are typically robust to FP32 to
BF16 conversion, with minimal to no impact on accuracy. The conversion
accuracy is model dependent; therefore, users are encouraged to
benchmark the accuracy of the auto-converted model against the original
FP32 trained model.

When the compiler is supplied with an unmodified FP32 model input it
will automatically compile the model to run as BF16 on Inferentia. During
inference the FP32 input data will be auto-converted internally by
Inferentia to BF16 and the output will be converted back to FP32
data-type. For explicit FP16 inferencing, either use an FP16 trained
model, or use an external tool (like AMP) to make the explicit
conversions.

.. _neuron-data-types-v2:

NeuronCore v2 Data Types
------------------------

The NeuronCore v2 supports the following data types:

* 32 and 16-bit Floating Point (FP32 / FP16)
* TensorFloat-32 (TF32)
* Brain Floating Point (BFloat16)
* 8-bit Floating point with configurable range and precision (cFP8)
* Unsigned 8-bit integer (UINT8)

The layout for these is as follows:

.. raw:: html

  <style type="text/css">table, td, th { border: 1px solid black; padding: 5px; }
  </style>
  <table style="table-layout: fixed; width: 50%; border-spacing:0px;">
  	<tbody>
  		<tr>
  			<th width="20%">Data Type</th>
  			<th width="10%">S</th>
         <th colspan="8">Range</th>
  			<th colspan="23">Precision</th>
  		</tr>
  		<tr>
  			<td>FP32</td>
  			<td bgcolor="#ad3bff">1</td>
  			<td bgcolor="#AFEFA9" colspan="8">8 bits</td>
         <td bgcolor="#FAC49E" colspan="23">23 bits</td>
  		</tr>
  		<tr>
  			<td>TF32</td>
  			<td bgcolor="#ad3bff">1</td>
         <td bgcolor="#AFEFA9" colspan="8">8 bits</td>
  			<td colspan="13" />
  			<td bgcolor="#FAC49E" colspan="10">10 bits</td>
  		</tr>
  		<tr>
  			<td>BF16</td>
  			<td bgcolor="#ad3bff">1</td>
         <td bgcolor="#AFEFA9" colspan="8">8 bits</td>
  			<td style="border-right: 0px" colspan="13" />
  			<td colspan="3" />
  			<td bgcolor="#FAC49E" colspan="7">7 bits</td>
  		</tr>
  		<tr>
  			<td>FP16</td>
  			<td bgcolor="#ad3bff">1</td>
         <td colspan="3" />
  			<td bgcolor="#AFEFA9" colspan="5">5 bits</td>
  			<td colspan="13" />
  			<td bgcolor="#FAC49E" colspan="10">10 bits</td>
      </tr>
      <tr>
  			<td>FP8_e5m2</td>
  			<td bgcolor="#ad3bff">1</td>
         <td colspan="3" />
  			<td bgcolor="#AFEFA9" colspan="5">5 bits</td>
         <td style="border-right: 0px" colspan="18" />
         <td colspan="3" />
  			<td bgcolor="#FAC49E" colspan="2">2 bits</td>
  		</tr>
      <tr>
  			<td>FP8_e4m3</td>
  			<td bgcolor="#ad3bff">1</td>
         <td style="border-right: 0px" colspan="3" />
         <td colspan="1" />
  			<td bgcolor="#AFEFA9" colspan="4">4 bits</td>
         <td style="border-right: 0px" colspan="20" />
  			<td bgcolor="#FAC49E" colspan="3">3 bits</td>
  		</tr>
      <tr>
  			<td>FP8_e3m4</td>
  			<td bgcolor="#ad3bff">1</td>
         <td style="border-right: 0px" colspan="4" />
         <td colspan="1" />
  			<td bgcolor="#AFEFA9" colspan="3">3 bits</td>
         <td style="border-right: 0px" colspan="19" />
  			<td bgcolor="#FAC49E" colspan="4">4 bits</td>
  		</tr>
      <tr>
  			<td>UINT8</td>
  			<td colspan="1" />
  			<td bgcolor="#AFEFA9" colspan="8">8 bits</td>
         <td colspan="23" />
  		</tr>
  </table>
  <p/>


Model Type Conversion
^^^^^^^^^^^^^^^^^^^^^

The Neuron SDK supports automatic model conversion from FP32 to BF16 by default. This capability allows developers to train their models using FP32 format for the highest accuracy, and then achieve run-time performance benefits without having to worry about low-precision training (e.g. no need for loss-scaling during training). ML models are typically robust to FP32 to BF16 conversion, with minimal to no impact on accuracy. Since conversion accuracy is model dependent, users are encouraged to benchmark the accuracy of the auto-converted model against the original FP32 trained model.

See :ref:`Mixed Precision and Performance-accuracy Tuning for Training<neuronx-cc-training-mixed-precision>` for more details on supported data types and their properties.

The Neuron compiler offers the ``--auto-cast`` and ``--auto-cast-type`` options to specify automatic casting of FP32 tensors to other data types to address performance and accuracy tradeoffs. See the :ref:`Neuron Compiler CLI Reference Guide<neuron-compiler-cli-reference-guide>` for a description of these options.


NeuronCore v2 Rounding Modes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Because floating point values are represented by a finite number of bits, they cannot represent all real numbers accurately. Floating point calculations that exceed their defined data type size are rounded. The NeuronCore v2 performs a Round-to-Nearest (RNE) algorithm with ties to Even by default. It also provides a new Stochastic Rounding mode. When Stochastic Rounding is enabled, the hardware will round the floating point value up or down using a proportional probability. This could lead to improved model convergence. Use the environment variable NEURON_RT_STOCHASTIC_ROUNDING_EN to select a rounding mode.


================================================
FILE: about-neuron/arch/neuron-features/index.rst
================================================
.. _neuron-features-index:

Neuron Features
===============
Neuron features provide insights into Neuron capabilities that enable high-performance and improve usability of developing and deploying deep learning acceleration on top of Inferentia and Trainium based instances.

.. grid:: 2
      :gutter: 2

      .. grid-item-card:: Custom C++ operators
            :link: custom-c++-operators
            :link-type: doc
            :class-body: sphinx-design-class-title-small

            Framework for implementing custom operators in C++ to extend Neuron's built-in operation support.

      .. grid-item-card:: Data types
            :link: data-types
            :link-type: doc
            :class-body: sphinx-design-class-title-small

            Supported numerical data types including FP32, FP16, BF16, and INT8 for efficient model execution.

      .. grid-item-card:: Logical NeuronCore configuration
            :link: logical-neuroncore-config
            :link-type: doc
            :class-body: sphinx-design-class-title-small

            Configuration options for grouping and managing NeuronCores as logical units for workload distribution.

      .. grid-item-card:: Neuron persistent cache
            :link: neuron-caching
            :link-type: doc
            :class-body: sphinx-design-class-title-small

            Persistent caching system for compiled models to reduce compilation time across sessions.

      .. grid-item-card:: NeuronCore batching
            :link: neuroncore-batching
            :link-type: doc
            :class-body: sphinx-design-class-title-small

            Batching strategies to maximize throughput by processing multiple inputs simultaneously on NeuronCores.

      .. grid-item-card:: NeuronCore pipeline
            :link: neuroncore-pipeline
            :link-type: doc
            :class-body: sphinx-design-class-title-small

            Pipeline execution model that overlaps computation and data movement for improved performance.

      .. grid-item-card:: Rounding modes
            :link: rounding-modes
            :link-type: doc
            :class-body: sphinx-design-class-title-small

            Configurable numerical rounding modes for controlling precision and accuracy in computations. 

.. toctree::
    :maxdepth: 1
    :hidden:

    Custom C++ operators <custom-c++-operators>
    Data types <data-types>
    Logical NeuronCore configuration <logical-neuroncore-config>
    Neuron persistent cache <neuron-caching>
    NeuronCore batching <neuroncore-batching>
    NeuronCore pipeline <neuroncore-pipeline>
    Rounding modes <rounding-modes>


================================================
FILE: about-neuron/arch/neuron-features/logical-neuroncore-config.rst
================================================
.. _logical-neuroncore-config:

################################
Logical NeuronCore configuration
################################

Logical NeuronCore configuration (LNC) is a set of compiler and runtime settings for instances powered by AWS Trainium2 that 
determines the number of NeuronCores exposed to your machine learning (ML) applications. LNC configuration works by combining 
the compute and memory resources of multiple physical NeuronCores into a single logical NeuronCore. You can configure these settings 
to reduce the number of worker process needed for training and deployment of large-scale models. 

.. important::

   LNC can only be set to **1** or **2**. These are the only supported values. On Trn2, each chip has 8 physical NeuronCores. With LNC=2 (default), these are grouped into 4 Logical NeuronCores. With LNC=1, all 8 physical cores are treated as individual logical NeuronCores. LNC applies only to Trn2 and Trn3 instances.


.. contents:: Concepts
    :depth: 1
    :local:
    :backlinks: none

===================
Logical NeuronCores
===================

A logical NeuronCore is a grouping of physical NeuronCores that the Neuron Compiler, Neuron Runtime, Neuron Tools, and Frameworks 
handle as a single unified NeuronCore. Every Trainium2 device contains eight physical NeuronCore-v3. 

=============================
Compiler and runtime settings
=============================
 
LNC configuration is controlled with the following runtime and compiler settings:

| **Neuron Runtime**
| The ``NEURON_LOGICAL_NC_CONFIG`` runtime environment variable controls how many physical NeuronCores are grouped to make up a logical NeuronCore.


| **Neuron compiler flags** 
| The ``--logical-nc-config`` or ``-lnc`` command-line options control the degree of model sharding the compiler performs on an input graph. You must compile your Models to use the LNC configuration set by the Neuron Runtime environment variable. AWS Neuron currently doesn't support setting the compiler flag to a different LNC configuration than the Neuron Runtime environment variable. 

=================================
Logical NeuronCore configurations
=================================

AWS Neuron supports the following Logical NeuronCore configurations:

.. tab-set::

    .. tab-item:: LNC = 2

        A Logical NeuronCore configuration (LNC) of two is the default setting on Trainium2 devices. It combines two physical 
        NeuronCore-v3 into a logical NeuronCore with the software id ``NC_V3d``. When you set Logical NeuronCore configuration to 
        two, it directs Trainium2 devices to expose four ``NC_v3d`` to your machine learning applications. On this setting, 
        a ``Trn2.48xlarge`` instance presents 64 available NeuronCores. The folowing high-level diagram shows a ``Trn2.48xlarge`` 
        instance, connected in a 2D torus topology, with the Logical NeuronCore configuration set to two.

        .. image:: /images/architecture/Trn2/trn2_lnc2.png
            :align: center
            :width: 750
        |

        Trainium2 devices contain four 24GB HBM banks. Each bank is shared by two physical NeuronCore-v3. 
        When LNC=2, the two physical NeuronCores share a single address space. Workers on each of the 
        two physical NeuronCores can access tensors and perform local collective operations without 
        accessing the network. The following diagram shows how a logical NeuronCore is presented to the 
        software under this configuration.

        .. image:: /images/architecture/NeuronCore/lnc_2.png
            :align: center
            :width: 450
        |

        To set the Logical NeuronCore configuration to two, use the following runtime and compiler flag combination:

        | **Runtime environment variable:**
        | ``NEURON_LOGICAL_NC_CONFIG`` = 2

        | **Compiler flag:**
        | ``-lnc`` = 2 
        |

    .. tab-item:: LNC = 1

        When you set the Logical NeuronCore configuration to one, it assigns each physical NeuronCore-v3 to a single logical 
        NeuronCore with the software id ``NC_V3``. This directs Trainium2 devices to expose eight ``NC_v3`` to your machine learning 
        applications. On this setting, a ``Trn2.48xlarge`` instance presents 128 available NeuronCores. 
        The following high-level diagram shows a ``Trn2.48xlarge`` instance, connected in a 2D torus topology, 
        with the Logical NeuronCore configuration set to one.

        .. image:: /images/architecture/Trn2/trn2_lnc1.png
            :align: center
            :width: 750
        |

        Trainium2 devices contain four 24GB HBM banks. Each bank is shared by two physical NeuronCore-v3. 
        When the Logical NeuronCore configuration is set to one, both physical NeuronCores have access to the entire 24GB HBM bank. The following 
        diagram shows how logical NeuronCores are presented to the software under this configuration.

        .. image:: /images/architecture/NeuronCore/lnc_1.png
            :align: center
            :width: 475
        |

        To set the Logical NeuronCore configuration to one, use the following runtime and compiler flag combination:

        | **Runtime environment variable:**
        | ``NEURON_LOGICAL_NC_CONFIG`` = 1

        | **Compiler flag:**
        | ``-lnc`` = 1
        |

        
================================================
FILE: about-neuron/arch/neuron-features/neuron-caching.rst
================================================
.. _neuron-caching:

Neuron Persistent Cache
=======================

PyTorch Neuron (``torch-neuronx``) uses ``torch-xla``, and ``torch-xla`` operates in lazy mode. In other words, every operation in training script
is recorded in a graph. The graph is executed only when the results are requested by 
the user when they use ``print`` or ``xm.mark_step``.  Requesting results tells 
``torch-xla`` that the recorded graph needs to be executed. 

Before executing the graph on a Neuron device, ``torch-xla`` would call Neuron Compiler (``neuronx-cc``) to compile the graph into Neuron specific 
graph. Then the graph is executed on the NeuronCores. Compiling the graph involves 
running optimizations that can make use of the NeuronCores efficiently. Running these 
optimizations can be expensive and can result in long compile times. To save the 
users from compiling these graphs at every iteration, ``torch-xla`` maintains an 
in-memory cache called Just in Time (JIT) cache. When the user re-runs the same graph (eg. 2nd 
iteration of the training run), torch-xla would check in this JIT cache and re-use 
the cached compilation result, thereby avoiding the wait times.

Since the JIT cache is an in-memory cache, it needs to be constructed every time the training script is 
run. Hence, if the user re-runs the training script, a new JIT cache is created. This causes a compilation for the first training graph.
To avoid such  compilations across training runs, PyTorch Neuron (``torch-neuronx``) has built an on-disk 
``Neuron Persistent Cache``. Since this cache is on-disk, its persistent across training runs. So 
now, when a graph is compiled for the fist time, the compilation result is saved in 
``Neuron Persistent Cache``. When the user re-runs the training script, since the JIT cache is not 
ready, it would send the graph for compilation. PyTorch Neuron (``torch-neuronx``) would then check if 
the compiled result is present in the ``Neuron Persistent Cache``, if yes, it would return with the 
compiled result. This on-disk cache thereby avoids compilations across training runs. 
This cache is enabled by default for Neuron's PyTorch/XLA flow (training) as well as
transformers-neuronx LLM inference package.
The default cache path is the directory ``/var/tmp/neuron-compile-cache``.

Look at the diagram below on the end to end flow:

|Image:|

As seen from the diagram, the operations are recorded in a graph in lazy mode and only 
when a mark_step is hit, the graph is executed. Before execution, the graph passes through
two caches to check if we have compiled the graph sometime in the past. If yes, we reuse 
the compilation result and execute with it. This avoid duplicate compilations.
One thing to note, both JIT cache and Neuron Cache are complementary to each other.
JIT cache prevents duplicate compilation within a run and Neuron Cache prevents duplicate 
compilations across training runs. For example, within a training script, we have a training 
loop that iterates through the dataset. The first iteration would trace a unique graph 
and the following iteration would trace a graph that is similar to the first one. In this case,
the subsequent iterations would hit the JIT cache and reuse the result. However, to save 
users from compiling for the first iteration graph, ``Neuron Persistent Cache`` would be used. In this case,
the very first time when the script is run, the ``Neuron Persistent Cache`` would be updated. Going forward 
when we re-run the training script, compilation results from ``Neuron Persistent Cache`` would be used.

To better understand how ``Neuron Persistent Cache`` works, consider the example below:

.. code:: python

   import torch
   import torch_xla
   import torch_xla.core.xla_model as xm
   device = xm.xla_device()
   t1 = torch.randn(3, 3).to(device)
   t2 = t1 / 0.5
   x = t2.cpu()

Running the above example produces the following logs:

.. code:: bash

   2023-08-25 21:51:36.000433: INFO ||NCC_WRAPPER||: Compile cache path: /var/tmp/neuron-compile-cache
   .
   Compiler status PASS

Re-running the above script would fetch the graph from the 
neuron cache and you would see logs as follows:

.. code:: bash

   2023-08-25 21:52:23.000451: INFO ||NCC_WRAPPER||: Compile cache path: /var/tmp/neuron-compile-cache
   2023-08-25 21:52:23.000453: INFO ||NCC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.8.0.25+a3ad0f342/MODULE_198775565831884870+d41d8cd9/model.neff. Exiting with a successfully compiled graph.

As you can see, the next run picks the compiled graph from
cache, thereby saving the compilation time.
The cache uses hash of the Neuron compiler flags and XLA graph as the
key. If the Neuron compiler version or XLA graph changes, you will see
recompilation. Examples of changes that would cause XLA graph change
include:

-  Model type and size
-  Batch size
-  Optimizer and optimizer hyperparameters
-  Location of xm.mark_step()

To keep cache size small and to enable weights/parameters updates without recompilation, 
only the compute graphs are cached when using transformers-neuronx (weights/parameters are inputs to the compute graphs) and 
training flow using torch-neuronx's XLA  (weights/parameters are inputs and outputs of the compute graphs). 
Note that this caching mechanism doesn't apply to the torch-neuronx trace API where the weights/parameters are frozen and converted to constants, 
then compiled together with the compute operations (traced graphs with frozen weights/parameters are not cached).

All compilation results are saved in the cache. To disable the cache, you 
can pass ``--no_cache`` option via NEURON_CC_FLAGS:

.. code:: python

   os.environ['NEURON_CC_FLAGS'] = os.environ.get('NEURON_CC_FLAGS', '') + ' --no_cache'

The default cache path is the directory ``/var/tmp/neuron-compile-cache``.
To change the cache's location, pass ``cache_dir=<cache_url>``
option via ``NEURON_CC_FLAGS`` or ``NEURON_COMPILE_CACHE_URL=<cache_url>`` environment variables:

.. code:: python

   os.environ['NEURON_CC_FLAGS'] = os.environ.get('NEURON_CC_FLAGS', '') + ' --cache_dir=<cache URL>'

.. code:: python

   os.environ['NEURON_COMPILE_CACHE_URL'] = '<cache_URL>'

The cache URL specified using ``--cache_dir`` is prioritized over that specified using ``NEURON_COMPILE_CACHE_URL`` if both are set.
If ``<cache_url>`` starts with ``s3://``, it will use the AWS S3 URL as the cache location, provided that the corresponding S3 bucket exists and is both readable and writeable.

You can change the verbose level of the compiler by adding ``log_level`` to either ``WARNING``, ``INFO``
or ``ERROR``. This can be done as follows:

.. code:: python

   os.environ['NEURON_CC_FLAGS'] = os.environ.get('NEURON_CC_FLAGS', '') + ' --log_level=INFO'

A graph compilation can fail because of a compilation error or an environment issue (for example, compilation is interrupted by ctrl-C). The graph would be marked as failed and subsequent rerun would encounter message like below:

.. code:: bash

    INFO ||NCC_WRAPPER||: Got a cached failed neff at /var/tmp/neuron-compile-cache/neuronxcc-2.8.0.25+a3ad0f342/MODULE_12486829708343293975+d41d8cd9/model.neff. Will skip compilation, please set --retry_failed_compilation for recompilation. 

To retry compilation,
add ``--retry_failed_compilation`` in ``NEURON_CC_FLAGS`` environment variable. When the script is reran, all the previously failed compilations are recompiled and fresh results are saved in the cache.

.. code:: python

   os.environ['NEURON_CC_FLAGS'] = os.environ.get('NEURON_CC_FLAGS', '') + ' --retry_failed_compilation'

Note that all flags demonstrated above will be parsed by a tool called ``neuron_cc_wrapper``, which is a wrapper over Neuron Compiler CLI to provide caching mechanism. All these flags will not be passed into Neuron Compiler CLI.  

.. |Image:| image:: ./images/NeuronCaching.png


================================================
FILE: about-neuron/arch/neuron-features/neuroncore-batching.rst
================================================
.. _neuron-batching:

Neuron Batching
===============

Batching refers to the process of grouping multiple samples together,
and processing them as a group (i.e. passing them together through the
neural network). Batching is typically used as an optimization for
improving throughput at the expense of higher latency (and potentially
higher memory footprint). Batching considerations are slightly different
between inference and training workloads, and we thus cover them
separately below.

.. contents:: Table of contents
	:local:
	:depth: 2

Batching in inference workloads
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

What is batched inference?
^^^^^^^^^^^^^^^^^^^^^^^^^^

The concept of batched inference is conceptually illustrated below, with
a single NeuronCore performing batched computation of a 3 layer neural
network with a batch-size of 4. The NeuronCore reads the parameters for
a certain layer from the external memory, and then performs the
corresponding computations for all 4 inference-requests, before reading
the next set of parameters (thus, performing more compute for every
parameter read from memory). 

.. image:: /images/batched-inference.png


What are the benefits of batched Inference?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For inference, batching is typically used as a trade-off knob between
throughput and latency: higher batch-size typically leads to better
hardware utilization and thus higher throughput, but at the same time
batching requires to perform more computation until getting the first
results, and hence leads to higher latency. 


.. image:: /images/tradeoffs.png

To understand why batching tends to improve throughput (up to a certain max
value), it is useful to consider an intuitive visual performance-model
called ‘the roofline model’, which provides with a theoretical bound on
the system’s performance: 


.. image:: /images/memoryvscompute.png

The X-axis indicates the
arithmetic intensity (AI) of the workload, which is the ratio between
the number of operations and the number of bytes read-from/written-to
memory. The Y-axis indicates the theoretical extractable performance.
For small(large) AI values, the workload is expected to be
memory(compute) bound. For inference workloads, AI is often approximated
by dividing the model’s number of operations by its memory footprint
(#params x dtype_size). To a first order approximate, the AI value is
linearly dependent on the batch-size, which means that the workloads
performance (throughput) is expected to increase with the batch-size. To
understand this more intuitively, for a larger batch size, Neuron can
better amortize the cost of reading parameters from the external memory,
and thus improve the overall hardware efficiency. It should be noted
that while the roofline model can be very useful, it is not perfectly
accurate (e.g. it doesn’t take into account spill/fills from/to on-chip
SRAM memories), and thus users are encouraged to use it as a tool for
**estimating** the optimal batch-size for their workloads.

How to determine the optimal batch-size for inference workloads?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The optimal batch size is dependent on the application-level
requirements: some applications require strict latency guarantees (in
which case, check out the :ref:`neuroncore-pipeline`
technology), while other applications strictly aim to maximize
throughput. We thus encourage our users to try out multiple batch-sizes,
and compare performance between them. A good starting for batch-size
exploration can be identified using the roofline model: we can choose a
batch-size that achieves an Arithmetic Intensity which is at the edge of
the compute bound region. By doing that, we aim to achieve max
throughput with a minimal batch-size, and thus minimal impact to
latency. 

.. image:: /images/memoryvscompute2.png


This can be expressed via the following
equation:
``batch-size(Inference) = ceiling[0.5 x (<NeuronDevice PeakFLOPS>/<NeuronDevice MemBW>) /``
``(<model FLOPs>/(<#model-dense-params> x <dtype_size>))]`` (for
NeuronDevice PeakFLOPS and MemBW, see the :ref:`trainium-arch`, :ref:`inferentia-arch` and :ref:`inferentia2-arch` pages.

For example, a BF16 BERT-Large model, with a sequence length of 128,
will have the following approximated batch sizes:


.. list-table::
    :widths: auto
    :header-rows: 1
    :stub-columns: 1    
    :align: left
    

    *   - Model
        - NeuronDevice
        - Peak TFLOPS (BF16)
        - MemBW (GB/sec)
        - Model GFLOPs
        - Model Dense Params (Millions)
        - Data-type size (BF16)
        - Approximated optimal batch-size

    *   - BERT-Large (SeqLen=128)
        - Inferentia
        - 64
        - 50
        - 77.3
        - 302
        - 2
        - 6

    *   - BERT-Large (SeqLen=128)
        - Trainium
        - 210
        - 820
        - 77.3
        - 302
        - 2
        - 2

    *   - ResNet-50
        - Inferentia
        - 64
        - 50
        - 7.8
        - 25
        - 2
        - 5

    *   - ResNet-50
        - Trainium
        - 210
        - 820
        - 7.8
        - 25
        - 2
        - 1

We recommend to evaluate multiple batch sizes and compare the
performance between them, in order to determine the optimal
latency/throughput deployment-point.

How to set the batch-size?
^^^^^^^^^^^^^^^^^^^^^^^^^^

The Neuron compiler takes a model and its sample input, as inputs for
the compilation process. For example, the code snippet below will
compile a model with a batch-size of 4:

.. code::

   import torch
   import torch_neuron
   from torchvision import models

   # Load the model and set it to evaluation mode
   model = models.resnet50(pretrained=True)
   model.eval()

   # Compile with an example input of batch size 4
   image = torch.rand([4, 3, 224, 224])

   model_neuron = torch.neuron.trace(model, image, dynamic_batch_size=True)

   # Execute with a batch of 12 images
   batch = torch.rand([12, 3, 224, 224])
   results = model_neuron(batch)

For ahead-of-time compiled inference graphs (i.e. Inf1), dynamic
batching can be used (as shown in the above code snippet) to process a
larger client-side inference batch-size, and allow the framework to
automatically break up the user-batch (12 in our case) into smaller
batch sizes, to match the compiled batch-size (4 in our case). This
technique increases the achievable throughput by hiding the
framework-to-neuron overhead, and amortizing it over a larger batch
size.

.. seealso::

  - :ref:`torch-neuronx-dynamic-batching` in ``torch-neuronx``
  - :ref:`tensorflow-neuronx-special-flags` in ``tensorflow-neuronx``.


Batching in training workloads
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Unlike inference workloads, training is inherently an offline process,
and thus doesn’t have latency requirements. This means that training is
almost always batched to some degree.

Batch-size naming
^^^^^^^^^^^^^^^^^

For distributed processing, defining the batch size depends on the observation
level. There are multiple terms you should be aware of when running
a distributed training job, especially global batch size (GBS) and micro-batch.
Knowing the batch size in advance is crucial for precompiling the computational
graph and for setting the hyperparameters.

  micro-batch size
    Smallest unit of the number of samples getting processed in a single step
    in the accelerator. For very large models, it is frequently chosen to be 1.

  gradient accumulation
    Process of iterating over a micro-batch multiple times and summing up the gradients 
    before an optimizer update. This can happen in a dedicated loop for gradient 
    accumulation or as part of multiple iterations of samples in pipeline parallelism.
    See :ref:`pp_developer_guide` for more details on pipeline parallelism.

  data-parallel size (or DP degree)
    Number of model replicas that process different portions of data in parallel.
    Each replica maintains a complete copy of the model while processing unique
    data chunks, after which their gradients are synchronized for the optimizer update. 
    See :ref:`neuron_hw_glossary` for more details.

  global batch-size
    Number of total samples used for an update of the optimizer.
    This includes all the respective gradients that get added up from
    data-parallel processing or gradient accumulation.
    :literal:`global batch size = micro_batch_size * data_parallel_size * gradient_accumulation_steps`

  mini-batch or replica-batch size
    Number of samples that contribute to a gradient within one data-parallel rank.
    A mini-batch gradient is obtained by aggregating multiple
    micro-batch gradients within or without a pipeline (aka. gradient accumulation).
    :literal:`mini_batch_size = micro_batch_size * gradient_accumulation_steps`

  worker batch
    The portion of mini-batch samples processed by a worker.
    The idea behind a worker batch is that one worker (node) might have a subset of the dp-degrees 
    and we care about how much data gets tackled by this worker.

How to determine the optimal batch-size for training workloads?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Determining the optimal batch-size for training workloads can be a
non-trivial task. In most cases, we’d want to choose the largest
batch-size that we can get away with.

The most dominant factor for determining the optimal batch-size in
training workloads is memory footprint: training workloads have higher
memory footprint compared to inference, as they require saving more
tensors aside from the model parameters, such as gradients, intermediate
activations (passed between forward-pass and backward-pass), and
optimizer-state. If the batch-size is increased beyond a certain point,
one can run out of device memory (indicated by an ‘Out of device memory’
error, typically abbreviated as OOM).

To estimate the memory footprint of a model, we look at the different
contributors:

1. Weights and gradients:

   1. typically 2B each, thus 4B per parameter

2. Optimizer state:

   1. typically 4B - 12B per parameter

3. Intermediate activations:

   1. sum of all tensor sizes for forward pass
   2. for example, for a transformer neural network, this is roughly 16
      x x <num_layers> x x x = 100MB x


For training workloads, determining the optimal batch size can be a
little more tricky, due to two reasons:

1. *Higher memory footprint:* Training workloads have higher memory
   footprint compared to inference, as they require saving more tensors
   aside from the model parameters, such as gradients,
   intermediate-state and optimizer-state. If the batch-size is
   increased too much, one can run out of device memory (indicated by an
   ‘Out of memory’ error, typically abbreviated as OOM).
2. *Arithmetic intensity estimation:* Arithmetic intensity is harder to
   estimate in training workloads, compared to inference workloads, as
   the majority of the external memory access are due to reads/writes of
   intermediate activation state (rather than parameters), which
   requires lower level familiarity with the model to estimate
   correctly.

A good first order approximate for the optimal batch-size in a training
workload, is the largest one that can fit in the device’s memory (i.e.
won’t lead to OOM error).
:literal:`batch-size(Training) = 0.6 x (<TP-Rank> x <PP-Rank> x ``<NeuronCore MemoryCapacity>)`
:literal:`/ ``(<#model-dense-params> x ``<model-state-bytes-per-parameter>)`

Note TP-rank stands for Tensor-Parallelism rank, i.e. how many
NeuronCores participate in a single Tensor-Parallelism group. Similarly,
PP-rank stands for Pipeline-Parallelism rank, i.e. how many NeuronCores
participate in a single Pipeline-Parallelism group.

For example, for BERT-Large Ph1 training, with a model-state of 4B per
parameter (2B weights, 2B parameters), and TP-rank = PP-rank = 1, the
approximated optimal per-NeuronCore training batch-size would be:
:literal:`batch-size(Training/Trainium) = 0.6 x (1 x 1 x 16e+9``) / ``(300e+6 x 4``) = 8`


================================================
FILE: about-neuron/arch/neuron-features/neuroncore-pipeline.rst
================================================
.. _neuroncore-pipeline:

NeuronCore Pipeline
===================

The Neuron software feature referred to as a NeuronCore Pipeline refers
to the process of sharding a compute-graph across multiple NeuronCores,
caching the model parameters in each core’s on-chip memory (cache), and
then streaming inference requests across the cores in a pipelined
manner. Based on the number of NeuronCores selected, the model might get
seamlessly sharded across up-to 16 Inferentia devices (i.e. 64
NeuronCores). This enables users to optimize for both throughput and
latency, as it enables the NeuronCores to process neural-networks with
locally cached data and avoid the cost of accessing external memory.
|Image:|

One benefit to this approach is that NeuronCore Pipeline can typically
hit maximal hardware efficiency without the need for batching (e.g.
BERT, ResNet50).

For maximal performance, users should choose an instance-size that can
cache the entire model by using sufficient NeuronCores. Inf1 instance
types have different number of Inferentia devices, each of which has 4
NeuronCores, as shown here
https://aws.amazon.com/ec2/instance-types/inf1/

To enable the NeuronCore Pipeline optimization, the compiler should be
invoked with the following flags: ``--neuroncore-pipeline-cores N``. The
number of NeuronCores is typically chosen to be the minimal number that
can fit the entire model, which is currently done through a
trial-and-error process (compiling to different number of cores and
looking for compilation success/failure message). This process will be
automated in the future. A simple formula to help define the number of
NeuronCores that may be an appropriate choice is

::

   neuroncore-pipeline-cores = 4 * round( number-of-weights-in-model/(2 * 10^7) ) 

This allocates a set of NeuronCores based on the size of the given
model's weights and normalizes to multiples of 4 so it uses full
Inferentias.

The code snippet below shows how to compile a model with NeuronCore
Pipeline for 16 NeuronCores (instance size inf1.6xlarge).

::

   import numpy as np
   import tensorflow.neuron as tfn

   example_input = np.zeros([1,224,224,3], dtype='float16')
   tfn.saved_model.compile("rn50_fp16",
                           "rn50_fp16_compiled/1",
                           model_feed_dict={'input_1:0' : example_input },
                           compiler_args = ['--neuroncore-pipeline-cores', '16'])

.. |Image:| image:: ./images/NeuronCorePipelining.png


================================================
FILE: about-neuron/arch/neuron-features/rounding-modes.rst
================================================
.. _neuron-rounding-modes:

Neuron Rounding Modes
=====================

.. contents:: Table of contents
	:local:
	:depth: 1


.. _neuron-rounding-mode-rne:

Round Nearest, ties to Even (RNE)
---------------------------------

When the exact result of a floating point operation cannot be exactly
represented as a floating point value, it must be rounded. The IEEE
754-2008 standard defines the default rounding mode to be ‘Round
Nearest, ties to Even’ (RNE for short). Under this scheme, numbers are
rounded to the nearest representable value, and in case of a ‘tie’ (i.e.
the number is exactly between the two nearest representable values)
numbers will be rounded to the nearest even number.

All NeuronCore generations support the RNE rounding scheme, which is the
most commonly used rounding scheme for Machine Learning workloads. Below
is an illustration of the RNE rounding scheme: 

.. image:: /images/rne1.png
    :width: 700

.. image:: /images/rne2.png
    :width: 700

.. image:: /images/rne3.png
    :width: 700

.. _neuron-rounding-mode-sr:


Stochastic Rounding (SR)
------------------------

One downside of the RNE rounding scheme (and other rounding schemes
described in the IEEE 754-2008 standard), is that when adding floating
point values of significantly different magnitudes, rounding can squash
small values and prevent them from accumulating over time. 

To improve this, starting from the second generation of the NeuronCore
(NeuronCore-v2), customers can choose between the RNE rounding scheme
described above, and a second rounding scheme called ‘Stochastic
Rounding’ (SR for short). Stochastic rounding prevents the computation
precision-loss described above, by performing the rounding operations in
a probabilistic manner, according to the relative distance from the two
nearest representable values, as illustrated below: 

.. image:: /images/sr.png
    :width: 700


By performing the rounding in a probabilistic manner, this scheme allows
for small increments to accumulate over time, even when added to numbers
of significantly higher magnitude, which leads to more precise results
when performing large floating point computations (as done for machine
learning).


Quick Tests 
-----------

As an example, we examine the code-snippet below:

::

   import torch
   import torch_xla
   import torch_xla.core.xla_model as xm
   device = xm.xla_device()
   
   a = torch.tensor(1024.0).half().to(device)
   
   for i in range(2048) :
      a = (a + 0.5)
      xm.mark_step()
   
   print(a)


This code shows that rounding can significantly impact the calculation’s precision over time.
To use standard RNE rounding, use the environment variable ``NEURON_RT_STOCHASTIC_ROUNDING_EN=0``.
To enable stochastic rounding, use the environment variable ``NEURON_RT_STOCHASTIC_ROUNDING_EN=1``.

NOTE: Stochastic rounding mode is enabled by default in PyTorch-Neuron when XLA_USE_BF16=1.

The first test continues to show 1024 due to RNE rounding after each addition, and the second test shows result that is mostly in line with expectation.

::

   $ NEURON_RT_STOCHASTIC_ROUNDING_EN=0 python3 rounding_mode_test.py
   
   tensor(1024., device='xla:1', dtype=torch.float16)
   
   $ NEURON_RT_STOCHASTIC_ROUNDING_EN=1 python3 rounding_mode_test.py
   
   tensor(2056., device='xla:1', dtype=torch.float16)


================================================
FILE: about-neuron/arch/neuron-hardware/inf1-arch.rst
================================================
.. _aws-inf1-arch:

Amazon EC2 Inf1 Architecture
==============================

On this page, we provide an architectural overview of the Amazon EC2 Inf1
instance and the corresponding :ref:`Inferentia <inferentia-arch>` NeuronChips that power
them (:ref:`Inferentia <inferentia-arch>` chips from here on).

.. contents:: Table of Contents
   :local:
   :depth: 2

.. _inf1-arch:

Inf1 Architecture
-----------------

The EC2 Inf1 instance is powered by 16 :ref:`Inferentia <inferentia-arch>` chips, allowing
customers to choose between four instance sizes:

.. list-table::
    :widths: auto
    :header-rows: 1
    :stub-columns: 1    
    :align: left
    

    *   - Instance size
        - # of Inferentia chips
        - vCPUs
        - Host Memory (GiB)
        - FP16/BF16 TFLOPS
        - INT8 TOPS
        - Device Memory (GiB)
        - Device Memory bandwidth (GiB/sec)
        - NeuronLink-v1 chip-to-chip bandwidth (GiB/sec/chip)
        - EFA bandwidth (Gbps)

    *   - Inf1.xlarge
        - 1
        - 4
        - 8
        - 64
        - 128
        - 8
        - 50
        - N/A
        - up-to 25


    *   - Inf1.2xlarge
        - 1
        - 8
        - 16
        - 64
        - 128
        - 8
        - 50
        - N/A
        - up-to 25

    *   - Inf1.6xlarge
        - 4
        - 24
        - 48
        - 256
        - 512
        - 32
        - 200
        - 32
        - 25

    *   - Inf1.24xlarge
        - 16
        - 96
        - 192
        - 1024
        - 2048
        - 128
        - 800
        - 32
        - 100


Inf1 offers a direct chip-to-chip interconnect called NeuronLink-v1,
which enables co-optimizing latency and throughput via the :ref:`Neuron Core Pipeline <neuroncore-pipeline>` technology. 

.. image:: /images/inf1-server-arch.png


================================================
FILE: about-neuron/arch/neuron-hardware/inf2-arch.rst
================================================
.. _aws-inf2-arch:

Amazon EC2 Inf2 Architecture
=============================

On this page we provide an architectural overview of the Amazon EC2 Inf2
instances and the corresponding Inferentia2 NeuronChips that power
them (Inferentia2 chips from here on).

Inf2 Architecture
-----------------

The EC2 Inf2 instance is powered by up to 12 :ref:`Inferentia2 chips <inferentia2-arch>`, and allows
customers to choose between four instance sizes:

.. list-table::
    :widths: auto
    :header-rows: 1
    :stub-columns: 1    
    :align: left

    *   - Instance size
        - # of Inferentia2 chips
        - vCPUs
        - Host Memory (GiB)
        - FP8/FP16/BF16/TF32 TFLOPS
        - FP32 TFLOPS
        - Device Memory (GiB)
        - Instance Memory Bandwidth (GiB/sec)
        - NeuronLink-v2 chip-to-chip (GiB/sec/chip)

    *   - Inf2.xlarge
        - 1
        - 4
        - 16
        - 190
        - 47.5
        - 32
        - 820
        - N/A

    *   - Inf2.8xlarge
        - 1
        - 32
        - 128
        - 190
        - 47.5
        - 32
        - 820
        - N/A

    *   - Inf2.24xlarge
        - 6
        - 96
        - 384
        - 1140
        - 285
        - 192
        - 4920
        - 192

    *   - Inf2.48xlarge
        - 12
        - 192
        - 768
        - 2280
        - 570
        - 384
        - 9840
        - 192


Inf2 offers a low-latency, high-bandwidth chip-to-chip interconnect
called NeuronLink-v2, which enables high-performance collective communication operations (e.g., AllReduce and AllGather).

This allows sharding large models across Inferentia2 chips (e.g., via
Tensor Parallelism), thus optimizing latency and throughput. This
capability is especially useful when deploying Large Generative Models.

.. image:: /images/inf2-topology.png


================================================
FILE: about-neuron/arch/neuron-hardware/inferentia.rst
================================================
.. _inferentia-arch:


Inferentia Architecture
-----------------------

At the heart of each Inf1 instance are sixteen Inferentia chips, each with four :ref:`NeuronCore-v1 <neuroncores-v1-arch>`, as depicted
below:

.. image:: /images/inferentia-neurondevice.png


Each Inferentia chip consists of:

+---------------+-------------------------------------------+
| Compute       | Four                                      |  
|               | :ref:`NeuronCore-v1 <neuroncores-v1-arch>`|   
|               | cores, delivering 128 INT8 TOPS and 64    |   
|               | FP16/BF16 TFLOPS                          |  
+---------------+-------------------------------------------+
| Device Memory | 8GiB of device DRAM memory (for storing   |  
|               | parameters and intermediate state), with  | 
|               | 50 GiB/sec of bandwidth                   | 
+---------------+-------------------------------------------+
| NeuronLink    | Enables co-optimization of latency and    |   
|               | throughput via the :ref:`Neuron Core      |
|               | Pipeline <neuroncore-pipeline>`           |  
|               | technology                                |  
+---------------+-------------------------------------------+


================================================
FILE: about-neuron/arch/neuron-hardware/inferentia2.rst
================================================
.. _inferentia2-arch:

Inferentia2 Architecture
------------------------

At the heart of each Inf2 instance are up to twelve Inferentia2 chips (each with two :ref:`NeuronCore-v2 <neuroncores-v2-arch>` cores). Inferentia2 is the second
generation AWS purpose-built Machine Learning inference accelerator. The Inferentia2 chip architecture is depicted below: 

.. image:: /images/inferentia2.png


Each Inferentia2 chip consists of:

+----------------------------------+----------------------------------+
| Compute                          | Two :ref:`NeuronCore-v2          |
|                                  | <neuroncores-v2-arch>`           |
|                                  | cores, delivering 380 INT8 TOPS, |
|                                  | 190 FP16/BF16/cFP8/TF32 TFLOPS,  |
|                                  | and 47.5 FP32 TFLOPS.            |
+----------------------------------+----------------------------------+
| Device Memory                    | 32GiB of high-bandwidth device   |                                  
|                                  | memor (HBM) (for storing model   |                                  
|                                  | state), with 820 GiB/sec of      |                                  
|                                  | bandwidth.                       |
+----------------------------------+----------------------------------+
| Data Movement                    | 1 TB/sec of DMA bandwidth, with  |
|                                  | inline memory                    |
|                                  | compression/decompression.       |
+----------------------------------+----------------------------------+
| NeuronLink                       | NeuronLink-v2 for                |                                  
|                                  | chip-to-chip interconnect        |                                  
|                                  | enables high-performance         |                                  
|                                  | collective compute for           |                                  
|                                  | co-optimization of latency and   |                                  
|                                  | throughput.                      |
+----------------------------------+----------------------------------+
| Programmability                  | Inferentia2 supports dynamic     |
|                                  | shapes and control flow, via ISA |
|                                  | extensions of NeuronCore-v2 and  |
|                                  | :ref:`custom-operators           |
|                                  | <feature-custom-c++-operators>`  |
|                                  | via the deeply embedded GPSIMD   |
|                                  | engines.                         |
+----------------------------------+----------------------------------+

For a more detailed description of all the hardware engines, see :ref:`NeuronCore-v2 <neuroncores-v2-arch>`.

================================================
FILE: about-neuron/arch/neuron-hardware/neuron-core-v1.rst
================================================
.. _neuroncores-v1-arch:


NeuronCore-v1 Architecture
--------------------------

NeuronCore-v1 is the first generation NeuronCore engine, powering
the Inferentia chips. Each NeuronCore-v1 is a fully-independent
heterogenous compute-unit, with three main engines (Tensor/Vector/Scalar
Engines), and on-chip software-managed SRAM memory, for
maximizing data locality (compiler managed, for maximum data locality
and optimized data prefetch).

.. image:: /images/nc-v1.png


The ScalarEngine is optimized for scalar computations, in which every
element of the output is dependent on one element of the input, e.g.,
non-linearities such as GELU, SIGMOID, or EXP. The ScalarEngine is highly
parallelized, and can process 512 floating point operations per cycle.
It can handle various data types, including FP16, BF16, FP32, INT8,
INT16, and INT32. 

The VectorEngine is optimized for vector computations,
in which every element of the output is dependent on multiple input
elements. Examples include ‘axpy’ operations (Z=aX+Y), Layer
Normalization, Pooling operations, and many more. The VectorEngine is
also highly parallelized, and can perform 256 floating point operations
per cycle. It can handle various data-types, including FP16, BF16, FP32,
INT8, INT16, and INT32.

The TensorEngine is based on a power-optimized systolic array, which is
highly optimized for tensor computations (e.g., GEMM, CONV, Reshape,
Transpose), and supports mixed-precision computations (FP16/BF16/INT8
inputs, FP32/INT32 outputs). Each NeuronCore-v1 TensorEngine delivers 16
TFLOPS of FP16/BF16 tensor computations.

================================================
FILE: about-neuron/arch/neuron-hardware/neuron-core-v2.rst
================================================
.. _neuroncores-v2-arch:

NeuronCore-v2 Architecture
--------------------------

NeuronCore-v2 is the second generation of the NeuronCore engine,
powering the Trainium chips. Each NeuronCore-v2 is a
fully-independent heterogenous compute-unit, with 4 main engines
(Tensor/Vector/Scalar/GPSIMD Engines), and on-chip
software-managed SRAM memory, for maximizing data locality (compiler
managed, for maximum data locality and optimized data prefetch).


.. image:: /images/nc-v2.png

Just like in NeuronCore-v1, The ScalarEngine is optimized for
scalar-computations, in which every element of the output is dependent
on one element of the input. The ScalarEngine is highly parallelized,
and delivers 2.9 TFLOPS of FP32 computations (3x speedup
relative to NeuronCore-v1). The NeuronCore-v2 ScalarEngine can handle
various data types, including cFP8, FP16, BF16, TF32, FP32, INT8, INT16,
and INT32. 

The VectorEngine is optimized for vector computations, in
which every element of the output is dependent on multiple input
elements. Examples include ‘axpy’ operations (Z=aX+Y), Layer
Normalization, Pooling operations, and many more. The VectorEngine is
also highly parallelized, and delivers 2.3 TFLOPS of FP32 computations 
(10x speedup vs. NeuronCore-v1). The NeuronCore-v2
VectorEngine can handle various data-types, including cFP8, FP16, BF16,
TF32, FP32, INT8, INT16 and INT32.

The TensorEngine is based on a power-optimized systolic-array, which is
highly optimized for tensor computations (e.g., GEMM, CONV, 
Transpose), and supports mixed-precision computations (cFP8 / FP16 /
BF16 / TF32 / FP32 / INT8 inputs, FP32 / INT32 outputs). Each
NeuronCore-v2 TensorEngine delivers over 90 TFLOPS of FP16/BF16 tensor
computations (6x speedup from NeuronCore-v1). 

NeuronCore-v2 also introduces a new engine called the
GPSIMD-Engine, which consists of eight fully-programmable 512-bit wide 
vector processors, which can execute general purpose C-code and access the 
embedded on-chip SRAM memory. With these cores, customers can implement 
custom operators and execute them directly on the NeuronCores.

NeuronCore-v2 also adds support for control flow, dynamic shapes, and
programmable :ref:`rounding mode <neuron-rounding-modes>` (RNE & Stochastic-rounding).


================================================
FILE: about-neuron/arch/neuron-hardware/neuron-core-v3.rst
================================================
.. _neuroncores-v3-arch:

NeuronCore-v3 Architecture
--------------------------

NeuronCore-v3 is the third-generation NeuronCore that powers Trainium2 chips. It is a fully-independent heterogenous compute 
unit consisting of 4 main engines: Tensor, Vector, Scalar, and GPSIMD, with on-chip software-managed SRAM memory to maximize data 
locality and optimize data prefetch. The following diagram shows a high-level overview of the NeuronCore-V3 architecture.

.. image:: /images/architecture/NeuronCore/nc-v3.png
    :align: center
    :width: 250
|
NeuronCore-v3 is made up of the following components:

On-chip SRAM 
""""""""""""
Each NeuronCore-v3 has a total of 28MB of on-chip SRAM. NeuronCore-v3 on-chip SRAM is software-managed to maximize data locality 
and optimize data prefetch. 

Tensor Engine
"""""""""""""

Tensor engines are based on a power-optimized systolic array. They are highly optimized for tensor computations such as GEMM, CONV, and 
Transpose. Tensor Engines support mixed-precision computations, including cFP8, FP16, BF16, TF32, and FP32 inputs and outputs. 
A NeuronCore-v3 Tensor Engine delivers 158 cFP8 TFLOPS, and 79 BF16/FP16/TF32 TFLOPS of tensor computations. 

Like NeuronCore-v2, NeuronCore-v3 supports control flow, dynamic shapes, and programmable rounding mode (RNE & Stochastic-rounding). 
NeuronCore-v3 also supports adjustable exponent biasing for the cFP8 data type.
   
The NeuronCore-v3 Tensor Engine also supports Structured Sparsity, delivering up to 316 TFLOPS of cFP8/FP16/BF16/TF32 
compute. This is useful when one of the input tensors to matrix multiplication exhibits a M:N sparsity pattern, where only M elements 
out of every N contiguous elements are non-zero. NeuronCore-v3 supports several sparsity patterns, including 4:16, 4:12, 4:8, 2:8, 
2:4, 1:4, and 1:2. 

Vector Engine
""""""""""""""

Optimized for vector computations, in which every element of the output is dependent on multiple input elements. Examples include 
axpi operations (Z=aX+Y), Layer Normalization, and Pooling operations. 

Vector Engines are highly parallelized, and deliver a total of 1 TFLOPS of FP32 computations. NeuronCore-v3 Vector Engines can handle 
various data-types, including cFP8, FP16, BF16, TF32, FP32, INT8, INT16, and INT32. 

Scalar engine
"""""""""""""

Optimized for scalar computations in which every element of the output is dependent on one element of the input. Scalar Engines are 
highly parallelized, and deliver a total of 1.2 TFLOPS of FP32 computations. NeuronCore-v3 Scalar Engines support multiple data 
types, including cFP8, FP16, BF16, TF32, FP32, INT8, INT16, and INT32.

GPSIMD engine
"""""""""""""

Each GPSIMD engine consists of eight fully-programmable 512-bit wide vector processors. They can execute general purpose C-code and 
access the embedded on-chip SRAM, allowing you to implement custom operators and execute them directly on the NeuronCores.


================================================
FILE: about-neuron/arch/neuron-hardware/neuron-core-v4.rst
================================================
.. meta::
    :description: "NeuronCore-v4 architecture overview and components."
    :date-modified: 12/02/2025


.. _neuroncores-v4-arch:

NeuronCore-v4 Architecture
===========================

NeuronCore-v4 is the fourth-generation NeuronCore that powers Trainium3 chips. It is a fully-independent heterogenous compute unit consisting of 4 main engines: Tensor, Vector, Scalar, and GPSIMD, with on-chip software-managed SRAM memory to maximize data locality and optimize data prefetch. 

The following diagram shows a high-level overview of the NeuronCore-v4 architecture.

.. image:: /images/architecture/trn3/neuroncore-v4.png
    :align: center

Like previous generations of NeuronCore, NeuronCore-v4 supports control flow, dynamic shapes, and programmable rounding mode (RNE & Stochastic-rounding). NeuronCore-v4 is made up of the following components:

On-chip SRAM
-------------

Each NeuronCore-v4 has a total of 32MiB of on-chip SRAM. The on-chip SRAM is software-managed to maximize data locality and optimize data prefetch. NeuronCore-v4 SRAM also introduces a new near-memory accumulation feature, which allows DMA engines to perform a read-add-write operation into existing SRAM data via a single transfer. 

Tensor Engine
--------------

Tensor engines are based on a power-optimized systolic array. They are highly optimized for tensor computations such as GEMM, CONV, and Transpose. Tensor Engines support mixed-precision computations, including MXFP8/MXFP4, FP16, BF16, TF32, and FP32 inputs. The output data type can either be FP32 or BF16. A NeuronCore-v4 Tensor Engine delivers 315 MXFP8/MXFP4 TFLOPS, where MXFP8/MXFP4 are OCP (Open Compute Project) compliant data type formats. MXFP4 data types are converted to MXFP8 before Tensor Engine computation logic, using any arbitrary programmer-defined mapping. Besides quantized data types, a NeuronCore-v4 Tensor Engine also delivers 79 BF16/FP16/TF32 and 20 FP32 TFLOPS of tensor computations. 

The NeuronCore-v4 Tensor Engine also supports Structured Sparsity, delivering up to 315 TFLOPS of FP16/BF16/TF32 compute. This is useful when one of the input tensors to matrix multiplication exhibits a M:N sparsity pattern, where only M elements out of every N contiguous elements are non-zero. NeuronCore-v4 supports several sparsity patterns, including 4:16, 4:12, 4:8, 2:8, 2:4, 1:4, and 1:2.


Vector Engine
----------------

Optimized for vector computations, in which every element of the output is dependent on multiple input elements. Examples include axpi operations (Z=aX+Y), Layer Normalization, and Pooling operations.

Vector Engines are highly parallelized, and deliver a total of 1.2 TFLOPS of FP32 computations. NeuronCore-v3 Vector Engines can handle various data-types, including FP8, FP16, BF16, TF32, FP32, INT8, INT16, and INT32. 

In addition, NeuronCore-v4 Vector Engine supports two new features:

1. Data quantization into MXFP8 data type formats from BF16/FP16, which is particularly useful for online data quantization in between MLP (multi-layer perceptron) layers. 
2. Fast exponential functional evaluation, at 4x higher throughput than exponential on Scalar Engine, which is particularly useful in self attention acceleration.


Scalar Engine
---------------

Optimized for scalar computations in which every element of the output is dependent on one element of the input. Scalar Engines are highly parallelized, and deliver a total of 1.2 TFLOPS of FP32 computations. NeuronCore-v3 Scalar Engines support multiple data types, including FP8, FP16, BF16, TF32, FP32, INT8, INT16, and INT32.

GPSIMD Engine
---------------

Each GPSIMD engine consists of eight fully-programmable 512-bit wide vector processors. They can execute general purpose C/C++ code and access the embedded on-chip SRAM, allowing you to implement custom operators and execute them directly on the NeuronCores.


================================================
FILE: about-neuron/arch/neuron-hardware/trainium.rst
================================================
.. _trainium-arch:


Trainium Architecture
----------------------

At the heart of the Trn1 instance are 16 x Trainium chips (each Trainium include 2 x :ref:`NeuronCore-v2 <neuroncores-v2-arch>`). Trainium is the second
generation purpose-built Machine Learning accelerator from AWS. The
Trainium chip architecture is depicted below:

.. image:: /images/trainium-neurondevice.png


Each Trainium chip consists of:

+----------------------------------+----------------------------------+
| Compute                          | Two :ref:`NeuronCore-v2          |
|                                  | <neuroncores-v2-arch>`           |
|                                  | delivering 380 INT8 TOPS,        |
|                                  | 190 FP16/BF16/cFP8/TF32 TFLOPS,  |
|                                  | and 47.5 FP32 TFLOP.             |
+----------------------------------+----------------------------------+
| Device Memory                    | 32 GiB of device memory (for     |                                  
|                                  | storing model state), with 820   |                                  
|                                  | GiB/sec of bandwidth.            |             
+----------------------------------+----------------------------------+
| Data Movement                    | 1 TB/sec of DMA bandwidth, with  |
|                                  | inline memory                    |
|                                  | compression/decompression.       |
+----------------------------------+----------------------------------+
| NeuronLink                       | NeuronLink-v2 for                |
|                                  | chip-to-chip interconnect        |
|                                  | enables efficient scale-out      |
|                                  | training, as well as memory      |
|                                  | pooling between the different    |
|                                  | Trainium chips.                  |
+----------------------------------+----------------------------------+
| Programmability                  | Trainium supports dynamic shapes |
|                                  | and control flow, via ISA        |
|                                  | extensions of NeuronCore-v2. In  |
|                                  | addition, Trainium also allows   |
|                                  | for user-programmable            |
|                                  | :ref:`rounding mode              |
|                                  | <neuron-rounding-modes>`         |
|                                  | (Round Nearest Even Stochastic   |
|                                  | Rounding), and custom operators  |
|                                  | via the deeply embedded GPSIMD   |
|                                  | engines.                         |
+----------------------------------+----------------------------------+


For a detailed description of all the hardware engines, see :ref:`NeuronCore-v2 <neuroncores-v2-arch>`


================================================
FILE: about-neuron/arch/neuron-hardware/trainium2.rst
================================================
.. _trainium2-arch:

######################
Trainium2 Architecture
######################

Trainium2 is the third generation, purpose-built Machine Learning chip from AWS. Every Trainium2 chip contains eight NeuronCore-V3 cores. Beginning with Trainium2, AWS Neuron adds support for Logical 
NeuronCore Configuration (LNC), which lets you combine the compute and memory resources of multiple physical NeuronCores into a 
single logical NeuronCore. The following diagram shows the architecture overview of a Trainium2 chip.

.. image:: /images/architecture/Trainium2/trainium2.png
    :align: center
    :width: 400
    
===========================
Trainium2 chip components
===========================

Each Trainium2 chip consists of the following components:

+----------------------------------+-----------------------------------------------------+
| Compute                          | Eight NeuronCore-v3 that collectively deliver:      |
|                                  |                                                     |
|                                  | * 1,299 FP8 TFLOPS                                  | 
|                                  | * 667 BF16/FP16/TF32 TFLOPS                         |
|                                  | * 2,563 FP8/FP16/BF16/TF32 sparse TFLOPS            |
|                                  | * 181 FP32 TFLOPS                                   |
|                                  |                                                     |
+----------------------------------+-----------------------------------------------------+
| Device Memory                    | 96 GiB of device memory with 2.9 TB/sec of          |
|                                  | bandwidth.                                          |             
+----------------------------------+-----------------------------------------------------+
| Data Movement                    | 3.5 TB/sec of DMA bandwidth, with inline            |
|                                  | memory compression and decompression.               |
+----------------------------------+-----------------------------------------------------+
| NeuronLink                       | NeuronLink-v3 for chip-to-chip interconnect         |
|                                  | provides 1.28 TB/sec bandwidth per chip. It allows  |
|                                  | for efficient scale-out training and inference, as  |
|                                  | well as memory pooling between Trainium2 chips.     |
+----------------------------------+-----------------------------------------------------+
| Programmability                  | Trainium2 supports dynamic shapes and control flow  |
|                                  | via NeuronCore-v3 ISA extensions. Trainium2 also    |
|                                  | allows for user-programmable                        |
|                                  | :ref:`rounding mode <neuron-rounding-modes>`        |
|                                  | (Round Nearest Even or Stochastic Rounding), and    |
|                                  | custom operators via deeply embedded GPSIMD engines.|
+----------------------------------+-----------------------------------------------------+
| Collective communication         | 16 CC-Cores orchestrate collective communication    |
|                                  | among Trainium2 chips within and across instances.  |
+----------------------------------+-----------------------------------------------------+     

==================================
Trainium2 performance improvements
==================================

The following set of tables offer a comparison between Trainium and Trainium2 chips. 
 
Compute
"""""""

.. list-table::
    :widths: auto
    :header-rows: 1 
    :stub-columns: 1    
    :align: left
      
    *   - 
        - Trainium
        - Trainium2
        - Improvement factor
    
    *   - FP8 (TFLOPS)
        - 191
        - 1299
        - 6.7x
    *   - BF16/FP16/TF32 (TFLOPS)
        - 191
        - 667
        - 3.4x
    *   - FP32 (TFLOPS)
        - 48
        - 181
        - 3.7x
    *   - FP8/FP16/BF16/TF32 Sparse (TFLOPS)
        - Not applicable
        - 2563 
        - Not applicable

Memory
""""""

.. list-table::
    :widths: auto
    :header-rows: 1 
    :stub-columns: 1    
    :align: left
      
    *   - 
        - Trainium
        - Trainium2
        - Improvement factor
    
    *   - HBM Capacity (GiB)
        - 32
        - 96
        - 3x
    *   - HBM Bandwidth (TB/sec)
        - 0.8
        - 2.9
        - 3.6x
    *   - SBUF Capacity (MiB)
        - 48
        - 224
        - 4.7x
    *   - Memory Pool Size
        - Up to 16 chips
        - Up to 64 chips
        - 4x

Interconnect
""""""""""""

.. list-table::
    :widths: auto
    :header-rows: 1 
    :stub-columns: 1    
    :align: left
      
    *   - 
        - Trainium
        - Trainium2
        - Improvement factor
    
    *   - Inter-chip Interconnect (GB/sec/chip)
        - 384
        - 1280
        - 3.3x

Data movement
"""""""""""""
.. list-table::
    :widths: auto
    :header-rows: 1 
    :stub-columns: 1    
    :align: left
      
    *   - 
        - Trainium
        - Trainium2
        - Improvement factor
    
    *   - CC Cores
        - 6
        - 16
        - 3.3x
    *   - DMA barriers
        - Write-after-write
        - Strong-order-write
        - \>1x (Benefit DMA-size dependent)
    *   - SBUF memory layout
        - Row-major
        - Row-major, Col-major-2B, Col-major-4B
        - Not applicable

====================
Additional resources
====================

For a detailed description of NeuronCore-v3 hardware engines, instances powered by AWS Trainium2, and Logical NeuronCore configuration, see the following resources:

* :ref:`NeuronCore-v3 architecture <neuroncores-v3-arch>`
* :ref:`Amazon EC2 Trn2 architecture <aws-trn2-arch>`
* :ref:`Logical NeuronCore configuration <logical-neuroncore-config>`


================================================
FILE: about-neuron/arch/neuron-hardware/trainium3.rst
================================================
.. meta::
    :description: "Neuron Trainium3 (Trn3) architecture overview."
    :date-modified: 12/02/2025

.. _trainium3-arch:

Trainium3 Architecture
=======================

Trainium3 is the fourth-generation purpose-built Machine Learning chip from AWS. A Trainium3 device contains eight NeuronCore-v4 cores. Similar to Trainium2, AWS Neuron adds support for Logical NeuronCore Configuration (LNC), which lets you combine the compute and memory resources of multiple physical NeuronCores into a single logical NeuronCore. The following diagram shows the architecture overview of a Trainium3 chip.

.. image:: /images/architecture/trn3/neuroncore-v4-overview.png
    :align: center

NeuronCore-v4
--------------

Each Trainium3 chip consists of the following components:

.. list-table::
    :widths: auto
    :header-rows: 0
    :stub-columns: 1
    :align: left

    *   - Compute
        - Eight NeuronCore-v4 cores that collectively deliver:

          * 2,517 MXFP8/MXFP4 TFLOPS
          * 671 BF16/FP16/TF32 TFLOPS
          * 2,517 FP16/BF16/TF32 sparse TFLOPS
          * 183 FP32 TFLOPS

    *   - Device memory
        - 144 GiB of device memory, with 4.9 TB/sec of bandwidth.

    *   - Data movement
        - 4.9 TB/sec of DMA bandwidth, with inline computation.

    *   - NeuronLink
        - NeuronLink-v4 for device-to-device interconnect provides 2.56 TB/sec bandwidth per device. It enables efficient scale-out training, as well as memory pooling between the different Trainium3 devices.

    *   - Programmability
        - Trainium3 supports dynamic shapes and control flow, via ISA extensions of NeuronCore-v4. Trainium3 also allows for user-programmable rounding mode (Round Nearest Even or Stochastic Rounding), and custom operators via the deeply embedded GPSIMD engines.

    *   - Collective communication
        - 16 CC-Cores orchestrate collective communication among Trainium3 devices, both within a server and across servers.

Trainium3 performance improvements
-----------------------------------

The following set of tables offer a comparison between Trainium2 and Trainium3 chips.

Compute
"""""""

.. list-table::
    :widths: auto
    :header-rows: 1
    :stub-columns: 1
    :align: left

    *   -
        - Trainium2
        - Trainium3
        - Improvement factor

    *   - MXFP4 (TFLOPS)
        - Not applicable
        - 2517
        - -
    *   - FP8 (TFLOPS)
        - 1299
        - 2517
        - 2x
    *   - BF16/FP16/TF32 (TFLOPS)
        - 667
        - 671
        - 1x
    *   - FP32 (TFLOPS)
        - 181
        - 183
        - 1x

Memory
""""""

.. list-table::
    :widths: auto
    :header-rows: 1
    :stub-columns: 1
    :align: left

    *   -
        - Trainium2
        - Trainium3
        - Improvement factor

    *   - HBM Capacity (GiB)
        - 96
        - 144
        - 1.5x
    *   - HBM Bandwidth (TB/sec)
        - 2.9
        - 4.9
        - 1.7x
    *   - SBUF Capacity (MiB)
        - 224
        - 256
        - 1.14x

Interconnect
""""""""""""

.. list-table::
    :widths: auto
    :header-rows: 1
    :stub-columns: 1
    :align: left

    *   -
        - Trainium2
        - Trainium3
        - Improvement factor

    *   - Inter-chip Interconnect (GB/sec/chip)
        - 1280
        - 2560
        - 2x

Data movement
"""""""""""""

.. list-table::
    :widths: auto
    :header-rows: 1
    :stub-columns: 1
    :align: left

    *   -
        - Trainium2
        - Trainium3
        - Improvement factor

    *   - DMA Bandwidth (TB/sec)
        - 3.5
        - 4.9
        - 1.4x

Additional resources
----------------------

For a detailed description of NeuronCore-v4 hardware engines, instances powered by AWS Trainium3, and Logical NeuronCore configuration, see the following resources:

* :ref:`NeuronCore-v4 architecture <neuroncores-v4-arch>`


================================================
FILE: about-neuron/arch/neuron-hardware/trn1-arch.rst
================================================
.. _aws-trn1-arch:

Amazon EC2 Trn1/Trn1n Architecture
===================================

On this page, we provide an architectural overview of the AWS Trn1/Trn1n
instances, and the corresponding :ref:`Trainium <trainium-arch>` NeuronChips that power them
(Trainium chips from here on).

.. contents::  Table of contents
   :local:
   :depth: 2

.. _trn1-arch:

Trn1/Trn1n Architecture
-----------------------

An EC2 Trn1/Trn1n instance is powered by up to 16 :ref:`Trainium <trainium-arch>` chips.


.. list-table::
    :widths: auto
    :header-rows: 1
    :stub-columns: 1    
    :align: left
      

    *   - Instance size
        - # of Trainium chips
        - vCPUs
        - Host Memory (GiB)
        - FP8/FP16/BF16/TF32 TFLOPS
        - FP32 TFLOPS
        - Device Memory (GiB)
        - Device Memory Bandwidth (GiB/sec)
        - EFA bandwidth (Gbps)

    *   - Trn1.2xlarge
        - 1
        - 8
        - 32
        - 190
        - 47.5
        - 32
        - 820
        - N/A
        - up-to 25 

    *   - Trn1.32xlarge
        - 16
        - 128
        - 512
        - 3,040
        - 760
        - 512
        - 13,120
        - 384
        - 800

    *   - Trn1n.32xlarge
        - 16
        - 128
        - 512
        - 3,040
        - 760
        - 512
        - 13,120
        - 768
        - 1,600


The Trn1.2xlarge instance size allows customers to train their models on
a single Trainium chip, which is useful for small model training, as
well as for model experimentation. The Trn1.32xlarge and Trn1n.32xlarge instance size come
with a high-bandwidth and low-latency NeuronLink-v2 chip-to-chip
interconnect, which utilizes a 2D Torus topology. This is useful for
collective communication between the Trainium chips during scale-out
training, as well as for pooling the memory capacity of all Trainium
chips, making it directly addressable from each of the chips.

In a Trn1/Trn1n server, the Trainium chips are connected in a 2D Torus topology, as depicted below:

.. image:: /images/trn1-topology.png

The Trn1/Trn1n instances are also available in an EC2 UltraCluster, which
enables customers to scale Trn1/Trn1n instances to over 100,000 Trainium
chips, and leverage the AWS-designed non-blocking petabit-scale EFA
networking infrastructure.

.. image:: /images/ultracluster-1.png


================================================
FILE: about-neuron/arch/neuron-hardware/trn2-arch.rst
================================================
.. _aws-trn2-arch:

############################
Amazon EC2 Trn2 Architecture
############################

Trn2 is an Amazon EC2 accelerated computing instance, purpose built for high-performance deep learning training and inference. This page provides 
an architecture overview of the trn2.48xlarge and trn2u.48xlarge instances, and Trn2 UltraServer.

.. contents::  Topics
   :local:
   :depth: 2

.. _trn2-arch:

Trn2 instance sizes
===================

Trn2 instances and UltraServers are available in the following sizes and configurations:

* trn2.48xlarge
* trn2u.48xlarge
* Trn2 UltraServer

.. _trn2-instance:

trn2.48xlarge / trn2u.48xlarge
""""""""""""""""""""""""""""""
Trn2 instances are powered by 16 Trainium2 chips connected using a high-bandwidth, low-latency NeuronLink-v3 
chip-to-chip interconnect. The NeuronLink-v3 chip-to-chip interconnect enables collective communication between Trainium2 
chips during distributed training and inference. It also allows for the pooling of memory resources from all 16 Trainium2 chips.  

In a trn2.48xlarge or trn2u.48xlarge instance, 16 Trainium2 chips are connected using a 4x4, 2D Torus topology. The following diagram shows the 
intra-instance connections of a trn2.48xlarge or trn2u.48xlarge instance

.. image:: /images/architecture/Trn2/trn2.48xlarge.png
    :align: center
    :width: 650
|

.. _trn2-ultraserver: 

Trn2 UltraServer
"""""""""""""""""""""

A Trn2 UltraServer comprises four trn2u.48xlarge instances connected together via the NeuronLink-v3 chip-to-chip interconnect. 
This allows for a total of 64 Trainium2 chips to be interconnected within a Trn2 UltraServer. Trainium2 chips with the same 
coordinates in each Trn2 instance are connected in a ring topology. The following figure shows the inter-instance ring connection 
between Trainium2 chips.

.. image:: /images/architecture/Trn2/u-trn2x64.png
    :align: center
    :width: 650
|
Trn2 instance specifications 
============================

The following table shows the performance metrics for Trainium2 based instances.

.. list-table::
    :widths: auto
    :header-rows: 1
    :stub-columns: 1    
    :align: left
      

    *   - Perfomance specification
        - trn2.48xlarge / trn2u.48xlarge
        - Trn2 UltraServer
    *   - # of Trainium2 chips
        - 16
        - 64
    *   - vCPUs
        - 192
        - 768
    *   - Host Memory (GiB)
        - 2,048
        - 8,192
    *   - FP8 PFLOPS
        - 20.8
        - 83.2
    *   - FP16/BF16/TF32 PFLOPS
        - 10.7
        - 42.8
    *   - FP8/FP16/BF16/TF32 Sparse PFLOPS
        - 41
        - 164
    *   - FP32 PFLOPS
        - 2.9
        - 11.6
    *   - Device Memory (GiB)
        - 1,536
        - 6,144
    *   - Device Memory Bandwidth (TB/sec)
        - 46.4
        - 185.6
    *   - Intra-instance NeuronLink-v3 bandwidth (GB/sec/chip)
        - 1,024
        - 1,024
    *   - Inter-instance NeuronLink-v3 bandwidth (GB/sec/chip)
        - Not applicable
        - 256
    *   - EFAv3 bandwidth (Gbps)
        - 3,200
        - 3,200


================================================
FILE: about-neuron/arch/neuron-hardware/trn3-arch.rst
================================================
.. _aws-trn3-arch:

###############################
Amazon EC2 Trn3 Architecture
###############################

Amazon EC2 **Trn3** instances are accelerated computing instances powered by Trainium3 AI chips, purpose-built for high-performance deep learning training and inference. Trn3 is available in two UltraServer scale-up configurations: Gen1 with 64 Trainium3 chips per UltraServer, and Gen2 with 144 chips per UltraServer. Both configurations use NeuronSwitch-v1 interconnect technology to enable all-to-all connectivity between chips, especially optimized for workloads that leverage all-to-all communication patterns, such as Mixture of Experts models and autoregressive inference serving.

=====================
Trn3 Gen1 UltraServer
=====================

The EC2 Trn3 Gen1 UltraServers deliver 161 PetaFLOPS of dense MXFP8 compute, 314 TB/s of HBM bandwidth, and 9TB of HBM capacity. Each UltraServer consists of four servers with 16 Trainium3 devices per server. Therefore, the UltraServer integrates a total of 64 Trainium3 devices into a single scale-up domain, interconnected via our latest-generation NeuronLink-v4 and the newly introduced NeuronSwitch-v1. The chip-to-chip topology features an all-to-all connectivity design, replacing the previous 2D-torus architecture. This all-to-all topology is optimized for workloads that require efficient all-to-all communication patterns or ultra-low latency collectives, including Mixture of Experts models and autoregressive inference serving. The following diagram illustrates the Trn3 Gen1 UltraServer connectivity.

.. image:: /images/architecture/trn3/trn3-ultraserver-gen1.png
    :align: center


=====================
Trn3 Gen2 UltraServer
=====================

The EC2 Trn3 Gen2 UltraServers deliver 362 PetaFLOPS of dense MXFP8 compute, 706 TB/s of HBM bandwidth, and 20TB of HBM capacity. Each UltraServer consists of 36 servers with 4 Trainium3 devices per server. Trainium3 devices within the same server are connected via a first-level NeuronSwitch-v1, while devices across servers are connected via two second-level NeuronSwitch-v1 and NeuronLink-v4. Therefore, the UltraServer integrates 144 Trainium3 devices into a single scale-up domain. Like Gen1, the chip-to-chip topology features an all-to-all connectivity design optimized for Mixture of Experts models and autoregressive inference serving. The following diagram illustrates the Trn3 Gen2 UltraServer connectivity.

.. image:: /images/architecture/trn3/trn3-ultraserver-gen2.png
    :align: center

==========================================
Trn3 Gen1/Gen2 UltraServer specifications
==========================================

The following table shows the performance metrics for Tranium3 based instances.

.. list-table::
   :header-rows: 2
   :stub-columns: 1
   :widths: 30 20 20

   * - 
     - Trn3 Gen1 UltraServer
     - Trn3 Gen2 UltraServer
   * - Configuration
     - 
     - 
   * - # of Trainium3 devices
     - 64
     - 144
   * - Host vCPUs
     - 768
     - 2304
   * - Host Memory (GiB)
     - 8,192
     - 27,648
   * - **Compute**
     - 
     - 
   * - MXFP8/MXFP4 TFLOPS
     - 161,088
     - 362,448
   * - FP16/BF16/TF32 TFLOPS
     - 42,944
     - 96,624
   * - FP32 TFLOPS
     - 11,712
     - 26,352
   * - **Memory**
     - 
     - 
   * - Device Memory (GiB)
     - 9,216
     - 20,736
   * - Device Memory Bandwidth (TB/sec)
     - 313.6
     - 705.6
   * - **Interconnect**
     - 
     - 
   * - NeuronLink-v4 bandwidth (GiB/sec/device)
     - 2,048
     - 2,048
   * - EFA bandwidth (Gbps)
     - 12,800
     - 28,800
  
============================================
Trn3 UltraServer Connectivity and Networking
============================================

Trn3 UltraServers use a PCIe switch-based interconnect architecture for all chip-to-chip communication, both within and across servers. This replaces the point-to-point NeuronLink topology used in previous generations (Trn1, Trn2) with a switched fabric that enables flexible, all-to-all connectivity across the entire UltraServer domain.

Intra-server connectivity
-------------------------

Each server (sled) contains 4 Trainium3 chips connected through an intra-server PCIe switch. Each chip provides four PCIe Gen6 x8 links to this switch, delivering a total of 256 GB/s of bidirectional bandwidth between chips within the same server. This local switch enables low-latency communication for operations like tensor parallelism and data-parallel gradient synchronization within a server.

Inter-server connectivity
-------------------------

All servers within a rack are connected through inter-server PCIe switches. Each Trainium3 chip provides five PCIe Gen6 x8 links to the inter-server switch, delivering 320 GB/s of bidirectional bandwidth per chip for cross-server communication. This enables collective operations such as all-reduce and all-gather to span all servers in a rack without requiring host CPU involvement.

Inter-rack connectivity
-----------------------

For multi-rack configurations, Trainium3 chips in corresponding positions across racks are connected via dedicated direct PCIe links. Each chip provides two PCIe Gen6 x8 links for inter-rack communication, delivering 128 GB/s of bidirectional bandwidth per chip between racks. This direct-link design avoids additional switch hops for cross-rack traffic.

Bandwidth summary
-----------------

.. list-table::
   :header-rows: 1
   :widths: 30 30 40

   * - Connectivity level
     - Bandwidth per chip
     - Link configuration
   * - Intra-server (within sled)
     - 256 GB/s
     - 4 × PCIe Gen6 x8 via intra-server switch
   * - Inter-server (within rack)
     - 320 GB/s
     - 5 × PCIe Gen6 x8 via inter-server switch
   * - Inter-rack
     - 128 GB/s
     - 2 × PCIe Gen6 x8 direct links

Routing and address-based switching
------------------------------------

Unlike Trn1 and Trn2, where NeuronLink connections are point-to-point and require no intermediate routing, Trn3's PCIe switch fabric uses address-based routing to direct transactions to the correct destination chip. Each Trainium3 chip in the system is identified by a tuple of (rack, server, chip), and this identity is encoded in the upper bits of the PCIe address used for outbound transactions. The PCIe switches use BAR (Base Address Register) address matching to determine the correct output port for each transaction.

This routing is transparent to ML workloads. The Neuron Runtime and compiler handle all address encoding and switch configuration automatically. From the developer's perspective, collective operations and direct memory access between chips work the same way as on previous Trainium generations.

Semaphore-based synchronization
-------------------------------

Trn3 uses hardware semaphores to synchronize data transfers across the switched fabric. When a chip writes data to a remote chip's HBM, a follow-up semaphore write signals completion to the receiving chip. The system guarantees that data and its associated semaphore always traverse the same physical path through the switch fabric, ensuring correct ordering without additional software synchronization overhead.


================================================
FILE: about-neuron/benchmarks/index.rst
================================================
.. _benchmark:

.. meta::
   :description: Explore AWS Neuron performance benchmarks for Inf1, Inf2, and Trn1 instances. Find detailed inference and training performance data across NLP, CV, and recommender models to optimize your machine learning workloads.
   :date-modified: 2025-10-03

Neuron performance
==================

The Neuron performance pages provide comprehensive benchmarks and performance data for AWS Neuron SDK across different Trainium and Inferentia instance types. These benchmarks cover various open-source models for Natural Language Processing (NLP), Computer Vision (CV), and Recommender systems. Each benchmark includes detailed setup instructions and reproducible test configurations to help you evaluate performance for your specific use cases.

Inference performance
---------------------

.. grid:: 1 1 2 2
   :gutter: 2

   .. grid-item-card::
      :link: appnote-performance-benchmark
      :link-type: ref

      **Inf1 Inference Performance**
      ^^^
      Comprehensive inference benchmarks for ``Inf1`` instances across NLP, CV, and recommender models

   .. grid-item-card::
      :link: inf2-performance
      :link-type: ref

      **Inf2 Inference Performance**
      ^^^
      Latest inference performance data for ``Inf2`` instances with improved throughput and latency metrics

   .. grid-item-card::
      :link: trn1-inference-performance
      :link-type: ref

      **Trn1 Inference Performance**
      ^^^
      Inference benchmarks for ``Trn1`` instances showcasing versatile training and inference capabilities

Training performance
--------------------

.. grid:: 1
   :gutter: 2

   .. grid-item-card::
      :link: trn1-training-performance
      :link-type: ref

      **Trn1 Training Performance**
      ^^^
      Training performance benchmarks for ``Trn1`` instances with distributed training metrics and scalability data

.. toctree::
   :maxdepth: 1
   :hidden:

   inf1/index
   inf2/inf2-performance
   trn1/trn1-inference-performance
   trn1/trn1-training-performance


================================================
FILE: about-neuron/benchmarks/inf1/data.csv
================================================
Name,Model,Model details,Framework,Application Type,Run Mode,Inst. Type,Num. Cores,Batch Size,Avg Throughput (/sec),Max Throughput,Threads,Ops in Inferentia,Latency P50 (ms),Latency P90 (ms),Latency P95 (ms),Latency P99 (ms),Latency P100 (ms),Neuron Version,Application,Tutorial
"YOLOv4-PT(fp32,b1,c4)",YOLO v4,fp32,PyTorch 1.13,Real Time,Data Parallel,inf1.2xlarge,4,1,180.2,,8,,40.1,,,52,,2.15.0,CV,:ref:`Evaluate YOLO v4 on Inferentia </src/examples/pytorch/yolo_v4.ipynb>`
"Resnet50-PT(fp32,b5,c4)",Resnet-50,fp32,PyTorch 1.13,Batch,Data Parallel,inf1.xlarge,4,5,923,,4,,22,,,23,,2.15.0,CV,:ref:`Resnet50 model for Inferentia </src/examples/pytorch/resnet50.ipynb>`
"Resnet50-TF(fp16,b5,c4)",Resnet-50,fp16,Tensorflow 1.15,Batch,Data Parallel,inf1.xlarge,4,10,2207,,8,,17.8,,,22.7,,2.12.0,CV,:ref:`ResNet-50 optimization example </src/examples/tensorflow/keras_resnet50/keras_resnet50.ipynb>`
"OpenPose-TF(fp16,b1,c4)",OpenPose,fp16,Tensorflow 1.15,Real Time,Data Parallel,inf1.xlarge,4,1,57.5,,4,,60.3,,,67.4,,2.12.0,CV,:ref:`Running OpenPose on Inferentia </src/examples/tensorflow/openpose_demo/openpose.ipynb>`
"BERT-base-PT(fp32,b6,c4)",BERT base,"fp32, bert-base-cased-finetuned-mrpc, sequence-length=128",PyTorch 1.13,Batch,Data Parallel,inf1.xlarge,4,6,966,,4,,21,,,22,,2.15.0,NLP,:ref:`HuggingFace Pretrained BERT </src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.ipynb>`
"BERT-base-PT(fp32,b1,c16)",BERT base,"fp32, bert-base-uncased, sequence-length=128",PyTorch 1.13,Real Time,Model Pipeline,inf1.6xlarge,16,1,1988.8,,12,,6,,,6.3,,2.15.0,NLP,:ref:`Using NeuronCore Pipeline </src/examples/pytorch/pipeline_tutorial/neuroncore_pipeline_pytorch.ipynb>`
"BERT-base-TF(fp32,b128,c16)",BERT base,"fp32, distilbert-base-uncased-finetuned-sst-2-english, sequence-length=128",Tensorflow 2.8,Batch,Data Parallel,inf1.6xlarge,16,16,2114.8,,,,30.1,,,33,,2.15.0,NLP,:ref:`HuggingFace distilBERT with Tensorflow2 </src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb>`

================================================
FILE: about-neuron/benchmarks/inf1/index.rst
================================================
.. _appnote-performance-benchmark:

Inf1 Inference Performance
===========================

.. important::

   The benchmark scripts linked on this page are provided for historical reference only and are not tested with recent versions of the Neuron SDK. They have been moved to the `archive folder <https://github.com/aws-neuron/aws-neuron-sdk/tree/master/archive/src/benchmark/pytorch>`_.

.. contents:: Table of contents
   :local:

The following tables contain the reference inference performance for models in the tutorials. Follow the links on each row to replicate similar results in your own environment. Refer to :ref:`ec2-then-ec2-setenv` documentation to create a new environment based on the latest Neuron release.

*Last update: September 16th, 2024*


.. _NLP:

Encoder Models
--------------
.. tab-set::

   .. tab-item:: Throughput optimized

      .. df-table::
         :header-rows: 1

         df = pd.read_csv('throughput_data_encoder.csv')
         df_prices = pd.read_csv('instance_prices.csv')
         df = pd.merge(df,df_prices,on='Inst. Type')

         df['Cost per 1M inferences'] = ((1.0e6 / df['Avg Throughput (/sec)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)

         cols_to_show = ['Model', 'Scripts', 'Framework', 'Inst. Type', 'Avg Throughput (/sec)', 'Latency P50 (ms)', 'Latency P99 (ms)', 'Cost per 1M inferences', 'Application Type', 'Neuron Version', 'Run Mode', 'Batch Size', 'Model details' ]
         df = df[cols_to_show].sort_values(['Model', 'Cost per 1M inferences'])

         int_cols = ['Avg Throughput (/sec)', 'Latency P50 (ms)', 'Latency P99 (ms)']
         df[int_cols] = df[int_cols].round(0).astype('int',copy=True)

   .. tab-item:: Latency optimized

      .. df-table::
         :header-rows: 1

         df = pd.read_csv('latency_data_encoder.csv')
         df_prices = pd.read_csv('instance_prices.csv')
         df = pd.merge(df,df_prices,on='Inst. Type')

         df['Cost per 1M inferences'] = ((1.0e6 / df['Avg Throughput (/sec)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)

         cols_to_show = ['Model', 'Scripts', 'Framework', 'Inst. Type', 'Avg Throughput (/sec)', 'Latency P50 (ms)', 'Latency P99 (ms)', 'Cost per 1M inferences', 'Application Type', 'Neuron Version', 'Run Mode', 'Batch Size', 'Model details' ]
         df = df[cols_to_show].sort_values(['Model', 'Cost per 1M inferences'])

         int_cols = ['Avg Throughput (/sec)', 'Latency P50 (ms)', 'Latency P99 (ms)']
         df[int_cols] = df[int_cols].round(0).astype('int',copy=True)


.. note::
    Throughput and latency numbers in this table were computed using* NeuronPerf_. To reproduce these results, install NeuronPerf and run the provided scripts.*

.. _NeuronPerf: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuronperf/index.html

Convolutional Neural Networks (CNN) Models
------------------------------------------

.. df-table::
   :header-rows: 1

   df = pd.read_csv('throughput_data_cnn.csv')
   df_prices = pd.read_csv('instance_prices.csv')
   df = pd.merge(df,df_prices,on='Inst. Type').query('`Application`=="CV"')

   df['Cost per 1M inferences'] = ((1.0e6 / df['Avg Throughput (/sec)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)

   cols_to_show = ['Model', 'Tutorial', 'Framework', 'Inst. Type', 'Avg Throughput (/sec)', 'Latency P50 (ms)', 'Latency P99 (ms)', 'Cost per 1M inferences', 'Application Type', 'Neuron Version', 'Run Mode', 'Batch Size', 'Model details' ]
   df = df[cols_to_show].sort_values(['Model', 'Cost per 1M inferences']).groupby('Model').head(2)

   int_cols = ['Avg Throughput (/sec)', 'Latency P50 (ms)', 'Latency P99 (ms)']
   df[int_cols] = df[int_cols].round(0).astype('int',copy=True)

.. note::
    Throughput and latency numbers in this table were generated using Neuron Tutorials.

.. note::
   **Cost per 1M inferences** is calculated using US East (N. Virginia) RI-Effective hourly rate.

   **Real Time** application refers to batch size 1 inference for minimal latency. **Batch** application refers to maximum throughput with minimum cost-per-inference.


================================================
FILE: about-neuron/benchmarks/inf1/instance_prices.csv
================================================
Inst. Type,RI-Effective hourly rate
inf1.xlarge,0.110
inf1.2xlarge,0.174
inf1.6xlarge,0.567
inf1.24xlarge,2.269


================================================
FILE: about-neuron/benchmarks/inf1/latency_data_encoder.csv
================================================
Model,Scripts,Source,Framework,Inst. Type,Num Cores,Seq. Length,Avg Throughput (/sec),Max Throughput,Threads,Latency P50 (ms),Latency P90 (ms),Latency P95 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,N Models,Workers per Model,Model details
BERT base (bert-base-cased),:compile-pt:`Compile <bert-base-cased>` + :benchmark-pt:`Benchmark <bert-base-cased>`,HuggingFace,PyTorch 1.13.1,inf1.xlarge,4,128,125.7,,8,7.9,,,8.0,Real Time,2.20.0,Data Parallel,1,1,1,"fp32, sequence-length=128"
BERT base (bert-base-uncased),:compile-pt:`Compile <bert-base-uncased>` + :benchmark-pt:`Benchmark <bert-base-uncased>`,HuggingFace,PyTorch 1.13.1,inf1.xlarge,4,128,284.7,,8,10.5,,,10.7,Real Time,2.20.0,Data Parallel,3,1,1,"fp32, sequence-length=128"
DistilBERT base (distilbert-base-uncased-finetuned-sst-2-english),:compile-pt:`Compile <distilbert-base-uncased-finetuned-sst-2-english>` + :benchmark-pt:`Benchmark <distilbert-base-uncased-finetuned-sst-2-english>`,HuggingFace,PyTorch 1.13.1,inf1.xlarge,4,128,593.4,,8,10.0,,,10.7,Real Time,2.20.0,Data Parallel,5,1,1,"fp32, sequence-length=128"
DistilBERT base (distilbert-base-uncased),:compile-pt:`Compile <distilbert-base-uncased>` + :benchmark-pt:`Benchmark <distilbert-base-uncased>`,HuggingFace,PyTorch 1.13.1,inf1.xlarge,4,128,538.2,,8,11.1,,,11.5,Real Time,2.20.0,Data Parallel,6,1,1,"fp32, sequence-length=128"
DistilRoBERTa base (distilroberta-base),:compile-pt:`Compile <distilroberta-base>` + :benchmark-pt:`Benchmark <distilroberta-base>`,HuggingFace,PyTorch 1.13.1,inf1.xlarge,4,128,417.0,,8,7.0,,,7.8,Real Time,2.20.0,Data Parallel,3,1,1,"fp32, sequence-length=128"


================================================
FILE: about-neuron/benchmarks/inf1/throughput_data_cnn.csv
================================================
Name,Model,Model details,Framework,Application Type,Run Mode,Inst. Type,Num. Cores,Batch Size,Avg Throughput (/sec),Max Throughput,Threads,Ops in Inferentia,Latency P50 (ms),Latency P90 (ms),Latency P95 (ms),Latency P99 (ms),Latency P100 (ms),Neuron Version,Application,Tutorial
"YOLOv4-PT(fp32,b1,c4)",YOLO v4,fp32,PyTorch 1.13,Real Time,Data Parallel,inf1.2xlarge,4,1,180.3,,8,,40.0,,,50.8,,2.20.0,CV,:ref:`Evaluate YOLO v4 on Inferentia </src/examples/pytorch/yolo_v4.ipynb>`
"Resnet50-PT(fp32,b5,c4)",Resnet-50,fp32,PyTorch 1.13,Batch,Data Parallel,inf1.xlarge,4,5,921.5,,4,,21.6,,,22.9,,2.20.0,CV,:ref:`Resnet50 model for Inferentia </src/examples/pytorch/resnet50.ipynb>`
"Resnet50-TF(fp16,b5,c4)",Resnet-50,fp16,Tensorflow 1.15,Batch,Data Parallel,inf1.xlarge,4,10,2207,,8,,17.8,,,22.7,,2.12.0,CV,:ref:`ResNet-50 optimization example </src/examples/tensorflow/keras_resnet50/keras_resnet50.ipynb>`
"OpenPose-TF(fp16,b1,c4)",OpenPose,fp16,Tensorflow 1.15,Real Time,Data Parallel,inf1.xlarge,4,1,57.5,,4,,60.3,,,67.4,,2.12.0,CV,:ref:`Running OpenPose on Inferentia </src/examples/tensorflow/openpose_demo/openpose.ipynb>`


================================================
FILE: about-neuron/benchmarks/inf1/throughput_data_encoder.csv
================================================
Model,Scripts,Source,Framework,Inst. Type,Num Cores,Seq. Length,Avg Throughput (/sec),Max Throughput,Threads,Latency P50 (ms),Latency P90 (ms),Latency P95 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,N Models,Workers per Model,Model details
BERT base (bert-base-cased),:compile-pt:`Compile <bert-base-cased>` + :benchmark-pt:`Benchmark <bert-base-cased>`,HuggingFace,PyTorch 1.13.1,inf1.xlarge,4,128,1095.4,,8,58.3,,,65.0,Batch,2.20.0,Data Parallel,8,4,2,"fp32, sequence-length=128"
BERT base (bert-base-uncased),:compile-pt:`Compile <bert-base-uncased>` + :benchmark-pt:`Benchmark <bert-base-uncased>`,HuggingFace,PyTorch 1.13.1,inf1.xlarge,4,128,1180.7,,8,40.6,,,45.0,Batch,2.20.0,Data Parallel,6,4,2,"fp32, sequence-length=128"
DistilBERT base (distilbert-base-uncased-finetuned-sst-2-english),:compile-pt:`Compile <distilbert-base-uncased-finetuned-sst-2-english>` + :benchmark-pt:`Benchmark <distilbert-base-uncased-finetuned-sst-2-english>`,HuggingFace,PyTorch 1.13.1,inf1.xlarge,4,128,1875.3,,8,33.7,,,54.1,Batch,2.20.0,Data Parallel,8,4,2,"fp32, sequence-length=128"
DistilBERT base (distilbert-base-uncased),:compile-pt:`Compile <distilbert-base-uncased>` + :benchmark-pt:`Benchmark <distilbert-base-uncased>`,HuggingFace,PyTorch 1.13.1,inf1.xlarge,4,128,1876.7,,8,33.7,,,53.2,Batch,2.20.0,Data Parallel,8,4,2,"fp32, sequence-length=128"
DistilRoBERTa base (distilroberta-base),:compile-pt:`Compile <distilroberta-base>` + :benchmark-pt:`Benchmark <distilroberta-base>`,HuggingFace,PyTorch 1.13.1,inf1.xlarge,4,128,1512.9,,8,15.0,,,25.9,Batch,2.20.0,Data Parallel,6,4,1,"fp32, sequence-length=128"
BERT base,:ref:`HuggingFace Pretrained BERT </src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.ipynb>`,,PyTorch 1.13,inf1.xlarge,,,1056,,,20,,,21,Batch,2.20.0,Data Parallel,4,,,"fp32, bert-base-cased-finetuned-mrpc, sequence-length=128"
BERT base,:ref:`Using NeuronCore Pipeline </src/examples/pytorch/pipeline_tutorial/neuroncore_pipeline_pytorch.ipynb>`,,PyTorch 1.13,inf1.6xlarge,,,2009.1,,,5.9,,,6.3,Real Time,2.20.0,Model Pipeline,1,,,"fp32, bert-base-uncased, sequence-length=128"
BERT base,:ref:`HuggingFace distilBERT with Tensorflow2 </src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb>`,,Tensorflow 2.10,inf1.6xlarge,,,2123.4,,,30.0,,,32.2,Batch,2.20.0,Data Parallel,16,,,"fp32, distilbert-base-uncased-finetuned-sst-2-english, sequence-length=128"


================================================
FILE: about-neuron/benchmarks/inf2/inf2-performance.rst
================================================
.. _inf2-performance:

Inf2 Inference Performance
==========================

.. important::

   The benchmark scripts linked on this page are provided for historical reference only and are not tested with recent versions of the Neuron SDK. They have been moved to the `archive folder <https://github.com/aws-neuron/aws-neuron-sdk/tree/master/archive/src/benchmark/pytorch>`_.

.. contents:: Table of contents
   :local:
   :depth: 1

*Last update: Feb 26th, 2026*

.. _inf2_inference_perf:

Encoder Models
--------------

.. tab-set::

    .. tab-item:: Throughput optimized

        .. df-table::
            :header-rows: 1

            df = pd.read_csv('throughput_data_encoder.csv')
            df_prices = pd.read_csv('inf2_instance_prices.csv')
            df = pd.merge(df,df_prices,on='Inst. Type')
            df['Cost per 1M inferences'] = ((1.0e6 / df['Throughput (inference/second)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)
            cols_to_show = ['Model','Scripts','Framework', 'Inst. Type', 'Task', 'Throughput (inference/second)', 'Latency P50 (ms)', 'Latency P99 (ms)', 'Cost per 1M inferences', 'Application Type', 'Neuron Version', 'Run Mode', 'Batch Size', 'Sequence Length', 'Model Data Type','Compilation Autocast Data Type', 'OS Type']
            df = df[cols_to_show].sort_values(['Model', 'Cost per 1M inferences'])
            df['Throughput (inference/second)'] = df['Throughput (inference/second)'].round(2).astype('float',copy=True)
            int_cols = ['Latency P50 (ms)', 'Latency P99 (ms)']
            df[int_cols] = df[int_cols].round(2).astype('float',copy=True)


    .. tab-item:: Latency optimized

        .. df-table::
            :header-rows: 1

            df = pd.read_csv('latency_data_encoder.csv')
            df_prices = pd.read_csv('inf2_instance_prices.csv')
            df = pd.merge(df,df_prices,on='Inst. Type')
            df['Cost per 1M inferences'] = ((1.0e6 / df['Throughput (inference/second)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)
            cols_to_show = ['Model','Scripts','Framework', 'Inst. Type', 'Task', 'Throughput (inference/second)', 'Latency P50 (ms)', 'Latency P99 (ms)', 'Cost per 1M inferences', 'Application Type', 'Neuron Version', 'Run Mode', 'Batch Size', 'Sequence Length', 'Model Data Type','Compilation Autocast Data Type', 'OS Type']
            df = df[cols_to_show].sort_values(['Model', 'Cost per 1M inferences'])
            df['Throughput (inference/second)'] = df['Throughput (inference/second)'].round(2).astype('float',copy=True)
            int_cols = ['Latency P50 (ms)', 'Latency P99 (ms)']
            df[int_cols] = df[int_cols].round(2).astype('float',copy=True)


Encoder-Decoder Models
----------------------

.. tab-set::

    .. tab-item:: Throughput optimized

        .. df-table::
            :header-rows: 1

            df = pd.read_csv('throughput_data_encoder_decoder.csv')
            df_prices = pd.read_csv('inf2_instance_prices.csv')
            df = pd.merge(df,df_prices,on='Inst. Type')

            df['Cost per 1M inferences'] = ((1.0e6 / df['Throughput (tokens/second)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)

            cols_to_show = ['Model','Scripts','Framework', 'Inst. Type', 'Task', 'Throughput (tokens/second)', 'Latency per Token P50 (ms)', 'Latency per Token P99 (ms)', 'Cost per 1M inferences', 'Application Type', 'Neuron Version', 'Run Mode', 'TP Degree',	'DP Degree', 'Batch Size', 'Sequence Length', 'Input Length', 'Output Length', 'Model Data Type','Compilation Autocast Data Type']
            df = df[cols_to_show].sort_values(['Model', 'Cost per 1M inferences'])

            df['Throughput (tokens/second)'] = df['Throughput (tokens/second)'].round(2).astype('float',copy=True)
            int_cols = ['Latency per Token P50 (ms)', 'Latency per Token P99 (ms)']
            df[int_cols] = df[int_cols].round(2).astype('float',copy=True)

        .. note::
         Only for Encoder-Decoder

         **Throughput (tokens/second)** counts both input and output tokens

         **Latency per Token** counts both input and output tokens


    .. tab-item:: Latency optimized

        .. df-table::
            :header-rows: 1

            df = pd.read_csv('latency_data_encoder_decoder.csv')
            df_prices = pd.read_csv('inf2_instance_prices.csv')
            df = pd.merge(df,df_prices,on='Inst. Type')

            df['Cost per 1M inferences'] = ((1.0e6 / df['Throughput (tokens/second)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)

            cols_to_show = ['Model','Scripts','Framework', 'Inst. Type', 'Task', 'Throughput (tokens/second)', 'Latency per Token P50 (ms)', 'Latency per Token P99 (ms)', 'Cost per 1M inferences', 'Application Type', 'Neuron Version', 'Run Mode', 'TP Degree',	'DP Degree', 'Batch Size', 'Sequence Length', 'Input Length', 'Output Length', 'Model Data Type','Compilation Autocast Data Type']
            df = df[cols_to_show].sort_values(['Model', 'Cost per 1M inferences'])

            df['Throughput (tokens/second)'] = df['Throughput (tokens/second)'].round(2).astype('float',copy=True)
            int_cols = ['Latency per Token P50 (ms)', 'Latency per Token P99 (ms)']
            df[int_cols] = df[int_cols].round(2).astype('float',copy=True)

        .. note::
         **Throughput (tokens/second)** counts both input and output tokens

         **Latency per Token** counts both input and output tokens
        

Vision Transformers Models
--------------------------

.. tab-set::

    .. tab-item:: Throughput optimized

        .. df-table::
            :header-rows: 1

            df = pd.read_csv('throughput_data_vision_transformers.csv')
            df_prices = pd.read_csv('inf2_instance_prices.csv')
            df = pd.merge(df,df_prices,on='Inst. Type')

            df['Cost per 1M images'] = ((1.0e6 / df['Throughput (inference/sec)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)

            cols_to_show = ['Model','Image Size','Scripts','Framework', 'Inst. Type', 'Task', 'Throughput (inference/sec)', 'Latency P50 (ms)', 'Latency P99 (ms)', 'Cost per 1M images', 'Application Type', 'Neuron Version', 'Run Mode', 'Batch Size', 'Model Data Type','Compilation Autocast Data Type']
            df = df[cols_to_show].sort_values(['Model', 'Image Size', 'Cost per 1M images'])

            df['Throughput (inference/sec)'] = df['Throughput (inference/sec)'].round(2).astype('float',copy=True)
            int_cols = ['Latency P50 (ms)', 'Latency P99 (ms)']
            df[int_cols] = df[int_cols].round(2).astype('float',copy=True)


    .. tab-item:: Latency optimized

        .. df-table::
            :header-rows: 1

            df = pd.read_csv('latency_data_vision_transformers.csv')

            df_prices = pd.read_csv('inf2_instance_prices.csv')
            df = pd.merge(df,df_prices,on='Inst. Type')

            df['Cost per 1M images'] = ((1.0e6 / df['Throughput (inference/sec)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)

            cols_to_show = ['Model','Image Size','Scripts','Framework','Inst. Type','Task', 'Throughput (inference/sec)','Latency P50 (ms)','Latency P99 (ms)','Cost per 1M images','Application Type','Neuron Version','Run Mode','Batch Size','Model Data Type', 'Compilation Autocast Data Type']
            df = df[cols_to_show].sort_values(['Model', 'Image Size', 'Cost per 1M images'])

            df['Throughput (inference/sec)'] = df['Throughput (inference/sec)'].round(2).astype('float',copy=True)
            int_cols = ['Latency P50 (ms)', 'Latency P99 (ms)']
            df[int_cols] = df[int_cols].round(2).astype('float',copy=True)


Convolutional Neural Networks (CNN) Models
------------------------------------------

.. tab-set::

    .. tab-item:: Throughput optimized

        .. df-table::
            :header-rows: 1

            df = pd.read_csv('throughput_data_vision_cnn.csv')
            df_prices = pd.read_csv('inf2_instance_prices.csv')
            df = pd.merge(df,df_prices,on='Inst. Type')

            df['Cost per 1M images'] = ((1.0e6 / df['Throughput (inference/sec)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)

            cols_to_show = ['Model','Image Size','Scripts','Framework', 'Inst. Type', 'Task', 'Throughput (inference/sec)', 'Latency P50 (ms)', 'Latency P99 (ms)', 'Cost per 1M images', 'Application Type', 'Neuron Version', 'Run Mode', 'Batch Size', 'Model Data Type','Compilation Autocast Data Type']
            df = df[cols_to_show].sort_values(['Model', 'Image Size', 'Cost per 1M images'])

            df['Throughput (inference/sec)'] = df['Throughput (inference/sec)'].round(2).astype('float',copy=True)
            int_cols = ['Latency P50 (ms)', 'Latency P99 (ms)']
            df[int_cols] = df[int_cols].round(2).astype('float',copy=True)


    .. tab-item:: Latency optimized

        .. df-table::
            :header-rows: 1

            df = pd.read_csv('latency_data_vision_cnn.csv')

            df_prices = pd.read_csv('inf2_instance_prices.csv')
            df = pd.merge(df,df_prices,on='Inst. Type')

            df['Cost per 1M images'] = ((1.0e6 / df['Throughput (inference/sec)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)

            cols_to_show = ['Model','Image Size','Scripts','Framework','Inst. Type','Task', 'Throughput (inference/sec)','Latency P50 (ms)','Latency P99 (ms)','Cost per 1M images','Application Type','Neuron Version','Run Mode','Batch Size','Model Data Type', 'Compilation Autocast Data Type']
            df = df[cols_to_show].sort_values(['Model', 'Image Size', 'Cost per 1M images'])

            df['Throughput (inference/sec)'] = df['Throughput (inference/sec)'].round(2).astype('float',copy=True)
            int_cols = ['Latency P50 (ms)', 'Latency P99 (ms)']
            df[int_cols] = df[int_cols].round(2).astype('float',copy=True)


Stable Diffusion Models
-----------------------

.. tab-set::

    .. tab-item:: Throughput optimized

        .. df-table::
            :header-rows: 1

            df = pd.read_csv('throughput_data_vision_sd.csv')
            df_prices = pd.read_csv('inf2_instance_prices.csv')
            df = pd.merge(df,df_prices,on='Inst. Type')

            df['Cost per 1M images'] = ((1.0e6 / df['Throughput (inference/sec)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)

            cols_to_show = ['Model','Image Size','Scripts','Framework', 'Inst. Type', 'Task', 'Throughput (inference/sec)', 'Latency P50 (ms)', 'Latency P99 (ms)', 'Cost per 1M images', 'Application Type', 'Neuron Version', 'Run Mode', 'Batch Size', 'Model Data Type','Compilation Autocast Data Type']
            df = df[cols_to_show].sort_values(['Model', 'Image Size', 'Cost per 1M images'])

            df['Throughput (inference/sec)'] = df['Throughput (inference/sec)'].round(2).astype('float',copy=True)
            int_cols = ['Latency P50 (ms)', 'Latency P99 (ms)']
            df[int_cols] = df[int_cols].round(2).astype('float',copy=True)

        .. note::
         **Cost per 1M images** is calculated using RI-Effective hourly rate.

         **Real Time** application refers to batch size 1 inference for minimal latency. **Batch** application refers to maximum throughput with minimum cost-per-inference.


    .. tab-item:: Latency optimized

        .. df-table::
            :header-rows: 1

            df = pd.read_csv('latency_data_vision_sd.csv')

            df_prices = pd.read_csv('inf2_instance_prices.csv')
            df = pd.merge(df,df_prices,on='Inst. Type')

            df['Cost per 1M images'] = ((1.0e6 / df['Throughput (inference/sec)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)

            cols_to_show = ['Model','Image Size','Scripts','Framework','Inst. Type','Task', 'Throughput (inference/sec)','Latency P50 (ms)','Latency P99 (ms)','Cost per 1M images','Application Type','Neuron Version','Run Mode','Batch Size','Model Data Type', 'Compilation Autocast Data Type']
            df = df[cols_to_show].sort_values(['Model', 'Image Size', 'Cost per 1M images'])

            df['Throughput (inference/sec)'] = df['Throughput (inference/sec)'].round(2).astype('float',copy=True)
            int_cols = ['Latency P50 (ms)', 'Latency P99 (ms)']
            df[int_cols] = df[int_cols].round(2).astype('float',copy=True)

        .. note::
         **Cost per 1M images** is calculated using RI-Effective hourly rate.

         **Real Time** application refers to batch size 1 inference for minimal latency. **Batch** application refers to maximum throughput with minimum cost-per-inference.

Diffusion Transformer Models
----------------------------

.. tab-set::

    .. tab-item:: Throughput optimized

        .. df-table::
            :header-rows: 1

            df = pd.read_csv('throughput_data_vision_dit.csv')
            df_prices = pd.read_csv('inf2_instance_prices.csv')
            df = pd.merge(df,df_prices,on='Inst. Type')

            df['Cost per 1M images'] = ((1.0e6 / df['Throughput (inference/sec)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)

            cols_to_show = ['Model','Image Size','Scripts','Framework', 'Inst. Type', 'Task', 'Throughput (inference/sec)', 'Latency P50 (ms)', 'Latency P99 (ms)', 'Cost per 1M images', 'Application Type', 'Neuron Version', 'Run Mode', 'Batch Size', 'Model Data Type','Compilation Autocast Data Type']
            df = df[cols_to_show].sort_values(['Model', 'Image Size', 'Cost per 1M images'])

            df['Throughput (inference/sec)'] = df['Throughput (inference/sec)'].round(2).astype('float',copy=True)
            int_cols = ['Latency P50 (ms)', 'Latency P99 (ms)']
            df[int_cols] = df[int_cols].round(2).astype('float',copy=True)

        .. note::
         **Cost per 1M images** is calculated using RI-Effective hourly rate.

         **Real Time** application refers to batch size 1 inference for minimal latency. **Batch** application refers to maximum throughput with minimum cost-per-inference.


    .. tab-item:: Latency optimized

        .. df-table::
            :header-rows: 1

            df = pd.read_csv('latency_data_vision_dit.csv')

            df_prices = pd.read_csv('inf2_instance_prices.csv')
            df = pd.merge(df,df_prices,on='Inst. Type')

            df['Cost per 1M images'] = ((1.0e6 / df['Throughput (inference/sec)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)

            cols_to_show = ['Model','Image Size','Scripts','Framework','Inst. Type','Task', 'Throughput (inference/sec)','Latency P50 (ms)','Latency P99 (ms)','Cost per 1M images','Application Type','Neuron Version','Run Mode','Batch Size','Model Data Type', 'Compilation Autocast Data Type']
            df = df[cols_to_show].sort_values(['Model', 'Image Size', 'Cost per 1M images'])

            df['Throughput (inference/sec)'] = df['Throughput (inference/sec)'].round(2).astype('float',copy=True)
            int_cols = ['Latency P50 (ms)', 'Latency P99 (ms)']
            df[int_cols] = df[int_cols].round(2).astype('float',copy=True)

        .. note::
         **Cost per 1M images** is calculated using RI-Effective hourly rate.

         **Real Time** application refers to batch size 1 inference for minimal latency. **Batch** application refers to maximum throughput with minimum cost-per-inference.


.. note::

      See :ref:`neuron_hw_glossary` for abbreviations and terms


================================================
FILE: about-neuron/benchmarks/inf2/inf2_instance_prices.csv
================================================
Inst. Type,RI-Effective hourly rate
Inf2.xlarge,0.328
Inf2.48xlarge,5.608
Inf2.24xlarge,2.804
Inf2.8xlarge,0.850


================================================
FILE: about-neuron/benchmarks/inf2/latency_data_decoder.csv
================================================
Model,Scripts,Framework,Inst. Type,Task,Output Token Throughput (tokens/sec),TTFT Latency P50 (ms),TTFT Latency P99 (ms),TPOT Latency P50 (ms),TPOT Latency P99 (ms),Application Type,Neuron Version,Run Mode,TP Degree,Batch Size,Sequence Length,Input Length,Output Length,Model Data Type,Compilation Autocast Data Type,Weight Storage Data Type
Llama-3-8B,:llama-sample:`Sample <meta-llama-3-8b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,144.98,29.47,42.05,7.41,7.68,Real Time,2.18.1,Tensor Parallel,24,1,8192,128,8064,FP16,Matmult-BF16,int8
Llama-3-8B,:llama-sample:`Sample <meta-llama-3-8b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,133.63,209.37,232.02,7.47,7.57,Real Time,2.18.1,Tensor Parallel,24,1,8192,4096,4096,FP16,Matmult-BF16,int8
Llama-3-8B,:llama-sample:`Sample <meta-llama-3-8b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,161.67,25,25.88,6.42,6.58,Real Time,2.18.1,Tensor Parallel,24,1,4096,128,3968,FP16,Matmult-BF16,int8
Llama-3-8B,:llama-sample:`Sample <meta-llama-3-8b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,153.58,101.81,110.6,6.5,6.6,Real Time,2.18.1,Tensor Parallel,24,1,4096,2048,2048,FP16,Matmult-BF16,int8
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,28.84,745.49,749.48,34.67,35.06,Real Time,2.18.1,Tensor Parallel,24,1,4096,2048,2048,FP16,Matmult-BF16,bf16
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,29.81,312.86,322.56,33.81,34.13,Real Time,2.18.1,Tensor Parallel,24,1,3072,1024,2048,FP16,Matmult-BF16,bf16
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,30.16,310.18,315.23,33.14,34.29,Real Time,2.18.1,Tensor Parallel,24,1,2048,1024,1024,FP16,Matmult-BF16,bf16
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,30.82,80,100.47,32.47,33.03,Real Time,2.18.1,Tensor Parallel,24,1,1152,128,1024,FP16,Matmult-BF16,bf16
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,30.9,99.37,142.62,32.48,32.86,Real Time,2.18.1,Tensor Parallel,24,1,512,256,256,FP16,Matmult-BF16,bf16
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,31.28,77.81,78.52,32.2,33.02,Real Time,2.18.1,Tensor Parallel,24,1,256,128,128,FP16,Matmult-BF16,bf16
Llama-2-7b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,156.1281689,27.63772011,33.7741375,6.46972656,7.07960129,Real Time,2.18.0,Tensor Parallel,24,1,4096,128,3968,FP16,Matmult-BF16,bf16
Llama-2-7b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,145.1665497,29.20985222,33.39338303,7.34019279,7.80153275,Real Time,2.18.0,Tensor Parallel,24,1,8192,128,8064,FP16,Matmult-BF16,bf16
Llama-2-13b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,112.520024,25.85077286,26.89838409,9.16552544,9.33074951,Real Time,2.18.0,Tensor Parallel,24,1,4096,128,3968,FP16,Matmult-BF16,bf16
Llama-2-13b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,97.41527724,333.7800503,340.9907818,10.17355919,10.37788391,Real Time,2.18.0,Tensor Parallel,24,1,8192,4096,4096,FP16,Matmult-BF16,bf16
Llama-2-13b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,73.16747525,994.1797257,999.7954369,13.49759102,13.97609711,Real Time,2.18.0,Tensor Parallel,24,1,16384,8192,8192,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,30.06356,76.59531,77.12364,32.89557,33.42032,Real Time,2.18.0,Tensor Parallel,24,1,256,128,128,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,29.92419,96.4396,98.47379,33.13422,33.45966,Real Time,2.18.0,Tensor Parallel,24,1,512,256,256,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,30.07017,76.33042,86.52544,33.15115,34.0786,Real Time,2.18.0,Tensor Parallel,24,1,1152,128,1024,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,29.426,277.01592,280.12586,33.73241,34.01256,Real Time,2.18.0,Tensor Parallel,24,1,2048,1024,1024,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,28.91353,275.96617,284.77097,34.81936,35.43973,Real Time,2.18.0,Tensor Parallel,24,1,3072,1024,2048,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,28.32725,810.43696,814.87799,34.90329,35.14242,Real Time,2.18.0,Tensor Parallel,24,1,4096,2048,2048,FP16,Matmult-BF16,bf16
Mistral-7B-Instruct-v0.2,:llama-sample:`Sample <mistralai-Mistral-7b-Instruct-v0.2>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,149.7363908,27.34160423,29.20722961,6.86240196,7.07960129,Real Time,2.18.0,Tensor Parallel,24,1,4096,128,3968,FP16,Matmult-BF16,bf16
Mistral-7B-Instruct-v0.2,:llama-sample:`Sample <mistralai-Mistral-7b-Instruct-v0.2>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,81.7034129,557.9631329,562.8581047,7.86566734,11.64746284,Real Time,2.18.0,Tensor Parallel,24,1,8192,4096,4096,FP16,Matmult-BF16,bf16
Mistral-7B-Instruct-v0.2,:llama-sample:`Sample <mistralai-Mistral-7b-Instruct-v0.2>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,95.99325977,539.5913124,557.1010113,10.32972336,10.61367989,Real Time,2.18.0,Tensor Parallel,24,1,16384,8192,8192,FP16,Matmult-BF16,bf16
CodeLlama-13b-hf,:llama-sample:`Sample <codellama-13b-16k-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,112.7050057,27.0178318,33.24627876,9.12380219,9.38177109,Real Time,2.18.0,Tensor Parallel,24,1,4096,128,3968,FP16,Matmult-BF16,bf16
CodeLlama-13b-hf,:llama-sample:`Sample <codellama-13b-16k-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,97.52121418,338.6683464,340.4603005,10.15138626,10.55026054,Real Time,2.18.0,Tensor Parallel,24,1,8192,4096,4096,FP16,Matmult-BF16,bf16
CodeLlama-13b-hf,:llama-sample:`Sample <codellama-13b-16k-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,73.67826681,989.4962311,1000.655413,13.43631744,13.85569572,Real Time,2.18.0,Tensor Parallel,24,1,16384,8192,8192,FP16,Matmult-BF16,bf16


================================================
FILE: about-neuron/benchmarks/inf2/latency_data_encoder.csv
================================================
Model,Scripts,Framework,Inst. Type,Task,Throughput (inference/second),Latency P50 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,Sequence Length,Model Data Type,Compilation Autocast Data Type,OS Type
albert-base-v2,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.8,Inf2.xlarge,Raw Output (AutoModel),2119.78480993,0.93722343,1.00183487,Real Time,2.26.0,Data Parallel,1,128,FP32,Matmult-BF16,U22
bert-base-uncased,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.8,Inf2.xlarge,Raw Output (AutoModel),1998.20950133,0.99897385,1.04045868,Real Time,2.26.0,Data Parallel,1,128,FP32,Matmult-BF16,U22
bert-large-uncased,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.7,Inf2.xlarge,Raw Output (AutoModel),738.64502335,2.69365311,2.77733803,Real Time,2.25.0,Data Parallel,1,128,FP32,Matmult-BF16,U22
distilbert-base-uncased,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.8,Inf2.xlarge,Raw Output (AutoModel),3401.96550351,0.57864189,0.67734718,Real Time,2.26.0,Data Parallel,1,128,FP32,Matmult-BF16,U22
google/electra-base-discriminator,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.8,Inf2.xlarge,Raw Output (AutoModel),2020.45540243,0.9958744,1.04618073,Real Time,2.26.0,Data Parallel,1,128,FP32,Matmult-BF16,U22
roberta-base,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.8,Inf2.xlarge,Raw Output (AutoModel),1989.26102482,0.99945068,1.09100342,Real Time,2.26.0,Data Parallel,1,128,FP32,Matmult-BF16,U22
roberta-large,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.8,Inf2.xlarge,Raw Output (AutoModel),738.88441011,2.69317627,2.77304649,Real Time,2.26.0,Data Parallel,1,128,FP32,Matmult-BF16,U22
xlm-roberta-base,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.5,Inf2.48xlarge,Raw Output (AutoModelForMaskedLM),48.80198341,40.66610336,51.05760336,Real Time,2.22.0,Data Parallel,1,128,FP32,Matmult-BF16,U22


================================================
FILE: about-neuron/benchmarks/inf2/latency_data_encoder_decoder.csv
================================================
Model,Scripts,Framework,Inst. Type,Task,Throughput (tokens/second),Latency per Token P50 (ms),Latency per Token P99 (ms),Application Type,Neuron Version,Run Mode,TP Degree,DP Degree,Batch Size,Sequence Length,Input Length,Output Length,Model Data Type,Compilation Autocast Data Type
t5-3b,`Tutorial <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`_,NeuronX Distributed,Inf2.24xlarge,Text Generation,108.18,9.25,9.26,Real Time,2.18.0,Tensor Parallel,8,1,1,128,128,84,FP32,Matmult-BF16
google/flan-t5-xl,`Tutorial <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`_,NeuronX Distributed,Inf2.24xlarge,Text Generation,117.6,8.5,8.53,Real Time,2.18.0,Tensor Parallel,8,1,1,128,128,84,FP32,Matmult-BF16


================================================
FILE: about-neuron/benchmarks/inf2/latency_data_vision.csv
================================================
Model,Image Size,Scripts,Framework,Inst. Type,Task,Throughput (inference/sec),Latency P50 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,Model Data Type,Compilation Autocast Data Type
deepmind/multimodal-perceiver,16x224x224,:benchmark-pt:`Benchmark <perceiver-multimodal>`,PyTorch 1.13.1,Inf2.xlarge,Multimodal Autoencoding,0.83,1250,1271,Real Time,2.18.0,Data Parallel,1,FP32,None
deepmind/vision-perceiver-learned,224x224,:benchmark-pt:`Benchmark <perceiver-vision>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,99.6,18.6,18.7,Real Time,2.18.0,Data Parallel,1,FP32,Matmult-BF16
deepmind/vision-perceiver-fourier,224x224,:benchmark-pt:`Benchmark <perceiver-vision>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,67.9,29.5,29.68,Real Time,2.18.0,Data Parallel,1,FP32,Matmult-BF16
deepmind/vision-perceiver-conv,224x224,:benchmark-pt:`Benchmark <perceiver-vision>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,126.5,14.14,14.2,Real Time,2.18.0,Data Parallel,1,FP32,Matmult-BF16
google/vit-base-patch16-224,224x224,:benchmark-pt:`Benchmark <hf-google-vit>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,709.468,1.406,1.431,Real Time,2.14.0,Data Parallel,1,FP32,Matmult-BF16
openai/clip-vit-base-patch32,224x224,:benchmark-pt:`Benchmark <hf-openai-clip>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,163.444,6.113,6.143,Real Time,2.14.0,Data Parallel,1,FP32,Matmult-BF16
openai/clip-vit-large-patch14,224x224,:benchmark-pt:`Benchmark <hf-openai-clip>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,61.812,16.172,16.216,Real Time,2.14.0,Data Parallel,1,FP32,Matmult-BF16
resnet18,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,1385.04,0.72,0.75,Real Time,2.14.0,Data Parallel,1,FP32,Matmult-BF16
resnet34,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,1187.64,0.83,0.88,Real Time,2.14.0,Data Parallel,1,FP32,Matmult-BF16
resnet50,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,1044.93,0.95,0.98,Real Time,2.14.0,Data Parallel,1,FP32,Matmult-BF16
resnet101,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,882.61,1.13,1.15,Real Time,2.14.0,Data Parallel,1,FP32,Matmult-BF16
resnet152,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,736.91,1.35,1.39,Real Time,2.14.0,Data Parallel,1,FP32,Matmult-BF16
Stable Diffusion 1.5,512x512,:benchmark-pt:`Benchmark <sd_15_512>`,PyTorch 1.13.1,Inf2.xlarge,Image Generation,0.421,2369.6,2406.8,Real Time,2.17.0,Data Parallel,1,FP32,Matmult-BF16
Stable Diffusion 2.1,512x512,:benchmark-pt:`Benchmark <sd2_512>`,PyTorch 1.13.1,Inf2.xlarge,Image Generation,0.549,1794.5,2103.7,Real Time,2.17.0,Data Parallel,1,"FP32, BF16",Matmult-BF16
Stable Diffusion 2.1,768x768,:benchmark-pt:`Benchmark <sd2_768>`,PyTorch 1.13.1,Inf2.xlarge,Image Generation,0.188,5306.7,5368.6,Real Time,2.17.0,Data Parallel,1,FP32,Matmult-BF16
Stable Diffusion 2 Inpainting,936x624,:benchmark-pt:`Benchmark <sd2_inpainting>`,PyTorch 1.13.1,Inf2.xlarge,Image Generation,0.15,6701.4,6737.4,Real Time,2.17.0,Data Parallel,1,"FP32, BF16",Matmult-BF16
Stable Diffusion XL Base,1024x1024,:benchmark-pt:`Benchmark <sdxl_base_1024>`,PyTorch 1.13.1,Inf2.xlarge,Image Generation,0.073,13431.7,15739.0,Real Time,2.17.0,Data Parallel,1,FP32,Matmult-BF16
Stable Diffusion XL Base & Refiner,1024x1024,:benchmark-pt:`Benchmark <sdxl_base_and_refiner>`,PyTorch 1.13.1,Inf2.8xlarge,Image Generation,0.078,12651.9,15053.9,Real Time,2.17.0,Data Parallel,1,FP32,Matmult-BF16
UNet,224x224,:benchmark-pt:`Benchmark <unet>`,PyTorch 1.13.1,Inf2.xlarge,Image Segmentation,420.16,2.37,2.41,Real Time,2.14.0,Data Parallel,1,FP32,Matmult-BF16
vgg11,224x224,:benchmark-pt:`Benchmark <vgg>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,524.10,1.90,1.96,Real Time,2.14.0,Data Parallel,1,FP32,Matmult-BF16
vgg16,224x224,:benchmark-pt:`Benchmark <vgg>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,435.54,2.29,2.33,Real Time,2.14.0,Data Parallel,1,FP32,Matmult-BF16


================================================
FILE: about-neuron/benchmarks/inf2/latency_data_vision_cnn.csv
================================================
Model,Image Size,Scripts,Framework,Inst. Type,Task,Throughput (inference/sec),Latency P50 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,Model Data Type,Compilation Autocast Data Type
resnet18,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 2.5,Inf2.xlarge,Image Classification,1669.796,0.596,0.613,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16
resnet34,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 2.5,Inf2.xlarge,Image Classification,1394.211,0.718,0.726,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16
resnet50,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 2.5,Inf2.xlarge,Image Classification,1218.875,0.83,0.846,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16
resnet101,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 2.5,Inf2.xlarge,Image Classification,994.691,1.007,1.024,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16
resnet152,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 2.5,Inf2.xlarge,Image Classification,837.784,1.185,1.219,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16
UNet,224x224,:benchmark-pt:`Benchmark <unet>`,PyTorch 2.5,Inf2.xlarge,Image Segmentation,447.094,2.232,2.253,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16
vgg11,224x224,:benchmark-pt:`Benchmark <vgg>`,PyTorch 2.5,Inf2.xlarge,Image Classification,629.189,1.59,1.605,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16
vgg16,224x224,:benchmark-pt:`Benchmark <vgg>`,PyTorch 2.5,Inf2.xlarge,Image Classification,508.665,1.956,1.995,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16

================================================
FILE: about-neuron/benchmarks/inf2/latency_data_vision_dit.csv
================================================
Model,Image Size,Scripts,Framework,Inst. Type,Task,Throughput (inference/sec),Latency P50 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,Model Data Type,Compilation Autocast Data Type
PixArt Alpha,256x256,:benchmark-pt:`Benchmark <pixart_alpha>`,PyTorch 2.1,Inf2.xlarge,Image Generation,1.975,502.587,537.258,Real Time,2.20,Data Parallel,1,"""FP32, BF16""",Matmult-BF16
PixArt Alpha,512x512,:benchmark-pt:`Benchmark <pixart_alpha>`,PyTorch 2.1,Inf2.xlarge,Image Generation,0.565,1769.756,1775.697,Real Time,2.20,Data Parallel,1,"""FP32, BF16""",Matmult-BF16
PixArt Sigma,256x256,:benchmark-pt:`Benchmark <pixart_sigma>`,PyTorch 2.1,Inf2.xlarge,Image Generation,1.86,540.832,548.41,Real Time,2.20,Data Parallel,1,"""FP32, BF16""",Matmult-BF16
PixArt Sigma,512x512,:benchmark-pt:`Benchmark <pixart_sigma>`,PyTorch 2.1,Inf2.xlarge,Image Generation,0.543,1841.882,1850.683,Real Time,2.20,Data Parallel,1,"""FP32, BF16""",Matmult-BF16


================================================
FILE: about-neuron/benchmarks/inf2/latency_data_vision_sd.csv
================================================
Model,Image Size,Scripts,Framework,Inst. Type,Task,Throughput (inference/sec),Latency P50 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,Model Data Type,Compilation Autocast Data Type
Stable Diffusion 1.5,512x512,:benchmark-pt:`Benchmark <sd_15_512>`,PyTorch 2.5,Inf2.xlarge,Image Generation,0.494,2023.741,2031.705,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16
Stable Diffusion 2.1,512x512,:benchmark-pt:`Benchmark <sd2_512>`,PyTorch 2.5,Inf2.xlarge,Image Generation,0.596,1679.805,1685.442,Real Time,2.21.0,Data Parallel,1,"FP32, BF16",Matmult-BF16
Stable Diffusion 2.1,768x768,:benchmark-pt:`Benchmark <sd2_768>`,PyTorch 2.5,Inf2.xlarge,Image Generation,0.187,5337.509,5357.361,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16
Stable Diffusion 2 Inpainting,936x624,:benchmark-pt:`Benchmark <sd2_inpainting>`,PyTorch 2.5,Inf2.xlarge,Image Generation,0.133,7546.004,7550.984,Real Time,2.21.0,Data Parallel,1,"FP32, BF16",Matmult-BF16
Stable Diffusion XL Base,1024x1024,:benchmark-pt:`Benchmark <sdxl_base_1024>`,PyTorch 2.5,Inf2.xlarge,Image Generation,0.083,12048.659,12102.431,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16
Stable Diffusion XL Base & Refiner,1024x1024,:benchmark-pt:`Benchmark <sdxl_base_and_refiner>`,PyTorch 2.5,Inf2.8xlarge,Image Generation,0.095,10546.45,10704.566,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16

================================================
FILE: about-neuron/benchmarks/inf2/latency_data_vision_transformers.csv
================================================
Model,Image Size,Scripts,Framework,Inst. Type,Task,Throughput (inference/sec),Latency P50 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,Model Data Type,Compilation Autocast Data Type
deepmind/multimodal-perceiver,16x224x224,:benchmark-pt:`Benchmark <perceiver-multimodal>`,PyTorch 2.5,Inf2.xlarge,Multimodal Autoencoding,0.853,1170.045,1232.056,Real Time,2.21.0,Data Parallel,1,FP32,None
deepmind/vision-perceiver-learned,224x224,:benchmark-pt:`Benchmark <perceiver-vision>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,99.6,18.6,18.7,Real Time,2.18.0,Data Parallel,1,FP32,Matmult-BF16
deepmind/vision-perceiver-fourier,224x224,:benchmark-pt:`Benchmark <perceiver-vision>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,67.9,29.5,29.68,Real Time,2.18.0,Data Parallel,1,FP32,Matmult-BF16
deepmind/vision-perceiver-conv,224x224,:benchmark-pt:`Benchmark <perceiver-vision>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,126.5,14.14,14.2,Real Time,2.18.0,Data Parallel,1,FP32,Matmult-BF16
google/vit-base-patch16-224,224x224,:benchmark-pt:`Benchmark <hf-google-vit>`,PyTorch 2.5,Inf2.xlarge,Image Classification,746.139,1.322,1.378,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16
openai/clip-vit-base-patch32,224x224,:benchmark-pt:`Benchmark <hf-openai-clip>`,PyTorch 2.5,Inf2.xlarge,Image Classification,161.047,6.213,6.246,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16
openai/clip-vit-large-patch14,224x224,:benchmark-pt:`Benchmark <hf-openai-clip>`,PyTorch 2.5,Inf2.xlarge,Image Classification,73.261,13.643,13.685,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16

================================================
FILE: about-neuron/benchmarks/inf2/throughput_data_decoder.csv
================================================
Model,Scripts,Framework,Inst. Type,Task,Output Token Throughput (tokens/sec),TTFT Latency P50 (ms),TTFT Latency P99 (ms),TPOT Latency P50 (ms),TPOT Latency P99 (ms),Application Type,Neuron Version,Run Mode,TP Degree,Batch Size,Sequence Length,Input Length,Output Length,Model Data Type,Compilation Autocast Data Type,Weight Storage Data Type
Llama-3-8B,:llama-sample:`Sample <meta-llama-3-8b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,649.17,68.95,99.28,15.22,15.48,Batch,2.18.1,Tensor Parallel,24,8,8192,128,8064,FP16,Matmult-BF16,int8
Llama-3-8B,:llama-sample:`Sample <meta-llama-3-8b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,521.96,1992.59,2016.73,15.31,15.64,Batch,2.18.1,Tensor Parallel,24,8,8192,4096,4096,FP16,Matmult-BF16,int8
Llama-3-8B,:llama-sample:`Sample <meta-llama-3-8b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,859.09,66.02,75.73,10.45,10.76,Batch,2.18.1,Tensor Parallel,24,8,4096,128,3968,FP16,Matmult-BF16,int8
Llama-3-8B,:llama-sample:`Sample <meta-llama-3-8b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,759.15,823.53,832.84,10.5,11.02,Batch,2.18.1,Tensor Parallel,24,8,4096,2048,2048,FP16,Matmult-BF16,int8
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,28.84,745.49,749.48,34.67,35.06,Batch,2.18.1,Tensor Parallel,24,1,4096,2048,2048,FP16,Matmult-BF16,bf16
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,29.81,312.86,322.56,33.81,34.13,Batch,2.18.1,Tensor Parallel,24,1,3072,1024,2048,FP16,Matmult-BF16,bf16
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,30.16,310.18,315.23,33.14,34.29,Batch,2.18.1,Tensor Parallel,24,1,2048,1024,1024,FP16,Matmult-BF16,bf16
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,30.82,80,100.47,32.47,33.03,Batch,2.18.1,Tensor Parallel,24,1,1152,128,1024,FP16,Matmult-BF16,bf16
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,30.9,99.37,142.62,32.48,32.86,Batch,2.18.1,Tensor Parallel,24,1,512,256,256,FP16,Matmult-BF16,bf16
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,31.28,77.81,78.52,32.2,33.02,Batch,2.18.1,Tensor Parallel,24,1,256,128,128,FP16,Matmult-BF16,bf16
Llama-2-7b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,725.82805,77.36206,87.27574,12.10523,13.05699,Batch,2.18.0,Tensor Parallel,24,8,4096,128,3968,FP16,Matmult-BF16,bf16
Llama-2-7b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,577.97078,80.11794,89.68878,16.39295,17.81178,Batch,2.18.0,Tensor Parallel,24,8,8192,128,8064,FP16,Matmult-BF16,bf16
Llama-2-13b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,589.88712,108.80947,113.89017,14.89663,15.79142,Batch,2.18.0,Tensor Parallel,24,8,4096,128,3968,FP16,Matmult-BF16,bf16
Llama-2-13b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,351.75817,7083.72855,7158.32424,20.9856,21.80099,Batch,2.18.0,Tensor Parallel,24,8,8192,4096,4096,FP16,Matmult-BF16,bf16
Llama-2-13b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,178.56973,5141.32094,5160.92515,21.70897,22.74466,Batch,2.18.0,Tensor Parallel,24,4,16384,8192,8192,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,30.06356,76.59531,77.12364,32.89557,33.42032,Batch,2.18.0,Tensor Parallel,24,1,256,128,128,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,29.92419,96.4396,98.47379,33.13422,33.45966,Batch,2.18.0,Tensor Parallel,24,1,512,256,256,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,30.07017,76.33042,86.52544,33.15115,34.0786,Batch,2.18.0,Tensor Parallel,24,1,1152,128,1024,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,29.426,277.01592,280.12586,33.73241,34.01256,Batch,2.18.0,Tensor Parallel,24,1,2048,1024,1024,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,28.91353,275.96617,284.77097,34.81936,35.43973,Batch,2.18.0,Tensor Parallel,24,1,3072,1024,2048,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,28.32725,810.43696,814.87799,34.90329,35.14242,Batch,2.18.0,Tensor Parallel,24,1,4096,2048,2048,FP16,Matmult-BF16,bf16
Mistral-7B-Instruct-v0.2,:llama-sample:`Sample <mistralai-Mistral-7b-Instruct-v0.2>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,761.88605,77.62027,86.62724,11.63864,12.49599,Batch,2.18.0,Tensor Parallel,24,8,4096,128,3968,FP16,Matmult-BF16,bf16
Mistral-7B-Instruct-v0.2,:llama-sample:`Sample <mistralai-Mistral-7b-Instruct-v0.2>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,450.37555,4740.11564,4783.75316,16.54649,17.52925,Batch,2.18.0,Tensor Parallel,24,8,8192,4096,4096,FP16,Matmult-BF16,bf16
Mistral-7B-Instruct-v0.2,:llama-sample:`Sample <mistralai-Mistral-7b-Instruct-v0.2>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,411.04655,11085.12306,11125.86117,18.01157,19.9585,Batch,2.18.0,Tensor Parallel,24,8,16384,8192,8192,FP16,Matmult-BF16,bf16
CodeLlama-13b-hf,:llama-sample:`Sample <codellama-13b-16k-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,546.51472,115.81421,121.49906,15.87224,17.21263,Batch,2.18.0,Tensor Parallel,24,8,4096,128,3968,FP16,Matmult-BF16,bf16
CodeLlama-13b-hf,:llama-sample:`Sample <codellama-13b-16k-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,333.24073,7115.97776,7231.01234,22.26758,23.81206,Batch,2.18.0,Tensor Parallel,24,8,8192,4096,4096,FP16,Matmult-BF16,bf16
CodeLlama-13b-hf,:llama-sample:`Sample <codellama-13b-16k-sampling>`,Transformers NeuronX,Inf2.48xlarge,Text Generation,178.79017,5136.61623,5192.58666,21.6732,22.73154,Batch,2.18.0,Tensor Parallel,24,4,16384,8192,8192,FP16,Matmult-BF16,bf16


================================================
FILE: about-neuron/benchmarks/inf2/throughput_data_encoder.csv
================================================
Model,Scripts,Framework,Inst. Type,Task,Throughput (inference/second),Latency P50 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,Sequence Length,Model Data Type,Compilation Autocast Data Type,OS Type
albert-base-v2,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.7,Inf2.xlarge,Raw Output (AutoModel),3147.09984049,5.0675869,5.27883291,Batch,2.25.0,Data Parallel,8,128,FP32,Matmult-BF16,U22
bert-base-uncased,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,Inf2.xlarge,Raw Output (AutoModel),2674.18956433,5.97381591,6.17100715,Batch,2.27.0,Data Parallel,8,128,FP32,Matmult-BF16,U22
bert-large-uncased,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.5,Inf2.xlarge,Raw Output (AutoModel),950.0496231,8.41140747,8.84652853,Batch,2.21.0,Data Parallel,4,128,FP32,Matmult-BF16,U22
distilbert-base-uncased,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,Inf2.xlarge,Raw Output (AutoModel),5307.87660777,6.01053237,6.23083114,Batch,2.27.0,Data Parallel,16,128,FP32,Matmult-BF16,U22
google/electra-base-discriminator,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.7,Inf2.xlarge,Raw Output (AutoModel),2889.75325068,11.02411747,11.97555304,Batch,2.25.0,Data Parallel,16,128,FP32,Matmult-BF16,U22
roberta-base,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.7,Inf2.xlarge,Raw Output (AutoModel),2920.37954741,5.42390347,5.82957506,Batch,2.25.0,Data Parallel,8,128,FP32,Matmult-BF16,U22
roberta-large,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.7,Inf2.xlarge,Raw Output (AutoModel),962.70185508,8.31007957,8.60977411,Batch,2.25.0,Data Parallel,4,128,FP32,Matmult-BF16,U22
xlm-roberta-base,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.5,Inf2.48xlarge,Raw Output (AutoModelForMaskedLM),51.13695938,625.66077709,694.93403673,Batch,2.22.0,Data Parallel,16,128,FP32,Matmult-BF16,U22


================================================
FILE: about-neuron/benchmarks/inf2/throughput_data_encoder_decoder.csv
================================================
Model,Scripts,Framework,Inst. Type,Task,Throughput (tokens/second),Latency per Token P50 (ms),Latency per Token P99 (ms),Application Type,Neuron Version,Run Mode,TP Degree,DP Degree,Batch Size,Sequence Length,Input Length,Output Length,Model Data Type,Compilation Autocast Data Type
t5-3b,`Tutorial <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`_,NeuronX Distributed,Inf2.24xlarge,Text Generation,111.92,8.97,8.98,Batch,2.17.0,Tensor Parallel,8,1,1,128,128,84,FP32,Matmult-BF16
google/flan-t5-xl,`Tutorial <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`_,NeuronX Distributed,Inf2.24xlarge,Text Generation,117.61,8.51,8.53,Batch,2.17.0,Tensor Parallel,8,1,1,128,128,84,FP32,Matmult-BF16


================================================
FILE: about-neuron/benchmarks/inf2/throughput_data_vision.csv
================================================
Model,Image Size,Scripts,Framework,Inst. Type,Task,Throughput (inference/sec),Latency P50 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,Model Data Type,Compilation Autocast Data Type
deepmind/multimodal-perceiver,16x224x224,:benchmark-pt:`Benchmark <perceiver-multimodal>`,PyTorch 1.13.1,Inf2.xlarge,Multimodal Autoencoding,0.83,1250,1271,Real Time,2.18.0,Data Parallel,1,FP32,None
deepmind/vision-perceiver-learned,224x224,:benchmark-pt:`Benchmark <perceiver-vision>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,99.6,18.6,18.7,Real Time,2.18.0,Data Parallel,1,FP32,Matmult-BF16
deepmind/vision-perceiver-fourier,224x224,:benchmark-pt:`Benchmark <perceiver-vision>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,67.9,29.5,29.68,Real Time,2.18.0,Data Parallel,1,FP32,Matmult-BF16
deepmind/vision-perceiver-conv,224x224,:benchmark-pt:`Benchmark <perceiver-vision>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,126.5,14.14,14.2,Real Time,2.18.0,Data Parallel,1,FP32,Matmult-BF16
google/vit-base-patch16-224,224x224,:benchmark-pt:`Benchmark <hf-google-vit>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,1632.359,4.716,5.902,Batch,2.14.0,Data Parallel,2,FP32,Matmult-BF16
openai/clip-vit-base-patch32,224x224,:benchmark-pt:`Benchmark <hf-openai-clip>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,5178.833,48.973,57.002,Batch,2.14.0,Data Parallel,64,FP32,Matmult-BF16
openai/clip-vit-large-patch14,224x224,:benchmark-pt:`Benchmark <hf-openai-clip>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,200.997,78.331,92.452,Batch,2.14.0,Data Parallel,4,FP32,Matmult-BF16
resnet18,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,6635.04,4.80,4.88,Batch,2.14.0,Data Parallel,8,FP32,Matmult-BF16
resnet34,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,4848.72,6.56,6.66,Batch,2.14.0,Data Parallel,8,FP32,Matmult-BF16
resnet50,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,4269.12,7.49,7.55,Batch,2.14.0,Data Parallel,8,FP32,Matmult-BF16
resnet101,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,3066.24,83.38,83.56,Batch,2.14.0,Data Parallel,64,FP32,Matmult-BF16
resnet152,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,2323.20,110.06,110.21,Batch,2.14.0,Data Parallel,64,FP32,Matmult-BF16
Stable Diffusion 1.5,512x512,:benchmark-pt:`Benchmark <sd_15_512>`,PyTorch 1.13.1,Inf2.xlarge,Image Generation,0.421,2369.6,2406.8,Real Time,2.17.0,Data Parallel,1,FP32,Matmult-BF16
Stable Diffusion 2.1,512x512,:benchmark-pt:`Benchmark <sd2_512>`,PyTorch 1.13.1,Inf2.xlarge,Image Generation,0.549,1794.5,2103.7,Real Time,2.17.0,Data Parallel,1,"FP32, BF16",Matmult-BF16
Stable Diffusion 2.1,768x768,:benchmark-pt:`Benchmark <sd2_768>`,PyTorch 1.13.1,Inf2.xlarge,Image Generation,0.188,5306.7,5368.6,Real Time,2.17.0,Data Parallel,1,FP32,Matmult-BF16
Stable Diffusion 2 Inpainting,936x624,:benchmark-pt:`Benchmark <sd2_inpainting>`,PyTorch 1.13.1,Inf2.xlarge,Image Generation,0.15,6701.4,6737.4,Real Time,2.17.0,Data Parallel,1,"FP32, BF16",Matmult-BF16
Stable Diffusion XL Base,1024x1024,:benchmark-pt:`Benchmark <sdxl_base_1024>`,PyTorch 1.13.1,Inf2.xlarge,Image Generation,0.073,13431.7,15739.0,Real Time,2.17.0,Data Parallel,1,FP32,Matmult-BF16
Stable Diffusion XL Base & Refiner,1024x1024,:benchmark-pt:`Benchmark <sdxl_base_and_refiner>`,PyTorch 1.13.1,Inf2.8xlarge,Image Generation,0.078,12651.9,15053.9,Real Time,2.17.0,Data Parallel,1,FP32,Matmult-BF16
UNet,224x224,:benchmark-pt:`Benchmark <unet>`,PyTorch 1.13.1,Inf2.xlarge,Image Segmentation,866.96,18.37,18.86,Batch,2.14.0,Data Parallel,4,FP32,Matmult-BF16
vgg11,224x224,:benchmark-pt:`Benchmark <vgg>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,3955.20,64.15,64.24,Batch,2.14.0,Data Parallel,64,FP32,Matmult-BF16
vgg16,224x224,:benchmark-pt:`Benchmark <vgg>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,1964.16,16.27,16.35,Batch,2.14.0,Data Parallel,8,FP32,Matmult-BF16


================================================
FILE: about-neuron/benchmarks/inf2/throughput_data_vision_cnn.csv
================================================
Model,Image Size,Scripts,Framework,Inst. Type,Task,Throughput (inference/sec),Latency P50 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,Model Data Type,Compilation Autocast Data Type
resnet18,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 2.5,Inf2.xlarge,Image Classification,6949.174,4.587,4.659,Batch,2.21.0,Data Parallel,8,FP32,Matmult-BF16
resnet34,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 2.5,Inf2.xlarge,Image Classification,5158.607,6.18,6.251,Batch,2.21.0,Data Parallel,8,FP32,Matmult-BF16
resnet50,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 2.5,Inf2.xlarge,Image Classification,4393.304,7.283,7.331,Batch,2.21.0,Data Parallel,8,FP32,Matmult-BF16
resnet101,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 2.5,Inf2.xlarge,Image Classification,3164.991,80.818,80.938,Batch,2.21.0,Data Parallel,64,FP32,Matmult-BF16
resnet152,224x224,:benchmark-pt:`Benchmark <resnet>`,PyTorch 2.5,Inf2.xlarge,Image Classification,2449.875,104.406,104.531,Batch,2.21.0,Data Parallel,64,FP32,Matmult-BF16
UNet,224x224,:benchmark-pt:`Benchmark <unet>`,PyTorch 2.5,Inf2.xlarge,Image Segmentation,1010.803,15.818,15.875,Batch,2.21.0,Data Parallel,4,FP32,Matmult-BF16
vgg11,224x224,:benchmark-pt:`Benchmark <vgg>`,PyTorch 2.5,Inf2.xlarge,Image Classification,4734.402,54.044,54.09,Batch,2.21.0,Data Parallel,64,FP32,Matmult-BF16
vgg16,224x224,:benchmark-pt:`Benchmark <vgg>`,PyTorch 2.5,Inf2.xlarge,Image Classification,2161.392,14.77,14.832,Batch,2.21.0,Data Parallel,8,FP32,Matmult-BF16

================================================
FILE: about-neuron/benchmarks/inf2/throughput_data_vision_dit.csv
================================================
Model,Image Size,Scripts,Framework,Inst. Type,Task,Throughput (inference/sec),Latency P50 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,Model Data Type,Compilation Autocast Data Type
PixArt Alpha,256x256,:benchmark-pt:`Benchmark <pixart_alpha>`,PyTorch 2.1,Inf2.xlarge,Image Generation,1.975,502.587,537.258,Real Time,2.20,Data Parallel,1,"""FP32, BF16""",Matmult-BF16
PixArt Alpha,512x512,:benchmark-pt:`Benchmark <pixart_alpha>`,PyTorch 2.1,Inf2.xlarge,Image Generation,0.565,1769.756,1775.697,Real Time,2.20,Data Parallel,1,"""FP32, BF16""",Matmult-BF16
PixArt Sigma,256x256,:benchmark-pt:`Benchmark <pixart_sigma>`,PyTorch 2.1,Inf2.xlarge,Image Generation,1.86,540.832,548.41,Real Time,2.20,Data Parallel,1,"""FP32, BF16""",Matmult-BF16
PixArt Sigma,512x512,:benchmark-pt:`Benchmark <pixart_sigma>`,PyTorch 2.1,Inf2.xlarge,Image Generation,0.543,1841.882,1850.683,Real Time,2.20,Data Parallel,1,"""FP32, BF16""",Matmult-BF16


================================================
FILE: about-neuron/benchmarks/inf2/throughput_data_vision_sd.csv
================================================
Model,Image Size,Scripts,Framework,Inst. Type,Task,Throughput (inference/sec),Latency P50 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,Model Data Type,Compilation Autocast Data Type
Stable Diffusion 1.5,512x512,:benchmark-pt:`Benchmark <sd_15_512>`,PyTorch 2.5,Inf2.xlarge,Image Generation,0.494,2023.741,2031.705,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16
Stable Diffusion 2.1,512x512,:benchmark-pt:`Benchmark <sd2_512>`,PyTorch 2.5,Inf2.xlarge,Image Generation,0.596,1679.805,1685.442,Real Time,2.21.0,Data Parallel,1,"FP32, BF16",Matmult-BF16
Stable Diffusion 2.1,768x768,:benchmark-pt:`Benchmark <sd2_768>`,PyTorch 2.5,Inf2.xlarge,Image Generation,0.187,5337.509,5357.361,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16
Stable Diffusion 2 Inpainting,936x624,:benchmark-pt:`Benchmark <sd2_inpainting>`,PyTorch 2.5,Inf2.xlarge,Image Generation,0.133,7546.004,7550.984,Real Time,2.21.0,Data Parallel,1,"FP32, BF16",Matmult-BF16
Stable Diffusion XL Base,1024x1024,:benchmark-pt:`Benchmark <sdxl_base_1024>`,PyTorch 2.5,Inf2.xlarge,Image Generation,0.083,12048.659,12102.431,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16
Stable Diffusion XL Base & Refiner,1024x1024,:benchmark-pt:`Benchmark <sdxl_base_and_refiner>`,PyTorch 2.5,Inf2.8xlarge,Image Generation,0.095,10546.45,10704.566,Real Time,2.21.0,Data Parallel,1,FP32,Matmult-BF16

================================================
FILE: about-neuron/benchmarks/inf2/throughput_data_vision_transformers.csv
================================================
Model,Image Size,Scripts,Framework,Inst. Type,Task,Throughput (inference/sec),Latency P50 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,Model Data Type,Compilation Autocast Data Type
deepmind/multimodal-perceiver,16x224x224,:benchmark-pt:`Benchmark <perceiver-multimodal>`,PyTorch 2.5,Inf2.xlarge,Multimodal Autoencoding,0.853,1170.045,1232.056,Real Time,2.21.0,Data Parallel,1,FP32,None
deepmind/vision-perceiver-learned,224x224,:benchmark-pt:`Benchmark <perceiver-vision>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,99.6,18.6,18.7,Real Time,2.18.0,Data Parallel,1,FP32,Matmult-BF16
deepmind/vision-perceiver-fourier,224x224,:benchmark-pt:`Benchmark <perceiver-vision>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,67.9,29.5,29.68,Real Time,2.18.0,Data Parallel,1,FP32,Matmult-BF16
deepmind/vision-perceiver-conv,224x224,:benchmark-pt:`Benchmark <perceiver-vision>`,PyTorch 1.13.1,Inf2.xlarge,Image Classification,126.5,14.14,14.2,Real Time,2.18.0,Data Parallel,1,FP32,Matmult-BF16
google/vit-base-patch16-224,224x224,:benchmark-pt:`Benchmark <hf-google-vit>`,PyTorch 2.5,Inf2.xlarge,Image Classification,1955.406,4.087,4.125,Batch,2.21.0,Data Parallel,2,FP32,Matmult-BF16
openai/clip-vit-base-patch32,224x224,:benchmark-pt:`Benchmark <hf-openai-clip>`,PyTorch 2.5,Inf2.xlarge,Image Classification,6509.83,135.806,136.003,Batch,2.21.0,Data Parallel,64,FP32,Matmult-BF16
openai/clip-vit-large-patch14,224x224,:benchmark-pt:`Benchmark <hf-openai-clip>`,PyTorch 2.5,Inf2.xlarge,Image Classification,285.938,113.117,115.940,Batch,2.21.0,Data Parallel,8,FP32,Matmult-BF16

================================================
FILE: about-neuron/benchmarks/trn1/latency_data_decoder.csv
================================================
Model,Scripts,Framework,Inst. Type,Task,Output Token Throughput (tokens/sec),TTFT Latency P50 (ms),TTFT Latency P99 (ms),TPOT Latency P50 (ms),TPOT Latency P99 (ms),Application Type,Neuron Version,Run Mode,TP Degree,Batch Size,Sequence Length,Input Length,Output Length,Model Data Type,Compilation Autocast Data Type,Weight Storage Data Type
Llama-3-8B,:llama-sample:`Sample <meta-llama-3-8b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,157.25202,17.09,21.62,7.03,7.16,Real Time,2.18.1,Tensor Parallel,32,1,8192,128,8064,FP16,Matmult-BF16,int8
Llama-3-8B,:llama-sample:`Sample <meta-llama-3-8b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,140.50031,153.02,159.13,7.04,7.13,Real Time,2.18.1,Tensor Parallel,32,1,8192,4096,4096,FP16,Matmult-BF16,int8
Llama-3-8B,:llama-sample:`Sample <meta-llama-3-8b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,178.18923,14.75,22.94,5.86,6,Real Time,2.18.1,Tensor Parallel,32,1,4096,128,3968,FP16,Matmult-BF16,int8
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,37.70379,547,553.89,26.2,26.79,Real Time,2.18.1,Tensor Parallel,32,1,4096,2048,2048,FP16,Matmult-BF16,bf16
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,40.63808,53.2,59.5,24.48,26.17,Real Time,2.18.1,Tensor Parallel,32,1,1152,128,1024,FP16,Matmult-BF16,bf16
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,40.80995,52.53,52.79,26.48,24.22,Real Time,2.18.1,Tensor Parallel,32,1,256,128,128,FP16,Matmult-BF16,bf16
Llama-2-7b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,161.7081305,13.32402229,14.1210556,6.69956207,6.84595108,Real Time,2.18.0,Tensor Parallel,32,1,8192,128,8064,FP16,Matmult-BF16,bf16
Llama-2-13b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,60.43330245,864.1381264,865.9124374,9.84406471,10.14947891,Real Time,2.18.0,Tensor Parallel,32,1,8192,4096,4096,FP16,Matmult-BF16,bf16
Llama-2-13b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,31.3990051,2367.928505,2369.139671,13.40842247,15.76948166,Real Time,2.18.0,Tensor Parallel,32,1,16384,8192,8192,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,39.28574,53.91026,54.9469,25.18129,26.58272,Real Time,2.18.0,Tensor Parallel,32,1,256,128,128,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,39.17668,81.882,98.77896,25.26712,25.7585,Real Time,2.18.0,Tensor Parallel,32,1,512,256,256,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,39.16379,57.75213,64.75568,25.44856,26.1333,Real Time,2.18.0,Tensor Parallel,32,1,1152,128,1024,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,38.09518,232.47981,239.02893,26.03793,26.17574,Real Time,2.18.0,Tensor Parallel,32,1,2048,1024,1024,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,37.70947,236.78207,241.14895,26.62468,27.02999,Real Time,2.18.0,Tensor Parallel,32,1,3072,1024,2048,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,36.78021,690.95588,695.91761,26.85046,27.04263,Real Time,2.18.0,Tensor Parallel,32,1,4096,2048,2048,FP16,Matmult-BF16,bf16
Mistral-7B-Instruct-v0.2,:llama-sample:`Sample <mistralai-Mistral-7b-Instruct-v0.2>`,Transformers NeuronX,trn1.32xlarge,Text Generation,49.55890938,1322.874308,1325.857162,9.89246368,10.18333435,Real Time,2.18.0,Tensor Parallel,32,1,16384,8192,8192,FP16,Matmult-BF16,bf16
CodeLlama-13b-hf,:llama-sample:`Sample <codellama-13b-16k-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,60.21552741,868.635416,870.9816933,9.86456871,10.24436951,Real Time,2.18.0,Tensor Parallel,32,1,8192,4096,4096,FP16,Matmult-BF16,bf16
CodeLlama-13b-hf,:llama-sample:`Sample <codellama-13b-16k-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,31.37781421,2372.928381,2375.921965,13.3998394,13.79013062,Real Time,2.18.0,Tensor Parallel,32,1,16384,8192,8192,FP16,Matmult-BF16,bf16


================================================
FILE: about-neuron/benchmarks/trn1/latency_data_encoder.csv
================================================
Model,Scripts,Framework,Inst. Type,Task,Throughput (inference/sec),Latency P50 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,Sequence Length,Model Data Type,Compilation Autocast Data Type,OS Type
albert-base-v2,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,trn1.2xlarge,Raw Output (AutoModel),2321.97758889,0.85997581,0.9086132,Real Time,2.27.0,Data Parallel,1,128,FP32,Matmult-BF16,U22
bert-base-uncased,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.8,trn1.2xlarge,Raw Output (AutoModel),2085.45272427,0.94294548,1.02853775,Real Time,2.26.0,Data Parallel,1,128,FP32,Matmult-BF16,U22
bert-large-uncased,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,trn1.2xlarge,Raw Output (AutoModel),747.48212826,2.66885757,2.73442268,Real Time,2.27.0,Data Parallel,1,128,FP32,Matmult-BF16,U22
distilbert-base-uncased,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,trn1.2xlarge,Raw Output (AutoModel),3672.38478861,0.54264069,0.58531761,Real Time,2.27.0,Data Parallel,1,128,FP32,Matmult-BF16,U22
google/electra-base-discriminator,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,trn1.2xlarge,Raw Output (AutoModel),2127.07474023,0.93317032,0.9958744,Real Time,2.27.0,Data Parallel,1,128,FP32,Matmult-BF16,U22
roberta-base,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,trn1.2xlarge,Raw Output (AutoModel),2094.37288172,0.95796585,1.00588799,Real Time,2.27.0,Data Parallel,1,128,FP32,Matmult-BF16,U22
roberta-large,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,trn1.2xlarge,Raw Output (AutoModel),747.58300171,2.66981125,2.73323059,Real Time,2.27.0,Data Parallel,1,128,FP32,Matmult-BF16,U22
xlm-roberta-base,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,trn1.32xlarge,Raw Output (AutoModelForMaskedLM),46.89836990,42.62268543,44.11746978,Real Time,2.27.0,Data Parallel,1,128,FP32,Matmult-BF16,U22


================================================
FILE: about-neuron/benchmarks/trn1/latency_data_encoder_decoder.csv
================================================
Model,Scripts,Framework,Inst. Type,Task,Throughput (tokens/second),Latency per Token P50 (ms),Latency per Token P99 (ms),Application Type,Neuron Version,Run Mode,TP Degree,DP Degree,Batch Size,Sequence Length,Input Length,Output Length,Model Data Type,Compilation Autocast Data Type
t5-3b,`Tutorial <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`_,NeuronX Distributed,trn1.32xlarge,Text Generation,110.23,9.07,9.12,Real Time,2.18.0,Tensor Parallel,8,1,1,128,128,84,FP32,Matmult-BF16
google/flan-t5-xl,`Tutorial <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`_,NeuronX Distributed,trn1.32xlarge,Text Generation,120.29,8.31,8.34,Real Time,2.18.0,Tensor Parallel,8,1,1,128,128,84,FP32,Matmult-BF16


================================================
FILE: about-neuron/benchmarks/trn1/throughput_data_decoder.csv
================================================
Model,Scripts,Framework,Inst. Type,Task,Output Token Throughput (tokens/sec),TTFT Latency P50 (ms),TTFT Latency P99 (ms),TPOT Latency P50 (ms),TPOT Latency P99 (ms),Application Type,Neuron Version,Run Mode,TP Degree,Batch Size,Sequence Length,Input Length,Output Length,Model Data Type,Compilation Autocast Data Type,Weight Storage Data Type
Llama-3-8B,:llama-sample:`Sample <meta-llama-3-8b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,933.50053,55.16,61.47,9.95,10.1,Batch,2.18.1,Tensor Parallel,32,8,8192,128,8064,FP16,Matmult-BF16,int8
Llama-3-8B,:llama-sample:`Sample <meta-llama-3-8b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,770.16291,1265.95,1292.94,10.04,10.33,Batch,2.18.1,Tensor Parallel,32,8,8192,4096,4096,FP16,Matmult-BF16,int8
Llama-3-8B,:llama-sample:`Sample <meta-llama-3-8b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,1142.69582,49.05,52.79,7.65,7.94,Batch,2.18.1,Tensor Parallel,32,8,4096,128,3968,FP16,Matmult-BF16,int8
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,120.3614,1661.12,1672.71,32.33,33.27,Batch,2.18.1,Tensor Parallel,32,4,4096,2048,2048,FP16,Matmult-BF16,bf16
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,140.51039,129.86,132.03,28.38,29.11,Batch,2.18.1,Tensor Parallel,32,4,1152,128,1024,FP16,Matmult-BF16,bf16
Llama-3-70B,:llama-sample:`Sample <meta-llama-3-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,138.01357,130.37,130.48,28.08,28.53,Batch,2.18.1,Tensor Parallel,32,4,256,128,128,FP16,Matmult-BF16,bf16
Llama-2-7b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,917.2452652,66.4024353,70.63961029,10.09511948,10.46204567,Batch,2.18.0,Tensor Parallel,32,8,8192,128,8064,FP16,Matmult-BF16,bf16
Llama-2-13b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,371.7031,6668.70475,6689.8005,19.85741,21.0557,Batch,2.18.0,Tensor Parallel,32,8,8192,4096,4096,FP16,Matmult-BF16,bf16
Llama-2-13b,:llama-sample:`Sample <meta-llama-2-13b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,184.28337,4628.44729,4635.24675,21.09194,22.3856,Batch,2.18.0,Tensor Parallel,32,4,16384,8192,8192,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,141.45357,156.84581,158.41317,26.72362,30.16973,Batch,2.18.0,Tensor Parallel,32,4,256,128,128,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,143.42503,270.15853,270.55573,26.9084,27.90999,Batch,2.18.0,Tensor Parallel,32,4,512,256,256,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,145.12799,156.68869,161.41367,27.21453,30.60174,Batch,2.18.0,Tensor Parallel,32,4,1152,128,1024,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,133.25056,1478.64008,1479.77638,28.55039,29.49882,Batch,2.18.0,Tensor Parallel,32,4,2048,1024,1024,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,129.27628,1478.84846,1482.93161,31.67439,32.01842,Batch,2.18.0,Tensor Parallel,32,4,3072,1024,2048,FP16,Matmult-BF16,bf16
Llama-2-70b,:llama-sample:`Sample <llama-70b-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,120.62953,2722.03422,2730.95036,31.78978,33.2315,Batch,2.18.0,Tensor Parallel,32,4,4096,2048,2048,FP16,Matmult-BF16,bf16
Mistral-7B-Instruct-v0.2,:llama-sample:`Sample <mistralai-Mistral-7b-Instruct-v0.2>`,Transformers NeuronX,trn1.32xlarge,Text Generation,484.5773,8614.85291,8630.24068,15.43713,15.9421,Batch,2.18.0,Tensor Parallel,32,8,16384,8192,8192,FP16,Matmult-BF16,bf16
CodeLlama-13b-hf,:llama-sample:`Sample <codellama-13b-16k-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,370.97736,6625.1595,6628.26467,19.91653,20.94936,Batch,2.18.0,Tensor Parallel,32,8,8192,4096,4096,FP16,Matmult-BF16,bf16
CodeLlama-13b-hf,:llama-sample:`Sample <codellama-13b-16k-sampling>`,Transformers NeuronX,trn1.32xlarge,Text Generation,184.17898,4626.17469,4630.66864,21.09528,22.16578,Batch,2.18.0,Tensor Parallel,32,4,16384,8192,8192,FP16,Matmult-BF16,bf16


================================================
FILE: about-neuron/benchmarks/trn1/throughput_data_encoder.csv
================================================
Model,Scripts,Framework,Inst. Type,Task,Throughput (inference/sec),Latency P50 (ms),Latency P99 (ms),Application Type,Neuron Version,Run Mode,Batch Size,Sequence Length,Model Data Type,Compilation Autocast Data Type,OS Type
albert-base-v2,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,trn1.2xlarge,Raw Output (AutoModel),3442.53392946,9.28854942,9.35173273,Batch,2.27.0,Data Parallel,16,128,FP32,Matmult-BF16,U22
bert-base-uncased,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,trn1.2xlarge,Raw Output (AutoModel),3421.56625089,9.34481621,9.41992044,Batch,2.27.0,Data Parallel,16,128,FP32,Matmult-BF16,U22
bert-large-uncased,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,trn1.2xlarge,Raw Output (AutoModel),1104.43610458,7.24101067,7.29799271,Batch,2.27.0,Data Parallel,4,128,FP32,Matmult-BF16,U22
distilbert-base-uncased,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,trn1.2xlarge,Raw Output (AutoModel),6369.44180331,5.00988960,5.09214401,Batch,2.28.0,Data Parallel,16,128,FP32,Matmult-BF16,U22
google/electra-base-discriminator,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,trn1.2xlarge,Raw Output (AutoModel),3425.55803570,9.32765007,9.45640087,Batch,2.28.0,Data Parallel,16,128,FP32,Matmult-BF16,U22
roberta-base,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,trn1.2xlarge,Raw Output (AutoModel),3378.10764201,9.46044921,9.53317165,Batch,2.28.0,Data Parallel,16,128,FP32,Matmult-BF16,U22
roberta-large,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,trn1.2xlarge,Raw Output (AutoModel),1123.90475943,14.23048973,14.30106163,Batch,2.27.0,Data Parallel,8,128,FP32,Matmult-BF16,U22
xlm-roberta-base,:benchmark-pt:`Benchmark <inf2>`,PyTorch 2.9,trn1.32xlarge,Raw Output (AutoModelForMaskedLM),46.68898543,342.50581264,350.86465597,Batch,2.27.0,Data Parallel,8,128,FP32,Matmult-BF16,U22


================================================
FILE: about-neuron/benchmarks/trn1/throughput_data_encoder_decoder.csv
================================================
Model,Scripts,Framework,Inst. Type,Task,Throughput (tokens/second),Latency per Token P50 (ms),Latency per Token P99 (ms),Application Type,Neuron Version,Run Mode,TP Degree,DP Degree,Batch Size,Sequence Length,Input Length,Output Length,Model Data Type,Compilation Autocast Data Type
t5-3b,`Tutorial <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`_,NeuronX Distributed,trn1.32xlarge,Text Generation,116.29,8.58,8.66,Batch,2.17.0,Tensor Parallel,8,1,1,128,128,84,FP32,Matmult-BF16
google/flan-t5-xl,`Tutorial <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`_,NeuronX Distributed,trn1.32xlarge,Text Generation,122.52,8.16,8.19,Batch,2.17.0,Tensor Parallel,8,1,1,128,128,84,FP32,Matmult-BF16


================================================
FILE: about-neuron/benchmarks/trn1/training_data_decoder.csv
================================================
Model,Instance-Type,Training Data-Type,Nodes,Topology,Microbatch,Globalbatch, Optimizer, Sequence Length, Performance [seq/sec],Strong/Weak Scaling,Neuron Version,Neuron Tutorial/Example,Pytorch Neuron(torch-neuronx) Version, OS Type.
Llama-3.1-8B pre-training,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+FP32Optimizer,32,TP=32 DP=32 PP=1 ZeRO-1,1,1024,AdamW,8192,47.95,strong scaling,2.24.0,`NeuronX Distributed <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.rst>`_,2.7.0.2.8.6896,U22
Llama-3.1-70B pre-training,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+FP32Optimizer,32,TP=32 DP=4 PP=8,1,1024,AdamW,8192,7.94,strong scaling,2.24.0,`NeuronX Distributed <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/libraries/neuronx-distributed/tutorials/training_llama_tp_pp.rst>`_,2.7.0.2.8.6896,U22


================================================
FILE: about-neuron/benchmarks/trn1/training_data_encoder.csv
================================================
Model,Instance-Type,Training Data-Type,Nodes,Topology,Microbatch,Globalbatch, Optimizer, Sequence Length, Performance [seq/sec],Strong/Weak Scaling,Neuron Version,Neuron Tutorial/Example,Pytorch Neuron(torch-neuronx) Version, OS Type.
HuggingFace BERT-Large Ph1 pre-training,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+SR,16,[32xNC(DP)] x 16Nodes(DP),16,1048576,Lamb,128,57407.9207,weak scaling,2.28.0,:ref:`hf-bert-pretraining-tutorial`,2.9.0.2.12.21983, U22
HuggingFace BERT-Large Ph1 pre-training,trn1.32xlarge/trn1n.32xlarge,FP32,16,[32xNC(DP)] x 16Nodes(DP),8,1048576,Lamb,128,32362.6714,weak scaling,2.28.0,:ref:`hf-bert-pretraining-tutorial`,2.9.0.2.12.21983, U22
HuggingFace BERT-Large Ph1 pre-training,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+SR,1,[32xNC(DP)],16,16384,AdamW,128,3826.6103,strong scaling,2.28.0,:ref:`hf-bert-pretraining-tutorial`,2.9.0.2.12.21983, U22


================================================
FILE: about-neuron/benchmarks/trn1/training_data_vision_transformers.csv
================================================
Model,Instance-Type,Training Data-Type,Nodes,Topology,Microbatch,Globalbatch, Optimizer, Performance [seq/sec],Strong/Weak Scaling,Neuron Version,Neuron Tutorial/Example,Pytorch Neuron(torch-neuronx) Version, OS Type.
HuggingFace ViT-Base fine-tuning,trn1.32xlarge/trn1n.32xlarge,BF16,1,[32xNC(DP)],64,2048,AdamW,6587.25,weak scaling,2.25.0,`ViT-Base Fine-tuning Example <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_image_classification/vit.ipynb>`_,2.7.0.2.9.0, U22


================================================
FILE: about-neuron/benchmarks/trn1/trn1-inference-performance.rst
================================================
.. _trn1-inference-performance:

Trn1/Trn1n Inference Performance
================================

.. important::

   The benchmark scripts linked on this page are provided for historical reference only and are not tested with recent versions of the Neuron SDK. They have been moved to the `archive folder <https://github.com/aws-neuron/aws-neuron-sdk/tree/master/archive/src/benchmark/pytorch>`_.

.. contents:: Table of contents
   :local:


*Last update:  Feb 26th, 2026*


.. _NLP:

Encoder Models
--------------

.. tab-set::

   .. tab-item:: Throughput optimized

      .. df-table::
         :header-rows: 1

         df = pd.read_csv('throughput_data_encoder.csv')
         df_prices = pd.read_csv('trn1_instance_prices.csv')
         df = pd.merge(df,df_prices,on='Inst. Type')

         df['Cost per 1M inferences'] = ((1.0e6 / df['Throughput (inference/sec)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)

         cols_to_show = ['Model','Scripts','Framework', 'Inst. Type', 'Task', 'Throughput (inference/sec)', 'Latency P50 (ms)', 'Latency P99 (ms)', 'Cost per 1M inferences', 'Application Type', 'Neuron Version', 'Run Mode', 'Batch Size','Sequence Length', 'Model Data Type','Compilation Autocast Data Type','OS Type']
         df = df[cols_to_show].sort_values(['Model', 'Cost per 1M inferences'])

         df['Throughput (inference/sec)'] = df['Throughput (inference/sec)'].round(2).astype('float',copy=True)
         int_cols = ['Latency P50 (ms)', 'Latency P99 (ms)']
         df[int_cols] = df[int_cols].round(2).astype('float',copy=True)


   .. tab-item:: Latency optimized

      .. df-table::
         :header-rows: 1

         df = pd.read_csv('latency_data_encoder.csv')
         df_prices = pd.read_csv('trn1_instance_prices.csv')
         df = pd.merge(df,df_prices,on='Inst. Type')

         df['Cost per 1M inferences'] = ((1.0e6 / df['Throughput (inference/sec)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)

         cols_to_show = ['Model','Scripts','Framework', 'Inst. Type', 'Task', 'Throughput (inference/sec)', 'Latency P50 (ms)', 'Latency P99 (ms)', 'Cost per 1M inferences', 'Application Type', 'Neuron Version', 'Run Mode', 'Batch Size','Sequence Length', 'Model Data Type','Compilation Autocast Data Type','OS Type']
         df = df[cols_to_show].sort_values(['Model', 'Cost per 1M inferences'])

         df['Throughput (inference/sec)'] = df['Throughput (inference/sec)'].round(2).astype('float',copy=True)
         int_cols = ['Latency P50 (ms)', 'Latency P99 (ms)']
         df[int_cols] = df[int_cols].round(2).astype('float',copy=True)

Encoder-Decoder Models
----------------------

.. tab-set::

   .. tab-item:: Throughput optimized

      .. df-table::
         :header-rows: 1

         df = pd.read_csv('throughput_data_encoder_decoder.csv')
         df_prices = pd.read_csv('trn1_instance_prices.csv')
         df = pd.merge(df,df_prices,on='Inst. Type')
         df['Cost per 1M inferences'] = ((1.0e6 / df['Throughput (tokens/second)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)
         cols_to_show = ['Model','Scripts','Framework', 'Inst. Type', 'Task', 'Throughput (tokens/second)', 'Latency per Token P50 (ms)', 'Latency per Token P99 (ms)', 'Cost per 1M inferences', 'Application Type', 'Neuron Version', 'Run Mode', 'TP Degree',        'DP Degree', 'Batch Size', 'Sequence Length', 'Input Length', 'Output Length', 'Model Data Type','Compilation Autocast Data Type']
         df = df[cols_to_show].sort_values(['Model', 'Cost per 1M inferences'])
         df['Throughput (tokens/second)'] = df['Throughput (tokens/second)'].round(2).astype('float',copy=True)
         int_cols = ['Latency per Token P50 (ms)', 'Latency per Token P99 (ms)']
         df[int_cols] = df[int_cols].round(2).astype('float',copy=True)

      .. note::
         Only for Encoder-Decoder

         **Throughput (tokens/second)** counts both input and output tokens

         **Latency per Token** counts both input and output tokens

         Applicable to all models

         **Cost per 1M inferences** is calculated using RI-Effective hourly rate.

         **Real Time** application refers to batch size 1 inference for minimal latency. **Batch** application refers to maximum throughput with minimum cost-per-inference.


   .. tab-item:: Latency optimized

      .. df-table::
         :header-rows: 1

         df = pd.read_csv('latency_data_encoder_decoder.csv')
         df_prices = pd.read_csv('trn1_instance_prices.csv')
         df = pd.merge(df,df_prices,on='Inst. Type')
         df['Cost per 1M inferences'] = ((1.0e6 / df['Throughput (tokens/second)']) * (df['RI-Effective hourly rate'] / 3.6e3 )).map('${:,.3f}'.format)
         cols_to_show = ['Model','Scripts','Framework', 'Inst. Type', 'Task', 'Throughput (tokens/second)', 'Latency per Token P50 (ms)', 'Latency per Token P99 (ms)', 'Cost per 1M inferences', 'Application Type', 'Neuron Version', 'Run Mode', 'TP Degree',        'DP Degree', 'Batch Size', 'Sequence Length', 'Input Length', 'Output Length', 'Model Data Type','Compilation Autocast Data Type']
         df = df[cols_to_show].sort_values(['Model', 'Cost per 1M inferences'])
         df['Throughput (tokens/second)'] = df['Throughput (tokens/second)'].round(2).astype('float',copy=True)
         int_cols = ['Latency per Token P50 (ms)', 'Latency per Token P99 (ms)']
         df[int_cols] = df[int_cols].round(2).astype('float',copy=True)

      .. note::

         Only for Encoder-Decoder

         **Throughput (tokens/second)** counts both input and output tokens

         **Latency per Token** counts both input and output tokens


      .. note::

         **Cost per 1M inferences** is calculated using RI-Effective hourly rate.

         **Real Time** application refers to batch size 1 inference for minimal latency. **Batch** application refers to maximum throughput with minimum cost-per-inference.


================================================
FILE: about-neuron/benchmarks/trn1/trn1-training-performance.rst
================================================
.. _trn1-training-performance:

Trn1/Trn1n Training Performance
===============================

This section provides benchmark results for training various deep learning models on AWS Trn1 and Trn1n instances powered by AWS Trainium chips. The benchmarks cover a range of model architectures, including encoder models, decoder models, and vision transformers, demonstrating the performance capabilities of Trn1/Trn1n instances for different training workloads.

**Last update: February 19th, 2026**

.. contents:: Table of contents
   :local:

.. _NLP:

Encoder Models
--------------

.. csv-table::
   :file: training_data_encoder.csv
   :header-rows: 1


Decoder Models
--------------

.. csv-table::
   :file: training_data_decoder.csv
   :header-rows: 1

.. note::
   **TP (Tensor Parallel), PP (Pipeline Parallel) and DP (Data Parallel) Topology** configuration refers to the degrees of 3D Parallelism (How the model and data is sharded across NeuronCores).

   TP and PP are specified in the run script and DP is calculated by dividing **world size** (Number of nodes/instances * Number of neuron cores per instance) by TP * PP degrees.

   For example, ``TP = 4, PP = 4`` and`` Number of instances is 32 (trn1.32xlarge)``. The world size will be: ``32(num instances) * 32(Neuron Cores per instance) = 1024``. Now, ``DP degree = 1024 (World size)/ 4 (TP) * 4 (PP) = 64``

   For more information on batch sizes please refer to :ref:`neuron-batching`


Vision Transformer Models
--------------------------

.. csv-table::
   :file: training_data_vision_transformers.csv
   :header-rows: 1

.. note::
   Read more about strong vs weak scaling here :ref:`neuron-training-faq`


================================================
FILE: about-neuron/benchmarks/trn1/trn1_instance_prices.csv
================================================
Inst. Type,RI-Effective hourly rate
trn1.2xlarge,0.512
trn1.32xlarge,8.197


================================================
FILE: about-neuron/benchmarks/trn1/trn1_trn1n_nlp_data.csv
================================================
Model,Instance-Type,Training Data-Type,Nodes,Topology,Microbatch,Globalbatch, Optimizer, Performance [seq/sec],MFU[%],ComputeCostPerToken(Tflops),Strong/Weak Scaling,Neuron Version,Neuron Tutorial/Example,Pytorch Neuron(torch-neuronx) Version, OS Type.
HuggingFace BERT-Large Ph1 pre-training,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+SR,16,[32xNC(DP)] x 16Nodes(DP),16,1048576,Lamb,53069,25.83,,weak scaling,2.15.0,:ref:`hf-bert-pretraining-tutorial`,1.13.1.1.12.0, U20
HuggingFace BERT-Large Ph2 pre-training,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+SR,16,[32xNC(DP)] x 16Nodes(DP),2,524288,Lamb,7507,15.5,,weak scaling,2.15.0,:ref:`hf-bert-pretraining-tutorial`,1.13.1.1.12.0, U20
HuggingFace BERT-Large Ph1 pre-training,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16/AMP,16,[32xNC(DP)] x 16Nodes(DP),16,16384,AdamW,24518.47,,,strong scaling,2.14.0,:ref:`hf-bert-pretraining-tutorial`,1.13.1.1.11.0, U20
HuggingFace BERT-Large Ph1 pre-training,trn1.32xlarge/trn1n.32xlarge,FP32,16,[32xNC(DP)] x 16Nodes(DP),8,1048576,Lamb,28432,13.83,,weak scaling,2.14.0,:ref:`hf-bert-pretraining-tutorial`,1.13.1.1.12.0, U20
HuggingFace BERT-Large Ph1 pre-training,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+SR,1,[32xNC(DP)],16,16384,AdamW,3530,27.49,,strong scaling,2.15.0,:ref:`hf-bert-pretraining-tutorial`,1.13.1.1.12.0, U20
HuggingFace BERT-Large Ph1 pre-training,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+SR,1,[32xNC(DP)],16,65536,Lamb,3733,29.07,,strong scaling,2.15.0,:ref:`hf-bert-pretraining-tutorial`,1.13.1.1.12.0, U20
GPT3-23B pre-training,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+SR,32,TP=8 DP=32 PP=4,1,1024,AdamW,100,29.65,289,strong scaling,2.15.0,`nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-gpt-job.md>`_,1.13.1.1.12.0, U20
GPT3-46B pre-training,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+SR,32,TP=8 DP=16 PP=8,1,1024,AdamW,47.2,27.7,578,strong scaling,2.15.0,`nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-gpt-job.md>`_,1.13.1.1.12.0, U20
GPT3-175B pre-training,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+SR,32,TP=32 DP=4 PP=8,1,1024,AdamW,12.7,33.14,2197,strong scaling,2.13.0,`nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-gpt-job.md>`_,1.13.1.1.10.0, U20
Llama2-7B pre-training,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+SR,32,TP=8 DP=4 PP=4,1,1024,AdamW,82,14.8,336,strong scaling,2.15.0,`nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md>`_,1.13.1.1.12.0, U20
Llama2-13B pre-training,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+SR,32,TP=8 DP=4 PP=4,1,1024,AdamW,60,20.7,336,strong scaling,2.15.0,`nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md>`_,1.13.1.1.12.0, U20
Llama2-7B pre-training,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+FP32Optimizer,16,TP=8 DP=64,1,1024,AdamW,81,30.8,,strong scaling,2.15.0,`neuronx-distributed <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/libraries/neuronx-distributed/tutorials/training_llama2_7b.rst>`_,1.13.1.1.12.0, U20
HuggingFace ViT-Base fine-tuning,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+SR,1,[32xNC(DP)],64,2048,AdamW,5232.78,,,weak scaling,2.17.0,`ViT-Base Fine-tuning Example <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_image_classification/vit.ipynb>`_,1.13.1.1.13.0, U20
HuggingFace CLIP-Base fine-tuning,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+SR,1,[32xNC(DP)],80,2560,AdamW,5152.76,,,weak scaling,2.17.0,`CLIP-Base Fine-tuning <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_contrastive_image_text/CLIPBase.ipynb>`_,1.13.1.1.13.0, U20
HuggingFace Vision-Perveriver-Conv fine-tuning,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+SR,1,[32xNC(DP)],4,128,AdamW,423.32,,,weak scaling,2.17.0,`Vision Perceiver Conv Fine-tuning <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_image_classification/VisionPerceiverConv.ipynb>`_,1.13.1.1.13.1, U20
HuggingFace Language-Perveriver fine-tuning,trn1.32xlarge/trn1n.32xlarge,Autocast:BF16+SR,1,[32xNC(DP)],20,640,AdamW,1407.02,,,weak scaling,2.17.0,`Language Perceiver Fine-tuning <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_text_classification/LanguagePerceiver.ipynb>`_,1.13.1.1.13.1, U20


================================================
FILE: about-neuron/beta-participation.rst
================================================
.. meta::
    :description: Information about participating in the AWS Neuron SDK beta program.
    :date-modified: 12/19/2025

Participate in the AWS Neuron SDK Beta Program
===============================================

AWS Neuron SDK users can participate in our beta program to get early access to new features and improvements. By joining the beta program, you can provide valuable feedback that helps us enhance the AWS Neuron SDK for everyone.

Currently, we are taking requests to join our Beta program for the new Neuron Kernel Interface and its associated features. If you are interested in participating, `fill out this online form <https://pulse.aws/survey/NZU6MQGW?p=0>`__ and we'll get back to you! Read more about the new NKI features `here <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/whats-new.html>`__.

.. admonition:: Disclaimer

   Beta features are not recommended for production workloads. They may contain bugs or incomplete functionality. Use them at your own risk and provide feedback to help us improve.

================================================
FILE: about-neuron/calculator/neuron-calculator.rst
================================================
.. _neuron_calculator:

Neuron Calculator
=================

.. raw:: html

       <link href="https://cdnjs.cloudflare.com/ajax/libs/choices.js/1.1.6/styles/css/choices.min.css" rel="stylesheet">

    <script src="https://cdnjs.cloudflare.com/ajax/libs/choices.js/1.1.6/choices.min.js"></script>


            <script>
                require.config({
                   paths: {
                        mathjs: 'https://cdnjs.cloudflare.com/ajax/libs/mathjs/11.8.0/math.min'
                        }
                    });
            </script>

 
        <div class="container">

            <div id="neuron-calculator-select" class="form-group row">
                <label for="formSelect" class="col-sm-3 col-form-label fw-bold text-end"> Select the Calculator</label>
                <div class="col-sm-8">

                    <select class="form-control" id="formSelect" name="formSelect">
                        <option value="compute-core-form"> NeuronCores needed for LLM Inference</option>
                    </select>

                </div>
            </div>


       <form id="compute-core-form" method="get" onkeydown="return event.key != 'Enter';"  onsubmit="submitComputeCoreForm(); return false;"> 

            <h2> Number of NeuronCores needed for LLM Inference</h2>
            <span style="margin-bottom:5px;">Please enter model configurations (You can enter multiple values of each hyperparameter. Press enter after adding each value in the text field) </span>

            
                 <div class="form-group row">
                    <label for="model-select" class="col-sm-3 col-form-label fw-bold text-end"> Model: </label>
                   <div class="col-sm-8">

                        <select class="form-control" id="model-select" style="width:100%;">
                            <option value="custom-llm-model"> Custom LLM Model </option>

                            <optgroup label="Sample Model Configuration" class="font-weight-bold" >
                                <option value="opt-66b"> opt-66b </option>
                                <option value="meta-llama/llama-2-7b"> meta-llama/Llama-2-7b </option>
                                <option value="meta-llama/llama-2-13b"> meta-llama/Llama-2-13b </option>
                            </optgroup>
                        </select>
                    </div>
                
              </div>


                 <div class="form-group row">
                    <label for="instance-type" class="col-sm-3 col-form-label fw-bold text-end"> Instance Type: </label>
                        <div class="col-sm-8">
                    <select class="form-control" id="instance-type" >
                        <option value="Inf2"> Inf2 </option>
                        <option value="Trn1"> Trn1 </option>
                    </select>
                </div>
            </div>


            <div class="form-group row">
                <label for="data-type" class="col-sm-3 col-form-label fw-bold text-end"> Data Type: </label>
                <div class="col-sm-8">
                    <select class="form-control" id="data-type" >
                        <option value="BF16 / FP16" selected> BF16 / FP16 </option>
                    </select>
                </div>
            </div>


             <div class="form-group row">
                <label for="batch-size" title="Enter the Batch Size" class="col-sm-3 col-form-label fw-bold text-end"> Batch Size:
                 <!-- <i class="fa fa-info-circle" aria-hidden="true"> </i> -->
                 </label>
                <div class="col-sm-8">
                    <input type="text" class="form-control" id="batch-size" placeholder="" >        
                </div>
            </div>

       
            <div class="form-group row">
                <label for="max-sequence-length" class="col-sm-3 col-form-label fw-bold text-end"> Max Sequence Length:</label>
                <div class="col-sm-8">
                    <input type="text" class="form-control" id="max-sequence-length" placeholder="" >
                </div>
            </div>


            <div class="form-group row">
                <label for="num-embeddings" class="col-sm-3 col-form-label fw-bold text-end"> Embedding Dimension:</label>
                <div class="col-sm-8">
                    <input type="text" class="form-control" id="num-embeddings" placeholder="" >
                </div>
            </div>

            <div class="form-group row">
                <label for="num-attention-heads" class="col-sm-3 col-form-label fw-bold text-end"> Number of Attention Heads:</label>
                <div class="col-sm-8">
                    <input type="text" class="form-control" id="num-attention-heads" placeholder="" >
                </div>
            </div>


            <div class="form-group row">
                <label for="num-layers" class="col-sm-3 col-form-label fw-bold text-end"> Number of Layers:</label>
                <div class="col-sm-8">
                    <input type="text" name="num-layers" class="form-control" id="num-layers" >              
                </div>
            </div>
    
        
            <div class="form-group row">
                <div class="form-check">
                    <input type="checkbox" style="width:20px;height:20px;margin-left:5px;" name="num-attention-heads-divisible" class="form-check-input" id="num-attention-heads-divisible" >              
                    <label for="num-attention-heads-divisible" class="form-check-label" style="margin-left:35px;"> Tensor Parallel Degree Constraint (Flexible tensor parallelism (TP) is not supported for certain models like GPT-J and GPT-NeoX in transformers-neuronx. Checking this box will flag a TP degree as invalid if the number of attention heads is not divisible by it.) </label>
                </div>
            </div>


            <div id="warningMessage" class="alert alert-warning text-danger" style="display:none;"> Invalid model configurations entered. Each text field accepts multiple values. Please press Enter after adding a new value to the text field.</div>


            <div id="submit-button-row" class="form-group row">
                <div class="col-sm-9 offset-sm-3 text-center">
                   <div class="mt-3">
                        <button  id="submit-button" type="submit" class="btn btn-primary ml-25"> Submit</button>
                    </div>
                </div>
            </div>        


        </div>
    
    </form>

      
        <div id="calculator-result" style="margin-bottom:50px;"> </div>

        <table id="results-table" class="table" style="display:none;">
            <thead>
                <tr>
                    <th> Batch Size </th>
                    <th> Max Seq Length </th>
                    <th> Embedding Dimension </th>
                    <th> Num Attention Heads </th>
                    <th> Num Layers </th>
                    <th> Memory Footprint (GB)</th>
                    <th> TP Degree(NeuronCores) </th>
                    <th> Instances Recommended </th>
                </tr>
            </thead>
            <tbody id="results-body">
            </tbody>
       </table>


        <div id="reset-button-row" class="form-group row" style="display:none;margin-bottom:50px;">
            
               <div class="col-sm-9 offset-sm-2 text-center">
                <div class="mt-3">
                    <button  id="edit-button"  class="btn btn-primary ml-15 mr-3"> Edit Model Configuration</button>  
                    <button  id="reset-button"  class="btn btn-primary ml-25"> Reset Calculator</button>  

                    <!--<button  id="reset-button"  class="btn btn-primary ml-25"> Reset Calculator</button> -->

                </div>
             </div>
            
        </div>  


        <style>
           .choices__list--multiple .choices__item {
                background-color: #6c757d;
                color: #ffffff;
           }
          
          .table {
                border-top: 1px solid black;
                margin-top: 20px;
                margin-bottom: 20px;
          } 
          .green-row {
             background-color: #c8e6c9 ;
             }

           .table-row {
              border-bottom: 1px solid black;
              } 


           </style>

.. raw:: html

    
    <script>

        var numLayersField ;
        var batchSizeField;
        var maxSequenceLengthField;
        var numEmbeddingDimensionField;
        var numAttentionHeadsField;


        var modelSelectSavedHTML = "";
        var modelSelectSavedValue = "";
        var instanceTypeSavedHTML = "";
        var instanceTypeSavedValue = "";
        var dataTypeSavedHTML = "";
        var dataTypeSavedValue = "";


        document.addEventListener('DOMContentLoaded', function() {

            numLayersField = new Choices('#num-layers' , { 
               maxItemCount: 8,
               valueField: 'id',
               labelField: 'title',
               searchField: 'title',
               shouldSort: false ,
               searchEnabled: false ,
               create: true ,
               removeItemButton:true,
               duplicateItems: false,
            });

            batchSizeField = new Choices('#batch-size' , { 
               maxItemCount: 8,
               valueField: 'id',
               labelField: 'title',
               searchField: 'title',
               shouldSort: false ,
               searchEnabled: false ,
               create: true ,
               removeItemButton:true,
               duplicateItems: false,
            });

            maxSequenceLengthField = new Choices('#max-sequence-length' , { 
               maxItemCount: 8,
               valueField: 'id',
               labelField: 'title',
               searchField: 'title',
               shouldSort: false ,
               searchEnabled: false ,
               create: true ,
               removeItemButton:true,
               duplicateItems: false,
            });

            numEmbeddingDimensionField = new Choices('#num-embeddings' , { 
               maxItemCount: 8,
               valueField: 'id',
               labelField: 'title',
               searchField: 'title',
               shouldSort: false ,
               searchEnabled: false ,
               create: true ,
               removeItemButton:true,
               duplicateItems: false,
            });


            numAttentionHeadsField = new Choices('#num-attention-heads' , { 
               maxItemCount: 8,
               valueField: 'id',
               labelField: 'title',
               searchField: 'title',
               shouldSort: false ,
               searchEnabled: false ,
               create: true ,
               removeItemButton:true,
               duplicateItems: false,
            });

        });


        function modelSelectOnChangeHandler() {
             var modelSelected=$(this).val();
                if(modelSelected=='opt-66b'){

                    batchSizeField.clearStore();
                    batchSizeField.setValue([{value: '16', label: '16'},]);

                    maxSequenceLengthField.clearStore();
                    maxSequenceLengthField.setValue([{value: '2048', label: '2048'},]);

                    
                    numEmbeddingDimensionField.clearStore();
                    numEmbeddingDimensionField.setValue([{value: '9216', label: '9216'},]);


                    numLayersField.clearStore();
                    numLayersField.setValue([{value: '64', label: '64'},]);


                    numAttentionHeadsField.clearStore();
                    numAttentionHeadsField.setValue([{value: '72', label: '72'},]);

                }
                else if(modelSelected == 'meta-llama/llama-2-7b'){

                    batchSizeField.clearStore();
                    batchSizeField.setValue([{value: '16', label: '16'},]);

                    maxSequenceLengthField.clearStore();
                    maxSequenceLengthField.setValue([{value: '4096', label: '4096'},]);

                    
                    numEmbeddingDimensionField.clearStore();
                    numEmbeddingDimensionField.setValue([{value: '4096', label: '4096'},]);


                    numLayersField.clearStore();
                    numLayersField.setValue([{value: '32', label: '32'},]);


                    numAttentionHeadsField.clearStore();
                    numAttentionHeadsField.setValue([{value: '32', label: '32'},]);

                }
                else if(modelSelected == 'meta-llama/llama-2-13b'){

                    batchSizeField.clearStore();
                    batchSizeField.setValue([{value: '16', label: '16'},]);

                    maxSequenceLengthField.clearStore();
                    maxSequenceLengthField.setValue([{value: '4096', label: '4096'},]);

                    
                    numEmbeddingDimensionField.clearStore();
                    numEmbeddingDimensionField.setValue([{value: '5120', label: '5120'},]);


                    numLayersField.clearStore();
                    numLayersField.setValue([{value: '40', label: '40'},]);


                    numAttentionHeadsField.clearStore();
                    numAttentionHeadsField.setValue([{value: '40', label: '40'},]);

                }
                else if(modelSelected=='custom-llm-model')
                {
                    batchSizeField.clearStore();
                    maxSequenceLengthField.clearStore();
                    numEmbeddingDimensionField.clearStore();
                    numLayersField.clearStore();
                    numAttentionHeadsField.clearStore();


                }
                else if(modelSelected=='import-hf-model')
                {
                    var hfDivLabel= document.getElementById('hf-model-url-label');
                    hfDivLabel.style.display = 'block';

                    var hfDiv= document.getElementById('hf-model-url-input-field');
                    hfDiv.style.display = 'block';

                    var hfDivButton= document.getElementById('hf-model-url-import-button');
                    hfDivButton.style.display = 'block';
                  
                    document.getElementById("hf-model-url-import-button").addEventListener("click",function() { processHFImport(); } );

                }


        }


        $(document).ready(function() {

   
            $('#formSelect').on('change',function() {
                var form=$(this).val();
                if(form=='compute-core-form'){
                    $('#compute-core-form').show();
                   // $('#batch-size-form').hide();
                }
                //else if(form=='batch-size-form'){
                //    $('#batch-size-form').show();
                //    $('#compute-core-form').hide();
                //}
            });
            

            $('#model-select').on('change',modelSelectOnChangeHandler);


     $('#compute-core-form').show();
            $('#batch-size-form').hide();

        });
                function processHFImport() {
                        var hfModelURL= $("#hf-model-url").val();
                        var hfModelJSONURL = hfModelURL + "/raw/main/config.json";

                        var xhr = new XMLHttpRequest();
                        xhr.open('GET',hfModelJSONURL,true);

                        xhr.onload = function() {
                          if(xhr.status==200)
                          {
                                var data = JSON.parse(xhr.responseText);

                                var numLayersVal = -1;
                                var numEmbeddingsVal = -1;

                                if ('n_layer' in data)
                                {
                                    numLayersVal = data.n_layer;
                                }


                                if ('hidden_size' in data)
                                {
                                    numEmbeddingsVal = data.hidden_size;
                                }


                                if(numLayersVal > -1)
                                {
                                    numLayersField.clearStore();
                                    numLayersField.setValue([{value: numLayersVal, label: numLayersVal},]);
                                }


                                if(numEmbeddingsVal > -1)
                                {
                                    numEmbeddingDimensionField.clearStore();
                                    numEmbeddingDimensionField.setValue([{value: numEmbeddingsVal, label: numEmbeddingsVal},]);
                                }


                          }
                        };

                        xhr.send();    

                }
        
                function submitComputeCoreForm() {


                    require(['mathjs'], function(math) {

                 
                    batchSizeVals = batchSizeField.getValue(true);
                    maxSequenceLengthVals = maxSequenceLengthField.getValue(true);
                    numEmbeddingDimensionVals = numEmbeddingDimensionField.getValue(true);
                    numAttentionHeadsVals = numAttentionHeadsField.getValue(true);


                    var numAttentionHeadsDivisibleField = document.getElementById("num-attention-heads-divisible");
                    attentionHeadsConstraint = false;

                    if(numAttentionHeadsDivisibleField.checked)
                    {
                        attentionHeadsConstraint = true;
                    }


                    numLayersVals = numLayersField.getValue(true);

                    const dataTypeSelected= $("#data-type").val();
                    const dTypeSize = math.bignumber(2);
                    
                    const instanceTypeSelected = $("#instance-type").val();

                    const modelSelected= $("#model-select").val();

                    //var hfModelURL= $("#hf-model-url").val();

                    var calculatorResultStr = '';

                    var resultsTable = document.getElementById('results-table');
                    var resultsBody = document.getElementById('results-body');
                    var warningMessage = document.getElementById('warningMessage')

                
                   var inf2Cores = {'Inf2.xlarge':2 , 'Inf2.8xlarge':2 , 'Inf2.24xlarge':12 , 'Inf2.48xlarge':24 };
                   var inf2Keys = Object.keys(inf2Cores);

                   var trn1Cores = { 'Trn1.2xlarge':2 , 'Trn1.32xlarge':32};
                   var trn1Keys = Object.keys(trn1Cores);
             

                    if(batchSizeVals=== null || numEmbeddingDimensionVals=== null || maxSequenceLengthVals=== null || numLayersVals=== null || (batchSizeVals.length === 0) || (maxSequenceLengthVals.length === 0) || (numEmbeddingDimensionVals.length === 0) || (numAttentionHeadsVals.length === 0) || (numLayersVals.length  === 0) )
                    {
                         event.preventDefault();
                         warningMessage.style.display = 'block';
                         return false;
                    }


                    rowBackgroundColor = "#f5f5f5" ;

                    for(let i=0; i<batchSizeVals.length;  i++) {
                        for(let j=0; j<maxSequenceLengthVals.length;  j++) {
                            for(let k=0; k<numEmbeddingDimensionVals.length;  k++) {
                                for(let m=0; m<numAttentionHeadsVals.length;  m++) {
                                    for(let l=0; l<numLayersVals.length;  l++) {
                                        
                                           
                                          rowBackgroundColor = (rowBackgroundColor === "#f5f5f5") ? "#e0e0e0" : "#f5f5f5"
                                        

                                           batchSize = math.bignumber(parseInt(batchSizeVals[i]));
                                           maxSequenceLength = math.bignumber(parseInt(maxSequenceLengthVals[j]));
                                           numEmbeddings = math.bignumber(parseInt(numEmbeddingDimensionVals[k]));
                                           numLayers = math.bignumber(parseInt(numLayersVals[l]));
                                           
                                           numAttentionHeads =  math.bignumber(parseInt(numAttentionHeadsVals[m]));


                                           weightMemFootPrintBytes = math.multiply(12,numLayers,math.pow(numEmbeddings,2),dTypeSize);
                                           weightMemFootPrintGB = math.divide(weightMemFootPrintBytes,math.pow(1024,3))


                                           kvCacheMemFootPrintBytes = math.multiply(batchSize,numLayers,maxSequenceLength,numEmbeddings,2,dTypeSize);
                                           kvCacheMemFootPrintGB = math.divide(kvCacheMemFootPrintBytes,math.pow(1024,3))

                                          
                                           memFootPrintGB = math.add(weightMemFootPrintGB,kvCacheMemFootPrintGB);
                                           memFootPrintGBRounded = math.ceil(memFootPrintGB)

                                           numCoresCeiled = math.ceil(math.divide(memFootPrintGB,16));


                                            if(isNaN(batchSize) || isNaN(numEmbeddings) || isNaN(maxSequenceLength) || isNaN(numLayers) || batchSize<=0 || numEmbeddings<=0 || numAttentionHeads<=0 || maxSequenceLength<=0 || numLayers<=0 )
                                            {
                                                event.preventDefault();
                                                warningMessage.style.display = 'block';
                                                return false;
                                            }
                                            else
                                            {
                                                warningMessage.style.display = 'none';

                                            }


                                            var neuronCoresNeeded = -1

                                            var tensorParallelDegreesSupported = [];
                                            if (instanceTypeSelected == 'Trn1') {
                                                tensorParallelDegreesSupported = [2,8,32];
                                            }
                                            else if (instanceTypeSelected == 'Inf2') {
                                                tensorParallelDegreesSupported = [2,4,8,12,24];
                                            }


                                            //alert("tensor parallel degrees supported on instance:" + tensorParallelDegreesSupported);

                                            //alert("num cores ceiled:" + numCoresCeiled)


                                            var tpDegreesPossible = [];

                                            for (let p=0; p < tensorParallelDegreesSupported.length; p++) {
                                                if(numCoresCeiled <= tensorParallelDegreesSupported[p]) {
                                                    neuronCoresNeeded = tensorParallelDegreesSupported[p];
                                                    
                                                    for(let q=p ; q < tensorParallelDegreesSupported.length; q++) {
                                                          var curPossibleTPDegree = tensorParallelDegreesSupported[q]
                                                          tpDegreesPossible.push(curPossibleTPDegree);
                                                    }

                                                    break;
                                                }
                                            }


                                            //alert("tp degrees possible:" + tpDegreesPossible)

                                            if(tpDegreesPossible.length == 0)
                                            {

                                                var row = document.createElement('tr');
                                                row.style.backgroundColor = rowBackgroundColor

                                                var batchSizeCell = document.createElement('td');
                                                var maxSequenceLengthCell = document.createElement('td');
                                                var numEmbeddingsCell = document.createElement('td');
                                                var numAttentionHeadsCell = document.createElement('td');
                                                var numLayersCell = document.createElement('td');
                                                var memoryFootprintCell = document.createElement('td');
                                                var numCoresCell = document.createElement('td');
                                                var instancesSupportedCell = document.createElement('td');


                                                batchSizeCell.textContent =  batchSize;
                                                maxSequenceLengthCell.textContent =  maxSequenceLength;
                                                numEmbeddingsCell.textContent =  numEmbeddings;
                                                numAttentionHeadsCell.textContent = numAttentionHeads;
                                                numLayersCell.textContent =  numLayers;

                                                memoryFootprintCell.textContent = memFootPrintGBRounded


                                                numCoresCell.textContent = "N/A";

                                                instancesSupportedCell.style.color = 'red' ;

                                                instancesSupportedCellStr = "Does not fit in Single Instance. Multiple Instances needed"


                                                var rawHtmlElement = document.createElement('div');
                                                rawHtmlElement.innerHTML = instancesSupportedCellStr
                                                var sphinxHTMLString = '\n\n    ' + rawHtmlElement.outerHTML;
                                                instancesSupportedCell.innerHTML = sphinxHTMLString

                                                row.classList.add('table-row');


                                                row.appendChild(batchSizeCell);
                                                row.appendChild(maxSequenceLengthCell);
                                                row.appendChild(numEmbeddingsCell);
                                                row.appendChild(numAttentionHeadsCell);
                                                row.appendChild(numLayersCell);
                                                row.appendChild(memoryFootprintCell);
                                                row.appendChild(numCoresCell);
                                                row.appendChild(instancesSupportedCell);

                                                //alert(row.innerHTML)

                                                resultsBody.appendChild(row);


                                            }
                                            
                                            
                                            for (let q=0;q<tpDegreesPossible.length; q++)
                                            {
                                                var row = document.createElement('tr');

                                                row.style.backgroundColor = rowBackgroundColor

                                                var batchSizeCell = document.createElement('td');
                                                var maxSequenceLengthCell = document.createElement('td');
                                                var numEmbeddingsCell = document.createElement('td');
                                                var numAttentionHeadsCell = document.createElement('td');
                                                var numLayersCell = document.createElement('td');
                                                var memoryFootprintCell = document.createElement('td');
                                                var numCoresCell = document.createElement('td');
                                                var instancesSupportedCell = document.createElement('td');

                                                batchSizeCell.textContent =  batchSize;
                                                maxSequenceLengthCell.textContent =  maxSequenceLength;
                                                numEmbeddingsCell.textContent =  numEmbeddings;
                                                numAttentionHeadsCell.textContent = numAttentionHeads;
                                                numLayersCell.textContent =  numLayers;

                                                memoryFootprintCell.textContent = memFootPrintGBRounded

                                                var instancesSupportedCellStr = "";

                                                var tpDegree = tpDegreesPossible[q]
                                                if(neuronCoresNeeded>0 && (!attentionHeadsConstraint || (numAttentionHeads % tpDegree === 0)))
                                                {
                                                    numCoresCell.textContent = tpDegree

                                                    if(instanceTypeSelected === "Inf2")
                                                    {
                                                    for(var p=0; p<inf2Keys.length ; p++) {
                                                    var inf2InstanceSize = inf2Keys[p];
                                                    var instanceCores = inf2Cores[inf2InstanceSize]
                                                    if(instanceCores>=tpDegree)
                                                    {
                                                            if(instancesSupportedCellStr.length >0)  instancesSupportedCellStr += "<br>";
                                                            instancesSupportedCellStr += inf2InstanceSize
                                                    } 
                                                    }
                                                    }
                                                    else if(instanceTypeSelected === "Trn1")
                                                    {
                                                    for(var p=0; p<trn1Keys.length ; p++) {
                                                            var trn1InstanceSize = trn1Keys[p];
                                                            var instanceCores = trn1Cores[trn1InstanceSize]
                                                            if(instanceCores>=tpDegree)
                                                            {
                                                                    if(instancesSupportedCellStr.length >0)  instancesSupportedCellStr += "<br>";
                                                                    instancesSupportedCellStr += trn1InstanceSize
                                                            } 
                                                        }
                                                    } 


                                                    if(instancesSupportedCellStr.length>0)
                                                    {
                                                        if(tpDegreesPossible.length>1)
                                                        {
                                                            if(q === 0)
                                                            {
                                                                numCoresCell.textContent = numCoresCell.textContent + "(Min NeuronCores Reqd)";
                                                            }
                                                            else if(q === (tpDegreesPossible.length-1))
                                                            {
                                                                numCoresCell.textContent = numCoresCell.textContent + "(Best Latency)";
                                                            }
                                                            else
                                                            {
                                                                numCoresCell.textContent = numCoresCell.textContent

                                                            }
                                                        }
                                                        else
                                                        {
                                                            numCoresCell.textContent = numCoresCell.textContent + "(Min NeuronCores Reqd)";

                                                        }
                                                    }
                                                    else
                                                    {

                                                        numCoresCell.textContent = numCoresCell.textContent;

                                                    }

                                                    instancesSupportedCell.textContent = "Inf2 or Trn1 instances"
                                                // row.classList.add('green-row')
                                                }
                                                else if((neuronCoresNeeded>0 && attentionHeadsConstraint && (numAttentionHeads % tpDegree != 0)))
                                                {

                                                    numCoresCell.textContent = tpDegree

                                                    instancesSupportedCell.style.color = 'red' ;


                                                    instancesSupportedCellStr = "TP degree not supported. Number of attention heads must be divisible by TP degree."
                                                    // row.classList.add('red-row')


                                                }
                                                else{

                                                    numCoresCell.textContent = "N/A";

                                                    instancesSupportedCell.style.color = 'red' ;


                                                    instancesSupportedCellStr = "Does not fit in Single Instance. Multiple Instances needed"
                                                    // row.classList.add('red-row')

                                                }


                                                var rawHtmlElement = document.createElement('div');
                                                rawHtmlElement.innerHTML = instancesSupportedCellStr
                                                var sphinxHTMLString = '\n\n    ' + rawHtmlElement.outerHTML;
                                                instancesSupportedCell.innerHTML = sphinxHTMLString

                                                row.classList.add('table-row');


                                                row.appendChild(batchSizeCell);
                                                row.appendChild(maxSequenceLengthCell);
                                                row.appendChild(numEmbeddingsCell);
                                                row.appendChild(numAttentionHeadsCell);
                                                row.appendChild(numLayersCell);
                                                row.appendChild(memoryFootprintCell);
                                                row.appendChild(numCoresCell);
                                                row.appendChild(instancesSupportedCell);

                                                //alert(row.innerHTML)

                                                resultsBody.appendChild(row);
                                            }

                                        }

                                    }
                                }
                            }
                        }


                    $('#submit-button-row').hide();
                    $('#neuron-calculator-select').hide();

                    var calculatorResult = document.getElementById('calculator-result');
                    calculatorResult.innerHTML = "For more details on how the number of NeuronCores is computed, please refer to the <a href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/transformers-neuronx/generative-llm-inference-with-neuron.html#how-many-neuroncores-do-i-need'>LLM Inference App Note</a>";
                    //$('#calculator-result').replaceWith("For more details on how the number of min NeuronCores are computed, please refer to the <a href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/transformers-neuronx/generative-llm-inference-with-neuron.html#how-many-neuroncores-do-i-need'>LLM Inference App Note</a>");

                    
                    resultsTable.style.display = 'block';
                     
          
                    $('#reset-button-row').show();


                    batchSizeField.disable();
                    maxSequenceLengthField.disable();
                    numEmbeddingDimensionField.disable();
                    numLayersField.disable();
                    numAttentionHeadsField.disable();

                    var numAttentionHeadsDivisibleField = document.getElementById("num-attention-heads-divisible");
                    numAttentionHeadsDivisibleField.disabled = true;

                    const choiceCloseButtons = document.querySelectorAll('.choices__button');
                    choiceCloseButtons.forEach(button => { button.disabled=true;} );


                    modelSelectSavedHTML =  document.getElementById("model-select").outerHTML;
                    modelSelectSavedValue = modelSelected
                    instanceTypeSavedHTML =  document.getElementById("instance-type").outerHTML;
                    instanceTypeSavedValue = instanceTypeSelected
                    dataTypeSavedHTML =  document.getElementById("data-type").outerHTML;
                    dataTypeSavedValue = dataTypeSelected

                   // $('#hf-model-url').replaceWith('<span id="hf-model-url" class="readonly-text" style="margin-top:5px;display:flex;">' + hfModelURL + '</span>');
                    $('#model-select').replaceWith('<span id="model-select" class="readonly-text" style="margin-top:5px;display:flex;">' + modelSelected + '</span>');
                    $('#instance-type').replaceWith('<span id="instance-type" class="readonly-text" style="margin-top:5px;display:flex;" >' + instanceTypeSelected + '</span>');
                     $('#data-type').replaceWith('<span id="data-type" class="readonly-text" style="margin-top:5px;display:flex;" >' + dataTypeSelected + '</span>');
                  

                    });

                    return false;
                
    }


         function editNeuronCalculatorConfiguration() {
                    batchSizeField.enable();
                    maxSequenceLengthField.enable();
                    numEmbeddingDimensionField.enable();
                    numLayersField.enable();
                    numAttentionHeadsField.enable();

                    var numAttentionHeadsDivisibleField = document.getElementById("num-attention-heads-divisible");
                    numAttentionHeadsDivisibleField.disabled = false;

                    const choiceCloseButtons = document.querySelectorAll('.choices__button');
                    choiceCloseButtons.forEach(button => { button.disabled=false;} );


                    $('#reset-button-row').hide();

                    $('#submit-button-row').show();
                    $('#neuron-calculator-select').show();


                    document.getElementById('model-select').outerHTML = modelSelectSavedHTML;
                    document.getElementById('model-select').value = modelSelectSavedValue;

            
                    document.getElementById('model-select').addEventListener("change",modelSelectOnChangeHandler)


                    document.getElementById('instance-type').outerHTML = instanceTypeSavedHTML;
                    document.getElementById('instance-type').value = instanceTypeSavedValue;

                    document.getElementById('data-type').outerHTML = dataTypeSavedHTML;
                    document.getElementById('data-type').value = dataTypeSavedValue;

                    var calculatorResult = document.getElementById('calculator-result');
                    calculatorResult.innerHTML = "";

                    var resultsTable = document.getElementById('results-table');

                    for (var i=resultsTable.rows.length-1; i>0; i--) 
                    {
                        resultsTable.deleteRow(i);
                    }
                    resultsTable.style.display = 'none';
                 

         }


         function resetNeuronCalculator() {

            location.reload();


         }

          document.getElementById("edit-button").addEventListener("click",function() { editNeuronCalculatorConfiguration(); } );
          document.getElementById("reset-button").addEventListener("click",function() { resetNeuronCalculator(); } );


        $(function() {
            $('[data-toggle="tooltip"]').tooltip();
            }
        );

    </script>


================================================
FILE: about-neuron/faq/contributing-faq.rst
================================================
.. _contribute-faq:

Contributing Guidelines FAQs
============================

.. contents:: Table of contents
   :local:
   :depth: 1

Whether it's
a bug report, new feature, correction, or additional documentation, we
greatly value feedback and contributions from our community.

Please read through this document before submitting any issues or pull
requests to ensure we have all the necessary information to effectively
respond to your bug report or contribution.

How to reporting Bugs/Feature Requests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We welcome you to use the GitHub issue tracker to report bugs or suggest
features.

When filing an issue, please check existing open, or recently closed,
issues to make sure somebody else hasn't already reported the issue.
Please try to include as much information as you can. Details like these
are incredibly useful:

-  A reproducible test case or series of steps
-  The version of our code being used
-  Any modifications you've made relevant to the bug
-  Anything unusual about your environment or deployment

Contributing via Pull Requests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Contributions via pull requests are much appreciated. Before sending us
a pull request, please ensure that:

1. You are working against the latest source on the *master* branch.
2. You check existing open, and recently merged, pull requests to make
   sure someone else hasn't addressed the problem already.
3. You open an issue to discuss any significant work - we would hate for
   your time to be wasted.

To send us a pull request, please:

1. Fork the repository.
2. Modify the source; please focus on the specific change you are
   contributing. If you also reformat all the code, it will be hard for
   us to focus on your change.
3. Ensure local tests pass.
4. Commit to your fork using clear commit messages.
5. Send us a pull request, answering any default questions in the pull
   request interface.
6. Pay attention to any automated CI failures reported in the pull
   request, and stay involved in the conversation.

GitHub provides additional document on `forking a
repository <https://help.github.com/articles/fork-a-repo/>`__ and
`creating a pull
request <https://help.github.com/articles/creating-a-pull-request/>`__.

How to find contributions to work on
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Looking at the existing issues is a great way to find something to
contribute on. As our projects, by default, use the default GitHub issue
labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix),
looking at any 'help wanted' issues is a great place to start.

What is the code of conduct
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This project has adopted the `Amazon Open Source Code of
Conduct <https://aws.github.io/code-of-conduct>`__. For more information
see the `Code of Conduct
FAQ <https://aws.github.io/code-of-conduct-faq>`__ or contact
opensource-codeofconduct@amazon.com with any additional questions or
comments.

How to notify for a security issue
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you discover a potential security issue in this project we ask that
you notify AWS/Amazon Security via our `vulnerability reporting
page <http://aws.amazon.com/security/vulnerability-reporting/>`__.
Please do **not** create a public github issue.

What is the licensing
~~~~~~~~~~~~~~~~~~~~~~~~

See the `link <https://github.com/aws/aws-neuron-sdk/blob/master/LICENSE-DOCUMENTATION>`_ 
and `link <https://github.com/aws/aws-neuron-sdk/blob/master/LICENSE-SUMMARY-DOCS-SAMPLES>`_ files
for our project's licensing. We will ask you to confirm the licensing of
your contribution.

We may ask you to sign a `Contributor License Agreement
(CLA) <http://en.wikipedia.org/wiki/Contributor_License_Agreement>`__
for larger changes.


================================================
FILE: about-neuron/faq/index.rst
================================================
.. _neuron_faq:

Other Neuron FAQs
=================

Frequently asked questions about AWS Neuron SDK, covering general topics, inference, training, ONNX support, and contributing guidelines. 

.. note::
    This content may not be be up to date as of 2026, and often pertains to older or now-unsupported platforms and components.

General FAQs
-------------

.. grid:: 1 1 2 2
   :gutter: 2

   .. grid-item-card::
      :link: neuron2-intro-faq
      :link-type: doc
      :class-card: sd-border-1

      **Neuron 2.x Introduction FAQ**
      ^^^
      Common questions about Neuron 2.x and Trn1 general availability

   .. grid-item-card::
      :link: onnx-faq
      :link-type: doc
      :class-card: sd-border-1

      **ONNX FAQ**
      ^^^
      Using ONNX models with AWS Neuron

   .. grid-item-card::
      :link: contributing-faq
      :link-type: doc
      :class-card: sd-border-1

      **Contributing Guidelines FAQ**
      ^^^
      How to report bugs, request features, and contribute to Neuron

Inference FAQs
---------------

.. grid:: 1 1 2 2
   :gutter: 2

   .. grid-item-card::
      :link: inference/neuron-faq
      :link-type: doc
      :class-card: sd-border-1

      **Inference with Neuron FAQ**
      ^^^
      Common questions about running inference workloads on AWS Neuron

   .. grid-item-card::
      :link: inference/trouble-shooting-faq
      :link-type: doc
      :class-card: sd-border-1

      **Troubleshooting for Inf1 FAQ**
      ^^^
      Debugging and troubleshooting inference issues on Inf1 instances

Training FAQs
-------------

.. grid:: 1
   :gutter: 2

   .. grid-item-card::
      :link: training/neuron-training
      :link-type: doc
      :class-card: sd-border-1

      **Training with Neuron FAQ**
      ^^^
      Common questions about training models on Trainium instances

.. toctree::
   :maxdepth: 1
   :hidden:

   Neuron 2.x Introduction FAQ <neuron2-intro-faq>
   ONNX FAQ <onnx-faq>
   Contributing Guidelines FAQ <contributing-faq>
   Inference with Neuron FAQ <inference/neuron-faq>
   Troubleshooting for Inf1 FAQ <inference/trouble-shooting-faq>
   Training with Neuron FAQ <training/neuron-training>


================================================
FILE: about-neuron/faq/inference/neuron-faq.rst
================================================
.. _neuron-f1-faq:

Inference with Neuron - FAQ
---------------------------

.. contents:: Table of contents
   :local:
   :depth: 1

What ML model types and operators are supported by AWS Neuron?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AWS Neuron includes a compiler that converts your trained machine
learning models to a binary object for execution. The Neuron
compiler supports many commonly used machine learning operators used in computer vision, natural language processing, recommender engines and more. A list of supported ML operators and supported inputs are in :ref:`neuron-supported-operators` .

It's important to mention that to get good performance doesn't require all of the model operators to run on the chip. In many cases, some of the operators will continue to run on the instance CPUs, like the case of embeddings or image pre-processing, and will still provide a compelling end to end performance. We call this approach auto-partitioning, where the Neuron compiler optimizes the model execution based on operators that are most suitable to run on the CPU or the chip.

For the latest model architecture support, please refer to the model architecture fit and performance pages.

Why is a compiler needed, and how do I use it?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The Neuron compiler converts a model from a framework level Neural Network
graph, with operators like convolution and pooling, into a
Neuron Device-specific instruction set, builds the schedule for
execution of these instructions, and converts the model parameters into
format that the neuron device can consume. The supported input formats include
TensorFlow, PyTorch, and MXNet. The output from the
compiler is a Neuron Executable File Format (NEFF) artifact. The NEFF
contains a combination of binary code, the model parameters, and
additional meta-data needed by the Neuron runtime and profiler.

I am using a ML framework today – what will change for me to use this?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To use Inferentia within the Inf1 instances, the developer needs to perform one-time compilation
of the pre-trained model to generate a NEFF, and use this as the inference
model in fleet of Inf1 instances.

-  :doc:`TensorFlow Neuron </archive/tensorflow/index>`
-  :ref:`neuron-pytorch`
-  :ref:`neuron-mxnet`

What is a NeuronCore Pipeline? How do I take advantage of it?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A NeuronCore Pipeline is a unique technique to shard a specific Neural
Network across multiple NeuronCores, to take advantage of the large
on-chip cache instead of moving data in and out of external memory. The result is an increased throughput and reduce latency
typically important for real-time inference applications. All Inf1 instances support it, and the Inf1
instances with multiple Inferentia accelerators, such as inf1.6xlarge or
inf1.24xlarge support it thanks to the fast chip-to-chip interconnect.

Developers can choose to use NeuronCore Pipeline mode during compile
stage, with an opt-in flag. :ref:`neuron-cc` provides further details.

NeuronCores, NeuronCore Groups and NeuronCore Pipelines: What do they do?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Each Inferentia chip has four compute engines called NeuronCores. A
NeuronCore Group is a way to aggregate NeuronCores to increase hardware
utilization and assign models with the right compute sizing for a
specific application. If you want to run multiple models in parallel,
you can assign different models to separate NeuronCore Groups. A model
compiled to use multiple NeuronCores in a NeuronCore Pipeline can be
assigned to a NeuronCore Group with enough NeuronCores to load into.
Finally- it is also possible for sets of Inferentia devices to be mapped
to separate Neuron Runtimes. :ref:`neuron-features-index` section has more
information and examples.

Can I use TensorFlow networks from tfhub.dev as-is ? if not, what should I do?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Yes. Models format can be imported into TensorFlow, either as a standard
model-server, in which case it appears as a simple command line utility,
or via the Python based TensorFlow environment. The primary additional
step needed is to compile the model into Inferentia NEFF format.


================================================
FILE: about-neuron/faq/inference/trouble-shooting-faq.rst
================================================
.. _trouble-shooting-inf1-faq:

Troubleshooting for Inf1 - FAQ
==============================

.. contents:: Table of contents
   :local:
   :depth: 1


Performance is not what I expect it to be, what's the next step?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Please check our performance optimization section on performance
tuning and other notes on how to use pipelining and batching to improve
performance.

Do I need to worry about size of model and size of inferentia memory? what problems can I expect to have?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Errors like this will be logged and can be found as shown
:ref:`neuron_gatherinfo`.

How can I debug / profile my inference request?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

See :ref:`neuron-plugin-tensorboard`


How to report Bug/Feature Requests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We welcome you to use the Neuron GitHub issue tracker to report bugs or suggest
features.

When filing an issue, please check existing open, or recently closed,
issues to make sure somebody else hasn't already reported the issue.
Please try to include as much information as you can. Details like these
are incredibly useful:

-  A reproducible test case or series of steps
-  The version of our code being used
-  Any modifications you've made relevant to the bug
-  Anything unusual about your environment or deployment


================================================
FILE: about-neuron/faq/neuron2-intro-faq.rst
================================================
.. _neuron2-intro-faq:

Neuron 2.x Introduction at Trn1 GA - FAQ
----------------------------------------

.. contents:: Table of contents
   :local:
   :depth: 1

.. include:: /release-notes/templates/n2.x-trn1-ga-faq.txt


================================================
FILE: about-neuron/faq/onnx-faq.rst
================================================
.. _onnx-faq:

ONNX FAQ
---------

.. contents:: Table of contents
   :local:
   :depth: 1


Can I use ONNX models with Neuron ? If not, what should I do?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AWS Neuron does not directly support compilation of models in the ONNX file format. The recommended way to compile a model that is in the ONNX file format is to first convert the model to PyTorch using a publicly available tool
like  `onnx2pytorch <https://github.com/ToriML/onnx2pytorch>`_ . Once the ONNX model is converted to PyTorch, it can then be compiled with the :func:`torch_neuron.trace` function to produce a model that can run on Neuron.


================================================
FILE: about-neuron/faq/roadmap-faq.rst
================================================
.. _neuron_roadmap_faq:

Roadmap FAQ
===========

.. contents:: Table of contents
   :local:
   :depth: 1


Why did you build this?
~~~~~~~~~~~~~~~~~~~~~~~

A: We know that our customers are making decisions and plans based on
what we are developing, and we want to provide them with the right
visibility to what we are working on, as well as the opportunity to
provide direct feedback.

What do the roadmap categories mean?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  **Roadmap Requests** - Requests we received and we are considering to add to the roadmap, 
   this is a great phase to give us feedback and let us know if you need this feature as well.
-  **Working on it** - In progress, we might still be
   working through the implementation details, or scoping stuff out.
   This is a great phase to give us feedback as to how you want to see
   something implemented. We’ll benefit from your specific use cases
   here.
-  **Completed** - Feature complete and supported by Neuron.


Why are there no dates on your roadmap?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A: We are not providing exact target dates for releases because we
prioritize operational excellence, security and quality over hitting a
specific date. If you have an urgent need for a feature, please contact
us directly at aws-neuron-support@amazon.com.

Is everything on the roadmap?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A: We are focusing on upgrades for existing features, as well as
building new features. We will keep adding features and capabilities to
this roadmap as time progresses.

How can I provide feedback or ask for more information?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A: When in doubt, please create an issue or post a question on the `AWS
Neuron support
forum <https://forums.aws.amazon.com/forum.jspa?forumID=355>`__.

How can I request a feature be added to the roadmap?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A: We encourage you to open an issue. All community-submitted issues
will be reviewed by the roadmap maintainers.

Can I "+1" existing issues?
~~~~~~~~~~~~~~~~~~~~~~~~~~~

A:We strongly encourage you to do so, as it helps us understand which
issues will have the widest impact. You can navigate to the issue
details page and add a reaction (thumbs up). There are six types of
reactions supported (thumbs down “-1”, confused, heart, watching, laugh,
hooray, and thumbs up +1).

================================================
FILE: about-neuron/faq/training/neuron-training.rst
================================================
.. _neuron-training-faq:

Training with Neuron - FAQ
==========================

.. contents:: Table of contents
   :local:
   :depth: 2

Compute
-------

How do I get started with training my model on Trn1?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Once you select your machine learning framework, you can get started here: :ref:`docs-quick-links`


How do I setup EFA for multi-node training?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For setting up EFA that is needed for multi-node training, please see :ref:`setup-trn1-multi-node-execution`


How do I know if I can train my models with Trainium?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We aim to support a broad set of models and distribution libraries. We continuously add more capabilities and enable new features via Neuron SDK releases and suggest you will follow our public roadmap and join our slack and email lists.

How should I size Trainium NeuronCores vs GPUs?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For simplicity, you should consider each NeuronCore within your instances as an independent deep learning compute engine, the equivalent of a GPU. As point of comparison, a trn1.32xlarge has 32 NeuronCores, and their max performance is 40% higher than of P4d for BF16/FP16/FP8, 2.5X faster for TF32, and 5X faster for FP32. Each NeuronCore is independent and connected to the rest of the NeuronCores within the instance via NeuronLink, and across instances with EFA. Each NeuronCore has also full access to the accelerator memory in the instance, which helps scale large models across NeuronCores using various collective compute ops techniques. 

What are the time to train advantages of Trn1?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

While the answer is largely model dependent, training performance on Trn1 is fast due thanks for multiple system wide optimizations working in concert. Dependent on the data type, you should expect between 1.4-5X higher throughput on Trn1 as compared to the latest GPUs instances (P4d). For distributed workloads, 800Gbps EFA gives customers lower latency, and 2x the throughput as compared to P4d. (a Trn1n 1.6Tb option is coming soon). Each Trainium also has a dedicated collective compute (CC) engine, which enables running the CC ops in parallel to the NeuronCores compute. This enables another 10-15% acceleration of the overall workload. Finally, stochastic rounding enables running at half precision speeds (BF16) while maintaining accuracy at near full precision, this is not only simplifying model development (no need for mixed precision) it also helps the loss function converge faster and reduce memory footprint.

What are some of the training performance results for Trn1?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

They are great! please refer to the :ref:`benchmark` page for open-source model performance results. We encourage you to try it for your own models/application.

Can I use CUDA libraries with AWS Trainium?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AWS Trainium and Neuron are plugged into popular frameworks, and is automatically optimizing model deployment on Neuron devices like Inferentia and Trainium. The Neuron SDK automatically optimizes for Trainium without using closed source dependencies like Nvidia CUDA, not requiring any application level code changes to accelerate models. We believe this intentional approach allows developers freedom of choice with their code and models. If you have applications dependencies on CUDA (or other 3rd party closed source artifacts) you will need to strip them out, and from that point the Neuron compiler will take the model as is and optimize it at the hardware level. 


Networking
----------

What’s important to know about the networking in Trn1?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Trn1 have the fastest EFA in AWS, clocked at 800Gbps they enable more collective communication as compared to other training instances, which is important if your training job spans across multiple servers. You should also expect lower latency as we streamline the communication path between the dedicated collective communication engine on Trainium, and the AWS Nitro EFA NICs.

How does Trainium accelerates collective communication  operations?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Trainium introduces a dedicated collective compute engine, that runs in parallel to the compute cores (aka NeuronCores). This improves convergence time of intermediate steps as the communication happens in parallel to the compute. This capability, in addition to the faster and optimized EFA, results in better scalability and faster time to train, as compared to other training instances in AWS.

What does Strong/Weak Scaling mean?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To enable strong scaling, we optimized Trainium to be efficient at small batch sizes. Compared to GPUs, Trn1 maintains high efficiency even for small batch sizes. This allows you to scale-out to thousands of devices without increasing the global mini-batch size at the same rate, which in turn leads to faster end-to-end training convergence.

In weak scaling setup, we show the optimal throughput with sufficiently large batch size per Trainium. The large batch size is set to leverage the high core utilization so that the overall end-to-end training will be fast. This setup also enables a large global batch size as it scales with the total number of nodes in the cluster.

Usability
---------

What have AWS done to improve usability of Trainium?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Stochastic rounding enables running at half precision speeds (BF16) while maintaining accuracy at near full precision. This of course helps the loss function converge faster and reduce memory footprint, but equally important, it is simplifying model development as you can write your model in FP32, and Neuron/Trainium will auto-cast the model to BF16, and execute it with SR enabled. There is no need to loss accuracy with pure BF16 runs, and more importantly no need for experimenting with  mixed precision strategies to find the optimal settings.

Eager debug mode provides a convenient utility to step through the code and evaluate operator correctness as part of your model creation/debug. For more details, please refer to the Neuron documentation

What other AWS services work with Trn1?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Trn1 via its Neuron SDK supports Amazon ECS, EKS, ParallelCluster, Batch, and Amazon SageMaker. Customers can also choose to run in a Neuron container within their self-managed containers orchestration service (e.g., Kubernetes and Ray).

What tools are available to develop models with Trn1?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When running training, evaluation or inference workloads you can use Neuron 2.x CLI tools such as neuron-ls and neuron-top to get insights into the NeuronCores and NeuronDevices performance and memory utilization, topology and host vCPU performance and memory utilization. In addition, the Neuron Plugin for TensorBoard provides a standard GUI that enables profile and debug of models. TensorBoard views include:

- Model overview: provide a summary of the model and the utilization on the Host and NeuronDevice
- Operators’ view: provide a breakdown of ML framework and HLO operators on both Host and NeuronDevice
- Code trace view: show a timeline of the model execution at the framework and HLO operators level 
- Hardware trace view: show a timeline of the model execution at the level of hardware (Host, NeuronDevice, Data Transfer)
- Topology view: show the NeuronDevices topology within an instance


How will compile time impact my work flow?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We understand compilation is a new step with Trainium, but as long as the overall time to train and cost to train is optimized, the compilation impact on these two metrics is minimized. To further help reduce compilation time impact on usability, Neuron supports a persistent cache, where artifacts that have not changed since the last run can be reused, skipping compilation all together. For developing and experimenting with new models, you can use the eager debug mode, that compiles (and caches) op-by-op, enabling quick evaluation without compiling large models. We are also working on Neuron model analyzer (see Neuron roadmap) that will recommend optimized hyper parameters, skipping full compilation per experiment.


================================================
FILE: about-neuron/faq.rst
================================================
.. _neuron_faq:

.. meta::
   :description: Frequently Asked Questions (FAQ) about the AWS Neuron SDK, including topics on Neuron 2.x, training, inference, runtime, compiler, containers, and ONNX support.
   :date-modified: 2025-10-03

Neuron FAQ
==========

This topic provides links to frequently asked questions (FAQs) about the AWS Neuron SDK, organized by Neuron component.

Neuron 2.x FAQ
--------------

.. grid:: 1
   :gutter: 2

   .. grid-item-card::
      :link: neuron2-intro-faq
      :link-type: ref

      **Neuron 2.x Introduction FAQ**
      ^^^
      Common questions about Neuron 2.x features and migration

Training-specific FAQ
---------------------

.. grid:: 1
   :gutter: 2

   .. grid-item-card::
      :link: neuron-training-faq
      :link-type: ref

      **Neuron Training FAQ**
      ^^^
      Frequently asked questions about training models on Neuron

Inference-specific FAQ
----------------------

.. grid:: 1 1 2 2
   :gutter: 2

   .. grid-item-card::
      :link: neuron-f1-faq
      :link-type: ref

      **Neuron F1 FAQ**
      ^^^
      Questions about F1 instance inference capabilities

   .. grid-item-card::
      :link: trouble-shooting-inf1-faq
      :link-type: ref

      **Inf1 Troubleshooting FAQ**
      ^^^
      Common ``Inf1`` instance issues and solutions

   .. grid-item-card::
      :link: neuronperf_faq
      :link-type: ref

      **NeuronPerf FAQ**
      ^^^
      Performance benchmarking tool questions

Neuron Runtime FAQ
------------------

.. grid:: 1
   :gutter: 2

   .. grid-item-card::
      :link: neuron-runtime-faq
      :link-type: ref

      **Neuron Runtime FAQ**
      ^^^
      Runtime configuration and execution questions

Neuron Compiler FAQ
-------------------

.. grid:: 1 1 2 2
   :gutter: 2

   .. grid-item-card::
      :link: neuronx_compiler_faq
      :link-type: ref

      **NeuronX Compiler FAQ**
      ^^^
      Questions about the NeuronX compiler for Trn1/Inf2

   .. grid-item-card::
      :link: neuron_compiler_faq
      :link-type: ref

      **Neuron Compiler FAQ**
      ^^^
      Questions about the Neuron compiler for Inf1

Neuron DLCs FAQ
---------------

.. grid:: 1
   :gutter: 2

   .. grid-item-card::
      :link: container-faq
      :link-type: ref

      **Neuron Containers FAQ**
      ^^^
      Container deployment and configuration questions

Support
-------

.. grid:: 1
   :gutter: 2

   .. grid-item-card::
      :link: contribute-faq
      :link-type: ref

      **Contribute FAQ**
      ^^^
      Questions about contributing to the Neuron project


================================================
FILE: about-neuron/index.rst
================================================
.. _about-neuron:

About the AWS Neuron SDK
========================

AWS Neuron is a software development kit (SDK) enabling high-performance deep learning acceleration using AWS Inferentia and Trainium, AWS's custom designed machine learning accelerators. It enables you to develop, profile, and deploy high-performance machine learning workloads on AWS Inferentia and Trainium instances. 

The AWS Neuron SDK includes:

* **Neuron Compiler** - Compiles high-level, framework-based models for optimal performance on Neuron devices
* **Neuron Kernel Interface (NKI)** - Provides direct compiler access to Neuron device capabilities
* **Neuron Runtime** - Executes compiled models on Neuron devices
* **ML Framework integration** - Deep support for PyTorch and JAX
* **Training and inference libraries** - Distributable training and inference libraries for large-scale models
* **Deployment support** - Integration with AWS services like SageMaker, EC2, EKS, and ECS
* **Developer tools** - Profiling, monitoring, and debugging utilities

For a full list of AWS Neuron features, see :ref:`what-is-neuron`.

.. admonition:: Join our Beta program

   Get early access to new Neuron features and tools! `Fill out this form and apply to join our Beta program <https://pulse.aws/survey/NZU6MQGW?p=0>`__.

What is "NeuronX"?
------------------

"NeuronX" refers to the next-generation AWS Neuron SDK, which provides enhanced capabilities for both inference and training on AWS Inferentia and Trainium instances. NeuronX includes:

* Support for the latest versions of PyTorch and JAX
* Advanced compiler optimizations for improved performance
* Enhanced distributed training libraries for large-scale models
* Improved profiling and debugging tools
* Ongoing feature development and support for new instance types

Catch up on the latest Neuron news
-----------------------------------

.. grid:: 1
   :gutter: 2

   .. grid-item-card::
      :link: /about-neuron/whats-new
      :link-type: doc
      :class-card: sd-border-1

      **What's New in Neuron**
      ^^^
      Read about the latest releases and features of the Neuron SDK


Learn about AWS Neuron
----------------------

.. grid:: 1
   :gutter: 2

   .. grid-item-card::
      :link: /about-neuron/what-is-neuron
      :link-type: doc
      :class-card: sd-border-1

      **What is AWS Neuron?**
      ^^^
      Short overview of the AWS Neuron SDK and its components

.. grid:: 1 1 2 2
   :gutter: 2
   
   .. grid-item-card::
      :link: /about-neuron/arch/index
      :link-type: doc
      :class-card: sd-border-1

      **Neuron architecture**
      ^^^
      Understand the Neuron hardware and software architecture

   .. grid-item-card::
      :link: /about-neuron/arch/neuron-features/index
      :link-type: doc
      :class-card: sd-border-1

      **Neuron features**
      ^^^
      Overviews of model development features provided by Neuron

   .. grid-item-card::
      :link: /frameworks/index
      :link-type: doc
      :class-card: sd-border-1

      **Supported ML frameworks**
      ^^^
      Neuron support for popular ML frameworks including PyTorch and JAX

   .. grid-item-card::
      :link: /libraries/index
      :link-type: doc
      :class-card: sd-border-1

      **NeuronX distributed (NxD) libraries**
      ^^^
      NeuronX distributed libraries for training and inference

   .. grid-item-card::
      :link: /nki/index
      :link-type: doc
      :class-card: sd-border-1

      **Neuron Kernel Interface (NKI)**
      ^^^
      NKI is a low-level interface for custom, bare-metal kernel development

   .. grid-item-card::
      :link: /compiler/index
      :link-type: doc
      :class-card: sd-border-1

      **Neuron Compiler**
      ^^^
      The Neuron compiler optimizes models for Neuron hardware

   .. grid-item-card::
      :link: /neuron-runtime/index
      :link-type: doc
      :class-card: sd-border-1

      **Neuron Runtime**
      ^^^
      Runtime for executing compiled models on Neuron devices

   .. grid-item-card::
      :link: /tools/index
      :link-type: doc
      :class-card: sd-border-1

      **Neuron developer tools**
      ^^^
      Tools for profiling, debugging, and monitoring Neuron applications

   .. grid-item-card::
      :link: /dlami/index
      :link-type: doc
      :class-card: sd-border-1

      **Neuron AWS Neuron Deep Learning AMIs**
      ^^^
      Deploy the Neuron SDK on EC2 instances with pre-installed Amazon Machine Images (AMIs)

   .. grid-item-card::
      :link: /containers/index
      :link-type: doc
      :class-card: sd-border-1

      **Neuron AWS Neuron Deep Learning Containers**
      ^^^
      Deploy the Neuron SDK using pre-built Docker deep learning containers (DLCs)

Resources
---------

* :ref:`Setup Guide <setup-guide-index>`
* :ref:`Release Notes <neuron_release_notes>`
* :ref:`Neuron FAQ <neuron_faq>`
* :doc:`Older Neuron FAQs <faq/index>`

Support
-------

* :doc:`Neuron Open Source GitHub Repos </about-neuron/oss/index>`
* :ref:`AWS Neuron SDK maintenance policy <sdk-maintenance-policy>`

.. _contact-us:

Contact us
----------

For support, submit a request with AWS Neuron `Github issues <https://github.com/aws/aws-neuron-sdk/issues>`_ or visit the `Neuron AWS forums <https://forums.aws.amazon.com/forum.jspa?forumID=355>`_ for an answer. 

If you want to request a feature or report a critical issue, you can contact us directly at ``aws-neuron-support@amazon.com``.

.. toctree::
   :maxdepth: 1
   :hidden:

   App Notes <appnotes/index>
   Ask Amazon AI helper tools </about-neuron/amazonq-getstarted>
   Benchmarks </about-neuron/benchmarks/index>
   Beta Participation </about-neuron/beta-participation>
   Model Samples </about-neuron/models/index>
   Neuron FAQ <faq>
   Neuron Features </about-neuron/arch/neuron-features/index>
   Open Source </about-neuron/oss/index>
   SDK Maintenance Policy <sdk-policy>
   Security <security>
   Term Glossary <arch/glossary>
   Troubleshooting <troubleshooting>
   What is AWS Neuron? <what-is-neuron>
   Older Neuron FAQS <faq/index>


================================================
FILE: about-neuron/models/index.rst
================================================
.. _model_samples_tutorials:

Model samples and tutorials
===========================

.. toctree::
    :maxdepth: 1
    :hidden:

    Training on Trn1 </about-neuron/models/training-trn1-samples>
    Inference on Inf2/Trn1/Trn2 </about-neuron/models/inference-inf2-trn1-samples>    
    Inference on Inf1 </about-neuron/models/inference-inf1-samples>
    
This section gives you the consolidated list of code samples and tutorials published by AWS Neuron across documentation 
and various GitHub repositories.

.. card:: Training on Trn1
    :link: model_samples_training_trn1
    :link-type: ref
    :class-body: sphinx-design-class-title-small

.. card:: Inference on Inf2, Trn1 and Trn2
    :link: model_samples_inference_inf2_trn1
    :link-type: ref
    :class-body: sphinx-design-class-title-small

.. card:: Inference on Inf1
    :link: model_samples_inference_inf1
    :link-type: ref
    :class-body: sphinx-design-class-title-small

For links to individual GitHub sample repositories, see :ref:`neuron-github-samples`


================================================
FILE: about-neuron/models/inference-inf1-samples.rst
================================================
.. _model_samples_inference_inf1:

Inference Samples/Tutorials (Inf1)
==================================

.. important::

   The samples linked on this page have been archived and are provided for historical reference only. They are not tested with recent versions of the Neuron SDK.

.. contents:: Table of contents
   :local:
   :depth: 1

   
.. _encoder_model_samples_inference_inf1:
 
Encoders 
--------

.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials

   * - bert-base-cased-finetuned-mrpc
     - torch-neuron
     - * HuggingFace pretrained BERT tutorial :ref:`[html] </src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.ipynb>` :pytorch-neuron-src:`[notebook] <bert_tutorial/tutorial_pretrained_bert.ipynb>`
       * `BertBaseCased Inference on Inf1 instances <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/bertbasecased/BertBaseCased.ipynb>`_
       * Bert TorchServe tutorial :ref:`[html] <pytorch-tutorials-torchserve>`
       * Bring your own HuggingFace pretrained BERT container to Sagemaker Tutorial :ref:`[html] </src/examples/pytorch/byoc_sm_bert_tutorial/sagemaker_container_neuron.ipynb>` :pytorch-neuron-src:`[notebook] <byoc_sm_bert_tutorial/sagemaker_container_neuron.ipynb>`

   * - bert-base-uncased
     - torch-neuron
     - * NeuronCore Pipeline tutorial :ref:`[html] </src/examples/pytorch/pipeline_tutorial/neuroncore_pipeline_pytorch.ipynb>` :pytorch-neuron-src:`[notebook] <pipeline_tutorial/neuroncore_pipeline_pytorch.ipynb>`

   * - bert-large-uncased
     - torch-neuron
     - * `BertLargeUncased Inference on Inf1 instances <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/bertlargeuncased/BertLargeUncased.ipynb>`_
   
   * - roberta-base
     - torch-neuron
     - * `Roberta-Base inference on Inf1 instances <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/robertabase/RobertaBase.ipynb>`_

   * - distilbert-base-uncased-finetuned-sst-2-english
     - tensorflow-neuron 
     - * Tensorflow 2.x - HuggingFace Pipelines distilBERT with Tensorflow2 Neuron :ref:`[html] </src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb>` :github:`[notebook] </src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb>`
    
   * - gluon bert
     - mxnet-neuron 
     - * MXNet 1.8: Using data parallel mode tutorial :ref:`[html] </src/examples/mxnet/data_parallel/data_parallel_tutorial.ipynb>` :mxnet-neuron-src:`[notebook] <data_parallel/data_parallel_tutorial.ipynb>`


.. _vision_transformer_model_samples_inference_inf1:

Vision Transformers  
-------------------

.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size
   
   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials

   * - ssd
     - torch-neuron
     - * `Inference of SSD model on inf1 instances <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/ssd/SSD300VGG16.ipynb>`_
 

   * - TrOCR
     - torch-neuron
     - * `TrOCR inference on Inf1 instances <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/trocr/TrOCR.ipynb>`_

    
   * - vgg
     - torch-neuron
     - * `VGG inference on Inf1 instances <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/vgg/VGG.ipynb>`_


   * - google/vit-base-patch16-224
     - torch-neuron
     - * `ViT model inference on Inf1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/vit/ViT.ipynb>`_


.. _cnn_model_samples_inference_inf1:

Convolutional Neural Networks(CNN)
----------------------------------


.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials

   * - EfficientNet
     - torch-neuron
     - * `EfficientNet model inference on Inf1 instances <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/efficientnet/EfficientNet.ipynb>`_

   * - GFL (MMDetection)
     - torch-neuron
     - * `GFL (MMDetection) inference on Inf1 instances <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/gfl_mmdet/GFL.ipynb>`_

   * - HRNet
     - torch-neuron
     - * `HRNET - Pose Estimation <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/hrnet/HRnet.ipynb>`_

   * - MarianMT
     - torch-neuron
     - * HuggingFace MarianMT tutorial :ref:`[html] </src/examples/pytorch/transformers-marianmt.ipynb>` :pytorch-neuron-src:`[notebook] <transformers-marianmt.ipynb>`
       * `Inference of Pre-trained MarianMT model on Inf1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/marianmt/MarianMT.ipynb>`_

   * - Detectron2 R-CNN 
     - torch-neuron
     - * `R-CNN inference on Inf1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/rcnn/Rcnn.ipynb>`_

   * - resnet
     - torch-neuron
     - * `Inference of Pre-trained Resnet model (18,34,50,101,152) on Inf1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/resnet/Resnet.ipynb>`_
       * ResNet-50 tutorial :ref:`[html] </src/examples/pytorch/resnet50.ipynb>` :pytorch-neuron-src:`[notebook] <resnet50.ipynb>`

   * - resnet
     - tensorflow-neuron
     - * Tensorflow 2.x - Using NEURON_RT_VISIBLE_CORES with TensorFlow Serving :ref:`[html] </src/examples/tensorflow/tensorflow_serving_tutorial.rst>`
   
   * - resnet
     - mxnet-neuron
     - * ResNet-50 tutorial :ref:`[html] </src/examples/mxnet/resnet50/resnet50.ipynb>` :mxnet-neuron-src:`[notebook] <resnet50/resnet50.ipynb>`
       * Getting started with Gluon tutorial :ref:`[html] </src/examples/mxnet/mxnet-gluon-tutorial.ipynb>` :github:`[notebook] </src/examples/mxnet/mxnet-gluon-tutorial.ipynb>`
       * NeuronCore Groups tutorial :ref:`[html] </src/examples/mxnet/resnet50_neuroncore_groups.ipynb>` :mxnet-neuron-src:`[notebook] <resnet50_neuroncore_groups.ipynb>`
    

   * - Resnext
     - torch-neuron
     - * `Inference of Resnext model on Inf1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/resnext/Resnext.ipynb>`_


   * - Yolov4
     - torch-neuron 
     - * PyTorch YOLOv4 tutorial :ref:`[html] </src/examples/pytorch/yolo_v4.ipynb>` :pytorch-neuron-src:`[notebook] <yolo_v4.ipynb>`

   * - Yolov5
     - torch-neuron
     - * `Inference of Yolov5 on Inf1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/yolov5/Yolov5.ipynb>`_


   * - Yolov6
     - torch-neuron 
     - * `Inference of Yolov6 on Inf1 instances <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/yolov6/Yolov6.ipynb>`_


   * - Yolov7
     - torch-neuron
     - * `Inference of Yolov7 model on Inf1 <https://github.com/aws-neuron/aws-neuron-samples/tree/master/archive/torch-neuron/inference/yolov7>`_

   * - Yolof
     - torch-neuron
     - * `Inference of Yolof model on Inf1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuron/inference/yolof_detectron2/YoloF.ipynb>`_

   * - fairseq
     - torch-neuron
     - * `Inference of fairseq model on Inf1 <https://github.com/aws-neuron/aws-neuron-samples-staging/tree/master/archive/torch-neuron/inference/fairseq>`_

   * - unet
     - tensorflow-neuron
     - * `Unet - Tensorflow 2.x tutorial <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/tensorflow-neuron/inference/unet>`_


.. _vision_model_samples_inference_inf1:

Vision
------

.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials

   * - craft-pytorch
     - torch-neuron
     - * `CRAFT model inference on Inf1 <https://github.com/aws-neuron/aws-neuron-samples/tree/master/archive/torch-neuron/inference/craft>`_

   
================================================
FILE: about-neuron/models/inference-inf2-trn1-samples.rst
================================================
.. _model_samples_inference_inf2_trn1:

Inference Samples/Tutorials (Inf2/Trn1/Trn2)
============================================

.. important::

   Some samples linked on this page have been archived and are provided for historical reference only. They are not tested with recent versions of the Neuron SDK. For the latest inference tutorials, refer to :ref:`NxD Inference Tutorials <nxdi-tutorials-index>`.

.. contents:: Table of contents
   :local:
   :depth: 1


.. _encoder_model_samples_inference_inf2_trn1:
 
Encoders 
--------


.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials

   * - bert-base-cased-finetuned-mrpc
     - torch-neuronx
     - * :ref:`BERT TorchServe tutorial <pytorch-tutorials-torchserve-neuronx>`
       * HuggingFace pretrained BERT tutorial :ref:`[html] </src/examples/pytorch/torch-neuronx/bert-base-cased-finetuned-mrpc-inference-on-trn1-tutorial.ipynb>` :pytorch-neuron-src:`[notebook] <torch-neuronx/bert-base-cased-finetuned-mrpc-inference-on-trn1-tutorial.ipynb>`
       * `LibTorch C++ Tutorial for HuggingFace Pretrained BERT <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/tutorials/tutorial-libtorch.html#pytorch-tutorials-libtorch>`_
       * `Compiling and Deploying HuggingFace Pretrained BERT on Inf2 on Amazon SageMaker <https://github.com/aws-neuron/aws-neuron-sagemaker-samples/blob/master/inference/inf2-bert-on-sagemaker/inf2_bert_sagemaker.ipynb>`_


   * - bert-base-cased-finetuned-mrpc
     - neuronx-distributed
     - * :ref:`tp_inference_tutorial`


   * - bert-base-uncased
     - torch-neuronx
     - * `HuggingFace Pretrained BERT Inference on Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuronx/inference/hf_pretrained_bert_inference_on_trn1.ipynb>`_

   * - distilbert-base-uncased
     - torch-neuronx
     - * `HuggingFace Pretrained DistilBERT Inference on Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuronx/inference/hf_pretrained_distilbert_Inference_on_trn1.ipynb>`_


   * - roberta-base
     - tensorflow-neuronx
     - * HuggingFace Roberta-Base :ref:`[html]</src/examples/tensorflow/tensorflow-neuronx/tfneuronx-roberta-base-tutorial.ipynb>` :github:`[notebook] </src/examples/tensorflow/tensorflow-neuronx/tfneuronx-roberta-base-tutorial.ipynb>`


   * - roberta-large
     - torch-neuronx
     - * `HuggingFace Pretrained RoBERTa Inference on Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuronx/inference/hf_pretrained_roberta_inference_on_frn1.ipynb>`_


.. _decoder_model_samples_inference_inf2_trn1:

Decoders
--------

.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials

   * - gpt2
     - torch-neuronx
     - * `HuggingFace Pretrained GPT2 Feature Extraction on Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuronx/inference/hf_pretrained_gpt2_feature_extraction_on_trn1.ipynb>`_
  
   * - meta-llama/Llama-3.3-70B
     - neuronx-distributed-inference
     - * :ref:`nxdi-trn2-llama3.3-70b-tutorial`
       * :ref:`/libraries/nxd-inference/tutorials/trn2-llama3.3-70b-dp-tutorial.ipynb`
       * :ref:`nxdi-sd-inference-tutorial`

   * - meta-llama/Llama-3.1-8b
     - transformers-neuronx
     - * `Run Hugging Face Llama 3.1 8B autoregressive sampling on Inf2 & Trn1 with 32k sequence length <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/llama-3.1-8b-32k-sampling.ipynb>`_
       * `Run Hugging Face Llama 3.1 8B autoregressive sampling on Inf2 & Trn1 with 128k sequence length <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/llama-3.1-8b-128k-sampling.ipynb>`_
       * `Run meta-llama/Meta-Llama-3.1-8B autoregressive sampling on Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-3.1-8b-sampling.ipynb>`_
   
   * - meta-llama/Llama-3.1-70b
     - transformers-neuronx
     - * `Run Hugging Face Llama 3.1 70B autoregressive sampling on Trn1 with 64k sequence length <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/llama-3.1-70b-64k-sampling.ipynb>`_
       * `Run Hugging Face meta-llama/Meta-Llama-3.1-70B autoregressive sampling on Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-3.1-70b-sampling.ipynb>`_

   * - meta-llama/Llama-3.1-70b-Instruct
     - transformers-neuronx
     - * `Run Hugging Face Llama-3.1-70B-Instruct + Llama-3.2-1B-Instruct Speculative Decoding on Trn1 with transformers-neuronx and vLLM <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/llama-3.1-70b-speculative-decoding.ipynb>`_
       * `Run Hugging Face Llama-3.1-70B-Instruct EAGLE Speculative Decoding on Trn1 with transformers-neuronx and vLLM <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/llama-3.1-70b-eagle-speculative-decoding.ipynb>`_

   * - meta-llama/Llama-3.1-405b
     - neuronx-distributed-inference
     - * :ref:`Tutorial for deploying Llama-3.1-405B on Trn2 <nxdi-trn2-llama3.1-405b-tutorial>`
       * :ref:`nxdi-trn2-llama3.1-405b-speculative-tutorial`
   
   * - meta-llama/Llama-3.1-405b
     - transformers-neuronx
     - * `Run Hugging Face Llama 3.1 405B autoregressive sampling on Trn1/Trn1n with 16k sequence length <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/llama-3.1-405b-multinode-16k-sampling.ipynb>`_

   * - meta-llama/Llama-3-8b
     - transformers-neuronx
     - * `Run Hugging Face meta-llama/Llama-3-8b autoregressive sampling on Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-3-8b-sampling.ipynb>`_

   * - meta-llama/Llama-3-70b
     - transformers-neuronx
     - * `Run Hugging Face meta-llama/Llama-3-70b autoregressive sampling on Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-3-70b-sampling.ipynb>`_

   * - meta-llama/Llama-2-13b
     - transformers-neuronx
     - * `Run Hugging Face meta-llama/Llama-2-13b autoregressive sampling on Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb>`_

   * - meta-llama/Llama-2-70b
     - transformers-neuronx
     - * `Run Hugging Face meta-llama/Llama-2-70b autoregressive sampling on Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/llama-70b-sampling.ipynb>`_
       *  `Run speculative sampling on Meta Llama models [Beta] <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/speculative_sampling.ipynb>`_

   * - meta-llama/Llama-3.2-1B-Instruct
     - neuronx-distributed
     - * `Run meta-llama/Llama-3.2-1B-Instruct on Inf2 and Trn1 <https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference/llama>`_

   * - meta-llama/codellama-13b
     - neuronx-distributed
     - * `Run meta-llama/codellama-13b-16k-sampling <https://github.com/aws-neuron/aws-neuron-samples/torch-neuronx/transformers-neuronx/inference/codellama-13b-16k-sampling.ipynb>`_

   * - mistralai/Mistral-7B-Instruct-v0.1
     - transformers-neuronx
     - * :ref:`Run Mistral-7B-Instruct-v0.1 autoregressive sampling on Inf2 & Trn1 <mistral_gqa_code_sample>`

   * - mistralai/Mistral-7B-Instruct-v0.2
     - transformers-neuronx
     - * `Run Hugging Face mistralai/Mistral-7B-Instruct-v0.2 autoregressive sampling on Inf2 & Trn1 [Beta] <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/mistralai-Mistral-7b-Instruct-v0.2.ipynb>`_

   * - Mixtral-8x7B-v0.1
     - transformers-neuronx
     - * `Run Hugging Face mistralai/Mixtral-8x7B-v0.1 autoregressive sampling on Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/mixtral-8x7b-sampling.ipynb>`_

   * - Mixtral-8x7B
     - neuronx-distributed
     - * `Mixtral inference with NeuronX Distributed on Inf2 & Trn1 <https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference/mixtral>`_


   * - DBRX
     - neuronx-distributed
     - * `DBRX inference with NeuronX Distributed on Inf2 & Trn1 <https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference/dbrx>`_  

   * - codellama/CodeLlama-13b-hf
     - transformers-neuronx
     - * `Run Hugging Face codellama/CodeLlama-13b-hf autoregressive sampling on Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/codellama-13b-16k-sampling.ipynb>`_

.. _encoder_decoder_model_samples_inference_inf2_trn1:

Encoder-Decoders  
----------------


.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials

   * - t5-large
     - * torch-neuronx
       * optimum-neuron
     - * T5 inference tutorial :ref:`[html] </src/examples/pytorch/torch-neuronx/t5-inference-tutorial.ipynb>` :pytorch-neuron-src:`[notebook] <torch-neuronx/t5-inference-tutorial.ipynb>`

   * - t5-3b
     - neuronx-distributed
     - * T5 inference tutorial :ref:`[html] </src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>` :pytorch-neuron-src:`[notebook] <neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`

   * - google/flan-t5-xl
     - neuronx-distributed
     - * flan-t5-xl inference tutorial :ref:`[html] </src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>` :pytorch-neuron-src:`[notebook] <neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`


.. _vision_transformer_model_samples_inference_inf2_trn1:

Vision Transformers  
-------------------

.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size
   
   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials

   * - google/vit-base-patch16-224
     - torch-neuronx
     - * `HuggingFace Pretrained ViT Inference on Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_vit_inference_on_inf2.ipynb>`_

   * - clip-vit-base-patch32
     - torch-neuronx
     - * `HuggingFace Pretrained CLIP Base Inference on Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuronx/inference/hf_pretrained_clip_base_inference_on_inf2.ipynb>`_


   * - clip-vit-large-patch14
     - torch-neuronx
     - * `HuggingFace Pretrained CLIP Large Inference on Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_clip_large_inference_on_inf2.ipynb>`_


.. _cnn_model_samples_inference_inf2_trn1:

Convolutional Neural Networks(CNN)
----------------------------------


.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials

   * - resnet50
     - torch-neuronx
     - * `Torchvision Pretrained ResNet50 Inference on Trn1 / Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/tv_pretrained_resnet50_inference_on_trn1.ipynb>`_
       *  Torchvision ResNet50 tutorial :ref:`[html] </src/examples/pytorch/torch-neuronx/resnet50-inference-on-trn1-tutorial.ipynb>` :pytorch-neuron-src:`[notebook] <torch-neuronx/resnet50-inference-on-trn1-tutorial.ipynb>`

   * - resnet50
     - tensorflow-neuronx
     - * :ref:`tensorflow-servingx-neuronrt-visible-cores`

   * - unet
     - torch-neuronx
     - * `Pretrained UNet Inference on Trn1 / Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/pretrained_unet_inference_on_trn1.ipynb>`_

   * - vgg
     - torch-neuronx
     - * `Torchvision Pretrained VGG Inference on Trn1 / Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/tv_pretrained_vgg_inference_on_trn1.ipynb>`_


.. _sd_model_samples_inference_inf2_trn1:

Stable Diffusion
----------------

.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials

   * - stable-diffusion-v1-5
     - torch-neuronx
     - * `HuggingFace Stable Diffusion 1.5 (512x512) Inference on Trn1 / Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuronx/inference/hf_pretrained_sd15_512_inference.ipynb>`_

   * - stable-diffusion-2-1-base
     - torch-neuronx
     - * `HuggingFace Stable Diffusion 2.1 (512x512) Inference on Trn1 / Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuronx/inference/hf_pretrained_sd2_512_inference.ipynb>`_

   * - stable-diffusion-2-1
     - torch-neuronx
     - * `HuggingFace Stable Diffusion 2.1 (768x768) Inference on Trn1 / Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuronx/inference/hf_pretrained_sd2_768_inference.ipynb>`_
       * `Deploy & Run Stable Diffusion on SageMaker and Inferentia2 <https://github.com/aws-neuron/aws-neuron-sagemaker-samples/blob/master/inference/stable-diffusion/StableDiffusion2_1.ipynb>`_

   * - stable-diffusion-xl-base-1.0
     - torch-neuronx
     - * `HuggingFace Stable Diffusion XL 1.0 (1024x1024) Inference on Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuronx/inference/hf_pretrained_sdxl_base_1024_inference.ipynb>`_
       * `HuggingFace Stable Diffusion XL 1.0 Base and Refiner (1024x1024) Inference on Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuronx/inference/hf_pretrained_sdxl_base_and_refiner_1024_inference.ipynb>`_

   * - stable-diffusion-2-inpainting
     - torch-neuronx
     - * `stable-diffusion-2-inpainting model Inference on Trn1 / Inf2 <https://github.com/aws-neuron/aws-neuron-samples/tree/master/archive/torch-neuronx/inference/hf_pretrained_sd2_inpainting_936_624_inference.ipynb>`_


.. _diffusion_transformers_samples_inference_inf2_trn1:

Diffusion Transformers
----------------------

.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials

   * - pixart-alpha
     - torch-neuronx
     - * `HuggingFace PixArt Alpha (256x256, 512x512 square resolution) Inference on Trn1 / Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuronx/inference/hf_pretrained_pixart_alpha_inference_on_inf2.ipynb>`_

   * - pixart-sigma
     - torch-neuronx
     - * `HuggingFace PixArt Sigma (256x256, 512x512 square resolution) Inference on Trn1 / Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_pixart_sigma_inference_on_inf2.ipynb>`_

   
.. _audio_model_samples_inference_inf2_trn1:

Audio
-----

.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials
       
   * - wav2vec2-conformer
     - torch-neuronx
     - * `Run HuggingFace Pretrained Wav2Vec2-Conformer with Rotary Position Embeddings Inference on Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_wav2vec2_conformer_rope_inference_on_inf2.ipynb>`_
       * `Run HuggingFace Pretrained Wav2Vec2-Conformer with Relative Position Embeddings Inference on Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuronx/inference/hf_pretrained_wav2vec2_conformer_relpos_inference_on_inf2.ipynb>`_


.. _multi_modal_model_samples_inference_inf2_trn1:

Multi Modal
-----------

.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size


   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials
       

   * - multimodal-perceiver
     - torch-neuronx
     - * `HuggingFace Multimodal Perceiver Inference on Trn1 / Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_perceiver_multimodal_inference.ipynb>`_


   * - language-perceiver
     - torch-neuronx
     - * `HF Pretrained Perceiver Language Inference on Trn1 / Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuronx/inference/hf_pretrained_perceiver_language_inference.ipynb>`_


   * - vision-perceiver-conv
     - torch-neuronx
     - * `HF Pretrained Perceiver Image Classification Inference on Trn1 / Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuronx/inference/hf_pretrained_perceiver_vision_inference.ipynb>`_


================================================
FILE: about-neuron/models/training-trn1-samples.rst
================================================
.. _model_samples_training_trn1:

Training Samples/Tutorials (Trn1/Trn1n)
=======================================

.. contents:: Table of contents
   :local:
   :depth: 1


.. _encoder_model_samples_training_trn1:
 
Encoders 
--------


.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials

   * - bert-base-cased
     - torch-neuronx
     - * `Fine-tune a "bert-base-cased" PyTorch model for Text Classification  <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_text_classification/BertBaseCased.ipynb>`_
       * `How to fine-tune a "bert base cased" PyTorch model with AWS Trainium (Trn1 instances) for Sentiment Analysis <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_sentiment_analysis/01-hf-single-neuron.ipynb>`_
    
   * - bert-base-uncased
     - torch-neuronx
     - * `Fine-tune a "bert-base-uncased" PyTorch model <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_text_classification/BertBaseUncased.ipynb>`_
       * `Fine tuning BERT base model from HuggingFace on Amazon SageMaker <https://github.com/aws-neuron/aws-neuron-sagemaker-samples/blob/master/training/trn1-bert-fine-tuning-on-sagemaker/bert-base-uncased-amazon-polarity.ipynb>`_
   
   * - bert-large-cased
     - torch-neuronx
     - * `Fine-tune a "bert-large-cased" PyTorch model  <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_text_classification/BertLargeCased.ipynb>`_
    
   * - bert-large-uncased
     - torch-neuronx
     - * :ref:`hf-bert-pretraining-tutorial`
       * `Launch Bert Large Phase 1 pretraining job on Parallel Cluster <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/dp-bert-launch-job.md>`_
       * `Launch a Multi-Node PyTorch Neuron Training Job on Trainium Using TorchX and EKS <https://github.com/aws-neuron/aws-neuron-eks-samples/tree/master/dp_bert_hf_pretrain#tutorial-launch-a-multi-node-pytorch-neuron-training-job-on-trainium-using-torchx-and-eks>`_
       * :ref:`torch-hf-bert-finetune`
       * `Fine-tune a "bert-large-uncased" PyTorch model <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_text_classification/BertLargeCased.ipynb>`_
       

   * - roberta-base
     - tensorflow-neuronx
     - * `Fine-tune a "roberta-base" PyTorch model <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_text_classification/RobertaBase.ipynb>`_


   * - roberta-large
     - torch-neuronx
     - * `Fine-tune a "roberta-large" PyTorch model <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_text_classification/RobertaLarge.ipynb>`_

  
   * - xlm-roberta-base
     - torch-neuronx
     - * `Fine-tune a "xlm-roberta-base" PyTorch model <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_text_classification/XlmRobertaBase.ipynb>`_


   * - alberta-base-v2
     - torch-neuronx
     - * `Fine-tune a "alberta-base-v2" PyTorch model <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_text_classification/AlbertBase.ipynb>`_


   * - distilbert-base-uncased
     - torch-neuronx
     - * `Fine-tune a "distilbert-base-uncased" PyTorch model <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_text_classification/DistilbertBaseUncased.ipynb>`_


   * - camembert-base
     - torch-neuronx
     - * `Fine-tune a "camembert-base PyTorch model <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_text_classification/CamembertBase.ipynb>`_

   * - cl-tohoku/bert-base-japanese-whole-word-masking
     - torch-neuronx
     - * `Fine-tuning & Deployment Hugging Face BERT Japanese model	<https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_bert_jp/bert-jp-tutorial.ipynb>`_


.. _decoder_model_samples_training_trn1:


Decoders
--------

.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials
   
   * - gpt-2
     - torch-neuronx
     - * `How to run training jobs for "gpt2" PyTorch model with AWS Trainium <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_language_modeling/gpt2/gpt2.ipynb>`_
       * :ref:`zero1-gpt2-pretraining-tutorial`
   
   
   * - gpt-3
     - neuronx-nemo-megatron
     - * `Launch a GPT-3 23B pretraining job using neuronx-nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-gpt-job.md>`_
       * `Launch a GPT-3 46B pretraining job using neuronx-nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-gpt-job.md>`_
       * `Launch a GPT-3 175B pretraining job using neuronx-nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-gpt-job.md>`_


   * - GPT-NEOX-20B
     - neuronx-distributed
     - * :ref:`gpt_neox_20b_tp_zero1_tutorial`
       * `Training GPT-NEOX 20B model using neuronx-distributed	 <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain>`_
       * `Pre-train GPT Neox 20b on Wikicorpus dataset using Neuronx Distributed library <https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow/blob/master/examples/neuronx-distributed/gpt_neox_20b/README.md>`_

   
   * - GPT-NEOX-6.9B
     - neuronx-distributed
     - * :ref:`gpt_neox_tp_zero1_tutorial`
       * `Training GPT-NEOX 6.9B model using neuronx-distributed		 <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_6.9b_hf_pretrain>`_
       * `Pre-train GPT Neox 6.9b on Wikicorpus dataset using Neuronx Distributed library <https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow/blob/master/examples/neuronx-distributed/gpt_neox_6.9b/README.md#pre-train-gpt-neox-69b-on-wikicorpus-dataset-using-neuronx-distributed-library>`_

   * - meta-llama/Llama-3.1-70b
     - neuronx-distributed
     - * :ref:`llama2_tp_pp_tutorial`

   * - meta-llama/Llama-3.1-8b
     - neuronx-distributed
     - * :ref:`llama2_7b_tp_zero1_tutorial`

   * - meta-llama/Llama-3-70b
     - neuronx-distributed
     - * :ref:`llama2_tp_pp_tutorial`

   * - meta-llama/Llama-3-8b
     - nxd-training
     - * :ref:`hf_llama3_8B_pretraining`
       * :ref:`hf_llama3_8B_SFT`

   * - meta-llama/Llama-3-8b
     - neuronx-distributed
     - * :ref:`Training Llama3 8B Model with Tensor Parallelism and ZeRO-1 Optimizer <llama2_7b_tp_zero1_tutorial>`
       * :ref:`Tutorial for Fine-tuning Llama3 8B with tensor parallelism and LoRA using Neuron PyTorch-Lightning with NeuronX Distributed <llama3_8b_tp_ptl_lora_finetune_tutorial>`
            
   * - meta-llama/Llama-2-7b
     - neuronx-distributed
     - * :ref:`llama2_7b_tp_zero1_tutorial`
       * `Training Llama2 7B Model with AWS Batch and Trainium <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/aws-batch/llama2/README.md>`_
       * :ref:`llama2_7b_tp_zero1_ptl_finetune_tutorial`
       * `Pre-train Llama2-7B on Wikicorpus dataset using Neuronx Distributed library <https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow/blob/master/examples/neuronx-distributed/llama2_7b/README.md>`_

   * - meta-llama/Llama-2-13b
     - neuronx-distributed
     - * :ref:`llama2_tp_pp_tutorial`

   * - meta-llama/Llama-2-70b
     - neuronx-distributed
     - * :ref:`llama2_tp_pp_tutorial`

   * - codegen25-7b-mono
     - neuronx-distributed
     - * :ref:`codegen25_7b_tp_zero1_tutorial`

   * - meta-llama/Llama-2
     - neuronx-nemo-megatron
     - * `Launch a Llama-2-7B pretraining job using neuronx-nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md>`_
       * `Launch a Llama-2-13B pretraining job using neuronx-nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md>`_
       * `Launch a Llama-2-70B pretraining job using neuronx-nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md>`_

   * - Mistral-7B
     - neuronx-nemo-megatron
     - * `Training Mistral-7B <https://github.com/aws-neuron/neuronx-nemo-megatron/blob/main/nemo/examples/nlp/language_modeling/test_mistral.sh>`_

.. _encoder_decoder_model_samples_training_trn1:

Encoder-Decoders  
----------------


.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials

   * - t5-small
     - * torch-neuronx
       * optimum-neuron
     - * :ref:`torch-hf-t5-finetune`

   * - facebook/bart-large
     - * torch-neuronx
     - * `How to fine-tune a "Bart-Large" PyTorch model with AWS Trainium (trn1 instances) <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/hf_summarization/BartLarge.ipynb>`_


.. _vision_transformer_model_samples_training_trn1:

Vision Transformers  
-------------------

.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size
   
   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials

   * - google/vit-base-patch16-224-in21k
     - torch-neuronx
     - * `Fine-tune a pretrained HuggingFace vision transformer PyTorch model  <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_image_classification/vit.ipynb>`_

    
   * - openai/clip-vit-base-patch32
     - torch-neuronx
     - * `Fine-tune a pretrained HuggingFace CLIP-base PyTorch model with AWS Trainium  <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_contrastive_image_text/CLIPBase.ipynb>`_


   * - openai/clip-vit-large-patch14
     - torch-neuronx
     - * `Fine-tune a pretrained HuggingFace CLIP-large PyTorch model with AWS Trainium <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_contrastive_image_text/CLIPLarge.ipynb>`_


.. _sd_model_samples_training_trn1:

Stable Diffusion
----------------

.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size


   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials
       

   * - stabilityai/stable-diffusion-2-1-base
     - torch-neuronx
     - * [Beta] `Train stabilityai/stable-diffusion-2-1-base with AWS Trainium (trn1 instances) <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/stable_diffusion/>`_


   * - runwayml/stable-diffusion-v1-5
     - torch-neuronx
     - * [Beta] `Train runwayml/stable-diffusion-v1-5 with AWS Trainium (trn1 instances) <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/stable_diffusion/>`_
  

.. _multi_modal_model_samples_training_trn1:

Multi Modal
-----------

.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size


   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials
       

   * - language-perceiver
     - torch-neuronx
     - * `How to fine-tune a "language perceiver" PyTorch model with AWS Trainium (trn1 instances) <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_text_classification/LanguagePerceiver.ipynb>`_


   * - vision-perceiver-conv
     - torch-neuronx
     - * `How to fine-tune a pretrained HuggingFace Vision Perceiver Conv <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_image_classification/VisionPerceiverConv.ipynb>`_


.. _cnn_model_samples_training_trn1:

Convolutional Neural Networks(CNN)
----------------------------------

.. list-table::
   :widths: 20 15 45 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size
   
   * - Model
     - Frameworks/Libraries
     - Samples and Tutorials

   * - resnet50
     - torch-neuronx
     - `How to fine-tune a pretrained ResNet50 Pytorch model with AWS Trainium (trn1 instances) using NeuronSDK <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/resnet50>`_


================================================
FILE: about-neuron/monitoring-tools.rst
================================================
.. _monitoring_tools:

Monitoring Tools
=================

.. toctree:: 
    :maxdepth: 1

    Neuron-Monitor User Guide </tools/neuron-sys-tools/neuron-monitor-user-guide>
    Neuron-Top User Guide </tools/neuron-sys-tools/neuron-top-user-guide>
    Neuron-LS User Guide </tools/neuron-sys-tools/neuron-ls>
    Neuron-Sysfs User Guide </tools/neuron-sys-tools/neuron-sysfs-user-guide>
    NCCOM-TEST User Guide </tools/neuron-sys-tools/nccom-test>
    What's New </release-notes/components/dev-tools>


================================================
FILE: about-neuron/news-and-blogs/CONTRIBUTING.md
================================================
# Contributing to AWS Neuron News and Blogs

Thank you for your interest in sharing content about AWS Neuron, Trainium, and Inferentia! This page collects external articles, blog posts, tutorials, and news to help the community discover valuable content.

## How to Add Your Article

### Quick Steps

1. **Fork the repository** on GitHub
2. **Edit the data file**: `about-neuron/news-and-blogs/news-and-blogs.yaml`
3. **Add your article** following the format below
4. **Submit a pull request** with your changes

### Article Entry Format

Add your article to the appropriate section in `news-and-blogs.yaml`:

```yaml
- title: "Your Article Title"
  url: "https://example.com/your-article"
  description: "A brief 1-2 sentence description of your article content."
  author: "Your Name or Organization"
  author_url: "https://your-website.com"  # Optional for featured articles
  date: "YYYY-MM-DD"  # Publication date
  category: "blog"  # Options: blog, news, tutorial, case-study, benchmark
  locale: "en-US"  # Language/region code (e.g., en-US, ja-JP, zh-CN, de-DE, fr-FR)
  featured: false  # Set to true only if approved by AWS Neuron team
  icon: "📝"  # Optional emoji icon for featured articles
```

### Sections

- **`featured_articles`**: Highlighted content (requires AWS Neuron team approval)
- **`all_articles`**: All community and official content

### Categories

Choose the most appropriate category for your content:

- **`blog`**: Technical blog posts and articles
- **`news`**: News announcements and press releases
- **`tutorial`**: Step-by-step guides and how-tos
- **`case-study`**: Customer success stories and use cases
- **`benchmark`**: Performance benchmarks and comparisons

### Locale Codes

Specify the language and region of your article using standard locale codes:

**Common Locales:**
- `en-US` - English (United States) 🇺🇸
- `en-GB` - English (United Kingdom) 🇬🇧
- `ja-JP` - Japanese 🇯🇵
- `zh-CN` - Chinese (Simplified) 🇨🇳
- `zh-TW` - Chinese (Traditional) 🇹🇼
- `ko-KR` - Korean 🇰🇷
- `de-DE` - German 🇩🇪
- `fr-FR` - French 🇫🇷
- `es-ES` - Spanish (Spain) 🇪🇸
- `es-MX` - Spanish (Mexico) 🇲🇽
- `pt-BR` - Portuguese (Brazil) 🇧🇷
- `it-IT` - Italian 🇮🇹
- `nl-NL` - Dutch 🇳🇱
- `ru-RU` - Russian 🇷🇺
- `ar-SA` - Arabic 🇸🇦
- `hi-IN` - Hindi 🇮🇳

A flag emoji will be automatically displayed next to your article based on the locale. If your locale isn't in the list, a 🌐 globe icon will be shown.

### Example Entry

```yaml
all_articles:
  - title: "Building Large Language Models on AWS Trainium"
    url: "https://example.com/llm-trainium-guide"
    description: "A comprehensive guide to training and deploying LLMs using AWS Trainium instances with practical code examples."
    author: "Jane Developer"
    date: "2026-01-15"
    category: "tutorial"
    locale: "en-US"
    featured: false
```

### Guidelines

1. **Content must be relevant** to AWS Neuron, Trainium, or Inferentia
2. **Provide accurate information** - ensure URLs work and descriptions are clear
3. **Use proper formatting** - follow YAML syntax exactly
4. **One article per pull request** - makes review easier
5. **Include context** in your PR description about why this content is valuable

### Featured Articles

To request your article be featured:

1. Add it to `all_articles` first with `featured: false`
2. In your pull request, explain why it should be featured
3. AWS Neuron team will review and may promote it to `featured_articles`

Featured articles should be:
- High-quality, in-depth content
- Particularly valuable to the community
- Recent (typically within the last 6 months)

### Review Process

1. Submit your pull request
2. AWS Neuron team will review within 5-7 business days
3. May request changes or clarifications
4. Once approved, your article will appear on the next documentation build

### Questions?

- Open an issue in the repository
- Contact your AWS Neuron support representative
- Email: aws-neuron-support@amazon.com

## Content Guidelines

### What to Include

✅ Technical tutorials and guides  
✅ Performance benchmarks and analysis  
✅ Customer success stories  
✅ Integration guides with other tools  
✅ Best practices and optimization tips  
✅ Conference talks and presentations  
✅ Research papers using Neuron/Trainium/Inferentia  

### What Not to Include

❌ Marketing content without technical substance  
❌ Broken or paywalled links  
❌ Content unrelated to AWS Neuron ecosystem  
❌ Duplicate submissions  
❌ Self-promotional content without value to community  

## Technical Details

This page uses:
- **Sphinx** with `sphinxcontrib.datatemplates` extension
- **YAML** for data storage
- **Jinja2** templates for rendering
- **sphinx-design** for grid layouts

The system is fully static - no backend required. All content is rendered at build time.

## License

By contributing, you agree that your contributions will be licensed under the same license as this project. See the repository LICENSE files for details.


================================================
FILE: about-neuron/news-and-blogs/JIRA-INTEGRATION-DESIGN.md
================================================
# Jira Integration Design for News & Blogs

## Overview

This document describes a design for populating the `news-and-blogs.yaml` file from Jira tickets, allowing contributors to submit article links via Jira instead of direct pull requests.

## Design Goals

1. **Simple for contributors**: Submit a Jira ticket with article metadata
2. **Automated**: Minimal manual intervention to add articles to YAML
3. **Quality control**: Review process before articles appear on the site
4. **Compatible**: Works with existing Sphinx build process
5. **No backend required**: Leverages existing CI/CD infrastructure

## Architecture

### Option 1: GitHub Actions + Jira API (Recommended)

```
Jira Ticket Created → GitHub Action Triggered → Parse Ticket → Update YAML → Create PR
```

**Components:**

1. **Jira Ticket Template**: Custom issue type "News Article Submission"
2. **GitHub Action**: Runs on schedule (e.g., hourly) or webhook
3. **Python Script**: Fetches approved tickets, generates YAML entries
4. **Automated PR**: Creates pull request with new articles

**Workflow:**

```yaml
# .github/workflows/sync-jira-articles.yml
name: Sync Jira Articles to YAML

on:
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours
  workflow_dispatch:  # Manual trigger

jobs:
  sync-articles:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'
      
      - name: Install dependencies
        run: pip install jira pyyaml
      
      - name: Fetch and process Jira tickets
        env:
          JIRA_URL: ${{ secrets.JIRA_URL }}
          JIRA_USER: ${{ secrets.JIRA_USER }}
          JIRA_TOKEN: ${{ secrets.JIRA_TOKEN }}
        run: python scripts/sync_jira_articles.py
      
      - name: Create Pull Request
        uses: peter-evans/create-pull-request@v5
        with:
          commit-message: 'Add articles from Jira'
          title: 'Add news articles from Jira submissions'
          body: 'Automated PR from Jira article submissions'
          branch: jira-articles-sync
```

### Option 2: Jira Automation + Webhook

```
Jira Ticket Approved → Webhook to GitHub → GitHub Action → Update YAML → Create PR
```

**Advantages:**
- Real-time updates when tickets are approved
- No polling required
- More efficient

**Setup:**
1. Configure Jira Automation rule
2. Trigger on status change to "Approved"
3. Send webhook to GitHub repository dispatch endpoint

### Option 3: Manual Script (Simplest)

```
Developer runs script → Fetches approved tickets → Updates YAML → Commits changes
```

**Use case:** Lower volume, manual review preferred

## Jira Ticket Structure

### Custom Fields Required

```
Issue Type: News Article Submission

Fields:
- Article Title (text, required)
- Article URL (URL, required)
- Description (text area, required)
- Author Name (text, required)
- Author URL (URL, optional)
- Publication Date (date, required)
- Category (dropdown: blog|news|tutorial|case-study|benchmark)
- Locale (dropdown: en-US|ja-JP|zh-CN|ko-KR|de-DE|fr-FR|es-ES|pt-BR)
- Keywords (labels or multi-select)
- Featured (checkbox)
- Icon (text, optional, for featured articles)

Status Workflow:
- Submitted → Under Review → Approved → Published → Rejected
```

### Example Jira Ticket

```
Title: Add Karakuri AWS Trainium Tutorial

Fields:
- Article Title: AWS Trainium: 50 Exercises
- Article URL: https://zenn.dev/karakuri_blog/articles/5ccedeee1beb08
- Description: Learn how to build LLMs for Trainium accelerators...
- Author Name: Karakuri
- Author URL: https://about.karakuri.ai/
- Publication Date: 2026-02-19
- Category: tutorial
- Locale: ja-JP
- Keywords: trainium, llm, training, tutorial
- Featured: Yes
- Icon: 🚀
```

## Implementation Script

### `scripts/sync_jira_articles.py`

```python
#!/usr/bin/env python3
"""
Sync approved Jira article submissions to news-and-blogs.yaml
"""

import os
import yaml
from jira import JIRA
from datetime import datetime

# Configuration
JIRA_URL = os.environ.get('JIRA_URL')
JIRA_USER = os.environ.get('JIRA_USER')
JIRA_TOKEN = os.environ.get('JIRA_TOKEN')
YAML_FILE = 'about-neuron/news-and-blogs/news-and-blogs.yaml'

# JQL to find approved, unpublished articles
JQL_QUERY = 'project = NEURON AND issuetype = "News Article Submission" AND status = "Approved" AND labels != "published"'

def connect_jira():
    """Connect to Jira instance"""
    return JIRA(server=JIRA_URL, basic_auth=(JIRA_USER, JIRA_TOKEN))

def fetch_approved_articles(jira):
    """Fetch approved article submissions from Jira"""
    issues = jira.search_issues(JQL_QUERY, maxResults=100)
    articles = []
    
    for issue in issues:
        article = {
            'title': issue.fields.customfield_10001,  # Article Title
            'url': issue.fields.customfield_10002,     # Article URL
            'description': issue.fields.customfield_10003,  # Description
            'author': issue.fields.customfield_10004,  # Author Name
            'date': issue.fields.customfield_10006,    # Publication Date
            'category': issue.fields.customfield_10007.value,  # Category
            'locale': issue.fields.customfield_10008.value,    # Locale
            'keywords': [label.name for label in issue.fields.labels if label.name != 'published'],
            'featured': bool(issue.fields.customfield_10009),  # Featured checkbox
        }
        
        # Optional fields
        if issue.fields.customfield_10005:  # Author URL
            article['author_url'] = issue.fields.customfield_10005
        
        if issue.fields.customfield_10010:  # Icon (for featured)
            article['icon'] = issue.fields.customfield_10010
        
        articles.append({
            'article': article,
            'issue_key': issue.key
        })
    
    return articles

def load_yaml():
    """Load existing YAML file"""
    with open(YAML_FILE, 'r', encoding='utf-8') as f:
        return yaml.safe_load(f)

def article_exists(data, url):
    """Check if article URL already exists in YAML"""
    all_urls = [a['url'] for a in data.get('featured_articles', [])]
    all_urls.extend([a['url'] for a in data.get('all_articles', [])])
    return url in all_urls

def add_articles_to_yaml(data, new_articles):
    """Add new articles to appropriate sections"""
    added_keys = []
    
    for item in new_articles:
        article = item['article']
        
        # Skip if already exists
        if article_exists(data, article['url']):
            print(f"Skipping duplicate: {article['title']}")
            continue
        
        # Add to appropriate section
        if article.get('featured', False):
            data['featured_articles'].append(article)
        else:
            data['all_articles'].append(article)
        
        added_keys.append(item['issue_key'])
        print(f"Added: {article['title']} ({item['issue_key']})")
    
    return added_keys

def save_yaml(data):
    """Save updated YAML file"""
    with open(YAML_FILE, 'w', encoding='utf-8') as f:
        yaml.dump(data, f, allow_unicode=True, sort_keys=False, default_flow_style=False)

def mark_as_published(jira, issue_keys):
    """Add 'published' label to Jira tickets and transition to Published status"""
    for key in issue_keys:
        issue = jira.issue(key)
        
        # Add published label
        labels = issue.fields.labels
        if 'published' not in labels:
            labels.append('published')
            issue.update(fields={'labels': labels})
        
        # Transition to Published status (adjust transition ID as needed)
        try:
            jira.transition_issue(issue, 'Published')
        except Exception as e:
            print(f"Could not transition {key}: {e}")

def main():
    print("Connecting to Jira...")
    jira = connect_jira()
    
    print("Fetching approved articles...")
    new_articles = fetch_approved_articles(jira)
    
    if not new_articles:
        print("No new articles to add.")
        return
    
    print(f"Found {len(new_articles)} approved articles")
    
    print("Loading existing YAML...")
    data = load_yaml()
    
    print("Adding articles to YAML...")
    added_keys = add_articles_to_yaml(data, new_articles)
    
    if added_keys:
        print("Saving YAML...")
        save_yaml(data)
        
        print("Marking Jira tickets as published...")
        mark_as_published(jira, added_keys)
        
        print(f"Successfully added {len(added_keys)} articles!")
    else:
        print("No new articles added (all were duplicates)")

if __name__ == '__main__':
    main()
```

## Setup Instructions

### 1. Configure Jira

1. Create custom issue type "News Article Submission"
2. Add custom fields (see structure above)
3. Configure workflow: Submitted → Under Review → Approved → Published
4. Create Jira API token for automation user

### 2. Configure GitHub Secrets

Add these secrets to your GitHub repository:

```
JIRA_URL: https://your-company.atlassian.net
JIRA_USER: automation@your-company.com
JIRA_TOKEN: <api-token>
```

### 3. Add GitHub Action

Create `.github/workflows/sync-jira-articles.yml` with the workflow above.

### 4. Install Dependencies

Add to `requirements.txt`:
```
jira==3.5.0
PyYAML==6.0
```

### 5. Test

1. Create a test Jira ticket
2. Approve it
3. Run workflow manually: Actions → Sync Jira Articles → Run workflow
4. Verify PR is created with new article

## Alternative: Simpler Webhook Approach

If you want something lighter without Jira API polling:

### Jira Automation Rule

```
Trigger: Issue transitioned to "Approved"
Condition: Issue type = "News Article Submission"
Action: Send web request

URL: https://api.github.com/repos/aws-neuron/aws-neuron-sdk/dispatches
Method: POST
Headers:
  Authorization: Bearer ${GITHUB_TOKEN}
  Accept: application/vnd.github.v3+json
Body:
{
  "event_type": "jira-article-approved",
  "client_payload": {
    "issue_key": "{{issue.key}}",
    "title": "{{issue.customfield_10001}}",
    "url": "{{issue.customfield_10002}}",
    "description": "{{issue.customfield_10003}}",
    "author": "{{issue.customfield_10004}}",
    "date": "{{issue.customfield_10006}}",
    "category": "{{issue.customfield_10007}}",
    "locale": "{{issue.customfield_10008}}"
  }
}
```

Then GitHub Action receives webhook and processes directly without Jira API calls.

## Maintenance

### Regular Tasks

1. **Monitor failed syncs**: Check GitHub Action logs
2. **Review PRs**: Automated PRs should still be reviewed before merge
3. **Clean up Jira**: Archive old Published tickets
4. **Update mappings**: If custom field IDs change, update script

### Troubleshooting

**Articles not syncing:**
- Check Jira API credentials
- Verify custom field IDs match
- Check JQL query returns expected tickets

**Duplicate articles:**
- Script checks URL before adding
- Manually remove duplicates from YAML if needed

**Formatting issues:**
- Validate YAML after sync: `python -m yaml about-neuron/news-and-blogs/news-and-blogs.yaml`
- Check for special characters in descriptions

## Security Considerations

1. **API Tokens**: Store in GitHub Secrets, never commit
2. **Permissions**: Use dedicated Jira service account with minimal permissions
3. **Validation**: Sanitize all input from Jira before adding to YAML
4. **Review**: Always review automated PRs before merging

## Cost & Complexity

| Approach | Setup Time | Maintenance | Cost |
|----------|-----------|-------------|------|
| GitHub Actions + Jira API | 4-6 hours | Low | Free (GitHub Actions) |
| Webhook + GitHub Actions | 2-3 hours | Very Low | Free |
| Manual Script | 1-2 hours | Medium | Free |

## Recommendation

**For production use**: Start with **Option 3 (Manual Script)** to validate the workflow, then upgrade to **Option 1 (GitHub Actions)** once the process is proven and volume increases.

**For high volume**: Use **Option 2 (Webhook)** for real-time updates.

## Future Enhancements

1. **Validation**: Add URL validation, duplicate detection in Jira
2. **Preview**: Generate preview of how article will appear
3. **Scheduling**: Support future publication dates
4. **Analytics**: Track article submissions and approval rates
5. **Notifications**: Notify submitters when articles are published
6. **Bulk import**: Support CSV upload for multiple articles


================================================
FILE: about-neuron/news-and-blogs/README.md
================================================
# AWS Neuron News and Blogs System

This directory contains a dynamic, community-driven news and blogs page for AWS Neuron, Trainium, and Inferentia content.

## Overview

The system allows external contributors to add links to relevant articles, blog posts, and news through a simple YAML data file, without requiring any backend infrastructure.

## Architecture

```
about-neuron/news-and-blogs/
├── index.rst                    # Main page (uses datatemplate directives)
├── news-and-blogs.yaml          # Data file with all article metadata
├── featured-articles.tmpl       # Jinja2 template for featured section
├── all-articles.tmpl            # Jinja2 template for all articles section
├── CONTRIBUTING.md              # Contribution guidelines
└── README.md                    # This file
```

## How It Works

1. **Data Storage**: Article metadata is stored in `news-and-blogs.yaml`
2. **Templating**: Jinja2 templates (`*.tmpl`) define how articles are rendered
3. **Rendering**: Sphinx's `datatemplates` extension processes the YAML and templates at build time
4. **Output**: Static HTML with grid cards using `sphinx-design`

## Key Features

- ✅ **No backend required** - fully static site generation
- ✅ **Easy contributions** - edit a YAML file and submit a PR
- ✅ **Version controlled** - all changes tracked in Git
- ✅ **Automated rendering** - Sphinx handles everything at build time
- ✅ **Responsive design** - uses sphinx-design grid system
- ✅ **Maintainable** - clear separation of data, templates, and content

## Adding New Articles

See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed instructions.

Quick example:

```yaml
all_articles:
  - title: "My Article Title"
    url: "https://example.com/article"
    description: "Brief description"
    author: "Author Name"
    date: "2026-01-15"
    category: "blog"
    featured: false
```

## Modifying Templates

Templates use Jinja2 syntax and have access to the YAML data structure.

### Featured Articles Template (`featured-articles.tmpl`)

Renders articles from the `featured_articles` section with:
- Large cards with borders
- Icons and bold titles
- Author attribution with links
- Publication dates

### All Articles Template (`all-articles.tmpl`)

Renders articles from the `all_articles` section with:
- 2-column grid on desktop, 1-column on mobile
- Simple card layout
- Title and description

## Customization

### Adding New Fields

1. Add field to YAML entries:
   ```yaml
   - title: "Article"
     new_field: "value"
   ```

2. Update template to use it:
   ```jinja
   {{ article.new_field }}
   ```

### Changing Layout

Edit the grid directive in templates:
```rst
.. grid:: 1 1 2 3  # 1 col mobile, 1 tablet, 2 desktop, 3 wide
   :gutter: 2
```

### Adding Filters/Sorting

You can add Jinja2 filters in templates:

```jinja
{% for article in all_articles | sort(attribute='date', reverse=True) %}
  {# Sorted by date, newest first #}
{% endfor %}
```

## Dependencies

Required Sphinx extensions (already in `conf.py`):
- `sphinxcontrib.datatemplates` - YAML data processing
- `sphinx_design` - Grid card layouts

## Testing Locally

1. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```

2. Build documentation:
   ```bash
   sphinx-build -b html . _build/html
   ```

3. View the page:
   ```bash
   open _build/html/about-neuron/news-and-blogs/index.html
   ```

## Troubleshooting

### Template Not Found Error

Ensure templates are in the same directory as `index.rst` or add the directory to `templates_path` in `conf.py`.

### YAML Parse Error

Validate your YAML:
```bash
python -c "import yaml; yaml.safe_load(open('news-and-blogs.yaml'))"
```

### Articles Not Rendering

Check that:
1. YAML file is in the same directory as `index.rst`
2. Template files exist and have correct names
3. YAML structure matches template expectations

## Future Enhancements

Possible improvements:
- Add category filtering/grouping
- Add search functionality
- Add RSS feed generation
- Add automatic link checking
- Add article metadata validation
- Sort by date automatically
- Add pagination for large lists

## Support

For questions or issues:
- Open a GitHub issue
- Contact AWS Neuron support team
- See main repository CONTRIBUTING.md


================================================
FILE: about-neuron/news-and-blogs/article-template.yaml
================================================
# Article Entry Template
# 
# Copy this template and fill in your article details.
# Then add it to the appropriate section in news-and-blogs.yaml
#
# For featured articles (requires AWS Neuron team approval):
# Add to the 'featured_articles' section with featured: true and an icon
#
# For regular articles:
# Add to the 'all_articles' section with featured: false

# TEMPLATE - Copy everything below this line
# ============================================

- title: "Your Article Title Here"
  url: "https://your-website.com/path-to-article"
  description: "A clear, concise description of your article in 1-2 sentences. Explain what readers will learn or discover."
  author: "Your Name or Organization Name"
  author_url: "https://your-website.com"  # Optional: Your website or profile URL (for featured articles)
  date: "YYYY-MM-DD"  # Publication date in YYYY-MM-DD format (e.g., 2026-01-27)
  category: "blog"  # Choose ONE: blog, news, tutorial, case-study, benchmark
  locale: "en-US"  # Language/region code (e.g., en-US, ja-JP, zh-CN, de-DE, fr-FR, es-ES, pt-BR, ko-KR)
  keywords: ["keyword1", "keyword2", "keyword3"]  # List of 3-10 relevant keywords for filtering/search
  featured: false  # Set to false unless approved by AWS Neuron team
  icon: "📝"  # Optional: Single emoji for featured articles (e.g., 🚀 📊 🎯 💡 ⚡)

# ============================================
# FIELD DESCRIPTIONS
# ============================================
#
# title (required):
#   - Clear, descriptive title of your article
#   - Keep under 100 characters
#   - Use title case
#
# url (required):
#   - Full HTTPS URL to your article
#   - Must be publicly accessible
#   - Should not require login or paywall
#
# description (required):
#   - Brief summary of article content
#   - 20-500 characters recommended
#   - Focus on what readers will learn
#   - Avoid marketing language
#
# author (required):
#   - Your name or organization
#   - Will be displayed as attribution
#
# author_url (optional):
#   - Link to your website or profile
#   - Only used in featured articles
#   - Must be valid HTTPS URL
#
# date (required):
#   - Publication date in YYYY-MM-DD format
#   - Use the date the article was published
#   - Not the date you're adding it here
#
# category (required):
#   - blog: Technical blog posts and articles
#   - news: News announcements and press releases
#   - tutorial: Step-by-step guides and how-tos
#   - case-study: Customer success stories and use cases
#   - benchmark: Performance benchmarks and comparisons
#
# locale (required):
#   - Language and region code in format: language-REGION
#   - Examples: en-US (English-US), ja-JP (Japanese), zh-CN (Chinese-Simplified)
#   - Common codes: en-US, en-GB, ja-JP, zh-CN, zh-TW, ko-KR, de-DE, fr-FR, 
#     es-ES, es-MX, pt-BR, pt-PT, it-IT, nl-NL, ru-RU, ar-SA, hi-IN
#   - A flag emoji will be displayed based on the locale
#   - Unknown locales will display 🌐 globe icon
#
# keywords (required):
#   - List of 3-10 relevant keywords for filtering and search
#   - Use lowercase, hyphenated format (e.g., "machine-learning", "pytorch")
#   - Include technology names, topics, and key concepts
#   - Examples: ["trainium", "inference", "pytorch", "llm", "optimization"]
#   - Keywords help users find your article through filtering and search
#
# featured (required):
#   - true: Article appears in featured section (requires approval)
#   - false: Article appears in all articles section
#   - Most submissions should use false
#
# icon (optional):
#   - Single emoji character
#   - Only used for featured articles
#   - Examples: 🚀 📊 🎯 💡 ⚡ 🔥 ✨ 🌟 📈 🛠️
#
# ============================================
# EXAMPLES
# ============================================

# Example 1: Tutorial Article
- title: "Getting Started with PyTorch on AWS Trainium"
  url: "https://example.com/pytorch-trainium-tutorial"
  description: "A comprehensive guide to training PyTorch models on AWS Trainium instances, including setup, optimization tips, and common pitfalls to avoid."
  author: "Jane Developer"
  date: "2026-01-15"
  category: "tutorial"
  locale: "en-US"
  keywords: ["pytorch", "trainium", "training", "tutorial", "getting-started"]
  featured: false

# Example 2: Benchmark Article
- title: "BERT Inference Performance: Inferentia2 vs GPU Comparison"
  url: "https://example.com/bert-benchmark"
  description: "Detailed performance comparison of BERT inference on AWS Inferentia2 vs leading GPU instances, including cost analysis and throughput metrics."
  author: "ML Performance Lab"
  date: "2026-01-20"
  category: "benchmark"
  locale: "en-US"
  keywords: ["inferentia", "bert", "benchmark", "performance", "gpu-comparison"]
  featured: false

# Example 3: Case Study
- title: "How Acme Corp Reduced ML Training Costs by 60% with Trainium"
  url: "https://example.com/acme-case-study"
  description: "Learn how Acme Corp migrated their large language model training to AWS Trainium and achieved significant cost savings while maintaining performance."
  author: "Acme Corp Engineering Team"
  date: "2026-01-25"
  category: "case-study"
  locale: "en-US"
  keywords: ["trainium", "cost-optimization", "llm", "case-study", "migration"]
  featured: false

# Example 4: Featured Article (requires approval)
- title: "Advanced Optimization Techniques for Neuron Compiler"
  url: "https://example.com/neuron-optimization"
  description: "Deep dive into advanced compiler optimization techniques for AWS Neuron, with practical examples and performance improvements."
  author: "AWS Neuron Team"
  author_url: "https://aws.amazon.com/machine-learning/neuron/"
  date: "2026-01-27"
  category: "blog"
  locale: "en-US"
  keywords: ["neuron", "compiler", "optimization", "performance", "advanced"]
  featured: true
  icon: "⚡"

# Example 5: Japanese Article
- title: "AWS Trainiumで大規模言語モデルを訓練する"
  url: "https://example.jp/trainium-llm-training"
  description: "AWS Trainiumを使用して大規模言語モデルを効率的に訓練する方法を詳しく解説します。"
  author: "日本のMLエンジニア"
  date: "2026-01-20"
  category: "tutorial"
  locale: "ja-JP"
  keywords: ["trainium", "llm", "training", "japanese", "tutorial"]
  featured: false


================================================
FILE: about-neuron/news-and-blogs/index.rst
================================================
.. meta::
    :description: Links to external news and blog articles about AWS Neuron and Trainium/Inferentia ML accelerators.
    :date-modified: 02/26/2026

.. _neuron-news:

AWS Neuron News and Blogs
=========================

Stay up to date with the latest news, announcements, and technical blog posts about AWS Neuron, AWS Trainium, and AWS Inferentia. Discover customer success stories, performance benchmarks, best practices, and deep dives into machine learning acceleration on AWS.

----

Featured Articles
-----------------

Read recent blogs and technical content about Neuron, Trainium, and Inferentia from AWS subject matter experts and our highly experienced customers.

.. datatemplate:yaml:: news-and-blogs.yaml

   .. grid:: 1
      :gutter: 2
   {% for article in data.featured_articles %}
   {% if article.locale == 'en-US' %}{% set flag = '🇺🇸' %}{% set locale_name = 'English' %}{% elif article.locale == 'ja-JP' %}{% set flag = '🇯🇵' %}{% set locale_name = 'Japanese' %}{% elif article.locale == 'zh-CN' %}{% set flag = '🇨🇳' %}{% set locale_name = 'Chinese' %}{% elif article.locale == 'ko-KR' %}{% set flag = '🇰🇷' %}{% set locale_name = 'Korean' %}{% else %}{% set flag = '🌐' %}{% set locale_name = 'Unknown' %}{% endif %}

      .. grid-item-card::
         :class-card: sd-border-2
         :link: {{ article.url }}

         {{ article.icon }} **{{ article.title }}**
         ^^^
         {{ article.description }}
         +++
         **Published on**: {{ article.date }} | {{ flag }} ({{ locale_name }}) | Content by `{{ article.author }} <{{ article.author_url }}>`__
   {% endfor %}

.. note::
   
   This page is regularly updated with new content. Bookmark it to stay informed about the latest developments in AWS Neuron, Trainium, and Inferentia.
 
**For the full list of featured articles and posts, go to the :ref:`News & Blogs <all-articles>` section of this page.**

.. _all-articles:

News & Blogs 
-------------

Explore the latest news, press releases, and industry coverage about AWS Neuron, Trainium, and Inferentia.

.. raw:: html

   <div style="margin-bottom: 20px;">
     <label for="locale-filter" style="font-weight: bold; margin-right: 10px;">Filter by language:</label>
     <select id="locale-filter" style="padding: 5px 10px; font-size: 14px; border: 1px solid #ccc; border-radius: 4px;">
       <option value="en-US" selected>🇺🇸 English</option>
       <option value="ja-JP">🇯🇵 Japanese</option>
       <option value="ko-KR">🇰🇷 Korean</option>
       <option value="zh-CN">🇨🇳 Chinese</option>
       <option value="all">All languages</option>
     </select>
   </div>

.. datatemplate:yaml:: news-and-blogs.yaml

   .. grid:: 1 1 2 2
      :gutter: 2
      :class-container: articles-grid news-blogs-grid
   {% for article in data.all_articles|sort(attribute='date', reverse=True) %}
   {% if article.locale == 'en-US' %}{% set flag = '🇺🇸' %}{% set locale_name = 'English' %}{% elif article.locale == 'ja-JP' %}{% set flag = '🇯🇵' %}{% set locale_name = 'Japanese' %}{% elif article.locale == 'zh-CN' %}{% set flag = '🇨🇳' %}{% set locale_name = 'Chinese' %}{% elif article.locale == 'ko-KR' %}{% set flag = '🇰🇷' %}{% set locale_name = 'Korean' %}{% else %}{% set flag = '🌐' %}{% set locale_name = 'Unknown' %}{% endif %}

      .. grid-item-card::
         :link: {{ article.url }}
         :class-card: sd-border-1 article-card
         :class-body: article-locale-{{ article.locale }}

         **{{ article.title }}**
         ^^^
         {{ article.description }}
         +++
         **Published on**: {{ article.date }} | {{ flag }} ({{ locale_name }})
   {% endfor %}

.. raw:: html

   <script>
   (function() {
     'use strict';
     
     function initFilter() {
       const filter = document.getElementById('locale-filter');
       const articlesGrid = document.querySelector('.news-blogs-grid');
       
       if (!filter) {
         console.error('Filter dropdown not found!');
         return;
       }
       
       if (!articlesGrid) {
         console.error('News & Blogs grid not found!');
         return;
       }
       
       console.log('Filter and News & Blogs grid found successfully');
       
       // Get all article cards
       const articleCards = Array.from(articlesGrid.querySelectorAll('.sd-col'));
       console.log('Total article cards found:', articleCards.length);
       
       // Extract locale from each card
       const cardLocales = articleCards.map((card, index) => {
         const body = card.querySelector('[class*="article-locale-"]');
         if (!body) {
           console.warn('Card', index, 'has no locale class');
           return 'UNKNOWN';
         }
         
         const classes = body.className.split(' ');
         const localeClass = classes.find(c => c.startsWith('article-locale-'));
         
         if (!localeClass) {
           console.warn('Card', index, 'has no article-locale- class');
           return 'UNKNOWN';
         }
         
         // Convert "article-locale-ja-jp" to "JA-JP"
         const locale = localeClass.replace('article-locale-', '').toUpperCase();
         console.log('Card', index, 'locale:', locale);
         return locale;
       });
       
       // Function to apply filter
       function applyFilter(selectedLocale) {
         console.log('=== Applying filter:', selectedLocale, '===');
         
         let visibleCount = 0;
         
         articleCards.forEach((card, index) => {
           const cardLocale = cardLocales[index];
           const shouldShow = (selectedLocale === 'ALL' || cardLocale === selectedLocale);
           
           if (shouldShow) {
             card.style.setProperty('display', 'flex', 'important');
             card.style.setProperty('visibility', 'visible', 'important');
             visibleCount++;
             console.log('Showing card', index, '(', cardLocale, ')');
           } else {
             card.style.setProperty('display', 'none', 'important');
             card.style.setProperty('visibility', 'hidden', 'important');
             console.log('Hiding card', index, '(', cardLocale, ')');
           }
         });
         
         console.log('Total visible cards:', visibleCount);
         
         // Remove existing "no results" message
         const existingMsg = document.querySelector('.no-results-message');
         if (existingMsg) {
           existingMsg.remove();
         }
         
         // Show "no results" message if needed
         if (visibleCount === 0) {
           const noResultsMsg = document.createElement('div');
           noResultsMsg.className = 'no-results-message';
           noResultsMsg.style.cssText = 'padding: 20px; text-align: center; color: #666; font-style: italic; margin-top: 20px;';
           noResultsMsg.textContent = 'No articles found for the selected language.';
           articlesGrid.parentElement.appendChild(noResultsMsg);
         }
       }
       
       // Add change event listener
       filter.addEventListener('change', function(e) {
         const selectedLocale = e.target.value.toUpperCase(); // Convert to uppercase for comparison
         applyFilter(selectedLocale);
       });
       
       // Apply initial filter on page load (English by default)
       const initialLocale = filter.value.toUpperCase();
       applyFilter(initialLocale);
       
       console.log('Filter initialized successfully!');
     }
     
     // Initialize when DOM is ready
     if (document.readyState === 'loading') {
       document.addEventListener('DOMContentLoaded', initFilter);
     } else {
       initFilter();
     }
   })();
   </script>

.. important::

   AWS and Neuron provide links to external articles and posts to help you discover them, but do not commission or own any content not created by AWS employees. This list is curated based on internal and customer recommendations. 

**Want to add your article?** Go to `https://github.com/aws-neuron/aws-neuron-sdk <https://github.com/aws-neuron/aws-neuron-sdk>`_, edit ``about-neuron/news-and-blogs/news-and-blogs.yaml`` to add your submission, and submit a pull request. 


================================================
FILE: about-neuron/news-and-blogs/news-and-blogs.yaml
================================================
# AWS Neuron News and Blogs Data File
# 
# This file contains metadata for external articles, blog posts, and news about
# AWS Neuron, Trainium, and Inferentia.
#
# To contribute a new article:
# 1. Add a new entry to the appropriate section below
# 2. Follow the existing format exactly
# 3. Submit a pull request with your changes
#
# Entry format:
#   - title: "Article Title"
#     url: "https://example.com/article"
#     description: "Brief description of the article content"
#     author: "Author Name or Organization"
#     date: "YYYY-MM-DD"
#     category: "blog|news|tutorial|case-study|benchmark"
#     locale: "en-US"  # Language/region code (e.g., en-US, ja-JP, zh-CN, de-DE, fr-FR, es-ES, pt-BR, ko-KR)
#     keywords: ["keyword1", "keyword2", "keyword3"]  # List of relevant keywords for filtering
#     featured: true|false  # Set to true for featured articles section

featured_articles:
  - title: "AWS Trainium: 50 Exercises"
    url: "https://zenn.dev/karakuri_blog/articles/5ccedeee1beb08"
    description: "Learn how to build LLMs for Trainium accelerators with this rich 50-lesson guide from customer Karakuri."
    author: "Karakuri"
    author_url: "https://about.karakuri.ai/"
    date: "2026-02-19"
    category: "tutorial"
    locale: "en-US"
    keywords: ["trainium", "llm", "training", "tutorial", "japanese"]
    featured: true
    icon: "🚀"

  - title: "Cost-effective AI image generation with PixArt-Sigma inference on AWS Trainium and AWS Inferentia"
    url: "https://aws.amazon.com/blogs/machine-learning/cost-effective-ai-image-generation-with-pixart-sigma-inference-on-aws-trainium-and-aws-inferentia/"
    description: "Learn how to use AWS Trainium and Inferentia to deploy a PixArt-Sigma diffusion transformer model."
    author: "AWS Neuron Team"
    author_url: "https://aws.amazon.com/machine-learning/neuron/"
    date: "2026-02-19"
    category: "blog"
    locale: "en-US"
    keywords: ["inferentia", "trainium", "inference", "diffusion", "image-generation"]
    featured: true
    icon: "📊"

all_articles:
  # Japanese Articles
  - title: "AWS Neuron 関連記事まとめ"
    url: "https://zenn.dev/tosshi/articles/36f3615e26c323"
    description: "AWS Neuron エコシステムに関する自身が作成した一連の技術記事のインデックス"
    author: "littlemex"
    date: "2026-02-20"
    category: "blog"
    locale: "ja-JP"
    keywords: ["trainium", "neuron", "collective-communication", "architecture", "japanese"]
    featured: false

  - title: "【AWS re:Invent 2025 速報】AWS 自社設計 AIチップ AWS Trainium3 の全貌"
    url: "https://zenn.dev/aws_japan/articles/06808526d5c75f"
    description: "AWS re:Invent 2025で発表されたAWS Trainium3カスタムAIチップの完全な概要をお届けします。"
    author: "AWS Japan"
    date: "2025-12-06"
    category: "news"
    locale: "ja-JP"
    keywords: ["trainium3", "reinvent", "announcement", "ai-chip"]
    featured: false

  - title: "【AWS Trainium 50本ノック #0】はじめに"
    url: "https://zenn.dev/karakuri_blog/articles/77d93c40b27b60"
    description: "AWS Trainium 50本ノックシリーズの紹介 - 入門ガイド。"
    author: "Karakuri"
    date: "2025-11-18"
    category: "tutorial"
    locale: "ja-JP"
    keywords: ["trainium", "tutorial", "getting-started", "series"]
    featured: false

  - title: "「Syn Pro」開発レポート：AWS TrainiumとRFTによる高性能日本語LLMの実現"
    url: "https://zenn.dev/karakuri_blog/articles/b923acfc86083b"
    description: "AWS TrainiumとRFTを使用した高性能日本語LLMの構築に関する開発レポート。"
    author: "Karakuri"
    date: "2025-10-24"
    category: "case-study"
    locale: "ja-JP"
    keywords: ["trainium", "llm", "japanese", "rft", "case-study"]
    featured: false

  - title: "AWS Inferentia2 + Llama 3.2 にできること"
    url: "https://zenn.dev/exwzd/articles/20250930-inferentia-llama"
    description: "AWS Inferentia2とLlama 3.2モデルでできることを紹介します。"
    author: "exwzd"
    date: "2025-09-30"
    category: "blog"
    locale: "ja-JP"
    keywords: ["inferentia2", "llama", "capabilities", "inference"]
    featured: false

  - title: "AWS Inferentia2とvLLMでLlama 3.2の推論サーバーを構築する手順"
    url: "https://zenn.dev/exwzd/articles/20250827_inferentia_compile"
    description: "AWS Inferentia2とvLLMを使用してLlama 3.2推論サーバーを構築するステップバイステップガイド。"
    author: "exwzd"
    date: "2025-08-28"
    category: "tutorial"
    locale: "ja-JP"
    keywords: ["inferentia2", "vllm", "llama", "inference", "tutorial"]
    featured: false

  - title: "【開催報告】Neuron Community – Vol.2"
    url: "https://aws.amazon.com/jp/blogs/news/neuron-community-vol-2/"
    description: "Neuron Community Vol.2の開催報告。"
    author: "AWS Japan"
    date: "2025-07-24"
    category: "news"
    locale: "ja-JP"
    keywords: ["community", "event", "neuron", "japan"]
    featured: false

  - title: "KARAKURI VL - 日本語コンピュータユースに特化した視覚言語モデル"
    url: "https://zenn.dev/karakuri_blog/articles/28c73f2ada797a"
    description: "日本語コンピュータユースに特化したビジョン言語モデルKARAKURI VLの紹介。"
    author: "Karakuri"
    date: "2025-07-11"
    category: "blog"
    locale: "ja-JP"
    keywords: ["vision-language", "japanese", "multimodal", "karakuri"]
    featured: false

  - title: "LLM-jp Chatbot Arenaを試験運用しました"
    url: "https://llm-jp.nii.ac.jp/ja/blog/blog-836/"
    description: "LLM-jp Chatbot Arenaの試験運用に関するレポート。"
    author: "LLM-jp"
    date: "2025-05-12"
    category: "blog"
    locale: "ja-JP"
    keywords: ["llm", "chatbot", "arena", "japanese"]
    featured: false

  - title: "【開催報告】Neuron Community – Day One"
    url: "https://aws.amazon.com/jp/blogs/news/neuron-community-day-one/"
    description: "初回Neuron Community Dayの開催報告。"
    author: "AWS Japan"
    date: "2025-04-14"
    category: "news"
    locale: "ja-JP"
    keywords: ["community", "event", "neuron", "japan"]
    featured: false

  - title: "EKS Auto Mode でサクッと機械学習用インスタンスを利用してみる。 AWS 独自設計チップ搭載の Trainium と Inferentia を使ってみた！"
    url: "https://dev.classmethod.jp/articles/eks-auto-mode-gpu-aws-trainium-inferentia/"
    description: "EKS Auto Modeを使用してMLインスタンスを簡単に利用する方法。AWS TrainiumとInferentiaチップの活用ガイド。"
    author: "Classmethod"
    date: "2025-01-02"
    category: "tutorial"
    locale: "ja-JP"
    keywords: ["eks", "trainium", "inferentia", "kubernetes", "tutorial"]
    featured: false

  # Korean Articles
  - title: "Nota AI가 제안하는 AWS Inferentia에서 다양한 LLM 모델 양자화 최적화기법 사용하기"
    url: "https://aws.amazon.com/ko/blogs/tech/llm-model-quantization-techniques-for-aws-inferentia-by-nota-ai/"
    description: "Nota AI가 제안하는 AWS Inferentia에서 LLM 모델 양자화 최적화 기법."
    author: "Nota AI / AWS Korea"
    date: "2026-01-20"
    category: "blog"
    locale: "ko-KR"
    keywords: ["inferentia", "quantization", "llm", "optimization", "nota-ai"]
    featured: false

  - title: "Nota AI가 제안하는 Transformer 모델을 AWS Inferentia/Trainium에 손쉽게 배포하는 방법"
    url: "https://aws.amazon.com/ko/blogs/tech/tips-for-using-transformer-models-on-aws-inf-and-trn/"
    description: "Nota AI가 제안하는 AWS Inferentia/Trainium에서 Transformer 모델을 쉽게 배포하는 방법."
    author: "Nota AI / AWS Korea"
    date: "2025-04-09"
    category: "blog"
    locale: "ko-KR"
    keywords: ["transformer", "deployment", "inferentia", "trainium", "nota-ai"]
    featured: false

  - title: "콜드스타트 추천 문제를 AWS Trainium과 vLLM으로 해결하는 자동화 전략"
    url: "https://blog.a-cloud.co.kr/2025/07/25/%EC%BD%9C%EB%93%9C%EC%8A%A4%ED%83%80%ED%8A%B8-%EC%B6%94%EC%B2%9C-%EB%AC%B8%EC%A0%9C%EB%A5%BC-aws-trainium%EA%B3%BC-vllm%EC%9C%BC%EB%A1%9C-%ED%95%B4%EA%B2%B0%ED%95%98%EB%8A%94-%EC%9E%90%EB%8F%99/"
    description: "AWS Trainium과 vLLM을 사용하여 콜드 스타트 추천 문제를 해결하는 자동화 전략."
    author: "A-Cloud"
    date: "2025-07-25"
    category: "blog"
    locale: "ko-KR"
    keywords: ["trainium", "vllm", "cold-start", "recommendations", "automation"]
    featured: false

  - title: "DeepSeek-R1 모델 AWS 출시"
    url: "https://aws.amazon.com/ko/blogs/korea/deepseek-r1-models-now-available-on-aws/"
    description: "AWS에서 DeepSeek-R1 모델을 사용할 수 있게 되었습니다."
    author: "AWS Korea"
    date: "2025-02-05"
    category: "news"
    locale: "ko-KR"
    keywords: ["deepseek", "r1", "model", "launch", "aws"]
    featured: false

  # Chinese Articles
  - title: "使用亚马逊云科技自研芯片 Inferentia2 部署 DeepSeek R1 Distillation 模型（一）"
    url: "https://aws.amazon.com/cn/blogs/china/deploying-the-deepseek-r1-distillation-model-using-amazon-inferentia2/"
    description: "使用亚马逊云科技自研芯片 Inferentia2 部署 DeepSeek R1 Distillation 模型（第一部分）。"
    author: "AWS China"
    date: "2025-02-12"
    category: "tutorial"
    locale: "zh-CN"
    keywords: ["inferentia2", "deepseek", "r1", "deployment", "distillation"]
    featured: false

  - title: "使用亚马逊云科技自研芯片 Inferentia2 部署 DeepSeek R1 Distillation 模型（二）"
    url: "https://aws.amazon.com/cn/blogs/china/deploying-the-deepseek-r1-distillation-model-using-amazon-inferentia2-part-two/"
    description: "使用亚马逊云科技自研芯片 Inferentia2 部署 DeepSeek R1 Distillation 模型（第二部分）。"
    author: "AWS China"
    date: "2025-02-14"
    category: "tutorial"
    locale: "zh-CN"
    keywords: ["inferentia2", "deepseek", "r1", "deployment", "distillation"]
    featured: false

  - title: "Bytedance processes billions of daily videos using their multimodal video understanding models on AWS Inferentia2"
    url: "https://aws.amazon.com/blogs/machine-learning/bytedance-processes-billions-of-daily-videos-using-their-multimodal-video-understanding-models-on-aws-inferentia2/"
    description: "How Bytedance processes billions of daily videos using multimodal models on AWS Inferentia2."
    author: "AWS"
    date: "2025-02-26"
    category: "case-study"
    locale: "en-US"
    keywords: ["inferentia2", "bytedance", "video", "multimodal", "case-study"]
    featured: false

  - title: "基于 HAMi 实现亚马逊云科技 Trainium 与 Inferentia 核心级共享与策略性拓扑调度"
    url: "https://aws.amazon.com/cn/blogs/china/achieve-trainium-and-inferentia-core-level-sharing-and-strategic-topology-scheduling/"
    description: "基于 HAMi 实现亚马逊云科技 Trainium 与 Inferentia 核心级共享与策略性拓扑调度。"
    author: "AWS China"
    date: "2025-11-06"
    category: "blog"
    locale: "zh-CN"
    keywords: ["trainium", "inferentia", "hami", "scheduling", "topology"]
    featured: false

  # Red Hat / AWS Neuron Collaboration
  - title: "Red Hat to Deliver Enhanced AI Inference Across AWS"
    url: "https://www.redhat.com/en/about/press-releases/red-hat-deliver-enhanced-ai-inference-across-aws"
    description: "Red Hat and AWS expand collaboration to power enterprise-grade generative AI using Red Hat AI Inference Server on AWS Inferentia2 and Trainium3."
    author: "Red Hat"
    date: "2025-12-02"
    category: "news"
    locale: "en-US"
    keywords: ["red-hat", "inferentia2", "trainium3", "vllm", "openshift", "inference", "collaboration"]
    featured: false

  - title: "Run cost-effective AI workloads on OpenShift with AWS Neuron Operator"
    url: "https://developers.redhat.com/articles/2025/12/02/cost-effective-ai-workloads-openshift-aws-neuron-operator"
    description: "How to use the AWS Neuron Operator to run LLM inference with vLLM on AWS AI chips in Red Hat OpenShift."
    author: "Red Hat"
    date: "2025-12-02"
    category: "tutorial"
    locale: "en-US"
    keywords: ["red-hat", "openshift", "neuron-operator", "vllm", "inferentia", "trainium", "kubernetes"]
    featured: false

  - title: "AWS Neuron Operator for AI Chips on AWS — GitHub Releases"
    url: "https://github.com/awslabs/operator-for-ai-chips-on-aws/releases"
    description: "Open-source AWS Neuron Operator for Kubernetes and Red Hat OpenShift, enabling native support for AWS Inferentia and Trainium accelerators."
    author: "AWS"
    date: "2025-12-02"
    category: "news"
    locale: "en-US"
    keywords: ["neuron-operator", "kubernetes", "openshift", "open-source", "inferentia", "trainium"]
    featured: false

  - title: "Red Hat AI Inference Server — vLLM Neuron Container Image (RHEL 9)"
    url: "https://catalog.redhat.com/en/software/containers/rhaiis/vllm-neuron-rhel9/698c42b20b626d81c97abd7f"
    description: "Certified container image for the Red Hat AI Inference Server with vLLM optimized for AWS Inferentia and Trainium accelerators via the AWS Neuron SDK. Provides enterprise-grade, high-performance LLM inference serving on RHEL 9, enabling production deployment of generative AI models on AWS AI chips through Red Hat OpenShift or Podman."
    author: "Red Hat"
    date: "2025-12-02"
    category: "news"
    locale: "en-US"
    keywords: ["red-hat", "vllm", "neuron", "inferentia", "trainium", "container", "rhel9", "inference", "openshift"]
    featured: true


================================================
FILE: about-neuron/news-and-blogs/validate_articles.py
================================================
#!/usr/bin/env python3
"""
Validation script for news-and-blogs.yaml

This script validates the structure and content of article entries
to ensure they meet the required format before submission.

Usage:
    python validate_articles.py
"""

import sys
from pathlib import Path
from datetime import datetime
import re

try:
    import yaml
except ImportError:
    print("Error: PyYAML is required. Install with: pip install pyyaml")
    sys.exit(1)


VALID_CATEGORIES = {'blog', 'news', 'tutorial', 'case-study', 'benchmark'}
REQUIRED_FIELDS = {'title', 'url', 'description', 'author', 'date', 'category', 'locale', 'keywords'}
OPTIONAL_FIELDS = {'featured', 'author_url', 'icon'}
ALL_FIELDS = REQUIRED_FIELDS | OPTIONAL_FIELDS

# Valid locale codes
VALID_LOCALES = {
    'en-US', 'en-GB', 'en-CA', 'en-AU', 'en-NZ', 'en-IE', 'en-IN', 'en-SG', 'en-ZA',
    'ja-JP', 'zh-CN', 'zh-TW', 'zh-HK', 'ko-KR', 'th-TH', 'vi-VN', 'id-ID', 'ms-MY', 'fil-PH',
    'de-DE', 'fr-FR', 'es-ES', 'es-MX', 'es-AR', 'pt-BR', 'pt-PT', 'it-IT', 'nl-NL', 'pl-PL',
    'ru-RU', 'tr-TR', 'sv-SE', 'da-DK', 'no-NO', 'fi-FI', 'cs-CZ', 'hu-HU', 'ro-RO', 'el-GR',
    'uk-UA', 'ar-SA', 'ar-AE', 'ar-EG', 'he-IL', 'fa-IR', 'hi-IN', 'bn-BD', 'ur-PK', 'sw-KE'
}


def validate_url(url):
    """Validate URL format"""
    url_pattern = re.compile(
        r'^https?://'  # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|'  # domain
        r'localhost|'  # localhost
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'  # or IP
        r'(?::\d+)?'  # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)
    return url_pattern.match(url) is not None


def validate_date(date_str):
    """Validate date format (YYYY-MM-DD)"""
    try:
        datetime.strptime(date_str, '%Y-%m-%d')
        return True
    except ValueError:
        return False


def validate_article(article, index, section):
    """Validate a single article entry"""
    errors = []
    warnings = []
    
    # Check for required fields
    missing_fields = REQUIRED_FIELDS - set(article.keys())
    if missing_fields:
        errors.append(f"Missing required fields: {', '.join(missing_fields)}")
    
    # Check for unknown fields
    unknown_fields = set(article.keys()) - ALL_FIELDS
    if unknown_fields:
        warnings.append(f"Unknown fields (will be ignored): {', '.join(unknown_fields)}")
    
    # Validate title
    if 'title' in article:
        if not article['title'] or not isinstance(article['title'], str):
            errors.append("Title must be a non-empty string")
        elif len(article['title']) > 200:
            warnings.append(f"Title is very long ({len(article['title'])} chars). Consider shortening.")
    
    # Validate URL
    if 'url' in article:
        if not validate_url(article['url']):
            errors.append(f"Invalid URL format: {article['url']}")
    
    # Validate description
    if 'description' in article:
        if not article['description'] or not isinstance(article['description'], str):
            errors.append("Description must be a non-empty string")
        elif len(article['description']) < 20:
            warnings.append("Description is very short. Consider adding more detail.")
        elif len(article['description']) > 500:
            warnings.append(f"Description is very long ({len(article['description'])} chars). Consider shortening.")
    
    # Validate author
    if 'author' in article:
        if not article['author'] or not isinstance(article['author'], str):
            errors.append("Author must be a non-empty string")
    
    # Validate author_url (optional)
    if 'author_url' in article:
        if article['author_url'] and not validate_url(article['author_url']):
            errors.append(f"Invalid author_url format: {article['author_url']}")
    
    # Validate date
    if 'date' in article:
        if not validate_date(str(article['date'])):
            errors.append(f"Invalid date format: {article['date']}. Use YYYY-MM-DD")
        else:
            article_date = datetime.strptime(str(article['date']), '%Y-%m-%d')
            if article_date > datetime.now():
                warnings.append(f"Date is in the future: {article['date']}")
    
    # Validate category
    if 'category' in article:
        if article['category'] not in VALID_CATEGORIES:
            errors.append(f"Invalid category: {article['category']}. Must be one of: {', '.join(VALID_CATEGORIES)}")
    
    # Validate locale
    if 'locale' in article:
        if not isinstance(article['locale'], str):
            errors.append("Locale must be a string")
        elif article['locale'] not in VALID_LOCALES:
            warnings.append(f"Locale '{article['locale']}' not in standard list. Will display with 🌐 globe icon. Common locales: en-US, ja-JP, zh-CN, de-DE, fr-FR, es-ES, pt-BR, ko-KR")
    
    # Validate keywords
    if 'keywords' in article:
        if not isinstance(article['keywords'], list):
            errors.append("Keywords must be a list")
        elif len(article['keywords']) == 0:
            warnings.append("Keywords list is empty. Consider adding relevant keywords for better filtering")
        else:
            for i, keyword in enumerate(article['keywords']):
                if not isinstance(keyword, str):
                    errors.append(f"Keyword at index {i} must be a string")
                elif len(keyword.strip()) == 0:
                    warnings.append(f"Keyword at index {i} is empty or whitespace")
            if len(article['keywords']) > 10:
                warnings.append(f"Article has {len(article['keywords'])} keywords. Consider limiting to 5-10 most relevant keywords")
    
    # Validate featured
    if 'featured' in article:
        if not isinstance(article['featured'], bool):
            errors.append("Featured must be true or false (boolean)")
        if section == 'all_articles' and article['featured']:
            warnings.append("Article marked as featured but in all_articles section")
    
    # Validate icon (optional)
    if 'icon' in article:
        if not isinstance(article['icon'], str) or len(article['icon']) > 10:
            warnings.append("Icon should be a short string (emoji recommended)")
    
    return errors, warnings


def main():
    """Main validation function"""
    yaml_file = Path(__file__).parent / 'news-and-blogs.yaml'
    
    if not yaml_file.exists():
        print(f"❌ Error: {yaml_file} not found")
        return 1
    
    print(f"Validating {yaml_file}...\n")
    
    try:
        with open(yaml_file, 'r', encoding='utf-8') as f:
            data = yaml.safe_load(f)
    except yaml.YAMLError as e:
        print(f"❌ YAML Parse Error: {e}")
        return 1
    
    if not isinstance(data, dict):
        print("❌ Error: YAML file must contain a dictionary")
        return 1
    
    total_errors = 0
    total_warnings = 0
    
    # Validate featured_articles section
    if 'featured_articles' in data:
        print("📌 Validating featured_articles section...")
        if not isinstance(data['featured_articles'], list):
            print("❌ Error: featured_articles must be a list")
            total_errors += 1
        else:
            for i, article in enumerate(data['featured_articles'], 1):
                errors, warnings = validate_article(article, i, 'featured_articles')
                if errors or warnings:
                    print(f"\n  Article #{i}: {article.get('title', 'NO TITLE')}")
                    for error in errors:
                        print(f"    ❌ Error: {error}")
                        total_errors += 1
                    for warning in warnings:
                        print(f"    ⚠️  Warning: {warning}")
                        total_warnings += 1
        print()
    
    # Validate all_articles section
    if 'all_articles' in data:
        print("📚 Validating all_articles section...")
        if not isinstance(data['all_articles'], list):
            print("❌ Error: all_articles must be a list")
            total_errors += 1
        else:
            for i, article in enumerate(data['all_articles'], 1):
                errors, warnings = validate_article(article, i, 'all_articles')
                if errors or warnings:
                    print(f"\n  Article #{i}: {article.get('title', 'NO TITLE')}")
                    for error in errors:
                        print(f"    ❌ Error: {error}")
                        total_errors += 1
                    for warning in warnings:
                        print(f"    ⚠️  Warning: {warning}")
                        total_warnings += 1
        print()
    
    # Summary
    print("=" * 60)
    if total_errors == 0 and total_warnings == 0:
        print("✅ Validation passed! No errors or warnings found.")
        return 0
    else:
        print(f"Validation complete:")
        if total_errors > 0:
            print(f"  ❌ {total_errors} error(s) found - must be fixed")
        if total_warnings > 0:
            print(f"  ⚠️  {total_warnings} warning(s) found - should be reviewed")
        
        if total_errors > 0:
            print("\n❌ Validation FAILED - please fix errors before submitting")
            return 1
        else:
            print("\n✅ Validation PASSED - warnings are optional to fix")
            return 0


if __name__ == '__main__':
    sys.exit(main())


================================================
FILE: about-neuron/oss/index.rst
================================================
.. meta::
    :description: GitHub repositories for AWS Neuron open source components, libraries, and tools.
    :date-modified: 12/02/2025

Neuron Open Source Repositories and Contribution
===================================================

AWS Neuron provides open source code and samples for some of its components, libraries, and tools under the Apache 2.0 license. The current public repositories open to contribution at this time are listed below.

Neuron Open Source GitHub Repositories
---------------------------------------

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: 
      :class-body: sphinx-design-class-title-small
 
      **TorchNeuron PyTorch Extension Open Source**
      ^^^
      Source code for the Neuron Native PyTorch extension and the TorchNeuron library that implements it for AWS Trainium.

      * Neuron GitHub source repository: https://github.com/aws-neuron/torch-neuronx

   .. grid-item-card:: 
      :class-body: sphinx-design-class-title-small
 
      **Neuron Kernel Library Open Source**
      ^^^
      Source code and specifications for the pre-built kernels that ship with the NKI Library .

      * Neuron GitHub source repository: https://github.com/aws-neuron/nki-library
  
   .. grid-item-card:: 
      :class-body: sphinx-design-class-title-small
 
      **vLLM for Neuron Open Source**
      ^^^
      Source code for the vLLM integrations with Neuron, supporting AWS Trainium and Inferentia.

      * Neuron GitHub source repository: https://github.com/vllm-project/vllm-neuron
      * **Note**: Released under vLLM project license (`LICENSE <https://github.com/vllm-project/vllm-neuron/blob/main/LICENSE>`__). 
  
   .. grid-item-card:: 
      :class-body: sphinx-design-class-title-small
 
      **NKI Samples**
      ^^^
      Full code examples that support NKI kernel development.

      * Neuron GitHub source repository: https://github.com/aws-neuron/nki-samples

How to Contribute to Neuron Open Source
----------------------------------------

Contributions via pull requests are appreciated! Before sending us a pull request, please ensure that:

1. You are working against the latest source on the `main`` branch.
2. You check existing open and recently merged pull requests and GitHub Issues to make sure someone else hasn't addressed the problem already.
3. You open a GitHub Issue for the repo to discuss any significant work.

To send us a pull request:

1. Fork the repository.
2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
3. Ensure local tests pass.
4. Commit to your fork using clear commit messages.
5. Send us a pull request, answering any default questions in the pull request interface.
6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.

GitHub provides documentation on `forking a repository <https://help.github.com/articles/fork-a-repo/>`_ and `creating a pull request <https://help.github.com/articles/creating-a-pull-request/>`_.

For the specific details on licenses and contributing to each OSS repo, review the ``CONTRIBUTING.md`` pages linked below:

* Contribute to TorchNeuron: https://github.com/aws-neuron/torch-neuronx/blob/main/CONTRIBUTING.md
* Contribute to the NKI Library: https://github.com/aws-neuron/nki-library/blob/main/CONTRIBUTING.md
* Contribute the the NKI samples: https://github.com/aws-neuron/nki-samples/blob/main/CONTRIBUTING.md
  
.. Re-add this when available: * Contribute to vLLM Neuron: https://github.com/vllm-project/vllm-neuron/blob/main/CONTRIBUTING.md


================================================
FILE: about-neuron/profiling-tools.rst
================================================
.. _profiling-tools:

Profiling Tools
================

.. toctree:: 
    :maxdepth: 1

    Neuron Profiler User Guide </tools/profiler/neuron-profile-user-guide>
    Neuron Profiler 2.0 (Beta) User Guide </tools/profiler/neuron-profiler-2-0-beta-user-guide>
    What's New </release-notes/components/dev-tools>


================================================
FILE: about-neuron/quick-start/_specs/REFACTORING_NOTES.md
================================================
# Quick-Start Refactoring Notes

## Summary

The quick-start documentation has been restructured with a modern, task-based information architecture. The new structure eliminates the need for .txt includes in the primary quickstart paths.

## New Structure (No .txt includes)

### Primary Quickstarts (Self-contained)
- `index.rst` - Main landing page with decision tree
- `training-quickstart.rst` - Complete training workflow (no includes)
- `inference-quickstart.rst` - Complete inference workflow (no includes)

These files follow the procedural-quickstart template and contain all content inline. No external includes required.

### Supporting Pages
- `docs-quicklinks.rst` - Quick navigation links
- `github-samples.rst` - GitHub repository links

## Legacy Structure (Uses .txt includes)

### Legacy Quick-Start Pages (Inf1 only)
- `torch-neuron.rst` - Uses tab-inference-torch-neuronx.txt and tab-inference-torch-neuron.txt
- `tensorflow-neuron.rst` - Uses tab-inference-tensorflow-neuronx.txt and tab-inference-tensorflow-neuron.rst
- `mxnet-neuron.rst` - Uses tab-inference-mxnet-neuron.txt

These legacy pages:
- Target Inf1 instances (NeuronCore v1)
- Use .txt includes that reference `/src/helperscripts/installationScripts/python_instructions.txt`
- Are de-emphasized in the new navigation (under "Legacy" section)
- Are preserved for backward compatibility and existing links

### .txt Include Files (Legacy only)
All .txt files in this directory are used exclusively by the legacy quick-start pages:
- `tab-inference-torch-neuronx*.txt` (various OS versions)
- `tab-inference-torch-neuron*.txt` (various OS versions)
- `tab-inference-tensorflow-neuronx*.txt` (various OS versions)
- `tab-inference-tensorflow-neuron*.txt` (various OS versions)
- `tab-inference-mxnet-neuron*.txt` (various OS versions)
- `select-framework-note.txt`

## Design Decision

**Why not refactor legacy files?**
1. They target deprecated Inf1 hardware
2. They're not prominently featured in new navigation
3. Refactoring would require updating installation script references
4. Risk of breaking existing external links
5. New users are directed to the new self-contained quickstarts

**Why are new quickstarts self-contained?**
1. Easier to maintain (all content in one place)
2. Better for AI/LLM context retrieval
3. Follows modern docs-as-code best practices
4. Clearer for human readers (no jumping between files)
5. Follows the procedural-quickstart template structure

## Migration Path

For users currently using legacy quick-starts:
- Inf1 users: Continue using legacy pages (torch-neuron.rst, etc.)
- New projects: Use new quickstarts (training-quickstart.rst, inference-quickstart.rst)
- Inf2/Trn1/Trn2/Trn3 users: Use new quickstarts

## Future Cleanup

When Inf1 support is fully deprecated:
1. Archive legacy quick-start pages to `/archive/quick-start/`
2. Remove .txt include files
3. Update any remaining cross-references
4. Update neuron_tag.py to remove special handling


================================================
FILE: about-neuron/quick-start/docs-quicklinks.rst
================================================
.. _docs-quick-links:

Neuron Quick Links
==================

.. grid:: 2
        :gutter: 2

        .. grid-item-card:: Overview
                
                * :ref:`neuron-quickstart`
                * :ref:`amazon-q-dev`
                * :ref:`model_samples_tutorials`
                * :ref:`benchmark`
                * :ref:`neuron_release_notes`
                * :ref:`announcements-main`

        .. grid-item-card:: ML frameworks
                
                * :ref:`pytorch-neuronx-main`
                * :ref:`jax-neuron-main`
                * :ref:`tensorflow-neuron-main`
                * :doc:`MXNet Neuron (archived) </archive/mxnet-neuron/index>`

        .. grid-item-card:: ML libraries

                * :ref:`nxdt`
                * :ref:`NxD Inference <nxdi-index>`
                * :ref:`neuronx-distributed-index`
                * :ref:`transformers_neuronx_readme`
                * :ref:`nemo-megatron-index`

        .. grid-item-card:: User Guides
                
                * :ref:`neuron_runtime`
                * :ref:`neuron_cc`
                * :ref:`Neuron Kernel Interface (NKI) (beta) <neuron-nki>`
                * :ref:`Neuron Custom C++ Operators (beta) <neuron_c++customops>`
                * :ref:`monitoring_tools`
                * :ref:`profiling-tools`
                * :ref:`setup-guide-index`
                * :ref:`neuron-dlami-overview`
                * :ref:`neuron_containers`
                * :ref:`neuron-devflows`

        .. grid-item-card:: Learn AWS Neuron

                * :ref:`neuron-architecture-index`
                * :ref:`neuron-features-index`
                * :ref:`neuron-appnotes-index`
                * :ref:`neuron_faq`
                * :ref:`general-troubleshooting`

        .. grid-item-card:: About AWS Neuron

                * :ref:`neuron_release_notes`


================================================
FILE: about-neuron/quick-start/github-samples.rst
================================================
.. _neuron-github-samples:

Neuron GitHub Samples
=====================

.. grid:: 2

        .. dropdown::  Training Samples for ``Trn1``
                :class-title: sphinx-design-class-title-small
                :class-body: sphinx-design-class-body-small
                :animate: fade-in
                :open:

                * `PyTorch Neuron (torch-neuronx) samples for Trn1 <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx#training>`_
                * `Nemo Megatron for Neuron for Trn1 <https://github.com/aws-neuron/neuronx-nemo-megatron>`_
                * `AWS Neuron samples for ParallelCluster <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples>`_
                * `AWS Neuron samples for EKS <https://github.com/aws-neuron/aws-neuron-eks-samples>`_
                * `AWS Neuron samples for SageMaker <https://github.com/aws-neuron/aws-neuron-sagemaker-samples>`_
                * `AWS Neuron samples for Batch <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/aws-batch/>`_


        .. dropdown::  Inference Samples for ``Inf2 & Trn1``
                :class-title: sphinx-design-class-title-small
                :class-body: sphinx-design-class-body-small
                :animate: fade-in
                :open:

                * `PyTorch Neuron (torch-neuronx) samples for Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx#inference>`_
                * `Transformers Neuron (transformers-neuronx)  samples <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx>`_
                * `AWS Neuron samples for SageMaker <https://github.com/aws-neuron/aws-neuron-sagemaker-samples>`_

        .. dropdown::   Inference Samples for ``Inf1``
                :class-title: sphinx-design-class-title-small
                :class-body: sphinx-design-class-body-small
                :animate: fade-in
                :open:

                * `PyTorch Neuron (torch-neuron) samples for Inf1 <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuron>`_
                * `TensorFlow Neuron (tensorflow-neuron) samples for Inf1 <https://github.com/aws-neuron/aws-neuron-samples/tree/master/tensorflow-neuron>`_

        
================================================
FILE: about-neuron/quick-start/index.rst
================================================
.. meta::
   :description: Get started quickly with AWS Neuron SDK for PyTorch, JAX, and TensorFlow on Inferentia and Trainium
   :keywords: neuron, quickstart, getting started, pytorch, jax, tensorflow, inferentia, trainium, training, inference
   :instance-types: inf2, trn1, trn2, trn3
   :content-type: navigation-hub
   :date-modified: 2026-03-03

.. _neuron-quickstart:

Get Started with AWS Neuron
============================

Get up and running with AWS Neuron SDK in minutes. These quickstarts guide you through your first training or inference workload on Inferentia and Trainium instances.

.. note::
   
   **First time using AWS Neuron?** These quickstarts assume you have:
   
   - An active AWS account with EC2 access
   - Basic familiarity with your chosen ML framework (PyTorch, JAX, or TensorFlow)
   - SSH access to launch and connect to EC2 instances
   
   For detailed installation instructions, see the :doc:`Setup Guide </setup/index>`.

Choose Your Path
----------------

Select the quickstart that matches your use case:

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: 🚀 Training Quickstart
      :link: training-quickstart
      :link-type: ref
      :class-card: sd-border-2
      
      Train your first model on Trainium
      
      - Launch a Trn1 instance
      - Run a PyTorch training script
      - Monitor training progress
      
      **Time**: ~15 minutes
      
      :bdg-primary:`Trn1` :bdg-primary:`Trn2` :bdg-primary:`Trn3`

   .. grid-item-card:: 🎯 Inference Quickstart
      :link: inference-quickstart
      :link-type: ref
      :class-card: sd-border-2
      
      Run your first inference on Inferentia
      
      - Launch an Inf2 instance
      - Load a pre-compiled model
      - Run predictions
      
      **Time**: ~10 minutes
      
      :bdg-success:`Inf2` :bdg-success:`Trn1`

Specialized Quickstarts
-----------------------

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: 💬 LLM Serving with vLLM
      :class-card: sd-border-1
      
      Deploy large language models for production inference
      
      - :doc:`Online serving </libraries/nxd-inference/vllm/quickstart-vllm-online-serving>` (OpenAI-compatible API)
      - :doc:`Offline batch inference </libraries/nxd-inference/vllm/quickstart-vllm-offline-serving>`
      
      **Time**: ~20 minutes
      
      :bdg-info:`Inf2` :bdg-info:`Trn1`

   .. grid-item-card:: 🤖 Amazon AI helper tools
      :link: amazon-q-dev
      :link-type: ref
      :class-card: sd-border-1
      
      Use AI-powered code assistance for Neuron development
      
      - Get code suggestions
      - Debug Neuron applications
      - Optimize performance
      
      **Time**: ~5 minutes

Framework-Specific Guides
-------------------------

Need framework-specific setup instructions?

.. grid:: 1 1 3 3
   :gutter: 2

   .. grid-item-card:: PyTorch
      :link: /setup/pytorch/index
      :link-type: doc
      :class-card: sd-border-1
      :class-body: sphinx-design-class-title-small
      
      PyTorch 2.9+ setup

   .. grid-item-card:: JAX
      :link: /setup/jax/index
      :link-type: doc
      :class-card: sd-border-1
      :class-body: sphinx-design-class-title-small
      
      JAX 0.7+ setup

   .. grid-item-card:: TensorFlow
      :link: /archive/tensorflow/index
      :link-type: doc
      :class-card: sd-border-1
      :class-body: sphinx-design-class-title-small
      
      TensorFlow 2.x setup

Additional Resources
--------------------

- :doc:`/about-neuron/models/index` - Pre-tested model samples and tutorials
- :doc:`/devflows/ec2-flows` - Detailed EC2 deployment workflows
- :doc:`/containers/index` - Use Deep Learning Containers
- :doc:`docs-quicklinks` - Quick links to all Neuron documentation
- :doc:`github-samples` - GitHub sample repositories

Legacy Quick-Start Pages (Inf1)
--------------------------------

.. warning::
   
   The following pages are for legacy Inf1 instances only. For new projects, use the quickstarts above for Inf2, Trn1, Trn2, or Trn3.

- :doc:`torch-neuron` - PyTorch on Inf1
- :doc:`tensorflow-neuron` - TensorFlow on Inf1
- :doc:`mxnet-neuron` - MXNet on Inf1

.. toctree::
   :hidden:
   :maxdepth: 1
   
   training-quickstart
   inference-quickstart
   /libraries/nxd-inference/vllm/quickstart-vllm-online-serving
   /libraries/nxd-inference/vllm/quickstart-vllm-offline-serving
   /about-neuron/amazonq-getstarted
   docs-quicklinks
   github-samples
   torch-neuron
   tensorflow-neuron
   mxnet-neuron


================================================
FILE: about-neuron/quick-start/inference-quickstart.rst
================================================
.. meta::
   :description: Run your first inference workload on AWS Inferentia with PyTorch and Neuron SDK
   :keywords: neuron, inference, quickstart, pytorch, inferentia, inf2, getting started
   :instance-types: inf2, trn1
   :content-type: quickstart
   :date-modified: 2026-03-03

.. _inference-quickstart:

Quickstart: Run Inference on Inferentia
========================================

This quickstart guides you through running your first PyTorch inference workload on AWS Inferentia. You'll launch an Inf2 instance, compile a model for Neuron, and run predictions. When you complete this quickstart, you'll understand the basic workflow for deploying models on Inferentia.

**This quickstart is for**: ML engineers and developers deploying inference workloads

**Time to complete**: ~10 minutes

Prerequisites
-------------

Before you begin, ensure you have:

- An AWS account with EC2 launch permissions
- AWS CLI configured with your credentials
- SSH key pair for EC2 access
- Basic familiarity with PyTorch
- Terminal access (Linux, macOS, or WSL on Windows)

Step 1: Launch an Inferentia instance
--------------------------------------

In this step, you will launch an Inf2 instance using the AWS Deep Learning AMI.

Launch an Inf2.xlarge instance with the latest Deep Learning AMI:

.. code-block:: bash

   aws ec2 run-instances \
       --image-id resolve:ssm:/aws/service/deep-learning-base-neuron/ubuntu-22-04/latest \
       --instance-type inf2.xlarge \
       --key-name YOUR_KEY_NAME \
       --security-group-ids YOUR_SECURITY_GROUP \
       --subnet-id YOUR_SUBNET_ID

.. note::
   
   Replace ``YOUR_KEY_NAME``, ``YOUR_SECURITY_GROUP``, and ``YOUR_SUBNET_ID`` with your values.
   
   Alternatively, launch the instance through the `EC2 Console <https://console.aws.amazon.com/ec2/>`_.

Connect to your instance via SSH:

.. code-block:: bash

   ssh -i YOUR_KEY.pem ubuntu@YOUR_INSTANCE_IP

Verify Neuron devices are available:

.. code-block:: bash

   neuron-ls

You should see output showing available NeuronCores:

.. code-block:: text

   +--------+--------+--------+---------+
   | NEURON | NEURON | NEURON |   PCI   |
   | DEVICE | CORES  | MEMORY |   BDF   |
   +--------+--------+--------+---------+
   | 0      | 2      | 32 GB  | 00:1e.0 |
   +--------+--------+--------+---------+

Step 2: Set up your environment
--------------------------------

In this step, you will create a Python virtual environment and install PyTorch with Neuron support.

Create and activate a virtual environment:

.. code-block:: bash

   python3 -m venv neuron_env
   source neuron_env/bin/activate

Install PyTorch Neuron and dependencies:

.. code-block:: bash

   pip install torch-neuronx neuronx-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com

Verify the installation:

.. code-block:: bash

   python -c "import torch; import torch_neuronx; print(f'PyTorch: {torch.__version__}')"

You should see output confirming PyTorch is installed:

.. code-block:: text

   PyTorch: 2.9.0+cpu

Step 3: Compile a model for Neuron
-----------------------------------

In this step, you will create a simple model and compile it for Neuron inference.

Create a file named ``compile_model.py``:

.. code-block:: python

   import torch
   import torch.nn as nn
   import torch_neuronx
   
   # Simple neural network
   class SimpleNet(nn.Module):
       def __init__(self):
           super().__init__()
           self.fc1 = nn.Linear(784, 128)
           self.fc2 = nn.Linear(128, 10)
           self.relu = nn.ReLU()
       
       def forward(self, x):
           x = self.relu(self.fc1(x))
           return self.fc2(x)
   
   # Create model and set to eval mode
   model = SimpleNet()
   model.eval()
   
   # Create example input
   example_input = torch.randn(1, 784)
   
   # Trace and compile for Neuron
   print("Compiling model for Neuron...")
   neuron_model = torch_neuronx.trace(model, example_input)
   
   # Save compiled model
   neuron_model.save('simple_net_neuron.pt')
   print("Model compiled and saved to simple_net_neuron.pt")

Run the compilation script:

.. code-block:: bash

   python compile_model.py

You should see compilation progress and success message:

.. code-block:: text

   Compiling model for Neuron...
   INFO:Neuron:Compiling function _NeuronGraph$1 with neuronx-cc
   INFO:Neuron:Compilation successful
   Model compiled and saved to simple_net_neuron.pt

.. note::
   
   Model compilation happens once. The compiled model (``simple_net_neuron.pt``) can be reused for inference without recompiling.

Step 4: Run inference
----------------------

In the final step, you will load the compiled model and run predictions.

Create a file named ``run_inference.py``:

.. code-block:: python

   import torch
   import torch_neuronx
   
   # Load compiled model
   print("Loading compiled model...")
   neuron_model = torch.jit.load('simple_net_neuron.pt')
   
   # Create sample input
   sample_input = torch.randn(1, 784)
   
   # Run inference
   print("Running inference...")
   with torch.no_grad():
       output = neuron_model(sample_input)
   
   # Get prediction
   predicted_class = output.argmax(dim=1).item()
   print(f"Predicted class: {predicted_class}")
   print(f"Output logits: {output[0][:5].tolist()}")  # Show first 5 logits
   
   # Run multiple inferences to measure throughput
   print("\nRunning 100 inferences...")
   import time
   start = time.time()
   
   with torch.no_grad():
       for _ in range(100):
           output = neuron_model(sample_input)
   
   elapsed = time.time() - start
   throughput = 100 / elapsed
   print(f"Throughput: {throughput:.2f} inferences/second")
   print(f"Latency: {elapsed/100*1000:.2f} ms per inference")

Run the inference script:

.. code-block:: bash

   python run_inference.py

You should see inference results:

.. code-block:: text

   Loading compiled model...
   Running inference...
   Predicted class: 7
   Output logits: [0.123, -0.456, 0.789, -0.234, 0.567]
   
   Running 100 inferences...
   Throughput: 245.67 inferences/second
   Latency: 4.07 ms per inference

Monitor Neuron device utilization in another terminal:

.. code-block:: bash

   neuron-top

This shows real-time NeuronCore utilization and inference metrics.

Confirmation
------------

Congratulations! You've successfully run inference on AWS Inferentia. You should have:

- ✅ Launched an Inf2 instance with Neuron SDK
- ✅ Installed PyTorch with Neuron support
- ✅ Compiled a model for Neuron inference
- ✅ Ran predictions and measured throughput
- ✅ Monitored inference with Neuron tools

If you encountered any issues, see the **Common issues** section below.

Common issues
-------------

**Issue**: ``ModuleNotFoundError: No module named 'torch_neuronx'``

**Solution**: Ensure you activated the virtual environment and installed packages:

.. code-block:: bash

   source neuron_env/bin/activate
   pip install torch-neuronx neuronx-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com

**Issue**: ``RuntimeError: No Neuron devices found``

**Solution**: Verify you're on an Inferentia instance and devices are visible:

.. code-block:: bash

   neuron-ls

If no devices appear, check instance type and driver installation.

**Issue**: Compilation takes a long time

**Solution**: Model compilation is a one-time cost. For this simple model, compilation should take 1-2 minutes. Larger models take longer but only need to be compiled once. The compiled model can be saved and reused.

**Issue**: Lower throughput than expected

**Solution**: This quickstart uses a small model and batch size for demonstration. For production workloads:

- Use larger batch sizes (e.g., 4, 8, 16)
- Enable dynamic batching
- Use multiple NeuronCores in parallel
- See :doc:`/frameworks/torch/torch-neuronx/programming-guide/inference/index` for optimization techniques

Clean up
--------

To avoid ongoing charges, terminate your instance when finished:

.. code-block:: bash

   # From your local machine
   aws ec2 terminate-instances --instance-ids YOUR_INSTANCE_ID

Or use the EC2 Console to terminate the instance.

Next steps
----------

Now that you've completed this quickstart, explore more advanced inference topics:

- :doc:`/frameworks/torch/torch-neuronx/programming-guide/inference/index` - Comprehensive inference guide
- :doc:`/libraries/nxd-inference/index` - Production inference with NeuronX Distributed
- :doc:`/libraries/nxd-inference/vllm/quickstart-vllm-online-serving` - Deploy LLMs with vLLM
- :doc:`/about-neuron/models/index` - Pre-tested model samples
- :doc:`/tools/neuron-explorer/index` - Profile and optimize inference performance

Further reading
---------------

- :doc:`/setup/pytorch/index` - Detailed PyTorch installation options
- :doc:`/devflows/ec2-flows` - EC2 deployment workflows
- :doc:`/frameworks/torch/index` - Complete PyTorch Neuron documentation
- :doc:`/compiler/index` - Understanding Neuron compilation


================================================
FILE: about-neuron/quick-start/mxnet-neuron.rst
================================================
.. _mxnet_quick_start:


Get Started with Apache MXNet Neuron
=====================================

This page provide links that will assist you to quickly start with :doc:`MXNet Neuron </archive/mxnet-neuron/index>` (supporting inference only).

.. note::
  Below instructions are for Ubuntu20, if you looking for complete setup instructions for different platforms, please :ref:`Check Here. <setup-guide-index>`

.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /setup/install-templates/launch-instance.txt

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 5
        :end-line: 6
		
.. include:: /includes/setup/tab-inference-mxnet-neuron.txt

================================================
FILE: about-neuron/quick-start/tab-inference-tensorflow-neuron.rst
================================================
.. dropdown::  Install TensorFlow Neuron (``tensorflow-neuron``)
        :class-title: drop-down-class-title-small
        :class-body: drop-down-class-body-small
        :animate: fade-in

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami --category=compiler_framework

.. dropdown::  Get Started with Inference (``Inf1``)
       :class-title: sphinx-design-class-title-small
       :class-body: sphinx-design-class-body-small
       :animate: fade-in

        :ref:`ResNet-50 </src/examples/tensorflow/tensorflow_resnet50/resnet50.ipynb>`

.. card:: Visit TensorFlow Neuron section for more
        :class-body: sphinx-design-class-body-small
        :link: tensorflow-neuron-main
        :link-type: ref

================================================
FILE: about-neuron/quick-start/tensorflow-neuron.rst
================================================
.. _tensorflow_quick_start:

Get Started with TensorFlow Neuron
==================================

This page provide links that will assist you to quickly start with :ref:`tensorflow-neuron-main`.


.. note::
  Below instructions are for Ubuntu20, if you looking for complete setup instructions for different platforms, please :ref:`Check Here. <setup-guide-index>`

.. _tensorflow_quick_start_inference:


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /setup/install-templates/launch-instance.txt

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 5
        :end-line: 6


.. tab-set::

   .. tab-item:: tensorflow-neuronx (``Trn1, Inf2``)

        .. include:: /includes/setup/tab-inference-tensorflow-neuronx.txt

   .. tab-item:: tensorflow-neuron (``Inf1``)

        .. include:: /includes/setup/tab-inference-tensorflow-neuron.rst

================================================
FILE: about-neuron/quick-start/torch-neuron-tab-training.rst
================================================

.. dropdown::  Launch Trn1 Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /setup/install-templates/launch-instance.txt


.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. code:: bash

        # Configure Linux for Neuron repository updates

        sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
        [neuron]
        name=Neuron YUM Repository
        baseurl=https://yum.repos.neuron.amazonaws.com
        enabled=1
        metadata_expire=0
        EOF
        sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

        # Update OS packages
        sudo dnf update -y

        # Install git
        sudo dnf install git -y


        # Install OS headers
        sudo dnf install -y "kernel-devel-uname-r = $(uname -r)"

        # Remove preinstalled packages and Install Neuron Driver and Runtime
        sudo dnf remove aws-neuron-dkms -y
        sudo dnf remove aws-neuronx-dkms -y
        sudo dnf remove aws-neuronx-oci-hook -y
        sudo dnf remove aws-neuronx-runtime-lib -y
        sudo dnf remove aws-neuronx-collectives -y
        sudo dnf install aws-neuronx-dkms-2.*  -y
        sudo dnf install aws-neuronx-oci-hook-2.*  -y
        sudo dnf install aws-neuronx-runtime-lib-2.*  -y
        sudo dnf install aws-neuronx-collectives-2.*  -y

        # Install EFA Driver(only required for multi-instance training)
        curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
        wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.key
        cat aws-efa-installer.key | gpg --fingerprint
        wget https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz.sig && gpg --verify ./aws-efa-installer-latest.tar.gz.sig
        tar -xvf aws-efa-installer-latest.tar.gz
        cd aws-efa-installer && sudo bash efa_installer.sh --yes
        cd
        sudo rm -rf aws-efa-installer-latest.tar.gz aws-efa-installer

        # Remove pre-installed package and Install Neuron Tools
        sudo dnf remove aws-neuron-tools  -y
        sudo dnf remove aws-neuronx-tools  -y
        sudo dnf install aws-neuronx-tools-2.*  -y

        export PATH=/opt/aws/neuron/bin:$PATH

.. dropdown::  Install PyTorch Neuron (``torch-neuronx``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. code:: bash

        # Install Python venv and activate Python virtual environment to install
        # Neuron pip packages.
        python3.7 -m venv aws_neuron_venv_pytorch
        source aws_neuron_venv_pytorch/bin/activate
        pip install -U pip

        # Install wget, awscli
        pip install wget
        pip install awscli

        # Install Neuron packages
        pip install torch-neuronx==1.13.0.1.* --extra-index-url=https://pip.repos.neuron.amazonaws.com
        pip install neuronx-cc==2.* --extra-index-url=https://pip.repos.neuron.amazonaws.com


.. dropdown::  Run Tutorial
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    :ref:`neuronx-mlp-training-tutorial`


.. card:: Visit PyTorch Neuron section for more
    :class-body: sphinx-design-class-body-small
    :link: pytorch-neuronx-main
    :link-type: ref


================================================
FILE: about-neuron/quick-start/torch-neuron.rst
================================================
.. _torch_quick_start:

Get Started with PyTorch Neuron
===============================

This page provide links that will assist you to quickly start with :ref:`pytorch-neuronx-main` for both Inference and Training.

.. note::
  Below instructions are for Ubuntu20, if you looking for complete setup instructions for different platforms, please :ref:`Check Here. <setup-guide-index>`


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /setup/install-templates/launch-instance.txt

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 5
        :end-line: 6

.. tab-set::

   .. tab-item:: torch-neuronx (``Trn1, Inf2``)

        .. include:: /includes/setup/tab-inference-torch-neuronx.txt


   .. tab-item:: torch-neuron (``Inf1``)

        .. include:: /includes/setup/tab-inference-torch-neuron.txt

================================================
FILE: about-neuron/quick-start/training-quickstart.rst
================================================
.. meta::
   :description: Train your first model on AWS Trainium with PyTorch and Neuron SDK
   :keywords: neuron, training, quickstart, pytorch, trainium, trn1, getting started
   :instance-types: trn1, trn2, trn3
   :content-type: quickstart
   :date-modified: 2026-03-03

.. _training-quickstart:

Quickstart: Train a Model on Trainium
======================================

This quickstart guides you through training your first PyTorch model on AWS Trainium. You'll launch a Trn1 instance, install Neuron SDK, and run a simple training script. When you complete this quickstart, you'll understand the basic workflow for training models with Neuron.

**This quickstart is for**: ML engineers and data scientists new to AWS Trainium

**Time to complete**: ~15 minutes

Prerequisites
-------------

Before you begin, ensure you have:

- An AWS account with EC2 launch permissions
- AWS CLI configured with your credentials
- SSH key pair for EC2 access
- Basic familiarity with PyTorch
- Terminal access (Linux, macOS, or WSL on Windows)

Step 1: Launch a Trainium instance
-----------------------------------

In this step, you will launch a Trn1 instance using the AWS Deep Learning AMI.

First, launch a Trn1.2xlarge instance with the latest Deep Learning AMI:

.. code-block:: bash

   aws ec2 run-instances \
       --image-id resolve:ssm:/aws/service/deep-learning-base-neuron/ubuntu-22-04/latest \
       --instance-type trn1.2xlarge \
       --key-name YOUR_KEY_NAME \
       --security-group-ids YOUR_SECURITY_GROUP \
       --subnet-id YOUR_SUBNET_ID

.. note::
   
   Replace ``YOUR_KEY_NAME``, ``YOUR_SECURITY_GROUP``, and ``YOUR_SUBNET_ID`` with your values.
   
   Alternatively, launch the instance through the `EC2 Console <https://console.aws.amazon.com/ec2/>`_.

Once the instance is running, connect via SSH:

.. code-block:: bash

   ssh -i YOUR_KEY.pem ubuntu@YOUR_INSTANCE_IP

Verify Neuron devices are available:

.. code-block:: bash

   neuron-ls

You should see output showing available NeuronCores:

.. code-block:: text

   +--------+--------+--------+---------+
   | NEURON | NEURON | NEURON |   PCI   |
   | DEVICE | CORES  | MEMORY |   BDF   |
   +--------+--------+--------+---------+
   | 0      | 2      | 32 GB  | 00:1e.0 |
   | 1      | 2      | 32 GB  | 00:1f.0 |
   +--------+--------+--------+---------+

Step 2: Set up your environment
--------------------------------

In this step, you will create a Python virtual environment and install PyTorch with Neuron support.

Create and activate a virtual environment:

.. code-block:: bash

   python3 -m venv neuron_env
   source neuron_env/bin/activate

Install PyTorch Neuron and dependencies:

.. code-block:: bash

   pip install torch-neuronx neuronx-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com

Verify the installation:

.. code-block:: bash

   python -c "import torch; import torch_neuronx; print(f'PyTorch: {torch.__version__}')"

You should see output confirming PyTorch is installed:

.. code-block:: text

   PyTorch: 2.9.0+cpu

Step 3: Create a training script
---------------------------------

In this step, you will create a simple PyTorch training script that uses Neuron acceleration.

Create a file named ``train_simple.py``:

.. code-block:: python

   import torch
   import torch.nn as nn
   import torch.optim as optim
   import torch_neuronx
   
   # Simple neural network
   class SimpleNet(nn.Module):
       def __init__(self):
           super().__init__()
           self.fc1 = nn.Linear(784, 128)
           self.fc2 = nn.Linear(128, 10)
           self.relu = nn.ReLU()
       
       def forward(self, x):
           x = self.relu(self.fc1(x))
           return self.fc2(x)
   
   # Create model and move to Neuron device
   model = SimpleNet().to('neuron')
   criterion = nn.CrossEntropyLoss()
   optimizer = optim.SGD(model.parameters(), lr=0.01)
   
   # Generate dummy training data
   batch_size = 32
   num_batches = 100
   
   print("Starting training...")
   model.train()
   
   for batch_idx in range(num_batches):
       # Create dummy batch
       inputs = torch.randn(batch_size, 784).to('neuron')
       targets = torch.randint(0, 10, (batch_size,)).to('neuron')
       
       # Training step
       optimizer.zero_grad()
       outputs = model(inputs)
       loss = criterion(outputs, targets)
       loss.backward()
       optimizer.step()
       
       if batch_idx % 10 == 0:
           print(f"Batch {batch_idx}/{num_batches}, Loss: {loss.item():.4f}")
   
   print("Training complete!")

This script creates a simple neural network, moves it to the Neuron device, and trains it on synthetic data.

Step 4: Run training
---------------------

In the final step, you will run the training script and monitor its progress.

Execute the training script:

.. code-block:: bash

   python train_simple.py

You should see training progress output:

.. code-block:: text

   Starting training...
   Batch 0/100, Loss: 2.3156
   Batch 10/100, Loss: 2.2845
   Batch 20/100, Loss: 2.2534
   ...
   Training complete!

Monitor Neuron device utilization in another terminal:

.. code-block:: bash

   neuron-top

This shows real-time NeuronCore utilization, memory usage, and other metrics.

Confirmation
------------

Congratulations! You've successfully trained your first model on AWS Trainium. You should have:

- ✅ Launched a Trn1 instance with Neuron SDK
- ✅ Installed PyTorch with Neuron support
- ✅ Created and ran a training script on Neuron devices
- ✅ Monitored training with Neuron tools

If you encountered any issues, see the **Common issues** section below.

Common issues
-------------

**Issue**: ``ModuleNotFoundError: No module named 'torch_neuronx'``

**Solution**: Ensure you activated the virtual environment and installed packages:

.. code-block:: bash

   source neuron_env/bin/activate
   pip install torch-neuronx neuronx-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com

**Issue**: ``RuntimeError: No Neuron devices found``

**Solution**: Verify you're on a Trainium instance and devices are visible:

.. code-block:: bash

   neuron-ls

If no devices appear, check instance type and driver installation.

**Issue**: Training is slower than expected

**Solution**: This quickstart uses a small model for demonstration. For production workloads:

- Use larger batch sizes
- Enable XLA compilation with ``torch.compile()``
- See :doc:`/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide` for optimization techniques

Clean up
--------

To avoid ongoing charges, terminate your instance when finished:

.. code-block:: bash

   # From your local machine
   aws ec2 terminate-instances --instance-ids YOUR_INSTANCE_ID

Or use the EC2 Console to terminate the instance.

Next steps
----------

Now that you've completed this quickstart, explore more advanced training topics:

- :doc:`/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide` - Comprehensive training guide
- :doc:`/libraries/nxd-training/index` - Distributed training with NeuronX Distributed
- :doc:`/about-neuron/models/index` - Pre-tested model samples
- :doc:`/tools/neuron-explorer/index` - Profile and optimize training performance

Further reading
---------------

- :doc:`/setup/pytorch/index` - Detailed PyTorch installation options
- :doc:`/devflows/ec2-flows` - EC2 deployment workflows
- :doc:`/frameworks/torch/index` - Complete PyTorch Neuron documentation


================================================
FILE: about-neuron/quick-start/user-guide-quickstart.rst
================================================
.. _userguide-quickstart:

User Guide Quick Start
======================

* :ref:`setup-guide-index`
* :ref:`Neuron Containers <neuron-containers>`
* :ref:`neuron-devflows`


================================================
FILE: about-neuron/sdk-policy.rst
================================================
.. _sdk-maintenance-policy:
.. _neuron-maintenance-policy:

Neuron Software Maintenance policy
==================================

.. contents:: Table of Contents
   :local:
   :depth: 3

Overview
--------

This document outlines software maintenance policy for AWS Neuron
Software Development Kit (SDK), Neuron Components, both extension and
standalone components, supported model classes, features, APIs, DLAMIs
and DLCs, and dependency software. AWS Neuron is the SDK for Amazon EC2
`Inferentia <https://aws.amazon.com/machine-learning/inferentia/>`__ and
Amazon EC2
`Trainium <https://aws.amazon.com/machine-learning/trainium/>`__ based
instances purpose-built for deep learning. Neuron integrates with
popular Machine Learning (ML) frameworks like PyTorch, JAX, and
TensorFlow and includes a compiler, runtime, driver, profiling tools,
and libraries to support high performance training of generative AI
models on Trainium and Inferentia powered instances.

This document addresses Neuron Software life-cycle and the Neuron SDK
release versioning.

.. _neuron-software-definitions:

Neuron Software Definitions
---------------------------

Neuron Software refers to the complete set of software elements
provided by AWS Neuron, including:

Neuron SDK
~~~~~~~~~~

The core software development kit that enables users to build, train,
and deploy machine learning models on Inferentia and Trainium based
instances. The Neuron SDK encompasses the entire set of components,
features, APIs, and other elements that are bundled together and made
available in a particular version of the Neuron SDK release.

Neuron components
~~~~~~~~~~~~~~~~~

Neuron components refer to any packages or libraries within the Neuron
SDK that offer specific functionality. These components are typically
accessible through PIP, RPM, or Debian packages for easy installation
and usage. There are two main categories of Neuron components: Neuron
extension components and Neuron standalone components.

Neuron extension components
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Neuron extension components are components that integrate Neuron support
into open source machine learning frameworks, libraries or tools
enhancing their functionality and extending their capabilities as
necessary. When referring to Neuron extension components, we are also
referring to the parts of the open source machine learning framework or
library that are supported by Neuron. The software life-cycle of the
open source machine learning frameworks, libraries or tools that are
extended by Neuron is managed and maintained by their respective
communities or the vendors responsible for those specific components.
Examples for Neuron extension components are:

-  **Third party ML Library**: Examples include Neuron Nemo Megatron.
-  **Third party ML Framework**: Examples include PyTorch NeuronX and
   TensorFlow Neuron.

Neuron standalone components
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Neuron standalone components are self-contained components within the
Neuron SDK. Examples of such components are Neuron Compiler, Neuron
Tools and Neuron Runtime.

Neuron Model Classes
~~~~~~~~~~~~~~~~~~~~

A Neuron supported model class is tightly coupled with a specific Neuron
extension component (e.g. PyTorch NeuronX) or Neuron library (e.g.
NeuronX Distributed) and the workload type (e.g. Training or Inference).
For example a model can be supported at Beta level in PyTorch NeuronX
for training and Stable level in PyTorch NeuronX for inference.

Neuron features
~~~~~~~~~~~~~~~

A Neuron feature refers to any functionality or attribute that is part
of the Neuron SDK, whether it belongs to the entire Neuron SDK or to one
of its specific components.

Neuron APIs
~~~~~~~~~~~

A Neuron API refers to any API, CLI, environment variables, or flag that
belong to to the entire Neuron SDK or to one the Neuron components. A
Neuron API allows developers to interact with and leverage the
capabilities of the Neuron SDK and its components.

Examples include :ref:`Neuron Trace API <torch_neuron_trace_api>` and :ref:`Neuron Compiler flags <neuron-compiler-cli-reference-guide>`

Dependency software components
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

External software components or frameworks that the Neuron
SDK and its components rely on for proper functioning and compatibility,
such as language runtimes or operating systems.

The software life-cycle of the dependency software components, is
managed and maintained by their respective communities or the vendors
responsible for those specific dependency software components. The
following terms are examples of underlying dependency software
components:

-  **Operating System (OS)**: Examples include Ubuntu 22 and Amazon
   Linux 2023
-  **Language Runtime**: Examples include Python 3.10

Neuron Deep Learning AMIs and Deep Learning Containers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

:ref:`Neuron Deep Learning AMIs
(DLAMIs) <neuron-dlami-overview>`
and :ref:`Neuron Deep Learning Containers
(DLCs) <neuron_containers>` are pre-configured Amazon Machine Images and Docket container that
come with the Neuron SDK and necessary dependencies pre-installed,
providing a ready-to-use environment for machine learning development.

.. _neuron-software-lifecycle:

Neuron Software Life-cycle
--------------------------

The typical life-cycle for Neuron software consists of several phases, though not all phases are applicable to every type of Neuron software. The phases are as follows:

-  **Developer Preview or Beta** (these terms are used interchangeably in
   Neuron collaterals)
-  **Release Candidate (RC)**
-  **General Availability (GA) or Stable** (these terms are used
   interchangeably in Neuron collaterals)
-  **Maintenance**
-  **End-of-Support (EOS)**

The following table outlines the details for each phase for Neuron software:

+-------------------------------+----------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
|                               | Description                                                                                                          | Comments                                         |
+-------------------------------+----------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| Developer Preview (Beta)      | In this phase, Neuron Software is not supported, should not be used in production environments,                      |                                                  |
|                               | and is meant for early access and feedback purposes only. It is possible for future releases                         |                                                  |
|                               | to introduce breaking changes.                                                                                       |                                                  |
|                               | See :ref:`Neuron Software Classification <sdk-classification>` for more information                                  |                                                  |
+-------------------------------+----------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| Release Candidate (RC)        | Once AWS identifies a release to be a stable product, it may be marked as a Release Candidate (RC).                  | This phase applies only to Neuron SDK            |
|                               | This phase is usually short and during it AWS will provide for Neuron Software on an as-needed basis.                | and Neuron components                            |
+-------------------------------+----------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| General Availability (Stable) | During this phase, AWS releases :ref:`regular <neuron-regular-updates>` updates for the Neuron Software based         |                                                  |
|                               | on a predefined release cadence of the Neuron SDK or provides :ref:`maintenance updates <neuron-maintenance-updates>`|                                                  |
|                               | for Neuron Software on an as-needed basis.                                                                           |                                                  |
|                               | See :ref:`Neuron Software Classification <sdk-classification>` for more information                                  |                                                  |
+-------------------------------+----------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| Maintenance                   | During the maintenance phase, AWS will provide :ref:`maintenance updates <neuron-maintenance-updates>`               | This phase does not apply to Dependency Software |
|                               | for Neuron Software on an as-needed basis. Any new PIP, RPM, and Debian packages for the Neuron                      | Components, Neuron DLCs,                         |
|                               | Software, as well as updated versions of the Neuron DLAMIs and Neuron DLCs, will be released                         | Neuron DLAMIs, Neuron Features and APIs          |
|                               | only when deemed necessary by the AWS Neuron team.                                                                   |                                                  |
|                               | Users can expect updates to be less frequent compared to :ref:`regular <neuron-regular-updates>`                     |                                                  |
|                               | as the focus will be on addressing critical issues and ensuring the stability of the software.                       |                                                  |
|                               |                                                                                                                      |                                                  |
|                               | Maintenance Announcement: AWS will make a public :ref:`announcement <neuron-communication>` at least one month       |                                                  |
|                               | before the Neuron Software enters Maintenance phase.                                                                 |                                                  |
+-------------------------------+----------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| End of Support (EOS)          | When Neuron Software reaches the end of its support lifecycle, it will no longer receive                             |                                                  |
|                               | :ref:`regular <neuron-regular-updates>` updates and :ref:`maintenance updates <neuron-maintenance-updates>`          |                                                  |
|                               | (including security updates). While AWS will continue to provide access to all previously released                   |                                                  |
|                               | PIP, RPM, and Debian packages for the Neuron Software, as well as earlier versions of the Neuron DLAMIs              |                                                  |
|                               | and Neuron DLCs, it's important to note that these older versions will not receive any updates or support.           |                                                  |
|                               | Customers can still use these resources at their own discretion, but it is highly recommended to upgrade             |                                                  |
|                               | to the latest available versions                                                                                     |                                                  |
|                               |                                                                                                                      |                                                  |
|                               | End of Support Announcement: AWS will make a public :ref:`announcement <neuron-communication>` at least one month    |                                                  |
|                               | before a Neuron Software enters End of Support.                                                                      |                                                  |
+-------------------------------+----------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+

.. _neuron-regular-updates:

Neuron Software Regular Updates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Regular updates for Neuron Software address the following areas: new
features, feature improvements, performance enhancements, bug
resolution, security vulnerability fixes, upgrades to Neuron dependency
software components and upgrades to Neuron extension components. To
handle these regular updates, AWS will release a new version of the
Neuron SDK, incrementing the minor version (the second digit in the
version number) for a minor release or incrementing the major version
(the first digit in the version number) for a major release when
significant changes that break compatibility are introduced. It's
important to note that any bug-fixes or security issues in regular
updates are not applied retroactively to previous versions of the Neuron
SDK. To benefit from these updates, users must adopt the latest release.

For more information see:

-  :ref:`Neuron DLAMIs and DLCs Updates <neuron-dlami-dlc-updates>`
-  :ref:`Neuron Extension Components Updates <neuron-extension-components-updates>`
-  :ref:`Neuron Software Versioning <neuron-software-versioning>`

**Neuron SDK Installation and Update instructions**
To install and update to the latest Neuron packages, customers need to pin the major
version of the Neuron package. For example, to install latest Neuron
tools package, call ``sudo apt-get install aws-neuronx-tools=2.*`` and
to install latest PyTorch Neuron package for Trn1, call
``pip install torch-neuronx==2.1.0.1.*``. This is done to future-proof
instructions for new, backwards-incompatible major version releases.

.. _neuron-maintenance-updates:

Neuron Software Maintenance Updates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Maintenance updates for Neuron Software address three key areas:
resolving bugs, fixing security vulnerabilities, and upgrading
dependency software components. At AWS discretion, additional critical
features or performance enhancement may also be included. To handle
these maintenance updates, AWS will release a new version of the Neuron
SDK, incrementing the patch number (the last digit in the version
number) to indicate a patch release. Major or minor releases may also
contain maintenance updates. It's important to note that these
maintenance updates are not applied retroactively to previous versions
of the Neuron SDK. To take advantage of these updates, users must adopt
the latest patch release.

For more information see:

-  :ref:`Neuron DLAMIs and DLCs Updates <neuron-dlami-dlc-updates>`
-  :ref:`Neuron Extension Components Updates <neuron-extension-components-updates>`
-  :ref:`Neuron Software Versioning <neuron-software-versioning>`

.. _neuron-dlami-dlc-updates:

Neuron DLAMIs and DLCs Updates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

AWS will address :ref:`regular <neuron-regular-updates>` updates, life-cycle changes, maintenance
updates, and security issues related to any third-party software
included in the Neuron DLAMI or DLCs by releasing new versions of the
Neuron DLAMI or DLCs. However, updates won't be applied retroactively to
older versions of the Neuron DLAMI or DLCs. Instead, users will need to
use the new versions to get the latest updates. Generally, Neuron DLAMIs and Deep Learning Containers (DLCs) will support one latest LTS Linux Distribution version (Ubuntu, Amazon Linux, and Rocky9), with exceptions. Neuron Base DLAMIs (which come pre-installed with Neuron driver, EFA, and Neuron tools) will support the two latest versions of LTS Linux Distributions.


For more information see:

-  :ref:`Neuron Extension Components Updates <neuron-extension-components-updates>`
-  :ref:`Neuron Software Versioning <neuron-software-versioning>`

.. _neuron-extension-components-updates:

Neuron Extension Components Updates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When a new version of an open source ML framework (e.g. PyTorch) is
supported by a Neuron extension component (e.g., PyTorch NeuronX), the
Neuron extension component for the latest supported ML framework version
will become the default for installation. If users wish to use a Neuron
extension component for an earlier supported ML framework version, they
will need to explicitly specify the desired version during installation.
After upgrading a Neuron extension component to support a newer version
of an ML framework, AWS will continue to provide :ref:`regular updates <neuron-regular-updates>`
for the Neuron extension component that supports the earlier ML
framework version for a minimum of 6 months. After the 6 months period,
the Neuron extension component for the earlier supported ML framework
version may transition into a maintenance mode. In the maintenance mode,
updates for the older Neuron extension component versions will be
provided on an as-needed basis, focusing on critical bug fixes and
security patches. For more information see: :ref:`Neuron extension component versioning <neuron-extension-components-versioning>`

.. _neuron-communication:

Communication methods
~~~~~~~~~~~~~~~~~~~~~

Neuron software classification and lifecycle announcements are
communicated as follows:

-  Neuron SDK documentation under
   `Announcements <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/index.html>`__

To see the list of available Neuron SDK versions and supported
dependency software components versions:

-  Neuron SDK documentation under `Release
   Content <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/releasecontent.html#latest-neuron-release-artifacts>`__
-  Neuron SDK documentation under `What’s
   New <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#neuron-whatsnew>`__

.. _neuron-software-versioning:

Neuron Software Versioning
--------------------------

Neuron SDK Documentation Versioning
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Neuron SDK documentation is versioned and maps to the corresponding
Neuron SDK version. Users can switch to earlier versions of the Neuron
SDK documentation by selecting the version from the dropdown in bottom
left portion of the side bar.

Neuron SDK Versioning
~~~~~~~~~~~~~~~~~~~~~

The AWS SDK release versions are in the form of ``[A.B.C]`` where
``(A)`` represents the major version, ``(B)`` represents
the minor version, and ``(C)`` represents the patch version.

.. _neuron-extension-components-versioning:

Neuron extension components Versioning
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Neuron extension components versioning (like PyTorch NeuronX) is in the
form ``[X.Y.Z].[A.B.C]``, where ``[X.Y.Z]`` represents the
third party component’s major (``X``), minor (``Y``), and patch
(``Z``) versions and ``[A.B.C]`` represents the Neuron extension
components (``A``), minor (``B``), and patch (``C``)
versions.

Neuron Standalone Component Versioning
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Neuron Component versioning (except of Neuron extension components like
PyTorch NeuronX) is in the form ``[A.B.C.D]``, where ``A``
represents the major version, ``B`` represents the minor version,
and ``C.D`` represents the patch version.

.. _neuron-releases-types:

Neuron Software Release Types
-----------------------------

Major release
~~~~~~~~~~~~~~~~~

Increasing the major version indicates that the Neuron software
underwent significant and substantial changes in an incompatible manner.
Applications need to be updated in order for them to work with the
newest SDK version. It is important to update major versions carefully
and in accordance with the upgrade guidelines provided by AWS. After
increasing the major version, the Neuron software may not maintain
compatibility with previous supported versions of :ref:`Neuron
Runtime <nrt-api-guide>`, :ref:`Neuron Compiler <neuron_cc>`, and
:ref:`NEFF <neff-format>`.

Minor release
~~~~~~~~~~~~~~~~~

Increasing the minor version indicates that the Neuron software added
functionality in a backwards compatible manner.

Patch release
~~~~~~~~~~~~~~~~~

Increasing the patch version indicates that the Neuron software
added backward compatible bug or security fixes. A bug fix is defined as
an internal change that fixes incorrect behavior.

Pre-releases
~~~~~~~~~~~~~~~~

-  **Developer Preview (Beta)**: During this phase, the Neuron software
   is not supported, should not be used in production environments, and
   is meant for early access and feedback purposes only. It is possible
   for future releases to introduce breaking changes. In the case of a
   Developer Preview (Beta) release, the minor version will include a
   lower case ``b`` along with a (Beta) tag.
-  **Release Candidate (RC)**: Once Neuron identifies a release to be a
   stable product, it may mark it as a Release Candidate. Release
   Candidates are ready for GA release unless significant bugs emerge,
   and will receive full AWS Neuron support. In the case of a RC
   release, the minor version will include a lower case ``rc``
   along with a (RC) tag.

.. _sdk-classification:

Neuron Software Classification
------------------------------

This section explains the Neuron software classification for APIs,
libraries, packages, features, and Neuron supported model classes
mentioned in the Neuron documentation.

Neuron SDK and Neuron components
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+-----------------+-----------------+-----------------+-------------+
|                 | Testing         | Features        | Performance |
+=================+=================+=================+=============+
| Developer       | Basic           | Minimal Viable  |             |
| Preview (Beta)  |                 | Product (MVP) \*|             |
+-----------------+-----------------+-----------------+-------------+
| Release         | Basic           | Minimal Viable  | Tested      |
| Candidate (RC)  |                 | Product (MVP)\* |             |
+-----------------+-----------------+-----------------+-------------+
| GA (Stable)     | Standard        | Incremental     | Tested      |
|                 | Product Testing | additions or    |             |
|                 |                 | changes         |             |
|                 |                 | in new releases |             |
+-----------------+-----------------+-----------------+-------------+

\* A minimum viable product (MVP) for a Neuron Component contains just
enough features to be usable by early customers who can then provide
feedback for future development. MVP can be different per use case
and depends on the specific package/library of interest. Please note
that in many cases, an MVP can also represent an advanced level of
features.

.. _neuron-apis-classification:

Neuron APIs
~~~~~~~~~~~

+----------------------+----------------------+----------------------+
|                      | API Contract         | API Backward         |
|                      |                      | Compatibility        |
+======================+======================+======================+
|       Alpha          |   Unstable and       |    No                |
|                      |   undocumented       |                      |
+----------------------+----------------------+----------------------+
| Developer Preview    | Major changes may    |    No                |
| (Beta)               | happen               |                      |
+----------------------+----------------------+----------------------+
| GA (Stable)          | Incremental changes  | Yes \*               |
|                      | in new releases      |                      |
|                      | (without breaking    |                      |
|                      | the API contract)    |                      |
+----------------------+----------------------+----------------------+

\* In certain cases, when necessary, AWS may introduce API changes that may break compatibility, with notice provided ahead of time.

.. _neuron-features-classification:

Neuron Features
~~~~~~~~~~~~~~~

+-----------------+-----------------+------------------------+-------------+
|                 | Testing         | Functionality          | Performance |
+=================+=================+========================+=============+
|                 | No formal       | Partial funcitonality  | Not tested  |
|     Alpha       | testing done    | with limited set of    | or          |
|                 |                 | core capabilities,     | evaluated   |
|                 |                 | far from Minium Viable |             |
|                 |                 | Product (MVP) \*       |             |
+-----------------+-----------------+------------------------+-------------+
| Developer       | Basic           | Minimum Viable         |             |
| Preview (Beta)  |                 | Product (MVP) \*       |             |
+-----------------+-----------------+------------------------+-------------+
| GA (Stable)     | Standard        | Incremental            | Tested      |
|                 | Product Testing | additions or changes   |             |
|                 |                 | in new releases        |             |
+-----------------+-----------------+------------------------+-------------+

\* A minimum viable product (MVP) for a Neuron Feature contains just
enough functionality to be usable by early customers who can then
provide feedback for future development. MVP can be different per use
case and depends on the specific feature of interest. Please note
that in many cases, an MVP can also represent an advanced level of
functionality.

.. _neuron-models-classification:

Neuron Supported Model Classes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+----------------------+----------------------+----------------------+
|                      | Accuracy /           | Throughput / Latency |
|                      | Convergence          |                      |
+======================+======================+======================+
| Developer Preview    | Validated            | Tested               |
| (Beta)               |                      |                      |
+----------------------+----------------------+----------------------+
| GA (Stable)          | Validated            | Tested               |
+----------------------+----------------------+----------------------+


================================================
FILE: about-neuron/security.rst
================================================
.. meta::
    :description: Security disclosures and notification for the AWS Neuron SDK.
    :date-modified: 01/27/2026

.. _security:

Neuron Security Disclosures
===========================

If you think you've found a potential security issue, please do not post it in the Issues. Instead, please follow the instructions here
(https://aws.amazon.com/security/vulnerability-reporting/) or email AWS
security directly (`mailto:aws-security@amazon.com <mailto:aws-security@amazon.com>`__).

Important Security Information for Trainium Hardware
-----------------------------------------------------

Trainium hardware is designed to optimize performance for machine learning workloads. To deliver high performance, applications with access to Trainium devices have unrestricted access to instance physical memory.

What this means for your deployment:

* Instance-level isolation is maintained: AWS EC2 ensures Trainium devices cannot access physical memory of other EC2 instances.
* As a best practice to prevent unrestricted access to host physical memory by any user/application, we recommend implementing a permission model where:

   * A dedicated system group owns the device nodes
   * Only explicitly authorized users are added to this group
   * Device permissions prevent access by users outside the group
  
Customer responsibility: Ensure that only trusted applications have access to Tranium devices on Trainium instances. For more information, see `the AWS Shared Responsibility Model <https://aws.amazon.com/compliance/shared-responsibility-model/>`__.

Example Implementation Steps
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The steps below are an example you can follow to implement a security group using udev rules:

1. Create a dedicated security group (in this example, ``neuron``): ``sudo groupadd -r neuron``

2. Add authorized users to that security group: ``sudo usermod -aG neuron {username-to-add-here}``, repeat for each user

3. Configure udev rules. Create a udev rule to automatically set correct ownership and permissions when Trainium (neuron) devices are detected.

   Create the file ``/etc/udev/rules.d/neuron-udev.rules`` with the following content:
    
   .. code-block:: shell

      # Neuron device access control
      # Only members of the 'neuron' group can access 'neuron' devices.

      SUBSYSTEM=="neuron*", KERNEL=="neuron*", GROUP="neuron", MODE="0660"

4. Apply the configuration:

   ``sudo udevadm control —-reload``
   ``sudo udevadm trigger —-subsystem-match=neuron``

5. Verify the configuration:

    ``ls -l /dev/neuron*``

    Expected output:

    ``crw-rw---- 1 root neuron 239, 0 Jan 9 15:58 /dev/neuron0``


================================================
FILE: about-neuron/troubleshooting.rst
================================================
.. _general-troubleshooting:

Troubleshooting Guide
=====================

.. contents:: Table of contents
   :local:
   :depth: 1


Training Only Troubleshooting
-----------------------------

* :ref:`PyTorch Neuron for Training <pytorch-neuron-traning-troubleshooting>`


Inference Only Troubleshooting
------------------------------

* :ref:`PyTorch Neuron for Inference <pytorch-neuron-inference-troubleshooting>`
* :ref:`NeuronPerf <neuronperf_troubleshooting>`
* :ref:`MXNet Neuron <mxnet_troubleshooting_guide>`


Runtime Troubleshooting
------------------------------

* :ref:`Neuron Runtime Troubleshooting on Inf1 and Trn1 <nrt-troubleshooting>`


Containers Troubleshooting
--------------------------

* :ref:`Containers <container-troubleshooting>`


Setup Troubleshooting
---------------------

* :ref:`neuron-setup-troubleshooting`


================================================
FILE: about-neuron/what-is-neuron.rst
================================================
.. _what-is-neuron:

.. meta::
   :description: AWS Neuron is a software development kit for high-performance machine learning on AWS Inferentia and Trainium, enabling developers to compile, optimize, and deploy deep learning models at scale.

What is AWS Neuron?
===================

AWS Neuron is the software stack for running deep learning and generative AI workloads on AWS Trainium and AWS Inferentia. Built on an open source foundation, Neuron enables developers to build, deploy and explore natively with PyTorch and JAX frameworks and with ML libraries such as Hugging Face, vLLM, PyTorch Lightning, and others without modifying your code.  It includes a compiler, runtime, training and inference libraries, and developer tools for monitoring, profiling, and debugging. Neuron supports your end-to-end machine learning (ML) development lifecycle from building and deploying deep learning and AI models, optimizing to achieve highest performance and lowest cost, and getting deeper insights into model behavior.

Neuron enables rapid experimentation, production scale training of frontier models, low level performance optimization through the Neuron Kernel Interface (NKI) for custom kernels, cost optimized inference deployment for agentic AI and reinforcement learning workloads, and comprehensive profiling and debugging with Neuron Explorer.

For more details, see the detailed documentation under :ref:`About the AWS Neuron SDK <about-neuron>`.

Who is AWS Neuron for?
-----------------------

* **ML engineers** can use Neuron's vLLM integration to migrate their models to Trainium for improved performance and without code modifications. They can
* **Performance engineers** can use NKI and our Developer Tools to create new ML kernels and optimize existing ones.
* **ML researchers** can use their existing PyTorch experience and ecosystem tools to experiment freely on Trainium using our native PyTorch implementatio, without having to learn new frameworks or APIs

What is AWS Neuron used for?
-----------------------------

**Research and Development**: Neuron provides native PyTorch execution on Trainium with full Eager mode compatibility. The stack supports standard distributed training patterns including FSDP, DDP, and DTensor for model sharding across devices and nodes. torch.compile integration enables graph optimization, while existing frameworks like TorchTitan and HuggingFace Transformers run without code modifications. JAX support includes XLA compilation targeting Inferentia and Trainium hardware. 

**Production Inference**: Neuron implements vLLM V1 API compatibility on Trainium and Inferentia with optimizations for large-scale inference workloads. The runtime supports Expert Parallelism for MoE models, disaggregated inference architectures, and speculative decoding. Optimized kernels from the NKI Library provide hardware-specific implementations. Training workflows integrate with HuggingFace Optimum Neuron, PyTorch Lightning, and TorchTitan, with seamless deployment through standard vLLM interfaces. 

**Performance Engineering**: Neuron Kernel Interface (NKI) provides direct access to Trainium instruction set architecture with APIs for memory management, execution scheduling, and low-level kernel development. The NKI Compiler, built on MLIR, offers full visibility into the compilation pipeline from high-level operations to hardware instructions. The NKI Library contains optimized kernel implementations with source code and performance benchmarks. Neuron Explorer enables comprehensive profiling from application code to hardware execution, supporting both single-node and distributed workload analysis with detailed performance metrics and optimization recommendations.

AWS Neuron Core Components
----------------------------

**vLLM**
    Neuron enables production inference deployment with standard frameworks and APIs on Trainium and Inferentia. Use Neuron's vLLM integration with standard APIs to deliver high-performance model serving with optimized kernels from the NKI Library. 

    It provides:

    * **Standard vLLM APIs**: Full compatibility with vLLM V1 APIs, enabling customers to use familiar vLLM interfaces on Neuron hardware without code changes
    * **Advanced Inference Features**: Support for Expert Parallelism for MoE models, disaggregated inference for flexible deployment architectures, and speculative decoding for improved latency
    * **Optimized Performance**: Pre-optimized kernels from the NKI Library for peak performance across dense, MoE, and multimodal models
    * **Open Source**: Source code released under the vLLM project organization with source code on GitHub, enabling community contributions

**Native PyTorch**
    Neuron provides native integration with PyTorch, enabling researchers and ML developers to run existing code unchanged on Trainium. Train models with familiar workflows and tools, from pre-training to post-training with reinforcement learning, while leveraging Trainium's performance and cost advantages for both experimentation and production scale training.

    It provides:

    * **Native Device Support**: Neuron registers as a native device type in PyTorch with standard device APIs like ``torch.tensor([1,2,3], device='neuron')`` and ``.to('neuron')``
    * **Standard Distributed Training APIs**: Support for FSDP, DTensor, DDP, tensor parallelism, context parallelism, and distributed checkpointing
    * **Eager Mode Execution**: Immediate operation execution for interactive development and debugging in notebook environments
    * **torch.compile Integration**: Support for ``torch.compile`` for optimized performance
    * **Open Source**: Released as an open source package on GitHub under Apache 2.0, enabling community contributions.  

**Neuron Kernel Interface (NKI)**
    For performance engineers seeking maximum hardware efficiency, Neuron provides complete control through the Neuron Kernel Interface (NKI), with direct access to the NeuronISA (NISA) instruction set, memory allocation, and execution scheduling. Developers can create new operations not available in standard frameworks and optimize performance critical code with custom kernels. 

    It includes:

    * The NKI Compiler, built on MLIR, which provides greater transparency into the kernel compilation process
    * The NKI Library , which provides pre-built kernels you can use to optimize the performance of your models

**Neuron Tools**
    Debug and profiling utilities including:
    
    * Neuron Monitor for real-time performance monitoring
    * Neuron Explorer, built on the Neuron Profiler (``neuron-profile``), for detailed performance analysis

    Neuron Explorer provides:

    * **Hierarchical Profiling**: Top-down visualization from framework layers through HLO operators to hardware instructions, enabling developers to understand execution at any level of the stack
    * **Code Linking**: Direct navigation between PyTorch, JAX, and NKI source code and performance timeline with automatic annotations showing metrics for specific code lines
    * **IDE Integration**: VSCode extension for profile visualization and analysis directly within the development environment
    * **Device Profiling**: Unified interface for comprehensive view of system-wide metrics and device-specific execution details

**Neuron Compiler**
    Optimizes machine learning models for AWS Inferentia and Trainium chips, converting models from popular frameworks into efficient executable formats.

**Neuron Runtime**
    Manages model execution on Neuron devices, handling memory allocation, scheduling, and inter-chip communication for maximum throughput.

**AWS DLAMIs and DLCs**
    Orchestrate and deploy your models using Deep Learning AWS Machine Images (DLAMIs) and Deep Learning Containers (DLCs).

    Neuron DLAMIs come pre-configured with the Neuron SDK, popular frameworks, and helpful libraries, allowing you to quickly begin training and running inference on AWS Inferentia. Or, quickly deploy models using pre-configured AWS Neuron Deep Learning Containers (Neuron DLCs) with optimized frameworks for AWS Trainium and Inferentia.  

Supported Hardware
------------------

**AWS Inferentia**
    Purpose-built for high-performance inference workloads:
    
    * ``Inf1`` instances - First-generation Inferentia chips
    * ``Inf2`` instances - Second-generation with improved performance and efficiency

**AWS Trainium**
    Designed for distributed training of large models:
    
    * ``Trn1`` instances - High-performance training acceleration
    * ``Trn1n`` instances - Enhanced networking for large-scale distributed training
    * ``Trn2`` instances - Next-generation Trainium with superior performance
    * ``Trn2`` UltraServer - High-density Trainium servers for massive training workloads
    * ``Trn3`` UltraServer -- The next generation of Trainium servers for massive training workloads


How do I get more information?
------------------------------

* Review the comprehensive documentation and follow the tutorials on this site
* Check the Neuron GitHub repositories for code examples. GitHub repos include:

  * `Neuron SDK code samples <https://github.com/aws-neuron/aws-neuron-samples>`_
  * `Neuron NKI ML kernel samples <https://github.com/aws-neuron/nki-samples>`_
  * `Neuron container confirguations <https://github.com/aws-neuron/deep-learning-containers>`_
  * `Helm charts for Kubernetes deployment <https://github.com/aws-neuron/neuron-helm-charts>`_
  * `NeuronX Distributed Core library sources <https://github.com/aws-neuron/neuronx-distributed>`_
  * `NeuronX Distributed Training library sources <https://github.com/aws-neuron/neuronx-distributed-training>`_
  * `NeuronX Distributed Inference library sources <https://github.com/aws-neuron/neuronx-distributed-inference>`_
  * `Linux kernel driver sources <https://github.com/aws-neuron/aws-neuron-driver>`_
  * `Neuron workshop model samples <https://github.com/aws-neuron/neuron-workshops>`_

* Visit the `AWS Neuron support forum <https://forums.aws.amazon.com/forum.jspa?forumID=355>`_ for community assistance


================================================
FILE: about-neuron/whats-new.rst
================================================
.. _main_whats-new:

.. meta::
    :description: Blog posts for the latest features and updates for the AWS Neuron SDK
    :date-modified: 03/13/2026

What's New in the AWS Neuron SDK
================================

.. toctree::
   :hidden:
   :maxdepth: 1

   Release Notes </release-notes/index>

*Explore detailed posts about the latest releases, updates, and upcoming changes to the AWS Neuron SDK.*

.. grid:: 1
    :gutter: 2

    .. grid-item-card:: Neuron Release Notes
        :link: /release-notes/index
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        **Latest release**: 2.29.0 (04/09/2026)

----

.. _whats-new-2026-04-02-v2_29:

AWS Neuron SDK 2.29.0: NKI Exits Beta, CPU Simulator, and Expanded NKI Library
-------------------------------------------------------------------------------

**Posted on**: April 09, 2026

Today we are releasing AWS Neuron SDK 2.29.0. This release brings NKI 0.3.0 out of Beta into Stable, featuring the new NKI Standard Library and an experimental CPU Simulator for local kernel development without Trainium hardware. The NKI Library adds 7 new experimental kernels including Conv1D, a Transformer TKG megakernel, and fused communication-compute primitives, along with improvements to existing attention, MLP, and MoE kernels. NxD Inference delivers performance gains for Qwen2 VL, Qwen3 VL, and Flux.1 models. Neuron Runtime introduces new APIs for collective stream management and network proxy tuning. Neuron Explorer is now out of Beta and Stable, with full Device widget support in the System Trace Viewer and availability on the VS Code Extension Marketplace. The Neuron Driver adds support for new Trn3 Gen2 Ultraserver configurations.


Neuron Kernel Interface (NKI)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AWS Neuron SDK 2.29.0 introduces NKI 0.3.0, the latest update to the Neuron Kernel Interface. NKI 0.3.0 is now out of Beta and Stable. It features the NKI Standard Library (``nki-stdlib``), which provides developer-visible code for all NKI APIs and native language objects (such as ``NkiTensor``). This release provides new exposed Trainium capabilities and features in the NKI API and introduces ``nki.language`` APIs.

**NKI CPU Simulator (Experimental)**: NKI 0.3.0 includes a CPU Simulator, which executes NKI kernels entirely on CPU and allows for a fast development cycle on inexpensive CPUs and compute instances to validate kernel correctness, using standard Python step-by-step debugging tools and instrumentation to print results for every line of kernel code. Activate it with ``NKI_SIMULATOR=1`` or use ``nki.simulate(kernel)``.

**New Language APIs (Experimental)**: Introduced ``nki.language`` high-level convenience wrappers including ``nl.load``, ``nl.store``, ``nl.copy``, ``nl.matmul``, ``nl.transpose``, and ``nl.softmax``. 

**New ISA and Hardware Features**: Added the ability to set DMA priority of DMA operations and collectives operations for Trn3 (NeuronCore-v4). A dedicated ``nki.isa.exponential`` instruction is optimized for vectorising exponents (``exp``) with VectorE. Matmul accumulation control is added via the ``accumulate`` parameter on ``nc_matmul`` and ``nc_matmul_mx``. Variable-length all-to-all collectives are now available via ``nki.collectives.all_to_all_v``.

**Breaking Changes**: NKI 0.3.0 includes several API breaking changes that improve correctness and consistency. All kernels must be updated to NKI 0.3.0; mixing with Beta 2 kernels in the same model is not supported. For the full list of changes and migration examples, see the :doc:`NKI 0.3.0 Update Guide </nki/migration/nki-0-3-0-update-guide>`.

For more details, see :ref:`nki-2-29-0-rn`.


NKI Library
^^^^^^^^^^^

**New Experimental Kernels (7 added)**: Conv1D provides 1D convolution with stride, padding, dilation, bias, activation fusion, and LNC sharding. Transformer TKG is a multi-layer transformer forward pass megakernel for token generation. Fine-Grained All-Gather and FGCC (All-Gather + Matmul) enable ring-based communication with compute overlap on Trn2. SBUF-to-SBUF All-Gather provides two variants for small and large tensors. Top-K Reduce supports MoE output gathering with LNC sharding. Dynamic Elementwise Add handles runtime-variable M-dimension tiling. The ``find_nonzero_indices`` subkernel is promoted from experimental to core.

**Key Improvements to Existing Kernels**: Attention CTE increases max batch size from 32 to 512 and max sequence length from 36,864 to 131,072 with sequence packing support. Attention Block TKG adds fused QK-norm before RoPE and KVDP attention sharding. MLP adds BufferManager support and MXFP4/MXFP8 quantization paths. MoE TKG introduces a dynamic all-expert algorithm with ``block_size``. QKV adds flexible weight layout support. PyTorch reference implementations are added for 22 kernels.

**Breaking Changes**: Multiple kernel signatures have changed with new parameters inserted mid-signature; callers using positional arguments must switch to keyword arguments. ``SbufManager`` is renamed to ``BufferManager``. MoE TKG replaces boolean sharding flags with ``LNCShardingStrategy`` enum. For the full list of breaking changes, see :ref:`nki-lib-2-29-0-rn`.

For more details, see :ref:`nki-lib-2-29-0-rn`.


Inference Updates
^^^^^^^^^^^^^^^^^

**NxD Inference 0.9.17155**: Qwen2 VL gains vision data parallelism with 7% QPS improvement for image-heavy workloads. Qwen3 VL adds text-model sequence parallelism with 2.2x QPS throughput improvement. Flux.1 adds CFG parallelism with 19% end-to-end latency improvement and 23% instance throughput improvement.

**vLLM Neuron Plugin 0.5.0**: Updated alongside NxD Inference with model performance improvements.

**Hardware Support Change**: NxD Inference no longer supports Trn1/Inf2. Only Trn2 and newer hardware is supported. Pin to Neuron SDK 2.28 for Trn1/Inf2 support.

For more details, see :ref:`nxd-inference-2-29-0-rn`.


Runtime and Driver
^^^^^^^^^^^^^^^^^^

**Neuron Runtime Library 2.31**: New ``nrt_cc_create_stream`` API creates a collective stream to be used by host-initiated collectives, replacing the previous environment variable approach. New ``nrt_get_attached_efa_bdf`` API returns the BDF string of the EFA device for optimal network interface selection. New environment variables ``NEURON_RT_ONE_THREAD_PER_CORE`` (up to 2x improvement in collective communication latency) and ``NEURON_RT_RANKS_PER_NETWORK_PROXY`` provide fine-grained control over network proxy threading. RDMA support extends to Trn3. Collectives XU gains profiling support, context caching with up to 90% performance improvement, and removal of the 512 queue set instance limit. The async API version is bumped from 2.x to 3.0; applications using the async API must be recompiled.

**Neuron Driver 2.27**: Adds support for new Trn3 Gen2 Ultraserver configurations: US3 (2-node), US4 (4-node), US16 (4-node), and US18 (4-node). Top-level DMA reset support is added during TPB reset on Trn3 and later platforms.

**Neuron Collectives 2.31**: EFA device processing is restructured to per-stream granularity for improved stability. Fixed incorrect interface selection in multi-ultraserver collectives and crash on channel initialization failures.

For more details, see :ref:`runtime-2-29-0-rn`.


Neuron Explorer
^^^^^^^^^^^^^^^

Neuron Explorer is now out of Beta and Stable. The System Trace Viewer now supports the full suite of Device widgets, enabling multi-device profile analysis across all linked Device Profiles within a single System Profile. The Summary Viewer includes system-level profile data for both system and device profiles. New System Timeline HBM Usage shows device HBM usage with memory allocation breakdown by category. Box Selection Summary enables viewing aggregated device profile information for a selected region in the trace viewer. Neuron Explorer for VS Code is now available on the Visual Studio Code Extension Marketplace and Open VSX, enabling simpler installation and automatic updates.

For more details, see :ref:`dev-tools-2-29-0-rn`.


PyTorch Framework
^^^^^^^^^^^^^^^^^

PyTorch 2.7 and 2.8 have reached end of support starting with this release. Use PyTorch 2.9 on Ubuntu 24.04. Starting with PyTorch 2.10 support (planned for a future Neuron release), AWS Neuron will transition from PyTorch/XLA to native PyTorch support via TorchNeuron.

For more details, see :ref:`pytorch-2-29-0-rn`.


End of Support and Migration Notices
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Effective this release:**

* PyTorch 2.7 and 2.8 have reached end of support. Pin to Neuron SDK 2.28 if required.
* NeuronX Distributed Training (NxDT) and NxD Core training APIs reach end of support; DLCs and DLAMI virtual environments pinned to SDK 2.28.0.
* ``neuron-profile analyze`` subcommand is no longer supported. Migrate to Neuron Explorer.
* Ubuntu 22.04 Multi-Framework DLAMI is no longer published. Use Ubuntu 24.04.

**Hardware support:**

* NxD Inference no longer supports Trn1/Inf2. Pin to Neuron SDK 2.28 for continued support.

**NKI namespace migration:**

* Removal of ``neuronxcc.nki.*`` namespace postponed to a future release. Both ``neuronxcc.nki.*`` and ``nki.*`` namespaces continue to work. Migration to ``nki.*`` is encouraged.

**Effective with PyTorch 2.10 support:**

* PyTorch/XLA will be replaced by TorchNeuron.

* Read the :doc:`Neuron 2.29.0 component release notes </release-notes/2.29.0>` for specific Neuron component improvements and details.

----

.. _whats-new-2026-03-13-v2_28_1:

AWS Neuron SDK 2.28.1 Patch Available
--------------------------------------

**Posted on**: March 13, 2026

AWS Neuron provides a patch version, 2.28.1, to address a Neuron Driver compatibility issue with Linux kernel 6.18. 

.. _whats-new-2026-02-26-v2_28:

AWS Neuron SDK 2.28.0: Enhanced Profiling, Vision Language Models, and Expanded NKI Capabilities
--------------------------------------------------------------------------------------------------

**Posted on**: February 26, 2026

Today we are releasing AWS Neuron SDK 2.28.0. This release enhances Neuron Explorer with system profiling, Tensor Viewer, and Database Viewer for comprehensive performance analysis. NxD Inference adds support for Qwen2/Qwen3 VL vision language models, Flux.1 inpainting capabilities, and Eagle3 speculative decoding. The NKI Library expands with 9 new kernels including RoPE, MoE operations, and experimental kernels for attention and cross entropy. NKI (Beta 2) introduces LNC multi-core support with intra-LNC collectives and new APIs. Kubernetes users gain Neuron DRA Driver support for advanced resource allocation.


Developer Tools and Profiling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Neuron Explorer Enhancements** - Added system profiling support with drill-down navigation to device profiles. New Tensor Viewer helps identify memory bottlenecks by displaying tensor names, shapes, sizes, and memory usage. Database Viewer provides an interactive interface for querying profiling data using SQL or natural language. Profile Manager now supports tag-based organization and search. A migration guide from Neuron Profiler/Profiler 2.0 is now available.

**nccom-test Improvements** - Enhanced data integrity checks use pseudo-random data patterns for better corruption detection. Added support for ``alltoallv`` collective operation for benchmarking variable-sized all-to-all communication patterns.

For more details, see :ref:`dev-tools-2-28-0-rn`.

Inference Updates
^^^^^^^^^^^^^^^^^

**NxD Inference 0.8.16251** - Added support for vision language models including Qwen2 VL (Qwen2-VL-7B-Instruct) and Qwen3 VL (Qwen3-VL-8B-Thinking) for processing text and image inputs (Beta). Pixtral model support improved with batch size 32 and sequence length 10240 on Trn2 with vLLM V1. Flux.1 model gains new functionality for in-paint, out-paint, canny edge detection, and depth-based image generation (Beta).

**vLLM Neuron Plugin 0.4.1** - Multi-LoRA serving enhancements enable streaming LoRA adapters via vLLM's ``load_adapter`` API with dynamic runtime loading. Users can now run the base model alone when multi-LoRA serving is enabled. Added Eagle3 speculative decoding support for Llama 3.1 8B. Updated to support vLLM v0.13.0 and PyTorch 2.9.

For more details, see :ref:`nxd-inference-2-28-0-rn`.

NKI Library
^^^^^^^^^^^

**9 New Kernels** - The NKI Library expands from 7 to 16 documented kernel APIs. New core kernels include RoPE (Rotary Position Embedding), Router Top-K (expert selection for MoE), MoE CTE (Context Encoding), MoE TKG (Token Generation), and Cumsum. New experimental kernels include Attention Block TKG (fused attention for token generation), Cross Entropy (forward and backward passes), Depthwise Conv1D, and Blockwise MM Backward (for MoE training).

**Enhanced Quantization Support** - Existing kernels receive FP8 and MX quantization support across QKV, MLP, and Output Projection kernels. QKV kernel adds fused FP8 KV cache quantization and block-based KV cache layout. MLP kernel adds gate/up projection clamping and fp16 support for TKG mode. Attention CTE kernel adds strided Q slicing for context parallelism.

**Improved Utilities** - TensorView gains ``rearrange`` method for dimension reordering and ``has_dynamic_access`` for runtime-dependent addressing checks. SbufManager provides hierarchical tree-formatted allocation logging with new query methods for SBUF utilization. New utilities include ``rmsnorm_mx_quantize_tkg``, ``interleave_copy``, ``LncSubscriptable``, and ``TreeLogger``.

For more details, see :ref:`nki-lib-2-28-0-rn`.

Neuron Kernel Interface (NKI)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**NKI Beta 2 (0.2.0)** - This release includes LNC multi-core support for LNC=2, enabling kernels to leverage multiple NeuronCores within a logical NeuronCore. The compiler now tracks ``shared_hbm`` tensors and canonicalizes LNC kernel outputs. Users can declare tensors private to a single NeuronCore using ``private_hbm`` memory type.

**New nki.collectives Module** - Enables collective communication across multiple NeuronCores with operations including ``all_reduce``, ``all_gather``, ``reduce_scatter``, ``all_to_all``, ``collective_permute`` variants, and ``rank_id``.

**New APIs and Features** - New ``nki.isa`` APIs include ``nonzero_with_count`` for sparse computation and ``exponential`` for element-wise operations. New ``float8_e4m3fn`` dtype supports FP8 workloads. Language features include ``no_reorder`` blocks for instruction ordering control, ``__call__`` special method support, ``tensor.view`` method for reshaping, and shared constants as string arguments.

**API Improvements** - ``dma_transpose`` now supports indirect addressing, ``dma_copy`` adds the ``unique_indices`` parameter, and ``register_alloc`` accepts optional tensor arguments for pre-filling. The compiler no longer truncates diagnostic output.

For more details, see :ref:`nki-2-28-0-rn`.

Kubernetes Support
^^^^^^^^^^^^^^^^^^

**Neuron DRA Driver** - Introduced Neuron Dynamic Resource Allocation (DRA) Driver enabling advanced resource allocation using the Kubernetes DRA API for flexible and efficient Neuron device management. The DRA API provides topology-aware scheduling, atomic resource allocation, and per-workload configuration. Neuron Helm Charts now include DRA Driver support.

For more details, see :ref:`containers-2-28-0-rn`.

PyTorch Framework (torch-neuronx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Transition to Native PyTorch Support** - Starting with PyTorch 2.10 support (planned for a future Neuron release), AWS Neuron will transition from PyTorch/XLA to native PyTorch support via TorchNeuron. PyTorch 2.9 is the last version using PyTorch/XLA. Users will need to update their scripts when upgrading to PyTorch 2.10 or later. See :ref:`native-pytorch-trainium` for migration guidance.

For more details, see :ref:`pytorch-2-28-0-rn`.

* Read the :doc:`Neuron 2.28.1 component release notes </release-notes/prev/2.28.1>` for specific Neuron component improvements and details.

.. _whats-new-2025-12-19-v2_27:

AWS Neuron SDK 2.27.0: Trainium3 Support, Enhanced NKI, and Unified Profiling with Neuron Explorer
---------------------------------------------------------------------------------------------------

**Posted on**: December 19, 2025

Today we are releasing AWS Neuron SDK 2.27.0. This release adds support for Trainium3 (``Trn3``) instances. Enhanced NKI with new NKI Compiler introduces the ``nki.*`` namespace with updated APIs and language constructs. The NKI Library provides pre-optimized kernels for common model operations including attention, MLP, and normalization. Neuron Explorer delivers a unified profiling suite with AI-driven optimization recommendations. vLLM V1 integration is now available through the vLLM-Neuron Plugin. Deep Learning Containers and AMIs are updated with vLLM V1, PyTorch 2.9, JAX 0.7, Ubuntu 24.04, and Python 3.12.

In addition to this release, we are introducing new capabilities and features in private beta access (see Private Beta Access section). We are also announcing our transition to PyTorch native support starting with PyTorch 2.10 in Neuron 2.28, plans to simplify NxDI in upcoming releases, and other important updates.

Neuron Kernel Interface (NKI)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**NKI Compiler** - The new ``nki.*`` namespace replaces the legacy ``neuronxcc.nki.*`` namespace. Top-level kernel functions now require the ``@nki.jit`` annotation. Neuron 2.27 supports both namespaces side by side; the legacy namespace will be removed in Neuron 2.28. A kernel migration guide is available in the documentation.

For more details, see :ref:`neuron-2-27-0-nki`.

NKI Library
^^^^^^^^^^^

The NKI Library provides pre-optimized kernels: Attention CTE, Attention TKG, MLP, Output Projection CTE, Output Projection TKG, QKV, and RMSNorm-Quant. Kernels are accessible via the ``nkilib.*`` namespace in neuronx-cc or from the GitHub repository.

For more details, see :ref:`neuron-2-27-0-nkilib`.

Developer Tools
^^^^^^^^^^^^^^^

**Neuron Explorer** - A a suite of tools designed to support ML engineers throughout their development journey on AWS Trainium. This release features improved performance and user expereince for device profiling, with four core viewers to provide insights into model performance:

* **Hierarchy Viewer**: Visualizes model structure and component interactions
* **AI Recommendation Viewer**: Delivers AI-driven optimization recommendations
* **Source Code Viewer**: Links profiling data directly to source code
* **Summary Viewer**: Displays high-level performance metrics

Neuron Explorer is available through UI, CLI, and VSCode IDE integration. Existing NTFF files are compatible but require reprocessing for new features.

New tutorials cover profiling NKI kernels, multi-node training jobs, and vLLM inference workloads. The ``nccom-test`` tool now includes fine-grained collective communication support.

For more details, see :ref:`neuron-2-27-0-tools`.

Inference Updates
^^^^^^^^^^^^^^^^^

**vLLM V1** - The vLLM-Neuron Plugin enables vLLM V1 integration for inference workloads. vLLM V0 support ends in Neuron 2.28.

**NxD Inference** - Model support expands with beta releases of Qwen3 MoE (Qwen3-235B-A22B) for multilingual text and Pixtral (Pixtral-Large-Instruct-2411) for image understanding. Both models use HuggingFace checkpoints and are supported on ``Trn2`` and ``Trn3`` instances.

For more details, see :ref:`neuron-2-27-0-nxd-inference`.

Neuron Graph Compiler
^^^^^^^^^^^^^^^^^^^^^

Default accuracy settings are now optimized for precision. The ``--auto-cast`` flag defaults to ``none`` (previously ``matmul``), and ``--enable-mixed-precision-accumulation`` is enabled by default. FP32 models may see performance impacts; restore previous behavior with ``--auto-cast=matmul`` and ``--disable-mixed-precision-accumulation``. Python 3.10 or higher is now required.

For more details, see :ref:`neuron-2-27-0-compiler`.

Runtime Improvements
^^^^^^^^^^^^^^^^^^^^

**Neuron Runtime Library 2.29** adds support for Trainium3 (``Trn3``) instances and delivers performance improvements for Collectives Engine overhead, NeuronCore branch overhead, NEFF program startup, and all-gather latency.

For more details, see :ref:`neuron-2-27-0-runtime`.

Deep Learning AMIs and Containers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Platform Updates** - All DLCs are updated to Ubuntu 24.04 and Python 3.12. DLAMIs add Ubuntu 24.04 support for base, single framework, and multi-framework configurations.

**Framework Updates**:

* vLLM V1 single framework DLAMI and multi-framework virtual environments
* PyTorch 2.9 single framework DLAMIs and multi-framework virtual environments (Amazon Linux 2023, Ubuntu 22.04, Ubuntu 24.04)
* JAX 0.7 single framework DLAMI and multi-framework virtual environments

**New Container** - The ``pytorch-inference-vllm-neuronx`` 0.11.0 DLC provides a complete vLLM inference environment with PyTorch 2.8 and all dependencies.

For more details, see :ref:`neuron-2-27-0-dlami` and :ref:`neuron-2-27-0-dlc`.


End of Support and Migration Notices
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Effective this release:**

* :ref:`announcement-python-3-9-eol`
* :ref:`announcement-end-of-support-pytorch-2-6`
* :ref:`announce-no-support-tensorflow2-10`
* :ref:`announce-eos-inf1-virtual-environments`
* :ref:`announcement-end-of-support-parallel-model-trace`
* :ref:`announce-eos-tensorboard-tools`

**Effective Neuron 2.28:**

* :ref:`announcement-end-of-support-neuronxcc-nki`
* :ref:`announcement-nki-library-namespace-changes`
* :ref:`announcement-nki-library-kernel-migration`
* :ref:`announcement-end-of-support-vllm-v0`

**Effective with PyTorch 2.10 support:**

* :ref:`announce-transition-pytorch-trainium`
* :ref:`announcement-end-of-support-nxdt-nxd-core`

**Future Releases:**

* :ref:`announce-nxdi-changes`
* :ref:`announce-eos-dlami-ubuntu-22-04`
* :ref:`announce-eos-pytorch-profling-api`
* :ref:`announce-eos-neuron-profiler`

Detailed Release Notes
^^^^^^^^^^^^^^^^^^^^^^^

* Read the :doc:`Neuron 2.27.0 component release notes </release-notes/prev/2.27.0/index>` for specific Neuron component improvements and details.

----

.. _whats-new-2025-12-02-riv:

AWS Neuron Expands with Trainium3, Native PyTorch, Faster NKI, and Open Source at re:Invent 2025
------------------------------------------------------------------------------------------------

**Posted on**: 12/02/2025

.. image:: /images/NeuronStandalone_white_small.png
   :alt: AWS Neuron Logo
   :align: right
   :width: 120px

At re:Invent 2025, AWS Neuron introduces support for `Trainium3 UltraServer <https://aws.amazon.com/ai/machine-learning/trainium/>`__ with expanded open source components and enhanced developer experience. These updates enable standard frameworks to run unchanged on Trainium, removing barriers for researchers to experiment and innovate. For developers requiring deeper control, the enhanced Neuron Kernel Interface (NKI) provides direct access to hardware-level optimizations, enabling customers to scale AI workloads with improved performance.

**Expanded capabilities and enhancements include**:

* :doc:`Trainium3 UltraServer support </about-neuron/arch/neuron-hardware/trn3-arch>`: Enabling customers to scale AI workloads with improved performance
* :doc:`Native PyTorch support </frameworks/torch/pytorch-native-overview>`: Standard PyTorch runs unchanged on Trainium without platform-specific modifications
* :doc:`Enhanced Neuron Kernel Interface (NKI) </nki/get-started/about/index>` with open source :doc:`NKI Compiler </nki/deep-dives/nki-compiler>`: Improved programming capabilities with direct access to Trainium hardware instructions and fine-grained optimization control, compiler built on MLIR
* :doc:`NKI Library </nki/library/index>`: Open source collection of optimized, ready-to-use kernels for common ML operations
* :doc:`Neuron Explorer </tools/neuron-explorer/index>`: Tools suite to support developers and performance engineers in their performance optimization journey from framework operations to hardware instructions
* :doc:`Neuron DRA for Kubernetes </containers/neuron-dra>`: Kubernetes-native resource management eliminating custom scheduler extensions
* :doc:`Expanded open source components </about-neuron/oss/index>`: Open sourcing more components including NKI Compiler, Native PyTorch, NKI Library, and more released under Apache 2.0


AI development requires rapid experimentation, hardware optimization, and production scale workloads. These updates enable researchers to experiment with novel architectures using familiar workflows, ML developers to build AI applications using standard frameworks, and performance engineers to optimize workloads using low-level hardware optimization.

.. admonition:: Looking to try out our Beta features?

   Submit your beta access request through `this form <https://pulse.aws/survey/NZU6MQGW?p=0>`__ and the Neuron Product team will get back to you.

Native PyTorch Support
^^^^^^^^^^^^^^^^^^^^^^

**Private Preview**

AWS Neuron now natively supports PyTorch through TorchNeuron, an open source native PyTorch backend for Trainium. TorchNeuron integrates with PyTorch through the PrivateUse1 device backend mechanism, registering Trainium as a native device alongside other backends and allowing researchers and ML developers to run their code without modifications.

TorchNeuron provides eager mode execution for interactive development and debugging, native distributed APIs including FSDP and DTensor for distributed training, and torch.compile support for optimization. TorchNeuron enables compatibility with minimal code changes with ecosystem tools like TorchTitan and HuggingFace Transformers.

Use TorchNeuron to run your PyTorch research and training workloads on Trainium without platform-specific code changes.

**Learn more**: :doc:`documentation </frameworks/torch/pytorch-native-overview>`, and `TorchNeuron GitHub repository <https://github.com/aws-neuron/torch-neuronx>`__.

**Access**: Contact your AWS account team for access.


Enhanced NKI
^^^^^^^^^^^^

**Public Preview**

The enhanced Neuron Kernel Interface (NKI) provides developers with complete hardware control through advanced APIs for fine-grained scheduling and allocation. The enhanced NKI enables instruction-level programming, memory allocation control, and execution scheduling with direct access to the Trainium ISA. 

We are also releasing the NKI Compiler as open source under Apache 2.0, built on MLIR to enable transparency and collaboration with the broader compiler community. NKI integrates with PyTorch and JAX, enabling developers to use custom kernels within their training workflows.

Use Enhanced NKI to innovate and build optimized kernels on Trainium. Explore the NKI Compiler source code to inspect and contribute to the MLIR-based compilation pipeline. 

.. note::
  The NKI Compiler source code is currently in **Private Preview**, while the NKI programming interface is in **Public Preview**.

**Learn more**: :doc:`NKI home page </nki/index>` and :doc:`NKI Language Guide </nki/get-started/nki-language-guide>`.

NKI Library
^^^^^^^^^^^

**Public Preview**

The NKI Library provides an open source collection of optimized, ready-to-use kernels for common ML operations. The library includes kernels for dense transformer operations, MoE-specific operations, and attention mechanisms, all with complete source code, documentation, and benchmarks.

Use NKI Library kernels directly in your models to improve performance, or explore the implementations as reference for best practices of performance optimizations on Trainium.

**Learn more**: `GitHub repository <https://github.com/aws-neuron/nki-library>`__ and :doc:`API documentation </nki/library/api/index>`.


Neuron Explorer
^^^^^^^^^^^^^^^

**Public Preview**

Neuron Explorer is a tools suite that supports developers and performance engineers in their performance optimization journey. It provides capabilities to inspect and optimize code from framework operations down to hardware instructions with hierarchical profiling, source code linking, IDE integration, and AI-powered recommendations for optimization insights.

Use Neuron Explorer to understand and optimize your model performance on Trainium, from high-level framework operations to low-level hardware execution.

**Learn more**: :doc:`Neuron Explorer documentation </tools/neuron-explorer/index>`.


Kubernetes-Native Resource Management with Neuron DRA
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Private Preview**

Neuron Dynamic Resource Allocation (DRA) provides Kubernetes-native resource management for Trainium, eliminating custom scheduler extensions. DRA enables topology-aware scheduling using the default Kubernetes scheduler, atomic UltraServer allocation, and flexible per-workload configuration.

Neuron DRA supports EKS, SageMaker HyperPod, and UltraServer configurations. The driver is open source with container images in AWS ECR public gallery.

Use Neuron DRA to simplify Kubernetes resource management for your Trainium workloads with native scheduling and topology-aware allocation.

**Learn more**: :doc:`Neuron DRA documentation </containers/neuron-dra>`.

**Access**: Contact your AWS account team to participate in the Private Preview.


Resources and Additional Information
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For more information visit the `AWS Trainium official page <https://aws.amazon.com/ai/machine-learning/trainium/>`__, the :doc:`AWS Neuron Documentation </index>`, and :doc:`the AWS Neuron GitHub repositories </about-neuron/oss/index>`.


================================================
FILE: archive/helper-tools/index.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

Helper Tools
============

.. toctree:: 
    :maxdepth: 1

        
    Check Model </archive/helper-tools/tutorial-neuron-check-model>
    GatherInfo </archive/helper-tools/tutorial-neuron-gatherinfo>

================================================
FILE: archive/helper-tools/tutorial-neuron-check-model.rst
================================================
.. _neuron_check_model:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

Neuron Check Model
^^^^^^^^^^^^^^^^^^

Overview
========

Neuron Check Model tool provides user with basic information about the compiled and uncompiled model's operations
without the use of TensorBoard-Neuron. For additional visibility into the models, please see :ref:`neuron-plugin-tensorboard`.

Neuron Check Model tool scans the user's uncompiled model and provides a table of the operations within the uncompiled
model. By default, the table shows each operation type and number of instances of that type within model, and whether
the type is supported in Neuron. If --show_names option is specified, the table shows each operation by name and
whether the type of that operation is supported in Neuron.

If the model is already compiled, the tool also provides the table of operations as for uncompiled model. The table
include the Neuron subgraph type and number of instances of that type, along with operations that have not been
compiled to Neuron. Additionally, the tool displays a message showing the minimum number of NeuronCores required to run the
model, followed by another table which shows the list of Neuron subgraphs by name and the number of pipelined
NeuronCores used by each subgraph. More information about NeuronCore pipeline can be found in
:ref:`neuroncore-pipeline`. If --expand_subgraph option is specified, the operations within each subgraph are
printed below the subgraph information.

Neuron Check Model tool is currently available for TensorFlow and MXNet. To check PT model, please use
torch.neuron.analyze_model function as shown in PyTorch-Neuron Getting Started tutorial :ref:`/src/examples/pytorch/resnet50.ipynb`

TensorFlow-Neuron Check Model
=============================

The following example shows how to run TensorFlow-Neuron Check Model tool with TensorFlow ResNet50 tutorial.

1. Start with the TensorFlow ResNet50 tutorial at :ref:`/src/examples/tensorflow/tensorflow_resnet50/resnet50.ipynb` and do the first three steps of the
tutorial. Please stay in the Python environment that you setup during the tutorial.

2. Install needed tensorflow_hub package and download the tool:

::

    pip install tensorflow_hub
    wget https://raw.githubusercontent.com/aws/aws-neuron-sdk/master/src/neuron-gatherinfo/tf_neuron_check_model.py
    python tf_neuron_check_model.py -h

::

    usage: tf_neuron_check_model.py [-h] [--show_names] [--expand_subgraph]
                                    model_path

    positional arguments:
      model_path         a TensorFlow SavedModel directory (currently supporting
                         TensorFlow v1 SaveModel only).

    optional arguments:
      -h, --help         show this help message and exit
      --show_names       list operation by name instead of summarizing by type
                         (caution: this option will generate many lines of output
                         for a large model).
      --expand_subgraph  show subgraph operations.

3. After step 3 of the TensorFlow ResNet50 tutorial, you can check the uncompiled model to see Neuron supported operations (currently supporting TensorFlow v1 SaveModel only):

::

    $ python tf_neuron_check_model.py ws_resnet50/resnet50/

    * The following table shows the supported and unsupported operations within this uncompiled model.
    * Each line shows an operation type, the number of instances of that type within model,
    * and whether the type is supported in Neuron.
    * Some operation types are excluded from table because they are no-operations or training-related operations:
     ['Placeholder', 'PlaceholderWithDefault', 'NoOp', 'Const', 'Identity', 'IdentityN', 'VarHandleOp',
     'VarIsInitializedOp', 'AssignVariableOp', 'ReadVariableOp', 'StringJoin', 'ShardedFilename', 'SaveV2',
     'MergeV2Checkpoints', 'RestoreV2']

    Op Type           Num Instances   Neuron Supported ?
    -------           -------------   ------------------
    Pad               2               Yes
    RandomUniform     54              Yes
    Sub               54              Yes
    Mul               54              Yes
    Add               54              Yes
    Conv2D            53              Yes
    BiasAdd           54              Yes
    FusedBatchNormV3  53              Yes
    Relu              49              Yes
    MaxPool           1               Yes
    AddV2             16              Yes
    Fill              56              Yes
    Mean              1               Yes
    MatMul            1               Yes
    Softmax           1               Yes
    Pack              1               Yes

    * Total inference operations: 504
    * Total Neuron supported inference operations: 504
    * Percent of total inference operations supported by Neuron: 100.0

4. You can also check the compiled model to see the number of pipeline NeuronCores for each subgraph:

::

    $ python tf_neuron_check_model.py ws_resnet50/resnet50_neuron/

    * Found 1 Neuron subgraph(s) (NeuronOp(s)) in this compiled model.
    * Use this tool on the original uncompiled model to see Neuron supported operations.
    * The following table shows all operations, including Neuron subgraphs.
    * Each line shows an operation type, the number of instances of that type within model,
    * and whether the type is supported in Neuron.
    * Some operation types are excluded from table because they are no-operations or training-related operations:
     ['Placeholder', 'PlaceholderWithDefault', 'NoOp', 'Const', 'Identity', 'IdentityN', 'VarHandleOp',
     'VarIsInitializedOp', 'AssignVariableOp', 'ReadVariableOp', 'StringJoin', 'ShardedFilename', 'SaveV2',
     'MergeV2Checkpoints', 'RestoreV2']

    Op Type   Num Instances   Neuron Supported ?
    -------   -------------   ------------------
    NeuronOp  1               Yes

    * Please run this model on Inf1 instance with at least 1 NeuronCore(s).
    * The following list show each Neuron subgraph with number of pipelined NeuronCores used by subgraph
    * (and subgraph operations if --expand_subgraph is used):

    Subgraph Name                                                                 Num Pipelined NeuronCores
    -------------                                                                 -------------------------
    conv5_block3_3_bn/FusedBatchNormV3/ReadVariableOp/neuron_op_d6f098c01c780733  1

5. When showing subgraph information, you can use --expand_subgraph to show operation types in each subgraph:

::

    $ python tf_neuron_check_model.py ws_resnet50/resnet50_neuron/ --expand_subgraph

    (output truncated to show subgraph information only)

    Subgraph Name                                                                 Num Pipelined NeuronCores
    -------------                                                                 -------------------------
    conv5_block3_3_bn/FusedBatchNormV3/ReadVariableOp/neuron_op_d6f098c01c780733  1
         Op Type         Num Instances
         -------         -------------
         MatMul          1
         Relu            49
         Add             16
         FusedBatchNorm  53
         BiasAdd         54
         Conv2D          53
         Pad             2
         Mean            1
         MaxPool         1
         Softmax         1

6. Use --show_names to see full operation names (caution: this option will generate many lines of output for a large model):

::

    $ python tf_neuron_check_model.py ws_resnet50/resnet50_neuron/ --show_names

    * Found 1 Neuron subgraph(s) (NeuronOp(s)) in this compiled model.
    * Use this tool on the original uncompiled model to see Neuron supported operations.
    * The following table shows all operations, including Neuron subgraphs.
    * Each line shows an operation name and whether the type of that operation is supported in Neuron.
    * Some operation types are excluded from table because they are no-operations or training-related operations:
     ['Placeholder', 'PlaceholderWithDefault', 'NoOp', 'Const', 'Identity', 'IdentityN', 'VarHandleOp',
     'VarIsInitializedOp', 'AssignVariableOp', 'ReadVariableOp', 'StringJoin', 'ShardedFilename', 'SaveV2',
     'MergeV2Checkpoints', 'RestoreV2']

    Op Name                                                                       Op Type   Neuron Supported ?
    -------                                                                       -------   ------------------
    conv5_block3_3_bn/FusedBatchNormV3/ReadVariableOp/neuron_op_d6f098c01c780733  NeuronOp  Yes

    * Please run this model on Inf1 instance with at least 1 NeuronCore(s).
    * The following list show each Neuron subgraph with number of pipelined NeuronCores used by subgraph
    * (and subgraph operations if --expand_subgraph is used):

    Subgraph Name                                                                 Num Pipelined NeuronCores
    -------------                                                                 -------------------------
    conv5_block3_3_bn/FusedBatchNormV3/ReadVariableOp/neuron_op_d6f098c01c780733  1


MXNet-Neuron Check Model
=========================

The following example shows how to run MXNet-Neuron Check Model tool with MXNet ResNet50 tutorial.

1. Start with the MXNet ResNet50 tutorial at :ref:`/src/examples/mxnet/resnet50/resnet50.ipynb` and do the first three steps of the tutorial.
Please stay in the Python environment that you setup during the tutorial.

2. Download the tool:

::

    wget https://raw.githubusercontent.com/aws/aws-neuron-sdk/master/src/neuron-gatherinfo/mx_neuron_check_model.py
    python mx_neuron_check_model.py -h

::

    usage: mx_neuron_check_model.py [-h] [--show_names] [--expand_subgraph]
                                    model_path

    positional arguments:
      model_path         path prefix to MXNet model (the part before -symbol.json)

    optional arguments:
      -h, --help         show this help message and exit
      --show_names       list operation by name instead of summarizing by type
                         (caution: this option will generate many lines of output
                         for a large model).
      --expand_subgraph  show subgraph operations.

3. After step 3 of MXNet ResNet50 tutorial, you can check the uncompiled model to see Neuron supported operations:

::

    $ python mx_neuron_check_model.py resnet-50

    * The following table shows the supported and unsupported operations within this uncompiled model.
    * Each line shows an operation type, the number of instances of that type within model,
    * and whether the type is supported in Neuron.
    * Some operation types are excluded from table because they are no-operations or training-related operations:
     ['null']

    Op Type         Num Instances   Neuron Supported ?
    -------         -------------   ------------------
    BatchNorm       51              Yes
    Convolution     53              Yes
    Activation      50              Yes
    Pooling         2               Yes
    elemwise_add    16              Yes
    Flatten         1               Yes
    FullyConnected  1               Yes
    SoftmaxOutput   1               No

    * Total inference operations: 175
    * Total Neuron supported inference operations: 174
    * Percent of total inference operations supported by Neuron: 99.4

4. You can also check the compiled model to see the number of pipeline NeuronCores for each subgraph:

::

    $ python mx_neuron_check_model.py resnet-50_compiled

    * Found 1 Neuron subgraph(s) (_neuron_subgraph_op(s)) in this compiled model.
    * Use this tool on the original uncompiled model to see Neuron supported operations.
    * The following table shows all operations, including Neuron subgraphs.
    * Each line shows an operation type, the number of instances of that type within model,
    * and whether the type is supported in Neuron.
    * Some operation types are excluded from table because they are no-operations or training-related operations:
     ['null']

    Op Type              Num Instances   Neuron Supported ?
    -------              -------------   ------------------
    _neuron_subgraph_op  1               Yes
    SoftmaxOutput        1               No

    * Please run this model on Inf1 instance with at least 1 NeuronCore(s).
    * The following list show each Neuron subgraph with number of pipelined NeuronCores used by subgraph
    * (and subgraph operations if --expand_subgraph is used):

    Subgraph Name         Num Pipelined NeuronCores
    -------------         -------------------------
    _neuron_subgraph_op0  1

5. When showing subgraph information, you can use --expand_subgraph to show operation types in each subgraph:

::

    $ python mx_neuron_check_model.py resnet-50_compiled --expand_subgraph

    (output truncated to show subgraph information only)

    Subgraph Name         Num Pipelined NeuronCores
    -------------         -------------------------
    _neuron_subgraph_op0  1
         Op Type         Num Instances
         -------         -------------
         BatchNorm       51
         Convolution     53
         Activation      50
         Pooling         2
         elemwise_add    16
         Flatten         1
         FullyConnected  1

6. Use --show_names to see full operation names (caution: this option will generate many lines of output for a large model):

::

    $ python mx_neuron_check_model.py resnet-50_compiled --show_names

    * Found 1 Neuron subgraph(s) (_neuron_subgraph_op(s)) in this compiled model.
    * Use this tool on the original uncompiled model to see Neuron supported operations.
    * The following table shows all operations, including Neuron subgraphs.
    * Each line shows an operation name and whether the type of that operation is supported in Neuron.
    * Some operation types are excluded from table because they are no-operations or training-related operations:
     ['null']

    Op Name               Op Type              Neuron Supported ?
    -------               -------              ------------------
    _neuron_subgraph_op0  _neuron_subgraph_op  Yes
    softmax               SoftmaxOutput        No

    * Please run this model on Inf1 instance with at least 1 NeuronCore(s).
    * The following list show each Neuron subgraph with number of pipelined NeuronCores used by subgraph
    * (and subgraph operations if --expand_subgraph is used):

    Subgraph Name         Num Pipelined NeuronCores
    -------------         -------------------------
    _neuron_subgraph_op0  1


================================================
FILE: archive/helper-tools/tutorial-neuron-gatherinfo.rst
================================================
.. _neuron_gatherinfo:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

Using Neuron GatherInfo Tool to collect debug and support information
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Overview
========

The Neuron GatherInfo tool ``neuron-gatherinfo.py`` can assist in
automating the collection and packaging of information from Neuron SDK
tools that is useful to both user and AWS for issue resolution. The tool
gathers log files and other system information. If being used to supply
that info to AWS, the tool will redact proprietary and confidential
information. The GatherInfo tool is supplied in source code form -
available here: :github:`Neuron Gatherinfo </src/neuron-gatherinfo/neuron-gatherinfo.py>`

The tool enables developers to gather compiler and inference/runtime
logs. Additionally, the common usage is from within one of the supported
ML frameworks that have been integrated with Neuron, and information can
be captured from those compile/runtime environments using the
frameworks.

Steps Overview:
~~~~~~~~~~~~~~~

1. Obtain a copy of neuron-gatherinfo.py from
   :github:`Neuron Gatherinfo </src/neuron-gatherinfo/neuron-gatherinfo.py>`
2. Install into a location in your $PATH or into a location from where
   you can launch the script
3. Use with compile and/or runtime environments

Neuron-CC information gathering
-------------------------------

Step 1: Re-run the compile steps for your workload with increased verbosity or debug levels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  For TensorFlow-Neuron, change the Python code as shown. Note that
   ‘compiler-workdir’ is expected to be an empty directory to prevent
   files from other runs from interfering with the information
   gathering. The call to the compile function has to be augmented with
   the **verbose** and the \**compiler_workdir \**arguments. In
   addition, please capture the stdout messages into a file (for
   example, by redirecting the stdout to a file)

::

   tfn.saved_model.compile(model_dir, compiled_model_dir, compiler_args=['--verbose', '2', '--pipeline', 'compile',  'SaveTemps'], compiler_workdir='./compiler-workdir')

-  For Neuron Apache MXNet, add compiler arguments as shown below and run the
   compilation process from an empty workdir:

::

   import mxnet as mx
   import os

   from packaging import version
   mxnet_version = version.parse(mx.__version__)
   if mxnet_version >= version.parse("1.8"):
      import mx_neuron as neuron
   else: 
      from mxnet.contrib import neuron

   ...
   os.environ['SUBGRAPH_INFO'] = '1'
   compile_args = { '--verbose' : 2, '--pipeline' : 'compile', 'flags' : ['SaveTemps'] }
   csym, cargs, cauxs = neuron.compile(sym, args, auxs, inputs=inputs, **compile_args)

.. _step-2-run-neuron-gatherinfopy-to-gather-information-to-share:

Step 2: Run neuron-gatherinfo.py to gather information to share
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The output result will be a tar.gz file.

Neuron Runtime information gathering
------------------------------------

Step 1: EXECUTE inference steps for your workload with increased verbosity or debug levels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the case of runtime information, the tool **neuron-dump.py** is used
by \**neuron-gatherinfo.py \**to gather that information. Make sure that
you have the neuron tools package (aws-neuron-tools) installed.

.. _step-2-run-neuron-gatherinfopy-to-gather-information-to-share-1:

Step 2: Run neuron-gatherinfo.py to gather information to share
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The output result will be a tar.gz file.

Tool Usage Reference
====================

Run neuron-gatherinfo.py using the “—help“ option:

::

   bash $ ~/bin/neuron-gatherinfo.py --help
   usage: neuron-gatherinfo.py [-h] [--additionalfileordir ADDFLDIR] [-c CCDIR]
                               [-i] [-f FILTERFILE] [-m] -o OUTDIR [-r RTDIR] -s
                               STDOUT [-v]

       Usage: /home/user/bin/neuron-gatherinfo.py [options]
       This program is used to gather information from this system for analysis
       and debugging


   optional arguments:
     -h, --help            show this help message and exit
     --additionalfileordir ADDFLDIR
                           Additional file or directory that the user wants to
                           provide in the archive. The user can sanitize this
                           file or directory before sharing
     -c CCDIR, --compileroutdir CCDIR
                           Location of the neuron-cc generated files
     -i, --include         By default, only the lines containing (grep) patterns
                           like 'nrtd|neuron|kernel:' from the syslog are copied.
                           Other lines are excluded. Using this option allows the
                           timestamp section of other lines to be included. The
                           rest of the contents of the line itself are elided.
                           Providing the timestamp section may provide time
                           continuity while viewing the copied syslog file
     -f FILTERFILE, --filter FILTERFILE
     -m, --modeldata       By using this option, the entire compiler work
                           directory's contents will be included (excluding the
                           .pb files, unless an additional option is used). This
                           would include model information, etc. The files that
                           are included, by default, are these: graph_def.neuron-
                           cc.log, all_metrics.csv, hh-tr-operand-
                           tensortensor.json
     -o OUTDIR, --out OUTDIR
                           The output directory where all the files and other
                           information will be stored. The output will be stored
                           as an archive as well as the actual directory where
                           all the contents are copied. This will allow a simple
                           audit of the files, if necessary. *** N O T E ***:
                           Make sure that this directory has enough space to hold
                           the files and resulting archive
     -r RTDIR, --runtimeoutdir RTDIR
                           Location of the neuron runtime generated files
     -s STDOUT, --stdout STDOUT
                           The file where the stdout of the compiler run was
                           saved
     -v, --verbose         Verbose mode displays commands executed and any
                           additional information which may be useful in
                           debugging the tool itself

Examples
========

Example 1: no ML model information gathered (default behavior)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this case, the tool will archive just the default information
gathering:

::

   bash $ sudo ~/bin/neuron-gatherinfo.py   -o compile-and-run-info-for-debugging-no-model-info  -i --verbose  -s stdout-from-compile_resnet50.out -c compiler-workdir

   Running cmd: lscpu and capturing output in file: /home/user/tutorials-3/compile-and-run-info-for-debugging-no-model-info/neuron-gatherinfo/report-lscpu.txt
   Running cmd: lshw and capturing output in file: /home/user/tutorials-3/compile-and-run-info-for-debugging-no-model-info/neuron-gatherinfo/report-lshw.txt
   Running cmd: lspci | grep -i Amazon and capturing output in file: /home/user/tutorials-3/compile-and-run-info-for-debugging-no-model-info/neuron-gatherinfo/report-lspci.txt
   Running cmd: neuron-cc --version and capturing output in file: /home/user/tutorials-3/compile-and-run-info-for-debugging-no-model-info/neuron-gatherinfo/report-neuron-cc.txt
   Running cmd: neuron-ls and capturing output in file: /home/user/tutorials-3/compile-and-run-info-for-debugging-no-model-info/neuron-gatherinfo/report-neuron-ls.txt
   <SNIP>
       ******
       Archive created at:
           /home/user/tutorials-3/compile-and-run-info-for-debugging-no-model-info/neuron-gatherinfo.tar.gz
       From directory:
           /home/user/tutorials-3/compile-and-run-info-for-debugging-no-model-info/neuron-gatherinfo
       ******


.. _example-2--model-ml-information-gathered-using-the-modeldata-option:

Example 2 : model ML information gathered using the “—modeldata” option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this case, the tool will archive the compiler work directory in
addition to the default information gathering

::

   bash $ sudo ~/bin/neuron-gatherinfo.py   -o compile-and-run-info-for-debugging  -i --verbose  -s stdout-from-compile_resnet50.out -c compiler-workdir --modeldata

   <SNIP>
   Running cmd: lscpu and capturing output in file: /home/user/tutorials-3/compile-and-run-info-for-debugging/neuron-gatherinfo/report-lscpu.txt
   Running cmd: lshw and capturing output in file: /home/user/tutorials-3/compile-and-run-info-for-debugging/neuron-gatherinfo/report-lshw.txt
   Running cmd: lspci | grep -i Amazon and capturing output in file: /home/user/tutorials-3/compile-and-run-info-for-debugging/neuron-gatherinfo/report-lspci.txt
   Running cmd: neuron-cc --version and capturing output in file: /home/user/tutorials-3/compile-and-run-info-for-debugging-no-model-info/neuron-gatherinfo/report-neuron-cc.txt
   Running cmd: neuron-ls and capturing output in file: /home/user/tutorials-3/compile-and-run-info-for-debugging-no-model-info/neuron-gatherinfo/report-neuron-ls.txt
   <SNIP>

       ******
       Archive created at:
           /home/user/tutorials-3/compile-and-run-info-for-debugging/neuron-gatherinfo.tar.gz
       From directory:
           /home/user/tutorials-3/compile-and-run-info-for-debugging/neuron-gatherinfo
       ******


       **************************
       Based on your command line option, we're also packaging these files:

           graph_def.neuron-cc.log
           all_metrics.csv
           hh-tr-operand-tensortensor.json

       And this directory: /home/user/tutorials-3/compiler-workdir

       **************************


================================================
FILE: archive/index.rst
================================================
.. meta::
   :description: Archived AWS Neuron SDK documentation
   :keywords: AWS Neuron SDK, archived tutorials, legacy documentation
   :date-modified: 12-02-2025

=====================================
Archived AWS Neuron SDK documentation
=====================================

.. note::

    This page contains archived tutorials and other documentation for older versions of the AWS Neuron SDK.
    These pages are no longer actively maintained and may reference unsupported features or deprecated APIs. They are provided as-is and may not reflect the current state of the AWS Neuron SDK.

Overview
--------

The following content has been archived for reference purposes. For the latest documentation and guides, visit the `AWS Neuron SDK documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/>`_.

Archived feature docs
---------------------

.. list-table::
   :header-rows: 1

   * - Feature
     - Last release supported
     - Date archived
   * - :doc:`tensorboard/getting-started-tensorboard-neuron-plugin`
     - Neuron 2.27.0
     - Archived on: 12/2/2025
   * - :doc:`neuronperf/index`
     - Neuron 2.27.0
     - Archived on: 12/2/2025
   * - :doc:`helper-tools/index`
     - Neuron 2.27.0
     - Archived on: 12/2/2025
   * - :doc:`transformers-neuronx/index`
     - Neuron 2.25.0
     - Archived on: 9/15/2025
   * - :doc:`MXNet Neuron Setup Guides <mxnet-neuron/index>`
     - Neuron 2.27.0
     - Archived on: 3/30/2026
   * - :doc:`mxnet-neuron/index`
     - Neuron 2.16.0
     - Archived on: 3/11/2026
   * - :doc:`tensorflow/index`
     - Neuron 2.22.0
     - Archived on: 3/11/2026
   * - :doc:`torch-neuron/index`
     - Neuron 2.22.0
     - Archived on: 3/11/2026


Archived tutorials
------------------

.. list-table::
   :header-rows: 1

   * - Tutorial
     - Last release supported
     - Date archived
   * - :doc:`tutorials/finetune_t5`
     - Neuron 2.24.0
     - Archived on: 7/31/2025
   * - :doc:`tutorials/ssd300_demo/ssd300_demo`
     - Neuron 2.24.0
     - Archived on: 7/31/2025
   * - :doc:`tutorials/megatron_gpt_pretraining`
     - Neuron 2.25.0
     - Archived on: 7/31/2025
   * - :doc:`tutorials/finetuning_llama2_7b_ptl`
     - Neuron 2.26.0
     - Archived on: 8/25/2025
   * - :doc:`tutorials/training_llama2_tp_pp_ptl`
     - Neuron 2.26.0
     - Archived on: 8/25/2025
   * - :doc:`tutorials/training_codegen25_7b`
     - Neuron 2.26.0
     - Archived on: 8/25/2025
   * - :doc:`tutorials/gpt3_neuronx_nemo_megatron_pretraining`
     - Neuron 2.26.0
     - Archived on: 8/25/2025
   * - :doc:`tutorials/multinode-training-model-profiling`
     - Neuron 2.29.0
     - Archived on: 3/30/2026

.. toctree::
    :maxdepth: 1
    :hidden:

    tutorials/finetune_t5
    tutorials/ssd300_demo/ssd300_demo
    tutorials/megatron_gpt_pretraining
    tutorials/training-gpt-neox-20b
    tutorials/finetuning_llama2_7b_ptl
    tutorials/training_llama2_tp_pp_ptl
    tutorials/training_codegen25_7b
    tutorials/multinode-training-model-profiling
    tutorials/training-gpt-neox
    tensorboard/getting-started-tensorboard-neuron-plugin
    neuronperf/index
    helper-tools/index
    transformers-neuronx/index
    mxnet-neuron/index
    tensorflow/index
    torch-neuron/index

Accessing Archived Content
--------------------------

Each tutorial listed above corresponds to a specific version or feature set of the Neuron SDK that has since been superseded. Use these resources for historical context or migration guidance.

.. warning::

    Archived tutorials may not be compatible with current Neuron SDK releases. Exercise caution when following instructions from these documents.


================================================
FILE: archive/mxnet-neuron/api-compilation-python-api.rst
================================================
.. _ref-mxnet-neuron-compilation-python-api:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Neuron Apache MXNet Compilation Python API
=======================================================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


The MXNet-Neuron compilation Python API provides a method to compile
model graph for execution on Inferentia.


Description
-----------

Within the graph or subgraph, the compile method selects and sends
Neuron-supported operations to Neuron-Compiler for compilation and saves
the compiled artifacts in the graph. Uncompilable operations are kept as
original operations for framework execution.

The compiled graph can be saved using the MXNet save_checkpoint and
served using MXNet Model Serving. Please see
:ref:`mxnet-neuron-model-serving` for more information about exporting
to saved model and serving using MXNet Model Serving.

Options can be passed to Neuron compiler via the compile function. For
example, the “\ ``--neuroncore-pipeline-cores``\ ” option directs Neuron compiler
to compile each subgraph to fit in the specified number of NeuronCores.
This number can be less than the total available NeuronCores on an Inf1
instance. See :ref:`neuron-compiler-cli-reference` for more information
about compiler options.

For debugging compilation, use SUBGRAPH_INFO=1 environment setting before
calling the compilation script. The extract subgraphs are preserved as hidden
files in the run directory. For more information, see :ref:`neuron_gatherinfo`

**MXNet 1.5**
-------------

Method
------


.. code:: python

  from mxnet.contrib import neuron
  neuron.compile(sym, args, aux, inputs, **compile_args)


Arguments
---------

-  **sym** - Symbol object loaded from symbol.json file
-  **args** - args/params dictionary loaded from params file
-  **aux** - aux/params dictionary loaded from params file
-  **inputs** - a dictionary with key/value mappings for input name to
   input numpy arrays
-  **kwargs** (optional) - a dictionary with key/value mappings for
   MXNet-Neuron compilation and Neuron Compiler options.

   -  For example, to limit the number of NeuronCores per subgraph, use
      ``compile_args={'--neuroncore-pipeline-cores' : N}`` where N is an integer
      representing the maximum number of NeuronCores per subgraph.
   -  Additional compiler flags can be passed using
      ``'flags' : [<flags>]`` where is a comma separated list of
      strings. See :ref:`neuron_gatherinfo` for example of passing debug
      flags to compiler.
   -  Advanced option to exclude node names:
      ``compile_args={'excl_node_names' : [<node names>]}`` where is a
      comma separated list of node name strings.

Returns
-------

-  **sym** - new partitioned symbol
-  **args** - modified args/params
-  **auxs** - modified aux/params

Example Usage: Compilation
--------------------------

The following is an example usage of the compilation, with default
compilation arguments:


.. code:: python

  from mxnet.contrib import neuron
  ...
  neuron.compile(sym, args, aux, inputs={'data' : img})


**MXNet 1.8**
-------------


Method
------


.. code:: python

  import mx_neuron as neuron
  neuron.compile(obj, args=None, aux=None, inputs=None, **compile_args)


Arguments
---------

-  **obj** - Symbol object loaded from symbol.json file or gluon.HybridBlock object
-  **args** (optional) - args/params dictionary loaded from params file. Only needed in case of Symbol object
-  **aux** (optional) - aux/params dictionary loaded from params file. Only needed in case of Symbol object
-  **inputs** - a dictionary with key/value mappings for input name to
   input numpy arrays.
-  **kwargs** (optional) - a dictionary with key/value mappings for
   MXNet-Neuron compilation and Neuron Compiler options.

   -  For example, to limit the number of NeuronCores per subgraph, use
      ``compile_args={'--neuroncore-pipeline-cores' : N}`` where N is an integer
      representing the maximum number of NeuronCores per subgraph.
   -  Additional compiler flags can be passed using
      ``'flags' : [<flags>]`` where is a comma separated list of
      strings. See :ref:`neuron_gatherinfo` for example of passing debug
      flags to compiler.
   -  Advanced option to exclude node names:
      ``compile_args={'excl_node_names' : [<node names>]}`` where is a
      comma separated list of node name strings.
   -  work_dir:  relative or absolute path for storing compiler artifacts (including params and jsons) generated 
      during compilation when SUBGRAPH_INFO=1.

Returns
-------
- **(sym, args, auxs)** - for symbol object as input. sym, args and auxs are new partitioned symbol, modified args/params and modified aux/params repectively.
- **(obj)** - for gluon.HybridBlock object as input. obj is the parititioned and optimized gluon.Hybrid block object for Neuron backend.


Example Usage: Compilation
--------------------------

The following is an example usage of the compilation, with default
compilation arguments for symbol object:


.. code:: python

  import mx_neuron as neuron
  ...  
  neuron.compile(sym, args, aux, inputs={'data' : img})


The following is an example usage of the compilation, with default
compilation arguments for gluon.HybridBlock object (only supported in MXNet-Neuron 1.8):

.. code:: python

  import mx_neuron as neuron
  ...  
  neuron.compile(obj, inputs={'data' : img})


Example Usage: Extract Compilation Statistics
---------------------------------------------

To extract operation counts, insert the following code after compile
step (assume csym is the compiled MXNet symbol):

.. code:: python

   import json

   # Return list of nodes from MXNet symbol
   def sym_nodes(sym):
     return json.loads(sym.tojson())['nodes']

   # Return number of operations in node list  
   def count_ops(graph_nodes):
     return len([x['op'] for x in graph_nodes if x['op'] != 'null'])

   # Return triplet of compile statistics
   # - count of operations in symbol database
   # - number of Neuron subgraphs
   # - number of operations compiled to Neuron runtime  
   def get_compile_stats(sym):
     cnt = count_ops(sym_nodes(sym))
     neuron_subgraph_cnt = 0
     neuron_compiled_cnt = 0
     for g in sym_nodes(sym):
       if g['op'] == '_neuron_subgraph_op':
         neuron_subgraph_cnt += 1
         for sg in g['subgraphs']:
           neuron_compiled_cnt += count_ops(sg['nodes'])
     return (cnt, neuron_subgraph_cnt, neuron_compiled_cnt)

   original_cnt = count_ops(sym_nodes(sym))
   post_compile_cnt, neuron_subgraph_cnt, neuron_compiled_cnt = get_compile_stats(csym)
   print("INFO:mxnet: Number of operations in original model: ", original_cnt)
   print("INFO:mxnet: Number of operations in compiled model: ", post_compile_cnt)
   print("INFO:mxnet: Number of Neuron subgraphs in compiled model: ", neuron_subgraph_cnt)
   print("INFO:mxnet: Number of operations placed on Neuron runtime: ", neuron_compiled_cnt)

.. code:: bash

   INFO:mxnet: Number of operations in original model:  67
   INFO:mxnet: Number of operations in compiled model:  4
   INFO:mxnet: Number of Neuron subgraphs in compiled model:  2
   INFO:mxnet: Number of operations placed on Neuron runtime:  65


================================================
FILE: archive/mxnet-neuron/api-reference-guide.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

API Reference Guide (mxnet-neuron)
==================================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:

    /archive/mxnet-neuron/api-compilation-python-api


.. include:: /archive/mxnet-neuron/api-reference-guide.txt


================================================
FILE: archive/mxnet-neuron/api-reference-guide.txt
================================================
* :ref:`ref-mxnet-neuron-compilation-python-api`

================================================
FILE: archive/mxnet-neuron/developer-guide.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Developer Guide
===============

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:
    
    /about-neuron/appnotes/mxnet-neuron/flex-eg


.. include:: /archive/mxnet-neuron/developer-guide.txt


================================================
FILE: archive/mxnet-neuron/developer-guide.txt
================================================
* :ref:`flexeg`

================================================
FILE: archive/mxnet-neuron/ec2-then-ec2-devflow.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11


.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.

.. include:: /devflows/inference/ec2-then-ec2-devflow.rst


================================================
FILE: archive/mxnet-neuron/index.rst
================================================
Neuron Apache MXNet Release Notes
==============================================

.. toctree::
   :maxdepth: 1

   /release-notes/archive/mxnet-neuron


================================================
FILE: archive/mxnet-neuron/inference-mxnet-neuron.rst
================================================
.. _inference-mxnet-neuron:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Inference (mxnet-neuron) (maintenance)
=======================================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:

    Tutorials </archive/mxnet-neuron/tutorials/tutorials-mxnet-neuron>
    API Reference Guide </archive/mxnet-neuron/api-reference-guide>
    Developer Guide  </archive/mxnet-neuron/developer-guide>
    Misc  </archive/mxnet-neuron/misc-mxnet-neuron>


.. include:: inference-mxnet-neuron.txt


================================================
FILE: archive/mxnet-neuron/inference-mxnet-neuron.txt
================================================
.. card:: Setup  (``mxnet-neuron``)
            :link: setup-mxnet-neuron
            :link-type: ref
            :class-body: sphinx-design-class-title-small

.. dropdown::  Tutorials
        :class-title: sphinx-design-class-title-small
        :animate: fade-in

        .. include:: /archive/mxnet-neuron/tutorials/tutorials-mxnet-neuron.txt


.. dropdown::  API Reference Guide
        :class-title: sphinx-design-class-title-small
        :class-body: sphinx-design-class-body-small
        :animate: fade-in

        .. include:: /archive/mxnet-neuron/api-reference-guide.txt


.. dropdown::  Developer Guide
        :class-title: sphinx-design-class-title-small
        :class-body: sphinx-design-class-body-small
        :animate: fade-in

        .. include:: /archive/mxnet-neuron/developer-guide.txt


.. dropdown::  Misc
        :class-title: sphinx-design-class-title-small
        :class-body: sphinx-design-class-body-small
        :animate: fade-in

        .. include:: /archive/mxnet-neuron/misc-mxnet-neuron.txt

================================================
FILE: archive/mxnet-neuron/misc-mxnet-neuron.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Misc (mxnet-neuron)
===================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:

    /archive/mxnet-neuron/troubleshooting-guide
    What's New </release-notes/archive/mxnet-neuron>
    /release-notes/archive/neuron-cc/neuron-cc-ops/neuron-cc-ops-mxnet


.. include:: /archive/mxnet-neuron/misc-mxnet-neuron.txt


================================================
FILE: archive/mxnet-neuron/misc-mxnet-neuron.txt
================================================
* :ref:`mxnet_troubleshooting_guide`
* :ref:`What's New <mxnet-neuron-rn>`
* :ref:`neuron-cc-ops-mxnet`

================================================
FILE: archive/mxnet-neuron/mxnet-neuron-setup.rst
================================================
.. _mxnet-setup:


.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

MXNet Neuron Setup
==================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: mxnet-neuron-setup.txt


================================================
FILE: archive/mxnet-neuron/mxnet-neuron-setup.txt
================================================

.. card:: MxNet Neuron (``mxnet-neuron``) Setup for  Inf1 Instances
            :link: setup-mxnet-neuron
            :link-type: ref
            :class-body: sphinx-design-class-title-small

================================================
FILE: archive/mxnet-neuron/neo-then-hosting-devflow.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11


.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.

.. include:: /devflows/inference/neo-then-hosting-devflow.rst


================================================
FILE: archive/mxnet-neuron/setup/mxnet-install-prev-al2.rst
================================================
.. _mxnet-neuron-install-prev-al2:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install Previous MXNet Neuron Releases for Amazon Linux (``mxnet-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
   :maxdepth: 1


This section will assist you in installing previous Neuron releases.


.. tab-set::

    .. tab-item:: Neuron 2.18.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --neuron-version=2.18.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.17.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --neuron-version=2.17.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.16.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --neuron-version=2.16.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/mxnet-neuron/setup/mxnet-install-prev-al2023.rst
================================================
.. _mxnet-neuron-install-prev-al2023:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install Previous MXNet Neuron Releases for Amazon Linux 2023 (``mxnet-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
   :maxdepth: 1


This section will assist you in installing previous Neuron releases.


.. tab-set::

    .. tab-item:: Neuron 2.20.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --neuron-version=2.20.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.19.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --neuron-version=2.19.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.18.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --neuron-version=2.18.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/mxnet-neuron/setup/mxnet-install-prev-u20.rst
================================================

.. Install previous MXNet Neuron releases for Ubuntu 20.04 - archived

Use the tabs below to install a specific previous Neuron SDK release. Select the Neuron version you need.

.. tab-set::

    .. tab-item:: Neuron 2.20.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --neuron-version=2.20.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.19.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --neuron-version=2.19.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.18.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --neuron-version=2.18.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/mxnet-neuron/setup/mxnet-install-prev-u22.rst
================================================
.. _mxnet-neuron-install-prev-u22:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install Previous MXNet Neuron Releases for Ubuntu 22 (``mxnet-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
   :maxdepth: 1


This section will assist you in installing previous Neuron releases.


.. tab-set::

    .. tab-item:: Neuron 2.20.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --neuron-version=2.20.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.19.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --neuron-version=2.19.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.18.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --neuron-version=2.18.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/mxnet-neuron/setup/mxnet-install.rst
================================================
.. _install-neuron-mxnet:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install MXNet Neuron
=====================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.5.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.5.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=mxnet --framework-version=1.5.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=mxnet --framework-version=1.5.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=mxnet --framework-version=1.5.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=mxnet --framework-version=1.5.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/mxnet-neuron/setup/mxnet-neuron-al2-base-dlami.rst
================================================
.. _setup-mxnet-neuron-al2-base-dlami:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


MXNet Neuron ("mxnet-neuron") Setup on Amazon Linux 2
=========================================================
.. contents:: Table of contents
	:local:
	:depth: 2

.. include:: /setup/install-templates/al2-python.rst

Get Started with Latest Release of MXNet Neuron (``mxnet-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`install-neuron-mxnet`.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console. please make sure to select the correct instance type.
    * To get more information about instance sizes and pricing see: `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_
    * Check for the latest version of the `DLAMI Base AMI <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-base-neuron-amazon-linux-2/>`_ and copy the AMI name that starts with "Deep Learning Base Neuron AMI (Amazon Linux 2) <latest_date>" from "AMI Name:" section
    * Search for the copied AMI name in the AMI Search , you should see a matching AMI with the AMI name in Community AMIs. Select the AMI and use it to launch the instance.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami --category=driver_runtime_tools


.. include:: /includes/setup/tab-inference-mxnet-neuron-al2.txt

.. include:: /archive/mxnet-neuron/setup/mxnet-update-u20.rst

.. include:: /archive/mxnet-neuron/setup/mxnet-install-prev-al2.rst

================================================
FILE: archive/mxnet-neuron/setup/mxnet-neuron-al2.rst
================================================
.. _setup-mxnet-neuron-al2:

.. include:: /setup/install-templates/al2-python.rst

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


MXNet Neuron ("mxnet-neuron") Setup on Amazon Linux 2
======================================================


.. contents:: Table of contents
	:local:
	:depth: 2

.. include:: /setup/install-templates/al2-python.rst

Get Started with Latest Release of MXNet Neuron (``mxnet-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`install-neuron-mxnet`.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_
    * Select Amazon Linux 2 AMI(HVM) - Kernel 5.10
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami --category=driver_runtime_tools


.. include:: /includes/setup/tab-inference-mxnet-neuron-al2.txt

.. include:: /archive/mxnet-neuron/setup/mxnet-update-u20.rst

.. include:: /archive/mxnet-neuron/setup/mxnet-install-prev-al2.rst

================================================
FILE: archive/mxnet-neuron/setup/mxnet-neuron-al2023.rst
================================================
.. _setup-mxnet-neuron-al2023:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


MXNet Neuron ("mxnet-neuron") Setup on Amazon Linux 2023
=========================================================


.. contents:: Table of contents
	:local:
	:depth: 2

Get Started with Latest Release of MXNet Neuron (``mxnet-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`install-neuron-mxnet`.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_
    * Select Amazon Linux 2023 AMI
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami --category=driver_runtime_tools


.. include:: /includes/setup/tab-inference-mxnet-neuron-al2023.txt

.. include:: /archive/mxnet-neuron/setup/mxnet-install-prev-al2023.rst

================================================
FILE: archive/mxnet-neuron/setup/mxnet-neuron-ubuntu20-base-dlami.rst
================================================
.. _setup-mxnet-neuron-u20-base-dlami:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


MXNet Neuron ("mxnet-neuron") Setup on Ubuntu 20
================================================


.. contents:: Table of contents
	:local:
	:depth: 2


Get Started with Latest Release of MXNet Neuron (``mxnet-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`install-neuron-mxnet`.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console. please make sure to select the correct instance type.
    * To get more information about instance sizes and pricing see: `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_
    * Check for the latest version of the `DLAMI Base AMI <https://aws.amazon.com/releasenotes/aws-deep-learning-ami-base-neuron-ubuntu-20-04/>`_ and copy the AMI name that starts with "Deep Learning Base Neuron AMI (Ubuntu 20.04) <latest_date>" from "AMI Name:" section
    * Search for the copied AMI name in the AMI Search , you should see a matching AMI with the AMI name in Community AMIs. Select the AMI and use it to launch the instance.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami --category=driver_runtime_tools

.. include:: /includes/setup/tab-inference-mxnet-neuron-u20.txt

.. include:: /archive/mxnet-neuron/setup/mxnet-update-u20.rst

.. include:: /archive/mxnet-neuron/setup/mxnet-install-prev-u20.rst

================================================
FILE: archive/mxnet-neuron/setup/mxnet-neuron-ubuntu20.rst
================================================
.. _setup-mxnet-neuron-u20:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


MXNet Neuron ("mxnet-neuron") Setup on Ubuntu 20
=================================================


.. contents:: Table of contents
	:local:
	:depth: 2


Get Started with Latest Release of MXNet Neuron (``mxnet-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`install-neuron-mxnet`.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console. please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_
    * Select Ubuntu Server 20 AMI
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami --category=driver_runtime_tools

.. include:: /includes/setup/tab-inference-mxnet-neuron-u20.txt

.. include:: /archive/mxnet-neuron/setup/mxnet-update-u20.rst

.. include:: /archive/mxnet-neuron/setup/mxnet-install-prev-u20.rst

================================================
FILE: archive/mxnet-neuron/setup/mxnet-neuron-ubuntu22.rst
================================================
.. _setup-mxnet-neuron-u22:

.. card:: Select a Different Framework or Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small


MXNet Neuron ("mxnet-neuron") Setup on Ubuntu 22
=================================================


.. contents:: Table of contents
	:local:
	:depth: 2


Get Started with Latest Release of MXNet Neuron (``mxnet-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provide links that will assist you to quickly start with a fresh installation of :ref:`install-neuron-mxnet`.


.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console. please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_
    * Select Ubuntu Server 20 AMI
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance 

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami --category=driver_runtime_tools

.. include:: /includes/setup/tab-inference-mxnet-neuron-u22.txt

.. include:: /archive/mxnet-neuron/setup/mxnet-install-prev-u22.rst

================================================
FILE: archive/mxnet-neuron/setup/mxnet-update-u20.rst
================================================

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

.. mxnet-neuron-u20-update:

Update to latest MXNet Neuron  (``mxnet-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


If you already have a previous Neuron release installed, this section provide links that will assist you to update to latest Neuron release.


.. tab-set::

    .. tab-item:: MXNet 1.8.0

        .. include:: /setup/install-templates/inf1/note-setup-general.rst

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami


    .. tab-item:: MXNet 1.5.1

        .. include:: /setup/install-templates/inf1/note-setup-general.rst

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=mxnet --framework-version=1.5.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/mxnet-neuron/setup/mxnet-update.rst
================================================
.. _update-neuron-mxnet:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Update to latest MXNet Neuron
===============================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst

.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=mxnet --framework-version=1.5.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=mxnet --framework-version=1.5.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=mxnet --framework-version=1.5.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=mxnet --framework-version=1.5.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=mxnet --framework-version=1.5.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=mxnet --framework-version=1.5.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/mxnet-neuron/setup/prev-releases/neuron-1.14.2-mxnet-install.rst
================================================
.. _install-neuron-1.14.2-mxnet:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install MXNet Neuron (Neuron 1.14.2)
======================================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.14.2


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=mxnet-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.14.2


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=mxnet-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.14.2


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=mxnet-1.5.1


================================================
FILE: archive/mxnet-neuron/setup/prev-releases/neuron-1.15.0-mxnet-install.rst
================================================
.. _install-neuron-1.15.0-mxnet:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install MXNet Neuron (Neuron 1.15.0)
======================================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.0


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=mxnet-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.0


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=mxnet-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.0


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=mxnet-1.5.1


================================================
FILE: archive/mxnet-neuron/setup/prev-releases/neuron-1.15.1-mxnet-install.rst
================================================
.. _install-neuron-1.15.1-mxnet:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install MXNet Neuron (Neuron 1.15.1)
======================================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.1


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=mxnet-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.1


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=mxnet-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.1


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=mxnet-1.5.1


================================================
FILE: archive/mxnet-neuron/setup/prev-releases/neuron-1.15.2-mxnet-install.rst
================================================
.. _install-neuron-1.15.2-mxnet:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install MXNet Neuron (Neuron 1.15.2)
======================================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.2


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=mxnet-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.2


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=mxnet-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.2


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=mxnet-1.5.1


================================================
FILE: archive/mxnet-neuron/setup/prev-releases/neuron-1.16.3-mxnet-install.rst
================================================
.. _install-neuron-1.16.3-mxnet:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install MXNet Neuron
=====================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.16.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.16.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.16.3


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=mxnet-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.16.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.16.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.16.3


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=mxnet-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.16.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.16.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.16.3


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=mxnet-1.5.1


================================================
FILE: archive/mxnet-neuron/setup/prev-releases/neuron-1.17.2-mxnet-install.rst
================================================
.. _install-neuron-1.17.2-mxnet:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install MXNet Neuron
=====================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.2


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=mxnet-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.2


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=mxnet-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.2


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=mxnet-1.5.1


================================================
FILE: archive/mxnet-neuron/setup/prev-releases/neuron-1.18.0-mxnet-install.rst
================================================
.. _install-neuron-1.18.0-mxnet:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install MXNet Neuron
=====================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.18.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.18.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.18.0


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=mxnet-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.18.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.18.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.18.0


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=mxnet-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.18.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.18.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.18.0


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=mxnet-1.5.1


================================================
FILE: archive/mxnet-neuron/setup/prev-releases/neuron-1.19.0-mxnet-install.rst
================================================
.. _install-neuron-1.19.0-mxnet:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install MXNet Neuron
=====================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.19.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.19.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.19.0


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=mxnet-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.19.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.19.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.19.0


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=mxnet-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.19.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.19.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.19.0


   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=mxnet-1.5.1


================================================
FILE: archive/mxnet-neuron/setup/setup-inference
================================================
Setup Guide for Inf1
====================

.. toctree::
    :maxdepth: 1

    Fresh install </archive/mxnet-neuron/setup/mxnet-install>
    Update to latest release </archive/mxnet-neuron/setup/mxnet-update>
    Install previous releases </archive/mxnet-neuron/setup/mxnet-install-prev>

================================================
FILE: archive/mxnet-neuron/troubleshooting-guide.rst
================================================
.. _mxnet_troubleshooting_guide:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Troubleshooting Guide for Neuron Apache MXNet 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of Contents
   :local:
   :depth: 2


Inference Runtime Error
=======================

Out-of-memory error when calling Symbol API bind() too many times
-----------------------------------------------------------------

.. important ::

  ``NEURONCORE_GROUP_SIZES`` will no longer be supported starting Neuron 1.19.0 release if your application is using ``NEURONCORE_GROUP_SIZES`` please 
  see :ref:`neuron-migrating-apps-neuron-to-libnrt` and :ref:`eol-ncgs-env_2` for more details.

If you see out-of-memory error when using Symbol API's bind() function, please ensure that the bind() function is
called once for each desired model instance. For example, on inf1.xlarge, use Symbol API to create 4 parallel 
instances of a model that was compiled to 1 NeuronCore (--neuroncore-pipeline-cores=1), each is bound to an 
different mx.neuron(i) context where i is the NeuronCore Group index ranging from 0 to 3. Then use 4 threads to feed
the 4 instances in parallel. For example:

.. code:: python

    NUM_PARALLEL = 4
    os.environ['NEURONCORE_GROUP_SIZES'] = ','.join('1' for _ in range(NUM_PARALLEL))
       
    data_iter = []
    for i in range(NUM_PARALLEL):
        data_iter.append(mx.io.ImageRecordIter(
            path_imgrec=recfile_base, data_shape=(3, 224, 224), batch_size=1,            
            prefetch_buffer=1,
            num_parts=NUM_PARALLEL, part_index=i))

    sym, args, auxs = mx.model.load_checkpoint('resnet-50_compiled', 0)

    exec_list = []
    for i in range(NUM_PARALLEL):
        exec = sym.bind(ctx=mx.neuron(i), args=args, aux_states=auxs, grad_req='null')
        exec_list.append(exec)

    def single_thread_infer(i):
        for batch in data_iter[i]:
            img = batch.data[0]
            label = batch.label
            feed_dict = {'data': img}
            exe = exec_list[i]
            exe.copy_params_from(feed_dict)
            exe.forward()
            out = exe.outputs[0]

    future_list = []
    with futures.ThreadPoolExecutor(max_workers=NUM_PARALLEL) as executor:
        for i in range(NUM_PARALLEL):
            future_list.append(executor.submit(single_thread_infer, i))


Inference crashed with MXNetError: InferShapeKeyword argument name xyz not found
--------------------------------------------------------------------------------

If you see MXNetError:

.. code:: bash

    mxnet.base.MXNetError: [11:55:39] src/c_api/c_api_symbolic.cc:508: InferShapeKeyword argument name xyz not found."

This is followed by a list of "Candidate arguments". This list shows all the input argument names that the model knows about, and 'xyz' is not in the list. To fix this, remove entry xyz from the feed dictionary.


Inference crashed at mx.nd.waitall() with MXNetError: Check failed: bin.dtype() == mshadow::kUint8
--------------------------------------------------------------------------------------------------

When executing Symbol API's forward function followed by mx.nd.waitall(), where MXNetError exception occurs with 'Check failed: bin.dtype() == mshadow::kUint8'.


Inference crashed with NRTD error 1002
--------------------------------------

During inference, the user may encounter an error with details "[NRTD:infer_wait] error: 1002":

.. code:: bash

    mxnet.base.MXNetError: [11:26:56] src/operator/subgraph/neuron/./neuron_util.h:1175: Check failed: rsp_wait.status().code() == 0 || rsp_wait.status().code() == 1003: Failed
    Infer Wait with Neuron-RTD Error. Neuron-RTD Status Code: 1002, details: "[NRTD:infer_wait] error: 1002
    "

Runtime errors are listed in the Neuron Runtime return codes documentation. In particular, 1002 means that some invalid input has been submitted to infer, e.g. missing some of the input tensors, incorrect input tensor sizes. Please examine /var/log/syslog to see imore details on the error. For example, you may see:

.. code::

    Oct 30 19:13:39 ip-172-31-93-131 nrtd[1125]: [TDRV:io_queue_prepare_input_nonhugetlb] Unexpected input size, for data00, expected: 2097152, received: 33554432

This means that the input tensor size is larger than what the model was compiled for (i.e. the example input tensor shapes passed during compilation.


Multi-Model Server
==================


Failed to create NEURONCORE Group with GRPC Error. Status Error: 14, Error message: "Connect Failed"
----------------------------------------------------------------------------------------------------

NOTE: This error only applies to MXNet 1.5.

If the client is unable to start workers and you get a message that MMS is unable to create NeuronCore Group,
please check that Neuron RTD is running (neuron-rtd process).

.. code:: json

    {
    "code": 500,
    "type": "InternalServerException",
    "message": "Failed to start workers“
    }

.. code:: bash

    2019-10-23 19:56:23,187 [INFO ] W-9000-squeezenet_v1.1_compiled-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [19:56:23] src/operator/subgraph/inferentia/./inferentia_util.h:218: Check failed: status.ok() Failed to create NeuronCore Group with GRPC Error. Status Error: 14, Error message: "Connect Failed"

Multiple MMS workers die with “Backend worker process die.” message
-------------------------------------------------------------------

.. important ::

  ``NEURONCORE_GROUP_SIZES`` will no longer be supported starting Neuron 1.19.0 release if your application is using ``NEURONCORE_GROUP_SIZES`` please 
  see :ref:`neuron-migrating-apps-neuron-to-libnrt` and :ref:`eol-ncgs-env_2` for more details.

If you run inference with MMS and get multiple messages “Backend worker process die", please ensure that the number of workers ("intial_workers") passed during load model is less than or equal to number of NeuronCores available divided by  number of NeuronCores required by model.

.. code:: bash

    com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Backend worker process die.
    com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
    com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/local/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1524, in simple_bind
    com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ctypes.byref(exe_handle)))
    com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/local/lib/python3.6/site-packages/mxnet/base.py", line 252, in check_call
    com.amazonaws.ml.mms.wlm.WorkerLifeCycle - raise MXNetError(py_str(_LIB.MXGetLastError()))
    com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mxnet.base.MXNetError: [00:26:32] src/operator/subgraph/neuron/./neuron_util.h:221: Check failed: 0 == create_eg_rsp.status().code() Failed to create NeuronCore Group with KRTD Error. KRTD Status Code: 4, details: ""

As indicated in :ref:`appnote-performance-tuning`, for greater flexibility user can use NEURONCORE_GROUP_SIZES to specify the groupings of NeuronCores into Neuron devices, each device consisting of one or more NeuronCores. Each worker would take a device. The total number of NeuronCores taken by all the workers should be less than or equal the total number of NeuronCores visible to neuron-rtd. This situation should be considered at full load (MMS scales up to max_workers). Additionally, to properly assign model to Neuron device, the environment NEURONCORE_GROUP_SIZES must be specified within the model server class (ie. mxnet_model_service.py in the example above). For example, add the following line within mxnet_model_service.py for model compiled to 1 NeuronCore:

.. code:: python

    os.environ['NEURONCORE_GROUP_SIZES'] = '1'

More information about max_worker limit setting can be found at `MMS Management API Documentation`_. For example, to run up to 4 workers in inf1.xlarge where 4 NeuronCores are available by default to Neuron-RTD, set max_workers to 4:

.. _MMS Management API Documentation: https://github.com/awslabs/multi-model-server/blob/master/docs/management_api.md#user-content-scale-workers

.. code:: bash

    curl -v -X PUT "http://localhost:8081/models/squeezenet_v1.1_compiled?min_worker=1?max_worker=4"

MMS throws a "mxnet.base.MXNetError: array::at" error
-----------------------------------------------------

If you see “mxnet.base.MXNetError: array::at” when running MMS please check that NDArray/Gluon API is not used as they are not supported in MXNet-Neuron.
If you would like to use NDArray or Gluon API, please upgrade to MXNet 1.8.

.. code:: bash

    [INFO ] W-9000-squeezenet_v1.1_compiled-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - array::at
    [INFO ] W-9000-squeezenet_v1.1_compiled com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 30
    [INFO ] W-9000-squeezenet_v1.1_compiled-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
    [INFO ] W-9000-squeezenet_v1.1_compiled-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/tmp/models/6606fa046f68a34df87f15362a7a2d9a49749878/model_handler.py", line 82, in handle
    [INFO ] W-9000-squeezenet_v1.1_compiled-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     data = self.inference(data)
    [INFO ] W-9000-squeezenet_v1.1_compiled-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/tmp/models/6606fa046f68a34df87f15362a7a2d9a49749878/mxnet_model_service.py", line 153, in inference
    [INFO ] W-9000-squeezenet_v1.1_compiled-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     d.wait_to_read()
    [INFO ] W-9000-squeezenet_v1.1_compiled-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/home/user/regression_venv_p3.6/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 1819, in wait_to_read
    [INFO ] W-9000-squeezenet_v1.1_compiled-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     check_call(_LIB.MXNDArrayWaitToRead(self.handle))
    [INFO ] W-9000-squeezenet_v1.1_compiled-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/home/user/regression_venv_p3.6/lib/python3.6/site-packages/mxnet/base.py", line 253, in check_call
    [INFO ] W-9000-squeezenet_v1.1_compiled-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     raise MXNetError(py_str(_LIB.MXGetLastError()))
    [INFO ] W-9000-squeezenet_v1.1_compiled-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mxnet.base.MXNetError: array::at
    [INFO ] W-9000-squeezenet_v1.1_compiled-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Invoking custom service failed.

MXNet Model Server is not able to clean up Neuron RTD states after model is unloaded
------------------------------------------------------------------------------------

NOTE: This issue is resolved in version 1.5.1.1.1.88.0 released 11/17/2020 and only applies for MXNet 1.5.

MXNet Model Server is not able to clean up Neuron RTD states after model is unloaded (deleted) from model server. Restarting the model server may fail with "Failed to create NEURONCORE_GROUP" error:

.. code:: bash

    mxnet.base.MXNetError: [00:26:59] src/operator/subgraph/neuron/./neuron_util.h:348: Check failed:    0 == create_eg_rsp.status().code(): Failed to create NEURONCORE_GROUP with Neuron-RTD Error. Neuron-RTD Status Code: 9, details: ""

The workaround is to run “`/opt/aws/neuron/bin/neuron-cli reset`“ to clear Neuron RTD states after all models are unloaded and server is shut down before restarting the model server.

Pipeline mode is not able to execute inferences requests in parallel
--------------------------------------------------------------------

If you see that multiple executors in a neuron pipeline setup (one model compiled for more than one neuron-cores using `--neuroncore-pipeline-cores` option during compilation) are not running in parallel, please set the following MXNet's environment variables before inference to allow mxnet to execute the CPU ops in parallel. Otherwise it will be sequential and stall the executors.

``MXNET_CPU_WORKER_NTHREADS`` is used to do that. Setting its value to ``__subgraph_opt_neuroncore__`` in the compiled model json will ensure that all the executors (threads) can be run in parallel.


Features only in MXNet-Neuron 1.5
---------------------------------
- Shared memory for IFMaps transfer to neuron runtime (has higher performance compared to GRPC mode)
- Neuron profiling using MXNet

Features only in MXNet-Neuron 1.8
---------------------------------
- Gluon API support
- Library mode neuron runtime


================================================
FILE: archive/mxnet-neuron/tutorials/mxnet-tutorial-setup.rst
================================================
.. _mxnet-tutorial-setup:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

MXNet Tutorial Setup
====================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


#. Launch an Inf1.6xlarge Instance:
    .. include:: /setup/install-templates/inf1/launch-inf1-dlami.rst

#. Set up a development environment:
    * Enable or install MXNet-Neuron: :ref:`install-neuron-mxnet`.
    

#. Run tutorial in Jupyter notebook:
    * Follow instruction at :ref:`Setup Jupyter notebook <setup-jupyter-notebook-steps-troubleshooting>` to:
    
      #. Start the Jupyter Notebook on the instance
      #. Run the Jupyter Notebook from your local browser

    * Connect to the instance from the terminal, clone the Neuron Github repository to the Inf1 instance and then change the working directory to the tutorial directory:

      .. code::

        git clone https://github.com/aws/aws-neuron-sdk.git
        cd aws-neuron-sdk/src/examples/mxnet

    * Locate the tutorial notebook file (.ipynb file) under ``aws-neuron-sdk/src/examples/mxnet``
    * From your local browser, open the tutorial notebook from the menu and follow the instructions.


================================================
FILE: archive/mxnet-neuron/tutorials/tutorial-model-serving.rst
================================================
.. _mxnet-neuron-model-serving:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Tutorial: Neuron Apache MXNet Model Serving
=============================================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


This MXNet Neuron Model Serving (MMS) example is adapted from the MXNet
vision service example which uses pretrained squeezenet to perform image
classification:
https://github.com/awslabs/multi-model-server/tree/master/examples/mxnet_vision.

Before starting this example, please ensure that Neuron-optimized MXNet
version mxnet-neuron is installed along with Neuron Compiler.

Warning
*******
If you are using MXNet-1.5, please note that MXNet-1.5 entered maintenance mode and require Neuron Runtime 1.x, please see :ref:`maintenance_mxnet_1_5`.
To setup development environment for MXNet-1.5 see installation instructions at :ref:`mxnet-setup`.

If using DLAMI, you can activate the environment aws_neuron_mxnet_p36
and skip the installation part in the first step below.

1. First, install Java runtime and multi-model-server:

.. code:: bash

   cd ~/
   # sudo dnf -y install -q jre # for AL2023
   sudo apt-get install -y -q default-jre  # for Ubuntu
   pip install multi-model-server

Download the example code:

.. code:: bash

   git clone https://github.com/awslabs/multi-model-server
   cd ~/multi-model-server/examples/mxnet_vision

2. Compile ResNet50 model to Inferentia target by saving the following
   Python script to compile_resnet50.py and run
   “\ ``python compile_resnet50.py``\ ”

.. code:: python


   from packaging import version
   import numpy as np
   import mxnet as mx
   
   mxnet_version = version.parse(mx.__version__)
   if mxnet_version >= version.parse("1.8"):
      import mx_neuron as neuron
   else: 
      from mxnet.contrib import neuron

   path='http://data.mxnet.io/models/imagenet/'
   mx.test_utils.download(path+'resnet/50-layers/resnet-50-0000.params')
   mx.test_utils.download(path+'resnet/50-layers/resnet-50-symbol.json')
   mx.test_utils.download(path+'synset.txt')

   nn_name = "resnet-50"

   #Load a model
   sym, args, auxs = mx.model.load_checkpoint(nn_name, 0)

   #Define compilation parameters
   #  - input shape and dtype
   inputs = {'data' : mx.nd.zeros([1,3,224,224], dtype='float32') }

   # compile graph to inferentia target
   csym, cargs, cauxs = neuron.compile(sym, args, auxs, inputs)

   # save compiled model
   mx.model.save_checkpoint(nn_name + "_compiled", 0, csym, cargs, cauxs)

3. Prepare signature file ``signature.json`` to configure the input name
   and shape:

.. code:: json

   {
     "inputs": [
       {
         "data_name": "data",
         "data_shape": [
           1,
           3,
           224,
           224
         ]
       }
     ]
   }

4. Prepare ``synset.txt`` which is a list of names for ImageNet
   prediction classes:

.. code:: bash

   curl -O https://s3.amazonaws.com/model-server/model_archive_1.0/examples/squeezenet_v1.1/synset.txt

5. Create custom service class following template in
   model_server_template folder:

.. code:: bash

   cp -r ../model_service_template/* .

Edit ``mxnet_model_service.py`` to use the appropriate context. 

Make the following change:

.. code:: bash

   from packaging import version
   
   mxnet_version = version.parse(mx.__version__)
   if mxnet_version >= version.parse("1.8"):
      import mx_neuron as neuron
   self.mxnet_ctx = mx.neuron()

Comment out the existing context set:

.. code:: bash

   #self.mxnet_ctx = mx.cpu() if gpu_id is None else mx.gpu(gpu_id)

Also, comment out unnecessary data copy for model_input in
``mxnet_model_service.py``:

.. code:: bash

   #model_input = [item.as_in_context(self.mxnet_ctx) for item in model_input]

6. Package the model with model-archiver:

.. code:: bash

   cd ~/multi-model-server/examples
   model-archiver --force --model-name resnet-50_compiled --model-path mxnet_vision --handler mxnet_vision_service:handle

7. Start MXNet Model Server (MMS) and load model using RESTful API.
   Please ensure that Neuron RTD is running with default settings (see
   Neuron Runtime Getting Started):

.. code:: bash

   cd ~/multi-model-server/
   multi-model-server --start --model-store examples
   # Pipe to log file if you want to keep a log of MMS
   curl -v -X POST "http://localhost:8081/models?initial_workers=1&max_workers=1&synchronous=true&url=resnet-50_compiled.mar"
   sleep 10 # allow sufficient time to load model

Each worker requires a NeuronCore group that can accommodate the compiled
model. Additional workers can be added by increasing max_workers
configuration as long as there are enough NeuronCores available. Use
``neuron-top`` to see which models are loaded on specific NeuronCores.

8. Test inference using an example image:

.. code:: bash

   curl -O https://raw.githubusercontent.com/awslabs/multi-model-server/master/docs/images/kitten_small.jpg
   curl -X POST http://127.0.0.1:8080/predictions/resnet-50_compiled -T kitten_small.jpg

You will see the following output:

.. code:: bash

   [
     {
       "probability": 0.6375716328620911,
       "class": "n02123045 tabby, tabby cat"
     },
     {
       "probability": 0.1692783385515213,
       "class": "n02123159 tiger cat"
     },
     {
       "probability": 0.12187337130308151,
       "class": "n02124075 Egyptian cat"
     },
     {
       "probability": 0.028840631246566772,
       "class": "n02127052 lynx, catamount"
     },
     {
       "probability": 0.019691042602062225,
       "class": "n02129604 tiger, Panthera tigris"
     }
   ]

9. To cleanup after test, issue a delete command via RESTful API and
   stop the model server:

.. code:: bash

   curl -X DELETE http://127.0.0.1:8081/models/resnet-50_compiled

   multi-model-server --stop


================================================
FILE: archive/mxnet-neuron/tutorials/tutorials-mxnet-computervision.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Computer Vision Tutorials (``mxnet-neuron``)
============================================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


* ResNet-50 tutorial :ref:`[html] </src/examples/mxnet/resnet50/resnet50.ipynb>` :mxnet-neuron-src:`[notebook] <resnet50/resnet50.ipynb>`
* Model Serving tutorial :ref:`[html] <mxnet-neuron-model-serving>`
* Getting started with Gluon tutorial :ref:`[html] </src/examples/mxnet/mxnet-gluon-tutorial.ipynb>` :github:`[notebook] </src/examples/mxnet/mxnet-gluon-tutorial.ipynb>`


================================================
FILE: archive/mxnet-neuron/tutorials/tutorials-mxnet-neuron.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Tutorials  (``mxnet-neuron``)
=============================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:

    Computer Vision Tutorials </archive/mxnet-neuron/tutorials/tutorials-mxnet-computervision>
    Natural Language Processing (NLP) Tutorials </archive/mxnet-neuron/tutorials/tutorials-mxnet-nlp>
    Utilizing Neuron Capabilities Tutorials </archive/mxnet-neuron/tutorials/tutorials-mxnet-utilizing-neuron-capabilities>


.. include:: /archive/mxnet-neuron/tutorials/tutorials-mxnet-neuron.txt


================================================
FILE: archive/mxnet-neuron/tutorials/tutorials-mxnet-neuron.txt
================================================
.. tab-set::

    .. tab-item:: Computer Vision Tutorials
                :name:

                * ResNet-50 tutorial :ref:`[html] </src/examples/mxnet/resnet50/resnet50.ipynb>` :mxnet-neuron-src:`[notebook] <resnet50/resnet50.ipynb>`
                * Model Serving tutorial :ref:`[html] <mxnet-neuron-model-serving>`
                * Getting started with Gluon tutorial :ref:`[html] </src/examples/mxnet/mxnet-gluon-tutorial.ipynb>` :github:`[notebook] </src/examples/mxnet/mxnet-gluon-tutorial.ipynb>`


    .. tab-item:: Natural Language Processing (NLP) Tutorials
                :name:

                * MXNet 1.8: Using data parallel mode tutorial :ref:`[html] </src/examples/mxnet/data_parallel/data_parallel_tutorial.ipynb>` :mxnet-neuron-src:`[notebook] <data_parallel/data_parallel_tutorial.ipynb>`


    .. tab-item:: Utilizing Neuron Capabilities Tutorials
                :name:

                * NeuronCore Groups tutorial :ref:`[html] </src/examples/mxnet/resnet50_neuroncore_groups.ipynb>` :mxnet-neuron-src:`[notebook] <resnet50_neuroncore_groups.ipynb>`


.. note::

        To use Jupyter Notebook see:

        * :ref:`setup-jupyter-notebook-steps-troubleshooting`
        * :ref:`running-jupyter-notebook-as-script`

================================================
FILE: archive/mxnet-neuron/tutorials/tutorials-mxnet-nlp.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Natural Language Processing (NLP) Tutorials (``mxnet-neuron``)
==============================================================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


* MXNet 1.8: Using data parallel mode tutorial :ref:`[html] </src/examples/mxnet/data_parallel/data_parallel_tutorial.ipynb>` :mxnet-neuron-src:`[notebook] <data_parallel/data_parallel_tutorial.ipynb>`


================================================
FILE: archive/mxnet-neuron/tutorials/tutorials-mxnet-utilizing-neuron-capabilities.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Utilizing Neuron Capabilities Tutorials (``mxnet-neuron``)
==========================================================

.. warning::

   This document is archived. MXNet is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


* NeuronCore Groups tutorial :ref:`[html] </src/examples/mxnet/resnet50_neuroncore_groups.ipynb>` :mxnet-neuron-src:`[notebook] <resnet50_neuroncore_groups.ipynb>`


================================================
FILE: archive/neuronperf/index.rst
================================================
.. _neuronperf:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

=================
NeuronPerf (Beta)
=================

NeuronPerf is a lightweight Python library with a simple API that enables fast measurements of performance when running models using Neuron.

.. _neuronperf_quickstart:

NeuronPerf Quickstart
---------------------

To install NeuronPerf in your Neuron environment, execute:

.. code:: bash

  $ pip install neuronperf --extra-index-url=https://pip.repos.neuron.amazonaws.com


Refer to the :ref:`neuronperf_examples` and :ref:`neuronperf_user_guide` to get started.


.. _neuronperf_user_guide:

NeuronPerf User Guide
---------------------

.. toctree::
   :maxdepth: 1

   Overview <neuronperf_overview>
   Terminology <neuronperf_terminology>
   Examples <neuronperf_examples>
   Benchmark Guide <neuronperf_benchmark_guide>
   Evaluate Guide <neuronperf_evaluate_guide>
   Compile Guide <neuronperf_compile_guide>
   Model Index Guide <neuronperf_model_index_guide>


NeuronPerf API Reference
------------------------

.. toctree::
   :maxdepth: 1

   API <neuronperf_api>
   Framework Notes <neuronperf_framework_notes>


FAQ
---

.. toctree::
   :maxdepth: 1

   FAQ <neuronperf_faq>


Troubleshooting
---------------

.. toctree::
   :maxdepth: 1

   Troubleshooting <neuronperf_troubleshooting>


Release Notes
-------------

.. toctree::
   :maxdepth: 1

   rn


================================================
FILE: archive/neuronperf/neuronperf_api.rst
================================================
.. _neuronperf_api:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

NeuronPerf API
==============

.. contents:: Table of Contents
   :local:
   :depth: 2

.. note::
    Due to a bug in Sphinx, some of the type annotations may be incomplete. 

.. py:function:: compile(compile_fn, model, inputs, batch_sizes: Union[int, List[int]] = None, pipeline_sizes: Union[int, List[int]] = None, performance_levels: Union[str, List[int]] = None, models_dir: str = "models", filename: str = None, compiler_args: dict = None, verbosity: int = 1, *args, **kwargs) -> str:

    Compiles the provided model with each provided example input, pipeline size, and performance level.
    Any additional compiler_args passed will be forwarded to the compiler on every invocation.

    :param model: The model to compile.
    :param list inputs: A list of example inputs.
    :param batch_sizes: A list of batch sizes that correspond to the example inputs.
    :param pipeline_sizes: A list of pipeline sizes to use. See :ref:`neuroncore-pipeline`.
    :param performance_levels: A list of performance levels to try. Options are: 0 (max accuracy), 1, 2, 3 (max performance, default).  See :ref:`neuron-cc-training-mixed-precision`.
    :param str models_dir: The directory where compilation artifacts will be stored.
    :param str model_name: An optional model name tag to apply to compiled artifacts.
    :param str filename: The name of the model index to write out. If not provided, a name will be generated and returned.
    :param dict compiler_args: Additional compiler arguments to be forwarded with every compilation.
    :param int verbosity: 0 = error, 1 = info, 2 = debug
    :return: A model index filename. If a configuration fails to compile, it will not be included in the index and an error will be logged.
    :rtype: str

.. _neuronperf_api_benchmark:


.. py:function:: benchmark(load_fn: Callable[[str, int], Any], model_filename: str, inputs: Any, batch_sizes: Union[int, List[int]] = None, duration: float = BENCHMARK_SECS, n_models: Union[int, List[int]] = None, pipeline_sizes: Union[int, List[int]] = None, cast_modes: Union[str, List[str]] = None, workers_per_model: Union[int, None] = None, env_setup_fn: Callable[[int, Dict], None] = None, setup_fn: Callable[[int, Dict, Any], None] = None, preprocess_fn: Callable[[Any], Any] = None, postprocess_fn: Callable[[Any], Any] = None, dataset_loader_fn: Callable[[Any, int], Any] = None, verbosity: int = 1, multiprocess: bool = True, multiinterpreter: bool = False, return_timers: bool = False, device_type: str = "neuron") -> List[Dict]:

    Benchmarks the model index or individiual model using the provided inputs.
    If a model index is provided, additional fields such as ``pipeline_sizes`` and
    ``performance_levels`` can be used to filter the models to benchmark. The default
    behavior is to benchmark all configurations in the model index.

    :param load_fn: A function that accepts a model filename and device id, and returns a loaded model. This is automatically passed through the subpackage calls (e.g. ``neuronperf.torch.benchmark``).
    :param str model_filename: A path to a model index from compile or path to an individual model. For CPU benchmarking, a class should be passed that can be instantiated with a default constructor (e.g. ``MyModelClass``).
    :param list inputs: A list of example inputs. If the list contains tuples, they will be destructured on inference to support multiple arguments.
    :param batch_sizes: A list of ints indicating batch sizes that correspond to the inputs. Assumes 1 if not provided.
    :param float duration: The number of seconds to benchmark each model.
    :param n_models: The number of models to run in parallel. Default behavior runs 1 model and the max number of models possible, determined by a best effort from ``device_type``, instance size, or other environment state.
    :param pipeline_sizes: A list of pipeline sizes to use. See :ref:`neuroncore-pipeline`.
    :param performance_levels: A list of performance levels to try. Options are: 0 (max accuracy), 1, 2, 3 (max performance, default). See :ref:`neuron-cc-training-mixed-precision`.
    :param workers_per_model: The number of workers to use per model loaded. If ``None``, this is automatically selected.
    :param env_setup_fn: A custom environment setup function to run in each subprocess before model loading. It will receive the benchmarker id and config.
    :param setup_fn: A function that receives the benchmarker id, config, and model to perform last minute configuration before inference.
    :param preprocess_fn: A custom preprocessing function to perform on each input before inference.
    :param postprocess_fn: A custom postprocessing function to perform on each input after inference.
    :param bool multiprocess: When True, model loading is dispatched to forked subprocesses. Should be left alone unless debugging.
    :param bool multiinterpreter: When True, benchmarking is performed in a new python interpreter per model. All parameters must be serializable. Overrides multiprocess.
    :param bool return_timers: When True, the return of this function is a list of tuples ``(config, results)`` with detailed information. This can be converted to reports with ``get_reports(results)``.
    :param float stats_interval: Collection interval (in seconds) for metrics during benchmarking, such as CPU and memory usage.
    :param str device_type: This will be set automatically to one of the ``SUPPORTED_DEVICE_TYPES``.
    :param float cost_per_hour: The price of this device / hour. Used to estimate cost / 1 million infs in reports.
    :param str model_name: A friendly name for the model to use in reports.
    :param str model_class_name: Internal use.
    :param str model_class_file: Internal use.
    :param int verbosity: 0 = error, 1 = info, 2 = debug
    :return: A list of benchmarking results.
    :rtype: list[dict]


.. py:function:: get_reports(results)

   Summarizes and combines the detailed results from ``neuronperf.benchmark``, when run with ``return_timers=True``. One report dictionary is produced per model configuration benchmarked. The list of reports can be fed directly to other reporting utilities, such as ``neuronperf.write_csv``.

   :param list[tuple] results: The list of results from ``neuronperf.benchmark``.
   :param list[int] batch_sizes: The batch sizes that correspond to the `inputs` provided to ``compile`` and ``benchmark``. Used to correct throughput values in the reports.
   :return: A list of dictionaries that summarize the results for each model configuration.
   :rtype: list[dict]

.. py:function:: print_reports(reports, cols=SUMMARY_COLS, sort_by="throughput_peak", reverse=False)

    Print a report to the terminal.
    Example of default behavior:

    >>> neuronperf.print_reports(reports)
    throughput_avg latency_ms_p50 latency_ms_p99 n_models pipeline_size  workers_per_model batch_size model_filename
    329.667        6.073          6.109          1        1              2                 1          models/model_b1_p1_83bh3hhs.pt

    :param reports: Results from `get_reports`.
    :param cols: The columns in the report to be displayed.
    :param sort_by: Sort the cols by the specified key.
    :param reverse: Sort order.

.. py:function:: write_csv(reports: list[dict], filename: str = None, cols=REPORT_COLS)

    Write benchmarking reports to CSV file.

    :param list[dict] reports: Results from `neuronperf.get_reports`.
    :param str filename: Filename to write. If not provided, generated from model_name in report and current timestamp.
    :param list[str] cols: The columns in the report to be kept.
    :return: The filename written.
    :rtype: str

.. py:function:: write_json(reports: list[dict], filename: str = None)

    Writes benchmarking reports to a JSON file.

	:param list[dict] reports: Results from `neuronperf.get_reports`.
	:param str filename: Filename to write. If not provided, generated from model_name in report and current timestamp.
	:return: The filename written.
	:rtype: str


.. py:function:: model_index.append(*model_indexes: Union[str, dict]) -> dict:

    Appends the model indexes non-destructively into a new model index, without
    modifying any of the internal data.

    This is useful if you have benchmarked multiple related models and wish to
    combine their respective model indexes into a single index.

    Model name will be taken from the first index provided.
    Duplicate configs will be filtered.

    :param model_indexes: Model indexes or paths to model indexes to combine.
    :return: A new dictionary representing the combined model index.
    :rtype: dict


.. py:function:: model_index.copy(old_index: Union[str, dict], new_index: str, new_dir: str) -> str:

    Copy an index to a new location. Will rename ``old_index``
    to ``new_index`` and copy all model files into ``new_dir``,
    updating the index paths.

    This is useful for pulling individual models out of a pool.

    Returns the path to the new index.


.. py:function:: model_index.create(filename, input_idx=0, batch_size=1, pipeline_size=1, cast_mode=DEFAULT_CAST, compile_s=None)

    Create a new model index from a pre-compiled model.

    :param str filename: The path to the compiled model.
    :param int input_idx: The index in your inputs that this model should be run on.
    :param int batch_size: The batch size at compilation for this model.
    :param int pipeline_size: The pipeline size used at compilation for this model.
    :param str cast_mode: The casting option this model was compiled with.
    :param float compile_s: Seconds spent compiling.
    :return: A new dictionary representing a model index.
    :rtype: dict


.. py:function:: model_index.delete(filename: str):

    Deletes the model index and all associated models referenced by the index.


.. py:function:: model_index.filter(index: Union[str, dict], **kwargs) -> dict:

    Filters provided model index on provided criteria and returns a new index.
    Each kwarg is a standard (k, v) pair, where k is treated as a filter name
    and v may be one or more values used to filter model configs.


.. py:function:: model_index.load(filename) -> dict:

    Load a NeuronPerf model index from a file.


.. py:function:: model_index.move(old_index: str, new_index: str, new_dir: str) -> str:

    This is the same as ``copy`` followed by ``delete`` on the old index.


.. py:function:: model_index.save(model_index, filename: str = None, root_dir=None) -> str:

    Save a NeuronPerf model index to a file.


================================================
FILE: archive/neuronperf/neuronperf_benchmark_guide.rst
================================================
.. _neuronperf_benchmark_guide:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

==========================
NeuronPerf Benchmark Guide
==========================

The call to ``neuronperf[torch/tensorflow/mxnet/cpu].benchmark`` is used to measure your model performance. It will choose reasonable defaults if none are provided, and will return back reports that summarize the benchmarking results.

What is the default behavior of ``benchmark``?
----------------------------------------------

That will depend how you provided your model and how your model was compiled.

The two most common ways to provide your model are:

#. Provide the path to your compiled model
#. Provide the path to a model index from ``neuronperf.compile`` (a JSON file)


Data Parallel
~~~~~~~~~~~~~

Your model is benchmarked on provided ``inputs`` in 4 different configurations:
   #. A single model on 1 NeuronCore with one worker (min. latency)
   #. A single model on 1 NeuronCore with two workers (max. throughput / NC)
   #. ``MAX`` models on ``MAX`` NeuronCores with one worker (min. latency + max. instance usage)
   #. ``MAX`` models on ``MAX`` NeuronCores with two workers (max. throughput + max. instance usage)

The value ``MAX`` is automatically determined by your instance size. If it can't be identified, those configurations will be skipped.

The primary benefit of (3) and (4) is to verify that your model scales well at maximum instance usage.

.. note::

   If you provided the path to a model index from ``compile``:
      * Your input parameters to ``benchmark`` (``batch_sizes``, etc.) are treated as filters on the index
      * Each remaining model configuration is benchmarked as described in (1)


Pipeline
~~~~~~~~

Pipeline mode is active when using a Neuron device and ``pipeline_sizes > 1``. The same behavior as described in Data Parallel applies, except that only one worker configuration is executed: the optimal number of workers for your pipeline size, unless manually overridden.


Parameters
----------

Below are some useful and common parameters to tweak. Please see the :ref:`neuronperf_api` for full details.

* ``n_models`` controls how many models to load. The default behavior is ``n_models=[1, MAX]``.
* ``workers_per_model`` controls how many worker threads will be feeding inputs to each model. The default is automatically determined.
* ``pipeline_sizes`` tells the benchmarker how many cores are needed for your model so that each model instance can be loaded properly. Default is 1.
* ``duration`` controls how long to run each configuration.
* ``batch_sizes`` is used to inform the benchmarker of your input shape so that throughput can be computed correctly.

Almost all NeuronPerf behaviors are controllable via arguments found in the :ref:`neuronperf_api`. This guide attempts to provide some context and examples for those arguments.

Inputs
------

Models accept one or more inputs to operate on. Since NeuronPerf needs to support multiple inputs for multiple models, as well as multi-input models, there are some details that may need your attention. See the :ref:`neuronperf_framework_notes` for details.

Multi-input Models
~~~~~~~~~~~~~~~~~~

If your model accepts multiple inputs, you must provide them in a ``tuple``. For example, suppose you have a model like this:

.. code:: python


	class Model(torch.nn.Module):
		def forward(self, x, y, z):
			...
			return output


In order for NeuronPerf to pass along your multiple inputs correctly, you should provide them as a ``tuple``:

.. code:: python

	inputs = (x, y, z)
	npf.torch.benchmark(model_filename, inputs, ...)

If you are compiling and/or benchmarking multiple models, you can pass different sized inputs as a list of tuples:

.. code:: python

	inputs = [(x1, y1, z1), (x2, y2, z2), ...]
	npf.torch.benchmark(model_filename, inputs, ...)


Preprocessing and Postprocessing
--------------------------------

Many models have additional preprocessing and postprocessing steps involved that may add non-negligible overhead to inference time. NeuronPerf supports these use cases through the use of custom functions.

Preprocessing
~~~~~~~~~~~~~

Recall that NeuronPerf expects (or wraps) each model input into a ``tuple``. These tuples will be unpacked before calling your model.

Here is an example for a model with one input. The example multiples the input by 5 before inference.

.. code:: python

    def preprocess_fn(x):
        return x * 5

    ...

    # Benchmark with custom preprocessing function
    reports = npf.torch.benchmark(
            filename,
            inputs,
            ...,
            preprocess_fn = preprocess_fn,
    )

Or if your model expects multiple inputs:

.. code:: python

    def preprocess_fn(x, y, z):
        return x / 255, y / 255, z / 255

    ...

    # Benchmark with custom preprocessing function
    reports = npf.torch.benchmark(
            filename,
            inputs,
            ...,
            preprocess_fn = preprocess_fn,
    )

Postprocessing
~~~~~~~~~~~~~~

Postprocessing is almost identical to preprocessing, except that your function will receive whatever the output of your model is, exactly as returned without modification. There are no type guarantees.

.. code:: python

   def postprocess_fn(x):
      return x.argmax()

   ...

   # Benchmark with custom preprocessing function
   reports = npf.torch.benchmark(
         filename,
         inputs,
         ...,
         postprocess_fn = postprocess_fn,
   )

Minimal Latency
---------------

Suppose you are interested in the minimal latency achievable with your model. In this case, there is no need for more than one worker to execute at a time. We can manually specify the number of workers to use. See below :ref:`neuronperf_worker_threads`.


.. _neuronperf_worker_threads:

Worker Threads
--------------

The argument ``workers_per_model`` controls the number of worker threads that are trying to prepare and load examples onto a single NeuronCore at a time. Therefore, a value of 1 corresponds to 1 thread / model. If ``n_models=16``, then there would be 16 worker threads, one per model. This number is selected based upon whether you are using DataParallel (i.e. ``pipeline_sizes == 1``), or Pipeline Mode (``pipeline_sizes != 1``).

By default, NeuronPerf will try to pick try multiple combinations of model copies and workers. You may be interested in controlling this manually.

.. code:: python

   reports = npf.torch.benchmark('model_neuron_b1.pt', ..., workers_per_model=1)


You may also pass a list, as with other parameters:

.. code:: python

   workers_per_model = [1, 2] # Same as the default for data parallel
   reports = npf.torch.benchmark('model_neuron_b1.pt', ..., workers_per_model=workers_per_model)

With the default number of :ref:`neuronperf_model_copies`, a call to ``print_results`` might look like this:

.. code:: bash

   throughput_avg latency_ms_p50 latency_ms_p99 n_models       pipeline_size  workers_per_model batch_size     model_filename
   307.25         3.251          3.277          1              1              1                 1              models/a5cff386-89ca-4bbf-9087-d0e624c3c604.pt
   2746.0         5.641          6.82           16             1              1                 1              models/a5cff386-89ca-4bbf-9087-d0e624c3c604.pt
   329.5          6.053          6.108          1              1              2                 1              models/a5cff386-89ca-4bbf-9087-d0e624c3c604.pt
   2809.0         10.246         12.52          16             1              2                 1              models/a5cff386-89ca-4bbf-9087-d0e624c3c604.pt


.. _neuronperf_model_copies:

Model Copies
------------

By default, NeuronPerf will benchmark two settings for ``n_models``:
   1. A single copy
   2. The maximum number number of copies for your instance size

You can override this behavior by passing ``n_models`` to ``benchmark``, as shown below:

.. code:: python

   reports = npf.torch.benchmark('model_neuron_b1.pt', ..., n_models=6)

or

.. code:: python

   n_models = list(range(1, 10))
   reports = npf.torch.benchmark('model_neuron_b1.pt', ..., n_models=n_models)

.. _neuronperf_pipeline_mode:

Pipeline Mode
-------------

By default, NeuronPerf will assume you intend to use DataParallel, with two exceptions:

* You compiled your model using NeuronPerf for pipeline mode
* You constructed a model index that uses pipeline mode

You can also manually tell NeuronPerf that your model was compiled for pipeline mode. It is similar to how other arguments are passed.

.. code:: python

   reports = npf.torch.benchmark('model_neuron_b1.pt', ..., pipeline_sizes=2)

If you are passing multiple models in an index, then you should pass a list for ``pipeline_sizes``.

.. code:: python

   reports = npf.torch.benchmark('model_index.json', ..., pipeline_sizes=[1, 2, 3])


Duration
--------

NeuronPerf will benchmark each configuration specified for 60 seconds by default. You can control the duration by passing ``duration`` (in seconds).

.. code:: python

   reports = npf.torch.benchmark('model_index.json', ..., duration=10)

.. warning::

   If you make the duration too short, it may expire before all models are loaded and have had time to execute.


Custom Datasets (Beta)
----------------------

Currently, only PyTorch supports custom datasets, and the interface is subject to change. If you provide a custom dataset, it will be fully executed on each loaded model copy. So if you provide ``n_models=2``, your dataset will be run through twice in parallel.

To use this API, call ``benchmark`` passing a ``torch.utils.data.Dataset`` to ``inputs``. You can easily create your own ``Dataset`` by implementing the interface, or use one of the available datasets. For example:

.. code:: python

   import torchvision

   dataset = torchvision.datasets.FashionMNIST(
      root="data",
      train=False,
      download=True,
      transform=ToTensor()
   )

   reports = npf.torch.benchmark('model_index.json', inputs=dataset, batch_sizes=[8], preprocess_fn=lambda x: x[0], loop_dataset=False)

.. note::

   The ``preprocess_fn`` is required here to extract image input from the ``(image, label)`` tuple generated by dataloader. If the length of dataset is not sufficient to get the runtime performance, one can set ``loop_dataset=True`` to rerun dataset until certain duration. 

Results
-------

Viewing and Saving
~~~~~~~~~~~~~~~~~~

There are currently three ways to view results.

- ``neuronperf.print_reports(...)``
   - Dump abbrieviated results in your terminal
- ``neuronperf.write_csv(...)``
   - Store metrics of interest as CSV
- ``neuronperf.write_json(...)``
   - Store everything as JSON

See the :ref:`neuronperf_api` for full details.

Full Timing Results
~~~~~~~~~~~~~~~~~~~

NeuronPerf automatically combines and summarizes the detailed timing information collecting during benchmarking. If you wish to receive everything back yourself, you can use:

.. code:: python

   results = npf.torch.benchmark('model_index.json', ..., return_timers=True)

If you later wish to produce reports the same way that NeuronPerf does internally, you can call:

.. code:: python

   reports = npf.get_reports(results)

Verbosity
---------

Verbosity is an integer, currently one of ``{0, 1, 2}``, where:

* 0 = SILENT
* 1 = INFO (default)
* 2 = VERBOSE / DEBUG

Example:

.. code:: python

   reports = npf.torch.benchmark(..., n_models=1, duration=5, verbosity=2)

.. code:: bash

   DEBUG:neuronperf.benchmarking - Cast mode was not specified, assuming default.
   INFO:neuronperf.benchmarking - Benchmarking 'resnet50.json', ~5 seconds remaining.
   DEBUG:neuronperf.benchmarking - Running model config: {'model_filename': 'models/model_b1_p1_83bh3hhs.pt', 'device_type': 'neuron', 'input_idx': 0, 'batch_size': 1, 'n_models': 1, 'workers_per_model': 2, 'pipeline_size': 1, 'cast_mode': None, 'multiprocess': True, 'multiinterpreter': False, 'start_dts': '20211111-062818', 'duration': '5'}
   DEBUG:neuronperf.benchmarking - Benchmarker 0 started.
   DEBUG:neuronperf.benchmarking - Benchmarker 0, Worker 0 started.
   DEBUG:neuronperf.benchmarking - Benchmarker 0, Worker 1 started.
   DEBUG:neuronperf.benchmarking - Benchmarker 0, Worker 0 finished after 738 inferences.
   DEBUG:neuronperf.benchmarking - Benchmarker 0, Worker 1 finished after 738 inferences.
   DEBUG:neuronperf.benchmarking - Benchmarker 0 finished.
   throughput_avg latency_ms_p50 latency_ms_p99 n_models       pipeline_size  workers_per_model batch_size     model_filename
   329.667        6.073          6.109          1              1              2                 1              models/model_b1_p1_83bh3hhs.pt


Internal Process Model
----------------------

For each model loaded (see :ref:`neuronperf_model_copies`), a process is spawned. Each process may use multiple threads (see :ref:`neuronperf_worker_threads`). The threads will continue to load examples and keep the hardware busy.

NeuronPerf spawns processes slightly differently between frameworks. For PyTorch and Apache MXNet, processes are forked. For Tensorflow/Keras, a fresh interpreter is launched, and benchmarkers are serialized and run as a script.

If you suspect you are having trouble due to the way processes are managed, you have two mechanisms of control:

.. code:: python

   reports = npf.torch.benchmark(..., multiprocess=False)

Default is ``True``, and ``False`` will disable multiprocessing and run everything inside a single parent process. This may not work for all frameworks beyond the first model configuration, because process teardown is used to safely deallocate models from the hardware. It is not recommeneded to benchmark this way.


.. code:: python

   reports = npf.torch.benchmark(..., multiinterpreter=True)

This flag controls whether a fresh interpreter is used instead of forking. Defaults to ``False`` except with Tensorflow/Keras.


.. _npf-cpu-gpu:

Benchmark on CPU or GPU
-----------------------

When benchmarking on CPU or GPU, the API is slightly different. With CPU or GPU, there is no compiled model to benchmark, so instead we need to directly pass a reference to the model class that will be instantiated.

.. note::

   GPU benchmarking is currently only available for PyTorch.

CPU:

.. code:: python

   cpu_reports = npf.cpu.benchmark(YourModelClass, ...)

GPU:

.. code:: python

   gpu_reports = npf.torch.benchmark(YourModelClass, ..., device_type="gpu")


Your model class will be instantiated in a subprocess, so there are some things to keep in mind.

* Your model class must be defined at the top level inside a Python module
   * i.e. don't place your model class definition inside a function or other nested scope
* If your model class has special Python module dependencies, consider importing them inside your class ``__init__``
* If your model class expects constructor arguments, wrap your class so that it has no constructor arguments


Example of a wrapped model class for CPU/GPU benchmarking:

.. code:: python

   class ModelWrapper(torch.nn.Module):
      def __init__(self):
         super().__init__()
         from transformers import AutoModelForSequenceClassification
         model_name = "bert-base-cased"
         self.bert = AutoModelForSequenceClassification.from_pretrained(model_name, return_dict=False)
         self.add_module(model_name, self.bert)

      def forward(self, *inputs):
         return self.bert(*inputs)


   reports = npf.torch.benchmark(ModelWrapper, inputs, device_type="gpu")


================================================
FILE: archive/neuronperf/neuronperf_compile_guide.rst
================================================
.. _neuronperf_compile_guide:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

========================
NeuronPerf Compile Guide
========================

If you wish to compile multiple configurations at once, NeuronPerf provides a simplified and uniform API across frameworks. The output is a model index that tracks the artifacts produces, and can be passed directly to the :ref:`benchmark <neuronperf_api_benchmark>` routine for a streamlined end-to-end process. This may be useful if you wish to test multiple configurations of your model on Neuron hardware.

You can manually specify the model index filename by passing ``filename``, or let NeuronPerf generate one and return it for you. Compiled artifacts will be placed in a local ``models`` directory.

How does ``compile`` know which instance type to compile for?
-------------------------------------------------------------

NeuronPerf will assume that the instance type your are currently on is also the compile target. However, you may compile on a non-Neuron instance or choose to target a different instance type. In the case, you can pass ``compiler_target`` to the ``compile`` call.

For example:

.. code:: python

   import neuronperf as npf
   import neuronperf.torch

   npf.torch.compile(model, inputs)  # compile for current instance type
   npf.torch.compile(model, inputs, compiler_target="inf2")  # compile for inf2


Compiling multiple variants
---------------------------

If you provide multiple pipeline sizes, batch sizes, and/or cast modes, NeuronPerf will compile all of them.

.. code:: python

   # Select a few batch sizes and pipeline configurations to test
   batch_sizes = [1, 5, 10]
   pipeline_sizes = [1, 2, 4]

   # Construct example inputs
   example_inputs = [torch.zeros([batch_size, 3, 224, 224], dtype=torch.float16) for batch_size in batch_sizes]

   # Compile all configurations
   index = npf.torch.compile(
      model,
      example_inputs,
      batch_sizes=batch_sizes,
      pipeline_sizes=pipeline_sizes,
   )


If you wished to benchmark specific subsets of configurations, you could compile the specific configurations independently and later combine the results into a single index, as shown below.

.. code:: python

   # Compile with pipeline size 1 and vary batch dimension
   batch_index = npf.torch.compile(
      model,
      example_inputs,
      batch_sizes=batch_sizes,
      pipeline_sizes=1,
   )

   # Compile with batch size 1 and vary pipeline dimension
   pipeline_index = npf.torch.compile(
      model,
      example_inputs[0],
      batch_sizes=1,
      pipeline_sizes=pipeline_sizes,
   )

   index = npf.model_index.append(batch_index, pipeline_index)
   npf.model_index.save(index, 'model_index.json')

The ``compile`` function supports ``batch_sizes``, ``pipeline_sizes``, ``cast_modes``, and custom ``compiler_args``. If there is an error during compilation for a requested configuration, it will be logged and compilation will continue onward without terminating. (This is to support long-running compile jobs with many configurations.)


================================================
FILE: archive/neuronperf/neuronperf_evaluate_guide.rst
================================================
.. _neuronperf_evaluate_guide:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

==========================
NeuronPerf Evaluate Guide
==========================

NeuronPerf has a new API for evaluating model accuracy on Neuron hardware. This API is currently only available for PyTorch.

You can access the API through standard ``benchmark()`` by passing an additional kwarg, ``eval_metrics``.

For example:

.. code:: python

    reports = npf.torch.benchmark(
        model_index_or_path,
        dataset,
        n_models=1,
        workers_per_model=2,
        duration=0,
        eval_metrics=['accuracy', 'precision']
    )


In this example, we fix ``n_models`` and ``n_workers`` because replicating the same model will not impact accuracy. We also set ``duration=0`` to allow benchmarking to run untimed through all dataset examples.

Because this call can be tedious to type, a convenience function is provided:

.. code:: python

    reports = npf.torch.evaluate(model_index_or_path, dataset, metrics=['accuracy', 'precision'])


.. note:

    Please note that ``eval_metrics`` becomes ``metrics`` when using ``evaluate``.

The ``dataset`` can be any iterable object that produces ``tuple(*INPUTS, TARGET)``.

If ``TARGET`` does not appear in the last column for your dataset, you can customize this by passing ``eval_target_col``.

For example:

.. code:: python

    reports = npf.torch.evaluate(model_index_or_path, dataset, metrics='accuracy', eval_target_col=1)


You can list the currently available metrics.

.. code:: python

    >>> npf.list_metrics()                                                                                 │·····
    Name                     Description                                                                   │·····
    Accuracy                 (TP + TN) / (TP + TN + FP + FN)                                               │·····
    TruePositiveRate         TP / (TP + FN)                                                                │·····
    Sensitivity              Alias for TruePositiveRate                                                    │·····
    Recall                   Alias for TruePositiveRate                                                    │·····
    Hit Rate                 Alias for TruePositiveRate                                                    │·····
    TrueNegativeRate         TN / (TN + FP)                                                                │·····
    Specificity              Alias for TrueNegativeRate                                                    │·····
    Selectivity              Alias for TrueNegativeRate                                                    │·····
    PositivePredictiveValue  TP / (TP + FP)                                                                │·····
    Precision                Alias for PositivePredictiveValue                                             │·····
    NegativePredictiveValue  TN / (TN + FN)                                                                │·····
    FalseNegativeRate        FN / (FN + TP)                                                                │·····
    FalsePositiveRate        FP / (FP + TN)                                                                │·····
    FalseDiscoveryRate       FP / (FP + TN)                                                                │·····
    FalseOmissionRate        FP / (FP + TP)                                                                │·····
    PositiveLikelihoodRatio  TPR / FPR                                                                     │·····
    NegativeLikelihoodRatio  FNR / TNR                                                                     │·····
    PrevalenceThreshold      sqrt(FPR) / (sqrt(FPR) + sqrt(TPR))                                           │·····
    ThreatScore              TP / (TP + FN + FP)                                                           │·····
    F1Score                  2TP / (2TP + FN + FP)                                                         │·····
    MeanAbsoluteError        sum(|y - x|) / n                                                              │·····
    MeanSquaredError         sum((y - x)^2) / n


New metrics may appear in the list after importing a submodule. For example, ``import neuronperf.torch`` will register a new ``topk`` metric.

Custom Metrics
--------------

Simple Variants
===============

If you wish to register a metric that is a slight tweak of an existing metric with different ``init`` args, you can use ``register_metric_from_existing()``:

.. code:: python

    npf.register_metric_from_existing("topk", "topk_3", k=3)

This example registers a new metric ``topk_3`` from existing metric ``topk``, passing ``k=3`` as at ``init`` time.


New Metrics
===========

You can register your own metrics using ``register_metric()``.

You metrics must extend ``BaseEvalMetric``:

.. code:: python

    class BaseEvalMetric(ABC):
        """
        Abstract base class BaseEvalMetric from which other metrics inherit.
        """

        @abstractmethod
        def process_record(self, output: Any = None, target: Any = None) -> None:
            """Process an individual record and return the result."""
            pass

        @staticmethod
        def aggregate(metrics: Iterable["BaseEvalMetric"]) -> Any:
            """Combine a sequence of metrics into a single result."""
            raise NotImplementedError

For example:

.. code:: python

    import neuronperf as npf

    class MyCustomMetric(npf.BaseEvalMetric):
        def __init__(self):
            super().__init__()
            self.passing = 0
            self.processed = 0

        def process_record(self, outputs, target):
            self.processed += 1
            if outputs == target:
                self.passing += 1
        
        @staticmethod
        def aggregate(metrics):
            passing = 0
            processed = 0
            for metric in metrics:
                passing += metric.passing
                processed += metric.processed
            return passing / processed if processed else 0


    npf.register_metric("MyCustomMetric", MyCustomMetric)


================================================
FILE: archive/neuronperf/neuronperf_examples.rst
================================================
.. _neuronperf_examples:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

NeuronPerf Examples
===================

This page walks through several examples of using NeuronPerf, starting with the simplest way---using a compiled model. We will also see how we can use NeuronPerf to perform a hyperparameter search, and manage the artifacts produced, as well as our results.

Benchmark a Compiled Model
--------------------------

This example assumes you have already compiled your model for Neuron and saved it to disk.
You will need to adapt the batch size, input shape, and filename for your model.

.. code:: python

   import torch  # or tensorflow, mxnet

   import neuronperf as npf
   import neuronperf.torch  # or tensorflow, mxnet

   # Construct dummy inputs
   batch_sizes = 1
   input_shape = (batch_sizes, 3, 224, 224)
   inputs = torch.ones(input_shape)  # or numpy array for TF, MX

   # Benchmark and save results
   reports = npf.torch.benchmark("your_model_file.pt", inputs, batch_sizes)
   npf.print_reports(reports)
   npf.write_json(reports)


.. code:: bash

   INFO:neuronperf.benchmarking - Benchmarking 'your_model_file.pt', ~8.0 minutes remaining.
   throughput_avg    latency_ms_p50    latency_ms_p99    n_models          pipeline_size     workers_per_model batch_size        model_filename
   296766.5          0.003             0.003             1                 1                 1                 1                 your_model_file.pt
   3616109.75        0.005             0.008             24                1                 1                 1                 your_model_file.pt
   56801.0           0.035             0.04              1                 1                 2                 1                 your_model_file.pt
   3094419.4         0.005             0.051             24                1                 2                 1                 your_model_file.pt


Let's suppose you only wish to test two specific configurations. You wish to benchmark  1 model and 1 worker thread, and also with 2 worker threads for 15 seconds each. The call to ``benchmark`` becomes:

.. code:: python

   reports = npf.torch.benchmark(filename, inputs, batch_sizes, n_models=1, workers_per_model=[1, 2], duration=15)

You can also add a custom model name to reports.

.. code:: python

   reports = npf.torch.benchmark(..., model_name="MyFancyModel")

See the :ref:`neuronperf_benchmark_guide` for further details.


Benchmark a Model from Source
-----------------------------

In this example, we define, compile, and benchmark a simple (dummy) model using PyTorch.

We'll assume you already have a PyTorch model compiled for Neuron with the filename ``model_neuron_b1.pt``. Furthermore, let's assume the model was traced with a batch size of 1, and has an input shape of (3, 224, 224).

.. literalinclude:: test_simple_pt.py
    :language: python
    :caption: :download:`test_simple_pt.py <test_simple_pt.py>`
    :linenos:


.. code:: bash

   (aws_neuron_pytorch_p36) ubuntu@ip-172-31-11-122:~/tmp$ python test_simple_pt.py
   INFO:neuronperf.benchmarking - Benchmarking 'model_neuron_b1.pt', ~8.0 minutes remaining.
   throughput_avg    latency_ms_p50    latency_ms_p99    n_models          pipeline_size     workers_per_model batch_size        model_filename
   296766.5          0.003             0.003             1                 1                 1                 1                 model_neuron_b1.pt
   3616109.75        0.005             0.008             24                1                 1                 1                 model_neuron_b1.pt
   56801.0           0.035             0.04              1                 1                 2                 1                 model_neuron_b1.pt
   3094419.4         0.005             0.051             24                1                 2                 1                 model_neuron_b1.pt

Compile and Benchmark a Model
-----------------------------

Here is an end-to-end example of compiling and benchmarking a ResNet-50 model from ``torchvision``.

.. literalinclude:: test_resnet50_pt.py
    :language: python
    :caption: :download:`test_resnet50_pt.py <test_resnet50_pt.py>`
    :linenos:


Benchmark on CPU or GPU
-----------------------

When benchmarking on CPU or GPU, the API is slightly different. With CPU or GPU, there is no compiled model to benchmark, so instead we need to directly pass a reference to the model class that will be instantiated.

.. note::

   GPU benchmarking is currently only available for PyTorch.

CPU:

.. code:: python

   cpu_reports = npf.cpu.benchmark(YourModelClass, ...)

GPU:

.. code:: python

   gpu_reports = npf.torch.benchmark(YourModelClass, ..., device_type="gpu")


Please refer to :ref:`npf-cpu-gpu` for details and an example of providing your model class.


================================================
FILE: archive/neuronperf/neuronperf_faq.rst
================================================
.. _neuronperf_faq:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

NeuronPerf FAQ
==============

.. contents:: Table of contents
   :local:
   :depth: 1

When should I use NeuronPerf?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When you want to measure the highest achievable performance for your model with Neuron.

When should I **not** use NeuronPerf?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When measuring end-to-end performance that includes your network serving stack. Instead, your should compare your e2e numbers to those obtained by NeuronPerf to optimize your serving overhead.


Which frameworks does NeuronPerf support?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

See :ref:`neuronperf_framework_notes`.

Which Neuron instance types does NeuronPerf support?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

PyTorch and TensorFlow support all instance types.
MXNet support is limited to inf1.


What is the secret to obtaining the best numbers?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There is no secret sauce. NeuronPerf follows best practices.

What are the "best practices" that NeuronPerf uses?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- These vary slightly by framework and how your model was compiled
- For a model compiled for a single NeuronCore (DataParallel):

	- To maximize throughput, for ``N`` models, use ``2 * N`` worker threads
	- To minimize latency, use 1 worker thread per model
- Use a new Python process for each model to avoid GIL contention
- Ensure you benchmark long enough for your numbers to stabilize
- Ignore outliers at the start and end of inference benchmarking


================================================
FILE: archive/neuronperf/neuronperf_framework_notes.rst
================================================
.. _neuronperf_framework_notes:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

==========================
NeuronPerf Framework Notes
==========================

PyTorch
=======

  * Requires: ``torch-neuron`` or ``torch-neuronx``
	- Versions: 1.7.x, 1.8.x, 1.9.x, 1.10.x, 1.11.x, 1.12.x, 1.13.x
  * Input to ``compile``: ``torch.nn.Module``
  * Model inputs: ``Any``.


TensorFlow 1.x
==============

  * Requires: ``tensorflow-neuron``
  	- Versions: All
  * Input to ``compile``: Path to uncompiled model dir from ``saved_model.simple_save``
  * Model inputs: Tensors must be provided as ``numpy.ndarray``

.. note::

	Although TensorFlow *tensors* must be ``ndarray``, this doesn't stop you from wrapping them inside of data structures that traverse process boundaries safely. For example, you can still pass an input ``dict`` like ``{'input_0': np.zeros((2, 1))}``.

TensorFlow 2.x
==============

  * Requires: ``tensorflow-neuron`` or ``tensorflow-neuronx``
  	- Versions: All
  * Input to ``compile``: ``tf.keras.Model``
  * Model inputs: Tensors must be provided as ``numpy.ndarray``

.. note::

	Although TensorFlow *tensors* must be ``ndarray``, this doesn't stop you from wrapping them inside of data structures that traverse process boundaries safely. For example, you can still pass an input ``dict`` like ``{'input_0': np.zeros((2, 1))}``.

Apache MXNet
=============

  * Requires: ``mxnet-neuron``
  	- Versions 1.5, 1.8
  * Input to ``compile``: ``tuple(sym, args, aux)``
  * Inputs: Tensors must be provided as ``mxnet.ndarray`` or ``numpy.ndarray``


================================================
FILE: archive/neuronperf/neuronperf_install.rst
================================================
.. _neuronperf_install:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

NeuronPerf Install
==================

Activate your Neuron environment, and execute:

.. code:: bash

  $ pip install neuronperf --extra-index-url=https://pip.repos.neuron.amazonaws.com


================================================
FILE: archive/neuronperf/neuronperf_model_index_guide.rst
================================================
.. _neuronperf_model_index_guide:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

============================
NeuronPerf Model Index Guide
============================

A **model index** is a JSON file that tracks information about one or more compiled models. You can generate them using ``compile``, by using the API described here, or you may create them manually in a text editor.

After a call to ``compile`` you may notice that you now have a ``models`` directory. You will also spot a new file named something like ``model_83b3raj2.json`` in your local directory, if you didn't provide a ``filename`` yourself.

A model index is not intended to be opaque; you should feel free to open, inspect, and modify it yourself. It contains some information about the artifacts that were compiled. Individual models referenced by the index can be handed to ``benchmark`` directly along with an example input, or you may pass the entire index as in the basic example above. Here is an example index:

.. code:: bash

   python3 -m json.tool model_index.json

.. code:: json

   {
       "version": "0.0.0.0+0bc220a",
       "model_configs": [
           {
               "filename": "models/model_b1_p1_38793jda.pt",
               "input_idx": 0,
               "batch_size": 1,
               "pipeline_size": 1,
               "compile_s": 5.32
           }
       ]
   }

An index is useful for keeping track of your compiled artifacts and their parameters. The advantages of using ``neuronperf.[torch/tensorflow/mxnet].compile`` are clearer when we wish to compile multiple variants of our model and benchmark all of them at the same time. All of the model artifacts and the index can be destroyed using ``model_index.delete('model_index.json')``.

Benchmarking
============

When benchmarking with an index, there are some important details to keep in mind. If you originally built the index using a set of inputs, the model index has associated the ``inputs`` with the compiled models by their positional index.

For example:

.. code:: python

   batch_sizes = [1, 2]
   inputs = [torch.zeros((b, 100)) for b in batch_sizes]

Here, ``inputs[0]`` corresponds to batch size 1. Therefore, the model index will contain a reference to input 0 for that model. When you call ``benchmark``, you must pass inputs with the same shape in the same positions as at compile time.

.. note::

   It's only necessary that there is an input with the correct shape at``inputs[input_index]``. The example data itself is not important.


Working with Indexes
--------------------

The API detail below describes utilities for working with indexes. An ``index`` can be either a loaded index (JSON) or the path to an index (it will be loaded automatically).

Creating
========

.. code:: python

   index = neuronperf.model_index.create('/path/to/model', batch_size=1)
   filename = neuronperf.model_index.save(index)

Once you have an index, you can pass its path directly to ``benchmark``. You can also pass a custom filename instead:

.. code:: python

   index = neuronperf.model_index.create('/path/to/model', batch_size=1)
   neuronperf.model_index.save(index, 'my_index.json')

Appending
=========

If **multiple models use the same inputs**, you can append them together. For example, if you have the same batch size with multiple pipeline sizes, the inputs are the same, but the model changes.

.. code:: python

   pipeline_sizes = [1, 2, 3, 4]
   indexes = [neuronperf.model_index.create(f'/path/to/model_p{p}', pipeline_size=p, batch_size=5) for p in pipeline_sizes]
   index = neuronperf.model_index.append(*indexes)
   neuronperf.model_index.save(index, 'my_index.json')

Filtering
=========

You can construct a new model index that is filtered by some parameter. For example, to get a new index with only batch sizes [1, 2], you could do:

.. code:: python

   new_index = neuronperf.model_index.filter(index, batch_sizes=[1, 2])

You can also benchmark subset of a model index by passing only the subset parameters of interest, but remember to ensure you provide the correct number of inputs for the index (even if some are not used).

For example, if you an index with models at ``batch_sizes = [1, 2, 3]``, but only wish to benchmark batch size 2:

.. code:: python

   batch_sizes = [1, 2, 3]
   inputs = [torch.zeros((b, 100)) for b in batch_sizes]
   reports = neuronperf.torch.benchmark('model_index.json', inputs, batch_sizes=2)

Copying
=======

You can copy an index to a new location with ``neuronperf.model_index.copy(index, new_index_name, new_index_dir)``. This is mostly useful in combination with ``filter``/``append``.

Deleting
========

If you wish to keep your compiled models, just delete the model index file yourself. If you want to delete your model index and all associated artifacts, use:

.. code:: python

   neuronperf.model_index.delete('my_index.json')

================================================
FILE: archive/neuronperf/neuronperf_overview.rst
================================================
.. _neuronperf_overview:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

===================
NeuronPerf Overview
===================

NeuronPerf is a lightweight Python library that can help you easily benchmark your models with Neuron hardware.

NeuronPerf supports Neuron releases for PyTorch, Tensorflow, and MXNet. It is used internally by the Neuron team to generate performance benchmarking numbers.

When interacting with NeuronPerf, you will typically import the base package along with one of the submodule wrappers, for example:

.. code:: python

	import neuronperf
	import neuronperf.torch

You may then benchmark and/or compile one or more models with NeuronPerf. For example,

.. code:: python

	reports = neuronperf.torch.benchmark(model, inputs, ...)

The ``compile`` and ``benchmark`` methods must be accessed through one of the supported framework submodules.

Benchmarking
============

All NeuronPerf ``benchmark`` calls require a minimum of two arguments:

	1. A filename
	2. Inputs

The filename may refer to:

	1. A Neuron-compiled model (e.g. ``my_model.pt``)
	2. A :ref:`Model Index <neuronperf_model_index_guide>`.

A Model Index is useful for benchmarking more than one model in a single session.

Compiling
=========

NeuronPerf also provides a standard interface to all Neuron frameworks through the ``compile`` API.

.. code:: python

	model_index = neuronperf.torch.compile(model, inputs, ...)

This is completely optional. You may use the standard compilation guides for supported frameworks.

Next Steps
==========

Take a look at the simple :ref:`neuronperf_examples`, :ref:`neuronperf_benchmark_guide`, :ref:`neuronperf_compile_guide`, and :ref:`neuronperf_api`.

================================================
FILE: archive/neuronperf/neuronperf_terminology.rst
================================================
.. _neuronperf_terminology:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

NeuronPerf Terminology
======================

  * Model Inputs
    - An individual input or ``list`` of inputs
    - Example: ``inputs = [(torch.ones((batch_size, 5))) for batch_size in batch_sizes]``
    - Each input is associated with the ``batch_sizes`` specified, in the same order
    - Each input is fed individually to a corresponding model
    - If an input is provided as a ``tuple``, it will be destructured to ``model(*input)`` to support multiple args
    - See :ref:`neuronperf_framework_notes` for framework-specific requirements
  * Latency
  	- Time to execute a single ``model(input)``
  	- Typically measured in milliseconds
  * Model
   	- Your data model; varies by framework. See :ref:`neuronperf_framework_notes`
  	- Models may be wrapped by submodules (``torch``, ``tensorflow``, ``mxnet``) as callables
  * Model Index
  	- A JSON file that tracks compiled model artifacts
  * Model Inputs
  	- A ``tuple`` of inputs passed to a model, i.e. a single complete example
  	- Example: ``input = (torch.ones((5, 3, 224, 224)),)``
  * Throughput
  	- Inferences / second

================================================
FILE: archive/neuronperf/neuronperf_troubleshooting.rst
================================================
.. _neuronperf_troubleshooting:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

NeuronPerf Troubleshooting
==========================

.. contents:: Table of contents
   :local:
   :depth: 2

Compilation issues
^^^^^^^^^^^^^^^^^^

Model fails to compile
~~~~~~~~~~~~~~~~~~~~~~

Please `file a bug <https://github.com/aws/aws-neuron-sdk/issues>`_ with as much information as possible.

Benchmarking Issues
^^^^^^^^^^^^^^^^^^^

Benchmarking terminates early with errors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Scroll up and read the output. Most likely causes are:
   - invalid input shapes or
   - not enough memory to load the requested number of model copies on the device. Try passing ``n_models=1`` to ``benchmark`` again to test for memory issues.

Other Issues or Feature Requests
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Please file a bug on `Github <https://github.com/aws/aws-neuron-sdk/issues>`_.

================================================
FILE: archive/neuronperf/rn.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

What's New
==========

.. toctree::
   :maxdepth: 1

   /release-notes/components/dev-tools


================================================
FILE: archive/neuronperf/setup.cfg
================================================
[aliases]
# Define this so we don't resolve to the wrong setuptools 'test' entrypoint when
# invoking brazil-build test.
test = brazil_test


================================================
FILE: archive/neuronperf/setup.py
================================================
import collections
import os
import subprocess

from setuptools import find_packages, setup

# Read __version__.py
version_py = os.path.join("src", "neuronperf", "__version__.py")
with open(version_py, "rt") as fp:
    lines = fp.readlines()
meta = collections.OrderedDict()
for line in lines:
    key, value = line.split("=")
    meta[key.strip()] = value.strip()[1:-1]

# Extract fields for packaging
TITLE = meta["__title__"]
AUTHOR = meta["__author__"]
DESCRIPTION = meta["__description__"]
VERSION = os.getenv("BRAZIL_PACKAGE_VERSION", "0.0.0.0")
LICENSE = meta["__license__"]

# Compute release version and write back meta info for consistency.
GIT_SHA = os.environ.get("BRAZIL_PACKAGE_CHANGE_ID")
if GIT_SHA:
    GIT_SHA = GIT_SHA.strip()[:9]
else:
    # This is probably a local build. Try to attach something meaningful.
    try:
        GIT_SHA = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).decode().strip()
    except:
        GIT_SHA = "0" * 9
VERSION = "{}+{}".format(VERSION.strip(), GIT_SHA)
meta["__version__"] = VERSION
with open(version_py, "wt") as fp:
    for k, v in meta.items():
        fp.write('{} = "{}"\n'.format(k, v))


setup(
    name=TITLE,
    version=VERSION,
    description=DESCRIPTION,
    author=AUTHOR,
    license=LICENSE,
    classifiers=[
        "Development Status :: 4 - Beta",
        "Intended Audience :: Developers",
        "Topic :: Scientific/Engineering :: Artificial Intelligence",
        "License :: Other/Proprietary License",
        "Programming Language :: Python :: 3.6",
    ],
    keywords="aws neuron",
    packages=find_packages(where="src", exclude=("test",)),
    install_requires=["dill==0.3.4", "numpy", "psutil==5.9.0"],
    python_requires=">=3.6",
    package_dir={"": "src"},
    data_files=[],
    package_data={"": ["py.typed"]},
)


================================================
FILE: archive/neuronperf/test_resnet50_pt.py
================================================
import torch
import torch_neuron

import neuronperf as npf
import neuronperf.torch

from torchvision import models


# Load a pretrained ResNet50 model
model = models.resnet50(pretrained=True)

# Select a few batch sizes to test
filename = 'resnet50.json'
batch_sizes = [5, 6, 7]

# Construct example inputs
inputs = [torch.zeros([batch_size, 3, 224, 224], dtype=torch.float32) for batch_size in batch_sizes]

# Compile
npf.torch.compile(
	model, 
	inputs, 
	batch_sizes=batch_sizes, 
	filename=filename,
)

# Benchmark
reports = npf.torch.benchmark(filename, inputs)

# View and save results
npf.print_reports(reports)
npf.write_csv(reports, 'resnet50_results.csv')
npf.write_json(reports, 'resnet50_results.json')


================================================
FILE: archive/neuronperf/test_simple_pt.py
================================================
import torch
import torch.neuron

import neuronperf as npf
import neuronperf.torch


# Define a simple model
class Model(torch.nn.Module):
    def forward(self, x):
        x = x * 3
        return x + 1


# Instantiate
model = Model()
model.eval()

# Define some inputs
batch_sizes = [1]
inputs = [torch.ones((batch_size, 3, 224, 224)) for batch_size in batch_sizes]

# Compile for Neuron
model_neuron = torch.neuron.trace(model, inputs)
model_neuron.save("model_neuron_b1.pt")

# Benchmark
reports = npf.torch.benchmark("model_neuron_b1.pt", inputs, batch_sizes)

# View and save results
npf.print_reports(reports)
npf.write_csv(reports, "model_neuron_b1.csv")


================================================
FILE: archive/src/benchmark/pytorch/bert-base-cased_benchmark.py
================================================
import torch
import torch.neuron

import neuronperf
import neuronperf.torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification


# Add to these lists or change as needed
model_names = ["bert-base-cased"]
sequence_lengths = [128]
batch_sizes = [6]
pipeline_sizes = [1]


def get_batch(tokenizer, sequence_length, batch_size):
    sequence_0 = "The company HuggingFace is based in New York City"
    sequence_1 = "HuggingFace's headquarters are situated in Manhattan"
    paraphrase = tokenizer.encode_plus(
        sequence_0,
        sequence_1,
        max_length=sequence_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    inputs = (
        torch.cat([paraphrase["input_ids"]] * batch_size, 0),
        torch.cat([paraphrase["attention_mask"]] * batch_size, 0),
    )
    return inputs


if __name__ == "__main__":
    for model_name in model_names:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSequenceClassification.from_pretrained(model_name, return_dict=False)
        for sequence_length in sequence_lengths:
            inputs = [
                get_batch(tokenizer, sequence_length, batch_size) for batch_size in batch_sizes
            ]
            filename = f"{model_name}_sl{sequence_length}.json"

            # Benchmark
            print("Benchmarking {}".format(filename))
            reports = neuronperf.torch.benchmark(filename, inputs)

            # View and save results
            print("======== {} ========".format(filename))
            neuronperf.print_reports(reports)
            neuronperf.write_csv(reports)
            neuronperf.write_json(reports)

================================================
FILE: archive/src/benchmark/pytorch/bert-base-cased_compile.py
================================================
import torch
import torch.neuron

import neuronperf
import neuronperf.torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification


# Add to these lists or change as needed
model_names = ["bert-base-cased"]
sequence_lengths = [128]
batch_sizes = [6]
pipeline_sizes = [1]


def get_batch(tokenizer, sequence_length, batch_size):
    sequence_0 = "The company HuggingFace is based in New York City"
    sequence_1 = "HuggingFace's headquarters are situated in Manhattan"
    paraphrase = tokenizer.encode_plus(
        sequence_0,
        sequence_1,
        max_length=sequence_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    inputs = (
        torch.cat([paraphrase["input_ids"]] * batch_size, 0),
        torch.cat([paraphrase["attention_mask"]] * batch_size, 0),
    )
    return inputs


if __name__ == "__main__":
    for model_name in model_names:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSequenceClassification.from_pretrained(model_name, return_dict=False)
        for sequence_length in sequence_lengths:
            inputs = [
                get_batch(tokenizer, sequence_length, batch_size) for batch_size in batch_sizes
            ]
            filename = f"{model_name}_sl{sequence_length}.json"

            # Compile
            print("Compiling {}".format(filename))
            neuronperf.torch.compile(
                model,
                inputs,
                batch_sizes=batch_sizes,
                pipeline_sizes=pipeline_sizes,
                filename=filename,
                model_name=model_name,
            )


================================================
FILE: archive/src/benchmark/pytorch/bert-base-uncased_benchmark.py
================================================
import torch
import torch.neuron

import neuronperf
import neuronperf.torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification


# Add to these lists or change as needed
model_names = ["bert-base-uncased"]
sequence_lengths = [128]
batch_sizes = [6]
pipeline_sizes = [1]


def get_batch(tokenizer, sequence_length, batch_size):
    sequence_0 = "The company HuggingFace is based in New York City"
    sequence_1 = "HuggingFace's headquarters are situated in Manhattan"
    paraphrase = tokenizer.encode_plus(
        sequence_0,
        sequence_1,
        max_length=sequence_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    inputs = (
        torch.cat([paraphrase["input_ids"]] * batch_size, 0),
        torch.cat([paraphrase["attention_mask"]] * batch_size, 0),
    )
    return inputs


if __name__ == "__main__":
    for model_name in model_names:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSequenceClassification.from_pretrained(model_name, return_dict=False)
        for sequence_length in sequence_lengths:
            inputs = [
                get_batch(tokenizer, sequence_length, batch_size) for batch_size in batch_sizes
            ]
            filename = f"{model_name}_sl{sequence_length}.json"

            # Benchmark
            print("Benchmarking {}".format(filename))
            reports = neuronperf.torch.benchmark(filename, inputs)

            # View and save results
            print("======== {} ========".format(filename))
            neuronperf.print_reports(reports)
            neuronperf.write_csv(reports)
            neuronperf.write_json(reports)

================================================
FILE: archive/src/benchmark/pytorch/bert-base-uncased_compile.py
================================================
import torch
import torch.neuron

import neuronperf
import neuronperf.torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification


# Add to these lists or change as needed
model_names = ["bert-base-uncased"]
sequence_lengths = [128]
batch_sizes = [6]
pipeline_sizes = [1]


def get_batch(tokenizer, sequence_length, batch_size):
    sequence_0 = "The company HuggingFace is based in New York City"
    sequence_1 = "HuggingFace's headquarters are situated in Manhattan"
    paraphrase = tokenizer.encode_plus(
        sequence_0,
        sequence_1,
        max_length=sequence_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    inputs = (
        torch.cat([paraphrase["input_ids"]] * batch_size, 0),
        torch.cat([paraphrase["attention_mask"]] * batch_size, 0),
    )
    return inputs


if __name__ == "__main__":
    for model_name in model_names:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSequenceClassification.from_pretrained(model_name, return_dict=False)
        for sequence_length in sequence_lengths:
            inputs = [
                get_batch(tokenizer, sequence_length, batch_size) for batch_size in batch_sizes
            ]
            filename = f"{model_name}_sl{sequence_length}.json"

            # Compile
            print("Compiling {}".format(filename))
            neuronperf.torch.compile(
                model,
                inputs,
                batch_sizes=batch_sizes,
                pipeline_sizes=pipeline_sizes,
                filename=filename,
                model_name=model_name,
            )


================================================
FILE: archive/src/benchmark/pytorch/distilbert-base-uncased-finetuned-sst-2-english_benchmark.py
================================================
import torch
import torch.neuron

import neuronperf
import neuronperf.torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification


# Add to these lists or change as needed
model_names = ["distilbert-base-uncased-finetuned-sst-2-english"]
sequence_lengths = [128]
batch_sizes = [6]
pipeline_sizes = [1]


def get_batch(tokenizer, sequence_length, batch_size):
    sequence_0 = "The company HuggingFace is based in New York City"
    sequence_1 = "HuggingFace's headquarters are situated in Manhattan"
    paraphrase = tokenizer.encode_plus(
        sequence_0,
        sequence_1,
        max_length=sequence_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    inputs = (
        torch.cat([paraphrase["input_ids"]] * batch_size, 0),
        torch.cat([paraphrase["attention_mask"]] * batch_size, 0),
    )
    return inputs


if __name__ == "__main__":
    for model_name in model_names:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSequenceClassification.from_pretrained(model_name, return_dict=False)
        for sequence_length in sequence_lengths:
            inputs = [
                get_batch(tokenizer, sequence_length, batch_size) for batch_size in batch_sizes
            ]
            filename = f"{model_name}_sl{sequence_length}.json"

            # Benchmark
            print("Benchmarking {}".format(filename))
            reports = neuronperf.torch.benchmark(filename, inputs)

            # View and save results
            print("======== {} ========".format(filename))
            neuronperf.print_reports(reports)
            neuronperf.write_csv(reports)
            neuronperf.write_json(reports)

================================================
FILE: archive/src/benchmark/pytorch/distilbert-base-uncased-finetuned-sst-2-english_compile.py
================================================
import torch
import torch.neuron

import neuronperf
import neuronperf.torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification


# Add to these lists or change as needed
model_names = ["distilbert-base-uncased-finetuned-sst-2-english"]
sequence_lengths = [128]
batch_sizes = [6]
pipeline_sizes = [1]


def get_batch(tokenizer, sequence_length, batch_size):
    sequence_0 = "The company HuggingFace is based in New York City"
    sequence_1 = "HuggingFace's headquarters are situated in Manhattan"
    paraphrase = tokenizer.encode_plus(
        sequence_0,
        sequence_1,
        max_length=sequence_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    inputs = (
        torch.cat([paraphrase["input_ids"]] * batch_size, 0),
        torch.cat([paraphrase["attention_mask"]] * batch_size, 0),
    )
    return inputs


if __name__ == "__main__":
    for model_name in model_names:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSequenceClassification.from_pretrained(model_name, return_dict=False)
        for sequence_length in sequence_lengths:
            inputs = [
                get_batch(tokenizer, sequence_length, batch_size) for batch_size in batch_sizes
            ]
            filename = f"{model_name}_sl{sequence_length}.json"

            # Compile
            print("Compiling {}".format(filename))
            neuronperf.torch.compile(
                model,
                inputs,
                batch_sizes=batch_sizes,
                pipeline_sizes=pipeline_sizes,
                filename=filename,
                model_name=model_name,
            )


================================================
FILE: archive/src/benchmark/pytorch/distilbert-base-uncased_benchmark.py
================================================
import torch
import torch.neuron

import neuronperf
import neuronperf.torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification


# Add to these lists or change as needed
model_names = ["distilbert-base-uncased"]
sequence_lengths = [128]
batch_sizes = [9]
pipeline_sizes = [1]


def get_batch(tokenizer, sequence_length, batch_size):
    sequence_0 = "The company HuggingFace is based in New York City"
    sequence_1 = "HuggingFace's headquarters are situated in Manhattan"
    paraphrase = tokenizer.encode_plus(
        sequence_0,
        sequence_1,
        max_length=sequence_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    inputs = (
        torch.cat([paraphrase["input_ids"]] * batch_size, 0),
        torch.cat([paraphrase["attention_mask"]] * batch_size, 0),
    )
    return inputs


if __name__ == "__main__":
    for model_name in model_names:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSequenceClassification.from_pretrained(model_name, return_dict=False)
        for sequence_length in sequence_lengths:
            inputs = [
                get_batch(tokenizer, sequence_length, batch_size) for batch_size in batch_sizes
            ]
            filename = f"{model_name}_sl{sequence_length}.json"

            # Benchmark
            print("Benchmarking {}".format(filename))
            reports = neuronperf.torch.benchmark(filename, inputs)

            # View and save results
            print("======== {} ========".format(filename))
            neuronperf.print_reports(reports)
            neuronperf.write_csv(reports)
            neuronperf.write_json(reports)

================================================
FILE: archive/src/benchmark/pytorch/distilbert-base-uncased_compile.py
================================================
import torch
import torch.neuron

import neuronperf
import neuronperf.torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification


# Add to these lists or change as needed
model_names = ["distilbert-base-uncased"]
sequence_lengths = [128]
batch_sizes = [9]
pipeline_sizes = [1]


def get_batch(tokenizer, sequence_length, batch_size):
    sequence_0 = "The company HuggingFace is based in New York City"
    sequence_1 = "HuggingFace's headquarters are situated in Manhattan"
    paraphrase = tokenizer.encode_plus(
        sequence_0,
        sequence_1,
        max_length=sequence_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    inputs = (
        torch.cat([paraphrase["input_ids"]] * batch_size, 0),
        torch.cat([paraphrase["attention_mask"]] * batch_size, 0),
    )
    return inputs


if __name__ == "__main__":
    for model_name in model_names:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSequenceClassification.from_pretrained(model_name, return_dict=False)
        for sequence_length in sequence_lengths:
            inputs = [
                get_batch(tokenizer, sequence_length, batch_size) for batch_size in batch_sizes
            ]
            filename = f"{model_name}_sl{sequence_length}.json"

            # Compile
            print("Compiling {}".format(filename))
            neuronperf.torch.compile(
                model,
                inputs,
                batch_sizes=batch_sizes,
                pipeline_sizes=pipeline_sizes,
                filename=filename,
                model_name=model_name,
            )


================================================
FILE: archive/src/benchmark/pytorch/distilroberta-base_benchmark.py
================================================
import torch
import torch.neuron

import neuronperf
import neuronperf.torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification


# Add to these lists or change as needed
model_names = ["distilroberta-base"]
sequence_lengths = [128]
batch_sizes = [6]
pipeline_sizes = [1]


def get_batch(tokenizer, sequence_length, batch_size):
    sequence_0 = "The company HuggingFace is based in New York City"
    sequence_1 = "HuggingFace's headquarters are situated in Manhattan"
    paraphrase = tokenizer.encode_plus(
        sequence_0,
        sequence_1,
        max_length=sequence_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    inputs = (
        torch.cat([paraphrase["input_ids"]] * batch_size, 0),
        torch.cat([paraphrase["attention_mask"]] * batch_size, 0),
    )
    return inputs


if __name__ == "__main__":
    for model_name in model_names:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSequenceClassification.from_pretrained(model_name, return_dict=False)
        for sequence_length in sequence_lengths:
            inputs = [
                get_batch(tokenizer, sequence_length, batch_size) for batch_size in batch_sizes
            ]
            filename = f"{model_name}_sl{sequence_length}.json"

            # Benchmark
            print("Benchmarking {}".format(filename))
            reports = neuronperf.torch.benchmark(filename, inputs)

            # View and save results
            print("======== {} ========".format(filename))
            neuronperf.print_reports(reports)
            neuronperf.write_csv(reports)
            neuronperf.write_json(reports)

================================================
FILE: archive/src/benchmark/pytorch/distilroberta-base_compile.py
================================================
import torch
import torch.neuron

import neuronperf
import neuronperf.torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification


# Add to these lists or change as needed
model_names = ["distilroberta-base"]
sequence_lengths = [128]
batch_sizes = [6]
pipeline_sizes = [1]


def get_batch(tokenizer, sequence_length, batch_size):
    sequence_0 = "The company HuggingFace is based in New York City"
    sequence_1 = "HuggingFace's headquarters are situated in Manhattan"
    paraphrase = tokenizer.encode_plus(
        sequence_0,
        sequence_1,
        max_length=sequence_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    inputs = (
        torch.cat([paraphrase["input_ids"]] * batch_size, 0),
        torch.cat([paraphrase["attention_mask"]] * batch_size, 0),
    )
    return inputs


if __name__ == "__main__":
    for model_name in model_names:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSequenceClassification.from_pretrained(model_name, return_dict=False)
        for sequence_length in sequence_lengths:
            inputs = [
                get_batch(tokenizer, sequence_length, batch_size) for batch_size in batch_sizes
            ]
            filename = f"{model_name}_sl{sequence_length}.json"

            # Compile
            print("Compiling {}".format(filename))
            neuronperf.torch.compile(
                model,
                inputs,
                batch_sizes=batch_sizes,
                pipeline_sizes=pipeline_sizes,
                filename=filename,
                model_name=model_name,
            )


================================================
FILE: archive/src/benchmark/pytorch/hf-google-vit_benchmark.py
================================================
import torch
import neuronperf
import neuronperf.torch
import torch_neuronx

from PIL import Image
import requests
from transformers import ViTImageProcessor, ViTForImageClassification

def benchmark(batch_size):
    feature_extractor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
    model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224', torchscript=True)
    model.eval()

    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    inputs = feature_extractor(images=image, return_tensors="pt")
    inputs = inputs['pixel_values'].repeat([batch_size, 1, 1, 1])
    example = (inputs,)

    traced = torch_neuronx.trace(model, example, compiler_args="--model-type=transformer")
    filename = 'model.pt'
    torch.jit.save(traced, filename)
    reports = neuronperf.torch.benchmark(filename, [example], batch_sizes=[batch_size])
    # View and save results
    print("======== {} ========".format(filename))
    neuronperf.print_reports(reports)
    neuronperf.write_csv(reports)
    neuronperf.write_json(reports)

if __name__ == '__main__':
    # Use batch_size = 1 for best latency, batch_size = 2 for best throughput
    benchmark(batch_size=2)

================================================
FILE: archive/src/benchmark/pytorch/hf-openai-clip_benchmark.py
================================================
import torch
import neuronperf
import neuronperf.torch
import torch_neuronx
import os

from torchvision.datasets import CIFAR100
from transformers import CLIPProcessor, CLIPModel

def benchmark(model_name, batch_size):
    # Build the model, preprocessor, and dataset
    cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)
    processor = CLIPProcessor.from_pretrained(model_name)
    model = CLIPModel.from_pretrained(model_name, return_dict=False)

    # Prepare a sample input
    image = cifar100[0][0]
    text = []
    for c in cifar100.classes:
        text.append(f'a photo of a {c}')

    inputs = processor(text=text, images=image, return_tensors="pt", padding=True)
    image = inputs['pixel_values']
    # (b, c, h, w)
    image = image.repeat(batch_size, 1, 1, 1)
    inputs = (inputs['input_ids'], image)

    # Trace the model
    model.eval()
    traced = torch_neuronx.trace(model, inputs, compiler_args='--enable-saturate-infinity')
    filename = 'model.pt'
    torch.jit.save(traced, filename)
    reports = neuronperf.torch.benchmark(filename, [inputs], batch_sizes=[batch_size])
    # View and save results
    print("======== {} ========".format(filename))
    neuronperf.print_reports(reports)
    neuronperf.write_csv(reports)
    neuronperf.write_json(reports)

if __name__ == '__main__':
    # Recommended batch sizes for throughput
    # openai/clip-vit-base-patch32: 64
    # openai/clip-vit-large-patch14: 4
    model_name = 'openai/clip-vit-base-patch32'
    batch_size = 64
    benchmark(model_name, batch_size)

================================================
FILE: archive/src/benchmark/pytorch/hf_pretrained_wav2vec2_conformer_relpos_benchmark.py
================================================
import torch
import torch_neuronx
from datasets import load_dataset
from transformers import Wav2Vec2Processor, Wav2Vec2ConformerForCTC
import neuronperf as npf
import neuronperf.torch

BATCH_SIZE = 1
def benchmark():
    processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-conformer-rel-pos-large-960h-ft")
    model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-conformer-rel-pos-large-960h-ft")
    model.eval()

    # take the first entry in the dataset as our input
    ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation", trust_remote_code=True)
    inputs = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest", sampling_rate=16_000).input_values
    inputs = inputs.repeat([BATCH_SIZE, 1])
    example = (inputs,)

    traced = torch_neuronx.trace(model, example, compiler_args='--model-type=transformer')
    filename = 'model.pt'
    torch.jit.save(traced, filename)
    
    model_neuron = torch.jit.load(filename)
    output = model_neuron(inputs)
    print(f"output is {output}")

    reports = neuronperf.torch.benchmark(filename, [example], multiprocess=False, batch_sizes=[BATCH_SIZE])
    # View and save results
    print("======== {} ========".format(filename))
    neuronperf.print_reports(reports)
    neuronperf.write_csv(reports)
    neuronperf.write_json(reports)

if __name__ == '__main__':
    benchmark()


================================================
FILE: archive/src/benchmark/pytorch/hf_pretrained_wav2vec2_conformer_rope_benchmark.py
================================================
import torch
import torch_neuronx
from datasets import load_dataset
from transformers import Wav2Vec2Processor, Wav2Vec2ConformerForCTC
import neuronperf as npf
import neuronperf.torch

BATCH_SIZE = 1
def benchmark():
    processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-conformer-rope-large-960h-ft")
    model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-conformer-rope-large-960h-ft")
    model.eval()

    # take the first entry in the dataset as our input
    ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation", trust_remote_code=True)
    inputs = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest", sampling_rate=16_000).input_values
    inputs = inputs.repeat([BATCH_SIZE, 1])
    example = (inputs,)

    traced = torch_neuronx.trace(model, example, compiler_args='--model-type=transformer')
    filename = 'model.pt'
    torch.jit.save(traced, filename)
    
    model_neuron = torch.jit.load(filename)
    output = model_neuron(inputs)
    print(f"output is {output}")

    reports = neuronperf.torch.benchmark(filename, [example], multiprocess=False, batch_sizes=[BATCH_SIZE])
    # View and save results
    print("======== {} ========".format(filename))
    neuronperf.print_reports(reports)
    neuronperf.write_csv(reports)
    neuronperf.write_json(reports)

if __name__ == '__main__':
    benchmark()


================================================
FILE: archive/src/benchmark/pytorch/inf2_benchmark.py
================================================
# primary Script used for inf2 Benchmarking

import torch
import neuronperf
import neuronperf.torch
import torch_neuronx
from transformers import (
    AutoModel, AutoModelForSequenceClassification  # Any other model class respective to the model we want to infer on
)

class GPT2Neuron(torch.nn.Module):
    def __init__(self, model) -> None:
        super().__init__()
        self.model = model

    def forward(self, input_ids, attention_mask):
        return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=False)

def benchmark(model_name, batch_size, sequence_length):
    model = AutoModel.from_pretrained(model_name, torchscript=True)
    if 'gpt2' in model_name:
        model = GPT2Neuron(model)
    model.eval()

    example = (
        torch.zeros(batch_size, sequence_length, dtype=torch.int),  # input_ids
        torch.zeros(batch_size, sequence_length, dtype=torch.int),  # attention_mask
    )

    traced = torch_neuronx.trace(model, example)
    filename = 'model.pt'
    torch.jit.save(traced, filename)
    reports = neuronperf.torch.benchmark(filename, [example])
    # View and save results
    print("======== {} ========".format(filename))
    neuronperf.print_reports(reports)
    neuronperf.write_csv(reports)
    neuronperf.write_json(reports)

if __name__ == '__main__':
    # benchmark(model_name, batch_size, sequence_length)
    # Below are a few examples -
    # benchmark('bert-base-cased', 16, 128)
    # benchmark('bert-base-uncased', 4, 128)
    # benchmark('gpt2', 16, 256)


================================================
FILE: archive/src/benchmark/pytorch/opt_benchmark.py
================================================
import os
import neuronperf as npf
import torch
from transformers import AutoTokenizer

"""
Run the sample at this link to get the split model state_dict (opt-13b-split):
https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/facebook-opt-13b-sampling.ipynb

Make sure transformers is installed

Change the variables below for opt30b or opt66b models
"""


BATCH_SIZE = 2
TP_DEGREE = 2
SEQ_LEN = 2048
TOKENIZER = AutoTokenizer.from_pretrained("facebook/opt-13b")
MODEL_DIR = "./opt-13b-split"


class Wrapper(torch.nn.Module):
    def __init__(self, filename):
        super().__init__()
        from transformers_neuronx.opt.model import OPTForSampling
        self.neuron_model = OPTForSampling.from_pretrained(
            filename, batch_size=BATCH_SIZE, tp_degree=TP_DEGREE, amp="f16"
        )
        self.neuron_model.to_neuron()

    def forward(self, *inputs):
        return self.neuron_model.sample(torch.concat(inputs), sequence_length=SEQ_LEN)

# Custom load to let our Wrapper class handle things
def load_fn(filename, **kwargs):
    return Wrapper(filename)

# NeuronPerf can't see tp_degree at the moment, so just expose all cores
def env_setup_fn(*_):
    del os.environ["NEURON_RT_VISIBLE_CORES"]

def preprocess_fn(inputs):
    return [TOKENIZER.encode(text, return_tensors="pt") for text in inputs]

def postprocess_fn(outputs):
    return [TOKENIZER.decode(seq) for seq in outputs]

def benchmark():
    inputs = ["Hello, I'm a language model,"] * BATCH_SIZE
    reports = npf.benchmark(
        load_fn,
        MODEL_DIR,
        [inputs],  # treat batch as 1 input and let Wrapper handle batching
        batch_sizes=1,  # ^
        n_models=1,  # only load 1 copy of model
        max_infers=5,
        max_duration=0,  # sampling can take a while, so let's not timeout
        workers_per_model=1,  # no bottleneck on model inputs, so 1 is fine
        env_setup_fn=env_setup_fn,
        preprocess_fn=preprocess_fn,
        postprocess_fn=postprocess_fn,
    )
    
    # grab the only report (we only benchmarked 1 config)
    report = reports[0]
    
    # let's update throughput to be tokens / second and add a new record
    new_tokens = sum(SEQ_LEN - len(TOKENIZER.encode(i)) for i in inputs)
    tokens_per_s = round(new_tokens / (report["latency_ms_avg"] / 1000), 2)
    report["throughput_avg"] = report["tokens_per_s"] = tokens_per_s
    
    # display and save results
    npf.print_report(report)
    print(f"Results saved to: {npf.write_json(report)}")


if __name__ == "__main__":
    benchmark()


================================================
FILE: archive/src/benchmark/pytorch/perceiver-multimodal_benchmark.py
================================================
import base64
import os
import ssl
import re
from urllib import request
import time
import random
from tqdm import tqdm
import numpy as np
import math

from typing import Optional, Tuple, Union
from transformers import PerceiverForMultimodalAutoencoding
from transformers.modeling_outputs import BaseModelOutputWithCrossAttentions
from transformers.models.perceiver.modeling_perceiver import PerceiverBasicDecoder, PerceiverClassifierOutput
from transformers.models.perceiver.modeling_perceiver import restructure
import torch
import torch.nn as nn
import torch_neuronx

# We cannot use any of the pre-existing benchmarking utilities to benchmark E2E pipeline models.
# All of the pre-existing benchmarking utilities (in neuronperf or torch_neuronx) require the model to be a
# traced Torchscript.
def benchmark(n_runs, test_name, model, model_inputs):
    if not isinstance(model_inputs, tuple):
        model_inputs = (model_inputs,)

    warmup_run = model(*model_inputs)

    latency_collector = LatencyCollector()    
    for _ in range(n_runs):
        latency_collector.pre_hook()
        res = model(*model_inputs)
        latency_collector.hook()
    
    p0_latency_ms = latency_collector.percentile(0) * 1000
    p50_latency_ms = latency_collector.percentile(50) * 1000
    p90_latency_ms = latency_collector.percentile(90) * 1000
    p95_latency_ms = latency_collector.percentile(95) * 1000
    p99_latency_ms = latency_collector.percentile(99) * 1000
    p100_latency_ms = latency_collector.percentile(100) * 1000

    report_dict = dict()
    report_dict["Latency P0"] = f'{p0_latency_ms:.1f}'
    report_dict["Latency P50"]=f'{p50_latency_ms:.1f}'
    report_dict["Latency P90"]=f'{p90_latency_ms:.1f}'
    report_dict["Latency P95"]=f'{p95_latency_ms:.1f}'
    report_dict["Latency P99"]=f'{p99_latency_ms:.1f}'
    report_dict["Latency P100"]=f'{p100_latency_ms:.1f}'

    report = f'RESULT FOR {test_name}:'
    for key, value in report_dict.items():
        report += f' {key}={value}'
    print(report)

class LatencyCollector:
    def __init__(self):
        self.start = None
        self.latency_list = []

    def pre_hook(self, *args):
        self.start = time.time()

    def hook(self, *args):
        self.latency_list.append(time.time() - self.start)

    def percentile(self, percent):
        latency_list = self.latency_list
        pos_float = len(latency_list) * percent / 100
        max_pos = len(latency_list) - 1
        pos_floor = min(math.floor(pos_float), max_pos)
        pos_ceil = min(math.ceil(pos_float), max_pos)
        latency_list = sorted(latency_list)
        return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor]
    
class MultimodalPerceiverWrapper(nn.Module):
    def __init__(self, perceiver_model, nchunks, image_chunk_size, audio_chunk_size):
        super().__init__()
        self.perceiver_model = perceiver_model
        self.nchunks = nchunks
        self.image_chunk_size = image_chunk_size
        self.audio_chunk_size = audio_chunk_size
    
    def forward(self, inputs: torch.FloatTensor,
        neuron_decoder,
        attention_mask: Optional[torch.FloatTensor] = None,
        head_mask: Optional[torch.FloatTensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None):


        output_attentions = output_attentions if output_attentions is not None else self.perceiver_model.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.perceiver_model.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.perceiver_model.config.use_return_dict
        
        if self.perceiver_model.input_preprocessor is not None:
            inputs, modality_sizes, inputs_without_pos = self.perceiver_model.input_preprocessor(inputs)
        else:
            modality_sizes = None
            inputs_without_pos = None
            if inputs.size()[-1] != self.perceiver_model.config.d_model:
                raise ValueError(
                    f"Last dimension of the inputs: {inputs.size()[-1]} doesn't correspond to config.d_model:"
                    f" {self.perceiver_model.config.d_model}. Make sure to set config.d_model appropriately."
                )

        batch_size, seq_length, _ = inputs.size()
        device = inputs.device

        # If no attention mask is provided, make them all ones
        if attention_mask is None:
            attention_mask = torch.ones((batch_size, seq_length), device=device)
        # Make the attention mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
        extended_attention_mask = self.perceiver_model.invert_attention_mask(attention_mask)

        head_mask = self.perceiver_model.get_head_mask(head_mask, self.perceiver_model.config.num_blocks * self.perceiver_model.config.num_self_attends_per_block)
        embedding_output = self.perceiver_model.embeddings(batch_size=batch_size)

        encoder_outputs = self.perceiver_model.encoder(
            embedding_output,
            attention_mask=None,
            head_mask=head_mask,
            inputs=inputs,
            inputs_mask=extended_attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = encoder_outputs[0]

        logits = None
        reconstruction = {}
        for chunk_idx in tqdm(range(self.nchunks)):
            subsampled_output_points = {
            'image': torch.arange(
                self.image_chunk_size * chunk_idx, self.image_chunk_size * (chunk_idx + 1)).to(device),
            'audio': torch.arange(
                self.audio_chunk_size * chunk_idx, self.audio_chunk_size * (chunk_idx + 1)).to(device),
            'label': None,
            }
            
            logits = neuron_decoder(sequence_output, extended_attention_mask, 
                                             inputs, modality_sizes, inputs_without_pos, subsampled_points=subsampled_output_points)

            reconstruction['label'] = logits['label']
            if 'image' not in reconstruction:
                reconstruction['image'] = logits['image']
                reconstruction['audio'] = logits['audio']
            else:
                reconstruction['image'] = torch.cat(
                    [reconstruction['image'], logits['image']], dim=1)
                reconstruction['audio'] = torch.cat(
                    [reconstruction['audio'], logits['audio']], dim=1)
            
            del logits

        return reconstruction

def custom_model_forward(
        self,
        nchunks,
        image_chunk_size,
        audio_chunk_size,
        neuron_decoder,
        inputs: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, PerceiverClassifierOutput]:

        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        perceiver_wrapper = MultimodalPerceiverWrapper(self.perceiver, nchunks, image_chunk_size, audio_chunk_size)
        outputs = perceiver_wrapper(
            inputs,
            neuron_decoder,
            attention_mask=attention_mask,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        return outputs


def custom_decoder_query(self, inputs, modality_sizes=None, inputs_without_pos=None, subsampled_points=None):
    if self.position_encoding_type == "none":  # Queries come from elsewhere
        raise ValueError("You cannot construct decoder queries when position_encoding_type is set to none")
    if subsampled_points is not None:
        # subsampled_points are the indices if the inputs would be flattened
        # however, the inputs aren't flattened, that's why we use unravel_index
        # to get the indices for the unflattened array
        # unravel_index returns a tuple (x_idx, y_idx, ...)
        # stack to get the [n, d] tensor of coordinates

        def unravel_indices(indices, shape):
            coord = []

            for dim in reversed(shape):
                coord.append(indices % dim)
                indices = indices // dim

            coord = torch.stack(coord[::-1], dim=-1)

            return coord

        pos = unravel_indices(subsampled_points, self.output_index_dims)

        batch_size = inputs.shape[0]
        # Map these coordinates to [-1, 1]
        pos = -1 + 2 * pos / torch.tensor(self.output_index_dims)[None, :]
        pos = torch.broadcast_to(pos[None], [batch_size, pos.shape[0], pos.shape[1]])
        # Construct the position encoding.
        if self.position_encoding_type == "trainable":
            pos_emb = self.output_position_encodings(batch_size)
        elif self.position_encoding_type == "fourier":
            pos_emb = self.output_position_encodings(
                self.output_index_dims, batch_size=batch_size, device=inputs.device, dtype=inputs.dtype, pos=pos
            )

        # Optionally project them to a target dimension.
        pos_emb = self.positions_projection(pos_emb)
        pos_emb = torch.reshape(pos_emb, [pos_emb.shape[0], -1, pos_emb.shape[-1]])
    else:
        batch_size = inputs.shape[0]
        index_dims = inputs.shape[2:]

        # Construct the position encoding.
        if self.position_encoding_type == "trainable":
            pos_emb = self.output_position_encodings(batch_size)
        elif self.position_encoding_type == "fourier":
            pos_emb = self.output_position_encodings(
                index_dims, batch_size, device=inputs.device, dtype=inputs.dtype
            )

        # Optionally project them to a target dimension.
        pos_emb = self.positions_projection(pos_emb)

    if self.concat_preprocessed_input:
        if inputs_without_pos is None:
            raise ValueError("Value is required for inputs_without_pos if concat_preprocessed_input is True")
        pos_emb = torch.cat([inputs_without_pos, pos_emb], dim=-1)

    return pos_emb


# Define wrapper for tracing encoder
class EncoderWrapper(nn.Module):
    def __init__(self, encoder):
        super().__init__()
        self.encoder = encoder
    
    def forward(self, embedding_output, inputs, extended_attention_mask):
        output = self.encoder(embedding_output, inputs=inputs, inputs_mask=extended_attention_mask)
        return output

class NeuronEncoder(nn.Module):
    def __init__(self, encoder_wrapper):
       super().__init__()
       self.encoder_wrapper = encoder_wrapper
    
    def forward(self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.FloatTensor] = None,
        head_mask: Optional[torch.FloatTensor] = None,
        inputs: Optional[torch.FloatTensor] = None,
        inputs_mask: Optional[torch.FloatTensor] = None,
        output_attentions: Optional[bool] = False,
        output_hidden_states: Optional[bool] = False,
        return_dict: Optional[bool] = True):

        last_hidden_states = self.encoder_wrapper(hidden_states, inputs, inputs_mask)['last_hidden_state']
        return BaseModelOutputWithCrossAttentions(last_hidden_state=last_hidden_states)


# Define wrapper for tracing decoder
class DecoderWrapper(nn.Module):
    def __init__(self, decoder, decoder_query_audio, decoder_query_image, decoder_query_label, output_postprocessor):
        super().__init__()
        self.decoder = decoder
        self.decoder_query_audio = decoder_query_audio
        self.decoder_query_image = decoder_query_image
        self.decoder_query_label = decoder_query_label
        self.output_postprocessor = output_postprocessor
        self.num_query_channels = decoder.num_query_channels
    
    def forward(self, z, query_mask,
                audio_input, audio_input_without_pos, audio_subsampled_point, audio_padding,
                image_input, image_input_without_pos, image_subsampled_point, image_padding,
                label_input, label_input_without_pos, label_padding):
        audio_query = self.decoder_query_audio(inputs=audio_input, inputs_without_pos=audio_input_without_pos, subsampled_points=audio_subsampled_point)
        image_query = self.decoder_query_image(inputs=image_input, inputs_without_pos=image_input_without_pos, subsampled_points=image_subsampled_point)
        label_query = self.decoder_query_label(inputs=label_input, inputs_without_pos=label_input_without_pos)

        def embed(x, pos):
            x = torch.reshape(x, [x.shape[0], np.prod(x.shape[1:-1]), x.shape[-1]])
            pos = torch.broadcast_to(pos, [x.shape[0], x.shape[1], self.num_query_channels - x.shape[2]])
            return torch.cat([x, pos], dim=2)

        audio_padded = embed(audio_query, audio_padding)
        image_padded = embed(image_query, image_padding)
        label_padded = embed(label_query, label_padding)

        decoder_query = torch.cat([audio_padded, image_padded, label_padded], dim=1)
        logits = self.decoder(decoder_query, z, query_mask).logits
        
        output_modality_sizes = {"audio": audio_subsampled_point.shape[0],
                                 "image": image_subsampled_point.shape[0],
                                 "label": 1}
        logits = self.output_postprocessor(logits, modality_sizes=output_modality_sizes)
        return logits

class NeuronDecoder(nn.Module):
    def __init__(self, decoder_wrapper):
        super().__init__()
        self.decoder_wrapper = decoder_wrapper
        self.modalities = decoder_wrapper.decoder.modalities
        self.padding = decoder_wrapper.decoder.padding

    def forward(self, z, query_mask, inputs, modality_sizes, inputs_without_pos=None, subsampled_points=None, output_attentions=False):
        # Partition the flat inputs among the different modalities
        inputs = restructure(modality_sizes, inputs)

        assert(subsampled_points is not None)
        assert(inputs_without_pos is not None)

        for modality, decoder in self.modalities.items():
            if modality == "audio":
                audio_input, audio_input_without_pos, audio_subsampled_point, audio_padding = inputs[modality], inputs_without_pos[modality], subsampled_points[modality].to(torch.float32), self.padding[modality]
            elif modality == "image":
                image_input, image_input_without_pos, image_subsampled_point, image_padding = inputs[modality], inputs_without_pos[modality], subsampled_points[modality].to(torch.float32), self.padding[modality]
            else:
                # label doesn't have subsampled point
                label_input, label_input_without_pos, label_padding = inputs[modality], inputs_without_pos[modality], self.padding[modality]

        assert(audio_input_without_pos is not None)
        assert(audio_subsampled_point is not None)
        assert(image_input_without_pos is not None)
        assert(image_subsampled_point is not None)
        assert(label_input_without_pos is not None)

        output = self.decoder_wrapper(z, query_mask, 
                                        audio_input, audio_input_without_pos, audio_subsampled_point, audio_padding,
                                        image_input, image_input_without_pos, image_subsampled_point, image_padding,
                                        label_input, label_input_without_pos, label_padding)
        return output


# -- Load compiled models --
model = PerceiverForMultimodalAutoencoding.from_pretrained("deepmind/multimodal-perceiver", 
                                                                  low_cpu_mem_usage=True)

PerceiverForMultimodalAutoencoding.forward = custom_model_forward
PerceiverBasicDecoder.decoder_query = custom_decoder_query

COMPILER_WORKDIR_ROOT="perceiver_multimodal_compile_dir"
COMPILER_WORKDIR_DECODER = os.path.join(COMPILER_WORKDIR_ROOT, "decoder")
COMPILER_WORKDIR_ENCODER = os.path.join(COMPILER_WORKDIR_ROOT, "encoder")


# load saved encoder from disk
encoder_fname = os.path.join(COMPILER_WORKDIR_ENCODER, 'model.pt')
neuron_encoder = NeuronEncoder(EncoderWrapper(model.perceiver.encoder))
neuron_encoder.encoder_wrapper = torch.jit.load(encoder_fname)
model.perceiver.encoder = neuron_encoder

# load saved decoder from disk
decoder_fname = os.path.join(COMPILER_WORKDIR_DECODER, 'model.pt')
neuron_decoder = NeuronDecoder(DecoderWrapper(model.perceiver.decoder, model.perceiver.decoder.modalities['audio'].decoder_query, \
                                              model.perceiver.decoder.modalities['image'].decoder_query, model.perceiver.decoder.modalities['label'].decoder_query, \
                                              model.perceiver.output_postprocessor))
neuron_decoder.decoder_wrapper = torch.jit.load(decoder_fname)


# Inference function
def autoencode_video(images, audio, nchunks, image_chunk_size, audio_chunk_size):
    input_image = torch.from_numpy(np.moveaxis(images, -1, 2)).to(torch.float32)
    input_audio = torch.from_numpy(audio).to(torch.float32)
    input_label = torch.zeros((images.shape[0], 700))

    inputs = {'image': input_image, 'audio': input_audio, 'label':input_label}

    reconstruction = {}
    with torch.no_grad():
        reconstruction = model(nchunks, image_chunk_size, audio_chunk_size, neuron_decoder, inputs=inputs)

    # reshape image and audio modalities back to original shape
    reconstruction['image'] = torch.reshape(reconstruction['image'], images.shape)
    reconstruction['audio'] = torch.reshape(reconstruction['audio'], audio.shape)
    return reconstruction

# Generate random image for benchmarking
AUDIO_SAMPLES_PER_PATCH = 16
image = np.random.random(size=(1, 16, 224, 224, 3))
audio = np.random.random(size=(1, 30720, 1))

nchunks = 128
image_chunk_size = np.prod(image.shape[1:-1]) // nchunks
audio_chunk_size = audio.shape[1] // AUDIO_SAMPLES_PER_PATCH // nchunks

n_runs = 20
model_inputs = (image, audio, nchunks, image_chunk_size, audio_chunk_size)
benchmark(n_runs, "perceiver-multimodal", autoencode_video, model_inputs)

================================================
FILE: archive/src/benchmark/pytorch/perceiver-multimodal_compile.py
================================================
import base64
import os
import ssl
import re
from urllib import request
import time
import random
from tqdm import tqdm
import numpy as np

from typing import Optional, Tuple, Union
from transformers import PerceiverForMultimodalAutoencoding
from transformers.modeling_outputs import BaseModelOutputWithCrossAttentions
from transformers.models.perceiver.modeling_perceiver import PerceiverBasicDecoder, PerceiverClassifierOutput
from transformers.models.perceiver.modeling_perceiver import restructure
import torch
import torch.nn as nn
import torch_neuronx

class MultimodalPerceiverWrapper(nn.Module):
    def __init__(self, perceiver_model, nchunks, image_chunk_size, audio_chunk_size):
        super().__init__()
        self.perceiver_model = perceiver_model
        self.nchunks = nchunks
        self.image_chunk_size = image_chunk_size
        self.audio_chunk_size = audio_chunk_size
    
    def forward(self, inputs: torch.FloatTensor,
        neuron_decoder,
        attention_mask: Optional[torch.FloatTensor] = None,
        head_mask: Optional[torch.FloatTensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None):


        output_attentions = output_attentions if output_attentions is not None else self.perceiver_model.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.perceiver_model.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.perceiver_model.config.use_return_dict
        
        if self.perceiver_model.input_preprocessor is not None:
            inputs, modality_sizes, inputs_without_pos = self.perceiver_model.input_preprocessor(inputs)
        else:
            modality_sizes = None
            inputs_without_pos = None
            if inputs.size()[-1] != self.perceiver_model.config.d_model:
                raise ValueError(
                    f"Last dimension of the inputs: {inputs.size()[-1]} doesn't correspond to config.d_model:"
                    f" {self.perceiver_model.config.d_model}. Make sure to set config.d_model appropriately."
                )

        batch_size, seq_length, _ = inputs.size()
        device = inputs.device

        # If no attention mask is provided, make them all ones
        if attention_mask is None:
            attention_mask = torch.ones((batch_size, seq_length), device=device)
        # Make the attention mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
        extended_attention_mask = self.perceiver_model.invert_attention_mask(attention_mask)

        head_mask = self.perceiver_model.get_head_mask(head_mask, self.perceiver_model.config.num_blocks * self.perceiver_model.config.num_self_attends_per_block)
        embedding_output = self.perceiver_model.embeddings(batch_size=batch_size)

        encoder_outputs = self.perceiver_model.encoder(
            embedding_output,
            attention_mask=None,
            head_mask=head_mask,
            inputs=inputs,
            inputs_mask=extended_attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = encoder_outputs[0]

        logits = None
        reconstruction = {}
        for chunk_idx in tqdm(range(self.nchunks)):
            subsampled_output_points = {
            'image': torch.arange(
                self.image_chunk_size * chunk_idx, self.image_chunk_size * (chunk_idx + 1)).to(device),
            'audio': torch.arange(
                self.audio_chunk_size * chunk_idx, self.audio_chunk_size * (chunk_idx + 1)).to(device),
            'label': None,
            }
            
            logits = neuron_decoder(sequence_output, extended_attention_mask, 
                                             inputs, modality_sizes, inputs_without_pos, subsampled_points=subsampled_output_points)

            reconstruction['label'] = logits['label']
            if 'image' not in reconstruction:
                reconstruction['image'] = logits['image']
                reconstruction['audio'] = logits['audio']
            else:
                reconstruction['image'] = torch.cat(
                    [reconstruction['image'], logits['image']], dim=1)
                reconstruction['audio'] = torch.cat(
                    [reconstruction['audio'], logits['audio']], dim=1)
            
            del logits

        return reconstruction

def custom_model_forward(
        self,
        nchunks,
        image_chunk_size,
        audio_chunk_size,
        neuron_decoder,
        inputs: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, PerceiverClassifierOutput]:

        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        perceiver_wrapper = MultimodalPerceiverWrapper(self.perceiver, nchunks, image_chunk_size, audio_chunk_size)
        outputs = perceiver_wrapper(
            inputs,
            neuron_decoder,
            attention_mask=attention_mask,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        return outputs


def custom_decoder_query(self, inputs, modality_sizes=None, inputs_without_pos=None, subsampled_points=None):
    if self.position_encoding_type == "none":  # Queries come from elsewhere
        raise ValueError("You cannot construct decoder queries when position_encoding_type is set to none")
    if subsampled_points is not None:
        # subsampled_points are the indices if the inputs would be flattened
        # however, the inputs aren't flattened, that's why we use unravel_index
        # to get the indices for the unflattened array
        # unravel_index returns a tuple (x_idx, y_idx, ...)
        # stack to get the [n, d] tensor of coordinates

        def unravel_indices(indices, shape):
            coord = []

            for dim in reversed(shape):
                coord.append(indices % dim)
                indices = indices // dim

            coord = torch.stack(coord[::-1], dim=-1)

            return coord

        pos = unravel_indices(subsampled_points, self.output_index_dims)

        batch_size = inputs.shape[0]
        # Map these coordinates to [-1, 1]
        pos = -1 + 2 * pos / torch.tensor(self.output_index_dims)[None, :]
        pos = torch.broadcast_to(pos[None], [batch_size, pos.shape[0], pos.shape[1]])
        # Construct the position encoding.
        if self.position_encoding_type == "trainable":
            pos_emb = self.output_position_encodings(batch_size)
        elif self.position_encoding_type == "fourier":
            pos_emb = self.output_position_encodings(
                self.output_index_dims, batch_size=batch_size, device=inputs.device, dtype=inputs.dtype, pos=pos
            )

        # Optionally project them to a target dimension.
        pos_emb = self.positions_projection(pos_emb)
        pos_emb = torch.reshape(pos_emb, [pos_emb.shape[0], -1, pos_emb.shape[-1]])
    else:
        batch_size = inputs.shape[0]
        index_dims = inputs.shape[2:]

        # Construct the position encoding.
        if self.position_encoding_type == "trainable":
            pos_emb = self.output_position_encodings(batch_size)
        elif self.position_encoding_type == "fourier":
            pos_emb = self.output_position_encodings(
                index_dims, batch_size, device=inputs.device, dtype=inputs.dtype
            )

        # Optionally project them to a target dimension.
        pos_emb = self.positions_projection(pos_emb)

    if self.concat_preprocessed_input:
        if inputs_without_pos is None:
            raise ValueError("Value is required for inputs_without_pos if concat_preprocessed_input is True")
        pos_emb = torch.cat([inputs_without_pos, pos_emb], dim=-1)

    return pos_emb


# Define wrapper for tracing encoder
class EncoderWrapper(nn.Module):
    def __init__(self, encoder):
        super().__init__()
        self.encoder = encoder
    
    def forward(self, embedding_output, inputs, extended_attention_mask):
        output = self.encoder(embedding_output, inputs=inputs, inputs_mask=extended_attention_mask)
        return output

class NeuronEncoder(nn.Module):
    def __init__(self, encoder_wrapper):
       super().__init__()
       self.encoder_wrapper = encoder_wrapper
    
    def forward(self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.FloatTensor] = None,
        head_mask: Optional[torch.FloatTensor] = None,
        inputs: Optional[torch.FloatTensor] = None,
        inputs_mask: Optional[torch.FloatTensor] = None,
        output_attentions: Optional[bool] = False,
        output_hidden_states: Optional[bool] = False,
        return_dict: Optional[bool] = True):

        last_hidden_states = self.encoder_wrapper(hidden_states, inputs, inputs_mask)['last_hidden_state']
        return BaseModelOutputWithCrossAttentions(last_hidden_state=last_hidden_states)


# Define wrapper for tracing decoder
class DecoderWrapper(nn.Module):
    def __init__(self, decoder, decoder_query_audio, decoder_query_image, decoder_query_label, output_postprocessor):
        super().__init__()
        self.decoder = decoder
        self.decoder_query_audio = decoder_query_audio
        self.decoder_query_image = decoder_query_image
        self.decoder_query_label = decoder_query_label
        self.output_postprocessor = output_postprocessor
        self.num_query_channels = decoder.num_query_channels
    
    def forward(self, z, query_mask,
                audio_input, audio_input_without_pos, audio_subsampled_point, audio_padding,
                image_input, image_input_without_pos, image_subsampled_point, image_padding,
                label_input, label_input_without_pos, label_padding):
        audio_query = self.decoder_query_audio(inputs=audio_input, inputs_without_pos=audio_input_without_pos, subsampled_points=audio_subsampled_point)
        image_query = self.decoder_query_image(inputs=image_input, inputs_without_pos=image_input_without_pos, subsampled_points=image_subsampled_point)
        label_query = self.decoder_query_label(inputs=label_input, inputs_without_pos=label_input_without_pos)

        def embed(x, pos):
            x = torch.reshape(x, [x.shape[0], np.prod(x.shape[1:-1]), x.shape[-1]])
            pos = torch.broadcast_to(pos, [x.shape[0], x.shape[1], self.num_query_channels - x.shape[2]])
            return torch.cat([x, pos], dim=2)

        audio_padded = embed(audio_query, audio_padding)
        image_padded = embed(image_query, image_padding)
        label_padded = embed(label_query, label_padding)

        decoder_query = torch.cat([audio_padded, image_padded, label_padded], dim=1)
        logits = self.decoder(decoder_query, z, query_mask).logits
        
        output_modality_sizes = {"audio": audio_subsampled_point.shape[0],
                                 "image": image_subsampled_point.shape[0],
                                 "label": 1}
        logits = self.output_postprocessor(logits, modality_sizes=output_modality_sizes)
        return logits

class NeuronDecoder(nn.Module):
    def __init__(self, decoder_wrapper):
        super().__init__()
        self.decoder_wrapper = decoder_wrapper
        self.modalities = decoder_wrapper.decoder.modalities
        self.padding = decoder_wrapper.decoder.padding

    def forward(self, z, query_mask, inputs, modality_sizes, inputs_without_pos=None, subsampled_points=None, output_attentions=False):
        # Partition the flat inputs among the different modalities
        inputs = restructure(modality_sizes, inputs)

        assert(subsampled_points is not None)
        assert(inputs_without_pos is not None)

        for modality, decoder in self.modalities.items():
            if modality == "audio":
                audio_input, audio_input_without_pos, audio_subsampled_point, audio_padding = inputs[modality], inputs_without_pos[modality], subsampled_points[modality].to(torch.float32), self.padding[modality]
            elif modality == "image":
                image_input, image_input_without_pos, image_subsampled_point, image_padding = inputs[modality], inputs_without_pos[modality], subsampled_points[modality].to(torch.float32), self.padding[modality]
            else:
                # label doesn't have subsampled point
                label_input, label_input_without_pos, label_padding = inputs[modality], inputs_without_pos[modality], self.padding[modality]

        assert(audio_input_without_pos is not None)
        assert(audio_subsampled_point is not None)
        assert(image_input_without_pos is not None)
        assert(image_subsampled_point is not None)
        assert(label_input_without_pos is not None)

        output = self.decoder_wrapper(z, query_mask, 
                                        audio_input, audio_input_without_pos, audio_subsampled_point, audio_padding,
                                        image_input, image_input_without_pos, image_subsampled_point, image_padding,
                                        label_input, label_input_without_pos, label_padding)
        return output


model = PerceiverForMultimodalAutoencoding.from_pretrained("deepmind/multimodal-perceiver", 
                                                                   low_cpu_mem_usage=True)
COMPILER_WORKDIR_ROOT="perceiver_multimodal_compile_dir"

PerceiverForMultimodalAutoencoding.forward = custom_model_forward
PerceiverBasicDecoder.decoder_query = custom_decoder_query


# --- Compile Encoder ---
# Define sample inputs for tracing encoder
embedding_output = torch.randn(1, 784, 512)
sample_inputs = torch.randn(1, 52097, 704)
extended_attention_mask = torch.zeros(1, 1, 1, 52097)

# Wrap and trace the encoder, save the traced encoder
COMPILER_WORKDIR_ENCODER = os.path.join(COMPILER_WORKDIR_ROOT, "encoder")
neuron_encoder = NeuronEncoder(EncoderWrapper(model.perceiver.encoder))

# You might see a warning from trace about unused input - these are safe to ignore.
print("Compiling Encoder...")
neuron_encoder.encoder_wrapper = torch_neuronx.trace(
  neuron_encoder.encoder_wrapper,
  (embedding_output, sample_inputs, extended_attention_mask),
  compiler_workdir=COMPILER_WORKDIR_ENCODER,
  compiler_args=[f"--temp-dir={COMPILER_WORKDIR_ENCODER}", "--auto-cast=none"] # --auto-cast=none is needed to avoid numerical error.
)

# Save compiled encoder
encoder_fname = os.path.join(COMPILER_WORKDIR_ENCODER, 'model.pt')
torch.jit.save(neuron_encoder.encoder_wrapper, encoder_fname)


# --- Compile Decoder ---
# Define sample inputs for tracing decoder
z = torch.randn(1, 784, 512)
query_mask = torch.zeros(1, 1, 1, 52097)

audio_input = torch.randn(1, 1920, 704)
audio_input_without_pos = torch.randn(1, 1920, 16)
audio_subsampled_point = torch.arange(0, 15, dtype=torch.float32) # 15 = 1920/128
audio_padding = torch.randn(1, 641)

image_input = torch.randn(1, 50176, 704)
image_input_without_pos = torch.randn(1, 50176, 48)
image_subsampled_point = torch.arange(0, 6272, dtype=torch.float32) # 6272 = 224*224*16/128
image_padding = torch.randn(1, 831)

label_input = torch.randn(1, 1, 704)
label_input_without_pos = torch.randn(1, 1, 700)
label_padding = torch.randn(1, 2)

# Wrap and trace the decoder, save the traced decoder
COMPILER_WORKDIR_DECODER = os.path.join(COMPILER_WORKDIR_ROOT, "decoder")
neuron_decoder = NeuronDecoder(DecoderWrapper(model.perceiver.decoder, model.perceiver.decoder.modalities['audio'].decoder_query, \
                                              model.perceiver.decoder.modalities['image'].decoder_query, model.perceiver.decoder.modalities['label'].decoder_query, \
                                              model.perceiver.output_postprocessor))

# You might see a warning from trace about unused input - these are safe to ignore.
print("Compiling decoder...")
neuron_decoder.decoder_wrapper = torch_neuronx.trace(
   neuron_decoder.decoder_wrapper,
   (z, query_mask, audio_input, audio_input_without_pos, audio_subsampled_point, audio_padding,
        image_input, image_input_without_pos, image_subsampled_point, image_padding,
        label_input, label_input_without_pos, label_padding),
   compiler_workdir=COMPILER_WORKDIR_DECODER,
   compiler_args=[f"--temp-dir={COMPILER_WORKDIR_DECODER}", "--auto-cast=none"] # --auto-cast=none is needed to avoid numerical error.
)

# Save compiled decoder
decoder_fname = os.path.join(COMPILER_WORKDIR_DECODER, 'model.pt')
torch.jit.save(neuron_decoder.decoder_wrapper, decoder_fname)

print("Done")

================================================
FILE: archive/src/benchmark/pytorch/perceiver-vision_benchmark.py
================================================
import torch
import neuronperf as npf
import neuronperf.torch

# Add to these lists or change as needed
models_list = [
    ("PerceiverForImageClassificationLearned", "deepmind/vision-perceiver-learned"),
    ("PerceiverForImageClassificationFourier", "deepmind/vision-perceiver-fourier"),
    ("PerceiverForImageClassificationConvProcessing", "deepmind/vision-perceiver-conv"),
]
batch_sizes = [1]
n_models = [1, 2]
workers_per_model = [1, 2] # optimized for latency or throughput


def get_batch(batch_size):
    return torch.zeros([batch_size, 3, 224, 224], dtype=torch.float32)


if __name__ == "__main__":
    for class_name, pretrained_name in models_list:
        model_name = pretrained_name.split("/")[1]
        inputs = [get_batch(batch_size) for batch_size in batch_sizes]
        filename = f"{model_name}.json"

        # Benchmark
        print("Benchmarking {}".format(filename))
        reports = npf.torch.benchmark(filename, inputs, n_models=n_models, workers_per_model=workers_per_model) 

        # View and save results
        print("======== {} ========".format(filename))
        npf.print_reports(reports)
        npf.write_csv(reports)
        npf.write_json(reports)


================================================
FILE: archive/src/benchmark/pytorch/perceiver-vision_compile.py
================================================
import torch
import transformers  # ==4.32.0
import neuronperf as npf
import neuronperf.torch

# Add to these lists or change as needed
models_list = [
    ("PerceiverForImageClassificationLearned", "deepmind/vision-perceiver-learned"),
    ("PerceiverForImageClassificationFourier", "deepmind/vision-perceiver-fourier"),
    ("PerceiverForImageClassificationConvProcessing", "deepmind/vision-perceiver-conv"),
]
batch_sizes = [1]
pipeline_sizes = [1]


def get_batch(batch_size):
    return torch.zeros([batch_size, 3, 224, 224], dtype=torch.float32)


if __name__ == "__main__":
    for class_name, pretrained_name in models_list:
        model_name = pretrained_name.split("/")[1]

        model = getattr(transformers, class_name).from_pretrained(pretrained_name)
        inputs = [get_batch(batch_size) for batch_size in batch_sizes]
        filename = f"{model_name}.json"

        # Compile
        print("Compiling {}".format(filename))
        npf.torch.compile(
            model,
            inputs,
            batch_sizes=batch_sizes,
            pipeline_sizes=pipeline_sizes,
            filename=filename,
            model_name=model_name,
        )

================================================
FILE: archive/src/benchmark/pytorch/pixart_alpha_benchmark.py
================================================
import os
os.environ["NEURON_FUSE_SOFTMAX"] = "1"
os.environ["NEURON_CUSTOM_SILU"] = "1"

import copy
import diffusers
import math
import numpy as npy
import time
import torch
import torch_neuronx
import torch.nn as nn
import torch.nn.functional as F

from diffusers import PixArtAlphaPipeline
from diffusers import Transformer2DModel
from IPython.display import clear_output
from matplotlib import image as mpimg
from matplotlib import pyplot as plt
from torch import nn

import torch
from torch import nn
from transformers.models.t5.modeling_t5 import T5EncoderModel
from diffusers import Transformer2DModel

# Define datatype
DTYPE = torch.bfloat16

# Specialized benchmarking class for PixArt models.
# We cannot use any of the pre-existing benchmarking utilities to benchmark E2E PixArt performance,
# because the top-level PixArt pipeline cannot be serialized into a single Torchscript object.
# All of the pre-existing benchmarking utilities (in neuronperf or torch_neuronx) require the model to be a
# traced Torchscript.
def benchmark(n_runs, test_name, model, model_inputs):
  if not isinstance(model_inputs, tuple):
    model_inputs = (model_inputs,)
  
  warmup_run = model(*model_inputs)

  latency_collector = LatencyCollector()
  # can't use register_forward_pre_hook or register_forward_hook because PixArt pipeline is not a torch.nn.Module
  
  for _ in range(n_runs):
    latency_collector.pre_hook()
    res = model(*model_inputs)
    latency_collector.hook()
  
  p0_latency_ms = latency_collector.percentile(0) * 1000
  p50_latency_ms = latency_collector.percentile(50) * 1000
  p90_latency_ms = latency_collector.percentile(90) * 1000
  p95_latency_ms = latency_collector.percentile(95) * 1000
  p99_latency_ms = latency_collector.percentile(99) * 1000
  p100_latency_ms = latency_collector.percentile(100) * 1000

  report_dict = dict()
  report_dict["Latency P0"] = f'{p0_latency_ms:.1f}'
  report_dict["Latency P50"]=f'{p50_latency_ms:.1f}'
  report_dict["Latency P90"]=f'{p90_latency_ms:.1f}'
  report_dict["Latency P95"]=f'{p95_latency_ms:.1f}'
  report_dict["Latency P99"]=f'{p99_latency_ms:.1f}'
  report_dict["Latency P100"]=f'{p100_latency_ms:.1f}'

  report = f'RESULT FOR {test_name}:'
  for key, value in report_dict.items():
    report += f' {key}={value}'
  print(report)

class LatencyCollector:
  def __init__(self):
    self.start = None
    self.latency_list = []

  def pre_hook(self, *args):
    self.start = time.time()

  def hook(self, *args):
    self.latency_list.append(time.time() - self.start)

  def percentile(self, percent):
    latency_list = self.latency_list
    pos_float = len(latency_list) * percent / 100
    max_pos = len(latency_list) - 1
    pos_floor = min(math.floor(pos_float), max_pos)
    pos_ceil = min(math.ceil(pos_float), max_pos)
    latency_list = sorted(latency_list)
    return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor]

class InferenceTextEncoderWrapper(nn.Module):
  def __init__(self, dtype, t: T5EncoderModel, seqlen: int):
    super().__init__()
    self.dtype = dtype
    self.device = t.device
    self.t = t
  def forward(self, text_input_ids, attention_mask=None):
    return [self.t(text_input_ids, attention_mask)['last_hidden_state'].to(self.dtype)]

class InferenceTransformerWrapper(nn.Module):
  def __init__(self, transformer: Transformer2DModel):
    super().__init__()
    self.transformer = transformer
    self.config = transformer.config
    self.dtype = transformer.dtype
    self.device = transformer.device
  def forward(self, hidden_states, encoder_hidden_states=None, timestep=None, 
              encoder_attention_mask=None, added_cond_kwargs=None,
              return_dict=False):
    output = self.transformer(
      hidden_states, 
      encoder_hidden_states, 
      timestep, 
      encoder_attention_mask)
    return output

class SimpleWrapper(nn.Module):
  def __init__(self, model):
    super().__init__()
    self.model = model
  def forward(self, x):
    output = self.model(x)
    return output

# --- Load all compiled models and benchmark pipeline ---
def get_pipe(resolution, dtype):
  if resolution == 256:
    transformer: Transformer2DModel = Transformer2DModel.from_pretrained(
      "PixArt-alpha/PixArt-XL-2-256x256", 
      subfolder="transformer", 
      torch_dtype=dtype)
    return PixArtAlphaPipeline.from_pretrained(
      "PixArt-alpha/PixArt-XL-2-512x512", 
      transformer=transformer, 
      torch_dtype=dtype)
  elif resolution == 512:
    return PixArtAlphaPipeline.from_pretrained(
      "PixArt-alpha/PixArt-XL-2-512x512", 
      torch_dtype=dtype)
  else:
    raise Exception(f"Unsupport resolution {resolution} for pixart alpha")

COMPILER_WORKDIR_ROOT = 'pixart_alpha_compile_dir'
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
transformer_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'transformer/model.pt')
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')

# Select the desired resolution ()
resolution = 256
# resolution = 512

pipe = get_pipe(resolution, DTYPE)
seqlen = 120

_neuronTextEncoder = InferenceTextEncoderWrapper(DTYPE, pipe.text_encoder, seqlen)
_neuronTextEncoder.t = torch.jit.load(text_encoder_filename)
pipe.text_encoder = _neuronTextEncoder
assert pipe._execution_device is not None

device_ids = [0, 1]
_neuronTransformer = InferenceTransformerWrapper(pipe.transformer)
_neuronTransformer.transformer = torch_neuronx.DataParallel(torch.jit.load(transformer_filename), device_ids, set_dynamic_batching=False)
pipe.transformer = _neuronTransformer
pipe.vae.decoder = SimpleWrapper(torch.jit.load(decoder_filename))
pipe.vae.post_quant_conv = SimpleWrapper(torch.jit.load(post_quant_conv_filename))

prompt = "a photo of an astronaut riding a horse on mars"
n_runs = 20
benchmark(n_runs, "pixart_alpha", pipe, prompt)


================================================
FILE: archive/src/benchmark/pytorch/pixart_sigma_benchmark.py
================================================
import os
os.environ["NEURON_FUSE_SOFTMAX"] = "1"
os.environ["NEURON_CUSTOM_SILU"] = "1"

import copy
import diffusers
import math
import numpy as npy
import time
import torch
import torch_neuronx
import torch.nn as nn
import torch.nn.functional as F

from diffusers import PixArtSigmaPipeline
from IPython.display import clear_output
from matplotlib import image as mpimg
from matplotlib import pyplot as plt
from torch import nn

import torch
from torch import nn
from transformers.models.t5.modeling_t5 import T5EncoderModel
from diffusers import Transformer2DModel

# Define datatype
DTYPE = torch.bfloat16

# Specialized benchmarking class for PixArt models.
# We cannot use any of the pre-existing benchmarking utilities to benchmark E2E PixArt performance,
# because the top-level PixArt pipeline cannot be serialized into a single Torchscript object.
# All of the pre-existing benchmarking utilities (in neuronperf or torch_neuronx) require the model to be a
# traced Torchscript.
def benchmark(n_runs, test_name, model, model_inputs):
  if not isinstance(model_inputs, tuple):
    model_inputs = (model_inputs,)
  
  warmup_run = model(*model_inputs)

  latency_collector = LatencyCollector()
  # can't use register_forward_pre_hook or register_forward_hook because PixArt pipeline is not a torch.nn.Module
  
  for _ in range(n_runs):
    latency_collector.pre_hook()
    res = model(*model_inputs)
    latency_collector.hook()
  
  p0_latency_ms = latency_collector.percentile(0) * 1000
  p50_latency_ms = latency_collector.percentile(50) * 1000
  p90_latency_ms = latency_collector.percentile(90) * 1000
  p95_latency_ms = latency_collector.percentile(95) * 1000
  p99_latency_ms = latency_collector.percentile(99) * 1000
  p100_latency_ms = latency_collector.percentile(100) * 1000

  report_dict = dict()
  report_dict["Latency P0"] = f'{p0_latency_ms:.1f}'
  report_dict["Latency P50"]=f'{p50_latency_ms:.1f}'
  report_dict["Latency P90"]=f'{p90_latency_ms:.1f}'
  report_dict["Latency P95"]=f'{p95_latency_ms:.1f}'
  report_dict["Latency P99"]=f'{p99_latency_ms:.1f}'
  report_dict["Latency P100"]=f'{p100_latency_ms:.1f}'

  report = f'RESULT FOR {test_name}:'
  for key, value in report_dict.items():
    report += f' {key}={value}'
  print(report)

class LatencyCollector:
  def __init__(self):
    self.start = None
    self.latency_list = []

  def pre_hook(self, *args):
    self.start = time.time()

  def hook(self, *args):
    self.latency_list.append(time.time() - self.start)

  def percentile(self, percent):
    latency_list = self.latency_list
    pos_float = len(latency_list) * percent / 100
    max_pos = len(latency_list) - 1
    pos_floor = min(math.floor(pos_float), max_pos)
    pos_ceil = min(math.ceil(pos_float), max_pos)
    latency_list = sorted(latency_list)
    return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor]

class InferenceTextEncoderWrapper(nn.Module):
  def __init__(self, dtype, t: T5EncoderModel, seqlen: int):
    super().__init__()
    self.dtype = dtype
    self.device = t.device
    self.t = t
  def forward(self, text_input_ids, attention_mask=None):
    return [self.t(text_input_ids, attention_mask)['last_hidden_state'].to(self.dtype)]

class InferenceTransformerWrapper(nn.Module):
  def __init__(self, transformer: Transformer2DModel):
    super().__init__()
    self.transformer = transformer
    self.config = transformer.config
    self.dtype = transformer.dtype
    self.device = transformer.device
  def forward(self, hidden_states, encoder_hidden_states=None, timestep=None, 
              encoder_attention_mask=None, added_cond_kwargs=None,
              return_dict=False):
    output = self.transformer(
      hidden_states, 
      encoder_hidden_states, 
      timestep, 
      encoder_attention_mask)
    return output

class SimpleWrapper(nn.Module):
  def __init__(self, model):
    super().__init__()
    self.model = model
  def forward(self, x):
    output = self.model(x)
    return output

# --- Load all compiled models and benchmark pipeline ---
def get_pipe(resolution, dtype):
  if resolution == 256:
    transformer = Transformer2DModel.from_pretrained(
      "PixArt-alpha/PixArt-Sigma-XL-2-256x256", 
      subfolder='transformer', 
      torch_dtype=dtype,
    )
    return PixArtSigmaPipeline.from_pretrained(
      "PixArt-alpha/pixart_sigma_sdxlvae_T5_diffusers",
      transformer=transformer,
      torch_dtype=dtype,
    )
  elif resolution == 512:
    transformer = Transformer2DModel.from_pretrained(
      "PixArt-alpha/PixArt-Sigma-XL-2-512-MS",
      subfolder='transformer', 
      torch_dtype=dtype,
    )
    return PixArtSigmaPipeline.from_pretrained(
      "PixArt-alpha/pixart_sigma_sdxlvae_T5_diffusers",
      transformer=transformer,
      torch_dtype=dtype,
    )
  else:
    raise Exception(f"Unsupport resolution {resolution} for PixArt Sigma")

COMPILER_WORKDIR_ROOT = 'pixart_sigma_compile_dir'
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
transformer_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'transformer/model.pt')
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')

# Select the desired resolution ()
resolution = 256
# resolution = 512

pipe = get_pipe(resolution, DTYPE)
seqlen = 300

_neuronTextEncoder = InferenceTextEncoderWrapper(DTYPE, pipe.text_encoder, seqlen)
_neuronTextEncoder.t = torch.jit.load(text_encoder_filename)
pipe.text_encoder = _neuronTextEncoder
assert pipe._execution_device is not None

device_ids = [0, 1]
_neuronTransformer = InferenceTransformerWrapper(pipe.transformer)
_neuronTransformer.transformer = torch_neuronx.DataParallel(torch.jit.load(transformer_filename), device_ids, set_dynamic_batching=False)
pipe.transformer = _neuronTransformer
pipe.vae.decoder = SimpleWrapper(torch.jit.load(decoder_filename))
pipe.vae.post_quant_conv = SimpleWrapper(torch.jit.load(post_quant_conv_filename))

prompt = "a photo of an astronaut riding a horse on mars"
n_runs = 20
benchmark(n_runs, "pixart_alpha", pipe, prompt)


================================================
FILE: archive/src/benchmark/pytorch/resnet50_benchmark.py
================================================
import torch
import torch.neuron

import neuronperf as npf
import neuronperf.torch

# Add to these lists or change as needed
model_name = "resnet50"
batch_sizes = [1, 6]


def get_batch(batch_size):
    return torch.zeros([batch_size, 3, 224, 224], dtype=torch.float32)


if __name__ == "__main__":
    inputs = [get_batch(batch_size) for batch_size in batch_sizes]
    filename = f"{model_name}.json"

    # Benchmark
    print("Benchmarking {}".format(filename))
    reports = npf.torch.benchmark(filename, inputs)

    # View and save results
    print("======== {} ========".format(filename))
    npf.print_reports(reports)
    npf.write_csv(reports)
    npf.write_json(reports)


================================================
FILE: archive/src/benchmark/pytorch/resnet50_compile.py
================================================
import torch
import torch.neuron
import torchvision

import neuronperf as npf
import neuronperf.torch

# Add to these lists or change as needed
model_name = "resnet50"
batch_sizes = [1, 6]
pipeline_sizes = [1]


def get_batch(batch_size):
    return torch.zeros([batch_size, 3, 224, 224], dtype=torch.float32)


if __name__ == "__main__":
    model = torchvision.models.resnet50(pretrained=True)
    inputs = [get_batch(batch_size) for batch_size in batch_sizes]
    filename = f"{model_name}.json"

    # Compile
    print("Compiling {}".format(filename))
    npf.torch.compile(
        model,
        inputs,
        batch_sizes=batch_sizes,
        pipeline_sizes=pipeline_sizes,
        filename=filename,
        model_name=model_name,
    )


================================================
FILE: archive/src/benchmark/pytorch/resnet_benchmark.py
================================================
import torch
import neuronperf as npf
import neuronperf.torch

# Add to these lists or change as needed
model_names = ["resnet18", "resnet34", "resnet50", "resnet101", "resnet152"]
batch_sizes = [1, 8, 64]
n_models = [1, 2]
workers_per_model = [1, 2] # optimized for latency or throughput


def get_batch(batch_size):
    return torch.zeros([batch_size, 3, 224, 224], dtype=torch.float32)


if __name__ == "__main__":
    for model_name in model_names:
        inputs = [get_batch(batch_size) for batch_size in batch_sizes]
        filename = f"{model_name}.json"

        # Benchmark
        print("Benchmarking {}".format(filename))
        reports = npf.torch.benchmark(filename, inputs, n_models=n_models, workers_per_model=workers_per_model) 

        # View and save results
        print("======== {} ========".format(filename))
        npf.print_reports(reports)
        npf.write_csv(reports)
        npf.write_json(reports)


================================================
FILE: archive/src/benchmark/pytorch/resnet_compile.py
================================================
import torch
import torchvision
import neuronperf as npf
import neuronperf.torch

# Add to these lists or change as needed
model_names = ["resnet18", "resnet34", "resnet50", "resnet101", "resnet152"]
batch_sizes = [1, 8, 64]
pipeline_sizes = [1]


def get_batch(batch_size):
    return torch.zeros([batch_size, 3, 224, 224], dtype=torch.float32)


if __name__ == "__main__":
    for model_name in model_names:
        model = getattr(torchvision.models, model_name)(pretrained=True)
        inputs = [get_batch(batch_size) for batch_size in batch_sizes]
        filename = f"{model_name}.json"

        # Compile
        print("Compiling {}".format(filename))
        npf.torch.compile(
            model,
            inputs,
            batch_sizes=batch_sizes,
            pipeline_sizes=pipeline_sizes,
            filename=filename,
            model_name=model_name,
        )

================================================
FILE: archive/src/benchmark/pytorch/sd2_512_benchmark.py
================================================
import os
os.environ["NEURON_FUSE_SOFTMAX"] = "1"

import torch
import torch.nn as nn
import torch_neuronx

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.models.unet_2d_condition import UNet2DConditionOutput

import time
import math

# Define datatype
DTYPE = torch.bfloat16

# Specialized benchmarking class for stable diffusion.
# We cannot use any of the pre-existing benchmarking utilities to benchmark E2E stable diffusion performance,
# because the top-level StableDiffusionPipeline cannot be serialized into a single Torchscript object.
# All of the pre-existing benchmarking utilities (in neuronperf or torch_neuronx) require the model to be a
# traced Torchscript.
def benchmark(n_runs, test_name, model, model_inputs):
    if not isinstance(model_inputs, tuple):
        model_inputs = (model_inputs,)
    
    warmup_run = model(*model_inputs)

    latency_collector = LatencyCollector()
    # can't use register_forward_pre_hook or register_forward_hook because StableDiffusionPipeline is not a torch.nn.Module
    
    for _ in range(n_runs):
        latency_collector.pre_hook()
        res = model(*model_inputs)
        latency_collector.hook()
    
    p0_latency_ms = latency_collector.percentile(0) * 1000
    p50_latency_ms = latency_collector.percentile(50) * 1000
    p90_latency_ms = latency_collector.percentile(90) * 1000
    p95_latency_ms = latency_collector.percentile(95) * 1000
    p99_latency_ms = latency_collector.percentile(99) * 1000
    p100_latency_ms = latency_collector.percentile(100) * 1000

    report_dict = dict()
    report_dict["Latency P0"] = f'{p0_latency_ms:.1f}'
    report_dict["Latency P50"]=f'{p50_latency_ms:.1f}'
    report_dict["Latency P90"]=f'{p90_latency_ms:.1f}'
    report_dict["Latency P95"]=f'{p95_latency_ms:.1f}'
    report_dict["Latency P99"]=f'{p99_latency_ms:.1f}'
    report_dict["Latency P100"]=f'{p100_latency_ms:.1f}'

    report = f'RESULT FOR {test_name}:'
    for key, value in report_dict.items():
        report += f' {key}={value}'
    print(report)

class LatencyCollector:
    def __init__(self):
        self.start = None
        self.latency_list = []

    def pre_hook(self, *args):
        self.start = time.time()

    def hook(self, *args):
        self.latency_list.append(time.time() - self.start)

    def percentile(self, percent):
        latency_list = self.latency_list
        pos_float = len(latency_list) * percent / 100
        max_pos = len(latency_list) - 1
        pos_floor = min(math.floor(pos_float), max_pos)
        pos_ceil = min(math.ceil(pos_float), max_pos)
        latency_list = sorted(latency_list)
        return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor]


class UNetWrap(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None):
        out_tuple = self.unet(sample, timestep, encoder_hidden_states, return_dict=False)
        return out_tuple

class NeuronUNet(nn.Module):
    def __init__(self, unetwrap):
        super().__init__()
        self.unetwrap = unetwrap
        self.config = unetwrap.unet.config
        self.in_channels = unetwrap.unet.in_channels
        self.device = unetwrap.unet.device

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None, return_dict=False):
        sample = self.unetwrap(sample, timestep.to(dtype=DTYPE).expand((sample.shape[0],)), encoder_hidden_states)[0]
        return UNet2DConditionOutput(sample=sample)

class NeuronTextEncoder(nn.Module):
    def __init__(self, text_encoder):
        super().__init__()
        self.neuron_text_encoder = text_encoder
        self.config = text_encoder.config
        self.dtype = text_encoder.dtype
        self.device = text_encoder.device

    def forward(self, emb, attention_mask = None):
        return [self.neuron_text_encoder(emb)['last_hidden_state']]
    
def decode_latents(self, latents):
    latents = latents.to(torch.float)
    latents = 1 / self.vae.config.scaling_factor * latents
    image = self.vae.decode(latents).sample
    image = (image / 2 + 0.5).clamp(0, 1)
    image = image.cpu().permute(0, 2, 3, 1).float().numpy()
    return image

StableDiffusionPipeline.decode_latents = decode_latents

# --- Load all compiled models and benchmark pipeline ---
COMPILER_WORKDIR_ROOT = 'sd2_compile_dir_512'
model_id = "stabilityai/stable-diffusion-2-1-base"
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# Load the compiled UNet onto two neuron cores.
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))
device_ids = [0,1]
pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)

class NeuronTypeConversionWrapper(nn.Module):
    def __init__(self, network):
        super().__init__()
        self.network = network

    def forward(self, x):
        return self.network(x.float())

# Load other compiled models onto a single neuron core.
pipe.text_encoder = NeuronTextEncoder(pipe.text_encoder)
pipe.text_encoder.neuron_text_encoder = torch.jit.load(text_encoder_filename)
pipe.vae.decoder = NeuronTypeConversionWrapper(torch.jit.load(decoder_filename))
pipe.vae.post_quant_conv = NeuronTypeConversionWrapper(torch.jit.load(post_quant_conv_filename))

prompt = "a photo of an astronaut riding a horse on mars"
n_runs = 20
benchmark(n_runs, "stable_diffusion_512", pipe, prompt)


================================================
FILE: archive/src/benchmark/pytorch/sd2_512_compile.py
================================================
import os
os.environ["NEURON_FUSE_SOFTMAX"] = "1"

import torch
import torch.nn as nn
import torch_neuronx

import copy
from diffusers import StableDiffusionPipeline
from diffusers.models.unet_2d_condition import UNet2DConditionOutput
# Compatibility for diffusers<0.18.0
from packaging import version
import diffusers
diffusers_version = version.parse(diffusers.__version__)
use_new_diffusers = diffusers_version >= version.parse('0.18.0')
if use_new_diffusers:
    from diffusers.models.attention_processor import Attention
else:
    from diffusers.models.cross_attention import CrossAttention

# Define datatype
DTYPE = torch.bfloat16

# Have to do this double wrapper trick to compile the unet, because
# of the special UNet2DConditionOutput output type.
class UNetWrap(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None):
        out_tuple = self.unet(sample, timestep, encoder_hidden_states, return_dict=False)
        return out_tuple

class NeuronUNet(nn.Module):
    def __init__(self, unetwrap):
        super().__init__()
        self.unetwrap = unetwrap
        self.config = unetwrap.unet.config
        self.in_channels = unetwrap.unet.in_channels
        self.device = unetwrap.unet.device

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None):
        sample = self.unetwrap(sample, timestep.to(dtype=DTYPE).expand((sample.shape[0],)), encoder_hidden_states)[0]
        return UNet2DConditionOutput(sample=sample)
    
class NeuronTextEncoder(nn.Module):
    def __init__(self, text_encoder):
        super().__init__()
        self.neuron_text_encoder = text_encoder
        self.config = text_encoder.config
        self.dtype = text_encoder.dtype
        self.device = text_encoder.device

    def forward(self, emb, attention_mask = None):
        return [self.neuron_text_encoder(emb)['last_hidden_state']]


# Optimized attention
def get_attention_scores(self, query, key, attn_mask):       
    dtype = query.dtype

    if self.upcast_attention:
        query = query.float()
        key = key.float()

    # Check for square matmuls
    if(query.size() == key.size()):
        attention_scores = custom_badbmm(
            key,
            query.transpose(-1, -2)
        )

        if self.upcast_softmax:
            attention_scores = attention_scores.float()

        attention_probs = attention_scores.softmax(dim=1).permute(0,2,1)
        attention_probs = attention_probs.to(dtype)

    else:
        attention_scores = custom_badbmm(
            query,
            key.transpose(-1, -2)
        )

        if self.upcast_softmax:
            attention_scores = attention_scores.float()

        attention_probs = attention_scores.softmax(dim=-1)
        attention_probs = attention_probs.to(dtype)
        
    return attention_probs

# In the original badbmm the bias is all zeros, so only apply scale
def custom_badbmm(a, b):
    bmm = torch.bmm(a, b)
    scaled = bmm * 0.125
    return scaled


# For saving compiler artifacts
COMPILER_WORKDIR_ROOT = 'sd2_compile_dir_512'

# Model ID for SD version pipeline
model_id = "stabilityai/stable-diffusion-2-1-base"

# --- Compile UNet and save ---
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)

# Replace original cross-attention module with custom cross-attention module for better performance
if use_new_diffusers:
    Attention.get_attention_scores = get_attention_scores
else:
    CrossAttention.get_attention_scores = get_attention_scores

# Apply double wrapper to deal with custom return type
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))

# Only keep the model being compiled in RAM to minimze memory pressure
unet = copy.deepcopy(pipe.unet.unetwrap)
del pipe

# Compile unet - FP32
sample_1b = torch.randn([1, 4, 64, 64], dtype=DTYPE)
timestep_1b = torch.tensor(999, dtype=DTYPE).expand((1,))
encoder_hidden_states_1b = torch.randn([1, 77, 1024], dtype=DTYPE)
example_inputs = sample_1b, timestep_1b, encoder_hidden_states_1b

unet_neuron = torch_neuronx.trace(
    unet,
    example_inputs,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'unet'),
    compiler_args=["--model-type=unet-inference", "--enable-fast-loading-neuron-binaries"]
)

# Enable asynchronous and lazy loading to speed up model load
torch_neuronx.async_load(unet_neuron)
torch_neuronx.lazy_load(unet_neuron)

# save compiled unet
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
torch.jit.save(unet_neuron, unet_filename)

# delete unused objects
del unet
del unet_neuron


# --- Compile CLIP text encoder and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)
text_encoder = copy.deepcopy(pipe.text_encoder)
del pipe

# Apply the wrapper to deal with custom return type
text_encoder = NeuronTextEncoder(text_encoder)

# Compile text encoder
# This is used for indexing a lookup table in torch.nn.Embedding,
# so using random numbers may give errors (out of range).
emb = torch.tensor([[49406, 18376,   525,  7496, 49407,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0]])
text_encoder_neuron = torch_neuronx.trace(
        text_encoder.neuron_text_encoder, 
        emb, 
        compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder'),
        compiler_args=["--enable-fast-loading-neuron-binaries"]
        )

# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(text_encoder_neuron)

# Save the compiled text encoder
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
torch.jit.save(text_encoder_neuron, text_encoder_filename)

# delete unused objects
del text_encoder
del text_encoder_neuron


# --- Compile VAE decoder and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
decoder = copy.deepcopy(pipe.vae.decoder)
del pipe

# Compile vae decoder
decoder_in = torch.randn([1, 4, 64, 64], dtype=torch.float32)
decoder_neuron = torch_neuronx.trace(
    decoder, 
    decoder_in, 
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder'),
    compiler_args=["--enable-fast-loading-neuron-binaries"]
)

# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(decoder_neuron)

# Save the compiled vae decoder
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
torch.jit.save(decoder_neuron, decoder_filename)

# delete unused objects
del decoder
del decoder_neuron


# --- Compile VAE post_quant_conv and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
post_quant_conv = copy.deepcopy(pipe.vae.post_quant_conv)
del pipe

# # Compile vae post_quant_conv
post_quant_conv_in = torch.randn([1, 4, 64, 64], dtype=torch.float32)
post_quant_conv_neuron = torch_neuronx.trace(
    post_quant_conv, 
    post_quant_conv_in,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv'),
)

# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(post_quant_conv_neuron)

# # Save the compiled vae post_quant_conv
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')
torch.jit.save(post_quant_conv_neuron, post_quant_conv_filename)

# delete unused objects
del post_quant_conv
del post_quant_conv_neuron


================================================
FILE: archive/src/benchmark/pytorch/sd2_768_benchmark.py
================================================
import os
os.environ["NEURON_FUSE_SOFTMAX"] = "1"

import torch
import torch.nn as nn
import torch_neuronx

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.models.unet_2d_condition import UNet2DConditionOutput

import time
import math

# Define datatype
DTYPE = torch.float32

# Specialized benchmarking class for stable diffusion.
# We cannot use any of the pre-existing benchmarking utilities to benchmark E2E stable diffusion performance,
# because the top-level StableDiffusionPipeline cannot be serialized into a single Torchscript object.
# All of the pre-existing benchmarking utilities (in neuronperf or torch_neuronx) require the model to be a
# traced Torchscript.
def benchmark(n_runs, test_name, model, model_inputs):
    if not isinstance(model_inputs, tuple):
        model_inputs = (model_inputs,)
    
    warmup_run = model(*model_inputs)

    latency_collector = LatencyCollector()
    # can't use register_forward_pre_hook or register_forward_hook because StableDiffusionPipeline is not a torch.nn.Module
    
    for _ in range(n_runs):
        latency_collector.pre_hook()
        res = model(*model_inputs)
        latency_collector.hook()
    
    p0_latency_ms = latency_collector.percentile(0) * 1000
    p50_latency_ms = latency_collector.percentile(50) * 1000
    p90_latency_ms = latency_collector.percentile(90) * 1000
    p95_latency_ms = latency_collector.percentile(95) * 1000
    p99_latency_ms = latency_collector.percentile(99) * 1000
    p100_latency_ms = latency_collector.percentile(100) * 1000

    report_dict = dict()
    report_dict["Latency P0"] = f'{p0_latency_ms:.1f}'
    report_dict["Latency P50"]=f'{p50_latency_ms:.1f}'
    report_dict["Latency P90"]=f'{p90_latency_ms:.1f}'
    report_dict["Latency P95"]=f'{p95_latency_ms:.1f}'
    report_dict["Latency P99"]=f'{p99_latency_ms:.1f}'
    report_dict["Latency P100"]=f'{p100_latency_ms:.1f}'

    report = f'RESULT FOR {test_name}:'
    for key, value in report_dict.items():
        report += f' {key}={value}'
    print(report)

class LatencyCollector:
    def __init__(self):
        self.start = None
        self.latency_list = []

    def pre_hook(self, *args):
        self.start = time.time()

    def hook(self, *args):
        self.latency_list.append(time.time() - self.start)

    def percentile(self, percent):
        latency_list = self.latency_list
        pos_float = len(latency_list) * percent / 100
        max_pos = len(latency_list) - 1
        pos_floor = min(math.floor(pos_float), max_pos)
        pos_ceil = min(math.ceil(pos_float), max_pos)
        latency_list = sorted(latency_list)
        return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor]


class UNetWrap(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None):
        out_tuple = self.unet(sample, timestep, encoder_hidden_states, return_dict=False)
        return out_tuple

class NeuronUNet(nn.Module):
    def __init__(self, unetwrap):
        super().__init__()
        self.unetwrap = unetwrap
        self.config = unetwrap.unet.config
        self.in_channels = unetwrap.unet.in_channels
        self.device = unetwrap.unet.device

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None, return_dict=False):
        sample = self.unetwrap(sample, timestep.to(dtype=DTYPE).expand((sample.shape[0],)), encoder_hidden_states)[0]
        return UNet2DConditionOutput(sample=sample)

class NeuronTextEncoder(nn.Module):
    def __init__(self, text_encoder):
        super().__init__()
        self.neuron_text_encoder = text_encoder
        self.config = text_encoder.config
        self.dtype = text_encoder.dtype
        self.device = text_encoder.device

    def forward(self, emb, attention_mask = None):
        return [self.neuron_text_encoder(emb)['last_hidden_state']]
    
    
# --- Load all compiled models and run pipeline ---
COMPILER_WORKDIR_ROOT = 'sd2_compile_dir_768'
model_id = "stabilityai/stable-diffusion-2-1"
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# Load the compiled UNet onto two neuron cores.
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))
device_ids = [0,1]
pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)

# Load other compiled models onto a single neuron core.
pipe.text_encoder = NeuronTextEncoder(pipe.text_encoder)
pipe.text_encoder.neuron_text_encoder = torch.jit.load(text_encoder_filename)
pipe.vae.decoder = torch.jit.load(decoder_filename)
pipe.vae.post_quant_conv = torch.jit.load(post_quant_conv_filename)

prompt = "a photo of an astronaut riding a horse on mars"
n_runs = 20
benchmark(n_runs, "stable_diffusion_768", pipe, prompt)


================================================
FILE: archive/src/benchmark/pytorch/sd2_768_compile.py
================================================
import os
os.environ["NEURON_FUSE_SOFTMAX"] = "1"

import torch
import torch.nn as nn
import torch_neuronx

import copy
from diffusers import StableDiffusionPipeline
from diffusers.models.unet_2d_condition import UNet2DConditionOutput
# Compatibility for diffusers<0.18.0
from packaging import version
import diffusers
diffusers_version = version.parse(diffusers.__version__)
use_new_diffusers = diffusers_version >= version.parse('0.18.0')
if use_new_diffusers:
    from diffusers.models.attention_processor import Attention
else:
    from diffusers.models.cross_attention import CrossAttention

# Define datatype
DTYPE = torch.float32

class UNetWrap(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None):
        out_tuple = self.unet(sample, timestep, encoder_hidden_states, return_dict=False)
        return out_tuple

class NeuronUNet(nn.Module):
    def __init__(self, unetwrap):
        super().__init__()
        self.unetwrap = unetwrap
        self.config = unetwrap.unet.config
        self.in_channels = unetwrap.unet.in_channels
        self.device = unetwrap.unet.device

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None):
        sample = self.unetwrap(sample, timestep.to(dtype=DTYPE).expand((sample.shape[0],)), encoder_hidden_states)[0]
        return UNet2DConditionOutput(sample=sample)
    
class NeuronTextEncoder(nn.Module):
    def __init__(self, text_encoder):
        super().__init__()
        self.neuron_text_encoder = text_encoder
        self.config = text_encoder.config
        self.dtype = text_encoder.dtype
        self.device = text_encoder.device

    def forward(self, emb, attention_mask = None):
        return [self.neuron_text_encoder(emb)['last_hidden_state']]


# Optimized attention
def get_attention_scores(self, query, key, attn_mask):       
    dtype = query.dtype

    if self.upcast_attention:
        query = query.float()
        key = key.float()

    # Check for square matmuls
    if(query.size() == key.size()):
        attention_scores = custom_badbmm(
            key,
            query.transpose(-1, -2)
        )

        if self.upcast_softmax:
            attention_scores = attention_scores.float()

        attention_probs = attention_scores.softmax(dim=1).permute(0,2,1)
        attention_probs = attention_probs.to(dtype)

    else:
        attention_scores = custom_badbmm(
            query,
            key.transpose(-1, -2)
        )

        if self.upcast_softmax:
            attention_scores = attention_scores.float()

        attention_probs = attention_scores.softmax(dim=-1)
        attention_probs = attention_probs.to(dtype)
        
    return attention_probs

# In the original badbmm the bias is all zeros, so only apply scale
def custom_badbmm(a, b):
    bmm = torch.bmm(a, b)
    scaled = bmm * 0.125
    return scaled


# For saving compiler artifacts
COMPILER_WORKDIR_ROOT = 'sd2_compile_dir_768'

# Model ID for SD version pipeline
model_id = "stabilityai/stable-diffusion-2-1"

# --- Compile UNet and save ---
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)

# Replace original cross-attention module with custom cross-attention module for better performance
if use_new_diffusers:
    Attention.get_attention_scores = get_attention_scores
else:
    CrossAttention.get_attention_scores = get_attention_scores

# Apply double wrapper to deal with custom return type
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))

# Only keep the model being compiled in RAM to minimze memory pressure
unet = copy.deepcopy(pipe.unet.unetwrap)
del pipe

# Compile unet
sample_1b = torch.randn([1, 4, 96, 96], dtype=DTYPE)
timestep_1b = torch.tensor(999, dtype=DTYPE).expand((1,))
encoder_hidden_states_1b = torch.randn([1, 77, 1024], dtype=DTYPE)
example_inputs = sample_1b, timestep_1b, encoder_hidden_states_1b

unet_neuron = torch_neuronx.trace(
    unet,
    example_inputs,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'unet'),
    compiler_args=["--model-type=unet-inference", "--enable-fast-loading-neuron-binaries"]
)

# Enable asynchronous and lazy loading to speed up model load
torch_neuronx.async_load(unet_neuron)
torch_neuronx.lazy_load(unet_neuron)

# save compiled unet
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
torch.jit.save(unet_neuron, unet_filename)

# delete unused objects
del unet
del unet_neuron


# --- Compile CLIP text encoder and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)
text_encoder = copy.deepcopy(pipe.text_encoder)
del pipe

# Apply the wrapper to deal with custom return type
text_encoder = NeuronTextEncoder(text_encoder)

# Compile text encoder
# This is used for indexing a lookup table in torch.nn.Embedding,
# so using random numbers may give errors (out of range).
emb = torch.tensor([[49406, 18376,   525,  7496, 49407,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0]])
text_encoder_neuron = torch_neuronx.trace(
        text_encoder.neuron_text_encoder, 
        emb, 
        compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder'),
        compiler_args=["--enable-fast-loading-neuron-binaries"]
        )

# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(text_encoder_neuron)

# Save the compiled text encoder
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
torch.jit.save(text_encoder_neuron, text_encoder_filename)

# delete unused objects
del text_encoder
del text_encoder_neuron


# --- Compile VAE decoder and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)
decoder = copy.deepcopy(pipe.vae.decoder)
del pipe

# Compile vae decoder
decoder_in = torch.randn([1, 4, 96, 96], dtype=DTYPE)
decoder_neuron = torch_neuronx.trace(
    decoder, 
    decoder_in, 
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder'),
    compiler_args=["--enable-fast-loading-neuron-binaries"]
)

# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(decoder_neuron)

# Save the compiled vae decoder
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
torch.jit.save(decoder_neuron, decoder_filename)

# delete unused objects
del decoder
del decoder_neuron


# --- Compile VAE post_quant_conv and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)
post_quant_conv = copy.deepcopy(pipe.vae.post_quant_conv)
del pipe

# # Compile vae post_quant_conv
post_quant_conv_in = torch.randn([1, 4, 96, 96], dtype=DTYPE)
post_quant_conv_neuron = torch_neuronx.trace(
    post_quant_conv, 
    post_quant_conv_in,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv'),
)

# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(post_quant_conv_neuron)

# # Save the compiled vae post_quant_conv
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')
torch.jit.save(post_quant_conv_neuron, post_quant_conv_filename)

# delete unused objects
del post_quant_conv
del post_quant_conv_neuron


================================================
FILE: archive/src/benchmark/pytorch/sd2_inpainting_benchmark.py
================================================
import torch
import torch.nn as nn
import torch_neuronx
import os
from diffusers import StableDiffusionInpaintPipeline
from diffusers.models.unet_2d_condition import UNet2DConditionOutput

from diffusers.models.attention_processor import Attention

import argparse
import copy

torch.manual_seed(0)

def parse_argsuments():
    parser = argparse.ArgumentParser()
    parser.add_argument('--prompt', type=str, default='Face of a yellow cat, high resolution, sitting on a park bench', help="user input for text to image use case")
    parser.add_argument('--target_dir', type=str, default='./sd21_inpainting_512_neuron', help="directory to save neuron compield model")
    args=parser.parse_args()
    return args

# Have to do this double wrapper trick to compile the unet, because
# of the special UNet2DConditionOutput output type.
class UNetWrap(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None):
        out_tuple = self.unet(sample, timestep, encoder_hidden_states, return_dict=False)
        return out_tuple

class NeuronUNet(nn.Module):
    def __init__(self, unetwrap):
        super().__init__()
        self.unetwrap = unetwrap
        self.config = unetwrap.unet.config
        self.in_channels = unetwrap.unet.in_channels
        self.device = unetwrap.unet.device

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None):
        sample = self.unetwrap(sample, timestep.bfloat16().expand((sample.shape[0],)), encoder_hidden_states)[0]
        return UNet2DConditionOutput(sample=sample)

class NeuronTextEncoder(nn.Module):
    def __init__(self, text_encoder):
        super().__init__()
        self.neuron_text_encoder = text_encoder
        self.config = text_encoder.config
        self.dtype = text_encoder.dtype
        self.device = text_encoder.device

    def forward(self, emb, attention_mask = None):
        return [self.neuron_text_encoder(emb)['last_hidden_state']]

# Optimized attention
def get_attention_scores(self, query, key, attn_mask):       
    dtype = query.dtype

    if self.upcast_attention:
        query = query.float()
        key = key.float()

    # Check for square matmuls
    if(query.size() == key.size()):
        attention_scores = custom_badbmm(
            key,
            query.transpose(-1, -2)
        )

        if self.upcast_softmax:
            attention_scores = attention_scores.float()

        attention_probs = torch.nn.functional.softmax(attention_scores, dim=1).permute(0,2,1)
        attention_probs = attention_probs.to(dtype)

    else:
        attention_scores = custom_badbmm(
            query,
            key.transpose(-1, -2)
        )

        if self.upcast_softmax:
            attention_scores = attention_scores.float()

        attention_probs = torch.nn.functional.softmax(attention_scores, dim=-1)
        attention_probs = attention_probs.to(dtype)
        
    return attention_probs

def custom_badbmm(a, b):
    bmm = torch.bmm(a, b)
    scaled = bmm * 0.125
    return scaled

inputs=parse_argsuments()
print(inputs.target_dir)
# For saving compiler artifacts
COMPILER_WORKDIR_ROOT = inputs.target_dir

def trace_vae_encoder(model_id, height, width):
    # Only keep the model being compiled in RAM to minimze memory pressure
    pipe = StableDiffusionInpaintPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
    vae_encoder = copy.deepcopy(pipe.vae.encoder)
    del pipe

    sample_input = torch.randn([1, 3, height, width])
    vae_encoder_neuron = torch_neuronx.trace(
            vae_encoder, 
            sample_input, 
            compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_encoder'),
            )

    # Save the compiled text encoder
    vae_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_encoder/model.pt')
    torch.jit.save(vae_encoder_neuron, vae_encoder_filename)

    # delete unused objects
    del vae_encoder
    del vae_encoder_neuron

def trace_unet(model_id, height, width):
    # --- Compile UNet and save ---
    DTYPE = torch.bfloat16
    pipe = StableDiffusionInpaintPipeline.from_pretrained(model_id, torch_dtype=DTYPE)

    # Replace original cross-attention module with custom cross-attention module for better performance
    Attention.get_attention_scores = get_attention_scores

    # Apply double wrapper to deal with custom return type
    pipe.unet = NeuronUNet(UNetWrap(pipe.unet))

    # Only keep the model being compiled in RAM to minimze memory pressure
    unet = copy.deepcopy(pipe.unet.unetwrap)
    del pipe

    sample_1b = torch.randn([1, 9, height, width], dtype=DTYPE)
    timestep_1b = torch.tensor(999, dtype=DTYPE).expand((1,))
    encoder_hidden_states_1b = torch.randn([1, 77, 1024], dtype=DTYPE)
    example_inputs = sample_1b, timestep_1b, encoder_hidden_states_1b

    unet_neuron = torch_neuronx.trace(
        unet,
        example_inputs,
        compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'unet'),
        compiler_args=["--model-type=unet-inference", "--verbose=info"],
    )

    # save compiled unet
    unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
    torch.jit.save(unet_neuron, unet_filename)

    # delete unused objects
    del unet
    del unet_neuron
    

def main():
    
    model_id = "stabilityai/stable-diffusion-2-inpainting"
    height = 624
    width = 936

    trace_unet(model_id, height // 8, width // 8)
    trace_vae_encoder(model_id, height, width)

    # Only keep the model being compiled in RAM to minimze memory pressure
    pipe = StableDiffusionInpaintPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
    text_encoder = copy.deepcopy(pipe.text_encoder)
    del pipe
    # Apply the wrapper to deal with custom return type
    text_encoder = NeuronTextEncoder(text_encoder)

    # Compile text encoder
    # This is used for indexing a lookup table in torch.nn.Embedding,
    # so using random numbers may give errors (out of range).
    emb = torch.tensor([[49406, 18376,   525,  7496, 49407,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0]])
    text_encoder_neuron = torch_neuronx.trace(
            text_encoder.neuron_text_encoder, 
            emb, 
            compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder'),
            )

    # Save the compiled text encoder
    text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
    torch.jit.save(text_encoder_neuron, text_encoder_filename)

    # delete unused objects
    del text_encoder
    del text_encoder_neuron

    # --- Compile VAE decoder and save ---

    # Only keep the model being compiled in RAM to minimze memory pressure
    pipe = StableDiffusionInpaintPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
    decoder = copy.deepcopy(pipe.vae.decoder)
    del pipe

    # Compile vae decoder
    decoder_in = torch.randn([1, 4, height // 8, width // 8])
    decoder_neuron = torch_neuronx.trace(
        decoder, 
        decoder_in, 
        compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder'),
        compiler_args=["--verbose", "info"]
    )

    # Save the compiled vae decoder
    decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
    torch.jit.save(decoder_neuron, decoder_filename)

    # delete unused objects
    del decoder
    del decoder_neuron
    
    # --- Compile VAE post_quant_conv and save ---

    # Only keep the model being compiled in RAM to minimze memory pressure
    pipe = StableDiffusionInpaintPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
    post_quant_conv = copy.deepcopy(pipe.vae.post_quant_conv)
    del pipe

    # Compile vae post_quant_conv
    post_quant_conv_in = torch.randn([1, 4, height // 8 , width // 8])
    post_quant_conv_neuron = torch_neuronx.trace(
        post_quant_conv, 
        post_quant_conv_in,
        compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv'),
        compiler_args=["--verbose", "info"]
    )

    # Save the compiled vae post_quant_conv
    post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')
    torch.jit.save(post_quant_conv_neuron, post_quant_conv_filename)

    # delete unused objects
    del post_quant_conv
    del post_quant_conv_neuron
    

if __name__ == "__main__":
    main()

================================================
FILE: archive/src/benchmark/pytorch/sd2_inpainting_inference.py
================================================
import torch
import torch.nn as nn
import torch_neuronx
import os
import time
from diffusers import StableDiffusionInpaintPipeline, DPMSolverMultistepScheduler
from diffusers.models.unet_2d_condition import UNet2DConditionOutput

from diffusers.models.attention_processor import Attention

import threading
import argparse
import sys
import copy
import PIL
import math

torch.manual_seed(0)

def parse_argsuments():
    parser = argparse.ArgumentParser()
    parser.add_argument('--prompt', type=str, default='Face of a yellow cat, high resolution, sitting on a park bench', help="user input for text to image use case")
    parser.add_argument('--target_dir', type=str, default='./sd21_inpainting_512_neuron', help="directory to save neuron compield model")
    args=parser.parse_args()
    return args

# Specialized benchmarking class for stable diffusion.
# We cannot use any of the pre-existing benchmarking utilities to benchmark E2E stable diffusion performance,
# because the top-level StableDiffusionPipeline cannot be serialized into a single Torchscript object.
# All of the pre-existing benchmarking utilities (in neuronperf or torch_neuronx) require the model to be a
# traced Torchscript.
def benchmark(n_runs, test_name, model, model_inputs):
    if not isinstance(model_inputs, tuple):
        model_inputs = (model_inputs,)
    
    warmup_run = model(*model_inputs)

    latency_collector = LatencyCollector()
    # can't use register_forward_pre_hook or register_forward_hook because StableDiffusionPipeline is not a torch.nn.Module
    
    for _ in range(n_runs):
        latency_collector.pre_hook()
        res = model(*model_inputs)
        latency_collector.hook()
    
    p0_latency_ms = latency_collector.percentile(0) * 1000
    p50_latency_ms = latency_collector.percentile(50) * 1000
    p90_latency_ms = latency_collector.percentile(90) * 1000
    p95_latency_ms = latency_collector.percentile(95) * 1000
    p99_latency_ms = latency_collector.percentile(99) * 1000
    p100_latency_ms = latency_collector.percentile(100) * 1000

    report_dict = dict()
    report_dict["Latency P0"] = f'{p0_latency_ms:.1f}'
    report_dict["Latency P50"]=f'{p50_latency_ms:.1f}'
    report_dict["Latency P90"]=f'{p90_latency_ms:.1f}'
    report_dict["Latency P95"]=f'{p95_latency_ms:.1f}'
    report_dict["Latency P99"]=f'{p99_latency_ms:.1f}'
    report_dict["Latency P100"]=f'{p100_latency_ms:.1f}'

    report = f'RESULT FOR {test_name}:'
    for key, value in report_dict.items():
        report += f' {key}={value}'
    print(report)

class LatencyCollector:
    def __init__(self):
        self.start = None
        self.latency_list = []

    def pre_hook(self, *args):
        self.start = time.time()

    def hook(self, *args):
        self.latency_list.append(time.time() - self.start)

    def percentile(self, percent):
        latency_list = self.latency_list
        pos_float = len(latency_list) * percent / 100
        max_pos = len(latency_list) - 1
        pos_floor = min(math.floor(pos_float), max_pos)
        pos_ceil = min(math.ceil(pos_float), max_pos)
        latency_list = sorted(latency_list)
        return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor]

DTYPE = torch.bfloat16

# Have to do this double wrapper trick to compile the unet, because
# of the special UNet2DConditionOutput output type.
class UNetWrap(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None):
        out_tuple = self.unet(sample, timestep, encoder_hidden_states, return_dict=False)
        return out_tuple

class NeuronUNet(nn.Module):
    def __init__(self, unetwrap):
        super().__init__()
        self.unetwrap = unetwrap
        self.config = unetwrap.unet.config
        self.in_channels = unetwrap.unet.in_channels
        self.device = unetwrap.unet.device

    def forward(self, sample, timestep, encoder_hidden_states, timestep_cond=None, added_cond_kwargs=None, cross_attention_kwargs=None, return_dict=False):
        sample = self.unetwrap(sample.to(dtype=DTYPE), timestep.to(dtype=DTYPE).expand((sample.shape[0],)), encoder_hidden_states.to(dtype=DTYPE))[0]
        return UNet2DConditionOutput(sample=sample)

class NeuronTextEncoder(nn.Module):
    def __init__(self, text_encoder):
        super().__init__()
        self.neuron_text_encoder = text_encoder
        self.config = text_encoder.config
        self.dtype = text_encoder.dtype
        self.device = text_encoder.device

    def forward(self, emb, attention_mask = None):
        return [self.neuron_text_encoder(emb)['last_hidden_state']]

# Optimized attention
def get_attention_scores(self, query, key, attn_mask):       
    dtype = query.dtype

    if self.upcast_attention:
        query = query.float()
        key = key.float()

    # Check for square matmuls
    if(query.size() == key.size()):
        attention_scores = custom_badbmm(
            key,
            query.transpose(-1, -2)
        )

        if self.upcast_softmax:
            attention_scores = attention_scores.float()

        attention_probs = torch.nn.functional.softmax(attention_scores, dim=1).permute(0,2,1)
        attention_probs = attention_probs.to(dtype)

    else:
        attention_scores = custom_badbmm(
            query,
            key.transpose(-1, -2)
        )

        if self.upcast_softmax:
            attention_scores = attention_scores.float()

        attention_probs = torch.nn.functional.softmax(attention_scores, dim=-1)
        attention_probs = attention_probs.to(dtype)
        
    return attention_probs

def custom_badbmm(a, b):
    bmm = torch.bmm(a, b)
    scaled = bmm * 0.125
    return scaled


def main():
    
    inputs=parse_argsuments()
    print(inputs.target_dir)

    # For saving compiler artifacts
    COMPILER_WORKDIR_ROOT = inputs.target_dir

    model_id = "stabilityai/stable-diffusion-2-inpainting"
    
    pipe = StableDiffusionInpaintPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
    
    text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
    unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
    vae_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_encoder/model.pt')
    decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
    post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')

    # Load the compiled UNet onto two neuron cores.
    pipe.unet = NeuronUNet(UNetWrap(pipe.unet))
    device_ids = [0,1]
    pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)

    # Load other compiled models onto a single neuron core.
    pipe.text_encoder = NeuronTextEncoder(pipe.text_encoder)
    pipe.text_encoder.neuron_text_encoder = torch.jit.load(text_encoder_filename)
    pipe.vae.encoder = torch.jit.load(vae_encoder_filename)
    pipe.vae.decoder = torch.jit.load(decoder_filename)
    pipe.vae.post_quant_conv = torch.jit.load(post_quant_conv_filename)
    
    height = 624
    width = 936
    base_image = PIL.Image.open('sd2_inpainting_photo.png')
    mask = PIL.Image.open('sd2_inpainting_mask.png')
    image = pipe(prompt=inputs.prompt, image=base_image, mask_image=mask, height=height, width=width).images[0]
    image.save("sd2_inpainting_output.png")
    
    n_runs = 10
    benchmark(n_runs, "stable_diffusion_inpainting", pipe, (inputs.prompt, base_image, mask, None, height, width))

if __name__ == "__main__":
    main()

================================================
FILE: archive/src/benchmark/pytorch/sd_15_512_benchmark.py
================================================
import os
os.environ["NEURON_FUSE_SOFTMAX"] = "1"

import copy
import time
import torch
import torch.nn as nn
import torch_neuronx

from diffusers import StableDiffusionPipeline
from diffusers.models.unet_2d_condition import UNet2DConditionOutput

import time
import math

# Specialized benchmarking class for stable diffusion.
# We cannot use any of the pre-existing benchmarking utilities to benchmark E2E stable diffusion performance,
# because the top-level StableDiffusionPipeline cannot be serialized into a single Torchscript object.
# All of the pre-existing benchmarking utilities (in neuronperf or torch_neuronx) require the model to be a
# traced Torchscript.
def benchmark(n_runs, test_name, model, model_inputs):
    if not isinstance(model_inputs, tuple):
        model_inputs = (model_inputs,)
    
    warmup_run = model(*model_inputs)

    latency_collector = LatencyCollector()
    # can't use register_forward_pre_hook or register_forward_hook because StableDiffusionPipeline is not a torch.nn.Module
    
    for _ in range(n_runs):
        latency_collector.pre_hook()
        res = model(*model_inputs)
        latency_collector.hook()
    
    p0_latency_ms = latency_collector.percentile(0) * 1000
    p50_latency_ms = latency_collector.percentile(50) * 1000
    p90_latency_ms = latency_collector.percentile(90) * 1000
    p95_latency_ms = latency_collector.percentile(95) * 1000
    p99_latency_ms = latency_collector.percentile(99) * 1000
    p100_latency_ms = latency_collector.percentile(100) * 1000

    report_dict = dict()
    report_dict["Latency P0"] = f'{p0_latency_ms:.1f}'
    report_dict["Latency P50"]=f'{p50_latency_ms:.1f}'
    report_dict["Latency P90"]=f'{p90_latency_ms:.1f}'
    report_dict["Latency P95"]=f'{p95_latency_ms:.1f}'
    report_dict["Latency P99"]=f'{p99_latency_ms:.1f}'
    report_dict["Latency P100"]=f'{p100_latency_ms:.1f}'

    report = f'RESULT FOR {test_name}:'
    for key, value in report_dict.items():
        report += f' {key}={value}'
    print(report)

class LatencyCollector:
    def __init__(self):
        self.start = None
        self.latency_list = []

    def pre_hook(self, *args):
        self.start = time.time()

    def hook(self, *args):
        self.latency_list.append(time.time() - self.start)

    def percentile(self, percent):
        latency_list = self.latency_list
        pos_float = len(latency_list) * percent / 100
        max_pos = len(latency_list) - 1
        pos_floor = min(math.floor(pos_float), max_pos)
        pos_ceil = min(math.ceil(pos_float), max_pos)
        latency_list = sorted(latency_list)
        return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor]


class UNetWrap(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None):
        out_tuple = self.unet(sample, timestep, encoder_hidden_states, return_dict=False)
        return out_tuple
    
class NeuronUNet(nn.Module):
    def __init__(self, unetwrap):
        super().__init__()
        self.unetwrap = unetwrap
        self.config = unetwrap.unet.config
        self.in_channels = unetwrap.unet.in_channels
        self.device = unetwrap.unet.device

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None, return_dict=False):
        sample = self.unetwrap(sample, timestep.float().expand((sample.shape[0],)), encoder_hidden_states)[0]
        return UNet2DConditionOutput(sample=sample)

class NeuronTextEncoder(nn.Module):
    def __init__(self, text_encoder):
        super().__init__()
        self.neuron_text_encoder = text_encoder
        self.config = text_encoder.config
        self.dtype = torch.float32
        self.device = text_encoder.device

    def forward(self, emb, attention_mask = None):
        return [self.neuron_text_encoder(emb)['last_hidden_state']]


class NeuronSafetyModelWrap(nn.Module):
    def __init__(self, safety_model):
        super().__init__()
        self.safety_model = safety_model

    def forward(self, clip_inputs):
        return list(self.safety_model(clip_inputs).values())


# # For saving compiler artifacts
COMPILER_WORKDIR_ROOT = 'sd_1_5_fp32_512_compile_workdir'

# Model ID for SD version pipeline
model_id = "runwayml/stable-diffusion-v1-5"

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)

text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')
safety_model_neuron_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'safety_model/model.pt')


# Load the compiled UNet onto two neuron cores.
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))
device_ids = [0,1]
pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)

# Load other compiled models onto a single neuron core.
pipe.text_encoder = NeuronTextEncoder(pipe.text_encoder)
pipe.text_encoder.neuron_text_encoder = torch.jit.load(text_encoder_filename)
pipe.vae.decoder = torch.jit.load(decoder_filename)
pipe.vae.post_quant_conv = torch.jit.load(post_quant_conv_filename)
pipe.safety_checker.vision_model = NeuronSafetyModelWrap(torch.jit.load(safety_model_neuron_filename))

prompt = "a photo of an astronaut riding a horse on mars"
n_runs = 20
benchmark(n_runs, "stable_diffusion_15_512", pipe, prompt)

================================================
FILE: archive/src/benchmark/pytorch/sd_15_512_compile.py
================================================
import os
os.environ["NEURON_FUSE_SOFTMAX"] = "1"

import copy
import time
import torch
import torch.nn as nn
import torch_neuronx

from diffusers import StableDiffusionPipeline
from diffusers.models.unet_2d_condition import UNet2DConditionOutput

# Compatibility for diffusers<0.18.0
from packaging import version
import diffusers
diffusers_version = version.parse(diffusers.__version__)
use_new_diffusers = diffusers_version >= version.parse('0.18.0')
if use_new_diffusers:
    from diffusers.models.attention_processor import Attention
else:
    from diffusers.models.cross_attention import CrossAttention


def get_attention_scores(self, query, key, attn_mask):    
    dtype = query.dtype

    if self.upcast_attention:
        query = query.float()
        key = key.float()

    if(query.size() == key.size()):
        attention_scores = cust_badbmm(
            key,
            query.transpose(-1, -2),
            self.scale
        )

        if self.upcast_softmax:
            attention_scores = attention_scores.float()

        attention_probs = torch.nn.functional.softmax(attention_scores, dim=1).permute(0,2,1)
        attention_probs = attention_probs.to(dtype)

    else:
        attention_scores = cust_badbmm(
            query,
            key.transpose(-1, -2),
            self.scale
        )

        if self.upcast_softmax:
            attention_scores = attention_scores.float()

        attention_probs = torch.nn.functional.softmax(attention_scores, dim=-1)
        attention_probs = attention_probs.to(dtype)
        
    return attention_probs

def cust_badbmm(a, b, scale):
    bmm = torch.bmm(a, b)
    scaled = bmm * scale
    return scaled


class UNetWrap(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None):
        out_tuple = self.unet(sample, timestep, encoder_hidden_states, return_dict=False)
        return out_tuple
    
class NeuronUNet(nn.Module):
    def __init__(self, unetwrap):
        super().__init__()
        self.unetwrap = unetwrap
        self.config = unetwrap.unet.config
        self.in_channels = unetwrap.unet.in_channels
        self.device = unetwrap.unet.device

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None, return_dict=False):
        sample = self.unetwrap(sample, timestep.float().expand((sample.shape[0],)), encoder_hidden_states)[0]
        return UNet2DConditionOutput(sample=sample)

class NeuronTextEncoder(nn.Module):
    def __init__(self, text_encoder):
        super().__init__()
        self.neuron_text_encoder = text_encoder
        self.config = text_encoder.config
        self.dtype = torch.float32
        self.device = text_encoder.device

    def forward(self, emb, attention_mask = None):
        return [self.neuron_text_encoder(emb)['last_hidden_state']]


class NeuronSafetyModelWrap(nn.Module):
    def __init__(self, safety_model):
        super().__init__()
        self.safety_model = safety_model

    def forward(self, clip_inputs):
        return list(self.safety_model(clip_inputs).values())


# For saving compiler artifacts
COMPILER_WORKDIR_ROOT = 'sd_1_5_fp32_512_compile_workdir'

# Model ID for SD version pipeline
model_id = "runwayml/stable-diffusion-v1-5"


# --- Compile CLIP text encoder and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
text_encoder = copy.deepcopy(pipe.text_encoder)
del pipe

# Apply the wrapper to deal with custom return type
text_encoder = NeuronTextEncoder(text_encoder)

# Compile text encoder
# This is used for indexing a lookup table in torch.nn.Embedding,
# so using random numbers may give errors (out of range).
emb = torch.tensor([[49406, 18376,   525,  7496, 49407,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0]])

with torch.no_grad():
    start_time = time.time()
    text_encoder_neuron = torch_neuronx.trace(
            text_encoder.neuron_text_encoder, 
            emb, 
            compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder'),
            compiler_args=["--enable-fast-loading-neuron-binaries"]
            )
    text_encoder_neuron_compile_time = time.time() - start_time
    print('text_encoder_neuron_compile_time:', text_encoder_neuron_compile_time)

# Save the compiled text encoder
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
torch_neuronx.async_load(text_encoder_neuron)
torch.jit.save(text_encoder_neuron, text_encoder_filename)

# delete unused objects
del text_encoder
del text_encoder_neuron
del emb

# --- Compile VAE decoder and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
decoder = copy.deepcopy(pipe.vae.decoder)
del pipe

# Compile vae decoder
decoder_in = torch.randn([1, 4, 64, 64])
with torch.no_grad():
    start_time = time.time()
    decoder_neuron = torch_neuronx.trace(
        decoder, 
        decoder_in, 
        compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder'),
        compiler_args=["--enable-fast-loading-neuron-binaries"]
    )
    vae_decoder_compile_time = time.time() - start_time
    print('vae_decoder_compile_time:', vae_decoder_compile_time)

# Save the compiled vae decoder
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
torch_neuronx.async_load(decoder_neuron)
torch.jit.save(decoder_neuron, decoder_filename)

# delete unused objects
del decoder
del decoder_in
del decoder_neuron

# --- Compile UNet and save ---

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)

# Replace original cross-attention module with custom cross-attention module for better performance
if use_new_diffusers:
    Attention.get_attention_scores = get_attention_scores
else:
    CrossAttention.get_attention_scores = get_attention_scores

# Apply double wrapper to deal with custom return type
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))

# Only keep the model being compiled in RAM to minimze memory pressure
unet = copy.deepcopy(pipe.unet.unetwrap)
del pipe

# Compile unet - FP32
sample_1b = torch.randn([1, 4, 64, 64])
timestep_1b = torch.tensor(999).float().expand((1,))
encoder_hidden_states_1b = torch.randn([1, 77, 768])
example_inputs = sample_1b, timestep_1b, encoder_hidden_states_1b

with torch.no_grad():
    start_time = time.time()
    unet_neuron = torch_neuronx.trace(
        unet,
        example_inputs,
        compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'unet'),
        compiler_args=["--model-type=unet-inference", "--enable-fast-loading-neuron-binaries"]
    )
    unet_compile_time = time.time() - start_time
    print('unet_compile_time:', unet_compile_time)

# Enable asynchronous and lazy loading to speed up model load
torch_neuronx.async_load(unet_neuron)
torch_neuronx.lazy_load(unet_neuron)

# save compiled unet
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
torch.jit.save(unet_neuron, unet_filename)

# delete unused objects
del unet
del unet_neuron
del sample_1b
del timestep_1b
del encoder_hidden_states_1b


# --- Compile VAE post_quant_conv and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
post_quant_conv = copy.deepcopy(pipe.vae.post_quant_conv)
del pipe

# Compile vae post_quant_conv
post_quant_conv_in = torch.randn([1, 4, 64, 64])
with torch.no_grad():
    start_time = time.time()
    post_quant_conv_neuron = torch_neuronx.trace(
        post_quant_conv, 
        post_quant_conv_in,
        compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv'),
        compiler_args=["--enable-fast-loading-neuron-binaries"]
    )
    vae_post_quant_conv_compile_time = time.time() - start_time
    print('vae_post_quant_conv_compile_time:', vae_post_quant_conv_compile_time)

# Save the compiled vae post_quant_conv
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')
torch_neuronx.async_load(post_quant_conv_neuron)
torch.jit.save(post_quant_conv_neuron, post_quant_conv_filename)

# delete unused objects
del post_quant_conv


# --- Compile safety checker and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
safety_model = copy.deepcopy(pipe.safety_checker.vision_model)
del pipe

clip_input = torch.randn([1, 3, 224, 224])
with torch.no_grad():
    start_time = time.time()
    safety_model = torch_neuronx.trace(
        safety_model, 
        clip_input,
        compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'safety_model'),
        compiler_args=["--enable-fast-loading-neuron-binaries"]
    )
    safety_model_compile_time = time.time() - start_time
    print('safety_model_compile_time:', safety_model_compile_time)

# Save the compiled safety checker
safety_model_neuron_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'safety_model/model.pt')
torch_neuronx.async_load(safety_model)
torch.jit.save(safety_model, safety_model_neuron_filename)

# delete unused objects
del safety_model

print('Total compile time:', text_encoder_neuron_compile_time + vae_decoder_compile_time + unet_compile_time + vae_post_quant_conv_compile_time + safety_model_compile_time)


================================================
FILE: archive/src/benchmark/pytorch/sd_4x_upscaler_benchmark.py
================================================
import os

import time
import requests
import copy
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch_neuronx
import numpy as np

from PIL import Image
from io import BytesIO

import diffusers
from diffusers import StableDiffusionUpscalePipeline
from diffusers.models.unet_2d_condition import UNet2DConditionOutput


class UNetWrap(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(
        self,
        sample,
        timestep,
        encoder_hidden_states,
        class_labels,
        cross_attention_kwargs=None,
    ):
        out_tuple = self.unet(
            sample, timestep, encoder_hidden_states, class_labels, return_dict=False
        )
        return out_tuple


class NeuronUNet(nn.Module):
    def __init__(self, unetwrap):
        super().__init__()
        self.unetwrap = unetwrap
        self.config = unetwrap.unet.config
        self.in_channels = unetwrap.unet.in_channels
        self.device = unetwrap.unet.device

    def forward(
        self,
        sample,
        timestep,
        encoder_hidden_states,
        class_labels,
        cross_attention_kwargs=None,
        return_dict=False,
    ):
        sample = self.unetwrap(
            sample,
            timestep.float().expand((sample.shape[0],)),
            encoder_hidden_states,
            class_labels,
        )[0]
        return UNet2DConditionOutput(sample=sample)


class NeuronTextEncoder(nn.Module):
    def __init__(self, text_encoder):
        super().__init__()
        self.neuron_text_encoder = text_encoder
        self.config = text_encoder.config
        self.dtype = text_encoder.dtype
        self.device = text_encoder.device

    def forward(self, emb, attention_mask=None):
        return [self.neuron_text_encoder(emb)["last_hidden_state"]]

# Specialized benchmarking class for stable diffusion.
# We cannot use any of the pre-existing benchmarking utilities to benchmark E2E stable diffusion performance,
# because the top-level StableDiffusionPipeline cannot be serialized into a single Torchscript object.
# All of the pre-existing benchmarking utilities (in neuronperf or torch_neuronx) require the model to be a
# traced Torchscript.
def benchmark(n_runs, test_name, model, model_inputs):
    if not isinstance(model_inputs, tuple):
        model_inputs = (model_inputs,)
    
    warmup_run = model(*model_inputs)

    latency_collector = LatencyCollector()
    # can't use register_forward_pre_hook or register_forward_hook because StableDiffusionPipeline is not a torch.nn.Module
    
    for _ in range(n_runs):
        latency_collector.pre_hook()
        res = model(*model_inputs)
        latency_collector.hook()
    
    p0_latency_ms = latency_collector.percentile(0) * 1000
    p50_latency_ms = latency_collector.percentile(50) * 1000
    p90_latency_ms = latency_collector.percentile(90) * 1000
    p95_latency_ms = latency_collector.percentile(95) * 1000
    p99_latency_ms = latency_collector.percentile(99) * 1000
    p100_latency_ms = latency_collector.percentile(100) * 1000

    report_dict = dict()
    report_dict["Latency P0"] = f'{p0_latency_ms:.1f}'
    report_dict["Latency P50"]=f'{p50_latency_ms:.1f}'
    report_dict["Latency P90"]=f'{p90_latency_ms:.1f}'
    report_dict["Latency P95"]=f'{p95_latency_ms:.1f}'
    report_dict["Latency P99"]=f'{p99_latency_ms:.1f}'
    report_dict["Latency P100"]=f'{p100_latency_ms:.1f}'

    report = f'RESULT FOR {test_name}:'
    for key, value in report_dict.items():
        report += f' {key}={value}'
    print(report)

class LatencyCollector:
    def __init__(self):
        self.start = None
        self.latency_list = []

    def pre_hook(self, *args):
        self.start = time.time()

    def hook(self, *args):
        self.latency_list.append(time.time() - self.start)

    def percentile(self, percent):
        latency_list = self.latency_list
        pos_float = len(latency_list) * percent / 100
        max_pos = len(latency_list) - 1
        pos_floor = min(math.floor(pos_float), max_pos)
        pos_ceil = min(math.ceil(pos_float), max_pos)
        latency_list = sorted(latency_list)
        return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor]


# --- Load all compiled models ---
COMPILER_WORKDIR_ROOT = 'stable_diffusion_upscaler_fp32'
model_id = "stabilityai/stable-diffusion-x4-upscaler"
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')

pipe = StableDiffusionUpscalePipeline.from_pretrained(model_id, torch_dtype=torch.float32)

# Load the compiled UNet onto two neuron cores.
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))
device_ids = [0,1]
pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)

# Load other compiled models onto a single neuron core.
pipe.text_encoder = NeuronTextEncoder(pipe.text_encoder)
pipe.text_encoder.neuron_text_encoder = torch.jit.load(text_encoder_filename)
pipe.vae.decoder = torch.jit.load(decoder_filename)
pipe.vae.post_quant_conv = torch.jit.load(post_quant_conv_filename)

# Run pipeline
prompt = ["a white cat"]
url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png"
response = requests.get(url)
low_res_img = Image.open(BytesIO(response.content)).convert("RGB")
low_res_img = low_res_img.resize((128, 128))
upscaled_image = pipe(prompt=prompt, image=low_res_img).images[0]
os.makedirs("misc", exist_ok=True)
upscaled_image.save("upsampled_cat.png")

# Benchmark
n_runs = 20
benchmark(n_runs, "stable_diffusion_512", pipe, (prompt, low_res_img))


================================================
FILE: archive/src/benchmark/pytorch/sd_4x_upscaler_compile.py
================================================
import os

import requests
import copy
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch_neuronx

from PIL import Image
from io import BytesIO

import diffusers
from diffusers import StableDiffusionUpscalePipeline
from diffusers.models.unet_2d_condition import UNet2DConditionOutput

from packaging import version

def apply_neuron_attn_override(
    diffusers_pkg, get_attn_scores_func, neuron_scaled_dot_product_attention
):
    diffusers_version = version.parse(diffusers_pkg.__version__)
    use_new_diffusers = diffusers_version >= version.parse("0.18.0")
    if use_new_diffusers:
        diffusers_pkg.models.attention_processor.Attention.get_attention_scores = (
            get_attn_scores_func
        )
    else:
        diffusers_pkg.models.cross_attention.CrossAttention.get_attention_scores = (
            get_attn_scores_func
        )

    # If Pytorch 2 is available, a F.scaled_dot_product_attention will be used, so we need to
    # monkey patch that too to be Neuron optimized attention
    if hasattr(F, "scaled_dot_product_attention"):
        F.scaled_dot_product_attention = neuron_scaled_dot_product_attention


def get_attention_scores_neuron(self, query, key, attn_mask):
    if query.size() == key.size():
        attention_scores = cust_badbmm(key, query.transpose(-1, -2), self.scale)
        attention_probs = attention_scores.softmax(dim=1).permute(0, 2, 1)

    else:
        attention_scores = cust_badbmm(query, key.transpose(-1, -2), self.scale)
        attention_probs = attention_scores.softmax(dim=-1)

    return attention_probs


def cust_badbmm(a, b, scale):
    bmm = torch.bmm(a, b)
    scaled = bmm * scale
    return scaled


def neuron_scaled_dot_product_attention(
    query, key, value, attn_mask=None, dropout_p=None, is_causal=None
):
    orig_shape = None
    if len(query.shape) == 4:
        orig_shape = query.shape

        def to3d(x):
            return x.reshape(-1, x.shape[2], x.shape[3])

        query, key, value = map(to3d, [query, key, value])

    if query.size() == key.size():
        attention_scores = torch.bmm(key, query.transpose(-1, -2)) * (
            1 / math.sqrt(query.size(-1))
        )
        attention_probs = attention_scores.softmax(dim=1).permute(0, 2, 1)

    else:
        attention_scores = torch.bmm(query, key.transpose(-1, -2)) * (
            1 / math.sqrt(query.size(-1))
        )
        attention_probs = attention_scores.softmax(dim=-1)

    attn_out = torch.bmm(attention_probs, value)

    if orig_shape:
        attn_out = attn_out.reshape(
            orig_shape[0], orig_shape[1], attn_out.shape[1], attn_out.shape[2]
        )

    return attn_out


class UNetWrap(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(
        self,
        sample,
        timestep,
        encoder_hidden_states,
        class_labels,
        cross_attention_kwargs=None,
    ):
        out_tuple = self.unet(
            sample, timestep, encoder_hidden_states, class_labels, return_dict=False
        )
        return out_tuple


class NeuronUNet(nn.Module):
    def __init__(self, unetwrap):
        super().__init__()
        self.unetwrap = unetwrap
        self.config = unetwrap.unet.config
        self.in_channels = unetwrap.unet.in_channels
        self.device = unetwrap.unet.device

    def forward(
        self,
        sample,
        timestep,
        encoder_hidden_states,
        class_labels,
        cross_attention_kwargs=None,
        return_dict=False,
    ):
        sample = self.unetwrap(
            sample,
            timestep.float().expand((sample.shape[0],)),
            encoder_hidden_states,
            class_labels,
        )[0]
        return UNet2DConditionOutput(sample=sample)


class NeuronTextEncoder(nn.Module):
    def __init__(self, text_encoder):
        super().__init__()
        self.neuron_text_encoder = text_encoder
        self.config = text_encoder.config
        self.dtype = text_encoder.dtype
        self.device = text_encoder.device

    def forward(self, emb, attention_mask=None):
        return [self.neuron_text_encoder(emb)["last_hidden_state"]]

# For saving compiler artifacts
COMPILER_WORKDIR_ROOT = 'stable_diffusion_upscaler_fp32'

# Model ID for SD version pipeline
model_id = "stabilityai/stable-diffusion-x4-upscaler"

# --- Compile CLIP text encoder and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionUpscalePipeline.from_pretrained(
model_id, torch_dtype=torch.float32
)
text_encoder = copy.deepcopy(pipe.text_encoder)
del pipe

# Apply the wrapper to deal with custom return type
text_encoder = NeuronTextEncoder(text_encoder)

# Compile text encoder
# This is used for indexing a lookup table in torch.nn.Embedding,
# so using random numbers may give errors (out of range).
emb = torch.tensor([[49406, 18376,   525,  7496, 49407,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0]])

text_encoder_neuron = torch_neuronx.trace(
        text_encoder.neuron_text_encoder,
        emb,
        compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder'),
        )

# Save the compiled text encoder
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
torch.jit.save(text_encoder_neuron, text_encoder_filename)

# delete unused objects
del text_encoder

# --- Compile VAE decoder and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionUpscalePipeline.from_pretrained(
model_id, torch_dtype=torch.float32
)
decoder = copy.deepcopy(pipe.vae.decoder)
del pipe

# # Compile vae decoder
decoder_in = torch.randn([1, 4, 128, 128])
decoder_neuron = torch_neuronx.trace(
    decoder,
    decoder_in,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder'),
)

# Save the compiled vae decoder
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
torch.jit.save(decoder_neuron, decoder_filename)

# delete unused objects
del decoder

# --- Compile UNet and save ---

pipe = StableDiffusionUpscalePipeline.from_pretrained(
model_id, torch_dtype=torch.float32
)

# Replace original cross-attention module with custom cross-attention module for better performance
apply_neuron_attn_override(
diffusers, get_attention_scores_neuron, neuron_scaled_dot_product_attention
)

# Apply double wrapper to deal with custom return type
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))

# Only keep the model being compiled in RAM to minimze memory pressure
unet = copy.deepcopy(pipe.unet.unetwrap)
del pipe

# Compile unet - FP32
sample_1b = torch.randn([1, 7, 128, 128])
timestep_1b = torch.tensor(999).float().expand((1,))
encoder_hidden_states_1b = torch.randn([1, 77, 1024])
class_labels = torch.tensor([20])
example_inputs = sample_1b, timestep_1b, encoder_hidden_states_1b, class_labels

unet_neuron = torch_neuronx.trace(
    unet,
    example_inputs,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'unet'),
    compiler_args=["--model-type=unet-inference"]
)

# save compiled unet
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
torch.jit.save(unet_neuron, unet_filename)

# delete unused objects
del unet

# --- Compile VAE post_quant_conv and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionUpscalePipeline.from_pretrained(
model_id, torch_dtype=torch.float32
)
post_quant_conv = copy.deepcopy(pipe.vae.post_quant_conv)
del pipe

# # # Compile vae post_quant_conv
post_quant_conv_in = torch.randn([1, 4, 128, 128])
post_quant_conv_neuron = torch_neuronx.trace(
    post_quant_conv,
    post_quant_conv_in,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv'),
)


# # Save the compiled vae post_quant_conv
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')
torch.jit.save(post_quant_conv_neuron, post_quant_conv_filename)

# delete unused objects
del post_quant_conv


================================================
FILE: archive/src/benchmark/pytorch/sdxl_base_1024_benchmark.py
================================================
import os

import torch
import torch.nn as nn
import torch_neuronx

from diffusers import DiffusionPipeline
from diffusers.models.unet_2d_condition import UNet2DConditionOutput
from transformers.models.clip.modeling_clip import CLIPTextModelOutput

import time
import math

# Define datatype
DTYPE = torch.float32

# Specialized benchmarking class for stable diffusion.
# We cannot use any of the pre-existing benchmarking utilities to benchmark E2E stable diffusion performance,
# because the top-level StableDiffusionPipeline cannot be serialized into a single Torchscript object.
# All of the pre-existing benchmarking utilities (in neuronperf or torch_neuronx) require the model to be a
# traced Torchscript.
def benchmark(n_runs, test_name, model, model_inputs):
    if not isinstance(model_inputs, tuple):
        model_inputs = (model_inputs,)
    
    warmup_run = model(*model_inputs)

    latency_collector = LatencyCollector()
    # can't use register_forward_pre_hook or register_forward_hook because StableDiffusionPipeline is not a torch.nn.Module
    
    for _ in range(n_runs):
        latency_collector.pre_hook()
        res = model(*model_inputs)
        latency_collector.hook()
    
    p0_latency_ms = latency_collector.percentile(0) * 1000
    p50_latency_ms = latency_collector.percentile(50) * 1000
    p90_latency_ms = latency_collector.percentile(90) * 1000
    p95_latency_ms = latency_collector.percentile(95) * 1000
    p99_latency_ms = latency_collector.percentile(99) * 1000
    p100_latency_ms = latency_collector.percentile(100) * 1000

    report_dict = dict()
    report_dict["Latency P0"] = f'{p0_latency_ms:.1f}'
    report_dict["Latency P50"]=f'{p50_latency_ms:.1f}'
    report_dict["Latency P90"]=f'{p90_latency_ms:.1f}'
    report_dict["Latency P95"]=f'{p95_latency_ms:.1f}'
    report_dict["Latency P99"]=f'{p99_latency_ms:.1f}'
    report_dict["Latency P100"]=f'{p100_latency_ms:.1f}'

    report = f'RESULT FOR {test_name}:'
    for key, value in report_dict.items():
        report += f' {key}={value}'
    print(report)

class LatencyCollector:
    def __init__(self):
        self.start = None
        self.latency_list = []

    def pre_hook(self, *args):
        self.start = time.time()

    def hook(self, *args):
        self.latency_list.append(time.time() - self.start)

    def percentile(self, percent):
        latency_list = self.latency_list
        pos_float = len(latency_list) * percent / 100
        max_pos = len(latency_list) - 1
        pos_floor = min(math.floor(pos_float), max_pos)
        pos_ceil = min(math.ceil(pos_float), max_pos)
        latency_list = sorted(latency_list)
        return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor]

class UNetWrap(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet
 
    def forward(self, sample, timestep, encoder_hidden_states, text_embeds=None, time_ids=None):
        out_tuple = self.unet(sample,
                              timestep,
                              encoder_hidden_states,
                              added_cond_kwargs={"text_embeds": text_embeds, "time_ids": time_ids},
                              return_dict=False)
        return out_tuple
    
    
class NeuronUNet(nn.Module):
    def __init__(self, unetwrap):
        super().__init__()
        self.unetwrap = unetwrap
        self.config = unetwrap.unet.config
        self.in_channels = unetwrap.unet.in_channels
        self.add_embedding = unetwrap.unet.add_embedding
        self.device = unetwrap.unet.device
 
    def forward(self, sample, timestep, encoder_hidden_states, added_cond_kwargs=None, return_dict=False, cross_attention_kwargs=None):
        sample = self.unetwrap(sample,
                               timestep.to(dtype=DTYPE).expand((sample.shape[0],)),
                               encoder_hidden_states,
                               added_cond_kwargs["text_embeds"],
                               added_cond_kwargs["time_ids"])[0]
        return UNet2DConditionOutput(sample=sample)

class TextEncoderOutputWrapper(nn.Module):
    def __init__(self, traceable_text_encoder, original_text_encoder):
        super().__init__()
        self.traceable_text_encoder = traceable_text_encoder
        self.config = original_text_encoder.config
        self.dtype = original_text_encoder.dtype
        self.device = original_text_encoder.device

    def forward(self, text_input_ids, output_hidden_states=True):
        out_tuple = self.traceable_text_encoder(text_input_ids)
        return CLIPTextModelOutput(text_embeds=out_tuple[0], last_hidden_state=out_tuple[1], hidden_states=out_tuple[2])
    
    
class TextEncoderOutputWrapper(nn.Module):
    def __init__(self, traceable_text_encoder, original_text_encoder):
        super().__init__()
        self.traceable_text_encoder = traceable_text_encoder
        self.config = original_text_encoder.config
        self.dtype = original_text_encoder.dtype
        self.device = original_text_encoder.device

    def forward(self, text_input_ids, output_hidden_states=True):
        out_tuple = self.traceable_text_encoder(text_input_ids)
        return CLIPTextModelOutput(text_embeds=out_tuple[0], last_hidden_state=out_tuple[1], hidden_states=out_tuple[2])
    
class TraceableTextEncoder(nn.Module):
    def __init__(self, text_encoder):
        super().__init__()
        self.text_encoder = text_encoder

    def forward(self, text_input_ids):
        out_tuple = self.text_encoder(text_input_ids, output_hidden_states=True, return_dict=False)
        return out_tuple


# --- Load all compiled models and run pipeline ---
COMPILER_WORKDIR_ROOT = 'sdxl_base_compile_dir_1024'
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
text_encoder_2_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder_2/model.pt')
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
text_encoder_2_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder_2/model.pt')

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)

# Load the compiled UNet onto two neuron cores.
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))
device_ids = [0,1]
pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)

# Load other compiled models onto a single neuron core.
pipe.vae.decoder = torch.jit.load(decoder_filename)
pipe.vae.post_quant_conv = torch.jit.load(post_quant_conv_filename)
pipe.text_encoder = TextEncoderOutputWrapper(torch.jit.load(text_encoder_filename), pipe.text_encoder)
pipe.text_encoder_2 = TextEncoderOutputWrapper(torch.jit.load(text_encoder_2_filename), pipe.text_encoder_2)


prompt = "a photo of an astronaut riding a horse on mars"
n_runs = 20
benchmark(n_runs, "stable_diffusion_1024", pipe, prompt)


================================================
FILE: archive/src/benchmark/pytorch/sdxl_base_1024_compile.py
================================================
import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch_neuronx

import math
import copy
import diffusers
from diffusers import DiffusionPipeline
from diffusers.models.unet_2d_condition import UNet2DConditionOutput
from diffusers.models.attention_processor import Attention
from transformers.models.clip.modeling_clip import CLIPTextModelOutput

from packaging import version

def apply_neuron_attn_override(
    diffusers_pkg, get_attn_scores_func, neuron_scaled_dot_product_attention
):
    diffusers_version = version.parse(diffusers_pkg.__version__)
    use_new_diffusers = diffusers_version >= version.parse("0.18.0")
    if use_new_diffusers:
        diffusers_pkg.models.attention_processor.Attention.get_attention_scores = (
            get_attn_scores_func
        )
    else:
        diffusers_pkg.models.cross_attention.CrossAttention.get_attention_scores = (
            get_attn_scores_func
        )

    # If Pytorch 2 is available, a F.scaled_dot_product_attention will be used, so we need to
    # monkey patch that too to be Neuron optimized attention
    if hasattr(F, "scaled_dot_product_attention"):
        F.scaled_dot_product_attention = neuron_scaled_dot_product_attention

# Define datatype
DTYPE = torch.float32

# Optimized attention
def get_attention_scores_neuron(self, query, key, attn_mask):    
    if query.size() == key.size():
        attention_scores = custom_badbmm(
            key,
            query.transpose(-1, -2),
            self.scale
        )
        attention_probs = attention_scores.softmax(dim=1).permute(0,2,1)

    else:
        attention_scores = custom_badbmm(
            query,
            key.transpose(-1, -2),
            self.scale
        )
        attention_probs = attention_scores.softmax(dim=-1)
  
    return attention_probs
 
def custom_badbmm(a, b, scale):
    bmm = torch.bmm(a, b)
    scaled = bmm * scale
    return scaled
 
def neuron_scaled_dot_product_attention(
    query, key, value, attn_mask=None, dropout_p=None, is_causal=None
):
    orig_shape = None
    if len(query.shape) == 4:
        orig_shape = query.shape

        def to3d(x):
            return x.reshape(-1, x.shape[2], x.shape[3])

        query, key, value = map(to3d, [query, key, value])

    if query.size() == key.size():
        attention_scores = torch.bmm(key, query.transpose(-1, -2)) * (
            1 / math.sqrt(query.size(-1))
        )
        attention_probs = attention_scores.softmax(dim=1).permute(0, 2, 1)

    else:
        attention_scores = torch.bmm(query, key.transpose(-1, -2)) * (
            1 / math.sqrt(query.size(-1))
        )
        attention_probs = attention_scores.softmax(dim=-1)

    attn_out = torch.bmm(attention_probs, value)

    if orig_shape:
        attn_out = attn_out.reshape(
            orig_shape[0], orig_shape[1], attn_out.shape[1], attn_out.shape[2]
        )

    return attn_out

# Replace original cross-attention module with custom cross-attention module for better performance
apply_neuron_attn_override(
    diffusers, get_attention_scores_neuron, neuron_scaled_dot_product_attention
)

class UNetWrap(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(
        self, sample, timestep, encoder_hidden_states, text_embeds=None, time_ids=None
    ):
        out_tuple = self.unet(
            sample,
            timestep,
            encoder_hidden_states,
            added_cond_kwargs={"text_embeds": text_embeds, "time_ids": time_ids},
            return_dict=False,
        )
        return out_tuple


class NeuronUNet(nn.Module):
    def __init__(self, unetwrap):
        super().__init__()
        self.unetwrap = unetwrap
        self.config = unetwrap.unet.config
        self.in_channels = unetwrap.unet.in_channels
        self.add_embedding = unetwrap.unet.add_embedding
        self.device = unetwrap.unet.device

    def forward(
        self,
        sample,
        timestep,
        encoder_hidden_states,
        added_cond_kwargs=None,
        return_dict=False,
        cross_attention_kwargs=None,
    ):
        sample = self.unetwrap(
            sample,
            timestep.float().expand((sample.shape[0],)),
            encoder_hidden_states,
            added_cond_kwargs["text_embeds"],
            added_cond_kwargs["time_ids"],
        )[0]
        return UNet2DConditionOutput(sample=sample)

class TextEncoderOutputWrapper(nn.Module):
    def __init__(self, traceable_text_encoder, original_text_encoder):
        super().__init__()
        self.traceable_text_encoder = traceable_text_encoder
        self.config = original_text_encoder.config
        self.dtype = original_text_encoder.dtype
        self.device = original_text_encoder.device

    def forward(self, text_input_ids, output_hidden_states=True):
        out_tuple = self.traceable_text_encoder(text_input_ids)
        return CLIPTextModelOutput(text_embeds=out_tuple[0], last_hidden_state=out_tuple[1], hidden_states=out_tuple[2])
    
class TraceableTextEncoder(nn.Module):
    def __init__(self, text_encoder):
        super().__init__()
        self.text_encoder = text_encoder

    def forward(self, text_input_ids):
        out_tuple = self.text_encoder(text_input_ids, output_hidden_states=True, return_dict=False)
        return out_tuple

# For saving compiler artifacts
COMPILER_WORKDIR_ROOT = 'sdxl_base_compile_dir_1024'

# Model ID for SD XL version pipeline
model_id = "stabilityai/stable-diffusion-xl-base-1.0"


# --- Compile Text Encoders and save ---

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)


# Apply wrappers to make text encoders traceable
traceable_text_encoder = copy.deepcopy(TraceableTextEncoder(pipe.text_encoder))
traceable_text_encoder_2 = copy.deepcopy(TraceableTextEncoder(pipe.text_encoder_2))

del pipe

text_input_ids_1 = torch.tensor([[49406,   736,  1615, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407]])


text_input_ids_2 = torch.tensor([[49406,   736,  1615, 49407,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]])


# Text Encoder 1
neuron_text_encoder = torch_neuronx.trace(
    traceable_text_encoder,
    text_input_ids_1,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder'),
)

text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
torch.jit.save(neuron_text_encoder, text_encoder_filename)


# Text Encoder 2
neuron_text_encoder_2 = torch_neuronx.trace(
    traceable_text_encoder_2,
    text_input_ids_2,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder_2'),
)

text_encoder_2_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder_2/model.pt')
torch.jit.save(neuron_text_encoder_2, text_encoder_2_filename)


# --- Compile Text Encoders and save ---

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)

# Apply wrappers to make text encoders traceable
traceable_text_encoder = copy.deepcopy(TraceableTextEncoder(pipe.text_encoder))
traceable_text_encoder_2 = copy.deepcopy(TraceableTextEncoder(pipe.text_encoder_2))

del pipe

text_input_ids_1 = torch.tensor([[49406,   736,  1615, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407]])


text_input_ids_2 = torch.tensor([[49406,   736,  1615, 49407,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0]])


# Text Encoder 1
neuron_text_encoder = torch_neuronx.trace(
    traceable_text_encoder,
    text_input_ids_1,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder'),
)

text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
torch.jit.save(neuron_text_encoder, text_encoder_filename)

# Text Encoder 2
neuron_text_encoder_2 = torch_neuronx.trace(
    traceable_text_encoder_2,
    text_input_ids_2,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder_2'),
)

text_encoder_2_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder_2/model.pt')
torch.jit.save(neuron_text_encoder_2, text_encoder_2_filename)


# --- Compile UNet and save ---

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)


# Replace original cross-attention module with custom cross-attention module for better performance
Attention.get_attention_scores = get_attention_scores_neuron

# Apply double wrapper to deal with custom return type
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))

# Only keep the model being compiled in RAM to minimze memory pressure
unet = copy.deepcopy(pipe.unet.unetwrap)
del pipe

# Compile unet - FP32
sample_1b = torch.randn([1, 4, 128, 128], dtype=DTYPE)
timestep_1b = torch.tensor(999, dtype=DTYPE).expand((1,))
encoder_hidden_states_1b = torch.randn([1, 77, 2048], dtype=DTYPE)
added_cond_kwargs_1b = {"text_embeds": torch.randn([1, 1280], dtype=DTYPE),
                        "time_ids": torch.randn([1, 6], dtype=DTYPE)}
example_inputs = (sample_1b, timestep_1b, encoder_hidden_states_1b, added_cond_kwargs_1b["text_embeds"], added_cond_kwargs_1b["time_ids"],)

unet_neuron = torch_neuronx.trace(
    unet,
    example_inputs,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'unet'),
    compiler_args=["--model-type=unet-inference"]
)

# Enable asynchronous and lazy loading to speed up model load
torch_neuronx.async_load(unet_neuron)
torch_neuronx.lazy_load(unet_neuron)

# save compiled unet
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
torch.jit.save(unet_neuron, unet_filename)

# delete unused objects
del unet
del unet_neuron


# --- Compile VAE decoder and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)
decoder = copy.deepcopy(pipe.vae.decoder)
del pipe

# Compile vae decoder
decoder_in = torch.randn([1, 4, 128, 128], dtype=DTYPE)
decoder_neuron = torch_neuronx.trace(
    decoder, 
    decoder_in, 
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder')
)

# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(decoder_neuron)

# Save the compiled vae decoder
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
torch.jit.save(decoder_neuron, decoder_filename)

# delete unused objects
del decoder
del decoder_neuron


# --- Compile VAE post_quant_conv and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)
post_quant_conv = copy.deepcopy(pipe.vae.post_quant_conv)
del pipe

# Compile vae post_quant_conv
post_quant_conv_in = torch.randn([1, 4, 128, 128], dtype=DTYPE)
post_quant_conv_neuron = torch_neuronx.trace(
    post_quant_conv, 
    post_quant_conv_in,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv'),
)

# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(post_quant_conv_neuron)

# Save the compiled vae post_quant_conv
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')
torch.jit.save(post_quant_conv_neuron, post_quant_conv_filename)

# delete unused objects
del post_quant_conv
del post_quant_conv_neuron


================================================
FILE: archive/src/benchmark/pytorch/sdxl_base_and_refiner_1024_benchmark.py
================================================
import os

import torch
import torch.nn as nn
import torch_neuronx

from diffusers import DiffusionPipeline
from diffusers.models.unet_2d_condition import UNet2DConditionOutput

import time
import math

# Define datatype
DTYPE = torch.float32

# Specialized benchmarking class for stable diffusion.
# We cannot use any of the pre-existing benchmarking utilities to benchmark E2E stable diffusion performance,
# because the top-level StableDiffusionPipeline cannot be serialized into a single Torchscript object.
# All of the pre-existing benchmarking utilities (in neuronperf or torch_neuronx) require the model to be a
# traced Torchscript.
def benchmark(n_runs, test_name, model, model_inputs):
    if not isinstance(model_inputs, tuple):
        model_inputs = (model_inputs,)
    
    warmup_run = model(*model_inputs)

    latency_collector = LatencyCollector()
    # can't use register_forward_pre_hook or register_forward_hook because StableDiffusionPipeline is not a torch.nn.Module
    
    for _ in range(n_runs):
        latency_collector.pre_hook()
        res = model(*model_inputs)
        latency_collector.hook()
    
    p0_latency_ms = latency_collector.percentile(0) * 1000
    p50_latency_ms = latency_collector.percentile(50) * 1000
    p90_latency_ms = latency_collector.percentile(90) * 1000
    p95_latency_ms = latency_collector.percentile(95) * 1000
    p99_latency_ms = latency_collector.percentile(99) * 1000
    p100_latency_ms = latency_collector.percentile(100) * 1000

    report_dict = dict()
    report_dict["Latency P0"] = f'{p0_latency_ms:.1f}'
    report_dict["Latency P50"]=f'{p50_latency_ms:.1f}'
    report_dict["Latency P90"]=f'{p90_latency_ms:.1f}'
    report_dict["Latency P95"]=f'{p95_latency_ms:.1f}'
    report_dict["Latency P99"]=f'{p99_latency_ms:.1f}'
    report_dict["Latency P100"]=f'{p100_latency_ms:.1f}'

    report = f'RESULT FOR {test_name}:'
    for key, value in report_dict.items():
        report += f' {key}={value}'
    print(report)

class LatencyCollector:
    def __init__(self):
        self.start = None
        self.latency_list = []

    def pre_hook(self, *args):
        self.start = time.time()

    def hook(self, *args):
        self.latency_list.append(time.time() - self.start)

    def percentile(self, percent):
        latency_list = self.latency_list
        pos_float = len(latency_list) * percent / 100
        max_pos = len(latency_list) - 1
        pos_floor = min(math.floor(pos_float), max_pos)
        pos_ceil = min(math.ceil(pos_float), max_pos)
        latency_list = sorted(latency_list)
        return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor]

class UNetWrap(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet
 
    def forward(self, sample, timestep, encoder_hidden_states, text_embeds=None, time_ids=None):
        out_tuple = self.unet(sample,
                              timestep,
                              encoder_hidden_states,
                              added_cond_kwargs={"text_embeds": text_embeds, "time_ids": time_ids},
                              return_dict=False)
        return out_tuple
    
    
class NeuronUNet(nn.Module):
    def __init__(self, unetwrap):
        super().__init__()
        self.unetwrap = unetwrap
        self.config = unetwrap.unet.config
        self.in_channels = unetwrap.unet.in_channels
        self.add_embedding = unetwrap.unet.add_embedding
        self.device = unetwrap.unet.device
 
    def forward(self, sample, timestep, encoder_hidden_states, added_cond_kwargs=None, return_dict=False, cross_attention_kwargs=None):
        sample = self.unetwrap(sample,
                               timestep.to(dtype=DTYPE).expand((sample.shape[0],)),
                               encoder_hidden_states,
                               added_cond_kwargs["text_embeds"],
                               added_cond_kwargs["time_ids"])[0]
        return UNet2DConditionOutput(sample=sample)
    
# Helper function to run both refiner and base pipes and return the final image
def run_refiner_and_base(base, refiner, prompt, n_steps=40, high_noise_frac=0.8, generator=None):
    image = base(
        prompt=prompt,
        num_inference_steps=n_steps,
        denoising_end=high_noise_frac,
        output_type="latent",
        generator=generator,
    ).images

    image = refiner(
        prompt=prompt,
        num_inference_steps=n_steps,
        denoising_start=high_noise_frac,
        image=image,
    ).images[0]

    return image
    
    
# --- Load all compiled models and run pipeline ---
COMPILER_WORKDIR_ROOT = 'sdxl_base_and_refiner_compile_dir_1024'
base_model_id = "stabilityai/stable-diffusion-xl-base-1.0"
refiner_model_id = "stabilityai/stable-diffusion-xl-refiner-1.0"

unet_base_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet_base/model.pt')
unet_refiner_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet_refiner/model.pt')
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')

# ------- Load base -------
pipe_base = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=DTYPE, low_cpu_mem_usage=True)

# Load the compiled UNet onto two neuron cores.
pipe_base.unet = NeuronUNet(UNetWrap(pipe_base.unet))
device_ids = [0,1]
pipe_base.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_base_filename), device_ids, set_dynamic_batching=False)

# Load other compiled models onto a single neuron core.
pipe_base.vae.decoder = torch.jit.load(decoder_filename)
pipe_base.vae.post_quant_conv = torch.jit.load(post_quant_conv_filename)


# ------- Load refiner -------
# refiner shares text_encoder_2 and vae with the base
pipe_refiner = DiffusionPipeline.from_pretrained(
    refiner_model_id,
    text_encoder_2=pipe_base.text_encoder_2,
    vae=pipe_base.vae,
    torch_dtype=torch.float32,
    low_cpu_mem_usage=True,
)

# Refiner - load the compiled UNet onto two neuron cores.
pipe_refiner.unet = NeuronUNet(UNetWrap(pipe_refiner.unet))
device_ids = [0,1]
pipe_refiner.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_refiner_filename), device_ids, set_dynamic_batching=False)


# Define how many steps and what % of steps to be run on each experts (80/20) here
n_steps = 40
high_noise_frac = 0.8


prompt = "a photo of an astronaut riding a horse on mars"
inputs = (pipe_base, pipe_refiner, prompt, n_steps, high_noise_frac, torch.manual_seed(0),)

n_runs = 50
benchmark(n_runs, "stable_diffusion_1024", run_refiner_and_base, inputs)


================================================
FILE: archive/src/benchmark/pytorch/sdxl_base_and_refiner_1024_compile.py
================================================
import os

import torch
import torch.nn as nn
import torch_neuronx

import copy
from diffusers import DiffusionPipeline
from diffusers.models.unet_2d_condition import UNet2DConditionOutput
from diffusers.models.attention_processor import Attention

# Define datatype
DTYPE = torch.float32

# Optimized attention
def get_attention_scores_neuron(self, query, key, attn_mask):    
    if query.size() == key.size():
        attention_scores = custom_badbmm(
            key,
            query.transpose(-1, -2),
            self.scale
        )
        attention_probs = attention_scores.softmax(dim=1).permute(0,2,1)

    else:
        attention_scores = custom_badbmm(
            query,
            key.transpose(-1, -2),
            self.scale
        )
        attention_probs = attention_scores.softmax(dim=-1)
  
    return attention_probs
 

def custom_badbmm(a, b, scale):
    bmm = torch.bmm(a, b)
    scaled = bmm * scale
    return scaled
 

class UNetWrap(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet
 
    def forward(self, sample, timestep, encoder_hidden_states, text_embeds=None, time_ids=None):
        out_tuple = self.unet(sample,
                              timestep,
                              encoder_hidden_states,
                              added_cond_kwargs={"text_embeds": text_embeds, "time_ids": time_ids},
                              return_dict=False)
        return out_tuple
    
    
class NeuronUNet(nn.Module):
    def __init__(self, unetwrap):
        super().__init__()
        self.unetwrap = unetwrap
        self.config = unetwrap.unet.config
        self.in_channels = unetwrap.unet.in_channels
        self.add_embedding = unetwrap.unet.add_embedding
        self.device = unetwrap.unet.device
 
    def forward(self, sample, timestep, encoder_hidden_states, added_cond_kwargs=None, return_dict=False, cross_attention_kwargs=None):
        sample = self.unetwrap(sample,
                               timestep.expand((sample.shape[0],)),
                               encoder_hidden_states,
                               added_cond_kwargs["text_embeds"],
                               added_cond_kwargs["time_ids"])[0]
        return UNet2DConditionOutput(sample=sample)
    

# For saving compiler artifacts
COMPILER_WORKDIR_ROOT = 'sdxl_base_and_refiner_compile_dir_1024'

# Model IDs for SD XL version pipeline
base_model_id = "stabilityai/stable-diffusion-xl-base-1.0"
refiner_model_id = "stabilityai/stable-diffusion-xl-refiner-1.0"

# All components we compile in this script:
# 1. unet (base, in fp32)
# 2. unet (refiner, in fp32)
# 3. vae.decoder (base & refiner)
# 4. vae.post_quant_conv (base & refiner)

# --- Compile UNet in fp32 (base) and save ---

pipe_base = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=DTYPE, low_cpu_mem_usage=True)

# Replace original cross-attention module with custom cross-attention module for better performance
Attention.get_attention_scores = get_attention_scores_neuron

# Apply double wrapper to deal with custom return type
pipe_base.unet = NeuronUNet(UNetWrap(pipe_base.unet))

# Only keep the model being compiled in RAM to minimze memory pressure
unet = copy.deepcopy(pipe_base.unet.unetwrap)
del pipe_base

# Compile unet - fp32 (note these tensors are cast to fp32 in UNetWrap)
sample_1b = torch.randn([1, 4, 128, 128])
timestep_1b = torch.tensor(999).float().expand((1,))
encoder_hidden_states_1b = torch.randn([1, 77, 2048])
added_cond_kwargs_1b = {"text_embeds": torch.randn([1, 1280]),
                        "time_ids": torch.randn([1, 6])}
example_inputs = (sample_1b, timestep_1b, encoder_hidden_states_1b, added_cond_kwargs_1b["text_embeds"], added_cond_kwargs_1b["time_ids"],)

unet_neuron = torch_neuronx.trace(
    unet,
    example_inputs,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'unet_base'),
    compiler_args=["--model-type=unet-inference"]
)

# Enable asynchronous and lazy loading to speed up model load
torch_neuronx.async_load(unet_neuron)
torch_neuronx.lazy_load(unet_neuron)

# save compiled unet
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet_base/model.pt')
torch.jit.save(unet_neuron, unet_filename)

# delete unused objects
del unet
del unet_neuron


# --- Compile UNet in fp32 (refiner) and save ---

pipe_refiner = DiffusionPipeline.from_pretrained(refiner_model_id, torch_dtype=DTYPE, low_cpu_mem_usage=True)

# Replace original cross-attention module with custom cross-attention module for better performance
Attention.get_attention_scores = get_attention_scores_neuron

# Apply double wrapper to deal with custom return type
pipe_refiner.unet = NeuronUNet(UNetWrap(pipe_refiner.unet))

# Only keep the model being compiled in RAM to minimze memory pressure
unet = copy.deepcopy(pipe_refiner.unet.unetwrap)
del pipe_refiner

# Compile unet - fp32 - some input shapes are different from base
sample_1b = torch.randn([1, 4, 128, 128])
timestep_1b = torch.tensor(999).float().expand((1,))
encoder_hidden_states_1b = torch.randn([1, 77, 1280])
added_cond_kwargs_1b = {"text_embeds": torch.randn([1, 1280]),
                        "time_ids": torch.randn([1, 5])}
example_inputs = (sample_1b, timestep_1b, encoder_hidden_states_1b, added_cond_kwargs_1b["text_embeds"], added_cond_kwargs_1b["time_ids"],)

unet_neuron = torch_neuronx.trace(
    unet,
    example_inputs,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'unet_refiner'),
    compiler_args=["--model-type=unet-inference"]
)

# Enable asynchronous and lazy loading to speed up model load
torch_neuronx.async_load(unet_neuron)
torch_neuronx.lazy_load(unet_neuron)

# save compiled unet
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet_refiner/model.pt')
torch.jit.save(unet_neuron, unet_filename)

# delete unused objects
del unet
del unet_neuron


# --- Compile VAE decoder and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=DTYPE, low_cpu_mem_usage=True)
decoder = copy.deepcopy(pipe.vae.decoder)
del pipe

# Compile vae decoder
decoder_in = torch.randn([1, 4, 128, 128], dtype=DTYPE)
decoder_neuron = torch_neuronx.trace(
    decoder, 
    decoder_in, 
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder')
)

# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(decoder_neuron)

# Save the compiled vae decoder
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
torch.jit.save(decoder_neuron, decoder_filename)

# delete unused objects
del decoder
del decoder_neuron


# --- Compile VAE post_quant_conv and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=DTYPE, low_cpu_mem_usage=True)
post_quant_conv = copy.deepcopy(pipe.vae.post_quant_conv)
del pipe

# Compile vae post_quant_conv
post_quant_conv_in = torch.randn([1, 4, 128, 128], dtype=DTYPE)
post_quant_conv_neuron = torch_neuronx.trace(
    post_quant_conv, 
    post_quant_conv_in,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv'),
)

# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(post_quant_conv_neuron)

# Save the compiled vae post_quant_conv
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')
torch.jit.save(post_quant_conv_neuron, post_quant_conv_filename)

# delete unused objects
del post_quant_conv
del post_quant_conv_neuron

================================================
FILE: archive/src/benchmark/pytorch/unet_benchmark.py
================================================
import torch
import neuronperf as npf
import neuronperf.torch

# Add to these lists or change as needed
model_name = "UNet"
batch_sizes = [1, 4]
n_models = [1, 2]
workers_per_model = [1, 2] # optimized for latency or throughput

def get_batch(batch_size):
    return torch.zeros([batch_size, 3, 224, 224], dtype=torch.float32)


if __name__ == "__main__":
    inputs = [get_batch(batch_size) for batch_size in batch_sizes]
    filename = f"{model_name}.json"

    # Benchmark
    print("Benchmarking {}".format(filename))
    reports = npf.torch.benchmark(filename, inputs, n_models=n_models, workers_per_model=workers_per_model) 

    # View and save results
    print("======== {} ========".format(filename))
    npf.print_reports(reports)
    npf.write_csv(reports)
    npf.write_json(reports)


================================================
FILE: archive/src/benchmark/pytorch/unet_compile.py
================================================
import torch

import neuronperf as npf
import neuronperf.torch

# Add to these lists or change as needed
model_name = "UNet"
batch_sizes = [1, 4]
pipeline_sizes = [1]

def get_batch(batch_size):
    return torch.zeros([batch_size, 3, 224, 224], dtype=torch.float32)

if __name__ == "__main__":
    # UNet Implementation from https://github.com/milesial/Pytorch-UNet
    # load the model
    model = torch.hub.load('milesial/Pytorch-UNet', 'unet_carvana', pretrained=False)
    # load the weights
    state_dict = torch.hub.load_state_dict_from_url('https://github.com/milesial/Pytorch-UNet/releases/download/v3.0/unet_carvana_scale0.5_epoch2.pth', map_location="cpu")
    model.load_state_dict(state_dict)

    inputs = [get_batch(batch_size) for batch_size in batch_sizes]
    filename = f"{model_name}.json"

    # Compile
    print("Compiling {}".format(filename))
    npf.torch.compile(
        model,
        inputs,
        batch_sizes=batch_sizes,
        pipeline_sizes=pipeline_sizes,
        filename=filename,
        model_name=model_name,
    )

================================================
FILE: archive/src/benchmark/pytorch/vgg_benchmark.py
================================================
import torch
import neuronperf as npf
import neuronperf.torch

# Add to these lists or change as needed
model_names = ["vgg11", "vgg16"]
batch_sizes = [1, 8, 64]
n_models = [1, 2]
workers_per_model = [1, 2] # optimized for latency or throughput


def get_batch(batch_size):
    return torch.zeros([batch_size, 3, 224, 224], dtype=torch.float32)


if __name__ == "__main__":
    for model_name in model_names:
        inputs = [get_batch(batch_size) for batch_size in batch_sizes]
        filename = f"{model_name}.json"

        # Benchmark
        print("Benchmarking {}".format(filename))
        reports = npf.torch.benchmark(filename, inputs, n_models=n_models, workers_per_model=workers_per_model) 

        # View and save results
        print("======== {} ========".format(filename))
        npf.print_reports(reports)
        npf.write_csv(reports)
        npf.write_json(reports)


================================================
FILE: archive/src/benchmark/pytorch/vgg_compile.py
================================================
import torch
import torchvision
import neuronperf as npf
import neuronperf.torch

# Add to these lists or change as needed
model_names = ["vgg11", "vgg16"]
batch_sizes = [1, 8, 64]
pipeline_sizes = [1]


def get_batch(batch_size):
    return torch.zeros([batch_size, 3, 224, 224], dtype=torch.float32)


if __name__ == "__main__":
    for model_name in model_names:
        model = getattr(torchvision.models, model_name)(pretrained=True)
        inputs = [get_batch(batch_size) for batch_size in batch_sizes]
        filename = f"{model_name}.json"

        # Compile
        print("Compiling {}".format(filename))
        npf.torch.compile(
            model,
            inputs,
            batch_sizes=batch_sizes,
            pipeline_sizes=pipeline_sizes,
            filename=filename,
            model_name=model_name,
        )

================================================
FILE: archive/tensorboard/getting-started-tensorboard-neuron-plugin.rst
================================================
.. _neuron-plugin-tensorboard:

.. meta::
   :noindex:
   :nofollow:
   :description: This page for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.
   :date-modified: 12-02-2025

Neuron Plugin for TensorBoard (Inf1)
====================================

.. contents:: Table of Contents
  :local:
  :depth: 2


Overview
--------

This guide is for developers who want to better understand how their
model is executed using Neuron SDK through TensorBoard.

The Neuron plugin for TensorBoard provides metrics to the performance of machine learning tasks accelerated using the Neuron SDK. It is
compatible with TensorBoard versions 1.15 and higher. It provides visualizations and profiling results for graphs executed on NeuronCores.

.. note::

    The following information is compatible with Neuron SDK for Inf1.  For a walkthrough on the latest version, please check out the guide
    :ref:`neuronx-plugin-tensorboard`.

.. note:: 

   Graph visualization is currently only supported for TensorFlow-Neuron.  Support
   for MXNet-Neuron and PyTorch-Neuron visualization will be added in a future
   release.


Compile the neural network
--------------------------

3. Refer to the following guides on how to compile a graph using Neuron SDK.

- TensorFlow-Neuron
   - :ref:`/src/examples/tensorflow/tensorflow_resnet50/resnet50.ipynb`
- PyTorch-Neuron:
   - "Compile model for Neuron" in `PyTorch-Neuron Resnet50 Tutorial`_
- MXNet-Neuron:
   - :ref:`/src/examples/mxnet/resnet50/resnet50.ipynb`

Enable profiling 
-----------------

In this step, we enable Neuron profile data collection and collect results
from executing an inference.

4.1. To start profiling the neural network and collect inference traces, create a
directory where profile data will be dumped and set the ``NEURON_PROFILE`` environment
variable.  In this example, we will assume this directory is ``$HOME/profile``

.. code:: bash

   mkdir -p $HOME/profile
   export NEURON_PROFILE=$HOME/profile

4.2. Ensure Neuron Tools are executable by setting the ``PATH`` environment variable.

.. code:: bash

   export PATH=/opt/aws/neuron/bin:$PATH

4.3. Execute inference!

.. note::

   Please run the inference script outside of Jupyter notebook.  Profiling in
   Jupyter notebook is not supported at this time.

.. note::

   Please ensure the inference script executes only one inference, as profiling
   results are currently only supported for a single inference.

For more info on how to execute inference, refer to the following guides:

- TensorFlow-Neuron
   - :ref:`/src/examples/tensorflow/tensorflow_resnet50/resnet50.ipynb`
- PyTorch-Neuron
   - "Run inference on Single Core" in :ref:`/src/examples/pytorch/resnet50.ipynb`
- MXNet-Neuron
   - :ref:`/src/examples/mxnet/resnet50/resnet50.ipynb`

4.4. Check if profiling results were successfully saved.  In the directory
pointed to by ``NEURON_PROFILE`` environment variable set in Step 4.1, there
should be at least two files, one with the ``.neff`` extension and one with the
``.ntff`` extension.  For TensorFlow-Neuron users, the graph file (``.pb``) will
also be in this directory.

.. code:: bash

   ls $NEURON_PROFILE

Launch TensorBoard
------------------

In this step, we will process the Neuron profile data and launch TensorBoard.

5.1. Install the Neuron plugin for Tensorboard.

.. include:: /setup/install-templates/inf1/tensorboard-plugin-neuron-pip-install.rst

5.2. After collecting the raw profile data, we need to post-process it to create the
log files used by the Neuron plugin.  This can be done when launching TensorBoard
by passing an extra flag ``--run_neuron_profiler``.  Using this flag will create the
directory specified by ``--logdir`` and populate it with Neuron plugin data.  Please
note that the ``NEURON_PROFILE`` environment variable set in Step 4.1 must still point
to the same directory as before.

.. code:: bash

   tensorboard --logdir results --run_neuron_profiler

.. note::

   If using TensorBoard >= 2.5, please use the ``--load_fast=false`` option when launching.
   ``tensorboard --logdir results --run_neuron_profiler --load_fast=false``

5.3. After you see the following message, TensorBoard is ready to use.  By default,
TensorBoard will be launched at ``localhost:6006`` on the Deployment Instance.

::

   ...
   Running neuron-profile
   Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
   TensorBoard 2.4.1 at http://localhost:6006/ (Press CTRL+C to quit)

View results in TensorBoard
---------------------------

In this step, we will view the Neuron plugin for TensorBoard from a browser on your local
development machine.

6.1. Connect to the Deployment Instance while enabling port forwarding.  In this example, we
assume TensorBoard has been launched using the default address ``localhost:6006`` on the
Deployment Instance.

.. code:: bash

   # if Ubuntu-based AMI
   ssh -i <PEM key file> ubuntu@<instance DNS> -L 6006:localhost:6006

   # if AL2-based AMI
   ssh -i <PEM key file> ec2-user@<instance DNS> -L 6006:localhost:6006

6.2. In a browser, visit |tensorboard_address|.

6.3. In the top navigation bar, switch from ``Graphs`` to ``Neuron``.  If it does not show up,
please wait a while and refresh the page while the plugin loads.  If the issue persists, check
the ``Inactive`` dropdown list on the right and check for ``Neuron``.

|image1|

6.4. If TensorBoard failed to find the generated logs, you will see the following message:

|image10|


In this case, please check the console output on the Deployment Instance where TensorBoard was
launched for any warnings or error messages, and make sure the version of the ``aws-neuron-tools``
package is compatible.


.. _tensorboard-plugin-visualize-graph:

Visualize graphs executed on Neuron
-----------------------------------

.. _tensorboard-plugin-graph-device:

Show how the graph was partition to run on NeuronCores
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To view how the graph was partitioned to run on NeuronCores, select "Device" under "Graph Color
Schemes" in the left navigation bar.

|image2|

Each operator will be colored according to the device used.  In this example, light blue indicates
an operator was executed on CPU, and orange indicates the operator was executed on NeuronCores.
Operators that are white may have been optimized by the Neuron compiler and fused into another
operation.

.. _tensorboard-plugin-graph-time:

Inspect which operators consumes the most time
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can also view how long each operator took by changing to the "Compute time" color scheme.

|image3|

This view will show time taken by each layer and will be colored according to how much relative
time the layer took to compute. A lighter shade of red means that a relatively small portion of
compute time was spent in this layer, while a darker red shows that more compute time was used.

.. _tensorboard-plugin-graph-supported-ops:

Check out Neuron support operators for each framework
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The "Compatibility" color scheme allows you to better understand what operators are currently
supported by the Neuron compiler - green for compatible ops, red for incompatible ops, and yellow
for subgraphs that contain both compatible and incompatible ops.

|image4|

.. _tensorboard-plugin-graph-filter-device:

Filter view by device
^^^^^^^^^^^^^^^^^^^^^

Additionally, you can choose to filter by CPU and NeuronCores, which will only color ops that
match the selected device(s).

|image5|

Expand/collapse subgraphs and view operator details
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Each rectangular node in the graph represents a subgraph that can be expanded or collapse by
clicking on the name.  Operators will be represented by ellipses, and can be clicked to reveal
more information on that operator, such as inputs and execution device.

|image11|

The ``Expand All`` and ``Collapse All`` buttons can be used to expand or collapse every subgraph.
When using these features, the positioning of the graph may change when redrawing the new graph.
Try using ``Reset Position`` button and zoom out by scrolling if the graph appears to be missing.

.. _tensorboard-plugin-view-profile:

Viewing the Neuron profile data
-------------------------------

On the right side of the Neuron plugin, information on the profiled inference will be displayed.

.. _tensorboard-plugin-profile-summary:

See performance summary
^^^^^^^^^^^^^^^^^^^^^^^

First is the "Neuron Performance Summary," which gives a quick overview on how Neuron executed the graph,
including information on the number of NeuronCores and both on-NeuronCore time and on-CPU time.

|image6|

.. _tensorboard-plugin-profile-nc:

Get a breakdown of time spent per NeuronCore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Next, the "Neuron Execution" will give more details on how a graph was partitioned for Neuron.
Each entry in the table will show the order it was executed in, what type of device was used, the compute
time (in microseconds), and the percentage of total time spent.  To dive deeper into subgraphs, you can
check the "Show Details" box to display the breakdown per NeuronCore.

|image7|

.. _tensorboard-plugin-profile-op:

Get a breakdown of time spent per operator
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The "Op Time Table" section shows the cycle count per operator, much like the "Compute time" coloring
for graph visualization.  This table can be sorted by clicking the column names, and searched using the 
provided text box in the top right corner. Due to Neuron compiler optimizations, some of the compute may
not be associated with any specific operator and will be categorized as ``unknown``.  Additionally, time
spent moving data to and from NeuronCores will fall under ``(ND_ENGINE_LOAD)``.

|image8|


.. |image1| image:: /images/tb-plugin-img1.png
  :height: 2914
  :width: 5344
  :scale: 10%
.. |image2| image:: /images/tb-plugin-img2.png
  :height: 2914
  :width: 5344
  :scale: 10%
.. |image3| image:: /images/tb-plugin-img3.png
  :height: 2914
  :width: 5344
  :scale: 10%
.. |image4| image:: /images/tb-plugin-img4.png
  :height: 2914
  :width: 5344
  :scale: 10%
.. |image5| image:: /images/tb-plugin-img5.png
  :height: 2914
  :width: 5344
  :scale: 10%
.. |image6| image:: /images/tb-plugin-img6.png
  :height: 2914
  :width: 5344
  :scale: 10%
.. |image7| image:: /images/tb-plugin-img7.png
  :height: 2914
  :width: 5344
  :scale: 10%
.. |image8| image:: /images/tb-plugin-img8.png
  :height: 2914
  :width: 5344
  :scale: 10%
.. |image9| image:: /images/tb-plugin-img9.png
  :height: 2914
  :width: 5344
  :scale: 10%
.. |image10| image:: /images/tb-plugin-img10.png
  :height: 2914
  :width: 5344
  :scale: 10%
.. |image11| image:: /images/tb-plugin-img11.png
  :height: 2826
  :width: 5341
  :scale: 10%
.. _PyTorch-Neuron Resnet50 Tutorial: ../../src/examples/pytorch/resnet50.ipynb
.. |tensorboard_address| raw:: html

   <a href="http://localhost:6006" target="_blank">localhost:6006</a>


================================================
FILE: archive/tensorflow/index.rst
================================================
.. _tensorflow-neuron-main:
.. _tensorflow-neuron:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

TensorFlow Neuron
=================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.

TensorFlow Neuron unlocks high-performance and cost-effective deep learning acceleration on AWS Trainium-based and Inferentia-based Amazon EC2 instances.

TensorFlow Neuron enables native TensorFlow models to be accelerated on Neuron devices, so you can use your existing framework application and get started easily with minimal code changes.


.. toctree::
    :maxdepth: 1
    :hidden:
    
    /archive/tensorflow/tensorflow-setup

.. toctree::
    :maxdepth: 2
    :hidden:

    Inference (Inf2 & Trn1)  </archive/tensorflow/tensorflow-neuronx-inference>
    Inference (Inf1)  </archive/tensorflow/tensorflow-neuron-inference>    

.. card:: Tensorflow NeuronX for Inference on ``Inf2`` & ``Trn1`` / ``Trn1n``
    :link: inference-tensorflow-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small

.. card:: Tensorflow Neuron for Inference on ``Inf1``
    :link: inference-tensorflow-neuron
    :link-type: ref
    :class-body: sphinx-design-class-title-small


================================================
FILE: archive/tensorflow/setup-legacy-inf1-tensorflow.rst
================================================
.. meta::
   :description: Legacy TensorFlow installation guide for AWS Inferentia 1 (Inf1) instances
   :keywords: tensorflow, neuron, inf1, legacy, installation, tensorflow-neuron
   :framework: tensorflow
   :instance-types: inf1
   :status: legacy
   :content-type: legacy-guide
   :date-modified: 2026-03-30

TensorFlow on Inf1 (legacy)
=============================

.. warning::
   
   **Legacy hardware**: Inf1 instances use NeuronCore v1 with TensorFlow 2.x (``tensorflow-neuron``).
   
   For new projects, use **Inf2, Trn1, Trn2, or Trn3** with PyTorch 2.9+ or JAX 0.7+.
   See :ref:`setup-guide-index` for current setup options.

.. note::
   
   TensorFlow support for Inf2 has reached end of support as of Neuron SDK 2.29.
   See :ref:`announce-eos-tensorflow-inf2` for details.

Setup instructions
------------------

For complete Inf1 TensorFlow setup instructions, see the original setup guides:

- :doc:`/archive/tensorflow/tensorflow-neuron/setup/tensorflow-update` - TensorFlow Neuron setup and updates
- :doc:`/archive/tensorflow/tensorflow-neuron-inference` - Inference on Inf1

The setup guides cover:

- Ubuntu 20, Ubuntu 22, and Amazon Linux 2 installation
- DLAMI-based installation
- Manual pip installation
- TensorFlow 2.10.1, 2.9.3, and 2.8.4 versions

Verification
------------

After installation, verify with:

.. code-block:: python
   
   import tensorflow as tf
   import tensorflow_neuron
   
   print(f"TensorFlow version: {tf.__version__}")

.. code-block:: bash
   
   neuron-ls

Next steps
----------

- :doc:`/archive/tensorflow/tensorflow-neuron-inference` - Inference tutorials for Inf1
- :ref:`setup-guide-index` - Current setup options (Inf2, Trn1, Trn2, Trn3)


================================================
FILE: archive/tensorflow/tensorflow-neuron/additional-examples.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Additional Examples (``tensorflow-neuron``)
===========================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:

    AWS Neuron Samples GitHub Repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/tensorflow-neuron/inference>


.. include:: /archive/tensorflow/tensorflow-neuron/additional-examples.txt


================================================
FILE: archive/tensorflow/tensorflow-neuron/additional-examples.txt
================================================
* `AWS Neuron Samples GitHub Repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/archive/tensorflow-neuron/inference>`_


================================================
FILE: archive/tensorflow/tensorflow-neuron/api-auto-replication-api.rst
================================================
.. _tensorflow-ref-auto-replication-python-api:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

TensorFlow 2.x (``tensorflow-neuron``) Auto Multicore Replication (Beta)
===================================================================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


The Neuron auto multicore replication Python API enables modifying TensorFlow 2.x
traced models so that they can be automatically replicated across multiple cores.
For Tensorflow-Serving models and TensorFlow 1.x models, see :ref:`tensorflow-ref-auto-replication-cli-api`

.. contents:: Table of contents
   :local:
   :depth: 1

TensorFlow 2.x (``tensorflow-neuron TF2.x``) Auto Multicore Replication Python API (Beta)
-----------------------------------------------------------------------------------------------------------

Method
^^^^^^

``tensorflow.neuron.auto_multicore``

Description
^^^^^^^^^^^

Converts an existing AWS-Neuron-optimized ``keras.Model`` and returns an auto-replication tagged
AWS-Multicore-Neuron-optimized  ``keras.Model`` that can execute on AWS Machine Learning Accelerators.
Like the traced model, the returned ``keras.Model`` will support inference only. Attributes or
variables held by the original function or ``keras.Model`` will be dropped.

The auto model replication feature in TensorFlow-Neuron enables you to
create a model once and the model parallel replication would happen
automatically. The desired number of cores can be less than the total available NeuronCores
on an Inf1 instance but not less than 1. This reduces framework memory usage as you are not
loading the same model multiple times manually. Calls to the returned model will execute the call
on each core in a round-robin fashion.

The returned ``keras.Model`` can be exported as SavedModel and served using
TensorFlow Serving. Please see the TensorFlow Serving documentation for more
information about exporting to saved model and serving using TensorFlow
Serving.

Note that the automatic replication will only work on models compiled with pipeline size 1:
via ``--neuroncore-pipeline-cores=1``. If auto replication is not enabled, the model will default to
replicate on up to 4 cores.

See  :ref:`neuron-compiler-cli-reference` for more information about compiler options.

Arguments
^^^^^^^^^

-   **func:** The ``keras.Model`` or function to be traced.
-   **example_inputs:** A ``tf.Tensor`` or a tuple/list/dict of
    ``tf.Tensor`` objects for tracing the function. When ``example_inputs``
    is a ``tf.Tensor`` or a list of ``tf.Tensor`` objects, we expect
    ``func`` to have calling signature ``func(example_inputs)``. Otherwise,
    the expectation is that inference on ``func`` is done by calling
    ``func(*example_inputs)`` when ``example_inputs`` is a ``tuple``,
    or ``func(**example_inputs)`` when ``example_inputs`` is a ``dict``.
    The case where ``func`` accepts mixed positional and keyword arguments
    is currently unsupported.
-   **num_cores:** The desired number of cores where the model will be automatically
    replicated across

Returns
^^^^^^^

-  An AWS-Multicore-Neuron-optimized ``keras.Model``.


Example Python API Usage for TF2.x traced models:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code :: python

        input0 = tf.keras.layers.Input(3)
        dense0 = tf.keras.layers.Dense(3)(input0)
        inputs = [input0]
        outputs = [dense0]
        model = tf.keras.Model(inputs=inputs, outputs=outputs)
        input0_tensor = tf.random.uniform([1, 3])
        model_neuron = tfn.trace(model, input0_tensor)

        num_cores = 4
        multicore_model = tfn.auto_multicore(model_neuron, input0_tensor, num_cores=num_cores)
        multicore_model(input0_tensor)

Example Python API Usage for TF2.x saved models:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code :: python

        from tensorflow.python import saved_model

        input0_tensor = tf.random.uniform([1, 3])
        num_cores = 4
        reload_model = saved_model.load(model_dir)
        multicore_model = tfn.auto_multicore(reload_model, input0_tensor, num_cores=num_cores)

.. _tensorflow-ref-auto-replication-cli-api:

TensorFlow Neuron 2.x (``tensorflow-neuron``) Auto Multicore Replication CLI (Beta)
---------------------------------------------------------------------------------------------------------------

The Neuron auto multicore replication CLI  enables modifying TensorFlow 1.x and Tensorflow 2.x
traced saved models so that they can be automatically replicated across multiple cores. By performing
this call on Tensorflow Saved Models, we can support both Tensorflow-Serving and Tensorflow 1.x
without significant modifications to the code. Note that the python API does not support Tensorflow 1.x.

Method
^^^^^^

``tf-neuron-auto-multicore MODEL_DIR --num_cores NUM_CORES --new_model_dir NEW_MODEL_DIR``

Arguments
^^^^^^^^^

-   **MODEL_DIR:** The directory of a saved AWS-Neuron-optimized ``keras.Model``.
-   **NUM_CORES:** The desired number of cores where the model will be automatically
    replicated across
-   **NEW_MODEL_DIR:** The directory of where the AWS-Multicore-Neuron-optimized
    ``keras.Model`` will be saved


================================================
FILE: archive/tensorflow/tensorflow-neuron/api-compilation-python-api.rst
================================================
.. _tensorflow-ref-neuron-compile-api:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

TensorFlow 1.x (``tensorflow-neuron``) Compilation API
=======================================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


The Neuron compilation API for TensorFlow 1.x enables compilation of saved
model to an Inferentia target.

Method
------

``tensorflow.neuron.saved_model.compile``

Description
-----------

Within the graph or subgraph, the compile method selects and send
Neuron-supported operations to Neuron-Compiler for compilation and saves
the compiled artifacts in the graph. Uncompilable operations are kept as
original operations for framework execution.

The compiled graph can be exported to saved model and served using
TensorFlow Serving. Please see the TensorFlow Serving documentation for more
information about exporting to saved model and serving using TensorFlow
Serving.

Options can be passed to Neuron compiler via the compile function. For
example, the “\ ``--neuroncore-pipeline-cores``\ ” option directs Neuron
compiler to compile each subgraph to fit in the specified number of
NeuronCores. This number can be less than the total available
NeuronCores on an Inf1 instance. See :ref:`neuron-compiler-cli-reference`
for more information about compiler options.

Arguments
---------

-  **model_dir:** The path of the original ``SavedModel``.
-  **new_model_dir:** The path to which the Neuron-optimized
   ``SavedModel`` will be stored.
-  **batch_size:** (Optional) Positive integer representing batch size
   used in inference. The default value is 1.
-  **model_shape_feed_dict:** (Optional) Dictionary {str: list} used for
   inferring tensor shapes. Keys should match model input names. Values
   are lists of positive integers representing model input tensor
   shapes.
-  **model_feed_dict:** (Optional) Dictionary {str: numpy.array} used
   for inference. Useful for inferring tensor shapes. Keys should match
   model input names. Values are numpy arrays that can be fed as inputs
   to the ``SavedModel``.
-  **tags:** (Optional) Iterable of strings to identify the required
   ``MetaGraphDef``. These should correspond to the tags used when
   saving the variables using the ``SavedModel`` ``save()`` API. Default
   is to use the first ``tag_set`` available in the ``SavedModel``.
-  **signature_def_key:** (Optional) String specifying the
   ``signature_def`` to use. Default is to use 'serving_default' or the
   first ``signature_def`` corresponding to ``tags``.
-  **minimum_segment_size:** (Optional) Integer indicating the minimum
   number of operations in an NeuronOp.
-  **no_fuse_ops:** (Optional) None or iterable of strings (unordered)
   representing names of operations that are forcibly placed on CPU.
-  **compiler_args:** (Optional) List of strings representing neuron-cc
   compiler arguments. Note that these arguments apply to all subgraphs
   generated by whitelist partitioning. For example, use
   ``compiler_args=['--neuroncore-pipeline-cores', '4']`` to set number
   of NeuronCores per subgraph to 4. See :ref:`neuron-compiler-cli-reference`
   for more information about compiler options.
-  **compiler_workdir:** (Optional) String representing work directory
   of the neuron-cc compiler.

Returns
-------

-  Dictionary with operator counts before/after optimization.
-  Operator count statistics are displayed to show original count,
   post-optimization count, and the number placed on Neuron runtime. For
   example:

::

   INFO:tensorflow:Number of operations in TensorFlow session: 3978
   INFO:tensorflow:Number of operations after tf.neuron optimizations: 555
   INFO:tensorflow:Number of operations placed on Neuron runtime: 554

Example Usage
-------------

.. code:: python

   import shutil
   import tensorflow.neuron as tfn
   saved_model_path = "<saved model path>"
   compiled_saved_model_path = "<compiled saved model path>"
   shutil.rmtree(compiled_saved_model_path, ignore_errors=True)
   tfn.saved_model.compile(saved_model_path, compiled_saved_model_path)


================================================
FILE: archive/tensorflow/tensorflow-neuron/api-reference-guide.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

API Reference Guide (``tensorflow-neuron``)
===========================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:

    /archive/tensorflow/tensorflow-neuron/api-tracing-python-api
    /archive/tensorflow/tensorflow-neuron/api-tfn-analyze-model-api
    /archive/tensorflow/tensorflow-neuron/api-auto-replication-api


.. include:: /archive/tensorflow/tensorflow-neuron/api-reference-guide.txt


================================================
FILE: archive/tensorflow/tensorflow-neuron/api-reference-guide.txt
================================================
* :ref:`tensorflow-ref-neuron-tracing-api`
* :ref:`tensorflow-ref-neuron-analyze_model-api`
* :ref:`tensorflow-ref-auto-replication-python-api`

================================================
FILE: archive/tensorflow/tensorflow-neuron/api-tfn-analyze-model-api.rst
================================================
.. _tensorflow-ref-neuron-analyze_model-api:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

TensorFlow 2.x (``tensorflow-neuron``) analyze_model API
========================================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


Method
------

``tensorflow.neuron.analyze_model``

Description
-----------

Analyzes a ``keras.Model`` or a Python callable that can be decorated by
``tf.function`` for it's compatibility with Neuron. It displays supported 
vs. unsupported operators in the model as well as percentages and counts of 
each operator and returns a dictionary with operator statistics.

Arguments
---------

-   **func:** The ``keras.Model`` or function to be analyzed.
-   **example_inputs:** A ``tf.Tensor`` or a tuple/list/dict of
    ``tf.Tensor`` objects for tracing the function. When ``example_inputs``
    is a ``tf.Tensor`` or a list of ``tf.Tensor`` objects, we expect
    ``func`` to have calling signature ``func(example_inputs)``. Otherwise,
    the expectation is that inference on ``func`` is done by calling
    ``func(*example_inputs)`` when ``example_inputs`` is a ``tuple``,
    or ``func(**example_inputs)`` when ``example_inputs`` is a ``dict``.
    The case where ``func`` accepts mixed positional and keyword arguments
    is currently unsupported.

Returns
-------

-  A results ``dict`` with these keys: ``'percent_supported', 'supported_count', 
  'total_count', 'supported_operators', 'unsupported_operators', 'operators', 
  'operator_count'``.

Example Usage
-------------

.. code:: python

    import tensorflow as tf
    import tensorflow.neuron as tfn

    input0 = tf.keras.layers.Input(3)
    dense0 = tf.keras.layers.Dense(3)(input0)
    model = tf.keras.Model(inputs=[input0], outputs=[dense0])
    example_inputs = tf.random.uniform([1, 3])
    results = tfn.analyze_model(model, example_inputs)
    print(results)

    # expected output
    '''
    BiasAdd
	MatMul
	100.00% of all operations (2 of 2) are supported
	{'percent_supported': 100.0, 'supported_count': 2, 'total_count': 2, 
	'supported_operators': {'BiasAdd', 'MatMul'}, 'unsupported_operators': [], 
	'operators': ['BiasAdd', 'MatMul'], 'operator_count': {'MatMul': 1, 'BiasAdd': 1}}
	'''


================================================
FILE: archive/tensorflow/tensorflow-neuron/api-tracing-python-api.rst
================================================
.. _tensorflow-ref-neuron-tracing-api:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

TensorFlow 2.x (``tensorflow-neuron``) Tracing API
===================================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


The Neuron tracing API enables tracing TensorFlow 2.x models for deployment
on AWS Machine Learning Accelerators.

Method
------

``tensorflow.neuron.trace``

Description
-----------

Trace a ``keras.Model`` or a Python callable that can be decorated by
``tf.function``, and return an AWS-Neuron-optimized ``keras.Model`` that
can execute on AWS Machine Learning Accelerators. Tracing is ideal for
``keras.Model`` that accepts a list of ``tf.Tensor`` objects and returns
a list of ``tf.Tensor`` objects. It is expected that users will provide
example inputs, and the ``trace`` function will execute ``func``
symbolically and convert it to a ``keras.Model``.

The returned ``keras.Model`` will support inference only. Attributes or
variables held by the original function or ``keras.Model`` will be dropped.

The returned ``keras.Model`` can be exported as SavedModel and served using
TensorFlow Serving. Please see the TensorFlow Serving documentation for more
information about exporting to saved model and serving using TensorFlow
Serving.

The returned ``keras.Model`` has an ``.on_neuron_ratio`` attribute
which shows the percentage of ops mapped to neuron hardware. This calculation
ignores PlaceholerOp, IdentityOp, ReadVariableOp and NoOp.

Options can be passed to Neuron compiler via the environment variable
``NEURON_CC_FLAGS``. For example, the syntax
``env NEURON_CC_FLAGS="--neuroncore-pipeline-cores=4"`` directs Neuron
compiler to compile each subgraph to fit in the specified number of
NeuronCores. This number can be less than the total available NeuronCores
on an Inf1 instance. See  :ref:`neuron-compiler-cli-reference` for more
information about compiler options.

Arguments
---------

-   **func:** The ``keras.Model`` or function to be traced.
-   **example_inputs:** A ``tf.Tensor`` or a tuple/list/dict of
    ``tf.Tensor`` objects for tracing the function. When ``example_inputs``
    is a ``tf.Tensor`` or a list of ``tf.Tensor`` objects, we expect
    ``func`` to have calling signature ``func(example_inputs)``. Otherwise,
    the expectation is that inference on ``func`` is done by calling
    ``func(*example_inputs)`` when ``example_inputs`` is a ``tuple``,
    or ``func(**example_inputs)`` when ``example_inputs`` is a ``dict``.
    The case where ``func`` accepts mixed positional and keyword arguments
    is currently unsupported.
-   **subgraph_builder_function:** (Optional) A callable with signature

    ``subgraph_builder_function(node : NodeDef) -> bool``
    (``NodeDef`` is defined in tensorflow/core/framework/node_def.proto)

    that is used as a call-back function to determine which part of
    the tensorflow GraphDef given by tracing ``func`` will be placed on
    Machine Learning Accelerators.

    If ``subgraph_builder_function`` is not provided, then ``trace`` will
    automatically place operations on Machine Learning Accelerators or
    on CPU to maximize the execution efficiency.

    If it is provided, and ``subgraph_builder_function(node)`` returns
    ``True``, and placing ``node`` on Machine Learning Accelerators
    will not cause deadlocks during execution, then ``trace`` will place
    ``node`` on Machine Learning Accelerators. If
    ``subgraph_builder_function(node)`` returns ``False``, then ``trace``
    will place ``node`` on CPU.

Special Flags
-------------

These are flags that get passed directly to the Neuron tracing API
(rather than the Neuron Compiler). The flags are still passed
via the environment variable ``NEURON_CC_FLAGS``.

-   **workdir:** example usage - ``NEURON_CC_FLAGS='--workdir ./artifacts'``
    will create a folder named artifacts in the current directory and
    save artifacts that can be used for debug.
-   **dynamic-batch-size:** example usage -
    ``NEURON_CC_FLAGS='--dynamic-batch-size'`` A flag to allow Neuron graphs to
    consume variable sized batches of data. Dynamic sizing is restricted to the
    0th dimension of a tensor.
-   **extract-weights (Beta):** example usage -
    ``NEURON_CC_FLAGS='--extract-weights inf1.2xlarge'`` will reduce the compiled
    model's protobuf size by taking the weights out of the protobuf.
    Useful for compiling large models that would exceed the 2GB protobuf
    size limit. This feature is in beta. Model performance is not
    guaranteed and the flag does not work in combination with
    ``--neuroncore-pipeline-cores``, ``--dynamic-batch-size``, models with
    multiple NEFFs, and models that are 4GB or greater. 
    Compiles models for different neuron instances depending on the instance type passed.
    Supports all inf1 instance types.

Returns
-------

-  An AWS-Neuron-optimized ``keras.Model``.


Example Usage
-------------

.. code:: python

    import tensorflow as tf
    import tensorflow.neuron as tfn

    input0 = tf.keras.layers.Input(3)
    dense0 = tf.keras.layers.Dense(3)(input0)
    model = tf.keras.Model(inputs=[input0], outputs=[dense0])
    example_inputs = tf.random.uniform([1, 3])
    model_neuron = tfn.trace(model, example_inputs)  # trace
    # check to see how much of the model was compiled successfully
    print(model_neuron.on_neuron_ratio) 

    model_dir = './model_neuron'
    model_neuron.save(model_dir)
    model_neuron_reloaded = tf.keras.models.load_model(model_dir)


Example Usage with Manual Device Placement Using ``subgraph_builder_function``
------------------------------------------------------------------------------

.. code:: python

    import tensorflow as tf
    import tensorflow.neuron as tfn

    input0 = tf.keras.layers.Input(3)
    dense0 = tf.keras.layers.Dense(3)(input0)
    reshape0 = tf.keras.layers.Reshape([1, 3])(dense0)
    output0 = tf.keras.layers.Dense(2)(reshape0)
    model = tf.keras.Model(inputs=[input0], outputs=[output0])
    example_inputs = tf.random.uniform([1, 3])

    def subgraph_builder_function(node):
        return node.op == 'MatMul'

    model_neuron = tfn.trace(
        model, example_inputs,
        subgraph_builder_function=subgraph_builder_function,
    )

.. important ::

    Although the old API ``tensorflow.neuron.saved_model.compile`` is still available under tensorflow-neuron 2.x,
    it supports only the limited capabilities of ``tensorflow.neuron.trace`` and will be deprecated in future releases.


================================================
FILE: archive/tensorflow/tensorflow-neuron/dlc-then-ec2-devflow.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11


.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.

.. include:: /devflows/inference/dlc-then-ec2-devflow.rst


================================================
FILE: archive/tensorflow/tensorflow-neuron/dlc-then-ecs-devflow.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11


.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.

.. include:: /devflows/inference/dlc-then-ecs-devflow.rst


================================================
FILE: archive/tensorflow/tensorflow-neuron/dlc-then-eks-devflow.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11


.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.

.. include:: /devflows/inference/dlc-then-eks-devflow.rst


================================================
FILE: archive/tensorflow/tensorflow-neuron/ec2-then-ec2-devflow.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11


.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.

.. include:: /devflows/inference/ec2-then-ec2-devflow.rst


================================================
FILE: archive/tensorflow/tensorflow-neuron/misc-tensorflow-neuron.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Misc (``tensorflow-neuron``)
============================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:

    /release-notes/archive/tensorflow/tensorflow-neuron/tensorflow-neuron-v2                  
    /archive/tensorflow/tensorflow-neuron/tensorflow2-accelerated-ops


.. include:: /archive/tensorflow/tensorflow-neuron/misc-tensorflow-neuron.txt


================================================
FILE: archive/tensorflow/tensorflow-neuron/misc-tensorflow-neuron.txt
================================================
* :ref:`tensorflow-neuron-rn-v2`
* :ref:`tensorflow-ref-neuron-accelerated-ops`

================================================
FILE: archive/tensorflow/tensorflow-neuron/neo-then-hosting-devflow.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11


.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.

.. include:: /devflows/inference/neo-then-hosting-devflow.rst


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.14.2-tensorflow-install.rst
================================================
.. _install-neuron-1.14.2-tensorflow:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install TensorFlow Neuron (Neuron 1.14.2)
======================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.14.2


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.14.2


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.14.2


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.15.0-tensorflow-install.rst
================================================
.. _install-neuron-1.15.0-tensorflow:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install TensorFlow Neuron (Neuron 1.15.0)
======================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.0


   .. tab-item:: TensorFlow 2.4.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.4.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.4.2


   .. tab-item:: TensorFlow 2.3.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.3.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.3.3


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-1.15.5    
         

Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.0


   .. tab-item:: TensorFlow 2.4.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.4.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.4.2


   .. tab-item:: TensorFlow 2.3.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.3.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.3.3


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-1.15.5   


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.0


   .. tab-item:: TensorFlow 2.4.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.4.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.4.2


   .. tab-item:: TensorFlow 2.3.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.3.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.3.3


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=tensorflow-1.15.5   


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.15.1-tensorflow-install.rst
================================================
.. _install-neuron-1.15.1-tensorflow:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install TensorFlow Neuron (Neuron 1.15.1)
=========================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.1


   .. tab-item:: TensorFlow 2.4.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.4.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.4.2


   .. tab-item:: TensorFlow 2.3.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.3.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.3.3


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-1.15.5    
         

Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.1


   .. tab-item:: TensorFlow 2.4.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.4.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.4.2


   .. tab-item:: TensorFlow 2.3.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.3.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.3.3


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-1.15.5   


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.1


   .. tab-item:: TensorFlow 2.4.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.4.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.4.2


   .. tab-item:: TensorFlow 2.3.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.3.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.3.3


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=tensorflow-1.15.5   


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.15.2-tensorflow-install.rst
================================================
.. _install-neuron-1.15.2-tensorflow:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install TensorFlow Neuron (Neuron 1.15.2)
=========================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.2


   .. tab-item:: TensorFlow 2.4.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.4.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.4.2


   .. tab-item:: TensorFlow 2.3.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.3.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.3.3


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-1.15.5    
         

Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.2


   .. tab-item:: TensorFlow 2.4.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.4.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.4.2


   .. tab-item:: TensorFlow 2.3.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.3.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.3.3


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-1.15.5   


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.2


   .. tab-item:: TensorFlow 2.4.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.4.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.4.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.4.2


   .. tab-item:: TensorFlow 2.3.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.3.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.3.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.3.3


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=tensorflow-1.15.5   


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.16.3-tensorflow-install.rst
================================================
.. _install-neuron-1.16.3-tensorflow:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install TensorFlow Neuron
=========================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst

.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.16.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.16.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.16.3


   .. tab-item:: TensorFlow 2.4.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.4.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.4.3


   .. tab-item:: TensorFlow 2.3.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.3.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.3.4


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-1.15.5    
         

Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.16.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.16.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.16.3


   .. tab-item:: TensorFlow 2.4.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.4.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.4.3


   .. tab-item:: TensorFlow 2.3.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.3.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.3.4


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-1.15.5   


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.16.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.16.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.16.3


   .. tab-item:: TensorFlow 2.4.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.4.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.4.3


   .. tab-item:: TensorFlow 2.3.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.3.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.3.4


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.16.3 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.16.3 --framework-version=tensorflow-1.15.5   


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.17.0-tensorflow-install.rst
================================================
.. _install-neuron-1.17.0-tensorflow:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install TensorFlow Neuron
=========================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst

.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.0


   .. tab-item:: TensorFlow 2.4.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.4.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.4.3


   .. tab-item:: TensorFlow 2.3.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.3.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.3.4


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-1.15.5    
         

Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.0


   .. tab-item:: TensorFlow 2.4.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.4.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.4.3


   .. tab-item:: TensorFlow 2.3.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.3.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.3.4


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-1.15.5   


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.0


   .. tab-item:: TensorFlow 2.4.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.4.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.4.3


   .. tab-item:: TensorFlow 2.3.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.3.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.3.4


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.0 --framework-version=tensorflow-1.15.5   


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.17.1-tensorflow-install.rst
================================================
.. _install-neuron-1.17.1-tensorflow:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install TensorFlow Neuron
=========================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst

.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.1


   .. tab-item:: TensorFlow 2.4.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3


   .. tab-item:: TensorFlow 2.3.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5    
         

Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.2


   .. tab-item:: TensorFlow 2.4.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3


   .. tab-item:: TensorFlow 2.3.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5   


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.2


   .. tab-item:: TensorFlow 2.4.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3


   .. tab-item:: TensorFlow 2.3.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5   


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.17.2-tensorflow-install.rst
================================================
.. _install-neuron-1.17.2-tensorflow:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install TensorFlow Neuron
=========================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst

.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.2


   .. tab-item:: TensorFlow 2.4.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3


   .. tab-item:: TensorFlow 2.3.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5    
         

Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.2


   .. tab-item:: TensorFlow 2.4.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3


   .. tab-item:: TensorFlow 2.3.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5   


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.5.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.2


   .. tab-item:: TensorFlow 2.4.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.4.3


   .. tab-item:: TensorFlow 2.3.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.3.4


   .. tab-item:: TensorFlow 2.2.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.2.3


   .. tab-item:: TensorFlow 2.1.4

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-2.1.4      


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.17.2 --framework-version=tensorflow-1.15.5   


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.18.0-tensorflow-install.rst
================================================
.. _install-neuron-1.18.0-tensorflow:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install TensorFlow Neuron
=========================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst

.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.18.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.18.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.18.0


   .. tab-item:: TensorFlow 2.6.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-2.6.3


   .. tab-item:: TensorFlow 2.5.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-2.5.3


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-1.15.5    
         

Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.18.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.18.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.18.0


   .. tab-item:: TensorFlow 2.6.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-2.6.3


   .. tab-item:: TensorFlow 2.5.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-2.5.3


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-1.15.5   


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.18.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.18.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.18.0


   .. tab-item:: TensorFlow 2.6.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-2.6.3


   .. tab-item:: TensorFlow 2.5.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-2.5.3


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.18.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.18.0 --framework-version=tensorflow-1.15.5   


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.19.0-tensorflow-install.rst
================================================
.. _install-neuron-1.19.0-tensorflow:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install TensorFlow Neuron
=========================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst

.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.19.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.19.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.19.0


   .. tab-item:: TensorFlow 2.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.7.1


   .. tab-item:: TensorFlow 2.6.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.6.3


   .. tab-item:: TensorFlow 2.5.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.5.3


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-1.15.5    
         

Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.19.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.19.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.19.0

   .. tab-item:: TensorFlow 2.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.7.1


   .. tab-item:: TensorFlow 2.6.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.6.3


   .. tab-item:: TensorFlow 2.5.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.5.3


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-1.15.5   


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: TensorFlow 2.8.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.19.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.19.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.19.0


   .. tab-item:: TensorFlow 2.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.7.1


   .. tab-item:: TensorFlow 2.6.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.6.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.6.3


   .. tab-item:: TensorFlow 2.5.3

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-2.5.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-2.5.3


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.19.0 --framework-version=tensorflow-1.15.5

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install tensorflow --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.19.0 --framework-version=tensorflow-1.15.5   


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/tensorflow-install-prev-al2023.rst
================================================
.. _tensorflow-neuron-install-prev-al2023:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install Previous Tensorflow Neuron Releases for Ubuntu (``tensorflow-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
   :maxdepth: 1


This section will assist you in installing previous Neuron releases.

.. tab-set::

    .. tab-item:: Neuron 2.21.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.21.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.20.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.20.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=inf1 --ami=non-dlami
    
    .. tab-item:: Neuron 2.19.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.19.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/tensorflow-install-prev-u20.rst
================================================
.. _tensorflow-neuron-install-prev-u20:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install Previous Tensorflow Neuron Releases for Ubuntu (``tensorflow-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
   :maxdepth: 1


This section will assist you in installing previous Neuron releases.

.. tab-set::

    .. tab-item:: Neuron 2.21.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.21.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.20.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.20.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami
    
    .. tab-item:: Neuron 2.19.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.19.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/tensorflow-install-prev-u22.rst
================================================
.. _tensorflow-neuron-install-prev-u20:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install Previous Tensorflow Neuron Releases for Ubuntu (``tensorflow-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
   :maxdepth: 1


This section will assist you in installing previous Neuron releases.

.. tab-set::

    .. tab-item:: Neuron 2.21.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.21.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.20.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.20.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.19.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.19.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/tensorflow-install-prev.rst
================================================
.. _install-prev-neuron-tensorflow:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install previous TensorFlow Neuron releases
===========================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst

.. toctree::
   :maxdepth: 1

   Neuron 1.19.0 </archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.19.0-tensorflow-install>
   Neuron 1.18.0 </archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.18.0-tensorflow-install>
   Neuron 1.17.2 </archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.17.2-tensorflow-install>
   Neuron 1.17.1 </archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.17.1-tensorflow-install>
   Neuron 1.17.0 </archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.17.0-tensorflow-install>
   Neuron 1.16.3 </archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.16.3-tensorflow-install>
   Neuron 1.15.2 </archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.15.2-tensorflow-install>
   Neuron 1.15.1 </archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.15.1-tensorflow-install>
   Neuron 1.15.0 </archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.15.0-tensorflow-install>
   Neuron 1.14.2 </archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.14.2-tensorflow-install>


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/tensorflow-install.rst
================================================
.. _install-neuron-tensorflow:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install TensorFlow Neuron
=========================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst

.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::
   .. tab-item:: TensorFlow 2.10.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: TensorFlow 2.9.3

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


   .. tab-item:: TensorFlow 2.8.4

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


   .. tab-item:: TensorFlow 2.7.4

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.7.4 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.7.4 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=1.15.5 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=1.15.5 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami
         

Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::
   .. tab-item:: TensorFlow 2.10.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: TensorFlow 2.9.3

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


   .. tab-item:: TensorFlow 2.8.4

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: TensorFlow 2.7.4

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.7.4 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.7.4 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=1.15.5 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=1.15.5 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::
   .. tab-item:: TensorFlow 2.10.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: TensorFlow 2.9.3

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


   .. tab-item:: TensorFlow 2.8.4

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: TensorFlow 2.7.4

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.7.4 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.7.4 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=1.15.5 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=1.15.5 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/tensorflow-update-u20.rst
================================================

.. _tensorflow-neuron-u20-update:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Update to latest TensorFlow Neuron  (``tensorflow-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


If you already have a previous Neuron release installed, this section provide links that will assist you to update to latest Neuron release.


.. tab-set::

    .. tab-item:: TensorFlow 2.10.1

        .. include:: /setup/install-templates/inf1/note-setup-general.rst

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami


    .. tab-item:: TensorFlow 2.9.3

        .. include:: /setup/install-templates/inf1/note-setup-general.rst

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami


    .. tab-item:: TensorFlow 2.8.4

        .. include:: /setup/install-templates/inf1/note-setup-general.rst

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/tensorflow-update-u22.rst
================================================

.. _tensorflow-neuron-u20-update:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Update to latest TensorFlow Neuron  (``tensorflow-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


If you already have a previous Neuron release installed, this section provide links that will assist you to update to latest Neuron release.


.. tab-set::

    .. tab-item:: TensorFlow 2.10.1

        .. include:: /setup/install-templates/inf1/note-setup-general.rst

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami


    .. tab-item:: TensorFlow 2.9.3

        .. include:: /setup/install-templates/inf1/note-setup-general.rst

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami


    .. tab-item:: TensorFlow 2.8.4

        .. include:: /setup/install-templates/inf1/note-setup-general.rst

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/tensorflow/tensorflow-neuron/setup/tensorflow-update.rst
================================================
.. _update-neuron-tensorflow:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Update to latest TensorFlow Neuron
===============================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::
   .. tab-item:: TensorFlow 2.10.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: TensorFlow 2.9.3

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


   .. tab-item:: TensorFlow 2.8.4

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


   .. tab-item:: TensorFlow 2.7.4

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.7.4 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.7.4 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=1.15.5 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=1.15.5 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami
         

Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::
   .. tab-item:: TensorFlow 2.10.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: TensorFlow 2.9.3

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


   .. tab-item:: TensorFlow 2.8.4

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: TensorFlow 2.7.4

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.7.4 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=2.7.4 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=1.15.5 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=tensorflow --framework-version=1.15.5 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::
   .. tab-item:: TensorFlow 2.10.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: TensorFlow 2.9.3

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


   .. tab-item:: TensorFlow 2.8.4

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: TensorFlow 2.7.4

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.7.4 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=2.7.4 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: TensorFlow 1.15.5

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=1.15.5 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=tensorflow --framework-version=1.15.5 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/tensorflow/tensorflow-neuron/tensorflow2-accelerated-ops.rst
================================================
.. _tensorflow-ref-neuron-accelerated-ops:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

TensorFlow 2.x (``tensorflow-neuron``) Accelerated (``torch-neuron``) Python APIs and Graph Ops
======================================================================================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


This page lists TensorFlow 2.x Python APIs and graph operators that are
accelerated by AWS Neuron. The lists are not exhaustive. TensorFlow 2.x Python
APIs or graph operators that are not listed here may still be accelerated if
they are composed of accelerated primitives, or they will be executed on CPU
without significant acceleration. The TensorFlow Neuron integration contains
an automatic operator-device-placement mechanism that strives to maximize
the execution efficiency of your deep learning models on AWS Machine Learning
ASIC instances.

Accelerated Python APIs
--------------------------------
+---------------+-----------------------------------+-----------------------------------------------------------+
|   Module      |   Accelerated Python API          |                       Comments                            |
+===============+===================================+===========================================================+
|   ``tf``      | ``tf.abs``                        |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.add``                        |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.add_n``                      |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.broadcast_static_shape``     |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.cast``                       |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.constant``                   |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.convert_to_tensor``          |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.cumsum``                     | ``axis`` must be a compile-time constant.                 |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.einsum``                     |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.erf``                        |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.exp``                        |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.identity``                   |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.matmul``                     | Uses float16/bfloat16 matmul with float32 accumulation.   |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.maximum``                    |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.minimum``                    |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.multiply``                   |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.negative``                   |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.range``                      | ``start``, ``limit`` and ``delta`` arguments must be      |
|               |                                   | compile-time constants.                                   |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.realdiv``                    |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.reciprocal``                 |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.reduce_all``                 | ``axis`` must be a compile-time constant.                 |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.reduce_any``                 | ``axis`` must be a compile-time constant.                 |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.reduce_max``                 | ``axis`` must be a compile-time constant.                 |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.reduce_min``                 | ``axis`` must be a compile-time constant.                 |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.reduce_prod``                | ``axis`` must be a compile-time constant.                 |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.reduce_sum``                 | ``axis`` must be a compile-time constant.                 |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.reshape``                    | ``shape`` argument must be a compile-time constant.       |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.rsqrt``                      |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.scalar_mul``                 |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.shape``                      |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.shape_n``                    |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.sigmoid``                    |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.size``                       |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.slice``                      | ``size`` must be a compile-time constant. In addition,    |
|               |                                   |                                                           |
|               |                                   | either ``begin`` must be a compile-time constant or       |
|               |                                   |                                                           |
|               |                                   | ``size`` must be non-negative.                            |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.sqrt``                       |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.square``                     |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.squared_difference``         |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.squeeze``                    |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.stack``                      |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.stop_gradient``              |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.strided_slice``              |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.tanh``                       |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.tensordot``                  |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.to_bfloat16``                |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.to_float``                   |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.truediv``                    |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
| ``tf.layers`` | ``tf.layers.batch_normalization`` |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.layers.dense``               |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.layers.flatten``             |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
| ``tf.nn``     | ``tf.nn.batch_normalization``     |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.nn.bias_add``                |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.nn.dropout``                 | Always treated as ``tf.identity`` during inference.       |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.nn.fused_batch_norm``        |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.nn.leaky_relu``              |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.nn.relu``                    |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.nn.relu6``                   |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.nn.relu_layer``              |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+
|               | ``tf.nn.softmax``                 |                                                           |
+---------------+-----------------------------------+-----------------------------------------------------------+

Accelerated graph operators
--------------------------------
.. code:: python

    Add
    AddN
    AddV2
    BatchMatMul
    BatchMatMulV2
    BiasAdd
    Cast
    Const
    Cumsum
    Einsum
    Erf
    Exp
    ExpandDims
    FusedBatchNorm
    FusedBatchNormV2
    FusedBatchNormV3
    Greater
    Identity
    LeakyRelu
    MatMul
    Max
    Maximum
    Minimum
    Mean
    Mul
    Neg
    Pack
    RealDiv
    Relu
    Relu6
    Reshape
    Rsqrt
    Sigmoid
    Softmax
    Split
    SplitV
    Sqrt
    Square
    SquaredDifference
    Squeeze
    StridedSlice
    Sub
    Sum
    Tanh
    Transpose
    Unpack


The lists share many commonalities with `Available TensorFlow Ops <https://cloud.google.com/tpu/docs/tensorflow-ops>`_. Portions of this page are modifications based on work created and `shared by Google <https://developers.google.com/terms/site-policies>`_ and used according to terms described in the `Creative Commons 4.0 Attribution License <https://creativecommons.org/licenses/by/4.0/>`_.


================================================
FILE: archive/tensorflow/tensorflow-neuron/tf2_faq.rst
================================================
.. _tf2_faq:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

TensorFlow 2.x FAQ
===================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 1


How do I get started with TensorFlow?
-------------------------------------

The easiest entry point is the tutorials offered by the AWS Neuron team. For beginners, the :ref:`HuggingFace DistilBERT Tutorial </src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb>` is a good place to start.

What TensorFlow versions are supported by Neuron?
-------------------------------------------------

The AWS Neuron provide well-tested tensorflow-neuron packages that work with a range of tensorflow official releases, as long as the version of tensorflow-neuron matches that of tensorflow. For example, you may install ``tensorflow-neuron==2.3.3.1.0.9999.0`` on top of ``tensorflow==2.3.3`` and expect them to work together.

Currently, tensorflow-neuron can work with tensorflow versions 2.1.4, 2.2.3, 2.3.3, 2.4.2, 2.5.0.

In a fresh Python environment, ``pip install tensorflow-neuron`` would bring in the highest version (2.5.0 as of 07/13/2021), which then pulls ``tensorflow==2.5.0`` into the current environment.

If you already have a particular version of tensorflow 2.x installed, then it is recommended to pay attention to the precise version of tensorflow-neuron and only install the desired one. For example, in an existing Python environment with ``tensorflow==2.3.3`` installed, you may install tensorflow-neuron by pip install ``tensorflow-neuron==2.3.3``, which will reuse the existing tensorflow installation.

What operators are supported?
-----------------------------

Due to fundamental backend design changes in the TensorFlow 2.x framework, the concept of "supported graph operators" is no longer well-defined. Please refer to :ref:`Accelerated Python APIs and graph operators <tensorflow-ref-neuron-accelerated-ops>` for a guide to the set of TensorFlow 2.x Python APIs and graph operators that can be accelerated by Neuron.

How do I compile my model?
--------------------------

It is achieved by a new public API called tfn.trace, which resembles the compilation API of AWS PyTorch Neuron integration. Programmatically, customers would be able to execute the following code.

.. code::

    import tensorflow as tf
    import tensorflow.neuron as tfn

    ...
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    model_neuron = tfn.trace(model, example_inputs)
    model_neuron.save('./model_neuron_dir')
    ...
    model_loaded = tf.saved_model.load('./model_dir')
    predict_func = model_loaded['serving_default']
    model_loaded_neuron = tfn.trace(predict_func, example_inputs2)
    model_loaded_neuron.save('./model_loaded_neuron_dir')
    ...

How do I deploy my model?
-------------------------

Python tensorflow
^^^^^^^^^^^^^^^^^

Pre-compiled models can be saved and reloaded back into a Python environment using regular tensorflow model loading APIs, as long as tensorflow-neuron is installed.

.. code::

    import tensorflow as tf

    model = tf.keras.models.load_model('./model_loaded_neuron_dir')
    example_inputs = ...
    output = model(example_inputs)

tensorflow-serving
^^^^^^^^^^^^^^^^^^

Pre-compiled models can be saved into SavedModel format via tensorflow SavedModel APIs

.. code::

    import tensorflow as tf
    import tensorflow.neuron as tfn

    ...
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    model_neuron = tfn.trace(model, example_inputs)
    tf.saved_model.save(model_neuron, './model_neuron_dir/1')

The generated SavedModel './model_neuron_dir' can be loaded into tensorflow-model-server-neuron, which can be installed through apt or yum based on the type of the operating system. For example, on Ubuntu 18.04 LTS the following command installs and launches a tensorflow-model-server-neuron on a pre-compiled SavedModel.

.. code::

    sudo apt install tensorflow-model-server-neuron
    # --model_base_path needs to be an absolute path
    tensorflow_model_server_neuron --model_base_path=$(pwd)/model_neuron_dir

Where can I find tutorials and examples ?
-----------------------------------------

:ref:`HuggingFace DistilBERT Tutorial </src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb>` is a good place to start.


How to debug or profile my model?
---------------------------------

:ref:`AWS Neuron TensorBoard integration <neuron-plugin-tensorboard>` provides visibility into what is happening inside of the Neuron runtime, and allows a more fine-grained (but also more hardware-awared) reasoning on where to improve the performance of machine learning applications.


================================================
FILE: archive/tensorflow/tensorflow-neuron/tutorials/bert_demo/bert_demo.rst
================================================
.. _tensorflow-bert-demo:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

[Broken] Running TensorFlow BERT-Large with AWS Neuron
=============================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


This example shows a Neuron compatible BERT-Large implementation that is
functionally equivalent to open source BERT-Large model. This demo uses
TensorFlow-Neuron, BERT-Large weights fine tuned for MRPC and also shows
the performance achieved by the Inf1 instance. For users who want to use
public BERT SavedModels please also follow the steps described :ref:`using-public-bert-savedmodels`.

Launch EC2 instances
--------------------

For this demo, launch two EC2 instances :

-  a c5.4xlarge instance for compiling the BERT-Large Model and
-  an inf1.xlarge instance for running inference

For both of these instances choose the latest Ubuntu 18 Deep Learning
AMI (DLAMI).

.. _compiling-neuron-compatible-bert-large:

Compiling Neuron compatible BERT-Large
--------------------------------------

First connect to a c5.4xlarge instance and update tensorflow-neuron and
neuron-cc

Update compilation EC2 instance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Update to the latest neuron software by executing the instructions at :ref:`install-neuron-tensorflow`.

Note: if your tensorflow-neuron version on the inference instance is
lower than 1.15.0.1.0.1333.0, you will need to run this demo on
inf1.2xlarge instead of inf1.xlarge.

Compile open source BERT-Large saved model using Neuron compatible BERT-Large implementation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Neuron software works with TensorFlow saved models. Users should bring
their own BERT-Large saved model for this section. This demo will run
inference for the MRPC task and the saved model should be fine tuned for
MRPC. Users who need additional help to fine-tune the model for MRPC or
to create a saved model can refer to :ref:`bert-tensorflow-demo-appendix1`.

In the same environment and directory bert_demo scripts, run the
following :

.. code:: bash

   git clone https://github.com/aws/aws-neuron-sdk
   cd ~/aws-neuron-sdk/src/examples/tensorflow/bert_demo/
   export BERT_LARGE_SAVED_MODEL="/path/to/user/bert-large/savedmodel"
   pip install tensorflow_neuron==1.15.5.2.8.9.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com/
   pip install neuron_cc==1.13.5.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com
   python bert_model.py --input_saved_model $BERT_LARGE_SAVED_MODEL --output_saved_model ./bert-saved-model-neuron --batch_size=6 --aggressive_optimizations

This compiles BERT-Large pointed to by $BERT_LARGE_SAVED_MODEL for an
input size of 128 and batch size of 6. The compilation output is stored
in bert-saved-model-neuron. Copy this to your Inf1 instance for
inferencing.

The bert_model.py script encapsulates all the steps necessary for this
process. For details on what is done by bert_model.py please refer to
:ref:`bert-tensorflow-demo-appendix2`.

Running the inference demo
--------------------------

Connect to your inf1.xlarge instance and update tensorflow-neuron,
aws-neuron-runtime and aws-neuron-tools.

Update inference EC2 instance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Update to the latest neuron software by executing the instructions at :ref:`install-neuron-tensorflow`.

Launching the BERT-Large demo server
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Copy the compiled model (bert-saved-model-neuron) from your c5.4xlarge
to your inf1.xlarge instance. Place the model in the same directory as
the bert_demo scripts. Then from the same conda environment launch the
BERT-Large demo server :

.. code:: bash

   cd ~/aws-neuron-sdk/src/examples/tensorflow/bert_demo/
   pip install tensorflow_neuron==1.15.5.2.8.9.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com/
   python bert_server.py --dir bert-saved-model-neuron --batch 6 --parallel 4

This loads 4 BERT-Large models, one into each of the 4 NeuronCores found
in an inf1.xlarge instance. For each of the 4 models, the BERT-Large
demo server opportunistically stitches together asynchronous requests
into batch 6 requests. When there are insufficient pending requests, the
server creates dummy requests for batching.

Wait for the bert_server to finish loading the BERT-Large models to
Inferentia memory. When it is ready to accept requests it will print the
inferences per second once every second. This reflects the number of
real inferences only. Dummy requests created for batching are not
credited to inferentia performance. Once the inferences are done you can send
a keyboard interrupt to print out the average throughput of your run.

Sending requests to server from multiple clients
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Wait until the bert demo server is ready to accept requests. Then on the
same inf1.xlarge instance, launch a separate linux terminal. From the
bert_demo directory execute the following commands :

.. code:: bash

   source activate aws_neuron_tensorflow_p36
   cd ~/aws-neuron-sdk/src/examples/tensorflow/bert_demo/
   for i in {1..96}; do python bert_client.py --cycle 128 & done

This spins up 96 clients, each of which sends 128 inference requests.


Printing latency metrics
~~~~~~~~~~~~~~~~~~~~~~~~
After all your requests have been sent to your server you can
run the following command:

.. code:: bash

    python latency_printer.py

.. _using-public-bert-savedmodels:

Using public BERT SavedModels
-----------------------------

We are now providing a compilation script that has better compatibility
with various flavors of BERT SavedModels generated from
https://github.com/google-research/bert. Here are the current
limitations:

1. You did not change
   `modeling.py <https://github.com/google-research/bert/blob/master/modeling.py>`__
2. BERT SavedModel is generated using ``estimator.export_saved_model``
3. BERT SavedModel uses fixed sequence length 128 (you may check by
   ``saved_model_cli show --dir /path/to/user/bert/savedmodel --all``)
4. ``neuron-cc`` version is at least 1.0.12000.0
5. ``aws-neuron-runtime`` version is at least 1.0.7000.0
6. The ``--batch_size`` argument specified in this script is at most 4

Example usage is shown below:

.. code:: bash

   export BERT_LARGE_SAVED_MODEL="/path/to/user/bert-large/savedmodel"
   cd ~/aws-neuron-sdk/src/examples/tensorflow/bert_demo/
   python bert_no_model.py --input_saved_model $BERT_LARGE_SAVED_MODEL --output_saved_model ./bert-saved-model-neuron --batch_size=1

.. _bert-tensorflow-demo-appendix1:

Appendix 1
----------

Users who need help finetuning BERT-Large for MRPC and creating a saved
model may follow the instructions here.

Connect to the c5.4xlarge compilation EC2 instance you started above and
download these three items :

1. clone `this <https://github.com/google-research/bert>`__ github repo.
2. download GLUE data as described
   `here <https://github.com/google-research/bert#user-content-sentence-and-sentence-pair-classification-tasks>`__.
   Do not run the finetuning command.
3. download a desired pre-trained BERT-Large checkpoint from
   `here <https://github.com/google-research/bert#user-content-pre-trained-models>`__.
   This is the model we will fine tune.

Next edit run_classifier.py in the cloned bert repo to apply the patch
described in the following git diff.

::

   diff --git a/run_classifier.py b/run_classifier.py
   index 817b147..c9426bc 100644
   --- a/run_classifier.py
   +++ b/run_classifier.py
   @@ -955,6 +955,18 @@ def main(_):
            drop_remainder=predict_drop_remainder)

        result = estimator.predict(input_fn=predict_input_fn)
   +    features = {
   +        "input_ids": tf.placeholder(shape=[None, FLAGS.max_seq_length], dtype=tf.int32, name='input_ids'),
   +        "input_mask": tf.placeholder(shape=[None, FLAGS.max_seq_length], dtype=tf.int32, name='input_mask'),
   +        "segment_ids": tf.placeholder(shape=[None, FLAGS.max_seq_length], dtype=tf.int32, name='segment_ids'),
   +        "label_ids": tf.placeholder(shape=[None], dtype=tf.int32, name='label_ids'),
   +        "is_real_example": tf.placeholder(shape=[None], dtype=tf.int32, name='is_real_example'),
   +    }
   +    serving_input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn(features)
   +    estimator._export_to_tpu = False  ## !!important to add this
   +    estimator.export_saved_model(
   +        export_dir_base='./bert_classifier_saved_model',
   +        serving_input_receiver_fn=serving_input_fn)

        output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
        with tf.gfile.GFile(output_predict_file, "w") as writer:

NOTE : Users who are interested may refer to this
`link <https://github.com/google-research/bert/issues/146#issuecomment-569138476>`__
for additional background information on the patch but it is not
necessary for running this demo.

Then from the bert_demo directory run the following :

.. code:: bash

   source activate aws_neuron_tensorflow_p36
   cd ~/aws-neuron-sdk/src/examples/tensorflow/bert_demo/
   export BERT_REPO_DIR="/path/to/cloned/bert/repo/directory"
   export GLUE_DIR="/path/to/glue/data/directory"
   export BERT_BASE_DIR="/path/to/pre-trained/bert-large/checkpoint/directory"
   ./tune_save.sh

The a saved model will be created in
$BERT_REPO_DIR/bert-saved-model/*random_number*/. Where, *random_number*
is a random number generated for every run. Use this saved model to
continue with the rest of the demo.

.. _bert-tensorflow-demo-appendix2:

Appendix 2
----------

For all BERT variants, we currently need to augment the standard Neuron
compilation process for performance tuning. In the future, we intend to
automate this tuning process. This would allow users to use the standard
Neuron compilation process, which requires only a one line change in
user source code. The standard compilation process is described :ref:`/src/examples/mxnet/resnet50/resnet50.ipynb`.

The augmented Neuron compilation process is encapsulated by the
bert_model.py script, which performs the following things :

1. Define a Neuron compatible implementation of BERT-Large. For
   inference, this is functionally equivalent to the open source
   BERT-Large. The changes needed to create a Neuron compatible
   BERT-Large implementation is described in :ref:`bert-tensorflow-demo-appendix3`.
2. Extract BERT-Large weights from the open source saved model pointed
   to by --input_saved_model and associates it with the Neuron
   compatible model
3. Invoke TensorFlow-Neuron to compile the Neuron compatible model for
   Inferentia using the newly associated weights
4. Finally, the compiled model is saved into the location given by
   --output_saved_model

.. _bert-tensorflow-demo-appendix3:

Appendix 3
----------

The Neuron compatible implementation of BERT-Large is functionally
equivalent to the open source version when used for inference. However,
the detailed implementation does differ and here are the list of changes
:

1. Data Type Casting : If the original BERT-Large an FP32 model,
   bert_model.py contains manually defined cast operators to enable
   mixed-precision. FP16 is used for multi-head attention and
   fully-connected layers, and fp32 everywhere else. This will be
   automated in a future release.
2. Remove Unused Operators: A model typically contains training
   operators that are not used in inference, including a subset of the
   reshape operators. Those operators do not affect inference
   functionality and have been removed.
3. Reimplementation of Selected Operators : A number of operators
   (mainly mask operators), has been reimplemented to bypass a known
   compiler issue. This will be fixed in a planned future release.
4. Manually Partition Embedding Ops to CPU : The embedding portion of
   BERT-Large has been partitioned manually to a subgraph that is
   executed on the host CPU, without noticable performance impact. In
   near future, we plan to implement this through compiler
   auto-partitioning without the need for user intervention.


================================================
FILE: archive/tensorflow/tensorflow-neuron/tutorials/bert_demo/glue_mrpc_dev.tsv
================================================
﻿Quality	#1 ID	#2 ID	#1 String	#2 String
1	1355540	1355592	He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .	" The foodservice pie business does not fit our long-term growth strategy .
0	2029631	2029565	Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .	His wife said he was " 100 percent behind George Bush " and looked forward to using his years of training in the war .
0	487993	487952	The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .	The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent .
1	1989515	1989458	The AFL-CIO is waiting until October to decide if it will endorse a candidate .	The AFL-CIO announced Wednesday that it will decide in October whether to endorse a candidate before the primaries .
0	1783137	1782659	No dates have been set for the civil or the criminal trial .	No dates have been set for the criminal or civil cases , but Shanley has pleaded not guilty .
1	3039165	3039036	Wal-Mart said it would check all of its million-plus domestic workers to ensure they were legally employed .	It has also said it would review all of its domestic employees more than 1 million to ensure they have legal status .
0	1490811	1490840	While dioxin levels in the environment were up last year , they have dropped by 75 percent since the 1970s , said Caswell .	The Institute said dioxin levels in the environment have fallen by as much as 76 percent since the 1970s .
1	426112	426210	This integrates with Rational PurifyPlus and allows developers to work in supported versions of Java , Visual C # and Visual Basic .NET.	IBM said the Rational products were also integrated with Rational PurifyPlus , which allows developers to work in Java , Visual C # and VisualBasic .Net.
1	1439663	1439808	The top rate will go to 4.45 percent for all residents with taxable incomes above $ 500,000 .	For residents with incomes above $ 500,000 , the income-tax rate will increase to 4.45 percent .
1	3147370	3147525	The results appear in the January issue of Cancer , an American Cancer Society journal , being published online today .	The results appear in the January issue of Cancer , an American Cancer Society ( news - web sites ) journal , being published online Monday .
1	3300040	3299992	The delegates said raising and distributing funds has been complicated by the U.S. crackdown on jihadi charitable foundations , bank accounts of terror-related organizations and money transfers .	Bin Laden ’ s men pointed out that raising and distributing funds has been complicated by the U.S. crackdown on jihadi charitable foundations , bank accounts of terror-related organizations and money transfers .
0	524136	524119	" Sanitation is poor ... there could be typhoid and cholera , " he said .	" Sanitation is poor , drinking water is generally left behind . . . there could be typhoid and cholera . "
0	969512	969295	The broader Standard & Poor 's 500 Index .SPX gave up 11.91 points , or 1.19 percent , at 986.60 .	The technology-laced Nasdaq Composite Index was down 25.36 points , or 1.53 percent , at 1,628.26 .
1	1685339	1685429	The only announced Republican to replace Davis is Rep. Darrell Issa of Vista , who has spent $ 1.71 million of his own money to force a recall .	So far the only declared major party candidate is Rep. Darrell Issa , a Republican who has spent $ 1.5 million of his own money to fund the recall .
1	1967578	1967664	The decision to issue new guidance has been prompted by intelligence passed to Britain by the FBI in a secret briefing in late July .	Scotland Yard 's decision to issue new guidance has been prompted by new intelligence passed to Britain by the FBI in late July .
1	2047034	2046820	Unable to find a home for him , a judge told mental health authorities they needed to find supervised housing and treatment for DeVries somewhere in California .	The judge had told the state Department of Mental Health to find supervised housing and treatment for DeVries somewhere in California .
1	2046630	2046644	The decision came a year after Whipple ended federal oversight of the district 's racial balance , facilities , budget , and busing .	The decision came a year after Whipple ended federal oversight of school busing as well as the district 's racial balance , facilities and budget .
0	2221603	2221633	In midafternoon trading , the Nasdaq composite index was up 8.34 , or 0.5 percent , to 1,790.47 .	The Nasdaq Composite Index .IXIC dipped 8.59 points , or 0.48 percent , to 1,773.54 .
1	129995	129864	Morgan Stanley raised its rating on the beverage maker to " overweight " from " equal-weight " saying in part that pricing power with its bottlers should improve in 2004 .	Morgan Stanley raised its rating on the company to " overweight " from " equal-weight , " saying the beverage maker 's pricing power with bottlers should improve in 2004 .
0	919683	919782	The pound also made progress against the dollar , reached fresh three-year highs at $ 1.6789 .	The British pound flexed its muscle against the dollar , last up 1 percent at $ 1.6672 .
0	970740	971209	Friday , Stanford ( 47-15 ) blanked the Gamecocks 8-0 .	Stanford ( 46-15 ) has a team full of such players this season .
1	2745055	2745022	Last month Intel raised its revenue guidance for the quarter to between $ 7.6 billion and $ 7.8 billion .	At the end of the second quarter , Intel initially predicted sales of between $ 6.9 billion and $ 7.5 billion .
0	2199097	2199072	The driver , Eugene Rogers , helped to remove children from the bus , Wood said .	At the accident scene , the driver was " covered in blood " but helped to remove children , Wood said .
1	1609290	1609098	ONG KONG , July 9 Tens of thousands of demonstrators gathered tonight before the legislature building here to call for free elections and the resignation of Hong Kong 's leader .	Tens of thousands of demonstrators gathered yesterday evening to stand before this city 's legislature building and call for free elections and the resignation of Hong Kong 's leader .
1	1597193	1597119	Saddam loyalists have been blamed for sabotaging the nation 's infrastructure , as well as frequent attacks on U.S. soldiers .	Hussein loyalists have been blamed for sabotaging the nation 's infrastructure and attacking US soldiers .
1	2758944	2758975	Its closest living relatives are a family frogs called sooglossidae that are found only in the Seychelles in the Indian Ocean .	Its closest relative is found in the Seychelles Archipelago , near Madagascar in the Indian Ocean .
0	2584416	2584653	Cooley said he expects Muhammad will similarly be called as a witness at a pretrial hearing for Malvo .	Lee Boyd Malvo will be called as a witness Wednesday in a pretrial hearing for fellow sniper suspect John Allen Muhammad .
1	86007	86373	" Instead of pursuing the most imminent and real threats - international terrorists , " Graham said , " this Bush administration chose to settle old scores . "	" Instead of pursuing the most imminent and real threats - international terrorists - this Bush administration has chosen to settle old scores , " Graham said .
1	1602860	1602844	He said they lied on a sworn affidavit that requires them to list prior marriages .	Morgenthau said the women , all U.S. citizens , lied on a sworn affidavit that requires them to list prior marriages .
1	1201306	1201329	The association said 28.2 million DVDs were rented in the week that ended June 15 , compared with 27.3 million VHS cassettes .	The Video Software Dealers Association said 28.2 million DVDs were rented out last week , compared to 27.3 million VHS cassettes .
0	461779	461815	With these assets , Funny Cide has a solid chance to become the first Triple Crown winner since Affirmed in 1978 .	Funny Cide is looking to become horse racing 's first Triple Crown winner in a generation .
1	1438666	1438643	Intel was disappointed and assessing its " options in the event Mr. Hamidi resumes his spamming activity against Intel , " spokesman Chuck Mulloy said .	Intel spokesman Chuck Mulloy said the company was disappointed and assessing its " options in the event Mr. Hamidi resumes his spamming activity against Intel . "
1	3261484	3261306	Mr Annan also warned the US should not use the war on terror as an excuse to suppress " long-cherished freedoms " .	Annan warned that the dangers of extremism after September 11 should not be used as an excuse to suppress " long-cherished " freedoms .
1	1277539	1277527	At community colleges , tuition will jump to $ 2,800 from $ 2,500 .	Community college students will see their tuition rise by $ 300 to $ 2,800 or 12 percent .
1	3035788	3035918	He made a point of saying during Tuesdays debate that the Confederate flag was a racist symbol .	Though Dean made a point of saying during the debate that the Confederate flag is a racist symbol .
0	132553	132725	Bush wanted " to see an aircraft landing the same way that the pilots saw an aircraft landing , " White House press secretary Ari Fleischer said yesterday .	On Tuesday , before Byrd 's speech , Fleischer said Bush wanted ' ' to see an aircraft landing the same way that the pilots saw an aircraft landing .
0	2259788	2259747	On Monday the Palestinian Prime Minister , Mahmoud Abbas , will report to the Palestinian parliament on his Government 's achievements in its first 100 days in office .	Palestinian Prime Minister Mahmoud Abbas must defend the record of his first 100 days in office before Parliament today as the death toll in the occupied territories continues to rise .
0	2307064	2307235	The civilian unemployment rate improved marginally last month -- slipping to 6.1 percent -- even as companies slashed payrolls by 93,000 .	The civilian unemployment rate improved marginally last month _ sliding down to 6.1 percent _ as companies slashed payrolls by 93,000 amid continuing mixed signals about the nation 's economic health .
1	3046488	3046824	Per-user pricing is $ 29 for Workplace Messaging , $ 89 for Team Collaboration and $ 35 for Collaborative Learning .	Workplace Messaging is $ 29 , Workplace Team Collaboration is $ 89 , and Collaborative Learning is $ 35 .
1	86020	86007	" Instead of pursuing the most imminent and real threats – international terrorism – this Bush administration chose to settle old scores , " Mr. Graham said .	" Instead of pursuing the most imminent and real threats - international terrorists , " Graham said , " this Bush administration chose to settle old scores . "
0	1100998	1100441	SARS has killed about 800 people and affected more than 8400 since being detected in China in November .	SARS has killed about 800 people and sickened more than 8,400 worldwide , mostly in Asia .
1	2268396	2268480	Authorities had no evidence to suggest the two incidents were connected .	There was no immediate evidence that the two incidents were connected , police said .
0	1984039	1983986	" Jeremy 's a good guy , " Barber said , adding : " Jeremy is living the dream life of the New York athlete .	He also said Shockey is " living the dream life of a New York athlete .
0	2697659	2697747	Ratliff 's daughters , Margaret and Martha Ratliff , were adopted by Peterson after their mother 's death .	Peterson helped raise Ratliff 's two daughters , Margaret and Martha Ratliff , who supported him throughout the trial .
0	2175939	2176090	After losing as much as 84.56 earlier , the Dow Jones industrial average closed up 22.81 , or 0.2 percent , at 9,340.45 .	In midday trading , the Dow Jones industrial average lost 68.84 , or 0.7 percent , to 9,248.80 .
1	886618	886456	Rumsfeld , who has been feuding for two years with Army leadership , passed over nine active-duty four-star generals .	Rumsfeld has been feuding for a long time with Army leadership , and he passed over nine active-duty four-star generals .
1	588637	588864	Consumers who said jobs are difficult to find jumped from 29.4 to 32.6 , while those claiming work was plentiful slipped from 13 to 12.6 .	Consumers who said jobs are difficult to find jumped to 32.6 from 29.4 , while those saying work was plentiful slipped to 12.6 from 13 in April .
0	2252795	2252970	He has no immediate plans for television advertising , believing it is unnecessary this early .	A Lieberman aide said there were no immediate plans for television advertising .
1	1756329	1756394	" I think it happened very quickly , " Houston Police Department homicide investigator Phil Yochum said of the crime .	" I think it happened very quickly , " said Investigator Phil Yochum of the Houston Police Department 's homicide division .
1	1673112	1673068	United issued a statement saying it will " work professionally and cooperatively with all its unions . "	Senior vice president Sara Fields said the airline " will work professionally and cooperatively with all our unions . "
1	2357324	2357271	" But they never climb out of the pot of beer again . "	It 's just that they never climb out of the beer again . "
1	780408	780363	Chief financial officer Andy Bryant has said that hike had a greater affect volume than officials expected .	Bryant has said that hike had a greater effect on demand than officials expected .
1	821523	821385	Robert Liscouski , the Assistant Secretary of Homeland Security for Infrastructure Protection , will oversee NCSD .	NCSD 's chief will be Robert Liscouski , the assistant secretary of Homeland Security for Infrastructure Protection .
1	2304696	2304863	HP 's shipments increased 48 percent year-over-year , compared to an increase of 31 percent for Dell .	HPs shipments increased 48 per cent year-on-year , compared to an increase of 31 per cent for Dell .
1	2531749	2531607	Chirac , who can pardon a law-breaker , refused Humbert 's request last year but kept in close touch with the family .	Chirac , who has the authority to pardon law-breakers , refused Humbert 's request to be allowed to die last year but kept in close touch with the family .
1	3180014	3179967	The charges allege that he was part of the conspiracy to kill and kidnap persons in a foreign country .	The government now charges that Sattar conspired with Rahman to kill and kidnap individuals in foreign countries .
1	726966	726945	In the 2002 study , the margin of error ranged from 1.8 to 4.4 percentage points .	It has a margin of error of plus or minus three to four percentage points .
1	2638861	2638982	Mr. Clinton 's national security adviser , Sandy Berger , said that the White House wasn 't informed of the FBI activities .	Clinton ’ s national security adviser , Sandy Berger , said in an interview that the White House was not informed of the FBI activities .
1	2495223	2495307	" This decision is clearly incorrect , " FTC Chairman Timothy Muris said in a written statement .	The decision is " clearly incorrect , " FTC Chairman Tim Muris said .
1	55187	54831	Prosecutors allege that Nichols and co-conspirator Timothy McVeigh worked together to prepare a bomb that destroyed the Alfred P. Murrah Federal Building .	Prosecutors allege that Nichols and coconspirator Timothy McVeigh worked together to prepare a 4,000-pound fuel-and-fertilizer bomb that destroyed the Murrah building .
0	2763381	2763517	Terri Schiavo , 39 , is expected to die sometime in the next two weeks in the Tampa-area hospice where she has spent the past several years .	Terri Schiavo , 39 , underwent the procedure at the Tampa Bay area hospice where she has been living for several years , said her father , Bob Schindler .
1	1990975	1991132	Secretary of State Colin Powell designated the Chechen leader believed responsible for last year 's hostage standoff in a Moscow theater as a threat to U.S. security Friday .	U.S. Secretary of State Colin Powell on Friday designated Chechen rebel leader Shamil Basayev a threat to the security of the United States and to U.S. citizens .
1	2204353	2204418	" Today , we are trying to convey this problem to Russian President Vladimir Putin and US President George W Bush . "	" Today , we are trying to convey this problem to Russian President Vladimir Putin ( news - web sites ) and President Bush ( news - web sites ) . "
1	60122	60445	That would be a potential setback to Chief Executive Phil Condit 's strategy of bolstering defense-related sales during a slump in jetliner deliveries .	The inquiry may hinder Chief Executive Phil Condit 's strategy of bolstering defense-related sales during a slump in jetliner deliveries .
1	961836	962243	PeopleSoft also said its board had officially rejected Oracle 's offer .	Thursday morning , PeopleSoft 's board rejected the Oracle takeover offer .
0	3140260	3140288	The Dow Jones industrial average ended the day down 10.89 at 9,837.94 , after advancing 111.04 Wednesday .	The Dow Jones industrial average fell 10.89 points , or 0.11 percent , to 9,837.94 .
1	1720166	1720115	Cortisol levels in the saliva of day care children were highest and rose most steeply in those judged by day care center personnel to be the shyest .	Cortisol levels in the saliva of day-care children were highest and rose most steeply in those whom day-care centre staffed judged to be the shyest .
1	2573262	2573319	" The idea that Tony Abbott is in some way a one-dimensional political head-kicker couldn 't be more wrong , " Mr Howard said .	" The idea that Tony Abbott is in some way a one-dimensional political head kicker couldn 't be more wrong . "
0	1353356	1353174	" Biotech products , if anything , may be safer than conventional products because of all the testing , " Fraley said , adding that 18 countries have adopted biotechnology .	" Biotech products , if anything , may be safer than conventional products because of all the testing , " said Robert Fraley , Monsanto 's executive vice president .
1	2738677	2738741	The rate of skin cancer has tripled since the 1950s in Norway and Sweden , according to the study .	The study also found that skin cancer nearly tripled in Norway and Sweden since the 1950s .
1	1638813	1639087	We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said .	Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11 " .
1	1605350	1605425	Trans fat makes up only 1 percent to 3 percent of the total fat Americans consume , compared with 14 percent for saturated fat .	Trans fat accounts for 2.5 percent of Americans ' daily calories , compared to 11 percent to 12 percent for saturated fat .
1	2494149	2494073	However , a recent slide in prices and OPEC 's expectations of a surge in oil inventories have compounded its fears about a further softening of the market .	A 14 percent slide in crude prices this month and expectations of a build up in oil inventories compounded OPEC 's fears of a further softening of the market .
1	3023029	3023229	Peterson , 31 , is now charged with murder in the deaths of his 27-year-old wife and their unborn son .	Peterson , 31 , is charged with two counts of first-degree murder in the slayings of his wife , Laci , and their unborn son , Conner .
1	1351550	1351155	Carlson on Tuesday said he would not recuse himself from the case .	Service officials said Carlson refused to recuse himself from the case .
1	981185	981234	The program will grow to include ports in Dubai , Turkey and Malaysia , among others .	The program will be expanded to include areas of the Middle East such as Dubai , Turkey and Malaysia , Mr. Ridge said .
0	2111629	2111786	McCabe said he was considered a witness , not a suspect .	" He is not considered a suspect , " McCabe said .
1	655498	655391	The woman was exposed to the SARS virus while in the hospital but was not a health care worker , said Dr. Colin D ’ Cunha , Ontario ’ s commissioner of public health .	The woman was exposed to the SARS virus while in the hospital but was not a health-care worker , said Dr Colin D 'Cunha , Ontario 's commissioner of public health .
1	533823	533909	He added that those " are not solely American principles , nor are they exclusively Western . "	" These are not solely American principles nor are they exclusively Western , " Rumsfeld said .
1	581592	581570	" If we don 't march into Tehran , I think we will be in pretty good shape , " he said .	" As long as we don 't march on Tehran , I think we are going to be in pretty good shape , " he said .
0	1010655	1010430	On Saturday , a 149mph serve against Agassi equalled Rusedski 's world record .	On Saturday , Roddick equalled the world record with a 149 m.p.h. serve in beating Andre Agassi .
1	2241925	2242066	Chad Kolton , emergency management spokesman with the Department of Homeland Security , said the government is open to new technologies and methods to communicate more quickly and efficiently .	Chad Kolton , emergency management spokesman with the Department of Homeland Security , said the government is open to new ways to communicate .
1	2796978	2797024	" APEC leaders are painfully aware that security and prosperity are inseparable , " Thai Prime Minister Thaksin Shinawatra told business leaders .	" APEC leaders are painfully aware that security and prosperity are inseparable , " Thaksin said .
0	101746	101775	Danbury prosecutor Warren Murray could not be reached for comment Monday .	Prosecutors could not be reached for comment after the legal papers were obtained late Monday afternoon .
1	327839	327748	Wittig resigned last year after being indicted on federal bank fraud charges involving a real estate loan unrelated to Westar business .	Wittig resigned in late November about two weeks after being indicted on bank fraud charges in a real estate case unrelated to the company .
0	2988297	2988555	Shattered Glass , " starring Hayden Christensen as Stephen Glass , debuted well with $ 80,000 in eight theaters .	" Shattered Glass " _ starring Hayden Christensen as Stephen Glass , The New Republic journalist fired for fabricating stories _ debuted well with $ 80,000 in eight theaters .
1	2217613	2217659	He was arrested Friday night at an Alpharetta seafood restaurant while dining with his wife , singer Whitney Houston .	He was arrested again Friday night at an Alpharetta restaurant where he was having dinner with his wife .
0	2128530	2128455	However , EPA officials would not confirm the 20 percent figure .	Only in the past few weeks have officials settled on the 20 percent figure .
1	2208376	2208198	University of Michigan President Mary Sue Coleman said in a statement on the university 's Web site , " Our fundamental values haven 't changed .	" Our fundamental values haven 't changed , " Mary Sue Coleman , president of the university , said in a statement in Ann Arbor .
1	1980654	1980641	The first products are likely to be dongles costing between US $ 100 and US $ 150 that will establish connections between consumer electronics devices and PCs .	The first products will likely be dongles costing $ 100 to $ 150 that will establish connections between consumer electronics devices and PCs .
0	589579	589557	However , Lapidus expects foreign brands ' sales to be up 4 percent , driven by strong truck sales at Honda Motor Co .	Lapidus expects Ford to be down 5 percent , Chrysler down 10 percent and foreign brands up 4 percent driven by strong truck sales at Honda .
1	1636060	1635946	Michel , who remains in the government , denied that US pressure had provoked the government 's move .	Michel , who has stayed in the new government , denied that it was U.S. pressure which had provoked the government 's move .
1	1630585	1630657	Some of the computers also are used to send spam e-mail messages to drum up traffic to the sites .	Some are also used to send spam e-mail messages to boost traffic to the sites .
0	447728	447699	Indonesia 's army has often been accused of human rights abuses during GAM 's battle for independence , charges it has generally denied while accusing the separatists of committing rights violations .	Indonesia 's army has been accused of human rights abuses during its earlier battles with GAM , charges it has generally denied .
1	1606495	1606619	Bush also hoped to polish his anti-AIDS credentials in Uganda , which has been hailed as an African pioneer in fighting the killer disease .	President Bush flies to Uganda Friday hoping to polish his anti- AIDS credentials in a country hailed as an African pioneer in fighting the epidemic .
1	1550897	1550977	Later this year , the command will send trainers with soldiers from four North African nations on patrolling and intelligence gathering missions .	This fall the command will send trainers to work with soldiers from four North African nations on patrolling and gathering intelligence .
0	490376	490490	The reports helped overcome investor jitters after the euro briefly hit an all-time high against the dollar Tuesday .	Stocks slipped at the open after the euro hit record highs against the dollar .
1	3084554	3084612	Sales for the quarter beat expectations , rising 37 percent year-on-year to 1.76 billion euros .	Sales rose 37 per cent year-on-year to 1.76bn , beating expectations .
1	315647	315778	If the MTA 's appeal to a higher court is successful , the $ 2 bus and subway base fare won 't be rolled back .	If the MTA 's appeal is successful , the $ 2 bus and subway base fare won 't change .
1	3428298	3428362	Robert Walsh , 40 , remained in critical but stable condition Friday at Staten Island University Hospital 's north campus .	Walsh , also 40 , was in critical but stable condition at Staten Island University Hospital last night .
1	2523564	2523358	The Guru microcontroller serves four functions : hardware monitoring , overclocking management , BIOS ( Basic Input Output System ) update and a troubleshooting-assistance feature called Black Box .	The µGuru microcontroller serves four functions : hardware monitoring , overclocking management , BIOS update and a troubleshooting-assistance feature called Black Box .
1	2079200	2079131	U.S. corporate bond yield spreads tightened in spotty trading on Friday as Wall Street labored to get back on its feet after the largest power outage ever in North America .	U.S. stocks rose slightly on feather-light volume on Friday , as Wall Street regrouped after the biggest-ever power outage in North America .
1	818091	817811	The company said it would issue revised guidance for the full fiscal year next month when it releases its Q2 results .	The company said it would renew its guidance for 2003 when it announces its second quarter results in mid-July .
1	1580638	1580663	" I stand 100 percent by it , and I think our intelligence services gave us the correct information at the time . "	I stand 100 percent by it , and I think that our intelligence services gave us the correct intelligence and information at the time , " Blair said .
0	1919740	1919926	" I don 't know if the person I 'm talking to now may end up being someone else at another time that may not follow the rules , " Parrish said .	" I don 't know whether the person I 'm talking to now may end up being someone else , " Parrish said .
1	2748287	2748550	" I think it 's going to be a close vote , but I think the grant proposal is going to win , " McConnell said .	" I think it 's going to be a close vote , but I think the grant proposal 's going to win , " said Sen. Mitch McConnell , assistant majority leader .
1	3394891	3394775	Twenty-eight people were believed to have been spending Christmas Day with the caretaker of the St Sophia 's camp , when the mudslide smashed into two cabins .	Twenty-seven people were believed to have been spending Christmas Day with the caretaker of Saint Sophia Camp , a Greek Orthodox facility , when the mudslide roared through .
0	2963943	2963880	One , Capt. Doug McDonald , remained hospitalized in critical condition on Thursday .	Her 20-year-old sister , Allyson , was severely burned and remained hospitalized in critical condition .
0	1865364	1865251	The United States finally relented during President Bush 's visit to Africa earlier this month .	During President Bush 's trip to Africa earlier this month , however , Washington said it would support the increase .
1	263690	263819	" There is no conscious policy of the United States , I can assure you of this , to move the dollar at all , " he said .	He also said there is no conscious policy by the United States to move the value of the dollar .
1	283751	283290	It 's the first such drill since the September 11 terrorist attacks on New York and Washington .	It is the nation 's first large-scale counterterrorism exercise since the Sept . 11 terrorist attacks .
1	2517014	2516995	Myanmar 's pro-democracy leader Aung San Suu Kyi will return home late Friday but will remain in detention after recovering from surgery at a Yangon hospital , her personal physician said .	Myanmar 's pro-democracy leader Aung San Suu Kyi will be kept under house arrest following her release from a hospital where she underwent surgery , her personal physician said Friday .
1	1330643	1330622	According to the Merchant Marine Ministry , the 37-year-old ship is registered to Alpha Shipping Inc. based in the Pacific Ocean nation of Marshall Islands .	The Baltic Sky is a 37-year-old ship registered to Alpha Shipping Inc. based in the Pacific Ocean nation of Marshall Islands .
1	3111452	3111428	In an unusual move , the U.S. Patent and Trademark Office is reconsidering a patent affecting Internet pages that critics contend could disrupt millions of Web sites .	In an unusual move that critics contend could disrupt millions of Web sites , the U.S. Patent and Trademark Office is reconsidering a patent affecting Internet pages .
0	1167835	1167651	Kansas Department of Health and Environment records show there were 88 abortions performed on girls age 14 and younger last year .	Statistics from the Kansas Department of Health and Environment show that 11,844 abortions were performed in the state last year .
0	1423836	1423708	A European Union spokesman said the Commission was consulting EU member states " with a view to taking appropriate action if necessary " on the matter .	Laos 's second most important export destination - said it was consulting EU member states ' ' with a view to taking appropriate action if necessary ' ' on the matter .
1	2090911	2091154	Waiting crowds filling the streets on both sides overwhelmed the peacekeepers soon after daylight , sweeping past the barbed wire barricades .	But waiting crowds filling the streets rushed the bridges soon after daylight , overrunning razor-wire barricades .
1	2265271	2265152	Barry Callebaut will be able to use Brach 's retail network to sell products made from its German subsidiary Stollwerck , which makes chocolate products not sold in the United States .	Barry Callebaut will be able to use Brach 's retail network to sell products made from its German subsidiary Stollwerck , which makes chocolate products unknown to the American market .
1	3062202	3062308	By skirting the FDA 's oversight , Eagan said , the quality of the imported drugs is " less predictable " than for those obtained in the United States .	By skirting the FDA 's oversight , Eagan said the quality of the imported drugs is " less predictable " than U.S. drugs .
1	2155514	2155377	He said : " For the first time there is an easy and affordable way of making this treasure trove of BBC content available to all . "	" For the first time , there is an easy and affordable way of making this treasure trove of BBC content available to all , " Dyke said .
1	1552068	1551928	Three such vigilante-style attacks forced the hacker organizer , who identified himself only as " Eleonora [ 67 ] , " to extend the contest until 7 p.m. EST Sunday .	Three such vigilante-style attacks forced the hacker organiser , who identified himself only as " Eleonora67 ] , " to extend the contest until 8am ( AEST ) today .
1	936978	937500	Eric Gagne pitched a perfect ninth for his 23rd save in as many opportunities .	Gagne struck out two in a perfect ninth inning for his 23rd save .
0	985015	984975	One way or another , Harry Potter And The Order Of The Phoenix will be in your hands by Saturday .	Just about everything about " Harry Potter and the Order of the Phoenix " will set records .
1	1430357	1430425	" Allison just proves you don 't need to wait until August or September to have a disaster , " said Josh Lichter , a meteorologist with the Houston-Galveston weather office .	" Allison just proves you don 't need to wait until August or September to have a disaster , " Lichter said .
1	3039310	3039413	Today , analysts say , UN members can no longer ignore the shifts since the September 11 2001 attacks .	On Wednesday , analysts say , UN members can no longer ignore the shifts since the attacks in the US of September 11 2001 .
1	34513	34742	Police say CIBA was involved in the importation of qat , a narcotic substance legal in Britain but banned in the United States .	Mr McKinlay said that CIBA was involved in the importation of qat , a narcotic substance legal in Britain but banned in the US .
1	368067	368018	Chiron already has nearly 20 percent acceptances from PowderJect 's shareholders .	Chiron has acceptances from holders of nearly 20 percent of PowderJect shares .
0	611663	611716	Ernst & Young has denied any wrongdoing and plans to fight the allegations .	Ernst & Young has denied the SEC 's claims , and called its recommendations " irresponsible " .
1	98432	98657	The attack followed several days of disturbances in the city where American soldiers exchanged fire with an unknown number of attackers as civilians carried out demonstrations against the American presence .	The attack came after several days of disturbance in the city in which U.S. soldiers exchanged fire with an unknown number of attackers as civilians protested the American presence .
1	3039007	3038845	No company employee has received an individual target letter at this time .	She said no company official had received " an individual target letter at this time . "
1	1708040	1708062	Second-quarter results reflected a gain of 10 cents per diluted share , while the 2002 results included a loss of 19 cents per diluted share .	The second-quarter results had a non-operating gain of 10 cents a share while the 2002 second-quarter performance had a net non-operating loss of 19 cents a share .
0	1757264	1757375	He allegedly told his ex-wife in an angry phone call that he had no intention of following their new custody agreement .	The two had battled over custody and he allegedly told her in an angry phone call that he had no intention of following their new custody agreement .
1	383417	383558	Worldwide , more than 50 million people have seen " Les Miz , " with gross receipts of $ 1.8 billion .	Worldwide , Les Misérables has been seen by over 50 million people , with a total gross of over $ 2 billion .
0	2766112	2766084	In fiction : Edward P. Jones ( " The Known World " ) and Scott Spencer ( " A Ship Made of Paper " ) .	The fifth nominee for fiction is Scott Spencer , for A Ship Made of Paper .
1	1261116	1261234	" Overwhelmingly the Windows brand really resonated with them . "	" Windows was the part of the experience that really resonated with people . "
1	3028143	3028234	The Centers for Medicare and Medicaid Services , the federal agency that runs Medicare , last year began a similar effort for nursing homes .	The Centers for Medicare and Medicaid launched a similar consumer tool for nursing homes last year .
0	249699	249623	Vivace was founded in 1999 and has raised over $ 118 million in three rounds of venture financing .	During difficult times for technology venture capital , Vivace raised over $ 118 million in three rounds of venture financing .
0	3448488	3448449	The Dow Jones industrial average < .DJI > added 28 points , or 0.27 percent , at 10,557 , hitting its highest level in 21 months .	The Dow Jones industrial average < .DJI > rose 49 points , or 0.47 percent , to 10,578 .
1	2749322	2749663	The Democratic candidates also began announcing their fund-raising totals before Wednesday 's deadline to file quarterly reports with the Federal Election Commission .	The Democratic candidates also began announcing their fund-raising totals in advance of the deadline today to file quarterly reports with the Federal Election Commission .
0	2204592	2204588	Sun Microsystems Inc. on Thursday said it had added 100 new third-party systems and 100 new components to its Hardware Compatibility List for the Solaris x86 operating system Platform Edition .	The vendor has added 100 new third-party systems and 100 new components to the operating system 's Hardware Compatibility List ( HCL ) .
1	2889005	2888954	Prosecutors said PW Marketing violated the state 's 1998 anti-spam law by sending unsolicited e-mail without a toll-free number for recipients to call to stop additional mailings .	Prosecutors said PW Marketing violated the 1998 anti-spam law because these unsolicited e-mails were sent without a free call number for recipients to phone to stop additional mailings .
0	1657632	1657619	The Neighbours star and singer spent yesterday resting at her family home in Sydney and will have more tests today .	Goodrem spent yesterday resting in her family home in Sydney and will have more tests today to determine her exact treatment .
0	555617	555528	The 3 rd Armored Cavalry Regiment is 5,200 strong and the largest combat unit at Fort Carson .	Broomhead , 34 , was assigned to the 2nd Squadron , 3rd Armored Cavalry Regiment .
1	2396937	2396818	" The risk of inflation becoming undesirably low remains the predominant concern for the foreseeable future , " the Fed said in a statement accompanying the unanimous decision .	" The risk of inflation becoming undesirably low remains the predominant concern for the foreseeable future , " the policy-setting Federal Open Market Committee said .
0	2339738	2339771	" It is bad for Symbian , " said Per Lindberg , analyst at Dresdner Kleinwort Wasserstein .	" Motorola has displayed clear disloyalty " to Symbian , said Per Lindberg , an analyst at Dresdner Kleinwort Wasserstein in London .
0	1616174	1616206	Bob Richter , a spokesman for House Speaker Tom Craddick , had no comment about the ruling .	Bob Richter , spokesman for Craddick , R-Midland , said the speaker had not seen the ruling and could not comment .
1	635783	635802	But Ms Ward said the headroom under its financial covenants was " tight " and that there could be another downgrade if Southcorp breached any of its banking covenants .	But Ms Ward said the headroom under its financial covenants was " tight " and that there could be a rating downgrade if Southcorp did breach any banking covenants .
1	3444633	3444733	He added : ``I 've never heard of more reprehensiblebehaviour by a doctor .	The Harrisons ’ lawyer Paul LiCalsi said : “ I ’ ve never heard of more reprehensible behaviour by a doctor .
1	555553	555528	Broomhead was assigned to 2nd Squadron , 3rd Armor Cavalry Regiment , based at Fort Carson .	Broomhead , 34 , was assigned to the 2nd Squadron , 3rd Armored Cavalry Regiment .
1	1112021	1111925	Other staff members , however , defended the document , saying it would still help policy-makers and the agency improve efforts to address the climate issue .	Some E.P.A. staff members defended the document , saying that although pared down it would still help policy makers and the agency address the climate issue .
0	2749410	2749625	President Bush raised a record-breaking $ 49.5 million for his re-election campaign over the last three months , with contributions from 262,000 Americans , the president 's campaign chairman said Tuesday .	President Bush has raised $ 83.9 million since beginning his re-election campaign in May , and has $ 70 million of that left to spend , his campaign said Tuesday .
1	1629064	1629043	An episode is declared when the ozone reaches .20 parts per million parts of air for one hour .	A Stage 1 episode is declared when ozone levels reach 0.20 parts per million .
1	789691	789665	" He may not have been there , " the defence official said on Thursday .	" He may not have been there , " said a defence official speaking on condition of anonymity .
1	844421	844679	The U.N. troops are in Congo to protect U.N. installations and personnel , and they can only fire in self defense and have been unable to stem the violence .	The troops - whose mandate is to protect U.N. installations and personnel - can only fire in self-defense and have been unable to stem the violence .
1	58540	58567	North American markets grabbed early gains Monday morning , as earnings season begins to slow and economic indicators take the spotlight .	North American futures pointed to a strong start to the first trading session of the week Monday , as earnings season slows and economic indicators take the spotlight .
1	781439	781461	Xerox itself paid a $ 10 million fine last year to settle similar SEC charges .	Xerox itself previously paid a $ 10-million penalty to settle the SEC accusations .
1	1909579	1909408	" This deal makes sense for both companies , " said National Chief Executive Brian Halla .	" This deal makes sense for both companies , " Halla said in a prepared statement .
0	787432	787464	The blasts killed two people and injured more than 150 others .	The Atlanta Olympic Games attack killed one woman and injured more than 100 other people .
0	52758	52343	Morrill 's wife , Ellie , sobbed and hugged Bondeson 's sister-in-law during the service .	At the service Morrill 's widow , Ellie , sobbed and hugged Bondeson 's sister-in-law as people consoled her .
1	1675025	1675047	Spansion products are to be available from both AMD and Fujitsu , AMD said .	Spansion Flash memory solutions are available worldwide from AMD and Fujitsu .
1	2131318	2131372	About 1,500 police will be deployed for the visit .	Around 1,500 police are to be deployed at Niigata for the ferry 's visit .
1	325763	325928	Gamarekian told The News she remembers only the woman 's first name - and refused to reveal it .	She told the New York Daily News she remembers only the intern 's first name , which she refused to reveal .
1	2638975	2638855	One of the FBI ’ s key operatives , who had a falling out with the bureau , provided an account of the operation at a friend ’ s closed immigration court proceeding .	One of the FBI 's key operatives , who has had a falling-out with the bureau , provided an account of the operation at a friend 's closed immigration court proceeding .
1	2198694	2198937	A nationally board certified teacher with a master 's degree , Kelley makes a salary of $ 65,000 in his 30th year .	A nationally board certified teacher with a master 's degree , Kelley , in his 30th year teaching , makes $ 65,000 .
1	1825432	1825301	A man arrested for allegedly threatening to shoot and kill a city councilman from Queens was ordered held on $ 100,000 bail during an early morning court appearance Saturday .	The Queens man arrested for allegedly threatening to shoot City Councilman Hiram Monserrate was held on $ 100,000 bail Saturday , a spokesman for the Queens district attorney said .
1	2906104	2906322	They were being held Sunday in the Camden County Jail on $ 100,000 bail .	They remained in Camden County Jail on Sunday on $ 100,000 bail .
1	722278	722383	Ms Stewart , the chief executive , was not expected to attend .	Ms Stewart , 61 , its chief executive officer and chairwoman , did not attend .
0	101747	101777	Christina 's aunt , Shelley Riling , said the defense 's claims were preposterous .	Christina 's aunt , Shelley Riling , said she will address the court .
1	2224884	2224819	The Justice Department Aug. 19 gave pre-clearance for the Oct. 7 date for the election to recall Gov. Gray Davis , saying it would not affect minority voting rights .	The Justice Department on Aug. 19 sanctioned the Oct. 7 date for recall election , saying it would not affect voting rights .
0	977938	978162	Lord Falconer hailed the changes as " a new beginning as far as the courts , Crown Prosecution Service and police are concerned " .	" It 's a new beginning as far as the courts , Crown Prosecution Service and police are concerned , making the criminal justice system work better . "
0	1015010	1014963	GE stock closed at $ 30.65 a share , down about 42 cents , on the New York Stock Exchange .	GE 's shares closed at $ 30.65 on Friday on the New York Stock Exchange .
1	1513190	1513246	At least 27 US troops have been killed in hostile fire since Bush 's statement .	At least 26 American troops have been killed in hostile fire since major combat was officially declared over on May 1 .
1	2385348	2385394	A recent poll showed Edwards with a narrow lead in South Carolina , and he plans a rally there later on Tuesday .	A recent poll showed Edwards in a virtual four-way tie at the top in South Carolina , and he plans a rally there later on Tuesday .
1	2317018	2317252	November 17 's last victim was British defence attache Stephen Saunders , who was shot on an Athens road in June 2000 .	November 17 's last victim was British defense attache Stephen Saunders , who was shot and killed at point-blank range on a busy Athens road in June 2000 .
0	1831696	1831660	The agency charged that one WD Energy worker discussed false reporting with traders at two other energy companies .	The agency found further that a WD Energy employee discussed false reporting with traders at two other energy companies , which the CFTC didn 't identify .
1	1528383	1528083	Zulifquar Ali , a worshipper slightly wounded by shrapnel , said the assailants first targeted the mosque 's security guards .	Witness Zulfiqar Ali , who was slightly wounded by shrapnel , said the attackers had focused on the mosque 's guards .
1	917965	918315	For the second year in a row , rises in hospital costs accounted for much of the inflation , accounting for 51 percent of the overall cost increase .	For the second year in a row , rises in hospital costs dominated the increase , accounting for 51 percent of the overall cost spiral .
0	3218713	3218830	Q : Can I buy coverage for prescription drugs right away ?	Congress has added a new benefit - an option to buy insurance coverage for prescription drugs .
1	221079	221003	The airline also said it has the option to buy 380 more airplanes , orders that would be split evenly between the two manufacturers .	The airline has the option to buy 380 more , split evenly between the two manufacturers .
1	2546175	2546198	Dr Mark McClean , Jonathan 's family doctor , said if the drug had been administered earlier Jonathan would have retained more of his brain functions .	Dr Mark McClean , the family 's GP , said had the drug been administered to Jonathan earlier , he would have retained more of his brain function .
0	799346	799268	The chain operates more than 3,400 stores , and has annual revenue of about $ 15.8 billion .	The chain , which has been under new management since late 1999 , has more than 3,400 stores and $ 15.8 billion in annual revenue .
0	2673104	2673130	All patients developed some or all of the symptoms of E. coli food poisoning : bloody diarrhea , vomiting , abdominal cramping and nausea .	Symptoms of the E. coli infection include bloody diarrhea , nausea , vomiting and abdominal cramping .
1	1354501	1354476	Federal regulators have turned from sour to sweet on a proposed $ 2.8 billion merger of ice cream giants Nestle Holdings Inc. and Dreyer 's Grand Ice Cream Inc .	Federal regulators have changed their minds on a proposed $ 2.8 billion merger of ice cream giants Nestle Holdings and Dreyer 's Grand Ice Cream .
1	3070979	3070949	Environmental campaigners are using this weekend ’ s lunar eclipse to highlight the huge increase in light pollution across the UK .	Environmental campaigners used the eclipse to highlight the surge in light pollution across Britain .
0	1264509	1264471	Available July 7 , the software supports the Solaris , IBM AIX , Red Hat Linux and Windows operating systems .	The OpForce product currently works with Solaris , AIX , Red Hat Linux and Windows servers .
1	103280	103431	Justice Minister Martin Cauchon and Prime Minister Jean Chrétien have both said the Liberal government will introduce legislation soon to decriminalize possession of small amounts of pot for personal use .	Justice Minister Martin Cauchon and Prime Minister Jean Chretien both have said the government will introduce legislation to decriminalize possession of small amounts of pot .
0	110731	110648	But Chauncey Billups demonstrated he 's also capable of big games , scoring 77 points over the final two games against the Magic .	Billups scored 77 points in the final two games of the first-round series against the Magic .
1	2274844	2274714	Kelly killed himself after being exposed as the source for a BBC report which claimed the government had embellished evidence of Iraq 's banned weapons to justify the war .	He killed himself after being exposed as the source for a BBC report which claimed the government exaggerated the case for war against Iraq .
0	1050307	1050144	And it 's going to be a wild ride , " said Allan Hoffenblum , a Republican consultant .	Now the rest is just mechanical , " said Allan Hoffenblum , a Republican consultant .
1	2810634	2810670	While the Ibrahims had one separation operation , Goodrich and Dr. David Staffenberg plan about three for the Aguirres , with several weeks between each .	Instead of one long operation to separate the twins , Goodrich and Dr. David Staffenberg plan about three , with several weeks between each .
1	3073773	3073779	Lay had contended that turning over the documents would violate his Fifth Amendment right against self-incrimination .	Lay had refused to turn over the papers , asserting his Fifth Amendment right against self-incrimination .
0	261202	260995	The WHO experts didn 't say how many cases in Hebei were in rural areas .	Hebei has reported 191 cases and eight deaths , though the WHO experts did not say how many were in rural areas .
1	1824224	1824209	Nearly 300 mutinous troops who seized a Manila shopping and apartment complex demanding the government resign gave up and retreated peacefully after some 19 hours .	Mutinous troops who seized a Manila shopping and apartment complex demanding the government resign ended a 19-hour standoff late Sunday and returned to barracks without a shot fired .
1	548867	548785	In three years , Lend Lease has slipped from a top-five stock , when its share price was around $ 24 , to 37th .	In the space of three years , Lend Lease has slipped from a top-five 5 stock when its share price hovered around $ 24 to 37th on the list .
0	2796658	2796682	About two hours later , his body , wrapped in a blanket , was found dumped a few blocks away .	Then his body was dumped a few blocks away , found in a driveway on Argyle Road .
1	1808166	1808434	Columbia broke up over Texas upon re-entry on Feb. 1 .	Columbia broke apart in the skies above Texas on Feb. 1 .
1	853475	853342	A year or two later , 259 , or 10 per cent , of the youths reported that they had started to smoke , or had taken just a few puffs .	Within two years , 259 , or 10 percent , of the youths reported they had started to smoke or had at least taken a few puffs .
0	977772	977804	The Lord Chancellor was guardian of the Great Seal , used to stamp all official documents from the sovereign .	Falconer will hold on , for now , to the Lord Chancellor 's Great Seal , used to sign off instructions from the sovereign .
1	577854	578500	Cindy Yeast , a 50-year-old Washington-area publicist , says she began taking supplements two years ago in part to avoid mild dementia that affects her elderly parents .	She started taking supplements two years ago - partly to stave off mild dementia that affects her elderly parents .
1	2829194	2829229	The two are not related , but have referred to each other as father and son .	He 's not related to Malvo , but the two have referred to each other as father and son .
1	2074182	2074668	Gibson said last month in a press statement that " neither I nor my film are anti-Semitic .	Gibson said in a June statement that he and his film are not anti-Semitic .
0	2758265	2758282	The world 's largest software company said it recognized the difficulty the multiple patches posed for companies , and set out to make it easier for them to apply the updates .	The world 's largest software company said it recognized the difficulty the multiple patches posed for companies trying to apply them .
1	1958079	1958143	The Dow Jones industrial average .DJI ended up 64.64 points , or 0.71 percent , at 9,191.09 , according to the latest available data .	The blue-chip Dow Jones industrial average .DJI added 38 points , or 0.42 percent , to 9,165 .
1	544217	544325	The vote came just two days after Kurds swept City Council elections , taking the largest single block of votes on the 30-seat council .	The vote for mayor followed City Council elections that gave Kurds the largest block of votes on the 30-seat council .
1	2385288	2385256	Large swells and dangerous surf already were being felt along sections of the coast .	Already large swells and dangerous surf have arrived along the mid-Atlantic .
0	2324708	2325028	Based on a separate survey of households , the unemployment rate fell in August to 6.1 percent from 6.2 percent .	Labor Department analysts discounted a slight improvement in the national unemployment rate , which fell in August to 6.1 percent from 6.2 percent .
1	2139506	2139427	" We will work with the board to ensure a smooth transition . "	He said federal regulators would work with the corporation to ensure a " smooth transition . "
1	2965576	2965701	Gasps could be heard in the courtroom when the photo was displayed .	Gasps could be heard as the photo was projected onto the screen .
1	2931098	2931144	Gilead had earnings of $ 73.1 million , or 33 cents a share , compared with $ 20.8 million , or 10 cents , in the year-ago quarter .	Quarterly profit climbed to $ 73.1 million , or 33 cents a share , from $ 20.8 million , or 10 cents , a year earlier , the company said .
0	644788	644816	" I had one bad stretch of holes that put me out of contention to win , " Woods said .	" I had one bad stretch of holes that put me out of contention , " Woods said , referring to his 42 on the front nine Saturday .
0	2551891	2551563	The poll had a margin of error of plus or minus 2 percentage points .	It had a margin of sampling error of plus or minus four percentage points and was conducted Thursday through Saturday .
1	1089053	1089297	Sen. Patrick Leahy of Vermont , the committee 's senior Democrat , later said the problem is serious but called Hatch 's suggestion too drastic .	Sen. Patrick Leahy , the committee 's senior Democrat , later said the problem is serious but called Hatch 's idea too drastic a remedy to be considered .
1	3435735	3435717	The broad Standard & Poor 's 500 < .SPX > eased 0.37 of a point , or 0.03 percent , at 1,121 .	The Standard & Poor 's 500 Index < .SPX > slipped 0.26 point , or 0.02 percent , to 1,121.96 .
0	1954	2142	Watertown , Saugus and Framingham also are going smoke-free Monday , joining a growing number of cities around the country .	Along with Boston , Watertown , Saugus and Framingham also are going smoke-free Monday .
1	3400796	3400822	That is evident from their failure , three times in a row , to get a big enough turnout to elect a president .	Three times in a row , they failed to get a big _ enough turnout to elect a president .
1	1220668	1220801	We firmly believe we have an absolute right to use the common word ' spike ' as the name of our network . "	We firmly believe that we have an absolute right to use the common word ' spike ' to name our network .
1	1889954	1889847	Sources who knew of the bidding said last week that cable TV company Comcast Corp. was also looking at VUE .	Late last week , sources told Reuters cable TV company Comcast Corp. CMCSA.O also was looking at buying VUE assets .
1	315785	315653	But MTA officials appropriated the money to the 2003 and 2004 budgets without notifying riders or even the MTA board members considering the 50-cent hike , Hevesi found .	MTA officials appropriated the surplus money to later years ' budgets without notifying riders or the MTA board members when the 50-cent hike was being considered , he said .
0	1521034	1520582	White , who had suffered kidney failure from years of high blood pressure , died at Cedars-Sinai Medical Center around 9 : 30 a.m. , said manager Ned Shankman .	White , who had kidney failure from years of high blood pressure , had been undergoing dialysis and had been hospitalized since a September stroke .
1	2083598	2083810	About 10 percent of high school and 16 percent of elementary students must be proficient at math .	In math , 16 percent of elementary and middle school students and 9.6 percent of high school students must be proficient .
1	1910610	1910455	The legal ruling follows three days of intense speculation Hewlett-Packard Co. may be bidding for the company .	The legal ruling follows three days of wild volatility in RIM 's stock over speculation that PC giant Hewlett-Packard Co. may be bidding for the company .
1	3113791	3113782	The European Commission , the EU 's antitrust enforcer , is expected to issue its decision next spring — unless a settlement is reached .	The European Commission is expected to issue its decision in the case next spring — unless a settlement is reached .
1	3214517	3214483	" So Sebastian did his best to convincingly confess to a crime that he didn 't commit in order to survive , " she told jurors .	" Sebastian did his best to confess convincingly to a crime he didn 't do in order to survive , " Ms. Richardson declared .
0	2083612	2083810	Twenty percent of Latino students and 23 percent of black students performed at proficient or higher .	In math , 16 percent of elementary and middle school students and 9.6 percent of high school students must be proficient .
1	661390	661218	He is charged in three bombings in Atlanta including a blast at the 1996 Olympics and one in Alabama .	He is charged in three bombings in Atlanta - including a blast at the 1996 Olympics - along with the bombing in Alabama .
1	1269572	1269682	The men were remanded in custody and are due to appear again before court on July 8 .	They were remanded in custody and will appear in court again on July 8 .
1	1095780	1095652	" No matter who becomes the sponsor for stock-car racing 's top series , NASCAR will need an all-star event , " Wheeler said in a statement .	No matter who becomes the sponsor for stock-car racings top series , NASCAR will need an all-star event , Wheeler said Tuesday .
1	116294	116332	The Phillies were upset that Counsell had stolen second in the sixth inning with Arizona leading 7-1 .	The Phillies were apparently upset when Counsell stole during the sixth with the Diamondbacks up 7-1 .
1	941617	941673	He said his hatred for such people grew from these discussions and had helped convince him violence was the answer .	His hatred for these people had germinated from these discussions and helped cement his belief that violence was the panacea .
1	2640607	2640576	" There is no need for one deadline for all to create the ASEAN Economic Community , " Thaksin said .	Thus , he said , there did not have to one deadline to create the economic community .
1	3310210	3310286	The announcement was made during the recording of a Christmas concert attended by top Vatican cardinals , bishops , and many elite from Italian society , witnesses said .	The broadside came during the recording on Saturday night of a Christmas concert attended by top Vatican cardinals , bishops and many elite of Italian society , witnesses said .
1	3376093	3376101	The additional contribution brings total U.S. food aid to North Korea this year to 100,000 tonnes .	The donation of 60,000 tons brings the total of U.S. contributions for the year to 100,000 .
1	1549586	1549609	Leon Williams ' body was found inside his third-floor apartment at 196 Bay St. , in Tompkinsville .	The dead man , Leon Williams , was found in his third-floor apartment .
1	460211	460445	The player 's eyes were bloodshot and a blood-alcohol test produced a reading of 0.18 - well above Tennessee 's level of presumed intoxication of 0.10 , the report said .	He failed a field sobriety test and a blood-alcohol test produced a reading of 0.18 – well above Tennessee 's level of presumed intoxication of 0.10 , the report said .
1	1196962	1197061	But Virgin wants to operate Concorde on routes to New York , Barbados and Dubai .	Branson said that his preference would be to operate a fully commercial service on routes to New York , Barbados and Dubai .
0	862804	862715	He tried to fight off officers and was taken to a hospital after a police dog bit him but was later released .	Cruz tried to fight off officers and was hospitalized after a police dog bit him , Sgt. Steve Dixon said .
1	1726935	1726879	The announcement , which economists said was not a surprise , may be bittersweet for the millions of Americans without jobs .	Economists said the announcement was not a surprise , and politicians said it offered little comfort to the millions of Americans without jobs .
0	331980	332110	Asked if the delegates could leave on Friday , police intelligence chief in Aceh , Surya Dharma , told reporters they could not because they did not have proper permission .	Asked if the delegates could leave on Friday , police intelligence chief Surya Dharma told reporters : " Of course they may not go .
1	173879	173832	Dealers said the dollar also drew some downside support as Japanese investors are expected to keep snapping up foreign bonds amid the yen 's rise against the dollar .	Dealers said the dollar also drew some downside support as Japanese investors are expected to keep snapping up foreign bonds amid ever-falling domestic interest rates .
0	2834988	2835026	Iran has until the end of the month to satisfy the agency it has no plans for nuclear weapons .	The Iranians have until the end of the month to answer all the agency 's questions about their past nuclear activities .
1	2587300	2587243	Her father , Florin Cioaba , the king of Transylvania 's Gypsies , had her brought back and she was married against her will .	Her father , Roma King Florin Cioaba , had her brought back and she was promptly married against her will .
0	554905	554627	Claire had advanced to the third round of the 76th annual Scripps Howard National Spelling Bee .	One by one they strolled to the microphone , all 251 youngsters in the 76th Scripps Howard National Spelling Bee .
1	1912524	1912648	Citigroup Inc . C.N , the world 's largest financial services company , on Wednesday promoted Marjorie Magner to chairman and chief executive of its global consumer group .	Citigroup ( C ) on Wednesday named Marjorie Magner chairman and chief executive of its colossal global consumer business .
1	3255597	3255668	" They 've been in the stores for over six weeks , " says Carney .	The quarterlies usually stay in stores for between six to eight weeks , " Carney added .
1	629316	629289	Let me just say this : the evidence that we have of weapons of mass destruction was evidence drawn up and accepted by the joint intelligence community .	" The evidence that we had of weapons of mass destruction was drawn up and accepted by the Joint Intelligence Committee , " he said .
1	54181	53570	Ridge said no actual explosives or other harmful substances will be used .	Ridge said no real explosives or harmful devices will be used in the exercise .
1	723557	724115	Thus far , Stewart 's company appears ready to stand behind her .	For now , the company 's management appears to be standing behind Stewart .
0	2607718	2607708	But late Thursday night , the campaign issued a statement saying there would be no news conference and no big announcement .	But late yesterday , the campaign and the state Democratic Party said there would be no news conference .
1	753858	753890	There 's also a flaw that results because IE does not implement an appropriate block on a file download dialog box .	The second vulnerability is a result of IE not implementing a block on a file download dialog box .
1	587009	586969	Another $ 100-million in savings will come from management layoffs and pay cuts .	The airline expects to save another $ 100-million a year through management layoffs and pay cuts .
1	308567	308525	He called on Prime Minister John Howard to establish a royal commission on child sex abuse .	The Senate motion also called on Prime Minister John Howard to hold a royal commission into child sex abuse .
0	665419	665612	" We think that the United States of America should support the free speech of all groups , " Mr. White said , objecting to Mr. Olson 's recommendation .	We think that the United States of America should support the free speech of all groups , he said .
1	2763517	2763576	Terri Schiavo , 39 , underwent the procedure at the Tampa Bay area hospice where she has been living for several years , said her father , Bob Schindler .	The tube was removed Wednesday from Terri Schiavo , 39 , at the Tampa Bay-area hospice where she has lived for several years .
0	3107118	3107136	After 18 months , Nissen found that Lipitor stopped plaque buildup in the patients ' arteries .	After 18 months , the atorvastatin patients had no change in the plaque in their arteries .
1	780604	780466	Toll , Australia 's second-largest transport company , last week offered NZ75 a share for Tranz Rail .	Toll last week offered to buy the company for NZ75c a share , or $ NZ158 million .
0	1989213	1989116	" This child was literally neglected to death , " Armstrong County District Attorney Scott Andreassi said .	Armstrong County District Attorney Scott Andreassi said the many family photos in the home did not include Kristen .
1	1462409	1462504	Wal-Mart , the nation 's largest private employer , has expanded its antidiscrimination policy to protect gay and lesbian employees , company officials said Tuesday .	Wal-Mart Stores Inc . , the nation 's largest private employer , will now include gays and lesbians in its anti-discrimination policy , company officials said Wednesday .
1	260952	260924	Metro , bus and local rail services in France 's four largest towns -- Paris , Lyon , Lille and Marseille -- were severely disrupted , Europe 1 radio reported .	Subway , bus and suburban rail services in France 's four largest cities -- Paris , Lyon , Lille and Marseille -- were severely disrupted , transport authorities said .
1	1224743	1225510	In the undergraduate case , Rehnquist said the use of race was not " narrowly tailored " to achieve the university 's asserted interest in diversity .	Rehnquist wrote that the system was not narrowly tailored to achieve the interest in educational diversity .
0	3329379	3329416	SP2 is basically about security enhancements to Windows , such as the improved Internet Connection Firewall ( ICF ) .	The firewall in the current Windows XP was known as the Internet Connection Firewall ( ICF ) .
1	2362761	2362698	A landslide in central Chungchong province derailed a Seoul-bound train and 28 passengers were injured , television said .	In central Chungchong province , a landslide caused a Seoul-bound Saemaeul Express train to derail , injuring 28 people , local television said .
0	1465073	1464854	They will help draft a plan to attack obesity that Kraft will implement over three to four years .	The team will help draft a plan by the end of the year to attack obesity .
1	195728	196099	But that amount would probably be impossible to pass in the Senate , where Republican moderates have refused to go above $ 350 billion .	Such an amount would probably be unable to summon a majority of the Senate , where Republican moderates have refused to go above $ 350 billion .
1	2587767	2587673	In the clash with police , Lt. Mothana Ali said about 1,000 demonstrators had gone to the station demanding jobs .	In Baghdad , police Lieut . Mothana Ali said about 1,000 demonstrators arrived at the station demanding jobs .
0	1490044	1489975	Corixa shares rose 54 cents to $ 7.74 yesterday on the Nasdaq Stock Market .	Shares of Corixa rose 54 cents , or about 8 percent , to close at $ 7.74 .
1	958161	957782	Committee approval , expected today , would set the stage for debate on the Senate floor beginning Monday .	That would clear the way for debate in the full Senate beginning on Monday .
1	1033204	1033365	O 'Brien was charged with leaving the scene of a fatal accident , a felony .	Bishop Thomas O 'Brien , 67 , was booked on a charge of leaving the scene of a fatal accident .
0	2996241	2996734	Tom Hamilton said his daughter was conscious and alert and in stable condition after the attack Friday morning .	Bethany , who remained in stable condition after the attack Friday morning , talked of the attack Saturday .
0	2015389	2015410	The Calgary woman , who is in her twenties , donated blood on Aug. 7 .	The woman -- who has no symptoms of illness -- donated blood Aug. 7 .
1	221515	221509	Quattrone lawyer John W. Keker said his client is innocent .	In a statement Monday , his lawyer John Keker said ``Frank Quattrone is innocent .
0	2283737	2283794	In the weeks leading up to the execution , several Florida officials received anonymous threatening letters .	Several Florida officials connected to the case have received threatening letters , accompanied by rifle bullets .
1	2826681	2826474	The disagreement over online music sales was disclosed in documents filed last week with the judge and made available by the court yesterday .	The fight over online music sales was disclosed in documents made available Monday by the court .
1	2249237	2249305	Parson was charged with intentionally causing and attempting to cause damage to protected computers .	Parson is charged with one count of intentionally causing damage to a protected computer .
1	389239	389299	" The court and the public need to know much more of the details of the defendant 's seemingly massive fraud , " the judge said .	" The court and the public need to know more of the defendants ' seemingly massive fraud , " he said .
1	2652187	2652218	The U.S. Supreme Court will hear arguments on Wednesday on whether companies can be sued under the Americans with Disabilities Act for refusing to rehire rehabilitated drug users .	The high court will hear arguments today on whether companies can be sued under the ADA for refusing to rehire rehabilitated drug users .
1	2945693	2945847	The IRS said taxpayers can avoid undelivered checks by having refunds deposited directly into their checking or savings accounts .	The IRS said taxpayers can avoid problems with lost or stolen refunds by having refunds deposited directly into personal checking or savings accounts .
1	2065523	2065836	" More than 70,000 men and women from bases in Southern California were deployed in Iraq .	In all , more than 70,000 troops based in Southern California were deployed to Iraq .
1	2222998	2223097	BP shares slipped 0.8 percent to 433.50 pence ( $ 6.85 ) each in afternoon trading on the London Stock Exchange .	BP shares slipped 48 cents to $ 41.72 Friday in trading on the New York Stock Exchange .
1	2561999	2561941	Because of the accounting charge , the company now says it lost $ 1.04 billion , or 32 cents a share , in the quarter ended June 30 .	Including the charge , the Santa Clara , Calif.-based company said Monday it lost $ 1.04 billion , or 32 cents per share , in the period ending June 30 .
0	2324704	2325023	Friday 's report raised new worries that a weak job market could shackle the budding economic recovery despite a slight improvement in the overall unemployment rate .	U.S. companies slashed payrolls for a seventh straight month in August , raising new worries that a weak jobs market could shackle the budding economic recovery .
1	2336453	2336545	Federal Emergency Management Administration designated $ 20 million to establish the registry .	The registry was launched with $ 20 million from the Federal Emergency Management Agency .
1	720572	720486	BREAST cancer cases in the UK have hit an all-time high with more than 40,000 women diagnosed with the disease each year , Cancer Re-search UK revealed yesterday .	Cases of breast cancer in Britain have reached a record high , with the number of women diagnosed with the disease passing the 40,000 mark for the first time .
1	1605818	1605806	" It was never our intention to sell the product , " said Health Minister Anne McClellan , a skeptic of medical marijuana use .	" It was never the intention of us to sell product , " federal Health Minister Anne McLellan said yesterday in Edmonton .
0	2440680	2440474	GM , the world 's largest automaker , has 115,000 active UAW workers and another 340,000 retirees and spouses .	They cover more than 300,000 UAW workers and 500,000 retirees and spouses .
0	726399	726078	Rosenthal is hereby sentenced to custody of the Federal Bureau of prisons for one day with credit for time served , " Breyer said to tumultuous cheers in the courtroom .	" Rosenthal is hereby sentenced to custody of the Federal Bureau of Prisons for one day with credit for time served . "
1	533903	533818	" We are committed to helping the Iraqi people get on the path to a free society , " Rumsfeld said in a speech to the Council on Foreign Relations .	" We are committed to helping the Iraqi people get on the path to a free society , " he said .
1	1166473	1166857	Mr. Young said he was disappointed that the government didn 't see the severe acute respiratory syndrome crisis as worthy of federal disaster-relief money .	Young said he was disappointed the government didn 't see the SARS crisis as worthy of federal disaster relief money .
1	144089	143697	The 12-nation currency has risen by 33 percent against the dollar over the past 15 months .	The euro is up 9 percent against the dollar in the past six weeks .
1	3439854	3439874	In February 2000 , the officers — Kenneth Boss , Sean Carroll , Edward McMellon and Richard Murphy — were acquitted of all charges in the killing .	The officers -- Kenneth Boss , Sean Carroll , Edward McMellon and Richard Murphy -- were acquitted in 2000 of state murder charges .
1	3464314	3464302	I was surprised it turned out me talking and the president just listening .	" I was surprised it turned out me talking and the president just listening . . . It was mostly a monologue . "
1	2008984	2009175	The state 's House delegation currently consists of 17 Democrats and 15 Republicans .	Democrats hold a 17-15 edge in the state 's U.S. House delegation .
0	816867	816831	Freddie also said Leland C. Brendsel will retire as chairman and chief executive and resign from the board .	He replaces Leland Brendsel , 61 , who retired as chairman and chief executive .
1	192285	192327	We 'll be listening carefully to the [ IAEA ] director general 's report at the next board meeting .	" We 'll be listening carefully to the ( IAEA ) director-general 's report at the next board meeting . "
1	2688145	2688162	In that position , Elias will report to Joe Tucci , president and CEO of EMC .	As executive vice president of new ventures , Elias will report to Joe Tucci , EMC 's president and chief executive .
1	3294207	3294290	But with the PM due to leave tomorrow afternoon for personal reasons there was a risk he might not be present when the final decision was made .	But with the Prime Minister due to leave tomorrow , a day early , he may not be present when the final decision is made .
0	205100	205145	A pro-independence radical , Miodrag Zivkovic , of the Liberal Alliance , came in second with 31 percent of the vote .	Miodrag Zivkovic , of the Liberal Alliance of Montenegro , won 31 percent of the vote while the independent Dragan Hajdukovic got four percent .
0	3242051	3241897	Mr. Kerkorian tried unsuccessfully to take over Chrysler in 1995 , but did win representation on its board .	Kerkorian and Tracinda had also tried to take over Chrysler in 1995 .
0	1076861	1077018	Glover spoke at a news conference that included about 20 relatives of the victims .	About 20 family members of the victims were invited to the news conference .
1	2095803	2095786	Drax faced a financial crisis late last year after it lost its most lucrative sales contract , held with insolvent utility TXU Europe .	Drax ’ s troubles began late last year when it lost its most lucrative sales contract , with the insolvent utility TXU Europe .
1	2112330	2112376	But I would rather be talking about high standards than low standards . "	" I would rather be talking about positive numbers rather than negative .
1	3389318	3389271	It was not immediately known how many people were on flight UTA 141 , which could carry 141 passengers and crew .	It was still not known exactly how many people were on the plane , which could carry 141 passengers and crew .
1	698948	698933	The market remains pinned in a narrow range after a powerful rally drove the broad Standard & Poor 's 500 index .SPX up more than 20 percent since mid-March .	The market remains pinned in a narrow range after a powerful rally pushed the broad S & P 500 index up more than 20 percent since mid-March .
1	539585	539355	Witnesses said they believed the man planned to crash the Launceston-bound Qantas flight 1737 , which was carrying 47 passengers and six crew .	Witnesses believe he wanted to crash Flight 1737 , which had 47 passengers and six crew .
1	684848	684557	As Samudra sat down to hear the indictment , he looked over to his nine lawyers and shouted ``God is Great ' ' three times .	As he sat down to hear the indictment , Samudra looked over to his nine lawyers and shouted " Takbir ! " , or " Proclaim ! " , a religious rallying cry .
1	347017	347002	In hardest-hit Taipei , traffic has disappeared from once bustling streets , ubiquitous department stores stand mostly empty and restaurants are eerily quiet .	In hardest-hit Taipei , traffic has disappeared from once-bustling streets and department stores and restaurants are virtually empty .
1	1592037	1592076	In a statement , Lee said he " no longer believes that Viacom deliberately intended to trade on my name when naming Spike TV . "	Spike Lee no longer believes that Viacom deliberately intended to trade on his name by calling its own venture " Spike TV , " according to a statement read in court Tuesday .
0	3013483	3013540	Singapore Prime Minister Goh Chok Tong says China plays an important role in the integration of Asia , including managing the stresses and strains both within and between countries .	HAINAN PROVINCE , China : Singapore Prime Minister Goh Chok Tong said China plays an important role in the integration of Asia .
1	2020252	2020081	The worm attacks Windows computers via a hole in the operating system , an issue Microsoft on July 16 had warned about .	The worm attacks Windows computers via a hole in the operating system , which Microsoft warned of 16 July .
0	2614947	2614904	The premium edition adds OfficeFront Page 2003 , Acceleration Server 2000 , and SQL Server 2000 .	The premium edition adds ISA Server , SQL Server and a specialized edition of BizTalk 2004 .
0	1744257	1744378	In the year-ago quarter , the steelmaker recorded a profit of $ 16.2 million , or 15 cents per share , on sales of $ 1.14 billion .	In the second quarter last year , AK Steel reported a profit of $ 16.2 million , or 15 cents a share .
0	1119721	1119714	Sony claimed that the reader 's capacitance sensing technology cannot be fooled by paper copies and does not require cleaning .	Its capacitance sensing technology electronically reads a fingerprint ; Sony says it can 't be fooled by paper copies and doesn 't require cleaning .
1	1186754	1187056	Amazon.com shipped out more than a million copies of the new book , making Saturday the largest distribution day of a single item in e-commerce history .	Amazon.com shipped more than a million copies by Saturday afternoon , making Saturday the largest distribution day of a single item in e-commerce history .
1	2842562	2842582	The show 's closure affected third-quarter earnings per share by a penny .	The company said this impacted earnings by a penny a share .
0	431076	431242	After the two-hour meeting on May 14 , publisher Arthur O. Sulzberger Jr . , executive editor Howell Raines and managing editor Gerald Boyd pledged quick remedies to staff grievances .	The committee will make recommendations to Publisher Arthur Sulzberger , Executive Editor Howell Raines and Managing Editor Gerald Boyd .
1	1393764	1393984	It 's been a busy couple of days for security gurus assigned to keep their companies safe and sound .	It 's been a busy couple of days for enterprise security gurus tasked with the job of keeping their companies safe and sound .
0	2916199	2916164	Lu reclined in a soft chair wearing a woolly coat near the blackened capsule .	" It 's great to be back home , " said Lu , dressed in a woolly coat near the blackened capsule .
1	2530671	2530542	Gov. Bob Riley proposed the budget cuts after Alabama voters rejected his $ 1.2 billion tax plan Sept . 9 .	After Alabama voters rejected his $ 1.2 billion tax plan Sept . 9 , Riley forecast significant cuts in state programs .
1	219064	218969	" It is probably not the easiest time to come in and take over the shuttle program , but then again , I look forward to the challenge , " he said .	" It 's probably not the easiest time to come in and take over the shuttle program , but I look forward to the challenge , " Parsons told reporters at NASA headquarters .
0	2377289	2377259	Estonia 's place in the European mainstream and safeguard its independence regained in 1991 .	Estonia was forcibly incorporated in the Soviet Union in 1940 and regained its independence only in 1991 .
0	2110220	2110199	Franklin County Judge-Executive Teresa Barton said a firefighter was struck by lightning and was taken to the Frankfort Regional Medical Center .	A county firefighter , was struck by lightning and was in stable condition at Frankfort Regional Medical Center .
0	1864253	1863810	Police suspected that Shaichat , 20 , had been abducted either by Palestinians or by Israeli Arabs .	Nobody claimed responsibility for Schaichat 's death , but police suspect that the 20-year-old soldier was abducted either by Palestinians or Israeli Arabs .
0	3150803	3150839	During this year 's August to October quarter , Lowe 's opened 38 new stores , including two relocations .	During the third quarter , Lowe 's opened 38 new stores and now has 932 stores in 45 states .
0	969381	969512	The technology-laced Nasdaq Composite Index < .IXIC > declined 25.78 points , or 1.56 percent , to 1,627.84 .	The broader Standard & Poor 's 500 Index .SPX gave up 11.91 points , or 1.19 percent , at 986.60 .
1	271891	271839	Sony said the PSP would also feature a 4.5-inch LCD screen , Memory Stick expansion slots .	It also features a 4.5 in back-lit LCD screen and memory expansion facilities .
0	2829648	2829613	Clinton did not mention that two Democratic senators , Charles Robb of Virginia and Wendell Ford of Kentucky , voted to shelve the McCain bill .	Two Democrats , Sen. Charles Robb of Virginia and Wendell Ford of Kentucky , voted with the 40 Republicans .
1	886904	887158	Some of the company 's software developers will join Microsoft , but details haven 't been finalized , said Mike Nash , corporate vice president of Microsoft 's security business unit .	Some of the companys software developers will join Microsoft , but details havent been finalized , said Mike Nash , corporate vice president of Microsofts security business unit .
0	2632692	2632767	Wal-Mart has said it plans to open at least 40 Supercenters in the state in the coming years ; analysts expect four or more to be in San Diego County .	At least 40 of the outlets will be in California , and analysts expect four or more to be in San Diego County .
1	2240399	2240149	Cintas is battling efforts to unionize 17,000 of its workers and to let unions organize the workers by signing cards , rather than by a lengthy election process .	Cintas is battling efforts to unionize 17,000 of its workers and labor 's demands to let its workers organize by signing cards , rather than by a lengthy election process .
1	805457	805985	The opposition would resort to rolling mass action " at strategic times of our choice and without warning to the dictatorship , " he said .	" From now onwards we will embark on rolling mass action at strategic times of our choice and without any warning to the dictatorship , " he said .
1	2896308	2896334	Federal Agriculture Minister Warren Truss said the Government still did not know the real reason the sheep were rejected at the Saudi port of Jeddah on August 21 .	He said the Government still did not know the real reason the original Saudi buyer pulled out on August 21 .
1	2110775	2110924	Tom Kraynak , manager of operations and resources for the Canton , Ohio-based East Central Area Reliability Council , said that scenario is one among many that investigators are considering .	Tom Kraynak , manager of operations and resources for the Canton , Ohio-based East Central Area Reliability Council , said investigators are considering the scenario .
1	1762569	1762526	Hester said Sanmina was the best fit among several purchase offers the company received from electronics manufacturers and computer makers .	Hester said Sanmina 's offer was the best among several Newisys received from electronics manufacturers and computer makers .
0	2706154	2706185	The other inmate fell but Selenski shimmed down the makeshift rope to a second-story roof and used the mattress to scale a razor-wire fence , Fischi said .	After the other inmate fell , Selenski used the mattress to scale a 10-foot , razor-wire fence , Fischi said .
1	1057995	1057778	The hearing , expected to last a week , will determine whether Akbar faces a court-martial .	The purpose of the hearing is to determine whether Akbar should be court-martialled .
1	1386884	1386857	He said he has begun a court action to seize Beacon Hill 's assets and has frozen more than $ 13 million Beacon Hill had when it closed .	He said he has initiated a forfeiture action in court and frozen more than $ 13 million Beacon Hill had when it closed .
1	3093023	3092996	Speaking for the first time yesterday , Brigitte 's maternal aunt said his family was unaware he had was in prison or that he had remarried .	Brigitte 's maternal aunt said his family was unaware he had been sent to prison , or that he had remarried in Sydney .
1	1661381	1661317	" Close co-operation between our law enforcement agencies , close co-operation between our intelligence services lie at the heart of the ongoing fight against terrorism . "	Close cooperation between regional law enforcement agencies and intelligence services was at the heart of the fight against terrorism , he said .
0	2926039	2925982	The mother of a Briton held by Colombian guerrillasspoke of her relief yesterday after hearing that he might be freed in the next few weeks .	The parents of a Briton being held hostage by Colombian rebels spoke yesterday of their optimism that he would be freed in time for his birthday next month .
0	637168	637447	We strongly disagree with Novell 's position and view it as a desperate measure to curry favor with the Linux community .	McBride characterized Novell 's move as " a desperate measure to curry favor with the Linux community . "
1	696677	696932	After more than two years ' detention under the State Security Bureau , the four were found guilty of subversion in Beijing 's No. 1 Intermediate Court last Wednesday .	After more than two years in detention by the State Security Bureau , the four were found guilty last Wednesday of subversion .
1	3122429	3122305	Mr Russell , 46 , a coal miner from Brisbane , said : " They are obviously hurting , so we are basically going over there to help them . "	" They are obviously hurting so we are basically going over there to help them , " Russell , 46 , said .
1	1348909	1348954	The New York Democrat and former first lady has said she will not run for the White House in 2004 , but has not ruled out a race in later years .	The former first lady has said she will not run for the White House in 2004 but has not ruled out a race later on .
0	162203	162101	It does not affect the current Windows Media Player 9.0 Series .	Windows Media Player has had security problems before .
0	71501	71627	The seizure took place at 4 a.m. on March 18 , just hours before the first American air assault .	The time was about 4 a.m. on March 18 , just hours before the first pinpoint missiles rained down on the capital .
1	2907762	2907649	Donations stemming from the Sept . 11 attacks helped push up contributions to human service organizations and large branches of the United Way by 15 percent and 28.6 percent , respectively .	Donations stemming from the Sept . 11 attacks helped push up contributions to human service organizations by 15 percent and to large branches of the United Way by 28.6 percent .
1	2167771	2167744	In May , Mr. Hatfill said he was struck by a vehicle being driven by an FBI employee who was tailing him in Georgetown .	Last May , Hatfill was struck by a vehicle being driven by an FBI employee who was tailing him in Washington 's Georgetown neighborhood .
1	3320577	3320553	" I will support a constitutional amendment which would honor marriage between a man and a woman , codify that , " he said .	" If necessary , I will support a constitutional amendment which would honour marriage between a man and a woman , codify that . "
1	849291	849442	IBM of the US and Infineon Technologies of Germany will today announce a technological development that could threaten multi-billion dollar memory chip markets .	IBMof the US andInfineon Technologies of Germany willon Tuesdayannounce a technological development that could threaten multi-billion dollar memory chip markets .
0	763948	763991	Costa 's semifinal opponent is Spaniard Juan Carlos Ferrero , whom he beat in last year 's final .	Costa will play Juan Carlos Ferrero next in a rematch of last year 's final .
1	1908763	1908744	A former employee of a local power company pleaded guilty Wednesday to setting off a bomb that knocked out a power substation during the Winter Olympics last year .	A former Utah Power meter reader pleaded guilty Wednesday to bombing a power substation during the 2002 Winter Olympics .
0	1876120	1876059	Thyroid hormones are known to help in weight loss by stimulating metabolism - and cutting cholesterol - but come with the unwanted side effect of speeding up the heartbeat .	Thyroid hormones are known to help in weight loss by stimulating metabolism , and they can help cut cholesterol too .
1	518089	518133	Judge Craig Doran said it wasn 't his role to determine if Hovan was " an evil man " but maintained that " he has committed an evil act . "	Judge Craig Doran said he couldn 't determine if Hovan was " an evil man " but said he " has committed an evil act . "
0	224932	224868	The Hartford shares rose $ 2.88 , or 6.6 percent , to close Monday at $ 46.50 on the New York Stock Exchange .	Shares of Hartford rose $ 2.88 to $ 46.50 in New York Stock Exchange composite trading .
1	1771131	1771091	It also offers a built-in NAND flash boot loader so that high-density NAND flash memory can be used without having to install an additional support chip .	The S3C2440 has a built-in NAND flash boot loader , for example , so that high-density NAND flash memory can be installed without an additional support chip .
0	2728425	2728251	It decided instead to issue them before the stock market opened Monday after the downgrade of its debt late Friday by Moody 's , the credit rating agency .	It decided instead to issue them before the stock market opened Monday to counteract the downgrade of its debt late Friday by Moody 's to one step above junk status .
0	953733	953537	Altria shares fell 2.5 percent or $ 1.11 to $ 42.57 and were the Dow 's biggest percentage loser .	Its shares fell $ 9.61 to $ 50.26 , ranking as the NYSE 's most-active issue and its biggest percentage loser .
1	349215	349241	It will be followed in November by a third movie , " The Matrix Revolutions . "	The film is the second of a trilogy , which will wrap up in November with " The Matrix Revolutions . "
1	2919853	2919804	Massachusetts regulators and the Securities and Exchange Commission on Tuesday pressed securities fraud charges against Putnam Investments and two of its former portfolio managers for alleged improper mutual fund trading .	State and federal securities regulators filed civil charges against Putnam Investments and two portfolio managers in the ever-expanding mutual fund trading scandal .
1	954526	954607	He is blocking them until the Air Force assigns four additional C-130 cargo planes to Gowen Field , an Idaho Air National Guard base in Boise .	He is holding them up until the Air Force agrees to assign four additional C-130 cargo planes to the Idaho Air National Guard .
1	69773	69792	Cisco pared spending to compensate for sluggish sales .	In response to sluggish sales , Cisco pared spending .
0	2823575	2823513	The study , published Monday in the journal Molecular Brain Research , is likely to also apply to humans , its authors said .	The study , conducted on the brains of developing mice , was being published today in the journal Molecular Brain Research .
1	2455942	2455978	My decision today is not based on any one event . "	Governor Rowland said his decision was " not based on any one event . "
1	131979	131957	Nelson , 27 , is being retried on civil-rights charges stemming from the disturbance which led to Rosenbaum 's death .	Nelson , 27 , is being retried on civil rights charges stemming from the disturbance that led to Rosenbaum 's death .
0	2010705	2010779	" The government elements who have been causing trouble are still in place .	The government elements who have been causing trouble are still in place , they are attacking us . "
1	54142	53641	Next Monday at about 2 p.m. ( CST ) , hospital officials in and near Chicago will notice a sudden increase in people complaining of flu-like symptoms .	Around the same time , hospital officials in and near Chicago will notice a sudden increase in people complaining of flu-like symptoms .
1	1015249	1015204	Wal-Mart Stores Inc . , Kohl 's Corp. , Family Dollar Stores Inc. and Big Lots Inc. were among the merchants posting May sales that fell below Wall Street 's modest expectations .	Wal- Mart , Kohl 's Corp. , Family Dollar Stores Inc . , and Big Lots Inc. posted May sales that fell below Wall Street 's modest expectations .
0	753928	753890	The patch also fixes a vulnerability that results because IE does not implement an appropriate block on a file download dialog box .	The second vulnerability is a result of IE not implementing a block on a file download dialog box .
1	3022833	3023029	Peterson , a former fertilizer salesman , is charged with murder in the deaths of his 27-year-old wife and the baby boy she was carrying .	Peterson , 31 , is now charged with murder in the deaths of his 27-year-old wife and their unborn son .
0	751520	751373	SPOT products run a Microsoft operating system and the company 's DirectBand radio technology developed with SCA Data Systems .	The DirectBand network was developed with the assistance of SCA Data Systems .
0	218848	218851	He replaces Ron Dittemore , who announced his resignation in April .	Dittemore announced his plans to resign on April 23 .
1	3181118	3181443	Detectives told Deasean 's father , Stelly Chisolm , a college student , and mother , Kimberly Hill , of the arrest shortly after Perry was apprehended .	Shortly after his arrest , detectives told Deasean 's father , Stelly Chisolm , a college student , and mother , Kimberly Hill , a medical assistant , about the development .
1	515581	515752	They were among about 40 people attending the traditional Jewish ceremony colored by some non-traditional touches .	He said about 40 people attended the traditional Jewish ceremony colored by some nontraditional touches .
1	347022	347003	Taiwan had been relatively free of the viral infection until a fiasco at a Taipei hospital in late April caused the number of infections to skyrocket .	Taiwan had been relatively free of the viral infection until a severe outbreak at a Taipei hospital in late April .
1	3311600	3311633	Mr. Rowland attended a party in South Windsor for the families of Connecticut National Guard soldiers called to active duty .	Rowland was making an appearance at a holiday party for families of Connecticut National Guard soldiers assigned to duty in Iraq and Afghanistan .
0	3439114	3439084	Ross Garber , Rowland 's lawyer , said Tuesday he would attend the meeting and would ask to speak on the issue .	Ross Garber , Rowland 's legal counsel , said the governor would have no comment on the condo deal .
0	487951	488007	The euro was at 1.5281 versus the Swiss franc EURCHF = , up 0.2 percent on the session , after hitting its highest since mid-2001 around 1.5292 earlier in the session .	The euro was steady versus the Swiss franc after hitting its highest since mid-2001 of 1.5261 earlier in the session .
0	314997	315030	On the stand Wednesday , she said she was referring only to the kissing .	On the stand Wednesday , she testified that she was referring to the kissing before the alleged rape .
0	4733	4557	Garner said the group would probably be expanded to include , for example , a Christian and perhaps another Sunni leader .	The group has already met several times and Gen. Garner said it probably will be expanded to include a Christian and perhaps another Sunni Muslim leader .
1	2820371	2820525	Blair 's Foreign Secretary Jack Straw was to take his place on Monday to give a statement to parliament on the European Union .	Blair 's office said his Foreign Secretary Jack Straw would take his place on Monday to give a statement to parliament on the EU meeting the prime minister attended last week .
1	801552	801516	" There were more people surrounding the clubhouse than the Unabomber 's house up in the hills , " Baker said .	" There are more people surrounding the clubhouse than surrounded the Unabomber 's home in the hills .
1	1704987	1705268	Charles O. Prince , 53 , was named as Mr. Weill 's successor .	Mr. Weill 's longtime confidant , Charles O. Prince , 53 , was named as his successor .
1	396041	396188	Officials are also meeting with the International Organization for Epizootics ( OIE ) , which establishes animal-health standards for the world .	Canadian officials were also expected to meet yesterday with the International Organization for Epizootics ( OIE ) , which establishes animal-health standards for the world .
0	1014983	1014963	GE stock closed Friday at $ 30.65 a share , down about 42 cents , on the New York Stock Exchange .	GE 's shares closed at $ 30.65 on Friday on the New York Stock Exchange .
1	2320654	2320666	The Midwestern research center will focus on the development of diagnostic , therapeutic and vaccine products for anthrax , botulism , tularemia , hemorrhagic fever viruses and plague .	The Midwestern center will focus on diagnosis , treatment and vaccines for anthrax , botulism , tularemia , hemorrhagic fever viruses and plague .
1	1057876	1057778	The hearing is to determine whether there is enough evidence to order Akbar to a general court-martial proceeding .	The purpose of the hearing is to determine whether Akbar should be court-martialled .
0	2116843	2116883	In the United States , heart attacks kill about 460,000 year , in Canada about 80,000 .	In the United States , heart attacks kill about 460,000 yearly , according to the National Institutes of Health .
1	1461629	1461781	Ninety-five percent of international cargo to the United States is carried by ship .	Ships carry 95 percent of international cargo to the United States .
0	374015	374162	" It 's a major victory for Maine , and it 's a major victory for other states .	The Maine program could be a model for other states .
1	2493369	2493428	News that oil producers were lowering their output starting in November exacerbated a sell-off that was already under way on Wall Street .	News that the Organization of Petroleum Exporting Countries was lowering output starting in November exacerbated a stock sell-off already under way yesterday .
1	490355	490378	They note that after several weeks of rallies on upbeat earnings , investors are looking for stronger evidence of a recovery before sending stocks higher .	After several weeks of market rallies on upbeat earnings , many investors are looking for more concrete signs of an economic recovery .
1	2691044	2691264	Most economists had expected a more dire report , with many anticipating the fifth month of job losses in six months .	Most economists had been expecting a far more dire report , with many expecting to see the fifth month of job losses in six months in September .
1	1831453	1831491	But software license revenues , a measure financial analysts watch closely , decreased 21 percent to $ 107.6 million .	License sales , a key measure of demand , fell 21 percent to $ 107.6 million .
1	2380695	2380822	King , brand-name writer , master of the horror story and e-book pioneer , is receiving this year 's medal for Distinguished Contributions to American Letters .	Stephen King , master of the horror story and e-book pioneer , is receiving this year 's medal for Distinguished Contributions to American Letters from the National Book Foundation .
1	2577517	2577531	The Denver-based natural gas producer and marketer said the inaccurate reporting was discovered after it received a subpoena from the U.S. Commodity Futures Trading Commission .	The natural gas producer and marketer said the inaccurate reporting was discovered in response to a subpoena from the U.S. Commodity Futures Trading Commission , or CFTC .
1	3267026	3266930	The steel tariffs , which the U.S. president imposed in March 2002 , will officially end at midnight , instead of March 2005 as initially planned .	The U.S. steel tariffs , which Bush imposed in March 2002 , were to officially end at midnight Thursday ( 0500 GMT ) , instead of March 2005 as initially planned .
1	360875	360943	Business Week 's online edition reported on Friday that WorldCom and the SEC could announce a settlement as early as Monday .	BusinessWeek Online has learned that the settlement could come as early as Monday , May 19 .
1	162632	162653	Only one of the five buildings in the Baghdad compound of the United Nations Development Program escaped being burned , the UN said on its Web site .	Only one of the five buildings in the compound in Baghdad run by the UN Development Program , escaped being burned , the UN said on its Web site .
1	1128884	1128865	Shares of Salix have rocketed 64 percent since Axcan made its first offer on April 10 .	Since the initial takeover offer , Salix shares have risen about 35 percent .
1	3264732	3264648	The jury verdict , reached Wednesday after less than four hours of deliberation , followed a 2 week trial , during which Waagner represented himself .	The quick conviction followed a 2 1 / 2 week trial , during which the Venango County man represented himself .
1	1721433	1721267	It 's happened five times in the last 11 years : A disaster puts this Southwestern town in the headlines during the summer tourist season .	It 's happened five times in the last decade : A disaster puts this tourist town in the headlines during summer , its busiest season .
0	146112	146127	The broader Standard & Poor 's 500 Index .SPX edged down 9 points , or 0.98 percent , to 921 .	The technology-laced Nasdaq Composite Index < .IXIC > shed 15 points , or 0.98 percent , to 1,492 .
1	389117	389052	The company emphasized that McDonald 's USA does not import any raw beef or hamburger patties from Canada for McDonald 's use in the United States .	McDonald 's said in a statement that it does not import any raw beef or hamburger patties from Canada for use in the United States .
1	872784	872834	Gregory Parseghian , a former investment banker , was appointed chief executive .	Greg Parseghian was appointed the new chief executive .
0	2977500	2977547	Their contract will expire at 12 : 01 a.m. Wednesday instead of 12 : 01 a.m. Sunday , said Rian Wathen , organizing director for United Food and Commercial Workers Local 700 .	" It has outraged the membership , " said Rian Wathen , organizing director of United Food and Commercial Workers Local 700 .
1	3107137	3107119	But plaque volume increased by 2.7 percent in pravastatin patients .	The volume of plaque in Pravachol patients ' arteries rose by 3 % .
1	1619244	1619274	Today in the US , the book - kept under wraps by its publishers , G. P. Putnam 's Sons , since its inception - will appear in bookstores .	Tomorrow the book , kept under wraps by G. P. Putnam 's Sons since its inception , will appear in bookstores .
0	3061836	3062031	The S & P / TSX composite rose 87.74 points on the week , while the TSX Venture Exchange composite gained 44.49 points .	On the week , the Dow Jones industrial average rose 11.56 points , while the Nasdaq Stock Market gained 39.42 points .
1	485999	486011	Ex-KGB agent Putin added that the Beatles were considered ' propaganda of an alien ideology ' .	In Soviet times the Beatles ' music " was considered propaganda of an alien ideology .


================================================
FILE: archive/tensorflow/tensorflow-neuron/tutorials/bert_demo/mrpc.proto
================================================
# coding=utf-8

""" Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0
    Program to gather information from a system
"""

syntax = "proto3";

package mrpc;

service mrpc {
    rpc paraphrase (TextPair) returns (YesNo) {}
}

message TextPair {
    bytes text_a = 1;
    bytes text_b = 2;
}

message YesNo {
    bytes message = 1;
    bytes prediction = 2;
}


================================================
FILE: archive/tensorflow/tensorflow-neuron/tutorials/index.rst
================================================
.. _tensorflow-tutorials:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

TensorFlow Tutorials
====================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


Before running a tutorial
-------------------------

You will run the tutorials on an inf1.6xlarge instance running Deep Learning AMI (DLAMI) to enable both compilation and deployment (inference) on the same instance. In a production environment we encourage you to try different instance sizes to optimize to your specific deployment needs.

Follow instructions at :ref:`tensorflow-tutorial-setup` before running a TensorFlow tutorial on Inferentia. We recommend new users start with the ResNet-50 tutorial.


.. toctree::
   :hidden:

   /archive/tensorflow/tensorflow-neuron/tutorials/tensorflow-tutorial-setup

.. _tensorflow-nlp:

Natural Language Processing
---------------------------

*  Tensorflow 2.x - HuggingFace DistilBERT with Tensorflow2 Neuron :ref:`[html] </src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb>` :github:`[notebook] </src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb>`

.. toctree::
   :hidden:

   /archive/tensorflow/tensorflow-neuron/tutorials/bert_demo/bert_demo
   /src/examples/tensorflow/huggingface_bert/huggingface_bert

.. _tensorflow-utilize-neuron:

Utilizing Neuron Capabilities
-----------------------------

*  Tensorflow 2.x - Using NEURON_RT_VISIBLE_CORES with TensorFlow Serving :ref:`[html] </src/examples/tensorflow/tensorflow_serving_tutorial.rst>`

.. toctree::
   :hidden:

   /src/examples/tensorflow/tensorflow_serving_tutorial.rst


================================================
FILE: archive/tensorflow/tensorflow-neuron/tutorials/k8s_bert_demo/Dockerfile.tfserving_example
================================================
From ubuntu:16.04
RUN apt-get update
RUN apt-get install -y wget apt-transport-https ca-certificates awscli
RUN echo "deb https://apt.repos.neuron.amazonaws.com xenial main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -

RUN apt-get update
RUN apt-get install -y tensorflow-model-server-neuron

================================================
FILE: archive/tensorflow/tensorflow-neuron/tutorials/tensorflow-tutorial-setup.rst
================================================
.. _tensorflow-tutorial-setup:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

TensorFlow Tutorial Setup
=========================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


#. Launch an Inf1.6xlarge Instance:
    .. include:: /setup/install-templates/inf1/launch-inf1-dlami.rst

#. Set up a development environment:
    * Enable or install TensorFlow-Neuron: :ref:`install-neuron-tensorflow`.
    
#. Run tutorial in Jupyter notebook:
    * Follow instruction at :ref:`Setup Jupyter notebook <setup-jupyter-notebook-steps-troubleshooting>` to:
    
      #. Start the Jupyter Notebook on the instance
      #. Run the Jupyter Notebook from your local browser

    * Connect to the instance from the terminal, clone the Neuron Github repository to the Inf1 instance and then change the working directory to the tutorial directory:

      .. code::

        git clone https://github.com/aws/aws-neuron-sdk.git
        cd aws-neuron-sdk/src/examples/tensorflow

    * Locate the tutorial notebook file (.ipynb file) under ``aws-neuron-sdk/src/examples/tensorflow``
    * From your local browser, open the tutorial notebook from the menu and follow the instructions.

    
================================================
FILE: archive/tensorflow/tensorflow-neuron/tutorials/tutorials-tensorflow-neuron.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Tutorials  (``tensorflow-neuron``)
===================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:

    Natural Language Processing (NLP) Tutorials </archive/tensorflow/tensorflow-neuron/tutorials/tutorials-tensorflow-nlp>
    Utilizing Neuron Capabilities Tutorials </archive/tensorflow/tensorflow-neuron/tutorials/tutorials-tensorflow-utilizing-neuron-capabilities>


.. include:: /archive/tensorflow/tensorflow-neuron/tutorials/tutorials-tensorflow-neuron.txt


================================================
FILE: archive/tensorflow/tensorflow-neuron/tutorials/tutorials-tensorflow-neuron.txt
================================================
.. tab-set::
                            
    .. tab-item:: Natural Language Processing (NLP) Tutorials
        
        *  Tensorflow 2.x - HuggingFace Pipelines distilBERT with Tensorflow2 Neuron :ref:`[html] </src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb>` :github:`[notebook] </src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb>`

                            
    .. tab-item:: Utilizing Neuron Capabilities Tutorials
        
        *  Tensorflow 2.x - Using NEURON_RT_VISIBLE_CORES with TensorFlow Serving :ref:`[html] </src/examples/tensorflow/tensorflow_serving_tutorial.rst>`

.. note::

    To use Jupyter Notebook see:

    * :ref:`setup-jupyter-notebook-steps-troubleshooting`
    * :ref:`running-jupyter-notebook-as-script` 

================================================
FILE: archive/tensorflow/tensorflow-neuron/tutorials/tutorials-tensorflow-nlp.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Natural Language Processing (NLP) Tutorials (``tensorflow-neuron``)
===================================================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


*  Tensorflow 2.x - HuggingFace DistilBERT with Tensorflow2 Neuron :ref:`[html] </src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb>` :github:`[notebook] </src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb>`

.. toctree::
    :hidden:

    /archive/tensorflow/tensorflow-neuron/tutorials/bert_demo/bert_demo
    /src/examples/tensorflow/huggingface_bert/huggingface_bert


================================================
FILE: archive/tensorflow/tensorflow-neuron/tutorials/tutorials-tensorflow-utilizing-neuron-capabilities.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Utilizing Neuron Capabilities Tutorials (``tensorflow-neuron``)
===============================================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


*  Using NEURON_RT_VISIBLE_CORES with TensorFlow Serving :ref:`[html] <tensorflow-serving-neuronrt-visible-cores>`

.. note::

   To use Jupyter Notebook see:

   * :ref:`setup-jupyter-notebook-steps-troubleshooting`
   * :ref:`running-jupyter-notebook-as-script` 


================================================
FILE: archive/tensorflow/tensorflow-neuron-inference.rst
================================================
.. _inference-tensorflow-neuron:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Inference on Inf1 (``tensorflow-neuron``)
=========================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:

    Tutorials </archive/tensorflow/tensorflow-neuron/tutorials/tutorials-tensorflow-neuron>
    Additional Examples  </archive/tensorflow/tensorflow-neuron/additional-examples>
    API Reference Guide  </archive/tensorflow/tensorflow-neuron/api-reference-guide>
    Misc  </archive/tensorflow/tensorflow-neuron/misc-tensorflow-neuron>


.. include:: tensorflow-neuron-inference.txt


================================================
FILE: archive/tensorflow/tensorflow-neuron-inference.txt
================================================
.. card:: Setup  (``tensorflow-neuron``)
            :class-body: sphinx-design-class-title-small

            See :doc:`TensorFlow Neuron setup </archive/tensorflow/index>`.


.. dropdown::  Tutorials (``tensorflow-neuron``)
    :class-title: sphinx-design-class-title-med
    :animate: fade-in
                
    .. include:: /archive/tensorflow/tensorflow-neuron/tutorials/tutorials-tensorflow-neuron.txt


.. dropdown::  Additional Examples (``tensorflow-neuron``)
    :class-title: sphinx-design-class-title-med
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /archive/tensorflow/tensorflow-neuron/additional-examples.txt


.. dropdown::  API Reference Guide (``tensorflow-neuron``)
    :class-title: sphinx-design-class-title-med
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /archive/tensorflow/tensorflow-neuron/api-reference-guide.txt


.. dropdown::  Misc (``tensorflow-neuron``)
    :class-title: sphinx-design-class-title-med
    :class-body: sphinx-design-class-body-small
    :animate: fade-in
                
    .. include:: /archive/tensorflow/tensorflow-neuron/misc-tensorflow-neuron.txt


================================================
FILE: archive/tensorflow/tensorflow-neuronx/api-reference-guide.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

API Reference Guide (``tensorflow-neuronx``)
===========================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:

    /archive/tensorflow/tensorflow-neuronx/tfneuronx-python-tracing-api
    /archive/tensorflow/tensorflow-neuronx/tf-neuronx-auto-replication-api
    /archive/tensorflow/tensorflow-neuronx/tfnx-analyze-model-api


.. include:: /archive/tensorflow/tensorflow-neuronx/api-reference-guide.txt


================================================
FILE: archive/tensorflow/tensorflow-neuronx/api-reference-guide.txt
================================================
* :ref:`tfneuronx-ref-neuron-tracing-api`
* :ref:`tf-neuronx-ref-auto-replication-python-api`
* :ref:`tf-neuronx-ref-analyze-model-api`

================================================
FILE: archive/tensorflow/tensorflow-neuronx/misc-tensorflow-neuronx.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Misc (``tensorflow-neuronx``)
============================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:

    /release-notes/archive/tensorflow/tensorflow-neuronx/tensorflow-neuronx


.. include:: /archive/tensorflow/tensorflow-neuronx/misc-tensorflow-neuronx.txt
    

================================================
FILE: archive/tensorflow/tensorflow-neuronx/misc-tensorflow-neuronx.txt
================================================
* :ref:`tensorflow-neuronx-release-notes`

================================================
FILE: archive/tensorflow/tensorflow-neuronx/setup/index.rst
================================================
.. _tensorflow-neuron-setup:
.. _tensorflow-neuronx-main:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

TensorFlow Setup Guide for Inf2 & Trn1
======================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1

    Fresh install </archive/tensorflow/tensorflow-neuronx/setup/tensorflow-neuronx-install>


================================================
FILE: archive/tensorflow/tensorflow-neuronx/setup/prev-releases/neuronx-2.8.0-tensorflow-install.rst
================================================
.. _install-neuronx-2.8.0-tensorflow:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install Tensorflow Neuron (Neuron 2.8.0)
========================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. tab-set::

    .. tab-item:: Tensorflow 2.10.0

        .. tab-set::

            .. tab-item:: Amazon Linux 2 AMI

                .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.0 --neuron-version=2.8.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

            .. tab-item:: Ubuntu 20 AMI

                .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.0 --neuron-version=2.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami


================================================
FILE: archive/tensorflow/tensorflow-neuronx/setup/prev-releases/neuronx-2.9.0-tensorflow-install.rst
================================================
.. _install-neuronx-2.9.0-tensorflow:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install Tensorflow Neuron (Neuron 2.9.0)
========================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. tab-set::

    .. tab-item:: Tensorflow 2.10.0

        .. tab-set::

            .. tab-item:: Amazon Linux 2 AMI

                .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.0 --neuron-version=2.9.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

            .. tab-item:: Ubuntu 20 AMI

                .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.0 --neuron-version=2.9.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami


================================================
FILE: archive/tensorflow/tensorflow-neuronx/setup/tensorflow-install-prev-al2.rst
================================================
.. _tensorflow-neuronx-install-prev-al2:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install Previous TensorFlow Neuron Releases for Amazon Linux (``tensorflow-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
   :maxdepth: 1


This section will assist you in installing previous Neuron releases.

.. tab-set::


    .. tab-item:: Neuron 2.18.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.18.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.17.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.17.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.16.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.16.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami


================================================
FILE: archive/tensorflow/tensorflow-neuronx/setup/tensorflow-install-prev-al2023.rst
================================================
.. _tensorflow-neuronx-install-prev-al2023:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install Previous TensorFlow NeuronX Releases for Amazon Linux 2023 (``tensorflow-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
   :maxdepth: 1


This section will assist you in installing previous Neuron releases.

.. tab-set::

    .. tab-item:: Neuron 2.21.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.21.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.20.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.20.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.19.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.19.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami


================================================
FILE: archive/tensorflow/tensorflow-neuronx/setup/tensorflow-install-prev-u20.rst
================================================
.. _tensorflow-neuronx-install-prev-u20:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install Previous TensorFlow Neuron Releases for Ubuntu (``tensorflow-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
   :maxdepth: 1


This section will assist you in installing previous Neuron releases.

.. tab-set::

    .. tab-item:: Neuron 2.20.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.20.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.19.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.19.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.18.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.18.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami


================================================
FILE: archive/tensorflow/tensorflow-neuronx/setup/tensorflow-install-prev-u22.rst
================================================
.. _tensorflow-neuronx-install-prev-u20:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install Previous TensorFlow Neuron Releases for Ubuntu (``tensorflow-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
   :maxdepth: 1


This section will assist you in installing previous Neuron releases.

.. tab-set::

    .. tab-item:: Neuron 2.21.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.21.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.20.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.20.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami
    
    .. tab-item:: Neuron 2.19.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --neuron-version=2.19.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami


================================================
FILE: archive/tensorflow/tensorflow-neuronx/setup/tensorflow-neuronx-install.rst
================================================
.. _install-tensorflow-neuronx:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install TensorFlow 2.x (``tensorflow-neuronx``)
===============================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. tab-set::

    .. tab-item:: Tensorflow 2.10.1

        .. tab-set::

            .. tab-item:: Amazon Linux 2

                .. include :: /setup/install-templates/trn1/dlami-notes.rst
                    :start-line: 13
                    :end-line: 16

                .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                    :start-line: 32
                    :end-line: 33

            .. tab-item:: Ubuntu 20

                .. include :: /setup/install-templates/trn1/dlami-notes.rst
                    :start-line: 19
                    :end-line: 22

                .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                    :start-line: 35
                    :end-line: 36

    .. tab-item:: Tensorflow 2.9.3

        .. tab-set::

            .. tab-item:: Amazon Linux 2

                .. include :: /setup/install-templates/trn1/dlami-notes.rst
                    :start-line: 13
                    :end-line: 16

                .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                    :start-line: 74
                    :end-line: 75

            .. tab-item:: Ubuntu 20

                .. include :: /setup/install-templates/trn1/dlami-notes.rst
                    :start-line: 19
                    :end-line: 22

                .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                    :start-line: 77
                    :end-line: 78

    .. tab-item:: Tensorflow 2.8.4

      .. tab-set::

            .. tab-item:: Amazon Linux 2

                .. include :: /setup/install-templates/trn1/dlami-notes.rst
                    :start-line: 13
                    :end-line: 16

                .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                    :start-line: 80
                    :end-line: 81

            .. tab-item:: Ubuntu 20

                .. include :: /setup/install-templates/trn1/dlami-notes.rst
                    :start-line: 19
                    :end-line: 22

                .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                    :start-line: 83
                    :end-line: 84

    .. tab-item:: Tensorflow 2.7.4

      .. tab-set::

            .. tab-item:: Amazon Linux 2

                .. include :: /setup/install-templates/trn1/dlami-notes.rst
                    :start-line: 13
                    :end-line: 16

                .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                    :start-line: 86
                    :end-line: 87

            .. tab-item:: Ubuntu 20

                .. include :: /setup/install-templates/trn1/dlami-notes.rst
                    :start-line: 19
                    :end-line: 22

                .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                    :start-line: 89
                    :end-line: 90


================================================
FILE: archive/tensorflow/tensorflow-neuronx/setup/tensorflow-update-al2-dlami.rst
================================================

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

.. tensorflow-neuronx-al2-update:

Update to latest TensorFlow Neuron  (``tensorflow-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


If you already have a previous Neuron release installed, this section provide links that will assist you to update to latest Neuron release.


.. tab-set::

    .. tab-item:: Tensorflow 2.10.1

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 122
            :end-line: 123


    .. tab-item:: Tensorflow 2.9.3

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 125
            :end-line: 126


    .. tab-item:: Tensorflow 2.8.4

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 128
            :end-line: 129


================================================
FILE: archive/tensorflow/tensorflow-neuronx/setup/tensorflow-update-al2.rst
================================================

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

.. tensorflow-neuronx-al2-update:

Update to latest TensorFlow Neuron  (``tensorflow-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


If you already have a previous Neuron release installed, this section provide links that will assist you to update to latest Neuron release.


.. tab-set::

    .. tab-item:: Tensorflow 2.10.1

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 56
            :end-line: 57


    .. tab-item:: Tensorflow 2.9.3

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 62
            :end-line: 63


    .. tab-item:: Tensorflow 2.8.4

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 68
            :end-line: 69


================================================
FILE: archive/tensorflow/tensorflow-neuronx/setup/tensorflow-update-u20-dlami.rst
================================================

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

.. tensorflow-neuronx-u20-update:

Update to latest TensorFlow Neuron  (``tensorflow-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


If you already have a previous Neuron release installed, this section provide links that will assist you to update to latest Neuron release.


.. tab-set::

    .. tab-item:: Tensorflow 2.10.1

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 131
            :end-line: 132


    .. tab-item:: Tensorflow 2.9.3

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 134
            :end-line: 135


    .. tab-item:: Tensorflow 2.8.4

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 137
            :end-line: 138


================================================
FILE: archive/tensorflow/tensorflow-neuronx/setup/tensorflow-update-u20.rst
================================================

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

.. tensorflow-neuronx-u20-update:

Update to latest TensorFlow NeuronX  (``tensorflow-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


If you already have a previous Neuron release installed, this section provide links that will assist you to update to latest Neuron release.


.. tab-set::

    .. tab-item:: Tensorflow 2.10.1

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 59
            :end-line: 60


    .. tab-item:: Tensorflow 2.9.3

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 65
            :end-line: 66


    .. tab-item:: Tensorflow 2.8.4

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 71
            :end-line: 72


================================================
FILE: archive/tensorflow/tensorflow-neuronx/setup/tensorflow-update-u22.rst
================================================

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

.. tensorflow-neuronx-u22-update:

Update to latest TensorFlow Neuron  (``tensorflow-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


If you already have a previous Neuron release installed, this section provide links that will assist you to update to latest Neuron release.


.. tab-set::

    .. tab-item:: Tensorflow 2.10.1

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami --category=compiler_framework


================================================
FILE: archive/tensorflow/tensorflow-neuronx/tf-neuronx-auto-replication-api.rst
================================================
.. _tf-neuronx-ref-auto-replication-python-api:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

TensorFlow 2.x (``tensorflow-neuronx``) Auto Multicore Replication (Beta)
===========================================================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


The Neuron auto multicore replication Python API enables modifying TensorFlow 2.x
models trace by ```tensorflow_neuronx.trace``` so that they can be automatically replicated across multiple cores.

.. contents:: Table of contents
   :local:
   :depth: 1

TensorFlow 2.x (``tensorflow-neuron TF2.x``) Auto Multicore Replication Python API (Beta)
-------------------------------------------------------------------------------------------

Method
^^^^^^

``tensorflow.neuron.auto_multicore``
on models traced by
``tensorflow_neuronx.trace``

Description
^^^^^^^^^^^

Converts an existing AWS-Neuron-optimized ``keras.Model`` and returns an auto-replication tagged
AWS-Multicore-Neuron-optimized  ``keras.Model`` that can execute on AWS Machine Learning Accelerators.
Like the traced model, the returned ``keras.Model`` will support inference only. Attributes or
variables held by the original function or ``keras.Model`` will be dropped.

The auto model replication feature in TensorFlow-Neuron enables you to
create a model once and the model parallel replication would happen
automatically. The desired number of cores can be less than the total available NeuronCores
on an trn1 or inf2 instance but not less than 1. This reduces framework memory usage as you are not
loading the same model multiple times manually. Calls to the returned model will execute the call
on each core in a round-robin fashion.

The returned ``keras.Model`` can be exported as SavedModel and served using
TensorFlow Serving. Please see the TensorFlow Serving documentation for more
information about exporting to saved model and serving using TensorFlow
Serving.

Note that the automatic replication will only work on models compiled with pipeline size 1:
via ``--neuroncore-pipeline-cores=1``. If auto replication is not enabled, the model will default to
replicate on up to 4 cores.

See  :ref:`neuron-compiler-cli-reference-guide` for more information about compiler options.

Arguments
^^^^^^^^^

-   **func:** The ``keras.Model`` or function to be traced.
-   **example_inputs:** A ``tf.Tensor`` or a tuple/list/dict of
    ``tf.Tensor`` objects for tracing the function. When ``example_inputs``
    is a ``tf.Tensor`` or a list of ``tf.Tensor`` objects, we expect
    ``func`` to have calling signature ``func(example_inputs)``. Otherwise,
    the expectation is that inference on ``func`` is done by calling
    ``func(*example_inputs)`` when ``example_inputs`` is a ``tuple``,
    or ``func(**example_inputs)`` when ``example_inputs`` is a ``dict``.
    The case where ``func`` accepts mixed positional and keyword arguments
    is currently unsupported.
-   **num_cores:** The desired number of cores where the model will be automatically
    replicated across

Returns
^^^^^^^

-  An AWS-Multicore-Neuron-optimized ``keras.Model``.


Example Python API Usage for TF2.x traced models:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code :: python

        import tensorflow as tf
        import tensorflow.neuron as tfn
        import tensorflow_neuronx as tfnx

        input0 = tf.keras.layers.Input(3)
        dense0 = tf.keras.layers.Dense(3)(input0)
        inputs = [input0]
        outputs = [dense0]
        model = tf.keras.Model(inputs=inputs, outputs=outputs)
        input0_tensor = tf.random.uniform([1, 3])
        model_neuron = tfnx.trace(model, input0_tensor)

        # a trn1.2xlarge has 2 neuron cores
        num_cores = 2
        multicore_model = tfn.auto_multicore(model_neuron, input0_tensor, num_cores=num_cores)
        multicore_model(input0_tensor)

Example Python API Usage for TF2.x saved models:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code :: python

        from tensorflow.python import saved_model

        input0_tensor = tf.random.uniform([1, 3])
        num_cores = 4
        reload_model = saved_model.load(model_dir)
        multicore_model = tfn.auto_multicore(reload_model, input0_tensor, num_cores=num_cores)

.. _tensorflow-ref-auto-replication-cli-api-neuronx:

TensorFlow Neuron TF2.x (``tensorflow-neuronx TF2.x``) Auto Multicore Replication CLI (Beta)
---------------------------------------------------------------------------------------------------------------

The Neuron auto multicore replication CLI  enables modifying Tensorflow 2.x
traced saved models so that they can be automatically replicated across multiple cores. By performing
this call on Tensorflow Saved Models, we can support Tensorflow-Serving
without significant modifications to the code.

Method
^^^^^^

``tf-neuron-auto-multicore MODEL_DIR --num_cores NUM_CORES --new_model_dir NEW_MODEL_DIR``

Arguments
^^^^^^^^^

-   **MODEL_DIR:** The directory of a saved AWS-Neuron-optimized ``keras.Model``.
-   **NUM_CORES:** The desired number of cores where the model will be automatically
    replicated across
-   **NEW_MODEL_DIR:** The directory of where the AWS-Multicore-Neuron-optimized
    ``keras.Model`` will be saved

Example CLI Usage for Tensorflow-Serving saved models:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code :: python

        tf-neuron-auto-multicore ./resnet --num_cores 8 --new_model_dir ./modified_resnet


================================================
FILE: archive/tensorflow/tensorflow-neuronx/tfneuronx-python-tracing-api.rst
================================================
.. _tfneuronx-ref-neuron-tracing-api:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

TensorFlow 2.x (``tensorflow-neuronx``) Tracing API
====================================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


The Neuron tracing API enables tracing TensorFlow 2.x models for deployment
on trn1 and inf2 AWS machine learning accelerators.

Method
------

``tensorflow_neuronx.trace``

Description
-----------

Trace a ``keras.Model`` or a Python callable that can be decorated by
``tf.function``, and return an AWS-Neuron-optimized ``keras.Model`` that
can execute on trn1 and inf2 AWS machine learning accelerators. Tracing is
ideal for ``keras.Model`` that accepts a list of ``tf.Tensor`` objects and
returns a list of ``tf.Tensor`` objects. It is expected that users will
provide example inputs, and the ``trace`` function will execute ``func``
symbolically and convert it to a ``keras.Model``.

The returned ``keras.Model`` will support inference only. Attributes or
variables held by the original function or ``keras.Model`` will be dropped.

The returned ``keras.Model`` can be exported as SavedModel and served using
TensorFlow Serving. Please see the TensorFlow Serving documentation for more
information about exporting to saved model and serving using TensorFlow
Serving.

The returned ``keras.Model`` has an ``.on_neuron_ratio`` attribute
which shows the percentage of ops mapped to neuron hardware. This calculation
ignores PlaceholerOp, IdentityOp, ReadVariableOp and NoOp.

Options can be passed to Neuron compiler via the environment variable
``NEURON_CC_FLAGS``. For example, the syntax
``env NEURON_CC_FLAGS="--workdir ./artifacts"`` directs the Neuron compiler to dump artifacts
in the artifacts directory for debugging. See :ref:`neuron-compiler-cli-reference-guide` for more
information about compiler options.

Arguments
---------

-   **func:** The ``keras.Model`` or function to be traced.
-   **example_inputs:** A ``tf.Tensor`` or a tuple/list/dict of
    ``tf.Tensor`` objects for tracing the function. When ``example_inputs``
    is a ``tf.Tensor`` or a list of ``tf.Tensor`` objects, we expect
    ``func`` to have calling signature ``func(example_inputs)``. Otherwise,
    the expectation is that inference on ``func`` is done by calling
    ``func(*example_inputs)`` when ``example_inputs`` is a ``tuple``,
    or ``func(**example_inputs)`` when ``example_inputs`` is a ``dict``.
    The case where ``func`` accepts mixed positional and keyword arguments
    is currently unsupported.
-   **subgraph_builder_function:** (Optional) A callable with signature

    ``subgraph_builder_function(node : NodeDef) -> bool``
    (``NodeDef`` is defined in tensorflow/core/framework/node_def.proto)

    that is used as a call-back function to determine which part of
    the tensorflow GraphDef given by tracing ``func`` will be placed on
    Machine Learning Accelerators.

    If ``subgraph_builder_function`` is not provided, then ``trace`` will
    automatically place operations on Machine Learning Accelerators or
    on CPU to maximize the execution efficiency.

    If it is provided, and ``subgraph_builder_function(node)`` returns
    ``True``, and placing ``node`` on Machine Learning Accelerators
    will not cause deadlocks during execution, then ``trace`` will place
    ``node`` on Machine Learning Accelerators. If
    ``subgraph_builder_function(node)`` returns ``False``, then ``trace``
    will place ``node`` on CPU.

.. _tensorflow-neuronx-special-flags:

Special Flags
-------------

These are flags that get passed directly to the Neuron tracing API
(rather than the Neuron Compiler). The flags are still passed
via the environment variable ``NEURON_CC_FLAGS``.

-   **workdir:** example usage - ``NEURON_CC_FLAGS='--workdir ./artifacts'``
    will create a folder named artifacts in the current directory and
    save artifacts that can be used for debug.
-   **dynamic-batch-size:** example usage -
    ``NEURON_CC_FLAGS='--dynamic-batch-size'`` A flag to allow Neuron graphs to
    consume variable sized batches of data. Dynamic sizing is restricted to the
    0th dimension of a tensor.
-   **extract-weights (Beta):** example usage - 
    ``NEURON_CC_FLAGS='--extract-weights trn1.2xlarge'`` will reduce the compiled
    model's protobuf size by taking the weights out of the protobuf.
    Useful for compiling large models that would exceed the 2GB protobuf
    size limit. This feature is in beta. Model performance is not
    guaranteed and the flag does not work in combination with
    ``--neuroncore-pipeline-cores``, ``--dynamic-batch-size``, models with
    multiple NEFFs, and models that are 16GB or greater. 
    Compiles models for different neuron instances depending on the instance type passed.
    Supports all trn1 and inf2 instance types except for trn1n.

Returns
-------

-  An AWS-Neuron-optimized ``keras.Model``.


Example Usage
-------------

.. code:: python

    import tensorflow as tf
    import tensorflow_neuronx as tfnx

    input0 = tf.keras.layers.Input(3)
    dense0 = tf.keras.layers.Dense(3)(input0)
    model = tf.keras.Model(inputs=[input0], outputs=[dense0])
    example_inputs = tf.random.uniform([1, 3])
    model_neuron = tfnx.trace(model, example_inputs)  # trace
    # check to see how much of the model was compiled successfully
    print(model_neuron.on_neuron_ratio) 

    model_dir = './model_neuron'
    model_neuron.save(model_dir)
    model_neuron_reloaded = tf.keras.models.load_model(model_dir)


Example Usage with Manual Device Placement Using ``subgraph_builder_function``
------------------------------------------------------------------------------

.. code:: python

    import tensorflow as tf
    import tensorflow_neuronx as tfnx

    input0 = tf.keras.layers.Input(3)
    dense0 = tf.keras.layers.Dense(3)(input0)
    reshape0 = tf.keras.layers.Reshape([1, 3])(dense0)
    output0 = tf.keras.layers.Dense(2)(reshape0)
    model = tf.keras.Model(inputs=[input0], outputs=[output0])
    example_inputs = tf.random.uniform([1, 3])

    def subgraph_builder_function(node):
        return node.op == 'MatMul'

    model_neuron = tfnx.trace(
        model, example_inputs,
        subgraph_builder_function=subgraph_builder_function,
    )


================================================
FILE: archive/tensorflow/tensorflow-neuronx/tfnx-analyze-model-api.rst
================================================
.. _tf-neuronx-ref-analyze-model-api:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

TensorFlow 2.x (``tensorflow-neuronx``) analyze_model API
==========================================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


Method
------

``tensorflow_neuronx.analyze_model``

Description
-----------

Analyzes a ``keras.Model`` or a Python callable that can be decorated by
``tf.function`` for it's compatibility with Neuron. It displays supported 
vs. unsupported operators in the model as well as percentages and counts of 
each operator and returns a dictionary with operator statistics.

Arguments
---------

-   **func:** The ``keras.Model`` or function to be analyzed.
-   **example_inputs:** A ``tf.Tensor`` or a tuple/list/dict of
    ``tf.Tensor`` objects for tracing the function. When ``example_inputs``
    is a ``tf.Tensor`` or a list of ``tf.Tensor`` objects, we expect
    ``func`` to have calling signature ``func(example_inputs)``. Otherwise,
    the expectation is that inference on ``func`` is done by calling
    ``func(*example_inputs)`` when ``example_inputs`` is a ``tuple``,
    or ``func(**example_inputs)`` when ``example_inputs`` is a ``dict``.
    The case where ``func`` accepts mixed positional and keyword arguments
    is currently unsupported.

Returns
-------

-  A results ``dict`` with these keys: ``'percent_supported'``, ``'supported_count'``, ``'total_count'``, ``'supported_operators'``, ``'unsupported_operators'``, ``'operators'``, ``'operator_count'``.

Example Usage
-------------

.. code:: python

    import tensorflow as tf
    import tensorflow_neuron as tfnx

    input0 = tf.keras.layers.Input(3)
    dense0 = tf.keras.layers.Dense(3)(input0)
    model = tf.keras.Model(inputs=[input0], outputs=[dense0])
    example_inputs = tf.random.uniform([1, 3])
    results = tfnx.analyze_model(model, example_inputs)
    print(results)

    # expected output
    '''
    BiasAdd
	MatMul
	100.00% of all operations (2 of 2) are supported
	{'percent_supported': 100.0, 'supported_count': 2, 'total_count': 2, 
	'supported_operators': {'BiasAdd', 'MatMul'}, 'unsupported_operators': [], 
	'operators': ['BiasAdd', 'MatMul'], 'operator_count': {'MatMul': 1, 'BiasAdd': 1}}
	'''


================================================
FILE: archive/tensorflow/tensorflow-neuronx/tutorials/tutorial-tensorflowx-serving-NeuronRT-Visible-Cores.rst
================================================
.. _tensorflow-servingx-neuronrt-visible-cores:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Using NEURON_RT_VISIBLE_CORES with TensorFlow Serving
=====================================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


TensorFlow serving allows customers to scale-up inference workloads
across a network. TensorFlow Neuron Serving uses the same API as normal
TensorFlow Serving with two differences: (a) the saved model must be
compiled for neuron and (b) the entry point is a different binary
named ``tensorflow_model_server_neuronx``. Follow the steps below 
to install the package using apt-get or dnf. This will be pre-installed in a future release.

Install TensorFlow Model Server and Serving API
-----------------------------------------------

Follow the steps in the TensorFlow NeuronX installation guide.

Then ensure you install using either apt-get or dnf.

.. code:: bash

  sudo apt-get install tensorflow-model-server-neuronx

or

.. code:: bash

  sudo dnf install tensorflow-model-server-neuronx

Also, you would need TensorFlow Serving API (use --no-deps to prevent
installation of regular tensorflow).

.. code:: bash

   pip install --no-deps tensorflow_serving_api

For the example image preprocessing using Keras preprocessing, the
Python Imaging Library Pillow is required:

.. code:: bash

   pip install pillow

To workaround h5py issue https://github.com/aws/aws-neuron-sdk/issues/220:

.. code:: bash

   pip install "h5py<3.0.0"


Export and Compile Saved Model
------------------------------

The following example shows graph construction followed by the addition
of Neuron compilation step before exporting to saved model.

.. code:: python

    import tensorflow as tf
    import tensorflow_neuronx as tfnx
    import numpy as np

    tf.keras.backend.set_learning_phase(0)
    tf.keras.backend.set_image_data_format('channels_last')
    image_sizes = [224, 224]
    model = tf.keras.applications.ResNet50(weights='imagenet')
    example_inputs = tf.random.uniform([1, *image_sizes, 3], dtype=tf.float32)

    model_neuron = tfnx.trace(model, example_inputs)
    # run the model once to define the forward pass and allow for saving
    model_neuron(example_inputs)
    tf.keras.models.save_model(model_neuron, './resnet50_neuron/1')


Serving Saved Model
-------------------

User can now serve the saved model with the
tensorflow_model_server_neuron binary. To utilize multiple NeuronCores,
it is recommended to launch multiple tensorflow model servers that
listen to the same gRPC port:

.. code:: bash

   export NEURON_RT_VISIBLE_CORES=0  # important to set this environment variable before launching model servers
   tensorflow_model_server_neuron --model_name=resnet50_neuron \
        --model_base_path=$(pwd)/resnet50_neuron/ --port=8500

   # then to run another server on a different neuron core open another
   # window and run this, except this time set NEURON_RT_VISIBLE_CORES=1
   # you can keep doing this up to the number of Neuron Cores on your machine

   export NEURON_RT_VISIBLE_CORES=1
   tensorflow_model_server_neuron --model_name=resnet50_neuron \
        --model_base_path=$(pwd)/resnet50_neuron/ --port=8500

The compiled model is staged in neuron DRAM by the server to prepare
for inference.

Generate inference requests to the model server
-----------------------------------------------

Now run inferences via GRPC as shown in the following sample client
code:

.. code:: python

    import numpy as np
    import grpc
    import tensorflow as tf
    from tensorflow.keras.preprocessing import image
    from tensorflow.keras.applications.resnet50 import preprocess_input
    from tensorflow_serving.apis import predict_pb2
    from tensorflow_serving.apis import prediction_service_pb2_grpc
    from tensorflow.keras.applications.resnet50 import decode_predictions

    tf.keras.backend.set_image_data_format('channels_last')

    if __name__ == '__main__':
        channel = grpc.insecure_channel('localhost:8500')
        stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
        img_file = tf.keras.utils.get_file(
            "./kitten_small.jpg",
            "https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg")
        img = image.load_img(img_file, target_size=(224, 224))
        img_array = preprocess_input(image.img_to_array(img)[None, ...])
        request = predict_pb2.PredictRequest()
        request.model_spec.name = 'resnet50_neuron'
        request.inputs['input_1'].CopyFrom(
            tf.make_tensor_proto(img_array, shape=img_array.shape))
        result = stub.Predict(request)
        prediction = tf.make_ndarray(result.outputs['output_1'])
        print(decode_predictions(prediction))


================================================
FILE: archive/tensorflow/tensorflow-neuronx/tutorials/tutorials-tensorflow-neuronx.rst
================================================
.. _inference-tensorflow-neuronx-tutorials:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Tutorials  (``tensorflow-neuronx``)
===================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:
  
    HuggingFace Roberta-Base </src/examples/tensorflow/tensorflow-neuronx/tfneuronx-roberta-base-tutorial.ipynb>
    /archive/tensorflow/tensorflow-neuronx/tutorials/tutorial-tensorflowx-serving-NeuronRT-Visible-Cores


.. include:: /archive/tensorflow/tensorflow-neuronx/tutorials/tutorials-tensorflow-neuronx.txt


================================================
FILE: archive/tensorflow/tensorflow-neuronx/tutorials/tutorials-tensorflow-neuronx.txt
================================================
* HuggingFace Roberta-Base :ref:`[html]</src/examples/tensorflow/tensorflow-neuronx/tfneuronx-roberta-base-tutorial.ipynb>` :github:`[notebook] </src/examples/tensorflow/tensorflow-neuronx/tfneuronx-roberta-base-tutorial.ipynb>`
* :ref:`tensorflow-servingx-neuronrt-visible-cores`

.. note::
    To use Jupyter Notebook see:

    * :ref:`setup-jupyter-notebook-steps-troubleshooting`
    * :ref:`running-jupyter-notebook-as-script`

================================================
FILE: archive/tensorflow/tensorflow-neuronx-inference.rst
================================================
.. _inference-tensorflow-neuronx:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Inference on Inf2 & Trn1/Trn1n (``tensorflow-neuronx``)
=======================================================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:

    Tutorials </archive/tensorflow/tensorflow-neuronx/tutorials/tutorials-tensorflow-neuronx>
    API Reference Guide  </archive/tensorflow/tensorflow-neuronx/api-reference-guide>
    Misc  </archive/tensorflow/tensorflow-neuronx/misc-tensorflow-neuronx>


.. include:: tensorflow-neuronx-inference.txt


================================================
FILE: archive/tensorflow/tensorflow-neuronx-inference.txt
================================================
.. card:: Setup  (``tensorflow-neuronx``)
            :class-body: sphinx-design-class-title-small

            See :doc:`TensorFlow NeuronX setup </archive/tensorflow/index>`.
            
            
.. dropdown::  Tutorials (``tensorflow-neuronx``)
    :class-title: sphinx-design-class-title-med
    :animate: fade-in

    .. include:: /archive/tensorflow/tensorflow-neuronx/tutorials/tutorials-tensorflow-neuronx.txt


.. dropdown::  API Reference Guide (``tensorflow-neuronx``)
    :class-title: sphinx-design-class-title-med
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /archive/tensorflow/tensorflow-neuronx/api-reference-guide.txt


.. dropdown::  Misc (``tensorflow-neuronx``)
    :class-title: sphinx-design-class-title-med
    :class-body: sphinx-design-class-body-small
    :animate: fade-in
                
    .. include:: /archive/tensorflow/tensorflow-neuronx/misc-tensorflow-neuronx.txt

================================================
FILE: archive/tensorflow/tensorflow-setup.rst
================================================
.. _tf-setup:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Tensorflow Neuron Setup
=======================

.. warning::

   This document is archived. TensorFlow is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: tensorflow-setup.txt


================================================
FILE: archive/tensorflow/tensorflow-setup.txt
================================================
.. card:: Tensorflow Neuron (``tensorflow-neuronx``) Setup for  Inf2, Trn1/Trn1n Instances
            :class-body: sphinx-design-class-title-small

            See :doc:`TensorFlow NeuronX setup </archive/tensorflow/index>`.


.. card:: Tensorflow Neuron (``tensorflow-neuron``) Setup for Inf1 Instances
            :class-body: sphinx-design-class-title-small

            See :doc:`TensorFlow Neuron setup </archive/tensorflow/index>`.


================================================
FILE: archive/torch-neuron/additional-examples-inference-torch-neuron.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Additional Examples (``torch-neuron``)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:
    
    AWS Neuron Samples GitHub Repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuron/inference>


.. include:: /archive/torch-neuron/additional-examples-inference-torch-neuron.txt


================================================
FILE: archive/torch-neuron/additional-examples-inference-torch-neuron.txt
================================================
* `AWS Neuron Samples GitHub Repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/archive/torch-neuron/inference>`_


================================================
FILE: archive/torch-neuron/api-compilation-python-api.rst
================================================
.. _torch_neuron_trace_api:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

PyTorch-Neuron trace python API
================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


The PyTorch-Neuron trace Python API provides a method to generate
PyTorch models for execution on Inferentia, which can be serialized as
TorchScript. It is analogous to :func:`torch.jit.trace` function in PyTorch.

.. py:function:: torch_neuron.trace(model, example_inputs, **kwargs)

    The :func:`torch_neuron.trace` method sends operations to
    the Neuron-Compiler (``neuron-cc``) for compilation and embeds compiled
    artifacts in a TorchScript graph.

    Compilation can be done on any EC2 machine with sufficient memory and
    compute resources. c5.4xlarge or larger is recommended.

    Options can be passed to Neuron compiler via the compile function. See
    :ref:`neuron-compiler-cli-reference`
    for more information about compiler options.

    This function partitions nodes into operations that are supported
    by Neuron and operations which are not. Operations which are not supported
    by Neuron are run on CPU. Graph partitioning can be controlled by the
    ``subgraph_builder_function``, ``minimum_segment_size``, and ``fallback``
    parameters (See below). By default all supported operations are compiled and
    run on Neuron.

    The compiled graph can be saved using the :func:`torch.jit.save` function and
    restored using :func:`torch.jit.load` function for inference on Inf1 instances.
    During inference, the previously compiled artifacts will be loaded into
    the Neuron Runtime for inference execution.

    *Required Arguments*

    :arg ~torch.nn.Module,callable model: The functions that that will be run with
       ``example_inputs`` arguments. The arguments and return types must compatible
       with :func:`torch.jit.trace`. When a :class:`~torch.nn.Module` is passed
       to :func:`torch_neuron.trace`, only the :func:`~torch.nn.Module.forward`
       method is run and traced.
    :arg tuple example_inputs: A tuple of example inputs that will be passed to
       the ``model`` while tracing. The resulting trace can be run with inputs
       of different types and shapes assuming the traced operations support
       those types and shapes. This parameter may also be a single
       :class:`torch.Tensor` in which case it is automatically wrapped in a
       ``tuple``.

    *Optional Keyword Arguments*

    :keyword list[str] compiler_args: List of strings representing
       ``neuron-cc`` compiler arguments. Note that these arguments apply to all
       subgraphs generated by allowlist partitioning. For example, use
       :code:`compiler_args=['--neuroncore-pipeline-cores', '4']` to set number
       of NeuronCores per subgraph to 4. See :ref:`neuron-compiler-cli-reference`
       for more information about compiler options.
    :keyword int compiler_timeout: Timeout in seconds for waiting
       ``neuron-cc`` to complete. Exceeding this timeout will cause a
       ``subprocess.TimeoutExpired`` exception.
    :keyword str compiler_workdir: Work directory used by
       ``neuron-cc``. Useful for debugging and/or inspecting ``neuron-cc``
       logs/IRs.
    :keyword callable subgraph_builder_function: A function which is evaluated
       on each node during graph partitioning. This takes in a torch graph
       operator node and returns a :class:`bool` value of whether
       it should be included in the fused Neuron graph or not. By default the
       partitioner selects all operators which are supported by Neuron.
    :keyword int minimum_segment_size: A parameter used during partitioning.
       This specifies the minimum number of graph nodes which should be compiled
       into a Neuron graph (default= :code:`2`). If the number of nodes is smaller
       than this size, the operations will run on CPU.
    :keyword float single_fusion_ratio_threshold: A parameter used during
        partitioning. During partitioning, if a single partition contains a
        fraction of operations greater than this threshold, only one graph
        partition will be compiled (default= :code:`0.6`). This is used to
        avoid compiling many small Neuron graphs. To force compilation of all
        graphs to Neuron (even when they are very small), a value of ``1.0``
        can be used.
    :keyword bool fallback: A function parameter to turn off graph partitioning.
       Indicates whether to attempt to fall back to CPU operations if an
       operation is not supported by Neuron. By default this is ``True``. If
       this is set to ``False`` and an operation is not supported by Neuron,
       this will fail compilation and raise an ``AttributeError``.
    :keyword bool dynamic_batch_size: A flag to allow Neuron graphs to consume
       variable sized batches of data. Dynamic sizing is restricted to the 0th
       dimension of a tensor.
    :keyword list optimizations: A list of :class:`~torch_neuron.Optimization`
        passes to apply to the model.
    :keyword bool separate_weights: A flag to enable compilation of models with 
        over 1.9GB of constant parameters. By default this flag is ``False``. 
        If this is set to ``True`` and the compiler version is not new enough 
        to support the flag, this will raise an ``NotImplementedError``.
    :keyword \*\*kwargs: All other keyword arguments will be forwarded directly to
       :func:`torch.jit.trace`. This supports flags like ``strict=False``
       in order to allow dictionary outputs.

    :returns: The traced :class:`~torch.jit.ScriptModule` with embedded
       compiled neuron sub-graphs. Operations in this module will run on Neuron
       unless they are not supported by Neuron or manually partitioned to run
       on CPU.

       Note that in ``torch<1.8`` This would return a
       :class:`~torch.jit.ScriptFunction` if the input was function type.
    :rtype: ~torch.jit.ScriptModule, ~torch.jit.ScriptFunction


.. py:class:: torch_neuron.Optimization

    A set of optimization passes that can be applied to the model.

    .. py:attribute:: FLOAT32_TO_FLOAT16

        A post-processing pass that converts all :attr:`torch.float32` tensors
        to :attr:`torch.float16` tensors. The advantage to this
        optimization pass is that input/output tensors will be type cast.
        This reduces the amount of data that will be copied to and from
        Inferentia hardware. The resulting traced model will accept both
        :attr:`torch.float32` and :attr:`torch.float16` inputs where the
        model used :attr:`torch.float32` inputs during tracing. It is only
        beneficial to enable this optimization if the throughput of a
        model is highly dependent upon data transfer speed. This optimization is
        not recommended if the final application will use :attr:`torch.float32`
        inputs since the :attr:`torch.float16` type cast will occur on CPU
        during inference.


Example Usage
-------------

Function Compilation
~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    import torch
    import torch_neuron

    def foo(x, y):
        return 2 * x + y

    # Run `foo` with the provided inputs and record the tensor operations
    traced_foo = torch.neuron.trace(foo, (torch.rand(3), torch.rand(3)))

    # `traced_foo` can now be run with the TorchScript interpreter or saved
    # and loaded in a Python-free environment
    torch.jit.save(traced_foo, 'foo.pt')
    traced_foo = torch.jit.load('foo.pt')

Module Compilation
~~~~~~~~~~~~~~~~~~

.. code-block:: python

    import torch
    import torch_neuron
    import torch.nn as nn

    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv = nn.Conv2d(1, 1, 3)

        def forward(self, x):
            return self.conv(x) + 1

    n = Net()
    n.eval()

    inputs = torch.rand(1, 1, 3, 3)

    # Trace a specific method and construct `ScriptModule` with
    # a single `forward` method
    neuron_forward = torch.neuron.trace(n.forward, inputs)

    # Trace a module (implicitly traces `forward`) and constructs a
    # `ScriptModule` with a single `forward` method
    neuron_net = torch.neuron.trace(n, inputs)

Pre-Trained Model Compilation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The following is an example usage of the compilation Python API, with
default compilation arguments, using a pretrained :class:`torch.nn.Module`:

.. code-block:: python

    import torch
    import torch_neuron
    from torchvision import models

    # Load the model and set it to evaluation mode
    model = models.resnet50(pretrained=True)
    model.eval()

    # Compile with an example input
    image = torch.rand([1, 3, 224, 224])
    model_neuron = torch.neuron.trace(model, image)


.. _compiling-models-with-kwargs:

Compiling models with torch.jit.trace kwargs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This example uses the :code:`strict=False` flag to compile a model with
dictionary outputs. Similarly, any other keyword argument of
:func:`torch.jit.trace` can be passed directly to
:func:`torch_neuron.trace` so that it is passed to the underlying trace call.

.. code-block:: python

    import torch
    import torch_neuron
    import torch.nn as nn

    class Model(nn.Module):
        def __init__(self):
            super(Model, self).__init__()
            self.conv = nn.Conv2d(1, 1, 3)

        def forward(self, x):
            return {'conv': self.conv(x) + 1}

    model = Model()
    model.eval()

    inputs = torch.rand(1, 1, 3, 3)

    # use the strict=False kwarg to compile a model with dictionary outputs
    # the model output format does not change
    model_neuron = torch.neuron.trace(model, inputs, strict=False)


Dynamic Batching
~~~~~~~~~~~~~~~~
This example uses the optional :code:`dynamic_batch_size` option in order to
support variable sized batches at inference time.

.. code-block:: python

    import torch
    import torch_neuron
    from torchvision import models

    # Load the model and set it to evaluation mode
    model = models.resnet50(pretrained=True)
    model.eval()

    # Compile with an example input of batch size 1
    image = torch.rand([1, 3, 224, 224])
    model_neuron = torch.neuron.trace(model, image, dynamic_batch_size=True)

    # Execute with a batch of 7 images
    batch = torch.rand([7, 3, 224, 224])
    results = model_neuron(batch)


Manual Partitioning
~~~~~~~~~~~~~~~~~~~
The following example uses the optional :code:`subgraph_builder_function`
parameter to ensure that only a specific convolution layer is compiled to
Neuron. The remaining operations are executed on CPU.

.. code-block:: python

    import torch
    import torch_neuron
    import torch.nn as nn

    class ExampleConvolutionLayer(nn.Module):
        def __init__(self):
            super().__init__()
            self.conv = nn.Conv2d(1, 1, 3)

        def forward(self, x):
            return self.conv(x) + 1

    class Model(nn.Module):
        def __init__(self):
            super().__init__()
            self.layer = ExampleConvolutionLayer()

        def forward(self, x):
            return self.layer(x) * 100

    def subgraph_builder_function(node) -> bool:
        """Select if the node will be included in the Neuron graph"""

        # Node names are tuples of Module names.
        if 'ExampleConvolutionLayer' in node.name:
            return True

        # Ignore all operations not in the example convolution layer
        return False

    model = Model()
    model.eval()

    inputs = torch.rand(1, 1, 3, 3)

    # Log output shows that `aten::_convolution` and `aten::add` are compiled
    # but `aten::mul` is not. This will seamlessly switch between Neuron/CPU
    # execution in a single graph.
    neuron_model = torch_neuron.trace(
        model,
        inputs,
        subgraph_builder_function=subgraph_builder_function
    )


Separate Weights
~~~~~~~~~~~~~~~~
This example uses the optional :code:`separate_weights` option in order to
support compilation of models greater than 1.9GB.

.. code-block:: python

    import torch
    import torch_neuron
    from torchvision import models

    # Load the model
    model = models.resnet50(pretrained=True)
    model.eval()

    # Compile with an example input
    image = torch.rand([1, 3, 224, 224])
    #the models' output format does not change
    model_neuron = torch.neuron.trace(model, image, separate_weights=True)


================================================
FILE: archive/torch-neuron/api-core-placement.rst
================================================
.. _torch_core_placement_api:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

PyTorch Neuron (``torch-neuron``) Core Placement API
=====================================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. automodule:: placement
    :module-name: torch_neuron.experimental
    :members:


================================================
FILE: archive/torch-neuron/api-reference-guide-torch-neuron.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

API Reference Guide (``torch-neuron``)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:

    PyTorch Neuron trace Python API </archive/torch-neuron/api-compilation-python-api>
    torch.neuron.DataParallel API </archive/torch-neuron/api-torch-neuron-dataparallel-api>
    /archive/torch-neuron/api-core-placement


.. include:: /archive/torch-neuron/api-reference-guide-torch-neuron.txt


================================================
FILE: archive/torch-neuron/api-reference-guide-torch-neuron.txt
================================================
* :ref:`PyTorch Neuron trace Python API <torch_neuron_trace_api>`
* :ref:`torch.neuron.DataParallel API <api_torch_neuron_dataparallel_api>`
* :ref:`torch_core_placement_api`

================================================
FILE: archive/torch-neuron/api-torch-neuron-dataparallel-api.rst
================================================
.. _api_torch_neuron_dataparallel_api:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

torch.neuron.DataParallel API
=============================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


The :func:`torch.neuron.DataParallel` Python API implements data parallelism on
:class:`~torch.jit.ScriptModule` models created by the
:ref:`torch_neuron_trace_api`.
This function is analogous to :class:`~torch.nn.DataParallel` in PyTorch.
The :ref:`torch-neuron-dataparallel-app-note` application note provides an
overview of how :func:`torch.neuron.DataParallel` can be used to improve
the performance of inference workloads on Inferentia.

.. py:function:: torch.neuron.DataParallel(model, device_ids=None, dim=0)

    Applies data parallelism by replicating the model on
    available NeuronCores and distributing data across the different
    NeuronCores for parallelized inference.

    By default, DataParallel will use all available NeuronCores
    allocated for the current process for parallelism. DataParallel will
    apply parallelism on ``dim=0`` if ``dim`` is not specified.

    DataParallel automatically enables
    :ref:`dynamic batching <dynamic_batching_description>` on
    eligible models if ``dim=0``. Dynamic batching can be dsiabled using
    :func:`torch.neuron.DataParallel.disable_dynamic_batching`.
    If dynamic batching is not enabled, the batch size at compilation-time must
    be equal to the batch size at inference-time divided by the number of
    NeuronCores being used. Specifically, the following must be true when
    dynamic batching is disabled:
    ``input.shape[dim] / len(device_ids) == compilation_input.shape[dim]``.
    DataParallel will throw a warning if dynamic batching cannot be enabled.

    DataParallel will try load all of a model’s NEFFs onto
    a single NeuronCore, only if all of the NEFFs can fit on a single
    NeuronCore. DataParallel does not currently support models that
    have been compiled with :ref:`neuroncore-pipeline`.

    :func:`torch.neuron.DataParallel` requires PyTorch >= 1.8.

    *Required Arguments*

    :arg ~torch.jit.ScriptModule model: Model created by the
        :ref:`torch_neuron_trace_api`
        to be parallelized.

    *Optional Arguments*

    :arg list device_ids: List of :obj:`int` or ``'nc:#'`` that specify the
        NeuronCores to use for parallelization (default: all NeuronCores).
        Refer to the :ref:`device_ids note <device_ids_note>` for a description
        of how ``device_ids`` indexing works.
    :arg int dim: Dimension along which the input tensor is scattered across
        NeuronCores (default ``dim=0``).

    *Attributes*

    :arg int num_workers: Number of worker threads used for
        multithreaded inference (default: ``2 * number of NeuronCores``).
    :arg int split_size: Size of the input chunks
        (default: ``max(1, input.shape[dim] // number of NeuronCores)``).


.. py:function:: torch.neuron.DataParallel.disable_dynamic_batching()

    Disables automatic dynamic batching on the DataParallel module. See
    :ref:`Dynamic batching disabled <dataparallel_example_disable_dynamic_batching_api>`
    for example of how DataParallel can be used with dynamic batching disabled.
    Use as follows:

        >>> model_parallel = torch.neuron.DataParallel(model_neuron)
        >>> model_parallel.disable_dynamic_batching()

.. _device_ids_note:

.. note::

    ``device_ids`` uses per-process NeuronCore granularity and zero-based
    indexing. Per-process granularity means that each Python process "sees"
    its own view of the world. Specifically, this means that ``device_ids``
    only "sees" the NeuronCores that are allocated for the current process.
    Zero-based indexing means that each Python process will index its
    allocated NeuronCores starting at 0, regardless of the "global" index of
    the NeuronCores. Zero-based indexing makes it possible to redeploy the exact
    same code unchanged in different process. This behavior is analogous to
    the ``device_ids`` argument in the PyTorch
    :class:`~torch.nn.DataParallel` function.

    As an example, assume DataParallel is run on an inf1.6xlarge, which
    contains four Inferentia chips each of which contains four NeuronCores:

    * If ``NEURON_RT_VISIBLE_CORES`` is not set, a single process can access
      all 16 NeuronCores. Thus specifying ``device_ids=["nc:0"]`` will
      correspond to chip0:core0 and ``device_ids=["nc:14"]`` will correspond
      to chip3:core2.

    * However, if two processes are launched where: process 1 has
      ``NEURON_RT_VISIBLE_CORES=0-6`` and process 2 has
      ``NEURON_RT_VISIBLE_CORES=7-15``, ``device_ids=["nc:14"]``
      cannot be specified in either process. Instead, chip3:core2 can only be
      accessed in process 2. Additionally, chip3:core2 is specified in process 2
      with ``device_ids=["nc:7"]``. Furthermore, in process 1,
      ``device_ids=["nc:0"]`` would correspond to chip0:core0; in process 2
      ``device_ids=["nc:0"]`` would correspond to chip1:core3.


Examples
--------

The following sections provide example usages of the
:func:`torch.neuron.DataParallel` module.

Default usage
^^^^^^^^^^^^^

.. include:: /archive/torch-neuron/torch-neuron-dataparallel-example-default.rst

Specifying NeuronCores
^^^^^^^^^^^^^^^^^^^^^^

.. include:: /archive/torch-neuron/torch-neuron-dataparallel-example-specify-ncs.rst

DataParallel with dim != 0
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /archive/torch-neuron/torch-neuron-dataparallel-example-dim-neq-zero.rst

Dynamic batching
^^^^^^^^^^^^^^^^

.. include:: /archive/torch-neuron/torch-neuron-dataparallel-example-dynamic-batching.rst

.. _dataparallel_example_disable_dynamic_batching_api:

Dynamic batching disabled
^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /archive/torch-neuron/torch-neuron-dataparallel-example-disable-dynamic-batching.rst

Full tutorial with torch.neuron.DataParallel
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For an end-to-end tutorial that uses DataParallel, see the
:ref:`PyTorch Resnet Tutorial </src/examples/pytorch/resnet50.ipynb>`.


================================================
FILE: archive/torch-neuron/developer-guide-torch-neuron.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Developer Guide (``torch-neuron``)
==================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:
    
    Running Inference on Variable Input Shapes with Bucketing </about-neuron/appnotes/torch-neuron/bucketing-app-note>
    Data Parallel Inference on PyTorch Neuron </about-neuron/appnotes/torch-neuron/torch-neuron-dataparallel-app-note>
    /archive/torch-neuron/guides/torch-lstm-support
    /archive/torch-neuron/guides/core-placement/torch-core-placement


.. include:: /archive/torch-neuron/developer-guide-torch-neuron.txt


================================================
FILE: archive/torch-neuron/developer-guide-torch-neuron.txt
================================================
* :ref:`Running Inference on Variable Input Shapes with Bucketing <bucketing_app_note>`
* :ref:`Data Parallel Inference on PyTorch Neuron <torch-neuron-dataparallel-app-note>`
* :ref:`torch_neuron_lstm_support`
* :ref:`torch_neuron_core_placement_guide`

================================================
FILE: archive/torch-neuron/guides/core-placement/torch-core-placement.rst
================================================
.. _torch_neuron_core_placement_guide:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

PyTorch Neuron (``torch-neuron``) Core Placement
================================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


This programming guide describes the available techniques and APIs to be able
to allocate NeuronCores to a process and place models onto specific NeuronCores.
In order of precedence, the current recommendation is to use the following
placement techniques:

1. For most regular models, default core placement should be used in
   conjunction with ``NEURON_RT_NUM_CORES`` (:ref:`torch_placement_default`)
2. For more specific core placement for NeuronCore Pipelined models, then
   ``NEURONCORE_GROUP_SIZES`` should be used (:ref:`torch_placement_ncg`).
3. Finally, for even more granular control, then the beta
   explicit placement APIs may be used (:ref:`torch_placement_explicit`).

.. contents:: Table of Contents
    :depth: 3

The following guide will assume a machine with 8 NeuronCores:

- NeuronCores will use the notation ``nc0``, ``nc1``, etc.
- NeuronCore Groups will use the notation ``ncg0``, ``ncg1`` etc.
- Models will use the notation ``m0``, ``m1`` etc.

NeuronCores, NeuronCore Groups, and model allocations will be displayed in
the following format:

.. raw:: html
    :file: images/0-0-legend.svg

Note that the actual cores that are visible to the process can be adjusted
according to the :ref:`nrt-configuration`.

NeuronCore Pipeline
-------------------

A key concept to understand the intent behind certain core placement strategies
is NeuronCore Pipelining (See :ref:`neuroncore-pipeline`). NeuronCore Pipelining
allows a model to be automatically split into pieces and executed on different
NeuronCores.

For most models only 1 NeuronCore will be required for execution. A model will
**only** require more than one NeuronCore when using NeuronCore Pipeline.
When model pipelining is enabled, the model is split between multiple
NeuronCores and data is transferred between them. For example, if the compiler
flag ``--neuroncore-pipeline-cores 4`` is used, this splits the model into
4 pieces to be executed on 4 separate NeuronCores.

.. _torch_placement_default:

Default Core Allocation & Placement
-----------------------------------

The most basic requirement of an inference application is to be able to place a
single model on a single NeuronCore. More complex applications may use multiple
NeuronCores or even multiple processes each executing different models. The
important thing to note about designing an inference application is that a
single NeuronCore will always be allocated to a single process. *Processes do
not share NeuronCores*. Different configurations can be used to ensure that
an application process has enough NeuronCores allocated to execute its model(s):

- Default: A process will attempt to take ownership of **all NeuronCores**
  visible on the instance. This should be used when an instance is only running
  a single inference process since no other process will be allowed to take
  ownership of any NeuronCores.
- ``NEURON_RT_NUM_CORES``: Specify the **number of NeuronCores** to allocate
  to the process. This places no restrictions on which NeuronCores will be used,
  however, the resulting NeuronCores will always be contiguous. This should be
  used in multi-process applications where each process should only use a subset
  of NeuronCores.
- ``NEURON_RT_VISIBLE_CORES``: Specifies exactly **which NeuronCores** are
  allocated to the process by index. Similar to ``NEURON_RT_NUM_CORES``, this
  can be used in multi-process applications where each process should only use a
  subset of NeuronCores. This provides more fined-grained controls over the
  exact NeuronCores that are allocated to a given process.
- ``NEURONCORE_GROUP_SIZES``: Specifies a number of **NeuronCore Groups** which
  are allocated to the process. This is described in more detail in the
  :ref:`torch_placement_ncg` section.

See the :ref:`nrt-configuration` for more environment variable details.

Example: Default
^^^^^^^^^^^^^^^^

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuron

    m0 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads to nc0
    m1 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads to nc1


.. raw:: html
    :file: images/0-1-default-2.svg

With no environment configuration, the process will take ownership of all
NeuronCores. In this example, only two of the NeuronCores are used by the
process and the remaining are allocated but left idle.


Example: ``NEURON_RT_NUM_CORES``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Environment Setup**:

.. code-block:: bash

    export NEURON_RT_NUM_CORES = '2'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuron

    m0 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads to nc0
    m1 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads to nc1

.. raw:: html
    :file: images/0-2-default-rt-num-cores.svg

Since there is no other process on the instance, only the first 2 NeuronCores
will be acquired by the process. Models load in a simple linear order to the
least used NeuronCores.


Example: ``NEURON_RT_VISIBLE_CORES``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Environment Setup**:

.. code-block:: bash

    export NEURON_RT_VISIBLE_CORES = '4-5'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuron

    m0 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads to nc4
    m1 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads to nc5


.. raw:: html
    :file: images/0-3-default-rt-visible-cores.svg

Unlike ``NEURON_RT_NUM_CORES``, setting the visible NeuronCores allows the
process to take control of a specific contiguous set. This allows an application
to have a more fine-grained control of where models will be placed.


Example: Overlapping Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Environment Setup**:

.. code-block:: bash

    export NEURON_RT_VISIBLE_CORES = '0-1'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuron

    m0 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads to nc0
    m1 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt')  # Loads to nc0-nc1
    m2 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads to nc1

.. raw:: html
    :file: images/0-4-default-overlap-model-2.svg

.. raw:: html
    :file: images/0-4-default-overlap.svg

This shows how models may share NeuronCores but the default model placement
will attempt to evenly distribute NeuronCore usage rather than overlapping all
models on a single NeuronCore.


Example: Multiple Processes
^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Environment Setup**:

.. code-block:: bash

    export NEURON_RT_NUM_CORES = '2'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuron

    m0 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads to nc0
    m1 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads to nc1


In this example, if the script is run **twice**, the following allocations
will be made:

.. raw:: html
    :file: images/0-5-default-multiprocess.svg

Note that each process will take ownership of as many NeuronCores as is
specified by the ``NEURON_RT_NUM_CORES`` configuration.


.. _torch_placement_ncg:

NEURONCORE_GROUP_SIZES
----------------------

.. important::

    The use of explicit core placement should only be used when a specific
    performance goal is required. By default ``torch-neuron`` places models on
    the **least used** NeuronCores. This should be optimal for most
    applications.

    Secondly, ``NEURONCORE_GROUP_SIZES`` is being deprecated in a future
    release and should be avoided in favor of newer placement methods.
    Use ``NEURON_RT_NUM_CORES`` or ``NEURON_RT_VISIBLE_CORES`` with default
    placement if possible (See :ref:`torch_placement_default`)


In the current release of NeuronSDK, the most well-supported method of placing
models onto specific NeuronCores is to use the ``NEURONCORE_GROUP_SIZES``
environment variable. This will define a set of "NeuronCore Groups" for the
application process.

NeuronCore Groups are *contiguous sets of NeuronCores* that are allocated to
a given process. Creating groups allows an application to ensure that a
model has a defined set of NeuronCores that will always be allocated to it.

Note that NeuronCore Groups *can* be used to allocate non-pipelined models
(those requiring exactly 1 NeuronCore) to specific NeuronCores but this is
not the primary intended use. The intended use of NeuronCore Groups is to
ensure pipelined models (those requiring >1 NeuronCore) have exclusive access
to a specific set of contiguous NeuronCores.

In the cases where models are being used *without* NeuronCore Pipeline, the
general recommendation is to use default placement
(See :ref:`torch_placement_default`).

The following section demonstrates how ``NEURONCORE_GROUP_SIZES`` can be used
and the issues that may arise.

Example: Single NeuronCore Group
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the example where one model requires 4 NeuronCores, the correct environment
configuration would be:

**Environment Setup**:

.. code-block:: bash

    export NEURONCORE_GROUP_SIZES = '4'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuron

    m0 = torch.jit.load('model-with-4-neuron-pipeline-cores.pt')  # Loads to nc0-nc3


.. raw:: html
    :file: images/1-ncg-4.svg

This is the most basic usage of a NeuronCore Group. The environment setup
causes the process to take control of 4 NeuronCores and then the script loads
a model compiled with a NeuronCore Pipeline size of 4 to the first group.


Example: Multiple NeuronCore Groups
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

With more complicated configurations, the intended use of
``NEURONCORE_GROUP_SIZES`` is to create 1 Group per model with the correct size
to ensure that the models are placed on the intended NeuronCores. Similarly, the
environment would need to be configured to create a NeuronCore Group for each
model:

**Environment Setup**:

.. code-block:: bash

    export NEURONCORE_GROUP_SIZES = '3,4,1'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuron

    m0 = torch.jit.load('model-with-3-neuron-pipeline-cores.pt')  # Loads to nc0-nc2
    m1 = torch.jit.load('model-with-4-neuron-pipeline-cores.pt')  # Loads to nc3-nc6
    m2 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads to nc7


.. raw:: html
    :file: images/2-ncg-3-4-1.svg


Issue: Overlapping Models with Differing Model Sizes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When multiple models are loaded to a single NeuronCore Group, this can cause
unintended inefficiencies. A single model is only intended to span a single
NeuronCore Group. Applications with many models of varying sizes can be
restricted by NeuronCore Group configurations since the most optimal model
layout may require more fine-grained controls.

**Environment Setup**:

.. code-block:: bash

    export NEURONCORE_GROUP_SIZES = '2,2'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuron

    m0 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt')  # Loads to nc0-nc1
    m1 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt')  # Loads to nc2-nc3
    m2 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads to nc0
    m3 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads to nc2
    m4 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads to nc0


.. raw:: html
    :file: images/3-models-m4-0-warning.svg

.. raw:: html
    :file: images/3-models-m2-0-m3-2.svg

.. raw:: html
    :file: images/3-ncg-2-2.svg


Here the ``NEURONCORE_GROUP_SIZES`` does not generate an optimal layout
because placement strictly follows the layout of NeuronCore Groups. A
potentially more optimal layout would be to place ``m4`` onto ``nc1``. In this
case, since a pipelined model will not be able to have exclusive access to a set
of NeuronCores, the default NeuronCore placement (no NeuronCore Groups
specified) would more evenly distribute the models.

Also note here that this is an example of where the order of model loads
affects which model is assigned to which NeuronCore Group. If the order of the
load statements is changed, models may be assigned to different NeuronCore
Groups.


Issue: Incompatible Model Sizes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Another problem occurs when attempting to place a model which does not evenly
fit into a single group:

**Environment Setup**:

.. code-block:: bash

    export NEURONCORE_GROUP_SIZES = '2,2'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuron

    m0 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt')  # Loads to nc0-nc1
    m1 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt')  # Loads to nc2-nc3
    m2 = torch.jit.load('model-with-3-neuron-pipeline-cores.pt')  # Loads to nc0-nc2


.. raw:: html
    :file: images/4-models-m2-0-2-warning.svg

.. raw:: html
    :file: images/3-ncg-2-2.svg


The model will be placed *across* NeuronCore Groups since there is no obvious
group to assign the model to according to the environment variable
configuration. Depending on the individual model and application requirements,
the placement here may not be optimal.


Issue: Multiple Model Copies
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It is common in inference serving applications to use multiple replicas of a
single model across different NeuronCores. This allows the hardware to be fully
utilized to maximize throughput. In this scenario, when using NeuronCore
Groups, the only way to replicate a model on multiple NeuronCores is to create a
*new model* object. In the example below, 4 models loads are performed to place
a model in each NeuronCore Group.

**Environment Setup**:

.. code-block:: bash

    export NEURONCORE_GROUP_SIZES = '2,2,2,2'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuron

    models = list()
    for _ in range(4):
        model = torch.jit.load('model-with-2-neuron-pipeline-cores.pt')
        models.append(model)


.. raw:: html
    :file: images/3-ncg-2-2-2-2-copies.svg


The largest consequence of this type of model allocation is that the application
code is responsible for routing inference requests to models. There are a
variety of ways to implement the inference switching but in all cases routing
logic needs to be implemented in the application code.


Issue Summary
^^^^^^^^^^^^^

The use of ``NEURONCORE_GROUP_SIZES`` has the following problems:

- **Variable Sized Models**: Models which require crossing NeuronCore Group
  boundaries may be placed poorly. This means group configuration limits the
  size of which models can be loaded.
- **Model Load Order**: Models are loaded to NeuronCore Groups greedily. This
  means that the order of model loads can potentially negatively affect
  application performance by causing unintentional overlap.
- **Implicit Placement**: NeuronCore Groups cannot be explicitly chosen in the
  application code.
- **Manual Replication**: When loading multiple copies of a model to different
  NeuronCore Groups, this requires that multiple model handles are used.


.. _torch_placement_explicit:

Explicit Core Placement
-------------------------------------

To address the limitations of ``NEURONCORE_GROUP_SIZES``, a new set of APIs has
been added which allows specific NeuronCores to be chosen by the application
code. These can be found in the :ref:`torch_neuron_core_placement_api` documentation.


Example: Manual Core Selection
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The most direct usage of the placement APIs is to manually select the
start NeuronCore that each model is loaded to. This will automatically use as
many NeuronCores as is necessary for that model (1 for most models, >1 for
NeuronCore Pipelines models).

**Environment Setup**:

.. code-block:: bash

    export NEURON_RT_NUM_CORES = '4'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuron

    # NOTE: Order of loads does NOT matter

    with torch_neuron.experimental.neuron_cores_context(2):
        m1 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt')  # Loads to nc2-nc3

    with torch_neuron.experimental.neuron_cores_context(0):
        m2 = torch.jit.load('model-with-3-neuron-pipeline-cores.pt')  # Loads to nc0-nc2

    with torch_neuron.experimental.neuron_cores_context(0):
        m0 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt')  # Loads to nc0-nc1

    with torch_neuron.experimental.neuron_cores_context(3):
        m3 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads to nc3


.. raw:: html
    :file: images/5-models-m2-0-2-m3-3.svg

.. raw:: html
    :file: images/5-placement.svg


Note that this directly solves the ``NEURONCORE_GROUP_SIZES`` issues of:

- **Variable Sized Models**: Now since models are directly placed on the
  NeuronCores requested by the application, there is no disconnect
  between the model sizes and NeuronCore Group sizes.
- **Model Load Order**: Since the NeuronCores are explicitly selected, there is
  no need to be careful about the order in which models are loaded since they
  can be placed deterministically regardless of the load order.
- **Implicit Placement**: Similarly, explicit placement means there is no chance
  that a model will end up being allocated to an incorrect NeuronCore Group.


Example: Automatic Multicore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Using explicit core placement it is possible to replicate a model to multiple
NeuronCores simultaneously. This means that a single model object within python
can utilize all available NeuronCores (or NeuronCores allocated to the process).

**Environment Setup**:

.. code-block:: bash

    export NEURON_RT_NUM_CORES = '8'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuron

    with torch_neuron.experimental.multicore_context():
        m0 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt')  # Loads replications to nc0-nc7


.. raw:: html
    :file: images/6-multicore.svg


This addresses the last ``NEURONCORE_GROUP_SIZES`` issue of:

- **Manual Replication**: Since models can be automatically replicated to
  multiple NeuronCores, this means that applications no longer need to implement
  routing logic and perform multiple loads.

This API has a secondary benefit that the exact same loading logic can be used
on an ``inf1.xlarge`` or an ``inf1.6xlarge``. In either case, it will use all
of the NeuronCores that are visible to the process. This means that no special
logic needs to be coded for different instance types.


Example: Explicit Replication
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Replication is also possible with the
:func:`~torch_neuron.experimental.neuron_cores_context` API. The number of
replications is chosen by ``replications = floor(nc_count / cores_per_model)``.


**Environment Setup**:

.. code-block:: bash

    export NEURON_RT_NUM_CORES = '8'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuron

    with torch_neuron.experimental.neuron_cores_context(start_nc=2, nc_count=4):
        m0 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt')  # Loads replications to nc2-nc5


.. raw:: html
    :file: images/7-replication.svg


================================================
FILE: archive/torch-neuron/guides/torch-lstm-support.rst
================================================
.. _torch_neuron_lstm_support:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Developer Guide - PyTorch Neuron (``torch-neuron``) |LSTM| Support
==================================================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


The `torch-neuron` package can support |LSTM| operations and yield
high performance on both fixed-length and variable-length sequences. Most
network configurations can be supported, with the exception of those that
require |PackedSequence| usage outside of |LSTM| or |pad_packed_sequence|
operations. Neuron must guarantee that the shapes can remain fixed throughout
the network.

The following sections describe which scenarios can and cannot be supported.

Supported Usage
---------------

Fixed-Length Sequences
~~~~~~~~~~~~~~~~~~~~~~

In normal usage of an |LSTM|, the inputs and outputs are expected to be a fixed
size sequence length. This is the most basic usage of an |LSTM| but may not be
applicable to applications where the input sequence length may vary.

.. code-block:: python

    import torch
    import torch_neuron

    class Network(torch.nn.Module):

        def __init__(self):
            super().__init__()
            self.lstm = torch.nn.LSTM(input_size=3, hidden_size=7)

        def forward(self, inputs):
            output, (ht, ct) = self.lstm(inputs)
            return output, (ht, ct)

    # Example Inputs
    seq_len, batch_size, input_size = 5, 2, 3
    inputs = torch.rand(seq_len, batch_size, input_size)

    # Trace
    torch_neuron.trace(Network(), (inputs,))


Packed Input, Padded Output, *Pre-Sorted* Inputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A common usage of an |LSTM| is when the input sequence sizes vary according
to an input sequence lengths (such as tokens).

For example, the following sentences could result in two different
sequence lengths after tokenization:

.. code-block:: python

    # Input
    text = [
       'Hello, sailor',
       'Example',
    ]

    # ... Tokenization ...

    # Result
    tokens = [
        [101, 7592, 1010, 11803, 102],
        [101, 2742,  102,     0,   0],
    ]
    lengths = [5, 3]

Because the lengths are different, the final |LSTM| state will be dependent upon
the lengths of each sequence in the batch. Torch provides a way to deal with
these types of sequences by densely packing batches into a |PackedSequence|. The
most common way this is constructed is by using the |pack_padded_sequence|
utility function prior to feeding inputs into the |LSTM|.

Packing the above sequences would result in the following data and batch
size tensors.

.. code-block:: python

    data = [101, 101, 7592, 2742, 1010, 102, 11803, 102]
    batch_sizes = [2, 2, 2, 1, 1]


In addition to correctly computing final |LSTM| state, using a packed
sequence instead of a padded sequence also improves model performance on CPU.
On Neuron, where computation is fixed to the maximum length ahead of time,
**this is does not improve performance**.

When an |LSTM| is processing a |PackedSequence|, it must do so in a descending
sorted length order. To ensure that sequences are sorted, |pack_padded_sequence|
provides an ``enforce_sorted`` flag. When ``enforce_sorted`` is ``True``, the
input is *already expected* to contain sequences sorted by length in a
decreasing order along the batch dimension. Note that this must be enforced in
the application-level code but is only relevant when batch size > 1.

The following network can compile successfully because the input and output
to the network are guaranteed to be a fixed shape. The input shape is expected
to be a padded tensor and the output tensor is expected to be padded to the
maximum sequence length using the |pad_packed_sequence| function call:

.. code-block:: python
    :emphasize-lines: 14

    import torch
    import torch_neuron

    class Network(torch.nn.Module):

        def __init__(self):
            super().__init__()
            self.lstm = torch.nn.LSTM(input_size=3, hidden_size=7)

        def forward(self, inputs, lengths):
            packed_input = torch.nn.utils.rnn.pack_padded_sequence(
                inputs,
                lengths=lengths,
                enforce_sorted=True,
            )
            packed_result, (ht, ct) = self.lstm(packed_input)
            padded_result, _ = torch.nn.utils.rnn.pad_packed_sequence(packed_result)
            return padded_result, ht, ct

    # Example Inputs
    seq_len, batch_size, input_size = 5, 2, 3
    inputs = torch.rand(seq_len, batch_size, input_size)
    lengths = torch.tensor([seq_len] * batch_size)

    # Trace
    torch_neuron.trace(Network(), (inputs, lengths))


Packed Input, Padded Output, *Unsorted* Inputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When ``enforce_sorted`` is ``False``, the input will be sorted unconditionally.
This causes some CPU overhead on Neuron because unsupported operators will be
inserted into the graph such as ``aten::sort`` and ``aten::scatter_``. The
``aten::lstm`` operation can still be supported, but it will be less efficient
than when ``enforce_sorted`` is ``True``.

The following code is able to be traced, but results in the sorting
operations running on CPU. This is not problematic in this case because the
``aten::sort`` and ``aten::scatter_`` are executed on CPU at the very beginning
of the graph just prior to Neuron execution.

Like the previous example, the call to |pad_packed_sequence| ensures that the
output is a fixed-shape based on the maximum sequence length.

.. code-block:: python
    :emphasize-lines: 14

    import torch
    import torch_neuron

    class Network(torch.nn.Module):

        def __init__(self):
            super().__init__()
            self.lstm = torch.nn.LSTM(input_size=3, hidden_size=7)

        def forward(self, inputs, lengths):
            packed_input = torch.nn.utils.rnn.pack_padded_sequence(
                inputs,
                lengths=lengths,
                enforce_sorted=False,
            )
            packed_result, (ht, ct) = self.lstm(packed_input)
            padded_result, _ = torch.nn.utils.rnn.pad_packed_sequence(packed_result)
            return padded_result, ht, ct

    # Example Inputs
    seq_len, batch_size, input_size = 5, 2, 3
    inputs = torch.rand(seq_len, batch_size, input_size)
    lengths = torch.tensor([seq_len] * batch_size)

    # Trace
    trace = torch_neuron.trace(Network(), (inputs, lengths))


Packed Inputs, Final Hidden & Cell State Only
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When **only** the final |LSTM| hidden & cell state is used, it does not
matter if the inputs are packed or unpacked since these state
tensors will not vary in size.

.. code-block:: python
    :emphasize-lines: 16,17

    import torch
    import torch_neuron

    class Network(torch.nn.Module):

        def __init__(self):
            super().__init__()
            self.lstm = torch.nn.LSTM(input_size=3, hidden_size=7)

        def forward(self, inputs, lengths):
            packed_input = torch.nn.utils.rnn.pack_padded_sequence(
                inputs,
                lengths=lengths,
                enforce_sorted=True,
            )
            packed_output, (ht, ct) = self.lstm(packed_input)
            return ht, ct

    # Example Inputs
    seq_len, batch_size, input_size = 5, 2, 3
    inputs = torch.rand(seq_len, batch_size, input_size)
    lengths = torch.tensor([seq_len] * batch_size)

    # Trace
    trace = torch_neuron.trace(Network(), (inputs, lengths))

Note that when the ``packed_output`` is unused, it does not need to be passed
to the |pad_packed_sequence| to enable the |LSTM| to be compiled.

Unsupported Usage
-----------------

Neuron does not support the use of a |PackedSequence| outside of the |LSTM|
operation and the |pad_packed_sequence| operation. This is because the shape of
a |PackedSequence| can vary depending on the input data. This is incompatible
with the Neuron restriction that all tensor sizes must be known at compilation
time. When a |PackedSequence| is used only by an |LSTM| or |pad_packed_sequence|
operation, Neuron *can guarantee* the size of the intermediary tensors by
padding on behalf of the application.

This means that If the |PackedSequence| is either used by a different operation
or returned from the network this would result in all of the |LSTM| operations to
be executed on CPU or the network compilation will fail.


|PackedSequence| Returned
~~~~~~~~~~~~~~~~~~~~~~~~~

The following is unsupported because the |PackedSequence| result of the |LSTM|
is returned by the network:

.. code-block:: python
    :emphasize-lines: 14

    class Network(torch.nn.Module):

        def __init__(self):
            super().__init__()
            self.lstm = torch.nn.LSTM(input_size=3, hidden_size=7)

        def forward(self, inputs, lengths):
            packed_input = torch.nn.utils.rnn.pack_padded_sequence(
                inputs,
                lengths=lengths,
                enforce_sorted=False,
            )
            packed_result, (ht, ct) = self.lstm(packed_input)
            return packed_result.data, ht, ct


**Behavior**: In this case, compilation fails and the following warning is
generated:

.. code-block:: text

    Operator "aten::lstm" consuming a PackedSequence input can only be supported when its corresponding PackedSequence output is unused or unpacked using "aten::_pad_packed_input". Found usage by "prim::Return"


**Resolution**: To avoid this error, the ``packed_result`` should be padded
prior to being returned from the network by using |pad_packed_sequence|


Invalid |PackedSequence| Usage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following is unsupported because the |PackedSequence| result of the |LSTM|
is used by a non-LSTM operator:

.. code-block:: python
    :emphasize-lines: 14

    class Network(torch.nn.Module):

        def __init__(self):
            super().__init__()
            self.lstm = torch.nn.LSTM(input_size=3, hidden_size=7)

        def forward(self, inputs, lengths):
            packed_input = torch.nn.utils.rnn.pack_padded_sequence(
                inputs,
                lengths=lengths,
                enforce_sorted=False,
            )
            packed_result, (ht, ct) = self.lstm(packed_input)
            return torch.max(packed_result.data)

**Behavior**: In this case, compilation fails and the following warning is
generated:

.. code-block:: text

    Operator "aten::lstm" consuming a PackedSequence input can only be supported when its corresponding PackedSequence output is unused or unpacked using "aten::_pad_packed_input". Found usage by "aten::max"

**Resolution**: To avoid this error, the ``packed_result`` should be padded
prior to being used in the :func:`~torch.max` from the network by
using |pad_packed_sequence|.


.. |LSTM| replace:: :class:`~torch.nn.LSTM`
.. |PackedSequence| replace:: :class:`~torch.nn.utils.rnn.PackedSequence`
.. |pack_padded_sequence| replace:: :func:`~torch.nn.utils.rnn.pack_padded_sequence`
.. |pad_packed_sequence| replace:: :func:`~torch.nn.utils.rnn.pad_packed_sequence`


================================================
FILE: archive/torch-neuron/index.rst
================================================
.. _torch-neuron-main:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

PyTorch Neuron (torch-neuron) — Archived
==========================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer actively developed.
   For new workloads, use TorchNeuron Native or torch-neuronx.
   See :doc:`/frameworks/torch/index` for current PyTorch support.

PyTorch Neuron (``torch-neuron``) was the original PyTorch integration for AWS Inferentia (Inf1) instances.
This package supported inference workloads on NeuronCores v1 architecture.

.. contents:: Table of contents
   :local:
   :depth: 2

API Reference
-------------

.. toctree::
   :maxdepth: 1

   api-reference-guide-torch-neuron
   api-compilation-python-api
   api-core-placement
   api-torch-neuron-dataparallel-api

Developer Guide
---------------

.. toctree::
   :maxdepth: 1

   developer-guide-torch-neuron
   troubleshooting-guide

Tutorials
---------

.. toctree::
   :maxdepth: 1

   tutorials/tutorials-inference-torch-neuron

Setup
-----

.. toctree::
   :maxdepth: 1

   setup/pytorch-install
   setup/pytorch-update

Misc
----

.. toctree::
   :maxdepth: 1

   additional-examples-inference-torch-neuron
   misc-inference-torch-neuron


================================================
FILE: archive/torch-neuron/inference-torch-neuron.rst
================================================
.. _inference-torch-neuron:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-13

Inference with ``torch-neuron`` (Inf1)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer actively developed.
   For new workloads, use TorchNeuron Native or torch-neuronx.
   See :doc:`/frameworks/torch/index` for current PyTorch support.

.. toctree::
    :maxdepth: 1
    :hidden:

    Tutorials </archive/torch-neuron/tutorials/tutorials-inference-torch-neuron>
    Additional Examples  </archive/torch-neuron/additional-examples-inference-torch-neuron>
    API Reference Guide </archive/torch-neuron/api-reference-guide-torch-neuron>
    Developer Guide   </archive/torch-neuron/developer-guide-torch-neuron>
    Misc  </archive/torch-neuron/misc-inference-torch-neuron>


.. card:: Setup  (``torch-neuron``)
            :link: setup-torch-neuron
            :link-type: ref
            :class-body: sphinx-design-class-title-small


.. dropdown:: Tutorials  (``torch-neuron``)
	:class-title: sphinx-design-class-title-small
	:animate: fade-in
	:name: torch-neuronx-training-tutorials

	.. include:: /archive/torch-neuron/tutorials/tutorials-inference-torch-neuron.txt


.. dropdown::  Additional Examples (``torch-neuron``)
	:class-title: sphinx-design-class-title-small
	:class-body: sphinx-design-class-body-small
	:animate: fade-in
	
	.. include:: /archive/torch-neuron/additional-examples-inference-torch-neuron.txt


.. dropdown:: API Reference Guide (``torch-neuron``)
	:class-title: sphinx-design-class-title-small
	:animate: fade-in

	.. include:: /archive/torch-neuron/api-reference-guide-torch-neuron.txt


.. dropdown:: Developer Guide (``torch-neuron``)
	:class-title: sphinx-design-class-title-small
	:animate: fade-in

	.. include:: /archive/torch-neuron/developer-guide-torch-neuron.txt

.. dropdown:: Misc (``torch-neuron``)
	:class-title: sphinx-design-class-title-small
	:animate: fade-in

	* :ref:`neuron-cc-ops-pytorch`
	* :ref:`pytorch-neuron-inference-troubleshooting`
	* :ref:`pytorch-neuron-rn`


================================================
FILE: archive/torch-neuron/misc-inference-torch-neuron.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Misc (``torch-neuron``)
=======================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
	:maxdepth: 1
	:hidden:

	/release-notes/archive/neuron-cc/neuron-cc-ops/neuron-cc-ops-pytorch
	/archive/torch-neuron/troubleshooting-guide
	/release-notes/components/pytorch


.. include:: /archive/torch-neuron/misc-inference-torch-neuron.txt


================================================
FILE: archive/torch-neuron/misc-inference-torch-neuron.txt
================================================
* :ref:`neuron-cc-ops-pytorch`
* :ref:`pytorch-neuron-inference-troubleshooting`
* :ref:`pytorch-neuron-rn`

================================================
FILE: archive/torch-neuron/placement.py
================================================
"""


.. warning::

    The following functionality is beta and **will not be supported** in
    future releases of the Neuron SDK. This module serves only as a preview for
    future functionality. In future releases, equivalent functionality may
    be moved directly to the :code:`torch_neuron` module and will no longer be
    available in the :code:`torch_neuron.experimental` module.

Functions which enable placement of :class:`torch.jit.ScriptModule` to specific
NeuronCores. Two sets of functions are provided which can be used
interchangeably but have different performance characteristics and advantages:

- The :func:`~torch_neuron.experimental.multicore_context` &
  :func:`~torch_neuron.experimental.neuron_cores_context` functions are context
  managers that allow a model to be placed on a given NeuronCore at
  :func:`torch.jit.load` time. These functions are the most efficient way of
  loading a model since the model is loaded directly to a NeuronCore. The
  alternative functions described below require that a model is unloaded from
  one core and then reloaded to another.
- The :func:`~torch_neuron.experimental.set_multicore` &
  :func:`~torch_neuron.experimental.set_neuron_cores` functions allow a model
  that has already been loaded to a NeuronCore to be moved to a different
  NeuronCore. This functionality is less efficient than directly loading a model
  to a NeuronCore within a context manager but allows device placement to be
  fully dynamic at runtime. This is analogous to the :meth:`torch.nn.Module.to`
  function for device placement.

.. important::

    A prerequisite to enable placement functionality is that
    the loaded :class:`torch.jit.ScriptModule` has already been compiled with
    the :func:`torch_neuron.trace` API. Attempting to place a regular
    :class:`torch.nn.Module` onto a NeuronCore prior to compilation will do
    nothing.
"""
import contextlib


def set_neuron_cores(trace: 'torch.jit.ScriptModule', start_nc: int=-1, nc_count: int=-1):
    """
    Set the NeuronCore start/count for all Neuron subgraphs in a torch Module.

    This will unload the model from an existing NeuronCore if it is already
    loaded.

    *Requires Torch 1.8+*

    Arguments:
        trace: A torch module which contains one or more Neuron subgraphs.
        start_nc: The starting NeuronCore index where the Module is placed. The
            value ``-1`` automatically loads to the optimal NeuronCore (least
            used). Note that this index is always relative to NeuronCores
            visible to this process.
        nc_count: The number of NeuronCores to use. The value ``-1`` will load
            a model to exactly the number of cores required by that model (1 for
            most models, >1 when using NeuronCore Pipeline). If ``nc_count``
            is greater than the number of NeuronCores required by the
            model, the model will be replicated across multiple
            NeuronCores. ``(replications = floor(nc_count / cores_per_model))``

    Raises:
        RuntimeError: If the Neuron runtime cannot be initialized.
        ValueError: If the ``nc_count`` is an invalid number of NeuronCores.

    Examples:

        *Single Load*: Move a model to the first visible NeuronCore after
        loading.

        >>> model = torch.jit.load('example_neuron_model.pt')
        >>> torch_neuron.experimental.set_neuron_cores(model, start_nc=0, nc_count=1)
        >>> model(example) # Executes on NeuronCore 0
        >>> model(example) # Executes on NeuronCore 0
        >>> model(example) # Executes on NeuronCore 0

        *Multiple Core Replication*: Replicate a model to 2 NeuronCores after
        loading. This allows a single :class:`torch.jit.ScriptModule` to
        use multiple NeuronCores by running round-robin executions.

        >>> model = torch.jit.load('example_neuron_model.pt')
        >>> torch_neuron.experimental.set_neuron_cores(model, start_nc=2, nc_count=2)
        >>> model(example) # Executes on NeuronCore 2
        >>> model(example) # Executes on NeuronCore 3
        >>> model(example) # Executes on NeuronCore 2

        *Multiple Model Load*: Move and pin 2 models to separate NeuronCores.
        This causes each :class:`torch.jit.ScriptModule` to always execute on
        a specific NeuronCore.

        >>> model1 = torch.jit.load('example_neuron_model.pt')
        >>> torch_neuron.experimental.set_neuron_cores(model1, start_nc=2)
        >>> model2 = torch.jit.load('example_neuron_model.pt')
        >>> torch_neuron.experimental.set_neuron_cores(model2, start_nc=0)
        >>> model1(example) # Executes on NeuronCore 2
        >>> model1(example) # Executes on NeuronCore 2
        >>> model2(example) # Executes on NeuronCore 0
        >>> model2(example) # Executes on NeuronCore 0
    """


def set_multicore(trace: 'torch.jit.ScriptModule'):
    """
    Loads all Neuron subgraphs in a torch Module to all visible NeuronCores.

    This loads each Neuron subgraph within a :class:`torch.jit.ScriptModule`
    to multiple NeuronCores without requiring multiple calls to
    :func:`torch.jit.load`. This allows a single
    :class:`torch.jit.ScriptModule` to use multiple NeuronCores for
    concurrent threadsafe inferences. Executions use a round-robin strategy
    to distribute across NeuronCores.

    This will unload the model from an existing NeuronCore if it is already
    loaded.

    *Requires Torch 1.8+*

    Arguments:
        trace: A torch module which contains one or more Neuron subgraphs.

    Raises:
        RuntimeError: If the Neuron runtime cannot be initialized.

    Examples:

        *Multiple Core Replication*: Move a model across all visible
        NeuronCores after loading. This allows a single
        :class:`torch.jit.ScriptModule` to use all NeuronCores by
        running round-robin executions.

        >>> model = torch.jit.load('example_neuron_model.pt')
        >>> torch_neuron.experimental.set_multicore(model)
        >>> model(example) # Executes on NeuronCore 0
        >>> model(example) # Executes on NeuronCore 1
        >>> model(example) # Executes on NeuronCore 2
    """


@contextlib.contextmanager
def neuron_cores_context(start_nc: int=-1, nc_count: int=-1):
    """
    A context which sets the NeuronCore start/count for all Neuron subgraphs.

    Any calls to :func:`torch.jit.load` will cause any underlying Neuron
    subgraphs to load to the specified NeuronCores within this context.
    This context manager only needs to be used during the model load.
    After loading, inferences do not need to occur in this context in order
    to use the correct NeuronCores.

    Note that this context is *not* threadsafe. Using multiple core placement
    contexts from multiple threads may not correctly place models.

    Arguments:
        start_nc: The starting NeuronCore index where the Module is placed. The
            value ``-1`` automatically loads to the optimal NeuronCore (least
            used). Note that this index is always relative to NeuronCores
            visible to this process.
        nc_count: The number of NeuronCores to use. The value ``-1`` will load
            a model to exactly the number of cores required by that model (1 for
            most models, >1 when using NeuronCore Pipeline). If ``nc_count``
            is greater than the number of NeuronCores required by the
            model, the model will be replicated across multiple
            NeuronCores. ``(replications = floor(nc_count / cores_per_model))``

    Raises:
        RuntimeError: If the Neuron runtime cannot be initialized.
        ValueError: If the ``nc_count`` is an invalid number of NeuronCores.

    Examples:

        *Single Load*: Directly load a model from disk to the first visible
        NeuronCore.

        >>> with torch_neuron.experimental.neuron_cores_context(start_nc=0, nc_count=1):
        >>>     model = torch.jit.load('example_neuron_model.pt')
        >>> model(example) # Executes on NeuronCore 0
        >>> model(example) # Executes on NeuronCore 0
        >>> model(example) # Executes on NeuronCore 0

        *Multiple Core Replication*: Directly load a model from disk to 2
        NeuronCores. This allows a single :class:`torch.jit.ScriptModule` to
        use multiple NeuronCores by running round-robin executions.

        >>> with torch_neuron.experimental.neuron_cores_context(start_nc=2, nc_count=2):
        >>>     model = torch.jit.load('example_neuron_model.pt')
        >>> model(example) # Executes on NeuronCore 2
        >>> model(example) # Executes on NeuronCore 3
        >>> model(example) # Executes on NeuronCore 2

        *Multiple Model Load*: Directly load 2 models from disk and pin them to
        separate NeuronCores. This causes each :class:`torch.jit.ScriptModule`
        to always execute on a specific NeuronCore.

        >>> with torch_neuron.experimental.neuron_cores_context(start_nc=2):
        >>>     model1 = torch.jit.load('example_neuron_model.pt')
        >>> with torch_neuron.experimental.neuron_cores_context(start_nc=0):
        >>>     model2 = torch.jit.load('example_neuron_model.pt')
        >>> model1(example) # Executes on NeuronCore 2
        >>> model1(example) # Executes on NeuronCore 2
        >>> model2(example) # Executes on NeuronCore 0
        >>> model2(example) # Executes on NeuronCore 0
    """


@contextlib.contextmanager
def multicore_context():
    """
    A context which loads all Neuron subgraphs to all visible NeuronCores.

    This loads each Neuron subgraph within a :class:`torch.jit.ScriptModule`
    to multiple NeuronCores without requiring multiple calls to
    :func:`torch.jit.load`. This allows a single
    :class:`torch.jit.ScriptModule` to use multiple NeuronCores for
    concurrent threadsafe inferences. Executions use a round-robin strategy
    to distribute across NeuronCores.

    Any calls to :func:`torch.jit.load` will cause any underlying Neuron
    subgraphs to load to the specified NeuronCores within this context.
    This context manager only needs to be used during the model load.
    After loading, inferences do not need to occur in this context in order
    to use the correct NeuronCores.

    Note that this context is *not* threadsafe. Using multiple core placement
    contexts from multiple threads may not correctly place models.

    Raises:
        RuntimeError: If the Neuron runtime cannot be initialized.

    Examples:

        *Multiple Core Replication*: Directly load a model to all visible
        NeuronCores. This allows a single  :class:`torch.jit.ScriptModule`
        to use all NeuronCores by running round-robin executions.

        >>> with torch_neuron.experimental.multicore_context():
        >>>     model = torch.jit.load('example_neuron_model.pt')
        >>> model(example) # Executes on NeuronCore 0
        >>> model(example) # Executes on NeuronCore 1
        >>> model(example) # Executes on NeuronCore 2
    """


================================================
FILE: archive/torch-neuron/setup/index.rst
================================================
.. _setup-torch-neuron-archived:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Setup Guide for Inf1
====================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
	:maxdepth: 1

	Fresh install </archive/torch-neuron/setup/pytorch-install>
	Update to latest release </archive/torch-neuron/setup/pytorch-update>
	Install previous releases </archive/torch-neuron/setup/pytorch-install-prev>
	/archive/torch-neuron/setup/pytorch-install-cxx11


================================================
FILE: archive/torch-neuron/setup/prev-releases/neuron-1.14.2-pytorch-install.rst
================================================
.. _install-neuron-1.14.2-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install PyTorch Neuron (Neuron 1.14.2)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents::
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.14.2


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=pytorch-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.14.2


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=pytorch-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.14.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.14.2


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.14.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.14.2 --framework-version=pytorch-1.5.1


================================================
FILE: archive/torch-neuron/setup/prev-releases/neuron-1.15.0-pytorch-install.rst
================================================
.. _install-neuron-1.15.0-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install PyTorch Neuron (Neuron 1.15.0)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.0


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=pytorch-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.0


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=pytorch-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.0


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.0 --framework-version=pytorch-1.5.1


================================================
FILE: archive/torch-neuron/setup/prev-releases/neuron-1.15.1-pytorch-install.rst
================================================
.. _install-neuron-1.15.1-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install PyTorch Neuron (Neuron 1.15.1)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=pytorch-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=pytorch-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.1 --framework-version=pytorch-1.5.1


================================================
FILE: archive/torch-neuron/setup/prev-releases/neuron-1.15.2-pytorch-install.rst
================================================
.. _install-neuron-1.15.2-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install PyTorch Neuron (Neuron 1.15.2)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.2


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=pytorch-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.2


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=pytorch-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.2


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu --neuron-version=1.15.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux --neuron-version=1.15.2 --framework-version=pytorch-1.5.1


================================================
FILE: archive/torch-neuron/setup/prev-releases/neuron-1.16.1-pytorch-install.rst
================================================
.. _install-neuron-1.16.1-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install PyTorch Neuron (Neuron 1.16.1)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst

.. tab-set::

   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.16.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.16.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.16.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.16.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.16.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.16.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::

   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.16.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.16.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.16.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.16.1 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.16.1 --framework-version=pytorch-1.5.1


================================================
FILE: archive/torch-neuron/setup/prev-releases/neuron-1.16.2-pytorch-install.rst
================================================
.. _install-neuron-1.16.2-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install PyTorch Neuron (Neuron 1.16.2)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst

.. tab-set::

   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.16.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.16.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.16.2


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.16.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.16.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.16.2


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::

   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.16.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.16.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.16.2


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.16.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.16.2 --framework-version=pytorch-1.5.1


================================================
FILE: archive/torch-neuron/setup/prev-releases/neuron-1.16.3-pytorch-install.rst
================================================
.. _install-neuron-1.16.3-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install PyTorch Neuron (Neuron 1.16.3)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst

.. tab-set::

   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.16.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.16.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.16.3


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.16.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.16.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.16.3


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::

   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.16.3

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.3

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.16.3

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.16.3


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.16.3 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.16.3 --framework-version=pytorch-1.5.1


================================================
FILE: archive/torch-neuron/setup/prev-releases/neuron-1.17.2-pytorch-install.rst
================================================
.. _install-neuron-1.17.2-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install PyTorch Neuron (Neuron 1.17.2)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst

.. tab-set::

   .. tab-item:: PyTorch 1.10.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.17.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.17.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.17.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.17.2


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.10.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.17.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.17.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.17.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.17.2


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::

   .. tab-item:: PyTorch 1.10.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.17.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.17.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.17.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.17.2


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.17.2 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.17.2 --framework-version=pytorch-1.5.1


================================================
FILE: archive/torch-neuron/setup/prev-releases/neuron-1.18.0-pytorch-install.rst
================================================
.. _install-neuron-1.18.0-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install PyTorch Neuron (Neuron 1.18.0)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst

.. tab-set::

   .. tab-item:: PyTorch 1.10.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.18.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.18.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.18.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.18.0


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.5.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.10.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.18.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.18.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.18.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.18.0


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.5.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::

   .. tab-item:: PyTorch 1.10.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.18.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.18.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.18.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.18.0


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.7.1


   .. tab-item:: PyTorch 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.18.0 --framework-version=pytorch-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.18.0 --framework-version=pytorch-1.5.1


================================================
FILE: archive/torch-neuron/setup/prev-releases/neuron-1.19.0-pytorch-install.rst
================================================
.. _install-neuron-1.19.0-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install PyTorch Neuron (Neuron 1.19.0)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst

.. tab-set::

   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.19.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.19.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.19.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.19.0


   .. tab-item:: PyTorch 1.10.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.10.2


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.7.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.19.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.19.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.19.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.19.0


   .. tab-item:: PyTorch 1.10.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.10.2


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.7.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::

   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.19.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.19.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.19.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.19.0


   .. tab-item:: PyTorch 1.10.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.10.2


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=1.19.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=1.19.0 --framework-version=pytorch-1.7.1


================================================
FILE: archive/torch-neuron/setup/prev-releases/neuron-2.3.0-pytorch-install.rst
================================================
.. _install-neuron-2.3.0-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install PyTorch Neuron (Neuron 2.3.0)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst

.. tab-set::

   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.3.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.3.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.3.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.3.0


   .. tab-item:: PyTorch 1.10.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.10.2


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.7.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.3.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.3.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.3.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.3.0


   .. tab-item:: PyTorch 1.10.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.10.2


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.7.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::

   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.3.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.3.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.3.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.3.0


   .. tab-item:: PyTorch 1.10.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.10.2


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.3.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.3.0 --framework-version=pytorch-1.7.1


================================================
FILE: archive/torch-neuron/setup/prev-releases/neuron-2.4.0-pytorch-install.rst
================================================
.. _install-neuron-2.4.0-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install PyTorch Neuron (Neuron 2.4.0)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst

.. tab-set::

   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.4.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.4.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.4.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.4.0


   .. tab-item:: PyTorch 1.10.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.10.2


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.7.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.4.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.4.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.4.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.4.0


   .. tab-item:: PyTorch 1.10.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.10.2


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.7.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::

   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.4.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.4.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.4.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.4.0


   .. tab-item:: PyTorch 1.10.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.10.2


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.4.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.4.0 --framework-version=pytorch-1.7.1


================================================
FILE: archive/torch-neuron/setup/prev-releases/neuron-2.5.0-pytorch-install.rst
================================================
.. _install-neuron-2.5.0-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install PyTorch Neuron (Neuron 2.5.0)
======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst

.. tab-set::

   .. tab-item:: PyTorch 1.12.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.5.0 

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 
   
   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.11.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.11.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.11.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.11.0


   .. tab-item:: PyTorch 1.10.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.10.2


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.7.1


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.12.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.5.0 

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 

   
   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.11.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.11.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.11.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.11.0


   .. tab-item:: PyTorch 1.10.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.10.2


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=compile --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.7.1


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::

   .. tab-item:: PyTorch 1.12.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.5.0 

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 
   
   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.11.0

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.11.0

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.11.0

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.11.0


   .. tab-item:: PyTorch 1.10.2

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.10.2

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.10.2


   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.9.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.9.1


   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.8.1


   .. tab-item:: PyTorch 1.7.1

      .. tab-set::

         .. tab-item:: Ubuntu AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux AMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=non-dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=ubuntu  --neuron-version=2.5.0 --framework-version=pytorch-1.7.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=deploy --ami=dlami --os=amazonlinux  --neuron-version=2.5.0 --framework-version=pytorch-1.7.1


================================================
FILE: archive/torch-neuron/setup/pytorch-install-cxx11.rst
================================================
.. _pytorch-install-cxx11:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install with support for cxx11 ABI
==================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. warning::

    The intended user of this guide is using a custom built version of
    ``torch`` or compiling a non-python application which must be built using
    the cxx11 ABI.

    *Most applications do not require this specialized distribution.*

    For regular installation instructions see: :ref:`Fresh install <install-neuron-pytorch>`

The standard ``torch-neuron`` packages (which are normally installed according
to the :ref:`Fresh install <install-neuron-pytorch>` guide) are compiled with
the pre-cxx11 ABI and linked against the pre-cxx11 ``libtorch``. These
compilation options ensure that the ``torch-neuron`` ABI matches the *publicly*
released version of the ``torch`` package that is installed from the default
PyPI index.

To support applications with specific ABI requirements, Neuron distributes
packages which are linked against the cxx11 version of
``libtorch``. These ``torch-neuron`` packages are built using the
``-D_GLIBCXX_USE_CXX11_ABI=1`` compilation flag.

The only difference between these packages and the standard packages
is the torch plugin library contained within the package. This is the
``libtorchneuron.so`` library located in the ``torch_neuron/lib/`` package
directory. All other libraries and python files within the packages are
identical. This means that these cxx11-compatible packages are drop-in
replacements in environments that are incompatible with the standard releases of
``torch-neuron``. Behavior is identical whether compiling models or executing
inferences.

Installation
^^^^^^^^^^^^

All versions of the library are available to download from the following pip
index:

::

    https://pip.repos.neuron.amazonaws.com/cxx11


To install a wheel, it is recommended to use the ``--no-deps`` flag since
versions of ``torch`` compiled using the cxx11 ABI are not distributed on this
index.

::

    pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 torch-neuron --no-deps


Specific versions of ``torch-neuron`` with cxx11 ABI support can be installed
just like standard versions of ``torch-neuron``.

::

    pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuron>=1.8" --no-deps
    pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuron==1.9.1" --no-deps
    pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuron<1.10" --no-deps

.. important::

    This pip index does not include a distribution of ``torch`` compiled with
    the new cxx11 ABI. The intent of this index is *only* to provide Neuron SDK
    wheels.

    The version of ``torch`` that is distributed on the default PyPI index is
    compiled with the old pre-cxx11 ABI.

    If a cxx11 ``torch-neuron`` package is installed *with* dependencies
    using the *default* PyPI index, then the installed version of ``torch`` will
    be using the pre-cxx11 ABI and ``torch-neuron`` will be using the cxx11
    ABI. This ABI mismatch will lead to errors in both python usage and at link
    time for non-python applications.

FAQ
^^^

When should I use a cxx11 torch-neuron wheel?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Distributions compiled with the new cxx11 ABI should only be used in the
following cases:

    1. You have built your own version of ``torch`` which uses the new cxx11 ABI and
       need a corresponding version of ``torch-neuron`` that is compatible.
    2. You are compiling an application against a ``libtorch``
       which uses the cxx11 ABI and would like to include
       ``libtorchneuron.so`` as well. Torch distributes these cxx11 ``libtorch``
       libraries with a ``libtorch-cxx11`` prefix.

        Example:

        ::

            https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.10.2%2Bcpu.zip


Can I download a library/header zip file similar to the torch distribution?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Currently ``torch-neuron`` does not distribute a bundled library ``.zip`` with
only library/header files.

The recommended alternative when compiling ``libtorchneuron.so`` into a
non-python application is to install the ``torch-neuron`` wheel using ``pip``
according to the installation instructions. Then use the ``libtorchneuron.so``
library from within the python ``site-packages`` directory.

A second alternative to isolate the package contents from a python environment
is to download the wheel and unpack the contents:

.. code:: bash

    pip download --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 torch-neuron --no-deps
    wheel unpack torch_neuron-*.whl

If the exact version of the ``torch-neuron`` package is known and no
python/pip is available in the build environment, an alternative to is fetch the
package file directly and ``unzip`` the wheel:

.. code::

    wget https://pip.repos.neuron.amazonaws.com/cxx11/torch-neuron/torch_neuron-<VERSION>-py3-none-any.whl
    unzip torch_neuron-<VERSION>-py3-none-any.whl


.. _pytorch-cxx11-versioning:

How can I know which ABI torch-neuron is using?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Packages which use the pre-cxx11 ABI have no local identifier and use the
following version scheme:

::

    <torch version>.<neuron version>

Packages which use the cxx11 ABI have a ``+cxx11`` local identifier and use
following version scheme:

::

    <torch version>.<neuron version>+cxx11


This allows the ABI to be validated in the by inspecting the local identifier
(or version suffix).

Example:
::

    1.8.1.0.0.0.0+cxx11
    1.9.1.0.0.0.0+cxx11
    1.10.2.0.0.0.0+cxx11


How can I know which ABI torch is using?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``torch`` python package provides an API at the that allows you to check if
the underlying ``libtorch`` was compiled with the cxx11 ABI:

.. code:: python

    import torch
    torch.compiled_with_cxx11_abi()  # True/False

Currently ``torch-neuron`` does not have an equivalent API. If the cxx11 ABI was
used, it will be visible in the version string (See :ref:`pytorch-cxx11-versioning`).


Troubleshooting
^^^^^^^^^^^^^^^

What python errors could I see if I mix ABI versions?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Using a version of ``torch`` compiled with the cxx11 ABI will trigger an error
in the python interpreter when importing a version of ``torch-neuron`` using
the old (pre-cxx11) ABI from the standard index. This will manifest as an
error when the ``import torch_neuron`` statement is executed.

::

    Traceback (most recent call last):
      File "/python3.7/site-packages/torch_neuron/__init__.py", line 64, in <module>
        _register_extension()
      File "/python3.7/site-packages/torch_neuron/__init__.py", line 60, in _register_extension
        torch.ops.load_library(neuron_op_filename)
      File "/python3.7/site-packages/torch/_ops.py", line 110, in load_library
        ctypes.CDLL(path)
      File "/python3.7/ctypes/__init__.py", line 364, in __init__
        self._handle = _dlopen(self._name, mode)
    OSError: /python3.7/site-packages/torch_neuron/lib/libtorchneuron.so: undefined symbol: _ZN5torch6detail10class_baseC2ERKSsS3_SsRKSt9type_infoS6_


Similarly if using the standard pre-cxx11 version of ``torch`` with the cxx11
version of ``torch-neuron`` will also cause an error upon import.

::

    Traceback (most recent call last):
      File "/python3.7/site-packages/torch_neuron/__init__.py", line 79, in <module>
        _register_extension()
      File "/python3.7/site-packages/torch_neuron/__init__.py", line 75, in _register_extension
        torch.ops.load_library(neuron_op_filename)
      File "/python3.7/site-packages/torch/_ops.py", line 110, in load_library
        ctypes.CDLL(path)
      File "/python3.7/ctypes/__init__.py", line 364, in __init__
        self._handle = _dlopen(self._name, mode)
    OSError: /python3.7/site-packages/torch_neuron/lib/libtorchneuron.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE


In either of these cases, the remedy is to ensure that the ABI of the ``torch``
distribution matches the ABI of the ``torch-neuron`` distribution.

What compiler/linking errors could I see if I mix ABI versions?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you link an application which uses the old (pre-cxx11) ABI
``libtorchneuron.so`` with a cxx11 version of ``torch``, this will trigger a
link error.

::

    libtorchneuron.so: undefined reference to `torch::detail::class_base::class_base(std::string const&, std::string const&, std::string, std::type_info const&, std::type_info const&)'
    libtorchneuron.so: undefined reference to `c10::Error::Error(c10::SourceLocation, std::string)'
    libtorchneuron.so: undefined reference to `c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&)'
    libtorchneuron.so: undefined reference to `c10::ClassType::getMethod(std::string const&) const'
    libtorchneuron.so: undefined reference to `c10::ivalue::ConstantString::create(std::string)'
    libtorchneuron.so: undefined reference to `c10::DeviceTypeName(c10::DeviceType, bool)'
    libtorchneuron.so: undefined reference to `torch::jit::parseSchema(std::string const&)'
    libtorchneuron.so: undefined reference to `unsigned short caffe2::TypeMeta::_typeMetaData<std::string>()'
    libtorchneuron.so: undefined reference to `c10::Warning::warn(c10::SourceLocation const&, std::string const&, bool)'
    libtorchneuron.so: undefined reference to `torch::jit::parseSchemaOrName(std::string const&)'
    libtorchneuron.so: undefined reference to `c10::Symbol::fromQualString(std::string const&)'
    libtorchneuron.so: undefined reference to `c10::Error::Error(std::string, std::string, void const*)'
    libtorchneuron.so: undefined reference to `c10::detail::infer_schema::make_function_schema(std::string&&, std::string&&, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>)'
    libtorchneuron.so: undefined reference to `c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&)'
    libtorchneuron.so: undefined reference to `torch::jit::canonicalSchemaString(c10::FunctionSchema const&)'


Similarly, an error will also occur in the opposite scenario where the
cxx11 ``libtorchneuron.so`` library is used with the pre-cxx11 ``libtorch``:

::

    libtorchneuron.so: undefined reference to `c10::ivalue::ConstantString::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)'
    libtorchneuron.so: undefined reference to `torch::jit::parseSchemaOrName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
    libtorchneuron.so: undefined reference to `c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)'
    libtorchneuron.so: undefined reference to `c10::Error::Error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void const*)'
    libtorchneuron.so: undefined reference to `torch::jit::canonicalSchemaString[abi:cxx11](c10::FunctionSchema const&)'
    libtorchneuron.so: undefined reference to `torch::detail::class_base::class_base(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::type_info const&, std::type_info const&)'
    libtorchneuron.so: undefined reference to `c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
    libtorchneuron.so: undefined reference to `c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
    libtorchneuron.so: undefined reference to `c10::detail::infer_schema::make_function_schema(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>)'
    libtorchneuron.so: undefined reference to `torch::jit::parseSchema(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
    libtorchneuron.so: undefined reference to `c10::DeviceTypeName[abi:cxx11](c10::DeviceType, bool)'
    libtorchneuron.so: undefined reference to `c10::Symbol::fromQualString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
    libtorchneuron.so: undefined reference to `unsigned short caffe2::TypeMeta::_typeMetaData<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >()'
    libtorchneuron.so: undefined reference to `c10::ClassType::getMethod(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const'
    libtorchneuron.so: undefined reference to `c10::Warning::warn(c10::SourceLocation const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)'


In either of these cases, the remedy is to ensure that the ABI of the
``libtorch`` distribution matches the ABI of the ``libtorchneuron.so``
distribution.

The ``torch`` ABI must match the ``torch-neuron`` ABI or an error will occur.


================================================
FILE: archive/torch-neuron/setup/pytorch-install-prev-al2.rst
================================================
.. _pytorch-neuron-install-prev-al2:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install Previous PyTorch Neuron Releases for Amazon Linux (``torch-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
   :maxdepth: 1


This section will assist you in installing previous Neuron releases.

.. tab-set::


    .. tab-item:: Neuron 2.18.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --neuron-version=2.18.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.17.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --neuron-version=2.17.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.16.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --neuron-version=2.16.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/torch-neuron/setup/pytorch-install-prev-al2023.rst
================================================

.. _pytorch-neuron-install-prev-al2023:

.. Install previous PyTorch Neuron releases for Amazon Linux 2023 - archived

Use the tabs below to install a specific previous Neuron SDK release. Select the Neuron version you need.

.. tab-set::

    .. tab-item:: Neuron 2.21.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --neuron-version=2.21.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.20.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --neuron-version=2.20.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.19.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --neuron-version=2.19.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/torch-neuron/setup/pytorch-install-prev-u20.rst
================================================

.. _pytorch-neuron-install-prev-u20:

.. Install previous PyTorch Neuron releases for Ubuntu 20.04 - archived

Use the tabs below to install a specific previous Neuron SDK release. Select the Neuron version you need.

.. tab-set::

    .. tab-item:: Neuron 2.21.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --neuron-version=2.21.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.20.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --neuron-version=2.20.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.19.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --neuron-version=2.19.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/torch-neuron/setup/pytorch-install-prev-u22.rst
================================================

.. _pytorch-neuron-install-prev-u22:

.. Install previous PyTorch Neuron releases for Ubuntu 22.04 - archived

Use the tabs below to install a specific previous Neuron SDK release. Select the Neuron version you need.

.. tab-set::

    .. tab-item:: Neuron 2.21.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --neuron-version=2.21.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.20.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --neuron-version=2.20.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami

    .. tab-item:: Neuron 2.19.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --neuron-version=2.19.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/torch-neuron/setup/pytorch-install-prev.rst
================================================
.. _install-prev-neuron-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install previous PyTorch Neuron releases (``torch-neuron``)
============================================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst

.. toctree::
   :maxdepth: 1

   Neuron 2.5.0 </archive/torch-neuron/setup/prev-releases/neuron-2.5.0-pytorch-install>
   Neuron 2.4.0 </archive/torch-neuron/setup/prev-releases/neuron-2.4.0-pytorch-install>
   Neuron 2.3.0 </archive/torch-neuron/setup/prev-releases/neuron-2.3.0-pytorch-install>
   Neuron 1.19.0 </archive/torch-neuron/setup/prev-releases/neuron-1.19.0-pytorch-install>
   Neuron 1.18.0 </archive/torch-neuron/setup/prev-releases/neuron-1.18.0-pytorch-install>
   Neuron 1.17.2 </archive/torch-neuron/setup/prev-releases/neuron-1.17.2-pytorch-install>
   Neuron 1.16.3 </archive/torch-neuron/setup/prev-releases/neuron-1.16.3-pytorch-install>
   Neuron 1.16.2 </archive/torch-neuron/setup/prev-releases/neuron-1.16.2-pytorch-install>
   Neuron 1.16.1 </archive/torch-neuron/setup/prev-releases/neuron-1.16.1-pytorch-install>
   Neuron 1.15.2 </archive/torch-neuron/setup/prev-releases/neuron-1.15.2-pytorch-install>
   Neuron 1.15.1 </archive/torch-neuron/setup/prev-releases/neuron-1.15.1-pytorch-install>
   Neuron 1.15.0 </archive/torch-neuron/setup/prev-releases/neuron-1.15.0-pytorch-install>
   Neuron 1.14.2 </archive/torch-neuron/setup/prev-releases/neuron-1.14.2-pytorch-install>


================================================
FILE: archive/torch-neuron/setup/pytorch-install.rst
================================================
.. _install-neuron-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Install PyTorch Neuron (``torch-neuron``)
=========================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst


.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst

.. tab-set::

   .. tab-item:: PyTorch 1.13.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: PyTorch 1.12.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.12.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.12.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.11.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.11.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: PyTorch 1.10.2

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.10.2 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.10.2 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.9.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.9.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.13.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: PyTorch 1.12.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=pytorch --framework-version=1.12.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=pytorch --framework-version=1.12.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=pytorch --framework-version=1.11.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=pytorch --framework-version=1.11.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: PyTorch 1.10.2

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=pytorch --framework-version=1.10.2 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=pytorch --framework-version=1.10.2 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=pytorch --framework-version=1.9.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=compile --category=compiler_framework --framework=pytorch --framework-version=1.9.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::

   .. tab-item:: PyTorch 1.13.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: PyTorch 1.12.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=pytorch --framework-version=1.12.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=pytorch --framework-version=1.12.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=pytorch --framework-version=1.11.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=pytorch --framework-version=1.11.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: PyTorch 1.10.2

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=pytorch --framework-version=1.10.2 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=pytorch --framework-version=1.10.2 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=pytorch --framework-version=1.9.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --mode=deploy --category=compiler_framework --framework=pytorch --framework-version=1.9.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/torch-neuron/setup/pytorch-update-al2-dlami.rst
================================================
.. _pytorch-neuron-al2-update:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Update to latest PyTorch Neuron  (``torch-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


If you already have a previous Neuron release installed, this section provide links that will assist you to update to latest Neuron release.


.. tab-set::

    .. tab-item:: PyTorch 1.13.1

        .. include:: /setup/install-templates/inf1/note-setup-general.rst

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=dlami-framework


================================================
FILE: archive/torch-neuron/setup/pytorch-update-al2023.rst
================================================

.. _pytorch-neuron-al2023-update:

.. Update PyTorch Neuron (torch-neuron) on Amazon Linux 2023 - archived

If you already have a previous Neuron release installed, select the PyTorch version tab below to get the update commands.

.. tab-set::

    .. tab-item:: PyTorch 1.13.1

        .. include:: /setup/install-templates/inf1/note-setup-general.rst

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/torch-neuron/setup/pytorch-update-u20-dlami.rst
================================================

.. _pytorch-neuron-u20-update:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Update to latest PyTorch Neuron  (``torch-neuron``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


If you already have a previous Neuron release installed, this section provide links that will assist you to update to latest Neuron release.


.. tab-set::

    .. tab-item:: PyTorch 1.13.1

        .. include:: /setup/install-templates/inf1/note-setup-general.rst

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=dlami-framework


================================================
FILE: archive/torch-neuron/setup/pytorch-update-u20.rst
================================================

.. _pytorch-neuron-u20-update:

.. Update PyTorch Neuron (torch-neuron) on Ubuntu 20.04 - archived

If you already have a previous Neuron release installed, select the PyTorch version tab below to get the update commands.

.. tab-set::

    .. tab-item:: PyTorch 1.13.1

        .. include:: /setup/install-templates/inf1/note-setup-general.rst

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/torch-neuron/setup/pytorch-update-u22.rst
================================================

.. _pytorch-neuron-u22-update:

.. Update PyTorch Neuron (torch-neuron) on Ubuntu 22.04 - archived

If you already have a previous Neuron release installed, select the PyTorch version tab below to get the update commands.

.. tab-set::

    .. tab-item:: PyTorch 1.13.1

        .. include:: /setup/install-templates/inf1/note-setup-general.rst

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/torch-neuron/setup/pytorch-update.rst
================================================
.. _update-neuron-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Update to latest PyTorch Neuron (``torch-neuron``)
==================================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. include:: /setup/install-templates/inf1/note-setup-cntr.rst

.. contents:: Table of contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/develop_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst

.. tab-set::
   .. tab-item:: PyTorch 1.13.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


Compile on compute instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/compile_mode.rst


.. tab-set::

   .. tab-item:: PyTorch 1.13.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=compile --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


Deploy on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /setup/install-templates/inf1/deploy_mode.rst

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst


.. tab-set::

   .. tab-item:: PyTorch 1.13.1

      .. tab-set::

         .. tab-item:: Ubuntu 20 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

         .. tab-item:: Amazon Linux 2 DLAMI Base

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --mode=deploy --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami


================================================
FILE: archive/torch-neuron/torch-neuron-dataparallel-example-default.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11


.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.

The default DataParallel use mode will replicate the model
on all available NeuronCores in the current process. The inputs will be split
on ``dim=0``.

.. code-block:: python

    import torch
    import torch_neuron
    from torchvision import models

    # Load the model and set it to evaluation mode
    model = models.resnet50(pretrained=True)
    model.eval()

    # Compile with an example input
    image = torch.rand([1, 3, 224, 224])
    model_neuron = torch.neuron.trace(model, image)

    # Create the DataParallel module
    model_parallel = torch.neuron.DataParallel(model_neuron)

    # Create a batched input
    batch_size = 5
    image_batched = torch.rand([batch_size, 3, 224, 224])

    # Run inference with a batched input
    output = model_parallel(image_batched)


================================================
FILE: archive/torch-neuron/torch-neuron-dataparallel-example-dim-neq-zero.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11


.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.

In this example we run DataParallel inference using four NeuronCores and
``dim = 2``. Because ``dim != 0``, dynamic batching is not enabled.
Consequently, the DataParallel inference-time batch size must be four times the
compile-time batch size. DataParallel will generate a warning that dynamic
batching is disabled because ``dim != 0``.

.. code-block:: python

    import torch
    import torch_neuron

    # Create an example model
    class Model(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.conv = torch.nn.Conv2d(3, 3, 3)

        def forward(self, x):
            return self.conv(x) + 1

    model = Model()
    model.eval()

    # Compile with an example input
    image = torch.rand([1, 3, 8, 8])
    model_neuron = torch.neuron.trace(model, image)

    # Create the DataParallel module using 4 NeuronCores and dim = 2
    model_parallel = torch.neuron.DataParallel(model_neuron, device_ids=[0, 1, 2, 3], dim=2)

    # Create a batched input
    # Note that image_batched.shape[dim] / len(device_ids) == image.shape[dim]
    batch_size = 4 * 8
    image_batched = torch.rand([1, 3, batch_size, 8])

    # Run inference with a batched input
    output = model_parallel(image_batched)


================================================
FILE: archive/torch-neuron/torch-neuron-dataparallel-example-disable-dynamic-batching.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11


.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.

In the following example, we use
:func:`torch.neuron.DataParallel.disable_dynamic_batching` to disable dynamic
batching. We provide an example of a batch size that will not work when dynamic
batching is disabled as well as an example of a batch size that does work when
dynamic batching is disabled.

.. code-block:: python

    import torch
    import torch_neuron
    from torchvision import models

    # Load the model and set it to evaluation mode
    model = models.resnet50(pretrained=True)
    model.eval()

    # Compile with an example input
    image = torch.rand([1, 3, 224, 224])
    model_neuron = torch.neuron.trace(model, image)

    # Create the DataParallel module and use 4 NeuronCores
    model_parallel = torch.neuron.DataParallel(model_neuron, device_ids=[0, 1, 2, 3], dim=0)

    # Disable dynamic batching
    model_parallel.disable_dynamic_batching()

    # Create a batched input (this won't work)
    batch_size = 8
    image_batched = torch.rand([batch_size, 3, 224, 224])

    # This will fail because dynamic batching is disabled and
    # image_batched.shape[dim] / len(device_ids) != image.shape[dim]
    # output = model_parallel(image_batched)

    # Create a batched input (this will work)
    batch_size = 4
    image_batched = torch.rand([batch_size, 3, 224, 224])

    # This will work because
    # image_batched.shape[dim] / len(device_ids) == image.shape[dim]
    output = model_parallel(image_batched)


================================================
FILE: archive/torch-neuron/torch-neuron-dataparallel-example-dynamic-batching.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11


.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.

In the following example, we use the :func:`torch.neuron.DataParallel` module
to run inference using several different batch sizes without recompiling the
Neuron model.

.. code-block:: python

    import torch
    import torch_neuron
    from torchvision import models

    # Load the model and set it to evaluation mode
    model = models.resnet50(pretrained=True)
    model.eval()

    # Compile with an example input
    image = torch.rand([1, 3, 224, 224])
    model_neuron = torch.neuron.trace(model, image)

    # Create the DataParallel module
    model_parallel = torch.neuron.DataParallel(model_neuron)

    # Create batched inputs and run inference on the same model
    batch_sizes = [2, 3, 4, 5, 6]
    for batch_size in batch_sizes:
        image_batched = torch.rand([batch_size, 3, 224, 224])

        # Run inference with a batched input
        output = model_parallel(image_batched)


================================================
FILE: archive/torch-neuron/torch-neuron-dataparallel-example-specify-ncs.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11


.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.

The following example uses the ``device_ids`` argument to use the first three
NeuronCores for DataParallel inference.

.. code-block:: python

    import torch
    import torch_neuron
    from torchvision import models

    # Load the model and set it to evaluation mode
    model = models.resnet50(pretrained=True)
    model.eval()

    # Compile with an example input
    image = torch.rand([1, 3, 224, 224])
    model_neuron = torch.neuron.trace(model, image)

    # Create the DataParallel module, run on the first three NeuronCores
    # Equivalent to model_parallel = torch.neuron.DataParallel(model_neuron, device_ids=[0, 1, 2])
    model_parallel = torch.neuron.DataParallel(model_neuron, device_ids=['nc:0', 'nc:1', 'nc:2'])

    # Create a batched input
    batch_size = 5
    image_batched = torch.rand([batch_size, 3, 224, 224])

    # Run inference with a batched input
    output = model_parallel(image_batched)


================================================
FILE: archive/torch-neuron/troubleshooting-guide.rst
================================================
.. _pytorch-neuron-inference-troubleshooting:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Troubleshooting Guide for PyTorch Neuron (``torch-neuron``)
===========================================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


Patching PyTorch version 1.13 for CVEs
--------------------------------------

PyTorch version 1.13 has the following CVEs:
- CVE-2025-32434
- CVE-2024-31580
- CVE-2024-31583

To patch PyTorch version 1.13, run the following on a CPU instance with Ubuntu 22 AMI (it takes 30 minutes on a c5.4xlarge):

::

    git clone --recursive https://github.com/pytorch/pytorch -b v1.13.1
    cd pytorch
    git cherry-pick b5c3a17c2c207ebefcb85043f0cf94be9b2fef81
    git cherry-pick 9c7071b0e324f9fb68ab881283d6b8d388a4bcd2
    wget https://github.com/user-attachments/files/22013116/patch_v113.txt
    git apply patch_v113.txt

To build the pip wheel, see `build steps <https://github.com/pytorch/pytorch/tree/v1.13.1?tab=readme-ov-file#from-source>`_. A condensed version is provided below.

Install Miniconda by following `installation steps <https://www.anaconda.com/docs/getting-started/miniconda/install#linux-2>`_ and run the following commands:

::

    source ~/miniconda3/bin/activate
    conda create --name conda_py39 python=3.9
    conda activate conda_py39
    conda install astunparse numpy==1.19.5 ninja pyyaml setuptools cmake cffi typing_extensions future six requests dataclasses
    conda install mkl mkl-include# CUDA only: Add LAPACK support for the GPU if needed
    conda install -c pytorch magma-cuda110  # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo
    sudo apt install cmake g++

    export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
    PYTORCH_BUILD_VERSION=1.13.2 PYTORCH_BUILD_NUMBER=1 python setup.py bdist_wheel
    # the PyTorch pip wheel will be in dist directory


General Torch-Neuron issues
---------------------------

If you see an error about "Unknown builtin op: neuron::forward_1" like below, please ensure that import line "import torch_neuron" (to register the Neuron custom operation) is in the inference script before using torch.jit.load.

::

   Unknown builtin op: neuron::forward_1.
   Could not find any similar ops to neuron::forward_1. This op may not exist or may not be currently supported in TorchScript.


TorchVision related issues
--------------------------

If you encounter an error like below, it is because latest torchvision
version >= 0.7 is not compatible with Torch-Neuron 1.5.1. Please
downgrade torchvision to version 0.6.1:

::

   E   AttributeError: module 'torch.jit' has no attribute '_script_if_tracing'                                                                                      


2GB protobuf limit related issues
---------------------------------

If you encounter an error like below, it is because the model size is larger than 2GB.
To compile such large models, use the :ref:`separate_weights=True <torch_neuron_trace_api>` flag. Note,
ensure that you have the latest version of compiler installed to support this flag.
You can upgrade neuron-cc using 
:code:`python3 -m pip install neuron-cc[tensorflow] -U --force --extra-index-url=https://pip.repos.neuron.amazonaws.com`

::

   E google.protobuf.message.DecodeError: Error parsing message with type 'tensorflow.GraphDef'


torch.jit.trace issues
----------------------
The :doc:`Trace API </archive/torch-neuron/api-compilation-python-api>`
uses the PyTorch :func:`torch.jit.trace` function to generate
:class:`~torch.jit.ScriptModule` models for execution on Inferentia. Due to that,
to execute your PyTorch model on Inferentia it must be torch-jit-traceable,
otherwise you need to make sure your model is torch-jit-traceable. You can try
modifying your underlying PyTorch model code to make it traceable. If it's not
possible to change your model code, you can :ref:`write a wrapper around your
model <wrapping-non-traceable-models>` that makes it torch-jit-traceable to
compile it for Inferentia.

Please visit :func:`torch.jit.trace` to review the properties that a model must
have to be torch-jit-traceable. The PyTorch-Neuron trace API
:func:`torch_neuron.trace` accepts :code:`**kwargs` for :func:`torch.jit.trace`.
For example, you can use the :code:`strict=False` flag to
:ref:`compile models with dictionary outputs <compiling-models-with-kwargs>`.


.. _wrapping-non-traceable-models:

Compiling models with outputs that are not torch-jit-traceable
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To enable compilation of models with non torch-jit-traceable outputs, you can
use a technique that involves writing a wrapper that converts the model's
output into a form that is torch-jit-traceable. You can then compile the
wrapped model for Inferentia using :func:`torch_neuron.trace`.


The following example uses a wrapper to compile a model with non
torch-jit-traceable outputs. This model cannot be compiled for Inferentia in
its current form because it outputs a list of tuples and tensors, which is not
torch-jit-traceable.

.. code-block:: python

    import torch
    import torch_neuron
    import torch.nn as nn

    class Model(nn.Module):
        def __init__(self):
            super(Model, self).__init__()
            self.conv = nn.Conv2d(1, 1, 3)

        def forward(self, x):
            a = self.conv(x) + 1
            b = self.conv(x) + 2
            c = self.conv(x) + 3
            # An output that is a list of tuples and tensors is not torch-traceable
            return [(a, b), c]

    model = Model()
    model.eval()

    inputs = torch.rand(1, 1, 3, 3)

    # Try to compile the model
    model_neuron = torch.neuron.trace(model, inputs) # ERROR: This cannot be traced, we must change the output format


To compile this model for Inferentia, we can write a wrapper around the model
to convert its outputs into a tuple of tensors, which is torch-jit-traceable.

.. code-block:: python

    class NeuronCompatibilityWrapper(nn.Module):
        def __init__(self):
            super(NeuronCompatibilityWrapper, self).__init__()
            self.model = Model()

        def forward(self, x):
            out = self.model(x)
            # An output that is a tuple of tuples and tensors is torch-jit-traceable
            return tuple(out)

Now, we can successfully compile the model for Inferentia using the
:code:`NeuronCompatibilityWrapper` wrapper as follows:

.. code-block:: python

    model = NeuronCompatibilityWrapper()
    model.eval()

    # Compile the traceable wrapped model
    model_neuron = torch.neuron.trace(model, inputs)

If the model's outputs must be in the original form, a second wrapper can be
used to transform the outputs after compilation for Inferentia. The following
example uses the :code:`OutputFormatWrapper` wrapper to convert the compiled
model's output back into the original form of a list of tuples and tensors.

.. code-block:: python

    class OutputFormatWrapper(nn.Module):
        def __init__(self):
            super(OutputFormatWrapper, self).__init__()
            self.traceable_model = NeuronCompatibilityWrapper()

        def forward(self, x):
            out = self.traceable_model(x)
            # Return the output in the original format of Model()
            return list(out)

    model = OutputFormatWrapper()
    model.eval()

    # Compile the traceable wrapped model
    model.traceable_model = torch.neuron.trace(model.traceable_model, inputs)


Compiling a submodule in a model that is not torch-jit-traceable
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following example shows how to compile a submodule that is part of a non
torch-jit-traceable model. In this example, the top-level model :code:`Outer`
uses a dynamic flag, which is not torch-jit-traceable. However, the
submodule :code:`Inner` is torch-jit-traceable and can be compiled for
Inferentia.

.. code-block:: python

    import torch
    import torch_neuron
    import torch.nn as nn

    class Inner(nn.Module) :
        def __init__(self):
            super().__init__()
            self.conv = nn.Conv2d(1, 1, 3)

        def forward(self, x):
            return self.conv(x) + 1


    class Outer(nn.Module):
        def __init__(self):
            super().__init__()
            self.inner = Inner()

        def forward(self, x, add_offset: bool = False):
            base = self.inner(x)
            if add_offset:
                return base + 1
            return base

    model = Outer()
    inputs = torch.rand(1, 1, 3, 3)

    # Compile the traceable wrapped submodule
    model.inner = torch.neuron.trace(model.inner, inputs)

    # TorchScript the model for serialization
    script = torch.jit.script(model)
    torch.jit.save(script, 'model.pt')

    loaded = torch.jit.load('model.pt')

Alternatively, for usage scenarios in which the model configuration is static
during inference, the dynamic flags can be hardcoded in a wrapper to make
the model torch-jit-traceable and enable compiling the entire model for Inferentia.
In this example, we assume the :code:`add_offset` flag is always
:code:`True` during inference, so we can hardcode this conditional path in the
:code:`Static` wrapper to remove the dynmaic behavior and compile the entire
model for Inferentia.

.. code-block:: python

    class Static(nn.Module):
        def __init__(self):
            super().__init__()
            self.outer = Outer()

        def forward(self, x):
            # hardcode `add_offset=True`
            output = self.outer(x, add_offset=True)
            return output

    model = Static()

    # We can now compile the entire model because `add_offset=True` is hardcoded in the Static wrapper
    model_neuron = torch.neuron.trace(model, inputs)


================================================
FILE: archive/torch-neuron/tutorials/neuroncore_pipeline_pytorch.rst
================================================
.. _pytorch-tutorials-neuroncore-pipeline-pytorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Using NeuronCore Pipeline with PyTorch Tutorial
================================================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of Contents
   :local:
   :depth: 2


Overview
--------

In this tutorial we will benchmark latency of a Hugging Face Transformers model deployed in model pipeline paralle mode using the NeuronCore Pipeline feature. We will compare the results with the usual data parallel (multi-worker) deployment. We compile a pretrained BERT base model and run the benchmarking locally.

To enable faster enviroment setup, We will run both compilation and deployment (inference) on an single inf1.6xlarge instance. You can take similar steps to recreate the benchmark on other instance sizes, such as inf1.xlarge.

If you already have an Inf1 instance environment ready, this tutorial is availabe as a Jupyter notebook at :pytorch-neuron-src:`neuroncore_pipeline_pytorch.ipynb <pipeline_tutorial/neuroncore_pipeline_pytorch.ipynb>` and instructions can be viewed at: 

.. toctree::
   :maxdepth: 1

   /src/examples/pytorch/pipeline_tutorial/neuroncore_pipeline_pytorch.ipynb

Instructions of how to setup the environment and run the tutorial are available in the next sections.

.. _pytorch-neuroncore-pipeline-pytorch-env-setup:

Setup The Environment 
---------------------

Launch an Inf1 instance by following the below steps, please make sure to choose an inf1.6xlarge instance.

.. include:: /setup/install-templates/inf1/launch-inf1-dlami.rst


.. _pytorch-neuroncore-pipeline-pytorch-run-tutorial:

Run The Tutorial
----------------

After connecting to the instance from the terminal, clone the Neuron Github repository to the EC2 instance and then change the working directory to the tutorial directory:

.. code::

  git clone https://github.com/aws/aws-neuron-sdk.git
  cd aws-neuron-sdk/src/examples/pytorch
  

The Jupyter notebook is available as a file with the name :pytorch-neuron-src:`neuroncore_pipeline_pytorch.ipynb <pipeline_tutorial/neuroncore_pipeline_pytorch.ipynb>`, you can either run the Jupyter notebook from a browser or run it as a script from terminal:


* **Running tutorial from browser**

  * First setup and launch the Jupyter notebook on your local browser by following instructions at :ref:`Running Jupyter Notebook Browser`
  * Open the Jupyter notebook from the menu and follow the instructions
  
  
You can also view the Jupyter notebook at:

.. toctree::
   :maxdepth: 1

   /src/examples/pytorch/pipeline_tutorial/neuroncore_pipeline_pytorch.ipynb


.. _pytorch-neuroncore-pipeline-pytorch-cleanup-instances:

Clean up your instance/s
------------------------

After you've finished with the instance/s that you created for this tutorial, you should clean up by terminating the instance/s, please follow instructions at `Clean up your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-clean-up-your-instance>`_.


================================================
FILE: archive/torch-neuron/tutorials/pytorch-tutorial-setup.rst
================================================
.. _pytorch-tutorial-setup:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

PyTorch Tutorial Setup
======================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


#. Launch an Inf1.6xlarge Instance:
    .. include:: /setup/install-templates/inf1/launch-inf1-dlami.rst

#. Set up a development environment:
    * Enable or install PyTorch-Neuron: :ref:`install-neuron-pytorch`.
      

#. Run tutorial in Jupyter notebook:
    * Follow instruction at :ref:`Setup Jupyter notebook <setup-jupyter-notebook-steps-troubleshooting>` to:
    
      #. Start the Jupyter Notebook on the instance
      #. Run the Jupyter Notebook from your local browser

    * Connect to the instance from the terminal, clone the Neuron Github repository to the Inf1 instance and then change the working directory to the tutorial directory:

      .. code::

        git clone https://github.com/aws/aws-neuron-sdk.git
        cd aws-neuron-sdk/src/examples/pytorch

    * Locate the tutorial notebook file (.ipynb file) under ``aws-neuron-sdk/src/examples/pytorch``
    * From your local browser, open the tutorial notebook from the menu and follow the instructions.

    
================================================
FILE: archive/torch-neuron/tutorials/transformers-marianmt.rst
================================================
.. _pytorch-tutorials-marianmt:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

PyTorch HuggingFace MarianMT Tutorial
=====================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of Contents
   :local:
   :depth: 2

Overview
--------

In this tutorial you will compile and deploy the `HuggingFace MarianMT <https://huggingface.co/transformers/v4.0.1/model_doc/marian.html>`_ model for sequence-to-seqeunce language translation on an Inf1 instance.

To enable faster environment setup, you will run the tutorial on an inf1.6xlarge instance to enable both compilation and deployment (inference) on the same instance.

In a production environment we encourage you to try different instance sizes to optimize to your specific deployment needs.

If you have already launched an Inf1 instance and have Neuron pytorch DLAMI environment ready, tutorial is available as a Jupyter notebook at :pytorch-neuron-src:`transformers-marianmt.ipynb <transformers-marianmt.ipynb>` and instructions can be viewed at:

.. toctree::
   :maxdepth: 1

   /src/examples/pytorch/transformers-marianmt.ipynb

Instructions of how to setup Neuron pytorch environment and run the tutorial as a Jupyter notebook are available in the next sections.

.. _pytorch-marianmt-env-setup:

Setup The Environment
---------------------

Launch an Inf1 instance by following the below steps, please make sure to choose an inf1.6xlarge instance.

.. include:: /setup/install-templates/inf1/launch-inf1-dlami.rst


.. _pytorch-marianmt-run-tutorial:

Run The Tutorial
----------------

After connecting to the instance from the terminal, clone the Neuron Github repository to the EC2 instance and then change the working directory to the tutorial directory:

.. code::

  git clone https://github.com/aws/aws-neuron-sdk.git
  cd aws-neuron-sdk/src/examples/pytorch

The Jupyter notebook is available as a file with the name :pytorch-neuron-src:`transformers-marianmt.ipynb <transformers-marianmt.ipynb>` that you can run from browser:

* **Running tutorial from browser**

  * First setup and launch the Jupyter notebook on your local browser by following instructions at :ref:`Running Jupyter Notebook Browser`
  * Open the Jupyter notebook from the menu and follow the instructions

You can also view the Jupyter notebook at:

.. toctree::
   :maxdepth: 1

   /src/examples/pytorch/transformers-marianmt.ipynb


.. _marianmt-cleanup-instances:

Clean up your instance/s
------------------------

After you've finished with the instance/s that you created for this tutorial, you should clean up by terminating the instance/s, please follow instructions at `Clean up your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-clean-up-your-instance>`_.


================================================
FILE: archive/torch-neuron/tutorials/tutorial-libtorch.rst
================================================
.. _pytorch-tutorials-libtorch:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

LibTorch C++ Tutorial
=========================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of Contents
   :local:
   :depth: 2


Overview
--------

This tutorial demonstrates the use of `LibTorch <https://pytorch.org/cppdocs/installing.html>`_ with Neuron, the SDK for Amazon Inf1, Inf2 and Trn1 instances. By the end of this tutorial, you will understand how to write a native C++ application that performs inference on EC2 Inf1, Inf2 and Trn1 instances. We will use an inf1.6xlarge and a pretrained BERT-Base model to determine if one sentence is a paraphrase of another.

Verify that this tutorial is running in a virtual environement that was set up according to the `Torch-Neuronx Installation Guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/index.html#setup-torch-neuronx>` or `Torch-Neuron Installation Guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/torch-neuron.html#setup-torch-neuron>`

Notes
-----

The tutorial has been tested on Inf1, Inf2 and Trn1 instances on ubuntu instances.


Run the tutorial
----------------

This tutorial is self contained.  It produces similar output to :ref:`[html] </src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.ipynb>` :pytorch-neuron-src:`[notebook] <bert_tutorial/tutorial_pretrained_bert.ipynb>`.

Note:  The tutorial will use about 8.5 GB of disk space. Ensure you have sufficient space before beginning.

Right-click and copy :download:`this link address to the tutorial archive</src/examples/pytorch/libtorch_demo.tar.gz>`.

.. code:: bash

  wget <paste archive URL>
  tar xvf libtorch_demo.tar.gz

Your directory tree should now look like this:

::

  libtorch_demo
  ├── bert_neuronx
  │   ├── compile.py
  │   └── detect_instance.py
  ├── clean.sh
  ├── core_count
  │   ├── build.sh
  │   └── main.cpp
  ├── example_app
  │   ├── build.sh
  │   ├── core_count.hpp
  │   ├── example_app.cpp
  │   ├── README.txt
  │   ├── utils.cpp
  │   └── utils.hpp
  ├── neuron.patch
  ├── run_tests.sh
  ├── setup.sh
  ├── tokenizer.json
  └── tokenizers_binding
      ├── build_python.sh
      ├── build.sh
      ├── remote_rust_tokenizer.h
      ├── run_python.sh
      ├── run.sh
      ├── tokenizer.json
      ├── tokenizer_test
      ├── tokenizer_test.cpp
      └── tokenizer_test.py

This tutorial uses the `HuggingFace Tokenizers <https://github.com/huggingface/tokenizers>`_ library implemented in Rust.
Install Cargo, the package manager for the Rust programming language.


 +----------------------------------+----------------------------------+
 | Ubuntu                           | Amazon Linux 2023                |
 +----------------------------------+----------------------------------+
 | .. code-block:: bash             | .. code-block:: bash             |
 |                                  |                                  |
 |    sudo apt install -y cargo     |    sudo dnf install -y cargo     |
 +----------------------------------+----------------------------------+


Run the setup script to download additional depdendencies and build the app. (This may take a few minutes to complete.)

.. literalinclude:: tutorial_source_instructions/run_libtorch.sh
   :language: bash
   :lines: 6-7

::

  ...
  + PATH_NEURON_LIB=/opt/aws/neuron/lib/
  + g++ utils.cpp example_app.cpp -o ../example-app -O2 -D_GLIBCXX_USE_CXX11_ABI=0 -I../libtorch/include -L../tokenizers_binding/lib -L/opt/aws/neuron/lib/ -L../libtorch/lib -Wl,-rpath,libtorch/lib -Wl,-rpath,tokenizers_binding/lib -Wl,-rpath,/opt/aws/neuron/lib/ -ltokenizers -ltorchneuron -ltorch_cpu -lc10 -lpthread -lnrt
  ~/libtorch_demo
  Successfully completed setup

.. _libtorch-benchmark:

Benchmark
---------

The setup script should have compiled and saved a PyTorch model compiled for neuron (bert_neuron_b6.pt).  Run the provided sanity tests to ensure everything is working properly.

.. literalinclude:: tutorial_source_instructions/run_libtorch.sh
   :language: bash
   :lines: 10

::

  Running tokenization sanity checks.

  None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
  Tokenizing: 100%|██████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 15021.69it/s]
  Python took 0.67 seconds.
  Sanity check passed.
  Begin 10000 timed tests.
  ..........
  End timed tests.
  C++ took 0.226 seconds.

  Tokenization sanity checks passed.
  Running end-to-end sanity check.

  The company HuggingFace is based in New York City
  HuggingFace's headquarters are situated in Manhattan
  not paraphrase: 10%
  paraphrase: 90%

  The company HuggingFace is based in New York City
  Apples are especially bad for your health
  not paraphrase: 94%
  paraphrase: 6%

  Sanity check passed.

Finally, run the example app directly to benchmark the BERT model.

.. note::

  You can safely ignore the warning about ``None of PyTorch, Tensorflow >= 2.0, ...``. This occurs because the test runs in a small virtual environment that doesn't require the full frameworks.

.. literalinclude:: tutorial_source_instructions/run_libtorch.sh
   :language: bash
   :lines: 13

::

  Getting ready................
  Benchmarking................
  Completed 32000 operations in 43 seconds => 4465.12 pairs / second
  
  ====================
  Summary information:
  ====================
  Batch size = 6
  Num neuron cores = 16
  Num runs per neuron core = 2000

**Congratulations!** By now you should have successfully built and used a native C++ application with LibTorch.

Troubleshooting
---------------

* In the event of SIGBUS errors you may have insufficient disk space for the creation of temporary model files at runtime.  Consider clearing space or mounting additional disk storage.
* In the event of a neuron runtime failure, confirm that the Neuron kernel module is loaded using ``sudo modprobe neuron``.

.. _libtorch-cleanup:


================================================
FILE: archive/torch-neuron/tutorials/tutorial-torchserve.rst
================================================
.. _pytorch-tutorials-torchserve:

.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

BERT TorchServe Tutorial
========================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. contents:: Table of Contents
   :local:
   :depth: 2


Overview
--------

This tutorial demonstrates the use of `TorchServe <https://pytorch.org/serve>`_ with Neuron, the SDK for Amazon Inf1 instances. By the end of this tutorial, you will understand how TorchServe can be used to serve a model backed by EC2 Inf1 instances. We will use a pretrained BERT-Base model to determine if one sentence is a paraphrase of another.

Verify that this tutorial is running in a virtual environement that was set up according to the `Torch-Neuronx Installation Guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/index.html#setup-torch-neuronx>` or `Torch-Neuron Installation Guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/torch-neuron.html#setup-torch-neuron>`

.. _torchserve-compile:


Run the tutorial
----------------

Open a terminal, log into your remote instance, and activate a Pytorch virtual environment setup (see the :ref:`Pytorch Installation Guide <install-neuron-pytorch>`). To complete this tutorial, you will need a compiled BERT model. If you have already completed the HuggingFace Pretrained BERT tutorial :ref:`[html] </src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.ipynb>` :pytorch-neuron-src:`[notebook] <bert_tutorial/tutorial_pretrained_bert.ipynb>` then you already have the necessary file. Otherwise, you can setup your environment as shown below and then run :download:`trace_bert_neuron.py </src/examples/pytorch/torchserve/trace_bert_neuron.py>` to obtain a traced BERT model.


You should now have a compiled ``bert_neuron_b6.pt`` file, which is required going forward.

Open a shell on the instance you prepared earlier, create a new directory named ``torchserve``. Copy your compiled model from the previous tutorial into this new directory.

.. literalinclude:: tutorial_source_instructions/run_torchserve_u20.sh
   :language: bash
   :lines: 4-6

::

  bert_neuron_b6.pt

Prepare a new Python virtual environment with the necessary Neuron and TorchServe components. Use a virtual environment to keep (most of) the various tutorial components isolated from the rest of the system in a controlled way.

.. literalinclude:: tutorial_source_instructions/run_torchserve_u20.sh
   :language: bash
   :lines: 8

Install the system requirements for TorchServe.

.. tab-set::

   .. tab-item:: Amazon Linux 2023 DLAMI Base

      .. code-block:: bash

        sudo dnf install jq java-11-amazon-corretto-headless
        sudo alternatives --config java
        sudo alternatives --config javac

   .. tab-item:: Ubuntu 20 DLAMI Base

      .. literalinclude:: tutorial_source_instructions/run_torchserve_u20.sh
        :language: bash
        :lines: 10


.. code:: bash

  java -version

::

  openjdk version "11.0.17" 2022-10-18
  OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu218.04)
  OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu218.04, mixed mode, sharing)

.. code:: bash

  javac -version

::

  javac 11.0.17

Verify that TorchServe is now available.

.. code:: bash

  torchserve --version

::

  TorchServe Version is 0.7.0


.. _torchserve-setup:

Setup TorchServe
----------------

During this tutorial you will need to download a few files onto your instance. The simplest way to accomplish this is to paste the download links provided above each file into a ``wget`` command. (We don't provide the links directly because they are subject to change.) For example, right-click and copy the download link for ``config.json`` shown below.

.. literalinclude:: /src/examples/pytorch/torchserve/config.json
    :language: JSON
    :caption: :download:`config.json </src/examples/pytorch/torchserve/config.json>`


Now execute the following in your shell:

.. code:: bash

  wget <paste link here>
  ls

::

  bert_neuron_b6.pt  config.json

Download the `custom handler script <https://pytorch.org/serve/custom_service.html>`_ that will eventually respond to inference requests.

.. literalinclude:: /src/examples/pytorch/torchserve/handler_bert.py
    :language: python
    :caption: :download:`handler_bert.py </src/examples/pytorch/torchserve/handler_bert.py>`
    :linenos:

Next, we need to associate the handler script with the compiled model using ``torch-model-archiver``. Run the following commands in your terminal:

.. literalinclude:: tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 12-16

.. note::

  If you modify your model or a dependency, you will need to rerun the archiver command with the ``-f`` flag appended to update the archive.

The result of the above will be a ``mar`` file inside the ``model_store`` directory.

.. literalinclude:: tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 18

::

  bert-max_length128-batch_size6.mar

This file is essentially an archive associated with a fixed version of your model along with its dependencies (e.g. the handler code).

.. note::

  The version specified in the ``torch-model-archiver`` command can be appended to REST API requests to access a specific version of your model. For example, if your model was hosted locally on port 8080 and named "bert", the latest version of your model would be available at ``http://localhost:8080/predictions/bert``, while version 1.0 would be accessible at ``http://localhost:8080/predictions/bert/1.0``. We will see how to perform inference using this API in Step 6.

Create a `custom config <https://pytorch.org/serve/configuration.html>`_ file to set some parameters. This file will be used to configure the server at launch when we run ``torchserve --start``.

.. literalinclude:: /src/examples/pytorch/torchserve/torchserve.config
    :language: properties
    :caption: :download:`torchserve.config </src/examples/pytorch/torchserve/torchserve.config>`

.. note::

  This will cause TorchServe to bind on all interfaces. For security in real-world applications, you’ll probably want to use port 8443 and `enable SSL <https://pytorch.org/serve/configuration.html#enable-ssl>`_.


.. _torchserve-run:

Run TorchServe
--------------

It's time to start the server. Typically we'd want to launch this in a separate console, but for this demo we’ll just redirect output to a file.

.. literalinclude:: tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 20

Verify that the server seems to have started okay.

.. literalinclude:: tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 22

::

  {
    "status": "Healthy"
  }

.. note::

  If you get an error when trying to ping the server, you may have tried before the server was fully launched. Check ``torchserve.log`` for details.

Use the Management API to instruct TorchServe to load our model.

.. literalinclude:: tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 24-26

::

  {
    "status": "Model \"bert-max_length128-batch_size6\" Version: 1.0 registered with 4 initial workers"
  }

.. note::

  Any additional attempts to configure the model after the initial curl request will cause the server to return a 409 error. You’ll need to stop/start/configure the server to realize any changes.

The ``MAX_BATCH_DELAY`` is a timeout value that determines how long to wait before processing a partial batch. This is why the handler code needs to check the batch dimension and potentially add padding. TorchServe will instantiate the number of model handlers indicated by ``INITIAL_WORKERS``, so this value controls how many models we will load onto Inferentia in parallel. This tutorial was performed on an inf1.xlarge instance (one Inferentia chip), so there are four NeuronCores available. If you want to control worker scaling more dynamically, `see the docs <https://pytorch.org/serve/management_api.html#scale-workers>`_.

.. warning::
  If you attempt to load more models than NeuronCores available, one of two things will occur. Either the extra models will fit in device memory but performance will suffer, or you will encounter an error on your initial inference. You shouldn't set ``INITIAL_WORKERS`` above the number of NeuronCores. However, you may want to use fewer cores if you are using the :ref:`neuroncore-pipeline` feature.

It looks like everything is running successfully at this point, so it's time for an inference.

Create the ``infer_bert.py`` file below on your instance.

.. literalinclude:: /src/examples/pytorch/torchserve/infer_bert.py
    :language: python
    :caption: :download:`infer_bert.py </src/examples/pytorch/torchserve/infer_bert.py>`
    :linenos:

This script will send a ``batch_size`` number of requests to our model. In this example, we are using a model that estimates the probability that one sentence is a paraphrase of another. The script sends positive examples in the first half of the batch and negative examples in the second half.

Execute the script in your terminal.

.. literalinclude:: tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 28

::

  1 ['paraphrase']
  3 ['not paraphrase']
  4 ['not paraphrase']
  0 ['paraphrase']
  5 ['not paraphrase']
  2 ['paraphrase']

We can see that the first three threads (0, 1, 2) all report ``paraphrase``, as expected. If we instead modify the script to send an incomplete batch and then wait for the timeout to expire, the excess padding results will be discarded.


.. _torchserve-benchmark:

Benchmark TorchServe
--------------------

We've seen how to perform a single batched inference, but how many inferences can we process per second? A separate upcoming tutorial will document performance tuning to maximize throughput. In the meantime, we can still perform a simple naïve stress test. The code below will spawn 64 worker threads, with each thread repeatedly sending a full batch of data to process. A separate thread will periodically print throughput and latency measurements.

.. literalinclude:: /src/examples/pytorch/torchserve/benchmark_bert.py
    :language: python
    :caption: :download:`benchmark_bert.py </src/examples/pytorch/torchserve/benchmark_bert.py>`
    :linenos:

Run the benchmarking script.

.. literalinclude:: tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 30

::

  pid 28523: current throughput 0.0, latency p50=0.000 p90=0.000
  pid 28523: current throughput 617.7, latency p50=0.092 p90=0.156
  pid 28523: current throughput 697.3, latency p50=0.082 p90=0.154
  pid 28523: current throughput 702.8, latency p50=0.081 p90=0.149
  pid 28523: current throughput 699.1, latency p50=0.085 p90=0.147
  pid 28523: current throughput 703.8, latency p50=0.083 p90=0.148
  pid 28523: current throughput 699.3, latency p50=0.083 p90=0.148
  ...

**Congratulations!** By now you should have successfully served a batched model over TorchServe.

You can now shutdown torchserve.

.. literalinclude:: tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 32


================================================
FILE: archive/torch-neuron/tutorials/tutorial_source_instructions/run_libtorch.sh
================================================
#!/bin/bash
set -eExuo
#Run the setup script
cd aws-neuron-sdk/src/examples/pytorch
sudo apt install -y cargo 
cd libtorch_demo
chmod +x setup.sh && ./setup.sh

#Run sanity checks
./run_tests.sh bert_neuron_b6.pt

#Benchmark
./example-app bert_neuron_b6.pt

================================================
FILE: archive/torch-neuron/tutorials/tutorial_source_instructions/run_torchserve_u20.sh
================================================
#!/bin/bash
set -eExuo
cd aws-neuron-sdk/src/examples/pytorch
cd torchserve
python trace_bert_neuronx.py
ls

pip install transformers==4.52.* torchserve==0.7.0 torch-model-archiver==0.7.0 captum==0.6.0

sudo apt install openjdk-11-jdk -y

mkdir model_store
MAX_LENGTH=$(jq '.max_length' config.json)
BATCH_SIZE=$(jq '.batch_size' config.json)
MODEL_NAME=bert-max_length$MAX_LENGTH-batch_size$BATCH_SIZE
torch-model-archiver --model-name "$MODEL_NAME" --version 1.0 --serialized-file ./bert_neuron_b6.pt --handler "./handler_bert_neuronx.py" --extra-files "./config.json" --export-path model_store

ls model_store

torchserve --start --ncs --model-store model_store --ts-config torchserve.config 2>&1 >torchserve.log
sleep 10
curl http://127.0.0.1:8080/ping

MAX_BATCH_DELAY=5000 # ms timeout before a partial batch is processed
INITIAL_WORKERS=2 # Number from table above
curl -X POST "http://localhost:8081/models?url=$MODEL_NAME.mar&batch_size=$BATCH_SIZE&initial_workers=$INITIAL_WORKERS&max_batch_delay=$MAX_BATCH_DELAY"

python infer_bert.py

python benchmark_bert.py

torchserve --stop


================================================
FILE: archive/torch-neuron/tutorials/tutorials-inference-torch-neuron.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Tutorials for Inference with torch-neuron (Inf1)
====================================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


.. toctree::
    :maxdepth: 1
    :hidden:
    
    Computer Vision Tutorials </archive/torch-neuron/tutorials/tutorials-torch-neuron-computervision>
    Natural Language Processing (NLP) Tutorials </archive/torch-neuron/tutorials/tutorials-torch-neuron-nlp>
    Utilizing Neuron Capabilities Tutorials </archive/torch-neuron/tutorials/tutorials-utilizing-neuron-capabilities>


.. include:: /archive/torch-neuron/tutorials/tutorials-inference-torch-neuron.txt


================================================
FILE: archive/torch-neuron/tutorials/tutorials-inference-torch-neuron.txt
================================================
.. tab-set::

    .. tab-item:: Computer Vision Tutorials


        * ResNet-50 tutorial :ref:`[html] </src/examples/pytorch/resnet50.ipynb>` :pytorch-neuron-src:`[notebook] <resnet50.ipynb>`
        * PyTorch YOLOv4 tutorial :ref:`[html] </src/examples/pytorch/yolo_v4.ipynb>` :pytorch-neuron-src:`[notebook] <yolo_v4.ipynb>`


    .. tab-item:: Natural Language Processing (NLP) Tutorials


        * HuggingFace pretrained BERT tutorial :ref:`[html] </src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.ipynb>` :pytorch-neuron-src:`[notebook] <bert_tutorial/tutorial_pretrained_bert.ipynb>`
        * HuggingFace pretrained BERT tutorial with shared weights :ref:`[html] </src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert_shared_weights.ipynb>` :pytorch-neuron-src:`[notebook] <bert_tutorial/tutorial_pretrained_bert_shared_weights.ipynb>`
        * Bring your own HuggingFace pretrained BERT container to Sagemaker Tutorial :ref:`[html] </src/examples/pytorch/byoc_sm_bert_tutorial/sagemaker_container_neuron.ipynb>` :pytorch-neuron-src:`[notebook] <byoc_sm_bert_tutorial/sagemaker_container_neuron.ipynb>`
        * LibTorch C++ tutorial :ref:`[html] <pytorch-tutorials-libtorch>`
        * TorchServe tutorial :ref:`[html] <pytorch-tutorials-torchserve>`
        * HuggingFace MarianMT tutorial :ref:`[html] </src/examples/pytorch/transformers-marianmt.ipynb>` :pytorch-neuron-src:`[notebook] <transformers-marianmt.ipynb>`


    .. tab-item:: Utilizing Neuron Capabilities Tutorials


        * BERT TorchServe tutorial :ref:`[html] <pytorch-tutorials-torchserve>`
        * NeuronCore Pipeline tutorial :ref:`[html] </src/examples/pytorch/pipeline_tutorial/neuroncore_pipeline_pytorch.ipynb>` :pytorch-neuron-src:`[notebook] <pipeline_tutorial/neuroncore_pipeline_pytorch.ipynb>`


.. note::

    To use Jupyter Notebook see:

    * :ref:`setup-jupyter-notebook-steps-troubleshooting`
    * :ref:`running-jupyter-notebook-as-script`


================================================
FILE: archive/torch-neuron/tutorials/tutorials-torch-neuron-computervision.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Computer Vision Tutorials (``torch-neuron``)
============================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


* ResNet-50 tutorial :ref:`[html] </src/examples/pytorch/resnet50.ipynb>` :pytorch-neuron-src:`[notebook] <resnet50.ipynb>`
* PyTorch YOLOv4 tutorial :ref:`[html] </src/examples/pytorch/yolo_v4.ipynb>` :pytorch-neuron-src:`[notebook] <yolo_v4.ipynb>`


================================================
FILE: archive/torch-neuron/tutorials/tutorials-torch-neuron-nlp.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Natural Language Processing (NLP) Tutorials (``torch-neuron``)
==============================================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


* HuggingFace pretrained BERT tutorial :ref:`[html] </src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.ipynb>` :pytorch-neuron-src:`[notebook] <bert_tutorial/tutorial_pretrained_bert.ipynb>`
* HuggingFace pretrained BERT tutorial with shared weights :ref:`[html] </src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert_shared_weights.ipynb>` :pytorch-neuron-src:`[notebook] <bert_tutorial/tutorial_pretrained_bert_shared_weights.ipynb>`
* Bring your own HuggingFace pretrained BERT container to Sagemaker Tutorial :ref:`[html] </src/examples/pytorch/byoc_sm_bert_tutorial/sagemaker_container_neuron.ipynb>` :pytorch-neuron-src:`[notebook] <byoc_sm_bert_tutorial/sagemaker_container_neuron.ipynb>`
* LibTorch C++ tutorial :ref:`[html] <pytorch-tutorials-libtorch>`
* TorchServe tutorial :ref:`[html] <pytorch-tutorials-torchserve>`
* HuggingFace MarianMT tutorial :ref:`[html] </src/examples/pytorch/transformers-marianmt.ipynb>` :pytorch-neuron-src:`[notebook] <transformers-marianmt.ipynb>`


.. toctree::
   :hidden:
   :maxdepth: 1

   /src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.ipynb
   /src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert_shared_weights.ipynb
   /src/examples/pytorch/byoc_sm_bert_tutorial/sagemaker_container_neuron.ipynb
   tutorial-libtorch
   tutorial-torchserve
   transformers-marianmt


================================================
FILE: archive/torch-neuron/tutorials/tutorials-utilizing-neuron-capabilities.rst
================================================
.. meta::
   :noindex:
   :nofollow:
   :description: This content is archived and no longer maintained.
   :date-modified: 2026-03-11

Utilizing Neuron Capabilities Tutorials
=======================================

.. warning::

   This document is archived. torch-neuron (Inf1) is no longer officially supported
   by the AWS Neuron SDK. It is provided for reference only. For current
   framework support, see :doc:`/frameworks/index`.


* BERT TorchServe tutorial :ref:`[html] <pytorch-tutorials-torchserve>`
* NeuronCore Pipeline tutorial :ref:`[html] </src/examples/pytorch/pipeline_tutorial/neuroncore_pipeline_pytorch.ipynb>` :pytorch-neuron-src:`[notebook] <pipeline_tutorial/neuroncore_pipeline_pytorch.ipynb>`


.. toctree::
	:hidden:

	tutorial-torchserve
	/src/examples/pytorch/pipeline_tutorial/neuroncore_pipeline_pytorch.ipynb


================================================
FILE: archive/transformers-neuronx/api-reference-guide.rst
================================================


================================================
FILE: archive/transformers-neuronx/api-reference-guide.txt
================================================


================================================
FILE: archive/transformers-neuronx/developer-guide.rst
================================================
.. _tn_developer_guide:

.. meta::
   :noindex:
   :nofollow:
   :description: This topic is currently archived and not maintained. It is provided for reference only.


Transformers Neuron Developer Guide (``transformers-neuronx``)
==============================================================

.. toctree::
    :maxdepth: 1
    :hidden:

    /archive/transformers-neuronx/transformers-neuronx-developer-guide
    /archive/transformers-neuronx/transformers-neuronx-developer-guide-for-continuous-batching


.. include:: /libraries/transformers-neuronx/developer-guide.txt


================================================
FILE: archive/transformers-neuronx/developer-guide.txt
================================================
* :ref:`transformers_neuronx_developer_guide`

================================================
FILE: archive/transformers-neuronx/index.rst
================================================
.. _transformers_neuronx_archive_readme:

.. meta::
   :noindex:
   :nofollow:
   :description: This topic is currently archived and not maintained. It is provided for reference only.


Transformers NeuronX (``transformers-neuronx``)
==============================================

.. toctree::
    :maxdepth: 1
    :hidden:

    Setup </archive/transformers-neuronx/setup/index>
    Developer Guide  </archive/transformers-neuronx/developer-guide>
    Tutorials  </archive/transformers-neuronx/transformers-neuronx-tutorials>
    Misc  </archive/transformers-neuronx/transformers-neuronx-misc>


.. include:: /archive/transformers-neuronx/transformers-neuronx.txt


================================================
FILE: archive/transformers-neuronx/setup/index.rst
================================================
.. _transformers-neuronx-setup:

.. meta::
   :noindex:
   :nofollow:
   :description: This topic is currently archived and not maintained. It is provided for reference only.

Transformers NeuronX Setup (``transformers-neuronx``)
=====================================================

If you already have setup your environment to run PyTorch NeuronX, you just need to install Transformers NeuronX library using
the following instruction.

.. code-block::

   pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com


If you are starting from scratch, Neuron Multi Framework DLAMI is recommended as it comes pre-installed with Transformers NeuronX virtual environment.
You can refer to the :ref:`instructions to launch a Neuron instance using Multi Framework DLAMI <setup-ubuntu22-multi-framework-dlami>`


================================================
FILE: archive/transformers-neuronx/transformers-neuronx-api-reference.rst
================================================


================================================
FILE: archive/transformers-neuronx/transformers-neuronx-developer-guide-for-continuous-batching.rst
================================================
.. _transformers_neuronx_developer_guide_for_cb:

.. meta::
   :noindex:
   :nofollow:
   :description: This topic is currently archived and not maintained. It is provided for reference only.

Transformers NeuronX (``transformers-neuronx``) Developer Guide for Continuous Batching
=======================================================================================

Transformers NeuronX is integrated with vLLM to enable continuous batching for high-throughput 
LLM serving and inference. This guide aims to help users get started with continuous batching for
Transformers NeuronX and vLLM by providing:

- :ref:`Transformers NeuronX <cb-tnx-overview>` An overview of Transformers NeuronX.
- :ref:`cb-overview` The continuous batching procedure implemented by Transformers NeuronX and vLLM.
- :ref:`cb-install` Installation and usage instructions for Transformers NeuronX and vLLM.
- :ref:`cb-release-221-features` A showcase of new features in Transformers NeuronX and vLLM.
- :ref:`cb-faq`

.. _cb-tnx-overview:

Transformers NeuronX (``transformers-neuronx``)
-----------------------------------------------

Transformers NeuronX for Trn1 and Inf2 is a software package that enables
PyTorch users to perform large language model (LLM) :ref:`performant inference <neuron_llm_inference>` on
second-generation Neuron hardware (See: :ref:`NeuronCore-v2 <neuroncores-v2-arch>`).
The :ref:`Neuron performance page <inf2-performance>` lists expected inference performance for commonly used Large Language Models.

.. _cb-overview:

Continuous Batching with Transformers NeuronX and vLLM
------------------------------------------------------

Transformers NeuronX implements the following operational flow with vLLM for continuous batching support:

1. Context encode multiple prompts using virtual dynamic batching.
2. Decode all sequences simultaneously until a sequence generates an EOS token.
3. Evict the finished sequence and insert a new prompt encoding.
4. Resume the decoding process, repeating steps 2 and 3 until all sequences are decoded.

.. _cb-supported-model-architectures:

Supported Model Architectures
-----------------------------

Transformers NeuronX supports continuous batching for models compatible with the following Hugging Face classes:

- ``LlamaForCausalLM``
- ``MistralForCausalLM``

.. _cb-install:

Install vLLM and Get Started with Offline Inference
---------------------------------------------------

Neuron maintains a fork of vLLM (v0.6.2) that contains the necessary changes to support inference with Transformers NeuronX.
Neuron is working with the vLLM community to upstream these changes to make them available in a future version.

Install vLLM
^^^^^^^^^^^^

First install ``neuronx-cc`` and the ``transformers-neuronx`` packages. Then install the vLLM fork from source:

.. code-block:: bash

    git clone -b v0.6.x-neuron https://github.com/aws-neuron/upstreaming-to-vllm.git
    cd upstreaming-to-vllm
    pip install -r requirements-neuron.txt
    VLLM_TARGET_DEVICE="neuron" && pip install -e .

.. note::

    Please note the vLLM ``pip`` package from PyPI is not compatible with Neuron. To work with Neuron, install vLLM using the source as outlined above.

.. note::

    The current supported version of Pytorch for Neuron installs ``triton`` version ``2.1.0``. This is incompatible with ``vllm >= 0.5.3``. You may see an error ``cannot import name 'default_dump_dir...``. To work around this, run ``pip install --upgrade triton==3.0.0`` after installing the vLLM wheel.

If Neuron packages are detected correctly in the installation process, ``vllm-0.1.dev2830+g22c56ee.neuron216`` will be installed (The ``neuron`` version depends on the installed
``neuronx-cc`` version).

Run Offline Batched Inference with Transformers NeuronX and vLLM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the following example we demonstrate how to perform continuous batching with a Llama model.

.. note::

    Since Llama models are gated, please accept the Llama Community License Agreement and request access to the model.
    Then use a Hugging Face user access token to download the model.

.. code-block:: python

    from vllm import LLM, SamplingParams
    
    # Sample prompts.
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    # Create an LLM.
    llm = LLM(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        max_num_seqs=8,
        # The max_model_len and block_size arguments are required to be same as max sequence length,
        # when targeting neuron device. Currently, this is a known limitation in continuous batching
        # support in transformers-neuronx.
        max_model_len=128,
        block_size=128,
        # The device can be automatically detected when AWS Neuron SDK is installed.
        # The device argument can be either unspecified for automated detection, or explicitly assigned.
        device="neuron",
        tensor_parallel_size=2)

    # Generate texts from the prompts. The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Run the API Server
^^^^^^^^^^^^^^^^^^
To run the OpenAI-compatible API server in vLLM, run either command below:

.. code-block:: bash

    vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 32 --max-num-seqs 4 --max-model-len 2048 --block-size 8

.. code-block:: bash

    python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 32 --max-num-seqs 4 --max-model-len 2048 --block-size 8

.. _cb-release-221-features:

New Features in Neuron Release 2.21
-----------------------------------

Neuron's vLLM integration with Transformers NeuronX is tested using a public fork of vLLM v0.6.2.
New features and enhancements introduced in this fork will be described below.
Neuron's intent is to upstream these features to vLLM as soon as possible after release.
Prior to upstreaming, these features can be accessed in the AWS Neuron GitHub
repository https://github.com/aws-neuron/upstreaming-to-vllm/tree/v0.6.x-neuron.

**Neuron Release 2.21 Features for the v0.6.2 vLLM Neuron Fork**

- :ref:`Sequence bucketing <cb-sequence-bucketing>` configuration for context encoding and token generation.
- :ref:`Granular NeuronConfig control <cb-neuron-config-override>` in vLLM entrypoints.
- Inference support for :ref:`speculative decoding <cb-speculative-decoding>`.
- Inference support for :ref:`EAGLE speculative decoding <cb-eagle-speculative-decoding>`.

**Neuron Release 2.20 Features**

- Multi-node inference support for larger models. Example scripts are included in `vLLM <https://github.com/vllm-project/vllm/commit/e5a3c0904799ec8e04e25ac25e66024004a61533>`_ .
- Direct loading of Hugging Face-compatible checkpoints without creation of a ``-split`` directory.

.. _cb-sequence-bucketing:

Sequence Bucketing
^^^^^^^^^^^^^^^^^^
To configure buckets, set the following environment variables. Refer to the `developer guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html#bucketing>`_
for details on how to configure the values. These environment variables need to be set before starting the vLLM server or instantiating the ``LLM`` object.

- ``NEURON_CONTEXT_LENGTH_BUCKETS``:  Bucket sizes for context encoding.
- ``NEURON_TOKEN_GEN_BUCKETS``: Bucket sizes for token generation.

For example: ``export NEURON_CONTEXT_LENGTH_BUCKETS="128,512,1024"``


.. _cb-neuron-config-override:

NeuronConfig Override
^^^^^^^^^^^^^^^^^^^^^
The default ``NeuronConfig`` in vLLM uses the latest optimizations from the Neuron SDK. However, you can override the default values or add a new configuration from the `developer guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html#>`_ by setting the ``override_neuron_config`` parameter while creating the ``LLM`` object.

.. code-block:: python

    llm = LLM(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        max_num_seqs=8,
        max_model_len=128,
        block_size=128
        device="neuron",
        tensor_parallel_size=32,
        #Override or update the NeuronConfig
        override_neuron_config={"shard_over_sequence":True})

While standing up the API server, set the ``override-neuron-config`` argument. For example:

.. code-block:: bash

    python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 32 --max-num-seqs 4 --max-model-len 2048 --block-size 8 --override-neuron-config {\"shard_over_sequence\":\"True\"}


.. _cb-quantization:

Quantization
^^^^^^^^^^^^
To use `int8 weight storage <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html#int8-weight-storage-support>`_ ,
set the environment variable ``NEURON_QUANT_DTYPE`` to ``s8``.


.. _cb-speculative-decoding:

Speculative Decoding
^^^^^^^^^^^^^^^^^^^^
Speculative decoding is a token generation optimization technique that
uses a small draft model to generate ``K`` tokens autoregressively and a
larger target model to determine which draft tokens to accept, all in a combined forward pass.
For more information on speculative decoding, please see `[Leviathan, 2023] <https://arxiv.org/abs/2211.17192>`_ and `[Chen et al., 2023] <https://arxiv.org/pdf/2302.01318>`_.

Speculative decoding is now available for inference with Transformers NeuronX and vLLM:

.. code-block:: python

    from vllm import LLM, SamplingParams

    # Sample prompts.
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    # Create an LLM.
    llm = LLM(
        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
        speculative_model="meta-llama/Llama-3.2-1B-Instruct",
        # The max_model_len, speculative_max_model_len, and block_size arguments are required to be same as max sequence length,
        # when targeting neuron device. Currently, this is a known limitation in continuous batching
        # support in transformers-neuronx.
        max_model_len=128,
        block_size=128,
        speculative_max_model_len=128,
        dtype="bfloat16",
        max_num_seqs=4,
        num_speculative_tokens=4,
        # The device can be automatically detected when AWS Neuron SDK is installed.
        # The device argument can be either unspecified for automated detection, or explicitly assigned.
        device="neuron",
        tensor_parallel_size=32,
        use_v2_block_manager=True,
    )

    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

.. note::

    Please ensure that the selected target and draft model are from the same model family. For example, if the target model is an instruction-tuned Llama model,
    the draft model must also be a lower-capacity instruction-tuned Llama model.

.. _cb-eagle-speculative-decoding:

EAGLE Speculative Decoding
^^^^^^^^^^^^^^^^^^^^^^^^^^
Extrapolation Algorithm for Greater Language-model Efficiency (EAGLE) extends the speculative decoding
technique described above by:

- Utilizing a specially trained EAGLE draft model that predicts feature outputs through an Autoregression Head and next token outputs through an LM Head.
- Reducing sampling uncertainty by using the next autoregressively sampled token and a current feature map as draft model inputs.

For more information on EAGLE, please see `[Li et al., 2024] <https://arxiv.org/pdf/2401.15077>`_

EAGLE speculative decoding can be applied without changes to the speculative decoding code sample above. Transformers NeuronX and vLLM will recognize
a draft model as an EAGLE draft when ``is_eagle: True`` is set in the model's Hugging Face ``config.json`` file.


.. _cb-faq:

Frequently Asked Questions
--------------------------

**Is PagedAttention supported in the vLLM integration?**

No, PagedAttention is not currently supported. It will be supported in a future Neuron release.


================================================
FILE: archive/transformers-neuronx/transformers-neuronx-developer-guide.rst
================================================
.. _transformers_neuronx_developer_guide:

.. meta::
   :noindex:
   :nofollow:
   :description: This topic is currently archived and not maintained. It is provided for reference only.

Transformers NeuronX (``transformers-neuronx``) Developer Guide
================================================================

Transformers NeuronX for Trn1 and Inf2 is a software package that enables
PyTorch users to perform large language model (LLM) :ref:`performant inference <neuron_llm_inference>` on
second-generation Neuron hardware (See: :ref:`NeuronCore-v2 <neuroncores-v2-arch>`).The :ref:`Neuron performance page <inf2-performance>` lists expected inference performance for commonly used Large Language Models.


Introduction
------------

The `Transformers NeuronX repository <https://github.com/aws-neuron/transformers-neuronx>`_
contains the source code of the AWS Neuron Transformers integration project.
As it stands now, it mainly serves the purpose of
running transformer decoder inference (autoregressive sampling)
workflows on the Neuron platform.

Note: This project is **actively** in development. The Neuron team is
still heavily modifying the Neuron optimized module classes. The
functionality provided in this repository will not maintain long-term
API stability until version >= 1.0.0. For applications willing to reuse
code from this repository, we recommend treating the Neuron optimized
module implementations as samples, and pin the version of the main
library package ``torch-neuronx`` to avoid breaking interface changes as
new features are developed.


Checkpoint compatibility with HuggingFace Transformers
------------------------------------------------------

``transformers-neuronx`` is checkpoint-compatible with HuggingFace
Transformers. While the Neuron team reimplemented some HuggingFace
Transformers models from scratch for the purpose of maximizing the
execution efficiency of transformer decoders on Neuron, the
implementations are done with maximizing compatibility in mind, meaning
one can train transformer decoder models, say GPT2, using the standard
HuggingFace Transformers library, and then construct an
inference-optimized decoder model using transformers-neuronx's
``GPT2ForSampling`` class. If training was done with other libraries
such as MegatronLM, then it is still possible to convert the obtained
checkpoint to the standard HuggingFace Transformers checkpoint format,
and then move on to transformers-neuronx's optimized decoder
implementations.


Neuron optimized transformer decoders implemented in XLA High Level Operations (HLO)
------------------------------------------------------------------------------------

Due to the stateful nature of the autoregressive sampling computation,
an efficient implementation of autoregressive sampling using the Neuron
SDK requires rewriting the model forward function into a pure-function
computation running on fixed-shape tensors. Furthermore, we want the
pure-function computation be implemented in a compiled language so that
the Neuron compiler can perform extensive code analysis and
optimization. We chose XLA High Level Operations (HLO) as the compiled
language for implementing Neuron optimized transformer decoder classes.
The source code of these classes contains Python functions written in a
syntax called "PyHLO", name of a Neuron internal tool for
writing/compiling the HLO language in Python. As an example, a "language
model head" implemented in PyHLO may look like the following.

::

   class LmHeadHlo:

       ...

       def lm_head(self, scribe):
           dtype = self.dtype
           hidden_size = self.hidden_size
           n_active_tokens = self.n_active_tokens
           batch_size = self.batch_size
           vocab_size = self.vocab_size
           hidden = dtype[hidden_size, n_active_tokens, batch_size].Parameter(parameter_number=0)
           weight = dtype[hidden_size, vocab_size].Parameter(parameter_number=1)
           rhs_size = n_active_tokens * batch_size
           hidden = dtype[hidden_size, rhs_size].Reshape(hidden)
           dot_dims = dict(lhs_contracting_dimensions=[0], rhs_contracting_dimensions=[0])
           logits = dtype[vocab_size, rhs_size].Dot(weight, hidden, dot_dimension_numbers=dot_dims)
           return dtype[vocab_size, n_active_tokens, batch_size].Reshape(logits)

       ...

The ``transformers_neuronx.compiler.compile_py_func`` function can
convert the Python ``lm_head`` function into ``HloModuleProto``, a valid
input format for the ``neuronx-cc`` compiler.


Tensor-parallelism support
--------------------------

For transformer decoders used in large language models,
tensor-parallelism is necessary as it provides a way to shard the
models' large weight matrices onto multiple NeuronCores, and having
NeuronCores working on the same matrix multiply operation
collaboratively. transformers-neuronx's tensor-parallelism support makes
heavy use of collective operations such as all-reduce, which is
supported natively by the Neuron runtime.

There are some principles for setting tensor-parallelism degree (number
of NeuronCores participating in sharded matrix multiply operations) for
Neuron-optimized transformer decoder models.

1. The number of attention heads needs to be divisible by the
   tensor-parallelism degree.
2. The total data size of model weights and key-value caches needs to be
   smaller than 16 GB times the tensor-parallelism degree.
3. Currently, the Neuron runtime supports tensor-parallelism degrees 1,
   2, 8, and 32 on Trn1 and supports tensor-parallelism degrees 1, 2, 4,
   8, and 24 on Inf2.

Some examples:

1. ``facebook/opt-13b`` has 40 attention heads, and when running at
   batch size 1 and float16 precision the model requires ~29 GB memory,
   therefore a ``trn1.2xlarge`` with 32 GB device memory is sufficient.
2. ``facebook/opt-30b`` has 56 attention heads, and at batch size 1 and
   float16 precision the model requires ~66 GB memory, therefore it can
   run on 8 NeuronCores on one ``trn1.32xlarge`` using 128 GB device
   memory.
3. ``gpt2-xl`` has 25 attention heads and requires ~4 GB memory at
   bfloat16 precision. It runs without tensor-parallelism only.


Features
--------


Compile-time Configurations
---------------------------

Transformers Neuron models support a variety of compile-time configurations
that can be used to tune model performance. All models support the following
configurations:

- ``batch_size``: The batch size to compile a model for. Once the batch size has
  been set, this is the only size that is supported at inference time. Neuron
  uses ahead-of-time compilation to achieve high performance which requires
  that the compiled artifact shapes must be known at compilation time.
- ``n_positions``: The maximum number of positions (or sequence length) to allow
  during generation. This parameter directly controls the width of the KV
  cache. This parameter should be set to the maximum expected sequence length
  for the end application.
- ``tp_degree``: This parameter controls the number of tensor parallel shards to
  split the model into. Each shard will execute on a separate NeuronCore. To
  minimize latency, it is recommended to set the tensor parallelism to be
  equal to the number of NeuronCores that are available on an instance.
- ``amp``: This allows a models weights and compute to be cast to a different
  type. The options are; ``'bf16'``, ``'f16'``, or ``'f32'``. For
  models trained in ``float32``, the 16-bit mixed precision options
  (``'bf16'``, ``'f16'``) generally provide sufficient accuracy while
  significantly improving performance.
- ``context_length_estimate``: This parameter controls the maximum sequence
  length of the prompt/context handling compute graph. This parameter is
  not supported in ``GPTNeoXForSampling`` and ``GPTJForSampling``.

.. code-block:: python

    from transformers_neuronx import NeuronAutoModelForCausalLM

    model = NeuronAutoModelForCausalLM.from_pretrained(
        'gpt2',                      # Uses the GPT2 checkpoint from https://huggingface.co/gpt2
        batch_size=1,                # Allow inference with batch size 1 inputs
        n_positions=128,             # Allow a maximum size of 128 prompt & output tokens
        tp_degree=2,                 # Shard the model weights & compute across 2 NeuronCores
        amp='f16',                   # Downcast the weights & compute to float16
        context_length_estimate=64,  # Build an optimized context encoding network for a maximum prompt size of 64
    )
    model.to_neuron() # Load/compile the model


Checkpoint support and automatic model selection
------------------------------------------------

*New in release 2.18*

Transformers Neuron now supports a greater variety of checkpoints including
older pytorch binary checkpoints and newer `safetensors`_ checkpoints. For
improved load speed and reduced host memory consumption, it is recommended to
always use ``safetensors`` by default. Both regular and sharded variants of
checkpoints are supported. It is no longer recommended to use the
``save_pretrained_split`` function which was used in older Transformers Neuron
examples.

In addition to supporting standard checkpoint formats, Transformers Neuron
provides an AutoModel class ``NeuronAutoModelForCausalLM`` which can be
used to load the correct model without explicitly importing the
architecture-specific class.

.. _safetensors: https://github.com/huggingface/safetensors

.. code-block:: python

    from transformers_neuronx import NeuronAutoModelForCausalLM

    # Loads: https://huggingface.co/bigscience/bloom-560m
    bloom = NeuronAutoModelForCausalLM.from_pretrained('bigscience/bloom-560m')
    bloom.to_neuron()

    # Loads: https://huggingface.co/openlm-research/open_llama_3b_v2
    llama = NeuronAutoModelForCausalLM.from_pretrained('openlm-research/open_llama_3b_v2')
    llama.to_neuron()

    # This is equivalent to the following:
    from transformers_neuronx import BloomForSampling
    model = BloomForSampling.from_pretrained('bigscience/bloom-560m')
    model.to_neuron()

    from transformers_neuronx import LlamaForSampling
    llama = LlamaForSampling.from_pretrained('openlm-research/open_llama_3b_v2')
    llama.to_neuron()


.. note::

    Advanced features of huggingface hub access are not supported. This
    includes private repositories which require access tokens and branches.

    In order to support more advanced repository downloads, please download the
    model to a local directory and load it from there.


Hugging Face generate() API support
-----------------------------------

Transformers Neuron models support the Hugging Face `generate() <https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/text_generation#transformers.GenerationMixin.generate>`__
API via the ``HuggingFaceGenerationModelAdapter`` adapter class. In the following example we
demonstrate how to run sampling with temperature using the ``GPT2`` model:

.. code-block:: python

    import torch
    from transformers import AutoTokenizer, AutoConfig
    from transformers_neuronx import GPT2ForSamplingWithContextBroadcasting, HuggingFaceGenerationModelAdapter

    # Create and compile the Neuron model
    model = GPT2ForSamplingWithContextBroadcasting.from_pretrained('gpt2')
    model.to_neuron()

    # Use the `HuggingFaceGenerationModelAdapter` to access the generate API
    config = AutoConfig.from_pretrained('gpt2')
    model = HuggingFaceGenerationModelAdapter(config, model)

    # Get a tokenizer and example input
    tokenizer = AutoTokenizer.from_pretrained('gpt2')
    tokenizer.pad_token_id = tokenizer.eos_token_id
    tokenizer.padding_side = 'left'
    text = "Hello, I'm a language model,"
    encoded_input = tokenizer(text, return_tensors='pt', padding=True)

    # Run inference using temperature
    with torch.inference_mode():
        model.reset_generation()
        generated_sequence = model.generate(
            input_ids=encoded_input.input_ids,
            attention_mask=encoded_input.attention_mask,
            do_sample=True,
            max_length=256,
            temperature=0.7,
        )

    print([tokenizer.decode(tok) for tok in generated_sequence])


Note: As the Hugging Face generation API can expand the input's batch dimension
based on different generation configurations, we need to compile the neuron
model with different compile batch_size compared to the run time batch_size
(batch dimension of inputs to generation API).
- if ``do_sample=True``, ``compile_batch_size = runtime_batch_size x num_return_sequences x beam_size``
- otherwise, ``compile_batch_size = runtime_batch_size x num_return_sequences``


Neuron Persistent Cache
------------------------

The Neuron Persistent Cache is now enabled for Transformers Neuron by default.
Model artifacts which have been compiled once will be cached and reused on
successive runs when possible. Model artifacts will only be reused when
compiling with the same compiler version (neuronx-cc), model configurations,
and compiler flags. It also includes other features (i.e. using an S3 bucket as
the cache backend). For more detailed information, see the
:ref:`Persistent cache documentation <neuron-caching>`


.. _int8_weight_storage_support:


int8 weight storage support
---------------------------

Transformers Neuron supports int8 weight storage for the ``GPT2`` model class.
int8 weight storage can be used to reduce memory bandwidth usage to improve
model performance. int8 weight storage support for additional model classes
will be added in an upcoming release. In the following example we demonstrate
how to apply int8 weight storage to the ``GPT2`` model via the
``QuantizationConfig`` and ``NeuronConfig`` configs:

.. code-block:: python

    import torch
    from transformers import AutoTokenizer
    from transformers_neuronx import GPT2ForSamplingWithContextBroadcasting, NeuronConfig, QuantizationConfig

    # Set the weight storage config use int8 quantization and bf16 dequantization
    neuron_config = NeuronConfig(
        quant=QuantizationConfig(quant_dtype='s8', dequant_dtype='bf16'),
    )

    # Create and compile the Neuron model
    model = GPT2ForSamplingWithContextBroadcasting.from_pretrained(
        'gpt2',
        amp='bf16', # NOTE: When using quantization, amp type must match dequant type
        neuron_config=neuron_config
    )
    model.to_neuron()

    # Get a tokenizer and example input
    tokenizer = AutoTokenizer.from_pretrained('gpt2')
    text = "Hello, I'm a language model,"
    encoded_input = tokenizer(text, return_tensors='pt')

    # Run inference
    with torch.inference_mode():
        generated_sequence = model.sample(encoded_input.input_ids, sequence_length=256, start_ids=None)
    print([tokenizer.decode(tok) for tok in generated_sequence])


Parallel Input Prompt Context Encoding
--------------------------------------

Transformers Neuron supports parallel input prompt context encoding for the ``GPT2``
model class. Parallel context encoding can be used to significantly reduce
the latency of the input prompt context encoding before the autoregressive
decoder token generation loop. Parallel context encoding support for additional
model classes will be added in an upcoming release.

The ``GPT2ForSamplingWithContextBroadcasting`` class has a ``context_length_estimate``
variable that determines the number of input prompt tokens that will be processed in
parallel. For optimal results, this should be set to a power of 2 that is
closest to the most frequently seen input prompt length.
In the following example we demonstrate how to apply parallel context encoding
to the ``GPT2`` model via the ``GPT2ForSamplingWithContextBroadcasting`` class.
In this example, we set the ``context_length_estimate`` to be 128, which is
the closest power of 2 the length of the input prompt (97 tokens).

.. code-block:: python

    import torch
    from transformers import AutoTokenizer
    from transformers_neuronx import GPT2ForSamplingWithContextBroadcasting

    # Create and compile the Neuron model
    model = GPT2ForSamplingWithContextBroadcasting.from_pretrained(
        'gpt2',
        context_length_estimate=256 # Create an optimized network which handles prompts up to 256 tokens
    )
    model.to_neuron()

    # Get a tokenizer and example input
    tokenizer = AutoTokenizer.from_pretrained('gpt2')
    text = "Hello, I'm a generative AI language model. Generative AI is a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. It is powered by large models that are pre-trained on vast amounts of data and commonly referred to as foundation models (FMs). With generative AI on AWS, you can reinvent your applications, create entirely new customer experiences, drive unprecedented levels of productivity, and transform your business. "
    encoded_input = tokenizer(text, return_tensors='pt')

    # Run inference
    with torch.inference_mode():
        generated_sequence = model.sample(encoded_input.input_ids, sequence_length=256)
    print([tokenizer.decode(tok) for tok in generated_sequence])


The ``GPT2ForSamplingWithContextBroadcasting`` class can also process
an input prompt that has a different batch size from the batch size of the
autoregressive decoder output. For example, an input prompt with batch size = 1 can
be used to produce an output of batch size = 5 to generate multiple suggestions
for the same input prompt. The input prompt batch size can be specified using
the ``prompt_batch_size`` argument and the autoregressive decoder output batch
size can be specified using the ``batch_size`` argument. In the following example
we demonstrate how to apply parallel context encoding to the ``GPT2`` model
to generate 5 outputs for a single input.

.. code-block:: python

    import torch
    from transformers import AutoTokenizer
    from transformers_neuronx import GPT2ForSamplingWithContextBroadcasting

    # Create and compile the Neuron model
    model = GPT2ForSamplingWithContextBroadcasting.from_pretrained(
        'gpt2',
        prompt_batch_size=1, # This allows prompt and output batch to vary
        batch_size=5,
        context_length_estimate=256
    )
    model.to_neuron()

    # Get a tokenizer and example input
    tokenizer = AutoTokenizer.from_pretrained('gpt2')
    text = "Hello, I'm a generative AI language model. Generative AI is a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. It is powered by large models that are pre-trained on vast amounts of data and commonly referred to as foundation models (FMs). With generative AI on AWS, you can reinvent your applications, create entirely new customer experiences, drive unprecedented levels of productivity, and transform your business. "
    encoded_input = tokenizer(text, return_tensors='pt')

    # Run inference
    with torch.inference_mode():
        generated_sequence = model.sample(encoded_input.input_ids, sequence_length=256)

    for i, output in enumerate(generated_sequence):
        print('-' * 50)
        print(f'Batch {i} output:')
        print(tokenizer.decode(output))


Serialization support
---------------------

Transformers NeuronX supports model serialization (model saving and loading) for
all models except the ``GPTJForSampling`` and ``GPTNeoXForSampling``` model
classes. In the following example we demonstrate how to save and load
the compiled artifacts for the ``GPT2`` model:

.. code-block:: python

    import torch
    from transformers import AutoTokenizer
    from transformers_neuronx import GPT2ForSamplingWithContextBroadcasting

    # Create and compile the Neuron model
    model = GPT2ForSamplingWithContextBroadcasting.from_pretrained('gpt2')
    model.to_neuron()

    # Save the compiled Neuron model
    model.save('gpt2-compiled-artifacts')

    # Load the Neuron model
    model = GPT2ForSamplingWithContextBroadcasting.from_pretrained('gpt2')
    # Load the compiled Neuron artifacts
    model.load('gpt2-compiled-artifacts')
    # Since prior artifacts are loaded, this skips compilation
    model.to_neuron()

    # Get a tokenizer and example input
    tokenizer = AutoTokenizer.from_pretrained('gpt2')
    text = "Hello, I'm a language model,"
    encoded_input = tokenizer(text, return_tensors='pt')

    # Run inference
    with torch.inference_mode():
        generated_sequence = model.sample(encoded_input.input_ids, sequence_length=256, start_ids=None)
    print([tokenizer.decode(tok) for tok in generated_sequence])

Transformers NeuronX also supports the serialization of presharded weights.
This reduces future model load time by saving a transformed and sharded
set of weights as a new safetensors checkpoint. When this checkpoint is loaded,
sharding and transformations normally done by Transformers NeuronX will be skipped,
reducing model load time significantly. The saving of presharded weights is only
available when ``on_device_embedding`` is true. In the following example we
demonstrate how to save and load presharded weights along with compiled artifacts on a Llama model:

.. code-block:: python

    from transformers_neuronx import LlamaForSampling
    from transformers_neuronx import NeuronConfig
    from transformers import AutoTokenizer

    neuron_config = NeuronConfig(on_device_embedding=True)

    # Create and compile the Neuron model
    model_neuron = LlamaForSampling.from_pretrained('openlm-research/open_llama_3b', batch_size=1, tp_degree=8, n_positions=128, neuron_config=neuron_config)
    model_neuron.to_neuron()

    # save the presharded weights and compiled artifacts to a directory
    model_neuron.save('llama-artifacts', sharded_weights=True)

    del model_neuron

    # use the presharded checkpoint to reduce model load time
    model_neuron_presharded = LlamaForSampling.from_pretrained('llama-artifacts', batch_size=1, tp_degree=8, n_positions=128, neuron_config=neuron_config)

    # load in the compiled artifcats to skip compilation
    model_neuron_presharded.load('llama-artifacts')
    model_neuron_presharded.to_neuron()

CPU Compilation Support
-----------------------

Transformers NeuronX now supports compilation on CPU. 
CPU compilation is compatible with model serialization and presharding weights, 
and is available for all models except the GPTJForSampling and GPTNeoXForSampling 
model classes. To compile on CPU, the initial call to to_neuron() is replaced with 
cpu_compile(). In the following example we demonstrate how to compile on CPU for 
the LLaMA model:

.. code-block:: python

    from transformers_neuronx import LlamaForSampling
    from transformers_neuronx import NeuronConfig
    from transformers import AutoTokenizer

    neuron_config = NeuronConfig(on_device_embedding=True)

    # Create and compile the model on CPU
    model_neuron = LlamaForSampling.from_pretrained('openlm-research/open_llama_3b', batch_size=1, tp_degree=8, n_positions=128, neuron_config=neuron_config)
    model_neuron.cpu_compile() # instead of model_neuron.to_neuron()

    # save the weights and compiled artifacts to a directory
    model_neuron.save('llama-artifacts')

To use the saved artifacts generated by CPU compilation on a Neuron device: 

.. code-block:: python
    
    from transformers_neuronx import LlamaForSampling
    from transformers_neuronx import NeuronConfig
    from transformers import AutoTokenizer

    neuron_config = NeuronConfig(on_device_embedding=True)

    # use the presharded checkpoint to reduce model load time
    model_neuron_presharded = LlamaForSampling.from_pretrained('llama-artifacts', batch_size=1, tp_degree=8, n_positions=128, neuron_config=neuron_config)

    # load in the compiled artifacts to skip compilation
    model_neuron_presharded.load('llama-artifacts')

    # now, use CPU compiled artifacts to run the model
    model_neuron_presharded.to_neuron()


Compilation worker count support
--------------------------------

Transformers-neuronx supports providing compilation worker count for all models. This setting controls how many workers will execute HLO graph compilation tasks in parallel. A lower setting reduces CPU memory utilization when compiling a model, but increases the compilation time. This setting is useful to prevent out of CPU memory errors when compiling large models. By default, the number of workers used is equal to the total HLO graphs required for compilation. Compilation worker count integrates with both CPU compilation flow using ``cpu_compile()`` and neuron device compilation flow using ``to_neuron()``.
To set the compilation worker count, use the ``compilation_worker_count`` argument in ``NeuronConfig``. The following sample shows how to compile the graphs one by one.

.. code-block:: python

    neuron_config = NeuronConfig(compilation_worker_count=1)


Grouped-query attention (GQA) support [Beta]
---------------------------------------------

Transformers Neuron supports grouped-query attention (GQA) models for
``Llama`` and ``Mistral`` model classes.
There are multiple sharding strategies for K/V cache, in order to satisfy different constraints.

- ``GQA.SHARD_OVER_HEADS`` distributes K/V caches along head dimension. This can be only used when K/V heads is multiple of tensor-parallelism degree. This is the default configuration.
- ``GQA.SHARD_OVER_BATCH`` distributes K/V caches along batch dimension. This can be only used when batch size is multiple of tensor-parallelism degree. This can be useful for large-batch inference.
- ``GQA.REPLICATED_HEADS`` replicates K/V heads. This can be used when neither batch size nor K/V heads can be divisible by tensor-parallelism degree. This can be useful for low-latency small-batch inference.
- ``GQA.ALL_GATHER_HEADS`` evenly splits the K/V heads across all NeuronCores. This is optimized for large-batch inference of GQA model without replication.

.. _mistral_gqa_code_sample:

In the following example we demonstrate how to configure these distributed inference strategies and
perform inference with the ``Mistral`` model:

.. code-block:: python

    import torch
    from transformers import AutoTokenizer
    from transformers_neuronx import MistralForSampling, GQA, NeuronConfig

    # Set sharding strategy for GQA to be shard over heads
    neuron_config = NeuronConfig(
        group_query_attention=GQA.SHARD_OVER_HEADS
    )

    # Create and compile the Neuron model
    model_neuron = MistralForSampling.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2', amp='bf16', neuron_config=neuron_config)
    model_neuron.to_neuron()

    # Get a tokenizer and exaple input
    tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')
    text = "[INST] What is your favourite condiment? [/INST]"
    encoded_input = tokenizer(text, return_tensors='pt')

    # Run inference
    with torch.inference_mode():
        generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=256, start_ids=None)
    print([tokenizer.decode(tok) for tok in generated_sequence])


Repeated Ngram Filtering
------------------------

Repeated Ngram Filtering reduces redundant ngram phrases within the generated text. It uses the same API as `HuggingFace API for NoRepeatedNGram <https://huggingface.co/docs/transformers/v4.38.2/en/internal/generation_utils#transformers.NoRepeatNGramLogitsProcessor>`__. Set the parameter no_repeat_ngram_size to the size of ngram phrases to be filtered and pass it to the sampling function as in the example ``model.sample(inputs_ids, no_repeat_ngram_size=3)``


On-device sampling support [Beta]
--------------------------------------

Transformers-neuronx supports on-device sampling for all models except Mixtral models. The features
can be enabled by setting ``on_device_generation`` in ``NeuronConfig`` to an instance of ``GenerationConfig``.

In the following example, we demonstrate how to use on-device generation for a ``Llama`` model using
``top_k``, ``top_p``, ``top_p_min_tokens`` and ``temperature``.


Top-K on-device sampling support [Beta]
---------------------------------------
Transformers Neuron supports Top-K Sampling on-device for all models except Mixtral models.
In the following example, we demonstrate how to use on-device Top-K for the ``Llama`` model via
the ``GenerationConfig`` and ``NeuronConfig`` configs.

.. code-block:: python

    import torch
    from transformers_neuronx import LlamaForSampling
    from transformers_neuronx.config import NeuronConfig, GenerationConfig
    from transformers import AutoTokenizer

    neuron_config = NeuronConfig(
        on_device_generation=GenerationConfig(max_length=128, top_k=10, top_p=0.9, top_p_min_tokens=1, temperature=0.9, do_sample=True)
    )

    # Create and compile the Neuron model
    model_neuron = LlamaForSampling.from_pretrained('openlm-research/open_llama_3b', batch_size=1, tp_degree=8, n_positions=128, neuron_config=neuron_config)
    model_neuron.to_neuron()

    # Get a tokenizer and exaple input
    tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_3b')
    text = "Hello, I'm a language model,"
    encoded_input = tokenizer(text, return_tensors='pt')

    # Run inference
    with torch.inference_mode():
        generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=128, top_k=10)
        print([tokenizer.decode(tok) for tok in generated_sequence])


By default, transformers-neuronx uses the same, fixed sampling parameters for all sequences across all invocations
of the model when on-device generation is enabled. It is possible to provide new sampling parameters per
model invocation by enabling the ``dynamic`` feature in the ``GenerationConfig``. It is also possible to provide
different sampling parameters for each sequence in the batch by using the ``per_batch_line`` feature.
When using this feature, it is recommended to limit the number of tokens that are considered during
sampling across all sequences by setting ``global_top_k`` to a reasonably low number e.g. 250 to prevent
poor performance when computing ``top_p`` tokens over a large vocabulary without any prior filtering. When using
``per_batch_line``, ``top_k``, ``top_p``, ``top_p_min_tokens`` and ``temperature`` accept lists with value per
sequence in the batch.


In the following example, we demonstrate how to use the ``dynamic`` and ``per_batch_line`` features together.

.. code-block:: python

    import torch
    from transformers_neuronx import LlamaForSampling
    from transformers_neuronx.config import NeuronConfig, GenerationConfig
    from transformers import AutoTokenizer

    batch_size = 2
    generation_config = GenerationConfig(
            max_length=128, dynamic=True, per_batch_line=True, do_sample=True,
            top_k=[1] * batch_size,
            top_p=[1.0] * batch_size,
            top_p_min_tokens=[1] * batch_size,
            temperature=[1.0] * batch_size,
            global_top_k=256
        )

    neuron_config = NeuronConfig(
        on_device_generation=generation_config
    )

    # Create and compile the Neuron model
    model_neuron = LlamaForSampling.from_pretrained('openlm-research/open_llama_3b', batch_size=2, tp_degree=8, n_positions=128, neuron_config=neuron_config)
    model_neuron.to_neuron()

    # Get a tokenizer and exaple input
    tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_3b')
    tokenizer.pad_token = tokenizer.eos_token
    text = ["Hello, I'm a language model,", "Hello, I'm also a language model,"]
    encoded_input = tokenizer(text, return_tensors='pt')

    # Run inference
    with torch.inference_mode():
        generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=128)
        print([tokenizer.decode(tok) for tok in generated_sequence])

        # Use different settings for each sequence in the batch
        # Supported because we use `generation_config.per_batch_line = True`
        generation_config.top_k = [1, 20]
        generation_config.top_p = [1.0, 0.9]
        generation_config.top_p_min_tokens = [1, 1]
        generation_config.temperature = [1.0, 0.9]

        # Update the generation configuration dynamically
        # Supported because we use `generation_config.dynamic = True`
        model_neuron.update_generation_config(generation_config)

        generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=128)
        print([tokenizer.decode(tok) for tok in generated_sequence])


Running inference with multiple models
--------------------------------------

Multiple transformers-neuronx models can be loaded at the same time as long
as the total number of consumed NeuronCores is less than or equal to the total
number of NeuronCores on the instance. For example, three tp-degree=8 models can be
loaded and run in parallel on an inf2.48xlarge which has 24 NeuronCores. The
``NEURON_RT_NUM_CORES`` and ``NEURON_RT_VISIBLE_CORES`` environment variables
can be used to allocate the necessary number of NeuronCores to each process
to run multiple transformers-neuronx models in parallel. See the
:ref:`torch_neuronx_core_placement_guide` section for additional information
about how to use these environment variables.

It is important to notice that when multiple models are used on a single instance,
the number of threads should be reduced to avoid race condition on host side.
Assume the neuron instance (i.e. trn1) has 192 CPU cores.
If one of the models keeps all CPU cores busy, there would be significant performance
degradation in the rest of models. As a result, the number of threads for each model
should be limited to part of available cores. To do this, ``OMP_NUM_THREADS`` environment
variable can be set. For example, if there are 192 CPU cores available and four tp-degree=8
models are used, one can export OMP_NUM_THREADS=48 to avoid race condition.


Streamer
----------------------------

LLMs generate tokens in auto-regressive loop. A model.sample call waits till
the end of full sequence generation before returning the generated response.
It is possible to output an output token as soon as it is generated. To do this,
a streamer object can be used. Streamer is an object which has 2 methods: put and end.
There are several predefined streamer in transformers library such as TextIteratorStreamer.
The following example shows how to define a streamer and use it in transformers-neuronx:

.. code-block:: python

    import torch
    from transformers import AutoTokenizer
    from transformers_neuronx import MistralForSampling, GQA

    import transformers
    from time import time

    # Create a custom streamer inherited from transformers.generation.streamers.BaseStreamer
    class CustomStreamer(transformers.generation.streamers.BaseStreamer):
        def __init__(self) -> None:
            self.reset()

        def reset(self):
            self.token_latencies = []
            self.iter = 0
            self.now = time()

        def put(self, tokens):
            now = time()
            token_latency = now - self.now
            print(f"Iteration {self.iter:4d}: Latency [s] {token_latency:6.3f} -- Token {tokens}")
            self.now = now
            self.iter += 1
            self.token_latencies.append(token_latency)


        def end(self):
            print("First 10 token latencies:", self.token_latencies[:10])


    # Create and compile the Neuron model
    model_neuron = MistralForSampling.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2', amp='bf16')
    model_neuron.to_neuron()

    # Get a tokenizer and exaple input
    tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')
    text = "[INST] What is your favourite condiment? [/INST]"
    encoded_input = tokenizer(text, return_tensors='pt')

    streamer = CustomStreamer()
    # Run inference
    with torch.inference_mode():
        generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=256, start_ids=None, streamer=streamer)


Stopping Criteria
------------------
We can define custom stopping criteria to stop autoregressive loop. For example, if
we want to limit autoregressive loop after 0.5s, we can define and use stopping criteria
class as follows:


.. code-block:: python

    import torch
    import transformers
    from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
    from transformers_neuronx import MistralForSampling, GQA, NeuronConfig
    from transformers_neuronx.stopping_criteria import StoppingCriteria, StoppingCriteriaList

    from time import time
    from typing import List, Optional, Callable


    class MaxTimeCriteria(StoppingCriteria):
        """
        This class can be used to stop generation whenever the full generation exceeds some amount of time. By default, the
        time will start being counted when you initialize this function. You can override this by passing an
        `initial_time`.

        Args:
            max_time (`float`):
                The maximum allowed time in seconds for the generation.
            initial_time (`float`, *optional*, defaults to `time()`):
                The start of the generation allowed time.
        """

        def __init__(self, max_time: float, initial_timestamp: Optional[float] = None):
            self.max_time = max_time
            self.initial_timestamp = time() if initial_timestamp is None else initial_timestamp

        def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
            dt = time() - self.initial_timestamp
            end_condition = dt > self.max_time
            if end_condition:
                print("Stopping!")
            return end_condition

    # Create a streamer. This can be a custom streamer too inherited from transformers.generation.streamers.BaseStreamer
    class CustomStreamer(transformers.generation.streamers.BaseStreamer):
        def __init__(self) -> None:
            self.reset()

        def reset(self):
            self.token_latencies = []
            self.iter = 0
            self.now = time()

        def put(self, tokens):
            now = time()
            token_latency = now - self.now
            print(f"Iteration {self.iter:4d}: Latency [s] {token_latency:6.3f} -- Token {tokens}")
            self.now = now
            self.iter += 1
            self.token_latencies.append(token_latency)


        def end(self):
            pass

    # Create and compile the Neuron model
    model_neuron = MistralForSampling.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2', amp='bf16')
    model_neuron.to_neuron()

    # Get a tokenizer and exaple input
    tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')
    text = "[INST] What is your favourite condiment? [/INST]"
    encoded_input = tokenizer(text, return_tensors='pt')

    # Add stopping criteria to stop after 0.5 seconds
    stopping_criteria_list= StoppingCriteriaList([MaxTimeCriteria(0.5)])
    streamer = CustomStreamer()

    # Run inference
    with torch.inference_mode():
        model_neuron.sample(input_ids=encoded_input.input_ids, sequence_length=256, stopping_criteria_list=stopping_criteria_list, streamer=streamer)


Speculative sampling [Beta]
---------------------------

Transformers Neuron supports speculative sampling for the ``Llama`` and ``GPT2``
model classes. In speculative sampling, we use use a smaller draft model to speculate future tokens.
These are then sent to the larger target model, which accepts or rejects these tokens.
For more detailed information, see the original proposal by
DeepMind titled `Accelerating Large Language Model Decoding with Speculative Sampling <https://arxiv.org/abs/2302.01318>`__.
Our implementation for speculative sampling is lossless. In addition to standalone draft models,
we also support `Eagle draft models <https://github.com/SafeAILab/EAGLE>`__.
Currently we only support Eagle v1.

In the following example, we demonstrate how to perform speculative sampling using the ``Llama`` model.
In this example, we are performing multinomial sampmling.

.. code-block:: python

    import torch
    from transformers import LlamaTokenizer
    from transformers_neuronx import NeuronAutoModelForCausalLM, NeuronConfig, GenerationConfig
    from transformers_neuronx.fused_speculation import FusedSpeculativeDecoder

    # Specify path to draft and target
    draft = '/home/ubuntu/Llama-2-7b-chat-hf'
    target = '/home/ubuntu/Llama-2-70b-chat-hf'

    # Specify generation parameters
    gen_kwargs = {
        "top_k": 50,
        "top_p": 0.9,
        "do_sample": True,
        "temperature": 0.7,
    }

    # Load draft model
    draft_neuron_model = NeuronAutoModelForCausalLM.from_pretrained(
            draft, 
            n_positions=1024,
            batch_size=1, 
            tp_degree=32, 
            amp='bf16', 
            neuron_config=NeuronConfig(
                padding_side="right",
                attention_layout=Layout.BSH,
                collectives_layout="BSH",
                on_device_embedding=True,
                on_device_generation=GenerationConfig(**gen_kwargs),
                ),
            )
    draft_neuron_model.to_neuron()
    # Load target model
    target_neuron_model = NeuronAutoModelForCausalLM.from_pretrained(
            target, 
            n_positions=1024,
            batch_size=1, 
            tp_degree=32, 
            amp='bf16',
            neuron_config=NeuronConfig(
                padding_side="right",
                attention_layout=Layout.BSH,
                collectives_layout="BSH",
                on_device_embedding=True,
                on_device_generation=GenerationConfig(**gen_kwargs),
                ),
            )
    target_neuron_model.to_neuron()
    
    # Compile the speculative sampling model
    # Here we set sepculation length to be 4
    fsd = FusedSpeculativeDecoder(
            draft_neuron_model, 
            target_neuron_model, 
            4,
            )
    fsd.to_neuron()

    # Initialize tokenizer and text prompt
    tokenizer = LlamaTokenizer.from_pretrained(target)
    prompt = "Hello, I'm a generative AI language model."
    inputs = tokenizer(prompt, return_tensors="pt")

    # Call speculative sampling on given input
    response = fsd.sample(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        sequence_length=30,
    )

    # Decode the response
    generated_text = tokenizer.decode(response[0])
    print(f"\nDecoded tokens: {generated_text}")

The following sample shows how to enable EAGLE speculation.
To get the EAGLE draft model to work, manually copy the LM head 
weights from the target model to the draft model. Additionally, 
you need to rename the keys in the draft model's ``state_dict`` 
to match those in the target model.

.. code-block:: python

    import torch
    from transformers import LlamaTokenizer
    from transformers_neuronx import NeuronAutoModelForCausalLM, NeuronConfig, GenerationConfig
    from transformers_neuronx.fused_speculation import FusedSpeculativeDecoder

    # Specify path to draft and target
    # The Eagle draft model can be downloaded from Eagle website
    draft = '/home/ubuntu/EAGLE-llama2-chat-70B'
    target = '/home/ubuntu/Llama-2-70b-chat-hf'

    # Specify generation parameters
    gen_kwargs = {
        "top_k": 50,
        "top_p": 0.9,
        "do_sample": True,
        "temperature": 0.7,
    }

    # Load draft model
    draft_neuron_model = NeuronAutoModelForCausalLM.from_pretrained(
            draft, 
            n_positions=1024,
            batch_size=1, 
            tp_degree=32, 
            amp='bf16', 
            neuron_config=NeuronConfig(
                is_eagle_draft=True,
                has_pre_attention_norm=False,
                # Need the above two configs for Eagle
                padding_side="right",
                attention_layout=Layout.BSH,
                collectives_layout="BSH",
                on_device_embedding=True,
                on_device_generation=GenerationConfig(**gen_kwargs),
                ),
            )
    draft_neuron_model.to_neuron()
    # Load target model
    target_neuron_model = NeuronAutoModelForCausalLM.from_pretrained(
            target, 
            n_positions=1024,
            batch_size=1, 
            tp_degree=32, 
            amp='bf16',
            neuron_config=NeuronConfig(
                is_eagle_target=True,
                # Need the above config for Eagle
                padding_side="right",
                attention_layout=Layout.BSH,
                collectives_layout="BSH",
                on_device_embedding=True,
                on_device_generation=GenerationConfig(**gen_kwargs),
                ),
            )
    target_neuron_model.to_neuron()
    
    # Compile the speculative sampling model
    # Here we set sepculation length to be 4
    fsd = FusedSpeculativeDecoder(
            draft_neuron_model, 
            target_neuron_model, 
            4,
            )
    fsd.to_neuron()

    # The rest are the same


QKV Weight Fusion
--------------------------------------

Concatenating a model's query, key and value weight matrices often achieves better performance because larger matrices allow
for more efficient data movement and compute. QKV weight fusion can be enabled by setting ``fuse_qkv=True`` in the ``NeuronConfig``:

.. code-block:: python

    neuron_config = NeuronConfig(fuse_qkv=True)


Attention Layout
--------------------------------------

The intermediate tensor layouts in a model's attention layer can impact the
compiler's optimization opportunities and thus can impact a model's performance.
Using ``(batch, sequence, hidden)`` (or ``BSH``) layout for attention often
achieves better performance since it can enable better overlapping of compute
with collectives and can reduce transposes. We intend to enable ``BSH``
attention by default in a future release. For now, ``BSH`` attention layout can
be enabled by setting ``attention_layout="BSH"`` in the ``NeuronConfig``:

.. code-block:: python

    neuron_config = NeuronConfig(attention_layout="BSH")


Bucketing
------------------
LLM inference is a generate process that can produce variable length sequences.
This poses a problem since the Neuron compiler produces executables which expect statically shaped inputs and outputs.
To make LLM work with different shapes, transformers_neuronx generates buckets
and applies padding wherever it is required.

There are at least two set of buckets for each LLM inference that can be set by user:
1) Context encoding (pre-fill) buckets and 2) output token generation buckets.


**Token generation buckets**

In token generation, tokens are generated iteratively.
At each token position, transformer need to attend to the previous tokens only.
But in the naive implementation with static shapes, one may attend to all KV-cache (full sequence length).
To solve this problem, we use token generation buckets.
Token generation buckets determine the attention lengths.
For instance, if the max sequence length is 1024 tokens and current token
is at position 120, there is no need to attend to all 1024 tokens in the current step.
We can use token generation buckets to attend to different portions of KV-cache.
By default, token generation buckets which are powers of 2 starting from 128
tokens are used (i.e. 128, 256, 512, up to sequence length). In the example above,
bucket 128 would be used for position 120 which would reduce the wasted compute significantly.
User can change these buckets by setting a list for ``n_positions`` (see example below).
Otherwise, if a number is given for ``n_positions`` (sequence length), instead of a list,
then the powers of 2 buckets starting from 128 will be used.
The last bucket would be ``n_positions`` (sequence length), even if it is not a power of 2.

**Context encoding buckets**

The prompt tokens can be processed in parallel.
As a result, we need to set the bucket sizes for different estimated length of
input prompts. We can specify these context bucket sizes using the ``context_length_estimate`` argument.
In general, it is better to have all the bucket to be multiples of 256 tokens.
But adding too many buckets would increase device memory consumption and add extra latency
for bucket switching.
Usually, the powers of 2 starting from 128 tokens are used for
context encoding buckets. If the total sequence length (``n_positions``) is beyond 2048
tokens, it is desirable to add extra buckets with multiple of 512 or 1024 tokens.
It is not recommended to add buckets of multiples of 256 tokens or smaller for context buckets beyond 2k to avoid bucket switching latency.
At runtime, the smallest bucket which fits the input context will be used.
By default, the context encoding buckets set to half of output-token buckets.
Adding extra context buckets would reduce the wasted compute and improves performance.
However, the extra executables would reduce memory space since executables require device memory space.

Notice that the default output token generation buckets work well for wide range
of applications. However, ideal context encoding buckets depends on the specific use case.
For instance, if all the requests have a context length of about 1500 +/- 500 tokens,
adding more buckets closer to 1500 might help context encoding time.
In this example, adding buckets of 1024, 1280, 1536, 1792, 2048 tokens (distance of 256 tokens) could help.
Moreover, the largest context encoding bucket should be larger than the largest context length.
Otherwise, the performance would degrade significantly.


To set context encoding and token generation buckets manually:

.. code-block:: python

    context_length_estimate = [1024, 1280, 1536, 1792, 2048]    # The best context estimate depends on the use case
    n_positions = [128, 256, 512, 1024, 2048, 3072]             # Usually default buckets are appropriate

    model = NeuronAutoModelForCausalLM.from_pretrained(
        'gpt2',
        batch_size=1,
        n_positions=n_positions,
        tp_degree=2,
        amp='f16',
        context_length_estimate=context_length_estimate,
    )


Multi-node inference support (TP/PP)
---------------------------------------

Prerequisite: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup-trn1-multi-node-execution.html

When models are too large to fit on single node, Transformers NeuronX multi-node inference (tensor parallel and pipeline parallel) can be used to shard model weights across multiple Neuron instances (only supported on Trn1 and Trn1n). Single node inference code can easily be extended to multi-node inference.

Note that Transformers Neuronx currently doesn't support multi-node Tensor Parallel and Pipeline Parallel at same time, when Pipeline Parallel is used, the Tensor Parallel has to be within a node (TP<=32 on Trn1/Trn1n).

In the below sections, we first outline the sample code for single node execution and then provide instructions to migrate the code to use multi-node tensor parallel or multi-node pipeline parallel. To start with, the code below is for single node script, running llama2-3b model with tensor parallel degree as 32.

.. code-block:: python

    import torch
    from transformers import AutoTokenizer, AutoConfig
    from transformers_neuronx import  LlamaForSampling, HuggingFaceGenerationModelAdapter

    # Create and compile the Neuron model
    model = LlamaForSampling.from_pretrained("openlm-research/open_llama_3b", tp_degree=32)
    model.to_neuron()

    # Use the `HuggingFaceGenerationModelAdapter` to access the generate API
    config = AutoConfig.from_pretrained("openlm-research/open_llama_3b")
    model = HuggingFaceGenerationModelAdapter(config, model)

    # Get a tokenizer and example input
    tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_3b")
    tokenizer.pad_token_id = tokenizer.eos_token_id
    tokenizer.padding_side = 'left'
    text = "Hello, I'm a language model,"
    encoded_input = tokenizer(text, return_tensors='pt', padding=True)


    # Run inference using temperature
    with torch.inference_mode():
        model.reset_generation()
        generated_sequence = model.generate(
            input_ids=encoded_input.input_ids,
            attention_mask=encoded_input.attention_mask,
            do_sample=True,
            max_length=256,
            temperature=0.7,
        )

    print([tokenizer.decode(tok) for tok in generated_sequence])

command line:

.. code-block:: bash

    python3 multi_node_dev_example.py

**Multi-Node Tensor Parallel**

Compared to single node tensor parallel, multi-node tensor parallel shards the model weights in the same way but having mores cores across nodes. In the meantime, it requires each node’s ``model.forward()`` receives the exact same input, otherwise there would be unexpected behaviors (runtime failure, wrong output).

Configurations (environment variables to be configured on each node):

- ``NEURON_RT_ROOT_COMM_ID``: the master node's ``<IP address>:<port>``
- ``NEURON_RANK_ID``: rank of the node, 0 means master node
- ``NEURON_LOCAL_TP``: the local tensor parallel degree on each node

example:

Change the single node script to use ``tp=64`` (2 node). Set the ``torch.manual_seed`` to ensure the sampling loop running on each node will sample same token as next input.


Node 1 command line:

.. code-block:: bash

    NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 NEURON_RANK_ID=0 NEURON_LOCAL_TP=32 python3 multi_node_dev_example.py

Node 2 command line (same as Node 1 but set ``NEURON_RANK_ID`` as 1):

.. code-block:: bash

    NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 NEURON_RANK_ID=1 NEURON_LOCAL_TP=32 python3 multi_node_dev_example.py

You can also refer to  `Tutorial <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/llama-3.1-405b-multinode-16k-sampling.ipynb>`__ to run lama 3.1 405b multinode 16k tutorial with multi-node tensor parallel.

**Multi-Node Pipeline Parallel**

While having the weight tensor sharded as tensor pararallel, one can utilize pipeline parallel to partition the layers across different node, the intermediate tensor (hidden) will be transferred from one pipeline stage (nodes) to the next pipeline stage (nodes). The final output will be sent from last pipeline stage back to first pipeline stage.

Compared to multi-node tensor parallel, for non-zero rank, the ``model.forward`` in pipeline parallel will fallback to while loop and block on the input broadcasting from master.

Configurations (environment variables to be configured on each node):

- ``NEURON_RT_ROOT_COMM_ID``: the master node's ``<IP address>:<port>``
- ``CPU_COMM_ID``: similar to NEURON_RT_ROOT_COMM_ID , but need to set with different port
- ``NEURON_RANK_ID``: rank of the node, 0 means master node
- ``NEURON_PP_STAGES``: number of pipeline stages (nodes)

example:

Keep the original single node script with tp=32.

Node 1 command line:

.. code-block:: bash

    NEURON_PP_STAGES=2 CPU_COMM_ID=10.1.201.64:8989 NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 NEURON_RANK_ID=0 python3 multi_node_dev_example.py

Node 2 command line (same as Node 1 but set ``NEURON_RANK_ID`` as 1):

.. code-block:: bash

    NEURON_PP_STAGES=2 CPU_COMM_ID=10.1.201.64:8989 NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 NEURON_RANK_ID=1 python3 multi_node_dev_example.py


Long Sequence length support up to 128k
---------------------------------------
**Flash Attention**

With the integration of FlashAttention kernel, developers can use longer sequence lengths for LLAMA models. The Flash Attention kernel is automatically used when the input sequence length is greater than 8k without any additional configuration. Refer to `Tutorial <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/llama-3-8b-32k-sampling.ipynb>`__ for usage of 32k sequence length on a variation of LLAMA3-8B Model.

**Flash Decoding**

Flash Decoding (FD) is a technique that significantly speeds up attention during inference, especially for long-context
tasks in large language models (LLMs) with GQA.

.. image:: ./flash_decoding.gif
   :alt: Flash Decoding
   :width: 800px
   :align: center

With integration of FD, developers can achieve faster inference with larger sequence
and batch size by reducing the KV cache replication.
Refer to `Tutorial <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx
/inference/llama-3.1-8b-128k-sampling.ipynb>`__ on flash decoding usage for 128k sequence length sampling. Flash decoding
can be enabled by setting the flag `shard_over_sequence=True` in `NeuronConfig`

.. code-block:: python

    neuron_config = NeuronConfig(shard_over_sequence=True)


Note that you can skip the first Allgather introduced by flash decoding at the cost of duplicate Q weights, this is only recommended for relatively small models (i.e. 3B, 8B) and large batch size.

.. code-block:: python

    neuron_config = NeuronConfig(shard_over_sequence=True, duplicate_q_weight_sos=True)

**Known limitations and FAQs**

- Flash decoding is expected to have performance degradation (PTL) for smaller sequence and batch sizes. We recommend flash decoding when **batch-size x sequence length > 16k**
- Flash decoding support is not enabled for the following features

 - Speculative Decoding
 - Multi Head Attention (MHA) models


================================================
FILE: archive/transformers-neuronx/transformers-neuronx-misc.rst
================================================
.. _transformers-neuronx-misc:

.. meta::
   :noindex:
   :nofollow:
   :description: This topic is currently archived and not maintained. It is provided for reference only.

Misc (``transformers-neuronx``)
===============================


================================================
FILE: archive/transformers-neuronx/transformers-neuronx-misc.txt
================================================
* :ref:`transformers-neuronx-rn`

================================================
FILE: archive/transformers-neuronx/transformers-neuronx-tutorials.rst
================================================
.. _transformers_neuronx_tutorials:

.. meta::
   :noindex:
   :nofollow:
   :description: This topic is currently archived and not maintained. It is provided for reference only.

Transformers NeuronX Tutorials 
===============================

.. toctree::
    :maxdepth: 1
    :hidden:

    Hugging Face meta-llama/Llama-2-13b autoregressive sampling on Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb>
    Hugging Face facebook/opt-13b autoregressive sampling on Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/facebook-opt-13b-sampling.ipynb>
    Hugging Face facebook/opt-30b autoregressive sampling on Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/facebook-opt-30b-sampling.ipynb>
    Hugging Face facebook/opt-66b autoregressive sampling on Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/facebook-opt-66b-sampling.ipynb>


.. include:: /libraries/transformers-neuronx/transformers-neuronx-tutorials.txt


================================================
FILE: archive/transformers-neuronx/transformers-neuronx-tutorials.txt
================================================
* `Hugging Face meta-llama/Llama-2-13b autoregressive sampling on Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb>`_
* `Hugging Face facebook/opt-13b autoregressive sampling on Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/facebook-opt-13b-sampling.ipynb>`_
* `Hugging Face facebook/opt-30b autoregressive sampling on Inf2 & Trn1 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/facebook-opt-30b-sampling.ipynb>`_
* `Hugging Face facebook/opt-66b autoregressive sampling on Inf2 <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/facebook-opt-66b-sampling.ipynb>`_


================================================
FILE: archive/transformers-neuronx/transformers-neuronx.txt
================================================
.. dropdown::  Setup  (``transformers-neuronx``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /libraries/transformers-neuronx/setup/index.rst


.. dropdown::  Developer Guide  (``transformers-neuronx``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /libraries/transformers-neuronx/developer-guide.txt


.. dropdown::  Tutorials  (``transformers-neuronx``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /libraries/transformers-neuronx/transformers-neuronx-tutorials.txt


.. dropdown::  Misc  (``transformers-neuronx``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /libraries/transformers-neuronx/transformers-neuronx-misc.txt


================================================
FILE: archive/tutorials/finetune_t5.rst
================================================
.. _torch-hf-t5-finetune:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.

Fine-tune T5 model on Trn1
================================

.. note:: 
   This page was archived on 7/31/2025.


In this tutorial, we show how to fine-tune a Hugging Face (HF) T5 model 
using HF trainer API. This example fine-tunes a `T5 model for
a text-summarization <https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization>`__ task on CNN/DailyMail dataset.

.. contents:: Table of Contents
   :local:
   :depth: 2

.. include:: /frameworks/torch/torch-neuronx/tutorials/note-performance.txt

Setup and compilation
---------------------

Before running the tutorial please follow the installation instructions at:

:ref:`Install PyTorch Neuron on Trn1 <setup-torch-neuronx>`

Please set the storage of instance to *512GB* or more if you also want to run through the BERT pretraining and GPT pretraining tutorials.

For all the commands below, make sure you are in the virtual environment that you have created above before you run the commands:

.. code:: shell

   source ~/aws_neuron_venv_pytorch/bin/activate

First we install a recent version of HF transformers, scikit-learn and evaluate packages in our environment as well as download the source matching the installed version. In this example, we chose version 4.26.0 and the text summarization example from HF transformers source:

.. literalinclude:: tutorial_source_code/t5_finetuning/t5_finetuning_setup_code.sh
   :language: shell
   :lines: 5-9

Single-worker training
----------------------

We will run text-summarization fine-tuning task following the example in
README.md located in the path
`~/transformers/examples/pytorch/summarization.`

We use full BF16 casting using `XLA_USE_BF16=1` to enable best
performance. First, paste the following script into your terminal to
create a “run.sh” file and change it to executable:

.. literalinclude:: tutorial_source_code/t5_finetuning/t5_finetuning_single_worker_training_code.sh
   :language: shell
   :lines: 7-46

We optionally precompile the model and training script using
`neuron\_parallel\_compile <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/training/pytorch-neuron-parallel-compile.html?highlight=neuron_parallel_compile>`__ to warm up the persistent graph cache (Neuron
Cache) such that the actual run has fewer compilations (faster run
time):

.. literalinclude:: tutorial_source_code/t5_finetuning/t5_finetuning_single_worker_training_code.sh
   :language: shell
   :lines: 49

Note: For these auto-regressive models, do not run the
``predict_with_generate`` method when doing the precompile step. This is
because the ``neuron_parallel_compile`` utility will run the training
script in graph extraction mode and no actual execution of the graph
will be done. Hence, the outputs at each step are invalid. Since the
auto-regressive generation at each step is dependent on output of
previous step, the generate step would fail since the outputs from
previous steps are invalid.

Precompilation is optional and only needs to be done once unless
hyperparameters such as batch size are modified. After the optional
precompilation, the actual run will be faster with minimal additional
compilations.

.. literalinclude:: tutorial_source_code/t5_finetuning/t5_finetuning_single_worker_training_code.sh
   :language: shell
   :lines: 51

If precompilation was not done, the first execution of ./run.sh will be
slower due to serial compilations. Rerunning the same script a second
time would show quicker execution as the compiled graphs will be already
cached in persistent cache.

Running the above script will run the T5-small fine-tuning on a single
process.

**Note:** As you may have noticed, we are not running the
``predict_with_generate`` as part of training. This is because,
``predict_with_generate`` requires auto-regressive sampling where the
inputs to the decoder are created by appending outputs of previous
steps. This causes the inputs to the decoder to change shape and thereby
resulting in a new graph. In other words, the current ``generate`` api
provided by HF transformers leads to repeated compilations. We are working on
building a Neuron friendly version of ``generate`` api and it will be
made available as part of future release. This will enable us to run
``predict_with_generate`` as part of training script.

As a workaround, we can run the ``predict_with_generate`` on CPU after
the model is trained. Once training is completed, a trained checkpoint
would be saved. We can load the trained model and run the
``predict_with_generate`` to compute the final accuracy.

To do so, in run_summarization.py, add the following before ``transformers`` get imported.
This can be done by adding the below lines before all the ``imports``:

.. literalinclude:: tutorial_source_code/t5_finetuning/t5_finetuning_single_worker_training_code.sh
   :language: python
   :lines: 55-59

You can now run the following and it should run the predict method on CPU device.

.. literalinclude:: tutorial_source_code/t5_finetuning/t5_finetuning_single_worker_training_code.sh
   :language: shell
   :lines: 67-78

Note: To run on CPU, we need to make sure that NEURON\_NUM\_DEVICES is
set to 0. This will make sure no xla\_devices are created and the
trainer would use the default device (CPU).

.. _multi_worker_training:

Multi-worker Training
---------------------

The above script will run one worker on one NeuronCore. To run on
multiple cores, first add these lines to top of run\_summarization.py to disable
Distributed Data Parallel (DDP) when using torchrun (see Known issues
and limitations section below):

.. literalinclude:: tutorial_source_code/t5_finetuning/t5_modify_run_summarization_code.sh
   :language: python
   :lines: 8-10

Then launch the run\_summarization.py script with torchrun using
--nproc\_per\_node=N option to specify the number of workers (N=2 for
trn1.2xlarge, and N=2, 8, or 32 for trn1.32xlarge). The following
example runs 2 workers. Paste the following script into your terminal to
create a “run\_2w.sh” file and change it to executable:

.. literalinclude:: tutorial_source_code/t5_finetuning/t5_finetuning_multi_worker_training_code.sh
   :language: shell
   :lines: 7-46

Again, we optionally precompile the model and training script using
neuron\_parallel\_compile to warm up the persistent graph cache (Neuron
Cache), ignoring the results from this precompile run as it is only for
extracting and compiling the XLA graphs:

.. literalinclude:: tutorial_source_code/t5_finetuning/t5_finetuning_multi_worker_training_code.sh
   :language: python
   :lines: 49

Precompilation is optional and only needs to be done once unless
hyperparameters such as batch size are modified. After the optional
precompilation, the actual run will be faster with minimal additional
compilations.

.. literalinclude:: tutorial_source_code/t5_finetuning/t5_finetuning_multi_worker_training_code.sh
   :language: python
   :lines: 51

During run, you will notice that the “Total train batch size” is now
8 and the “Total optimization steps” is now half the number for one
worker training. Also, if you open ``neuron-top`` in a separate terminal, 
you should see 2 cores been utilized.

To train T5-large model, you can set the ``model_name_or_path`` argument to ``t5-large``.
Please note, currently running ``t5-large`` on trn1-2xl machine can result in ``HOST OOM`` during 
compilation. Hence, it is recommended that you run a ``t5-large`` model training on a trn1-32xl machine.

On a trn1-32xl machine, you can create a run_32w.sh on the terminal using the following commands:

.. literalinclude:: tutorial_source_code/t5_finetuning/t5_finetuning_32_worker_training_code.sh
   :language: shell
   :lines: 7-46

You can now follow the same steps as listed above. This script would run a t5-large model by launching a training script 
using 32 data-parallel workers.


.. _t5_known_issues:

Known issues and limitations
----------------------------

The following are currently known issues:

-  Long compilation times: this can be alleviated with
   ``neuron_parallel_compile`` tool to extract graphs from a short trial run and
   compile them in parallel ahead of the actual run, as shown above.
- T5-Large compilation causing processes to get killed on trn1-2xl: It is recommended 
  to ``t5-large`` model training on a trn1-32xl machine, as it avoids CPU OOM and also provides 
  faster training by making use of 32 data-parallel workers.


================================================
FILE: archive/tutorials/finetuning_llama2_7b_ptl.rst
================================================
.. _llama2_7b_tp_zero1_ptl_finetune_tutorial:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.

Fine-tuning Llama2 7B with tensor parallelism and ZeRO-1 optimizer using Neuron PyTorch-Lightning
=================================================================================================

This tutorial shows how to fine-tune Llama2 7B with tensor parallelism and ZeRO-1 using Neuron PyTorch-Lightning APIs. For pre-training information and additional context, see the Llama2 7B Tutorial
and :ref:`Neuron PT-Lightning Developer Guide <ptl_developer_guide>`. 


Setting up the environment
^^^^^^^^^^^^^^^^^^^^^^^^^

For this experiment, we will use AWS ParallelCluster with at least four trn1.32xlarge compute nodes.
To set up a cluster and prepare it for use, see `Train your model on ParallelCluster <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/parallelcluster/parallelcluster-training.html>`__.
To set up the packages on the head node of the cluster, see
:ref:`Install PyTorch Neuron on Trn1 <setup-torch-neuronx>`.

Install the ``neuronx-distributed`` package inside the virtual environment using the following command:

.. code:: ipython3

   python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com

Next, download the scripts for fine-tuning.


1. Create a directory to hold the experiments.

.. code:: ipython3

   mkdir -p ~/examples/tp_zero1_llama2_7b_hf_finetune_ptl
   cd ~/examples/tp_zero1_llama2_7b_hf_finetune_ptl

2. Download training scripts for the experiments.

.. code:: ipython3

   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/lightning/data_module.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/lightning/module_llama.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/lightning/tp_zero1_llama2_7b_hf_finetune_ptl.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/lightning/tp_zero1_llama2_7b_hf_finetune_ptl.sh
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/lightning/finetune_config/config.json
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/lr.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/modeling_llama_nxd.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/requirements.txt
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/requirements_ptl.txt
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/training_utils.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/convert_checkpoints.py

3. Install the additional requirements and give the right permissions to the shell script.

.. code:: ipython3

   python3 -m pip install -r requirements.txt
   python3 -m pip install -r requirements_ptl.txt  # Currently we're supporting Lightning version 2.4.0
   python3 -m pip install optimum-neuron==0.0.18 nltk  # Additional dependencies for evaluation
   python3 -m pip install --no-warn-conflicts transformers==4.32.1   # Ping transformers version 4.32.1
   chmod +x tp_zero1_llama2_7b_hf_finetune_ptl.sh

Download the Llama2-7B pre-trained checkpoint from HuggingFace.


1. Create a Python script ``get_model.py`` with the following lines: 

.. code:: ipython3

   import torch
   from transformers.models.llama.modeling_llama import LlamaForCausalLM
   model = LlamaForCausalLM.from_pretrained("NousResearch/Llama-2-7b-hf")
   torch.save(model.state_dict(), "llama-7b-hf-pretrained.pt")

2. Run the download script and conversion script to pull and convert the checkpoint, note that conversion scripts requires high memory so need to login to a compute node to do so:

.. code:: ipython3

   ssh compute1-dy-training-0-1
   source ~/aws_neuron_venv_pytorch/bin/activate
   cd ~/examples/tp_zero1_llama2_7b_hf_finetune_ptl
   python3 get_model.py
   python3 convert_checkpoints.py --tp_size 8 --convert_from_full_model --config config.json --input_dir llama-7b-hf-pretrained.pt --output_dir llama7B-pretrained/pretrained_weight

3. (Optional) If you are loading checkpoint from different directory, set the checkpoint path by adding the following flag to ``tp_zero1_llama2_7b_hf_finetune_ptl.sh``:

   * ``--pretrained_ckpt``.

   This provides direction to the pre-trained checkpoint to be loaded.

Then, set the dataset for the fine-tuning job. In this example, we will use Dolly, which is an open source dataset
of instruction-following records on categories outlined in the InstructGPT paper, including brainstorming, classification,
closed QA, generation, information extraction, open QA, and summarization.

.. code-block:: json

   {
     "instruction": "Alice's parents have three daughters: Amy, Jessy, and what's the name of the third daughter?",
     
     "context": "",
     
     "response": "The name of the third daughter is Alice"
   }

Configure the following flags in ``tp_zero1_llama2_7b_hf_finetune_ptl.sh``:

.. code:: ipython3

   --data_dir "databricks/databricks-dolly-15k" \
   --task "open_qa"

At this point, you are all set to start fine-tuning.

Running fine-tuning
^^^^^^^^^^^^^^^^^^^

By this step, the cluster is all set up for running experiments. 
Before running training, first pre-compile the graphs using the :ref:`neuron_parallel_compile <pytorch-neuronx-parallel-compile-cli>`.
Run the command below:

.. code:: ipython3

   sbatch --exclusive \
   --nodes 1 \
   --wrap="srun neuron_parallel_compile bash $(pwd)/tp_zero1_llama2_7b_hf_finetune_ptl.sh"

This script uses a tensor-parallel size of 8.
This automatically sets the zero-1 sharding degree to 4 (32 workers / tensor_parallel_size). 

`Note`: You can use any number of nodes in this case by adjusting the number of nodes in the above 
Slurm command accordingly. Also, the number of nodes used in the parallel_compile command should be same as the number used in the actual 
training run. This is because, as the number of nodes change, the data-parallel degree changes too. This  
results in more workers participating in operations like `gradient all-reduce`, which results in new graphs getting 
created. 

After the graphs are compiled, you can run training and observe how the loss goes down.
Before the actual fine-tune started, we need  to prepare the dataset

.. code:: ipython3

   python3 -c "import nltk; nltk.download('punkt')" 

To run the training, run the above command without ``neuron_parallel_compile``:

.. code:: ipython3

   sbatch --exclusive \
   --nodes 1 \
   --wrap="srun bash $(pwd)/tp_zero1_llama2_7b_hf_finetune_ptl.sh"

At the end of fine-tuning, run evaluation once with a test data split by generating sentences and calculating ROUGE scores.
The final evaluation results and ROUGE score are then printed in your terminal.


Checkpointing
^^^^^^^^^^^^^^

To enable checkpoint saving, add the following flags to ``tp_zero1_llama2_7b_hf_finetune_ptl.sh``:

* ``--save_checkpoint`` Enables checkpoint saving.
* ``--checkpoint_freq`` Number of steps to save a checkpoint.
* ``--checkpoint_dir`` Direction to save the checkpoint.
* ``--num_kept_checkpoint`` Number of checkpoints to save. Older checkpoint are deleted manually. Set to -1 to keep all saved checkpoints.
* ``--save_load_xser`` Loads with torch_xla serialization to reduce time saving. We recommend enabling xser for significantly faster save and load times. Note that if the checkpoint is saved with xser, it can only be loaded with xser, and vice versa. 

To enable checkpoint loading, add the following flags to ``tp_zero1_llama2_7b_hf_finetune_ptl.sh``:

* ``--resume_ckpt`` Resumes the checkpoint process.
* ``--load_step`` The step to retrieve the checkpoint from.
* ``--checkpoint_dir`` Direction to load the checkpoint from.
* ``--save_load_xser`` Loads with torch_xla serialization to reduce time saving. We recommend enabling xser for significantly faster save and load times. Note that if the checkpoint is saved with xser, it can only be loaded with xser, and vice versa. 


================================================
FILE: archive/tutorials/gpt3_neuronx_nemo_megatron_pretraining.rst
================================================
.. _gpt3_neuronx_nemo_megatron_pretraining:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.

Launch a GPT-3 pretraining job using neuronx-nemo-megatron
==========================================================

Archived tutorials for gpt3 pretraining using neuronx-nemo-megatron
  * `Launch a GPT-3 23B pretraining job using neuronx-nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-gpt-job.md>`_
  * `Launch a GPT-3 46B pretraining job using neuronx-nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-gpt-job.md>`_
  * `Launch a GPT-3 175B pretraining job using neuronx-nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-gpt-job.md>`_

================================================
FILE: archive/tutorials/megatron_gpt_pretraining.rst
================================================
.. _megatron_gpt_pretraining:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.

Megatron GPT Pretraining
========================

.. note:: 
   This page was archived on 7/31/2025.

In this example, we will compile and train a Megatron GPT model on a single instance or
on multiple instances using ParallelCluster with the NxD Training library.
The example has the following main sections:

.. contents:: Table of contents
   :local:
   :depth: 2

Setting up the environment
--------------------------

ParallelCluster Setup
^^^^^^^^^^^^^^^^^^^^^

In this example, we will use 8 instances with ParallelCluster,
please follow the instructions here to create a cluster:
`Train your model on ParallelCluster
<https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/parallelcluster/parallelcluster-training.html>`_

ParallelCluster automates the creation of trn1 clusters,
and provides the SLURM job management system for scheduling and managing distributed training jobs.
Please note that the home directory on your ParallelCluster
head node will be shared with all of the worker nodes via NFS.

Install Dependencies
^^^^^^^^^^^^^^^^^^^^

Once you have launched a trn1 instance or ParallelCluster,
please follow this guide on how to install the latest Neuron packages:
`PyTorch Neuron Setup Guide
<https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/torch-neuronx.html#setup-torch-neuronx>`_.

Next, we will need to install NxD Training and its dependencies.
Please see the following installation guide for installing NxD Training:
:ref:`NxDT Installation Guide <nxdt_installation_guide>`


Download the dataset
--------------------

This tutorial makes use of a preprocessed Wikipedia dataset that is stored in S3.
The dataset can be downloaded to your cluster or instance by running
the following commands on the head node or your trn1 instance:

.. code-block:: bash

    export DATA_DIR=~/examples_datasets/gpt2
    mkdir -p ${DATA_DIR} && cd ${DATA_DIR}
    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
    aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.bin .  --no-sign-request
    aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.idx .  --no-sign-request
    aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/license.txt .  --no-sign-request


Pre-compile the model
---------------------

By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially
compiles all of the neural network compute graphs as they are encountered during a training job.
The compiled graphs are cached in a local compiler cache so that subsequent training jobs
can leverage the compiled graphs and avoid compilation
(so long as the graph signatures and Neuron version have not changed).

An alternative to the JIT flow is to use the included ``neuron_parallel_compile``
command to perform ahead of time (AOT) compilation. In the AOT compilation flow,
the compute graphs are first identified and extracted during a short simulated training run,
and the extracted graphs are then compiled and cached using parallel compilation,
which is considerably faster than the JIT flow.

First, clone the open-source ``neuronx-distributed-training`` library

.. code:: ipython3

   git clone https://github.com/aws-neuron/neuronx-distributed-training
   cd neuronx-distributed-training/examples

Now, ensure that you are using the proper config file in the ``conf/`` directory.
In the ``train.sh`` file, ensure that the ``CONF_FILE`` variable is properly
set to the config for the model you want to use. In our case,
it will be ``megatron_gpt_config``. The default config here is a 6.7B parameter model,
but users can also add their own ``conf/*.yaml`` files and run different configs and
hyperparameters if desired. Please see :ref:`Config Overview <nxdt_config_overview>`
for examples and usage for the ``.yaml`` config files.

Next, run the following commands to launch an AOT pre-compilation job on your instance:

.. code-block:: bash

    export COMPILE=1
    ./train.sh

The compile output and logs will be shown directly in the terminal
and you will see a message similar to this:

.. code-block:: bash

    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total graphs: 22
    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total successful compilations: 22
    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0

Then, you know your compilation has successfully completed.

.. note::
    The number of graphs will differ based on package versions, models, and other factors.
    This is just an example.

If you are using ParallelCluster, then you will need to update the ``conf/megatron_gpt_config.yaml``
with

.. code-block:: yaml

    num_nodes: 8

Then to run the compile job:

.. code-block:: bash

    export COMPILE=1
    sbatch --exclusive \
        --nodes 8 \
        --cpus-per-task 128 \
        --wrap="srun ./train.sh"

Once you have launched the precompilation job, run the squeue command to view the
SLURM job queue on your cluster. If you have not recently run a job on your cluster,
it may take 4-5 minutes for the requested trn1.32xlarge nodes to be launched and initialized.
Once the job is running, squeue should show output similar to the following:

.. code-block:: bash

    JOBID  PARTITION  NAME      USER    ST  TIME  NODES NODELIST(REASON)
    10     compute1   wrap      ubuntu  R   5:11  8     compute1-dy-queue1-i1-[0-7]

You can view the output of the precompilation job by examining the file named
``slurm-ZZ.out``,
where ZZ represents the JOBID of your job in the squeue output above.

.. code-block:: bash

    tail -f slurm-10.out

Once the precompilation job is complete, just like the above output
you should see a message similar to the following in the logs:

.. code-block:: bash

    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total graphs: 22
    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total successful compilations: 22
    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0

At this point, you can press ``CTRL-C`` to exit the tail command.

Training the model
------------------

The pre-training job is launched almost exactly the same as the compile job.
We now turn off the ``COMPILE`` environment variable and
run the same training script to start pre-training.

On a single instance:

.. code-block:: bash

    export COMPILE=0
    ./train.sh

If you are using ParallelCluster:

.. code-block:: bash

    export COMPILE=0
    sbatch --exclusive \
        --nodes 8 \
        --cpus-per-task 128 \
        --wrap="srun ./train.sh"

As outlined above, you can again use the ``squeue`` command to view the job queue,
and also monitor the job in the same way with the ``tail`` command to see the training logs.
Once the model is loaded onto the Trainium accelerators and training has commenced,
you will begin to see output indicating the job progress:

Example:

.. code-block:: bash

    Epoch 0:   0%|          | 189/301501 [59:12<1573:03:24, 18.79s/it, loss=7.75, v_num=3-16, reduced_train_loss=7.560, global_step=188.0, consumed_samples=24064.0]
    Epoch 0:   0%|          | 190/301501 [59:30<1572:41:13, 18.79s/it, loss=7.74, v_num=3-16, reduced_train_loss=7.560, global_step=189.0, consumed_samples=24192.0]
    Epoch 0:   0%|          | 191/301501 [59:48<1572:21:28, 18.79s/it, loss=7.73, v_num=3-16, reduced_train_loss=7.910, global_step=190.0, consumed_samples=24320.0]

Monitoring Training
-------------------

Tensorboard monitoring
^^^^^^^^^^^^^^^^^^^^^^

In addition to the text-based job monitoring described in the previous section,
you can also use standard tools such as TensorBoard to monitor training job progress.
To view an ongoing training job in TensorBoard, you first need to identify the
experiment directory associated with your ongoing job.
This will typically be the most recently created directory under
``~/neuronx-distributed-training/examples/nemo_experiments/megatron_gpt/``.
Once you have identifed the directory, cd into it, and then launch TensorBoard:

.. code-block:: bash

    cd ~/neuronx-distributed-training/examples/nemo_experiments/megatron_gpt/
    tensorboard --logdir ./

With TensorBoard running, you can then view the TensorBoard dashboard by browsing to
``http://localhost:6006`` on your local machine. If you cannot access TensorBoard at this address,
please make sure that you have port-forwarded TCP port 6006 when SSH'ing into the head node,

.. code-block:: bash

    ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006

neuron-top / neuron-monitor / neuron-ls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The `neuron-top <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-top-user-guide.html>`_
tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization,
and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job,
first SSH into one of your compute nodes from the head node (if using ParallelCluster), and then run ``neuron-top``:

.. code-block:: bash

    ssh compute1-dy-queue1-i1-1  # to determine which compute nodes are in use, run the squeue command
    neuron-top

Similarly, once you are logged into one of the active compute nodes,
you can also use other Neuron tools such as
`neuron-monitor <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_
and `neuron-ls <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_
to capture performance and utilization statistics and to understand NeuronCore allocation.

Troubleshooting Guide
---------------------

For issues with NxD Training, please see:
:ref:`NxD Training Known Issues <nxdt_known_issues>`

For ParallelCluster issues see:
`AWS ParallelCluster Troubleshooting <https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html>`_


================================================
FILE: archive/tutorials/multinode-training-model-profiling.rst
================================================
.. meta::
    :description: Learn how to use Neuron Explorer to analyze performance during multi-node training on AWS Trainium instances with SLURM job scheduling
    :date-modified: 12/02/2025

Profiling Multi-Node Training Jobs with Neuron Explorer
========================================================

This tutorial demonstrates how to use Neuron Explorer to analyze performance during multi-node training on AWS Trainium instances. We will run a scaled-down version of the :doc:`NxD Training Llama3 8B tutorial </libraries/nxd-training/tutorials/hf_llama3_8B_pretraining>` across 2 nodes, capture performance traces, and visualize them using Perfetto. we will run training across 2 nodes with reduced steps and layers so that compilation and profiling complete quickly.

Prerequisites
-------------

* Access to a multi-node Trainium cluster (4 nodes in this example)
* Neuron SDK installed and configured along with :doc:`NxD Training library installation </libraries/nxd-training/general/installation_guide>`
* Review of the :doc:`NxD Training Llama3 8B tutorial </libraries/nxd-training/tutorials/hf_llama3_8B_pretraining>`
* Familiarity with SLURM job scheduling

Setup and Configuration
-----------------------

Step 1: Initial Setup
~~~~~~~~~~~~~~~~~~~~~~

A. Download the dataset script:

.. code-block:: bash

    # Download get_dataset.py
    wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/llama/get_dataset.py

B. Create a directory for dataset and get the corresponding config file -

.. code-block:: bash

    mkdir ~/examples_datasets/ && cd ~/examples_datasets/

    # Download config.json 
    wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3/config.json ~/

C. Get the tokenizer using the following code snippet -

.. code-block:: python

    # tokenizer.py
    from huggingface_hub import login
    from transformers import AutoTokenizer

    login(token='YourHuggingFaceToken')

    tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B')

    tokenizer.save_pretrained(".")

.. code-block:: bash

    python3 tokenizer.py

D. Run the get_dataset.py -

.. code-block:: bash

    python3 ~/get_dataset.py --llama-version 3

E. Clone neuronx-distributed-training git repo

.. code-block:: bash

    cd ~
    git clone https://github.com/aws-neuron/neuronx-distributed-training.git
    cd ~/neuronx-distributed-training/examples

Step 2: Modify the Configuration Files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Update the training configuration to minimize runtime while still generating useful profiling data:

1. In ``hf_llama3_8B_config.yaml``, make the following changes:

.. code-block:: yaml

    max_steps: 5             # Run only 5 steps for faster turnaround
    num_layers: 2            # Reduce model depth to 2 layers
    num_nodes: 2             # Run only 2 nodes
    global_batch_size: 32    # Set a relatively smaller GBS to avoid large trace volume

These changes ensure the job compiles and runs quickly while still exercising the profiler.

2. In ``train.sh``, set the configuration file name:

.. code-block:: bash

    CONF_FILE=hf_llama3_8B_config

This ensures the job runs with your modified config.

Step 3: Compile the Model
~~~~~~~~~~~~~~~~~~~~~~~~~

Before training, the model must be compiled into Neuron Executable Files (NEFFs). To do this:

.. code-block:: bash

    export COMPILE=1 
    export CONF_FILE=hf_llama3_8B_config

    sbatch --exclusive \
        --nodes=2 \
        --cpus-per-task=128 \
        --wrap="srun ./train.sh"

* ``COMPILE=1`` tells the script to run in compile-only mode.
* ``--nodes=2`` requests 2 Trainium nodes for compilation.
* ``srun ./train.sh`` launches the job via Slurm across the allocated nodes.

.. note::
   The first compilation may take some time depending on the model size. Once compiled, NEFFs are cached for reuse in later training runs.

Step 4: Run the Training Job with Profiling Enabled
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now that compilation is done, we can run the training job while enabling Neuron Explorer:

.. code-block:: bash

    export COMPILE=0
    export CONF_FILE=hf_llama3_8B_config

    NEURON_RT_INSPECT_DEVICE_PROFILE=1 NEURON_RT_INSPECT_ENABLE=1 \
    NEURON_RT_INSPECT_OUTPUT_DIR=./output \
    sbatch --exclusive \
        --nodes=2 \
        --cpus-per-task=128 \
        --wrap="srun ./train.sh"

Here's what's happening:

* ``COMPILE=0``: Use precompiled NEFFs instead of recompiling.
* ``NEURON_RT_INSPECT_ENABLE=1``: Turns on runtime inspection for profiling.
* ``NEURON_RT_INSPECT_OUTPUT_DIR=./output``: All profiler logs will be saved into the ``./output`` directory.
* Slurm runs the job across 2 nodes with 128 CPUs per task.

At the end of this step, you should see an output directory containing runtime inspection logs from each node.

Step 5: Generate a Perfetto Profile
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Neuron Explorer produces raw trace data. To visualize it, convert the logs into a Perfetto compatible trace file:

1. Run the Neuron Explorer CLI:

.. code-block:: bash

    neuron-profile view -d ./output --output-format perfetto

This command consolidates the logs and generates a Perfetto compatible trace file.

Step 6: Visualize in Perfetto
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. Download the generated trace file to your local machine.
2. Open the Perfetto UI.
3. Drag and drop the trace file into the browser window.

You'll now see a timeline view of your training job, including kernel execution, operator scheduling, and activity across NeuronCores. This visualization helps you identify compute vs. memory bottlenecks, idle time, and overall efficiency of the training job.

Step 7: Understanding the System Level Profile
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Once the profile is loaded in Perfetto, you'll see both nodes (2 in our case) along with their workers, listed on the left-hand side as process IDs (PIDs). Each worker captures the same trace, so expanding any one of them will give you the information you need. The key runtime event to focus on is the Neuron Runtime API call named ``nc_exec_running``. This API is responsible for executing a Neuron Executable File (NEFF) on the NeuronCores.

If you hover over or click on one of these calls, Perfetto will display details about which NEFF is being executed. While you may see other runtime API calls, our primary interest is in ``nc_exec_running`` since it directly represents the model execution on Neuron hardware.

.. image:: /tools/profiler/images/multinode-training-1.png

In the example trace shown, the calls to ``nc_exec_running`` appear back-to-back with no significant delays in between. This indicates that, at a system level, the runtime is efficiently dispatching work to NeuronCores. The ``model_name`` field in the arguments section will display the name of the NEFF which is being used in the corresponding ``nc_exec_running``.

Step 8: Linking to device level profiles
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Since we are able to see the NEFF name from ``nc_exec_running`` api call, we will now see how to visualize the profile for that NEFF. This effectively means how the model performance on a given Neuron core looks like. For this, on your trainium cluster, navigate to your compile cache directory (If you are following this tutorial it could be set as ``compiler_cache_url`` in config.yaml file). Navigate to the directory and search for the respective module directory based on the name, and you will see artifacts in that directory as shown below -

.. code-block:: text

    ├── compile_flags.json
    ├── model.done
    ├── model.hlo_module.pb
    └── model.neff

================================================
FILE: archive/tutorials/nxd-source-code/gpt_neox_tp_zero1/gpt_neox_20b.sh
================================================
#!/bin/bash
set -eExuo

cd ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/
ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/common/adamw_fp32_optim_params.py ./
ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/common/get_dataset.py ./
ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/common/requirements.txt ./
python3 -m pip install -r requirements.txt

python3 get_dataset.py

PATH=$PATH:/opt/slurm/bin/

sbatch --exclusive \
--nodes 4 \
--cpus-per-task 128 \
--wrap="srun neuron_parallel_compile bash $(pwd)/tp_dp_gpt_neox_20b_hf_pretrain.sh"

sbatch --exclusive \
--nodes 4 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/tp_dp_gpt_neox_20b_hf_pretrain.sh"


================================================
FILE: archive/tutorials/nxd-source-code/gpt_neox_tp_zero1/gpt_neox_6_9b.sh
================================================
#!/bin/bash
set -eExuo

cd ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_6.9b_hf_pretrain/
ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/common/adamw_fp32_optim_params.py ./
ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/common/get_dataset.py ./
ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/common/requirements.txt ./
ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/modeling_gpt_neox_nxd.py ./
ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/utils.py ./
python3 -m pip install -r requirements.txt

python3 get_dataset.py

PATH=$PATH:/opt/slurm/bin/

sbatch --exclusive \
--nodes 4 \
--wrap="srun neuron_parallel_compile bash $(pwd)/tp_dp_gpt_neox_6.9b_hf_pretrain.sh"

sbatch --exclusive \
--nodes 4 \
--wrap="srun bash $(pwd)/tp_dp_gpt_neox_6.9b_hf_pretrain.sh"


================================================
FILE: archive/tutorials/nxd-source-code/llama_tp_pp_ptl/llama_2_13b.sh
================================================
#!/bin/bash
set -eExuo

cd ~/neuronx-distributed/examples/training/llama/lightning
chmod +x run_llama_13b_tp_pp_ptl.sh
mkdir 13B_config
cp ~/neuronx-distributed/examples/training/llama/tp_pp_llama_hf_pretrain/13B_config_llama2/config.json ./13B_config


sudo rm -rf /home/ubuntu/.cache/
pip install --upgrade filelock

python3 get_dataset.py --llama-version 2

PATH=$PATH:/opt/slurm/bin/

sbatch --exclusive \
--nodes 32 \
--cpus-per-task 128 \
--wrap="srun neuron_parallel_compile bash $(pwd)/run_llama_13b_tp_pp_ptl.sh"

sbatch --exclusive \
--nodes 32 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/run_llama_13b_tp_pp_ptl.sh"


================================================
FILE: archive/tutorials/nxd-source-code/llama_tp_pp_ptl/llama_2_70b.sh
================================================
#!/bin/bash
set -eExuo

cd ~/neuronx-distributed/examples/training/llama/lightning
chmod +x run_llama_70b_tp_pp_ptl.sh
mkdir 70B_config
cp ~/neuronx-distributed/examples/training/llama/tp_pp_llama_hf_pretrain/70B_config_llama2/config.json ./70B_config


sudo rm -rf /home/ubuntu/.cache/
pip install --upgrade filelock

python3 get_dataset.py --llama-version 2

PATH=$PATH:/opt/slurm/bin/

sbatch --exclusive \
--nodes 32 \
--cpus-per-task 128 \
--wrap="srun neuron_parallel_compile bash $(pwd)/run_llama_70b_tp_pp_ptl.sh"

sbatch --exclusive \
--nodes 32 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/run_llama_70b_tp_pp_ptl.sh"


================================================
FILE: archive/tutorials/nxd-source-code/llama_tp_pp_ptl/llama_2_7b.sh
================================================
#!/bin/bash
set -eExuo

cd ~/neuronx-distributed/examples/training/llama/tp_zero1_llama_hf_pretrain
chmod +x tp_zero1_llama2_7B_hf_pretrain.sh
ln -sf 7B_config_llama2/config.json ./

sudo rm -rf /home/ubuntu/.cache/
pip install --upgrade filelock

python3 get_dataset.py --llama-version 2

PATH=$PATH:/opt/slurm/bin/

sbatch --exclusive \
--nodes 4 \
--cpus-per-task 128 \
--wrap="srun neuron_parallel_compile bash $(pwd)/tp_zero1_llama2_7B_hf_pretrain.sh"

sbatch --exclusive \
--nodes 4 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/tp_zero1_llama2_7B_hf_pretrain.sh"


================================================
FILE: archive/tutorials/nxd-source-code/llama_tp_pp_ptl/llama_tp_pp_ptl_setup.sh
================================================
#!/bin/bash
set -eExuo

cd ~/neuronx-distributed/examples/training/llama/lightning
ln -sf ~/neuronx-distributed/examples/training/llama/get_dataset.py ./
ln -sf ~/neuronx-distributed/examples/training/llama/lr.py ./
ln -sf ~/neuronx-distributed/examples/training/llama/modeling_llama_nxd.py ./
ln -sf ~/neuronx-distributed/examples/training/llama/requirements.txt ./
ln -sf ~/neuronx-distributed/examples/training/llama/requirements_ptl.txt ./
ln -sf ~/neuronx-distributed/examples/training/llama/training_utils.py ./

python3 -m pip install -r requirements.txt
python3 -m pip install -r requirements_ptl.txt  # Currently we're supporting Lightning version 2.1.0

================================================
FILE: archive/tutorials/ssd300_demo/requirements.txt
================================================
numpy>1.18.5
tensorflow_neuron==1.15.5.2.8.9.0
neuron_cc==1.13.5.0
tensorflow-serving-api==1.15.0
torch>=1.0,<2.0
torchvision<1.0
matplotlib<4.0
Cython<0.29
pycocotools==2.0.1
tensorflow-serving-api==1.15.0


================================================
FILE: archive/tutorials/ssd300_demo/ssd300_demo.rst
================================================
.. _tensorflow-ssd300:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.

Running SSD300 with AWS Neuron
==============================

.. note:: 
   This page was archived on 7/31/2025.

*Update 11/16: The model checkpoint
link*\ https://api.ngc.nvidia.com/v2/models/nvidia/ssdpyt_fp32/versions/1/files/nvidia_ssdpyt_fp32_20190225.pt\ *is
currently broken and the AWS Neuron team is working on providing an
alternative source.*


This demo shows a Neuron compatible SSD300 implementation that is
functionally equivalent to open source SSD300 model. This demo uses
TensorFlow-Neuron, PyTorch SSD300 model and checkpoint
(https://pytorch.org/hub/nvidia_deeplearningexamples_ssd/) and also
shows the performance achieved by the Inf1 instance.

Table of Contents
-----------------

1. Launch EC2 instance and update AWS Neuron SDK software
2. Generating Neuron compatible SSD300 TensorFlow SavedModel

   -  Convert open source PyTorch SSD300 model and checkpoint into
      Neuron compatible SSD300 TensorFlow SavedModel

3. Evaluate the generated SSD300 TensorFlow SavedModel for both accuracy
   and performance

   -  Running threaded inference through the COCO 2017 validation
      dataset

Launch EC2 instances and update tensorflow-neuron and neuron-cc
---------------------------------------------------------------

For this demo, launch one inf1.xlarge EC2 instance. We recommend using
the latest Ubuntu 18 Deep Learning AMI (DLAMI).

Please configure your ubuntu16/ubuntu18/yum repo following the steps in
the :ref:`install-neuron-tensorflow` in order to install
``tensorflow-model-server-neuron``.

Generating Neuron compatible SSD300 TensorFlow SavedModel
---------------------------------------------------------

First connect to your inf1.xlarge instance

Compile open source PyTorch SSD300 model and checkpoint into Neuron compatible SSD300 TensorFlow SavedModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the same directory ssd300_demo, run the following:

1. Create venv and install dependencies

.. code:: bash

   sudo apt update
   sudo apt install g++ python3-dev python3-venv unzip
   sudo apt install tensorflow-model-server-neuron
   python3 -m venv env
   source ./env/bin/activate
   pip install pip setuptools --upgrade
   pip install -r ./requirements.txt --extra-index-url=https://pip.repos.neuron.amazonaws.com

2. Clone NVIDIA's DeepLearningExamples repo that contains PyTorch
   SSD300.

.. code:: bash

   git clone https://github.com/NVIDIA/DeepLearningExamples.git
   cd DeepLearningExamples
   git checkout a644350589f9abc91b203f73e686a50f5d6f3e96
   cd ..

3. Download PyTorch SSD300 checkpoint file.

.. code:: bash

   curl -LO https://api.ngc.nvidia.com/v2/models/nvidia/ssdpyt_fp32/versions/1/files/nvidia_ssdpyt_fp32_20190225.pt

4. Download COCO 2017 validation set and annotations.

.. code:: bash

   curl -LO http://images.cocodataset.org/zips/val2017.zip
   unzip ./val2017.zip
   curl -LO http://images.cocodataset.org/annotations/annotations_trainval2017.zip
   unzip ./annotations_trainval2017.zip

5. Convert PyTorch SSD300 model and checkpoint into a Neuron-compatible
   TensorFlow SavedModel.

.. code:: bash

   python ssd300_model.py --torch_checkpoint=./nvidia_ssdpyt_fp32_20190225.pt --output_saved_model=./ssd300_tf_neuron/1

This converts PyTorch SSD300 model and checkpoint to a Neuron-compatible
TensorFlow SavedModel using tensorflow-neuron and neuron-cc. The
compilation output is stored in ``./ssd300_tf_neuron``.

6. Launch the ``tensorflow-model-server-neuron`` gRPC server at default
   port 8500 in the background.

.. code:: bash

   tensorflow_model_server_neuron --model_base_path=$(pwd)/ssd300_tf_neuron &

7. In client, evaluate the Neuron-compatible TensorFlow SavedModel for
   both accuracy and performance. Note that this client by default
   assumes a ``tensorflow-model-server-neuron`` listening at
   ``localhost:8500``. On inf1.xlarge, the expected throughput is 100
   images/second once the server is fully warmed up, and the expected
   mean average precision (mAP) is 0.253.

.. code:: bash

   python ssd300_evaluation_client.py --val2017=./val2017 --instances_val2017_json=./annotations/instances_val2017.json

8. After running the demo, please cleanup resources allocated in Neuron
   runtime by gracefully killing the ``tensorflow_model_server_neuron``
   process, e. g.,

.. code:: bash

   killall tensorflow_model_server_neuron


================================================
FILE: archive/tutorials/ssd300_demo/ssd300_detection.py
================================================
import argparse
import json
import pkg_resources
from distutils.version import LooseVersion
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import tensorflow as tf
import tensorflow.neuron as tfn


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--image', required=True, help='Path to image that is to be detected. Support jpeg and png format.')
    parser.add_argument('--image_with_detections', required=True, help='Path to save image after detection (with bounding boxes drawn). Png format.')
    parser.add_argument('--saved_model', required=True, help='TensorFlow SSD300 SavedModel')
    parser.add_argument('--score_threshold', type=float, default=0.15, help='Minimum required score for drawing a bounding box')
    parser.add_argument('--instances_val2017_json', default=None, help='Json file that contains labeling information')
    parser.add_argument('--save_results', default=None)
    parser.add_argument('--disable_version_check', action='store_true')
    args = parser.parse_args()
    if not args.disable_version_check:
        tfn_version = LooseVersion(pkg_resources.get_distribution('tensorflow-neuron').version)
        if tfn_version < LooseVersion('1.15.0.1.0.1333.0'):
            raise RuntimeError(
                'tensorflow-neuron version {} is too low for this demo. Please upgrade '
                'by "pip install -U tensorflow-neuron --index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))

    with open(args.image, 'rb') as f:
        img_jpg_bytes = f.read()
    model_feed_dict = {'batch_image': [img_jpg_bytes]}

    predictor = tf.contrib.predictor.from_saved_model(args.saved_model)
    results = predictor(model_feed_dict)
    if args.save_results is not None:
        np.savez(args.save_results, **results)
    boxes_np = results['boxes']
    scores_np = results['scores']
    classes_np = results['classes']

    if args.instances_val2017_json is not None:
        with open(args.instances_val2017_json) as f:
            annotate_json = json.load(f)
        label_info = {idx+1: cat['name'] for idx, cat in enumerate(annotate_json['categories'])}

    plt.switch_backend('agg')
    fig, ax = plt.subplots(1)
    ax.imshow(Image.open(args.image).convert('RGB'))

    wanted = scores_np[0] > args.score_threshold
    for xywh, label_no_bg in zip(boxes_np[0][wanted], classes_np[0][wanted]):
        rect = patches.Rectangle((xywh[0], xywh[1]), xywh[2], xywh[3], linewidth=1, edgecolor='g', facecolor='none')
        ax.add_patch(rect)
        rx, ry = rect.get_xy()
        rx = rx + rect.get_width() / 2.0
        if args.instances_val2017_json is not None:
            ax.annotate(label_info[label_no_bg + 1], (rx, ry), color='w', backgroundcolor='g', fontsize=10,
                        ha='center', va='center', bbox=dict(boxstyle='square,pad=0.01', fc='g', ec='none', alpha=0.5))
    plt.savefig(args.image_with_detections)
    plt.close(fig)


if __name__ == '__main__':
    main()


================================================
FILE: archive/tutorials/ssd300_demo/ssd300_evaluation.py
================================================
import argparse
import os
import json
import glob
from concurrent import futures
import time
import pkg_resources
from distutils.version import LooseVersion
import numpy as np
import tensorflow as tf
import tensorflow.neuron as tfn
from pycocotools.cocoeval import COCOeval
from DeepLearningExamples.PyTorch.Detection.SSD.src.coco import COCO
from DeepLearningExamples.PyTorch.Detection.SSD.src.utils import dboxes300_coco
from DeepLearningExamples.PyTorch.Detection.SSD.src.utils import SSDTransformer
from DeepLearningExamples.PyTorch.Detection.SSD.src.utils import COCODetection


def get_val_dataset(val_annotate, val_coco_root):
    dboxes = dboxes300_coco()
    val_trans = SSDTransformer(dboxes, (300, 300), val=True)
    val_coco = COCODetection(val_coco_root, val_annotate, val_trans)
    return val_coco


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--saved_model', required=True, help='TensorFlow SSD300 SavedModel')
    parser.add_argument('--val2017', required=True, help='Path to COCO 2017 validation dataset')
    parser.add_argument('--instances_val2017_json', required=True, help='Json file that contains labeling information')
    parser.add_argument('--num_sessions', type=int, default=1, help='Number of tensorflow sessions')
    parser.add_argument('--num_threads', type=int, default=4, help='Number of threads')
    parser.add_argument('--throughput_interval', type=int, default=10, help='Interval for counting throughput')
    parser.add_argument('--save_results', default=None)
    parser.add_argument('--disable_version_check', action='store_true')
    args = parser.parse_args()
    if not args.disable_version_check:
        tfn_version = LooseVersion(pkg_resources.get_distribution('tensorflow-neuron').version)
        if tfn_version < LooseVersion('1.15.0.1.0.1333.0'):
            raise RuntimeError(
                'tensorflow-neuron version {} is too low for this demo. Please upgrade '
                'by "pip install -U tensorflow-neuron --index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))
    predictor_list = [tf.contrib.predictor.from_saved_model(args.saved_model) for _ in range(args.num_sessions)]

    val_dataset = get_val_dataset(args.instances_val2017_json, args.val2017)
    inv_map = {v: k for k, v in val_dataset.label_map.items()}
    model_feed_dict_list = []
    for img_id in val_dataset.img_keys:
        img_path = os.path.join(args.val2017, val_dataset.images[img_id][0])
        with open(img_path, 'rb') as f:
            img_jpg_bytes = f.read()
        model_feed_dict_list.append({'batch_image': [img_jpg_bytes]})

    latency_list = []
    throughput_list = []
    def predict(pred, model_feed_dict):
        start = time.time()
        result = pred(model_feed_dict)
        latency_list.append(time.time() - start)
        return result

    def performance():
        last_num_infer = len(latency_list)
        while len(latency_list) < len(model_feed_dict_list):
            current_num_infer = len(latency_list)
            throughput = (current_num_infer - last_num_infer) / args.throughput_interval
            throughput_list.append(throughput)
            p50 = 0.0
            p90 = 0.0
            if latency_list:
                p50 = np.percentile(latency_list, 50)
                p90 = np.percentile(latency_list, 90)
            print('pid {}: current throughput {}, latency p50={:.3f} p90={:.3f}'.format(os.getpid(), throughput, p50, p90))
            last_num_infer = current_num_infer
            time.sleep(args.throughput_interval)

    executor = futures.ThreadPoolExecutor(max_workers=(args.num_sessions*args.num_threads)+1)
    performance_future = executor.submit(performance)
    eval_futures = []
    for idx, model_feed_dict in enumerate(model_feed_dict_list):
        eval_fut = executor.submit(predict, predictor_list[idx%len(predictor_list)], model_feed_dict)
        eval_futures.append(eval_fut)
    waited_results = []
    for idx, eval_fut in enumerate(eval_futures):
        if idx % 100 == 0:
            print('evaluating image {}/{}'.format(idx, len(eval_futures)))
        waited_results.append(eval_fut.result())
    eval_results = []
    for idx, (img_id, results) in enumerate(zip(val_dataset.img_keys, waited_results)):
        boxes = results['boxes']
        for box, label, prob in zip(results['boxes'][0], results['classes'][0], results['scores'][0]):
            res = [img_id, box[0], box[1], box[2], box[3], prob, inv_map[label+1]]  # +1 to account for background
            eval_results.append(res)
    performance_future.result()

    coco_gt = COCO(annotation_file=args.instances_val2017_json)
    coco_dt = coco_gt.loadRes(np.array(eval_results).astype(np.float32))
    coco_eval = COCOeval(coco_gt, coco_dt, iouType='bbox')
    coco_eval.evaluate()
    coco_eval.accumulate()
    coco_eval.summarize()
    if args.save_results is not None:
        np.save(args.save_results, coco_eval.stats)


if __name__ == '__main__':
    main()


================================================
FILE: archive/tutorials/ssd300_demo/ssd300_evaluation_client.py
================================================
import argparse
import os
import json
import glob
from concurrent import futures
import time
import subprocess
from distutils.version import LooseVersion
import numpy as np
import tensorflow as tf
import grpc
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
from pycocotools.cocoeval import COCOeval
from DeepLearningExamples.PyTorch.Detection.SSD.src.coco import COCO
from DeepLearningExamples.PyTorch.Detection.SSD.src.utils import dboxes300_coco
from DeepLearningExamples.PyTorch.Detection.SSD.src.utils import SSDTransformer
from DeepLearningExamples.PyTorch.Detection.SSD.src.utils import COCODetection


def get_val_dataset(val_annotate, val_coco_root):
    dboxes = dboxes300_coco()
    val_trans = SSDTransformer(dboxes, (300, 300), val=True)
    val_coco = COCODetection(val_coco_root, val_annotate, val_trans)
    return val_coco


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--server_address', default='localhost:8500', help='tensorflow-model-server-neuron grpc address')
    parser.add_argument('--model_name', default='default', help='Serving model name')
    parser.add_argument('--val2017', required=True, help='Path to COCO 2017 validation dataset')
    parser.add_argument('--instances_val2017_json', required=True, help='Json file that contains labeling information')
    parser.add_argument('--num_threads', type=int, default=4, help='Number of threads')
    parser.add_argument('--throughput_interval', type=int, default=10, help='Interval for counting throughput')
    parser.add_argument('--save_results', default=None)
    args = parser.parse_args()

    channel = grpc.insecure_channel(args.server_address)
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

    val_dataset = get_val_dataset(args.instances_val2017_json, args.val2017)
    inv_map = {v: k for k, v in val_dataset.label_map.items()}
    request_list = []
    for img_id in val_dataset.img_keys:
        img_path = os.path.join(args.val2017, val_dataset.images[img_id][0])
        with open(img_path, 'rb') as f:
            img_jpg_bytes = f.read()
        data = np.array([img_jpg_bytes], dtype=object)
        data = tf.contrib.util.make_tensor_proto(data, shape=data.shape)
        request = predict_pb2.PredictRequest()
        request.model_spec.name = args.model_name
        request.inputs['batch_image'].CopyFrom(data)
        request_list.append(request)

    latency_list = []
    throughput_list = []
    def predict(request):
        start = time.time()
        result = stub.Predict(request).outputs
        latency_list.append(time.time() - start)
        return result

    def performance():
        last_num_infer = len(latency_list)
        while len(latency_list) < len(request_list):
            current_num_infer = len(latency_list)
            throughput = (current_num_infer - last_num_infer) / args.throughput_interval
            throughput_list.append(throughput)
            p50 = 0.0
            p90 = 0.0
            if latency_list:
                p50 = np.percentile(latency_list, 50)
                p90 = np.percentile(latency_list, 90)
            print('pid {}: current throughput {}, latency p50={:.3f} p90={:.3f}'.format(os.getpid(), throughput, p50, p90))
            last_num_infer = current_num_infer
            time.sleep(args.throughput_interval)

    executor = futures.ThreadPoolExecutor(max_workers=args.num_threads+1)
    performance_future = executor.submit(performance)
    eval_futures = []
    for idx, request in enumerate(request_list):
        eval_fut = executor.submit(predict, request)
        eval_futures.append(eval_fut)
    waited_results = []
    for idx, eval_fut in enumerate(eval_futures):
        if idx % 100 == 0:
            print('evaluating image {}/{}'.format(idx, len(eval_futures)))
        waited_results.append(eval_fut.result())
    eval_results = []
    for idx, (img_id, results) in enumerate(zip(val_dataset.img_keys, waited_results)):
        results = {key: tf.make_ndarray(value) for key, value in results.items()}
        boxes = results['boxes']
        for box, label, prob in zip(results['boxes'][0], results['classes'][0], results['scores'][0]):
            res = [img_id, box[0], box[1], box[2], box[3], prob, inv_map[label+1]]  # +1 to account for background
            eval_results.append(res)
    performance_future.result()

    coco_gt = COCO(annotation_file=args.instances_val2017_json)
    coco_dt = coco_gt.loadRes(np.array(eval_results).astype(np.float32))
    coco_eval = COCOeval(coco_gt, coco_dt, iouType='bbox')
    coco_eval.evaluate()
    coco_eval.accumulate()
    coco_eval.summarize()
    if args.save_results is not None:
        np.save(args.save_results, coco_eval.stats)


if __name__ == '__main__':
    main()


================================================
FILE: archive/tutorials/ssd300_demo/ssd300_model.py
================================================
import sys
import os
import argparse
import time
import itertools
from functools import partial
from collections import Counter
import json
import shutil
import pkg_resources
from distutils.version import LooseVersion
import numpy as np
import tensorflow as tf
from tensorflow.core.framework import attr_value_pb2
import tensorflow.neuron as tfn
import torch


def decode_jpeg_resize(input_tensor, image_size):
    # decode jpeg
    tensor = tf.image.decode_png(input_tensor, channels=3)

    # resize
    decoded_shape = tf.shape(tensor)
    tensor = tf.cast(tensor, tf.float32)
    decoded_shape_hw = decoded_shape[0:2]
    decoded_shape_hw_float32 = tf.cast(decoded_shape_hw, tf.float32)
    tensor = tf.image.resize(tensor, image_size)

    # normalize
    tensor -= np.array([0.485, 0.456, 0.406]).astype(np.float32) * 255.0
    return tensor, decoded_shape_hw_float32[::-1]


def preprocessor(input_tensor, image_size):
    with tf.name_scope('Preprocessor'):
        tensor, bbox_scale_hw = tf.map_fn(
            partial(decode_jpeg_resize, image_size=image_size), input_tensor,
            dtype=(tf.float32, tf.float32), back_prop=False, parallel_iterations=16)
    return tensor, bbox_scale_hw


def tf_Conv2d(input_tensor, module, first_conv=False):
    np_dtype = input_tensor.dtype.as_numpy_dtype
    kernel_np = module.weight.detach().numpy().transpose([2, 3, 1, 0])
    if first_conv:
        kernel_np /= (np.array([0.229, 0.224, 0.225]).astype(np.float32) * 255.0)[:, np.newaxis]
    kernel = tf.constant(kernel_np.astype(np_dtype))
    if any(module.padding):
        pad_h, pad_w = module.padding
        padding = [[0, 0], [pad_h, pad_h], [pad_w, pad_w], [0, 0]]
        input_tensor = tf.pad(input_tensor, padding)
    stride_h, stride_w = module.stride
    tensor = tf.nn.conv2d(input_tensor, kernel, strides=[1, stride_h, stride_w, 1], padding='VALID')
    if module.bias is not None:
        bias = tf.constant(module.bias.detach().numpy().astype(np_dtype))
        tensor = tf.nn.bias_add(tensor, bias)
    return tensor

def tf_BatchNorm2d(input_tensor, module):
    def _norm_np(ts):
        return ts.astype(input_tensor.dtype.as_numpy_dtype)
    mean = _norm_np(module.running_mean.detach().numpy())
    offset = _norm_np(module.bias.detach().numpy())
    inv_std = np.sqrt(module.running_var.detach().numpy() + module.eps)
    scale_inv_std = _norm_np(module.weight.detach().numpy() / inv_std)
    return scale_inv_std * (input_tensor - mean) + offset

def tf_MaxPool2d(input_tensor, module):
    pad = module.padding
    tensor = tf.pad(input_tensor, [[0, 0], [pad, pad], [pad, pad], [0, 0]])
    return tf.nn.max_pool2d(tensor, ksize=module.kernel_size, strides=module.stride, padding='VALID')

def tf_Bottleneck(input_tensor, module):
    tensor = tf_Conv2d(input_tensor, module.conv1)
    tensor = tf_BatchNorm2d(tensor, module.bn1)
    tensor = tf.nn.relu(tensor)
    tensor = tf_Conv2d(tensor, module.conv2)
    tensor = tf_BatchNorm2d(tensor, module.bn2)
    tensor = tf.nn.relu(tensor)
    tensor = tf_Conv2d(tensor, module.conv3)
    tensor = tf_BatchNorm2d(tensor, module.bn3)
    if module.downsample is not None:
        input_tensor = tf_Conv2d(input_tensor, module.downsample[0])
        input_tensor = tf_BatchNorm2d(input_tensor, module.downsample[1])
    return tf.nn.relu(input_tensor + tensor)

def tf_SequentialBottleneck(tensor, seq, resnet):
    with tf.name_scope('{}.Sequential'.format(seq)):
        for idx, module in enumerate(resnet[seq]):
            with tf.name_scope('{}.BasicBlock'.format(idx)):
                tensor = tf_Bottleneck(tensor, module)
    return tensor

def tf_bbox_view(detection_feed, modules, ndim):
    results = []
    for idx, (tensor, mod) in enumerate(zip(detection_feed, modules)):
        with tf.name_scope('branch{}'.format(idx)):
            tensor = tf_Conv2d(tensor, mod)
            tensor = tf.transpose(tensor, [0, 3, 1, 2])
            tensor = tf.cast(tensor, tf.float32)

            shape = tensor.shape.as_list()
            batch_size = -1 if shape[0] is None else shape[0]
            new_shape = [batch_size, ndim, np.prod(shape[1:]) // ndim]
            results.append(tf.reshape(tensor, new_shape))
    tensor = tf.concat(results, axis=-1)
    return tensor


def tf_feature_extractor(input_tensor, resnet):
    with tf.name_scope('FeatureExtractor'):
        with tf.name_scope('0.Conv2d'):
            tensor = tf_Conv2d(input_tensor, resnet[0], first_conv=True)
        with tf.name_scope('1.BatchNorm2d'):
            tensor = tf_BatchNorm2d(tensor, resnet[1])
        with tf.name_scope('2.ReLU'):
            tensor = tf.nn.relu(tensor)
        with tf.name_scope('3.MaxPool2d'):
            tensor = tf_MaxPool2d(tensor, resnet[3])
        tensor = tf_SequentialBottleneck(tensor, 4, resnet)
        tensor = tf_SequentialBottleneck(tensor, 5, resnet)
        tensor = tf_SequentialBottleneck(tensor, 6, resnet)
        tensor = tf.cast(tensor, tf.float16)
    return tensor


def tf_box_predictor(tensor, ssd300_torch):
    with tf.name_scope('BoxPredictor'):
        detection_feed = [tensor]
        for idx, block in enumerate(ssd300_torch.additional_blocks):
            with tf.name_scope('{}.Sequential'.format(idx)):
                tensor = tf_Conv2d(tensor, block[0])
                tensor = tf_BatchNorm2d(tensor, block[1])
                tensor = tf.nn.relu(tensor)
                tensor = tf_Conv2d(tensor, block[3])
                tensor = tf_BatchNorm2d(tensor, block[4])
                tensor = tf.nn.relu(tensor)
                detection_feed.append(tensor)
        with tf.name_scope('Boxes'):
            loc = tf_bbox_view(detection_feed, ssd300_torch.loc, ndim=4)
        with tf.name_scope('Probabilities'):
            conf = tf_bbox_view(detection_feed, ssd300_torch.conf, ndim=ssd300_torch.label_num)
    return loc, conf


@tfn.fuse(batch_size=1, dynamic_batch_size=True)
def tf_ssd300(input_tensor, ssd300_torch):
    with tf.name_scope('SSD300'):
        tensor = tf_feature_extractor(input_tensor, ssd300_torch.feature_extractor.feature_extractor)
        loc, conf = tf_box_predictor(tensor, ssd300_torch)
    return loc, conf


def scale_back_batch(bboxes_in, scores_in, scale_xy, scale_wh, dboxes_xywh):
    """
        Do scale and transform from xywh to ltrb
        suppose input Nx4xnum_bbox Nxlabel_numxnum_bbox
    """
    with tf.name_scope('ScaleBackBatch'):
        bboxes_in = tf.transpose(bboxes_in, [0, 2, 1])
        scores_in = tf.transpose(scores_in, [0, 2, 1])

        bboxes_xy = bboxes_in[:, :, :2]
        bboxes_wh = bboxes_in[:, :, 2:]
        bboxes_xy *= scale_xy
        bboxes_wh *= scale_wh

        bboxes_xy = bboxes_xy * dboxes_xywh[:, :, 2:] + dboxes_xywh[:, :, :2]
        bboxes_wh = tf.exp(bboxes_wh) * dboxes_xywh[:, :, 2:]

        bboxes_wh_half = 0.5 * bboxes_wh
        bboxes_lt = bboxes_xy - bboxes_wh_half
        bboxes_rb = bboxes_xy + bboxes_wh_half

        bboxes_in = tf.concat([bboxes_lt, bboxes_rb], axis=-1)

        return bboxes_in, tf.nn.softmax(scores_in, axis=-1)

def select_nms_outputs(input_tensors):
    boxes_xywh, scores, classes, valid_detections = input_tensors
    return boxes_xywh[:valid_detections], scores[:valid_detections], classes[:valid_detections]

def postprocessor(ploc_ts, plabel_ts, bbox_scale_hw_ts, scale_xy, scale_wh, dboxes_xywh):
    with tf.name_scope('Postprocessor'):
        ploc_ts = tf.cast(ploc_ts, tf.float32)
        plabel_ts = tf.cast(plabel_ts, tf.float32)
        bboxes_ts, probs_ts = scale_back_batch(ploc_ts, plabel_ts, scale_xy, scale_wh, dboxes_xywh)
        bboxes_ts = bboxes_ts[:, :, tf.newaxis, :]
        probs_ts = probs_ts[:, :, 1:]
        nms_outputs = tf.image.combined_non_max_suppression(
            bboxes_ts,
            probs_ts,
            max_output_size_per_class=200,
            max_total_size=200,
            iou_threshold=0.5,
            score_threshold=0.05,
            pad_per_class=False,
            clip_boxes=False,
            name='CombinedNonMaxSuppression',
        )
        nmsed_boxes_x0y0x1y1, nmsed_scores, nmsed_classes, valid_detections = nms_outputs
        nmsed_boxes_x0y0 = nmsed_boxes_x0y0x1y1[..., :2]
        nmsed_boxes_x1y1 = nmsed_boxes_x0y0x1y1[..., 2:]
        bbox_scale_hw_ts = bbox_scale_hw_ts[:, tf.newaxis, :]
        nmsed_boxes_xy = nmsed_boxes_x0y0 * bbox_scale_hw_ts
        nmsed_boxes_wh = (nmsed_boxes_x1y1 - nmsed_boxes_x0y0) * bbox_scale_hw_ts
        nmsed_boxes_xywh = tf.concat([nmsed_boxes_xy, nmsed_boxes_wh], axis=-1)
        nmsed_boxes_xywh, nmsed_scores, nmsed_classes = tf.map_fn(
            select_nms_outputs, (nmsed_boxes_xywh, nmsed_scores, nmsed_classes, valid_detections),
            dtype=(tf.float32, tf.float32, tf.float32), back_prop=False, parallel_iterations=16)
    return nmsed_boxes_xywh, nmsed_scores, nmsed_classes


class DefaultBoxes(object):

    def __init__(self, fig_size, feat_size, steps, scales, aspect_ratios,
                 scale_xy=0.1, scale_wh=0.2):

        self.feat_size = feat_size
        self.fig_size = fig_size

        self.scale_xy_ = scale_xy
        self.scale_wh_ = scale_wh

        # According to https://github.com/weiliu89/caffe
        # Calculation method slightly different from paper
        self.steps = steps
        self.scales = scales

        fk = fig_size/np.array(steps)
        self.aspect_ratios = aspect_ratios

        self.default_boxes = []
        # size of feature and number of feature
        for idx, sfeat in enumerate(self.feat_size):

            sk1 = scales[idx]/fig_size
            sk2 = scales[idx+1]/fig_size
            sk3 = np.sqrt(sk1*sk2)
            all_sizes = [(sk1, sk1), (sk3, sk3)]

            for alpha in aspect_ratios[idx]:
                w, h = sk1*np.sqrt(alpha), sk1/np.sqrt(alpha)
                all_sizes.append((w, h))
                all_sizes.append((h, w))
            for w, h in all_sizes:
                for i, j in itertools.product(range(sfeat), repeat=2):
                    cx, cy = (j+0.5)/fk[idx], (i+0.5)/fk[idx]
                    self.default_boxes.append((cx, cy, w, h))

        self.dboxes = np.array(self.default_boxes)
        self.dboxes = self.dboxes.clip(min=0, max=1)
        # For IoU calculation
        self.dboxes_ltrb = self.dboxes.copy()
        self.dboxes_ltrb[:, 0] = self.dboxes[:, 0] - 0.5 * self.dboxes[:, 2]
        self.dboxes_ltrb[:, 1] = self.dboxes[:, 1] - 0.5 * self.dboxes[:, 3]
        self.dboxes_ltrb[:, 2] = self.dboxes[:, 0] + 0.5 * self.dboxes[:, 2]
        self.dboxes_ltrb[:, 3] = self.dboxes[:, 1] + 0.5 * self.dboxes[:, 3]

    @property
    def scale_xy(self):
        return self.scale_xy_

    @property
    def scale_wh(self):
        return self.scale_wh_

    def __call__(self, order="ltrb"):
        if order == "ltrb": return self.dboxes_ltrb
        if order == "xywh": return self.dboxes


def dboxes300_coco():
    figsize = 300
    feat_size = [38, 19, 10, 5, 3, 1]
    steps = [8, 16, 32, 64, 100, 300]
    # use the scales here: https://github.com/amdegroot/ssd.pytorch/blob/master/data/config.py
    scales = [21, 45, 99, 153, 207, 261, 315]
    aspect_ratios = [[2], [2, 3], [2, 3], [2, 3], [2], [2]]
    dboxes = DefaultBoxes(figsize, feat_size, steps, scales, aspect_ratios)
    return dboxes


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--torch_checkpoint', required=True, help='Path to PyTorch SSD300 model checkpoint')
    parser.add_argument('--output_saved_model', required=True, help='Output TensorFlow SavedModel that runs on Inferentia')
    parser.add_argument('--disable_version_check', action='store_true')
    args = parser.parse_args()
    if os.path.exists(args.output_saved_model):
        raise OSError('SavedModel dir {} already exists'.format(args.output_saved_model))

    if not args.disable_version_check:
        neuroncc_version = LooseVersion(pkg_resources.get_distribution('neuron-cc').version)
        if neuroncc_version < LooseVersion('1.0.18000'):
            raise RuntimeError(
                'neuron-cc version {} is too low for this demo. Please upgrade '
                'by "pip install -U neuron-cc --index-url=https://pip.repos.neuron.amazonaws.com"'.format(neuroncc_version))
        tfn_version = LooseVersion(pkg_resources.get_distribution('tensorflow-neuron').version)
        if tfn_version < LooseVersion('1.15.3.1.0.1900.0'):
            raise RuntimeError(
                'tensorflow-neuron version {} is too low for this demo. Please upgrade '
                'by "pip install -U tensorflow-neuron --index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))

    sys.path.append(os.getcwd())
    from DeepLearningExamples.PyTorch.Detection.SSD.src import model as torch_ssd300_model
    ssd300_torch = torch_ssd300_model.SSD300()
    ckpt = torch.load(args.torch_checkpoint, map_location=torch.device('cpu'))
    ssd300_torch.load_state_dict(ckpt['model'])
    ssd300_torch.eval()

    input_tensor = tf.placeholder(tf.string, [None])
    image_tensor, bbox_scale_hw_tensor = preprocessor(input_tensor, [300, 300])

    dboxes = dboxes300_coco()
    dboxes_xywh = dboxes(order="xywh")[np.newaxis, ...]

    ploc_tensor, plabel_tensor = tf_ssd300(image_tensor, ssd300_torch)
    boxes_tensor, scores_tensor, classes_tensor = postprocessor(
        ploc_tensor, plabel_tensor, bbox_scale_hw_tensor, dboxes.scale_xy, dboxes.scale_wh, dboxes_xywh)
    outputs = {
        'boxes': boxes_tensor,
        'scores': scores_tensor,
        'classes': classes_tensor,
    }

    sess = tf.Session()
    try:
        sess.run(outputs)
    except:
        pass

    for op in sess.graph.get_operations():
        if op.type == 'NeuronOp':
            if not op.get_attr('executable'):
                raise AttributeError(
                    'Neuron executable (neff) is empty. Please check neuron-cc is installed and working properly '
                    '("pip install neuron-cc --force --index-url=https://pip.repos.neuron.amazonaws.com" '
                    'to force reinstall neuron-cc).')
            model_config = op.node_def.attr['model_config'].list
            if model_config.i:
                model_config.i[0] = 1
            else:
                model_config.i.extend([1, 1, 1, 10])
            op._set_attr('model_config', attr_value_pb2.AttrValue(list=model_config))
    tf.saved_model.simple_save(sess, args.output_saved_model, {'batch_image': input_tensor}, outputs)


if __name__ == '__main__':
    main()


================================================
FILE: archive/tutorials/training-gpt-neox-20b.rst
================================================
.. _gpt_neox_20b_tp_zero1_tutorial:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently unsupported and not maintained. It is provided for reference only.

Training GPT-NeoX 20B with Tensor Parallelism and ZeRO-1 Optimizer 
=========================================================================================

In this section, we showcase to pretrain a GPT-NeoX 20B model by using the sequence parallel optimization
of tensor parallelism in the ``neuronx-distributed`` package. Please refer to the `Neuron Samples repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain>`__ to view the files in this tutorial.

This GPT-NeoX 20B tutorial differs from the :ref:`GPT-NeoX 6.9B tutorial<gpt_neox_tp_zero1_tutorial>` in the following ways:

* sequence parallel optimization has been applied
* parallel cross entropy has been applied
* the model size has been increased from 6.9B to 20B
* the TP degree has been increased from 8 to 32

Setting up environment is same as the :ref:`GPT-NeoX 6.9B tutorial<gpt_neox_tp_zero1_tutorial>`.

**Let’s download the scripts for pretraining:**

.. literalinclude:: nxd-source-code/gpt_neox_tp_zero1/gpt_neox_20b.sh
   :language: shell
   :lines: 4-8

Next let’s download and pre-process the dataset:

.. literalinclude:: nxd-source-code/gpt_neox_tp_zero1/gpt_neox_20b.sh
   :language: shell
   :lines: 10

At this point, you are all set to start training.

**Running training**

We first pre-compile the graphs using the ``neuron_parallel_compile``.
Let’s run the command below:

.. literalinclude:: nxd-source-code/gpt_neox_tp_zero1/gpt_neox_20b.sh
   :language: shell
   :lines: 14-17

This script uses a tensor-parallel size of 32.
This will automatically set the zero-1 sharding degree to 4 (4 * 32 workers / tensor_parallel_size).
Once the graphs are compiled we can now run training and observe our loss goes down.
To run the training, we just the above command but without ``neuron_parallel_compile``.

.. literalinclude:: nxd-source-code/gpt_neox_tp_zero1/gpt_neox_20b.sh
   :language: shell
   :lines: 19-22


**Sequence Parallel**

We made the following model level modifications to enable sequence parallel:

* turn on ``sequence_parallel_enabled`` of ``ColumnParallelLinear`` and ``RowParallelLinear``
  in ``GPTNeoXAttention`` and ``GPTNeoXMLP``;
* replace torch ``LayerNorm`` in ``GPTNeoXLayer`` and ``GPTNeoXModel`` with neuronx-distributed  ``LayerNorm``
  with ``sequence_parallel_enabled``
  turned on;
* dimension transposition of intermediate states in the forward function of ``GPTNeoXAttention``.
* dimension transposition and collective communication of intermediate states in the forward function of ``GPTNeoXModel``.

In the training training script level, we enable:

* all-reduce sequence parallel gradients at the gradient accumulation boundary.

Please check `modeling_gpt_neox_nxd.py <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/modeling_gpt_neox_nxd.py>`__ and `tp_dp_gpt_neox_20b_hf_pretrain.py <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain.py>`__ for details.


**Parallel Cross Entropy**

To enable parallel cross entropy, we made the following model level modeifincations:

* replace the ``CrossEntropyLoss`` with neuronx-distributed ``parallel_cross_entropy`` in the forward
  function of ``GPTNeoXForCausalLM``.
* use ``ColumnParallelLinear`` for the ``embed_out`` layer in ``GPTNeoXForCausalLM``.

Please check ``modeling_gpt_neox_nxd.py`` for details.


================================================
FILE: archive/tutorials/training-gpt-neox.rst
================================================
.. _gpt_neox_tp_zero1_tutorial:

.. meta::
   :noindex:
   :nofollow:
   :description: This documentation for the AWS Neuron SDK is currently unsupported and not maintained. It is provided for reference only.

Training GPT-NeoX 6.9B with Tensor Parallelism and ZeRO-1 Optimizer
=========================================================================================

In this section, we showcase to pretrain a GPT-NeoX 6.9B model by using tensor parallelism
and zero-1 optimizer in the ``neuronx-distributed`` package. Please refer to the `Neuron Samples repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_6.9b_hf_pretrain>`__ to view the files in this tutorial.

**Setting up environment:**
                       

For this experiment, we will use a ParallelCluster with at least four trn1-32xl compute nodes.
`Train your model on ParallelCluster <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/parallelcluster/parallelcluster-training.html>`__
introduces how to setup and use a ParallelCluster.
We need first to create and activate a python virtual env on the head node of the ParallelCluster.
Next follow the instructions mentioned here:
:ref:`Install PyTorch Neuron on Trn1 <setup-torch-neuronx>` to install neuron python packages.

We also need to install and clone the ``neuronx-distributed`` package using the following command:

.. code:: ipython3

   python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com
   git clone git@github.com:aws-neuron/neuronx-distributed.git

Let’s download the scripts for pretraining.

.. literalinclude:: nxd-source-code/gpt_neox_tp_zero1/gpt_neox_6_9b.sh
   :language: shell
   :lines: 4-10

Next let’s download and pre-process the dataset:

.. literalinclude:: nxd-source-code/gpt_neox_tp_zero1/gpt_neox_6_9b.sh
   :language: shell
   :lines: 12

At this point, you are all set to start training.

**Running training**
                

We first pre-compile the graphs using the ``neuron_parallel_compile``.
Let’s run the command below:

.. literalinclude:: nxd-source-code/gpt_neox_tp_zero1/gpt_neox_6_9b.sh
   :language: shell
   :lines: 16-18

This script uses a tensor-parallel size of 8.
This will automatically set the zero-1 sharding degree to 16 (4 * 32 workers / tensor_parallel_size).
Once the graphs are compiled we can now run training and observe our loss goes down.
To run the training, we just the above command but without ``neuron_parallel_compile``.

.. literalinclude:: nxd-source-code/gpt_neox_tp_zero1/gpt_neox_6_9b.sh
   :language: shell
   :lines: 20-22

**ZeRO-1 Optimizer**
                

The training script uses ZeRO-1 optimizer, where the optimizer states are partitioned across
the ranks so that each rank updates only its partition.
Below shows the code snippet of using ZeRO-1 optimizer in training script:

.. code:: ipython3

   from neuronx_distributed.optimizer import NeuronZero1Optimizer

   optimizer = NeuronZero1Optimizer(
        optimizer_grouped_parameters,
        AdamW_FP32OptimParams,
        lr=flags.lr,
        pin_layout=False,
        sharding_groups=parallel_state.get_data_parallel_group(as_list=True),
        grad_norm_groups=parallel_state.get_tensor_model_parallel_group(as_list=True),
    )


================================================
FILE: archive/tutorials/training_codegen25_7b.rst
================================================
.. _codegen25_7b_tp_zero1_tutorial:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.

Training CodeGen2.5 7B with Tensor Parallelism and ZeRO-1 Optimizer 
==============================================================================================

In this tutorial, we showcase how to pretrain a CodeGen2.5 7B model for program synthesis. Since Codegen2.5's architecture is identical to the one of Llama2, you may want to take a look at our `Llama2 tutorial <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.html>`__ first.

After setting up the environment and installing ``neuronx-distributed``, we need to download a data set containing source code (in this case Java code) and then preprocess and tokenize it to match the code-infill format (more about this below). Use the following commands to download the required files. Note, that we reuse our llama2 training files.

.. code:: bash

   mkdir -p ~/examples/tp_zero1_codegen25_7b_hf_pretrain
   cd ~/examples/tp_zero1_codegen25_7b_hf_pretrain
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/llama/modeling_llama_nxd.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/llama/tp_zero1_llama_hf_pretrain/tp_zero1_llama_hf_pretrain.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/llama/tp_zero1_llama_hf_pretrain/logger.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/codegen25/tp_zero1_codegen25_7b_hf_pretrain.sh
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/codegen25/get_dataset_infill.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/codegen25/get_dataset_infill.sh
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/codegen25/requirements.txt
   chmod +x tp_zero1_codegen25_7b_hf_pretrain.sh
   chmod +x get_dataset_infill.sh
   python3 -m pip install -r requirements.txt

Data Preprocessing and Tokenization
------------------------------------

To tokenize the data, we will use the CodeGen2.5 tokenizer from the HuggingFace repository. Download it by cloning the repository.

.. code:: bash

   cd ~/examples
   git clone https://huggingface.co/Salesforce/codegen25-7b-mono
   cd codegen25-7b-mono
   rm config.json # Need to use our config.json for some Trainium-specific settings
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/codegen25/config.json
   cd ..

This tutorial makes use of a clean JAVA subset of the TheStack corpus and we preprocess it to fit the infill-format.
The infill format samples a random number of spans and formats the input the following way:

.. code:: Python

   def count_words(filename: str) -> Dict[str, int]:
      """Count the number of occurrences of each word in the file."""
      with open(filename, 'r') as f:
         word_counts = {}
         for line in f:
            if word in word_counts:
                  for word in line.split():
                     word_counts[word] += 1
            else:
                  word_counts[word] = 1
      return word_counts

becomes 

.. code:: Python

   def count_words(filename: str) -> Dict[str, int]:
      """Count the number of occurrences of each word in the file."""
      with open(filename, 'r') as f:
            <mask_1> in word_counts:
                  for word in line.split():
                        word_counts[word] += 1
               else:
                  word_counts[word] = 1
      return word_counts<|endoftext|><sep>
      <mask_1>word_counts = {}
            for line in f:
                  if word <eom>

For each span, we introduce two ``<mask_X>`` tokens. One signals the model that a span is missing at this position, and one (at the end of the code) which is followed by the original code span. Lastly, each span is suffixed with an end of mask (``<eom>``) token. 
You can preprocess and tokenize the dataset by running:

.. code:: bash

   cd ~/examples/tp_zero1_codegen25_7b_hf_pretrain
   ./get_dataset_infill.sh

This will preprocess and store the data in your home directory at ``~/example_datasets/bigcode-stack-java_tokenized_infill``.

Starting Training
-----------------
At this point, you are all set to start training.

Per default, we use a tensor parallel degree of 8, a global batch size of 256, and train for 10k steps. Feel free to change these settings in the ``tp_zero1_codegen25_7b_hf_pretrain.sh`` script.

We first pre-compile the graphs using the ``neuron_parallel_compile``. Let’s run the command below:

.. code:: Python

   sbatch --exclusive \
   --nodes 1 \
   --wrap="srun neuron_parallel_compile bash $(pwd)/tp_zero1_codegen25_7b_hf_pretrain.sh"

Once the graphs are compiled we can run training and observe our loss going down. 
To do so, we run the same command omitting ``neuron_parallel_compile``.

.. code:: Python

   sbatch --exclusive \
   --nodes 1 \
   --wrap="srun bash $(pwd)/tp_zero1_codegen25_7b_hf_pretrain.sh"


Happy training!


================================================
FILE: archive/tutorials/training_llama2_tp_pp_ptl.rst
================================================
.. _llama2_tp_pp_ptl_tutorial:

.. meta::
   :noindex:
   :nofollow:
   :description: This tutorial for the AWS Neuron SDK is currently archived and not maintained. It is provided for reference only.

Training Llama-2-7B/13B/70B using Tensor Parallelism and Pipeline Parallelism with Neuron PyTorch-Lightning
============================================================================================================

In this section, we showcase to pretrain a Llama2 7B/13B/70B with Tensor Parallelism and Pipeline Parallel using Neuron PyTorch-Lightning APIs, please refer to  the Llama2 13B/70B Tutorial
and the Neuron PT-Lightning Developer Guide for more context.


Setting up environment:
^^^^^^^^^^^^^^^^^^^^^^^
                       
For this experiment, we will use AWS ParallelCluster with at least four trn1.32xlarge compute nodes(at least 32 nodes are needed for 13B/70B model size).
`Train your model on ParallelCluster <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/parallelcluster/parallelcluster-training.html>`__
introduces how to setup and use a ParallelCluster.
To setup the packages on the headnode of the ParallelCluster, follow the instructions mentioned here:
:ref:`Install PyTorch Neuron on Trn1 <setup-torch-neuronx>`.

We also need to install the ``neuronx-distributed`` package inside the virtual env using the following command:

.. code:: ipython3

   python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com
   git clone git@github.com:aws-neuron/neuronx-distributed.git

Let’s download the scripts for pretraining:


1. Navigate to a directory to hold our experiments

.. literalinclude:: nxd-source-code/llama_tp_pp_ptl/llama_tp_pp_ptl_setup.sh
   :language: shell
   :lines: 4

2. Link the training scripts for our experiments

.. literalinclude:: nxd-source-code/llama_tp_pp_ptl/llama_tp_pp_ptl_setup.sh
   :language: shell
   :lines: 5-10

If you want to pre-train Llama 7B, you would need to run the following steps -

.. literalinclude:: nxd-source-code/llama_tp_pp_ptl/llama_2_7b.sh
   :language: shell
   :lines: 5-8

If you want to pre-train Llama 13B, you would need to run the following steps -

.. literalinclude:: nxd-source-code/llama_tp_pp_ptl/llama_2_13b.sh
   :language: shell
   :lines: 5-8

If you want to pre-train Llama 70B, you would need to run the following steps -

.. literalinclude:: nxd-source-code/llama_tp_pp_ptl/llama_2_70b.sh
   :language: shell
   :lines: 5-8

3. Installing the additional requirements and giving the right permissions to our shell script

.. literalinclude:: nxd-source-code/llama_tp_pp_ptl/llama_tp_pp_ptl_setup.sh
   :language: shell
   :lines: 12-13


Next, we tokenize our dataset. 
``Note``: To tokenize the data, we must request the tokenizer from `HuggingFace` and `Meta` by following 
the instructions at the following link: `HuggingFace Llama 2 7B Model <https://huggingface.co/meta-llama/Llama-2-7b>`__ .
Use of the Llama 2 model is governed by the Meta license. In order to download the model weights and tokenizer, please 
visit the above website and accept their License before requesting access. After access has been granted, 
you may use the download scripts provided by Meta to download the model weights and tokenizer to your cluster.

Once you have downloaded the tokenizer and model weights, you can copy the ``tokenizer.model`` to the ``~/examples/llama2_lightning`` directory.

Next let’s download and pre-process the dataset:

.. literalinclude:: nxd-source-code/llama_tp_pp_ptl/llama_2_7b.sh
   :language: shell
   :lines: 13

``Note``: In case you see an error of the following form when downloading data: ``huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/ubuntu/examples/llama2_lightning'. Use `repo_type` argument if needed.`` 
This could be because of a stale cache. Try deleting the cache using: 

.. code:: ipython3

   sudo rm -rf /home/ubuntu/.cache/


At this point, you are all set to start training.

Training Llama2-7B with Tensor Parallelism
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

By this step, the ParallelCluster is all setup for running experiments. 
Before we run training, we first pre-compile the graphs using the :ref:`neuron_parallel_compile <pytorch-neuronx-parallel-compile-cli>`.
Let’s run the command below:

.. literalinclude:: nxd-source-code/llama_tp_pp_ptl/llama_2_7b.sh
   :language: shell
   :lines: 17-20

This script uses a tensor-parallel size of 8.
This will automatically set the zero-1 sharding degree to 16 (4 * 32 workers / tensor_parallel_size). 

``Note``: You can use any number of nodes in this case, would just need to adjust the number of nodes in the above 
slurm command accordingly. Also, the number of nodes used in parallel_compile command should be same as the actual 
training run. This is because, as the number of nodes change, the data-parallel degree would change too. This would 
result in more workers participating in operations like `gradient all-reduce` which would result in new graphs getting 
created. 

Once the graphs are compiled we can now run training and observe our loss goes down.
To run the training, we just run the above command but without ``neuron_parallel_compile``.

.. literalinclude:: nxd-source-code/llama_tp_pp_ptl/llama_2_7b.sh
   :language: shell
   :lines: 22-25

Training Llama2-13B/70B with Tensor Parallelism and Pipeline Parallelism
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Here we use ``Llama70B`` as an example. To run 13B, simply change the script from ``run_llama_70b_tp_pp.sh`` to ``run_llama_13B_tp_pp.sh``
Before we run training, we first pre-compile the graphs using the :ref:`neuron_parallel_compile <pytorch-neuronx-parallel-compile-cli>`.
Let’s run the command below:

Pre-compiling

.. literalinclude:: nxd-source-code/llama_tp_pp_ptl/llama_2_70b.sh
   :language: shell
   :lines: 17-20

This script uses a tensor-parallel size of 8, pipeline-parallel size of 8
To run the training, we just use the above command but without ``neuron_parallel_compile``.

.. literalinclude:: nxd-source-code/llama_tp_pp_ptl/llama_2_7b.sh
   :language: shell
   :lines: 22-25


Checkpointing:
^^^^^^^^^^^^^^

To enable checkpoint saving, add following flags to ``run_llama_7b_tp_ptl.sh``/ ``run_llama_13b_tp_pp.sh`` /  ``run_llama_70B_tp_pp.sh``:
* ``--save_checkpoint`` Add this flag to enable checkpoint saving
* ``--checkpoint_freq`` Number of steps to save a checkpoint
* ``--checkpoint_dir`` Direction to save the checkpoint 
* ``--num_kept_checkpoint`` Number of checkpoints to save, older checkpoint will be deleted manually, set to -1 to keep all saved checkpoints
* ``--save_load_xser`` load with torch xla serialization to reduce time saving, it's recommended to enable xser for significantly faster save/load. Note that if the chekpoint is saved with xser, it can only be loaded with xser, vice versa. 

To enable checkpoint loading, add following flags to ``run_llama_7b_tp_ptl.sh``/ ``run_llama_13b_tp_pp.sh`` /  ``run_llama_70B_tp_pp.sh``:
* ``--resume_ckpt`` 
* ``--load_step`` Step to retrieve checkpoint from
* ``--checkpoint_dir`` Direction to load the checkpoint from
* ``--save_load_xser`` load with torch xla serialization to reduce time saving, it's recommended to enable xser for significantly faster save/load. Note that if the chekpoint is saved with xser, it can only be loaded with xser, vice versa. 


================================================
FILE: archive/tutorials/tutorial_source_code/t5_finetuning/t5_finetuning_32_worker_training_code.sh
================================================
#!/bin/bash
set -eExuo pipefail

cd ~/transformers/examples/pytorch/summarization

# Create run 32 worker script
tee run_32w.sh > /dev/null <<EOF
#!/bin/bash
set -eExuo
if [ \$NEURON_PARALLEL_COMPILE == "1" ]
then
    XLA_USE_BF16=1 torchrun --nproc_per_node=32 ./run_summarization.py \
    --model_name_or_path t5-large \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --do_train \
    --do_eval \
    --source_prefix "summarize: " \
    --max_source_length 512 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --overwrite_output_dir \
    --pad_to_max_length \
    --max_steps 100 \
    --max_eval_samples 100 \
    --gradient_accumulation_steps=11 \
    --output_dir /tmp/tst-summarization |& tee log_run
else
    XLA_USE_BF16=1 torchrun --nproc_per_node=32 ./run_summarization.py \
    --model_name_or_path t5-large \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --do_train \
    --do_eval \
    --source_prefix "summarize: " \
    --max_source_length 512 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --overwrite_output_dir \
    --pad_to_max_length \
    --gradient_accumulation_steps=11 \
    --output_dir /tmp/tst-summarization |& tee log_run
fi
EOF

chmod +x run_32w.sh

# Precompile and run training
neuron_parallel_compile ./run_32w.sh

./run_32w.sh

================================================
FILE: archive/tutorials/tutorial_source_code/t5_finetuning/t5_finetuning_multi_worker_training_code.sh
================================================
#!/bin/bash
set -eExuo pipefail

cd ~/transformers/examples/pytorch/summarization

# Create run 2 worker script
tee run_2w.sh > /dev/null <<EOF
#!/bin/bash
set -eExuo
if [ \$NEURON_PARALLEL_COMPILE == "1" ]
then
    XLA_USE_BF16=1 torchrun --nproc_per_node=2 ./run_summarization.py \
    --model_name_or_path t5-small \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --do_train \
    --do_eval \
    --source_prefix "summarize: " \
    --max_source_length 512 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 4 \
    --overwrite_output_dir \
    --pad_to_max_length \
    --max_steps 100 \
    --max_eval_samples 100 \
    --gradient_accumulation_steps=32 \
    --output_dir /tmp/tst-summarization |& tee log_run
else
    XLA_USE_BF16=1 torchrun --nproc_per_node=2 ./run_summarization.py \
    --model_name_or_path t5-small \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --do_train \
    --do_eval \
    --source_prefix "summarize: " \
    --max_source_length 512 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 4 \
    --overwrite_output_dir \
    --pad_to_max_length \
    --gradient_accumulation_steps=32 \
    --output_dir /tmp/tst-summarization |& tee log_run
fi
EOF

chmod +x run_2w.sh

# Precompile and run training
neuron_parallel_compile ./run_2w.sh

./run_2w.sh

================================================
FILE: archive/tutorials/tutorial_source_code/t5_finetuning/t5_finetuning_setup_code.sh
================================================
#!/bin/bash
set -eExuo

# Install packages and clone transformers
export HF_VER=4.26.0
pip install -U transformers==$HF_VER datasets evaluate scikit-learn rouge_score pandas==1.4.0
cd ~/
git clone https://github.com/huggingface/transformers --branch v$HF_VER
cd ~/transformers/examples/pytorch/summarization

================================================
FILE: archive/tutorials/tutorial_source_code/t5_finetuning/t5_finetuning_single_worker_training_code.sh
================================================
#!/bin/bash
set -eExuo pipefail

cd ~/transformers/examples/pytorch/summarization

# Create run.sh file
tee run.sh > /dev/null <<EOF
#!/bin/bash
set -eExuo
if [ \$NEURON_PARALLEL_COMPILE == "1" ]
then
    XLA_USE_BF16=1 python3 ./run_summarization.py \
    --model_name_or_path t5-small \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --do_train \
    --do_eval \
    --source_prefix "summarize: " \
    --max_source_length 512 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 4 \
    --overwrite_output_dir \
    --pad_to_max_length \
    --max_steps 100 \
    --max_eval_samples 100 \
    --gradient_accumulation_steps=32 \
    --output_dir /tmp/tst-summarization |& tee log_run
else
    XLA_USE_BF16=1 python3 ./run_summarization.py \
    --model_name_or_path t5-small \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --do_train \
    --do_eval \
    --source_prefix "summarize: " \
    --max_source_length 512 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 4 \
    --overwrite_output_dir \
    --pad_to_max_length \
    --gradient_accumulation_steps=32 \
    --output_dir /tmp/tst-summarization |& tee log_run
fi
EOF

chmod +x run.sh

# Run precompilation and training
neuron_parallel_compile ./run.sh

./run.sh

# Insert code into run summarization in order to predict with generate
tee temp_run_summarization.py > /dev/null <<EOF
import libneuronxla
# Disable configuring xla env
def _configure_env():
    pass
libneuronxla.configure_environment = _configure_env
EOF

cat run_summarization.py >> temp_run_summarization.py
mv temp_run_summarization.py run_summarization.py
chmod +x run_summarization.py

# Run run summarization to predict without generate
NEURON_NUM_DEVICES=0 python3 ./run_summarization.py \
    --model_name_or_path <CHECKPOINT_DIR> \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --do_predict \
    --predict_with_generate \
    --source_prefix "summarize: " \
    --per_device_eval_batch_size 4 \
    --max_source_length 512 \
    --pad_to_max_length \
    --no_cuda \
    --output_dir /tmp/tst-summarization |& tee log_run

================================================
FILE: archive/tutorials/tutorial_source_code/t5_finetuning/t5_modify_run_summarization_code.sh
================================================
#!/bin/bash
set -eExuo pipefail

cd ~/transformers/examples/pytorch/summarization

# Insert code into run summarization to disable DDP for torchrun
tee temp_run_summarization.py > /dev/null <<EOF
# Disable DDP for torchrun
from transformers import __version__, Trainer
Trainer._wrap_model = lambda self, model, training=True, dataloader=None: model
EOF

cat run_summarization.py >> temp_run_summarization.py
mv temp_run_summarization.py run_summarization.py
chmod +x run_summarization.py

================================================
FILE: audit-report.md
================================================
# Frameworks Audit Report

## Orphaned Pages

| File Path | Type | Reason | Action |
|---|---|---|---|
| frameworks/mxnet-neuron/container-sm-hosting-devflow.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/mxnet-neuron/dlc-then-ec2-devflow.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/mxnet-neuron/dlc-then-ecs-devflow.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/mxnet-neuron/env-setup.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/mxnet-neuron/refman.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/mxnet-neuron/rn.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/mxnet-neuron/setup/mxnet-install-prev.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/mxnet-neuron/setup/mxnet-update-al2.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/mxnet-neuron/setup/mxnet-update-u22.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/mxnet-neuron/tutorials/bert_mxnet/index.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/mxnet-neuron/tutorials/index.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/inference.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuron/container-sm-hosting-devflow.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuron/dlc-then-k8s-devflow.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuron/env-setup.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuron/refman.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuron/rn.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuron/setup/index.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuron/setup/tensorflow-install-prev-al2.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuron/setup/tensorflow-update-al2.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuron/tf1_faq.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuron/tutorials/yolo_v4_demo/code.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuron/tutorials/yolo_v4_demo/yolo_v4_demo.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuronx/setup/tensorflow-install-prev.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuronx/setup/tensorflow-update.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuronx/tensorflow-neuron-quickstart.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuronx/tensorflow-neuron-supported-operators.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuronx/tutorials/inference/tensorflow-neuronx-serving-tutorial.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/inference.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/torch-neuron/env-setup.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/torch-neuron/setup/pytorch-update-al2.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/torch-neuron/tutorials/index.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/torch-neuronx/setup/note-setup-general.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/torch-neuronx/setup/prev-releases/neuronx-2.3.0-pytorch-install.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/torch-neuronx/setup/prev-releases/neuronx-2.4.0-pytorch-install.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/torch-neuronx/setup/prev-releases/neuronx-2.5.0-pytorch-install.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/torch-neuronx/setup/prev-releases/neuronx-2.6.0-pytorch-install.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/torch-neuronx/setup/pytorch-install-prev.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/torch-neuronx/setup/pytorch-update.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/torch-neuronx/setup/setup-inference.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/torch-neuronx/setup/setup-training.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/torch-neuronx/tutorials/inference/index.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/torch-neuronx/tutorials/training/index.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/torch/training.rst | .rst | Not in any toctree or cross-reference | Delete |
| frameworks/tensorflow/tensorflow-neuron/tutorials/bert_demo/uncased_L-24_H-1024_A-16.vocab.txt | .txt (include fragment) | Not referenced by any .. include:: directive | Delete |
| frameworks/torch/dropdown-neuron-setup.txt | .txt (include fragment) | Not referenced by any .. include:: directive | Delete |
| frameworks/torch/tab-inference-torch-neuronx.txt | .txt (include fragment) | Not referenced by any .. include:: directive | Delete |
| frameworks/torch/tab-training-torch-neuronx.txt | .txt (include fragment) | Not referenced by any .. include:: directive | Delete |
| frameworks/torch/torch-neuronx/api-reference-guide/inference/inference-api-guide-torch-neuronx.txt | .txt (include fragment) | Not referenced by any .. include:: directive | Delete |
| frameworks/torch/torch-neuronx/api-reference-guide/training/index.txt | .txt (include fragment) | Not referenced by any .. include:: directive | Delete |
| frameworks/torch/torch-neuronx/programming-guide/inference/index.txt | .txt (include fragment) | Not referenced by any .. include:: directive | Delete |
| frameworks/torch/torch-neuronx/programming-guide/training/index.txt | .txt (include fragment) | Not referenced by any .. include:: directive | Delete |
| frameworks/torch/torch-neuronx/setup/install-templates/pytorch-dev-install.txt | .txt (include fragment) | Not referenced by any .. include:: directive | Delete |

## Stale Pages

| File Path | Staleness Indicators | Recommendation |
|---|---|---|
| frameworks/mxnet-neuron/misc-mxnet-neuron.rst | References deprecated neuron-cc compiler | Will be archived |
| frameworks/mxnet-neuron/misc-mxnet-neuron.txt | References deprecated neuron-cc compiler | Will be archived |
| frameworks/mxnet-neuron/rn.rst | References deprecated neuron-cc compiler | Will be archived |
| frameworks/mxnet-neuron/setup/mxnet-install.rst | Amazon Linux 2 | Will be archived |
| frameworks/mxnet-neuron/setup/mxnet-update.rst | Amazon Linux 2 | Will be archived |
| frameworks/mxnet-neuron/tutorials/bert_mxnet/index.rst | References deprecated neuron-cc compiler | Will be archived |
| frameworks/tensorflow/tensorflow-neuron/api-compilation-python-api.rst | References deprecated neuron-cc compiler | Will be archived |
| frameworks/tensorflow/tensorflow-neuron/refman.rst | References deprecated neuron-cc compiler | Will be archived |
| frameworks/tensorflow/tensorflow-neuron/rn.rst | References deprecated neuron-cc compiler | Will be archived |
| frameworks/tensorflow/tensorflow-neuron/setup/tensorflow-install.rst | Amazon Linux 2 | Will be archived |
| frameworks/tensorflow/tensorflow-neuron/setup/tensorflow-update.rst | Amazon Linux 2 | Will be archived |
| frameworks/tensorflow/tensorflow-neuron/tf1_faq.rst | References deprecated neuron-cc compiler | Will be archived |
| frameworks/tensorflow/tensorflow-neuron/tf2_faq.rst | Ubuntu 18.04 | Will be archived |
| frameworks/tensorflow/tensorflow-neuron/tutorials/bert_demo/bert_demo.rst | References deprecated neuron-cc compiler | Will be archived |
| frameworks/tensorflow/tensorflow-neuronx/setup/prev-releases/neuronx-2.8.0-tensorflow-install.rst | Amazon Linux 2 | Will be archived |
| frameworks/tensorflow/tensorflow-neuronx/setup/prev-releases/neuronx-2.9.0-tensorflow-install.rst | Amazon Linux 2 | Will be archived |
| frameworks/tensorflow/tensorflow-neuronx/setup/tensorflow-neuronx-install.rst | Amazon Linux 2 | Will be archived |
| frameworks/tensorflow/tensorflow-neuronx/setup/tensorflow-update.rst | Amazon Linux 2 | Will be archived |
| frameworks/torch/dropdown-neuron-setup.txt | Amazon Linux 2; torch-neuron setup/update with unsupported OS: Amazon Linux 2 | Update or archive |
| frameworks/torch/guide-torch-neuron-vs-torch-neuronx-inference.rst | References deprecated neuron-cc compiler | Update or archive |
| frameworks/torch/inference-torch-neuron.txt | References deprecated neuron-cc compiler | Update or archive |
| frameworks/torch/torch-neuron/api-compilation-python-api.rst | References deprecated neuron-cc compiler | Will be archived |
| frameworks/torch/torch-neuron/misc-inference-torch-neuron.rst | References deprecated neuron-cc compiler | Will be archived |
| frameworks/torch/torch-neuron/misc-inference-torch-neuron.txt | References deprecated neuron-cc compiler | Will be archived |
| frameworks/torch/torch-neuron/setup/pytorch-install.rst | Amazon Linux 2 | Will be archived |
| frameworks/torch/torch-neuron/setup/pytorch-update.rst | Amazon Linux 2 | Will be archived |
| frameworks/torch/torch-neuron/troubleshooting-guide.rst | References deprecated neuron-cc compiler | Will be archived |
| frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-trace.rst | References deprecated neuron-cc compiler | Update or archive |
| frameworks/torch/torch-neuronx/programming-guide/inference/core-placement.rst | References deprecated neuron-cc compiler | Update or archive |
| frameworks/torch/torch-neuronx/setup-trn1-multi-node-execution.rst | Ubuntu 20.04 | Update or archive |
| frameworks/torch/torch-neuronx/setup/prev-releases/neuronx-2.4.0-pytorch-install.rst | Amazon Linux 2; torch-neuron setup/update with unsupported OS: Amazon Linux 2 | Update or archive |
| frameworks/torch/torch-neuronx/setup/prev-releases/neuronx-2.6.0-pytorch-install.rst | Amazon Linux 2; torch-neuron setup/update with unsupported OS: Amazon Linux 2 | Update or archive |
| frameworks/torch/torch-neuronx/setup/prev-releases/neuronx-2.7.0-pytorch-install.rst | Amazon Linux 2; torch-neuron setup/update with unsupported OS: Amazon Linux 2 | Update or archive |
| frameworks/torch/torch-neuronx/setup/prev-releases/neuronx-2.8.0-pytorch-install.rst | Amazon Linux 2; torch-neuron setup/update with unsupported OS: Amazon Linux 2 | Update or archive |
| frameworks/torch/torch-neuronx/setup/prev-releases/neuronx-2.9.0-pytorch-install.rst | Amazon Linux 2; torch-neuron setup/update with unsupported OS: Amazon Linux 2 | Update or archive |
| frameworks/torch/torch-neuronx/setup/pytorch-install.rst | Amazon Linux 2; torch-neuron setup/update with unsupported OS: Amazon Linux 2 | Update or archive |
| frameworks/torch/torch-neuronx/setup/pytorch-update.rst | Amazon Linux 2; torch-neuron setup/update with unsupported OS: Amazon Linux 2 | Update or archive |
| frameworks/torch/torch-neuronx/training-troubleshooting.rst | Ubuntu 18.04; torch-neuron setup/update with unsupported OS: Ubuntu 18.04 | Update or archive |


================================================
FILE: build.sh
================================================
#!/bin/bash
# build.sh - Docker + uv workflow for Neuron docs

set -e
IMAGE_NAME="neuron-docs"

case "${1:-build}" in
  build)
    docker build -t "$IMAGE_NAME" .
    ;;
  html)
    docker run --rm -v "$(pwd):/docs" "$IMAGE_NAME" -c "sphinx-build -b html . _build/html -j auto"
    ;;
  shell)
    docker run --rm -it -v "$(pwd):/docs" "$IMAGE_NAME"
    ;;
  clean)
    rm -rf _build
    ;;
  *)
    echo "Usage: $0 {build|html|shell|clean}"
    exit 1
    ;;
esac


================================================
FILE: compiler/error-codes/EARG001.rst
================================================
.. _error-code-earg001:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EARG001.

NCC_EARG001
===========

**Error message**: This error occurs when you attempt to use a Logical Neuron Core (LNC) configuration that is not supported by the target Neuron architecture.

For example, a trn1 instance running the following code will run into this error:

.. code-block:: python

   traced_model = torch_neuronx.trace(
      model,
      input,
      compiler_args=['--lnc', '2']  # ERROR: lnc=2 not supported on trn1
   )

On trn1, only lnc=1 is supported.

Physical Neuron Core:

- Actual hardware compute unit on the chip

- Has dedicated compute resources, memory, etc.

Logical Neuron Core:

- Software abstraction grouping multiple physical cores

- Controlled via the NEURON_LOGICAL_NC_CONFIG environment variable or the --lnc flag (when using neuronx-cc directly)

For more information: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/explore/device-memory.html#logical-neuron-cores


================================================
FILE: compiler/error-codes/EBIR023.rst
================================================
.. _error-code-ebir023:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EBIR023.

NCC_EBIR023
===========

**Error message**: MLP kernel intermediate size exceeds the maximum supported value of 4096.

Consider tiling large intermediate tensors in your kernel to stay within the supported limit, or increase tensor parallelism to shard the intermediate dimension across more cores.


================================================
FILE: compiler/error-codes/EBVF030.rst
================================================
.. _error-code-ebvf030:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EBVF030.

NCC_EBVF030
===========

**Error message**: The number of instructions generated exceeds the limit.

Consider applying model parallelism as partitioning the model will help break large computational graphs into smaller subgraphs.

For more information: 

- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#api-guide
- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/app_notes/nxd-training-pp-appnote.html


================================================
FILE: compiler/error-codes/EHCA005.rst
================================================
.. _error-code-ehca005:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EHCA005.

NCC_EHCA005
===========

**Error message**: The compiler encountered a custom call instruction with a target name that is not recognized.

The Neuron compiler currently recognizes the following custom call targets:

 - AwsNeuronErf
 - AwsNeuronGelu
 - AwsNeuronGeluApprxTanh
 - AwsNeuronGeluBackward
 - AwsNeuronSilu
 - AwsNeuronSiluBackward
 - AwsNeuronRmsNorm
 - AwsNeuronSoftmax
 - AwsNeuronSoftmaxBackward
 - AwsNeuronCollectiveMatmul
 - AwsNeuronIntMatmult
 - AwsNeuronArgMax
 - AwsNeuronArgMin
 - AwsNeuronTopK
 - AwsNeuronDropoutMaskV1
 - AwsNeuronCustomNativeKernel
 - AwsNeuronCustomOp
 - AwsNeuronDevicePrint
 - ResizeNearest
 - ResizeBilinear
 - ResizeNearestGrad
 - AwsNeuronLNCShardingConstraint
 - AwsNeuronTransferWithStaticRing
 - AwsNeuronModuleMarkerStart-Forward
 - AwsNeuronModuleMarkerStart-Backward
 - AwsNeuronModuleMarkerEnd-Forward
 - AwsNeuronModuleMarkerEnd-Backward
 - NeuronBoundaryMarker-Start
 - NeuronBoundaryMarker-End

Erroneous code example:

.. code-block:: python

    def lowering(ctx, x_val):
        result_type = ir.RankedTensorType(x_val.type)
        # This target name will not be recognized by HandleCustomCall
        return hlo.CustomCallOp(
            [result_type],
            [x_val],
            call_target_name="UNRECOGNIZED_TARGET",
            has_side_effect=ir.BoolAttr.get(False),
        ).results

Use a supported custom call target:

.. code-block:: python

    def lowering(ctx, x_val):
        result_type = ir.RankedTensorType(x_val.type)
        return hlo.CustomCallOp(
            [result_type],
            [x_val],
            call_target_name="AwsNeuronSilu",
            has_side_effect=ir.BoolAttr.get(False),
            backend_config=ir.StringAttr.get(""),
            api_version=ir.IntegerAttr.get(ir.IntegerType.get_signless(32), 2),
        ).results


================================================
FILE: compiler/error-codes/EOOM001.rst
================================================
.. _error-code-eoom001:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EOOM001.

NCC_EOOM001
===========

**Error message**: The combined memory needed for the model tensors exceeds the high-bandwidth memory limit.

The memory usage consists of:

- I/O tensors: Input and output activation tensors
- Internal allocations: Scratchpad memory for intermediate computations
- SBUF spills: Data that cannot fit in on-chip SBUF memory and must spill to HBM

There are several ways to potentially fix this issue.

1. Simply reduce the batch/tensor size if possible
2. Utilize pipeline/tensor parallelism via neuronx-distributed

Short snippet of tensor parallelism:

.. code-block:: python

    class ParallelSelfAttention(transformers.models.bert.modeling_bert.BertSelfAttention):
        def __init__(self, config, position_embedding_type=None):
            super().__init__(config, position_embedding_type)
            self.query = ColumnParallelLinear(config.hidden_size,
                                            self.all_head_size,
                                            gather_output=False)
            self.key = ColumnParallelLinear(config.hidden_size,
                                            self.all_head_size,
                                            gather_output=False)
            self.value = ColumnParallelLinear(config.hidden_size,
                                            self.all_head_size,
                                            gather_output=False)
            # Since we shard the number of attention heads across tensor parallel
            # ranks, each rank would have a subset of heads, hence, we update
            # the num_attention_heads here.
            tp_size = parallel_state.get_tensor_parallel_size()
            self.num_attention_heads = self.num_attention_heads // tp_size
            self.all_head_size = self.all_head_size // tp_size

For more information: 

- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/activation_memory_reduction.html
- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/app_notes/nxd-training-pp-appnote.html


================================================
FILE: compiler/error-codes/EOOM002.rst
================================================
.. _error-code-eoom002:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EOOM002.

NCC_EOOM002
===========

**Error message**: The combined memory needed for the model tensors exceeds the high-bandwidth memory limit.

The memory usage consists of:

- I/O tensors: Input and output activation tensors
- Internal allocations: Scratchpad memory for intermediate computations
- SBUF spills: Data that cannot fit in on-chip SBUF memory and must spill to HBM

There are several ways to potentially fix this issue.

1. Simply reduce the batch/tensor size if possible
2. Utilize pipeline/tensor parallelism via neuronx-distributed

Short snippet of tensor parallelism:

.. code-block:: python

    class ParallelSelfAttention(transformers.models.bert.modeling_bert.BertSelfAttention):
        def __init__(self, config, position_embedding_type=None):
            super().__init__(config, position_embedding_type)
            self.query = ColumnParallelLinear(config.hidden_size,
                                            self.all_head_size,
                                            gather_output=False)
            self.key = ColumnParallelLinear(config.hidden_size,
                                            self.all_head_size,
                                            gather_output=False)
            self.value = ColumnParallelLinear(config.hidden_size,
                                            self.all_head_size,
                                            gather_output=False)
            # Since we shard the number of attention heads across tensor parallel
            # ranks, each rank would have a subset of heads, hence, we update
            # the num_attention_heads here.
            tp_size = parallel_state.get_tensor_parallel_size()
            self.num_attention_heads = self.num_attention_heads // tp_size
            self.all_head_size = self.all_head_size // tp_size

For more information: 

- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/activation_memory_reduction.html
- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/app_notes/nxd-training-pp-appnote.html


================================================
FILE: compiler/error-codes/ESFH002.rst
================================================
.. _error-code-esfh002:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error ESFH002.

NCC_ESFH002
===========

**Error message**: The compiler encountered a unsigned 64-bit integer constant with a value that cannot be safely converted to 32-bit representation. 

The Neuron hardware operates on 32-bit or narrower data types and attempts to convert 64-bit integers to 32-bit. 64-bit constants that exceed the 32-bit range and cannot be safely converted will fail compilation. Try to use uint32 for constants when possible and restructure code to avoid large constants.

Erroneous code example:

.. code-block:: python

   @jax.jit
   def foo():
      # direct uint64 constant in arithmetic operation
      x = jnp.array([1, 2, 3], dtype=jnp.uint64)
      # large constant that exceeds uint32 max
      large_constant = jnp.uint64(5_000_000_000)
      return x + large_constant

Use uint32 for constants when possible:

.. code-block:: python

   @jax.jit
   def test():
      x = jnp.array([1, 2, 3], dtype=jnp.uint32)
      large_constant = jnp.uint32(5_000_000_000)
      return x + large_constant

================================================
FILE: compiler/error-codes/ESPP004.rst
================================================
.. _error-code-espp004:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error ESPP004.

NCC_ESPP004
===========

**Error message**: The compiler encountered a data type that is not supported for code generation.

Erroneous code example:

.. code-block:: python

    import numpy as np
    import jax.numpy as jnp
    import jax
    from jax._src import dtypes
    from jax._src.lax import lax as lax_internal

    # float4_e2m1fn type not supported
    dtype = np.dtype(dtypes.float4_e2m1fn)
    val = lax_internal._convert_element_type(0, dtype, weak_type=False)

Use a supported data type:

.. code-block:: python

    import numpy as np
    import jax.numpy as jnp
    import jax
    from jax._src import dtypes
    from jax._src.lax import lax as lax_internal

    # float4_e2m1fn type not supported
    dtype = jnp.bfloat16
    val = lax_internal._convert_element_type(0, dtype, weak_type=False)

More information on supported data types https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-features/data-types.html


================================================
FILE: compiler/error-codes/ESPP047.rst
================================================
.. _error-code-espp047:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error ESPP047.

NCC_ESPP047
===========

**Error message**: The compiler found usage of an unsupported 8-bit floating-point data type.

Erroneous code example:

.. code-block:: python

    class Model(nn.Module):
        def __init__(self):
            super().__init__()
            self.linear1 = nn.Linear(10, 20)
            self.linear2 = nn.Linear(20, 10)

        def forward(self, x):
            x = self.linear1(x)
            x = torch.relu(x)
            x = self.linear2(x)
            return x

    # Unsupported 8-bit floating-point data type being used here
    input_tensor = torch.randn(1, 10).to(torch.float8_e4m3fn)


To fix this error:

.. code-block:: python

    class Model(nn.Module):
        def __init__(self):
            super().__init__()
            self.linear1 = nn.Linear(10, 20)
            self.linear2 = nn.Linear(20, 10)

        def forward(self, x):
            x = self.linear1(x)
            x = torch.relu(x)
            x = self.linear2(x)
            return x

    input_tensor = torch.randn(1, 10).to(torch.float8_e4m3fn)
    # Convert to a supported type
    input_tensor = input_tensor.to(torch.float16)


================================================
FILE: compiler/error-codes/EUOC002.rst
================================================
.. _error-code-euoc002:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EUOC002.

NCC_EUOC002
===========

**Error message**: An unsupported operator was used.

Try using alternative operators from the full list of supported operators via `neuronx-cc list-operators --framework XLA` to workaround the limitation.

Before:

.. code-block:: python

    class Model(torch.nn.Module):
        def forward(self, A, b):
            return torch.triangular_solve(b, A)

Possible workaround:

.. code-block:: python

    class Model(torch.nn.Module):
        def forward(self, A, b):
            # Although slower than triangular_solve, this is mathematically equivalent
            A_inv = torch.inverse(A)
            return A_inv @ b


================================================
FILE: compiler/error-codes/EVRF001.rst
================================================
.. _error-code-evrf001:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVRF001.

NCC_EVRF001
===========

**Error message**: An unsupported operator was used.

Try using alternative operators from the full list of supported operators via `neuronx-cc list-operators --framework XLA` to workaround the limitation.

Before:

.. code-block:: python

    class Model(torch.nn.Module):
        def forward(self, A, b):
            return torch.triangular_solve(b, A)


Possible workaround:

.. code-block:: python

    class Model(torch.nn.Module):
        def forward(self, A, b):
            # Although slower than triangular_solve, this is mathematically equivalent
            A_inv = torch.inverse(A)
            return A_inv @ b


================================================
FILE: compiler/error-codes/EVRF004.rst
================================================
.. _error-code-evrf004:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVRF004.

NCC_EVRF004
===========

**Error message**: Complex data types are not supported on the Neuron device.

You cannot use complex data types (such as ``complex64``, ``complex128``, and others) on the Neuron device directly. 

One fix is to offload complex operations to CPU, like so:

.. code-block:: python

    x = torch.tensor([1+2j, 3+4j], dtype=torch.complex64).to('cpu')

.. note::

   Since data transfer between CPU and device is expensive, this is best used when complex operations are rare.

You can also address this error by manually emulating complex tensors using real and imaginary parts:

.. code-block:: python

    real = x.real
    imag = x.imag
    ...
    # (a + bi) * (c + di)
    real_out = a_real * b_real - a_imag * b_imag
    imag_out = a_real * b_imag + a_imag * b_real


================================================
FILE: compiler/error-codes/EVRF005.rst
================================================
.. _error-code-evrf005:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVRF005.

NCC_EVRF005
===========

**Error message**: The compiler found usage of F8E4M3FNUZ, F8E4M3B11FNUZ, or F8E5M2FNUZ data type which is not supported.

Erroneous code example:

.. code-block:: python

    class Model(nn.Module):
        def __init__(self):
            super().__init__()
            self.linear1 = nn.Linear(10, 20)
            self.linear2 = nn.Linear(20, 10)
        def forward(self, x):
            x = self.linear1(x)
            x = torch.relu(x)
            x = self.linear2(x)
            return x
    input_tensor = torch.randn(1, 10).to(torch.float8_e4m3fnuz)

To fix this error:

.. code-block:: python
    
    class Model(nn.Module):
        def __init__(self):
            super().__init__()
            self.linear1 = nn.Linear(10, 20)
            self.linear2 = nn.Linear(20, 10)
        def forward(self, x):
            x = self.linear1(x)
            x = torch.relu(x)
            x = self.linear2(x)
            return x
    input_tensor = torch.randn(1, 10).to(torch.float8_e4m3fnuz)
    # Convert to a supported type
    input_tensor = input_tensor.to(torch.float16)

* More information on supported data types: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-features/data-types.html


================================================
FILE: compiler/error-codes/EVRF006.rst
================================================
.. _error-code-evrf006:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVRF006.

NCC_EVRF006
===========

The compiler encountered a RNGBitGenerator operation using a random number generation algorithm other than RNG_DEFAULT.
-----------------------------------------------------------------------------------------------------------------------

Ensure that you are using standard JAX/PyTorch random APIs and not explicity specifying an RNG algorithm.


================================================
FILE: compiler/error-codes/EVRF007.rst
================================================
.. _error-code-evrf007:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVRF007.

NCC_EVRF007
===========

**Error message**: The number of instructions generated exceeds the limit.

Consider applying model parallelism as partitioning the model will help break large computational graphs into smaller subgraphs.

For more information: 

- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#api-guide
- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/app_notes/nxd-training-pp-appnote.html


================================================
FILE: compiler/error-codes/EVRF009.rst
================================================
.. _error-code-evrf009:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVRF009.

NCC_EVRF009
===========

**Error message**: The combined memory needed for the model's activation tensors exceeds the high-bandwidth memory limit.

There are several ways to potentially fix this issue.

1. Simply reduce the batch/tensor size if possible
2. Utilize pipeline/tensor parallelism via neuronx-distributed

Short snippet of tensor parallelism:

.. code-block:: python

    class ParallelSelfAttention(transformers.models.bert.modeling_bert.BertSelfAttention):
        def __init__(self, config, position_embedding_type=None):
            super().__init__(config, position_embedding_type)

            self.query = ColumnParallelLinear(config.hidden_size,
                                            self.all_head_size,
                                            gather_output=False)
            self.key = ColumnParallelLinear(config.hidden_size,
                                            self.all_head_size,
                                            gather_output=False)
            self.value = ColumnParallelLinear(config.hidden_size,
                                            self.all_head_size,
                                            gather_output=False)
            # Since we shard the number of attention heads across tensor parallel
            # ranks, each rank would have a subset of heads, hence, we update
            # the num_attention_heads here.
            tp_size = parallel_state.get_tensor_parallel_size()
            self.num_attention_heads = self.num_attention_heads // tp_size
            self.all_head_size = self.all_head_size // tp_size


For more information: 

- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/activation_memory_reduction.html
- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/app_notes/nxd-training-pp-appnote.html


================================================
FILE: compiler/error-codes/EVRF010.rst
================================================
.. _error-code-evrf010:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVRF010.

NCC_EVRF010
===========

**Error message**: The compiler encountered simultaneous use of input and kernel dilation, which is not supported.

Erroneous code example:

.. code-block:: python

    x = jnp.ones((1, 4, 4, 1), dtype=jnp.float32)
    kernel = jnp.ones((3, 3, 1, 1), dtype=jnp.float32)

    result = lax.conv_general_dilated(
        x,
        kernel,
        window_strides=(1, 1),
        padding=((2, 2), (2, 2)),
        lhs_dilation=(2, 2), # input dilation
        rhs_dilation=(2, 2), # kernel dilation
        dimension_numbers=('NHWC', 'HWIO', 'NHWC')
    )


If possible, use only only input or kernel dilation:

.. code-block:: python

    x = jnp.ones((1, 4, 4, 1), dtype=jnp.float32)
    kernel = jnp.ones((3, 3, 1, 1), dtype=jnp.float32)

    result = lax.conv_general_dilated(
        x,
        kernel,
        window_strides=(1, 1),
        padding=((2, 2), (2, 2)),
        lhs_dilation=(1, 1), # no input dilation
        rhs_dilation=(2, 2),
        dimension_numbers=('NHWC', 'HWIO', 'NHWC')
    )

Or apply dilation manually and apply convolution to the remainder.


================================================
FILE: compiler/error-codes/EVRF011.rst
================================================
.. _error-code-evrf011:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVRF011.

NCC_EVRF011
===========

**Error message**: The compiler encountered strided convolution combined with dilated input, which is not supported.

Erroneous code example:

.. code-block:: python

    x = jnp.ones((1, 4, 4, 1), dtype=jnp.float32)
    kernel = jnp.ones((3, 3, 1, 1), dtype=jnp.float32)

    result = lax.conv_general_dilated(
        x,
        kernel,
        window_strides=(2, 2),    # strided convolution
        padding=((2, 2), (2, 2)),
        lhs_dilation=(2, 2),      # and dilated input
        rhs_dilation=(1, 1),
        dimension_numbers=('NHWC', 'HWIO', 'NHWC')
    )


If possible, remove stride or input dilation:

.. code-block:: python

    x = jnp.ones((1, 4, 4, 1), dtype=jnp.float32)
    kernel = jnp.ones((3, 3, 1, 1), dtype=jnp.float32)

    result = lax.conv_general_dilated(
        x, kernel,
        window_strides=(2, 2),  
        padding=((2, 2), (2, 2)),
        lhs_dilation=(1, 1),    # remove input dilation
        rhs_dilation=(1, 1),
        dimension_numbers=('NHWC', 'HWIO', 'NHWC')
    )


Or apply upsampling and downsampling separately.


================================================
FILE: compiler/error-codes/EVRF013.rst
================================================
.. _error-code-evrf013:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVRF013.

NCC_EVRF013
===========

**Error message**: TopK does not support int32 or int64 input tensors.

Erroneous code example:

.. code-block:: python

    def forward(self, x):
        # assume x is an integer tensor
        # error: cannot call TopK on integer dtypes
        k = 5
        values, indices = torch.topk(x, k=k, dim=-1)
        return values, indices


To fix this error, you can cast your tensor to a supported floating point dtype.

.. code-block:: python

    def forward(self, x):
        x = x.float()
        k = 5
        values, indices = torch.topk(x, k=k, dim=-1)
        return values, indices


================================================
FILE: compiler/error-codes/EVRF015.rst
================================================
.. _error-code-evrf015:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVRF015.

NCC_EVRF015
===========

**Error message**: The compiler encountered a custom call instruction with a target name that is not recognized.

The Neuron compiler currently recognizes the following custom call targets:

 - AwsNeuronErf
 - AwsNeuronGelu
 - AwsNeuronGeluApprxTanh
 - AwsNeuronGeluBackward
 - AwsNeuronSilu
 - AwsNeuronSiluBackward
 - AwsNeuronRmsNorm
 - AwsNeuronSoftmax
 - AwsNeuronSoftmaxBackward
 - AwsNeuronCollectiveMatmul
 - AwsNeuronIntMatmult
 - AwsNeuronArgMax
 - AwsNeuronArgMin
 - AwsNeuronTopK
 - AwsNeuronDropoutMaskV1
 - AwsNeuronCustomNativeKernel
 - AwsNeuronCustomOp
 - AwsNeuronDevicePrint
 - ResizeNearest
 - ResizeBilinear
 - ResizeNearestGrad
 - AwsNeuronLNCShardingConstraint
 - AwsNeuronTransferWithStaticRing
 - AwsNeuronModuleMarkerStart-Forward
 - AwsNeuronModuleMarkerStart-Backward
 - AwsNeuronModuleMarkerEnd-Forward
 - AwsNeuronModuleMarkerEnd-Backward
 - NeuronBoundaryMarker-Start
 - NeuronBoundaryMarker-End

Erroneous code example:

.. code-block:: python

    def lowering(ctx, x_val):
        result_type = ir.RankedTensorType(x_val.type)
        # This target name will not be recognized by HandleCustomCall
        return hlo.CustomCallOp(
            [result_type],
            [x_val],
            call_target_name="UNRECOGNIZED_TARGET",
            has_side_effect=ir.BoolAttr.get(False),
        ).results


Use a supported custom call target:

.. code-block:: python

    def lowering(ctx, x_val):
        result_type = ir.RankedTensorType(x_val.type)
        return hlo.CustomCallOp(
            [result_type],
            [x_val],
            call_target_name="AwsNeuronSilu",
            has_side_effect=ir.BoolAttr.get(False),
            backend_config=ir.StringAttr.get(""),
            api_version=ir.IntegerAttr.get(ir.IntegerType.get_signless(32), 2),
        ).results


================================================
FILE: compiler/error-codes/EVRF016.rst
================================================
.. _error-code-evr016:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVR016.

NCC_EVRF016
===========

The NCC_EVRF016 error is raised when the Neuron compiler detects that you are trying to use an integer or boolean type with one of the restricted reduction functions.

**Error message**: The scatter-reduce operation cannot perform reduction logic if the data being scattered or the destination tensor is using an integer or boolean data type.

The hardware instructions used on the Neuron device for these specific scatter-and-reduce functions are optimized for and limited to floating-point arithmetic. When the compiler detects that you are trying to use an integer or boolean type with one of the restricted reduction functions, it stops the compilation process to prevent a hardware crash or incorrect calculation.

**Example of the error**

The following example shows the **NCC\_EVRF016** error because the :code:`input_tensor` is defined using an integer data type (:code:`torch.int32`) while being used with a reduction function (:code:`reduce='sum'`) in the :code:`scatter_reduce_` operation.

.. code-block:: python

    def forward(self, input_tensor, indices_tensor, src_tensor):
        output = input_tensor.clone()
        
        output.scatter_reduce_(
            dim=1,
            index=indices_tensor,
            src=src_tensor,
            reduce='sum',
        )
        return output

    # ERROR: using integer dtype with scatter-reduce
    input_tensor = torch.zeros(BATCH_SIZE, DIM_SIZE, dtype=torch.int32)
    ...

**How to fix**

To fix this error, you must cast your input and source tensors to a floating-point data type (e.g., torch.float32 or torch.bfloat16).

.. code-block:: python

    def forward(self, input_tensor, indices_tensor, src_tensor):
        output = input_tensor.clone()
        
        output.scatter_reduce_(
            dim=1,
            index=indices_tensor,  
            src=src_tensor,        
            reduce='sum',
        )
        return output

    # FIXED: changed to float32
    # now works with scatter-reduce
    input_tensor = torch.zeros(BATCH_SIZE, DIM_SIZE, dtype=torch.float32)
    ...


================================================
FILE: compiler/error-codes/EVRF017.rst
================================================
.. _error-code-evrf017:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVRF017.

NCC_EVRF017
===========

**Error message**: The compiler encountered a reduce-window operation with base dilation (input dilation) greater than 1, which is not supported.

Erroneous code example:

.. code-block:: python

    result = lax.reduce_window(
        x, -jnp.inf, lax.max,
        window_dimensions=(1, 1, 1, 1),
        window_strides=(1, 1, 1, 1),
        padding='VALID',
        base_dilation=(1, 2, 1, 1) # ERROR: applying base dilation of 2 in dimension 1
    )

If possible, change base dilation to be all 1s:

.. code-block:: python

    result = lax.reduce_window(
        x, -jnp.inf, lax.max,
        window_dimensions=(1, 1, 1, 1),
        window_strides=(1, 1, 1, 1),
        padding='VALID',
        base_dilation=(1, 1, 1, 1) # FIXED: all values are 1 (no dilation)
    )

Or consider manual dilation if necessary.


================================================
FILE: compiler/error-codes/EVRF018.rst
================================================
.. _error-code-evrf018:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVRF018.

NCC_EVRF018
===========

**Error message**: The compiler encountered a reduce-window operation with window dilation greater than 1, which is not supported.

Erroneous code example:

.. code-block:: python

    result = lax.reduce_window(
        jnp.ones((1, 4, 4, 1)), -jnp.inf, lax.max,
        window_dimensions=(1, 2, 2, 1),
        window_strides=(1, 1, 1, 1),
        padding='VALID',
        window_dilation=(1, 2, 2, 1) # 2 is greater than 1
    )


If possible, remove window_dilation or change values to be all 1s:

.. code-block:: python

    result = lax.reduce_window(
        jnp.ones((1, 4, 4, 1)), -jnp.inf, lax.max,
        window_dimensions=(1, 2, 2, 1),
        window_strides=(1, 1, 1, 1),
        padding='VALID',
        window_dilation=(1, 1, 1, 1) 
    )

Or consider manual dilation if necessary.


================================================
FILE: compiler/error-codes/EVRF019.rst
================================================
.. _error-code-evrf019:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVRF019.

NCC_EVRF019
===========

**Error message**: The compiler encountered a reduce-window operation with more or less than 2 operands. Support for reduce_window is available for exactly one input tensor and one initial value for reduction.

Erroneous code example:

.. code-block:: python

    # reduce-window operation with more or less than 2 operands is not supported
    # 4 operands are being provided instead of 2
    lax.reduce_window(
        (x, x),               # ERROR: a tuple of two input tensors
        (-jnp.inf, jnp.inf),  # ERROR: a tuple of two initial values
        lambda a, b: (jnp.maximum(a[0], b[0]), jnp.minimum(a[1], b[1])),
        window_dimensions=(1, 2, 2, 1),
        window_strides=(1, 2, 2, 1),
        padding='VALID'
    )


If possible, split multi-operand reduce_window with multiple single-operand reduce_window operations.

.. code-block:: python

    # For max pooling
    # 2 operands are correctly being provided
    max_pool = lax.reduce_window(
        x,         # FIXED: a single input tensor
        -jnp.inf,  # FIXED: a single initial value
        lax.max,
        window_dimensions=(1, 2, 2, 1),
        window_strides=(1, 2, 2, 1),
        padding='VALID'
    )
    
    # For min pooling
    # 2 operands are correctly being provided
    min_pool = lax.reduce_window(
        x,        # FIXED: a single input tensor
        jnp.inf,  # FIXED: a single initial value
        lax.min,
        window_dimensions=(1, 2, 2, 1),
        window_strides=(1, 2, 2, 1),
        padding='VALID'
    )


================================================
FILE: compiler/error-codes/EVRF022.rst
================================================
.. _error-code-evrf022:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVRF022.

NCC_EVRF022
===========

**Error message**: Shift-right-arithmetic operation on non 32-bit inputs is not supported. Cast the first argument's data type to be S32, U32, or F32.

Erroneous code example:

.. code-block:: python

    def forward(self, input, other):
        return torch.bitwise_right_shift(input, other)
    
    # This will be the first argument and must be 32-bit
    input = torch.tensor([16, 32, 64], dtype=torch.int16)
    # The second argument can be non 32-bit
    other = torch.tensor([1, 2, 3], dtype=torch.int16)


To fix this error:

.. code-block:: python

    def forward(self, input, other):
        return torch.bitwise_right_shift(input, other)

    # Correctly setting the first argument to be 32-bit
    input = torch.tensor([16, 32, 64], dtype=torch.int32)
    other = torch.tensor([1, 2, 3], dtype=torch.int16)


================================================
FILE: compiler/error-codes/EVRF031.rst
================================================
.. _error-code-evrf031:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EVRF031.

NCC_EVRF031
===========

**Error message**: The compiler encountered a scatter out-of-bounds error. The indices created via iota instruction contain values that are beyond the size of the operand dimension.

Erroneous code example:

.. code-block:: python

    # size 3 in dimension 0
    operand = jnp.zeros((3, 4), dtype=jnp.float32)

    # iota generates indices [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    indices = lax.iota(jnp.int32, 10) # ERROR: size 10 > operand dimension 3
    indices = indices.reshape(10, 1)

    updates = jnp.ones((10, 4), dtype=jnp.float32) # ERROR: 10 updates but operand only has 3 rows

    result = lax.scatter(
        operand,
        indices, # ERROR: index values in [0, 10) but operand dimension only allows indices in [0, 3)
        updates,
        lax.ScatterDimensionNumbers(
        update_window_dims=(1,),
        inserted_window_dims=(0,),
        scatter_dims_to_operand_dims=(0,)
        )
    )


Ensure that the iota size matches the operand dimension size:

.. code-block:: python

    N = 3
    D = 4
    operand = jnp.zeros((N, D), dtype=jnp.float32)

    # FIXED: match iota size to operand dimension
    indices = lax.iota(jnp.int32, N) # size N is same as operand dimension
    indices = indices.reshape(N, 1)

    # FIXED: updates size matches operand dimension
    updates = jnp.ones((N, D), dtype=jnp.float32)

    result = lax.scatter(
        operand,
        indices, # FIXED: indices now in valid range [0, 3)
        updates,
        lax.ScatterDimensionNumbers(
        update_window_dims=(1,),
        inserted_window_dims=(0,),
        scatter_dims_to_operand_dims=(0,)
        )
    )


================================================
FILE: compiler/error-codes/EXSP001.rst
================================================
.. _error-code-exsp001:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EXSP001.

NCC_EXSP001
===========

The combined memory needed for the model's activation tensors exceeds the high-bandwidth memory limit. 
------------------------------------------------------------------------------------------------------

There are several ways to potentially fix this issue.

1. Simply reduce the batch/tensor size if possible
2. Utilize pipeline/tensor parallelism via neuronx-distributed

Short snippet of tensor parallelism:

.. code-block:: python

    class ParallelSelfAttention(transformers.models.bert.modeling_bert.BertSelfAttention):
        def __init__(self, config, position_embedding_type=None):
            super().__init__(config, position_embedding_type)

            self.query = ColumnParallelLinear(config.hidden_size,
                                            self.all_head_size,
                                            gather_output=False)
            self.key = ColumnParallelLinear(config.hidden_size,
                                            self.all_head_size,
                                            gather_output=False)
            self.value = ColumnParallelLinear(config.hidden_size,
                                            self.all_head_size,
                                            gather_output=False)
            # Since we shard the number of attention heads across tensor parallel
            # ranks, each rank would have a subset of heads, hence, we update
            # the num_attention_heads here.
            tp_size = parallel_state.get_tensor_parallel_size()
            self.num_attention_heads = self.num_attention_heads // tp_size
            self.all_head_size = self.all_head_size // tp_size


For more information: 

- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/activation_memory_reduction.html
- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/app_notes/nxd-training-pp-appnote.html


================================================
FILE: compiler/error-codes/EXTP004.rst
================================================
.. _error-code-extp004:

.. meta::
   :description: AWS Neuron SDK Graph Compiler error code documentation for error EXTP004.

NCC_EXTP004
===========

**Error message**: The number of instructions generated exceeds the limit.

Consider applying model parallelism as partitioning the model will help break large computational graphs into smaller subgraphs.
        
For more information: 

- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#api-guide
- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/app_notes/nxd-training-pp-appnote.html


================================================
FILE: compiler/error-codes/index.rst
================================================
.. meta::
    :description: "Neuron Compiler error code documentation home."
    :date-modified: 12/02/2025

.. _ncc-errors-home:

Neuron Compiler Error Codes
============================

This page lists the error codes you can encounter while developing with the Neuron Compiler. For more details on any individual error, click the link for that error code in the table below.

.. list-table::
   :header-rows: 1

   * - Error Code
     - Error Message
     - Recommendation
   * - :ref:`NCC_EARG001 <error-code-earg001>`
     - Unsupported Logical Neuron Core (LNC) configuration.
     - You attempted to use a Logical Neuron Core configuration that is not supported by the target Neuron architecture.
   * - :ref:`NCC_EBIR023 <error-code-ebir023>`
     - MLP kernel intermediate size exceeds the maximum supported value of 4096.
     - Consider tiling large intermediate tensors in your kernel to stay within the supported limit, or increase tensor parallelism to shard the intermediate dimension across more cores.
   * - :ref:`NCC_EBVF030 <error-code-ebvf030>`
     - The number of instructions generated exceeds the limit.
     - Consider applying model parallelism as partitioning the model will help break large computational graphs into smaller subgraphs.
   * - :ref:`NCC_EHCA005 <error-code-ehca005>`
     - The compiler encountered a custom call instruction with a target name that is not recognized.
     - Use a supported custom call target from the list of recognized targets.
   * - :ref:`NCC_EOOM001 <error-code-eoom001>`
     - The combined memory needed for the model's activation tensors exceeds the high-bandwidth memory limit.
     - You may need to reduce batch/tensor size or utilize pipeline/tensor parallelism via neuronx-distributed.
   * - :ref:`NCC_EOOM002 <error-code-eoom002>`
     - The combined memory needed for the model's activation tensors exceeds the high-bandwidth memory limit.
     - You may need to reduce batch/tensor size or utilize pipeline/tensor parallelism via neuronx-distributed.
   * - :ref:`NCC_ESFH002 <error-code-esfh002>`
     - The compiler encountered a unsigned 64-bit integer constant with a value that cannot be safely converted to 32-bit representation.
     - Try to use uint32 for constants when possible and restructure code to avoid large constants.
   * - :ref:`NCC_ESPP004 <error-code-espp004>`
     - The compiler encountered a data type that is not supported for code generation.
     - Use a supported data type as listed in the Neuron documentation.
   * - :ref:`NCC_ESPP047 <error-code-espp047>`
     - Unsupported 8-bit floating-point data type.
     - The compiler found usage of an unsupported 8-bit floating-point data type. Convert to a supported type like torch.float16.
   * - :ref:`NCC_EUOC002 <error-code-euoc002>`
     - An unsupported operator was used.
     - Try using alternative operators from the full list of supported operators via `neuronx-cc list-operators --framework XLA` to workaround the limitation.
   * - :ref:`NCC_EVRF001 <error-code-evrf001>`
     - An unsupported operator was used.
     - Try using alternative operators from the full list of supported operators to workaround the limitation.
   * - :ref:`NCC_EVRF004 <error-code-evrf004>`
     - Complex data types are not supported on the Neuron device.
     - You cannot use complex data types (such as ``complex64``, ``complex128``, and others) on the Neuron device directly.
   * - :ref:`NCC_EVRF005 <error-code-evrf005>`
     - Unsupported F8E4M3FNUZ, F8E4M3B11FNUZ, or F8E5M2FNUZ data type.
     - The compiler found usage of unsupported 8-bit floating-point data types. Convert to a supported type like torch.float16.
   * - :ref:`NCC_EVRF006 <error-code-evrf006>`
     - The compiler encountered a RNGBitGenerator operation using a random number generation algorithm other than RNG_DEFAULT.
     - Ensure that you are using standard JAX/PyTorch random APIs and not explicity specifying an RNG algorithm.
   * - :ref:`NCC_EVRF007 <error-code-evrf007>`
     - The number of instructions generated exceeds the limit.
     - Consider applying model parallelism as partitioning the model will help break large computational graphs into smaller subgraphs.
   * - :ref:`NCC_EVRF009 <error-code-evrf009>`
     - The combined memory needed for the model's activation tensors exceeds the high-bandwidth memory limit.
     - You may need to reduce batch/tensor size or utilize pipeline/tensor parallelism via neuronx-distributed.
   * - :ref:`NCC_EVRF010 <error-code-evrf010>`
     - The compiler encountered simultaneous use of input and kernel dilation, which is not supported.
     - If possible, use only input or kernel dilation, not both simultaneously.
   * - :ref:`NCC_EVRF011 <error-code-evrf011>`
     - The compiler encountered strided convolution combined with dilated input, which is not supported.
     - If possible, remove stride or input dilation, or apply upsampling and downsampling separately.
   * - :ref:`NCC_EVRF013 <error-code-evrf013>`
     - TopK does not support integer input tensors (int32, int64).
     - The TopK operation cannot be performed on integer data types.
   * - :ref:`NCC_EVRF015 <error-code-evrf015>`
     - The compiler encountered a custom call instruction with a target name that is not recognized.
     - Use a supported custom call target from the list of recognized targets.
   * - :ref:`NCC_EVRF016 <error-code-evr016>`
     - The scatter-reduce operation cannot perform reduction logic if the data being scattered or the destination tensor is using an integer or boolean data type.
     - Cast your input and source tensors to a floating-point data type (e.g., torch.float32 or torch.bfloat16).
   * - :ref:`NCC_EVRF017 <error-code-evrf017>`
     - Reduce-window operation with base dilation greater than 1 is not supported.
     - Change base dilation to be all 1s or consider manual dilation if necessary.
   * - :ref:`NCC_EVRF018 <error-code-evrf018>`
     - Reduce-window operation with window dilation greater than 1 is not supported.
     - Remove window_dilation or change values to be all 1s, or consider manual dilation if necessary.
   * - :ref:`NCC_EVRF019 <error-code-evrf019>`
     - The compiler encountered a reduce-window operation with more or less than 2 operands.
     - If possible, split multi-operand reduce_window with multiple single-operand reduce_window operations.
   * - :ref:`NCC_EVRF022 <error-code-evrf022>`
     - Shift-right-arithmetic operation on non 32-bit inputs is not supported. Cast the first argument's data type to be S32, U32, or F32.
     - You need to use 32-bit data types for shift operations. Cast inputs to int32, uint32, or float32.
     - Reduce batch/tensor size or utilize tensor parallelism via neuronx-distributed.
   * - :ref:`NCC_EVRF031 <error-code-evrf031>`
     - The compiler encountered a scatter out-of-bounds error.
     - Ensure that the iota size matches the operand dimension size.
   * - :ref:`NCC_EXSP001 <error-code-exsp001>`
     - The combined memory needed for the model's activation tensors exceeds the high-bandwidth memory limit.
     - You may need to reduce batch/tensor size or utilize pipeline/tensor parallelism via neuronx-distributed.
   * - :ref:`NCC_EXTP004 <error-code-extp004>`
     - The number of instructions generated exceeds the limit.
     - Consider applying model parallelism as partitioning the model will help break large computational graphs into smaller subgraphs.

.. toctree::
    :hidden:
    :maxdepth: 1

    EARG001
    EBIR023
    EBVF030
    EHCA005
    EOOM001
    EOOM002
    ESFH002
    ESPP004
    ESPP047
    EUOC002
    EVRF001
    EVRF004
    EVRF005
    EVRF006
    EVRF007
    EVRF009
    EVRF010
    EVRF011
    EVRF013
    EVRF015
    EVRF016
    EVRF017
    EVRF018
    EVRF019
    EVRF022
    EVRF031
    EXSP001
    EXTP004


================================================
FILE: compiler/index.rst
================================================
.. _neuron_cc:

Neuron Graph Compiler
======================

The Neuron Graph Compiler is a sophisticated compilation system that transforms Machine Learning models from various frameworks (TensorFlow, MXNet, PyTorch, XLA HLO) into highly optimized code for AWS Neuron accelerators. It performs deep analysis of model structure, applies hardware-specific optimizations, and generates executable code tailored for maximum performance on Neuron hardware.

The Neuron compiler is available in two versions to support different AWS ML accelerator architectures:
 
* **neuronx-cc**: The newer XLA-based compiler supporting NeuronCores v2 architecture (Trn1, Inf2, Trn1n, Trn2). This compiler leverages the XLA (Accelerated Linear Algebra) framework to provide advanced optimizations for modern ML workloads.
* **neuron-cc**: The TVM-based compiler supporting NeuronCores v1 architecture (Inf1). This compiler uses the TVM (Tensor Virtual Machine) framework as its foundation.

Key capabilities of the Neuron Graph Compiler include:

* **Performance optimization**: Intelligently converts FP32 operations to more efficient formats (BF16/FP16/TF32/FP8) with configurable precision-performance tradeoffs. By default, the compiler automatically casts FP32 matrix multiplication operations to BF16 for optimal performance while maintaining accuracy.

* **Model-specific optimizations**: Provides specialized optimizations for different model architectures:
  * **Generic**: Applies general optimizations suitable for all model types
  * **Transformer**: Implements specific optimizations for transformer-based architectures like BERT, GPT, and other attention-based models
  * **U-Net**: Applies specialized memory optimizations for U-Net architectures to prevent performance-impacting data transfers

* **Distributed training support**: Enables efficient large language model (LLM) training through distribution strategies that shard parameters, gradients, and optimizer states across data-parallel workers.

* **Advanced memory management**: Optimizes memory usage for large models through techniques like model sharding across multiple NeuronCores, with configurable logical NeuronCore settings to control sharding degree.

* **Optimization levels**: Provides multiple optimization levels (1-3) to balance compilation time against runtime performance, allowing users to choose the appropriate tradeoff for their workflow.

* **Mixed precision support**: Offers fine-grained control over precision and performance through auto-casting options, supporting multiple numeric formats (FP32, TF32, FP16, BF16, FP8) with different strengths in dynamic range and numeric precision.

The compilation process is typically transparent to users, as the compiler is invoked automatically within ML frameworks through Neuron Framework plugins. Models are analyzed, optimized, and compiled into a NEFF file (Neuron Executable File Format), which is then loaded by the :doc:`Neuron Runtime </neuron-runtime/index>` for execution on Neuron devices.

.. grid:: 1 
   :gutter: 3

   .. grid-item-card:: Neuron Graph Compiler Component Release Notes
      :link: /release-notes/components/compiler
      :link-type: doc

      Review the Neuron Graph Compiler release notes for all versions of the Neuron SDK.

.. tab-set::

   .. tab-item:: Neuron Graph Compiler (neuronx-cc) for Trn1 & Inf2

      .. grid:: 1 
         :gutter: 3

         .. grid-item-card:: CLI Reference Guide
            :link: neuron-compiler-cli-reference-guide
            :link-type: ref

            Neuron Compiler CLI Reference Guide

         .. grid-item-card:: Graph Compiler Developer Guide
            :link: neuronx-cc-training-mixed-precision
            :link-type: ref

            Mixed precision training guide

         .. grid-item-card:: Graph Compiler Error Code Reference
            :link: ncc-errors-home
            :link-type: ref

            Error code reference

         .. grid-item-card:: How To Convolute Kernels in UNet Training Models
            :link: implement-convolution-kernels-unet
            :link-type: ref

            Learn how to modify UNet training models to use convolution kernels with the AWS Neuron SDK. 

         .. grid-item-card:: Graph Compiler FAQ
            :link: neuronx_compiler_faq
            :link-type: ref

            Frequently asked questions


   .. tab-item:: Neuron Graph Compiler (neuron-cc) for Inf1

      .. grid:: 1 
         :gutter: 3

         .. grid-item-card:: Graph Compiler API Reference Guide
            :link: neuron-compiler-cli-reference
            :link-type: ref

            Neuron Compiler CLI Reference

         .. grid-item-card:: Graph Compiler Developer Guide
            :link: neuron-cc-training-mixed-precision
            :link-type: ref

            Mixed precision training guide

         .. grid-item-card:: Graph Compiler FAQ
            :link: neuron_compiler_faq
            :link-type: ref

            Frequently asked questions


.. toctree::
    :maxdepth: 2
    :hidden:

    /compiler/neuronx-cc
    /compiler/neuron-cc
    Error Codes </compiler/error-codes/index>
    Release Notes </release-notes/components/compiler>


================================================
FILE: compiler/neuron-cc/api-reference-guide.rst
================================================
API Reference Guide
===================

.. toctree::
    :maxdepth: 1

    /compiler/neuron-cc/command-line-reference

================================================
FILE: compiler/neuron-cc/command-line-reference.rst
================================================
.. _neuron-compiler-cli-reference:

Neuron compiler CLI Reference Guide (``neuron-cc``)
===================================================

This document describes the command line interface of the Neuron
compiler. This reference is not relevant for applications that run
neuron-cc from within a machine learning framework (TensorFlow-Neuron
for example) since these options are passed from the framework directly
to neuron-cc.

Using neuron-cc on the command line may be desirable for applications
that do not use a framework, or customize existing frameworks. It is
also possible to supply CLI commands to the framework as options to be
passed through to the compiler.

Usage
--------

Optional parameters are shown in square brackets. See the individual
framework guides for the correct syntax.

.. _neuron_cli:

.. rubric:: Neuron Compiler CLI

.. program:: neuron-cc

.. option:: neuron-cc [options] <command> [parameters]

Common options for the Neuron CLI:

    - :option:`--verbose` (string) default=“WARN”:

        Valid values:

        -  :option:`DEBUG`
        -  :option:`INFO`
        -  :option:`WARN`
        -  :option:`ERROR`


Use :option:`neuron-cc <command> --help` for information on a specific command.

Available Commands:
~~~~~~~~~~~~~~~~~~~

-  :option:`compile`
-  :option:`list-operators`


.. option:: neuron-cc compile [parameters]

    Compile a model for use on the AWS Inferentia Machine Learning Accelerator.

    .. code-block::

        neuron-cc compile <file names> --framework <value> --io-config <value> [--neuroncore-pipeline-cores <value>] [--enable-saturate-infinity] [--enable-fast-loading-neuron-binaries] [--enable-fast-context-switch] [--fp32-cast cast-method] [--fast-math cast-method] [--output <value>]

    **Compile Parameters:**

    - :option:`<file names>`: Input containing model specification. The number
      of arguments required varies between frameworks:

        -  **TENSORFLOW**: A local filename or URI of a TensorFlow Frozen
           GraphDef (.pb); or the name of a local directory containing a
           TensorFlow SavedModel.

           See
           https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/graph.proto
           for the associated .proto schema for TensorFlow Frozen GraphDefs. See
           https://www.tensorflow.org/guide/saved_model for more information on
           the SavedModel format.

        -  **MXNET**: List of local filenames or URIs where input architecture
           .json file and parameter .param file are stored. These contains
           information related to the architecture of your graph and associated
           parameters, respectively.


    - :option:`--framework` (string): Framework in which the model was trained.

      Valid values:

        - :option:`TENSORFLOW`
        - :option:`MXNET`
        - :option:`XLA`

    - :option:`--neuroncore-pipeline-cores` (int) (default=1): Number of neuron cores
      to be used in "NeuronCore Pipeline" mode. This is different from data
      parallel deployment (same model on multiple neuron cores). Refer to
      Runtime/Framework documentation for data parallel deployment options.

      Compile for the given number of
      neuron cores so as to leverage NeuronCore Pipeline mode.

      .. note::
        This is not used to define the number of Neuron Cores to be used in a data
        parallel deployment (ie the same model on multiple Neuron Cores). That
        is a runtime/framework configuration choice.

    - :option:`--output` (string) (default=“out.neff”): Filename where compilation
      output (NEFF archive) will be recorded.

    - :option:`--io-config` (string): Configuration containing the names and shapes
      of input and output tensors.

      The io-config can be specified as a local filename, a URI, or a string
      containing the io-config itself.

      The io-config must be formatted as a JSON object with two members
      “inputs” and “outputs”. “inputs” is an object mapping input tensor names
      to an array of shape and data type. “outputs” is an array of output
      tensor names. Consider the following example:

      .. code-block:: json

        {
         "inputs": {
            "input0:0": [[1,100,100,3], "float16"],
            "input1:0": [[1,100,100,3], "float16"]
         },
         "outputs": ["output:0"]
        }

    - :option:`--enable-saturate-infinity` : Convert +/- infinity values to MAX/MIN_FLOAT for certain computations that have a high risk of generating Not-a-Number (NaN) values. There is a potential performance impact during model execution when this conversion is enabled.


    - :option:`--enable-fast-loading-neuron-binaries` : Write the compilation
      output (NEFF archive) in uncompressed format which results
      in faster loading of the archive during inference.

    - :option:`--enable-fast-context-switch` : Optimize for faster model switching
      rather than inference latency. This results in overall faster system
      performance when your application switches between models frequently
      on the same neuron core (or set of cores). The optimization
      triggered by this option for example defers loading some weight
      constants until the start of inference.

    - :option:`--fast-math` : Controls tradeoff between performance and accuracy for fp32 operators. See more suggestions on how to use this option with the below arguments in :ref:`neuron-cc-training-mixed-precision`.


        - ``all`` (Default): enables all optimizations that improve performance. This option can potentially lower precision/accuracy.

        - ``none`` : Disables all optimizations that improve performance. This option will provide best precision/accuracy.

        - Tensor transpose options

            - ``fast-relayout``: Only enables fast relayout optimization to improve performance by using the matrix multiplier for tensor transpose. The data type used for the transpose is either FP16 or BF16, which is controlled by the ``fp32-cast-xxx`` keyword.

            - ``no-fast-relayout``: Disables fast relayout optimization which ensures that tensor transpose is bit-accurate (lossless) but slightly slower.


        - Casting options

            - ``fp32-cast-all`` (Default): Cast all FP32 operators to BF16 to achieve highest performance and preserve dynamic range. Same as setting ``--fp32-cast all``.

            - ``fp32-cast-all-fp16``: Cast all FP32 operators to FP16 to achieve speed up and increase precision versus BF16. Same setting as ``--fp32-cast all-fp16``.

            - ``fp32-cast-matmult``: Only cast FP32 operators that use Neuron Matmult engine to BF16 while using FP16 for matmult-based transpose to get better accuracy. Same as setting ``--fp32-cast matmult``.

            - ``fp32-cast-matmult-bf16``: Cast only FP32 operators that use Neuron Matmult engine (including matmult-based transpose) to BF16 to preserve dynamic range. Same as setting ``--fp32-cast matmult-bf16``.

            - ``fp32-cast-matmult-fp16``: Cast only FP32 operators that use Neuron Matmult engine (including matmult-based transpose) to fp16 to better preserve precision. Same as setting ``--fp32-cast matmult-fp16``.


        .. important ::

            * ``all`` and ``none`` are mutually exclusive

            * ``all`` is equivalent to using ``fp32-cast-all fast-relayout`` (best performance)

            * ``none`` is equivalent to using ``fp32-cast-matmult-bf16 no-fast-relayout`` (best accuracy)

            * ``fp32-cast-*`` options are mutually exclusive

            * ``fast-relayout`` and ``no-fast-relayout`` are mutually exclusive

            * The ``fp32-cast-*`` and ``*-fast-relayout`` options will overwrite the default behavior in ``all`` and ``none``.

            * For backward compatibility, the ``--fp32-cast`` option has higher priority over ``--fast-math``. It will overwrite the FP32 casting options in any of the ``--fast-math`` options if ``--fp32-cast`` option is present explicitly.


    - :option:`--fp32-cast` : Refine the automatic casting of fp32 tensors. This is being replaced by a newer --fast-math.

        .. important ::

            * ``--fp32-cast`` option is being deprecated and ``--fast-math`` will replace it in future releases.

            * ``--fast-math`` is introducing the ``no-fast-relayout`` option to enable lossless transpose operation.


        The ``--fp32-cast`` is an interface for controlling the performance and accuracy tradeoffs. Many of the ``--fast-math`` values invoke (override) it.

        - ``all`` (default): Cast all FP32 operators to BF16 to achieve speed up and preserve dynamic range.

        - ``matmult``: Cast only FP32 operators that use Neuron Matmult engine to BF16 while using fp16 for matmult-based transpose to get better accuracy.

        - ``matmult-fp16``: Cast only FP32 operators that use Neuron Matmult engine (including matmult-based transpose) to fp16 to better preserve precision.

        - ``matmult-bf16``: Cast only FP32 operators that use Neuron Matmult engine (including matmult-based transpose) to BF16 to preserve dynamic range.

        - ``all-fp16``: Cast all FP32 operators to FP16 to achieve speed up and better preserve precision.


    **Log Levels:**

        Logs at levels “trace”, “debug”, and “info” will be written to STDOUT.

        Logs at levels “warn”, “error”, and “fatal” will be written to STDERR.

    **Exit Status**

        **0** - Compilation succeeded

        **>0** - An error occurred during compilation.

    **Examples**


        Compiling a saved TensorFlow model:

        .. code-block:: shell

           neuron-cc compile test_graph_tfmatmul.pb --framework TENSORFLOW --io-config test_graph_tfmatmul.config

        Compiling a MXNet model:

        .. code-block:: shell

           neuron-cc compile lenet-symbol.json lenet-0001.params --framework MXNET --neuroncore-pipeline-cores 2 --output file.neff

        Compiling an XLA HLO:

        .. code-block:: shell

           neuron-cc compile bert-model.hlo --framework XLA  --output file.neff

.. _neuron-cc-list-operators:

.. option:: neuron-cc list-operators [parameters]

    .. _description-1:

        Returns a newline ('n') separated list of operators supported by the NeuronCore.

        -  **TENSORFLOW**: Operators will be formatted according to the value
           passed to the associated REGISTER_OP(“OperatorName”) macro.

           See https://www.tensorflow.org/guide/create_op#define_the_op_interface
           for more information regarding operator registration in TensorFlow.

        -  **MXNET**: Operator names will be formatted according to the value
           passed to the associated NNVM_REGISTER_OP(operator_name) macro.

        -  **XLA**: Operator names will be formatted according to the value used by XLA compiler in XlaBuilder.

           See https://www.tensorflow.org/xla/operation_semantics for more information regarding XLA operator semantics in XLA interface.

    .. code-block:: shell

        neuron-cc list-operators --framework <value>

    .. _options-1:

    - :option:`--framework` (string): Framework in which the operators were
      registered.

      Valid values:

        - :option:`TENSORFLOW`
        - :option:`MXNET`
        - :option:`XLA`

    **Exit Status**

    **0** - Call succeeded

    **>0** - An error occurred

    **Example**

    .. code-block:: shell

       $ neuron-cc list-operators --framework TENSORFLOW
       AddN
       AdjustContrastv2
       CheckNumbers
       ...


================================================
FILE: compiler/neuron-cc/developer-guide.rst
================================================
Developer Guide
===================

.. toctree::
    :maxdepth: 1
    
    /about-neuron/appnotes/neuron-cc/mixed-precision

================================================
FILE: compiler/neuron-cc/faq.rst
================================================
.. _neuron_compiler_faq:

Neuron Compiler FAQ (``neuron-cc``)
===================================

.. contents:: Table of contents
   :local:
   :depth: 1

Where can I compile to Neuron?
---------------------------------

The one-time compilation step from the standard framework-level model to
NEFF binary may be performed on any EC2 instance or even
on-premises.

We recommend using a high-performance compute server of choice (C5 or
z1d instance types), for the fastest compile times and ease of use with
a prebuilt `DLAMI <https://aws.amazon.com/machine-learning/amis/>`__.
Developers can also install Neuron in their own environments; this
approach may work well for example when building a large fleet for
inference, allowing the model creation, training and compilation to be
done in the training fleet, with the NEFF files being distributed by a
configuration management application to the inference fleet.

My current Neural Network is based on FP32, how can I use it with Neuron?
-------------------------------------------------------------------------

Developers who want to train their models in FP32 for best accuracy can
compile and deploy them with Neuron. The Neuron compiler automatically converts
FP32 to internally supported datatypes, such as FP16 or BF16.
You can find more details about FP32 data type support
and performance and accuracy tuning
in :ref:`neuron-cc-training-mixed-precision`.
The Neuron compiler preserves the application interface - FP32 inputs and outputs.
Transferring such large tensors may become a bottleneck for your application.
Therefore, you can improve execution time by casting the inputs and outputs to
FP16 or BF16 in the ML framework prior to compilation for Inferentia.

What are some of the important compiler defaults I should be aware of?
-----------------------------------------------------------------------

The compiler compiles the input graph for a single NeuronCore by default. Using the :option:`--neuroncore-pipeline-cores` option directs the compiler to
partition so as to run on a specified number of NeuronCores. This number can
be less than the total available NeuronCores on an instance.
See :ref:`inferentia-arch` for more information on NeuronCores.

Which operators does Neuron support?
---------------------------------------

see :ref:`neuron-supported-operators`.

You can also use the "neuron-cc list-operators" command on the cli to list the
operators. See :ref:`neuron-cc-list-operators`

If your model contains operators missing from the above list, and you can't reach your performance goals, please
post a message on the Neuron developer forum or open a github issue to let us know.

Any operators that Neuron doesn't support?
---------------------------------------------

Models with control-flow and dynamic shapes are not supported. You will
need to partition the model using the framework prior to compilation.
See the :ref:`neuron-cc`.

Will I need to recompile again if I updated runtime/driver version?
----------------------------------------------------------------------

The compiler and runtime are committed to maintaining compatibility for
major version releases with each other. The versioning is defined as
major.minor, with compatibility for all versions with the same major
number. If the versions mismatch, an error notification is logged and
the load will fail. This will then require the model to be recompiled.

I have a NEFF binary, how can I tell which compiler version
-----------------------------------------------------------
generated it?** We will bring a utility out to help with this soon.

How long does it take to compile?
------------------------------------

It depends on the model and its size and complexity, but this generally
takes a few minutes.


================================================
FILE: compiler/neuron-cc.rst
================================================
.. _neuron-cc-index:

Neuron Compiler for Inf1
========================

.. toctree::
    :maxdepth: 1

    API Reference Guide </compiler/neuron-cc/api-reference-guide>
    CLI Reference </compiler/neuron-cc/command-line-reference>
    Developer Guide </compiler/neuron-cc/developer-guide>
    FAQ </compiler/neuron-cc/faq>

================================================
FILE: compiler/neuronx-cc/api-reference-guide/index.rst
================================================
.. _neuron-compiler-cli-reference-guide:

Neuron Compiler CLI Reference Guide (``neuronx-cc``)
====================================================

This document describes the command line interface of the Neuron Compiler.

This reference is not relevant for applications that run the Neuron Compiler from within a machine learning framework (:ref:`PyTorch-Neuron <pytorch-neuronx-programming-guide>` for example) since these options are passed from the framework directly to the compiler. Using the compiler command line may be desirable for applications that do not use a framework or customize existing frameworks. It is also possible to specify compiler options within the framework which will forward these options to the compiler using :ref:`NEURON_CC_FLAGS <pytorch-neuronx-envvars>`.

.. contents:: Table of Contents
  :local:
  :depth: 3

Usage
-----

*Optional parameters are shown in square brackets.*

.. _neuron_cli:

.. rubric:: Neuron Compiler Command-Line Interface

.. program:: neuronx-cc

.. option:: neuronx-cc <command> [parameters]

Available Commands
------------------

-  ``compile``
-  ``list-operators``

Common parameters for the Neuron CLI:

- ``--help``: Display a usage message of compiler options.
    Use ``neuronx-cc <command> --help`` for information on a specific command.


.. _neuronx-cc-compile:

'compile' Command
-----------------

.. option:: neuronx-cc compile [parameters]

  .. _description-1:

  Compile a model for use on the AWS Machine Learning Accelerator.


  .. code-block:: shell

     neuronx-cc compile <model_files>
     --framework <framework_name>
     --target <instance_family>
     [--model-type <model>]
     [--auto-cast <cast_mode>]
     [--auto-cast-type <data_type>]
     [--distribution-strategy <distribution_type>]
     [--logical-nc-config <shard_degree>], or [-lnc <shard_degree>]
     [--optlevel <opt_level>], or [-O <opt_level>]
     [--enable-mixed-precision-accumulation]
     [--enable-saturate-infinity]
     [--enable-fast-context-switch]
     [--enable-fast-loading-neuron-binaries]
     [--logfile <filename>]
     [--output <filename>]
     [--verbose <level>]


  Parameters
  ~~~~~~~~~~

  - ``<model_files>``: Input containing model specification.
      The number of arguments required varies between frameworks:

      - **XLA**: A local filename of a HLO file (hlo.pb) generated via XLA. See `hlo.proto <https://github.com/tensorflow/tensorflow/blob/73c8e20101ae93e9f5ff0b58f68be0b70eca44c5/tensorflow/compiler/xla/service/hlo.proto>`_ for the .proto description and `inspect-compiled-programs <https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/g3doc/index.md#user-content-inspect-compiled-programs>`_ for more information on how to generate such files.

  - ``--framework <framework_name>``: Framework used to generate training model.

    Valid values:

    - ``XLA``

  - ``--target <instance_family>``: Name of the Neuron instance family on which the compiled model will be run.

    Valid values:

    - ``inf2``
    - ``trn1``
    - ``trn1n``
    - ``trn2``

  - ``--model-type <model>``: Permit the compiler to attempt model-specific optimizations based upon type of model being compiled. (Default: ``generic``)

    Valid values:

    - ``generic``: Perform optimizations applicable to all types of inference and training models.
    - ``transformer``: Perform optimizations specific to `Transformer <https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)>`_ models. 
    - ``unet-inference``: Perform optimizations specific to certain `U-Net <https://en.wikipedia.org/wiki/U-Net>`_ model architectures when performing inference. U-Net models often have certain structures that result in excessive performance-impacting data transfers; this option allows the compiler to apply additional memory optimizations to prevent these data transfers and also allows the compiler to map larger normalization operators which would otherwise not successfully execute.

  - ``--auto-cast <cast_mode>``: Controls how the compiler makes tradeoffs between performance and accuracy for FP32 operations. (Default: ``none``)

    Valid values:

    - ``none``: (default) Leave all data types as defined in the model. Do not apply auto-casting data type optimizations.
    - ``matmult``: Only cast FP32 operations that use the Neuron matrix-multiplication engine.
    - ``all``: Cast all FP32 operations to achieve highest performance. This option can potentially lower precision/accuracy.

    A more complete discussion on how to use this option and its arguments is in :ref:`Mixed Precision and Performance-accuracy Tuning for Training <neuronx-cc-training-mixed-precision>`.

    .. note:: If the ``--auto-cast`` option is specified, the ``--auto-cast-type`` compiler flag can be optionally set to define which lower-precision data type the compiler should use.

  - ``--auto-cast-type <data_type>``: When auto-cast mode is enabled, cast the FP32 operators to the lower-precision data type specified by this option. (Default: ``bf16``)

    Valid values:

    - ``bf16``: Cast the FP32 operations selected via the ``--auto-cast`` option to BF16 to achieve highest performance and preserve dynamic range.
    - ``fp16``: Cast the FP32 operations selected via the ``--auto-cast`` option to FP16 to achieve improved performance relative to FP32 and increased precision relative to BF16.
    - ``tf32``: Cast the FP32 operations selected via the ``--auto-cast`` option to TensorFloat-32.
    - ``fp8_e4m3``: Cast the FP32 operations selected via the ``--auto-cast`` option to a signed 8-bit floating point represented as a 4-bit exponent and 3-bit mantissa. 


    .. note:: If multiple competing options are specified then the option right-most on the command line will supercede previous options.

  - ``--distribution-strategy <distribution_type>``: Permit the compiler to attempt model-specific optimizations based upon type of model being compiled. (Default: ``generic``)

    Valid values:

    - ``llm-training``: Enable the compiler to perform optimizations applicable to large language model (LLMS) training runs that  shard parameters, gradients, and optimizer states across data-parallel workers. This is equivalent to the previously documented option argument value of ``NEMO``, which will be deprecated in a future release.

  - ``--logical-nc-config <shard_degree>``: Instructs the compiler to shard the input graph across physical NeuronCore accelerators. Possible numeric values are {1, 2}. (Only available on trn2; Default: ``2``)

    Valid values:

    - ``1``: instructs the compiler to shard the input graph across 1 physical NeuronCore, i.e., do not perform any input graph sharding.
    - ``2``: [default on trn2] instructs the compiler to shard the input graph across 2 physical NeuronCores.

  - ``--optlevel <opt_level>``: Specify the level of optimization the compiler should perform. Possible numeric values are {1, 2, 3}. (Default: ``2``)

    Valid values:

    - ``1``: enables the core performance optimizations in the compiler, while also minimizing compile time.
    - ``2``: [default] provides the best balance between model performance and compile time.
    - ``3``: may provide additional model execution performance but may incur longer compile times and higher host memory usage during model compilation.

    .. note:: This option supercedes, and deprecates, the ``—enable-experimental-O1`` option introduced in an earlier release.

  - ``--enable-mixed-precision-accumulation``: **Enabled by default**. Set to ``true`` by default. Perform intermediate calculations of accumulation operators (such as softmax and layernorm) in FP32 and cast the result to the model-designated datatype. This improves the operator's resulting accuracy.

  - ``--disable-mixed-precision-accumulation``: Disables mixed precision accumulation. Mixed precision accumulation is enabled by default; use this flag to disable it. Disabling mixed precision accumulation may improve performance at the cost of reduced accuracy for certain operators.

  - ``--enable-saturate-infinity``: Convert +/- infinity values to MAX/MIN_FLOAT for compiler-introduced matrix-multiply transpose computations that have a high risk of generating Not-a-Number (NaN) values. There is a potential performance impact during model execution when this conversion is enabled. (Only needed on trn1; while the trn2 compiler will accept this flag for compatibility reasons, it has no effect on the compilation.)

  - ``--enable-fast-context-switch``: Optimize for faster model switching rather than execution latency.
      This option will defer loading some weight constants until the start of model execution. This results in overall faster system performance when your application switches between models frequently on the same Neuron Core (or set of cores).

  - ``--enable-fast-loading-neuron-binaries``: Save the compilation output file in an uncompressed format.
      This creates executable files which are larger in size but faster for the Neuron Runtime to load into memory during model execution.

  - ``--logfile <filename>``: Filename where compiler writes log messages. (Default: “log-neuron-cc.txt”).

  - ``--output <filename>``: Filename where compilation output (NEFF archive) will be recorded. (Default: "file.neff”)

  - ``--verbose <level>``: Specify the level of output produced by the compiler. (Default: ``warning``)

    Valid values:

    - ``info``: Informational messages regarding the progress of model compilation (written to stdout).
    - ``warning``: Diagnostic messages that report model code that is not inherently erroneous but may be risky or suggest there may have been an error (written to stderr).
    - ``error``: The compiler detected a condition causing it not complete the compilation successfully (written to stderr).
    - ``critical``: The compiler encountered an unrecoverable error terminates immediately (written to stderr).
    - ``debug``: Extensive information regarding the compiler's internal execution phases (written to stdout).

  *Example*:
    Compiling an XLA HLO:

    .. code-block:: shell

      neuronx-cc compile bert-model.hlo —-framework XLA -—target trn1 —-model-type transformer —-output bert.neff


.. _neuronx-cc-list-operators:

'list-operators' Command
------------------------

.. option:: neuronx-cc list-operators [parameters]

  .. _description-1:

  Returns a newline (‘\\n’) separated list of operators supported by the Neuron Compiler.

  .. code-block:: shell

    neuronx-cc list-operators
    --framework <value>

  Parameters
  ~~~~~~~~~~

  - ``--framework <framework_name>``: Framework in which the operators were registered.

    Valid values:

    - ``XLA``: Operator names will be formatted according to the value used by XLA compiler in XlaBuilder.


  *Example*:

  .. code-block:: shell

    neuronx-cc list-operators —framework XLA
    ...


Compiler Exit Statuses
----------------------

- **0**: Compilation succeeded
- **<>0**: An error occurred during compilation.


================================================
FILE: compiler/neuronx-cc/developer-guide.rst
================================================
.. meta::
   :description: Developer guides for the Neuron Compiler (neuronx-cc), including mixed precision training, performance tuning, and custom kernel implementation for AWS Trainium and Inferentia.
   :keywords: neuronx-cc, Neuron Compiler, mixed precision, BF16, FP16, TF32, auto-cast, convolution kernels, UNet, performance optimization, Trainium, Inferentia

Developer Guide
===================

Learn how to optimize your models with the Neuron Compiler (neuronx-cc). These guides cover mixed precision training, performance-accuracy tuning, and custom kernel implementations for AWS Trainium and Inferentia instances.

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: Mixed Precision and Performance-Accuracy Tuning
      :link: /about-neuron/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision
      :link-type: doc

      Learn how to use FP32, TF32, FP16, and BF16 data types with the Neuron Compiler's auto-cast options to balance performance and accuracy. Understand the tradeoffs between different data types and how to configure compiler settings for optimal model execution.

   .. grid-item-card:: How to Use Convolution Kernels in UNet Training Models
      :link: /compiler/neuronx-cc/how-to-convolution-in-unet
      :link-type: doc

      Modify UNet training models to use custom convolution kernels with NKI (Neuron Kernel Interface). This implementation helps avoid out-of-memory errors when training convolution-heavy models on Trainium instances.

.. toctree::
    :hidden:
    :maxdepth: 1
    
    /about-neuron/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision
    /compiler/neuronx-cc/how-to-convolution-in-unet

================================================
FILE: compiler/neuronx-cc/faq.rst
================================================
.. _neuronx_compiler_faq:

Neuron Compiler FAQ (``neuronx-cc``)
====================================

.. contents:: Table of contents
   :local:
   :depth: 1

Where can I compile to Neuron?
---------------------------------

The one-time compilation step from the standard framework-level model to
NEFF binary may be performed on any EC2 instance or even
on-premises.

We recommend using a high-performance compute server of choice (C5 or
z1d instance types), for the fastest compile times and ease of use with
a prebuilt `DLAMI <https://aws.amazon.com/machine-learning/amis/>`__.
Developers can also install Neuron in their own environments; this
approach may work well for example when building a large fleet for
inference, allowing the model creation, training and compilation to be
done in the training fleet, with the NEFF files being distributed by a
configuration management application to the inference fleet.

.. _neuron-vs-neuronx:

What is the difference between ``neuron-cc`` and ``neuronx-cc``?
----------------------------------------------------------------

* ``neuron-cc`` is the Neuron Compiler with TVM front-end, ``neuron-cc`` supports only :ref:`neuroncores-v1-arch`.
* ``neuronx-cc`` is the Neuron Compiler with XLA front-end, ``neuronx-cc`` currently supports 
  :ref:`neuroncores-v2-arch`, ``neuronx-cc`` support of :ref:`neuroncores-v1-arch` is currently a 
  :ref:`Roadmap Item <neuron_roadmap>`.

Should I use ``neuron-cc`` or ``neuronx-cc``?
---------------------------------------------

See :ref:`neuron-vs-neuronx`

My current neural network is based on FP32, how can I use it with Neuron?
-------------------------------------------------------------------------

Developers who want to train their models in FP32 for best accuracy can
compile and deploy them with Neuron. The Neuron compiler automatically converts
FP32 to internally supported datatypes, such as FP16 or BF16.
You can find more details about FP32 data type support
and performance and accuracy tuning
in :ref:`neuronx-cc-training-mixed-precision` or :ref:`neuron-cc-training-mixed-precision`.
The Neuron compiler preserves the application interface - FP32 inputs and outputs.
Transferring such large tensors may become a bottleneck for your application.
Therefore, you can improve execution time by casting the inputs and outputs to
FP16 or BF16 in the ML framework prior to compilation.

Which operators does Neuron support?
---------------------------------------

You can use the ``neuronx-cc list-operators`` command on the cli to list the operators. See :ref:`neuron-compiler-cli-reference-guide`.

To request support for new operators, open an issue on our `GitHub forum <https://github.com/aws/aws-neuron-sdk/issues/new>`_.

Any operators that Neuron Compiler doesn't support?
---------------------------------------------------

Models with control-flow and dynamic shapes are not supported now. You will
need to partition the model using the framework prior to compilation.

.. note::

  Starting with :ref:`neuroncores-v2-arch` Neuron supports control-flow and dynamic shapes.

  Stay tuned and follow the :ref:`Neuron Roadmap <neuron_roadmap>`.

Will I need to recompile again if I updated runtime/driver version?
----------------------------------------------------------------------

The compiler and runtime are committed to maintaining compatibility for
major version releases with each other. The versioning is defined as
major.minor, with compatibility for all versions with the same major
number. If the versions mismatch, an error notification is logged and
the load will fail. This will then require the model to be recompiled.

I have a NEFF binary, how can I tell which compiler version generated it?
-------------------------------------------------------------------------
 ** We will bring a utility out to help with this soon.

How long does it take to compile?
------------------------------------

It depends on the model and its size and complexity, but this generally
takes a few minutes.

Why is my model producing different results compared to CPU/GPU?
----------------------------------------------------------------

:ref:`neuroncores-v2-arch` supports multiple casting modes for floating point numbers, each with
associated implications for performance and accuracy. The default casting mode
is a pragmatic balance between performance and accuracy, however on some models
it may result in loss of precision.

See the :option:`--auto-cast` and :option:`--auto-cast-type` options in :ref:`neuron-compiler-cli-reference-guide` for details on how to adjust the casting mode.

Do you support model *<insert model type>*?
-------------------------------------------

``neuronx-cc`` has explicit support for select model families using the :option:`--model-type` option, though many other model types are supported. You can also inspect supported operators using the :option:`list-operators` sub-command. See th :ref:`neuron-compiler-cli-reference-guide` for details.
More generally, support for new operators and models is continually being added. See our :ref:`neuron_roadmap` for details.


================================================
FILE: compiler/neuronx-cc/how-to-convolution-in-unet.rst
================================================
.. meta::
   :description: Learn how to modify UNet training models to use convolution kernels with AWS Neuron SDK
   :date_updated: 2025-09-09

.. _implement-convolution-kernels-unet:

=======================================================
How to Use Convolution Kernels in UNet Training Models
=======================================================

Task overview
-------------
This topic discusses how to modify UNet training models to use convolution kernels with the AWS Neuron SDK. This implementation helps avoid out-of-memory errors seen when performing training on the convolution-heavy UNet model.

Prerequisites
-------------
- AWS Neuron SDK 2.26 or later: Required for kernel implementation support
- trn1.32xlarge instance: Needed for model training  
- Existing UNet implementation: Base model to be modified
- PyTorch-Neuron environment: Required for neural network operations

Instructions
------------

**1: Import required dependencies**

.. code-block:: python

   import torch
   import torch.nn as nn
   import torch.nn.functional as F
   from torch.autograd import Function
   import neuronxcc.nki as nki
   import neuronxcc.nki.language as nl
   from neuronxcc.nki._private_kernels.conv import conv2d_dw_fb01_io01_01bf_rep_nhwc_Pcinh

**2: Create the convolution wrapper function**

.. code-block:: python

   @nki.jit
   def conv_wrap(img_ref, filter_ref, out_shape):
       out_arr = nl.ndarray(shape=out_shape, dtype=img_ref.dtype, buffer=nl.hbm)
       conv2d_dw_fb01_io01_01bf_rep_nhwc_Pcinh(img_ref, filter_ref, out_arr, **{
           'input': img_ref.shape,
           'filter': filter_ref.shape, 
           'output': out_shape,
           'in_perm': [0, 1, 2, 3],
           'kern_perm': [0, 1, 2, 3],
           'out_perm': [0, 1, 2, 3],
           'stride': (1, 1),
           'padding': ((1, 1), (1, 1))})
       return out_arr

**3: Implement the custom Conv2d module**

.. code-block:: python

   class BwdConv2dWithKernel(nn.Module):
       def __init__(self, in_channels, out_channels, kernel_size, padding, bias):
           super().__init__()
           assert padding == 1
           assert bias == False
           self.in_channels = in_channels
           self.out_channels = out_channels
           self.kernel_size = kernel_size
           self.weight = nn.Parameter(torch.randn(out_channels, in_channels, kernel_size, kernel_size))
           nn.init.kaiming_uniform_(self.weight, a=0.0, mode='fan_in', nonlinearity='leaky_relu')

**4: Replace standard convolutions in the UNet model**

.. code-block:: python

   class DoubleConvWithKernel(nn.Module):
       def __init__(self, in_channels, out_channels, mid_channels=None):
           super().__init__()
           if not mid_channels:
               mid_channels = out_channels
           self.double_conv = nn.Sequential(
               BwdConv2dWithKernel(in_channels, mid_channels, kernel_size=3, padding=1, bias=False),
               nn.BatchNorm2d(mid_channels),
               nn.ReLU(inplace=True),
               BwdConv2dWithKernel(mid_channels, out_channels, kernel_size=3, padding=1, bias=False),
               nn.BatchNorm2d(out_channels),
               nn.ReLU(inplace=True)
           )

**5: Update the UNet model initialization**

.. code-block:: python

   def __init__(self, n_channels, n_classes, bilinear=False):
       super().__init__()
       self.n_channels = n_channels
       self.n_classes = n_classes
       self.bilinear = bilinear
       self.inc = (DoubleConvWithKernel(n_channels, 64))
       # ... rest of initialization

Confirm your work
-----------------

To confirm successful implementation, verify the following:

.. code-block:: bash

   Expected training output
   Training Device=xla:0 Epoch=1 Step=20 Loss=0.30803
   Training Device=xla:0 Epoch=2 Step=560 Loss=0.01826

Check for:

- No out-of-memory errors during execution
- Decreasing loss values across epochs

Common issues
-------------

.. rubric:: Memory Errors

- Solution: Verify all standard convolutions are replaced with BwdConv2dWithKernel implementations

.. rubric:: Compilation Errors

- Solution: Confirm Neuron SDK version is 2.26 or later

.. rubric:: Kernel Errors

- Solution: Use the kernel for supported configurations. The kernel will error out in unsupported scenarios.

Related information
-------------------

.. toctree::
   :maxdepth: 1

   * `UNet training sample <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/unet_image_segmentation>`_ - Sample UNet training implementation


================================================
FILE: compiler/neuronx-cc.rst
================================================
.. _neuronx-cc-index:

NeuronX Compiler for Trn1 & Inf2
=================================

.. toctree::
    :maxdepth: 1

    API Reference Guide </compiler/neuronx-cc/api-reference-guide/index>
    How-to: Convolution </compiler/neuronx-cc/how-to-convolution-in-unet>
    Developer Guide </compiler/neuronx-cc/developer-guide>
    FAQ </compiler/neuronx-cc/faq>

================================================
FILE: conf.py
================================================
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Path setup --------------------------------------------------------------

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.

import datetime
import os
import sys

sys.path.append(os.path.abspath("./_ext"))
sys.path.append(os.path.abspath("./nki/api"))
sys.path.append(os.path.abspath("./nki/_ext"))
sys.path.append(os.path.abspath("./frameworks/torch/torch-neuron/"))
sys.path.append(os.path.abspath("./_static"))


# get environment variables
def get_env_vars_from_gh():
    project_name = os.environ.get("GIT_PROJECT_NAME", "aws-neuron-sdk")
    branch_name = os.environ.get("GIT_BRANCH_NAME", "master")
    branch_name = "master" if branch_name == "latest" else branch_name

    return project_name, branch_name


def get_env_vars_from_rtd():
    branch_name = os.environ.get("READTHEDOCS_VERSION_NAME", "master")
    branch_name = "master" if branch_name == "latest" else branch_name

    project_name = "aws-neuron-sdk"
    if os.environ.get("READTHEDOCS_PROJECT") == "awsdocs-neuron-staging":
        project_name = "private-aws-neuron-sdk-staging"

    return project_name, branch_name


def get_env_vars():
    """Configure project and branch names based on environment"""
    if os.environ.get("READTHEDOCS") == "True":
        return get_env_vars_from_rtd()
    return get_env_vars_from_gh()


project_name, branch_name = get_env_vars()
# -- Project information -----------------------------------------------------

project = "AWS Neuron"
copyright = "{}, Amazon.com".format(datetime.datetime.now().year)
author = "AWS"
master_doc = "index"
html_title = "AWS Neuron Documentation"

# -- General configuration ---------------------------------------------------

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
    "sphinxcontrib.contentui",
    "nbsphinx",
    "sphinx.ext.extlinks",
    "sphinx.ext.intersphinx",
    "sphinx_plotly_directive",
    "df_tables",
    "sphinxcontrib.programoutput",
    "neuron_tag",
    "sphinx_design",
    "ablog",
    "sphinx.ext.viewcode",
    "sphinx.ext.napoleon",
    "sphinx.ext.autodoc",
    "sphinx.ext.autosummary",
    "local_documenter",
    "archive",
    "sphinx_copybutton",
    "nki_directives",
    "sphinxcontrib.googleanalytics",
    "sphinxcontrib.datatemplates",
    "sphinxcontrib.spelling",
    "sphinx_tabs.tabs",
]


html_sidebars = {
    "**": [
        "navbar-logo.html",
        "search-field.html",
        "sbt-sidebar-nav.html",
    ],
    "about-neuron/announcements/*": [
        "navbar-logo.html",
        "search-field.html",
        "ablog/postcard.html",
        "ablog/recentposts.html",
        "ablog/tagcloud.html",
        "ablog/categories.html",
        "ablog/archives.html",
        "sbt-sidebar-nav.html",
    ],
}


# Add any paths that contain templates here, relative to this directory.
templates_path = [
    "_templates",
    "nki/_templates/",
    "_content-types/",
    "libraries/nxd-inference/_templates",
]

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', '_backup-rn', '_backup-setup', '_content-types','**.ipynb_checkpoints','.venv','_utilities', 'nki/_templates']
html_extra_path = ['static']

# remove bash/python/ipython/jupyter prompts and continuations
copybutton_prompt_text = r">>> |\.\.\. |\$ |In \[\d*\]: | {2,5}\.\.\.: | {5,8}: "
copybutton_prompt_is_regexp = True

# nbsphinx_allow_errors = True
nbsphinx_execute = "never"

html_logo = "images/Site-Merch_Neuron-ML-SDK_Editorial.png"

napoleon_google_docstring = True

# Turn on figure/table numbering
numfig = True

# -- autodoc/autosummary options -------------------------------------------------

autosummary_generate = True  # Turn on sphinx.ext.autosummary


# -- more options -------------------------------------------------


projectblob = project_name + "/blob/" + branch_name
projecttree = project_name + "/tree/" + branch_name

extlinks = {
    "mxnet-neuron": (
        "https://github.com/aws-neuron/" + projectblob + "/neuron-guide/neuron-frameworks/mxnet-neuron/%s",
        "",
    ),
    "pytorch-neuron": (
        "https://github.com/aws-neuron/" + projectblob + "/neuron-guide/neuron-frameworks/pytorch-neuron/%s",
        "",
    ),
    "tensorflow-neuron": (
        "https://github.com/aws-neuron/" + projectblob + "/neuron-guide/neuron-frameworks/tensorflow-neuron/%s",
        "",
    ),
    "neuron-deploy": (
        "https://github.com/aws-neuron/" + projectblob + "/neuron-deploy/%s",
        "",
    ),
    "neuron-tools-tree": (
        "https://github.com/aws-neuron/" + projecttree + "/neuron-guide/neuron-tools/%s",
        "",
    ),
    "mxnet-neuron-src": (
        "https://github.com/aws-neuron/" + projectblob + "/src/examples/mxnet/%s",
        "",
    ),
    "pytorch-neuron-src": (
        "https://github.com/aws-neuron/" + projectblob + "/src/examples/pytorch/%s",
        "",
    ),
    "tensorflow-neuron-src": (
        "https://github.com/aws-neuron/" + projectblob + "/src/examples/tensorflow/%s",
        "",
    ),
    "neuron-gatherinfor-src": (
        "https://github.com/aws-neuron/" + projectblob + "/src/examples/neuron-gatherinfo/%s",
        "",
    ),
    "neuron-monitor-src": (
        "https://github.com/aws-neuron/" + projectblob + "/src/examples/neuron-monitor/%s",
        "",
    ),
    "compile-pt": (
        "https://github.com/aws-neuron/" + projectblob + "/archive/src/benchmark/pytorch/%s_compile.py",
        "",
    ),
    "benchmark-pt": (
        "https://github.com/aws-neuron/" + projectblob + "/archive/src/benchmark/pytorch/%s_benchmark.py",
        "",
    ),
    "llama-sample": (
        "https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/%s.ipynb",
        "",
    ),
    'github':(f'https://github.com/aws-neuron/{project_name}/blob/{branch_name}/%s', '')
}


intersphinx_mapping = {
    "python": ("https://docs.python.org/3", None),
    "numpy": ("https://numpy.org/doc/stable/", None),
    "torch": ("https://pytorch.org/docs/master/", None),
    "transformers": ("https://huggingface.co/docs/transformers/master/en/", None),
}

# -- Options for Theme  -------------------------------------------------

top_banner_message = "<b>Neuron 2.29.0 is released!</b> Check the <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/whats-new.html'>What's New</a> and <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html'>Release Notes</a> for more details."

html_theme = "sphinx_book_theme"
html_theme_options = {
    "repository_url": "https://github.com/aws-neuron/" + project_name,
    "use_issues_button": True,
    "use_repository_button": True,
    "use_download_button": True,
    "use_fullscreen_button": True,
    "use_edit_page_button": True,
    "home_page_in_toc": False,
    "repository_branch": branch_name,
    "announcement": top_banner_message,
    # "navbar_persistent": [],
}

html_additional_pages = {
    "search-google": "search-google.html",
}

html_context = {
    # ...
    "default_mode": "light"
}

# The theme to use for HTML and HTML Help pages.  See the documentation for
# a list of builtin themes.
#
# html_theme = 'sphinx_rtd_theme'

# html_theme_options = {
#
#    'navigation_depth': 3
# }


# html_theme = "pydata_sphinx_theme"
# html_theme_options = {
#   "use_edit_page_button": True,
# }

# html_context = {
#    "github_url": "https://github.com",
#    "github_user": "aws-neuron",
#    "github_repo": "private-aws-neuron-sdk-staging",
#    "github_version": "master",
#    "doc_path": "/",
# }

# -- Options for HTML output -------------------------------------------------

html_css_files = ["css/custom.css", "styles/sphinx-book-theme.css"]

# def setup(app):
#   app.add_css_file('css/custom.css')

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]

plotly_include_source = False
plotly_html_show_source_link = False
plotly_html_show_formats = False
plotly_include_directive_source = False


# -- ABlog config -------------------------------------------------
blog_path = "about-neuron/announcements/index"
blog_post_pattern = "about-neuron/appnotes/*.rst"
blog_feed_length = 5
fontawesome_included = True
post_show_prev_next = False
post_auto_image = 1
post_auto_excerpt = 2
execution_show_tb = "READTHEDOCS" in os.environ

# --- Google Analytics Sphinx extension ---

googleanalytics_id = "G-2Q13EGB80H"

# --- for neuron-tag directive ---

rst_prolog = """

.. neuron-tag::


"""

rst_epilog = """

.. neuron-tag::

"""

# Exclude private github from linkcheck. Readthedocs only exposes the ssh-agent to the 'checkout' build step, which is too early for the linkchecker to run.
linkcheck_ignore = [
    r"http://localhost:\d+/",
    r"https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/dlami-pytorch-introduce.html",
    r"https://github\.com/aws-neuron/private-aws-neuron-sdk-staging/",
    r"https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/dlami-pytorch-introduce.html",
    r"https://awsdocs-neuron-staging.readthedocs-hosted.com/en/latest/frameworks/tensorflow/tensorflow-neuronx/setup/tensorflow-neuronx-install.html#install-tensorflow-neuronx",
    r"https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx#inference",
    r"https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx#training",
    r"https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers",
    r"https://github.com/aws-neuron/aws-neuron-sagemaker-samples/tree/master/inference/inf2-bert-on-sagemaker",
    r"https://github.com/awslabs/multi-model-server/blob/master/docs/management_api.md",
    r"https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/dp_bert_hf_pretrain/run_dp_bert_large_hf_pretrain_bf16_s128.sh",
    r" https://github.com/pytorch/xla/blob/master/test/test_train_mp_mnist.py",
    r"https://github.com/pytorch/xla/blob/v1.10.0/TROUBLESHOOTING.md",
    r"https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/saved_model.md",
    r"https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/g3doc/index.md",
    r"https://github.com/pytorch/xla/blob/master/test/test_train_mp_mnist.py",
    r"https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb",
    r"https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/examples/pytorch/torch-neuronx/t5-inference-tutorial.ipynb",
    r"https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md",
    r"https://github.com/pytorch/PiPPy/blob/main/pippy/IR.py#L697",
    r"https://github.com/pytorch/pytorch/blob/main/torch/fx/_symbolic_trace.py#L241",
    r"https://github.com/pytorch/xla/blob/master/torch_xla/utils/checkpoint.py#L129",
    r"https://github.com/aws-neuron/neuronx-distributed/blob/main/src/neuronx_distributed/parallel_layers/layer_norm.py#L32",
    r"https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain.py#L273C1-L289C55",
    r"https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install.html#pytorch-neuronx-install",
    r"https://github.com/google-research/bert#user-content-pre-trained-models",
    r"https://github.com/google-research/bert#user-content-sentence-and-sentence-pair-classification-tasks",
    r"https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-retirement.html",
    r"https://repost.aws/knowledge-center/eventbridge-notification-scheduled-events",
    r"https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/modeling_gpt_neox_nxd.py",
    r"https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain.py",
    r"https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/llama-3-8b-32k-sampling.ipynb",
]
linkcheck_exclude_documents = [
    r"src/examples/.*",
    "about-neuron/announcements/neuron1.x/announcements",
    r"release-notes/.*",
    r"containers/.*",
]
nitpicky = False


================================================
FILE: containers/container-deployment-flows.rst
================================================
.. _container-deployment-flows:

Container Deployment Flows
==========================

You can also choose one of the following combinations for running the neuron container:

.. toctree::
   :maxdepth: 1
   
   dlc-then-ec2-devflow
   dlc-then-ecs-devflow
   dlc-then-eks-devflow
   container-sm-hosting-devflow


================================================
FILE: containers/container-sm-hosting-devflow.rst
================================================
.. _containers-byoc-hosting-devflow:

.. include:: /devflows/inference/byoc-hosting-devflow.rst


================================================
FILE: containers/developerflows.rst
================================================
Containers - Developer Flows
============================

.. toctree::
    :maxdepth: 1
    :hidden:

    /containers/dlc-then-ec2-devflow
    /containers/dlc-then-ecs-devflow
    /containers/dlc-then-eks-devflow
    /containers/container-sm-hosting-devflow
    /containers/dlc-then-customize-devflow


.. include:: /containers/developerflows.txt


================================================
FILE: containers/developerflows.txt
================================================
.. tab-set:: 

    .. tab-item:: Inference
    
        * :ref:`containers-dlc-then-ec2-devflow`
        * :ref:`containers-dlc-then-ecs-devflow`
        * :ref:`containers-dlc-then-eks-devflow`
        * :ref:`containers-byoc-hosting-devflow`
        * :ref:`containers-dlc-then-customize-devflow`


================================================
FILE: containers/dlc-then-customize-devflow.rst
================================================
.. _containers-dlc-then-customize-devflow:

.. include:: /devflows/dlc-then-customize-devflow.rst


================================================
FILE: containers/dlc-then-ec2-devflow.rst
================================================
.. _containers-dlc-then-ec2-devflow:

.. include:: /devflows/inference/dlc-then-ec2-devflow.rst

================================================
FILE: containers/dlc-then-ecs-devflow.rst
================================================
.. _containers-dlc-then-ecs-devflow:

.. include:: /devflows/inference/dlc-then-ecs-devflow.rst

================================================
FILE: containers/dlc-then-eks-devflow.rst
================================================
.. _containers-dlc-then-eks-devflow:

.. include:: /devflows/inference/dlc-then-eks-devflow.rst

================================================
FILE: containers/dlc-then-k8s-devflow.rst
================================================
.. _containers-dlc-then-k8s-devflow:


.. include:: /devflows/inference/dlc-then-k8s-devflow.rst

================================================
FILE: containers/docker-example/Dockerfile.device-plugin
================================================
FROM amazonlinux:2 

RUN echo $'[neuron] \n\
name=Neuron YUM Repository \n\
baseurl=https://yum.repos.neuron.amazonaws.com \n\
enabled=1' > /etc/yum.repos.d/neuron.repo

RUN rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

RUN dnf install -y aws-neuron-k8-plugin
RUN dnf install -y tar gzip

ENV PATH="/opt/aws/neuron/bin/k8s-neuron-device-plugin:${PATH}"

CMD k8s-neuron-device-plugin


================================================
FILE: containers/docker-example/index.rst
================================================
Example: Run containerized neuron application
=============================================

Introduction:
-------------

With this example you will learn how to run a Neuron application using
docker containers.

Prerequisites:
--------------

-  Please ensure the steps from the guide on :ref:`tensorflow-serving`
   were completed successfully before continuing.

Steps:
------

Step 1: Start neuron-rtd container:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You may choose to use the following neuron-rtd image:
[790709498068.dkr.ecr.us-east-1.amazonaws.com/neuron-rtd:latest], or
build your own image as shown in :ref:`neuron-runtime-dockerfile`.

Run neuron-rtd container as shown below. A volume must be mounted to
:/sock where neuron-rtd will open a UDS socket. The application can
interact with runtime using this socket.

.. code:: bash

   aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 790709498068.dkr.ecr.us-east-1.amazonaws.com
 
   docker pull 790709498068.dkr.ecr.us-east-1.amazonaws.com/neuron-rtd:1.1.1402.0
   docker tag 790709498068.dkr.ecr.us-east-1.amazonaws.com/neuron-rtd:1.1.1402.0 neuron-rtd
   mkdir /tmp/neuron_rtd_sock
   chmod o+rwx /tmp/neuron_rtd_sock
   docker run --device=/dev/neuron0 --cap-add IPC_LOCK -v /tmp/neuron_rtd_sock/:/sock -it neuron-rtd
   
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   If using older version of neuorn(below 1.1):
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   docker pull 790709498068.dkr.ecr.us-east-1.amazonaws.com/neuron-rtd:1.0.9592.0
   docker tag 790709498068.dkr.ecr.us-east-1.amazonaws.com/neuron-rtd:1.0.9592.0 neuron-rtd
   mkdir /tmp/neuron_rtd_sock
   chmod o+rwx /tmp/neuron_rtd_sock
   docker run --env AWS_NEURON_VISIBLE_DEVICES="0" --cap-add SYS_ADMIN --cap-add IPC_LOCK -v /tmp/neuron_rtd_sock/:/sock -it neuron-rtd

Step 2: Start application (tensorflow serving) container:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Build tensorflow-model-server-neuron image using provided example
dockerfile :ref:`tensorflow-model-server-neuron-dockerfile`.

Run assuming a compiled saved model was stored in s3:///my_model/

.. code:: bash


   # Note: the neuron-rtd socket directory must be mounted and pointed at using environment variable.
   #       TensorFlow serving will use that socket to talk to Neuron-rtd
   docker run --env NEURON_RTD_ADDRESS=unix:/sock/neuron.sock \
              -v /tmp/neuron_rtd_sock/:/sock \
              -p 8501:8501 \
              -p 8500:8500 \
              --env MODEL_BASE_PATH=s3://<my-bucket>/my_model/ \
              --env MODEL_NAME=my_model
              tensorflow-model-server-neuron

Step 3: Verify by running an inference!
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As shown in :ref:`tensorflow-serving`


================================================
FILE: containers/docker-example/inference/Dockerfile-inference
================================================
# Example pytorch neuron container
# To build:
#    docker build . -f Dockerfile.pt -t neuron-container:pytorch
# To run on EC2 Inf1 instances with AWS DLAMI:
#    docker run -it --device=/dev/neuron0 neuron-container:pytorch

FROM ubuntu:24.04

LABEL maintainer=" "

RUN apt-get update -y \
 && apt-get install -y --no-install-recommends \
    gnupg2 \
    wget \
    python3-pip \
    python3-setuptools \
    && cd /usr/local/bin \
    && pip3 --no-cache-dir install --upgrade pip \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

RUN echo "deb https://apt.repos.neuron.amazonaws.com bionic main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -

# Installing Neuron Tools
RUN apt-get update -y && apt-get install -y \
    aws-neuronx-tools

# Sets up Path for Neuron tools
ENV PATH="/opt/bin/:/opt/aws/neuron/bin:${PATH}"

# Include framework tensorflow-neuron or torch-neuronx and compiler (compiler not needed for inference)
RUN pip3 install \
    torch-neuronx \
    --extra-index-url=https://pip.repos.neuron.amazonaws.com

# Include your APP dependencies here.
# RUN ...

# Define the entrypoint script that has some application code (if needed) and executes the docker run command
# For example you can use something like below
# COPY dockerd-libmode-entrypoint.sh /opt/bin/dockerd-entrypoint.sh
# RUN chmod +x /opt/bin/dockerd-entrypoint.sh
# ENTRYPOINT ["/opt/bin/dockerd-entrypoint.sh"]

CMD ["neuron-top"]


================================================
FILE: containers/docker-example/inference/Dockerfile-inference-dlc
================================================
FROM ubuntu:24.04

#SDK 1.17.1 has version 1. We skipped 1.18.0.
LABEL dlc_major_version="2"
LABEL maintainer="Amazon AI"
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true

ARG PYTHON=python3.7
ARG PYTHON_VERSION=3.7.10
ARG TS_VERSION=0.5.2
ARG MAMBA_VERSION=4.12.0-0

# See http://bugs.python.org/issue19846
ENV LANG C.UTF-8
ENV LD_LIBRARY_PATH /lib/x86_64-linux-gnu:/opt/conda/lib/:$LD_LIBRARY_PATH
ENV PATH /opt/conda/bin:$PATH
ENV SAGEMAKER_SERVING_MODULE sagemaker_pytorch_serving_container.serving:main
ENV TEMP=/home/model-server/tmp

RUN apt-get update \
 && apt-get install -y --no-install-recommends software-properties-common \
 && add-apt-repository ppa:openjdk-r/ppa \
 && apt-get update \
 && apt-get install -y --no-install-recommends \
    build-essential \
    apt-transport-https \
    ca-certificates \
    cmake \
    curl \
    emacs \
    git \
    jq \
    libgl1-mesa-glx \
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libxrender-dev \
    openjdk-11-jdk \
    vim \
    wget \
    unzip \
    zlib1g-dev \
    libcap-dev \
    gpg-agent \
 && rm -rf /var/lib/apt/lists/* \
 && rm -rf /tmp/tmp* \
 && apt-get clean

RUN echo "deb https://apt.repos.neuron.amazonaws.com bionic main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -

RUN apt-get update \
 && apt-get install -y \
    aws-neuron-tools \
 && rm -rf /var/lib/apt/lists/* \
 && rm -rf /tmp/tmp* \
 && apt-get clean


# https://github.com/docker-library/openjdk/issues/261 https://github.com/docker-library/openjdk/pull/263/files
RUN keytool -importkeystore -srckeystore /etc/ssl/certs/java/cacerts -destkeystore /etc/ssl/certs/java/cacerts.jks -deststoretype JKS -srcstorepass changeit -deststorepass changeit -noprompt; \
    mv /etc/ssl/certs/java/cacerts.jks /etc/ssl/certs/java/cacerts; \
    /var/lib/dpkg/info/ca-certificates-java.postinst configure;

RUN curl -L -o ~/mambaforge.sh https://github.com/conda-forge/miniforge/releases/download/${MAMBA_VERSION}/Mambaforge-${MAMBA_VERSION}-Linux-x86_64.sh \
 && chmod +x ~/mambaforge.sh \
 && ~/mambaforge.sh -b -p /opt/conda \
 && rm ~/mambaforge.sh \
 && /opt/conda/bin/conda update conda \
 && /opt/conda/bin/conda install -c conda-forge -y \
    python=$PYTHON_VERSION \
    cython \
    mkl-include \
    mkl \
    parso \
    scipy \
    typing \
    # Below 2 are included in miniconda base, but not mamba so need to install
    conda-content-trust \
    charset-normalizer \
 && /opt/conda/bin/conda clean -ya

RUN conda install -c conda-forge \
    opencv \
    scikit-learn \
    pandas \
    h5py \
    requests \
 && conda clean -ya \
 && pip install --upgrade pip --trusted-host pypi.org --trusted-host files.pythonhosted.org \
 && ln -s /opt/conda/bin/pip /usr/local/bin/pip3 \
 && pip install packaging==20.4 \
    enum-compat==0.0.3 \
    numpy==1.20.3 \
    ipython \
    # pyOpenSSL requires cryptography>=2.3, but all versions <3.3 have vulnerabilities
    "cryptography>=3.3.2"

RUN pip install --no-cache-dir -U \
    scipy \
    six \
    # install PyYAML>=5.4 to avoid conflict with latest awscli
    "pyYAML>=5.4,<5.5" \
    "pillow>=8.3" \
    "awscli<2" \
    boto3

RUN pip install neuron-cc[tensorflow] --extra-index-url https://pip.repos.neuron.amazonaws.com \
 && pip install "torch-neuron>=1.10.2,<1.10.3" --extra-index-url https://pip.repos.neuron.amazonaws.com \
 && pip install torchserve==$TS_VERSION \
 && pip install --no-deps --no-cache-dir -U torchvision==0.11.3 \
 # Install TF 1.15.5 to override neuron-cc[tensorflow]'s installation of tensorflow==1.15.0
 && pip install -U tensorflow==1.15.5 \
 && pip install torch-model-archiver==$TS_VERSION

RUN useradd -m model-server \
 && mkdir -p /home/model-server/tmp /opt/ml/model \
 && chown -R model-server /home/model-server /opt/ml/model

COPY torchserve-neuron.sh /usr/local/bin/entrypoint.sh
COPY config.properties /home/model-server

RUN chmod +x /usr/local/bin/dockerd-entrypoint.py \
 && chmod +x /usr/local/bin/neuron-monitor.sh \
 && chmod +x /usr/local/bin/entrypoint.sh

ADD https://raw.githubusercontent.com/aws/deep-learning-containers/master/src/deep_learning_container.py /usr/local/bin/deep_learning_container.py

RUN chmod +x /usr/local/bin/deep_learning_container.py

RUN pip install --no-cache-dir "sagemaker-pytorch-inference==2.0.8"

RUN HOME_DIR=/root \
 && curl -o ${HOME_DIR}/oss_compliance.zip https://aws-dlinfra-utilities.s3.amazonaws.com/oss_compliance.zip \
 && unzip ${HOME_DIR}/oss_compliance.zip -d ${HOME_DIR}/ \
 && cp ${HOME_DIR}/oss_compliance/test/testOSSCompliance /usr/local/bin/testOSSCompliance \
 && chmod +x /usr/local/bin/testOSSCompliance \
 && chmod +x ${HOME_DIR}/oss_compliance/generate_oss_compliance.sh \
 && ${HOME_DIR}/oss_compliance/generate_oss_compliance.sh ${HOME_DIR} ${PYTHON} \
 && rm -rf ${HOME_DIR}/oss_compliance*

RUN curl https://aws-dlc-licenses.s3.amazonaws.com/pytorch-1.10/license.txt -o /license.txt

EXPOSE 8080 8081

CMD ["/usr/local/bin/entrypoint.sh"]


================================================
FILE: containers/docker-example/inference/Dockerfile-inference-dlc.rst
================================================
.. _inference-dlc-dockerfile:

DLC sample Dockerfile for Application Container
==============================================

.. literalinclude:: Dockerfile-inference-dlc
   :linenos:


================================================
FILE: containers/docker-example/inference/Dockerfile-libmode
================================================
# Example pytorch neuron container
# To build:
#    docker build . -f Dockerfile.pt -t neuron-container:pytorch
# To run on EC2 Inf1 instances with AWS DLAMI:
#    docker run -it --device=/dev/neuron0 neuron-container:pytorch

FROM ubuntu:24.04

LABEL maintainer=" "

RUN apt-get update -y \
 && apt-get install -y --no-install-recommends \
    gnupg2 \
    wget \
    python3-pip \
    python3-setuptools \
    && cd /usr/local/bin \
    && pip3 --no-cache-dir install --upgrade pip \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

RUN echo "deb https://apt.repos.neuron.amazonaws.com bionic main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -

# Installing Neuron Tools
RUN apt-get update -y && apt-get install -y \
    aws-neuron-tools

# Sets up Path for Neuron tools
ENV PATH="/opt/bin/:/opt/aws/neuron/bin:${PATH}"

# Include framework tensorflow-neuron or torch-neuron and compiler (compiler not needed for inference)
RUN pip3 install \
    torch-neuron \
    --extra-index-url=https://pip.repos.neuron.amazonaws.com

# Include your APP dependencies here.
# RUN ...

# Define the entrypoint script that has some application code (if needed) and executes the docker run command
# For example you can use something like below
# COPY dockerd-libmode-entrypoint.sh /opt/bin/dockerd-entrypoint.sh
# RUN chmod +x /opt/bin/dockerd-entrypoint.sh
# ENTRYPOINT ["/opt/bin/dockerd-entrypoint.sh"]

CMD ["neuron-top"]


================================================
FILE: containers/docker-example/inference/Dockerfile-libmode.rst
================================================
.. _libmode-dockerfile:

Dockerfile for Application Container
====================================

.. literalinclude:: Dockerfile-inference
   :linenos:


================================================
FILE: containers/docker-example/inference/Dockerfile-tf-serving.rst
================================================
.. _tensorflow-model-server-neuron-dockerfile:

tensorflow-model-server-neuron Dockerfile
=========================================

.. literalinclude:: Dockerfile.tf-serving
   :linenos:


================================================
FILE: containers/docker-example/inference/Dockerfile.mxnet-serving
================================================
# To build:
#    docker build . -f Dockerfile.mxnet-serving -t mxnet-model-server-neuron

FROM amazonlinux:2

ENV PYTHONUNBUFFERED TRUE

RUN dnf install -y gcc-c++
RUN dnf install -y python3-devel
RUN dnf install -y java-1.8.0-openjdk
RUN dnf install -y curl
RUN cd /tmp \
    && curl -O https://bootstrap.pypa.io/get-pip.py \
    && python3 get-pip.py

RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1
RUN update-alternatives --install /usr/local/bin/pip pip /usr/local/bin/pip3 1
RUN pip install mxnet-neuron --index-url=https://pip.repos.neuron.amazonaws.com
RUN pip install multi-model-server


RUN useradd -m model-server \
    && mkdir -p /home/model-server/tmp

COPY dockerd-entrypoint.sh /usr/local/bin/dockerd-entrypoint.sh

RUN mkdir -p /home/model-server/tmp/models/
#copy your model
COPY mxnet_model/resnet-50_compiled.mar  /home/model-server/tmp/models/

RUN chmod +x /usr/local/bin/dockerd-entrypoint.sh \
    && chown -R model-server /home/model-server

EXPOSE 8080 8081

USER model-server
WORKDIR /home/model-server
ENV TEMP=/home/model-server/tmp
ENTRYPOINT ["/usr/local/bin/dockerd-entrypoint.sh"]
CMD ["serve"]


================================================
FILE: containers/docker-example/inference/Dockerfile.tf-serving
================================================
# Example tensorflow-model-server-neuron dockerfile.

# Note: tensorflow_model_server_neuron must be pointed at the model location and name using MODEL_BASE_PATH and
# MODEL_NAME env variables. MODEL_BASE_PATH may be an s3 location.

# To build:
#    docker build . -f Dockerfile.tf-serving -t tensorflow-model-server-neuron


FROM amazonlinux:2


# Expose ports for gRPC and REST
EXPOSE 8500 8501

ENV MODEL_BASE_PATH=/models \
    MODEL_NAME=model

RUN echo $'[neuron] \n\
name=Neuron YUM Repository \n\
baseurl=https://yum.repos.neuron.amazonaws.com \n\
enabled=1' > /etc/yum.repos.d/neuron.repo

RUN rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

RUN dnf install -y tensorflow-model-server-neuron
RUN mkdir -p /root/models/
#copy your model
COPY tf_model/  /root/models/
RUN ls -la /root/models/*

CMD ["/bin/sh", "-c", "/usr/local/bin/tensorflow_model_server_neuron --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=/root/models/${MODEL_NAME}"]


================================================
FILE: containers/docker-example/inference/config-properties.rst
================================================
.. _torchserve-config-properties:

Torchserve config.properties example
====================================

.. literalinclude:: config.properties
   :linenos:


================================================
FILE: containers/docker-example/inference/config.properties
================================================
vmargs=-XX:+UseContainerSupport -XX:InitialRAMPercentage=8.0 -XX:MaxRAMPercentage=10.0 -XX:-UseLargePages -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError
model_store=/opt/ml/model
load_models=ALL
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
# management_address=unix:/tmp/management.sock
# number_of_netty_threads=0
# netty_client_threads=0
# default_response_timeout=120
# default_workers_per_model=0
# job_queue_size=100
# async_logging=false
# number_of_gpu=1
# cors_allowed_origin
# cors_allowed_methods
# cors_allowed_headers
# keystore=src/test/resources/keystore.p12
# keystore_pass=changeit
# keystore_type=PKCS12
# private_key_file=src/test/resources/key.pem
# certificate_file=src/test/resources/certs.pem
# max_response_size=6553500
# max_request_size=6553500
# blacklist_env_vars=
# decode_input_request=false
# enable_envvars_config=false


================================================
FILE: containers/docker-example/inference/dockerd-libmode-entrypoint.rst
================================================
.. _dockerd-libmode-entrypoint:

Docker Entrypoint Example - Application container
=================================================

.. literalinclude:: dockerd-libmode-entrypoint.sh
   :linenos:


================================================
FILE: containers/docker-example/inference/dockerd-libmode-entrypoint.sh
================================================
#!/bin/bash
if [[ "$1" = "serve" ]]; then
  # Start your application here!
  # e.g: 'python my_server_app.py'
else
    eval "$@"
fi

# prevent docker exit
tail -f /dev/null


================================================
FILE: containers/docker-example/inference/torchserve-neuron.rst
================================================
.. _torchserve-neuron:

Torchserve Example
==================

.. literalinclude:: torchserve-neuron.sh
   :linenos:


================================================
FILE: containers/docker-example/inference/torchserve-neuron.sh
================================================
#!/bin/bash

MODEL_STORE=/opt/ml/model
TS_CONFIG=/home/model-server/config.properties
MODEL_PATH=""

while getopts ":m:t:" opt; do
  case $opt in
    m) MODEL_PATH="$OPTARG"
    ;;
    t) TS_CONFIG="$OPTARG"
    ;;
    \?) echo "Invalid option -$OPTARG" >&2
    ;;
  esac
done

printf "Model path: %s\n" "$MODEL_PATH"
printf "TS_CONFIG: %s\n" "$TS_CONFIG"
# Start the Model Server
if [[ -z "$MODEL_PATH" ]]; then
  torchserve --start --ts-config /home/model-server/config.properties --model-store /opt/ml/model &
else
  torchserve --start --ts-config $TS_CONFIG --models $MODEL_PATH &
fi
status=$?
if [ $status -ne 0 ]; then
  echo "Failed to start TF Model Server: $status"
  exit $status
fi

================================================
FILE: containers/docker-example/training/Dockerfile-training-dlc
================================================
# Example pytorch neuron container
# To build:
#    docker build . -f Dockerfile.pt -t neuron-container:pytorch
# To run on EC2 Inf1 instances with AWS DLAMI:
#    docker run -it --net=host --device=/dev/neuron0 neuron-container:pytorch

# You can find the latest Pytorch Training Image here - https://gallery.ecr.aws/neuron/pytorch-training-neuronx
FROM public.ecr.aws/neuron/pytorch-training-neuronx:2.9.0-neuronx-py310-sdk2.27.0-ubuntu24.04
RUN mkdir -p /opt/ml
COPY model.py /opt/ml/model.py
COPY mlp_train.py /opt/ml/mlp_train.py 

================================================
FILE: containers/docker-example/training/Dockerfile-trainium-dlc.rst
================================================
.. _trainium-dlc-dockerfile:

Dockerfile for Application Container
====================================

.. literalinclude:: Dockerfile-training-dlc
   :linenos:


================================================
FILE: containers/docker-example/training/mlp.rst
================================================
.. _mlp-train:

Simple MLP train script
========================

Save the following contents as mlp_train.py

.. literalinclude:: mlp_train.py
   :linenos:


Save the following contents as model.py

.. literalinclude:: model.py
   :linenos:

================================================
FILE: containers/docker-example/training/mlp_train.py
================================================
import os
import time
import torch
from model import MLP

from torchvision.datasets import mnist
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor

# XLA imports
import torch_xla.core.xla_model as xm

# Global constants
EPOCHS = 4
WARMUP_STEPS = 2
BATCH_SIZE = 32

# Load MNIST train dataset
train_dataset = mnist.MNIST(root='./MNIST_DATA_train',
                            train=True, download=True, transform=ToTensor())

def main():
    # Prepare data loader
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE)

    # Fix the random number generator seeds for reproducibility
    torch.manual_seed(0)

    # XLA: Specify XLA device (defaults to a NeuronCore on Trn1 instance)
    device = 'xla'

    # Move model to device and declare optimizer and loss function
    model = MLP().to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    loss_fn = torch.nn.NLLLoss()

    # Run the training loop
    print('----------Training ---------------')
    model.train()
    for epoch in range(EPOCHS):
        start = time.time()
        for idx, (train_x, train_label) in enumerate(train_loader):
            optimizer.zero_grad()
            train_x = train_x.view(train_x.size(0), -1)
            train_x = train_x.to(device)
            train_label = train_label.to(device)
            output = model(train_x)
            loss = loss_fn(output, train_label)
            loss.backward()
            optimizer.step()
            xm.mark_step() # XLA: collect ops and run them in XLA runtime
            if idx < WARMUP_STEPS: # skip warmup iterations
                start = time.time()

    # Compute statistics for the last epoch
    interval = idx - WARMUP_STEPS # skip warmup iterations
    throughput = interval / (time.time() - start)
    print("Train throughput (iter/sec): {}".format(throughput))
    print("Final loss is {:0.4f}".format(loss.detach().to('cpu')))

    # Save checkpoint for evaluation
    os.makedirs("checkpoints", exist_ok=True)
    checkpoint = {'state_dict': model.state_dict()}
    # XLA: use xm.save instead of torch.save to ensure states are moved back to cpu
    # This can prevent "XRT memory handle not found" at end of test.py execution
    xm.save(checkpoint,'checkpoints/checkpoint.pt')

    print('----------End Training ---------------')

if __name__ == '__main__':
    main()

================================================
FILE: containers/docker-example/training/model.py
================================================
import torch.nn as nn
import torch.nn.functional as F

# Declare 3-layer MLP for MNIST dataset
class MLP(nn.Module):
  def __init__(self, input_size = 28 * 28, output_size = 10, layers = [120, 84]):
      super(MLP, self).__init__()
      self.fc1 = nn.Linear(input_size, layers[0])
      self.fc2 = nn.Linear(layers[0], layers[1])
      self.fc3 = nn.Linear(layers[1], output_size)

  def forward(self, x):
      x = F.relu(self.fc1(x))
      x = F.relu(self.fc2(x))
      x = self.fc3(x)
      return F.log_softmax(x, dim=1)

================================================
FILE: containers/docker-example/v1/inference/Dockerfile-app-rt-diff.rst
================================================
.. _app-rt-diff-dockerfile:

Dockerfile with Application and Runtime in different Container
==============================================================

.. literalinclude:: Dockerfile.app-rt-diff
   :linenos:


================================================
FILE: containers/docker-example/v1/inference/Dockerfile-app-rt-same.rst
================================================
.. _app-rt-same-dockerfile:

Dockerfile with Application and Runtime in same Container
=========================================================

.. literalinclude:: Dockerfile.torch-neuron
   :linenos:


================================================
FILE: containers/docker-example/v1/inference/Dockerfile-neuron-rtd.rst
================================================
.. _neuron-runtime-dockerfile:

Neuron Runtime Dockerfile
=========================

.. literalinclude:: Dockerfile.neuron-rtd
   :linenos:


================================================
FILE: containers/docker-example/v1/inference/Dockerfile-torch-neuron.rst
================================================
.. _torch-neuron-dockerfile:

torch-neuron Dockerfile
=======================

.. literalinclude:: Dockerfile.torch-neuron
   :linenos:


================================================
FILE: containers/docker-example/v1/inference/Dockerfile.app-rt-diff
================================================
# Example pytorch neuron container
# To build:
#    docker build . -f Dockerfile.pt -t neuron-container:pytorch
# To run on EC2 Inf1 instances with AWS DLAMI:
#    sudo service neuron-rtd stop
#    docker run -it --device=/dev/neuron0 -v /run/:/run --cap-add IPC_LOCK neuron-container:pytorch

FROM ubuntu:18.04

LABEL maintainer=" "

RUN apt-get update -y \
 && apt-get install -y --no-install-recommends \
 wget \
 gnupg2 \
 python3-pip \
 python3-setuptools \
 && cd /usr/local/bin \
 && pip3 --no-cache-dir install --upgrade pip \
 && rm -rf /var/lib/apt/lists/* \
 && apt-get clean

RUN echo "deb https://apt.repos.neuron.amazonaws.com bionic main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -

# Include framework tensorflow-neuron or torch-neuron and compiler (compiler not needed for inference)
RUN pip3 install \
    torch-neuron \
    --extra-index-url=https://pip.repos.neuron.amazonaws.com

# Include your APP dependencies here.
# RUN/ENTRYPOINT/CMD ...


================================================
FILE: containers/docker-example/v1/inference/Dockerfile.neuron-rtd
================================================
# Example neuron-rtd dockerfile.

# To build:
#    docker build . -f Dockerfile.neuron-rtd -t neuron-rtd

# Note: the container must start with CAP_IPC_LOCK capability

# To run on EC2 Inf1 instances with AWS DLAMI:
#    sudo service neuron-rtd stop
#   docker run --env AWS_NEURON_VISIBLE_DEVICES="0" --cap-add IPC_LOCK -v /tmp/neuron_rtd_sock/:/sock neuron-rtd


FROM amazonlinux:2

RUN echo $'[neuron] \n\
name=Neuron YUM Repository \n\
baseurl=https://yum.repos.neuron.amazonaws.com \n\
enabled=1' > /etc/yum.repos.d/neuron.repo

RUN rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

RUN dnf install -y aws-neuron-tools
RUN dnf install -y aws-neuron-runtime
RUN dnf install -y tar gzip

ENV PATH="/opt/aws/neuron/bin:${PATH}"

CMD neuron-rtd -g unix:/sock/neuron.sock --log-console


================================================
FILE: containers/docker-example/v1/inference/Dockerfile.torch-neuron
================================================
# Example pytorch neuron container
# Note: a dockerd_entrypoint.sh script is required to succesfully build this image. Place the script on the same folder as the Dockerfile
# To build:
#    docker build . -f Dockerfile.pt -t neuron-container:pytorch
# To run on EC2 Inf1 instances with AWS DLAMI:
#    sudo service neuron-rtd stop
#    docker run -it --device=/dev/neuron0 --cap-add IPC_LOCK neuron-container:pytorch

FROM ubuntu:18.04

LABEL maintainer=" "

RUN apt-get update -y \
 && apt-get install -y --no-install-recommends \
    gnupg2 \
    wget \
    python3-pip \
    python3-setuptools \
    libcap-dev \
    && cd /usr/local/bin \
    && pip3 --no-cache-dir install --upgrade pip \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

RUN echo "deb https://apt.repos.neuron.amazonaws.com bionic main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -

# Installing Neuron Runtime and Tools
RUN apt-get update -y && apt-get install -y \
    aws-neuron-runtime \
    aws-neuron-tools

# Sets up Path for Neuron tools
ENV PATH="/opt/bin/:/opt/aws/neuron/bin:${PATH}"

# Include framework tensorflow-neuron or torch-neuron and compiler (compiler not needed for inference)
RUN pip3 install \
    torch-neuron \
    --extra-index-url=https://pip.repos.neuron.amazonaws.com

# Include your APP dependencies here.
# RUN ...

# Define the entrypoint script that starts the runtime and executes the docker run command
COPY dockerd-entrypoint.sh /opt/bin/dockerd-entrypoint.sh
RUN chmod +x /opt/bin/dockerd-entrypoint.sh
ENTRYPOINT ["/opt/bin/dockerd-entrypoint.sh"]

CMD ["neuron-top"]


================================================
FILE: containers/docker-example/v1/inference/dockerd-entrypoint-app-rt-same.rst
================================================
.. _dockerd-entrypoint-app-rt-same:

Docker Entrypoint Example - Application and Runtime in same Container
=====================================================================

.. literalinclude:: dockerd-entrypoint.sh
   :linenos:


================================================
FILE: containers/docker-example/v1/inference/dockerd-entrypoint.sh
================================================
#!/bin/bash
set -e

wait_for_nrtd() {
  nrtd_sock="/run/neuron.sock"
  SOCKET_TIMEOUT=300
  is_wait=true
  wait_time=0
  i=1
  sp="/-\|"
  echo -n "Waiting for neuron-rtd  "
  pid=$1
  while $is_wait; do
    if [ -S "$nrtd_sock" ]; then
      echo "$nrtd_sock Exist..."
      is_wait=false
    else
      sleep 1
      wait_time=$((wait_time + 1))
      if [ "$wait_time" -gt "$SOCKET_TIMEOUT" ]; then
        echo "neuron-rtd failed to start, exiting"
	      cat /tmp/nrtd.log
        exit 1
      fi
      printf "\b${sp:i++%${#sp}:1}"
    fi
  done
  cat /tmp/nrtd.log
}

# Start neuron-rtd
/opt/aws/neuron/bin/neuron-rtd -g unix:/run/neuron.sock --log-console  >>  /tmp/nrtd.log 2>&1 &
nrtd_pid=$!
echo "NRTD PID: "$nrtd_pid""
#wait for nrtd to be up (5 minutes timeout)
wait_for_nrtd $nrtd_pid
export NEURON_RTD_ADDRESS=unix:/run/neuron.sock
nrtd_present=1

if [[ "$1" = "serve" ]]; then
  # Start your application here!
  # e.g: 'python my_server_app.py'
else
    eval "$@"
fi

# prevent docker exit
tail -f /dev/null


================================================
FILE: containers/ec2-then-ec2-devflow.rst
================================================
.. _containers-ec2-then-ec2-devflow:

.. include:: /devflows/inference/ec2-then-ec2-devflow.rst

================================================
FILE: containers/ec2.rst
================================================
.. _ec2-instance:

EC2 Instance
============

Introduction
------------

Use of Neuron in containers on EC2 can be simple to achieve by following these steps

    - :ref:`tutorial-docker-env-setup-for-neuron`
    - More details on EC2 setup `can be found at <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-ec2-setup.html>`_

DLC Images
----------

    - The location for DLC images for Neuron can be obtained from `here <https://github.com/aws/deep-learning-containers/blob/master/available_images.md>`_
    - To get the list of images for neuron, the following commands can be used.

      ``aws ecr list-images --registry-id 763104351884 --repository-name tensorflow-inference-neuron``

      ``aws ecr list-images --registry-id 763104351884 --repository-name pytorch-inference-neuron``

Setup recommendations
---------------------

    - The EC2 Inf1 instance needs to have the aws-neuron-runtime-base and aws-neruon-dkms package installed.
    - The DLC inference container runs the framework server (like tensorflow-model-server or TorchServe) and also the neuron runtime that interacts with the neuron driver running in the host.
    - For more details on setting up the container, check the `tensorflow <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-ec2-tutorials-inference.html#deep-learning-containers-ec2-tutorials-inference-tf>`_ or `pytorch <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-ec2-tutorials-inference.html#deep-learning-containers-ec2-tutorials-inference-pytorch>`_. Make sure the appropriate framework container image is used.

Debug Hints
-----------
    - Use the docker log command to get the neuron rtd logs in the container.

       ``docker logs <container-name>``

    - Look for errors like the following
        - If we see *nrtd[8]: [TDRV:tdrv_init_mla_phase1] Could not open the device index:0*, it either means that some other container is using that device or the host is running the neuron-rtd process.
        - Check to see that host is not running neuron-rtd

           ``sudo systemctl status neuron-rtd``


================================================
FILE: containers/faq-troubleshooting-releasenote.rst
================================================
Containers - FAQ, Troubleshooting & ReleaseNotes
================================================

.. toctree::
    :maxdepth: 1
    :hidden:

    FAQ </containers/faq>
    troubleshooting
    /release-notes/components/containers

* :ref:`container-faq`
* :ref:`container-troubleshooting`
* :ref:`containers_rn`


================================================
FILE: containers/faq.rst
================================================
.. _container-faq:

Neuron Containers FAQ
=====================

.. contents:: Table of Contents
   :local:
   :depth: 1

Where can I find DLC images
---------------------------
* The Inference/Training DLC images can be found `here <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#user-content-neuron-containers>`_.
* In the `DLC release page <https://github.com/aws/deep-learning-containers/releases>`_ do a search for neuron to get the ECR repo location of specific neuron DLC release.


What is OCI Neuron Hook and do we need that
-------------------------------------------
Neuron devices are exposed to the containers using the --device option in the docker run command.
Docker runtime (runc) does not yet support the ALL option to expose all neuron
devices to the container. 

With OCI neuron hook support is added to expose ALL devices to container using an environment variable,
“AWS_NEURON_VISIBLE_DEVICES=ALL". For more details please refer :ref:`oci neuron hook <tutorial-oci-hook>`

In Kubernetes, if we are using the device plugin version 1.7 & below, then the oci neuron hook is needed. If
using device plugin version >= 1.8 then oci neuron hook is not needed

What container runtimes are supported
-------------------------------------
Neuron containers have been tested to work with docker, containerd, cri-o runtimes without any changes.
If the oci neuron hook is used then they need to be enabled in the runtime config. For more details please refer :ref:`oci neuron hook <tutorial-oci-hook>`


How to expose Neuron Devices to Container
-----------------------------------------
Neuron Device: Represents the number of Inferentia/Trainium chips in the instance. Refer :ref:`Container Devices <container-devices>` for more details


How to expose Neuron Cores to Container
---------------------------------------
Neuron Core: Represents the number of Neuron Cores in the instance. Refer :ref:`Container Cores <container-cores>` for more details. Each Inferentia1
device has 4 Neuron Cores and each Inferentia2 and Trainium1 device has 2 Neuron Cores.
When the devices are exposed to the containers all the cores in the device are available
for use in the container.  Please refer :ref:`nrt-configuration` to see how the environment variables NEURON_RT_VISIBLE_CORES and NEURON_RT_NUM_CORES 
can be used to assign core to containers

Can Neuron Devices be shared by different Containers running in the same Host
-----------------------------------------------------------------------------
Yes, except in Kubernetes environment where the devices cannot be shared

Can Neuron Cores be shared by different Containers running in the same Host
-----------------------------------------------------------------------------
No

When would you use Neuron K8 Scheduler Extension
-------------------------------------------------
The neuron cores/devices that are exposed to the container needs to be contiguous. The kubernetes device plugin
does not guarantee the devices to be contiguous. The K8 Neuron Scheduler Extension takes care of 
assigning contiguous devices to the containers.

How to add EFA devices to the container
---------------------------------------
The EFA devices are exposed to the container using the --device option

::

   --device /dev/infiniband/uverbs0 

In a Kubernetes environment, the EFA device plugin is used to detect and advertise 
the available EFA interfaces. The EFA device plugin can be installed using the `Helm chart provided by Amazon EKS <https://github.com/aws/eks-charts/tree/master/stable/aws-efa-k8s-device-plugin>`_

::

   helm repo add eks https://aws.github.io/eks-charts
   helm install aws-efa-k8s-device-plugin --namespace kube-system eks/aws-efa-k8s-device-plugin

Once the plugin is deployed, applications can use the resource type vpc.amazonaws.com/efa in a pod request spec

::

   resources:
      limits:
         vpc.amazonaws.com/efa: 4


Can distributed training jobs be run without EFA devices in container
---------------------------------------------------------------------
No. For distributed training jobs on Trainium, all EFA interfaces provided by trn1.32xlarge need to be
attached to the container


================================================
FILE: containers/files/index-dra.rst
================================================
.. meta::
    :description: Templates supporting AWS Neuron Dynamic Resource Allocation (DRA) on Kubernetes.
    :keywords: AWS Neuron, Neuron DRA, Dynamic Resource Allocation, Kubernetes, K8s, Device Plugin
    :date-modified: 02/05/2026

AWS Neuron Dynamic Resource Allocation (DRA) on Kubernetes: Support files
=========================================================================

This page provides templates supporting AWS Neuron Dynamic Resource Allocation (DRA) on Kubernetes. You can view and download these files from the links below.

Resource Claim Specifications
-----------------------------

Example resource claim templates and pod specifications demonstrating different Neuron device allocation patterns for various workload requirements.

.. list-table::
   :header-rows: 1
   :widths: 30 55 15

   * - File Name
     - Description
     - Download
   * - 1x4-connected-devices.yaml
     - Resource claim template for allocating 4 connected Neuron devices with topology constraints for optimal performance.
     - :download:`Download <specs/1x4-connected-devices.yaml>`
   * - 2-node-inference-us.yaml
     - Multi-node inference configuration for distributed workloads across 2 Trainium nodes.
     - :download:`Download <specs/2-node-inference-us.yaml>`
   * - 4-node-inference-us.yaml
     - Large-scale inference setup for distributed workloads spanning 4 Trainium nodes.
     - :download:`Download <specs/4-node-inference-us.yaml>`
   * - all-devices.yaml
     - Resource claim template that allocates all available Neuron devices on a trn2.48xlarge instance.
     - :download:`Download <specs/all-devices.yaml>`
   * - lnc-setting-trn2.yaml
     - Logical NeuronCore configuration template optimized for Trainium2 instances.
     - :download:`Download <specs/lnc-setting-trn2.yaml>`
   * - specific-driver-version.yaml
     - Example configuration for requesting specific Neuron driver versions in resource claims.
     - :download:`Download <specs/specific-driver-version.yaml>`
   * - us-and-lnc-config.yaml
     - Example configuration for requesting UltraServer node with Logical NeuronCore configuration.
     - :download:`Download <specs/us-and-lnc-config.yaml>`


================================================
FILE: containers/files/manifests/clusterrole.yaml
================================================
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: neuron-dra-driver-clusterrole
rules:
# Required for DRA device plugin to manage ResourceSlices
- apiGroups: ["resource.k8s.io"]
  resources: ["resourceslices"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
# Required for DRA device plugin to read ResourceClaims
- apiGroups: ["resource.k8s.io"]
  resources: ["resourceclaims"]
  verbs: ["get", "list", "watch"]
# Required for DRA device plugin to read DeviceClasses
- apiGroups: ["resource.k8s.io"]
  resources: ["deviceclasses"]
  verbs: ["get", "list", "watch"]
  # Required to read and modify node information
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list", "watch", "patch", "update"]
  # Required to modify node status
- apiGroups: [""]
  resources: ["nodes/status"]
  verbs: ["patch"]


================================================
FILE: containers/files/manifests/clusterrolebinding.yaml
================================================

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: neuron-dra-driver-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: neuron-dra-driver-clusterrole
subjects:
- kind: ServiceAccount
  name: neuron-dra-driver-sa
  namespace: neuron-dra-driver

================================================
FILE: containers/files/manifests/daemonset.yaml
================================================
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: neuron-dra-driver-kubelet-plugin
  namespace: neuron-dra-driver
  labels:
    app: neuron-dra-driver-kubelet-plugin
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app: neuron-dra-driver-kubelet-plugin
  template:
    metadata:
      labels:
        app: neuron-dra-driver-kubelet-plugin
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node.kubernetes.io/instance-type
                operator: In
                values:
                - trn1.2xlarge
                - trn1.32xlarge
                - trn1n.32xlarge
                - trn2.3xlarge
                - trn2.48xlarge
                - trn2n.48xlarge
              - key: eks.amazonaws.com/compute-type
                operator: NotIn
                values:
                - fargate
                - hybrid
                - auto
      serviceAccountName: neuron-dra-driver-sa
      hostNetwork: true
      containers:
      - name: neuron-dra-driver
        image: NEURON_DRA_IMAGE
        imagePullPolicy: Always
        command: ["k8s-neuron-dra-driver"]
        # args:
        # - --v=6
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: POD_UID
          valueFrom:
            fieldRef:
              fieldPath: metadata.uid
        - name: CDI_ROOT
          value: "/var/run/cdi"
        - name: KUBELET_REGISTRAR_DIRECTORY_PATH
          value: "/var/lib/kubelet/plugins_registry"
        - name: KUBELET_PLUGINS_DIRECTORY_PATH
          value: "/var/lib/kubelet/plugins"
        - name: HEALTHCHECK_PORT
          value: "51515"
        - name: NEURON_DRA_DRIVER_EMULATION_MODE
          value: "trn2u"
        resources:
          limits:
            cpu: 20m
            memory: 256Mi
          requests:
            cpu: 10m
            memory: 128Mi
        securityContext:
          privileged: true
        volumeMounts:
        - name: kubelet-plugins-dir
          mountPath: /var/lib/kubelet/plugins
        - name: kubelet-registry-dir
          mountPath: /var/lib/kubelet/plugins_registry
        - name: cdi-dir
          mountPath: /var/run/cdi
        livenessProbe:
          grpc:
            port: 51515
            service: liveness
          failureThreshold: 3
          periodSeconds: 10
          initialDelaySeconds: 30
          timeoutSeconds: 5
      volumes:
      - name: kubelet-plugins-dir
        hostPath:
          path: /var/lib/kubelet/plugins
      - name: kubelet-registry-dir
        hostPath:
          path: /var/lib/kubelet/plugins_registry
      - name: cdi-dir
        hostPath:
          path: /var/run/cdi
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - key: aws.amazon.com/neuron
        operator: Exists
        effect: NoSchedule
      - key: sagemaker.amazonaws.com/node-health-status
        operator: Equal
        value: Unschedulable
        effect: NoSchedule
      # - key: "kwok.x-k8s.io/node"
      #   operator: "Exists"
      #   effect: "NoSchedule"

================================================
FILE: containers/files/manifests/deviceclass.yaml
================================================
apiVersion: resource.k8s.io/v1beta1
kind: DeviceClass
metadata:
  name: neuron.aws.com
spec:
  selectors:
  - cel:
      expression: device.driver == "neuron.aws.com"

================================================
FILE: containers/files/manifests/namespace.yaml
================================================
apiVersion: v1
kind: Namespace
metadata:
  name: neuron-dra-driver
  labels:
    name: neuron-dra-driver

================================================
FILE: containers/files/manifests/serviceaccount.yaml
================================================
apiVersion: v1
kind: ServiceAccount
metadata:
  name: neuron-dra-driver-sa
  namespace: neuron-dra-driver

================================================
FILE: containers/files/scripts/install-dra-driver.sh
================================================
#!/bin/bash

# Deploy Neuron DRA Driver
set -e

echo "🚀 Deploying Neuron DRA Driver..."

# Check argument
if [ $# -ne 1 ]; then
    echo "Usage: $0 <image_name>"
    echo "Example: $0 123456789.dkr.ecr.us-west-2.amazonaws.com/neuron-dra-driver:v1.0"
    exit 1
fi

# Get the script directory and set the manifests path
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
MANIFESTS_DIR="$SCRIPT_DIR/../../manifests"
DRA_IMAGE="$1"

# Apply all manifests in order
echo "📝 Creating namespace..."
kubectl apply -f "$MANIFESTS_DIR/namespace.yaml"

echo "🔐 Creating ServiceAccount and RBAC..."
kubectl apply -f "$MANIFESTS_DIR/serviceaccount.yaml"
kubectl apply -f "$MANIFESTS_DIR/clusterrole.yaml"
kubectl apply -f "$MANIFESTS_DIR/clusterrolebinding.yaml"

echo "📱 Creating DeviceClass..."
kubectl apply -f "$MANIFESTS_DIR/deviceclass.yaml"

echo "🔧 Deploying DRA DaemonSet..."
# Check if DaemonSet already exists before applying
DAEMONSET_EXISTS=false
if kubectl get daemonset neuron-dra-driver-kubelet-plugin -n neuron-dra-driver >/dev/null 2>&1; then
    DAEMONSET_EXISTS=true
    echo "📋 DaemonSet already exists, will restart after applying..."
fi

echo "🏷️  Using custom image: $DRA_IMAGE"
sed "s|NEURON_DRA_IMAGE|$DRA_IMAGE|g" "$MANIFESTS_DIR/daemonset.yaml" | kubectl apply -f -

# If DaemonSet was already running, restart it to pull latest image
if [ "$DAEMONSET_EXISTS" = true ]; then
    echo "🔄 Restarting DaemonSet to pull latest image..."
    kubectl rollout restart daemonset/neuron-dra-driver-kubelet-plugin -n neuron-dra-driver
    echo "⏳ Waiting for rollout to complete..."
    kubectl rollout status daemonset/neuron-dra-driver-kubelet-plugin -n neuron-dra-driver --timeout=300s
else
    echo "⏳ Waiting until pods are in a running state..."
    kubectl wait --for=condition=ready pod -l app=neuron-dra-driver-kubelet-plugin -n neuron-dra-driver --timeout=300s
fi

echo "✅ Deployment complete!"

echo ""
echo "📊 Recent logs from dra driver:"
kubectl logs -n neuron-dra-driver -l app=neuron-dra-driver-kubelet-plugin --tail=10
echo ""

================================================
FILE: containers/files/specs/1x4-connected-devices.yaml
================================================
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: 1x4-connected-neurons
spec:
  spec:
    devices:
      requests:
      - name: neurons
        exactly:
          deviceClassName: neuron.aws.com
          allocationMode: ExactCount
          count: 4
          selectors:
          - cel:
              expression: "device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'"
      constraints:
      - requests: ["neurons"]
        matchAttribute: "resource.aws.com/devicegroup4_id"

---
apiVersion: v1
kind: Pod
metadata:
  name: pod0
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    image: public.ecr.aws/ubuntu/ubuntu:22.04
    command: ["bash", "-c"]
    args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: neurons
  resourceClaims:
  - name: neurons
    resourceClaimTemplateName: 1x4-connected-neurons


================================================
FILE: containers/files/specs/2-node-inference-us.yaml
================================================
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: us-2-node-config
spec:
  spec:
    devices:
      requests:
      - name: neurons
        exactly:
          deviceClassName: neuron.aws.com
          selectors:
          - cel:
              expression: "device.attributes['neuron.aws.com'].resourceType == 'neuron_node'"
          allocationMode: ExactCount
          count: 1
      config:
      - requests: ["neurons"]
        opaque:
          driver: neuron.aws.com
          parameters:
            apiVersion: neuron.aws.com/v1
            kind: UltraServerConfig
            ultraserverMode: 2
---
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
  annotations:
    leaderworkerset.sigs.k8s.io/exclusive-topology: neuron.amazonaws.com/ultraserver-server-id-2
spec:
  rolloutStrategy:
    type: RollingUpdate
    rollingUpdateConfiguration:
      maxUnavailable: 1
      maxSurge: 1
  # Two replica groups of 2 nodes each
  replicas: 2
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        containers:
          - name: vllm-leader
            image: public.ecr.aws/ubuntu/ubuntu:22.04
            command:
              - sh
              - -c
              - "sleep infinity"
            resources:
              claims:
              - name: one-node-from-ultraserver
        resourceClaims:
        - name: one-node-from-ultraserver
          resourceClaimTemplateName: us-2-node-config
    workerTemplate:
      metadata:
        labels:
          role: worker
      spec:
        containers:
          - name: vllm-worker
            image: public.ecr.aws/ubuntu/ubuntu:22.04
            command:
              - sh
              - -c
              - "sleep infinity"
            resources:
              claims:
              - name: one-node-from-ultraserver
        resourceClaims:
        - name: one-node-from-ultraserver
          resourceClaimTemplateName: us-2-node-config


================================================
FILE: containers/files/specs/4-node-inference-us.yaml
================================================
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: us-4-node-config
spec:
  spec:
    devices:
      requests:
      - name: neurons
        exactly:
          deviceClassName: neuron.aws.com
          selectors:
          - cel:
              expression: "device.attributes['neuron.aws.com'].resourceType == 'neuron_node'"
          allocationMode: ExactCount
          count: 1
      config:
      - requests: ["neurons"]
        opaque:
          driver: neuron.aws.com
          parameters:
            apiVersion: neuron.aws.com/v1
            kind: UltraServerConfig
            ultraserverMode: 4
---
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
  annotations:
    leaderworkerset.sigs.k8s.io/exclusive-topology: neuron.amazonaws.com/ultraserver-server-id-4
spec:
  rolloutStrategy:
    type: RollingUpdate
    rollingUpdateConfiguration:
      maxUnavailable: 1
      maxSurge: 1
  # Two replica groups of 4 nodes each, i.e. two ultraservers
  replicas: 2
  leaderWorkerTemplate:
    size: 4
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        containers:
          - name: vllm-leader
            image: public.ecr.aws/ubuntu/ubuntu:22.04
            command:
              - sh
              - -c
              - "sleep infinity"
            resources:
              claims:
              - name: one-node-from-ultraserver
        resourceClaims:
        - name: one-node-from-ultraserver
          resourceClaimTemplateName: us-4-node-config
    workerTemplate:
      metadata:
        labels:
          role: worker
      spec:
        containers:
          - name: vllm-worker
            image: public.ecr.aws/ubuntu/ubuntu:22.04
            command:
              - sh
              - -c
              - "sleep infinity"
            resources:
              claims:
              - name: one-node-from-ultraserver
        resourceClaims:
        - name: one-node-from-ultraserver
          resourceClaimTemplateName: us-4-node-config


================================================
FILE: containers/files/specs/all-devices.yaml
================================================
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: all-neurons
spec:
  spec:
    devices:
      requests:
        - name: neurons
          exactly:
            deviceClassName: neuron.aws.com
            selectors:
              - cel:
                  expression: "device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'"
            allocationMode: All
---
apiVersion: v1
kind: Pod
metadata:
  name: pod0
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    image: public.ecr.aws/ubuntu/ubuntu:22.04
    command: ["bash", "-c"]
    args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: neurons
  resourceClaims:
  - name: neurons
    resourceClaimTemplateName: all-neurons

================================================
FILE: containers/files/specs/lnc-setting-trn2.yaml
================================================
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: all-neurons-lnc-1
spec:
  spec:
    devices:
      requests:
      - name: neurons
        exactly:
          deviceClassName: neuron.aws.com
          selectors:
          - cel:
              expression: "device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'"
          allocationMode: All
      config:
      - requests: ["neurons"]
        opaque:
          driver: neuron.aws.com
          parameters:
            apiVersion: neuron.aws.com/v1
            kind: NeuronConfig
            logicalNeuronCore: 1
---
apiVersion: v1
kind: Pod
metadata:
  name: pod0
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    image: public.ecr.aws/ubuntu/ubuntu:22.04
    command: ["bash", "-c"]
    args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: neurons
  resourceClaims:
  - name: neurons
    resourceClaimTemplateName: all-neurons-lnc-1


================================================
FILE: containers/files/specs/specific-driver-version.yaml
================================================
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: driver-version-neuron
spec:
  spec:
    devices:
      requests:
      - name: neurons
        exactly:
          deviceClassName: neuron.aws.com
          selectors:
            - cel:
                expression: "device.attributes['neuron.aws.com'].neuronDriverVersion == '2.25.4.0'"
          allocationMode: All

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: nginx
        image: public.ecr.aws/docker/library/nginx:alpine
      resourceClaims:
      - name: neurons
        resourceClaimTemplateName: driver-version-neuron

================================================
FILE: containers/files/specs/us-and-lnc-config.yaml
================================================
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: us-and-lnc-config
spec:
  spec:
    devices:
      requests:
        - name: neurons
          exactly:
            deviceClassName: neuron.aws.com
            selectors:
              - cel:
                  expression: "device.attributes['neuron.aws.com'].resourceType == 'neuron_node'"
            allocationMode: ExactCount
            count: 1
      config:
        - requests: ["neurons"]
          opaque:
            driver: neuron.aws.com
            parameters:
              apiVersion: neuron.aws.com/v1
              kind: UltraServerConfig
              ultraserverMode: 2
        - requests: ["neurons"]
          opaque:
            driver: neuron.aws.com
            parameters:
              apiVersion: neuron.aws.com/v1
              kind: NeuronConfig
              logicalNeuronCore: 1
---
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
  annotations:
    leaderworkerset.sigs.k8s.io/exclusive-topology: neuron.amazonaws.com/ultraserver-server-id-2
spec:
  rolloutStrategy:
    type: RollingUpdate
    rollingUpdateConfiguration:
      maxUnavailable: 1
      maxSurge: 1
  # Two replica groups of 2 nodes each
  replicas: 2
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        containers:
          - name: vllm-leader
            image: public.ecr.aws/ubuntu/ubuntu:22.04
            command:
              - sh
              - -c
              - "sleep infinity"
            resources:
              claims:
                - name: one-node-from-ultraserver
        resourceClaims:
          - name: one-node-from-ultraserver
            resourceClaimTemplateName: us-and-lnc-config
    workerTemplate:
      metadata:
        labels:
          role: worker
      spec:
        containers:
          - name: vllm-worker
            image: public.ecr.aws/ubuntu/ubuntu:22.04
            command:
              - sh
              - -c
              - "sleep infinity"
            resources:
              claims:
                - name: one-node-from-ultraserver
        resourceClaims:
          - name: one-node-from-ultraserver
            resourceClaimTemplateName: us-and-lnc-config


================================================
FILE: containers/get-started/quickstart-configure-deploy-dlc.rst
================================================
.. meta::
   :description: Learn how to deploy a vLLM server using preconfigured Neuron Deep Learning Container with on Trainium and Inferentia instances.
   :date_updated: 01/26/2026

.. _quickstart_vllm_dlc_deploy:

Quickstart: Configure and deploy a vLLM server using Neuron Deep Learning Container (DLC)
==========================================================================================

This topic guides you through deploying a vLLM server on Trainium and Inferentia instances using a Deep Learning Container preconfigured with AWS Neuron SDK artifacts. When you complete this tutorial, you will be able run a vLLM inference server on AWS Trainium and Inferentia instances.

Overview
--------
In this quickstart, you will pull a vLLM Docker image, configure it for Neuron devices, and start an inference server running vLLM. This process lets you deploy large language models on AWS ML accelerators for high-performance inference workloads.

Before you start
----------------

This tutorial assumes that you have experience in the following areas:

* Docker container management
* AWS EC2 instance administration
* Command-line interface operations

Prerequisites
-------------

Before you begin, ensure you have:

* AWS Trainium or Inferentia instance access
* Docker installed on your instance. You can set up docker environment according to :ref:`tutorial-docker-env-setup`
* SSH access to your instance

Prepare your environment
------------------------

Launch an AWS Trainium or Inferentia instance with sufficient resources for your model requirements. We recommend using one of the base DLAMIs to launch your instance - `Neuron Base DLAMI <#>`.

Step 1: Pull the vLLM Docker image
-----------------------------------

In this step, you will download the vLLM Docker image from AWS ECR.

Get the latest vLLM Docker image from Neuron's ECR public gallery `pytorch-inference-vllm-neuronx <https://gallery.ecr.aws/neuron/pytorch-inference-vllm-neuronx>`_ repository, and then get the latest published image tag and use it in the command below:

.. code-block:: bash

   docker pull public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:<image_tag>

For example, replace ``<image_tag>`` with an SDK 2.28.0 released DLC image tag such as ``0.13.0-neuronx-py312-sdk2.28.0-ubuntu24.04``

Step 2: Start the Docker container
-----------------------------------

In this step, you will run the container with access to Neuron devices. For this tutorial, we are using an trn1.32xlarge instance.

Run the container interactively with access to Neuron devices:

.. code-block:: bash

   docker run -it \
   --device=/dev/neuron0 \
   --device=/dev/neuron1 \
   --device=/dev/neuron2 \
   --device=/dev/neuron3 \
   --device=/dev/neuron4 \
   --device=/dev/neuron5 \
   --device=/dev/neuron6 \
   --device=/dev/neuron7 \
   --device=/dev/neuron8 \
   --device=/dev/neuron9 \
   --device=/dev/neuron10 \
   --device=/dev/neuron11 \
   --device=/dev/neuron12 \
   --device=/dev/neuron13 \
   --device=/dev/neuron14 \
   --device=/dev/neuron15 \
   --cap-add SYS_ADMIN \
   --cap-add IPC_LOCK \
   -p 8080:8080 \
   --name <server_name> \
   <image_uri> \
   bash

.. note::
   The trn1.32xlarge instance provides 16 Neuron devices. Adjust the number of Neuron devices (``--device=/dev/neuronX``) based on your instance type and requirements.

Step 3: Start the vLLM server
------------------------------

In this step, you will launch the vLLM inference server inside the container.

Inside the container, start the vLLM inference server:

.. code-block:: bash

   vllm serve \
   --model='TinyLlama/TinyLlama-1.1B-Chat-v1.0' \
   --max-num-seqs=4 \
   --max-model-len=128 \
   --tensor-parallel-size=2 \
   --block-size=32 \
   --num-gpu-blocks-override=16 \
   --port=8080 \
   --additional-config='{"override_neuron_config":{"enable_bucketing":false}}'

.. note::
   **Version compatibility**: The command above is compatible with vLLM version 0.11.0 and later. If you are using an older version (such as 0.9.1), you must:
   
   * Replace ``--additional-config='{"override_neuron_config":{"enable_bucketing":false}}'`` with ``--override-neuron-config '{"enable_bucketing":false}'``
   
.. important::
   * Choose the appropriate model for your use case
   * Set ``--tensor-parallel-size`` to be less than or equal to total number of NeuronCores (or TP ranks) available from your devices, accounting for cores per device and logical core configuration
   * Server startup typically takes 5-10 minutes

Step 4: Verify server status
-----------------------------

In this step, you will confirm the server starts successfully.

Wait for the server to fully initialize. You will see output showing available API routes:

.. code-block:: text

   INFO 08-12 00:04:47 [launcher.py:28] Available routes are:
   INFO 08-12 00:04:47 [launcher.py:36] Route: /health, Methods: GET
   INFO 08-12 00:04:47 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
   INFO 08-12 00:04:47 [launcher.py:36] Route: /v1/completions, Methods: POST

.. note::
   During startup, you may see warning logs similar to the following, which can be safely ignored:

   .. code-block:: text

      No module named 'vllm._version'
        from .version import __version__, __version_tuple__  # isort:skip
      WARNING [__init__.py:25] The vLLM package was not found, so its version could not be inspected. This may cause platform detection to fail.
      INFO [__init__.py:243] Automatically detected platform neuron.
      WARNING [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")

All complete! Now, let's confirm everything works.

Step 5: Inference service confirmation
---------------------------------------

Test the API to confirm your setup works correctly.

Open a separate terminal and make an API call:

.. code-block:: bash

   curl http://localhost:8080/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "messages": [
       {
         "role": "user",
         "content": "What is the capital of Italy?"
       }
     ]
   }'

You should receive a response similar to:

.. code-block:: json

   {
     "id": "chatcmpl-ac7551dd2f2a4be3bd2c1aabffa79b4c",
     "object": "chat.completion",
     "created": 1754958455,
     "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
     "choices": [
       {
         "index": 0,
         "message": {
           "role": "assistant",
           "content": "The capital of Italy is Rome...",
           "tool_calls": []
         },
         "finish_reason": "stop"
       }
     ],
     "usage": {
       "prompt_tokens": 23,
       "total_tokens": 106,
       "completion_tokens": 83
     }
   }

Congratulations! You have successfully deployed a vLLM inference server using a preconfigured Neuron DLC. If you encountered any issues, see the **Common issues** section below.

Available API endpoints
-----------------------

The server provides various endpoints for different use cases:

* **Health Check**: ``GET /health``
* **Chat Completions**: ``POST /v1/chat/completions``
* **Text Completions**: ``POST /v1/completions``
* **Models Info**: ``GET /v1/models``
* **API Documentation**: ``GET /docs``

Common issues
-------------

Did you encounter an error while working through this tutorial? Here are common issues and solutions:

- **Server won't start**: Check that you have sufficient Neuron devices allocated
- **Connection refused**: Verify the container is running and port 8080 is properly mapped
- **Slow performance**: Ensure your ``tensor-parallel-size`` matches your available Neuron devices
- **Memory issues**: Consider using a larger instance type or reducing model size

For additional help, refer to the complete vLLM User Guide for NxD Inference documentation.

Clean up
--------

To clean up resources after completing this tutorial:

1. Stop the Docker container:

   .. code-block:: bash

      docker stop <server_name>

2. Remove the container:

   .. code-block:: bash

      docker rm <server_name>

3. Terminate your EC2 instance if no longer needed.

Next steps
----------

Now that you've completed this tutorial, explore these related topics:

* Learn more about vLLM configuration options in the vLLM User Guide for NxD Inference
* Explore model optimization techniques for better performance
* Set up production deployment with load balancing and monitoring

Further reading
---------------

- `vLLM User Guide for NxD Inference <#>`_ - Complete documentation for vLLM on Neuron
- `AWS Neuron SDK Documentation <https://awsdocs-neuron.readthedocs-hosted.com/>`_ - Full Neuron SDK reference


================================================
FILE: containers/get-started/quickstart-pytorch-inference-dlc.rst
================================================
.. meta::
   :description: Learn how to run PyTorch inference using preconfigured Neuron Deep Learning Container with Llama-2-7b on Trainium instances.
   :date_updated: 02/17/2026

.. _quickstart_pytorch_inference_dlc:

Quickstart: Run PyTorch inference using Neuron Deep Learning Container (DLC)
=============================================================================

This topic guides you through running PyTorch inference on Trainium instances using a Deep Learning Container preconfigured with AWS Neuron SDK artifacts. When you complete this tutorial, you will be able to run inference with the Llama-2-7b model on AWS Trainium instances.

Overview
--------
In this quickstart, you will pull a PyTorch inference Docker image, download the Llama-2-7b model from S3, and run an inference demo that compiles, validates, and benchmarks the model. This process lets you deploy large language models on AWS ML accelerators for high-performance inference workloads.

Before you start
----------------

This tutorial assumes that you have experience in the following areas:

* Docker container management
* AWS EC2 instance administration
* Command-line interface operations
* AWS S3 operations

Prerequisites
-------------

Before you begin, ensure you have:

* AWS Trainium instance access (trn2.48xlarge recommended)
* Docker installed on your instance. You can set up docker environment according to :ref:`tutorial-docker-env-setup`
* SSH access to your instance
* AWS credentials configured with access to the model S3 bucket

Prepare your environment
------------------------

Launch an AWS Trainium instance with sufficient resources for your model requirements. We recommend using one of the base DLAMIs to launch your instance - `Neuron Base DLAMI <#>`.

Step 1: Pull the PyTorch inference Docker image
------------------------------------------------

In this step, you will download the PyTorch inference Docker image from AWS ECR.

Get the latest PyTorch inference Docker image from Neuron's ECR public gallery `pytorch-inference-neuronx <https://gallery.ecr.aws/neuron/pytorch-inference-neuronx>`_ repository, and then get the latest published image tag and use it in the command below:

.. code-block:: bash

   docker pull public.ecr.aws/neuron/pytorch-inference-neuronx:<image_tag>

For example, replace ``<image_tag>`` with an SDK 2.28.0 released DLC image tag such as ``2.9.0-neuronx-py312-sdk2.28.0-ubuntu24.04``

Step 2: Download the Llama-2-7b model
--------------------------------------

In this step, you will download the Llama-2-7b model from HuggingFace to an S3 bucket, then copy it to your instance.

First, download the model from HuggingFace and upload to your S3 bucket:

.. code-block:: bash

   # Install HuggingFace CLI if not already installed
   pip install huggingface-hub

   # Login to HuggingFace (you'll need to accept the Llama-2 license first)
   hf auth login

   # Download the model
   hf download meta-llama/Llama-2-7b --local-dir ./Llama-2-7b

   # Upload to your S3 bucket
   aws s3 cp --recursive ./Llama-2-7b s3://your-bucket-name/models/Llama-2-7b/

Then, on your Trainium instance, download the model from S3:

.. note::
   Change ``/home/ec2-user`` to ``/home/ubuntu`` if you're using an Ubuntu AMI.

.. code-block:: bash

   # Create directory for the model
   mkdir -p /home/ec2-user/model_hf/Llama-2-7b

   # Download from S3
   aws s3 cp --recursive s3://your-bucket-name/models/Llama-2-7b/ /home/ec2-user/model_hf/Llama-2-7b/

   # Verify the model downloaded successfully
   ls /home/ec2-user/model_hf/Llama-2-7b/config.json

.. note::
   You must accept the Llama-2 license on HuggingFace before you can download the model. Visit https://huggingface.co/meta-llama/Llama-2-7b to request access.

Step 3: Start the Docker container
-----------------------------------

In this step, you will run the container with access to Neuron devices and mount the model directory. For this tutorial, we are using a trn2.48xlarge instance.

Run the container interactively with access to all Neuron devices:

.. code-block:: bash

   docker run -it \
   --device=/dev/neuron0 \
   --device=/dev/neuron1 \
   --device=/dev/neuron2 \
   --device=/dev/neuron3 \
   --device=/dev/neuron4 \
   --device=/dev/neuron5 \
   --device=/dev/neuron6 \
   --device=/dev/neuron7 \
   --device=/dev/neuron8 \
   --device=/dev/neuron9 \
   --device=/dev/neuron10 \
   --device=/dev/neuron11 \
   -v /home/ec2-user/model_hf/Llama-2-7b:/root/model_hf/Llama-2-7b \
   --cap-add SYS_ADMIN \
   --cap-add IPC_LOCK \
   --name pytorch-inference-demo \
   public.ecr.aws/neuron/pytorch-inference-neuronx:<image_tag> \
   bash

.. note::
   The trn2.48xlarge instance provides 12 Neuron devices. Adjust the number of Neuron devices (``--device=/dev/neuronX``) based on your instance type and requirements.

Step 4: Run the inference demo
-------------------------------

In this step, you will run the inference demo script that compiles the model, checks accuracy, and benchmarks performance.

Inside the container, run the inference demo:

.. code-block:: bash

   inference_demo \
   --model-type llama \
   --task-type causal-lm \
   run \
   --model-path /root/model_hf/Llama-2-7b/ \
   --compiled-model-path /root/traced_model/Llama-2-7b-demo/ \
   --torch-dtype bfloat16 \
   --tp-degree 96 \
   --batch-size 2 \
   --max-context-length 32 \
   --seq-len 64 \
   --on-device-sampling \
   --enable-bucketing \
   --top-k 1 \
   --do-sample \
   --pad-token-id 2 \
   --prompt 'I believe the meaning of life is' \
   --prompt 'The color of the sky is' \
   --check-accuracy-mode token-matching \
   --benchmark

.. important::
   * The inference demo takes approximately 20 minutes to complete on a trn2.48xlarge instance
   * The script will compile the model, validate accuracy, and run benchmarks
   * Set ``--tp-degree`` to match the number of NeuronCores you want to use (96 for trn2.48xlarge)

Step 5: Verify the results
---------------------------

In this step, you will confirm the inference demo completed successfully and review the benchmark results.

Wait for the demo to complete. You will see output showing benchmark results:

.. code-block:: text

   Benchmark completed and its result is as following
   {
     "e2e_model": {
       "latency_ms_p50": 8539.34,
       "latency_ms_p90": 8627.43,
       "latency_ms_p95": 8646.97,
       "latency_ms_p99": 8652.62,
       "latency_ms_p100": 8654.03,
       "latency_ms_avg": 8533.13,
       "throughput": 480.01
     },
     "context_encoding_model": {
       "latency_ms_p50": 132.42,
       "latency_ms_p90": 133.47,
       "latency_ms_p95": 133.59,
       "latency_ms_p99": 133.81,
       "latency_ms_p100": 133.86,
       "latency_ms_avg": 132.52,
       "throughput": 30908.75
     },
     "token_generation_model": {
       "latency_ms_p50": 7.84,
       "latency_ms_p90": 8.39,
       "latency_ms_p95": 8.47,
       "latency_ms_p99": 8.63,
       "latency_ms_p100": 28.96,
       "latency_ms_avg": 7.87,
       "throughput": 520434.73
     }
   }
   Completed saving result to benchmark_report.json

.. note::
   You may see several red ``ERROR NRT:nrt_tensor_free`` errors at the end of the script output. These can be safely ignored - the actual benchmark results appear above these error messages.

All complete! The benchmark results are saved to ``benchmark_report.json`` in the container.

Understanding the results
-------------------------

The benchmark output provides three key metrics:

* **e2e_model**: End-to-end model performance including context encoding and token generation
* **context_encoding_model**: Performance of processing the input prompt
* **token_generation_model**: Performance of generating output tokens

Each metric includes:

* Latency percentiles (p50, p90, p95, p99, p100) in milliseconds
* Average latency in milliseconds
* Throughput in tokens per second

Common issues
-------------

Did you encounter an error while working through this tutorial? Here are common issues and solutions:

- **Model download fails**: Verify you have accepted the Llama-2 license on HuggingFace and have valid AWS credentials
- **Container won't start**: Check that you have sufficient Neuron devices allocated
- **Compilation fails**: Ensure you have enough memory and the correct PyTorch version
- **Slow performance**: Verify your ``tp-degree`` matches your available Neuron devices
- **Memory issues**: Consider using a larger instance type or reducing batch size

For additional help, refer to the complete NeuronX Distributed Inference documentation.

Clean up
--------

To clean up resources after completing this tutorial:

1. Exit the container:

   .. code-block:: bash

      exit

2. Stop and remove the container:

   .. code-block:: bash

      docker stop pytorch-inference-demo
      docker rm pytorch-inference-demo

3. Remove the model files if no longer needed:

   .. code-block:: bash

      rm -rf /home/ec2-user/model_hf/Llama-2-7b

4. Terminate your EC2 instance if no longer needed.

Next steps
----------

Now that you've completed this tutorial, explore these related topics:

* Learn more about NeuronX Distributed Inference configuration options
* Explore different model architectures and optimization techniques
* Set up production deployment with monitoring and logging

Further reading
---------------

- `NeuronX Distributed Inference Documentation <#>`_ - Complete documentation for inference on Neuron
- `AWS Neuron SDK Documentation <https://awsdocs-neuron.readthedocs-hosted.com/>`_ - Full Neuron SDK reference
- `Llama-2 Model Card <https://huggingface.co/meta-llama/Llama-2-7b>`_ - Model details and license information


================================================
FILE: containers/getting-started.rst
================================================
.. _containers-getting-started:

Getting started with Neuron DLC using Docker
============================================

.. tab-set::

   .. tab-item:: Training


      .. dropdown::  Launch Trn1 Instance
            :class-title: sphinx-design-class-title-small
            :class-body: sphinx-design-class-body-small
            :animate: fade-in

            .. include:: /setup/install-templates/launch-instance.txt

      .. dropdown:: Install Drivers
            :class-title: sphinx-design-class-title-small
            :class-body: sphinx-design-class-body-small
            :animate: fade-in

            .. code:: bash


               # Configure Linux for Neuron repository updates

               sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
               [neuron]
               name=Neuron YUM Repository
               baseurl=https://yum.repos.neuron.amazonaws.com
               enabled=1
               metadata_expire=0
               EOF
               sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

               # Update OS packages
               sudo dnf update -y

               # Install OS headers
               sudo dnf install -y "kernel-devel-uname-r = $(uname -r)"

               # Remove preinstalled packages and Install Neuron Driver and Runtime
               sudo dnf remove aws-neuron-dkms -y
               sudo dnf remove aws-neuronx-dkms -y
               sudo dnf install aws-neuronx-dkms-2.*  -y

               # Install EFA Driver(only required for multi-instance training)
               curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
               wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.key
               cat aws-efa-installer.key | gpg --fingerprint
               wget https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz.sig && gpg --verify ./aws-efa-installer-latest.tar.gz.sig
               tar -xvf aws-efa-installer-latest.tar.gz
               cd aws-efa-installer && sudo bash efa_installer.sh --yes
               cd
               sudo rm -rf aws-efa-installer-latest.tar.gz aws-efa-installer

      .. dropdown:: Install Docker
            :class-title: sphinx-design-class-title-small
            :class-body: sphinx-design-class-body-small
            :animate: fade-in

            .. code:: bash

               sudo dnf install -y docker.io
               sudo usermod -aG docker $USER

            Logout and log back in to refresh membership.

      .. dropdown:: Verify Docker
            :class-title: sphinx-design-class-title-small
            :class-body: sphinx-design-class-body-small
            :animate: fade-in

            .. code:: bash

               docker run hello-world

            Expected result:

            ::

               Hello from Docker!
               This message shows that your installation appears to be working correctly.

               To generate this message, Docker took the following steps:
               1. The Docker client contacted the Docker daemon.
               2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
               (amd64)
               3. The Docker daemon created a new container from that image which runs the
               executable that produces the output you are currently reading.
               4. The Docker daemon streamed that output to the Docker client, which sent it
               to your terminal.

               To try something more ambitious, you can run an Ubuntu container with:
               $ docker run -it ubuntu bash

               Share images, automate workflows, and more with a free Docker ID:
               https://hub.docker.com/

               For more examples and ideas, visit:
               https://docs.docker.com/get-started/

      .. dropdown:: Verify Neuron Component
           :class-title: sphinx-design-class-title-small
           :class-body: sphinx-design-class-body-small
           :animate: fade-in

           Once the environment is setup, a container can be started with
           --device=/dev/neuron# to specify desired set of Inferentia/Trainium devices to be
           exposed to the container. To find out the available neuron devices on
           your instance, use the command ``ls /dev/neuron*``.

           When running neuron-ls inside a container, you will only see the set of
           exposed Trainiums. For example:

           .. code:: bash

             docker run --device=/dev/neuron0 neuron-test neuron-ls

           Would produce the following output in trn1.32xlarge:

           ::

             +--------+--------+--------+---------+
             | NEURON | NEURON | NEURON |   PCI   |
             | DEVICE | CORES  | MEMORY |   BDF   |
             +--------+--------+--------+---------+
             | 0      | 2      | 32 GB  | 10:1c.0 |
             +--------+--------+--------+---------+

      .. dropdown::  Run Tutorial
            :class-title: sphinx-design-class-title-small
            :class-body: sphinx-design-class-body-small
            :animate: fade-in

            :ref:`tutorial-training`


   .. tab-item:: Inference


      .. dropdown::  Launch Inf1 Instance
            :class-title: sphinx-design-class-title-small
            :class-body: sphinx-design-class-body-small
            :animate: fade-in

            .. include:: /setup/install-templates/launch-inf1.txt

      .. dropdown:: Install Drivers
            :class-title: sphinx-design-class-title-small
            :class-body: sphinx-design-class-body-small
            :animate: fade-in

            .. code:: bash

               # Configure Linux for Neuron repository updates
               sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
               [neuron]
               name=Neuron YUM Repository
               baseurl=https://yum.repos.neuron.amazonaws.com
               enabled=1
               metadata_expire=0
               EOF
               sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

               # Update OS packages
               sudo dnf update -y

               ################################################################################################################
               # To install or update to Neuron versions 1.19.1 and newer from previous releases:
               # - DO NOT skip 'aws-neuron-dkms' install or upgrade step, you MUST install or upgrade to latest Neuron driver
               ################################################################################################################

               # Install OS headers
               sudo dnf install -y "kernel-devel-uname-r = $(uname -r)"

               # Install Neuron Driver
               sudo dnf install aws-neuron-dkms -y

               ####################################################################################
               # Warning: If Linux kernel is updated as a result of OS package update
               #          Neuron driver (aws-neuron-dkms) should be re-installed after reboot
               ####################################################################################

      .. dropdown:: Install Docker
            :class-title: sphinx-design-class-title-small
            :class-body: sphinx-design-class-body-small
            :animate: fade-in

            .. code:: bash

               sudo dnf install -y docker.io
               sudo usermod -aG docker $USER

            Logout and log back in to refresh membership.

      .. dropdown:: Verify Docker
            :class-title: sphinx-design-class-title-small
            :class-body: sphinx-design-class-body-small
            :animate: fade-in

            .. code:: bash

               docker run hello-world

            Expected result:

            ::

               Hello from Docker!
               This message shows that your installation appears to be working correctly.

               To generate this message, Docker took the following steps:
               1. The Docker client contacted the Docker daemon.
               2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
               (amd64)
               3. The Docker daemon created a new container from that image which runs the
               executable that produces the output you are currently reading.
               4. The Docker daemon streamed that output to the Docker client, which sent it
               to your terminal.

               To try something more ambitious, you can run an Ubuntu container with:
               $ docker run -it ubuntu bash

               Share images, automate workflows, and more with a free Docker ID:
               https://hub.docker.com/

               For more examples and ideas, visit:
               https://docs.docker.com/get-started/


      .. dropdown:: Verify Neuron Component
           :class-title: sphinx-design-class-title-small
           :class-body: sphinx-design-class-body-small
           :animate: fade-in

           Once the environment is setup, a container can be started with
           --device=/dev/neuron# to specify desired set of Inferentia/Trainium devices to be
           exposed to the container. To find out the available neuron devices on
           your instance, use the command ``ls /dev/neuron*``.

           When running neuron-ls inside a container, you will only see the set of
           exposed Inferentias. For example:

           .. code:: bash

             docker run --device=/dev/neuron0 neuron-test neuron-ls


           Would produce the following output in inf1.xlarge:

           ::

               +--------------+---------+--------+-----------+-----------+------+------+
               |   PCI BDF    | LOGICAL | NEURON |  MEMORY   |  MEMORY   | EAST | WEST |
               |              |   ID    | CORES  | CHANNEL 0 | CHANNEL 1 |      |      |
               +--------------+---------+--------+-----------+-----------+------+------+
               | 0000:00:1f.0 |       0 |      4 | 4096 MB   | 4096 MB   |    0 |    0 |
               +--------------+---------+--------+-----------+-----------+------+------+

      .. dropdown::  Run Tutorial
            :class-title: sphinx-design-class-title-small
            :class-body: sphinx-design-class-body-small
            :animate: fade-in

            :ref:`tutorial-infer`
            :ref:`quickstart_vllm_dlc_deploy`


================================================
FILE: containers/how-to/how-to-ultraserver.rst
================================================
.. _containers-how-to-ultraserver:

How to schedule MPI jobs to run on Neuron UltraServer on EKS
============================================================

.. contents:: Table of Contents
   :local:
   :depth: 2

Overview
--------

Trn2 UltraServers represent a sophisticated computing infrastructure designed to connect multiple Trainium instances
through NeuronLinkV3 (Read more here: :ref:`aws-trn2-arch`). For many advanced and complex models, customers can use UltraServers to greatly reduce training
and inference times compared to previous distributed job setups.

This page explains the two setups needed to properly schedule and run MPI jobs on the Neuron UltraServer on EKS:

* UltraServer init script for the launcher pod
* Affinity configuration for the worker pods

How it works
~~~~~~~~~~~~

The UltraServer init script will:

* Validate the node config and deployment of the MPI job worker pods
* Write environment variables that are required for runtime to each MPI worker pod
* Write a new hostfile to ``/root/ultraserver_init/new_hostfile``

The validation process includes making sure the node config is a valid number (4, 2, or 1), and that the worker pods
are deployed correctly to UltraServer nodes. More about the how to set the node config can been found below.

The environment variables that are being written are:

* NEURON_GLOBAL_TOPOID: The topology ID of the worker pod
* NEURON_GLOBAL_TOPOID0_HOST: The FQDN of the worker pod that's the “leader” (topology ID of 0)
* NEURON_RT_ULTRASERVER_MODE: The mode of the UltraServer node that’s passed to the Neuron runtime
* NEURON_RT_ULTRASERVER_SERVER_ID: The server ID of the UltraServer node that’s passed to the Neuron runtime
* NEURON_RT_ULTRASERVER_NODE_ID: The node ID of the UltraServer node that’s passed to the Neuron runtime

The affinity performs two functions:

* Prevents worker pods from being scheduled together with worker pods from other jobs
* Requires/Encourages worker pods from the same job to be scheduled together

These configurations are needed in order to properly schedule your MPI job worker pods.

The pod anti-affinity prevents scheduling your workload onto UltraServer topologies where worker pods from other jobs
already exist. For example, if you have an UltraServer that already has a 2-node job running on it, the pod
anti-affinity will prevent scheduling a 4-node job on that UltraServer since 2 of the 4 nodes are already occupied.

The pod affinity will make sure that worker pods of the same job are scheduled together in the same UltraServer
topology. For example, if you have an 2 UltraServers with no jobs running on either of them, the pod affinity would
make sure that the worker pods of a 4-node job are all scheduled on the same UltraServer and not split between the two.

Prerequisites
-------------

* An EKS cluster with trn2 UltraServers (:ref:`kubernetes-getting-started`)
* Neuron Device Plugin installed on the cluster with version >= 2.26.26.0 (:ref:`tutorials/k8s-neuron-device-plugin`)
* MPI operator installed on the cluster
* An MPI job spec

Instructions
------------

UltraServer Init Script
~~~~~~~~~~~~~~~~~~~~~~~

Download the UltraServer init script :download:`k8s-ultraserver-init-script.sh </src/k8/k8s-ultraserver-init-script.sh>`

To use the script, either:
- add it to your MPI job Dockerfile and build the image OR
- create a new Dockerfile and build a new image from your MPI job image

Example:

.. code-block:: dockerfile

    FROM 123456789012.dkr.ecr.us-west-2.amazonaws.com/ultraserver:mpijob
    COPY ultraserver-init-script.sh /tmp/
    RUN chmod +x /tmp/ultraserver-init-script.sh
    ENTRYPOINT ["/tmp/ultraserver-init-script.sh"]

Then add the 2 required init containers to the launcher pod.

The first init container should utilize the /etc/mpi/discover_hosts.sh script to ensure that all worker pods are ready
before continuing on to the UltraServer init script.

The second init container should use the image containing ultraserver-init-script.sh. You can specify a value for
NEURON_ULTRASERVER_NODE_CONFIG, which determines what UltraServer node config your MPI job will use, i.e. how many
UltraServer nodes to use. Possible values are 4, 2, and 1, and the default value is 4.

Example:

.. code-block:: yaml

    apiVersion: kubeflow.org/v2beta1
    kind: MPIJob
    metadata:
      name: &job_name <MPI-JOB-NAME>
      namespace: default
    spec:
      mpiReplicaSpecs:
        Launcher:
          replicas: 1
          template:
            spec:
              containers:
              - name: mpitest
                image: 123456789012.dkr.ecr.us-west-2.amazonaws.com/ultraserver:mpijob
              ...
              initContainers:
              - name: wait-hostfilename
                image: 123456789012.dkr.ecr.us-west-2.amazonaws.com/ultraserver:mpijob
                command:
                - bash
                - -cx
                - |
                  if [[ $(cat /etc/mpi/discover_hosts.sh | wc -l) != 1 ]]; then
                    date
                    echo "Ready"
                    cat /etc/mpi/discover_hosts.sh
                  else
                    date
                    echo "not ready ..."
                    sleep 10
                    exit 1
                  fi
                  while read host; do
                    while ! ssh $host echo $host; do
                      date
                      echo "Pod $host is not up ..."
                      sleep 10
                    done
                    date
                    echo "Pod $host is ready"
                  done <<< "$(/etc/mpi/discover_hosts.sh)"
                resources: {}
                volumeMounts:
                - mountPath: /etc/mpi
                  name: mpi-job-config
                - mountPath: /root/.ssh
                  name: ssh-auth
              - name: ultraserver-init-container
                image: 123456789012.dkr.ecr.us-west-2.amazonaws.com/ultraserver:init-container
                env:
                - name: NEURON_ULTRASERVER_NODE_CONFIG
                  value: <"4", "2", OR "1">
                volumeMounts:
                - mountPath: /etc/mpi
                  name: mpi-job-config
                - mountPath: /root/.ssh
                  name: ssh-auth
                - mountPath: /root/ultraserver_init
                  name: ultraserver-init
              ...
              volumes:
              - name: ultraserver-init
                emptyDir: {}

MPI Worker Pod Affinity
~~~~~~~~~~~~~~~~~~~~~~~

Single-node Job
^^^^^^^^^^^^^^^

2-node job

.. code-block:: yaml

    apiVersion: kubeflow.org/v2beta1
    kind: MPIJob
    metadata:
      name: &job_name <MPI-JOB-NAME>
      namespace: default
      ...
    spec:
      mpiReplicaSpecs:
        Launcher:
          ...
        Worker:
          replicas: 2
          template:
            spec:
              nodeSelector:
                node.kubernetes.io/instance-type: trn2u.48xlarge
              affinity:
                podAntiAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchExpressions:
                      - key: training.kubeflow.org/job-name
                        operator: NotIn
                        values:
                        - *job_name
                      matchLabels:
                        training.kubeflow.org/job-role: worker
                    topologyKey: neuron.amazonaws.com/ultraserver-server-id-2
                podAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchLabels:
                        training.kubeflow.org/job-role: worker
                        training.kubeflow.org/job-name: *job_name
                    topologyKey: neuron.amazonaws.com/ultraserver-server-id-2
        ...

4-node job

.. code-block:: yaml

    apiVersion: kubeflow.org/v2beta1
    kind: MPIJob
    metadata:
      name: &job_name <MPI-JOB-NAME>
      namespace: default
      ...
    spec:
      mpiReplicaSpecs:
        Launcher:
          ...
        Worker:
          replicas: 4
          template:
            spec:
              nodeSelector:
                node.kubernetes.io/instance-type: trn2u.48xlarge
              affinity:
                podAntiAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchExpressions:
                      - key: training.kubeflow.org/job-name
                        operator: NotIn
                        values:
                        - *job_name
                      matchLabels:
                        training.kubeflow.org/job-role: worker
                    topologyKey: neuron.amazonaws.com/ultraserver-server-id-4
                podAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchLabels:
                        training.kubeflow.org/job-role: worker
                        training.kubeflow.org/job-name: *job_name
                    topologyKey: neuron.amazonaws.com/ultraserver-server-id-4
        ...

Multi-node job
^^^^^^^^^^^^^^

.. code-block:: yaml

    apiVersion: kubeflow.org/v2beta1
    kind: MPIJob
    metadata:
      name: &job_name <MPI-JOB-NAME>
      namespace: default
      ...
    spec:
      mpiReplicaSpecs:
        Launcher:
          ...
        Worker:
          replicas: 16
          template:
            spec:
              nodeSelector:
                node.kubernetes.io/instance-type: trn2u.48xlarge
              affinity:
                podAntiAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchExpressions:
                      - key: training.kubeflow.org/job-name
                        operator: NotIn
                        values:
                        - *job_name
                      matchLabels:
                        training.kubeflow.org/job-role: worker
                    topologyKey: neuron.amazonaws.com/ultraserver-server-id-4
                podAffinity:
                  preferredDuringSchedulingIgnoredDuringExecution:
                  - weight: 100
                    podAffinityTerm:
                      labelSelector:
                        matchLabels:
                          training.kubeflow.org/job-role: worker
                          training.kubeflow.org/job-name: *job_name
                      topologyKey: neuron.amazonaws.com/ultraserver-server-id-4
        ...

To use the affinity configuration, replace <MPI-JOB-NAME> with your MPI job name and add it to your workload yaml spec.

Confirm your work
-----------------

To validate that the init container is working:

.. code-block::

    # Find the worker pods associated with your MPI job
    kubectl get pods

    # Get the logs of the init container
    kubectl logs <LAUNCHER-POD-NAME> -c ultraserver-init-container

You should see logs under the init container.

Example:

.. code-block::

    $ kubectl get pods
    NAME                                       READY   STATUS     RESTARTS   AGE
    demo-launcher-42lh9                        0/1     Init:0/2   0          4s
    demo-worker-0                              1/1     Running    0          4s
    demo-worker-1                              1/1     Running    0          4s
    demo-worker-2                              1/1     Running    0          4s
    demo-worker-3                              1/1     Running    0          4s

    $ kubectl logs demo-launcher-42lh9 -c ultraserver-init-container
    Using 4-node config
    ...

To validate that the affinity configuration is working:

.. code-block::

    # Find the worker pods and the nodes they are scheduled to
    kubectl get pods -o=custom-columns='POD_NAME:metadata.name,NODE_NAME:spec.nodeName'

    # Compare the labels of the nodes to the
    kubectl get nodes \
        -l neuron.amazonaws.com/ultraserver-mode \
        -o=custom-columns='NAME:metadata.name,MODE:metadata.labels.neuron\.amazonaws\.com/ultraserver-mode,ULTRASERVER_SERVER_ID_2:metadata.labels.neuron\.amazonaws\.com/ultraserver-server-id-2,ULTRASERVER_NODE_ID_2:metadata.labels.neuron\.amazonaws\.com/ultraserver-node-id-2,ULTRASERVER_SERVER_ID_4:metadata.labels.neuron\.amazonaws\.com/ultraserver-server-id-4,ULTRASERVER_NODE_ID_4:metadata.labels.neuron\.amazonaws\.com/ultraserver-node-id-4' | awk 'NR==1{print;next}{print | "sort -k3,3 -k4,4"}'

When looking at the nodes used by the worker pods, they should share the same ULTRASERVER_SERVER_ID_2 or
ULTRASERVER_SERVER_ID_4 label based on which config you chose.

Example when choosing a 4-node config:

.. code-block::

    $ kubectl get pods -o=custom-columns='POD_NAME:metadata.name,NODE_NAME:spec.nodeName'
    POD_NAME                                   NODE_NAME
    demo-launcher-42lh9                        ip-172-32-5-227.ap-southeast-4.compute.internal
    demo-worker-0                              ip-172-32-5-227.ap-southeast-4.compute.internal
    demo-worker-1                              ip-172-32-11-17.ap-southeast-4.compute.internal
    demo-worker-2                              ip-172-32-13-57.ap-southeast-4.compute.internal
    demo-worker-3                              ip-172-32-9-4.ap-southeast-4.compute.internal

    $ kubectl get nodes \
        -l neuron.amazonaws.com/ultraserver-mode \
        -o=custom-columns='NAME:metadata.name,MODE:metadata.labels.neuron\.amazonaws\.com/ultraserver-mode,ULTRASERVER_SERVER_ID_2:metadata.labels.neuron\.amazonaws\.com/ultraserver-server-id-2,ULTRASERVER_NODE_ID_2:metadata.labels.neuron\.amazonaws\.com/ultraserver-node-id-2,ULTRASERVER_SERVER_ID_4:metadata.labels.neuron\.amazonaws\.com/ultraserver-server-id-4,ULTRASERVER_NODE_ID_4:metadata.labels.neuron\.amazonaws\.com/ultraserver-node-id-4' | awk 'NR==1{print;next}{print | "sort -k3,3 -k4,4"}'

    NAME                                              MODE    ULTRASERVER_SERVER_ID_2   ULTRASERVER_NODE_ID_2   ULTRASERVER_SERVER_ID_4   ULTRASERVER_NODE_ID_4
    ip-172-32-11-17.ap-southeast-4.compute.internal   1_2_4   u5wy80u0o2saugxy          0                       bog79p1y8tetj5uu          0
    ip-172-32-13-57.ap-southeast-4.compute.internal   1_2_4   u5wy80u0o2saugxy          1                       bog79p1y8tetj5uu          1
    ip-172-32-5-227.ap-southeast-4.compute.internal   1_2_4   ygml2651y0lwdd46          0                       bog79p1y8tetj5uu          2
    ip-172-32-9-4.ap-southeast-4.compute.internal     1_2_4   ygml2651y0lwdd46          1                       bog79p1y8tetj5uu          3

Common issues
-------------

Init script fails to start
~~~~~~~~~~~~~~~~~~~~~~~~~~

If at least one of the worker pods isn't scheduled to a node, the init script will fail to start.

Example:

.. code-block::

    $ kubectl get pods -o=custom-columns='POD_NAME:metadata.name,NODE_NAME:spec.nodeName'
    POD_NAME                                   NODE_NAME
    demo-launcher-96xsl                        ip-172-32-9-4.ap-southeast-4.compute.internal
    demo-worker-0                              <none>
    demo-worker-1                              <none>
    demo-worker-2                              <none>
    demo-worker-3                              <none>

    $ kubectl logs demo-launcher-96xsl -c ultraserver-init-container
    Error from server (BadRequest): container "ultraserver-init-container" in pod "demo-launcher-96xsl" is waiting to start: PodInitializing

Possible solution: Check your pods for affinity/scheduling issues.

.. code-block::

    $ kubectl describe pod demo-worker-0
    Events:
      Type     Reason            Age    From               Message
      ----     ------            ----   ----               -------
      Warning  FailedScheduling  3m13s  default-scheduler  0/4 nodes are available: 4 node(s) didn't match pod affinity rules. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.

Related Information
-------------------

- :ref:`kubernetes-getting-started` - Information about how to use Neuron on EKS
- :ref:`tutorials/k8s-neuron-device-plugin` - Information about Neuron Device Plugin
- :ref:`aws-trn2-arch` - Information about trn2 UltraServer architecture
- :ref:`general-troubleshooting` - Information about general troubleshooting for Neuron
- `MPI Operator <https://github.com/kubeflow/mpi-operator>`_ - Information about MPI Operator
- `MPI User Guide <https://www.kubeflow.org/docs/components/trainer/legacy-v1/user-guides/mpi/>`_ - Information about MPI jobs
- `Kubernetes Pod Affinity <https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity>`_ - Information about pod affinity rules
- `YAML anchors <https://support.atlassian.com/bitbucket-cloud/docs/yaml-anchors/>`_ - Information about YAML anchors


================================================
FILE: containers/index.rst
================================================
.. meta::
   :description: AWS Neuron Deep Learning Containers (DLCs) are pre-configured Docker images for training and serving models on AWS Trainium and Inferentia instances with the Neuron SDK.
   :keywords: Neuron Containers, Deep Learning Containers, DLC, Docker, Kubernetes, EKS, ECS, AWS Neuron, Trainium, Inferentia, vLLM, Container Deployment
   :date-modified: 01/22/2026

.. _neuron_containers:

Neuron Containers
=================

This section contains the technical documentation for using AWS Neuron Deep Learning Containers (DLCs) and containerized deployments on Inferentia and Trainium instances.

.. toctree::
    :maxdepth: 1
    :hidden:

    Getting Started </containers/getting-started>
    Locate Neuron DLC Images </containers/locate-neuron-dlc-image>
    Customize DLC </containers/dlc-then-customize-devflow>
    Neuron Plugins </containers/neuron-plugins>
    Tutorials </containers/tutorials>
    How-To Guides </containers/how-to/how-to-ultraserver>
    FAQ </containers/faq>
    DRA </containers/neuron-dra>
    Release Notes </release-notes/components/containers>

What are Neuron Deep Learning Containers?
------------------------------------------

AWS Neuron Deep Learning Containers (DLCs) are a set of pre-configured Docker images for training and serving models on AWS Trainium and Inferentia instances using the AWS Neuron SDK. Each DLC is optimized for specific ML frameworks and comes with all Neuron components pre-installed, enabling you to quickly deploy containerized workloads without manual setup.

With Neuron DLCs, developers can:

* Deploy production-ready containers with pre-installed Neuron SDK and ML frameworks
* Use containers across multiple deployment platforms including EC2, EKS, ECS, and SageMaker
* Customize DLCs to fit specific project requirements
* Leverage Neuron plugins for better observability and fault tolerance
* Run distributed training and inference workloads with vLLM integration
* Schedule MPI jobs on Trn2 UltraServers for improved performance

Neuron DLCs support popular ML frameworks including PyTorch, TensorFlow, and JAX, and are available for both training and inference workloads on Inf1, Inf2, Trn1, Trn1n, and Trn2 instances.

.. admonition:: Neuron DRA for Kubernetes

   Neuron has released support for Dynamic Resource Allocation (DRA) with Kubernetes. :doc:`Read more about it here </containers/neuron-dra>`.

Quickstarts
-----------

.. grid:: 1 1 2 2
    :gutter: 3
    
    .. grid-item-card:: Quickstart: Deploy a DLC with vLLM
        :link: quickstart_vllm_dlc_deploy
        :link-type: ref
        :class-card: sd-rounded-3
        
        Get started by configuring and deploying a Deep Learning Container with vLLM for inference. Time to complete: ~30 minutes.

    .. grid-item-card:: Quickstart: Build a Custom Neuron Container
        :link: containers-getting-started
        :link-type: ref
        :class-card: sd-rounded-3
        
        Learn how to build a custom Neuron container using Docker for training or inference workloads.

Neuron Containers Documentation
--------------------------------

.. grid:: 1 1 2 2
    :gutter: 3
    
    .. grid-item-card:: Getting Started
        :link: containers-getting-started
        :link-type: ref
        :class-card: sd-rounded-3
        
        Step-by-step guide for building Neuron containers using Docker, including driver installation and container setup.

    .. grid-item-card:: Locate Neuron DLC Images
        :link: locate-neuron-dlc-image
        :link-type: ref
        :class-card: sd-rounded-3
        
        Find the right pre-configured Deep Learning Container image for your ML framework and instance type.

    .. grid-item-card:: Customize Neuron DLC
        :link: containers-dlc-then-customize-devflow
        :link-type: ref
        :class-card: sd-rounded-3
        
        Learn how to customize Neuron Deep Learning Containers to fit your specific project requirements.

    .. grid-item-card:: Neuron Plugins
        :link: neuron-container-plugins
        :link-type: ref
        :class-card: sd-rounded-3
        
        Explore Neuron plugins for containerized environments, providing better observability and fault tolerance.

    .. grid-item-card:: Tutorials
        :link: /containers/tutorials
        :link-type: doc
        :class-card: sd-rounded-3
        
        Hands-on tutorials for deploying containers on EC2, EKS, ECS, and other platforms with various configurations.

    .. grid-item-card:: How-To: Schedule MPI Jobs on UltraServers
        :link: containers-how-to-ultraserver
        :link-type: ref
        :class-card: sd-rounded-3
        
        Learn how to schedule MPI jobs to run on Neuron UltraServers in EKS for improved performance.

    .. grid-item-card:: FAQ & Troubleshooting
        :link: container-faq
        :link-type: ref
        :class-card: sd-rounded-3
        
        Frequently asked questions and solutions for common issues with Neuron containers.

    .. grid-item-card:: Neuron Containers Release Notes
        :link: /release-notes/components/containers
        :link-type: doc
        :class-card: sd-rounded-3
        
        Review the latest updates, new DLC images, and improvements in Neuron container releases.


================================================
FILE: containers/k8.rst
================================================
.. _self-managed-kubernetes-service:

Self Managed Kubernetes Service
===============================
Introduction
------------
Use of Neuron in containers on a Kubernetes cluster can be simple to achieve by following :ref:`tutorial-k8s-env-setup-for-neuron`

Known Limitations
-----------------
Scheduling on k8s cluster requires contiguous neuron device-ids.  Neuron provides a scheduler extension to solve this problem for self-managed k8 clusters.  Read more about it here: :ref:`neuron-k8-scheduler-ext`.


================================================
FILE: containers/kubernetes-getting-started.rst
================================================
.. _kubernetes-getting-started:

Using Neuron with Amazon EKS
=============================

.. contents:: Table of Contents
   :local:
   :depth: 2

.. _tutorial-k8s-env-setup-for-neuron:

EKS Setup for Neuron
--------------------

Customers that use Kubernetes can conveniently integrate Inf/Trn instances into their workflows. This section provides step-by-step instructions for setting up an EKS cluster with Neuron support.

Prerequisites
~~~~~~~~~~~~~

.. include:: /containers/tutorials/k8s-prerequisite.rst

Neuron Helm Chart
~~~~~~~~~~~~~~~~~

.. include:: /containers/tutorials/k8s-neuron-helm-chart.rst

.. _k8s-neuron-device-plugin:

Neuron Device Plugin
~~~~~~~~~~~~~~~~~~~~

.. include:: /containers/tutorials/k8s-neuron-device-plugin.rst

.. _neuron_scheduler:

Neuron Scheduler Extension
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. include:: /containers/tutorials/k8s-neuron-scheduler.rst

Neuron Node Problem Detector and Recovery
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. include:: /containers/tutorials/k8s-neuron-problem-detector-and-recovery-irsa.rst

.. include:: /containers/tutorials/k8s-neuron-problem-detector-and-recovery.rst

Neuron Monitor Daemonset
~~~~~~~~~~~~~~~~~~~~~~~~~

.. include:: /containers/tutorials/k8s-neuron-monitor.rst


================================================
FILE: containers/locate-neuron-dlc-image.rst
================================================
.. _locate-neuron-dlc-image:

Neuron Deep Learning Containers
===============================

.. contents:: Table of Contents
   :local:
   :depth: 2


Overview
--------

AWS Deep Learning Containers (DLCs) provide a set of Docker images that are pre-installed with deep learning frameworks.
The containers are optimized for performance and available in Amazon Elastic Container Registry (Amazon ECR).
DLCs make it straightforward to deploy custom ML environments in a containerized manner,
while taking advantage of the portability and reproducibility benefits of containers.

AWS Neuron DLCs are a set of Docker images for training and serving models on AWS Trainium and Inferentia instances using AWS Neuron SDK.
The sections below list all of the AWS Neuron DLCs, as well as the AWS DLCs that come pre-installed with the Neuron SDK.


Inference Containers
--------------------

.. list-table::
    :widths: auto
    :header-rows: 1
    :align: left
    :class: table-smaller-font-size

    * - DLC Name
      - DLC Link(s)
      - Tutorial(s)

    * - Neuron Inference Containers
      - | `Neuron PyTorch Inference Containers <https://github.com/aws-neuron/deep-learning-containers#pytorch-inference-neuron>`_
        | `Neuronx PyTorch Inference Containers <https://github.com/aws-neuron/deep-learning-containers#pytorch-inference-neuronx>`_
        | `Neuronx PyTorch vLLM Inference Containers <https://github.com/aws-neuron/deep-learning-containers#vllm-inference-neuronx>`_
      - | :ref:`tutorial-infer`
        | :ref:`torchserve-neuron`
        | :ref:`quickstart_vllm_dlc_deploy`

    * - Large Model Inference (LMI)/Deep Java Library (DJL) Containers
      - `LMI Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers>`_
      -

    * - HuggingFace Inference Containers
      - | `HuggingFace Neuron Inference Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-neuron-inference-containers>`_
        | `HuggingFace Neuron vLLM Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-neuron-vllm-containers>`_
        | `HuggingFace Text Generation Inference (TGI) Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-neuron-text-generation-inference-tgi-containers>`_
      -

    * - Triton Inference Containers
      - `NVIDIA Triton Inference Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only>`_
      -


Training Containers
-------------------

.. list-table::
    :widths: auto
    :header-rows: 1
    :align: left
    :class: table-smaller-font-size

    * - DLC Name
      - DLC Link(s)
      - Tutorial(s)

    * - Neuron Training Containers
      - | `Neuronx PyTorch Training Containers <https://github.com/aws-neuron/deep-learning-containers#pytorch-training-neuronx>`_
        | `Neuronx Jax Training Containers <https://github.com/aws-neuron/deep-learning-containers#jax-training-neuronx>`_
      - :ref:`tutorial-training`

    * - HuggingFace Training Containers
      - `HuggingFace Neuron Training Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-neuron-training-containers>`_
      -

.. note::
   Latest HuggingFace Neuron containers are also available on the `HuggingFace Optimum website <https://huggingface.co/docs/optimum-neuron/en/containers#available-optimum-neuron-containers>`_.


Getting started with Neuron DLC using Docker
----------------------------------------------

:ref:`containers-getting-started`


Using containers on AWS services
----------------------------------

:ref:`Amazon EKS<eks_flow>`
^^^^^^^^^^^^^^^^^^^^^^^^^^^

:ref:`Amazon ECS<ecs_flow>`
^^^^^^^^^^^^^^^^^^^^^^^^^^^

:ref:`Amazon SageMaker<sagemaker_flow>`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:ref:`AWS Batch<aws_batch_flow>`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


Customizing Neuron Deep Learning Containers
-------------------------------------------
Deep Learning Containers can be customized to fit your specific project needs.
To read more, visit :ref:`containers-dlc-then-customize-devflow`.


================================================
FILE: containers/neo-then-hosting-devflow.rst
================================================
.. include:: /devflows/inference/neo-then-hosting-devflow.rst

================================================
FILE: containers/neuron-dra.rst
================================================
.. meta::
   :description: AWS Neuron Dynamic Resource Allocation (DRA) for Kubernetes
   :keywords: AWS, Neuron, DRA, Kubernetes, Dynamic Resource Allocation

.. _neuron-dra:

=================================================
AWS Neuron Dynamic Resource Allocation (DRA)
=================================================

What is DRA?
------------

Prior to Kubernetes 1.33, Kubernetes used device plugins for resource management. The Neuron device plugin implements the
device plugin interface to allow Kubernetes scheduler to manage Neuron resources. However, the device plugin framework
only tracks device count—the scheduler cannot see device attributes. Due to this limitation, the framework cannot natively
facilitate attribute-based filtering during device selection. For example, the default Kubernetes scheduler prior to DRA cannot
support allocation of connected devices without additional mechanisms such as a scheduler extension.

Dynamic Resource Allocation (DRA) is a new framework for advanced resource management that addresses this limitation. DRA
enables the scheduler to see the device attributes, allowing workloads to select devices based on specific attributes and
achieve topology aware allocation. Hardware vendors determine which attributes are published for their hardware. The AWS
Neuron DRA driver implements the kubelet plugin for DRA for AWS Trainium instances.

For more information on DRA, refer to `Kubernetes Dynamic Resource Allocation <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/>`_.

Where can I get the Neuron DRA driver and resource templates?
-------------------------------------------------------------------

To review and download the individual resource claim templates, visit this page: 

* :doc:`/containers/files/index-dra`.

What are the benefits of using DRA over device plugin?
-------------------------------------------------------

**Reduced developer complexity**

Device plugin-based workloads use node labels along with request and limits to allocate right resources. Example:

.. code-block:: yaml

   Worker:
     replicas: 4
     template:
       spec:
         containers:
         - image: <aws-account-id>.dkr.ecr.us-west-2.amazonaws.com/neuronx_nemo:latest
           name: mpitest
           imagePullPolicy: Always
           resources:
             limits:
               aws.amazon.com/neuron: "16"
               vpc.amazonaws.com/efa: "16"
             requests:
               aws.amazon.com/neuron: "16"
               vpc.amazonaws.com/efa: "16"
           volumeMounts:
           - name: dshm
             mountPath: /dev/shm
         volumes:
         - name: dshm
           emptyDir:
             medium: Memory

DRA introduces ``ResourceClaim`` and ``ResourceClaimTemplates`` which provide abstraction:

.. code-block:: yaml

   Worker:
     replicas: 4
     template:
       spec:
         containers:
         - image: <aws-account-id>.dkr.ecr.us-west-2.amazonaws.com/neuronx_nemo:latest
           name: mpitest
           imagePullPolicy: Always
           resources:
             claims:
             - name: neurons
           volumeMounts:
           - name: dshm
             mountPath: /dev/shm
         volumes:
         - name: dshm
           emptyDir:
             medium: Memory
         resourceClaims:
         - name: neurons
           resourceClaimTemplateName: efa-neurons-4-devices

The ``ResourceClaimTemplate`` name is a given name and can be defined by the ML infra operators to be friendly to their developers. The RCT
definition translates the name into the underlying allocation details - these are abstracted away from ML developers.

**Rich interface for resource requests**

With DRA, resource requests can specify attribute-based selection. For example, RCT can follow requests, which was not possible to
do with device plugins without additional node labeling and extensions. This interface allows us to facilitate topology-aware scheduling.

* Allocate connected neuron devices from trn2 instance type and the devices in the set need to be running specified Neuron driver version.
* Allocate a specific set of neuron devices for my pod - I want the pod to use devices in row 1 of the topology.

**Dynamic configuration**

DRA allows end users to specify additional configuration for the device via RCT. The Neuron DRA driver leverages this capability to
allow ResourceClaimTemplates to specify LNC size to be used for the allocation. An example is shown below. The end user need
not configure LNC via launch template while using Neuron devices with Neuron DRA driver.

.. code-block:: yaml

   #Template will be vended by Neuron via documentation/code repo
   apiVersion: resource.k8s.io/v1
   kind: ResourceClaimTemplate
   metadata:
     namespace: neuron-test7
     name: lnc-neurons
   spec:
     spec:
       devices:
         requests:
         - name: neurons
           exactly:
             deviceClassName: neuron.aws.com
             selectors:
             - cel:
                 expression: device.attributes['neuron.aws.com'].instanceType == "trn2.48xlarge"
             allocationMode: all
         config:
         - opaque:
             driver: neuron.aws.com
             parameters:
               apiVersion: neuron.aws.com/v1
               kind: NeuronConfig
               logicalNeuronCore: 1
           requests: ["neurons"]

Prerequisites
-----------------------------

* **Kubernetes version** - Please use K8s control plane 1.34+
* **Instance type** - Trn2.48xlarge launched with K8s version 1.34.2+

For instructions on how to setup an EKS cluster, please refer to :ref:`prerequisites<k8s-prerequisite>`.

Installation via Helm
---------------------

Connect to your cluster from local box. The cluster should have at least one trn2.48xlarge node. 
Do not install the Neuron device plugin on the cluster! 

Please confirm the cluster being used via:

.. code-block:: bash

   kubectl config current-context

Then install the DRA driver:

.. code-block:: bash

   helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
     --set "devicePlugin.enabled=false" --set "npd.enabled=false" --set "draDriver.enabled=true"

Example 1 – Connected Neuron Devices
--------------------------------------

This section will demonstrate how to run a workload that needs to request a subset of connected Neuron Devices from a trn2.48xlarge instance.
Before DRA, this use case required using Neuron Scheduler Extension. With DRA, this allocation is enabled natively.

* [:download:`Download example YAML file </containers/files/specs/1x4-connected-devices.yaml>`]

The supported subsets include set of 1, 4, 8 or 16. Specifically, these are ``resource.aws.com/devicegroup1_id``, ``resource.aws.com/devicegroup4_id``,
``resource.aws.com/devicegroup8_id``, ``resource.aws.com/devicegroup16_id`` respectively.

The sets of 4 and 8 are selected as shown in diagram below:

.. image:: /containers/images/neuron-dra-connected-devices.jpeg
   :alt: Connected Neuron Devices
   :width: 600px

To enable a workload to consume a connected subset of Neuron Devices, first create a ``ResourceClaimTemplate`` that requests a connected set of
Neuron devices. From the package run:

.. code-block:: bash

   kubectl apply -f specs/1x4-connected-devices.yaml

This workload definition (which includes the ``ResourceClaimTemplate``) is shown below for quick reference:

.. code-block:: yaml

   apiVersion: resource.k8s.io/v1
   kind: ResourceClaimTemplate
   metadata:
     name: 1x4-connected-neurons
   spec:
     spec:
       devices:
         requests:
         - name: neurons
           exactly:
             deviceClassName: neuron.aws.com
             allocationMode: ExactCount
             count: 4
             selectors:
             - cel:
                 expression: "device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'"
         constraints:
         - requests: ["neurons"]
           matchAttribute: "resource.aws.com/devicegroup4_id"

Next step is to reference the ``ResourceClaimTemplate`` in a pod definition as shown below:

.. code-block:: yaml

   ---
   apiVersion: v1
   kind: Pod
   metadata:
     name: pod0
     labels:
       app: pod
   spec:
     containers:
     - name: ctr0
       image: public.ecr.aws/ubuntu/ubuntu:22.04
       command: ["bash", "-c"]
       args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"]
       resources:
         claims:
         - name: neurons
     resourceClaims:
     - name: neurons
       resourceClaimTemplateName: 1x4-connected-neurons


Deploy the above workload using ``kubectl apply``. When the pod is running, examine the related ``ResourceClaim`` using:

.. code-block:: bash

   kubectl get resourceclaim -o yaml

The ``resourceclaim`` output will show the 4 Neuron Devices that were allocated to the pod. An example is shown below. These will be connected Neuron
Devices.

.. code-block:: bash

   [devbox]$ kubectl get pod
   
   NAME   READY   STATUS    RESTARTS   AGE
   ---------------------------------------
   pod0   1/1     Running   0          3s
   
   [devbox]$ kubectl get resourceclaim
   
   NAME                 STATE                AGE
   ---------------------------------------------
   pod0-neurons-zdk76   allocated,reserved   9s
   
   [devbox]$ kubectl get resourceclaim pod0-neurons-zdk76 -o yaml

Status shown below:

.. code-block:: yaml

   status:
     allocation:
       devices:
         results:
         - adminAccess: null
           device: neurondevice2
           driver: neuron.aws.com
           pool: ip-1-1-1-1.region.compute.internal
           request: neurons
         - adminAccess: null
           device: neurondevice3
           driver: neuron.aws.com
           pool: ip-1-1-1-1.region.compute.internal
           request: neurons
         - adminAccess: null
           device: neurondevice1
           driver: neuron.aws.com
           pool: ip-1-1-1-1.region.compute.internal
           request: neurons
         - adminAccess: null
           device: neurondevice0
           driver: neuron.aws.com
           pool: ip-1-1-1-1.region.compute.internal
           request: neurons

.. note::
   The RCT name can be simplified to communicate the intent of the allocation and abstract the allocation details away from ML developers.

**Example RCT1 - "xl" - Allocate All 16 devices**

.. code-block:: yaml

   apiVersion: resource.k8s.io/v1
   kind: ResourceClaimTemplate
   metadata:
     name: xl-trn2
   spec:
     spec:
       devices:
         requests:
         - name: neurons
           exactly: 
             allocationMode: ExactCount
             count: 16
             deviceClassName: neuron.aws.com
             selectors:
             - cel:
                 expression: device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'

**Example RCT2 - large - Allocate 8 devices**

.. code-block:: yaml

   apiVersion: resource.k8s.io/v1
   kind: ResourceClaimTemplate
   metadata:
     name: l-trn2
   spec:
     spec:
       devices:
         constraints:
         - matchAttribute: resource.aws.com/devicegroup8_id
           requests:
           - neurons
         requests:
         - name: neurons
           exactly:
             allocationMode: ExactCount
             count: 8
             deviceClassName: neuron.aws.com
             selectors:
             - cel:
                 expression: device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'

**Example RCT2 - 2.27-driver – Allocate 8 devices with driver version at the driver published by Neuron SDK 2.27**

`Neuron 2.27.0 Runtime <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/2.27.0/runtime.html#neuron-2-27-0-runtime>`_

.. code-block:: yaml

   apiVersion: resource.k8s.io/v1
   kind: ResourceClaimTemplate
   metadata:
     name: 2.27-driver-trn2
   spec:
     spec:
       devices:
         constraints:
         - matchAttribute: resource.aws.com/devicegroup8_id
           requests:
           - neurons
         requests:
         - name: neurons
           exactly:
             allocationMode: ExactCount
             count: 8
             deviceClassName: neuron.aws.com
             selectors:
             - cel:
                 expression: device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge' &&
                            device.attributes['neuron.aws.com'].neuronDriverVersion == '2.25.4.0'

Example 2 - Dynamic LNC config
------------------------------

This example shows how to set LNC per workload. Earlier, overriding LNC on a Node required a node template. With DRA, workloads can
override default LNC via ``ResourceClaim.``

* [:download:`Download example YAML file </containers/files/specs/lnc-setting-trn2.yaml>`]


Apply the following workload definition:

.. code-block:: bash

   kubectl apply -f specs/lnc-setting-trn2.yaml

This workload definition (which includes the ``ResourceClaimTemplate``) is shown below for quick reference:

.. code-block:: yaml

   apiVersion: resource.k8s.io/v1
   kind: ResourceClaimTemplate
   metadata:
     name: all-neurons-lnc-1
   spec:
     spec:
       devices:
         requests:
         - name: neurons
           exactly:
             deviceClassName: neuron.aws.com
             selectors:
             - cel:
                 expression: "device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'"
             allocationMode: All
         config:
         - requests: ["neurons"]
           opaque:
             driver: neuron.aws.com
             parameters:
               apiVersion: neuron.aws.com/v1
               kind: NeuronConfig
               logicalNeuronCore: 1

Then deploy a pod that references the above ``ResourceClaimTemplate`` as shown below:

.. code-block:: yaml

   apiVersion: v1
   kind: Pod
   metadata:
     name: pod0
     labels:
       app: pod
   spec:
     containers:
     - name: ctr0
       image: public.ecr.aws/ubuntu/ubuntu:22.04
       command: ["bash", "-c"]
       args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"]
       resources:
         claims:
         - name: neurons
     resourceClaims:
     - name: neurons
       resourceClaimTemplateName: all-neurons-lnc-1

Example 3 – Four Node Inference on trn2u.48xlarge
--------------------------------------------------

A trn2u.48xlarge Trn2 UltraServer has 4 Trn2 nodes interconnected by Neuron Links.

trn2u.48xlarge instances can be allocated in set of 1, 2, or 4. The Neuron DRA driver can utilize 1 or more ``ResourceClaimTemplate`` definitions to convey the
desired size of the set. The ``ResourceClaimTemplate`` allows end users to specify "UltraServerConfig" to declare their intent to use all 4 nodes of
the UltraServer. This configuration value is passed by the Neuron DRA driver to the Neuron runtime and collectives inside the container.

* [:download:`Download example YAML file </containers/files/specs/4-node-inference-us.yaml>`]

Example yaml for 4-node inference on trn2u.48xlarge:

.. code-block:: yaml

   apiVersion: resource.k8s.io/v1
   kind: ResourceClaimTemplate
   metadata:
     name: us-4-node-config
   spec:
     spec:
       devices:
         requests:
         - name: neurons
           exactly: 
             deviceClassName: neuron.aws.com
             selectors:
             - cel:
                 expression: "device.attributes['neuron.aws.com'].resourceType == 'neuron_node'"
             allocationMode: ExactCount
             count: 1
         config:
         - requests: ["neurons"]
           opaque:
             driver: neuron.aws.com
             parameters:
               apiVersion: neuron.aws.com/v1
               kind: UltraServerConfig
               ultraserverMode: 4
   ---
   apiVersion: leaderworkerset.x-k8s.io/v1
   kind: LeaderWorkerSet
   metadata:
     name: vllm
     annotations:
       leaderworkerset.sigs.k8s.io/exclusive-topology: neuron.amazonaws.com/ultraserver-server-id-4
   spec:
     rolloutStrategy:
       type: RollingUpdate
       rollingUpdateConfiguration:
         maxUnavailable: 1
         maxSurge: 1
     # Two replica groups of 4 nodes each, i.e. two ultraservers
     replicas: 2
     leaderWorkerTemplate:
       size: 4
       restartPolicy: RecreateGroupOnPodRestart
       leaderTemplate:
         metadata:
           labels:
             role: leader
         spec:
           containers:
           - name: vllm-leader
             image: public.ecr.aws/ubuntu/ubuntu:22.04
             command:
             - sh
             - -c
             - "sleep infinity"
             resources:
               claims:
               - name: one-node-from-ultraserver
           resourceClaims:
           - name: one-node-from-ultraserver
             resourceClaimTemplateName: us-4-node-config
       workerTemplate:
         metadata:
           labels:
             role: worker
         spec:
           containers:
           - name: vllm-worker
             image: public.ecr.aws/ubuntu/ubuntu:22.04
             command:
             - sh
             - -c
             - "sleep infinity"
             resources:
               claims:
               - name: one-node-from-ultraserver
           resourceClaims:
           - name: one-node-from-ultraserver
             resourceClaimTemplateName: us-4-node-config


Neuron DRA Driver Attributes Reference
---------------------------------------

The Neuron DRA driver publishes the following attributes in resource slices. These attributes can be used in ``ResourceClaimTemplate`` CEL expressions
to filter and select specific devices for allocation.

Common Attributes
^^^^^^^^^^^^^^^^^

These attributes are common to all Neuron instances and their devices:

* ``deviceId`` - An integer value representing the ID of the Neuron device. Used to identify which device is chosen from allocation.
* ``instanceType`` - A string value representing the EC2 instance type of the Neuron device. Used to specify devices of which instance(s) to choose for allocation.
* ``neuronDriverVersion`` - A string value representing the Neuron driver version running on the instance. Used to claim instances with the same driver version for allocation.
* ``draDriverVersion`` - A version value of the Neuron DRA driver version. Provides visibility on which Neuron DRA driver version published the resource slice.
* ``resourceType`` - A string value to distinguish between devices and UltraServer nodes. For devices, this value is ``neuron_device``. For UltraServers, this value is ``neuron_node``.
* ``networkNodeLayer1`` - A string value representing network node layer 1. Can be used during topology-aware scheduling to minimize network latency and optimize instance placement. See `EC2 Instance Topology <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/how-ec2-instance-topology-works.html>`_.
* ``networkNodeLayer2`` - A string value representing network node layer 2. Can be used to allocate workloads to nodes on the same spine. See `EC2 Instance Topology <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/how-ec2-instance-topology-works.html>`_.
* ``networkNodeLayer3`` - A string value representing network node layer 3. Can be used during topology-aware scheduling to minimize network latency and optimize instance placement. See `EC2 Instance Topology <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/how-ec2-instance-topology-works.html>`_.

Trn Non-UltraServer Attributes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

These attributes are only populated for Neuron instances that have grid topology (trn) and are not UltraServers:

* ``topology_x`` - An integer value representing the row of the device in a grid topology. Only populated when the number of devices in the instance is greater than 1. Can be used to select a specific device or devices that belong to the same row.
* ``topology_y`` - An integer value representing the column of the device in a grid topology. Only populated when the number of devices in the instance is greater than 1. Can be used to select a specific device or devices that belong to the same column.
* ``topology4_id`` - An integer value representing the row of the device in a grid topology. Only populated when the number of devices in the instance is greater than 1. Can be used to select devices that belong to the same row.
* ``topology8_id`` - An integer value representing the row of the device in a grid topology. Only populated when the number of devices in the instance is greater than or equal to 8. Can be used to select devices that belong to the same two rows.

Trn UltraServer Attributes
^^^^^^^^^^^^^^^^^^^^^^^^^^^

These attributes are only populated for Neuron instances that have grid topology (trn) and are UltraServers:

* ``capacityBlockId`` - A string value representing the ID of the capacity block that the UltraServer instance is in. See `Instance Topology API <https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_InstanceTopology.html>`_.

EFA-Enabled Instance Attributes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

These attributes are only populated for Neuron instances that are EFA-enabled:

* ``resource.aws.com/devicegroup1_id`` - A string value representing the EFA Bus:Device:Function (BDF) corresponding to that device.
* ``resource.aws.com/devicegroup4_id`` - A string value representing a hash, ensuring Neuron devices in the same topology group of 4 get the same group ID.
* ``resource.aws.com/devicegroup8_id`` - A string value representing a hash, ensuring Neuron devices in the same topology group of 8 get the same group ID.
* ``resource.aws.com/devicegroup16_id`` - A string value representing a hash, ensuring Neuron devices in the same topology group of 16 get the same group ID.

FAQs
----

Can DRA plugin co-exist with other device plugins?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Device plugins and the DRA plugin can coexist in the same cluster, but **not** for the same node. As of now, the two mechanisms act independently. Neuron is preparing
an upcoming feature that will allow device plugin based allocations to work with DRA, but the feature is still in alpha and not enabled on EKS.
Ref: `Extended Resource <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#extended-resource>`_.

Is DRA replacing Neuron Device Plugin and Scheduler Extension?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We will continue to support the Neuron Device Plugin and Scheduler Extension as long as:

1. Upstream Kubernetes continues to support device plugins.
2. EKS continues to support Kubernetes versions below 1.34 (which do not support DRA).

What Kubernetes versions are supported?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Kubernetes control plane must be on 1.34. For Node AMI, we support 1.34.2+. We do not support Node AMI for 1.34.0 or 1.34.1
since it had a regression in DRA. Upstream issue: `Kubernetes Issue #133920 <https://github.com/kubernetes/kubernetes/issues/133920>`_

Where can I learn more about how to put together RCT using CEL expressions?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To learn more about RCTs, please visit `Kubernetes Dynamic Resource Allocation <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/>`_. To learn more
about CEL expressions, please visit `CEL Language <https://cel.dev/>`_. Send us feedback and let us know which additional RCT examples you would like
us to provide in the source code.

.. toctree::
   :maxdepth: 1
   :hidden:

   Support Files </containers/files/index-dra>


================================================
FILE: containers/neuron-plugins.rst
================================================
.. _neuron-container-plugins:

Neuron Plugins for Containerized Environments
=============================================

This section provides an overview of the Neuron infrastructure components for containerized environments. For detailed setup instructions, see :ref:`tutorial-k8s-env-setup-for-neuron`.

Neuron Device Plugin
--------------------

Exposes Neuron hardware resources to Kubernetes as schedulable resources (``aws.amazon.com/neuron`` and ``aws.amazon.com/neuroncore``). The device plugin discovers Neuron devices on each node, advertises them to the scheduler, and manages allocation to Pods with exclusive access.

Neuron Scheduler Extension
---------------------------

Provides topology-aware scheduling for optimal Neuron device allocation. It considers device connectivity and placement to ensure efficient utilization. This component is optional and most beneficial for workloads requesting specific subsets of Neuron devices or cores.

Neuron Node Problem Detector and Recovery
------------------------------------------

Monitors Neuron device health and detects hardware and software errors. When unrecoverable issues occur, it can mark nodes as unhealthy and trigger node replacement. It also publishes CloudWatch metrics under the ``NeuronHealthCheck`` namespace for monitoring.

For ECS environments, see :ref:`ecs-neuron-problem-detector-and-recovery`.

Neuron Monitor
--------------

Collects and exposes metrics from Neuron devices including hardware utilization, performance counters, memory usage, and device health. Supports integration with observability platforms like Prometheus for monitoring and alerting.

Neuron Dynamic Resource Allocation (DRA) Driver
-----------------------------------------------

Manages Neuron hardware resources in a Kubernetes environment. It integration with Kubernetes Dynamic Resource Allocation (DRA) framework to advertise Neuron devices and their attributes. This feature cannot be used alongside Neuron device plugin for nodes of the same cluster. For more information on Neuron DRA driver, please refer to :ref:`neuron-dra`


================================================
FILE: containers/neuron_dlc_images.csv
================================================
Framework,Neuron Package,Job Type,Supported EC2 Instance Types,Python Version Options,ECR Public Repo URL,Image Details,Other Packages
PyTorch 2.1.2,"aws-neuronx-tools, neuronx_distributed, torch-neuronx, transformers-neuronx",inference,trn1 and inf2,3.10 (py310),https://gallery.ecr.aws/neuron/pytorch-inference-neuronx,https://github.com/aws-neuron/deep-learning-containers#pytorch-inference-neuronx,torchserve
PyTorch 2.1.2,"aws-neuronx-tools, neuronx_distributed, torch-neuronx",training,trn1 and inf2,3.10 (py310),https://gallery.ecr.aws/neuron/pytorch-training-neuronx,https://github.com/aws-neuron/deep-learning-containers#pytorch-training-neuronx,
PyTorch 1.13.1,"aws-neuronx-tools, torch-neuron",inference,inf1,3.10 (py310),https://gallery.ecr.aws/neuron/pytorch-inference-neuron,https://github.com/aws-neuron/deep-learning-containers#pytorch-inference-neuron,torchserve
PyTorch 1.13.1,"aws-neuronx-tools, neuronx_distributed, torch-neuronx, transformers-neuronx",inference,trn1 and inf2,3.10 (py310),https://gallery.ecr.aws/neuron/pytorch-inference-neuronx,https://github.com/aws-neuron/deep-learning-containers#pytorch-inference-neuronx,torchserve
PyTorch 1.13.1,"aws-neuronx-tools, neuronx_distributed, torch-neuronx",training,trn1 and inf2,3.10 (py310),https://gallery.ecr.aws/neuron/pytorch-training-neuronx,https://github.com/aws-neuron/deep-learning-containers#pytorch-training-neuronx,


================================================
FILE: containers/troubleshooting.rst
================================================
.. _container-troubleshooting:

Troubleshooting Neuron Containers
=================================

This document aims to provide more information on how to fix issues you
might encounter while using the Neuron Containers. For each
issue we will provide an explanation of what happened and what can
potentially correct the issue.


If your issue is not listed below or you have a more nuanced problem, contact
us via `issues <https://github.com/aws/aws-neuron-sdk/issues>`__ posted
to this repo, the `AWS Neuron developer
forum <https://forums.aws.amazon.com/forum.jspa?forumID=355>`__, or
through AWS support.

Neuron Container includes the following Neuron Components. For issues relating to 
these components inside the container refer the individual component troubleshooting
guides :ref:`general-troubleshooting`

* Neuron Runtime/Driver
* Pytorch/Tenosrflow/MXNet frameworks
* Libfabric/EFA 

The following are container specific issues

Neuron Device Not found
-----------------------

The neuron container expects the neuron devices to be exposed to the container as
referenced in :ref:`container-devices`. 

Please look at the container logs to see messages like below

::

   2022-Sep-08 17:55:23.0768    19:19    ERROR  TDRV:tdrv_get_dev_info                       No neuron device available


If the above message is seen then devices are not exposed to container

Solution
''''''''

* Refer :ref:`container-devices` and make sure the devices are exposed to container
* If specific cores are being used refer :ref:`container-cores` and make sure the cores are exposed to container
* In kubernetes environment refer :ref:`k8s-specify-devices` or :ref:`k8s-specify-cores` to make sure neuron devices/cores are there in pods container spec


Contiguous Device ID's
-----------------------

Neuron runtime expects the inferentia/trainium device id's to be contigious. If the device id's
are not contiguous you might see error messages like below


::

   2022-Sep-08 21:52:11.0307     7:7     ERROR  TDRV:tdrv_init_mla_phase1                    Could not open the nd1

::

   2022-Sep-08 23:00:05.0667     8:8     ERROR   NRT:nrt_allocate_neuron_cores               Neuron cores are not contiguous


Solution
''''''''

* In the docker run command make sure the devices specified using --device are all contiguous
* If oci neuron hook is used and the env variable AWS_NEURON_VISIBLE_DEVICES is used then make sure the
devices specified are all contiguous
* In kubernetes environment with just the neuron device plugin running there is no guarantee that
the devices allocated will be contiguous. Make sure to run the neuron scheduler extension as specified in :ref:`neuron-k8-scheduler-ext`

================================================
FILE: containers/tutorial-docker-runtime1.0.rst
================================================
.. _tutorial-docker-environment-setup-for-neuron-runtime-10:

Tutorial: Docker environment setup for Neuron Runtime 1.x
=========================================================

Introduction
------------

A Neuron application can be deployed using docker containers. This
tutorial describes how to configure docker to expose Inferentia devices
to containers.

Once the environment is setup, a container can be started with
*AWS_NEURON_VISIBLE_DEVICES* environment variable to specify desired set
of Inferentia devices to be exposed to the container.
AWS_NEURON_VISIBLE_DEVICES is a set of contiguous comma-seperated
inferentia logical ids. To find out the available logical ids on your
instance, run the neuron-ls tool. For example, on inf1.6xlarge instance
with 4 inferentia devices, you may set AWS_NEURON_VISIBLE_DEVICES="2,3"
to expose the last two devices to a container. When running neuron-ls
inside a container, you will only see the set of exposed Inferentias.
For example:

.. code:: bash

   docker run --env AWS_NEURON_VISIBLE_DEVICES="0" neuron-test neuron-ls

Would produce the following output:

::

   +--------------+---------+--------+-----------+-----------+------+------+
   |   PCI BDF    | LOGICAL | NEURON |  MEMORY   |  MEMORY   | EAST | WEST |
   |              |   ID    | CORES  | CHANNEL 0 | CHANNEL 1 |      |      |
   +--------------+---------+--------+-----------+-----------+------+------+
   | 0000:00:1f.0 |       0 |      4 | 4096 MB   | 4096 MB   |    0 |    0 |
   +--------------+---------+--------+-----------+-----------+------+------+

Steps:
------

This tutorial starts from a fresh Ubuntu Server 16.04 LTS AMI
"ami-08bc77a2c7eb2b1da".

Step 1: install aws-neuron-runtime-base package
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Follow the :ref:`install-guide-index` to
setup access to Neuron repos. Then, install the aws-neuron-runtime-base
package.

.. code:: bash

   sudo apt-get install aws-neuron-runtime-base

Step 2: Make sure that the neuron-rtd service is not running
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If neuron-rtd is running on the host, stop the neuron-rtd service before
starting the containerized neuron-rtd. This is needed to allow
assignment of devices to containers:

.. code:: bash

   sudo service neuron-rtd stop

Step 3: install oci-add-hooks dependency
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

`oci-add-hooks <https://github.com/awslabs/oci-add-hooks>`__ is an OCI
runtime with the sole purpose of injecting OCI prestart, poststart, and
poststop hooks into a container config.json before passing along to an
OCI compatable runtime. oci-add-hooks is used to inject a hook that
exposes Inferentia devices to the container.

.. code:: bash

   sudo apt install -y golang && \
       export GOPATH=$HOME/go && \
       go get github.com/joeshaw/json-lossless && \
       cd /tmp/ && \
       git clone https://github.com/awslabs/oci-add-hooks && \
       cd /tmp/oci-add-hooks && \
       make build && \
       sudo cp /tmp/oci-add-hooks/oci-add-hooks /usr/local/bin/

.. _step-4-setup-docker-to-use-oci-neuron-oci-runtime:

Step 4: setup Docker to use oci-neuron OCI runtime.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

oci-neuron is a script representing OCI compatible runtime. It wraps
oci-add-hooks, which wraps runc. In this step, we configure docker to
point at oci-neuron OCI runtime. Install dockerIO:

.. code:: bash

   sudo apt install -y docker.io
   sudo usermod -aG docker $USER

Logout and log back in to refresh membership. Place daemon.json Docker
configuration file supplied by Neuron SDK in default location. This file
specifies oci-neuron as default docker runtime:

.. code:: bash

   sudo cp /opt/aws/neuron/share/docker-daemon.json /etc/docker/daemon.json
   sudo service docker restart

If the docker restart command fails, make sure to check if the docker
systemd service is not masked. More information on this can be found
here: https://stackoverflow.com/a/37640824

Verify docker:

.. code:: bash

   docker run hello-world

Expected result:

::

   Hello from Docker!
   This message shows that your installation appears to be working correctly.

   To generate this message, Docker took the following steps:
   1. The Docker client contacted the Docker daemon.
   2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
   (amd64)
   3. The Docker daemon created a new container from that image which runs the
   executable that produces the output you are currently reading.
   4. The Docker daemon streamed that output to the Docker client, which sent it
   to your terminal.

   To try something more ambitious, you can run an Ubuntu container with:
   $ docker run -it ubuntu bash

   Share images, automate workflows, and more with a free Docker ID:
   https://hub.docker.com/

   For more examples and ideas, visit:
   https://docs.docker.com/get-started/

Build a docker image using provided dockerfile :ref:`neuron-runtime-dockerfile`, and use to
verify whitelisting:

.. code:: bash

   docker build . -f Dockerfile.neuron-rtd -t neuron-test

Then run:

.. code:: bash

   docker run --env AWS_NEURON_VISIBLE_DEVICES="0"  neuron-test neuron-ls

Expected result:

::

   +--------------+---------+--------+-----------+-----------+------+------+
   |   PCI BDF    | LOGICAL | NEURON |  MEMORY   |  MEMORY   | EAST | WEST |
   |              |   ID    | CORES  | CHANNEL 0 | CHANNEL 1 |      |      |
   +--------------+---------+--------+-----------+-----------+------+------+
   | 0000:00:1f.0 |       0 |      4 | 4096 MB   | 4096 MB   |    0 |    0 |
   +--------------+---------+--------+-----------+-----------+------+------+


================================================
FILE: containers/tutorials/build-run-neuron-container.rst
================================================
.. _how-to-build-neuron-container:

Tutorial How to Build and Run a Neuron Container
================================================

Introduction
------------

This document explains how to build a Neuron Container using an existing Dockerfile.

Pre-requisites
--------------
#. Docker version 18 or newer is configured according to :ref:`tutorial-docker-env-setup`
#. Inf1/Trn1 instance with available :ref:`Neuron Devices<container-devices>`
#. If running a serving application such as tensorflow-model-server, torchserve or multi-model-server, make sure the appropriate ports that the server listens to are exposed using EXPOSE in the Dockerfile or the arguments ``-p 80:8080`` on the ``docker run`` command.

.. _running-application-container:

Build and Run the Application Container
---------------------------------------
Follow the steps below for creating neuron application containers.

- Build a docker image using provided dockerfile :ref:`libmode-dockerfile` for Inf1 and :ref:`trainium-dlc-dockerfile` for Trn1 (also for Trn1 the dockerfile needs mlp train script found here at :ref:`mlp-train`

.. code:: bash

   docker build . -f Dockerfile.pt -t neuron-container:pytorch

- Run the container locally:

.. code:: bash

   docker run -it --name pt17 --device=/dev/neuron0 neuron-container:pytorch neuron-ls

Expected result for Inf1:

::

   +--------------+---------+--------+-----------+-----------+------+------+
   |   PCI BDF    | LOGICAL | NEURON |  MEMORY   |  MEMORY   | EAST | WEST |
   |              |   ID    | CORES  | CHANNEL 0 | CHANNEL 1 |      |      |
   +--------------+---------+--------+-----------+-----------+------+------+
   | 0000:00:1f.0 |       0 |      4 | 4096 MB   | 4096 MB   |    0 |    0 |
   +--------------+---------+--------+-----------+-----------+------+------+

Expected result for Trn1:

::

   +--------+--------+--------+-----------+---------+
   | NEURON | NEURON | NEURON | CONNECTED |   PCI   |
   | DEVICE | CORES  | MEMORY |  DEVICES  |   BDF   |
   +--------+--------+--------+-----------+---------+
   | 0      | 4      | 8 GB   | 1         | 00:1f.0 |
   +--------+--------+--------+-----------+---------+


.. note::

   If instead of the --device option above if the env variable AWS_NEURON_VISIBLE_DEVICES
   is to be used then the oci hook needs to installed by following instructions in :ref:`tutorial-oci-hook`


Important to know
-----------------

.. _container-devices:

Devices
^^^^^^^

- The docker native way is to use --device /dev/neuron# for each of the Neuron Devices intended to be passed. When using --device option ALL/all is not supported.

    .. code:: bash

        docker run --device=/dev/neuron0 --device=/dev/neuron1

- If you install the aws-neuronx-oci-hook package, you will have an OCI hook that also supports use of a container environment variable AWS_NEURON_VISIBLE_DEVICES=<ALL | csv of devices>, which intends to make things easier for multi device scenarios. Following are some examples. For setting up oci hook please refer :ref:`oci neuron hook <tutorial-oci-hook>`

    .. code:: bash

        docker run -e “AWS_NEURON_VISIBLE_DEVICES=0,1”
        docker run -e “AWS_NEURON_VISIBLE_DEVICES=ALL”

- In kubernetes environment, the neuron device plugin is used for exposing the neuron device to the containers in the pod. The number of devices can be adjusted using the *aws.amazon.com/neuron* resource in the pod specification. Refer :ref:`K8s setup <tutorial-k8s-env-setup-for-neuron>` for more details

    .. code:: bash

         resources:
            limits:
            aws.amazon.com/neuron: 1

   .. note::

      Only the number of devices can be specfied.
      When only the neuron device plugin is running that does not guaratee the devices to be
      contiguous. Make sure to run the neuron scheduler extension :ref:`neuron-k8-scheduler-ext`
      so that it makes sure that contigiuous devices are allocated to the containers


- Multiple container applications running in the same host can share the devices but the cores cannot be shared. This is similar to running multiple applications in the host. 
- In the kubernetes environment the devices cannot be shared by multiple containers in the pod

.. _container-cores:

Cores
^^^^^
Each neuron device has multiple cores. The cores allocated to process/container can be controlled by
the environment variable NEURON_RT_VISIBLE_CORES and NEURON_RT_NUM_CORES. Please refer :ref:`nrt-configuration` for more details.

- The docker native way is to use --device /dev/neuron# for each of the Neuron Devices intended to be passed. Add --env NEURON_RT_VISIBLE_CORES-1,2 to use cores 1 and 2 to this container. For example in inf1.24xlarge with 64 cores, if we want to use cores 51 & 52, the appropriate device and NEURON_RT_VISIBLE_CORES needs to be used. With 4 cores in each device, core 51 is in device 12 and 52 is in device 13

    .. code:: bash

        docker run --device=/dev/neuron12 --device=/dev/neuron13 --env NEURON_RT_VISIBLE_CORES=51,52

- In kubernetes environment, the neuron device plugin is used for exposing the neuron cores to the containers in the pod. The number of cores can be adjusted using the *aws.amazon.com/neuroncore* resource in the pod specification. Refer :ref:`K8s setup <tutorial-k8s-env-setup-for-neuron>` for more details.

    .. code:: bash

         resources:
            limits:
            aws.amazon.com/neuroncore: 1

   .. note::

      Only the number of cores can be specfied.
      When only the neuron device plugin is running that does not guaratee the cores to be
      contiguous. Make sure to run the neuron scheduler extension :ref:`neuron-k8-scheduler-ext`
      so that it makes sure that contigiuous cores are allocated to the containers

- Multiple container applications running in the same host cannot share the cores. This is similar to running multiple applications in the host.
- In the kubernetes environment the cores cannot be shared by multiple containers in the pod


================================================
FILE: containers/tutorials/inference/index.rst
================================================
Containers -- Inference Tutorials
=================================

.. toctree::
    :maxdepth: 1
    :hidden:

    /containers/tutorials/inference/tutorial-infer
    /containers/tutorials/inference/k8s_rn50_demo


.. include:: /containers/tutorials/inference/index.txt

================================================
FILE: containers/tutorials/inference/index.txt
================================================
* :ref:`tutorial-infer`
* :ref:`example-deploy-rn50-as-k8s-service`


================================================
FILE: containers/tutorials/inference/k8s_rn50_demo.rst
================================================
.. _example-deploy-rn50-as-k8s-service:

Deploy a TensorFlow Resnet50 model as a Kubernetes service
----------------------------------------------------------

This tutorial uses Resnet50 model as a teaching example on how to deploy an
inference application using Kubernetes on the Inf1 instances.

Prerequisite:
^^^^^^^^^^^^^

-  Please follow instructions at :ref:`tutorial-k8s-env-setup-for-neuron` to setup k8s support on your cluster.
-  Inf1 instances as worker nodes with attached roles allowing:

   -  ECR read access policy to retrieve container images from ECR:
      **arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly**
   -  S3 access to retrieve saved_model from within tensorflow serving
      container.

Deploy a TensorFlow Serving application image
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A trained model must be compiled to an Inferentia target before it can be deployed on Inferentia instances\.
To continue, you will need a Neuron-optimized TensorFlow model saved in Amazon S3\.
If you don’t already have a SavedModel, please follow the tutorial for `creating a Neuron compatible ResNet50 model <https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia-tf-neuron.html>`_
and upload the resulting SavedModel to S3\.

ResNet-50 is a popular machine learning model used for image
classification tasks\. For more information about compiling Neuron models, see
`The AWS Inferentia Chip With DLAMI <https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia.html>`_
in the AWS Deep Learning AMI Developer Guide\.

The sample deployment manifest manages a pre-built inference serving container for TensorFlow provided by
AWS Deep Learning Containers. Inside the container is the AWS Neuron Runtime and the TensorFlow Serving application.
A complete list of pre-built Deep Learning Containers optimized for Neuron is maintained on GitHub under
`Available Images <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#user-content-neuron-containers>`_.
At start\-up, the DLC will fetch your model from Amazon S3, launch Neuron TensorFlow Serving with the saved model,
and wait for prediction requests\.

The number of Neuron devices allocated to your serving application can be adjusted by changing the
`aws.amazon.com/neuron` resource in the deployment yaml\. Please note that communication between TensorFlow Serving
and the Neuron runtime happens over GRPC, which requires passing the `IPC_LOCK` capability to the container.

1. Create a file named `rn50_deployment.yaml` with the contents below\. Update the region\-code and model path to match your desired settings. The model name is for identification purposes when a client makes a request to the TensorFlow server\. This example uses a model name to match a sample ResNet50 client script that will be used in a later step for sending prediction requests\.

.. note::
   1. Replace the s3 bucket name in model_base_path arg in the file with the location of the where the saved model was stored in s3.
   2. In the image:  add the appropriate location of the DLC tensorflow image


::

   kind: Deployment
   apiVersion: apps/v1
   metadata:
     name: k8s-neuron-test
     labels:
       app: k8s-neuron-test
       role: master
   spec:
     replicas: 2
     selector:
       matchLabels:
         app: k8s-neuron-test
         role: master
     template:
       metadata:
         labels:
           app: k8s-neuron-test
           role: master
       spec:
         containers:
           - name: k8s-neuron-test
             image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference-neuron:1.15.4-neuron-py37-ubuntu18.04
             command:
               - /usr/local/bin/entrypoint.sh
             args:
               - --port=8500
               - --rest_api_port=9000
               - --model_name=resnet50_neuron
               - --model_base_path=s3://${your-bucket-of-models}/resnet50_neuron/
             ports:
               - containerPort: 8500
               - containerPort: 9000
             imagePullPolicy: IfNotPresent
             env:
               - name: AWS_REGION
                 value: "us-east-1"
               - name: S3_USE_HTTPS
                 value: "1"
               - name: S3_VERIFY_SSL
                 value: "0"
               - name: S3_ENDPOINT
                 value: s3.us-east-1.amazonaws.com
               - name: AWS_LOG_LEVEL
                 value: "3"
             resources:
               limits:
                 cpu: 4
                 memory: 4Gi
                 aws.amazon.com/neuron: 1
               requests:
                 cpu: "1"
                 memory: 1Gi
             securityContext:
               capabilities:
                 add:
                   - IPC_LOCK

2. Deploy the model\.

::

   kubectl apply -f rn50_deployment.yaml

3. Create a file named `rn50_service.yaml` with the following contents\. The HTTP and gRPC ports are opened for accepting prediction requests\.

::

   kind: Service
   apiVersion: v1
   metadata:
     name: k8s-neuron-test
     labels:
       app: k8s-neuron-test
   spec:
     type: ClusterIP
     ports:
       - name: http-tf-serving
         port: 8500
         targetPort: 8500
       - name: grpc-tf-serving
         port: 9000
         targetPort: 9000
     selector:
       app: k8s-neuron-test
       role: master


4. Create a Kubernetes service for your TensorFlow model Serving application\.

::

   kubectl apply -f rn50_service.yaml

Make predictions against your TensorFlow Serving service
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. To test locally, forward the gRPC port to the `k8s-neuron-test` service\.

::

   kubectl port-forward service/k8s-neuron-test 8500:8500 &

2. Create a Python script called `tensorflow-model-server-infer.py` with the following content. This script runs inference via gRPC, which is service framework.

::

   import numpy as np
   import grpc
   import tensorflow as tf
   from tensorflow.keras.preprocessing import image
   from tensorflow.keras.applications.resnet50 import preprocess_input
   from tensorflow_serving.apis import predict_pb2
   from tensorflow_serving.apis import prediction_service_pb2_grpc
   from tensorflow.keras.applications.resnet50 import decode_predictions

   if __name__ == '__main__':
       channel = grpc.insecure_channel('localhost:8500')
       stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
       img_file = tf.keras.utils.get_file(
           "./kitten_small.jpg",
           "https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg")
       img = image.load_img(img_file, target_size=(224, 224))
       img_array = preprocess_input(image.img_to_array(img)[None, ...])
       request = predict_pb2.PredictRequest()
       request.model_spec.name = 'resnet50_inf1'
       request.inputs['input'].CopyFrom(
           tf.make_tensor_proto(img_array, shape=img_array.shape))
       result = stub.Predict(request)
       prediction = tf.make_ndarray(result.outputs['output'])
       print(decode_predictions(prediction))

3. Run the script to submit predictions to your service\.
::

   python3 tensorflow-model-server-infer.py

   Your output should look like the following:

::

   [[(u'n02123045', u'tabby', 0.68817204), (u'n02127052', u'lynx', 0.12701613), (u'n02123159', u'tiger_cat', 0.08736559), (u'n02124075', u'Egyptian_cat', 0.063844085), (u'n02128757', u'snow_leopard', 0.009240591)]]


================================================
FILE: containers/tutorials/inference/tutorial-infer.rst
================================================
.. _tutorial-infer:

Run Inference in PyTorch Neuron Container
==========================================

.. contents:: Table of Contents
   :local:
   :depth: 2


Overview
--------

This tutorial demonstrates how to run a pytorch DLC on an inferentia instance.

By the end of this tutorial you will be able to run the inference using the container

You will use an inf1.2xlarge to test your Docker configuration for Inferentia.

To find out the available neuron devices on your instance, use the command ``ls /dev/neuron*``.

Setup Environment
-----------------

1. Launch an Inf1 Instance

2. Set up docker environment according to :ref:`tutorial-docker-env-setup`

3. Clone the `aws-neuron/deep-learning-containers <https://github.com/aws-neuron/deep-learning-containers>`_ GitHub repository and use one of the PyTorch inference Dockerfiles found in the folders of the repo:

.. code:: bash

   git clone https://github.com/aws-neuron/deep-learning-containers.git
   cd deep-learning-containers/docker/pytorch/inference/2.9.0

For additional prerequisites and setup requirements, see the `docker build prerequisites <https://github.com/aws-neuron/deep-learning-containers/blob/main/README.md#prerequisites>`_.

This tutorial requires the `torchserve entrypoint <https://github.com/aws-neuron/deep-learning-containers/blob/main/docker/common/torchserve-neuron.sh>`_ and `torchserve config.properties <https://github.com/aws-neuron/deep-learning-containers/blob/main/docker/common/config.properties>`_ which are copied over to the same parent folder as part of prerequisites.

With the files in a local directory, build the image with the following command:

.. code:: bash

   docker build . -f Dockerfile.neuronx -t neuron-container:pytorch

Run the following command to start the container

.. code:: bash

   docker run -itd --name pt-cont -p 80:8080 -p 8081:8081 --device=/dev/neuron0 neuron-container:pytorch /usr/local/bin/entrypoint.sh -m 'pytorch-resnet-neuron=https://aws-dlc-sample-models.s3.amazonaws.com/pytorch/Resnet50-neuron.mar' -t /home/model-server/config.properties

================================================
FILE: containers/tutorials/k8s-default-scheduler.rst
================================================
This approach integrates the Neuron Scheduler Extension directly with the Kubernetes default scheduler. This method requires access to modify the default scheduler configuration.

**Prerequisites**

Ensure that the Neuron Device Plugin is running.

**Step 1: Configure kube-scheduler**

Enable the kube-scheduler to use a ConfigMap for scheduler policy. In your ``cluster.yml``, update the spec section with the following:

.. code:: yaml

    spec:
      kubeScheduler:
        usePolicyConfigMap: true

**Step 2: Launch the Cluster**

Create and launch the cluster:

.. code:: bash

    kops create -f cluster.yml
    kops create secret --name neuron-test-1.k8s.local sshpublickey admin -i ~/.ssh/id_rsa.pub
    kops update cluster --name neuron-test-1.k8s.local --yes

**Step 3: Install Neuron Scheduler Extension**

Install the Neuron Scheduler Extension and register it with kube-scheduler:

.. code:: bash

    helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
        --set "scheduler.enabled=true" \
        --set "scheduler.customScheduler.enabled=false" \
        --set "scheduler.defaultScheduler.enabled=true" \
        --set "npd.enabled=false"


================================================
FILE: containers/tutorials/k8s-multiple-scheduler.rst
================================================
This approach deploys a separate scheduler alongside the default Kubernetes scheduler. This is useful in environments where you don't have access to modify the default scheduler configuration, such as Amazon EKS.

In this setup, a new scheduler (``my-scheduler``) is deployed with the Neuron Scheduler Extension integrated. Pods that need to run Neuron workloads specify this custom scheduler in their configuration.

.. note::

    Amazon EKS does not natively support modifying the default scheduler, so this multiple scheduler approach is required for EKS environments.

**Prerequisites**

Ensure that the Neuron Device Plugin is running.

**Step 1: Install Neuron Scheduler Extension**

Install the Neuron Scheduler Extension as a custom scheduler:

.. code:: bash

    helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
        --set "scheduler.enabled=true" \
        --set "npd.enabled=false"

**Step 2: Verify Installation**

Check that there are no errors in the ``my-scheduler`` pod logs and that the ``k8s-neuron-scheduler`` pod is bound to a node:

.. code:: bash

    kubectl logs -n kube-system my-scheduler-79bd4cb788-hq2sq

**Expected output:**

.. code:: bash

    I1012 15:30:21.629611       1 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/k8s-neuron-scheduler-5d9d9d7988-xcpqm" node="ip-192-168-2-25.ec2.internal" evaluatedNodes=1 feasibleNodes=1

**Step 3: Configure Pods to Use Custom Scheduler**

When creating Pods that need to use the Neuron Scheduler Extension, specify ``my-scheduler`` as the scheduler name. Here's a sample Pod specification:

.. code:: yaml

    apiVersion: v1
    kind: Pod
    metadata:
      name: <POD_NAME>
    spec:
      restartPolicy: Never
      schedulerName: my-scheduler
      containers:
        - name: <POD_NAME>
          command: ["<COMMAND>"]
          image: <IMAGE_NAME>
          resources:
            limits:
              cpu: "4"
              memory: 4Gi
              aws.amazon.com/neuroncore: 9
            requests:
              cpu: "1"
              memory: 1Gi

**Step 4: Verify Scheduling**

After running a Neuron workload Pod, verify that the Neuron Scheduler successfully processed the filter and bind requests:

.. code:: bash

    kubectl logs -n kube-system k8s-neuron-scheduler-5d9d9d7988-xcpqm

**Expected output for filter request:**

.. code:: bash

    2022/10/12 15:41:16 POD nrt-test-5038 fits in Node:ip-192-168-2-25.ec2.internal
    2022/10/12 15:41:16 Filtered nodes: [ip-192-168-2-25.ec2.internal]
    2022/10/12 15:41:16 Failed nodes: map[]
    2022/10/12 15:41:16 Finished Processing Filter Request...

**Expected output for bind request:**

.. code:: bash

    2022/10/12 15:41:16 Executing Bind Request!
    2022/10/12 15:41:16 Determine if the pod %v is NeuronDevice podnrt-test-5038
    2022/10/12 15:41:16 Updating POD Annotation with alloc devices!
    2022/10/12 15:41:16 Return aws.amazon.com/neuroncore
    2022/10/12 15:41:16 neuronDevUsageMap for resource:aws.amazon.com/neuroncore in node: ip-192-168-2-25.ec2.internal is [false false false false false false false false false false false false false false false false]
    2022/10/12 15:41:16 Allocated ids for POD nrt-test-5038 are: 0,1,2,3,4,5,6,7,8
    2022/10/12 15:41:16 Try to bind pod nrt-test-5038 in default namespace to node ip-192-168-2-25.ec2.internal with &Binding{ObjectMeta:{nrt-test-5038    8da590b1-30bc-4335-b7e7-fe574f4f5538  0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},Target:ObjectReference{Kind:Node,Namespace:,Name:ip-192-168-2-25.ec2.internal,UID:,APIVersion:,ResourceVersion:,FieldPath:,},}
    2022/10/12 15:41:16 Updating the DevUsageMap since the bind is successful!
    2022/10/12 15:41:16 Return aws.amazon.com/neuroncore
    2022/10/12 15:41:16 neuronDevUsageMap for resource:aws.amazon.com/neuroncore in node: ip-192-168-2-25.ec2.internal is [false false false false false false false false false false false false false false false false]
    2022/10/12 15:41:16 neuronDevUsageMap for resource:aws.amazon.com/neurondevice in node: ip-192-168-2-25.ec2.internal is [false false false false]
    2022/10/12 15:41:16 Allocated devices list 0,1,2,3,4,5,6,7,8 for resource aws.amazon.com/neuroncore
    2022/10/12 15:41:16 Allocated devices list [0] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [0] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [0] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [0] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [1] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [1] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [1] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [1] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [2] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Return aws.amazon.com/neuroncore
    2022/10/12 15:41:16 Succesfully updated the DevUsageMap [true true true true true true true true true false false false false false false false]  and otherDevUsageMap [true true true false] after alloc for node ip-192-168-2-25.ec2.internal
    2022/10/12 15:41:16 Finished executing Bind Request...


================================================
FILE: containers/tutorials/k8s-neuron-device-plugin.rst
================================================
The Neuron Device Plugin is a Kubernetes device plugin that exposes Neuron hardware resources to the cluster's scheduler. It discovers available Neuron devices on each node, advertises them as allocatable resources, and manages their lifecycle. When Pods request Neuron resources, the device plugin handles the allocation and ensures exclusive access to the assigned devices. This integration enables Kubernetes to treat Neuron accelerators as first-class schedulable resources, similar to GPUs or other specialized hardware.

The device plugin registers two resource types with Kubernetes:

* ``aws.amazon.com/neuroncore`` - Used for allocating individual Neuron cores to containers
* ``aws.amazon.com/neuron`` - Used for allocating entire Neuron devices to containers (all cores belonging to the device)

**Deploy Neuron Device Plugin**

**Prerequisites**

Ensure that all :ref:`prerequisites<k8s-prerequisite>` are satisfied before proceeding.

**Installation**

Apply the Neuron Device Plugin as a DaemonSet on the cluster:

.. code:: bash

    helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
        --set "npd.enabled=false"

**Verify Installation**

Verify that the Neuron Device Plugin is running:

.. code:: bash

    kubectl get ds neuron-device-plugin -n kube-system

Expected output (example with 2 nodes in cluster):

.. code:: bash

    NAME                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
    neuron-device-plugin   2         2         2       2            2           <none>          18h

**Verify Allocatable Resources**

Verify that nodes have allocatable Neuron cores:

.. code:: bash

    kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronCore:.status.allocatable.aws\.amazon\.com/neuroncore"

Expected output:

.. code:: bash

    NAME                                          NeuronCore
    ip-192-168-65-41.us-west-2.compute.internal   32
    ip-192-168-87-81.us-west-2.compute.internal   32

Verify that nodes have allocatable Neuron devices:

.. code:: bash

    kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronDevice:.status.allocatable.aws\.amazon\.com/neuron"

Expected output:

.. code:: bash

    NAME                                          NeuronDevice
    ip-192-168-65-41.us-west-2.compute.internal   16
    ip-192-168-87-81.us-west-2.compute.internal   16


================================================
FILE: containers/tutorials/k8s-neuron-helm-chart.rst
================================================
.. _k8s-neuron-helm-chart:

The Neuron Helm Chart simplifies the deployment and management of Neuron infrastructure components on Kubernetes clusters. It provides a unified installation method for all essential Neuron components, streamlining the setup process and ensuring consistent configuration across your cluster.

Components Included
^^^^^^^^^^^^^^^^^^^

The Neuron Helm Chart includes the following components:

* Neuron Device Plugin
* Neuron Scheduler Extension
* :ref:`Neuron Node Problem Detector and Recovery <k8s-neuron-problem-detector-and-recovery>`
* Neuron DRA (Dynamic Resource Allocation) Driver. Refer to :ref:`neuron-dra`.

Installation
^^^^^^^^^^^^

To install the Neuron Helm Chart:

.. code:: bash

    helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart

For detailed information on configuration options, advanced deployment scenarios, and troubleshooting, please refer to the official Neuron Helm Charts repository: https://github.com/aws-neuron/neuron-helm-charts/


================================================
FILE: containers/tutorials/k8s-neuron-monitor.rst
================================================
.. _k8s-neuron-monitor:

Neuron Monitor is a monitoring solution that collects and exposes metrics from Neuron devices and the Neuron runtime. It provides visibility into hardware utilization, performance counters, memory usage, and device health status. The monitor can export metrics in formats compatible with popular observability platforms like Prometheus, enabling integration with existing monitoring and alerting infrastructure. This allows operators to track Neuron device performance, identify bottlenecks, and troubleshoot issues in production environments.

For detailed information about Neuron Monitor, see the `Neuron Monitor User Guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_.

.. note::

    Neuron Monitor does not currently support environments using the Neuron DRA (Dynamic Resource Allocation) Driver.

Deploy Neuron Monitor DaemonSet
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Step 1: Download the Configuration**

Download the Neuron Monitor YAML file: :download:`k8s-neuron-monitor-daemonset.yml </src/k8/k8s-neuron-monitor-daemonset.yml>`

**Step 2: Apply the Configuration**

Apply the Neuron Monitor YAML to create a DaemonSet on the cluster:

.. code:: bash

    kubectl apply -f k8s-neuron-monitor-daemonset.yml

**Step 3: Verify Installation**

Verify that the Neuron Monitor DaemonSet is running:

.. code:: bash

    kubectl get ds neuron-monitor --namespace neuron-monitor

Expected output (example with 2 nodes in cluster):

.. code:: bash

    NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
    neuron-monitor   2         2         2       2            2           <none>          27h

**Step 4: Get Pod Names**

Retrieve the Neuron Monitor pod names:

.. code:: bash

    kubectl get pods --namespace neuron-monitor

Expected output:

.. code:: bash

    NAME                   READY   STATUS    RESTARTS   AGE
    neuron-monitor-slsxf   1/1     Running   0          17m
    neuron-monitor-wc4f5   1/1     Running   0          17m

**Step 5: Verify Prometheus Endpoint**

Verify that the Prometheus metrics endpoint is available:

.. code:: bash

    kubectl exec neuron-monitor-wc4f5 --namespace neuron-monitor -- wget -q --output-document - http://127.0.0.1:8000

Expected output (sample metrics):

.. code:: bash

    # HELP python_gc_objects_collected_total Objects collected during gc
    # TYPE python_gc_objects_collected_total counter
    python_gc_objects_collected_total{generation="0"} 362.0
    python_gc_objects_collected_total{generation="1"} 0.0
    python_gc_objects_collected_total{generation="2"} 0.0
    # HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
    # TYPE python_gc_objects_uncollectable_total counter


================================================
FILE: containers/tutorials/k8s-neuron-problem-detector-and-recovery-irsa.rst
================================================
.. _k8s-neuron-problem-detector-and-recovery-irsa:

Permissions for Neuron Node Problem Detector and Recovery
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The Neuron Node Problem Detector and Recovery requires IAM roles for service accounts (IRSA) for authorization. For more information, see `IAM roles for service accounts <https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html>`__ in the Amazon EKS User Guide.

This section shows how to configure an IAM role for service accounts using the ``eksctl`` command-line tool.

**Step 1: Install eksctl**

Install the ``eksctl`` CLI using the instructions at https://eksctl.io/installation/.

**Step 2: Create IAM Policy**

Create an IAM policy that grants the necessary permissions for the Neuron Node Problem Detector.

.. code:: json

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "autoscaling:SetInstanceHealth",
                    "autoscaling:DescribeAutoScalingInstances"
                ],
                "Effect": "Allow",
                "Resource": "<arn of the Auto Scaling group corresponding to the Neuron nodes for the cluster>"
            },
            {
                "Action": [
                    "ec2:DescribeInstances"
                ],
                "Effect": "Allow",
                "Resource": "*",
                "Condition": {
                    "ForAllValues:StringEquals": {
                        "ec2:ResourceTag/aws:autoscaling:groupName": "<name of the Auto Scaling group corresponding to the Neuron nodes for the cluster>"
                    }
                }
            },
            {
                "Action": [
                    "cloudwatch:PutMetricData"
                ],
                "Effect": "Allow",
                "Resource": "*",
                "Condition": {
                    "StringEquals": {
                        "cloudwatch:Namespace": "NeuronHealthCheck"
                    }
                }
            }
        ]
    }

Save the policy template above to a file named ``npd-policy.json`` (replacing the placeholder values), then run:

.. code:: bash

    aws iam create-policy \
        --policy-name NeuronProblemDetectorPolicy \
        --policy-document file://npd-policy.json

**Step 3: Create Namespace and Service Account**

Create a dedicated namespace for the Neuron Node Problem Detector:

.. code:: bash

    kubectl create ns neuron-healthcheck-system

**Step 4: Associate IAM Role with Service Account**

Use the following script to create the service account and associate it with the IAM role:

.. code:: bash

    #!/bin/bash
    CLUSTER_NAME=<eks cluster name>
    REGION_CODE=$(aws configure get region)
    POLICY_ARN=<policy arn for NeuronProblemDetectorPolicy>

    eksctl create iamserviceaccount \
        --name node-problem-detector \
        --namespace neuron-healthcheck-system \
        --cluster $CLUSTER_NAME \
        --attach-policy-arn $POLICY_ARN \
        --approve \
        --role-name neuron-problem-detector-role-$CLUSTER_NAME \
        --region $REGION_CODE \
        --override-existing-serviceaccounts

**Step 5: Verify Service Account Configuration**

Verify that the service account is annotated correctly with the IAM role:

.. code:: bash

    kubectl describe sa node-problem-detector -n neuron-healthcheck-system

Expected output:

.. code:: bash

    Name:                node-problem-detector
    Namespace:           neuron-healthcheck-system
    Labels:              app.kubernetes.io/managed-by=eksctl
    Annotations:         eks.amazonaws.com/role-arn: arn:aws:iam::111111111111:role/neuron-problem-detector-role-cluster1
    Image pull secrets:  <none>
    Mountable secrets:   <none>
    Tokens:              <none>
    Events:              <none>

**Cleanup**

To remove the service account and associated IAM role, use the following command:

.. code:: bash

    #!/bin/bash
    CLUSTER_NAME=<eks cluster name>
    REGION_CODE=$(aws configure get region)

    eksctl delete iamserviceaccount \
        --name node-problem-detector \
        --namespace neuron-healthcheck-system \
        --cluster $CLUSTER_NAME \
        --approve \
        --region $REGION_CODE


================================================
FILE: containers/tutorials/k8s-neuron-problem-detector-and-recovery.rst
================================================
.. _k8s-neuron-problem-detector-and-recovery:

Deploy Neuron Node Problem Detector and Recovery
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The Neuron Node Problem Detector and Recovery is a critical resiliency component that continuously monitors the health of Neuron devices on each Kubernetes node by detecting hardware and software errors such as device failures, driver problems, and runtime errors. It integrates with the Kubernetes Node Problem Detector framework to report Neuron-specific conditions. When unrecoverable issues are detected, it can automatically remediate problems by marking nodes as unhealthy and triggering node replacement to prevent workload scheduling on faulty hardware. The component can also publish CloudWatch metrics under the ``NeuronHealthCheck`` namespace for monitoring and alerting purposes.

**Requirements**

Before deploying the Neuron Node Problem Detector and Recovery, ensure the following requirements are met:

* **Neuron Driver:** Version 2.15 or later
* **Neuron Runtime:** SDK 2.18 or later
* **Prerequisites:** All prerequisites for Kubernetes containers and the Neuron Node Problem Detector must be satisfied

**Installation**

Install the Neuron Node Problem Detector and Recovery as a DaemonSet using Helm:

.. note::

    The installation pulls the container image from the upstream Node Problem Detector repository at ``registry.k8s.io/node-problem-detector``.

.. code:: bash

    helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart

**Enable Node Recovery**

By default, the Neuron Node Problem Detector runs in **monitor-only mode**. To enable automatic node recovery functionality:

.. code:: bash

    helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
        --set "npd.nodeRecovery.enabled=true"

**Verify Installation**

Verify that the Node Problem Detector pods are running:

.. code:: bash

    kubectl get pod -n neuron-healthcheck-system

Expected output (example with 4 nodes in cluster):

.. code:: bash

    NAME                          READY   STATUS    RESTARTS   AGE
    node-problem-detector-7qcrj   1/1     Running   0          59s
    node-problem-detector-j45t5   1/1     Running   0          59s
    node-problem-detector-mr2cl   1/1     Running   0          59s
    node-problem-detector-vpjtk   1/1     Running   0          59s

**Monitoring and Metrics**

When an unrecoverable error occurs, the Neuron Node Problem Detector:

* Publishes metrics to CloudWatch under the ``NeuronHealthCheck`` namespace
* Updates the node's ``NodeCondition``, which can be viewed using:

  .. code:: bash

      kubectl describe node <node-name>


================================================
FILE: containers/tutorials/k8s-neuron-scheduler-flow.rst
================================================
.. _k8s-neuron-scheduler-flow:

Neuron Scheduler Extension Flow Diagram
---------------------------------------

::


                                                                           +----------------------------+
                                                                           | POD Manifest               |
                                                                           | with Request               |
                                                                           | aws.amazon.com/neuroncore:2|
                                                                           |                            |
                                                                           |                            |
                                                       2                   +-------------+--------------+
                                            +--------------------------------+           |
                                            |                                |           |
                                            |                                |           | 3
             +------------------------------+-----+                          |           |
             |           Kubelet in INF1/TRN1 Node|                          |           |
             |                                    +<-----------+             |           |
             +-----+---------------------+--------+            |       +-----v-----------v--------------+
                   |                     ^                     |       |          Kube-Scheduler        |
                   |                     |                     |       |                                |
                   |                     |                     |       +--^------+---------------+------+
                 9 |                  1  |                     |          |      |               |
                   |                     |                    8|         5|      |4              |
                   |                     |                     |          |      |               |
                   |                     |                     |          |      |               |6
                   v                     |                     |          |      |               |
             +-----+---------------------+--------+            |       +--+------v---------------v------+
             |    neuron-device-plugin            |            +-------+       neuron|scheduler|ext     |
             |    in INF1/TRN1 node               |                    +---------------------+----------+
             +----+----------------------+--------+                                          |
                  |                      |                                                   |7
                  |                      |10                                                 |
                  |                      |                                                   v
                11|                      |                                         +---------+-------+
                  |                      |                                         |POD Manifest:    |
                  |                      |                                         |Annotation:      |
                  |                      |                                         |NEURON_CORES:2,3 |
                  v                      +---------------------------------------->+                 |
   --device=/dev/neuron1 --env NEURON_RT_VISIBLE_CORES=2,3                         |                 |
                                                                                   |                 |
                                                                                   +-----------------+

   1. neuron-device-plugin returns the list of Neuron cores/devices to kublet
   2. Kubelet advertises the Core/Device list to K8s API server (in turn to kube-scheduler)
   3. POD Request for neuron cores/devices [Kube-Scheduler picks up the POD creation request]
   4. kube-scheduler calls the neuron-scheduler-extn filter function with list of nodes and POD Specification
   5. neuron-scheduler-extn scans through the nodes and filters out nodes with non
   contiguous cores/devices and returns the nodes that are capable of supporing the given POD specification
   6. kube-scheduler calls the neuron-scheduler-extn bind function with pod and node
   7. neuron-scheduler-extn updates the POD annotation with allocated neuron core/device Ids (contiguous)
   8. neuron-scheduler-extn sends the bind request to kubelet of the selected node
   9. Kubelet calls the Alloc function of the neuron-device-plugin
   10. neuron-device-plugin queries the POD Annotation for allocated core/device Ids
   11. neuron-device-plugin exports the devices & visisble cores to container runtime


================================================
FILE: containers/tutorials/k8s-neuron-scheduler.rst
================================================
The Neuron Scheduler Extension is a Kubernetes scheduler plugin that provides intelligent, topology-aware scheduling for Neuron workloads. While the device plugin handles basic resource allocation, the scheduler extension optimizes Pod placement by considering Neuron core topology, NeuronCore-to-NeuronCore connectivity, and workload requirements. It ensures efficient utilization of Neuron devices by placing Pods on nodes where the requested Neuron cores are optimally configured. This component is optional and primarily beneficial for workloads that require specific subsets of Neuron devices or cores rather than consuming all available resources on a node.

The scheduler extension is required for scheduling Pods that request more than one Neuron core or device resource. It finds sets of directly connected devices with minimal communication latency when scheduling containers, ensuring optimal performance for multi-device workloads.

For a graphical depiction of how the Neuron Scheduler Extension works, see :ref:`k8s-neuron-scheduler-flow`.

**Device Allocation by Instance Type**

The Neuron Scheduler Extension applies topology-aware scheduling rules based on instance type to ensure consistent and high performance regardless of which cores and devices are assigned to containers.

**Inf1 and Inf2 Instances (Ring Topology)**

Devices are connected through a ring topology with no restrictions on the number of devices requested (as long as it is fewer than the total devices on a node). When N devices are requested, the scheduler finds a node where N contiguous devices are available to minimize communication latency. It will never allocate non-contiguous devices to the same container.

For example, when a container requests 3 Neuron devices, the scheduler might assign devices 0, 1, 2 if available, but never devices 0, 2, 4 because those devices are not directly connected.

The figure below shows examples of device sets on an Inf2.48xlarge node that could be assigned to a container requesting 2 devices:

|eks-inf2-device-set|

**Trn1.32xlarge and Trn1n.32xlarge Instances (2D Torus Topology)**

Devices are connected via a 2D torus topology. The scheduler enforces that containers request 1, 4, 8, or all 16 devices. If your container requires a different number of devices (such as 2 or 5), we recommend using an Inf2 instance instead to benefit from more flexible topology support.

If you request an invalid number of devices (such as 7), your Pod will not be scheduled and you will receive a warning:

``Instance type trn1.32xlarge does not support requests for device: 7. Please request a different number of devices.``

When requesting 4 devices, your container will be allocated one of the following device sets if available:

|eks-trn1-device-set4|

When requesting 8 devices, your container will be allocated one of the following device sets if available:

|eks-trn1-device-set8|

.. note::

    For all instance types, requesting one or all Neuron cores or devices is always valid.

**Deploy Neuron Scheduler Extension**

.. tab-set::

   .. tab-item:: Multiple Scheduler Approach

      .. include:: /containers/tutorials/k8s-multiple-scheduler.rst

   .. tab-item:: Default Scheduler Approach

      .. include:: /containers/tutorials/k8s-default-scheduler.rst


.. |eks-inf2-device-set| image:: /images/eks-inf2-device-set.png
.. |eks-trn1-device-set4| image:: /images/eks-trn1-device-set4.png
.. |eks-trn1-device-set8| image:: /images/eks-trn1-device-set8.png


================================================
FILE: containers/tutorials/k8s-prerequisite.rst
================================================
.. _k8s-prerequisite:

.. meta::
   :description: Learn how to create an Amazon EKS cluster with AWS Trainium instances (Trn1, Trn2) for machine learning workloads using AWS Neuron SDK. Step-by-step guide with eksctl and CloudFormation templates.
   :keywords: EKS, Kubernetes, Trainium, Trn1, Trn2, Neuron, AWS, machine learning, distributed training, eksctl, CloudFormation, EFA, node group

Before setting up Neuron components on your EKS cluster, you must create an EKS cluster and add Neuron-enabled nodes. This section guides you through creating an Amazon Elastic Kubernetes Service (EKS) cluster with AWS Trainium-enabled nodes (Trn1 or Trn2 instances) using CloudFormation templates and the eksctl command-line tool. You'll configure optimized networking with Elastic Fabric Adapter (EFA) support and pre-configured Neuron components for distributed training and inference workloads.

For detailed information, refer to:

* `EKS Cluster Creation Guide <https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html>`_
* `EKS Compute Resources Guide <https://docs.aws.amazon.com/eks/latest/userguide/eks-compute.html>`_
* `eksctl Getting Started <https://eksctl.io/getting-started/>`_

**Step 1: Download Node Group Template**

Download the node group CloudFormation template for your instance type.

.. tab-set::

   .. tab-item:: Trn1

      .. code-block:: bash

         wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-eks-samples/master/dp_bert_hf_pretrain/cfn/eks_trn1_ng_stack.yaml

   .. tab-item:: Trn2

      .. code-block:: bash

         wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-eks-samples/master/dp_bert_hf_pretrain/cfn/eks_trn2_ng_stack_al2023.yaml

**Important template configuration information**

* **Placement Group:** Optimizes network speed between nodes
* **EFA Driver:** Installed automatically (ensure ``libfabric`` version matches between AMI and workload containers)
* **AMI:** Uses `EKS optimized accelerated AMI <https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html#gpu-ami>`_ with Neuron components pre-installed
* **Instance Type:** Configured for trn1.32xlarge or trn2.48xlarge (update to your desired instance type)
* **Kubernetes Version:** Trn1 templates use Kubernetes 1.25+, Trn2 templates use Kubernetes 1.34+ (update as needed)

Trn2 LNC configuration (Optional):

Trn2 instances use a default Logical NeuronCore Configuration (LNC) of ``2``. To change it to ``1``, update the ``UserData`` section of the launch template:

.. code-block:: bash

    --==BOUNDARY==
    Content-Type: text/x-shellscript; charset="us-ascii"

    #!/bin/bash
    set -ex
    config_dir=/opt/aws/neuron
    config_file=${config_dir}/logical_nc_config
    [ -d "$config_dir" ] || mkdir -p "$config_dir"
    [ -f "$config_file" ] || touch "$config_file"
    if ! grep -q "^NEURON_LOGICAL_NC_CONFIG=1$" "$config_file" 2>/dev/null; then
        printf "NEURON_LOGICAL_NC_CONFIG=1" >> "$config_file"
    fi
    --==BOUNDARY==--

**Step 2: Create Cluster Parameter Script**

Create a bash script to capture the parameters needed for the node template:

.. tab-set::

   .. tab-item:: Trn1

      .. code-block:: bash

        #!/bin/bash

        CLUSTER_NAME=$1
        CLUSTER_SG=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].ResourcesVpcConfig.ClusterSecurityGroupId")
        VPC_ID=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].ResourcesVpcConfig.VpcId")

        cat <<EOF > cfn_params.json
        [
            {
                "ParameterKey": "ClusterName",
                "ParameterValue": "$CLUSTER_NAME"
            },
            {
                "ParameterKey": "ClusterControlPlaneSecurityGroup",
                "ParameterValue": "$CLUSTER_SG"
            },
            {
                "ParameterKey": "VpcId",
                "ParameterValue": "$VPC_ID"
            }
        ]
        EOF

   .. tab-item:: Trn2

      .. code-block:: bash

          #!/bin/bash

          CLUSTER_NAME=$1
          CLUSTER_SG=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].ResourcesVpcConfig.ClusterSecurityGroupId")
          VPC_ID=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].ResourcesVpcConfig.VpcId")
          CLUSTER_ENDPOINT=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].Endpoint")
          CLUSTER_SERVICE_CIDR=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].KubernetesNetworkConfig.ServiceIpv4Cidr")
          CLUSTER_CA=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].CertificateAuthority.Data")

          cat <<EOF > cfn_params.json
          [
              {
                  "ParameterKey": "ClusterName",
                  "ParameterValue": "$CLUSTER_NAME"
              },
              {
                  "ParameterKey": "ClusterControlPlaneSecurityGroup",
                  "ParameterValue": "$CLUSTER_SG"
              },
              {
                  "ParameterKey": "VpcId",
                  "ParameterValue": "$VPC_ID"
              },
              {
                  "ParameterKey": "ClusterEndpoint",
                  "ParameterValue": "$CLUSTER_ENDPOINT"
              },
              {
                  "ParameterKey": "ClusterServiceCidr",
                  "ParameterValue": "$CLUSTER_SERVICE_CIDR"
              },
              {
                  "ParameterKey": "ClusterCertificateAuthority",
                  "ParameterValue": "$CLUSTER_CA"
              }
          ]
          EOF


This script captures the cluster name, security group for control plane connectivity, and VPC ID.

**Step 3: Create CloudFormation Stack**

Create the CloudFormation stack for the node group.

.. tab-set::

   .. tab-item:: Trn1

      .. code-block:: bash

         aws cloudformation create-stack \
             --stack-name eks-trn1-ng-stack \
             --template-body file://eks_trn1_ng_stack.yaml \
             --parameters file://cfn_params.json \
             --capabilities CAPABILITY_IAM

   .. tab-item:: Trn2

      .. code-block:: bash

         aws cloudformation create-stack \
             --stack-name eks-trn2-ng-stack \
             --template-body file://eks_trn2_ng_stack_al2023.yaml \
             --parameters file://cfn_params.json \
             --capabilities CAPABILITY_IAM

Wait for the stack creation to complete before proceeding. You can monitor the progress in the AWS CloudFormation console.

**Step 4: Determine Availability Zones**

Identify the availability zones for your cluster:

.. code-block:: bash

    aws ec2 describe-availability-zones \
        --region $REGION_CODE \
        --query "AvailabilityZones[]" \
        --filters "Name=zone-id,Values=$1" \
        --query "AvailabilityZones[].ZoneName" \
        --output text

**Step 5: Generate Node Group Configuration**

Create a script named ``create_ng_yaml.sh`` to generate the node group YAML configuration. The script requires: region, availability zones, cluster name, and CloudFormation stack name.

.. tab-set::

   .. tab-item:: Trn1

      .. code-block:: bash

         #!/bin/bash

         REGION_CODE=$1
         EKSAZ1=$2
         EKSAZ2=$3
         CLUSTER_NAME=$4
         STACKNAME=$5

         LT_ID_TRN1=$(aws cloudformation describe-stacks --stack-name $STACKNAME \
                 --query "Stacks[0].Outputs[?OutputKey=='LaunchTemplateIdTrn1'].OutputValue" \
                 --output text)

         cat <<EOF > trn1_nodegroup.yaml
         apiVersion: eksctl.io/v1alpha5
         kind: ClusterConfig

         metadata:
           name: $CLUSTER_NAME
           region: $REGION_CODE
           version: "1.28"

         iam:
           withOIDC: true

         availabilityZones: ["$EKSAZ1","$EKSAZ2"]

         managedNodeGroups:
           - name: trn1-32xl-ng1
             launchTemplate:
               id: $LT_ID_TRN1
             minSize: 1
             desiredCapacity: 1
             maxSize: 1
             availabilityZones: ["$EKSAZ1"]
             privateNetworking: true
             efaEnabled: true
         EOF

   .. tab-item:: Trn2

      .. code-block:: bash

         #!/bin/bash

         REGION_CODE=$1
         EKSAZ1=$2
         EKSAZ2=$3
         CLUSTER_NAME=$4
         STACKNAME=$5

         LT_ID_TRN2=$(aws cloudformation describe-stacks --stack-name $STACKNAME \
                 --query "Stacks[0].Outputs[?OutputKey=='LaunchTemplateIdTrn2'].OutputValue" \
                 --output text)

         cat <<EOF > trn2_nodegroup.yaml
         apiVersion: eksctl.io/v1alpha5
         kind: ClusterConfig

         metadata:
           name: $CLUSTER_NAME
           region: $REGION_CODE
           version: "1.34"

         iam:
           withOIDC: true

         availabilityZones: ["$EKSAZ1","$EKSAZ2"]

         managedNodeGroups:
           - name: trn2-48xl-ng1
             launchTemplate:
               id: $LT_ID_TRN2
             minSize: 1
             desiredCapacity: 1
             maxSize: 1
             availabilityZones: ["$EKSAZ1"]
             privateNetworking: true
             efaEnabled: true
         EOF

Run the script to generate the configuration file. Update the Kubernetes version as needed for your environment.

Example output:

.. tab-set::

   .. tab-item:: Trn1

      .. code-block:: yaml

         apiVersion: eksctl.io/v1alpha5
         kind: ClusterConfig

         metadata:
           name: nemo2
           region: us-west-2
           version: "1.28"

         iam:
           withOIDC: true

         availabilityZones: ["us-west-2d","us-west-2c"]

         managedNodeGroups:
           - name: trn1-32xl-ng1
             launchTemplate:
               id: lt-093c222b35ea89009
             minSize: 1
             desiredCapacity: 1
             maxSize: 1
             availabilityZones: ["us-west-2d"]
             privateNetworking: true
             efaEnabled: true

   .. tab-item:: Trn2

      .. code-block:: yaml

         apiVersion: eksctl.io/v1alpha5
         kind: ClusterConfig

         metadata:
           name: nemo2
           region: us-west-2
           version: "1.34"

         iam:
           withOIDC: true

         availabilityZones: ["us-west-2d","us-west-2c"]

         managedNodeGroups:
           - name: trn2-48xl-ng1
             launchTemplate:
               id: lt-093c222b35ea89010
             minSize: 1
             desiredCapacity: 1
             maxSize: 1
             availabilityZones: ["us-west-2d"]
             privateNetworking: true
             efaEnabled: true

**Step 6: Create Node Group**

Create the node group using the generated configuration.

.. tab-set::

   .. tab-item:: Trn1

      .. code-block:: bash

         eksctl create nodegroup -f trn1_nodegroup.yaml

   .. tab-item:: Trn2

      .. code-block:: bash

         eksctl create nodegroup -f trn2_nodegroup.yaml

Wait for the nodes to reach the ``Ready`` state. Verify using:

.. code-block:: bash

    kubectl get nodes

**Step 7: Install EFA Device Plugin (Optional)**

If you plan to run distributed training or inference jobs, install the EFA device plugin following the instructions at the `EFA device plugin repository <https://github.com/aws-samples/aws-efa-eks>`_.


================================================
FILE: containers/tutorials/k8s-setup.rst
================================================
.. _tutorial-k8s-env-setup-for-neuron-to-remove:

Kubernetes environment setup for Neuron
=======================================

Introduction
------------

Customers that use Kubernetes can conveniently integrate Inf1/Trn1 instances into their workflows. This tutorial will go through deploying the neuron device plugin daemonset and also how to allocate neuron cores or devices to application pods.

.. dropdown:: Prerequisite
      :class-title: sphinx-design-class-title-small
      :class-body: sphinx-design-class-body-small
      :animate: fade-in

      .. include:: /containers/tutorials/k8s-prerequisite.rst

.. dropdown:: Deploy Neuron Device Plugin
      :class-title: sphinx-design-class-title-small
      :class-body: sphinx-design-class-body-small
      :animate: fade-in

      .. include:: /containers/tutorials/k8s-neuron-device-plugin.rst

.. dropdown:: Deploy Neuron Scheduler Extension
      :class-title: sphinx-design-class-title-small
      :class-body: sphinx-design-class-body-small
      :animate: fade-in

      .. include:: /containers/tutorials/k8s-neuron-scheduler.rst


================================================
FILE: containers/tutorials/training/index.rst
================================================
Containers -- Training Tutorials
=================================

.. toctree::
    :maxdepth: 1
    :hidden:

    /containers/tutorials/training/tutorial-training
    /containers/tutorials/training/k8s_mlp_train_demo


.. include:: /containers/tutorials/training/index.txt


================================================
FILE: containers/tutorials/training/index.txt
================================================
* :ref:`tutorial-training`
* :ref:`example-deploy-mlp-train-pod`


================================================
FILE: containers/tutorials/training/k8s_mlp_train_demo.rst
================================================
.. _example-deploy-mlp-train-pod:

Deploy a simple mlp training script as a Kubernetes job
----------------------------------------------------------

This tutorial uses mlp train as a teaching example on how to deploy an
training application using Kubernetes on the Trn1 instances. For more advanced example, please refer to `Tutorial: Launch a Multi-Node PyTorch Neuron Training Job on Trainium Using TorchX and EKS <https://github.com/aws-neuron/aws-neuron-eks-samples/tree/master/dp_bert_hf_pretrain>`__

Prerequisite:
^^^^^^^^^^^^^

-  :ref:`tutorial-k8s-env-setup-for-neuron`: to setup k8s support on your cluster.
-  Trn1 instances as worker nodes with attached roles allowing:

   -  ECR read access policy to retrieve container images from ECR:
      **arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly**
- Have a container image that is build using :ref:`tutorial-training`

Deploy a mlp training image
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. Create a file named `mlp_train.yaml` with the contents below\. 

.. note::
   In the image:  add the appropriate location of the image


::

  apiVersion: v1
  kind: Pod
  metadata:
    name: trn1-mlp
  spec:
    restartPolicy: Never
    schedulerName: default-scheduler
    hostNetwork: true
    nodeSelector:
      beta.kubernetes.io/instance-type: trn1.32xlarge
      beta.kubernetes.io/instance-type: trn1.2xlarge
    containers:
      - name: trn1-mlp
        command: ["/usr/local/bin/python3"]
        args:  ["/opt/ml/mlp_train.py"]
        image: 647554078242.dkr.ecr.us-east-1.amazonaws.com/sunda-pt:k8s_mlp_0907
        imagePullPolicy: IfNotPresent
        env:
        - name: NEURON_RT_LOG_LEVEL
          value: "INFO"
        resources:
          limits: 
            aws.amazon.com/neuron: 2
          requests:
            aws.amazon.com/neuron: 2

2. Deploy the pod.

::

   kubectl apply -f mlp_train.yaml

3. Check the logs to make sure training completed
::

   kubectl logs <pod name>

   Your log should have the following

::

  Final loss is 0.1977
  ----------End Training ---------------


================================================
FILE: containers/tutorials/training/tutorial-training.rst
================================================
.. _tutorial-training:

Run Training in PyTorch Neuron Container
========================================

.. contents:: Table of Contents
   :local:
   :depth: 2


Overview
--------

This tutorial demonstrates how to run a pytorch container on an trainium instance.

By the end of this tutorial you will be able to run simple mlp training using the container

You will use an trn1.2xlarge to test your Docker configuration for Trainium.

To find out the available neuron devices on your instance, use the command ``ls /dev/neuron*``.

Setup Environment
-----------------

1. Launch an Trn1 Instance
         .. include:: /setup/install-templates/launch-instance.txt

2. Set up docker environment according to :ref:`tutorial-docker-env-setup`

3. A sample Dockerfile for for torch-neuron can be found here :ref:`trainium-dlc-dockerfile`.
This dockerfile needs the mlp train script found here  :ref:`mlp-train`

With the files in a dir, build the image with the following command:

.. code:: bash

   docker build . -f Dockerfile.pt -t neuron-container:pytorch

Run the following command to start the container

.. code:: bash

   docker run -it --name pt-cont --net=host --device=/dev/neuron0 neuron-container:pytorch python3 /opt/ml/mlp_train.py

================================================
FILE: containers/tutorials/tutorial-docker-env-setup.rst
================================================
.. _tutorial-docker-env-setup:

Tutorial Docker environment setup
=================================

Introduction
------------

A Neuron application can be deployed using docker containers. This
tutorial describes how to configure docker on Amazon Linux 2023 to expose Inferentia/Trainium devices
to containers.


.. tab-set::

   .. tab-item:: Training

        .. dropdown:: Install Drivers
            :class-title: sphinx-design-class-title-small
            :class-body: sphinx-design-class-body-small
            :animate: fade-in

            .. code:: bash

               # Configure Linux for Neuron repository updates

               sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
               [neuron]
               name=Neuron YUM Repository
               baseurl=https://yum.repos.neuron.amazonaws.com
               enabled=1
               metadata_expire=0
               EOF
               sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

               # Update OS packages
               sudo dnf update -y


               # Install OS headers
               sudo dnf install -y "kernel-devel-uname-r = $(uname -r)"

               # Remove preinstalled packages and Install Neuron Driver and Runtime
               sudo dnf remove aws-neuron-dkms -y
               sudo dnf remove aws-neuronx-dkms -y
               sudo dnf install aws-neuronx-dkms-2.*  -y

               # Install EFA Driver(only required for multi-instance training)
               curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
               wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.key
               cat aws-efa-installer.key | gpg --fingerprint
               wget https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz.sig && gpg --verify ./aws-efa-installer-latest.tar.gz.sig
               tar -xvf aws-efa-installer-latest.tar.gz
               cd aws-efa-installer && sudo bash efa_installer.sh --yes
               cd
               sudo rm -rf aws-efa-installer-latest.tar.gz aws-efa-installer

        .. dropdown:: Install Docker
            :class-title: sphinx-design-class-title-small
            :class-body: sphinx-design-class-body-small
            :animate: fade-in

            .. code:: bash

               sudo dnf install -y docker.io
               sudo usermod -aG docker $USER

            Logout and log back in to refresh membership.

        .. dropdown:: Verify Docker
            :class-title: sphinx-design-class-title-small
            :class-body: sphinx-design-class-body-small
            :animate: fade-in

            .. code:: bash

               docker run hello-world

            Expected result:

            ::

               Hello from Docker!
               This message shows that your installation appears to be working correctly.

               To generate this message, Docker took the following steps:
               1. The Docker client contacted the Docker daemon.
               2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
               (amd64)
               3. The Docker daemon created a new container from that image which runs the
               executable that produces the output you are currently reading.
               4. The Docker daemon streamed that output to the Docker client, which sent it
               to your terminal.

               To try something more ambitious, you can run an Ubuntu container with:
               $ docker run -it ubuntu bash

               Share images, automate workflows, and more with a free Docker ID:
               https://hub.docker.com/

               For more examples and ideas, visit:
               https://docs.docker.com/get-started/

        .. dropdown:: Verify Neuron Component
            :class-title: sphinx-design-class-title-small
            :class-body: sphinx-design-class-body-small
            :animate: fade-in

            Once the environment is setup, a container can be started with
            --device=/dev/neuron# to specify desired set of Inferentia/Trainium devices to be
            exposed to the container. To find out the available neuron devices on
            your instance, use the command ``ls /dev/neuron*``.

            When running neuron-ls inside a container, you will only see the set of
            exposed Trainiums. For example:

            .. code:: bash

               docker run --device=/dev/neuron0 neuron-test neuron-ls

            Would produce the following output in trn1.32xlarge:

            ::

               +--------+--------+--------+---------+
               | NEURON | NEURON | NEURON |   PCI   |
               | DEVICE | CORES  | MEMORY |   BDF   |
               +--------+--------+--------+---------+
               | 0      | 2      | 32 GB  | 10:1c.0 |
               +--------+--------+--------+---------+

   .. tab-item:: Inference

      .. dropdown:: Install Drivers
         :class-title: sphinx-design-class-title-small
         :class-body: sphinx-design-class-body-small
         :animate: fade-in

         .. code:: bash

            # Configure Linux for Neuron repository updates
            sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
            [neuron]
            name=Neuron YUM Repository
            baseurl=https://yum.repos.neuron.amazonaws.com
            enabled=1
            metadata_expire=0
            EOF
            sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

            # Update OS packages
            sudo dnf update -y

            ################################################################################################################
            # To install or update to Neuron versions 1.19.1 and newer from previous releases:
            # - DO NOT skip 'aws-neuron-dkms' install or upgrade step, you MUST install or upgrade to latest Neuron driver
            ################################################################################################################

            # Install OS headers
            sudo dnf install -y "kernel-devel-uname-r = $(uname -r)"

            # Install Neuron Driver
            sudo dnf install aws-neuron-dkms -y

            ####################################################################################
            # Warning: If Linux kernel is updated as a result of OS package update
            #          Neuron driver (aws-neuron-dkms) should be re-installed after reboot
            ####################################################################################

      .. dropdown:: Install Docker
         :class-title: sphinx-design-class-title-small
         :class-body: sphinx-design-class-body-small
         :animate: fade-in

         .. code:: bash

            sudo dnf install -y docker.io
            sudo usermod -aG docker $USER

         Logout and log back in to refresh membership.

      .. dropdown:: Verify Docker
         :class-title: sphinx-design-class-title-small
         :class-body: sphinx-design-class-body-small
         :animate: fade-in

         .. code:: bash

            docker run hello-world

         Expected result:

         ::

            Hello from Docker!
            This message shows that your installation appears to be working correctly.

            To generate this message, Docker took the following steps:
            1. The Docker client contacted the Docker daemon.
            2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
            (amd64)
            3. The Docker daemon created a new container from that image which runs the
            executable that produces the output you are currently reading.
            4. The Docker daemon streamed that output to the Docker client, which sent it
            to your terminal.

            To try something more ambitious, you can run an Ubuntu container with:
            $ docker run -it ubuntu bash

            Share images, automate workflows, and more with a free Docker ID:
            https://hub.docker.com/

            For more examples and ideas, visit:
            https://docs.docker.com/get-started/


      .. dropdown:: Verify Neuron Component
         :class-title: sphinx-design-class-title-small
         :class-body: sphinx-design-class-body-small
         :animate: fade-in

         Once the environment is setup, a container can be started with
         --device=/dev/neuron# to specify desired set of Inferentia/Trainium devices to be
         exposed to the container. To find out the available neuron devices on
         your instance, use the command ``ls /dev/neuron*``.

         When running neuron-ls inside a container, you will only see the set of
         exposed Inferentias. For example:

         .. code:: bash

            docker run --device=/dev/neuron0 neuron-test neuron-ls

         Would produce the following output in inf1.xlarge:

         ::

            +--------------+---------+--------+-----------+-----------+------+------+
            |   PCI BDF    | LOGICAL | NEURON |  MEMORY   |  MEMORY   | EAST | WEST |
            |              |   ID    | CORES  | CHANNEL 0 | CHANNEL 1 |      |      |
            +--------------+---------+--------+-----------+-----------+------+------+
            | 0000:00:1f.0 |       0 |      4 | 4096 MB   | 4096 MB   |    0 |    0 |
            +--------------+---------+--------+-----------+-----------+------+------+


================================================
FILE: containers/tutorials/tutorial-oci-hook.rst
================================================
.. _tutorial-oci-hook:

Tutorial Docker Neuron OCI Hook Setup
=====================================

Introduction
------------

A Neuron application can be deployed using docker containers. Neuron devices
are exposed to the containers using the --device option in the docker run command.
Docker runtime (runc) does not yet support the ALL option to expose all neuron
devices to the container. In order to do that an environment variable,
“AWS_NEURON_VISIBLE_DEVICES=ALL" can be used.

For the above environment variable to be used, the oci neuron hook has to be
installed/configured.

.. important::

    The Neuron OCI Hook is currently NOT supported with AL2023 on ECS. For workarounds,
    see the :ref:`oci-hook-workarounds` section below.

Install oci-add-hooks dependency on the Linux host
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. important::

    This step should run on the Linux host and not inside the container.


`oci-add-hooks <https://github.com/awslabs/oci-add-hooks>`__ is an OCI
runtime with the sole purpose of injecting OCI prestart, poststart, and
poststop hooks into a container config.json before passing along to an
OCI compatable runtime. oci-add-hooks is used to inject a hook that
exposes Inferentia devices to the container.

.. code:: bash

    sudo apt install -y golang && \
        export GOPATH=$HOME/go && \
        go get github.com/joeshaw/json-lossless && \
        cd /tmp/ && \
        git clone https://github.com/awslabs/oci-add-hooks && \
        cd /tmp/oci-add-hooks && \
        make build && \
        sudo cp /tmp/oci-add-hooks/oci-add-hooks /usr/local/bin/

Install the package that has oci hook software
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. important::

    This step should run on the Linux host and not inside the container.

For Inf1 install the following package

.. code:: bash

    sudo apt-get install aws-neuron-runtime-base -y

For Trn1 install the following package

.. code:: bash

    sudo apt-get install aws-neuronx-oci-hook -y

For docker runtime setup Docker to use oci-neuron OCI runtime.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

oci-neuron is a script representing OCI compatible runtime. It wraps
oci-add-hooks, which wraps runc. In this step, we configure docker to
point at oci-neuron OCI runtime. Install dockerIO:

.. code:: bash

    sudo cp /opt/aws/neuron/share/docker-daemon.json /etc/docker/daemon.json
    sudo service docker restart

If the docker restart command fails, make sure to check if the docker
systemd service is not masked. More information on this can be found
here: https://stackoverflow.com/a/37640824

For containerd runtime, setup containerd to use oci-neuron OCI runtime.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Update the following fields in the /etc/containerd/config.toml to configure
containerd to use the neuron oci hook

.. code:: bash

    default_runtime_name = "neuron"
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.neuron]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.neuron.options]
            BinaryName = "/opt/aws/neuron/bin/oci_neuron_hook_wrapper.sh"


After that restart the containerd daemon

.. code:: bash

    sudo systemctl restart containerd

For cri-o runtime, setup cri-o to use oci-neuron OCI runtime.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Update the following fields in the /etc/crio/crio.conf to configure
cri-o to use the neuron oci hook

.. code:: bash

    default_runtime_name = "neuron"
    [crio.runtime.runtimes.neuron]
    runtime_path = "/opt/aws/neuron/bin/oci_neuron_hook_wrapper.sh"

After that restart the containerd daemon

.. code:: bash

    sudo systemctl restart cri-o

.. _oci-hook-workarounds:

OCI hook workarounds
^^^^^^^^^^^^^^^^^^^^

**ECS (EC2)**

Add the following to your ECS task definition:

.. code:: json

    "linuxParameters": {
        "devices": [
            {
                "containerPath": "/dev/neuron0",
                "hostPath": "/dev/neuron0",
                "permissions": [
                    "read",
                    "write"
                ]
            },
            {
                "containerPath": "/dev/neuron1",
                "hostPath": "/dev/neuron1",
                "permissions": [
                    "read",
                    "write"
                ]
            },
            ...,
        ],
    },

The linuxParameters parameter can be found under containerDefinition. More information can be found here:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#container_definition_linuxparameters.
Expose as many Neuron devices as needed, up to the max number of devices for the specified instance.
For example, the trn1.32xlarge instance type contains 16 neuron devices, so the devices that can be exposed are
/dev/neuron0, /dev/neuron1, up to /dev/neuron15.
To see an example of an ECS task definition exposing Neuron devices,
see https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-inference-task-def.html.


================================================
FILE: containers/tutorials.rst
================================================
.. meta::
   :description: Comprehensive tutorials for deploying AWS Neuron SDK in containers with Docker and Kubernetes. Learn to build Neuron containers, configure EKS clusters, deploy device plugins, and set up monitoring for Trainium and Inferentia instances.
   :keywords: Neuron containers, Docker, Kubernetes, EKS, Trainium, Inferentia, device plugin, scheduler, monitoring, tutorials, AWS, machine learning

Containers - Tutorials
=======================

Learn how to deploy and manage AWS Neuron workloads in containerized environments. These tutorials cover everything from building Docker containers with Neuron support to deploying production-ready Kubernetes clusters with device plugins, schedulers, and monitoring solutions. Whether you're running inference or training workloads on AWS Trainium or Inferentia instances, these step-by-step guides will help you configure your container infrastructure for optimal performance and reliability.

.. toctree::
    :maxdepth: 1
    :hidden:
    
    Inference </containers/tutorials/inference/index>
    Training </containers/tutorials/training/index>
    /containers/tutorials/tutorial-docker-env-setup
    /containers/tutorials/build-run-neuron-container
    /containers/tutorials/tutorial-oci-hook
    /containers/tutorials/k8s-setup
    /containers/tutorials/k8s-neuron-helm-chart
    /containers/tutorials/k8s-neuron-scheduler-flow
    /containers/tutorials/k8s-neuron-monitor
    /containers/tutorials/k8s-neuron-problem-detector-and-recovery
    /containers/tutorials/k8s-neuron-problem-detector-and-recovery-irsa


General Container Tutorials
----------------------------

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: Docker Environment Setup
      :link: /containers/tutorials/tutorial-docker-env-setup
      :link-type: doc

      Configure Docker on Amazon Linux 2023 to expose Inferentia and Trainium devices to containers. Install Neuron drivers, runtime, and configure the Docker daemon for Neuron device access.

   .. grid-item-card:: Build and Run Neuron Containers
      :link: /containers/tutorials/build-run-neuron-container
      :link-type: doc

      Learn how to build Docker images with Neuron support using provided Dockerfiles and run containerized applications on Inf1 and Trn1 instances with proper device exposure.

   .. grid-item-card:: Docker Neuron OCI Hook Setup
      :link: /containers/tutorials/tutorial-oci-hook
      :link-type: doc

      Install and configure the Neuron OCI hook to enable the AWS_NEURON_VISIBLE_DEVICES environment variable for exposing all Neuron devices to containers without explicit device flags.

Kubernetes Setup and Configuration
-----------------------------------

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: Kubernetes Environment Setup
      :link: /containers/tutorials/k8s-setup
      :link-type: doc

      Complete guide to setting up Kubernetes for Neuron, including EKS cluster creation with Trainium nodes, device plugin installation, scheduler extension setup, and resource allocation configuration.

   .. grid-item-card:: Neuron Helm Chart
      :link: /containers/tutorials/k8s-neuron-helm-chart
      :link-type: doc

      Simplify Neuron infrastructure deployment with the unified Helm chart that installs device plugins, scheduler extensions, node problem detector, and DRA driver in a single command.

Kubernetes Device Management
-----------------------------

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: Scheduler Flow Diagram
      :link: /containers/tutorials/k8s-neuron-scheduler-flow
      :link-type: doc

      Visual diagram showing how the Neuron Scheduler Extension integrates with Kubernetes components to schedule Pods with Neuron resource requests.

Kubernetes Monitoring and Recovery
-----------------------------------

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: Neuron Monitor
      :link: /containers/tutorials/k8s-neuron-monitor
      :link-type: doc

      Deploy Neuron Monitor to collect and expose metrics from Neuron devices and runtime. Integrate with Prometheus for observability, performance tracking, and troubleshooting.

   .. grid-item-card:: Node Problem Detector and Recovery
      :link: /containers/tutorials/k8s-neuron-problem-detector-and-recovery
      :link-type: doc

      Monitor Neuron device health and automatically remediate issues by detecting hardware failures, driver problems, and runtime errors. Enable automatic node replacement for faulty hardware.

   .. grid-item-card:: NPD Permissions (IRSA)
      :link: /containers/tutorials/k8s-neuron-problem-detector-and-recovery-irsa
      :link-type: doc

      Configure IAM roles for service accounts (IRSA) to grant the Neuron Node Problem Detector necessary permissions for Auto Scaling group operations and CloudWatch metrics.


Training and Inference Container Tutorials
------------------------------------------    
.. tab-set:: 

    .. tab-item:: Training

        .. include:: /containers/tutorials/training/index.txt

.. tab-set:: 

    .. tab-item:: Inference
    
         .. include:: /containers/tutorials/inference/index.txt

================================================
FILE: devflows/aws-batch-flows.rst
================================================
.. _aws_batch_flow:

AWS Batch
=========

.. toctree::
    :maxdepth: 1

    /devflows/training/batch/batch-training

              
================================================
FILE: devflows/aws-batch-flows.txt
================================================
.. tab-set:: 

    .. tab-item:: Inference

        .. include:: /devflows/inference/aws-batch-flows.txt


.. tab-set:: 

    .. tab-item:: Training

        .. include:: /devflows/training/aws-batch-flows.txt

================================================
FILE: devflows/dlc-then-customize-devflow.rst
================================================
.. _dlc-then-customize-devflow:

Customize Neuron DLC
==============================

.. contents:: Table of Contents
   :local:
   :depth: 2


Description
-----------

This guide covers how to customize and extend the Neuron Deep Learning Container (DLC) to fit your specific project needs. You can customize the DLC either by using the DLC as a base image in your Dockerfile or by modifying published Dockerfiles on GitHub.

Method 1: Using DLC as a Base Image
-----------------

1. Create a New Dockerfile. In your Dockerfile, specify the Neuron DLC as your base image using the FROM directive.

2. Complete the Dockerfile. You can add additional packages, change the base environment, or any other modifications that suit your project. `AWS Batch Training <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/batch/batch-training.html#batch-training>`_ is a good example which needs customize Neuron DLC by using it as the base image. From its `Dockerfile <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/aws-batch/llama2/docker/Dockerfile>`_, we can find the customized container copies llama_batch_training.sh to the container and runs it.

3. Navigate to the directory containing your Dockerfile and build your custom container.

Method 2: Modifying Published Dockerfiles
-----------------

1. Visit the `Neuron DLC Github repo <https://github.com/aws-neuron/deep-learning-containers>`_ and locate the Dockerfile for the container you wish to customize.

2. Modify the Dockerfile as needed. You can add additional packages, change the base environment, or any other modifications that suit your project. For example, if you do not need to use Neuron tools in your scenario and want to make the container smaller, you can remove aws-neuronx-tools at this `line <https://github.com/aws-neuron/deep-learning-containers/blob/a969c77fdba17ff8d35f411b39ce3a9bc6368730/docker/pytorch/inference/2.1.1/Dockerfile.neuronx#L64>`_.

3. Navigate to the directory containing your Dockerfile and build your custom container.


================================================
FILE: devflows/ec2-flows.rst
================================================
.. _amazon-ec2:

Amazon EC2
==========

.. toctree::
    :maxdepth: 1
    :hidden:

    Inference </devflows/inference/ec2-flows>
    Training </devflows/training/ec2-flows>


.. include:: /devflows/ec2-flows.txt


================================================
FILE: devflows/ec2-flows.txt
================================================
.. tab-set:: 

    .. tab-item:: Inference

        .. include:: /devflows/inference/ec2-flows.txt

.. tab-set:: 

    .. tab-item:: Training
        
        .. include:: /devflows/training/ec2-flows.txt


================================================
FILE: devflows/ecs-flows.rst
================================================
.. _ecs_flow:

Amazon ECS
==========

.. toctree::
    :maxdepth: 1

    /devflows/plugins/npd-ecs-flows
    /devflows/inference/dlc-then-ecs-devflow
    /devflows/training/dlc-then-ecs-devflow

In this section, you'll find resources to help you use Neuron with ECS cluster, deploying inference and training workloads on Inferentia and Trainium ECS clusters.


Using Neuron Node Problem Detector Plugin with ECS
--------------------------------------------------

Neuron node problem detector and recovery plugin enhances resiliency by detecting and remediating errors.
To get started with using Neuron node problem detector plugin and recovery plugin on an ECS cluster, please refer to :ref:`ecs-neuron-problem-detector-and-recovery`.


Running Inference workload
--------------------------

This guide walks you through the end-to-end process of building and running a Docker container with your model and deploying it on an ECS cluster with Inferentia instances.
For running machine learning inference workloads on Amazon ECS using AWS Deep Learning Containers, please refer to :ref:`inference-dlc-then-ecs-devflow`.


Running Training workload
-------------------------

This guide walks you through the end-to-end process of building and running a Docker container with your model and deploying it on an ECS cluster with Trainium instances.
For running machine learning training workloads on Amazon ECS using AWS Deep Learning Containers, please refer to :ref:`training-dlc-then-ecs-devflow`.


================================================
FILE: devflows/eks-flows.rst
================================================
.. _eks_flow:

Amazon EKS
==========

.. toctree::
    :maxdepth: 1

    /containers/kubernetes-getting-started
    /devflows/inference/dlc-then-eks-devflow
    /containers/tutorials/training/k8s_mlp_train_demo


In this section, you'll find resources to help you use Neuron with EKS cluster, deploying inference and training workloads on Inferentia and Trainium EKS clusters.


EKS Setup
------------

This guide covers setting up the Neuron device plugin, scheduler extension, node problem detector, and monitoring plugins.
These components enable efficient resource utilization, monitoring, and resilience when using Inferentia and Trainium instances for inference and training workloads on Kubernetes clusters.
To get started with using AWS Neuron and setting up the required plugins on an EKS cluster, please refer to :ref:`tutorial-k8s-env-setup-for-neuron`.


Running Inference workload
--------------------------

This guide walks you through the end-to-end process of building and running a Docker container with your model and deploying it on an EKS cluster with Inferentia instances.
For running machine learning inference workloads on Amazon EKS using AWS Deep Learning Containers, please refer to :ref:`dlc-then-eks-devflow`.


Running Training workload
-------------------------

This guide walks you through the end-to-end process of building and running a Docker container with your model and deploying it on an EKS cluster with Trainium instances.
For running machine learning training workloads on Amazon EKS using AWS Deep Learning Containers, please refer to :ref:`example-deploy-mlp-train-pod`.


================================================
FILE: devflows/index.rst
================================================
.. _neuron-devflows:

.. meta::
      :description:
      :date-modified:

AWS Workload Orchestration
==========================

AWS Neuron integrates seamlessly with various AWS compute and orchestration services to accelerate deep learning workloads. This section provides deployment patterns and best practices for running Neuron-powered applications across different AWS services, from container orchestration to high-performance computing clusters.

.. grid:: 2
   :gutter: 2

   .. grid-item-card:: Amazon EKS
      :link: /devflows/eks-flows
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Deploy Neuron workloads on Kubernetes with Amazon Elastic Kubernetes Service

   .. grid-item-card:: Amazon ECS
      :link: /devflows/ecs-flows
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Run containerized Neuron applications using Amazon Elastic Container Service

   .. grid-item-card:: AWS ParallelCluster
      :link: /devflows/parallelcluster-flows
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Set up HPC clusters for distributed training and inference workloads

   .. grid-item-card:: AWS Batch
      :link: /devflows/aws-batch-flows
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Execute batch ML jobs with automatic scaling and resource management

.. toctree::
    :maxdepth: 1
    :hidden:

    /devflows/eks-flows
    /devflows/ecs-flows
    /devflows/parallelcluster-flows
    /devflows/aws-batch-flows
    Amazon SageMaker </devflows/sagemaker-flows>
    Third-party Solutions </devflows/third-party-solutions>


================================================
FILE: devflows/inference/aws-batch-flows.rst
================================================
AWS Batch Flows - Inference
===========================


.. include:: /devflows/inference/aws-batch-flows.txt

================================================
FILE: devflows/inference/aws-batch-flows.txt
================================================
.. note::

    AWS Batch supports Inf1.

    An example of how to deploy a model with Neuron using Batch is coming soon.


================================================
FILE: devflows/inference/byoc-hosting-devflow-inf2.rst
================================================
.. _byoc-hosting-devflow-inf2:

Bring Your Own Neuron Container to Sagemaker Hosting (inf2 or trn1)
====================================================

.. contents:: Table of Contents
   :local:
   :depth: 2

   
Description
-----------

|image|
 
.. |image| image:: /images/byoc-then-hosting-dev-flow.png
   :width: 850
   :alt: Neuron developer flow on SageMaker Neo
   :align: middle

You can use a SageMaker Notebook or an EC2 instance to compile models and build your own containers for deployment on SageMaker Hosting using ml.inf2 instances. In this developer flow, you provision a Sagemaker Notebook or an EC2 instance to train and compile your model to Inferentia. Then you deploy your model to SageMaker Hosting using the `SageMaker Python SDK <https://sagemaker.readthedocs.io/en/stable/index.html>`_. 

You may not need to create a container to bring your own **code** to Amazon SageMaker. When you are using a framework such as TensorFlow or PyTorch that has direct support in SageMaker, you can simply supply the Python code that implements your algorithm using the SDK entry points for that framework.

Follow the steps bellow to setup your environment. Once your environment is set you'll be able to follow the `Compiling and Deploying HuggingFace Pretrained BERT on Inf2 on Amazon SageMaker Sample <https://github.com/aws-neuron/aws-neuron-sagemaker-samples/tree/master/inference/inf2-bert-on-sagemaker>`_.


.. _byoc-hosting-setenv:

Setup Environment
-----------------

1. Create a Compilation Instance:
	If using an **EC2 instance for compilation only** you can use any instances to compile a model. It is recommended that you start with an c5.4xlarge instance. If using an **EC2 instance for compilation and test a model** you can use an Inf2 instance. Follow these steps to launch an Inf2 instance:
		
		.. include:: /setup/install-templates/inf2/launch-inf2-dlami.rst
	

	If using an **SageMaker Notebook for compilation**, follow the instructions in `Get Started with Notebook Instances <https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html>`_ to provision the environment. 

	It is recommended that you start with an ml.c5.4xlarge instance for the compilation. Also, increase the volume size of you SageMaker notebook instance, to accomodate the models and containers built locally. A volume of 10GB is sufficient.
	
		.. note::
			
			To compile the model in the SageMaker Notebook instance, you'll need to install the Neuron Compiler and Neuron Framework Extensions. Follow the `Compiling and Deploying HuggingFace Pretrained BERT on Inf2 on Amazon SageMaker Sample <https://github.com/aws-neuron/aws-neuron-sagemaker-samples/tree/master/inference/inf2-bert-on-sagemaker>`_ to install the environments.  


2. Set up the environment to compile a model, build your own container and deploy:
    To compile your model on EC2 or SageMaker Notebook, follow the *Set up a development environment* section on the EC2 :ref:`ec2-then-ec2-setenv` documentation.

    Refer to `Adapting Your Own Inference Container <https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html>`_ documentation for information on how to bring your own containers to SageMaker Hosting.

    Make sure to add the **AmazonEC2ContainerRegistryPowerUser** role to your IAM role ARN, so you're able to build and push containers from your SageMaker Notebook instance.

    .. note::
        The container image can be created using :ref:`how-to-build-neuron-container`.


================================================
FILE: devflows/inference/byoc-hosting-devflow.rst
================================================
.. _byoc-hosting-devflow:

Bring Your Own Neuron Container to Sagemaker Hosting (inf1)
====================================================

.. contents:: Table of Contents
   :local:
   :depth: 2

   
Description
-----------

|image|
 
.. |image| image:: /images/byoc-then-hosting-dev-flow.png
   :width: 850
   :alt: Neuron developer flow on SageMaker Neo
   :align: middle

You can use a SageMaker Notebook or an EC2 instance to compile models and build your own containers for deployment on SageMaker Hosting using ml.inf1 instances. In this developer flow, you provision a Sagemaker Notebook or an EC2 instance to train and compile your model to Inferentia. Then you deploy your model to SageMaker Hosting using the SageMaker Python SDK. Follow the steps bellow to setup your environment. Once your environment is set you'll be able to follow the :ref:`BYOC HuggingFace pretrained BERT container to Sagemaker Tutorial </src/examples/pytorch/byoc_sm_bert_tutorial/sagemaker_container_neuron.ipynb>` .

.. _byoc-hosting-setenv:

Setup Environment
-----------------

1. Create a Compilation Instance:
	If using an **EC2 instance for compilation** you can use an Inf1 instance to compile and test a model. Follow these steps to launch an Inf1 instance:
		
		.. include:: /setup/install-templates/inf1/launch-inf1-ami.rst
	

	If using an **SageMaker Notebook for compilation**, follow the instructions in `Get Started with Notebook Instances <https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html>`_ to provision the environment. 

	It is recommended that you start with an ml.c5.4xlarge instance for the compilation. Also, increase the volume size of you SageMaker notebook instance, to accomodate the models and containers built locally. A volume of 10GB is sufficient.
	
		.. note::
			
			To compile the model in the SageMaker Notebook instance, you'll need to update the conda environments to include the Neuron Compiler and Neuron Framework Extensions. Follow the installation guide on the section :ref:`how-to-update-to-latest-Neuron-Conda-Env` to update the environments.  


2. Set up the environment to compile a model, build your own container and deploy:
    To compile your model on EC2 or SageMaker Notebook, follow the *Set up a development environment* section on the EC2 :ref:`ec2-then-ec2-setenv` documentation.

    Refer to `Adapting Your Own Inference Container <https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html>`_ documentation for information on how to bring your own containers to SageMaker Hosting.

    Make sure to add the **AmazonEC2ContainerRegistryPowerUser** role to your IAM role ARN, so you're able to build and push containers from your SageMaker Notebook instance.

    .. note::
        The container image can be created using :ref:`how-to-build-neuron-container`.


================================================
FILE: devflows/inference/container-sm-hosting-devflow.rst
================================================
.. _container-sm-hosting-devflow:

Deploy on Sagemaker Hosting
===========================

.. contents:: Table of Contents
   :local:
   :depth: 2

   
Description
-----------
You can use `Sagemaker Hosted Endpoint <https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html>`_ to do inference on Inf1 instances.

================================================
FILE: devflows/inference/dev-flows.rst
================================================
.. _neuron1-devflows:
.. _compilation-flow-target:
.. _deploym-flow-target:

Developer Flows Introduction
============================

|image|

 
.. |image| image:: /images/neuron-devflow.jpg
   :width: 500
   :alt: Neuron developer flow
   
A typical Neuron developer flow includes compilation phase and then deployment (inference) on inf1 instance/s. You can develop on Neuron using one of the following combinations of developer flows:


.. toctree::
   :maxdepth: 1

   ec2-then-ec2-devflow
   ec2-then-ec2-devflow-inf2
   neo-then-hosting-devflow
   byoc-hosting-devflow
   dlc-then-ec2-devflow
   dlc-then-ecs-devflow
   dlc-then-eks-devflow


================================================
FILE: devflows/inference/dlc-then-ec2-devflow.rst
================================================
.. _dlc-then-ec2-devflow:

Deploy Neuron Container on EC2
==============================

.. contents:: Table of Contents
   :local:
   :depth: 2

   
Description
-----------

|image|
 
.. |image| image:: /images/dlc-on-ec2-dev-flow.png
   :width: 500
   :alt: Neuron developer flow for DLC on EC2
   :align: middle

You can use the Neuron version of the `AWS Deep Learning Containers <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-ec2-tutorials-inference.html>`_ to run inference on inf1 instances. In this developer flow, you provision an EC2 inf1 instance using a Deep Learming AMI (DLAMI), pull the container image with the Neuron version of the desired framework, and run the container as a server for the already compiled model. This developer flow assumes the model has already has been compiled through a :ref:`compilation developer flow <compilation-flow-target>` 

.. _dlc-then-ec2-setenv:

Setup Environment
-----------------

1. Launch an Inf1 Instance
	.. include:: /setup/install-templates/inf1/launch-inf1-ami.rst

2. Once you have your EC2 environment set according to :ref:`tutorial-docker-env-setup`, you can build and run a Neuron container using the :ref:`how-to-build-neuron-container` section above.

.. [DLC specific flow, uncomment when DLC available] Follow the `Getting Started with Deep Learning Containers for Inference on EC2 <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-ec2-tutorials-inference.html>`_ and use the appropriate DLC container.


.. note:: 

	**Prior to running the container**, make sure that the Neuron runtime on the instance is turned off, by running the command:

	.. code:: bash

		sudo service neuron-rtd stop


================================================
FILE: devflows/inference/dlc-then-ecs-devflow.rst
================================================
.. _inference-dlc-then-ecs-devflow:

Deploy Neuron Container on Elastic Container Service (ECS) for Inference
========================================================================

.. contents:: Table of Contents
   :local:
   :depth: 2

   
Description
-----------

|image|
 
.. |image| image:: /images/dlc-on-ecs-dev-flow.png
   :width: 750
   :alt: Neuron developer flow for DLC on ECS
   :align: middle

You can use the Neuron version of the `AWS Deep Learning Containers <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-ecs-tutorials-inference.html>`_ to run inference on Amazon Elastic Container Service (ECS). In this developer flow, you set up an ECS cluster with inf1/inf2 instances, create a task description for your inference service and deploy it to your cluster. This developer flow assumes:

1. The model has already been compiled through :ref:`Compilation with Framework API on EC2 instance <ec2-then-ec2-devflow>` or through :ref:`Compilation with Sagemaker Neo <neo-then-hosting-devflow>`. 

2. You already set up your container to retrieve it from storage.

.. _inference-dlc-then-ecs-setenv:

Setup Environment
-----------------


1. Set up an Amazon ECS cluster:
	Follow the instructions on `Setting up Amazon ECS for Deep Learning Containers <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-ecs-setting-up-ecs.html>`_

2. Define an Inference Task:
	Use the instruction on the `DLC Inference on ECS Tutorial <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-ecs-tutorials-inference.html>`_ to define a task and create a service for the appropriate framework.

	When creating tasks for inferentia instances on ECS, be aware of the considerations and requirements listed in `Working with inference workloads on Amazon ECS <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-inference.html>`_. 


3. Use the container image created using :ref:`how-to-build-neuron-container` as the ``image`` in your task definition.

   .. _inference-push_to_ecr_note:

   .. note::

       Before deploying your task definition to your ECS cluster, make sure to push the image to ECR. Refer to `Pushing a Docker image <https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html>`_ for more information.


================================================
FILE: devflows/inference/dlc-then-eks-devflow.rst
================================================
.. _dlc-then-eks-devflow:

Deploy Neuron Container on Elastic Kubernetes Service (EKS) for Inference
=========================================================================

.. contents:: Table of Contents
   :local:
   :depth: 2

   
Description
-----------

|image|
 
.. |image| image:: /images/dlc-on-eks-dev-flow.png
   :width: 750
   :alt: Neuron developer flow for DLC on ECS
   :align: middle

You can use the Neuron version of the `AWS Deep Learning Containers <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-ecs-tutorials-inference.html>`_ to run inference on Amazon Elastic Kubernetes Service (EKS). In this developer flow, you set up an EKS cluster with Inf1 instances, create a Kubernetes manifest for your inference service and deploy it to your cluster. This developer flow assumes:

1. The model has already been compiled through :ref:`Compilation with Framework API on EC2 instance <ec2-then-ec2-devflow>` or through :ref:`Compilation with Sagemaker Neo <neo-then-hosting-devflow>`. 

2. You already set up your container to retrieve it from storage.

.. _dlc-then-eks-setenv:

Setup Environment
-----------------

Please add inferentia nodes using instructions at :ref:`tutorial-k8s-env-setup-for-neuron` . 

Using the YML deployment manifest shown `in the EKS documentation for inferentia <https://docs.aws.amazon.com/eks/latest/userguide/inferentia-support.html#deploy-tensorflow-serving-application>`_, replace the `image` in the `containers` specification with the one you built using :ref:`how-to-build-neuron-container`.

   .. note::

     Before deploying the yaml to your EKS cluster, make sure to push the image to ECR. Refer to `Pushing a Docker image <https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html>`_ for more information.


Inference Example
-----------------
Please refer to :ref:`example-deploy-rn50-as-k8s-service` run a simple inference example. Note that the container image referenced in the YML manifest is created using :ref:`how-to-build-neuron-container`.


================================================
FILE: devflows/inference/dlc-then-k8s-devflow.rst
================================================
.. _dlc-then-k8s-devflow:

Deploy  Neuron Container on Kubernetes
======================================

.. contents:: Table of Contents
   :local:
   :depth: 2

   
Description
-----------
Use of Neuron in containers on Kubernetes cluster can be simple to achieve by following :ref:`tutorial-k8s-env-setup-for-neuron`

Known Limitations
-----------------
Scheduling on k8s cluster requires contiguous neuron device-ids


================================================
FILE: devflows/inference/ec2-flows.rst
================================================
EC2 Flows - Inference
=====================

.. toctree::
    :maxdepth: 1
    :hidden:

    /devflows/inference/ec2-then-ec2-devflow
    /devflows/inference/ec2-then-ec2-devflow-inf2

    
.. include:: /devflows/inference/ec2-flows.txt


================================================
FILE: devflows/inference/ec2-flows.txt
================================================
* :ref:`ec2-then-ec2-devflow`
* :ref:`ec2-then-ec2-devflow-inf2`
       

================================================
FILE: devflows/inference/ec2-then-ec2-devflow-inf2.rst
================================================
.. _ec2-then-ec2-devflow-inf2:

Compile with Framework API and Deploy on EC2 Inf2
=================================================

.. contents:: Table of Contents
   :local:
   :depth: 3

   
Description
-----------

|image|
 
.. |image| image:: /images/ec2-then-ec2-dev-flow-inf2.png
   :width: 500
   :alt: Neuron developer flow on EC2
   :align: middle

You can use a single inf2 instance as a development environment to compile and deploy Neuron models. In this developer flow, you provision an EC2 inf2 instance using a Deep Learning AMI (DLAMI) and execute the two steps of the development flow in the same instance. The DLAMI comes pre-packaged with the Neuron frameworks, compiler, and required runtimes to complete the flow. Development happens through Jupyter Notebooks or using a secure shell (ssh) connection in terminal. Follow the steps below to setup your environment. 

.. note::
	**Model compilation can be executed on a non-inf2 instance** for later deployment. 
	Follow the same EC2 Developer Flow Setup using other instance families and leverage `Amazon Simple Storage Service  <https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html>`_ (S3) to share the compiled models between different instances.   

.. _ec2-then-ec2-setenv:

Setup Environment
-----------------

1. Launch an Inf2 Instance
^^^^^^^^^^^^^^^^^^^^^^^^^^

    .. include:: /setup/install-templates/inf2/launch-inf2-dlami.rst
  

2. Set up a development environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   
Enable PyTorch-Neuron
~~~~~~~~~~~~~~~~~~~~~

.. include :: /setup/install-templates/inf2/note-setup-libnrt-warning.rst

.. include:: /setup/install-templates/inf2/dlami-enable-neuron-pytorch.rst

3. Set up Jupyter notebook
^^^^^^^^^^^^^^^^^^^^^^^^^^

To develop from a Jupyter notebook see :ref:`setup-jupyter-notebook-steps-troubleshooting`  

You can also run a Jupyter notebook as a script, first enable the ML framework Conda or Python environment of your choice and see :ref:`running-jupyter-notebook-as-script` for instructions. 


================================================
FILE: devflows/inference/ec2-then-ec2-devflow.rst
================================================
.. _ec2-then-ec2-devflow:

Compile with Framework API and Deploy on EC2 Inf1
=================================================

.. contents:: Table of Contents
   :local:
   :depth: 3

   
Description
-----------

|image|
 
.. |image| image:: /images/ec2-then-ec2-dev-flow.png
   :width: 500
   :alt: Neuron developer flow on EC2
   :align: middle

You can use a single inf1 instance as a development environment to compile and deploy Neuron models. In this developer flow, you provision an EC2 inf1 instance using a Deep Learming AMI (DLAMI) and execute the two steps of the development flow in the same instance. The DLAMI comes pre-packaged with the Neuron frameworks, compiler, and required runtimes to complete the flow. Development happens through Jupyter Notebooks or using a secure shell (ssh) connection in terminal. Follow the steps bellow to setup your environment. 

.. note::
	**Model compilation can be executed on a non-inf1 instance** for later deployment. 
	Follow the same EC2 Developer Flow Setup using other instance families and leverage `Amazon Simple Storage Service  <https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html>`_ (S3) to share the compiled models between different instances.   

.. _ec2-then-ec2-setenv:

Setup Environment
-----------------

1. Launch an Inf1 Instance
^^^^^^^^^^^^^^^^^^^^^^^^^^

    .. include:: /setup/install-templates/inf1/launch-inf1-dlami.rst
  

2. Set up a development environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   
Enable PyTorch-Neuron
~~~~~~~~~~~~~~~~~~~~~

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst

.. include:: /setup/install-templates/inf1/dlami-enable-neuron-pytorch.rst

Enable TensorFlow-Neuron
~~~~~~~~~~~~~~~~~~~~~~~~~

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst

.. include:: /setup/install-templates/inf1/dlami-enable-neuron-tensorflow.rst

Enable Apache MXNet
~~~~~~~~~~~~~~~~~~~~

.. include :: /setup/install-templates/inf1/note-setup-libnrt-warning.rst

.. include:: /setup/install-templates/inf1/dlami-enable-neuron-mxnet.rst

3. Set up Jupyter notebook
^^^^^^^^^^^^^^^^^^^^^^^^^^

To develop from a Jupyter notebook see :ref:`setup-jupyter-notebook-steps-troubleshooting`  

You can also run a Jupyter notebook as a script, first enable the ML framework Conda or Python environment of your choice and see :ref:`running-jupyter-notebook-as-script` for instructions. 


================================================
FILE: devflows/inference/env-setup-text.rst
================================================
A typical Neuron developer flow includes compilation phase and then deployment (inference) on inf1 instance/s.


You can also choose one of the following combinations for compilation and deployment:


================================================
FILE: devflows/inference/neo-then-hosting-devflow.rst
================================================
.. _neo-then-hosting-devflow:

Compile with Sagemaker Neo and Deploy on Sagemaker Hosting (inf1)
==========================================================

.. contents:: Table of Contents
   :local:
   :depth: 2

   
Description
-----------

|image|
 
.. |image| image:: /images/neo-then-hosting-dev-flow.png
   :width: 700
   :alt: Neuron developer flow on SageMaker Neo
   :align: middle

You can use SageMaker Neo to compile models for deployment on SageMaker Hosting using ml.inf1 instances. In this developer flow, you provision a Sagemaker Notebook instance to train, compile and deploy your model using the SageMaker Python SDK. Follow the steps bellow to setup your environment. 

.. _neo-then-hosting-setenv:

Setup Environment
-----------------

1. Create an Amazon SageMaker Notebook Instance:

	Follow the instructions in `Get Started with Notebook Instances <https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html>`_

	The Notebook instance created provides the required Python SDK for training, compiling and deploying models with Amazon SageMaker.

2. Compile a model using the Amazon SageMaker SDK:

	Refer to `Supported Instances Types and Frameworks <https://docs.aws.amazon.com/sagemaker/latest/dg/neo-supported-cloud.html>`_ for information on the framework versions currently supported by Amazon SageMaker Neo on AWS Inferentia. 

	More information about compiling and deploying models with Amazon SageMaker Neo can be found on `Use Neo to Compile a Model <https://docs.aws.amazon.com/sagemaker/latest/dg/neo-job-compilation.html>`_


================================================
FILE: devflows/inference/parallelcluster-flows.rst
================================================
Parallel Cluster Flows - Inference
===================================


.. include:: /devflows/inference/parallelcluster-flows.txt

================================================
FILE: devflows/inference/parallelcluster-flows.txt
================================================
.. note::

    AWS ParallelCluster support is coming soon.

================================================
FILE: devflows/inference/sagemaker-flows.rst
================================================
Sagemaker Flows - Inference
===========================

.. toctree::
    :maxdepth: 1
    :hidden:
    
    /devflows/inference/byoc-hosting-devflow-inf2
    /devflows/inference/byoc-hosting-devflow 
    /devflows/inference/neo-then-hosting-devflow

   
.. include:: /devflows/inference/sagemaker-flows.txt

================================================
FILE: devflows/inference/sagemaker-flows.txt
================================================
* :ref:`byoc-hosting-devflow-inf2`
* :ref:`byoc-hosting-devflow`
* :ref:`neo-then-hosting-devflow`
* `AWS Neuron Sagemaker Samples GitHub Repository <https://github.com/aws-neuron/aws-neuron-sagemaker-samples>`_


================================================
FILE: devflows/parallelcluster-flows.rst
================================================
AWS ParallelCluster
===================

.. toctree::
    :maxdepth: 1

    /devflows/training/parallelcluster-flows


.. .. include:: /devflows/parallelcluster-flows.txt


================================================
FILE: devflows/parallelcluster-flows.txt
================================================
.. tab-set:: 

    .. tab-item:: Training
            
        .. include:: /devflows/training/parallelcluster-flows.txt
                

.. tab-set:: 

    .. tab-item:: Inference

        .. note::

            AWS ParallelCluster support is coming soon.


================================================
FILE: devflows/plugins/npd-ecs-flows.rst
================================================
.. _ecs-neuron-problem-detector-and-recovery:

Neuron Problem Detector And Recovery
====================================

.. include:: /devflows/plugins/npd-ecs-flows.txt


================================================
FILE: devflows/plugins/npd-ecs-flows.txt
================================================
Neuron node problem detector and recovery artifact checks the health of Neuron devices on each ECS instance. After detecting an unrecoverable Neuron error, it triggers an instance replacement. In order to get started with Neuron node problem detector and recovery, make sure that the following requirements are satisfied:

* The Neuron node problem detector and recovery requires Neuron driver 2.15+, and it requires the runtime to be at SDK 2.18 or later.

Creating a Task Definition
--------------------------

Configuration
~~~~~~~~~~~~~

The task definition includes two containers:

- **npd-container**: This container is responsible for enabling Problem detection functionality in the ECS cluster.
- **recovery-container**: This container handles recovery operations in case of failures detected by Neuron Problem Detector.

The **recovery-container** has an environment variable called ``ENABLE_RECOVERY`` that controls whether recovery is enabled or disabled. Set the value to ``true`` to enable recovery, or ``false`` to disable it.

Follow these steps to create a task definition for NPD and recovery:

1. Go to the `ECS console <https://console.aws.amazon.com/ecs/>`_ and select **Task Definitions** in the navigation pane.
2. Click **Create new Task Definition** and choose **Create new Task Definition with JSON**.
3. Paste the task definition JSON provided, replacing the placeholders with your account-specific values.

    .. code-block:: json

        {
            "family": "neuron-npd-and-recovery",
            "containerDefinitions": [
                {
                    "name": "npd",
                    "image": "registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19",
                    "cpu": 0,
                    "portMappings": [
                        {
                            "name": "npd-80-tcp",
                            "containerPort": 80,
                            "hostPort": 80,
                            "protocol": "tcp",
                            "appProtocol": "http"
                        }
                    ],
                    "essential": true,
                    "entryPoint": [
                        "/bin/sh",
                        "-c"
                    ],
                    "command": [
                        "echo '{\"plugin\":\"kmsg\",\"logPath\":\"/dev/kmsg\",\"lookback\":\"5m\",\"bufferSize\":10,\"source\":\"kernel-monitor\",\"conditions\":[{\"type\":\"NeuronHealth\",\"reason\":\"NeuronHasNoError\",\"message\":\"Neuronhasnoerror\"}],\"rules\":[{\"type\":\"permanent\",\"condition\":\"NeuronHealth\",\"reason\":\"NeuronHasError_SRAM_UNCORRECTABLE_ERROR\",\"pattern\":\".*NEURON_HW_ERR=SRAM_UNCORRECTABLE_ERROR.*\"},{\"type\":\"permanent\",\"condition\":\"NeuronHealth\",\"reason\":\"NeuronHasError_NC_UNCORRECTABLE_ERROR\",\"pattern\":\".*NEURON_HW_ERR=NC_UNCORRECTABLE_ERROR.*\"},{\"type\":\"permanent\",\"condition\":\"NeuronHealth\",\"reason\":\"NeuronHasError_HBM_UNCORRECTABLE_ERROR\",\"pattern\":\".*NEURON_HW_ERR=HBM_UNCORRECTABLE_ERROR.*\"},{\"type\":\"permanent\",\"condition\":\"NeuronHealth\",\"reason\":\"NeuronHasError_DMA_ERROR\",\"pattern\":\".*NEURON_HW_ERR=DMA_ERROR.*\"}]}' > /config/kernel-monitor.json && /node-problem-detector --v=2 --logtostderr --enable-k8s-exporter=false --config.system-log-monitor=/config/kernel-monitor.json"
                    ],
                    "environment": [],
                    "mountPoints": [],
                    "volumesFrom": [],
                    "linuxParameters": {
                        "devices": [
                            {
                                "hostPath": "/dev/kmsg",
                                "containerPath": "/dev/kmsg",
                                "permissions": [
                                    "read",
                                    "write"
                                ]
                            }
                        ]
                    },
                    "privileged": true,
                    "logConfiguration": {
                        "logDriver": "awslogs",
                        "options": {
                            "awslogs-group": "/ecs/npd",
                            "awslogs-create-group": "true",
                            "awslogs-region": "us-west-2",
                            "awslogs-stream-prefix": "ecs"
                        },
                        "secretOptions": []
                    },
                    "systemControls": []
                },
                {
                    "name": "recovery",
                    "image": "public.ecr.aws/neuron/neuron-node-recovery:1.3.0",
                    "cpu": 0,
                    "portMappings": [],
                    "essential": true,
                    "entryPoint": [
                        "/bin/sh",
                        "-c"
                    ],
                    "command": [
                        "python scripts/check-health.py"
                    ],
                    "environment": [
                        {
                            "name": "ENABLE_RECOVERY",
                            "value": "false"
                        }
                    ],
                    "mountPoints": [],
                    "volumesFrom": [],
                    "readonlyRootFilesystem": true,
                    "logConfiguration": {
                        "logDriver": "awslogs",
                        "options": {
                            "awslogs-create-group": "true",
                            "awslogs-group": "/ecs/recovery",
                            "awslogs-region": "us-west-2",
                            "awslogs-stream-prefix": "ecs"
                        }
                    },
                    "systemControls": []
                }
            ],
            "executionRoleArn": "arn:aws:iam::012345678910:role/ecsTaskExecutionRole",
            "taskRoleArn": "arn:aws:iam::012345678910:role/ecsTaskExecutionRole",
            "networkMode": "awsvpc",
            "requiresCompatibilities": [
                "EC2"
            ],
            "cpu": "1024",
            "memory": "3072",
            "runtimePlatform": {
                "cpuArchitecture": "X86_64",
                "operatingSystemFamily": "LINUX"
            }
        }

4. Review the task definition and click **Create**.

For more details on task definitions, refer to the `AWS documentation <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definitions.html>`_.

.. _deploy-service:

Deploying the Service
---------------------

After creating the task definition, follow these steps to deploy the service:

1. In the ECS console, select the task definition and click **Deploy** → **Create Service**.
2. Select your ECS cluster, set the launch type to **EC2**, and the service type to **Daemon**.
3. Click **Create** to deploy the service.

For more details on deploying services, refer to the `AWS documentation <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/services.html>`_.

Permissions
~~~~~~~~~~~

Ensure the ECS task execution role and task role have permissions to:

- Publish metrics to CloudWatch
- Read and set health status of EC2 instances in the Auto Scaling group

Refer to the `AWS documentation on IAM roles for ECS tasks <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html>`_ for more information.

When any unrecoverable error occurs, Neuron node problem detector and recovery publishes a metric under the CloudWatch namespace NeuronHealthCheck. It also reflects in NodeCondition and can be seen with kubectl describe node.

================================================
FILE: devflows/sagemaker-flows.rst
================================================
.. _sagemaker_flow:

Amazon SageMaker
================

Amazon SageMaker is a fully managed machine learning (ML) platform that streamlines the end-to-end ML workflow at scale. AWS Neuron integrates 
with Amazon SageMaker to provide optimized performance for ML workloads on AWS Inferentia and AWS Trainium chips.

.. contents:: Table of contents
   :local:
   :depth: 1

SageMaker JumpStart
"""""""""""""""""""
Use `Amazon SageMaker JumpStart <https://aws.amazon.com/sagemaker/jumpstart/>`_ to train and deploy models using Neuron.  SageMaker JumpStart is an ML hub that accelerates model 
selection and deployment. It provides support for fine-tuning and deploying popular models such as Meta’s Llama family of models. 
Users can customize pre-trained models with their data and easily deploy them.

SageMaker HyperPod
""""""""""""""""""
Use `Amazon SageMaker HyperPod <https://aws.amazon.com/sagemaker/hyperpod/>`_ to streamline ML infrastructure setup and optimization with AWS Neuron. SageMaker HyperPod leverages 
pre-configured distributed training libraries to split workloads across numerous AI accelerators, enhancing model performance. 
HyperPod ensures uninterrupted training through automatic checkpointing, fault detection, and recovery.

SageMaker Training
""""""""""""""""""
`Amazon SageMaker Model Training <https://aws.amazon.com/sagemaker/train/>`_ reduces the time and cost to train and tune ML models at scale without the need to manage infrastructure.

SageMaker Inference
"""""""""""""""""""
With `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html>`_ , you can start getting predictions, or inferences, from your trained ML models. SageMaker 
provides a broad selection of ML infrastructure and model deployment options to help meet all your ML inference needs.

================================================
FILE: devflows/setup/ecs-flows.rst
================================================
ECS Flows - Setup
=================

.. toctree::
    :maxdepth: 1
    :hidden:

    /devflows/plugins/npd-ecs-flows

.. include:: /devflows/setup/ecs-flows.txt

================================================
FILE: devflows/setup/ecs-flows.txt
================================================
* :ref:`ecs-neuron-problem-detector-and-recovery`

================================================
FILE: devflows/setup/eks-flows.rst
================================================
EKS - Setup
=====================

.. toctree::
    :maxdepth: 1
    :hidden:

    /containers/kubernetes-getting-started


.. include:: /devflows/setup/eks-flows.txt


================================================
FILE: devflows/setup/eks-flows.txt
================================================
* :ref:`kubernetes-getting-started`


================================================
FILE: devflows/third-party-solutions.rst
================================================
.. _third-party-devflow-solutions:

Third-party solutions
====================

AWS Neuron integrates with multiple third-party partner solutions that alow you to run deep learning workloads on Amazon EC2 
instances powered by AWS Trainium and AWS Inferentia chips. The following list gives an overview of third-party solutions 
that work with AWS Neuron.

.. contents:: Table of contents
   :local:
   :depth: 1

Ray 
"""
Ray, by Anyscale, is the open source AI Compute Engine at the center of the world's most powerful AI Platforms. It precisely 
orchestrates infrastructure for any distributed AI workload like data processing, model training, and serving on any accelerator at 
any scale. Ray simplifies the complexity of distributed computing, improves efficiency, lower costs, and accelerates developer 
productivity.

`Ray Train documentation <https://docs.ray.io/en/latest/train/examples/aws-trainium/llama3.html>`_

Domino
""""""
Domino is an open enterprise platform for data science, machine learning, and AI research. It works with an expansive list of 
industry leading tools and technologies to enrich data science research, development, and deployment processes. Domino works with a 
wide range of data sources, languages, IDEs, tools, libraries, and publication targets.

`Domino documentation <https://docs.dominodatalab.com/en/latest/user_guide/d98a6d/aws-trainium-and-inferentia-silicon-accelerators/>`_


================================================
FILE: devflows/training/aws-batch-flows.rst
================================================
AWS Batch Flows- Training
=========================


.. include:: /devflows/training/aws-batch-flows.txt

================================================
FILE: devflows/training/aws-batch-flows.txt
================================================
* :ref:`batch-training`

================================================
FILE: devflows/training/batch/batch-training.rst
================================================
.. _batch-training:

Train your model on AWS Batch
=============================

.. contents:: Table of Contents
   :local:
   :depth: 3

Description
------------

AWS Batch provides a scalable and cost-effective solution for running batch computing workloads in the AWS Cloud. Integrating Trainium with AWS Batch provides an efficient and cost-effective way of training deep learning models at scale.
Once you configure your training job, AWS Batch effectively manages the orchestration, execution, and dynamic scaling of compute resources for your extensive machine learning workloads. To learn more about AWS Batch, see `the AWS Batch documentation <https://docs.aws.amazon.com/batch/latest/userguide/what-is-batch.html>`_.


How does AWS Batch work with Trainium
-------------------------------------

.. image:: /images/batch-setup.png


As depicted in the illustration above, our workflow begins by building a ``Docker container image for Trainium`` and pushing it to Amazon Elastic Container Registry (ECR). Following this, we configure our AWS Batch environment with the required capabilities, and subsequently submit the training job.

Please follow the below mentioned steps to run your training jobs on ``AWS Batch`` with ``Trainium``.

#. **Before you begin, please ensure that you have the following prerequisites completed:**

   * ``AWS VPC`` with at least one ``Subnet`` and ``EFA Enabled Security Group`` (learn more about EFA-enabled security group `the AWS EFA User Guide <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security>`_). Please make sure subnet needs to be private, and the VPC needs to have a NAT gateway to allow internet connectivity for the private subnet.
   * ``AWS ECR`` repository
   * ``AWS CLI`` installed and configured with permissions for the above mentioned AWS resources
   * ``Docker``
   * ``jq``

#. **Setup to start working with AWS Batch**

   Connect to your EC2 instance(``x86_64-based Linux instance``) and clone the ``aws-neuron-samples`` repo. Once done, navigate to aws batch scripts directory.

   .. code:: shell

      cd ~/
      git clone https://github.com/aws-neuron/aws-neuron-samples.git
      cd ~/aws-neuron-samples/torch-neuronx/training/aws-batch/all-reduce

#. **Configure resource requirements**

   Update the ``build_configs_and_setup.sh`` with your environment variables. Once done, execute the bash script using the command ``./build_configs_and_setup.sh``.

#. **Build the required docker image and publish it to ECR**

   Run ``./build_docker_image.sh`` to build a Neuron Deep-Learning Container image using the latest Neuron packages and push this image to ECR.

#. **Prepare the AWS infrastructure required to submit the batch job**

   Run ``./create_resources.sh`` to create all AWS Batch resources needed for your training workload. Below is the brief description of various AWS Batch components this script will create for you -

   * ``Placement Group`` enables you to influence the placement of your EC2 (Elastic Compute Cloud) instances within the AWS infrastructure.
   * ``Launch Template`` allows you to define a set of instance configuration parameters, including the Amazon Machine Image (AMI), instance type, key pair, security groups, and other settings, in a template format.
   * ``Compute Environment`` helps you to specify configuration that specifies the type of compute resources you want to use for your batch jobs. It includes details such as the EC2 instance types, the minimum and maximum number of instances, the VPC configuration, and other settings related to the compute environment.
   * ``Job Definition`` is a blueprint that specifies how a batch job should be run. It encapsulates information about the job, such as the Docker image to be used, the command to execute within the container, the CPU and memory requirements, job dependencies, and other settings.
   * ``Job Queue`` acts as a queueing mechanism for managing and scheduling the execution of batch computing workloads. By using job queues, AWS Batch provides a scalable and efficient way to process batch workloads, managing the allocation of resources and ensuring optimal use of compute capacity.

#. **Submit the job to AWS-Batch**

   Run ``./submit_job.sh`` to submit a basic all-reduce job in the provisioned AWS Batch environment

#. **Monitor the AWS-Batch job**

   You can use Amazon CloudWatch Logs to monitor, store, and view all your logs from AWS Batch job. To learn more about it, please see `the AWS docs on using Batch and EKS with CloudWatch <https://docs.aws.amazon.com/batch/latest/userguide/batch-eks-cloudwatch-logs.html>`_.

.. note::
    * You could run a full model training job using this setup. For example, `this sample <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/aws-batch/llama2/README.md>`_ runs the Llama2-7B tutorial on AWS Batch using the same setup.
    * You can further tailor your ``Dockerfile`` to include any additional dependencies specific to your needs.
    * You have the option to leverage ``trn1n.32xlarge`` instances as an alternative to ``trn1.32xlarge``. To make this transition, you only need to make adjustments to the ``launch template`` and ``job definition`` in order to accommodate the use of 16 EFA (Elastic Fabric Adapter) devices, whereas the current setup for ``trn1`` employs 8 EFA devices. Please check out `this document <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup-trn1-multi-node-execution.html?highlight=multi-node>`_ to start with ``trn1n.32xlarge`` for multi-node execution.

================================================
FILE: devflows/training/dlc-then-ecs-devflow.rst
================================================
.. _training-dlc-then-ecs-devflow:

Deploy Neuron Container on Elastic Container Service (ECS) for Training
=======================================================================

.. contents:: Table of Contents
   :local:
   :depth: 2

   
Description
-----------

|image|
 
.. |image| image:: /images/dlc-on-ecs-dev-flow.png
   :width: 750
   :alt: Neuron developer flow for DLC on ECS
   :align: middle

You can use the Neuron version of the `AWS Deep Learning Containers <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-ecs-tutorials-training.html>`_ to run training on Amazon Elastic Container Service (ECS). In this developer flow, you set up an ECS cluster with trn1 instances, create a task description for your training container and deploy it to your cluster. This developer flow assumes:

1. The model has already been compiled through :ref:`Compilation with Framework API on EC2 instance <ec2-training>` or through :ref:`Compilation with Sagemaker Neo <neo-then-hosting-devflow>`.

2. You already set up your container to retrieve it from storage.

.. _training-dlc-then-ecs-setenv:

Setup Environment
-----------------


1. Set up an Amazon ECS cluster:
	Follow the instructions on `Setting up Amazon ECS for Deep Learning Containers <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-ecs-setting-up-ecs.html>`_

2. Define a Training Task:
	Use the instruction on the `DLC Training on ECS Tutorial <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-ecs-tutorials-training.html>`_ to define a task and create a service for the appropriate framework.

	When creating tasks for trn1 instances on ECS, be aware of the considerations and requirements listed in `Working with training workloads on Amazon ECS <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-inference.html>`_.


3. Use the container image created using :ref:`how-to-build-neuron-container` as the ``image`` in your task definition.

   .. _training_push_to_ecr_note:

   .. note::

       Before deploying your task definition to your ECS cluster, make sure to push the image to ECR. Refer to `Pushing a Docker image <https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html>`_ for more information.


================================================
FILE: devflows/training/ec2/ec2-training.rst
================================================
.. _ec2-training:

Train your model on EC2
=======================

.. contents:: Table of Contents
   :local:
   :depth: 3
   
Description
-----------

|image|
 
.. |image| image:: /images/trn1-on-ec2-dev-flow.png
   :width: 500
   :alt: Neuron developer flow on EC2
   :align: middle
   
You can use a single Trn1 instance as a development environment to compile and train Neuron models. In this developer flow, you provision an EC2 Trn1 instance using a Deep Learming AMI (DLAMI) and execute the two steps of the development flow in the same instance. The DLAMI comes pre-packaged with the Neuron frameworks, compiler, and required runtimes to complete the flow. Development happens through Jupyter Notebooks or using a secure shell (ssh) connection in terminal. Follow the steps bellow to setup your environment.

Setup Environment
-----------------

1. Launch an Trn1 Instance
^^^^^^^^^^^^^^^^^^^^^^^^^^

    .. include:: /setup/install-templates/launch-trn1-dlami.rst

2. Set up a development environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   
Enable PyTorch-Neuron
~~~~~~~~~~~~~~~~~~~~~

    .. include:: /frameworks/torch/torch-neuronx/setup/install-templates/pytorch-dev-install.txt

3. Set up Jupyter notebook
^^^^^^^^^^^^^^^^^^^^^^^^^^

To develop from a Jupyter notebook see :ref:`setup-jupyter-notebook-steps-troubleshooting`  

You can also run a Jupyter notebook as a script, first enable the ML framework Conda or Python environment of your choice and see :ref:`running-jupyter-notebook-as-script` for instructions. 


================================================
FILE: devflows/training/ec2-flows.rst
================================================
EC2 Flows- Training
====================

.. toctree::
    :maxdepth: 1
    :hidden:

    /devflows/training/ec2/ec2-training

    
.. include:: /devflows/training/ec2-flows.txt

================================================
FILE: devflows/training/ec2-flows.txt
================================================
* :ref:`ec2-training`


================================================
FILE: devflows/training/parallelcluster/parallelcluster-training.rst
================================================
.. _parallelcluster-training:

Train your model on ParallelCluster
===================================

.. contents:: Table of Contents
   :local:
   :depth: 3

Description
------------

This document explains how to use AWS ParallelCluster to build HPC compute environment 
that uses Trn1 compute nodes to run your distributed ML training job. Once the nodes are 
launched, we will run a training task to confirm that the nodes are working, and use 
slurm commands to check the job status. In this tutorial, we will use AWS `pcluster` command
to run a yaml file in order to generate the cluster. As an example, we are going to launch
multiple Trn1.32xl nodes in our cluster.

We are going to set up our ParallelCluster infrastructure as below:

.. image:: /images/vpc-setup.png

As shown in the figure above, inside a VPC, there are two subnets, a public and a private
ones. Head Node resides in the public subnet, while the compute fleet (in this case, trn1
instances) are in the private subnet. A Network Address Translation (NAT) gateway is also 
needed in order for nodes in the private subnet to connect to clients outside the VPC. In 
the next section, we are going to describe how to set up all the necessary infrastructure 
for trn1 ParallelCluster.


Setup environment
-----------------

1. Install prerequisite infrastructure:

Follow `these setup <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/about-neuron/network/vpc-subnet-setup.md>`_ instructions to install VPC and all the necessary components for ParallelCluster. 

2. Install AWS ParallelCluster in a virtual environment (recommended)

Follow `https://docs.aws.amazon.com/parallelcluster/latest/ug/install-v3-virtual-environment.html`

3. Create and launch ParallelCluster 

Follow `these creating cluster <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/cluster-configs/trn1-16-nodes-pcluster.md>`_ instructions to launch ParallelCluster in the VPC.

1. Launch training job

Follow `these running training <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/dp-bert-launch-job.md>`_ instructions to submit a model training script as a slurm job.


================================================
FILE: devflows/training/parallelcluster-flows.rst
================================================
Parallel Cluster Flows- Training
================================

.. toctree::
    :maxdepth: 1
    :hidden:

    /devflows/training/parallelcluster/parallelcluster-training


.. include:: /devflows/training/parallelcluster-flows.txt

================================================
FILE: devflows/training/parallelcluster-flows.txt
================================================
* :ref:`parallelcluster-training`
                

================================================
FILE: devflows/training/sagemaker-flows.rst
================================================
Sagemaker Flows- Training
=========================

.. toctree::
    :maxdepth: 1
    :hidden:

    /devflows/training/sm-devflow/sm-training-devflow

    
.. include:: /devflows/training/sagemaker-flows.txt

================================================
FILE: devflows/training/sagemaker-flows.txt
================================================
* :ref:`sm-training-devflow`
* `AWS Neuron Sagemaker Samples GitHub Repository <https://github.com/aws-neuron/aws-neuron-sagemaker-samples>`_


================================================
FILE: devflows/training/sm-devflow/sm-training-devflow.rst
================================================
.. _sm-training-devflow:

Train your model on SageMaker
===================================

.. contents:: Table of Contents
   :local:
   :depth: 3

Description
------------

SageMaker Training helps you manage cloud computing resources in Amazon EC2, data storage services
such as S3, EFS, and FSx, and security management services such as IAM and VPC. SageMaker Training 
provides you a complete end-to-end experience of training classical ML and state-of-the-art DL models. 

You can use SageMaker to train models using Trn1 instances (ml.trn1 instance types). 
In this developer flow, you provision a SageMaker Notebook instance or SageMaker Studio to train 
your model using the `SageMaker Python SDK <https://sagemaker.readthedocs.io/en/stable/index.html>`_.

The Amazon SageMaker Python SDK lets you launch training jobs in just a few lines of code with ease. 
As shown in the below diagram Amazon SageMaker launches Trn1 instances, copies both data and code 
onto the instance. It then runs the training script to generate model artifacts. The trained model 
artifacts are then uploaded to S3 and finally SageMaker will terminate the provisioned instances. 
In order to speed up the training process for successive runs you can copy the `Neuron Persistent Cache
<https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-features/neuron-caching.html>`_
to S3 and then copied by future training jobs as they will leverage the cached artifacts. 
(See `Hugging Face fine tuning BERT base model on Amazon SageMaker Tutorial 
<https://github.com/aws-neuron/aws-neuron-sagemaker-samples/tree/main/training/trn1-bert-fine-tuning-on-sagemaker>`_
for an example on how to reuse the compiled cache.)

.. image:: /images/trn1-on-sm-dev-flow.png


Setup environment
-----------------

1. Create an Amazon SageMaker Notebook Instance

   Follow the instructions in `Get Started with Notebook Instances 
   <https://docs.aws.amazon.com/sagemaker/latest/dg/gs-console.html>`_ or 
   `Use Amazon SageMaker Studio Notebooks <https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks.html>`_.
   The Notebook instance provides the required Python SDK for training models with Amazon SageMaker.
   Please make sure SageMaker Python SDK version is 2.116.0 or later.

2. Train a model using the Amazon SageMaker SDK

   Follow the instructions in `Distributed Training with PyTorch Neuron on Trn1 instances
   <https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-training-with-pytorch-neuron-on-trn1-instances>`_.
   You’ll be able to follow the `Hugging Face fine tuning BERT base model on Amazon SageMaker Tutorial
   <https://github.com/aws-neuron/aws-neuron-sagemaker-samples/tree/main/training/trn1-bert-fine-tuning-on-sagemaker>`_.

   .. note::
     SageMaker support for EC2 Trn1 instance is currently available only for PyTorch Estimator. 
     HuggingFace Estimator will be available in future release.


================================================
FILE: dlami/index.rst
================================================
.. meta::
   :description: Neuron Deep Learning AMIs (DLAMIs) are pre-configured Amazon Machine Images with the Neuron SDK for easy deployment on AWS Inferentia and Trainium instances.
   :keywords: Neuron DLAMI, Deep Learning AMI, AWS Neuron SDK, Inferentia, Trainium, PyTorch, JAX, TensorFlow, vLLM, SSM Parameters
   :date-modified: 01/22/2026

.. _neuron-dlami-overview:
.. _setup-ubuntu22-multi-framework-dlami:
.. _setup-ubuntu24-multi-framework-dlami:

Neuron DLAMI User Guide
=======================

This guide helps you select, configure, and deploy AWS Neuron Deep Learning AMIs (DLAMIs) for running machine learning workloads on AWS Inferentia and Trainium instances. Learn about the different DLAMI types available, pre-installed virtual environments for popular ML frameworks like PyTorch and JAX, and how to automate DLAMI deployment.

.. contents:: Table of Contents
   :local:
   :depth: 2

What are Neuron DLAMIs?
------------------------

Neuron Deep Learning AMIs (DLAMIs) are pre-configured Amazon Machine Images that provide the easiest way to get started with the AWS Neuron SDK. Each DLAMI comes with Neuron drivers, frameworks, and libraries pre-installed, enabling you to quickly launch and run deep learning workloads on AWS Inferentia and Trainium instances without manual setup.

Neuron currently supports three types of DLAMIs to meet different deployment needs:

* **Multi-Framework DLAMIs**: Support multiple ML frameworks (PyTorch, JAX, vLLM) with separate virtual environments for each
* **Single Framework DLAMIs**: Optimized for a specific framework version with focused virtual environments
* **Base DLAMIs**: Include only Neuron drivers, EFA, and tools - ideal for containerized applications and custom builds

All Neuron DLAMIs support automated discovery through AWS Systems Manager (SSM) parameters, making them easy to integrate into cloud automation workflows and infrastructure-as-code deployments.

.. note::
  Starting with version 2.26.1, Neuron DLAMIs no longer support ``Inf1`` instance types due to an incompatibility with the Neuron driver.  
  If you'd like to run ``Inf1`` workloads, use previous DLAMIs released up to SDK version 2.26.

----

Neuron Multi Framework DLAMI
----------------------------

Neuron Multi-Framework DLAMIs provide the most comprehensive environment, supporting multiple ML frameworks and libraries in isolated virtual environments. Each DLAMI is pre-installed with Neuron drivers and supports all current Neuron instance types (Inf2, Trn1, Trn1n, Trn2, Trn3). This is the recommended option for teams working with multiple frameworks or exploring different ML libraries.

.. note::
  Starting with version 2.27.1, AL2023 DLAMIs no longer support ``PyTorch 2.9+`` due to an incompatibility issue with the default GLIB.c installed on AL2023.
  PyTorch requires GLIB.c 2.35+ and upgrading the version within AL2023 can break other system dependencies. This is the error message:
  
  ``ImportError: /lib64/libm.so.6: version `GLIBC_2.35' not found``

  Since the latest vLLM version depends on PyTorch 2.9, we have also removed that environment from the DLAMI.
  
  For a workaround, use the latest Ubuntu-based AMIs instead.


Multi Framework DLAMIs supported
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
    :widths: auto
    :header-rows: 1
    :align: left
    :class: table-smaller-font-size

    * - Operating System
      - Neuron Instances Supported
      - DLAMI Name

    * - Ubuntu 24.04
      - Inf2, Trn1, Trn1n, Trn2, Trn3
      - Deep Learning AMI Neuron (Ubuntu 24.04)

.. _neuron-dlami-multifw-venvs:


Virtual Environments pre-installed
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
    :widths: auto
    :header-rows: 1
    :align: left
    :class: table-smaller-font-size

    * - Neuron Framework/Libraries supported
      - Virtual Environment

    * - PyTorch 2.9 Torch NeuronX, NxD Core (Ubuntu 24.04)
      - /opt/aws_neuronx_venv_pytorch_2_9

    * - PyTorch 2.9 NxD Training, Torch NeuronX (Ubuntu 24.04)
      - /opt/aws_neuronx_venv_pytorch_2_9_nxd_training

    * - PyTorch 2.9 NxD Inference, Torch NeuronX (Ubuntu 24.04)
      - /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference

    * - JAX 0.7 NeuronX (Ubuntu 24.04)
      - /opt/aws_neuronx_venv_jax_0_7

    * - vLLM 0.16.0 NxD Inference, Torch NeuronX (Ubuntu 24.04)
      - /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16


We have included a setup script that installs required dependencies for the package within the PyTorch 2.9 NxD Training virtual environment. To run this script,
activate the virtual environment and run ``setup_nxdt.sh`` and this will run :ref:`the setup steps here <nxdt_installation_guide>`.

You can easily get started with the multi-framework DLAMI through AWS console by following this :doc:`setup guide </setup/multiframework-dlami>`. If you are looking to 
use the Neuron DLAMI in your cloud automation flows, Neuron also supports :ref:`SSM parameters <ssm-parameter-neuron-dlami>` to easily retrieve the latest DLAMI id.

----

Neuron Single Framework DLAMI
-----------------------------

Neuron Single Framework DLAMIs are optimized for specific framework versions, providing a streamlined environment when you know exactly which framework you'll be using. Each DLAMI is pre-installed with Neuron drivers and supports all Neuron instance types. These DLAMIs are ideal for production deployments where you want a focused, framework-specific environment. 


Single Framework DLAMIs supported
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. list-table::
    :widths: auto
    :header-rows: 1
    :align: left
    :class: table-smaller-font-size

    * - Framework
      - Operating System
      - Neuron Instances Supported
      - DLAMI Name

    * - PyTorch 2.9
      - Ubuntu 24.04
      - Inf2, Trn1, Trn1n, Trn2, Trn3
      - Deep Learning AMI Neuron PyTorch 2.9 (Ubuntu 24.04)

    * - JAX 0.7
      - Amazon Linux 2023
      - Inf2, Trn1, Trn1n, Trn2, Trn3
      - Deep Learning AMI Neuron JAX 0.7 (Amazon Linux 2023)

    * - JAX 0.7
      - Ubuntu 24.04
      - Inf2, Trn1, Trn1n, Trn2, Trn3
      - Deep Learning AMI Neuron JAX 0.7 (Ubuntu 24.04)

    * - vLLM 0.16.0
      - Ubuntu 24.04
      - Inf2, Trn1, Trn1n, Trn2, Trn3
      - Deep Learning AMI Neuron PyTorch Inference vLLM 0.16 (Ubuntu 24.04)


Virtual Environments pre-installed
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
    :widths: auto
    :header-rows: 1
    :align: left
    :class: table-smaller-font-size

    * - DLAMI Name
      - Neuron Libraries supported
      - Virtual Environment
  
    * - Deep Learning AMI Neuron PyTorch 2.9 (Ubuntu 24.04) 
      - PyTorch 2.9 Torch NeuronX, NxD Core
      - /opt/aws_neuronx_venv_pytorch_2_9

    * - Deep Learning AMI Neuron PyTorch 2.9 (Ubuntu 24.04) 
      - PyTorch 2.9 NxD Training, Torch NeuronX
      - /opt/aws_neuronx_venv_pytorch_2_9_nxd_training

    * - Deep Learning AMI Neuron PyTorch 2.9 (Ubuntu 24.04) 
      - PyTorch 2.9 NxD Inference, Torch NeuronX
      - /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference

    * - Deep Learning AMI Neuron JAX 0.7 (Ubuntu 24.04, Amazon Linux 2023) 
      - JAX NeuronX 0.7
      - /opt/aws_neuronx_venv_jax_0_7

    * - Deep Learning AMI Neuron PyTorch Inference vLLM 0.16 (Ubuntu 24.04) 
      - vLLM NeuronX 0.16.0
      - /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16


Get started with the single framework DLAMI through AWS console by following one of the corresponding setup guides. If you want to
use the Neuron DLAMI in your cloud automation flows, Neuron also supports :ref:`SSM parameters <ssm-parameter-neuron-dlami>` to retrieve the latest DLAMI id.

----

Neuron Base DLAMI
-----------------

Neuron Base DLAMIs provide a minimal foundation with only the essential components: Neuron driver, EFA (Elastic Fabric Adapter), and Neuron tools. These DLAMIs are designed for advanced users who want to build custom environments, create containerized applications, or have specific framework version requirements not covered by the pre-configured DLAMIs.


Base DLAMIs supported
^^^^^^^^^^^^^^^^^^^^^

.. list-table::
    :widths: auto
    :header-rows: 1
    :align: left
    :class: table-smaller-font-size

    * - Operating System
      - Neuron Instances Supported
      - DLAMI Name

    * - Amazon Linux 2023
      - Inf2, Trn1, Trn1n, Trn2, Trn3 
      - Deep Learning Base Neuron AMI (Amazon Linux 2023)

    * - Ubuntu 24.04
      - Inf2, Trn1, Trn1n, Trn2, Trn3
      - Deep Learning Base Neuron AMI (Ubuntu 24.04)

    * - Ubuntu 22.04
      - Inf2, Trn1, Trn1n, Trn2, Trn3
      - Deep Learning Base Neuron AMI (Ubuntu 22.04)


.. _ssm-parameter-neuron-dlami:

----

Using SSM Parameters for Cloud Automation
------------------------------------------

Neuron DLAMIs support AWS Systems Manager (SSM) parameters for automated DLAMI discovery and deployment. This enables you to always use the latest Neuron SDK release in your infrastructure-as-code templates, CI/CD pipelines, and auto-scaling configurations without hardcoding AMI IDs.

SSM parameters provide several key benefits:

* **Always up-to-date**: Automatically reference the latest DLAMI with the newest Neuron SDK release
* **Infrastructure-as-code friendly**: Use in CloudFormation, Terraform, and other IaC tools
* **Auto Scaling integration**: Update Auto Scaling groups without modifying launch templates
* **Multi-region support**: Available across all AWS regions where Neuron instances are supported

Currently, SSM parameters support finding the latest DLAMI ID for each DLAMI type. Support for finding specific Neuron SDK version DLAMIs will be added in future releases.


Finding specific DLAMI image id with the latest neuron release
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can find the DLAMI that supports latest Neuron SDK by using the SSM get-parameter.


.. code-block::

    aws ssm get-parameter \
    --region us-east-1 \
    --name <dlami-ssm-parameter-prefix>/latest/image_id \
    --query "Parameter.Value" \
    --output text


The SSM parameter prefix for each currently supported DLAMI can be seen below. To discover SSM parameters for older or end-of-life DLAMIs, you can filter by framework, version, or operating system using the path structure ``/aws/service/neuron/dlami/<framework>-<framework-version>/<os>``:

.. code-block::

      # List all Neuron DLAMI SSM parameters
      aws ssm get-parameters-by-path --region us-east-1 --path /aws/service/neuron --recursive

      # Filter by framework (e.g., all PyTorch 2.8 DLAMIs)
      aws ssm get-parameters-by-path --region us-east-1 --path /aws/service/neuron/dlami/pytorch-2.8 --recursive

      # Filter by framework and OS
      aws ssm get-parameters-by-path --region us-east-1 --path /aws/service/neuron/dlami/pytorch-2.8/ubuntu-22.04 --recursive


SSM Parameter Prefix
""""""""""""""""""""
.. list-table::
    :widths: 20 39
    :header-rows: 1
    :align: left
    :class: table-smaller-font-size

    * - AMI Name
      - SSM parameter Prefix

    * - Deep Learning AMI Neuron (Ubuntu 24.04)
      - /aws/service/neuron/dlami/multi-framework/ubuntu-24.04

    * - Deep Learning AMI Neuron PyTorch 2.9 (Ubuntu 24.04)
      - /aws/service/neuron/dlami/pytorch-2.9/ubuntu-24.04

    * - Deep Learning AMI Neuron JAX 0.7 (Amazon Linux 2023)
      - /aws/service/neuron/dlami/jax-0.7/amazon-linux-2023

    * - Deep Learning AMI Neuron JAX 0.7 (Ubuntu 24.04)
      - /aws/service/neuron/dlami/jax-0.7/ubuntu-24.04

    * - Deep Learning AMI Neuron PyTorch Inference vLLM 0.16 (Ubuntu 24.04)
      - /aws/service/neuron/dlami/pytorch-inference-vllm-0.16/ubuntu-24.04

    * - Deep Learning Base Neuron AMI (Amazon Linux 2023)
      - /aws/service/neuron/dlami/base/amazon-linux-2023

    * - Deep Learning Base Neuron AMI (Ubuntu 24.04)
      - /aws/service/neuron/dlami/base/ubuntu-24.04

    * - Deep Learning Base Neuron AMI (Ubuntu 22.04)
      - /aws/service/neuron/dlami/base/ubuntu-22.04


For example to find the latest DLAMI id for Multi-Framework DLAMI (Ubuntu 24.04) you can use the following:

.. code-block::

    aws ssm get-parameter \
    --region us-east-1 \
    --name /aws/service/neuron/dlami/multi-framework/ubuntu-24.04/latest/image_id \
    --query "Parameter.Value" \
    --output text


You can find all available parameters supported in Neuron DLAMis via CLI

.. code-block::

    aws ssm get-parameters-by-path \
    --region us-east-1 \
    --path /aws/service/neuron \
    --recursive


You can also view the SSM parameters supported in Neuron through AWS parameter store by selecting the "Neuron" service.


Use SSM Parameter to launch instance directly via CLI
"""""""""""""""""""""""""""""""""""""""""""""""""""""

You can use the AWS CLI to resolve the latest DLAMI ID and launch an instance in a single command. This is particularly useful for scripting and automation workflows.

Below is an example of launching an Inf2 instance using the TensorFlow 2.10 single-framework DLAMI: 


.. code-block::

    aws ec2 run-instances \
    --region us-east-1 \
    --image-id resolve:ssm:/aws/service/neuron/dlami/tensorflow-2.10/ubuntu-22.04/latest/image_id \
    --count 1 \
    --instance-type inf2.48xlarge \
    --key-name <my-key-pair> \
    --security-groups <my-security-group>


Use SSM alias in EC2 launch templates
"""""""""""""""""""""""""""""""""""""

SSM Parameters can be used directly in EC2 launch templates, enabling your Auto Scaling groups to automatically use the latest AMI IDs without requiring updates to launch templates or creating new versions each time an AMI ID changes. This significantly simplifies AMI lifecycle management in production environments.

For more information, see: https://docs.aws.amazon.com/autoscaling/ec2/userguide/using-systems-manager-parameters.html

----

Other Resources
---------------

Learn more about AWS Deep Learning AMIs and Systems Manager:

* `AWS Deep Learning AMI Developer Guide <https://docs.aws.amazon.com/dlami/latest/devguide/what-is-dlami.html>`_
* `AWS DLAMI Release Notes <https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html>`_
* `AWS Systems Manager Parameter Store <https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html>`_
* :doc:`Neuron DLAMI Release Notes </release-notes/components/dlamis>`


================================================
FILE: frameworks/index.rst
================================================
.. meta::
   :description: ML Framework support on AWS Neuron SDK - PyTorch and JAX integration for high-performance machine learning on AWS Inferentia and Trainium.
   :date-modified: 2026-03-12
   :keywords: AWS Neuron, machine learning

.. _frameworks-neuron-sdk:

ML framework support on AWS Neuron SDK
=======================================

AWS Neuron provides integration with popular machine learning frameworks, enabling you to accelerate your existing models on AWS Inferentia and Trainium with minimal code changes. Choose from our comprehensive framework support to optimize your inference and training workloads.

Frameworks
-----------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: PyTorch on AWS Neuron
        :link: torch/index
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Complete PyTorch integration for both inference and training on all Neuron hardware.

        * **TorchNeuron Native** - Native PyTorch backend with eager execution and ``torch.compile``
        * **PyTorch NeuronX (torch-neuronx)** - ``Inf2``, ``Trn1``, ``Trn2`` (inference & training)
        * See: :doc:`/frameworks/torch/pytorch-native-overview`

    .. grid-item-card:: JAX on AWS Neuron
        :link: jax/index
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        **Beta release**

        Experimental JAX support with Neuron Kernel Interface (NKI) integration.

        * **JAX NeuronX** - Neuron hardware support
        * Research and development focus
        * **Status**: Beta - active

.. note::

   Looking for TensorFlow, MXNet, or torch-neuron (Inf1) documentation? These frameworks
   have been archived. See :doc:`/archive/index` for legacy framework documentation.

Hardware compatibility matrix
-----------------------------

.. list-table::
   :header-rows: 1
   :class: compatibility-matrix

   * - Framework
     - Inf2
     - Trn1/Trn1n
     - Trn2
     - Inference
     - Training
   * - **torch-neuronx**
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
   * - **JAX NeuronX**
     - ✅
     - ✅
     - N/A
     - ✅
     - N/A


================================================
FILE: frameworks/jax/api-reference-guide/index.rst
================================================
.. _jax-neuronx-api-reference-guide:


.. meta::
   :description: API Reference Guide for JAX Neuronx - AWS Neuron SDK documentation
   :keywords: API reference, AWS Neuron, JAX, JAX NeuronX
   :date-modified: 2026-03-13


API Reference Guide for JAX Neuronx
====================================================

.. toctree::
    :maxdepth: 1
    :hidden:

    /frameworks/jax/api-reference-guide/neuron-envvars

* :ref:`jax-neuronx-envvars`


================================================
FILE: frameworks/jax/api-reference-guide/neuron-envvars.rst
================================================
.. _jax-neuronx-envvars:


.. meta::
   :description: JAX NeuronX Environment Variables - AWS Neuron SDK documentation
   :keywords: API reference, AWS Neuron, JAX, JAX NeuronX
   :date-modified: 2026-03-13


JAX NeuronX Environment Variables
======================================

Environment variables allow modifications to JAX NeuronX behavior
without requiring code change to user script. It is recommended to set
them in code or just before invoking the python process, such as
``NEURON_RT_VISIBLE_CORES=8 python3 <script>`` to avoid inadvertently
changing behavior for other scripts. Environment variables specific to
JAX Neuronx are:

``NEURON_CC_FLAGS``

-  Compiler options. Full compiler options are described in the :ref:`neuronx-cc-training-mixed-precision`.

``XLA_FLAGS``

- When set to ``"--xla_dump_hlo_snapshots --xla_dump_to=<dir>"``, this environmental variable enables dumping snapshots in ``<dir>`` directory. See :ref:`torch-neuronx-snapshotting` section for more information. The snapshotting interface for JAX and Pytorch are identical.
- When set to ``"--xla_dump_hlo_as_text --xla_dump_hlo_as_proto --xla_dump_to=<dir> --xla_dump_hlo_pass_re='.*'"``, this environmental variable enables dumping HLOs in proto and text formats after each XLA pass. The dumped ``*.hlo.pb`` files are in HloProto format.

``NEURON_FORCE_PJRT_PLUGIN_REGISTRATION``

- When ``NEURON_FORCE_PJRT_PLUGIN_REGISTRATION=1``, the Neuron PJRT plugin will be registered in JAX regardless of the instance type.

``NEURON_RUN_TRIVIAL_COMPUTATION_ON_CPU``

-  When ``NEURON_RUN_TRIVIAL_COMPUTATION_ON_CPU=1``, the Neuron PJRT plugin will compile and execute "trivial" computations on CPU instead of Neuron cores. A "trivial" computation is defined as an HLO program that does not contain any collective-compute instructions. The HLO program will be compiled by the XLA CPU compiler and outputs of the computation will be allocated on Neuron cores. The following HLO instructions are considered as collective-compute instructions.

    - ``all-gather``
    - ``all-gather-done``
    - ``all-gather-start``
    - ``all-reduce-done``
    - ``all-reduce-start``
    - ``all-to-all``
    - ``collective-permute``
    - ``partition-id``
    - ``replica-id``
    - ``recv``
    - ``recv-done``
    - ``reduce-scatter``
    - ``send``
    - ``send-done``

``NEURON_PJRT_PROCESSES_NUM_DEVICES``

- Should be set to a comma-separated list stating the number of NeuronCores used by each worker process. It is used to construct a global device array with its size equal to the sum of the list. This gets reported to the XLA PJRT runtime when requested. Must be set for multi-process executions. It can be used in conjunction with ``NEURON_RT_VISIBLE_CORES`` to expose a limited number of NeuronCores to each worker process. If ``NEURON_RT_VISIBLE_CORES`` is not set, it should be set to available number of NeuronCores on the host. ``NEURON_PJRT_PROCESSES_NUM_DEVICES`` must be less than or equal to ``NEURON_RT_VISIBLE_CORES``.

``NEURON_PJRT_PROCESS_INDEX``

- An integer stating the index (or rank) of the current worker process. This is required for multi-process environments where all workers need to know information on all participating processes. Must be set for multi-process executions. The value should be between ``0`` and ``NEURON_PJRT_PROCESS_INDEX - 1``.

``NEURON_RT_STOCHASTIC_ROUNDING_EN`` **[Neuron Runtime]**

- When ``NEURON_RT_STOCHASTIC_ROUNDING_EN=1``, JAX Neuron will use stochastic rounding instead of
  round-nearest-even for all internal rounding operations when casting from FP32 to a reduced precision data type (FP16, BF16, FP8, TF32).
  This feature has been shown to improve
  training convergence for reduced precision training jobs. 
  To switch to round-nearest-even mode, set ``NEURON_RT_STOCHASTIC_ROUNDING_EN=0``.

``NEURON_RT_STOCHASTIC_ROUNDING_SEED`` **[Neuron Runtime]**

- Sets the seed for the random number generator used in stochastic rounding (see previous section). If this environment variable is not set, the seed is set to 0 by default. Please set ``NEURON_RT_STOCHASTIC_ROUNDING_SEED`` to a fixed value to ensure reproducibility between runs.

``NEURON_RT_VISIBLE_CORES`` **[Neuron Runtime]**

- Integer range of specific NeuronCores needed by the process (for example, 0-3 specifies NeuronCores 0, 1, 2, and 3). Use this environment variable when launching processes to limit the launched process to specific consecutive NeuronCores.

Additional Neuron runtime environment variables are described in :ref:`nrt-configuration`.


================================================
FILE: frameworks/jax/index.rst
================================================
.. meta::
   :description: JAX support on AWS Neuron SDK - JAX NeuronX for training and inference on Trn1, Trn2, and Inf2 instances with native JAX device integration.
   :keywords: JAX, jax-neuronx, libneuronxla, AWS Neuron, Trainium, Inferentia, PJRT, machine learning
   :date-modified: 01/22/2026

.. _jax-neuron-main:

JAX Support on Neuron
=====================

JAX running on Neuron unlocks high-performance and cost-effective deep learning acceleration on AWS Trainium-based and AWS Inferentia-based Amazon EC2 instances.

The JAX NeuronX plugin is a set of modularized JAX plugin packages that integrate AWS Trainium and Inferentia machine learning accelerators into JAX as pluggable devices using the PJRT (Plugin Runtime) mechanism. This enables native JAX device support for Neuron accelerators with minimal code changes.

JAX NeuronX includes the following key components:

* **libneuronxla**: Neuron's integration into JAX's runtime PJRT, built using the PJRT C-API plugin mechanism. Installing this package enables using Trainium and Inferentia natively as JAX devices.
* **jax-neuronx**: A package containing Neuron-specific JAX features, such as the Neuron NKI JAX interface. It also serves as a meta-package for providing a tested combination of ``jax-neuronx``, ``jax``, ``jaxlib``, ``libneuronxla``, and ``neuronx-cc`` packages.

Key capabilities of JAX NeuronX include:

* **Native JAX device integration**: Seamless integration with JAX through the PJRT C-API plugin mechanism
* **Flexible installation**: Choose between a production-ready meta-package or custom package combinations
* **NKI support**: Access to Neuron Kernel Interface (NKI) through the JAX interface for custom kernel development
* **Broad compatibility**: Support for multiple JAX and jaxlib versions through the PJRT C-API mechanism
* **Training and inference**: Full support for both training and inference workloads on Trainium and Inferentia instances

.. admonition:: Beta Release
   :class: note

   JAX NeuronX is currently in beta. Some JAX functionality may not be fully supported. We welcome your feedback and contributions.

.. toctree::
   :maxdepth: 1
   :hidden:

   /frameworks/jax/setup/jax-setup
   /frameworks/jax/setup/jax-neuronx-known-issues
   /frameworks/jax/api-reference-guide/index
   Release Notes </release-notes/components/jax>

Get Started
------------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: Setup Guide
        :link: jax-neuron-setup
        :link-type: ref
        :class-header: sd-bg-primary sd-text-white

        Install and configure JAX NeuronX for Trn1, Trn2, and Inf2 instances.

    .. grid-item-card:: Neuron Kernel Interface (NKI)
        :link: /nki/index
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Learn about NKI for custom kernel development with JAX.

Reference
----------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: API Reference Guide
        :link: jax-neuronx-api-reference-guide
        :link-type: ref
        :class-header: sd-bg-primary sd-text-white

        Comprehensive API reference for JAX NeuronX features and environment variables.

    .. grid-item-card:: Known Issues
        :link: /frameworks/jax/setup/jax-neuronx-known-issues
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Review known issues and limitations in the current JAX NeuronX release.

Release Notes
--------------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: JAX NeuronX Component Release Notes
        :link: /release-notes/components/jax
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Review the JAX NeuronX release notes for all versions of the Neuron SDK.


================================================
FILE: frameworks/jax/setup/jax-neuronx-known-issues.rst
================================================
.. _jax-neuron-known-issues:


.. meta::
   :description: JAX NeuronX Known Issues - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, JAX, JAX NeuronX, Trainium, setup, torch-neuronx
   :date-modified: 2026-03-13


JAX NeuronX Known Issues
------------------------
- ``Threefry`` RNG algorithm is not completely supported. Use ``rbg`` algorithm instead. This can be configured by setting the following config option ``jax.config.update("jax_default_prng_impl", "rbg")``
- For JAX versions older than ``0.4.34``, caching does not work out of the box. Use the following to enable caching support,

  .. code:: python

    import jax
    import jax_neuronx
    from jax._src import compilation_cache

    compilation_cache.set_cache_dir('./cache_directory')

- For JAX versions older than ``0.4.34``, buffer donation does not work out of the box. Add the following snippet to your script to enable it - ``jax._src.interpreters.mlir._platforms_with_donation.append('neuron')``
- Mesh configurations which use non-connected Neuron cores might crash during execution. You might observe compilation or Neuron runtime errors for such configurations. Device connectivity can be determined by using ``neuron-ls --topology``.
- Not all dtypes supported by JAX work on Neuron. Check :ref:`neuron-data-types` for supported data types.
- ``jax.random.randint`` does not produce expected distribution of randint values. Run it on CPU instead.
- Dynamic loops are not supported for ``jax.lax.while_loop``. Only static while loops are supported.
- ``jax.lax.cond`` is not supported.
- Host callbacks are not supported. As a result APIs based on callbacks from ``jax.debug`` and ``jax.experimental.checkify`` are not supported.
- ``jax.dlpack`` is not supported.
- ``jax.experimental.sparse`` is not supported.
- ``jax.lax.sort`` only supports comparators with LE, GE, LT and GT operations.
- ``jax.lax.reduce_precision`` is not supported.
- Certain operations (for example, rng weight initialization) might result in slow compilations. Try to run such operations on the CPU backend or by setting the following environment variable ``NEURON_RUN_TRIVIAL_COMPUTATION_ON_CPU=1``.
- Neuron only supports ``float8_e4m3`` and ``float8_e5m2`` for FP8 dtypes.
- Complex dtypes (``jnp.complex64`` and ``jnp.complex128``) are not supported.
- Variadic reductions are not supported.
- Out of bound access for scatter/gather operations can result in runtime errors.
- Dot operations on int dtypes are not supported.
- ``lax.DotAlgorithmPreset`` is not always respected. Dot operations occur in operand dtypes. This is a configurable parameter for ``jax.lax.dot`` and ``jax.lax.dot_general``.


================================================
FILE: frameworks/jax/setup/jax-setup.rst
================================================
.. _jax-neuron-setup:


.. meta::
   :description: JAX NeuronX plugin Setup - AWS Neuron SDK documentation
   :keywords: AWS Neuron, JAX, JAX NeuronX, setup
   :date-modified: 2026-03-13


JAX NeuronX plugin Setup
------------------------------

The JAX NeuronX plugin is a set of modularized JAX plugin packages integrating
AWS Trainium and Inferentia machine learning accelerators into JAX as pluggable
devices. It includes the following Python packages, all hosted on the AWS Neuron
pip repository.

* ``libneuronxla``: A package containing Neuron's integration into JAX's runtime `PJRT <https://openxla.org/xla/pjrt_integration>`__, built using the `PJRT C-API plugin <https://github.com/openxla/xla/blob/5564a9220af230c6c194e37b37938fb40692cfc7/xla/pjrt/c/docs/pjrt_integration_guide.md>`__ mechanism. Installing this package enables using Trainium and Inferentia natively as JAX devices.
* ``jax-neuronx``: A package containing Neuron-specific JAX features, such as the `Neuron NKI <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/nki_rn.html>`__ JAX interface. It also serves as a meta-package for providing a tested combination of the ``jax-neuronx``, ``jax``, ``jaxlib``, ``libneuronxla``, and ``neuronx-cc`` packages. Making proper use of the features provided in ``jax-neuronx`` will unleash the full potential of Trainium and Inferentia.

.. include:: /setup/install-templates/trn1-ga-warning.txt

.. note:: 
    JAX requires ``Python 3.10`` or newer. Ensure a supported python version is installed on your system prior to installing JAX. See https://docs.aws.amazon.com/linux/al2023/ug/python.html to install newer python versions on Amazon Linux 2023.

.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * To launch an instance, follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_. Make sure to select the correct instance type on the EC2 console.
    * For more information about instance sizes and pricing, see `Amazon EC2 Trn1 Instances <https://aws.amazon.com/ec2/instance-types/trn1/>`_ and `Amazon EC2 Inf2 Instances <https://aws.amazon.com/ec2/instance-types/inf2/>`_
    * Select Ubuntu Server 22 AMI.
    * When launching a Trn1, adjust your primary EBS volume size to a minimum of 512GB.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance.

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    Ubuntu

    .. include:: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 242
        :end-line: 243

    Amazon Linux 2023

    .. include:: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 239
        :end-line: 240

.. dropdown::  Install the JAX NeuronX Plugin
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    We provide two methods for installing the JAX NeuronX plugin. The first is to install
    the ``jax-neuronx`` meta-package from the AWS Neuron pip repository. This method provides
    a production-ready JAX environment where ``jax-neuronx``'s major dependencies, namely
    ``jax``, ``jaxlib``, ``libneuronxla``, and ``neuronx-cc``, have undergone thorough testing
    by the AWS Neuron team and will have their versions pinned during installation. 
    **Note:** AL2023 requires setting the correct Python binary (Python 3.10 or newer).

    .. code:: bash

        python3 -m pip install jax-neuronx[stable] --extra-index-url=https://pip.repos.neuron.amazonaws.com

    The second is to install packages ``jax``, ``jaxlib``, ``libneuronxla``,
    and ``neuronx-cc`` separately, with ``jax-neuronx`` being an optional addition.
    Because ``libneuronxla`` supports a broad range of ``jaxlib`` versions through
    the PJRT C-API mechanism, this method provides flexibility when choosing
    ``jax`` and ``jaxlib`` versions, enabling JAX users to bring the JAX NeuronX plugin
    into their own JAX environments.

    .. code:: bash

        python3 -m pip install jax==0.6.2 jaxlib==0.6.2
        python3 -m pip install jax-neuronx libneuronxla neuronx-cc==2.* --extra-index-url=https://pip.repos.neuron.amazonaws.com

We can now run some simple JAX programs on the Trainium or Inferentia
accelerators.

.. code:: bash

   ~$ python3 -c 'import jax; print(jax.numpy.multiply(1, 1))'
   Platform 'neuron' is experimental and not all JAX functionality may be correctly supported!
   .
   Compiler status PASS
   1

Compatibility between packages ``jaxlib`` and ``libneuronxla`` can be
determined from `PJRT C-API
version <https://github.com/openxla/xla/blob/0d1b60216ea13b0d261d59552a0f7ef20c4f76c5/xla/pjrt/c/pjrt_c_api.h>`__.
For more information, see `PJRT integration
guide <https://github.com/openxla/xla/blob/0d1b60216ea13b0d261d59552a0f7ef20c4f76c5/docs/pjrt/pjrt_integration.md>`__.

To determine compatible JAX versions, you can use the
``libneuronxla.supported_clients`` API for querying known supported
client packages and their versions.

.. code::

   Help on function supported_clients in module libneuronxla.version:

   supported_clients()
       Return a description of supported client (jaxlib, torch-xla, etc.) versions,
       as a list of strings formatted as `"<package> <version> (PJRT C-API <c-api version>)"`.
       For example,
       >>> import libneuronxla
       >>> libneuronxla.supported_clients()
       ['jaxlib 0.4.38 (PJRT C-API 0.58)', torch_xla 2.6.0 (PJRT C-API 0.55)', 'torch_xla 2.6.1 (PJRT C-API 0.55)', 'torch_xla 2.7.0 (PJRT C-API 0.61)']

Note that the list of supported client packages and versions covers
known versions only and may be incomplete. More versions could be
supported, including Google's future ``jaxlib`` releases, assuming the
PJRT C-API stays compatible with the current release of
``libneuronxla``. As a result, we avoid specifying any dependency
relationship between ``libneuronxla`` and ``jaxlib``. This provides more
freedom when coordinating ``jax`` and ``libneuronxla`` installations.


================================================
FILE: frameworks/torch/about/index.rst
================================================
.. meta::
    :description: History and evolution of PyTorch support on AWS Neuron across Inferentia and Trainium platforms
    :keywords: pytorch, torch-neuron, torch-neuronx, torchneuron, neuron, inferentia, trainium
    :date-modified: 02/26/2026

About PyTorch on AWS Neuron
===========================

This topic provides an overview of PyTorch support in Neuron for AWS ``Inf*`` (Inferentia-based) and ``Trn*`` (Trainium-based) ML platforms. 

Throughout the past 5 years, AWS Neuron has evolved its PyTorch support to match the capabilities and architectures of successive generations of AWS ML accelerators, delivering three distinct PyTorch implementations optimized for different hardware platforms and use cases:

* **torch-neuron** (2019): Graph-based inference for Inferentia (Inf1)
* **torch-neuronx** (2022): XLA-based training and inference for Inferentia2 (Inf2) and Trainium (Trn1/Trn2)
* **TorchNeuron** (2025): Native PyTorch backend for Trainium (Trn2/Trn3) with eager mode and ``torch.compile``

Overview
--------

AWS Neuron's PyTorch support has evolved through three major implementations, each designed to leverage the unique capabilities of AWS ML accelerators:

1. **torch-neuron** (2019-2026): The original PyTorch integration for AWS Inferentia (Inf1), focused on inference workloads with a graph-based compilation approach
2. **torch-neuronx** (2022-): An XLA-based PyTorch implementation for AWS Inferentia2 (Inf2) and Trainium (Trn1/Trn2/Trn3), supporting both training and inference with distributed computing capabilities
3. **TorchNeuron** (2025-): A native PyTorch backend for Trainium that provides eager mode execution, ``torch.compile`` support, and standard PyTorch distributed APIs without requiring XLA

Each implementation represents a significant architectural evolution, reflecting advances in both AWS ML accelerator hardware and PyTorch framework capabilities.

torch-neuron for Inf1
---------------------

The first Neuron library supporting PyTorch, ``torch-neuron``, was initially released in December 2019 alongside the launch of AWS Inferentia. This implementation introduced PyTorch developers to AWS's purpose-built ML inference accelerators.

``torch-neuron`` uses a graph-based compilation approach where PyTorch models are traced and compiled into optimized Neuron Executable File Format (NEFF) binaries. The library integrates with PyTorch through custom operators and provides APIs for model compilation (``torch.neuron.trace``) and execution on Inferentia NeuronCores.

Key characteristics of torch-neuron:

* **Target Platform**: AWS Inferentia (Inf1 instances)
* **Primary Use Case**: Inference workloads
* **Compilation Approach**: Ahead-of-time (AOT) graph compilation via ``torch.neuron.trace``
* **Supported Models**: Computer vision models (ResNet, VGG, EfficientNet, YOLO variants), NLP models (BERT, RoBERTa, DistilBERT, MarianMT), and other inference-optimized architectures
* **Integration Method**: Custom PyTorch operators and tracing API

When to choose torch-neuron
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Choose ``torch-neuron`` when:

* Deploying inference workloads on AWS Inferentia (Inf1) instances
* Working with models that can be traced and compiled ahead of time
* Optimizing for inference latency and throughput on first-generation Inferentia hardware
* Requiring compatibility with existing Inf1-based infrastructure


torch-neuronx for Inf2 and Trn1
-------------------------------

In October 2022, AWS introduced Inferentia2 and Trainium, second-generation ML accelerators with enhanced capabilities for both training and inference. To support these platforms, Neuron delivered ``torch-neuronx``, a new PyTorch implementation built on PyTorch/XLA.

``torch-neuronx`` represents a significant architectural shift from torch-neuron, leveraging the XLA (Accelerated Linear Algebra) compiler infrastructure to enable both training and inference workloads. This XLA-based approach provides support for dynamic shapes, control flow, distributed training primitives, and advanced parallelism strategies.

Key characteristics of torch-neuronx:

* **Target Platforms**: AWS Inferentia2 (Inf2 instances) and AWS Trainium (Trn1, Trn1n, Trn2, Trn3 instances)
* **Primary Use Cases**: Both training and inference workloads
* **Compilation Approach**: XLA-based compilation with support for dynamic shapes and control flow
* **Distributed Computing**: Native support for data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, and Zero Redundancy Optimizer (ZeRO)
* **Training Capabilities**: Full support for large-scale model training including LLMs (Llama, GPT, BERT families), with gradient accumulation, mixed precision training, and distributed checkpointing
* **Inference Capabilities**: Support for large language model inference with features like continuous batching, speculative decoding, and quantization
* **Integration Method**: PyTorch/XLA device backend (``xla`` device type)

The XLA-based architecture enables torch-neuronx to support advanced training techniques and distributed strategies that were not possible with the original torch-neuron implementation. This includes support for frameworks like NeuronX Distributed (NxD) for training and inference, Transformers NeuronX for LLM inference, and integration with popular ML libraries like HuggingFace Transformers and PyTorch Lightning.

When to choose torch-neuronx
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Choose ``torch-neuronx`` when:

* Training models on AWS Trainium (Trn1, Trn1n, Trn2) instances
* Running inference on AWS Inferentia2 (Inf2) instances
* Requiring distributed training capabilities with tensor parallelism, pipeline parallelism, or data parallelism
* Working with large language models or other models requiring multi-device training
* Needing dynamic shape support or control flow in your models
* Using PyTorch versions 2.5 through 2.9 (XLA-based implementation)

**Note**: PyTorch 2.9 is the last version of torch-neuronx based on PyTorch/XLA. Starting with PyTorch 2.10 support (planned for a future Neuron release), torch-neuronx will transition to the native PyTorch implementation (TorchNeuron).


TorchNeuron (Native PyTorch integration)
----------------------------------------

**TorchNeuron**, the latest evolution of PyTorch support for Neuron, was announced in December 2025 at AWS re:Invent and shipped its initial version as part of Neuron release 2.27.0. While it retains the same Python package name as its predecessor (``torch-neuronx``), TorchNeuron is an entirely new native PyTorch backend developed specifically for Trainium platforms.

TorchNeuron represents a fundamental architectural shift from XLA-based compilation to native PyTorch integration through the PrivateUse1 device backend mechanism. This native integration enables PyTorch code to run on Trainium with minimal modifications, supporting both eager mode execution for rapid iteration and ``torch.compile`` for production optimization.

Key characteristics of TorchNeuron:

* **Target Platforms**: AWS Trainium (Trn2, Trn3 instances)
* **Primary Use Cases**: Training and inference workloads with native PyTorch workflows
* **Execution Modes**: 
  
  * **Eager Mode**: Immediate operation execution for interactive development and debugging
  * **torch.compile**: Just-in-time (JIT) compilation via TorchDynamo for optimized performance

* **Distributed APIs**: Native support for standard PyTorch distributed primitives:
  
  * Fully Sharded Data Parallel (FSDP)
  * Distributed Tensor (DTensor)
  * Distributed Data Parallel (DDP)
  * Tensor Parallelism (TP)

* **Integration Method**: Native PyTorch backend via PrivateUse1 mechanism (``neuron`` device type)
* **Ecosystem Compatibility**: Works with TorchTitan, HuggingFace Transformers, and other PyTorch ecosystem tools with minimal code changes
* **Custom Kernels**: Integration with Neuron Kernel Interface (NKI) for performance-critical operations
* **Open Source**: Available on GitHub under Apache 2.0 license

TorchNeuron's native integration eliminates the need for XLA-specific APIs and enables researchers and ML developers to use familiar PyTorch patterns. The eager mode support provides immediate feedback during development, while ``torch.compile`` delivers production-grade performance through hardware-specific optimizations.

The implementation includes Adaptive Eager Execution, which applies optimizations like operator fusion while maintaining functional accuracy and debuggability. This approach provides a balance between development velocity and runtime performance.

When to choose TorchNeuron
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Choose **TorchNeuron** (native PyTorch) when:

* Training models on AWS Trainium (Trn2, Trn3) instances with PyTorch 2.10 or later
* Requiring eager mode execution for interactive development and debugging
* Using standard PyTorch distributed training APIs (FSDP, DTensor, DDP)
* Working with PyTorch ecosystem tools like TorchTitan or HuggingFace Transformers
* Needing minimal code changes to run existing PyTorch code on Trainium
* Leveraging ``torch.compile`` for production optimization
* Developing custom kernels with Neuron Kernel Interface (NKI)

**Migration Note**: Starting with PyTorch 2.10 support (planned for a future Neuron release), AWS Neuron will transition from PyTorch/XLA to native PyTorch support via TorchNeuron. Users on PyTorch 2.9 or earlier will need to update their scripts when upgrading to PyTorch 2.10 or later. See :ref:`native-pytorch-trainium` for complete migration guidance.


Read More
---------

**Training Resources**

* :doc:`Training with torch-neuronx </frameworks/torch/training-torch-neuronx>` - Training guides and tutorials for Trainium
* :doc:`PyTorch Neuron Programming Guide </frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide>` - Core concepts for training on Neuron
* :doc:`NeuronX Distributed (NxD) Training </libraries/nxd-training/index>` - Distributed training library for large-scale models
* :doc:`PyTorch Training Tutorials </frameworks/torch/torch-neuronx/tutorials/training/tutorials-training-torch-neuronx>` - Step-by-step training examples

**Inference Resources**

* :doc:`Inference with torch-neuronx </frameworks/torch/inference-torch-neuronx>` - Inference guides for Inf2 and Trn1/Trn2
* :doc:`Inference with torch-neuron </archive/torch-neuron/inference-torch-neuron>` - Inference guides for Inf1
* :doc:`NeuronX Distributed Inference (NxDI) </libraries/nxd-inference/index>` - Inference library for large language models
* :ref:`torch-neuron vs torch-neuronx Comparison <torch-neuron_vs_torch-neuronx>` - Detailed comparison for inference workloads

**Architecture and Hardware**

* :doc:`AWS Inferentia Architecture </about-neuron/arch/neuron-hardware/inferentia>` - Inf1 hardware architecture
* :doc:`AWS Inferentia2 Architecture </about-neuron/arch/neuron-hardware/inferentia2>` - Inf2 hardware architecture
* :doc:`AWS Trainium Architecture </about-neuron/arch/neuron-hardware/trainium>` - Trn1 hardware architecture
* :doc:`AWS Trainium2 Architecture </about-neuron/arch/neuron-hardware/trainium2>` - Trn2 hardware architecture
* :doc:`AWS Trainium3 Architecture </about-neuron/arch/neuron-hardware/trainium3>` - Trn3 hardware architecture


================================================
FILE: frameworks/torch/guide-torch-neuron-vs-torch-neuronx-inference.rst
================================================
.. _torch-neuron_vs_torch-neuronx:


.. meta::
   :description: Comparison of |torch-neuron| (|Inf1|) versus |torch-neuronx| (|Inf2| & |Trn1|) for Inference - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx
   :date-modified: 2026-03-13


Comparison of |torch-neuron| (|Inf1|) versus |torch-neuronx| (|Inf2| & |Trn1|) for Inference
============================================================================================

.. warning::

   ``torch-neuron`` (Inf1) has been archived and is no longer actively developed.
   For new inference workloads, use :doc:`TorchNeuron Native </frameworks/torch/pytorch-native-overview>`
   (recommended for Trn2/Trn3) or ``torch-neuronx`` (for Inf2/Trn1). The archived
   torch-neuron documentation is available at :doc:`/archive/torch-neuron/index`.

Neuron now supports multiple instance types for inference. The choice of
instance should be motivated primarily by the performance needs of the
application, the instance pricing, and model compatibility.

In prior releases, |torch-neuron| *only supported inference* and
|torch-neuronx| *only supported training*. While |torch-neuron| will never
be updated to support training, |torch-neuronx| now supports both *inference and
training*.

.. note::

    **Recommendation**: For new inference workloads, use
    :doc:`TorchNeuron Native </frameworks/torch/pytorch-native-overview>` (Trn2/Trn3)
    or |torch-neuronx| (|Inf2| & |Trn1|). |torch-neuron| (|Inf1|) is archived and
    should only be used for existing applications that have not yet migrated.


Framework Comparison
--------------------

Example
~~~~~~~

The following scripts are identical model compilations performed using each
framework. The lines that are changed are highlighted to show where the
differences occur.


.. tab-set::

    .. tab-item:: torch-neuron

        .. code-block:: python
            :emphasize-lines: 3, 8

            import torch
            import torchvision
            import torch_neuron

            model = torchvision.models.resnet50(pretrained=True).eval()
            image = torch.rand(1, 3, 224, 224)

            trace = torch_neuron.trace(model, image)

    .. tab-item:: torch-neuronx

        .. code-block:: python
            :emphasize-lines: 3, 8

            import torch
            import torchvision
            import torch_neuronx

            model = torchvision.models.resnet50(pretrained=True).eval()
            image = torch.rand(1, 3, 224, 224)

            trace = torch_neuronx.trace(model, image)


Hardware Features
~~~~~~~~~~~~~~~~~

The |torch-neuron| framework supports |Inf1| instances and the |torch-neuronx|
framework supports |Inf2| & |Trn1| instances. These instances have different
|architectures|, networking configurations, and capabilities due to the
NeuronCore versions used.

Models compiled with |torch-neuron| produce artifacts which are *only*
compatible with |NeuronCore-v1|. Models compiled with |torch-neuronx| produce
artifacts which are *only* compatible with |NeuronCore-v2|. This also means
that models that were previously compiled with |torch-neuron| for |Inf1| are
not forwards compatible with |Inf2| & |Trn1| instances. Likewise, models compiled
with |torch-neuronx| for |Inf2| & |Trn1| are not backwards compatible with |Inf1|.

|NeuronCore-v2| is capable of higher throughput and lower latency than
|NeuronCore-v1| due to more powerful compute engines and improved memory
bandwidth. |NeuronCore-v2| can also support larger models since more
memory is available per NeuronCore. The hardware differences between
NeuronCore versions means that models compiled with |torch-neuronx| will
usually outperform models compiled with |torch-neuron|.

In cases where throughput may be similar across instance-types, instances using
|NeuronCore-v2| tend to achieve *significantly lower* latency than instances
using |NeuronCore-v1|. This can enable applications that require extremely fast
response time.

See the :ref:`benchmark` page for the most up-to-date performance metrics.

Besides performance benefits, |NeuronCore-v2| also has more hardware
capabilities compared to |NeuronCore-v1|. For example, |NeuronCore-v2|
supports a greater variety of data types and introduces a new fully programmable
GPSIMD-Engine.

Note that ``Trn`` instance-types are optimized for training purposes. Some
``Trn`` features (such as inter-chip networking) may be unnecessary
for inference applications that do not require distribution across multiple
NeuronCores.


Software Features
~~~~~~~~~~~~~~~~~

The |torch-neuron| framework uses :func:`torch_neuron.trace` to
create a TensorFlow GraphDef protobuf intermediate representation (IR) of the
model compute graph. This is compiled to a binary Neuron Executable File Format
(NEFF) with the |neuron-cc| compiler.

The |torch-neuronx| framework uses :func:`torch_neuronx.trace` with
torch-xla_ to create a HloModule protobuf IR of the model compute graph. This is
compiled to a binary executable NEFF with the |neuronx-cc| compiler.

The use of different compiler versions means that separate flags are supported
by each framework. For example:

- :ref:`neuroncore-pipeline` is supported in |neuron-cc| but is not supported
  in |neuronx-cc|. However, this feature is much less useful when using the
  |NeuronCore-v2| architecture due to significant memory improvements.
- Mixed precision flags will differ across the compilers. |neuronx-cc| improves
  the flags by making the behavior more explicit and streamlined:

  - |neuron-cc-mixed-precision|
  - |neuronx-cc-mixed-precision|

Since the python graph recording methods used by the frameworks are much
different, this may lead to different levels of model support. To view the
models which are known to be working, many compilation samples are provided for
each framework:

- `torch-neuron Samples`_
- `torch-neuronx Samples`_

Framework model support may also be affected by the graph partitioning feature.
In |torch-neuron|, the :func:`torch_neuron.trace` API provides the ability to
fall back to CPU for operations that are not supported directly by Neuron. This
fallback behavior is currently not supported by :func:`torch_neuronx.trace`,
however, certain operations that were previously not well-supported
in |torch-neuron| are now supported in |torch-neuronx| by default (e.g.
:class:`torch.nn.Embedding`).


Feature Summary
~~~~~~~~~~~~~~~

+-----------------------+-----------------------------+-----------------------------+
|                       | `torch-neuron`              | `torch-neuronx`             |
+=======================+=============================+=============================+
| Supported Instances   | |Inf1|                      | |Inf2| & |Trn1|             |
+-----------------------+-----------------------------+-----------------------------+
| Inference Support     | Yes                         | Yes                         |
+-----------------------+-----------------------------+-----------------------------+
| Training Support      | No                          | Yes                         |
+-----------------------+-----------------------------+-----------------------------+
| Architecture          | |NeuronCore-v1|             | |NeuronCore-v2|             |
+-----------------------+-----------------------------+-----------------------------+
| Model Support         | |model-support-v1|          | |model-support-v2|          |
+-----------------------+-----------------------------+-----------------------------+
| Trace API             | :func:`torch_neuron.trace`  | :func:`torch_neuronx.trace` |
+-----------------------+-----------------------------+-----------------------------+
| NeuronCore Pipeline   | Yes                         | No                          |
+-----------------------+-----------------------------+-----------------------------+
| Partitioning          | Yes                         | No                          |
+-----------------------+-----------------------------+-----------------------------+
| IR                    | GraphDef                    | HLO                         |
+-----------------------+-----------------------------+-----------------------------+
| Compiler              | |neuron-cc|                 | |neuronx-cc|                |
+-----------------------+-----------------------------+-----------------------------+
| Samples               | `torch-neuron Samples`_     | `torch-neuronx Samples`_    |
+-----------------------+-----------------------------+-----------------------------+


References
----------

To determine if a model is already supported in a given framework, it is
recommended to check the existing documentation for specific models. In order
of reference quality, the following pages can be checked prior to compiling a
model:

1. :ref:`benchmark`: Models that are available here have been optimized to
   maximize throughput and/or latency. These metrics are updated frequently as
   improvements are made. Since metrics are published for different instance
   types, this can provide a direct performance comparison between instances.
   Note that the exact models and configurations may differ across instances.
2. `Neuron GitHub Samples`_: Provides simple examples of compiling and executing
   models. Compared to the benchmarks, this reference is only
   intended to show *how* to run a particular model on Neuron. This only
   validates if a framework supports a given model.

If a model does not appear in any of these references, the last option is
to attempt to compile the model to see how it performs. In the case that an
error occurs during compilation, please file a ticket in the
`Neuron SDK Github Issues`_.


.. |neuron-cc-mixed-precision| replace:: :ref:`neuron-cc-training-mixed-precision`
.. |neuronx-cc-mixed-precision| replace:: :ref:`neuronx-cc-training-mixed-precision`
.. |Inf1| replace:: :ref:`Inf1 <aws-inf1-arch>`
.. |Trn1| replace:: :ref:`Trn1 <aws-trn1-arch>`
.. |Inf2| replace:: :ref:`Inf2 <aws-inf2-arch>`
.. |architectures| replace:: architectures
.. |NeuronCore-v1| replace:: :ref:`NeuronCore-v1 <neuroncores-v1-arch>`
.. |NeuronCore-v2| replace:: :ref:`NeuronCore-v2 <neuroncores-v2-arch>`
.. |neuron-cc| replace:: :ref:`neuron-cc <neuron-compiler-cli-reference>`
.. |neuronx-cc| replace:: :ref:`neuronx-cc <neuron-compiler-cli-reference-guide>`
.. |torch-neuron| replace:: :ref:`torch-neuron <inference-torch-neuron>`
.. |torch-neuronx| replace:: :ref:`torch-neuronx <inference-torch-neuronx>`
.. |model-support-v1| replace:: Architecture Fit NeuronCore-v1
.. |model-support-v2| replace:: Architecture Fit NeuronCore-v2

.. _Neuron GitHub Samples: https://github.com/aws-neuron/aws-neuron-samples
.. _torch-neuron Samples: https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuron
.. _torch-neuronx Samples: https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx
.. _torch-xla: https://github.com/pytorch/xla
.. _Neuron SDK Github Issues: https://github.com/aws-neuron/aws-neuron-sdk/issues

================================================
FILE: frameworks/torch/index.rst
================================================
.. meta::
   :description: PyTorch support on AWS Neuron SDK - TorchNeuron Native for eager execution and torch.compile on Trainium and Inferentia, with torch-neuronx XLA-based support for training and inference.
   :keywords: PyTorch, TorchNeuron, torch-neuronx, AWS Neuron, Trainium, Inferentia, deep learning, torch.compile, eager mode
   :date-modified: 01/22/2026

.. _neuron-pytorch:
.. _pytorch-neuronx-main:

PyTorch Support on Neuron
==========================

PyTorch running on Neuron unlocks high-performance and cost-effective deep learning acceleration on AWS Trainium-based and AWS Inferentia-based Amazon EC2 instances.

The PyTorch plugin for Neuron architecture enables native PyTorch models to be accelerated on Neuron devices, so you can use your existing framework application and get started easily with minimal code changes.

PyTorch Neuron support is available at three levels:

* **TorchNeuron Native** *(recommended)*: The newest native PyTorch backend providing eager execution, ``torch.compile``, and standard distributed APIs (FSDP, DTensor, DDP, Tensor Parallelism) for Trainium and Inferentia. This is the recommended starting point for new workloads.
* **PyTorch NeuronX (torch-neuronx)** *(supported)*: The XLA-based PyTorch integration supporting NeuronCores v2 architecture (Trn1, Trn2, Inf2, Trn1n). Provides full capabilities for both training and inference workloads.
* **PyTorch Neuron (torch-neuron)** *(archived)*: The legacy PyTorch integration for NeuronCores v1 architecture (Inf1 only). This package is no longer actively developed. See :doc:`/archive/torch-neuron/index` for reference documentation.

.. admonition:: Which Neuron framework for PyTorch should I select?

   For help selecting a framework type for inference, see:
   *  :doc:`/frameworks/torch/about/index`
   *  :ref:`torch-neuron_vs_torch-neuronx`

.. toctree::
    :maxdepth: 1
    :hidden:
    
    About PyTorch on Neuron </frameworks/torch/about/index>
    Native PyTorch </frameworks/torch/pytorch-native-overview>
    PyTorch Setup </frameworks/torch/torch-setup>
    Training </frameworks/torch/training-torch-neuronx>
    Inference </frameworks/torch/inference-torch-neuronx>
    torch-neuron v. torch-neuronx </frameworks/torch/guide-torch-neuron-vs-torch-neuronx-inference>
    Release Notes </release-notes/components/pytorch>

Get Started
------------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: TorchNeuron Native Backend Overview
        :link: /frameworks/torch/pytorch-native-overview
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        **Recommended for new workloads** — Learn about the native PyTorch backend with eager execution, ``torch.compile`` support, and standard distributed APIs for Trainium and Inferentia.

    .. grid-item-card:: Setup Guide
        :link: /frameworks/torch/torch-setup
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Install and configure PyTorch NeuronX for your environment.

Training & Inference
---------------------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: Training on Trn1 and Trn2
        :link: training-torch-neuronx
        :link-type: ref
        :class-header: sd-bg-primary sd-text-white

        Train models using PyTorch NeuronX on Trainium instances.

    .. grid-item-card:: Inference on Inf2, Trn1, and Trn2
        :link: inference-torch-neuronx
        :link-type: ref
        :class-header: sd-bg-primary sd-text-white

        Deploy inference workloads using PyTorch NeuronX.

Release Notes
--------------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: PyTorch Neuron Component Release Notes
        :link: /release-notes/components/pytorch
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Review the PyTorch Neuron release notes for all versions of the Neuron SDK.

.. note::

   Looking for torch-neuron (Inf1) documentation? The torch-neuron package has been
   archived. See :doc:`/archive/torch-neuron/index` for legacy Inf1 documentation.


================================================
FILE: frameworks/torch/inference-torch-neuronx.rst
================================================
.. _inference-torch-neuronx:


.. meta::
   :description: Inference with ``torch-neuronx`` (Inf2 & Trn1/Trn2) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx
   :date-modified: 2026-03-13


Inference with ``torch-neuronx`` (Inf2 & Trn1/Trn2)
====================================================

Deploy inference workloads using PyTorch NeuronX on Inf2, Trn1, and Trn2 instances.

.. toctree::
    :maxdepth: 1
    :hidden:

    Tutorials </frameworks/torch/torch-neuronx/tutorials/inference/tutorials-torch-neuronx>
    Additional Examples </frameworks/torch/torch-neuronx/additional-examples-inference-torch-neuronx>
    API Reference Guide </frameworks/torch/torch-neuronx/api-reference-guide/inference/inference-api-guide-torch-neuronx>
    Developer Guide  </frameworks/torch/torch-neuronx/programming-guide/inference/index>
    Misc  </frameworks/torch/torch-neuronx/misc-inference-torch-neuronx>

Get Started
------------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: Setup (``torch-neuronx``)
        :link: setup-torch-neuronx
        :link-type: ref
        :class-header: sd-bg-primary sd-text-white

        Install and configure PyTorch NeuronX for inference workloads on Inf2, Trn1, and Trn2 instances.

Tutorials
----------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: Inference Tutorials
        :link: /frameworks/torch/torch-neuronx/tutorials/inference/tutorials-torch-neuronx
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Step-by-step tutorials including BERT, TorchServe, LibTorch C++, ResNet50, and T5 inference.

Reference
----------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: API Reference Guide
        :link: /frameworks/torch/torch-neuronx/api-reference-guide/inference/inference-api-guide-torch-neuronx
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Inference API reference for PyTorch NeuronX, including trace, replace weights, core placement, and data parallel APIs.

    .. grid-item-card:: Developer Guide
        :link: /frameworks/torch/torch-neuronx/programming-guide/inference/index
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        In-depth developer guide covering core placement, trace vs XLA, data parallelism, and auto-bucketing.

Additional Resources
---------------------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: Additional Examples
        :link: /frameworks/torch/torch-neuronx/additional-examples-inference-torch-neuronx
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        More inference examples and sample code from the AWS Neuron Samples repository.

    .. grid-item-card:: Misc
        :link: /frameworks/torch/torch-neuronx/misc-inference-torch-neuronx
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Supported operators, release notes, and additional inference resources.


================================================
FILE: frameworks/torch/pytorch-native-overview.rst
================================================
.. _native-pytorch-trainium:

.. meta::
    :description: Documentation Landing Page for TorchNeuron, the native PyTorch backend for AWS Trainium
    :date-modified: 12/02/2025
    :keywords: AWS Neuron, PyTorch

Native PyTorch for AWS Trainium
==================================

Overview
--------

``TorchNeuron`` is an open-source PyTorch backend that provides native PyTorch framework integration for AWS Trainium. TorchNeuron provides support for eager mode, ``torch.compile``, and standard PyTorch native distributed APIs.

.. image:: /images/torchneuron/pytorch-native-neuron-stack.png

TorchNeuron
------------------

.. important::
    TorchNeuron is currently only available as part of a closed Beta program. If you would like to participate, contact your AWS Neuron support representative.

``TorchNeuron`` is an open-source PyTorch extension that provides new native backend integration for AWS Trainium. The implementation includes support for eager mode for rapid iteration and experimentation, ``torch.compile`` for just-in-time compilation, and standard distributed processing APIs. 
TorchNeuron enables ecosystem compatibility and supports custom kernel development through the Neuron Kernel Interface (NKI) for performance optimization and research applications.

PyTorch Eager Mode 
---------------------------

In eager mode, operations are dispatched and execute immediately upon invocation. PyTorch's dispatcher routes tensor operations to the Neuron backend, which provides optimized implementations of ``ATen`` operators (core tensor operations) and distributed communication operators. These primitives execute directly on AWS Trainium hardware.

Adaptive Eager Execution
^^^^^^^^^^^^^^^^^^^^^^^^^
``TorchNeuron`` implements *Adaptive Eager Execution* to improve performance while maintaining functional accuracy and debuggability. 
Adaptive Eager Execution applies optimizations such as operator fusion while guaranteeing identical stream order semantics and numerical accuracy. 

torch.compile Support
----------------------

``TorchNeuron`` supports ``torch.compile``, enabling developers to JIT-compile some or all of their PyTorch code to improve performance on AWS Trainium. 
``TorchNeuron`` implements a custom backend for `TorchDynamo <https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html>`__ that receives the forward and backward FX graphs and transforms them into optimized AWS Trainium instructions.

The ``TorchNeuron`` backend fully supports the caching mechanism provided by ``TorchDynamo``.

Compilation Process
^^^^^^^^^^^^^^^^^^^

When ``torch.compile`` is applied to a model on AWS Trainium:

1. ``TorchDynamo`` captures Python bytecode and extracts PyTorch operations into an FX Graph during the forward pass.
2. ``AOT Autograd`` generates forward and backward graphs.
3. The Neuron Backend receives both FX graphs and lowers them to Neuron IR.
4. The Neuron Compiler applies hardware-specific optimizations and generates Trainium instructions for execution on the hardware.

Distributed Inference and Training Support
-------------------------------------------

``TorchNeuron`` offers support for PyTorch distributed APIs, such as those included in the ``torch.distributed`` module, to support collective communications across sharded models, such as ``torch.distributed.all_reduce()``). 
Higher-level distributed training tools and techniques such as ``FSDP (Fully Sharded Data Parallel)`` and ``DTensor (Distributed Tensor)`` are implemented using these ``torch.distributed`` primitives to provide model parallelism and data parallelism strategies.

The Trainium backend supports the following ``torch.distributed`` APIs and techniques:

* Fully Sharded Data Parallel (FSDP)
* Distributed Tensor (DTensor)
* Distributed Data Parallel (DDP)
* Tensor Parallelism (TP)

Support for additional parallelism strategies such as Pipeline Parallelism (PP) will be available soon. 


Neuron Kernel Interface (NKI) Integration
-------------------------------------------

``TorchNeuron`` integrates with the ``Neuron Kernel Interface (NKI)``, enabling the development, optimization, and execution of custom operators.

NKI provides fine-grained control beyond adaptive eager execution and ``torch.compile``. Developers can call performance-critical NKI kernels within training code to replace sequences of standard PyTorch operations. NKI kernels function in both eager and ``torch.compile`` modes, supporting:

* Immediate execution and debugging capabilities in eager mode for rapid iteration
* Graph-level optimizations with ``torch.compile`` for production deployment

NKI kernels integrate with native PyTorch code through the ``@nki.jit`` decorator and ``@nki_op`` for custom op registration. 
Training models that include NKI kernels requires a backward version of the custom op, implemented using the `register_autograd() <https://docs.pytorch.org/docs/stable/library.html#torch.library.register_autograd>`__ function.

.. _pytorch_faqs:

FAQs
---------

Getting Started FAQ
^^^^^^^^^^^^^^^^^^^

**Q: What is TorchNeuron?**

TorchNeuron is an open-source native PyTorch backend for AWS Trainium that integrates through PyTorch's standard PrivateUse1 device backend mechanism. TorchNeuron supports both eager mode execution and ``torch.compile``. TorchNeuron is open source and initially available on GitHub at aws-neuron/torch-neuronx.

**Q: What changes are needed to run my PyTorch code on Trainium?**

Running your PyTorch code on Trainium requires minimal changes, organized below by execution mode and common configuration:

For Eager Mode Execution:

Minimal changes listed below:

* Device placement: Change ``.to('cuda')`` to ``.to('neuron')``
* ``torch.accelerator`` API: If your code uses ``torch.accelerator``, no changes are needed (automatic device detection)
* Mixed precision: Use standard ``torch.autocast(device_type="neuron")`` API with automatic datatype conversion following PyTorch CUDA conventions
* Distributed training: Native support for FSDP, DTensor, Tensor Parallelism, and Distributed Data Parallel with no code modifications required, except for sharding configurations which depend on the number of NeuronCores (which can be different from the number of GPUs)
* Sharding (Parallelism) Configuration: On Trainium, the unit of distribution is the NeuronCore, the heterogeneous compute unit that powers Trainium. Configure sharding strategies based on available Trainium instance and NeuronCores per Trainium chip, which depends on model and workload requirements. For some parallelism strategies like Tensor Parallelism, you need to specify how many NeuronCores are used for sharding. For other strategies like FSDP, no configuration changes are needed.

For ``torch.compile``:

On top the minimal changes listed in Eager Mode, the following two changes are needed for ``torch.compile``:

* Specify ``backend="neuron"`` (Specifically, ``@torch.compile(backend="neuron")``)
* Remove CUDA-specific parameters like m``ode="max-autotune-no-cudagraphs"``

**Q: What is NKI and when and how should I use it?**

NKI (Neuron Kernel Interface) is TorchNeuron's kernel programming interface for creating custom operators optimized for Trainium hardware. NKI uses similar definition and registration patterns as Triton, providing a familiar workflow for developers.

When to use NKI:

* For performance optimization requiring low-level hardware control
* For novel research requiring operations not yet expressible in standard PyTorch

How to use NKI:

* Import ``torch_neuronx``
* Define kernels using ``@nki.jit`` decorator for low-level hardware control
* Register as PyTorch operators with ``@nki_op`` decorator for seamless integration
* Provide explicit type signatures like ``(x: torch.Tensor) -> torch.Tensor``
* For training, add autograd support via ``register_autograd()`` method for custom backward passes

NKI kernels work in both eager execution and ``torch.compile``, integrating seamlessly with PyTorch's custom op registration system.

**Q: Do I need to import torch_neuronx when not using NKI?**

No, the torch_neuronx import is only needed when using NKI kernels (via ``nki.jit``).

In PyTorch 2.9, PyTorch introduced a feature that allows custom backends to autoload their device and thereby register their backend. TorchNeuron follows the same setup as mentioned, allowing us to get rid of the import for device registration. For more details, see Autoloading Out-of-Tree Extension.

**Q: What changes are needed to run TorchTitan on Trainium?**

Running TorchTitan on Trainium requires minimal code changes:

For Eager Mode:

* Zero code changes required. TorchTitan's automatic device detection discovers Trainium hardware automatically.

For ``torch.compile:``

Minimal changes required:

* Specify ``backend="neuron"``
* Remove CUDA-specific parameters

For Mixed Mode (Eager + torch.compile):

* When combining eager execution with components that use ``torch.compile`` (like FlexAttention), apply the ``torch.compile`` changes only to those specific components.

Parallelism Configuration:

* Configure sharding strategy based on your hardware. For example, set ``NGPU=64`` for 16 Trainium2 chips (4 NeuronCores per chip). On Trainium, the unit of distribution is the NeuronCore, and you must specify how many NeuronCores are used based on your model and parallelism strategy.

Open Source & Development FAQ
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Q: Will TorchNeuron be fully open source with GitHub-first development?**

Yes. We are setting up the infrastructure for GitHub-first development in early 2026.

**Q: Will TorchNeuron have open source CI/CD and nightly benchmarks similar to other PyTorch backends?**

Yes, TorchNeuron will have open source CI/CD before it reaches GA level same way we supported and enable PyTorch CI/CD for Aarch64 on graviton. We will provide similar testing and benchmarking infrastructure comparable to PyTorch CUDA, ROCm, and Intel XPU.

**Q: When will TorchNeuron move from out-of-tree to in-tree in the main PyTorch repository?**

Our ultimate goal is to move in-tree. However, we are starting as out-of-tree as it is the fastest way to provide value to our customers and allow us faster iteration in early life. Regardless of placement, our goal is to have full CI/CD integration as part of PyTorch's CI/CD infrastructure, even while out-of-tree. We are actively in discussions with PyTorch on the best path forward.

Torch.compile FAQ
^^^^^^^^^^^^^^^^^^^

**Q: Does the Neuron Backend for TorchDynamo use Inductor?**

This may evolve in future releases. For this release, the Neuron Backend for TorchDynamo provides native ``torch.compile`` support without using Inductor.

**Q: What does the Neuron TorchDynamo backend generate: kernels or hardware instructions?**

The Neuron backend generates Neuron IR, which may include NKI kernels passed as custom ops in the FX graphs. The Neuron Compiler then generates Trainium instructions from the IR.

**Q: Does Neuron TorchDynamo Backend support overlapping compute and communication operations?**

The overlapping functionality is supported by the Neuron Compiler itself, and not by TorchDynamo backend. 

**Q: When using torch.compile, does TorchNeuron support graph breaks?**

Yes, the Neuron TorchDynamo backend supports graph breaks. 

**Q: When using torch.compile, does TorchNeuron have equivalents to CUDA graphs and CUDA graph trees?**

Not in the initial release. We are considering equivalent constructs for future releases. 

**Q: Can I compile my model using torch.compile on a compute instance without Trainium hardware?**

No. The initial release requires compilation on Trainium instances (Trn1, Trn2, or Trn3). Future releases will support compilation on non-Trainium instances.

Eager Mode Execution FAQ
^^^^^^^^^^^^^^^^^^^^^^^^^^

**Q: How does eager mode work on TorchNeuron? What is Adaptive Eager Execution, and how can operations be both dispatched individually and fused for performance?**

Execution Model:

In PyTorch eager mode on TorchNeuron, operations are executed immediately as they are encountered in the Python code, following the same "define-by-run" paradigm where each operation is dispatched one at a time through PyTorch's dispatcher to the Neuron backend.

Neuron Asynchronous Execution:

Peeking into the details, Neuron operations are enqueued to the device asynchronously allowing the Python interpreter to continue issuing subsequent operations while previous Neuron operations may still be executing. PyTorch automatically performs necessary synchronization when copying data between host and devices or when accessing tensor values, making the effect of asynchronous computation transparent to the user since each device executes operations in the order they are queued.

Adaptive Eager Execution:

When the user is not debugging or inspecting tensors, TorchNeuron introduces Adaptive Eager Execution as an optimization. In PyTorch, the dispatcher queues operations on the backend for execution while the Python code continues running ahead. This allows multiple operations to be queued up simultaneously. TorchNeuron takes advantage of this mechanism by analyzing sequences of queued operators and fusing them into single operators based on fusion heuristics. These fused operations are then dispatched as single operator calls, improving performance while maintaining the same execution order, numerical accuracy, and determinism as non-fused execution.

Debugging and Tensor Inspection:

Whenever a user wants to print a tensor, just like any other backend, Neuron synchronizes at the operation where that tensor is needed and performs a device-to-host copy. This synchronization and copy mechanism applies the same way for fused ops when Adaptive Eager Execution is enabled.

In the context of Adaptive Eager Execution, printing operations may determine fusion boundaries. If printing occurs after an operation that would normally be fused with subsequent operations, fusion will not happen at that point to ensure the requested tensor value is available for inspection.

When ``torch.use_deterministic_algorithms()`` or ``torch.set_deterministic_debug_mode()`` is called, TorchNeuron will ensure reproducible order of execution and Adaptive Eager Execution optimizations are disabled.

**Q: Where are the TorchNeuron kernels implemented for eager mode execution?**

ATen implementations and kernels are part of the Neuron backend for eager mode. Currently, TorchNeuron is an out-of-tree backend. When TorchNeuron becomes an in-tree backend, those implementations will be part of the main PyTorch repository.

Distributed Training FAQ
^^^^^^^^^^^^^^^^^^^^^^^^^^

**Q: Which FSDP implementation does TorchNeuron support FSDP1, FSDP2, or SimpleFSDP?**

For eager mode, TorchNeuron supports all three: FSDPv1, FSDPv2, and SimpleFSDP. For torch.compile, TorchNeuron follows the PyTorch community recommendation and supports SimpleFSDP as it is more compiler-friendly, see SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile.

**Q: Does TorchNeuron support activation checkpointing?**

Yes. TorchNeuron supports activation checkpointing.

**Q: Does TorchNeuron support passing a mixed precision policy directly to FSDP, or do I need to use the autocast API?**

Both are supported. You can pass a mixed precision policy directly to FSDP or use the autocast API; it is up to the user to decide.

General FAQ
^^^^^^^^^^^^^

**Q: Does TorchNeuron support native PyTorch MoE (Mixture of Experts) operations?**

Yes, torch native MoE ops will be supported from first release, including ``torch._scaled_grouped_mm``, ``torch._grouped_mm``, and MoE Dispatch/Combine operations. First TorchNeuron release already comes with GPT-OSS support that covers all this.

``torch.all_to_all_vdev_2d`` and ``torch.all_to_all_vdev_2d_offset`` (MoE Dispatch/Combine ops) will be supported in future releases.

**Q: What is the timeline for supporting PyTorch Foundation libraries like torchcomms, monarch, torchforge, and torchao?**

We are actively evaluating support for these libraries now. Our goal is to support all of them over the next couple of quarters.

**Q: Can NeuronCores on the same Trainium chip share HBM memory?**

Yes. HBM can be shared between the multiple NeuronCores on a single Trainium chip. However, depending on the Trainium generation, the available bandwidth of the NeuronCore and HBM could vary depending on affinity.

Appendix
^^^^^^^^

While historically PyTorch used ``autograd`` function style, that approach is less recommended:

.. code-block:: python

    # sin_autograd.py
    # sine using NKI kernels, registered via torch.autograd.Function (not recommended)

    import torch
    from torch_neuronx import nki

    # Declaring and implementing NKI kernels
    @nki.jit
    def sin_kernel(in_ptr0, out_ptr):
        import nki.language as nl
        
        input_tile = nl.load(in_ptr0[0:128])
        output_tile = nl.sin(input_tile)
        nl.store(out_ptr[0:128], value=output_tile)

    @nki.jit
    def cos_kernel(in_ptr0, out_ptr):
        import nki.language as nl
        
        input_tile = nl.load(in_ptr0[0:128])
        output_tile = nl.cos(input_tile)
        nl.store(out_ptr[0:128], value=output_tile)

    # after this line, there is no NKI code, just native PyTorch

    # Create autograd function
    class NKI_sin(torch.autograd.Function):
        @staticmethod
        def forward(ctx, x):
            ctx.save_for_backward(x)
            output = torch.empty_like(x)
            # Here we call the nki kernel for sin
            sin_kernel(x, output)
            return output
        
        @staticmethod
        def backward(ctx, grad_output):
            x = ctx.saved_tensors[0]
            cos_result = torch.empty_like(x)
            # Here we call the nki kernel for cos
            cos_kernel(x, cos_result)  # cos is derivative of sin
            return grad_output * cos_result

    # User-facing function
    def custom_sin(x):
        """Sin with cosine as backward pass."""
        return NKI_sin.apply(x)

    # Test
    if __name__ == "__main__":
        x = torch.randn(128, device="neuron", requires_grad=True)
        
        # Forward pass, which call forward() -> sin_kernel()
        y = custom_sin(x)
        
        # Backward pass
        loss = y.sum()
        # autograd automatically calls backward() -> cos_kernel()
        loss.backward() 
        
        # Verify
        expected_forward = torch.sin(x)
        expected_grad = torch.cos(x.detach())
        
        print("Testing accuracy of sin custom op, using autograd function style")
        
        assert torch.allclose(y, expected_forward, atol=1e-5)
        assert torch.allclose(x.grad, expected_grad, atol=1e-5)
        
        print("✅ Forward: sin kernel")
        print("✅ Backward: cos kernel")
        print("✅ Gradients match!")

Resources and More Information
--------------------------------

* `TorchNeuron GitHub Repository <https://github.com/aws-neuron/torch-neuronx>`__
* `AWS Trainium Overview <https://aws.amazon.com/machine-learning/trainium/>`__


================================================
FILE: frameworks/torch/torch-neuronx/additional-examples-inference-torch-neuronx.rst
================================================

.. meta::
   :description: Additional Examples (``torch-neuronx``) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx
   :date-modified: 2026-03-13


Additional Examples (``torch-neuronx``)
=======================================

.. toctree::
    :maxdepth: 1
    :hidden:

    AWS Neuron Samples GitHub Repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/>
    Transformers Neuron GitHub samples <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx>


* `AWS Neuron Samples GitHub Repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/>`_
* `Transformers Neuron GitHub samples <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx>`_

================================================
FILE: frameworks/torch/torch-neuronx/additional-examples-training.rst
================================================

.. meta::
   :description: Additional Examples (``torch-neuronx``) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx, training
   :date-modified: 2026-03-13


Additional Examples (``torch-neuronx``)
=======================================

.. toctree::
    :maxdepth: 1
    :hidden:
    
    AWS Neuron Reference for Nemo Megatron GitHub Repository <https://github.com/aws-neuron/neuronx-nemo-megatron>
    AWS Neuron Samples for EKS <https://github.com/aws-neuron/aws-neuron-eks-samples>
    AWS Neuron Samples for AWS ParallelCluster <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples>
    AWS Neuron Samples GitHub Repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training>


* `AWS Neuron Reference for Nemo Megatron GitHub Repository <https://github.com/aws-neuron/neuronx-nemo-megatron>`_
* `AWS Neuron Samples for EKS <https://github.com/aws-neuron/aws-neuron-eks-samples>`_
* `AWS Neuron Samples for AWS ParallelCluster <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples>`_
* `AWS Neuron Samples GitHub Repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training>`_


================================================
FILE: frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-analyze.rst
================================================
.. _torch_neuronx_analyze_api:


.. meta::
   :description: PyTorch NeuronX Analyze API for Inference - AWS Neuron SDK documentation
   :keywords: API reference, AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx
   :date-modified: 2026-03-13


PyTorch NeuronX Analyze API for Inference
============================================================

.. py:function:: torch_neuronx.analyze(func, example_inputs, compiler_workdir=None)

   Checks the support of the operations in the ``func`` by checking each operator against neuronx-cc.

   :arg ~torch.nn.Module,callable func: The function/module that that will be
      run using the ``example_inputs`` arguments in order to record the
      computation graph.
    
   :arg ~torch.Tensor,tuple[~torch.Tensor] example_inputs: A tuple of example
      inputs that will be passed to the ``func`` while tracing.

   :keyword str compiler_workdir: Work directory used by
      |neuronx-cc|. This can be useful for debugging and/or inspecting
      intermediary |neuronx-cc| outputs
   
   :keyword set additional_ignored_ops: A set of aten operators to not analyze. Default is an empty set.
   
   :keyword int max_workers: The max number of workers threads to spawn.
      The default is ``4``.
   
   :keyword bool is_hf_transformers: If the model is a huggingface transformers model,
      it is recommended to enable this option to prevent deadlocks. Default is ``False``.
   
   :keyword bool cleanup: Specifies whether to delete the compiler artifact directories
      generated after running analyze. Default is ``False``.
   

   :returns: A JSON like :class:`~Dict` with the supported operators and their count, and unsupported
      operators with the failure mode and location of the operator in the python code.
    
   :rtype: :class:`~Dict`


   .. note::

      This function is meant to be used as a way to evaluate operator support for the model that is intended to be traced.
      The information can be used to modify operators that are unsupported to ones that are supported, or custom partitioning
      of the model.

      Note that this API does not return a traced model.
      
      Just like torch_neuronx.trace, this API can be used on any EC2 machine with sufficient memory and compute resources.


   Examples
   ----------

   *Fully supported model*

   .. code-block:: python

      import json

      import torch
      import torch.nn as nn
      import torch_neuronx

      class MLP(nn.Module):
         def __init__(self, input_size=28*28, output_size=10, layers=[120,84]):
            super(MLP, self).__init__()
            self.fc1 = nn.Linear(input_size, layers[0])
            self.relu = nn.ReLU()
            self.fc2 = nn.Linear(layers[0], layers[1])
         def forward(self, x):
            f1 = self.fc1(x)
            r1 = self.relu(f1)
            f2 = self.fc2(r1)
            r2 = self.relu(f2)
            f3 = self.fc3(r2)
            return torch.log_softmax(f3, dim=1)
    
      model = MLP()
      ex_input = torch.rand([32,784])

      model_support = torch_neuronx.analyze(model,ex_input)
      print(json.dumps(model_support,indent=4))

   .. code-block::

     {
         "torch_neuronx_version": "1.13.0.1.5.0",
         "neuronx_cc_version": "2.0.0.11796a0+24a26e112",
         "support_percentage": "100.00%",
         "supported_operators": {
            "aten::linear": 3,
         "aten::relu": 2,
         "aten::log_softmax": 1
         },
         "unsupported_operators": []
      }
   
   *Unsupported Model/Operator*

   .. code-block:: python

      import json
      import torch
      import torch_neuronx

      def fft(x):
         return torch.fft.fft(x)

      model = fft
      ex_input = torch.arange(4)

      model_support = torch_neuronx.analyze(model,ex_input)
      print(json.dumps(model_support,indent=4))

   .. code-block::

      {
         "torch_neuronx_version": "1.13.0.1.5.0",
         "neuronx_cc_version": "2.0.0.11796a0+24a26e112",
         "support_percentage": "0.00%",
         "supported_operators": {},
         "unsupported_operators": [
            {
               "kind": "aten::fft_fft",
               "failureAt": "neuronx-cc",
               "call": "test.py(6): fft\n/home/ubuntu/testdir/venv/lib/python3.8/site-packages/torch_neuronx/xla_impl/analyze.py(35): forward\n/home/ubuntu/testdir/venv/lib/python3.8/site-packages/torch/nn/modules/module.py(1182): _slow_forward\n/home/ubuntu/testdir/venv/lib/python3.8/site-packages/torch/nn/modules/module.py(1194): _call_impl\n/home/ubuntu/testdir/venv/lib/python3.8/site-packages/torch/jit/_trace.py(976): trace_module\n/home/ubuntu/testdir/venv/lib/python3.8/site-packages/torch/jit/_trace.py(759): trace\n/home/ubuntu/testdir/venv/lib/python3.8/site-packages/torch_neuronx/xla_impl/analyze.py(302): analyze\ntest.py(11): <module>\n",
               "opGraph": "graph(%x : Long(4, strides=[1], requires_grad=0, device=cpu),\n      %neuron_4 : NoneType,\n      %neuron_5 : int,\n      %neuron_6 : NoneType):\n  %neuron_7 : ComplexFloat(4, strides=[1], requires_grad=0, device=cpu) = aten::fft_fft(%x, %neuron_4, %neuron_5, %neuron_6)\n  return (%neuron_7)\n"
            }
         ]
      }
   
   **Note:** the ``failureAt`` field can either be "neuronx-cc" or "Lowering to HLO". If the field is "neuronx-cc", then it indicates that the provided operator configuration failed to be compiled with ``neuronx-cc``. This could either indicate that the operator configuration is unsupported, or there is a bug with that operator configuration.

.. |neuronx-cc| replace:: :ref:`neuronx-cc <neuron-compiler-cli-reference-guide>`


================================================
FILE: frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-async-lazy-load.rst
================================================
.. _torch_neuronx_lazy_async_load_api:


.. meta::
   :description: PyTorch NeuronX Lazy and Asynchronous Loading API - AWS Neuron SDK documentation
   :keywords: API reference, AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx
   :date-modified: 2026-03-13


PyTorch NeuronX Lazy and Asynchronous Loading API
===================================================

The :func:`torch_neuronx.lazy_load` and :func:`torch_neuronx.async_load` Python APIs allow
for more fine-grained control of loading a model onto the Neuron cores. They are designed to
enable different load behaviours (i.e. lazy or asynchronous loading) that, in certain cases, 
can speed up the load time. Both APIs take as input a :class:`~torch.jit.ScriptModule` model
created by :ref:`torch_neuronx_trace_api`. **They should be called immediately after** :func:`torch_neuronx.trace`
**returns, before saving the model via** :func:`torch.jit.save`

.. py:function:: torch_neuronx.lazy_load(trace, enable_lazy_load=True)

    Enables(or disables) lazy load behaviour on the traced Neuron ScriptModule ``trace``.
    By default, lazy load behaviour is disabled, so this API must be called immediately after
    :func:`torch_neuronx.trace` returns if lazy load behaviour is desired.

    In this context, lazy loading means that **calling** ``torch.jit.load`` **will not immediately load
    the model onto the Neuron core.** Instead, the model will be loaded onto the Neuron core at a later
    time, either via a call to :ref:`torch_neuronx_dataparallel_api`, or automatically when the model's
    ``forward`` method executes.

    There are several scenarios where lazy loading is useful. For instance, if one wants to use
    the DataParallel API to load the model onto multiple Neuron cores, typically
    one would first call ``torch.jit.load`` to load the saved model from disk, and then call ``DataParallel``
    on the object returned by ``torch.jit.load``. Doing this will cause redundant loading, because calling ``torch.jit.load``
    first will by default load the model onto one Neuron core, while calling ``DataParallel`` next will
    first unload the model from the Neuron core, and then load again according to user-specified ``device_ids``.
    This redundant load is avoided if one enables lazy loading by calling ``torch_neuronx.lazy_load`` prior to saving
    the model. This way, ``torch.jit.load`` will not load the model onto the Neuron core, so ``DataParallel`` can 
    directly load the model onto the desired cores.

    *Required Arguments*

    :arg ~torch.jit.ScriptModule trace: Model created by the
        :ref:`torch_neuronx_trace_api`, for which lazy loading is to be enabled.

    *Optional Arguments*

    :arg bool enable_lazy_load: Whether to enable lazy loading, defaults to True.

    Simple example usage:

        >>> neuron_model = torch_neuronx.trace(model, inputs)
        >>> torch_neuronx.lazy_load(neuron_model)
        >>> torch.jit.save(neuron_model, "my_model")
        
        Then some time later:

        >>> neuron_model = torch.jit.load("my_model") # neuron_model will not be loaded onto the Neuron core until it is run or it is passed to DataParallel

.. py:function:: torch_neuronx.async_load(trace, enable_async_load=True)
    
    Enables(or disables) asynchronous load behaviour on the traced Neuron ScriptModule ``trace``.
    
    By default, loading onto the Neuron core is a synchronous, blocking operation. This API
    can be called immediately after :func:`torch_neuronx.trace` returns in order to make
    loading this model onto the Neuron core a non-blocking operation. This means that when
    a load onto the Neuron core is triggered, either through a call to ``torch.jit.load`` or
    ``DataParallel``, a new thread is launched to perform the load, while the calling function
    will immediately return. The load will proceed asynchronously in the background, and only
    when it finishes will the model successfully execute. If the model's ``forward`` method is invoked
    before the asynchronus load finishes, ``forward`` will wait until the load completes before
    executing the model.

    This API is useful when one wants to load multiple models onto the Neuron core in parallel.
    It allows multiple calls to load different models to execute concurrently on different threads,
    which can significantly reduce the total load time when there are multiple CPU cores on the host.
    It is especially useful in cases where a single model pipeline has several compiled Neuron models.
    In this case, one can enable asynchronous load on each Neuron model and load all of them in parallel.

    Note that this API differs from :func:`torch_neuronx.lazy_load`. Lazy loading will
    only delay the load onto the Neuron core from when ``torch.jit.load`` is called to some later time, 
    but when the load does occur, it is still a synchronous, blocking operation. Asynchronous loading
    will make the load an asynchronous, non-blocking operation, but it does not delay when the load starts,
    meaning that calling ``torch.jit.load`` will still start the load, but the load will proceed asynchronously
    in the background.

    *Required Arguments*

    :arg ~torch.jit.ScriptModule trace: Model created by the
        :ref:`torch_neuronx_trace_api`, for which asynchronous loading is to be enabled.

    *Optional Arguments*

    :arg bool enable_async_load: Whether to enable asynchronous loading, defaults to True.

    Simple example usage:

        >>> neuron_model1 = torch_neuronx.trace(model1, inputs1)
        >>> torch_neuronx.async_load(neuron_model1)
        >>> torch.jit.save(neuron_model1, "my_model1")

        >>> neuron_model2 = torch_neuronx.trace(model2, inputs2)
        >>> torch_neuronx.async_load(neuron_model2)
        >>> torch.jit.save(neuron_model2, "my_model2")
        
        Then some time later:

        >>> neuron_model1 = torch.jit.load("my_model1") # neuron_model1 will start loading onto the Neuron core immediately, but the load will occur in a separate thread in the background.
        >>> neuron_model2 = torch.jit.load("my_model2") # neuron_model2 will start loading onto the Neuron core immediately, but the load will occur in a separate thread in the background.

        Both neuron_model1 and neuron_model2 will load concurrently.
        
        >>> output1 = neuron_model1(input1) # This call will block until the asynchronous load launched above finishes.
        >>> output2 = neuron_model2(input2) # This call will block until the asynchronous load launched above finishes.


Using :func:`torch_neuronx.lazy_load` and :func:`torch_neuronx.async_load` Together
-------------------------------------------------------------------------------------

You can also enable lazy load and asynchronous load together for the same model.
To do so, simply call each API independently before saving the model with ``torch.jit.save``:

    >>> neuron_model = torch_neuronx.trace(model, inputs)
    >>> torch_neuronx.lazy_load(neuron_model)
    >>> torch_neuronx.async_load(neuron_model)
    >>> torch.jit.save(neuron_model, "my_model")

This will both delay loading the model onto the Neuron core, and make the load asynchronous.

For another example usage, please refer to the `Github sample <https://github.com/aws-neuron/aws-neuron-samples/blob/master/archive/torch-neuronx/inference/hf_pretrained_sd2_512_inference.ipynb>`_ we provide for running inference on HuggingFace Stable Diffusion 2.1,
where we use both ``lazy_load`` and ``async_load`` to speed up the total load time of the four Neuron models that make 
up that pipeline.


================================================
FILE: frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-core-placement.rst
================================================
.. _torch_neuronx_core_placement_api:


.. meta::
   :description: PyTorch NeuronX NeuronCore Placement APIs - AWS Neuron SDK documentation
   :keywords: API reference, AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx
   :date-modified: 2026-03-13


PyTorch NeuronX NeuronCore Placement APIs
=========================================

Functions which enable placement of :class:`torch.jit.ScriptModule` to specific
NeuronCores. Two sets of functions are provided which can be used
interchangeably but have different performance characteristics and advantages:

- The :func:`~torch_neuronx.multicore_context` &
  :func:`~torch_neuronx.neuron_cores_context` functions are context
  managers that allow a model to be placed on a given NeuronCore *only* at
  :func:`torch.jit.load` time. These functions are the most efficient way of
  loading a model since the model is loaded directly to a NeuronCore. The
  alternative functions described below require that a model is unloaded from
  one core and then reloaded to another.
- The :func:`~torch_neuronx.set_multicore` &
  :func:`~torch_neuronx.set_neuron_cores` functions allow a model
  that has already been loaded to a NeuronCore to be moved to a different
  NeuronCore. This functionality is less efficient than directly loading a model
  to a NeuronCore within a context manager but allows device placement to be
  fully dynamic at runtime. This is analogous to the :meth:`torch.nn.Module.to`
  function for device placement.

.. important::

    A prerequisite to enable placement functionality is that
    the loaded :class:`torch.jit.ScriptModule` has already been compiled with
    the :func:`torch_neuronx.trace` API. Attempting to place a regular
    :class:`torch.nn.Module` onto a NeuronCore prior to compilation will do
    nothing.

.. py:function:: torch_neuronx.set_neuron_cores(trace: torch.jit.ScriptModule, start_nc: int=-1, nc_count: int=-1)

    Set the NeuronCore start/count for all Neuron subgraphs in a torch Module.

    This will unload the model from an existing NeuronCore if it is already
    loaded.

    *Requires Torch 1.8+*

    :arg ~torch.jit.ScriptModule trace: A torch module which contains one or more Neuron subgraphs.
    :keyword int start_nc: The starting NeuronCore index where the Module is placed. The
        value ``-1`` automatically loads to the optimal NeuronCore (least
        used). Note that this index is always relative to NeuronCores
        visible to this process.
    :keyword int nc_count: The number of NeuronCores to use. The value ``-1``
        will load a model to exactly one NeuronCore. If ``nc_count``
        is greater than than one, the model will be replicated across multiple
        NeuronCores.

    :raises [RuntimeError]: If the Neuron runtime cannot be initialized.
    :raises [ValueError]: If the ``nc_count`` is an invalid number of NeuronCores.

    .. rubric:: Examples

    *Single Load*: Move a model to the first visible NeuronCore after
    loading.

    .. code-block:: python

        model = torch.jit.load('example_neuron_model.pt')
        torch_neuronx.set_neuron_cores(model, start_nc=0, nc_count=1)

        model(example) # Executes on NeuronCore 0
        model(example) # Executes on NeuronCore 0
        model(example) # Executes on NeuronCore 0

    *Multiple Core Replication*: Replicate a model to 2 NeuronCores after
    loading. This allows a single :class:`torch.jit.ScriptModule` to
    use multiple NeuronCores by running round-robin executions.

    .. code-block:: python

        model = torch.jit.load('example_neuron_model.pt')
        torch_neuronx.set_neuron_cores(model, start_nc=2, nc_count=2)

        model(example) # Executes on NeuronCore 2
        model(example) # Executes on NeuronCore 3
        model(example) # Executes on NeuronCore 2

    *Multiple Model Load*: Move and pin 2 models to separate NeuronCores.
    This causes each :class:`torch.jit.ScriptModule` to always execute on
    a specific NeuronCore.

    .. code-block:: python

        model1 = torch.jit.load('example_neuron_model.pt')
        torch_neuronx.set_neuron_cores(model1, start_nc=2)

        model2 = torch.jit.load('example_neuron_model.pt')
        torch_neuronx.set_neuron_cores(model2, start_nc=0)

        model1(example) # Executes on NeuronCore 2
        model1(example) # Executes on NeuronCore 2
        model2(example) # Executes on NeuronCore 0
        model2(example) # Executes on NeuronCore 0


.. py:function:: torch_neuronx.set_multicore(trace: torch.jit.ScriptModule)

    Loads all Neuron subgraphs in a torch Module to all visible NeuronCores.

    This loads each Neuron subgraph within a :class:`torch.jit.ScriptModule`
    to multiple NeuronCores without requiring multiple calls to
    :func:`torch.jit.load`. This allows a single
    :class:`torch.jit.ScriptModule` to use multiple NeuronCores for
    concurrent threadsafe inferences. Executions use a round-robin strategy
    to distribute across NeuronCores.

    This will unload the model from an existing NeuronCore if it is already
    loaded.

    *Requires Torch 1.8+*

    :arg ~torch.jit.ScriptModule trace: A torch module which contains one or more Neuron subgraphs.

    :raises [RuntimeError]: If the Neuron runtime cannot be initialized.

    .. rubric:: Examples

    *Multiple Core Replication*: Move a model across all visible
    NeuronCores after loading. This allows a single
    :class:`torch.jit.ScriptModule` to use all NeuronCores by
    running round-robin executions.

    .. code-block:: python

        model = torch.jit.load('example_neuron_model.pt')
        torch_neuronx.set_multicore(model)

        model(example) # Executes on NeuronCore 0
        model(example) # Executes on NeuronCore 1
        model(example) # Executes on NeuronCore 2


.. py:function:: torch_neuronx.neuron_cores_context(start_nc: int=-1, nc_count: int=-1)

    A context which sets the NeuronCore start/count for Neuron models loaded
    with :func:`torch.jit.load`.

    This context manager may only be used when loading a model with
    :func:`torch.jit.load`. A model which has already been loaded into memory
    will not be affected by this context manager. Furthermore, after loading the
    model, inferences do not need to occur in this context in order to use the
    correct NeuronCores.

    Note that this context is *not* threadsafe. Using multiple core placement
    contexts from multiple threads may not correctly place models.

    :keyword int start_nc: The starting NeuronCore index where the Module is placed. The
        value ``-1`` automatically loads to the optimal NeuronCore (least
        used). Note that this index is always relative to NeuronCores
        visible to this process.
    :keyword int nc_count: The number of NeuronCores to use. The value ``-1``
        will load a model to exactly one NeuronCore. If ``nc_count``
        is greater than than one, the model will be replicated across multiple
        NeuronCores.

    :raises [RuntimeError]: If the Neuron runtime cannot be initialized.
    :raises [ValueError]: If the ``nc_count`` is an invalid number of NeuronCores.


    .. rubric:: Examples

    *Single Load*: Directly load a model from disk to the first visible
    NeuronCore.

    .. code-block:: python

        with torch_neuronx.neuron_cores_context(start_nc=0, nc_count=1):
            model = torch.jit.load('example_neuron_model.pt')  # Load must occur within the context

        model(example) # Executes on NeuronCore 0
        model(example) # Executes on NeuronCore 0
        model(example) # Executes on NeuronCore 0

    *Multiple Core Replication*: Directly load a model from disk to 2
    NeuronCores. This allows a single :class:`torch.jit.ScriptModule` to
    use multiple NeuronCores by running round-robin executions.

    .. code-block:: python

        with torch_neuronx.neuron_cores_context(start_nc=2, nc_count=2):
            model = torch.jit.load('example_neuron_model.pt')  # Load must occur within the context

        model(example) # Executes on NeuronCore 2
        model(example) # Executes on NeuronCore 3
        model(example) # Executes on NeuronCore 2

    *Multiple Model Load*: Directly load 2 models from disk and pin them to
    separate NeuronCores. This causes each :class:`torch.jit.ScriptModule`
    to always execute on a specific NeuronCore.

    .. code-block:: python

        with torch_neuronx.neuron_cores_context(start_nc=2):
            model1 = torch.jit.load('example_neuron_model.pt')  # Load must occur within the context

        with torch_neuronx.neuron_cores_context(start_nc=0):
            model2 = torch.jit.load('example_neuron_model.pt')  # Load must occur within the context

        model1(example) # Executes on NeuronCore 2
        model1(example) # Executes on NeuronCore 2
        model2(example) # Executes on NeuronCore 0
        model2(example) # Executes on NeuronCore 0


.. py:function:: torch_neuronx.multicore_context()

    A context manager which loads models to all visible NeuronCores for Neuron
    models loaded with :func:`torch.jit.load`.

    This loads each Neuron subgraph within a :class:`torch.jit.ScriptModule`
    to multiple NeuronCores without requiring multiple calls to
    :func:`torch.jit.load`. This allows a single
    :class:`torch.jit.ScriptModule` to use multiple NeuronCores for
    concurrent threadsafe inferences. Executions use a round-robin strategy
    to distribute across NeuronCores.

    This context manager may only be used when loading a model with
    :func:`torch.jit.load`. A model which has already been loaded into memory
    will not be affected by this context manager. Furthermore, after loading the
    model, inferences do not need to occur in this context in order to use the
    correct NeuronCores.

    Note that this context is *not* threadsafe. Using multiple core placement
    contexts from multiple threads may not correctly place models.

    :raises [RuntimeError]: If the Neuron runtime cannot be initialized.

    .. rubric:: Examples

    *Multiple Core Replication*: Directly load a model to all visible
    NeuronCores. This allows a single  :class:`torch.jit.ScriptModule`
    to use all NeuronCores by running round-robin executions.

    .. code-block:: python

        with torch_neuronx.multicore_context():
            model = torch.jit.load('example_neuron_model.pt')  # Load must occur within the context

        model(example) # Executes on NeuronCore 0
        model(example) # Executes on NeuronCore 1
        model(example) # Executes on NeuronCore 2


================================================
FILE: frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-data-parallel.rst
================================================
.. _torch_neuronx_dataparallel_api:


.. meta::
   :description: PyTorch NeuronX DataParallel API - AWS Neuron SDK documentation
   :keywords: API reference, AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx
   :date-modified: 2026-03-13


PyTorch NeuronX DataParallel API
==================================

The :func:`torch_neuronx.DataParallel` Python API implements data parallelism on
:class:`~torch.jit.ScriptModule` models created by 
:ref:`torch_neuronx_trace_api`.
This function is analogous to :class:`~torch.nn.DataParallel` in PyTorch.
The :ref:`torch-neuronx-dataparallel-app-note` application note provides an
overview of how :func:`torch_neuronx.DataParallel` can be used to improve
the performance of inference workloads on Inferentia.

.. py:function:: torch_neuronx.DataParallel(model, device_ids=None, dim=0, set_dynamic_batching=True)

    Applies data parallelism by replicating the model on
    available NeuronCores and distributing data across the different
    NeuronCores for parallelized inference.

    By default, DataParallel will use all available NeuronCores
    allocated for the current process for parallelism. DataParallel will
    apply parallelism on ``dim=0`` if ``dim`` is not specified.

    DataParallel automatically enables
    :ref:`dynamic batching <dynamic_batching_description_torch_neuronx>` on
    eligible models if ``dim=0``. Dynamic batching can be disabled using
    :func:`torch_neuronx.DataParallel.disable_dynamic_batching`, or by setting
    ``set_dynamic_batching=False`` when initializing the DataParallel object.
    If dynamic batching is not enabled, the batch size at compilation-time must
    be equal to the batch size at inference-time divided by the number of
    NeuronCores being used. Specifically, the following must be true when
    dynamic batching is disabled:
    ``input.shape[dim] / len(device_ids) == compilation_input.shape[dim]``.

    :func:`torch.neuron.DataParallel` requires PyTorch >= 1.8.

    *Required Arguments*

    :arg ~torch.jit.ScriptModule model: Model created by the
        :ref:`torch_neuronx_trace_api` to be parallelized.

    *Optional Arguments*

    :arg list device_ids: List of :obj:`int` or ``'nc:#'`` that specify the
        NeuronCores to use for parallelization (default: all NeuronCores).
        Refer to the :ref:`device_ids note <device_ids_note_torch_neuronx>` for a description
        of how ``device_ids`` indexing works.
    :arg int dim: Dimension along which the input tensor is scattered across
        NeuronCores (default ``dim=0``).
    :arg bool set_dynamic_batching: Whether to enable dynamic batching.

    *Attributes*

    :arg int num_workers: Number of worker threads used for
        multithreaded inference (default: ``2 * number of NeuronCores``).
    :arg int split_size: Size of the input chunks
        (default: ``max(1, input.shape[dim] // number of NeuronCores)``).


.. py:function:: torch.neuron.DataParallel.disable_dynamic_batching()

    Disables automatic dynamic batching on the DataParallel module. See
    :ref:`Dynamic batching disabled <dataparallel_example_disable_dynamic_batching_api_torch_neuronx>`
    for example of how DataParallel can be used with dynamic batching disabled.
    Use as follows:

        >>> model_parallel = torch_neuronx.DataParallel(model_neuron)
        >>> model_parallel.disable_dynamic_batching()

.. _device_ids_note_torch_neuronx:

.. note::

    ``device_ids`` uses per-process NeuronCore granularity and zero-based
    indexing. Per-process granularity means that each Python process "sees"
    its own view of the world. Specifically, this means that ``device_ids``
    only "sees" the NeuronCores that are allocated for the current process.
    Zero-based indexing means that each Python process will index its
    allocated NeuronCores starting at 0, regardless of the "global" index of
    the NeuronCores. Zero-based indexing makes it possible to redeploy the exact
    same code unchanged in different process. This behavior is analogous to
    the ``device_ids`` argument in the PyTorch
    :class:`~torch.nn.DataParallel` function.

    As an example, assume DataParallel is run on an inf2.48xlarge, which
    contains 12 Inferentia chips each of which contains two NeuronCores:

    * If ``NEURON_RT_VISIBLE_CORES`` is not set, a single process can access
      all 24 NeuronCores. Thus specifying ``device_ids=["nc:0"]`` will
      correspond to chip0:core0 and ``device_ids=["nc:13"]`` will correspond
      to chip6:core1.

    * However, if two processes are launched where: process 1 has
      ``NEURON_RT_VISIBLE_CORES=0-11`` and process 2 has
      ``NEURON_RT_VISIBLE_CORES=12-23``, ``device_ids=["nc:13"]``
      cannot be specified in either process. Instead, chip6:core1 can only be
      accessed in process 2. Additionally, chip6:core1 is specified in process 2
      with ``device_ids=["nc:1"]``. Furthermore, in process 1,
      ``device_ids=["nc:0"]`` would correspond to chip0:core0; in process 2
      ``device_ids=["nc:0"]`` would correspond to chip6:core0.


Examples
--------

The following sections provide example usages of the
:func:`torch_neuronx.DataParallel` module.

Default usage
^^^^^^^^^^^^^

.. include:: /frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-default.rst

Specifying NeuronCores
^^^^^^^^^^^^^^^^^^^^^^

.. include:: /frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-specify-ncs.rst

DataParallel with dim != 0
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-dim-neq-zero.rst

Dynamic batching
^^^^^^^^^^^^^^^^

.. include:: /frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-dynamic-batching.rst

.. _dataparallel_example_disable_dynamic_batching_api_torch_neuronx:

Dynamic batching disabled
^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-disable-dynamic-batching.rst


================================================
FILE: frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-replace-weights.rst
================================================
.. _torch_neuronx_replace_weights_api:


.. meta::
   :description: PyTorch Neuron (``torch-neuronx``) Weight Replacement API for Inference - AWS Neuron SDK documentation
   :keywords: API reference, AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx
   :date-modified: 2026-03-13


PyTorch Neuron (``torch-neuronx``) Weight Replacement API for Inference
========================================================================

.. py:function:: torch_neuronx.replace_weights(neuron_model, weights)

    Replaces the weights in a Neuron Model with split weights.
    This function will emit a warning of the supplied Neuron model does not
    contain any separated weights.

    .. warning::

        The below API is only applicable for models traced with the
        parameter ``inline_weights_to_neff=False``, which is ``True`` by
        default. See :func:`torch_neuronx.trace` for details.

    :arg ~torch.jit.RecursiveScriptModule neuron_model: A Neuron model compiled with split weights

    :arg ~torch.nn.Module,Dict[str, ~torch.Tensor] weights: Either the original model with the new weights,
        or the state_dict of a model.
    
    :returns: ``None``, this function performs the weight replacement inline.
    :rtype: ``None``

    .. rubric:: Examples

    *Using a model*

    .. code-block:: python

        import torch
        import torch_neuronx


        class Network(torch.nn.Module):
            def __init__(self, hidden_size=4, layers=3) -> None:
                super().__init__()
                self.layers = torch.nn.Sequential(
                    *(torch.nn.Linear(hidden_size, hidden_size) for _ in range(layers)))

            def forward(self, tensor):
                return self.layers(tensor)
    

        # initialize two networks
        network = Network()
        network2 = Network()
        network.eval()
        network2.eval()

        inp = torch.rand(2,4)

        # trace weight separated model with first network
        weight_separated_trace = torch_neuronx.trace(network,inp,inline_weights_to_neff=False)

        # replace with weights from second network
        torch_neuronx.replace_weights(weight_separated_trace,network2.state_dict())

        # get outputs from neuron and cpu networks
        out_network2 = network2(inp)
        out_neuron = weight_separated_trace(inp)
        
        # check that they are equal
        print(out_network2,out_neuron)


    *Using safetensors*

    The `safetensors`_ library is useful for storing/loading model tensors safely and quickly.

    .. code-block:: python

        import torch
        import torch_neuronx

        from safetensors import safe_open
        from safetensors.torch import save_model


        class Network(torch.nn.Module):
            def __init__(self, hidden_size=4, layers=3) -> None:
                super().__init__()
                self.layers = torch.nn.Sequential(
                    *(torch.nn.Linear(hidden_size, hidden_size) for _ in range(layers)))

            def forward(self, tensor):
                return self.layers(tensor)
    

        # initialize two networks
        network = Network()
        network2 = Network()
        network.eval()
        network2.eval()

        inp = torch.rand(2,4)

        # trace weight separated model with first network
        weight_separated_trace = torch_neuronx.trace(network,inp,inline_weights_to_neff=False)

        # save network2 weights to safetensors
        safetensor_path = f"{directory}/network2.safetensors"
        save_model(network2,safetensor_path)

        #load safetensors from network2 into traced_weight separated model
        tensors = {}
        with safe_open(safetensor_path,framework="pt") as f:
            for k in f.keys():
                tensors[k] = f.get_tensor(k)

        # replace with weights from second network
        torch_neuronx.replace_weights(weight_separated_trace,tensors)

        # get outputs from neuron and cpu networks
        out_network2 = network2(inp)
        out_neuron = weight_separated_trace(inp)
        
        # check that they are equal
        print(out_network2,out_neuron)


.. note::

    For non-safetensors models, use ``torch.load`` to load the model, and pass the model's ``state_dict`` inside like the first example.

.. _safetensors: https://huggingface.co/docs/safetensors/index
.. _torch-xla: https://github.com/pytorch/xla
.. _torchscript: https://pytorch.org/docs/stable/jit.html


================================================
FILE: frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-trace.rst
================================================
.. _torch_neuronx_trace_api:


.. meta::
   :description: PyTorch NeuronX Tracing API for Inference - AWS Neuron SDK documentation
   :keywords: API reference, AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx
   :date-modified: 2026-03-13


PyTorch NeuronX Tracing API for Inference
===========================================

.. py:function:: torch_neuronx.trace(func, example_inputs, *_, input_output_aliases={}, compiler_workdir=None, compiler_args=None, partitioner_config=None, inline_weights_to_neff=True, cpu_backend=False)
    
    Trace and compile operations in the ``func`` by executing it using
    ``example_inputs``.

    This function is similar to a :func:`torch.jit.trace` since it produces a
    :class:`~torch.jit.ScriptModule` that can be saved with
    :func:`torch.jit.save` and reloaded with :func:`torch.jit.load`. The
    resulting module is an optimized fused graph representation of the ``func``
    that is *only* compatible with Neuron.

    Tracing a module produces a more efficient *inference-only* version of the
    model. XLA Lazy Tensor execution should be used during training. See:
    :ref:`trace-vs-xla-lazytensor`

    .. warning::

        Currently this only supports |NeuronCore-v2| type instances
        (e.g. |trn1|, inf2). To compile models compatible with |NeuronCore-v1|
        (e.g. |inf1|), please see :func:`torch_neuron.trace`

    :arg ~torch.nn.Module,callable func: The function/module that that will be
       run using the ``example_inputs`` arguments in order to record the
       computation graph.
    :arg ~torch.Tensor,tuple[~torch.Tensor] example_inputs: A tuple of example
       inputs that will be passed to the ``func`` while tracing.
    :keyword dict input_output_aliases: Marks input tensors as state tensors
       which are device tensors. 
    :keyword str compiler_workdir: Work directory used by
       |neuronx-cc|. This can be useful for debugging and/or inspecting
       intermediary |neuronx-cc| outputs
    :keyword str,list[str] compiler_args: List of strings representing
       |neuronx-cc| compiler arguments. See :ref:`neuron-compiler-cli-reference-guide`
       for more information about compiler options.
    :keyword PartitionerConfig partitioner_config: A PartitionerConfig object,
        which can be optionally supplied if there are unsupported ops in the model 
        that need to be partitioned out to CPU.
    :keyword bool inline_weights_to_neff: A boolean indicating whether the weights should be
        inlined to the NEFF. If set to False, weights will be separated from the NEFF.
        The default is ``True``.
    :keyword bool cpu_backend: A boolean indicating whether CPU should be used for tracing. 
        If set to True, tracing can be done completely on CPU. This keyword needs to be used with 
        the ``compiler_args`` option to set the ``--target`` flag. The default is ``False``.

    :returns: The traced :class:`~torch.jit.ScriptModule` with the embedded
       compiled Neuron graph. Operations in this module will execute on Neuron.
    :rtype: ~torch.jit.ScriptModule

    .. warning::

      Behavior Change! Using ``args`` for ``kwargs`` is no longer supported starting from release 2.15.0 (``torch-neuronx==1.13.1.1.12.0``).
      The current behavior is that a warning will be raised, but ``torch_neuronx.trace()`` will attempt to infer the keyword
      arguments. This is likely to become an error in future releases, so to avoid the warning/error, assign kwargs as kwargs and
      not args.

    .. rubric:: Notes

    This function records operations using `torch-xla`_ to create a HloModule
    representation of the ``func``. This fixed graph representation is
    compiled to the Neuron Executable File Format (NEFF) using the |neuronx-cc|
    compiler. The NEFF binary executable is embedded into an optimized
    :class:`~torch.jit.ScriptModule` for `torchscript`_ execution.

    In contrast to a regular :func:`torch.jit.trace` that produces a graph of
    many separate operations, tracing with Neuron produces a graph with a single
    fused operator that is executed entirely on device. In `torchscript`_
    this appears as a stateful ``neuron::Model`` component with an associated
    ``neuron::forward*`` operation.

    Tracing can be performed on any EC2 machine with sufficient memory and
    compute resources, but inference can only be executed on a Neuron instance.

    Unlike some devices (such as `torch-xla`_) that use
    :meth:`~torch.Tensor.to` to move :class:`~torch.nn.parameter.Parameter` and
    :class:`~torch.Tensor` data between CPU and device, upon loading a
    Neuron traced :class:`~torch.jit.ScriptModule`, the model binary executable
    is automatically moved to a NeuronCore. When the underlying
    ``neuron::Model`` is initialized after tracing or upon
    :func:`torch.jit.load`, it is loaded to a Neuron device without specifying
    a device or ``map_location`` argument.

    .. warning::

      One small exception is models traced with ``inline_weights_to_neff=False``. For these models,
      the NEFF is loaded onto the NeuronCore automatically, but the weights are not moved automatically. To move
      the weights to the NeuronCore, call :func:`torch_neuronx.move_trace_to_device`. If this is not
      done, a perfomance penalty is incurred per inference, because on every inference call, the weights move from CPU
      to Neuron.

    Furthermore, the Neuron traced :class:`~torch.jit.ScriptModule` expects
    to consume CPU tensors and produces CPU tensors. The underlying operation
    performs all data transfers to and from the Neuron device without explicit
    data movement. This is a significant difference from the training XLA
    device mechanics since XLA operations are no longer required to
    be recorded after a trace. See: :ref:`pytorch-neuronx-programming-guide`

    By *default*, when multiple NeuronCores are available, every Neuron traced
    model :class:`~torch.jit.ScriptModule` within in a process
    is loaded to each available NeuronCore in round-robin order. This is
    useful at deployment to fully utilize the Neuron hardware since it means
    that multiple calls to :func:`torch.jit.load` will attempt to load to each
    available NeuronCore in linear order. The default start device is chosen
    according to the |nrt-configuration|.

    A traced Neuron module has limitations that are not present in regular
    torch modules:

    - **Fixed Control Flow**: Similar to :func:`torch.jit.trace`, tracing a
      model with Neuron statically preserves control flow (i.e.
      ``if``/``for``/``while`` statements) and will not re-evaluate the branch
      conditions upon inference. If a model result is based on data-dependent
      control flow, the traced function may produce inaccurate results.
    - **Fixed Input Shapes**: After a function has been traced, the resulting
      :class:`~torch.jit.ScriptModule` will always expect to consume tensors
      of the same shape. If the tensor shapes used at inference differs
      from the tensor shapes used in the ``example_inputs``, this will result in
      an error. See: |bucketing|.
    - **Fixed Tensor Shapes**: The intermediate tensors within the
      ``func`` must always stay the same shape for the same shaped inputs. This
      means that certain operations which produce data-dependent
      sized tensors are not supported. For example, :func:`~torch.nonzero`
      produces a different tensor shape depending on the input data.
    - **Fixed Data Types**: After a model has been traced, the input, output,
      and intermediate data types cannot be changed without recompiling.
    - **Device Compatibility**: Due to Neuron using a specialized compiled
      format (NEFF), a model traced with Neuron can no longer be executed in any
      non-Neuron environment.
    - **Operator Support**: If an operator is unsupported by `torch-xla`_, then
      this will throw an exception.

    .. rubric:: Examples

    *Function Compilation*

    .. code-block:: python

        import torch
        import torch_neuronx
        def func(x, y):
            return 2 * x + y
        example_inputs = torch.rand(3), torch.rand(3)
        # Runs `func` with the provided inputs and records the tensor operations
        trace = torch_neuronx.trace(func, example_inputs)
        # `trace` can now be run with the TorchScript interpreter or saved
        # and loaded in a Python-free environment
        torch.jit.save(trace, 'func.pt')
        # Executes on a NeuronCore
        loaded = torch.jit.load('func.pt')
        loaded(torch.rand(3), torch.rand(3))
    
    *Module Compilation*

    .. code-block:: python

        import torch
        import torch_neuronx
        import torch.nn as nn
        class Model(nn.Module):
            def __init__(self):
                super().__init__()
                self.conv = nn.Conv2d(1, 1, 3)
            def forward(self, x):
                return self.conv(x) + 1
        model = Model()
        model.eval()
        example_inputs = torch.rand(1, 1, 3, 3)
        # Traces the forward method and constructs a `ScriptModule`
        trace = torch_neuronx.trace(model, example_inputs)
        torch.jit.save(trace, 'model.pt')
        # Executes on a NeuronCore
        loaded = torch.jit.load('model.pt')
        loaded(torch.rand(1, 1, 3, 3))

    *Weight Separated Module*

    .. code-block:: python

        import torch
        import torch_neuronx
        import torch.nn as nn

        class Model(nn.Module):

            def __init__(self):
                super().__init__()
                self.conv = nn.Conv2d(1, 1, 3)

            def forward(self, x):
                return self.conv(x) + 1

        model = Model()
        model.eval()

        example_inputs = torch.rand(1, 1, 3, 3)

        # Traces the forward method and constructs a `ScriptModule`
        trace = torch_neuronx.trace(model, example_inputs,inline_weights_to_neff=False)

        # Model can be saved like a normally traced model
        torch.jit.save(trace, 'model.pt')

        # Executes on a NeuronCore like a normally traced model
        loaded = torch.jit.load('model.pt')
        torch_neuronx.move_trace_to_device(loaded,0) # necessary for performance
        loaded(torch.rand(1, 1, 3, 3))
    
    *CPU Compilation*

    On CPU:

    .. code-block:: python

        import torch
        import torch_neuronx
        import torch.nn as nn
        class Model(nn.Module):
            def __init__(self):
                super().__init__()
                self.conv = nn.Conv2d(1, 1, 3)
            def forward(self, x):
                return self.conv(x) + 1
        model = Model()
        model.eval()
        example_inputs = torch.rand(1, 1, 3, 3)
        # Traces the forward method on CPU, compiling for Trn1
        trace = torch_neuronx.trace(model, example_inputs, compiler_args="--target trn1", cpu_backend=True)
        torch.jit.save(trace, 'model.pt')
        # Move model.pt to a Neuron instance
    
    On Neuron:

    .. code-block:: python

      import torch
      import torch_neuronx
      import torch.nn as nn
      
      loaded = torch.jit.load('model.pt')
      loaded(torch.rand(1, 1, 3, 3))
    
    .. note::

      Weight Separated models can have its weights replaced via the `torch_neuronx.replace_weights` API.

.. _torch-neuronx-device-movement:

Moving a Traced Module to a Neuron Core
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. warning::
  This function will be deprecated in a future release, and instead, :func:`torch_neuronx.experimental.set_neuron_cores` will move out of experimental, and become a stable API.

.. py:function:: torch_neuronx.move_trace_to_device(trace, device_id)

  This function moves a model traced with :func:`torch_neuronx.trace`, to a Neuron Core. Here are some reasons to use this function|colon|

  1. Explicit control of device placement for models
    By default, the Neuron Runtime assigns neffs to devices in a Round Robin manner, meaning it will allocate a neff onto Neuron Core 0, then 1, 2, and then loop around.
  2. Allocating Weights onto the Neuron Core for Weight Separated models.
    This is necessary for performance reasons. If this is not done, the weights would remain on CPU and would need to move to device on every inference call, which is an expensive operation.

  :arg ~torch.jit.ScriptModule trace: This is the torchscript model returned from :func:`torch_neuronx.trace`
  :arg int device_id: The Neuron Core to move the traced model to. This number will need to be between 0 to the max number of NCs on the instance - 1. For example, a trn1.32xlarge has 32 Neuron Cores, so the acceptable values are from 0-31.

  :returns: Nothing, the movement of the model happens in-place. 
  :rtype: None

.. _torch-neuronx-autobucketing:

Autobucketing
~~~~~~~~~~~~~

.. note::
  
  See :func:`neuronx_distributed.parallel_model_trace` for the API to use the autobucketing feature along with tensor parallelism.

.. py:class:: torch_neuronx.BucketModelConfig(bucket_kernel, *_, shared_state_buffer=None, shared_state_buffer_preprocessor=None, func_kwargs=None)

    This object contains configuration data for how buckets are selected based on input via the ``bucket_kernel``.
    
    This also supports the concept of a shared buffer between bucket models. You can use this to define how the shared buffer can be manipulated to be fed as input to a bucket model via the ``shared_state_buffer_preprocessor``. Details on how these are defined are found below.

    :arg callable bucket_kernel: A function that returns a new TorchScript function. The TorchScript function has been adapted to the TorchScript
     representation using :func:`torch.jit.script`. This new function takes in a list of input tensors and outputs a list of tensors and an index tensor.
    
    :keyword Optional[List[torch.Tensor]] shared_state_buffer: A list of tensors that is used as the initial values for
        a shared state for bucket models via aliasing.
    :keyword Optional[Callable] shared_state_buffer_preprocessor: Similar to bucket_kernel, this is a function that returns a
        new TorchScript function that has been adapted to the TorchScript representation using :func:`torch.jit.script`.
        This new TorchScript function takes in 3 arguments: an n-dimensional integer list representing a list
        of tensor shapes, the state_buffer list of tensors, and a tensor representing the bucket index.
        This function outputs a reshaped state_buffer to be supplied to the bucket model. If ``shared_state_buffer_preprocessor`` is not supplied when
        ``shared_state_buffer`` is supplied, the preprocessor returns the full ``shared_state_buffer``.
    :keyword Optional[Union[Dict[str, Any], List[Any]]] func_kwargs: A single dictionary or a list of dictionaries that can be used
        to supply custom arguments to the function supplied to the ``func`` argument
        in :func:`torch_neuronx.bucket_model_trace`. If you are using a list of dictionaries,
        verify that func_kwargs equals the bucket degree, or number of buckets.
        By default func_kwargs is None, which means no arguments.
    
    :returns: The  :class:`torch_neuronx.BucketModelConfig` with the configuration defining bucket selection for inputs and shared buffers.
    :rtype: ~torch_neuronx.BucketModelConfig

.. py:function:: torch_neuronx.bucket_model_trace(func, example_inputs, bucket_config, compiler_workdir=None, compiler_args=None)

    This function traces a single model with multiple ``example_inputs`` and a ``bucket_config`` object to produce a single compiled model that can take in multiple input shapes. This trace function is very similar to :func:`torch_neuronx.trace`, but it has a few key differences:

    1. In this case, ``func`` does not take in a ``Model``. Instead, it takes in a function that returns a tuple containing a ``Model`` and ``input_output_aliases``. This is like :func:`neuronx_distributed.parallel_model_trace`, and is done for the same reason, which is that bucket models are traced in parallel. 
    2. Instead of taking in one input, the function takes in multiple inputs in the form of a list. For example, ``[torch.rand(128,128),torch.rand(256,256)]``. 
    3. The ``bucket_config`` argument is of type :func:`torch_neuronx.BucketModelConfig`, which defines how an input is mapped to a bucket. For more details, see the :func:`torch_neuronx.BucketModelConfig` API Reference. You can use this for a variety of bucketing applications, such as sequence length bucketing for language models or image resolution bucketing for computer vision models.

    Apart from the aforementioned differences, the rest of the function behaves similarly to :func:`torch_neuronx.trace`. You can save the model with :func:`torch.jit.save` and load it with :func:`torch.jit.load`.

    :arg ~torch.nn.Module,callable func: This is a function that returns a ``Model``
        object and a dictionary of states, or input_output_aliases. Similar to :func:`neuronx_distributed.parallel_model_trace`, this API
        calls this function inside each worker and runs trace against them. Note: This differs
        from the ``torch_neuronx.trace`` where the ``torch_neuronx.trace``
        requires a model object to be passed.
    :arg List[Union[~torch.Tensor,tuple[~torch.Tensor]]] example_inputs: A list of possible
        inputs to the bucket model.
    :arg ~torch_neuronx.BucketModelConfig bucket_config: The config object that defines
        bucket selection behavior.
    
    :keyword str compiler_workdir: Work directory used by
       |neuronx-cc|. This can be useful for debugging and inspecting
       intermediary |neuronx-cc| outputs.
    :keyword str,list[str] compiler_args: List of strings representing
       |neuronx-cc| compiler arguments. See :ref:`neuron-compiler-cli-reference-guide`
       for more information about compiler options.

    :returns: The traced :class:`~torch.jit.ScriptModule` with the embedded
       compiled Neuron graphs for each bucket model. Operations in this module will execute on Neuron.
    :rtype: ~torch.jit.ScriptModule

.. warning::
    
  If you receive the ``Too Many Open Files`` error message, increase the ulimit via ``ulimit -n 65535``. There is
  a limitation in torch_xla's ``xmp.spawn`` function when dealing with large amounts of data.
  
The developer guide for Autobucketing is located :ref:`here <torch-neuronx-autobucketing-devguide>`, which contains an example usage of autobucketing with BERT.

.. _torch-neuronx-dynamic-batching:

Dynamic Batching
~~~~~~~~~~~~~~~~

.. py:function:: torch_neuronx.dynamic_batch(neuron_script)

    Enables a compiled Neuron model to be called with variable sized batches.

    When tracing with Neuron, usually a model can only consume tensors that are the same size as the example tensor used in the :func:`torch_neuronx.trace` call. Enabling dynamic batching allows a model to consume inputs that may be either smaller or larger than the original trace-time tensor size. Internally, dynamic batching splits & pads an input batch into chunks of size equal to the original trace-time tensor size. These chunks are passed to the underlying model(s). Compared to serial inference, the expected runtime scales by ``ceil(inference_batch_size / trace_batch_size) / neuron_cores``.
    
    This function modifies the ``neuron_script`` network in-place. The returned result is a reference to the modified input.

    Dynamic batching is only supported by chunking inputs along the 0th dimension. A network that uses a non-0 batch dimension is incompatible with dynamic batching. Upon inference, inputs whose shapes differ from the compile-time shape in a non-0 dimension will raise a ValueError. For example, take a model was traced with a single example input of size ``[2, 3, 5]``. At inference time, when dynamic batching is enabled, a batch of size ``[3, 3, 5]`` is *valid* while a batch of size ``[2, 7, 5]`` is *invalid* due to changing a non-0 dimension.

    Dynamic batching is only supported when the 0th dimension is the same size for all inputs. For example, this means that dynamic batching would not be applicable to a network which consumed two inputs with shapes ``[1, 2]`` and ``[3, 2]`` since the 0th dimension is different. Similarly, at inference time, the 0th dimension batch size for all inputs must be identical otherwise a ValueError will be raised.
    
    *Required Arguments*

    :arg ~torch.jit.ScriptModule neuron_script: The neuron traced :class:`~torch.jit.ScriptModule` with the
       embedded compiled neuron graph. This is the output of :func:`torch_neuronx.trace`.

    :returns: The traced :class:`~torch.jit.ScriptModule` with the embedded
       compiled neuron graph. The same type as the input, but with dynamic_batch enabled in the neuron graph.
    :rtype: ~torch.jit.ScriptModule

.. code-block:: python

    import torch
    import torch_neuronx
    import torch.nn as nn

    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv = nn.Conv2d(1, 1, 3)

        def forward(self, x):
            return self.conv(x) + 1

    n = Net()
    n.eval()

    inputs = torch.rand(1, 1, 3, 3)
    inputs_batch_8 = torch.rand(8, 1, 3, 3)

    # Trace a neural network with input batch size of 1
    neuron_net = torch_neuronx.trace(n, inputs)

    # Enable the dynamic batch size feature so the traced network
    # can consume variable sized batch inputs
    neuron_net_dynamic_batch = torch_neuronx.dynamic_batch(neuron_net)

    # Run inference on inputs with batch size of 8
    # different than the batch size used in compilation (tracing)
    ouput_batch_8 = neuron_net_dynamic_batch(inputs_batch_8)

Graph Partitioner
~~~~~~~~~~~~~~~~~

.. py:function:: torch_neuronx.PartitionerConfig(*,trace_kwargs=None,model_support_percentage_threshold=0.5,min_subgraph_size=-1,max_subgraph_count=-1,ops_to_partition=None,analyze_parameters=None)

    Allows for Neuron to trace a model with unsupported operators and partition these operators to CPU.

    This model will contain subgraphs of Neuron and CPU submodules, but it is executed like one model,
    and can be saved and loaded like one model as well.

    The graph partitioner is customized using this class, and is *only* enabled (disabled by default) from the ``torch_neuronx.trace`` API by setting ``partitioner_config``
    keyword argument to this class. Below are the various configuration options.

    :arg Dict trace_kwargs: Used if you need to pass trace kwargs to the Neuron subgraphs, such as the
      ``compiler_workdir`` and/or ``compiler_args``. The default is ``None`` corresponding to the default trace args.
    
    :arg float model_support_percentage_threshold: A number between 0 to 1 representing
      the maximum allowed percentage of operators that must be supported.
      If the max is breached, the function will throw a ValueError.
      Default is ``0.5`` (i.e 50% of operators must be supported by Neuron)
    
    :arg int min_subgraph_size: The minimum number of operators in a subgraph.
      Can be ``>= 1`` or ``== -1``. If ``-1``, minimum subgraph size is not checked (i.e no minimum).
      If ``>= 1``, each subgraph must contain at least that many operators.
      If not, the graph partitioner will throw a ``ValueError``.
    
    :arg int max_subgraph_count: The maximum number of subgraphs in the partitioned model.
      Can be ``>= 1`` or ``== -1``. If ``-1``, max subgraph count is not checked (i.e no maximum).
      If ``>= 1``, the partitioned model must contain at most that many subgraphs.
      If not, the graph partitioner will throw a ``ValueError``.
    
    :arg Set[str] ops_to_partition: This is a set of strings of this structure "aten::<operator>".
      These are operators that will be partitioned to CPU regardless of Neuron support.
      The default is ``None`` (i.e no additional operators will be partitioned).

    :arg Dict analyze_parameters: This is a dictionary of kwargs used in ``torch_neuronx.analyze()``.
      NOTE: Not all kwargs in ``torch_neuronx.analyze()`` are supported
      in the graph partitioner.
      The following ``kwargs`` in analyze are supported for use in the graph partitioner.

      * ``compiler_workdir``
      * ``additional_ignored_ops``
      * ``max_workers``

      The default is ``None``, corresponding to the default analyze arguments.

    :returns: The  :class:`~torch_neuronx.PartitionerConfig` with the configuration for the graph partitioner.
    :rtype: ~torch_neuronx.PartitionerConfig

.. rubric:: Examples

.. _graph_partitioner_example_default_usage:

This example demonstrates using the graph partitioner.

The below model is a simple MLP model with sorted log softmax output.
The sort operator, ``torch.sort()`` or ``aten::sort``, is not supported
by ``neuronx-cc`` at this time, so the graph partitioner will partition
out the sort operator to CPU.

.. code-block:: python

  import torch
  import torch_neuronx
  import torch.nn as nn

  import logging
  
  # adjust logger level to see what the partitioner is doing
  logger = logging.getLogger("Neuron")

  class MLP(nn.Module):
      def __init__(
          self, input_size=28 * 28, output_size=10, layers=[4096, 2048]
      ):
          super(MLP, self).__init__()
          self.fc1 = nn.Linear(input_size, layers[0])
          self.fc2 = nn.Linear(layers[0], layers[1])
          self.fc3 = nn.Linear(layers[1], output_size)
          self.relu = nn.ReLU()

      def forward(self, x):
          f1 = self.fc1(x)
          r1 = self.relu(f1)
          f2 = self.fc2(r1)
          r2 = self.relu(f2)
          f3 = self.fc3(r2)
          out = torch.log_softmax(f3, dim=1)
          sort_out,_ = torch.sort(out)
          return sort_out

  n = MLP()
  n.eval()

  inputs = torch.rand(32,784)

  # Configure the graph partitioner with the default values
  partitioner_config = torch_neuronx.PartitionerConfig()

  # Trace a neural network with graph partitioner enabled
  neuron_net = torch_neuronx.trace(n, inputs, partitioner_config=partitioner_config)

  # Run inference on the partitioned model
  output = neuron_net(inputs)

.. note::
  Dynamic batching has a case-by-case support with partitioned
  models, because it is highly dependent on how the
  final partition scheme looks like.

.. |neuron-cc| replace:: :ref:`neuron-cc <neuron-compiler-cli-reference>`
.. |neuronx-cc| replace:: :ref:`neuronx-cc <neuron-compiler-cli-reference-guide>`
.. |NeuronCore-v1| replace:: :ref:`NeuronCore-v1 <neuroncores-v1-arch>`
.. |NeuronCore-v2| replace:: :ref:`NeuronCore-v2 <neuroncores-v2-arch>`

.. |HloModule| replace:: HloModule

.. |inf1| replace:: :ref:`inf1 <aws-inf1-arch>`
.. |trn1| replace:: :ref:`trn1 <aws-trn1-arch>`

.. |bucketing| replace:: :ref:`bucketing_app_note`
.. |nrt-configuration| replace:: :ref:`nrt-configuration`

.. _torch-xla: https://github.com/pytorch/xla
.. _torchscript: https://pytorch.org/docs/stable/jit.html

================================================
FILE: frameworks/torch/torch-neuronx/api-reference-guide/inference/inference-api-guide-torch-neuronx.rst
================================================

.. meta::
   :description: API Reference Guide  (``torch-neuronx``) - AWS Neuron SDK documentation
   :keywords: API reference, AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx
   :date-modified: 2026-03-13


API Reference Guide  (``torch-neuronx``)
========================================

.. toctree::
    :maxdepth: 1
    :hidden:
    
    /frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-trace
    /frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-replace-weights
    /frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-core-placement
    /frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-analyze
    /frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-data-parallel


.. dropdown::  API Reference Guide  (``torch-neuronx``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in
    :open:

    * :ref:`torch_neuronx_trace_api`
    * :ref:`torch_neuronx_replace_weights_api`
    * :ref:`torch_neuronx_core_placement_api`
    * :ref:`torch_neuronx_analyze_api`
    * :ref:`torch_neuronx_dataparallel_api`
    * :ref:`torch_neuronx_lazy_async_load_api`


================================================
FILE: frameworks/torch/torch-neuronx/api-reference-guide/torch-neuronx-profiling-api.rst
================================================
.. _torch-neuronx-profiling-api:


.. meta::
   :description: PyTorch NeuronX Profiling API - AWS Neuron SDK documentation
   :keywords: API reference, AWS Neuron, Inferentia, PyTorch, Trainium, profiling, torch-neuronx
   :date-modified: 2026-03-13


PyTorch NeuronX Profiling API
===============================

.. contents:: Table of Contents
   :local:
   :depth: 2

The profiler provides a method to generate a context manager to capture
trace events at the operator or runtime level.

.. py:function:: torch_neuronx.experimental.profiler.profile(port=9012,ms_duration=60000,neuron_tensorboard_plugin_dir="logs/plugins/neuron",profile_type="operator",auto_start=True,delete_working=True)

   The :func:`torch_neuronx.experimental.profiler.profile` method returns a ``profile`` context manager object. This object
   doesn't need to be used directly, as default options are set to auto capture events based on the ``profile_type``.

   The context manager will wrap around the entire model
   and training/inference loop. The context-manager is 
   backwards-compatible with the torch_xla.debug.profiler``

   *Required Arguments*

   None

   *Optional Keyword Arguments*

   :keyword int port: Port to run the profiling GRPC server on. Default is 9012.
   :keyword int ms_duration: This defines how long the profiler will capture the
      HLO artifacts from the model to view in the profiler. The unit is in
      milliseconds. The default value is 60000 ms, or 1 minute.
   :keyword str neuron_tensorboard_plugin_dir: The directory the neuron tensorboard plugin will file write to.
      This will be ``logs/plugins/neuron`` by default/
   :keyword str profile_type: There is “trace” and “operator”. “trace”
      is the Torch Runtime Trace Level, while “operator” is the Model
      Operator Trace Level. Default is "operator"
   :keyword bool auto_start: If set to true, the profiler will start profiling immediately.
      If set to false, the profiler can be set to start at a later condition.
      Refer to ``profile.start()`` for more details. Default is ``True``.
   :keyword bool delete_working: If set to False turns off the deletion of temporary files. Default True.
   :keyword str traced_only: This should be set to ``True`` if profiling a model that has been traced with
      ``torch_neuronx.trace()``. Default is ``False``.
      
   :returns: The traced :class:`profile`

   :rtype: ~profile

.. py:function:: torch_neuronx.experimental.profiler.profile.start()

   The :func:`torch_neuronx.experimental.profiler.profile.start` method starts the profiler if not started (i.e when ``auto_start=False``).
   This function does not take in any parameters, nor return anything.

    *Required Arguments*

   None

    *Optional Keyword Arguments*

   None

   :returns: None


================================================
FILE: frameworks/torch/torch-neuronx/api-reference-guide/training/index.rst
================================================

.. meta::
   :description: API Reference Guide for Training (``torch-neuronx``) - AWS Neuron SDK documentation
   :keywords: API reference, AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx, training
   :date-modified: 2026-03-13


API Reference Guide for Training (``torch-neuronx``) 
====================================================

.. toctree::
    :maxdepth: 1
    :hidden:

    /frameworks/torch/torch-neuronx/api-reference-guide/training/pytorch-neuron-parallel-compile
    /frameworks/torch/torch-neuronx/api-reference-guide/training/torch-neuron-envvars
    /about-neuron/arch/neuron-features/neuron-caching
    /frameworks/torch/torch-neuronx/api-reference-guide/torch-neuronx-profiling-api


.. dropdown::  API Reference Guide for Training (``torch-neuronx``) 
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in
    :open:
    
    * :ref:`pytorch-neuronx-parallel-compile-cli`
    * :ref:`neuron-caching`
    * :ref:`pytorch-neuronx-envvars`
    * :ref:`torch-neuronx-profiling-api`


================================================
FILE: frameworks/torch/torch-neuronx/api-reference-guide/training/pytorch-neuron-parallel-compile.rst
================================================
.. _pytorch-neuronx-parallel-compile-cli:


.. meta::
   :description: PyTorch NeuronX neuron_parallel_compile CLI - AWS Neuron SDK documentation
   :keywords: API reference, AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx, training
   :date-modified: 2026-03-13


PyTorch NeuronX neuron_parallel_compile CLI
=============================================

PyTorch NeuronX performs just-in-time compilation of graphs during
execution. At every step, a graph is traced. If the traced graph varies
from the previous executions, it is compiled by the neuron compiler. For
large models, the compilation time for each graph can be high. Moreover,
because of JIT, we would compile all these graphs sequentially, hence
incurring huge compilation penalty.

To reduce this compilation time during execution, the ``neuron_parallel_compile``
utility is provided as part of PyTorch Neuron installation. The
``neuron_parallel_compile`` will extract graphs from a trial run of your script,
perform parallel pre-compilation of the graphs, and populate the :ref:`Neuron Persistent Cache <neuron-caching>`
on disk or in AWS S3 bucket with compiled graphs.
Your trial run should be limited to a few steps
(eg.10-15), enough for the utility to extract the different graphs needed for
full execution. To run the utility:

``neuron_parallel_compile <run commands>``

Where ``<run commands>`` are the commands to run a short run (i.e. 10
steps) to trace training loops for pre-compilation. The example for
the run command is ``torchrun --nproc_per_node=2 <train script>``, where
train script accepts ``--steps_this_run`` option to limit number of run steps:

``neuron_parallel_compile torchrun --nproc_per_node=2 <train script> --steps_this_run=10``

You may notice that the output from the model is invalid when you use
``neuron_parallel_compile``. This is because when you initiate your training
run command with ``neuron_parallel_compile``, the utility will run your command
with environment variables that puts your training script into graph
extraction mode. In this mode, no real execution is performed and the outputs
are invalid. You will also see outputs similar to the following about the compile cache path and the
extracted graphs:

.. code:: bash

   INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
   INFO ||NEURON_CC_WRAPPER||: Extracting graphs (/var/tmp/neuron-compile-cache/neuronxcc-2.0.0.22266a0+a69f71e55/MODULE_9219523464496887986+abb26765/model.hlo.pb) for ahead-of-time parallel compilation. No compilation was done.

After the trial execution ends and the graphs are extracted, ``neuron_parallel_compile`` would launch multiple compilation processes in parallel to compile all these graphs. Compiled graphs (NEFFs) are inserted into the Neuron Persistent Cache. You will also see outputs similar to the following about the compile cache path, the list of graphs (HLOs) to be compiled, and the running statistics of compiled graphs (count of remaining graphs, locked graphs, failed graphs, done compiled graphs).

.. code:: bash

    INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
    INFO ||NEURON_CACHE||: Current remaining items are 5, locked are 0, failed are 0, done are 0, total is 5
    INFO ||NEURON_PARALLEL_COMPILE||: master grab hlos to compile: ['/var/tmp/neuron-compile-cache/neuronxcc-2.0.0.22266a0+a69f71e55/MODULE_8068656800389078395+abb26765/model.hlo.pb', '/var/tmp/neuron-compile-cache/neuronxcc-2.0.0.22266a0+a69f71e55/MODULE_17109392703413819652+abb26765/model.hlo.pb', '/var/tmp/neuron-compile-cache/neuronxcc-2.0.0.22266a0+a69f71e55/MODULE_9219523464496887986+abb26765/model.hlo.pb', '/var/tmp/neuron-compile-cache/neuronxcc-2.0.0.22266a0+a69f71e55/MODULE_16969875447143373016+abb26765/model.hlo.pb', '/var/tmp/neuron-compile-cache/neuronxcc-2.0.0.22266a0+a69f71e55/MODULE_3000743782456078279+abb26765/model.hlo.pb']
    ...
    INFO ||NEURON_CACHE||: Current remaining items are 0, locked are 0, failed are 0, done are 5, total is 5

After all compilations are completed, a compilation summary is shown:

.. code:: bash

   INFO: 2023-08-24 20:21:11.000895:  161136  INFO ||NEURON_PARALLEL_COMPILE||: {
   INFO:     "compilation_summary": {
   INFO:         "true": 2
   INFO:     },
   INFO:     "compilation_report": {
   INFO:         "/var/tmp/neuron-compile-cache/neuronxcc-2.0.0.22266a0+a69f71e55/MODULE_1970132581169579119+abb26765/model.hlo.pb": {
   INFO:             "status": true,
   INFO:             "retry": 0
   INFO:         },
   INFO:         "/var/tmp/neuron-compile-cache/neuronxcc-2.0.0.22266a0+a69f71e55/MODULE_16141953836240613513+abb26765/model.hlo.pb": {
   INFO:             "status": true,
   INFO:             "retry": 0
   INFO:         }
   INFO:     }
   INFO: }
   INFO: 2023-08-24 20:21:11.000895:  161136  INFO ||NEURON_PARALLEL_COMPILE||: Total graphs: 2
   INFO: 2023-08-24 20:21:11.000895:  161136  INFO ||NEURON_PARALLEL_COMPILE||: Total successful compilations: 2
   INFO: 2023-08-24 20:21:11.000895:  161136  INFO ||NEURON_PARALLEL_COMPILE||: Total failed compilations: 0

Now if you run your script (without ``neuron_parallel_compile``), it will be faster
since the compiled graphs are already cached.

``torchrun --nproc_per_node=2 <train script>``

``Note``: Except for the option to limit number of run steps (such as ``--steps_this_run``),
the other options of ``<run commands>`` must match between the pre-compilation and
actual run. If this is not the case, you may see additional compilations during training
run because of new graphs getting generated, resulting in cache miss.

There may be additional compilations due to unreached execution paths (in case the
execution path is not reached in the first few steps of graph extraction), or changes
in parameters such as number of data parallel workers.

Each precompilation command or actual script execution command above can be prefixed with ``NEURON_COMPILE_CACHE_URL=<cache URL>`` or ``NEURON_CC_FLAGS="--cache_dir=<cache URL>"`` to specify a different cache location than the default (with ``--cache_dir`` taking precedence over ``NEURON_COMPILE_CACHE_URL`` if both are specified). Alternatively, the cache URL can also be specify in Python code using:

.. code:: python

    os.environ['NEURON_CC_FLAGS'] = os.environ.get('NEURON_CC_FLAGS', '') + "--cache_dir=<cache URL>"

You need to specify the same cache URL for both the precompilation command (using ``neuron_parallel_compile``) and the actual script execution command if you want the previously compiled and cached graphs to be used for actual script execution.

The environment variables below are available to help modify ``neuron_parallel_compile`` behavior:

``NEURON_PARALLEL_COMPILE_MAX_RETRIES`` :

-  Set the maximum number of retries when using :ref:`Neuron Persistent Cache <neuron-caching>` or :ref:`neuron_parallel_compile <pytorch-neuronx-parallel-compile-cli>`.
   If set to N, the tool will try compilation N more time(s) if the first graph compilation
   failed. Example: Set NEURON_PARALLEL_COMPILE_MAX_RETRIES=1 when precompiling on
   trn1.2xlarge where there's limited host memory and CPU resources.
   Default is 0.

``NEURON_IGNORE_TRAINING_SCRIPT_ERROR_AND_COMPILE`` :

- When using :ref:`Neuron Persistent Cache <neuron-caching>` or :ref:`neuron_parallel_compile <pytorch-neuronx-parallel-compile-cli>` , if you want to ignore the error in training script
  and compile the accumulated HLO graphs, you can do so by setting this environment variable.
  Example: If NEURON_IGNORE_TRAINING_SCRIPT_ERROR_AND_COMPILE=1 is set when using ``neuron_parallel_compile``,
  a crash in the training script would be ignored and the graphs collected up to the crash would be
  compiled.

``NEURON_COMPILE_CACHE_URL``:

-  Set the :ref:`Neuron Persistent Cache <neuron-caching>` URL or :ref:`neuron_parallel_compile <pytorch-neuronx-parallel-compile-cli>`.
   If starts with ``s3://``, it will use AWS S3 as cache backend. Otherwise it will use
   local disk cache. Default is ``/var/tmp/neuron-compile-cache``.
   If this is specified together with ``cache_dir=<cache_url>`` option via ``NEURON_CC_FLAGS``, the ``--cache_dir`` option takes precedence.


Debugging with Neuron Persistent Cache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A graph compilation can fail because of a compilation error or an environment issue (for example, compilation is interrupted by ctrl-C). The graph would be marked as failed and subsequent rerun would encounter message like below:

.. code:: bash

    INFO ||NCC_WRAPPER||: Got a cached failed neff at /var/tmp/neuron-compile-cache/neuronxcc-2.8.0.25+a3ad0f342/MODULE_12486829708343293975+d41d8cd9/model.neff. Will skip compilation, please set --retry_failed_compilation for recompilation. 

To retry compilation,
add ``--retry_failed_compilation`` in ``NEURON_CC_FLAGS`` environment variable. This will retry the compilation even if the graph was previously marked as failed compilation.

.. code:: python

   os.environ['NEURON_CC_FLAGS'] = os.environ.get('NEURON_CC_FLAGS', '') + ' --retry_failed_compilation'

See :ref:`Neuron Persistent Cache <neuron-caching>` for more information.

Separate collection and compilation commands
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For cases like finetuning, there could be multiple independent training tasks running on different nodes
and sharing many compilation graphs in common. ``neuron_parallel_compile`` provides commands to separate 
the graph collection and compilation phases, so users can collect all graphs across different training sessions in advance to avoid duplicate compilations.

To only collect the graphs from trial executions of training scripts into Neuron Persistent Cache:

.. code:: bash

    neuron_parallel_compile --command collect <run_script>

To compile the graph previously collected using ``collect`` command and store compiled result (NEFFs) back into Neuron Persistent Cache (make sure to use the same neuronx-cc compiler version as during the graph collection step):

.. code:: bash

    ``neuron_parallel_compile --command compile <run_script>``

Note: if ``--command`` is not specified, ``neuron_parallel_compile`` will do both collection and compilation phases by default.

Cache maintenance commands
~~~~~~~~~~~~~~~~~~~~~~~~~~

The following commands are available to help maintain the cache.

.. warning::
   
    Make sure no running process is using the cache when you use ``clean`` or ``clear-locks`` command because it can cause cache errors.

To clean cached files:

.. code:: bash

    # WARNING: Make sure no running process is using the cache
    neuron_parallel_compile --command clean
    
To clear file locks left behind when a ``neuron_parallel_compile`` execution was interrupted:

.. code:: bash

    # WARNING: Make sure no running process is using the cache
    neuron_parallel_compile --command clear-locks

Each command above can be prefixed with ``NEURON_COMPILE_CACHE_URL=<cache URL>`` or ``NEURON_CC_FLAGS="--cache_dir=<cache URL>"`` to specify a different cache location than the default.

.. note::

   Currently there's no automatic maintenance of cache size either on disk or in S3. Please delete files (i.e. older compiler versions) as necessary to keep cache size within your limit.

Analyze operations support
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The analyze command checks the support of operations within the training script by checking each operator against neuronx-cc.
It is only supported for PyTorch models. The output of the tool will be available as result.json within the output location.

.. code:: bash

    neuron_parallel_compile --command analyze python3 training_script.py

Optional Arguments:

    ``--analyze-output ANALYZE_OUTPUT_LOCATION``
    Only supported for --command analyze. Path to location where output will be persisted.
    Default: cwd/model_analysis_result

    ``--analyze-verbosity {1,2}``
    Only supported for --command analyze. Level of information to be included within the output.
    1: add XLA operator information into the results.
    2: add aten metadata into results.
    Default: 2

The tutorial for ``analyze`` can be found :ref:`here <torch-analyze-for-training-tutorial>`


================================================
FILE: frameworks/torch/torch-neuronx/api-reference-guide/training/torch-neuron-envvars.rst
================================================
.. _pytorch-neuronx-envvars:


.. meta::
   :description: PyTorch NeuronX Environment Variables - AWS Neuron SDK documentation
   :keywords: API reference, AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx, training
   :date-modified: 2026-03-13


PyTorch NeuronX Environment Variables
======================================

Environment variables allow modifications to PyTorch NeuronX behavior
without requiring code change to user script. It is recommended to set
them in code or just before invoking the python process, such as
``NEURON_FRAMEWORK_DEBUG=1 python3 <script>`` to avoid inadvertently
changing behavior for other scripts. Environment variables specific to
PyTorch Neuron are (beta ones are noted):

``NEURON_CC_FLAGS``

-  Compiler options. Full compiler options are described in the :ref:`neuronx-cc-training-mixed-precision`.
   Additional options for the Neuron
   Persistent Cache can be found in the :ref:`Neuron Persistent Cache <neuron-caching>` guide.

``NEURON_FRAMEWORK_DEBUG``

-  Enable dumping of XLA graphs in both HLO format (intermediate representation) and text form for debugging.

``NEURON_EXTRACT_GRAPHS_ONLY``

-  Dump the XLA graphs in HLO format (intermediate representation) and execute empty stubs with zero outputs
   in order to allow multiple XLA graphs to be traced through a trial execution.
   Used automatically for ahead-of-time
   graph extraction for parallel compilation in :ref:`neuron_parallel_compile <pytorch-neuronx-parallel-compile-cli>`
   tool. This environment variable can be checked in the training script
   to prevent checking of bad outputs during trial run.

``NEURON_NUM_RECENT_MODELS_TO_KEEP`` 

-  Keep only N number of graphs loaded in Neuron runtime for each
   process, where N is the value this environment variable is set to.
   Default is to keep all graphs loaded by a process.

``NEURON_COMPILE_CACHE_URL``

-  Set the :ref:`Neuron Persistent Cache <neuron-caching>` URL or :ref:`neuron_parallel_compile <pytorch-neuronx-parallel-compile-cli>`.
   If starts with ``s3://``, it will use AWS S3 as cache backend. Otherwise it will use
   local disk cache. Default is ``/var/tmp/neuron-compile-cache``.
   If this is specified together with ``cache_dir=<cache_url>`` option via ``NEURON_CC_FLAGS``, the ``--cache_dir`` option takes precedence.

``NEURON_PARALLEL_COMPILE_MAX_RETRIES``

-  Set the maximum number of retries when using :ref:`Neuron Persistent Cache <neuron-caching>` or :ref:`neuron_parallel_compile <pytorch-neuronx-parallel-compile-cli>`.
   If set to N, the tool will try compilation N more time(s) if the first graph compilation failed.
   Example: Set NEURON_PARALLEL_COMPILE_MAX_RETRIES=1 when precompiling on 
   trn1.2xlarge where there's limited host memory and CPU resources.
   Default is 0.

``NEURON_IGNORE_TRAINING_SCRIPT_ERROR_AND_COMPILE`` 

- When using :ref:`Neuron Persistent Cache <neuron-caching>` or :ref:`neuron_parallel_compile <pytorch-neuronx-parallel-compile-cli>` , if you want to ignore the error in training script
  and compile the accumulated HLO graphs, you can do so by setting this environment variable.
  Example: If NEURON_IGNORE_TRAINING_SCRIPT_ERROR_AND_COMPILE=1 is set when using ``neuron_parallel_compile``,
  a crash in the training script would be ignored and the graphs collected up to the crash would be
  compiled.

``NEURON_PARALLEL_COMPILE_DUMP_RESULTS``

- When set to 1, neuron_parallel_compile would report compilation time results in the final JSON output.

``NEURON_FUSE_SOFTMAX``

- Enable custom lowering for Softmax operation to enable compiler optimizations.

``NEURON_CUSTOM_SILU``

- Enable custom lowering for SILU operation to enable compiler optimizations.

``NEURON_TRANSFER_WITH_STATIC_RING_OPS``

- The list of torch.nn.Modules that will have all parameter input buffers marked as static to enable runtime optimizations. The default is "Embedding,LayerNorm,Linear,Conv2d,BatchNorm2d" for ``torch-neuronx`` 1.13/2.1, and "Embedding" for ``torch-neuronx`` 2.1 in SDK release 2.20, and empty for ``torch-neuronx`` 2.1+ in SDK release 2.21.

``NEURONCORE_NUM_DEVICES`` **[Use only with xmp.spawn]**

-  Number of NeuronCores for setting up distributed data parallel training
   when using torch_xla.distributed.xla_multiprocessing.spawn (xmp.spawn) utility only. See `MNIST MLP training with xmp.spawn <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/mnist_mlp/train_xmp.py>`__ for example.
   NOTE: Do not use this environment variable when using ``torchrun``, which has ``--nproc_per_node`` option instead for this purpose. ``torchrun`` is recommended for consistent experience on one instance as well as across multiple instances.

``NEURON_DUMP_HLO_SNAPSHOT`` **[Beta]** **[Torch-NeuronX 1.13 only]**

- Dump the inputs, outputs, and graph in HLO format of a graph execution in a snapshot file. This
  variable can be set to ``1``, ``ON_NRT_ERROR``, ``ON_NRT_ERROR_CPU``, ``ON_NRT_ERROR_HYBRID`` to
  dump snapshots at every iteration using CPU memory, or dump only on errors automatically using
  device, host, and both device and host memory respectively.

``NEURON_NC0_ONLY_SNAPSHOT`` **[Beta]** **[Torch-NeuronX 1.13 only]**

- Dump only the snapshot associated with Neuron Core 0 when ``NEURON_NC0_ONLY_SNAPSHOT=1`` and 
  the ``NEURON_DUMP_HLO_SNAPSHOT`` flag is set.

``NEURON_TRANSFER_ALL_PARAMETERS_WITH_STATIC_RING`` **[Beta]**

- When set to 1, mark all parameter transfers as static to enable runtime optimizations for torch.nn modules that are wrapped as done in Megatron-LM. This setting is not needed if torch.nn modules are not wrapped.

``BUCKET_CAP_MB`` **[PyTorch XLA <=2.1]**

- If there are many small gradient tensors, such as in BERT training, small allreduce sizes can limit performance. To improve performance, you can try increasing the bucket size using ``BUCKET_CAP_MB`` environment variable, which is set to 50MB by default. For example, BERT pretraining on multiple instances can see improved performance with ``BUCKET_CAP_MB=512``. NOTE: While this is supported in PyTorch Neuron 2.5, it is recommended for users to switch to ``ALLREDUCE_GRADIENTS_BUCKET_SIZE_MB``.

``ALLREDUCE_GRADIENTS_BUCKET_SIZE_MB`` **[PyTorch XLA 2.5+]**

- If there are many small gradient tensors, such as in BERT training, small allreduce sizes can limit performance. To improve performance, you can try increasing the bucket size using ``ALLREDUCE_GRADIENTS_BUCKET_SIZE_MB`` environment variable, which is set to 50MB by default. For example, BERT pretraining on multiple instances can see improved performance with ``ALLREDUCE_GRADIENTS_BUCKET_SIZE_MB=512``.


``XLA_FLAGS`` **[PyTorch XLA]** **[Torch-NeuronX 2.1+]**

- When set to ``"--xla_dump_hlo_snapshots --xla_dump_to=<dir>"``, this environmental variable enables dumping snapshots in ``<dir>`` directory. See :ref:`torch-neuronx-snapshotting` section for more information.

``XLA_USE_DUMMY_STORE`` **[PyTorch XLA]**

- When set to 1 along with ``TORCH_DIST_INIT_BARRIER=0``, PJRT process group initialization will use DummyStore instead of TCPStore. This reduces the number of open file descriptors and enables scaling training up to a large number of nodes.

``XLA_USE_BF16`` **[PyTorch XLA <=2.1]**

- When ``XLA_USE_BF16=1``, PyTorch Neuron will automatically map both torch.float and torch.double tensors
  to bfloat16 tensors and turn on Stochastic Rounding mode. This can both reduce memory footprint and improve performance.
  Example: to enable bfloat16 autocasting and stochastic rounding, set XLA_USE_BF16=1 only, as
  stochastic rounding mode is on by default when XLA_USE_BF16=1. If you would like to preserve some tensors in float32, see ``XLA_DOWNCAST_BF16`` below. NOTE: This is deprecated in PyTorch Neuron 2.5. See :ref:`migration_from_xla_downcast_bf16`.


``XLA_DOWNCAST_BF16`` **[PyTorch XLA <=2.1]**

- When ``XLA_DOWNCAST_BF16=1``, PyTorch Neuron will automatically map torch.float tensors to bfloat16 tensors, torch.double tensors
  to float32 tensors and turn on Stochastic Rounding mode. This can both reduce memory footprint and improve performance, while preserving some tensors in float32.
  Example: to enable float to bfloat16 and double to float autocasting and stochastic rounding, set XLA_DOWNCAST_BF16=1 only, as
  stochastic rounding mode is on by default when XLA_DOWNCAST_BF16=1. If you want to cast both torch.float and torch.double to bfloat16, please see ``XLA_USE_BF16`` above. NOTE: This is deprecated in PyTorch Neuron 2.5. See :ref:`migration_from_xla_downcast_bf16`.

``XLA_DISABLE_FUNCTIONALIZATION`` **[PyTorch XLA 2.1+]**

- When ``XLA_DISABLE_FUNCTIONALIZATION=0``, PyTorch XLA will enable the functionalization feature which makes graphs more compilable by removing mutations from functions. In PyTorch XLA 2.1 functionalization causes 15% performance degradations for BERT due to missing aliasing for gradient accumulation https://github.com/pytorch/xla/issues/7174 so it is off by default (``XLA_DISABLE_FUNCTIONALIZATION=1``). Enabling functionalization can improve convergence for LLaMA 70B with ZeRO1 (when used with release 2.19 compiler).


``XLA_ENABLE_PARAM_ALIASING`` **[PyTorch XLA]**

- When ``XLA_ENABLE_PARAM_ALIASING=0``, PyTorch Neuron will disable parameter aliasing in HLO graphs. This can be useful for debug. However, it would lead to increased device memory usage due to extra allocation of buffers (so higher chance of out-of-device memory errors) and decreased performance. When not set, parameter aliasing is enabled by default.

``NEURON_RT_STOCHASTIC_ROUNDING_EN`` **[Neuron Runtime]**

- When ``NEURON_RT_STOCHASTIC_ROUNDING_EN=1``, PyTorch Neuron will use stochastic rounding instead of
  round-nearest-even for all internal rounding operations when casting from FP32 to a reduced precision data type (FP16, BF16, FP8, TF32).
  This feature has been shown to improve
  training convergence for reduced precision training jobs, such as when bfloat16 autocasting is
  enabled. This is set to 1 by default by PyTorch Neuron when XLA_USE_BF16=1 or XLA_DOWNCAST_BF16=1. To switch to round-nearest-even mode, please set ``NEURON_RT_STOCHASTIC_ROUNDING_EN=0``.

``NEURON_RT_STOCHASTIC_ROUNDING_SEED`` **[Neuron Runtime]**

- Sets the seed for the
  random number generator used in stochastic rounding (see previous section). If this environment variable is not set, the seed is set to 0 by default. Please set ``NEURON_RT_STOCHASTIC_ROUNDING_SEED`` to a fixed value to ensure reproducibility between runs.

``NEURON_RT_VISIBLE_CORES`` **[Neuron Runtime]**

  Integer range of specific NeuronCores needed by the process (for example, 0-3 specifies NeuronCores 0, 1, 2, and 3).
  You this environment variable when using torchrun to limit the launched processs to specific consecutive NeuronCores. To ensure best performance, the multi-core jobs requiring N NeuronCores for collective communication must be placed at the NeuronCore ID that starts at a multiple of N, where N is the world size limited to 1, 2, 8, 32. For example, a process using 2 NeuronCores can be mapped to 2 free NeuronCores starting at NeuronCore id 0, 2, 4, 6, etc, and a process using 8 NeuronCores can be mapped to 8 free NeuronCores starting at NeuronCore id 0, 8, 16, 24.

Additional Neuron runtime environment variables are described in `runtime
configuration
documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-runtime/nrt-configurable-parameters.html>`__.

Additional XLA runtime environment variables are described in `PyTorch-XLA troubleshooting guide
<https://github.com/pytorch/xla/blob/v1.10.0/TROUBLESHOOTING.md#user-content-environment-variables>`__.


================================================
FILE: frameworks/torch/torch-neuronx/misc-inference-torch-neuronx.rst
================================================

.. meta::
   :description: Misc (``torch-neuronx``) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx
   :date-modified: 2026-03-13


Misc (``torch-neuronx``)
========================

.. toctree::
    :maxdepth: 1
    :hidden:
    
    /release-notes/components/pytorch

* :ref:`pytorch_rn`


================================================
FILE: frameworks/torch/torch-neuronx/misc-training.rst
================================================

.. meta::
   :description: Misc (Training - torch-neuronx) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx, training
   :date-modified: 2026-03-13


Misc (Training - torch-neuronx)
===============================


.. toctree::
    :maxdepth: 1
    :hidden:

    /frameworks/torch/torch-neuronx/pytorch-neuron-supported-operators
    /frameworks/torch/torch-neuronx/setup-trn1-multi-node-execution
    /frameworks/torch/torch-neuronx/training-troubleshooting
    /release-notes/components/pytorch


* :ref:`pytorch-neuron-supported-operators`
* :ref:`setup-trn1-multi-node-execution`
* :ref:`pytorch-neuron-traning-troubleshooting`
* :ref:`pytorch_rn`

================================================
FILE: frameworks/torch/torch-neuronx/programming-guide/inference/autobucketing-dev-guide.rst
================================================
.. _torch-neuronx-autobucketing-devguide:


.. meta::
   :description: Autobucketing for Inference (torch-neuronx) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx
   :date-modified: 2026-03-13


Autobucketing for Inference (torch-neuronx)
=============================================

.. contents:: Table of Contents
    :depth: 3

Introduction
------------

Autobucketing is a feature that enables you to use multiple bucket models. Each bucket model accepts a static input shape and a bucket kernel function. The models are then packaged into a single traced PyTorch model that can accept multiple different input shapes. 

This gives you increased flexibility for inputs into Neuron models without the need to manage multiple Neuron models. The applications of this are extensive, from optimal model selection based on image resolution, to efficient sampling for token generation in language models.

While Autobucketing offers increased flexibility, Autobucketing is also useful for latency sensitive applications since small and large inputs can be applied on small and large models respectively, based on the bucket kernel function.

This Developer Guide will discuss best practices for implementing Autobucketing for your use case. For this Developer Guide, a BERT model will be used, where we bucket on the sequence length dimension.

Before continuing, it is recommended to familiarize yourself with the Autobucketing APIs, which can be found :ref:`here <torch-neuronx-autobucketing>`.

Bucket Kernels
--------------

Bucket kernels are user-defined functions that take in the model input as input to the function and return a tuple containing a *potentially* manipulated model input and a tensor representing the bucket index.
An important aspect of this function is that it must be able to be adapted to the TorchScript representation using :func:`torch.jit.script`. This is because to support saving a traced bucket model with :func:`torch.jit.save` and :func:`torch.jit.load`, you need all elements of the model to be in TorchScript.
The below example shows a bucket kernel that is adaptable to TorchScript in this way.

.. code-block:: python

    import torch
    from typing import List

    def sequence_length_bucket_kernel(tensor_list: List[torch.Tensor]):
      x = tensor_list[0]
      bucket_dim = 1
      x_shape = x.shape
      tensor_sequence_length = x_shape[bucket_dim]
      batch_size = x_shape[bucket_dim - 1]
      buckets = [128, 512]
      idx = 0
      num_inputs = 3
      bucket = buckets[0]
      reshaped_tensors: List[torch.Tensor] = []
      bucket_idx = 0
      for idx, bucket in enumerate(buckets):
          if tensor_sequence_length <= bucket:
              bucket_idx = idx
              for tensor in tensor_list:
                  if num_inputs == 0:
                      break
                  delta = bucket - tensor_sequence_length
                  padding_shape: List[int] = [batch_size, delta]
                  zeros = torch.zeros(padding_shape, dtype=x.dtype)
                  reshaped_tensors.append(torch.cat([tensor, zeros], dim=bucket_dim))
                  num_inputs -= 1
              break
      return reshaped_tensors, torch.tensor([bucket_idx])

  def get_bucket_kernel(*_):
      bk = torch.jit.script(sequence_length_bucket_kernel)
      return bk


In the above example we define a bucket kernel that takes in an input to a transformers model, which is ``[input_ids,attention_mask,token_type_ids]``. We first obtain the first tensor in that list, since that tensor contains ``sequence_length`` as a dimension, and retrieve the ``sequence_length`` and ``batch_size``. We also define the sequence length buckets. The next major part of the code is the for loop, which first finds the matching sequence length bucket and then iterates through the tensors in the list to right pad the tensors to the desired sequence length. After this is done, we return the padded inputs as a list of tensors and a tensor containing the bucket index. Finally, we create a function ``get_bucket_kernel`` which returns a version of the bucket kernel that has been adapted to TorchScript using using :func:`torch.jit.script`. We can use this bucket kernel to pass in a tokenized input of sequence length 1-512, which is padded to the nearest bucket size rounded up.

Note that we call :func:`torch.jit.script` instead of :func:`torch.jit.trace`. This
is because we rely on control flow logic evaluating correctly for all inputs. This
results in certain challenges when writing compatible and accurate bucket kernels. We
discuss these challenges and resolutions in the next section.

Torchscript Best Practices for Bucket Kernels
---------------------------------------------

Below are some recommendations when creating these Bucket Kernels:

    - **Type annotate non-tensor-like data types**: Functions that have been adapted to the TorchScript representation using using :func:`torch.jit.script` treat 
      variables that are defined by using another variable as tensor-like when they might not be. This can be seen when defining
      ``padding_shape`` in the above bucket kernel.
    - **Index selection support is limited**: Functions that have been adapted to the TorchScript representation using using :func:`torch.jit.script` don't support the use of variables
      for indexing very well. It could work in some scenarios, but there isn't a discernable pattern to it,
      so for more reliable TorchScript-adapted functions relying on indexes, use an enumerated for loop or literals if possible.
    - **Initializing variables with literals**: The Torchscript compiler often incorrectly removes
      a variable if it finds another variable initialized with the same literal, such as ``0``. The compiler might also reuse variables initialized with a
      literal for other operations, such as indexing or function parameters. This can cause inaccurate results for certain inputs. Therefore, always validate the
      function by testing with the expected inputs. If the lowering does not behave as expected, you can see the lowered representation by calling ``bucket_kernel.graph``, where ``bucket_kernel`` is the return value of ``get_bucket_kernel``, and analyze the graph for inaccurate lowerings.
    - **Use of aten functions might be necessary to guarantee correct lowering**: The TorchScript interpreter supports certain operations, such as slicing, but can
      lower them in unexpected ways when using normal syntax. For example, with slicing, the most common way to slice is with indexing syntax such as ``tensor[:,:2,:]``. However,
      this can cause lowering issues due to the aforementioned reasons. To mitigate this, it might be necessary to call the respective aten function directly.
      See the below example with ``shared_state_buffer_preprocessor``.

Shared State Buffers
--------------------

Autobucketing supports the concept of a shared buffer between bucket models. You can use this to define how the shared buffer can be manipulated to be fed as input to a bucket model via the ``shared_state_buffer_preprocessor``.

The above recommendations also apply when defining a ``shared_state_buffer_preprocessor``.

An example where a shared buffer is useful between bucket models is maintaining a KV Cache between bucket models for LLMs.

Below is an example of a KV Cache preprocessor for Autobucketing.

.. code-block:: python

  def state_preprocessor(shapes_collection: List[List[List[int]]], states: List[torch.Tensor], bucket_idx_tensor: torch.Tensor)->List[torch.Tensor]:
    bucket_idx = torch.ops.aten.Int(bucket_idx_tensor)
    shapes = shapes_collection[bucket_idx]
    sliced_state_tensors = []
    
    for i in range(len(shapes)):
        expected_shape = shapes[i]
        state_tensor = states[i]
        state_tensor_shape = state_tensor.shape
        for j,npos in enumerate(expected_shape):
            state_tensor_dim_length = state_tensor_shape[j]
            state_tensor = torch.ops.aten.slice(state_tensor,dim=j,start=state_tensor_dim_length-npos,end=state_tensor_dim_length)
        sliced_state_tensors.append(state_tensor)
    
    return sliced_state_tensors
  
  def get_state_preprocessor():
    sp = torch.jit.script(state_preprocessor)
    return sp

In this example, we take in ``shapes_collection``, ``states``, and ``bucket_idx_tensor``. The input ``shapes_collection`` is essentially a list of expected shapes for each state tensor defined for each bucket kernel. For example, we can have ``shapes_collection = [[[1,128],[1,128]],[[1,512],[1,512]]]`` where ``shapes_collection[0][1]`` retrieves the expected shape for the second state tensor in the first bucket. The input ``states`` is the actual list of tensors in the shared buffer, which contains tensors of the largest shape. Finally, ``bucket_idx_tensor`` is the same tensor returned by the bucket kernel.

Two things to note is that we use two aten functions directly: ``aten::Int`` to convert the ``bucket_idx_tensor`` to an integer, and ``aten::slice`` to perform slicing given non-const or non-literal parameters.

.. note::

    The above shared state function is not used in the BERT example

Bucket Model Config
-------------------

Given the above two examples, we can initialize a :class:`torch_neuronx.BucketModelConfig` object as follows:

.. code-block:: python

  import torch
  import torch_neuronx

  from typing import List

  # above code

  bucket_config = torch_neuronx.BucketModelConfig(get_bucket_kernel,shared_state_buffer_preprocessor=get_state_preprocessor)


Putting it all Together
-----------------------

Here is a simple example using the BERT model:

.. code-block:: python

  import torch
  import torch_neuronx

  from transformers import AutoTokenizer, AutoModelForSequenceClassification

  from typing import List

  def encode(tokenizer, *inputs, max_length=128, batch_size=1):
      tokens = tokenizer.encode_plus(
          *inputs,
          max_length=max_length,
          padding='max_length',
          truncation=True,
          return_tensors="pt"
      )
      return (
          torch.repeat_interleave(tokens['input_ids'], batch_size, 0),
          torch.repeat_interleave(tokens['attention_mask'], batch_size, 0),
      )

  def get_bert_model(*args):
      name = "bert-base-cased-finetuned-mrpc"
      model = AutoModelForSequenceClassification.from_pretrained(name, torchscript=True)

      return model,{}

  def sequence_length_bucket_kernel(tensor_list: List[torch.Tensor]):
      x = tensor_list[0]
      bucket_dim = 1
      x_shape = x.shape
      tensor_sequence_length = x_shape[bucket_dim]
      batch_size = x_shape[bucket_dim - 1]
      buckets = [128, 512]
      idx = 0
      num_inputs = 3
      bucket = buckets[0]
      reshaped_tensors: List[torch.Tensor] = []
      bucket_idx = 0
      for idx, bucket in enumerate(buckets):
          if tensor_sequence_length <= bucket:
              bucket_idx = idx
              for tensor in tensor_list:
                  if num_inputs == 0:
                      break
                  delta = bucket - tensor_sequence_length
                  padding_shape: List[int] = [batch_size, delta]
                  zeros = torch.zeros(padding_shape, dtype=x.dtype)
                  reshaped_tensors.append(torch.cat([tensor, zeros], dim=bucket_dim))
                  num_inputs -= 1
              break
      return reshaped_tensors, torch.tensor([bucket_idx])

  def get_bucket_kernel(*_):
      bk = torch.jit.script(sequence_length_bucket_kernel)
      return bk
  
  if __name__ == '__main__':

      name = "bert-base-cased-finetuned-mrpc"

      # Build tokenizer and model
      tokenizer = AutoTokenizer.from_pretrained(name)
      model = AutoModelForSequenceClassification.from_pretrained(name, torchscript=True)

      # Setup some example inputs
      sequence_0 = "The company HuggingFace is based in New York City"
      sequence_1 = "HuggingFace is named after the huggingface emoji"
      sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

      paraphrase_s128 = encode(tokenizer, sequence_0, sequence_2)
      paraphrase_s122 = encode(tokenizer, sequence_0, sequence_2, max_length=122)
      
      paraphrase_s512 = encode(tokenizer, sequence_0, sequence_1, max_length=512)
      paraphrase_s444 = encode(tokenizer, sequence_0, sequence_1, max_length=444)

      # Note: Run on CPU before trace. Avoids running with XLA allocated params
      paraphrase_expected_s128 = torch.argmax(model(*paraphrase_s128)[0])
      paraphrase_expected_s512 = torch.argmax(model(*paraphrase_s512)[0])
      

      # Trace model
      bucket_config = torch_neuronx.BucketModelConfig(get_bucket_kernel)
      bucket_trace_neuron = torch_neuronx.bucket_model_trace(get_bert_model, [paraphrase_s128,paraphrase_s512], bucket_config)

      # Run traced model with shorter inputs to test bucket rounding
      paraphrase_actual_s128 = torch.argmax(bucket_trace_neuron(*paraphrase_s122)[0])
      paraphrase_actual_s512 = torch.argmax(bucket_trace_neuron(*paraphrase_s444)[0])
      

      # Compare outputs
      assert paraphrase_expected_s128 == paraphrase_actual_s128
      assert paraphrase_expected_s512 == paraphrase_actual_s512


Autobucketing for Neuronx-Distributed
-------------------------------------

To see this same example applied on Neuronx-Distributed, go to this section on the :ref:`Neuronx-Distributed Inference Developer Guide <neuronx_distributed_inference_developer_guide>`

================================================
FILE: frameworks/torch/torch-neuronx/programming-guide/inference/core-placement.rst
================================================
.. _torch_neuronx_core_placement_guide:


.. meta::
   :description: NeuronCore Allocation and Model Placement for Inference (|torch-neuronx|) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx
   :date-modified: 2026-03-13


NeuronCore Allocation and Model Placement for Inference (|torch-neuronx|)
=========================================================================

This programming guide describes the how to allocate NeuronCores to processes
and place models onto specific NeuronCores. The models in this guide are
expected to have been traced with with :func:`torch_neuronx.trace`.

.. warning::

    This guide is **not** applicable to NeuronCore placement using XLA
    LazyTensor device execution. See: :ref:`trace-vs-xla-lazytensor`

In order of precedence, the recommendation is to use the following placement
techniques:

1. For nearly all regular models, default core placement should be used to take
   control of all cores for a single process.
2. For applications using multiple processes, default core placement should be
   used in conjunction with ``NEURON_RT_NUM_CORES`` (:ref:`torch_neuronx_placement_default`)
3. For more granular control, then the beta explicit placement APIs may
   be used (:ref:`torch_neuronx_placement_explicit`).

.. contents:: Table of Contents
    :depth: 3

The following guide will assume a machine with 8 NeuronCores:

- NeuronCores will use the notation ``nc0``, ``nc1``, etc.
- Models will use the notation ``m0``, ``m1`` etc.

NeuronCores and  model allocations will be displayed in the following format:

.. raw:: html
    :file: images/0-0-legend-neuronx.svg

The actual cores that are visible to the process can be adjusted according to
the :ref:`nrt-configuration`.

Unlike |torch-neuron| (with |neuron-cc|) instances, |torch-neuronx| (with
|neuronx-cc|) does not support :ref:`neuroncore-pipeline`. This simplifies
model core allocations since it means that model pipelines will likely not span
across multiple NeuronCores.

.. _torch_neuronx_placement_default:

Default Core Allocation & Placement
-----------------------------------

The most basic requirement of an inference application is to be able to place a
single model on a single NeuronCore. More complex applications may use multiple
NeuronCores or even multiple processes each executing different models. The
important thing to note about designing an inference application is that a
single NeuronCore will always be allocated to a single process. *Processes do
not share NeuronCores*. Different configurations can be used to ensure that
an application process has enough NeuronCores allocated to execute its model(s):

- Default: A process will attempt to take ownership of **all NeuronCores**
  visible on the instance. This should be used when an instance is only running
  a single inference process since no other process will be allowed to take
  ownership of any NeuronCores.
- ``NEURON_RT_NUM_CORES``: Specify the **number of NeuronCores** to allocate
  to the process. This places no restrictions on which NeuronCores will be used,
  however, the resulting NeuronCores will always be contiguous. This should be
  used in multi-process applications where each process should only use a subset
  of NeuronCores.
- ``NEURON_RT_VISIBLE_CORES``: Specifies exactly **which NeuronCores** are
  allocated to the process by index. Similar to ``NEURON_RT_NUM_CORES``, this
  can be used in multi-process applications where each process should only use a
  subset of NeuronCores. This provides more fined-grained controls over the
  exact NeuronCores that are allocated to a given process.

See the :ref:`nrt-configuration` for more environment variable details.

Example: Default
^^^^^^^^^^^^^^^^

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuronx

    m0 = torch.jit.load('model.pt')  # Loads to nc0
    m1 = torch.jit.load('model.pt')  # Loads to nc1


.. raw:: html
    :file: images/0-1-default-2.svg

With no environment configuration, the process will take ownership of all
NeuronCores. In this example, only two of the NeuronCores are used by the
process and the remaining are allocated but left idle.


Example: ``NEURON_RT_NUM_CORES``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Environment Setup**:

.. code-block:: bash

    export NEURON_RT_NUM_CORES = '2'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuronx

    m0 = torch.jit.load('model.pt')  # Loads to nc0
    m1 = torch.jit.load('model.pt')  # Loads to nc1

.. raw:: html
    :file: images/0-2-default-rt-num-cores.svg

Since there is no other process on the instance, only the first 2 NeuronCores
will be acquired by the process. Models load in a simple linear order to the
least used NeuronCores.


Example: ``NEURON_RT_VISIBLE_CORES``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Environment Setup**:

.. code-block:: bash

    export NEURON_RT_VISIBLE_CORES = '4-5'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuronx

    m0 = torch.jit.load('model.pt')  # Loads to nc4
    m1 = torch.jit.load('model.pt')  # Loads to nc5


.. raw:: html
    :file: images/0-3-default-rt-visible-cores.svg

Unlike ``NEURON_RT_NUM_CORES``, setting the visible NeuronCores allows the
process to take control of a specific contiguous set. This allows an application
to have a more fine-grained control of where models will be placed.


Example: Multiple Processes
^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Environment Setup**:

.. code-block:: bash

    export NEURON_RT_NUM_CORES = '2'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuronx

    m0 = torch.jit.load('model.pt')  # Loads to nc0
    m1 = torch.jit.load('model.pt')  # Loads to nc1


In this example, if the script is run **twice**, the following allocations
will be made:

.. raw:: html
    :file: images/0-5-default-multiprocess.svg

Note that each process will take ownership of as many NeuronCores as is
specified by the ``NEURON_RT_NUM_CORES`` configuration.


.. _torch_neuronx_placement_explicit:

Explicit Core Placement
-------------------------------------

The ``torch_neuronx`` framework allows can be found in the
:ref:`torch_neuronx_core_placement_api` documentation.


Example: Manual Core Selection
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The most direct usage of the placement APIs is to manually select the
start NeuronCore that each model is loaded to.

**Environment Setup**:

.. code-block:: bash

    export NEURON_RT_NUM_CORES = '4'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuronx

    # NOTE: Order of loads does NOT matter
    with torch_neuronx.experimental.neuron_cores_context(start_nc=3):
        m0 = torch.jit.load('model.pt')  # Loads to nc3

    with torch_neuronx.experimental.neuron_cores_context(start_nc=0, nc_count=2):
        m1 = torch.jit.load('model.pt')  # Loads replicas to nc0 and nc1

    example = torch.rand(1, 3, 224, 224)

    m1(example)  # Executes on nc3
    m1(example)  # Executes on nc3

    m0(example)  # Executes on nc0
    m0(example)  # Executes on nc1
    m0(example)  # Executes on nc0


.. raw:: html
    :file: images/8-models-m0-3-m1-1-2.svg


Example: Automatic Multicore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Using explicit core placement it is possible to replicate a model to multiple
NeuronCores simultaneously. This means that a single model object within python
can utilize all available NeuronCores (or NeuronCores allocated to the process).

**Environment Setup**:

.. code-block:: bash

    export NEURON_RT_NUM_CORES = '8'

**Python Script**:

.. code-block:: python

    import torch
    import torch_neuronx

    with torch_neuronx.experimental.multicore_context():
        m0 = torch.jit.load('model.pt')  # Loads replications to nc0-nc7

    example = torch.rand(1, 3, 224, 224)

    m0(example)  # Executes on nc0
    m0(example)  # Executes on nc1

.. raw:: html
    :file: images/6-multicore.svg

To make full use of a model that has been loaded to multiple NeuronCores,
multiple threads should be used to run inferences in parallel.


.. |neuron-cc| replace:: :ref:`neuron-cc <neuron-compiler-cli-reference>`
.. |neuronx-cc| replace:: :ref:`neuronx-cc <neuron-compiler-cli-reference-guide>`
.. |torch-neuron| replace:: :ref:`torch-neuron <inference-torch-neuron>`
.. |torch-neuronx| replace:: :ref:`torch-neuronx <inference-torch-neuronx>`
.. |Inf1| replace:: :ref:`Inf1 <aws-inf1-arch>`
.. |Trn1| replace:: :ref:`Trn1 <aws-trn1-arch>`


================================================
FILE: frameworks/torch/torch-neuronx/programming-guide/inference/index.rst
================================================

.. meta::
   :description: Developer Guide  (``torch-neuronx``) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx
   :date-modified: 2026-03-13


Developer Guide  (``torch-neuronx``) 
====================================

.. toctree::
    :maxdepth: 1
    :hidden:
    
    /frameworks/torch/torch-neuronx/programming-guide/inference/core-placement
    /frameworks/torch/torch-neuronx/programming-guide/inference/trace-vs-xla-lazytensor
    /about-neuron/appnotes/torch-neuronx/torch-neuronx-dataparallel-app-note.rst
    

.. dropdown::  Developer Guide for Inference (``torch-neuronx``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in
    :open:

    * :ref:`torch_neuronx_core_placement_guide`
    * :ref:`trace-vs-xla-lazytensor`
    * :ref:`torch-neuronx-dataparallel-app-note`
    * :ref:`torch-neuronx-autobucketing-devguide`

================================================
FILE: frameworks/torch/torch-neuronx/programming-guide/inference/trace-vs-xla-lazytensor.rst
================================================
.. _trace-vs-xla-lazytensor:


.. meta::
   :description: Comparison of Traced Inference versus XLA |LazyTensor| Inference (``torch-neuronx``) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx
   :date-modified: 2026-03-13


Comparison of Traced Inference versus XLA |LazyTensor| Inference (``torch-neuronx``)
=====================================================================================

.. contents:: Table of contents
   :local:
   :depth: 1

Introduction
------------


Using ``torch-neuronx``, there are two ways that a model can be
executed for inference:

- **XLA LazyTensor Inference**: A model is executed on Neuron by calling
  :meth:`~torch.Tensor.to` to move :class:`~torch.nn.parameter.Parameter`
  and :class:`~torch.Tensor` data using the |device|. Executing operations uses
  torch |LazyTensor| to record, compile, and execute the graph. These are the
  same mechanisms used for :ref:`training <pytorch-neuronx-programming-guide>`.

- **(Recommended) Traced Inference**: A model is traced prior to inference
  using the |trace| API. This trace is similar to :func:`torch.jit.trace` but
  instead creates a Neuron-specific `TorchScript`_ artifact. This artifact
  provides improved performance and portability compared to XLA
  |LazyTensor| inference.


.. _xla_lazytensor:

XLA Lazy Tensor Inference Mechanics
-----------------------------------

XLA |LazyTensor| inference uses Just-In-Time (JIT) compilation for Neuron
execution.

XLA Device execution uses the built-in ``torch-xla`` functionality with torch
|LazyTensor| to record torch operations using the |device|. The graph of
operations is sent to the |neuronx-cc| compiler upon calling
``xm.mark_step()``. Finally the compiled graph is transferred to a NeuronCore
and executed in the Neuron backend.

The initial model inference will be very slow since the model binary file in the
Neuron Executable File Format (NEFF) will need to be generated by the compiler.
Upon each subsequent call to a model, the application will re-execute the
python, rebuild the graph, and check a cache to see if an existing NEFF file
is available for the given graph before attempting to recompile.

The process of recording graph operations in python can become a bottleneck for
otherwise fast models. This overhead will always have an effect on performance
regardless of model size but may be less noticeable on larger models. Note that
this XLA |LazyTensor| execution performance may improve significantly with new
torch features in the future.

Example
~~~~~~~

.. tab-set::

    .. tab-item:: Fixed Shape Example

        .. code-block:: python

            import torch
            import torch_neuronx
            import torch_xla.core.xla_model as xm

            # Create XLA device
            device = xm.xla_device()

            # Load example model and inputs to Neuron device
            model = torch.nn.Sequential(
                torch.nn.Linear(784, 120),
                torch.nn.ReLU(),
                torch.nn.Linear(120, 10),
                torch.nn.Softmax(dim=-1),
            )
            model.eval()
            model.to(device)
            example = torch.rand((1, 784), device=device)

            # Inference
            with torch.no_grad():
                result = model(example)
                xm.mark_step()  # Compilation occurs here
                print(result.cpu())


    .. tab-item:: Dynamic Shape Example

        The following is an example of a model that dynamically changes the
        sequence length and batch size of the input token ID tensor to trigger
        recompilations. This kind of workflow would require padding when using
        traced inference.

        .. code-block:: python

            import torch
            import torch_neuronx
            import torch_xla.core.xla_model as xm

            # Create XLA device
            device = xm.xla_device()

            # Load example model and inputs to Neuron device
            model = torch.nn.Sequential(
                torch.nn.Embedding(num_embeddings=30522, embedding_dim=512),
                torch.nn.Linear(512, 128),
                torch.nn.ReLU(),
                torch.nn.Linear(128, 2),
                torch.nn.Softmax(dim=-1),
            )
            model.eval()
            model.to(device)

            token_ids_1 = torch.tensor([
                [1, 28, 748, 0],
            ]) # shape: [1, 4]
            token_ids_2 = torch.tensor([
                [1, 13087, 10439, 1990, 18912, 0],
                [1, 12009, 7849, 2509, 3500, 0],
            ])  # shape: [2, 6]

            # Inference
            with torch.no_grad():

                # First compilation/inference
                result = model(token_ids_1)
                xm.mark_step()
                print(result.cpu())  # shape: [1, 4, 2]

                # Recompilation occurs here since token_ids_2 is a different shape. This infer
                # would have failed if the model had been traced with shape [1, 4]
                result = model(token_ids_2)
                xm.mark_step()
                print(result.cpu())  # shape: [2, 6, 2]


Traced Inference Mechanics
--------------------------
Traced inference uses Ahead-Of-Time (AOT) compilation for Neuron execution.

Similar to XLA |LazyTensor| inference, |trace| uses the operation recording
mechanisms provided by ``torch-xla`` to build the graph structure. This graph
structure is also sent to the |neuronx-cc| compiler to produce a binary (NEFF)
that is executable on Neuron.

The main difference is that the call to |trace| returns a *new* fully
compiled graph as a `TorchScript`_ Module. Upon calling this new Module, rather
than re-executing the python, rebuilding the graph, and checking
the cache for a matching model, the new Module simply executes the precompiled
graph that was preloaded during tracing. This is a significantly
more optimized runtime since it avoids the python operator tracing, graph
building, etc.

One disadvantage of this interface is that a model will never dynamically
recompile after a trace. This means that dynamic control flow is not supported
within a function/module. Tensor input/output shapes are fixed to the shapes
passed to the |trace| API. Dynamic batching and bucketing can be used to avoid
the pitfalls of static shapes.

Example
~~~~~~~
.. code-block:: python

    import torch
    import torch_neuronx

    # Create example model and inputs
    model = torch.nn.Sequential(
        torch.nn.Linear(784, 120),
        torch.nn.ReLU(),
        torch.nn.Linear(120, 10),
        torch.nn.Softmax(dim=-1),
    )
    model.eval()
    example = torch.rand((1, 784))

    # Create fixed model trace
    trace = torch_neuronx.trace(model, example)

    # Inference
    result = trace(example) # No recompilation. Input shapes must not change
    print(result)


Traced Inference Advantages
---------------------------

Traced inference should be used for nearly all deployment purposes since it
provides some key advantages over XLA |LazyTensor| execution:

- **Reduced Overhead**: There is no overhead associated with graph recording,
  compilation, and model loading since these steps are performed only once
  within the call to |trace|. In contrast, when using XLA |LazyTensor|
  inference, all of these steps are performed just-in-time (with caching to
  improve performance).
- **Serializable**: The TorchScript Module that is produced from the |trace| API
  is serializable using the normal :func:`torch.jit.save` function. It is able
  to be reloaded in an inference environment with :func:`torch.jit.load`.
  In contrast, XLA device inference does not provide a predetermined
  serialization format that includes the pre-compiled NEFF artifacts. These
  must be manually copied to an inference environment to be used.
- **Reduced Dependencies**: When using the traced TorchScript Module in an
  inference environment, it is no longer required to install the
  |neuronx-cc| compiler. In contrast, when using the XLA |LazyTensor|
  execution, an execution may require a recompile to successfully infer.
- **Static & Predictable**: The resulting module produced by |trace| will
  contain a static model that will consume a predictable amount of Neuron device
  memory and will never require recompilation based on input changes. In
  contrast, since XLA device inference performs just-in-time compilation, it
  can be more difficult to predict memory utilization and the compilations
  that may be required at inference time.
- **C++ Usability**: If the end application is an inference platform using
  ``libtorch``, it is easy to integrate with ``libtorchneuron`` to load
  traced modules. It is not currently possible to set up an environment to use
  torch in C++ in conjunction with Neuron XLA |LazyTensor| execution.

Tensor Materialization During Tracing
---------------------------------------

While tensor materialization is normal for JIT workflow, it is not expected during traced inference.
When working with traced inference, developers may encounter tensor materialization, which leads to graphs being compiled based on example input tensor value and unexpected program behavior.
Therefore we need to take advantage of PyTorch/XLA's debugging flags to identify when unexpected tensor materialization happens and make appropriate code changes to avoid tensor materialization.


A common issue occurs when tensor values are evaluated during model compilation (traced inference). Consider this example:

.. code-block:: python

   def forward(self, tensor):
       if tensor[0] == 1:
           return tensor
       else:
           return tensor * 2

While this code can compile and run, it may lead to unexpected behavior because:

* The tensor value is being accessed during tracing (``tensor[0]``)
* The resulting graph becomes fixed based on the tensor value available during tracing
* Developers might incorrectly assume the condition will be evaluated dynamically during inference
* The solution for the code above is to utilize the debugging flags below to catch the issue and modify the code

See the updated code without tensor materialization:

.. code-block:: python

  class TestModel(torch.nn.Module):
      def __init__(self, flag=1):
          super().__init__()
          # the flag should be pre-determined based on the model configuration
          # it should not be an input of the model during runtime
          self.flag = flag

      def forward(self, tensor):
          if self.flag:
              return tensor
          else:
              return tensor * 2

Debugging Flags
~~~~~~~~~~~~~~~~

To help catch tensor materialization issues, PyTorch/XLA provides two useful approaches:

1. Enable warning messages for tensor materialization:

.. code-block:: python

   import os
   os.environ['PT_XLA_DEBUG_LEVEL'] = '2'

2. Disable graph execution to catch issues during development:

.. code-block:: python

   import torch_xla
   torch_xla._XLAC._set_allow_execution(False)

Recommendations
~~~~~~~~~~~~~~~~

Using these flags during development can help identify potential issues early in the development cycle. The recommended approach is to:

* Use ``PT_XLA_DEBUG_LEVEL=2`` during initial development to identify potential materialization points
* Apply ``_set_allow_execution(False)`` when you want to ensure no tensor materialization occurs during tracing
* When you see warnings or errors related the tensor materialization, look into the code path and make appropriate changes. The example above moved the flag to the ``__init__`` function which does not depend on the model input during runtime.

For more detailed debugging information, refer to the `XLA PyTorch on XLA Devices <https://github.com/pytorch/xla/blob/master/docs/source/learn/pytorch-on-xla-devices.md>`__.


Summary
-------

+----------------+-----------------------+-------------------+
|                | XLA Device Inference  | Traced Inference  |
+================+=======================+===================+
| Compilation    | JIT                   | AOT               |
+----------------+-----------------------+-------------------+
| Serialization  | N/A                   | `TorchScript`_    |
+----------------+-----------------------+-------------------+
| Performance    | Slower                | Faster            |
+----------------+-----------------------+-------------------+
| Dynamic        | Yes                   | No                |
+----------------+-----------------------+-------------------+
| C++ Usage      | No                    | Yes               |
+----------------+-----------------------+-------------------+


.. |LazyTensor| replace:: :ref:`Lazy Tensor <xla_lazytensor>`
.. |trace| replace:: :func:`~torch_neuronx.trace`
.. |device| replace:: :code:`xm.xla_device()`
.. |neuronx-cc| replace:: :ref:`neuronx-cc <neuron-compiler-cli-reference-guide>`
.. _TorchScript: https://pytorch.org/docs/stable/jit.html


================================================
FILE: frameworks/torch/torch-neuronx/programming-guide/torch-neuronx-profiling-dev-guide.rst
================================================
.. _torch-neuronx-dev-guide:


.. meta::
   :description: Developer Guide for Profiling with PyTorch NeuronX - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, profiling, torch-neuronx
   :date-modified: 2026-03-13


Developer Guide for Profiling with PyTorch NeuronX 
===================================================

.. contents:: Table of Contents
   :local:
   :depth: 2

Introduction
~~~~~~~~~~~~

The Neuron PyTorch profiler is a context manager wrapping around the entire model
and training loop. Specifically this is the context manager:
``torch_neuronx.experimental.profiler.profile``. This is a wrapper of
the XLA Debug Profiler which we imported earlier as
``import torch_xla.debug.profiler as xp``, and is backwards-compatible.
Here are the parameters of the profiler context manager:

1. ``port``: Port to run the profiling GRPC server on. Default is 9012.
2. ``profile_type``: There is “trace” and “operator”. “trace”
   is the Torch Runtime Trace Level, while “operator” is the Model
   Operator Trace Level.
3. ``ms_duration``: This defines how long the profiler will capture the
   HLO artifacts from the model to view in the profiler. The unit is in
   milliseconds.
4. ``neuron_tensorboard_plugin_dir``: The directory the neuron tensorboard plugin will file write to
   (NB: Assumes that the tensorboard logdir="log/")
5. ``delete_working``: If set to False turns off the deletion of temporary files (default True)

We move the model to the xla device *inside the context manager.* This is important,
as this allows the profiler to collect the operations and processes from the 
``neuronx-cc`` compiler artifacts. If the model is moved to the xla device outside of
the context manager, the profiling won't work.

.. note::

   The warnings about the ``XLA_IR_DEBUG`` and ``XLA_HLO_DEBUG``
   env vars not being set can be ignored for the most part. This warning
   only comes into play when compiling the model for Neuron outside of the
   profiler context manager.

After running this script, notice a ``./logs`` directory has been
created. It contains the TensorBoard logs including the
profiler views.


Example used in this guide
~~~~~~~~~~~~~~~~~~~~~~~~~~

We will use the following code sample to describe in detail how to use the Neuron PyTorch profiling API.

Prerequisites
^^^^^^^^^^^^^

1. Initial `Trn1 setup for PyTorch
   (torch-neuronx) <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install.html>`__
   has been done

Environment
^^^^^^^^^^^

::

   #activate python virtual environment and install tensorboard_plugin_neuron
   source ~/aws_neuron_venv_pytorch_p38/bin/activate
   pip install tensorboard_plugin_neuronx

   #create work directory for the Neuron Profiling tutorials
   mkdir -p ~/neuron_profiling_tensorboard_examples
   cd ~/neuron_profiling_tensorboard_examples

Setup
^^^^^

Create a new working directory:

::
   
   mkdir simple_demo
   cd simple_demo

Save the following code as ``demo.py``:

::

   import os

   import torch
   import torch.nn as nn
   import torch.nn.functional as F

   # XLA imports
   import torch_xla
   import torch_xla.core.xla_model as xm
   import torch_xla.debug.profiler as xp

   import torch_neuronx
   from torch_neuronx.experimental import profiler

   os.environ["NEURON_CC_FLAGS"] = "--cache_dir=./compiler_cache"

   # Global constants
   EPOCHS = 10

   # Declare 3-layer MLP Model
   class MLP(nn.Module):
     def __init__(self, input_size = 10, output_size = 2, layers = [5, 5]):
         super(MLP, self).__init__()
         self.fc1 = nn.Linear(input_size, layers[0])
         self.fc2 = nn.Linear(layers[0], layers[1])
         self.fc3 = nn.Linear(layers[1], output_size)

     def forward(self, x):
         x = F.relu(self.fc1(x))
         x = F.relu(self.fc2(x))
         x = self.fc3(x)
         return F.log_softmax(x, dim=1)


   def main():
       # Fix the random number generator seeds for reproducibility
       torch.manual_seed(0)

       # XLA: Specify XLA device (defaults to a NeuronCore on Trn1 instance)
       device = xm.xla_device()

       # Start the proflier context-manager
       with torch_neuronx.experimental.profiler.profile(
           port=9012,
           profile_type='trace',
           ms_duration=15000 ) as profiler:

           # IMPORTANT: the model has to be transferred to XLA within
           # the context manager, otherwise profiling won't work
           model = MLP().to(device)
           optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
           loss_fn = torch.nn.NLLLoss()

           # start training loop
           print('----------Training ---------------')
           model.train()
           for epoch in range(EPOCHS):
               optimizer.zero_grad()
               train_x = torch.randn(1,10).to(device)
               train_label = torch.tensor([1]).to(device)
               
               #forward
               loss = loss_fn(model(train_x), train_label)                
               
               #back
               loss.backward()    
               optimizer.step()
               
               # XLA: collect ops and run them in XLA runtime
               xm.mark_step() 

       print('----------End Training ---------------')

   if __name__ == '__main__':
       main()

Then run it!

::

    python demo.py

.. _Tensorboard Interface Overview:

Viewing the Trace on TensorBoard
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To view the TensorBoard logs, run ``tensorboard --logdir=./logs``

.. note:: 

   Depending on TensorBoard version ``--load_fast=false`` might be an additional
   parameter to add to view the trace.

Take note of the port (usually 6006) and enter ``localhost:<port>`` into
the local browser (assuming port forwarding is set up properly):

|tensorboard-url-image|

Once ``localhost:<port>`` is entered, verify that the
“NEURON” view is shown:

|tensorboard-NEURON-header|

If “NEURON” isn’t shown on the
top left hand side, select “NEURON” from the drop down on the top right
hand side

|tensorboard-NEURON-dropdown|

On the Left Hand Side, there are two dropdown menus: Run & Tool.

|tensorboard-run-tool-dropdowns|

The Run dropdown would contain the Torch Runtime
Trace and Operator Level Trace views; however since we only ran the
“trace” (i.e Torch Runtime Trace Level), we’ll only see that log.
The Torch Runtime Trace views are simply dates in
``year_month_day_hour_minute_second_millisecond`` format. The Tool
Dropdown only contains the “trace“ option.

The trace view should look like this:

|tensorboard-run-trace-original|

Let’s zoom into the following section of the trace:

|tensorboard-run-trace-selected-section|

After zooming in the trace should look like this:

|tensorboard-run-trace-selected-section-zoomed|

Notice on the top, there is a ``StepMarker`` process followed by ``NeuronDevice Execution``
process. This correlates to the ``xm.mark_step()`` call which executes
the collected graph of our model on Neuron. For the Operator Level Trace
(“operator”), we’ll be profiling the model operators that occur on
Neuron. In other words, the profiler will zoom into the
``NeuronDevice Execution`` process, if the user specifies
``profile_type='trace'``.

Using Named Blocks for the Trace
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

What we've produced so far is the default behavior of the profiler, however 
it would be more useful to profile specific blocks of our code to narrow down onto
performance bottlenecks. To do this, use ``xp.Trace`` context manager.
Replace the respective code in the training loop with the following:

::

   ...
   optimizer.zero_grad()
   train_x = torch.randn(1,10).to(device)
   train_label = torch.tensor([1]).to(device)

   with xp.Trace("model_build"):
       loss = loss_fn(model(train_x), train_label)                
   with xp.Trace("loss_backward"):
       loss.backward()    
   with xp.Trace("optimizer_step"):
       optimizer.step()

   # XLA: collect ops and run them in XLA runtime
   xm.mark_step()
   ...

Run the script, and follow the same TensorBoard steps. Afterwards, the
trace should look like this:

|tensorboard-run-trace-selected-section-zoomed-named-traces|

As seen, the ``model_build``, ``loss_backward`` and ``optimizer_step`` 
sections have been profiled.

.. note::
   If you are running your training script in a docker container, to
   view the tensorboard, you should launch the docker container using flag:
   ``—network host`` eg. ``docker run —network host my_image:my_tag``


.. |tensorboard-url-image| image:: /images/Neuron_Profiler_Tensorboard_Url.jpg

.. |tensorboard-NEURON-header| image:: /images/Neuron_Profiler_Tensorboard_Header.jpg

.. |tensorboard-NEURON-dropdown| image:: /images/Neuron_Profiler_Tensorboard_Dropdown.jpg

.. |tensorboard-run-tool-dropdowns| image:: /images/Neuron_Profiler_Tensorboard_Run_Tool_Dropdowns.jpg

.. |tensorboard-run-trace-original| image:: /images/Neuron_Profiler_Runtime_Trace_Original.jpg

.. |tensorboard-run-trace-selected-section| image:: /images/Neuron_Profiler_Runtime_Trace_Section_Selection.jpg

.. |tensorboard-run-trace-selected-section-zoomed| image:: /images/Neuron_Profiler_Runtime_Trace_Section_Selection_Zoomed.jpg

.. |tensorboard-run-trace-selected-section-zoomed-named-traces| image:: /images/Neuron_Profiler_Runtime_Trace_Section_Selection_Zoomed_Named_Traces.jpg

.. |tensorboard-operator-framework-view| image:: /images/Neuron_Profiler_T1_Op_Framework_View.png

.. |tensorboard-operator-hlo-view| image:: /images/Neuron_Profiler_T1_Op_HLO_View.png

.. |tensorboard-operator-trace-view| image:: /images/Neuron_Profiler_T1_Op_Trace_View.png

.. |tensorboard-operator-trace-fusion-simple| image:: /images/Neuron_Profiler_T1_Op_Trace_Fusion_Simple.png

.. |tensorboard-operator-trace-fusion-complex| image:: /images/Neuron_Profiler_T1_Op_Trace_Fusion_Complex.png


================================================
FILE: frameworks/torch/torch-neuronx/programming-guide/training/index.rst
================================================

.. meta::
   :description: Developer Guide  (``torch-neuronx``) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx, training
   :date-modified: 2026-03-13


Developer Guide  (``torch-neuronx``)
====================================

.. toctree::
    :maxdepth: 1
    :hidden:

    /frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide
    /frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-debug
    /frameworks/torch/torch-neuronx/programming-guide/torch-neuronx-profiling-dev-guide


.. dropdown::  Developer Guide
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in
    :open:

    * :ref:`pytorch-neuronx-programming-guide`
    * :ref:`pytorch-neuronx-debug`
    * :ref:`torch-neuronx-dev-guide`


================================================
FILE: frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-debug.rst
================================================
.. _pytorch-neuronx-debug:


.. meta::
   :description: How to debug models in PyTorch NeuronX - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, debugging, torch-neuronx, training
   :date-modified: 2026-03-13


How to debug models in PyTorch NeuronX
=======================================

.. contents:: Table of Contents
   :local:
   :depth: 2

Torch-XLA evaluates operations lazily, which means it builds a symbolic
graph in the background and the graph is executed in hardware only when
the users request (print) for the output or a mark_step is encountered.
To effectively debug training scripts with torch-xla, please use one of
the approaches mentioned below:

**Printing metrics**
~~~~~~~~~~~~~~~~~~~~

Torch-xla provides a utility that records metrics of different sections
of the code. These metrics can help figure out things like: How much
time is spent in compilation? How much time is spent in execution? To
check the metrics:

1. Import metrics: ``import torch_xla.debug.metrics as met``
2. Print metrics at the end of the step: ``print(met.metrics_report())``

Printing metrics should produce an output that looks like this:

.. code:: bash

   Metric: CompileTime
     TotalSamples: 1
     Accumulator: 09s969ms486.408us
     Percentiles: 1%=09s969ms486.408us; 5%=09s969ms486.408us; 10%=09s969ms486.408us; 20%=09s969ms486.408us; 50%=09s969ms486.408us; 80%=09s969ms486.408us; 90%=09s969ms486.408us; 95%=09s969ms486.408us; 99%=09s969ms486.408us
   .....
   Metric: ExecuteTime
     TotalSamples: 1
     Accumulator: 186ms062.970us
     Percentiles: 1%=186ms062.970us; 5%=186ms062.970us; 10%=186ms062.970us; 20%=186ms062.970us; 50%=186ms062.970us; 80%=186ms062.970us; 90%=186ms062.970us; 95%=186ms062.970us; 99%=186ms062.970us
   ....
   Metric: TensorsGraphSize
     TotalSamples: 1
     Accumulator: 9.00
     Percentiles: 1%=9.00; 5%=9.00; 10%=9.00; 20%=9.00; 50%=9.00; 80%=9.00; 90%=9.00; 95%=9.00; 99%=9.00
   Metric: TransferFromServerTime
     TotalSamples: 2
     Accumulator: 010ms130.597us
     ValueRate: 549ms937.108us / second
     Rate: 108.372 / second
     Percentiles: 1%=004ms948.602us; 5%=004ms948.602us; 10%=004ms948.602us; 20%=004ms948.602us; 50%=006ms181.995us; 80%=006ms181.995us; 90%=006ms181.995us; 95%=006ms181.995us; 99%=006ms181.995us
   Metric: TransferToServerTime
     TotalSamples: 6
     Accumulator: 061ms698.791us
     ValueRate: 007ms731.182us / second
     Rate: 0.665369 / second
     Percentiles: 1%=006ms848.579us; 5%=006ms848.579us; 10%=006ms848.579us; 20%=007ms129.666us; 50%=008ms940.718us; 80%=008ms496.166us; 90%=024ms636.413us; 95%=024ms636.413us; 99%=024ms636.413us
   Metric: TransferToServerTransformTime
     TotalSamples: 6
     Accumulator: 011ms835.717us
     ValueRate: 001ms200.844us / second
     Rate: 0.664936 / second
     Percentiles: 1%=108.403us; 5%=108.403us; 10%=108.403us; 20%=115.676us; 50%=167.399us; 80%=516.659us; 90%=010ms790.400us; 95%=010ms790.400us; 99%=010ms790.400us
   .....
   Counter: xla::_copy_from
     Value: 7
   Counter: xla::addmm
     Value: 2
   Counter: xla::empty
     Value: 5
   Counter: xla::t
     Value: 2
   ....
   Metric: XrtCompile
     TotalSamples: 1
     Accumulator: 09s946ms607.609us
     Mean: 09s946ms607.609us
     StdDev: 000.000us
     Percentiles: 25%=09s946ms607.609us; 50%=09s946ms607.609us; 80%=09s946ms607.609us; 90%=09s946ms607.609us; 95%=09s946ms607.609us; 99%=09s946ms607.609us
   Metric: XrtExecute
     TotalSamples: 1
     Accumulator: 176ms932.067us
     Mean: 176ms932.067us
     StdDev: 000.000us
     Percentiles: 25%=176ms932.067us; 50%=176ms932.067us; 80%=176ms932.067us; 90%=176ms932.067us; 95%=176ms932.067us; 99%=176ms932.067us
   Metric: XrtReadLiteral
     TotalSamples: 2
     Accumulator: 608.578us
     Mean: 304.289us
     StdDev: 067.464us
     Rate: 106.899 / second
     Percentiles: 25%=236.825us; 50%=371.753us; 80%=371.753us; 90%=371.753us; 95%=371.753us; 99%=371.753us

As seen, you can get useful information about graph compile
times/execution times. You can also know which operators are present in
the graph, which operators are run on the CPU and which operators are run on an XLA device.
For example, operators that have a prefix ``aten::`` would run on the CPU, since they do not have
xla lowering. All operators with prefix ``xla::`` would run on an XLA device. Note: aten operators
that do not have xla lowering would result in a graph fragmentation and might end up slowing down the
entire execution. If you encounter such operators, create a request for operator support.

**Printing tensors**
~~~~~~~~~~~~~~~~~~~~

Users can print tensors in their script as below:

.. code:: python

   import os
   import torch
   import torch_xla
   import torch_xla.core.xla_model as xm

   device = xm.xla_device()
   input1 = torch.randn(2,10).to(device)
   # Defining 2 linear layers
   linear1 = torch.nn.Linear(10,30).to(device)
   linear2 = torch.nn.Linear(30,20).to(device)

   # Running forward
   output1 = linear1(input1)
   output2 = linear2(output1)
   print(output2)

Since torch-xla evaluates operations lazily, when you try to print
``output2`` , the graph associated with the tensor would be evaluated.
When a graph is evaluated, it is first compiled for the device and executed on
the selected device. Note: Each tensor would have a graph associated
with it and can result in graph compilations and executions. For
example, in the above script, if you try to print ``output1`` , a new
graph is cut and you would see another evaluation. To avoid multiple evaluations, you can make use of ``mark_step`` (next section).

**Use mark_step**
~~~~~~~~~~~~~~~~~

Torch-XLA provides an api called ``mark_step`` which evaluates a graph
collected up to that point. While this is similar to printing of an output tensor
wherein a graph is also evaluated, there is a difference. When
an output tensor is printed, only the graph associated with that specific tensor is
evaluated, whereas mark_step enables all the output tensors up to ``mark_step`` call to be evaluated
in a single graph. Hence, any tensor print after ``mark_step`` would be
effectively free of cost as the tensor values are already evaluated.
Consider the example below:

.. code:: python

   import os
   import torch
   import torch_xla
   import torch_xla.core.xla_model as xm
   import torch_xla.debug.metrics as met

   device = xm.xla_device()
   input1 = torch.randn(2,10).to(device)
   # Defining 2 linear layers
   linear1 = torch.nn.Linear(10,30).to(device)
   linear2 = torch.nn.Linear(30,20).to(device)

   # Running forward
   output1 = linear1(input1)
   output2 = linear2(output1)
   xm.mark_step()
   print(output2)
   print(output1)
   # Printing the metrics to check if compilation and execution occurred
   print(met.metrics_report())

In the printed metrics, the number of compiles and
executions is only 1, even though 2 tensors are printed.
Hence, to avoid multiple graph evaluations, it is recommended that you
visualize tensors after a ``mark_step`` . You can also make use of the
`add_step_closure <https://pytorch.org/xla/release/1.9/index.html#torch_xla.core.xla_model.add_step_closure>`__
api for this purpose. With this api, you pass in the tensors that needs to
be visualized/printed. The added tensors would then be preserved in the
graph and can be printed as part of the callback function passed to the
api. Here is a sample usage:
`test_train_mp_mnist.py <https://github.com/pytorch/xla/blob/master/test/test_train_mp_mnist.py#L133>`__

**Note:** Graph compilations can take time as the compiler optimizes the graph to run on device.

**Using Eager Debug Mode**
~~~~~~~~~~~~~~~~~~~~~~~~~~

Eager debug mode provides a convenient utility to step through the code and evaluate operators one by one for correctness. Eager debug mode is useful to inspect your models the way you would do in eager-mode frameworks like PyTorch and Tensorflow. With Eager Debug Mode operations are executed eagerly. As soon as an operation is registered with torch-xla, it would be sent for compilation and
execution. Since compiling a single operation, the time spent
would be minimal. Moreover, the chances of hitting the framework compilation cache
increases as models would have repeated operations throughout.
Consider example 1 below:

.. code:: python

   # Example 1

   import os
   # You need to set this env variable before importing torch-xla
   # to run in eager debug mode.
   os.environ["NEURON_USE_EAGER_DEBUG_MODE"] = "1"

   import torch
   import torch_xla
   import torch_xla.core.xla_model as xm
   import torch_xla.debug.metrics as met

   device = xm.xla_device()
   input1 = torch.randn(2,10).to(device)
   # Defining 2 linear layers
   linear1 = torch.nn.Linear(10,30).to(device)
   linear2 = torch.nn.Linear(30,20).to(device)

   # Running forward
   output1 = linear1(input1)
   output2 = linear2(output1)

   # Printing the metrics to check if compilation and execution occurred
   # Here, in the metrics you should notice that the XRTCompile and XRTExecute
   # value is non-zero, even though no tensor is printed. This is because, each
   # operation is executed eagerly.
   print(met.metrics_report())

   print(output2)
   print(output1)
   # Printing the metrics to check if compilation and execution occurred.
   # Here the XRTCompile count should be same as the previous count.
   # In other words, printing tensors did not incur any extra compile
   # and execution of the graph
   print(met.metrics_report())

As seen from the above scripts, each operator is evaluated eagerly and
there is no extra compilation when output tensors are printed. Moreover, together with
the on-disk Neuron persistent cache, eager debug mode only incurs one time
compilation cost when the ops is first run. When the script is run again, the compiled ops will be
pulled from the persistent cache. Any changes you make to the
training script would result in the re-compilation of only the newly
inserted operations. This is because each operation is compiled
independently. Consider example 2 below:

.. code:: python

   # Example 2

   import os
   # You need to set this env variable before importing torch-xla
   # to run in eager debug mode.
   os.environ["NEURON_USE_EAGER_DEBUG_MODE"] = "1"

   import torch
   import torch_xla
   import torch_xla.core.xla_model as xm
   import torch_xla.debug.metrics as met

   os.environ['NEURON_CC_FLAGS'] = "--log_level=INFO"

   device = xm.xla_device()
   input1 = torch.randn(2,10).to(device)
   # Defining 2 linear layers
   linear1 = torch.nn.Linear(10,30).to(device)
   linear2 = torch.nn.Linear(30,20).to(device)
   linear3 = torch.nn.Linear(20,30).to(device)
   linear4 = torch.nn.Linear(30,20).to(device)

   # Running forward
   output1 = linear1(input1)
   output2 = linear2(output1)
   output3 = linear3(output2)

   # Note the number of compiles at this point and compare
   # with the compiles in the next metrics print
   print(met.metrics_report())

   output4 = linear4(output3)
   print(met.metrics_report())

Running the above example 2 script after running example 1 script, you may notice that from the start until the statement ``output2 = linear2(output1)`` ,
all the graphs would hit the persistent cache. Executing the line
``output3 = linear3(output2)`` would result in a new compilation for ``linear3`` layer only because the layer configuration is new.
Now, when we run
``output4 = linear4(output3)`` , you would observe no new compilation
happens. This is because the graph for ``linear4`` is same as the graph for
``linear2`` and hence the compiled graph for ``linear2`` is reused for ``linear4`` by the framework's internal cache.

Eager debug mode avoids the wait times involved with tensor printing because of larger graph compilation.
It is designed only for debugging purposes, so when the training script is ready, please remove the ``NEURON_USE_EAGER_DEBUG_MODE`` environment
variable from the script in order to obtain optimal performance.

By default, in eager debug mode the
logging level in the Neuron compiler is set to error mode. Hence, no
logs would be generated unless there is an error. Before your first
print, if there are many operations that needs to be compiled, there
might be a small delay. In case you want to check the logs, switch on
the ``INFO`` logs for compiler using:

.. code:: python

   os.environ["NEURON_CC_FLAGS"] = "--log_level=INFO"

**Profiling model run**
~~~~~~~~~~~~~~~~~~~~~~~

Profiling model run can help to identify different bottlenecks and
resolve issues faster. You can profile different sections of the code to
see which block is the slowest. To profile model run, you can follow the
steps below:

1. Add: ``import torch_xla.debug.profiler as xp``

2. Start server. This can be done by adding the following line after
   creating xla device: ``server = xp.start_server(9012)``

3. In a separate terminal, start tensorboard. The logdir should be in
   the same directory from which you run the script.

   .. image:: /images/tensorboard.png
      :alt: Image: tensorboard.png

   Open the tensorboard on a browser. Go to profile section in the top
   right. Note: you may have to install the profile plugin using:
   ``pip install tensorboard-plugin-profile``

4. When you click on the profile, it should give an option to capture
   profile. Clicking on capture profile produces the following pop-up.

   .. image:: /images/popup.png
      :alt: Image: popup.png

   In the URL enter: ``localhost:9012`` . Port in this URL should
   be same as the one you gave when starting the server in the script.

5. Once done, click capture and it should automatically load the
   following page:

   .. image:: /images/./tb_1.png
      :alt: Image: tb_1.png

6. To check the profile for different blocks of code, head to
   ``trace_viewer`` under ``Tools`` (on the left column).

   .. image:: /images/./options.png
      :alt: Image: options.png

7. It should show a profile that looks like this:

   .. image:: /images/./profile_large.png
      :alt: Image: profile_large.png

Note: By default, torch-xla would time different blocks of code inside
the library. However, you can also profile block of code in your
scripts. This can be done by adding the code within a ``xp.Trace``
context as follows:

.. code:: python

   ....
   for epoch in range(total_epochs):
       inputs = torch.randn(1,10).to(device)
       labels = torch.tensor([1]).to(device)
       with xp.Trace("model_build"):
           loss = model(inputs, labels)
       with xp.Trace("loss_backward"):
           loss.backward()
   ....

It should produce a profile that has the ``model_build`` and
``loss_backward`` section timed. This way you can time any block of
script for debugging.

.. image:: /images/./profile_zoom.png
   :alt: Image: Screen profile_zoom.png

Note: If you are running your training script in a docker container, to view the
tensorboard, you should launch the docker container using flag: ``--network host``
eg. ``docker run --network host my_image:my_tag``


.. _torch-neuronx-snapshotting:

**Snapshotting With Torch-Neuronx 2.1**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Snapshotting models can be used to dump debug information that can then be sent
to the Neuron team. Neuron execution relies on a series of compiled graphs. Internally,
graph HLOs are used as an intermediate representation which is then compiled. Then, during
execution, the graph inputs are passed to the Neuron runtime, which produces
outputs using the compiled graph. Snapshotting saves the inputs to a graph
execution, executes the graphs, saves the outputs of the execution, and then
bundles and dumps the inputs, outputs and graph HLO in one file. This is
illustrated here:

.. image :: /images/./snapshot-diagram.png
   :alt: Image: snapshot-diagram.png

This feature can be enabled using the following environment variables,
which can be set at the beginning of your script as follows (``./dump`` is the snapshot
dump directory that will be created):

.. code:: python

   ....
   os.environ["XLA_FLAGS"] = "--xla_dump_hlo_snapshots --xla_dump_to=./dump"
   ....

This environment variable will produce snapshots in the ``./dump``
folder with the extension ``.decomposed_hlo_snapshot``
at every iteration for every process. For example, files that look like the following would
be produced.

.. code:: bash

   SyncTensorsGraph.27737-process000000-executable000003-device000000-execution000496.inputs.decomposed_hlo_snapshot

Note that ``NEURON_FRAMEWORK_DEBUG`` does not need to be set, as in torch-neuronx 1.13. Also note that ``NEURON_DUMP_HLO_SNAPSHOT`` and ``NEURON_NC0_ONLY_SNAPSHOT`` environment variables used in torch-neuronx 1.13 are now no longer used to control snapshot dumping.

Snapshots can take up a large amount of disk space. To avoid running out of disk space, you can limit the snapshoting for a certain rank, such as rank 0. The following example code would work with ``torchrun`` utility which sets the ``RANK`` environment variable for each process:

.. code:: python

    if os.environ.get("RANK", "0") == "0":
        os.environ["XLA_FLAGS"]="--xla_dump_hlo_snapshots --xla_dump_to=./dump"

or if not using torchrun:

.. code:: python

    import torch_xla.core.xla_model as xm

    ....
    if xm.is_master_ordinal():
        os.environ["XLA_FLAGS"]="--xla_dump_hlo_snapshots --xla_dump_to=./dump"
    ....

Torch-NeuronX 2.1+ provides a ``register_hlo_snapshot_callback`` API to allow more control over when to dump the snapshot.
By default, Torch-NeuronX 2.1+ includes the following callback function:

.. code:: python

    def _dump_hlo_snapshot_callback(name: str, addressable_device_index: int, execution_count: int) -> str:
        return 'inputs'

As the return value is always 'inputs', the backend will always dump snapshot files containing HLO and input data only. Recognized return value keywords are 'inputs' and 'outputs'.  If the return value is an empty string '', then the backend will skip this dump. If the return value is 'inputs outputs', then the backend will dump two snapshot files for each execution, one holding inputs, and another one holding outputs.

To implement selective dumping, we can make use of the callback function's parameters name, addressable_device_index, execution_count , where:

* ``name`` is a string that stands for the HLO graph's name.
* ``addressable_device_index`` is an integer that refers to the index of the addressable Neuron device as one NEFF can load onto multiple addressable Neuron devices (NeuronCores) for SPMD executions. Note that this is not the same as the worker process rank in multi-process execution, in which ``RANK``/``xm.get_ordinal()`` or ``LOCAL_RANK``/``xm.get_local_ordinal()`` should be used. See examples above.
* ``execution_count`` is an integer that indicates the value of an internal execution counter that increments by one for each execution of a compiled graph when HloSnapshot dumping is requested. Note that each compiled graph maintains multiple execution counters, one for each addressable device that it loads onto.

For example, the following will dump snapshot files containing outputs at execution #2 (Note that this is graph execution number, not the iteration or step; for iteration or step, see the next example):

.. code:: python

    def callback(name, addressable_device_index, execution_count):
        if execution_count == 2:
            return 'outputs'
        else:
            return ''

    import libneuronxla
    old_callback = libneuronxla.register_hlo_snapshot_callback(callback)

Callback functions can be use to dump at a certain condition, such as when the global step count equal a value:

.. code:: python

    step = 0
    def callback(name, addressable_device_index, execution_count):
        if step == 5:
            return 'inputs'
        else:
            return ''

    import libneuronxla
    old_callback = libneuronxla.register_hlo_snapshot_callback(callback)

    ...
    for epoch in range(EPOCHS):
        for idx, (train_x, train_label) in enumerate(train_loader):
            step += 1
    ...

.. note::

   Snapshot dumping triggered by a runtime error such as NaN is not yet available. It will be available in a feature release.


.. _torch-neuronx-snapshotting_1.13:

**Snapshotting with Torch-Neuronx 1.13**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. note::

   If you are using Torch-NeuronX 2.1, please see :ref:`torch-neuronx-snapshotting`

With Torch-Neuronx 1.13, the snapshotting feature can be enabled using the following environment variables,
which can be set at the beginning of your script as follows.

.. code:: python

   ....
   os.environ["XLA_FLAGS"] = " --xla_dump_to=dump"
   os.environ["NEURON_FRAMEWORK_DEBUG"] = "1"
   os.environ["NEURON_DUMP_HLO_SNAPSHOT"] = "1"
   ....


This set of environment variables will produce snapshots under the dump
folder with the extensions ``.hlo.snapshot.pb`` or ``.decomposed_hlo_snapshot``
at every iteration. For example a file that looks like the following would
be produced.

.. code:: bash

   dump/module_SyncTensorsGraph.387.pid_12643.execution_7496.hlo_snapshot.pb

The dumping environment variable can be set and unset at specific
iterations as shown in the following example.

.. code:: python

    ....
    for step in range(STEPS):
        if step == 20:
            os.environ["NEURON_DUMP_HLO_SNAPSHOT"] = "1"
        else:
            os.environ.pop('NEURON_DUMP_HLO_SNAPSHOT', None)
        train_x = torch.randn(BATCH_SIZE, 28, 28)
        train_x = train_x.to(device)
        loss = model(train_x)
        loss.backward()
        optimizer.step()
        xm.mark_step()
    ....


Additionally, we provide capabilities to snapshot graphs automatically.
The environment variables above can be set as follows:

.. code:: python

    ....
    os.environ["XLA_FLAGS"] = " --xla_dump_to=dump"
    os.environ["NEURON_FRAMEWORK_DEBUG"] = "1"
    os.environ["NEURON_DUMP_HLO_SNAPSHOT"] = "ON_NRT_ERROR"
    ....

When unexpected errors such as a graph execution producing NaNs occurs,
snapshots will be automatically produced and execution will be terminated.
Occasionally, for larger models, automatic snapshotting may not capture
snapshots due to the device memory being exhausted. In this case, the above
flag can be set to
``os.environ["NEURON_DUMP_HLO_SNAPSHOT"] = "ON_NRT_ERROR_HYBRID"``, this
will allocate memory for inputs on both the device and host memory.
In some additional cases, this may still go out of memory and may need to be
set to ``os.environ["NEURON_DUMP_HLO_SNAPSHOT"] = "ON_NRT_ERROR_CPU"`` to
avoid allocating any memory on the device at all for automatic snapshotting.

**Snapshot FAQs:**
--------------------

**When should I use this features?**

This feature should be used when debugging errors that requires interfacing
with and providing debug data to the Neuron team. Snapshotting may be redundant
and unnecessary in some situations. For example, when only the model weights are
necessary for debugging, methods such as checkpointing may be more convenient to use.

**What sort of data is captured with these snapshots?**

The type of data captured by these snapshots may include model graphs in HLO form,
weights/parameters, optimizer states, intermediate tensors and gradients.
This data may be considered sensitive and this should be taken into account before
sending the data to the Neuron team.

**What is the size of these snapshots?**

The size of snapshots can be significant for larger models such as GPT or BERT
with several GBs worth of data for larger graphs, so it is recommended to check
that sufficient disk space exists before using snapshotting. In addition, limiting
the amount of snapshots taken in a run will help to preserve disk space.

**Will snapshotting add overhead to my execution?**

Snapshotting does add a small overhead to the execution in most cases. This
overhead can be significant if snapshots are dumped at every iteration. In
order to alleviate some of this overhead, in the case that snapshotting is
not necessary on all cores the following environment variable can be set to
collect snapshots only on the first core in torch-neuronx 1.13:

.. code:: python

    ....
    os.environ["NEURON_NC0_ONLY_SNAPSHOT"] = "1"
    ....

In torch-neuronx 2.1, use ``RANK`` environmental variable when using torchrun or ``xm.is_master_ordinal()`` to limit dumping to the first process (see above):

.. code:: python

    ....
    if os.environ.get("RANK", "0") == "0":
        os.environ["XLA_FLAGS"]="--xla_dump_hlo_snapshots --xla_dump_to=./dump"
    ....

or (not using torchrun):

.. code:: python

    import torch_xla.core.xla_model as xm

    ....
    if xm.is_master_ordinal():
        os.environ["XLA_FLAGS"]="--xla_dump_hlo_snapshots --xla_dump_to=./dump"
    ....

In addition, checkpointing in tandem
with snapshotting can be useful to reduce overhead. A checkpoint close to
the problem iteration can be captured, then execution resumed with
snapshots enabled.

**How can I share snapshots with the Neuron team?**

These snapshots can be shared with the Neuron team via S3 bucket.


================================================
FILE: frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide.rst
================================================
.. _pytorch-neuronx-programming-guide:


.. meta::
   :description: Developer Guide for Training with PyTorch NeuronX - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx, training
   :date-modified: 2026-03-13


Developer Guide for Training with PyTorch NeuronX 
===================================================


.. contents:: Table of Contents
   :local:
   :depth: 2


Trainium is designed to speed up model training and reduce training cost. It is available on the Trn1 and Trn2 instances. On Trn1, each Trainium accelerator has two NeuronCores (default two Logical NeuronCores), which are the main neural network compute units. On Trn2, each Trainium accelerator has 8 physical NeuronCores, configured as 4 Logical NeuronCores by default (LNC=2). The only supported LNC values are 1 and 2. The examples in this guide apply to Trn1. They can be extended to run on Trn2.

.. important::
   Currently, Neuron does not support setting the number of Logical NeuronCores (LNC) to a value of 8.

PyTorch NeuronX enables PyTorch users to train their models on Trainium's
NeuronCores with little code change to their training code. It is based
on the `PyTorch/XLA software package <https://pytorch.org/xla>`__.

This guide helps you get started with single-worker training and
distributed training using PyTorch Neuron.

PyTorch NeuronX
----------------

Neuron XLA device
~~~~~~~~~~~~~~~~~

With PyTorch NeuronX the default XLA device is mapped to a :ref:`Logical NeuronCore<logical-neuroncore-config>`. By default, one Logical NeuronCore is configured by a process. To use the Neuron XLA device, specify
the device as ``xm.xla_device()`` or ``'xla'``:

.. code:: python

   import torch_xla.core.xla_model as xm
   device = xm.xla_device()

or

.. code:: python

   device = 'xla'

PyTorch models and tensors can be mapped to the device as usual:

.. code:: python

   model.to(device)
   tensor.to(device)

To move tensor back to CPU, do :

.. code:: python

   tensor.cpu()

or

.. code:: python

   tensor.to('cpu')

PyTorch NeuronX single-worker training/evaluation quick-start
--------------------------------------------------------------

PyTorch NeuronX uses XLA to enable conversion of
PyTorch operations to Trainium instructions. To get started on PyTorch
NeuronX, first modify your :ref:`training script <neuronx-mlp-training-tutorial>` to
use XLA in the same manner as described in `PyTorch/XLA
documentation <https://pytorch.org/xla>`__ and
use XLA device:

.. code:: python

   import torch_xla.core.xla_model as xm

   device = xm.xla_device()
   # or
   device = 'xla'

The Logical NeuronCore is mapped to an XLA device. On Trainium instance, the XLA device is automatically mapped to the first available Logical NeuronCore. You can use :ref:`NEURON_RT_VISIBLE_CORES<nrt-configuration>` to select specific Logical NeuronCore to use.

By default the above steps will enable the training or evaluation script to run on one Logical
NeuronCore. NOTE: Each process is mapped to one NeuronCore.

Finally, add ``mark_step`` at the end of the training or evaluation step to compile
and execute the training or evaluation step:

.. code:: python

   xm.mark_step()

These changes can be placed in control-flows in order to keep the script
the same between PyTorch Neuron and CPU/GPU. For example, you can use an
environment variable to disable XLA which would cause the script to run
in PyTorch native mode (using CPU on Trainium instances and GPU on GPU
instances):

.. code:: python

   device = 'cpu'
   if not os.environ.get("DISABLE_XLA", None):
       device = 'xla'

   ...

       # end of training step 
       if not os.environ.get("DISABLE_XLA", None):
           xm.mark_step()

More on the need for mark_step is at `Understand the lazy mode in
PyTorch Neuron <#understand-the-lazy-mode-in-pytorch-neuronx>`__.

For a full runnable example, please see the :ref:`Single-worker MLP training
on Trainium tutorial
<neuronx-mlp-training-tutorial>`.

PyTorch NeuronX multi-worker data parallel training using torchrun
--------------------------------------------------------------------

Data parallel training allows you to replicate your script across
multiple workers, each worker processing a proportional portion of the
dataset, in order to train faster.

To run multiple workers in data parallel configuration, with each worker
using one NeuronCore, first add additional imports for parallel
dataloader and multi-processing utilities:

::

   import torch_xla.distributed.parallel_loader as pl

Next we initialize the Neuron distributed context using the XLA backend for torch.distributed:

::

    import torch_xla.distributed.xla_backend
    torch.distributed.init_process_group('xla')

Next, replace ``optimizer.step()`` function call with
``xm.optimizer_step(optimizer)`` which adds gradient synchronization
across workers before taking the optimizer step:

::

   xm.optimizer_step(optimizer)

If you're using a distributed dataloader, wrap your dataloader in the
PyTorch/XLA's ``MpDeviceLoader`` class which provides buffering
to hide CPU to device data load latency:

::

   parallel_loader = pl.MpDeviceLoader(dataloader, device)

Within the training code, use xm.xrt_world_size() to get the world size,
and xm.get_ordinal to get the global rank of the current process.

Then run use `PyTorch
torchrun <https://pytorch.org/docs/stable/elastic/run.html#launcher-api>`__
utility to run the script. For example, to run 32 worker data parallel
training on trn1.32xlarge:

``torchrun --nproc_per_node=32 <script and options>``

To run on multiple instances, make sure to use trn1.32xlarge instances
and use all 32 NeuronCores on each instance. For example, with two instances, 
on the rank-0 Trn1 host, run with --node_rank=0  using torchrun utility:

.. code:: shell

    torchrun --nproc_per_node=32 --nnodes=2 --node_rank=0 --master_addr=<root IP> --master_port=<root port> <script and options>

On another Trn1 host, run with --node_rank=1 :

.. code:: shell

    torchrun --nproc_per_node=32 --nnodes=2 --node_rank=1 --master_addr=<root IP> --master_port=<root port> <script and options>

It is important to launch rank-0 worker with --node_rank=0  to avoid hang.

For trn2.48xlarge, use ``--nproc_per_node=64`` for 64 Logical NeuronCores default (each Logical NeuronCores using two physical NeuronCores).

To train on multiple instances, it is recommended to use a ParallelCluster. For a ParallelCluster example, please see `Train a model on AWS Trn1 ParallelCluster <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples>`__.

More information about torchrun can be found PyTorch documentation at
https://pytorch.org/docs/stable/elastic/run.html#launcher-api .

See the :ref:`Multi-worker data-parallel MLP training using torchrun
tutorial <neuronx-mlp-training-tutorial>`
for a full example.


Checking the device kind
------------------------

To find out the device kind that the application is running on, use ``torch_xla.core.xla_model.xla_device_kind()``. The returned string can be ``NC_v2`` for Trainium1, ``NC_v3`` for Trainium2 LNC=1 configuration, or ``NC_v3d`` for Trainium2 LNC=2 configuration. See :ref:`Logical NeuronCore<logical-neuroncore-config>` for more information about LNC configuration.

Example:

.. code:: python

   import torch_xla.core.xla_model as xm

   devkind = xm.xla_device_kind()
   print(devkind)

Output on trn1.32xlarge:

.. code:: bash

   NC_v2

Checking the number of devices
------------------------------

To find out the number of devices are available on the EC2 instance, use ``torch_xla.core.xla_model.get_xla_supported_devices()``, which returns a list of devices:

.. code:: python

   import torch_xla.core.xla_model as xm

   devices = xm.get_xla_supported_devices()
   print(len(devices))
   print(devices)

Output on trn1.32xlarge:
 
.. code:: bash

    32
    ['xla:0', 'xla:1', 'xla:2', 'xla:3', 'xla:4', 'xla:5', 'xla:6', 'xla:7', 'xla:8', 'xla:9', 'xla:10', 'xla:11', 'xla:12', 'xla:13', 'xla:14', 'xla:15', 'xla:16', 'xla:17', 'xla:18', 'xla:19', 'xla:20', 'xla:21', 'xla:22', 'xla:23', 'xla:24', 'xla:25', 'xla:26', 'xla:27', 'xla:28', 'xla:29', 'xla:30', 'xla:31']


Checking the platform type
--------------------------

To get the EC2 instance's platform type string, i.e. ``trn1``, ``inf2``, ``trn2``, use ``torch_neuronx.utils.get_platform_target()``:

.. code:: python

   from torch_neuronx.utils import get_platform_target

   platform = get_platform_target()
   print(platform)

Output on trn1.32xlarge:
 
.. code:: bash

   trn1


Conversion from Distributed Data Parallel (DDP) application
-----------------------------------------------------------

Distributed Data Parallel (DDP) in torch.distributed module is a wrapper
to help convert a single-worker training to distributed training. To
convert from torch.distributed Distributed Data Parallel (DDP)
application to PyTorch Neuron, first convert the application back to
single-worker training, which simply involves removing the DDP wrapper,
for example ``model = DDP(model, device_ids=[rank])``. After this,
follow the previous section to change to multi-worker training.

PyTorch NeuronX environment variables
--------------------------------------

Environment variables allow modifications to PyTorch Neuron behavior
without requiring code change to user script. See :ref:`PyTorch Neuron environment variables <pytorch-neuronx-envvars>` for more details.

Neuron Persistent Cache for compiled graphs
-------------------------------------------

See :ref:`Neuron Persistent Cache for compiled graphs <neuron-caching>`

Number of graphs
-----------------

PyTorch/XLA converts PyTorch's eager mode execution to lazy-mode
graph-based execution. During this process, there can be multiple graphs
compiled and executed if there are extra mark-steps or functions with
implicit mark-steps. Additionally, more graphs can be generated if there
are different execution paths taken due to control-flows.

Full BF16 with stochastic rounding enabled
------------------------------------------

Previously, on torch-neuronx 2.1 and earlier, the environmental variables ``XLA_USE_BF16`` or ``XLA_DOWNCAST_BF16`` provided full casting to BF16 with stochastic rounding enabled by default. These environmental variables are deprecated in torch-neuronx 2.5, although still functional with warnings. To replace ``XLA_USE_BF16`` or ``XLA_DOWNCAST_BF16`` with stochastic rounding on Neuron, set ``NEURON_RT_STOCHASTIC_ROUNDING_EN=1`` and use the ``torch.nn.Module.to`` method to cast model floating-point parameters and buffers to data-type BF16 as follows:

.. code:: python

    os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "1"

    # model is created
    model.to(torch.bfloat16)

Stochastic rounding is needed to enable faster convergence for full BF16 model.

If the loss is to be kept in FP32, initialize it with ``dtype=torch.float`` as follows:

.. code:: python

    running_loss = torch.zeros(1, dtype=torch.float).to(device)

Similarly, if the optimizer states are to be kept in FP32, convert the gradients to FP32 before optimizer computations:

.. code:: python

    grad = p.grad.data.float()

For a full example, please see the :ref:`PyTorch Neuron BERT Pretraining Tutorial (Data-Parallel) <hf-bert-pretraining-tutorial>`, which has been updated to use ``torch.nn.Module.to`` instead of ``XLA_DOWNCAST_BF16``.

BF16 in GPU-compatible mode without stochastic rounding enabled
---------------------------------------------------------------

Full BF16 training in GPU-compatible mode would enable faster convergence without the need for stochastic rounding, but would require a FP32 copy of weights/parameters to be saved and used in the optimizer. To enable BF16 in GPU-compatible mode without stochastic rounding enabled, use the ``torch.nn.Module.to`` method to cast model floating-point parameters and buffers to data-type bfloat16 as follows without setting ``NEURON_RT_STOCHASTIC_ROUNDING_EN=1``:

.. code:: python

    # model is created
    model.to(torch.bfloat16)

In the initializer of the optimizer, for example AdamW, you can add code like the following code snippet to make a FP32 copy of weights:

.. code:: python

        # keep a copy of weights in highprec
        self.param_groups_highprec = []
        for group in self.param_groups:
            params = group['params']
            param_groups_highprec = [p.data.float() for p in params]
            self.param_groups_highprec.append({'params': param_groups_highprec})

In the :ref:`PyTorch Neuron BERT Pretraining Tutorial (Data-Parallel) <hf-bert-pretraining-tutorial>`, this mode can be enabled by pasing ``--optimizer=AdamW_FP32ParamsCopy`` option to ``dp_bert_large_hf_pretrain_hdf5.py`` and setting ``NEURON_RT_STOCHASTIC_ROUNDING_EN=0`` (or leave it unset).

.. _automatic_mixed_precision_autocast:

BF16 automatic mixed precision using PyTorch Autocast
-----------------------------------------------------

By default, the compiler automatically casts internal FP32 operations to
BF16. You can disable this and allow PyTorch's BF16 automatic mixed precision function (``torch.autocast``) to
do the casting of certain operations to operate in BF16.

To enable PyTorch's BF16 mixed-precision, first turn off the Neuron
compiler auto-cast:

.. code:: python

   os.environ["NEURON_CC_FLAGS"] = "--auto-cast=none"

Next, per recommendation from official PyTorch `torch.autocast documentation <https://pytorch.org/docs/stable/amp.html#autocasting>`__, place only
the forward-pass of the training step in the ``torch.autocast`` scope with ``xla`` device type:

.. code:: python

   with torch.autocast(dtype=torch.bfloat16, device_type='xla'):
       # forward pass

The device type is XLA because we are using PyTorch-XLA's autocast backend. The PyTorch-XLA `autocast mode source code <https://github.com/pytorch/xla/blob/master/torch_xla/csrc/autocast_mode.cpp>`_ lists which operations are casted to lower precision BF16 ("lower precision fp cast policy" section), which are maintained in FP32 ("fp32 cast policy"), and which are promoted to the widest input types ("promote" section).

Example showing the original training code snippet:

.. code:: python

   def train_loop_fn(train_loader):
       for i, data in enumerate(train_loader):
           inputs = data[0]
           labels = data[3]
           outputs = model(inputs, labels=labels)
           loss = outputs.loss/ flags.grad_acc_steps
           loss.backward()
           optimizer.step()
           xm.mark_step()               

The following shows the training loop modified to use BF16 autocast:

.. code:: python

   os.environ["NEURON_CC_FLAGS"] = "--auto-cast=none"

   def train_loop_fn(train_loader):
       for i, data in enumerate(train_loader):
           torch.cuda.is_bf16_supported = lambda: True
           with torch.autocast(dtype=torch.bfloat16, device_type='xla'):
               inputs = data[0]
               labels = data[3]
               outputs = model(inputs, labels=labels)
           loss = outputs.loss/ flags.grad_acc_steps
           loss.backward()
           optimizer.step()
           xm.mark_step()        

For a full example of BF16 mixed-precision, see :ref:`PyTorch Neuron BERT Pretraining Tutorial (Data-Parallel) <hf-bert-pretraining-tutorial>`.

See official PyTorch documentation for more details about
`torch.autocast <https://pytorch.org/docs/stable/amp.html#autocasting>`__
.

Tips and Best Practices
-----------------------

Understand the lazy mode in PyTorch NeuronX
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One significant difference between PyTorch NeuronX and native PyTorch is
that the PyTorch NeuronX system runs in lazy mode while the native
PyTorch runs in eager mode. Tensors in lazy mode are placeholders for
building the computational graph until they are materialized after the
compilation and evaluation are complete. The PyTorch NeuronX system
builds the computational graph on the fly when you call PyTorch APIs to
build the computation using tensors and operators. The computational
graph gets compiled and executed when ``xm.mark_step()`` is called
explicitly or implicitly by ``pl.MpDeviceLoader/pl.ParallelLoader``, or
when you explicitly request the value of a tensor such as by calling
``loss.item()`` or ``print(loss)``.

.. _minimize-the-number-of-compilation-and-executions:

Minimize the number of compilation-and-executions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For best performance, you should keep in mind the possible ways to
initiate compilation-and-executions as described in `Understand the lazy
mode in PyTorch/XLA <#understand-the-lazy-mode-in-pytorch-neuronx>`__ and
should try to minimize the number of compilation-and-executions.
Ideally, only one compilation-and-execution is necessary per training
iteration and is initiated automatically by
``pl.MpDeviceLoader/pl.ParallelLoader``. The ``MpDeviceLoader`` is
optimized for XLA and should always be used if possible for best
performance. During training, you might want to examine some
intermediate results such as loss values. In such case, the printing of
lazy tensors should be wrapped using ``xm.add_step_closure()`` to avoid
unnecessary compilation-and-executions.

Aggregate the data transfers between host CPUs and devices
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For best performance, you may try to aggregate the data transfers between host CPUs and devices.
For example, increasing the value for `batches_per_execution` argument when instantiating ``MpDeviceLoader`` can help increase performance for certain where there's frequent host-device traffic like ViT as described in `a blog <https://towardsdatascience.com/ai-model-optimization-on-aws-inferentia-and-trainium-cfd48e85d5ac>`_. NOTE: Increasing `batches_per_execution` value would delay the mark-step for multiple batches specified by this value, increasing graph size and could lead to out-of-memory (device OOM) error.

Ensure common initial weights across workers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To achieve best accuracy during data parallel training, all workers need
to have the same initial parameter states. This can be achieved by using
the same seed across the workers. In the case of HuggingFace library,
the set_seed function can be used.
(`pytorch/xla#3216 <https://github.com/pytorch/xla/issues/3216>`__).

Use PyTorch/XLA's model save function
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To avoid problems with saving and loading checkpoints, make sure you use
PyTorch/XLA's model save function to properly checkpoint your model. For
more information about the function, see
`torch_xla.core.xla_model.save <https://pytorch.org/xla/release/1.9/index.html#torch_xla.core.xla_model.save>`__
in the *PyTorch on XLA Devices* documentation.

When training using multiple devices, ``xla_model.save`` can result in high host memory usage. If you see such high usage 
causing the host to run out of memory, please use `torch_xla.utils.serialization.save <https://pytorch.org/xla/release/1.9/index.html#torch_xla.utils.serialization.save>`__ .
This would save the model in a serialized manner. When saved using the ``serialization.save`` api, the model should 
be loaded using ``serialization.load`` api. More information on this here: `Saving and Loading Tensors <https://pytorch.org/xla/release/1.9/index.html#saving-and-loading-xla-tensors>`__


Debugging and troubleshooting
-----------------------------

To debug on PyTorch Neuron, follow the :doc:`debug guide <pytorch-neuron-debug>`.


================================================
FILE: frameworks/torch/torch-neuronx/pytorch-neuron-supported-operators.rst
================================================
.. _pytorch-neuron-supported-operators:


.. meta::
   :description: PyTorch Neuron (``torch-neuronx``) - Supported Operators - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx
   :date-modified: 2026-03-13


PyTorch Neuron (``torch-neuronx``) - Supported Operators
========================================================


.. contents:: Table of Contents
   :local:
   :depth: 2

Operator support
~~~~~~~~~~~~~~~~

The following list the aten operators supported by torch-neuronx.

+----------------------------------+
| aten::_s_where                   |
+----------------------------------+
| aten::_softmax                   |
+----------------------------------+
| aten::_softmax_backward_data     |
+----------------------------------+
| aten::_unsafe_view               |
+----------------------------------+
| aten::add                        |
+----------------------------------+
| aten::addcdiv\_                  |
+----------------------------------+
| aten::addcmul                    |
+----------------------------------+
| aten::addmm                      |
+----------------------------------+
| aten::bernoulli\_                |
+----------------------------------+
| aten::bmm                        |
+----------------------------------+
| aten::constant_pad_nd            |
+----------------------------------+
| aten::div                        |
+----------------------------------+
| aten::embedding                  |
+----------------------------------+
| aten::embedding_dense_backward   |
+----------------------------------+
| aten::empty                      |
+----------------------------------+
| aten::expand                     |
+----------------------------------+
| aten::fill\_                     |
+----------------------------------+
| aten::index_select               |
+----------------------------------+
| aten::_log_softmax               |
+----------------------------------+
| aten::_log_softmax_backward_data |
+----------------------------------+
| aten::lt                         |
+----------------------------------+
| aten::mm                         |
+----------------------------------+
| aten::mul                        |
+----------------------------------+
| aten::native_batch_norm          |
+----------------------------------+
| aten::native_batch_norm_backward |
+----------------------------------+
| aten::neg                        |
+----------------------------------+
| aten::permute                    |
+----------------------------------+
| aten::relu                       |
+----------------------------------+
| aten::rsub                       |
+----------------------------------+
| aten::select                     |
+----------------------------------+
| aten::slice                      |
+----------------------------------+
| aten::sqrt                       |
+----------------------------------+
| aten::sum                        |
+----------------------------------+
| aten::t                          |
+----------------------------------+
| aten::tanh                       |
+----------------------------------+
| aten::tanh_backward              |
+----------------------------------+
| aten::threshold_backward         |
+----------------------------------+
| aten::transpose                  |
+----------------------------------+
| aten::unsqueeze                  |
+----------------------------------+
| aten::view                       |
+----------------------------------+
| aten::zero\_                     |
+----------------------------------+


================================================
FILE: frameworks/torch/torch-neuronx/setup/install-templates/pytorch-dev-install.txt
================================================


.. tab-set::

   .. tab-item:: PyTorch 1.11.0

      .. tab-set::

         .. tab-item:: Ubuntu 20 AMI

            .. include :: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst
            .. code:: bash

                # Configure Linux for Neuron repository updates
                . /etc/os-release

                sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
                deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
                EOF
                wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -

                # Update OS packages
                sudo apt-get update -y

                # Install git
                sudo apt-get install git -y

                # Install OS headers
                sudo apt-get install linux-headers-$(uname -r) -y

                # Remove preinstalled packages and Install Neuron Driver and Runtime
                sudo apt-get remove aws-neuron-dkms  -y
                sudo apt-get remove aws-neuronx-dkms  -y
                sudo apt-get remove aws-neuronx-oci-hook  -y
                sudo apt-get remove aws-neuronx-runtime-lib -y
                sudo apt-get remove aws-neuronx-collectives -y
                sudo apt-get install aws-neuronx-dkms=2.* -y
                sudo apt-get install aws-neuronx-oci-hook=2.* -y
                sudo apt-get install aws-neuronx-runtime-lib=2.* -y
                sudo apt-get install aws-neuronx-collectives=2.* -y

                # Install EFA Driver(only required for multi-instance training)

                curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
                wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.key
                cat aws-efa-installer.key | gpg --fingerprint
                wget https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz.sig && gpg --verify ./aws-efa-installer-latest.tar.gz.sig

                tar -xvf aws-efa-installer-latest.tar.gz
                cd aws-efa-installer && sudo bash efa_installer.sh --yes
                cd
                sudo rm -rf aws-efa-installer-latest.tar.gz aws-efa-installer

                # Remove pre-installed package and Install Neuron Tools
                sudo apt-get remove aws-neuron-tools  -y
                sudo apt-get remove aws-neuronx-tools  -y
                sudo apt-get install aws-neuronx-tools=2.* -y

                export PATH=/opt/aws/neuron/bin:$PATH

                # Install Python venv and activate Python virtual environment to install
                # Neuron pip packages.
                sudo apt install python3.8-venv
                python3.8 -m venv aws_neuron_venv_pytorch
                source aws_neuron_venv_pytorch/bin/activate
                pip install -U pip

                # Install wget, awscli
                pip install wget
                pip install awscli

                # Install Python packages - Transformers package is needed for BERT
                python -m pip install torch-neuronx=="1.11.0.1.*" "neuronx-cc==2.*" --extra-index-url "https://pip.repos.neuron.amazonaws.com"


         .. tab-item:: Amazon Linux 2023 AMI

            .. include :: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

            .. code:: bash


                # Configure Linux for Neuron repository updates

                sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
                [neuron]
                name=Neuron YUM Repository
                baseurl=https://yum.repos.neuron.amazonaws.com
                enabled=1
                metadata_expire=0
                EOF
                sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

                # Install OS headers
                sudo dnf install -y "kernel-devel-uname-r = $(uname -r)"

                # Update OS packages
                sudo dnf update -y

                # Install git
                sudo dnf install git -y

                # Remove preinstalled packages and Install Neuron Driver and Runtime
                sudo dnf remove aws-neuron-dkms -y
                sudo dnf remove aws-neuronx-dkms -y
                sudo dnf remove aws-neuronx-oci-hook -y
                sudo dnf remove aws-neuronx-runtime-lib -y
                sudo dnf remove aws-neuronx-collectives -y
                sudo dnf install aws-neuronx-dkms-2.*  -y
                sudo dnf install aws-neuronx-oci-hook-2.*  -y
                sudo dnf install aws-neuronx-runtime-lib-2.*  -y
                sudo dnf install aws-neuronx-collectives-2.*  -y

                # Install EFA Driver(only required for multi-instance training)
                curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
                wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.key
                cat aws-efa-installer.key | gpg --fingerprint
                wget https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz.sig && gpg --verify ./aws-efa-installer-latest.tar.gz.sig
                tar -xvf aws-efa-installer-latest.tar.gz
                cd aws-efa-installer && sudo bash efa_installer.sh --yes
                cd
                sudo rm -rf aws-efa-installer-latest.tar.gz aws-efa-installer

                # Remove pre-installed package and Install Neuron Tools
                sudo dnf remove aws-neuron-tools  -y
                sudo dnf remove aws-neuronx-tools  -y
                sudo dnf install aws-neuronx-tools-2.*  -y

                export PATH=/opt/aws/neuron/bin:$PATH

                # Install Python venv and activate Python virtual environment to install
                # Neuron pip packages.
                python3.7 -m venv aws_neuron_venv_pytorch
                source aws_neuron_venv_pytorch/bin/activate
                python -m pip install -U pip

                # Install wget, awscli
                pip install wget
                pip install awscli

                # Install Python packages - Transformers package is needed for BERT
                python -m pip install torch-neuronx=="1.11.0.1.*" "neuronx-cc==2.*" --extra-index-url "https://pip.repos.neuron.amazonaws.com"


================================================
FILE: frameworks/torch/torch-neuronx/setup/note-setup-general.rst
================================================

.. note::

  * Instructions in this page only apply to setting up Neuron components on Linux host running Ubuntu or Amazon Linux AMI.
  * When launching a Trn1/Trn2/Trn3 instance, you must adjust your primary EBS volume size to a minimum of 512GB. 


================================================
FILE: frameworks/torch/torch-neuronx/setup/prev-releases/neuronx-2.7.0-pytorch-install.rst
================================================
.. _install-neuronx-2.7.0-pytorch:


.. meta::
   :description: Install PyTorch NeuronX (Neuron 2.7.0) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, setup, torch-neuronx, Neuron 2.7.0, previous release
   :date-modified: 2026-03-30


Install PyTorch NeuronX (Neuron 2.7.0)
======================================

.. tab-set::

    .. tab-item:: PyTorch 1.13.0

        .. tab-set::

            .. tab-item:: Amazon Linux 2 AMI

                .. include :: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

                .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.0 --neuron-version=2.7.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

            .. tab-item:: Ubuntu 20 AMI

                .. include :: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

                .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.0 --neuron-version=2.7.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami


================================================
FILE: frameworks/torch/torch-neuronx/setup/prev-releases/neuronx-2.8.0-pytorch-install.rst
================================================
.. _install-neuronx-2.8.0-pytorch:


.. meta::
   :description: Install PyTorch NeuronX (Neuron 2.8.0) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, setup, torch-neuronx, Neuron 2.8.0, previous release
   :date-modified: 2026-03-30


Install PyTorch NeuronX (Neuron 2.8.0)
======================================

.. tab-set::

    .. tab-item:: PyTorch 1.13.0

        .. tab-set::

            .. tab-item:: Amazon Linux 2 AMI

                .. include :: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

                .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.0 --neuron-version=2.8.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

            .. tab-item:: Ubuntu 20 AMI

                .. include :: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

                .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.0 --neuron-version=2.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami


================================================
FILE: frameworks/torch/torch-neuronx/setup/prev-releases/neuronx-2.9.0-pytorch-install.rst
================================================
.. _install-neuronx-2.9.0-pytorch:


.. meta::
   :description: Install PyTorch NeuronX (Neuron 2.9.0) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, setup, torch-neuronx, Neuron 2.9.0, previous release
   :date-modified: 2026-03-30


Install PyTorch NeuronX (Neuron 2.9.0)
======================================

.. tab-set::

    .. tab-item:: PyTorch 1.13.0

        .. tab-set::

            .. tab-item:: Amazon Linux 2 AMI

                .. include :: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

                .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.0 --neuron-version=2.9.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

            .. tab-item:: Ubuntu 20 AMI

                .. include :: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

                .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.0 --neuron-version=2.9.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami


================================================
FILE: frameworks/torch/torch-neuronx/setup/pytorch-install-prev-al2.rst
================================================
.. _pytorch-neuronx-install-prev-al2:


.. meta::
   :description: Install previous PyTorch NeuronX releases on Amazon Linux 2
   :keywords: AWS Neuron, PyTorch, Trainium, Inferentia, setup, torch-neuronx, previous releases, Amazon Linux 2, AL2
   :date-modified: 2026-03-30


Install Previous PyTorch Neuron Releases for Amazon Linux (``torch-neuronx``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. toctree::
   :maxdepth: 1


Use the tabs below to install a specific previous Neuron SDK release of PyTorch NeuronX on Amazon Linux 2. Select the Neuron version you need.


.. tab-set::

    .. tab-item:: Neuron 2.18.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --neuron-version=2.18.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.17.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --neuron-version=2.17.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.16.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --neuron-version=2.16.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

================================================
FILE: frameworks/torch/torch-neuronx/setup/pytorch-install-prev-al2023.rst
================================================

.. _pytorch-neuronx-install-prev-al2023:

.. Install previous PyTorch NeuronX releases for Amazon Linux 2023

Use the tabs below to install a specific previous Neuron SDK release of PyTorch NeuronX on Amazon Linux 2023. Select the Neuron version you need.

.. tab-set::

    .. tab-item:: Neuron 2.28.1

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.8.0 --neuron-version=2.28.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.27.1

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.8.0 --neuron-version=2.27.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.26.1

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.8.0 --neuron-version=2.26.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami


================================================
FILE: frameworks/torch/torch-neuronx/setup/pytorch-install-prev-u20.rst
================================================

.. _pytorch-neuronx-install-prev-u20:

.. Install previous PyTorch NeuronX releases for Ubuntu 20.04

Use the tabs below to install a specific previous Neuron SDK release of PyTorch NeuronX on Ubuntu 20.04. Select the Neuron version you need.

.. tab-set::

    .. tab-item:: Neuron 2.21.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.1.2 --neuron-version=2.21.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.20.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.1.2 --neuron-version=2.20.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.19.0

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.1.2 --neuron-version=2.19.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami


================================================
FILE: frameworks/torch/torch-neuronx/setup/pytorch-install-prev-u22.rst
================================================

.. _pytorch-neuronx-install-prev-u22:

.. Install previous PyTorch NeuronX releases for Ubuntu 22.04

Use the tabs below to install a specific previous Neuron SDK release of PyTorch NeuronX on Ubuntu 22.04. Select the Neuron version you need.

.. tab-set::

    .. tab-item:: Neuron 2.28.1

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.9.0 --neuron-version=2.28.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.27.1

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.9.0 --neuron-version=2.27.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.26.1

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.8.0 --neuron-version=2.26.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami


================================================
FILE: frameworks/torch/torch-neuronx/setup/pytorch-install-prev-u24.rst
================================================

.. _pytorch-neuronx-install-prev-u24:

.. Install previous PyTorch NeuronX releases for Ubuntu 24.04

Use the tabs below to install a specific previous Neuron SDK release of PyTorch NeuronX on Ubuntu 24.04. Select the Neuron version you need.

.. tab-set::

    .. tab-item:: Neuron 2.28.1

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.9.0 --neuron-version=2.28.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

    .. tab-item:: Neuron 2.27.1

        .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.9.0 --neuron-version=2.27.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami


================================================
FILE: frameworks/torch/torch-neuronx/setup/pytorch-install.rst
================================================
.. _pytorch-neuronx-install:


.. meta::
   :description: Install PyTorch NeuronX on AWS Trainium and Inferentia instances
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, setup, torch-neuronx, install, DLAMI, pip
   :date-modified: 2026-03-30


Install PyTorch NeuronX 
========================

.. contents:: Table of Contents
   :local:
   :depth: 2


Develop on AWS ML accelerator instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Select the PyTorch version and AMI type tabs below to get the installation commands for your environment.

.. tab-set::

    .. tab-item:: PyTorch 1.13.1

        .. tab-set::

            .. tab-item:: Amazon Linux 2 DLAMI Base

                .. include :: /setup/install-templates/trn1/dlami-notes.rst
                    :start-line: 13
                    :end-line: 18

                .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                    :start-line: 8
                    :end-line: 9

            .. tab-item:: Ubuntu 20 DLAMI Base

                .. include :: /setup/install-templates/trn1/dlami-notes.rst
                    :start-line: 19
                    :end-line: 24

                .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                    :start-line: 11
                    :end-line: 12

            .. tab-item:: Amazon Linux 2 DLAMI Pytorch

                .. include :: /setup/install-templates/trn1/dlami-notes.rst
                    :start-line: 25
                    :end-line: 29

                .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                    :start-line: 50
                    :end-line: 51

            
            .. tab-item:: Ubuntu 20 DLAMI Pytorch

                .. include :: /setup/install-templates/trn1/dlami-notes.rst
                    :start-line: 30
                    :end-line: 35

                .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                    :start-line: 53
                    :end-line: 54

            .. tab-item:: Amazon Linux 2

                .. include :: /setup/install-templates/trn1/dlami-notes.rst
                    :start-line: 1
                    :end-line: 3

            .. tab-item:: Ubuntu 20

                .. include :: /setup/install-templates/trn1/dlami-notes.rst
                    :start-line: 4
                    :end-line: 6


================================================
FILE: frameworks/torch/torch-neuronx/setup/pytorch-neuronx-install-cxx11.rst
================================================
.. _pytorch-neuronx-install-cxx11:


.. meta::
   :description: Build torch-xla from source with C++11 ABI support for PyTorch NeuronX
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, setup, torch-neuronx, CXX11, C++11 ABI, build from source
   :date-modified: 2026-03-30


Install with support for C++11 ABI
==================================

.. warning::

    The intended user of this guide is using a custom built version of
    ``torch`` and ``torch-xla`` or compiling a non-python application which must be built using
    the C++11 ABI.

    *Most applications do not require this specialized distribution.*

    For regular installation instructions see: :ref:`Fresh install <pytorch-neuronx-install>`

The standard ``torch-neuronx`` packages (which are normally installed according
to the :ref:`Fresh install <pytorch-neuronx-install>` guide) are compiled with
the pre-C++11 ABI and linked against the pre-C++11 ``libtorch``. These
compilation options ensure that the ``torch-neuronx`` ABI matches the *publicly*
released version of the ``torch`` and ``torch-xla`` packages that are installed from the default
PyPI index.

To support applications with specific ABI requirements, Neuron distributes
packages which are linked against the C++11 version of
``libtorch``. These ``torch-neuronx`` packages are built using the
``-D_GLIBCXX_USE_CXX11_ABI=1`` compilation flag. 

.. note::

    The ``libneuronxla`` packages are already built with both pre-C++11 ABI and C++11 ABI symbols so the same PIP package can be used for C++11 ABI applications.

The only difference between these packages and the standard packages
is the torch plugin library contained within the package. This is the
``libtorchneuron.so`` library located in the ``torch_neuronx/lib/`` package
directory. All other libraries and python files within the packages are
identical. This means that these C++11-compatible packages are drop-in
replacements in environments that are incompatible with the standard releases of
``torch-neuronx``. The behavior is identical whether compiling models, executing
inferences or running training.

Installation
^^^^^^^^^^^^

All versions of the library are available to download from the following pip
index:

::

    https://pip.repos.neuron.amazonaws.com/cxx11


To install a wheel, it is recommended to use the ``--no-deps`` flag since
versions of ``torch`` and ``torch-xla`` compiled using the C++11 ABI are not distributed on this
index.

::

    pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 torch-neuronx --no-deps


Specific versions of ``torch-neuronx`` with C++11 ABI support can be installed
just like standard versions of ``torch-neuronx``.

::

    pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuronx==2.5.*" --no-deps
    pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuronx==2.6.*" --no-deps

.. important::

    This pip index does not include a distribution of ``torch`` and ``torch-xla`` compiled with
    the new C++11 ABI. The intent of this index is *only* to provide Neuron SDK
    wheels. See :ref:`pytorch-neuronx-cxx11-building-torch-xla`.

    The version of ``torch`` and ``torch-xla`` that are distributed on the default PyPI index is
    compiled with the old pre-C++11 ABI.

    If a C++11 ``torch-neuronx`` package is installed *with* dependencies
    using the *default* PyPI index, then the installed version of ``torch`` and ``torch-xla`` will
    be using the pre-C++11 ABI and ``torch-neuronx`` will be using the C++11
    ABI. This ABI mismatch will lead to ``undefined symbol`` errors in both Python usage and at link
    time for non-Python applications.


.. _pytorch-neuronx-cxx11-building-torch-xla:

Building ``torch`` and ``torch-xla`` with C++11 ABI
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The instructions for building ``torch`` from source are at https://github.com/pytorch/pytorch#from-source

The instructions for building ``torch-xla`` from source are at https://github.com/pytorch/xla/blob/master/CONTRIBUTING.md

The following are simplified instructions (subject to change):

Setting the build environment:

.. code:: bash

   sudo apt install cmake
   pip install yapf==0.30.0
   wget https://github.com/bazelbuild/bazelisk/releases/download/v1.20.0/bazelisk-linux-amd64
   sudo cp bazelisk-linux-amd64 /usr/local/bin/bazel

Build ``torch`` (CPU only) and ``torch-xla`` wheels for version 2.5:

.. code:: bash

   git clone --recursive https://github.com/pytorch/pytorch --branch v2.5.1
   cd pytorch/
   git clone --recursive https://github.com/pytorch/xla.git --branch v2.5.1
   _GLIBCXX_USE_CXX11_ABI=1 python setup.py bdist_wheel
   # pip wheel will be present in ./dist
   cd xla/
   CXX_ABI=1 python setup.py bdist_wheel
   # pip wheel will be present in ./dist

Build ``torch`` (CPU only) and ``torch-xla`` wheels for version 2.6:

.. code:: bash

   git clone --recursive https://github.com/pytorch/pytorch --branch v2.6.0
   cd pytorch/
   git clone --recursive https://github.com/pytorch/xla.git --branch r2.6_aws_neuron
   _GLIBCXX_USE_CXX11_ABI=1 python setup.py bdist_wheel
   # pip wheel will be present in ./dist
   cd xla/
   CXX_ABI=1 python setup.py bdist_wheel
   # pip wheel will be present in ./dist

FAQ
^^^

When should I use a C++11 torch-neuronx wheel?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Distributions compiled with the new C++11 ABI should only be used in the
following cases:

1. You have built your own version of ``torch`` and ``torch-xla`` which uses the new C++11 ABI and
   need a corresponding version of ``torch-neuronx`` that is compatible.
2. You are compiling an application against a ``libtorch``
   which uses the C++11 ABI and would like to include
   ``libtorchneuron.so`` as well. Torch distributes these C++11 ``libtorch``
   libraries with a ``libtorch-cxx11`` prefix.

    Example:

    ::

        https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-2.5.1%2Bcpu.zip


Can I download a library/header zip file similar to the torch distribution?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Currently ``torch-neuron`` does not distribute a bundled library ``.zip`` with
only library/header files.

The recommended alternative when compiling ``libtorchneuron.so`` into a
non-python application is to install the ``torch-neuron`` wheel using ``pip``
according to the installation instructions. Then use the ``libtorchneuron.so``
library from within the python ``site-packages`` directory.

A second alternative to isolate the package contents from a python environment
is to download the wheel and unpack the contents:

.. code:: bash

    pip download --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 torch-neuronx --no-deps
    wheel unpack torch_neuronx-*.whl

If the exact version of the ``torch-neuronx`` package is known and no
Python/Pip is available in the build environment, an alternative is to fetch the
package file directly and ``unzip`` the wheel:

.. code::

    wget https://pip.repos.neuron.amazonaws.com/cxx11/torch-neuronx/torch_neuronx-<VERSION>-py3-none-any.whl
    unzip torch_neuronx-<VERSION>-py3-none-any.whl


.. _pytorch-neuronx-cxx11-versioning:

How can I know which ABI torch-neuronx is using?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Packages which use the pre-C++11 ABI have no local identifier and use the
following version scheme:

::

    <torch version>.<neuron version>

Packages which use the C++11 ABI have a ``+cxx11`` local identifier and use
following version scheme:

::

    <torch version>.<neuron version>+cxx11


This allows the ABI to be validated in the by inspecting the local identifier
(or version suffix).

Example:
::

    2.5.1.2.4.0+cxx11
    2.6.1.2.4.0+cxx11


How can I know which ABI torch is using?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``torch`` python package provides an API at the that allows you to check if
the underlying ``libtorch`` was compiled with the C++11 ABI:

.. code:: python

    import torch
    torch.compiled_with_cxx11_abi()  # True/False

Currently ``torch-neuronx`` does not have an equivalent API. If the C++11 ABI was
used, it will be visible in the version string (See :ref:`pytorch-neuronx-cxx11-versioning`).


Troubleshooting
^^^^^^^^^^^^^^^

What Python errors could I see if I mix ABI versions?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Using a version of ``torch`` compiled with the C++11 ABI will trigger an error
in the python interpreter when importing a version of ``torch-neuronx`` using
the old (pre-C++11) ABI from the standard index. This will manifest as an
error when the ``import torch_neuronx`` statement is executed.

::

    Traceback (most recent call last):
      File "/python3.9/site-packages/torch_neuron/__init__.py", line 64, in <module>
        _register_extension()
      File "/python3.9/site-packages/torch_neuron/__init__.py", line 60, in _register_extension
        torch.ops.load_library(neuron_op_filename)
      File "/python3.9/site-packages/torch/_ops.py", line 110, in load_library
        ctypes.CDLL(path)
      File "/python3.9/ctypes/__init__.py", line 364, in __init__
        self._handle = _dlopen(self._name, mode)
    OSError: /python3.9/site-packages/torch_neuron/lib/libtorchneuron.so: undefined symbol: _ZN5torch6detail10class_baseC2ERKSsS3_SsRKSt9type_infoS6_


Similarly, when using the standard pre-C++11 versions of ``torch/torch-xla`` with the C++11
version of ``torch-neuronx``, an error would also occur at import.

::

    Traceback (most recent call last):
      File "/python3.9/site-packages/torch_neuron/__init__.py", line 79, in <module>
        _register_extension()
      File "/python3.9/site-packages/torch_neuron/__init__.py", line 75, in _register_extension
        torch.ops.load_library(neuron_op_filename)
      File "/python3.9/site-packages/torch/_ops.py", line 110, in load_library
        ctypes.CDLL(path)
      File "/python3.9/ctypes/__init__.py", line 364, in __init__
        self._handle = _dlopen(self._name, mode)
    OSError: /python3.9/site-packages/torch_neuron/lib/libtorchneuron.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE


In either of these cases, the remedy is to ensure that the ABI of the ``torch`` and ``torch-xla``
distribution matches the ABI of the ``torch-neuronx`` distribution.

What compiler/linking errors could I see if I mix ABI versions?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you link an application which uses the old (pre-C++11) ABI
``libtorchneuron.so`` with a C++11 version of ``torch``, this will trigger a
link error.

::

    libtorchneuron.so: undefined reference to `torch::detail::class_base::class_base(std::string const&, std::string const&, std::string, std::type_info const&, std::type_info const&)'
    libtorchneuron.so: undefined reference to `c10::Error::Error(c10::SourceLocation, std::string)'
    libtorchneuron.so: undefined reference to `c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&)'
    libtorchneuron.so: undefined reference to `c10::ClassType::getMethod(std::string const&) const'
    libtorchneuron.so: undefined reference to `c10::ivalue::ConstantString::create(std::string)'
    libtorchneuron.so: undefined reference to `c10::DeviceTypeName(c10::DeviceType, bool)'
    libtorchneuron.so: undefined reference to `torch::jit::parseSchema(std::string const&)'
    libtorchneuron.so: undefined reference to `unsigned short caffe2::TypeMeta::_typeMetaData<std::string>()'
    libtorchneuron.so: undefined reference to `c10::Warning::warn(c10::SourceLocation const&, std::string const&, bool)'
    libtorchneuron.so: undefined reference to `torch::jit::parseSchemaOrName(std::string const&)'
    libtorchneuron.so: undefined reference to `c10::Symbol::fromQualString(std::string const&)'
    libtorchneuron.so: undefined reference to `c10::Error::Error(std::string, std::string, void const*)'
    libtorchneuron.so: undefined reference to `c10::detail::infer_schema::make_function_schema(std::string&&, std::string&&, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>)'
    libtorchneuron.so: undefined reference to `c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&)'
    libtorchneuron.so: undefined reference to `torch::jit::canonicalSchemaString(c10::FunctionSchema const&)'


Similarly, an error will also occur in the opposite scenario where the
C++11 ``libtorchneuron.so`` library is used with the pre-C++11 ``libtorch``:

::

    libtorchneuron.so: undefined reference to `c10::ivalue::ConstantString::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)'
    libtorchneuron.so: undefined reference to `torch::jit::parseSchemaOrName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
    libtorchneuron.so: undefined reference to `c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)'
    libtorchneuron.so: undefined reference to `c10::Error::Error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void const*)'
    libtorchneuron.so: undefined reference to `torch::jit::canonicalSchemaString[abi:cxx11](c10::FunctionSchema const&)'
    libtorchneuron.so: undefined reference to `torch::detail::class_base::class_base(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::type_info const&, std::type_info const&)'
    libtorchneuron.so: undefined reference to `c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
    libtorchneuron.so: undefined reference to `c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
    libtorchneuron.so: undefined reference to `c10::detail::infer_schema::make_function_schema(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>)'
    libtorchneuron.so: undefined reference to `torch::jit::parseSchema(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
    libtorchneuron.so: undefined reference to `c10::DeviceTypeName[abi:cxx11](c10::DeviceType, bool)'
    libtorchneuron.so: undefined reference to `c10::Symbol::fromQualString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
    libtorchneuron.so: undefined reference to `unsigned short caffe2::TypeMeta::_typeMetaData<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >()'
    libtorchneuron.so: undefined reference to `c10::ClassType::getMethod(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const'
    libtorchneuron.so: undefined reference to `c10::Warning::warn(c10::SourceLocation const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)'


In either of these cases, the remedy is to ensure that the ABI of the
``libtorch`` distribution matches the ABI of the ``libtorchneuron.so``
distribution.

The ``torch`` and ``torch-xla`` ABI must match the ``torch-neuron`` ABI or an ``undefined symbol`` error will occur.


================================================
FILE: frameworks/torch/torch-neuronx/setup/pytorch-update-al2-dlami.rst
================================================

.. _pytorch-neuronx-al2-dlami-update:

.. Update PyTorch NeuronX on Amazon Linux 2 PyTorch DLAMI

If you already have a previous Neuron release installed on a PyTorch DLAMI, select the PyTorch version tab below to get the update commands.


.. tab-set::

    .. tab-item:: PyTorch 1.13.1

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 116
            :end-line: 117


================================================
FILE: frameworks/torch/torch-neuronx/setup/pytorch-update-al2.rst
================================================

.. _pytorch-neuronx-al2-update:


.. meta::
   :description: Update PyTorch NeuronX to the latest release on Amazon Linux 2
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, setup, torch-neuronx, update, Amazon Linux 2, AL2
   :date-modified: 2026-03-30


Update to latest PyTorch NeuronX 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you already have a previous Neuron release installed, select the PyTorch version tab below to get the update commands for your environment.


.. tab-set::

    .. tab-item:: PyTorch 1.13.1

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 14
            :end-line: 15


================================================
FILE: frameworks/torch/torch-neuronx/setup/pytorch-update-al2023.rst
================================================

.. _pytorch-neuronx-al2023-update:

.. Update PyTorch NeuronX to the latest release on Amazon Linux 2023

If you already have a previous Neuron release installed, select the PyTorch version tab below to get the update commands for your environment.


.. tab-set::

    .. tab-item:: PyTorch 2.8.0

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. note::
            PyTorch versions 2.7 and 2.8 are no longer supported on Neuron. If you are looking for setup instructions specific to PyTorch 2.7 and 2.8 on Amazon Linux 2023, Ubuntu 24.04, or Ubuntu 22.04, see `the Neuron release 2.28.0 version of the setup docs <https://awsdocs-neuron.readthedocs-hosted.com/en/v2.28.0/setup/neuron-setup/pytorch/neuronx/amazon-linux/torch-neuronx-al2023.html#id2>`__.

    .. tab-item:: PyTorch 2.7.0

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst
      
        .. note::
            PyTorch versions 2.7 and 2.8 are no longer supported on Neuron. If you are looking for setup instructions specific to PyTorch 2.7 and 2.8 on Amazon Linux 2023, Ubuntu 24.04, or Ubuntu 22.04, see `the Neuron release 2.28.0 version of the setup docs <https://awsdocs-neuron.readthedocs-hosted.com/en/v2.28.0/setup/neuron-setup/pytorch/neuronx/amazon-linux/torch-neuronx-al2023.html#id2>`__.


================================================
FILE: frameworks/torch/torch-neuronx/setup/pytorch-update-u20-dlami.rst
================================================

.. _pytorch-neuronx-ubuntu20-dlami-update:

.. Update PyTorch NeuronX on Ubuntu 20.04 PyTorch DLAMI

If you already have a previous Neuron release installed on a PyTorch DLAMI, select the PyTorch version tab below to get the update commands.


.. tab-set::

    .. tab-item:: PyTorch 2.1.2

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 197
            :end-line: 198

    .. tab-item:: PyTorch 1.13.1

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 119
            :end-line: 120


================================================
FILE: frameworks/torch/torch-neuronx/setup/pytorch-update-u20.rst
================================================

.. _pytorch-neuronx-ubuntu20-update:

.. Update PyTorch NeuronX to the latest release on Ubuntu 20.04

If you already have a previous Neuron release installed, select the PyTorch version tab below to get the update commands for your environment.


.. tab-set::

    .. tab-item:: PyTorch 2.1.2

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 233
            :end-line: 234

    .. tab-item:: PyTorch 1.13.1

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 17
            :end-line: 18


================================================
FILE: frameworks/torch/torch-neuronx/setup/pytorch-update-u22.rst
================================================

.. _pytorch-neuronx-ubuntu22-update:

.. Update PyTorch NeuronX to the latest release on Ubuntu 22.04

If you already have a previous Neuron release installed, select the PyTorch version tab below to get the update commands for your environment.


.. tab-set::

    .. tab-item:: PyTorch 2.9.0

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 284
            :end-line: 285

    .. tab-item:: PyTorch 2.8.0

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. note::
            PyTorch versions 2.7 and 2.8 are no longer supported on Neuron. If you are looking for setup instructions specific to PyTorch 2.7 and 2.8 on Amazon Linux 2023, Ubuntu 24.04, or Ubuntu 22.04, see `the Neuron release 2.28.0 version of the setup docs <https://awsdocs-neuron.readthedocs-hosted.com/en/v2.28.0/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html#setup-torch-neuronx-ubuntu22>`__.


================================================
FILE: frameworks/torch/torch-neuronx/setup/pytorch-update-u24.rst
================================================

.. _pytorch-neuronx-ubuntu24-update:

.. Update PyTorch NeuronX to the latest release on Ubuntu 24.04

If you already have a previous Neuron release installed, select the PyTorch version tab below to get the update commands for your environment.


.. tab-set::

    .. tab-item:: PyTorch 2.9.0

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 293
            :end-line: 294

    .. tab-item:: PyTorch 2.8.0

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst
            
        .. note::
            PyTorch versions 2.7 and 2.8 are no longer supported on Neuron. If you are looking for setup instructions specific to PyTorch 2.7 and 2.8 on Amazon Linux 2023, Ubuntu 24.04, or Ubuntu 22.04, see `the Neuron release 2.28.0 version of the setup docs <https://awsdocs-neuron.readthedocs-hosted.com/en/v2.28.0/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu24.html#setup-torch-neuronx-ubuntu24>`__.


================================================
FILE: frameworks/torch/torch-neuronx/setup-trn1-multi-node-execution.rst
================================================
.. _setup-trn1-multi-node-execution:


.. meta::
   :description: How to prepare trn1.32xlarge for multi-node execution - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, setup, torch-neuronx
   :date-modified: 2026-03-13


How to prepare trn1.32xlarge for multi-node execution
=====================================================

EFA is a low latency transport that is used for inter-node communication.  Multi-node jobs, such as distributed training, requires EFA to be enabled on every participating trn1/trn1n 32xlarge instance. Please note that EFA is currently not available on the smaller instances sizes and they cannot be used for running multi-node jobs.

trn1.32xlarge has 8 EFA devices, trn1n.32xlarge has 16 EFA devices.  The rest of the document will refer to trn1.32xlarge but everything in the document also applies to trn1n.32xlarge except for the different number of EFA devices.


Launching an instance
^^^^^^^^^^^^^^^^^^^^^

Before launching trn1 you need to create a security group that allows EFA traffic between the instances.  Follow Step1 here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security and note the newly created security group ID.  It will be used on the next step.

Determine the region, the AMI, the key and the subnet that will be used to launch trn1.

At the moment launching Trn1 instances with EFA support from the console is not recommended. The instances must be launched using AWS CLI.  To launch trn1.32xlarge instance:


.. code-block:: bash

    export AMI=<ami>
    export SUBNET=<subnet id>
    export SG=<security group created on the previous step>
    export REG=<AWS region>
    export KEY=<the key>

    aws ec2 run-instances --region ${REG} \
    --image-id ${AMI} --instance-type trn1.32xlarge \
    --key-name ${KEY} \
    --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=\"friendly name\"}]" \
    --network-interfaces \
    "NetworkCardIndex=0,DeviceIndex=0,Groups=${SG},SubnetId=${SUBNET},InterfaceType=efa" \
    "NetworkCardIndex=1,DeviceIndex=1,Groups=${SG},SubnetId=${SUBNET},InterfaceType=efa" \
    "NetworkCardIndex=2,DeviceIndex=1,Groups=${SG},SubnetId=${SUBNET},InterfaceType=efa" \
    "NetworkCardIndex=3,DeviceIndex=1,Groups=${SG},SubnetId=${SUBNET},InterfaceType=efa" \
    "NetworkCardIndex=4,DeviceIndex=1,Groups=${SG},SubnetId=${SUBNET},InterfaceType=efa" \
    "NetworkCardIndex=5,DeviceIndex=1,Groups=${SG},SubnetId=${SUBNET},InterfaceType=efa" \
    "NetworkCardIndex=6,DeviceIndex=1,Groups=${SG},SubnetId=${SUBNET},InterfaceType=efa" \
    "NetworkCardIndex=7,DeviceIndex=1,Groups=${SG},SubnetId=${SUBNET},InterfaceType=efa" 


Note that one of the cards is assigned DeviceIndex 0 and the rest are assigned DeviceIndex 1.  Cloud-init will configure instance routing to route outgoing traffic prioritized by the device index field.  I.e the outbound traffic will always egress from the interface with DeviceIndex 0.  That avoids network connectivity problems when multiple interfaces are attached to the same subnet.

To launch trn1n.32xlarge instance:

.. code-block:: bash

    export AMI=<ami>
    export SUBNET=<subnet id>
    export SG=<security group created on the previous step>
    export REG=<AWS region>
    export KEY=<the key>
    
    aws ec2 run-instances --region ${REG} \
    --image-id ${AMI} --instance-type trn1.32xlarge \
    --key-name ${KEY} \
    --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=\"friendly name\"}]" \
    --network-interfaces \
        NetworkCardIndex=0,DeviceIndex=0,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa \
        NetworkCardIndex=1,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa \
        NetworkCardIndex=2,DeviceIndex=2,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa \
        NetworkCardIndex=3,DeviceIndex=3,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa \
        NetworkCardIndex=4,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa \
        NetworkCardIndex=5,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa \
        NetworkCardIndex=6,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa \
        NetworkCardIndex=7,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa \
        NetworkCardIndex=8,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa \
        NetworkCardIndex=9,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa \
        NetworkCardIndex=10,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa \
        NetworkCardIndex=11,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa \
        NetworkCardIndex=12,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa \
        NetworkCardIndex=13,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa \
        NetworkCardIndex=14,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa \
        NetworkCardIndex=15,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa

Assigning public IP address
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Multi-interface instances are not assigned public IP automatically.  If you require access to the newly launched trn1 from the Internet you need to assign Elastic IP to the interface with DeviceIndex = 0.  To find the right interface either parse the output of the instance launch command or use describe-instances command:


.. code-block:: bash

    $ aws ec2 describe-instances --instance-ids i-01b17afa1e6021d6c
    {
        "Reservations": [
            {
                "Groups": [],
                "Instances": [
                    {
                        "AmiLaunchIndex": 0,
                        "ImageId": "ami-01257e71ecb2f431c",
                        "InstanceId": "i-01b17afa1e6021d6c",
                        "InstanceType": "trn1.32xlarge",
                        .........
                        "NetworkInterfaces": [
                            {
                                "Attachment": {
                                    "AttachTime": "2023-05-19T17:37:26.000Z",
                                    "AttachmentId": "eni-attach-03730388baedd4b96",
                                    "DeleteOnTermination": true,
                                    "DeviceIndex": 0,
                                    "Status": "attached",
                                    "NetworkCardIndex": 4
                                },
                                "Description": "",
                                .........
                                "InterfaceType": "efa"
                            },
                            {
                                "Attachment": {
                                    "AttachTime": "2023-05-19T17:37:26.000Z",
                                    "AttachmentId": "eni-attach-0e1242371cd2532df",
                                    "DeleteOnTermination": true,
                                    "DeviceIndex": 0,
                                    "Status": "attached",
                                    "NetworkCardIndex": 3
                                },
                                "Description": "",
                                ................
            
            }
        ]
    }


The second entry in “NetworkInterfaces” in this example has “DeviceIndex” 0 and should be used to attach EIP.


Software installation
^^^^^^^^^^^^^^^^^^^^^

The software required for EFA operation is distributed via aws-efa-installer package.  The package is preinstalled on Neuron DLAMI.  If you’d like to install the latest or if you are using your own AMI follow these steps:

.. code-block:: bash

    curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz 
    wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.key 
    cat aws-efa-installer.key | gpg --fingerprint 
    wget https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz.sig && gpg --verify ./aws-efa-installer-latest.tar.gz.sig 
    tar -xvf aws-efa-installer-latest.tar.gz 
    cd aws-efa-installer && sudo bash efa_installer.sh --yes 
    cd 
    sudo rm -rf aws-efa-installer-latest.tar.gz aws-efa-installer

Application execution environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When running an application make sure the following environment variables are set:

.. code-block:: bash

    FI_PROVIDER=efa
    FI_EFA_USE_DEVICE_RDMA=1
    FI_EFA_FORK_SAFE=1  # only required when running on AL2

Containers
^^^^^^^^^^

aws-efa-installer package must be installed on the instance.  That installs both the efa kernel module and the libraries.  The libraries must be accessible to an application running inside a container.  This can be accomplished by either installing aws-efa-installer package inside the container or by making on the instance library installation path available inside a container.

If installing aws-efa-installer package inside a container pass the flag that disables the kernel module installation:

.. code-block:: bash

    sudo bash efa_installer.sh --yes --skip-kmod


The location of the libraries is distribution specific:

.. code-block:: bash

    /opt/amazon/efa/lib   # Ubuntu
    /opt/amazon/efa/lib64 # AL2


Appendix - trn1 instance launch example script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    #!/bin/bash
     
    set -e
 
    # AWS CLI v2 Installation instructions for Linux:
    # curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
    # unzip awscliv2.zip
    # sudo ./aws/install
    # $ aws --version
    # aws-cli/2.11.20 Python/3.11.3 Linux/5.15.0-1034-aws exe/x86_64.ubuntu.20 prompt/off
    # Someone with AWS console admin privileges can create an access key ID and secret for this:
    # Configure credentials: aws configure
 
    # Search the AWS AMIs for the most recent "Deep Learning Base Neuron AMI (Ubuntu 20.04) <Latest_Date>"
    # This one is 2023-05-17 - ami-01257e71ecb2f431c
    AMI= ... # the ami
    KEYNAME= ... # your key
    SG= ... # the security group 
    SUBNET= ... # the subnet
    REGION=us-west-2
    
    # Launch instances
    echo "Starting instances..."
    output=$(aws ec2 --region $REGION run-instances \
    --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=_Trainium-Big}]' \
    --count 1 \
    --image-id $AMI \
    --instance-type trn1.32xlarge \
    --key-name $KEYNAME \
    --network-interfaces "NetworkCardIndex=0,DeviceIndex=0,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
    "NetworkCardIndex=1,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
    "NetworkCardIndex=2,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
    "NetworkCardIndex=3,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
    "NetworkCardIndex=4,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
    "NetworkCardIndex=5,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
    "NetworkCardIndex=6,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
    "NetworkCardIndex=7,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa")
    
    
    # Parse the output to get the instance IDs
    instance_ids=$(echo $output | jq -r .Instances[].InstanceId)
    echo "Got created instance IDs: $instance_ids"
 
    # Loop through each instance ID
    public_ips=""
    for instance_id in $instance_ids; do
      echo "Waiting for instance $instance_id to be running..."
      aws ec2 wait instance-running --instance-ids $instance_id --region $REGION
 
      echo "Creating SSH public IP newtork inteface for instance $instance_id..."
      interface_id=""
      INSTANCE_INFO=$(aws ec2 describe-instances --region $REGION --instance-ids $instance_id)
      OUTPUT=$(echo "$INSTANCE_INFO" | jq -r '.Reservations[0].Instances[0].NetworkInterfaces[] | "\(.Attachment.DeviceIndex),\(.NetworkInterfaceId)"')
      echo $OUTPUT
      for pair in $OUTPUT; do
          IFS="," read -r device_idx ni_id <<< $pair
          if [ "$device_idx" == "0" ]; then
              interface_id=$ni_id
              break
          fi
      done
      if [ "$interface_id" == "" ]; then
          exit -1
      fi
      echo $interface_id
 
      echo "Checking for unassociated Elastic IPs..."
      unassociated_eips=$(aws ec2 describe-addresses --region $REGION | jq -r '.Addresses[] | select(.AssociationId == null) | .AllocationId')
      if [[ -z "$unassociated_eips" ]]; then
          echo "No unassociated Elastic IPs found. Allocating new Elastic IP..."
          eip_output=$(aws ec2 allocate-address --domain vpc --region $REGION)
          eip_id=$(echo $eip_output | jq -r .AllocationId)
          echo "Allocated Elastic IP ID: $eip_id"
          eip_public_ip=$(echo $eip_output | jq -r .PublicIp)
          echo "Allocated Elastic IP Public IP: $eip_public_ip"
          echo "Note that this newly allocated Elasic IP will persist even after the instance termination"
          echo "If the Elastic IP is not going to be reused do not forget to delete it"
      else
          # use the first unassociated Elastic IP found
          eip_id=$(echo "$unassociated_eips" | head -n 1)
          echo "Found unassociated Elastic IP ID: $eip_id"
          eip_public_ip=$(aws ec2 describe-addresses --allocation-ids $eip_id --region $REGION | jq -r .Addresses[0].PublicIp)
          echo "Elastic IP Public IP: $eip_public_ip"
      fi
      public_ips+="${eip_public_ip} "
 
      echo "Associating Elastic IP with network interface $interface_id..."
      aws ec2 associate-address --allocation-id $eip_id --network-interface-id $interface_id --region $REGION
      echo "Associated Elastic IP with network interface."
    done
 
    echo "The instance has been launched.\nYou can now SSH into $public_ips with key $KEYNAME.\n"

.. note:: if you face connectivity issues after launching trn1\\trn1n 32xlarge instance on Ubuntu, please follow the troubleshooting instructions mentioned :ref:`here. <trn1_ubuntu_troubleshooting>`


================================================
FILE: frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-default.rst
================================================

.. meta::
   :description: AWS Neuron SDK documentation for torch neuronx dataparallel example default
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx
   :date-modified: 2026-03-13


The default DataParallel use mode will replicate the model
on all available NeuronCores in the current process. The inputs will be split
on ``dim=0``.

.. code-block:: python

    import torch
    import torch_neuronx
    from torchvision import models

    # Load the model and set it to evaluation mode
    model = models.resnet50(pretrained=True)
    model.eval()

    # Compile with an example input
    image = torch.rand([1, 3, 224, 224])
    model_neuron = torch_neuronx.trace(model, image)

    # Create the DataParallel module
    model_parallel = torch_neuronx.DataParallel(model_neuron)

    # Create a batched input
    batch_size = 5
    image_batched = torch.rand([batch_size, 3, 224, 224])

    # Run inference with a batched input
    output = model_parallel(image_batched)

================================================
FILE: frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-dim-neq-zero.rst
================================================

.. meta::
   :description: AWS Neuron SDK documentation for torch neuronx dataparallel example dim neq zero
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx
   :date-modified: 2026-03-13


In this example we run DataParallel inference using two NeuronCores and
``dim = 2``. Because ``dim != 0``, dynamic batching is not enabled.
Consequently, the DataParallel inference-time batch size must be two times the
compile-time batch size.

.. code-block:: python

    import torch
    import torch_neuronx

    # Create an example model
    class Model(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.conv = torch.nn.Conv2d(3, 3, 3)

        def forward(self, x):
            return self.conv(x) + 1

    model = Model()
    model.eval()

    # Compile with an example input
    image = torch.rand([1, 3, 8, 8])
    model_neuron = torch_neuronx.trace(model, image)

    # Create the DataParallel module using 2 NeuronCores and dim = 2
    model_parallel = torch_neuronx.DataParallel(model_neuron, device_ids=[0, 1], dim=2)

    # Create a batched input
    # Note that image_batched.shape[dim] / len(device_ids) == image.shape[dim]
    batch_size = 2 * 8
    image_batched = torch.rand([1, 3, batch_size, 8])

    # Run inference with a batched input
    output = model_parallel(image_batched)

================================================
FILE: frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-disable-dynamic-batching.rst
================================================

.. meta::
   :description: AWS Neuron SDK documentation for torch neuronx dataparallel example disable dynamic batching
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx
   :date-modified: 2026-03-13


In the following example, we use
:func:`torch_neuronx.DataParallel.disable_dynamic_batching` to disable dynamic
batching. We provide an example of a batch size that will not work when dynamic
batching is disabled as well as an example of a batch size that does work when
dynamic batching is disabled.

.. code-block:: python

    import torch
    import torch_neuronx
    from torchvision import models

    # Load the model and set it to evaluation mode
    model = models.resnet50(pretrained=True)
    model.eval()

    # Compile with an example input
    image = torch.rand([1, 3, 224, 224])
    model_neuron = torch_neuronx.trace(model, image)

    # Create the DataParallel module and use 2 NeuronCores
    model_parallel = torch_neuronx.DataParallel(model_neuron, device_ids=[0, 1], dim=0)

    # Disable dynamic batching
    model_parallel.disable_dynamic_batching()

    # Create a batched input (this won't work)
    batch_size = 4
    image_batched = torch.rand([batch_size, 3, 224, 224])

    # This will fail because dynamic batching is disabled and
    # image_batched.shape[dim] / len(device_ids) != image.shape[dim]
    # output = model_parallel(image_batched)

    # Create a batched input (this will work)
    batch_size = 2
    image_batched = torch.rand([batch_size, 3, 224, 224])

    # This will work because
    # image_batched.shape[dim] / len(device_ids) == image.shape[dim]
    output = model_parallel(image_batched)


================================================
FILE: frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-dynamic-batching.rst
================================================

.. meta::
   :description: AWS Neuron SDK documentation for torch neuronx dataparallel example dynamic batching
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx
   :date-modified: 2026-03-13


In the following example, we use the :func:`torch_neuronx.DataParallel` module
to run inference using several different batch sizes without recompiling the
Neuron model.

.. code-block:: python

    import torch
    import torch_neuronx
    from torchvision import models

    # Load the model and set it to evaluation mode
    model = models.resnet50(pretrained=True)
    model.eval()

    # Compile with an example input
    image = torch.rand([1, 3, 224, 224])
    model_neuron = torch_neuronx.trace(model, image)

    # Create the DataParallel module
    model_parallel = torch_neuronx.DataParallel(model_neuron)

    # Create batched inputs and run inference on the same model
    batch_sizes = [2, 3, 4, 5, 6]
    for batch_size in batch_sizes:
        image_batched = torch.rand([batch_size, 3, 224, 224])

        # Run inference with a batched input
        output = model_parallel(image_batched)

================================================
FILE: frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-specify-ncs.rst
================================================

.. meta::
   :description: AWS Neuron SDK documentation for torch neuronx dataparallel example specify ncs
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx
   :date-modified: 2026-03-13


The following example uses the ``device_ids`` argument to use the first three
NeuronCores for DataParallel inference.

.. code-block:: python

    import torch
    import torch_neuronx
    from torchvision import models

    # Load the model and set it to evaluation mode
    model = models.resnet50(pretrained=True)
    model.eval()

    # Compile with an example input
    image = torch.rand([1, 3, 224, 224])
    model_neuron = torch_neuronx.trace(model, image)

    # Create the DataParallel module, run on the first two NeuronCores
    # Equivalent to model_parallel = torch.neuron.DataParallel(model_neuron, device_ids=[0, 1])
    model_parallel = torch_neuronx.DataParallel(model_neuron, device_ids=['nc:0', 'nc:1'])

    # Create a batched input
    batch_size = 5
    image_batched = torch.rand([batch_size, 3, 224, 224])

    # Run inference with a batched input
    output = model_parallel(image_batched)

================================================
FILE: frameworks/torch/torch-neuronx/training-troubleshooting.rst
================================================
.. _pytorch-neuron-traning-troubleshooting:


.. meta::
   :description: PyTorch Neuron (``torch-neuronx``) for Training Troubleshooting Guide - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx, training, troubleshooting
   :date-modified: 2026-03-13


PyTorch Neuron (``torch-neuronx``) for Training Troubleshooting Guide
=====================================================================

.. contents:: Table of contents
   :local:
   :depth: 2


This document shows common issues users may encounter while using
PyTorch-Neuron and provides guidance how to resolve or work-around them.

General Troubleshooting
-----------------------

For setting up EFA that is needed for multi-node training, please see :ref:`setup-trn1-multi-node-execution`


For XLA-related troubleshooting notes see :ref:`How to debug models in PyTorch
Neuron <pytorch-neuronx-debug>`
and `PyTorch-XLA troubleshooting
guide <https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md>`__.

If your multi-worker training run is interrupted, you may need to kill
all the python processes (WARNING: this kills all python processes and
reload the driver):

.. code:: bash

   killall -9 python
   killall -9 python3
   sudo rmmod neuron; sudo modprobe neuron

To turn on RT debug:

.. code:: python

   os.environ["NEURON_RT_LOG_LEVEL"] = "INFO"

To turn on Neuron NCCL debug:

.. code:: python

   os.environ["NCCL_DEBUG"] = "WARN"
   os.environ["NCCL_DEBUG_SUBSYS"] = "ALL"

If some process crashed during training, you can enable core dumps using ``ulimit`` command:

.. code:: bash

   ulimit -S -c unlimited

To see the type of signals that would cause core dumps, see https://www.man7.org/linux/man-pages/man7/signal.7.html.

Note that core dumps take significant amount of storage, so make sure there is enough free disk space before enabling core dumps.

On Ubuntu, if Apport is not running, core dump file name is by default "core" in the local directory. To change file location and name format, modify ``/proc/sys/kernel/core_pattern`` (see https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#core-pattern for pattern info). For example, to dump to /tmp with executable filename and process ID:

.. code:: bash

   echo '/tmp/core.%e.%p' | sudo tee /proc/sys/kernel/core_pattern

For containers, install appropriate dependencies during docker build ("apt-get update && apt-get -y install build-essential gdb") and start the container with ``--ulimit core=-1`` to enable core dump and ``-v /tmp/:/tmp/`` to ensure core dumps to ``/tmp`` are preserved when container is stopped or deleted. Dependencies can also be installed after container is started.

On Ubuntu, core dumps can also handled by Apport which is disabled by default. To enable Apport, run ``sudo service apport start``. The ``/proc/sys/kernel/core_pattern`` is updated by Apport service. After a crash, look in ``/var/log/apport.log`` for the core dump file name, which should be in located in ``/var/lib/apport/coredump/``.

Once you have the core dump, you can use gdb to debug further (for Python applications, <executable> is ``python`` or ``python3``):

.. code:: bash

   gdb <executable> <core file>

If some process (i.e. XRT server) is killed due to out-of-memory on host (i.e. you see ``Out of memory: Killed process <PID>`` in ``/var/log/syslog`` or output of ``dmesg``), there won't be any core dump generated. However, you can change to it to kernel panic mode to trigger core dump by setting ``/proc/sys/vm/panic_on_oom`` to value of 1 on the host or from inside container.

On the host where you need ``sudo`` (this change will be reflected inside the container also):

.. code:: bash

   echo 1 | sudo tee /proc/sys/vm/panic_on_oom

From inside container where ``sudo`` doesn't work (this change will be reflected on the host also):

.. code:: bash
    
   echo 1 > /proc/sys/vm/panic_on_oom


Possible Error Conditions
-------------------------

Eager debug mode fails with "urllib3.exceptions.URLSchemeUnknown: Not supported URL scheme http+unix"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When running with eager debug mode (NEURON_USE_EAGER_DEBUG_MODE=1) using ``torch-neuronx`` and ``neuronx-cc`` from releases 2.19.1 and 2.20, you may see the following error:

.. code:: bash

   urllib3.exceptions.URLSchemeUnknown: Not supported URL scheme http+unix

This error is due to ``requests`` version >= 2.32. While ``neuronx-cc`` pins ``requests`` package version be less than 2.32, installing other packages like ``transformers`` could bring in a newer version of ``requests``.  To work-around this, you can pin ``requests`` to version 2.31.0 with the following command, which also include ``urllib3`` pinning due to a related issue noted in the next note:

.. code:: bash

   pip install requests==2.31.0 urllib3==1.26.20

Eager debug mode fails with "TypeError: HTTPConnection.request() got an unexpected keyword argument 'chunked'"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When running with eager debug mode (NEURON_USE_EAGER_DEBUG_MODE=1) using ``torch-neuronx`` and ``neuronx-cc`` from releases 2.19.1 and 2.20, you may see the following error:

.. code:: bash

   TypeError: HTTPConnection.request() got an unexpected keyword argument 'chunked'

This error is due to ``urllib3`` version >= 2.* and can be a dependency of ``requests`` < 2.32.  To work-around this, you can pin ``urllib3`` to version 1.26.20 with the following command (which also include ``requests`` pinning due a related issue noted the previous note):

.. code:: bash

   pip install requests==2.31.0 urllib3==1.26.20


Non-Fatal Error OpKernel ('op: "TPU*" device_type: "CPU"')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

During execution using PyTorch Neuron, you may see these non-fatal error messages:

.. code:: bash

    E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
    E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey

They don't affect operation of the PyTorch Neuron and can be ignored.

XLA runtime error: "Invalid argument: Cannot assign a device for operation"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: bash

    RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:490 : Check failed: session->session()->Run(session_work->feed_inputs, session_work->outputs_handles, &outputs) == ::tensorflow::Status::OK() (INVALID_ARGUMENT: Cannot assign a device for operation XRTAllocateFromTensor: {{node XRTAllocateFromTensor}} was explicitly assigned to /job:localservice/replica:0/task:0/device:TPU:0 but available devices are [ /job:localservice/replica:0/task:0/device:CPU:0, /job:localservice/replica:0/task:0/device:TPU_SYSTEM:0, /job:localservice/replica:0/task:0/device:XLA_CPU:0 ]. Make sure the device specification refers to a valid device.
	 [[XRTAllocateFromTensor]] vs. OK)
      *** Begin stack trace ***
         tensorflow::CurrentStackTrace()

         xla::util::MultiWait::Complete(std::function<void ()> const&)

         clone
      *** End stack trace ***

The above error indicates that the framework was not able to initialize the neuron runtime. If you get
the above error, check for the following:

1. No other process is taking the neuron cores. If yes, you may have to kill that process.

2. If no process is running, try reloading the driver using ``sudo rmmod neuron; sudo modprobe neuron``


Error: “Could not start gRPC server”
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you get “Could not start gRPC server” error, please check if there
are any leftover python processes from a previous interrupted run and
terminate them before restarting run.

.. code:: bash

   E0207 17:22:12.592127280   30834 server_chttp2.cc:40]        {"created":"@1644254532.592081429","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/t
   ransport/chttp2/server/chttp2_server.cc","file_line":395,"referenced_errors":[{"created":"@1644254532.592078907","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/s
   rc/core/lib/iomgr/tcp_server_posix.cc","file_line":342,"referenced_errors":[{"created":"@1644254532.592072626","description":"Unable to configure socket","fd":10,"file":"external/com_github_grpc_grpc/src/c
   ore/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1644254532.592068939","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":189,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1644254532.592078512","description":"Unable to configure socket"
   ,"fd":10,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1644254532.592077123","description":"Address already in
    use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":189,"os_error":"Address already in use","syscall":"bind"}]}]}]}
   2022-02-07 17:22:12.592170: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:545] Unknown: Could not start gRPC server


Failed compilation result in the cache
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

All compilation results are by default saved in ``Neuron Persistent Cache``. If the Neuron Compiler
fails to compile a graph, we save the failed result in the cache. The reason for doing so is, if
the user tries to run the same script, we want the users to error out early rather than wait for
the compilation to progress and see an error at the later stage. However, there could be certain
cases under which a failed compilation may be do you some environment issues. One possible reason
of failure could be, during compilation the process went out of memory. This can happen if you are
running multiple processes in parallel such that not enough memory is available for compilation of
graph. Failure due to such reasons can be easily mitigated by re-running the compilation. In case,
you want to retry a failed compilation, you can do that by passing ``--retry_failed_compilation``
as follows:

.. code:: python

   os.environ['NEURON_CC_FLAGS'] = os.environ.get('NEURON_CC_FLAGS', '') + ' --retry_failed_compilation'

This would retry the compilation and would replace a failed result in the cache with a
successful compilation result.


Compilation errors when placing NeuronCache home directory on NFS/EFS/FSx mounted drive
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently, NeuronCache default root directory is /var/tmp which is local to the instance you are running on. You can modify the location of the NeuronCache root directory using ``NEURON_CC_FLAGS='--cache_dir=<root dir>'``.  However, when the NeuronCache directory is placed in a directory that is part of a NFS mounted drive shared among multiple instances, you may encounter file errors such as file not found, file corruption, or KeyError when running multi-instance training:

.. code:: bash

    KeyError: 'neff_cache2/neuron-compile-cache/USER_neuroncc-1.0.48875.0+7437fbf18/MODULE_7223055628515330524/MODULE_0_SyncTensorsGraph.14_7223055628515330524_compute1-dy-training-2-1-e859998e-3035-5df63dab5ce63'

This is a result of limitations to file locking on NFS. EFS/FSx also exhibit similar limitation. The workaround is to setup separate NeuronCache root directories for each worker instance, such as ``NEURON_CC_FLAGS="--cache_dir=$HOME/neuron_cache/bert/`hostname`"``, where the home directory is shared among worker instances as in ParallelCluster.

Consider the use case of a ParallelCluster with SLURM cluster management. The home directory of the head node is shared via NFS with worker instances. Also, SLURM would terminate the idle worker instances when the cluster is configured as dynamic auto-scaling cluster, and the default cache in the terminated worker instance's /var/tmp is deleted. So to persist the cache across runs separated by a cluster idle period, we use the workaround above to create separate NeuronCache root directories for each worker instance. For example, see `BERT ParallelCluster script <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/dp_bert_hf_pretrain/run_dp_bert_large_hf_pretrain_bf16_s128.sh#L42>`__.


Compilation error: “Expect ap datatype to be of type float32 float16 bfloat16 uint8”
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If an XLA example fails to run because of failed compilation and one of
the error messages is “Expect ap datatype to be of type float32 float16
bfloat16 uint8”, then please set the environment variable
``XLA_USE_32BIT_LONG=1`` in your script:

.. code:: python

    os.environ['XLA_USE_32BIT_LONG'] = '1'

.. code:: bash

   11/18/2021 04:51:25 PM WARNING 34567 [StaticProfiler]: matmul-based transposes inserted by penguin takes up 93.66 percent of all matmul computation
   terminate called after throwing an instance of 'std::runtime_error'
     what():  === BIR verification failed ===
   Reason: Expect ap datatype to be of type float32 float16 bfloat16 uint8
   Instruction: I-545-0
   Opcode: Matmult
   Input index: 0
   Argument AP:
   Access Pattern: [[1,8],[1,1],[1,1]]
   Offset: 0
   Memory Location: {compare.85-t604_i0}@SB<0,0>(8x2)#Internal DebugInfo: <compare.85||uint16||UNDEF||[8, 1, 1]>

NeuronCore(s) not available - Requested:1 Available:0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When you see "NeuronCore(s) not available" please terminate processes
that may be holding the NeuronCores and terminate any neuron-top
sessions that are running. Also check if someone else is using the
system. Then do "sudo rmmod neuron; sudo modprobe neuron" to reload the
driver.

.. code:: bash

   2021-Nov-15 15:21:28.0231 7245:7245 ERROR NRT:nrt_allocate_neuron_cores NeuronCore(s) not available - Requested:nc1-nc1 Available:0
   2021-11-15 15:21:28.231864: F ./tensorflow/compiler/xla/service/neuron/neuron_runtime.h:1037] Check failed: status == NRT_SUCCESS NEURONPOC : nrt_init failed. Status = 1

Often when you run multi-worker training, there can be many python
processes leftover after a run is interrupted. To kill all python
processes, run the follow (WARNING: this kills all python processes on
the system) then reload the driver:

.. code:: bash

   killall -9 python
   killall -9 python3
   sudo rmmod neuron; sudo modprobe neuron

TDRV error "TDRV:exec_consume_infer_status_notification"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you see TDRV error "TDRV:exec_consume_infer_status_notification", try reloading the driver using ``sudo modprobe -r neuron; sudo modprobe neuron;``.

.. code:: bash

    2022-Mar-10 18:51:19.07392022-Mar-10 18:51:19.0739 17821:17931 ERROR  TDRV:exec_consume_infer_status_notifications  17822:18046 ERROR  TDRV:exec_consume_infer_status_notifications Unexpected number of CC notifications:  mod->cc_op_count=1, cc_start_cnt=0, cc_end_cnt=0Unexpected number of CC notifications:  mod->cc_op_count=1, cc_start_cnt=0, cc_end_cnt=0

    2022-Mar-10 18:51:19.07392022-Mar-10 18:51:19.0739 17821:17931 ERROR  TDRV:exec_consume_infer_status_notifications  17822:18046 ERROR  TDRV:exec_consume_infer_status_notifications (NON-FATAL, Ignoring) inference timeout (180000 ms) on Neuron Device 0 NC 0, waiting for cc status notifications.

    (NON-FATAL, Ignoring) inference timeout (180000 ms) on Neuron Device 0 NC 1, waiting for cc status notifications.

TDRV error "TDRV:tdrv_one_tmpbuf_reserve  Number of ONE TMPBUF pages requested exceeded the max number of pages allowed (requested: <N>, max allowed: 16)."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you see the TDRV error "TDRV:tdrv_one_tmpbuf_reserve  Number of ONE TMPBUF pages requested exceeded the max number of pages allowed (requested: <N>, max allowed: 16)", it maybe due to model tensors requiring more device memory then available. A solution is to try training with a smaller data batch size.

.. code:: bash

    ERROR  TDRV:tdrv_one_tmpbuf_reserve                 Number of ONE TMPBUF pages requested exceeded the max number of pages allowed (requested: 28, max allowed: 16).
    ERROR  TDRV:copy_and_stage_mr                       Failed to reserve one tmpbuf memory
    ERROR  TDRV:kbl_model_add                           copy_and_stage_mr() error
    W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1669183391.155135683","description":"Error received from peer ipv4:172.31.58.24:43941","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC


Could not open the ndX, close device failed, TDRV not initialized
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you see error messages stating “Could not open the ndX” (where X is
an integer from 0..15), please run ``neuron-ls`` and ensure that you are
able to see all 16 Neuron devices in the output. If one or more devices
are missing please report the issue to aws-neuron-support@amazon.com with the instance ID and a screen capture of ``neuron-ls`` output.

::

   2021-Nov-11 15:33:20.0161  7912:7912  ERROR  TDRV:tdrv_init_mla_phase1                    Could not open the nd0
   2021-Nov-11 15:33:20.0161  7912:7912  ERROR  TDRV:tdrv_destroy_one_mla                    close device failed
   2021-Nov-11 15:33:20.0161  7912:7912  ERROR  TDRV:tdrv_destroy                            TDRV not initialized
   2021-Nov-11 15:33:20.0161  7912:7912  ERROR   NRT:nrt_init                                Failed to initialize devices, error:1
   2021-11-11 15:33:20.161331: F ./tensorflow/compiler/xla/service/neuron/neuron_runtime.h:1033] Check failed: status == NRT_SUCCESS NEURONPOC : nrt_init failed. Status = 1

Multiworker execution hangs during NCCL init
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When your multi-worker execution hangs during NCCL init, you can try to
reserve the port used by environment variable ``NEURON_RT_ROOT_COMM_ID``
by (here we use host:port localhost:48620 as an example but you can use
any free port and root node’s host IP):

.. code:: bash

   sudo sysctl -w net.ipv4.ip_local_reserved_ports=48620

Then set the environment variable ``NEURON_RT_ROOT_COMM_ID`` in your
script:

.. code:: python

   os.environ["NEURON_RT_ROOT_COMM_ID"] = "localhost:48620"

.. _nrt-init-error-one-or-more-engines-are-running-please-restart-device-by-reloading-driver:

NRT init error “One or more engines are running. Please restart device by reloading driver”
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you see an error stating “One or more engines are running. Please
restart device by reloading driver” please follow the instruction and
reload the driver using
“\ ``sudo modprobe -r neuron; sudo modprobe neuron;``\ ”.

.. code:: bash

   2021-Nov-15 20:23:27.0280 3793:3793 ERROR TDRV:tpb_eng_init_hals_v2 CRITICAL HW ERROR: One or more engines are running. Please restart device by reloading driver:
   sudo modprobe -r neuron; sudo modprobe neuron;
   2021-Nov-15 20:23:27.0280 3793:3793 ERROR TDRV:tdrv_init_one_mla_phase2 nd0 nc0 HAL init failed. error:1

NRT error “ERROR TDRV:kbl_model_add Attempting to load an incompatible model!”
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you see an NRT error “ERROR TDRV:kbl_model_add Attempting to load an
incompatible model!” this means that the compiler neuronx-cc used to
compile the model is too old. See installation instruction to update to
latest compiler.

NRT error "ERROR HAL:aws_hal_sprot_config_remap_entry SPROT remap destination address must be aligned size"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you see an NRT error "ERROR HAL:aws_hal_sprot_config_remap_entry SPROT remap
destination address must be aligned size", please check the kernel version and upgrade it
to the distribution's latest kernel.

For example, on Ubuntu 18.04.6 LTS, the kernel version 4.15.0-66-generic is
known to cause this error when running MLP tutorial. This is due to a known
bug in the kernel in aligned memory allocation. To fix this issue, please
upgrade your kernel to latest version (i.e. 4.15.0-171-generic):

.. code:: shell

    uname -a
    sudo apt-get update
    sudo  apt-get upgrade
    sudo apt-get dist-upgrade

Please reboot after the upgrade.  Use "uname -a" to check kernel version again after reboot.

NCCL warning : "NCCL WARN Timeout waiting for RX (waited 120 sec) - retrying"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When running multi-worker training, if a graph has collective communication operator like an
``all_reduce``, it requires all the workers involved in the collective communication to load the
graph in the runtime at approximately same time. If any of the worker doesn't load the graph
within a 120 sec window from the first model load by any of the worker, you would see warnings
like ``NCCL WARN Timeout waiting for RX (waited 120 sec) - retrying``. When you see such warnings
check for the following in the log messages:

1. One of the workers is compiling a graph: In multi-worker training, there is a chance that
each worker builds a slightly different graph. This would result in cache miss and can result
in compilation. Since the compilations during training run are serialized, the first worker
can compile and load the graph with collective communication. It would then wait for 120 secs
for other works to join. If they don't show up because they are compiling their own graphs,
first worker would start throwing a warning message as above. The warning in this case is
``non-fatal`` and would go away once all workers have compiled their respective graphs and then loaded
them. To identify this scenario, look for ``No candidate found under ....`` logs around the warning.
You should also see ``.....`` which indicates compilation is in progress.

2. Server on one of the nodes crashed: In distributed training across multiple nodes, if the server on one
node crashed, the workers on other nodes would keep waiting on model load and you would see above
``timeout`` logs on those nodes. To identify if the server crashed, check if you see the following
error on any of the nodes:

::

   `RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1664146011.016500243","description":"Error received from peer ipv4:10.1.24.109:37379","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC`

If you see the above error, then it means there is a server crash and you need to cancel the
traning run.

RPC error: "RPC failed with status = 'UNAVAILABLE: Socket closed'"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When you see the above error, it means that the xrt server crashed. When you see such an error, look for
the following:

1. Check for any error logs before the ``RPC error``. That should indicate the root cause of server crash.
   Note: The actual error log might be buried because of all the ``RPC error`` logs that swamp the logs.

2. Sometimes the server can crash because of host OOM. This can happen when we are loading and saving checkpoints.
   In such cases, you only see ``RPC errors`` and no other log. You can check if any instance is going out of memory
   by using tools like `dmesg <https://man7.org/linux/man-pages/man1/dmesg.1.html>`_

Error "Assertion \`listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed" followed by 'RPC failed with status = "UNAVAILABLE: Connection reset by peer"'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The error "Assertion \`listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed" is intermittent and occurs when using glibc 2.26. To find out the glibc version you have, you can run ``ldd --version``. The workaround is to use Ubuntu 20 where glibc is 2.27.

.. code:: bash

   INFO: Inconsistency detected by ld.so: ../elf/dl-tls.c: 488: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!
   INFO: 2022-10-03 02:16:04.488054: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1664763364.487962663","description":"Error received from peer ipv4:10.0.9.150:41677","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC

RPC connection error: "RPC failed with status = UNAVAILABLE: Connection reset by peer" not preceded by any error
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This error may not be preceded by another error like shown in the previous section.
In this case, the RPC connection error usually happens when we do distributed training across multiple nodes. When you see such error, please
wait for a few minutes. It might be because some node is taking time to setup and hence the other node is not
able to connect to it just yet. Once, all nodes are up, training should resume.

Runtime errors "Missing infer_status notification" followed by "inference timeout"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you get a timeout error like below:

.. code:: bash

    ERROR  TDRV:exec_consume_tpb_status_notifications   Missing infer_status notification: (end:4)
    ERROR  TDRV:exec_consume_infer_status_notifications (FATAL-RT-UNDEFINED-STATE) inference timeout (600000 ms) on Neuron Device 4 NC 1, waiting for execution completion notification

It maybe due to long graph execution time causing synchronization delays
exceeding the default timeout. Please try increasing the timeout to
larger value using ``NEURON_RT_EXEC_TIMEOUT`` (unit in seconds) and
see if the problem is resolved.

Protobuf Error "TypeError: Descriptors cannot not be created directly."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you install torch-neuronx after neuronx-cc, you may get the Protobuf error "TypeError: Descriptors cannot not be created directly.". To fix this, please reinstall neuronx-cc using "pip install --force-reinstall neuronx-cc".

.. code:: bash

    Traceback (most recent call last):
      File "./run_glue.py", line 570, in <module>
        main()
      File "./run_glue.py", line 478, in main
        data_collator=data_collator,
      File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/transformers/trainer.py", line 399, in __init__
        callbacks, self.model, self.tokenizer, self.optimizer, self.lr_scheduler
      File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/transformers/trainer_callback.py", line 292, in __init__
        self.add_callback(cb)
      File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/transformers/trainer_callback.py", line 309, in add_callback
        cb = callback() if isinstance(callback, type) else callback
      File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/transformers/integrations.py", line 390, in __init__
        from torch.utils.tensorboard import SummaryWriter  # noqa: F401
      File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/torch/utils/tensorboard/__init__.py", line 10, in <module>
        from .writer import FileWriter, SummaryWriter  # noqa: F401
      File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 9, in <module>
        from tensorboard.compat.proto.event_pb2 import SessionLog
      File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/tensorboard/compat/proto/event_pb2.py", line 17, in <module>
        from tensorboard.compat.proto import summary_pb2 as tensorboard_dot_compat_dot_proto_dot_summary__pb2
      File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/tensorboard/compat/proto/summary_pb2.py", line 17, in <module>
        from tensorboard.compat.proto import tensor_pb2 as tensorboard_dot_compat_dot_proto_dot_tensor__pb2
      File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/tensorboard/compat/proto/tensor_pb2.py", line 16, in <module>
        from tensorboard.compat.proto import resource_handle_pb2 as tensorboard_dot_compat_dot_proto_dot_resource__handle__pb2
      File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/tensorboard/compat/proto/resource_handle_pb2.py", line 16, in <module>
        from tensorboard.compat.proto import tensor_shape_pb2 as tensorboard_dot_compat_dot_proto_dot_tensor__shape__pb2
      File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/tensorboard/compat/proto/tensor_shape_pb2.py", line 42, in <module>
        serialized_options=None, file=DESCRIPTOR),
      File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/google/protobuf/descriptor.py", line 560, in __new__
        _message.Message._CheckCalledFromGeneratedFile()
    TypeError: Descriptors cannot not be created directly.
    If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
    If you cannot immediately regenerate your protos, some other possible workarounds are:
     1. Downgrade the protobuf package to 3.20.x or lower.
     2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

TDRV error "Timestamp program stop timeout"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you see TDRV error "Timestamp program stop timeout", i.e. when rerunning a training script after it was interrupted, try first reloading the driver using ``sudo modprobe -r neuron; sudo modprobe neuron;`` (make sure neuron-top and/or neuron-monitor are not running).

.. code:: bash

    2022-Aug-31 04:59:21.0546 117717:117717 ERROR  TDRV:tsync_wait_eng_stop                     nd0 nc0 Timestamp program stop timeout (1000 ms)
    2022-Aug-31 04:59:21.0546 117717:117717 ERROR  TDRV:tsync_wait_nc_stop                      nd0 nc0 Error while waiting for timestamp program to end on TPB eng 0
    2022-Aug-31 04:59:21.0546 117717:117717 ERROR  TDRV:tsync_timestamps_finish                 nd0 nc0 Failed to stop neuron core
    2022-Aug-31 04:59:21.0546 117717:117717 ERROR  TDRV:tdrv_tsync_timestamps                   nd0 nc0 Failed to end timestamp sync programs
    2022-Aug-31 04:59:22.0768 117717:117717 ERROR  TDRV:tdrv_destroy                            TDRV not initialized
    2022-Aug-31 04:59:22.0768 117717:117717 ERROR   NRT:nrt_init                                Failed to initialize devices, error:5

Compiler error "module 'numpy' has no attribute 'asscalar'"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When you have a newer version of numpy in the Python environment, compilations may fail with the "error module 'numpy' has no attribute 'asscalar'".
Please note the neuronx-cc has the following dependency on numpy "numpy<=1.20.0,>=1.13.3". To workaround this error, please do "pip install --force-reinstall neuronx-cc" to reinstall neuronx-cc with the proper dependencies.

.. code:: base

   ERROR 227874 [neuronx-cc]: ***************************************************************
   ERROR 227874 [neuronx-cc]:  An Internal Compiler Error has occurred
   ERROR 227874 [neuronx-cc]: ***************************************************************
   ERROR 227874 [neuronx-cc]:
   ERROR 227874 [neuronx-cc]: Error message:  module 'numpy' has no attribute 'asscalar'
   ERROR 227874 [neuronx-cc]:
   ERROR 227874 [neuronx-cc]: Error class:    AttributeError
   ERROR 227874 [neuronx-cc]: Error location: Unknown
   ERROR 227874 [neuronx-cc]: Version information:
   ERROR 227874 [neuronx-cc]:   NeuronX Compiler version 2.1.0.76+2909d26a2
   ERROR 227874 [neuronx-cc]:
   ERROR 227874 [neuronx-cc]:   HWM version 2.1.0.7-64eaede08
   ERROR 227874 [neuronx-cc]:   NEFF version Dynamic
   ERROR 227874 [neuronx-cc]:   TVM not available
   ERROR 227874 [neuronx-cc]:   NumPy version 1.23.3
   ERROR 227874 [neuronx-cc]:   MXNet not available
   ERROR 227874 [neuronx-cc]:

Import errors 'generic_type: type "IrValue" is already registered!' or 'generic_type: type "XlaBuilder" is already registered!'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When you encounter a PyTorch import error 'import _XLAC ... generic_type: type "IrValue" is already registered!' or 'import _XLAC ... generic_type: type "XlaBuilder" is already registered!', please check that TensorFlow and/or JAX are not installed in the Python environment. If they are installed, please uninstall them.

Import error "import _XLAC ImportError: <>/site-packages/_XLAC.cpython-38-x86_64-linux-gnu.so: undefined symbol"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When you encounter a PyTorch import error "import _XLAC ImportError: <>/site-packages/_XLAC.cpython-38-x86_64-linux-gnu.so: undefined symbol" during execution, please check:
    1. TensorFlow and/or JAX are not installed in the Python environment. If they are installed, please uninstall them.
    2. The installed PyTorch (torch) package major/minor versions match the installed torch-neuronx package's major/minor versions (ie. 1.11). If they don't match, please install the version of PyTorch that matches torch-neuronx.

.. code:: bash

    Traceback (most recent call last):
      File "/opt/ml/mlp_train.py", line 11, in <module>
        import torch_xla.core.xla_model as xm
      File "/usr/local/lib/python3.8/site-packages/torch_xla/__init__.py", line 117, in <module>
        import _XLAC
    ImportError: /usr/local/lib/python3.8/site-packages/_XLAC.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK3c1010TensorImpl7stridesEv

NaNs seen with transformers version >= 4.21.0 when running HF BERT fine-tuning or pretraining with XLA_USE_BF16=1 or XLA_DOWNCAST_BF16=1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When running HuggingFace BERT (any size) fine-tuning tutorial or pretraining tutorial with transformers version >= 4.21.0 and using XLA_USE_BF16=1 or XLA_DOWNCAST_BF16=1, you will see NaNs in the loss immediately at the first step. More details on the issue can be found at `pytorch/xla#4152 <https://github.com/pytorch/xla/issues/4152>`_. The workaround is to use 4.20.0 or earlier (the tutorials currently recommend version 4.15.0) or add ``transformers.modeling_utils.get_parameter_dtype = lambda x: torch.bfloat16`` to the Python script.


.. _trn1_ubuntu_troubleshooting:

Network Connectivity Issue on trn1/trn1n 32xlarge with Ubuntu
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Description**

Ubuntu distributions have network connectivity issues when multiple interfaces are connected to the same subnet. trn1/trn1n 32xlarge comes with 8/16 network interfaces. (To launch trn1/trn1n with 8/16 interfaces please follow :ref:`here <setup-trn1-multi-node-execution>`)

AWS publishes a package that installs a helper service to address the issue. This service runs at the startup, creates the appropriate netplan files, updates the netplan and the the instance networking and terminates.

Note that the following fix is only required on instances launched using generic Ubuntu AMIs.  Neuron AMIs and instances launched via ParalleCluster do not require the fix.

**Patch to fix networking on a multi-interface instance**

.. code:: bash

    wget -O /tmp/aws-ubuntu-eni-helper.deb 'https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-ami_base/networking/aws-ubuntu-eni-helper_0.3-1_all.deb?raw=true'
    sudo apt install /tmp/aws-ubuntu-eni-helper.deb -y
    sudo systemctl enable aws-ubuntu-eni-helper.service
    sudo systemctl start aws-ubuntu-eni-helper.service


**How to apply the patch?**

The following steps could be followed to resolve this issue:

* Launch trn1.32xl from AWS console (starts with ``single interface``, does not suffer from the multi-interface issue)
* Apply the patch on this newly launched single-interface instance
* Create a new AMI from this instance
* Launch an 8 or 16 interface instance using that AMI.

.. note::
    The patch installs and enables the service but does not run it.  This is intentional.  The service will run at the startup when the AMI is used to launch a multi-interface instance. 

**FAQs**

.. note::
  Neuron DLAMI has the patch installed, users are always encouraged to launch the instances using the DLAMI which does not require any fix. Please refer to the :ref:`Set Up Guide <setup-guide-index>` to know how to launch an instance using DLAMI.


"Too many open files" when running training job
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When running a large model training with several workers, it can result in errors like the following.

.. code:: bash

	2023-Jun-14 19:05:29.0312 4112959:4113326 [23] bootstrap.cc:106 CCOM WARN Call to accept failed : Too many open files
	2023-Jun-14 19:05:29.0312 4112959:4113263 [14] include/socket.h:438 CCOM WARN Net : Socket creation failed : Too many open files
	2023-Jun-14 19:05:29.0312 4112959:4113326 ERROR   ENC:ncclBootstrapRecv                       failed neuronBootstrapRecv request to NCCL
	2023-Jun-14 19:05:29.0312 4112959:4113249 [12] bootstrap.cc:106 CCOM WARN Call to accept failed : Too many open files
	2023-Jun-14 19:05:29.0312 4112959:4113263 ERROR   ENC:ncclBootstrapSend                       failed neuronBootstrapSend request to NCCL2023-Jun-14 19:05:29.03122023-Jun-14 19:05:29.0312 4112959:4113270 [15] bootstrap.cc:106 CCOM WARN Call to accept failed : Too many open files

This can result when the default OS limits is low. The hard and soft limits can be set on OS using the following commands or by manually opening and setting the limits.

.. code:: bash

	sudo sed -i 'H;1h;$!d;x;/hard  *nofile/!s/$/\n* hard nofile 65536/' /etc/security/limits.conf
	sudo sed -i 'H;1h;$!d;x;/soft  *nofile/!s/$/\n* soft nofile 65536/' /etc/security/limits.conf
	sudo sed -i 's/^#*\(\*\|\s*\*\)\s*soft\s*nofile\s*[0-9]\+$/\1 soft nofile 65536/' /etc/security/limits.conf
	sudo sed -i 's/^#*\(\*\|\s*\*\)\s*hard\s*nofile\s*[0-9]\+$/\1 hard nofile 65536/' /etc/security/limits.conf
	sudo sed -i 's/^#*\(\*\|\s*\*\)\s*soft\s*nofile\s*[0-9]\+$/\1 soft nofile 65536/' /etc/security/limits.d/01_efa.conf || true
	sudo sed -i 's/^#*\(\*\|\s*\*\)\s*hard\s*nofile\s*[0-9]\+$/\1 hard nofile 65536/' /etc/security/limits.d/01_efa.conf || true

The `01_efa.conf` file is created as part of the EFA installation and needs to be updated. If EFA driver is not installed the file `01_efa.conf` doesn't exist and the sed commands will fail with `No such file or directory`. If there are other files under `limits.d` with file limits they need to be updated as well.

"undefined symbol"
^^^^^^^^^^^^^^^^^^
To maintain compatibility with the packages vended publicly in Pypi, AWS Neuron python packages contain binary extensions that are compiled with the pre-2011 libstdc++ application binary interface (ABI). If a custom version of a package - such as `torch` - is compiled using a modern compiler, it can result in "undefined symbol" errors due to mismatches between the package and AWS Neuron package. 

To support this situation, we provide alternative versions of AWS Neuron packages that are compiled according to the newer 2011 ABI. For information on how to use these packages, see :ref:`pytorch-install-cxx11`.


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/inference/tutorial-torchserve-neuronx.rst
================================================
.. _pytorch-tutorials-torchserve-neuronx:


.. meta::
   :description: BERT TorchServe Tutorial - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx, tutorials
   :date-modified: 2026-03-13


BERT TorchServe Tutorial
========================

.. contents:: Table of Contents
   :local:
   :depth: 2


Overview
--------
This tutorial demonstrates the use of `TorchServe <https://pytorch.org/serve>`_ with Neuron, the SDK for EC2 Inf2 and Trn1 instances. By the end of this tutorial, you will understand how TorchServe can be used to serve a model backed by EC2 Inf2/Trn1 instances. We will use a pretrained BERT-Base model to determine if one sentence is a paraphrase of another.

.. _torchserve-compile-nx:


Run the tutorial
----------------

Open a terminal, log into your remote instance, and activate a Pytorch virtual environment setup (see the:ref:`Install PyTorch Neuron <setup-torch-neuronx>`). To complete this tutorial, you will also need a compiled BERT model. You can run :download:`trace_bert_neuronx.py </src/examples/pytorch/torchserve/trace_bert_neuronx.py>` to obtain a traced BERT model.

You should now have a compiled ``bert_neuron_b6.pt`` file, which is required going forward.

Open a shell on the instance you prepared earlier, create a new directory named ``torchserve``. Copy your compiled model from the previous tutorial into this new directory.

.. literalinclude:: /archive/torch-neuron/tutorials/tutorial_source_instructions/run_torchserve_u20.sh
   :language: bash
   :lines: 4-6

::

  bert_neuron_b6.pt

Prepare a new Python virtual environment with the necessary Neuron and TorchServe components. Use a virtual environment to keep (most of) the various tutorial components isolated from the rest of the system in a controlled way.

.. literalinclude:: /archive/torch-neuron/tutorials/tutorial_source_instructions/run_torchserve_u20.sh
   :language: bash
   :lines: 8

Install the system requirements for TorchServe.

.. tab-set::

   .. tab-item:: Amazon Linux 2023 DLAMI Base

      .. code-block:: bash

        sudo dnf -y install jq java-11-amazon-corretto-headless
        sudo alternatives --config java
        sudo alternatives --config javac

   .. tab-item:: Ubuntu 20 DLAMI Base

      .. literalinclude:: /archive/torch-neuron/tutorials/tutorial_source_instructions/run_torchserve_u20.sh
        :language: bash
        :lines: 10


.. code:: bash

  java -version

::

  openjdk version "11.0.17" 2022-10-18
  OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu218.04)
  OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu218.04, mixed mode, sharing)

.. code:: bash

  javac -version

::

  javac 11.0.17

Verify that TorchServe is now available.

.. code:: bash

  torchserve --version

::

  TorchServe Version is 0.7.0


.. _torchserve-setup-nx:

Setup TorchServe
----------------

During this tutorial you will need to download a few files onto your instance. The simplest way to accomplish this is to paste the download links provided above each file into a ``wget`` command. (We don't provide the links directly because they are subject to change.) For example, right-click and copy the download link for ``config.json`` shown below.

.. literalinclude:: /src/examples/pytorch/torchserve/config.json
    :language: JSON
    :caption: :download:`config.json </src/examples/pytorch/torchserve/config.json>`


Now execute the following in your shell:

.. code:: bash

  wget <paste link here>
  ls

::

  bert_neuron_b6.pt  config.json

Download the `custom handler script <https://pytorch.org/serve/custom_service.html>`_ that will eventually respond to inference requests.

.. literalinclude:: /src/examples/pytorch/torchserve/handler_bert_neuronx.py
    :language: python
    :caption: :download:`handler_bert_neuronx.py </src/examples/pytorch/torchserve/handler_bert_neuronx.py>`
    :linenos:

Next, we need to associate the handler script with the compiled model using ``torch-model-archiver``. Run the following commands in your terminal:

.. literalinclude:: /archive/torch-neuron/tutorials/tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 12-16

.. note::

  If you modify your model or a dependency, you will need to rerun the archiver command with the ``-f`` flag appended to update the archive.

The result of the above will be a ``mar`` file inside the ``model_store`` directory.

.. literalinclude:: /archive/torch-neuron/tutorials/tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 18

::

  bert-max_length128-batch_size6.mar

This file is essentially an archive associated with a fixed version of your model along with its dependencies (e.g. the handler code).

.. note::

  The version specified in the ``torch-model-archiver`` command can be appended to REST API requests to access a specific version of your model. For example, if your model was hosted locally on port 8080 and named "bert", the latest version of your model would be available at ``http://localhost:8080/predictions/bert``, while version 1.0 would be accessible at ``http://localhost:8080/predictions/bert/1.0``. We will see how to perform inference using this API in Step 6.

Create a `custom config <https://pytorch.org/serve/configuration.html>`_ file to set some parameters. This file will be used to configure the server at launch when we run ``torchserve --start``.

.. literalinclude:: /src/examples/pytorch/torchserve/torchserve.config
    :language: properties
    :caption: :download:`torchserve.config </src/examples/pytorch/torchserve/torchserve.config>`

.. note::

  This will cause TorchServe to bind on all interfaces. For security in real-world applications, you’ll probably want to use port 8443 and `enable SSL <https://pytorch.org/serve/configuration.html#enable-ssl>`_.


.. _torchserve-run-nx:

Run TorchServe
--------------

It's time to start the server. Typically we'd want to launch this in a separate console, but for this demo we’ll just redirect output to a file.

.. literalinclude:: /archive/torch-neuron/tutorials/tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 20

Verify that the server seems to have started okay.

.. literalinclude:: /archive/torch-neuron/tutorials/tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 22

::

  {
    "status": "Healthy"
  }

.. note::

  If you get an error when trying to ping the server, you may have tried before the server was fully launched. Check ``torchserve.log`` for details.

Use the Management API to instruct TorchServe to load our model.

First, determine the number of NeuronCores available based on your instance size.

.. tab-set::

   .. tab-item:: Inf2

      .. list-table::
        :header-rows: 1

        * - Instance Size
          - # of NeuronCores
        * - xlarge
          - 2
        * - 8xlarge
          - 2
        * - 24xlarge
          - 12
        * - 48xlarge
          - 24

   .. tab-item:: Trn1

      .. list-table::
        :header-rows: 1

        * - Instance Size
          - # of NeuronCores
        * - 2xlarge
          - 2
        * - 32xlarge
          - 32


.. literalinclude:: /archive/torch-neuron/tutorials/tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 24-26

::

  {
    "status": "Model \"bert-max_length128-batch_size6\" Version: 1.0 registered with X initial workers"
  }


.. warning::
  You shouldn't set ``INITIAL_WORKERS`` above the number of NeuronCores. If you attempt to load more models than NeuronCores available, one of two things will occur. Either the extra models will fit in device memory but performance will suffer, or you will encounter an error on your initial inference. However, you may want to use fewer cores if you are using the :ref:`neuroncore-pipeline` feature.


.. note::

  Any additional attempts to configure the model after the initial curl request will cause the server to return a 409 error. You’ll need to stop/start/configure the server to realize any changes.

The ``MAX_BATCH_DELAY`` is a timeout value that determines how long to wait before processing a partial batch. This is why the handler code needs to check the batch dimension and potentially add padding. TorchServe will instantiate the number of model handlers indicated by ``INITIAL_WORKERS``, so this value controls how many models we will load onto Inferentia in parallel. If you want to control worker scaling more dynamically, `see the docs <https://pytorch.org/serve/management_api.html#scale-workers>`_.

It looks like everything is running successfully at this point, so it's time for an inference.

Create the ``infer_bert.py`` file below on your instance.

.. literalinclude:: /src/examples/pytorch/torchserve/infer_bert.py
    :language: python
    :caption: :download:`infer_bert.py </src/examples/pytorch/torchserve/infer_bert.py>`
    :linenos:

This script will send a ``batch_size`` number of requests to our model. In this example, we are using a model that estimates the probability that one sentence is a paraphrase of another. The script sends positive examples in the first half of the batch and negative examples in the second half.

Execute the script in your terminal.

.. literalinclude:: /archive/torch-neuron/tutorials/tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 28

::

  1 ['paraphrase']
  3 ['not paraphrase']
  4 ['not paraphrase']
  0 ['paraphrase']
  5 ['not paraphrase']
  2 ['paraphrase']

We can see that the first three threads (0, 1, 2) all report ``paraphrase``, as expected. If we instead modify the script to send an incomplete batch and then wait for the timeout to expire, the excess padding results will be discarded.


.. _torchserve-benchmark-nx:

Benchmark TorchServe
--------------------

We've seen how to perform a single batched inference, but how many inferences can we process per second? A separate upcoming tutorial will document performance tuning to maximize throughput. In the meantime, we can still perform a simple naïve stress test. The code below will spawn 64 worker threads, with each thread repeatedly sending a full batch of data to process. A separate thread will periodically print throughput and latency measurements.

.. literalinclude:: /src/examples/pytorch/torchserve/benchmark_bert.py
    :language: python
    :caption: :download:`benchmark_bert.py </src/examples/pytorch/torchserve/benchmark_bert.py>`
    :linenos:

Run the benchmarking script.

.. literalinclude:: /archive/torch-neuron/tutorials/tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 30

::

  pid 1214554: current throughput 0.0, latency p50=0.000 p90=0.000
  pid 1214554: current throughput 713.9, latency p50=0.071 p90=0.184
  pid 1214554: current throughput 737.9, latency p50=0.071 p90=0.184
  pid 1214554: current throughput 731.6, latency p50=0.068 p90=0.192
  pid 1214554: current throughput 732.2, latency p50=0.070 p90=0.194
  pid 1214554: current throughput 733.9, latency p50=0.070 p90=0.187
  pid 1214554: current throughput 739.3, latency p50=0.071 p90=0.184
  ...

.. note::

  Your throughput numbers may differ from these based on instance type and size.

**Congratulations!** By now you should have successfully served a batched model over TorchServe.

You can now shutdown torchserve.

.. literalinclude:: /archive/torch-neuron/tutorials/tutorial_source_instructions/run_torchserve_u20.sh
    :language: bash
    :lines: 32


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/inference/tutorials-torch-neuronx.rst
================================================
.. _inference-torch-neuronx-tutorials:


.. meta::
   :description: Tutorials for Inference (``torch-neuronx``) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, inference, torch-neuronx, tutorials
   :date-modified: 2026-03-13


Tutorials for Inference (``torch-neuronx``)
===========================================


.. toctree::
    :maxdepth: 1
    :hidden:
    
    /src/examples/pytorch/torch-neuronx/bert-base-cased-finetuned-mrpc-inference-on-trn1-tutorial.ipynb
    /frameworks/torch/torch-neuronx/tutorials/inference/tutorial-torchserve-neuronx
    /archive/torch-neuron/tutorials/tutorial-libtorch
    /src/examples/pytorch/torch-neuronx/resnet50-inference-on-trn1-tutorial.ipynb
    /src/examples/pytorch/torch-neuronx/t5-inference-tutorial.ipynb


* HuggingFace pretrained BERT tutorial :ref:`[html] </src/examples/pytorch/torch-neuronx/bert-base-cased-finetuned-mrpc-inference-on-trn1-tutorial.ipynb>` :pytorch-neuron-src:`[notebook] <torch-neuronx/bert-base-cased-finetuned-mrpc-inference-on-trn1-tutorial.ipynb>`
* TorchServe tutorial :ref:`[html] <pytorch-tutorials-torchserve-neuronx>`
* LibTorch C++ tutorial (for torch-neuron and torch-neuronx) :ref:`[html] <pytorch-tutorials-libtorch>`
* Torchvision ResNet50 tutorial :ref:`[html] </src/examples/pytorch/torch-neuronx/resnet50-inference-on-trn1-tutorial.ipynb>` :pytorch-neuron-src:`[notebook] <torch-neuronx/resnet50-inference-on-trn1-tutorial.ipynb>`
* T5 inference tutorial :ref:`[html] </src/examples/pytorch/torch-neuronx/t5-inference-tutorial.ipynb>` :pytorch-neuron-src:`[notebook] <torch-neuronx/t5-inference-tutorial.ipynb>`

.. note::

        To use Jupyter Notebook see:

        * :ref:`setup-jupyter-notebook-steps-troubleshooting`
        * :ref:`running-jupyter-notebook-as-script`


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/note-performance.txt
================================================
.. note::

    Logs used in tutorials do not present latest performance numbers

    For latest performance numbers visit :ref:`benchmark`

================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/analyze_for_training.rst
================================================
.. _torch-analyze-for-training-tutorial:


.. meta::
   :description: Analyze for Training Tutorial - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx, training, tutorials
   :date-modified: 2026-03-13


Analyze for Training Tutorial
==============================

This tutorial explains how to analyze a model for training support using via ``torch-neuronx``.

.. note::
    For analyzing models for inference support via ``torch-neuronx``, please refer to :ref:`torch_neuronx.analyze() <torch_neuronx_analyze_api>`

Setup
-----

For this tutorial we'll be using two scripts: ``supported.py`` and ``unsupported.py``. Create these files by copy pasting the below code to their respective files.

``supported.py``

.. literalinclude:: tutorial_source_code/analyze_training/analyze_training_code.sh
   :language: python
   :lines: 3-42

``unsupported.py``

.. literalinclude:: tutorial_source_code/analyze_training/analyze_training_code.sh
   :language: python
   :lines: 46-74

Running ``analyze`` via ``neuron_parallel_compile``
---------------------------------------------------

To analyze a model, we supply the training script to the ``analyze`` command, which is shipped with ``neuron_parallel_compile``.
The command is:

.. literalinclude:: tutorial_source_code/analyze_training/analyze_training_code.sh
   :language: bash
   :lines: 78

This will generate a lot of output showing a lot of compilation statuses.
Here's a snippet of the output when running the above command. 

.. code:: shell

    .2023-05-25 00:43:43.000394:  776642  INFO ||ANALYZE||: Compiling /tmp/model_analyis_graphs/compare_7841189860629745939_23.hlo.pb using following command: neuronx-cc compile --target=trn1 --framework XLA /tmp/model_analyis_graphs/compare_7841189860629745939_23.hlo.pb --verbose=35 --query-compute-placement 
    2023-05-25 00:43:43.000418:  776642  INFO ||ANALYZE||: Compiling /tmp/model_analyis_graphs/multiply_15640857564712679356_53.hlo.pb using following command: neuronx-cc compile --target=trn1 --framework XLA /tmp/model_analyis_graphs/multiply_15640857564712679356_53.hlo.pb --verbose=35 --query-compute-placement 
    .
    Compiler status PASS
    2023-05-25 00:43:43.000549:  776642  INFO ||ANALYZE||: Compiling /tmp/model_analyis_graphs/subtract_1927104012014828209_49.hlo.pb using following command: neuronx-cc compile --target=trn1 --framework XLA /tmp/model_analyis_graphs/subtract_1927104012014828209_49.hlo.pb --verbose=35 --query-compute-placement 
    ...
    Compiler status PASS


The analysis report will be generated as a JSON file.
The location of the report is shown as the last log entry:

.. code:: shell

    2023-05-25 00:43:49.000252:  776642  INFO ||ANALYZE||: Removing existing report /home/ubuntu/analyze_for_training/model_analysis_result/result.json
    2023-05-25 00:43:49.000252:  776642  INFO ||ANALYZE||: Model analysis completed. Report - /home/ubuntu/analyze_for_training/model_analysis_result/result.json

.. note::

    Note that if a report is already present in the specified path, ``analyze`` will remove/overwrite it.

The report generated running the above command looks like:

.. code:: json

    {
        "torch_neuronx_version": "1.13.0.1.6.1",
        "neuronx_cc_version": "2.5.0.28+1be23f232",
        "support_percentage": "100.00%",
        "supported_operators": {
            "aten": {
                "aten::permute": 8,
                "aten::add": 8,
                "aten::mul": 8,
                "aten::expand": 18,
                "aten::mm": 10,
                "aten::mse_loss_backward": 12,
                "aten::relu": 3,
                "aten::threshold_backward": 4,
                "aten::squeeze": 4,
                "aten::view": 4,
                "aten::pow": 2,
                "aten::mse_loss": 2,
                "aten::tanh": 2
            }
        },
        "unsupported_operators": {
            "aten": []
        }
    }

.. note::

    Note that the ``torch_neuronx`` and ``neuronx_cc`` versions may be different from this example

Understanding ``analyze`` report for Unsupported Models
-------------------------------------------------------

Default Verbosity
~~~~~~~~~~~~~~~~~

Let's run ``analyze`` for ``unsupported.py``

.. literalinclude:: tutorial_source_code/analyze_training/analyze_training_code.sh
   :language: bash
   :lines: 80

Here is the report generated by the above command:

.. code:: json

    {
        "torch_neuronx_version": "1.13.0.1.6.1",
        "neuronx_cc_version": "2.5.0.28+1be23f232",
        "support_percentage": "60.00%",
        "supported_operators": {
            "aten": {
                "aten::add": 2,
                "aten::mul": 1
            }
        },
        "unsupported_operators": {
            "aten": [
                {
                    "kind": "aten::mul",
                    "failureAt": "neuronx-cc",
                    "call": "test2_unsup.py 24"
                }
            ]
        }
    }

In the list of unsupported operators we are provided the specific aten op that failed, and where that operator is in the training script.

One thing to notice is that the ``support_percentage`` doesn't exactly add up. This is because the ``support_percentage`` is calculated based on the supported number of XLA/HLO instructions (explained more in the next section). To see the specific XLA/HLO op lowerings, use the flag ``--analyze-verbosity 1``, as the default is ``2``.

The last thing is that a specific aten operator can be supported and unsupported simultaneously. In our example, this can be seen with ``aten::mul``. This is due to the configuration of the aten op. The below section will describe what went wrong with the ``aten::mul`` op.

Lower Level Verbosity
~~~~~~~~~~~~~~~~~~~~~

Let's run again with lower verbosity level:

.. literalinclude:: tutorial_source_code/analyze_training/analyze_training_code.sh
   :language: bash
   :lines: 82

The report looks like:

.. code:: json

    {
        "torch_neuronx_version": "1.13.0.1.6.1",
        "neuronx_cc_version": "2.5.0.28+1be23f232",
        "support_percentage": "60.00%",
        "supported_operators": {
            "aten": {
                "aten::mul": 1,
                "aten::add": 2
            },
            "xla": [
                "f32[] multiply(f32[], f32[])",
                "f32[4]{0} broadcast(f32[]), dimensions={}",
                "f32[4]{0} add(f32[4]{0}, f32[4]{0})"
            ]
        },
        "unsupported_operators": {
            "aten": [
                {
                    "kind": "aten::mul",
                    "failureAt": "neuronx-cc",
                    "call": "test2_unsup.py 24"
                }
            ],
            "xla": [
                {
                    "hlo_instruction": "c64[4]{0} convert(f32[4]{0})",
                    "aten_op": "aten::mul"
                },
                {
                    "hlo_instruction": "c64[4]{0} multiply(c64[4]{0}, c64[4]{0})",
                    "aten_op": "aten::mul"
                }
            ]
        }
    }

This report provides both the aten operator and the failed XLA/HLO instructions. There will be more HLO instructions than aten ops since an aten op generally lowers to multiple HLO instructions. As a result, the ``support_percentage`` field doesn't exactly line up with the aten operator count, but does line up the XLA/HLO instruction count. This level of verbosity is intended for use when you have the ability to modify the model's HLO lowering, or generally have insight into the HLO lowering.

As mentioned before, the ``aten::mul`` op appears to be both supported and unsupported. This is because the compiler does not support a specific configuration of ``aten::mul``, which can be seen more clearly with the HLO lowering. In the above example, the ``aten::mul`` operator is unsupported since at least one parameter provided was a complex type (``C64``), which is unsupported by ``neuronx-cc``.

This concludes the tutorial. The API for ``analyze`` can be found within :ref:`neuron_parallel_compile <pytorch-neuronx-parallel-compile-cli>`


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/bert.rst
================================================
.. _hf-bert-pretraining-tutorial:


.. meta::
   :description: Hugging Face BERT Pretraining Tutorial (Data-Parallel) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx, training, tutorials
   :date-modified: 2026-03-13


Hugging Face BERT Pretraining Tutorial (Data-Parallel)
======================================================

.. important::
   Neuron will stop supporting XLA-based training support in a future release. For now, this tutorial is provided strictly for reference.

This tutorial explains how to run Hugging Face BERT-Large model
pretraining on Trainium using PyTorch Neuron and data-parallel mode.

The Hugging Face BERT pretraining example demonstrates the steps
required to perform single-node, multi-accelerator PyTorch model
training using the new AWS EC2 Trn1 (Trainium) instances and the AWS
Neuron SDK. This tutorial is an adaptation of an existing `BERT
example <https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/run_pretraining.py>`__
with the following important characteristics:

-  Framework: PyTorch/XLA
-  Model: Hugging Face BertForPreTraining
-  Optimizer: AdamW, LAMB (Layerwise Adaptive Moments optimizer)
-  Scheduler: Hugging Face's get_linear_schedule_with_warmup
-  Allreduce occurs before optimizer step, after gradient accumulations
   (following DeepSpeed's Smart Gradient Accumulation)
-  Training data types: Float32, full BFloat16 and Stochastic Rounding (SR), full BFloat16 with fp32 copy of weights, PyTorch Autocast (Automatic Mixed Precision or AMP)

As done in the original BERT paper, BERT pretraining happens in two
phases. In the first phase (phase 1) BERT maximum sequence length is fixed
at 128 tokens, while in phase 2 it is fixed at 512 tokens.

Neuron provides access to Trainium devices through an extension of PyTorch/XLA - a library that includes the familiar PyTorch interface along with XLA-specific additions. For additional details
relating to PyTorch/XLA, please refer to the `official PyTorch/XLA
documentation <https://pytorch.org/xla/>`__.

.. contents:: Table of Contents
   :local:
   :depth: 3


.. include:: ../note-performance.txt


Phase 1 BFloat16 BERT-Large pretraining with AdamW and stochastic rounding
--------------------------------------------------------------------------


Setting up the training environment on trn1.32xlarge
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The BERT training script ``dp_bert_large_hf_pretrain_hdf5.py`` (`source <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/dp_bert_hf_pretrain/dp_bert_large_hf_pretrain_hdf5.py>`_)
can run on a Trainium instance (trn1.32xlarge) that contains the
appropriate Neuron runtime and Python dependencies.

First, on a trn1.32xlarge instance, follow the installation instructions at:

:ref:`Install PyTorch Neuron on Trn1 <setup-torch-neuronx>`

Please set the storage of instance to *512GB* or more if you intent to run multiple experiments and save many checkpoints.

For all the commands below, make sure you are in the virtual environment that you have created above before you run the commands:

.. code:: shell

   source ~/aws_neuron_venv_pytorch/bin/activate

Next, clone the `AWS Neuron Samples repository <https://github.com/aws-neuron/aws-neuron-samples/>`_ and install requirements in the BERT tutorial directory ``aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain`` (`directory link <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/dp_bert_hf_pretrain>`_):

.. code:: bash

   cd ~/
   git clone https://github.com/aws-neuron/aws-neuron-samples.git

.. literalinclude:: tutorial_source_code/bert_training/bert_setup_code.sh
   :language: shell
   :lines: 5


Downloading tokenized and sharded dataset files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To download the tokenized and sharded dataset files needed for this tutorial, please run the following commands:

.. literalinclude:: tutorial_source_code/bert_training/bert_setup_code.sh
   :language: shell
   :lines: 8-16

``~/examples_datasets/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128`` will now have the tokenized and sharded dataset files for phase 1 pretraining and ``~/examples_datasets/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512`` for phase 2 pretraining.

Number of workers
~~~~~~~~~~~~~~~~~

You will be using torchrun (`PyTorch's Elastic Launch <https://pytorch.org/docs/stable/elastic/run.html>`__) to run some of the commands in this tutorial. When running the training script, you can configure the number of
NeuronCores to use for training by using torchrun's ``--nproc_per_node`` option. In this tutorial, we use 32 NeuronCores on trn1.32xlarge.

.. note::

    Currently Neuron Runtime only support 1 and 2 worker configurations on trn1.2xlarge and 1, 2, 8, and 32-worker configurations on trn1.32xlarge.

.. _bf16_sr_phase1:

BFloat16 and stochastic rounding in phase 1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Phase 1 pretraining performance can be increased by using BFloat16 casting
and stochastic rounding. BFloat16 casting and stochastic rounding can be enabled by moving the model to BFloat16 using ``model.to(torch.bfloat16)`` expression in the training code and setting the environment variable ``NEURON_RT_STOCHASTIC_ROUNDING_EN=1``, both are done in BERT pretraining example ``dp_bert_large_hf_pretrain_hdf5.py`` by default. Also in the BERT pretraining example, the loss is kept in FP32 to ensure smooth loss curve when loss averaging is used. We also preserve the optimizer states in FP32 using a modified HuggingFace AdamW implementation in order to match FP32 loss with BFloat16.
To achieve maximum performance while maintaining loss
convergence characteristics, we are using batch size of 16 and
gradient accumulation microsteps of 32 to maintain global batch size of 16384 for phase 1.
The batch size and gradient accumulation microstep changes can be set by
launching the BERT pretraining script ``dp_bert_large_hf_pretrain_hdf5.py`` with
command-line arguments ``--batch_size=16 --grad_accum_usteps=32``, as seen in the following steps.

Another option with BFloat16 using PyTorch AutoCast (Automatic Mixed Precision or AMP) is covered in :ref:`amp-sr-phase1`.

.. note::

   ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated starting in torch-xla 2.1, and their usage would result in warnings. They will become no-operations in torch-xla 2.6. Please switch to using ``model.to(torch.bfloat16()`` or AMP.

Pre-compilation
~~~~~~~~~~~~~~~

PyTorch Neuron evaluates operations lazily during execution of the training loops, which means it builds a symbolic
graph in the background and the graph is executed in hardware only when the tensor is printed, transfered to CPU, or ``xm.mark_step()`` is encountered (``xm.mark_step()`` is implicitly called by ``pl.MpDeviceLoader/pl.ParallelLoader``). During execution of the training loops, PyTorch Neuron can build multiple graphs depending on the number of conditional paths taken. For BERT-Large pretraining, PyTorch Neuron builds multiple unique graphs that should be compiled before running on the NeuronCores. PyTorch Neuron will compile those graphs only if they are not in the XLA in-memory cache or the persistent cache. To reduce the compilation time of these graphs, you can pre-compile those graphs using the utility ``neuron_parallel_compile`` (provided by the ``libneuronxla`` package, a transitive dependency of ``torch-neuronx``) as shown:

.. literalinclude:: tutorial_source_code/bert_training/bert_precompilation_code.sh
   :language: shell
   :lines: 5-10

This command performs a fast trial run of the training script to build
graphs and then do parallel compilations on those graphs using multiple processes of Neuron Compiler before
populating the on-disk persistent cache with compiled graphs. This helps make
the actual training run faster because the compiled graphs will loaded from the persistent cache.
Currently it takes ~13 minutes to compile the BERT-Large model training step using the pre-compilation script (compare to ~40 minute if not using the pre-compilation script).
Note that the command above specifies 32 NeuronCores for trn1.32xlarge via --nproc_per_node option.

The script ``run_dp_bert_large_hf_pretrain_bf16_s128.sh`` is provided in the same BERT tutorial directory for convenience and you can simply run the script using ``neuron_parallel_compile ./run_dp_bert_large_hf_pretrain_bf16_s128.sh`` to start the precompilation.

The pretokenized dataset is expected to be at ``~/examples_datasets/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128/`` by default (see above for downloading instructions) and can be changed via the ``--data_dir`` option.

.. note::

    The trial run during pre-compilation currently outputs invalid loss numbers. Please disregard them.

.. note::

    The command after ``neuron_parallel_compile`` should match the actual run command, except for the option ``--steps_this_run`` which shortens the trial run just enough to allow the tool to build all the graphs needed for the actual run.


If you interrupt
the run and restart the execution without changing model configurations or training hyperparameters, the new run will detect the cached
graphs in the persistent cache (on-disk) and reload the compiled graphs for
execution, avoiding any recompilation time.

Changes made to the BERT model configuration (layers, hidden
size, attention heads in the get_model function), batch size (using
``--batch_size`` option), optimizer or number of workers may trigger
graph recompilation. It is best to rerun the pre-compilation step above if these changes are made.

You can adjust the following hyperparameters without changing the model
and causing recompilation:

-  Number of global steps to run (``--steps_this_run`` option)
-  Learning rate (``--lr`` option)
-  Gradient accumulation steps > 1 (``--grad_accum_usteps`` option). If
   1 then there's no gradient accumulation and the graphs change causing
   recompilation.

Initiating a Training Job
~~~~~~~~~~~~~~~~~~~~~~~~~

After running the pre-compilation step, continue
with the actual phase 1 pretraining by running the following
set of commands to launch 32 data parallel distributed training workers on trn1.32xlarge:

.. literalinclude:: tutorial_source_code/bert_training/bert_training_code.sh
   :language: shell
   :lines: 5-9

The script ``run_dp_bert_large_hf_pretrain_bf16_s128.sh`` is provided in the same BERT tutorial directory for convenience and you can simply run the script to start the training.


The following messages indicate that the Neuron Runtime is initializing:

.. code:: bash

   Using Neuron Runtime
   Using Neuron Runtime
   Using Neuron Runtime
   Using Neuron Runtime
   Using Neuron Runtime
   ...

A few moments later, you will see the Training Configuration and Model
Configuration in the output:

.. code:: bash

   --------TRAINING CONFIG----------
   Namespace(batch_size=16, data_dir='~/examples_datasets/
   bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128/', debug=False,
   enable_pt_autocast=False, grad_accum_usteps=32, local_rank=0, lr=0.0004,
   max_pred_len=20, max_steps=28125, metrics_file='/tmp/test_dict.json',
   minimal_ckpt=False, num_ckpts_to_keep=1, output_dir='./output',
   phase1_end_step=28125, phase2=False, resume_ckpt=False, resume_step=-1,
   seed=12349, seq_len=128, shards_per_ckpt=1, steps_this_run=28125, warmup_steps=2000)

.. code:: bash

   --------MODEL CONFIG----------
   BertConfig {
   "_name_or_path": "bert-large-uncased",
   "architectures": [
   "BertForMaskedLM"
   ],
   "attention_probs_dropout_prob": 0.1,
   "classifier_dropout": null,
   "gradient_checkpointing": false,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.1,
   "hidden_size": 1024,
   "initializer_range": 0.02,
   "intermediate_size": 4096,
   "layer_norm_eps": 1e-12,
   "max_position_embeddings": 512,
   "model_type": "bert",
   "num_attention_heads": 16,
   "num_hidden_layers": 24,
   "pad_token_id": 0,
   "position_embedding_type": "absolute",
   "transformers_version": "4.15.0",
   "type_vocab_size": 2,
   "use_cache": true,
   "vocab_size": 30522
   }

As the worker processes begin training on the BERT dataset, you will
begin to see training metrics and the learning rate logged to the
console approximately every training step. The metrics include
average_loss, step_loss, and throughput:

.. code:: bash

    LOG Thu Sep 29 22:30:10 2022 - (0, 78) step_loss : 9.1875  learning_rate : 1.56e-05  throughput : 2873.14
    LOG Thu Sep 29 22:30:16 2022 - (0, 79) step_loss : 8.9375  learning_rate : 1.58e-05  throughput : 2878.09
    LOG Thu Sep 29 22:30:22 2022 - (0, 80) step_loss : 9.0000  learning_rate : 1.60e-05  throughput : 2875.31
    LOG Thu Sep 29 22:30:27 2022 - (0, 81) step_loss : 9.0000  learning_rate : 1.62e-05  throughput : 2877.35
    LOG Thu Sep 29 22:30:33 2022 - (0, 82) step_loss : 8.8750  learning_rate : 1.64e-05  throughput : 2872.55
    LOG Thu Sep 29 22:30:39 2022 - (0, 83) step_loss : 9.0000  learning_rate : 1.66e-05  throughput : 2876.17
    LOG Thu Sep 29 22:30:44 2022 - (0, 84) step_loss : 9.1250  learning_rate : 1.68e-05  throughput : 2872.48
    LOG Thu Sep 29 22:30:50 2022 - (0, 85) step_loss : 9.0000  learning_rate : 1.70e-05  throughput : 2873.39

By default, the training script will store all output files under
``~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain/output``. The output files consist of
the following:

-  PyTorch model checkpoint files, with names containing the global step
   of the checkpoint (ckpt_2000.pt, ckpt_4000.pt, etc.). Currently, the
   training script saves a checkpoint after every dataset shard.
   The frequency of saving checkpoint can be reduced by increasing the number of
   dataset shards per checkpoint, using option ``--shards_per_ckpt``.
   Furthermore, the number of checkpoints kept at a given time is limited by ``--num_ckpts_to_keep`` option (currently default to 1).

-  TensorBoard log files (each training run will store its logs in a
   subdirectory with prefix ``neuron_tblogs_``).

Monitoring Progress of the Training Job
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Using a single Trn1 instance with 32 NeuronCores, the current BERT
phase 1 pretraining will finish in about 45 hours. During this time, you will
see the average loss metric begin at about 11.2 and ultimately converge to about 1.4.

Monitoring Training Job Progress using neuron-top
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With the training job still running, launch a second SSH connection into
the trn1 instance, and use the ``neuron-top`` command to examine the
aggregate NeuronCore utilization. If you have not modified the ``--nproc_per_node`` option
in the run command, you should observe that
all 32 NeuronCores are participating in the training job, with
utilization fluctuating around 80%.

Monitoring Training Job Progress using TensorBoard
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The demo includes TensorBoard-compatible logging, which allows the
learning rate and training metrics to be monitored in real-time. By
default, the training script logs metrics to the following TensorBoard
log directory ``~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain/output/neuron_tblogs_<date/time>_<training configs>``.

In order to view your training metrics in TensorBoard, first run the
following commands in your SSH session:

.. code:: bash

   cd ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain
   tensorboard --logdir ./output

Once running, open a new SSH connection to the instance and port-forward
TCP port 6006 (ex: ``ssh -L 6006:127.0.0.1:6006 user_name@remote_ip``). Once the tunnel is
established, TensorBoard can then be accessed via web browser at the
following URL: `http://localhost:6006 <http://localhost:6006/>`__.
Please note that you will not be able to access TensorBoard if you
disconnect your port-forwarding SSH session to the Trainium instance.

.. image:: tensorboard.png
   :alt: Image: tensorboard.png


Finishing the tutorial
~~~~~~~~~~~~~~~~~~~~~~

Once you are ready, there are a couple of options for finishing
the BERT pretraining demo:

1. **Allow the training script to run to completion**. If you would like
   to observe the training script run to completion, it is recommended
   to launch the training script from a terminal multiplexer such as
   ``tmux`` or ``screen``, and then detach the session so that the
   training script can run in the background. With this approach, you
   can safely let the training script run unattended, without risk of an
   SSH disconnection causing the training job to stop running.
2. **Stop the training job early**. To stop the training job early,
   press CTRL-C in the terminal window in which you launched the
   training script. In some cases, if you manually cancel a job using
   CTRL-C and then later want to run the job again, you might first need
   to execute ``sudo rmmod neuron; sudo modprobe neuron`` in order to
   reload/reset the Neuron driver.

Phase 1 BERT-Large pretraining with Layerwise Adaptive Moments based optimizer (LAMB)
-------------------------------------------------------------------------------------
Sometimes, to reduce the training wall time, you can use higher learning rate and larger global batch size. The approach is discussed in `LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES <https://arxiv.org/pdf/1904.00962.pdf>`__. Tranium supports LAMB, and in this tutorial, we use publicly available XLA-friendly LAMB implemenation from https://github.com/rwightman/pytorch-image-models/blob/master/timm/optim/lamb.py.

.. literalinclude:: tutorial_source_code/bert_training/bert_lamb_training_code.sh
   :language: shell
   :lines: 5-12
 
The command-line argument ``--optimizer LAMB`` is needed, otherwise, the default optimizer AdamW will be used. Besides, you need to use a set of hyper-parameters supporting the larger global batch size (GBS). In this case, we have 64k as GBS for LAMB and use a set of hyper-params similar to https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/README.md. Given higher GBS from LAMB than AdamW, it takes fewer steps (roughly 7k) to achieve similar level of accuracy as AdamW, which takes more than 28k steps. In addition, you can also use different data types on top of LAMB. Below is an example using the BFloat16 and Stochastic Roundings. 

.. literalinclude:: tutorial_source_code/bert_training/bert_lamb_bf16_training_code.sh
   :language: shell
   :lines: 5-12

The script ``run_dp_bert_large_hf_pretrain_bf16_s128_lamb.sh`` is provided in the same BERT tutorial directory for convenience and you can simply run the script to start the training.

.. _fp32paramscopy-sr-phase1:

Phase 1 BFloat16 BERT-Large pretraining with AdamW and FP32 copy of weights
---------------------------------------------------------------------------
BFloat16 training can be achieved without stochastic rounding when a copy of weights is kept in FP32. To train BERT-Large with AdamW and FP32 copy of weights, specify ``--optimizer=AdamW_FP32ParamsCopy`` option when calling the BERT pretraining script (stochastic rounding is off):

.. code:: bash

    cd ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain
    torchrun --nproc_per_node=32 dp_bert_large_hf_pretrain_hdf5.py \
    --batch_size 16 \
    --optimizer=AdamW_FP32ParamsCopy \
    --grad_accum_usteps 32 |& tee run_pretrain_log.txt

The script ``run_dp_bert_large_hf_pretrain_bf16_s128.sh`` is provided in the same BERT tutorial directory for convenience and you can simply run the script with ``fp32paramscopy`` option like ``./run_dp_bert_large_hf_pretrain_bf16_s128.sh fp32paramscopy`` to start the training with FP32 copy of weights.


.. _amp-sr-phase1:

Phase 1 BERT-Large pretraining with AdamW and PyTorch Autocast (Automatic Mixed Precision or AMP)
-------------------------------------------------------------------------------------------------
Besides the :ref:`bf16_sr_phase1` , you can also use [PyTorch Autocast for XLA (Automatic Mixed Precision or AMP)](https://github.com/pytorch/xla/blob/master/docs/source/perf/amp.md), which automatically converts operations to either a lower precision (like Bfloat16) or Float32. This generally provides better performance over full Float32 due to higher compute density and lower memory footprint.
With the BERT-Large pretraining scripts you can use AMP by specifying the ``--enable_pt_autocast`` option without enabling stochatic rounding (``NEURON_RT_STOCHASTIC_ROUNDING_EN is not set``).

.. literalinclude:: tutorial_source_code/bert_training/bert_amp_training_code.sh
   :language: shell
   :lines: 5-10

The script ``run_dp_bert_large_hf_pretrain_bf16_s128.sh`` is provided in the same BERT tutorial directory for convenience and you can simply run the script with ``amp`` option like ``./run_dp_bert_large_hf_pretrain_bf16_s128.sh amp`` to start the training with AMP.

Under the hood, ``--enable_pt_autocast`` would wrap only the forward pass and loss in the PyTorch autocasting context. The backward pass is NOT in the PyTorch autocasting context. This converts compute operations such as matrix multiply, convolution, activation, and pooling to lower precision such as BFloat16 while keeping numerically sensitive operations such as softmax and cross-entropy in Float32. For information about operations that are autocasted, please see [PyTorch Autocast for XLA AMP guide](https://github.com/pytorch/xla/blob/master/docs/source/perf/amp.md#supported-operators).

.. code:: python

             with torch.autocast(enabled=flags.enable_pt_autocast, dtype=torch.bfloat16, device_type='xla'):
                 outputs = model(input_ids=input_ids,
                                 attention_mask=input_mask,
                                 token_type_ids=segment_ids,
                                 labels=masked_lm_labels,
                                 next_sentence_label=next_sentence_labels)
                 loss = outputs.loss / flags.grad_accum_usteps
             loss.backward()
             running_loss += loss.detach()

Phase 1 BERT-Large pretraining on two instances
-----------------------------------------------

If you have two trn1.32xlarge instances with EFA-enabled interfaces, using `EFA-enabled security group <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl-base.html#nccl-start-base-setup>`__, and setup using :ref:`Install PyTorch Neuron on Trn1 <pytorch-neuronx-install>`, you can run
multi-instance BERT-Large pretraining. The following example demonstrate running BERT phase 1 pretraining on two instances.
To ensure that the global batch size remains at 16384 for phase 1, the gradient accumulation microstep count is reduced by half when the number of instances is 2.
NOTE: To run on multiple instances, you will need to use trn1.32xlarge instances and using all 32 NeuronCores on each instance.

On the rank-0 Trn1 host (root), run with ``--node_rank=0`` using torchrun utility, and ``--master_addr`` set to rank-0 host's IP address:

.. code:: shell

   cd ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain
   export FI_EFA_USE_DEVICE_RDMA=1
   export FI_PROVIDER=efa
   export BUCKET_CAP_MB=512
   export XLA_TRANSFER_SEED_ASYNC=1
   torchrun --nproc_per_node=32 --nnodes=2 --node_rank=0 --master_addr=<root IP> --master_port=2020 \
   dp_bert_large_hf_pretrain_hdf5.py \
   --batch_size 16 \
   --grad_accum_usteps 16 |& tee run_pretrain_log.txt

On another Trn1 host, run with ``--node_rank=1``, and ``--master_addr`` also set to rank-0 host's IP address:

.. code:: shell

   cd ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain
   export FI_EFA_USE_DEVICE_RDMA=1
   export FI_PROVIDER=efa
   export BUCKET_CAP_MB=512
   export XLA_TRANSFER_SEED_ASYNC=1
   torchrun --nproc_per_node=32 --nnodes=2 --node_rank=1 --master_addr=<root IP> --master_port=2020 \
   dp_bert_large_hf_pretrain_hdf5.py \
   --batch_size 16 \
   --grad_accum_usteps 16 |& tee run_pretrain_log.txt

It is important to launch rank-0 worker with ``--node_rank=0`` to avoid hang.

To train on multiple instances, it is recommended to use a ParallelCluster. For a ParallelCluster example, please see `Train a model on AWS Trn1 ParallelCluster <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples>`__.

Phase 2 BERT-Large pretraining
------------------------------

As mentioned above, BERT pretraining happens in two
phases. In phase 1, the sequence length is 128.
In phase 2, the sequence length increases to 512.
This additional training phase will further reduce the pretraining
loss and improve the metrics for the fine-tune tasks that usually
follow. The setup is very similar to the phase 1, with some differences
in training environment and command line arguments highlighted below.

Training Environment
~~~~~~~~~~~~~~~~~~~~

The following dataset and checkpoint are required:

* ``~/examples_datasets/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512`` is WikiCorpus training dataset that is preprocessed (tokenized and pre-masked) for phase 2.

* ``~/examples/dp_bert_hf_pretrain/output/ckpt_<phase1_end_step>.pt`` is the final checkpoint from phase 1.  It’s generated automatically at the end of phase 1 pretraining. For convenience, one can also download the example available at ``s3://neuron-s3/training_checkpoints/pytorch/dp_bert_large_hf_pretrain/ckpt_28125.pt``, which is collected after 28125 training steps in phase 1. Phase 2 will continue training by loading this checkpoint. During its progression, phase 2 continues to generate its own checkpoints in output directory, following the naming convention ``ckpt_<global_steps>.pt``

Initiating a Training Job
~~~~~~~~~~~~~~~~~~~~~~~~~

To launch the phase 2 pretraining job with AdamW optimizer, run the same python script ``dp_bert_large_hf_pretrain_hdf5.py``
as before except with different options for phase 2. 
For phase 2, we are using global batch size of 32768, with worker device batch size of 2
and gradient accumulation microsteps of 512. The pretokenized dataset is expected to be at ``~/examples_datasets/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512/`` following the setup steps above and is set via ``--data_dir`` option.

.. literalinclude:: tutorial_source_code/bert_training/bert_phase2_training_code.sh
   :language: shell
   :lines: 6-19

The script ``run_dp_bert_large_hf_pretrain_bf16_s512_phase2.sh`` is provided in the same BERT tutorial directory for convenience and you can simply run the script to start the training with AdamW optimizer. Similarly, you can use LAMB optimizer using the script ``run_dp_bert_large_hf_pretrain_bf16_s512_lamb_phase2.sh``.

The output below is expected as the job is initiated. Step 28125 is the phase1_end_step in this run, which could be different if phase1 training stops at a different global step.

.. code:: shell

    Worker 21 resuming from checkpoint ./output/ckpt_28125.pt at step 28125
    Worker 23 resuming from checkpoint ./output/ckpt_28125.pt at step 28125
    Worker 27 resuming from checkpoint ./output/ckpt_28125.pt at step 28125
    Worker 26 resuming from checkpoint ./output/ckpt_28125.pt at step 28125
    Worker 20 resuming from checkpoint ./output/ckpt_28125.pt at step 28125
    Worker 22 resuming from checkpoint ./output/ckpt_28125.pt at step 28125

    --------TRAINING CONFIG----------
    Namespace(batch_size=2, data_dir='/home/ec2-user/examples_datasets/
    bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512/', debug=False,
    enable_pt_autocast=False, grad_accum_usteps=512, local_rank=0, lr=0.0002,
    max_pred_len=80, max_steps=28125, metrics_file='/tmp/test_dict.json',
    minimal_ckpt=False, num_ckpts_to_keep=1, output_dir='./output',
    phase1_end_step=28125, phase2=True, resume_ckpt=True, resume_step=-1,
    seed=12349, seq_len=512, shards_per_ckpt=1, steps_this_run=32, warmup_steps=781)

    --------MODEL CONFIG----------
    BertConfig {
      "_name_or_path": "bert-large-uncased",
      "architectures": [
        "BertForMaskedLM"
      ],
      "attention_probs_dropout_prob": 0.1,
      "classifier_dropout": null,
      "gradient_checkpointing": false,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 1024,
      "initializer_range": 0.02,
      "intermediate_size": 4096,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 512,
      "model_type": "bert",
      "num_attention_heads": 16,
      "num_hidden_layers": 24,
      "pad_token_id": 0,
      "position_embedding_type": "absolute",
      "transformers_version": "4.15.0",
      "type_vocab_size": 2,
      "use_cache": true,
      "vocab_size": 30522
    }

As the phase 2 training proceeds, similar metrics to phase 1 will appear on the console, showing the loss, learning rate, and throughput:

.. code:: shell

    LOG Tue Sep 27 20:56:35 2022 - (0, 26) step_loss : 4.3438  learning_rate : 6.66e-06  throughput : 494.55
    LOG Tue Sep 27 20:57:40 2022 - (0, 27) step_loss : 4.0938  learning_rate : 6.91e-06  throughput : 495.67
    LOG Tue Sep 27 20:58:46 2022 - (0, 28) step_loss : 4.1875  learning_rate : 7.17e-06  throughput : 496.18
    LOG Tue Sep 27 20:59:53 2022 - (0, 29) step_loss : 4.0000  learning_rate : 7.43e-06  throughput : 495.31
    LOG Tue Sep 27 21:00:58 2022 - (0, 30) step_loss : 4.2500  learning_rate : 7.68e-06  throughput : 495.60
    LOG Tue Sep 27 21:02:05 2022 - (0, 31) step_loss : 4.3125  learning_rate : 7.94e-06  throughput : 495.50
    LOG Tue Sep 27 21:03:10 2022 - (0, 32) step_loss : 4.4688  learning_rate : 8.19e-06  throughput : 496.02

Tools
-----

While running the tutorial, try experimenting with the following Neuron
tools, which help monitor and evaluate compute utilization in real-time:

neuron-ls
~~~~~~~~~

The ``neuron-ls`` command describes the number of Neuron devices present
in the system, along with the associated NeuronCore count, memory, and
PCI device information:

.. image:: neuron-ls.png
   :alt: Image: image.png

You will find that the Trn1 instance has 16 Neuron devices, each with 2
NeuronCores. This configuration allows you to train the model using a
total of 32 workers, one per NeuronCore, within a single instance.

Additional information regarding neuron-ls can be found in the
`neuron-ls user
guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-tools/neuron-ls.html>`__.

neuron-top
~~~~~~~~~~

The ``neuron-top`` command presents a high-level view of the Neuron
environment, including the utilization of each of the NeuronCores, any
models that are currently loaded onto one or more NeuronCores, process
IDs for any processes that are leveraging the Neuron runtime, and basic
system statistics relating to vCPU and memory usage.

Please note that ``neuron-top`` can either display aggregate NeuronCore
utilization for 'all' processes (the default), or alternatively display
the NeuronCore utilization for a particular process. You can toggle
through the aggregate and per-process views using the ``a`` and ``d``
keys. The screenshot below illustrates the default aggregate view:

.. image:: neuron-top.png
   :alt: Image: image.png

Please refer to the `neuron-top user
guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-tools/neuron-top-user-guide.html>`__
for additional details.

Generating tokenized and sharded dataset files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This section is for generating tokenized and sharded dataset files from WikiCorpus dataset. If you just want the pregenenerated dataset files, please see ``Downloading tokenized and sharded dataset files`` section above.

On a c5n.18xlarge instance launched with Deep Learning Conda AMI and 512GB disk space, you can generate the preprocessed datasets from WikiCorpus dataset using NVidia's DeepLearningExamples for BERT pretraining. The preprocessing converts the WikiCorpus dataset to tokenized data and shard the data into multiple shards for parallel loading. The full flow takes about 8.7 hours:

.. code:: shell

    source activate pytorch_latest_p37
    cd ~/
    git clone https://github.com/NVIDIA/DeepLearningExamples.git
    cd DeepLearningExamples
    git checkout 81b9010096b6f9812e3977b607669f6ec8b16561
    sudo mkdir -m a=rwx /workspace
    cp -rf PyTorch/LanguageModeling/BERT /workspace/bert
    cd /workspace
    git clone https://github.com/attardi/wikiextractor.git
    cd wikiextractor
    git checkout 6408a430fc504a38b04d37ce5e7fc740191dee16
    cd /workspace/bert
    # increase num processes and shards
    ex -s "+%s/\(bertPrep\.py\)\( --action create_hdf5_files\)/\1 --n_processes 32 --n_test_shards 1024 --n_training_shards 1024\2" "+wq" data/create_datasets_from_start.sh
    export BERT_PREP_WORKING_DIR=/workspace/data/
    time ./data/create_datasets_from_start.sh wiki_only |& tee log

After execution is finished, phase 1 pre-tokenized and sharded dataset is located at:

``/workspace/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/``

Copy this entire directory to ``~/examples_datasets/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128`` of the trn1.32xlarge machine.

Phase 2 pre-tokenized dataset is located at:

``/workspace/data/hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/``

Copy this entire directory to ``~/examples_datasets/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512`` of the trn1.32xlarge machine.

Known issues and limitations
----------------------------

BERT-large compilation limitations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Optimal BERT-large phase 1 (sequence length 128) batch size is currently 8 for FP32 and 16 for full BF16 with stochastic rounding.
Optimal BERT-large phase 2 (sequence length 512) batch size is currently 1 for FP32 and 2 for full BF16 with stochastic rounding.

BERT-large pretraining with pretokenized dataset hangs when using xm.save
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Currently, BERT-large pretraining with pretokenized dataset hangs when
``xm.save`` is used outside of the main training loop.

.. code:: python

   Loop through HDF5 sharded dataset files:
       Train on one HDF5 sharded dataset file
           Loop through batched samples:
               Training iteration
       Save checkpoint using xm.save

The reason is that xm.save has a synchronization point. However, the
HDF5 shared data files do not have the same number of training samples
so the workers cannot all reach xm.save in the same iteration.

The workaround is to use ``xm._maybe_convert_to_cpu`` to ensure tensors
are moved to CPU followed by ``torch.save`` as done in the BERT-large
pretraining tutorial:

.. code:: python

   cpu_data = xm._maybe_convert_to_cpu(data)

BERT-large two worker pretraining hangs or run out of host memory during checkpointing on trn1.2xlarge
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On trn1.2xlarge, where there's limited host memory and CPU resources,
the BERT-large two worker pretraining may hang or run out of host memory during
checkpointing. This problem can be worked around by not saving optimizer and
LR scheduler states in the checkpoint. This is enabled by ``--minimal_ckpt`` option
of the pretraining script.

BERT precompilation using neuron_parallel_compile hangs when using torchrun
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We use neuron_parallel_compile in front of the short run command to do precompilation. However, the following command hangs when running BERT parallel compilation with torchrun:


.. code:: bash

    neuron_parallel_compile XLA_DOWNCAST_BF16=1 torchrun --nproc_per_node=32 --nnodes=1 dp_bert_large_hf_pretrain_hdf5.py --steps_this_run 5

    ...
    Updating train metrics in provide results.json file
    Current data: {'num_workers': 32, 'epoch': 0, 'steps': 5, 'microsteps': 320, 'loss': -22172234.0, 'train_time_minutes': 0.7424166639645894, 'throughput_average': 1839.0391805624324, 'throughput_peak': 1840.0107059878164, 'batch_size': 8, 'max_length': 128}
    Updating with data: {'num_workers': 32, 'epoch': 0, 'steps': 5, 'microsteps': 320, 'loss': -22172234.0, 'train_time_minutes': 0.7826640844345093, 'throughput_average': 1744.4691285659471, 'throughput_peak': 1745.4964663587539, 'batch_size': 8, 'max_length': 128}
    Checkpointing...
    Checkpointing done...
    (hangs)

The fix is to add xm.rendezvous at the end of training to ensure all workers sync up before exiting the script dp_bert_large_pretrain_hdf5.py.

.. code:: python

    def _mp_fn(index, flags):
        torch.set_default_tensor_type('torch.FloatTensor')
        train_bert_hdf5(flags)
        xm.rendezvous("_mp_fn finished")

Troubleshooting
---------------

The following are troubleshooting tips related to this tutorial. See
:ref:`PyTorch Neuron on Trainium Troubleshooting
Guide <pytorch-neuron-traning-troubleshooting>` for additional troubleshooting
tips.

.. _modulenotfounderror-no-module-named-torch--torch_xla-transformers-etc:

ModuleNotFoundError: No module named 'torch' , 'torch_xla', 'transformers', etc
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you encounter 'ModuleNotFoundError' messages while attempting to run
the demo scripts, please ensure that you have activated the appropriate
Python *virtualenv* which contains all of the demo dependencies:

.. code:: bash

   cd ~
   source <python virtual environment path>/bin/activate


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/finetune_hftrainer.rst
================================================
.. _torch-hf-bert-finetune:


.. meta::
   :description: PyTorch Neuron for Trainium Hugging Face BERT MRPC task finetuning using Hugging Face Trainer API - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx, training, tutorials
   :date-modified: 2026-03-13


PyTorch Neuron for Trainium Hugging Face BERT MRPC task finetuning using Hugging Face Trainer API
=================================================================================================

.. important::
   Neuron will stop supporting XLA-based training support in a future release. For now, this tutorial is provided strictly for reference.

.. note::

   Use Hugging Face `Optimum-Neuron <https://huggingface.co/docs/optimum-neuron/index>`_ for the best coverage and support for Hugging Face models running on AWS Trainium and Inferentia devices.

In this tutorial, we show how to run a Hugging Face script that uses Hugging Face Trainer API
to do fine-tuning on Trainium. The example follows the `text-classification
example <https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification>`__
which fine-tunes BERT-base model for sequence classification on the GLUE
benchmark.


.. contents:: Table of Contents
   :local:
   :depth: 2

.. include:: ../note-performance.txt

Setup and compilation
---------------------

Before running the tutorial please follow the installation instructions at:

:ref:`Install PyTorch Neuron on
Trn1 <setup-torch-neuronx>`

Please set the storage of instance to *512GB* or more if you also want to run through the BERT pretraining and GPT pretraining tutorials.

For all the commands below, make sure you are in the virtual environment that you have created above before you run the commands:

.. code:: shell

   source ~/aws_neuron_venv_pytorch/bin/activate

First we install a recent version of HF transformers, scikit-learn and evaluate packages in our environment as well as download the source matching the installed version. In this example, we use the text classification example from HF transformers source:

.. literalinclude:: tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_setup_code.sh
   :language: shell
   :lines: 5-10

Single-worker training
----------------------

We will run MRPC task fine-tuning following the example in README.md located in the path ``~/transformers/examples/pytorch/text-classification``. In this part of the tutorial we will use the Hugging Face model hub's pretrained ``bert-large-uncased`` model.

.. note::

    If you are using older versions of transformers <4.27.0 or PyTorch Neuron <1.13.0, please see section :ref:`workarounds_for_older_versions` for necessary workarounds.

We use BF16 mixed-precision casting using trainer API ``--bf16`` option and compiler flag ``--model-type=transformer`` to enable best performance.
We also launch the ``run_glue.py`` script with ``torchrun`` using ``--nproc_per_node=N`` option to specify the number of workers. Here we start off with 1 worker.

.. note::

    With transformers version 4.44 and up, please use torchrun even for one worker (``--nproc_per_node=1``) to avoid execution hang.

First, paste the following script into your terminal to create a “run.sh” file and change it to executable:

.. literalinclude:: tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_single_worker_training.sh
   :language: shell
   :lines: 7-29

We optionally precompile the model and training script using neuron_parallel_compile to warm up the persistent
graph cache (Neuron Cache) such that the actual run has fewer compilations (faster run
time):

.. literalinclude:: tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_single_worker_training.sh
   :language: shell
   :lines: 32

Please ignore the results from this precompile run as it is only for
extracting and compiling the XLA graphs.

.. note::

   With both train and evaluation options (``--do_train`` and ``--do_eval``), you will encounter harmless error
   ``ValueError: Target is multiclass but average='binary'`` when using neuron_parallel_compile.

Precompilation is optional and only needed to be done once unless hyperparameters such as batch size are modified.
After the optional precompilation, the actual run will be faster with minimal
additional compilations.

.. literalinclude:: tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_single_worker_training.sh
   :language: shell
   :lines: 34

If precompilation was not done, the first execution of ./run.sh will be slower due to serial compilations. Rerunning the same script a second time would show quicker execution as the compiled graphs will be already cached in persistent cache.

.. _multi_worker_training_parallel:

Multi-worker data-parallel training
-----------------------------------

The above script would run one worker on one Logical NeuronCore. To run on
multiple Logical NeuronCores in data-parallel configuration, launch the ``run_glue.py`` script with ``torchrun`` using ``--nproc_per_node=N`` option to specify the number of workers
(N=2 for trn1.2xlarge, and N=2, 8, or 32 for trn1.32xlarge).

.. note::

    If you are using older versions of transformers <4.27.0 or PyTorch Neuron <1.13.0, please see section :ref:`workarounds_for_older_versions` for necessary workarounds.

The following example runs 2 workers.
Paste the following script into your terminal to create a “run_2w.sh” file and change it to executable:

.. literalinclude:: tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_multi_worker_training_code.sh
   :language: shell
   :lines: 7-29

Again, we optionally precompile the model and training script using neuron_parallel_compile to warm up the persistent
graph cache (Neuron Cache), ignoring the results from this precompile run as it is only for
extracting and compiling the XLA graphs:

.. literalinclude:: tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_multi_worker_training_code.sh
   :language: shell
   :lines: 32

Precompilation is optional and only needed to be done once unless hyperparameters such as batch size are modified.
After the optional precompilation, the actual run will be faster with minimal
additional compilations.

.. literalinclude:: tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_multi_worker_training_code.sh
   :language: shell
   :lines: 34

During run, you will now notice that the "Total train batch size" is now 16 and the "Total optimization steps" is now half the number for one worker training.

Converting BERT pretrained checkpoint to Hugging Face pretrained model format
-----------------------------------------------------------------------------
If you have a pretrained checkpoint (i.e., from the BERT phase 2 pretraining tutorial), you can run the script below (saved as "convert.py") to convert BERT pretrained saved checkpoint to Hugging Face pretrained model format. An example phase 2 pretrained checkpoint can be downloaded from ``s3://neuron-s3/training_checkpoints/pytorch/dp_bert_large_hf_pretrain/ckpt_29688.pt``. Note that here we also use the ``bert-large-uncased`` model configuration to match the BERT-Large model trained following BERT phase 2 pretraining tutorial.

.. literalinclude:: tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_converted_checkpoint_training.sh
   :language: python
   :lines: 8-33

Run the conversion script as:

.. literalinclude:: tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_converted_checkpoint_training.sh
   :language: shell
   :lines: 35

After conversion, the new Hugging Face pretrained model is stored in the output directory specified by the ``--output_saved_model_path`` option which is ``hf_saved_model`` by default. You will use this directory in the next step.

Paste the following script into your terminal to create a “run_converted.sh” file and change it to executable:
(note that it uses the converted Hugging Face pretrained model in ``hf_saved_model`` directory):

.. literalinclude:: tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_converted_checkpoint_training.sh
   :language: shell
   :lines: 38-61

If it is the first time running with ``bert-large-uncased`` model or if hyperparameters have changed, then the optional one-time precompilation step can save compilation time:

.. literalinclude:: tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_converted_checkpoint_training.sh
   :language: shell
   :lines: 64

If you have run the single worker training in a previous section, then you can skip the precompilation step and just do:

.. literalinclude:: tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_converted_checkpoint_training.sh
   :language: shell
   :lines: 67


.. _known_issues:

Known issues and limitations
----------------------------

``RuntimeError: `fused=True` requires all the params to be floating point Tensors of supported devices: ['mps', 'cuda', 'xpu', 'hpu', 'cpu', 'mtia', 'privateuseone'] but torch.float32 and xla``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The error ``RuntimeError: `fused=True``` below indicates that the fused option for ``torch.optim.AdamW`` is turned on by default for PyTorch 2.8/2.9 in Hugging Face transformers versions 4.54 and newer. To work-around, use version <=4.53.3 or pass the option ``--optim adamw_torch`` to the ``run_glue.py`` script. This issue will be fixed with the upcoming Neuron PyTorch native which supports ``privateuseone`` device.

.. code:: shell

   RuntimeError: `fused=True` requires all the params to be floating point Tensors of supported devices: ['mps', 'cuda', 'xpu', 'hpu', 'cpu', 'mtia', 'privateuseone'] but torch.float32 and xla


``INVALID_ARGUMENT: Input dimension should be either 1 or equal to the output dimension ...`` during precompilation of evaluation phase
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

During precompilation (``neuron_parallel_compile``) of model evaluation phase, you may see the following crash:

.. code:: shell

    Status: INVALID_ARGUMENT: Input dimension should be either 1 or equal to the output dimension it is broadcasting into; the 1th operand dimension is 2, the 1th output dimension is 0.
    *** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::Shape const* ConsumeValue<xla::Shape const*>(absl::lts_20230802::StatusOr<xla::Shape const*>&&) 
        ...

This is due to output dependent logic in HuggingFace Accelerate's ``pad_across_processes`` utility function. To work-around this issue, please add the following code snippet to the top of your run script (i.e. ``run_glue.py``):

.. code:: python

    import os
    if os.environ.get("NEURON_EXTRACT_GRAPHS_ONLY", "0") == "1":
        from accelerate.accelerator import Accelerator
        def pad_across_processes(self, tensor, dim=0, pad_index=0, pad_first=False):
            return tensor
        Accelerator.pad_across_processes = pad_across_processes


Compilations for every evaluation step
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

During model evaluation, there can be small compilations for every evaluation step due to a `known transformers issue <https://github.com/huggingface/transformers/issues/37593>`_. The work-around is to set training arguments ``eval_do_concat_batches=False`` and apply the changes in `the PR <https://github.com/huggingface/transformers/pull/37621>`_ which will be in a future release of transformers package (version 4.52 or later).

Running one worker fine-tuning without torchrun would result in a hang
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

With transformers>=4.44.0, running one worker fine-tuning without torchrun would result in a hang. To workaround and run one worker fine-tuning, use ``torchrun --nproc_per_node=1 <script>``.


Long compilation times
^^^^^^^^^^^^^^^^^^^^^^

Long compilation times can be alleviated by using the ``neuron_parallel_compile`` tool to extract graphs from a short trial run and compile them in parallel ahead of the actual run, as shown above. Subsequent runs would load compiled graphs from the Neuron Cache and thus avoid long compilation times.

Compilation errors during precompilation using ``neuron_parallel_compile`` on small EC2 instances
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When precompiling using batch size of 16 on trn1.2xlarge, you will see ``ERROR ||PARALLEL_COMPILE||: parallel compilation with neuronx-cc exited with error.Received error code: -9``. To workaround this error, please set ``NEURON_PARALLEL_COMPILE_MAX_RETRIES=1`` in the environment.


Variable input sizes leading to timeouts
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Variable input sizes: When fine-tuning models such as dslim/bert-base-NER using the `token-classification example <https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification>`__, you may encounter timeouts (lots of "socket.h:524 CCOM WARN Timeout waiting for RX" messages) and execution hang. This occurs because NER dataset has different sample sizes, which causes many recompilations and compiled graph (NEFF) reloads. Furthermore, different data parallel workers can execute different compiled graph. This multiple-program multiple-data behavior is currently unsupported. To workaround this issue, please pad to maximum length using the Trainer API option ``--pad_to_max_length``.

"ValueError: Your setup doesn't support bf16/gpu."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When using latest HuggingFace transformers version, you may see "ValueError: Your setup doesn't support bf16/gpu." To fix this, please use ``--use_cpu True`` in your scripts.

.. _resolved_hf_issues:

Resolved issues
---------------

-  With torch-neuronx 2.1, HF Trainer API's use of XLA function ``xm.mesh_reduce`` causes ``"EOFError: Ran out of input"`` or ``"_pickle.UnpicklingError: invalid load key, '!'"`` errors during Neuron Parallel Compile. This is an issue with the trial execution of empty NEFFs and should not affect the normal execution of the training script.
-  Multi-worker training using Trainer API resulted in too many graph compilations for HF transformers>=4.35: This is resolved with HF transformers>=4.37 with the additional workarounds as shown in `the ticket <https://github.com/aws-neuron/aws-neuron-sdk/issues/813>`_.
-  Reduced accuracy for RoBERTa-Large is seen with Neuron PyTorch 1.12 (release 2.6) in FP32 mode with compiler BF16 autocast.
   The workaround is to set NEURON_CC_FLAGS="--auto-cast none" or set NEURON_RT_STOCHASTIC_ROUNDING_EN=1.
- When running HuggingFace GPT fine-tuning with transformers version >= 4.21.0 and using XLA_USE_BF16=1 or XLA_DOWNCAST_BF16=1, you might see NaNs in the loss immediately at the first step. This issue occurs due to large negative constants used to implement attention masking (https://github.com/huggingface/transformers/pull/17306). To workaround this issue, please use transformers version <= 4.20.0.
-  With release 2.6 and transformers==4.25.1,
   using ``neuron_parallel_compile`` tool to run ``run_glue.py`` script
   with both train and evaluation options (``--do_train`` and ``--do_eval``), you will encounter harmless error
   ``ValueError: Target is multiclass but average='binary'``
-  Using ``neuron_parallel_compile`` tool to run ``run_glue.py`` script
   with both train and evaluation options (``--do_train`` and ``--do_eval``), you will
   encounter INVALID_ARGUMENT error. To avoid this, only enable train for parallel
   compile (``--do_train``). This will cause compilations during evaluation step.
   The INVALID_ARGUMENT error is fixed in release 2.6 together with latest transformers package version 4.25.1.
- When using Trainer API option --bf16, you will see "RuntimeError: No CUDA GPUs are available". To workaround this error, please add "import torch; torch.cuda.is_bf16_supported = lambda: True" to the Python script (i.e. run_glue.py). (Trainer API option --fp16 is not yet supported).
-  When running HuggingFace BERT (any size) fine-tuning tutorial or pretraining tutorial with transformers version >= 4.21.0 and < 4.25.1 and using XLA_USE_BF16=1 or XLA_DOWNCAST_BF16=1, you will see NaNs in the loss immediately at the first step. More details on the issue can be found at `pytorch/xla#4152 <https://github.com/pytorch/xla/issues/4152>`_. The workaround is to use transformers version < 4.21.0 or >= 4.25.1, or add ``transformers.modeling_utils.get_parameter_dtype = lambda x: torch.bfloat16`` to your Python script (i.e. run_glue.py).
-  Some recompilation is seen at the epoch boundary even after ``neuron_parallel_compile`` is used. This can be fixed by using the same number of epochs both during precompilation and the actual run.
-  When running multi-worker training, you may see the process getting killed at the time of model saving on trn1.2xlarge.
   This happens because the transformers ``trainer.save_model`` api uses ``xm.save`` for saving models.
   This api is known to cause high host memory usage in multi-worker setting `see Saving and Loading XLA Tensors in  <https://github.com/pytorch/xla/blob/master/API_GUIDE.md>`__ . Coupled with a compilation
   at the same time results in a host OOM. To avoid this issue, we can: Precompile all the graphs in multi-worker
   training. This can be done by running the multi-worker training first with ``neuron_parallel_compile <script>``
   followed by the actual training. This would avoid the compilation at model save during actual training.

.. _workarounds_for_older_versions:

Older versions of transformers <4.27.0 or PyTorch Neuron <1.13.0
----------------------------------------------------------------

If using older versions of transformers package before 4.27.0 or PyTorch Neuron before 1.13.0, please edit the python script run_glue.py and add the following lines after the Python
imports. They set the compiler flag for transformer model type and enable data parallel training using torchrun:

.. code:: python

    # Enable torchrun
    import os
    import torch
    import torch_xla.distributed.xla_backend
    from packaging import version
    from transformers import __version__, Trainer
    if version.parse(__version__) < version.parse("4.26.0") and os.environ.get("WORLD_SIZE"):
        torch.distributed.init_process_group('xla')

    # Disable DDP for torchrun
    import contextlib
    if version.parse(__version__) < version.parse("4.20.0"):
        def _wrap_model(self, model, training=True):
            model.no_sync = lambda: contextlib.nullcontext()
            return model
    else:
        def _wrap_model(self, model, training=True, dataloader=None):
            model.no_sync = lambda: contextlib.nullcontext()
            return model
    Trainer._wrap_model = _wrap_model

    # Workaround for NaNs seen with transformers version >= 4.21.0
    # https://github.com/aws-neuron/aws-neuron-sdk/issues/593
    import transformers
    if os.environ.get("XLA_USE_BF16") or os.environ.get("XLA_DOWNCAST_BF16"):
        transformers.modeling_utils.get_parameter_dtype = lambda x: torch.bfloat16


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/mlp.rst
================================================
.. _neuronx-mlp-training-tutorial:


.. meta::
   :description: Multi-Layer Perceptron Training Tutorial - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx, training, tutorials
   :date-modified: 2026-03-13


Multi-Layer Perceptron Training Tutorial
========================================


MNIST is a standard dataset for handwritten digit recognition. A
multi-layer perceptron (MLP) model can be trained with MNIST dataset to
recognize hand-written digits. This tutorial starts with a 3-layer MLP
training example in PyTorch on CPU, then show how to modify it to run on
Trainium using PyTorch Neuron. It also shows how to do multiple worker
data parallel MLP training.


.. contents:: Table of Contents
   :local:
   :depth: 2

.. include:: ../note-performance.txt

Setup environment and download examples
---------------------------------------

Before running the tutorial please follow the installation instructions at:

:ref:`Install PyTorch Neuron on
Trn1 <setup-torch-neuronx>`

Please set the storage of instance to *512GB* or more if you also want to run through the BERT pretraining and GPT pretraining tutorials.

For all the commands below, make sure you are in the virtual environment that you have created above before you run the commands:

.. code:: shell

   source ~/aws_neuron_venv_pytorch/bin/activate

Install needed dependencies in your environment by running:

.. code:: bash

    pip install pillow

Torchvision package is needed for MNIST dataset and has already been installed as part of :ref:`Install PyTorch Neuron on Trn1 <pytorch-neuronx-install>`. Installing Torchvision together with torch-neuronx ensures that the compatible version of Torchvision is selected. For example, torchvision==0.12 is compatible with torch==1.11 and torchvision==0.13 is compatible with torch==1.12.
    
To download the MNIST MLP examples, do:

.. code:: bash

   git clone https://github.com/aws-neuron/aws-neuron-samples.git
   cd aws-neuron-samples/torch-neuronx/training/mnist_mlp

Multi-layer perceptron MNIST model
----------------------------------

In ``model.py``, we define the multi-layer perceptron (MLP) MNIST model with 3
linear layers and ReLU activations, followed by a log-softmax layer.
This model will be used in multiple example scripts.

Single-worker MLP training script in PyTorch on CPU
---------------------------------------------------

We will show how to modify a training script that runs on other platform to run on Trainium.

We begin with a single-worker MLP training script for running on
the host CPUs of the Trainium instance. The training script imports the
MLP model from ``model.py``.

In this training script, we load the MNIST train dataset and, within the
``main()`` method, set the data loader to read batches of 32 training
examples and corresponding labels.

Next we instantiate the MLP model and move it to the device. We use
``device = 'cpu'`` to illustrate the use of device in PyTorch. On GPU
you would use ``device = 'cuda'`` instead.

We also instantiate the other two components of a neural network
trainer: stochastic-gradient-descent (SGD) optimizer and
negative-log-likelihood (NLL) loss function (also known as cross-entropy
loss).

After the optimizer and loss function, we create a training loop to iterate over the training samples and
labels, performing the following steps for each batch in each iteration:

-  Zero gradients using:

.. code:: python

   optimizer.zero_grad()

-  Move training samples and labels to device using the 'tensor.to'
   method.
-  Perform forward/prediction pass using

.. code:: python

   output = model(train_x)

-  The prediction results are compared against the corresponding labels
   using the loss function to compute the loss

.. code:: python

   loss_fn(output, train_label)

-  The loss is propagated back through the model using chain-rule to
   compute the weight gradients

.. code:: python

   loss.backward()

-  The weights are updated with a change that is proportional to the
   computed weights gradients

.. code:: python

   optimizer.step()

At the end of training we compute the throughput, display the final loss
and save the checkpoint.

Expected CPU output:

.. code:: bash

    ----------Training ---------------
    Train throughput (iter/sec): 286.96994718801335
    Final loss is 0.1040
    ----------End Training ---------------

Run the command below to execute this script:

.. literalinclude:: tutorial_source_code/multi_layer_perceptron_training/multi_layer_perceptron_training_code.sh
   :language: bash
   :lines: 7

For a full tutorial on training in PyTorch, please see
`Training with PyTorch <https://pytorch.org/tutorials/beginner/introyt/trainingyt.html>`__.

Thus far we have used PyTorch without Trainium. Next, we will show how
to change this script to run on Trainium.

Single-worker MLP training on Trainium
--------------------------------------

To run on Trainium, first we modify the CPU training script train_cpu.py to run with
PyTorch Neuron torch_xla as described in :ref:`PyTorch Neuron for Trainium Getting Started Guide <pytorch-neuronx-programming-guide>`
by changing the device:

.. code:: python

   import torch_xla.core.xla_model as xm
   device = xm.xla_device()
   # or
   device = 'xla'

When the model is moved to the XLA device using ``model.to(device)``
method, subsequent operations on the model are recorded for later
execution. This is XLA's lazy execution which is different from
PyTorch's eager execution. Within the training loop, we must mark the
graph to be optimized and run on XLA device (NeuronCore) using
xm.mark_step() (unless MpDeviceLoader is used as you will see in the next section). 
Without this mark, XLA cannot determine where the graph
ends. The collected computational graph also gets compiled and executed
when you request the value of a tensor such as by calling
``loss.item()`` or ``print(loss)``.

To save a checkpoint, it is recommended to use the ``xm.save()``
function instead of ``torch.save()`` to ensure states are moved to CPU.
``xm.save()`` also prevents the "XRT memory handle not found" warning at
the end of evaluation script (if the checkpoint saved using torch.save()
is used for evaluation).

The resulting script ``train.py`` can be executed as 
``python3 train.py``. Again, note that we import the MLP model
from ``model.py``. When you examine the script, the comments that begin with
'XLA' indicate the changes required to make the script compatible with
torch_xla.

Run the command below to execute this script:

.. literalinclude:: tutorial_source_code/multi_layer_perceptron_training/multi_layer_perceptron_training_code.sh
   :language: bash
   :lines: 10

Expected output on trn1.32xlarge (start from a fresh compilation cache, located at /var/tmp/neuron-compile-cache by default):

.. code:: bash

    2022-04-12 16:15:00.000947: INFO ||NCC_WRAPPER||: No candidate found under /var/tmp/neuron-compile-cache/USER_neuroncc-1.0.47218.0+162039557/MODULE_18200615679846498221.
    2022-04-12 16:15:00.000949: INFO ||NCC_WRAPPER||: Cache dir for the neff: /var/tmp/neuron-compile-cache/USER_neuroncc-1.0.47218.0+162039557/MODULE_18200615679846498221/MODULE_0_SyncTensorsGraph.318_18200615679846498221_ip-172-31-69-14.ec2.internal-8355221-28940-5dc775cd78aa2/83a0fd4a-b07e-4404-aa55-701ab3b2700c
    ........
    Compiler status PASS
    2022-04-12 16:18:05.000843: INFO ||NCC_WRAPPER||: Exiting with a successfully compiled graph
    2022-04-12 16:18:05.000957: INFO ||NCC_WRAPPER||: No candidate found under /var/tmp/neuron-compile-cache/USER_neuroncc-1.0.47218.0+162039557/MODULE_5000680699473283909.
    2022-04-12 16:18:05.000960: INFO ||NCC_WRAPPER||: Cache dir for the neff: /var/tmp/neuron-compile-cache/USER_neuroncc-1.0.47218.0+162039557/MODULE_5000680699473283909/MODULE_1_SyncTensorsGraph.390_5000680699473283909_ip-172-31-69-14.ec2.internal-8355221-28940-5dc7767e5fc69/7d0a2955-11b4-42e6-b536-6f0f02cc68df
    .
    Compiler status PASS
    2022-04-12 16:18:12.000912: INFO ||NCC_WRAPPER||: Exiting with a successfully compiled graph
    ----------Training ---------------
    Train throughput (iter/sec): 95.06756661972014
    Final loss is 0.1979
    ----------End Training ---------------

If you re-run the training script a second time, you will see messages
indicating that the compiled graphs are cached in the persistent cache
from the previous run and that the startup time is quicker:

.. code:: bash

    (aws_neuron_venv_pytorch_p36) [ec2-user@ip-172-31-69-14 mnist_mlp]$ python train.py |& tee log_trainium
    2022-04-12 16:21:58.000241: INFO ||NCC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/USER_neuroncc-1.0.47218.0+162039557/MODULE_18200615679846498221/MODULE_0_SyncTensorsGraph.318_18200615679846498221_ip-172-31-69-14.ec2.internal-8355221-28940-5dc775cd78aa2/83a0fd4a-b07e-4404-aa55-701ab3b2700c/MODULE_0_SyncTensorsGraph.318_18200615679846498221_ip-172-31-69-14.ec2.internal-8355221-28940-5dc775cd78aa2.neff. Exiting with a successfully compiled graph
    2022-04-12 16:21:58.000342: INFO ||NCC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/USER_neuroncc-1.0.47218.0+162039557/MODULE_5000680699473283909/MODULE_1_SyncTensorsGraph.390_5000680699473283909_ip-172-31-69-14.ec2.internal-8355221-28940-5dc7767e5fc69/7d0a2955-11b4-42e6-b536-6f0f02cc68df/MODULE_1_SyncTensorsGraph.390_5000680699473283909_ip-172-31-69-14.ec2.internal-8355221-28940-5dc7767e5fc69.neff. Exiting with a successfully compiled graph
    ----------Training ---------------
    Train throughput (iter/sec): 93.16748895384832
    Final loss is 0.1979
    ----------End Training ---------------

Multiple graphs can be created during execution since there are
differences between some iterations (first, steady state, last). After
the first iteration, the graph for each iteration should remain the same
from iteration to iteration. This allows XLA runtime to execute a
previous compiled graph that has been cached in XLA runtime cache.

If the inner training loop has some control-flows, for example for
gradient accumulation, the number of compiled graphs may increase due to the
generation and consumption of intermediates as well as additional
operations when the conditional path is taken.

Multi-worker data-parallel MLP training using torchrun
------------------------------------------------------

Data parallel training allows you to replicate your script across
multiple workers, each worker processing a proportional portion of the
dataset, in order to train faster.

The PyTorch distributed utility torchrun can be used to launch multiple
processes in a server node for multi-worker data parallel training.

To run multiple workers in data parallel configuration using torchrun,
modify the single-worker training script train.py as follows (below we use ``xm``
as alias for ``torch_xla.core.xla_model`` and ``xmp`` as alias for
``torch_xla.distributed.xla_multiprocessing``):

1. Import XLA backend for torch.distributed using ``import torch_xla.distributed.xla_backend``.
2. Use ``torch.distributed.init_process_group('xla')``
   to initialize PyTorch XLA runtime and Neuron
   runtime.
3. Use XLA multiprocessing device loader (``MpDeviceLoader``) from
   ``torch_xla.distributed`` to wrap PyTorch data loader.
4. Use ``xm.optimizer_step(optimizer)`` to perform allreduce and take
   optimizer step.

XLA MpDeviceLoader is optimized for XLA and is recommended for best
performance. It also takes care of marking the step for execution
(compile and execute the lazily collected operations for an iteration)
so no separate ``xm.mark_step()`` is needed.

The following are general best-practice changes needed to scale up the
training:

1. Set the random seed to be the same across workers.
2. Scale up the learning rate by the number of workers. Use
   ``xm.xrt_world_size()`` to get the global number of workers.
3. Add distributed sampler to allow different worker to sample different
   portions of dataset.

Also, the ``xm.save()`` function used to save checkpoint automatically
saves only for the rank-0 worker's parameters.

The resulting script is ``train_torchrun.py``
(note again that we import the MLP model from ``model.py``):

Next we use the ``torchrun`` utility that is included with torch
installation to run multiple processes, each using one Logical NeuronCore. Use
the option ``nproc_per_node`` to indicate the number of processes to launch.
For example, to run on two Logical NeuronCores on one Trn1/Trn2 instance only, do:

Run the command below to execute this script:

.. literalinclude:: tutorial_source_code/multi_layer_perceptron_training/multi_layer_perceptron_training_code.sh
   :language: bash
   :lines: 13

.. note::

    Currently we only support:
    - 1 and 2 worker configurations on trn1.2xlarge (default Logic NeuronCores size of 1)
    - 1, 2, 8, and 32-worker configurations on trn1.32xlarge (default Logic NeuronCores size of 1)
    - 1, 4, 16 and 64-worker configurations on trn2.48xlarge (default Logic NeuronCores size of 2)

Expected output on trn1.32xlarge (second run to avoid compilations):

.. code:: bash

    ----------Training ---------------
    ----------Training ---------------
    ... (Info messages truncated)
    Train throughput (iter/sec): 163.25353269069706
    Train throughput (iter/sec): 163.23261047441036
    Final loss is 0.3469
    Final loss is 0.1129
    ----------End Training ---------------
    ----------End Training ---------------

In another example, we run on two trn1.32xlarge instances launched with EFA-enabled interfaces, using `EFA-enabled security group <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl-base.html#nccl-start-base-setup>`__, and setup using :ref:`Install PyTorch Neuron on Trn1 <pytorch-neuronx-install>`.
NOTE: To run on multiple instances, you will need to use trn1.32xlarge instances and using all 32 NeuronCores on each instance.

On the rank-0 Trn1 host (root), run with ``--node_rank=0`` using torchrun utility, and ``--master_addr`` set to rank-0 host's IP address:

.. code:: shell

   export FI_EFA_USE_DEVICE_RDMA=1
   export FI_PROVIDER=efa
   torchrun --nproc_per_node=32 --nnodes=2 --node_rank=0 --master_addr=<root IP> --master_port=2020 train_torchrun.py

On another Trn1 host, run with ``--node_rank=1``, and ``--master_addr`` also set to rank-0 host's IP address:

.. code:: shell

   export FI_EFA_USE_DEVICE_RDMA=1
   export FI_PROVIDER=efa
   torchrun --nproc_per_node=32 --nnodes=2 --node_rank=1 --master_addr=<root IP> --master_port=2020 train_torchrun.py

It is important to launch rank-0 worker with ``--node_rank=0`` to avoid hang.

For trn2.48xlarge, use ``--nproc_per_node=64`` for 64 Logical NeuronCores default (each Logical NeuronCores using two physical NeuronCores).

To train on multiple instances, it is recommended to use a ParallelCluster. For a ParallelCluster example, please see `Train a model on AWS Trn1 ParallelCluster <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples>`__.

Single-worker MLP evaluation on Trainium
----------------------------------------

After training, the final checkpoint is saved in ``checkpoints`` directory. You can run the evaluation step by running the ``eval.py`` script in the same directory as the training script:

Run the command below to execute this script:

.. literalinclude:: tutorial_source_code/multi_layer_perceptron_training/multi_layer_perceptron_training_code.sh
   :language: bash
   :lines: 16-17

This evaluation phase can be merged with the training script to check accuracy, for example at the end of every epoch. It is kept separate for illustration purpose.

The evaluation script follow similar flow as the training script with the following differences:

- The input data used is the validation subset of the MNIST dataset.
- Only need to loop through the dataset once (no epochs).
- There's only forward pass through the model, and no backward pass or optimizer update.
- Compute the accuracy across validation set instead of loss per batch.

Expected results (after a second execution to eliminate warmup compilation time during first execution):

.. code:: bash

   ----------Evaluating---------------
   Test throughput (iter/sec): 47.897945949832845
   Accuracy: 0.9273833632469177
   ----------Done Evaluating---------------

If you get a lower accuracy than above, please check that the training is done with at least 4 epochs.

You can also use :ref:`torch_neuronx_trace_api` in the evaluation loop. This can be achieved by the following changes to the ``eval.py``:

- Use ``device = 'cpu'`` instead of XLA device.
- Don't use ``mark_step()``.
- Trace the model at the first iteration to freeze it and precompile for inference:

.. code:: python

         if idx == 0:
             import torch_neuronx
             model = torch_neuronx.trace(model, test_x)


However, note that the inference trace API fixed the input tensor shape, so that every  input tensor will need to match the size used during the tracing step. To ensure every batch from ``DataLoader`` has the same tensor shape, pass ``drop_last=True`` option when instantiating ``DataLoader``.

.. code:: python

        test_loader = DataLoader(test_dataset, batch_size=32, drop_last=True)

The script ``eval_using_trace.py`` can be compared against ``eval.py`` to show the above modifications. It can be executed using:

Run the command below to execute this script:

.. literalinclude:: tutorial_source_code/multi_layer_perceptron_training/multi_layer_perceptron_training_code.sh
   :language: bash
   :lines: 18

Expected results (note the large increase in performance when using trace API for inference):

.. code:: bash

   ----------Evaluating---------------
   Test throughput (iter/sec): 409.0836291417652
   Accuracy: 0.9288585186004639
   ----------Done Evaluating---------------


Known issues and limitations
----------------------------

MLP model is not optimized for performance. For the single-worker training, the performance can be improved by using MpDeviceLoader which exists in the multiprocessing example. For example, by setting ``--nproc_per_node=1`` in the torchrun example, you will see higher MLP performance.

.. code:: bash

    (aws_neuron_venv_pytorch_p36) [ec2-user@ip-172-31-69-14 mnist_mlp]$ torchrun --nproc_per_node=1 train_torchrun.py

    ----------Training ---------------
    ... (Info messages truncated)
    Train throughput (iter/sec): 192.43508922834008
    Final loss is 0.2720
    ----------End Training ---------------


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/analyze_training/analyze_training_code.sh
================================================
# Create the files needed
tee supported.py > /dev/null <<EOF
import torch
import torch_xla.core.xla_model as xm

class NN(torch.nn.Module):
    def __init__(self):
        super().__init__()

        self.layer1 = torch.nn.Linear(4,4)
        self.nl1 = torch.nn.ReLU()
        self.layer2 = torch.nn.Linear(4,2)
        self.nl2 = torch.nn.Tanh()

    def forward(self, x):
        x = self.nl1(self.layer1(x))
        return self.nl2(self.layer2(x))


def main():
    device = xm.xla_device()

    model = NN().to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    loss_fn = torch.nn.MSELoss()

    inp = torch.rand(4)
    target = torch.tensor([1,0])

    model.train()
    for epoch in range(2):
        optimizer.zero_grad()
        inp = inp.to(device)
        target = target.to(device)
        output = model(inp)
        loss = loss_fn(output,target)
        loss.backward()
        optimizer.step()
        xm.mark_step()

if __name__ == '__main__':
    main()
EOF

tee unsupported.py > /dev/null <<EOF
import torch
import torch_xla.core.xla_model as xm

class UnsupportedModel(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        y =  torch.fft.fft(x)
        x = x + 10
        return x * y


def main():
    device = xm.xla_device()

    model = UnsupportedModel().to(device)

    inp = torch.rand(4)

    model.train()
    for epoch in range(1):
        inp = inp.to(device)
        output = model(inp)

        xm.mark_step()

if __name__ == '__main__':
    main()
EOF

# Run analyze
neuron_parallel_compile --command analyze python supported.py

neuron_parallel_compile --command analyze python unsupported.py

neuron_parallel_compile --command analyze --analyze-verbosity 1 python unsupported.py

================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_converted_checkpoint_training.sh
================================================
#!/bin/bash

# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.

set -eExuo

cd ~/transformers/examples/pytorch/text-classification
aws s3 cp --no-progress s3://neuron-s3/training_checkpoints/pytorch/dp_bert_large_hf_pretrain/ckpt_29688.pt ./ --no-sign-request

# Create convert file
tee convert.py > /dev/null <<EOF
import os
import sys
import argparse
import torch
import transformers
from transformers import (
    BertForPreTraining,
)
import torch_xla.core.xla_model as xm
from transformers.utils import check_min_version
from transformers.utils.versions import require_version

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_name', type=str, default='bert-large-uncased',  help="Path to model identifier from huggingface.co/models")
    parser.add_argument('--output_saved_model_path', type=str, default='./hf_saved_model', help="Directory to save the HF pretrained model format.")
    parser.add_argument('--checkpoint_path', type=str, required=True, help="Path to pretrained checkpoint which needs to be converted to a HF pretrained model format")
    args = parser.parse_args(sys.argv[1:])

    model = BertForPreTraining.from_pretrained(args.model_name)
    check_point = torch.load(args.checkpoint_path, map_location='cpu')
    model.load_state_dict(check_point['model'], strict=False)
    model.save_pretrained(args.output_saved_model_path, save_config=True, save_function=xm.save)
    print("Done converting checkpoint {} to HuggingFace saved model in directory {}.".format(args.checkpoint_path, args.output_saved_model_path))
EOF

python convert.py --checkpoint_path ckpt_29688.pt

# Create run script
tee run_converted.sh > /dev/null <<EOF
#!/usr/bin/env bash
set -eExuo
export TASK_NAME=mrpc
export NEURON_CC_FLAGS="--model-type=transformer"
NEURON_RT_STOCHASTIC_ROUNDING_EN=1 torchrun --nproc_per_node=2 ./run_glue.py \\
--model_name_or_path hf_saved_model \\
--tokenizer_name bert-large-uncased \\
--task_name \$TASK_NAME \\
--do_train \\
--do_eval \\
--bf16 \\
--use_cpu True \\
--max_seq_length 128 \\
--per_device_train_batch_size 8 \\
--eval_do_concat_batches False \\
--learning_rate 2e-5 \\
--num_train_epochs 5 \\
--save_total_limit 1 \\
--overwrite_output_dir \\
--output_dir /tmp/\$TASK_NAME/ |& tee log_run_converted
EOF

chmod +x run_converted.sh

# Pre-compile
neuron_parallel_compile ./run_converted.sh

#Run Training
./run_converted.sh


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_multi_worker_training_code.sh
================================================
#!/bin/bash

# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.

set -eExuo

cd ~/transformers/examples/pytorch/text-classification

# Create the run_2w.sh file
tee run_2w.sh > /dev/null <<EOF
#!/usr/bin/env bash
set -eExuo
export TASK_NAME=mrpc
export NEURON_CC_FLAGS="--model-type=transformer"
NEURON_RT_STOCHASTIC_ROUNDING_EN=1 torchrun --nproc_per_node=2 ./run_glue.py \\
--model_name_or_path bert-large-uncased \\
--task_name \$TASK_NAME \\
--do_train \\
--do_eval \\
--bf16 \\
--use_cpu True \\
--max_seq_length 128 \\
--per_device_train_batch_size 8 \\
--eval_do_concat_batches False \\
--learning_rate 2e-5 \\
--num_train_epochs 5 \\
--save_total_limit 1 \\
--overwrite_output_dir \\
--output_dir /tmp/\$TASK_NAME/ |& tee log_run_2w
EOF

chmod +x run_2w.sh

# Pre-compile and train
neuron_parallel_compile ./run_2w.sh

./run_2w.sh


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_setup_code.sh
================================================
#!/bin/bash

# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.

set -eExuo

# Install packages and clone transformers
export HF_VER=4.53.2
export ACC_VER=1.9.0
export DATA_VER=4.0.0
export EVAL_VER=0.4.5
pip install -U transformers==$HF_VER accelerate==$ACC_VER datasets==$DATA_VER evaluate==$EVAL_VER scikit-learn
cd ~/
git clone https://github.com/huggingface/transformers --branch v$HF_VER
cd ~/transformers/examples/pytorch/text-classification


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_single_worker_training.sh
================================================
#!/bin/bash

# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.

set -eExuo

cd ~/transformers/examples/pytorch/text-classification

# Create the run.sh file
tee run.sh > /dev/null <<EOF
#!/usr/bin/env bash
set -eExuo
export TASK_NAME=mrpc
export NEURON_CC_FLAGS="--model-type=transformer"
NEURON_RT_STOCHASTIC_ROUNDING_EN=1 torchrun --nproc_per_node=1 ./run_glue.py \\
--model_name_or_path bert-large-uncased \\
--task_name \$TASK_NAME \\
--do_train \\
--do_eval \\
--bf16 \\
--use_cpu True \\
--max_seq_length 128 \\
--per_device_train_batch_size 8 \\
--eval_do_concat_batches False \\
--learning_rate 2e-5 \\
--num_train_epochs 5 \\
--save_total_limit 1 \\
--overwrite_output_dir \\
--output_dir /tmp/\$TASK_NAME/ |& tee log_run
EOF

chmod +x run.sh

# Pre-compile and train
neuron_parallel_compile ./run.sh

./run.sh


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_training/bert_amp_training_code.sh
================================================
#!/bin/bash
# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.
set -eExuo

# Run the training script
cd ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain

torchrun --nproc_per_node=32 dp_bert_large_hf_pretrain_hdf5.py \
--batch_size 16 \
--enable_pt_autocast \
--grad_accum_usteps 32 | tee run_pretrain_log.txt
torchrun_exit_status=${PIPESTATUS[0]}
echo "Training return code: $torchrun_exit_status"
exit $torchrun_exit_status


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_training/bert_lamb_bf16_training_code.sh
================================================
#!/bin/bash
# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.
set -eExuo

# Run the training script
cd ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain
torchrun --nproc_per_node=32 \
dp_bert_large_hf_pretrain_hdf5.py \
--max_steps 7032 \
--batch_size 16 \
--optimizer LAMB \
--lr 6e-3 \
--grad_accum_usteps 128 | tee run_pretrain_log.txt
torchrun_exit_status=${PIPESTATUS[0]}
echo "Training return code: $torchrun_exit_status"
exit $torchrun_exit_status


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_training/bert_lamb_training_code.sh
================================================
#!/bin/bash
# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.
set -eExuo

# Run the training script
cd ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain
torchrun --nproc_per_node=32 \
dp_bert_large_hf_pretrain_hdf5.py \
--max_steps 7032 \
--batch_size 8 \
--optimizer LAMB \
--lr 6e-3 \
--grad_accum_usteps 256 | tee run_pretrain_log.txt
torchrun_exit_status=${PIPESTATUS[0]}
echo "Training return code: $torchrun_exit_status"
exit $torchrun_exit_status


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_training/bert_phase2_training_code.sh
================================================
#!/bin/bash
# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.
set -eExuo

aws s3 cp --no-progress s3://neuron-s3/training_checkpoints/pytorch/dp_bert_large_hf_pretrain/ckpt_28125.pt ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain/output/ckpt_28125.pt --no-sign-request

cd ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain
torchrun --nproc_per_node=32 dp_bert_large_hf_pretrain_hdf5.py \
    --data_dir ~/examples_datasets/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512/ \
    --lr 2.8e-4 \
    --phase2 \
    --resume_ckpt \
    --phase1_end_step 28125 \
    --batch_size 2 \
    --grad_accum_usteps 512 \
    --seq_len 512 \
    --max_pred_len 80 \
    --warmup_steps 781 \
    --max_steps 1563 \
    | tee run_pretrain_log_phase2.txt
torchrun_exit_status=${PIPESTATUS[0]}
echo "Training return code: $torchrun_exit_status"
exit $torchrun_exit_status


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_training/bert_precompilation_code.sh
================================================
#!/bin/bash
# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.
set -eExuo

# Navigate to the script directory and run the pre-compile script
cd ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain
neuron_parallel_compile torchrun --nproc_per_node=32 \
dp_bert_large_hf_pretrain_hdf5.py \
--steps_this_run 10 \
--batch_size 16 \
--grad_accum_usteps 32 | tee compile_log.txt
torchrun_exit_status=${PIPESTATUS[0]}
echo "Training return code: $torchrun_exit_status"
exit $torchrun_exit_status


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_training/bert_setup_code.sh
================================================
#!/bin/bash
# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.
set -eExuo

# Install the required Python packages
python3 -m pip install -r ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain/requirements.txt

# Create a directory for the datasets and download the datasets
mkdir -p ~/examples_datasets/
pushd ~/examples_datasets/
aws s3 cp --no-progress s3://neuron-s3/training_datasets/bert_pretrain_wikicorpus_tokenized_hdf5/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar .  --no-sign-request
tar -xf bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar
rm bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar
aws s3 cp --no-progress s3://neuron-s3/training_datasets/bert_pretrain_wikicorpus_tokenized_hdf5/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512.tar .  --no-sign-request
tar -xf bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512.tar
rm bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512.tar
popd

================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_training/bert_setup_code_ph2.sh
================================================
#!/bin/bash
# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.
set -eExuo

# Install the required Python packages
python3 -m pip install -r ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain/requirements.txt

# Create a directory for the datasets and download the datasets
mkdir -p ~/examples_datasets/
pushd ~/examples_datasets/
aws s3 cp --no-progress s3://neuron-s3/training_datasets/bert_pretrain_wikicorpus_tokenized_hdf5/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512.tar .  --no-sign-request
tar -xf bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512.tar
rm bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512.tar
popd


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_training/bert_training_code.sh
================================================
#!/bin/bash
# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.
set -eExuo

# Run the training script
cd ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain
torchrun --nproc_per_node=32 \
dp_bert_large_hf_pretrain_hdf5.py \
--batch_size 16 \
--grad_accum_usteps 32 | tee run_pretrain_log.txt
torchrun_exit_status=${PIPESTATUS[0]}
echo "Training return code: $torchrun_exit_status"
exit $torchrun_exit_status


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/multi_layer_perceptron_training/multi_layer_perceptron_training_code.sh
================================================
#!/bin/bash
set -eExuo

cd ~/aws-neuron-samples/torch-neuronx/training/mnist_mlp

# Single worker CPU training
python train_cpu.py

# Single worker MLP training
python train.py

# Multi-worker data-parallel MLP training
torchrun --nproc_per_node=2 train_torchrun.py

# Single-worker MLP evaluation
cd ~/aws-neuron-samples/torch-neuronx/training/mnist_mlp
python eval.py
python eval_using_trace.py

================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/zero1_training/zero1_single_node_training_code.sh
================================================
#!/bin/bash
set -eExuo

# Install requirements
cd ~/aws-neuron-samples/torch-neuronx/training/zero1_gpt2
python3 -m pip install -r requirements.txt

# Run precompile and training
neuron_parallel_compile bash run_clm.sh MIXED wikitext-103-raw-v1
bash run_clm.sh MIXED wikitext-103-raw-v1


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/tutorials-training-torch-neuronx.rst
================================================

.. meta::
   :description: Tutorials for Training(torch-neuronx) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx, training, tutorials
   :date-modified: 2026-03-13


Tutorials for Training(torch-neuronx)
=====================================

.. toctree::
    :maxdepth: 1
    :hidden:

    /frameworks/torch/torch-neuronx/tutorials/training/bert
    /frameworks/torch/torch-neuronx/tutorials/training/mlp
    /frameworks/torch/torch-neuronx/tutorials/training/finetune_hftrainer
   
    /frameworks/torch/torch-neuronx/tutorials/training/zero1_gpt2
    /frameworks/torch/torch-neuronx/tutorials/training/analyze_for_training
    /neuron-customops/tutorials/customop-mlp-training
    /neuron-customops/tutorials/customop-mlp-perf-opt


* :ref:`hf-bert-pretraining-tutorial`
* :ref:`neuronx-mlp-training-tutorial`
* :ref:`torch-hf-bert-finetune`
* :ref:`torch-hf-t5-finetune`
* :ref:`zero1-gpt2-pretraining-tutorial`
* :ref:`torch-analyze-for-training-tutorial`
* :ref:`neuronx-customop-mlp-tutorial`
* :ref:`neuronx-customop-mlp-perf`

.. note::

    To use Jupyter Notebook see:

    * :ref:`setup-jupyter-notebook-steps-troubleshooting`
    * :ref:`running-jupyter-notebook-as-script`


================================================
FILE: frameworks/torch/torch-neuronx/tutorials/training/zero1_gpt2.rst
================================================
.. _zero1-gpt2-pretraining-tutorial:


.. meta::
   :description: ZeRO-1 Tutorial - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx, training, tutorials
   :date-modified: 2026-03-13


ZeRO-1 Tutorial
===============

.. important::
   Neuron will stop supporting XLA-based training support in a future release. For now, this tutorial is provided strictly for reference.

What is ZeRO-1?
---------------

ZeRO-1 (Zero Redundancy Optimizer Stage 1,
https://arxiv.org/abs/1910.02054) is an optimization technique for
large-scale deep learning models. It is a memory efficient variation of
data parallelism. ZeRO leverages the aggregate computation and memory
resources of data parallelism to reduce the memory and compute
requirements of each accelerator used for model training. ZeRO reduces
the memory consumption of each accelerator by partitioning the various
model training states (weights, gradients, and optimizer states) across
the available devices in the distributed training hardware. ZeRO is
being implemented as incremental stages of optimizations. In stage 1,
the optimizer states (e.g., for Adam optimizer, 32-bit weights, and the
first, and second moment estimates) are partitioned across the
processes, so that each process updates only its partition.

.. image:: zero1.jpg
   :alt: Image: zero1.jpg

We implemented an XLA-friendly version of ZeRO-1 and it has
been merged in open-source PyTorch/XLA project. Users can use it to
enable ZeRO-1 algorithm by simply wrapping the origin optimizer as shown
below.

::

   # Before:
   optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)


   # After
   optimizer = ZeroRedundancyOptimizer(model.parameters(), torch.optim.Adam, lr=0.0001)

Then just call ``optimizer.step()`` directly, the wrapped optimizer will
handle the distributed operations automatically.

The above code snippet illustrates the basic usage. Generally, users can
use ZeRO-1 optimizer like a normal optimizer. In addition,
``ZeroRedundancyOptimizer`` also provides other features: enable
gradient clipping or use other data type for wrapped optimizer. Note
that though the most of optimizers can be used with ZeRO-1, optimizers
that compute norm for parameters (e.g. LAMB) might lead to accuracy
disparities compared to using original local optimizer when using
ZeRO-1, because these optimizers cannot get full parameters but shards.

Usage
-----

To enable ZeRO-1 optimizer, just import it and replace origin optimizer
with ZeRO-1 wrapped version

::

   from torch_xla.distributed.zero_redundancy_optimizer import ZeroRedundancyOptimizer
   ...
   ...

   device = xm.xla_device()
   model = model.to(device)

   optimizer = ZeroRedundancyOptimizer(model.parameters(), AdamW, lr=0.001)

Then in training loop, just call ``optimizer.step()`` , note that we
should not use ``xm.reduce_gradients()`` or ``xm.optimizer_step()`` as
gradient reduction will be handle by ZeRO-1.

::

       ...
       loss.backward()
       xm.mark_step()
       optimizer.step()
       xm.mark_step()

ZeRO-1 optimizer also provides some additional features, user can pass
these arguments to the wrapper constructor:

-  Change ``optimizer_dtype`` to choose data dtype used by optimizer, default
   is ``torch.float32``. For example, when parameter data type is bfloat16,
   set ``optimizer_dtype`` to be float32 to enable 'master weight'.
-  Change ``grad_clipping`` to enable grad clipping, default is ``True``.
-  Change ``max_norm`` to determine the maximum norm value used by grad
   clipping, default is ``1.0``.
-  Change ``use_grad_acc_hook`` to enable using buffers to store gradients,
   it will use the same data type as ``optimizer_dtype`` to accumulate gradients.
   (Added in neuron 2.19.0 release).
-  Change ``higher_cc_precision`` to force reduce-scatter operator to use the same
   data type as ``optimizer_dtype``, default is ``False``. When ``use_grad_acc_hook``
   is ``True``, it has no effects. (Added in neuron 2.19.0 release).

Note: ZeRO-1 optimizer now forces to use the same data type as parameters for
all-gather operator. (Changed in neuron 2.19.0 release)

GPT2-XL Pretraining Tutorial
----------------------------

.. contents:: Table of contents
   :local:
   :depth: 2

Setup
~~~~~

We use single Trn1.32xlarge instance. Follow :ref:`Install PyTorch Neuron on
Trn1 <setup-torch-neuronx>` to setup the environment first. For all the commands below, make sure
you are in the virtual environment that you have created above before
you run the commands:

**requirements.txt:** We pin the following Hugging Face Library versions
necessary for the tutorial

::

   transformers==4.27.3
   accelerate==0.17
   tensorboard==2.12.2

::

   source ~/aws_neuron_venv_pytorch/bin/activate

::

   git clone https://github.com/aws-neuron/aws-neuron-samples.git
   cd aws-neuron-samples/torch-neuronx/training/zero1_gpt2
   python3 -m pip install -r requirements.txt

The specific files you need for this tutorial:

-  config_1p5B_gpt2.json: The model configuration used in the tutorial
   for GPT 2.7B Neo
-  neuron_utils.py: includes utility functions and the logging tools
-  run_clm_no_trainer.py: the main training script that runs the actual
   training
-  run_clm.sh: the shell script to launch the training job

Dataset
~~~~~~~

For the dataset, we use the wikitext dataset, specifically
``wikitext-103-raw-v1,`` provided by the HuggingFace
https://huggingface.co/datasets/wikitext. The data will be preprocessed
the first time running through the training script and then preprocessed
data will be cached in the HuggingFace cache directory for any future
training runs.

If the main process downloads the dataset, tokenizes the data and groups
them together successfully, the expected output would be as below at the
beginning of the training.

::

   ***** Running training *****
     Num examples = 114248
     Num Epochs = 29
     Instantaneous batch size per device = 1
     Total train batch size (w. parallel, distributed & accumulation) = 32
     Gradient Accumulation steps = 1
     Total optimization steps = 100000

Training
~~~~~~~~

The GPT2 python fine-tuning script is adapted from the example
`run_clm_no_trainer.py <https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm_no_trainer.py>`__
in the `Transformers language modeling examples <https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling>`__.
It incorporates the `Accelerate <https://github.com/huggingface/accelerate>`__ library. Given its beta stage,
some modifications are needed, along with the bridge code to XLA.
Particularly, some workarounds to support Accelerate for the training
script are listed in "Known Issues Workarounds and Limitations" below.

In this example, we use GPT2-xl as example, and show the training steps
with mixed precision (bfloat16 and float32)

-  single node training:

.. literalinclude:: tutorial_source_code/zero1_training/zero1_single_node_training_code.sh
   :language: shell
   :lines: 8-10

-  multi-node training, run:

::

   sbatch run_clm_compile.slurm

then

::

   sbatch run_clm.slurm

Known Issues, **Work-arounds and Limitations**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. Error message: ``ValueError: invalid literal for int() with base 10: ''``.
   Simply re-run the script can solve this issue. This issue is already solved
   in the newer versions of transformers, see https://github.com/huggingface/transformers/pull/22427.

2. Accelerator API workarounds:

   -  Error message: "Gradient accumulation is not supported on TPU.
      Please set gradient_accumulation_steps to 1 and don’t pass in a
      GradientAccumulationPlugin object." More context here:
      https://github.com/huggingface/accelerate/pull/479. The training
      still works by commenting out the assertion and avoid using the
      accumulation wrapper with accelerator.accumulate(model)
   -  Accelerator.prepare call: We have noticed that using the optimizer
      returned by this API are not directly reusable. It is due to gaps
      in configuring accelerate API for XLA devices.


================================================
FILE: frameworks/torch/torch-setup.rst
================================================
.. _torch-setup:


.. meta::
   :description: Pytorch Neuron Setup - AWS Neuron SDK documentation
   :keywords: AWS Neuron, PyTorch, setup
   :date-modified: 2026-03-13


Pytorch Neuron Setup
====================

Install and configure PyTorch for use with AWS Neuron on Trainium and Inferentia instances.

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: PyTorch Neuron (``torch-neuronx``) Setup for Inf2, Trn1, and Trn2 Instances
        :link: setup-torch-neuronx
        :link-type: ref
        :class-header: sd-bg-primary sd-text-white

        Install and configure PyTorch NeuronX for Inf2, Trn1, and Trn2 instances.

    .. grid-item-card:: PyTorch Neuron (``torch-neuron``) Setup for Inf1 Instances (Archived)
        :link: /archive/torch-neuron/setup/pytorch-install
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Legacy setup guide for PyTorch Neuron on Inf1 instances. This package has been archived.


================================================
FILE: frameworks/torch/training-torch-neuronx.rst
================================================
.. _training-torch-neuronx:


.. meta::
   :description: Training (``torch-neuronx``) - AWS Neuron SDK documentation
   :keywords: AWS Neuron, Inferentia, PyTorch, Trainium, torch-neuronx, training
   :date-modified: 2026-03-13


Training (``torch-neuronx``)
============================

Train models using PyTorch NeuronX on Trainium instances (Trn1, Trn2).

.. toctree::
    :maxdepth: 1
    :hidden:

     Tutorials </frameworks/torch/torch-neuronx/tutorials/training/tutorials-training-torch-neuronx>
     Additional Examples </frameworks/torch/torch-neuronx/additional-examples-training>
     API Reference Guide </frameworks/torch/torch-neuronx/api-reference-guide/training/index>
     Developer Guide  </frameworks/torch/torch-neuronx/programming-guide/training/index>
     Misc  </frameworks/torch/torch-neuronx/misc-training>

Get Started
------------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: Setup (``torch-neuronx``)
        :link: setup-torch-neuronx
        :link-type: ref
        :class-header: sd-bg-primary sd-text-white

        Install and configure PyTorch NeuronX for training workloads.

Tutorials & Examples
---------------------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: Tutorials
        :link: /frameworks/torch/torch-neuronx/tutorials/training/tutorials-training-torch-neuronx
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Step-by-step training tutorials for PyTorch NeuronX.

    .. grid-item-card:: Additional Examples
        :link: /frameworks/torch/torch-neuronx/additional-examples-training
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        More training examples and sample code.

Reference
----------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: API Reference Guide
        :link: /frameworks/torch/torch-neuronx/api-reference-guide/training/index
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Training API reference for PyTorch NeuronX.

    .. grid-item-card:: Developer Guide
        :link: /frameworks/torch/torch-neuronx/programming-guide/training/index
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        In-depth developer guide for training on Neuron.

    .. grid-item-card:: Misc
        :link: /frameworks/torch/torch-neuronx/misc-training
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Supported operators, multi-node setup, and troubleshooting.


================================================
FILE: general/faq.rst
================================================
.. _neuron_faq:

Neuron FAQ
==========

.. contents:: Table of contents
   :local:
   :depth: 1

Neuron 2.x FAQ
--------------

* :ref:`neuron2-intro-faq`

Training Only FAQ
-----------------

* :ref:`neuron-training-faq`


Inference Only FAQ
------------------

* :ref:`neuron-f1-faq`
* :ref:`trouble-shooting-inf1-faq`
* :ref:`tf1_faq`
* :ref:`tf2_faq`
* :ref:`NeuronPerf <neuronperf_faq>`

Runtime FAQ
-----------

* :ref:`Neuron Runtime FAQ <neuron-runtime-faq>`

Compiler FAQ
------------

* :ref:`neuronx_compiler_faq`
* :ref:`neuron_compiler_faq`


Neuron Containers
-----------------

* :ref:`Neuron Containers FAQ <container-faq>`


ONNX FAQ
--------

* :ref:`onnx-faq`
  

Support
-------

* :ref:`neuron_roadmap_faq`
* :ref:`contribute-faq`


================================================
FILE: includes/setup/select-framework-note.txt
================================================
.. note::
    For help selecting a framework type, see:

    :ref:`torch-neuron_vs_torch-neuronx`

================================================
FILE: includes/setup/tab-inference-mxnet-neuron-al2.txt
================================================
.. _neuron_installation:

.. dropdown::  Install MXNet Neuron (``mxnet-neuron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

.. card:: Visit MXNet Neuron(``mxnet-neuron``) for Inference section
    :link: inference-mxnet-neuron
    :link-type: ref
    :class-body: sphinx-design-class-title-small


================================================
FILE: includes/setup/tab-inference-mxnet-neuron-al2023.txt
================================================
.. _neuron_installation:

.. dropdown::  Install MXNet Neuron (``mxnet-neuron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=inf1 --ami=non-dlami

.. card:: Visit MXNet Neuron(``mxnet-neuron``) for Inference section
    :link: inference-mxnet-neuron
    :link-type: ref
    :class-body: sphinx-design-class-title-small


================================================
FILE: includes/setup/tab-inference-mxnet-neuron-u20.txt
================================================
.. _neuron_installation:

.. dropdown::  Install MXNet Neuron (``mxnet-neuron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

.. card:: Visit MXNet Neuron(``mxnet-neuron``) for Inference section
    :link: inference-mxnet-neuron
    :link-type: ref
    :class-body: sphinx-design-class-title-small


================================================
FILE: includes/setup/tab-inference-mxnet-neuron-u22.txt
================================================
.. _neuron_installation:

.. dropdown::  Install MXNet Neuron (``mxnet-neuron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami

.. card:: Visit MXNet Neuron(``mxnet-neuron``) for Inference section
    :link: inference-mxnet-neuron
    :link-type: ref
    :class-body: sphinx-design-class-title-small


================================================
FILE: includes/setup/tab-inference-mxnet-neuron.txt
================================================
.. _neuron_installation:

.. dropdown::  Install MXNet Neuron (``mxnet-neuron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=all --framework=mxnet --framework-version=1.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami --category=compiler_framework

.. dropdown::  Run Tutorial
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    :ref:`ResNet50 </src/examples/mxnet/resnet50/resnet50.ipynb>`

.. card:: Visit MXNet Neuron section for more
    :class-body: sphinx-design-class-body-small
    :link: /archive/mxnet-neuron/index
    :link-type: doc

================================================
FILE: includes/setup/tab-inference-tensorflow-neuron-al2.txt
================================================
.. _neuron_installation:

.. dropdown::  Install TensorFlow Neuron (``tensorflow-neuron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

.. card:: Visit TensorFlow Neuron(``tensorflow-neuron``) for Inference section
    :link: inference-tensorflow-neuron
    :link-type: ref
    :class-body: sphinx-design-class-title-small

.. card:: Visit TensorFlow Neuron section for more
    :class-body: sphinx-design-class-body-small
    :link: tensorflow-neuron-main
    :link-type: ref

================================================
FILE: includes/setup/tab-inference-tensorflow-neuron-al2023.txt
================================================
.. _neuron_installation:

.. dropdown::  Install TensorFlow Neuron (``tensorflow-neuron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=inf1 --ami=non-dlami

.. card:: Visit TensorFlow Neuron(``tensorflow-neuron``) for Inference section
    :link: inference-tensorflow-neuron
    :link-type: ref
    :class-body: sphinx-design-class-title-small

.. card:: Visit TensorFlow Neuron section for more
    :class-body: sphinx-design-class-body-small
    :link: tensorflow-neuron-main
    :link-type: ref

================================================
FILE: includes/setup/tab-inference-tensorflow-neuron-u20.txt
================================================
.. _neuron_installation:

.. dropdown::  Install TensorFlow Neuron (``tensorflow-neuron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

.. card:: Visit TensorFlow Neuron(``tensorflow-neuron``) for Inference section
    :link: inference-tensorflow-neuron
    :link-type: ref
    :class-body: sphinx-design-class-title-small

.. card:: Visit TensorFlow Neuron section for more
    :class-body: sphinx-design-class-body-small
    :link: tensorflow-neuron-main
    :link-type: ref

================================================
FILE: includes/setup/tab-inference-tensorflow-neuron-u22.txt
================================================
.. _neuron_installation:

.. dropdown::  Install TensorFlow Neuron (``tensorflow-neuron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 113
        :end-line: 114


.. card:: Visit TensorFlow Neuron(``tensorflow-neuron``) for Inference section
    :link: inference-tensorflow-neuron
    :link-type: ref
    :class-body: sphinx-design-class-title-small

.. card:: Visit TensorFlow Neuron section for more
    :class-body: sphinx-design-class-body-small
    :link: tensorflow-neuron-main
    :link-type: ref

================================================
FILE: includes/setup/tab-inference-tensorflow-neuronx-al2.txt
================================================
.. _neuronx_installation:

.. dropdown::  Install EFA (Applicable only for ``Trn1\Trn1n``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 44
        :end-line: 45

.. dropdown::  Install TensorFlow Neuron (``tensorflow-neuronx``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 32
        :end-line: 33

.. card:: Visit TensorFlow Neuron(``tensorflow-neuronx``) for Inference section
    :link: inference-tensorflow-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small


================================================
FILE: includes/setup/tab-inference-tensorflow-neuronx-al2023.txt
================================================
.. _neuronx_installation:

.. dropdown::  Install EFA (Applicable only for ``Trn1\Trn1n``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 245
        :end-line: 246

.. dropdown::  Install TensorFlow Neuron (``tensorflow-neuronx``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 173
        :end-line: 174

.. card:: Visit TensorFlow Neuron(``tensorflow-neuronx``) for Inference section
    :link: inference-tensorflow-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small


================================================
FILE: includes/setup/tab-inference-tensorflow-neuronx-u20.txt
================================================
.. _neuronx_installation:

.. dropdown::  Install EFA (Applicable only for ``Trn1\Trn1n``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 47
        :end-line: 48

.. dropdown::  Install TensorFlow Neuron (``tensorflow-neuronx``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 35
        :end-line: 36

.. card:: Visit TensorFlow Neuron(``tensorflow-neuronx``) for Inference section
    :link: inference-tensorflow-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small


================================================
FILE: includes/setup/tab-inference-tensorflow-neuronx-u22.txt
================================================
.. _neuronx_installation:

.. dropdown::  Install EFA (Applicable only for ``Trn1\Trn1n``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 248
        :end-line: 249

.. dropdown::  Install TensorFlow Neuron (``tensorflow-neuronx``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 107
        :end-line: 108

.. card:: Visit TensorFlow Neuron(``tensorflow-neuronx``) for Inference section
    :link: inference-tensorflow-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small


================================================
FILE: includes/setup/tab-inference-torch-neuron-al2.txt
================================================
.. _neuron_installation:

.. dropdown::  Install PyTorch Neuron (``torch-neuron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami

.. card:: Visit PyTorch Neuron(``torch-neuron``) for Inference section
    :link: inference-torch-neuron
    :link-type: ref
    :class-body: sphinx-design-class-title-small

.. card:: Visit PyTorch Neuron section for more
    :class-body: sphinx-design-class-body-small
    :link: neuron-pytorch
    :link-type: ref

================================================
FILE: includes/setup/tab-inference-torch-neuron-al2023.txt
================================================
.. _neuron_installation_al2023:

.. dropdown::  Install PyTorch Neuron (``torch-neuron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=inf1 --ami=non-dlami

.. card:: Visit PyTorch Neuron(``torch-neuron``) for Inference section
    :link: inference-torch-neuron
    :link-type: ref
    :class-body: sphinx-design-class-title-small

.. card:: Visit PyTorch Neuron section for more
    :class-body: sphinx-design-class-body-small
    :link: neuron-pytorch
    :link-type: ref

================================================
FILE: includes/setup/tab-inference-torch-neuron-u20.txt
================================================
.. _neuron_installation:

.. dropdown::  Install PyTorch Neuron (``torch-neuron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami

.. card:: Visit PyTorch Neuron(``torch-neuron``) for Inference section
    :link: inference-torch-neuron
    :link-type: ref
    :class-body: sphinx-design-class-title-small

.. card:: Visit PyTorch Neuron section for more
    :class-body: sphinx-design-class-body-small
    :link: neuron-pytorch
    :link-type: ref

================================================
FILE: includes/setup/tab-inference-torch-neuron-u22.txt
================================================
.. _neuron_installation_u22:

.. dropdown::  Install PyTorch Neuron (``torch-neuron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 110
        :end-line: 111

.. card:: Visit PyTorch Neuron(``torch-neuron``) for Inference section
    :link: inference-torch-neuron
    :link-type: ref
    :class-body: sphinx-design-class-title-small

.. card:: Visit PyTorch Neuron section for more
    :class-body: sphinx-design-class-body-small
    :link: neuron-pytorch
    :link-type: ref

================================================
FILE: includes/setup/tab-inference-torch-neuron.txt
================================================
.. _neuron_installation:

.. dropdown::  Install PyTorch Neuron (``torch-neuron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=all --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami --category=compiler_framework

.. dropdown::  Run Tutorial
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Torchvision ResNet50 tutorial :ref:`[html] </src/examples/pytorch/torch-neuronx/resnet50-inference-on-trn1-tutorial.ipynb>` :pytorch-neuron-src:`[notebook] <torch-neuronx/resnet50-inference-on-trn1-tutorial.ipynb>`

.. card:: Visit PyTorch Neuron section for more
    :class-body: sphinx-design-class-body-small
    :link: pytorch-neuronx-main
    :link-type: ref

================================================
FILE: includes/setup/tab-inference-torch-neuronx-al2.txt
================================================
.. _neuronx_installation:

.. dropdown::  Install EFA (Applicable only for ``Trn1\Trn1n``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 44
        :end-line: 45

.. dropdown::  Install PyTorch Neuron (``torch-neuronx``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. tab-set::

        .. tab-item:: PyTorch 1.13.1

            .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                :start-line: 8
                :end-line: 9

.. card:: Visit PyTorch Neuron(``torch-neuronx``) for Inference section
    :link: inference-torch-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small


.. card:: Visit PyTorch Neuron(``torch-neuronx``) for Training section
    :link: training-torch-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small

================================================
FILE: includes/setup/tab-inference-torch-neuronx-al2023.txt
================================================
.. dropdown::  Install EFA (Applicable only for ``Trn1\Trn1n\Trn2``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 245
        :end-line: 246

.. dropdown::  Install PyTorch Neuron (torch-neuronx) 
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. tab-set::

        .. tab-item :: PyTorch 2.8.0

            .. include:: /src/helperscripts/installationScripts/python_instructions.txt
                :start-line: 278
                :end-line: 279

        .. tab-item :: PyTorch 2.7.0

            .. include:: /src/helperscripts/installationScripts/python_instructions.txt
                :start-line: 266
                :end-line: 267
        

.. card:: Visit PyTorch Neuron(``torch-neuronx``) for Inference section
    :link: inference-torch-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small

.. card:: Visit PyTorch Neuron(``torch-neuronx``) for Training section
    :link: training-torch-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small


================================================
FILE: includes/setup/tab-inference-torch-neuronx-u20.txt
================================================
.. dropdown::  Install EFA (Applicable only for ``Trn1\Trn1n``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 47
        :end-line: 48

.. dropdown::  Install PyTorch Neuron (torch-neuronx) 
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. tab-set::

        .. tab-item:: PyTorch 2.1.2

            .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                :start-line: 230
                :end-line: 231

        .. tab-item:: PyTorch 1.13.1

            .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                :start-line: 11
                :end-line: 12


.. card:: Visit PyTorch Neuron(``torch-neuronx``) for Inference section
    :link: inference-torch-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small


.. card:: Visit PyTorch Neuron(``torch-neuronx``) for Training section
    :link: training-torch-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small


================================================
FILE: includes/setup/tab-inference-torch-neuronx-u22.txt
================================================
.. _neuronx_installation:

.. dropdown::  Install EFA (Applicable only for ``Trn1\Trn1n\Trn2\Trn3``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 248
        :end-line: 249

.. dropdown::  Install PyTorch Neuron (torch-neuronx) 
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. tab-set::

        .. tab-item:: PyTorch 2.9.0

            .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                :start-line: 287
                :end-line: 288

        .. tab-item:: PyTorch 2.8.0

            .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                :start-line: 281
                :end-line: 282

        .. tab-item:: PyTorch 2.7.0

            .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                :start-line: 269
                :end-line: 270


.. card:: Visit PyTorch Neuron(``torch-neuronx``) for Inference section
    :link: inference-torch-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small


.. card:: Visit PyTorch Neuron(``torch-neuronx``) for Training section
    :link: training-torch-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small


================================================
FILE: includes/setup/tab-inference-torch-neuronx-u24.txt
================================================
.. _neuronx_installation:

.. dropdown::  Install EFA (Applicable only for ``Trn1\Trn1n\Trn2\Trn3``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include :: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 290
        :end-line: 291

.. dropdown::  Install PyTorch Neuron (torch-neuronx) 
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. tab-set::

        .. tab-item:: PyTorch 2.9.0

            .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                :start-line: 296
                :end-line: 297

        .. tab-item:: PyTorch 2.8.0

            .. include :: /src/helperscripts/installationScripts/python_instructions.txt
                :start-line: 305
                :end-line: 306


.. card:: Visit PyTorch Neuron(``torch-neuronx``) for Inference section
    :link: inference-torch-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small


.. card:: Visit PyTorch Neuron(``torch-neuronx``) for Training section
    :link: training-torch-neuronx
    :link-type: ref
    :class-body: sphinx-design-class-title-small


================================================
FILE: index.rst
================================================
.. meta::
   :description: AWS Neuron SDK enables high-performance deep learning and generative AI on AWS Inferentia and Trainium instances. Get started with PyTorch, JAX, and distributed training.
   :date-modified: 2026-04-09

.. _neuron_home:

AWS Neuron Documentation
=========================

:ref:`AWS Neuron <what-is-neuron>` is a software stack that enables high-performance deep learning and generative AI workloads on `AWS Inferentia <https://aws.amazon.com/ai/machine-learning/inferentia/>`_ and `AWS Trainium <https://aws.amazon.com/ai/machine-learning/trainium/>`_ instances. Neuron provides a complete machine learning development experience with compiler optimization, runtime efficiency, and comprehensive tooling.

* **For more details, see** :doc:`What is AWS Neuron? </about-neuron/what-is-neuron>` and :doc:`What's New in AWS Neuron? </about-neuron/whats-new>`

* **For the latest release notes, see** :doc:`AWS Neuron Release Notes </release-notes/index>`. The current release is :doc:`version 2.29.0 </release-notes/2.29.0>`, released on April 09, 2026.

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: 
      :class-card: sd-border-2

      **Looking to dive into Neuron development? Follow these links:**
      ^^^
      * :doc:`Learn about Neuron's support for native PyTorch </frameworks/torch/pytorch-native-overview>`
      * :doc:`Get started with vLLM </libraries/nxd-inference/vllm/index>` for :doc:`Offline </libraries/nxd-inference/vllm/quickstart-vllm-offline-serving>` or :doc:`Online </libraries/nxd-inference/vllm/quickstart-vllm-online-serving>` inference model serving
      * :doc:`Implement and run your first NKI kernel </nki/get-started/quickstart-implement-run-kernel>`
      * :doc:`Optimize model performance with Neuron Explorer </tools/neuron-explorer/index>`
      * :doc:`Launch a Inf/Trn instance on Amazon EC2 </devflows/ec2-flows>`
      * :doc:`Deploy a DLC </containers/get-started/quickstart-configure-deploy-dlc>`

----

Learn more about AWS Neuron
----------------------------

**Select a card below to read more about these features**:

.. grid:: 1 2 2 2
   :gutter: 3

   .. grid-item-card:: 
      :link: /frameworks/torch/pytorch-native-overview
      :link-type: doc
      :class-card: sd-border-2
 
      **Native PyTorch**
      ^^^
      Learn about native PyTorch support in AWS Neuron.

   .. grid-item-card:: 
      :link: /libraries/nxd-inference/vllm/index
      :link-type: doc
      :class-card: sd-border-2

      **vLLM on Neuron**
      ^^^
      High-performance inference serving for large language models with OpenAI-compatible APIs on Trainium and Inferentia.

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: 
      :class-card: sd-border-2

      **Developer Tools**
      ^^^
      Profile and monitor your models as you develop, build, test, and deploy them with Neuron's developer tools.

      * :doc:`Neuron Explorer </tools/neuron-explorer/index>`
      * :doc:`Neuron Profiler </tools/profiler/neuron-profile-user-guide>`
      * :doc:`Neuron Profiler 2.0 </tools/profiler/neuron-profiler-2-0-beta-user-guide>`
      * :doc:`Neuron System tools </tools/neuron-sys-tools/index>`

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: 
      :class-card: sd-border-2

      **Neuron Kernel Interface**
      ^^^
      Low-level programming interface for custom kernel development on Trainium and Inferentia with direct hardware access.

      * :doc:`Set up your developer environment </nki/get-started/setup-env>`
      * :doc:`NKI Library  </nki/library/index>`
      * :doc:`NKI Language Guide </nki/get-started/nki-language-guide>`
      * :doc:`NKI Tutorials </nki/guides/tutorials/index>`
      * :doc:`NKI API Reference </nki/api/index>`
      * :doc:`NKI Compiler </nki/deep-dives/nki-compiler>`

**Other Neuron features:**

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: **Orchestration and Deployment on AWS EC2 and EKS**
      :link: /devflows/index
      :link-type: doc
      :class-card: sd-border-1

      Configure and run AWS Deep Learning Images (DLAMIs) and Containers (DLCs) to test and deploy your models with AWS EC2 and EKS.

   .. grid-item-card::  **AWS Neuron Open Source**
      :link: /about-neuron/oss/index
      :link-type: doc
      :class-card: sd-border-1

      Interested in contributing to Neuron source code and samples? Review this documentation and learn about our public GitHub repos and how to contribute to the code and samples in them.  

   .. grid-item-card:: **AWS Neuron-supported ML frameworks**
      :class-card: sd-border-1

      * :doc:`PyTorch NeuronX (torch-neuronx) <frameworks/torch/index>`
      * :doc:`JAX NeuronX <frameworks/jax/index>`

   .. grid-item-card:: **NeuronX Distributed (NxD) libraries**
      :class-card: sd-border-1

      * :doc:`NxD Libraries Overview <libraries/index>`
      * :doc:`NxD Training <libraries/nxd-training/index>`
      * :doc:`NxD Inference <libraries/nxd-inference/index>`
      * :doc:`NxD Core <libraries/index>`

   .. grid-item-card:: **Workloads**
      :class-card: sd-border-1

      * :doc:`Workload orchestration </devflows/index>`
      * :doc:`AWS Neuron Deep Learning Machine Images (DLAMIs) <dlami/index>`
      * :doc:`AWS Neuron Deep Learning Containers (DLCs) <containers/index>`

   .. grid-item-card:: **Runtime & Collectives**
      :class-card: sd-border-1

      * :doc:`Neuron Runtime <neuron-runtime/index>`
      * :doc:`Neuron Collectives <neuron-runtime/about/collectives>`
      * :doc:`Neuron C++ Custom Operators <neuron-customops/index>`

   .. grid-item-card:: **Compilers**
      :class-card: sd-border-1

      * :doc:`Neuron Graph Compiler <compiler/index>`
      * :doc:`Neuron Compiler Error Codes <compiler/error-codes/index>`

   .. grid-item-card:: **Legacy Documentation and Samples**
      :class-card: sd-border-1

      * :doc:`Apache MXNet </archive/mxnet-neuron/index>`
      * :doc:`TensorFlow Neuron </archive/tensorflow/index>`
      * :doc:`torch-neuron (Inf1) </archive/torch-neuron/index>`
      * :doc:`All archived content </archive/index>`

.. toctree::
   :maxdepth: 1
   :hidden:
   
   About Neuron </about-neuron/index>
   Neuron Architecture </about-neuron/arch/index>
   What's New </about-neuron/whats-new>
   Announcements </about-neuron/announcements/index>
   News & Blogs </about-neuron/news-and-blogs/index>
   Contribute </about-neuron/oss/index>

.. toctree::
    :maxdepth: 1
    :caption: Get Started
    :hidden:

    Quickstarts </about-neuron/quick-start/index>
    Setup Guides </setup/index>
    Developer Flows </devflows/index>

.. toctree::
   :maxdepth: 1
   :caption: ML Frameworks
   :hidden:

   Home </frameworks/index>
   PyTorch </frameworks/torch/index>
   JAX </frameworks/jax/index>

.. toctree::
   :maxdepth: 1
   :caption: Training
   :hidden:

   NxD Training </libraries/nxd-training/index>
   NxD Core (Training) </libraries/neuronx-distributed/index-training>

.. toctree::
   :maxdepth: 1
   :caption: Inference
   :hidden:

   Overview </libraries/nxd-inference/neuron-inference-overview>
   vLLM </libraries/nxd-inference/vllm/index>
   NxD Inference </libraries/nxd-inference/index>
   NxD Core (Inference) </libraries/neuronx-distributed/index-inference>

.. toctree::
   :maxdepth: 1
   :caption: Developer Tools
   :hidden:

   Home </tools/index>
   Neuron Explorer </tools/neuron-explorer/index>

.. toctree::
   :maxdepth: 1
   :caption: Orchestrate and Deploy
   :hidden:

   AWS Workload Orchestration </devflows/index>
   Neuron DLAMI </dlami/index>
   Neuron Containers </containers/index>

.. toctree::
   :maxdepth: 1
   :caption: Runtime & Collectives
   :hidden:

   Neuron Runtime </neuron-runtime/index>
   Collectives </neuron-runtime/about/collectives>
   Neuron C++ Custom Operators </neuron-customops/index>

.. toctree::
   :maxdepth: 1
   :caption: Compilers
   :hidden:

   Graph Compiler </compiler/index>
   Compiler Error Codes </compiler/error-codes/index>

.. toctree::
   :maxdepth: 1
   :caption: Neuron Kernel Interface (NKI)
   :hidden:

   Home </nki/index>
   Get Started </nki/get-started/index>
   Guides </nki/guides/index>
   Deep Dives </nki/deep-dives/index>
   Migration Guides </nki/migration/index>
   NKI API Reference </nki/api/index>
   NKI Library </nki/library/index>

.. toctree::
   :maxdepth: 1
   :caption: Archive
   :hidden:

   Archived content </archive/index>
   
*AWS and the AWS logo are trademarks of Amazon Web Services, Inc. or its affiliates. All rights reserved.*


================================================
FILE: info/exclude
================================================
# The following do not need to be shared outside of staging:
/.github/*
/_ext 
/static

================================================
FILE: libraries/index.rst
================================================
.. meta::
   :description: AWS NeuronX distributed libraries - High-performance distributed training and inference libraries for AWS Trainium and Inferentia, including NxD Core, NxD Inference, NxD Training, and third-party integrations.

.. _libraries-neuron-sdk:

Work with training and inference libraries
===========================================

Accelerate your machine learning workloads with Neuron's distributed libraries. Our libraries provide high-level abstractions and optimized implementations for distributed training and inference on AWS Trainium and Inferentia.

What are NeuronX Distributed libraries?
----------------------------------------

NeuronX Distributed (NxD) libraries are a comprehensive suite of PyTorch-based libraries designed to enable scalable machine learning on AWS Neuron hardware. The NxD ecosystem provides a layered architecture where foundational distributed primitives support higher-level training and inference workflows.

**The NxD Stack:**

* **NxD Core**: The foundational layer providing distributed primitives, model sharding techniques, and XLA-optimized implementations
* **NxD Training**: High-level training library built on NxD Core, offering turnkey distributed training workflows with NeMo compatibility
* **NxD Inference**: Production-ready inference library with advanced features like continuous batching, speculative decoding, and vLLM integration

Together, these libraries enable developers to scale from prototype to production while leveraging the full performance potential of AWS Trainium and Inferentia instances.

About NxD Core Libraries
------------------------
        
NxD Core libraries provide distributed training and inference mechanisms for Neuron devices with XLA-friendly implementations. This includes:
        
* :doc:`Tensor Parallel (TP) sharding </libraries/neuronx-distributed/ptl_developer_guide>` (:doc:`Overview </libraries/neuronx-distributed/tensor_parallelism_overview>`)
* :doc:`Pipeline Parallel (PP) support </libraries/neuronx-distributed/pp_developer_guide>` (:doc:`Overview </libraries/neuronx-distributed/pipeline_parallelism_overview>`)
* :doc:`Model activation memory reduction support </libraries/neuronx-distributed/activation_memory_reduction_developer_guide>` (:doc:`Overview </libraries/neuronx-distributed/activation_memory_reduction>`)
* Model partitioning across devices
* XLA-optimized distributed operations
* Foundation for other NxD libraries

The NxD Training and Inference documentation below provides documentation for NxD Core libraries in the context of of training and inference models respectively.

NxD Training and Inference Libraries 
-------------------------------------

.. grid:: 1
  :gutter: 3
  :class-container: library-grid

  .. grid-item-card:: NxD Inference
      :link: /libraries/nxd-inference/index
      :link-type: doc
      :class-header: bg-success text-white
      :class-body: library-card-body
        
      PyTorch-based inference library for deploying large models on Inferentia and Trainium.
        
       * Large Language Model (LLM) inference
       * Disaggregated inference architecture
       * vLLM integration and compatibility
       * Model sharding and parallelism
       * Performance optimization tools

  .. grid-item-card:: NxD Training
      :link: nxdt
      :link-type: ref
      :class-header: bg-info text-white
      :class-body: library-card-body

      PyTorch library for end-to-end distributed training with Neuron.
        
       * Large-scale model training
       * NeMo YAML configuration support
       * HuggingFace and Megatron-LM models   
       * Experiment management
       * Advanced parallelism strategies

Other Libraries
----------------

.. grid:: 1 1 2 2
  :gutter: 3
  :class-container: library-grid

  .. grid-item-card:: Hugging Face Transformers (legacy)
      :link: /libraries/transformers-neuronx/index
      :link-type: doc
      :class-header: bg-success text-white
      :class-body: library-card-body
          
  .. grid-item-card:: NeMo Megatron
      :link: /libraries/nemo-megatron/index
      :link-type: doc
      :class-header: bg-success text-white
      :class-body: library-card-body

Hardware Compatibility
----------------------

.. list-table::
   :header-rows: 1
   :class: compatibility-matrix

   * - Library
     - Inf1
     - Inf2
     - Trn1/Trn1n
     - Trn2
     - Inference
     - Training
   * - **NxD Core**
     - N/A
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
   * - **NxD Inference**
     - N/A
     - ✅
     - ✅
     - ✅
     - ✅
     - N/A
   * - **NxD Training**
     - N/A
     - N/A
     - ✅
     - ✅
     - N/A
     - ✅

.. _third-party-libraries:

Third-party libraries
-----------------------

AWS Neuron integrates with multiple third-party partner products that alow you to run deep learning workloads on Amazon EC2 
instances powered by AWS Trainium and AWS Inferentia chips. The following list gives an overview of the third-party libraries 
working with AWS Neuron.

Hugging Face Optimum Neuron
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Optimum Neuron bridges Hugging Face Transformers and the AWS Neuron SDK, providing standard Hugging Face APIs for 
`AWS Trainium <https://aws.amazon.com/ai/machine-learning/trainium/>`_ and `AWS Inferentia <https://aws.amazon.com/ai/machine-learning/inferentia/>`_. 
It offers solutions for both training and inference, including support for large-scale model training and deployment for AI workflows. 
Supporting Amazon SageMaker and pre-built Deep Learning Containers, Optimum Neuron simplifies the use of Trainium and Inferentia 
for machine learning. This integration allows developers to work with familiar Hugging Face interfaces while leveraging Trainium 
and Inferentia for their transformer-based projects.

`Optimum Neuron documentation <https://huggingface.co/docs/optimum-neuron/en/index>`_

PyTorch Lightning
^^^^^^^^^^^^^^^^^^^

PyTorch Lightning is a deep learning framework for professional AI researchers and machine learning engineers who need maximal 
flexibility without sacrificing performance at scale. Lightning organizes PyTorch code to remove boilerplate and unlock scalability.

`Get Started with Lightning  <https://lightning.ai/lightning-ai/studios/finetune-llama-90-cheaper-on-aws-trainium~01hh3kj60fs8b8x91rv9n9fn2j?section=featured>`_

Use PyTorch Lightning Trainer with :ref:`NeuronX Distributed <pytorch-lightning>`. 


AXLearn
^^^^^^^^^

AXLearn is an open-source JAX-based library used by AWS Neuron for training deep learning models on AWS Trainium. Integrates with JAX ecosystem and supports distributed training.

Check out the `AXLearn Github repository <https://github.com/apple/axlearn>`_.


Additional libraries
---------------------

NeMo 
^^^^^

:ref:`NxD Training <nxd-training-overview>` offers a `NeMo <https://github.com/NVIDIA/NeMo>`_-compatible YAML interface for training 
PyTorch models on AWS Trainium chips. The library supports both Megatron-LM and HuggingFace model classes through its model hub. 
NxD Training leverages key NeMo components, including Experiment Manager for tracking ML experiments and data loaders for efficient 
data processing. This library simplifies the process of training deep learning models on AWS Trainium while providing compatibility 
with familiar NeMo YAML Interface.

.. toctree::
   :hidden:
   :maxdepth: 1

   HF Transformers </libraries/transformers-neuronx/index>
   NeMo Megatron </libraries/nemo-megatron/index>
   NxD Core Release Notes </release-notes/components/nxd-core>

  
================================================
FILE: libraries/nemo-megatron/index.rst
================================================
.. _nemo-megatron-index:

AWS Neuron Reference for NeMo Megatron
======================================

AWS Neuron Reference for NeMo Megatron is a library that includes modified versions of the open-source packages `NeMo <https://github.com/NVIDIA/NeMo>`_ and `Apex <https://github.com/NVIDIA/apex>`_ that have been adapted for use with AWS Neuron and AWS EC2 Trn1 instances.
The library supports Tensor Parallel, Pipeline parallel and Data Parallel configurations for distributed training of large language models like GPT-3 175B. The APIs have been optimized for XLA based computation and high performance communication over Trainium instances.
The library uses various techniques to improve memory utilization such as sequence parallelism which reduces activation memory footprint, selective or full activation checkpointing which allows larger model configurations to fit. SPMD optimizations are also used whenever possible to reduce the number of graphs obtained.


.. dropdown::  Setup  (``neuronx-nemo-megatron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    The library can be installed from `neuronx-nemo-megatron github repo <https://github.com/aws-neuron/neuronx-nemo-megatron>`_


.. dropdown::  Tutorials  (``neuronx-nemo-megatron``)
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in
   
    * `Launch a GPT-3 pretraining job using neuronx-nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-gpt-job.md>`_
    * `Launch a Llama 2 pretraining job using neuronx-nemo-megatron <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md>`_

Important Tips for Training with Neuron NeMo Megatron
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Do Not Create the Attention Mask
""""""""""""""""""""""""""""""""

If you are using your own data pipeline, do not create an attention mask for each record. Neuron NeMo Megatron is optimized to create an attention mask on Neuron Cores directly before use. Creating an attention mask per sample consumes excess CPU memory and often causes out of memory errors on CPU.

================================================
FILE: libraries/neuronx-distributed/activation_memory_reduction.rst
================================================
.. _activation_memory_reduction:

Activation Memory Reduction
============================

There are three major contributors to high device memory utilization: 
`Parameters`, `Optimizer states` and `Activation Memory`.
To reduce the size of parameter/optimizer states memory, one can use parallelism 
techniques like `Tensor-parallelism`, `Pipeline-paralleism` or `Zero1`.
However, as the hidden size and sequence length increases, the size of the 
activation memory keeps growing linearly with hidden size and quadraticly with
sequence length. 

The total activation memory without any parallelism comes to about:

.. math::

   \text{Activations memory per layer} = \text{sbh} \left(34 + \frac{5as}{h}\right)

where,

* `a`: Number of attention heads
* `b`: microbatch size
* `h`: hidden dimension size
* `s`: sequence length


When we use tensor-parallelism, it not only helps to reduce the parameter and optimizer states
size on each device, but it also helps to reduce the activation memory. For a transformer model,
where we apply the tensor-parallel sharding on the attention block (more info `here <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tensor_parallelism_overview.html#tensor-parallelism-overview>`__), 
the activation memory within the attention block also drops by a factor of tensor-parallel degree (`t`). However, since the layernorms and dropouts 
(which are outside these attention blocks) are not parallelised and they are replicated on each device. These 
layernorms and dropouts are computationally inexpensive, however, they increase the overall activation memory 
on each device. Moreover, since we only parallelize within the attention block or within the MLP block (h -> 4h projection),
the inputs to the QKV multiplies and the MLP are still unsharded. This overall adds to about `10sbh` of total activation 
memory. To reduce this activation memory, one can use 2 methods:

* `Sequence-Parallelism <https://arxiv.org/abs/2105.13120>`__ 
* `Activation Recomputation <https://arxiv.org/abs/1604.06174>`__


Sequence Parallelism
====================

Sequence-Parallelism was proposed by `Shenggui and et.al <https://arxiv.org/pdf/2105.13120.pdf>`__ . The authors propose to 
parallelize the compute along all the sequence dimension in an attempt the reduce increasing the memory pressure due to high 
sequence-lengths. Sequence-parallelism can be combined with tensor-parallelism to reduce the activation memory pressure 
due to increasing sequence-lengths.

Tensor-parallelism parallelizes the parts of the transformer which are computationally heavy, however, it leaves the 
layer-norms, dropouts and some MLP block intact. The activation memory for this block adds up to a factor of `10sbh`.
`Vijay Korthikanti et.al <https://browse.arxiv.org/pdf/2205.05198.pdf>`__ noticed that the compute in the non-tensor parallel 
region is independent in the sequence dimension. This property can be leveraged to shard the compute along the 
sequence dimension. The main advantage of sharding these non-tensor parallel block is reducing the activation memory.
We can use the same tensor-parallel degree to partition, thereby reducing the overall activation memory by a factor of `t`.
However, this partitioning comes at a cost. Since we are partitionining the non-tensor parallel region along sequence dimnesion,
we have to collect the activations before we feed to the tensor-parallel block. This requires an introduction of 
`all-gather <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#allgather>`__ collective 
operation which would gather the activations along the sequence dimension. Similarly, after the tensor-parallel block, since 
we would have to split the activations along the sequence dimension and distribute among the devices. Since, the tensor-parallel 
block in the transformer module already uses an all-reduce (Row-parallel linear layer used for MLP), we can replace the 
all-reduce operation with a `reduce-scatter <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#reducescatter>`__ operation.

.. image:: /libraries/neuronx-distributed/images/sequence_parallel.png
   :alt: Image: image.png

Ref: `Reducing Activation Recomputation in Large Transformer Models <https://browse.arxiv.org/pdf/2205.05198.pdf>`__

In the figure, `g` is all-gather operation and `g¯` is the reduce-scatter operation. `g` and `g¯` are conjugates and in the 
backward pass, `g¯` becomes an all-gather operation and `g` becomes the reduce-scatter operation. At first glance, it appears 
that sequence-parallelism when combined with tensor-parallelism introduces an extra communication operation, however, in a ring 
all-reduce, the op is broken down into all-gather and reduce-scatter. Hence, the bandwidth required for sequence-parallelism is 
same as tensor-parallelism only. Hence, we are not losing out on compute but end up saving the activation memory per device.
The final activation memory when sequence-parallelism is combined with tensor-parallelism:

.. math::

   \text{Activations memory per layer} = \text{sbh} \left(\frac{10}{t} + \frac{24}{t} + \frac{5as}{ht}\right) = \frac{\text{sbh}}{t} \left(34 + \frac{5as}{h}\right)


Activation Recomputation
========================

The total required memory in the above equation can still be high as we increase the sequence length and hidden size. 
We would have to keep increasing the tensor-parallel degree to accommodate this requirement. Increasing the tensor-parallel 
degree might soon start producing diminishing returns as the model would start becoming bandwidth bottlenecked because of the 
extra collective communication operations. `Activation recomputation <https://arxiv.org/abs/1604.06174>`__ can help to alleviate 
this problem. In this method, we recompute a part of the forward pass during the backward pass, thereby avoiding the need to 
save the activations during the forward pass. Activation recomputation is a trade-off between duplicate computation vs memory.
It allows you to save on memory at the cost of extra recompute. This trade-off becomes valuable when we can fit larger models 
at the expense of recomputing forward pass activations. 

Ideally one can recompute the entire forward pass, there by resulting in an activation memory of `2sbh` per transformer layer.
This method is called `Full-activation checkpointing`. This memory can further go down by a factor of `t` if we use tensor-parallelism.
In the activation memory equation, we have a quadratic term of `5abs^2`. As the sequence length, this term will grow at a much 
faster rate. This quadratic term comes from the softmax computation. `Vijay Korthikanti et.al <https://browse.arxiv.org/pdf/2205.05198.pdf>`__ 
propose `Selective activation checkpointing` where they only recompute the softmax and attention computation and thereby avoid saving the activations coming 
from softmax and attention computation. This completely gets rid of the quadratic term and brings down the activation memory per layer to 
`34sbh/t`. The LLama-7B example in `this tutorial <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.html#llama2-7b-tp-zero1-tutorial>`__
used selective activation checkpointing.


================================================
FILE: libraries/neuronx-distributed/activation_memory_reduction_developer_guide.rst
================================================
.. _activation_memory_reduction_developer_guide:

Developer guide for Activation Memory reduction 
============================================================================

Sequence Parallelism
^^^^^^^^^^^^^^^^^^^^

To combine sequence parallelism with tensor-parallelism, one needs to follow the steps below:

Model changes for Tensor-Parallel block:
'''''''''''''''''''''''''''''''''''''''''

For tensor-parallelism, we replace the linear layers with ColumnParallel and RowParallel Linear 
layers as mentioned `here <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tp_developer_guide.html#creating-model>`__.
To enable sequence-parallel, we need to pass the `sequence_parallel_enabled` for the ColumnParallel and RowParallel linear layers.
Setting this argument to ``true``, the ColumnParallel and RowParallel Linear layers will introduce the ``all-gather`` and ``reduce-scatter`` 
operations for gathering and distributing the activations along the sequence dimension.

.. code:: ipython3
   
   from transformers.models.gpt_neox.modeling_gpt_neox import GPTNeoXAttention

   class class GPTNeoXAttentionNxD(GPTNeoXAttention):
       def __init__(self, config):
           super().__init__(config)
           ....
           self.query_key_value = ColumnParallelLinear(
                                    config.hidden_size,
                                    3 * config.hidden_size,
                                    stride=3,
                                    gather_output=False,
                                    init_method=init_method,
                                    sequence_parallel_enabled=self.config.sequence_parallel_enabled,
                                )
           self.dense = RowParallelLinear(
                            config.hidden_size,
                            config.hidden_size,
                            input_is_parallel=True,
                            init_method=init_method,
                            sequence_parallel_enabled=self.config.sequence_parallel_enabled,
                        )
           ....

Model changes for Non-Tensor-Parallel block:
''''''''''''''''''''''''''''''''''''''''''''

In a transformer module, the non-tensor parallel block contains mainly the Layer-Norm modules. Since we partition 
the computation along the sequence dimension for the layer-norm, we 
need to sum up the gradients along the sequence dimension for the Layer-norm. To help us do that, 
we use the Layer-norm provided from ``neuronx-distributed.parallel_layers.layer_norm``. The Layer-norm in 
neuronx-distributed should uses the same forward and backward as ``torch.nn.LayerNorm``, however, it just marks
the weights as sequence-parallel weights. This tagging allows us to look for weights with sequence-parallel 
tagging and reduce those gradients along the tensor-parallel degree. Hence we need to add the following two changes:


.. code:: ipython3

   from transformers.models.gpt_neox.modeling_gpt_neox import GPTNeoXLayer
   from neuronx_distributed.parallel_layers import layer_norm

   class GPTNeoXLayerNxD(GPTNeoXLayer):
       def __init__(self, config):
           super().__init__(config)
           ...
           self.input_layernorm = layer_norm.LayerNorm(
                                    config.hidden_size,
                                    eps=config.layer_norm_eps,
                                    sequence_parallel_enabled=config.sequence_parallel_enabled
                                  )
           self.post_attention_layernorm = layer_norm.LayerNorm(
                                                config.hidden_size,
                                                eps=config.layer_norm_eps,
                                                sequence_parallel_enabled=config.sequence_parallel_enabled
                                            )

Once we replace the layernorm with neuronx-distributed's layernorm, it will `mark the weights <https://github.com/aws-neuron/neuronx-distributed/blob/main/src/neuronx_distributed/parallel_layers/layer_norm.py#L32>`__ 
as sequence-parallel weights. Note: If your model is using RMSNorm or any other layer that parallelizes in the sequence-dimension,
you can mark the weights as sequence-parallel weights by using the following code:

.. code:: ipython3

    setattr(param, "sequence_parallel_enabled", sequence_parallel_enabled)

Once marked, we then use this attribute when we compute gradients for layer-norm. We need to add the following code before our optimizer.step:

.. code:: ipython3

    def allreduce_sequence_parallel_gradients(optimizer):
        """ All-reduce layernorm parameters across model parallel nodes when sequence parallelism is used.
            Modified from megatron-lm:
            https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/blob/3f91f09bb2ab32f9904b47f46f19d2fc3f518ed8/megatron/training.py#L425
        """
        from neuronx_distributed.parallel_layers.mappings import reduce_from_tensor_model_parallel_region
        grads = []
        for param_group in optimizer.__getstate__()['param_groups']:
            for group, params in param_group.items():
                if group == 'params':
                    for p in params:
                        if isinstance(p, torch.Tensor) and p.grad is not None:
                            sequence_parallel_param = getattr(p, 'sequence_parallel_enabled', False)
                            if sequence_parallel_param:
                                grads.append(p.grad.data)
        for grad in grads:
            reduce_from_tensor_model_parallel_region(grad)

As seen in the above code, we reduce the gradients from all tensor parallel devices. This is because the compute is divided along the 
sequence dimension across all the devices participating in the tensor parallel group. For reference implementation, check 
the `GPTNeoX-20B modeling code <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain.py#L273C1-L289C55>`__ .

Transposing the activations:
''''''''''''''''''''''''''''

Sequence-parallelism implementation requires the sequence dimension to be the 0th dimension whereas the tensor-parallel region 
requires the sequence dimension to be the first dimension. All our model implementation keeps the sequence dimension 
as 1st dimension and batch dimension as 0th dimension. Hence, to accommodate sequence parallelism, we need to insert a few 
transpose operations at the following places:

1. Before we start looping through all the layers, we need to transpose the sequence and batch dimension. We 
also need to partition the inputs along the sequence dimensions such that each tp-rank gets a part. This can be done as:

.. code:: ipython3

    form neuronx_distributed.parallel_layers.mappings import scatter_to_sequence_parallel_region
    # NxD Core code change: sequence parallel uses seq_len as the 0-th dim
    if self.config.sequence_parallel_enabled:
        hidden_states = hidden_states.transpose(0, 1).contiguous()
        hidden_states = scatter_to_sequence_parallel_region(hidden_states)

2. Since the attention block requires the sequence dimension to be 1st dimension, we transpose the output of QKV projection and then 
transpose it back before the final MLP of the attention block. 

.. code:: ipython3

    # Within the attention module
    qkv = self.query_key_value(hidden_states)

    if config.sequence_parallel_enabled:
        qkv = qkv.transpose(0,1)
    ...

    attn_output = attn_output.transpose(0,1)
    attn_output = self.dense(attn_output)


3. Finally before returning the final output, we need to put all the partial activations along the sequence dimension 
back together. This can be done as follows:

.. code:: ipython3

    form neuronx_distributed.parallel_layers.mappings import gather_from_sequence_parallel_region
    if self.config.sequence_parallel_enabled:
        hidden_states = gather_from_sequence_parallel_region(hidden_states, to_model_parallel=False)
        hidden_states = hidden_states.transpose(0, 1).contiguous()

    return BaseModelOutputWithPast(
            last_hidden_state=hidden_states,
            past_key_values=presents,
            hidden_states=all_hidden_states,
            attentions=all_attentions,
        )

These are the only major changes required to add sequence-parallelism on top of tensor-parallelism. Note: Sequence-parallelism 
uses the same tensor-parallel group. 
For reference implementation, follow `GPTNeoX-20B model script <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/modeling_gpt_neox_nxd.py>`__.

Activation Recomputation
^^^^^^^^^^^^^^^^^^^^^^^^

As seen in the :ref:`App notes on Activation Memory Recomputation <activation_memory_reduction>` we can reduce the activation memory by recomputing few operations from 
the forward pass during the backward run. To replay some of the compute, we can use the 
`torch.utils.checkpoint.checkpoint <https://pytorch.org/docs/stable/checkpoint.html>`__. To use this API, we need 
to put the compute, we want to replay, inside a function which can be passed to the `checkpoint` API. This API takes care 
of maintaining the RNG seed, not saving the activations and also inserting the forward recompute during the gradient computation.

To enable selective activation checkpointing for the attention block, we can simply pass the attention block to the checkpoint 
api as follows:

.. code:: ipython3

    if config.selective_activation_checkpointing_is_enabled:
        attn_output = torch.utils.checkpoint.checkpoint(self._attn, query, key, value, attention_mask, head_mask)
    else:
        attn_output = self._attn(query, key, value, attention_mask, head_mask)

Note: To use torch.utils.checkpoint, it is mandatory to use `-O1 <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/neuronx-cc/api-reference-guide/index.html?highlight=--O1#cmdoption-neuronx-cc-arg-0>`__ 
compiler flag. If this is not enabled, the Neuron compiler would eliminate the duplicate recompute as an 
optimization and hence you would not see any memory gains.

================================================
FILE: libraries/neuronx-distributed/api-reference-guide-inference.rst
================================================
.. _api_guide_nxd_inference:

Inference APIs
==============

.. contents:: Table of contents
   :local:
   :depth: 2


.. _nxd_tracing:

Model Trace:
^^^^^^^^^^^^

We can use the tensor parallel layers to perform large model inference
too. For performing inference, we can re-use the Parallel model built
above for training and then use the trace APIs provided by the
neuronx_distributed package to trace it for inference. One can use the
following set of APIs for running distributed inference:

::

   def neuronx_distributed.trace.parallel_model_trace(func, example_inputs, compiler_workdir=None, compiler_args=None, inline_weights_to_neff=True, bucket_config=None, tp_degree=1, max_parallel_compilations=None)

This API would launch tensor parallel workers, where each worker would
trace its own model. These traced models would be wrapped with a single
TensorParallelModel module which can then be used like any other traced
model.

.. _parameters-9:

Parameters:


-  ``func : Callable``: This is a function that returns a ``Model``
   object and a dictionary of states. The ``parallel_model_trace`` API would call this function
   inside each worker and run trace against them. Note: This differs
   from the ``torch_neuronx.trace`` where the ``torch_neuronx.trace``
   requires a model object to be passed.

-  ``example_inputs: (torch.Tensor like)`` : The inputs that needs to be passed to
   the model. If you are using ``bucket_config``, then this must be a list of inputs for
   each bucket model. This configuration is similar to :func:`torch_neuronx.bucket_model_trace`

-  ``compiler_workdir: Optional[str,pathlib.Path]`` : Work directory used by
   Neuron Graph Compiler. This can be useful for debugging and inspecting
   intermediary Neuron Graph Compiler outputs.

-  ``compiler_args: Optional[Union[List[str],str]]`` : List of strings representing
   Neuron Graph Compiler compiler arguments. See :ref:`neuron-compiler-cli-reference-guide`
   for more information about compiler options.

-  ``inline_weights_to_neff: bool`` : A boolean indicating whether the weights should be
   inlined to the NEFF. If set to False, weights will be separated from the NEFF.
   The default is ``True``.

-  ``bucket_config: torch_neuronx.BucketModelConfig`` : The config object that defines
   bucket selection behavior. See :func:`torch_neuronx.BucketModelConfig` for more details.

-  ``tp_degree: (int)`` : How many devices to be used when performing
   tensor parallel sharding

-  ``max_parallel_compilations: Optional[int]`` : If specified, this function will only trace these numbers
   of models in parallel, which can be necessary to prevent OOMs while tracing. The default
   is None, which means the number of parallel compilations is equal to the ``tp_degree``.


Trace Model Save/Load:
^^^^^^^^^^^^^^^^^^^^^^

Save:
'''''

::

   def neuronx_distributed.trace.parallel_model_save(model, save_dir)

This API should save the traced model in ``save_dir``. Each shard would be
saved in its respective directory inside the ``save_dir``. 

Parameters:

-  ``model: (TensorParallelModel)`` : Traced model produced using the ``parallel_model_trace`` API
-  ``save_dir: (str)`` : The directory where the model would be saved

Load:
'''''

::

   def neuronx_distributed.trace.parallel_model_load(load_dir)

This API will load the sharded traced model into ``TensorParallelModel`` for inference.

.. _parameters-10:

Parameters:
'''''''''''

-  ``load_dir: (str)`` : Directory which contains the traced model.


================================================
FILE: libraries/neuronx-distributed/api-reference-guide-training.rst
================================================
.. _api_guide_nxd_training:

Training APIs
==============


.. contents:: Table of contents
   :local:
   :depth: 2

Neuronx-Distributed Training APIs:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In Neuronx-Distributed, we provide a series of APIs under `neuronx_distributed` directly that helps
user to apply optimizations in NxD Core easily. These APIs cover configuration, model/optimizer initialization
and saving/loading checkpoint.

Initialize NxD Core config:
''''''''''''''''''''''''''''

::

    def neuronx_distributed.trainer.neuronx_distributed_config(
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        pipeline_config=None,
        optimizer_config=None,
        activation_checkpoint_config=None,
        pad_model=False,
        sequence_parallel=False,
        model_init_config=None,
        lora_config=None,
    )

This method initializes NxD Core training config and initialize model parallel. This config
maintains all optimization options of the distributed training, and it's a global config
(the same for all processes).

Parameters:

- ``tensor_parallel_size (int)`` : Tensor model parallel size. Default: :code:`1`.
- ``pipeline_parallel_size (int)`` : Pipeline model parallel size. Default: :code:`1`.
- ``pipeline_config (dict)`` : Pipeline parallel config. For details please refer to 
    `pipeline parallel guidance <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/pp_developer_guide.html>`__.
    Default: :code:`None`.

- ``optimizer_config (dict)`` : Optimizer config. Default: :code:`{"zero_one_enabled": False, "grad_clipping": True, "max_grad_norm": 1.0}`.
- Enable ZeRO-1 by setting ``zero_one_enabled`` to ``True``.
- Enable grad clipping by setting ``grad_clipping`` to ``True``.
- Change maximum grad norm value by setting ``max_grad_norm``.

- ``activation_checkpoint_config (str of torch.nn.Module)`` : Activation checkpoint config,
   accept value: ``"full"``, ``None``, or any ``torch.nn.Module``. When set to ``full``,
   regular activation checkpoint enabled (every transformer layer will be re-computed).
   When set to ``None``, activation checkpoint disabled. When set to any ``torch.nn.Module``,
   selective activation checkpoint enabled, any provided module will be re-computed.
   Default: :code:`None`.

- ``pad_model (bool)`` : Whether to pad attention heads of model. Default: :code:`False`.
- ``sequence_parallel (bool)`` : Whether to enable sequence parallel. Default: :code:`False`.
- ``model_init_config (dict)`` : Model initialization config. Default: :code:`{"sequential_move_factor": 11, "meta_device_init": False, "param_init_fn": None}`.
- ``lora_config``: LoRA configuration. Default: :code:`None` with LoRA disabled.

- ``sequential_move_factor``: num of processes instantiating model on host at the same time.
    This is done to avoid the host OOM. Range: 1-32.
- ``meta_device_init``: whether to initialize model on meta device.
- ``param_init_fn``: method that initialize parameters of modules, should be provided when
    ``param_init_fn`` is ``True``.

Initialize NxD Core Model Wrapper:
''''''''''''''''''''''''''''''''''''

::

    def neuronx_distributed.trainer.initialize_parallel_model(nxd_config, model_fn, *model_args, **model_kwargs)

This method initialize NxD Core model wrapper, return a wrapped model that can be used as
a regular ``torch.nn.Module``, while has all the model optimizations in config applied.
This wrapper is designed to hide the complexity of optimizations such as pipeline model
parallel, so that users can simply use the wrapped model as the unwrapped version.

Parameters:

- ``nxd_config (dict)``: config generated by ``neuronx_distributed_config``.
- ``model_fn (callable)``: user provided function to get the model for training.
- ``model_args`` and ``model_kwargs``: arguments that will be passed to ``model_fn``.

Model wrapper class and its methods:

::

    class neuronx_distributed.trainer.model.NxDModel(torch.nn.Module):
        def local_module(self):
            # return the unwrapped local module

        def run_train(self, *args, **kwargs):
            # method to run one iteration, when pipeline parallel enabled,
            # user have to use this instead of forward+backward

        def named_parameters(self, *args, **kwargs):
            # only return parameters on local rank.
            # same for `parameters`, `named_buffers`, `buffers`

        def named_modules(self, *args, **kwargs):
            # only return modules on local rank.
            # same for `modules`, `named_children`, `children`

.. note::
    
    As a short cut, users can call ``model.config`` or ``model.dtype`` from wrapped model
    if original model is hugging face transformers pre-trained model.

Initialize NxD Core Optimizer Wrapper:
''''''''''''''''''''''''''''''''''''''''

::

    def neuronx_distributed.trainer.initialize_parallel_optimizer(nxd_config, optimizer_class, parameters, **defaults)

This method initialize NxD Core optimizer wrapper, return a wrapped optimizer that can be used as
a regular ``torch.optim.Optimizer``, while has all the optimizer optimizations in config applied.

This optimizer wrapper is inherited from ``toch.optim.Optimizer``. It takes in the ``nxd_config`` and
configures the optimizer to work with different distributed training regime.

The `step` method of the wrapped optimizer contains necessary all-reduce operations and grad clipping.
Other methods and variables work the same as the unwrapped optimizer.

Parameters:

- ``nxd_config (dict)``: config generated by ``neuronx_distributed_config``.
- ``optimizer_class (Type[torch.optim.Optimizer])``: optimizer class to create the optimizer.
- ``parameters (iterable)``: parameters passed to the optimizer.
- ``defaults``: optimizer options that will be passed to the optimizer.


Enable LoRA fine-tuning:
''''''''''''''''''''''''''

LoRA model wrapper
::

   class LoRAModel(module, LoraConfig)


Parameters:

- ``module (torch.nn.Module)``: Module to be wrapped with LoRA

- ``LoraConfig``: The LoRA configuration defined in ``neuronx_distributed.modules.lora.LoraConfig``

The flags in ``LoraConfig`` to initialize LoRA adapter:

- ``enable_lora (bool)``: Enable LoRA fine-tuning. 

- ``lora_rank (int)``: The rank of LoRA adapter. A small LoRA rank reduces the memory footprint during fine-tuning, but it may harm the model quality.

- ``lora_alpha (float)``: The alpha parameter for LoRA scaling, i.e., scaling LoRA weights against base model weights.

- ``lora_dropout (float)``: The dropout probability for LoRA layers.

- ``bias (str)``: Bias type for LoRA. Can be ``none``, ``all`` or ``lora_only``.

- ``target_modules (List[str])``: The names of the modules that need LoRA.

- ``use_rslora (bool)``: If True, uses Rank-Stabilized LoRA, which sets the adapter scaling factor to ``lora_alpha/math.sqrt(lora_rank)``.

- ``init_lora_weights (str)``: Weights initialization of LoRA adapter. Can be ``default`` (initialized with ``torch.nn.init.kaiming_uniform_()``) or ``gaussian`` (initialized with ``torch.nn.init.normal_()``).


**Usage:**

   We first define the LoRA configuration for fine-tuning. Suppose the target modules is ``[q_proj, v_proj, k_proj]``, 
   it indicates that LoRA will be appied to modules whose name includes any of the keywords. 
   An example is

   ::

      lora_config = neuronx_distributed.modules.lora.LoraConfig(
         enable_lora=True,
         lora_rank=16,
         lora_alpha=32,
         lora_dropout=0.05,
         bias="none",
         target_modules=["q_proj", "v_proj", "k_proj"],
      )

   You can enable LoRA fine-tuning like below

   ::

      nxd_config = neuronx_distributed.neuronx_distributed_config(
        ...
        lora_config=lora_config,
      )
      model = neuronx_distributed.initialize_parallel_model(nxd_config, ...)

   Then the NxD model will be initialized with LoRA adapter enabled.


Save Checkpoint:
''''''''''''''''

Method to save checkpoint, return ``None``.

This method saves checkpoints for model, optimizer, scheduler and user contents sequentially.
Model states are saved on data parallel rank-0 only. When ZeRO-1 optimizer is not turned on,
optimizer states are also saved like this; while when ZeRO-1 optimizer is turned on, states
are saved on all ranks. Scheduler and user contents are saved on master rank only. Besides,
users can use ``use_xser=True`` to boost saving performance and avoid host OOM. It's achieved
by saving tensors one by one simultaneously and keeping the original data structure.
However, the resulted checkpoint cannot be loaded using ``load`` api of PyTorch. Users
can also use ``async_save=True`` to further boost saving performance. It's achieved by saving tensors
in separate processes along with computation. Setting ``async_save`` to true will result
in more host memory being used, therefore increase the risk of application crash due to system
ran out of memory.

::

    def neuronx_distributed.trainer.save_checkpoint(
        path,
        tag="",
        model=None,
        optimizer=None,
        scheduler=None,
        user_content=None,
        num_workers=8,
        use_xser=False,
        num_kept_ckpts=None,
        async_save=False,
        zero1_optimizer=False
    )

Parameters:

- ``path (str)``: path to save the checkpoints.
- ``tag (str)``: tag to save the checkpoints.
- ``model (torch.nn.Module)``: model to save, optional.
- ``optimizer (torch.optim.Optimizer)``: optimizer to save, optional.
- ``scheduler``: scheduler to save, optional.
- ``user_content``: user contents to save, optional.
- ``num_workers (int)``: num of processes saving data on host at the same time.
   This is done to avoid the host OOM, range: 1-32.

- ``use_xser (bool)``: whether to use torch-xla serialization. When enabled, ``num_workers``
   will be ignored and maximum num of workers will be used. Default: :code:`False`.

- ``num_kept_ckpts (int)``: number of checkpoints to keep on disk, optional. Default: :code:`None`.
- ``async_save (bool)``: whether to use asynchronous saving method. Default: :code:`False`.
- ``zero1_optimizer (bool):`` : whether the optimizer state is from a zero1 optimizer, used when optimizer is a dict


**Save LoRA Checkpoint:**

NxD also uses ``neuronx_distributed.trainer.save_checkpoint()`` to save LoRA models, but it can set ``save_lora_base`` and ``merge_lora`` in LoraConfig to specify how to save LoRA checkpoint.
There are three modes for LoRA checkpoint saving:

* ``save_lora_base=False, merge_lora=False``: Save the LoRA adapter only.
* ``save_lora_base=True, merge_lora=False``: Save both the base model and the LoRA adapter seperately.
* ``save_lora_base=True, merge_lora=True``: Merge the LoRA adapter into the base model and then save the base model.


Other than the adapter, NxD also needs to save the LoRA configuration file for LoRA loading. 
The configuration can be saved into the same checkpoint with the adapter, or saved as a seperately json file.

- ``save_lora_config_adapter (bool)``: If False, save the configuration file as a seperately json file.

Note that if LoRA configuration file is saved separately, it is named as ``lora_adapter/adapter_config.json``.

A configuration example to save the LoRA adapter only is

::

   lora_config = neuronx_distributed.modules.lora.LoraConfig(
      ...
      save_lora_base=False,  
      merge_lora=False,      
      save_lora_config_adapter=True, 
   )


Load Checkpoint:
''''''''''''''''

Method to load checkpoint saved by ``save_checkpoint``, return user contents if exists otherwise ``None``.
If ``tag`` not provided, will try to use the newest tag tracked by ``save_checkpoint``.

Note that the checkpoint to be loaded must have the same model parallel degrees as in current use,
and if ZeRO-1 optimizer is used, must use the same data parallel degrees.

::

    def neuronx_distributed.trainer.load_checkpoint(
        path,
        tag=None,
        model=None,
        optimizer=None,
        scheduler=None,
        num_workers=8,
        strict=True,
    )

Parameters:

- ``path (str)``: path to load the checkpoints.
- ``tag (str)``: tag to load the checkpoints.
- ``model (torch.nn.Module)``: model to load, optional.
- ``optimizer (torch.optim.Optimizer)``: optimizer to load, optional.
- ``scheduler``: scheduler to load, optional.
- ``num_workers (int)``: num of processes loading data on host at the same time. This is done to avoid the host OOM, range: ``[1-32]``.
- ``strict (bool)``: whether to use strict mode when loading model checkpoint. Default: ``True``.


**Load LoRA Checkpoint:**

NxD loads LoRA checkpoints by setting flags in LoraConfig.

- ``load_lora_from_ckpt``: Resumes the checkpoint process.
- ``lora_save_dir``: Load LoRA checkpoint from the specified folder
- ``lora_load_tag``: Load the LoRA checkpoint with the specified tag

An example is:

::

   lora_config = LoraConfig(
      enable_lora=True,
      load_lora_from_ckpt=True,
      lora_save_dir=checkpoint_dir,  # checkpoint path
      lora_load_tag=tag,  # sub-directory under checkpoint path
   )
   nxd_config = nxd.neuronx_distributed_config(
      ...
      lora_config=lora_config,
   )
   model = nxd.initialize_parallel_model(nxd_config, ...)


The NxD model with be initialized with LoRA enabled and LoRA weights loaded. LoRA-related configurations are the same as the LoRA adapter checkpoint.


**Sample usage:**

::

    import neuronx_distributed as nxd

    # create config
    nxd_config = nxd.neuronx_distributed_config(
        tensor_parallel_size=8,
        optimizer_config={"zero_one_enabled": True, "grad_clipping": True, "max_grad_norm": 1.0},
    )

    # wrap model
    model = nxd.initialize_parallel_model(nxd_config, get_model)

    # wrap optimizer
    optimizer = nxd.initialize_parallel_optimizer(nxd_config, AdamW, model.parameters(), lr=1e-3)

    ...
    (training loop):
        loss = model.run_train(inputs)
        optimizer.step()

    ...
    # loading checkpoint (auto-resume)
    user_content = nxd.load_checkpoint(
        "ckpts",
        model=model,
        optimizer=optimizer,
        scheduler=scheduler,
    )
    ...
    # saving checkpoint
    nxd.save_checkpoint(
        "ckpts",
        nxd_config=nxd_config,
        model=model,
        optimizer=optimizer,
        scheduler=scheduler,
        user_content={"total_steps": total_steps},
    )

Modules:
^^^^^^^^

GQA-QKV Linear Module:
''''''''''''''''''''''

::

    class neuronx_distributed.modules.qkv_linear.GQAQKVColumnParallelLinear(
        input_size, output_size, bias=True, gather_output=True,
        sequence_parallel_enabled=False, dtype=torch.float32, device=None, kv_size_multiplier=1, fuse_qkv=True)

This module parallelizes the Q,K,V linear projections using ColumnParallelLinear layers. Instead of using 
3 different linear layers, we can replace it with a single QKV module. In case of GQA module, the number of 
Q attention heads are `N` times more than the number of K and V attention heads. The K and V attention heads 
are replicated after projection to match the number of Q attention heads. This helps to reduce the K and V 
weights and is useful especially during inference. However, in case of training these modules, it restricts 
the tensor-parallel degree that can be used, since the attention heads should be divisible by tensor-parallel 
degree. Hence, to mitigate this bottleneck, the `GQAQKVColumnParallelLinear` takes in a `kv_size_multiplier` 
argument. The module would replicate the K and V weights `kv_size_multiplier` times thereby allowing you to 
use higher tensor-parallel degree. Note: here instead of replicating the projection `N/tp_degree` times, we 
end of replicating the weights `kv_size_multiplier` times. This would produce the same result, allow you to use 
higher tp_degree degree, however, it would result in extra memory getting consumed.

.. _parameters-11:

Parameters:
        

-  ``input_size: (int)`` : First dimension of the weight matrix
-  ``output_sizes: (List[int])`` : A list of second dimension of the Q and K/V weight matrix
-  ``bias: (bool)``: If set to True, bias would be added
-  ``gather_output: (bool)`` : If true, call all-gather on output and make Y available to all
    Neuron devices, otherwise, every Neuron device will have its output which is Y_i = XA_i
- ``sequence_parallel_enabled: (bool)`` : When sequence-parallel is enabled, it would gather
   the inputs from the sequence parallel region and perform the forward and backward passes
-  ``init_method: (torch.nn.init)`` : Initialization function for the Q and K/V weights.
-  ``dtype: (dtype)`` : Datatype for the weights
-  ``device: (torch.device)`` : Device to initialize the weights on. By default, the weights
    would be initialized on CPU
- ``kv_size_multiplier: (int)``: Factor by which the K and V weights would be replicated along the first dimension
- ``fuse_qkv: (bool)``: When fuse_qkv is enabled, a single fused tensor is used for QKV. By default, this parameter is True. 


Checkpointing:
^^^^^^^^^^^^^^

These are set of APIs for saving and loading the checkpoint. These APIs
take care of saving and loading the shard depending the tensor parallel
rank of the worker.

Save Checkpoint:
''''''''''''''''

::

   def neuronx_distributed.parallel_layers.save(state_dict, save_dir, save_serially=True, save_xser: bool=False, down_cast_bf16=False)

.. note::
    
    This method will be deprecated, use ``neuronx_distributed.trainer.save_checkpoint`` instead.

This API will save the model from each tensor-parallel rank in the
save_dir . Only workers with data parallel rank equal to 0 would be
saving the checkpoints. Each tensor parallel rank would be creating a
``tp_rank_ii_pp_rank_ii`` folder inside ``save_dir`` and each ones saves its shard
in the ``tp_rank_ii_pp_rank_ii`` folder.
If ``save_xser`` is enabled, the folder name would be ``tp_rank_ii_pp_rank_ii.tensors``
and there will be a ref data file named as ``tp_rank_ii_pp_rank_ii`` in save_dir for each rank.

.. _parameters-4:

Parameters:


-  ``state_dict: (dict)`` : Model state dict. Its the same dict that you
   would save using torch.save
-  ``save_dir: (str)`` : Model save directory.
-  ``save_serially: (bool)``: This flag would save checkpoints one model-parallel rank at a time.
   This is particularly useful when we are checkpointing large models.
-  ``save_xser: (bool)``: This flag would save the model with torch xla serialization.
   This could significantly reduce checkpoint saving time when checkpointing large model, so it's recommended
   to enable xser when the model is large.
   Note that if a checkpoint is saved with ``save_xser``, it needs to be loaded with ``load_xser``, vice versa.
-  ``down_cast_bf16: (bool)``: This flag would downcast the state_dict to bf16 before saving.

Load Checkpoint
'''''''''''''''

::

   def neuronx_distributed.parallel_layers.load(
       load_dir, model_or_optimizer=None, model_key='model', load_xser=False, sharded=True)

.. note:: This method will be deprecated, use ``neuronx_distributed.trainer.load_checkpoint`` instead.

This API will automatically load checkpoint depending on the tensor
parallel rank. For large models, one should pass the model object to the
load API to load the weights directly into the model. This could avoid
host OOM, as the load API would load the checkpoints for one tensor
parallel rank at a time.

.. _parameters-5:

Parameters:


-  ``load_dir: (str)`` : Directory where the checkpoint is saved.
-  ``model_or_optimizer``: (torch.nn.Module or torch.optim.Optimizer): Model or Optimizer object.
-  ``model``: (torch.nn.Module or torch.optim.Optimizer): Model or Optimizer object, equivilant to ``model_or_optimizer``
-  ``model_key: (str)`` : The model key used when saving the model in the
   state_dict.
-  ``load_xser: (bool)`` : Load model with torch xla serialization.
   Note that if a checkpoint is saved with ``save_xser``, it needs to be loaded with ``load_xser``, vice versa.
-  ``sharded: (bool)`` : If the checkpoint is not sharded, pass False.
   This is useful (especially during inference) when the model is
   trained using a different strategy and you end up saving a single
   unsharded checkpoint. You can then load this unsharded checkpoint
   onto the sharded model. When this attribute is set to ``False`` , it
   is necessary to pass the model object. Note: The keys in the
   state-dict should have the same name as in the model object, else it
   would raise an error.

Gradient Clipping:
''''''''''''''''''

With tensor parallelism, we need to handle the gradient clipping as we
have to accumulate the total norm from all the tensor parallel ranks.
This should be handled by the following API

::

   def neuronx_distributed.parallel_layers.clip_grad_norm(
       parameters, max_norm, norm_type=2)

.. _parameters-6:

Parameters:


-  ``parameters (Iterable[Tensor] or Tensor)`` : an iterable of Tensors
   or a single Tensor that will have gradients normalized
-  ``max_norm (float or int)`` :max norm of the gradients
-  ``norm_type (float or int)`` : type of the used p-norm. Can be ‘inf’
   for infinity norm.

Neuron Zero1 Optimizer:
'''''''''''''''''''''''

In Neuronx-Distributed, we built a wrapper on the Zero1-Optimizer present in torch-xla.

::

   class NeuronZero1Optimizer(Zero1Optimizer)

This wrapper takes into account the tensor-parallel degree and computes the grad-norm
accordingly. It also provides two APIs: save_sharded_state_dict and load_sharded_state_dict.
As the size of the model grows, saving the optimizer state from a single rank can result in OOMs.
Hence, the api to save_sharded_state_dict can allow saving states from each data-parallel rank. To
load this sharded optimizer state, there is a corresponding load_sharded_state_dict that allows each
rank to pick its corresponding shard from the checkpoint directory.

::

   optimizer_grouped_parameters = [
        {
            "params": [
                p for n, p in param_optimizer if not any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.01,
        },
        {
            "params": [
                p for n, p in param_optimizer if any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.0,
        },
   ]

   optimizer = NeuronZero1Optimizer(
        optimizer_grouped_parameters,
        AdamW,
        lr=flags.lr,
        pin_layout=False,
        sharding_groups=parallel_state.get_data_parallel_group(as_list=True),
        grad_norm_groups=parallel_state.get_tensor_model_parallel_group(as_list=True),
    )

The interface is same as Zero1Optimizer in torch-xla

::

   save_sharded_state_dict(output_dir, save_serially = True)

.. note:: This method will be deprecated, use ``neuronx_distributed.trainer.save_checkpoint`` instead.

.. _parameters-7:

Parameters:


-  ``output_dir (str)`` : Checkpoint directory where the sharded optimizer states need to be saved
-  ``save_serially (bool)`` : Whether to save the states one data-parallel rank at a time. This is
    especially useful when we want to checkpoint large models.

::

   load_sharded_state_dict(output_dir, num_workers_per_step = 8)

.. note:: This method will be deprecated, use ``neuronx_distributed.trainer.load_checkpoint`` instead.

.. _parameters-8:

Parameters:


-  ``output_dir (str)`` : Checkpoint directory where the sharded optimizer states are saved
-  ``num_workers_per_step (int)`` : This argument controls how many workers are doing model load
   in parallel.


.. _pytorch-lightning:

Neuron PyTorch-Lightning
^^^^^^^^^^^^^^^^^^^^^^^^
Neuron PyTorch-Lightning is currently based on Lightning version 2.4.0, and will eventually be upstreamed Lightning-AI code base

Neuron Lightning Module
'''''''''''''''''''''''

Inherited from `LightningModule <https://lightning.ai/docs/pytorch/stable/common/lightning_module.html>`__
::

    class neuronx_distributed.lightning.NeuronLTModule(
        model_fn: Callable,
        nxd_config: Dict,
        opt_cls: Callable,
        scheduler_cls: Callable,
        model_args: Tuple = (),
        model_kwargs: Dict = {},
        opt_args: Tuple = (),
        opt_kwargs: Dict = {},
        scheduler_args: Tuple = (),
        scheduler_kwargs: Dict = {},
        grad_accum_steps: int = 1,
        log_rank0: bool = False,
        manual_opt: bool = True,
    )

Parameters:

- ``model_fn``: Model function to create the actual model

- ``nxd_config``: Neuronx Distributed Config, output of neuronx_distributed.neuronx_distributed_config

- ``opt_cls``: Callable to create optimizer

- ``scheduler_cls``: Callable to create scheduler

- ``model_args``: Tuple of args fed to model callable

- ``model_kwargs``: Dict of keyworded args fed to model callable

- ``opt_args``: Tuple of args fed to optimizer callable

- ``opt_kwargs``: Dict of keyword args fed to optimizer callable

- ``scheduler_args``: Tuple of args fed to scheduler callable

- ``scheduler_args``: Dict of keyworded args fed to scheduler callable

- ``grad_accum_steps``: Grad accumulation steps

- ``log_rank0``: Log at rank 0 (by default it will log at the last PP rank). Note that setting this to True will introduce extra communication per step hence causing performance drop

- ``manual_opt``: Whether to do manual optimization, note that currently NeuronLTModule doesn't support auto optimization so this should always set to True


Neuron XLA Strategy
'''''''''''''''''''

Inherited from `XLAStrategy <https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.XLAStrategy.html>`__
::

    class neuronx_distributed.lightning.NeuronXLAStrategy(
        nxd_config: Dict = None,
        tensor_parallel_size: int = 1,
        pipeline_parallel_size: int = 1,
        save_load_xser: bool = True,
    )

Parameters:

- ``nxd_config``: Neuronx Distributed Config, output of neuronx_distributed.neuronx_distributed_config

- ``tensor_parallel_size``: Tensor parallel degree, only needed when nxd_config is not specified

- ``pipeline_parallel_size``: Pipeline parallel degree, only needed when nxd_config is not specified (Note that for now we only support TP with Neuron-PT-Lightning)

- ``save_load_xser``: Set to True will enable save/load with xla serialization, for more context check `Save Checkpoint <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#save-checkpoint>`__


Neuron XLA Precision Plugin
'''''''''''''''''''''''''''

Inherited from `XLAPrecisionPlugin <https://github.com/Lightning-AI/lightning/blob/2.1.0/src/lightning/pytorch/plugins/precision/xla.py>`__

::

    class neuronx_distributed.lightning.NeuronXLAPrecisionPlugin

Neuron TQDM Progress Bar
''''''''''''''''''''''''

Inherited from `TQDMProgressBar <https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.TQDMProgressBar.html>`__

::

    class neuronx_distributed.lightning.NeuronTQDMProgressBar


Neuron TensorBoard Logger
'''''''''''''''''''''''''

Inherited from `TensorBoardLogger <https://lightning.ai/docs/pytorch/stable/extensions/generated/lightning.pytorch.loggers.TensorBoardLogger.html>`__

::

    class neuronx_distributed.lightning.NeuronTensorBoardLogger(save_dir)

Parameters:

- ``save_dir``: Directory to save the log files


.. |neuronx-cc| replace:: :ref:`neuronx-cc <neuron-compiler-cli-reference-guide>`


================================================
FILE: libraries/neuronx-distributed/api-reference-guide.rst
================================================
.. _neuronx_distributed_api_guide:

API Reference Guide
===============================================

.. toctree::
    :maxdepth: 1
    :hidden:
    
    /libraries/neuronx-distributed/api_guide
    /libraries/neuronx-distributed/api-reference-guide-training
    /libraries/neuronx-distributed/api-reference-guide-inference
    /libraries/neuronx-distributed/model_builder_v2_api_reference

.. include:: /libraries/neuronx-distributed/api-reference-guide.txt

================================================
FILE: libraries/neuronx-distributed/api-reference-guide.txt
================================================
* :ref:`api_guide`
* :ref:`api_guide_nxd_training`
* :ref:`api_guide_nxd_inference`
* :ref:`nxd-core-model-builder-v2`


================================================
FILE: libraries/neuronx-distributed/api_guide.rst
================================================
.. _api_guide:

Distributed Strategies APIs
===========================


NeuronX Distributed Core (NxD Core) is XLA based library for distributed training and inference on Neuron devices.
As part of this library, we support 3D parallelism: Tensor-Parallelism, Pipeline-Parallelism
and Data-Parallelism. We also support Zero1 optimizer to shard the optimizer weights.
To support tensor-parallelism on Neuron, we adopted the Apex Library
built for CUDA devices. We modified the implementations to work with
XLA. This document enlist the different APIs and modules provided by the library

.. contents:: Table of contents
   :local:
   :depth: 2


Parallel Model State:
^^^^^^^^^^^^^^^^^^^^^

Initialize Model Parallelism:
'''''''''''''''''''''''''''''

::

   def neuronx_distributed.parallel_state.initialize_model_parallel(
       tensor_model_parallel_size=1,
       pipeline_model_parallel_size=1,
   )

This module would initialize the distributed model training and allows
users to set the number of tensor_parallel world size.

Parameters:

- ``tensor_model_parallel_size`` : This should set the number of tensor
  parallel workers. Note the default value is set to 1
- ``pipeline_model_parallel_size`` : This should set the number of pipeline
  parallel workers. Note the default value is set to 1

Other helper APIs:
''''''''''''''''''

-  ``neuronx_distributed.parallel_state.get_data_parallel_size()`` :
   Returns the data parallel world size depending on the number of
   global workers and tensor parallel workers.
-  ``neuronx_distributed.parallel_state.get_tensor_model_parallel_size()``
   : Returns the tensor parallel world size.
-  ``neuronx_distributed.parallel_state.get_tensor_model_parallel_rank()``
   : Returns the rank of the worker within the tensor parallel group
-  ``neuronx_distributed.parallel_state.get_pipeline_model_parallel_size()``
   : Returns the pipeline parallel world size.
-  ``neuronx_distributed.parallel_state.get_pipeline_model_parallel_rank()``
   : Returns the rank of the worker within the pipeline parallel group
-  ``neuronx_distributed.parallel_state.get_data_parallel_rank()`` :
   Returns the rank of the worker in the data parallel group.
-  ``neuronx_distributed.parallel_state.get_data_parallel_group(as_list=False)``
   : Returns the data parallel group after taking into account the
   tensor parallel size and the global world size. as_list argument when
   set to True, would return the group as a List[List] otherwise it
   would return a torch.distributed.group.
-  ``neuronx_distributed.parallel_state.get_tensor_model_parallel_group(as_list=False)``
   : Returns the tensor parallel group after taking into account the
   tensor parallel size and the global world size. as_list argument when
   set to True, would return the group as a List[List] otherwise it
   would return a torch.distributed.group.
-  ``neuronx_distributed.parallel_state.get_pipeline_model_parallel_group(as_list=False)``
   : Returns the pipeline parallel group after taking into account the
   pipeline parallel size and the global world size. as_list argument when
   set to True, would return the group as a List[List] otherwise it
   would return a torch.distributed.group.
- ``move_model_to_device(model, device)``: This api moves the model to device by
  preserving tensor parallel attributes.

Parallel Layers:
^^^^^^^^^^^^^^^^

Majority of parameters within the transformer based model reside in the
Embedding and Linear layers. Hence, to reduce the number of parameters
on a single device because of these layers, we provided sharded
Embedding and Linear layers.

Parallel Embedding:
'''''''''''''''''''

::

   class neuronx_distributed.parallel_layers.ParallelEmbedding(
       num_embeddings, embedding_dim, init_method=init.normal_,
       dtype=torch.float32, device=None)

This module is intended to replace torch.nn.Embedding . In cases where
the vocab size is too large, we can shard the Embedding table across
workers. Note: The embedding table would be sharded across all the
tensor-parallel workers.

.. _parameters-1:

Parameters:

-  ``num_embeddings (int)`` : size of the dictionary of embeddings
-  ``embedding_dim (int)`` : the size of each embedding vector
-  ``init_method: (torch.nn.init)`` : Initialization function for the
   embedding weights.
-  ``dtype: (dtype)`` : Datatype for the weights
-  ``device: (torch.device)`` : Device to initialize the weights on. By
   default, the weights would be initialized on CPU

ColumnParallel Linear Layer:
''''''''''''''''''''''''''''

::

   class neuronx_distributed.parallel_layers.ColumnParallelLinear(
       input_size, output_size, bias=True, gather_output=True,
       sequence_parallel_enabled=False, dtype=torch.float32, device=None)

This module would perform a Column wise partition of the weight matrix.
Linear layer is defined as ``Y = XA + b`` , here A is parallelized along
second dimension as ``A = [A_1, A_2 .... A_p]`` . ``Note``: This layer
is designed to operate on 3-dimensional inputs.

.. _parameters-2:

Parameters:

-  ``input_size: (int)`` : First dimension of the weight matrix
-  ``output_size: (int)`` : Second dimension of the weight matrix
-  ``bias: (bool)``: If set to True, bias would be added
-  ``gather_output: (bool)`` : If true, call all-gather on output and
   make Y available to all Neuron devices, otherwise, every Neuron
   device will have its output which is Y_i = XA_i
- ``sequence_parallel_enabled: (bool)`` : When sequence-parallel is enabled, it would
   gather the inputs from the sequence parallel region and perform the forward and backward
   passes
-  ``dtype: (dtype)`` : Datatype for the weights
-  ``device: (torch.device)`` : Device to initialize the weights on. By
   default, the weights would be initialized on CPU

RowParallel Linear Layer:
'''''''''''''''''''''''''

::

   class neuronx_distributed.parallel_layers.RowParallelLinear(
       input_size, output_size, bias=True, input_is_parallel=False,
       sequence_parallel_enabled=False, dtype=torch.float32, device=False
   )

The linear layer is defined as ``Y = XA + b``. A is parallelized along
its first dimension and X along its second. ``Note``: This layer is
designed to operate on 3-dimensional inputs.

.. _parameters-3:

Parameters:

-  ``input_size: (int)`` : First dimension of the weight matrix
-  ``output_size: (int)`` : Second dimension of the weight matrix
-  ``bias: (bool)`` : If set to True, bias would be added
-  ``input_is_parallel: (bool)`` : If true, we assume that the input is
   already split across the Neuron devices and we do not split again.
   This is useful when we have a ColumnParallel Layer just before the
   Row Parallel layer
-  ``sequence_parallel_enabled: (bool)`` : When sequence-parallel is enabled, it would
   gather the inputs from the sequence parallel region and perform the forward and backward
   passes
-  ``dtype: (dtype)`` : Datatype for the weights
-  ``device: (torch.device)`` : Device to initialize the weights on. By
   default, the weights would be initialized on CPU


Padding Tensor-Parallel Layers
''''''''''''''''''''''''''''''

::

   def neuronx_distributed.parallel_layers.pad.pad_model(
      model, tp_degree, n_heads, wrapped_classes=(), pad_hook_fn=None)


Pads a generic model to function to a desired tensor parallelism degree by padding the
number of attention heads. Returns the original model modified with padding.
Uses 1-axis padding strategy: pads the sharded dim of the ParallelLinear layers to the
size it would have been for the padded number of heads.

.. _parameters-4:

Parameters:

- ``model (torch.nn.Module)`` : model to be padded
- ``tp_degree (int)`` : tensor parallel degree
- ``n_heads (int)`` : the number of heads the given model to be padded has. This can
   typically be found in the config
- ``wrapped_classes (Tuple[any], *optional*, defaults to `()`)`` : tuple of classes
   (and their submodules) which should be padded
- ``pad_hook_fn (Callable[any, float], *optional*, defaults to `None`)`` : a hook
   function that is called whenever encountering a class to pad. Receives an instance
   of the class to pad and the tgt_src_ratio (num_heads_padded / num_heads)as its argument

Usage:

   When modifying the Attention layer, typically you must divide by TP degree like so:

   ``self.num_heads = neuronx_dist_utils.divide(self.num_heads, get_tensor_model_parallel_size())``

   This line must be modified like so:
  
   .. code-block:: python

      self.num_heads = neuronx_dist_utils.divide(
         self.num_heads + get_number_of_extra_heads(self.num_heads, get_tensor_model_parallel_size()),
         get_tensor_model_parallel_size())

   Then, after initializing the model, you must call this wrapper:
   
   .. code-block:: python

      model = get_model(config=desired_config)
      model = pad_model(model, tp_degree=32, desired_config.num_heads)  # Use the model as desired after this point

   You can specify a specific layer or class for your model to pad, so you aren't unnecessarily padding.
   Typically, this layer will be your Attention layer
   
   ``model = pad_model(model, tp_degree=32, desired_config.num_heads, wrapped_classes=[MyAttention])``

   You can also specify a pad_hook_fn, to be called whenever encountering an instance of wrapped_class,
   passing in said instance as a parameter, along with the tgt_src_ratio (num_heads_padded / num_heads).
   
   .. code-block:: python

      def my_hook(attention_to_pad, tgt_src_ratio):
         attention_to_pad.split_size = int(model.split_size * tgt_src_ratio)
         model = pad_model(
                  model,
                  tp_degree=32,
                  desired_config.num_heads,
                  wrapped_classes=[MyAttention],
                  pad_hook_fn=my_hook
               )


Loss functions:
''''''''''''''''''

When you shard the final MLP layer using tensor-parallelism, instead of
recollecting all the outputs from each TP rank, we can use the
ParallelCrossEntropy loss function. This function would take the parallel
logits produced by final parallel MLP and produce a loss by taking into
account that the logits are sharded across multiple workers.


::

   def neuronx_distributed.parallel_layers.loss_functions.parallel_cross_entropy(
       parallel_logits, labels, label_smoothing=0.0)

.. _parameters-6:

Parameters:


-  ``parallel_logits (Tensor)`` : Sharded logits from the previous MLP
-  ``labels (Tensor)`` : Label for each token. Labels should not be sharded, and the parallel_cross_entropy would take care of sharding the labels internally
-  ``label_smoothing (float)`` : A float in [0.0, 1.0]. Specifies the amount of smoothing when computing the loss, where 0.0 means no smoothing

Pipeline parallelism:
^^^^^^^^^^^^^^^^^^^^^^

Neuron Distributed Pipeline Model
'''''''''''''''''''''''''''''''''

::

   class NxDPPModel(
        module: torch.nn.Module,
        transformer_layer_cls: Optional[Any] = None,
        num_microbatches: int = 1,
        virtual_pipeline_size: int = 1,
        output_loss_value_spec: Optional[Union[Dict, Tuple]] = None,
        return_mb_loss: bool = False,
        broadcast_and_average_loss: bool = False,
        pipeline_cuts: Optional[List[str]] = None,
        input_names: Optional[List[str]] = None,
        leaf_module_cls: Optional[List[Any]] = None,
        autowrap_functions: Optional[Tuple[ModuleType]] = None,
        autowrap_modules: Optional[Tuple[Callable, ...]] = None,
        tracer_cls: Optional[Union[str, Any]] = None,
        param_init_fn: Optional[Any] = None,
        trace_file_path: Optional[str] = None,
        use_zero1_optimizer: bool = False,
        auto_partition: Optional[bool] = False,
        deallocate_pipeline_outputs: bool = False,
   )

Parameters:

- ``module``: Module to be distributed with pipeline parallelism

- ``transformer_layer_cls``: The module class of transformer layers

- ``num_microbatches``: Number of pipeline microbatchs

- ``virtual_pipeline_size``: Virtual pipeline size if greater than 1 we will use the interleaved pipeline schedule.

- ``output_loss_value_spec``:
      The ``output_loss_value_spec`` value can be specified to disambiguate
      which value in the output of `forward` is the loss value on which NxDPPModel should apply
      backpropagation. For example, if your ``forward`` returns a tuple ``(loss, model_out)``,
      you can specify ``output_loss_value_spec=(True, False)``. Or, if your ``forward`` returns
      a dict ``{'loss': loss_value, 'model_out': model_out}``, you can specify
      ``output_loss_value_spec={'loss': True, 'model_out': False}``
      referred from `this <https://github.com/pytorch/PiPPy/blob/main/pippy/IR.py#L697>`__

- ``return_mb_loss``: Whether return a list of loss for all microbatchs

- ``broadcast_and_average_loss``:Whether to broadcast loss to all PP ranks and average across dp ranks, when set to True return_mb_loss must be False

- ``pipeline_cuts``: A list of layer names that will be used to annotate pipeline stage boundaries

- ``input_names``:The input names that will be used for tracing, which will be the same as the model inputs during runtime.

- ``leaf_module_cls``:A list of module classes that should be treated as leaf nodes during tracing. Note transformer layer class will be by default treat as leaf nodes.

- ``autowrap_modules``: (symbolic tracing only)
      Python modules whose functions should be wrapped automatically
      without needing to use fx.wrap().
      reference `here <https://github.com/pytorch/pytorch/blob/main/torch/fx/_symbolic_trace.py#L241>`__

- ``autowrap_functions``: (symbolic tracing only)
      Python functions that should be wrapped automatically without
      needing to use fx.wrap().
      reference `here <https://github.com/pytorch/pytorch/blob/main/torch/fx/_symbolic_trace.py#L241>`__

- ``tracer_cls``:User provided tracer class for symbolic tracing. It can be "hf", "torch" or any tracer class user created.

- ``param_init_fn``:
      Function used to initialize parameters. This is useful if user wants to use meta device to do
      delayed parameter initialization. param_init_fn should take a module as input and initialize the
      parameters that belongs to this module only (not for submodules).

- ``use_zero1_optimizer``: Whether to use the zero1 optimizer. When setting to True the gradient average will be handed over.

- ``auto_partition``:
      Boolean to indicate whether to use auto_partition for the model. When set to True, the pipeline
      cuts used as the pipeline stage boundaries to partition the model are automatically determined. When set to
      True, the pipeline_cuts parameter should not be set. The pipeline_cuts are chosen on the basis of the transformer layer names.

- ``deallocate_pipeline_outputs``: 
      Whether to deallocate the pipeline outputs after send. After send the output tensor is only useful for its 
      '.grad_fn' field, and not its '.data'.

Common used APIs
'''''''''''''''''

::

   NxDPPModel.run_train(**kwargs)

Train the model with PP schedule, which will run both forward and backward in a PP manner.
The kwargs should be the same as the input_names provided to the trace function.
Will output the loss that provided by user from output_loss_value_spec.

::

   NxDPPModel.run_eval(**kwargs)

Eval the model with PP schedule, which will run forward only.
The kwargs should be the same as the input_names provided to the trace function.
Will output the loss that provided by user from output_loss_value_spec.

::

   NxDPPModel.local_named_parameters(**kwargs)

The parameters that are local to this PP rank. This must be called after the model is partitioned.

::

   NxDPPModel.local_named_modules(**kwargs)


================================================
FILE: libraries/neuronx-distributed/app_notes.rst
================================================
.. _neuronx_distributed_appnotes:

App Notes 
====================================

.. toctree::
    :maxdepth: 1
    :hidden:

    /libraries/neuronx-distributed/tensor_parallelism_overview
    /libraries/neuronx-distributed/pipeline_parallelism_overview
    /libraries/neuronx-distributed/activation_memory_reduction
    /libraries/neuronx-distributed/context_parallelism_overview


.. include:: /libraries/neuronx-distributed/app_notes.txt

================================================
FILE: libraries/neuronx-distributed/app_notes.txt
================================================
* :ref:`tensor_parallelism_overview`
* :ref:`pipeline_parallelism_overview`
* :ref:`activation_memory_reduction`
* :ref:`context_parallelism_overview`


================================================
FILE: libraries/neuronx-distributed/context_parallelism_overview.rst
================================================
.. _context_parallelism_overview:

Context Parallelism Overview 
===============================

Context parallelism (CP) is a technique used in deep learning model training to train large context models.
CP parallelizes the processing of neural network activations across multiple devices by partitioning the input 
tensors along the sequence dimension. CP reduces the memory footprint and computational cost of processing long sequences.
Unlike Sequence Parallelism (SP) that partitions the activations of specific layers, CP divides the activations of all layers.

The implementation of Context Parallelism in NxD leverages `Ring Attention <https://arxiv.org/abs/2310.01889>`_. Ring Attention
enables efficient communication between devices by organizing them in a ring topology, allowing tokens to attend to each other 
across devices without needing full attention computation on each device. This reduces memory overhead while extending the 
feasible context length beyond traditional transformer models.

For more details, refer to Context Parallelism in Megatron <https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/context_parallel.html>_

.. image:: /libraries/neuronx-distributed/images/cp.png
   :alt: Image: image.png

Fig: Context Parallelism in NxD (Figure adapted from `Megatron 
CP <https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/context_parallel.html>`_).
In NxD's TP implementation, we make use of All-Gather (AG), Reduce-Scatter (RS) collectives. Further
CP is applied to all layers including LayerNorm (LN), Linear (LIN) and Fully-Connected (FC) layers.
The figure shows a transformer layer running with TP2 and CP2. Assuming sequence length is 8K, each device processes 4K tokens. 
Device0 and Device2 form a CP group and exchange KV with each other; similarly, Device1 and Device3 form a CP group and exchange KV with each other. 
The collective communication to exchange KV is handled by NxD using approaches described in the 
`Ring Attention <https://arxiv.org/abs/2310.01889>`_ paper.
   

================================================
FILE: libraries/neuronx-distributed/developer-guide-inference.rst
================================================
.. _neuronx_distributed_developer_guide_inference:

Inference Developer Guide
==========================================

.. toctree::
    :maxdepth: 1
    :hidden:

    /libraries/neuronx-distributed/neuronx_distributed_inference_developer_guide

.. include:: /libraries/neuronx-distributed/developer-guide-inference.txt

================================================
FILE: libraries/neuronx-distributed/developer-guide-inference.txt
================================================
* :ref:`neuronx_distributed_inference_developer_guide`

================================================
FILE: libraries/neuronx-distributed/developer-guide-training.rst
================================================
.. _neuronx_distributed_developer_guide_training:

Training Developer Guides
==========================================

.. toctree::
    :maxdepth: 1
    :hidden:

   
    /libraries/neuronx-distributed/tp_developer_guide
    /libraries/neuronx-distributed/pp_developer_guide
    /libraries/neuronx-distributed/activation_memory_reduction_developer_guide
    /libraries/neuronx-distributed/save_load_developer_guide
    /libraries/neuronx-distributed/ptl_developer_guide
    /libraries/neuronx-distributed/model_optimizer_wrapper_developer_guide
    /libraries/neuronx-distributed/lora_finetune_developer_guide

.. include:: /libraries/neuronx-distributed/developer-guide-training.txt

================================================
FILE: libraries/neuronx-distributed/developer-guide-training.txt
================================================
* :ref:`tp_developer_guide`
* :ref:`pp_developer_guide`
* :ref:`activation_memory_reduction_developer_guide`
* :ref:`save_load_developer_guide`
* :ref:`ptl_developer_guide`
* :ref:`model_optimizer_wrapper_developer_guide`

================================================
FILE: libraries/neuronx-distributed/developer-guide.rst
================================================
.. _neuronx_distributed_developer_guide:

Developer Guide 
==========================================

.. toctree::
    :maxdepth: 1
    :hidden:
    
    /libraries/neuronx-distributed/developer-guide-training.rst
    /libraries/neuronx-distributed/developer-guide-inference.rst

.. include:: /libraries/neuronx-distributed/developer-guide.txt


================================================
FILE: libraries/neuronx-distributed/developer-guide.txt
================================================
* :ref:`neuronx_distributed_developer_guide_training`
* :ref:`neuronx_distributed_developer_guide_inference`


================================================
FILE: libraries/neuronx-distributed/index-inference.rst
================================================
.. meta::
    :description: Home page for the NxD Inference for Training (NxDI) library included with the Neuron SDK.
    :date-modified: 12/02/2025

.. _neuronx-distributed-inference-index:


NxD Core for Inference
=======================

NeuronX Distributed Core (NxD Core) is a package for supporting different distributed
inference mechanisms for Neuron devices. It provides XLA-friendly
implementations of some of the more popular distributed
inference techniques. As the size of the model scales, fitting
these models on a single device becomes impossible and hence we have to
make use of model sharding techniques to partition the model across
multiple devices.

As part of this library, we enable support for Tensor
Parallelism sharding technique with other distributed library supported to be
added in future.

.. _neuronx_distributed_inference_developer_guide:

About NeuronX-Distributed (NxD) Inference
------------------------------------------

NeuronX Distributed (NxD Core) provides fundamental building blocks that enable you to run advanced inference workloads on AWS Inferentia and Trainium instances. These building blocks include parallel linear layers that enable distributed inference, a model builder that compiles PyTorch modules into Neuron models, and more.

As part of NxD Core, Neuron offers NxD Inference, which is a library that provides optimized model and module implementations that build on top of NxD Core. For more information about NxD Inference, see :ref:`nxdi-overview`.

For examples of how to build directly on NxD Core, see the following:

* `Llama 3.2 1B inference sample <https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference/llama>`_
* T5 3B inference tutorial :ref:`[html] </src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>` :pytorch-neuron-src:`[notebook] <neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`

.. toctree::
    :maxdepth: 1
    :hidden:

    Setup </libraries/neuronx-distributed/setup/index>
    App Notes </libraries/neuronx-distributed/app_notes>
    API Reference Guide </libraries/neuronx-distributed/api-reference-guide>
    Developer Guide </libraries/neuronx-distributed/developer-guide-inference>
    LoRA Guide </libraries/neuronx-distributed/lora_finetune_developer_guide>

    Tutorials  </libraries/neuronx-distributed/tutorials/index>
    Misc  </libraries/neuronx-distributed/neuronx-distributed-misc>

NxD Core for Inference Documentation
-------------------------------------

.. dropdown::  Setup
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /libraries/neuronx-distributed/setup/index.txt

.. dropdown::  App Notes
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /libraries/neuronx-distributed/app_notes.txt

.. dropdown::  API Reference Guide
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /libraries/neuronx-distributed/api-reference-guide.txt

.. dropdown::  Developer Guide
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /libraries/neuronx-distributed/developer-guide-inference.txt

.. dropdown::  Tutorials
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /libraries/neuronx-distributed/tutorials/neuronx_distributed_tutorials.txt


.. dropdown::  Misc
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /libraries/neuronx-distributed/neuronx-distributed-misc.txt


================================================
FILE: libraries/neuronx-distributed/index-training.rst
================================================
.. meta::
    :description: Home page for the NxD Core for Training (NxDT) library included with the Neuron SDK.
    :date-modified: 12/02/2025

.. _neuronx-distributed-training-index:
.. _neuronx-distributed-index:


NxD Core for Training
=======================

NeuronX Distributed Core (NxD Core) is a package for supporting different distributed training mechanisms for Neuron devices. It provides XLA-friendly implementations of some of the more popular distributed
training techniques. As the size of the model scales, fitting these models on a single device becomes impossible and hence we have to make use of model sharding techniques to partition the model across multiple devices. 


About NeuronX-Distributed (NxD) for Training
---------------------------------------------

NeuronX Distributed (NxD Core) provides fundamental building blocks that enable you to run advanced inference workloads on AWS Inferentia and Trainium instances. These building blocks include parallel linear layers that enable distributed inference, a model builder that compiles PyTorch modules into Neuron models, and more.

The NeuronX Distributed Training (NxD Training) library is a collection of open-source tools and libraries designed to empower customers to train PyTorch models on AWS Trainium instances. It combines both ease-of-use and access to features built on top of
NxD Core library. Except for a few Trainium-specific features, NxD Training is compatible with training platforms like NVIDIA's NeMo.

.. toctree::
    :maxdepth: 1
    :hidden:

    Setup </libraries/neuronx-distributed/setup/index>
    App Notes </libraries/neuronx-distributed/app_notes>
    API Reference Guide </libraries/neuronx-distributed/api-reference-guide>
    Developer Guide  </libraries/neuronx-distributed/developer-guide-training>
    Tutorials  </libraries/neuronx-distributed/tutorials/index>
    Misc  </libraries/neuronx-distributed/neuronx-distributed-misc>

NxD Core for Inference Documentation
-------------------------------------

.. dropdown::  Setup  
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in
    
    .. include:: /libraries/neuronx-distributed/setup/index.txt

.. dropdown::  App Notes  
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in
   
    .. include:: /libraries/neuronx-distributed/app_notes.txt

.. dropdown::  API Reference Guide  
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in
    
    .. include:: /libraries/neuronx-distributed/api-reference-guide.txt

.. dropdown::  Developer Guide  
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in
    
    .. include:: /libraries/neuronx-distributed/developer-guide-training.txt

.. dropdown::  Tutorials  
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in
    
    .. include:: /libraries/neuronx-distributed/tutorials/neuronx_distributed_tutorials.txt


.. dropdown::  Misc  
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in
    
    .. include:: /libraries/neuronx-distributed/neuronx-distributed-misc.txt


================================================
FILE: libraries/neuronx-distributed/lora_finetune_developer_guide.rst
================================================

.. _lora_finetune_developer_guide:

Developer guide for LoRA finetuning
===================================

This document will introduce how to enable model finetuning with LoRA.

For a complete api guide, refer to :ref:`API <api_guide>`.

Enable LoRA finetuning:
'''''''''''''''''''''''

We first set up LoRA-related configurations:

.. code:: ipython3

    lora_config = nxd.modules.lora.LoraConfig(
        enable_lora=True,
        lora_rank=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        lora_verbose=True,
        target_modules=["q_proj", "v_proj", "k_proj"],
        save_lora_base=False,
        merge_lora=False,
    )


The default target modules for different model architectures can be found in `model.py <https://github.com/aws-neuron/neuronx-distributed/blob/main/src/neuronx_distributed/modules/lora/model.py>`_.


We then initialize NxD model with LoRA enabled:

.. code:: ipython3

    nxd_config = nxd.neuronx_distributed_config(
        ...
        lora_config=lora_config,
    )
    model = nxd.initialize_parallel_model(nxd_config, ...)


Save LoRA checkpoint
''''''''''''''''''''

Users can save the LoRA adapter with

.. code:: ipython3

    nxd.save_checkpoint(
        checkpoint_dir_str=checkpoint_dir, # checkpoint path
        tag=tag,     # sub-directory under checkpoint path
        model=model
    )


Because ``save_lora_base=False`` and ``merge_lora=False``, only the LoRA adapter is saved under ``checkpoint_dir/tag/``.
We can also set ``merge_lora=True`` to save the merged model, i.e., merging LoRA adapter into the base model.


Load LoRA checkpoint:
''''''''''''''''''''''

A sample usage:

.. code:: ipython3

    lora_config = LoraConfig(
        enable_lora=True,
        load_lora_from_ckpt=True,
        lora_save_dir=checkpoint_dir,  # checkpoint path
        lora_load_tag=tag,  # sub-directory under checkpoint path
    )
    nxd_config = nxd.neuronx_distributed_config(
        ...
        lora_config=lora_config,
    )
    model = nxd.initialize_parallel_model(nxd_config, ...)
   
   
The NxD model with be initialized with LoRA enabled and LoRA weights loaded. LoRA-related configurations are the same as the LoRA adapter checkpoint.

================================================
FILE: libraries/neuronx-distributed/model_builder_v2_api_reference.rst
================================================
.. _nxd-core-model-builder-v2:

ModelBuilderV2 API Reference
==============================================

APIs
~~~~

- `neuronx_distributed.trace.model_builder.trace`_
- `neuronx_distributed.trace.model_builder.compile`_
- `neuronx_distributed.shard_checkpoint`_
- `neuronx_distributed.ModelBuilder`_
- `neuronx_distributed.ModelBuilder.trace`_
- `neuronx_distributed.ModelBuilder.compile`_
- `neuronx_distributed.trace.nxd_model.base_nxd_model.StateInitializer`_
- `neuronx_distributed.NxDModel`_
- `neuronx_distributed.NxDModel.add`_
- `neuronx_distributed.NxDModel.get_neff`_
- `neuronx_distributed.NxDModel.get_metaneff`_
- `neuronx_distributed.NxDModel.get_hlo`_
- `neuronx_distributed.NxDModel.set_weights`_
- `neuronx_distributed.NxDModel.to_neuron`_
- `neuronx_distributed.NxDModel.replace_weights`_
- `neuronx_distributed.NxDModel.read_from_neuron_buffer`_
- `neuronx_distributed.NxDModel.write_to_neuron_buffer`_
- `neuronx_distributed.NxDModel.forward`_
- `neuronx_distributed.NxDModel.save`_
- `neuronx_distributed.NxDModel.load`_

`Usage Notes`_

**Examples**

`Usage Examples`_

- `E2E with ModelBuilder APIs`_
- `E2E with Fundamental Units`_

neuronx_distributed.trace.model_builder.trace
=============================================

::

   neuronx_distributed.trace.model_builder.trace(
       model: Union[Callable, torch.nn.Module],
       args: Union[None, torch.Tensor, Tuple[torch.Tensor, ...]] = None,
       kwargs: Optional[Dict[str, torch.Tensor]] = None,
       spmd: bool = True,
       preserve_parameters: bool = True,
   ) -> TraceArtifacts

The ``trace()`` function is a fundamental unit in the ModelBuilderV2
framework that handles the tracing of PyTorch models for execution on
Neuron devices. It processes example inputs as both positional and
keyword arguments, validates model parameters, and generates necessary
trace artifacts such as HLOs.

Parameters
~~~~~~~~~~

- **model: Union[Callable, torch.nn.Module]** — The PyTorch model or
  callable function to be traced. Must have explicitly defined
  parameters (no ``*args`` *or* ``**kwargs``). Must have at least one
  parameter.
- **args: Union[None, torch.Tensor, Tuple[torch.Tensor, …]] = None** —
  Example inputs as positional arguments. Can be None, a single tensor,
  or a tuple of tensors. Must match the model’s positional parameter
  requirements.
- **kwargs: Optional[Dict[str, torch.Tensor]] = None** — Example inputs
  as keyword arguments. Must be a dictionary mapping parameter names to
  tensor values. Cannot override parameters provided in args.
- **spmd: bool = True** — Whether to use SPMD (Single Program Multiple
  Data) for tracing. Currently only True is supported
- **preserve_parameters: bool = True** — Whether to preserve module
  buffers across multi-bucket trace.

Returns
~~~~~~~

Returns a ``TraceArtifacts`` object containing:

::

   neuronx_distributed.trace.model_builder_utils.TraceArtifacts(
       hlo: Any,                                 # HLO representation
       metaneff: Any,                            # Meta information for NEFF
       flattener: Any,                           # Function to flatten inputs
       packer: Any,                              # Function to pack outputs
       weight_name_to_idx: Dict[str, int],       # Maps weight names to indices
       weight_names_to_skip: Set,                # Weight names excluded from optimization
       provided_args: List[ProvidedArgInfo],     # Information about provided arguments
       model_params: List[ModelParamInfo],       # Information about model parameters
   )

``ProvidedArgInfo`` object contains:

::

   neuronx_distributed.trace.model_builder_utils.ProvidedArgInfo(
        param_name: str,       # Name of the parameter this argument corresponds to
        is_positional: bool,   # Whether this argument is positional (required) or keyword (optional)
        tensor: torch.Tensor,  # The tensor value provided for this argument
   )

``ModelParamInfo`` object contains:

::

   neuronx_distributed.trace.model_builder_utils.ModelParamInfo(
        param_name: str,      # Name of the parameter in the function signature
        is_positional: bool,  # Whether this parameter is positional (required) or keyword (optional)
   )

neuronx_distributed.trace.model_builder.compile
===============================================

::

   neuronx_distributed.trace.model_builder.compile(
       hlo_module: hlo_pb2.HloModuleProto,
       metaneff: Any,
       compiler_workdir: Optional[Union[str, pathlib.Path]] = None,
       compiler_args: Optional[str] = None,
       key: Optional[str] = None
   ) -> CompilationArtifacts

The ``compile()`` function is a fundamental unit in the ModelBuilderV2
framework that compiles traced models using the Neuron Compiler, and
generates Neuron Executable File Format (NEFF) files. It handles
compiler configurations, workdir management, and produces compilation
artifacts.

.. _parameters-1:

Parameters
~~~~~~~~~~

- **hlo_module: hlo_pb2.HloModuleProto** — The HLO module representing
  the computational graph to be compiled. Generated from the ``trace()``
  function.
- **metaneff: Any** — Meta information for the Neuron Executable File
  Format (NEFF)
- **compiler_workdir: Optional[Union[str, pathlib.Path]] = None** —
  Directory path to store compiler artifacts. If None, uses a default
  path. Creates timestamped subdirectories (in UTC format) for each
  compilation.
- **compiler_args: Optional[str] = None** — Compiler flags for
  neuronx-cc. If None, uses default compiler
  flags. Can include optimization levels and other compiler options.
- **key: Optional[str] = None** — Key to tag the bucket with a
  meaningful name. If None, generates a hash from the HLO module. Used
  for logging and artifact organization

.. _returns-1:

Returns
~~~~~~~

Returns a ``CompilationArtifacts`` object containing:

::

   neuronx_distributed.trace.model_builder_utils.CompilationArtifacts(
       neff_filepath: str    # Path to the compiled NEFF file
   )

Default Compiler Flags
~~~~~~~~~~~~~~~~~~~~~~

If no ``compiler_args`` are provided, the following defaults are used:

::

   --enable-saturate-infinity --auto-cast=none --model-type=transformer -O1

Directory Structure
~~~~~~~~~~~~~~~~~~~

This creates the following directory structure:

::

   compiler_workdir/
   └── {key}/
       └── {timestamp}/
           ├── model/
           │   └── graph.hlo
           ├── graph.neff
           ├── metaneff.pb
           └── command.txt
           └── log-neuron-cc.txt

neuronx_distributed.shard_checkpoint
====================================

::

   neuronx_distributed.shard_checkpoint(
       checkpoint: Dict[str, torch.Tensor],
       model: torch.nn.Module,
       start_rank: Optional[int] = None,
       end_rank: Optional[int] = None,
       load_on_device: bool = False,
       serialize_path: Optional[str] = None
   ) -> List[Dict[str, torch.Tensor]]

The ``shard_checkpoint()`` function shards a model checkpoint across
tensor parallel ranks for distributed execution. It supports options for
serialization (pre-shard) and direct loading onto Neuron devices
(shard-on-load).

.. _parameters-2:

Parameters
~~~~~~~~~~

- **checkpoint: Dict[str, torch.Tensor]** — The model checkpoint
  dictionary. Maps parameter names to tensor values. Must contain all
  model parameters.
- **model: torch.nn.Module** — The PyTorch model to be sharded. Used for
  determining sharding strategy.
- **start_rank: Optional[int] = None** — Starting rank for sharding.
  Must be in range [0, tp_degree). Defaults to 0 if None.
- **end_rank: Optional[int] = None** — Ending rank for sharding. Must be
  in range [start_rank, tp_degree). Defaults to ``(tp_degree - 1)`` if
  None.
- **load_on_device: bool = False** — Whether to load sharded tensors
  onto Neuron devices. Requires running on supported Neuron instance.
  Defaults to False.
- **serialize_path: Optional[str] = None** — Path to save sharded
  checkpoints. If provided, saves as safetensors files. Creates
  directory if it doesn’t exist.

.. _returns-2:

Returns
~~~~~~~

Returns a ``List[Dict[str, torch.Tensor]]`` where:

- Each dictionary represents a sharded checkpoint for a rank
- Dictionary keys are parameter names
- Dictionary values are sharded tensor values
- List length is (end_rank - start_rank + 1)

neuronx_distributed.ModelBuilder
================================

::

   class ModelBuilder:
       def __init__(
           self,
           model: Union[Callable, torch.nn.Module],
       )

``ModelBuilder`` is a high-level class that provides a fluent interface
for tracing and compiling PyTorch models for Neuron devices. It supports
SPMD (Single Program Multiple Data) execution, and distributed model
execution.

Constructor Parameters
~~~~~~~~~~~~~~~~~~~~~~

- **model: Union[Callable, torch.nn.Module]** — The PyTorch model to be
  traced and compiled. Can be a model class or callable function. Must
  have explicitly defined parameters (no ``*args`` *or* ``**kwargs``).
  Must have at least one argument.

neuronx_distributed.ModelBuilder.trace
======================================

::

   neuronx_distributed.ModelBuilder.trace(
       self,
       args: Union[None, torch.Tensor, Tuple[torch.Tensor, ...]] = None,
       kwargs: Optional[Dict[str, torch.Tensor]] = None,
       tag: Optional[str] = None,
       spmd: bool = True,
   ) -> ModelBuilderV2

Traces the model with given inputs and stores trace artifacts. Leverages
`neuronx_distributed.trace.model_builder.trace`_
fundamental unit.

.. _parameters-3:

Parameters
~~~~~~~~~~

- **args: Union[None, torch.Tensor, Tuple[torch.Tensor, …]] = None** —
  Example inputs as positional arguments. Can be None, a single tensor,
  or a tuple of tensors. Must match the model’s positional parameter
  requirements.
- **kwargs: Optional[Dict[str, torch.Tensor]] = None** — Example inputs
  as keyword arguments
- **tag: Optional[str] = None** — Unique identifier for this trace.
  Corresponding bucket will be tagged with this name. If None, generates
  a hash from the HLO module.
- **spmd: bool = True** — Whether to use SPMD (Single Program Multiple
  Data) for tracing. Currently only True is supported

.. _returns-3:

Returns
~~~~~~~

Self reference for method chaining.

neuronx_distributed.ModelBuilder.compile
========================================

::

   neuronx_distributed.ModelBuilder.compile(
       self,
       priority_model_key: Optional[str] = None,
       compiler_workdir: Optional[Union[str, pathlib.Path]] = None,
       compiler_args: Optional[Union[str, Dict[str, str]]] = None,
       max_workers: Optional[int] = None,
   ) -> NxDModel

Compiles traced models using the Neuron compiler. Leverages
`neuronx_distributed.trace.model_builder.compile`_
fundamental unit.

.. _parameters-4:

Parameters
~~~~~~~~~~

- **priority_model_key: Optional[str] = None** — Key of model to
  prioritize for WLO
- **compiler_workdir: Optional[Union[str, pathlib.Path]] = None** —
  Directory for compiler artifacts
- **compiler_args: Optional[Union[str, Dict[str, str]]] = None** —
  Compiler flags as string or dictionary mapping tags to flags.
- **max_workers: Optional[int] = None** — Maximum worker threads for
  parallel compilation. If None, uses the default value from
  ThreadPoolExecutor.

.. _returns-4:

Returns
~~~~~~~

A built and configured ``NxDModel`` instance.

neuronx_distributed.trace.nxd_model.base_nxd_model.StateInitializer
===================================================================

::

   class StateInitializer(torch.nn.Module):
       def __init__(
           self,
           shapes: Dict[str, List[int]],
           dtypes: Dict[str, torch.dtype],
           local_ranks_size: int
       ):

A TorchScript-compatible module to initialize state buffers onto Neuron.

.. _constructor-parameters-1:

Constructor Parameters
~~~~~~~~~~~~~~~~~~~~~~

- **shapes: Dict[str, List[int]]** — Dict of shape lists associated with
  a specific stateful tensor by key
- **dtypes: Dict[str, torch.dtype]** — Dict of torch dtypes associated
  with a specific stateful tensor by key
- **local_ranks_size: int** — integer representing the number of ranks
  per instance in a distributed setting. Unless it’s a Multi Instance
  Data Parallel setup, it is usually just equal to the ``world_size``
  your model was compiled for.

neuronx_distributed.NxDModel
============================

::

   class NxDModel(torch.nn.Module, BaseNxDModel):
       def __init__(
           self,
           world_size: int,
           start_rank: Optional[int] = None,
           local_ranks_size: Optional[int] = None,
           state_initializer: Optional[StateInitializer] = None,
           layout_transformer: Optional[LayoutTransformerArtifacts] = None
       )

An executor class to run models compiled by either the ``ModelBuilder``
or ``trace()``, ``compile()`` fundamental units.

.. _constructor-parameters-2:

Constructor Parameters
~~~~~~~~~~~~~~~~~~~~~~

- **world_size: int —** Total number of ranks/processes in the
  distributed setup.
- **start_rank: Optional[int], default=None —** Starting rank for this
  instance. If None, defaults to 0.
- **local_ranks_size: Optional[int], default=None —** Number of local
  ranks. Must be specified if start_rank is provided.
- **state_initializer: Optional[StateInitializer], default=None —**
  Initializer for model states. If not provided, stateful model tensors
  will be initialized with zeros.

neuronx_distributed.NxDModel.add
================================

::

   @torch.jit.unused
   def add(
       self,
       key: str,
       trace_artifacts: TraceArtifacts,
       compilation_artifacts: Union[CompilationArtifacts, WLOArtifacts],
   ) -> "NxDModel"

Add a compiled submodel to this ``NxDModel`` instance.

**Notes:**

- Creates a ``StateInitializer`` if state tensors are present in the
  metaneff, and none was provided in the ``NxDModel`` constructor
- Sets up ``SPMDModel`` instances and input/output processing components

.. _parameters-5:

Parameters
~~~~~~~~~~

- **key: str —** Unique identifier for this submodel within the
  ``NxDModel``
- **trace_artifacts: TraceArtifacts —** Artifacts produced from the
  ``trace()`` function
- **compilation_artifacts:** CompilationArtifacts — Artifacts produced
  from the ``compile()`` or ``compile_wlo()`` functions

.. _returns-5:

Returns
~~~~~~~

``NxDModel`` self reference, enabling builder-style method chaining.

neuronx_distributed.NxDModel.get_neff
=====================================

::

   @torch.jit.unused
   def get_neff(self, key: str) -> bytes

Retrieves the NEFF (Neuron Executable File Format) from the specified
model. Requires the associated model to already be added using the
``add()`` method.

.. _parameters-6:

Parameters
~~~~~~~~~~

- **key: str —** The identifier for the model whose NEFF should be
  retrieved.

.. _returns-6:

Returns
~~~~~~~

``bytes`` — The NEFF for the specified model

.. _raises-6:

Raises
~~~~~~

- ``KeyError``: If the specified key is not found in the available keys.
- ``RuntimeError``: If there is an error retrieving the NEFF.

neuronx_distributed.NxDModel.get_metaneff
=========================================

::

   @torch.jit.unused
   def get_metaneff(self, key: str) -> metaneff_pb2.MetaNeff

Retrieves the metaneff from the specified model. Requires the associated
model to already be added using the ``add()`` method.

.. _parameters-7:

Parameters
~~~~~~~~~~

- **key: str** — The identifier for the model whose metaneff should be
  retrieved.

.. _returns-7:

Returns
~~~~~~~

``metaneff_pb2.MetaNeff`` — The metaneff proto object for the specified
model.

.. _raises-7:

Raises
~~~~~~~

- ``KeyError``: If the specified key is not found in the available keys. 
- ``RuntimeError``: If there is an error retrieving the metaneff.

neuronx_distributed.NxDModel.get_hlo
====================================

::

   @torch.jit.unused
   def get_hlo(self, key: str) -> hlo_pb2.HloModuleProto

Retrieves the HLO from the specified model. Requires the associated
model to already be added using the ``add()`` method.

.. _parameters-8:

Parameters
~~~~~~~~~~

- **key: str** — The identifier for the model whose HLO should be
  retrieved.

.. _returns-8:

Returns
~~~~~~~

``hlo_pb2.HloModuleProto`` — The HLO module proto object for the
specified model.

.. _raises-8:

Raises
~~~~~~

- ``KeyError``: If the specified key is not found in the available keys.
- ``RuntimeError``: If there is an error retrieving the metaneff. 


neuronx_distributed.NxDModel.set_weights
========================================

::

   @torch.jit.export
   def set_weights(
       self,
       sharded_checkpoint: List[Dict[str, torch.Tensor]]
   )

Set the model’s weights from a sharded checkpoint.

This function initializes the model’s weights using a sharded
checkpoint. The checkpoint is processed and loaded using either a layout
transformer (if provided) or a direct parallel loading mechanism.

This function should only be called before the model is loaded onto a
Neuron device. Once the model is loaded, use the
``replace_weights()`` method to update the weights.

.. _parameters-9:

Parameters
~~~~~~~~~~

- **sharded_checkpoint: List[Dict[str, torch.Tensor]]** — \***\* A list
  of state dicts mapping parameter names to their corresponding tensor
  values for each rank.

.. _returns-9:

Returns
~~~~~~~

``None``

.. _raises-9:

Raises
~~~~~~

``ValueError``: If the model is already loaded on a Neuron device.

neuronx_distributed.NxDModel.to_neuron
======================================

::

   @torch.jit.export
   def to_neuron(self)

Loads the model onto Neuron Devices.

This function initializes the model onto Neuron Hardware. Must be called
before executing the model, otherwise the forward method will raise a
``RuntimeError``.

.. _returns-10:

Returns
~~~~~~~

``None``

neuronx_distributed.NxDModel.replace_weights
============================================

::

   @torch.jit.export
   def replace_weights(
       self,
       sharded_checkpoint: List[Dict[str, torch.Tensor]]
   )

Replace the model’s weights and reload onto Neuron devices.

This method should be used instead of ``set_weights()`` when the model
is already loaded on Neuron devices and weights need to be updated.

.. _parameters-10:

Parameters
~~~~~~~~~~

- **sharded_checkpoint: List[Dict[str, torch.Tensor]]** — \***\* A list
  of state dicts mapping parameter names to their corresponding tensor
  values for each rank.

.. _returns-11:

Returns
~~~~~~~

``None``

neuronx_distributed.NxDModel.read_from_neuron_buffer
====================================================

::

   @torch.jit.export
   def read_from_neuron_buffer(
       self,
       buffer_key: str,
       rank: int
   ) -> torch.Tensor

Reads a tensor value from a Neuron device buffer to CPU, based on given
key and rank.

.. _parameters-11:

Parameters
~~~~~~~~~~

- **buffer_key: str** — The key identifying the specific buffer
  to retrieve.
- **rank: int** — The rank from which to retrieve the buffer.

.. _returns-12:

Returns
~~~~~~~

``torch.Tensor``: The requested tensor buffer copied to Host memory.

.. _raises-12:

Raises
~~~~~~

- ``AssertionError``: If this method is called before to_neuron()
- ``KeyError``: If the specified state_buffer_key does not exist in the states for the given rank.

neuronx_distributed.NxDModel.write_to_neuron_buffer
===================================================

::

   @torch.jit.export
   def write_to_neuron_buffer(
       self,
       tensor: torch.Tensor,
       buffer_key: str,rank: int
   )

Write a tensor to a specific Neuron device buffer.

This function updates a state buffer on a Neuron device by copying
values from the provided tensor. The destination buffer must already
exist and have the same shape as the input tensor.

.. _parameters-12:

Parameters
~~~~~~~~~~

- **tensor: torch.Tensor** — The tensor containing the data to be
  written to the buffer.
- **buffer_key: str** — The key identifying the specific buffer
  to update.
- **rank: int** — The rank where the buffer is located.

.. _returns-13:

Returns
~~~~~~~

``None``

.. _raises-13:

Raises
~~~~~~~

- ``AssertionError``: If this method is called before ``to_neuron()``.
- ``KeyError``: If the specified ``state_buffer_key`` does not exist in the states for the given rank, or if the shapes of the input tensor and target buffer do not match.

neuronx_distributed.NxDModel.forward
====================================

::

   def forward(
       self,
       *args,
       model_name: Optional[str] = None,
       forward_mode='default',
       **kwargs
   ):

The forward method of the NxDModel class, which will take in inputs and
run the respective NEFF.

.. _parameters-13:

Parameters
~~~~~~~~~~

- **args: Union[torch.Tensor, List[torch.Tensor]]** — Positional
  tensor inputs to model. List form must be used if
  ``forward_mode != 'default'``.
- **model_name: Optional[str]** — Parameter to pass in a specific
  key to execute. This must be used in cases of ambiguous routing.
- **forward_mode: str, default=‘default’** — There are 3
  supported modes: default, ranked, async.

  - **default**: This takes in inputs, replicates them across ranks,
    executes the model, and only returns the outputs from rank 0
  - **ranked:** This takes in inputs in ranked form, meaning each
    individual tensor input (ie each ``arg`` in ``*args``) must be a list
    of tensors whose length is equal to the world size of the compiled
    model. The model will execute, and return a ranked output, which is
    a ``List`` of all outputs by rank (ie a
    ``List[List[torch.Tensor]]``.
  - **async:** Like ranked, this takes in inputs and returns outputs in
    ranked form, except the major difference is that the outputs will be
    returned instantly, and will be references to buffers where the
    model will write the output once the NEFF is done executing. To
    block on the NEFF call, you must call ``.cpu()`` for each tensor in
    the output.

- ****kwargs (torch.Tensor, List[torch.Tensor])** — Keyword arguments
  corresponding to specific input tensors to the model. List form must
  be used if ``forward_mode != 'default'``.

.. _returns-14:

Returns
~~~~~~~

It depends on the ``forward_mode`` setting: 

- **default:** Expected format of tensor outputs based on what was originally traced.
- **ranked or async:** ``List[List[torch.Tensor]]`` of shape (num_out_tensors, world_size).

neuronx_distributed.NxDModel.save
=================================

::

   def save(self, path_to_save: str, save_weights: bool = False)

Saves the model as a TorchScript module to the specified path. The saved
artifact can be loaded with ``NxDModel.load`` or ``torch.jit.load``
(``NxDModel.load`` is preferrable).

.. _parameters-14:

Parameters
~~~~~~~~~~

- **path_to_save: str** — The file path where the TorchScript
  model should be saved.
- **save_weights: Optional[bool], default=False** — If ``True``,
  preserves the weights within the TorchScript model. It is ``False`` by
  default.

.. _returns-15:

Returns
~~~~~~~

``None``

neuronx_distributed.NxDModel.load
=================================

::

   @classmethod
   def load(
       cls,
       path_to_model: str,
       start_rank: Optional[int] = None,
       local_ranks_size: Optional[int] = None
   ) -> Union["NxDModel", torch.jit.ScriptModule]

Attempts to load and restore an ``NxDModel`` from a saved TorchScript
model.

This classmethod tries to reconstruct an NxDModel instance from a
previously saved TorchScript model. If the restoration process fails, it
returns the loaded TorchScript model instead, as backwards compatibility
is not guaranteed across different versions of NxD.

.. _parameters-15:

Parameters
~~~~~~~~~~

- **path_to_model: str** — Path to the saved TorchScript model
  file.
- **start_rank: Optional[int], default=None** — Starting rank for
  distributed processing. If ``None``, and ``local_ranks_size`` is set,
  an ``AssertionError`` will be raised. Defaults to ``None``
- **local_ranks_size: Optional[int], default=None** — Size of
  local_ranks for distribtued processing. Must be set if ``start_rank``
  is provided. Defaults to ``None``

.. _returns-16:

Returns
~~~~~~~

``Union[NxDModel, torch.jit.ScriptModule]``: Either the restored
``NxdModel`` instance, or the loaded TorchScript model if restoration
fails.

.. _raises-16:

Raises
~~~~~~~

- ``ValueError``: If the provided model was not originally saved using ``NxDModel.save()``.
- ``AssertionError``: If ``start_rank``/``local_ranks_size`` parameters are inconsistently set.

Usage Notes
===========

In-place buffer updates
~~~~~~~~~~~~~~~~~~~~~~~

Description
~~~~~~~~~~~

ModelBuilderV2 enables users to update model buffers in-place during
their model’s ``forward`` pass. In-place updates enable users to
efficiently utilize memory when caching values during the ``forward``
pass. An example use case for in-place updates is the population of a
model’s KV Cache.

Under the hood, ModelBuilderV2 detects when buffers are mutated during
``forward`` while tracing a model, and uses `XLA’s
aliasing <https://openxla.org/xla/aliasing>`__ to ensure that buffers
are mutated in-place.

Supported Usage
~~~~~~~~~~~~~~~

In-place updates are currently supported for the following combinations
of ``torch.Tensor`` subclasses and torch operations:

+-----------------------+-----------------------+-----------------------+
| Tensor class          | Out of place torch    | In place torch        |
|                       | operation             | operation             |
+=======================+=======================+=======================+
| torch.nn.Buffer,      | Supported             | Not Supported         |
| persistent=True       |                       |                       |
+-----------------------+-----------------------+-----------------------+
| torch.nn.Buffer,      | Supported             | Not Supported         |
| persistent=False      |                       |                       |
+-----------------------+-----------------------+-----------------------+
| torch.nn.Parameter    | Not Supported         | Not Supported         |
+-----------------------+-----------------------+-----------------------+

Additionally, the following forms of updates are not supported, because
these mutations change the memory utilization or memory layout of the
mutated tensor:

- Updating the ``dtype`` of a buffer or parameter during ``forward``.
- Updating the ``shape`` of a buffer or parameter during ``forward``.

.. _supported-usage-1:

Supported Usage:
~~~~~~~~~~~~~~~~

::

   import torch
   import torch.nn as nn

   class ExampleModel(nn.Module):
       def __init__(self):
           super().__init__()
           
           self.register_buffer("buffer_persistent", torch.zeros(10), dtype=torch.bfloat16, persistent=True)
           self.register_buffer("buffer_nonpersistent", torch.zeros(10), dtype=torch.bfloat16, persistent=False)
           self.parameter = nn.Parameter(torch.zeros(10), dtype=torch.bfloat16)
           
       def forward(self, x, dim_tensor, index, src):
           # supported: buffers with out of place torch operations
           self.buffer_persistent = self.buffer_persistent + 1
           self.buffer_nonpersistent = torch.scatter(self.buffer_persistent, dim_tensor, index, src)
           
           # not supported: buffers with inplace torch operations
           self.buffer_persistent.scatter_(dim_tensor, index, src)
           self.buffer_nonpersistent.index_copy_(dim_tensor, index, src)
           
           # not supported: parameters
           self.parameter = torch.scatter(self.paramter, dim_tensor, index, src)
           self.parameter.scatter_(dim_tensor, index, src)
           
           # not supported: dtype updates
           self.buffer_persistent = self.buffer_persistent.to(torch.float32)
           
           # not supported: shape changes
           self.buffer_persistent = torch.reshape(self.buffer_persistent.reshape, (2, 5))

Usage Examples
==============

E2E with ModelBuilder APIs
~~~~~~~~~~~~~~~~~~~~~~~~~~

Example: Build and run callable with ModelBuilder
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   import torch
   import torch.nn as nn
   from neuronx_distributed import ModelBuilder

   torch.manual_seed(0)

   def func(a, b):
       return a + b

   nxd_model = ModelBuilder(func) \
       .trace(kwargs={'a': torch.rand(2,2), 'b': torch.rand(2,2)}, tag="key1") \
       .compile()

   nxd_model.to_neuron()
   input = (torch.rand(2, 2), torch.rand(2, 2))
   cpu_out = func(a=input[0], b=input[1])
   neuron_out = nxd_model(a=input[0], b=input[1])

   torch.testing.assert_close(cpu_out, neuron_out)

Example: Build and run torch module with ModelBuilder
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   import torch
   import torch.nn as nn
   from neuronx_distributed.utils.model_utils import init_on_device
   from neuronx_distributed import NxDParallelState, shard_checkpoint, ModelBuilder
   from neuronx_distributed.parallel_layers import ColumnParallelLinear, RowParallelLinear

   torch.manual_seed(0)

   class Model(nn.Module):
       def __init__(self, is_distributed=True):
           super().__init__()
           if is_distributed:
               self.layer1 = ColumnParallelLinear(1024, 1024, gather_output=False)
               self.layer2 = RowParallelLinear(1024, 1024, input_is_parallel=True)
           else:
               self.layer1 = nn.Linear(1024, 1024)
               self.layer2 = nn.Linear(1024, 1024)
       def forward(self, x):
           x = self.layer1(x)
           return self.layer2(x)

   cpu_model = Model(is_distributed=False)
   model_checkpoint = cpu_model.state_dict()

   with NxDParallelState(world_size=32, tensor_model_parallel_size=32):
       model = Model()

       example_inputs = torch.rand(32, 1024)

       nxd_model = ModelBuilder(model) \
           .trace(args=example_inputs, tag="key1") \
           .compile()

   with NxDParallelState(world_size=32, tensor_model_parallel_size=32), init_on_device(torch.device("meta")):
       sharded_checkpoint = shard_checkpoint(
           checkpoint=model_checkpoint,
           model=Model()
       )

   nxd_model.set_weights(sharded_checkpoint)
   nxd_model.to_neuron()

   input = torch.ones(32, 1024)
   cpu_out = cpu_model(input)
   neuron_out = nxd_model(x=input)

Example: Multi-bucket trace with ModelBuilder
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   import torch
   import torch.nn as nn
   from neuronx_distributed.utils.model_utils import init_on_device
   from neuronx_distributed import NxDParallelState, shard_checkpoint, ModelBuilder
   from neuronx_distributed.parallel_layers import ColumnParallelLinear

   torch.manual_seed(0)

   class Model(nn.Module):
       def __init__(self, is_distributed=True):
           super().__init__()
           if is_distributed:
               self.layer1 = ColumnParallelLinear(1024, 1024, gather_output=True)
               self.layer2 = ColumnParallelLinear(1024, 1024, gather_output=True)
           else:
               self.layer1 = nn.Linear(1024, 1024)
               self.layer2 = nn.Linear(1024, 1024)
       def forward(self, x):
           x = self.layer1(x)
           return self.layer2(x)

   cpu_model = Model(is_distributed=False)
   model_checkpoint = cpu_model.state_dict()

   with NxDParallelState(world_size=32, tensor_model_parallel_size=32):
       model = Model()

       example_inputs1 = torch.rand(32, 1024)
       example_inputs2 = torch.rand(16, 1024)
       
       nxd_model = ModelBuilder(model) \
           .trace(args=example_inputs1, tag="bucket1") \
           .trace(args=example_inputs2, tag="bucket2") \
           .compile()


   with NxDParallelState(world_size=32, tensor_model_parallel_size=32), init_on_device(torch.device("meta")):
       sharded_checkpoint = shard_checkpoint(
           checkpoint=model_checkpoint,
           model=Model()
       )

   nxd_model.set_weights(sharded_checkpoint)
   nxd_model.to_neuron()

   input1 = torch.rand(32, 1024)
   input2 = torch.rand(16, 1024)

   for input in [input1, input2]:
       cpu_out = cpu_model(input)
       neuron_out = nxd_model(input)
       torch.testing.assert_close(cpu_out, neuron_out)

Example: Build and run torch module with ModelBuilder where example inputs are supplied as kwargs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   import torch
   import torch.nn as nn
   from neuronx_distributed.utils.model_utils import init_on_device
   from neuronx_distributed import NxDParallelState, shard_checkpoint, ModelBuilder
   from neuronx_distributed.parallel_layers.layers import ColumnParallelLinear

   torch.manual_seed(0)

   class Model(nn.Module):
       def __init__(self, is_distributed=True):
           super().__init__()
           if is_distributed:
               self.layer1 = ColumnParallelLinear(5, 10, gather_output=True)
               self.layer2 = ColumnParallelLinear(20, 10, gather_output=True)
           else:
               self.layer1 = nn.Linear(5, 10)
               self.layer2 = nn.Linear(20, 10)

       def forward(self, x, y):
           return self.layer1(x) + self.layer2(y)

   cpu_model = Model(is_distributed=False)
   model_checkpoint = cpu_model.state_dict()

   with NxDParallelState(world_size=2, tensor_model_parallel_size=2):
       model = Model()

       example_inputs1 = {'x': torch.rand(10, 5), 'y': torch.rand(10, 20)}
       example_inputs2 = {'x': torch.rand(50, 5), 'y': torch.rand(50, 20)}
       
       nxd_model = ModelBuilder(model) \
           .trace(kwargs=example_inputs1, tag="bucket1") \
           .trace(kwargs=example_inputs2, tag="bucket2") \
           .compile()


   with NxDParallelState(world_size=2, tensor_model_parallel_size=2), init_on_device(torch.device("meta")):
       sharded_checkpoint = shard_checkpoint(
           checkpoint=model_checkpoint,
           model=Model()
       )

   nxd_model.set_weights(sharded_checkpoint)
   nxd_model.to_neuron()

   input1 = (torch.rand(10, 5), torch.rand(10, 20))
   input2 =  (torch.rand(50, 5), torch.rand(50, 20))

   for input in [input1, input2]:
       cpu_out = cpu_model(input[0], input[1])
       neuron_out = nxd_model(x=input[0], y=input[1])
       torch.testing.assert_close(cpu_out, neuron_out)

Example: Build and run torch module with in-place buffer updates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   import torch
   from neuronx_distributed import ModelBuilder

   torch.manual_seed(0)

   class Model(torch.nn.Module):
       def __init__(self):
           super().__init__()
           self.register_buffer('cache', torch.tensor([0], dtype=torch.float32), persistent=True)

       def forward(self, x, update_value):
           self.cache = torch.add(self.cache, update_value)
           return x + self.cache

   cpu_model = Model()

   model = Model()

   example_inputs1 = {'x': torch.zeros(1, dtype=torch.float32), 'update_value': torch.zeros(1, dtype=torch.float32)}

   nxd_model = ModelBuilder(model) \
       .trace(kwargs=example_inputs1, tag="bucket1") \
       .compile()

   state_dict = [
       {
           "cache": torch.tensor([0])
       }
   ]
   nxd_model.set_weights(state_dict)
   nxd_model.to_neuron()

   input1 = (torch.tensor([1], dtype=torch.float32), torch.tensor([5], dtype=torch.float32))
   input2 =  (torch.tensor([2], dtype=torch.float32), torch.tensor([10], dtype=torch.float32))

   model_iteration = 0
   for input in [input1, input2]:
       cpu_out = cpu_model(input[0], input[1])
       neuron_out = nxd_model(x=input[0], update_value=input[1])
       
       torch.testing.assert_close(cpu_out, neuron_out)
       model_iteration += 1
       print(f"Iteration {model_iteration} matches!")

E2E with Fundamental Units
~~~~~~~~~~~~~~~~~~~~~~~~~~

Example: Build and run Callable with Fundamental Units
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   import torch
   from neuronx_distributed import NxDModel
   from neuronx_distributed.trace.model_builder import trace, compile

   torch.manual_seed(0)

   def func(a,b):
       return a + b

   trace_artifacts = trace(func, kwargs={'a': torch.rand(2,2), 'b': torch.rand(2,2)})
   compilation_artifacts = compile(trace_artifacts.hlo, trace_artifacts.metaneff)

   nxd_model = NxDModel(world_size=1)
   nxd_model.add('func', trace_artifacts, compilation_artifacts)
   nxd_model.to_neuron()

   cpu_out = func(torch.ones(2, 2), torch.ones(2, 2))
   neuron_out = nxd_model(torch.ones(2,2), torch.ones(2,2))
   torch.testing.assert_close(cpu_out, neuron_out)

Example: Build and run torch module with Fundamental Units
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   import os
   import shutil
   import torch
   import torch.nn as nn

   from neuronx_distributed.utils.model_utils import init_on_device
   from neuronx_distributed import NxDParallelState, shard_checkpoint, ModelBuilder, NxDModel
   from neuronx_distributed.parallel_layers import ColumnParallelLinear, RowParallelLinear
   from neuronx_distributed.trace.model_builder_utils import ModelBuilderConstants
   from neuronx_distributed.trace.model_builder import (
       trace,
       compile,
   ) 

   torch.manual_seed(0)

   class Model(nn.Module):
       def __init__(self, is_distributed=True):
           super().__init__()
           if is_distributed:
               self.layer1 = ColumnParallelLinear(1024, 1024, gather_output=False)
               self.layer2 = RowParallelLinear(1024, 1024, input_is_parallel=True)
           else:
               self.layer1 = nn.Linear(1024, 1024)
               self.layer2 = nn.Linear(1024, 1024)
       def forward(self, x):
           x = self.layer1(x)
           return self.layer2(x)

   cpu_model = Model(is_distributed=False)
   model_checkpoint = cpu_model.state_dict()

   with NxDParallelState(world_size=32, tensor_model_parallel_size=32):
       model = Model()

       example_inputs = torch.rand(32, 1024)

       trace_artifacts = {
           "bucket1": trace(model, args=example_inputs),
       }

       compilation_artifacts_priority = compile(
           hlo_module=trace_artifacts["bucket1"].hlo,
           metaneff=trace_artifacts["bucket1"].metaneff,
           key="bucket1"
       )

   with NxDParallelState(world_size=32, tensor_model_parallel_size=32), init_on_device(torch.device("meta")):
       sharded_checkpoint = shard_checkpoint(
           checkpoint=model_checkpoint,
           model=Model()
       )

   nxd_model = NxDModel(world_size=32)
   nxd_model.add(key="bucket1", trace_artifacts=trace_artifacts["bucket1"], compilation_artifacts=compilation_artifacts_priority)

   nxd_model.set_weights(sharded_checkpoint)
   nxd_model.to_neuron()

   input = torch.rand(32, 1024)

   cpu_out = cpu_model(input)
   neuron_out = nxd_model(input)
   torch.testing.assert_close(cpu_out, neuron_out)

Example: Multi-bucket trace with Fundamental Units
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   import os
   import shutil
   import torch
   import torch.nn as nn

   from neuronx_distributed.utils.model_utils import init_on_device
   from neuronx_distributed import NxDParallelState, shard_checkpoint, ModelBuilder, NxDModel
   from neuronx_distributed.parallel_layers import ColumnParallelLinear, RowParallelLinear
   from neuronx_distributed.trace.model_builder_utils import ModelBuilderConstants
   from neuronx_distributed.trace.model_builder import (
       trace,
       compile,
   ) 

   torch.manual_seed(0)

   class Model(nn.Module):
       def __init__(self, is_distributed=True):
           super().__init__()
           if is_distributed:
               self.layer1 = ColumnParallelLinear(1024, 1024, gather_output=False)
               self.layer2 = RowParallelLinear(1024, 1024, input_is_parallel=True)
           else:
               self.layer1 = nn.Linear(1024, 1024)
               self.layer2 = nn.Linear(1024, 1024)
       def forward(self, x):
           x = self.layer1(x)
           return self.layer2(x)

   cpu_model = Model(is_distributed=False)
   model_checkpoint = cpu_model.state_dict()

   with NxDParallelState(world_size=32, tensor_model_parallel_size=32):
       model = Model()

       example_inputs1 = torch.rand(32, 1024)
       example_inputs2 = torch.rand(16, 1024)

       trace_artifacts = {
           "bucket1": trace(model, args=example_inputs1),
           "bucket2": trace(model, args=example_inputs2),
       }

       compilation_artifacts_bucket1 = compile(
           hlo_module=trace_artifacts["bucket1"].hlo,
           metaneff=trace_artifacts["bucket1"].metaneff,
           key="bucket1"
       )
       compilation_artifacts_bucket2 = compile(
           hlo_module=trace_artifacts["bucket2"].hlo,
           metaneff=trace_artifacts["bucket2"].metaneff,
           key="bucket2"
       )

   with NxDParallelState(world_size=32, tensor_model_parallel_size=32), init_on_device(torch.device("meta")):
       sharded_checkpoint = shard_checkpoint(
           checkpoint=model_checkpoint,
           model=Model()
       )

   nxd_model = NxDModel(world_size=32)
   nxd_model.add(key="bucket1", trace_artifacts=trace_artifacts["bucket1"], compilation_artifacts=compilation_artifacts_bucket1)
   nxd_model.add(key="bucket2", trace_artifacts=trace_artifacts["bucket2"], compilation_artifacts=compilation_artifacts_bucket2)

   nxd_model.set_weights(sharded_checkpoint)
   nxd_model.to_neuron()

   input1 = torch.rand(32, 1024)
   input2 = torch.rand(16, 1024)

   for input in [input1, input2]:
       cpu_out = cpu_model(input)
       neuron_out = nxd_model(input)
       torch.testing.assert_close(cpu_out, neuron_out)

Example: Build and run torch module with Fundamental Units where example inputs are supplied as kwargs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   import os
   import shutil
   import torch
   import torch.nn as nn

   from neuronx_distributed.utils.model_utils import init_on_device
   from neuronx_distributed import NxDParallelState, shard_checkpoint, ModelBuilder, NxDModel
   from neuronx_distributed.parallel_layers import ColumnParallelLinear, RowParallelLinear
   from neuronx_distributed.trace.model_builder_utils import ModelBuilderConstants
   from neuronx_distributed.trace.model_builder import (
       trace,
       compile,
   ) 

   torch.manual_seed(0)

   class Model(nn.Module):
       def __init__(self, is_distributed=True):
           super().__init__()
           if is_distributed:
               self.linear1 = ColumnParallelLinear(5, 10, gather_output=True)
               self.linear2 = ColumnParallelLinear(20, 10, gather_output=True)
           else:
               self.linear1 = nn.Linear(5, 10)
               self.linear2 = nn.Linear(20, 10)

       def forward(self, x, y):
           return self.linear1(x) + self.linear2(y)

   cpu_model = Model(is_distributed=False)
   model_checkpoint = cpu_model.state_dict()

   with NxDParallelState(world_size=2, tensor_model_parallel_size=2):
       model = Model()

       example_inputs1 = {'x': torch.rand(10, 5), 'y': torch.rand(10, 20)}
       example_inputs2 = {'x': torch.rand(50, 5), 'y': torch.rand(50, 20)}

       trace_artifacts = {
           "bucket1": trace(model, kwargs=example_inputs1),
           "bucket2": trace(model, kwargs=example_inputs2),
       }

       compilation_artifacts_bucket1 = compile(
           hlo_module=trace_artifacts["bucket1"].hlo,
           metaneff=trace_artifacts["bucket1"].metaneff,
           key="bucket1"
       )
       compilation_artifacts_bucket2 = compile(
           hlo_module=trace_artifacts["bucket2"].hlo,
           metaneff=trace_artifacts["bucket2"].metaneff,
           key="bucket2"
       )

   with NxDParallelState(world_size=2, tensor_model_parallel_size=2), init_on_device(torch.device("meta")):
       sharded_checkpoint = shard_checkpoint(
           checkpoint=model_checkpoint,
           model=Model()
       )

   nxd_model = NxDModel(world_size=2)
   nxd_model.add(key="bucket1", trace_artifacts=trace_artifacts["bucket1"], compilation_artifacts=compilation_artifacts_bucket1)
   nxd_model.add(key="bucket2", trace_artifacts=trace_artifacts["bucket2"], compilation_artifacts=compilation_artifacts_bucket2)

   nxd_model.set_weights(sharded_checkpoint)
   nxd_model.to_neuron()

   input1 = (torch.rand(10, 5), torch.rand(10, 20))
   input2 =  (torch.rand(50, 5), torch.rand(50, 20))

   for input in [input1, input2]:
       cpu_out = cpu_model(input[0], input[1])
       neuron_out = nxd_model(x=input[0], y=input[1])
       torch.testing.assert_close(cpu_out, neuron_out)

Example: Build and run torch module with in-place buffer updates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   import torch

   from neuronx_distributed import NxDModel
   from neuronx_distributed.trace.model_builder import (
       trace,
       compile,
   ) 

   torch.manual_seed(0)

   class Model(torch.nn.Module):
       def __init__(self):
           super().__init__()
           self.register_buffer('cache', torch.tensor([0], dtype=torch.float32), persistent=True)

       def forward(self, x, update_value):
           self.cache = torch.add(self.cache, update_value)
           return x + self.cache

   cpu_model = Model()

   model = Model()

   example_inputs1 = {'x': torch.zeros(1, dtype=torch.float32), 'update_value': torch.zeros(1, dtype=torch.float32)}

   trace_artifacts = {
       "bucket1": trace(model, kwargs=example_inputs1),
   }

   compilation_artifacts_bucket1 = compile(
       hlo_module=trace_artifacts["bucket1"].hlo,
       metaneff=trace_artifacts["bucket1"].metaneff,
       key="bucket1"
   )


   nxd_model = NxDModel(world_size=1)
   nxd_model.add(key="bucket1", trace_artifacts=trace_artifacts["bucket1"], compilation_artifacts=compilation_artifacts_bucket1)

   state_dict = [
       {
           "cache": torch.tensor([0], dtype=torch.float32)
       }
   ]
   nxd_model.set_weights(state_dict)
   nxd_model.to_neuron()

   input1 = (torch.tensor([1], dtype=torch.float32), torch.tensor([5], dtype=torch.float32))
   input2 =  (torch.tensor([2], dtype=torch.float32), torch.tensor([10], dtype=torch.float32))

   model_iteration = 0
   for input in [input1, input2]:
       cpu_out = cpu_model(input[0], input[1])
       neuron_out = nxd_model(x=input[0], update_value=input[1])
       
       torch.testing.assert_close(cpu_out, neuron_out)
       model_iteration += 1
       print(f"Iteration {model_iteration} matches!")


================================================
FILE: libraries/neuronx-distributed/model_optimizer_wrapper_developer_guide.rst
================================================
.. _model_optimizer_wrapper_developer_guide:

Developer guide for model and optimizer wrapper 
==========================================================================

Model and optimizer wrapper are useful tools to wrap original model and optimizer
while keep the API unchanged. We recommend to always use model and optimizer wrappers,
it's helpful to apply optimizations and hide the complexity from the optimizations.
Users need to care about the implementation details of the optimization, just use
the wrappers as you normally use ``torch.nn.Module`` and ``torch.optim.Optimizer``.

For a complete api guide, refer to :ref:`API GUIDE<api_guide>`.

Create training config:
'''''''''''''''''''''''

To use model and optimizer wrapper, we need to create ``neuronx_distributed``
config firstly.

A sample config use tensor parallel, pipeline parallel, ZeRO-1 optimizer,
sequence parallel and activation checkpointing:

.. code:: ipython3

   nxd_config = nxd.neuronx_distributed_config(
       tensor_parallel_size=args.tensor_parallel_size,
       pipeline_parallel_size=args.pipeline_parallel_size,
       pipeline_config={
           "transformer_layer_cls": LlamaDecoderLayer,
           "num_microbatches": args.num_microbatches,
           "output_loss_value_spec": (True, False),
           "input_names": ["input_ids", "attention_mask", "labels"],
           "pipeline_cuts": pipeline_cuts,
           "trace_file_path": args.trace_file_path,
           "param_init_fn": None,
           "leaf_module_cls": [LlamaRMSNorm.__name__],
           "autowrap_modules": [mappings],
           "use_zero1_optimizer": args.use_zero1_optimizer > 0,
           "use_optimizer_wrapper": True,
       },
       optimizer_config={
           "zero_one_enabled": args.use_zero1_optimizer > 0,
           "grad_clipping": True,
           "max_grad_norm": 1.0,
       },
       sequence_parallel=args.use_sequence_parallel,
       activation_checkpoint_config=CoreAttention if args.use_selective_checkpoint > 0 else "full",
       model_init_config=model_init_config,
   )

Use model wrapper:
''''''''''''''''''

When we wrap a model with model wrapper, we need to implement a model getter
function. The model getter function will be called to initialize model on CPU and
then model will be moved to XLA device serially. Then, let's pass ``nxd_config``,
model getter function and its inputs to method ``initialize_parallel_model``:

.. code:: ipython3

   model = nxd.initialize_parallel_model(nxd_config, get_model, config)

If pipeline parallel is enabled, to run a training iteration, user must use
``run_train``, it handles pipeline partitioned forward and backward in it:

.. code:: ipython3

   loss = model.run_train(*inputs)

Otherwise, users can use either ``run_train`` or:

.. code:: ipython3

   loss = model(*inputs)
   loss.backward()

To access the wrapped model:

.. code:: ipython3

   model.local_module()

Model wrapper also has short cuts to access some common fields of hugging
face transformers model;

.. code:: ipython3

   model.dtype  # get model's dtype
   model.config  # get model's config
   model.name_or_path  # get model's name or path

Use optimizer wrapper:
''''''''''''''''''''''

When we wrap an optimizer with optimizer wrapper, we need ``nxd_config``,
original optimizer class and its inputs (parameters and optimizer arguments):

.. code:: ipython3

   optimizer = nxd.initialize_parallel_optimizer(
       nxd_config, torch.optim.AdamW, param_groups, lr=args.lr, betas=(args.beta1, args.beta2), weight_decay=args.weight_decay
   )

One useful feature is that user can access grad norm value from wrapped optimizer
directly:

.. code:: ipython3

   # It's a XLA tensor
   optimizer.grad_norm

Note that if optimizer has not been executed or ``grad_clipping`` is disable,
access ``grad_norm`` will get ``None``.


================================================
FILE: libraries/neuronx-distributed/neuronx-distributed-misc.rst
================================================
Misc 
===============================

.. toctree::
    :maxdepth: 1
    :hidden:
    
    /release-notes/components/nxd-core

.. include:: /libraries/neuronx-distributed/neuronx-distributed-misc.txt

================================================
FILE: libraries/neuronx-distributed/neuronx-distributed-misc.txt
================================================
* :ref:`nxd-core_rn`

================================================
FILE: libraries/neuronx-distributed/neuronx_distributed_inference_developer_guide.rst
================================================
.. _neuronx_distributed_inference_developer_guide:

About NeuronX-Distributed (NxD) Inference
=================================================

NeuronX Distributed (NxD Core) provides fundamental building blocks that enable you to run
advanced inference workloads on AWS Inferentia and Trainium instances. These building
blocks include parallel linear layers that enable distributed inference, a model builder
that compiles PyTorch modules into Neuron models, and more.

Neuron also offers Neuronx-Distributed (NxD) Inference,
which is a library that provides optimized model and module implementations that build on top
of NxD Core. We recommend that you use NxD Inference to run inference workloads and onboard
custom models. For more information about NxD Inference, see :ref:`nxdi-overview`.

For examples of how to build directly on NxD Core, see the following:

* `Llama 3.2 1B inference sample <https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference/llama>`_
* T5 3B inference tutorial :ref:`[html] </src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>` :pytorch-neuron-src:`[notebook] <neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`

================================================
FILE: libraries/neuronx-distributed/pipeline_parallelism_overview.rst
================================================
.. _pipeline_parallelism_overview:

Pipeline Parallelism Overview 
===============================

Pipeline parallelism is a technique used in deep learning model training to improve efficiency 
and reduce the training time of large neural networks.
Currently NeuronxDistributed's pipeline parallelism is built specially for transformer based models,
where each Neuron core will be assigned with a subset of transformer layers.
Pipelining is a technique to achieve true parallelization in pipeline parallelism, 
by having the Neuron cores compute simultaneously on different data samples, 
and to overcome the performance loss due to sequential computation. 
When you use pipeline parallelism, training job is executed in a pipelined 
fashion over microbatches to maximize device usage.

Model partitioning
---------------------

In NeuronxDistributed, we use `Pytorch's FX <https://pytorch.org/docs/stable/fx.html>`__ to trace the model and do partition on the FX IR.
User simply needs to specify where to cut the pipeline stages, and our algorithm will cut the
pipeline stages and assign the corresponding modules to each Neuron core automatically.
Currently we require user to provide model partition decision but auto-partition will be supported in the future.
Here is an example of simple partition with 5 linear layers

.. code:: ipython3

   # original NN module
   class MyModule(torch.nn.Module):
      def __init__(self):
         super().__init__()
         self.linears = torch.nn.ModuleList([torch.nn.Linear(4, 4) for _ in range(5)])

      def forward(self, x):
         for lin in self.linears:
               x = lin(x)
         return x

   m = MyModule()
   gm = torch.fx.symbolic_trace(m)
   print(gm)
   """
   GraphModule(
   (linears): Module(
      (0): Linear(in_features=4, out_features=4, bias=True)
      (1): Linear(in_features=4, out_features=4, bias=True)
      (2): Linear(in_features=4, out_features=4, bias=True)
      (3): Linear(in_features=4, out_features=4, bias=True)
      (4): Linear(in_features=4, out_features=4, bias=True)
   )
   )

   def forward(self, x):
      linears_0 = getattr(self.linears, "0")(x);  x = None
      linears_1 = getattr(self.linears, "1")(linears_0);  linears_0 = None
      linears_2 = getattr(self.linears, "2")(linears_1);  linears_1 = None
      linears_3 = getattr(self.linears, "3")(linears_2);  linears_2 = None
      linears_4 = getattr(self.linears, "4")(linears_3);  linears_3 = None
      return linears_4
   """

If user decide to cut the pipeline stage at the 3nd linear call, after partition 
there will be 2 submodules, where `submod_0` contains first 3 linear layers 
and `submod_1` contains last 2 linear layers.

.. code:: ipython3

   After Split module
   GraphModule(
   (submod_0): GraphModule(
      (linears_0): Linear(in_features=4, out_features=4, bias=True)
      (linears_1): Linear(in_features=4, out_features=4, bias=True)
      (linears_2): Linear(in_features=4, out_features=4, bias=True)
   )
   (submod_1): GraphModule(
      (linears_3): Linear(in_features=4, out_features=4, bias=True)
      (linears_4): Linear(in_features=4, out_features=4, bias=True)
   )
   )

   def forward(self, x):
      submod_0 = self.submod_0(x);  x = None
      submod_1 = self.submod_1(submod_0);  submod_0 = None
      return submod_1

Pipeline Execution Schedule
----------------------------

Pipelining is based on splitting a mini-batch into microbatches, which are 
fed into the training pipeline one-by-one and follow an execution schedule defined 
by the library runtime. A microbatch is a smaller subset of a given training mini-batch. 
The pipeline schedule determines which microbatch is executed by which device for every time slot.

For example, depending on the pipeline schedule and the model partition, 
Neuron core i might perform (forward or backward) computation on microbatch b while Neuron core i+1 performs 
computation on microbatch b+1, thereby keeping both Neuron cores active at the same time. An example taken from
Megatron paper is showed as below

.. image:: /libraries/neuronx-distributed/images/pp_schedule.png
   :alt: Image: image.png


================================================
FILE: libraries/neuronx-distributed/pp_developer_guide.rst
================================================
.. _pp_developer_guide:

Developer guide for Pipeline Parallelism 
=====================================================================

Training
^^^^^^^^

For training models with pipeline-parallelism, user needs to make few
changes to their model/training script. In the below steps, we walk through different 
changes user has to make to use pipeline parallelism.
For general changes please refer to `tensor parallel guidance <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tp_developer_guide.html>`__.

Creating Model
'''''''''''''''

To train with pipeline parallel, user needs to wrap their torch module with NeuronxDistributed's Pipeline Parallel model wrapper, i.e. ``NxDPPModel``
Let's take a look at our Llama example:

.. code:: ipython3

    # Create torch model
    config.return_dict = False
    model = transformers.LlamaForCausalLM(config)
    # Create pipeline cuts
    pipeline_cuts = create_partition(config, args)
    # Apply model wrapper
    model = NxDPPModel(
        model,
        transformer_layer_cls=LlamaDecoderLayer,
        num_microbatches=args.num_microbatches,
        virtual_pipeline_size=1,
        output_loss_value_spec=(True, False),
        input_names=["input_ids", "attention_mask", "labels"],
        pipeline_cuts=pipeline_cuts,
        trace_file_path=args.trace_file_path,
        leaf_module_cls=[LlamaRMSNorm.__name__],
        autowrap_modules=[mappings],
        use_zero1_optimizer=args.use_zero1_optimizer,
        deallocate_pipeline_outputs=False,
    )
    model.move_model_to_device()

We first create the model from the Hugging Face model config. If tensor parallel needs to be applied to model
it must be done here before applying the pipeline parallel model wrapper. The next step is to create the partitions. Here
is an example to evenly partition the layers for all stages:

.. code:: ipython3

    def create_partition(config, args):
        """
        Evenly split the transformer layers between the PP ranks
        """
        assert config.num_hidden_layers % args.pipeline_parallel_size == 0
        num_layer_per_partition = config.num_hidden_layers  // args.pipeline_parallel_size
        pipeline_cuts = []
        current_cut = num_layer_per_partition - 1
        for i in range(args.pipeline_parallel_size-1):
            pipeline_cuts.append(f"model.layers.{current_cut}")
            current_cut += num_layer_per_partition
        if torch.distributed.get_rank() == 0:
            print(f"pipeline_cuts {pipeline_cuts}")
        return pipeline_cuts

Note that the pipeline cuts should be at the transformer layer module name, which 
in Llama model is indicated as ``model.layers.i`` where ``i`` is the layer index. Users have the option to either provide the pipeline cuts, or set ``auto_partition`` to ``True`` to automatically determine the pipeline cuts to use.
After pipeline cuts are decided, pipeline model wrapper is applied. Let's take a deeper look into each input of the model wrapper

- ``model``: The original Pytorch module, could be TPfied.
- ``transformer_layer_cls=LlamaDecoderLayer``: The transformer layer class, we will use it for partition
- ``num_microbatches=args.num_microbatches``: The number of microbatches we used for pipeline execution.
- ``virtual_pipeline_size``: Virtual pipeline size if greater than 1 we will use the interleaved pipeline schedule.
- ``output_loss_value_spec=(True, False)``: This tells ``NxDPPModel`` how to get the loss from the model output. In this case output is a tuple, where first value is loss and second value is something else. ``NxDPPModel`` will use loss to run backward and return loss as the output.
- ``input_names=["input_ids", "attention_mask", "labels"]``: The model input names that we will use to run training. As our partition uses FX symbolic trace to trace the model, we will use these input names to create ``concrete_args``. Usually this will be the same input as you will feed into model for the execution. For details please check https://pytorch.org/docs/stable/fx.html#torch.fx.symbolic_trace
- ``pipeline_cuts=pipeline_cuts``: The pipeline cuts to decide the stages
- ``leaf_module_cls=[LlamaRMSNorm.__name__]``: We can add some pytorch modules as leaf module so that FX symbolic trace won't trace it through. Here we mark the ``LlamaRMSNorm`` as one leaf module. If you hit any issue about tracing you can skip tracing that part by add the module as a leaf module here. The transformer layer module will be a leaf module by default.
- ``autowrap_modules``: This serves as the same functionality to simplify FX tracing. User can provide a **python** module here and all the methods from this python module will not be traced.
- ``use_zero1_optimizer``: When zero-1 optimizer is used, set this to True, so the PP model will understand that zero-1 optimizer will handle data parallel gradient averaging.
- ``deallocate_pipeline_outputs``: 
    Whether to deallocate the pipeline outputs after send. After send the output tensor is only useful for its 
    '.grad_fn' field, and not its '.data'.

After applying model wrapper, ``NxDPPModel`` will partition the model based on the pipeline cuts. If the original model is not yet moved to device, we can call
``model.move_model_to_device()`` so that ``NxDPPModel`` will only move the local module to device.

Runtime execution:
'''''''''''''''''''

To use pipeline runtime, user simply needs to replace their original model call with ``NxDPPModel.run_train``, rest will remain unchanged. 
Please note that the pipeline runtime will take care of both forward and backward call, so user will not need to explicitly make backward calls. 
The ``NxDPPModel.run_train`` call will return the loss that is achieved from ``output_loss_value_spec``.

Interleaved Pipeline-Parallelism:
---------------------------------

To use interleaved pipeline parallel, one has to set ``virtual_pipeline_size`` greater than 1. The value of the 
``virtual_pipeline_size * pipeline_parallel_size`` should be equal to the number of layers in the models. Interleave pipeline can 
help to reduce the pipeline bubble size and improve performance especially in cases when the number of microbatches 
per data-parallel rank is small. More information can be found `here <https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/#interleaved_schedule>`__


Mixed precision training
------------------------
We support the torch autocast to do mixed precision, simply apply the context manager for the ``NxDPPModel.run_train`` call.
Here is an example:


.. code:: ipython3

    # replace loss, _ = model(input_ids, attention_mask, labels) with below
    with torch.autocast(enabled=args.use_amp > 0, dtype=torch.bfloat16, device_type="cuda"):
        loss = model.run_train(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels,
        )


Things that require user attention:
'''''''''''''''''''''''''''''''''''

Model initialization
--------------------

When the model is large, it is easy to cause host OOM when full model is created on every Neuron core. We recommend 2 ways to deal with this situation:

Using torchdistx's deferred initialization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Pytorch's torchdistx package (https://github.com/pytorch/torchdistx/tree/main) provides easy way to do deferred initialization. If you have torchdistx installed,
using deferred initialization is simple as below

.. code:: ipython3

    from torchdistx import deferred_init
    # Instead of model = LlamaForCausalLM(config)
    model = deferred_init.deferred_init(LlamaForCausalLM, config)

The model weights will be initialized in fake tensor mode which will not consume memory.
After applying the ``NxDPPModel`` model wrapper we will only materialize the weights that belong to the local module. 
Please be aware that the torchdistx package is not actively maintained by Meta, please use at your own risk.

Using meta device for initialization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NeuronxDistributed also supports also offer a way to first create the model on meta device, then reinitialize it to host device with only the local modules.
To create the model on meta device, follow the below example:

.. code:: ipython3

    from neuronx_distributed.utils.model_utils import init_on_device
    with init_on_device(torch.device("meta")):
        model = LlamaForCausalLM(config)

With ``init_on_device(torch.device("meta"))`` context manager, all model weights will be create to meta device, which will not consume host memory.
Then during applying the PP model wrapper, user can pass the ``param_init_fn`` kwargs which can define how to reinit the parameter. Here is an example:

.. code:: ipython3
    
    def init_weights(module):
        from neuronx_distributed.parallel_layers import ColumnParallelLinear, RowParallelLinear, ParallelEmbedding
        if isinstance(module, (nn.Linear, Conv1D)):
            module.weight.data.normal_(mean=0.0, std=model_config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=model_config.initializer_range)
            if module.padding_idx:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        elif isinstance(module, (ParallelEmbedding, RowParallelLinear, ColumnParallelLinear)):
            module.init_weight_cpu()
            if hasattr(module, "bias") and module.bias is not None:
                module.bias.data.zero_()
    
    model = NxDPPModel(...,param_init_fn=init_weights,...)

``param_init_fn`` should take a module as input and initialize how the weight of that module should be initialized.

Moving model to device
----------------------

When user create the model it is usually either created on CPU, or using meta device/torchdistx for delayed parameter initialization. It is important to understand 
when the delayed parameter will be materialized and how/when to move model to device.

Once the ``NxDPPModel`` wrapper is applied with the model together with the partition information, tracing and partition will happen immediately. After partition
we will materialize the local module if torchdistx is used or ``param_init_fn`` is passed. So the returned model of ``NxDPPModel`` wrapper will have local parameters on host device.

After model is wrapped with ``NxDPPModel`` user can do things that are recommended to run on CPU, e.g. loading shareded checkpoint. It is important to make sure to call ``model.move_model_to_device()``
before creating the optimizer, so that the optimizer can take the weights that are on the device. When using zero-1 optimizer, it is also required to use ``model.local_parameters()`` to create parameter groups so the optimizer can
infer the right device information from parameter groups.

Gradient checkpointing
----------------------

Gradient checkpointing (or activation checkpointing) is a common method used in deep learning to reduce memory footprint by doing 
recomputation of forward computation. The common way to apply the gradient checkpointing on XLA device is to use the torch_xla's 
`gradient checkpointing wrapper <https://github.com/pytorch/xla/blob/master/torch_xla/utils/checkpoint.py#L129>`__, which will apply an autograd function.
However FX's symbolic tracing does not understand autograd function, and as a result the checkpointing information will be ignored if the checkpoint wrapper
is traced during partition.
To handle this case, user can manually re-apply gradient checkpoint after partition. Here we provide an example to checkpoint every transformer layer
after partition.

.. code:: ipython3

    from typing import Any, Dict, Iterator, Tuple
    import torch.nn as nn

    import torch
    from torch_xla.utils.checkpoint import checkpoint as torch_checkpoint
    from neuronx_distributed.parallel_layers.parallel_state import rmsg
    from neuronx_distributed.utils.logger import get_logger
    from torch.distributed.utils import _replace_by_prefix

    logger = get_logger()

    _CHECKPOINT_WRAPPED_MODULE = "mod"
    _CHECKPOINT_PREFIX = _CHECKPOINT_WRAPPED_MODULE + "."

    class CheckPointWrapper(torch.nn.Module):
        def __init__(self, mod) -> None:
            super().__init__()
            self.mod = mod
            # state_dict post hook to remove prefix to allow loading into a
            # non-checkpoint wrapped module.
            self._register_state_dict_hook(self._post_state_dict_hook)
            # load_state_dict pre-hook to allow loading back into
            # checkpoint-wrapped module.
            self._register_load_state_dict_pre_hook(
                self._pre_load_state_dict_hook, with_module=True
            )


        def forward(self, *args, **kwargs):
            ordered_args = list(args)
            for value in kwargs.values():
                ordered_args += [value]

            # Note: checkpoint cannot accept kwargs
            return torch_checkpoint(self.mod, *ordered_args, use_reentrant=True)
        
        def named_parameters(
            self,
            *args,
            **kwargs,
        ) -> Iterator[Tuple[str, torch.nn.Parameter]]:
            """
            Overrides :meth:`named_parameters()` to intercept parameter names and
            remove all occurrences of ``_CHECKPOINT_PREFIX``.
            """
            for param_name, param in super().named_parameters(*args, **kwargs):
                updated_name = param_name.replace(_CHECKPOINT_PREFIX, "")
                yield updated_name, param
        
        def named_modules(self,*args,**kwargs):
            for module_name, module in super().named_modules(*args, **kwargs):
                updated_name = module_name.replace(_CHECKPOINT_PREFIX, "")
                yield updated_name, module

        @staticmethod
        def _post_state_dict_hook(
            module: nn.Module,
            state_dict: Dict[str, Any],
            prefix: str,
            *args: Any,
        ) -> Dict[str, Any]:
            """
            _post_state_dict_hook() is called after the state_dict() of this
            FSDP module is executed. For ``checkpoint_wrapper``, it will strip
            checkpoint-wrapped module prefix so that this module can be loaded into
            non-checkpointed modules. It would still be able to be loaded into
            checkpoint-wrapped modules as this class adds the prefix back before
            loading the state_dict.
            """
            _replace_by_prefix(state_dict, f"{prefix}{_CHECKPOINT_PREFIX}", prefix)
            return state_dict
        
        @staticmethod
        def _pre_load_state_dict_hook(
            module: nn.Module,
            state_dict: Dict[str, Any],
            prefix: str,
            *args: Any,
        ) -> None:
            """
            ``_pre_state_dict_hook` is called before ``self._load_from_state_dict()``
            is called. For ``checkpoint_wrapper``, it will add back the module
            prefix so that non-checkpointed modules can be loaded into
            checkpoint_wrapper modules properly.
            """
            _replace_by_prefix(state_dict, prefix, prefix + f"{_CHECKPOINT_PREFIX}")

    def apply_checkpoint(dist_model, layers_to_checkpoint=None):
        checkpoint_wrapper_added = False
        if layers_to_checkpoint is not None and len(layers_to_checkpoint) == 0:
            raise RuntimeError(
                rmsg(f"invalid input layers_to_checkpoint {layers_to_checkpoint}, can't be empty")
            )
        for name, module in dist_model.local_module.named_children():
            # checkpoint layers that are provided in input
            # if layers not provide in input, then checkpoint if it is transformer layer
            if (layers_to_checkpoint and name in layers_to_checkpoint) or (
                not layers_to_checkpoint and type(module) == dist_model.transformer_layer_cls
            ):
                # add_module replaces old module with our own custom module.
                # https://pytorch.org/docs/stable/_modules/torch/nn/modules/module.html#Module.add_module
                dist_model.local_module.add_module(name, CheckPointWrapper(module))
                checkpoint_wrapper_added = True
        if layers_to_checkpoint is not None and not checkpoint_wrapper_added:
            logger.warning(
                rmsg(f"layers_to_checkpoint {layers_to_checkpoint} do not exist in the graph")
            )
        elif layers_to_checkpoint is None and not checkpoint_wrapper_added:
            logger.warning(
                rmsg(
                    f"During applying activation checkpointing, transformer_layer_cls {dist_model.transformer_layer_cls.__name__} can not be found in stage {dist_model.pipeline_parallel_rank}, skipping..."
                )
            )

    model = NxDPPModel(...)
    # Will checkpoint every transformer layer
    apply_checkpoint(model)

``apply_checkpoint`` function will try to apply gradient checkpointing to every transformer layer. Please note we have plan to add this functionality into ``NxDPPModel`` in the future releases.


Model tracing
-------------

It is important to understand that the model cannot be partitioned without tracing.
The model tracing is currently done with FX's symbolic trace. There are `certain limitations for FX's symbolic trace <https://pytorch.org/docs/stable/fx.html#limitations-of-symbolic-tracing>`__. So in order to avoid any tracing issue, 
we would like to trace as less operations as possible, which means that we only want to trace the structure of the model, and cut the pipeline stages on the transformer layers, we do not care how exactly the computations are in the model.
By default, we will mark all transformer layers as leaf nodes, so that the tracer will not trace inside these layers. If you have some module that might cause tracing problem, you can try to mark them as leaf nodes as well. Our previous example 
also marks the `LlamaRMSNorm` as leaf module for Llama model.

Special treatment for Hugging Face models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Hugging Face offers FX support for many of its models. We will detect if user is using a Hugging Face model (by checking if the model class is ``transformers.PreTrainedModel``), and if so we will use the Huggingface's FX tracer to do the symbolic trace.
The Hugging Face's tracer has implementation of many functionalities to help tracing, for details please refer to `here <https://github.com/huggingface/transformers/blob/main/src/transformers/utils/fx.py>`__.
However, please be aware that Hugging Face's tracer will check if the model class name belongs to one of the Hugging Face models. So if you create your model class based on some Huggingface model class, it is important to maintain the same class name. Below is an example:

.. code:: ipython3

    from transformers.models.llama.modeling_llama import LlamaForCausalLM as LlamaForCausalLMHF

    # Keep the same class name as original one
    class LlamaForCausalLM(LlamaForCausalLMHF):
        ...


Auto partition
---------------
Setting the ``auto_partition`` parameter to ``True`` means that the transformer layers are automatically partitioned by evenly splitting the transformer layers between the PP ranks. If the transformer layers are not evenly divisible by the PP ranks, the remaining layers are distributed to the latter pipeline ranks.
The partitions are created on the basis of the transformer layer names. The transformer layer names are determined by recursively traversing the original torch module to find the layer names of modules that are of the ``transformer_layer_cls`` type in the model.
If the user does not want to partition the model in this way, they can set the partitions to use by specifying the ``pipeline_cuts``. Note that the pipeline cuts should be at the transformer layer module name, which in the Llama model is given by ``model.layers.i`` where ``i`` is the layer index.


================================================
FILE: libraries/neuronx-distributed/ptl_developer_guide.rst
================================================
.. _ptl_developer_guide:

Developer guide for Neuron-PT-Lightning 
=================================================================

Training
^^^^^^^^

For training models with Neuron-PT-Lightning, user needs to make few
changes to their model/training script. 
In this document we explain how we can train a model using Tensor Parallelism (TP), Data Parallelism (DP) and Zero-1. 

First, let's start with the model changes. Please follow the guidelines here (`tensor parallel guidance <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tp_developer_guide.html>`__) 
for building the model with tensor-parallelism enabled and setting up training dataset.

Next, let's walkthrough how we can build the training loop with Neuron-PT-Lightning APIs

Configure NeuronLTModule
''''''''''''''''''''''''
NeuronxDistributed overrides `LightningModule <https://lightning.ai/docs/pytorch/stable/common/lightning_module.html>`__ with built-in support for 
Neuron device. User needs to inherit from ``NeuronLTModule``

.. code:: ipython3

    class NeuronLlamaLTModule(NeuronLTModule):
        def training_step(self, batch, batch_idx):
            ...
        ...

Within LTModule, user needs to override the following methods
``training_step``
At this moment NeuronLTModule only support `manual optimization <https://lightning.ai/docs/pytorch/stable/model/manual_optimization.html>`__, so user needs to define forward, backward and optimization steps

.. code:: ipython3

    def training_step(self, batch, batch_idx):
        xm.mark_step() # Isolate forward+backward graph
        for logger in self.trainer.loggers:
            logger.print_step = -1
        self.should_print = False
        outputs = self.model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
            labels=batch["labels"],
        )
        loss = outputs.loss / self.grad_accum_steps
        loss.backward()
        self.averaged_loss += loss.detach()
        xm.mark_step() # Isolate forward+backward graph
        if not self.automatic_optimization and (batch_idx +1) % self.grad_accum_steps == 0:
            self.should_print = True
            loss_div = self.averaged_loss / self.trainer.strategy.data_parallel_size
            loss_reduced = xm.all_reduce(
                xm.REDUCE_SUM,
                loss_div,
                groups=parallel_state.get_data_parallel_group(as_list=True),
            )
            loss_reduced_detached = loss_reduced.detach()
            self.averaged_loss.zero_()
            optimizer = self.optimizers()
            scheduler = self.lr_schedulers()
            optimizer.step()
            optimizer.zero_grad()
            scheduler.step()
            xm.mark_step() # Isolate Optimization step graph

            # Setup items for logging
            self.loss = loss_reduced_detached
        return loss

``configure_optimizers``
Configure optimizer and lr_scheduler

.. code:: ipython3

    def configure_optimizers(self):
        param_groups = self.get_param_groups_by_weight_decay()
        optimizer = initialize_parallel_optimizer(
            self.nxd_config, self.opt_cls, param_groups, **self.opt_kwargs
        )
        optimizer.zero_grad()
        scheduler = self.scheduler_cls(optimizer, *self.scheduler_args, **self.scheduler_kwargs)
        return (
            [optimizer],
            [
                {
                    "scheduler": scheduler,
                }
            ],
        )

``on_train_batch_end``
Customized behaviour at the end of each training batch, like logging

.. code:: ipython3

    def on_train_batch_end(self, *args, **kwargs):
        if self.should_print:
            if not self.automatic_optimization:
                self.log(
                    "loss",
                    self.loss.detach().cpu().item() if self.loss is not None else torch.zeros(1, device="cpu", requires_grad=False),
                    prog_bar=True,
                )
                self.log(
                    "global_step",
                    self.global_step,
                    prog_bar=True,
                    on_step=True,
                    on_epoch=True,
                )
                for logger in self.trainer.loggers:
                    logger.print_step = self.global_step

Note that NeuronLTModule has a built-in function of ``get_param_groups_by_weight_decay`` for common use case as shown in snippet below, 
users can also override with their own param_groups generation.

.. code:: ipython3

    def get_param_groups_by_weight_decay(self):
        """Get param groups. Customers can override this to have their own way of weight_decay"""
        param_optimizer = list(self.model.named_parameters())
        no_decay = ["bias", "LayerNorm"]  # gamma/beta are in LayerNorm.weight

        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
                "weight_decay": 0.01,
            },
            {
                "params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
            },
        ]
        return optimizer_grouped_parameters


Configure DataModule
''''''''''''''''''''

Create a LightningDataModule for data loading/sampling

.. code:: ipython3

    class NeuronLightningDataModule(LightningDataModule):
        def __init__(
            self, 
            dataloader_fn: Callable,
            data_dir: str, 
            batch_size: int,
            data_args: Tuple = (), 
            data_kwargs: Dict = {},
        ):
            super().__init__()
            self.dataloader_fn = dataloader_fn
            self.data_dir = data_dir
            self.batch_size = batch_size
            self.data_args = data_args,
            self.data_kwargs = data_kwargs
            

        def setup(self, stage: str):
            pass

        def train_dataloader(self):
            return self.dataloader_fn(
                self.data_dir,
                self.batch_size,
                self.trainer.strategy.data_parallel_size,
                self.trainer.strategy.data_parallel_rank,
                *self.data_args,
                **self.data_kwargs
            )

Update Training Script
''''''''''''''''''''''

For detailed introduction to each api/class, check `api guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html>`__

Create NeuronLTModule and DataModule
------------------------------------

.. code:: ipython3

    model = NeuronLlamaLTModule(
        model_fn = LlamaForCausalLM,
        nxd_config = nxd_config,
        model_args = (model_config,),
        opt_cls = optimizer_cls,
        scheduler_cls = configure_scheduler,
        opt_kwargs = {
            "lr": flags.lr,
        },
        scheduler_args = (flags.warmup_steps, flags.max_steps),
        grad_accum_steps = flags.grad_accum_usteps,
        manual_opt = True, 
    )

    dm = NeuronLightningDataModule(
        create_llama_pretraining_dataset,
        flags.data_dir,
        flags.batch_size,
        data_args = (flags.seed,),
    )

Add Strategy, Plugins, Callbacks
--------------------------------

.. code:: ipython3

    strategy = NeuronXLAStrategy(
        nxd_config = nxd_config
    )
    plugins = []
    plugins.append(NeuronXLAPrecisionPlugin())
    callbacks = []
    callbacks.append(NeuronTQDMProgressBar())

Create Trainer and Start Training
---------------------------------

.. code:: ipython3

    trainer = Trainer(
        strategy = strategy, 
        max_steps = flags.steps_this_run,
        plugins = plugins,
        enable_checkpointing = flags.save_checkpoint,
        logger = NeuronTensorBoardLogger(save_dir=flags.log_dir),
        log_every_n_steps = 1,
        callbacks = callbacks,
    )
    trainer.fit(model=model, datamodule=dm)

Checkpointing
-------------

To enable checkpoint saving, add `ModelCheckpoint <https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.ModelCheckpoint.html>`__
to the callbacks

.. code:: ipython3

    callbacks.append(
        ModelCheckpoint(
            save_top_k = flags.num_kept_checkpoint,
            monitor="global_step",
            mode="max",
            every_n_train_steps = flags.checkpoint_freq,
            dirpath = flags.checkpoint_dir,
        )
    )

To load from specific checkpoint, add ``ckpt_path=ckpt_path`` to ``trainer.fit``

.. code:: ipython3

     trainer.fit(model=model, datamodule=dm, ckpt_path=ckpt_path)


================================================
FILE: libraries/neuronx-distributed/save_load_developer_guide.rst
================================================

.. _save_load_developer_guide:
.. _neuronx_distributed_save_load_developer_guide:

Developer guide for save/load checkpoint
========================================

This document will introduce how to use `nxd.save_checkpoint` and `nxd.load_checkpoint`
to save and load checkpoint for distributed model training. This two methods handle all
checkpoint in a single method: model, optimize, learning rate scheduler and any user contents.

Model states are saved on data parallel rank-0 only. When ZeRO-1 optimizer is not turned on,
optimizer states are also saved like this; while when ZeRO-1 optimizer is turned on, states
are saved on all ranks. Scheduler and user contents are saved on master rank only.

For a complete api guide, refer to :ref:`API GUIDE<api_guide>`.

Save checkpoint:
''''''''''''''''

A sample usage:

.. code:: ipython3

   nxd.save_checkpoint(
       args.checkpoint_dir,  # checkpoint path
       tag=f"step_{total_steps}",  # tag, sub-directory under checkpoint path
       model=model,
       optimizer=optimizer,
       scheduler=lr_scheduler,
       user_content={"total_steps": total_steps, "batch_idx": batch_idx, "cli_args": args.__dict__},
       use_xser=True,
       async_save=True,
   )

Users can choose to not save every thing. For example, model states only:

.. code:: ipython3

   nxd.save_checkpoint(
       args.checkpoint_dir,  # checkpoint path
       tag=f"step_{total_steps}",  # tag, sub-directory under checkpoint path
       model=model,
       use_xser=True,
       async_save=True,
   )

To only keep several checkpoints (e.g. 5), just use :code:`num_kept_ckpts=5`.

Load checkpoint:
''''''''''''''''

A sample usage, note that if no user contents detected, it will return ``None``:

.. code:: ipython3

   user_content = nxd.load_checkpoint(
       args.checkpoint_dir,  # checkpoint path
       tag=f"step_{args.loading_step}",  # tag
       model=model,
       optimizer=optimizer,
       scheduler=lr_scheduler,
   )

Leave ``tag`` not provided, this loading method will try to automatically resume from the
latest checkpoint.

.. code:: ipython3

   user_content = nxd.load_checkpoint(
       args.checkpoint_dir,  # checkpoint path
       model=model,
       optimizer=optimizer,
       scheduler=lr_scheduler,
   )

ZeRO-1 Optimizer State Offline Conversion:
''''''''''''''''''''''''''''''''''''''''''

ZeRO-1 optimizer checkpoint are sharded states stored for each rank. When user want to
load ZeRO-1 optimizer states with different cluster setting (e.g. with DP degree changed),
they can run the offline ZeRO-1 optimizer checkpoint conversion tool. This tool supports
conversion from sharded states to full states, from full to sharded, and from sharded to sharded.

.. code:: ipython3
   
   # sharded to sharded or full to sharded
   nxd_convert_zero_checkpoints --input_dir <input path> --output_dir <output path> --convert_to_sharded --dp_size <new dp degree>
   # sharded to full
   nxd_convert_zero_checkpoints --input_dir <input path> --output_dir <output path> --convert_to_full


================================================
FILE: libraries/neuronx-distributed/setup/index.rst
================================================
.. _neuronx_distributed_setup:

NeuronX Distributed Setup
===========================

:ref:`Install PyTorch Neuron on Trn1 <setup-torch-neuronx>` to create a pytorch environment. It is recommended to work out of a Python virtual environment (such as ``venv``) so as to avoid package installation issues.

You can install the ``neuronx-distributed`` package using the following command:

.. code:: ipython3

   python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com


================================================
FILE: libraries/neuronx-distributed/setup/index.txt
================================================
:ref:`Install PyTorch Neuron on Trn1 <setup-torch-neuronx>` to create a pytorch environment. It is recommended to work out of a Python virtual environment (such as ``venv``) so as to avoid package installation issues.

You can install the ``neuronx-distributed`` package using the following command:

.. code:: ipython3

   python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com

================================================
FILE: libraries/neuronx-distributed/standard_mixed_precision.rst
================================================

.. _standard_mixed_precision:

Developer guide for Standard Mixed Precision
============================================

This document will introduce the concept of Standard Mixed Precision in NxD. It's
newly introduced in neuron release 2.20. It is recommended to use this setting for
training large models using NxD. When enabled, the optimizer will maintain a copy of
weights and their grads in FP32 data type.

.. note::
   Using this can increase memory pressure as we are using master weights and also performing
   optimiizer updates in higher precision. This can result in increased memory pressure and a
   slighly lower throughpout

Standard Mixed Precision offers few config settings that can be tuned by users

Compared to legacy mixed precision setting (i.e. before this feature's addition), Standard Mixed Precision
includes these components:

- Use FP32 for precision sensitive operators
- Use FP32 master weights and optimizer states for ZeRO-1 optimizer
- Use FP32 in local gradients accumulation
- Turn off stochastic rounding

.. note::
   The feature is tightly integrated with the :code:`NeuronZero1Optimizer`, to make
   Standard Mixed Precision take effect, ZeRO-1 optimizer needs to be enabled.

NxD Config Update
'''''''''''''''''

Newly introduced NxD config is as below:

.. code:: ipython3

   mixed_precision_config = {
       "use_master_weights": True,
       "use_fp32_grad_acc": True,
       "use_master_weights_in_ckpt": False,
   }

   config = {
       ...
       "mixed_precision_config": mixed_precision_config,
   }

In NxD training config, a new field :code:`mixed_precision_config` (default value is :code:`None`,
see details in the following sections) is added. It contains three sub-fields: :code:`use_master_weights`,
:code:`use_fp32_grad_acc`, and :code:`use_master_weights_in_ckpt`. Default value of
:code:`use_master_weights` and :code:`use_fp32_grad_acc` is whether ZeRO-1 optimizer is enabled.
Field :code:`use_master_weights` controls whether to use FP32 master weights. Field :code:`use_fp32_grad_acc`
controls whether to enable FP32 gradient accumulation buffer. Default value of :code:`use_master_weights_in_ckpt`
is :code:`False`. This field controls whether to save master weights in checkpoints.

.. code:: ipython3

   # same as `mixed_precision_config = None`
   mixed_precision_config = {
       "use_master_weights": optimizer_config["zero_one_enabled"],
       "use_fp32_grad_acc": optimizer_config["zero_one_enabled"],
       "use_master_weights_in_ckpt": False,
   }

   config = {
       ...
       "mixed_precision_config": mixed_precision_config,
   }

Note that only when ZeRO-1 optimizer is enabled, Standard Mixed Precision will take effect.

To disable this Standard Mixed Precision setting, just change NxD config:

.. code:: ipython3

   mixed_precision_config = {
       "use_master_weights": False,
       "use_fp32_grad_acc": False,
       "use_master_weights_in_ckpt": False,
   }

   config = {
       ...
       "mixed_precision_config": mixed_precision_config,
   }


================================================
FILE: libraries/neuronx-distributed/tensor_parallelism_overview.rst
================================================
.. _tensor_parallelism_overview:

Tensor Parallelism Overview 
===========================

Tensor Parallelism is a technique in which a tensor is split into N
chunks along a particular dimension such that each device only holds 1/N
chunk of the tensor. Computation is performed using this partial chunk
so as to get partial output. These partial outputs are collected from
all devices ensuring the correctness of the computation is maintained.

Taking a general matrix multiplication as an example, let’s say we have
C = AB. We can split B along the column dimension into [B0 B1 B2 … Bn]
and each device holds a column. We then multiply A with each column in B
on each device, we will get [AB0 AB1 AB2 … ABn]. At this moment, each
device still holds partial results, e.g. device rank 0 holds AB0. To
make sure the result is correct, we need to all-gather the partial
result and concatenate the tensor along the column dimension. In this
way, we are able to distribute the tensor over devices while making sure
the computation flow remains correct.

.. image:: /libraries/neuronx-distributed/images/tp.png
   :alt: Image: image.png

Fig and TP explanation is borrowed from https://colossalai.org/docs/concepts/paradigms_of_parallelism/#tensor-parallel

Similarly we can perform the partition along the row dimensions and
create a RowParallel Linear layer. In RowParallelLinear layer, we
partition the weight matrix along the row dimension. Let’s say we have C
= AB. We can split B along the row dimension into [B0 B1 B2 … Bn] and
each device holds a row. We then multiply each column of A on each
device, we will get [A0B0 A1B1 A2B2 … AnBn]. At this moment, each device
still holds partial results, e.g. device rank 0 holds A0B0. To make sure
the result is correct, we need to all-reduce sum the partial result from
all devices to produce the final output.

Using this principle of sharded linear layers, we can construct MLPs of
arbitrary depth until the need to operate on the whole output tensor, in
which case we would have to construct the output but gathering it from
all devices.

.. image:: /libraries/neuronx-distributed/images/mlp.png
   :alt: Image: image.png

Here is an illustration from the Megatron-LM paper In the above case, as
you can see two linear layers are implemented using Column Parallel and
Row Parallel linear layers, wherein the ColumnParallel Linear shards
along the columns and then it is followed by RowParallel Linear layer
which takes in parallel inputs (sharded outputs from
ColumnParallelLinear). Consider the example shown in the above diagram,
Z = (X\ *A)*\ B. In this case we split the first matrix multiplication
over column dimension such that each device after first matrix
multiplication holds partial result of Y0=XA0,Y1=XA1 and so on. For the
second matrix multiplication, we partition the weight matrix over row
dimension and since the inputs are already columns sharded and we can
multiply them to produce partial outputs. These outputs finally requires
an all-reduce sum, since we want to sum up the single column*row result.

Tensor Parallelism for Transformers:

A transformer block

.. image:: /libraries/neuronx-distributed/images/self-attention.png
   :alt: Image: image.png

Fig: Taken from Megatron-LM paper.

As seen from the figure above, a simple self attention block has the QKV linear layer followed by MLP.
Using the same Column and Row Parallel linear layers, we can partition
the self-attention block across devices thereby reducing the memory
footprint on each device, since each device now only holds partial
parameters. This weight distribution strategy allows us to scale large
model training across devices.


================================================
FILE: libraries/neuronx-distributed/tp_developer_guide.rst
================================================
.. _tp_developer_guide:

Developer guide for Tensor Parallelism 
=================================================================

Training
^^^^^^^^

For training models with tensor-parallelism, one would have to make few
changes to their model/training script. Below we walk through the
different changes one would have to make to shard the models across
devices.

Creating DataLoader:
''''''''''''''''''''

When we shard the model across devices using tensor parallelism, all the
tensor parallel workers are operating on the same batch of data. Hence,
to ensure that each tensor parallel worker is getting the same data, we
make use of ``DistributedSampler`` as shown in the snippet below

.. code:: ipython3

   def create_pretraining_dataset(
       input_file, max_pred_length, mini_batch_size, worker_init
   ):
       train_data = pretraining_dataset(
           input_file=input_file, max_pred_length=max_pred_length
       )
       # To distribute the data across different workers in the world, 
       # we use the DistributedSampler. The num_replicas should be equal
       # to the data_parallel_world_size. Note: data_parallel_rank=0 can have
       # multiple tensor parallel ranks and each of these should get the same 
       # data. 
       train_sampler = DistributedSampler(
           train_data,
           num_replicas=parallel_state.get_data_parallel_world_size(),
           rank=parallel_state.get_data_parallel_rank(),
       )
       train_dataloader = DataLoader(
           train_data,
           sampler=train_sampler,
           batch_size=mini_batch_size,
           num_workers=0,
           worker_init_fn=worker_init,
           drop_last=True,
           pin_memory=True,
       )
       return train_dataloader

Creating Model:
'''''''''''''''

One can create models by replacing the large linear layers with
``ColumnParallel`` and ``RowParallel`` Linear layers. In case of
transformers, we have a good structure where the Attention block usually
have linear projections for QKV and this is followed by a fully
connected layer. Let’s take a look at the example for the BERT model. We
make the attention module of BERT model to use tensor parallel layers,
thereby adding the ability to shard the model across devices.

.. code:: ipython3

   class ParallelSelfAttention(transformers.models.bert.modeling_bert.BertSelfAttention):
       def __init__(self, config, position_embedding_type=None):
           super().__init__(config, position_embedding_type)

           self.query = ColumnParallelLinear(config.hidden_size,
                                             self.all_head_size,
                                             gather_output=False)
           self.key = ColumnParallelLinear(config.hidden_size,
                                           self.all_head_size,
                                           gather_output=False)
           self.value = ColumnParallelLinear(config.hidden_size,
                                             self.all_head_size,
                                             gather_output=False)
           # Since we shard the number of attention heads across tensor parallel
           # ranks, each rank would have a subset of heads, hence, we update
           # the num_attention_heads here.
           tp_size = parallel_state.get_tensor_parallel_size()
           self.num_attention_heads = self.num_attention_heads // tp_size
           self.all_head_size = self.all_head_size // tp_size

As seen we just had to swap out the linear layers with ColumnParallel
Linear layers and the rest of the forward method of the attention layer
can work as is. Note: In the above ColumnParallelLinear layer we are not
gathering output from each rank, in other words, each ranks is working
on its own shard. We can make gather_output=True and that would gather
output and you would get a full dim output. However, gathering output
from all ranks would introduce an all-gather operation which can be
expensive depending on the size of the tensor. In the case of attention
module, we know that the SelfAttention block is followed by MLP block.
Hence, we replace the linear layer there with a RowParallelLinear as
shown below:

.. code:: ipython3

   class ParallelSelfOutput(transformers.models.bert.modeling_bert.BertSelfOutput):
       def __init__(self, config):
           super().__init__(config)
           self.dense = RowParallelLinear(config.hidden_size,
                                          config.hidden_size,
                                          input_is_parallel=True)

As seen we just had to replace the dense layer here, and pass the
``input_is_parallel`` argument. This way, the ``RowParallelLinear``
should operator on partitions and get a collective result.

Making just the above two changes can help you partition good chunk of
your model across multiple workers, thereby allowing models of larger
size to be trained on a single instance. Note: Majority of the
parameters of a transformer model are in these linear layers and hence
partitioning these layers can help you scale.

Final Training script:
''''''''''''''''''''''

Once the dataloader and model changes are done, we are ready to build
the training script. Good news, you can use the same training loop as
before for data-parallel training, and would need just the minor tweaks
to get it all started.

.. code:: ipython3

   from neuronx_distributed.parallel_layers import parallel_state, clip_grad_norm

   neuronx_distributed.parallel_state.initialize_model_parallel(tensor_model_parallel_size=2)
   dataloader = create_pretraining_dataset(
    input_file, max_pred_length, mini_batch_size, worker_init)

   model = YourNewlyBuiltParallelModel(config)
   # We have to move the model to device using this API, because when
   # we move model to device using .to(device), the model parameter's
   # attributes aren't preserved. This causes some of the tensor parallel
   # attributes to be lost. Hence, this API takes care of preserving the
   # tensor parallel attributes.
   parallel_layers.move_model_to_device(model, device)

   for inputs, labels in dataloader:
       output = model(*inputs)
       loss = loss_fn(output, labels)
       loss.backward()
       # Here we use clip_grad_norm from neuronx_distributed as that 
       # can handle tensor parallel ranks
       clip_grad_norm(model.parameters(), max_norm)
       # For the optimzer step, we have to pass the data_parallel group
       xm.optimizer_step(
           optimzer, 
           groups=parallel_state.get_data_parallel_group(as_list=True)
       )
       optimizer.zero_grad()
       scheduler.step()

Few things to take note of in the above code snippet: 1. We are
initializing the model parallel with tensor parallel size of 2. This
will shard the model across 2 devices. 2. We use the
``move_model_to_device`` API to move model to device. This is equivalent
to doing ``model.to(device)``. We need to explicity call this API since
some of the tensor-parallel attributes do not get copied over when we
move the model to device using ``model.to(device)``. 3. We are calling
the ``clip_grad_norm`` from ``parallel_layers``. This clip_grad_norm
should take care of accumulating the max_norm from the tensor_parallel
ranks and producing the correct output. 4. We pass the
``data_parallel_group`` to the ``optimizer_step``. If we don’t pass the
group, default would be all the workers in the world.

Saving Model:
'''''''''''''

Once training is done, we want to save the model. This can be done
easily by calling the save api from
``neuronx_distributed.parallel_layers`` . Here is an example:

.. code:: ipython3

   neuronx_distributed.parallel_layers.save({
               'epoch': epoch,
               'model': model.state_dict(),
               'optimizer_state_dict': optimizer.state_dict(),
               'loss': loss,
               ...
               }, PATH)

Note the ``model`` key used here, we need to provide the same key during
model load.

================================================
FILE: libraries/neuronx-distributed/tutorials/finetune_llama3_8b_ptl_lora.rst
================================================
.. _llama3_8b_tp_ptl_lora_finetune_tutorial:

Fine-tuning Llama3 8B with tensor parallelism and LoRA using Neuron PyTorch-Lightning
=====================================================================================

This tutorial shows how to fine-tune a Llama3-8B model with tensor-parallelism and LoRA adaptors. The tutorial uses the :ref:`PyTorch-lightning trainer <ptl_developer_guide>` for setting up the finetuning loop.


Setting up the environment
^^^^^^^^^^^^^^^^^^^^^^^^^^

For this experiment, we will use one trn1.32xlarge compute instance in AWS EC2.
To set up the packages in the compute instance, see
:ref:`Install PyTorch Neuron on Trn1 <setup-torch-neuronx>`.
Install the ``neuronx-distributed`` package inside the virtual environment using the following command:

.. code-block:: ipython3
   

   python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com

Next, download the scripts for fine-tuning with LoRA

1. Create a directory to hold the experiments.

.. code-block:: ipython3

   mkdir -p ~/examples/tp_llama3_8b_lora_finetune
   cd ~/examples/tp_llama3_8b_lora_finetune


2. Download training scripts for the experiments.


We download training scripts for Llama modules, data modules, the config file of Llama3-8B, and the LoRA fine-tuning script from NxD.
We also download the requirements files for package dependencies and scripts to convert Llama checkpoint to NxD checkpoint.

.. code-block:: ipython3

   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/lightning/data_module.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/lightning/module_llama.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/lightning/tp_llama_hf_finetune_ptl.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3/config.json
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/lr.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/modeling_llama_nxd.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/requirements.txt
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/requirements_ptl.txt
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/training_utils.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/examples/training/llama/convert_checkpoints.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/main/test/integration/modules/lora/test_llama_lora_finetune.sh
   wget https://raw.githubusercontent.com/huggingface/transformers/main/src/transformers/models/llama/convert_llama_weights_to_hf.py


3. Install the additional requirements and give the right permissions to the shell script.

.. code-block:: ipython3

   python3 -m pip install -r requirements.txt
   python3 -m pip install -r requirements_ptl.txt  # Currently we're supporting Lightning version 2.4.0
   chmod +x test_llama_lora_finetune.sh
   # prepare the dataset
   python3 -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab');" 


Prepare the checkpoint and dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


1. Download the Llama3-8B checkpoint

Use of this model is governed by the Meta license. In order to download the model weights and tokenizer follow the instructions in meta-llama/Meta-Llama-3-8B .

Once granted access, you can download the model. For the purposes of this tutorial we assume you have saved the Llama-3-8B model in a directory called ``models/Llama-3-8B``

2. Convert the llama checkpoint to NxD checkpoint

Use `convert_llama_weights_to_hf.py <https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py>`_ to convert Llama checkpoint to HuggingFace checkpoint. 
This script will shard Llama3-8B into multiple partitions.
In order to save it as one partition, we need to set flags ``max_shard_size="64GB"`` and ``safe_serialization=False`` in ``model.save_pretrained()``.

.. code-block:: ipython3

   pip install blobfile tiktoken
   cd ~/examples/tp_llama3_8b_lora_finetune
   python convert_llama_weights_to_hf.py --input_dir models/Llama-3-8B/ --model_size 8B --llama_version 3 --output_dir models/Llama-3-8B-hf


When the HuggingFace checkpoint is ready, we can convert it to NxD checkpoint with

.. code-block:: ipython3

   cd ~/examples/tp_llama3_8b_lora_finetune
   python3 convert_checkpoints.py --tp_size 32 --qkv_linear 1 --kv_size_multiplier 4 --convert_from_full_state --config config.json --input_dir models/Llama-3-8B-hf/pytorch_model.bin --output_dir models/llama3_8b_tp32/pretrained_weight/


We then set up `PRETRAINED_PATH="models/llama3_8b_tp32"` in `tp_llama3_8b_lora_finetune_ptl.sh`.


3. Set up HuggingFace Token for Llama3 Tokenizer

We need to set up ``HF_TOKEN`` in ``test_llama_lora_finetune.sh`` to configure your Huggingface Token for Llama3-8B Tokenizer.

Refer to `Huggingface Access Tokens <https://huggingface.co/docs/hub/en/security-tokens>`_ to create your Huggingface access tokens.


1. Set the dataset for the fine-tuning job. 

In this example, we will use `Dolly <https://huggingface.co/datasets/databricks/databricks-dolly-15k>`_, which is an open source dataset
of instruction-following records on categories outlined in the `InstructGPT paper <https://arxiv.org/pdf/2203.02155>`_, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

.. code-block::

   {
   "instruction": "Alice's parents have three daughters: Amy, Jessy, and what's the name of the third daughter?",

   "context": "",

   "response": "The name of the third daughter is Alice"
   }

Configure the following flags in ``test_llama_lora_finetune.sh`` to set up the dataset:

.. code-block:: ipython3

   --data_dir "databricks/databricks-dolly-15k" \
   --task "open_qa" \


Running fine-tuning
^^^^^^^^^^^^^^^^^^^

1. Enable LoRA for fine-tuning 

In ``test_llama_lora_finetune.sh``, we also need to enable LoRA by adding the below argument

.. code-block:: ipython3

   --enable_lora \


The default configuration for LoRA adapters in ``test_llama_lora_finetune.py`` is

.. code-block:: ipython3

   target_modules = ["q_proj", "v_proj", "k_proj"] if flags.qkv_linear == 0 else ["qkv_proj"]      
   lora_config = LoraConfig(
      enable_lora=flags.enable_lora,
      lora_rank=16,
      lora_alpha=32,
      lora_dropout=0.05,
      bias="none",
      lora_verbose=True,
      target_modules=target_modules,
   )


1. LoRA checkpoint

There are three checkpoint saving modes for LoRA fine-tuning and we can set different modes with LoRA flags ``save_lora_base`` and ``merge_lora``

* ``save_lora_base=False, merge_lora=False`` Save the LoRA adapter only.
* ``save_lora_base=True, merge_lora=False``  Save both the base model and the LoRA adapter seperately.
* ``save_lora_base=True, merge_lora=True``   Merge the LoRA adapter into the base model and then save the base model.


Other than the adapter, LoRA also needs to save the LoRA configuration file for adapter loading. 
The configuration can be saved into the same checkpoint with the adapter, or saved as a seperately json file.
An example of configurations for LoRA saving is

.. code-block:: ipython3

   lora_config = LoraConfig(
      ...
      save_lora_base=False,   # save the LoRA adapter only
      merge_lora=False,       # do not merge LoRA adapter into the base model
      save_lora_config_adapter=True,  # save LoRA checkpoint and configuration file in the same checkpoint
   )


After adding these flags, users can save LoRA model with 

.. code-block:: ipython3

   import neuronx_distributed as nxd
   nxd.save_checkpoint(
      checkpoint_dir_str="lora_checkpoint", 
      tag="lora", 
      model=model
   )


The output checkpoints of LoRA Adapter will be saved under folder ``lora_checkpoint/lora/``. 

.. note::
   If LoRA configuration file is saved separately, it should be placed as ``lora_adapter/adapter_config.json``.


3. Run the fine-tune script

.. code-block:: ipython3

   ./test_llama_lora_finetune.sh


================================================
FILE: libraries/neuronx-distributed/tutorials/index.rst
================================================
.. _tp_tutorials:

Tutorials for NeuronX Distributed 
============================================================

.. toctree::
    :maxdepth: 1
    :hidden:

    Training Tutorials </libraries/neuronx-distributed/tutorials/training_tutorials>
    Inference Tutorials </libraries/neuronx-distributed/tutorials/inference_tutorials>

.. include:: /libraries/neuronx-distributed/tutorials/index.txt


================================================
FILE: libraries/neuronx-distributed/tutorials/index.txt
================================================
* :ref:`nxd_training_tutorials`
* :ref:`nxd_inference_tutorials`

================================================
FILE: libraries/neuronx-distributed/tutorials/inference.rst
================================================
.. _tp_inference_tutorial:

Inference with Tensor Parallelism [Beta]
===========================================================================

Before we start, let's install transformers.

.. code:: ipython3

    pip install transformers==4.26.0

For running model inference, we would need to trace the distributed
model. Before we run the inference, let’s get a checkpoint that we can
use. Let’s run the below block of code:

.. code:: ipython3

    import torch
    import torch_neuronx
    import transformers
    from transformers import AutoTokenizer, AutoModelForSequenceClassification

    name = "bert-base-cased-finetuned-mrpc"

    model = AutoModelForSequenceClassification.from_pretrained(name, torchscript=True)
    torch.save({"model":model.state_dict()}, "bert.pt")

If you already have a checkpoint from the tensor parallel training tutorial or by running
training from another source, feel free to skip the above step.

Once we have the checkpoint we are ready to trace the model and run
inference against it. Let’s look at the example below:

.. code:: ipython3

    import os
    import torch
    import torch_neuronx
    import transformers
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    from transformers.models.bert.modeling_bert import BertSelfAttention, BertSelfOutput

    import neuronx_distributed
    from neuronx_distributed.parallel_layers import layers, parallel_state


    def encode(tokenizer, *inputs, max_length=128, batch_size=1):
        tokens = tokenizer.encode_plus(
            *inputs,
            max_length=max_length,
            padding='max_length',
            truncation=True,
            return_tensors="pt"
        )
        return (
            torch.repeat_interleave(tokens['input_ids'], batch_size, 0),
            torch.repeat_interleave(tokens['attention_mask'], batch_size, 0),
            torch.repeat_interleave(tokens['token_type_ids'], batch_size, 0),
        )


    # Create the tokenizer and model
    name = "bert-base-cased-finetuned-mrpc"
    tokenizer = AutoTokenizer.from_pretrained(name)


    # Set up some example inputs
    sequence_0 = "The company HuggingFace is based in New York City"
    sequence_1 = "Apples are especially bad for your health"
    sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

    paraphrase = encode(tokenizer, sequence_1, sequence_2)
    not_paraphrase = encode(tokenizer, sequence_1, sequence_1)

    def get_model():
        model = AutoModelForSequenceClassification.from_pretrained(name, torchscript=True)
        # Here we build a model with tensor-parallel layers.
        # Note: If you already have a Model class that does this, we can use that directly
        # and load the checkpoint in it.
        class ParallelSelfAttention(BertSelfAttention):
            def __init__(self, config, position_embedding_type=None):
                super().__init__(config, position_embedding_type)
                self.query = layers.ColumnParallelLinear(config.hidden_size, self.all_head_size, gather_output=False)
                self.key = layers.ColumnParallelLinear(config.hidden_size, self.all_head_size, gather_output=False)
                self.value = layers.ColumnParallelLinear(config.hidden_size, self.all_head_size, gather_output=False)
                self.num_attention_heads = self.num_attention_heads // parallel_state.get_tensor_model_parallel_size()
                self.all_head_size = self.all_head_size // parallel_state.get_tensor_model_parallel_size()

        class ParallelSelfOutput(BertSelfOutput):
            def __init__(self, config):
                super().__init__(config)
                self.dense = layers.RowParallelLinear(config.hidden_size,
                                        config.hidden_size,
                                        input_is_parallel=True)

        for layer in model.bert.encoder.layer:
            layer.attention.self = ParallelSelfAttention(model.config)
            layer.attention.output = ParallelSelfOutput(model.config)

        # Here we created a checkpoint as mentioned above. We pass sharded=False, since the checkpoint
        # we obtained is unsharded. In case you are using the checkpoint from the tensor-parallel training,
        # you can set the sharded=True, as that checkpoint will contain shards from each tp rank.
        neuronx_distributed.parallel_layers.load("bert.pt", model, sharded=False)

        # These io aliases would enable us to mark certain input tensors as state tensors. These
        # state tensors are going to be device tensors.
        io_aliases = {}
        return model, io_aliases
    
    if __name__ == "__main__":

        # Note how we are passing a function that returns a model object, which needs to be traced.
        # This is mainly done, since the model initialization needs to happen within the processes
        # that get launched internally within the parallel_model_trace.
        model = neuronx_distributed.trace.parallel_model_trace(get_model, paraphrase, tp_degree=2)

        # Once traced, we now save the trace model for future inference. This API takes care
        # of saving the checkpoint from each tensor parallel worker
        neuronx_distributed.trace.parallel_model_save(model, "tp_models")

        # We now load the saved model and will run inference against it
        model = neuronx_distributed.trace.parallel_model_load("tp_models")
        cpu_model = AutoModelForSequenceClassification.from_pretrained(name, torchscript=True)
        assert torch.argmax(model(*paraphrase)[0]) == torch.argmax(cpu_model(*paraphrase)[0])


================================================
FILE: libraries/neuronx-distributed/tutorials/inference_tutorials.rst
================================================
.. _nxd_inference_tutorials:

Inference Tutorials
============================================================

.. toctree::
    :maxdepth: 1
    :hidden:
        
    /src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb

.. include:: /libraries/neuronx-distributed/tutorials/nxd_inference_tutorials.txt


================================================
FILE: libraries/neuronx-distributed/tutorials/neuronx_distributed_tutorials.txt
================================================
* :ref:`nxd_training_tutorials`
* :ref:`nxd_inference_tutorials`

================================================
FILE: libraries/neuronx-distributed/tutorials/nxd-source-code/llama_tp_pp/llama_2_13b.sh
================================================
#!/bin/bash
set -eExuo

cd ~/neuronx-distributed/examples/training/llama/tp_pp_llama_hf_pretrain

chmod +x run_llama2_13B_tp_pp.sh
ln -sf 13B_config_llama2/config.json ./

sudo rm -rf /home/ubuntu/.cache/
pip install --upgrade filelock

python3 get_dataset.py --llama-version 2

PATH=$PATH:/opt/slurm/bin/

sbatch --exclusive \
--nodes 32 \
--cpus-per-task 128 \
--wrap="srun neuron_parallel_compile bash $(pwd)/run_llama2_13B_tp_pp.sh"

sbatch --exclusive \
--nodes 32 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/run_llama2_13B_tp_pp.sh"

================================================
FILE: libraries/neuronx-distributed/tutorials/nxd-source-code/llama_tp_pp/llama_2_70b.sh
================================================
#!/bin/bash
set -eExuo

cd ~/neuronx-distributed/examples/training/llama/tp_pp_llama_hf_pretrain

chmod +x run_llama2_70B_tp_pp.sh
ln -sf 70B_config_llama2/config.json ./

sudo rm -rf /home/ubuntu/.cache/
pip install --upgrade filelock

python3 get_dataset.py --llama-version 2

PATH=$PATH:/opt/slurm/bin/

sbatch --exclusive \
--nodes 32 \
--cpus-per-task 128 \
--wrap="srun neuron_parallel_compile bash $(pwd)/run_llama2_70B_tp_pp.sh"

sbatch --exclusive \
--nodes 32 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/run_llama2_70B_tp_pp.sh"

================================================
FILE: libraries/neuronx-distributed/tutorials/nxd-source-code/llama_tp_pp/llama_31_70b.sh
================================================
#!/bin/bash
# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.
set -eExuo

cd ~/neuronx-distributed/examples/training/llama/tp_pp_llama_hf_pretrain

chmod +x run_llama3_70B_tp_pp.sh
ln -sf 70B_config_llama3.1/config.json ./

sudo rm -rf /home/ubuntu/.cache/
pip install --upgrade filelock

python3 get_dataset.py --llama-version 3 # change the version number to 2 for Llama-2 models

PATH=$PATH:/opt/slurm/bin/

sbatch --exclusive \
--nodes 32 \
--cpus-per-task 128 \
--wrap="srun neuron_parallel_compile bash $(pwd)/run_llama3_70B_tp_pp.sh"

sbatch --exclusive \
--nodes 32 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/run_llama3_70B_tp_pp.sh"

================================================
FILE: libraries/neuronx-distributed/tutorials/nxd-source-code/llama_tp_pp/llama_3_70b.sh
================================================
#!/bin/bash
# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.
set -eExuo

cd ~/neuronx-distributed/examples/training/llama/tp_pp_llama_hf_pretrain

chmod +x run_llama3_70B_tp_pp.sh
ln -sf 70B_config_llama3/config.json ./

sudo rm -rf /home/ubuntu/.cache/
pip install --upgrade filelock

python3 get_dataset.py --llama-version 3 # change the version number to 2 for Llama-2 models

PATH=$PATH:/opt/slurm/bin/

sbatch --exclusive \
--nodes 32 \
--cpus-per-task 128 \
--wrap="srun neuron_parallel_compile bash $(pwd)/run_llama3_70B_tp_pp.sh"

sbatch --exclusive \
--nodes 32 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/run_llama3_70B_tp_pp.sh"

================================================
FILE: libraries/neuronx-distributed/tutorials/nxd-source-code/llama_tp_pp/llama_tp_pp_setup.sh
================================================
#!/bin/bash
# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.
set -eExuo

cd ~/neuronx-distributed/examples/training/llama/tp_pp_llama_hf_pretrain
ln -sf ~/neuronx-distributed/examples/training/llama/lr.py ./
ln -sf ~/neuronx-distributed/examples/training/llama/training_utils.py ./
ln -sf ~/neuronx-distributed/examples/training/llama/convert_checkpoints.py ./
ln -sf ~/neuronx-distributed/examples/training/llama/get_dataset.py ./
ln -sf ~/neuronx-distributed/examples/training/llama/modeling_llama_nxd.py ./
ln -sf ~/neuronx-distributed/examples/training/llama/requirements.txt ./

python3 -m pip install -r requirements.txt

================================================
FILE: libraries/neuronx-distributed/tutorials/nxd-source-code/llama_tp_zero1/llama_2_7b.sh
================================================
#!/bin/bash
# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.
set -eExuo

cd ~/neuronx-distributed/examples/training/llama/tp_zero1_llama_hf_pretrain
chmod +x tp_zero1_llama2_7B_hf_pretrain.sh
ln -sf 7B_config_llama2/config.json ./

sudo rm -rf /home/ubuntu/.cache/
pip install --upgrade filelock

python3 get_dataset.py --llama-version 2

PATH=$PATH:/opt/slurm/bin/

sbatch --exclusive \
--nodes 4 \
--cpus-per-task 128 \
--wrap="srun neuron_parallel_compile bash $(pwd)/tp_zero1_llama2_7B_hf_pretrain.sh"

sbatch --exclusive \
--nodes 4 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/tp_zero1_llama2_7B_hf_pretrain.sh"


================================================
FILE: libraries/neuronx-distributed/tutorials/nxd-source-code/llama_tp_zero1/llama_31_8b.sh
================================================
#!/bin/bash
# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.
set -eExuo

cd ~/neuronx-distributed/examples/training/llama/tp_zero1_llama_hf_pretrain
chmod +x tp_zero1_llama3_8B_hf_pretrain.sh
cp ./8B_config_llama3.1/config.json ./8B_config_llama3
ln -sf 8B_config_llama3.1/config.json ./

sudo rm -rf /home/ubuntu/.cache/

pip install --upgrade filelock

python3 get_dataset.py --llama-version 3

PATH=$PATH:/opt/slurm/bin/

sbatch --exclusive \
--nodes 4 \
--cpus-per-task 128 \
--wrap="srun neuron_parallel_compile bash $(pwd)/tp_zero1_llama3_8B_hf_pretrain.sh"

sbatch --exclusive \
--nodes 4 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/tp_zero1_llama3_8B_hf_pretrain.sh"


================================================
FILE: libraries/neuronx-distributed/tutorials/nxd-source-code/llama_tp_zero1/llama_3_8b.sh
================================================
#!/bin/bash
# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.
set -eExuo

cd ~/neuronx-distributed/examples/training/llama/tp_zero1_llama_hf_pretrain
chmod +x tp_zero1_llama3_8B_hf_pretrain.sh
ln -sf 8B_config_llama3/config.json ./

sudo rm -rf /home/ubuntu/.cache/
pip install --upgrade filelock

python3 get_dataset.py --llama-version 3 # change the version number to 2 for Llama-2 models

PATH=$PATH:/opt/slurm/bin/

sbatch --exclusive \
--nodes 4 \
--cpus-per-task 128 \
--wrap="srun neuron_parallel_compile bash $(pwd)/tp_zero1_llama3_8B_hf_pretrain.sh"

sbatch --exclusive \
--nodes 4 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/tp_zero1_llama3_8B_hf_pretrain.sh"


================================================
FILE: libraries/neuronx-distributed/tutorials/nxd-source-code/llama_tp_zero1/llama_tp_zero1_setup.sh
================================================
#!/bin/bash
# IMPORTANT: Neuron will stop supporting XLA-based training support in a future release. For now, this code sample is provided strictly for reference.
set -eExuo

cd ~/neuronx-distributed/examples/training/llama/tp_zero1_llama_hf_pretrain
ln -sf ~/neuronx-distributed/examples/training/llama/training_utils.py ./
ln -sf ~/neuronx-distributed/examples/training/llama/modeling_llama_nxd.py ./
ln -sf ~/neuronx-distributed/examples/training/llama/get_dataset.py ./
ln -sf ~/neuronx-distributed/examples/training/llama/requirements.txt ./

python3 -m pip install -r requirements.txt


================================================
FILE: libraries/neuronx-distributed/tutorials/nxd_inference_tutorials.txt
================================================
* T5 inference tutorial :ref:`[html] </src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>` :pytorch-neuron-src:`[notebook] <neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`
* `Llama 3.2 1B inference example <https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference/llama>`__


================================================
FILE: libraries/neuronx-distributed/tutorials/nxd_training_tutorials.txt
================================================
* :ref:`tp_training_tutorial`
* :ref:`gpt_neox_tp_zero1_tutorial`
* :ref:`gpt_neox_20b_tp_zero1_tutorial`
* :ref:`llama2_tp_pp_ptl_tutorial`
* :ref:`llama2_7b_tp_zero1_ptl_finetune_tutorial`
* :ref:`llama3_8b_tp_ptl_lora_finetune_tutorial`


================================================
FILE: libraries/neuronx-distributed/tutorials/training.rst
================================================
.. _tp_training_tutorial:

Training with Tensor Parallelism 
===========================================================

Keeping the above changes made in :ref:`Developer guide <tp_developer_guide>`, let’s now run an end-to-end training
with tensor-parallelism. This section is adopted from `BERT pretraining
tutorial <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/bert.html#hf-bert-pretraining-tutorial>`__
which used data-parallel training to scale the throughput. In this
section we modify that tutorial to showcase the use of
tensor-parallelism which should enable us to scale the size of the
model.

Setting up environment:
                       
For this experiment, we will use a trn1-32xl machine with the storage
set to 512GB at least.
Follow the instructions mentioned here: 
:ref:`Install PyTorch Neuron on Trn1 <setup-torch-neuronx>`. 
It is recommended to work out of python virtual env so as to avoid package installation issues.

We also have to install the ``neuronx-distributed`` package using the
following command:

.. code:: ipython3

   python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com

Make sure the transformers version is set to ``4.26.0`` (Note: If you have transformers-neuronx in your environment, you need to uninstall it to avoid a conflict with the transformers version.)

Let’s download the scripts and datasets for pretraining.

.. code:: ipython3

   mkdir -p ~/examples/tp_dp_bert_hf_pretrain
   cd ~/examples/tp_dp_bert_hf_pretrain
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/tp_dp_bert_hf_pretrain/tp_dp_bert_large_hf_pretrain_hdf5.py
   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/tp_dp_bert_hf_pretrain/requirements.txt
   python3 -m pip install -r requirements.txt

Next let’s download the tokenizer and the sharded datasets:

.. code:: ipython3

   mkdir -p ~/examples_datasets/
   pushd ~/examples_datasets/
   aws s3 cp s3://neuron-s3/training_datasets/bert_pretrain_wikicorpus_tokenized_hdf5/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar .  --no-sign-request
   tar -xf bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar
   rm bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar
   popd

At this point, you are all set to start training

Running training
                

We first pre-compile the graphs using the ``neuron_parallel_compile``.
This process is similar to one discussed in the `BERT pretraining
tutorial <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/bert.html#hf-bert-pretraining-tutorial>`__
. Let’s run the command below:

.. code:: ipython3

   cd ~/examples/tp_dp_bert_hf_pretrain
   export XLA_DOWNCAST_BF16=1
   neuron_parallel_compile torchrun --nproc_per_node=32 \
   tp_dp_bert_large_hf_pretrain_hdf5.py \
   --tensor_parallel_size 8 \
   --steps_this_run 10 \
   --batch_size 64 \
   --grad_accum_usteps 64 |& tee compile_log.txt

This script uses a tensor-parallel size of 8. This will automatically
set the data-parallel degree to 4 (32 workers / tensor_parallel_size).
Once the graphs are compiled we can now run training and observe our
loss go down. To run the training, we just the above command but without
``neuron_parallel_compile``.

.. code:: ipython3

   XLA_DOWNCAST_BF16=1 torchrun --nproc_per_node=32 \
   tp_dp_bert_large_hf_pretrain_hdf5.py \
   --tensor_parallel_size 8 \
   --steps_this_run 10 \
   --batch_size 64 \
   --grad_accum_usteps 64 |& tee training_log.txt

You would notice that the throughput is lower when you run the
``dp_bert_large_hf_pretrain_hdf5.py``. This is expected as the number of
data-parallel workers have gone down (from 32 to 4). However, if you
open ``neuron-top`` in another terminal, you should see the memory
utilization per core for this script is lower than the
``dp_bert_large_hf_pretrain_hdf5.py``. Since the memory requirement has
gone down, you can scale the size of model either by increasing the
number of layers/attention heads/hidden sizes.

The loss curve should match to the loss curve we would get from the
data_parallel counterpart.

Known Issues:
~~~~~~~~~~~~~

1. Currently the checkpoints dumped during training are sharded and
   users would have to write a script to combine the checkpoints
   themselves. This should be fixed in the future release


================================================
FILE: libraries/neuronx-distributed/tutorials/training_llama_tp_pp.rst
================================================
.. _llama3_tp_pp_tutorial:

Training Llama-3.1-70B and Llama-3-70B with Tensor Parallelism and Pipeline Parallelism 
========================================================================================

.. important::
   Neuron will stop supporting XLA-based training support in a future release. For now, this tutorial is provided strictly for reference.

In this section, we showcase to pretrain Llama 3.1 and Llama3 70B models by using the tensor parallel, pipeline parallel, sequence parallel, activation
checkpoint as well as constant mask optimization in the ``neuronx-distributed`` package.

Setting up environment:
                       
For this experiment, we will use a ParallelCluster with at least 32 trn1-32xl compute nodes.
`Train your model on ParallelCluster <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/parallelcluster/parallelcluster-training.html>`__
introduces how to setup and use a ParallelCluster.

We also need to install the ``neuronx-distributed`` package using the following command:

.. code:: ipython3

   python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com
   git clone git@github.com:aws-neuron/neuronx-distributed.git

Let’s download the scripts for pretraining:

.. literalinclude:: nxd-source-code/llama_tp_pp/llama_tp_pp_setup.sh
   :language: shell
   :lines: 4-10


If you want to pre-train Llama3.1 70B, you would need to run the following steps -

.. literalinclude:: nxd-source-code/llama_tp_pp/llama_31_70b.sh
   :language: shell
   :lines: 6-7

If you want to pre-train Llama3 70B, you would need to run the following steps -

.. literalinclude:: nxd-source-code/llama_tp_pp/llama_3_70b.sh
   :language: shell
   :lines: 6-7


The below tutorial uses ``Llama3.1 70B`` as an example.

First, let's get all the needed dependencies

.. literalinclude:: nxd-source-code/llama_tp_pp/llama_tp_pp_setup.sh
   :language: shell
   :lines: 12
    

To tokenize the data, we must request the tokenizer from hugging face and meta by following the instructions at the following link: `HuggingFace Llama 3 8B Model <https://huggingface.co/meta-llama/Meta-Llama-3-8B>`__ . 

Use of the Llama models is governed by the Meta license. In order to download the model weights and tokenizer, please visit the above website and accept their License before requesting access. After access has been granted, you may use the following python3 script along with your own hugging face token to download and save required tokenizer.

Run the following from ``~/examples/tp_pp_llama_hf_pretrain`` directory:

.. code:: ipython3

   from huggingface_hub import login
   from transformers import AutoTokenizer

   login(token='your_own_hugging_face_token')

   tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B')

   tokenizer.save_pretrained(".")

For Llama3.1/Llama3, make sure your ``~/examples/tp_pp_llama2_hf_pretrain`` directory has the following files:

.. code:: ipython3

   './tokenizer_config.json', './special_tokens_map.json', './tokenizer.json'


Next let’s download and pre-process the dataset:

.. literalinclude:: nxd-source-code/llama_tp_pp/llama_3_70b.sh
   :language: shell
   :lines: 12

In case you see an error of the following form when downloading data: ``huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/ubuntu/examples/tp_pp_llama2_hf_pretrain'. Use `repo_type` argument if needed.`` This could be because of a stale cache. Try deleting the cache using: 

.. code:: ipython3

   sudo rm -rf /home/ubuntu/.cache/

In case you see an error of the following form when downloading data: ```NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.``` Try upgrading pip:

.. code:: ipython3

   pip install -U datasets


At this point, you are all set to start training.


Running training

We first pre-compile the graphs using the ``neuron_parallel_compile``. Let’s run the command below:

.. literalinclude:: nxd-source-code/llama_tp_pp/llama_3_70b.sh
   :language: shell
   :lines: 16-19

This script uses a tensor-parallel size of 8, pipeline-parallel size of 8
To run the training, we just use the above command but without ``neuron_parallel_compile``.

.. literalinclude:: nxd-source-code/llama_tp_pp/llama_3_70b.sh
   :language: shell
   :lines: 21-24


To achieve better performance, the script applies few techniques:

**Sequence Parallelism and Selective Activation Checkpointing**:

As explained in the :ref:`Activation Memory Recomputation Doc <activation_memory_reduction>`, both `Sequence Parallelism` 
and `Selective activation checkpointing` can help with activation memory reduction thereby allowing us to fit bigger 
models with less number of devices. 
Please refer to :ref:`Activation Memory Reduction Developer Guide <activation_memory_reduction_developer_guide>` on how to 
enable sequence parallel and selective activation checkpointing. 


**GQAQKVColumnParallelLinear Layer**:

In LLama 70B GQA module, the K and V attention heads are `8` whereas Q has `64` attentions heads. Since the number of 
attention heads should be divisible by tensor_parallel_degree, we would end up using a tp_degree of 8. Hence to fit 
a 70B model, we would have to use a higher pipeline-parallel degree. Using higher pipeline-parallel degree works well 
when the global batch size is very high, however, as the data-parallel degree increases at higher cluster size, the 
batch size per node decreases. This would result in higher `pipeline bubble <https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/>`__ 
thereby reducing performance. To mitigate this issue, one can use the :ref:`GQAQKVColumnParallelLinear <parameters-11>` layer with the
`kv_size_multiplier` set to 4. This would repeat the KV heads and make them 32. This would allow doing tensor-parallelism 
using tp_degree of 32. This reduces the activation memory per device and thereby eventually allows using a pipeline 
parallel degree of 4. This can be enabled by passing the argument:

.. code-block:: python

   torchrun $DISTRIBUTED_ARGS run_llama_nxd.py \
   ... \
   --qkv_linear 1 \
   --kv_replicator 4 \
   --tb_dir $tb_dir |& tee $LOG_PATH/log

The above changes are already included in the `run_llama_70b_tp_pp.sh`. For Llama13B model we only do 8-way tensor parallelism so
we do not need this change.

**Fusing Q,K,V layers**:

In the GQAQKVColumnParallelLinear, the parallel matrix multiply is coalesced to improve throughput. Currently it's enabled by default. To disable it, set ``--fuse_qkv 0``

.. note::
    Because the layers above are coalesced, ensure that any pretrained checkpoint loaded for fine-tuning has the q,k,v layers coleasced. Otherwise, preprocessing is required to fuse these layers in the checkpoint. Follow this :ref:`Checkpoint Conversion Guide <checkpoint_conversion>` and set ``--fuse_qkv`` to coalesce the layers in the checkpoint. 


**Flash Attention**:

We're introducing flash attention function for better performance/memory efficiency. Currently it's enabled by default, to disable it set ``--use_flash_attention 0``


`Save/Load Checkpoint` (refer to :ref:`API Guide <api_guide>` for more context about checkpoint APIs).

To enable checkpoint saving, add the following flags to ``run_llama_70b_tp_pp.sh``:

* ``--checkpoint_freq`` Number of steps to save a checkpoint, set to -1 to disable saving checkpoint, should set as -1 when pre-compling graph
* ``--checkpoint_dir`` Direction to save the checkpoint
* ``--num_kept_checkpoint`` Number of checkpoints to save, older checkpoint will be deleted manually, set to -1 to keep all saved checkpoints.
* ``--save_load_xser`` Save with torch xla serialization to reduce time saving, it's recommended to enable xser for significantly faster save/load 
* ``--async_checkpoint_saving`` Whether to use asynchronous checkpoint saving to reduce saving time.

To enable checkpoint loading, add the following flags to ``run_llama_70b_tp_pp.sh``:

* ``--loading_step`` Step to retrieve checkpoint from, set to -1 to disable checkpoint loading. Set to ``latest_if_exists`` to load the latest checkpoint under ``checkpoint_dir``.
* ``--checkpoint_dir`` Direction to load the checkpoint from
* ``--save_load_xser`` load with torch xla serialization to reduce time saving, it's recommended to enable xser for significantly faster save/load. Note that if the chekpoint is saved with xser, it can only be loaded with xser, vice versa. 

Load pretrained model:

We also provide option to load from pretrained HF model. Before loading, convert the full model to sharded model with ``convert_checkpoints.py``:

.. code:: ipython3

   python3 convert_checkpoints.py --tp_size <tp_size> --pp_size <pp_size> --n_layers <number_of_layers>  --input_dir  <path_to_full_model> --output_dir <sharded_model_path> --convert_from_full_model 

And add ``--pretrained_weight_dir <sharded_model_path>`` flag to ``run_llama_70b_tp_pp.sh``


Convert sharded model to full model with ``convert_checkpoints.py``:

.. code:: ipython3

   python3 convert_checkpoints.py --tp_size <tp_size> --pp_size <pp_size> --n_layers <number_of_layers>  --input_dir  <sharded_model_dir> --output_dir <full_model_dir> --convert_to_full_model --kv_size_multiplier <kv_size_multiplier> --config config.json --qkv_linear True --load_xser True


================================================
FILE: libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.rst
================================================
.. _llama3_tp_zero1_tutorial:

Training Llama3.1-8B and Llama3-8B with Tensor Parallelism and ZeRO-1 Optimizer
=================================================================================

.. important::
   Neuron will stop supporting XLA-based training support in a future release. For now, this tutorial is provided strictly for reference.

In this section, we showcase how to pre-train Llama3.1-8B and Llama3 8B models on four Trn1.32xlarge instances 
using the Neuron Distributed library. We will use AWS ParallelCluster to orchestrate the training jobs. 
To train the LLama model in this example, we will apply the following optimizations using the 
Neuron Distributed library:

1. :ref:`Tensor Parallelism <tensor_parallelism_overview>`
2. :ref:`Sequence Parallel <activation_memory_reduction>`
3. :ref:`Selective checkpointing <activation_memory_reduction>`
4. :ref:`ZeRO-1 <zero1-gpt2-pretraining-tutorial>`


Setting up environment:
^^^^^^^^^^^^^^^^^^^^^^^
                       
For this experiment, we will use AWS ParallelCluster with at least four Trn1.32xlarge compute nodes.
`Train your model on ParallelCluster <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/parallelcluster/parallelcluster-training.html>`__
introduces how to setup and use a ParallelCluster.
To setup the packages on the headnode of the ParallelCluster, follow the instructions mentioned here:
:ref:`Install PyTorch Neuron on Trn1 <setup-torch-neuronx>`.

We also need to install and clone the ``neuronx-distributed`` package inside the virtual env using the following commands:

.. code:: ipython3

   python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com
   git clone git@github.com:aws-neuron/neuronx-distributed.git

Let’s download the scripts for pretraining:


1. Navigate to a directory to hold our experiments

.. literalinclude:: nxd-source-code/llama_tp_zero1/llama_tp_zero1_setup.sh
   :language: shell
   :lines: 4

2. Link the training scripts for our experiments

.. literalinclude:: nxd-source-code/llama_tp_zero1/llama_tp_zero1_setup.sh
   :language: shell
   :lines: 5-8

If you want to pre-train Llama3.1 8B, you would need to run the following steps -

.. literalinclude:: nxd-source-code/llama_tp_zero1/llama_31_8b.sh
   :language: shell
   :lines: 5-7

If you want to pre-train Llama3 8B, you would need to run the following steps -

.. literalinclude:: nxd-source-code/llama_tp_zero1/llama_3_8b.sh
   :language: shell
   :lines: 5-6

3. Installing the additional requirements

.. literalinclude:: nxd-source-code/llama_tp_zero1/llama_tp_zero1_setup.sh
   :language: shell
   :lines: 10


To tokenize the data, we must request the tokenizer from hugging face and meta by following the instructions at the following link: `HuggingFace Llama 3 8B Model <https://huggingface.co/meta-llama/Meta-Llama-3-8B>`__ . 

Use of the Llama models is governed by the Meta license. In order to download the model weights and tokenizer, please visit the above website and accept their License before requesting access. After access has been granted, you may use the following python3 script along with your own hugging face token to download and save the tokenizer.

Run the following from ``~/examples/tp_zero1_llama_hf_pretrain`` directory:

.. code:: ipython3

   from huggingface_hub import login
   from transformers import AutoTokenizer

   login(token='your_own_hugging_face_token')

   tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B') 

   tokenizer.save_pretrained(".")

For Llama3.1/Llama3, make sure your ``~/examples/tp_zero1_llama_hf_pretrain`` directory has the following files:

.. code:: ipython3

   './tokenizer_config.json', './special_tokens_map.json', './tokenizer.json'

Next let’s download and pre-process the dataset:

.. literalinclude:: nxd-source-code/llama_tp_zero1/llama_3_8b.sh
   :language: shell
   :lines: 11

`Note:` In case you see an error of the following form when downloading data: ``huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/ubuntu/examples/tp_zero1_llama_hf_pretrain'. Use `repo_type` argument if needed.`` 
This could be because of a stale cache. Try deleting the cache using: 

.. literalinclude:: nxd-source-code/llama_tp_zero1/llama_3_8b.sh
   :language: shell
   :lines: 8


At this point, you are all set to start training. The below tutorial uses ``Llama3 8B`` as an example.

Running training
^^^^^^^^^^^^^^^^

By this step, the ParallelCluster is all setup for running experiments. 
Before we run training, we first pre-compile the graphs using the :ref:`neuron_parallel_compile <pytorch-neuronx-parallel-compile-cli>`.
Let’s run the command below:

.. literalinclude:: nxd-source-code/llama_tp_zero1/llama_3_8b.sh
   :language: shell
   :lines: 15-18

This script uses a tensor-parallel size of 8.
This will automatically set the zero-1 sharding degree to 16 (4 * 32 workers / tensor_parallel_size). 

`Note`: You can use any number of nodes in this case, would just need to adjust the number of nodes in the above 
slurm command accordingly. Also, the number of nodes used in parallel_compile command should be same as the actual 
training run. This is because, as the number of nodes change, the data-parallel degree would change too. This would 
result in more workers participating in operations like `gradient all-reduce` which would result in new graphs getting 
created. 

Once the graphs are compiled we can now run training and observe our loss goes down.
To run the training, we just run the above command but without ``neuron_parallel_compile``.

.. literalinclude:: nxd-source-code/llama_tp_zero1/llama_3_8b.sh
   :language: shell
   :lines: 20-23


Performance
^^^^^^^^^^^^

To achieve better performance, the script applies few techniques:

**Sequence Parallelism and Selective Activation Checkpointing**

As explained in the :ref:`Activation Memory Recomputation Doc <activation_memory_reduction>`, both `Sequence Parallelism` 
and `Selective activation checkpointing` can help with activation memory reduction thereby allowing us to fit bigger 
models with less number of devices. 
Please refer to :ref:`Activation Memory Reduction Developer Guide <activation_memory_reduction_developer_guide>` on how to 
enable sequence parallel and selective activation checkpointing.

**Coalescing Q, K, V layers**

We coalesced parallel matrix multiply to improve throughput:

* We coalesced ``query``, ``key`` and ``value`` into one matrix multiply
* We coalesced ``gate_proj`` and ``up_proj`` into one matrix multiply

Please check ``modeling_llama_nxd.py`` for details.
`Note:` Because we coalesced the layers above, pretrained checkpoints cannot be loaded out of the box for fine-tuning, and would require preprocessing. The Q,K,V layers 
and the gate_proj and up_proj layers need to be coalesced in the checkpoint before loading.

**Logging**

Currently for better performance we log loss values every 10 steps. Logging frequently will result in frequent 
syncs between device and CPU which are expensive. Hence, it is recommended to do less frequent logging if possible.


**Flash Attention**

We're introducing flash attention function for better performance/memory efficiency. Currently it's enabled by default, to disable it set ``--use_flash_attention 0``

Checkpointing
^^^^^^^^^^^^^^

Currently by default, the checkpoint is saved at the end of training. You can modify that behaviour by saving 
the checkpoint after every `N steps` inside the training loop:

.. code:: ipython3

   from neuronx_distributed.parallel_layers import checkpointing
   if global_step % every_n_steps_checkpoint == 0:
      state_dict = {
         "model": model.state_dict(),
         "global_step": global_step,
         "epoch": epoch,
         "scheduler": scheduler.state_dict()
      }
      checkpointing.save(state_dict, flags.output_dir)
      optimizer.save_sharded_state_dict(flags.output_dir)

Here we have to save the model state_dict using the `checkpointing.save` API and the optimizer state_dict using 
the `optimizer.save_sharded_state_dict`. This is because, currently, `checkpointing.save` API only saves on 
data-parallel rank 0, while in case of Zero1 Optimizer, the optimizer states are distributed across all data-parallel 
ranks. Hence, we use Zero1 Optimizer's save API to save the optimizer states.

`Time to save a checkpoint:`

Checkpoint save time can vary depending on what location the checkpoint is saved. If the checkpoint is saved in 
the `home` directory, the checkpointing time can be higher. The same time can be reduce by 4x if the checkpoint 
is dumped to FSX file system. 

By default, `checkpoint.save` API allows one tensor-parallel rank at a time to save the checkpoint. This is done 
in order to avoid HOST OOM. When all tensor-parallel ranks try to save at the same time, they would end up copying 
weights to CPU at the same time. This can result in HOST OOM. `Note:` Since, we use `XLA_DOWNCAST_BF16` flag for 
BF16 training, even though the weights on device are on bf16, the weights on CPU are copied in FP32 format. In case, 
you want to avoid this typecasting from BF16 to FP32 when copying weights from device to CPU for checkpoint saving, 
you can pass `down_cast_bf16=True` to the checkpointing.save API as follows:

.. code:: ipython3

   from neuronx_distributed.parallel_layers import checkpointing
   if global_step % every_n_steps_checkpoint == 0:
      state_dict = {
         "model": model.state_dict(),
         "global_step": global_step,
         "epoch": epoch,
         "scheduler": scheduler.state_dict()
      }
      checkpointing.save(state_dict, flags.output_dir, down_cast_bf16=True)

This should not only reduce the HOST memory pressure when saving weights, but at the same time reduce model checkpointing 
time by half. `Note:` We are saving checkpoint in sharded format, wherein each tensor-parallel rank is 
saving one shard. To deploy these pretrained models, one would have to combine these shards by loading them and 
concatenating the tensor-parallel layers together. (We are working on a checkpoint conversion script that 
combines the shards into a single checkpoint)

In addition to the above method, if we want to speed up checkpoint saving for the model further, we can do so by:

.. code:: ipython3

   from neuronx_distributed.parallel_layers import checkpointing
   if global_step % every_n_steps_checkpoint == 0:
      state_dict = {
         "model": model.state_dict(),
         "global_step": global_step,
         "epoch": epoch,
         "scheduler": scheduler.state_dict()
      }
      checkpointing.save(state_dict, flags.output_dir, down_cast_bf16=True, save_xser=True)

The `save_xser` uses torch-xla's `xser.save <https://pytorch.org/xla/release/2.1/index.html#saving-and-loading-xla-tensors>`__ 
to save the tensors serially. This API will copy one tensor at a time to the disk. This will allow all the ranks to 
save the checkpoint at the same time. This speeds up checkpoint saving especially for large models as all ranks 
are saving at the same time. Moreover, the risk of HOST OOM is completely eliminated because only one tensor is copied 
to CPU at a time. 

`Note:` If we use `save_xser` to save the checkpoint, we would have to pass `load_xser` to the 
`checkpoint.load` API. 
Also, if you use `save_xser`, the checkpoint folder would contain a `.pt` file for each tensor instead of a 
single `.pt` for the entire state_dict. To read this checkpoint in your checkpoint conversion script, you would 
have to use `xser.load <https://pytorch.org/xla/release/2.1/index.html#saving-and-loading-xla-tensors>`__ API 
instead of `torch.load` to load the checkpoint. The `xser.load` should load the serialized checkpoint and return 
the full state_dict.

Finally, to speed up optimizer saving time, you can increase the number of workers saving at the same time. 
This can be done as follows:

.. code:: ipython3

   if global_step % every_n_steps_checkpoint == 0:
      ...
      optimizer.save_sharded_state_dict(flags.output_dir, num_workers_per_step=32)

By default, `num_workers_per_step` is set to 8.


================================================
FILE: libraries/neuronx-distributed/tutorials/training_tutorials.rst
================================================
.. _nxd_training_tutorials:

Training Tutorials
============================================================

.. toctree::
    :maxdepth: 1
    :hidden:
        
    Training using Tensor Parallelism </libraries/neuronx-distributed/tutorials/training>
    Training Llama 3.1 8B/Llama 3 8B using TP and ZeRO-1 </libraries/neuronx-distributed/tutorials/training_llama_tp_zero1>
    Training Llama 3.1 70B/Llama 3 70B using TP and PP </libraries/neuronx-distributed/tutorials/training_llama_tp_pp>
    Fine-tuning Llama3 8B with tensor parallelism and LoRA using Neuron PyTorch-Lightning </libraries/neuronx-distributed/tutorials/finetune_llama3_8b_ptl_lora>

.. include:: /libraries/neuronx-distributed/tutorials/nxd_training_tutorials.txt


================================================
FILE: libraries/nxd-inference/_templates/model_card.jinja.rst
================================================
.. -*- mode: rst -*-

.. meta::
   :description: Learn how to get started with the {{ data.model.display_name }} model with Neuron, using recommended online and offline serving configurations.

.. _nxdi-models-{{ data.model.name | lower | replace(".", "-") | replace("/", "-") }}:

{{ data.model.display_name }}
=====================================

.. toctree::
   :hidden:

Learn how to get started with the {{ data.model.display_name }} model with Neuron, using recommended online and offline serving configurations. 

About {{ data.model.display_name }}
-------------------------------------------------------------------

{{ data.model.description }}

For detailed model specifications, capabilities, and checkpoints, see the official `{{ data.model.checkpoint }} <https://huggingface.co/{{ data.model.checkpoint }}>`_ model card on Hugging Face.

.. _nxdi-models-{{ data.model.name | lower | replace(".", "-") | replace("/", "-") }}-quickstart:

Quickstart
-----------------

The following examples show how to use {{ data.model.display_name }} with NeuronX Distributed Inference (NxDI) framework and vLLM for both online and offline use cases on Neuron devices.

.. admonition:: Before you start...
   :class: note

   Before running the sample code below, review how to set up your environment by following the :ref:`NxDI Setup Guide <nxdi-setup>`. Additionally, download the model checkpoint to a local directory of your choice (such as ``~/models/{{ data.model.name }}/``).

{%- macro render_nxdi_code(config) %}

.. code-block:: python
   :linenos:
   :emphasize-lines: 9,10,11,12{% for key, value in config.neuron.items() %}{% if key not in ["extra"] %},{{ loop.index + 12 }}{% endif %}{% endfor %}

   import torch
   from transformers import AutoTokenizer, GenerationConfig

   from neuronx_distributed_inference.models.config import NeuronConfig
   from neuronx_distributed_inference.models.llama.modeling_llama import LlamaInferenceConfig, NeuronLlamaForCausalLM
   from neuronx_distributed_inference.modules.generation.sampling import prepare_sampling_params
   from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter, load_pretrained_config

   MODEL_PATH = "~/models/{{ data.model.name }}/"
   TRACED_MODEL_PATH = "~/traced_models/{{ data.model.name }}/"
   SEED = 0
   NEURON_CONFIG = NeuronConfig({% for key, value in config.neuron.items() %}{% if key not in ["extra"] %}
      {{ key }}={% if value is string %}"{{ value }}"{% else %}{{ value }}{% endif %},
      {%- endif %}{%- endfor %}
   )

   # Set random seed for reproducibility
   torch.manual_seed(SEED)

   # Initialize configs and tokenizer.
   generation_config = GenerationConfig.from_pretrained(MODEL_PATH)
   eos = generation_config.eos_token_id
   generation_config_kwargs = {
      "do_sample": True,
      "top_k": 1,
      "pad_token_id": eos[0] if isinstance(eos, list) else eos,
   }
   generation_config.update(**generation_config_kwargs)
   config = LlamaInferenceConfig(NEURON_CONFIG, load_config=load_pretrained_config(MODEL_PATH))

   tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, padding_side="right")
   tokenizer.pad_token = tokenizer.eos_token

   # Compile and save model.
   print("Compiling and saving model...")
   model = NeuronLlamaForCausalLM(MODEL_PATH, config)
   model.compile(TRACED_MODEL_PATH)
   tokenizer.save_pretrained(TRACED_MODEL_PATH)

   # Load from compiled checkpoint.
   print("Loading model from compiled checkpoint...")
   model = NeuronLlamaForCausalLM(TRACED_MODEL_PATH)
   model.load(TRACED_MODEL_PATH)

   # Generate outputs.
   print("Generating outputs...")
   prompts = ["I believe the meaning of life is", "The color of the sky is"]
   sampling_params = prepare_sampling_params(
      batch_size=NEURON_CONFIG.batch_size,
      top_k=[10, 5],
      top_p=[0.5, 0.9],
      temperature=[0.9, 0.5],
   )
   print(f"Prompts: {prompts}")

   inputs = tokenizer(prompts, padding=True, return_tensors="pt")
   generation_model = HuggingFaceGenerationAdapter(model)
   outputs = generation_model.generate(
      inputs.input_ids,
      generation_config=generation_config,
      attention_mask=inputs.attention_mask,
      max_length=model.config.neuron_config.max_length,
      sampling_params=sampling_params,
   )

   output_tokens = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)
   print("Generated outputs:")
   for i, output_token in enumerate(output_tokens):
      print(f"Output {i}: {output_token}")

{%- endmacro %}

{%- macro render_offline_code(config) %}

.. code-block:: python
   :linenos:
   :emphasize-lines: 9{% for key, value in config.vllm.items() %}{% if key != "extra" %},{{ loop.index + 9 }}{% endif %}{% endfor %}

   import os

   os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"

   from vllm import LLM, SamplingParams

   # Create an LLM.
   llm = LLM(
      model="~/models/{{ data.model.name }}/",{% for key, value in config.vllm.items() %}{% if key != "extra" %}
      {{ key }}={% if value is string %}"{{ value }}"{% else %}{{ value }}{% endif %},
      {%- endif %}{%- endfor %}
   )

   # Sample prompts.
   prompts = [
      "The president of the United States is",
      "The capital of France is",
      "The future of AI is",
   ]
   outputs = llm.generate(prompts, SamplingParams(top_k=1))

   for output in outputs:
      prompt = output.prompt
      generated_text = output.outputs[0].text
      print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

{%- endmacro %}

{%- macro render_online_code(config) %}

.. code-block:: bash
   :linenos:
   :emphasize-lines: 2,3{% for key, value in config.vllm.items() %}{% if key != "extra" %},{{ loop.index + 3 }}{% endif %}{% endfor %}

   vllm serve \
      --model="~/models/{{ data.model.name }}/"{% for key, value in config.vllm.items() %}{% if key != "extra" %} \
      --{{ key | replace("_", "-") }}{% if value is sameas true %}{% elif value is mapping %}='{{ value | tojson | replace("True", "true") | replace("False", "false") }}'{% elif value is string %}="{{ value }}"{% else %}={{ value }}{% endif %}{% endif %}{% endfor %} \
      --port=8080 

Once the vLLM server is online, submit requests using the example below:

.. literalinclude:: ../../examples/vllm_client.py
   :linenos:
   :language: python

{%- endmacro %}

.. tab-set::

   .. tab-item:: NxDI
      :selected:

      Select the instance type and make sure to update the highlighted code below to match your chosen path before you execute it.

      .. tab-set::
      {% for hardware_type, default_config in data.defaults.items() %}
         {%- set config = data.configurations[default_config.config] %}

         .. tab-item:: {{ hardware_type }}
            {% if loop.first %}:selected:{% endif %}

{{ render_nxdi_code(config) | indent(12, true) }}

      {% endfor %}

   .. tab-item:: Offline serving

      Select the instance type and make sure to update the highlighted code below to match your chosen path before you execute it.

      .. tab-set::
      {% for hardware_type, default_config in data.defaults.items() %}
         {%- set config = data.configurations[default_config.config] %}

         .. tab-item:: {{ hardware_type }}
            {% if loop.first %}:selected:{% endif %}

{{ render_offline_code(config) | indent(12, true) }}

      {% endfor %}

   .. tab-item:: Online serving

      Select the instance type and make sure to update the highlighted code below to match your chosen path before you execute it.

      .. tab-set::
      {% for hardware_type, default_config in data.defaults.items() %}
         {%- set config = data.configurations[default_config.config] %}

         .. tab-item:: {{ hardware_type }}
            {% if loop.first %}:selected:{% endif %}

{{ render_online_code(config) | indent(12, true) }}

      {% endfor %}

.. _nxdi-models-{{ data.model.name | lower | replace(".", "-") | replace("/", "-") }}-benchmarks:

{% if data.benchmarks %}
Benchmarks
------------------------

Select a metric to view performance benchmarks for various **batch sizes** and **input|output** sequence length combinations.

.. tab-set::

   .. tab-item:: Latency
      :sync: Latency

      Measured in: seconds (s)

      .. df-table::
         :header-rows: 1

         latency_data = {{ data.benchmarks.Latency | tojson }}
         df_raw = pd.DataFrame(latency_data)

         cols = [c for c in df_raw.columns if c not in ('neuron_config', 'batch_size')]

         df_grouped = df_raw.groupby('batch_size')[cols].min().round(3)
         df = df_grouped.reset_index()
         df.rename(columns={'batch_size': 'Batch Size'}, inplace=True)

   .. tab-item:: Throughput
      :sync: Throughput

      Measured in: tokens per second (tok/s)

      .. df-table::
         :header-rows: 1

         throughput_data = {{ data.benchmarks.Throughput | tojson }}
         df_raw = pd.DataFrame(throughput_data)

         cols = [c for c in df_raw.columns if c not in ('neuron_config', 'batch_size')]

         df_grouped = df_raw.groupby('batch_size')[cols].max().round(2)
         df = df_grouped.reset_index()
         df.rename(columns={'batch_size': 'Batch Size'}, inplace=True)

   .. tab-item:: TTFT
      :sync: TTFT

      Measured in: seconds (s)

      .. df-table::
         :header-rows: 1

         ttft_data = {{ data.benchmarks.TTFT | tojson }}
         df_raw = pd.DataFrame(ttft_data)

         cols = [c for c in df_raw.columns if c not in ('neuron_config', 'batch_size')]

         df_grouped = df_raw.groupby('batch_size')[cols].min().round(3)
         df = df_grouped.reset_index()
         df.rename(columns={'batch_size': 'Batch Size'}, inplace=True)

   .. tab-item:: ITL
      :sync: ITL

      Measured in: seconds (s)

      .. df-table::
         :header-rows: 1

         itl_data = {{ data.benchmarks.ITL | tojson }}
         df_raw = pd.DataFrame(itl_data)

         cols = [c for c in df_raw.columns if c not in ('neuron_config', 'batch_size')]

         df_grouped = df_raw.groupby('batch_size')[cols].min().round(5)
         df = df_grouped.reset_index()
         df.rename(columns={'batch_size': 'Batch Size'}, inplace=True)

.. admonition:: Tip
   :class: tip

   Further improvements and optimizations are possible through the :ref:`Neuron Kernel Interface (NKI) <neuron-nki>`.


{% endif %}

.. _nxdi-models-{{ data.model.name | lower | replace(".", "-") | replace("/", "-") }}-neuron-config:

Recommended configuration
--------------------------

{% if data.recommendations %}
Select a use case to view the recommended Neuron configuration. For the definitions of the flags listed below, see the :ref:`NxDI API reference guide <nxd-inference-api-guide>`.

{%- set throughput_config = data.configurations[data.recommendations.Throughput.config] %}
{%- set latency_config = data.configurations[data.recommendations.Latency.config] %}

.. tab-set::

   .. tab-item:: Offline serving
      :sync: Throughput

      For most use cases, the configuration below can be used to optimize **throughput** on Neuron devices. You can also increase the ``batch_size`` or use quantization to improve throughput even further. 

      {% if throughput_config.dp_degree != 1 %}
      For this specific configuration, we recommend using **Data Parallelism (DP) of {{ throughput_config.dp_degree }}**. For more details on how to implement data parallelism, refer to the :ref:`Data Parallelism on Trn2 <nxdi-trn2-llama3.3-70b-dp-tutorial>` tutorial.
      {% endif %}

      :bdg-info:`{{ throughput_config.instance_type }}`

      .. code-block:: python
         :linenos:

         NeuronConfig({% for key, value in throughput_config.neuron.items() %}{% if key not in ["extra"] %}
            {{ key }}={% if value is string %}"{{ value }}"{% else %}{{ value }}{% endif %},
            {%- endif %}{%- endfor %}
         )

   .. tab-item:: Online serving
      :sync: Latency

      For most use cases, the configuration below can be used to optimize **latency** on Neuron devices.

      {% if latency_config.dp_degree != 1 %}
      For this specific configuration, we recommend using **Data Parallelism (DP) of {{ latency_config.dp_degree }}**. For more details on how to implement data parallelism, refer to the :ref:`Data Parallelism on Trn2 <nxdi-trn2-llama3.3-70b-dp-tutorial>` tutorial.
      {% endif %}

      :bdg-info:`{{ latency_config.instance_type }}`

      .. code-block:: python
         :linenos:

         NeuronConfig({% for key, value in latency_config.neuron.items() %}{% if key not in ["extra"] %}
            {{ key }}={% if value is string %}"{{ value }}"{% else %}{{ value }}{% endif %},
            {%- endif %}{%- endfor %}
         )

{% else %}
.. note::

   The recommended configuration for the {{ data.model.display_name }} model is coming soon...

{% endif %}


================================================
FILE: libraries/nxd-inference/_templates/model_card_qwen3.jinja.rst
================================================
.. -*- mode: rst -*-

.. meta::
   :description: Learn how to get started with the {{ data.model.display_name }} model with Neuron, using recommended online and offline serving configurations.

.. _nxdi-models-{{ data.model.name | lower | replace(".", "-") | replace("/", "-") }}:

{{ data.model.display_name }}
=====================================

.. toctree::
   :hidden:

Learn how to get started with the {{ data.model.display_name }} model with Neuron, using recommended online and offline serving configurations. 

About {{ data.model.display_name }}
-------------------------------------------------------------------

{{ data.model.description }}

For detailed model specifications, capabilities, and checkpoints, see the official `{{ data.model.checkpoint }} <https://huggingface.co/{{ data.model.checkpoint }}>`_ model card on Hugging Face.

.. _nxdi-models-{{ data.model.name | lower | replace(".", "-") | replace("/", "-") }}-quickstart:

Quickstart
-----------------

The following examples show how to use {{ data.model.display_name }} with NeuronX Distributed Inference (NxDI) framework and vLLM for both online and offline use cases on Neuron devices.

.. admonition:: Before you start...
   :class: note

   Before running the sample code below, review how to set up your environment by following the :ref:`NxDI Setup Guide <nxdi-setup>`. Additionally, download the model checkpoint to a local directory of your choice (such as ``~/models/{{ data.model.name }}/``).

{%- macro render_nxdi_code(config) %}

.. code-block:: python
   :linenos:
   :emphasize-lines: 11,12{% for key, value in config.neuron.items() %}{% if key not in ["extra"] %},{{ loop.index + 12 }}{% endif %}{% endfor %}

   import torch
   from transformers import AutoTokenizer, GenerationConfig

   from neuronx_distributed_inference.models.config import MoENeuronConfig, OnDeviceSamplingConfig
   from neuronx_distributed_inference.models.qwen3_moe.modeling_qwen3_moe import Qwen3MoeInferenceConfig, NeuronQwen3MoeForCausalLM
   from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter, load_pretrained_config

   MODEL_PATH = "~/models/{{ data.model.name }}/"
   TRACED_MODEL_PATH = "~/traced_models/{{ data.model.name }}/"
   SEED = 0
   NEURON_CONFIG = MoENeuronConfig({% for key, value in config.neuron.items() %}{% if key not in ["extra"] %}
      {{ key }}={% if value is string %}"{{ value }}"{% else %}{{ value }}{% endif %},
      {%- endif %}{%- endfor %}
   )

   # Set random seed for reproducibility
   torch.manual_seed(SEED)

   # Initialize configs and tokenizer.
   generation_config = GenerationConfig.from_pretrained(MODEL_PATH)
   config = Qwen3MoeInferenceConfig(NEURON_CONFIG, load_config=load_pretrained_config(MODEL_PATH))

   tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, padding_side="right")
   tokenizer.pad_token = tokenizer.eos_token

   # Compile and save model.
   print("Compiling and saving model...")
   model = NeuronQwen3MoeForCausalLM(MODEL_PATH, config)
   model.compile(TRACED_MODEL_PATH)
   tokenizer.save_pretrained(TRACED_MODEL_PATH)

   # Load from compiled checkpoint.
   print("Loading model from compiled checkpoint...")
   model = NeuronQwen3MoeForCausalLM(TRACED_MODEL_PATH)
   model.load(TRACED_MODEL_PATH)

   # Generate outputs.
   print("\nGenerating outputs...")
   prompt = "Give me a short introduction to large language models."
   messages = [
      {"role": "user", "content": prompt}
   ]
   text = tokenizer.apply_chat_template(
      messages,
      tokenize=False,
      add_generation_prompt=True,
      enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
   )
   inputs = tokenizer([text], padding=True, return_tensors="pt")
   generation_model = HuggingFaceGenerationAdapter(model)
   outputs = generation_model.generate(
      inputs.input_ids,
      generation_config=generation_config,
      attention_mask=inputs.attention_mask,
      max_length=model.config.neuron_config.max_length,
   )

   output_tokens = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)
   print("Generated outputs:")
   for i, output_token in enumerate(output_tokens):
      print(f"Output {i}: {output_token}")

{%- endmacro %}

{%- macro render_offline_code(config) %}

.. code-block:: python
   :linenos:
   :emphasize-lines: 9,10,11,14,15{% for key, value in config.vllm.items() %}{% if key != "extra" %},{{ loop.index + 9 }}{% endif %}{% endfor %}

   import os

   os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"

   from vllm import LLM, SamplingParams

   # Create an LLM.
   llm = LLM(
      model="~/models/{{ data.model.name }}/",{% for key, value in config.vllm.items() %}{% if key != "extra" %}
      {{ key }}={% if value is string %}"{{ value }}"{% else %}{{ value }}{% endif %},
      {%- endif %}{%- endfor %}
      enable_prefix_caching=False,
      enable_chunked_prefill=False,
   )

   # Sample prompts.
   prompts = [
      "The president of the United States is",
      "The capital of France is",
      "The future of AI is",
   ]
   outputs = llm.generate(prompts, SamplingParams(top_k=1))

   for output in outputs:
      prompt = output.prompt
      generated_text = output.outputs[0].text
      print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

{%- endmacro %}

{%- macro render_online_code(config) %}

.. code-block:: bash
   :linenos:
   :emphasize-lines: 2,3,7,8,9{% for key, value in config.vllm.items() %}{% if key != "extra" %},{{ loop.index + 3 }}{% endif %}{% endfor %}

   vllm serve \
      --model="~/models/{{ data.model.name }}/"{% for key, value in config.vllm.items() %}{% if key != "extra" %} \
      --{{ key | replace("_", "-") }}{% if value is sameas true %}{% elif value is mapping %}='{{ value | tojson | replace("True", "true") | replace("False", "false") }}'{% elif value is string %}="{{ value }}"{% else %}={{ value }}{% endif %}{% endif %}{% endfor %} \
      --no-enable-chunked-prefill \
      --no-enable-prefix-caching \
      --port=8080 

Once the vLLM server is online, submit requests using the example below:

.. literalinclude:: ../../examples/vllm_client.py
   :linenos:
   :language: python

{%- endmacro %}

.. tab-set::

   .. tab-item:: NxDI
      :selected:

      Select the instance type and make sure to update the highlighted code below to match your chosen path before you execute it.

      .. tab-set::
      {% for hardware_type, default_config in data.defaults.items() %}
         {%- set config = data.configurations[default_config.config] %}

         .. tab-item:: {{ hardware_type }}
            {% if loop.first %}:selected:{% endif %}

{{ render_nxdi_code(config) | indent(12, true) }}

      {% endfor %}

   .. tab-item:: Offline serving

      Select the instance type and make sure to update the highlighted code below to match your chosen path before you execute it.

      .. tab-set::
      {% for hardware_type, default_config in data.defaults.items() %}
         {%- set config = data.configurations[default_config.config] %}

         .. tab-item:: {{ hardware_type }}
            {% if loop.first %}:selected:{% endif %}

{{ render_offline_code(config) | indent(12, true) }}

      {% endfor %}

   .. tab-item:: Online serving

      Select the instance type and make sure to update the highlighted code below to match your chosen path before you execute it.

      .. tab-set::
      {% for hardware_type, default_config in data.defaults.items() %}
         {%- set config = data.configurations[default_config.config] %}

         .. tab-item:: {{ hardware_type }}
            {% if loop.first %}:selected:{% endif %}

{{ render_online_code(config) | indent(12, true) }}

      {% endfor %}

.. _nxdi-models-{{ data.model.name | lower | replace(".", "-") | replace("/", "-") }}-benchmarks:

{% if data.benchmarks %}
Benchmarks
------------------------

Select a metric to view performance benchmarks for various **batch sizes** and **input|output** sequence length combinations.

.. tab-set::

   .. tab-item:: Latency
      :sync: Latency

      Measured in: seconds (s)

      .. df-table::
         :header-rows: 1

         latency_data = {{ data.benchmarks.Latency | tojson }}
         df_raw = pd.DataFrame(latency_data)

         cols = [c for c in df_raw.columns if c not in ('neuron_config', 'batch_size')]

         df_grouped = df_raw.groupby('batch_size')[cols].min().round(3)
         df = df_grouped.reset_index()
         df.rename(columns={'batch_size': 'Batch Size'}, inplace=True)

   .. tab-item:: Throughput
      :sync: Throughput

      Measured in: tokens per second (tok/s)

      .. df-table::
         :header-rows: 1

         throughput_data = {{ data.benchmarks.Throughput | tojson }}
         df_raw = pd.DataFrame(throughput_data)

         cols = [c for c in df_raw.columns if c not in ('neuron_config', 'batch_size')]

         df_grouped = df_raw.groupby('batch_size')[cols].max().round(2)
         df = df_grouped.reset_index()
         df.rename(columns={'batch_size': 'Batch Size'}, inplace=True)

   .. tab-item:: TTFT
      :sync: TTFT

      Measured in: seconds (s)

      .. df-table::
         :header-rows: 1

         ttft_data = {{ data.benchmarks.TTFT | tojson }}
         df_raw = pd.DataFrame(ttft_data)

         cols = [c for c in df_raw.columns if c not in ('neuron_config', 'batch_size')]

         df_grouped = df_raw.groupby('batch_size')[cols].min().round(3)
         df = df_grouped.reset_index()
         df.rename(columns={'batch_size': 'Batch Size'}, inplace=True)

   .. tab-item:: ITL
      :sync: ITL

      Measured in: seconds (s)

      .. df-table::
         :header-rows: 1

         itl_data = {{ data.benchmarks.ITL | tojson }}
         df_raw = pd.DataFrame(itl_data)

         cols = [c for c in df_raw.columns if c not in ('neuron_config', 'batch_size')]

         df_grouped = df_raw.groupby('batch_size')[cols].min().round(5)
         df = df_grouped.reset_index()
         df.rename(columns={'batch_size': 'Batch Size'}, inplace=True)

.. admonition:: Tip
   :class: tip

   Further improvements and optimizations are possible through the :ref:`Neuron Kernel Interface (NKI) <neuron-nki>`.


{% endif %}

.. _nxdi-models-{{ data.model.name | lower | replace(".", "-") | replace("/", "-") }}-neuron-config:

Recommended configuration
--------------------------

{% if data.recommendations %}
Select a use case to view the recommended Neuron configuration. For the definitions of the flags listed below, see the :ref:`NxDI API reference guide <nxd-inference-api-guide>`.

{%- set throughput_config = data.configurations[data.recommendations.Throughput.config] %}
{%- set latency_config = data.configurations[data.recommendations.Latency.config] %}

.. tab-set::

   .. tab-item:: Offline serving
      :sync: Throughput

      For most use cases, the configuration below can be used to optimize **throughput** on Neuron devices. You can also increase the ``batch_size`` or use quantization to improve throughput even further. 

      {% if throughput_config.neuron.moe_ep_degree != 1 %}
      For this specific configuration, we recommend using **Expert Parallelism (EP) of {{ throughput_config.neuron.moe_ep_degree }}**. For more details, refer to the :ref:`Qwen3-MoE Inference on Trn2 <qwen3-moe-tutorial>` tutorial.
      {% endif %}

      :bdg-info:`{{ throughput_config.instance_type }}`

      .. code-block:: python
         :linenos:

         NeuronConfig({% for key, value in throughput_config.neuron.items() %}{% if key not in ["extra"] %}
            {{ key }}={% if value is string %}"{{ value }}"{% else %}{{ value }}{% endif %},
            {%- endif %}{%- endfor %}
         )

   .. tab-item:: Online serving
      :sync: Latency

      For most use cases, the configuration below can be used to optimize **latency** on Neuron devices.

      {% if latency_config.neuron.moe_ep_degree != 1 %}
      For this specific configuration, we recommend using **Expert Parallelism (EP) of {{ latency_config.neuron.moe_ep_degree }}**. For more details, refer to the :ref:`qwen3-moe-tutorial` tutorial.
      {% endif %}

      :bdg-info:`{{ latency_config.instance_type }}`

      .. code-block:: python
         :linenos:

         NeuronConfig({% for key, value in latency_config.neuron.items() %}{% if key not in ["extra"] %}
            {{ key }}={% if value is string %}"{{ value }}"{% else %}{{ value }}{% endif %},
            {%- endif %}{%- endfor %}
         )

{% else %}
.. note::

   The recommended configuration for the {{ data.model.display_name }} model is coming soon...

{% endif %}


================================================
FILE: libraries/nxd-inference/api-guides/api-guide.rst
================================================
.. _nxd-inference-api-guide:

NxD Inference API Reference
===========================

NeuronX Distributed (NxD) Inference (``neuronx-distributed-inference``) is
an open-source PyTorch-based inference library that simplifies deep learning
model deployment on AWS Inferentia and Trainium instances. Neuronx Distributed
Inference includes a model hub and modules that users can reference to
implement their own models on Neuron.

This API guide describes API and configuration functions and parameters that you
can use when you directly interact with the NxD Inference library.

.. note ::

   NxD Inference also supports integration with vLLM. When you use vLLM, you can
   use the ``override_neuron_config`` attribute to override defaults using the
   :ref:`NeuronConfig parameters <nxd-inference-api-guide-neuron-config>` described
   in this API guide. For more information about vLLM integration, see :ref:`nxdi-vllm-user-guide-v1`.


.. contents:: Table of contents
   :local:
   :depth: 2

Configuration
-------------

NxD Inference defines configuration objects that enable you to control how a model
is compiled and used for inference. When you compile a model, its configuration is
serialized to a JSON file in the compiled checkpoint, so you can distribute the
compiled checkpoint to additional Neuron instances without needing to compile on
each instance.

NxD Inference supports loading HuggingFace model checkpoints and configurations.
When you run a model from a HuggingFace checkpoint, NxD Inference loads the model
configuration from the model's PretrainedConfig.

.. _nxd-inference-api-guide-neuron-config:

NeuronConfig
~~~~~~~~~~~~

NeuronConfig contains compile-time configuration options for inference on Neuron. 

Initialization
^^^^^^^^^^^^^^

Pass the NeuronConfig attributes as keyword args.

Functions
^^^^^^^^^

- ``NeuronConfig(**kwargs)`` - Initializes a NeuronConfig with
  attributes from ``kwargs``.

Attributes
^^^^^^^^^^

- General configuration

  - ``batch_size`` - The number of inputs to process in a single
    request. Defaults to ``1``.
  - ``padding_side`` - The padding side. Defaults to ``right``.
  - ``seq_len`` - The sequence length, which is typically the sum of
    ``max_context_length`` and ``max_new_tokens``. This value is the
    maximum sequence size that the model can process in a single
    request. Defaults to ``128``.
  - ``max_context_length`` - The maximum context length. Default to the
    ``seq_len``.
  - ``max_new_tokens`` - The maximum number of tokens to generate in a
    single request. Default to the difference between ``seq_len`` and
    ``max_context_length``. If the difference is zero, then
    ``max_new_tokens`` is set to ``None``.
  - ``max_length`` - The maximum length to process. Default to the
    ``seq_len``.
  - ``n_active_tokens`` - The number of active tokens to track. Defaults
    to the ``seq_len``.
  - ``n_positions`` - The number of positions to track. Defaults to the
    ``seq_len``.
  - ``torch_dtype`` - The torch data type to use for computation. Choose
    from the following options. Defaults to ``torch.bfloat16``.

    - ``torch.bfloat16``
    - ``torch.float16``
    - ``torch.float32``

  - ``rpl_reduce_dtype`` - The torch data type to use for ``all_reduce``
    operations in RowParallelLinear layers. Defaults to ``None`` and does the 
    reduction in the input tensors dtype.
  - ``cast_type`` - The type of casting strategy to use when loading model parameters. 
    Can be set to ``config`` (default) which casts all parameters to ``torch_dtype``, 
    or ``as-declared`` which casts all parameters to the dtype they were defined with.
  - ``async_mode`` - Whether to use asynchronous mode for inference.
    Defaults to ``false``.
  - ``save_sharded_checkpoint`` - Whether to save the sharded weights in
    the compiled checkpoint. If this option is disabled, NxD Inference
    shards the weights during model load. Defaults to ``true``.
  - ``logical_nc_config`` - The Logical NeuronCore Configuration (LNC).
    On Trn1 and Inf2, this defaults to ``1``. On Trn2, this defaults to ``2``.
    You can also configure LNC with the ``NEURON_LOGICAL_NC_CONFIG`` environment
    variable. For more information about LNC, see :ref:`logical-neuroncore-config`.

    - Note: If you use Trn2 with NxD Inference v0.1 (Neuron 2.21), you must
      specify LNC=2 by setting ``logical_neuron_cores=2`` in NeuronConfig.
      The ``logical_neuron_cores`` attribute is deprecated in NxD Inference v0.2
      and later.

  - ``skip_sharding`` - Whether to skip weight sharding during compilation.
    You can use this option if the compiled checkpoint path already
    includes sharded weights for the model. Defaults to ``false``.
  - ``weights_to_skip_layout_optimization`` - The list of weight names
    to skip during weight layout optimization.
  - ``skip_warmup`` - Whether to skip warmup during model load. To improve
    the performance of the first request sent to a model, NxD Inference
    warms up the model during load. Defaults to ``false``.
  - ``scratchpad_page_size`` - The scratchpad page size to use during compilation
    and at runtime. The scratchpad is a shared memory buffer used for internal
    model variables and other data. You can adjust this attribute in scenarios
    where you need to adjust memory usage to support larger models or larger
    sequence lengths.

- Distributed configuration

  - ``tp_degree`` - The number of Neuron cores to parallelize across
    using tensor parallelism. Defaults to ``1``.

    - The number of attention heads needs to be divisible by the
      tensor-parallelism degree.
    - The total data size of model weights and key-value caches needs to
      be smaller than the tensor-parallelism degree multiplied by the
      amount of HBM memory per Neuron core.

      - On trn2, each Neuron core has 24GB of memory (with
        ``logical_nc_config`` set to ``2``).
      - On inf2/trn1, each Neuron core has 16GB of memory.

    - The Neuron runtime supports the following tensor-parallelism
      degrees:

      - trn2: 1, 2, 4, 8, 16, 32, and 64 (with ``logical_nc_config``
        set to ``2``)
      - inf2: 1, 2, 4, 8, and 24
      - trn1: 1, 2, 8, 16, and 32

- Attention

  - ``flash_decoding_enabled`` - Whether to enable flash decoding.
    Defaults to ``false``.
  - ``fused_qkv`` - Whether to fuse the query (Q), key (K), and value
    (V) weights in the models attention layers. This option improves
    performance by using larger matrices. Defaults to ``false``.
  - ``sequence_parallel_enabled`` - Whether to use sequence parallelism,
    which splits tensors along the sequence dimension. Defaults to
    ``false``. Sequence parallel requires context sequence length to
    be divisible with tensor parallelism degree. Once enabled, sequence parallelism
    is only applied to context encoding.
  - ``qk_layernorm`` - Whether to enable QK layer normalization.
    Defaults to ``false``.
  - ``attention_dtype`` - The torch data type to use for all operations in attention. 
    Defaults to ``None`` and infers the dtype based on the dtype of the hidden_states passed to attention.

- On-device sampling

  - ``on_device_sampling_config`` - The on-device sampling configuration
    to use. Specify this config to enable on-device sampling. This
    config is an ``OnDeviceSamplingConfig``, which has the following
    attributes:

    - ``do_sample`` - Whether to use multinomial sampling (true) or
      greedy sampling (false). Defaults to ``false``.
    - ``top_k`` - The top-k value to use for sampling. Defaults to
      ``1``.
    - ``dynamic`` - Whether to enable dynamic sampling. With dynamic
      sampling, you can pass different ``top_k``, ``top_p``, and
      ``temperature`` values to the ``forward`` call to configure
      sampling for each input in a batch. Defaults to ``false``.
    - ``deterministic`` - Whether to enable deterministic sampling.
      Defaults to ``false``.
    - ``global_topk`` - The global topK value to use. Defaults to
      ``256``.

- Bucketing

  - ``enable_bucketing`` - Whether to enable bucketing. Defaults to
    ``false``. You can specify the buckets to use with the
    ``context_encoding_buckets`` and ``token_generation_buckets``
    attributes. If you don't specify the buckets to use, NxDI
    automatically selects buckets based on the following logic.

    - Context encoding: Powers of two between 128 and the max context
      length.

      - Note: Max context length is equivalent to sequence length by
        default.

    - Token generation: Powers of two between 128 and the maximum
      sequence length.

  - ``context_encoding_buckets`` - The list of bucket sizes to use for
    the context encoding model.
  - ``token_generation_buckets`` - The list of bucket sizes to use for
    the token generation model.

- Quantization

  - ``quantized`` - Whether the model weights are quantized. Defaults to
    ``false``.
  - ``quantized_checkpoints_path`` - The path to the quantized
    checkpoint. To quantize the model and save it to this path, use
    NeuronApplicationBase's ``save_quantized_state_dict`` function.
    Specify one of the following:

    - A folder path. During quantization, NxD Inference
      saves the quantized model in safetensors format to this folder. To
      use a quantized model from a folder, it can be in safetensors or
      pickle format.
    - A file path to a quantized model file in pickle format.

  - ``quantization_dtype`` - The data type to use for quantization.
    Choose from the following options. Defaults to ``int8``.

    - ``int8`` - 8 bit int.
    - ``f8e4m3`` - 8-bit float with greater precision and less range.

      - Important: To use ``f8e4m3`` for quantization, you must set the
        ``XLA_HANDLE_SPECIAL_SCALAR`` environment variable to ``1``.

    - ``f8e5m2`` - 8-bit float with greater range and less precision.

  - ``quantization_type`` - The type of quantization to use. Choose from
    the following options. Defaults to ``per_tensor_symmetric``.

    - ``per_tensor_symmetric``
    - ``per_channel_symmetric``

  - ``modules_to_not_convert`` - Specify a list of modules to be not quantized. Also, required when running inference on custom quantized models(using external libraries) where certain layers are left in full precision. Example: ["lm_head", "layers.0.self_attn", "layers.1.mlp", ...].
    Defaults to None (meaning all modules will be quantized)

  - ``draft_model_modules_to_not_convert`` - Specify a list of modules in full precision when working with fused speculation. If no layers are required, add all layers in the list. Example: ["lm_head", "layers.0.self_attn", "layers.1.mlp", ...].
    This is only required in the case of fused speculation.

- KV cache quantization

  - ``kv_cache_quant`` - Whether to quantize the KV cache. When enabled,
    the model quantizes the KV cache to the ``torch.float8_e4m3fn`` data
    type. Defaults to ``false``.

    - Important: To use ``kv_cache_quant``, you must set the
      ``XLA_HANDLE_SPECIAL_SCALAR`` environment variable to ``1``.

- Kernels

  - ``attn_kernel_enabled`` - Whether to enable the flash attention
    kernel when supported. Defaults to ``false``. Flash attention is automatically enabled by default for certain conditions,
    see ``NeuronAttentionBase.get_flash_attention_strategy`` in 
    `neuronx_distributed_inference.modules.attention.attention_base <https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/modules/attention/attention_base.py>`_.
    Even explicitly enabled flash attention with ``NeuronConfig(attn_kernel_enabled=True)`` will be disabled for use cases
    where enabling it would be less efficient.
  - ``qkv_kernel_enabled`` - Whether to enable the fused QKV kernel. To
    use this option, you must set ``fused_qkv`` to ``true`` and ``torch_dtype``
    to ``torch.bfloat16``. Defaults to ``false``.
  - ``mlp_kernel_enabled`` - Whether to enable the MLP kernel. To use this
    option, you must set ``torch_dtype`` to ``torch.bfloat16``. Defaults
    to ``false``.
  - ``quantized_mlp_kernel_enabled`` - Whether to enable the quantized
    MLP kernel, which uses FP8 compute to improve performance. To use this
    option, you must set ``mlp_kernel_enabled`` to ``true``. Defaults to ``false``.
  - ``rmsnorm_quantize_kernel_enabled`` - Whether to enable the
    quantized RMS norm kernel. Defaults to ``false``.

- Continuous batching

  - ``is_continuous_batching`` - Whether to enable continuous batching.
    Defaults to ``false``.
  - ``max_batch_size`` - The maximum batch size to use for continuous
    batching. Defaults to ``batch_size``.
  - ``ctx_batch_size`` - The maximum batch size to use for the context
    encoding model in continuous batching. Defaults to ``batch_size``.
  - ``tkg_batch_size`` - The maximum batch size to use for the token
    generation model in continuous batching. Defaults to ``batch_size``.

- Speculative decoding

  - ``speculation_length`` - The number of tokens to generate with the
    draft model before checking work with the primary model. Set this
    value to a positive integer to enable speculation. Defaults to
    ``0``.
  - ``spec_batch_size`` - The batch size to use for speculation.
    Defaults to ``batch_size``.
  - ``enable_eagle_speculation`` - Whether to enable EAGLE speculation,
    where the previous hidden state is passed to a specialized target
    model to improve performance. Defaults to ``false``.
  - ``enable_eagle_draft_input_norm`` - Whether to perform input
    normalization in the EAGLE draft model. Defaults to ``false``.
  - ``enable_fused_speculation`` - Whether to enable fused speculation,
    where the target and draft model are fused into a single compiled
    model to improve performance. Fused speculation is enabled by
    default if ``enable_eagle_speculation`` is true. Otherwise, this
    defaults to ``false``.

- Medusa decoding - Medusa is a speculation method that uses multiple
  smaller LM heads to perform speculation.

  - ``is_medusa`` - Whether to use Medusa decoding. Defaults to
    ``false``
  - ``medusa_speculation_length`` - The number of tokens to generate
    with the Medusa heads before checking work with the primary model.
    Set this value to a positive integer. Defaults to ``0``.
  - ``num_medusa_heads`` - The number of LM heads to use for Medusa.
    Defaults to ``0``.
  - ``medusa_tree`` - The Medusa tree to use. For an example, see
    ``medusa_mc_sim_7b_63.json`` in the ``examples`` folder.


- Multi-LoRA serving

  - ``lora_config`` - The multi-lora serving configuration to use. Defaults to ``none``. Specify this config to enable multi-LoRA serving. This
    config is ``LoraServingConfig``, which has the following
    attributes:

    - ``max_loras`` - The maximum number of concurrent LoRA adapters 
      in device memory. Defaults to ``1``.
    - ``max_cpu_loras`` - The maximum number of concurrent LoRA adapters in host memory.
    - ``enable-dynamic-multi-lora`` - The flag to enable dynamic multi-LoRA serving in NxD inference. Defaults to False.
    - ``lora_ckpt_paths`` - The checkpoint paths for LoRA adapters that need to be loaded to HBM during initialization with key-value pairs. The key is the adapter ID and the value is the local path of the LoRA adapter checkpoint.
    - ``lora_ckpt_paths_cpu`` - The checkpoint paths for LoRA adapters in host memory during initialization with key-value pairs. The key is the adapter ID and the value is the local path of the LoRA adapter checkpoint.
    - ``lora_memory_transpose`` - Transpose memory layout to optimize 
      inference performance. Defaults to ``True``.
    - ``lora_shard_linear_layer`` - Shard the linear layer across TP group.
      Defaults to ``True``.
    - ``base_model_quantized`` - Whether the base model is quantized. Defaults to False.
    - ``lora_ckpt_json`` - The JSON file that specifies the checkpoint paths for LoRA adapters in both HBM and host memory. Users can set either ``lora_ckpt_json`` or ``lora_ckpt_paths``/``lora_ckpt_paths_cpu`` to specify LoRA adapters, but ``lora_ckpt_json`` is recommended. The JSON file includes three fields:
      - ``lora-ckpt-dir`` - The directory of the LoRA adapters.
      - ``lora-ckpt-paths`` - The mapping between LoRA adapter IDs on HBM and their checkpoint paths at initialization.
      - ``lora-ckpt-paths-cpu`` - The mapping between LoRA adapter IDs and their checkpoints on CPU.


- Compilation configuration

  - ``cc_pipeline_tiling_factor`` - The pipeline tiling factor to use
    for collectives. Defaults to ``2``.

- Debugging

  - ``output_logits`` - Whether to return model logits from the Neuron device
    when using on-device sampling. With on-device sampling, the model samples
    the logits on-device to return a singular token, and the model output includes only
    the tokens (without the logits) to improve performance. The ``output_logits`` feature enables
    you to output the logits alongside the token, which enables you to run logit
    validation and investigate the model output. Note: This feature
    impacts performance and shouldn't be used in production; this should 
    only be used for testing and debugging model logits.

InferenceConfig
~~~~~~~~~~~~~~~

InferenceConfig contains a NeuronConfig and model configuration
attributes.


.. _initialization-1:

Initialization
^^^^^^^^^^^^^^

You can pass attributes through keyword args, or provide a
``load_config`` hook that is called during initialization to load the
configuration attributes.

InferenceConfig is compatible with HuggingFace ``transformers``. To use
a model from HuggingFace ``transformers``, you can populate an
InferenceConfig with the attributes from the model's PretrainedConfig,
which is stored in ``config.json`` in the model checkpoint.

::

   from neuronx_distributed_inference.models.llama import (
       LlamaInferenceConfig,
       LlamaNeuronConfig
   )
   from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config

   model_path = "/home/ubuntu/models/Meta-Llama-3.1-8B"

   neuron_config = LlamaNeuronConfig()
   config = LlamaInferenceConfig(
       neuron_config,
       load_config=load_pretrained_config(model_path),
   )

.. _attributes-1:

Attributes
^^^^^^^^^^

An InferenceConfig includes ``neuron_config`` and any other attributes
that you set during initialization.

- ``neuron_config`` - The NeuronConfig for this inference config.
- ``fused_spec_config`` - The FusedSpecNeuronConfig for this inference
  config. Provide a fused spec config if using fused speculation.
- ``load_config`` - The ``load_config`` hook to run during
  initialization. You can provide a load config hook to load
  configuration attributes from another source. To load from a
  HuggingFace PretrainedConfig, pass the load config hook returned by
  ``load_pretrained_config``. The ``load_pretrained_config`` hook
  provider takes the model path as its argument.

InferenceConfig also supports an attribute map, which lets you configure
additional names or aliases for attributes. When you get or set an
attribute by an alias, you retrieve or modify the value of the original
attribute. When you initialize an InferenceConfig from a HuggingFace
PretrainedConfig, it automatically inherits the attribute map from that
PretrainedConfig.

.. _functions-1:

Functions
^^^^^^^^^

- ``InferenceConfig(neuron_config, load_config=None, **kwargs)`` -
  Initializes an InferenceConfig.
- ``load_config(self)`` - Loads the config attributes. This function
  does nothing by default; subclasses can override it to provide a
  model-specific implementation. This function is called during
  initialization unless a ``load_config`` hook is provided.
- ``get_required_attributes(self)`` - Returns the list of attribute
  names that must be present in this config for it to validate during
  initialization. This function returns an empty list by default;
  subclasses can override it to require model-specific attributes to be
  present.
- ``validate_config(self)`` - Checks that the config is valid. This
  function is called during initialization. By default, this function
  checks that the attributes returned by ``get_required_attributes`` are
  present. Subclasses can override this function to implement
  model-specific validation.
- ``save(self, model_path)`` - Serializes the config to a JSON file,
  ``neuron_config.json`` in the given model path.
- ``to_json_file(self, json_file)`` - Serializes the config to the given
  JSON file.
- ``to_json_string(self)`` - Serializes the config to a string in JSON
  format.
- ``load(cls, model_path, **kwargs)`` - Loads the config from the
  ``neuron_config.json`` file in the given model path. You can specify
  ``kwargs`` to override attributes in the config.
- ``from_json_file(cls, json_file, **kwargs)`` - Loads the config from
  the given JSON file. You can specify ``kwargs`` to override attributes
  in the config.
- ``from_json_string(cls, json_string, **kwargs)`` - Loads the config
  from the given JSON string. You can specify ``kwargs`` to override
  attributes in the config.
- ``get_neuron_config_cls(cls)`` - Returns the NeuronConfig class type
  to use for this InferenceConfig. This function returns
  ``NeuronConfig`` by default; subclasses can override this function to
  configure a specific NeuronConfig subclass to use.

RouterConfig
~~~~~~~~~~~~~

Configuration class for expert router in mixture-of-experts models. This config specifies the activation function and data type used in the router component.

.. _initialization-4:

Initialization
^^^^^^^^^^^^^^^

Initialize directly with parameters or use the from_kwargs class method.

.. _functions-4:

Functions
^^^^^^^^^^

- ``RouterConfig(**kwargs)`` - Initializes router configuration with specified activation function and data type.

.. _attributes-4:

Attributes
^^^^^^^^^^^

- ``act_fn`` - Activation function to use in the router. Defaults to ``"softmax"``. See ACT2FN for supported activations.
- ``dtype`` - Data type for router computations. Defaults to ``torch.float32``.

RoutedExpertsMLPOpsConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~

Configuration class for routed experts in mixture-of-experts models. This class shares several configuration flags with MoENeuronConfig and provides additional settings specific to expert MLPs.

.. _initialization-6:

Initialization
^^^^^^^^^^^^^^^

Initialize with specific parameters for expert MLP operations.

.. _attributes-6:

Attributes
^^^^^^^^^^^

- ``num_experts`` - Total number of experts in the model.
- ``hidden_size`` - Hidden dimension of the layers.
- ``intermediate_size`` - Intermediate dimension of the layers.
- ``top_k`` - Number of experts activated per token. Must be less than or equal to num_experts.
- ``hidden_act`` - Activation function for hidden layers. See ACT2FN for supported activations.
- ``glu_mlp`` -  When True, combines gate and up projection; otherwise, uses simple up projection.
- ``bias`` - Whether to include bias terms in linear layers. Defaults to ``False``.
- ``glu_type`` - Type of GLU activation to use. Defaults to ``GLUType.GLU``.
- ``hidden_act_scaling_factor`` - Scaling factor applied to gate projections before activation. Defaults to ``1.0``
- ``hidden_act_bias`` - Bias term added to the up projection values. Defaults to ``0.0``.
- ``capacity_factor`` - Controls expert capacity and token dropping rate. None indicates full capacity with no token dropping.
- ``use_index_calc_kernel`` - Whether to use specialized kernel for index calculations. Defaults to ``False``.
- ``gate_clamp_upper_limit`` - Upper bound for clamping expert MLP gate projection results. No clamping if ``None``.
- ``gate_clamp_lower_limit`` - Lower bound for clamping expert MLP gate projection results. No clamping if ``None``.
- ``up_clamp_upper_limit`` - Upper bound for clamping expert MLP up projection results. No clamping if ``None``.
- ``up_clamp_lower_limit`` - Lower bound for clamping expert MLP up projection results. No clamping if ``None``.
- ``normalize_top_k_affinities`` - Whether to normalize chosen experts' affinities before combining with MLP outputs. Defaults to ``False``.
- ``early_expert_affinity_modulation`` - Whether to enable early modulation of expert affinities. Defaults to ``False``.
- ``input_layer_init_method`` - Initialization function for input linear layer weights. Defaults to ``None``.
- ``output_layer_init_method`` - Initialization function for output linear layer weights. Defaults to ``None``.
- ``enable_spmd_rank`` - Whether to use runtime rank information in inference. Defaults to ``False``.
- ``is_prefill`` - Whether the configuration is for prefill computation. Defaults to ``None``.

BlockwiseMatmulConfig
~~~~~~~~~~~~~~~~~~~~~~

Configuration class for blockwise matrix multiplication operations. This config contains settings that control how blockwise matrix multiplication is performed, particularly in the context of expert MLPs.

.. _initialization-3:

Initialization
^^^^^^^^^^^^^^^

Initialize with specific parameters or use the from_kwargs class method.

.. _functions-3:

Functions
^^^^^^^^^^

- ``BlockwiseMatmulConfig(**kwargs)`` - Initializes configuration with the specified attributes.

.. _attributes-3:

Attributes
^^^^^^^^^^^

- ``block_size`` - Size of blocks used in blockwise matrix multiplication.
- ``use_block_parallel`` - Whether to enable block parallel blockwise matmul NKI kernel.
- ``block_sharding_strategy`` - Strategy for block parallel blockwise matmul kernel implementation.

  - ``BlockShardStrategy.HI_LO`` - distribute upper half and lower half blocks across LNCs.
  - ``BlockShardStrategy.PING_PONG`` - distribute odd blocks on NC0 and even blocks on NC1.
  
- ``skip_dma_token`` - Kernel optimization flag for skipping token DMA operations for padding. When true, inputs to blockwise kernel don't require padding.
- ``skip_dma_weight`` - Kernel optimization flag for skipping weight DMA operations for padding.
- ``logical_nc_config`` - LNC size configuration. Defaults to 1 on trn1 and 2 on trn2.
- ``blockwise_nki_autograd_cls`` - NKI function implementing blockwise matmul for expert MLPs. Defaults to ``BlockwiseMatmulNKIFunc`` when None.
- ``use_torch_block_wise`` - Forces using PyTorch implementation of blockwise matmul for expert MLPs instead of NKI kernel.
- ``parallelize_token_to_block_mapping`` - Enables parallel computation of block position to token indices mapping. Enabled by default.
- ``optimized_block_to_token_mapping`` - When enabled, token position in blocks will only include top k experts.
- ``always_augment_inputs_for_blockwise_matmul`` - Forces padding of inputs to blockwise kernel regardless of skip_dma value.
- ``use_shard_on_intermediate_dynamic_while`` - Enables shard-on-intermediate dynamic while kernel.
- ``use_shard_on_block_dynamic_while`` - Enables shard-on-block dynamic while kernel.
- ``num_static_blocks`` - Number of static blocks to compute in dynamic kernel. Static blocks have fixed computation, while dynamic blocks can be skipped.

MoEFusedTKGConfig
~~~~~~~~~~~~~~~~~~

Configuration class for fused Token Generation operations in mixture-of-experts models. This config controls various kernel optimizations and fusion options.

.. _initialization-7:

Initialization
^^^^^^^^^^^^^^^

Initialize with settings for quantization and kernel enablement options.

.. _attributes-7:

Attributes
^^^^^^^^^^^

- ``quantized`` - Whether weights are quantized or not.
- ``moe_fused_kernel_enabled`` - Whether to enable the fused MoE kernel. Defaults to ``None``.
- ``router_topk_kernel_enabled`` - Whether to enable the router top-k kernel optimization. Defaults to ``None``.
- ``expert_mlp_kernel_enabled`` - Whether to enable the expert MLP kernel optimization. Defaults to ``None``.
- ``shared_mlp_kernel_enabled`` - Whether to enable the shared MLP kernel optimization. Defaults to ``None``.

HybridShardingConfig
~~~~~~~~~~~~~~~~~~~~~

Configuration class for hybrid sharding in mixture-of-experts models. This config specifies different parallelism degrees for CTE (Context Encoding) and TKG (Token Generation) components.

.. _initialization-5:

Initialization
^^^^^^^^^^^^^^^

Initialize with keyword arguments specifying parallelism degrees.

.. _functions-5:

Functions
^^^^^^^^^^

- ``HybridShardingConfig(**kwargs)`` - Initializes configuration with specified parallelism degrees.

.. _attributes-5:

Attributes
^^^^^^^^^^^

- ``moe_cte_tp_degree`` - Tensor parallelism degree for Context Encoding. Defaults to ``1``.
- ``moe_cte_ep_degree`` - Expert parallelism degree for Context Encoding. Defaults to ``1``.
- ``moe_tkg_tp_degree`` - Tensor parallelism degree for Token Generation. Defaults to ``1``.
- ``moe_tkg_ep_degree`` - Expert parallelism degree for Token Generation. Defaults to ``1``.

.. _nxd-inference-api-guide-moe-neuron-config:

MoENeuronConfig
~~~~~~~~~~~~~~~

A NeuronConfig subclass for mixture-of-experts (MoE) models. This config
includes attributes specific to MoE models. MoE model configurations, such
as DbrxNeuronConfig, are subclasses of MoENeuronConfig.

.. _initialization-2:

Initialization
^^^^^^^^^^^^^^

Pass the attributes as keyword args.

.. _functions-2:

Functions
^^^^^^^^^

- ``MoENeuronConfig(**kwargs)`` - Initializes an MoENeuronConfig with
  attributes from ``kwargs``.

.. _attributes-2:

Attributes
^^^^^^^^^^
- General

  - ``moe_tp_degree`` - Tensor parallelism degree for MoE. Defaults to ``1``.
  - ``moe_ep_degree`` - Expert parallelism degree. Defaults to ``1``.
  - ``hybrid_sharding_config(**kwargs)`` - Configuration for hybrid model sharding. Defaults to ``None``.

- Router

  - ``router_config(**kwargs)`` - Configuration for the expert router. Can be initialized from a dictionary or kwargs.
  - ``return_expert_index`` - Whether to return expert indices in output. Defaults to ``False``.
  - ``return_router_logits`` - Whether to return router logits in output. Defaults to ``False``.

- Expert MLPs

  - ``blockwise_matmul_config(**kwargs)`` - Configuration for blockwise matrix multiplication. Defaults to empty config.
  - ``capacity_factor`` - The capacity factor to use when allocating
    tokens across experts. When an expert is at capacity, tokens allocated
    to that expert are dropped until that expert has capacity again.
    Defaults to ``None``, which means that NxDI waits until an expert has
    capacity, and no tokens are dropped.
  - ``glu_mlp`` - Whether to use a Gated Linear Unit in the MLP. Defaults
    to ``false``
  - ``glu_type`` - Type of GLU activation to use. Defaults to ``"glu"``.
  - ``hidden_act_scaling_factor`` - Scaling factor applied to gate projections before activation. Defaults to ``1.0``.
  - ``hidden_act_bias`` - Bias term added to the up projection values. Defaults to ``0.0``.
  - ``gate_clamp_upper_limit`` - Upper limit for clamping experts gate values. Defaults to ``None``.
  - ``gate_clamp_lower_limit`` - Lower limit for clamping experts gate values. Defaults to ``None``.
  - ``up_clamp_upper_limit`` - Upper limit for clamping experts up values. Defaults to ``None``.
  - ``up_clamp_lower_limit`` - Lower limit for clamping experts up values. Defaults to ``None``.
  - ``use_index_calc_kernel`` - Whether to use specialized kernel for index calculations. Defaults to ``False``.
  - ``early_expert_affinity_modulation`` - Enable early modulation of expert affinities. Defaults to ``False``.
  - ``normalize_top_k_affinities`` - Whether to normalize the top-k expert affinities. Defaults to ``True``.

- Shared Experts 

  - ``fused_shared_experts`` - Whether to use fused gate/up computation for shared experts. Defaults to ``False``.
  - ``shared_experts_sequence_parallel_enabled`` - Enable sequence parallelism for shared experts. Defaults to ``False``.
  - ``transpose_shared_experts_weights`` - Whether to transpose weights for shared experts. Defaults to ``False``.

FusedSpecNeuronConfig
~~~~~~~~~~~~~~~~~~~~~

A configuration for a model that uses fused speculation, which is a speculative
decoding feature where the target and draft models are compiled into a combined model to improve
performance. For more information, see :ref:`nxd-fused-speculative-decoding`.

.. _attributes-17:

Attributes
^^^^^^^^^^

- ``worker_cls`` - The model class to use for fused speculation. This
  class should be a subclass of NeuronBaseModel.
- ``draft_config`` - The InferenceConfig for the draft model.
- ``draft_model_path`` - The path to the draft model checkpoint.

Generation
----------

HuggingFaceGenerationAdapter
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NxD Inference supports running inference with the HuggingFace ``generate``
inference. To use HuggingFace-style generation, create a
HuggingFaceGenerationAdapter that wraps a Neuron application model.
Then, you can call ``generate`` on the adapted model.

::

   generation_model = HuggingFaceGenerationAdapter(neuron_model)
   outputs = generation_model.generate(
       inputs.input_ids,
       attention_mask=inputs.attention_mask,
       generation_config=generation_config
   )

Models
------

NxD Inference provides a :ref:`model hub<nxdi-model-reference>` with production
ready models. You can use these existing models to run inference, or use them as
reference implementations when you develop your own models on Neuron. All model
inherit from base classes that provide a basic set of functionality that
is common to all models.

NeuronApplicationBase
~~~~~~~~~~~~~~~~~~~~~

NeuronApplicationBase is the base class for all application models,
including NeuronBaseForCausalLM. NeuronApplicationBase provides
functions to compile and load models. This class extends
``torch.nn.Module``. Application models are the entry point to running
inference with NxD Inference. You can extend this class to define new
application models that implement use cases in addition to causal LM.

.. _attributes-18:

Attributes
^^^^^^^^^^

- ``config`` - The InferenceConfig for this model.
- ``neuron_config`` - The NeuronConfig for this model.
- ``model_path`` - The model path for this model.
- ``models`` - The list of models that make up this application model.
  These models are instances of ModelWrapper. Add models to this list to
  compile them with ``compile``.
- ``is_compiled`` - Whether this model is compiled.
- ``is_loaded_to_neuron`` - Whether this model is loaded to the Neuron
  device.

.. _functions-8:

Functions
^^^^^^^^^

- ``NeuronApplicationBase(self, model_path, config=None, neuron_config=None)``
  - Initializes an application model from the given model path, and
  optionally the given InferenceConfig (``config``) and NeuronConfig
  (``neuron_config``). If no InferenceConfig is provided, this function
  loads the config from the given model path.
- ``compile(self, compiled_model_path, debug=False)`` - Compiles this
  model for Neuron and saves the compiled model to the given path. This
  function compiles all models added to ``self.models``. This function
  also shards the weights for the model. To produce HLO files that have
  source annotations enabled for debugging, set ``debug`` to ``True``. When ``debug`` is enabled, HLOs contain following attributes for each computation: ``op_type``, ``op_name``, ``source_file``, and ``source_line``.
- ``load(self, compiled_model_path)`` - Loads the compiled model from
  the given path to the Neuron device. This function also loads the
  model weights to the Neuron device.
- ``load_weights(self, compiled_model_path)`` - Loads the model weights
  from the given path to the Neuron device. You can call this function
  to load new weights without reloading the entire model.
- ``shard_weights(self, compiled_model_path)`` - Shards the model's
  weights and serializes the sharded weights to the given path.
- ``forward(self, **kwargs)`` - The forward function for this
  application model. This function must be implemented by subclasses.
- ``validate_config(cls, config)`` - Checks whether the config is valid
  for this model. By default, this function requires that
  ``neuron_config`` is present. This function can be implemented by
  subclasses to provide model-specific validation.
- ``get_compiler_args(self)`` - Returns the Neuron compiler arguments to
  use when compiling this model. By default, this returns no compiler
  arguments. This function can be implemented by subclasses to use
  model-specific compiler args.
- ``to_cpu(self)`` - Allows inference to be run entirely on CPU. Use this 
  in place of the ``compile`` and ``load`` functions. Note that CPU inference 
  doesn't currently work for configurations that use kernels.
- ``get_state_dict(cls, model_path, config)`` - Gets the state dict for
  this model. By default, this function loads the state dict from the
  given model path. This function calls the class'
  ``convert_hf_to_neuron_state_dict`` function to convert the state dict
  according to the specific model. Subclasses can override this function
  to provide custom state dict loading.

  - When loading the state dict, this function replaces keys that start
    with the class' ``_STATE_DICT_MODEL_PREFIX`` value with the class'
    ``_NEW_STATE_DICT_MODEL_PREFIX`` value. Subclasses can set these
    values to update the state dict keys accordingly.

- ``convert_hf_to_neuron_state_dict`` - Converts a state dict from HF
  format to the format expected by Neuron. By default, this function
  returns the state dict without modifying it; subclasses can override
  this to provide custom conversion for each model.
- ``save_quantized_state_dict(cls, model_path, config)`` - Quantizes the
  model's state dict and saves the quantized checkpoint to the
  ``quantized_checkpoint_path`` from the given config's NeuronConfig.
- ``generate_quantized_state_dict(cls, model_path, config)`` - Generates
  the quantized state dict for this model. This function loads the
  HuggingFace model from the given model path in order to quantize the
  model. Then, this function passes the quantized model to
  ``prepare_quantized_state_dict`` to generate the state dict.
  Subclasses can override this function to customize quantization.
- ``prepare_quantized_state_dict(cls, hf_model_quant)`` - Prepares the
  quantized state dict for the model. By default, this function converts
  the state dict from qint8 to int8. Subclasses can override this
  function to customize quantization.
- ``load_hf_model(model_path)`` - Loads the equivalent HuggingFace model
  from the given model path. Subclasses must implement this function to
  use quantization or to generate expected outputs when evaluating
  accuracy with ``accuracy.py``.
- ``reset(self)`` - Resets the model state. By default, this function
  does nothing; subclasses can implement it to provide custom behavior.

NeuronBaseForCausalLM
~~~~~~~~~~~~~~~~~~~~~

NeuronBaseForCausalLM is the base application class that you use to generate
text with causal language models. This class extends NeuronApplicationBase.
You can extend this class to run text generation in custom models.

.. _attributes-9:

Attributes
^^^^^^^^^^

- ``kv_cache_populated`` - Whether the KV cache is populated.

.. _functions-9:

Functions
^^^^^^^^^

- ``NeuronBaseForCausalLM(self, *args, **kwargs)`` - Initializes the
  NeuronApplicationBase and configures the models used in this LM
  application, including context encoding, token gen, and others, based
  on the given NeuronConfig.
- ``forward(self, input_ids=None, seq_ids=None, attention_mask=None, position_ids=None, sampling_params=None, prev_hidden=None, past_key_values=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, medusa_args=None, return_dict=None, input_capture_hook=None)``
  - The forward function for causal LM. This function routes the forward
  pass to the correct sub-model (such as context encoding or token
  generation) based on the current model state. If an ``input_capture_hook``
  function is provided, the forward function calls the hook with the model
  inputs as arguments.
- ``reset(self)`` - Resets the model for a new batch of inference. After
  the model is reset, a subsequent run will invoke the context encoding
  model.
- ``reset_kv_cache(self)`` - Resets the KV cache by replacing its key
  values with zeroes.

NeuronBaseModel
~~~~~~~~~~~~~~~

NeuronBaseModel is the base class for all models. This class extends
``torch.nn.Module``. In instances of NeuronBaseModel, you define the
modules, such as attention, MLP, and decoder layers, that make up a model.
You can extend this class to define custom decoder models.

.. _attributes-16:

Attributes
^^^^^^^^^^

- ``sampler`` - The sampler to use for on-device sampling.
- ``kv_mgr`` - The KV cache manager to use to manage the KV cache.
- ``sequence_dimension`` - The dimension for sequence parallelism.

.. _functions-15:

Functions
^^^^^^^^^

- ``NeuronBaseModel(config, optimize_inference=True)`` - Initializes the
  Neuron model from the given config. If ``optimize_inference`` is true,
  then this initializes a KV cache manager and sampler (if on-device
  sampling).
- ``setup_attr_for_model(self, config)`` - Initializes the following
  attributes for the model. These attributes are used by modules within
  the model. Subclasses must implement this function to set these
  attributes from the config.

  - ``on_device_sampling``
  - ``tp_degree``
  - ``hidden_size``
  - ``num_attention_heads``
  - ``num_key_value_heads``
  - ``max_batch_size``
  - ``buckets``

- ``init_model(self, config)`` - Initializes the following modules for
  the model. Subclasses must implement this function.

  - ``embed_tokens``
  - ``layers``
  - ``norm``
  - ``lm_head``

- ``forward(self, input_ids, attention_mask, position_ids, seq_ids, accepted_indices=None, current_length=None, medusa_mask=None, scatter_index=None)``
  - The forward function for this model.

ModelWrapper
~~~~~~~~~~~~

Wraps a model to prepare it for compilation. Neuron applications, such
as NeuronBaseForCausalLM, use this class to prepare a model for
compilation. ModelWrapper defines the inputs to use when tracing the
model during compilation.

To define a custom model with additional model inputs, you can extend ModelWrapper
and override the ``input_generator`` function, which defines the inputs for tracing.

.. _functions-6:

Functions
^^^^^^^^^

- ``ModelWrapper(config, model_cls, tag, compiler_args)`` - Initializes
  a model wrapper from a given config and model class. This model class
  is used to compile the model with the given compiler args. The tag is
  used to identify the compiled model in the application.
- ``input_generator(self)`` - Returns a list of input tensors to use to trace
  the model for compilation. When you trace and compile a model, the trace captures
  only the code paths that are run with these inputs. To support different inputs and
  code paths based on configuration options, provide configuration-specific inputs
  in ``input_generator``.


================================================
FILE: libraries/nxd-inference/api-guides/api-guide.txt
================================================
* :ref:`nxd-inference-api-guide`

================================================
FILE: libraries/nxd-inference/api-guides/index.rst
================================================
.. _nxdi-api-ref-index:

API Reference Guides
====================

.. toctree::
    :maxdepth: 1
    :hidden:
    :caption: API Reference Guides
    
    /libraries/nxd-inference/api-guides/api-guide


Use the NxD Inference (``neuronx-distributed-inference``) API Reference Guides to learn how to use NxD Inference.

.. include:: /libraries/nxd-inference/api-guides/api-guide.txt

================================================
FILE: libraries/nxd-inference/app-notes/app_notes.txt
================================================
* :ref:`introduce-nxd-inference`
* :ref:`nxdi-parallelism-user-guide`

================================================
FILE: libraries/nxd-inference/app-notes/index.rst
================================================
Application Notes
=================

.. toctree::
    :maxdepth: 1
    :hidden:
    :caption: Application Notes
    
    /about-neuron/appnotes/neuronx-distributed/introducing-nxd-inference
    /libraries/nxd-inference/app-notes/parallelism

.. include:: /libraries/nxd-inference/app-notes/app_notes.txt

================================================
FILE: libraries/nxd-inference/app-notes/parallelism.rst
================================================
.. _nxdi-parallelism-user-guide:
.. _nxdi-tensor-parallelism:

Parallelism Techniques for LLM Inference
========================================


.. contents:: Table of contents
   :local:
   :depth: 2

Overview
--------

Large language models (LLMs) have grown exponentially in size in the past few years, requiring
increasing accelerator memory to run the model. In order to effectively generate predictions from an LLM, it
is often necessary to use one or more **parallelism techniques** to shard operations across multiple available accelerators.
**Model parallelism**, such as tensor and sequence parallelism described in this document, can reduce memory requirements per NeuronCore 
by sharding the model across multiple cores. **Data parallelism**, on the other hand, enables
higher throughput by sharding input data.

Tensor Parallelism
--------------------

Tensor parallelism is a technique in which a tensor is split into a number of chunks along the intermediate
dimension, resulting in sharding not only model parameters but also intermediate activations.
Tensor parallelism has relatively high communication volume and presents a synchronization point in forward pass,
making it costly to scale beyond 1 node. When tensors are sharded across multiple EC2 instances, the collective communication
at these synchronization points must occur through network interfaces like Elastic Fabric Adapter (EFA) instead of
the faster chip-to-chip NeuronLink connections.

A basic transformer MLP block contains a single matrix multiplication (matmul) called the up-projection, 
which increases the dimensionality from the hidden_size to the intermediate_size, and a single output matmul called the down projection, 
which reduces the dimensionality back to the hidden_size, with a non-linear activation function in-between. 
In order to avoid running collective operations (synchronization point) after each matrix multiply, we
defer collective to run after 2nd linear layer. To ensure correctness of non-linear activation
function computation (``f(x+y) != f(x) + f(y)`` for non-linear ``f`` like silu in SwiGLU), we split first linear layer
along columns (ColumnParallel) and second linear layer along rows (RowParallel), then run an AllReduce collective
operation at the end.

Modern transformer architectures use SwiGLU activation function, where the MLP block has 3 matrices, first
up and gate projection and later a down projection. We can view up and gate projection as the same
(referred to as first matrix multiply or first linear layer) in this context because they have the same
sharding approach. In this case up and gate projection is column parallel, while down projection is row parallel.

In attention, we similarly split Q, K and V projections in column parallel fashion and use row parallel for
final output (O) projection, then run AllReduce with input tensor size equal to
``batch_size x sequence_length x hidden_size x per_element_bytes`` bytes. Here,``per_element_bytes`` depends on the
numerical format of your tensors. When using BF16, for example, it would be ``2``. 
AllReduce input tensor size is the same for both MLP and attention blocks, resulting in two AllReduce operations
with with the same input size and output size as per AllReduce algorithm per transformer layer.

.. figure:: /images/sharding/tensor_parallel.png
   :alt: Image: sharding_tensor_parallel.png

   Image visualizing transformer layer like llama3 with SwiGLU activation layer in MLP.

How to Use Tensor Parallelism with NxD Inference
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Tensor parallelism can be enabled by setting the ``tp_degree`` parameter in NeuronConfig. See
:ref:`nxdi-feature-guide-tensor-parallelism` for more detail.

Code example, defining NeuronConfig:

 .. code-block:: python

    neuron_config = NeuronConfig(tp_degree=32)

See :ref:`tensor_parallelism_overview` for a detailed reference of the concepts underlying tensor parallelism.

Sequence Parallelism
---------------------

One drawback of tensor parallelism is that it replicates attention/MLP layer norm and dropout operations across all NeuronCores.
These operations are less compute intensive compared to linear layers, but still requires
significant memory. These computations are independent along the sequence dimension, allowing us to shard
across the sequence dimension. This requires AllGather in the transition from a sequence to a tensor parallel 
region and ReduceScatter in the transition from tensor to sequence parallel region during inference.
Sequence parallelism is especially useful for longer sequences and usually used in conjunction with tensor parallelism.


.. figure:: /images/sharding/sequence_parallel.png
   :alt: Image: sharding_sequence_parallel.png

   Image visualizing how sequence and tensor parallelism intertwine in transformer layer like Llama 3.

How to Use Sequence Parallelism with NxD Inference
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Sequence parallelism can be enabled by setting the ``sequence_parallel_enabled`` parameter in NeuronConfig. See 
:ref:`nxdi-feature-guide-sequence-parallelism` for more detail.

Code example, defining NeuronConfig:

.. code-block:: python

    neuron_config = NeuronConfig(sequence_parallel_enabled=True)

Flash Decoding
--------------

Flash decoding enables inference on long sequences by partitioning the KV cache. The technique is useful for 
long sequences and used in decoding phase. It is motivated by the fact that assuming KV caching, K and V memory
footprint scales with sequence length, while Q has sequence length equal to 1 during decoding.

Flash decoding shards K and V, and at the start uses AllGather to gather all Q heads in each
KV partition. Each KV partition computes partial softmax (also called log-sum-exp) which uses AllGather
to compute log-sum-exp scaling factor and correction denominator after “local” attention computation
(multiply Q and K, then apply the mask). Lastly, the algorithm performs ReduceScatter on attention results at the end.

How to Use Flash Decoding with NxD Inference
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Flash decoding can be enabled by setting the ``flash_decoding_enabled`` parameter in NeuronConfig.
The technique is only supported with GQA (grouped query attention).

Code example, defining NeuronConfig:


.. code-block:: python

    neuron_config = NeuronConfig(flash_decoding_enabled=True)


Data Parallelism
------------------

Data parallelism will replicate the model (same architecture, weights and hyperparameters) but will shard input data.
By distributing the data across NeuronCores or even different instances, data parallelism reduces
the total execution time of large batch size inputs using parallelization across sharded inputs instead of
sequential execution. Compared to batch parallel where KV cache is sharded, each data parallel replica has
its own individual cache for separate sequences.

Data parallelism works as standalone technique or can be used in conjunction with other model sharding techniques such as tensor parallelism. 
For example, Trn2 instances has 64 NeuronCores available when using default Logical NeuronCore configuration (LNC) set to 2, so you can use a
tensor parallel degree of 16 and a data parallel degree of 4, resulting in four copies of the model, each with disjunct data partitioning and
with each model sharded across 16 logical NeuronCores. Model replicas can run on the same instance or separate instances.
Data parallelism doesn't introduce any additional collective operations during inference.

.. figure:: /images/sharding/data_parallel.png
   :alt: Image: sharding_data_parallel.png

   Image visualizing how data parallelism shards inputs.

How to Use Data Parallelism with NxD Inference
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

See :ref:`/libraries/nxd-inference/tutorials/trn2-llama3.3-70b-dp-tutorial.ipynb` for detailed guidance on how to use vLLM to apply data parallelism along with tensor
parallelism to increase model inference throughput in NxDI. 

================================================
FILE: libraries/nxd-inference/developer_guides/accuracy-eval-with-datasets.rst
================================================
.. _accuracy-eval-with-datasets:

Accuracy Evaluation of Models on Neuron Using Open Source Datasets
====================================================================

This guide demonstrates how to evaluate accuracy of models on Trainium and Inferentia instances using open source datasets. 
This approach expands on the accuracy evaluation using logits and enables you to evaluate accuracy using open source datasets 
like MMLU and GSM8K for tasks such as instruction following and mathematical reasoning.

Under the hood, this accuracy suite uses vLLM server to serve the model
and can use benchmarking clients such as `lm-eval <https://github.com/EleutherAI/lm-evaluation-harness>`__ 
and `LongBench <https://github.com/THUDM/LongBench>`__ to evaluate on their supported datasets. 
In future we will add support for other benchmarking clients. 

The code used in this guide is located at https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/

For a tutorial that you can follow and run on a trainium or inferentia instance please look at :ref:`Evaluating Accuracy of Llama-3.1-70B on Neuron using open source datasets </libraries/nxd-inference/tutorials/trn1-llama3.1-70b-instruct-accuracy-eval-tutorial.ipynb>`.

Configuration Setup
-------------------

Creating the Configuration File
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Create a test_config.yaml file that defines your server settings and
accuracy test configurations:

.. code:: yaml

   server:
     name: "test-model-server"
     model_path: "/path/to/model"
     model_s3_path: "s3://bucket/path/to/model"
     max_seq_len: 2048
     context_encoding_len: 1024
     tp_degree: 2
     n_vllm_threads: 16
     server_port: 8000
     continuous_batch_size: 2

   test:
     accuracy:
       mmlu_test:
         client: "lm_eval"
         datasets: ["mmlu"]
         max_concurrent_requests: 1
         timeout: 3600
         client_params:
           limit: 100
       
       longbench_test:
         client: "longbench"
         datasets: ["qasper", "multifieldqa"]
         max_concurrent_requests: 1
         timeout: 7200
         client_params:
           max_length: 4096

Configuration Parameters
------------------------

Server Configuration
~~~~~~~~~~~~~~~~~~~~

========================= ================================
Parameter                 Description
========================= ================================
``name``                  Identifier for your model server
``model_path``            Local path to model files
``model_s3_path``         S3 location of model files
``max_seq_len``           Maximum sequence length
``context_encoding_len``  Length of context encoding
``tp_degree``             Tensor parallelism degree
``n_vllm_threads``        Number of vLLM threads
``server_port``           Server port number
``continuous_batch_size`` Size of continuous batches
========================= ================================

if ``model_s3_path`` is specified, the model will be downloaded into ``model_path``,
otherwise model should already exist in ``model_path``.

Accuracy Test Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~

+-----------------------------+---------------------------------------+
| Parameter                   | Description                           |
+=============================+=======================================+
| ``client``                  | Evaluation framework (e.g.,           |
|                             | “lm_eval”, “longbench”)               |
+-----------------------------+---------------------------------------+
| ``datasets``                | List of datasets for evaluation       |
|                             | from the supported set by the client  |
+-----------------------------+---------------------------------------+
| ``max_concurrent_requests`` | Maximum parallel requests             |
+-----------------------------+---------------------------------------+
| ``timeout``                 | Maximum execution time (seconds)      |
+-----------------------------+---------------------------------------+
| ``client_params``           | Client-specific parameters            |
+-----------------------------+---------------------------------------+

Running Evaluations
-------------------

Execute accuracy tests using the CLI command:

.. code:: bash

   python accuracy.py --config test_config.yaml


For more detailed information and advanced configurations, please refer
to: - `lm-eval
Documentation <https://github.com/EleutherAI/lm-evaluation-harness>`__ -
`LongBench Documentation <https://github.com/THUDM/LongBench>`__

These resources provide comprehensive guides on client-specific
parameters and advanced evaluation scenarios.


================================================
FILE: libraries/nxd-inference/developer_guides/custom-quantization.rst
================================================
.. _nxdi-custom-quantization:

Custom Quantization
===================

Overview
--------

This document gives an overview of customizable quantization feature in
the NxD Inference. Users can specify which modules should not be
converted during quantization, allowing custom quantized model
inference. Users can take an un-quantized model and apply selective
quantization to specific layers while keeping others in full precision.

The document also explains how to use external libraries like
`llmcompressor <https://github.com/vllm-project/llm-compressor>`__,
including quantization config setup and applying necessary patches. It
also covers running inference with quantized models and specifying
unconverted modules through either command-line arguments or
NeuronConfig kwargs.

Quantization
------------

Custom quantization allows users to have fine-grained control over which
layers of the model are quantized. This can be particularly useful for
maintaining model accuracy while still benefiting from the reduced
memory footprint of quantization. For more detailed information on
quantization techniques and implementation, please refer to the
`quantization feature
guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#quantization>`__.

Quantize Using NxD
~~~~~~~~~~~~~~~~~~

Quantization can significantly reduce the model size and inference time,
making it more suitable for deployment of large models that typically
cannot fit on a single instance. However, not all layers of the model
benefit equally from quantization.

- Some layers, especially those involved in critical computations like
  normalizations or certain types of activations, may see a significant
  drop in accuracy if quantized. Leaving these layers in full precision
  helps maintain the overall performance of the model.
- Quantization can also introduce small errors in each layer’s
  computation. When these errors accumulate through the network, they
  can lead to a noticeable degradation in performance. Keeping certain
  layers in full precision can mitigate this accumulation.

To leverage the customizable quantization feature in NxD, follow the
steps below. This process involves importing necessary libraries,
defining the model and output paths, specifying modules to not convert,
and utilizing a quantization function to create a quantized model.

::

   import torch
   from typing import Optional, List
   from transformers import AutoModelForCausalLM, AutoTokenizer
   from neuronx_distributed_inference.modules.checkpoint import prune_state_dict,save_state_dict_safetensors
   from neuronx_distributed.quantization.quantization_utils import quantize_pytorch_model_per_channel_symmetric, convert_qint8_to_int8_state_dict

   model_path = "/<model_path/llama-3.1-405b-instruct-4layers/" 
   output_path = "<save_quantized_checkpoints>"

   modules_to_not_convert = [
       "lm_head",
       "layers.0.self_attn",
       "layers.1.self_attn",
       "layers.2.self_attn",
       "layers.1.mlp"
   ]

   def quantize(model: torch.nn.Module, dtype=torch.qint8, modules_to_not_convert: Optional[List[str]] = None) -> torch.nn.Module:
       quant_model = quantize_pytorch_model_per_channel_symmetric(model,dtype=dtype, modules_to_not_convert=modules_to_not_convert)
       model_quant_sd = quant_model.state_dict()
       convert_qint8_to_int8_state_dict(model_quant_sd)
       quantized_state_dict = prune_state_dict(model_quant_sd)
       return quantized_state_dict
       
   model = AutoModelForCausalLM.from_pretrained(model_path)
   tokenizer = AutoTokenizer.from_pretrained(model_path)

   state_dict = quantize(model,torch.float8_e4m3fn,modules_to_not_convert)

   save_state_dict_safetensors(state_dict=state_dict,state_dict_dir=output_path)
   tokenizer.save_pretrained(output_path)

Quantize using external libraries
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In addition to the built-in quantization features of NxD, users can also
leverage external libraries for more flexible and advanced quantization
options. One such library is ``llmcompressor``, which offers a robust
set of tools for quantizing models. To use the ``llmcompressor`` library
for quantization, follow the steps below.

This process involves importing necessary libraries, specifying modules
to not convert, setting up a quantization recipe, and applying the
quantization to create a quantized model. llmcompressor gives us a range
from -/+448, so it is important to ensure the scale range is set from
-/+240 if you need to run inference on the quantized model later using
NxD Inference. Values outside the range of -/+240 on Neuron devices
result in NaNs.

The ``LLaMA`` model is an example where not all layers are quantized.

- By keeping the attention layers, first and last MLP layers, and the LM
  head in full precision, the model maintains high accuracy in tasks
  like language generation and comprehension.
- Quantizing the remaining layers (e.g., intermediate MLP layers)
  reduces the model size and inference time without significantly
  compromising performance.
- This strategy allows for a balanced trade-off between model efficiency
  and accuracy, making the model suitable for high performance
  deployment.

::

   import torch
   from llmcompressor.transformers import oneshot, SparseAutoModelForCausalLM
   from transformers import AutoTokenizer
   from compressed_tensors.quantization.utils.helpers import calculate_range
   from compressed_tensors.quantization.quant_args import QuantizationType
   import compressed_tensors.quantization.utils.helpers as helpers

   model_path = "/<model_path>/llama-3.1-405b-instruct-4layers/" 
   output_path = "<save_quantized_checkpoints>"

   modules_to_not_convert = ['lm_head',
       "model.layers.0.mlp.down_proj",
       "model.layers.0.mlp.gate_proj",
       "model.layers.0.mlp.up_proj",
       "model.layers.3.mlp.down_proj",
       "model.layers.3.mlp.gate_proj",
       "model.layers.3.mlp.up_proj",
       "model.layers.0.self_attn.k_proj",
       "model.layers.0.self_attn.o_proj",
       "model.layers.0.self_attn.q_proj",
       "model.layers.0.self_attn.v_proj",
       "model.layers.1.self_attn.k_proj",
       "model.layers.1.self_attn.o_proj",
       "model.layers.1.self_attn.q_proj",
       "model.layers.1.self_attn.v_proj",
       "model.layers.2.self_attn.k_proj",
       "model.layers.2.self_attn.o_proj",
       "model.layers.2.self_attn.q_proj",
       "model.layers.2.self_attn.v_proj",
       "model.layers.3.self_attn.k_proj",
       "model.layers.3.self_attn.o_proj",
       "model.layers.3.self_attn.q_proj",
       "model.layers.3.self_attn.v_proj"]

   recipe = f"""
   quant_stage:
       quant_modifiers:
           QuantizationModifier:
               ignore: {modules_to_not_convert}
               config_groups:
                   group_0:
                       weights:
                           num_bits: 8
                           type: float
                           strategy: channel
                           dynamic: false
                           symmetric: true
                       input_activations:
                           num_bits: 8
                           type: float
                           strategy: token
                           dynamic: true
                           symmetric: true
                       targets: ["Linear"]
   """

   model = SparseAutoModelForCausalLM.from_pretrained(
       model_path, torch_dtype="auto"
   )

   # Monkey patch to rescale weights from -/+448 to -/+240
   original_calculate_range = helpers.calculate_range
   def calculate_range(*args, **kwargs):
       q_min, q_max = original_calculate_range(*args, **kwargs)
       if args[0].type == QuantizationType.FLOAT and args[0].num_bits == 8:
           return torch.tensor(-240.0, device=args[1]), torch.tensor(240.0, device=args[1])
       return q_min, q_max

   # Patch it
   helpers.calculate_range = calculate_range
   oneshot(model=model, recipe=recipe)

   for name, module in model.named_modules():
       if hasattr(module, 'weight_scale'):
           module.weight_scale.data = module.weight_scale.data.to(torch.float32)

   tokenizer = AutoTokenizer.from_pretrained(model_path)

   model.save_pretrained(output_path)
   tokenizer.save_pretrained(output_path)

Quantization Commands
---------------------

To utilize the quantization commands in NxD Inference, users can follow
the instructions below. These commands cover the required flags to
enable running inference with quantized models.

First Quantize then Inference
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you have a model in full precision and need to quantize it on the CPU
first before using it for inference, you can set the following flags to
enable quantization during inference:

::

   inference_demo --model-type llama --task-type causal-lm run \
   --model-path /your_model_path/ \
   --compiled-model-path /save_to_path/ \
   --torch-dtype bfloat16 \
   --tp-degree 32 \
   --batch-size 1 \
   --max-context-length 1024 \
   --quantized \
   --quantization-dtype f8e4m3 \
   --quantization-type per_channel_symmetric \
   --quantized-checkpoints-path /save_to_path/ \
   --seq-len 2048 \
   --fused-qkv \
   --pad-token-id 2 \
   --on-device-sampling \
   --sequence-parallel-enabled \
   --attn-kernel-enabled \
   --prompt "I believe the meaning of life is" \
   --is-continuous-batching \
   --enable-fused-speculation \
   --enable-eagle-speculation \
   --speculation-length 4  \
   --draft-model-path /your_draft_model_path \
   --modules-to-not-convert-file /path/modules_to_not_convert.json

Inference Using Already quantized checkpoint
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To utilize the quantization commands in NxD, users can follow the
instructions below. These commands cover the required flags to enable
running inference with quantized models. The
``modules-to-not-convert-file`` allows you to specify the list of
modules to not quantize, useful for quantizing models that explicitly
require having some modules left in their original precision.

How to Use
~~~~~~~~~~

- Pass ``modules_to_not_convert`` using Inference Demo

::

   inference_demo --model-type llama --task-type causal-lm run \
       --model-path <path> \
       --compiled-model-path <path> \
       --torch-dtype bfloat16 \
       --tp-degree <value> \
       --batch-size <value> \
       --max-context-length <value> \
       --seq-len <value> \
       --on-device-sampling \
       --mlp-kernel-enabled \
       --quantized-mlp-kernel-enabled \
       --quantization-dtype <dtype> \
       --quantization-type <type> \
       --prompt "I believe the meaning of life is" \
       --modules-to-not-convert-file /<your_path>/modules_to_not_convert.json

- Pass ``modules_to_not_convert`` using NeuronConfig kwargs

::

   neuron_config = NeuronConfig(
       tp_degree=32,
       batch_size=2,
       max_context_length=32,
       seq_len=64,
       on_device_sampling_config=OnDeviceSamplingConfig(top_k=1),
       enable_bucketing=True,
       flash_decoding_enabled=False,
       modules_to_not_convert=["lm_head", "layers.0.self_attn", "layers.1.mlp", ...],
       draft_model_modules_to_not_convert=["lm_head", "layers.0.self_attn", "layers.1.mlp", ..., "fc"]
   )

..

   *Note: If you are creating different NeuronConfig for draft and
   target models, you only need to pass the modules_to_not_convert list
   for both.*

JSON File Structure
~~~~~~~~~~~~~~~~~~~

The JSON structure is a crucial component for specifying which modules
should not be converted during the quantization if you are using
inference demo. This section provides detailed examples of how to format
the JSON file. The JSON structure depends on whether fused speculation
is used.

1. Basic Structure

For simple cases:

::

   {
       "modules_to_not_convert": [
               "lm_head",
               "layers.0.self_attn",
               "layers.1.self_attn",
               "layers.2.self_attn",
               "layers.3.self_attn",
               "layers.0.mlp",
               "layers.3.mlp"
       ]}

OR
^^

::

   {
       "model": {
           "modules_to_not_convert": [
               "lm_head",
               "layers.0.self_attn",
               "layers.1.self_attn",
               "layers.2.self_attn",
               "layers.3.self_attn",
               "layers.0.mlp",
               "layers.3.mlp"
           ]
       }}

1. With Fused Speculation

::

   {
       "model": {
           "modules_to_not_convert": [
               "lm_head",
               "layers.0.self_attn",
               "layers.1.self_attn",
               "layers.2.self_attn",
               "layers.3.self_attn",
               "layers.0.mlp",
               "layers.3.mlp"
           ]
       },
       "draft_model": {
           "modules_to_not_convert": [
               "lm_head",
               "layers.0.self_attn",
               "layers.0.mlp",
               "fc"
           ]
       }}

Important Notes
~~~~~~~~~~~~~~~

- Make sure to assign partial names in modules to avoid conversion, as
  shown in the examples above. This is necessary due to different naming
  schemes between the model layers being read from the source and the
  model we create for inference. The above examples include the partial
  parts of the names which are common between the two naming schemes.

  - For example: Original model names are like
    ``model.layers.0.self_attn.q_proj``, whereas the names we give are
    like ``layers.0.self_attn.qkv_proj.q_proj``

- Quantization with Fused Speculation

  - We currently do not quantize the draft model, Include these in the
    ``draft_model.modules_to_not_convert`` section of your JSON file

Backward Incompatible Changes:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Now running the quantization workflow will need the
  ``modules-to-not-convert-file`` flag while running with
  ``inference demo`` because we no longer hard-code the layers to
  incorporate quantized layers.


================================================
FILE: libraries/nxd-inference/developer_guides/disaggregated-inference.rst
================================================
.. _nxdi-disaggregated-inference:

==============================
Disaggregated Inference [BETA]
==============================


Overview
--------

Disaggregated Inference (DI), also known as disaggregated serving, disaggregated prefill, P/D disaggregation,
is an LLM serving architecture that separates the prefill and decode phases of inference onto different hardware resources. 
To achieve this, the prefill worker needs to transfer the computed KV cache to the decode worker to resume decoding.
Separating the compute intensive prefill phase from the memory bandwidth intensive 
decode phase can improve the LLM serving experience by

1. Removing prefill interruptions to decode from continuous batching to reduce inter token latency (ITL). These gains can be used to achieve higher throughput by running with a higher decode batch size while staying under Service Level Objectives (SLO).

2. Adapt to changing traffic patterns while still remaining under application SLOs.

3. Enable independent scaling of resources and parallelism strategies for prefill (compute bound) and decode (memory bound).


.. note::

    Automatic Prefix Caching is not supported with DI.


High-Level Flow on Neuron
-------------------------

Disaggregated Inference is mainly implemented through Neuron's vLLM fork 
https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.25 
and the Neuron Runtime.

There are three main components to a DI workflow.

1. The router. Its job is to orchestrate requests to servers inside the prefill and decode clusters.

2. The prefill cluster. This represents all of the prefill servers ready to run a DI workload.

3. The decode cluster. This represents all of the decode servers ready to run a DI workload.

Below is an example lifespan of a single request through the DI flow.

.. image:: /libraries/nxd-inference/developer_guides/images/di_high_level_architecture.png
    :alt: High Level Disaggregated Inference Architecture

1. A request is sent to the router (1), a component responsible for orchestrating (2) the requests to both
the prefill and decode servers. It receives responses from the prefill and decode servers and 
streams the results back to the user. 

2. The prefill server receives the request from the router (3a) and starts prefilling. After the prefill completes (4),
it updates the status of the request for the decode server by sending information through another ZMQ server.
Then, it listens for a "pull request" from the decode server to initiate the KV cache transfer.
We use Neuron runtime APIs to transfer the KV cache through EFA from Neuron device to Neuron device.
This is a zero copy transfer, meaning that we do not copy the KV cache from a Neuron device to CPU to transfer, 
but rather directly transfer KV cache from Neuron device to Neuron device.
The transfer is also asynchronous. This means that the prefill server can immediately start 
prefilling the next request while the KV cache of the previous request is being transferred. This 
ensures that TTFT is not impacted for other requests while the KV cache for older request is being transferred to decode.

3. The decode server also receives a request from the router at the same time as the prefill server (3b).
It waits until it receives a signal that its corresponding prefill is done from the prefill server by listening
on the ZMQ server. Then, if there is a free spot in the decode batch, the scheduler will schedule the request and send
a "pull request" to the prefill server. This initiates the asynchronous KV cache transfer (red arrow) 
through EFA by calling the Neuron Runtime API. The receive also needs to be asynchronous to ensure
smooth ITL. While the receive is happening other decode requests will still run. As soon as the receive is
finished the scheduler will add the request to the next decode batch (5).


Prefill Decode Interference When Colocating Prefill and Decode
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In traditional continuous batching, prefill requests are prioritized over decode requests. Prefills
are run as batch size 1 because they are compute intensive whereas decodes can be run at a higher 
batch size because it is constrained on memory bandwidth not compute. To ensure the highest
throughput, continuous batching schedulers prioritize new prefills if the decode batch is not at max capacity.
As soon as a decode request finishes, another prefill is scheduled to fill the finished request's place. 
However, all other ongoing decodes pause while the new prefill is running because that prefill uses
the same compute resources. This effect is known as prefill stall or prefill contention.

Disaggregated Inference avoids prefill stall because the decode workflow is never interrupted by a prefill as
it receives KV caches asynchronously while decoding. The overall ITL on DI is affected
by the transfer time of the KV cache but this does not scale with batch size. For example, in a continuous
batching workload of batch size 8 each request will on average be interrupted 7 times whereas in DI each 
request is only affected by a single transfer since it happens asynchronously.

Another advantage of DI is its ability to adapt to traffic patterns while maintaining a consistent
ITL. For example, if prefill requests double in length the application can double the amount of available prefill
servers in the prefill cluster to match the new traffic pattern. Continuous batching workloads would suffer because
longer prefill requests increase tail ITL whereas DI workloads continue to deliver a low variance and a predictable customer experience.

Additionally, DI also allows users to tailor their parallelism strategies differently for prefill and decode. 
For example, a model with 32 attention heads may prefer to run two decode servers Data Parallel=2 (DP) 
each with Tensor Parallel=32 (TP) in order to reduce KV replication instead of TP=64. Such replication will get worse if using Group Query Attention (GQA).

DI does not necessarily improve throughput directly but it can help depending on the workload. Continuous
batching is a technique optimized for throughput at the cost of ITL. An application may have an SLO to ensure 
that ITL is under a certain threshold. Because increasing the batch size
increases the amount of prefill stall, and therefore increases ITL, many applications run on smaller than ideal batch sizes 
when using continuous batching. DI can allow an application to run at a higher batch size while still keeping ITL
under the application defined SLO.


Trade-Offs
^^^^^^^^^^^^

Because DI runs prefill and decode separately, each part of the inference process needs to operate at an
equal level of efficiency to maximize throughput and hardware resources. For example, if you can process 4 prefill
requests per second and two decode requests per second the application will be stuck processing
two requests per second. It is also important to note that the prefill and decode efficiency can vary based on
the prompt length and the number of tokens for a response respectively. Continuous batching and chunked prefill
do not have this issue as these techniques run prefill and decode on the same hardware.

One technique to remediate this is to run with a dynamic amount of prefill and decode servers. We call this
dynamic xPyD. In the above example, we could run with 1 prefill and 2 decode servers so that our prefill and 
decode efficiency will be balanced.


Proxy Server Architecture
----------------------------

The proxy server routes messages between clients and workers in our disaggregated inference system. 
It uses the Quart framework, Python's asyncio libraries, and etcd to manage this communication.

Main Components
^^^^^^^^^^^^^^^^^

* **Framework**: Quart (for handling web requests)
* **Task Management**: Python asyncio
* **Request Forwarding**: Uses etcd to detect new prefill and decode workers (xPyD only)

How Requests Flow
^^^^^^^^^^^^^^^^^

When a client sends a request, the proxy server starts two tasks at the same time:

.. code:: python

    prefill_task = asyncio.create_task(anext(prefill_response))
    decode_task = asyncio.create_task(anext(decode_response))

    await prefill_task
    async for chunk in handle_prefill_response(prefill_response,
                                             streaming, endpoint,
                                             uid, request_time):
        yield chunk

    await decode_task
    async for chunk in handle_decode_response(decode_response,
                                            streaming, endpoint, uid,
                                            request_time):
        yield chunk

If running in static 1P1D mode, the workers are pre-chosen. If running in dynamic 
xPyD mode, the workers are chosen by round-robin and discovered through etcd.

This approach offers two benefits:

1. Faster responses because network delays don't stack up
2. The decode server can get ready while prefill is working

How Tokens Work
^^^^^^^^^^^^^^^^^

The proxy server handles tokens in specific ways to ensure accurate responses:

**Prefill Settings**

* Sets ``max_tokens=1`` for prefill requests
* Returns the first output token

**Decode settings**

* Runs as normal except it skips the first token from decode

Output Types
^^^^^^^^^^^^^^^

The system can work in two ways decided by the client if streaming is enabled:

1. **Streaming Mode**
   
   * Sends tokens to the client one at a time
   * Uses both prefill and decode servers
   * Shows results as they're created

2. **Batch Mode (stream=false)**
   
   * Sends all tokens at once when finished

Response Handling
^^^^^^^^^^^^^^^^^^

The proxy server:

* Combines responses from both servers
* Keeps tokens in the right order
* Makes sure outputs match what clients expect from a regular system

Dynamic xPyD (Multiple Prefill, Multiple Decode)
--------------------------------------------------

Dynamic xPyD lets you use multiple prefill and decode workers and dynamically add new workers to the cluster.

.. note::
   The system can't yet remove or handle unresponsive nodes automatically.


Worker Discovery and Connection Manager (neuron_connector.py)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The system keeps track of workers using an etcd server. Here's how it works:

.. code:: python

    class NeuronConnector:
        def _keep_alive_ectd(self):
            # Add worker to etcd
            etcd_client.put(
                f"/workers/{self.role}/{self.local_ip}/{self.api_server_port}",
                json.dumps({"connections": []}),
                lease
            )

This manager:

* Signs up workers with etcd 
* Keeps a list of active connections
* Creates new buffers when needed (dynamic xPyD)
* Or statically creates buffers (static 1P1D)

Signal Plane (ZMQ Communication)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Router** (Prefill): Works with many decode connections
* **Dealer** (Decode): Connects to prefill
* **Message Types**:
  
  * Welcome message when connecting
  * Setting up key-value maps
  * Managing transfers

Buffer Connection Management Details
------------------------------------

Buffer connection management is a critical component of the DI system that controls how servers communicate.
The system supports two modes of operation: static 1P1D and dynamic xPyD.
The connection management is done by ``neuron_connector.py`` and the actual buffer class is in ``neuron_buffer.py``.

We use two types of buffers:

* ``SendBuffer``: For prefill workers
* ``RecvBuffer``: For decode workers

Static 1P1D Mode
-----------------

In static mode, the system creates a single buffer for each worker during initialization:

.. code-block:: python

    def initialize_buffer(self):
        if self.config.is_kv_producer:
            self.static_buffer = SendBuffer(
                self.kv_caches,
                self.zmq_context,
                self.neuron_recv_ip,
                self.config.kv_ip,
                self.config.kv_port
            )

This approach means:

* All connection components are predefined
* Communication paths are fixed
* Buffers have predetermined communication partners

Dynamic xPyD Mode
------------------

In dynamic mode, the system creates buffers on demand. Both SendBuffers and RecvBuffers can be created dynamically:

.. code-block:: python

    def maybe_setup_buffer(self, remote_ip, remote_port):
        if self.static_buffer:
            return self.static_buffer

        key = "" if self.config.is_kv_producer else (remote_ip, remote_port)
        
        if key in self.connection_dict:
            return self.connection_dict[key]

Key differences in dynamic mode:

1. One to many relationship between SendBuffers and RecvBuffers
2. Workers register themselves in etcd for service discovery
3. New connection determined by proxy server, the info is encoded in the request_id
4. Workers check their connection dictionary for existing buffers encoded in the request_id
5. If no buffer exists, they create a new one using the proxy server's information
6. The new buffer establishes ZMQ communication with its partner

This dynamic approach allows the system to:

* Add new connections as needed
* Scale with changing workloads
* Maintain efficient communication paths
* Adapt to cluster changes


Transfer Engine and Communication
---------------------------------

Below is an image showing the KV cache transfer process on neuron:

.. image:: /libraries/nxd-inference/developer_guides/images/di_transfer_architecture.png
    :alt: High Level Transfer Architecture

Transfer Engine (neuron_transfer_engine.py)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The transfer engine moves KV cache efficiently between workers:

.. code:: python

    class NeuronTransferEngine:
        def transfer_neuron_tensors(self, tensors, offsets, lengths, peer_devices, ...):
            self.engine.queue_transfer_with_token(
                tensors, offsets, lengths, peer_devices, self.local_devices,
                self.comm_ids, completion_count, completion_token, use_queue,
                completion_time_out)

The engine:

* Sets up KV communication between devices
* Calls Neuron Runtime APIs to move KV caches
* Tracks when transfers finish

Zero-Copy Transfer System
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Send Handler (Prefill Side)**

* Runs in its own thread
* Listens for requests from decode servers
* Handles three types of requests:
  
  * Handshakes to confirm connection establishment
  * Setting up KV cache maps
  * Decode server requests for KV cache transfer (lookup_all)

Here's how it works:

.. code:: python

    def send_handler(self):
        while True:
            identity, request = self.router.recv_json()
        
            if request["type"] == "handshake":
                self.router.send_json(identity, {
                    "status": "ok",
                    "timestamp": time.time()
                })
                continue
        
            if request["type"] == "kv_map_init":
                # Set up transfer details
                continue
                
            if request["type"] == "lookup_all":
                self._process_lookup_all(identity, request)
                continue

**Receive Handler (Decode Side)**

* Keeps a list of waiting transfers
* For each task:
  
  * Sends request to prefill server
  * Waits for answer
  * If successful:
    
    * Saves the output token from signal plane
    * Starts moving the KV cache through Transfer Engine
  
  * If it fails:
    
    * Tries again later

Starting Transfers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

On the Prefill Side:

.. code:: python

    # ensure that the request is finished prefill
    if request_id not in self.lookup_dict:
        self.router.send_json(identity, {"success": False})
        return

    # After getting decode server request and prefill is finished
    kv_caches, offsets, lengths, peer_devices = \
        self.generate_transfer_sequences(entry, remote_id=identity_str)

    # Start transfer
    self.get_transfer_engine(remote_id=identity_str).transfer_neuron_tensors(
        kv_caches, offsets, lengths, peer_devices,
        completion_token=entry.completion_token)

On the Decode Side:

.. code:: python

    # receive prefill worker's output token
    entry.output_token = torch.tensor(
        response["output_token"]).unsqueeze(0)

    kv_caches, offsets, lengths, peer_devices = \
         self.generate_transfer_sequences(entry)

    # do not wait for request completion for recv buffer
    self.get_transfer_engine().transfer_neuron_tensors(
        kv_caches, offsets,lengths, peer_devices,
        completion_token=entry.completion_token)

The ``completion_token`` provides the status of the transfer.

.. note::

    These are separate threads from the main inference process and do not block ongoing inference.


Request Scheduling Rules
------------------------

Here are new scheduling rules for Disaggregated Inference:

Prefill Worker Rules
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* Requests can be: Waiting, Transferring, or Running
* Only one request can run at a time
* Total of transferring + running must not exceed batch size
* Can start new requests when:
  
  * Nothing is running
  * Number of transfers is less than batch size

Decode Worker Rules
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* Uses same request states as prefill
* Running + transferring must not exceed batch size
* Running must not exceed batch size
* Must finish key-value cache transfer before running
* Can start new transfers when there's space

Scheduler Jobs
^^^^^^^^^^^^^^^

* Adds transfer requests to a list
* Checks status without blocking
* Uses status to make decisions
* Doesn't handle transfers directly

These rules help:

* Keep key-value caches safe
* Use resources well
* Process batches efficiently
* Keep scheduling separate from transfers

Example Usage
-------------

Refer to the `offline inference DI example <https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.26/examples/offline_inference/neuron_di.py>`_
for a quick example to get started.

Refer to the :ref:`Disaggregated Inference Tutorial<nxdi-disaggregated-inference-tutorial>` for a detailed usage guide.

================================================
FILE: libraries/nxd-inference/developer_guides/feature-guide.rst
================================================
.. _nxdi-feature-guide:

NxD Inference Features Configuration Guide
==========================================

NxD Inference (``neuronx-distributed-inference``) is
an open-source PyTorch-based inference library that simplifies deep learning
model deployment on AWS Inferentia and Trainium instances. Neuronx Distributed
Inference includes a model hub and modules that users can reference to
implement their own models on Neuron.


.. contents:: Table of contents
   :local:
   :depth: 2

Checkpoint compatibility with HuggingFace Transformers
------------------------------------------------------

Models included in the NxD Inference model hub are checkpoint-compatible with
HuggingFace Transformers. Supporting other checkpoint formats in NxD Inference is possible through converting the
obtained checkpoint to the standard HuggingFace Transformers checkpoint format.

.. _nxdi-checkpoint-support:

Checkpoint support
------------------

NxD Inference supports older PyTorch binary checkpoints
and newer `safetensors <https://github.com/huggingface/safetensors>`__
checkpoints. For improved load speed and reduced host memory
consumption, we recommend to always use safetensors by default. Both
regular and sharded variants of checkpoints are supported.

NxD Inference supports weights stored in the model path in the following
formats:

=========== ======= ============================
Format      Sharded File name
=========== ======= ============================
Safetensors No      model.safetensors
Safetensors Yes     model.safetensors.index.json
Pickle      No      pytorch_model.bin
Pickle      Yes     pytorch_model.bin.index.json
=========== ======= ============================

If your weights are in another format, you must convert them to one of
these formats before you can compile and load the model to Neuron. See
the following references for more information about these formats:

- Safetensors:

  - https://github.com/huggingface/safetensors
  - https://huggingface.co/docs/safetensors/en/convert-weights

- Pickle:

  - https://docs.python.org/3/library/pickle.html

Compiling models
----------------
To run a model on Neuron with NxD Inference, you compile Python code into
a NEFF file (Neuron Executable File Format), which you can load to Neuron
devices using the Neuron Runtime.

When you call ``compile()``, NxD Inference does the following:

1. Trace the Python code to produce an HLO file.
2. Use the Neuron Compiler to compile the HLO file into a NEFF.

During the trace process, the model code is traced based on a given sample
tensor for each input. As a result, model code should avoid dynamic logic
that depends on the input values in a tensor, because NxD Inference compiles
only the code path that is traced for the sample input tensor.

::

    # Configure, initialize, and compile a model.
    model = NeuronLlamaForCausalLM(model_path, config)
    model.compile(compiled_model_path)


.. _nxdi-neuron-persistent-cache:

Neuron Persistent Cache
------------------------

The Neuron Persistent Cache is enabled by default for NxD Inference library.
Model artifacts which have been compiled once will be cached and reused on
successive runs when possible. Model artifacts will only be reused when
compiling with the same compiler version (neuronx-cc), model configurations,
and compiler flags. Neuron Persistent Cache also includes other features, such as using an S3 bucket as
the cache backend. For more detailed information, see the
:ref:`Persistent cache documentation <neuron-caching>`


Serialization support
---------------------

When you compile a model with NxD Inference, the library
serializes the model to a given folder. After you have a serialized
model, you can load it directly to a Neuron device without needing to
compile again.

The compile function does not serialize sharded weights by default, and you can
enable this functionality with the ``save_sharded_checkpoint`` flag in
NeuronConfig. For more information on weights sharding, see :ref:`nxdi-weights-sharding-guide`.

Logical NeuronCore Configuration (LNC) support
----------------------------------------------
On Trn2 instances, Neuron supports Logical NeuronCore (LNC) configuration,
which combines multiple physical NeuronCores into a single logical
NeuronCore. On Trn2 instances, the Neuron SDK is optimized for LNC=2, which means
each NeuronCore visible to the Neuron SDK is two physical NeuronCores.

NxD Inference automatically chooses the correct LNC configuration
based on the target platform. To override the default LNC configuration,
you can set the ``NEURON_LOGICAL_NC_CONFIG`` environment variable, or set the
``logical_nc_config`` flag in NeuronConfig.

::

   neuron_config = NeuronConfig(logical_nc_config=2)

For more information about logical NeuronCore support, see
:ref:`logical-neuroncore-config`.

.. _nxdi-feature-guide-tensor-parallelism:

Tensor-parallelism support
--------------------------

For transformer decoders used in large language models,
tensor-parallelism is necessary as it provides a way to shard the
models' large weight matrices onto multiple NeuronCores, and having
NeuronCores working on the same matrix multiply operation
collaboratively. neuronx-distributed-inference's tensor-parallelism
support makes heavy use of collective operations such as all-reduce,
which is supported natively by the Neuron runtime.

There are some principles for setting tensor-parallelism degree (number
of NeuronCores participating in sharded matrix multiply operations) for
Neuron-optimized transformer decoder models.

1. The number of attention heads needs to be divisible by the
   tensor-parallelism degree.
2. The total data size of model weights and key-value caches needs to be
   smaller than the tensor-parallelism degree multiplied by the amount
   of memory per Neuron core.

   1. On Trn2, each Neuron core has 24GB of memory (with LNC2).
   2. On Inf2/Trn1, each Neuron core has 16GB of memory.

3. The Neuron runtime supports the following tensor-parallelism degrees:

   1. Trn2: 1, 2, 4, 8, 16, 32, and 64 (with LNC2)
   2. Inf2: 1, 2, 4, 8, and 24
   3. Trn1: 1, 2, 8, 16, and 32

Examples
~~~~~~~~

1. ``meta-llama/Meta-Llama-3.1-8B`` has 32 attention heads, and when
   running at batch size 1 and bfloat16 precision, the model requires
   about 16GB memory. Therefore, a ``trn1.2xlarge`` with 32GB device
   memory is sufficient.
2. ``meta-llama/Meta-Llama-3.1-70B`` has 64 attention heads, and when
   running at batch size 1 and bfloat16 precision, the model requires
   about 148GB memory. Therefore, it can run on 16 NeuronCores on one
   ``trn1.32xlarge`` using 256GB device memory.

.. _nxdi-feature-guide-sequence-parallelism:

Sequence Parallelism
--------------------
Sequence parallelism splits tensors across the sequence dimension to
improve performance. You can enable sequence parallelism by setting
``sequence_parallel_enabled=True`` in NeuronConfig.

::

   neuron_config = NeuronConfig(sequence_parallel_enabled=True)

Compile-time Configurations
---------------------------

NxD Inference models support a variety of compile-time
configurations you can use to tune model performance. For more
information, see the :ref:`nxd-inference-api-guide`.

Hugging Face generate() API support
-----------------------------------

NxD Inference models support the HuggingFace `generate()
API <https://huggingface.co/docs/transformers/main/en/main_classes/text_generation>`__
via the ``HuggingFaceGenerationAdapter`` class. This adapter wraps a
Neuron model to provide the HuggingFace generation interface.

NxD Inference's supports the following HuggingFace
generation modes:

- Greedy decoding — ``num_beams=1`` and ``do_sample=False``.
- Multinomial sampling — ``num_beams=1`` and ``do_sample=True``.
- Assisted (speculative) decoding — ``assistant_model`` or
  ``prompt_lookup_num_tokens`` are specified.

NxD Inference doesn't currently support other
HuggingFace generation modes such beam-search sampling.

Note: When you call ``generate``, the number of prompts must match the
``batch_size`` for the model, which is an attribute of NeuronConfig.

::

   neuron_config = NeuronConfig(batch_size=2)

Example
~~~~~~~

The following example demonstrates how to wrap a model with
HuggingFaceGenerationAdapter to call ``generate()``.

::

   from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter

   # Init Neuron model, HuggingFace tokenizer, HuggingFace and generation config.


   # Run generation with HuggingFaceGenerationAdapter.
   generation_model = HuggingFaceGenerationAdapter(model)
   inputs = tokenizer(prompts, padding=True, return_tensors="pt")
   outputs = generation_model.generate(
       inputs.input_ids,
       generation_config=generation_config,
       attention_mask=inputs.attention_mask,
       max_length=model.neuron_config.max_length,
       **kwargs,
   )

   output_tokens = tokenizer.batch_decode(
       outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
   )

   print("Generated outputs:")
   for i, output_token in enumerate(output_tokens):
       print(f"Output {i}: {output_token}")

On-device Sampling Support
--------------------------

On-device sampling performs sampling logic on the Neuron device (rather
than on the CPU) to achieve better performance. To enable on device
sampling, provide an OnDeviceSamplingConfig for the
``on_device_sampling_config`` attribute in NeuronConfig.

::

   on_device_sampling_config = OnDeviceSamplingConfig(global_topk=256)
   neuron_config = NeuronConfig(on_device_sampling_config=on_device_sampling_config)

Dynamic Sampling
~~~~~~~~~~~~~~~~

With dynamic sampling, you can pass different ``top_k``, ``top_p``, and
``temperature`` values to the ``forward`` call to configure sampling for
each input in a batch. To enable dynamic sampling, provide an
OnDeviceSamplingConfig with ``dynamic=True``.

::

   on_device_sampling_config = OnDeviceSamplingConfig(dynamic=True)
   neuron_config = NeuronConfig(on_device_sampling_config=on_device_sampling_config)

To use dynamic sampling, pass a ``sampling_params`` tensor to the
forward function of the model. The ``sampling_params`` tensor has shape
``[batch_size, 3]``, where the three values per batch are ``top_k``,
``top_p``, and ``temperature``.

The following example demonstrates how to create ``sampling_params`` for
a batch with two inputs. In the first input, ``top_k=50``,
``top_p=0.5``, and ``temperature=0.75``. In the second input,
``top_k=5``, ``top_p=1.0``, and ``temperature=1.0``.

::

   sampling_params = torch.tensor([[50, 0.5, 0.75], [5, 1.0, 1.0]])

Greedy Sampling
~~~~~~~~~~~~~~~

By default, on-device sampling uses greedy sampling, where the model
picks the highest scoring token.

Multinomial (Top-K) Sampling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With multinomial (top-k) sampling, the model picks one of the top
*k*-highest scoring tokens. To use on-device multinomial sampling, you
must enable dynamic sampling. You can configure the default ``top_k``
attribute in the OnDeviceSamplingConfig, or you can specify the
``top_k`` value in each call to the model's ``forward`` function.

::

   on_device_sampling_config = OnDeviceSamplingConfig(top_k=5)

Top-P Support in On-Device Sampling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To use top-p in on-device sampling, enable dynamic sampling, and specify
``top_p`` values in the ``sampling_params``.

Temperature Support in On-Device Sampling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To adjust temperature in on-device sampling, enable dynamic sampling,
and specify ``temperature`` values in the ``sampling_params``.

.. _qkv-weight-fusion:

QKV Weight Fusion
-----------------

QKV weight fusion concatenates a model's query, key and value weight
matrices to achieve better performance, because larger matrices allow
for more efficient data movement and compute. You can enable QKV weight
fusion by setting ``fused_qkv=True`` in the NeuronConfig.

::

   neuron_config = NeuronConfig(fused_qkv=True)

.. _nxdi-bucketing:

Bucketing
---------

LLM inference is a generation process that can produce variable length
sequences. This poses a problem since the Neuron compiler produces
executables which expect statically shaped inputs and outputs. To make
LLMs work with different shapes, NxD Inference supports
buckets and applies padding wherever it is required. When you run
inference, NxD Inference automatically chooses the
smallest bucket that fits the input for optimal performance. For more
information about bucketing, see :ref:`torch-neuronx-autobucketing-devguide`.

Automatic Bucketing
~~~~~~~~~~~~~~~~~~~

When automatic bucketing is enabled, NxD Inference
automatically chooses buckets for each model according to the following
logic:

- Context encoding: Powers of two between 128 and the max context
  length.

  - Note: Max context length is equivalent to sequence length by
    default.

- Token generation: Powers of two between 128 and the maximum sequence
  length.

To enable automatic bucketing, set ``enable_bucketing=True`` in
NeuronConfig.

::

   neuron_config = NeuronConfig(enable_bucketing=True)

Configuring Specific Buckets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can configure specific buckets to further optimize inference based
on the input and output length distribution that you expect to process
with your model. In NeuronConfig, set ``enable_bucketing=True``, and
provide a list of bucket sizes in ``context_encoding_buckets`` and/or
``token_generation_buckets``.

::

   neuron_config = NeuronConfig(
       enable_bucketing=True,
       context_encoding_buckets=[1024, 2048, 4096],
       token_generation_buckets=[8192]
   )

.. _nxdi-quantization:

Quantization
------------

NxD Inference supports quantization, where model weights
and data are converted to a smaller data type to reduce memory bandwidth
usage, which improves model performance.

Note: Quantization slightly reduces accuracy due to using data types
with lower precision and/or lower range.

.. _nxdi-weight-quantization:

Model Weight Quantization
~~~~~~~~~~~~~~~~~~~~~~~~~

NxD Inference supports quantizing model weights to the
following data types:

- INT8 (``int8``) - 8 bit int.
- FP8 - 8 bit float.

  - ``f8e4m3`` - 8-bit float with greater precision and less range.

    - Important: To use ``f8e4m3`` for quantization, you must set the
      ``XLA_HANDLE_SPECIAL_SCALAR`` environment variable to ``1``.

  - ``f8e5m2`` - 8-bit float with greater range and less precision.

NxD Inference supports the following quantization methods, which you specify with `quantization_type` in NeuronConfig:

- `per_tensor_symmetric`
- `per_channel_symmetric`

.. _example-1:

Example
^^^^^^^

The following example demonstrates how to quantize a model to INT8. To quantize
a model to a different data type, change the ``quantization_dtype`` config
attribute in ``NeuronConfig``.

::

   from neuronx_distributed_inference.models.config import NeuronConfig
   from neuronx_distributed_inference.models.llama.modeling_llama import (
       LlamaInferenceConfig,
       NeuronLlamaForCausalLM
   )
   from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config

   model_path = "/home/ubuntu/models/Llama-3.1-8B"
   quantized_model_path = "/home/ubuntu/models/Llama-3.1-8B-quantized"

   neuron_config = NeuronConfig(
       quantized=True,
       quantized_checkpoints_path=quantized_model_path,
       quantization_dtype="int8",
       quantization_type="per_tensor_symmetric"
   )

   config = LlamaInferenceConfig(
       neuron_config,
       load_config=load_pretrained_config(model_path)
   )

   # Quantize the model and save it to `quantized_checkpoints_path`.
   NeuronLlamaForCausalLM.save_quantized_state_dict(model_path, config)

   # Compile, load, and use the model.
   model = NeuronLlamaForCausalLM(model_path, config)

.. _nxdi-kv-cache-quantization:

KV Cache Quantization
~~~~~~~~~~~~~~~~~~~~~

NxD Inference supports KV cache quantization, where the
model's KV cache is quantized to a smaller data type. When enabled, the
model quantizes the KV cache to the ``torch.float8_e4m3fn`` data type.
Before using the KV cache, the model dequantizes the KV cache to the data
type specified by ``torch_dtype`` in NeuronConfig.

To enable KV cache quantization, set ``kv_cache_quant=True`` in
NeuronConfig.

::

   neuron_config = NeuronConfig(kv_cache_quant=True)

- Important: To use KV cache quantization, you must set the
  ``XLA_HANDLE_SPECIAL_SCALAR`` environment variable to ``1``.

.. _nxd-speculative-decoding:

Speculative Decoding
--------------------

Speculative decoding is a performance optimization technique where a
smaller *draft* LLM model predicts the next tokens, and the larger *target*
LLM model verifies those predictions. NxD Inference supports
the following speculative decoding implementations:

1. :ref:`Speculative decoding with a draft model <nxd-vanilla-speculative-decoding>`,
   where a separate draft model predicts the next *n* tokens for the target
   model. Each model is compiled independently.
2. :ref:`Medusa speculative decoding<nxd-medusa-speculative-decoding>`,
   where several small model heads predict next tokens, and the target
   model verifies all predictions at the same time.
3. :ref:`EAGLE speculative decoding<nxd-eagle-speculative-decoding>`,
   where the draft model uses additional context from the target model
   to improve generation efficiency. NxD Inference supports EAGLE v1 with
   a flat draft structure.

.. _nxd-vanilla-speculative-decoding:

Speculative Decoding with a Draft model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To use speculative decoding with a draft model, you configure, compile, and load a
draft model in addition to the main target model. To enable 
speculative decoding with a draft model, set ``speculation_length`` and
``trace_tokengen_model=False`` in the target model's NeuronConfig. The
draft model's NeuronConfig should use the same configuration but with
these additional attributes reset to their defaults.

 Speculative decoding with a draft model currently supports only batch sizes of 1.

.. _example-2:

Example
^^^^^^^

The following example demonstrates using Llama-3.2 3B as a draft model
for Llama-3.1 70B. The speculation length is set to 5 tokens.

::

   import copy

   from transformers import AutoTokenizer, GenerationConfig

   from neuronx_distributed_inference.models.config import NeuronConfig
   from neuronx_distributed_inference.models.llama.modeling_llama import (
       LlamaInferenceConfig,
       NeuronLlamaForCausalLM
   )
   from neuronx_distributed_inference.utils.accuracy import get_generate_outputs
   from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config

   prompts = ["I believe the meaning of life is"]

   model_path = "/home/ubuntu/models/Llama-3.1-70B"
   draft_model_path = "/home/ubuntu/models/Llama-3.2-3B"
   compiled_model_path = "/home/ubuntu/neuron_models/Llama-3.1-70B"
   compiled_draft_model_path = "/home/ubuntu/neuron_models/Llama-3.2-3B"

   # Initialize target model.
   neuron_config = NeuronConfig(
       speculation_length=5,
       trace_tokengen_model=False
   )
   config = LlamaInferenceConfig(
       neuron_config,
       load_config=load_pretrained_config(model_path)
   )
   model = NeuronLlamaForCausalLM(model_path, config)

   # Initialize draft model.
   draft_neuron_config = copy.deepcopy(neuron_config)
   draft_neuron_config.speculation_length **=** 0
   draft_neuron_config.trace_tokengen_model **=** True
   draft_config = LlamaInferenceConfig(
       draft_neuron_config,
       load_config=load_pretrained_config(draft_model_path)
   )
   draft_model = NeuronLlamaForCausalLM(draft_model_path, draft_config)

   # Compile and save models.
   model.compile(compiled_model_path)
   draft_model.compile(compiled_draft_model_path)

   # Load models to the Neuron device.
   model.load(compiled_model_path)
   draft_model.load(compiled_draft_model_path)

   # Load tokenizer and generation config.
   tokenizer **=** AutoTokenizer.from_pretrained(model_path, padding_side**=**neuron_config.padding_side)
   generation_config = GenerationConfig.from_pretrained(model_path)

   # Run generation.
   _, output_tokens = get_generate_outputs(
       model,
       prompts,
       tokenizer,
       is_hf=False,
       draft_model=draft_model,
       generation_config=generation_config
   )

   print("Generated outputs:")
   for i, output_token in enumerate(output_tokens):
       print(f"Output {i}: {output_token}")

.. _nxd-medusa-speculative-decoding:

Medusa Speculative Decoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~

To use Medusa speculative decoding, you must use a model that is
specifically fine-tuned for Medusa speculation, such as
`text-generation-inference/Mistral-7B-Instruct-v0.2-medusa <https://huggingface.co/text-generation-inference/Mistral-7B-Instruct-v0.2-medusa>`__.
You must also provide a Medusa tree. For an example Medusa tree, see
``medusa_mc_sim_7b_63.json`` in the ``examples`` folder in NeuronX
Distributed Inference.

To enable Medusa, set ``is_medusa=True``, set the
``medusa_speculation_length``, set the ``num_medusa_heads``, and specify
the ``medusa_tree``.

::

   def load_json_file(json_path):
       with open(json_path, "r") as f:
           return json.load(f)

   medusa_tree = load_json_file("medusa_mc_sim_7b_63.json")

   neuron_config = NeuronConfig(
       is_medusa=True,
       medusa_speculation_length=64,
       num_medusa_heads=4,
       medusa_tree=medusa_tree
   )

To run generation with a Medusa model and the HuggingFace ``generate()``
API, set the ``assistant_model`` to the target model.

For more information about Medusa speculative decoding, see the official
implementation on GitHub: https://github.com/FasterDecoding/Medusa.

Medusa speculative decoding currently supports only batch sizes of 1.

.. _nxd-eagle-speculative-decoding:

EAGLE Speculative Decoding
~~~~~~~~~~~~~~~~~~~~~~~~~~

NxD Inference supports EAGLE v1 speculative decoding with a flat draft structure.

EAGLE Checkpoint Compatibility
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To use EAGLE speculative decoding, you must use a draft
model that is specifically fine-tuned for EAGLE speculation. Additionally, to use EAGLE with
NxD Inference, the draft model must include the LM head weights from the target model.
These weights are shared between the draft and target model.

Because NxD Inference uses a flat draft structure, it predicts only one token per draft iteration.
Although NxD Inference doesn't support EAGLE with a tree structure, you can train
an EAGLE checkpoint in the same way. Note that depending on your use case and dataset, you
might see lower acceptance rate with the flat draft structure compared with using a tree structure.

NxD Inference supports EAGLE models with or without input normalization. By default,
NxD Inference expects that the EAGLE model doesn't use input normalization. To use
an EAGLE model with input normalization, set ``enable_eagle_draft_input_norm`` to ``True``
in NeuronConfig.

You can find links to pretrained EAGLE draft model checkpoints for various
popular models in the official EAGLE repository on GitHub: https://github.com/SafeAILab/EAGLE.
However, these pretrained EAGLE model checkpoints don't include the LM head
weights from the target model. To use these pretrained checkpoints with NxD Inference,
you must first copy the LM head weights from the target to the draft model.

The following code demonstrates how to perform this operation for a `Llama-3.1-70B-Instruct <https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct>`__
target model and the corresponding `EAGLE draft <https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-70B>`__:

::

    import json
    import os

    import torch
    from safetensors import safe_open
    from safetensors.torch import save_file

    target_model_path = "Meta-Llama-3.1-70B-Instruct"
    draft_model_path = "Llama-3.1-70B-Instruct-EAGLE-Draft"

    DRAFT_MODEL_SAFETENSORS_NAME = "model.safetensors"
    LM_HEAD_WEIGHT_TENSOR_NAME = "lm_head.weight"
    TARGET_MODEL_SAFETENSORS_INDEX_NAME = "model.safetensors.index.json"

    def find_lm_head_safetensors_location(model_dir):
        model_index_location_path = os.path.join(model_dir, TARGET_MODEL_SAFETENSORS_INDEX_NAME)

        with open(model_index_location_path, 'r') as f:
            model_index_locations = json.load(f)

        lm_head_safetensors_name = model_index_locations["weight_map"][LM_HEAD_WEIGHT_TENSOR_NAME]

        return lm_head_safetensors_name

    # Find the target model `lm_head.weight` location in safetensors
    target_lm_head_safetensors_name = find_lm_head_safetensors_location(target_model_path)
    target_lm_head_safetensors_path = os.path.join(target_model_path, target_lm_head_safetensors_name)

    # Open the target model.safetensor containing `lm_head.weight`
    with safe_open(target_lm_head_safetensors_path, framework="pt") as f:
        target_lm_head = f.get_tensor(LM_HEAD_WEIGHT_TENSOR_NAME)

    # Collect all tensors in the draft model
    draft_model_safetensors_path = os.path.join(draft_model_path, DRAFT_MODEL_SAFETENSORS_NAME)
    tensors = {}
    with safe_open(draft_model_safetensors_path, framework="pt") as f:
        for key in f.keys():
            tensors[key] = f.get_tensor(key)

    # Add the LM head weights and save out the new draft model.safetensors file
    tensors[LM_HEAD_WEIGHT_TENSOR_NAME] = target_lm_head.type(torch.float16)
    save_file(tensors, draft_model_safetensors_path)

.. _nxd-fused-speculative-decoding:

Fused Speculation
^^^^^^^^^^^^^^^^^

EAGLE speculation uses a feature called *fused speculation*, where the
draft model and target model are fused into a single compiled model to
improve performance. Fused speculation uses a different config called
FusedSpecNeuronConfig, which specifies the model class. draft config,
and draft model path to fuse with the target model.

.. _example-3:

Example
^^^^^^^

::

    import copy

    from neuronx_distributed_inference.models.config import (
        FusedSpecNeuronConfig,
        NeuronConfig,
        OnDeviceSamplingConfig
    )
    from neuronx_distributed_inference.models.llama.modeling_llama import (
        NeuronLlamaForCausalLM,
        NeuronLlamaModel
    )
    from neuronx_distributed_inference.utils.accuracy import get_generate_outputs
    from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config
    from transformers import AutoTokenizer, GenerationConfig

    prompt = "The future of AI is"

    model_path = "/home/ubuntu/models/Llama-3.1-70B-Instruct"
    draft_model_path = "/home/ubuntu/models/Llama-3.1-70B-Instruct-EAGLE-Draft"
    compiled_model_path = "/home/ubuntu/neuron_models/Llama-3.1-70B-Instruct-EAGLE"
    max_sequence_length = 1024

    # Initialize on-device sampling configuration.
    on_device_sampling_config = OnDeviceSamplingConfig(
        temperature=0.7,
        top_k=50,
        top_p=1.0,
    )

    # Initialize model configuration.
    neuron_config = NeuronConfig(
        # Neuron supports EAGLE batch sizes greater than 1.
        # We set batch size to 1 in this tutorial due to a
        # limitation in the transformers library for
        # generation with speculative decoding.
        # For more information, see: https://github.com/huggingface/transformers/issues/32165
        batch_size = 1,
        enable_eagle_speculation=True,
        enable_fused_speculation=True,
        max_context_length=max_sequence_length,
        max_length=max_sequence_length,
        on_device_sampling_config=on_device_sampling_config,
        seq_len=max_sequence_length,
        speculation_length=5,
        # For best performance, set to the maximum tensor
        # parallelism of your Neuron instance type.
        tp_degree=32,
        trace_tokengen_model=False
    )

    config = NeuronLlamaForCausalLM.get_config_cls()(
        neuron_config, load_config=load_pretrained_config(model_path)
    )

    # Initialize draft model configuration and set EAGLE-specific values.
    draft_neuron_config = copy.deepcopy(neuron_config)
    draft_neuron_config.trace_tokengen_model = True
    draft_neuron_config.enable_fused_speculation = False
    draft_neuron_config.is_eagle_draft = True
    draft_neuron_config.sequence_parallel_enabled = False

    draft_config = NeuronLlamaForCausalLM.get_config_cls()(
        draft_neuron_config, load_config=load_pretrained_config(draft_model_path))

    # Initialize fused speculation configuration.
    fused_spec_config = FusedSpecNeuronConfig(
        NeuronLlamaForCausalLM._model_cls,
        draft_config=draft_config,
        draft_model_path=draft_model_path,
    )
    config.fused_spec_config = fused_spec_config

    # Initialize model from configuration.
    model = NeuronLlamaForCausalLM(model_path, config)

    # Compile and save model.
    model.compile(compiled_model_path)

    # Load model to the Neuron device.
    model.load(compiled_model_path)

    # Load tokenizer and generation config.
    tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side=neuron_config.padding_side)
    generation_config = GenerationConfig.from_pretrained(model_path)
    generation_config.max_length = 1024
    # pad_token_id is required for Hugging Face assisted sampling.
    generation_config.pad_token_id = tokenizer.eos_token_id

    # Run generation and print outputs.
    _, output_tokens = get_generate_outputs(
        model,
        [prompt],
        tokenizer,
        is_hf=False,
        # draft_model is not set here due to fused speculation.
        draft_model=None,
        generation_config=generation_config
    )

    print("Generated output:")
    for _, output in enumerate(output_tokens):
        print(output)

MoE model architecture support
------------------------------

NxD Inference supports mixture-of-experts (MoE) models.
The library includes ready-to-use modeling code for Mixtral and DBRX.
These models are built using reusable MoE modules from NeuronX
Distributed Core: ``RouterTopK``, ``ExpertMLPs``, and ``MoE``. You can
use these modules to onboard additional MoE models.

NxD Inference also provides a helper function,
``initialize_moe_module``, which you can use to initialize an MoE
model's MLP module from these MoE modules. For examples of how to use
this helper function, see the decoder layer module implementation in the
`Mixtral <https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/models/mixtral/modeling_mixtral.py>`__
and `DBRX <https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/models/dbrx/modeling_dbrx.py>`__
modeling code.

Grouped-query attention (GQA) support
-------------------------------------

NxD Inference provides a reusable attention module,
NeuronAttentionBase, which you can use when onboarding models. This
module is also used in NxD Inference modeling code like Llama and
Mixtral.

NxD Inference supports the following sharding strategies
for the KV cache used in the attention module:

- ``CONVERT_TO_MHA`` — Transforms a GQA attention mechanism into a
  traditional MHA mechanism by replicating the K/V heads to evenly match
  the corresponding Q heads. This consumes more memory than would
  otherwise be used with other sharding mechanisms but works in all
  cases.
- ``REPLICATE_TO_TP_DEGREE`` — Transforms a GQA attention mechanism such
  that there is exactlyone K/V head per tp_degree through replication
  e.g. 8 K/V heads with tp_degree=32 results in 32 K/V heads. This is
  more memory efficient but does not work for all configurations. Q
  heads are padded interleaved to retain correct alignment between Q and
  K/V heads.

The NeuronAttentionBase module uses ``REPLICATE_TO_TP_DEGREE`` by
default. If the TP degree isn't divisible by the number of KV heads,
NeuronAttentionBase uses ``CONVERT_TO_MHA``.

.. _nxdi_async_mode_feature_guide:

Asyncronous Runtime Support
---------------------------

NxD Inference offers certain model configurations to be run with Asyncronous Runtime Mode (Async mode).
Async mode allows NxD Inference to parallelize CPU logic with Neuron (NEFF) logic. As a result, any CPU overheads
within NxDI that exist between sequential model executions (ex. autoregressive loop in LLMs) are almost fully
eliminated. This reduces latency anywhere from 5% to 20% based on the model configuration.

This feature can be enabled with by setting ``async_mode`` to ``True`` in ``NeuronConfig``.

To use Async mode, a model configuration must meet the following prerequisites:
- On-device sampling is enabled.
- If speculation is enabled, fused speculation must also be enabled.

It is highly recommended to set ``async_mode`` to ``True`` for every other case, since it offers a latency reduction.
Furthermore, this feature is a purely runtime feature, so if you have a previously compiled model, and its configuration
doesn't fall under the unsupported case, ``async_mode`` will likely be able to improve performance.

.. note::
    If you are using vLLM, this feature works independently of vLLM's Async Engine. As a result, ``async_mode`` can be enabled
    whether vLLM is used or not.

.. _nxdi_prefix_caching:

Prefix Caching Support
----------------------

Prefix caching is a performance optimization technique where prompts in multiple requests sharing the same prefix can reuse the
previously computed KV cache. When context encoding a prompt that starts with a previously computed prefix, the encoding of the
prefix tokens will be skipped and the corresponding KV Cache will be fetched and used for encoding the rest of the tokens (suffix).
The performance benefit comes from the time saved by re-using the KV Cache instead of re-encoding the prefix tokens. NxD Inference
supports prefix caching during context encoding. To store KV cache and match to prefix efficiently, NxD Inference uses block KV Cache
layout for prefix caching. NxD Inference does not implement its own cache eviction, memory management, or prefix hashing for matches.
Instead, it requires external management of the block KV cache and expects active block tables and slot mappings to be provided with
each generation request. This feature integrates with vLLM by enabling automatic prefix caching, which manages the block tables,
handles automatic prefix matching across prompts, and performs cache evictions. More on automatic prefix caching support on vLLM
can be found `here <https://docs.vllm.ai/en/latest/design/v1/prefix_caching.html>`__.

To enable prefix caching with NxD Inference, set ``is_prefix_caching=True`` in NeuronConfig along with configurations for
block KV cache layout.

::

    neuron_config = NeuronConfig(
        is_prefix_caching=True,
        is_block_kv_layout=True,
        pa_num_blocks=1024,
        pa_block_size=32,
    )

``is_block_kv_layout=True`` and ``is_prefix_caching=True`` are set in NeuronConfig to enable block KV Cache layout and enable
prefix caching. The first two dimensions of the KV cache are set to the number of blocks and block size, respectively. These
configurations are specified using ``pa_num_blocks`` and ``pa_block_size`` in NeuronConfig. For optimal performance with Neuron,
it's recommended to set ``pa_block_size=32``. The minimum required ``pa_num_blocks`` to be initialized is
``(batch_size * max_model_len) / pa_block_size`` However, it is recommended to initialize more blocks than the required minimum
to accommodate caching of common prefixes. The higher the number of blocks, the greater the likelihood of cache hits, as fewer
cache evictions will occur. NxD Inference does not currently provide an automated solution to determine the maximum number of
KV Cache blocks that can be initialized in HBM without exceeding available memory space. Customers are advised to experiment with
increasing the number of blocks that balances the cache hit rate and memory taken. Any memory taken by increasing the cache will
impact the batch sizes and sequence lengths that can be supported, so customers are sugggested to pick the correct number of blocks
considering these trade offs and the specific inference workload they plan to run in production.

NxD Inference does not use paged attention for prefix caching. Instead, it follows a different process:
first gathering the block KV cache using the block table, then converting it to a flat KV cache layout, computing attention, and 
finally scattering the computed cache back to the block KV cache layout. This approach introduces overhead during
token generation requests due to layout conversions, which can negatively impact performance as the ``max_model_len`` increases.

.. _bucketing-with-prefix-caching:

Bucketing with Prefix Caching
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Prefix caching handles both the prefix (cache hit) and suffix (no cache hit) portions of input prompts during context encoding.
A two-dimensional bucketing system has been introduced to support context encoding when prefix caching is enabled. This system
uses separate dimensions corresponding to the prefix and suffix (non cache-hit portion) of the input prompts. In contrast,
token generation still uses one-dimensional bucketing based on the maximum sequence length.

When bucketing is enabled, NxD Inference creates prefill (suffix) buckets (covering suffix portion) starting with powers of 2,
ranging from 512 up to the maximum context length. The prefix buckets mirror the prefill buckets, with one key difference: a special
prefix bucket of size 0 is added to handle requests with no cache hits. NxD Inference then creates a two-dimensional grid of all prefill
and prefix bucket combinations, which represents the effective set of buckets during context encoding. During request processing,
NxD Inference first identifies the smallest prefill bucket that can accommodate the largest suffix portion of the input prompts.
If prefill padding is needed, NxD Inference prioritizes moving tokens from the prefix's end to the prefill bucket before adding padding.
It then determines the smallest prefix bucket that can fit the largest prefix across prompts. These two dimensions together determine
the final (prefill, prefix) bucket combination used to serve the context encoding request.

You can configure specific buckets to optimize inference based on the expected distribution of prefix lengths, input lengths, and
output lengths for your model. In NeuronConfig, set ``enable_bucketing=True``, and provide a list of bucket sizes in
``context_encoding_buckets``, ``prefix_buckets`` and/or ``token_generation_buckets``. ``context_encoding_buckets`` corresponds to prefill
buckets when prefix caching is enabled.

::

    neuron_config = NeuronConfig(
        enable_bucketing=True,
        context_encoding_buckets=[512, 1024, 2048],
        prefix_buckets=[512, 1024]
        token_generation_buckets=[2048]
    )

Examples
^^^^^^^^

For ``context_encoding_buckets=[512, 1024, 2048]`` and ``prefix_buckets=[512, 1024]``

For requests with:

- Input prompt of size 1000 with no prefix, NxDI uses prefill bucket as 1024 and prefix bucket as 0.
- Input prompt of size 800 with 128 as the prefix size, and remaining 672 as the suffix size, NxDI first selects 1024
  as the prefill bucket. Remaining 352 prefill slots are filled up by moving entire prefix to the suffix part.
  So prefill bucket of 1024 and prefix bucket as 0 is used here.
- Input prompt of size 900 with 640 as the prefix size, and remaining 260 as the suffix size, NxDI first selects 512
  as the prefill bucket. Remaining 252 prefill slots are filled up by moving 252 tokens from the end of prefix to the suffix part.
  Effective prefix length now becomes 388, so prefill bucket of 512 and prefix bucket as 512 is used.
- Input prompt of size 1600 with 1280 as the prefix size and remaining 320 as the suffix size, NxDI selects 512 as the
  prefill bucket. Remaining 192 prefill slots are filled up by moving 192 tokens from the end of prefix to the suffix part.
  Effective prefix length now becomes 1088 which is larger than the largest prefix bucket of 1024. This leads to exception
  during the request processing.

The two-dimensional bucketing system exponentially increases the number of context encoding buckets. Therefore, users should exercise caution
when using auto-bucketing with large context lengths. It is recommended to limit the granularity of prefix buckets based on your
specific workload requirements.

For detailed examples of prefix caching with NxD Inference and vLLM, see :ref:`/libraries/nxd-inference/tutorials/trn2-llama3.3-70b-apc-tutorial.ipynb`.

Multi-LoRA Serving
------------------

NxD Inference supports serving with multiple LoRA adapters and users can specify different LoRA adapters for their requests at runtime. 
It also supports multi-LoRA serving with vLLM as the frontend.
NxD Inference currently supports loading of LoRA adapters for dense model families, including Llama-2, Llama-3.1, Llama-3.2, Llama-3.3, TinyLlama, OpenLLaMA, Qwen2, and Qwen3.
A current prerequisite is that the LoRA adapter checkpoints must be stored locally before the server is initialized and started.

Enable multi-LoRA serving
~~~~~~~~~~~~~~~~~~~~~~~~~

To enable multi-LoRA serving, provide a LoraServingConfig for ``lora_config`` attribute in NeuronConfig.

::

    lora_config = LoraServingConfig(
        max_loras=max_loras,
        max_cpu_loras=max_cpu_loras,
        batch_size=batch_size,
        dynamic_multi_lora=dynamic_multi_lora,
        base_model_quantized=quantized,
        lora_ckpt_json=lora_ckpt_json,
    )
    neuron_config = NeuronConfig(lora_config=lora_config)

Refer to :ref:`nxd-inference-api-guide` for more details of ``LoraServingConfig``.

NxD Inference primarily supports the format of LoRA adapters from `Huggingface PEFT <https://github.com/huggingface/peft>`__.
Each checkpoint path is a folder that contains a checkpoint file (.safetensors, .bin, or .pt) and a configuration json file (.json).
In addition, NxD inference also supports LoRA adapters trained from `NxD LoRA finetuning <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/lora_finetune_developer_guide.html>`__.
Each checkpoint path is a checkpoint file (.pt) that includes both LoRA adapter weights and the configuration. 

NxD Inference assumes all the LoRA adapters for multi-LoRA serving are available locally during compilation and their weights are loaded on neuron devices during serving.
When uploading a LoRA adapter checkpoint to NxDI for multi-LoRA serving, the user is required to name the adapter with a unique adapter ID, such as ``adapter_id_1``, which will be used by users to specify the LoRA adapter for serving at runtime and by NxDI for model compilation.

The maximum number of concurrent LoRA adapters in device memory and host memory for serving are specified by ``max_loras`` and ``max_cpu_loras``, respectively.
When ``dynamic_multi_lora=False``, all the LoRA adapters must be fully pre-loaded into device memory before the serving process begins.
Dynamic multi-LoRA serving is enabled by ``dynamic_multi_lora=True``, which loads more LoRA adapters to host memory and dynamically swaps them from CPU to HBM at runtime according to user requests.
NxD Inference can quantize the base model for multi-LoRA serving with ``base_model_quantized=True``. 
Refer to :ref:`nxd-inference-api-guide-neuron-config` for setting the quantization configurations.
The set of LoRA adapters are specified by ``lora_ckpt_json``, which is a JSON file describing the mapping between the adapters IDs and their local paths of the LoRA adapter checkpoint.
Refer to :ref:`nxd-inference-api-guide-neuron-config` for the JSON format.
For detailed examples of multi-LoRA serving in NxDI, see :ref:`/libraries/nxd-inference/tutorials/trn2-llama3.1-8b-multi-lora-tutorial.ipynb`.


Maximum number of LoRA adapters supported in device memory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The LoRA adapter size is much smaller than the base model, but its weights still consumes non-negligible on-device memory. 
The maximum number of LoRA adapters that can be concurrently supported in the device memory depends on the base model, the LoRA rank, the reserved HBM size for LoRA adapters, and how the LoRA adapters are sharded across TP groups.

Suppose a Trainium instance is used for multi-LoRA serving and the reserved HBM size on each neuron core for LoRA adapters is 2GB.
Each LoRA adapter has two parts, LoRA A and LoRA B, and only one of them can be partitioned with tensor parallelism and the other is just Linear layer.
We analyze the maximum number of LoRA adapters supported in the device memory under two cases: 1/ the linear layer is duplicated, and 2/ the linear layer is sharded.
These two cases can be specified by ``lora_shard_linear_layer`` in ``LoraServingConfig``.

When the linear layer is duplicated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The weight size of a LoRA adapter on each device is around half of the total LoRA adapter size in this case.
When the base model is Llama3.1 8B, the LoRA adapter checkpoint size with LoRA rank 16 in BF16 is around 170MB. 
Because ``2GB / (170MB / 2) = 23``, the maximum number of concurrent LoRA adapters is 23.
When the base model is Llama3.3 70B, the LoRA adapter checkpoint size with LoRA rank 16 in BF16 is around 830MB and we can set ``max_loras=4``.
We analyze the maximum number of LoRA adapters supported in NxD inference under two cases: the linear layer is duplicated and the linear layer is sharded.
These two cases can be specified by ``lora_shard_linear_layer`` in ``LoraServingConfig``.

.. list-table::
    :widths: auto
    :header-rows: 1 
    :stub-columns: 1    
    :align: left
      
    *   - Model
        - Reserved Memory size
        - LoRA rank
        - Maximum LoRAs
    
    *   - Llama3.1 8B
        - 2GB
        - 16
        - 23
    *   - Llama3.1 8B
        - 2GB
        - 32
        - 12
    *   - Llama3.3 70B
        - 2GB
        - 16
        - 4
    *   - Llama3.3 70B
        - 2GB
        - 32 
        - 2

When the linear layer is sharded
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The linear layer in a LoRA adapter is sharded across neuron cores in a TP group at the cost of Allgather communication overehead in this case.
The weight size of a LoRA adapter on each device is ``1/TP_DEGREE`` of the total LoRA adapter size.

.. list-table::
    :widths: auto
    :header-rows: 1 
    :stub-columns: 1    
    :align: left
      
    *   - Model
        - Reserved Memory size
        - LoRA rank
        - TP degree
        - Maximum LoRAs
    
    *   - Llama3.1 8B
        - 2GB
        - 16
        - 32
        - 376
    *   - Llama3.1 8B
        - 2GB
        - 32
        - 32
        - 188
    *   - Llama3.3 70B
        - 2GB
        - 16
        - 32
        - 77
    *   - Llama3.3 70B
        - 2GB
        - 32 
        - 32
        - 38

.. _nxdi_di_feature_guide:

Disaggregated Inference [BETA]
------------------------------

Disaggregated Inference is an LLM serving architecture separates the prefill and decode phases of inference onto different hardware resources.
Separating the compute intensive prefill phase from the memory bandwidth intensive decode phase can improve the LLM serving experience by

1. Removing prefill interruptions to decode from continuous batching to reduce inter token latency (ITL). These gains can be used to
achieve higher throughput by running with a higher decode batch size while staying under Service Level Objectives (SLO).

2. Adapt to changing traffic patterns while still remaining under application SLOs.

3. Enable independent scaling of resources and parallelism strategies for prefill (compute bound) and decode (memory bound).

See the :ref:`Disaggregated Inference Developer Guide<nxdi-disaggregated-inference>` and the :ref:`Disaggregated Inference Tutorial<nxdi-disaggregated-inference-tutorial>`


================================================
FILE: libraries/nxd-inference/developer_guides/how-to-use-fpem.rst
================================================
.. meta::
   :description: Learn how to use Pipeline Execution Mode to optimize performance for large models with multiple submodels using NxD Inference
   :date_updated: 2025-09-19

.. _how-to-use-fpem:

=======================================================================
How to Use On-device Forward Pipeline Execution Mode for Optimization
=======================================================================

Task Overview
-------------

This topic shows you how to use Pipeline Execution Mode to optimize performance for large models with multiple submodels using the NxD Inference. This technique keeps intermediate tensors from sub models on the device to reduce data transfer overhead and minimize model latency.

In this guide, you'll learn to:

* Configure pipeline execution flags for optimal performance
* Set up multi-stage model wrappers that communicate efficiently
* Manage intermediate tensor placement between pipeline stages
* Implement a simple vision-text pipeline as a practical example

Sample Architecture
-------------------

This guide uses a vision-text multimodal model to demonstrate pipeline execution. The architecture consists of:

**Vision Model**: Processes image inputs through convolutional layers and outputs vision embeddings

**Text Model**: Takes vision embeddings and text inputs, then produces final classification results

This two-stage pipeline shows how intermediate vision embeddings can remain on the device, avoiding costly CPU transfers between model stages. The same principles apply to other multi-stage architectures like transformer decoder chains, diffusion model denoisers, or encoder-decoder pairs.

Prerequisites
-------------

- **NeuronX Distributed Inference (NxDI)**: You must have NxDI installed and configured. See NxD Inference Setup Guide.
- **Multi-stage model**: Your model should have intermediate tensors in a pipeline structure, such as Llama4-style models, Pixtral, or diffusion-based models.

The following diagram shows how intermediate tensors flow through a multi-stage pipeline::

    Input Data
        |
        v
    ┌─────────────┐
    │   Stage 1   │  <- Vision Model (Conv2D + Pooling)
    │ (SubModel)  │
    └─────────────┘
        |
        v
    Intermediate    <- Kept on device with pipeline_execution=True
    Tensors            and return_ranked_to_cpu=False
        |
        v
    ┌─────────────┐
    │   Stage 2   │  <- Text Model (Embedding + Fusion)
    │ (SubModel)  │
    └─────────────┘
        |
        v
    Final Output    <- Returned to CPU with return_ranked_to_cpu=True

Without pipeline execution, intermediate tensors transfer between CPU and device at each stage, creating overhead. With pipeline execution enabled, intermediate tensors remain on the device, reducing latency.

.. note::

   **Padding Requirements**: When passing outputs between ModelWrapper instances, you must manually pad the list of lists to ensure consistent input dimensions. Padding is crucial to maintain tensor compatibility across pipeline stages.

Instructions
------------

**1:** Import required modules and define model classes

Start by importing the necessary modules and defining your model architectures:

.. code-block:: python

    import torch
    from torch import nn
    from neuronx_distributed_inference.models.encoder_base import NeuronEncoderBase
    from neuronx_distributed_inference.models.model_wrapper import ModelWrapper
    from neuronx_distributed_inference.models.application_base import NeuronApplicationBase
    from neuronx_distributed_inference.models.config import InferenceConfig, NeuronConfig

    # Vision Model Definition
    class VisionModel(NeuronEncoderBase):
        def __init__(self, config: InferenceConfig):
            super().__init__(config)
            self.conv = nn.Conv2d(3, 64, kernel_size=3, padding=1)
            self.pool = nn.AdaptiveAvgPool2d((1, 1))
            self.fc = nn.Linear(64, config.vision_embedding_size)

        def forward(self, x):
            x = self.conv(x)
            x = self.pool(x)
            x = torch.flatten(x, 1)
            return self.fc(x)

    # Text Model Definition
    class TextModel(NeuronEncoderBase):
        def __init__(self, config: InferenceConfig):
            super().__init__(config)
            self.embedding = nn.Linear(config.text_input_size, config.text_embedding_size)
            self.fusion = nn.Linear(
                config.vision_embedding_size + config.text_embedding_size,
                config.output_size
            )

        def forward(self, vision_features, text_input):
            text_features = self.embedding(text_input)
            combined = torch.cat([vision_features, text_features], dim=1)
            return self.fusion(combined)

**2:** Configure ModelWrappers with pipeline execution flags

Set up your ModelWrapper classes with appropriate pipeline execution parameters:

.. code-block:: python

    # Vision Model Wrapper - keeps output on device
    class VisionModelWrapper(ModelWrapper):
        def __init__(self, config: InferenceConfig):
            super().__init__(
                config=config,
                model_cls=VisionModel,
                pipeline_execution=True,
                return_ranked_to_cpu=False,  # Keep output ranked for efficient pipeline
                tag="vision_model"
            )

        def input_generator(self):
            # Generate sample input for compilation
            x = torch.randn(
                self.neuron_config.batch_size,
                3,
                224,
                224
            )
            return [(x,)]

    # Text Model Wrapper - returns final output to CPU
    class TextModelWrapper(ModelWrapper):
        def __init__(self, config: InferenceConfig):
            super().__init__(
                config=config,
                model_cls=TextModel,
                pipeline_execution=True,
                return_ranked_to_cpu=True,  # Return final output to CPU
                tag="text_model"
            )

        def input_generator(self):
            # Generate sample inputs for compilation
            vision_features = torch.randn(
                self.neuron_config.batch_size,
                self.config.vision_embedding_size
            )
            text_input = torch.randn(
                self.neuron_config.batch_size,
                self.config.text_input_size
            )
            return [(vision_features, text_input)]

**3:** Create application classes

Build application classes that use your configured ModelWrappers:

.. code-block:: python

    # Application Classes
    class VisionModelApp(NeuronApplicationBase):
        def __init__(self, model_path: str, config: InferenceConfig):
            super().__init__(model_path=model_path, config=config)
            self.model = VisionModelWrapper(config)
            self.models.append(self.model)

        def forward(self, x):
            return self.models[0].forward(x)

    class TextModelApp(NeuronApplicationBase):
        def __init__(self, model_path: str, config: InferenceConfig):
            super().__init__(model_path=model_path, config=config)
            self.model = TextModelWrapper(config)
            self.models.append(self.model)

        def forward(self, vision_features, text_input):
            return self.models[0].forward(vision_features, text_input)

**4:** Run the complete pipeline example

Execute your pipeline with the configured models:

.. code-block:: python

    def main():
        # Configure models
        config = InferenceConfig(
            NeuronConfig(batch_size=32, torch_dtype=torch.float32, tp_degree=2),
            vision_embedding_size=512,
            text_input_size=256,
            text_embedding_size=512,
            output_size=1024
        )

        # Create applications
        vision_app = VisionModelApp("path/to/vision/model", config)
        text_app = TextModelApp("path/to/text/model", config)

        # Compile models
        vision_app.compile("path/to/compiled/vision")
        text_app.compile("path/to/compiled/text")

        # Load models
        vision_app.load("path/to/compiled/vision")
        text_app.load("path/to/compiled/text")

        # Example inference
        image_input = torch.randn(32, 3, 224, 224)
        text_input = torch.randn(32, 256)

        # Forward pass through vision model
        # Returns ranked output (list of lists) since return_ranked_to_cpu=False
        vision_features = vision_app.forward(image_input)

        # Forward pass through text model
        # Returns CPU tensor since return_ranked_to_cpu=True
        final_output = text_app.forward(vision_features, text_input)

        print(f"Final output shape: {final_output.shape}")  # [32, 1024]

Confirm your work
-----------------

To confirm you have successfully configured pipeline execution mode, check that your model outputs have the expected tensor placement:

.. code-block:: python

    # Check intermediate output placement
    print(f"Vision features type: {type(vision_features)}")  # Should be list of lists
    print(f"Final output shape: {final_output.shape}")       # Should be [32, 1024]
    print(f"Final output device: {final_output.device}")     # Should be CPU

Common issues
-------------

.. rubric:: Tensor dimension mismatch between pipeline stages

- **Possible solution**: Ensure you manually pad the list of lists when passing outputs between ModelWrapper instances to maintain consistent input dimensions.

.. rubric:: Performance not improving with pipeline execution

- **Possible solution**: Verify that your model has intermediate tensors in a pipeline structure. Pipeline execution works best with models like Llama4-style, Pixtral, or diffusion-based models.

.. rubric:: Memory issues with large models

- **Possible solution**: Adjust your batch size and tensor parallelism degree (tp_degree) in the NeuronConfig to better fit your available memory.

================================================
FILE: libraries/nxd-inference/developer_guides/index.rst
================================================
.. meta::
   :description: Developer guides for NxD Inference (neuronx-distributed-inference) on AWS Inferentia and AWS Trainium, covering model deployment, optimization, quantization, and integration with vLLM.
   :keywords: AWS Neuron, NxD Inference, neuronx-distributed-inference, LLM inference, model deployment, AWS Inferentia, AWS Trainium, model optimization, quantization, vLLM integration
   :author: AWS Neuron Team

.. _nxdi-dev-ref-index:

Developer Guides
================

Comprehensive guides for using NxD Inference (neuronx-distributed-inference) to deploy and optimize machine learning models on AWS Inferentia and AWS Trainium accelerators. These guides cover model onboarding, performance optimization, quantization techniques, integration with vLLM, and other advanced features to help you maximize the performance of your models on AWS Neuron hardware.

.. toctree::
    :maxdepth: 1
    :hidden:
    :caption: Developer Guides
    
    Accuracy Evaluation </libraries/nxd-inference/developer_guides/accuracy-eval-with-datasets>
    Custom Quantization </libraries/nxd-inference/developer_guides/custom-quantization>
    Disaggregated Inference </libraries/nxd-inference/developer_guides/disaggregated-inference>
    Feature Guide </libraries/nxd-inference/developer_guides/feature-guide>
    Using Pipeline Execution Mode </libraries/nxd-inference/developer_guides/how-to-use-fpem>
    LLM Benchmarking </libraries/nxd-inference/developer_guides/llm-inference-benchmarking-guide>
    Migrate from TNX </libraries/nxd-inference/developer_guides/migrate-from-tnx-to-nxdi>
    Model Reference </libraries/nxd-inference/developer_guides/model-reference>
    MoE Architecture </libraries/nxd-inference/developer_guides/moe-arch-deep-dive>
    Migrate from NxD Core </libraries/nxd-inference/developer_guides/nxd-examples-migration-guide>
    Onboarding Models </libraries/nxd-inference/developer_guides/onboarding-models>
    Performance Benchmarking CLI </libraries/nxd-inference/developer_guides/performance-cli-params>
    vLLM Guide (Legacy) </libraries/nxd-inference/developer_guides/vllm-user-guide>
    vLLM Guide v1 </libraries/nxd-inference/developer_guides/vllm-user-guide-v1>
    Weights Sharding </libraries/nxd-inference/developer_guides/weights-sharding-guide>
    Writing Tests </libraries/nxd-inference/developer_guides/writing-tests>

Use the NxD Inference (``neuronx-distributed-inference``) Developer Guides to learn how to use NxD Inference.

.. grid:: 2
    :gutter: 3

    .. grid-item-card:: Accuracy Evaluation with Datasets
        :link: /libraries/nxd-inference/developer_guides/accuracy-eval-with-datasets
        :link-type: doc
        
        Guide for evaluating model accuracy using datasets to ensure model quality and performance.

    .. grid-item-card:: Custom Quantization
        :link: /libraries/nxd-inference/developer_guides/custom-quantization
        :link-type: doc
        
        Guide for implementing custom quantization techniques to optimize model size and performance.

    .. grid-item-card:: Disaggregated Inference
        :link: /libraries/nxd-inference/developer_guides/disaggregated-inference
        :link-type: doc
        
        Guide for using disaggregated inference architecture that separates prefill and decode phases for improved performance.

    .. grid-item-card:: Feature Guide
        :link: /libraries/nxd-inference/developer_guides/feature-guide
        :link-type: doc
        
        Overview of NxD Inference features and configuration options for optimizing model deployment.

    .. grid-item-card:: How to Use FPEM
        :link: /libraries/nxd-inference/developer_guides/how-to-use-fpem
        :link-type: doc
        
        Guide for using Fast Parameter-Efficient Module (FPEM) for efficient model fine-tuning.

    .. grid-item-card:: LLM Inference Benchmarking Guide
        :link: /libraries/nxd-inference/developer_guides/llm-inference-benchmarking-guide
        :link-type: doc
        
        Guide for benchmarking LLM inference performance to optimize deployment configurations.

    .. grid-item-card:: Migrate from TNX to NxDI
        :link: /libraries/nxd-inference/developer_guides/migrate-from-tnx-to-nxdi
        :link-type: doc
        
        Guide for migrating from Transformers NeuronX to NxD Inference with step-by-step instructions.

    .. grid-item-card:: Model Reference
        :link: /libraries/nxd-inference/developer_guides/model-reference
        :link-type: doc
        
        Reference for production-ready models supported by NxD Inference and their configuration options.

    .. grid-item-card:: MoE Architecture Deep Dive
        :link: /libraries/nxd-inference/developer_guides/moe-arch-deep-dive
        :link-type: doc
        
        Deep dive into Mixture of Experts (MoE) architecture implementation in NxD Inference.

    .. grid-item-card:: NxD Examples Migration Guide
        :link: /libraries/nxd-inference/developer_guides/nxd-examples-migration-guide
        :link-type: doc
        
        Guide for migrating examples to NxD Inference from other frameworks or previous versions.

    .. grid-item-card:: Onboarding Models
        :link: /libraries/nxd-inference/developer_guides/onboarding-models
        :link-type: doc
        
        Guide for onboarding new models to NxD Inference with detailed implementation steps.

    .. grid-item-card:: Performance CLI Parameters
        :link: /libraries/nxd-inference/developer_guides/performance-cli-params
        :link-type: doc
        
        Guide for performance tuning using command-line interface parameters for optimal model execution.

    .. grid-item-card:: vLLM User Guide (Legacy)
        :link: /libraries/nxd-inference/developer_guides/vllm-user-guide
        :link-type: doc
        
        Guide for using vLLM v0.x with NxD Inference (Legacy version) for LLM inference and serving.

    .. grid-item-card:: vLLM User Guide v1
        :link: /libraries/nxd-inference/developer_guides/vllm-user-guide-v1
        :link-type: doc
        
        Guide for using vLLM v1.x with NxD Inference for efficient LLM inference and serving.

    .. grid-item-card:: Weights Sharding Guide
        :link: /libraries/nxd-inference/developer_guides/weights-sharding-guide
        :link-type: doc
        
        Guide for implementing weights sharding to distribute model parameters across multiple devices.

    .. grid-item-card:: Writing Tests
        :link: /libraries/nxd-inference/developer_guides/writing-tests
        :link-type: doc
        
        Guide for writing tests for NxD Inference models to ensure accuracy and performance.


================================================
FILE: libraries/nxd-inference/developer_guides/llm-inference-benchmarking-guide.rst
================================================
.. _llm-inference-benchmarking:

LLM Inference benchmarking guide
================================

This guide gives an overview of the metrics that are tracked for LLM Inference and guidelines in using LLMPerf library
to benchmark for LLM Inference.

.. contents:: Table of contents
   :local:
   :depth: 2


.. _llm_inference_metrics:

LLM Inference metrics
---------------------
Following are the essential metrics for monitoring LLM Inference server performance.

.. list-table::
   :widths: 20 70 
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - Metric
     - Description

   * - Time To First Token (TTFT) 
     - Average time taken for the LLM to process the prompt and output the first output token to the user. This is typically measured in milli seconds.
  
   * - Time per Output Token (TPOT) 
     - Average time taken for LLM to generate an output token for an inference request. This is typically measured in milli seconds. This metric is also referred as Inter Token Latency (ITL) or Per Token Latency(PTL)
  
   * - End-to-End Response Latency
     - Time taken for the LLM to generate the entire response, including all output tokens. This metric is computed as  
       end-to-end latency = (TTFT) + (TPOT) * (Number of output tokens).
 
   * - Output Token Throughput
     - Number of output tokens generated per second by the inference server across all concurrent users and requests.


.. _llm_perf_patch_changes:

Using LLMPerf to benchmark LLM Inference performance
----------------------------------------------------

`LLMPerf <https://github.com/ray-project/llmperf>`_ is an open source library to benchmark LLM Inference performance. However, there are few changes that need to be applied to LLMPerf
to accurately benchmark and reproduce the metrics that are published by Neuron.


All the changes outlined below are provided as a patch file. 

.. note::

  Patches need to be applied in order because they might modify the same files.

Step 1: Install LLMPerf from source
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

    python3 -m venv llmperf-env
    source llmperf-env/bin/activate

    git clone https://github.com/ray-project/llmperf.git ~/llmperf
    cd ~/llmperf
    pip install -e .


Step 2: Patch custom Tokenizer and updated TPOT metric
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In public LLMPerf, ``hf-internal-testing`` tokenizer is used for all models which leads to incorrect
performance metrics due to counting more or less tokens than were actually processed by the model
on the server. Instead, we use the tokenizer of the model that is being benchmarked. 

LLMPerf includes TTFT in Time per Output Token(or Inter Token Latency) calculation. As TPOT and TTFT are two different metrics, a change is done to LLMPerf
to exclude TTFT from TPOT calculation to keep it consistent with how other industry standard performance benchmarks are done.

Follow these instructions to apply the patch to the LLMPerf library.

* Download the ``neuron_perf.patch`` :download:`file </src/benchmark/helper_scripts/neuron_perf.patch>` into the ``llmperf`` directory. 
* Run ``git apply neuron_perf.patch``. Confirm changes with ``git diff``.


Step 3: Patch data parallel benchmarking with multiple model endpoints
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To measure performance with data parallel inference using multiple model copies, 
we allow users to provide multiple semicolon separated endpoints via `OPENAI_API_BASE` 
(e.g. "export OPENAI_API_BASE=http://server1;http://server2;http://server3") for
the OpenAI chat completion client. By default, the patch uses round-robin to route
requests.

* Download the ``llmperf_dp.patch`` :download:`file </src/benchmark/helper_scripts/llmperf_dp.patch>` into the ``llmperf`` directory. 
* Run ``git apply llmperf_dp.patch``. Confirm changes with ``git diff``.


Step 4: Patch reasoning model support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To measure LLM Inference performance of reasoning models, we need to patch LLMPerf to measure TTFT up to the 
first reasoning token instead of the first answer token.

* Download the ``llmperf_reasoning.patch`` :download:`file </src/benchmark/helper_scripts/llmperf_reasoning.patch>` into the ``llmperf`` directory. 
* Run ``git apply llmperf_reasoning.patch``. Confirm changes with ``git diff``.


================================================
FILE: libraries/nxd-inference/developer_guides/migrate-from-tnx-to-nxdi.rst
================================================
.. _nxdi_migrate_from_tnx:


Migrating from Transformers NeuronX to  NeuronX Distributed(NxD) Inference
==========================================================================


.. contents:: Table of contents
   :local:
   :depth: 2


For customers who are currently using Transformers NeuronX, this migration guide explains the steps involved in
migrating from Transformers NeuronX to NxD Inference library.  


How is writing modeling code different in NxD Inference?
---------------------------------------------------------

In Transformers NeuronX, you write modeling code in HLO format using a Python HLO interface. In NeuronX Distributed Inference, you write modeling code in native PyTorch and Python, and the library converts it to HLO for you. 
This change makes it easier to develop models to run on Neuron, because you can start from existing Pytorch or Python modeling code.


How can I migrate from Transformers NeuronX to use NxD Inference with vLLM?
----------------------------------------------------------------------------

Transformers NeuronX library currently supports Llama and Mistral model architectures with vLLM integration. If you are using one of these models, like Llama 3.1, Llama 3, Llama 2, or Mistral-7b-V2, you can migrate to use NxD Inference library with vLLM using the following steps:


Update Environment Variable to force vLLM to use NxD Inference
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As vLLM currently supports both Transformers NeuronX and NeuronX Distributed Inference libraries for the Llama and Mistral models, you need to update the following environment variable in the inference scripts to force vLLM to use NxD Inference.

.. code:: 

    # Force vLLM framework to use neuronx-distributed-inference
    os.environ['VLLM_NEURON_FRAMEWORK'] = "neuronx-distributed-inference"


Compiling and loading the model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Transformers NeuronX uses Neuron Persistent Cache to load a pre-compiled model so that there is no additional delay in compilation when loading the model on vLLM.  NxD Inference currently does not support Neuron Persistent Cache but provides the following way to load a pre-compiled model in NeuronX Distributed Inference.

For production use cases where customer wants to avoid compiling the model in NxD Inference for the first time, users can set the environment variable ``NEURON_COMPILED_ARTIFACTS`` which points to pre-compiled artifacts directory to avoid the compilation time. If the artifacts are not present within the specified directory, then compilation of the model would be triggered as a fallback mechanism and will store the artifacts by default in ``neuron-compiled-artifacts/{unique_hash}/``


Features currently not supported in NxD Inference through vLLM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

NxD Inference doesn't yet support the following features that TNx supports in vLLM integration.

* Multi-Node Inference
* Persistent Cache
* concurrency > 1 support for speculation

Users can use exactly the same set of parameters to test out vLLM with NxD Inference library as they specify with Transformers NeuronX with the exception of ``override_neuron_config`` . Both Transformers NeuronX and NxD Inference allows overriding available NeuronConfig, but not all NeuronConfig parameters that are available with Transformers NeuronX are still valid/applicable in NxD Inference. Refer to the :ref:`neuron_config_migration_tnx_nxdi` to migrate your ``override_neuron_config`` params from Transformers NeuronX to NxD Inference.

Serialization support
----------------------

In both libraries, you serialize the compiled model, so you can use the model in subsequent runs without compiling it each time.

In Transformers NeuronX, the save function does not serialize sharded weights by default, and you can enable this functionality with the ``sharded_weights`` flag. In NeuronX Distributed Inference, the ``compile`` function serializes sharded weights by default, and you can disable this functionality with the ``save_sharded_checkpoint`` flag in ``NeuronConfig``.

Tranformers NeuronX
^^^^^^^^^^^^^^^^^^^

.. code::

    # Create and compile the Neuron model
    neuron_config = NeuronConfig()
    model_neuron = LlamaForSampling.from_pretrained(
        'openlm-research/open_llama_3b',
        batch_size=1,
        tp_degree=8,
        n_positions=128,
        neuron_config=neuron_config
    )

    # Compile the model.
    model_neuron.to_neuron()

    # Save the presharded weights and compiled artifacts to a directory.
    model_neuron.save('llama-artifacts', sharded_weights=True)

NeuronX Distributed Inference
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code::
    
    model_path = "/home/ubuntu/models/open_llama_3b"
    compiled_model_path = "/home/ubuntu/compiled_models/open_llama_3b"

    neuron_config = NeuronConfig(
        batch_size=1,
        tp_degree=8,
        seq_len=128
    )

    config = LlamaInferenceConfig(
        neuron_config,
        load_config=load_pretrained_config(model_path)
    )

    model = NeuronLlamaForCausalLM(model_path, config)

    # Compile the model, shard the weights, and save to the given path.
    model.compile(compiled_model_path)

Models supported in Transformers NeuronX and NxD Inference model hubs
----------------------------------------------------------------------

The following table depicts the list of models currently supported by TNx and their status in the NxD Inference library. For a more detailed list of models currently supported in NeuronX Distributed Inference, please refer to :ref:`NxD Inference model hub guide <nxdi-model-reference>`


+----------------------------+--------------------+---------------------+------------------+--------------------------------+
| Model                      | Transformers NeuronX (TNx)               | NxD Inference (NxDI)                              |
+                            +--------------------+---------------------+------------------+--------------------------------+
|                            | supported in TNx   | vLLM Support (TNx)  | supported in NxDI| vLLM Support (NxD Inference)   |
+============================+====================+=====================+==================+================================+
| BLOOM                      | Yes                | No                  | No               | No                             |
+----------------------------+--------------------+---------------------+------------------+--------------------------------+
| GPT2                       | Yes                | No                  | No               | No                             |
+----------------------------+--------------------+---------------------+------------------+--------------------------------+
| GPT-J                      | Yes                | No                  | No               | No                             |
+----------------------------+--------------------+---------------------+------------------+--------------------------------+
| GPT-Neox                   | Yes                | No                  | No               | No                             |
+----------------------------+--------------------+---------------------+------------------+--------------------------------+
| Llama 2                    | Yes                | Yes                 | Yes              | Yes                            |
+----------------------------+--------------------+---------------------+------------------+--------------------------------+
| Llama 3                    | Yes                | Yes                 | Yes              | Yes                            |
+----------------------------+--------------------+---------------------+------------------+--------------------------------+
| Llama 3.1                  | Yes                | Yes                 | Yes              | Yes                            |
+----------------------------+--------------------+---------------------+------------------+--------------------------------+
| Llama 3.2 (1B and 3B)      | Yes                | Yes                 | Yes              | Yes                            |
+----------------------------+--------------------+---------------------+------------------+--------------------------------+
| Llama 3.2 (11B and 90B)    | No                 | No                  | Yes              | Yes                            |
+----------------------------+--------------------+---------------------+------------------+--------------------------------+
| Mistral-V2                 | Yes                | Yes                 | Yes              | Yes                            |
+----------------------------+--------------------+---------------------+------------------+--------------------------------+
| Mixtral                    | Yes                | No                  | Yes              | Yes                            |
+----------------------------+--------------------+---------------------+------------------+--------------------------------+
| DBRX                       | No                 | No                  | Yes              | Yes                            |
+----------------------------+--------------------+---------------------+------------------+--------------------------------+


Onboarding custom or private models with NxD Inference
-------------------------------------------------------

If you need model support for one of the models not currently supported in NxD Inference or if you have a private model that you currently implemented support in Transformers Neuronx,
you need to implement the model using NxD Inference library.  You can use the :ref:`nxdi-onboarding-models` guide.

.. _neuron_config_migration_tnx_nxdi:

Neuron Config Migration
-----------------------

There are differences in Neuron Config parameters in Transformers NeuronX and :ref:`NxD Inference <nxd-inference-api-guide-neuron-config>` libraries.  
If you use TNx directly without vLLM, or if you use the ``override_neuron_config`` param in vLLM with TNx, then you must update config parameters according to the following table.


.. list-table::
   :header-rows: 1
   :widths: 30 30 40

   * - Transformers NeuronX parameter
     - NxD Inference parameter
     - Notes
   * - sparse_attn
     - N/A
     - 
   * - quant.quant_dtype
     - quantization_dtype
     - To use quantization, set ``quantized`` to True, and provide the ``quantized_checkpoints_path`` where the quantized model is stored (or will be stored).
   * - quant.dequant_dtype
     - torch_dtype
     - NxD Inference uses the inference dtype as the dequant dtype.
   * - quant.quantize_method
     - quantization_type
     - 
   * - quant.quantize_attn
     - N/A
     - 
   * - quant.no_quantize_list
     - N/A
     - 
   * - kv_cache_quant.quant_dtype
     - N/A
     - NxD Inference uses FP8 (torch.float8_e4m3fn) for KV cache quantization. To use KV cache quantization, set ``kv_cache_quant`` to True.
   * - kv_cache_quant.dequant_dtype
     - torch_dtype
     - NxD Inference uses the inference dtype as the dequant dtype.
   * - kv_cache_quant.quantize_method
     - N/A
     - NxD Inference uses direct cast.
   * - continuous_batching.max_num_seqs
     - max_batch_size
     - To use continuous batching, set ``is_continous_batching`` to True, and set ``tkg_batch_size`` to the max batch size.
   * - continuous_batching.max_model_len
     - seq_len
     - 
   * - continuous_batching.optimized_paged_attention
     - N/A
     - 
   * - continuous_batching.block_size
     - N/A
     - 
   * - continuous_batching.num_blocks
     - N/A
     - 
   * - attention_layout
     - N/A
     - NxD Inference uses BHSD layout.
   * - collectives_layout
     - N/A
     - NxD Inference uses BHSD layout.
   * - cache_layout
     - N/A
     - NxD Inference uses BHSD layout.
   * - padding_side
     - padding_side
     - NxD Inference defaults to padding on the right side.
   * - group_query_attention
     - N/A
     - 
   * - sequence_parallel_norm
     - sequence_parallel_enabled
     - 
   * - sequence_parallel_norm_threshold
     - N/A
     - 
   * - bf16_rms_norm
     - N/A
     - NxD Inference upcasts RMS norm inputs to fp32.
   * - on_device_embedding
     - N/A
     - 
   * - on_device_generation
     - on_device_sampling_config
     - 
   * - on_device_generation.max_length
     - seq_len
     - NxD Inference uses the model's sequence length.
   * - on_device_generation.do_sample
     - on_device_sampling_config.do_sample
     - 
   * - on_device_generation.top_k
     - on_device_sampling_config.top_k
     - NxD Inference supports top_k through dynamic sampling. Pass the top_k values to the model inputs.
   * - on_device_generation.top_p
     - N/A
     - NxD Inference supports top_p through dynamic sampling. Pass the top_p values to the model inputs.
   * - on_device_generation.temperature
     - N/A
     - NxD Inference supports temperature through dynamic sampling. Pass the temperature values to the model inputs.
   * - on_device_generation.top_p_min_tokens
     - N/A
     - NxD Inference defaults to a minimum of 1 token.
   * - on_device_generation.global_top_k
     - on_device_sampling_config.global_topk
     - 
   * - on_device_generation.eos_token_id
     - N/A
     - NxD Inference sampling treats EOS like any other token.
   * - on_device_generation.dynamic
     - on_device_sampling_config.dynamic
     - 
   * - on_device_generation.deterministic
     - on_device_sampling_config.deterministic
     - 
   * - on_device_generation.per_batch_line
     - N/A
     - 
   * - all_reduce_dtype
     - rpl_reduce_dtype
     - NxD Inference applies this dtype to only the all_reduce in attention's ``o_proj`` layer.
   * - cast_logits_dtype
     - N/A
     - 
   * - fuse_qkv
     - fused_qkv
     - 
   * - qkv_tiling
     - N/A
     - 
   * - weight_tiling
     - N/A
     - 
   * - mlp_in_weight_tiling_permute_order
     - N/A
     - 
   * - mlp_out_weight_tiling_permute_order
     - N/A
     - 
   * - mlp_out_weight_transpose
     - N/A
     - 
   * - log_softmax_scores
     - N/A
     - 
   * - shard_over_sequence
     - flash_decoding_enabled
     - 
   * - duplicate_q_weight_sos
     - N/A
     - 
   * - output_all_logits
     - N/A
     - 
   * - fused_rmsnorm_qkv
     - qkv_kernel_enabled
     - 
   * - fused_rmsnorm_mlp
     - mlp_kernel_enabled
     - 
   * - attn_output_transposed
     - N/A
     - 
   * - compilation_worker_count
     - N/A
     - 


================================================
FILE: libraries/nxd-inference/developer_guides/model-reference.rst
================================================
.. _nxdi-model-reference:

NxD Inference - Production Ready Models
=======================================

Neuronx Distributed Inference provides production ready models that you can
directly use for seamless deployment. You can view the source code for all
supported models in the `NxD Inference GitHub repository <https://github.com/aws-neuron/neuronx-distributed-inference/tree/main/src/neuronx_distributed_inference/models>`__. 

.. note:: 
   
   If you are looking to deploy a custom model integration, you can follow the
   :ref:`model onboarding guide <nxdi-onboarding-models>`. You can refer to the source
   code for supported models in the `NxD Inference GitHub repository <https://github.com/aws-neuron/neuronx-distributed-inference/tree/main/src/neuronx_distributed_inference/models>`__
   and make custom changes required for your use case.

.. contents:: Table of contents
   :local:
   :depth: 2

Using Models to Run Inference
-----------------------------

You can run models through vLLM or integrate directly with NxD
Inference.

Using vLLM
~~~~~~~~~~

If you are using vLLM for production deployment, we recommend that you
use the vLLM API to integrate with NxD Inference. The vLLM API automatically
chooses the correct model and config classes based on the model's config file.
For more information, refer to the :ref:`nxdi-vllm-user-guide-v1`.

Integrating Directly with NxD Inference
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To use NxD Inference directly, you construct model and configuration
classes. For more information about which model and configuration classes to use for each
model, see :ref:`nxdi-supported-model-architectures`. To see an example of how to
run inference directly with NxD Inference, see the `generation_demo.py
script <https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/examples/generation_demo.py>`__.

.. _nxdi-supported-model-architectures:

Supported Model Architectures
-----------------------------

NxD Inference currently provides support for the following model
architectures.

Llama (Text)
~~~~~~~~~~~~

NxD Inference supports Llama text models. The Llama model architecture
supports all Llama text models, including Llama 2, Llama 3, Llama 3.1,
Llama 3.2, and Llama 3.3. You can also use the Llama model architecture
to run any model based on Llama, such as Mistral.

Neuron Classes
^^^^^^^^^^^^^^

- Neuron config class: NeuronConfig
- Inference config class: LlamaInferenceConfig
- Causal LM model class: NeuronLlamaForCausalLM

Compatible Checkpoint Examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct (requires
  Trn2)
- https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
- https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
- https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
- https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

----

Llama 4
~~~~~~~~

NxD Inference supports Llama 4 models, including both Scout and Maverick checkpoints.
You can use Hugging Face checkpoints. Both checkpoints leverage early fusion for native multimodality,
enabling them to process text and image inputs. For more information
about how to run Llama 4 inference, see :ref:`/libraries/nxd-inference/tutorials/llama4-tutorial.ipynb`.

.. _neuron-classes-1:

Neuron Classes
^^^^^^^^^^^^^^

- Neuron config class: Llama4NeuronConfig
- Inference config class: Llama4InferenceConfig
- Causal LM model class: NeuronLlama4ForCausalLM

.. _compatible-checkpoint-examples-1:

Compatible Checkpoint Examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
- https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct

----

Mixtral
~~~~~~~

NxD Inference supports models based on the Mixtral model architecture,
which uses mixture-of-experts (MoE) architecture.

.. _neuron-classes-2:

Neuron Classes
^^^^^^^^^^^^^^

- Neuron config class: MoENeuronConfig
- Inference config class: MixtralInferenceConfig
- Causal LM model class: NeuronMixtralForCausalLM

.. _compatible-checkpoint-examples-2:

Compatible Checkpoint Examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

----

DBRX
~~~~

NxD Inference supports models based on the DBRX model architecture,
which uses mixture-of-experts (MoE) architecture.

.. _neuron-classes-3:

Neuron Classes
^^^^^^^^^^^^^^

- Neuron config class: DbrxNeuronConfig
- Inference config class: DbrxInferenceConfig
- Causal LM model class: NeuronDbrxForCausalLM

.. _compatible-checkpoint-examples-3:

Compatible Checkpoint Examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- https://huggingface.co/databricks/dbrx-instruct

Qwen2.5
~~~~~~~~

NxD Inference supports models based on the Qwen2.5 model architecture.

----

.. _neuron-classes-4:

Neuron Classes
^^^^^^^^^^^^^^

- Neuron config class: Qwen2NeuronConfig
- Inference config class: Qwen2InferenceConfig
- Causal LM model class: NeuronQwen2ForCausalLM

.. _compatible-checkpoint-examples-4:

Compatible Checkpoint Examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- https://huggingface.co/Qwen/Qwen2.5-72B-Instruct
- https://huggingface.co/Qwen/Qwen2.5-32B-Instruct
- https://huggingface.co/Qwen/Qwen2.5-14B-Instruct (Not tested, but expected to work out of the box)
- https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
- https://huggingface.co/Qwen/Qwen2.5-3B-Instruct (Not tested, but expected to work out of the box)
- https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct (Not tested, but expected to work out of the box)
- https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct

----

Qwen3
~~~~~~

NxD Inference supports models based on the Qwen3 model architecture.

.. _neuron-classes-5:

Neuron Classes
^^^^^^^^^^^^^^

- Neuron config class: Qwen3NeuronConfig
- Inference config class: Qwen3InferenceConfig
- Causal LM model class: NeuronQwen3ForCausalLM

.. _compatible-checkpoint-examples-5:

Compatible Checkpoint Examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- https://huggingface.co/Qwen/Qwen3-0.6B
- https://huggingface.co/Qwen/Qwen3-1.7B
- https://huggingface.co/Qwen/Qwen3-4B
- https://huggingface.co/Qwen/Qwen3-8B
- https://huggingface.co/Qwen/Qwen3-14B
- https://huggingface.co/Qwen/Qwen3-32B

----

Qwen3 MoE
~~~~~~~~~~

NxD Inference supports Qwen3 MoE language model which supports multilingual text inputs.

.. _neuron-classes-6:

Neuron Classes
^^^^^^^^^^^^^^

- Neuron config class: MoENeuronConfig
- Inference config class: Qwen3MoeInferenceConfig
- Causal LM model class: NeuronQwen3MoeForCausalLM

.. _compatible-checkpoint-examples-6:

Compatible Checkpoint Examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- https://huggingface.co/Qwen/Qwen3-235B-A22B

----

FLUX.1 [BETA]
~~~~~~~~~~~~~~~~~~

NxD Inference supports FLUX.1-dev model checkpoint for text to image generation.
You can use Hugging Face checkpoints. For more information
about how to run FLUX.1-dev inference, see :ref:`/libraries/nxd-inference/tutorials/flux-inference-tutorial.ipynb`.

.. _neuron-classes-7:

Neuron Classes
^^^^^^^^^^^^^^

- Flux Application class: NeuronFluxApplication
- Flux Pipeline class: NeuronFluxPipeline
- Flux Backbone Neuron config class: FluxBackboneInferenceConfig

.. _compatible-checkpoint-examples-7:

Compatible Checkpoint Examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- https://huggingface.co/black-forest-labs/FLUX.1-dev

----

Pixtral-Large-Instruct-2411
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NxD Inference supports Pixtral image understanding model which processes text and image inputs. You can use HuggingFace checkpoint.

.. _neuron-classes-8:

Neuron Classes
^^^^^^^^^^^^^^

- Neuron config class: NeuronConfig
- Inference config class: PixtralInferenceConfig
- Causal LM model class: NeuronPixtralForCausalLM

.. _compatible-checkpoint-examples-8:

Compatible Checkpoint Examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411

----

Qwen2-VL-7B-Instruct (Dense)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NxD Inference supports models based on the Qwen2-VL-7B-Instruct (Dense) model architecture.

.. _neuron-classes-9:

Neuron Classes
^^^^^^^^^^^^^^

- Neuron config class: Qwen2VLNeuronConfig
- Inference config class: Qwen2VLInferenceConfig
- Causal LM model class: NeuronQwen2VLForCausalLM

.. _compatible-checkpoint-examples-9:

Compatible Checkpoint Examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct

----

Qwen3-VL-8B-Thinking (Dense)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NxD Inference supports models based on the Qwen3-VL-8B-Thinking (Dense) model architecture.

.. _neuron-classes-10:

Neuron Classes
^^^^^^^^^^^^^^

- Neuron config class: Qwen3VLNeuronConfig
- Inference config class: Qwen3VLInferenceConfig
- Causal LM model class: NeuronQwen3VLForCausalLM

.. _compatible-checkpoint-examples-10:

Compatible Checkpoint Examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking


================================================
FILE: libraries/nxd-inference/developer_guides/moe-arch-deep-dive.rst
================================================
.. meta::
   :description: Deep dive into MoE architecture support in NxD Inference
   :date_updated: 12/01/2025

.. _moe-inference-deep-dive:

================================================================================
Deep dive: Explore Mixture of Experts (MoE) inference support for Neuron
================================================================================

**Why read this guide?** This guide is intended for ML engineers looking to
implement custom MoE models or implement advanced performance optimizations on Neuron.
It explains how each MoE component maps to Neuron hardware and how to combine router, expert, and parallelism
settings to extract maximium performance during the prefill and decode phases of MoE model inference.

**How to use this guide:** If you are looking to deploy existing MoE models with vLLM,
refer to the :doc:`vLLM user guide <vllm-user-guide-v1>` instead.
Skip to the :ref:`optimization sections <moe-prefill-optimization>` if you already know NxD basics.

This topic explores Mixture of Experts (MoE) inference in depth. It discusses the
technical details from an AWS Neuron expert perspective. You need experience
with model sharding concepts like Tensor Parallelism and performance tuning on Neuron
using Neuron Kernel Interface (NKI) to fully understand this content.

Prerequisites
-------------

Before you start, you must be familiar with the following:

- **NxD Inference library overview:** How to build and deploy models
  using NxD Inference. See :doc:`../index`.
- **Neuron Kernel Interface (NKI):** Performance optimization techniques
  using NKI for custom kernel development. See :doc:`/nki/index`.
- **Model parallelism techniques:** Tensor parallelism and other
  distributed inference strategies. See :doc:`../app-notes/parallelism`.

Overview
--------

Mixture of Experts (MoE) is a neural network architecture that scales
to massive parameter counts while maintaining computational efficiency. An
MoE layer replaces a traditional dense feedforward network with multiple specialized
"expert" networks. Only a subset of experts are activated per token.
Each input token is processed by only the top-k most relevant
experts (typically k=1-8), as determined by a learned router. This selective activation
allows models to have billions of parameters while computing only a fraction of them.
This breaks the linear relationship between model size and computational cost.
Due to its computational benefits,
the MoE architecture has gained significant adoption across the industry.
Recent models like GPT-OSS, Llama4, DeepSeek-V3, and Qwen3-MoE all use MoE.

.. image:: /images/deep-dives/moe-arch/moe-architecture-overview.png
   :alt: MoE layer architecture showing input tokens, router, expert selection, and output combination
   :align: center
   :width: 80%

Implementing MoE models to extract peak performance on Neuron hardware requires careful
design. This is due to the dynamic nature of expert selection, which creates variable
computational graphs. These must be handled within Neuron's static compilation model.
Expert routing decisions vary per iteration. This causes different number of tokens to be
assigned to each expert. This requires algorithms like the blockwise
matrix multiplication approach to maintain static tensor shapes while minimizing padding
overhead. Additionally, MoE models require careful consideration of tensor parallelism
(TP), expert parallelism (EP), and sequence parallelism (SP) strategies. The
optimal approach depends on expert size, sparsity patterns, and whether the workload is
compute-bound (prefill) or memory-bound (decode). These topics form the focus of this deep dive.

Anatomy of an MoE layer and MoE API in NxDI
--------------------------------------------

An MoE layer consists of three main components: a router that determines expert selection,
expert MLPs that perform the actual computation, and optional shared experts that
process all tokens.
The NxD Inference library provides a comprehensive set of APIs for building MoE layers
that mirrors this conceptual structure.

MoE Layer Structure
~~~~~~~~~~~~~~~~~~~

The ``MoE`` class in NeuronxDistributed serves as the main orchestrator. It combines the
three core components into a unified layer. The data flow implements a clear pattern:
input tokens first pass through the router to determine expert assignments, then through
the selected expert MLPs for computation, and finally through optional shared experts
before output combination. This modular design allows you to flexibly configure and
build different MoE model architectures. You also benefit from optimizations in the
Neuron SDK to optimize MoE model performance.

**Expert combine** is an operation where outputs
from multiple experts are weighted and combined to produce the final token
representation. For each token processed by top-k experts, the router produces affinity
scores that determine how much each expert's output contributes to the final result.
Mathematically, for a token processed by experts :math:`E_1, E_2, ..., E_k` with corresponding
affinities :math:`a_1, a_2, ..., a_k`, the final output is computed as:

.. math::

   \mathrm{output\_token} = \sum_{i=1}^{k} a_i \times E_i(\text{token})

where:

- :math:`E_i(\text{token})` is the output of expert :math:`i` for the given token
- :math:`a_i` is the affinity score for expert :math:`i`
- :math:`k` is the number of selected experts (top_k)

This weighted combination ensures that experts with higher routing confidence contribute
more significantly to the final output. The affinity normalization (controlled by
``normalize_top_k_affinities``) ensures that :math:`\sum_{i=1}^{k} a_i = 1.0` across the selected
experts for each token. The NxD framework handles this expert combination logic internally,
along with routing and static compilation optimizations.

Below is an example of how to instantiate the MoE API:

.. code-block:: python

   from neuronx_distributed.modules.moe import MoE, routing
   from neuronx_distributed.modules.moe.expert_mlps_v2 import ExpertMLPsV2
   from neuronx_distributed.modules.moe.moe_configs import (
       RoutedExpertsMLPOpsConfig,
       BlockwiseMatmulConfig
   )
   from neuronx_distributed.modules.moe.shared_experts import SharedExperts

   # Example: GPT-OSS MoE layer configuration
   num_experts = 128
   top_k = 8
   hidden_size = 7168
   intermediate_size = 2048

   # Initialize router for expert selection
   router = routing.RouterTopK(
       num_experts=num_experts,
       top_k=top_k,
       hidden_size=hidden_size,
   )

   # Configure expert MLPs using ExpertMLPsV2 class
   routed_experts_config = RoutedExpertsMLPOpsConfig(
       num_experts=num_experts,
       top_k=top_k,
       hidden_size=hidden_size,
       intermediate_size=intermediate_size,
       hidden_act="silu",
       glu_mlp=True,
       capacity_factor=None,  # Full capacity, no token dropping
       normalize_top_k_affinities=True,
   )

   # These configs relate to the blockwise matrix multiply (BWMM) algorithm,
   # which enables static compilation by organizing tokens into fixed-size blocks
   # assigned to experts. BWMM tuning parameters are covered in detail later.
   blockwise_config = BlockwiseMatmulConfig.from_kwargs(
       block_size=512,
       logical_nc_config=2,  # Use LNC2 for Trn2
   )

   expert_mlps = ExpertMLPsV2(
       routed_experts_mlp_config=routed_experts_config,
       blockwise_matmul_config=blockwise_config,
       sequence_parallel_enabled=True,
   )

   # Create complete MoE layer
   moe_layer = MoE(
       router=router,
       expert_mlps=expert_mlps,
       sequence_parallel_enabled=True,
   )

Router
~~~~~~

The router component determines which experts compute each token through routing
decisions learned during model training. NxD Inference supports multiple routing
strategies, each optimized for different model architectures. The ``RouterBase``
class provides interfaces for inputs and outputs that the MoE module expects. Specialized implementations offer distinct
routing behaviors.

The ``RouterTopK`` implementation available for use out of the box in NxD inference provides standard top-k expert selection, making it
suitable for most MoE models including GPT-OSS, Llama4, and Qwen-3 Moe. It supports
both softmax and sigmoid activation functions for computing token to expert affinities:

.. code-block:: python

   # Standard top-k routing (used in GPT-OSS, DBRX)
   router = routing.RouterTopK(
       num_experts=128,
       top_k=8,
       hidden_size=7168,
       act_fn="softmax",  # or "sigmoid"
       sequence_parallel_enabled=True,
   )

The ``GroupLimitedRouter`` is another built-in routing API that implements the no-auxiliary-loss method from DeepSeek-V3,
which groups experts and selects top groups before performing top-k selection within
those groups:

.. code-block:: python

   # Setting up Group-limited routing (DeepSeek-V3 style)
   router = routing.GroupLimitedRouter(
       num_experts=256,
       top_k=8,
       hidden_size=7168,
       n_group=8,  # Number of expert groups
       topk_group=2,  # Top groups to select
   )

Routed Experts
~~~~~~~~~~~~~~

The ``ExpertMLPsV2`` class handles the core routed expert computation. It computes tokens through
their assigned experts. This class contains implementations of the experts matrix
multiplication that are optimized depending on whether the workload is compute-bound
or memory-bound. It automatically selects the appropriate strategy based on sequence
length, batch size and other architectural parameters.

The V2 API provides a configuration-based approach with ``RoutedExpertsMLPOpsConfig``
for expert-specific settings to implement different MoE architectures
and ``BlockwiseMatmulConfig`` for optimization parameters.
This separation provides cleaner configuration management and better extensibility:

.. code-block:: python

   # GPT-OSS Expert MLPs configuration
   routed_experts_config = RoutedExpertsMLPOpsConfig(
       num_experts=128,
       top_k=8,
       hidden_size=7168,
       intermediate_size=2048,
       hidden_act="swiglu",
       glu_mlp=True,
       capacity_factor=None,  # Full capacity, no token dropping
       normalize_top_k_affinities=True,
   )

   # Configuration parameters for the BWMM algorithm, which are explained later.
   blockwise_config = BlockwiseMatmulConfig.from_kwargs(
       block_size=512,
       logical_nc_config=2,  # Use LNC2 for Trn2
       skip_dma_token=True,  # Skip loading padded tokens
       skip_dma_weight=True,  # Skip duplicate weight loads
   )

   expert_mlps = ExpertMLPsV2(
       routed_experts_mlp_config=routed_experts_config,
       blockwise_matmul_config=blockwise_config,
       sequence_parallel_enabled=True,
   )


NxD Inference supports both dropping and dropless MoE strategies. Each has different
trade-offs between computational efficiency and model accuracy. The choice between these
strategies is controlled by the ``capacity_factor`` parameter in the expert configuration.

**Dropless MoE** (``capacity_factor=None``) processes all tokens through their assigned
experts without dropping any tokens. This approach maintains full model accuracy but
requires dynamic handling of variable expert loads. Models using dropless strategies
include GPT-OSS, Llama4, and DBRX. The blockwise matrix multiplication
algorithm enables efficient dropless computation by organizing tokens into fixed-size
blocks while minimizing padding overhead:

.. code-block:: python

   # Dropless MoE configuration (recommended for inference)
   routed_experts_config = RoutedExpertsMLPOpsConfig(
       num_experts=128,
       top_k=8,
       hidden_size=7168,
       intermediate_size=2048,
       hidden_act="swiglu",
       glu_mlp=True,
       capacity_factor=None,  # Dropless - no tokens dropped
       normalize_top_k_affinities=True,
   )

**Dropping MoE** (``capacity_factor > 0``) sets a fixed capacity for each expert and
drops tokens that exceed this capacity. This approach provides more predictable
computational costs but may impact model accuracy due to dropped tokens. Models using
dropping strategies include DeepSeek-V3:

.. code-block:: python

   # Dropping MoE configuration with 25% extra capacity
   routed_experts_config = RoutedExpertsMLPOpsConfig(
       num_experts=128,
       top_k=8,
       hidden_size=2880,
       intermediate_size=2880,
       hidden_act="swiglu",
       glu_mlp=True,
       capacity_factor=1.25,  # 25% extra capacity beyond perfect balance
       normalize_top_k_affinities=True,
   )

**Parallelism Strategies for Routed Experts**

MoE models on Neuron hardware benefit from two primary parallelism strategies that can
be used independently or in combination to optimize performance and memory usage:

.. image:: /images/deep-dives/moe-arch/moe-parallelism-strategies.png
   :alt: MoE parallelism strategies showing data flow for Tensor Parallelism vs Expert Parallelism
   :align: center
   :width: 80%

**Tensor Parallelism (TP)** distributes each expert's computation across multiple
NeuronCores by sharding the expert weights along the intermediate dimension. This
approach reduces memory usage per core and enables larger models to fit in available
memory. With TP, each expert's gate, up, and down projection matrices are split across
TP ranks, requiring collective communication to combine results.

**Expert Parallelism (EP)** distributes different experts across different NeuronCores,
allowing each core to specialize in computing a subset of the total experts.

As we discuss later in this deep dive,
the choice between TP and EP (or their combination) depends on model architecture
and the specific TRN hardware under consideration.

To configure TP and EP, configure the degrees
while initializing the model parallel state in NxD.
The MoE components automatically create and use the appropriate PyTorch process groups based on the
parallelism configuration. These configurations set up routed expert behavior and
parallelism strategy, while NxD internally manages mapping to the optimized kernels,
and process group mapping for TP/EP. We show a few code examples below.

.. code-block:: python

   from neuronx_distributed.parallel_layers import parallel_state

   # Configure Tensor Parallelism only (TP=8)
   parallel_state.initialize_model_parallel(
       tensor_model_parallel_size=8,
       expert_model_parallel_size=1,  # No expert parallelism
   )

   # Configure Expert Parallelism only (EP=16)
   parallel_state.initialize_model_parallel(
       tensor_model_parallel_size=1,  # No tensor parallelism
       expert_model_parallel_size=16,
   )

   # Configure combined TP and EP (TP=4, EP=16)
   parallel_state.initialize_model_parallel(
       tensor_model_parallel_size=4,
       expert_model_parallel_size=16,
   )


Shared Experts
~~~~~~~~~~~~~~

Shared experts provide an optional mechanism for processing all tokens through a
dedicated expert network in addition to the routed experts described above.
Model architectures that use shared experts include Llama4 Maverick and DeepSeek-V3.

The ``SharedExperts`` implementation supports both tensor parallelism and sequence
parallelism execution modes. **Sequence Parallelism (SP)** distributes the sequence
dimension across multiple NeuronCores, where each core processes a subset of tokens
while maintaining complete copies of the weights. It uses automatic weight replication or sharding based on the
configuration. For prefill, shared experts can run in sequence parallel mode
with replicated weights. Token generation uses tensor parallel mode with sharded
weights:

.. code-block:: python

   # Llama4 Maverick shared experts configuration
   shared_experts = SharedExperts(
       hidden_size=5120,
       intermediate_size=8192,
       num_shared_experts=1,  # Llama4 Maverick uses 1 shared expert
       hidden_act="silu",
       sequence_parallel_enabled=True,  # Run in SP for prefill
       fused_gate_up_projection=True,  # Optimize gate/up fusion
   )

   # Complete Llama4 Maverick MoE layer with shared experts
   moe_layer = MoE(
       router=router,
       expert_mlps=expert_mlps,
       shared_experts=shared_experts,
       sequence_parallel_enabled=True,
   )

The shared experts component automatically handles the complexity of different execution
modes. It switches between sequence parallel execution for prefill (where weights
are replicated) and tensor parallel execution for token generation (where weights are
sharded).

.. _moe-prefill-optimization:

MoE prefill optimization
------------------------

This section explores the core design principles and optimization techniques that enable
efficient MoE execution during prefill. It focuses on three key areas: router execution strategies,
blockwise matrix multiplication algorithms for efficient routed expert computation,
and optimization strategies for shared experts.

Router execution in sequence parallel mode
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Router networks are significantly smaller compared to expert MLPs. They have weight matrices of size
:math:`[\mathrm{hidden\_size}, \mathrm{num\_experts}]`. For most MoE architectures, this represents a relatively
modest memory footprint that allows RouterTopK to run with replicated weights across sequence
parallel ranks. NxD also delays logit gathering until after expert selection to reduce
communication volume. Consider a concrete example:

.. code-block:: text

   Example: GPT-OSS 120B configuration
   - Hidden size: 2880
   - Number of experts: 128
   - Router weight size: 2880 × 128 × 2 bytes = ~0.07MB per MoE layer
   - Router across all layers: 0.07MB × 36 layers = ~2.4MB
   - Replicating the router occupies ~0.01% of HBM capacity on a TRN2 instance

The small size of router weights makes weight replication across cores acceptable. This enables
sequence parallel execution where each core processes a subset of the sequence but maintains
a complete copy of the router weights. This approach improves the arithmetic intensity
of router layer operations without imposing significant memory overhead.

**Communication optimization in sequence parallel mode**

The NxD implementation performs an additional optimization to reduce communication overhead.
A naive implementation of router in sequence parallel (SP) would involve gathering the
router logits computed in sequence parallel. This induces a communication
volume of :math:`[\mathrm{batch\_size}, \mathrm{seq\_len}, \mathrm{num\_experts}]`.
The gathering of logits is needed to proceed to the next step
of computing experts. The computation operates in TP or EP mode rather than SP.
For long sequences and models with a large number of experts, this step can become a performance bottleneck.

To optimize this, we delay gathering logits until after expert selection is completed.
Following this step, the size of router logits to be gathered becomes :math:`[\mathrm{batch\_size}, \mathrm{seq\_len}, \mathrm{top\_k}]`.
This is significantly smaller and reduces communication overhead by a factor of :math:`\frac{\mathrm{num\_experts}}{\mathrm{top\_k}}`.

For example, with 128 experts and top_k=8, this optimization reduces communication volume by 16×.

**Takeaway**: During prefill, we recommend configuring the router in sequence parallel mode.

**Enabling router in sequence parallel mode**

The router implementation in NxD automatically handles sequence parallel execution through
the ``sequence_parallel_enabled`` parameter.

.. code-block:: python

   # Router configuration for sequence parallel execution
   router = routing.RouterTopK(
       num_experts=128,
       top_k=8,
       hidden_size=2880,
       sequence_parallel_enabled=True,  # Enable SP execution
       act_fn="softmax"
   )


Blockwise Matrix Multiplication (BWMM): Routed Expert optimization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A naive implementation of routed expert computation
inherently creates dynamic computational graphs. This is because token-to-expert
distributions vary across iterations.

Consider a simple example that illustrates the core problem:

.. code-block:: python

   # Naive MoE implementation picked from HuggingFace
   # (problematic for static compilation)
   def moe_forward(tokens, experts, router):
       expert_assignments = router(tokens)  # Dynamic routing decisions
       outputs = []

       for expert_id in range(num_experts):
           # Variable number of tokens per expert each iteration
           expert_tokens = tokens[expert_assignments == expert_id]
           if len(expert_tokens) > 0:
               # experts[expert_id] represents the expert network/function
               expert_output = experts[expert_id](expert_tokens)
               outputs.append(expert_output)

       return combine_outputs(outputs, expert_assignments)


Blockwise matrix multiplication solution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The blockwise matrix multiplication (BWMM) approach solves this challenge
by transforming the dynamic problem into a static one. It maps tokens
into fixed-size computational blocks:

.. image:: /images/deep-dives/moe-arch/moe-blockwise-transformation.png
   :alt: Transformation from dynamic expert assignment to fixed-size blocks
   :align: center
   :width: 80%

**Core design principles:**

The algorithm maps tokens into blocks with a fixed number of tokens (equal to block_size).
It maintains the following constraints:

1. **Single expert per block**: Each block contains tokens assigned to only one expert
2. **Multiple blocks per expert**: Experts can be assigned multiple blocks when needed
3. **Padded blocks allowed**: Some blocks may contain only padding tokens depending on the token-to-expert distribution during inference

For dropless inference, provisioning :math:`N = \lceil\frac{\mathrm{tokens} \times \mathrm{top\_k}}{\mathrm{block\_size}}\rceil + (\mathrm{num\_experts} - 1)`
blocks is sufficient to map all tokens without dropping while satisfying these constraints.

**Concrete example:**

.. code-block:: text

   Input: 6 tokens [T0, T1, T2, T3, T4, T5]
   Expert assignment: [E0, E1, E0, E2, E1, E0]
   Block size: 4

   Block organization:
   Block 0 → Expert E0: [T0, T2, T5, -1]  # 3 real tokens + 1 padding
   Block 1 → Expert E1: [T1, T4, -1, -1]  # 2 real tokens + 2 padding
   Block 2 → Expert E2: [T3, -1, -1, -1]  # 1 real token + 3 padding

**Padding overhead analysis**

Understanding padding overhead is crucial for optimizing MoE performance. It directly
impacts compute utilization and memory efficiency. The BWMM algorithm introduces
padding in two scenarios: within blocks (when experts receive fewer tokens than block_size)
and across blocks (when we provision more blocks than the minimum required).

*Mathematical framework:*

The total padding overhead can be quantified as:

.. math::

   \text{Padding overhead} = (\text{Total provisioned compute}) - (\text{Actual required compute})

.. math::

   = (N \times \mathrm{block\_size}) - (T \times \mathrm{top\_k})

Where:

- :math:`N` = number of blocks provisioned
- :math:`T` = total input tokens
- :math:`\mathrm{block\_size}` = tokens per block
- :math:`\mathrm{top\_k}` = experts per token

*Concrete example - Padding impact:*

.. code-block:: text

   Scenario: 1000 tokens, 8 experts, top_k=2, block_size=256

   Required computation: 1000 × 2 = 2000 token-expert pairs

   Blocks statically provisioned to handle worst case:
   N = ⌈(1000 × 2) / 256⌉ + (8 - 1) = ⌈7.8⌉ + 7 = 15

   Best case (perfect load balancing):
   - Each expert gets: 2000 ÷ 8 = 250 tokens
   - Blocks needed: 8 experts × 1 block = 8 blocks
   - Total compute slots (required): 8 × 256 = 2048
   - Total compute slots (actual): 15 × 256 = 3840
   - Padding overhead (to handle worst case): (3840 - 2048) ÷ 2048 = 87.5%
   - Algorithmic padding overhead: (2048 - 2000) / 2000 = 2.4%

   Worst case (load imbalance):
   - One expert gets 1750 tokens, others get ~36 tokens each
   - Blocks needed: 7 blocks for hot expert + 7 blocks for others = 14 blocks
   - Total compute slots (required): 14 × 256 = 3584
   - Total compute slots (actual): 15 × 256 = 3840
   - Padding overhead (to handle worst case): (3840 - 3584) ÷ 3584 = 7.14%
   - Algorithmic padding overhead: (3584 - 2000) / 2000 = 79.2%

**Block size selection guidance**

*Trade-offs:*

.. code-block:: text

   Smaller block_size (e.g., 128):
   ✓ Reduces within-block padding, improving performance when token-to-expert distribution is imbalanced
   ✗ Lower arithmetic intensity per block

   Larger block_size (e.g., 1024):
   ✓ Higher arithmetic intensity per block
   ✗ Higher within-block padding for sparse experts

*Optimization principle:*

Choose the block size just large enough so that the workload becomes compute-bound rather than memory-bound.

The arithmetic intensity factor (AIF) provides a quantitative framework for block size selection:

.. math::

   \text{AIF} = \frac{\text{Compute FLOPs}}{\text{Data movement}}

.. math::

   = \frac{2 \times 3 \times \mathrm{block\_size} \times \mathrm{hidden\_size} \times \mathrm{intermediate\_size} \times \mathrm{num\_blocks}}{2 \times 3 \times \mathrm{num\_experts} \times \mathrm{hidden\_size} \times \mathrm{intermediate\_size}}

.. math::

   = \frac{\mathrm{block\_size} \times \mathrm{num\_blocks}}{\mathrm{num\_experts}}

Target configuration: :math:`\text{AIF} \geq \frac{\text{Peak compute throughput}}{\text{Memory bandwidth}}`

For TRN2 instances, this ratio is approximately 400-500 FLOPs/byte, providing guidance for optimal block size selection.


Advanced optimizations in the BWMM algorithm
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The implementation of the BWMM kernel that is available in the Neuron SDK
provides several sophisticated optimizations. These significantly
improve MoE performance by reducing memory bandwidth requirements and eliminating
unnecessary computation.

**DMA skipping optimizations**

DMA (Direct Memory Access) skipping addresses the padding overhead inherent in the blockwise
approach. It selectively avoids DMA transfers for padded elements.

*Token skipping:*

Token skipping eliminates memory transfers for padded token positions (marked as ``-1`` in
the token position mapping):

.. image:: /images/deep-dives/moe-arch/moe-token-skipping.png
   :alt: Token skipping optimization showing elimination of padded token transfers
   :align: center
   :width: 80%

.. code-block:: text

   Without token skipping:
   Block: [T0, T2, T5, -1]
   DMA operations: 4 token loads (including padding)

   With token skipping:
   Block: [T0, T2, T5, -1]
   DMA operations: 3 token loads (padding skipped)
   Performance improvement: ~25% reduction in memory bandwidth

*Weight skipping:*

Weight skipping avoids redundant expert weight loads when consecutive blocks use the same expert:

.. code-block:: text

   Block sequence: [E0, E0, E1, E2, E2]

   Without weight skipping:
   - Load E0 weights for Block 0
   - Load E0 weights for Block 1 (redundant)
   - Load E1 weights for Block 2
   - Load E2 weights for Block 3
   - Load E2 weights for Block 4 (redundant)

   With weight skipping:
   - Load E0 weights for Block 0
   - Reuse E0 weights for Block 1
   - Load E1 weights for Block 2
   - Load E2 weights for Block 3
   - Reuse E2 weights for Block 4

**Configuration in NxD Inference:**

Recommendation is to have both these features as default on.

.. code-block:: python

   # Enable DMA skipping optimizations
   blockwise_config = BlockwiseMatmulConfig.from_kwargs(
       block_size=512,
       logical_nc_config=2,
       skip_dma_token=True,    # Enable token skipping
       skip_dma_weight=True,   # Enable weight skipping
   )

**Dynamic control flow - block compute skipping**

Dynamic control flow optimization eliminates computation
entirely for blocks that contain only padding tokens.
This is done inside the kernel by leveraging support for
executing while loops on chip with dynamic number of iterations in the Neuron SDK.

.. image:: /images/deep-dives/moe-arch/moe-dynamic-control-flow.png
   :alt: Dynamic while loop skipping fully padded blocks
   :align: center
   :width: 80%

**Conceptual example:**

.. code-block:: text

   Total blocks: 10
   Token distribution: 6 blocks with real tokens, 4 blocks fully padded

   Block to expert allocation: [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
                              ^-- real blocks --^  ^-- skip --^

   Regular execution: Compute all 10 blocks
   With dynamic control flow: Compute only 6 blocks, skip 4 entirely
   Performance improvement roofline: ~40% reduction in compute FLOPs, especially when token to expert distribution is not imbalanced.

**NxD Inference configuration:**

.. code-block:: python

   # Enable dynamic control flow optimization
   blockwise_config = BlockwiseMatmulConfig.from_kwargs(
       block_size=512,
       logical_nc_config=2,
       # Choose based on LNC2 sharding:
       use_shard_on_block_dynamic_while=True,
       # OR
       use_shard_on_intermediate_dynamic_while=True, # Based on technique used for LNC2 sharding
   )

**LNC2 sharding strategies**

TRN2 and TRN3 provide two physical cores per logical NeuronCore.
NxD inference via the Neuron Kernel Library (NKI-Lib) supports three distinct sharding strategies,
each optimized for different scenarios. The choice of LNC sharding algorithm can be configured
through `BlockwiseMatmulConfig` parameters:

*Hidden dimension sharding (shard on H):*

Default sharding strategy in `BlockwiseMatmulConfig`.

.. code-block:: text

   Computation per block: [block_size, H] @ [H, I] @ [I, H]
   Sharding strategy: Split H dimension across cores

   Core 0: [block_size, H/2] @ [H/2, I] @ [I, H/2]
   Core 1: [block_size, H/2] @ [H/2, I] @ [I, H/2]

   Requires: Cross-core reduction after first matmul
   Best for: High tensor parallelism scenarios

*Intermediate dimension sharding (shard on I):*

Configured with `use_shard_on_intermediate_dynamic_while=True` in `BlockwiseMatmulConfig`.

.. code-block:: text

   Computation per block: [block_size, H] @ [H, I] @ [I, H]
   Sharding strategy: Split I dimension across cores

   Core 0: [block_size, H] @ [H, I/2] @ [I/2, H]
   Core 1: [block_size, H] @ [H, I/2] @ [I/2, H]

   Requires: Cross-core reduction after second matmul
   Best for: Low expert parallelism scenarios, large intermediate dimensions

*Block parallel execution:*

Configured with `use_shard_on_block_dynamic_while=True` in `BlockwiseMatmulConfig`.

.. code-block:: text

   Total blocks: N
   Sharding strategy: Distribute blocks across cores

   Core 0: Processes blocks [0, 2, 4, ...] (even indices)
   Core 1: Processes blocks [1, 3, 5, ...] (odd indices)

   Requires: Enough HBM capacity to store intermediate outputs across cores and a cross-core reduction at the end.
   Best for: When workload can afford the HBM capacity to store intermediate outputs from both cores and when there is more than one expert per logical core.


Shared experts optimization
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Shared experts, used in models like Llama4 Maverick, process all tokens regardless of
routing decisions. Their optimization strategy differs significantly from routed experts
due to their deterministic computation pattern.

**Execution mode selection**

Shared experts support two primary execution modes. Each is optimized for different phases
of inference:

*Sequence parallel mode:*
- **When to use**: Context encoding with small weights and available HBM capacity
- **Characteristics**: Weights replicated across cores, each core processes subset of sequence
- **Benefits**: Maximizes compute utilization, minimizes communication overhead

*Tensor parallel mode :*
- **When to use**: When memory constraints require weight sharding
- **Characteristics**: Weights sharded across cores, requires collective communication
- **Benefits**: Reduces memory usage per core, enables larger models


**Configuration in NxD Inference**

.. code-block:: python

   # Shared experts with dual-mode execution
   shared_experts = SharedExperts(
       hidden_size=5120,
       intermediate_size=8192,
       num_shared_experts=1,
       hidden_act="silu",
       sequence_parallel_enabled=True,  # Enable SP for prefill
       fused_gate_up_projection=True,
   )


Configuring TP and EP
~~~~~~~~~~~~~~~~~~~~~~

The choice between Tensor Parallelism (TP) and Expert Parallelism (EP) depends on several
model characteristics and hardware constraints. This section provides practical guidance
for selecting the optimal parallelism strategy.

**Decision framework**

**When to prefer Tensor Parallelism:**

- *Small number of experts* (≤32): TP provides good load balancing without expert distribution concerns
- *Large intermediate dimensions*: Optimal configuration is when sharded intermediate dimensions are >= 128 for good tensor engine utilization

**When to prefer Expert Parallelism:**

- *Large number of experts* (≥64): Better expert distribution and load balancing
- *Small intermediate dimensions*: Avoids under-utilization from excessive TP sharding

**Hybrid TP+EP approach:**

- *Best of both worlds*: Combine moderate TP (2-8) with EP to achieve good compute efficiency.
- *Load balancing problem with very large EP*: Expert parallelism can suffer from load imbalance.
  
Some EP groups receive significantly more work than others. In the worst case, one EP
group may receive 3-4x the average number of tokens. This creates straggler effects that
limit overall performance. This skew becomes more pronounced with larger EP degrees
and imbalanced routing patterns. The overall MoE layer performance is determined by
the slowest EP group. This makes load balancing critical for EP effectiveness.


**Configuration examples**

.. code-block:: python

   # Small model, balanced routing - prefer TP
   parallel_state.initialize_model_parallel(
       tensor_model_parallel_size=8,
       expert_model_parallel_size=1,
   )

   # Large model, many experts - prefer EP
   parallel_state.initialize_model_parallel(
       tensor_model_parallel_size=1,
       expert_model_parallel_size=16,
   )

   # Very large model - hybrid approach
   parallel_state.initialize_model_parallel(
       tensor_model_parallel_size=4,
       expert_model_parallel_size=16,
   )


MoE decode optimization
-----------------------

Token generation (decode) presents fundamentally different optimization challenges compared
to prefill due to its memory-bound characteristics. During decode, the input shape is
``[batch_size, 1, hidden_size]`` rather than ``[1, seq_len, hidden_size]``. This creates
small matrix multiplications that are limited by memory bandwidth rather than compute
throughput. This section explores the specialized optimization strategies for efficient
MoE execution during token generation.

Memory-bound characteristics of token generation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Token generation workloads exhibit distinct computational characteristics. They require
different optimization approaches from prefill:

**Computational profile comparison:**

.. code-block:: text

   Prefill (compute-bound):
   - Input shape: [1, seq_len, hidden_size] where seq_len >> batch_size
   - Large matrix multiplications: [1, 8192, 4096] @ [4096, 12288]
   - High arithmetic intensity: ~400+ FLOPs/byte
   - Bottleneck: Compute throughput (TensorEngine utilization)

   Token generation (memory-bound):
   - Input shape: [batch_size, 1, hidden_size] where batch_size << seq_len
   - Small matrix multiplications: [32, 1, 4096] @ [4096, 12288]
   - Low arithmetic intensity: ~50-100 FLOPs/byte
   - Bottleneck: Memory bandwidth (weight loading from HBM)

The key insight is that during token generation, the time to load expert weights from
HBM often exceeds the actual computation time. This makes memory bandwidth optimization
the primary concern.

Selective loading algorithm
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Selective loading addresses the memory bandwidth bottleneck. It loads only the expert
weights required for the current batch of tokens, rather than loading all expert weights.

**Core principle:**

Instead of loading all ``E`` experts, load only the ``batch_size × top_k`` unique experts
needed for the current batch. This can provide significant memory bandwidth savings when
the number of required experts is much smaller than the total number of experts.

**Algorithm overview:**

.. code-block:: text

   For each token generation step:
   1. Determine expert assignments for current batch
   2. Identify unique experts needed across all tokens
   3. Load only required expert weights from HBM
   4. Compute only loaded experts
   5. Combine outputs using expert affinities

**Effectiveness conditions:**

Selective loading is most effective when the number of unique experts required is significantly smaller than the total number of experts:

.. math::

   \mathrm{Effectiveness\ condition:\ } \mathrm{batch\_size} \times \mathrm{top\_k} \ll \mathrm{num\_experts}

**Memory bandwidth savings:**

The theoretical memory bandwidth reduction can be calculated as:

.. math::

   \mathrm{Bandwidth\ reduction} = 1 - \frac{\mathrm{unique\_experts\_loaded}}{\mathrm{num\_experts}}

**Example scenarios:**

.. code-block:: text

   DeepSeek (256 experts, top_k=8):
   - Effective for batch_size ≤ 16
   - Max unique experts: 16 × 8 = 128 (50% of total experts)
   - Potential bandwidth savings: ~50%

   GPT-OSS (128 experts, top_k=8):
   - Effective for batch_size ≤ 8
   - Max unique experts: 8 × 8 = 64 (50% of total experts)
   - Potential bandwidth savings: ~50%

   Llama4 (16 experts, top_k=1):
   - Effective for batch_size ≤ 8
   - Max unique experts: 8 × 1 = 8 (50% of total experts)
   - Potential bandwidth savings: ~50%


All-Experts algorithm
~~~~~~~~~~~~~~~~~~~~~~

When selective loading becomes ineffective (large batch sizes),
the all-experts algorithm provides an alternative optimization strategy.

**When to use All-Experts:**

NxD Inference automatically determines when to switch from selective loading to all-experts
based on workload characteristics. The threshold for switching can be determined by:

.. math::

   \mathrm{Switch\ threshold:\ } \mathrm{batch\_size} \times \mathrm{top\_k} \geq \alpha \times \mathrm{num\_experts}

where :math:`\alpha` is typically between 0.8-1, representing the point where loading all experts becomes more efficient than selective loading.

**Example threshold analysis:**

.. code-block:: text

   DeepSeek with batch_size=32, top_k=8:
   - Required experts: 32 × 8 = 256 (potentially all experts)
   - All-experts becomes more efficient than selective loading

**Implementation strategy:**

The all-experts algorithm follows a structured approach:

1. **Load all expert weights** once per token generation step
2. **Compute all experts** for all tokens in parallel
3. **Apply expert masks** during output combination to zero out unused expert outputs
4. **Benefits**:
   - Better DMA efficiency since all DMA loads do not have indirection unlike in selective loading.
5. **Scalability with TP+EP**: Use TP+EP to shard weights across multiple cores, increasing effective memory bandwidth for expert weight loading
6. **Automatic configuration**: NxD Inference automatically selects between selective loading and all-experts based on the workload characteristics


MoE Quantization Support
------------------------

The MoE module available in NxD inference supports the below quantization techniques:

1. BF16 weights and compute
2. Weights quantized to FP8 along the hidden dimension with BF16 compute
3. Weights quantized to MxFP4 with MxFP4/BF16 compute

Reference Implementations
-------------------------

For detailed reference implementations of MoE models using the techniques described in this guide,
refer to the following NxDI model code:

- **GPT-OSS MoE models**: `GPT-OSS implementation <https://github.com/aws-neuron/neuronx-distributed-inference/tree/main/src/neuronx_distributed_inference/models/gpt_oss>`_
- **Llama4 MoE models**: `Llama4 implementation <https://github.com/aws-neuron/neuronx-distributed-inference/tree/main/src/neuronx_distributed_inference/models/llama4>`_

These implementations demonstrate practical applications of the router configurations, expert
parallelism strategies, and optimization techniques covered in this deep dive.

Future Optimizations
--------------------

We will continue to optimize the Neuron SDK with advanced optimizations for MoE workloads. Two key improvements
which will be available in future releases are:

**Expert Parallel Load Balancing (EPLB)**

Expert Parallel Load Balancing (EPLB) addresses the fundamental challenge of load imbalance in EP configurations
where some expert groups receive significantly more tokens than others, creating straggler effects.
EPLB introduces redundant expert placement across multiple EP ranks, allowing dynamic load redistribution
when imbalance is detected.

**Communication Optimization for Expert Parallelism with All-to-All-v**

Currently, Expert Parallelism uses All-Gather to gather all tokens at all ranks, resulting in
wasted communication volume since each rank only needs tokens assigned to its subset of experts.
We are working on an optimized All-to-All-v primitive in the Neuron SDK that will enable
variable-sized token exchanges between EP ranks, communicating only the actual tokens assigned
to each expert rather than gathering all tokens everywhere. This optimization will significantly
reduce network bandwidth requirements for EP communication.


================================================
FILE: libraries/nxd-inference/developer_guides/nxd-examples-migration-guide.rst
================================================
.. _nxd-examples-migration-guide:

Migrating from NxD Core inference examples to NxD Inference
===========================================================

We have migrated the NeuronX Distributed Core `examples/inference <https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference>`__
folder to a separate package, NeuronX Distributed (NxD) Inference
(``neuronx-distributed-inference``), so you can import and use it as a
proper library. This new library, NxD Inference, includes production ready
models that you can deploy out of the box with model inference backends,
such as vLLM. This library also provides modules that you can use to
implement your own models to run with the Neuron SDK.

If you use the inference examples from NxD Core, follow this guide to migrate
to NxD Inference. For more information about NxD Inference and to see examples
of how to use it, see :ref:`nxdi-feature-guide`, :ref:`NxD Inference Tutorials <nxdi-tutorials-index>`,
and the `generation_demo.py script <https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/examples/generation_demo.py>`__.

.. warning::
   Previous inference examples (including Llama 2, Llama 3, Mixtral, and DBRX) in
   the NxD Core GitHub repository were removed in Neuron Release 2.23.
   The models and example code are implemented in the
   NxD Inference library, so you can easily integrate them with your inference
   scripts. If you use these examples in NxD Core, we recommend
   that you update your inference scripts to use the NxD Inference model hub
   instead. If your use case requires you to directly integrate with the NxD
   Core library (and not NxD Inference) then you can continue to use the NxD
   Core library directly. For an example of how to integrate with NxD Core directly,
   see the newer `Llama3.2 1B sample <https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference/llama>`__
   added in Neuron Release 2.23. For more information, see :ref:`announce-eos-nxd-examples`.

.. contents:: Table of contents
   :local:
   :depth: 2

Changes
-------

1. New config interface
~~~~~~~~~~~~~~~~~~~~~~~

NxD Inference includes a new model config interface, ``InferenceConfig``,
where NeuronConfig is an attribute within the model config, and the
model config no longer extends HuggingFace's PretrainedConfig. NxDI
includes an adapter for loading an HuggingFace's config into this model
config. The configurations are serialized into a file named
``neuron_config.json``.

**This change means that the config structure is inverted compared to
the NxD examples folder.**

- To access the model config (similar to HuggingFace's
  PreTrainedConfig), use ``config`` (or ``model.config``,
  ``self.config``).
- To access the NeuronConfig, use ``config.neuron_config`` (or
  ``model.neuron_config``, ``self.neuron_config``).

To onboard a custom model, you define config classes that extend InferenceConfig
and NeuronConfig. The following example from DBRX shows how to define a
DBRX-specific NeuronConfig (NeuronDbrxConfig) and InferenceConfig
(DbrxInferenceConfig). DbrxInferenceConfig that defines required config
attributes and specifies that NeuronDbrxConfig is the NeuronConfig
class. The required attributes are typically set by loading a
PretrainedConfig (in this case, HuggingFace's DbrxConfig) into the
InferenceConfig. Alternatively, a user can manually provide these
attributes to avoid depending on an HuggingFace config class.

::

   class NeuronDbrxConfig(MoENeuronConfig):
       def __init__(self, **kwargs):
           super().__init__(**kwargs)
           self.fused_qkv = True


   class DbrxInferenceConfig(InferenceConfig):
       def get_required_attributes(self) -> List[str]:
           return [
               "d_model",
               "n_heads",
               "max_seq_len",
               "emb_pdrop",
               "resid_pdrop",
               "pad_token_id",
               "vocab_size",
               "attn_config",
               "ffn_config",
           ]

       @classmethod
       def get_neuron_config_cls(cls):
           return NeuronDbrxConfig

.. note:: 

   NeuronDbrxConfig extends MoENeuronConfig, which is a subclass of NeuronConfig
   that includes attributes that are specific to mixture-of-experts (MoE) models.


To load the config from an HuggingFace checkpoint or a compiled
checkpoint, pass ``load_pretrained_config(path)`` as the ``load_config``
hook when you create the InferenceConfig.

::

   from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config

   neuron_config = DbrxNeuronConfig()  # Provide args
   config = DbrxInferenceConfig(
       neuron_config,
       load_config=load_pretrained_config(model_path),
   )

To serialize the config, call ``save(path)``.

::

   config.save(compiled_model_path)

To deserialize the config, call ``load(path)``.

::

   config = DbrxInferenceConfig.load(compiled_model_path)

NeuronConfig also supports nested configs now. For example, see the
OnDeviceSamplingConfig class and its integration into NeuronConfig.

2. New base application interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NeuronApplicationBase takes general purpose features from
NeuronBaseForCausalLM, such as compile and load, and makes them
available in a new abstract base class. You can extend this base class
to define other types of application heads, such as for image
classification.

3. New generation inference
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Neuron model classes no longer extend HuggingFace's PretrainedModel,
so they no longer include a HuggingFace ``generate()`` function.
Additionally, GenerationConfig arguments are no longer passed through
the model config. To run HuggingFace generation in NxD Inference, wrap
the Neuron model in a HuggingFaceGenerationAdapter, and pass a
GenerationConfig when you call ``generate()``.

::

   from transformers import GenerationConfig

   from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter

   # Init config, model, and tokenizer.

   generation_config = GenerationConfig.from_pretrained(model_path)
   generation_config_kwargs = {
       "do_sample": True,
       "top_k": 1,
       "pad_token_id": generation_config.eos_token_id,
       "max_length": neuron_config.max_length,
   }
   generation_config.update(**generation_config_kwargs)

   inputs = tokenizer(prompts, padding=True, return_tensors="pt")
   generation_model = HuggingFaceGenerationAdapter(model)
   outputs = generation_model.generate(
       inputs.input_ids,
       generation_config=generation_config,
       attention_mask=inputs.attention_mask,
   )

4. New quantization interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This new base class also includes an interface for quantization, which
was previously part of the ``run_llama_quantized.py`` example in the old
NxD examples folder. The following example saves a quantized checkpoint
for a Llama model. In this example, the ``config`` includes a
``neuron_config`` with quantization enabled.

::

   NeuronLlamaForCausalLM.save_quantized_state_dict(model_path, config)

5. Inference demo script (replaces runners)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In place of ``runner.py`` and various ``run_x.py`` examples, NxD-I
provides an ``inference_demo`` console script. When you run the script,
you provide a model path and configuration parameters to use for
inference. This script includes benchmarking and accuracy checking
features that you can use verify that your models and modules work
correctly.

The following example demonstrates how to run Llama-3-8b with token
matching and benchmarking enabled.

::

   inference_demo \ 
     --model-type llama \
     --task-type causal-lm \
     run \ 
       --model-path /home/ubuntu/model_hf/Llama-3.1-8b/ \ 
       --compiled-model-path /home/ubuntu/traced_model/Llama-3.1-8b/ \ 
       --torch-dtype bfloat16 \ 
       --tp-degree 32 \ 
       --batch-size 2 \ 
       --max-context-length 32 \ 
       --seq-len 64 \ 
       --on-device-sampling \ 
       --enable-bucketing \ 
       --top-k 1 \ 
       --do-sample \ 
       --pad-token-id 2 \ 
       --prompt "I believe the meaning of life is" \ 
       --prompt "The color of the sky is" \ 
       --check-accuracy-mode token-matching \ 
       --benchmark

For additional examples, see the ``neuronx-distributed-inference``
GitHub repository:
https://github.com/aws-neuron/neuronx-distributed-inference.

================================================
FILE: libraries/nxd-inference/developer_guides/onboarding-models.rst
================================================
.. _nxdi-onboarding-models:

Onboarding models to run on NxD Inference
=========================================

This guide covers how to onboard a model to get it to run on NxD Inference
for the first time. To learn more about how to optimize a model on Neuron,
see the :ref:`nxdi-feature-guide`.

.. contents:: Table of contents
   :local:
   :depth: 2


Overview
--------

This guide demonstrates how to adapt an existing PyTorch model to run on
Neuron with the NeuronX Distributed (NxD) Inference library. At a
high-level, you will do the following:

1. Define configuration classes. NxD Inference models include a
   NeuronConfig, which defines Neuron-specific configuration parameters,
   and an InferenceConfig, which defines model configuration parameters.
   When adapting a model that works with HuggingFace, InferenceConfig is
   synonymous to PretrainedConfig.
2. Define model classes. When you define model classes, you replace
   linear layers with parallel layers that are optimized for distributed
   inference on Neuron. NxD Inference also provides modules for
   attention, KV cache management, and more, which you can use to write
   model classes that work with Neuron. Model classes are compiled to
   run effectively on Neuron.
3. Define application heads. Application heads orchestrate passing
   inputs to the correct compiled model. Application heads also provide
   the interface to compile and load the model.
4. Convert weights to a supported format. NxD Inference supports
   safetensors and pickle formats.


1. Define a NeuronConfig class
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Define a Neuron configuration class, which extends NeuronConfig.
NeuronConfig includes Neuron-specific configuration parameters. In the
config class for your model, you can define any additional
Neuron-specific configuration parameters that your model requires.

- For MoE models, you can extend MoENeuronConfig instead of
  NeuronConfig. This class includes configuration parameters specific to
  MoE models.

::

   from neuronx_distributed_inference.models.config import NeuronConfig

   class NeuronLlamaConfig(NeuronConfig):
       def __init__(self, **kwargs):
           super().__init__(**kwargs)
           # Set any args/defaults

2. Define an InferenceConfig class
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Define an inference configuration class, which extends InferenceConfig.
InferenceConfig includes model parameters, such as those from a
HuggingFace PretrainedConfig (like LlamaConfig). When users initialize
your config, they can provide required attributes directly, or they can
populate the config from a HuggingFace PretrainedConfig. You can also
override ``get_required_attributes`` to enforce that certain attributes
are present.

::

   from neuronx_distributed_inference.models.config import InferenceConfig, NeuronConfig

   class LlamaInferenceConfig(InferenceConfig):
       def get_required_attributes(self) -> List[str]:
           return [
               "hidden_size",
               "num_attention_heads",
               "num_hidden_layers",
               "num_key_value_heads",
               "pad_token_id",
               "vocab_size",
               "max_position_embeddings",
               "rope_theta",
               "rms_norm_eps",
               "hidden_act",
           ]
           
       @classmethod
       def get_neuron_config_cls(cls) -> Type[NeuronConfig]:
           return NeuronLlamaConfig

3. Define a Neuron model
~~~~~~~~~~~~~~~~~~~~~~~~

Define a Neuron model. This class is a subclass of NeuronBaseModel,
which is a PyTorch module.

1. In this class, you provide implementations for
   ``setup_attr_for_model(self, config)`` and
   ``init_model(self, config)``.

   1. In ``setup_attr_for_model``, set values for the following
      attributes. You can set these attributes from values in ``config``
      and ``config.neuron_config``.

      1. self.on_device_sampling
      2. self.tp_degree
      3. self.hidden_size
      4. self.num_attention_heads
      5. self.num_key_value_heads
      6. self.max_batch_size
      7. self.buckets

   2. In ``init_model``, initialize the modules that make up the model.

      1. For attention modules, extend NeuronAttentionBase, which
         provides a group query attention (GQA) implementation adapted
         to Neuron.
      2. Replace linear layers (such as in attention and MLP) with
         Neuron parallel layers (RowParallelLinear and
         ColumnParallelLinear).

         1. For more information about RowParallelLinear and
            ColumnParallelLinear layers, see :ref:`tensor_parallelism_overview`.

      3. Replace embeddings with Neuron parallel embeddings
         (ParallelEmbedding).
      4. Replace any other modules that require Neuron-specific
         implementations.

Note: This example demonstrates a simplified version of NeuronLlamaModel
from from the NxDI model hub.

::

   from torch import nn
   from transformers.activations import ACT2FN

   from neuronx_distributed.parallel_layers import parallel_state
   from neuronx_distributed.parallel_layers.layers import ColumnParallelLinear, RowParallelLinear, ParallelEmbedding

   from neuronx_distributed_inference.models.model_base import NeuronBaseModel
   from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase
   from neuronx_distributed_inference.modules.attention.utils import RotaryEmbedding
   from neuronx_distributed_inference.modules.custom_calls import CustomRMSNorm

   class NeuronLlamaMLP(nn.Module):
       """
       This class just replace the linear layers (gate_proj, up_proj and down_proj) with column and row parallel layers
       """

       def __init__(self, config: InferenceConfig):
           super().__init__()
           self.config = config
           self.neuron_config = config.neuron_config
           self.tp_degree = config.neuron_config.tp_degree
           self.hidden_size = config.hidden_size
           self.intermediate_size = config.intermediate_size
           self.act_fn = ACT2FN[config.hidden_act]

           self.gate_proj = ColumnParallelLinear(
               self.hidden_size,
               self.intermediate_size,
               bias=False,
               gather_output=False,
               dtype=config.neuron_config.torch_dtype,
               pad=True,
           )
           self.up_proj = ColumnParallelLinear(
               self.hidden_size,
               self.intermediate_size,
               bias=False,
               gather_output=False,
               dtype=config.neuron_config.torch_dtype,
               pad=True,
           )
           self.down_proj = RowParallelLinear(
               self.intermediate_size,
               self.hidden_size,
               bias=False,
               input_is_parallel=True,
               dtype=config.neuron_config.torch_dtype,
               pad=True,
           )

       def forward(self, x):
           return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))


   class NeuronLlamaAttention(NeuronAttentionBase):
       """
       Compared with LlamaAttention, this class just
       1. replaces the q_proj, k_proj, v_proj with column parallel layer
       2. replaces the o_proj with row parallel layer
       3. update self.num_head to be self.num_head / tp_degree
       4. update self.num_key_value_heads to be self.num_key_value_heads / tp_degree
       5. update forward() method to adjust to changes from self.num_head
       """

       def __init__(self, config: InferenceConfig):
           super().__init__()

           self.config = config
           self.neuron_config = config.neuron_config
           self.hidden_size = config.hidden_size
           self.num_attention_heads = config.num_attention_heads
           self.num_key_value_heads = config.num_key_value_heads
           self.head_dim = self.hidden_size // self.num_attention_heads
           self.max_position_embeddings = config.max_position_embeddings
           self.rope_theta = config.rope_theta
           self.padding_side = config.neuron_config.padding_side
           self.torch_dtype = config.neuron_config.torch_dtype

           self.tp_degree = parallel_state.get_tensor_model_parallel_size()

           self.fused_qkv = config.neuron_config.fused_qkv
           self.clip_qkv = None

           self.init_gqa_properties()
           self.init_rope()

       def init_rope(self):
           self.rotary_emb = RotaryEmbedding(
               self.head_dim,
               max_position_embeddings=self.max_position_embeddings,
               base=self.rope_theta,
           )


   class NeuronLlamaDecoderLayer(nn.Module):
       """
       Just replace the attention with the NXD version, and MLP with the NXD version
       """

       def __init__(self, config: InferenceConfig):
           super().__init__()
           self.hidden_size = config.hidden_size
           self.self_attn = NeuronLlamaAttention(config)
           self.mlp = NeuronLlamaMLP(config)
           self.input_layernorm = CustomRMSNorm(
               config.hidden_size,
               eps=config.rms_norm_eps,
           )
           self.post_attention_layernorm = CustomRMSNorm(
               config.hidden_size,
               eps=config.rms_norm_eps,
           )

       def forward(
           self,
           hidden_states: torch.Tensor,
           attention_mask: Optional[torch.Tensor] = None,
           position_ids: Optional[torch.LongTensor] = None,
           past_key_value: Optional[Tuple[torch.Tensor]] = None,
           **kwargs,
       ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
           residual = hidden_states
           hidden_states = self.input_layernorm(hidden_states)

           # Self Attention
           attn_outs = self.self_attn(
               hidden_states=hidden_states,
               attention_mask=attention_mask,
               position_ids=position_ids,
               past_key_value=past_key_value,
               **kwargs,
           )

           hidden_states, present_key_value = attn_outs
           hidden_states = residual + hidden_states

           # Fully Connected
           residual = hidden_states
           hidden_states = self.post_attention_layernorm(hidden_states)
           hidden_states = self.mlp(hidden_states)
           hidden_states = residual + hidden_states

           return (hidden_states, present_key_value)


   class NeuronLlamaModel(NeuronBaseModel):
       """
       The neuron version of the LlamaModel
       """

       def setup_attr_for_model(self, config: InferenceConfig):
           # Needed for init_inference_optimization()
           self.on_device_sampling = config.neuron_config.on_device_sampling_config is not None
           self.tp_degree = config.neuron_config.tp_degree
           self.hidden_size = config.hidden_size
           self.num_attention_heads = config.num_attention_heads
           self.num_key_value_heads = config.num_key_value_heads
           self.max_batch_size = config.neuron_config.max_batch_size
           self.buckets = config.neuron_config.buckets

       def init_model(self, config: InferenceConfig):
           self.padding_idx = config.pad_token_id
           self.vocab_size = config.vocab_size

           self.embed_tokens = ParallelEmbedding(
               config.vocab_size,
               config.hidden_size,
               self.padding_idx,
               dtype=config.neuron_config.torch_dtype,
               shard_across_embedding=True,
               # We choose to shard across embedding dimension because this stops XLA from introducing
               # rank specific constant parameters into the HLO. We could shard across vocab, but that
               # would require us to use non SPMD parallel_model_trace.
               pad=True,
           )
           self.lm_head = ColumnParallelLinear(
               config.hidden_size,
               config.vocab_size,
               bias=False,
               pad=True,
           )

           self.layers = nn.ModuleList(
               [NeuronLlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)]
           )
           self.norm = CustomRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

4. Define an application/task head
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Define an application/task head. Applications includes causal LM,
classification, and so on. This class extends a task-specific Neuron
application head class (such as NeuronBaseForCausalLM), or the general
NeuronApplicationHead class.

1. In this class, you provide an value for ``_model_cls`` which is the
   Neuron model class you defined.
2. You can also override any other functions as needed for your model,
   such as ``get_compiler_args(self)`` or
   ``convert_hf_to_neuron_state_dict(model_state_dict, neuron_config)``.

Note: This example demonstrates a simplified version of
`NeuronLlamaForCausalLM <https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/models/llama/modeling_llama.py>`__
from the NxD Inference model hub.


::

   class NeuronLlamaForCausalLM(NeuronBaseForCausalLM):
       _model_cls = NeuronLlamaModel
           
       @classmethod
       def get_config_cls(cls):
           return LlamaInferenceConfig

NxD Inference offers :ref:`nxdi_async_mode_feature_guide` as an alternative method to executing NEFFs in parallel with CPU Logic. To evaluate if your
task can utilize ``async_mode``, the following questions must be answered:

1. Does your task repeatedly execute a model for a single user request? If not, then ``async_mode`` won't offer any benefits.
    - Example: The Auto Regressive loops used in LLMs perform repeated execution of models for a given prompt, which can get some benefits from async mode.
2. Does the output of one execution get passed onto the next execution without manipulation? If not, then ``async_mode`` is incompatible.
    - NOTE: It might be possible to address this by moving some manipulation logic within the neff.
    - Example: For LLMs using on-device-sampling, we pass in the token generated as output as input to the next step in the auto regressive loop directly. Without on-device-sampling, the sampling logic will rely on logits as output, which is a data dependent compute pattern that is incompatible with async mode.
3. Is there sufficient CPU logic that is independent of the previous outputs? If not, then ``async_mode`` likely won't offer major benefits.
    - Example: In production workloads, these are typically server overheads (scheduling, logging, etc.), but this could also be some pre/post processing steps in the model execution pipeline.
  
Based on the answers above, ``async_mode`` will need to be set accordingly, and/or, be configured to work correctly with the application.

1. Convert weights to a supported format
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NxD Inference supports weights stored in the model path in the following
formats:

=========== ======= ============================
Format      Sharded File name
=========== ======= ============================
Safetensors No      model.safetensors
Safetensors Yes     model.safetensors.index.json
Pickle      No      pytorch_model.bin
Pickle      Yes     pytorch_model.bin.index.json
=========== ======= ============================

If your weights are in another format, you must convert them to one of
these formats before you can compile and load the model to Neuron. See
the following references for more information about these formats:

- Safetensors:

  - https://github.com/huggingface/safetensors
  - https://huggingface.co/docs/safetensors/en/convert-weights

- Pickle:

  - https://docs.python.org/3/library/pickle.html

.. _nxdi-onboarding-models-vllm:

Integrating Onboarded Model with vLLM
-------------------------------------

After completing the model onboarding in NxDI using the steps outlined 
in this guide, you can follow these steps to run that model through vLLM.

1. **Model Architecture**: Ensure your model follows standard NxDI naming 
   conventions (e.g., ``ModelNameForCausalLM``). The model is automatically 
   recognized through NxDI's ``MODEL_TYPES`` registry.

2. **Model Directory**: Use the local directory as ``model_name_or_path`` 
   when initializing vLLM. This directory should contain:
   
   - Model weights (safetensors or pickle format)
   - ``config.json`` file compatible with your InferenceConfig class

3. **Custom Configuration**: Pass any custom NeuronConfig attributes using 
   the ``override_neuron_config`` parameter when initializing the vLLM engine.

4. **Run Inference**: Execute offline or online inference using vLLM's 
   standard APIs to get your model working with vLLM.


.. _nxdi-evaluating-models:

Evaluating Models on Neuron
---------------------------

NxD Inference provides tools that you can use to
evaluate the accuracy and performance of the models that you onboard to
Neuron.

.. _nxdi-logit-matching:

Logit Matching
~~~~~~~~~~~~~~

The logit matching evaluation tool verifies that output logits are
within certain tolerances of expected logits. With this evaluation tool,
NxD Inference runs generation on the Neuron device.
Then, it compares the output logits against expected logits, which you
can provide or generate with the HuggingFace model on CPU.

During logit validation, if the output tokens diverge, then this process
runs generation on Neuron again, using the tokens up to the point where it diverged. This
process is performed repeatedly each time the output diverges, until the
entire output matches. This process uses greedy sampling to choose the
most likely token at each index.

Once all tokens match, this process compares the logits generated on
Neuron with the expected logits. If all logits are within expected
tolerances, this accuracy check passes. Divergence difference tolerance
is used to compare the logits at the token that diverges. Absolute and
relative tolerance are used to compare the values of the logits for the
top k highest scoring tokens. For best results, use a lower relative
tolerance for smaller k values, and a higher relative tolerance for
larger k values. A top k of ``None`` means to compare logits for all
possible tokens at each index.

Logit matching uses the following tolerances by default, and you can
customize these tolerances.

- Divergence difference tolerance: ``0.001``
- Absolute tolerance:

  - Top k = 5: ``1e-5``
  - Top k = 50: ``1e-5``
  - Top k = 1000: ``1e-5``
  - Top k = None: ``1e-5``

- Relative tolerance:

  - Top k = 5: ``0.01``
  - Top k = 50: ``0.02``
  - Top k = 1000: ``0.03``
  - Top k = None: ``0.05``

If all logits are within expected thresholds, this accuracy check
passes.

- Note: Logit matching cannot be used with on-device sampling.
- Note: Generating HuggingFace model outputs on CPU can take a
  significant amount of time for larger models or large sequence
  lengths.

Example (``check_accuracy_logits_v2`` API)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   from neuronx_distributed_inference.utils.accuracy import generate_expected_logits, check_accuracy_logits_v2

   # Init Neuron model, test inputs and HuggingFace generation config.

   # Generating HuggingFace model outputs on CPU.
   expected_logits = generate_expected_logits(
        neuron_model,
        inputs.input_ids,
        inputs.attention_mask,
        generation_config
    )
    # Alternatively, you can load the expected_logits from disk to save time.
    # expected_logits = ...

    check_accuracy_logits_v2(
        neuron_model,
        expected_logits,
        inputs.input_ids,
        inputs.attention_mask,
        generation_config=generation_config
    )

Example (``check_accuracy_logits`` API)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   from neuronx_distributed_inference.utils.accuracy import check_accuracy_logits

   # Init Neuron model, HuggingFace tokenizer, and HuggingFace generation config.

   check_accuracy_logits(
       model,
       tokenizer,
       generation_config,
   )

Token Matching
~~~~~~~~~~~~~~

The token matching evaluation tool verifies that output tokens match
expected tokens. With this evaluation tool, Neuronx Distributed
Inference runs generation on the Neuron device. Then, it compares the
output against expected tokens, which you can provide or generate with
the HuggingFace model on CPU. If all tokens match, this accuracy check
passes.

- Warning: Token mismatches are acceptable in many scenarios, especially
  with large models or large sequence lengths. This tool should only be
  used for small models and small sequence lengths.
- Note: Generating HuggingFace model outputs on CPU can take a
  significant amount of time for larger models or large sequence
  lengths.

Example (``check_accuracy`` API)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   from neuronx_distributed_inference.utils.accuracy import check_accuracy

   # Init Neuron model, HuggingFace tokenizer, and HuggingFace generation config.

   check_accuracy(
       model,
       tokenizer,
       generation_config,
   )

.. _nxdi-benchmark-sampling:

Benchmarking
~~~~~~~~~~~~

NxD Inference provides a benchmarking tool that
evaluates the latency and throughput of a Neuron model and its
sub-models (context encoding, token generation, etc.).

Example (``benchmark_sampling`` API)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   from neuronx_distributed_inference.utils.benchmark import benchmark_sampling

   # Init Neuron model and HuggingFace generation config.

   benchmark_sampling(model, generation_config)

Example benchmarking result
^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   {
       "e2e_model": {
           "latency_ms_p50": 28890.24031162262,
           "latency_ms_p90": 28977.734088897705,
           "latency_ms_p95": 28983.17071199417,
           "latency_ms_p99": 29032.21325159073,
           "latency_ms_p100": 29044.473886489868,
           "latency_ms_avg": 28879.499554634094,
           "throughput": 283.66142510545984
       },
       "context_encoding_model": {
           "latency_ms_p50": 705.0175666809082,
           "latency_ms_p90": 705.3698301315308,
           "latency_ms_p95": 705.6618571281433,
           "latency_ms_p99": 705.8443236351013,
           "latency_ms_p100": 705.8899402618408,
           "latency_ms_avg": 705.0377488136292,
           "throughput": 5809.618005408024
       },
       "token_generation_model": {
           "latency_ms_p50": 27.20165252685547,
           "latency_ms_p90": 27.295589447021484,
           "latency_ms_p95": 27.324914932250977,
           "latency_ms_p99": 27.655515670776367,
           "latency_ms_p100": 32.74345397949219,
           "latency_ms_avg": 27.19622969277793,
           "throughput": 147.22298324644066
       }
   }

Profiling Models
~~~~~~~~~~~~~~~~

Neuron provides a profiling tool, ``neuron-profile``, which you can use
to analyze the performance of a compiled Neuron model. For more
information, see :ref:`neuron-profile-ug`.

Evaluating Models with the Inference Demo Script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NxD Inference provides an ``inference_demo`` console
script, which you can run from the environment where you install
``neuronx_distributed_inference``.

Note: Before you can use a custom model with the ``inference_demo``, you
must add it to the ``MODEL_TYPES`` dictionary in ``inference_demo.py``.

This script provides the following arguments to configure evaluation
tools:

- ``--check-accuracy-mode`` - Provide one of the following values:

  - ``token-matching`` - Perform a token matching accuracy check.
  - ``logit-matching`` - Perform a logit matching accuracy check.
  - ``skip-accuracy-check`` - Do not perform an accuracy check.

- ``--num-tokens-to-check`` - The number of tokens to check when performing
  token matching or logit matching.
- ``--expected-outputs-path`` - The path to a file that contains tokens or
  logits to compare against for the accuracy check. This file must contain
  an object saved with ``torch.save()``.
- ``--benchmark`` - Run benchmarking.
- ``--on-cpu`` - Run inference on CPU. To simulate tensor parallelism, 
  initialize ``inference_demo.py`` with ``torchrun``.

Debugging Models on Neuron
--------------------------

When you debug models on Neuron, you can enable debug logging to view
information about inputs and outputs of the NeuronBaseForCausalLM
forward function, which calls the NeuronBaseModel's forward function.

::

   import logging

   logging.getLogger().setLevel(logging.DEBUG)

Because the forward function of NeuronBaseModel is compiled, you cannot
use log/print statements to debug code that is called from this forward
function (or any other compiled code).

Debugging Neuron modeling code on CPU isn't yet supported.

Writing Tests on Neuron
-----------------------

NxD Inference provides tools to help you write unit and integration tests
that validate your model works as expected. For more information, see
:ref:`nxdi-writing-tests`.


================================================
FILE: libraries/nxd-inference/developer_guides/performance-cli-params.rst
================================================
.. _performance-cli-params:

Evaluating Performance of Models on Neuron Using LLMPerf
==========================================================

This topic guides you through determining the performance of your models on Trainium and Inferentia instances using  open-source clients.
It expands on the basic performance analysis tools provided with Neuron by incorporating the `LLMperf <https://github.com/ray-project/llmperf>`_ client to collect additional information about performance for models such as llama-3.3-70B-instruct and llama-3.1-8b.


Under the hood, this performance suite uses vLLM server to serve the model
and can use benchmarking clients such as `llm-perf <https://github.com/ray-project/llmperf>`_
to evaluate on their supported models. 

In the future we will add support for other benchmarking clients. 

The code used in this guide is located at `inference-benchmarking <https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/>`_.

For a tutorial that you can follow and run on a Trainium or Inferentia instance, see :ref:`/libraries/nxd-inference/tutorials/generating-results-with-performance-cli.ipynb`. 


Creating the Configuration File
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Create a test_config.yaml file that defines your server settings and
performance test configurations and paste in the following code:

.. code:: yaml

   server:
     name: "test-model-server"
     model_path: "/path/to/model"
     model_s3_path: "s3://bucket/path/to/model"
     max_seq_len: 256
     context_encoding_len: 128
     tp_degree: 32
     server_port: 8000
     continuous_batch_size: 1
     custom_chat_template_path: "default"

   test:
     performance:
       llama_test:
         client: "llm_perf"
         client_type: "llm_perf_github_patched"
         max_concurrent_requests: 20
         timeout: 3600
         input_size: 128
         output_size: 124
         client_params:
           stddev_input_tokens: 0
           stddev_output_tokens: 1
       

Configuration Parameters
------------------------

Below is a reference for the configuration parameters you can use when configuring the server and tastes for your model performance analysis:

Server Configuration
~~~~~~~~~~~~~~~~~~~~

===================================== ===================================
Parameter                               Description
===================================== ===================================
``name``                              Identifier for your model server
``model_path``                        Local path to model files
``model_s3_path``                     S3 location of model files
``max_seq_len``                       Maximum sequence length
``context_encoding_len``              Length of context encoding
``tp_degree``                         Tensor parallelism degree
``server_port``                       Server port number
``continuous_batch_size``             Size of continuous batches
``custom_chat_template_path``         Chat template for the prompt
===================================== ===================================

if ``model_s3_path`` is specified, the model is downloaded to ``model_path``;
otherwise, the model should already be available at ``model_path``.

Performance Test Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+-----------------------------+---------------------------------------+
| Parameter                   | Description                           |
+=============================+=======================================+
| ``client``                  | Performance framework (such as,       |
|                             | llm-perf)                             |
+-----------------------------+---------------------------------------+
| ``client_type``             | List of clients such as               |
|                             |  llm_perf_github_patched              |
+-----------------------------+---------------------------------------+
| ``max_concurrent_requests`` | Maximum parallel requests             |
+-----------------------------+---------------------------------------+
| ``timeout``                 | Maximum execution time (seconds)      |
+-----------------------------+---------------------------------------+
| ``input_size``              | Input context length                  |
+-----------------------------+---------------------------------------+
| ``output_size``             | Output length / MaxNewTokens          |
+-----------------------------+---------------------------------------+
| ``client_params``           | Client-specific parameters            |
+-----------------------------+---------------------------------------+

Client_params
-------------------

Involves ``stddev_input_tokens`` and ``stddev_output_tokens``

To prevent bucket overflow at higher batch sizes, we use the following default:

``outputlength`` = ``orig_output_length - 4* continuous_batch_size``

``stddev_output_tokens`` = ``batch_size``


Running Evaluations
-------------------

Execute performance tests using the CLI command:

.. code:: bash

   python performance.py --config perf.yaml


For more detailed information and advanced configurations, please refer
to: - `llm-perf
Documentation <https://github.com/ray-project/llmperf>`__ -


These resources provide comprehensive guides on client-specific
parameters and advanced evaluation scenarios.


================================================
FILE: libraries/nxd-inference/developer_guides/vllm-user-guide-v1.rst
================================================
.. _nxdi-vllm-user-guide-v1:
.. _nxdi-vllm-user-guide:

vLLM User Guide for NxD Inference
============================================

`vLLM <https://docs.vllm.ai/en/latest/>`_ is a popular library for LLM inference and serving utilizing advanced inference features such as continuous batching.
This guide describes how to utilize AWS Inferentia and AWS Trainium AI accelerators in vLLM by using NxD Inference (``neuronx-distributed-inference``).

.. contents:: Table of contents
   :local:
   :depth: 2

Overview
--------

NxD Inference integrates with vLLM by using `vLLM's Plugin System <https://docs.vllm.ai/en/latest/design/plugin_system.html>`_ to extend the model execution components responsible for loading and invoking models within vLLM's LLMEngine (see `vLLM architecture <https://docs.vllm.ai/en/latest/design/arch_overview.html#llm-engine>`_ 
for more details). This means input processing, scheduling and output 
processing follow the default vLLM behavior.

Versioning
^^^^^^^^^^

Plugin Version: ``0.5.0``

Neuron SDK Version: ``2.29.0``

vLLM Version: ``0.16.0``

PyTorch Version: ``2.9.1``


Supported Models
----------------

The following models are supported on vLLM with NxD Inference:

- Llama 2/3.1/3.3
- Llama 4 Scout, Maverick
- Qwen 2.5
- Qwen 3
- Qwen2-VL
- Qwen3-VL
- Pixtral

If you are adding your own model to NxD Inference, see :ref:`Integrating Onboarded Model with vLLM<nxdi-onboarding-models-vllm>`.

  
Setup
-----

Prerequisite: Launch an instance and install drivers and tools
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Before installing vLLM with the instructions below, you must launch an Inferentia or Trainium instance and install the necessary
Neuron SDK dependency libraries. We recommend using a Neuron Deep Learning Container (DLC) for the best compatibility. 
Refer to :ref:`these setup instructions<nxdi-setup>` for information on using Neuron DLCs.


**Prerequisites:**

- Latest AWS Neuron SDK (`Neuron SDK 2.29.0 <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/2.29.0.html>`_)
- Python 3.10+ (compatible with vLLM requirements)
- Supported AWS instances: Inf2, Trn1/Trn1n, Trn2


Quickstart using Docker
"""""""""""""""""""""""""""

You can use a Deep Learning Container (DLC) which bundles the SDK and dependencies.
Refer to the `pytorch-inference-neuronx container <https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-inference-neuronx>`_
on `https://github.com/aws-neuron/deep-learning-containers <https://github.com/aws-neuron/deep-learning-containers>`_ to get started.

For a complete step-by-step tutorial, see :ref:`Option B in the DLC quickstart <quickstart_vllm_dlc_option_b>`. After entering the container, proceed to `Manually install from source`_ below to install the vLLM Neuron plugin.

Manually install from source
"""""""""""""""""""""""""""""""

Install the plugin from GitHub sources using the following commands. The plugin will automatically install the correct version of vLLM along with other required dependencies.
This version of the plugin is intended to work with the Neuron SDK 2.29.0, PyTorch 2.9, and vLLM 0.16.0.

.. code-block:: bash

    git clone --branch "0.5.0" https://github.com/vllm-project/vllm-neuron.git
    cd vllm-neuron
    pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com -e .


Usage
-----

Quickstart
^^^^^^^^^^^^

Here is a very basic example to get started:

.. code-block:: python

  from vllm import LLM, SamplingParams

  if __name__ == '__main__':
      # Initialize the model
      llm = LLM(
          model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
          max_num_seqs=4,
          max_model_len=128,
          tensor_parallel_size=2,
          block_size=32,
          num_gpu_blocks_override=16
      )

      # Generate text
      prompts = [
          "Hello, my name is",
          "The president of the United States is",
          "The capital of France is",
      ]
      sampling_params = SamplingParams(temperature=0.0)
      outputs = llm.generate(prompts, sampling_params)

      for output in outputs:
          print(f"Prompt: {output.prompt}")
          print(f"Generated: {output.outputs[0].text}")

Feature Support
------------------

.. list-table::
   :header-rows: 1
   :widths: 30 10 60

   * - Feature
     - Status
     - Notes
   * - Continuous batching
     - 🟢
     -
   * - Prefix Caching
     - 🟢
     - 
   * - Multi-LORA
     - 🟢
     - 
   * - Speculative Decoding
     - 🟢
     - Eagle V1 and V3 are supported
   * - Quantization
     - 🟢
     - INT8/FP8 quantization support
   * - Dynamic sampling	
     - 🟢
     -
   * - Tool calling
     - 🟢
     -
   * - CPU Sampling
     - 🟢
     -
   * - Multimodal
     - 🟢
     - Llama4 and Pixtral are supported

- 🟢 Functional: Fully operational, with ongoing optimizations.
- 🚧 WIP: Under active development.

Feature Configuration
----------------------

NxD Inference models provide many configuration options. When using NxD Inference through vLLM,
you configure the model with a default configuration that sets the required fields from vLLM settings.

.. code:: ipython3

    neuron_config = dict(
        tp_degree=parallel_config.tensor_parallel_size,
        ctx_batch_size=1,
        batch_size=scheduler_config.max_num_seqs,
        max_context_length=scheduler_config.max_model_len,
        seq_len=scheduler_config.max_model_len,
        enable_bucketing=True,
        is_continuous_batching=True,
        quantized=False,
        torch_dtype=TORCH_DTYPE_TO_NEURON_AMP[model_config.dtype],
        padding_side="right"
    )


Use the ``additional_config`` field to provide an ``override_neuron_config`` dictionary that specifies your desired NxD Inference configuration settings. You provide the settings you want to override as a dictionary (or JSON object when starting vLLM from the CLI) containing basic types. For example, to enable prefix caching:

.. code:: ipython3
    
    additional_config=dict(
        override_neuron_config=dict(
            is_prefix_caching=True,
            is_block_kv_layout=True,
            pa_num_blocks=4096,
            pa_block_size=32,
        )
    )

or when launching vLLM from the CLI

.. code:: bash

    --additional-config '{
        "override-neuron-config": {
            "is_prefix_caching": true,
            "is_block_kv_layout": true,
            "pa_num_blocks": 4096,
            "pa_block_size": 32
        }
    }'

Here's a list of arguments that can be set to enable specific features from NxD Inference.

.. list-table::
   :header-rows: 1
   :widths: 45 10 45

   * - Neuronx Distributed Inference Feature
     - Argument
     - Description
   * - :ref:`Sequence Parallelism <nxdi-feature-guide-sequence-parallelism>`
     - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "sequence_parallel_enabled": true
                } 
            }'
    
     - Sequence parallelism splits tensors across the sequence dimension
   * - :ref:`QKV Weight Fusion <qkv-weight-fusion>`
     - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "fused_qkv": true
                }
        }'
     - QKV weight fusion concatenates a model’s query, key and value weight matrices
   * - :ref:`Bucketing <nxdi-bucketing>`
     - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "enable_bucketing": true,
                "context_encoding_buckets": [512, 1024],
                "token_generation_buckets": [1536, 2048]
                }
            }'
     - Bucketing helps LLMs work optimally with different shapes.Setting only `enable_bucketing=True` enables automatic bucketing which creates context encoding and token generation buckets with powers of two from 128 and `max-model-len`.Set `context_encoding_buckets` and `token_generation_buckets` to explicit values if your use case needs to be optimized for specific sequence lengths. 
   * - :ref:`Prefix Caching<nxdi_prefix_caching>`
     - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "is_prefix_caching": true,
                "is_block_kv_layout": true,
                "pa_num_blocks": 4096,
                "pa_block_size": 32
                }
            }'
     - ``is_prefix_caching`` and ``is_block_kv_layout`` enable prefix caching and block KV Cache layout respectively. Both arguments need to be enabled for automatic prefix caching. For optimal performance with Neuron, it’s recommended to set ``pa_block_size=32 or 16``. Also set ``num_gpu_blocks_override`` to the same value as ``pa_num_blocks``.
   * - :ref:`Asyncronous Runtime Support<nxdi_async_mode_feature_guide>`
     - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "async_mode": true
                }
            }'
     - Parallelizes CPU logic with Neuron device logic, eliminating CPU overheads
   * - :ref:`Bucketing with Prefix Caching<bucketing-with-prefix-caching>`
     - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "is_prefix_caching": true,
                "is_block_kv_layout": true,
                "pa_num_blocks": 4096,
                "pa_block_size": 32,
                "context_encoding_buckets": [512, 1024, 2048],
                "prefix_buckets": [512, 1024],
                "token_generation_buckets": [2048]
                }
            }'
     - Bucketing is enabled by default, and Neuron automatically determines optimal bucket sizes. However, if needed, you can specify custom bucket sizes by defining the context_encoding_buckets and prefix_buckets parameters in ``override-neuron-config``.
   
For more information on NxD Inference features, see :ref:`NxD Inference Features Configuration Guide<nxdi-feature-guide>`
and :ref:`NxD Inference API Reference<nxd-inference-api-guide>`.

Enabling Kernels
^^^^^^^^^^^^^^^^
Kernels can be enabled in nxdi by using certain arguments through ``--additional-config`` via the ``overrride-neuron-config`` field.

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Argument
     - Description
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "attn_kernel_enabled": true
                } 
            }'
    
     - Prefill attention kernel
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "attn_block_tkg_nki_kernel_enabled": true
                }
        }'
     - Token generation attention kernel with block kv layout. Improves performance when prefix caching is enabled.
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "attn_tkg_nki_kernel_enabled": true
                }
        }'
     - Token generation attention kernel without block kv layout.
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "attn_block_tkg_nki_kernel_cascaded_attention": true
                }
        }'
     - Token generation attention kernel. Use this for performance considerations. Performance is better at longer sequence lengths and high batches. Needs to be used with ``attn_block_tkg_nki_kernel``.
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "attn_block_tkg_nki_kernel_cache_update": true
                }
        }'
     - Enables cache update inside the attention kernel 
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "attn_block_cte_nki_kernel_enabled": true
                }
        }'
     - Prefill attention kernel with block KV for prefix caching support
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "mlp_kernel_enabled": true
                }
        }'
     - Prefill MLP kernel
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "quantized_mlp_kernel_enabled": true
                }
        }'
     - Prefill MLP kernell with fp8 (static / dynamic quantization)
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "mlp_tkg_nki_kernel_enabled": true
                }
        }'
     - Token generation MLP kernel. Should be used with ``mlp_kernel_enabled``.
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "mlp_kernel_fuse_residual_add": true
                }
        }'
     - Fuses the residual add into the mlp kernel. This kernel cannot be used when ``sequence_parallel_enabled`` is used.
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "qkv_kernel_enabled": true
                }
        }'
     - QKV projection prefill kernel
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "qkv_nki_kernel_enabled": true
                }
        }'
     - QKV projection prefill kernel (new NKI)
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "qkv_cte_nki_kernel_fuse_rope": true
                }
        }'
     - QKV projection prefill with fused RoPE
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "qkv_kernel_fuse_residual_add": true
                }
        }'
     - Fuses residual add into the qkv kernel. Can be used with ``qkv_kernel_enabled`` and cannot be used with ``sequence_parallel_enabled``
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "rmsnorm_quantize_kernel_enabled": true
                }
        }'
     - Used in combination with quantized  mlp kernel for Prefill. Moves qunatization from MLP kenrel to rmsnorm, followed by collectives in fp8 and qunatized mlp. Use for better performance.
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "strided_context_parallel_kernel_enabled": true
                }
        }'
     - Context parallel attention CTE kernel with striding for load balancing. To be used when using context parallelism. 
   * - .. code-block::

        --additional-config '{
            "override-neuron-config": {
                "k_cache_transposed": true
                }
        }'
     - Decode optimization


Scheduling and K/V Cache
^^^^^^^^^^^^^^^^^^^^^^^^^^^

NxD Inference uses a contiguous memory layout for the K/V cache instead of PagedAttention support.
It integrates into vLLM's block manager by setting the block size to the maximum length supported by the model
and allocating one block per maximum number of sequences configured. However, the vLLM scheduler currently does
not introspect the blocks associated to each sequence when (re-)scheduling running sequences. The scheduler requires an additional
free block regardless of space available in the current block resulting in preemption. This would lead to a large increase 
in latency for the preempted sequence because it would be rescheduled in the context encoding phase. Since NxD Inference's implementation ensures each block
is big enough to fit the maximum model length, preemption is never needed in our current integration. 
As a result, AWS Neuron disabled the preemption checks done by the scheduler in our fork. This significantly improves
E2E performance of the Neuron integration.

.. _nxdi-on-device-sampling:

Decoding
^^^^^^^^^^

On-device sampling is enabled by default, which performs sampling logic on the Neuron devices 
rather than passing the generated logits back to CPU and sample through vLLM. This allows you to
use Neuron hardware to accelerate sampling and reduce the amount of data transferred between devices 
leading to improved latency.

However, on-device sampling comes with some limitations. Currently, we only support the following
sampling parameters: ``temperature``, ``top_k`` and ``top_p`` parameters. 
Other `sampling parameters <https://docs.vllm.ai/en/latest/dev/sampling_params.html>`_ are currently
not supported through on-device sampling.

When on-device sampling is enabled, we handle the following special cases:

* When ``top_k`` is set to -1, we limit ``top_k`` to 256 instead.
* When ``temperature`` is set to 0, we use greedy decoding to remain compatible with existing conventions. This is the same as setting ``top_k`` to 1.

By default, on-device sampling utilizes a greedy decoding strategy to select tokens with the highest probabilities. 
You can enable a different on-device sampling strategy by passing a ``on_device_sampling_config``
using the override neuron config feature (see :ref:`Model Configuration<nxdi-vllm-model-configuration>`). It is strongly recommended to make use
of the ``global_top_k`` configuration limiting the maximum value of ``top_k`` a user can request for improved performance.

Quantization
^^^^^^^^^^^^^^

NxD Inference supports quantization but has not yet been integrated with vLLM's configuration for quantization.
If you want to use quantization, **do not** set vLLM's ``--quantization`` setting to ``neuron_quant``. 
Keep it unset and use the Neuron configuration of the model to configure quantization of the NxD Inference model directly.
For more information on how to configure and use quantization with NxD Inference incl. requirements on checkpoints,
refer to :ref:`Quantization<nxdi-quantization>` in the NxD Inference Feature Guide.

.. _nxdi-vllm-v1-serialization:

Loading pre-compiled models / Serialization Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Tracing and compiling the model can take a non-trivial amount of time depending on model size e.g. 
a small-ish model of 15GB might take around 15min to compile. Exact times depend on multiple factors.
Doing this on each server start would lead to unacceptable application startup times. 
Therefore, we support storing and loading the traced and compiled models.

Both are controlled through the ``NEURON_COMPILED_ARTIFACTS`` variable. When pointed to a path that contains a pre-compiled model,
we load the pre-compiled model directly, and any differing model configurations passed in to the vllm API will not trigger re-compilation. 
If loading from the ``NEURON_COMPILED_ARTIFACTS`` path fails, then we will recompile the model with the provided configurations and store 
the results in the provided location. If ``NEURON_COMPILED_ARTIFACTS`` is not set, we will compile the model and store it under a ``neuron-compiled-artifacts``
subdirectory in the directory of your model checkpoint.

Prefix Caching
^^^^^^^^^^^^^^^^

Starting in Neuron SDK 2.24, prefix caching is supported on the AWS Neuron fork of vLLM. Prefix caching allows developers to improve TTFT by 
re-using the KV Cache of the common shared prompts across inference requests. See :ref:`Prefix Caching <nxdi_prefix_caching>` for more information on how to 
enable prefix caching with vLLM. 


Examples
--------

For more in depth NxD Inference tutorials that include vLLM deployment steps, refer to :ref:`Tutorials <nxdi-tutorials-index>`.

The following examples use `TinyLlama/TinyLlama-1.1B-Chat-v1.0 <https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0>`_

If you have access to the model checkpoint locally, replace ``TinyLlama/TinyLlama-1.1B-Chat-v1.0`` with the path to your local copy. 

If you use a different instance type, you need to adjust the ``tensor_parallel_size`` according to the number of Neuron Cores 
available on your instance type. (For more information see: :doc:`Tensor-parallelism support </libraries/nxd-inference/app-notes/parallelism>`.)

Offline Inference Example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For offline inference, refer to the code example in the Quickstart section above.

Online Inference Example
^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can start an OpenAI API compatible server with the same settings as the offline example by running
the following command:

.. code:: bash

    vllm serve \
        --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
        --tensor-parallel-size 2 \
        --max-model-len 128 \
        --max-num-seqs 4 \
        --block-size 32 \
        --num-gpu-blocks-override 16 \
        --port 8000

In addition to the sampling parameters supported by OpenAI, we also support ``top_k``.
You can change the sampling parameters and enable or disable streaming.

.. code:: python

    from openai import OpenAI

    # Client Setup
    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"

    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )

    models = client.models.list()
    model_name = models.data[0].id

    # Sampling Parameters
    max_tokens = 64
    temperature = 1.0
    top_p = 1.0
    top_k = 50
    stream = False

    # Chat Completion Request
    prompt = "Hello, my name is Llama "
    response = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=int(max_tokens),
        temperature=float(temperature),
        top_p=float(top_p),
        stream=stream,
        extra_body={'top_k': top_k}
    )

    # Parse the response
    generated_text = ""
    if stream:
        for chunk in response:
            if chunk.choices[0].delta.content is not None:
                generated_text += chunk.choices[0].delta.content
    else:
        generated_text = response.choices[0].message.content
        
    print(generated_text)


Known Issues
---------------

1. Chunked prefill is not supported on Neuron.
2. You must provide ``num_gpu_blocks_override`` to avoid out-of-bounds (OOB) errors. This override ensures vLLM's scheduler uses the same block count that you compiled into the model Currently NxDI does not support using different kv cache sizes at compile vs. runtime.

   - With either chunked prefill or prefix caching: NxDI will internally use blockwise kv cache layout. Set ``num_gpu_blocks_override`` to at least ``ceil(max_model_len / block_size) * max_num_seqs``
   - With neither chunked prefill nor prefix caching: NxDI will internally use contiguous kv cache layout, and overwrite ``block_size`` to ``max_model_len``. Set ``num_gpu_blocks_override`` to exactly ``max_num_seqs``

3. When using HuggingFace model IDs with `shard on load <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/weights-sharding-guide.html#shard-on-load>`_ enabled, models with ``tie_word_embeddings=true`` in their config.json (including Qwen3-8B, Qwen2.5-7B, and other Qwen family models) will encounter the error ``NotImplementedError: Cannot copy out of meta tensor; no data!``. To resolve this, download the model checkpoint locally from Hugging Face and serve it from the local path instead of using the HuggingFace model ID.
4. Async tokenization in vLLM V1 may result in increased time to first token (TTFT) compared to V0 for small inputs and low batch sizes, as the orchestration overhead can outweigh the efficiency gains from async processing.
5. The following features are only supported on the legacy Neuron fork of vLLM v0 architecture that is no longer supported: disaggregated inference, mllama, and speculative decoding with a draft model. The fork can be found at https://github.com/aws-neuron/upstreaming-to-vllm/releases/tag/2.26.1. 

Support
----------

- **Documentation**: `AWS Neuron Documentation <https://awsdocs-neuron.readthedocs-hosted.com/>`_
- **Issues**: `GitHub Issues <https://github.com/vllm-project/vllm-neuron/issues>`_
- **Community**: `AWS Neuron Forum <https://repost.aws/tags/TAjy-krivRTDqDPWNNBmV9lA>`_

================================================
FILE: libraries/nxd-inference/developer_guides/vllm-user-guide.rst
================================================
.. _nxdi-vllm-user-guide-v0:

vLLM V0 User Guide for NxD Inference (Legacy)
==============================================

`vLLM <https://docs.vllm.ai/en/latest/>`_ is a popular library for LLM inference and serving utilizing advanced inference features such as continuous batching.
This guide describes how to utilize AWS Inferentia and AWS Trainium AI accelerators in vLLM by using NxD Inference (``neuronx-distributed-inference``).

.. important::
   This guide is compatible with vLLM v0.x versions. Since vLLM has deprecated v0.x versions (see `vLLM issue #18571 <https://github.com/vllm-project/vllm/issues/18571>`_), Neuron recommends using vLLM v1.x with the vLLM-Neuron Plugin for new deployments. See :ref:`vLLM User Guide  V1 <nxdi-vllm-user-guide>` for the updated guide.

.. contents:: Table of contents
   :local:
   :depth: 2

Overview
--------

NxD Inference integrates into vLLM by extending the model execution components responsible
for loading and invoking models used in vLLM’s LLMEngine (see https://docs.vllm.ai/en/latest/design/arch_overview.html#llm-engine 
for more details on vLLM architecture). This means input processing, scheduling and output 
processing follow the default vLLM behavior. 

Currently, we support continuous batching and streaming generation in the NxD Inference vLLM integration.
We are working with the vLLM community to enable support for other vLLM features like PagedAttention
and Chunked Prefill on Neuron instances through NxD Inference in upcoming releases.


Supported Models
----------------

Refer to :ref:`Supported Model Architectures<nxdi-supported-model-architectures>` for a list of models supported in vLLM through NxD Inference.

If you are adding your own model to NxD Inference, please see :ref:`Integrating Onboarded Model with vLLM<nxdi-onboarding-models-vllm>`
for instructions on how to setup vLLM integration for it.

.. warning::
  NeuronX distributed inference does not support the following combination of features in vLLM:

  - vLLM with model ID
  - Shard on load
  - Tied weight embeddings
 
  If this combination is configured, you will likely see this error: ``NotImplementedError: Cannot copy out of meta tensor; no data!``
 
  To workaround this limitation, download a model checkpoint from Hugging Face (such as `Qwen3-8B <https://huggingface.co/Qwen/Qwen3-8B>`_) and serve it.
  
Setup
-----
Before installing vLLM with the instructions below, you need to install the Neuron SDK.

Prerequisite: Launch an instance and install drivers and tools
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Before installing vLLM with the instructions below, you will first need to launch an Inferentia or Trainium instance and install the necessary
Neuron drivers and tools. Refer to :ref:`these setup instructions<nxdi-setup>` for different ways to prepare your environment, including using
Neuron DLAMIs and Neuron DLCs for quick setups.

Installing the AWS Neuron fork of vLLM 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We maintain a fork of vLLM that supports the latest features for NxD Inference. 

Quickstart using Docker
"""""""""""""""""""""""

Users can now use a preconfigured Deep Learning Container (DLC) with the AWS Neuron fork of vLLM pre-installed.
Refer to the `vllm-inference-neuronx container <https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#vllm-inference-neuronx>`_
on `https://github.com/aws-neuron/deep-learning-containers <https://github.com/aws-neuron/deep-learning-containers>`_ to get started.

For a complete step-by-step tutorial on deploying the vLLM Neuron DLC, see :ref:`quickstart_vllm_dlc_deploy`.

Manually install from source
"""""""""""""""""""""""""""""""

To manually install the AWS fork from source, use the following commands:

.. code::

    git clone -b 2.26.1 https://github.com/aws-neuron/upstreaming-to-vllm.git
    cd upstreaming-to-vllm
    pip install -r requirements/neuron.txt
    VLLM_TARGET_DEVICE="neuron" pip install -e .


Installing vLLM from vLLM main repository
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A prior version of Neuron SDK 2.23 NxD Inference support was upstreamed onto vLLM v0.9.0. 
Additional details can be found in vLLM docs `here <https://docs.vllm.ai/en/stable/getting_started/installation/ai_accelerator.html#aws-neuron>`_.

To install the official vLLM repository with Neuron support, use the following commands. Only Neuron SDK 2.23 and prior features are 
currently available in the official vLLM repository. See Neuron SDK 2.23 artifacts :ref:`here<neuron-2.23.0-artifacts>`. It is recommended 
to re-install neuronx-distributed and neuronx-distributed-inference libraries after installing vLLM to avoid dependency version incompatibilities.

.. code::

    git clone -b releases/v0.9.0 https://github.com/vllm-project/vllm.git
    cd vllm
    pip install -U -r requirements/neuron.txt
    VLLM_TARGET_DEVICE="neuron" pip install -e .

    pip install neuronx-distributed==0.12.12111
    pip install neuronx-distributed-inference==0.3.5591


Usage
-----

Neuron Framework Selection
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. note::

    The Neuron integration for vLLM supports both Transformers NeuronX and NxD Inference libraries. Set the ``VLLM_NEURON_FRAMEWORK`` 
    environment variable to ``neuronx-distributed-inference`` to use the NxD Inference library. Set the  ``VLLM_NEURON_FRAMEWORK`` 
    environment variable to ``transformers-neuronx`` to use the Transformers NeuronX library. Make sure you have the corresponding library
    installed before running vLLM. If you have both libraries installed, and the ``VLLM_NEURON_FRAMEWORK`` environment variable is not set,
    the NxD Inference library will be used by default.

If you are migrating from Transformers NeuronX to NxD Inference, you can refer to this :ref:`Migration Guide<nxdi_migrate_from_tnx>` for
additional support.

Quickstart
^^^^^^^^^^

Here is a quick and minimal example to get running.

.. code::

    import os
    os.environ['VLLM_NEURON_FRAMEWORK'] = "neuronx-distributed-inference"

    from vllm import LLM, SamplingParams
    llm = LLM(
        model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        max_num_seqs=8,
        max_model_len=128,
        device="neuron",
        tensor_parallel_size=2)

    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    # note that top_k must be set to lower than the global_top_k defined in
    # the neuronx_distributed_inference.models.config.OnDeviceSamplingConfig
    sampling_params = SamplingParams(top_k=10, temperature=0.8, top_p=0.95)

    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


.. _nxdi-vllm-model-configuration:

Model Configuration
^^^^^^^^^^^^^^^^^^^

NxD Inference models provide many configuration options. When using NxD Inference through vLLM,
we configure the model with a default configuration that sets the required fields from vLLM settings.
It is recommended that you do not override these configuration settings unless you need it.

.. code:: ipython3

    neuron_config = dict(
        tp_degree=parallel_config.tensor_parallel_size,
        ctx_batch_size=1,
        batch_size=scheduler_config.max_num_seqs,
        max_context_length=scheduler_config.max_model_len,
        seq_len=scheduler_config.max_model_len,
        enable_bucketing=True,
        is_continuous_batching=True,
        quantized=False,
        torch_dtype=TORCH_DTYPE_TO_NEURON_AMP[model_config.dtype],
        padding_side="right"
    )


If you want to add or change any settings, you can use vLLM's ``override_neuron_config`` setting. 
You provide the settings you want to override as dictionary (or JSON object when starting vLLM from the CLI)
containing basic types e.g. to disable auto bucketing (for illustration), use 

.. code:: ipython3
    
    override_neuron_config={
        "enable_bucketing":False,
    }

or when launching vLLM from the CLI

.. code::

    --override-neuron-config "{\"enable_bucketing\":false}"


For more information on NxD Inference features, see :ref:`NxD Inference Features Configuration Guide<nxdi-feature-guide>`
and :ref:`NxD Inference API Reference<nxd-inference-api-guide>`.

Scheduling and K/V Cache
^^^^^^^^^^^^^^^^^^^^^^^^

We currently use a contiguous memory layout for the K/V cache instead of PagedAttention support in NxD Inference.
We integrated into vLLMs block manager by setting the block size to the maximum length supported by the model
and allocating one block per maximum number of sequences configured. However, the vLLM scheduler currently does
not introspect the blocks associated to each sequence when (re-)scheduling running sequences. It requires an additional
free block regardless of space available in the current block resulting in preemption. This would lead to a large increase 
in latency for the preempted sequence because it would be rescheduled in the context encoding phase. Since we ensure each block
is big enough to fit the maximum model length, preemption is never needed in our current integration. 
Therefore, we disabled the preemption checks done by the scheduler in our fork. This significantly improves
E2E performance of the Neuron integration.

Decoding
^^^^^^^^

:ref:`On-device sampling<nxdi-on-device-sampling>` is enabled by default, which performs sampling logic on the Neuron devices 
rather than passing the generated logits back to CPU and sample through vLLM. This allows us to
use Neuron hardware to accelerate sampling and reduce the amount of data transferred between devices 
leading to improved latency.

However, on-device sampling comes with some limitations. Currently, we only support the following
sampling parameters: ``temperature``, ``top_k`` and ``top_p`` parameters. 
Other sampling parameters (https://docs.vllm.ai/en/latest/dev/sampling_params.html) are currently
not supported through on-device sampling.

When on-device sampling is enabled, we handle the following special cases:

* When ``top_k`` is set to -1, we limit ``top_k`` to 256 instead.
* When ``temperature`` is set to 0, we use greedy decoding to remain compatible with existing conventions. This is the same as setting ``top_k`` to 1.

By default, on-device sampling utilizes a greedy decoding strategy to select tokens with the highest probabilities. 
You can enable a different on-device sampling strategy by passing a ``on_device_sampling_config``
using the override neuron config feature (see :ref:`Model Configuration<nxdi-vllm-model-configuration>`). It is strongly recommended to make use
of the ``global_top_k`` configuration limiting the maximum value of ``top_k`` a user can request for improved performance.

Quantization
^^^^^^^^^^^^

NxD Inference supports quantization but has not yet been integrated with vLLMs configuration for quantization.
If you want to use quantization, **do not** set vLLM’s  ``--quantization`` setting to ``neuron_quant``. 
Keep it unset and use the Neuron configuration of the model to configure quantization of the NxD Inference model directly.
For more information on how to configure and use quantization with NxD Inference incl. requirements on checkpoints,
refer to :ref:`Quantization<nxdi-quantization>` in the NxD Inference Feature Guide.

Loading pre-compiled models / Serialization Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Tracing and compiling the model can take a non-trivial amount of time depending on model size e.g. 
a small-ish model of 15GB might take around 15min to compile. Exact times depend on multiple factors.
Doing this on each server start would lead to unacceptable application startup times. 
Therefore, we support storing and loading the traced and compiled models.

Both are controlled through the ``NEURON_COMPILED_ARTIFACTS`` variable. When pointed to a path that contains a pre-compiled model,
we load the pre-compiled model directly, and any differing model configurations passed in to the vllm API will not trigger re-compilation. 
If loading from the ``NEURON_COMPILED_ARTIFACTS`` path fails, then we will recompile the model with the provided configurations and store 
the results in the provided location. If ``NEURON_COMPILED_ARTIFACTS`` is not set, we will compile the model and store it under a ``neuron-compiled-artifacts``
subdirectory in the directory of your model checkpoint.

Prefix Caching
^^^^^^^^^^^^^^
Starting in Neuron SDK 2.24, prefix caching is supported on the AWS Neuron fork of vLLM. Prefix caching allows developers to improve TTFT by 
re-using the KV Cache of the common shared prompts across inference requests. See :ref:`Prefix Caching<nxdi_prefix_caching>` for more information on how to 
enable prefix caching with vLLM. 


Disaggregated Inference
^^^^^^^^^^^^^^^^^^^^^^^
Starting in Neuron SDK 2.24, disaggregated inference is supported on the AWS Neuron fork of vLLM. This feature allows different hardware
resources to separately perform the compute intensive prefill phase and the memory bandwidth intensive decode phase of inference, thereby 
removing the prefill-decode interference and improving Goodput. See :ref:`Disaggregated Inference<nxdi-disaggregated-inference>` for more information on 
how to use disaggregated inference with vLLM. 


Examples
--------

For a list of examples for using vLLM with Neuron, refer to `upstreaming-to-vllm/examples
/offline_inference/ <https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.26/examples/offline_inference>`_ folder. Look for example scripts with the ``neuron_`` prefix. 
We provide examples for use cases such as `automatic prefix caching <https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.26/examples/offline_inference/neuron_prefix_caching.py>`_,
`disaggregated inference <https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.26/examples/offline_inference/neuron_di.py>`_, 
`speculative decoding with a draft model <https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.26/examples/offline_inference/neuron_speculation.py>`_,
`speculative decoding using EAGLE <https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.26/examples/offline_inference/neuron_eagle.py>`_,
`multimodal models <https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.26/examples/offline_inference/neuron_multimodal.py>`_, 
`multi-LoRA <https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.26/examples/offline_inference/neuron_multi_lora.py>`_, 
`quantization <https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.26/examples/offline_inference/neuron_int8_quantization.py>`_, and more.


For more in depth NxD Inference tutorials that include vLLM deployment steps, refer to :ref:`Tutorials<nxdi-tutorials-index>`.

The following examples use `meta-llama/Llama-3.1-8B-Instruct <https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct>`_ on a ``Trn1.32xlarge`` instance. 

If you have access to the model checkpoint locally, replace ``meta-llama/Llama-3.1-8B-Instruct`` with the path to your local copy. 
Otherwise, you need to request access through HuggingFace and login via `huggingface-cli login <https://huggingface.co/docs/huggingface_hub/en/guides/cli#huggingface-cli-login>`_ using 
a `HuggingFace user access token <https://huggingface.co/docs/hub/en/security-tokens>`_ before running the examples. 

If you use a different instance type, you need to adjust the ``tp_degree`` according to the number of Neuron Cores 
available on your instance type (for more information see: :ref:`Tensor-parallelism support<nxdi-tensor-parallelism>`).

Offline Inference Example
^^^^^^^^^^^^^^^^^^^^^^^^^

Here is an example for running offline inference. :ref:`Bucketing<nxdi-bucketing>` is only disabled to demonstrate 
how to override Neuron configuration values. Keeping it enabled generally delivers better
performance.

.. code:: ipython3

    import os
    os.environ['VLLM_NEURON_FRAMEWORK'] = "neuronx-distributed-inference"

    from vllm import LLM, SamplingParams

    # Sample prompts.
    prompts = [
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    # Create a sampling params object.
    sampling_params = SamplingParams(top_k=1)

    # Create an LLM.
    llm = LLM(
        model="meta-llama/Llama-3.1-8B-Instruct",
        max_num_seqs=4,
        max_model_len=128,
        override_neuron_config={
            "enable_bucketing":False,
        },
        device="neuron",
        tensor_parallel_size=32)

    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Online Inference Example
^^^^^^^^^^^^^^^^^^^^^^^^

You can start an OpenAI API compatible server with the same settings as the offline example by running
the following command:

.. code::

    VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference' python -m vllm.entrypoints.openai.api_server \
        --model="meta-llama/Llama-3.1-8B-Instruct" \
        --max-num-seqs=4 \
        --max-model-len=128 \
        --tensor-parallel-size=8 \
        --port=8080 \
        --device "neuron" \
        --override-neuron-config "{\"enable_bucketing\":false}"

In addition to the sampling parameters supported by OpenAI, we also support ``top_k``.
You can change the sampling parameters and enable or disable streaming.

.. code::

    from openai import OpenAI

    # Client Setup
    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"

    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )

    models = client.models.list()
    model_name = models.data[0].id

    # Sampling Parameters
    max_tokens = 1024
    temperature = 1.0
    top_p = 1.0
    top_k = 50
    stream = False

    # Chat Completion Request
    prompt = "Hello, my name is Llama "
    response = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=int(max_tokens),
        temperature=float(temperature),
        top_p=float(top_p),
        stream=stream,
        extra_body={'top_k': top_k}
    )

    # Parse the response
    generated_text = ""
    if stream:
        for chunk in response:
            if chunk.choices[0].delta.content is not None:
                generated_text += chunk.choices[0].delta.content
    else:
        generated_text = response.choices[0].message.content
        
    print(generated_text)


Specifying context and token buckets (online inference)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can tune bucketing for **prefill** (context encoding) and **decode** (token generation) by
passing ``override_neuron_config`` to the OpenAI-compatible server.  
The example below targets a 1K-token workload on ``meta-llama/Llama-3.1-8B-Instruct`` with **single sequence** (BS=1) execution.

.. code:: bash

    export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"

    python -m vllm.entrypoints.openai.api_server \
      --model "meta-llama/Llama-3.1-8B-Instruct" \
      --device "neuron" \
      --tensor-parallel-size 16 \
      --max-num-seqs 1 \
      --max-model-len 1024 \
      --port 8080 \
      --override-neuron-config "{\"enable_bucketing\": true, \
        \"context_encoding_buckets\": [256, 512, 1024], \
        \"token_generation_buckets\": [32, 64, 128, 256, 512, 768], \
        \"max_context_length\": 1024, \
        \"seq_len\": 1024, \
        \"batch_size\": 1, \
        \"ctx_batch_size\": 1, \
        \"tkg_batch_size\": 1, \
        \"is_continuous_batching\": true}"


================================================
FILE: libraries/nxd-inference/developer_guides/weights-sharding-guide.rst
================================================
.. _nxdi-weights-sharding-guide:

NxD Inference Weights Sharding Guide
==========================================

NxD Inference provides two approaches to shard model weights and load them onto Neuron Devices, enabling parallel processing 
(e.g. Tensor Parallelism) on each device. This guide demonstrates the usage of both approaches using :ref:`nxdi-trn2-llama3.1-405b-speculative-tutorial`,
and provides insights into selecting the appropriate method based on the usage pattern and performance requirements.

.. note::

    Sharding speed on different storage volumes can vary. We recommend to use NVMe solid state drive (SSD) storage to achieve the best sharding performance.
    This guide shows sharding results on NVMe SSD. For more information about NVMe storage on EC2 instances, see the following:
    * `Instance store volumes <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html>`__ in the Amazon EC2 User Guide. Instance store volumes are drives attached to EC2 instances that you can use for temporary storage. Neuron instances such as Trn1 and Trn2 include NVMe drives that you can use as instance store volumes.
    * `EBS volumes and NVMe <https://docs.aws.amazon.com/ebs/latest/userguide/nvme-ebs-volumes.html>`__ in the Amazon EBS User Guide. For persistent storage on NVMe, you can use EBS volumes built on AWS Nitro.


.. contents:: Table of contents
   :local:
   :depth: 1

Shard on compile (Pre-shard)
----------------------------

The shard on compile (pre-shard) approach loads the supported pretrained :ref:`checkpoints <nxdi-checkpoint-support>`, 
converts to Neuron compatible format, shards for each parallel rank and serializes sharded weights to disk as safetensors files. The entire sharding and serialization 
process can take a few minutes to hours depending on the model size and throughput of the storage volume. This approach is optimized to minimize the future model loading time.

The following example demonstrates how to run shard on compile with Llama3.1-405b.

First, complete the `prerequisites <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.1-405b-speculative-tutorial.html#prerequisites>`__
for running Llama3.1-405b on a Trn2.48xlarge instance.

Next, enable shard on compile by adding ``--save-sharded-checkpoint`` to the command. The sharded checkpoints will be saved to the ``/weights`` folder under the specified ``COMPILED_MODEL_PATH``.

Full command to run shard on compile for Llama3.1-405b:
::

    # Replace this with the path where model files are downloaded.
    MODEL_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct/"
    # This is where the compiled model will be saved.
    COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Llama-3.1-405B-Instruct/"

    NUM_CORES=128
    TP_DEGREE=64
    LNC=2

    export NEURON_RT_VIRTUAL_CORE_SIZE=$LNC
    export NEURON_RT_NUM_CORES=$((NUM_CORES/NEURON_RT_VIRTUAL_CORE_SIZE))
    export NEURON_RT_EXEC_TIMEOUT=600 

    inference_demo \
        --model-type llama \
        --task-type causal-lm \
            run \
            --model-path $MODEL_PATH \
            --compiled-model-path $COMPILED_MODEL_PATH \
            --torch-dtype bfloat16 \
            --start_rank_id 0 \
            --local_ranks_size $TP_DEGREE \
            --tp-degree $TP_DEGREE \
            --batch-size 1 \
            --max-context-length 2048 \
            --seq-len 2048 \
            --on-device-sampling \
            --top-k 1 \
            --fused-qkv \
            --sequence-parallel-enabled \
            --qkv-kernel-enabled \
            --attn-kernel-enabled \
            --mlp-kernel-enabled \
            --cc-pipeline-tiling-factor 1 \
            --pad-token-id 2 \
            --save-sharded-checkpoint \
            --prompt "What is annapurna labs?" 2>&1 | tee log

You should see the outputs below in your logs. The duration can slightly vary between runs. Note that model loading started only after sharding is completed. 

::

    INFO:Neuron:Sharding Weights for ranks: 0...63
    INFO:Neuron:Done sharding weights in 1856.5586961259833 seconds
    Loading model to Neuron...
    Total model loading time: 107.76132441597292 seconds

Now that sharded checkpoints have been serialized to disk, you may save sharding time in your next run by adding ``--skip-sharding`` to the command.
Sharded weights will be directly loaded from the disk for inference, which saves you 30+ minutes of sharding for each subsequent run in this example.

The total model loading time in each subsequent run is expected to be comparable with the first run.


Shard on load
------------------

.. warning::
    At high batch size (>=32), we have observed performance degradation with ``shard-on-load`` for some models such as Llama3.1-8B. If you observe worse inference performance with ``shard-on-load``, please disable this feature (by enabling the ``--save-sharded-checkpoint`` flag during compilation with ``inference_demo`` as above).
    Alternatively, if you are not using ``inference_demo``, you can also enable ``save_sharded_checkpoint`` directly in ``NeuronConfig`` which will be passed to model init when the model is traced and compiled.

The shard on load approach significantly reduces sharding overheads by parallelizing tensor movement in sharding/loading and skipping sharded checkpoints serialization. 
This approach is preferred when you are working with weights that are frequently retrained/fine-tuned so re-sharding becomes a bottleneck when serving with new weights.
Since Neuron 2.23 release, Shard on load is enabled by default in NxD Inference.

Full command to run shard on load for Llama3.1-405b is shown below. Note that ``--save-sharded-checkpoint`` is excluded from the command.
::

    # Replace this with the path where model files are downloaded.
    MODEL_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct/"
    # This is where the compiled model will be saved.
    COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Llama-3.1-405B-Instruct/"

    NUM_CORES=128
    TP_DEGREE=64
    LNC=2

    export NEURON_RT_VIRTUAL_CORE_SIZE=$LNC
    export NEURON_RT_NUM_CORES=$((NUM_CORES/NEURON_RT_VIRTUAL_CORE_SIZE))
    export NEURON_RT_EXEC_TIMEOUT=600 

    inference_demo \
        --model-type llama \
        --task-type causal-lm \
            run \
            --model-path $MODEL_PATH \
            --compiled-model-path $COMPILED_MODEL_PATH \
            --torch-dtype bfloat16 \
            --start_rank_id 0 \
            --local_ranks_size $TP_DEGREE \
            --tp-degree $TP_DEGREE \
            --batch-size 1 \
            --max-context-length 2048 \
            --seq-len 2048 \
            --on-device-sampling \
            --top-k 1 \
            --fused-qkv \
            --sequence-parallel-enabled \
            --qkv-kernel-enabled \
            --attn-kernel-enabled \
            --mlp-kernel-enabled \
            --cc-pipeline-tiling-factor 1 \
            --pad-token-id 2 \
            --prompt "What is annapurna labs?" 2>&1 | tee log

You should see the outputs below in your logs. The duration can slightly vary between runs. Note that sharding happened while model was being loaded (i.e. shard on load).

::

    Loading model to Neuron...
    INFO:Neuron:Done Sharding weights in 49.31190276599955 seconds
    Total model loading time: 187.3972628650372 seconds

As you can see, weights sharding of shard on load is much faster than that of shard on compile.

When the current run finishes, no sharded checkpoints will be saved. Therefore, you cannot use ``--skip-sharding`` for your next run. 
In each subsequent run, NxD Inference will do the exact same amount of sharding work, so the total model loading time is expected to be 
comparable with the first run. It's also expected that the total model loading time is longer than that of shard on compile, due to the extra
sharding work it has to do during loading time.


================================================
FILE: libraries/nxd-inference/developer_guides/writing-tests.rst
================================================
.. _nxdi-writing-tests:

Testing modeling code with NxD Inference
========================================

To ensure that your model is accurate and performant, we recommend that
you write tests for your modules, functions, and models. Run your tests
each time you make a code change to check that your modeling code
continues to work as expected.

.. contents:: Table of contents
   :local:
   :depth: 2

Testing models on Neuron
------------------------

NxD Inference provides utilities that you can use to test the performance
and accuracy of a full model end-to-end. The :ref:`check_accuracy_logits <nxdi-logit-matching>` 
tool validates the accuracy of a Neuron model's output logits over the full sequence
length. NxD Inference also includes a benchmarking tool, :ref:`benchmark_sampling <nxdi-benchmark-sampling>`,
that you can use to evaluate the performance of your model and its submodels.
You can use these utilities to define integration tests that
validate your model. For more information, see :ref:`nxdi-evaluating-models`.

Testing modules and functions on Neuron
---------------------------------------

NxD Inference provides common test utilities to help you validate that
modules and functions run correctly on Neuron. The ``build_module`` and
``build_function`` utilities help you convert modules and functions into
Neuron models. Then, you can use the ``validate_accuracy`` function to
validate that the Neuron model is accurate for given inputs. You can use
these utilities to write unit tests for your modeling code.

Building modules to run on Neuron
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

   neuronx_distributed_inference.utils.testing.build_module(
       module_cls,
       example_inputs: List[Tuple[torch.Tensor]],
       module_init_kwargs: Dict = {},
       tp_degree: int = 1,
       compiler_args: Optional[str] = None,
       compiler_workdir: Optional[str] = None,
       checkpoint_path: Optional[str] = None,
   )

Builds a module into a Neuron model. This function traces the module
using the example inputs, which is a list of tuples where each item is a
tensor. Then, it compiles the traced module to produce a Neuron model.
Arguments:

- ``module_cls``: The module class to compile.
- ``example_inputs``: The list of example inputs to use to trace the
  module. This list must contain exactly one tuple of tensors.
- ``tp_degree``: The TP degree to use. Defaults to 1.
- ``module_init_kwargs``: The kwargs to pass when initializing the
  module.
- ``compiler_args``: The compiler args to use.
- ``compiler_workdir``: Where to save compiler artifacts. Defaults to a
  tmp folder with a UUID for uniqueness.
- ``checkpoint_path``: The path to the checkpoint to load. By default,
  this function saves the module state dict to use as the checkpoint.

Building functions to run on Neuron
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

   neuronx_distributed_inference.utils.testing.build_function(
       func: Callable,
       example_inputs: List[Tuple[torch.Tensor]],
       tp_degree: int = 1,
       compiler_args: Optional[str] = None,
       compiler_workdir: Optional[str] = None,
   )

Builds a function into a Neuron model. You can use ``build_function`` to
test an individual function, such as a ``top_k`` sampling function.

See ``build_module`` for more
information about common arguments. If the function has non-tensor
inputs, you must convert it to a function that only takes tensor inputs.
You can use ``partial`` to do this, where you provide the non-tensor
inputs as constants in the partial function. This step is necessary
because all inputs must be tensors in a Neuron model.

::

   import torch

   from neuronx_distributed_inference.utils.testing import build_module


   def top_k(input: torch.Tensor, k: int, dim: int):
       return torch.topk(input, k, dim)


   top_k_partial = partial(top_k, 1, 0)
   model = build_fuction(top_k_partial, example_inputs=[(torch.rand(4)),])
   output = model(torch.rand(4))

Validating module and function accuracy on Neuron
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

   neuronx_distributed_inference.utils.testing.validate_accuracy(
       neuron_model,
       inputs: List[Tuple],
       expected_outputs: Optional[List] = None,
       cpu_callable: Optional[Callable] = None,
       assert_close_kwargs: Dict = {},
   )

Validates the accuracy of a Neuron model. This function tests that the
model produces expected outputs, which you can provide and/or produce on
CPU. To compare outputs, this function uses
``torch_neuronx.testing.assert_close``. If the output isn't similar,
this function raises an AssertionError. Arguments:

- ``neuron_model``: The Neuron model to validate.
- ``inputs``: The list of inputs to use to run the model. Each input is
  passed to the model's forward function.
- ``expected_outputs``: The list of expected outputs for each input. If
  not provided, this function compares against the CPU output for each
  input.
- ``cpu_callable``: The callable to use to produce output on CPU.
- ``assert_close_kwargs``: The kwargs to pass to
  ``torch_neuronx.testing.assert_close``.

Examples
~~~~~~~~

Example: Basic module test
^^^^^^^^^^^^^^^^^^^^^^^^^^

This example demonstrates how to validate the accuracy of a basic module
with a single linear layer. In this example, we initialize the module
separately on Neuron and CPU (using the ``distributed`` arg in
``ExampleModule``). This flag enables us run a parallel linear layer on
Neuron and compare it to a standard linear layer on CPU.

::

   import torch

   from neuronx_distributed_inference.utils.testing import build_module, validate_accuracy

   # Module to test.
   class ExampleModule(torch.nn.Module):
       def __init__(self, distributed):
           super().__init__()
           if distributed:
               self.linear = ColumnParallelLinear(
                   input_size=SAMPLE_SIZE,
                   output_size=SAMPLE_SIZE,
                   bias=False,
                   dtype=torch.float32,
               )
           else:
               self.linear = torch.nn.Linear(
                   in_features=SAMPLE_SIZE,
                   out_features=SAMPLE_SIZE,
                   bias=False,
                   dtype=torch.float32,
               )

       def forward(self, x):
           return self.linear(x)


   def test_validate_accuracy_basic_module():
       inputs = [(torch.arange(0, SAMPLE_SIZE, dtype=torch.float32),)]
       example_inputs = [(torch.zeros((SAMPLE_SIZE), dtype=torch.float32),)]

       module_cpu = ExampleModule(distributed=False)
       neuron_model = build_module(ExampleModule, example_inputs, module_init_kwargs={"distributed": True})

       validate_accuracy(neuron_model, inputs, cpu_callable=module_cpu)

Example: Basic function test
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This example demonstrates how to validate the accuracy of a basic
function with tensor args.

::

   import torch

   from neuronx_distributed_inference.utils.testing import build_function, validate_accuracy


   def example_sum(tensor):
       return torch.sum(tensor)


   def test_validate_accuracy_basic_function():
       inputs = [(torch.tensor([1, 2, 3], dtype=torch.float32),)]
       example_inputs = [(torch.zeros((3), dtype=torch.float32),)]

       neuron_model = build_function(example_sum, example_inputs)
       validate_accuracy(neuron_model, inputs, cpu_callable=example_sum)

Additional examples
^^^^^^^^^^^^^^^^^^^

For additional examples of ``build_module``, ``build_function``, and
``validate_accuracy``, see the `testing.py unit
tests <https://github.com/aws-neuron/neuronx-distributed-inference/tree/main/test/unit/testing/test_testing.py>`__.


================================================
FILE: libraries/nxd-inference/examples/vllm_client.py
================================================
from openai import OpenAI


client = OpenAI(api_key="EMPTY", base_url="http://0.0.0.0:8080/v1")
models = client.models.list()
model_name = models.data[0].id

prompt = "Hello, my name is Llama "

response = client.chat.completions.create(
    model=model_name,
    messages=[{"role": "user", "content": prompt}],
    max_tokens=1024,
    temperature=1.0,
    top_p=1.0,
    stream=False,
    extra_body={"top_k": 50},
)

generated_text = response.choices[0].message.content
print(generated_text)


================================================
FILE: libraries/nxd-inference/index.rst
================================================
.. meta::
   :description: NxD Inference (NeuronX Distributed Inference) is an ML inference library included with the Neuron SDK that simplifies deploying deep learning models on AWS Inferentia and Trainium instances.
   :keywords: NxD Inference, NeuronX Distributed Inference, AWS Neuron SDK, Deep Learning Inference, LLM Deployment, Model Optimization, Tensor Parallelism, Sequence Parallelism, vLLM Integration, Speculative Decoding, Continuous Batching 
   :date-modified: 12/02/2025

.. _nxdi-index:

NxD Inference
=============

This section contains the technical documentation specific to the NxD Inference library included with the Neuron SDK.

.. toctree::
    :maxdepth: 1
    :hidden:

    Overview </libraries/nxd-inference/overview-index>
    Setup </libraries/nxd-inference/nxdi-setup>
    Tutorials  </libraries/nxd-inference/tutorials/index>
    Developer Guides  </libraries/nxd-inference/developer_guides/index>
    API Reference Guide </libraries/nxd-inference/api-guides/index>
    App Notes  </libraries/nxd-inference/app-notes/index>
    Release Notes </release-notes/components/nxd-inference>
    Misc  </libraries/nxd-inference/misc/index>

What is NxD Inference?
-----------------------

NxD Inference (NeuronX Distributed Inference) is an ML inference library included with the Neuron SDK that simplifies deploying deep learning models on AWS Inferentia and Trainium instances. It offers advanced features like continuous batching and speculative decoding for high-performance inference, and supports popular models like Llama-3.1, DBRX, and Mixtral.

With NxD Inference, developers can:

* Deploy production-ready LLMs with minimal configuration
* Leverage optimizations like KV Cache, Flash Attention, and Quantization
* Distribute large models across multiple NeuronCores using Tensor and Sequence Parallelism
* Integrate with vLLM for seamless production deployment
* Customize and extend models with a modular design approach

With NxD Inference, developers can:

Use vLLM for Inference
------------------------

Neuron recommends that use vLLM when building your inference models. Read more about Neuron's integration with vLLM here: :doc:`vLLM on Neuron </libraries/nxd-inference/vllm/index>`

Quickstarts
------------

.. grid:: 1 1 2 2
    :gutter: 3
    
    .. grid-item-card:: Quickstart: Serve models online with vLLM on Neuron
        :link: /libraries/nxd-inference/vllm/quickstart-vllm-online-serving
        :link-type: doc
        :class-card: sd-rounded-3
        
        Get started serving online models with vLLM. Time to complete: ~20 minutes.

    .. grid-item-card:: Quickstart: Run offline inference with vLLM on Neuron
        :link: /libraries/nxd-inference/vllm/quickstart-vllm-offline-serving
        :link-type: doc
        :class-card: sd-rounded-3
        
        Get started running offline inference with vLLM. Time to complete: ~20 minutes.

NxD Inference documentation
----------------------------

.. grid:: 1 1 2 2
    :gutter: 3
    
    .. grid-item-card:: Overview
        :link: /libraries/nxd-inference/overview-index
        :link-type: doc
        :class-card: sd-rounded-3
        
        Learn about NxD Inference architecture, key features, and how it can help you deploy models efficiently on AWS Neuron hardware.

    .. grid-item-card:: Setup
        :link: /libraries/nxd-inference/nxdi-setup
        :link-type: doc
        :class-card: sd-rounded-3
        
        Step-by-step instructions for setting up NxD Inference using DLAMI, Docker containers, or manual installation.

    .. grid-item-card:: Get Started with Models
        :link: /libraries/nxd-inference/models/index
        :link-type: doc
        :class-card: sd-rounded-3
        
        Deploy production-ready models like Llama 3, DBRX, and Mixtral with optimized configurations for AWS Neuron hardware.

    .. grid-item-card:: Tutorials
        :link: /libraries/nxd-inference/tutorials/index
        :link-type: doc
        :class-card: sd-rounded-3
        
        Hands-on tutorials for deploying various models, including Llama 3 variants, multimodal models, and using advanced features like speculative decoding.

    .. grid-item-card:: Developer Guides
        :link: /libraries/nxd-inference/developer_guides/index
        :link-type: doc
        :class-card: sd-rounded-3
        
        In-depth guides for model onboarding, feature integration, vLLM usage, benchmarking, and customizing inference workflows.

    .. grid-item-card:: API Reference
        :link: /libraries/nxd-inference/api-guides/index
        :link-type: doc
        :class-card: sd-rounded-3
        
        Comprehensive API documentation for integrating NxD Inference into your applications and customizing inference behavior.

    .. grid-item-card:: Application Notes
        :link: /libraries/nxd-inference/app-notes/index
        :link-type: doc
        :class-card: sd-rounded-3
        
        Detailed application notes on parallelism strategies and other advanced topics for optimizing inference performance.

    .. grid-item-card:: Misc Resources
        :link: /libraries/nxd-inference/misc/index
        :link-type: doc
        :class-card: sd-rounded-3
        
        Release notes, troubleshooting guides, and other helpful resources for working with NxD Inference.

    .. grid-item-card:: NxD Inference Release Notes
        :link: /release-notes/components/nxd-inference
        :link-type: doc
        :class-card: sd-rounded-3
        
        Release notes, troubleshooting guides, and other helpful resources for working with NxD Inference.


================================================
FILE: libraries/nxd-inference/misc/index.rst
================================================
.. _nxdi-misc-index:

NxD Inference Misc
===================

.. toctree::
    :maxdepth: 1
    
    /release-notes/components/nxd-inference
    /libraries/nxd-inference/misc/nxdi-troubleshooting

.. include:: /libraries/nxd-inference/misc/misc.txt
  

================================================
FILE: libraries/nxd-inference/misc/misc.txt
================================================
* :ref:`nxd-inference_rn`
* :ref:`nxdi-troubleshooting`

================================================
FILE: libraries/nxd-inference/misc/nxdi-troubleshooting.rst
================================================
.. _nxdi-troubleshooting:

Troubleshooting Guide for NxD Inference
=======================================

This guide provides solutions for common issues encountered when using NxD Inference.

.. contents:: Table of contents
   :local:
   :depth: 2

Accuracy Issues
----------------

The primary methods for validating model accuracy on Neuron involve both token-by-token output matching and logit-level error analysis (relative or max absolute error) against a pre-calibrated GPU FP32 or CPU FP32 reference. When output deviations are observed, these can be systematically attributed to factors such as tokenizer/input discrepancies, amplification from large weight norms (high Lipschitz constants), quantization or precision loss, differences in operator implementation or kernel fusion, compiler optimization, or unintended hardware-level datatype casts.

When validating model accuracy on Neuron, it is important to recognize that predicting the exact output deviations from a high-precision reference (like CPU or GPU FP32) is theoretically NP-hard, due to the complex and nonlinear nature of large neural networks. Rather than attempting to anticipate every possible numerical difference, the recommended strategy is to systematically identify, localize, and diagnose deviations as they occur.

Accuracy Degradation with Auto-Cast
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Issue**: You may observe accuracy degradation in model outputs when using the default auto-cast behavior of the Neuron compiler.

**Explanation**: By default, the Neuron compiler automatically casts certain operations to lower precision data types (BF16) to improve performance. While this works well for most cases, it can sometimes lead to accuracy issues, especially in operations involving integer-to-float conversions.

**Solution**: Use the ``--auto-cast=none`` compiler flag to disable automatic casting. This preserves the original precision of operations at the cost of some performance.

Example using inference_demo:

.. code:: bash

   inference_demo --model-type llama --task-type causal-lm run \
       --model-path <path> \
       --compiled-model-path <path> \
       --torch-dtype bfloat16 \
       --tp-degree <value> \
       --batch-size <value> \
       --max-context-length <value> \
       --seq-len <value> \
       --on-device-sampling \
       --prompt "Your prompt here" \
       --compiler-args "--auto-cast=none"

Example using NeuronConfig:

.. code:: python

   from neuronx_distributed_inference.models.config import NeuronConfig
   
   neuron_config = NeuronConfig(
       tp_degree=32,
       batch_size=1,
       max_context_length=1024,
       seq_len=2048,
       compiler_args="--auto-cast=none"
   )

Integer-to-Float Conversion Issues
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Issue**: Operations involving integer-to-float conversions (such as in rotary embeddings) may experience significant accuracy degradation when auto-cast is enabled.

**Explanation**: When integer values are converted to floating point and then automatically cast to lower precision (like BF16), the precision loss can be substantial. This is particularly problematic in operations like rotary embeddings where position IDs are converted to floating point for computing sin/cos values.

**Solution**: Use the ``--auto-cast=none`` compiler flag to prevent downcasting these operations. This is especially important for models that use rotary embeddings or similar position encoding mechanisms.

**Technical Details**: The issue occurs in operations like:

.. code:: python

   # Integer position IDs are converted to float for sin/cos computation
   # Downcasting to BF16 here can cause significant precision loss
   position_ids = position_ids.to(torch.bfloat16)
   sin, cos = self.compute_sin_cos(position_ids)

Memory Usage Considerations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Note**: Using ``--auto-cast=none`` will increase memory usage as operations will use higher precision data types. Ensure your instance has sufficient memory when using this flag.

Performance Impact
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Note**: Disabling auto-cast will typically result in slower inference. The exact performance impact depends on your model architecture and hardware configuration. Consider this trade-off when optimizing for accuracy.


Array indexing and in-place operations in Neuron
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Issue**: When building attention masks, operations that combine array slicing with in-place modifications (e.g., ``mask_i[: arx[0] * arx[1], :ntok] = 0``) can cause accuracy issues in Neuron. This is particularly problematic when the array indices are dynamically computed.

**Explanation**: The accuracy issue stems from two main factors:

1. Array Slicing with Dynamic Ranges:

.. code:: python

   # Problematic: Array slicing with dynamic range (arx[0] * arx[1])
   mask_i[: arx[0] * arx[1], :ntok] = 0

- Uses computed indices to access specific portions of the tensor
- Dynamic ranges can lead to unpredictable memory access patterns

2. In-place Modifications:

.. code:: python

   # Problematic: Modifying tensor in-place
   mask_i[...] = 0  # Direct modification of the original tensor

- Changes the original tensor's values directly
- Can cause issues with Neuron's memory management and optimization

**Solution**: Replace array slicing and in-place operations with element-wise operations:

.. code:: python

   # Instead of array slicing and in-place modification:
   mask_i[: arx[0] * arx[1], :ntok] = 0  # Problematic

   # Use element-wise operations:
   arx_mask = (torch.arange(num_chunks, device=x.device) >= (arx[0] * arx[1])).to(dtype=x.dtype)
   mask_i[:, :ntok] *= arx_mask.view(num_chunks, 1, 1)  # Neuron-friendly

**Example**: 
File: `test/unit/models/mllama/test_vision_encoder_attention_mask.py <https://github.com/aws-neuron/neuronx-distributed-inference/blob/9b90cd02ffc3cc76bb3e81113a177f10d7a350a8/test/unit/models/mllama/test_vision_encoder_attention_mask.py>`__

.. code:: python

   # CPU version (problematic in Neuron):
   def build_encoder_attention_mask_meta(x, ar, ntok, num_chunks, n_heads):
       masks = []
       for arx in ar:
           mask_i = torch.ones((num_chunks, x.shape[2], 1), dtype=x.dtype)
           mask_i[: arx[0] * arx[1], :ntok] = 0  # Problematic: array slicing + in-place
           # ...

   # Neuron-friendly version:
   def build_encoder_attention_mask(x, ar, ntok, num_chunks, n_heads):
       masks = []
       for arx in ar:
           mask_i = torch.ones((num_chunks, x.shape[2], 1), dtype=x.dtype, device=x.device)
           arx_mask = (torch.arange(num_chunks, device=x.device) >= (arx[0] * arx[1])).to(dtype=x.dtype)
           mask_i[:, :ntok] *= arx_mask.view(num_chunks, 1, 1)  # Element-wise operation
           # ...

**Note**: This pattern applies to similar operations where array slicing and in-place modifications are used together. 
Consider using element-wise operations and avoiding in-place modifications for better Neuron compatibility.


Performance Issues
--------------------


Skip model warmup during inference
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Issue**: You may observe slower performance for the first few inference requests, particularly on Trn2.

**Explanation**: By default, model warmup is disabled (``skip_warmup=True``) on Trn2 since warmup feature is not yet implemented for Trn2. This means the model needs to "warm up" naturally through actual inference requests, leading to slower performance during the initial requests.


**Solution**: There are approaches to ensure initial request performance:

1. Enable built-in warmup if your configuration supports it (on Inf2, Trn1):

.. code:: python

   neuron_config = NeuronConfig(
       tp_degree=32,
       batch_size=1,
       # skip_warmup=True is the default for Trn2 in release 2.23
       # skip_warmup=False is the default for Trn1, Inf2 in release 2.23
   )

2. Implement manual warmup by sending dummy requests (on all instance types):

.. code:: python

   # Send a few dummy requests before serving real traffic
   dummy_prompt = "This is a warmup request."
   for _ in range(3):  # Number of warmup iterations
       model.generate(
           prompt=dummy_prompt,
           max_new_tokens=32
       )


**Note**:
 
- When using vLLM for serving, the same initial performance impact applies if warmup is disabled.
- Use `--override-neuron-config "{\"skip_warmup\":false}"` to change the warmup setting

**Best Practice**: 

- For production environments where initial latency is critical, test if your configuration supports built-in warmup.
- If built-in warmup isn't supported, implement manual warmup before serving real traffic.
- For development or non-latency-critical scenarios, the default configuration (warmup disabled) is sufficient.

Other Common Issues
--------------------

Tensor Materialization During Tracing caused unexpected model behavior
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Issue**: Developers may inadvertently write code that forces tensor materialization during model tracing, leading to fixed computation paths and unexpected behaviors.

**Explanation**: When model logic depends on tensor values during the forward pass, the compiler may try to evaluate these values during tracing time. This "fixes" the computation path based on the initial values, resulting in a model that doesn't properly handle different runtime values.

Example of problematic code:

.. code:: python

   def forward(self, tensor):
       if tensor[0] == 1:  # Forces tensor sync during tracing
           return tensor
       else:
           return tensor * 2

**Solution**: There are two debugging approaches to detect tensor materialization issues:

1. Enable warning messages:

.. code:: python

   import os
   
   # Set before model tracing
   os.environ['PT_XLA_DEBUG_LEVEL'] = '2'  # Will print warnings when tensor sync occurs

2. Force errors on tensor materialization:

.. code:: python

   import torch_xla
   
   # Set before model tracing
   torch_xla._XLAC._set_allow_execution(False)  # Will raise an error if tensor sync is attempted

**Best Practice**: 

- Avoid control flow that depends on tensor values during tracing. Instead, consider setting flags through configurations that should not change during runtime. See below example:

.. code:: python

   class TestModel(torch.nn.Module):
      def __init__(self, flag=1):
         super().__init__()
         # the flag should be pre-determined based on the model configuration
         # it should not be an input of the model during runtime
         self.flag = flag

      def forward(self, tensor):
         if self.flag:
               return tensor
         else:
               return tensor * 2

- If dynamic model path is required, consider using JIT inference (See: :ref:`trace-vs-xla-lazytensor`)


Input Data Type Handling for int64/fp64 due to compiler dtype compatibility
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


**Issue**: While you may be using 64-bit data types (int64/fp64) from tokenizers or other input sources, be aware that these are automatically converted to 32-bit types inside `ModelWrapper <https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/models/model_wrapper.py>`__.

**Explanation**: The Neuron compiler is optimized for 32-bit data types. To ensure optimal accuracy and compatibility, the model wrapper automatically converts 64-bit inputs (like those from Hugging Face tokenizers) to their 32-bit equivalents (int64 → int32, fp64 → fp32).

**Note**: No action is required from users as this conversion is handled automatically.

**Best Practice**:
 
- Continue using your tokenizers and input pipelines as normal
- Be aware that 64-bit inputs are automatically converted to 32-bit when using `ModelWrapper <https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/models/model_wrapper.py>`__
- If you're implementing custom pre-processing, using 32-bit types directly can be more efficient

This automatic conversion ensures consistent accuracy and compatibility with the Neuron compiler while maintaining ease of use with standard tokenizers and input pipelines.

================================================
FILE: libraries/nxd-inference/models/index.rst
================================================
.. meta::
  :description: Reference guide for running inference with NeuronX Distributed Inference (NxDI) on AWS Neuron for Trainium and Inferentia ML chips.

.. _nxdi-models-index:

Neuron Inference Model Support
=================================

This section provides information on model support in **NeuronX Distributed Inference (NxDI)** and how to determine appropriate configurations for both online and offline use cases.

.. _nxdi-models-llama3-index:

Llama 3
---------------------------

Meta's Llama 3 family includes large language models available in multiple sizes and versions. Select the model variant that matches your application requirements:

.. grid:: 1
  :gutter: 1

  .. grid-item-card:: Llama 3.3 70B

    Meta's multilingual LLM, featuring 70B parameters and Grouped Query Attention.

    :bdg-ref-primary:`Quickstart <nxdi-models-llama-3-3-70b-instruct-quickstart>`

.. _nxdi-models-qwen3-index:

Qwen 3
---------------------------

Qwen 3 family includes large language models available in multiple sizes and versions. Select the model variant that matches your application requirements:

.. grid:: 1
  :gutter: 1

  .. grid-item-card:: Qwen3 MoE 235B

    Qwen family multilingual LLM, featuring sparse Mixture-of-Experts and Grouped Query Attention

    :bdg-ref-primary:`Quickstart <nxdi-models-qwen3-235b-a22b-quickstart>`

.. note::
   Instructions for additional models will be available soon. For a complete list of supported model architectures, refer to this :ref:`developer guide <nxdi-model-reference>`.

.. toctree::
   :maxdepth: 1
   :hidden:

   llama3/llama_33_70b
   qwen3/qwen3_moe_235b


================================================
FILE: libraries/nxd-inference/models/llama3/data/card_llama33_70b.yml
================================================
# Optional Metadata
metadata:

# Model Information
model:
  family: "Llama 3"
  name: "Llama-3.3-70B-Instruct"
  display_name: "Llama 3.3 70B"
  checkpoint: "meta-llama/Llama-3.3-70B-Instruct"

  description: |
    Llama 3.3 70B is Meta's multilingual large language model with 70B parameters 
    and a transformer architecture featuring Grouped Query Attention (GQA).

# Hardware Requirements
hardware:

# Configurations
configurations:
  config1:
    instance_type: "trn2.48xlarge"
    sdk_version: "2.25"
    dp_degree: 1
    neuron:
      async_mode: true
      batch_size: 1
      tp_degree: 64
      attn_block_tkg_nki_kernel_cache_update: true
      attn_block_tkg_nki_kernel_enabled: true
      attn_kernel_enabled: true
      cc_pipeline_tiling_factor: 1
      enable_bucketing: true
      fused_qkv: true
      is_continuous_batching: true
      k_cache_transposed: true
      kv_cache_tiling: false
      logical_nc_config: 2
      mlp_kernel_enabled: true
      qkv_kernel_enabled: true
      seq_len: 16384
      sequence_parallel_enabled: true
      token_generation_buckets: [256, 512, 1024, 2048, 4096, 8192, 10240, 12288, 16384]
      context_encoding_buckets: [256, 512, 1024, 2048, 4096, 8192, 10240, 12288, 16384]
      on_device_sampling_config: 
        do_sample: true
        dynamic: true
      torch_dtype: "bfloat16"
    vllm:
      tensor_parallel_size: 64
      max_num_seqs: 1
      max_model_len: 16384
      additional_config:
        override_neuron_config:
          async_mode: true
          batch_size: 1
          tp_degree: 64
          attn_block_tkg_nki_kernel_cache_update: true
          attn_block_tkg_nki_kernel_enabled: true
          attn_kernel_enabled: true
          cc_pipeline_tiling_factor: 1
          enable_bucketing: true
          fused_qkv: true
          is_continuous_batching: true
          k_cache_transposed: true
          kv_cache_tiling: false
          logical_nc_config: 2
          mlp_kernel_enabled: true
          qkv_kernel_enabled: true
          seq_len: 16384
          sequence_parallel_enabled: true
          token_generation_buckets: [256, 512, 1024, 2048, 4096, 8192, 10240, 12288, 16384]
          context_encoding_buckets: [256, 512, 1024, 2048, 4096, 8192, 10240, 12288, 16384]
          on_device_sampling_config:
            do_sample: true
            dynamic: true
          torch_dtype: "bfloat16"

  config2:
    instance_type: "trn2.48xlarge"
    sdk_version: "2.25"
    dp_degree: 2
    neuron:
      async_mode: true
      batch_size: 8
      ctx_batch_size: 1
      tp_degree: 32
      attn_block_tkg_nki_kernel_cache_update: true
      attn_block_tkg_nki_kernel_enabled: true
      attn_kernel_enabled: true
      cc_pipeline_tiling_factor: 1
      enable_bucketing: true
      fused_qkv: true
      is_continuous_batching: true
      k_cache_transposed: true
      kv_cache_tiling: false
      logical_nc_config: 2
      mlp_kernel_enabled: true
      qkv_kernel_enabled: true
      seq_len: 16384
      sequence_parallel_enabled: true
      token_generation_buckets: [256, 512, 1024, 2048, 4096, 8192, 10240, 12288, 16384]
      context_encoding_buckets: [256, 512, 1024, 2048, 4096, 8192, 10240, 12288, 16384]
      on_device_sampling_config: 
        do_sample: true
        dynamic: true
      torch_dtype: "bfloat16"
    vllm:
      tensor_parallel_size: 32
      max_num_seqs: 8
      max_model_len: 16384
      additional_config:
        override_neuron_config:
          async_mode: true
          batch_size: 8
          ctx_batch_size: 1
          tp_degree: 32
          attn_block_tkg_nki_kernel_cache_update: true
          attn_block_tkg_nki_kernel_enabled: true
          attn_kernel_enabled: true
          cc_pipeline_tiling_factor: 1
          enable_bucketing: true
          fused_qkv: true
          is_continuous_batching: true
          k_cache_transposed: true
          kv_cache_tiling: false
          logical_nc_config: 2
          mlp_kernel_enabled: true
          qkv_kernel_enabled: true
          seq_len: 16384
          sequence_parallel_enabled: true
          token_generation_buckets: [256, 512, 1024, 2048, 4096, 8192, 10240, 12288, 16384]
          context_encoding_buckets: [256, 512, 1024, 2048, 4096, 8192, 10240, 12288, 16384]
          on_device_sampling_config:
            do_sample: true
            dynamic: true
          torch_dtype: "bfloat16"
  
  config3:
    instance_type: "trn2.48xlarge"
    sdk_version: "2.25"
    dp_degree: 2
    neuron:
      batch_size: 1
      tp_degree: 64
      enable_bucketing: true
      is_continuous_batching: true
      logical_nc_config: 2
      seq_len: 16384
      torch_dtype: "bfloat16"
    vllm:
      tensor_parallel_size: 64
      max_num_seqs: 1
      max_model_len: 16384
      additional_config:
        override_neuron_config:
          batch_size: 1
          tp_degree: 64
          enable_bucketing: true
          is_continuous_batching: true
          logical_nc_config: 2
          seq_len: 16384
          torch_dtype: "bfloat16"

  config4:
    instance_type: "trn1.32xlarge"
    sdk_version: "2.25"
    dp_degree: 1
    neuron:
      batch_size: 1
      tp_degree: 32
      enable_bucketing: true
      is_continuous_batching: true
      logical_nc_config: 1
      seq_len: 16384
      torch_dtype: "bfloat16"
    vllm:
      tensor_parallel_size: 32
      max_num_seqs: 1
      max_model_len: 16384
      additional_config:
        override_neuron_config:
          batch_size: 1
          tp_degree: 32
          enable_bucketing: true
          is_continuous_batching: true
          logical_nc_config: 1
          seq_len: 16384
          torch_dtype: "bfloat16"

defaults:
  "trn2.48xlarge": 
    config: "config3"
  "trn1.32xlarge": 
    config: "config4"

# Recommendations
recommendations:
  Latency:
    config: "config1"

  Throughput:
    config: "config2"


================================================
FILE: libraries/nxd-inference/models/llama3/llama_33_70b.rst
================================================
.. datatemplate:yaml:: data/card_llama33_70b.yml
   :template: model_card.jinja.rst

Resources
-----------

* :ref:`llm-inference-benchmarking`
* :ref:`nxdi-onboarding-models`
* :ref:`nxdi-vllm-user-guide-v0`
* :ref:`neuron-nki`


================================================
FILE: libraries/nxd-inference/models/models.txt
================================================
* :ref:`nxdi-models-llama3-index`

================================================
FILE: libraries/nxd-inference/models/qwen3/data/card_qwen3_moe_235b.yml
================================================
# Optional Metadata
metadata:

# Model Information
model:
  family: "Qwen3"
  name: "Qwen3-235B-A22B"
  display_name: "Qwen3 235B A22B"
  checkpoint: "Qwen/Qwen3-235B-A22B"

  description: |
    Qwen3 235B A22B is a mixture-of-experts (MoE) model with 235B parameters developed by the Qwen Team,
    activating 22B parameters per forward pass.

# Hardware Requirements
hardware:

# Configurations
configurations:
  config1:
    instance_type: "trn2.48xlarge"
    sdk_version: "2.27"
    dp_degree: 8
    neuron:
      tp_degree: 64
      attention_dp_degree: 8
      cp_degree: 16
      moe_tp_degree: 2
      moe_ep_degree: 32
      use_index_calc_kernel: true
      mode_mask_padded_tokens: true
      batch_size: 16
      ctx_batch_size: 1
      max_context_length: 16384
      seq_len: 16384
      scratch_pad_size: 1024
      torch_dtype: "float16"
      is_continuous_batching: true
      fused_qkv: true
      blockwise_matmul_config:
        use_shard_on_intermediate_dynamic_while: true
        skip_dma_token: true
      on_device_sampling_config:
        do_sample: true
        temperature: 0.6
        top_k: 20
        top_p: 0.95
      enable_bucketing: true
      token_generation_buckets: [10240, 16384]
      context_encoding_buckets: [10240, 16384]
      flash_decoding_enabled: false
      logical_nc_config: 2
      cc_pipeline_tiling_factor: 2
      sequence_parallel_enabled: true
      qkv_kernel_enabled: true
      qkv_nki_kernel_enabled: true
      qkv_cte_nki_kernel_fuse_rope: true
      attn_kernel_enabled: true
      strided_context_parallel_kernel_enabled: true
      async_mode: true
    # vllm v1 config
    vllm:
      tensor_parallel_size: 64
      max_num_seqs: 16
      max_model_len: 16384
      additional_config:
        override_neuron_config:
          tp_degree: 64
          attention_dp_degree: 8
          cp_degree: 16
          moe_tp_degree: 2
          moe_ep_degree: 32
          use_index_calc_kernel: true
          mode_mask_padded_tokens: true
          batch_size: 16
          ctx_batch_size: 1
          max_context_length: 16384
          seq_len: 16384
          scratch_pad_size: 1024
          torch_dtype: "float16"
          is_continuous_batching: true
          fused_qkv: true
          blockwise_matmul_config:
            use_shard_on_intermediate_dynamic_while: true
            skip_dma_token: true
          on_device_sampling_config:
            do_sample: true
            temperature: 0.6
            top_k: 20
            top_p: 0.95
          enable_bucketing: true
          token_generation_buckets: [10240, 16384]
          context_encoding_buckets: [10240, 16384]
          flash_decoding_enabled: false
          logical_nc_config: 2
          cc_pipeline_tiling_factor: 2
          sequence_parallel_enabled: true
          qkv_kernel_enabled: true
          qkv_nki_kernel_enabled: true
          qkv_cte_nki_kernel_fuse_rope: true
          attn_kernel_enabled: true
          strided_context_parallel_kernel_enabled: true
          async_mode: true

  config2:
    instance_type: "trn2.48xlarge"
    sdk_version: "2.27"
    dp_degree: 8
    neuron:
      tp_degree: 64
      attention_dp_degree: 8
      cp_degree: 16
      moe_tp_degree: 2
      moe_ep_degree: 32
      use_index_calc_kernel: true
      mode_mask_padded_tokens: true
      batch_size: 64
      ctx_batch_size: 1
      max_context_length: 16384
      seq_len: 16384
      scratch_pad_size: 1024
      torch_dtype: "float16"
      is_continuous_batching: true
      fused_qkv: true
      blockwise_matmul_config:
        use_shard_on_intermediate_dynamic_while: true
        skip_dma_token: true
      on_device_sampling_config:
        do_sample: true
        temperature: 0.6
        top_k: 20
        top_p: 0.95
      enable_bucketing: true
      token_generation_buckets: [10240, 16384]
      context_encoding_buckets: [10240, 16384]
      flash_decoding_enabled: false
      logical_nc_config: 2
      cc_pipeline_tiling_factor: 2
      sequence_parallel_enabled: true
      qkv_kernel_enabled: true
      qkv_nki_kernel_enabled: true
      qkv_cte_nki_kernel_fuse_rope: true
      attn_kernel_enabled: true
      strided_context_parallel_kernel_enabled: true
      async_mode: true

defaults:
  "trn2.48xlarge": 
    config: "config1"

# Recommendations
recommendations:
  Latency:
    config: "config1"

  Throughput:
    config: "config2"


================================================
FILE: libraries/nxd-inference/models/qwen3/qwen3_moe_235b.rst
================================================
.. datatemplate:yaml:: data/card_qwen3_moe_235b.yml
   :template: model_card_qwen3.jinja.rst

Resources
-----------

* :ref:`llm-inference-benchmarking`
* :ref:`nxdi-onboarding-models`
* :ref:`nxdi-vllm-user-guide-v1`
* :ref:`neuron-nki`


================================================
FILE: libraries/nxd-inference/neuron-inference-overview.rst
================================================
.. _neuron-inference-overview:

AI Inference on Neuron
======================

.. contents:: Table of contents
   :local:
   :depth: 2

Overview
--------

AWS Neuron provides optimized AI inference on AWS Trainium and Inferentia instances across diverse AI workloads, from Large Language Models (LLMs) to image/video generation models and custom machine learning architectures. The Neuron SDK enables optimized performance tuning for both latency-sensitive applications like interactive chatbots and high-throughput batch processing workloads. Whether you're building real-time generative AI applications, agentic AI systems, or processing offline batch requests, the Neuron SDK provides the flexibility to optimize inference for your specific performance requirements.


Deploying Production-Ready Models on Trainium/Inferentia
--------------------------------------------------------

The Neuron SDK enables deployment of production-ready popular LLM models like Meta Llama-3.3-70B and OpenAI gpt-oss-120B using vLLM. 
For model architectures not supported through vLLM, such as diffusion transformer models (Flux), you can integrate with other model servers directly using NxD Inference APIs.


.. _neuron_inference_deployment_figure:

.. figure:: ./images/inference-deployment-options.png
   :align: center
   :class: outlined-image

   Figure 1: Neuron Inference Deployment options


Deploy Production-Ready Models with vLLM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vLLM on Neuron offers a streamlined deployment experience using the standard vLLM V1 APIs with minimal code changes. Once you :ref:`install the latest Neuron SDK <nxdi-setup>`, you can easily get started using vLLM to serve
production-ready models. Below is an example of starting vLLM serving for the Llama-3.1-8B model:

.. code-block:: python

      ##install the vllm-neuron plugin which automatically installs the right vLLM version that is supported
      git clone https://github.com/vllm-project/vllm-neuron.git
      cd vllm-neuron
      pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com -e .

      ##start the vLLM server to start serving inference requests (sample config for Trn1 instance)
      vllm serve meta-llama/Meta-Llama-3-8B-Instruct --tensor-parallel-size 32 --max-num-seqs 4 --max-model-len 128 --block-size 32 --num-gpu-blocks-override 256


Neuron also offers :ref:`AMIs <neuron-dlami-overview>` with pre-installed Neuron SDK dependencies to quickly test your inference workloads and :ref:`pre-built inference containers<neuron_containers>` that you can use
to get started with your production workloads in Kubernetes environments.

You can refer to the :ref:`detailed developer guide on vLLM V1 support <nxdi-vllm-user-guide-v1>` for the list of features and models supported through vLLM.

If you are looking to deploy a model in vLLM that is not yet supported out of the box on Neuron, you can refer to the :ref:`implementing custom models section below <neuron-inference-implement-custom-models>`.


Integrate with NxD Inference APIs for Custom Model Serving Deployments
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you are looking to deploy models beyond standard LLM architectures, such as Diffusion Transformers which are not supported in vLLM, NxD Inference provides direct API integration options that you can
integrate with general-purpose model serving frameworks like FastAPI or Triton Inference Server. You can refer to the `Flux tutorial <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/flux-inference-tutorial.html>`_ to learn how to integrate directly with NxD Inference APIs.

Similarly, if you want to integrate LLM model serving with model serving options other than vLLM, you can integrate directly with NxD Inference. However, you will need to make custom changes to the scheduler along with any modifications required to make it compatible with your desired model server.


.. _neuron-inference-implement-custom-models:

Implementing Custom Models or Performance Optimizations
---------------------------------------------------------

NxD Inference Library
^^^^^^^^^^^^^^^^^^^^^^

NxD Inference is a PyTorch-based open-source library that provides reference implementations for optimizing popular dense LLM models, MoE LLM models, and image generation models like Llama-3.3-70B, gpt-oss-120B, and Flux on Neuron.
The NxD Inference library provides key model building blocks such as different attention techniques, distributed strategies like Tensor Parallel, Expert Parallelism, speculative decoding techniques, and NKI kernels for popular model architectures that you can use to quickly
build custom LLM and other ML model architectures. 

You can use the :ref:`model onboarding guide <nxdi-onboarding-models>` to get started implementing custom models on Neuron. Similarly, you can extend and implement custom performance optimizations on models
already implemented in NxD Inference. Once you have implemented the model in NxD Inference, you can either integrate it with vLLM as described in the :ref:`model onboarding guide <nxdi-onboarding-models-vllm>` or integrate it with another model serving framework.

NxD Inference is an open-source library with `source code publicly available on GitHub <https://github.com/aws-neuron/neuronx-distributed-inference>`_. We invite you to contribute
custom model implementations or performance optimizations by opening a PR on GitHub.

Implementing Custom Models Directly on PyTorch
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you want to implement models directly on PyTorch without using the NxD Inference library and need more fine-grained control, you can use the :ref:`NxD Core library<neuronx_distributed_api_guide>` that offers Neuron essential primitives like tracing and compilation. The `Llama-3.2-1B <https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference/llama>`_ example provides a sample reference implementation showing how to build custom models with the NxD Core library.


================================================
FILE: libraries/nxd-inference/nxdi-setup.rst
================================================
.. _nxdi-setup:

NxD Inference Setup Guide
=========================

The NeuronX Distributed (NxD) Inference framework is built on top of
:ref:`NxD Core <neuronx-distributed-index>`. Follow the steps in this
guide to set up your environment to run inference using the NxD Inference framework.

.. contents:: Table of contents
   :local:
   :depth: 2

Option 1: Launch an instance using a Neuron DLAMI
-------------------------------------------------
Neuron Deep Learning AMIs (DLAMIs) are Amazon Machine Images (AMIs) that come
with the Neuron SDK pre-installed. To quickly get started with NxD Inference,
you can launch an EC2 instance with the multi-framework DLAMI, which includes
NxD Inference and its dependencies. For more information, see the
:ref:`Neuron Multi-Framework DLAMI Guide <setup-ubuntu22-multi-framework-dlami>`
and :ref:`neuron-dlami-overview`.

After you launch an instance, you can run the following command to activate the
NxD Inference virtual environment.

::

   source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate


Option 2: Use a Neuron Deep Learning Container (DLC)
----------------------------------------------------
Neuron Deep Learning Containers (DLCs) are Docker images that come with the
Neuron SDK pre-installed. To run NxD Inference in a Docker container, use the
`Neuronx PyTorch Inference Containers <https://github.com/aws-neuron/deep-learning-containers#pytorch-inference-neuronx>`_.
For more information, see :ref:`neuron_containers`.


Option 3: Manually Install NxD Inference
----------------------------------------

Follow these instructions to manually install NxD Inference on an instance.

.. note:: 

   For information about which Python versions are compatible with the Neuron
   SDK, see :ref:`Release Artifacts <latest-neuron-release-artifacts>`.

Setup a Neuron Environment
^^^^^^^^^^^^^^^^^^^^^^^^^^

Before you install NxD Inference, you must install the Neuron SDK and its
dependencies, including PyTorch Neuron (``torch-neuronx``). Follow instructions
for one of the following operating systems:

* :ref:`PyTorch NeuronX Setup on Ubuntu 22 <setup-torch-neuronx>`
* :ref:`PyTorch NeuronX Setup on Amazon Linux 2023 <setup-torch-neuronx>`


Install NxD Inference
^^^^^^^^^^^^^^^^^^^^^

Run this command to install NxD Inference.

::

   source aws_neuron_venv_pytorch/bin/activate
   pip install -U pip
   pip install --upgrade neuronx-cc==2.* neuronx-distributed-inference --extra-index-url https://pip.repos.neuron.amazonaws.com


Verify NxD Inference Installation
---------------------------------

To verify that NxD Inference installed successfully, check that you can
run the ``inference_demo`` console script.

::

   inference_demo --help


================================================
FILE: libraries/nxd-inference/overview-index.rst
================================================
.. _nxdi-overview:

NxD Inference Overview
=======================

.. contents:: Table of contents
   :local:
   :depth: 2


Overview
--------

NxD Inference  (where NxD stands for NeuronX Distributed) is an open-source PyTorch-based inference library that simplifies deep learning model deployment on AWS Inferentia and Trainium instances. Introduced with Neuron SDK 2.21 release, 
it offers advanced inference capabilities, including features such as continuous batching and speculative decoding for high performance inference. Additionally, it supports inference engine for vLLM for seamless integration with the majority of customers' production deployment systems. ML developers can use NxD Inference library at different levels of abstraction that fits their inference use case.


NxD Inference(NxDI) library offers the following benefits:

* **Production ready models**: NxD Inference provides production ready models like  Llama-3.1, DBRX, and Mixtral that you can quickly deploy for high performance inference. 

* **LLM Inference Features**:  NxD Inference provides support for various LLM inference features like KV Cache, Multi-Head Attention (MHA), Grouped Query Attention (GQA), Flash Attention, Quantization, MoE , Continuous Batching and Speculative Decoding enabling high performance inference.  

* **Modular Design**:  Inference features in NxDI like KV Caching are implemented with a modular design, allowing developers to easily incorporate them into new models or customize and extend them.

* **Distributed Strategies**: NxD Inference enables distributing inference workload of large models across multiple NeuronCores in a single instance using Tensor parallelism and Sequence Parallelism. Pipeline parallelism and multi-node inference will be supported in future Neuron releases. 

* **Support for NKI Kernels**:  NxD Inference provides support for integrating custom NKI kernels on Trainium and  Inferentia instances. 

* **Open Source and SW Release**:  NxD Inference library is provided as pip wheel and corresponding source code is made available on `GitHub <https://github.com/aws-neuron/neuronx-distributed-inference>`_ . We encourage developers to contribute new model implementations or feature optimizations to the NxDI library by submitting a pull request.


.. _nxdi_figure:

.. figure:: ./images/nxd-inference-block-diagram.jpg
   :align: center
   :class: outlined-image

   NxD Inference High level Overview


Using NxD Inference Library
---------------------------


ML developers can use NxD Inference library at different levels of abstraction. As shown in the below diagram :numref:`Fig. %s <nxdi_use-cases-figure>`, developers can use NxDI library in 3 different ways.

* **Deploy production ready models with vLLM**:  NxDI supports production ready models like Llama-3.1, DBRX and Mixtral that can be easily deployed directly through vLLM. Customers can integrate their inference scripts directly with vLLM API. 

* **Deploy production ready models with NxDI**:   For customers who are not using vLLM, they can integrate with NxDI models directly for use cases such as static batching.   For continuous batching, customers can also integrate with NxDI API to implement a custom model server with scheduler(other than vLLM) . See :numref:`Fig. %s <nxdi_use-cases-figure>` b) for reference.

* **Integrate with Inference modules and NxD Core primitives**:   As described in :numref:`Fig. %s <nxdi_use-cases-figure>` c), customers who are looking to onboard new models which are not in NxDI model hub can integrate with inference modules and NxD Core primitives. In addition, customers who are looking to integrate with model servers other than vLLM can also integrate directly with NxD Inference modules and NxD core primitives.

.. _nxdi_use-cases-figure:

.. figure:: ./images/nxd-inference-use-cases.jpg
   :align: center
   :class: outlined-image

   Using NxD Inference through various abstractions 

================================================
FILE: libraries/nxd-inference/setup.txt
================================================
* :ref:`nxdi-setup`

================================================
FILE: libraries/nxd-inference/tutorials/disaggregated-inference-tutorial-1p1d.rst
================================================
.. _nxdi-disaggregated-inference-1p1d-tutorial:

Tutorial: Static 1P1D Disaggregated Inference on Trn2 [BETA]
============================================================

Overview
~~~~~~~~

This tutorial will mainly cover how to run Disaggregated Inference (DI) 1P1D (1 prefill, 1 Decode) 
either on a single Trn2 instance (1P and 1D both are on same instance) or on 2 instances 
(1P and 1D are on separate instances). It will provide scripts that can setup both
single and multi instance workflows. Next, the tutorial will demonstrate how to benchmark DI. Finally,
we show how to benchmark non Disaggregated Inference (non-DI) continuous batching to compare results between DI vs. non-DI.

Read the :ref:`DI Developer Guide<nxdi-disaggregated-inference>` for more detailed information.

.. note::

   This tutorial was tested on trn2.48xlarge but its concepts are also be applicable to trn1.32xlarge.

Set up and connect trn2.48xlarge instance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As a prerequisite, this tutorial requires that you have one or two Trn2 instances
with Neuron SDK, Neuron vLLM and Elastic Fabric Adapter (EFA) enabled and installed. The Neuron Deep Learning AMI
comes with Neuron dependencies and EFA enabled and installed so it is the recommended
way to run this tutorial.

To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK,
see :ref:`nxdi-setup`.

.. note::

   Disaggregated Inference is only supported on Neuron instances with EFA enabled (trn1.32xlarge or trn2.48xlarge).
   EFA is still required even when running single instance as the KV cache transfer happens through EFA.

If you choose to manually install NxD Inference follow the 
`EFA setup guide <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html>`_ to install and enable EFA.


If running multi-instance it is recommended to have shared storage between the two instances to avoid having
to download, compile and save scripts twice. For more details, see documentation on mounting 
`EFS <https://docs.aws.amazon.com/efs/latest/ug/mount-multiple-ec2-instances.html>`_ or 
`FSX <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage_fsx.html>`_ filesystems.

After setting up an instance, use SSH to connect to the Neuron instance(s) using the key pair that you
chose when you launched the instance.

After you are connected, activate the Python virtual environment that includes the Neuron SDK.

::

   source ~/aws_neuronx_venv_pytorch_2_7_nxd_inference/bin/activate

Install the latest release branch of vLLM from the AWS Neuron fork 
following the instructions in the :ref:`vLLM User Guide for NxD Inference<nxdi-vllm-user-guide>`.


Run ``pip list`` to verify that the Neuron SDK is installed.

::

   pip list | grep neuron

You should see Neuron packages including
``neuronx-distributed-inference`` and ``neuronx-cc`` and ``vllm``.

Download Dependencies
~~~~~~~~~~~~~~~~~~~~~

To use this sample, you must first download a `Llama-3.3-70B-Instruct <https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct>`_ model checkpoint from Hugging Face
to a local path on the Trn2 instance. 
Note that you may need access from Meta for model download.
For more information, see
`Downloading models <https://huggingface.co/docs/hub/en/models-downloading>`_
in the Hugging Face documentation.


Compile the model
~~~~~~~~~~~~~~~~~

Compile the model for Neuron by using the following ``compile.sh`` script.

::

   #!/bin/bash
   # copy and paste me into a file called compile.sh
   # then run chmod +x compile.sh

   # Parse command line arguments
   while [[ $# -gt 0 ]]; do
      case $1 in
         --tp-degree)
               TP_DEGREE="$2"
               shift 2
               ;;
         --batch-size)
               BATCH_SIZE="$2"
               shift 2
               ;;
         --model-path)
               MODEL_PATH="$2"
               shift 2
               ;;
         *)
               echo "Unknown parameter: $1"
               echo "Usage: $0 --tp-degree <value> --batch-size <value> --model-path <path>"
               exit 1
               ;;
      esac
   done

   export COMPILED_MODEL_PATH="di_traced_model_tp${TP_DEGREE}_b${BATCH_SIZE}/"

   inference_demo \
      --model-type llama \
      --task-type causal-lm \
      run \
      --model-path $MODEL_PATH \
      --compiled-model-path $COMPILED_MODEL_PATH \
      --torch-dtype bfloat16 \
      --tp-degree $TP_DEGREE \
      --batch-size $BATCH_SIZE \
      --ctx-batch-size 1 \
      --tkg-batch-size $BATCH_SIZE \
      --is-continuous-batching \
      --max-context-length 8192 \
      --seq-len 8192 \
      --on-device-sampling \
      --fused-qkv \
      --global-topk 256 --dynamic \
      --top-k 50 --top-p 0.9 --temperature 0.7 \
      --do-sample \
      --sequence-parallel-enabled \
      --qkv-kernel-enabled \
      --attn-kernel-enabled \
      --mlp-kernel-enabled \
      --cc-pipeline-tiling-factor 1 \
      --pad-token-id 2 \
      --logical-neuron-cores 2 \
      --context-encoding-buckets 256 512 1024 2048 4096 8192 \
      --token-generation-buckets 512 1024 2048 4096 8192 \
      --apply-seq-ids-mask \
      --enable-bucketing \
      --prompt "test prompt" \
      --save-sharded-checkpoint \
      --attn-block-tkg-nki-kernel-enabled \
      --attn-block-tkg-nki-kernel-cache-update \
      --k-cache-transposed \
      --async-mode \
      --compile-only

The ``--apply-seq-ids-mask`` flag is required for DI because it
tells Neuron to only update the KV cache of the current sequence ID to ensure 
KV cache integrity, and ultimately, accuracy.

Multi-Instance
---------------
For multi-instance run: 

::

   ./compile.sh --tp-degree 64 --batch-size 4 --model-path path/to/your/downloaded/model

Single-Instance
---------------
For single-instance run: 

::

   ./compile.sh --tp-degree 32 --batch-size 4 --model-path path/to/your/downloaded/model

We compile for ``tp-degree=32`` because 1 prefill server will take up half 
of the Neuron Cores cores while the decode server will take up the other half.


Launch the Prefill and Decode Servers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We provide a script called ``server.sh``, which you can use to launch prefill and
decode servers.

``NEURON_RT_ASYNC_SENDRECV_EXPERIMENTAL_ENABLED=1`` is currently required as DI is still in beta.
``NEURON_RT_ASYNC_SENDRECV_BOOTSTRAP_PORT=45645`` is required to tell the Neuron Runtime which port to use for KV Cache transfer communications.
``NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=2`` enables :ref:`Asynchronous Runtime Support<nxdi_async_mode_feature_guide>`

The ``KVTransferConfig`` provided to both servers on startup have key information.
``kv_connector=NeuronConnector`` lets vLLM know to use the Neuron implementation for KV cache transfer.
``kv_role=producer`` lets vLLM know that this server's job is to do prefill.
``kv_role=consumer`` lets vLLM know that this server's job is to do decode.
``neuron_core_offset=n`` lets vLLM know that the model is hosted starting on the nth Neuron Core.


::

   #!/bin/bash
   # copy and paste me into a file called server.sh
   # then run chmod +x server.sh

   #!/bin/bash

   # Parse command line arguments
   while [[ $# -gt 0 ]]; do
      case $1 in
         --tp-degree)
               TP_DEGREE="$2"
               shift 2
               ;;
         --batch-size)
               BATCH_SIZE="$2"
               shift 2
               ;;
         --model-path)
               MODEL_PATH="$2"
               shift 2
               ;;
         --compiled-model-path)
               COMPILED_MODEL_PATH="$2"
               shift 2
               ;;
         --neuron-send-ip)
               SEND_IP="$2"
               shift 2
               ;;
         --neuron-recv-ip)
               RECV_IP="$2"
               shift 2
               ;;
         *)
               echo "Unknown parameter: $1"
               echo "Usage: $0 --tp-degree <value> --batch-size <value> --model-path <path> \
                              --compiled-model-path <path> --send-ip <ip> --recv-ip <ip>"
               exit 1
               ;;
      esac
   done

   export NEURON_RT_ASYNC_SENDRECV_BOOTSTRAP_PORT=45645
   export NEURON_RT_ASYNC_SENDRECV_EXPERIMENTAL_ENABLED=1
   export NEURON_COMPILED_ARTIFACTS="$COMPILED_MODEL_PATH"
   export NEURON_SEND_IP="$SEND_IP"
   export NEURON_RECV_IP="$RECV_IP"
   export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=2

   if [ "$SEND" = "1" ]; then
      PORT=8100
      if [ "$SINGLE_INSTANCE" = "1" ]; then
         export NEURON_RT_VISIBLE_CORES=0-31
      fi
      TRANSFER_CONFIG='{
               "kv_connector":"NeuronConnector",
               "kv_buffer_device":"cpu",
               "kv_role":"kv_producer",
               "kv_rank":0,
               "kv_parallel_size":2,
               "kv_buffer_size":2e11,
               "kv_ip":"'"$NEURON_SEND_IP"'",
               "neuron_core_offset": 0
         }'
      
   else
      PORT=8200
      if [ "$SINGLE_INSTANCE" = "1" ]; then
         NC_OFFSET=32
         export NEURON_RT_VISIBLE_CORES=32-63
      else   
         NC_OFFSET=0
      fi
      TRANSFER_CONFIG='{
               "kv_connector":"NeuronConnector",
               "kv_buffer_device":"cpu",
               "kv_role":"kv_consumer",
               "kv_rank":1,
               "kv_parallel_size":2,
               "kv_buffer_size":2e11,
               "kv_ip":"'"$NEURON_SEND_IP"'",
               "neuron_core_offset": "'"$NC_OFFSET"'"
         }'
   fi

   python3 -m vllm.entrypoints.openai.api_server \
         --model "$MODEL_PATH" \
         --max-num-seqs "$BATCH_SIZE" \
         --max-model-len 8192 \
         --tensor-parallel-size "$TP_DEGREE" \
         --device neuron \
         --use-v2-block-manager \
         --override-neuron-config "{}" \
         --kv-transfer-config "$TRANSFER_CONFIG" \
         --port "$PORT"


You may need multiple terminals to run the following commands.

For multi-instance choose one instance to be your prefill instance and
one instance to be your decode instance. Get the IP addresses of them by running
``hostname -i`` and use them in the commands below. Single instance can use ``127.0.0.1``
as the IP address since prefill and decode always run on the same instance.

Multi-Instance
---------------

To launch a prefill server for multi-instance run: 

::

   SEND=1 ./server.sh --tp-degree 64 --batch-size 4 \
                      --model-path path/to/your/downloaded/model \
                      --compiled-model-path di_traced_model_tp64_b4/ \
                      --neuron-send-ip prefill_ip --neuron-recv-ip decode_ip

To launch a decode server open up a new tab and run: 

::

   ./server.sh --tp-degree 64 --batch-size 4 \
               --model-path path/to/your/downloaded/model \
               --compiled-model-path di_traced_model_tp64_b4/  \
               --neuron-send-ip prefill_ip --neuron-recv-ip decode_ip


Single-Instance
---------------
To launch a prefill server for single-instance run: 

::

   SEND=1 SINGLE_INSTANCE=1 ./server.sh --tp-degree 32 --batch-size 4 \
                                        --model-path path/to/your/downloaded/model \
                                        --compiled-model-path di_traced_model_tp32_b4/ \
                                        --neuron-send-ip 127.0.0.1 --neuron-recv-ip 127.0.0.1


To launch a decode server open up a new tab and run: 

::

   SINGLE_INSTANCE=1 ./server.sh --tp-degree 32 --batch-size 4 \
                                 --model-path path/to/your/downloaded/model \
                                 --compiled-model-path di_traced_model_tp32_b4/ \
                                 --neuron-send-ip 127.0.0.1 --neuron-recv-ip 127.0.0.1


When you see the line ``INFO:     Uvicorn running on http://0.0.0.0:8100 (Press CTRL+C to quit)``
on your prefill and decode server tabs your servers are ready.

Launch a Router (Proxy Server)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Both servers need to receive a request to run inference. The component that does this job is called the 
router as mentioned in :ref:`DI Developer Guide<nxdi-disaggregated-inference>`.
We offer an implementation of a router called the ``neuron-proxy-server``.
The ``neuron-proxy-server`` is an entrypoint in our fork of vLLM which launches a proxy server that
will take a request and forward it to both the prefill and decode servers. It will 
then capture their responses and format them back to the user. 

The implementation of the neuron-proxy-server can be found 
`here <https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.24-vllm-v0.7.2/vllm/neuron_immediate_first_token_proxy_server.py>`_.


For multi-instance run the router as another process on your prefill instance. 
For single-instance run the router as another process on your Trn2.

A router can run on any instance that has a connection to both the prefill and decode nodes.
For multi-instance 1P1D, it makes the most sense to have the router on the prefill node to reduce network latency.

Launch the proxy server by running:

::

   pip install quart # only install one time
   neuron-proxy-server --prefill-ip your_prefill_ip --decode-ip your_decode_ip --prefill-port 8100 --decode-port 8200

The proxy server is ready when you see the line ``INFO:hypercorn.error:Running on http://127.0.0.1:8000 (CTRL + C to quit)``

Test the DI Setup
~~~~~~~~~~~~~~~~~

Run a sanity check to see if you DI setup is working by sending a curl request to the ``neuron-proxy-server``:

::

   curl -s http://localhost:8000/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
      "model": "path/to/your/downloaded/model",
      "prompt": ["a tornado is a"],
      "max_tokens": 10,
      "temperature": 0
      }'

A successful response looks like:
``{"id": ... :[{"index":0,"text":" rotating column of air that forms during severe thunderstorms" ... }``

The ``neuron-proxy-server`` also supports the streaming of responses. It can be tested by:

::

   curl -s http://localhost:8000/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
      "model": "path/to/your/downloaded/model",
      "prompt": ["a tornado is a"],
      "max_tokens": 10,
      "temperature": 0,
      "stream": true
      }'


Benchmark the DI Setup
~~~~~~~~~~~~~~~~~~~~~~

Install LLMPerf
---------------

We will use `LLMPerf <https://github.com/ray-project/llmperf>`_ to measure the performance.

LLMPerf will send requests to the ``neuron-proxy-server`` and capture data including Time To First Token,
Inter Token Latency and throughput.

Install llmperf into the ``aws_neuronx_venv_pytorch_2_7_nxd_inference`` virtual environment.

For multi-instance LLMperf is only required to be installed on the prefill instance where you will run benchmarking.

::

    git clone https://github.com/ray-project/llmperf.git
    cd llmperf
    pip install -e .    

Once you have installed LLMPerf, apply the ``neuron_perf.patch`` as described in :ref:`llm-inference-benchmarking`. 

Next use the ``llmperf.sh`` script to run benchmarks.

::

   #!/bin/bash
   # copy and paste me into a file called llmperf.sh
   # then run chmod +x llmperf.sh

   # Set environment variables
   export OPENAI_API_BASE="http://localhost:8000/v1"
   export OPENAI_API_KEY="mock_key"

   python llmperf/token_benchmark_ray.py \
      --model=$MODEL_PATH \
      --tokenizer=$MODEL_PATH \
      --mean-input-tokens=1024 \
      --stddev-input-tokens=0\
      --mean-output-tokens=100 \
      --stddev-output-tokens=10 \
      --max-num-completed-requests=200 \
      --timeout=1720000 \
      --num-concurrent-requests=4 \
      --results-dir=llmperf_results \
      --llm-api=openai \
      --additional-sampling-params "{\"top_k\": 50, \"top_p\": 0.9, \"temperature\": 0.7}"

Since the ``llmperf.sh`` script sends requests to localhost, it should be run on the same instance
the router is running on.

In multi-instance that means as a separate process on your prefill instance.
For single instance that means a separate process on your Trn2.

::

   MODEL_PATH=path/to/your/downloaded/model ./llmperf.sh 

This will run a total of 200 requests and your final output should have the line:
``Completed Requests Per Minute: xx.xxxxxxx``. Scroll up to see metrics such as
Inter Token Latency and Time To First Token.


Benchmark a Non-DI Continuous Batching Setup for Comparison
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To compare Disaggregated Inference against non-DI continuous batching 
we will run benchmarks without Disaggregated Inference.

First kill all DI servers. Then kill the ``neuron-proxy-server``.

We will run the same compiled model as a singular server for non-DI benchmarks.
For single instance non-DI benchmarking we will start one TP=32 server. For multi-instance non-DI 
benchmarking we will start one TP=64 server. This means you do not need your second (decode) instance for this step.
Latency can be compared directly in DI vs non-DI benchmarks. You might need to adjust the throughput related 
metrics based on number of instances to compare apples-to-apples between DI and non-D1. 
In this case, Non-DI throughput should be doubled before comparing with DI as the non-DI benchmark uses half the amount of hardware.

Use the ``baseline_server.sh`` to launch a vLLM server without DI.

::

   #!/bin/bash
   # copy and paste me into a file called baseline_server.sh
   # then run chmod +x baseline_server.sh

   #!/bin/bash

   # Parse command line arguments
   while [[ $# -gt 0 ]]; do
      case $1 in
         --tp-degree)
               TP_DEGREE="$2"
               shift 2
               ;;
         --batch-size)
               BATCH_SIZE="$2"
               shift 2
               ;;
         --model-path)
               MODEL_PATH="$2"
               shift 2
               ;;
         --compiled-model-path)
               COMPILED_MODEL_PATH="$2"
               shift 2
               ;;
         *)  
               echo "Unknown parameter: $1"
               echo "Usage: $0 --tp-degree <value> --batch-size <value> --model-path <path> \
                              --compiled-model-path <path>"
               exit 1
               ;;
      esac
   done

   export NEURON_COMPILED_ARTIFACTS="$COMPILED_MODEL_PATH"
   export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=2

   if [ "$SINGLE_INSTANCE" = "1" ]; then
      NEURON_RT_VISIBLE_CORES=0-31
   fi

   python3 -m vllm.entrypoints.openai.api_server \
         --model "$MODEL_PATH" \
         --max-num-seqs "$BATCH_SIZE" \
         --max-model-len 8192 \
         --tensor-parallel-size "$TP_DEGREE" \
         --device neuron \
         --use-v2-block-manager \
         --override-neuron-config "{}" \
         --port 8000


Multi-Instance
---------------
Launch for multi-instance with:

::
   
   ./baseline_server.sh --tp-degree 64 --batch-size 4 \
                        --model-path path/to/your/downloaded/model \
                        --compiled-model-path di_traced_model_tp64_b4/


Single-Instance
---------------
Launch for single-instance with:

::
   
   SINGLE_INSTANCE=1 ./baseline_server.sh --tp-degree 32 --batch-size 4 \
                                          --model-path path/to/your/downloaded/model \
                                          --compiled-model-path di_traced_model_tp32_b4/

Now we have a server launched with the same underlying model but with DI turned off.

Then on the same instance run llmperf which will now directly send requests to the server
instead of going through a proxy:

::

   MODEL_PATH=path/to/your/downloaded_model ./llmperf.sh 

This will run a total of 200 requests and your final output should have the line:
``Completed Requests Per Minute: xx.xxxxxxx``. Scroll up to see metrics such as
Inter Token Latency and Time To First Token.


Known Issues
~~~~~~~~~~~~

``ENC:kv_store_acquire_file_lock   Failed to open kv store server lock file Permission denied`` 
usually means that another user on the system ran a DI workload and left behind a lock file
that the current user does not have access to. The solution is to delete ``/tmp/nrt_kv_store_server.lock`` file.

================================================
FILE: libraries/nxd-inference/tutorials/disaggregated-inference-tutorial.rst
================================================
.. _nxdi-disaggregated-inference-tutorial:

Tutorial: Disaggregated Inference [BETA]
================================================

Overview
~~~~~~~~

This tutorial shows how to run Disaggregated Inference (DI) using prefill and decode vLLM workers. You'll learn how to set up both worker types and scale from a basic 1P1D setup to larger configurations. The guide includes benchmarks that show how DI improves performance compared to standard inference, especially for long input sequences.

DI splits work between prefill workers and decode workers. Each worker needs:

* A Trn1 or Trn2 instance 
* Neuron SDK
* A supported vLLM version
* Elastic Fabric Adapter (EFA)

DI also needs a proxy server to manage traffic between workers and an etcd service for worker registration. You can run these on a basic EC2 instance like an M-series.

For more details, see the :ref:`DI Developer Guide<nxdi-disaggregated-inference>`.

.. note::

  This tutorial works with trn2.48xlarge and trn1.32xlarge instances.

Before You Begin  
~~~~~~~~~~~~~~~~

You need:

* A Trn1 or Trn2 instance with Neuron SDK, Neuron vLLM, and EFA enabled (see :ref:`nxdi-setup`)
* An m5.xlarge instance with Ubuntu or Amazon Linux

.. note::
   DI only works on Neuron instances with EFA (trn1.32xlarge or trn2.48xlarge). You need EFA even for single-instance setups.

.. tip::
   Use the AWS Neuron Deep Learning Container (DLC) to avoid manual setup. We'll use the vllm-inference-neuronx DLC in this guide.

Select and Compile Your Model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

DI works best with large models that have billions of parameters. We'll use ``meta-llama/Llama-3.3-70B-Instruct`` as an example. First, compile your model following the :ref:`nxdi-trn2-llama3.3-70b-dp-tutorial` guide. Make sure to set the correct input shapes and tensor parallelism.

Set Up the etcd Server
~~~~~~~~~~~~~~~~~~~~~~~

1. Connect to your EC2 proxy instance using SSH or Session Manager
2. Run these commands:

.. code-block:: bash

   sudo su - ubuntu
   HOST_IP=$(hostname -i | awk '{print $1}')

   # Remove old containers
   docker rm -f etcd proxy 2>/dev/null || true

   # Start etcd
   docker run -d \
     --name etcd \
     --shm-size=10g \
     --privileged \
     -p 8989:8989 \
     -e ETCD_IP=$HOST_IP \
     ubuntu:22.04 \
     bash -c "apt-get update && apt-get install -y etcd && \
              exec etcd \
                --data-dir=/etcd-data \
                --listen-client-urls=http://0.0.0.0:8989 \
                --advertise-client-urls=http://\$ETCD_IP:8989 \
                --listen-peer-urls=http://127.0.0.1:21323 \
                --initial-advertise-peer-urls=http://127.0.0.1:21323 \
                --initial-cluster=default=http://127.0.0.1:21323 \
                --name=default"

   # Start proxy
   docker run -d \
     --name proxy \
     --shm-size=10g \
     --privileged \
     -p 8000:8000 \
     -e ETCD_IP=$HOST_IP \
     -e ETCD_PORT=8989 \
     public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py310-sdk2.25.1-ubuntu22.04 \
     bash -c "exec neuron-proxy-server --etcd \$ETCD_IP:\$ETCD_PORT"

Verify both services are running:

.. code-block:: bash

   docker ps

Start the Prefill Server
~~~~~~~~~~~~~~~~~~~~~~~~~~

Run these commands:

.. code-block:: bash

   sudo su - ubuntu
   export MODEL="meta-llama/Llama-3.3-70B-Instruct"
   export VLLM_BATCH=8
   export MAX_LEN=8192
   export ETCD="${HOST_IP}:8989"
   export PORT=8000

   # Remove old container
   docker rm -f prefill-vllm-server1 2>/dev/null || true

   # Start prefill server
   docker run -d \
     --name prefill-vllm-server1 \
     --privileged \
     --device /dev/infiniband/uverbs0 \
     --shm-size=10g \
     -p ${PORT}:${PORT} \
     -e MODEL \
     -e VLLM_BATCH \
     -e MAX_LEN \
     -e ETCD \
     -e PORT \
     public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py310-sdk2.25.1-ubuntu22.04 \
     bash -c "exec python3 -m vllm.entrypoints.openai.api_server \
       --model \$MODEL \
       --max-num-seqs \$VLLM_BATCH \
       --max-model-len \$MAX_LEN \
       --tensor-parallel-size 64 \
       --device neuron \
       --speculative-max-model-len \$MAX_LEN \
       --override-neuron-config '{}' \
       --kv-transfer-config '{\"kv_connector\":\"NeuronConnector\",\"kv_role\":\"kv_producer\",\"kv_buffer_size\":2e11,\"etcd\":\"\$ETCD\"}' \
       --port \$PORT"

Note: The prefill server uses ``kv_role:kv_producer`` in its configuration.

Start the Decode Server
~~~~~~~~~~~~~~~~~~~~~~~~~

Run similar commands for the decode server:

.. code-block:: bash

   sudo su - ubuntu
   export MODEL="meta-llama/Llama-3.3-70B-Instruct"
   export VLLM_BATCH=8
   export MAX_LEN=8192
   export ETCD="${HOST_IP}:8989"
   export PORT=8000

   # Remove old container
   docker rm -f decode-vllm-server1 2>/dev/null || true

   # Start decode server
   docker run -d \
     --name decode-vllm-server1 \
     --privileged \
     --device /dev/infiniband/uverbs0 \
     --shm-size=10g \
     -p ${PORT}:${PORT} \
     -e MODEL \
     -e VLLM_BATCH \
     -e MAX_LEN \
     -e ETCD \
     -e PORT \
     public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py310-sdk2.25.1-ubuntu22.04 \
     bash -c "exec python3 -m vllm.entrypoints.openai.api_server \
       --model \$MODEL \
       --max-num-seqs \$VLLM_BATCH \
       --max-model-len \$MAX_LEN \
       --tensor-parallel-size 64 \
       --device neuron \
       --speculative-max-model-len \$MAX_LEN \
       --override-neuron-config '{}' \
       --kv-transfer-config '{\"kv_connector\":\"NeuronConnector\",\"kv_role\":\"kv_consumer\",\"kv_buffer_size\":2e11,\"etcd\":\"\$ETCD\"}' \
       --port \$PORT"

Note: The decode server uses ``kv_role:kv_consumer`` in its configuration.

Test Your Setup
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Test your DI setup with this simple request:

.. code-block:: bash

   curl -s http://localhost:8000/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
      "model": "meta-llama/Llama-3.3-70B-Instruct",
      "prompt": ["a tornado is a"],
      "max_tokens": 10,
      "temperature": 0
      }'

Scale Your Setup
~~~~~~~~~~~~~~~~~~~~~

To add more capacity:

1. Launch additional prefill workers when you need more compute power
2. Launch additional decode workers when you need more memory
3. Workers can run on the same instance or different ones
4. New workers automatically register with etcd
5. The proxy automatically routes traffic to all workers

Benchmark Your Setup
~~~~~~~~~~~~~~~~~~~~~~

Install LLMPerf
---------------

1. Get LLMPerf:

.. code-block:: bash

   git clone https://github.com/ray-project/llmperf.git
   cd llmperf
   pip install -e .    

2. Apply the ``neuron_perf.patch`` as shown in :ref:`llm-inference-benchmarking`

3. Create this benchmark script (``llmperf.sh``):

.. code-block:: bash

   #!/bin/bash
   export OPENAI_API_BASE="http://localhost:8000/v1"
   export OPENAI_API_KEY="mock_key"

   python llmperf/token_benchmark_ray.py \
      --model=$MODEL_PATH \
      --tokenizer=$MODEL_PATH \
      --mean-input-tokens=1024 \
      --stddev-input-tokens=0\
      --mean-output-tokens=100 \
      --stddev-output-tokens=10 \
      --max-num-completed-requests=200 \
      --timeout=1720000 \
      --num-concurrent-requests=4 \
      --results-dir=llmperf_results \
      --llm-api=openai \
      --additional-sampling-params "{\"top_k\": 50, \"top_p\": 0.9, \"temperature\": 0.7}"

4. Run the benchmark:

.. code-block:: bash

   MODEL_PATH=path/to/your/downloaded/model ./llmperf.sh 

Compare with Standard Inference
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To benchmark without DI:

1. Stop all DI servers and the proxy
2. Create this script (``baseline_server.sh``):

.. code-block:: bash

   #!/bin/bash
   while [[ $# -gt 0 ]]; do
      case $1 in
         --tp-degree)
               TP_DEGREE="$2"
               shift 2
               ;;
         --batch-size)
               BATCH_SIZE="$2"
               shift 2
               ;;
         --model-path)
               MODEL_PATH="$2"
               shift 2
               ;;
         --compiled-model-path)
               COMPILED_MODEL_PATH="$2"
               shift 2
               ;;
         *)  
               echo "Unknown parameter: $1"
               exit 1
               ;;
      esac
   done

   export NEURON_COMPILED_ARTIFACTS="$COMPILED_MODEL_PATH"
   export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=2

   if [ "$SINGLE_INSTANCE" = "1" ]; then
      NEURON_RT_VISIBLE_CORES=0-31
   fi

   python3 -m vllm.entrypoints.openai.api_server \
         --model "$MODEL_PATH" \
         --max-num-seqs "$BATCH_SIZE" \
         --max-model-len 8192 \
         --tensor-parallel-size "$TP_DEGREE" \
         --device neuron \
         --use-v2-block-manager \
         --override-neuron-config "{}" \
         --port 8000

3. Run for multi-instance:

.. code-block:: bash
   
   ./baseline_server.sh --tp-degree 64 --batch-size 4 \
                        --model-path path/to/your/downloaded/model \
                        --compiled-model-path di_traced_model_tp64_b4/

Or for single-instance:

.. code-block:: bash
   
   SINGLE_INSTANCE=1 ./baseline_server.sh --tp-degree 32 --batch-size 4 \
                                          --model-path path/to/your/downloaded/model \
                                          --compiled-model-path di_traced_model_tp32_b4/

4. Run the benchmark:

.. code-block:: bash

   MODEL_PATH=path/to/your/downloaded_model ./llmperf.sh 

Known Issues
~~~~~~~~~~~~

If you see ``ENC:kv_store_acquire_file_lock Failed to open kv store server lock file Permission denied``, delete the lock file:

.. code-block:: bash

   sudo rm /tmp/nrt_kv_store_server.lock

================================================
FILE: libraries/nxd-inference/tutorials/flux-inference-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6a2a5707",
   "metadata": {},
   "source": [
    "# Generating Images with Black Forest Labs Flux.1-Dev on Trn1/Trn2\n",
    "\n",
    "This tutorial provides a step-by-step guide for generating images using the Flux.1-dev model from Black Forest Labs with NeuronX Distributed (NxD) Inference on a single trn2.48xl instance. This sample specifically generates 1k x 1k images."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46aaf0d2",
   "metadata": {},
   "outputs": [],
   "source": [
    ".. contents:: Table of contents\n",
    "    :local:\n",
    "    :depth: 2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a71ce078",
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "source": [
    "## Background, Concepts, and Optimizations\n",
    "\n",
    "### Tensor and Context Parallel\n",
    "\n",
    "For the latent transformer model, use a combination of Tensor Parallelism and Context Parallelism. Due to the compute-bound nature of diffusion inference, add additional parallelism by using sharding on the sequence dimension. Sharding is governed by the `world_size` relative to the `backbone_tp_degree`. \n",
    "\n",
    "### CFG Parallelism\n",
    "\n",
    "Classifier-Free Guidance (CFG) inference runs two forward passes per denoising step: one for the conditional (prompt) input and one for the unconditional (negative prompt) input. CFG Parallelism accelerates this by distributing the two passes across two sets of devices, effectively halving the per-step latency. To enable CFG Parallelism, set `cfg_parallel_enabled=True` and also enable CFG inference via `use_cfg=True` (providing a `negative_prompt`). Like Context Parallelism, CFG Parallelism requires `world_size = 2 × backbone_tp_degree`. CFG Parallelism and Context Parallelism are mutually exclusive.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d6c47a4",
   "metadata": {},
   "source": [
    "## Step 1: Setup the environment\n",
    "### Set up and connect to a trn2.48xlarge instance\n",
    "\n",
    "As a prerequisite, this tutorial requires that you have a Trn2 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed.\n",
    "To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK, see the [NxDI setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html#nxdi-setup).\n",
    "\n",
    "After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you chose when you launched the instance.\n",
    "\n",
    "To use Jupyter Notebook on the Neuron instance, follow the [Jupyter Notebook QuickStart guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "After you are connected, activate the Python virtual environment that includes the Neuron SDK.\n",
    "\n",
    "`source ~/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate`\n",
    "\n",
    "Run pip list to verify that the Neuron SDK is installed.\n",
    "\n",
    "`pip list | grep neuron`\n",
    "\n",
    "You should see Neuron packages including neuronx-distributed-inference and neuronx-cc.\n",
    "\n",
    "### Download the model\n",
    "\n",
    "To use this sample, you must first download the model checkpoint from HuggingFace to a local path on the Trn2 instance. For more information, see [Download models](https://huggingface.co/docs/hub/en/models-downloading) in the HuggingFace documentation. You can download and use [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) for this tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdce0741",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install matplotlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "744c9c85",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import torch\n",
    "from matplotlib import pyplot as plt\n",
    "\n",
    "from neuronx_distributed_inference.models.diffusers.flux.application import NeuronFluxApplication, get_flux_parallelism_config\n",
    "from neuronx_distributed_inference.models.config import NeuronConfig\n",
    "from neuronx_distributed_inference.models.diffusers.flux.clip.modeling_clip import CLIPInferenceConfig\n",
    "from neuronx_distributed_inference.models.diffusers.flux.t5.modeling_t5 import T5InferenceConfig\n",
    "from neuronx_distributed_inference.models.diffusers.flux.modeling_flux import FluxBackboneInferenceConfig\n",
    "from neuronx_distributed_inference.models.diffusers.flux.vae.modeling_vae import VAEDecoderInferenceConfig\n",
    "from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config\n",
    "from neuronx_distributed_inference.utils.diffusers_adapter import load_diffusers_config\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f77e9c9",
   "metadata": {},
   "source": [
    "## Step 2: Setup Inference Parameters and Model Config\n",
    "\n",
    "Start by initializing your inference parameters, which include model parallelism configuration, image sizes and model configuration. Ensure that that `CKPT_DIR` matches the local directory where you downloaded the model in Step 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29c0c1fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "backbone_tp_degree = 4\n",
    "dtype = torch.bfloat16\n",
    "\n",
    "# Set context_parallel_enabled or cfg_parallel_enabled to True to enable those parallelism modes.\n",
    "context_parallel_enabled = False\n",
    "cfg_parallel_enabled = False\n",
    "\n",
    "# world_size is derived automatically: backbone_tp_degree * 2 if either parallel mode is enabled, else backbone_tp_degree\n",
    "world_size = get_flux_parallelism_config(\n",
    "    backbone_tp_degree,\n",
    "    context_parallel_enabled=context_parallel_enabled,\n",
    "    cfg_parallel_enabled=cfg_parallel_enabled,\n",
    ")\n",
    "\n",
    "height, width = [1024, 1024]\n",
    "guidance_scale = 3.5\n",
    "num_inference_steps = 25\n",
    "prompt = \"A robot named trn2\"\n",
    "\n",
    "# CFG inference parameters. Set use_cfg=True and provide a negative_prompt to enable CFG.\n",
    "# cfg_parallel_enabled above requires use_cfg=True.\n",
    "use_cfg = False\n",
    "negative_prompt = \"\"\n",
    "true_cfg_scale = 2.0 if use_cfg else 1.0\n",
    "\n",
    "# The Ckpt directory root under huggingface\n",
    "CKPT_DIR = \"/shared/models/FLUX.1-dev/\"\n",
    "\n",
    "# Existing Compiled working directory for the compiler\n",
    "BASE_COMPILE_WORK_DIR = \"/tmp/flux/compiler_workdir/\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81915c20",
   "metadata": {},
   "source": [
    "## Step 3: Setup Model and Neuron Configuration\n",
    "\n",
    "Here, you initialize the various component model configuration objects for the models within the Flux Pipeline. The Flux pipeline contains CLIP, T5, the backbone transformer and the VAE. For each component model, you can use the following parallelism configuration:\n",
    "- For CLIP, `tp_degree` of 1\n",
    "- For T5, `tp_degree` is the same as the `world_size`. In the case of this example, this will be 8.\n",
    "- For the backbone transformer, if using Context Parallelism or CFG Parallelism, `tp_degree` is half the world size. In the case of this example, this will be 4, which allows for 2 parallel ranks.\n",
    "- Finally, for the VAE, `tp_degree` of 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fae08e51",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_encoder_path = os.path.join(CKPT_DIR, \"text_encoder\")\n",
    "text_encoder_2_path = os.path.join(CKPT_DIR, \"text_encoder_2\")\n",
    "backbone_path = os.path.join(CKPT_DIR, \"transformer\")\n",
    "vae_decoder_path = os.path.join(CKPT_DIR, \"vae\")\n",
    "\n",
    "clip_neuron_config = NeuronConfig(\n",
    "    tp_degree=1,\n",
    "    world_size=world_size,\n",
    "    torch_dtype=dtype,\n",
    ")\n",
    "clip_config = CLIPInferenceConfig(\n",
    "    neuron_config=clip_neuron_config,\n",
    "    load_config=load_pretrained_config(text_encoder_path),\n",
    ")\n",
    "\n",
    "t5_neuron_config = NeuronConfig(\n",
    "    tp_degree=world_size,\n",
    "    world_size=world_size,\n",
    "    torch_dtype=dtype,\n",
    ")\n",
    "t5_config = T5InferenceConfig(\n",
    "    neuron_config=t5_neuron_config,\n",
    "    load_config=load_pretrained_config(text_encoder_2_path),\n",
    ")\n",
    "\n",
    "backbone_neuron_config = NeuronConfig(\n",
    "    tp_degree=backbone_tp_degree,\n",
    "    world_size=world_size,\n",
    "    torch_type=dtype,\n",
    ")\n",
    "backbone_config = FluxBackboneInferenceConfig(\n",
    "    cfg_parallel_enabled=cfg_parallel_enabled,\n",
    "    context_parallel_enabled=context_parallel_enabled,\n",
    "    neuron_config=backbone_neuron_config,\n",
    "    load_config=load_diffusers_config(backbone_path),\n",
    "    height=height,\n",
    "    width=width,\n",
    ")\n",
    "\n",
    "decoder_neuron_config = NeuronConfig(\n",
    "    tp_degree=1,\n",
    "    world_size=world_size,\n",
    "    torch_type=dtype,\n",
    ")\n",
    "decoder_config = VAEDecoderInferenceConfig(\n",
    "    neuron_config=decoder_neuron_config,\n",
    "    load_config=load_diffusers_config(vae_decoder_path),\n",
    "    height=height,\n",
    "    width=width,\n",
    "    transformer_in_channels=backbone_config.in_channels,\n",
    ")\n",
    "\n",
    "setattr(\n",
    "    backbone_config,\n",
    "    \"vae_scale_factor\",\n",
    "    decoder_config.vae_scale_factor,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "870692df",
   "metadata": {},
   "source": [
    "## Step 4: Initialize the Flux Application and Compile\n",
    "\n",
    "Now you instantiate the `NeuronFluxApplication` which contains the pipeline orchestration logic, as well as the various component models. You then compile the application, which then compiles each component model individually."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd1f472a",
   "metadata": {},
   "outputs": [],
   "source": [
    "flux_app = NeuronFluxApplication(\n",
    "    model_path=CKPT_DIR,\n",
    "    text_encoder_config=clip_config,\n",
    "    text_encoder2_config=t5_config,\n",
    "    backbone_config=backbone_config,\n",
    "    decoder_config=decoder_config,\n",
    "    height=height,\n",
    "    width=width,\n",
    ")\n",
    "\n",
    "flux_app.compile(BASE_COMPILE_WORK_DIR)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4389cc4d",
   "metadata": {},
   "source": [
    "## Step 5: Load Model\n",
    "This step loads the compiled model (NEFF), along with the model weights into device memory. Specifically, calling load on the flux_app loads all the individual component models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6c271fe2",
   "metadata": {},
   "outputs": [],
   "source": [
    "flux_app.load(BASE_COMPILE_WORK_DIR)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5d3d857",
   "metadata": {},
   "source": [
    "## Step 6: Generate an Image\n",
    "\n",
    "Finally, you will generate a singular image and render it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c6e5781",
   "metadata": {},
   "outputs": [],
   "source": [
    "image = flux_app(\n",
    "    prompt,\n",
    "    negative_prompt=negative_prompt if use_cfg else None,\n",
    "    true_cfg_scale=true_cfg_scale,\n",
    "    height=height,\n",
    "    width=width,\n",
    "    guidance_scale=guidance_scale,\n",
    "    num_inference_steps=num_inference_steps\n",
    ").images[0]\n",
    "plt.imshow(image)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1db69b0",
   "metadata": {},
   "source": [
    "## Notes \n",
    "### Running Flux Inference on trn1\n",
    "\n",
    "This sample can also be deployed to a trn1.32xlarge with a few modifications. If you are using Context Parallelism specifically, then apply the following parallelism configuration\n",
    "\n",
    "```\n",
    "world_size = 16\n",
    "backbone_tp_degree = 8\n",
    "```\n",
    "\n",
    "Otherwise use the following:\n",
    "\n",
    "```\n",
    "world_size = 8\n",
    "backbone_tp_degree = 8\n",
    "```\n",
    "\n",
    "### Using CFG Parallelism\n",
    "\n",
    "To enable CFG Parallelism, set both `use_cfg=True` and `cfg_parallel_enabled=True` in Step 2. You must also provide a `negative_prompt`. CFG Parallelism and Context Parallelism are mutually exclusive — only one can be enabled at a time.\n",
    "\n",
    "When `cfg_parallel_enabled=True`, `world_size` is automatically set to `2 × backbone_tp_degree` by `get_flux_parallelism_config`.\n",
    "\n",
    "For **trn2.48xlarge**:\n",
    "```\n",
    "backbone_tp_degree = 4  # world_size will be 8\n",
    "```\n",
    "\n",
    "For **trn1.32xlarge**:\n",
    "```\n",
    "backbone_tp_degree = 8  # world_size will be 16\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: libraries/nxd-inference/tutorials/flux-inpainting-inference-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6a2a5707",
   "metadata": {},
   "source": [
    "# Inpainting Images with Black Forest Labs Flux.1-Fill-Dev on Trn1/Trn2\n",
    "\n",
    "This tutorial provides a step-by-step guide for inpainting/outpainting images using the Flux.1-Fill-dev model from Black Forest Labs with NeuronX Distributed (NxD) Inference on a single trn2.48xl instance."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a71ce078",
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "source": [
    "## Background, Concepts, and Optimizations\n",
    "\n",
    "### Tensor and Context Parallelism\n",
    "\n",
    "For the latent transformer model, use a combination of Tensor Parallelism and Context Parallelism. Due to the compute-bound nature of diffusion inference, add additional parallelism by using sharding on the sequence dimension. Sharding is governed by the `world_size` relative to the `backbone_tp_degree`. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d6c47a4",
   "metadata": {},
   "source": [
    "## Step 1: Setup the environment\n",
    "### Set up and connect to a trn2.48xlarge instance\n",
    "\n",
    "As a prerequisite, this tutorial requires that you have a Trn2 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed.\n",
    "To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK, see the [NxDI setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html#nxdi-setup).\n",
    "\n",
    "After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you chose when you launched the instance.\n",
    "\n",
    "To use a Jupyter Notebook (`.ipynb`) on the Neuron instance, follow the [Jupyter Notebook QuickStart guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "After you are connected, activate the Python virtual environment that includes the Neuron SDK.\n",
    "\n",
    "`source ~/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate`\n",
    "\n",
    "Run pip list to verify that the Neuron SDK is installed.\n",
    "\n",
    "`pip list | grep neuron`\n",
    "\n",
    "You should see Neuron packages including neuronx-distributed-inference and neuronx-cc.\n",
    "\n",
    "### Download the model\n",
    "\n",
    "To use this sample, you must first download the model checkpoint from HuggingFace to a local path on the Trn2 instance. For more information, see [Download models](https://huggingface.co/docs/hub/en/models-downloading) in the HuggingFace documentation. You can download and use [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev) for this tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdce0741",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install matplotlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "744c9c85",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import torch\n",
    "from matplotlib import pyplot as plt\n",
    "\n",
    "\n",
    "from neuronx_distributed_inference.models.diffusers.flux.application import NeuronFluxApplication\n",
    "from neuronx_distributed_inference.models.config import NeuronConfig\n",
    "from neuronx_distributed_inference.models.diffusers.flux.clip.modeling_clip import CLIPInferenceConfig\n",
    "from neuronx_distributed_inference.models.diffusers.flux.t5.modeling_t5 import T5InferenceConfig\n",
    "from neuronx_distributed_inference.models.diffusers.flux.modeling_flux import FluxBackboneInferenceConfig\n",
    "from neuronx_distributed_inference.models.diffusers.flux.vae.modeling_vae import VAEDecoderInferenceConfig\n",
    "from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config\n",
    "from neuronx_distributed_inference.utils.diffusers_adapter import load_diffusers_config\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f77e9c9",
   "metadata": {},
   "source": [
    "## Step 2: Setup Inference Parameters and Model Config\n",
    "\n",
    "Start by initializing your inference parameters, which include model parallelism configuration, image sizes and model configuration. Ensure that that `CKPT_DIR` matches the local directory where you downloaded the model in Step 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29c0c1fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "world_size = 8\n",
    "backbone_tp_degree = 4\n",
    "dtype = torch.bfloat16\n",
    "\n",
    "height, width = [1024, 1024]\n",
    "guidance_scale = 3.5\n",
    "num_inference_steps = 25\n",
    "prompt = \"Milky way galaxy in space\"\n",
    "\n",
    "\n",
    "# The Ckpt directory root under huggingface\n",
    "CKPT_DIR = \"/shared/models/FLUX.1-Fill-dev/\"\n",
    "\n",
    "# Existing Compiled working directory for the compiler\n",
    "BASE_COMPILE_WORK_DIR = \"/tmp/flux/compiler_workdir/\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81915c20",
   "metadata": {},
   "source": [
    "## Step 3: Setup Model and Neuron Configuration\n",
    "\n",
    "Here, you initialize the various component model configuration objects for the models within the Flux Pipeline. The Flux pipeline contains CLIP, T5, the backbone transformer and the VAE. For each component model, you can use the following parallelism configuration:\n",
    "- For CLIP, `tp_degree` of 1\n",
    "- For T5, `tp_degree` is the same as the `world_size`. In the case of this example, this will be 8.\n",
    "- For the backbone transformer, if using Context Parallelism, `tp_degree` is half the world size. In the case of this example, this will be 4, which allows for 2 CP ranks.\n",
    "- Finally, for the VAE, `tp_degree` of 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fae08e51",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_encoder_path = os.path.join(CKPT_DIR, \"text_encoder\")\n",
    "text_encoder_2_path = os.path.join(CKPT_DIR, \"text_encoder_2\")\n",
    "backbone_path = os.path.join(CKPT_DIR, \"transformer\")\n",
    "vae_decoder_path = os.path.join(CKPT_DIR, \"vae\")\n",
    "\n",
    "clip_neuron_config = NeuronConfig(\n",
    "    tp_degree=1,\n",
    "    world_size=world_size,\n",
    "    torch_dtype=dtype,\n",
    ")\n",
    "clip_config = CLIPInferenceConfig(\n",
    "    neuron_config=clip_neuron_config,\n",
    "    load_config=load_pretrained_config(text_encoder_path),\n",
    ")\n",
    "\n",
    "t5_neuron_config = NeuronConfig(\n",
    "    tp_degree=world_size,\n",
    "    world_size=world_size,\n",
    "    torch_dtype=dtype,\n",
    ")\n",
    "t5_config = T5InferenceConfig(\n",
    "    neuron_config=t5_neuron_config,\n",
    "    load_config=load_pretrained_config(text_encoder_2_path),\n",
    ")\n",
    "\n",
    "backbone_neuron_config = NeuronConfig(\n",
    "    tp_degree=backbone_tp_degree,\n",
    "    world_size=world_size,\n",
    "    torch_type=dtype,\n",
    ")\n",
    "backbone_config = FluxBackboneInferenceConfig(\n",
    "    neuron_config=backbone_neuron_config,\n",
    "    load_config=load_diffusers_config(backbone_path),\n",
    "    height=height,\n",
    "    width=width,\n",
    ")\n",
    "\n",
    "decoder_neuron_config = NeuronConfig(\n",
    "    tp_degree=1,\n",
    "    world_size=world_size,\n",
    "    torch_type=dtype,\n",
    ")\n",
    "decoder_config = VAEDecoderInferenceConfig(\n",
    "    neuron_config=decoder_neuron_config,\n",
    "    load_config=load_diffusers_config(vae_decoder_path),\n",
    "    height=height,\n",
    "    width=width,\n",
    ")\n",
    "\n",
    "setattr(\n",
    "    backbone_config,\n",
    "    \"vae_scale_factor\",\n",
    "    decoder_config.vae_scale_factor,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "870692df",
   "metadata": {},
   "source": [
    "## Step 4: Initialize the Flux Application and Compile\n",
    "\n",
    "Now you instantiate the `NeuronFluxApplication` which contains the pipeline orchestration logic, as well as the various component models. You then compile the application, which then compiles each component model individually."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd1f472a",
   "metadata": {},
   "outputs": [],
   "source": [
    "flux_app = NeuronFluxApplication(\n",
    "    model_path=CKPT_DIR,\n",
    "    text_encoder_config=clip_config,\n",
    "    text_encoder2_config=t5_config,\n",
    "    backbone_config=backbone_config,\n",
    "    decoder_config=decoder_config,\n",
    "    height=height,\n",
    "    width=width,\n",
    ")\n",
    "\n",
    "flux_app.compile(BASE_COMPILE_WORK_DIR)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4389cc4d",
   "metadata": {},
   "source": [
    "## Step 5: Load Model\n",
    "This step loads the compiled model (NEFF), along with the model weights into device memory. Specifically, calling load on the flux_app loads all the individual component models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6c271fe2",
   "metadata": {},
   "outputs": [],
   "source": [
    "flux_app.load(BASE_COMPILE_WORK_DIR)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bceb5666",
   "metadata": {},
   "source": [
    "## Step 6: Load the Image and Mask\n",
    "\n",
    "Load the image and mask which denotes the area that has to be filled in adherence to the prompt. The `cat.png` and `mask.png` are taken from COCO dataset (https://cocodataset.org/#explore?id=261706). Ensure that the images are in the same directory as the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98fb15fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "from diffusers.utils import load_image\n",
    "from PIL import Image\n",
    "\n",
    "def load_and_resize_image(image_path: str, height: int, width: int) -> Image.Image:\n",
    "    \"\"\"Load an image from a file path and resize it to the specified dimensions.\"\"\"\n",
    "    image = load_image(image_path)\n",
    "    return image.resize((width, height), Image.Resampling.LANCZOS)\n",
    "\n",
    "\n",
    "image = load_and_resize_image('./cat.png', height, width)\n",
    "mask_image = load_and_resize_image('./mask.png', height, width)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5d3d857",
   "metadata": {},
   "source": [
    "## Step 7: Generate Fill Image using the model\n",
    "\n",
    "Finally, you will fill the masked-region of the image using the prompt and render it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c6e5781",
   "metadata": {},
   "outputs": [],
   "source": [
    "image = flux_app(\n",
    "    prompt=prompt,\n",
    "    image=image,\n",
    "    mask_image=mask_image,\n",
    "    height=height,\n",
    "    width=width,\n",
    "    guidance_scale=guidance_scale,\n",
    "    num_inference_steps=num_inference_steps,\n",
    ").images[0]\n",
    "plt.imshow(image)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1db69b0",
   "metadata": {},
   "source": [
    "## Notes \n",
    "### Running Flux Inference on trn1\n",
    "\n",
    "This sample can also be deployed to a trn1.32xlarge with a few modifications. If you are using Context Parallelism specifically, then apply the following parallelism configuration\n",
    "\n",
    "```\n",
    "world_size = 16\n",
    "backbone_tp_degree = 8\n",
    "```\n",
    "\n",
    "Otherwise use the following:\n",
    "\n",
    "```\n",
    "world_size = 8\n",
    "backbone_tp_degree = 8\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: libraries/nxd-inference/tutorials/generating-results-with-performance-cli.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8b8b4883",
   "metadata": {},
   "source": [
    "# Tutorial: Evaluating Performance of Llama-3.3-70B on Neuron using Performance CLI\n",
    "\n",
    "## Introduction\n",
    "This tutorial provides a step-by-step guide to measure the performance of Llama3.3 70B on `Trn1` with easy to reproduce benchmarks.\n",
    "\n",
    "In this tutorial you will learn how llama-3.3-70B can be easily tested with llm-perf for 3.3-70b-instruct model.\n",
    "\n",
    "You must have the instruction-tuned version of llama-3.3 70b [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/) available for Hugging Face to successfully complete it.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f747ff61",
   "metadata": {},
   "source": [
    "## Environment Setup Guide\n",
    "\n",
    "### Prerequisites\n",
    "\n",
    "This tutorial requires that you have a `Trn1` instance created from a Deep Learning AMI that has the Neuron SDK pre-installed. This tutorial depends on the Neuron fork of vLLM.\n",
    "\n",
    "Before running evaluations, ensure your environment is properly configured by following these essential setup guides:\n",
    "\n",
    "1. [NxD Inference Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html)\n",
    "2. [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html)\n",
    "\n",
    "###  Installing dependencies\n",
    "\n",
    "- Copy the [inference-benchmarking](https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/) directory to some location on your instance. \n",
    "- Change your current working directory to your copy of [inference-benchmarking](https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/). \n",
    "- Install other required dependencies in the same Python env (such as `aws_neuron_venv_pytorch`, if you followed the steps in [Manually install NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html#id3)) by:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "0d51cd27",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "aws-neuron-llm-evaluation                1.0\n",
      "awsneuroneval                            1.0\n",
      "libneuronxla                             2.2.7366.0+1faf0ddf\n",
      "neuron-torch-tools                       1.0.0.33853+83b6bf63a\n",
      "neuronx-cc                               2.20.2831.0+8bfecb25\n",
      "neuronx-cc-devel                         2.20.2831.0+8bfecb25\n",
      "neuronx-distributed                      0.14.17095+c66a8ca6\n",
      "neuronx-distributed-inference            0.5.0+dev\n",
      "torch-neuronx                            2.7.0.2.9.8707+08e1f40d\n",
      "vllm-neuronx                             0.9.0.dev0+neuron225\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "WARNING: apt does not have a stable CLI interface. Use with caution in scripts.\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "aws-neuronx-collectives/now 2.27.13.0-f3bd841a2 amd64 [installed,local]\n",
      "aws-neuronx-dkms/now 2.23.0.0 all [installed,local]\n",
      "aws-neuronx-runtime-lib/now 2.27.7.0-765d5f599 amd64 [installed,local]\n",
      "aws-neuronx-tools/now 2.25.100.0 amd64 [installed,local]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/IPython/core/completerlib.py:371: UserWarning: This is now an optional IPython functionality, using bookmarks requires you to install the `pickleshare` library.\n",
      "  bks = self.db.get('bookmarks',{})\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "pip list | grep neuron\n",
    "apt list --installed | grep neuron"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b87a5da4",
   "metadata": {},
   "source": [
    "You should see Neuron packages including `neuronx-distributed-inference` and its related components."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "186aacdb",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "e4db09b0",
   "metadata": {},
   "source": [
    "## Download llama-3.3 70B\n",
    "To use this sample, you must first download [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) model checkpoint from Hugging Face `/home/ubuntu/models/Llama-3.3-70B-Instruct/` on the `Trn1` instance. For more information, see [Downloading models](https://huggingface.co/docs/hub/en/models-downloading) in the Hugging Face documentation.\n",
    "\n",
    "To use a Jupyter Notebook on the Neuron instance, follow this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "### Running Evaluations\n",
    "There are two methods that you can use to run your evaluation.\n",
    "\n",
    "1. Use a YAML configuration file and `performance.py` script\n",
    "\n",
    "2. Write your own python script that uses several components provided in `performance.py` and `server_config.py`\n",
    "\n",
    "Each use case is demonstrated below:\n",
    "\n",
    "### 1. Running performance with yaml config file\n",
    "In this method, you create  a YAML (`.yaml`) config file that specifies the server configuration and testing scenario you want to run. Create `config.yaml` with the following content."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "09b648ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd inference-benchmarking/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "32fab921",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Install requirements present in inference-benchmarking package\n",
    "#!pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4702a430",
   "metadata": {},
   "source": [
    "perf.yaml\n",
    "```yaml\n",
    "server:\n",
    "  name: \"llama-3.3-70b-instruct\"\n",
    "  model_path: \"/home/ubuntu/models/llama-3.3-70b-instruct\"\n",
    "  model_s3_path: null\n",
    "  compiled_model_path: \"/home/ubuntu/traced_models/llama-3.3-70b-instruct\"\n",
    "  max_seq_len: 256\n",
    "  context_encoding_len: 128\n",
    "  tp_degree: 32\n",
    "  server_port: 8000\n",
    "  continuous_batch_size: 1\n",
    "  custom_chat_template_path: \"default\"\n",
    "\n",
    "test:\n",
    "  performance:\n",
    "    sonnets_small_test:\n",
    "      client: \"llm_perf\"\n",
    "      client_type: \"llm_perf_github_patched\"\n",
    "      n_batches: 1\n",
    "      max_concurrent_requests: 20\n",
    "      timeout: 3600\n",
    "      input_size: 128\n",
    "      output_size: 124\n",
    "      client_params:\n",
    "        stddev_input_tokens: 0\n",
    "        stddev_output_tokens: 1\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ad27ecd",
   "metadata": {},
   "source": [
    "The above YAML file is explained in more detail in [Performance Params guide](../developer_guides/performance-cli-params.html)\n",
    "\n",
    "\n",
    "For changing sequence length you must adjust `max_seq_len`. \n",
    "\n",
    "Run `python performance.py --config perf.yaml`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6d2442a",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python performance.py --config perf.yaml"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85dad77c",
   "metadata": {},
   "source": [
    "### 2. Running perf as part of your own Python code\n",
    "\n",
    "You nmight want to run the performance script as part of your Python code. For example, you might want to change the configuration programatically or post-process the results. This is possible using 3 main components provided in `performance.py` and `server_config.py`.\n",
    "\n",
    "1. Server Configuration: Use ServerConfig to define the vLLM server settings\n",
    "\n",
    "2. Performance Scenario: Use PerformanceScenario to specify evaluation parameters\n",
    "\n",
    "3. Test Execution: Run the performance with the configured settings\n",
    "\n",
    "### Step-by-Step Implementation\n",
    "First, import the necessary components:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3ac8e7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd \"/home/ubuntu/inference-benchmarking\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "7d4bd54d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from performance import PerformanceScenario, run_perf_test\n",
    "from server_config import ServerConfig"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "263b8270",
   "metadata": {},
   "source": [
    "### 1. Configure the Server\n",
    "\n",
    "Set up your server configuration with ServerConfig. This example uses Llama 3.3-70b Instruct:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "e37832b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "name = \"llama-3.3-70b-instruct\"\n",
    "server_config = ServerConfig(\n",
    "    name=name,\n",
    "    model_path=f\"/home/ubuntu/models/{name}\",  # Local model path\n",
    "    model_s3_path=None,  # S3 model path\n",
    "    max_seq_len=256,          # Maximum sequence length\n",
    "    context_encoding_len=128,  # Context window size\n",
    "    tp_degree=32,               # Tensor parallel degree\n",
    "    n_vllm_threads=1,          # Number of vLLM threads\n",
    "    server_port=8000,           # Server port\n",
    "    continuous_batch_size=1,    # Batch size for continuous batching\n",
    "    custom_chat_template_path=\"default\" # Chat template\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90aecfe7",
   "metadata": {},
   "source": [
    "### 2. Define Performance Scenarios\n",
    "\n",
    "Create a PerformanceScenario to specify your perf parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "818598ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "scenario = PerformanceScenario(\n",
    "    client=\"llm_perf\",          # Evaluation client\n",
    "    client_type=\"llm_perf_github_patched\",\n",
    "    n_batches=1,\n",
    "    max_concurrent_requests=20,  # Maximum concurrent requests\n",
    "    timeout=5000,              # Timeout in seconds - changed to 5000 from 3600\n",
    "    input_size=128,            # Input length\n",
    "    output_size=124,           # Output length\n",
    "    client_params={\"stddev_input_tokens\": 0, \"stddev_output_tokens\": 1}  # Client-specific parameters\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f57a53d0",
   "metadata": {},
   "source": [
    "### 3. Run the Evaluation\n",
    "\n",
    "Execute the evaluation using `run_perf_test`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8c5dcd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run the test with a named scenario\n",
    "results_collection = run_perf_test(\n",
    "    server_config=server_config,\n",
    "    named_scenarios={\"mytest\": scenario}\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75ebe3d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pprint import pprint\n",
    "# Display results\n",
    "pprint(results_collection)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8753980b",
   "metadata": {},
   "source": [
    "This code will execute and return detailed performance metrics for the model."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "aws_neuron_venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: libraries/nxd-inference/tutorials/index.rst
================================================
.. meta::
    :description: Comprehensive tutorials for NeuronX Distributed (NxD) Inference on AWS Neuron hardware, covering various LLM deployments and optimizations.
    :date-modified: 12/02/2025

.. _nxdi-tutorials-index:

NxD Inference Tutorials
========================

Welcome to the NeuronX Distributed (NxD) Inference tutorials collection. These step-by-step guides help you deploy and optimize large language models (LLMs) on AWS Neuron hardware. Learn how to run various models like Llama3, GPT, and more with different optimization techniques including speculative decoding, tensor parallelism, and disaggregated inference.

.. toctree::
    :maxdepth: 1
    :hidden:
    :caption: Tutorials

    Disaggregated Inference (1P1D) </libraries/nxd-inference/tutorials/disaggregated-inference-tutorial-1p1d>
    Disaggregated Inference </libraries/nxd-inference/tutorials/disaggregated-inference-tutorial>
    Flux Inference </libraries/nxd-inference/tutorials/flux-inference-tutorial>
    Flux Inpainting </libraries/nxd-inference/tutorials/flux-inpainting-inference-tutorial>
    Benchmark using Performance CLI </libraries/nxd-inference/tutorials/generating-results-with-performance-cli>
    GPT-OSS 120B </libraries/nxd-inference/tutorials/trn3-gpt-oss-120b-tutorial>
    Llama3.1 405B on Trn2 </libraries/nxd-inference/tutorials/trn2-llama3.1-405b-tutorial>
    Llama3.1 405B with Speculative Decoding </libraries/nxd-inference/tutorials/trn2-llama3.1-405b-speculative-tutorial>
    Llama3.1 70B Instruct Accuracy Evaluation </libraries/nxd-inference/tutorials/trn1-llama3.1-70b-instruct-accuracy-eval-tutorial>
    Llama3.1 8B with Multi-LoRA </libraries/nxd-inference/tutorials/trn2-llama3.1-8b-multi-lora-tutorial>
    Llama3.2 Multimodal </libraries/nxd-inference/tutorials/llama3.2-multimodal-tutorial>
    Llama3.3 70B FP8 </libraries/nxd-inference/tutorials/trn2-llama3.3-70b-fp8>
    Llama3.3 70B with APC </libraries/nxd-inference/tutorials/trn2-llama3.3-70b-apc-tutorial>
    Llama3.3 70B with Data Parallelism </libraries/nxd-inference/tutorials/trn2-llama3.3-70b-dp-tutorial>
    Llama3.3 70B with Speculative Decoding </libraries/nxd-inference/tutorials/trn2-llama3.3-70b-tutorial>
    Llama4 </libraries/nxd-inference/tutorials/llama4-tutorial>
    Llama4 Legacy </libraries/nxd-inference/tutorials/llama4-tutorial-v0>
    Pixtral </libraries/nxd-inference/tutorials/pixtral-tutorial>
    Qwen3 MoE 235B </libraries/nxd-inference/tutorials/qwen3-moe-tutorial>
    Qwen2 VL 7B </libraries/nxd-inference/tutorials/qwen2-vl-tutorial>
    Speculative Decoding </libraries/nxd-inference/tutorials/sd-inference-tutorial>
    Qwen3-VL 8B </libraries/nxd-inference/tutorials/qwen3-vl-tutorial>

Llama
-----

.. grid:: 2
    :gutter: 3

    .. grid-item-card:: Llama3.1 405B on Trn2
        :link: /libraries/nxd-inference/tutorials/trn2-llama3.1-405b-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Learn how to deploy Llama3.1 405B on a single Trn2 instance using NxD Inference with vLLM and explore performance optimization techniques.

    .. grid-item-card:: Llama3.1 405B with Speculative Decoding
        :link: /libraries/nxd-inference/tutorials/trn2-llama3.1-405b-speculative-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Optimize Llama3.1 405B inference on Trn2 using vanilla fused speculative decoding techniques for improved performance.

    .. grid-item-card:: Llama3.1 70B Instruct Accuracy Evaluation
        :link: /libraries/nxd-inference/tutorials/trn1-llama3.1-70b-instruct-accuracy-eval-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Evaluate the accuracy of Llama3.1 70B Instruct model on Trn1 hardware and learn how to measure model performance.

    .. grid-item-card:: Llama3.1 8B with Multi-LoRA
        :link: /libraries/nxd-inference/tutorials/trn2-llama3.1-8b-multi-lora-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Learn how to use multiple LoRA adapters with Llama3.1 8B on Trn2 for efficient fine-tuning and domain-specific inference.

    .. grid-item-card:: Llama3.3 70B with Speculative Decoding
        :link: /libraries/nxd-inference/tutorials/trn2-llama3.3-70b-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Deploy Llama3.3 70B on Trn2 instances and learn how to optimize performance with tensor parallelism and other NxD Inference features.

    .. grid-item-card:: Llama3.3 70B with Data Parallelism
        :link: /libraries/nxd-inference/tutorials/trn2-llama3.3-70b-dp-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Explore data parallelism techniques for Llama3.3 70B on Trn2 to increase throughput for high-volume inference workloads.

    .. grid-item-card:: Llama3.3 70B with APC
        :link: /libraries/nxd-inference/tutorials/trn2-llama3.3-70b-apc-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Deploy Llama3.3 70B on Trn2 with Automatic Prefix Caching (APC) to improve inference performance for repetitive patterns.

    .. grid-item-card:: Llama3.3 70B FP8 on Trainium2
        :link: /libraries/nxd-inference/tutorials/trn2-llama3.3-70b-fp8
        :link-type: doc
        :class-card: sd-rounded-3

        Deploy Llama3.3 70B FP8 quantized model on Trainium2.

    .. grid-item-card:: Llama4
        :link: /libraries/nxd-inference/tutorials/llama4-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Deploy and optimize Llama4 models on AWS Neuron hardware using NxD Inference with various performance tuning options.

Qwen
----

.. grid:: 2
    :gutter: 3

    .. grid-item-card:: Qwen3 MoE 235B
        :link: /libraries/nxd-inference/tutorials/qwen3-moe-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Learn how to deploy `Qwen/Qwen3-235B-A22B <https://huggingface.co/Qwen/Qwen3-235B-A22B>`__ with NxD Inference with various performance tuning options.

    .. grid-item-card:: Qwen3 VL 8B
        :link: /libraries/nxd-inference/tutorials/qwen3-vl-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Learn how to deploy `Qwen/Qwen3-VL-8B-Thinking <https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking>`__ on a single `trn2.48xlarge` instance.

    .. grid-item-card:: Qwen2 VL 7B
        :link: /libraries/nxd-inference/tutorials/qwen2-vl-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Learn how to deploy `Qwen/Qwen2-VL-7B-Instruct <https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct>`__ with NxD Inference with various performance tuning options.

    .. grid-item-card:: Speculative Decoding (Qwen3-32B) on Trainium2
        :link: /libraries/nxd-inference/tutorials/sd-inference-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Implement speculative decoding techniques with Qwen3-32B on Trn2 instances to accelerate LLM inference with NxD Inference.

GPT
---

.. grid:: 2
    :gutter: 3

    .. grid-item-card:: GPT-OSS 120B on Trainium3
        :link: /libraries/nxd-inference/tutorials/trn3-gpt-oss-120b-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Deploy open-source GPT models on Trainium3 hardware using NxD Inference and explore Trn3-specific optimizations.

Flux
----

.. grid:: 2
    :gutter: 3

    .. grid-item-card:: Flux Inference
        :link: /libraries/nxd-inference/tutorials/flux-inference-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Learn how to use Flux for efficient inference with NxD, enabling dynamic batch processing and optimized resource utilization.

    .. grid-item-card:: Flux Inpainting
        :link: /libraries/nxd-inference/tutorials/flux-inpainting-inference-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Learn how to use the Flux-Fill model for efficient inference with NxD, enabling image inpainting/outpainting.

Pixtral
-------

.. grid:: 2
    :gutter: 3

    .. grid-item-card:: Pixtral Large Instruct
        :link: /libraries/nxd-inference/tutorials/pixtral-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Learn how to deploy `mistralai/Pixtral-Large-Instruct-2411 <https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411>`__ on a single `trn2.48xlarge` instance.

Techniques and tools
--------------------

.. grid:: 2
    :gutter: 3

    .. grid-item-card:: Disaggregated Inference
        :link: /libraries/nxd-inference/tutorials/disaggregated-inference-tutorial
        :link-type: doc
        :class-card: sd-rounded-3

        Implement disaggregated inference to distribute model components across multiple instances for large-scale LLM deployment.

    .. grid-item-card:: Disaggregated Inference (1P1D)
        :link: /libraries/nxd-inference/tutorials/disaggregated-inference-tutorial-1p1d
        :link-type: doc
        :class-card: sd-rounded-3

        Learn about the 1P1D (1 Prefill, 1 Decode) pattern for disaggregated inference to optimize latency and throughput.

    .. grid-item-card:: Benchmark using Performance CLI
        :link: /libraries/nxd-inference/tutorials/generating-results-with-performance-cli
        :link-type: doc
        :class-card: sd-rounded-3

        Use the Performance CLI tool to benchmark and generate performance results for NxD Inference deployments.


================================================
FILE: libraries/nxd-inference/tutorials/llama4-tutorial-v0.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4f580405",
   "metadata": {},
   "source": [
    "# Tutorial: Deploying Llama4 Multimodal Models (Legacy)\n",
    "\n",
    "> **Important**: This guide is compatible with vLLM v0.x versions. Since vLLM has deprecated v0.x versions, we recommend using vLLM v1.x with the vLLM-Neuron Plugin for new deployments. See [Tutorial: Deploying Llama4 Multimodal Models](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/llama4-tutorial.html) for the updated guide."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98b5edf6",
   "metadata": {},
   "source": [
    "This guide shows how to deploy Llama4 on an AWS Neuron Trainium2 (Trn2) instance. This model supports both text and images. It uses Llama4 Scout (meta-llama/Llama-4-Scout-17B-16E) as the example model in this tutorial; however, Maverick (meta-llama/Llama-4-Maverick-17B-128E-Instruct) can also be used."
   ]
  },
  {
   "cell_type": "raw",
   "id": "88abcf3e",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    ".. contents:: Table of contents\n",
    "    :local:\n",
    "    :depth: 1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "963ac8ed",
   "metadata": {},
   "source": [
    "## Examples\n",
    "\n",
    "- [Offline Example](#offline-example)\n",
    "- [Online Example](#online-example)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d0049f7",
   "metadata": {},
   "source": [
    "## Step 1: Set up your development environment\n",
    "\n",
    "As a prerequisite, this tutorial requires that you have a Trn2 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed.\n",
    "\n",
    "To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK, see [NxD Inference Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html#nxdi-setup). To use a Jupyter Notebook (.ipynb) on a Neuron-enabled instance, see this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you chose when you launched the instance.\n",
    "\n",
    "After you are connected, activate the Python virtual environment that includes the Neuron SDK."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dff142cd",
   "metadata": {},
   "source": [
    "## Step 2: Compile your model and save it as artifacts\n",
    "\n",
    "The code snippet below is required to compile Llama4 as artifacts to load for vLLM serving. There is no need to download a Llama4 checkpoint from HuggingFace explicitly, but you may need access from Meta to download it as part of compilation script. For more information, see [Downloading models](https://huggingface.co/docs/hub/en/models-downloading) in the Hugging Face documentation.\n",
    "\n",
    "In``neuron_config``, to support multimodal architecture, you can define ``text_config`` and ``vision_config`` separately for text decoder and vision encoder. \n",
    "\n",
    "The image input can be represented in 1, 4, or 16 chunks based on its resolution and aspect ratio. Additionally, there is one chunk to describe the entire image, resulting in the total number of chunks. Due to the use of data parallelism (DP) together with tensor parallelism (TP), the vision model input batch size is padded to the next value divisible by the DP degree, which in this case is 4. The final padded batch size will be:\n",
    "* 1+1 = 2 → 4: Each rank has the batch size = 4/4 = 1\n",
    "* 4+1 = 5 → 8: Each rank has the batch size = 8/4 = 2\n",
    "* 16+1 = 17 → 20: Each rank has the batch size = 20/4 = 5\n",
    "\n",
    "There are a few fields you can configure to improve performance:\n",
    "- ``cp_degree``: degree of context parallelism at the attention layer for prefill.\n",
    "- ``blockwise_matmul_config``: the configuration of the blockwise MoE kernel for prefill.\n",
    "- ``attn_block_tkg_nki_kernel_enabled`` and ``attn_block_tkg_nki_kernel_cache_update`` to enable a NKI kernel for attention and a kernel KV cache update for decode operations.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14ff3fe0",
   "metadata": {},
   "outputs": [],
   "source": [
    "scout_neuron_config = {\n",
    "    \"text_config\": {\n",
    "        \"batch_size\": 1,\n",
    "        \"is_continuous_batching\": true,\n",
    "        \"seq_len\": 16384,\n",
    "        \"enable_bucketing\": true,\n",
    "        \"context_encoding_buckets\": [256, 512, 1024, 2048, 4096, 8192, 10240, 16384],\n",
    "        \"token_generation_buckets\": [256, 512, 1024, 2048, 4096, 8192, 10240, 16384],\n",
    "        \"torch_dtype\": \"float16\",\n",
    "        \"async_mode\": true,\n",
    "        \"world_size\": 64,\n",
    "        \"tp_degree\": 64,\n",
    "        \"cp_degree\": 16,\n",
    "        \"on_device_sampling_config\": {\n",
    "            \"dynamic\": true,\n",
    "            \"top_k_kernel_enabled\": true,\n",
    "            \"top_k\": 1\n",
    "        },\n",
    "        \"cast_type\": \"as-declared\",\n",
    "        \"logical_neuron_cores\": 2,\n",
    "        \"cc_pipeline_tiling_factor\": 1,\n",
    "        \"sequence_parallel_enabled\": true,\n",
    "        \"fused_qkv\": true,\n",
    "        \"qkv_kernel_enabled\": true,\n",
    "        \"attn_kernel_enabled\": true,\n",
    "        \"attn_block_tkg_nki_kernel_enabled\": true,\n",
    "        \"attn_block_tkg_nki_kernel_cache_update\": true,\n",
    "        \"blockwise_matmul_config\": {\n",
    "            \"block_size\": 256,\n",
    "            \"use_block_parallel\": true,\n",
    "            \"block_sharding_strategy\": \"HI_LO\",\n",
    "            \"skip_dma_token\": true,\n",
    "            \"skip_dma_weight\": true,\n",
    "            \"parallelize_token_to_block_mapping\": true\n",
    "        }\n",
    "    },\n",
    "    \"vision_config\": {\n",
    "        \"batch_size\": 1,\n",
    "        \"seq_len\": 8192,\n",
    "        \"torch_dtype\": \"float16\",\n",
    "        \"tp_degree\": 16,\n",
    "        \"cp_degree\": 1,\n",
    "        \"dp_degree\": 4,\n",
    "        \"world_size\": 64,\n",
    "        \"fused_qkv\": true,\n",
    "        \"qkv_kernel_enabled\": true,\n",
    "        \"attn_kernel_enabled\": true,\n",
    "        \"mlp_kernel_enabled\": true,\n",
    "        \"enable_bucketing\": true,\n",
    "        \"buckets\": [8, 28, 88],\n",
    "        \"logical_neuron_cores\": 2,\n",
    "        \"save_sharded_checkpoint\": true\n",
    "    }\n",
    "}\n",
    "\n",
    "import argparse\n",
    "import json\n",
    "\n",
    "import torch\n",
    "from neuronx_distributed_inference.models.config import OnDeviceSamplingConfig\n",
    "from neuronx_distributed_inference.models.llama4.modeling_llama4 import NeuronLlama4ForCausalLM, Llama4InferenceConfig, Llama4NeuronConfig\n",
    "from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config\n",
    "\n",
    "def parse_args():\n",
    "    parser = argparse.ArgumentParser()\n",
    "    parser.add_argument('--model-path', type=str, required=True)\n",
    "    parser.add_argument('--traced-model-path', type=str, required=True)\n",
    "    parser.add_argument('--neuron-config-path', type=str, default=None)\n",
    "    return parser.parse_args()\n",
    "\n",
    "\n",
    "def build_config(neuron_config_path, model_path):\n",
    "    with open(neuron_config_path, 'r') as f:\n",
    "        config_json = json.load(f)\n",
    "    text_neuron_config = Llama4NeuronConfig(**config_json['text_config'])\n",
    "    vision_neuron_config = Llama4NeuronConfig(**config_json['vision_config'])\n",
    "    return Llama4InferenceConfig(\n",
    "        text_neuron_config=text_neuron_config,\n",
    "        vision_neuron_config=vision_neuron_config,\n",
    "        load_config=load_pretrained_config(model_path)\n",
    "    )\n",
    "\n",
    "def compile(model_path, traced_model_path, config):\n",
    "    model = NeuronLlama4ForCausalLM(model_path, config)\n",
    "    model.compile(traced_model_path)\n",
    "\n",
    "\n",
    "\n",
    "args = parse_args()\n",
    "config = build_config(args.neuron_config_path, args.model_path)\n",
    "compile(\"meta-llama/Llama-4-Scout-17B-16E-Instruct\", \n",
    "    \"/home/ubuntu/llama4/traced_models/Llama-4-Scout-17B-16E-Instruct\", \n",
    "    scount_neuron_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01b099e8",
   "metadata": {},
   "source": [
    "## Step 3: Deploy with vLLM Inference\n",
    "\n",
    "We provide two examples to run Llama4 with vLLM:\n",
    "\n",
    "* Offline inference: you can provide prompts in a python script and execute it.\n",
    "* Online inference: you will serve the model in an online server and send requests."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a59eb13",
   "metadata": {},
   "source": [
    "### Offline Example\n",
    "\n",
    "\n",
    "Prior to launching the vLLM server, you must trace the Llama4 model. Provide the trace model by setting the environment variable NEURON_COMPILED_ARTIFACTS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "08f20db7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from vllm import LLM, SamplingParams\n",
    "\n",
    "# Hugging Face authentication (replace with your token)\n",
    "# from huggingface_hub import login\n",
    "# login(token=\"your_hf_token_here\")\n",
    "\n",
    "# Configure Neuron environment for inference\n",
    "os.environ['VLLM_NEURON_FRAMEWORK'] = \"neuronx-distributed-inference\"\n",
    "os.environ['NEURON_COMPILED_ARTIFACTS'] = \"/home/ubuntu/llama4/traced_models/Llama-4-Scout-17B-16E-Instruct\"\n",
    "\n",
    "IMAGE_URL = \"https://httpbin.org/image/png\"\n",
    "\n",
    "# Initialize LLM with Neuron device configuration\n",
    "llm = LLM(\n",
    "    model=\"meta-llama/Llama-4-Scout-17B-16E-Instruct\",  # or the file path to the downloaded checkpoint\n",
    "    max_num_seqs=1,\n",
    "    max_model_len=16384,\n",
    "    device=\"neuron\",\n",
    "    tensor_parallel_size=64,\n",
    "    use_v2_block_manager=True,\n",
    "    limit_mm_per_prompt={\"image\": 5}, # Accepts up to 5 images per prompt\n",
    ")\n",
    "# Configure sampling for deterministic output\n",
    "sampling_params = SamplingParams(top_k=1, max_tokens=100)\n",
    "\n",
    "# Test 1: Text-only input\n",
    "conversation = [\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\"type\": \"text\", \"text\": \"what is the recipe of mayonnaise in two sentences?\"},\n",
    "        ]\n",
    "    }\n",
    "]\n",
    "for output in llm.chat(conversation, sampling_params):\n",
    "    print(f\"Generated text: {output.outputs[0].text !r}\")\n",
    "\n",
    "# Test 2: Single image with text\n",
    "conversation = [\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\"type\": \"image_url\", \"image_url\": {\"url\": IMAGE_URL}},\n",
    "            {\"type\": \"text\", \"text\": \"Describe this image\"},\n",
    "        ]\n",
    "    }\n",
    "]\n",
    "for output in llm.chat(conversation, sampling_params):\n",
    "    print(f\"Generated text: {output.outputs[0].text !r}\")\n",
    "\n",
    "# Test 3: Multiple images with text\n",
    "conversation = [\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\"type\": \"image_url\", \"image_url\": {\"url\": IMAGE_URL}},\n",
    "            {\"type\": \"image_url\", \"image_url\": {\"url\": IMAGE_URL}},\n",
    "            {\"type\": \"text\", \"text\": \"Compare these two images, tell me the difference.\"},\n",
    "        ]\n",
    "    }\n",
    "]\n",
    "for output in llm.chat(conversation, sampling_params):\n",
    "    print(f\"Generated text: {output.outputs[0].text !r}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "586cc351",
   "metadata": {},
   "source": [
    "Below is an example output:\n",
    "\n",
    "```bash\n",
    "Generated text: 'To make mayonnaise, combine 2 egg yolks, 1 tablespoon of lemon juice or vinegar, and a pinch of salt in a bowl, and whisk them together until smooth. Then, slowly pour in 1/2 cup of oil while continuously whisking the mixture until it thickens and emulsifies into a creamy sauce.'\n",
    "Generated text: \"The image depicts a cartoon-style illustration of a pig's face, characterized by its pink color and endearing expression. The pig features two small black eyes with white outlines, a curved smile, and two small nostrils on its snout. Two red circles adorn the cheeks, adding to the pig's rosy appearance.\\n\\n**Key Features:**\\n\\n* **Color:** Pink\\n* **Facial Expression:** Smiling\\n* **Eyes:** Small, black, with white outlines\\n* **Sn\"\n",
    "Generated text: \"The two images are identical, with no discernible differences. The only variation is a slight difference in the shade of pink used for the pig's face, but this could be due to different rendering or display settings rather than an actual difference in the images themselves.\\n\\n**Key Features:**\\n\\n* Both images feature a cartoon-style pig's head with a smiling face.\\n* The pig has two small ears, two eyes, and a curved smile.\\n* The background of both images is white.\\n\\n**Conclusion:**\\nGiven\"\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a04efba",
   "metadata": {},
   "source": [
    "### Online Example\n",
    "\n",
    "Prior to launching the Vllm server, you must trace the llama4 model, with the traced model path provided through the environment variable NEURON_COMPILED_ARTIFACTS.\n",
    "\n",
    "Open a terminal and spin up a server of the model. \n",
    "To accommodate multiple image inputs, include the optional argument --limit-mm-per-prompt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b31fb3f",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "export VLLM_NEURON_FRAMEWORK=\"neuronx-distributed-inference\"\n",
    "export NEURON_COMPILED_ARTIFACTS=\"/home/ubuntu/llama4/traced_models/Llama-4-Scout-17B-16E-Instruct/\"\n",
    "export VLLM_RPC_TIMEOUT=100000\n",
    "nohup python -m vllm.entrypoints.openai.api_server \\\n",
    "    --model \"meta-llama/Llama-4-Scout-17B-16E-Instruct\" \\\n",
    "    --max-num-seqs 1 \\\n",
    "    --max-model-len 16384 \\\n",
    "    --tensor-parallel-size 64 \\\n",
    "    --device neuron \\\n",
    "    --port 8000 \\\n",
    "    --use-v2-block-manager \\\n",
    "    --disable-log-requests \\\n",
    "    --override-neuron-config '{}' \\\n",
    "    --limit-mm-per-prompt image=5\n",
    "\n",
    "...\n",
    "INFO:     Started server process [25218]\n",
    "INFO:     Waiting for application startup.\n",
    "INFO:     Application startup complete."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69768e75",
   "metadata": {},
   "source": [
    "Open another terminal and execute the following client code with python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0494f47e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "\n",
    "MODEL = \"meta-llama/Llama-4-Scout-17B-16E-Instruct\"\n",
    "\n",
    "client = OpenAI(\n",
    "    api_key = \"EMPTY\",\n",
    "    base_url = \"http://localhost:8000/v1\"\n",
    ")\n",
    "\n",
    "print(\"== Test text input ==\")\n",
    "completion = client.chat.completions.create(\n",
    "    model=MODEL,\n",
    "    messages=[{\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\"type\": \"text\", \"text\": \"what is the recipe of mayonnaise in two sentences?\"},\n",
    "        ]\n",
    "    }]\n",
    ")\n",
    "print(completion.choices[0].message.content)\n",
    "\n",
    "\n",
    "print(\"== Test image input ==\")\n",
    "completion = client.chat.completions.create(\n",
    "    model=MODEL,\n",
    "    messages=[{\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\"type\": \"image_url\", \"image_url\": {\"url\": \"https://httpbin.org/image/png\"}},\n",
    "            {\"type\": \"text\", \"text\": \"Describe this image\"},\n",
    "        ]\n",
    "    }]\n",
    ")\n",
    "print(completion.choices[0].message.content)\n",
    "\n",
    "\n",
    "print(\"== Test multiple image inputs ==\")\n",
    "completion = client.chat.completions.create(\n",
    "    model=MODEL,\n",
    "    messages=[{\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\"type\": \"image_url\", \"image_url\": {\"url\": \"https://httpbin.org/image/png\"}},\n",
    "            {\"type\": \"image_url\", \"image_url\": {\"url\": \"https://httpbin.org/image/png\"}},\n",
    "            {\"type\": \"text\", \"text\": \"Compare these two images, tell me the difference.\"},\n",
    "        ]\n",
    "    }]\n",
    ")\n",
    "print(completion.choices[0].message.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39d36c0c",
   "metadata": {},
   "source": [
    "Below is an example output:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d07d242d",
   "metadata": {},
   "source": [
    "```\n",
    "== Test text input ==\n",
    "To make mayonnaise, combine 2 egg yolks, 1 tablespoon of lemon juice or vinegar, and a pinch of salt in a bowl, and whisk them together until smooth. Then, slowly pour in 1/2 cup of oil while continuously whisking the mixture until it thickens and emulsifies into a creamy sauce.\n",
    "== Test image input ==\n",
    "The image depicts a cartoon-style illustration of a pig's face, characterized by its pink color and endearing expression. The pig features two small black eyes with white outlines, a curved smile, and two small nostrils on its snout. Two red circles adorn the cheeks, adding to the pig's rosy appearance.\n",
    "\n",
    "**Key Features:**\n",
    "\n",
    "* **Ears:** Two triangular ears are positioned at the top of the head.\n",
    "* **Facial Expression:** The pig's facial expression is cheerful, with a smile and rosy cheeks.\n",
    "* **Background:** The background of the image is transparent.\n",
    "\n",
    "Overall, the image presents a cute and friendly cartoon pig face.\n",
    "== Test multiple image inputs ==\n",
    "The two images are identical, featuring a cartoon pig's face with a pink color and black outline. The only difference is that the first image has a lighter shade of pink compared to the second image.\n",
    "\n",
    "**Key Features:**\n",
    "\n",
    "* Both images depict a cartoon pig's face.\n",
    "* They have the same facial features, including eyes, nose, mouth, and ears.\n",
    "* The background of both images is white.\n",
    "\n",
    "**Color Comparison:**\n",
    "\n",
    "* The first image has a lighter pink color (RGB: 255, 182, 193).\n",
    "* The second image has a slightly darker pink color (RGB: 240, 128, 128).\n",
    "\n",
    "Overall, while the two images appear similar at first glance, they differ slightly in terms of their pink hue.\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: libraries/nxd-inference/tutorials/llama4-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4f580405",
   "metadata": {},
   "source": [
    "# Tutorial: Deploying Llama4 Multimodal Models"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98b5edf6",
   "metadata": {},
   "source": [
    "This guide shows how to deploy Llama4 on an AWS Neuron Trainium2 (Trn2) instance using vLLM V1 with the vLLM-Neuron Plugin. This model supports both text and images. It uses Llama4 Scout (meta-llama/Llama-4-Scout-17B-16E) as the example model in this tutorial; however, Maverick (meta-llama/Llama-4-Maverick-17B-128E-Instruct) can also be used."
   ]
  },
  {
   "cell_type": "raw",
   "id": "88abcf3e",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    ".. contents:: Table of contents\n   :local:\n   :depth: 2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "963ac8ed",
   "metadata": {},
   "source": [
    "## Examples\n",
    "\n",
    "- [Offline Example](#offline-example)\n",
    "- [Online Example](#online-example)\n",
    "- [Advanced Configuration Examples](#advanced-configuration-examples)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d0049f7",
   "metadata": {},
   "source": [
    "## Step 1: Set up your development environment\n",
    "\n",
    "As a prerequisite, this tutorial requires that you have a Trn2 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed.\n",
    "\n",
    "To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK, see [NxD Inference Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html#nxdi-setup). To use a Jupyter Notebook (.ipynb) on a Neuron-enabled instance, see this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you chose when you launched the instance.\n",
    "\n",
    "After you are connected, activate the Python virtual environment that includes the Neuron SDK."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dff142cd",
   "metadata": {},
   "source": [
    "## Step 2: Install the vLLM version that supports NxD Inference\n",
    "\n",
    "NxD Inference supports running models with vLLM. This functionality is available in the vLLM-Neuron GitHub repository. Install the latest release branch of vLLM-Neuron plugin following instructions in the [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide-v1.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01b099e8",
   "metadata": {},
   "source": [
    "## Step 3: Deploy with vLLM V1 Inference\n",
    "\n",
    "We provide two examples to run Llama4 with vLLM V1:\n",
    "\n",
    "* Offline inference: you can provide prompts in a python script and execute it.\n",
    "* Online inference: you will serve the model in an online server and send requests."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a59eb13",
   "metadata": {},
   "source": [
    "### Offline Example\n",
    "\n",
    "\n",
    "Prior to launching the vLLM server, you must trace the Llama4 model. Provide the trace model by setting the environment variable NEURON_COMPILED_ARTIFACTS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "08f20db7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from vllm import LLM, SamplingParams\n",
    "\n",
    "# Hugging Face authentication (replace with your token)\n",
    "# from huggingface_hub import login\n",
    "# login(token=\"your_hf_token_here\")\n",
    "\n",
    "# Configure Neuron environment for inference\n",
    "# Note: No need to set VLLM_NEURON_FRAMEWORK in V1 - it defaults to neuronx-distributed-inference\n",
    "os.environ['NEURON_COMPILED_ARTIFACTS'] = \"/home/ubuntu/llama4/traced_models/Llama-4-Scout-17B-16E-Instruct\"\n",
    "\n",
    "IMAGE_URL = \"https://httpbin.org/image/png\"\n",
    "\n",
    "# Initialize LLM with Neuron device configuration\n",
    "# Note: In V1, configuration is passed via additional_config\n",
    "llm = LLM(\n",
    "    model=\"meta-llama/Llama-4-Scout-17B-16E-Instruct\",  # or the file path to the downloaded checkpoint\n",
    "    max_num_seqs=1,\n",
    "    max_model_len=16384,\n",
    "    tensor_parallel_size=64,\n",
    "    limit_mm_per_prompt={\"image\": 5}, # Accepts up to 5 images per prompt\n",
    "    # V1 uses additional_config for Neuron-specific settings\n",
    "    additional_config=dict(\n",
    "        override_neuron_config=dict(\n",
    "            # Add any custom Neuron configurations here if needed\n",
    "        )\n",
    "    )\n",
    ")\n",
    "\n",
    "# Configure sampling for deterministic output\n",
    "sampling_params = SamplingParams(temperature=0.0, max_tokens=100)\n",
    "\n",
    "# Test 1: Text-only input\n",
    "conversation = [\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\"type\": \"text\", \"text\": \"what is the recipe of mayonnaise in two sentences?\"},\n",
    "        ]\n",
    "    }\n",
    "]\n",
    "for output in llm.chat(conversation, sampling_params):\n",
    "    print(f\"Generated text: {output.outputs[0].text !r}\")\n",
    "\n",
    "# Test 2: Single image with text\n",
    "conversation = [\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\"type\": \"image_url\", \"image_url\": {\"url\": IMAGE_URL}},\n",
    "            {\"type\": \"text\", \"text\": \"Describe this image\"},\n",
    "        ]\n",
    "    }\n",
    "]\n",
    "for output in llm.chat(conversation, sampling_params):\n",
    "    print(f\"Generated text: {output.outputs[0].text !r}\")\n",
    "\n",
    "# Test 3: Multiple images with text\n",
    "conversation = [\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\"type\": \"image_url\", \"image_url\": {\"url\": IMAGE_URL}},\n",
    "            {\"type\": \"image_url\", \"image_url\": {\"url\": IMAGE_URL}},\n",
    "            {\"type\": \"text\", \"text\": \"Compare these two images, tell me the difference.\"},\n",
    "        ]\n",
    "    }\n",
    "]\n",
    "for output in llm.chat(conversation, sampling_params):\n",
    "    print(f\"Generated text: {output.outputs[0].text !r}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "586cc351",
   "metadata": {},
   "source": [
    "Below is an example output:\n",
    "\n",
    "```bash\n",
    "Generated text: 'To make mayonnaise, combine 2 egg yolks, 1 tablespoon of lemon juice or vinegar, and a pinch of salt in a bowl, and whisk them together until smooth. Then, slowly pour in 1/2 cup of oil while continuously whisking the mixture until it thickens and emulsifies into a creamy sauce.'\n",
    "Generated text: \"The image depicts a cartoon-style illustration of a pig's face, characterized by its pink color and endearing expression. The pig features two small black eyes with white outlines, a curved smile, and two small nostrils on its snout. Two red circles adorn the cheeks, adding to the pig's rosy appearance.\\n\\n**Key Features:**\\n\\n* **Color:** Pink\\n* **Facial Expression:** Smiling\\n* **Eyes:** Small, black, with white outlines\\n* **Sn\"\n",
    "Generated text: \"The two images are identical, with no discernible differences. The only variation is a slight difference in the shade of pink used for the pig's face, but this could be due to different rendering or display settings rather than an actual difference in the images themselves.\\n\\n**Key Features:**\\n\\n* Both images feature a cartoon-style pig's head with a smiling face.\\n* The pig has two small ears, two eyes, and a curved smile.\\n* The background of both images is white.\\n\\n**Conclusion:**\\nGiven\"\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a04efba",
   "metadata": {},
   "source": [
    "### Online Example\n",
    "\n",
    "Prior to launching the Vllm server, you must trace the llama4 model, with the traced model path provided through the environment variable NEURON_COMPILED_ARTIFACTS.\n",
    "\n",
    "Open a terminal and spin up a server of the model. \n",
    "To accommodate multiple image inputs, include the optional argument --limit-mm-per-prompt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b31fb3f",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "export NEURON_COMPILED_ARTIFACTS=\"/home/ubuntu/llama4/traced_models/Llama-4-Scout-17B-16E-Instruct/\"\n",
    "export VLLM_RPC_TIMEOUT=100000\n",
    "\n",
    "# V1 uses different configuration syntax with --additional-config\n",
    "nohup vllm serve \\\n",
    "    --model \"meta-llama/Llama-4-Scout-17B-16E-Instruct\" \\\n",
    "    --max-num-seqs 1 \\\n",
    "    --max-model-len 16384 \\\n",
    "    --tensor-parallel-size 64 \\\n",
    "    --port 8000 \\\n",
    "    --disable-log-requests \\\n",
    "    --limit-mm-per-prompt image=5 \\\n",
    "    --additional-config '{\n",
    "        \"override_neuron_config\": {}\n",
    "    }' &\n",
    "\n",
    "# Wait for server to start\n",
    "sleep 10\n",
    "echo \"Server started. Check logs for startup completion.\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "server_output",
   "metadata": {},
   "source": [
    "Expected server startup output:\n",
    "\n",
    "```text\n",
    "INFO:     Started server process [25218]\n",
    "INFO:     Waiting for application startup.\n",
    "INFO:     Application startup complete.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69768e75",
   "metadata": {},
   "source": [
    "Open another terminal and execute the following client code with python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0494f47e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "\n",
    "MODEL = \"meta-llama/Llama-4-Scout-17B-16E-Instruct\"\n",
    "\n",
    "client = OpenAI(\n",
    "    api_key = \"EMPTY\",\n",
    "    base_url = \"http://localhost:8000/v1\"\n",
    ")\n",
    "\n",
    "print(\"== Test text input ==\")\n",
    "completion = client.chat.completions.create(\n",
    "    model=MODEL,\n",
    "    messages=[{\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\"type\": \"text\", \"text\": \"what is the recipe of mayonnaise in two sentences?\"},\n",
    "        ]\n",
    "    }]\n",
    ")\n",
    "print(completion.choices[0].message.content)\n",
    "\n",
    "\n",
    "print(\"== Test image input ==\")\n",
    "completion = client.chat.completions.create(\n",
    "    model=MODEL,\n",
    "    messages=[{\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\"type\": \"image_url\", \"image_url\": {\"url\": \"https://httpbin.org/image/png\"}},\n",
    "            {\"type\": \"text\", \"text\": \"Describe this image\"},\n",
    "        ]\n",
    "    }]\n",
    ")\n",
    "print(completion.choices[0].message.content)\n",
    "\n",
    "\n",
    "print(\"== Test multiple image inputs ==\")\n",
    "completion = client.chat.completions.create(\n",
    "    model=MODEL,\n",
    "    messages=[{\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\"type\": \"image_url\", \"image_url\": {\"url\": \"https://httpbin.org/image/png\"}},\n",
    "            {\"type\": \"image_url\", \"image_url\": {\"url\": \"https://httpbin.org/image/png\"}},\n",
    "            {\"type\": \"text\", \"text\": \"Compare these two images, tell me the difference.\"},\n",
    "        ]\n",
    "    }]\n",
    ")\n",
    "print(completion.choices[0].message.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39d36c0c",
   "metadata": {},
   "source": [
    "Below is an example output:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d07d242d",
   "metadata": {},
   "source": [
    "```\n",
    "== Test text input ==\n",
    "To make mayonnaise, combine 2 egg yolks, 1 tablespoon of lemon juice or vinegar, and a pinch of salt in a bowl, and whisk them together until smooth. Then, slowly pour in 1/2 cup of oil while continuously whisking the mixture until it thickens and emulsifies into a creamy sauce.\n",
    "\n",
    "== Test image input ==\n",
    "The image depicts a cartoon-style illustration of a pig's face, characterized by its pink color and endearing expression. The pig features two small black eyes with white outlines, a curved smile, and two small nostrils on its snout. Two red circles adorn the cheeks, adding to the pig's rosy appearance.\n",
    "\n",
    "**Key Features:**\n",
    "\n",
    "* **Ears:** Two triangular ears are positioned at the top of the head.\n",
    "* **Facial Expression:** The pig's facial expression is cheerful, with a smile and rosy cheeks.\n",
    "* **Background:** The background of the image is transparent.\n",
    "\n",
    "Overall, the image presents a cute and friendly cartoon pig face.\n",
    "\n",
    "== Test multiple image inputs ==\n",
    "The two images are identical, featuring a cartoon pig's face with a pink color and black outline. The only difference is that the first image has a lighter shade of pink compared to the second image.\n",
    "\n",
    "**Key Features:**\n",
    "\n",
    "* Both images depict a cartoon pig's face.\n",
    "* They have the same facial features, including eyes, nose, mouth, and ears.\n",
    "* The background of both images is white.\n",
    "\n",
    "**Color Comparison:**\n",
    "\n",
    "* The first image has a lighter pink color (RGB: 255, 182, 193).\n",
    "* The second image has a slightly darker pink color (RGB: 240, 128, 128).\n",
    "\n",
    "Overall, while the two images appear similar at first glance, they differ slightly in terms of their pink hue.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "advanced_config",
   "metadata": {},
   "source": [
    "### Advanced Configuration Examples\n",
    "\n",
    "#### Model Compilation and Configuration\n",
    "\n",
    "In `override_neuron_config`, to support multimodal architecture, you can define `text_config` and `vision_config` separately for text decoder and vision encoder.\n",
    "\n",
    "The image input can be represented in 1, 4, or 16 chunks based on its resolution and aspect ratio. Additionally, there is one chunk to describe the entire image, resulting in the total number of chunks. Due to the use of data parallelism (DP) together with tensor parallelism (TP), the vision model input batch size is padded to the next value divisible by the DP degree, which in this case is 4. The final padded batch size will be:\n",
    "\n",
    "* 1+1 = 2 → 4: Each rank has the batch size = 4/4 = 1\n",
    "* 4+1 = 5 → 8: Each rank has the batch size = 8/4 = 2\n",
    "* 16+1 = 17 → 20: Each rank has the batch size = 20/4 = 5\n",
    "\n",
    "There are a few fields you can configure to improve performance:\n",
    "\n",
    "- `cp_degree`: degree of context parallelism at the attention layer for prefill.\n",
    "- `blockwise_matmul_config`: the configuration of the blockwise MoE kernel for prefill.\n",
    "- `attn_block_tkg_nki_kernel_enabled` and `attn_block_tkg_nki_kernel_cache_update` to enable a NKI kernel for attention and a kernel KV cache update for decode operations.\n",
    "\n",
    "The `scout_neuron_config` shown below contains the recommended configuration for Llama4 Scout model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "scout_config",
   "metadata": {},
   "outputs": [],
   "source": [
    "scout_neuron_config = {\n",
    "    \"text_config\": {\n",
    "        \"batch_size\": 1,\n",
    "        \"is_continuous_batching\": True,\n",
    "        \"seq_len\": 16384,\n",
    "        \"enable_bucketing\": True,\n",
    "        \"context_encoding_buckets\": [256, 512, 1024, 2048, 4096, 8192, 10240, 16384],\n",
    "        \"token_generation_buckets\": [256, 512, 1024, 2048, 4096, 8192, 10240, 16384],\n",
    "        \"torch_dtype\": \"float16\",\n",
    "        \"async_mode\": True,\n",
    "        \"world_size\": 64,\n",
    "        \"tp_degree\": 64,\n",
    "        \"cp_degree\": 16,\n",
    "        \"cast_type\": \"as-declared\",\n",
    "        \"logical_neuron_cores\": 2,\n",
    "        \"cc_pipeline_tiling_factor\": 1,\n",
    "        \"sequence_parallel_enabled\": True,\n",
    "        \"fused_qkv\": True,\n",
    "        \"qkv_kernel_enabled\": True,\n",
    "        \"attn_kernel_enabled\": True,\n",
    "        \"attn_block_tkg_nki_kernel_enabled\": True,\n",
    "        \"attn_block_tkg_nki_kernel_cache_update\": True,\n",
    "        \"k_cache_transposed\": False,\n",
    "        \"blockwise_matmul_config\": {\n",
    "            \"block_size\": 256,\n",
    "            \"use_block_parallel\": True,\n",
    "            \"block_sharding_strategy\": \"HI_LO\",\n",
    "            \"skip_dma_token\": True,\n",
    "            \"skip_dma_weight\": True,\n",
    "            \"parallelize_token_to_block_mapping\": True\n",
    "        }\n",
    "    },\n",
    "    \"vision_config\": {\n",
    "        \"batch_size\": 1,\n",
    "        \"seq_len\": 8192,\n",
    "        \"torch_dtype\": \"float16\",\n",
    "        \"tp_degree\": 16,\n",
    "        \"cp_degree\": 1,\n",
    "        \"dp_degree\": 4,\n",
    "        \"world_size\": 64,\n",
    "        \"fused_qkv\": True,\n",
    "        \"qkv_kernel_enabled\": True,\n",
    "        \"attn_kernel_enabled\": True,\n",
    "        \"mlp_kernel_enabled\": True,\n",
    "        \"enable_bucketing\": True,\n",
    "        \"buckets\": [8, 28, 88],\n",
    "        \"logical_neuron_cores\": 2,\n",
    "        \"save_sharded_checkpoint\": True\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "custom_config_v1",
   "metadata": {},
   "source": [
    "#### Using Custom Neuron Configuration with vLLM V1\n",
    "\n",
    "When using vLLM V1, you can pass custom Neuron configurations using the `additional_config` parameter. Here's an example of how to use the advanced configuration:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "custom_v1_example",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example: Using custom Neuron configuration with vLLM V1\n",
    "import os\n",
    "from vllm import LLM, SamplingParams\n",
    "\n",
    "# Enable V1 mode\n",
    "os.environ['VLLM_USE_V1'] = '1'\n",
    "os.environ['NEURON_COMPILED_ARTIFACTS'] = \"/home/ubuntu/llama4/traced_models/Llama-4-Scout-17B-16E-Instruct\"\n",
    "\n",
    "# Initialize LLM with custom Neuron configuration\n",
    "llm = LLM(\n",
    "    model=\"meta-llama/Llama-4-Scout-17B-16E-Instruct\",\n",
    "    max_num_seqs=1,\n",
    "    max_model_len=16384,\n",
    "    tensor_parallel_size=64,\n",
    "    limit_mm_per_prompt={\"image\": 5},\n",
    "    # V1 syntax: use additional_config with override_neuron_config\n",
    "    additional_config=dict(\n",
    "        override_neuron_config=scout_neuron_config  # Use the configuration defined above\n",
    "    )\n",
    ")\n",
    "\n",
    "# The rest of your inference code remains the same\n",
    "sampling_params = SamplingParams(temperature=0.0, max_tokens=100)\n",
    "# ... inference code ..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "server_custom_config",
   "metadata": {},
   "source": [
    "#### Server Configuration with Custom Neuron Config\n",
    "\n",
    "For online inference with custom configuration, you can pass the Neuron config via the `--additional-config` flag:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "server_custom_example",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Example server startup with custom Neuron configuration\n",
    "export VLLM_USE_V1=1\n",
    "export NEURON_COMPILED_ARTIFACTS=\"/home/ubuntu/llama4/traced_models/Llama-4-Scout-17B-16E-Instruct/\"\n",
    "export VLLM_RPC_TIMEOUT=100000\n",
    "\n",
    "# Start server with custom Neuron configuration\n",
    "vllm serve \\\n",
    "    --model \"meta-llama/Llama-4-Scout-17B-16E-Instruct\" \\\n",
    "    --max-num-seqs 1 \\\n",
    "    --max-model-len 16384 \\\n",
    "    --tensor-parallel-size 64 \\\n",
    "    --port 8000 \\\n",
    "    --disable-log-requests \\\n",
    "    --limit-mm-per-prompt image=5 \\\n",
    "    --additional-config '{\n",
    "        \"override_neuron_config\": {\n",
    "            \"text_config\": {\n",
    "                \"batch_size\": 1,\n",
    "                \"is_continuous_batching\": true,\n",
    "                \"seq_len\": 16384,\n",
    "                \"enable_bucketing\": true,\n",
    "                \"context_encoding_buckets\": [256, 512, 1024, 2048, 4096, 8192, 10240, 16384],\n",
    "                \"token_generation_buckets\": [256, 512, 1024, 2048, 4096, 8192, 10240, 16384],\n",
    "                \"torch_dtype\": \"float16\",\n",
    "                \"async_mode\": true,\n",
    "                \"world_size\": 64,\n",
    "                \"tp_degree\": 64,\n",
    "                \"cp_degree\": 16\n",
    "            },\n",
    "            \"vision_config\": {\n",
    "                \"batch_size\": 1,\n",
    "                \"seq_len\": 8192,\n",
    "                \"torch_dtype\": \"float16\",\n",
    "                \"tp_degree\": 16,\n",
    "                \"cp_degree\": 1,\n",
    "                \"dp_degree\": 4,\n",
    "                \"world_size\": 64\n",
    "            }\n",
    "        }\n",
    "    }'"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: libraries/nxd-inference/tutorials/llama405b_perf_comparison.csv
================================================
Scenario (all using BF16),TTFT (P50 in ms),TPOT (P50 in ms),Output token Throughput (per second)
No speculative decoding,2442,37.9,25.46
Fused speculative decoding + rescaled weights (Llama 3.2 1B Draft),2255,8.27,102.41

================================================
FILE: libraries/nxd-inference/tutorials/llama70b_apc_perf_comparison.csv
================================================
Dataset,TTFT (P50 in ms) without prefix caching ,TTFT (P50 in ms) with prefix caching,Improvement
math.math (>90% cache hit),342.81,107.8,3.18x
dynamic sonnet 1k (~25% cache hit),123.08,102.15,1.2x
dynamic sonnet 2k (~25% cache hit),592.8,377.2,1.57x
HumanEval (No cache hit),89.7,91.8,0.98x

================================================
FILE: libraries/nxd-inference/tutorials/llama70b_perf_comparison.csv
================================================
Scenario (all using BF16),TTFT (P50 in ms),TPOT (P50 in ms),Output token Throughput (per second)
No speculative decoding,814.2,19.6,36
Fused speculative decoding (Llama 3.2 1B Draft),870.1,5.3,144

================================================
FILE: libraries/nxd-inference/tutorials/modules_to_not_convert.json
================================================
{
"model":{
    "modules_to_not_convert": [
    "lm_head",
    "layers.0.mlp",
    "layers.125.mlp",
    "layers.0.self_attn",
    "layers.1.self_attn",
    "layers.2.self_attn",
    "layers.3.self_attn",
    "layers.4.self_attn",
    "layers.5.self_attn",
    "layers.6.self_attn",
    "layers.7.self_attn",
    "layers.8.self_attn",
    "layers.9.self_attn",
    "layers.10.self_attn",
    "layers.11.self_attn",
    "layers.12.self_attn",
    "layers.13.self_attn",
    "layers.14.self_attn",
    "layers.15.self_attn",
    "layers.16.self_attn",
    "layers.17.self_attn",
    "layers.18.self_attn",
    "layers.19.self_attn",
    "layers.20.self_attn",
    "layers.21.self_attn",
    "layers.22.self_attn",
    "layers.23.self_attn",
    "layers.24.self_attn",
    "layers.25.self_attn",
    "layers.26.self_attn",
    "layers.27.self_attn",
    "layers.28.self_attn",
    "layers.29.self_attn",
    "layers.30.self_attn",
    "layers.31.self_attn",
    "layers.32.self_attn",
    "layers.33.self_attn",
    "layers.34.self_attn",
    "layers.35.self_attn",
    "layers.36.self_attn",
    "layers.37.self_attn",
    "layers.38.self_attn",
    "layers.39.self_attn",
    "layers.40.self_attn",
    "layers.41.self_attn",
    "layers.42.self_attn",
    "layers.43.self_attn",
    "layers.44.self_attn",
    "layers.45.self_attn",
    "layers.46.self_attn",
    "layers.47.self_attn",
    "layers.48.self_attn",
    "layers.49.self_attn",
    "layers.50.self_attn",
    "layers.51.self_attn",
    "layers.52.self_attn",
    "layers.53.self_attn",
    "layers.54.self_attn",
    "layers.55.self_attn",
    "layers.56.self_attn",
    "layers.57.self_attn",
    "layers.58.self_attn",
    "layers.59.self_attn",
    "layers.60.self_attn",
    "layers.61.self_attn",
    "layers.62.self_attn",
    "layers.63.self_attn",
    "layers.64.self_attn",
    "layers.65.self_attn",
    "layers.66.self_attn",
    "layers.67.self_attn",
    "layers.68.self_attn",
    "layers.69.self_attn",
    "layers.70.self_attn",
    "layers.71.self_attn",
    "layers.72.self_attn",
    "layers.73.self_attn",
    "layers.74.self_attn",
    "layers.75.self_attn",
    "layers.76.self_attn",
    "layers.77.self_attn",
    "layers.78.self_attn",
    "layers.79.self_attn",
    "layers.80.self_attn",
    "layers.81.self_attn",
    "layers.82.self_attn",
    "layers.83.self_attn",
    "layers.84.self_attn",
    "layers.85.self_attn",
    "layers.86.self_attn",
    "layers.87.self_attn",
    "layers.88.self_attn",
    "layers.89.self_attn",
    "layers.90.self_attn",
    "layers.91.self_attn",
    "layers.92.self_attn",
    "layers.93.self_attn",
    "layers.94.self_attn",
    "layers.95.self_attn",
    "layers.96.self_attn",
    "layers.97.self_attn",
    "layers.98.self_attn",
    "layers.99.self_attn",
    "layers.100.self_attn",
    "layers.101.self_attn",
    "layers.102.self_attn",
    "layers.103.self_attn",
    "layers.104.self_attn",
    "layers.105.self_attn",
    "layers.106.self_attn",
    "layers.107.self_attn",
    "layers.108.self_attn",
    "layers.109.self_attn",
    "layers.110.self_attn",
    "layers.111.self_attn",
    "layers.112.self_attn",
    "layers.113.self_attn",
    "layers.114.self_attn",
    "layers.115.self_attn",
    "layers.116.self_attn",
    "layers.117.self_attn",
    "layers.118.self_attn",
    "layers.119.self_attn",
    "layers.120.self_attn",
    "layers.121.self_attn",
    "layers.122.self_attn",
    "layers.123.self_attn",
    "layers.124.self_attn",
    "layers.125.self_attn"
    ]
  },
"draft_model":{
    "modules_to_not_convert": [
    "lm_head",
    "layers.0.mlp",
    "layers.1.mlp",
    "layers.2.mlp",
    "layers.3.mlp",
    "layers.4.mlp",
    "layers.5.mlp",
    "layers.6.mlp",
    "layers.7.mlp",
    "layers.8.mlp",
    "layers.9.mlp",
    "layers.10.mlp",
    "layers.11.mlp",
    "layers.12.mlp",
    "layers.13.mlp",
    "layers.14.mlp",
    "layers.15.mlp",
    "layers.0.self_attn",
    "layers.1.self_attn",
    "layers.2.self_attn",
    "layers.3.self_attn",
    "layers.4.self_attn",
    "layers.5.self_attn",
    "layers.6.self_attn",
    "layers.7.self_attn",
    "layers.8.self_attn",
    "layers.9.self_attn",
    "layers.10.self_attn",
    "layers.11.self_attn",
    "layers.12.self_attn",
    "layers.13.self_attn",
    "layers.14.self_attn",
    "layers.15.self_attn",
    "fc"
    ]
  }
}
    

================================================
FILE: libraries/nxd-inference/tutorials/pixtral-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial: Deploy Pixtral Large on Trn2 instances\n",
    "\n",
    "This tutorial provides a step-by-step guide to deploy [mistralai/Pixtral-Large-Instruct-2411](https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411) using NeuronX Distributed (NxD) Inference on a single `trn2.48xlarge` instance."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    ".. contents:: Table of contents\n",
    "    :local:\n",
    "    :depth: 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "### Set up and connect to a `trn2.48xlarge` instance\n",
    "\n",
    "As a prerequisite, this tutorial requires that you have a Trn2 instance with a Deep Learning AMI that has the Neuron SDK pre-installed. To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK, see the [NxDI setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html#nxdi-setup).\n",
    "\n",
    "\n",
    "To use Jupyter Notebook on the Neuron instance, you can follow this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html). After you are connected, activate the Python virtual environment that includes the Neuron SDK."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "```python\n",
    "pip list | grep neuron\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should see Neuron packages including\n",
    "`neuronx-distributed-inference` and `neuronx-cc`.\n",
    "\n",
    "### Install packages\n",
    "\n",
    "NxD Inference supports running models with vLLM. This functionality is\n",
    "available in a fork of the vLLM GitHub repository:\n",
    "\n",
    "- [aws-neuron/upstreaming-to-vllm](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.26)\n",
    "\n",
    "To run NxD Inference with vLLM, you need to download and install vLLM from this fork. Refer the [Neuron vllm installation guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html#installing-the-aws-neuron-fork-of-vllm) to install vllm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ensure that the Neuron virtual environment is activated if using a new terminal instead of the one from connect step above. Then, install the Neuron vLLM fork into the virtual environment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 Download the model and convert the checkpoint\n",
    "\n",
    "To deploy [mistralai/Pixtral-Large-Instruct-2411](https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411) on Neuron, you need to first download the checkpoint from HuggingFace to a local path on the Trn2 instance (for more information on downloading models from HuggingFace, refer [the guide on Downloading models](https://huggingface.co/docs/hub/en/models-downloading)).\n",
    "\n",
    "Once you have downloaded the model, convert the original Pixtral checkpoint by running the following [script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pixtral/convert_pixtral_weights_to_hf.py). After the conversion, you should see a `config.json` file in the output folder along with weights in `model-xxxx-of-xxxx.safetensors` format."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>Note:</b> There is a known issue in the Huggingface conversion script that sets the `image_token_index` to `32000` in `config.json`. You need to manually set `image_token_index` to `10` before proceeding with the subsequent steps.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Compile and deploy Pixtral Large\n",
    "\n",
    "While compiling the model, certain configurations are used to optimize the performance of the model. These configurations are described below and can be modified as per one's use-case.\n",
    "\n",
    "- Pixtral consists of a **_text_** model and a **_vision encoder_**. You need to specify configurations explicitly through `text_neuron_config` and `vision_neuron_config`.\n",
    "- `tp_degree` : This is the tensor parallel degree for sharding the model across the neuron cores. Here, it is set to **64** for the **_text model_** and **16** for the **_vision encoder_**.\n",
    "- `batch_size` : This is set to the batch size for compiling the models. Currently prefill is always done with `batch_size = 1`; hence the `batch_size` in `vision_neuron_config` is set to **1** and the `batch_size` in `text_neuron_config` is set to the desired value for handling concurrent requests (same as `max-num-seqs` for the vllm argument).\n",
    "- `seq_len` : Set this to the maximum sequence length that needs to be supported.\n",
    "- `text_neuron_config`\n",
    "    - `enable_bucketing` : [Bucketing](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#bucketing) allows one to optimize performance for specific sequence lengths and in this case we [configure specific buckets](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#configuring-specific-buckets). \n",
    "    - `context_encoding_buckets` : This refers to the prefill phase (size of the input prompt) and should be set to handle different sequence lengths for inputs. It's set to `[2048, 4096, 10240]`.\n",
    "    - `token_generation_buckets` : Token generation buckets are set to the output token lengths. In this case - `[2048, 4096, 10240]`.\n",
    "    - `flash_decoding_enabled` : Setting this to `True` enables partitioning the KV cache and improves the performance for long sequences. Refer the app note on [Flash Decoding](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/app-notes/parallelism.html#flash-decoding) for more details.\n",
    "    - `sequence_parallel_enabled` : [Sequence Parallelism](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#sequence-parallelism) splits tensors across the sequence dimension to improve performance.\n",
    "    - `fused_qkv` : [QKV weight fusion](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#qkv-weight-fusion) concatenates a model's query, key and value weight matrices to achieve better performance.\n",
    "    - `qkv_kernel_enabled` : Enable the use of the fused QKV kernel.\n",
    "    - `mlp_kernel_enabled` : Enable the use of the MLP kernel.\n",
    "    - `cc_pipeline_tiling_factor` : \n",
    "- `vision_neuron_config`\n",
    "    - `buckets` : In the context of the vision encoder, buckets account for two dimensions - image sizes and number of images. The Pixtral HF processor processes each image in `16x16` patches. For example, a `512x512` image is processed as a `32x32` grid, which is `32x32=1024` image tokens. To handle 6 images, it'll be `6144` tokens. In this case, buckets are set to `[2048, 4096, 6144, 8192, 10240]` to handle different number of images and image sizes.\n",
    "    - `seq_len` : Set this to the maximum sequence length for the use case.\n",
    "    - `tp_degree` : The vision encoder uses a tensor parallel degree of 16. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compile and deploy using `vllm`\n",
    "In this step, you can directly use the vllm command to deploy the model. The `neuronx-distributed-inference` model loader in vllm performs JIT compilation before deploying it with the model server. Replace `<path to converted pixtral checkpoint>` with your specific path before running the below command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile start_vllm.sh\n",
    "#!/bin/bash\n",
    "\n",
    "echo \"Running vLLM server in the background...\"\n",
    "rm -f ./vllm_server.log \n",
    "\n",
    "export NEURON_RT_INSPECT_ENABLE=0 \n",
    "export NEURON_RT_VIRTUAL_CORE_SIZE=2\n",
    "export VLLM_NEURON_FRAMEWORK=\"neuronx-distributed-inference\"\n",
    "VLLM_RPC_TIMEOUT=100000\n",
    "\n",
    "nohup vllm serve \\\n",
    "    --model \"/home/ubuntu/model_hf/\" \\\n",
    "    --limit-mm-per-prompt 'image=6' \\\n",
    "    --tensor-parallel-size 64 \\\n",
    "    --max-model-len 10240 \\\n",
    "    --max-num-seqs 4 \\\n",
    "    --device neuron \\\n",
    "    --override-neuron-config \"{\\\"text_neuron_config\\\": { \\\"tp_degree\\\": 64, \\\"world_size\\\": 64, \\\"batch_size\\\": 4, \\\"seq_len\\\": 10240, \\\"ctx_batch_size\\\": 1, \\\"flash_decoding_enabled\\\": true, \\\"enable_bucketing\\\": true, \\\"skip_warmup\\\": true, \\\"context_encoding_buckets\\\": [2048, 4096, 10240], \\\"token_generation_buckets\\\": [2048, 4096, 10240], \\\"torch_dtype\\\": \\\"float16\\\", \\\"sequence_parallel_enabled\\\": true, \\\"fused_qkv\\\": true, \\\"qkv_kernel_enabled\\\": true, \\\"mlp_kernel_enabled\\\": true, \\\"cc_pipeline_tiling_factor\\\": 1 }, \\\"vision_neuron_config\\\": { \\\"batch_size\\\": 1, \\\"seq_len\\\": 10240, \\\"tp_degree\\\": 16, \\\"world_size\\\": 64, \\\"torch_dtype\\\": \\\"float16\\\", \\\"buckets\\\": [2048, 4096, 6144, 8192, 10240] }}\" > ./vllm_server.log 2>&1 &\n",
    "SERVER_PID=$!\n",
    "\n",
    "echo \"Server started in the background with the following id: $SERVER_PID. Waiting until server is ready to serve...\"\n",
    "until grep -q \"Application startup complete\" ./vllm_server.log 2>/dev/null || ! kill -0 $SERVER_PID 2>/dev/null; do sleep 0.5; done\n",
    "grep -q \"Application startup complete\" ./vllm_server.log 2>/dev/null && echo \"vLLM Server is ready!\" || (echo \"vLLM Server failed, check the ./vllm_server.log file\" && exit 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!chmod +x ./start_vllm.sh\n",
    "!./start_vllm.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Ping the server using a client\n",
    "\n",
    "After deploying the model server, you can run inference by sending it requests. The below example sends a text prompt with a single image - "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import json\n",
    "from huggingface_hub import hf_hub_download\n",
    "from datetime import datetime, timedelta\n",
    "\n",
    "url = \"http://0.0.0.0:8000/v1/chat/completions\"\n",
    "headers = {\"Content-Type\": \"application/json\", \"Authorization\": \"Bearer token\"}\n",
    "\n",
    "model = \"mistralai/Pixtral-Large-Instruct-2411\"\n",
    "vllm_model = \"/home/ubuntu/model_hf/\"\n",
    "\n",
    "def load_system_prompt(repo_id: str, filename: str) -> str:\n",
    "    file_path = hf_hub_download(repo_id=repo_id, filename=filename)\n",
    "    with open(file_path, \"r\") as file:\n",
    "        system_prompt = file.read()\n",
    "    today = datetime.today().strftime(\"%Y-%m-%d\")\n",
    "    yesterday = (datetime.today() - timedelta(days=1)).strftime(\"%Y-%m-%d\")\n",
    "    model_name = repo_id.split(\"/\")[-1]\n",
    "    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)\n",
    "\n",
    "\n",
    "SYSTEM_PROMPT = load_system_prompt(model, \"SYSTEM_PROMPT.txt\")\n",
    "\n",
    "image_url = \"https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/europe.png\"\n",
    "\n",
    "messages = [\n",
    "    {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\n",
    "                \"type\": \"text\",\n",
    "                \"text\": \"Which of the depicted countries has the best food? Which the second and third and fourth? Name the country, its color on the map and one its city that is visible on the map, but is not the capital. Make absolutely sure to only name a city that can be seen on the map.\",\n",
    "            },\n",
    "            {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
    "        ],\n",
    "    },\n",
    "]\n",
    "\n",
    "data = {\"model\": vllm_model, \"messages\": messages}\n",
    "\n",
    "response = requests.post(url, headers=headers, data=json.dumps(data))\n",
    "print(response.json()[\"choices\"][0][\"message\"][\"content\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sample response from the model\n",
    "\n",
    "```\n",
    "The ranking of countries based on the best food is subjective and can vary greatly depending on personal preferences. It can be perceived as offensive by some to rank cuisines but I will do it based on commonly held opinions.\n",
    "\n",
    "1. Italy\n",
    "Color on the map: Brown\n",
    "City visible on the map: Napoli (in brown color)\n",
    "\n",
    "2. France\n",
    "Color on the map: Dark teal\n",
    "City visible on the map: Marseille (in dark teal color)\n",
    "\n",
    "3. Spain\n",
    "Color on the map: Red pink\n",
    "City visible on the map: Barcelona (in red pink color)\n",
    "\n",
    "4. Germany\n",
    "Color on the map: Orange\n",
    "City visible on the map: Cologne (in orange color)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "Congratulations ! You now know how to deploy `mistralai/Pixtral-Large-Instruct-2411` on a `trn2.48xlarge` instance. Modify the configurations and deploy the model as per your requirements and use case."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "neuron-224",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: libraries/nxd-inference/tutorials/qwen2-vl-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial: Deploy Qwen2-VL on Trn2 instances\n",
    "\n",
    "This tutorial provides a step-by-step guide to deploy [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) using NeuronX Distributed (NxD) Inference on a single `trn2.48xlarge` instance."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    ".. contents:: Table of contents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Set up your development environment\n",
    "\n",
    "As a prerequisite, this tutorial requires that you have a Trn2 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed.\n",
    "\n",
    "To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK, see the [NxDI setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html#nxdi-setup). To run a Jupyter (.ipynb) notebook on a Neuron instance, follow this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you chose when you launched the instance.\n",
    "\n",
    "After you are connected, activate the Python virtual environment that includes the Neuron SDK."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "```python\n",
    "pip list | grep neuron\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should see Neuron packages including\n",
    "`neuronx-distributed-inference` and `neuronx-cc`.\n",
    "\n",
    "## Step 2: Install the vLLM version that supports NxD Inference\n",
    "\n",
    "NxD Inference supports running models with vLLM. This functionality is available in the vLLM-Neuron GitHub repository. Install the latest release branch of vLLM-Neuron plugin following instructions in the [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide-v1.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ensure that the Neuron virtual environment is activated if you are using a new terminal session instead of the one from connection step above. Then, install the Neuron vLLM fork into the virtual environment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 Download the model from HuggingFace (Optional)\n",
    "\n",
    "To deploy [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) on Neuron, first download the checkpoint from HuggingFace to a local path on the Trn2 instance. For more information on downloading models from HuggingFace, refer to [HuggingFace's guide on Downloading models](https://huggingface.co/docs/hub/en/models-downloading)).\n",
    "\n",
    "After the download, you should see a `config.json` file in the output folder along with weights in `model-xxxx-of-xxxx.safetensors` format."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Compile and deploy Qwen2-VL Inference\n",
    "\n",
    "In this step, you use the `vllm` command to deploy the model. The `neuronx-distributed-inference` model loader in vllm performs JIT compilation before deploying it with the model server. Replace the `model_name_or_path` with your specific path if you download the model checkpoint from HuggingFace(Step 3).\n",
    "\n",
    "Here are two examples of running Qwen2-VL with vLLM V1:\n",
    "\n",
    "* Offline inference: you can provide prompts in a python script and execute it.\n",
    "* Online inference: you will serve the model in an online server and send requests. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model Configuration Requirements & Examples\n",
    "\n",
    "There is a known issue with `batch_size` > 1 or `tp_degree` != 4 configurations for Qwen2-VL models. Here we suggest to use `batch_size` = 1 and `tp_degree` = 4 configuration, which deploys `Qwen/Qwen2-VL-7B-Instruct` model on a single trn2 chip with 4 cores. You can replicate the setting on the `trn2.48xlarge` instance consisting of 16 chips and 64 cores.\n",
    "\n",
    "We support configurable image sizes for Qwen2-VL and use `number_of_images` as the vision buckets. For example, in the configuration below, `number_of_images` is the maximum vision bucket, i.e., `128`.\n",
    "Please specify `default_image_width` and `default_image_height` in the `vision_neuron_config` as the input image size. The default image sizes are `default_image_width: 640` and `default_image_height: 320`.\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>Note:</b> Please make sure the number of tokens does not exceed the `max_context_length` in the `text_neuron_config`, i.e., `number_of_prompt_tokens + (default_image_width // 28) * (default_image_height // 28) * number_of_images < max_context_length - max_new_tokens`.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We configure these fields below to improve performance. For more details, refer to [NxD Inference features configurations guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html).\n",
    "- **`enable_ve_data_parallelism`: whether to enable vision encoder data parallelism.**\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>Note:</b> We set the `ve_dp_degree` to `world_size // tp_degree` in the vision_neuron_config. With `enable_ve_data_parallelism=True`, we require the number of images (vision bucket size) to be divisible by `ve_dp_degree`.\n",
    "</div>\n",
    "\n",
    "- `sequence_parallel_enabled`: whether to enable sequence parallelism.\n",
    "- `fuse_qkv` and `qkv_kernel_enabled`: whether to use the fused QKV kernel. `qkv_kernel_enabled` is not supported yet in the `vision_neuron_config` for Qwen2-VL.\n",
    "- `attn_kernel_enabled`: whether to use the optimized attention kernel.\n",
    "\n",
    "Below we provide the recommended configuration with `batch_size` 1 and `tp_degree` 4.\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>Note:</b> If you encounter Out-of-Memory issue during the runtime, please try to reduce the size of vision buckets as the KV cache grows linearly with batch size and sequence length.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "qwen2_vl_neuron_config = {\n",
    "    \"text_neuron_config\": {\n",
    "        \"batch_size\": 1,\n",
    "        \"ctx_batch_size\": 1,\n",
    "        \"tkg_batch_size\": 1,\n",
    "        \"seq_len\": 32768,\n",
    "        \"max_new_tokens\": 64,\n",
    "        \"max_context_length\": 32768,\n",
    "        \"torch_dtype\": \"float16\",\n",
    "        \"skip_sharding\": False,\n",
    "        \"save_sharded_checkpoint\": True,\n",
    "        \"tp_degree\": 4,\n",
    "        \"world_size\": 4,\n",
    "        \"enable_bucketing\": True,\n",
    "        \"context_encoding_buckets\": [2048, 16384, 32768],\n",
    "        \"token_generation_buckets\": [2048, 16384, 32768],\n",
    "        \"fused_qkv\": True,\n",
    "        \"qkv_kernel_enabled\": True,\n",
    "        \"sequence_parallel_enabled\": True,\n",
    "        \"attn_kernel_enabled\": True,\n",
    "        \"cc_pipeline_tiling_factor\": 2,\n",
    "        \"attention_dtype\": \"float16\",\n",
    "        \"rpl_reduce_dtype\": \"float16\",\n",
    "        \"cast_type\": \"as-declared\",\n",
    "        \"logical_neuron_cores\": 2,\n",
    "    },\n",
    "    \"vision_neuron_config\": {\n",
    "        \"batch_size\": 1,\n",
    "        \"seq_len\": 131072,\n",
    "        \"max_context_length\": 131072,\n",
    "        \"torch_dtype\": \"bfloat16\",\n",
    "        \"skip_sharding\": False,\n",
    "        \"save_sharded_checkpoint\": True,\n",
    "        \"tp_degree\": 1,\n",
    "        \"world_size\": 4,\n",
    "        \"fused_qkv\": True,\n",
    "        \"enable_ve_data_parallel\": True,\n",
    "        \"qkv_kernel_enabled\": False,\n",
    "        \"attn_kernel_enabled\": True,\n",
    "        \"enable_bucketing\": True,\n",
    "        \"buckets\": [128],\n",
    "        \"cc_pipeline_tiling_factor\": 2,\n",
    "        \"attention_dtype\": \"bfloat16\",\n",
    "        \"rpl_reduce_dtype\": \"bfloat16\",\n",
    "        \"cast_type\": \"as-declared\",\n",
    "        \"logical_neuron_cores\": 2,\n",
    "        \"default_image_width\": 640,\n",
    "        \"default_image_height\": 320\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Offline Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"VLLM_NEURON_FRAMEWORK\"] = \"neuronx-distributed-inference\"\n",
    "\n",
    "from vllm import LLM, SamplingParams\n",
    "from vllm.assets.image import ImageAsset\n",
    "from transformers import AutoProcessor\n",
    "\n",
    "def qwen2_vl_offline_test():\n",
    "    model_name_or_path = \"Qwen/Qwen2-VL-7B-Instruct/\"\n",
    "    # Create an LLM.\n",
    "    llm = LLM(\n",
    "    model=model_name_or_path,\n",
    "    tensor_parallel_size=4,\n",
    "    max_num_seqs=1,\n",
    "    max_model_len=32768,\n",
    "    additional_config=dict(\n",
    "        override_neuron_config=qwen2_vl_neuron_config  # Use the configuration defined above\n",
    "    ),\n",
    "    enable_prefix_caching=False,\n",
    "    enable_chunked_prefill=False,\n",
    "    )\n",
    "\n",
    "    # Sample prompts.\n",
    "    prompt = \"What do you see in these images?\"\n",
    "    # Resize to default image size\n",
    "    default_image_size = (640, 320)\n",
    "\n",
    "    images = [\n",
    "        ImageAsset(\"blue_flowers\").pil_image.resize(default_image_size),\n",
    "        ImageAsset(\"bird\").pil_image.resize(default_image_size),\n",
    "    ]\n",
    "\n",
    "    processor = AutoProcessor.from_pretrained(model_name_or_path)\n",
    "\n",
    "    placeholders = [{\"type\": \"image\"} for _ in images]\n",
    "    messages = [\n",
    "    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "                *placeholders,\n",
    "                {\n",
    "                \"type\": \"text\",\n",
    "                \"text\": prompt,\n",
    "                },\n",
    "        ],\n",
    "    },\n",
    "    ]\n",
    "\n",
    "    prompt = processor.apply_chat_template(\n",
    "    messages,\n",
    "    tokenize=False,\n",
    "    add_generation_prompt=True,\n",
    "    )\n",
    "    inputs = {\n",
    "    \"prompt\": prompt,\n",
    "    \"multi_modal_data\": {\n",
    "        \"image\": images,\n",
    "    },\n",
    "    }\n",
    "    outputs = llm.generate([inputs], SamplingParams(top_k=1, max_tokens=64))\n",
    "\n",
    "    for output in outputs:\n",
    "        generated_text = output.outputs[0].text\n",
    "        print(f\"Generated text: {generated_text!r}\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    qwen2_vl_offline_test()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is an example output:\n",
    "```bash\n",
    "Generated text: 'The first image shows a close-up of a flower with blue petals and water droplets on them, set against a dark background. The second image features a vibrant red bird with blue and green wings perched on a branch.'\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Online Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference'\n",
    "additional_neuron_config=json.dumps(dict(override_neuron_config=qwen2_vl_neuron_config))\n",
    "start_server_cmd=cmd = f'''python3 -m vllm.entrypoints.openai.api_server \\\n",
    "   --model=\\'{model_name_or_path}\\' \\\n",
    "   --tensor-parallel-size=4 \\\n",
    "   --max-num-seqs=1 \\\n",
    "   --max-model-len=32768 \\\n",
    "   --additional-config=\\'{additional_neuron_config}\\' \\\n",
    "   --no-enable-chunked-prefill \\\n",
    "   --no-enable-prefix-caching \\\n",
    "   --port=8080\n",
    "'''\n",
    "\n",
    "import os\n",
    "os.system(start_server_cmd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the vLLM server is online, submit requests using the example below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "\n",
    "\n",
    "client = OpenAI(api_key=\"EMPTY\", base_url=\"http://0.0.0.0:8080/v1\")\n",
    "models = client.models.list()\n",
    "model_name = models.data[0].id\n",
    "\n",
    "messages = [\n",
    "   {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
    "   {\n",
    "      \"role\": \"user\",\n",
    "      \"content\": [\n",
    "        {\n",
    "            \"type\": \"text\",\n",
    "            \"text\": \"Describe this image.\",\n",
    "        },\n",
    "        {\n",
    "            \"type\": \"image_url\",\n",
    "            \"image_url\": {\n",
    "                \"url\": \"example_image_url\" # need to resize to {default_image_width}x{default_image_height}\n",
    "            }\n",
    "        }\n",
    "      ],\n",
    "   },\n",
    "]\n",
    "\n",
    "response = client.chat.completions.create(\n",
    "    model=model_name,\n",
    "    messages=messages,\n",
    "    max_tokens=64,\n",
    "    temperature=1.0,\n",
    "    top_p=1.0,\n",
    "    stream=False,\n",
    "    extra_body={\"top_k\": 1},\n",
    ")\n",
    "\n",
    "generated_text = response.choices[0].message.content\n",
    "print(generated_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "Congratulations ! You now know how to deploy `Qwen/Qwen2-VL-7B-Instruct` on a `trn2.48xlarge` instance. Modify the configurations and deploy the model as per your requirements and use case."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "neuron-224",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: libraries/nxd-inference/tutorials/qwen3-moe-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d860f5c8",
   "metadata": {},
   "source": [
    "# Tutorial: Deploy Qwen3-MoE 235B on Trn2 instances\n",
    "This tutorial provides a step-by-step guide to deploy [Qwen/Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) on a single `trn2.48xlarge` instance using vLLM V1 with the vLLM-Neuron Plugin.\n",
    "\n",
    "**Note**: Qwen3-MoE 235B may observe degraded decode throughput compared to previous releases. Our team is actively investigating the root cause. In the meantime, we recommend customers use release 2.28 for workloads where Qwen3-MoE 235B decode performance is critical."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f05df502",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    ".. contents:: Table of contents"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bdc1ca69",
   "metadata": {},
   "source": [
    "## Step 1: Set up your development environment\n",
    "\n",
    "As a prerequisite, this tutorial requires that you have a Trn2 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed.\n",
    "\n",
    "To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK, see the [NxDI setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html#nxdi-setup). To use Jupyter Notebook on the Neuron instance, you can follow this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you chose when you launched the instance.\n",
    "\n",
    "After you are connected, activate the Python virtual environment that includes the Neuron SDK."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2cb24e4d",
   "metadata": {},
   "source": [
    "```python\n",
    "pip list | grep neuron\n",
    "```\n",
    "You should see Neuron packages including\n",
    "`neuronx-distributed-inference` and `neuronx-cc`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4dcab0f",
   "metadata": {},
   "source": [
    "## Step 2: Install the vLLM version that supports NxD Inference\n",
    "\n",
    "NxD Inference supports running models with vLLM. This functionality is available in the vLLM-Neuron GitHub repository. Install the latest release branch of vLLM-Neuron plugin following instructions in the [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide-v1.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db96b543",
   "metadata": {},
   "source": [
    "Ensure that the Neuron virtual environment is activated if using a new terminal instead of the one from connect step above. Then, install the Neuron vLLM into the virtual environment."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7683efb",
   "metadata": {},
   "source": [
    "## Step 3 Download the model from HuggingFace\n",
    "\n",
    "To deploy [Qwen/Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) on Neuron, you need to first download the checkpoint from HuggingFace to a local path on the Trn2 instance (for more information on downloading models from HuggingFace, refer [the guide on Downloading models](https://huggingface.co/docs/hub/en/models-downloading)).\n",
    "\n",
    "After the download, you should see a `config.json` file in the output folder along with weights in `model-xxxx-of-xxxx.safetensors` format."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba133c4a",
   "metadata": {},
   "source": [
    "## Step 4: Compile and deploy Qwen3 Inference\n",
    "\n",
    "In this step, you can directly use the vllm command to deploy the model. The `neuronx-distributed-inference` model loader in vllm performs JIT compilation before deploying it with the model server. Replace the default model path `~/models/Qwen3-235B-A22B/` with your specific path before running the below command.\n",
    "\n",
    "We provide two examples to run Qwen3 with vLLM V1:\n",
    "\n",
    "* Offline inference: you can provide prompts in a python script and execute it.\n",
    "* Online inference: you will serve the model in an online server and send requests. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1eee733a",
   "metadata": {},
   "source": [
    "#### Model Compilation and Configuration\n",
    "There is a known issue with `batch_size < 16` for Qwen3 MoE configurations.\n",
    "\n",
    "There are a few fields you can configure to improve performance:\n",
    "- `tp_degree`: degree of tensor parallelism.\n",
    "- `attention_dp_degree`: degree of data parallelism at the attention layer for decoding.\n",
    "- `cp_degree`: degree of context parallelism at the attention layer for prefill.\n",
    "- `moe_tp_degree`: degree of tensor parallelism at the moe layer, `moe_tp_degree`*`moe_ep_degree` should equal to `tp_degree`.\n",
    "- `moe_ep_degree`: degree of expert parallelism at the moe layer, `moe_tp_degree`*`moe_ep_degree` should equal to `tp_degree`.\n",
    "- `blockwise_matmul_config`: the configuration of the blockwise MoE kernel for prefill, here we recommend to shard on the intermediate dimension.\n",
    "- `use_index_calc_kernel`: whether to use specialized kernel for index calculations.\n",
    "- `moe_mask_padded_token`: whether to mask padded tokens at the moe layer.\n",
    "- `qkv_kernel_enabled` and `qkv_nki_kernel_enabled`: whether to use the fused QKV kernel.\n",
    "- `qkv_cte_nki_kernel_fuse_rope`: whether to use the fused QKV and RoPE kernel.\n",
    "- `strided_context_parallel_kernel_enabled`: whether to use the strided context parallel flash attention kernel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca58ccff",
   "metadata": {},
   "outputs": [],
   "source": [
    "qwen3_moe_neuron_config = {\n",
    "    \"tp_degree\": 64,\n",
    "    \"attention_dp_degree\": 8,\n",
    "    \"cp_degree\": 16,\n",
    "    \"moe_tp_degree\": 2,\n",
    "    \"moe_ep_degree\": 32,\n",
    "    \"use_index_calc_kernel\": True,\n",
    "    \"moe_mask_padded_tokens\": True,\n",
    "    \"batch_size\": 16,\n",
    "    \"ctx_batch_size\": 1,\n",
    "    \"max_context_length\": 16384,\n",
    "    \"seq_len\": 16384,\n",
    "    \"is_continuous_batching\": True,\n",
    "    \"fused_qkv\": True,\n",
    "    \"blockwise_matmul_config\":{\"use_shard_on_intermediate_dynamic_while\": True, \"skip_dma_token\": True},\n",
    "    \"on_device_sampling_config\": {\n",
    "        \"do_sample\": True,\n",
    "        \"temperature\": 0.6,\n",
    "        \"top_k\": 20,\n",
    "        \"top_p\": 0.95\n",
    "    },\n",
    "    \"enable_bucketing\": True,\n",
    "    \"context_encoding_buckets\": [10240, 16384],\n",
    "    \"token_generation_buckets\": [10240, 16384],\n",
    "    \"flash_decoding_enabled\": False,\n",
    "    \"logical_nc_config\": 2,\n",
    "    \"sequence_parallel_enabled\": True,\n",
    "    \"qkv_kernel_enabled\": True,\n",
    "    \"qkv_nki_kernel_enabled\": True,\n",
    "    \"qkv_cte_nki_kernel_fuse_rope\": True,\n",
    "    \"attn_kernel_enabled\": True,\n",
    "    \"strided_context_parallel_kernel_enabled\": True,\n",
    "    \"async_mode\": True\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ffbd74f",
   "metadata": {},
   "source": [
    "### Offline Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1320307",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"VLLM_NEURON_FRAMEWORK\"] = \"neuronx-distributed-inference\"\n",
    "\n",
    "from vllm import LLM, SamplingParams\n",
    "\n",
    "# Create an LLM.\n",
    "llm = LLM(\n",
    "   model=\"~/models/Qwen3-235B-A22B/\",\n",
    "   tensor_parallel_size=64,\n",
    "   max_num_seqs=16,\n",
    "   max_model_len=16384,\n",
    "   additional_config=dict(\n",
    "    override_neuron_config=qwen3_moe_neuron_config  # Use the configuration defined above\n",
    "    ),\n",
    "   enable_prefix_caching=False,\n",
    "   enable_chunked_prefill=False,\n",
    ")\n",
    "\n",
    "# Sample prompts.\n",
    "prompts = [\n",
    "   \"The president of the United States is\",\n",
    "   \"The capital of France is\",\n",
    "   \"The future of AI is\",\n",
    "]\n",
    "outputs = llm.generate(prompts, SamplingParams(top_k=1))\n",
    "\n",
    "for output in outputs:\n",
    "   prompt = output.prompt\n",
    "   generated_text = output.outputs[0].text\n",
    "   print(f\"Prompt: {prompt!r}, Generated text: {generated_text!r}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c0dce40",
   "metadata": {},
   "source": [
    "Below is an example output:\n",
    "\n",
    "```bash\n",
    "Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States, indirectly elected to'\n",
    "Prompt: 'The capital of France is', Generated text: ' Paris. The capital of Italy is Rome. The capital of Germany is Berlin.'\n",
    "Prompt: 'The future of AI is', Generated text: \" not just about smarter algorithms or faster processors; it's about creating systems that can\"\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "417b462b",
   "metadata": {},
   "source": [
    "### Online Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d95a68b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference'\n",
    "additional_neuron_config=json.dumps(dict(override_neuron_config=qwen3_moe_neuron_config))\n",
    "start_server_cmd=cmd = f'''vllm serve \\\n",
    "   --model=\"~/models/Qwen3-235B-A22B/\" \\\n",
    "   --tensor-parallel-size=64 \\\n",
    "   --max-num-seqs=16 \\\n",
    "   --max-model-len=16384 \\\n",
    "   --additional-config=\\'{additional_neuron_config}\\' \\\n",
    "   --no-enable-chunked-prefill \\\n",
    "   --no-enable-prefix-caching \\\n",
    "   --port=8080\n",
    "'''\n",
    "\n",
    "import os\n",
    "os.system(start_server_cmd)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7c3024a",
   "metadata": {},
   "source": [
    "Once the vLLM server is online, submit requests using the example below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f09ec8eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "\n",
    "\n",
    "client = OpenAI(api_key=\"EMPTY\", base_url=\"http://0.0.0.0:8080/v1\")\n",
    "models = client.models.list()\n",
    "model_name = models.data[0].id\n",
    "\n",
    "prompt = \"Hello, my name is Llama \"\n",
    "\n",
    "response = client.chat.completions.create(\n",
    "    model=model_name,\n",
    "    messages=[{\"role\": \"user\", \"content\": prompt}],\n",
    "    max_tokens=1024,\n",
    "    temperature=1.0,\n",
    "    top_p=1.0,\n",
    "    stream=False,\n",
    "    extra_body={\"top_k\": 50},\n",
    ")\n",
    "\n",
    "generated_text = response.choices[0].message.content\n",
    "print(generated_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "583a3a9f",
   "metadata": {},
   "source": [
    "Below is an example output:\n",
    "```bash\n",
    "<think>\n",
    "Okay, so the user is Llama, and they want to know if I can handle that name. Let me think. First, Llama is an animal, but people can have names like that too. I should make sure I use the correct capitalization if that's how they present themselves. The user mentioned they're trying to start a conversation, so I should respond warmly. Maybe they want to check if I can remember their name or if I can be friendly. I need to acknowledge their name properly and invite them to ask questions. Also, considering Llama isn't a common name, I should take care not to misspell it or use lowercase unless instructed. Let me confirm the name and offer assistance. I'll keep it simple and welcoming.\n",
    "\n",
    "Wait, but maybe the user just wants to confirm they're using the correct name format. Should I include an emoji to keep the tone friendly? The example response uses a Llama face, but since Llama is their name, maybe a different emoji like a star or checkmark? Or maybe none, to stay professional. But the user wants a conversational tone, so perhaps a smiley. Let me structure the response as: \"Hello, Llama! 😊 Nice to meet you. How can I assist you today?\" That's friendly, uses their name correctly, and opens the door for help without assuming their intent.\n",
    "</think>\n",
    "\n",
    "Hello, Llama! 🤗 It's nice to meet you. How can I assist you today? Let me know if you have any questions or need help exploring topics!\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6bd1f1bc",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "Congratulations ! You now know how to deploy `Qwen/Qwen3-235B-A22B` on a `trn2.48xlarge` instance. Modify the configurations and deploy the model as per your requirements and use case."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: libraries/nxd-inference/tutorials/qwen3-vl-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d860f5c8",
   "metadata": {},
   "source": [
    "# Tutorial: Deploy Qwen3-VL 8B on Trn2 instances\n",
    "This tutorial provides a step-by-step guide to deploy [Qwen/Qwen3-VL-8B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking) on a single `trn2.48xlarge` instance using vLLM V1 with the vLLM-Neuron Plugin."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f05df502",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    ".. contents:: Table of contents"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3d939e5",
   "metadata": {},
   "source": [
    "## Examples\n",
    "\n",
    "- [Offline Example](#offline-example)\n",
    "- [Online Example](#online-example)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bdc1ca69",
   "metadata": {},
   "source": [
    "## Step 1: Set up your development environment\n",
    "\n",
    "As a prerequisite, this tutorial requires that you have a Trn2 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed.\n",
    "\n",
    "To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK, see the [NxDI setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html#nxdi-setup). To use a Jupyter (.ipynb) notebook on a Neuron instance, follow this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you chose when you launched the instance.\n",
    "\n",
    "After you are connected, activate the Python virtual environment that includes the Neuron SDK."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2cb24e4d",
   "metadata": {},
   "source": [
    "```python\n",
    "pip list | grep neuron\n",
    "```\n",
    "You should see Neuron packages including\n",
    "`neuronx-distributed-inference` and `neuronx-cc`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4dcab0f",
   "metadata": {},
   "source": [
    "## Step 2: Install the vLLM version that supports NxD Inference\n",
    "\n",
    "NxD Inference supports running models with vLLM. This functionality is available in the vLLM-Neuron GitHub repository. Install the latest release branch of vLLM-Neuron plugin following instructions in the [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide-v1.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db96b543",
   "metadata": {},
   "source": [
    "Ensure that the Neuron virtual environment is activated if you are using a new terminal instead of the one from connection step above. Then, install the Neuron vLLM into the virtual environment."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7683efb",
   "metadata": {},
   "source": [
    "## Step 3: Download the model from HuggingFace (Optional)\n",
    "\n",
    "To deploy [Qwen/Qwen3-VL-8B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking) on Neuron,  download the checkpoint from HuggingFace to a local path on the Trn2 instance. For more information on downloading models from HuggingFace, refer to [the HuggingFace guide on downloading models](https://huggingface.co/docs/hub/en/models-downloading)).\n",
    "\n",
    "After the download, you should see a `config.json` file in the output folder along with weights in `model-xxxx-of-xxxx.safetensors` format."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba133c4a",
   "metadata": {},
   "source": [
    "## Step 4: Compile and deploy Qwen3 VL Inference\n",
    "\n",
    "We provide two examples to run Qwen3 VL with vLLM V1:\n",
    "\n",
    "* Offline inference: you can provide prompts in a python script and execute it.\n",
    "* Online inference: you will serve the model in an online server and send requests. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab33a384",
   "metadata": {},
   "source": [
    "#### Model Compilation and Configuration\n",
    "\n",
    "Certain configurations are used to optimize the performance of the model during compilation. These configurations are described below and can be modified for your specific use case.\n",
    "- Qwen3 VL consists of a text model and a vision encoder. You must specify configurations explicitly through `text_neuron_config` and `vision_neuron_config`.\n",
    "- `world_size`: max number of neuron cores in the distributed environment. Text and vision model must have the same world size.\n",
    "- `tp_degree`: degree of tensor parallelism. Text and vision model can use different sharding scheme and therefore different TP degree.\n",
    "- `batch_size`: This is set to the batch size for compiling the models. For optimized latency, Prefill is always done with batch_size = 1; hence `ctx_batch_size` in `text_neuron_config` and the `batch_size` in `vision_neuron_config` are set to 1. The `batch_size` and `tkg_batch_size` in `text_neuron_config` are set to the desired value for handling concurrent requests (same as `max-num-seqs` for the vllm argument)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1eee733a",
   "metadata": {},
   "source": [
    "- `text_neuron_config`\n",
    "    - `seq_len`: Set this to the maximum sequence length in your use case. We currently support up to 32768 in the text model. This refers to the total length of vision and text, input and output tokens.\n",
    "    - `enable_bucketing`: [Bucketing](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#bucketing) allows one to optimize performance for specific sequence lengths and in this case we [configure specific buckets](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#configuring-specific-buckets).\n",
    "    - `context_encoding_buckets`: This refers to the prefill/context encoding phase and should be set to handle different total length of vision and text input tokens.\n",
    "    - `token_generation_buckets`: This refers to the decode/token generation phase. The bucket size should reflect the total sequence length, which is the sum of vision tokens, text input tokens, and output tokens.\n",
    "    - **`sequence_parallel_enabled`: Enable the sequence parallelism for text model.**\n",
    "    - `fused_qkv`: [QKV weight fusion](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#qkv-weight-fusion) concatenates a model’s query, key and value weight matrices to achieve better performance.\n",
    "    - `qkv_kernel_enabled`: Enable the use of the fused QKV kernel.\n",
    "    - `mlp_kernel_enabled`: Enable the use of the MLP kernel.\n",
    "    - `attn_kernel_enabled`: Enable the use of the Flash Attention kernel.\n",
    "- `vision_neuron_config`\n",
    "    - `seq_len`: Set this to the maximum vision sequence length in your use case. We currently support up to 16384 in the vision model. Vision sequence length is calculated by `num_images * (image_height//patch_size) * (image_width//patch_size)`.\n",
    "    - `buckets`: Set this to handle different vision sequence lengths.\n",
    "    - `fused_qkv`: [QKV weight fusion](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#qkv-weight-fusion) concatenates a model’s query, key and value weight matrices to achieve better performance.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f583149",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>Note:</b> Qwen3 VL vision embeddings are spatially compressed by a factor of `spatial_merge_size ** 2` before being fed into the text model. This value is defined in the model's `config.json`. As a result, the effective text context length is calculated as: `text_context_len = vision_seq_len // (spatial_merge_size ** 2)`.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca58ccff",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_neuron_config = {\n",
    "    # Batch Size\n",
    "    \"batch_size\": 1,\n",
    "    \"ctx_batch_size\": 1,\n",
    "    \"tkg_batch_size\": 1,\n",
    "    \n",
    "    # Sequence Lengths\n",
    "    \"seq_len\": 32768,\n",
    "    \"max_context_length\": 32768,\n",
    "    \n",
    "    # Buckets\n",
    "    \"enable_bucketing\": True,\n",
    "    \"context_encoding_buckets\": [2048, 5120, 32768],\n",
    "    \"token_generation_buckets\": [2048, 5120, 32768],\n",
    "    \n",
    "    # Parallelism\n",
    "    \"world_size\": 16,\n",
    "    \"tp_degree\": 16,\n",
    "    \"sequence_parallel_enabled\": True,\n",
    "    \n",
    "    # Others\n",
    "    \"torch_dtype\": \"float16\",\n",
    "    \"rpl_reduce_dtype\": \"float16\",\n",
    "    \"attention_dtype\": \"float16\",\n",
    "    \"cast_type\": \"as-declared\",\n",
    "    \"logical_neuron_cores\": 2,\n",
    "    \"cc_pipeline_tiling_factor\": 1,\n",
    "    \n",
    "    # Kernels\n",
    "    \"fused_qkv\": True,\n",
    "    \"qkv_kernel_enabled\": True,\n",
    "    \"mlp_kernel_enabled\": True,\n",
    "    \"attn_kernel_enabled\": True,\n",
    "}\n",
    "\n",
    "vision_neuron_config = {\n",
    "    # Batch Size\n",
    "    \"batch_size\": 1,\n",
    "    \n",
    "    # Sequence Lengths\n",
    "    \"seq_len\": 16384,\n",
    "    \"max_context_length\": 16384,\n",
    "    \n",
    "    # Buckets\n",
    "    \"enable_bucketing\": True,\n",
    "    \"buckets\": [1024, 2048, 16384],\n",
    "    \n",
    "    # Parallelism\n",
    "    \"world_size\": 16,\n",
    "    \"tp_degree\": 16,\n",
    "    \n",
    "    # Others\n",
    "    \"torch_dtype\": \"float16\",\n",
    "    \"rpl_reduce_dtype\": \"float16\",\n",
    "    \"cast_type\": \"as-declared\",\n",
    "    \"logical_neuron_cores\": 2,\n",
    "    \"cc_pipeline_tiling_factor\": 2,\n",
    "    \n",
    "    # Kernels\n",
    "    \"fused_qkv\": True,\n",
    "    \"attn_kernel_enabled\": False,\n",
    "    \"mlp_kernel_enabled\": False,\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ffbd74f",
   "metadata": {},
   "source": [
    "### Offline Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1320307",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"VLLM_NEURON_FRAMEWORK\"] = \"neuronx-distributed-inference\"\n",
    "os.environ[\"NEURON_RT_DBG_INTRA_RDH_CHANNEL_BUFFER_SIZE\"] = \"146800640\" # to support 32k sequence length\n",
    "\n",
    "from vllm import LLM, SamplingParams\n",
    "\n",
    "model_name_or_path = \"~/models/Qwen3-VL-8B-Thinking/\"\n",
    "\n",
    "# Create an LLM.\n",
    "llm = LLM(\n",
    "   model=model_name_or_path,\n",
    "   tokenizer=model_name_or_path,\n",
    "   trust_remote_code=True,\n",
    "   tensor_parallel_size=16,\n",
    "   max_num_seqs=1,\n",
    "   max_model_len=32768,\n",
    "   additional_config={\n",
    "      \"override_neuron_config\": {\n",
    "            \"text_neuron_config\": text_neuron_config,\n",
    "            \"vision_neuron_config\": vision_neuron_config\n",
    "      }\n",
    "   },\n",
    "   limit_mm_per_prompt={\"image\": 20}, # Use the max number of image in your use case\n",
    "   enable_prefix_caching=False,\n",
    "   enable_chunked_prefill=False,\n",
    ")\n",
    "\n",
    "# Sample prompts.\n",
    "from transformers import AutoProcessor\n",
    "from vllm.assets.image import ImageAsset\n",
    "\n",
    "processor = AutoProcessor.from_pretrained(model_name_or_path)\n",
    "\n",
    "prompt = \"What do you see in these images?\"\n",
    "images = [\n",
    "   ImageAsset(\"blue_flowers\").pil_image,\n",
    "   ImageAsset(\"bird\").pil_image,\n",
    "]\n",
    "      \n",
    "placeholders = [{\"type\": \"image\"} for _ in images]\n",
    "messages = [\n",
    "   {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
    "   {\n",
    "   \"role\": \"user\",\n",
    "      \"content\": [\n",
    "               *placeholders,\n",
    "               {\n",
    "               \"type\": \"text\",\n",
    "               \"text\": prompt,\n",
    "               },\n",
    "      ],\n",
    "   },\n",
    "]\n",
    "\n",
    "prompt = processor.apply_chat_template(\n",
    "   messages,\n",
    "   tokenize=False,\n",
    "   add_generation_prompt=True,\n",
    ")\n",
    "inputs = {\n",
    "   \"prompt\": prompt,\n",
    "   \"multi_modal_data\": {\n",
    "      \"image\": images,\n",
    "   },\n",
    "}\n",
    "\n",
    "outputs = llm.generate([inputs], SamplingParams(top_k=1, max_tokens=1024))\n",
    "print(f\"Prompt: {prompt!r}, Generated text: {outputs[0].outputs[0].text!r}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11fa90db",
   "metadata": {},
   "source": [
    "Below is an example output:\n",
    "```bash\n",
    "Prompt: '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n<|im_start|>user\\n<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|>What do you see in these images?<|im_end|>\\n<|im_start|>assistant\\n<think>\\n', Generated text: \"So, let's look at both images. First image: there are blue flowers with water droplets, some pink flowers in the background, and they're in a wet, reflective surface, maybe water. There are bokeh lights (those yellow circles) in the background, so it's a shallow depth of field. Second image: a bird with bright red head and chest, blue wings and tail, perched on a branch. The background is green, blurred, so it's a forest or jungle setting. Need to describe each image clearly.\\n\\nFirst image details: blue flowers (maybe plumeria?), water droplets on petals, some pink flowers, wet surface (water), reflections, bokeh lights (out of focus yellow circles). Second image: bird with vibrant colors—red body, blue wings/tail, black beak, perched on a brown branch, green background (blurred foliage). Both images have high detail, vibrant colors, nature themes.\\n\\nSo, summarize each image's content.\\n</think>\\n\\nIn the first image, I see **vibrant blue flowers** (likely plumeria) with water droplets glistening on their petals. These flowers are partially submerged in a reflective, wet surface (possibly water), creating subtle ripples and reflections. In the background, there are soft, out-of-focus pink flowers and warm, golden bokeh lights (blurred circular highlights), which add a dreamy, atmospheric quality to the scene. The overall mood is serene and ethereal, emphasizing the delicate beauty of the flowers and the moisture around them.  \\n\\nIn the second image, I observe a **colorful bird** perched on a thick, textured brown branch. The bird has a striking combination of colors: a bright red head and chest, vivid blue wings and tail, and a dark beak. Its feathers appear detailed and glossy, with the blue wings showing intricate patterns. The background is a blurred, lush green (suggesting a forest or jungle environment), which creates a soft, natural backdrop that highlights the bird’s vibrant plumage. The image captures the bird in sharp focus, emphasizing its vivid colors and the texture of its feathers and the branch it rests on.  \\n\\nBoth images showcase nature’s beauty with high detail, vibrant colors, and a focus on the interplay of light and texture.\"\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "417b462b",
   "metadata": {},
   "source": [
    "### Online Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d95a68b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference'\n",
    "additional_neuron_config=json.dumps(dict(override_neuron_config=dict(text_neuron_config=text_neuron_config, vision_neuron_config=vision_neuron_config)))\n",
    "limit_mm_per_prompt_json = json.dumps({\"image\": 20})\n",
    "\n",
    "start_server_cmd= f'''vllm serve \\\n",
    "--model=\"~/models/Qwen3-VL-8B-Thinking/\" \\\n",
    "--tokenizer=\"~/models/Qwen3-VL-8B-Thinking/\" \\\n",
    "--trust-remote-code \\\n",
    "--tensor-parallel-size=16 \\\n",
    "--max-num-seqs=1 \\\n",
    "--max-model-len=32768 \\\n",
    "--additional-config=\\'{additional_neuron_config}\\' \\\n",
    "--limit_mm_per_prompt=\\'{limit_mm_per_prompt_json}\\' \\\n",
    "--no-enable-chunked-prefill \\\n",
    "--no-enable-prefix-caching \\\n",
    "--port=8080\n",
    "'''\n",
    "\n",
    "import os\n",
    "os.system(start_server_cmd)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7c3024a",
   "metadata": {},
   "source": [
    "After deploying the model server, you can run inference by sending it requests. The below example sends a text prompt with two images -"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f09ec8eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "\n",
    "client = OpenAI(api_key=\"EMPTY\", base_url=\"http://0.0.0.0:8080/v1\")\n",
    "models = client.models.list()\n",
    "model_name = models.data[0].id\n",
    "\n",
    "messages = [\n",
    "   {\n",
    "      \"role\": \"user\",\n",
    "      \"content\": [\n",
    "            {\n",
    "               \"type\": \"image_url\",\n",
    "               \"image_url\": {\n",
    "                  \"url\": \"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg\"}\n",
    "            },\n",
    "            {\n",
    "               \"type\": \"text\",\n",
    "               \"text\": \"Describe this image\",\n",
    "            },\n",
    "      ],\n",
    "   },\n",
    "]\n",
    "\n",
    "response = client.chat.completions.create(\n",
    "   model=model_name,\n",
    "   messages=messages,\n",
    "   temperature=1.0,\n",
    "   top_p=1.0,\n",
    "   stream=False,\n",
    "   extra_body={\"top_k\": 1},\n",
    ")\n",
    "\n",
    "generated_text = response.choices[0].message.content\n",
    "print(generated_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "583a3a9f",
   "metadata": {},
   "source": [
    "Below is an example output:\n",
    "```bash\n",
    "So, let's describe this image. First, the main subject is a wild cat, probably a Pallas's cat, in a snowy environment. Let's check the details. The cat has thick, fluffy fur that's a mix of brown, gray, and maybe some lighter shades. Its fur is dusted with snow, so it's in a winter setting. The cat is walking on snow, with one paw lifted, so it's in motion. The background has white birch trees with black bark patterns, typical of a snowy forest. There's also a chain-link fence on the left side, which might indicate a controlled environment like a zoo or wildlife reserve. The snow on the ground is fresh, and there are some small twigs or debris visible. The cat's face has distinctive markings, like the white area around the mouth and the striped pattern on its cheeks. The overall scene is cold, with the snow and the cat's thick fur suggesting it's adapted to cold climates. Let's structure the description: start with the main subject, then details about the cat's appearance, the environment, and the setting.\n",
    "</think>\n",
    "\n",
    "The image depicts a **Pallas's cat** (a wild feline species native to Central Asia) walking through a snowy landscape. The cat’s thick, fluffy fur is a mix of brown, gray, and cream tones, dusted with snowflakes, emphasizing its adaptation to cold climates. Its face features distinctive markings: a white patch around the mouth, dark stripes on the cheeks, and a short, rounded muzzle. The cat is captured mid-stride, with one paw lifted, conveying movement across the snow-covered ground.  \n",
    "\n",
    "In the background, **white-barked birch trees** with dark, irregular bark patterns create a stark, wintry forest scene. To the left, a **chain-link fence** suggests the setting may be a controlled environment like a zoo or wildlife reserve. The snow on the ground is fresh and undisturbed except for the cat’s path, with small twigs and debris scattered nearby. The overall atmosphere is serene and cold, highlighting the cat’s natural camouflage and resilience in a snowy habitat.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6bd1f1bc",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "Congratulations ! You now know how to deploy `Qwen/Qwen3-VL-8B-Thinking` on a `trn2.48xlarge` instance. Modify the configurations and deploy the model as per your requirements and use case."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: libraries/nxd-inference/tutorials/sd-inference-tutorial.rst
================================================
.. _nxdi-sd-inference-tutorial:

Tutorial: Using Speculative Decoding (SD) to improve inference performance on Trn2 instances
============================================================================================

NeuronX Distributed Inference (NxDI) allows you to deploy large language models on
Trn2 or Trn1 instances. This tutorial provides a step-by-step guide to deploy a Qwen3-32B model
on a Trn2 instance using two configurations: one without speculative decoding and one
with Qwen3-0.6B as the draft model for speculative decoding. We use LLMPerf to measure and compare
performance between the two configurations. While this tutorial uses Qwen models for
demonstration, the approach is model-agnostic and can be applied to other supported models
(see :ref:`nxdi-model-reference`).

.. contents:: Table of contents
   :local:
   :depth: 2

Environment setup
-----------------
This tutorial requires that you have a Trn2 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed.
To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK,
see :ref:`nxdi-setup`.

Connect to the EC2 instance via your preferred option: EC2 Instance Connect, Session Manager, or SSH client.
For more information, see `Connect to your Linux instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-connect-methods.html>`_ in the Amazon EC2 User Guide.

Start a built-in vLLM Neuron Deep Learning Container (DLC). For more information about available containers,
see the `AWS Neuron Deep Learning Containers repository <https://github.com/aws-neuron/deep-learning-containers#vllm-inference-neuronx>`_.

For example, we use the following:

::

    docker run -d -it --privileged -v /home/ubuntu/:/home/ubuntu/ public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py311-sdk2.26.1-ubuntu22.04

Scenario 1: Run without Speculative Decoding
---------------------------------------------

Step 1: Set environment variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Populate the following environment variables:

::

    export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"
    export NEURON_COMPILED_ARTIFACTS="/home/ubuntu/Qwen-32B-BS1-SL6k-TP64"
    export MODEL_ID="Qwen/Qwen3-32B"

NxDI will persist the compiled model artifacts on the EC2 instance in ``NEURON_COMPILED_ARTIFACTS`` so you can rerun the model without recompiling it. If you need to recompile it, empty the folder.

Step 2: Start the vLLM server
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Invoke the model:

::

    VLLM_USE_V1=0 vllm serve $MODEL_ID \
        --tensor-parallel-size 64 \
        --max-num-seqs 1  \
        --max-model-len 6400 \
        --override-neuron-config '{"save_sharded_checkpoint": true}'  

We use ``tensor-parallel-size 64`` assuming the default Logical NeuronCore (LNC) configuration.
For more information about LNC, see `Trainium2 Architecture <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/trainium2.html>`_.

We also use ``max-num-seqs 1`` as a baseline. Feel free to adjust this value to your needs. We will use the same value for both scenarios.

Finally, we use ``save_sharded_checkpoint: true`` to speed up model loading after compilation.
For more information, see the :ref:`NeuronX Distributed Save/Load Developer Guide <neuronx_distributed_save_load_developer_guide>`.

After the model compiles, you will see the following output:

::

    INFO:     Started server process [7]
    INFO:     Waiting for application startup.
    INFO:     Application startup complete.

This indicates the server is ready and the model endpoint is available for inference.

Step 3: Test the endpoint
~~~~~~~~~~~~~~~~~~~~~~~~~~
You can test the endpoint using curl or any HTTP client:

::

    curl http://localhost:8000/v1/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "Qwen/Qwen3-32B",
            "prompt": "What is machine learning?",
            "max_tokens": 100,
            "temperature": 0.7
        }'

Step 4: Load the model and measure performance with LLMPerf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Login to the docker container that runs the model (``docker exec -it ...``) and install llmperf:

::

    cd /opt
    git clone https://github.com/ray-project/llmperf.git
    cd llmperf
    pip install -e .

    export OPENAI_API_BASE="http://localhost:8000/v1"
    export OPENAI_API_KEY=dummy

    python token_benchmark_ray.py \
        --model "Qwen/Qwen3-32B" \
        --mean-input-tokens 128 \
        --stddev-input-tokens 0 \
        --mean-output-tokens 512 \
        --stddev-output-tokens 0 \
        --max-num-completed-requests 10 \
        --timeout 1200 \
        --num-concurrent-requests 1 \
        --results-dir /tmp/results \
        --llm-api openai \
        --additional-sampling-params '{}'

We used ``mean-output-tokens 512`` as a baseline example of an output token length to demonstrate SD performance. Shorter values in our case here did not show significant benefits.

Log the results (we kept p99 for brevity):

::

    ttft_s 0.04828366368776187
    end_to_end_latency_s 6.044886132028841
    request_output_throughput_token_per_s 102.27375153804246
    number_input_tokens 128.0
    number_output_tokens 558.0


Scenario 2: Run with Speculative Decoding
------------------------------------------

Step 1: Set environment variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For speculative decoding, we need to specify both the target model and the draft model:

::

    export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"
    export NEURON_COMPILED_ARTIFACTS="/home/ubuntu/Qwen-32B-BS1-SL6k-TP64-SD"
    export MODEL_ID="Qwen/Qwen3-32B"
    export DRAFT_MODEL_ID="Qwen/Qwen3-0.6B"

Step 2: Start the vLLM server with speculative decoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Invoke the model with speculative decoding enabled:

::

    VLLM_USE_V1=0 vllm serve $MODEL_ID \
        --tensor-parallel-size 64 \
        --max-num-seqs 1 \
        --max-model-len 6400 \
        --override-neuron-config '{"save_sharded_checkpoint": true, "enable_fused_speculation": true}' \
        --speculative-config '{"model": "'"$DRAFT_MODEL_ID"'", "num_speculative_tokens": 7, "max_model_len": 2048, "method": "eagle"}'

The key differences from the baseline configuration are:

- ``--speculative-config``: Specifies the draft model configuration including:
  
  - ``model``: The draft model path (Qwen3-0.6B)
  - ``num_speculative_tokens``: Number of tokens to speculatively generate (7 in this example)
  - ``max_model_len``: Maximum sequence length for the draft model (2048)
  - ``method``: Speculative decoding method (eagle)

- ``enable_fused_speculation``: Enables fused speculation in the Neuron config for improved performance by combining draft model execution with verification

After the model compiles, you will see the same startup messages indicating the server is ready.

Step 3: Test the endpoint
~~~~~~~~~~~~~~~~~~~~~~~~~~
Test the endpoint the same way as in Scenario 1:

::

    curl http://localhost:8000/v1/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "Qwen/Qwen3-32B",
            "prompt": "What is machine learning?",
            "max_tokens": 100,
            "temperature": 0.7
        }'

Step 4: Load the model and measure performance with LLMPerf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Login to the docker container that runs the model (``docker exec -it ...``) and follow Step 4 from the non-SD experiment.
Run the load test with the same configuration.

Log the results (we kept p99 for brevity):

::

    ttft_s 0.04737630250383518
    end_to_end_latency_s 5.6368158639998
    request_output_throughput_token_per_s 137.84216889131872
    number_input_tokens 128.0
    number_output_tokens 565.37


Performance Comparison
----------------------

The table below summarizes the key performance metrics (p99 values) from both configurations:

.. list-table::
   :header-rows: 1
   :widths: 40 30 30

   * - Metric
     - Without SD
     - With SD
   * - Time to First Token (TTFT)
     - 48.3 ms
     - 47.4 ms
   * - End-to-End Latency
     - 6.04 s
     - 5.64 s
   * - Request Output Throughput
     - 102.3 tokens/s
     - 137.8 tokens/s
   * - Number of Input Tokens
     - 128
     - 128
   * - Number of Output Tokens
     - 558
     - 565

Key observations:

- **Throughput improvement**: Speculative decoding achieves 35% higher throughput (137.8 vs 102.3 tokens/s)
- **Latency reduction**: End-to-end latency is reduced by 7% (5.64s vs 6.04s)
- **TTFT**: Time to first token remains comparable between both configurations

Conclusion
----------
For Qwen3-32B with Qwen3-0.6B as the draft model on Trn2, speculative decoding delivers
35% higher throughput and 7% lower end-to-end latency at 512 output tokens. Performance
gains vary based on model pairing, output length, and workload characteristics. Use this
benchmarking approach to validate the optimal configuration for your use case. 


================================================
FILE: libraries/nxd-inference/tutorials/trn1-llama3.1-70b-instruct-accuracy-eval-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial: Evaluating Accuracy of Llama-3.1-70B on Neuron using open source datasets\n",
    "\n",
    "This tutorial provides a step-by-step guide to measure the accuracy of Llama3.1 70B on Trn1 with evaluation on two distinct tasks: mathematical reasoning and logical analysis.\n",
    "\n",
    "For this tutorial we use two datasets available in lm-eval, namely `gsm8k_cot`(high school math questions) and `mmlu_flan_n_shot_generative_logical_fallacies` (multiple choice questions on the subject) to demonstrate accuracy evaluation on Trn1. The metrics in these task are two variants of [ExactMatch](https://huggingface.co/spaces/evaluate-metric/exact_match) metrics called StrictMatch and FlexibleExtract which differ in how strict they are in extracting the final answer from the generated output from the model. To see the exact task definition used in lm-eval please look at [gsm8k-cot](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k-cot.yaml) and [mmlu template](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/flan_n_shot/generative/_mmlu_flan_generative_template_yaml).\n",
    "\n",
    "We also need the instruction-tuned version of llama-3.1 70b [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) available hugging face."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    ".. contents:: Table of contents\n",
    "    :local:\n",
    "    :depth: 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task Overview\n",
    "\n",
    "### 1. GSM8K with Chain-of-Thought (gsm8k_cot)\n",
    "\n",
    "The GSM8K dataset focuses on grade school math word problems, testing LLMs’ mathematical reasoning capabilities. Using Chain-of-Thought (CoT) prompting, we evaluate models’ ability to:\n",
    "\n",
    "- Solve complex math word problems\n",
    "\n",
    "- Show step-by-step reasoning\n",
    "\n",
    "- Arrive at accurate numerical answers\n",
    "\n",
    "### 2. MMLU Logical Fallacies (mmlu_flan_n_shot_generative_logical_fallacies)\n",
    "\n",
    "This evaluation focuses on the model’s ability to identify and explain logical fallacies, a subset of the MMLU benchmark. The task tests:\n",
    "\n",
    "- Understanding of common logical fallacies\n",
    "\n",
    "- Ability to analyze arguments\n",
    "\n",
    "- Explanation of reasoning flaws"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment Setup Guide"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prerequisites\n",
    "\n",
    "This tutorial requires that you have a Trn1 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed. Also we depend on our fork of vLLM as described in the [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html#nxdi-vllm-user-guide).\n",
    "\n",
    "To use Jupyter Notebook on the Neuron instance, you can use this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "Before running evaluations, ensure your environment is properly configured by following these essential setup guides:\n",
    "\n",
    "1. [NxD Inference Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html)\n",
    "\n",
    "    - Configure AWS Neuron environment\n",
    "\n",
    "    - Set up required dependencies\n",
    "\n",
    "    - Verify system requirements\n",
    "\n",
    "2. [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html)\n",
    "\n",
    "    - Setup vLLM according to the guide"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Installing dependencies\n",
    "\n",
    "Copy the [inference-benchmarking](https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/) directory to some location on your instance. Change directory to the your copy of [inference-benchmarking](https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/). Install other required dependencies in the same python env (e.g aws_neuron_venv_pytorch if you followed [manual install NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html#id3) ) by:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "```python\n",
    "git clone --depth 1 https://github.com/aws-neuron/aws-neuron-samples.git\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "```python\n",
    "pip install -r requirements.txt\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download llama-3.1 70B\n",
    "To use this sample, you must first download [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) model checkpoint from Hugging Face and store it locally. We are saving the model checkpoints at ``/home/ubuntu/models/Llama-3.1-70B-Instruct/`` on the Trn1 instance. For more information, see [Downloading models](https://huggingface.co/docs/hub/en/models-downloading) in the Hugging Face documentation.\n",
    "\n",
    "## Running Evaluations\n",
    "There are two methods that you can use [the evaluation scripts](https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/) to run your evaluation.\n",
    "\n",
    "1. Using a yaml configuration file and `accuracy.py` script\n",
    "\n",
    "2. writing your own python script that uses several components provided in `accuracy.py` and `server_config.py`\n",
    "\n",
    "We demonstrate each use case separately here.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Running eval with yaml config file\n",
    "In this method all you need is to create a yaml config file that specifies the server configuration and testing scenario you want to run. Create `config.yaml` with the following content."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile config.yaml\n",
    "server:\n",
    "  name: \"Llama-3.1-70B-Instruct\"\n",
    "  model_path: \"/home/ubuntu/model_hf/llama-3.1-70B-Instruct-hf\"\n",
    "  model_s3_path: null\n",
    "  compiled_model_path: \"/home/ubuntu/traced_model_hf/llama-3.1-70B-Instruct-hf\"\n",
    "  max_seq_len: 16384\n",
    "  context_encoding_len: 16384\n",
    "  tp_degree: 32\n",
    "  n_vllm_threads: 32\n",
    "  server_port: 8000\n",
    "  continuous_batch_size: 1\n",
    "\n",
    "test:\n",
    "  accuracy:\n",
    "    mytest:\n",
    "      client: \"lm_eval\"\n",
    "      datasets: [\"gsm8k_cot\", \"mmlu_flan_n_shot_generative_logical_fallacies\"]\n",
    "      max_concurrent_requests: 1\n",
    "      timeout: 3600\n",
    "      client_params:\n",
    "        limit: 200\n",
    "        use_chat: True\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For tasks that require higher sequence length you need to adjust `max_seq_len`. For the tasks in this tutorial 16384 would suffice.\n",
    "\n",
    "Run `python accuracy.py --config config.yaml`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "python accuracy.py --config config.yaml 2>&1 | tee accuracy_evaluation.log"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Running eval through your own python code\n",
    "You might be interested in running the evaluation in you python code. For instance if you want to change the configuration programatically or post-process the results. This is possible using 3 main components provided in `accuracy.py` and `server_config.py`.\n",
    "\n",
    "1. Server Configuration: Using ServerConfig to define the vLLM server settings\n",
    "\n",
    "2. Accuracy Scenario: Using AccuracyScenario to specify evaluation parameters\n",
    "\n",
    "3. Test Execution: Running the evaluation with the configured settings\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, import the necessary components:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from aws_neuron_eval.accuracy import AccuracyScenario, run_accuracy_test\n",
    "from aws_neuron_eval.server_config import ServerConfig"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. Configure the Server\n",
    "\n",
    "Set up your server configuration with ServerConfig. This example uses Llama 3.1-70b Instruct:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Configure the server settings\n",
    "name = \"Llama-3.1-70B-Instruct\"\n",
    "\n",
    "server_config = ServerConfig(\n",
    "    name=name,\n",
    "    model_path=f\"/home/ubuntu/model_hf/llama-3.1-70B-Instruct-hf\",  # Local model path\n",
    "    model_s3_path=None,                         # S3 model path (not used)\n",
    "    compiled_model_path=f\"/home/ubuntu/traced_model_hf/llama-3.1-70B-Instruct-hf\",  # Compiled model path\n",
    "    max_seq_len=16384,                          # Maximum sequence length\n",
    "    context_encoding_len=16384,                 # Context window size\n",
    "    tp_degree=32,                               # Tensor parallel degree for Trn1\n",
    "    n_vllm_threads=32,                          # Number of vLLM threads\n",
    "    server_port=8000,                           # Server port\n",
    "    continuous_batch_size=1,                    # Batch size for continuous batching\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Define Evaluation Scenarios\n",
    "\n",
    "Create an AccuracyScenario to specify your evaluation parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scenario = AccuracyScenario(\n",
    "    client=\"lm_eval\",              # Evaluation client\n",
    "    datasets=[                     # Target datasets\n",
    "        \"gsm8k_cot\",\n",
    "        \"mmlu_flan_n_shot_generative_logical_fallacies\",\n",
    "    ],\n",
    "    max_concurrent_requests=1,     # Maximum concurrent requests\n",
    "    timeout=5000,                  # Timeout in seconds - changed to 5000 from 3600\n",
    "    client_params={\"limit\": 200}   # Client-specific parameters\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Run the Evaluation\n",
    "\n",
    "Execute the evaluation using run_accuracy_test:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run the test with a named scenario\n",
    "results_collection = run_accuracy_test(\n",
    "    server_config=server_config,\n",
    "    named_scenarios={\"mytest\": scenario}\n",
    ")\n",
    "\n",
    "# Display results\n",
    "print(results_collection)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code will execute the evaluation on the specified datasets and return detailed performance metrics. The results include accuracy scores and other relevant metrics for each dataset."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "aws_neuron_venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: libraries/nxd-inference/tutorials/trn2-llama3.1-405b-speculative-tutorial.rst
================================================
.. _nxdi-trn2-llama3.1-405b-speculative-tutorial:

Tutorial: Using Speculative Decoding and Quantization to improve Llama-3.1-405B inference performance on Trn2 instances
=======================================================================================================================

NeuronX Distributed (NxD) Inference allows you to deploy Llama3.1 405B on
a single Trn2 instance. This tutorial will show you how to optimize inference performance for Llama3.1 405B on a Trn2 instance
with speculative decoding and quantization. We will compile and load the model into a VLLM server and measure performance using LLMPerf.
This tutorial consists of two parts. In the first part, we will collect performance metrics for our base configuration with ``bf16`` model weights. In the second part, we will optimize inference performance with ``fp8`` quantized weights and speculative decoding. 
The performance is then compared with the results from part 1.

.. contents:: Table of contents
   :local:
   :depth: 2

Prerequisites
-----------------------------------------------


Set up and connect to a Trn2.48xlarge instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As a prerequisite, this tutorial requires that you have a Trn2 instance
created from a Deep Learning AMI that has the Neuron SDK pre-installed.

To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK,
see :ref:`nxdi-setup`.

After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you
chose when you launched the instance.

After you are connected, activate the Python virtual environment that
includes the Neuron SDK.

::

   source ~/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate

Run ``pip list`` to verify that the Neuron SDK is installed.

::

   pip list | grep neuron

You should see Neuron packages including
``neuronx-distributed-inference`` and ``neuronx-cc``.

Install packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

NxD Inference supports running models with vLLM via the upstream ``vllm-neuron``
plugin. Install the latest release branch by following the steps in the
:ref:`vLLM User Guide for NxD Inference<nxdi-vllm-user-guide-v1>`.

In this tutorial, you will use `llmperf <https://github.com/ray-project/llmperf>`_ to measure the inference performance of the base Llama-3.1-405b-Instruct configuration and the more
optimized configuration. 
We will use the `load test <https://github.com/ray-project/llmperf?tab=readme-ov-file#load-test>`_ feature of LLMPerf and measure the performance for accepting
10,000 tokens as input and generating 1500 tokens as output.
Install llmperf into the virtual environment.

::

    git clone https://github.com/ray-project/llmperf.git
    cd llmperf
    pip install -e . 


Download models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To run inference in the first part of the tutorial, you need to download the Llama-3.1-405b-Instruct model checkpoint with ``bf16`` weights from Hugging Face (`meta-llama/Llama-3.1-405B-Instruct <https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct>`__). 
For the second part of the tutorial, you will run a more optimized inference configuration. For this part, you need to download an fp8-quantized Llama3.1-405B-FP8 model checkpoint (`meta-llama/Llama-3.1-405B-Instruct-FP8 <https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct-FP8>`__).
With Speculative Decoding, you will also need to specify a draft model. You can download and use the model checkpoint from `meta-llama/Llama-3.2-1B-Instruct <https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct>`__.
For more information, see
`Downloading models <https://huggingface.co/docs/hub/en/models-downloading>`__
in the Hugging Face documentation. 

Scenario 1: Run Llama-3.1-405b inference with base configuration using ``bf16`` weights
-----------------------------------------------------------------------------------------

Step 1: Compile the model and run generate
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We will first compile and run generation on a sample prompt using a command
installed by ``neuronx-distributed-inference``. Save the contents of the below script to your favorite 
shell script file, for example, ``compile_model.sh`` and then run it.

Note that we are using the following features as described in
the tutorial for running 405B model :ref:`nxdi-trn2-llama3.1-405b-tutorial`

* Logical NeuronCore Configuration (LNC)
* Tensor parallelism (TP) on Trn2
* Optimized Kernels

The script compiles the model and runs generation on the given input prompt. Please refer to :ref:`nxd-inference-api-guide` for more information on these ``inference_demo`` flags.
Note the path we used to save the compiled model. This path should be used
when launching vLLM server for inference so that the compiled model can be loaded without recompilation.

.. note::

    Known issue: Using kernels with bucket length of 1024 or less may lead to ``Numerical Error`` in inference.

    ::

        RuntimeError: Failed to execute the model status=1003 message=Numerical Error

::

    # Replace this with the path where you downloaded and saved the model files.
    MODEL_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct/"
    # This is where the compiled model will be saved. The same path
    # should be used when launching vLLM server for inference.
    COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Llama-3.1-405B-Instruct/"

    NUM_CORES=128
    TP_DEGREE=64
    LNC=2

    export NEURON_RT_VIRTUAL_CORE_SIZE=$LNC
    export NEURON_RT_NUM_CORES=$((NUM_CORES/NEURON_RT_VIRTUAL_CORE_SIZE))
    export NEURON_RT_EXEC_TIMEOUT=600 


    inference_demo \
        --model-type llama \
        --task-type causal-lm \
            run \
            --model-path $MODEL_PATH \
            --compiled-model-path $COMPILED_MODEL_PATH \
            --torch-dtype bfloat16 \
            --start_rank_id 0 \
            --local_ranks_size $TP_DEGREE \
            --tp-degree $TP_DEGREE \
            --batch-size 1 \
            --max-context-length 12288 \
            --seq-len 12800 \
            --on-device-sampling \
            --top-k 1 \
            --fused-qkv \
            --sequence-parallel-enabled \
            --qkv-kernel-enabled \
            --attn-kernel-enabled \
            --mlp-kernel-enabled \
            --cc-pipeline-tiling-factor 1 \
            --pad-token-id 2 \
            --enable-bucketing \
            -—context-encoding-buckets 2048 4096 10240 12288 \
            -—token-generation-buckets 12800 \
            --prompt "What is annapurna labs?" 2>&1 | tee log


The above script will compile a Neuron model for this base-case configuration, and also run generate on the example prompt specified with the ``-prompt`` flag. 
You can change this prompt to your prompt of choice. 
The script's output will be written into ``log``, a log file in the working directory. 

In addition, in the subsequent runs of this script, you can add a ``--skip-compile`` flag to skip 
the compiling step since the model is already compiled in the first run of the script. 
This will allow you to test the model with different prompts. 

Step 2: Start the vLLM server with the compiled Neuron model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After compiling the model, you can run the model using vLLM. Save the contents of the below script to another
shell script file, for example, ``start_vllm.sh`` and then run it.

::

    export NEURON_RT_VIRTUAL_CORE_SIZE=2

    MODEL_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct/"
    COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Llama-3.1-405B-Instruct/"

    export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"
    export NEURON_COMPILED_ARTIFACTS=$COMPILED_MODEL_PATH  # Re-use the compiled artifacts
    VLLM_RPC_TIMEOUT=100000 vllm serve \
        --model "$MODEL_PATH" \
        --max-num-seqs 1 \
        --max-model-len 12800 \
        --tensor-parallel-size 64 \
        --no-enable-prefix-caching \
        --port 8000 & PID=$! 2>&1 | tee llama405b_bf16.log
    echo "vLLM server started with PID $PID"

Step 3: Measure performance using LLMPerf
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
After the above steps, the vllm server should be running. Before we can use the ``llmperf`` package, we need to make a few changes to its code. 
Follow :ref:`benchmarking with LLMPerf guide <llm_perf_patch_changes>` to apply the code changes. 
    
We can now measure the performance using ``llmperf``. Below is a sample shell script to run ``llmperf``. More information about several arguments used in the script can be found in the 
`llmperf open source code <https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py>`_ .

::

    # This should be the same path to which the model was downloaded (also used in the above steps).
    MODEL_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct"
    # This is the name of directory where the test results will be saved.
    OUTPUT_PATH=llmperf-results-sonnets

    export OPENAI_API_BASE="http://localhost:8000/v1"
    export OPENAI_API_KEY="mock_key"

    python token_benchmark_ray.py \
        --model $MODEL_PATH \
        --mean-input-tokens 10000 \
        --stddev-input-tokens 0 \
        --mean-output-tokens 1500 \
        --stddev-output-tokens 0 \
        --num-concurrent-requests 1\
        --timeout 3600 \
        --max-num-completed-requests 50 \
        --additional-sampling-params '{}' \
        --results-dir $OUTPUT_PATH \
        --llm-api "openai"


The output for this llama-3.1-405B model run for the base case is shown below. Please note that the numbers can slightly vary between runs but should be in the same order of magnitude.
::
    
    Results for token benchmark for /home/ubuntu/models/llama-3.1-405b queried with the openai api.

    inter_token_latency_s
        p25 = 0.03783673520494379
        p50 = 0.037929154633788834
        p75 = 0.03799374728198055
        p90 = 0.03806084386428147
        p95 = 0.03818095359194858
        p99 = 0.03862880035825585
        mean = 0.03790912092492011
        min = 0.03711292916794487
        max = 0.03867580939426865
        stddev = 0.0002364662521116205
    ttft_s
        p25 = 2.437347081664484
        p50 = 2.441959390998818
        p75 = 2.4439403364085592
        p90 = 2.444729209714569
        p95 = 2.445114637189545
        p99 = 79.22927707570342
        mean = 5.451600373298861
        min = 2.427013176959008
        max = 153.00210832804441
        stddev = 21.29264628138615
    end_to_end_latency_s
        p25 = 70.06310007086722
        p50 = 70.09642704750877
        p75 = 70.1557097924524
        p90 = 70.28295350184199
        p95 = 70.56055794338462
        p99 = 148.28325726192182
        mean = 73.19207735829521
        min = 70.00512732309289
        max = 222.50397142698057
        stddev = 21.54750467688136
    request_output_throughput_token_per_s
        p25 = 25.417755028050912
        p50 = 25.463487985775544
        p75 = 25.522234144656743
        p90 = 25.6487981126861
        p95 = 25.729858763245502
        p99 = 25.90146713883131
        mean = 25.13808905954906
        min = 8.080754642125802
        max = 26.021214285642255
        stddev = 2.465472136291901
    number_input_tokens
        p25 = 10000.0
        p50 = 10000.0
        p75 = 10000.0
        p90 = 10000.0
        p95 = 10000.0
        p99 = 10000.0
        mean = 10000.0
        min = 10000
        max = 10000
        stddev = 0.0
    number_output_tokens
        p25 = 1783.0
        p50 = 1785.0
        p75 = 1789.75
        p90 = 1798.1
        p95 = 1803.55
        p99 = 1816.67
        mean = 1787.92
        min = 1779
        max = 1825
        stddev = 8.54720386310933
    Number Of Errored Requests: 0
    Overall Output Throughput: 24.421011092151268
    Number Of Completed Requests: 50
    Completed Requests Per Minute: 0.8195336846889548


Scenario 2: Run Llama-3.1-405b inference with fp8 weights and fused speculation (with draft model)
--------------------------------------------------------------------------------------------------

Step 1: Rescale the model weights to use Neuron FP8 format and save the modules to not convert file in model path
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Since Neuron device only supports the ``FP8_EXP4 (IEEE-754)`` data type, and the HuggingFace FP8 checkpoint for Llamma-405b is in a different FP8 format (``OCP FP8 E4M3/e4m3fn``) which has a different range, we need to rescale the public model weights. 
Follow this guide to rescale the FP8 model weights from HuggingFace: `link <https://github.com/aws-neuron/neuronx-distributed/blob/main/src/neuronx_distributed/quantization/README_rescaling_fp8_for_neuron.md>`__.

Running a quantized model requires us to create modules to not convert json file to explicitly mention the layers which are not quantized in the model. For this tutorial we can use the following file.

Download: :download:`modules_to_not_convert.json <modules_to_not_convert.json>`

Next we will compile and run the model and record performance metrics.

Step 2: Compile the model and run generate
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We will first compile and run generation on a sample prompt using a command
installed by ``neuronx-distributed-inference``. Save the contents of the below script to your favorite 
shell script file, for example, ``compile_model.sh`` and then run it.

Note that we are using the following features as described in
the tutorial for running 405B model :ref:`nxdi-trn2-llama3.1-405b-tutorial`

* Logical NeuronCore Configuration (LNC)
* Tensor parallelism (TP) on Trn2
* Optimized Kernels

The compiling script is similar to the one in part 1. 
Note that we have added the path for the draft model.


.. note::

    Known issue: Using kernels with bucket length of 1024 or less may lead to ``Numerical Error`` in inference.

    ::

        RuntimeError: Failed to execute the model status=1003 message=Numerical Error


::
    
    # Replace this with the path where you downloaded and saved the model files.
    MODEL_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct-FP8-rescaled/"
    # Replace this with the path where you downloaded and saved the draft model files.
    DRAFT_MODEL_PATH="/home/ubuntu/models/Llama-3.2-1b-instruct/"    
    # This is where the compiled model (.pt file) and sharded checkpoints will be saved. The same path
    # should be used when launching vLLM server for inference.
    COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Llama-3.1-405B-Instruct/"
    # Add a modules to not convert json file to the model path to specify non quantized modules.
    MTNC_FILE_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct-FP8-rescaled/modules_to_not_convert.json"

    NUM_CORES=128
    TP_DEGREE=64
    LNC=2


    export NEURON_RT_VIRTUAL_CORE_SIZE=$LNC
    export NEURON_RT_NUM_CORES=$((NUM_CORES/NEURON_RT_VIRTUAL_CORE_SIZE))
    export NEURON_RT_EXEC_TIMEOUT=600 
    export XLA_HANDLE_SPECIAL_SCALAR=1
    export UNSAFE_FP8FNCAST=1

    inference_demo \
        -—model-type llama \
        -—task-type causal-lm \
        run \
            -—model-path $MODEL_PATH \
            -—compiled-model-path $COMPILED_MODEL_PATH \
            -—torch-dtype bfloat16 \
            -—start_rank_id 0 \
            -—local_ranks_size $TP_DEGREE \
            -—tp-degree $TP_DEGREE \
            -—batch-size 1 \
            -—max-context-length 12288 \
            -—seq-len 12800 \
            -—on-device-sampling \
            -—top-k 1 \
            -—fused-qkv \
            -—sequence-parallel-enabled \
            -—qkv-kernel-enabled \
            -—attn-kernel-enabled \
            -—mlp-kernel-enabled \
            -—cc-pipeline-tiling-factor 1 \
            -—draft-model-path $DRAFT_MODEL_PATH \
            -—enable-fused-speculation \
            -—speculation-length 7 \
            -—pad-token-id 2 \
            -—quantized-mlp-kernel-enabled \
            -—quantization-type per_channel_symmetric \
            -—rmsnorm-quantize-kernel-enabled \
            -—enable-bucketing \
            -—prompt "What is annapurna labs?" \
            --modules-to-not-convert-file $MTNC_FILE_PATH \
            -—context-encoding-buckets 2048 4096 10240 12288 \
            -—token-generation-buckets 12800 2>&1 | tee compile_and_generate_log


The above script will compile a Neuron model with fused speculation, and also run generate on the example prompt specified with the ``-prompt`` flag. Please refer to :ref:`nxd-inference-api-guide` for more information on these ``inference_demo`` flags.

You can change this prompt to your prompt of choice. 
The script's output will be written into ``compile_and_generate_log``, a log file in the working directory. 

In this script, we also turn on some additional environment variables: ``XLA_HANDLE_SPECIAL_SCALAR`` and ``UNSAFE_FP8FNCAST`` to enable Neuron compiler to treat rescaled ``FP8FN`` weights as
``FP8_EXP4`` weights.

In addition, in the subsequent runs of this script, you can add a ``--skip-compile`` flag to skip 
the compiling step since the model is already compiled in the first run of the script. 
This will allow you to test the model with different prompts. 


Step 3: Start the Vllm server with the compiled Neuron model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After compiling the model, you can run the model using vLLM. Save the contents of the below script to another
shell script file, for example, ``start_vllm.sh`` and then run it.

::

    export NEURON_RT_INSPECT_ENABLE=0
    export NEURON_RT_VIRTUAL_CORE_SIZE=2
    export XLA_HANDLE_SPECIAL_SCALAR=1
    export UNSAFE_FP8FNCAST=1


    MODEL_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct-FP8-rescaled"
    DRAFT_MODEL_PATH="/home/ubuntu/models/Llama-3.2-1b-instruct"
    COMPILED_MODEL_PATH="/home/ubuntu/traced_models/Llama-3.1-405B-Instruct_fp8"


    export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"
    export NEURON_COMPILED_ARTIFACTS=$COMPILED_MODEL_PATH
    VLLM_RPC_TIMEOUT=100000 vllm serve \
        -—model $MODEL_PATH \
        -—max-num-seqs 1 \
        -—max-model-len 12800 \
        -—tensor-parallel-size 64 \
        -—device neuron \
        -—speculative-max-model-len 12800 \
        -—speculative-model $DRAFT_MODEL_PATH \
        -—num-speculative-tokens 7 \
        -—use-v2-block-manager \
        -—override-neuron-config "{\"enable_fused_speculation\":true, \"quantized-mlp-kernel-enabled\":true, \"quantization-type\":\"per_channel_symmetric\", \"skip_warmup\": true}" \
        -—port 8000 & PID=$!
    echo "vLLM server started with PID $PID"

Step 4: Measure performance using LLMPerf
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
After the above steps, the vllm server should be running. Before we can use the ``llmperf`` package, we need to make a few changes to its code. 
Follow :ref:`benchmarking with LLMPerf guide <llm_perf_patch_changes>` to apply the code changes.
    
We can now measure the performance using ``llmperf``. Run the following script with the modified ``llmperf`` package.

::

    # This should be the same path to which the model was downloaded (also used in the above steps).
    MODEL_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct-FP8-rescaled"
    # This is the name of directory where the test results will be saved.
    OUTPUT_PATH=llmperf-results-sonnets

    export OPENAI_API_BASE="http://localhost:8000/v1"
    export OPENAI_API_KEY="mock_key"

    python token_benchmark_ray.py \
        --model $MODEL_PATH \
        --mean-input-tokens 10000 \
        --stddev-input-tokens 0 \
        --mean-output-tokens 1500 \
        --stddev-output-tokens 0 \
        --num-concurrent-requests 1\
        --timeout 3600 \
        --max-num-completed-requests 50 \
        --additional-sampling-params '{}' \
        --results-dir $OUTPUT_PATH \
        --llm-api "openai"


The output for this llama-3.1-405B model run with fused speculation with fused spec is shown below. Please note that the numbers can slightly vary between runs but should be in the same order of magnitude. 

::

    Results for token benchmark for /home/ubuntu/models/Llama-3.1-405B-Instruct-FP8-rescaled queried with the openai api.

    inter_token_latency_s
        p25 = 0.008220573497974934
        p50 = 0.008265312568750231
        p75 = 0.008438719224417583
        p90 = 0.00848199803312309
        p95 = 0.008495625438929224
        p99 = 0.011143428944987235
        mean = 0.008419798457414533
        min = 0.008173695931987216
        max = 0.01364151847269386
        stddev = 0.0007612118573477839
    ttft_s
        p25 = 2.2543624382815324
        p50 = 2.254961202503182
        p75 = 2.2576071268413216
        p90 = 2.2596270388457924
        p95 = 2.260639927221928
        p99 = 2.2628143909573555
        mean = 2.256157155628316
        min = 2.2534945809748024
        max = 2.2629711360204965
        stddev = 0.0023667267664955545
    end_to_end_latency_s
        p25 = 14.586015026085079
        p50 = 14.65608573507052
        p75 = 14.91364526405232
        p90 = 14.977840351965279
        p95 = 15.000083449739032
        p99 = 18.969864878777866
        mean = 14.886235136194154
        min = 14.520539953839034
        max = 22.716861865017563
        stddev = 1.1415236552464672
    request_output_throughput_token_per_s
        p25 = 100.64608830743339
        p50 = 102.4148205461138
        p75 = 102.90679421801005
        p90 = 103.02201242683091
        p95 = 103.26614794565539
        p99 = 103.36118277211666
        mean = 101.22055373532301
        min = 66.0742671641385
        max = 103.37081160698546
        stddev = 5.19249551094185
    number_input_tokens
        p25 = 10000.0
        p50 = 10000.0
        p75 = 10000.0
        p90 = 10000.0
        p95 = 10000.0
        p99 = 10000.0
        mean = 10000.0
        min = 10000
        max = 10000
        stddev = 0.0
    number_output_tokens
        p25 = 1501.0
        p50 = 1501.0
        p75 = 1501.0
        p90 = 1501.0
        p95 = 1501.0
        p99 = 1501.0
        mean = 1501.0
        min = 1501
        max = 1501
        stddev = 0.0
    Number Of Errored Requests: 0
    Overall Output Throughput: 100.69986490153724
    Number Of Completed Requests: 50
    Completed Requests Per Minute: 4.025311055357918


Conclusion
-----------------------------------------------------------
As seen from the table below, draft model based fused speculative decoding and quantization significantly improved inference performance: TPOT reduced by 4x and output token throughput increased by 4x, while TTFT decreased from 2442 ms to 2255 ms compared to baseline without speculative decoding.
Please note that batch size of 1 is used in this tutorial for computing the below metrics.

.. csv-table::
   :file: llama405b_perf_comparison.csv
   :header-rows: 1


================================================
FILE: libraries/nxd-inference/tutorials/trn2-llama3.1-405b-tutorial.rst
================================================
.. _nxdi-trn2-llama3.1-405b-tutorial:

Tutorial: Deploying Llama3.1 405B (Trn2)
========================================

NeuronX Distributed (NxD) Inference enables you to deploy Llama3.1 405B on
a single Trn2 instance.

You can run Llama3.1 405B with default configuration options. NxD
Inference also provides several features and configuration options that
you can use to optimize and tune the performance of Llama3.1 405B on
Trn2. This guide walks through how to run Llama3.1 405B on Trn2 with
vLLM, and how to enable these optimizations for optimal performance. In addition, we also have a separate tutorial for running Llama3.1 405B with vanilla fused speculative decoding :ref:`nxdi-trn2-llama3.1-405b-speculative-tutorial`. 

.. contents:: Table of contents
   :local:
   :depth: 2

Background, Concepts, and Optimizations
---------------------------------------

Logical NeuronCore Configuration (LNC)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Trn2, the Neuron SDK supports *Logical NeuronCore Configuration
(LNC)*, which determines the number of NeuronCores visible to the Neuron SDK.
When running on Trn2, the Neuron SDK is optimized for LNC=2, which means
each NeuronCore visible to the Neuron SDK is two physical NeuronCores.
The LNC configuration also affects what TP degree options you can use.

NxD Inference automatically chooses the correct LNC configuration
based on the target platform.

For more information about LNC, see :ref:`logical-neuroncore-config`.

Tensor parallelism (TP) on Trn2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Each Trn2 instance has 128 Neuron cores. With LNC=2, you can set a TP
degree up to 64. We recommend that you use LNC=2 for all models on Trn2.

For more information about tensor parallelism in NxD Inference, see
:ref:`nxdi-tensor-parallelism`.

Optimizing Performance
~~~~~~~~~~~~~~~~~~~~~~

EAGLE Speculative Decoding
^^^^^^^^^^^^^^^^^^^^^^^^^^

Speculative decoding is a performance optimization technique where a
smaller *draft* LLM model predicts the next tokens, and the larger *target*
LLM model verifies those predictions.

NxD Inference supports EAGLE v1 speculative decoding with a
flat draft structure. To use EAGLE v1, you must use an EAGLE checkpoint for a draft model 
that is not tree-based and is specifically fine-tuned for EAGLE speculation. For more
information about EAGLE, see the official implementation on GitHub: `SafeAILab/EAGLE <https://github.com/SafeAILab/EAGLE>`__.

To optimize performance for EAGLE speculative decoding, NxD Inference uses
a feature called *fused speculation*, where the
draft model and target model are fused into a single compiled model artifact
to improve performance. Fused speculation uses a different config called
FusedSpecNeuronConfig, which specifies the model class. draft config,
and draft model path to fuse with the target model.

For more information about speculative decoding in NxD Inference, including
other types of speculative decoding supported, see :ref:`nxd-speculative-decoding`.

FP8 Quantization
^^^^^^^^^^^^^^^^

NxD Inference supports FP8 quantization, where model weights and data
are converted to a smaller data type to reduce memory bandwidth usage.
FP8 quantization enables optimal usage of memory bandwidth to improve
model performance. For more information, see :ref:`nxdi-weight-quantization`.

NxD Inference also supports KV cache quantization, where the KV cache is
quantized to FP8. For more information, see :ref:`nxdi-kv-cache-quantization`.

Optimized Kernels
^^^^^^^^^^^^^^^^^

NxD Inference supports kernels that optimize parts of the modeling code
for best performance.

- Flash attention. This kernel uses a sharded flash attention
  implementation to improve performance during the context encoding
  pass. This kernel is enabled automatically at supported sequence
  lengths. For LNC2, NxD Inference automatically enables flash attention for sequence lengths of
  256 and larger that are divisible by 256. For LNC1, NxD Inference automatically enables flash attention
  for sequence lengths of 4096 and larger. You can also enable it with ``attn_kernel_enabled=True`` in
  NeuronConfig. NxD Inference automatically enables the flash attention kernel
  at supported sequence lengths even if ``attn_kernel_enabled`` is ``false``.
- QKV. This kernel fuses the QKV layers to improve performance during
  the attention forward pass. To enable this kernel, set
  ``qkv_kernel_enabled=True`` in NeuronConfig.
- MLP. This kernel implements the MLP module used in decoder layers. To
  enable this kernel, set ``mlp_kernel_enabled=True`` in NeuronConfig.
- Quantized MLP. This kernel implements a quantized version of the MLP
  kernel. This kernel uses FP8 compute to improve performance. To enable
  this kernel, set ``quantized_mlp_kernel_enabled=True``. This kernel requires
  ``mlp_kernel_enabled=True``.

.. note::
   To use the QKV and MLP kernels, you must set ``torch_dtype`` to ``torch.bfloat16``
   in NeuronConfig.

.. _nxdi-trn2-llama3.1-405b-running:

Tutorial: Run Llama3.1 405B on Trn2
-----------------------------------

As a prerequisite, this tutorial requires that you have a Trn2 instance
created from a Deep Learning AMI that has the Neuron SDK pre-installed.

To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK,
see :ref:`nxdi-setup`.

Step 1: Connect to the Trn2 instance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use SSH to connect to the Trn2 instance using the key pair that you
chose when you launched the instance.

After you are connected, activate the Python virtual environment that
includes the Neuron SDK.

::

   source ~/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate

Run ``pip list`` to verify that the Neuron SDK is installed.

::

   python -m pip list

You should see Neuron packages including
``neuronx-distributed-inference`` and ``neuronx-cc``.

Step 2: Install the vLLM version that supports NxD Inference
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NxD Inference supports running models with vLLM through the upstream
``vllm-neuron`` plugin that ships in the vLLM project. Install the
latest release branch of the plugin following the detailed
steps in the vLLM user guide for NxD Inference.

Step 3: Deploy Llama 3.1 405B sample code
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Choose one of the following examples to run on the Trn2 instance:

1. Deploy Llama3.1 405B with vLLM offline inference. This example demonstrates
   how to deploy on Trn2 with vLLM and topK sampling.

2. Deploy Llama3.1 405B with EAGLE speculative decoding. This example
   demonstrates how to use EAGLE to optimize Llama3.1 405B on Trn2.

Example 1: Deploy Llama3.1 405B on Trn2 with vLLM offline inference
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This example demonstrates how to deploy Llama3.1 405B on Trn2 with vLLM
offline inference and the following configuration options:

- Sequence length: 2048 tokens
- Max context length: 1024 tokens

To use this sample, you must first download a 405B model checkpoint from Hugging Face
to a local path on the Trn2 instance. For more information, see
`Downloading models <https://huggingface.co/docs/hub/en/models-downloading>`__
in the Hugging Face documentation. You can download and use `meta-llama/Llama-3.1-405B-Instruct <https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct>`__
for this tutorial.

::

   import os
   import torch
   
   from vllm import LLM, SamplingParams
   
   # Force vLLM framework to use neuronx-distributed-inference
   os.environ['VLLM_NEURON_FRAMEWORK'] = "neuronx-distributed-inference"
   
   model_path = "/home/ubuntu/models/Llama-3.1-405B-Instruct/"
   
   
   def run_llama_generate():
       # Initialize vLLM.
       llm = LLM(
           model=model_path,
           tensor_parallel_size=64,
           max_num_seqs=1,
           max_model_len=2048,
           block_size=2048,
           dtype=torch.bfloat16,
           enable_prefix_caching=False,
           additional_config={
               "override_neuron_config": {
                   "skip_warmup": True,
                   "max_context_length": 1024,
               },
           },
       )
   
       # Run vLLM to generate outputs.
       prompts = ["I believe the meaning of life is"]
       sampling_params = SamplingParams(top_k=50)
       outputs = llm.generate(prompts, sampling_params)
       for output in outputs:
           prompt = output.prompt
           generated_text = output.outputs[0].text
           print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
   
   
   if __name__ == "__main__":
       run_llama_generate()

Example 2: Deploy Llama3.1 405B on Trn2 with EAGLE speculative decoding
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This example demonstrates how to deploy Llama3.1 405B on Trn2 with EAGLE
speculative decoding.

.. note::
   To use this example, you must provide an EAGLE-trained Llama3.1 405B
   checkpoint to use for EAGLE speculative decoding. For more information
   about EAGLE checkpoint compatibility with NxD Inference, see :ref:`nxd-eagle-speculative-decoding`.

This example uses the following configuration options:

- Sequence length: 2048 tokens
- Max context length: 1024 tokens
- Speculation length: 6 tokens
- Flash attention, QKV, and MLP kernels
- On-device sampling with greedy sampling
- Sequence parallelism enabled
- Auto-bucketing enabled, which automatically selects buckets to use.
  For more information about bucketing and how to customize the buckets used,
  see :ref:`nxdi-bucketing`.

::

   import copy
   import os
   import torch
   
   from transformers import AutoTokenizer, GenerationConfig
   
   from neuronx_distributed_inference.models.config import FusedSpecNeuronConfig, NeuronConfig, OnDeviceSamplingConfig
   from neuronx_distributed_inference.models.llama.modeling_llama import LlamaInferenceConfig, NeuronLlamaForCausalLM
   from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter, load_pretrained_config
   
   model_path = "/home/ubuntu/models/llama-3.1-405b-Instruct/"
   draft_model_path = "/home/ubuntu/models/EAGLE-llama-3-405b/"
   compiled_model_path = "/home/ubuntu/neuron_models/llama-3-405b-instruct-EAGLE/"
   
   # Set environment variables for Trn2.
   os.environ["XLA_DENSE_GATHER_FACTOR"] = "0"
   os.environ["NEURON_RT_EXEC_TIMEOUT"] = "600"
   
   def run_llama_generate():
       top_k = 1
       do_sample = False
   
       # Initialize tokenizer.
       tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="right")
       tokenizer.pad_token = tokenizer.eos_token
   
       # Initialize target model config.
       neuron_config = NeuronConfig(
           torch_dtype=torch.bfloat16,
           tp_degree=64,
           batch_size=1,
           max_context_length=1024,
           seq_len=2048,
           on_device_sampling_config=OnDeviceSamplingConfig(
               dynamic=False,
               do_sample=do_sample,
               top_k=top_k
           ),
           enable_eagle_speculation=True,
           enable_fused_speculation=True,
           speculation_length=6,
           trace_tokengen_model=False,
           enable_bucketing=True,
           fused_qkv=True,
           sequence_parallel_enabled=True,
           attn_kernel_enabled=True,
           qkv_kernel_enabled=True,
           mlp_kernel_enabled=True,
           cc_pipeline_tiling_factor=1,
       )
       config = LlamaInferenceConfig(
           neuron_config,
           load_config=load_pretrained_config(model_path),
       )
   
       # Initialize draft model config.
       draft_neuron_config = copy.deepcopy(neuron_config)
       draft_neuron_config.trace_tokengen_model = True
       draft_neuron_config.enable_fused_speculation = False
       draft_neuron_config.is_eagle_draft = True
       draft_config = LlamaInferenceConfig(
           draft_neuron_config,
           load_config=load_pretrained_config(draft_model_path)
       )
   
       # Initialize fused speculation config.
       fused_spec_config = FusedSpecNeuronConfig(
           NeuronLlamaForCausalLM._model_cls,
           draft_config=draft_config,
           draft_model_path=draft_model_path,
       )
       config.fused_spec_config = fused_spec_config
           
       # Compile and save model.
       print("\nCompiling and saving model...")
       model = NeuronLlamaForCausalLM(model_path, config)
       model.compile(compiled_model_path)
       tokenizer.save_pretrained(compiled_model_path)
   
       # Load from compiled checkpoint.
       print("\nLoading model from compiled checkpoint...")
       model = NeuronLlamaForCausalLM(compiled_model_path)
       model.load(compiled_model_path)
       tokenizer = AutoTokenizer.from_pretrained(compiled_model_path)
   
       # Initialize generation config.
       generation_config = GenerationConfig.from_pretrained(model_path)
       generation_config_kwargs = {
           "do_sample": do_sample,
           "top_k": top_k,
           "pad_token_id": 0,
           "prompt_lookup_num_tokens": neuron_config.speculation_length,
       }
       generation_config.update(**generation_config_kwargs)
   
       # Generate outputs.
       print("\nGenerating outputs...")
       prompts = ["I believe the meaning of life is"]
       print(f"Prompts: {prompts}")
       inputs = tokenizer(prompts, padding=True, return_tensors="pt")
       generation_model = HuggingFaceGenerationAdapter(model)
       outputs = generation_model.generate(
           inputs.input_ids,
           generation_config=generation_config,
           attention_mask=inputs.attention_mask,
           max_length=model.config.neuron_config.max_length,
       )
       output_tokens = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)
       print("Generated outputs:")
       for i, output_token in enumerate(output_tokens):
           print(f"Output {i}: {output_token}")
   
   
   if __name__ == "__main__":
       run_llama_generate()


================================================
FILE: libraries/nxd-inference/tutorials/trn2-llama3.1-8b-multi-lora-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial: Multi-LoRA serving for Llama-3.1-8B on Trn2 instances\n",
    "\n",
    "NeuronX Distributed (NxD) Inference supports multi-LoRA serving. This tutorial provides a step-by-step guide for multi-LoRA serving with Llama-3.1-8B as the base model on a Trn2 instance. It describes two different ways of running multi-LoRA serving with NxD Inference directly and through vLLM (with NxD Inference) We will use LoRA adapters downloaded from HuggingFace as examples for serving."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    ".. contents:: Table of contents\n",
    "    :local:\n",
    "    :depth: 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "### Set up and connect to a Trn2.48xlarge instance\n",
    "\n",
    "As a prerequisite, this tutorial requires that you have a Trn2 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed.\n",
    "\n",
    "To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK, see [NxD Inference Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html#nxdi-setup). To use Jupyter Notebook on the Neuron instance, you can use this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you chose when you launched the instance.\n",
    "\n",
    "After you are connected, activate the Python virtual environment that includes the Neuron SDK."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "source ~/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run ```pip list``` to verify that the Neuron SDK is installed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "pip list | grep neuron\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should see Neuron packages including `neuronx-distributed-inference` and `neuronx-cc`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install Packages\n",
    "\n",
    "NxD Inference supports running models with vLLM. This functionality is available in the AWS Neuron fork of the vLLM GitHub repository. Install the latest release branch of vLLM from the AWS Neuron fork following instructions in the [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html#nxdi-vllm-user-guide).\n",
    "\n",
    "### Download base model and LoRA adapters\n",
    "\n",
    "To use this sample, you must first download a [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model checkpoint from Hugging Face to a local path on the Trn2 instance. Note that you may need access from Meta for model download. For more information, see [Downloading models](https://huggingface.co/docs/hub/en/models-downloading) in the Hugging Face documentation.\n",
    "\n",
    "You must download LoRA adapters from Hugging Face for multi-LoRA serving. As examples, you can download [nvidia/llama-3.1-nemoguard-8b-topic-control](https://huggingface.co/nvidia/llama-3.1-nemoguard-8b-topic-control), [reissbaker/llama-3.1-8b-abliterated-lora](https://huggingface.co/reissbaker/llama-3.1-8b-abliterated-lora), [Stefano-M/aixpa_amicifamiglia_short_prompt](https://huggingface.co/Stefano-M/aixpa_amicifamiglia_short_prompt), and [GaetanMichelet/Llama-31-8B_task-2_180-samples_config-2](https://huggingface.co/GaetanMichelet/Llama-31-8B_task-2_180-samples_config-2). Suppose these LoRA adapters are saved in `/home/ubuntu/lora_adapters/`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using vLLM V1 for multi-LoRA serving on Trn2\n",
    "\n",
    "You will run multi-LoRA serving on Trn2 with vLLM V1 using Llama-3.1-8b-instruct and four LoRA adapters, two are preloaded in HBM during model initialization and the four adapters are loaded in host memory. The data type is bfloat16 precision.\n",
    "Please refer to [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html#nxdi-vllm-user-guide) for more details on how to run model inference on TRN2 with vLLM V1.\n",
    "\n",
    "### Multi-LoRA Configurations\n",
    "\n",
    "You should specifically set the following configurations when enabling multi-LoRA serving with vLLM V1.\n",
    "\n",
    "- `enable_lora` - The flag to enable multi-LoRA serving in NxD Inference. Defaults to False.\n",
    "\n",
    "- `max_loras` - The maximum number of concurrent LoRA adapters in device memory.\n",
    "\n",
    "- `max_cpu_loras` - The maximum number of concurrent LoRA adapters in host memory.\n",
    "\n",
    "- `max_lora_rank` - The highest LoRA rank that needs to be supported. Defaults to ```16```. If it is not specified, the maximum LoRA rank of the LoRA adapter checkpoints will be used.\n",
    "\n",
    "- `lora-ckpt-json` - The the path of JSON file that describes the mappings for the adapter IDs and their checkpoint paths. It includes three fields:\n",
    "   - `lora-ckpt-dir` - The directory of the LoRA adapters.\n",
    "   - `lora-ckpt-paths` - The mapping between LoRA adapter IDs on HBM and their checkpoint paths at initialization. Note that they might be evicted at runtime.\n",
    "   - `lora-ckpt-paths-cpu` - The mapping between LoRA adapter IDs and their checkpoints on CPU.\n",
    "\n",
    "Here is an example of the JSON file:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "```json\n",
    "{\n",
    "    \"lora-ckpt-dir\": \"/home/ubuntu/lora_adapters/\",\n",
    "    \"lora-ckpt-paths\": {\n",
    "        \"lora_id_1\": \"llama-3.1-nemoguard-8b-topic-control\",\n",
    "        \"lora_id_2\": \"llama-3.1-8b-abliterated-lora\"\n",
    "    },\n",
    "    \"lora-ckpt-paths-cpu\": {\n",
    "        \"lora_id_1\": \"llama-3.1-nemoguard-8b-topic-control\",\n",
    "        \"lora_id_2\": \"llama-3.1-8b-abliterated-lora\",\n",
    "        \"lora_id_3\": \"aixpa_amicifamiglia_short_prompt\",\n",
    "        \"lora_id_4\": \"Llama-31-8B_task-2_180-samples_config-2\"\n",
    "    }\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Offline inference example\n",
    "\n",
    "You can run multi-LoRA serving offline on TRN2 with vLLM V1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from vllm import LLM, SamplingParams\n",
    "from vllm.lora.request import LoRARequest\n",
    "\n",
    "MODEL_PATH=\"/home/ubuntu/model_hf/llama-3.1-8b-instruct/\"\n",
    "# Replace this with the path where you saved the JSON file.\n",
    "LORA_CKPT_JSON=\"/home/ubuntu/lora_adapters/lora_adapters.json\"\n",
    "# This is where the compiled model will be saved.\n",
    "COMPILED_MODEL_PATH=\"/home/ubuntu/traced_model/llama-3.1-8B-Lora/\"\n",
    "os.environ[\"NEURON_COMPILED_ARTIFACTS\"] = (COMPILED_MODEL_PATH)\n",
    "os.environ[\"VLLM_USE_V1\"] = \"1\"\n",
    "\n",
    "# Sample prompts.\n",
    "prompts = [\n",
    "    \"The president of the United States is\",\n",
    "    \"The capital of France is\",\n",
    "]\n",
    "\n",
    "# Create a sampling params object.\n",
    "sampling_params = SamplingParams(top_k=1)\n",
    "override_neuron_config = {\n",
    "    \"skip_warmup\": True,\n",
    "    \"lora_ckpt_json\": LORA_CKPT_JSON,\n",
    "}\n",
    "\n",
    "# Create an LLM with multi-LoRA serving.\n",
    "llm = LLM(\n",
    "    model=MODEL_PATH,\n",
    "    max_num_seqs=2,\n",
    "    max_model_len=64,\n",
    "    tensor_parallel_size=32,\n",
    "    additional_config={\n",
    "        \"override_neuron_config\": override_neuron_config\n",
    "    },\n",
    "    enable_lora=True,\n",
    "    max_loras=2,\n",
    "    max_cpu_loras=4,\n",
    "    enable_prefix_caching=False,\n",
    "    enable_chunked_prefill=False,\n",
    ")\n",
    "\"\"\"\n",
    "Only the lora_name needs to be specified.\n",
    "The lora_id and lora_path are supplied at the LLM class/server initialization, after which the paths are\n",
    "handled by NxD Inference.\n",
    "\"\"\"\n",
    "# lora_id_1 is in HBM\n",
    "lora_req_1 = LoRARequest(\"lora_id_1\", 1, \" \")\n",
    "# lora_id_3 is in host memory and it will be dynamically swapped to HBM at runtime\n",
    "lora_req_2 = LoRARequest(\"lora_id_3\", 2, \" \")\n",
    "outputs = llm.generate(prompts, sampling_params, lora_request=[lora_req_1, lora_req_2])\n",
    "\n",
    "for output in outputs:\n",
    "    prompt = output.prompt\n",
    "    generated_text = output.outputs[0].text\n",
    "    print(f\"Prompt: {prompt!r}, Generated text: {generated_text!r}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Run multi-LoRA serving with model quantization\n",
    "\n",
    "To enable multi-LoRA serving with the base model quantized, you must pass some quantization-related arguments to vLLM. For example, you can add the following arguments to `override_neuron_config`. Refer to [Model Weight Quantization](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#nxdi-weight-quantization) for more information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "quantization_config = {\n",
    "    \"quantized\": True,\n",
    "    # quantized_checkpoints_path is the path that saves the quantized base model weights\n",
    "    \"quantized_checkpoints_path\": os.path.join(COMPILED_MODEL_PATH, \"model_quant.pt\"),\n",
    "    \"quantization_type\": \"per_channel_symmetric\",\n",
    "}\n",
    "# Add quantization config to override_neuron_config\n",
    "override_neuron_config.update(quantization_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Online Server Example\n",
    "\n",
    "You can also run online multi-LoRA serving on TRN2 with vLLM V1. Save the contents of the below script to another shell script file, for example, `start_vllm.sh` and then run it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile start_vllm.sh\n",
    "#!/bin/bash\n",
    "\n",
    "echo \"Running vLLM server in the background...\"\n",
    "rm -f ./vllm_server.log\n",
    "\n",
    "# These should be the same paths used when compiling the model.\n",
    "MODEL_PATH=\"/home/ubuntu/model_hf/llama-3.1-8b-instruct/\"\n",
    "# Replace this with the path where you saved the JSON file. Refer to the NxD Inference script for the JSON format.\n",
    "LORA_CKPT_JSON=\"/home/ubuntu/lora_adapters/lora_adapters.json\"\n",
    "# This is where the compiled model will be saved.\n",
    "COMPILED_MODEL_PATH=\"/home/ubuntu/traced_model/llama-3.1-8B-Lora/\"\n",
    "# Replace this with the path where you saved the LoRA adapters\n",
    "LORA_ADAPTER_DIR=\"/home/ubuntu/lora_adapters\"\n",
    "# Set lora_modules to register LoRA adapters during multi-LoRA serving\n",
    "LORA_MODULES=\"lora_id_1=${LORA_ADAPTER_DIR}/llama-3.1-nemoguard-8b-topic-control \"\n",
    "LORA_MODULES+=\"lora_id_2=${LORA_ADAPTER_DIR}/llama-3.1-8b-abliterated-lora \"\n",
    "LORA_MODULES+=\"lora_id_3=${LORA_ADAPTER_DIR}/aixpa_amicifamiglia_short_prompt \"\n",
    "LORA_MODULES+=\"lora_id_4=${LORA_ADAPTER_DIR}/Llama-31-8B_task-2_180-samples_config-2 \"\n",
    "\n",
    "export NEURON_COMPILED_ARTIFACTS=$COMPILED_MODEL_PATH\n",
    "VLLM_RPC_TIMEOUT=100000 \n",
    "nohup vllm serve $MODEL_PATH \\\n",
    "    --max-num-seqs 2 \\\n",
    "    --max-model-len 64 \\\n",
    "    --tensor-parallel-size 32 \\\n",
    "    --disable-log-requests \\\n",
    "    --no-enable-chunked-prefill \\\n",
    "    --no-enable-prefix-caching \\\n",
    "    --enable-lora \\\n",
    "    --max-loras 2 \\\n",
    "    --max-cpu-loras 8 \\\n",
    "    --override-neuron-config \"{\\\"sequence_parallel_enabled\\\": false}\" \\\n",
    "    --lora-modules ${LORA_MODULES} \\\n",
    "    --port 8000 ./vllm_server.log 2>&1 & \n",
    "\n",
    "SERVER_PID=$!\n",
    "\n",
    "echo \"Server started in the background with the following id: $SERVER_PID. Waiting until server is ready to serve...\"\n",
    "\n",
    "until grep -q \"Server is ready to serve\" ./vllm_server.log 2>/dev/null || ! kill -0 $SERVER_PID 2>/dev/null; do sleep 0.5; done\n",
    "grep -q \"Server is ready to serve\" ./vllm_server.log 2>/dev/null && echo \"vLLM Server is ready!\" || (echo \"vLLM Server failed, check the ./vllm_server.log file\" && exit 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!chmod +x ./start_vllm.sh\n",
    "!./start_vllm.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the vLLM server is launched, you can check the registered LoRA adapters in the vLLM server."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "curl http://localhost:8000/v1/models | jq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can send requests to the server for serving with the `model` field as one of the registered LoRA adapter IDs. Two sample requests are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "# request LoRA adapter in HBM\n",
    "curl http://localhost:8000/v1/completions \\\n",
    "    -H \"Content-Type: application/json\" \\\n",
    "    -d '{\n",
    "        \"model\": \"lora_id_1\",\n",
    "        \"prompt\": \"The president of the United States is\",\n",
    "        \"max_tokens\": 32,\n",
    "        \"temperature\": 0\n",
    "    }' | jq\n",
    "\n",
    "# request LoRA adapter in host memory with dynamic swap\n",
    "curl http://localhost:8000/v1/completions \\\n",
    "    -H \"Content-Type: application/json\" \\\n",
    "    -d '{\n",
    "        \"model\": \"lora_id_3\",\n",
    "        \"prompt\": \"The capital of France is\",\n",
    "        \"max_tokens\": 32,\n",
    "        \"temperature\": 0\n",
    "    }' | jq\n",
    "    \n",
    "# request the base model for serving\n",
    "curl http://localhost:8000/v1/completions \\\n",
    "    -H \"Content-Type: application/json\" \\\n",
    "    -d '{\n",
    "        \"model\": $MODEL_PATH,\n",
    "        \"prompt\": \"The capital of France is\",\n",
    "        \"max_tokens\": 32,\n",
    "        \"temperature\": 0\n",
    "    }' | jq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dynamically loading LoRA Adapters\n",
    "\n",
    "In addition to specifying LoRA adapters at server startup, you can also dynamically configure LoRA adapters at runtime through dedicated API endpoints. This feature can be particularly useful when the flexibility to change LoRA adapters on-the-fly is needed.\n",
    "\n",
    "Note: the LoRA adapter checkpoints must be stored locally on the host where the server is running before a LoRA adapter is loaded.\n",
    "\n",
    "To enable dynamic LoRA configuration, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING` is set to True when starting the server engine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Example request to load a LoRA adapter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "curl -X POST http://localhost:8000/v1/load_lora_adapter \\\n",
    "-H \"Content-Type: application/json\" \\\n",
    "-d '{\n",
    "    \"lora_name\": \"lora_id_5\",\n",
    "    \"lora_path\": \"/path/to/lora-adapter-5\"\n",
    "}'\n",
    "\n",
    "# check the registered LoRA adapters in the vLLM server.\n",
    "curl http://localhost:8000/v1/models | jq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Example request to unload a LoRA adapter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "curl -X POST http://localhost:8000/v1/unload_lora_adapter \\\n",
    "-H \"Content-Type: application/json\" \\\n",
    "-d '{\n",
    "    \"lora_name\": \"lora_id_1\"\n",
    "}'\n",
    "\n",
    "# check the registered LoRA adapters in the vLLM server.\n",
    "curl http://localhost:8000/v1/models | jq"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "neuron-224",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: libraries/nxd-inference/tutorials/trn2-llama3.3-70b-apc-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial: Using Prefix Caching with Llama-3.3-70B on Trn2 instances\n",
    "\n",
    "This tutorial provides a step-by-step guide to deploy Llama3.3 70B using \n",
    "NeuronX Distributed (NxD) Inference on a single Trn2.48xl instance using two\n",
    "different configurations, one with prefix caching enabled and the other\n",
    "without prefix caching. We will also measure average response time\n",
    "for both the configurations with prompts containing a common prefix."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    ".. contents:: Table of contents\n",
    "    :local:\n",
    "    :depth: 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Background, Concepts, and Optimizations\n",
    "\n",
    "### Block KV Cache Layout\n",
    "\n",
    "To support prefix caching, NxDI now uses block kv cache layout. Enable block layout of\n",
    "the cache by setting `is_block_kv_layout=True` in NeuronConfig. The first two\n",
    "dimensions of the KV cache are set to the number of blocks and block size, respectively.\n",
    "These configurations are specified using `pa_num_blocks` and `pa_block_size` in NeuronConfig.\n",
    "\n",
    "For optimal performance with Neuron, it's recommended to set `pa_block_size=32`.\n",
    "The minimum required `pa_num_blocks` can be calculated using the formula\n",
    "`(batch_size * max_seq_len) / block_size` where batch_size is the compiled batch size\n",
    "and max_seq_len is the maximum sequence length of the compiled model on Neuron.\n",
    "While using the minimum block calculation will produce accurate results, it's recommended\n",
    "to initialize as many blocks as possible without exceeding HBM space limitations. This\n",
    "ensures that Neuron has sufficient blocks to save as much prefix data as possible. More cache\n",
    "blocks implies higher prefix caching hit rate and hence better context encoding performance.\n",
    "\n",
    "### Kernels\n",
    "\n",
    "NxD Inference supports kernels that optimize parts of the modeling code\n",
    "for best performance when prefix caching is enabled.\n",
    "\n",
    "- Token generation attention kernel with block kv cache read and update capabilities.\n",
    "  This kernel reads the cache blocks using the active block table, converts the required\n",
    "  blocks into flat layout, performs attention and scatters back the computed key and value\n",
    "  to the correct slot in the block cache layout. To enable this kernel, set\n",
    "  `attn_block_tkg_nki_kernel_enabled=True` and `attn_block_tkg_nki_kernel_cache_update=True`\n",
    "  in NeuronConfig."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "### Set up and connect to a Trn2.48xlarge instance\n",
    "\n",
    "As a prerequisite, this tutorial requires that you have a Trn2 instance\n",
    "created from a Deep Learning AMI that has the Neuron SDK pre-installed.\n",
    "\n",
    "To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK,\n",
    "see the [NxDI setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html#nxdi-setup).\n",
    "\n",
    "After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you\n",
    "chose when you launched the instance.\n",
    "\n",
    "To use Jupyter Notebook on the Neuron instance, you can use this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "After you are connected, activate the Python virtual environment that\n",
    "includes the Neuron SDK."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "```python\n",
    "source ~/aws_neuronx_venv_pytorch_2_8_nxd_inference/bin/activate\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run `pip list` to verify that the Neuron SDK is installed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "```python\n",
    "pip list | grep neuron\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should see Neuron packages including\n",
    "`neuronx-distributed-inference` and `neuronx-cc`.\n",
    "\n",
    "### Install packages\n",
    "\n",
    "NxD Inference supports running models with vLLM. This functionality is\n",
    "available through the vllm-neuron plugin. Install the latest release branch of\n",
    "vLLM from the vllm-neuron plugin following instructions in the\n",
    "[vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html#nxdi-vllm-user-guide)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download models\n",
    "\n",
    "To use this sample, you must first download a 70B model checkpoint from Hugging Face\n",
    "to a local path on the Trn2 instance. For more information, see\n",
    "[Downloading models](https://huggingface.co/docs/hub/en/models-downloading)\n",
    "in the Hugging Face documentation. You can download and use [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)\n",
    "for this tutorial."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scenario 1: Run Llama3.3 70B on Trn2 without Prefix Caching\n",
    "\n",
    "### Step 1: Compile the model\n",
    "\n",
    "We will first compile using a command installed by `neuronx-distributed-inference`.\n",
    "\n",
    "Note that we are also using the following features as described in\n",
    "the tutorial for running 405B model [Tutorial: Deploying Llama3.1 405B (Trn2)](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.1-405b-tutorial.html)\n",
    "\n",
    "- Logical NeuronCore Configuration (LNC)\n",
    "- Tensor parallelism (TP) on Trn2\n",
    "- Optimized Kernels\n",
    "\n",
    "Note the path we used to save the compiled model. This path should be used\n",
    "when launching vLLM server for inference so that the compiled model can be loaded without recompilation.\n",
    "Refer to the [NxD inference API](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/api-guides/api-guide.html) guide for more information on these `inference_demo` flags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Replace this with the path where you downloaded and saved the model files.\n",
    "MODEL_PATH=\"/home/ubuntu/models/Llama-3.3-70B-Instruct/\"\n",
    "# This is where the compiled model will be saved. The same path\n",
    "# should be used when launching vLLM server for inference.\n",
    "COMPILED_MODEL_PATH=\"/home/ubuntu/traced_model/Llama-3.3-70B-Instruct/\"\n",
    "\n",
    "NUM_CORES=128\n",
    "TP_DEGREE=64\n",
    "LNC=2\n",
    "\n",
    "export NEURON_RT_VIRTUAL_CORE_SIZE=$LNC\n",
    "export NEURON_RT_NUM_CORES=$((NUM_CORES/NEURON_RT_VIRTUAL_CORE_SIZE))\n",
    "export NEURON_RT_EXEC_TIMEOUT=600 \n",
    "export XLA_DENSE_GATHER_FACTOR=0 \n",
    "export NEURON_RT_INSPECT_ENABLE=0\n",
    "\n",
    "inference_demo \\\n",
    "    --model-type llama \\\n",
    "    --task-type causal-lm \\\n",
    "        run \\\n",
    "        --model-path $MODEL_PATH \\\n",
    "        --compiled-model-path $COMPILED_MODEL_PATH \\\n",
    "        --torch-dtype bfloat16 \\\n",
    "        --start_rank_id 0 \\\n",
    "        --local_ranks_size $TP_DEGREE \\\n",
    "        --tp-degree $TP_DEGREE \\\n",
    "        --batch-size 4 \\\n",
    "        --is-continuous-batching \\\n",
    "        --ctx-batch-size 1 \\\n",
    "        --tkg-batch-size 4 \\\n",
    "        --max-context-length 8192 \\\n",
    "        --seq-len 8192 \\\n",
    "        --on-device-sampling \\\n",
    "        --top-k 1 \\\n",
    "        --do-sample \\\n",
    "        --fused-qkv \\\n",
    "        --sequence-parallel-enabled \\\n",
    "        --qkv-kernel-enabled \\\n",
    "        --attn-kernel-enabled \\\n",
    "        --mlp-kernel-enabled \\\n",
    "        --attn-block-tkg-nki-kernel-enabled \\\n",
    "        --attn-block-tkg-nki-kernel-cache-update \\\n",
    "        --k-cache-transposed \\\n",
    "        --cc-pipeline-tiling-factor 1 \\\n",
    "        --pad-token-id 2 \\\n",
    "        --enable-bucketing \\\n",
    "        --context-encoding-buckets 512 1024 2048 4096 8192 \\\n",
    "        --token-generation-buckets 512 1024 2048 4096 8192 \\\n",
    "        --compile-only \\\n",
    "        --prompt \"What is annapurna labs?\" 2>&1 | tee log.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Serve the model using vLLM\n",
    "\n",
    "After compiling the model, you can run the model using vLLM. Save the contents of the below script to another\n",
    "shell script file, for example, `start_vllm.sh` and then run it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile start_vllm.sh\n",
    "#!/bin/bash\n",
    "\n",
    "echo \"Running vLLM server in the background...\"\n",
    "rm -f ./vllm_server.log \n",
    "\n",
    "export NEURON_RT_INSPECT_ENABLE=0 \n",
    "export NEURON_RT_VIRTUAL_CORE_SIZE=2\n",
    "\n",
    "# These should be the same paths used when compiling the model.\n",
    "MODEL_PATH=\"/home/ubuntu/models/Llama-3.3-70B-Instruct/\"\n",
    "COMPILED_MODEL_PATH=\"/home/ubuntu/traced_model/Llama-3.3-70B-Instruct/\"\n",
    "\n",
    "export VLLM_NEURON_FRAMEWORK=\"neuronx-distributed-inference\"\n",
    "export NEURON_COMPILED_ARTIFACTS=$COMPILED_MODEL_PATH\n",
    "VLLM_RPC_TIMEOUT=100000 \n",
    "nohup vllm serve \\\n",
    "    --model $MODEL_PATH \\\n",
    "    --max-num-seqs 4 \\\n",
    "    --max-model-len 8192 \\\n",
    "    --tensor-parallel-size 64 \\\n",
    "    --no-enable-prefix-caching \\\n",
    "    --block-size 32 \\\n",
    "    --port 8000 > ./vllm_server.log 2>&1 &\n",
    "SERVER_PID=$!\n",
    "\n",
    "echo \"Server started in the background with the following id: $SERVER_PID. Waiting until server is ready to serve...\"\n",
    "\n",
    "until grep -q \"Server is ready to serve\" ./vllm_server.log 2>/dev/null || ! kill -0 $SERVER_PID 2>/dev/null; do sleep 0.5; done\n",
    "grep -q \"Server is ready to serve\" ./vllm_server.log 2>/dev/null && echo \"vLLM Server is ready!\" || (echo \"vLLM Server failed, check the ./vllm_server.log file\" && exit 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!chmod +x ./start_vllm.sh\n",
    "!./start_vllm.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you see the below logs, that means your server is up and running:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "INFO: Started server process [284309]\n",
    "INFO: Waiting for application startup.\n",
    "INFO: Application startup complete.\n",
    "INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: Analyze Request response from server\n",
    "\n",
    "An example script has been added to demonstrate how a common lookup table is used to\n",
    "answer 10 different questions while measuring the total response time. The lookup table\n",
    "serves as a shared prefix that's consistently applied across all 10 input prompts.\n",
    "The script will calculate and display the average time required to answer all questions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "MODEL_PATH=\"/home/ubuntu/models/Llama-3.3-70B-Instruct/\"\n",
    "COMPILED_MODEL_PATH=\"/home/ubuntu/traced_model/Llama-3.3-70B-Instruct/\"\n",
    "\n",
    "LONG_PROMPT=$(cat << 'EOL'\n",
    "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n",
    "# Table\n",
    "| ID  | Name          | Age | Occupation    | Country       | Email                  | Phone Number   | Address                       |\n",
    "|-----|---------------|-----|---------------|---------------|------------------------|----------------|------------------------------|\n",
    "| 1   | John Doe      | 29  | Engineer      | USA           | john.doe@example.com   | 555-1234       | 123 Elm St, Springfield, IL  |\n",
    "| 2   | Jane Smith    | 34  | Doctor        | Canada        | jane.smith@example.com | 555-5678       | 456 Oak St, Toronto, ON      |\n",
    "| 3   | Alice Johnson | 27  | Teacher       | UK            | alice.j@example.com    | 555-8765       | 789 Pine St, London, UK      |\n",
    "| 4   | Bob Brown     | 45  | Artist        | Australia     | bob.b@example.com      | 555-4321       | 321 Maple St, Sydney, NSW    |\n",
    "| 5   | Carol White   | 31  | Scientist     | New Zealand   | carol.w@example.com    | 555-6789       | 654 Birch St, Wellington, NZ |\n",
    "| 6   | Dave Green    | 28  | Lawyer        | Ireland       | dave.g@example.com     | 555-3456       | 987 Cedar St, Dublin, IE     |\n",
    "| 7   | Emma Black    | 40  | Musician      | USA           | emma.b@example.com     | 555-1111       | 246 Ash St, New York, NY     |\n",
    "| 8   | Frank Blue    | 37  | Chef          | Canada        | frank.b@example.com    | 555-2222       | 135 Spruce St, Vancouver, BC |\n",
    "| 9   | Grace Yellow  | 50  | Engineer      | UK            | grace.y@example.com    | 555-3333       | 864 Fir St, Manchester, UK   |\n",
    "| 10  | Henry Violet  | 32  | Artist        | Australia     | henry.v@example.com    | 555-4444       | 753 Willow St, Melbourne, VIC|\n",
    "| 11  | Irene Orange  | 26  | Scientist     | New Zealand   | irene.o@example.com    | 555-5555       | 912 Poplar St, Auckland, NZ  |\n",
    "| 12  | Jack Indigo   | 38  | Teacher       | Ireland       | jack.i@example.com     | 555-6666       | 159 Elm St, Cork, IE         |\n",
    "| 13  | Karen Red     | 41  | Lawyer        | USA           | karen.r@example.com    | 555-7777       | 357 Cedar St, Boston, MA     |\n",
    "| 14  | Leo Brown     | 30  | Chef          | Canada        | leo.b@example.com      | 555-8888       | 246 Oak St, Calgary, AB      |\n",
    "| 15  | Mia Green     | 33  | Musician      | UK            | mia.g@example.com      | 555-9999       | 975 Pine St, Edinburgh, UK   |\n",
    "| 16  | Noah Yellow   | 29  | Doctor        | Australia     | noah.y@example.com     | 555-0000       | 864 Birch St, Brisbane, QLD  |\n",
    "| 17  | Olivia Blue   | 35  | Engineer      | New Zealand   | olivia.b@example.com   | 555-1212       | 753 Maple St, Hamilton, NZ   |\n",
    "| 18  | Peter Black   | 42  | Artist        | Ireland       | peter.b@example.com    | 555-3434       | 912 Fir St, Limerick, IE     |\n",
    "| 19  | Quinn White   | 28  | Scientist     | USA           | quinn.w@example.com    | 555-5656       | 159 Willow St, Seattle, WA   |\n",
    "| 20  | Rachel Red    | 31  | Teacher       | Canada        | rachel.r@example.com   | 555-7878       | 357 Poplar St, Ottawa, ON    |\n",
    "| 21  | Steve Green   | 44  | Lawyer        | UK            | steve.g@example.com    | 555-9090       | 753 Elm St, Birmingham, UK   |\n",
    "| 22  | Tina Blue     | 36  | Musician      | Australia     | tina.b@example.com     | 555-1213       | 864 Cedar St, Perth, WA      |\n",
    "| 23  | Umar Black    | 39  | Chef          | New Zealand   | umar.b@example.com     | 555-3435       | 975 Spruce St, Christchurch, NZ|\n",
    "| 24  | Victor Yellow | 43  | Engineer      | Ireland       | victor.y@example.com   | 555-5657       | 246 Willow St, Galway, IE    |\n",
    "| 25  | Wendy Orange  | 27  | Artist        | USA           | wendy.o@example.com    | 555-7879       | 135 Elm St, Denver, CO       |\n",
    "| 26  | Xavier Green  | 34  | Scientist     | Canada        | xavier.g@example.com   | 555-9091       | 357 Oak St, Montreal, QC     |\n",
    "| 27  | Yara Red      | 41  | Teacher       | UK            | yara.r@example.com     | 555-1214       | 975 Pine St, Leeds, UK       |\n",
    "| 28  | Zack Blue     | 30  | Lawyer        | Australia     | zack.b@example.com     | 555-3436       | 135 Birch St, Adelaide, SA   |\n",
    "| 29  | Amy White     | 33  | Musician      | New Zealand   | amy.w@example.com      | 555-5658       | 159 Maple St, Wellington, NZ |\n",
    "| 30  | Ben Black     | 38  | Chef          | Ireland       | ben.b@example.com      | 555-7870       | 246 Fir St, Waterford, IE    |\n",
    "EOL\n",
    ")\n",
    "\n",
    "questions=(\n",
    "    \"Question: what is the age of John Doe? Your answer: The age of John Doe is \"\n",
    "    \"Question: what is the age of Zack Blue? Your answer: The age of Zack Blue is \"\n",
    "    \"Question: Which country is Ben Black from? Your answer: The country of Ben Black is \"\n",
    "    \"Question: Who has rachel.r@example.com as their email domain? Your answer: The email domain rachel.r@example.com belongs to \"\n",
    "    \"Question: What is the phone number for contacting Karen Red? Your answer: The phone number for contacting Karen Red is \"\n",
    "    \"Question: What is the occupation of Tina Blue? Your answer: The occupation of Tina Blue is \"\n",
    "    \"Question: What is the name of the person with id as 29? Your answer: The name of the person with id as 29 is \"\n",
    "    \"Question: What is the address of Alice Johnson? Your answer: The address of Alice Johnson is \"\n",
    "    \"Question: What is the id of Irene Orange? Your answer: The id of Irene Orange is \"\n",
    "    \"Question: What is the age of Leo Brown? Your answer: The age of Leo Brown is \"\n",
    ")\n",
    "\n",
    "\n",
    "# Function to make a single request\n",
    "make_request() {\n",
    "    local question=$1\n",
    "    local prompt_with_suffix=\"${LONG_PROMPT}\n",
    "\n",
    "Based on the table above, please answer this question:\n",
    "${question}\"\n",
    "    \n",
    "    local escaped_prompt=$(echo \"$prompt_with_suffix\" | jq -Rs .)\n",
    "    \n",
    "    # Make the curl request and capture both response and time\n",
    "    local response_file=$(mktemp)\n",
    "    time_output=$(TIMEFORMAT='%R'; { time curl -s http://localhost:8000/v1/chat/completions \\\n",
    "        -H \"Content-Type: application/json\" \\\n",
    "        -d \"{\n",
    "            \\\"model\\\": \\\"$MODEL_PATH\\\",\n",
    "            \\\"messages\\\": [\n",
    "                {\n",
    "                    \\\"role\\\": \\\"user\\\",\n",
    "                    \\\"content\\\": ${escaped_prompt}\n",
    "                }\n",
    "            ]\n",
    "        }\" > \"$response_file\"; } 2>&1)\n",
    "    \n",
    "    # Extract the response content\n",
    "    local response_content=$(cat \"$response_file\" | jq -r '.choices[0].message.content')\n",
    "    rm \"$response_file\"\n",
    "    \n",
    "    # Return both time and response\n",
    "    echo \"TIME:$time_output\"\n",
    "    echo \"RESPONSE:$response_content\"\n",
    "}\n",
    "\n",
    "# Make first request (warm-up) with a random question\n",
    "random_index=$((RANDOM % ${#questions[@]}))\n",
    "echo \"Warm-up request with question: ${questions[$random_index]}\"\n",
    "IFS=$'\\n' read -r -d '' time_str response_str < <(make_request \"${questions[$random_index]}\" && echo '')\n",
    "echo \"Response: $response_str\"\n",
    "echo \"Time taken: ${time_str#TIME:} seconds\"\n",
    "echo \"Warm-up complete\"\n",
    "echo \"-------------------\"\n",
    "\n",
    "# Make 10 timed requests with random questions\n",
    "total_time=0\n",
    "for i in {0..9}; do\n",
    "    random_index=$i\n",
    "    #random_index=$((RANDOM % ${#questions[@]}))\n",
    "    question=\"${questions[$random_index]}\"\n",
    "    echo \"Request $i with question: $question\"\n",
    "    \n",
    "    IFS=$'\\n' read -r -d '' time_str response_str < <(make_request \"$question\" && echo '')\n",
    "    time_taken=${time_str#TIME:}\n",
    "    response=${response_str#RESPONSE:}\n",
    "    \n",
    "    total_time=$(echo \"$total_time + $time_taken\" | bc -l)\n",
    "    echo \"Response: $response\"\n",
    "    echo \"Time taken: ${time_taken} seconds\"\n",
    "    echo \"-------------------\"\n",
    "done\n",
    "\n",
    "# Calculate and display average time\n",
    "average_time=$(echo \"scale=3; $total_time / 10\" | bc -l)\n",
    "echo \"Average time across 10 requests: ${average_time} seconds\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Output from the script would include all the answers to the questions along with the average time to process all the requests at the very end as shown below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "Average time across 10 requests: .388 seconds\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scenario 2: Run Llama3.3 70B on Trn2 with Prefix Caching"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: Compile the model\n",
    "\n",
    "The compilation script with prefix caching adds extra flags specific to prefix caching to enable and configure Block KV cache layout along with enabling the kernels used with prefix caching. Please refer to the [Prefix Caching Support](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#prefix-caching-support) documentation for more information on the prefix caching flags used below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Replace this with the path where you downloaded and saved the model files.\n",
    "MODEL_PATH=\"/home/ubuntu/models/Llama-3.3-70B-Instruct/\"\n",
    "# This is where the compiled model will be saved. The same path\n",
    "# should be used when launching vLLM server for inference.\n",
    "COMPILED_MODEL_PATH=\"/home/ubuntu/traced_model/Llama-3.3-70B-Instruct/\"\n",
    "\n",
    "NUM_CORES=128\n",
    "TP_DEGREE=64\n",
    "LNC=2\n",
    "\n",
    "export NEURON_RT_VIRTUAL_CORE_SIZE=$LNC\n",
    "export NEURON_RT_NUM_CORES=$((NUM_CORES/NEURON_RT_VIRTUAL_CORE_SIZE))\n",
    "export NEURON_RT_EXEC_TIMEOUT=600 \n",
    "export XLA_DENSE_GATHER_FACTOR=0 \n",
    "export NEURON_RT_INSPECT_ENABLE=0\n",
    "\n",
    "inference_demo \\\n",
    "    --model-type llama \\\n",
    "    --task-type causal-lm \\\n",
    "        run \\\n",
    "        --model-path $MODEL_PATH \\\n",
    "        --compiled-model-path $COMPILED_MODEL_PATH \\\n",
    "        --torch-dtype bfloat16 \\\n",
    "        --start_rank_id 0 \\\n",
    "        --local_ranks_size $TP_DEGREE \\\n",
    "        --tp-degree $TP_DEGREE \\\n",
    "        --batch-size 4 \\\n",
    "        --is-continuous-batching \\\n",
    "        --ctx-batch-size 1 \\\n",
    "        --tkg-batch-size 4 \\\n",
    "        --max-context-length 8192 \\\n",
    "        --seq-len 8192 \\\n",
    "        --on-device-sampling \\\n",
    "        --top-k 1 \\\n",
    "        --do-sample \\\n",
    "        --fused-qkv \\\n",
    "        --sequence-parallel-enabled \\\n",
    "        --qkv-kernel-enabled \\\n",
    "        --attn-kernel-enabled \\\n",
    "        --mlp-kernel-enabled \\\n",
    "        --attn-block-tkg-nki-kernel-enabled \\\n",
    "        --attn-block-tkg-nki-kernel-cache-update \\\n",
    "        --cc-pipeline-tiling-factor 1 \\\n",
    "        --pad-token-id 2 \\\n",
    "        --enable-bucketing \\\n",
    "        --context-encoding-buckets 512 1024 2048 4096 8192 \\\n",
    "        --token-generation-buckets 512 1024 2048 4096 8192 \\\n",
    "        --prefix-buckets 512 1024 2048 \\\n",
    "        --enable-block-kv-layout \\\n",
    "        --pa-num-blocks 2048 \\\n",
    "        --pa-block-size 32 \\\n",
    "        --enable-prefix-caching \\\n",
    "        --compile-only \\\n",
    "        --prompt \"What is annapurna labs?\" 2>&1 | tee log.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Serve the model using vLLM with prefix caching enabled\n",
    "\n",
    "After compiling the model, you can serve the model using vLLM with prefix caching enabled.\n",
    "Save the contents of the below script to another\n",
    "shell script file, for example, `start_vllm_apc.sh` and then run it.\n",
    "\n",
    "Note that we use `--enable-prefix-caching` in vLLM to enable prefix caching, along\n",
    "with `--block-size 32` and `--num-gpu-blocks-override 2048` which are consistent\n",
    "with `--pa-block-size 32` and `--pa-num-blocks 2048` flags specified during model\n",
    "compilation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile start_vllm.sh\n",
    "#!/bin/bash\n",
    "echo \"Running vLLM server in the background...\"\n",
    "rm -f ./vllm_server.log \n",
    "\n",
    "export NEURON_RT_INSPECT_ENABLE=0 \n",
    "export NEURON_RT_VIRTUAL_CORE_SIZE=2\n",
    "\n",
    "# These should be the same paths used when compiling the model.\n",
    "MODEL_PATH=\"/home/ubuntu/models/Llama-3.3-70B-Instruct/\"\n",
    "COMPILED_MODEL_PATH=\"/home/ubuntu/traced_model/Llama-3.3-70B-Instruct/\"\n",
    "\n",
    "export VLLM_NEURON_FRAMEWORK=\"neuronx-distributed-inference\"\n",
    "export NEURON_COMPILED_ARTIFACTS=$COMPILED_MODEL_PATH\n",
    "VLLM_RPC_TIMEOUT=100000 \n",
    "nohup vllm serve \\\n",
    "    --model $MODEL_PATH \\\n",
    "    --max-num-seqs 4 \\\n",
    "    --max-model-len 8192 \\\n",
    "    --tensor-parallel-size 64 \\\n",
    "    --num-gpu-blocks-override 2048 \\\n",
    "    --enable-prefix-caching \\\n",
    "    --block-size 32 \\\n",
    "    --additional-config '{\"override_neuron_config\": {\"is_block_kv_layout\": true, \"is_prefix_caching\": true}}' \\\n",
    "    --port 8000 > ./vllm_server.log 2>&1 &\n",
    "SERVER_PID=$!\n",
    "\n",
    "echo \"Server started in the background with the following id: $SERVER_PID. Waiting until server is ready to serve...\"\n",
    "\n",
    "until grep -q \"Server is ready to serve\" ./vllm_server.log 2>/dev/null || ! kill -0 $SERVER_PID 2>/dev/null; do sleep 0.5; done\n",
    "grep -q \"Server is ready to serve\" ./vllm_server.log 2>/dev/null && echo \"vLLM Server is ready!\" || (echo \"vLLM Server failed, check the ./vllm_server.log file\" && exit 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!chmod +x ./start_vllm.sh\n",
    "!./start_vllm.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wait for the server to be up and running before proceeding further.\n",
    "\n",
    "### Step 3: Analyze Request response from server\n",
    "\n",
    "Execute the same script file from scenario 1,\n",
    "to send identical request to the server with prefix caching enabled.\n",
    "The average time to respond to all the requests will be printed in the terminal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "MODEL_PATH=\"/home/ubuntu/models/Llama-3.3-70B-Instruct/\"\n",
    "COMPILED_MODEL_PATH=\"/home/ubuntu/traced_model/Llama-3.3-70B-Instruct/\"\n",
    "\n",
    "LONG_PROMPT=$(cat << 'EOL'\n",
    "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n",
    "# Table\n",
    "| ID  | Name          | Age | Occupation    | Country       | Email                  | Phone Number   | Address                       |\n",
    "|-----|---------------|-----|---------------|---------------|------------------------|----------------|------------------------------|\n",
    "| 1   | John Doe      | 29  | Engineer      | USA           | john.doe@example.com   | 555-1234       | 123 Elm St, Springfield, IL  |\n",
    "| 2   | Jane Smith    | 34  | Doctor        | Canada        | jane.smith@example.com | 555-5678       | 456 Oak St, Toronto, ON      |\n",
    "| 3   | Alice Johnson | 27  | Teacher       | UK            | alice.j@example.com    | 555-8765       | 789 Pine St, London, UK      |\n",
    "| 4   | Bob Brown     | 45  | Artist        | Australia     | bob.b@example.com      | 555-4321       | 321 Maple St, Sydney, NSW    |\n",
    "| 5   | Carol White   | 31  | Scientist     | New Zealand   | carol.w@example.com    | 555-6789       | 654 Birch St, Wellington, NZ |\n",
    "| 6   | Dave Green    | 28  | Lawyer        | Ireland       | dave.g@example.com     | 555-3456       | 987 Cedar St, Dublin, IE     |\n",
    "| 7   | Emma Black    | 40  | Musician      | USA           | emma.b@example.com     | 555-1111       | 246 Ash St, New York, NY     |\n",
    "| 8   | Frank Blue    | 37  | Chef          | Canada        | frank.b@example.com    | 555-2222       | 135 Spruce St, Vancouver, BC |\n",
    "| 9   | Grace Yellow  | 50  | Engineer      | UK            | grace.y@example.com    | 555-3333       | 864 Fir St, Manchester, UK   |\n",
    "| 10  | Henry Violet  | 32  | Artist        | Australia     | henry.v@example.com    | 555-4444       | 753 Willow St, Melbourne, VIC|\n",
    "| 11  | Irene Orange  | 26  | Scientist     | New Zealand   | irene.o@example.com    | 555-5555       | 912 Poplar St, Auckland, NZ  |\n",
    "| 12  | Jack Indigo   | 38  | Teacher       | Ireland       | jack.i@example.com     | 555-6666       | 159 Elm St, Cork, IE         |\n",
    "| 13  | Karen Red     | 41  | Lawyer        | USA           | karen.r@example.com    | 555-7777       | 357 Cedar St, Boston, MA     |\n",
    "| 14  | Leo Brown     | 30  | Chef          | Canada        | leo.b@example.com      | 555-8888       | 246 Oak St, Calgary, AB      |\n",
    "| 15  | Mia Green     | 33  | Musician      | UK            | mia.g@example.com      | 555-9999       | 975 Pine St, Edinburgh, UK   |\n",
    "| 16  | Noah Yellow   | 29  | Doctor        | Australia     | noah.y@example.com     | 555-0000       | 864 Birch St, Brisbane, QLD  |\n",
    "| 17  | Olivia Blue   | 35  | Engineer      | New Zealand   | olivia.b@example.com   | 555-1212       | 753 Maple St, Hamilton, NZ   |\n",
    "| 18  | Peter Black   | 42  | Artist        | Ireland       | peter.b@example.com    | 555-3434       | 912 Fir St, Limerick, IE     |\n",
    "| 19  | Quinn White   | 28  | Scientist     | USA           | quinn.w@example.com    | 555-5656       | 159 Willow St, Seattle, WA   |\n",
    "| 20  | Rachel Red    | 31  | Teacher       | Canada        | rachel.r@example.com   | 555-7878       | 357 Poplar St, Ottawa, ON    |\n",
    "| 21  | Steve Green   | 44  | Lawyer        | UK            | steve.g@example.com    | 555-9090       | 753 Elm St, Birmingham, UK   |\n",
    "| 22  | Tina Blue     | 36  | Musician      | Australia     | tina.b@example.com     | 555-1213       | 864 Cedar St, Perth, WA      |\n",
    "| 23  | Umar Black    | 39  | Chef          | New Zealand   | umar.b@example.com     | 555-3435       | 975 Spruce St, Christchurch, NZ|\n",
    "| 24  | Victor Yellow | 43  | Engineer      | Ireland       | victor.y@example.com   | 555-5657       | 246 Willow St, Galway, IE    |\n",
    "| 25  | Wendy Orange  | 27  | Artist        | USA           | wendy.o@example.com    | 555-7879       | 135 Elm St, Denver, CO       |\n",
    "| 26  | Xavier Green  | 34  | Scientist     | Canada        | xavier.g@example.com   | 555-9091       | 357 Oak St, Montreal, QC     |\n",
    "| 27  | Yara Red      | 41  | Teacher       | UK            | yara.r@example.com     | 555-1214       | 975 Pine St, Leeds, UK       |\n",
    "| 28  | Zack Blue     | 30  | Lawyer        | Australia     | zack.b@example.com     | 555-3436       | 135 Birch St, Adelaide, SA   |\n",
    "| 29  | Amy White     | 33  | Musician      | New Zealand   | amy.w@example.com      | 555-5658       | 159 Maple St, Wellington, NZ |\n",
    "| 30  | Ben Black     | 38  | Chef          | Ireland       | ben.b@example.com      | 555-7870       | 246 Fir St, Waterford, IE    |\n",
    "EOL\n",
    ")\n",
    "\n",
    "questions=(\n",
    "    \"Question: what is the age of John Doe? Your answer: The age of John Doe is \"\n",
    "    \"Question: what is the age of Zack Blue? Your answer: The age of Zack Blue is \"\n",
    "    \"Question: Which country is Ben Black from? Your answer: The country of Ben Black is \"\n",
    "    \"Question: Who has rachel.r@example.com as their email domain? Your answer: The email domain rachel.r@example.com belongs to \"\n",
    "    \"Question: What is the phone number for contacting Karen Red? Your answer: The phone number for contacting Karen Red is \"\n",
    "    \"Question: What is the occupation of Tina Blue? Your answer: The occupation of Tina Blue is \"\n",
    "    \"Question: What is the name of the person with id as 29? Your answer: The name of the person with id as 29 is \"\n",
    "    \"Question: What is the address of Alice Johnson? Your answer: The address of Alice Johnson is \"\n",
    "    \"Question: What is the id of Irene Orange? Your answer: The id of Irene Orange is \"\n",
    "    \"Question: What is the age of Leo Brown? Your answer: The age of Leo Brown is \"\n",
    ")\n",
    "\n",
    "\n",
    "# Function to make a single request\n",
    "make_request() {\n",
    "    local question=$1\n",
    "    local prompt_with_suffix=\"${LONG_PROMPT}\n",
    "\n",
    "Based on the table above, please answer this question:\n",
    "${question}\"\n",
    "    \n",
    "    local escaped_prompt=$(echo \"$prompt_with_suffix\" | jq -Rs .)\n",
    "    \n",
    "    # Make the curl request and capture both response and time\n",
    "    local response_file=$(mktemp)\n",
    "    time_output=$(TIMEFORMAT='%R'; { time curl -s http://localhost:8000/v1/chat/completions \\\n",
    "        -H \"Content-Type: application/json\" \\\n",
    "        -d \"{\n",
    "            \\\"model\\\": \\\"$MODEL_PATH\\\",\n",
    "            \\\"messages\\\": [\n",
    "                {\n",
    "                    \\\"role\\\": \\\"user\\\",\n",
    "                    \\\"content\\\": ${escaped_prompt}\n",
    "                }\n",
    "            ]\n",
    "        }\" > \"$response_file\"; } 2>&1)\n",
    "    \n",
    "    # Extract the response content\n",
    "    local response_content=$(cat \"$response_file\" | jq -r '.choices[0].message.content')\n",
    "    rm \"$response_file\"\n",
    "    \n",
    "    # Return both time and response\n",
    "    echo \"TIME:$time_output\"\n",
    "    echo \"RESPONSE:$response_content\"\n",
    "}\n",
    "\n",
    "# Make first request (warm-up) with a random question\n",
    "random_index=$((RANDOM % ${#questions[@]}))\n",
    "echo \"Warm-up request with question: ${questions[$random_index]}\"\n",
    "IFS=$'\\n' read -r -d '' time_str response_str < <(make_request \"${questions[$random_index]}\" && echo '')\n",
    "echo \"Response: $response_str\"\n",
    "echo \"Time taken: ${time_str#TIME:} seconds\"\n",
    "echo \"Warm-up complete\"\n",
    "echo \"-------------------\"\n",
    "\n",
    "# Make 10 timed requests with random questions\n",
    "total_time=0\n",
    "for i in {0..9}; do\n",
    "    random_index=$i\n",
    "    #random_index=$((RANDOM % ${#questions[@]}))\n",
    "    question=\"${questions[$random_index]}\"\n",
    "    echo \"Request $i with question: $question\"\n",
    "    \n",
    "    IFS=$'\\n' read -r -d '' time_str response_str < <(make_request \"$question\" && echo '')\n",
    "    time_taken=${time_str#TIME:}\n",
    "    response=${response_str#RESPONSE:}\n",
    "    \n",
    "    total_time=$(echo \"$total_time + $time_taken\" | bc -l)\n",
    "    echo \"Response: $response\"\n",
    "    echo \"Time taken: ${time_taken} seconds\"\n",
    "    echo \"-------------------\"\n",
    "done\n",
    "\n",
    "# Calculate and display average time\n",
    "average_time=$(echo \"scale=3; $total_time / 10\" | bc -l)\n",
    "echo \"Average time across 10 requests: ${average_time} seconds\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "Average time across 10 requests: .388 seconds\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As seen from the two scenarios, average time with prefix caching enabled is lesser than the time it takes to serve the same requests with prefix caching disabled. This is attributed to the lesser time to compute the first token by reusing the common prefix across all the prompts.\n",
    "\n",
    "We also ran the same model configurations with public datasets with varying cache hit rates for benchmarking prefix caching on neuron and here are the results that we achieved:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| Dataset | TTFT (P50 in ms) without prefix caching | TTFT (P50 in ms) with prefix caching | Improvement |\n",
    "|---------|----------------------------------------|-------------------------------------|-------------|\n",
    "| math.math (>90% cache hit) | 342.81 | 107.8 | 3.18x |\n",
    "| dynamic sonnet 1k (~25% cache hit) | 123.08 | 102.15 | 1.2x |\n",
    "| dynamic sonnet 2k (~25% cache hit) | 592.8 | 377.2 | 1.57x |\n",
    "| HumanEval (No cache hit) | 89.7 | 91.8 | 0.98x |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "In general, with a higher ratio of prefix(shared prompt) to prefill tokens that results in higher cache-hit rate, \n",
    "prefix caching achieves a TTFT speedup of up to 3x compared to when prefix caching is disabled. When the dataset has\n",
    "low prefix cache hit rate, prefix caching TTFT performance can degrade slightly due to the overhead of supporting\n",
    "block KV cache layout, as seen in the HumanEval dataset."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "neuron-224",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: libraries/nxd-inference/tutorials/trn2-llama3.3-70b-dp-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial: Scaling LLM Inference with Data Parallelism on Trn2\n",
    "\n",
    "This tutorial demonstrates how to implement data parallelism (DP) for LLM inference with multiple model copies on AWS Neuron. We'll walk through the steps to deploy multiple Llama 3.3 70B model endpoints on a single ```trn2.48xlarge``` instance using NxD Inference and vLLM, and run data parallel inference."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    ".. contents:: Table of contents\n",
    "    :local:\n",
    "    :depth: 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Parallel Inference\n",
    "\n",
    "We can achieve Data Parallelism by using multiple copies of the same model hosted on the instance to process multiple requests simultaneously. Using NxD Inference and vLLM, you can deploy multiple model endpoints by adjusting the tensor parallel degree (Tensor Parallelism (TP) refers to sharding model weight matrices onto multiple NeuronCores within each model copy) and allocating appropriate NeuronCore ranges for each model endpoint. While increasing the batch size with a single copy of the model increases throughput, introducing data parallelism with multiple model endpoints combined with tensor parallelism allows further increase in instance throughput with some impact to latency. Use this technique when you can relax the latency constraint of your application to further maximize the throughput of the instance.\n",
    "\n",
    "In this tutorial we use Llama 3.3 70B with DP=2 and TP=32. However, you can follow the same sequence of steps to deploy additional model copies by appropriately changing the tensor parallel degree. You can also use this guide to deploy multiple copies of any other models on Trn1 or Inf2 instances as long as the model fits and the DP x TP degree does not exceed the number of model cores."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "### Setup and Connect to an Amazon EC2 Trn2 Instance"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "An Amazon EC2 ``trn2.48xlarge`` instance with AWS Neuron SDK version 2.23.0 or later (:ref:`latest-neuron-release`) is required. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To launch a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK and NxD Inference dependencies, see [NxD Inference Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/neuronx-distributed/setup-guide-nxd-inference.html).\n",
    "\n",
    "To use Jupyter Notebook on the Neuron instance, you can use this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "Make sure to activate the Neuron virtual environment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "```python\n",
    "source ~/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To verify that NxD Inference has installed successfully, check that you can run the inference_demo console script.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "```python\n",
    "inference_demo --help\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download Model Weights\n",
    "\n",
    "To use this tutorial, you must first download a Llama 3.3 70B Instruct model checkpoint from Hugging Face to a local path on the Trn2 instance. For more information, see [Downloading Models](https://huggingface.co/docs/transformers/main/en/installation#offline-mode) in the Hugging Face documentation. You can download and use [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) for this tutorial."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install the vLLM Neuron plugin\n\n",
    "NxD Inference supports running models with vLLM via the upstream `vllm-neuron` plugin. ",
    "Install the latest release branch by following the steps in the ",
    "[vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install LLMPerf\n",
    "\n",
    "In this tutorial, you will use [LLMPerf](https://github.com/ray-project/llmperf) to measure the performance.\n",
    "Follow [LLM Inference Benchmarking guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/neuronx-distributed/programming-guide/nxd-inference/nxdi-llm-inference-benchmarking.html) to install LLMPerf from source and apply patches incl. support to benchmark data parallel setups."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-by-Step Tutorial Instructions\n",
    "\n",
    "### Step 1: Compile the model\n",
    "\n",
    "Before we launch the model endpoint with vLLM, we'll use the NxD Inference library to compile the model with an appropriate configuration. Refer to [NxD Inference Features Configuration Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/neuronx-distributed/programming-guide/nxd-inference/nxdi-features-configuration.html) for more information. To compile a model for data parallelism inference, set the ```NUM_CORES```,```TP_DEGREE```, ```BATCH_SIZE``` to allow for strategic workflow distribution. For DP=2 with BATCH_SIZE>=1, TP_DEGREE should be set to 64/2=32 to maximize NeuronCore utilization across all model copies. Simply create and run a shell script as illustrated below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Replace with path to your downloaded Hugging Face model checkpoints\n",
    "MODEL_PATH=\"/ubuntu/model_hf/Llama-3.3-70B-Instruct/\"\n",
    "\n",
    "# This is where the compiled model will be saved. The same path should be used when launching vLLM server for inference.\n",
    "\n",
    "NUM_CORES=128\n",
    "TP_DEGREE=32\n",
    "LNC=2\n",
    "BATCH_SIZE=4\n",
    "\n",
    "export NEURON_RT_VIRTUAL_CORE_SIZE=$LNC\n",
    "export NEURON_RT_NUM_CORES=$((NUM_CORES/NEURON_RT_VIRTUAL_CORE_SIZE))\n",
    "export NEURON_RT_EXEC_TIMEOUT=600\n",
    "export XLA_DENSE_GATHER_FACTOR=0\n",
    "export NEURON_RT_INSPECT_ENABLE=0\n",
    "\n",
    "inference_demo \\\n",
    "    --model-type llama \\\n",
    "    --task-type causal-lm \\\n",
    "        run \\\n",
    "        --model-path $MODEL_PATH \\\n",
    "        --compiled-model-path $COMPILED_MODEL_PATH \\\n",
    "        --torch-dtype bfloat16 \\\n",
    "        --start_rank_id 0 \\\n",
    "        --local_ranks_size $TP_DEGREE \\\n",
    "        --tp-degree $TP_DEGREE \\\n",
    "        --batch-size $BATCH_SIZE \\\n",
    "        --max-context-length 8192 \\\n",
    "        --seq-len 8192 \\\n",
    "        --on-device-sampling \\\n",
    "        --top-k 1 \\\n",
    "        --do-sample \\\n",
    "        --fused-qkv \\\n",
    "        --sequence-parallel-enabled \\\n",
    "        --qkv-kernel-enabled \\\n",
    "        --attn-kernel-enabled \\\n",
    "        --mlp-kernel-enabled \\\n",
    "        --cc-pipeline-tiling-factor 1 \\\n",
    "        --pad-token-id 2 \\\n",
    "        --enable-bucketing \\\n",
    "        --context-encoding-buckets 2048 4096 8192 \\\n",
    "        --token-generation-buckets 2048 4096 8192 \\\n",
    "        --compile-only \\\n",
    "        --prompt \"What is annapurna labs?\" 2>&1 | tee log2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's important to specify the path to which the compiled model is saved, as this same path must be used when you later launch the vLLM server for inference, allowing you to use the pre-compiled model without having to compile it again."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    ".. note::\n",
    "    \n",
    "    To run this script on trn1, set LNC=1. For more information about LNC, see :ref:`logical-neuroncore-config` .\n",
    "    Also appropriately change NUM_CORES & TP_DEGREE (eg. 16 for DP=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For detailed information about the inference_demo flags, you can consult the [NxD Inference API Reference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/neuronx-distributed/api-reference-guide/nxd-inference/index.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Launch model endpoints\n",
    "Create a deployment script (```deploy_vllm_endpoint.sh```) containing below code snippet that configures and launches a model endpoint. The script is parameterized so that you can pass a specific port number, range of neuron cores, tensor parallel degree and batch size.\n",
    "\n",
    "#### Key Parameters Explained:\n",
    "\n",
    "- ```MODEL_PATH```: The Hugging Face model identifier or local model_hf path containing Meta-Llama-3.3-70B-Instruct hugging face checkpoints. Eg. /home/ubuntu/model_hf/Llama-3.3-70B-Instruct/\n",
    "\n",
    "- ```port```: Network port for the endpoint Eg. 8000. The port number should be unique for each model endpoint.\n",
    "\n",
    "- ```cores```: Range of NeuronCores allocated to this endpoint. This should be a non overlapping range of cores when deploying multiple model endpoints on the same instance. For example, when allocated 32 NeuronCores to a model endpoint specify 0-31 or 32-63.\n",
    "\n",
    "- ```tp_degree```: Degree of tensor parallelism for model sharding. To maximize NeuronCores utilization, reduce tp_degree while increasing dp_degree.\n",
    "\n",
    "- ```bs``` : Batch size specified for model endpoint.\n",
    "\n",
    "These parameters should match the values used during compilation step above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile start_vllm.sh\n",
    "#!/bin/bash\n",
    "\n",
    "echo \"Running vLLM server in the background...\"\n",
    "\n",
    "# Default values for arguments\n",
    "DEFAULT_PORT=$PORT\n",
    "DEFAULT_CORES=$CORES\n",
    "DEFAULT_TP_DEGREE=32\n",
    "DEFAULT_BS=4\n",
    "\n",
    "# Help function\n",
    "show_help() {\n",
    "    echo \"Usage: $0 [options]\"\n",
    "    echo \"Options:\"\n",
    "    echo \"  -p port        Port number for vLLM endpoint (default: $DEFAULT_PORT)\"\n",
    "    echo \"  -c cores       Range of neuron cores (default: $DEFAULT_CORES)\"\n",
    "    echo \"  -t tp_degree   Tensor parallel degree (default: $DEFAULT_TP_DEGREE)\"\n",
    "    echo \"  -b bs          Batch size (default: $DEFAULT_BS)\"\n",
    "    echo \"  -h             Show this help message\"\n",
    "}\n",
    "\n",
    "# Parse single-letter arguments\n",
    "while getopts \"p:c:t:b:h\" opt; do\n",
    "    case $opt in\n",
    "        p) port=\"$OPTARG\" ;;\n",
    "        c) cores=\"$OPTARG\" ;;\n",
    "        t) tp_degree=\"$OPTARG\" ;;\n",
    "        b) bs=\"$OPTARG\" ;;\n",
    "        h) show_help; exit 0 ;;\n",
    "        ?) show_help; exit 1 ;;\n",
    "    esac\n",
    "done\n",
    "\n",
    "# Set defaults if not provided\n",
    "port=${port:-$DEFAULT_PORT}\n",
    "cores=${cores:-$DEFAULT_CORES}\n",
    "tp_degree=${tp_degree:-$DEFAULT_TP_DEGREE}\n",
    "bs=${bs:-$DEFAULT_BS}\n",
    "\n",
    "# Environment configurations\n",
    "export NEURON_RT_INSPECT_ENABLE=0\n",
    "export NEURON_RT_VIRTUAL_CORE_SIZE=2\n",
    "\n",
    "# These should be the same paths used when compiling the model.\n",
    "MODEL_PATH=\"/shared/models/llama-3.3-70b-instruct/\"\n",
    "COMPILED_MODEL_PATH=\"/shared/traced-models/dp_tutorial/\"\n",
    "\n",
    "export VLLM_NEURON_FRAMEWORK=\"neuronx-distributed-inference\"\n",
    "export NEURON_COMPILED_ARTIFACTS=$COMPILED_MODEL_PATH\n",
    "export NEURON_RT_VISIBLE_CORES=${cores}\n",
    "\n",
    "VLLM_RPC_TIMEOUT=100000\n",
    "nohup vllm serve \\\n",
    "    --model \"$MODEL_PATH\" \\\n",
    "    --max-num-seqs \"${bs}\" \\\n",
    "    --max-model-len 8192 \\\n",
    "    --no-enable-prefix-caching \\\n",
    "    --tensor-parallel-size \"${tp_degree}\" \\\n",
    "    --additional-config '{\n",
    "        \"override_neuron_config\": {\n",
    "            \"on_device_sampling_config\": {\n",
    "                \"do_sample\": true,\n",
    "                \"global_topk\": 64\n",
    "            }\n",
    "        }\n",
    "    }' \\\n",
    "    --port \"${port}\" > ./vllm_server.log 2>&1 &\n",
    "SERVER_PID=$!\n",
    "\n",
    "echo \"Server started in the background with the following id: $SERVER_PID. Waiting until server is ready to serve...\"\n",
    "\n",
    "until grep -q \"Server is ready to serve\" ./vllm_server.log 2>/dev/null || ! kill -0 $SERVER_PID 2>/dev/null; do sleep 0.5; done\n",
    "grep -q \"Server is ready to serve\" ./vllm_server.log 2>/dev/null && echo \"vLLM Server is ready!\" || (echo \"vLLM Server failed, check the ./vllm_server.log file\" && exit 1)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!chmod +x ./start_vllm.sh\n",
    "!./start_vllm.sh -p 8000 -c 0-31\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!./start_vllm.sh -p 8001 -c 32-63\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run this script to launch 2 vLLM servers. You can run these commands as background processes in the same terminal or run two separate terminals for each command. We launch two servers, each with a tensor parallel degree of 32 and batch size of 4. Note that the first vLLM server uses neuron cores 0-31 and the second one 32-63. You can pick any ports that are available."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The server start up time can take a few minutes since the model weights are getting loaded. Once the vLLM servers have been launched, you should see the following log output. This implies that the model server has been deployed.\n",
    "\n",
    "```\n",
    "INFO:     Started server process [221607]\n",
    "INFO:     Waiting for application startup.\n",
    "INFO:     Application startup complete.\n",
    "INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: Benchmark the deployed model endpoints\n",
    "After the above steps, the vLLM server should be running. You can now measure the performance using LLMPerf. Ensure you have made the required changes to use LLMPerf with DP>1 by following [Install LLMPerf](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.3-70b-dp-tutorial.html#install-llmperf)\n",
    "\n",
    "Below is a sample shell script to run LLMPerf. The script allows the user to specify tensor parallelism degree, data parallelism degree, and batch size through command-line arguments, with default values provided. It calculates the concurrency based on batch size and data parallelism, sets up the environment for benchmarking with input tokens N(7936, 30) and output tokens N(256,30), and then runs LlmPerf\u2019s ```token_benchmark_ray.py``` with various parameters to measure the model endpoints\u2019 performance. The benchmark simulates requests with specific input and output token distributions, and collects results for analysis.\n",
    "\n",
    "More information about several arguments used in the script can be found in the [llmperf open source code](https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "# Default values for arguments\n",
    "DEFAULT_TP_DEGREE=32\n",
    "DEFAULT_DP_DEGREE=2\n",
    "DEFAULT_BS=1\n",
    "\n",
    "# Help function\n",
    "show_help() {\n",
    "    echo \"Usage: $0 [options]\"\n",
    "    echo \"Options:\"\n",
    "    echo \"  -t tp_degree          Tensor parallel degree (default: $DEFAULT_TP_DEGREE)\"\n",
    "    echo \"  -d dp_degree          Data parallel degree (default: $DEFAULT_DP_DEGREE)\"\n",
    "    echo \"  -b bs          Batch size (default: $DEFAULT_BS)\"\n",
    "    echo \"  -h             Show this help message\"\n",
    "}\n",
    "\n",
    "# Parse single-letter arguments\n",
    "while getopts \"t:d:b:h\" opt; do\n",
    "    case $opt in\n",
    "        t) tp_degree=\"$OPTARG\" ;;\n",
    "        d) dp_degree=\"$OPTARG\" ;;\n",
    "        b) bs=\"$OPTARG\" ;;\n",
    "        h) show_help; exit 0 ;;\n",
    "        ?) show_help; exit 1 ;;\n",
    "    esac\n",
    "done\n",
    "\n",
    "# Set defaults if not provided\n",
    "tp_degree=${tp_degree:-$DEFAULT_TP_DEGREE}\n",
    "dp_degree=${dp_degree:-$DEFAULT_DP_DEGREE}\n",
    "bs=${bs:-$DEFAULT_BS}\n",
    "\n",
    "# Calculate total concurrent requests (batch_size * data_parallelism)\n",
    "# If result is less than 1, default to batch_size\n",
    "concurrency=$(awk -v batch=\"$bs\" -v dp_degree=\"$dp_degree\" 'BEGIN {\n",
    "    concurrency = int(batch * dp_degree)\n",
    "    print (concurrency >= 1 ? concurrency : batch)\n",
    "}')\n",
    "echo \"concurrency: $concurrency\"\n",
    "\n",
    "MODEL_PATH=\"/shared/ashdeok/llama33-70B/Llama-3.3-70B-Instruct\"\n",
    "export COMPILED_MODEL_PATH=\"/shared/ashdeok/llama33-70B/traced_model/Llama-3.3-70B-Instruct-DP/\"\n",
    "\n",
    "# Modify OpenAI's API key and API base to use vLLM's API server.\n",
    "export OPENAI_API_KEY=EMPTY\n",
    "\n",
    "#if you have more vLLM servers, append the required number of ports like so:\n",
    "#;http://localhost:8001/v1;http://localhost:8002/v1\"\n",
    "export OPENAI_API_BASE=\"http://0.0.0.0:8000/v1;http://0.0.0.0:8001/v1\"\n",
    "\n",
    "python /shared/ashdeok/PR_tutorials/llmperf/token_benchmark_ray.py \\\n",
    "--model ${MODEL_PATH} \\\n",
    "--mean-input-tokens 7936 \\\n",
    "--stddev-input-tokens 30 \\\n",
    "--mean-output-tokens 256 \\\n",
    "--stddev-output-tokens 30 \\\n",
    "--num-concurrent-requests ${concurrency} \\\n",
    "--results-dir \"/shared/ashdeok/results-DP/\" \\\n",
    "--timeout 21600 \\\n",
    "--max-num-completed-requests 1000 \\\n",
    "--additional-sampling-params '{\"temperature\": 0.7, \"top_k\": 50}' \\\n",
    "--llm-api \"openai\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the script starts executing, you will see output like:\n",
    "\n",
    "```\n",
    "INFO worker.py:1852 -- Started a local Ray instance.\n",
    "  4%|\u258d         | 39/1000 [01:29<30:14,  1.89s/it]\n",
    "```\n",
    "\n",
    "Once benchmarking is complete, results can be found in the directory specified with the `--results-dir` flag in the ```benchmark_model.sh``` script."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "This tutorial demonstrates how data parallelism using multiple model copies can help increase the throughput. While standard batching (DP=1, BS>1) processes multiple requests through a single model copy, data parallelism deploys multiple independent model copies that can process different requests simultaneously. Our experiments with batch sizes 1 & 4 show that as we decrease Tensor Parallelism (TP) from 64 to 16 and increase Data Parallelism (DP) from 1 to 4, we see up to 2x throughput improvement with non optimized configurations. However, this comes with an increase in Time To First Token (TTFT) latency. This illustrates a key consideration: while DP can improve overall system throughput by processing more concurrent requests, it can lead to higher latency\n",
    "\n",
    "When to choose Data parallel with multiple model copies over using single model copy in an instance:\n",
    "\n",
    "- Use DP when your workload is collective-bound rather than memory or compute-bound. At high batch sizes, TP64 / TP128 collectives can become slow due to the number of hops and increasing throughput requirements. At high enough batch size, it can be better to pay the cost of duplicated weight loads and use DP with multiple model copies in order to reduce collective latencies.\n",
    "\n",
    "- Consider DP when you need to handle many concurrent requests and can tolerate moderate latency increases.\n",
    "\n",
    "Implementation requires careful consideration of your total memory budget, as each additional model copy increases memory consumption. You'll need to balance the number of model copies against the resources allocated to each model copy based on your specific throughput and latency requirements. By understanding these trade-offs and following the implementation guidelines in this tutorial, users can select the most appropriate approach for their specific use case and optimize their inference setup accordingly."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "neuron-224",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: libraries/nxd-inference/tutorials/trn2-llama3.3-70b-fp8.rst
================================================
.. _nxdi-vllm-llama-fp8-inference-tutorial:

Tutorial: Deploy fp8 quantized Llama3.3-70B on Trn2 instances
============================================================================================

Quantization can significantly reduce the model size and inference time. This tutorial provides a step-by-step guide to deploy a fp8 quantized Llama3.3-70B on Trainium2 instances. We utilize the custom quantization feature to quantize specific layers from the original model checkpoint. 

.. contents:: Table of contents
   :local:
   :depth: 2

Environment setup
-----------------
This tutorial requires that you have a Trn2 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed.
To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK,
see :ref:`nxdi-setup`.

Connect to the EC2 instance via your preferred option: EC2 Instance Connect, Session Manager, or SSH client.
For more information, see `Connect to your Linux instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-connect-methods.html>`_ in the Amazon EC2 User Guide.


For this tutorial, we use a pre-installed virtual environment in the DLAMI at ``/opt/aws_neuronx_venv_pytorch_inference_vllm``. If you prefer to use a container, start a built-in vLLM Neuron Deep Learning Container (DLC). For more information about available containers,
see the `AWS Neuron Deep Learning Containers repository <https://github.com/aws-neuron/deep-learning-containers#vllm-inference-neuronx>`_.


Step 1: Quantize the Llama3.3 70B b16 checkpoint to fp8 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We first quantize the `original Llama3.3 70B model <https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct>`_ checkpoint using `modules from Neuronx Distributed <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html#quantize-using-nxd>`_.
In the below script, ``modules_to_not_convert`` contains the layers that are not being quantized to fp8. In this instance, we quantize only the mlp layers except the first and the last layer. If you have a similar FP8 checkpoint, you can skip this step and use that.
Use the below code snippet to create a script for quantization and execute the script. This will create a fp8 checkpoint in the `output_path`.
::

    import json
    import torch
    from typing import Optional, List
    from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, AutoConfig
    from neuronx_distributed_inference.modules.checkpoint import prune_state_dict,save_state_dict_safetensors
    from neuronx_distributed.quantization.quantization_utils import quantize_pytorch_model_per_channel_symmetric, convert_qint8_to_int8_state_dict

    model_path = "<path to the bf16 checkpoint>"
    output_path = "<path to save the quantized checkpoint>"

    modules_to_not_convert = [
        "lm_head",
        "layers.0.mlp",
        "layers.79.mlp",
        "layers.0.self_attn",
        "layers.1.self_attn",
        "layers.2.self_attn",
        "layers.3.self_attn",
        "layers.4.self_attn",
        "layers.5.self_attn",
        "layers.6.self_attn",
        "layers.7.self_attn",
        "layers.8.self_attn",
        "layers.9.self_attn",
        "layers.10.self_attn",
        "layers.11.self_attn",
        "layers.12.self_attn",
        "layers.13.self_attn",
        "layers.14.self_attn",
        "layers.15.self_attn",
        "layers.16.self_attn",
        "layers.17.self_attn",
        "layers.18.self_attn",
        "layers.19.self_attn",
        "layers.20.self_attn",
        "layers.21.self_attn",
        "layers.22.self_attn",
        "layers.23.self_attn",
        "layers.24.self_attn",
        "layers.25.self_attn",
        "layers.26.self_attn",
        "layers.27.self_attn",
        "layers.28.self_attn",
        "layers.29.self_attn",
        "layers.30.self_attn",
        "layers.31.self_attn",
        "layers.32.self_attn",
        "layers.33.self_attn",
        "layers.34.self_attn",
        "layers.35.self_attn",
        "layers.36.self_attn",
        "layers.37.self_attn",
        "layers.38.self_attn",
        "layers.39.self_attn",
        "layers.40.self_attn",
        "layers.41.self_attn",
        "layers.42.self_attn",
        "layers.43.self_attn",
        "layers.44.self_attn",
        "layers.45.self_attn",
        "layers.46.self_attn",
        "layers.47.self_attn",
        "layers.48.self_attn",
        "layers.49.self_attn",
        "layers.50.self_attn",
        "layers.51.self_attn",
        "layers.52.self_attn",
        "layers.53.self_attn",
        "layers.54.self_attn",
        "layers.55.self_attn",
        "layers.56.self_attn",
        "layers.57.self_attn",
        "layers.58.self_attn",
        "layers.59.self_attn",
        "layers.60.self_attn",
        "layers.61.self_attn",
        "layers.62.self_attn",
        "layers.63.self_attn",
        "layers.64.self_attn",
        "layers.65.self_attn",
        "layers.66.self_attn",
        "layers.67.self_attn",
        "layers.68.self_attn",
        "layers.69.self_attn",
        "layers.70.self_attn",
        "layers.71.self_attn",
        "layers.72.self_attn",
        "layers.73.self_attn",
        "layers.74.self_attn",
        "layers.75.self_attn",
        "layers.76.self_attn",
        "layers.77.self_attn",
        "layers.78.self_attn",
        "layers.79.self_attn"
    ]

    def quantize(model: torch.nn.Module, dtype=torch.qint8, modules_to_not_convert: Optional[List[str]] = None) -> torch.nn.Module:
        quant_model = quantize_pytorch_model_per_channel_symmetric(model,dtype=dtype, modules_to_not_convert=modules_to_not_convert)
        model_quant_sd = quant_model.state_dict()
        convert_qint8_to_int8_state_dict(model_quant_sd)
        quantized_state_dict = prune_state_dict(model_quant_sd)
        return quantized_state_dict

    model = AutoModelForCausalLM.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    generation_config = GenerationConfig.from_pretrained(model_path)
    config = AutoConfig.from_pretrained(model_path)

    state_dict = quantize(model,torch.float8_e4m3fn,modules_to_not_convert)

    save_state_dict_safetensors(state_dict=state_dict,state_dict_dir=output_path)

    #save tokenizer, config in new checkpoint folder
    tokenizer.save_pretrained(output_path)
    config.save_pretrained(output_path)
    generation_config.save_pretrained(output_path)

    modules_to_not_convert_json = {
        "model": {
            "modules_to_not_convert": modules_to_not_convert
        }
    }

    with open(f"{output_path}/modules_to_not_convert.json", "w") as f:
        json.dump(modules_to_not_convert_json, f, indent=2)


Step 2: Compile the model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this step, we use the quantized fp8 checkpoint to compile the model using a utility from `neuronx-distributed-inference <https://github.com/aws-neuron/neuronx-distributed-inference>`_.
Note that we are using multiple optimization features like tensor parallelism, sequence parallelism and optimized kernels for attention, mlp and QKV computation.
You can modify some of the below parameters based on your use case:

* ``tp-degree``: set this to the number of neuron cores for partitioning the model. Typically ``local_ranks_size`` needs to be set to the same value.
* ``batch-size``: set this to the desired number of requests to process simultaneously. Along with this, ``tkg-batch-size`` and ``max-batch-size`` should be set to the same value.
* ``seq-len``: set this to the maximum sequence length during inference. i.e. sum of input and output sequence lengths.

::

    export NEURON_RT_INSPECT_ENABLE=0
    export NEURON_RT_EXEC_TIMEOUT=600
    export NEURON_RT_VIRTUAL_CORE_SIZE=2
    export NEURON_RT_NUM_CORES=64
    export XLA_DENSE_GATHER_FACTOR=0
    export XLA_IR_DEBUG=1
    export XLA_HLO_DEBUG=1
    export XLA_HANDLE_SPECIAL_SCALAR=1
    export UNSAFE_FP8FNCAST=1
    export DISABLE_NUMERIC_CC_TOKEN=1
    MODEL_PATH="<path to the fp8 model checkpoint"
    COMPILED_MODEL_PATH="<folder to save compiled artifacts>"
    export BASE_COMPILE_WORK_DIR="<folder to save compiled artifacts>"
    inference_demo \
        --model-type llama \
        --task-type causal-lm \
        run \
        --model-path $MODEL_PATH \
        --compiled-model-path $COMPILED_MODEL_PATH \
        --torch-dtype bfloat16 \
        --batch-size 4 \
        --enable-bucketing \
        --local_ranks_size 64 \
        --tp-degree 64 \
        --start_rank_id 0 \
        --pad-token-id 0 \
        --cc-pipeline-tiling-factor 1 \
        --on-device-sampling \
        --global-topk 256 \
        --dynamic \
        --top-k 50 \
        --top-p 0.9 \
        --temperature 0.7 \
        --do-sample \
        --sequence-parallel-enabled \
        --fused-qkv \
        --qkv-kernel-enabled \
        --attn-kernel-enabled \
        --mlp-kernel-enabled \
        --logical-neuron-cores 2 \
        --prompt "What is annapurna labs?" \
        --ctx-batch-size 1 \
        --tkg-batch-size 4 \
        --max-batch-size 4 \
        --is-continuous-batching \
        --compile-only \
        --quantized-mlp-kernel-enabled \
        --quantization-type per_channel_symmetric \
        --rmsnorm-quantize-kernel-enabled \
        --modules-to-not-convert-file $MODEL_PATH/modules_to_not_convert.json \
        --async-mode \
        --attn-block-tkg-nki-kernel-enabled \
        --attn-block-tkg-nki-kernel-cache-update \
        --k-cache-transposed \
        --save-sharded-checkpoint \
        --max-context-length 4096 \
        --seq-len 5120 \
        --context-encoding-buckets  2048 4096 \
        --token-generation-buckets  5120   2>&1 | tee compile.log

.. note::

    There is a known limitation for compiling the fp8 model directly through vllm. This will be fixed in a future release.


Step 3: Serve the model using vLLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this step, we use the pre-compiled model from the previous step and serve it using vllm.

* ``tensor-parallel-size``: set this to the ``tp-degree`` used during compilation.
* ``max-num-seqs``: set this to the ``batch-size`` used during compilation.
* ``max-model-len``: set this to ``seq-len`` from the above step.

Note that we set an environment variable (``NEURON_COMPILED_ARTIFACTS``) to the path that has the compiled model from the previous step. The vllm command skips compilation and loads the model using the pre-compiled artifacts.
::

    export NEURON_RT_INSPECT_ENABLE=0
    export NEURON_RT_EXEC_TIMEOUT=600
    export NEURON_RT_VIRTUAL_CORE_SIZE=2
    export NEURON_RT_NUM_CORES=64
    export NEURON_RT_VISIBLE_CORES='0-63'
    export XLA_DENSE_GATHER_FACTOR=0
    export XLA_IR_DEBUG=1
    export XLA_HLO_DEBUG=1
    export XLA_HANDLE_SPECIAL_SCALAR=1
    export UNSAFE_FP8FNCAST=1
    export DISABLE_NUMERIC_CC_TOKEN=1
    export VLLM_RPC_TIMEOUT=100000
    export VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference
    
    MODEL_PATH="<path to Llama3.3 70B fp8 checkpoint>"
    COMPILED_MODEL_PATH="<path to a folder that has the pre-compiled model artifacts from the previous step>"
    export NEURON_COMPILED_ARTIFACTS=$COMPILED_MODEL_PATH

    vllm serve \
        $MODEL_PATH \
        --tensor-parallel-size 64 \
        --max-num-seqs 4 \
        --max-model-len 5120 \
        --port 8000 \
        --disable-log-requests \
        --block_size 128 \
        --num-gpu-blocks-override 4 \
        --no-enable-prefix-caching \
        --additional-config '{
            "override_neuron_config": {
                "max_prompt_length": 4096
               }
        }' 2>&1 | tee vllm.log 


Once the model is loaded, you will see the following output:

::

    INFO:     Started server process [7]
    INFO:     Waiting for application startup.
    INFO:     Application startup complete.

This indicates the server is ready and the model endpoint is available for inference.

Step 4: Test the endpoint
~~~~~~~~~~~~~~~~~~~~~~~~~~
You can test the endpoint using curl or any HTTP client:

::

    curl http://localhost:8000/v1/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "<model name>",
            "prompt": "What is machine learning?",
            "max_tokens": 100,
            "temperature": 0.7
        }'


Conclusion
----------
You have successully quantized a Llama3.3 70B model to fp8 and deployed the model on Trainium 2 for inference. To evaluate the accuracy of the quantized model, run accuracy evaluation tests using :ref:`accuracy-eval-with-datasets`.

================================================
FILE: libraries/nxd-inference/tutorials/trn2-llama3.3-70b-tutorial.rst
================================================
.. _nxdi-trn2-llama3.3-70b-tutorial:

Tutorial: Using Speculative Decoding to improve Llama-3.3-70B inference performance on Trn2 instances
=======================================================================================================

NeuronX Distributed (NxD) Inference allows you to deploy Llama3.3 70B on
a single Trn2 or Trn1 instance. This tutorial provides a step-by-step
guide to deploy Llama3.3 70B on a Trn2 instance using two different configurations, one without
speculative decoding and the other with draft model based speculative decoding enabled
(with Llama-3.2 1B as the draft model).
We will also measure performance by running a load test using LLMPerf
and compare key metrics between the two configurations.
While this tutorial uses batch size 1 for demonstration purposes, the model configuration provides support for batch sizes up to 4.

.. contents:: Table of contents
   :local:
   :depth: 2

Prerequisites:
---------------
Set up and connect to a Trn2.48xlarge instance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As a prerequisite, this tutorial requires that you have a Trn2 instance
created from a Deep Learning AMI that has the Neuron SDK pre-installed.

To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK,
see :ref:`nxdi-setup`.

After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you
chose when you launched the instance.

After you are connected, activate the Python virtual environment that
includes the Neuron SDK.

::

   source ~/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate

Run ``pip list`` to verify that the Neuron SDK is installed.

::

   pip list | grep neuron

You should see Neuron packages including
``neuronx-distributed-inference`` and ``neuronx-cc``.

Install packages
~~~~~~~~~~~~~~~~~
NxD Inference supports running models with vLLM. This functionality is
available in the AWS Neuron fork of the vLLM GitHub repository. Install the latest release branch of vLLM from the AWS Neuron fork 
following instructions in the :ref:`vLLM User Guide for NxD Inference<nxdi-vllm-user-guide-v1>`.

In this tutorial, you will use `llmperf <https://github.com/ray-project/llmperf>`_ to measure the performance.
We will use the `load test <https://github.com/ray-project/llmperf?tab=readme-ov-file#load-test>`_ feature of LLMPerf and measure the performance for accepting
10,000 tokens as input and generating 1500 tokens as output.
Install llmperf into the virtual environment.

::

    git clone https://github.com/ray-project/llmperf.git
    cd llmperf
    pip install -e . 


Download models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To use this sample, you must first download a 70B model checkpoint from Hugging Face
to a local path on the Trn2 instance. For more information, see
`Downloading models <https://huggingface.co/docs/hub/en/models-downloading>`__
in the Hugging Face documentation. You can download and use `meta-llama/Llama-3.3-70B-Instruct <https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct>`__
for this tutorial.

Since we will be using Speculative Decoding in the second configuration, 
you will also need a draft model checkpoint. You can download and use `meta-llama/Llama-3.2-1B-Instruct <https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct>`__.

.. note::

    NxD Inference supports batch sizes up to 4 for this model configuration. To determine the optimal batch size for your specific use case, we recommend incrementally testing batch sizes from 1 to 4 while monitoring your application's performance metrics.

Scenario 1: Run Llama3.3 70B on Trn2
-------------------------------------
In this scenario, you will run Llama3.3 70B on Trn2 without Speculative Decoding
using bfloat16 precision.

Step 1: Compile the model
~~~~~~~~~~~~~~~~~~~~~~~~~~
We will first compile and run generation on a sample prompt using a command
installed by ``neuronx-distributed-inference``. Save the contents of the below script to your favorite 
shell script file, for example, ``compile_model.sh`` and then run it.

Note that we are using the following features as described in
the tutorial for running 405B model :ref:`nxdi-trn2-llama3.1-405b-tutorial`

* Logical NeuronCore Configuration (LNC)
* Tensor parallelism (TP) on Trn2
* Optimized Kernels

The script compiles the model and runs generation on the given input prompt.
Note the path we used to save the compiled model. This path should be used
when launching vLLM server for inference so that the compiled model can be loaded without recompilation.
Please refer to :ref:`nxd-inference-api-guide` for more information on these ``inference_demo`` flags.


.. note::

    Known issue: Using kernels with bucket length of 1024 or less may lead to ``Numerical Error`` in inference.

    ::

        RuntimeError: Failed to execute the model status=1003 message=Numerical Error


::

    # Replace this with the path where you downloaded and saved the model files.
    MODEL_PATH="/home/ubuntu/models/Llama-3.3-70B-Instruct/"
    # This is where the compiled model will be saved. The same path
    # should be used when launching vLLM server for inference.
    COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Llama-3.3-70B-Instruct/"

    NUM_CORES=128
    TP_DEGREE=64
    LNC=2

    export NEURON_RT_VIRTUAL_CORE_SIZE=$LNC
    export NEURON_RT_NUM_CORES=$((NUM_CORES/NEURON_RT_VIRTUAL_CORE_SIZE))
    export NEURON_RT_EXEC_TIMEOUT=600 
    export XLA_DENSE_GATHER_FACTOR=0 
    export NEURON_RT_INSPECT_ENABLE=0

    inference_demo \
        --model-type llama \
        --task-type causal-lm \
            run \
            --model-path $MODEL_PATH \
            --compiled-model-path $COMPILED_MODEL_PATH \
            --torch-dtype bfloat16 \
            --start_rank_id 0 \
            --local_ranks_size $TP_DEGREE \
            --tp-degree $TP_DEGREE \
            --batch-size 1 \
            --max-context-length 12288 \
            --seq-len 12800 \
            --on-device-sampling \
            --top-k 1 \
            --do-sample \
            --fused-qkv \
            --sequence-parallel-enabled \
            --qkv-kernel-enabled \
            --attn-kernel-enabled \
            --mlp-kernel-enabled \
            --cc-pipeline-tiling-factor 1 \
            --pad-token-id 2 \
            --enable-bucketing \
            --context-encoding-buckets 2048 4096 8192 12288 \
	        --token-generation-buckets 2048 4096 8192 12800 \
            --prompt "What is annapurna labs?" 2>&1 | tee log


Step 2: Run the model using vLLM 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
After compiling the model, you can run the model using vLLM. Save the contents of the below script to another
shell script file, for example, ``start_vllm.sh`` and then run it.

::

    export NEURON_RT_INSPECT_ENABLE=0 
    export NEURON_RT_VIRTUAL_CORE_SIZE=2

    # These should be the same paths used when compiling the model.
    MODEL_PATH="/home/ubuntu/models/Llama-3.3-70B-Instruct/"
    COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Llama-3.3-70B-Instruct/"

    export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"
    export NEURON_COMPILED_ARTIFACTS=$COMPILED_MODEL_PATH
    VLLM_RPC_TIMEOUT=100000 vllm serve \
        --model $MODEL_PATH \
        --max-num-seqs 1 \
        --max-model-len 12800 \
        --tensor-parallel-size 64 \
        --device neuron \
        --use-v2-block-manager \
        --override-neuron-config "{\"on_device_sampling_config\": {\"do_sample\": true}, \"skip_warmup\": true}" \
        --port 8000 &
    PID=$!
    echo "vLLM server started with PID $PID"

Step 3: Measure performance using LLMPerf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
After the above steps, the vllm server should be running. 
You can now measure the performance using LLMPerf. Before we can use the ``llmperf`` package, we need to make a few changes to its code. 
Follow :ref:`benchmarking with LLMPerf guide <llm_perf_patch_changes>` to apply the code changes.


Below is a sample shell script to run LLMPerf. To provide the model with 10000 tokens as input and generate 1500 tokens as output on average,
we use the following parameters from LLMPerf:

::

    --mean-input-tokens 10000 \
    --mean-output-tokens 1500 \


More information about several arguments used in the script can be found in the 
`llmperf open source code <https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py>`_.

::

    # This should be the same path to which the model was downloaded (also used in the above steps).
    MODEL_PATH="/home/ubuntu/models/Llama-3.3-70B-Instruct/"
    # This is the name of directory where the test results will be saved.
    OUTPUT_PATH=llmperf-results-sonnets

    export OPENAI_API_BASE="http://localhost:8000/v1"
    export OPENAI_API_KEY="mock_key"

    python token_benchmark_ray.py \
        --model $MODEL_PATH \
        --mean-input-tokens 10000 \
        --stddev-input-tokens 0 \
        --mean-output-tokens 1500 \
        --stddev-output-tokens 0 \
        --num-concurrent-requests 1\
        --timeout 3600 \
        --max-num-completed-requests 50 \
        --tokenizer $MODEL_PATH \
        --additional-sampling-params '{}' \
        --results-dir $OUTPUT_PATH \
        --llm-api "openai"

A sample output from the above script is shown below:

::

    Results for token benchmark for /home/ubuntu/models/Llama-3.3-70B-Instruct/ queried with the openai api.

    inter_token_latency_s
        p25 = 0.01964743386193489
        p50 = 0.01965969146322459
        p75 = 0.019672998415771872
        p90 = 0.01969826815724373
        p95 = 0.019810569172135244
        p99 = 0.020350346909947692
        mean = 0.01969182239660784
        min = 0.0196275211258056
        max = 0.020702997242410977
        stddev = 0.00015700734112322808
    ttft_s
        p25 = 0.8109508841298521
        p50 = 0.8142827898263931
        p75 = 30.46490489714779
        p90 = 30.513100237119943
        p95 = 30.521608413150535
        p99 = 48.876512633068415
        mean = 11.503728219866753
        min = 0.8080519903451204
        max = 66.4881955627352
        stddev = 15.692731777293613
    end_to_end_latency_s
        p25 = 30.296781020238996
        p50 = 30.326033774763346
        p75 = 59.9560666854959
        p90 = 60.001504834741354
        p95 = 60.028880204679446
        p99 = 79.1842334462329
        mean = 41.04328096391633
        min = 30.265212223865092
        max = 97.54387667682022
        stddev = 15.796048923358924
    request_output_throughput_token_per_s
        p25 = 25.044969421803977
        p50 = 49.49542857484997
        p75 = 49.543217224244
        p90 = 49.583184869985566
        p95 = 49.58588728343319
        p99 = 49.592597790896676
        mean = 40.91042833304163
        min = 15.387946954098137
        max = 49.59489426003143
        stddev = 11.825984480587056
    number_input_tokens
        p25 = 10000.0
        p50 = 10000.0
        p75 = 10000.0
        p90 = 10000.0
        p95 = 10000.0
        p99 = 10000.0
        mean = 10000.0
        min = 10000
        max = 10000
        stddev = 0.0
    number_output_tokens
        p25 = 1501.0
        p50 = 1501.0
        p75 = 1501.0
        p90 = 1501.0
        p95 = 1501.0
        p99 = 1502.02
        mean = 1501.04
        min = 1501
        max = 1503
        stddev = 0.282842712474619
    Number Of Errored Requests: 0
    Overall Output Throughput: 36.55567822866449
    Number Of Completed Requests: 50
    Completed Requests Per Minute: 1.4612140207588533


Scenario 2: Run Llama3.3 70B on Trn2 with Speculative Decoding
--------------------------------------------------------------
In this scenario, you will run Llama3.3 70B on Trn2 with Speculative Decoding.
Specifically, we will use the below variations from the supported variants as described in
:ref:`nxd-speculative-decoding`

* Speculative Decoding with Llama-3.2-1B as the draft model :ref:`nxd-vanilla-speculative-decoding`
* Fused Speculation for improved performance :ref:`nxd-fused-speculative-decoding`

Step 1: Compile the model
~~~~~~~~~~~~~~~~~~~~~~~~~~
When compiling the model to use speculative decoding, you need to provide 
a draft model checkpoint and a few additional parameters to the ``inference_demo`` command.

For a quick review, here are the additional arguments provided:

::

            --draft-model-path $DRAFT_MODEL_PATH \
            --enable-fused-speculation \
            --speculation-length 7 \

Please refer to :ref:`nxd-inference-api-guide` for more information on these ``inference_demo`` flags.
The complete script to compile the model for this configuration is shown below:


.. note::

    Known issue: Using kernels with bucket length of 1024 or less may lead to ``Numerical Error`` in inference.

    ::

        RuntimeError: Failed to execute the model status=1003 message=Numerical Error


::

    # This is the same path as in the previous scenario.
    MODEL_PATH="/home/ubuntu/models/Llama-3.3-70B-Instruct/"
    # This is the path where the draft model is downaloded and saved.
    DRAFT_MODEL_PATH="/home/ubuntu/models/Llama-3.2-1B-Instruct/"
    # As in the previous scenario, this is where the compiled model will be saved.
    COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Llama-3.3-70B-Instruct/"

    NUM_CORES=128
    TP_DEGREE=64
    LNC=2

    export NEURON_RT_VIRTUAL_CORE_SIZE=$LNC
    export NEURON_RT_NUM_CORES=$((NUM_CORES/NEURON_RT_VIRTUAL_CORE_SIZE))
    export NEURON_RT_EXEC_TIMEOUT=600 
    export XLA_DENSE_GATHER_FACTOR=0 
    export NEURON_RT_INSPECT_ENABLE=0

    inference_demo \
        --model-type llama \
        --task-type causal-lm \
            run \
            --model-path $MODEL_PATH \
            --compiled-model-path $COMPILED_MODEL_PATH \
            --torch-dtype bfloat16 \
            --start_rank_id 0 \
            --local_ranks_size $TP_DEGREE \
            --tp-degree $TP_DEGREE \
            --batch-size 1 \
            --max-context-length 12288 \
            --seq-len 12800 \
            --on-device-sampling \
            --top-k 1 \
            --fused-qkv \
            --sequence-parallel-enabled \
            --qkv-kernel-enabled \
            --attn-kernel-enabled \
            --mlp-kernel-enabled \
            --cc-pipeline-tiling-factor 1 \
            --draft-model-path $DRAFT_MODEL_PATH \
            --enable-fused-speculation \
            --speculation-length 7 \
            --pad-token-id 2 \
            --enable-bucketing \
            --context-encoding-buckets 2048 4096 8192 12288 \
	        --token-generation-buckets 2048 4096 8192 12800 \
            --prompt "What is annapurna labs?" 2>&1 | tee log

Step 2: Run the model using vLLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Similar to compiling the model, we need to specify parameters specific to 
speculative decoding when running the model using vLLM.

For a quick glance, these are the parameters that are different for 
running vLLM server with model compiled using speculative decoding:

::

            --speculative-max-model-len 12800 \
            --speculative-model $DRAFT_MODEL_PATH \
            --num-speculative-tokens 7 \
            --override-neuron-config "{\"enable_fused_speculation\":true}" \
            
Here is the complete script to run the model using vLLM with speculative decoding:

::

    export NEURON_RT_INSPECT_ENABLE=0 
    export NEURON_RT_VIRTUAL_CORE_SIZE=2

    # These should be the same paths used when compiling the model.
    MODEL_PATH="/home/ubuntu/models/Llama-3.3-70B-Instruct/"
    DRAFT_MODEL_PATH="/home/ubuntu/models/Llama-3.2-1B-Instruct/"
    COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Llama-3.3-70B-Instruct/"

    export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"
    export NEURON_COMPILED_ARTIFACTS=$COMPILED_MODEL_PATH
    VLLM_RPC_TIMEOUT=100000 vllm serve \
        --model $MODEL_PATH \
        --max-num-seqs 1 \
        --max-model-len 12800 \
        --tensor-parallel-size 64 \
        --device neuron \
        --speculative-max-model-len 12800 \
        --speculative-model $DRAFT_MODEL_PATH \
        --num-speculative-tokens 7 \
        --use-v2-block-manager \
        --override-neuron-config "{\"enable_fused_speculation\":true}" \
        --port 8000 &
    PID=$!
    echo PID=$PID
    echo "vLLM server started with PID $PID"

Step 3: Measure performance using LLMPerf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The script to measure the performance using LLMPerf is same as the one used in the first scenario. Before we can use the ``llmperf`` package, we need to make a few changes to its code. 
Follow :ref:`benchmarking with LLMPerf guide <llm_perf_patch_changes>` to apply the code changes.

For convenience, here's the script once again:

::

    # This should be the same path to which the model was downloaded (also used in the above steps).
    MODEL_PATH="/home/ubuntu/models/Llama-3.3-70B-Instruct/"
    # This is the name of directory where the test results will be saved. Use a different name for this scenario.
    OUTPUT_PATH=llmperf-results-sonnets-speculative

    export OPENAI_API_BASE="http://localhost:8000/v1"
    export OPENAI_API_KEY="mock_key"

    python token_benchmark_ray.py \
        --model $MODEL_PATH \
        --mean-input-tokens 10000 \
        --stddev-input-tokens 0 \
        --mean-output-tokens 1500 \
        --stddev-output-tokens 0 \
        --num-concurrent-requests 1\
        --timeout 3600 \
        --max-num-completed-requests 50 \
        --tokenizer $MODEL_PATH \
        --additional-sampling-params '{}' \
        --results-dir $OUTPUT_PATH \
        --llm-api "openai"

A sample output from the above script is shown below:

::

    Results for token benchmark for /home/ubuntu/models/Llama-3.3-70B-Instruct/ queried with the openai api.

    inter_token_latency_s
        p25 = 0.0053349758717231455
        p50 = 0.005386366705410183
        p75 = 0.005441084293027719
        p90 = 0.005499971026182175
        p95 = 0.005520176071580499
        p99 = 0.005911254031351169
        mean = 0.00540780140378178
        min = 0.005264532127728065
        max = 0.006265544256816307
        stddev = 0.00013951778334019935
    ttft_s
        p25 = 0.8693495176266879
        p50 = 0.870149074587971
        p75 = 0.8710820493288338
        p90 = 0.8725412225350737
        p95 = 0.8742059985175729
        p99 = 36.83790613239617
        mean = 2.280795605443418
        min = 0.8676468348130584
        max = 71.38881027325988
        stddev = 9.97280539681726
    end_to_end_latency_s
        p25 = 8.873123338911682
        p50 = 8.950916013680398
        p75 = 9.030085149221122
        p90 = 9.120021602977067
        p95 = 9.150626054406166
        p99 = 45.70815015356973
        mean = 10.393093119114637
        min = 8.766328778117895
        max = 80.78758085798472
        stddev = 10.158917239418473
    request_output_throughput_token_per_s
        p25 = 166.22213179149702
        p50 = 167.69243252025473
        p75 = 169.16253286110174
        p90 = 169.52692450439133
        p95 = 169.81518762962915
        p99 = 170.85438941846397
        mean = 164.631719334475
        min = 18.579588397857652
        max = 171.2233293995004
        stddev = 21.152953887186314
    number_input_tokens
        p25 = 10000.0
        p50 = 10000.0
        p75 = 10000.0
        p90 = 10000.0
        p95 = 10000.0
        p99 = 10000.0
        mean = 10000.0
        min = 10000
        max = 10000
        stddev = 0.0
    number_output_tokens
        p25 = 1501.0
        p50 = 1501.0
        p75 = 1501.0
        p90 = 1501.0
        p95 = 1501.0
        p99 = 1502.02
        mean = 1501.04
        min = 1501
        max = 1503
        stddev = 0.282842712474619
    Number Of Errored Requests: 0
    Overall Output Throughput: 144.17136914316023
    Number Of Completed Requests: 50
    Completed Requests Per Minute: 5.76285918335928

Conclusion
-----------
As seen in the table below, TPOT reduced by 3.6x and output token throughput increased by 4x when using speculative decoding with draft model combined with fused speculative decoding,
compared to baseline without speculative decoding. Please note that batch size of 1 is used in this tutorial for computing the below metrics.


.. csv-table::
   :file: llama70b_perf_comparison.csv
   :header-rows: 1


================================================
FILE: libraries/nxd-inference/tutorials/trn3-gpt-oss-120b-tutorial.rst
================================================
.. meta::
    :description: Tutorial for deploying GPT-OSS 120B on Trainium3 instances using NeuronX Distributed (NxD) Inference with vLLM.
    :keywords: GPT-OSS 120B, Trainium3, NeuronX Distributed Inference, NxD Inference, vLLM, Large Language Models, LLM Deployment, Tensor Parallelism, Data Parallelism, Speculative Decoding, Neuron SDK
    :date-modified: 12/02/2025

.. _nxdi-trn3-gpt-oss-120b-tutorial:

Tutorial: GPT-OSS 120B on Trn3 instances [BETA]
=======================================================================

NeuronX Distributed (NxD) Inference allows you to deploy GPT-OSS 120B on
Trn3 instances for high-performance inference. This tutorial provides a step-by-step
guide to deploy GPT-OSS 120B on a Trn3 instance using tensor parallelism, 
data parallelism and optimized kernels for efficient inference at scale.

.. contents:: Table of contents
   :local:
   :depth: 2

Prerequisites
-------------

As a prerequisite, this tutorial requires that you have a Trn3 instance
created from a Deep Learning AMI which is current private[Beta] 
that has the Neuron SDK with support for GPT-OSS 120B on Trn3 instances pre-installed.

.. note::

    Please contact us to get access to the private Deep Learning AMI for 2.27 Beta release
    that has all the necessary artifacts for you to run this tutorial on Trn3 instance.


The Deep Learning AMI contains the following:

* Neuron system dependencies
* Python virtual environment with Neuron SDK and vLLM v0.11.1 in :code:`~/neuronx_gpt_oss_120b_in_vllm_venv`
* vLLM startup script at :code:`~/start_vllm_server.sh`
* GPT-OSS 120B and EAGLE3 draft model checkpoints in :code:`/mnt/inference/models/`


Performance Optimizations
-------------------------

The model is configured to run with data parallelism i.e. 8 independent vLLM endpoints per Trn3
instance each using tensor parallelism with :code:`tp_degree=8` and :code:`LNC=2`. Furthermore, we use the
following performance optimizations:

* speculative decoding using EAGLE3 with speculation length 5
* optimized NKI kernels for attention, MoE, sampling
* support for MXFP4 compute in MoE (Trn3 only)

For more information see:

* :ref:`moe-inference-deep-dive`
* :ref:`logical-neuroncore-config`
* :ref:`trainium3-arch`


Step 1: Launch vLLM server
--------------------------

Use the included script to launch a vLLM server on the instance.

::

    source ./neuronx_gpt_oss_120b_in_vllm_venv/bin/activate
    bash start_vllm_server.sh


During first start up, the model will be compiled and serialized. Subsequent startups will
directly load from the serialized model (:ref:`nxdi-vllm-v1-serialization`). You should see output indicating the server is 
ready:


::

    INFO:     Started server process
    INFO:     Waiting for application startup.
    INFO:     Application startup complete.
    INFO:     Uvicorn running on http://0.0.0.0:8000


The setup is intended to be used with data parallelism and supports running 8 copies on
one Trn3 instance. You will need to provide a unique port (:code:`--port 8000`) and update the visible 
NeuronCores range (:code:`export NEURON_RT_VISIBLE_CORES=0-7`) for each copy. If you want to start multiple
servers concurrently without loading from a serialized model, you also need to provide each with
a unique compiler working directory by setting the :code:`BASE_COMPILE_WORK_DIR` environment variable.
Please refer to :ref:`/libraries/nxd-inference/tutorials/trn2-llama3.3-70b-dp-tutorial.ipynb` for more information.

Currently, (NxD) Inference supports EAGLE3 heads with the same hidden size and vocabulary size as the target model
and follow the Llama3 dense architecture. It must contain the following layers:
:code:`fc, hidden_norm, input_layernorm, attention, mlp, lm_head and embed_tokens`. Any other EAGLE3 head 
architecture needs to be brought up as a new model.


Step 2: Test inference with sample requests
--------------------------------------------

With the vLLM server running, open a new terminal session and test the inference endpoint.

First, verify the server is responding:

::

    curl -i http://localhost:8000/health

You should receive a :code:`HTTP/1.1 200 OK` response.

Now, send a sample inference request:

::

    curl http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "/mnt/inference/models/gpt-oss-120b",
            "messages": [
                {"role": "user", "content": "How are you?"}
            ]
        }'

You should receive a JSON response with the generated text.


Step 3: Run performance benchmarks
-----------------------------------

We are going to use LLMPerf for benchmarking. Install LLMPerf from source and 
patch it to support data parallelism and reasoning models following :ref:`llm-inference-benchmarking`.

Then, run the benchmark with the following commands:

::

    export OPENAI_API_KEY=EMPTY

    # if you have started multiple vLLM servers, 
    # append the endpoints separated by semicolon
    # e.g. `export OPENAI_API_BASE="http://localhost:8000/v1;http://localhost:8001/v1"`
    # and adjust `--num-concurrent-requests` accordingly. You might also want to increase
    # `--max-num-completed-requests`.
    export OPENAI_API_BASE="http://0.0.0.0:8000/v1"

    python ~/llmperf/token_benchmark_ray.py \
        --model /mnt/inference/models/gpt-oss-120b \
        --mean-input-tokens 10000 \
        --stddev-input-tokens 0 \
        --mean-output-tokens 3000 \
        --stddev-output-tokens 0 \
        --num-concurrent-requests 1 \
        --results-dir "./llmperf_results/" \
        --max-num-completed-requests 50 \
        --additional-sampling-params '{"temperature": 1.0, "top_k": 1.0, "top_p": 1.0}' \
        --llm-api "openai"

Step 4: Clean up
----------------

To stop the vLLM server and free up resources:

1. Press ``Ctrl+C`` in the terminal running the vLLM server
2. Verify all processes have stopped:

::

    ps aux | grep vllm

3. If any vLLM processes are still running, terminate them using their process IDs (PIDs): ``kill -9 <PID>``.
   
You have now successfully deployed GPT-OSS 120B on a Trn3 instance using NxD Inference with vLLM!


================================================
FILE: libraries/nxd-inference/vllm/index.rst
================================================
.. meta:: 
    :description: Run high-performance LLM inference with vLLM on AWS Neuron accelerators. Deploy models like Llama, Qwen, and more on Trainium and Inferentia instances.
    :date-modified: 11/25/2025


vLLM on Neuron
===============

vLLM on Neuron enables high-performance LLM inference on AWS Trainium and Inferentia instances, providing a streamlined deployment experience with minimal code changes. The integration leverages AWS Neuron's optimized AI inference capabilities and vLLM's advanced features like continuous batching to deliver efficient model serving for both latency-sensitive applications and high-throughput batch processing workloads.

Overview
---------

vLLM is a popular library for LLM inference and serving that integrates with AWS Neuron through the NxD Inference (neuronx-distributed-inference) library. This integration uses vLLM's Plugin System to extend the model execution components responsible for loading and invoking models within vLLM's LLMEngine, while maintaining vLLM's input processing, scheduling, and output processing behaviors.

**Key Features:**

- **Continuous batching** for efficient processing of multiple requests
- **Prefix caching** to improve time-to-first-token by reusing KV cache of common prompts
- **Speculative decoding** support (Eagle V1)
- **Quantization** with INT8/FP8 support for optimized performance
- **Dynamic sampling** and tool calling capabilities
- **Multimodal support** for models like Llama 4 Scout and Maverick

**Supported Models:**

- Llama 2/3.1/3.3
- Llama 4 Scout, Maverick (with multimodal capabilities)
- Qwen 2.5
- Qwen 3
- Custom models onboarded to NxD Inference

**Deployment Options:**

- Quick deployment using pre-configured Deep Learning Containers
- Manual installation from source with the vLLM-Neuron plugin
- Offline batch inference for processing multiple prompts
- Online model serving with an OpenAI-compatible API server

Get Started with Inference and vLLM on Neuron
----------------------------------------------

Learn how to run high-performance inference workloads using vLLM on AWS Neuron accelerators. These quickstart guides walk you through setting up both offline batch processing and online API serving, helping you deploy large language models efficiently on Trainium and Inferentia instances.

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: Deploy a Deep Learning Container with vLLM
      :link: /containers/get-started/quickstart-configure-deploy-dlc
      :link-type: doc
      :class-card: sd-border-1

      Quickly deploy a vLLM server on Trainium and Inferentia instances using a DLC image preconfigured with AWS Neuron SDK artifacts.

   .. grid-item-card:: Offline Model Serving
      :link: quickstart-vllm-offline-serving
      :link-type: doc
      :class-card: sd-border-1

      Run batch inference jobs with vLLM on Neuron. Install the plugin, process multiple prompts, and cache compiled artifacts for faster reruns.

   .. grid-item-card:: Online Model Serving
      :link: quickstart-vllm-online-serving
      :link-type: doc
      :class-card: sd-border-1

      Launch an OpenAI-compatible API server with vLLM on Neuron. Set up interactive endpoints, validate with curl, and integrate with the OpenAI SDK.

Guides for vLLM on Neuron
--------------------------

.. grid:: 1 
   :gutter: 3

   .. grid-item-card:: vLLM on Neuron User Guide (V1)
      :link: /libraries/nxd-inference/developer_guides/vllm-user-guide-v1
      :link-type: doc
      :class-card: sd-border-1

      Learn the details of developing inference models on Neuron with vLLM V1.

vLLM on Neuron Tutorials
--------------------------

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: Deploy Llama4 with vLLM
      :link: /libraries/nxd-inference/tutorials/llama4-tutorial
      :link-type: doc
      :class-card: sd-border-1

      Learn how to deploy Llama4 multimodal models on Trainium2 instances using vLLM for both offline and online inference.

.. toctree::
    :hidden:
    :maxdepth: 1

    Quickstart: Offline Model Serving </libraries/nxd-inference/vllm/quickstart-vllm-offline-serving>
    Quickstart: Online Model Serving </libraries/nxd-inference/vllm/quickstart-vllm-online-serving>
    vLLM on Neuron User Guide </libraries/nxd-inference/developer_guides/vllm-user-guide-v1>
    Model Recipes </libraries/nxd-inference/models/index>
    Deploy Llama4 with vLLM </libraries/nxd-inference/tutorials/llama4-tutorial>


================================================
FILE: libraries/nxd-inference/vllm/quickstart-vllm-offline-serving.rst
================================================
.. meta::
   :description: Learn how to run your first offline vLLM batch inference job on AWS Neuron.
   :date_updated: 2025-12-02

.. _quickstart-offline-serving:

Quickstart: Run offline inference with vLLM on Neuron
======================================================

This quickstart walks you through running vLLM in offline (batch) inference mode on AWS Neuron. You install the ``vllm-neuron`` plugin, generate text for a batch of prompts, and cache the compiled artifacts so reruns stay fast.

**This quickstart is for**: Developers who want to run offline/batch inference on Neuron without an API server  
**Time to complete**: ~20 minutes

Prerequisites
-------------

Before you begin, make sure you have:

* An EC2 instance with Neuron cores and network access to Hugging Face Hub.
* The Neuron SDK installed (see :ref:`Setup Instructions<nxdi-setup>`).
* Python 3.10 or later with ``pip``.
* Basic familiarity with running Python scripts in a virtual environment.

.. note::
   For the fastest setup, consider the vLLM Neuron Deep Learning Container (DLC) which bundles the SDK, vLLM, and dependencies. See :ref:`quickstart_vllm_dlc_deploy`.

Step 1: Install the ``vllm-neuron`` plugin
-------------------------------------------

In this step, you install the Neuron-enabled vLLM plugin inside your Python environment.

.. code-block:: bash

   # Activate your Neuron virtual environment
   source ~/aws_neuronx_venv_pytorch_2_8_nxd_inference/bin/activate

   # Clone the vLLM Neuron plugin repository
   git clone https://github.com/vllm-project/vllm-neuron.git
   cd vllm-neuron

   # Install with the Neuron package repository
   pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com -e .

.. important::
   The ``--extra-index-url`` flag ensures Neuron-compatible wheels are pulled from the AWS repository.

To confirm the install succeeded, run ``python -c "import vllm"`` and verify no errors display.

Step 2: Run a batch inference job
---------------------------------

In this step, you run a short Python script that generates completions for three prompts using the Llama 3.1 8B Instruct model.

.. tip::
   **Before your first run**, set the ``NEURON_COMPILED_ARTIFACTS`` environment variable to enable caching. This lets subsequent runs skip the Neuron compilation phase and load instantly:

   .. code-block:: bash

      export NEURON_COMPILED_ARTIFACTS="./compiled_models"

   After the first run completes, the ``compiled_models`` directory will contain the cached artifacts.

.. code-block:: python

    from vllm import LLM, SamplingParams

    llm = LLM(
        model="meta-llama/Llama-3.1-8B-Instruct",
        tensor_parallel_size=32,
        max_num_seqs=1,
        max_model_len=128,
        enable_prefix_caching=False,
        enable_chunked_prefill=False,
        additional_config={
            "override_neuron_config": {
                "skip_warmup": True,
            },
        },
    )

    prompts = [
        "Hello, my name is",
        "The capital of France is",
        "The future of AI is",
    ]

    outputs = llm.generate(prompts, SamplingParams(top_k=10))
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}")
        print(f"Generated: {generated_text!r}")

If the script succeeds, you will see each prompt followed by generated text in the console.

Step 3: Optimize model loading with sharded checkpoints
-------------------------------------------------------

In this step, you configure vLLM to save sharded checkpoints, which significantly speeds up model loading on subsequent runs.

By default, vLLM shards the model weights during every load, which can take considerable time. Setting ``save_sharded_checkpoint: true`` saves the sharded weights to disk after the first run, eliminating this overhead.

.. code-block:: python

    from vllm import LLM, SamplingParams

    llm = LLM(
        model="meta-llama/Llama-3.1-8B-Instruct",
        tensor_parallel_size=32,
        max_num_seqs=1,
        max_model_len=128,
        enable_prefix_caching=False,
        enable_chunked_prefill=False,
        additional_config={
            "override_neuron_config": {
                "skip_warmup": True,
                "save_sharded_checkpoint": true,
            },
        },
    )

After the first run, the sharded checkpoint is saved alongside your model files. Subsequent runs will load the pre-sharded weights directly, reducing initialization time.

Step 4: Try advanced configuration options (optional)
-----------------------------------------------------

In this step, you explore optional tuning features that can improve throughput for specific workloads.

**Enable prefix caching when prompts share a long system prefix**:

.. note::
   To understand how to configure prefix caching parameters like ``num_gpu_blocks_override``, ``block_size``, ``pa_num_blocks``, and ``pa_block_size``, 
   see the `Llama 3.3 70B prefix caching tutorial <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.3-70b-apc-tutorial.html#Scenario-1:-Run-Llama3.3-70B-on-Trn2-without-Prefix-Caching>`_.

.. code-block:: python

    from vllm import LLM, SamplingParams

    llm = LLM(
        model="meta-llama/Llama-3.1-8B-Instruct",
        tensor_parallel_size=32,
        max_num_seqs=4,
        max_model_len=2048,
        num_gpu_blocks_override=4096,
        block_size=32,
        enable_prefix_caching=True,
        additional_config={
            "override_neuron_config": {
                "is_prefix_caching": True,
                "is_block_kv_layout": True,
                "pa_num_blocks": 4096,
                "pa_block_size": 32,
                "skip_warmup": True,
            },
        },
    )

    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
    ]

    outputs = llm.generate(prompts, SamplingParams(temperature=0.0))

    for output in outputs:
        print(f"Prompt: {output.prompt}")
        print(f"Generated: {output.outputs[0].text}")

**Use Eagle speculative decoding when you have an EAGLE checkpoint available**:

Below is an example of how to run vLLM inference with an EAGLE V1 checkpoint

.. note::
   Eagle draft checkpoints must be converted for NxD Inference compatibility and include the target model's LM head. Follow the guidance at `EAGLE Checkpoint Compatibility <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility>`_.

.. code-block:: python

    from vllm import LLM, SamplingParams

    llm = LLM(
        model="meta-llama/Llama-3.1-8B-Instruct",
        tensor_parallel_size=32,
        max_num_seqs=4,
        max_model_len=256,
        speculative_config={
            "model": "./eagle_draft_converted",
            "num_speculative_tokens": 5,
            "max_model_len": 256,
            "method": "eagle",
        },
    )

    prompts = [
        "The key benefits of cloud computing are",
        "Python is a popular programming language because",
        "Machine learning models can be improved by",
    ]

    outputs = llm.generate(prompts, SamplingParams(top_k=50, max_tokens=100))

    for output in outputs:
        print(f"Prompt: {output.prompt}")
        print(f"Generated: {output.outputs[0].text}")

Confirmation
------------

Re-run the script from Step 2. You should see completions printed again, and the log will indicate:

* Compiled artifacts were loaded from cache (if ``NEURON_COMPILED_ARTIFACTS`` is set)
* Sharded checkpoint was loaded directly (if ``save_sharded_checkpoint: true`` was used)

If you enable Neuron debug logging, look for ``Loaded Neuron compiled artifacts`` messages.

Common issues
-------------

- **Initial run takes too long**: Set ``NEURON_COMPILED_ARTIFACTS`` before running so the second run reuses the cache.
- **Model loading is slow on every run**: Enable ``save_sharded_checkpoint: true`` in ``override_neuron_config`` to avoid re-sharding the model weights each time.
- **Warmup adds latency**: Keep ``skip_warmup=True`` in ``override_neuron_config`` if your workload does not require the warmup pass.

Clean up
--------

Deactivate your Python environment with ``deactivate``. Delete the ``compiled_models`` directory if you no longer need the cached artifacts. Remove any sharded checkpoint directories created by ``save_sharded_checkpoint``. Remove the cloned ``vllm-neuron`` repository if finished testing.

Next steps
----------

* Explore prefix caching, Eagle speculative decoding, and other options in :ref:`nxdi-feature-guide`.
* Review supported model architectures in :ref:`nxdi-supported-model-architectures`.
* Switch to the online serving quickstart (:ref:`quickstart-online-serving`) when you need an API endpoint.

Further reading
---------------

- :ref:`nxdi-vllm-user-guide-v1`: Complete integration reference.
- :ref:`nxdi-tutorials-index`: In-depth tutorials and workflow guides.
- `Downloading models from Hugging Face <https://huggingface.co/docs/hub/en/models-downloading>`_: Instructions for obtaining model checkpoints.


================================================
FILE: libraries/nxd-inference/vllm/quickstart-vllm-online-serving.rst
================================================
.. meta::
   :description: Launch the vLLM OpenAI-compatible server on AWS Neuron for interactive inference.
   :date_updated: 2025-01-15

.. _quickstart-online-serving:

Quickstart: Serve models online with vLLM on Neuron
===================================================

This quickstart shows you how to launch the vLLM OpenAI-compatible API server on AWS Neuron. You install the ``vllm-neuron`` plugin, start the server, validate it with ``curl``, and call it from the OpenAI Python SDK.

**This quickstart is for**: Developers who need an interactive, low-latency serving endpoint on Neuron  
**Time to complete**: ~20 minutes

Prerequisites
-------------

Before you begin, make sure you have:

* An EC2 instance with Neuron cores and network access to Hugging Face Hub.
* The Neuron SDK installed (see :ref:`Setup Instructions<nxdi-setup>`).
* Python 3.10 or later with ``pip``.
* Basic familiarity with running Python scripts in a virtual environment.

.. note::
   For the fastest setup, consider the vLLM Neuron Deep Learning Container (DLC), which bundles the SDK, vLLM, and dependencies. See :ref:`quickstart_vllm_dlc_deploy`.

Step 1: Install the ``vllm-neuron`` plugin
-------------------------------------------

In this step, you install the Neuron-enabled vLLM plugin inside your Python environment.

.. code-block:: bash

   # Activate your Neuron virtual environment
   source ~/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate

   # Clone the vLLM Neuron plugin repository
   git clone https://github.com/vllm-project/vllm-neuron.git
   cd vllm-neuron

   # Install with the Neuron package repository
   pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com -e .

.. important::
   The ``--extra-index-url`` flag pulls the Neuron-compatible wheels from the AWS repository.

To confirm the install succeeded, run ``python -c "import vllm"`` and verify no errors display.

Step 2: Launch the API server
-----------------------------

In this step, you start an OpenAI-compatible endpoint with a LLaMA model.

.. code-block:: bash

    export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"

    vllm serve \
      --model meta-llama/Llama-3.1-8B-Instruct \
      --tensor-parallel-size 8 \
      --max-model-len 128 \
      --max-num-seqs 4 \
      --no-enable-prefix-caching \
      --port 8000 \
      --additional-config '{
        "override_neuron_config": {
          "enable_bucketing": false
        }
      }'

Key arguments:

* ``--tensor-parallel-size``: Matches the number of Neuron cores you want to use.
* ``--max-model-len`` and ``--max-num-seqs``: Duplicate limits from your offline workflow.
* ``--additional-config``: Wrap Neuron overrides under ``override_neuron_config`` (``enable_bucketing`` here).
* ``--port``: Choose the listening port for the API server.

Step 3: Verify the endpoint with ``curl``
------------------------------------------

In this step, you confirm the server is responding by sending a chat completion request.

.. code-block:: bash

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
            "model": "meta-llama/Llama-3.1-8B-Instruct",
            "messages": [
              {"role": "system", "content": "You are a concise assistant."},
              {"role": "user", "content": "List three Neuron optimization tips."}
            ],
            "temperature": 0.2
          }'

If successful, the server returns a JSON payload containing the generated answer.

Step 4: Call the API with the OpenAI SDK
-----------------------------------------

Now that the server is live, call it using the OpenAI Python SDK.

.. code-block:: python

    from openai import OpenAI

    # Client setup
    client = OpenAI(
        api_key="EMPTY",
        base_url="http://localhost:8000/v1",
    )

    models = client.models.list()
    model_name = models.data[0].id

    max_tokens = 50
    temperature = 1.0
    top_p = 1.0
    top_k = 50
    stream = False

    response = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": "Hello, my name is Llama"}],
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        stream=stream,
        extra_body={"top_k": top_k},
    )

    generated_text = ""
    if stream:
        for chunk in response:
            if chunk.choices[0].delta.content is not None:
                generated_text += chunk.choices[0].delta.content
    else:
        generated_text = response.choices[0].message.content

    print(generated_text)

Step 5: Explore advanced configuration (optional)
-------------------------------------------------

The commands below show optional tuning features to adapt the server for different workloads.

**Reuse compiled models to avoid recompilation**:

.. code-block:: bash

    # Create directory for compiled artifacts if it doesn't exist
    mkdir -p ./neuron_compiled_models/llama3-8b
    
    # Set the environment variable before launching the server
    export NEURON_COMPILED_ARTIFACTS="./neuron_compiled_models/llama3-8b"

**Enable prefix caching when prompts share a long context**:

.. code-block:: bash

    vllm serve \
      --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
      --tensor-parallel-size 32 \
      --max-model-len 1024 \
      --max-num-seqs 8 \
      --enable-prefix-caching \
      --block-size 32 \
      --num-gpu-blocks-override 256 \
      --additional-config '{
        "override_neuron_config": {
          "is_prefix_caching": true,
          "is_block_kv_layout": true,
          "pa_num_blocks": 256,
          "pa_block_size": 32
        }
      }' \
      --port 8000

**Use Eagle speculative decoding when you have an EAGLE checkpoint available**:

Below is an example of how to run vLLM inference with an EAGLE V1 checkpoint

.. note::
   Eagle draft checkpoints must be converted for NxD Inference compatibility and include the target model's LM head. Follow the guidance at `EAGLE Checkpoint Compatibility <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility>`_.

.. code-block:: bash

    vllm serve \
      --model meta-llama/Llama-3.1-8B-Instruct \
      --tensor-parallel-size 32 \
      --max-model-len 2048 \
      --max-num-seqs 4 \
      --speculative-config '{"model": "./eagle_draft_converted", "method": "eagle", "num_speculative_tokens": 5, "max_model_len": 2048}' \
      --port 8000

**Use MultiLoRA**:

.. note::
   For multi-LoRA serving, you can optionally create a JSON configuration file that maps LoRA adapter IDs to their checkpoint paths. This enables dynamic adapter loading and swapping between HBM and host memory.

.. code-block:: bash

    # Example JSON configuration (save as lora_config.json):
    # {
    #   "lora-ckpt-dir": "/opt/lora/tinyllama/",
    #   "lora-ckpt-paths": {
    #     "tarot_adapter": "tarot",
    #     "support_adapter": "mental-health"
    #   },
    #   "lora-ckpt-paths-cpu": {
    #     "tarot_adapter": "tarot",
    #     "support_adapter": "mental-health"
    #   }
    # }

    vllm serve \
      --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
      --tensor-parallel-size 32 \
      --max-model-len 1024 \
      --max-num-seqs 2 \
      --enable-lora \
      --max-loras 2 \
      --max-cpu-loras 4 \
      --lora-modules \
        tarot_adapter=/opt/lora/tinyllama/tarot \
        support_adapter=/opt/lora/tinyllama/mental-health \
      --additional-config '{"override_neuron_config": {"lora_ckpt_json": "/path/to/lora_config.json"}}' \
      --port 8000

Clients can select an adapter per request by setting the ``model`` field to the adapter ID in the request. The ``max-loras`` parameter controls concurrent adapters in HBM, while ``max-cpu-loras`` controls adapters in host memory with dynamic swapping support.

**Tune context and token buckets for long prompts**:

.. code-block:: bash

    export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"

    vllm serve \
      --model "meta-llama/Llama-3.1-8B-Instruct" \
      --tensor-parallel-size 32 \
      --max-num-seqs 1 \
      --max-model-len 1024 \
      --port 8000 \
      --no-enable-prefix-caching \
      --additional-config '{
        "override_neuron_config": {
          "enable_bucketing": true,
          "context_encoding_buckets": [256, 512, 1024],
          "token_generation_buckets": [32, 64, 128, 256, 512, 768],
          "max_context_length": 1024,
          "seq_len": 1024,
          "batch_size": 1,
          "ctx_batch_size": 1,
          "tkg_batch_size": 1,
          "is_continuous_batching": true
        }
      }'

Set ``NEURON_COMPILED_ARTIFACTS`` before launching if you want to reuse artifacts across runs.

Confirmation
------------

Resend the ``curl`` request from Step 3 or rerun the OpenAI SDK snippet from Step 4. Successful responses confirm that the server is up and reachable. You can also open ``http://localhost:8000/health`` to check the health probe.

Common issues
-------------

- **Server exits immediately**: Confirm ``--tensor-parallel-size`` matches the number of available Neuron cores.
- **Requests return 5xx errors**: Lower ``--max-num-seqs`` or ``--max-model-len`` if the model runs out of memory.
- **Initial requests take too long**: Set ``NEURON_COMPILED_ARTIFACTS`` so subsequent launches reuse compiled artifacts.

Clean up
--------

Stop the API server (Ctrl+C). Deactivate your Python environment with ``deactivate``. Remove the cloned ``vllm-neuron`` repository if you no longer need it, and clear any cached artifacts if disk space is a concern.

Next steps
----------

* Explore prefix caching, Eagle speculative decoding, and other options in :ref:`nxdi-feature-guide`.
* Review supported model architectures in :ref:`nxdi-supported-model-architectures`.
* Try the offline batch quickstart (:ref:`quickstart-offline-serving`) if you need non-interactive inference.

Further reading
---------------

- :ref:`nxdi-vllm-user-guide-v1` – Complete integration reference.
- :ref:`nxdi-tutorials-index` – In-depth tutorials and workflow guides.
- `OpenAI Python SDK reference <https://github.com/openai/openai-python>`_ – API documentation for the client used in Step 4.


================================================
FILE: libraries/nxd-training/api-guide.txt
================================================
* :ref:`nxdt_config_overview`

..
    * :ref:`nxdt_config_overview`
    * :ref:`nxdt_config_trainer`
    * :ref:`nxdt_config_exptm`
    * :ref:`nxdt_config_distributed_strategy`
    * :ref:`nxdt_config_data`
    * :ref:`nxdt_config_model`
    * :ref:`nxdt_config_overview_precision_config`

================================================
FILE: libraries/nxd-training/api-reference-guide.rst
================================================
.. _nxd-training-api-guide:

API Reference Guide 
===============================================

.. toctree::
    :maxdepth: 1
    :hidden:
    
    /libraries/nxd-training/general/config_overview


.. include:: /libraries/nxd-training/api-guide.txt

================================================
FILE: libraries/nxd-training/app_notes/nxd-training-amr-appnote.rst
================================================
.. _nxd_training_amr_appnote:

.. include:: /libraries/neuronx-distributed/activation_memory_reduction.rst

================================================
FILE: libraries/nxd-training/app_notes/nxd-training-cp-appnote.rst
================================================
.. _nxd_training_cp_appnote:

.. include:: /libraries/neuronx-distributed/context_parallelism_overview.rst

================================================
FILE: libraries/nxd-training/app_notes/nxd-training-pp-appnote.rst
================================================
.. _nxd_training_pp_appnote:

.. include:: /libraries/neuronx-distributed/pipeline_parallelism_overview.rst

================================================
FILE: libraries/nxd-training/app_notes/nxd-training-tp-appnote.rst
================================================
.. _nxd_training_tp_appnote:

.. include:: /libraries/neuronx-distributed/tensor_parallelism_overview.rst

================================================
FILE: libraries/nxd-training/app_notes.rst
================================================
.. _nxd_training_appnotes:

App Notes
=========

.. toctree::
    :maxdepth: 1
    :hidden:
    
    /about-neuron/appnotes/neuronx-distributed/introducing-nxdt-training
    /libraries/nxd-training/app_notes/nxd-training-tp-appnote
    /libraries/nxd-training/app_notes/nxd-training-pp-appnote
    /libraries/nxd-training/app_notes/nxd-training-amr-appnote


.. include:: /libraries/nxd-training/app_notes.txt

================================================
FILE: libraries/nxd-training/app_notes.txt
================================================
* :ref:`introduce-nxd-training`
* :ref:`nxd_training_tp_appnote`
* :ref:`nxd_training_pp_appnote`
* :ref:`nxd_training_amr_appnote`

================================================
FILE: libraries/nxd-training/developer-guide.rst
================================================
.. _nxdt_developer_guide

Developer Guide (``nxd-training`` )
====================================

.. toctree::
    :maxdepth: 1
    :hidden:

    /libraries/nxd-training/developer_guides/index

.. include:: /libraries/neuronx-distributed/developer-guide.txt


================================================
FILE: libraries/nxd-training/developer_guides/cpu_mode_developer_guide.rst
================================================
.. _cpu_mode_overview:

CPU Mode Overview
=================

CPU mode allows users to run parallel primitives
like `RowParallelLinear` and `ColumnParallelLinear` on CPU. This is useful
when debugging or developing model sharding and want to check the intermediate results 
of sharded layers. The CPU mode runs in PyTorch's eager mode and does not require
the compilation steps of torch-xla and Neuron compiler. The collective communications
like all-reduce use the PyTorch's 
`gloo backend <https://pytorch.org/docs/stable/distributed.html#backends-that-come-with-pytorch>`_
for communications.

To enable the CPU mode, we need to set the environment variable `NXD_CPU_MODE=1` to 
enable the CPU mode. As the CPU mode leverages Gloo backend for communication, users 
need to initialize the distributed environment with "gloo" backend instead of "xla" backend.
In the following, we given an example of a MLP with Tensor Parallel linear layers. 

.. code-block:: python

    import torch
    import torch.nn as nn
    import torch.distributed as dist
    from neuronx_distributed.parallel_layers import layers
    from neuronx_distributed.parallel_layers import initialize_model_parallel
    from neuronx_distributed.utils import cpu_mode, get_device, master_print

    # initialize the distributed environment inside PyTorch
    cc_backend = "gloo" if cpu_mode() else "xla"
    dist.init_process_group(backend=cc_backend)

    # assuming sharding the model with TP=2
    initialize_model_parallel(tensor_model_parallel_size=2)

    hidden_size = 1024
    rand_inputs = torch.rand(4, hidden_size)
    model = nn.Sequential(
        layers.ColumnParallelLinear(
            hidden_size,
            hidden_size,
            bias=False,
            gather_output=False,
            keep_master_weight=True,
        ),
        layers.RowParallelLinear(
            hidden_size,
            hidden_size,
            bias=False,
            input_is_parallel=True,
            keep_master_weight=True,
        ),
    )
    model = model.to(get_device())
    rand_inputs = rand_inputs.to(get_device())

    outputs = model(rand_inputs)
    # user can check the outputs are on the CPU
    # and there is no compilation triggered
    master_print(f"Output sum is {outputs.sum()}")


.. code-block:: bash

    # set the environment variable to enable CPU mode
    # if the environment variable is set to 0, 
    # the script will run on Trainium accelerator using XLA
    export NXD_CPU_MODE=1
    # assumign the script show above is saved in test_cpu_mode.py
    exec_file=test_cpu_mode.py
    torchrun --nnodes=1 --nproc-per-node=2 --master_port=1234 ${exec_file}


How to use CPU mode in existing scripts
---------------------------------------

If the scripts previously used the `xla_device` explicitly, 
users need to replace the corresponding use of `xla_device` with 
`get_device()` function call from `neuronx_distributed.utils` to get the suitable device. 
Similarly, you need to replace explicit calling of `xm.master_print` with wrapped `master_print`
from `neuronx_distributed.utils`. In principle, to make the 
scripts general to both CPU mode and XLA mode with Trainium as the backend, you 
need to replace functions from torch-xla package with a thin wrapper that can 
dispatch the function calls to the native PyTorch counterparts, when CPU mode 
is in-use.


================================================
FILE: libraries/nxd-training/developer_guides/dev-guide.txt
================================================
* :ref:`nxdt_developer_guide_integrate_new_model`
* :ref:`nxdt_developer_guide_integrate_new_dataloader`
* :ref:`nxdt_developer_flow_register_optimizer_lr_scheduler`
* :ref:`nxdt_developer_guide_migration_nnm_nxdt`
* :ref:`nxdt_developer_guide_migration_nemo_nxdt`

================================================
FILE: libraries/nxd-training/developer_guides/index.rst
================================================
.. _nxdt_developer_guide:

Developer Guide
===============

This section will go over a variety of developer guides to help users get started with
the Neuronx Distributed Training library.

.. toctree::
    :maxdepth: 2

    Integrating a new model <new_model_guide>
    Integrating a new dataset/dataloader <new_dataloader_guide>
    Registering an optimizer and LR scheduler <optimizer_lr_scheduler_flow>
    Migrating from Neuron-NeMo-Megatron to Neuronx Distributed Training <migration_nnm_nxdt>
    NxD Training Compatibility with NeMo <migration_nemo_nxdt>
    CPU Mode Developer Guide <cpu_mode_developer_guide>


================================================
FILE: libraries/nxd-training/developer_guides/migration_nemo_nxdt.rst
================================================
.. _nxdt_developer_guide_migration_nemo_nxdt:

NxD Training Compatibility with NeMo
====================================

NxD Training (NxDT) is built on top of `NeMo-1.14 <https://github.com/NVIDIA/NeMo/tree/v1.14.0>`_.
The framework reuses modules from NeMo and exposes them via similar config interface.

.. note::

    At the moment, NxDT only allows running training of decoder LLM models.

This document goes over steps on how to run the NeMo training workloads inside NxDT.

.. contents:: Table of contents
   :local:
   :depth: 2


Model Integration
------------------

**Model already Exists in NxDT Model Hub:**

If the model you want to train is already included in the NxDT model hub, and the training workflow
(e.g., pre-training, fine-tuning) is supported in NxDT, you need to modify NeMo YAML configuration file to
the NxDT YAML file. Follow the mapping table in the :ref:`nxdt_nemo_nxdt_config_mapping`.

**Custom/New Model**

If your model is not part of the NxDT model hub, please use the guide
:ref:`nxdt_developer_guide_integrate_new_model`.


Dataloader Integration
----------------------

**Dataloader already exposed via one of the NxDT configs**

In this case, please map the NeMo YAML config parameters to NxDT config parameters using the
mapping table provided here :ref:`nxdt_nemo_nxdt_config_mapping`.

**Custom/New Dataloader**

If the dataloader is not part of the hub, please use the guide
:ref:`nxdt_developer_guide_integrate_new_dataloader`.

Optimizer/LR Scheduler Integration
----------------------------------

Since NxDT is built on top of NeMo, all the optimizers/LR schedulers provided by NeMo can be enabled
from the config.

Optimal Partitioning
--------------------

NxDT is built on top of
`NxD Core <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index.html>`_
primitives and exposes different model parallelism techniques. All of them can be configured using
the ``distributed_strategy`` config.

Fusions/kernels
---------------

All the fused kernels available inside the NeMo config are not available in NxDT. This is because fused
kernels in NeMo are built specifically for GPUs. Neuron have a different set of kernels that can be
enabled from the config. Also, since Neuron uses a graph based approach, the compiler can optimize
some of the modules and do fusions wherever required.

Checkpoint Saving/loading
-------------------------

#.
   NeMo combines the model weights, optimizers and other state_dicts into a single ``state_dict``
   and dumps a file of the format: ``tp_rank_0*_pp_rank_00*/model_optim_rng.ckpt``. However, with NxDT, we
   save the model ``state_dict`` and the optimizer separately. The model statedict is saved in a folder
   of the form: ``model/dp_rank_00_tp_rank_00_pp_rank_00.pt`` and the optimizer is saved into a separate folder
   as: ``optim/dp_rank_00_tp_rank_00_pp_rank_00.pt``. This is mainly done so that when we use
   `zero1 <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html?highlight=zero1#neuron-zero1-optimizer>`_,
   each DP rank can save its own optimizer shard.

#.
   NxDT doesn’t support ``.nemo`` style checkpoint saving. If users have a ``.nemo`` checkpoint, they would
   have to unpack it themselves and build a checkpoint conversion script to load the checkpoint into NxDT.

#.
   In NeMo, if we are using pipeline parallel, each pipeline stage creates an independent model. So
   lets say we have a model with 32 layers and we use PP=4, then NeMo would create 4 chunks with layers 0-7.
   So each PP rank would have a ``model_state_dict`` with keys going from layer-0-7. However, in NxDT, the model
   is created as a whole and then sharded. So the layer numbers are preserved.

#.
   One would have to write up a checkpoint conversion script similar to the checkpoint conversion from
   NeMo to NxDT.

For a more detailed mapping of NeMo parameters to NxDT parameters, follow the guide
:ref:`nxdt_nemo_nxdt_config_mapping`.

.. _nxdt_nemo_nxdt_config_mapping:

Config Mapping
--------------

Here is a detailed mapping for all the parameters in the config file. For the below mapping, we chose
the Llama example across both NeMo and NxDT frameworks. The same mapping is also true for other models.

.. csv-table::
   :file: nemo_nxdt_mapping.csv
   :header-rows: 1
   :widths: 20, 20, 40

.. note::

   For parameters that are not supported by NxDT, please create a feature request with specific use-case
   for the parameter, if needed.


================================================
FILE: libraries/nxd-training/developer_guides/migration_nnm_nxdt.rst
================================================
.. _nxdt_developer_guide_migration_nnm_nxdt:

Migrating from Neuron-NeMo-Megatron to Neuronx Distributed Training
====================================================================

In this section, we go over the changes one would have to make if they are migrating their
training workload from
`Neuronx-NeMo-Megatron (NNM) <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nemo-megatron/index.html>`_
to Neuronx Distributed Training (NxDT) framework.

.. contents:: Table of contents
   :local:
   :depth: 2

Config migration
----------------

NxDT is a framework built on top of `NeMo <https://github.com/NVIDIA/NeMo>`_ and
`NeuronxDistributed (NxD) <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index.html>`_
and supports megatron-style model. The megatron model implementation is ported over from NNM.
Hence, most of the config YAMLs from NNM can be migrated to use NxDT.

When building NxDT for the sake of modularity, we grouped certain parameters together, eg.
:ref:`distributed_strategy<nxdt_config_distributed_strategy>` has all the configuration for model parallelism,
:ref:`data<nxdt_config_data>` config now holds all the parameters required to configure
the dataset.

At a high level, there are some differences with the NNM config, which are highlighted below:

#.
    The overall config structure has changed. For simplicity and ease of understanding, the config parameters
    are grouped according to their high level use case. For example, previously all the distributed config parameters
    used to reside inside ``model`` config, now it’s been moved to a ``distributed_config`` of its own. Similarly data
    config is moved out to have clear separation between model and data.

#.
    Environment variables like ``neuron_cc_flags``  and ``neuron_compile_cache_url`` can be set from the config
    itself. There is no need to set the environment variables. The rationale is to avoid having to configure training
    scripts from multiple places.

#.
    ``Activation Checkpointing:`` NxDT only supports selective and full activation checkpointing. The ``selective``
    checkpointing is done only for the ``CoreAttention`` block (in case of llama3-8K we recompute the ``MLP``
    block, too) and ``full`` activation checkpointing is done only at a layer boundary. NxDT doesn’t support
    config parameters like ``activations_checkpoint_method``, ``activations_checkpoint_num_layers``,
    ``num_micro_batches_with_partial_activation_checkpoints``, ``activations_checkpoint_layers_per_pipeline``,
    ``disable_layer_norm_checkpointing``. Please remove these parameters from your config.yaml file.

.. note::

    If you plan to add more modules that need to be recomputed, one would have to override the checkpointing config inside
    ``ModelModule`` (refer to ``build_model`` API at :ref:`nxdt_developer_guide_integrate_new_model_build_module`)
    and add the modules that need to be recomputed.

4.
    ``Tokenizer:`` The tokenizer which used to reside under ``model`` is now moved to ``data``. This is done so that all
    data related configuration can reside at one place.

#.
    ``accumulate_grad_batches:`` This param is removed since it should always be 1. Gradient accumulation is handled by
    setting the global_batch_size and micro_batch_size along with data-parallel degree.

#.
    ``pre_process and post_process:``: These two parameters were added to the model to decide if the embedding lookup
    needs to be added at the start and if a ``pooler`` layer needs to be added at the end. This has been set by default
    for all decoder models and hence the config param is no longer exposed.

#.
    ``Mixed precision config:`` NxDT no longer exposes NeMo mixed precision parameters: ``native_amp_init_scale``,
    ``native_amp_growth_interval``, ``hysteresis``, ``fp32_residual_connection``, ``fp16_lm_cross_entropy``. All these
    parameters are specific to the GPU mixed precision strategy, which Neuron doesn’t support, or they are not
    applicable. Neuron has a different way to enable mixed precision training through ``master_weights`` and
    ``fp32_grad_accumulation``.


#.
    ``megatron_amp_o2:`` This parameter is not supported.

#.
    ``Fusions:`` Neuron doesn’t support fusion parameters like ``grad_div_ar_fusion``, ``gradient_accumulation_fusion``,
    ``bias_activation_fusion``, ``bias_dropout_add_fusion``, ``masked_softmax_fusion``. All of these fusions are built
    for GPU and require CUDA kernels which cannot run on Trn1. Neuron would have its own set of kernels and when we
    support them, we would enable those parameters from the config.

.. note::

    If there is a need to support these configs, please create a feature request with exact needs and we shall work on it.

For detailed mapping, please check the :ref:`nxdt_nnm_nxdt_config_mapping`.

Model code
----------

There are the following differences in the model code:

#.
    NNM used `Apex <https://github.com/NVIDIA/apex/tree/master>`_ to get all the distributed parallel layers and schedules.
    Since NxDT uses NxD as the base library, all the
    `parallel layers/parallel state <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#parallel-layers>`_
    are coming from NxD. Eg. `apex.parallel_state <https://github.com/NVIDIA/apex/blob/master/apex/transformer/parallel_state.py>`_
    is replaced with
    `nxd.parallel_layers.parallel_state <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#parallel-model-state>`_.

#.
    NNM explicitly creates a module for each pipeline-parallel (PP) rank, however, NxDT uses NxD which does the
    partitioning under the hood. Hence, users no longer have to worry about creating a rank specific module.
    They can create one single model and
    `NxD’s PP wrapper <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#neuron-distributed-pipeline-model>`_
    takes care of sharding for each PP rank. Hence, all the code related to pipeline parallelism inside model
    code is removed. The model code assumes there is no PP and just uses TP layers from NxD.

.. note::
    For the tracer to work efficiently, we configure the pipeline parallel config inside the ``BaseModelModule`` class inside
    ``lightning_modules/model``.

3.
    In NNM, megatron module had to explicitly handle gradient reduction for shared weights across PP ranks. In NxDT,
    since we are using NxD’s PP wrapper, all that is handled for the user.

#.
    For activation checkpointing, NNM had explicit recompute functions which handled the
    `custom forward API <https://github.com/aws-neuron/neuronx-nemo-megatron/blob/main/nemo/nemo/collections/nlp/modules/common/megatron/transformer.py>`_.
    With NxDT, `NxD’s Activation Checkpoint wrapper <https://github.com/aws-neuron/neuronx-distributed/blob/main/src/neuronx_distributed/utils/activation_checkpoint.py>`_
    handles the recompute of the modules. Users just have to configure the ``activation_checkpoint_config`` inside
    ``nxd_config``
    `here <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#initialize-nxd-config>`__.


Checkpointing Save/Load
-----------------------

NxDT supports all the checkpointing features which NNM supports. This includes async checkpointing, auto-resume, etc.
There are some differences in the format of the checkpoint. This is because NxDT uses
`NxD’s checkpoint api <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#save-checkpoint>`_.
The key differences are listed below:

#.
    NNM combines the model weights, optimizers and other state_dicts into a single ``state_dict`` and dump a file
    of the format: ``tp_rank_0*_pp_rank_00*/model_optim_rng.ckpt``. However, with NxDT, we save the model ``state_dict``
    and the optimizer separately. The model ``statedict`` is saved in a folder of the form:
    ``model/dp_rank_00_tp_rank_00_pp_rank_00.pt`` and the optimizer is saved into a separate folder as:
    ``optim/dp_rank_00_tp_rank_00_pp_rank_00.pt``. This is mainly done so that when we use
    `zero1 <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html?highlight=zero1#neuron-zero1-optimizer>`_,
    each DP rank can save its own optimizer shard.

#.
    In NNM, if we are using pipeline parallelism, each pipeline stage creates an independent model. So lets say we have
    a model with 32 layers and we use PP=4, then NNM would create 4 chunks with layers 0-7. So each PP rank would have
    ``model_state_dict`` with keys going from layer-0-7. However, in NxDT, the model is created as a whole and then
    sharded. So the layer numbers are preserved.

#.
    There are checkpoint conversion scripts provided under ``examples/`` of NxDT repository to convert the existing NNM
    checkpoints to NxDT format in case of migrating in the middle of training.

.. code-block:: shell

    python nnm_nxdt_ckpt_converter.py --tp 8 --pp 4 --n_layers 32 --nnm_ckpt_path {path_to_ckpt}/ckpt/nnm --nxdt_ckpt_path {path_to_ckpt}/nnm-converted-nxdt-ckpt/ --enable_parallel_processing True --num_parallel_processes 8

.. _nxdt_nnm_nxdt_config_mapping:

Config Mapping
--------------

Here is a detailed mapping for all the parameters in the config file. For the below mapping, we chose the
Llama-7B example across NNM and NxDT frameworks. The same mapping is also true for other models.

.. csv-table::
   :file: nnm_nxdt_mapping.csv
   :header-rows: 1
   :widths: 20, 20, 40

.. note::

    For parameters that are not supported by NxDT, please create a feature request with specific use-case
    for the parameter, if needed.


================================================
FILE: libraries/nxd-training/developer_guides/nemo_nxdt_mapping.csv
================================================
﻿NeMo param,NxDT param mapping,Comments
name,name,
restore_from_path,Not supported,
**trainer**,,
devices,devices,
num_nodes,num_nodes,
accelerator,Not required,"We made the default as TPU which maps to Neuron internally, so users no longer have to add it."
precision,replaced by ``precision_config``,There is a separate `precision` config to control the precision of model and optimizer.
logger,Not required,"We set default value of logger to False."
enable_checkpointing,Separate ``exp_manager`` config,"All checkpointing is controlled by exp_manager config."
use_distributed_sampler,Not supported,
max_epochs,max_epochs,
max_steps,max_steps,
log_every_n_steps,log_every_n_steps,
val_check_interval,val_check_interval,
limit_val_batches,limit_val_batches,
limit_test_batches,limit_test_batches,
accumulate_grad_batches,Removed,"This is automatically configured based on global_batchsize, micro-batchsize and distributed config."
gradient_clip_val,gradient_clip_val,
benchmark,Not supported,
enable_model_summary,Not supported,
**exp_manager**,,
log_local_rank_0_only,log_local_rank_0_only,
create_tensorboard_logger,create_tensorboard_logger,
explicit_log_dir,explicit_log_dir,
exp_dir,exp_dir,
name,name,
create_wandb_logger,Not supported,"This was not supported under NNM, either. We have removed this argument from NxDT."
wandb_logger_kwargs,Not supported,
resume_if_exists,resume_if_exists,
resume_ignore_no_checkpoint,resume_ignore_no_checkpoint,
create_checkpoint_callback,create_checkpoint_callback,
checkpoint_callback_params,checkpoint_callback_params,
**model**,,
mcore_gpt,Not supported,NxDT has its own implementation of megatron_gpt_model which is based on v1.14 version of NeMo
tensor_model_parallel_size,``distributed_strategy.tensor_model_parallel_size``,All the parallelism config are moved to distributed_strategy config
pipeline_model_parallel_size,``distributed_strategy.pipeline_model_parallel_size``,
virtual_pipeline_model_parallel_size,``distributed_strategy.virtual_pipeline_model_parallel_size``,
sequence_parallel,``distributed_strategy.sequence_parallel``,
micro_batch_size,``data.micro_batch_size``,All the dataset/dataloader/tokenizer configuration are now part of a separate config called data
global_batch_size,``data.global_batch_size``,
tokenizer,``data.tokenizer``,
data,Moved to ``data`` at the same level as model,"The entire ``data`` key now controls a ``DataModule`` and is placed at the same level as ``model`` key in the config structure."
encoder_seq_length,encoder_seq_length,
max_position_embeddings,max_position_embeddings,
make_vocab_size_divisible_by,make_vocab_size_divisible_by,
pre_process,Not supported,NxDT by default adds embedding layer at the start of the transformer block.
post_process,Not supported,NxDT by default adds a LM-head at the end of the transformer block.
persist_layer_norm,persist_layer_norm,
share_embeddings_and_output_weights,share_embeddings_and_output_weights,
position_embedding_type,position_embedding_type,
rotary_percentage,rotary_percentage,
transformer_block_type,transformer_block_type,
has_bias,has_bias,
num_query_groups,Not required,query group attention can be configured using ``num_kv_heads`` parameter.
native_amp_init_scale,Not Required,
native_amp_growth_interval,Not Required,"GPU optimizations which were not supported in NNM, have been removed from NxDT. Most of these fusion ops, the neuron compiler handles on its own. For Attention and Softmax, Neuron uses NKI kernels and custom ops to implement them"
hysteresis,Not Required,
fp32_residual_connection,Not Required,
fp16_lm_cross_entropy,Not Required,
megatron_amp_O2,Not Required,
grad_div_ar_fusion,Not Required,
gradient_accumulation_fusion,Not Required,
bias_activation_fusion,Not Required,
bias_dropout_add_fusion,Not Required,
masked_softmax_fusion,``fusions.softmax``,
seed,seed is moved out of model and at the same level as model,
resume_from_checkpoint,``exp_manager.resume_from_checkpoint``,
use_cpu_initialization,use_cpu_initialization,
onnx_safe,Not supported,"This was not supported under NNM too, we have removed this argument from NxDT."
apex_transformer_log_level,Not supported,
gradient_as_bucket_view,Not supported,
sync_batch_comm,Not supported,
activations_checkpoint_granularity,activations_checkpoint_granularity,By default NxDT checkpoints attention module in case of selective and a single layer in case of full checkpointing.
activations_checkpoint_method,Not supported,
activations_checkpoint_num_layers,Not supported,
num_micro_batches_with_partial_activation_checkpoints,Not supported,
activations_checkpoint_layers_per_pipeline,Not supported,
disable_layer_norm_checkpointing,Not supported,
transformer_engine,Not supported,This is specifically built for NVIDIA GPUs.
fp8,Not supported,fp8 training is not supported on Neuron (both NNM and NxDT).
fp8_e4m3,Not supported,
fp8_hybrid,Not supported,
fp8_margin,Not supported,
use_emha,Not supported,
nsys_profile,Not supported,This is specifically built for NVIDIA GPUs.
optim,optim,

================================================
FILE: libraries/nxd-training/developer_guides/new_dataloader_guide.rst
================================================
.. _nxdt_developer_guide_integrate_new_dataloader:

Integrating a new dataset/dataloader
====================================

In this section, we showcase how to integrate a new dataset/dataloader with the library.

.. contents:: Table of contents
   :local:
   :depth: 2

Building Dataset module
-----------------------

One can use the guide on `PyTorch docs <https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class>`_
to create a ``Dataset`` class.

Building DataModule
-------------------

To configure the dataloader, one needs to create a ``DataModule`` class. Neuronx Distributed Training library provides
a ``BaseDataModule`` which one can use to implement their new ``DataModule``. Create a new file called
``new_data_module.py`` and add the following content.

.. code-block:: python

    from neuronx_distributed_training.lightning_modules.data.base import BaseDataModule

    class NewDataModule(BaseDataModule):
        def __init__(self, cfg, trainer):
            """
            DataModule class for configuring the dataset/dataloader

            Args:
                cfg: `data` cfg in the yaml file.
                trainer: PyTorch-Lightning trainer.
            """
            super().__init__(cfg, trainer)
            # Users can use the cfg argument to pass down
            # arguments from the yaml file to the DataModule.


        def get_batch_length(self, batch):
            """
            Returns the length of the batch.
            """
            return len(batch["input_ids"])

        def process_global_batch(self, global_batch, global_batch_size=None):
            """ Any custom processing of batches can be done here.

            Args:
                global_batch: list of inputs, eg.[tokens, labels]
                global_batch_size: Length of tokens and labels
            """
            return global_batch

        def train_dataloader(self):
            """
            This API should return a torch.utils.data.dataloader.DataLoader object
            """
            ...

        def val_dataloader(self):
            """
            This API should return a torch.utils.data.dataloader.DataLoader object
            """
            ...

        def test_dataloader(self):
            """
            This API should return a torch.utils.data.dataloader.DataLoader object
            """
            ...


Plug into ``training.py``
#########################

Once the new data module is created, we can then plug this into the ``training.py`` script under ``examples``
folder. We can modify the ``training.py`` script as follows:

.. code-block:: python

    ...
    # Assuming we are using the same ModelModule we used for LLama example.
    from new_data_module import NewDataModule
    data_module = NewDataModule(cfg, trainer)
    model = HFLLamaModule(cfg, trainer)

    trainer.fit(model, datamodule=data_module)


The rest of the code can remain the same. The trainer will now use the ``NewDataModule`` for fetching the
``dataloader`` and run e2e training.

Create config file
###################

Next, we can create a config file under ``conf`` to be used for this new dataloader. We can start with a copy of
``hf_llama_7B_config.yaml``. Let's call this config file ``my_new_config.yaml``. We can edit the ``data`` key
to configure the ``DataModule``

.. note::

    For the model, we are using the same model that the llama example is using. To configure
    a new model, please check the
    :ref:`nxdt_developer_guide_integrate_new_model` section.

Launching e2e training
######################

We can now launch training using the new ``data_module``. This can be done using the following command:

.. code-block:: shell

    CONF=my_new_config.yaml ./train.sh


================================================
FILE: libraries/nxd-training/developer_guides/new_model_guide.rst
================================================
.. _nxdt_developer_guide_integrate_new_model:

Integrating a New Model
==========================

The NeuronX Distributed Training library is a modular framework that allows users to integrate
their new modules with the framework while still utilizing the other modules provided by the
library. In this section, we showcase how to integrate a new model with the library.

.. contents:: Table of contents
   :local:
   :depth: 2

Model Building (torch.nn.Module)
--------------------------------

Users can create a torch.nn.Module using the tensor-parallel APIs provided by the
`NeuronxDistributed <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index.html>`_
library. Let’s take an example of the
`GPT-NeoX model built inside NxD examples <https://github.com/aws-neuron/neuronx-distributed/blob/main/examples/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/modeling_gpt_neox_nxd.py>`_.
We can copy the model file and treat it as a new model to onboard using the framework.

.. note::

    To understand more about how to build models using Tensor-parallel APIs check the
    `Developer guide here <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tp_developer_guide.html#creating-model>`_.


Model Integration
-----------------

Once we have built the model, the next step is to integrate with the training framework. This can be done
using the following steps:

.. _nxdt_developer_guide_integrate_new_model_build_module:

Build a `Lightning Module <https://lightning.ai/docs/pytorch/stable/common/lightning_module.html>`_
####################################################################################################

Neuronx Distributed Training framework provides a ``BaseModelModule`` that implements the majority of the training
APIs. Users can subclass this base module and implement few APIs that set up the model. Here is an example to
setup the GPT-NeoX model example. Create a new file called ``new_model_module.py`` and add the following content.

.. code-block:: python

    from transformers import GPTNeoXConfig
    import neuronx_distributed as nxd
    from neuronx_distributed.parallel_layers.layer_norm import LayerNorm
    from neuronx_distributed_training.lightning_modules.model.base import BaseModelModule
    from neuronx_distributed_training.utils.model_utils import get_param_groups_by_weight_decay
    from modeling_gpt_neox_nxd import GPTNeoXForCausalLMNxD

    class MyNewModel(BaseModelModule):

        def _get_model(self,):
            model_name = "EleutherAI/gpt-neox-20b"
            config = GPTNeoXConfig.from_pretrained(model_name)
            config.use_cache = False
            # Note: We can modify the model by reading parameters from self.config.model.
            # We would have to expose those config in the self.config.model accordingly.
            # Couple of examples are here, where we have exposed num_layers and hidden_size.
            if self.config.model.get('num_layers', -1) != -1:
                config.num_hidden_layers = self.config.model.get('num_layers')
            if self.config.model.get('hidden_size', -1) != -1:
                config.hidden_size = self.config.model.get('hidden_size')
            # This is because the GPT-Neox implementation requires this in the config.
            config.sequence_parallel_enabled = self.config.distributed_strategy.get("sequence_parallel", False)
            return GPTNeoXForCausalLMNxD(config)

        def build_model(self):
            # This API is where we build the model object, and return the model.
            # However, in addition to returning the model, users need to
            # configure the nxd config too for pipeline parallelism and
            # activation checkpointing. Here is an example:
            if self.config.model.get("activations_checkpoint_granularity", None) == "selective":
                # Here just to showcase how to recompute modules, we are using
                # GPTNeoXMLPNxD, users can add their own custom modules
                self.nxd_config["activation_checkpoint_config"] = GPTNeoXMLPNxD
            elif self.config.model.get("activations_checkpoint_granularity", None) == "full":
                self.nxd_config["activation_checkpoint_config"] = "full"

            # Read more about configuring pipeline parallel config here:
            # https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/pp_developer_guide.html#pp-developer-guide
            self.nxd_config["pipeline_config"].update(
                {
                    "transformer_layer_cls": GPTNeoXLayerNxD,
                    "output_loss_value_spec": (True, False),
                    "input_names": ["input_ids", "attention_mask", "labels"],
                    "leaf_module_cls": [LayerNorm.__name__],
                }
            )
            return nxd.initialize_parallel_model(self.nxd_config, self._get_model)

        def setup_optimizer_param_groups(self):
            # Depending on what weight decay we need, users can configure
            # the params groups accordingly.
            no_decay = ["bias"]
            if self.config.model.get("do_layer_norm_weight_decay", False):
                no_decay.append("LayerNorm")
            self._optimizer_param_groups = get_param_groups_by_weight_decay(self.model, no_decay)

        def init_weights(self,):
            """
            This API is mainly to tell the framework how each layer needs
            to be initialized. This is required because NxD's PP API would
            use this to initialize the layers after model partition.
            Any layer that is unique to the model needs to be added here.
            """
            if isinstance(module, LayerNorm):
                module.weight.data.fill_(1.0)
            # The BaseModelModule already initializes the ColumnParallel, RowParallel
            # ParallelEmbedding layers.
            super().init_weights()


Plug into ``training.py``
#########################


Once the new model is created, we can then plug this into the ``training.py`` script under ``examples`` folder.
We can modify the ``training.py`` script as follows:

.. code-block:: python

    ...
    # Assuming we are using the same DataModule we used for LLama example.
    data_module = HFDataModule(cfg, trainer)
    from new_model_module import MyNewModel
    model = MyNewModel(cfg, trainer)

    trainer.fit(model, datamodule=data_module)

The rest of the code can remain the same. The trainer will now use the ``MyNewModel`` for fetching the
``model`` code and run e2e training.

Create config file
###################

Next we can create a config file under ``conf`` to be used for this new model. We can start with a copy of
``hf_llama_7B_config.yaml``. Let's call this config file ``my_new_config.yaml``. We can remove the key
``model.model_config`` as we are not using it inside our ``MyNewModel``. We can edit the
``distributed_strategy`` config depending on what we need.

.. note::

    For the dataset, we are using the same dataset that the llama example is using. To configure
    a new dataset, please check the
    :ref:`nxdt_developer_guide_integrate_new_dataloader` section

Launching e2e training
######################

We can now launch training using the new model. This can be done using the following command:

.. code-block:: shell

    CONF=my_new_config.yaml ./train.sh


================================================
FILE: libraries/nxd-training/developer_guides/nnm_nxdt_mapping.csv
================================================
﻿NNM param,NxDT param mapping,Comments
name,name,
restore_from_path,Not supported,"This config was not fully supported in NNM, either."
**trainer**,,
devices,devices,
num_nodes,num_nodes,
accelerator,Not required,"We made the default as TPU which maps to Neuron internally, so users no longer have to add it."
precision,replaced by ``precision_config``,There is a separate `precision` config to control the precision of model and optimizer.
logger,Replaced by default,"We made the NNM logger default in NxDT."
enable_checkpointing,Separate ``exp_manager`` config,"All checkpointing is controlled by exp_manager config."
replace_sampler_ddp,Not supported,"Had to be always False in NNM, made it default in NxDT. No setting required."
max_epochs,max_epochs,
max_steps,max_steps,
log_every_n_steps,log_every_n_steps,
val_check_interval,val_check_interval,
limit_val_batches,limit_val_batches,
limit_test_batches,limit_test_batches,
accumulate_grad_batches,Removed,"This is automatically configured based on global_batchsize, micro-batchsize and distributed config."
gradient_clip_val,gradient_clip_val,
benchmark,Not supported,
enable_model_summary,Not supported,
**exp_manager**,,
log_local_rank_0_only,log_local_rank_0_only,
create_tensorboard_logger,create_tensorboard_logger,
explicit_log_dir,explicit_log_dir,
exp_dir,exp_dir,
name,name,
create_wandb_logger,Not supported,"This was not supported under NNM, either. We have removed this argument from NxDT."
wandb_logger_kwargs,Not supported,
resume_if_exists,resume_if_exists,
resume_ignore_no_checkpoint,resume_ignore_no_checkpoint,
create_checkpoint_callback,create_checkpoint_callback,
checkpoint_callback_params,checkpoint_callback_params,
**model**,,
tensor_model_parallel_size,``distributed_strategy.tensor_model_parallel_size``,"All the parallelism config are moved to distributed_strategy config."
pipeline_model_parallel_size,``distributed_strategy.pipeline_model_parallel_size``,
virtual_pipeline_model_parallel_size,``distributed_strategy.virtual_pipeline_model_parallel_size``,
sequence_parallel,``distributed_strategy.sequence_parallel``,
wrap_with_zero,``distributed_strategy.zero1``,
micro_batch_size,``data.micro_batch_size``,All the dataset/dataloader/tokenizer configurations are now part of a separate config called data.
global_batch_size,``data.global_batch_size``,
tokenizer,``data.tokenizer``,
data,Moved to ``data`` at the same level as model,"The entire ``data`` key now controls a ``DataModule`` and is placed at the same level as ``model`` key in the config structure."
encoder_seq_length,encoder_seq_length,
max_position_embeddings,max_position_embeddings,
make_vocab_size_divisible_by,make_vocab_size_divisible_by,
pre_process,Not supported,NxDT by default adds embedding layer at the start of the transformer block.
post_process,Not supported,NxDT by default adds a LM-head at the end of the transformer block.
persist_layer_norm,persist_layer_norm,
share_embeddings_and_output_weights,share_embeddings_and_output_weights,
position_embedding_type,position_embedding_type,
rotary_percentage,rotary_percentage,
transformer_block_type,transformer_block_type,
has_bias,has_bias,
native_amp_init_scale,Not required,
native_amp_growth_interval,Not required,"GPU optimizations which were not supported in NNM, have been removed from NxDT. Most of these fusion ops, the neuron compiler handles on its own. For Attention and Softmax, Neuron uses NKI kernels and custom ops to implement them."
hysteresis,Not required,
fp32_residual_connection,Not required,
fp16_lm_cross_entropy,Not required,
megatron_amp_O2,Not required,
grad_div_ar_fusion,Not required,
gradient_accumulation_fusion,Not required,
bias_activation_fusion,Not required,
bias_dropout_add_fusion,Not required,
masked_softmax_fusion,``fusions.softmax``,
seed,Seed is moved out of model and at the same level as ``model``,
resume_from_checkpoint,``exp_manager.resume_from_checkpoint``,
use_cpu_initialization,use_cpu_initialization,
onnx_safe,Not supported,"This was not supported under NNM, either. We have removed this argument from NxDT."
apex_transformer_log_level,Not supported,
gradient_as_bucket_view,Not supported,
sync_batch_comm,Not supported,
log_parameter_norm,``exp_manager.log_gradient_norm``,
log_gradient_norm,``exp_manager.log_gradient_norm``,
flexible_pipeline_parallel_stages,Not supported,
activations_checkpoint_granularity,activations_checkpoint_granularity,"Currently, NxDT checkpoints the attention module in case of selective and a single layer in case of full checkpointing."
activations_checkpoint_method,Not supported,
activations_checkpoint_num_layers,Not supported,
num_micro_batches_with_partial_activation_checkpoints,Not supported,
activations_checkpoint_layers_per_pipeline,Not supported,
disable_layer_norm_checkpointing,Not supported,
zero_use_master_weight,Supported via precision config,See :ref:`manual precision config<nxdt_config_overview_precision_config>`.
zero_use_fp32_grad_accum,Supported via precision config,See :ref:`manual precision config<nxdt_config_overview_precision_config>`.
transformer_engine,Not supported,This is specifically built for NVIDIA GPUs.
fp8,Not supported,fp8 training is not supported on Neuron (both NNM and NxDT).
fp8_e4m3,Not supported,fp8 training is not supported on Neuron (both NNM and NxDT).
fp8_hybrid,Not supported,fp8 training is not supported on Neuron (both NNM and NxDT).
fp8_margin,Not supported,fp8 training is not supported on Neuron (both NNM and NxDT).
use_emha,Not supported,fp8 training is not supported on Neuron (both NNM and NxDT).
convert_to_hf,Supported via separate script,
nsys_profile,Not supported,This is specifically built for NVIDIA GPUs.
optim,optim,
enable_recovery_time_instrumentation,``exp_manager.enable_recovery_time_instrumentation``,
async_checkpointing,``exp_manager.async_checkpointing``,

================================================
FILE: libraries/nxd-training/developer_guides/optimizer_lr_scheduler_flow.rst
================================================
.. _nxdt_developer_flow_register_optimizer_lr_scheduler:

Registering an optimizer and LR scheduler
=========================================

A new optimizer or LR scheduler can be registered with the framework and enabled from the config.

.. contents:: Table of contents
   :local:
   :depth: 2


Setting up the optimizer
------------------------

One can write their own optimizer class. One such example is the
`AdamW_FP32OptimParams <https://github.com/aws-neuron/neuronx-distributed/blob/main/src/neuronx_distributed/utils/adamw_fp32_optim_params.py>`_.

The inputs to the optimizer can be exposed in the config YAML file. To do this, we need to create a ``Params`` class
as shown below:

.. code-block:: python

    from dataclasses import dataclass
    from typing import Any, Dict, Optional, Tuple

    from omegaconf import MISSING

    @dataclass
    class OptimizerParams:
        """
        All the params listed below can be configured from the YAML file
        """

        lr: Optional[float] = MISSING
        betas: Tuple[float, float] = (0.9, 0.999)
        eps: float = 1e-08
        weight_decay: float = 0
        amsgrad: bool = False


Once we create the optimizer and the optimizer params class, we can now register the optimizer with the
framework using the following code:

.. code-block:: python

    from nemo.core.optim import register_optimizer

    # `adamw_fp32OptState` would be the name in the optim config of the YAML file.
    register_optimizer("adamw_fp32OptState", AdamW_FP32OptimParams, OptimizerParams)

This registration can be done inside the ``training.py`` file which resides in ``examples`` folder.

Once the registration is done, we can now expose the ``OptimizerParams`` under ``optim`` config of the
YAML file.


Setting up the LR scheduler
---------------------------

One can write their own LR scheduler and register with the framework. One such example of LR scheduler is
shown below:

.. code-block:: python

    from functools import partial

    from torch.optim.lr_scheduler import LambdaLR
    from transformers.optimization import _get_linear_schedule_with_warmup_lr_lambda


    class LinearAnnealingWithWarmUp(LambdaLR):
        def __init__(self, optimizer, warmup_steps, max_steps, last_epoch=-1):
            lr_lambda = partial(
                _get_linear_schedule_with_warmup_lr_lambda,
                num_warmup_steps=warmup_steps,
                num_training_steps=max_steps,
            )
            super().__init__(optimizer, lr_lambda, last_epoch)


Once we build this LR scheduler, we can expose the arguments to the config YAML file. Before that,
we need to write up a ``LRSchedulerParams`` class. Here is an example for the same:

.. code-block:: python

    from nemo.core.config.schedulers import SchedulerParams

    class LinearAnnealingWithWarmupParams(SchedulerParams):
        warmup_steps: int = 0
        max_steps: int = 0


Once the LR scheduler and the ``SchedulerParams`` class are set, we can now register the scheduler
with the framework as below:

.. code-block:: python

    from nemo.core.optim.lr_scheduler import register_scheduler


    # Here, `LinearAnnealingWithWarmUp` is the name of the scheduler we would use in the config YAML file
    register_scheduler("LinearAnnealingWithWarmUp", LinearAnnealingWithWarmUp, LinearAnnealingWithWarmupParams)


This registration can be done inside the ``training.py`` file which resides under ``examples`` folder.

Once the registration is done, we can now expose the ``LinearAnnealingWithWarmupParams`` under ``sched`` config
of the YAML file.


================================================
FILE: libraries/nxd-training/general/config_overview.rst
================================================
.. _nxdt_config_overview:

YAML Configuration Settings
===========================

The library allows configuring a bunch of parameters in the YAML file to run large scale training.
The important categories and parameters are highlighted below. At the top level, we have the following
keys:

.. code-block:: yaml

    name:
        # Name of the experiment
    model_source:
        # Model source code, could be megatron or hf
    seed:
        # Random seed to be used for the entire experiment
    trainer:
        # Settings to configure the PyTorch-Lightning trainer
    exp_manager:
        # Settings to configure logging/checkpointing
    distributed_strategy:
        # Settings to configure how the model is to be distributed across devices
    data:
        # Settings to configure the dataset/dataloader
    model:
        # Settings to configure the model architecture and the optimizer
    precision:
        # Settings to configure the model precision
    compiler_flags:
        # Neuron compiler flags to be used
    compiler_cache_url:
        # Cache to be used to save the compiled artifacts
    aync_exec_max_inflight_requests:
        # Used to configure the runtime queue
    bucket_size_collectives:
        # Collectives are batched into tensors of this size (in MBs)
    neuron_rt_exec_timeout:
        # Runtime timeout
    neuron_experimental_compress_rg:
        # To use compress replica group


.. _nxdt_config_trainer:

Trainer
-------

Neuronx Distributed Trainer framework is built on top of `PyTorch-Lightning <https://lightning.ai/docs/pytorch/stable/>`_
and this key allows users to configure the ``trainer``.

.. code-block:: yaml

    devices: 32
    num_nodes: 1
    max_epochs: -1
    max_steps: 20000
    log_every_n_steps: 1
    val_check_interval: 20000
    check_val_every_n_epoch: null
    num_sanity_val_steps: 0
    limit_val_batches: 1
    limit_test_batches: 1
    gradient_clip_val: 1.0
    lnc: 2
    sequential_move_factor: 11

.. note::

    All the above trainer parameters follow the exact same definition of the PyTorch-Lightning Trainer.
    More information about each of them can be found
    `here <https://lightning.ai/docs/pytorch/stable/common/trainer.html>`__.

**devices**

Number of devices to be used for training. If using torchrun, this is equal to ``nproc_per_node * num_nodes``.

    * **Type**: integer
    * **Required**: True

**lnc**

Neuron-specific setting that specifies the logical-to-physical Neuron Core mapping ratio.
This parameter determines the number of physical Neuron cores used for each logical Neuron Core.

Values:

- lnc: 1 - Each node exposes 128 logical devices, with a 1:1 mapping between logical and physical Neuron Cores.
- lnc: 2 - Implements a 2:1 mapping between logical and physical Neuron Cores.

    * **Type**: integer
    * **Required**: False
    * **Default**: None (must be explicitly set)

**num_nodes**

Number of nodes to be used for training

    * **Type**: integer
    * **Required**: True

**max_epochs**

Maximum number of epochs to run. A value of ``-1`` means that the number of training steps would be inferred
from ``max_steps``

    * **Type**: integer
    * **Required**: True

**log_every_n_steps**

How often to log loss values

    * **Default value**: 1
    * **Type**: integer
    * **Required**: True

**val_check_interval**

How often to run validation step. Using this parameter one can run validation step after ``X`` training steps.

    * **Type**: integer
    * **Required**: True

**check_val_every_n_epoch**

Another parameter that controls the frequency of validation step. Using this parameter, one can run valiation
step after ``X`` epochs.

    * **Type**: integer
    * **Required**: True

**num_sanity_val_steps**

How many sanity validation steps to run. Keeping it to ``0`` would not run validation step at the start of
training.

    * **Type**: integer
    * **Required**: True


**limit_val_batches**

Number of batches to run validation step on.

    * **Type**: integer
    * **Required**: True


**gradient_clip_val**

Float value to clip gradients at.

    * **Type**: float
    * **Required**: True


**sequential_move_factor**

Number of ranks/devices participating in initializing the model weights in parallel. Useful to reduce init time
when using TP-PP config. The value can be increased upto the number of ``trainer.devices`` being used.

    * **Default value**: 11
    * **Type**: integer
    * **Required**: False

.. _nxdt_config_exptm:

Experiment Manager
------------------

This setting is mainly for configuring different aspects of experiment management like checkpointing,
experiment logging directory, which parameters to log and how often to log, etc.


.. code-block:: yaml

    log_local_rank_0_only: True
    create_tensorboard_logger: True
    explicit_log_dir: null
    exp_dir: null
    name: megatron_llama
    resume_if_exists: True
    resume_ignore_no_checkpoint: True
    create_checkpoint_callback: True
    checkpoint_callback_params:
        monitor: step
        save_top_k: 1
        mode: max
        save_last: False
        filename: 'megatron_llama--{step}-{consumed_samples}'
        every_n_train_steps: 200
        use_master_weights_in_ckpt: False
    log_parameter_norm: True
    log_gradient_norm: True
    enable_recovery_time_instrumentation: False
    save_xser: True
    load_xser: True
    async_checkpointing: False
    resume_from_checkpoint: null

**log_local_rank_0_only**

Log only on rank 0. The recommended setting should be ``True``

    * **Type**: bool
    * **Default**: False
    * **Required**: False

**create_tensorboard_logger**

Setting this ``True`` would log the loss and other parameters to tensorboard.

    * **Type**: bool
    * **Default**: False
    * **Required**: False

**exp_log_dir**

Explicitly specify the logging directory. Otherwise, the framework would save to current directory as default.

    * **Type**: str
    * **Default**: null
    * **Required**: False

**resume_if_exists**

Set this to ``True`` to resume from an existing checkpoint. This config will be useful when we want to
auto-resume from a failed training job.

    * **Type**: bool
    * **Default**: False
    * **Required**: False


**resume_ignore_no_checkpoint**

Experiment manager errors out if ``resume_if_exists`` is ``True`` and no checkpoint could be found. This
behaviour can be disabled, in which case exp_manager will print a message and
continue without restoring, by setting ``resume_ignore_no_checkpoint`` to ``True``.

    * **Type**: bool
    * **Default**: False
    * **Required**: False

**checkpoint_callback_params.save_top_k**

How many checkpoints to keep around. Example: If set to 1, only 1 checkpoint at any given time would be
kept around. The framework would automatically keep deleting checkpoints.

    * **Type**: int
    * **Required**: True

**checkpoint_callback_params.every_n_train_steps**

How often we want to checkpoint.

    * **Type**: int
    * **Required**: True

**checkpoint_callback_params.use_master_weights_in_ckpt**

Whether or not to save master weights when checkpointing.

    * **Type**: bool
    * **Default**: False
    * **Required**: False

**log_parameter_norm**

Set this to log parameter norm across model parallel ranks.

    * **Type**: bool
    * **Default**: False
    * **Required**: False

**log_gradient_norm**

Set this to log gradient norm across model parallel ranks.

    * **Type**: bool
    * **Default**: False
    * **Required**: False

**enable_recovery_time_instrumentation**

Set this if you don’t want to default to not printing the detailing timing for recovery.

    * **Type**: bool
    * **Default**: False
    * **Required**: False

**save_xser**

Set this to save with torch xla serialization to reduce time saving, it’s recommended to enable ``xser``
for significantly faster save/load. Note that if the checkpoint is saved with ``xser``, it can only be
loaded with ``xser``, vice versa.

    * **Type**: bool
    * **Default**: False
    * **Required**: False

**load_xser**

Set this to load with torch xla serialization to reduce time saving, it’s recommended to enable ``xser`` for
significantly faster save/load. Note that if the checkpoint is saved with ``xser``, it can only be loaded
with ``xser``, vice versa.

    * **Type**: bool
    * **Default**: False
    * **Required**: False

**async_checkpointing**

Set this if you want to use async checkpointing. Under the hood the library uses the async checkpointing
feature provided by NeuronxDistributed's
`save API <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#id3>`_.

    * **Type**: bool
    * **Default**: False
    * **Required**: False

**resume_from_checkpoint**

Set this as the checkpoint file to load from. Check the SFT/DPO/ORPO example config under ``conf`` on how to use it.

    * **Type**: str
    * **Default**: null
    * **Required**: False

**ckpt_ptl_version**

Set this only if your checkpoint does not contain the pytorch-lightning version in it.
This version is the pytorch-lightning version the checkpoint was saved with.

    * **Type**: str
    * **Default**: "2.5.0"
    * **Required**: False

.. _nxdt_config_distributed_strategy:

Distributed Strategy
--------------------

.. code-block:: yaml

    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 1
    virtual_pipeline_model_parallel_size: 1
    zero1: True
    sequence_parallel: True
    kv_replicator: 4

This setting allows users to configure the sharding strategy to be used for distributing the model across
workers.

**tensor_model_parallel_size**

`Tensor parallel degree <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#initialize-model-parallelism>`_
to be used for sharding models.

    * **Type**: int
    * **Required**: True

**pipeline_model_parallel_size**

`Pipeline parallel degree <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#initialize-model-parallelism>`_
to be used for sharding models.

    * **Type**: int
    * **Required**: True

**virtual_pipeline_model_parallel_size**

`Interleaved pipeline parallel degree <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#neuron-distributed-pipeline-model>`_.
Use a value of 1 if no pipeline parallelism is used.

    * **Type**: int
    * **Required**: True

**context_parallel_size**

Context parallel degree to be used for sharding sequence. When
context_parallel_size is greater than 1,
``fusions.ring_attention`` must be set to ``True``.

    * **Type**: int
    * **Required**: False
    * **Default**: 1

**zero1**

Wraps the optimizer with zero1.

    * **Type**: bool
    * **Required**: True

**sequence_parallel**

To shard along the sequence dimension. Sequence Parallel is always used in conjuction with tensor parallel.
The sequence dimension will be sharded with the same degree as the ``tensor_model_parallel_size``.

    * **Type**: bool
    * **Required**: True

**kv_replicator**

This parameter is used together with ``qkv_linear`` parameter. It is used to configure the
`GQAQKVLinear module <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#gqa-qkv-linear-module>`_

    * **Type**: bool
    * **Required**: True

.. _nxdt_config_data:

Data
----

This is where we configure the dataset/dataloader. This config is dependent on the dataloader/dataset been
used. Users can add custom keys in this config and read inside the ``CustomDataModule`` using ``cfg.data``.
Currently the library adds support for 3 kinds of data modules: ``MegatronDataModule``, ``ModelAlignmentDataModule``
and ``HFDataModule``. To learn about the config parameters of ``MegatronDataModule`` please check the
``megatron_llama_7B_config.yaml``, for ``ModelAlignmentDataModule`` check the ``megatron_llama2_7B_SFT_config.yaml``
and for ``HFDataModule``, refer to ``hf_llama3_8B_config.yaml``.

The parameters that are common across all the configs are documented below.

.. code-block:: yaml

    micro_batch_size: 1
    global_batch_size: 1024


**micro_batch_size**

The batch is distributed across multiple data parallel ranks and within each rank, we accumulate gradients.
Micro batch size is the size that is used for each of those gradient calculation steps.

    * **Type**: int
    * **Required**: True

**global_batch_size**

This config along with micro-batchsize decides the gradient accumulation number automatically.

    * **Type**: int
    * **Required**: True

.. _nxdt_config_model:

Model
-----

This is where we can configure the model architecture. When building custom models, this config can be
used to parameterize the custom model. The below parameters are taken from an example of the Megatron
model config. Depending on the model and required parameters, this config can change.

HF Model
########

Let's start with the config for the HF model:

.. code-block:: yaml

    # model architecture
    model_config: /home/ubuntu/config.json
    encoder_seq_length: 4096
    max_position_embeddings: ${.encoder_seq_length}
    num_layers: 4
    hidden_size: 4096
    qkv_linear: False

    # Miscellaneous
    use_cpu_initialization: True

    ## Activation Checkpointing
    activations_checkpoint_granularity: selective
    activations_checkpoint_recompute: [CoreAttention]

    fusions:
        softmax: True
        flash_attention: False

    do_layer_norm_weight_decay: False

    optim:
        name: adamw_fp32OptState
        lr: 3e-4
        weight_decay: 0.01
        capturable: False
        betas:
        - 0.9
        - 0.999
        sched:
            name: LinearAnnealingWithWarmUp
            warmup_steps: 100
            max_steps: ${trainer.max_steps}

**model_config**

Points to the ``config.json`` path required by the ``transformers`` model implementation. One such example of
``config.json`` is `here <https://github.com/aws-neuron/neuronx-distributed/blob/main/examples/training/llama/tp_zero1_llama_hf_pretrain/7B_config_llama2/config.json>`__

    * **Type**: str
    * **Required**: True

**encoder_seq_length**

Setting the sequence length for the training job. This parameter is common for all models supported in the library.

    * **Type**: int
    * **Required**: True

**num_layers**

This config will override the number of layers inside the ``config.json`` in the ``model_config``. This is exposed
so that one can quickly increase/decrease the size of the model. This parameter is common for all models supported
in the library.

    * **Type**: int
    * **Required**: True

**hidden_size**

This config will override the ``hidden_size`` inside the ``config.json`` in the ``model_config``. This parameter
is common for all models supported in the library.

    * **Type**: int
    * **Required**: True

**qkv_linear**

This needs to be set if users want to use the
`GQAQKVLinear module <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#gqa-qkv-linear-module>`_

    * **Type**: bool
    * **Required**: True

**fuse_qkv**

This is set if users want to use fused q, k and v tensors in
`GQAQKVLinear module <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api-reference-guide-training.html#gqa-qkv-linear-module>`_ Using fuse_qkv can improve throughput. 
This parameter is True by default.

    * **Type**: bool
    * **Required**: False

**transpose_nki_inputs**

This is set if users want to transpose the inputs to NKI FlashAttention function. To be used only when
``fusions.flash_attention`` is ``True``. Using ``transpose_nki_inputs`` with ``fusions.flash_attention``
can improve throughput. This parameter is True by default for all models, unless used otherwise.

    * **Type**: bool
    * **Required**: False

**pipeline_cuts**

This is set as a list of layer names if users want to specify manual cut points for pipeline parallelism.
One example is ['model.layers.10', 'model.layers.20'] in the case of PP=3.

    * **Type**: List[str]
    * **Required**: False

.. note::
    When using this param, the number of pipeline cuts should always be ``pipeline_model_parallel_size-1``.

**use_cpu_initialization**

Setting this flag to ``True`` will initialize the weights on ``CPU`` and then move to device. It is recommended to set
this flag to ``True``. This parameter is common for all models supported in the library.

    * **Type**: bool
    * **Required**: True

**activations_checkpoint_granularity**

This flag controls which module needs to be recomputed during the backward pass.

Values:

- ``selective`` - Enables selective recomputation of specified
                modules in `activations_checkpoint_recompute` during the backward pass.
- ``full`` - Saves activations at layer boundaries and recomputes the entire layer during the backward pass.
- ``null`` - Disables activation checkpointing.

More information on activation recompute can be found
`in this link <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/activation_memory_reduction.html#activation-recomputation>`_.
This parameter is common for all models supported in the library.

    * **Type**: str
    * **Possible Values**: ``selective``, ``full``, ``null``
    * **Required**: True

**activations_checkpoint_recompute**
This config specifies which modules to recompute when using ``selective`` activation checkpointing.
It accepts a list of module names as strings or `null`.

    * **Type**: list[str] or `null`
    * **Required**: False

**fusions.softmax**

Setting this flag to ``True`` will replace the ``torch.nn.Softmax`` with a fused custom ``Softmax`` operator. This
parameter is common for all models supported in the library.

    * **Type**: bool
    * **Required**: True

**fusions.flash_attention**

Setting this flag to ``True`` will insert the flash attention module for both forward and backward. This parameter is
common for all models supported in the library.

    * **Type**: bool
    * **Required**: True

**fusions.ring_attention**

Setting this flag to ``True`` will use the ring attention module for
both forward and backward.
This parameter must be true when ``context_parallel_size``
is greater than 1.

    * **Type**: bool
    * **Required**: False

**fusions.do_layer_norm_weight_decay**

Setting this flag to ``True`` will add layer norm weight decay. This parameter is common for all models supported in
the library.

    * **Type**: bool
    * **Required**: True

**optim**

This is where the optimizers can be set. We can configure the optimizers
supported by ``NeMo``. All the optimzers can be configured according to the
`parameters specified here <https://github.com/NVIDIA/NeMo/blob/v1.14.0/nemo/core/config/optimizers.py>`__.

    * **Type**: config
    * **Possible Values**: ``adamw``, ``adamw_fp32OptState``, ``sgd``, ``adam``, ``adadelta``, ``adamax``,
    *  ``adagrad``, ``rmsprop``, ``rprop``, ``novograd``, ``adafactor``
    * **Required**: True

**optim.sched**

This is where the LR schedulers can be set. We can configure the schedulers
supported by ``NeMo``. All the schedulers can be configured according to the
`parameters specified here <https://github.com/NVIDIA/NeMo/blob/v1.14.0/nemo/core/config/schedulers.py>`__.

    * **Type**: config
    * **Possible Values**: ``LinearAnnealingWithWarmUp``, ``CosineAnnealing``, ``WarmupPolicy``,
    *  ``WarmupHoldPolicy``, ``SquareAnnealing``, ``NoamAnnealing``, ``WarmupAnnealing``,
    *   ``StepLR``, ``rprop``, ``ExponentialLR``
    * **Required**: True

Megatron Model
##############

The library enables a
`megatron transformer <https://github.com/NVIDIA/NeMo/blob/v1.14.0/nemo/collections/nlp/models/language_modeling/megatron/gpt_model.py>`_
model which can be configured from the yaml file. The different available parameters are documented below after
the following reference example.

.. code-block:: yaml

    # model architecture
    encoder_seq_length: 4096
    max_position_embeddings: ${.encoder_seq_length}
    num_layers: 32
    hidden_size: 4096
    ffn_hidden_size: 11008
    num_attention_heads: 32
    num_kv_heads: 32
    init_method_std: 0.021
    hidden_dropout: 0
    attention_dropout: 0
    ffn_dropout: 0
    apply_query_key_layer_scaling: True
    normalization: 'rmsnorm'
    layernorm_epsilon: 1e-5
    do_layer_norm_weight_decay: False # True means weight decay on all params
    make_vocab_size_divisible_by: 8 # Pad the vocab size to be divisible by this value for computation efficiency.
    persist_layer_norm: True # Use of persistent fused layer norm kernel.
    share_embeddings_and_output_weights: False # Untie embedding and output layer weights.
    position_embedding_type: 'rope' # Position embedding type. Options ['learned_absolute', 'rope]
    rotary_percentage: 1 # If using position_embedding_type=rope, then the per head dim is multiplied by this.
    activation: 'swiglu' # ['swiglu', 'gelu']
    has_bias: False
    # Miscellaneous
    use_cpu_initialization: True

    ## Activation Checkpointing
    activations_checkpoint_granularity: selective # 'selective' or 'full'

    fusions:
        softmax: True
        flash_attention: False # Use NKI flash attention

    optim:
        name: adamw
        lr: 3e-4
        weight_decay: 0.1
        capturable: True
        betas:
        - 0.9
        - 0.95
        sched:
        name: CosineAnnealing
        warmup_steps: 2000
        constant_steps: 0
        min_lr: 3.0e-5

.. note::

    For common config, please refer to the ``HF Model`` section above.

**ffn_hidden_size**

Transformer FFN hidden size.

    * **Type**: int
    * **Required**: True

**num_attention_heads**

Number of ``Q`` attention heads.

    * **Type**: int
    * **Required**: True

**num_kv_heads**

Number of ``KV`` heads. This is where we can configure ``Q`` and ``KV`` differently to create ``GQA`` modules.

    * **Type**: int
    * **Required**: True

**init_method_std**

Standard deviation to use when we init layers of the transformer model.

    * **Type**: float
    * **Required**: True

**hidden_dropout**

Dropout probability for hidden state transformer.

    * **Type**: float
    * **Required**: True

**attention_dropout**

Dropout probability in the attention layer.

    * **Type**: float
    * **Required**: True

**ffn_dropout**

Dropout probability in the feed-forward layer.

    * **Type**: float
    * **Required**: True

**apply_query_key_layer_scaling**

Scale ``Q * K^T`` by ``(1 / layer-number)``.

    * **Type**: bool
    * **Required**: True

**normalization**

Normalization layer to use.

    * **Type**: str
    * **Possible Values**: ``rmsnorm``, ``layernorm``
    * **Required**: True

**layernorm_epsilon**

Epsilon value for layernorm.

    * **Type**: float
    * **Required**: True

**share_embeddings_and_output_weights**

Setting this parameter to ``True`` will tie the ``vocab embedding`` weight with the final ``MLP`` weight.

    * **Type**: bool
    * **Required**: True

**make_vocab_size_divisible_by**

So lets say your vocab size is ``31999`` and you set this value to 4, the framework would pad the vocab-size such that
it becomes divisible by ``4``. In this case the close divisible value is ``32K``.

    * **Type**: int
    * **Required**: True

**position_embedding_type**

Type of position embedding to be used.

    * **Type**: str
    * **Possible Values**: ``learned_absolute``, ``rope``
    * **Required**: True

**rotary_percentage**

If using ``position_embedding_type=rope``, then the per head dim is multiplied by this factor.

    * **Type**: float
    * **Required**: True

**activation**

Users can specify the activation function to be used in the model.

    * **Type**: str
    * **Possible Values**: ``swiglu``, ``gelu``
    * **Required**: True

**has_bias**

Setting this parameter to ``True`` will add bias to each of the linear layers in the model.

    * **Type**: bool
    * **Required**: True


.. _nxdt_config_overview_precision_config:

Precision
---------

This config can help to decide the dtype of the model/optimizer.

.. code-block:: yaml

    precision:
        type: 'mixed_precision' # ['bf16SR', 'fp32', 'autocast', 'mixed_precision', 'mixed_precisionSR', 'manual']
        # Set the following only if precision type is manual, otherwise they will be automatically set.
        master_weights: False
        fp32_grad_acc: False
        xla_use_bf16: '0'
        xla_downcast_bf16: '0'
        neuron_rt_stochastic_rounding_en: '0'
        parallel_layers_reduce_dtype: 'bf16'

.. note::

    Only if the precision type is ``manual``, ``master_weights`` , ``fp32_grad_acc``, ``xla_use_bf16``, ``xla_downcast_bf16``,
    ``neuron_rt_stochastic_rounding_en`` will be picked up from the config. These parameters are for more finer control of
    precision. It is recommended to use ``mixed_precision`` config for better accuracy.

**type**
    **mixed_precision**

    The ``mixed_precision`` config uses the ``zero1`` optimizer. It performs grad accumulation,
    ``grad cc``, and keeps the master copy of the weights in ``fp32``. It also sets the ``xla_downcast_bf16``
    environment variable to 1 and disables stochastic rounding.

    **mixed_precisionSR**

    ``mixed_precisionSR`` is a superset of the ``mixed_precision`` config with stochastic rounding enabled.


    **bf16SR**

    ``bf16SR`` config will perform all operations in ``bf16`` and relies on stochastic rounding feature for accuracy gains.


    **autocast**

    ``autocast`` config will follow the exact same precision strategy followed by ``torch.autocast``.

    .. note::
        Autocast is not supported in this release.

    **manual**

    To gain control of the different precision nobs, one can set the precision type to ``manual`` and control parameters
    like - ``master_weights`` , ``fp32_grad_acc``, ``xla_use_bf16``, ``xla_downcast_bf16`` and
    ``neuron_rt_stochastic_rounding_en``.

**parallel_layers_reduce_dtype**

This config will perform reduce collectives (all-reduce and reduce-scatter) within parallel layers in the
specified precision. If ``fp32`` precision type is used, then we implicitly set reduce dtype to ``fp32``.
Otherwise it will be defaulted to ``bf16`` in all other cases unless specified.


Model Alignment Specific
------------------------

You can configure fine-tuning (SFT) or model alignment (DPO/ORPO)
through the YAML file, along with parameter-efficient
fine-tuning using LoRA.

.. code-block:: yaml

    model_alignment_strategy:
        # DPO specific config
        dpo:
            kl_beta: 0.01
            loss_type: sigmoid
            max_prompt_length: 2048
            precompute_ref_log_probs: True
            truncation_mode: keep_start

        # Alternatively, you can also use SFT specific config
        sft:
            packing: True

        # Alternatively, can also use ORPO specific config
        orpo:
            beta: 0.01
            max_prompt_length: 2048
            truncation_mode: keep_start

        # Parameter-efficient finetuning - LoRA config
        peft:
            lora_rank: 16
            lora_alpha: 32
            lora_dropout: 0.05
            lora_bias: "none"
            lora_verbose: True
            target_modules: ["qkv_proj"]


**model_alignment_strategy**

    Set only when using finetuning specific algorithms (SFT, DPO, etc) and related hyperparameters
    DPO-specific parameters.

        **dpo**
            **kl_beta**

            KL-divergence beta to control divergence of policy model from reference model

                * **Type**: float
                * **Default**: 0.01
                * **Required**: True

            **loss_type**

            Currently support sigmoid version of optimized DPO loss

                * **Type**: str
                * **Default**: ``sigmoid``
                * **Required**: True

            **max_prompt_length**

            Set maximum length of prompt in the concatenated prompt and (chosen/rejected) response input

                * **Type**: integer
                * **Required**: True

            **precompute_ref_log_probs**

            To enable precomputation of reference model log probabilities using pre-fit hook,
            False is not supported currently

                * **Type**: bool
                * **Required**: True

            **truncation_mode**

            To define how to truncate if size (prompt+response) exceeds seq_length
            options: ["keep_start", "keep_end"]

                * **Type**: str
                * **Default**: ``keep_start```
                * **Required**: True

    SFT-specific parameters.

        **sft**
            **packing**

            Appends multiple records in a single record until seq length
            supported by model, if false uses pad tokens to reach seq length.
            Setting it to True increases throughput but might impact accuracy.

                * **Type**: bool
                * **Default**: False
                * **Required**: False

    `Odds Ratio Preference Optimization (ORPO) <https://arxiv.org/abs/2403.07691>`_
    specific parameters.

        **orpo**
            **beta**

            KL-divergence beta to control divergence of policy model from reference model

                * **Type**: float
                * **Default**: 0.01
                * **Required**: True

            **max_prompt_length**

            Set maximum length of prompt in the concatenated prompt and (chosen/rejected) response input

                * **Type**: integer
                * **Required**: True

            **truncation_mode**

            To define how to truncate if size (prompt+response) exceeds seq_length
            options: ["keep_start", "keep_end"]

                * **Type**: str
                * **Default**: ``keep_start```
                * **Required**: True

        **peft**
            Configuration options for Parameter-Efficient Fine-Tuning (PEFT) methods,
            specifically LoRA settings.

            **lora_rank**

            Rank of LoRA; determines the number of trainable parameters
            Higher rank allows for more expressive adaptations but increases memory usage

                * **Type**: int
                * **Default**: 16
                * **Required**: True

            **lora_alpha**

            Scaling factor for LoRA updates; affects the magnitude of LoRA adaptations.

                * **Type**: int
                * **Default**: 32
                * **Required**: True

            **lora_dropout**

            Dropout rate for LoRA layers to prevent overfitting.

                * **Type**: float
                * **Default**: 0.05
                * **Required**: False

            **lora_bias**

            Bias type for LoRA. Determines which biases are trainable. Can be 'none', 'all' or 'lora_only'

                * **Type**: str
                * **Default**: "none"
                * **Required**: False

            **lora_verbose**

            Enables detailed LoRA-related logging during training.

                * **Type**: bool
                * **Default**: False
                * **Required**: False

            **target_modules**

            List of model layers to apply `LoRA <https://arxiv.org/abs/2106.09685>`__.

                * **Type**: list[str]
                * **Default**: ["qkv_proj"] (for Llama)
                * **Required**: True

================================================
FILE: libraries/nxd-training/general/features.rst
================================================
.. _nxdt_features:

Neuronx Distributed Training Library Features
=============================================

The library is meant to provide an end-to-end framework for training on Trainium instances. The NxD Training is a
collection of open-source libraries, tools, and resources that empowers customers to run end-to-end training workflows
on Neuron. Its an extension to Neuronx-Distributed (NxD) library. NxD Training incorporates the distributed strategies
primitives from NxD (i.e., NxD Parallel Primitives),while maintaining a design that is ready to integrate partitioning
technologies from native PyTorch or from OpenXLA such as GSPMD. NxD Training also supports  PyTorch Lightning (PTL)
Trainer and extends NxD to include data engineering features from NeMo, such as data loaders, datasets, and tokenizers,
as well as ML engineering capabilities from NeMo like monitoring, logging, and experiment management. Furthermore,
the NxD Training framework introduces support for training techniques such as pre-training and fine-tuning, along with
a model hub featuring end-to-end examples for state of the art models like LLama, GPT, and Mixtral MoE implemented using
both HuggingFace and Megatron-LM model classes.

The framework uses the distributed training technology from NxD. This allows the framework to support all the
sharding techniques and Modules already supported by NxD.

.. contents:: Table of contents
   :local:
   :depth: 2

Distributed Techniques
-----------------------

1. Data-parallelism
2. `Tensor-parallelism <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tensor_parallelism_overview.html#tensor-parallelism-overview>`_
3. `Sequence-Parallelism <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/activation_memory_reduction.html#sequence-parallelism>`_
4. `Pipeline-parallelism <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/pipeline_parallelism_overview.html>`_
    1. 1F1B pipeline schedule
    2. Interleave pipeline schedule (or virtual pipeline parallel)
5. `Zero1 <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/zero1_gpt2.html#what-is-zero-1>`_
6. Expert-parallelism
7. `Context-Parallelism <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/context_parallelism_overview.html>`_

Modules
--------

1. `Grouped Query Attention layer <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#gqa-qkv-linear-module>`_
2. Mixture of Experts (MoE)

Model/Optimizer Precision
-------------------------

To cater to different types of precision that can affect the overall training, the library provides an option to
configure the following:

1. Zero1 with Master weights in FP32
2. BF16 + Stochastic Rounding
3. FP32

Checkpoint Saving/Loading
-------------------------
When we are working with large models and running training for a long time, checkpointing becomes an important
part of training models. The framework supports the following features for checkpointing:

1. Save/Load sharded checkpoints
2. Asynchronous checkpoint saving/loading
3. Ability to keep only the last K checkpoints
4. Auto-resume training jobs from previous checkpoints
5. Ability to dump a checkpoint to S3

To optimize the checkpointing time, we have enabled dumping of checkpoints from all ranks to distribute the workload
and parallelize the checkpoint saving. Similarly when loading checkpoints, the API would load only on 1 data-parallel
rank and broadcast it to all ranks. This improves the checkpoint loading time as it avoids contention on the file
system.

Training Recipes
----------------

The library supports the following training recipes:

1. Pre-training: The library shows examples of pretraining models like LLama2/3-8B/70B , GPT, Mistral, and Mixtral MoE
2. Supervised Fine-tuning: Showcase fine-tuning of llama-3 model with a chat dataset.


================================================
FILE: libraries/nxd-training/general/installation_guide.rst
================================================
.. _nxdt_installation_guide:

Setup
=====

Neuronx Distributed Training framework is built on top of
`NeuronxDistributed (NxD) <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index.html>`_ ,
`NeMo <https://github.com/NVIDIA/NeMo/tree/v1.14.0>`_ libraries and
`PyTorch-Lightning <https://github.com/Lightning-AI/pytorch-lightning/tree/1.8.6>`_. The guide below will provide
a step-by-step instructions on how to setup the environment to run training using NeuronX Distributed Training
framework. Alternatively, you can use the Neuronx Distributed Training virtual environment found in the Neuron DLAMI without
running any of these setup steps. See :ref:`neuron-dlami-overview`.

.. contents:: Table of contents
   :local:
   :depth: 2


.. _nxdt_python_venv:

Setup a python Virtual Environment
----------------------------------

Let's first setup a virtual env for our development. This can be done using the command below:

.. code-block :: shell

    python3 -m venv env
    source env/bin/activate

.. _nxdt_neuron_deps:

Installing Neuron Dependencies
------------------------------

Install the neuron packages using the command:

.. code-block :: shell

    pip install -U pip
    pip install --upgrade neuronx-cc==2.* torch-neuronx torchvision neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com

.. _nxdt_nemo_deps:

Building Apex
-------------

NxD Training uses the NeMo toolkit, which requires you to install additional dependencies. One of these dependencies is 
the `Apex <https://github.com/NVIDIA/apex/tree/master>`_ library. The NeMo toolkit uses this library for several fused 
module implementations.

.. note::
    NeMo used to use Apex for all distributed training APIs. Since we are using NxD for the same purpose, the use of
    Apex for this framework is very minimal. It's been added as a dependency since some of the minor imports inside NeMo
    will break without it. Hence, when building Apex, we build a slim CPU version using the instructions below:

1. Clone Apex repo

.. code-block :: shell

    git clone https://github.com/NVIDIA/apex.git
    cd apex
    git checkout 23.05


2. Replace the contents of the ``setup.py`` with the following contents:

.. code-block :: python

    import sys
    import warnings
    import os
    from packaging.version import parse, Version

    from setuptools import setup, find_packages
    import subprocess

    import torch
    from torch.utils.cpp_extension import BuildExtension, CppExtension, CUDAExtension, CUDA_HOME, load

    setup(
        name="apex",
        version="0.1",
        packages=find_packages(
            exclude=("build", "csrc", "include", "tests", "dist", "docs", "tests", "examples", "apex.egg-info",)
        ),
        install_requires=["packaging>20.6",],
        description="PyTorch Extensions written by NVIDIA",
    )

3. Install python dependencies:

.. code-block :: shell

    pip install packaging wheel


4. Build the wheel using the command:

.. code-block :: shell

    python setup.py bdist_wheel


5. After this, you should see the wheel at ``dist/``. You can use this for installation in the next section.
6. Come out of the ``apex`` directory using ``cd ..``.


.. _nxdt_nxdt_reqs:

Installing the requirements
---------------------------

Download the ``requirements.txt`` using the command:

.. code-block :: shell

    wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed-training/master/requirements.txt

We can now install the dependencies of the library using the following command:

.. code-block :: shell

    pip install -r requirements.txt ~/apex/dist/apex-0.1-py3-none-any.whl

After installing the requirements, we need to patch some of the installations so run

.. code-block :: shell
    
    wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed-training/master/install_setup.sh
    chmod +x install_setup.sh
    ./install_setup.sh

You may see some warnings related to the installations, but those can be ignored.

.. _nxdt_nxdt_nxdt_install:

Installing Neuronx Distributed Training framework
-------------------------------------------------

To install the library, one can run the following command:

.. code-block :: shell

    pip install neuronx_distributed_training --extra-index-url https://pip.repos.neuron.amazonaws.com


.. _nxdt_installation_common_failures:

Common failures during installation
-----------------------------------

This section goes over the common failures one can see during setup and how to resolve them.

1. **``ModuleNotFoundError: No module named 'Cython'``**

   You may have to install Cython explicitly using ``pip install Cython``

2. **Error while building ``youtokentome``**

   If you get an error that says ``Python.h file not found``, you may have to install python-dev and recreate the
   virtual env. To install python-dev, you can use the command: ``sudo apt-get install python-dev``

3. **Mismatched torch and torch-xla version**

   When you see an error that looks like:

::

    ImportError: env/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c109TupleTypeC1ESt6vectorINS_4Type24SingletonOrSharedTypePtrIS2_EESaIS4_EENS_8optionalINS_13QualifiedNameEEESt10shared_ptrINS_14FunctionSchemaEE

   It indicates that the major versions of torch and torch-xla don't match.

.. note::
    If you install torch again, make sure to install the corresponding torchvision version else that would have
    a conflict.

4. **Torch vision version error**

   The below error indicates incorrect torchvision version. If installing ``torch=2.1``, install ``torchvision=0.16``
   (This `link <https://pypi.org/project/torchvision/>`_ shows which version of torchvision is compatible with
   which version of torch).

::

    ValueError: Could not find the operator torchvision::nms. Please make sure you have already registered the operator
    and (if registered from C++) loaded it via torch.ops.load_library.`

5. **Matplotlib lock error**

   If you see the below error:

::

    TimeoutError: Lock error: Matplotlib failed to acquire the following lock file

   This error means there is some contention in compute/worker nodes to access the matlotlib cache, and hence the timeout
   error. To resolve this error, add or run ``python -c 'import matplotlib.pyplot as plt'`` command as part of your setup.
   This will create a matplotlib cache and avoid the race condition.


================================================
FILE: libraries/nxd-training/general/known-issues.txt
================================================
* :ref:`nxdt_known_issues`

================================================
FILE: libraries/nxd-training/general/known_issues.rst
================================================
.. _nxdt_known_issues:

Known Issues and Workarounds
============================

This section covers the common failures that one can see while working with Neuronx Distributed Training library.
Some of the failures regarding installation have been documented in :ref:`nxdt_installation_common_failures`.

.. contents:: Table of contents
   :local:
   :depth: 2

Shared weights error
--------------------

Tieing weights is not supported when using pipeline parallelism.
This means currently, the ``share_embeddings_and_output_weights`` parameter is not supported when using pipeline
parallelism. It would produce an error that looks like this

::

    File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 625, in _reduce_shared_weights
    assert p.grad is not None, f"Found shared weight {n} has None grad"
    AssertionError: Found shared weight language_model_embedding_word_embeddings.weight has None grad

Please set this flag to ``False`` when using pipeline parallelism.


HOST OOM issues
---------------

You would see an error log that looks like this without any other error above it.

::

    WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
    WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3721028 closing signal SIGTERM
    WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3721029 closing signal SIGTERM
    WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3721030 closing signal SIGTERM
    WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3721031 closing signal SIGTERM

You can confirm ``HOST OOM`` by checking ``sudo dmesg`` on the Trn1 node. ``HOST OOM`` can occur because of multiple
reasons:

During checkpoint saving
########################

If you see the above error immediately after a checkpoint saving log, this indicates that the entire checkpoint
is copied to CPU. In this case, please check if the ``save_xser`` parameter is set to ``True``. This mode will
ensure each worker saves only one tensor at a time to disk. Setting this to ``False`` will make all the workers
copy the entire checkpoint to CPU and can result in ``HOST OOM``.

During async_checkpointing
##########################

``async_checkpointing`` when used with a low number of nodes can cause ``HOST OOM`` as it increases memory pressure
per node. When we use more nodes, the memory pressure gets divided among the nodes and hence you would get an OOM.

On a high level, async checkpointing copies data from device memory to host memory, then launch a new process
to save host memory to storage, and let the main process continue with the training. Since we launch
a new process, it requires a lot more extra host memory, because the launched process has the exact copy of memory
space of the parent process. Let's use the following example to demonstrate how much memory we would need. For a llama2
70b training using tp32 on 32 nodes, we launch 32 processes on each node. As baseline, each process uses 5 GB of host
memory. There is also the XRT server, which uses 110 GB of host memory, so in total 270 GB host memory is used
(5*32 + 110). If we enable ``async_checkpointing`` on this setting, the final memory usage can reach as high as
482 GB because of the following reasons:

1. Each training process needs to allocate memory to hold the model. The model weights for llama2 70B would
require 280GB of memory to store the weights. The optimizer state would require twice as much memory. So total
amount of host memory is 840 GB. Because we used all ranks for saving, the 840GB of data was evenly distributed
among 1,024 processes (32 x 32), which means 0.84 GB of memory per process, or 26 GB of memory per instance. So
each process’s host memory usage is 5.8GB.

2. Second, each training process will fork a process for saving. The forked process will have a copy of parent’s
memory. In practice, linux uses a Copy-On-Write mechanism to save memory usage, but still in theory the actual memory
usage of the child process can reach 5.8 GB combined. When ``async_checkpointing`` is enabled, we have 64 processes
each using 5.8 GB of memory, and the XRT server uses 110 GB of memory. Therefore the total memory usage will be 482GB
(64 * 5.8 + 110).

Hence with 32 nodes, we are already on the edge (each Trn1 node has 512GB of host memory) and we could OOM at 32 nodes.
For a more stable run, enabling ``async_checkpointing`` at 64 nodes is recommended.


During Dataloading
##################

Another common reason for ``HOST OOM`` is loading too much data onto CPU. For pipeline-parallel processing, the
library loads the entire global batch onto CPU and then moves it one-by-one to device. If we have a large
batchsize with each batch taking space, it can lead to ``HOST OOM``.


ImportError: ``helpers``
------------------------

If you see an error that looks like:

::

    ImportError: cannot import name 'helpers' from 'nemo.collections.nlp.data.language_modeling.megatron' (/usr/local/lib/python3.8/dist-packages/nemo/collections/nlp/data/language_modeling/megatron/__init__.py)

This could be because the helpers.cpp didn’t get built correctly at the time of execution. We can pre-built it
by running the following code:

.. code-block:: python

    import sys
    import types

    import torch

    if torch.__version__.startswith("2"):
        string_classes = str
        inf = torch.inf
    else:
        string_classes = None
        inf = None


    # conditionally modify the import
    def modify_torch_six_import():
        if string_classes is not None:
            try:
                if "torch._six" not in sys.modules:
                    # Create and add dummy module to sys.modules
                    six_module = types.ModuleType("torch._six")
                    six_module.string_classes = string_classes
                    six_module.inf = inf
                    sys.modules["torch._six"] = six_module
            except Exception as e:
                raise RuntimeError(f"Failed to override torch._six import: {e}")

    modify_torch_six_import()
    from nemo.collections.nlp.data.language_modeling.megatron.dataset_utils import compile_helper
    compile_helper()


Alternatively, if you see

::

    ImportError: /shared/username/aws_neuron_venv_pytorch/lib/python3.10/site-packages/nemo/collections/nlp/data/language_modeling/megatron/helpers.cpython-310-x86_64-linux-gnu.so: file too short

A current workaround for this case is to delete the .so file and run the above snippet explicitly.

Matplotlib error
----------------

If you see an error that looks like:

::

    TimeoutError: Lock error: Matplotlib failed to acquire the following lock file

It means there is some contention in compute/worker nodes to access the matlotlib cache, and hence the lock error.
To resolve this add or run ``python -c 'import matplotlib.pyplot as plt'`` as part of your setup. This will
create a matplotlib cache and avoid the race condition.

Flash Attention not supported for megatron-style models
-------------------------------------------------------

Flash attention kernel is supported only for HF-style models and will be added for megatron-style models in one of
the future releases.


================================================
FILE: libraries/nxd-training/index.rst
================================================
.. meta::
   :description: NxD Training (NeuronX Distributed Training) is a PyTorch library for end-to-end distributed training on AWS Trainium instances, offering turnkey workflows for pre-training, fine-tuning, and PEFT.
   :keywords: NxD Training, NeuronX Distributed Training, AWS Neuron SDK, Distributed Training, PyTorch Lightning, Tensor Parallelism, Pipeline Parallelism, ZeRO-1, LoRA, PEFT, Model Training
   :date-modified: 01/22/2026

.. _nxdt:

NxD Training
============

This section contains the technical documentation specific to the NxD Training library included with the Neuron SDK.

.. toctree::
    :maxdepth: 1
    :hidden:

    Overview </libraries/nxd-training/overview>
    Setup </libraries/nxd-training/general/installation_guide>
    Tutorials  </libraries/nxd-training/tutorials/index>
    Developer Guides  </libraries/nxd-training/developer_guides/index>
    API Reference Guide </libraries/nxd-training/api-reference-guide>
    App Notes </libraries/nxd-training/app_notes>
    Release Notes </release-notes/components/nxd-training>
    Misc  </libraries/nxd-training/misc>

What is NxD Training?
---------------------

NxD Training (NeuronX Distributed Training) is a PyTorch library for end-to-end distributed training on AWS Trainium instances. It combines ease-of-use with powerful features built on top of the NxD Core library, offering turnkey support for model pre-training, supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT) using LoRA.

With NxD Training, developers can:

* Train large-scale models with turnkey workflows for pre-training, SFT, and PEFT (LoRA)
* Leverage distributed strategies including Data Parallelism, Tensor Parallelism, Sequence Parallelism, Pipeline Parallelism, and ZeRO-1
* Use PyTorch Lightning integration for organized training code
* Access ready-to-use model samples based on HuggingFace and Megatron-LM formats
* Manage experiments with integrated checkpointing, logging, and S3 storage support
* Choose from three usage interfaces: YAML configuration files, PyTorch Lightning APIs, or NxD Core primitives

NxD Training is compatible with training platforms like NVIDIA's NeMo (except for Trainium-specific features) and is available on GitHub as both pip wheel and source code.

Usage Interfaces
----------------

NxD Training provides three interfaces to meet different developer needs:

* **YAML Configuration Files**: High-level access for distributed training with minimal code changes
* **PyTorch Lightning APIs**: Standardized training workflows with NxD Core primitives
* **NxD Core Primitives**: Low-level APIs for custom model integration and advanced use cases


NxD Training documentation
---------------------------

.. grid:: 1 1 2 2
    :gutter: 3
    
    .. grid-item-card:: Overview
        :link: /libraries/nxd-training/overview
        :link-type: doc
        :class-card: sd-rounded-3
        
        Learn about NxD Training architecture, key features, and usage interfaces for distributed training on AWS Trainium.

    .. grid-item-card:: Setup
        :link: /libraries/nxd-training/general/installation_guide
        :link-type: doc
        :class-card: sd-rounded-3
        
        Step-by-step instructions for installing and configuring NxD Training on Trainium instances.

    .. grid-item-card:: Tutorials
        :link: /libraries/nxd-training/tutorials/index
        :link-type: doc
        :class-card: sd-rounded-3
        
        Hands-on tutorials for training various models including Llama, GPT, and BERT with different parallelism strategies.

    .. grid-item-card:: Developer Guides
        :link: /libraries/nxd-training/developer_guides/index
        :link-type: doc
        :class-card: sd-rounded-3
        
        In-depth guides for model integration, YAML configuration, migration from NeMo/NNM, and advanced training workflows.

    .. grid-item-card:: API Reference
        :link: /libraries/nxd-training/api-reference-guide
        :link-type: doc
        :class-card: sd-rounded-3
        
        Comprehensive API documentation for NxD Training modules, configuration options, and programming interfaces.

    .. grid-item-card:: Application Notes
        :link: /libraries/nxd-training/app_notes
        :link-type: doc
        :class-card: sd-rounded-3
        
        Detailed application notes on distributed strategies, optimization techniques, and best practices for training.

    .. grid-item-card:: Misc Resources
        :link: /libraries/nxd-training/misc
        :link-type: doc
        :class-card: sd-rounded-3
        
        Known issues, troubleshooting guides, and other helpful resources for working with NxD Training.

    .. grid-item-card:: NxD Training Release Notes
        :link: /release-notes/components/nxd-training
        :link-type: doc
        :class-card: sd-rounded-3
        
        Review the latest updates, new features, and bug fixes in NxD Training releases.

================================================
FILE: libraries/nxd-training/misc.rst
================================================
.. _nxdt_misc

Misc.
=====

.. toctree::
    :maxdepth: 1
    :hidden:

    /release-notes/components/nxd-training
    /libraries/nxd-training/general/known_issues

.. include:: /libraries/nxd-training/misc.txt

================================================
FILE: libraries/nxd-training/misc.txt
================================================
* :ref:`nxd-training_rn`
* :ref:`nxdt_known_issues`

================================================
FILE: libraries/nxd-training/overview.rst
================================================
.. _nxd-training-overview:

Overview
=========
.. contents:: Table of contents
   :local:
   :depth: 2

NxD Training
-------------------

The NeuronX Distributed Training (NxD Training) library is a collection of open-source tools and
libraries designed to empower customers to train PyTorch models on AWS Trainium instances.
It combines both ease-of-use and access to features built on top of
:ref:`NxD Core <neuronx-distributed-training-index>` library. Except for a few Trainium specific features, NxD Training
is compatible with training platforms like NVIDIA’s NeMo.

Specifically, :ref:`NxD Training <nxdt_figure>` offers the following features and productivity flows:

*  **Training Workflows**: Developers benefit from turnkey support for multiple workflows such as model Pre-training, Supervised Finetuning (SFT),  
   and Parameter Efficient Finetuning (PEFT) using Low Rank Adapters (LoRA) [#f1]_. For these workflows, precision types supported include  
   (a) FP32 for both baseline and for master weights when using ZeRO-1, 
   and (b) BF16 combined with :ref:`stochastic rounding <neuron-rounding-modes>`.

*  **Distributed Strategies**: Splitting training workload over multiple nodes shortens the job duration. This is made possible through distributed strategies 
   that are the techniques used to shard large scale models across multiple Neuron Cores. NxD Training Distributed Strategies are implemented in the 
   :ref:`NxD Core <neuronx-distributed-training-index>` library and include:
   Data Parallelism, 
   `Tensor-parallelism <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tensor_parallelism_overview.html#tensor-parallelism-overview>`_, 
   `Sequence-Parallelism <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/activation_memory_reduction.html#sequence-parallelism>`_,  
   `Pipeline-parallelism <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/pipeline_parallelism_overview.html>`_  (including 1F1B pipeline 
   schedule and interleaved pipeline schedule), and `ZeRO-1 <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/zero1_gpt2.html#what-is-zero-1>`_.

*  **Data Science  Modules**: The integration of datasets, dataloaders, tokenizers and other data wrangling tools makes it easy to prepare and use large-scale training data.

*  **Data Engineering Modules**: Integrated *Experiment Manager* allows for saving training outputs through checkpointing and evaluating results through enhanced logging. It comes with 
   multiple options
   for optimally loading/saving checkpoints such as sharded checkpoints, last-K checkpoints, asynchronous checkpoints, auto-resume from checkpoints and storage in S3 buckets.

*  **PyTorch Lightning**: NxD Training is integrated with training frameworks like like PyTorch Lightning that help with organizing training code.

*  **Models**: Users can start on NxD Training with ready-to-use samples based on HuggingFace and Megatron-LM model formats. It has support for advanced LLM architecture blocks such as 
   `Grouped Query Attention layer <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#gqa-qkv-linear-module>`_. 

*  **SW Releases**: NxD Training code is available on `GitHub <https://github.com/aws-neuron/neuronx-distributed-training/tree/main>`_, both as pip wheel and source code.

.. _nxdt_figure:

.. figure:: ./images/nxd_training.jpg
    
    `NxD Training`

Using NxD Training
------------------

ML developers often need access to training code at different levels of abstraction. As shown in :ref:`figure <nxdt_usage_figure>`, using NxD Training is possible  
using three interfaces: 

*   High-level `YAML <https://yaml.org/>`_  configuration file used in conjunction with models in NxD Training's model hub
*   `PyTorch Lightning (PTL) <https://github.com/Lightning-AI/pytorch-lightning>`_ APIs and Trainer in conjunction with NxD Core primitives
*   :ref:`NxD Core <neuronx-distributed-training-index>` foundational API, also refered to as NxD Core primitives

All three usage mechanisms employ the underlying NxD Core library either directly through programming interfaces or 
configuration files and developers can choose the method that meets 
their needs.

.. _nxdt_usage_figure:

.. figure:: ./images/nxdt_ux.jpg

    `Using NxD Training through (a) Configuration Files (b) PyTorch Lightning APIs, and (c) NxD Core primitives`

Configuration File
^^^^^^^^^^^^^^^^^^

NxD Training supports a top-level access for distributed training using YAML based configuration files. 
This option is available for models that are available in the model hub or custom ones enabled after following
the steps listed in :ref:`model integration guide <nxdt_developer_guide_integrate_new_model>` inside NxD Training. With this usage model, only the configuration parameters 
inside the YAML file need to be set and no further code changes are necessary. This facilitates easy experimentation with various configuration settings and automating the workflow.
Figure below shows the major 
settings available inside YAML configuration file and more details on how to exercise these options are in 
:ref:`YAML Configuration Settings <nxdt_config_overview>`. Existing users of NeuronX NeMo Megatron (NNM) or NVIDIA NeMo 
can review :ref:`NNM <nxdt_developer_guide_migration_nnm_nxdt>` and :ref:`NeMo <nxdt_developer_guide_migration_nemo_nxdt>`
migration guides, respectively, to map the configuration parameters to NxD Training.

.. figure:: ./images/yaml_parts.jpg

    `Top level settings for NxD Training through configuration file`

PyTorch Lightning APIs
^^^^^^^^^^^^^^^^^^^^^^

`PyTorch Lightning <https://github.com/Lightning-AI/pytorch-lightning>`_ is a library that abstracts out model 
training workflows and eliminates the boilerplate code to setup training loops. Through its inheritable classes for 
training loops, data and customizable callbacks for checkpointing and distributed strategies, developers can set 
training workflows in a standardized and compact manner. 

As shown in :ref:`user interfaces to NxD Training, Figure (b) <nxdt_usage_figure>`, overall training scripts can be built 
using PyTorch Lightning and making use of NxD Core library. 
This requires overriding the base classes of PyTorch Lightning such as ``LightningModule``, ``DataModule``; 
configuring optimizer and LR scheduler;setting appropriate callbacks; and launching the ``Trainer``.
For more details, refer to NxD Core's PyTorch Lightning :ref:`developer guide <ptl_developer_guide>` 
and :ref:`sample tutorial <llama2_tp_pp_ptl_tutorial>`. 

NxD Core Primitives
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

NxD Core primitives are basic APIs that can be stitched together to build complete training workflows for AWS Trainium instances. 
Addtionally, these primitives are required for integrating a new custom model into NxD Training or 
using the model directly via NxD Core library.

NxD Core library has support for all the essential training features - model sharding, handling collective communications, 
memory reduction, checkpointing, optimizer setting and profiling. 
For example, tensor parallelism through NxD Core is achieved by converting the linear layers, common in attention modules 
of transformer-architecture based models, to parallel layers. For pipeline parallelism, NxD Core offers ability for both manual and automatic
selection of pipeline cut points in the model graph. 
Additional options for sequence parallelism and activation recomputation help with memory reduction.
For all these parallelism options, NxD Core library automatically ensures efficient management of all the required collective communications across Neuron Cores.

Exact details on how these capabilities can be exercised are described in :ref:`NxD Core developer guide <neuronx_distributed_developer_guide>`. 
For background information and description of NxD Core primitives, users are referred to 
NxD Core's :ref:`app notes <neuronx_distributed_appnotes>`, and :ref:`API guide <neuronx_distributed_api_guide>`, respectively. 
Following these steps, once a new model is onboarded using NxD Core APIs, its training workflow can be streamlined using
NxD Training's experiment manager and data science/engineering modules.

.. [#f1] Supported through NxD Core.
..
   With NxD Core, model sharding is made possible using 
   coversion of linear layers to ``RowParallel``/ ``ColumnParallel`` layers for tensor parallelism; wrapping model class into ``NxDPPModel`` for pipeline parallelism; and setting suitable flags for sequence parallelism.
   NxD Core provides sample implementations for optimizer and checkpointing code and they can then be integrated inside an overall model training script.
   Details on how these capabilities can be exercised are detailed in :ref:`NxD Core developer guide <neuronx_distributed_developer_guide>`. For background information and interface descriptions, users are referred to 
   NxD Core's :ref:`app notes <neuronx_distributed_appnotes>`, and :ref:`API guide <neuronx_distributed_api_guide>`, respectively. Once a new model is onboarded using NxD Core APIs, its training workflow can be streamlined using
   NxD Training's experiment manager and data science/engineering modules.


================================================
FILE: libraries/nxd-training/overview.txt
================================================
* :ref:`nxd-training-overview`

================================================
FILE: libraries/nxd-training/setup.txt
================================================
* :ref:`nxdt_installation_guide`

================================================
FILE: libraries/nxd-training/tutorials/checkpoint_conversion.rst
================================================
.. _checkpoint_conversion:

Checkpoint Conversion
=====================

.. contents:: Table of Contents
   :local:
   :depth: 2

The NxD Training library provides a versatile checkpoint conversion functionality,
allowing seamless transition between different model styles. This tutorial aims to provide a
comprehensive guide through the various use cases and demonstrate how to perform the checkpoint conversions.

Supported Model Architectures
-----------------------------

The checkpoint conversion functionality supports conversion of the following model styles to/from NxDT checkpoints:

1. **HuggingFace (HF) style models**
2. **Megatron style models**

Extends support for both GQA (Llama-3) and non-GQA models (Llama-2).

Conversion Scenarios and Usage
------------------------------

The tool supports the following conversion scenarios. It internally
uses ``NeuronxDistributed (NxD)`` to convert to/from checkpoints.
Run the following commands from the ``/examples/checkpoint_conversion_scripts/`` directory:

.. note::

   1. **Important**: You must set the ``--hw_backend`` argument correctly for your hardware.
      The sample commands below use ``trn1``.

      - Set ``--hw_backend trn1`` for Trainium (Trn1) hardware
      - Set ``--hw_backend trn2`` for Trainium 2 (Trn2) hardware

   All example commands in this tutorial use ``trn1``. If you're using Trn2,
   remember to replace ``trn1`` with ``trn2`` in every command.

   2. Ensure that the model configuration config.json file is present,
      as it is required for checkpoint conversions.
      It is suggested to use specific json files like
      `examples <https://github.com/aws-neuron/neuronx-distributed/blob/main/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3/config.json>`__ .
      If not present, you will need to create it.

   3. If your HF/custom checkpoint has multiple ``.bin`` or ``.pt`` or ``.pth`` files
      then merge and convert to a single file before conversion.

For conversion of non-GQA based models (e.g. Llama2), just set the ``--qkv_linear`` argument to ``False``.

1. **HF style model**:

   a. **HF to NxDT checkpoint**:

      **Command**:

      .. code-block:: bash

        python3 checkpoint_converter.py --model_style hf --hw_backend trn1 --input_dir /home/ubuntu/pretrained_llama_3_8B_hf/pytorch_model.bin --output_dir /home/ubuntu/converted_hf_style_hf_to_nxdt_tp8pp4/ --save_xser True --config /home/ubuntu/pretrained_llama_3_8B_hf/config.json --tp_size 8 --pp_size 4 --n_layers 32 --kv_size_multiplier 1 --qkv_linear True --convert_from_full_state

     This converts an HF-style checkpoint to an NxDT checkpoint.

   b. **NxDT to HF checkpoint**:

    **Command**:

    .. code-block:: bash

       python3 checkpoint_converter.py --model_style hf --hw_backend trn1 --input_dir ~/examples/nemo_experiments/hf_llama3_8B_SFT/2024-07-19_23-07-40/checkpoints/hf_llama3_8B--step=5-consumed_samples=160.0.ckpt/model --output_dir ~/converted_hf_style_nxdt_to_hf_tp8pp4/ --load_xser True --config ~/config.json --tp_size 8 --pp_size 4 --kv_size_multiplier 1 --qkv_linear True --convert_to_full_state

    This converts an NxDT checkpoint to an HF-style checkpoint.

2. **Megatron style model (non-GQA models: e.g., Llama-2, and GQA models: e.g., Llama-3)**:

   a. **HF to NxDT Megatron checkpoint**:

    **Command**:

    .. code-block:: bash

       python3 checkpoint_converter.py --model_style megatron --hw_backend trn1 --input_dir ~/megatron-tp8pp4-nxdt-to-hf4/checkpoint.pt --output_dir ~/meg_nxdt_hf3_nxdt3 --config ~/llama_gqa/config.json --save_xser True --tp_size 8 --pp_size 4 --n_layers 32 --kv_size_multiplier 1 --qkv_linear True --convert_from_full_state

    This converts an HF-style checkpoint to an NxDT Megatron-style checkpoint.

   b. **NxDT Megatron checkpoint to HF**:

    **Command**:

    .. code-block:: bash

       python3 checkpoint_converter.py  --model_style megatron --hw_backend trn1 --input_dir ~/examples/nemo_experiments/megatron_llama/2024-07-23_21-07-30/checkpoints/megatron_llama--step=5-consumed_samples=5120.0.ckpt/model --output_dir ~/megatron-tp8pp4-nxdt-to-hf4 --load_xser True --config ~/llama_gqa/config.json --tp_size 8 --pp_size 4 --kv_size_multiplier 1 --qkv_linear True --convert_to_full_state

    This converts an NxDT Megatron-style checkpoint to an HF-style checkpoint (GQA-based model, see: ``--qkv_linear`` set to ``True``).


Key Arguments
^^^^^^^^^^^^^

The ``checkpoint_converter.py`` script supports the following key arguments:

- ``--model_style``: Specifies the model style, either `hf` (HuggingFace: default) or `megatron`
- ``--hw_backend``: (required) Specifies the hardware backend either `trn1` or `trn2`
- ``--input_dir``: (required) directory containing the input checkpoint
- ``--hf_model_name``: (optional) HuggingFace model identifier for directly converting models hosted on HuggingFace
- ``--output_dir``: (required) directory to save the converted checkpoint directory
- ``--save_xser``: Saves the checkpoint with torch_xla serialization
- ``--load_xser``: Loads the checkpoint with torch_xla serialization
- ``--convert_from_full_state``: Converts full model checkpoint to sharded model checkpoint
- ``--convert_to_full_state``: Converts sharded model checkpoint to full model checkpoint
- ``--config``: path to the model configuration file (create `json` file if not present)
- ``--tp_size``: tensor parallelism degree
- ``--pp_size``: pipeline parallelism degree
- ``--n_layers``: number of layers in the model
- ``--kv_size_multiplier``: key-value size multiplier
- ``--qkv_linear``: boolean to specify GQA/non-GQA models
- ``--fuse_qkv``: boolean to specify fused QKV in GQA models

We recommend enabling xser for significantly faster save and load times.
Note that if the checkpoint is saved with xser, it can only be loaded with xser,
and vice versa.

Conversion Example
------------------

Assuming you have a pre-trained HF-style Llama3-8B model checkpoint looking similar to:

``input_dir: /hf/checkpoint/pytorch_model.bin``

.. code-block:: bash

  $ ls /hf/checkpoint

  -rw-r--r-- 1 user group 123 Aug 27 2024 pytorch_model.bin

Convert the HF-style checkpoint to an NxDT checkpoint on a single instance:

.. code-block:: bash

  python3 checkpoint_converter.py --model_style hf --hw_backend trn1 --input_dir /hf/checkpoint/pytorch_model.bin --output_dir /nxdt/checkpoint --save_xser True --convert_from_full_state --config /path/to/config.json --tp_size 8 --pp_size 4 --n_layers 32 --kv_size_multiplier 1 --qkv_linear True --convert_from_full_state

This command will create an NxDT checkpoint in ``output_dir: /nxdt/checkpoint``
and it will be sharded with (tp=8, pp=4) like:

.. code-block:: bash

  $ ls /nxdt/checkpoint/model

  -rw-r--r-- 1 user group 123 Aug 27 2024 dp_rank_00_tp_rank_00_pp_rank_00.pt
  -rw-r--r-- 1 user group 456 Aug 27 2024 dp_rank_00_tp_rank_01_pp_rank_00.pt
  ...........................................................................
  -rw-r--r-- 1 user group 789 Aug 27 2024 dp_rank_00_tp_rank_07_pp_rank_02.pt
  -rw-r--r-- 1 user group 122 Aug 27 2024 dp_rank_00_tp_rank_07_pp_rank_03.pt

Direct HuggingFace Model Conversion
-----------------------------------

Using the ``--hf_model_name`` argument allows users to directly convert checkpoint files hosted on HuggingFace
without the need for manual downloading or merging of checkpoint files.

To use this feature, you can specify the HuggingFace model identifier using the ``--hf_model_name`` argument.
The script will then download the model and convert it directly to the NxDT format.

.. note::

   1. When using ``--hf_model_name``, do not specify ``--input_dir``. These arguments are mutually exclusive.
   2. If both ``--hf_model_name`` and ``--input_dir`` are specified, the script will prioritize ``--input_dir`` and ignore ``--hf_model_name``
   3. You will be prompted to enter your HuggingFace API token. If you don't have one,
      you can create it at https://huggingface.co/settings/tokens.
   4. Ensure you have sufficient disk space to download and process the model.

Example usage:

.. code-block:: bash

   python3 checkpoint_converter.py --model_style hf --hw_backend trn1 --hf_model_name "meta-llama/Llama-2-7b-hf" --output_dir /path/to/output --save_xser True --config /path/to/config.json --tp_size 8 --pp_size 4 --n_layers 32 --kv_size_multiplier 1 --qkv_linear False --convert_from_full_state

This command will download the Llama-2-7b model from HuggingFace.
Convert it to NxDT format, and save it in the specified output directory.

Troubleshooting
^^^^^^^^^^^^^^^

- If you encounter an error related to HuggingFace authentication, ensure you're using a valid API token.
- If the download fails, check your internet connection and verify that the model identifier is correct.

================================================
FILE: libraries/nxd-training/tutorials/hf_llama3_70B_pretraining.rst
================================================
.. _hf_llama3_70B_pretraining:

HuggingFace Llama3.1/Llama3-70B Pretraining
=============================================

In this example, we will compile and train a HuggingFace Llama3.1/Llama3-70B model
on multiple trn1 or newly launched trn2 instances using ParallelCluster with the ``NxD Training (NxDT)`` library.
The example has the following main sections:

.. contents:: Table of contents
   :local:
   :depth: 2

Setting up the environment
--------------------------

ParallelCluster Setup
^^^^^^^^^^^^^^^^^^^^^

In this example, we will use 16 trn1.32xlarge instances or 8 trn2.48xlarge instances with ParallelCluster.
Please follow the instructions here to create a cluster:
`Train your model on ParallelCluster
<https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/parallelcluster/parallelcluster-training.html>`_

ParallelCluster automates the creation of trainium clusters,
and provides the Slurm job management system for scheduling and managing distributed training jobs.
Please note that the home directory on your ParallelCluster
head node will be shared with all of the worker nodes via NFS.

Install Dependencies
^^^^^^^^^^^^^^^^^^^^

Once you have launched ParallelCluster,
please follow this guide on how to install the latest Neuron packages:
`PyTorch Neuron Setup Guide
<https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/index.html#setup-torch-neuronx>`_.

Next, we will need to install ``NxDT`` and its dependencies.
Please see the following installation guide for installing ``NxDT``:
:ref:`NxDT Installation Guide <nxdt_installation_guide>`


Download the dataset
--------------------

Let's download training-data scripts for our experiments

.. code:: ipython3

   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/llama/get_dataset.py

Then download ``config.json`` file:

For Llama-3.1-70B:

.. code-block:: bash

   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/llama/tp_pp_llama_hf_pretrain/70B_config_llama3.1/config.json ~/

For Llama-3-70B:

.. code-block:: bash

   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/llama/tp_pp_llama_hf_pretrain/70B_config_llama3/config.json ~/

To tokenize the data, we must request the tokenizer from Hugging Face and Meta by following the
instructions at the following link: `HuggingFace Llama 3.1 70B Model <https://huggingface.co/meta-llama/Meta-Llama-3.1-70B>`__ . 

Use of the Llama models is governed by the Meta license.
In order to download the model weights and tokenizer, please visit the above website
and accept their License before requesting access. After access has been granted,
you may use the following python3 script along with your own hugging face token to download and save the tokenizer.


.. code:: ipython3

   from huggingface_hub import login
   from transformers import AutoTokenizer

   login(token='your_own_hugging_face_token')

   tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-70B')  
   # For llama3 uncomment line below
   # tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-70B')

   tokenizer.save_pretrained(".")

For Llama3.1/Llama3, make sure your base directory has the following files:

.. code:: ipython3

   './tokenizer_config.json', './special_tokens_map.json', './tokenizer.json'

Next, let’s download and pre-process the dataset:

.. code:: ipython3

   mkdir ~/examples_datasets/
   python3 get_dataset.py --llama-version 3


`Note:` In case you see an error of the following form when downloading data: ``huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name'. Use `repo_type` argument if needed.`` 
This could be because of a stale cache. Try deleting the cache using: 

.. code:: ipython3

   sudo rm -rf ~/.cache/


Pre-compile the model
---------------------

By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially
compiles all of the neural network compute graphs as they are encountered during a training job.
The compiled graphs are cached in a local compiler cache so that subsequent training jobs
can leverage the compiled graphs and avoid compilation
(so long as the graph signatures and Neuron version have not changed).

An alternative to the JIT flow is to use the included ``neuron_parallel_compile``
command to perform ahead of time (AOT) compilation. In the AOT compilation flow,
the compute graphs are first identified and extracted during a short simulated training run,
and the extracted graphs are then compiled and cached using parallel compilation,
which is considerably faster than the JIT flow.

First, clone the open-source ``neuronx-distributed-training`` library

.. code:: ipython3

   git clone https://github.com/aws-neuron/neuronx-distributed-training
   cd neuronx-distributed-training/examples

Now, ensure that you are using the proper config file in the ``conf/`` directory.
In the ``train.sh`` file, ensure that the ``CONF_FILE`` variable is properly
set to the config for the model you want to use. In our case,
it will be ``hf_llama3_70B_config.yaml`` for training on trn1 cluster, and ``hf_llama3_70B_trn2_config.yaml`` for trn2.

In this tutorial, we will train Llama3-70B model on multiple compute nodes. For training on trn1, please make sure ``hf_llama3_70B_config`` has the right configuration:

.. code-block:: bash

    trainer:
      devices: 32
      num_nodes: 16

For pretraining on trn2, ``hf_llama3_70B_trn2_config`` would contain:

.. code-block:: bash

    trainer:
      devices: 64
      lnc: 2 # default for trn2 workloads
      num_nodes: 8

On trn2 instances, the configuration `lnc: 2` indicates that there is a 2-to-1 mapping between logical Neuron Core (lnc) and physical Neuron Core.
Another supported configuration is `lnc: 1`, in which case each node would expose 128 logical devices.

The default config here is a 70B parameter model,
but users can also add their own ``conf/*.yaml`` files and run different configs and
hyperparameters if desired. Please see :ref:`Config Overview <nxdt_config_overview>`
for examples and usage for the ``.yaml`` config files.

On trn1 cluster, run the following commands to launch an AOT pre-compilation job on your instance:

.. code-block:: bash

    export COMPILE=1
    export CONF_FILE=hf_llama3_70B_config
    sbatch --exclusive \
        --nodes 16 \
        --cpus-per-task 128 \
        --wrap="srun ./train.sh"

On trn2 cluster, run the following:

.. code-block:: bash

    export COMPILE=1
    export CONF_FILE=hf_llama3_70B_trn2_config
    sbatch --exclusive \
        --nodes 8 \
        --cpus-per-task 128 \
        --wrap="srun ./train.sh"


Once you have launched the precompilation job, run the squeue command to view the
Slurm job queue on your cluster. If you have not recently run a job on your cluster,
it may take 4-5 minutes for the requested trn1.32xlarge or trn2.48xlarge nodes nodes to
be launched and initialized.
Once the job is running, squeue should show output similar to the following:


.. code-block:: bash

    JOBID  PARTITION  NAME      USER    ST  TIME  NODES NODELIST(REASON)
    7      compute1   wrap      ubuntu  R   5:11  16    compute1-st-queue1-i1-[1-16]

You can view the output of the precompilation job by examining the file named
``slurm-ZZ.out``,
where ZZ represents the JOBID of your job in the squeue output above.

.. code-block:: bash

    tail -f slurm-7.out

Once the precompilation job is complete, just like the above output
you should see a message similar to the following in the logs:

.. code-block:: bash

    2024-11-07 09:57:13.000144:  39810  INFO ||NEURON_PARALLEL_COMPILE||: Total graphs: 36
    2024-11-07 09:57:13.000144:  39810  INFO ||NEURON_PARALLEL_COMPILE||: Total successful compilations: 36
    2024-11-07 09:57:13.000144:  39810  INFO ||NEURON_PARALLEL_COMPILE||: Total failed compilations: 0

At this point, you can press ``CTRL-C`` to exit the tail command.

.. note::
    The number of graphs will differ based on package versions, models, and other factors.
    This is just an example.


Training the model
------------------

You can launch pre-training job similar to compilation by using the same
training script but now turning off the ``COMPILE`` environment variable

On trn1 ParallelCluster:

.. code-block:: bash

    export COMPILE=0
    export CONF_FILE=hf_llama3_70B_config
    sbatch --exclusive \
        --nodes 16 \
        --cpus-per-task 128 \
        --wrap="srun ./train.sh"

On trn2 ParallelCluster:

.. code-block:: bash

    export COMPILE=0
    export CONF_FILE=hf_llama3_70B_trn2_config
    sbatch --exclusive \
        --nodes 8 \
        --cpus-per-task 128 \
        --wrap="srun ./train.sh"

As outlined above, you can again use the ``squeue`` command to view the job queue,
and also monitor the job in the same way with the ``tail`` command to see the training logs.
Once the model is loaded onto the Trainium accelerators and training has commenced,
you will begin to see output indicating the job progress:

Example:

.. code-block:: bash

    Epoch 0:   3%|▎         | 3/91 [16:05<7:52:06, 321.89s/it, loss=6.7, v_num=2, reduced_train_loss=13.40, lr=7.5e-9, parameter_norm=5536.0, global_step=1.000, consumed_samples=2048.0]
    Epoch 0:   3%|▎         | 3/91 [16:05<7:52:06, 321.89s/it, loss=4.47, v_num=2, reduced_train_loss=13.40, lr=7.5e-9, parameter_norm=5536.0, global_step=2.000, consumed_samples=3072.0]
    Epoch 0:   4%|▍         | 4/91 [21:20<7:44:18, 320.22s/it, loss=4.47, v_num=2, reduced_train_loss=13.40, lr=7.5e-9, parameter_norm=5536.0, global_step=2.000, consumed_samples=3072.0]
    Epoch 0:   4%|▍         | 4/91 [21:20<7:44:18, 320.22s/it, loss=3.35, v_num=2, reduced_train_loss=13.40, lr=7.5e-9, parameter_norm=5536.0, global_step=3.000, consumed_samples=4096.0]


.. note::
    The convergence is for demonstration and would differ based on instance type, model, and other factors.


Monitoring Training
-------------------

Tensorboard monitoring
^^^^^^^^^^^^^^^^^^^^^^

In addition to the text-based job monitoring described in the previous section,
you can also use tools such as TensorBoard to monitor training job progress.
To view an ongoing training job in TensorBoard, you first need to identify the
experiment directory associated with your ongoing job.
This will typically be the most recently created directory under
``~/neuronx-distributed-training/examples/nemo_experiments/hf_llama/``.
Once you have identifed the directory, ``cd`` into it, and then launch TensorBoard:

.. code-block:: bash

    cd ~/neuronx-distributed-training/examples/nemo_experiments/hf_llama/8/
    tensorboard --logdir ./

With TensorBoard running, you can then view the TensorBoard dashboard by browsing to
``http://localhost:6006`` on your local machine. If you cannot access TensorBoard at this address,
please make sure that you have port-forwarded TCP port 6006 when SSH'ing into the head node,

.. code-block:: bash

    ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006

neuron-top / neuron-monitor / neuron-ls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The `neuron-top <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-top-user-guide.html>`_
tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization,
and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job, run ``neuron-top``:

.. code-block:: bash

    ssh compute1-st-queue1-i1-1  # to determine which compute nodes are in use, run the squeue command
    neuron-top

Similarly, once you are logged into one of the active compute nodes,
you can also use other Neuron tools such as
`neuron-monitor <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_
and `neuron-ls <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_
to capture performance and utilization statistics and to understand NeuronCore allocation.


Continual Pre-training with Downloaded Meta Model Weights
---------------------------------------------------------
If you want to perform contiual pre-training using the model weights provided by Meta, follow these steps:

Ensure you have the ``config.json`` file, which should have been downloaded as described in the `Download the dataset`_ section.


Download the model and convert the ``state_dict`` to NxDT checkpoint format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Get the conversion scripts described in the :ref:`Checkpoint Conversion <checkpoint_conversion>`. 
Mention the ``hf_model_name`` argument to specify the HuggingFace model identifier for
the model you want to download and convert the checkpoint to NxDT format.

Run the following to download the model and convert the ``state_dict`` to NxDT sharded checkpoint.

On trn1 cluster:

.. code-block:: bash

   python3 ./checkpoint_converter_scripts/checkpoint_converter.py \
     --model_style hf \
     --hf_model_name meta-llama/Meta-Llama-3-70B \
     --hw_backend trn1 \
     --tp_size 32 --pp_size 8 --n_layers 80 \
     --output_dir /fsx/pretrained_weight/ \
     --convert_from_full_state --save_xser True \
     --kv_size_multiplier 4 --qkv_linear True \
     --config ~/config.json

On trn2 cluster:

.. code-block:: bash

   python3 ./checkpoint_converter_scripts/checkpoint_converter.py \
     --model_style hf \
     --hf_model_name meta-llama/Meta-Llama-3-70B \
     --hw_backend trn2 \
     --tp_size 32 --pp_size 4 --n_layers 80 \
     --output_dir /fsx/pretrained_weight/ \
     --convert_from_full_state --save_xser True \
     --kv_size_multiplier 4 --qkv_linear True \
     --config ~/config.json


.. note::
    This conversion process requires larger host memory. Please run it on a trn1.32xlarge or trn2.48xlarge compute node. 
    In this example, the converted model is stored on FSx for Lustre to be accessed by all compute nodes.

Start the continual training job by loading converted checkpoints
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In order to start the continual training job with loading this converted model as initial weights, please update the config file (``hf_llama3_70B_config.yaml`` or ``hf_llama3_70B_trn2_config.yaml``)  as below:

.. code-block:: bash

    exp_manager:
    .
    .
      resume_from_checkpoint: /fsx/pretrained_weight/ # manually set the checkpoint file to load from
    .
    .
    model:
      # Miscellaneous
      use_cpu_initialization: False # Init weights on the CPU (slow for large models) 
      weight_init_only: True 

Compared to initial pre-training loss value, you should see lower loss value when the training starts with Meta's model weights. Logs for one such sample run look like below.

.. code-block:: bash

    Epoch 0:   3%|▎         | 3/91 [16:09<7:53:59, 323.17s/it, loss=0.834, v_num=7, reduced_train_loss=1.670, lr=7.5e-9, parameter_norm=4736.0, global_step=1.000, consumed_samples=2048.0]
    Epoch 0:   3%|▎         | 3/91 [16:09<7:53:59, 323.17s/it, loss=0.556, v_num=7, reduced_train_loss=1.670, lr=7.5e-9, parameter_norm=4736.0, global_step=2.000, consumed_samples=3072.0]
    Epoch 0:   4%|▍         | 4/91 [21:25<7:46:02, 321.41s/it, loss=0.556, v_num=7, reduced_train_loss=1.670, lr=7.5e-9, parameter_norm=4736.0, global_step=2.000, consumed_samples=3072.0]
    Epoch 0:   4%|▍         | 4/91 [21:25<7:46:02, 321.41s/it, loss=0.417, v_num=7, reduced_train_loss=1.670, lr=7.5e-9, parameter_norm=4736.0, global_step=3.000, consumed_samples=4096.0]


Pretraining with Context Paralellism
------------------------------------

To run pretraining with context parallelism, use the following yaml config file: ``hf_llama3_70B_CP_config.yaml``.
This YAML file has the following changes to enable context parallelism:


.. code-block:: yaml

    distributed_strategy:
        context_parallel_size: 2

    fusions:
        flash_attention: False
        ring_attention: True


**distributed_strategy**
    **context_parallel_size**

    Context parallel degree to be used for sharding sequence.

    * **Type**: int
    * **Required**: False
    * **Default**: 1


**fusions**
    **ring_attention**

    Setting this flag to ``True`` will use the ring attention module for
    both forward and backward.
    This parameter must be true when context parallel is
    ```context_parallel_size`` is greater than 1.

    * **Type**: bool
    * **Required**: False


In the config file, ``context_parallel_size`` is set to the desired degree, and as
context parallelism leverages ring attention instead of flash attention, we set ``ring_attention: True``,
and ``flash_attention: False``.

Context parallelism currently supports sequence lengths up to 32k and is supported on TRN1.

Compile with:

.. code-block:: bash

    export COMPILE=1
    export CONF_FILE=hf_llama3_70B_CP_config
    sbatch --exclusive \
        --nodes 16 \
        --cpus-per-task 128 \
        --wrap="srun ./train.sh"

and pre-training with:

.. code-block:: bash

    export COMPILE=0
    export CONF_FILE=hf_llama3_70B_CP_config
    sbatch --exclusive \
        --nodes 16 \
        --cpus-per-task 128 \
        --wrap="srun ./train.sh"


Troubleshooting Guide
---------------------

For issues with ``NxDT``, please see:
:ref:`NxDT Known Issues <nxdt_known_issues>`


================================================
FILE: libraries/nxd-training/tutorials/hf_llama3_8B_DPO_ORPO.rst
================================================
.. _hf_llama3_8B_DPO_ORPO:

HF Llama3.1/Llama3-8B Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO) based Fine-tuning (Beta)
=================================================================================================================================

In this example, we will show how to compile and finetune a pre-trained
HF Llama3.1/Llama3-8B model on a single instance with the ``NxD Training (NxDT)`` library
using `Direct Preference Optimization (DPO) <https://arxiv.org/pdf/2305.18290>`_ and
`Odds Ratio Preference Optimization (ORPO) <https://arxiv.org/abs/2403.07691>`_
based fine-tuning. The pre-trained Llama3-8B model serves as the foundation, and we will
build upon this base by fine-tuning and aligning the model to adapt
it to a specific task or dataset.
The example has the following main sections:

.. contents:: Table of contents
   :local:
   :depth: 2

Setting up the environment
--------------------------

Install Dependencies
^^^^^^^^^^^^^^^^^^^^

Once you have launched a Trn1 instance,
Please follow this guide on how to install the latest Neuron packages:
`PyTorch Neuron Setup Guide
<https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/index.html#setup-torch-neuronx>`_.

Next, we will need to install ``NxDT`` and its dependencies.
Please see the following installation guide for installing ``NxDT``:
:ref:`NxDT Installation Guide <nxdt_installation_guide>`.

For DPO and ORPO tests, We have to first install ``requirements.txt`` and then install ``alignment_requirements.txt``. We can use the following commands for the same:

.. code-block:: shell

    wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed-training/master/requirements.txt
    pip install -r requirements.txt
    wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed-training/master/alignment_requirements.txt
    pip install -r alignment_requirements.txt

DPO-YAML Configuration Overview
-------------------------------

You can configure a variety of DPO-specific and model parameters for finetuning through the YAML file.

.. code-block:: yaml

    exp_manager:
        resume_from_checkpoint: /pretrained_ckpt

    data:
        train_dir: /example_datasets/llama3_8b/data_dpo.jsonl
        val_dir: null
        dev_choose_samples: null
        seq_length: 4096
        tokenizer:
            type: /llama3_tokenizer

    model:
        weight_init_only: True

    model_alignment_strategy:
        dpo:
            kl_beta: 0.01
            loss_type: sigmoid
            max_prompt_length: 2048
            precompute_ref_log_probs: True
            truncation_mode: keep_start


**exp_manager**
    **resume_from_checkpoint**

    Manually set the checkpoint file (pretrained/post-SFT checkpoint) to load from

        * **Type**: str
        * **Default**: ``/pretrained_ckpt``
        * **Required**: True (start with pretrained checkpoint)

**data**
    **train_dir**

    DPO training data - jsonl or arrow file

    As for DPO we use HF style ModelAlignment dataloader, we also use HF style data file paths

        * **Type**: str
        * **Required**: True

    **val_dir**

    DPO validation data - jsonl or arrow file

    As for DPO we use HF style ModelAlignment dataloader, we also use HF style data file paths

        * **Type**: str
        * **Required**: False

    **dev_choose_samples**

    If set, will use that many number of records from the
    head of the dataset instead of using all. Set to null to use full dataset

        * **Type**: integer
        * **Default**: null
        * **Required**: False

    **seq_length**

    Set sequence length for the training job.
    For DPO, it is total sequence length of prompt and (chosen/rejected) response concatenated together

        * **Type**: integer
        * **Required**: True

    **tokenizer**
        **type**

        Set tokenizer path/type

            * **Type**: str
            * **Default**: ``/llama3_tokenizer``
            * **Required**: True

 **model**
        **weight_init_only**

        Load only model states and ignore the optim states from ckpt directory

            * **Type**: bool
            * **Default**: True

 **model_alignment_strategy**

    Set only when using finetuning specific algorithms (SFT, DPO, etc) and and parameter-efficient
    fine-tuning methods like LoRA (Low-Rank Adaptation).

        **dpo**
            Direct Preference Optimization (DPO) specific parameters.

            **kl_beta**

            KL-divergence beta to control divergence of policy model from reference model

                * **Type**: float
                * **Default**: 0.01
                * **Required**: True

            **loss_type**

            Currently support sigmoid version of optimized DPO loss

                * **Type**: str
                * **Default**: ``sigmoid``
                * **Required**: True

            **max_prompt_length**

            Set maximum length of prompt in the concatenated prompt and (chosen/rejected) response input

                * **Type**: integer
                * **Required**: True

            **precompute_ref_log_probs**

            To enable precomputation of reference model log probabilities using pre-fit hook,
            False is not supported currently

                * **Type**: bool
                * **Required**: True

            **truncation_mode**

            To define how to truncate if size (prompt+response) exceeds seq_length
            options: ["keep_start", "keep_end"]

                * **Type**: str
                * **Default**: ``keep_start```
                * **Required**: True

ORPO-YAML Configuration Overview
--------------------------------

Here we show the ORPO-specific model parameters which can be configured
for finetuning through the YAML file.
And below we explain the parameters that are new as compared to DPO-specific
parameters.

.. code-block:: yaml

    exp_manager:
        checkpoint_callback_params:
            every_n_train_steps: 10
        resume_from_checkpoint: /pretrained_ckpt

    data:
        train_dir: /example_datasets/llama3_8b/data_orpo.jsonl
        val_dir: null
        dev_choose_samples: null
        seq_length: 4096
        tokenizer:
            type: /llama3_tokenizer

    model:
        encoder_seq_len: 4096
        weight_init_only: True
        optim:
            lr: 1.5e-4
            sched:
                name: CosineAnnealing

    model_alignment_strategy:
        orpo:
            beta: 0.1
            max_prompt_length: 2048
            truncation_mode: keep_start


**exp_manager**

    **checkpoint_callback_params.every_n_train_steps**

    How often we want to checkpoint.

        * **Type**: int
        * **Required**: True

**model**
    **encoder_seq_length**

    Setting the sequence length for the training job. This parameter is common for all
    models supported in the library.

        * **Type**: int
        * **Required**: True

    **optim.sched**

    This is where the LR schedulers can be set. We can configure the schedulers supported by
    ``NeMo``. All the schedulers can be configured according to the
    `parameters specified here <https://github.com/NVIDIA/NeMo/blob/v1.14.0/nemo/core/config/schedulers.py>`__.

        * **Type**: config
        * **Possible Values**: ``LinearAnnealingWithWarmUp``, ``CosineAnnealing``, ``WarmupPolicy``,
        *  ``WarmupHoldPolicy``, ``SquareAnnealing``, ``NoamAnnealing``, ``WarmupAnnealing``,
        *   ``StepLR``, ``rprop``, ``ExponentialLR``
        * **Required**: True


 **model_alignment_strategy**

    Set only when using finetuning specific algorithms (SFT, DPO, ORPO, etc) and parameter-efficient
    fine-tuning methods like LoRA (Low-Rank Adaptation).

        **orpo**
            Odds Ratio Preference Optimization (ORPO) specific parameters.

            **beta**

            KL-divergence beta to control divergence of policy model from reference model

                * **Type**: float
                * **Default**: 0.01
                * **Required**: True

Download the dataset
--------------------

The DPO (& ORPO) tutorial makes use of the same preprocessed version of `intel-orca_dpo_pairs`
preference dataset that is stored in S3. The dataset can be downloaded to your cluster or
instance by running the following AWS CLI commands on the head node or your Trn1 instance:

.. code-block:: bash

    export DATA_DIR=~/examples_datasets/llama3_8b
    mkdir -p ${DATA_DIR} && cd ${DATA_DIR}
    aws s3 cp s3://neuron-s3/training_datasets/llama/dpo/data_dpo.jsonl .  --no-sign-request

Then, download the ``config.json`` file:

For Llama-3.1-8B:

.. code-block:: bash

   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3.1/config.json ~/


For Llama-3-8B:

.. code-block:: bash

   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3/config.json ~/


Convert data to DPO-specific Preference data format
---------------------------------------------------

If you directly downloaded the `Intel ORCA_dpo_pairs dataset <https://huggingface.co/datasets/Intel/orca_dpo_pairs>`_, then you need to convert the
data into preference dataset format using the script below.

.. note::
    For different datasets with different field names, make necessary changes to the script accordingly.

.. code-block:: python

    from datasets import load_dataset
    from transformers import AutoTokenizer

    def preference_data_format(example):

        system = "<|im_start|>\n" + example['system'] + "<|im_end|>\n"

        # Format instruction
        prompt = "<|im_start|> " + example['question'] + "<|im_end|>\n<|im_start|>assistant\n"

        # Format chosen answer
        chosen = example['chosen'] + "<|im_end|>\n"

        # Format rejected answer
        rejected = example['rejected'] + "<|im_end|>\n"

        return {
            "prompt": system + prompt,
            "chosen": chosen,
            "rejected": rejected,
        }

    # Particular dataset with following fields: "system", "question", "chosen", "rejected"
    dataset = load_dataset("json", data_files="orca_rlhf.jsonl", split="train")

    # Save columns
    original_columns = dataset.column_names

    # Format dataset
    dataset = dataset.map(
        preference_data_format,
        remove_columns=original_columns
        )

    # save converted preference dataset
    dataset.to_json("data_dpo.jsonl")


Download pretrained model checkpoint and tokenizer
--------------------------------------------------

In this tutorial, we will use a pretrained Llama3-8B checkpoint (post-SFT checkpoint preferred)
from the original repository.
Follow the steps to download tokenizer and model checkpoint from
the pretraining stage: `<https://llama.meta.com/llama-downloads/>`_

Alternatively, the model checkpoint and tokenizer can also be downloaded
from HuggingFace by following this `guide <https://huggingface.co/meta-llama/Llama-3.1-8B#use-with-llama>`_

You can also directly download and covert the HuggingFace
model checkpoint using :ref:`Direct HuggingFace Model Conversion <checkpoint_conversion>`

Create a folder ``llama3_tokenizer`` and copy the tokenizer contents to it.

Modify the following paths in YAML file based on your specific directory configuration:

1. ``model.model_config``
2. ``exp_manager.resume_from_checkpoint``
3. ``tokenizer.type``
4. ``train_dir`` and ``val_dir``

You can use your Llama model, pretrained checkpoint and tokenizer by
modifying the ``hf_llama3_8B_<DPO/ORPO>_config.yaml`` file.


Checkpoint Conversion
^^^^^^^^^^^^^^^^^^^^^

Follow this :ref:`Checkpoint Conversion Guide <checkpoint_conversion>` to convert the
HF-style Llama3-8B checkpoint
to NxDT supported format and store it in ``pretrained_ckpt`` directory.
Modify the config parameter ``exp_manager.resume_from_checkpoint`` path to the
converted pretrained checkpoint path.

Pre-compile the model
---------------------

By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially
compiles all of the neural network compute graphs as they are encountered during a training job.
The compiled graphs are cached in a local compiler cache so that subsequent training jobs
can leverage the compiled graphs and avoid compilation
(so long as the graph signatures and Neuron version have not changed).

An alternative to the JIT flow is to use the included ``neuron_parallel_compile``
command to perform ahead of time (AOT) compilation. In the AOT compilation flow,
the compute graphs are first identified and extracted during a short simulated training run,
and the extracted graphs are then compiled and cached using parallel compilation,
which is considerably faster than the JIT flow.

First, clone the open-source ``neuronx-distributed-training`` library

.. code:: ipython3
    
   git clone https://github.com/aws-neuron/neuronx-distributed-training
   cd neuronx-distributed-training/examples

Now, ensure that you are using the proper config file in the ``conf/`` directory.
In the ``train.sh`` file, ensure that the ``CONF_FILE`` variable is properly
set to the config for the model you want to use. In our case,
it will be ``hf_llama3_8B_<DPO/ORPO>_config.yaml``. The default config here is a 8B parameter model,
but users can also add their own ``conf/*.yaml`` files and run different configs and
hyperparameters if desired. Please see :ref:`Config Overview <nxdt_config_overview>`
for examples and usage for the ``.yaml`` config files.

Next, run the following commands to launch an AOT pre-compilation job on your instance:

.. code-block:: bash

    export COMPILE=1
    export CONF_FILE=hf_llama3_8B_<DPO/ORPO>_config
    ./train.sh

The compile output and logs will be shown directly in the terminal
and you will see logs similar to this:

.. code-block:: bash

    2024-10-24 18:49:49.000950: INFO ||NEURON_PARALLEL_COMPILE||: Total graphs: 32
    2024-10-24 18:49:49.000950: INFO ||NEURON_PARALLEL_COMPILE||: Total successful compilations: 32
    2024-10-24 18:49:49.000950: INFO ||NEURON_PARALLEL_COMPILE||: Total failed compilations: 0

Then, you know your compilation has successfully completed.

.. note::
    The number of graphs will differ based on package versions, models, and other factors.
    This is just an example.


Training the model
------------------

The fine-tuning job is launched almost exactly in the same way as the compile job.
We now turn off the ``COMPILE`` environment variable and
run the same training script to start pre-training.

On a single instance:

.. code-block:: bash

    export COMPILE=0
    export CONF_FILE=hf_llama3_8B_<DPO/ORPO>_config
    ./train.sh

Once the model is loaded onto the Trainium accelerators and training has commenced,
you will begin to see output indicating the job progress:

Example:

.. code-block:: bash

    Epoch 0:   5%|â–         | 3/62 [02:59<58:44,  0.02it/s, v_num=8-06, reduced_train_loss=6.930, chosen_rewards=-0.81, rejected_rewards=-0.675, lr=2.73e-5, parameter_norm=1.95e+3, global_step=1.000, consumed_samples=32.00, throughput=0.108, throughput_peak=0.0677, gradient_norm=8.600]
    Epoch 0:   6%|â–‹         | 4/62 [03:24<49:27,  0.02it/s, v_num=8-06, reduced_train_loss=6.790, chosen_rewards=-0.628, rejected_rewards=-0.64, lr=5.45e-5, parameter_norm=1.95e+3, global_step=3.000, consumed_samples=64.00, throughput=0.181, throughput_peak=0.146, gradient_norm=6.590]
    Epoch 0:   8%|â–Š         | 5/62 [03:50<43:42,  0.02it/s, v_num=8-06, reduced_train_loss=6.790, chosen_rewards=-0.628, rejected_rewards=-0.64, lr=5.45e-5, parameter_norm=1.95e+3, global_step=3.000, consumed_samples=64.00, throughput=0.181, throughput_peak=0.146, gradient_norm=6.590]

.. note::
    The values in the above logs will differ based on config used, package versions,
    models, and other factors. This is just an example.

Monitoring Training
-------------------

Tensorboard monitoring
^^^^^^^^^^^^^^^^^^^^^^

In addition to the text-based job monitoring described in the previous section,
you can also use standard tools such as TensorBoard to monitor training job progress.
To view an ongoing training job in TensorBoard, you first need to identify the
experiment directory associated with your ongoing job.
This will typically be the most recently created directory under
``~/neuronx-distributed-training/examples/nemo_experiments/hf_llama3_8B/``.
Once you have identifed the directory, cd into it, and then launch TensorBoard:

.. code-block:: bash

    cd ~/neuronx-distributed-training/examples/nemo_experiments/hf_llama3_8B/
    tensorboard --logdir ./

With TensorBoard running, you can then view the TensorBoard dashboard by browsing to
``http://localhost:6006`` on your local machine. If you cannot access TensorBoard at this address,
please make sure that you have port-forwarded TCP port 6006 when SSH'ing into the head node,

.. code-block:: bash

    ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006

neuron-top / neuron-monitor / neuron-ls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The `neuron-top <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-top-user-guide.html>`_
tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization,
and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job, run ``neuron-top``:

.. code-block:: bash

    ssh compute1-dy-queue1-i1-1  # to determine which compute nodes are in use, run the squeue command
    neuron-top

Similarly, once you are logged into one of the active compute nodes,
you can also use other Neuron tools such as
`neuron-monitor <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_
and `neuron-ls <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_
to capture performance and utilization statistics and to understand NeuronCore allocation.

Troubleshooting Guide
---------------------

For issues with ``NxDT``, please see:
:ref:`NxDT Known Issues <nxdt_known_issues>`

================================================
FILE: libraries/nxd-training/tutorials/hf_llama3_8B_SFT.rst
================================================
.. _hf_llama3_8B_SFT:

HuggingFace  Llama3.1/Llama3-8B Supervised Fine-tuning
======================================================

In this example, we will compile and finetune pre-trained HF  Llama3.1/Llama3-8B
model on a single instance with the NxD Training library.
The pre-trained Llama3-8B model serves as the foundation, and we will
build upon this solid base by fine-tuning the model to adapt
it to a specific task or dataset.
The example has the following main sections:

.. contents:: Table of contents
   :local:
   :depth: 2

Setting up the environment
--------------------------

Install Dependencies
^^^^^^^^^^^^^^^^^^^^

Please follow this guide on how to install the latest Neuron packages:
`PyTorch Neuron Setup Guide
<https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/index.html#setup-torch-neuronx>`_.

Next, we will need to install NxD Training and its dependencies.
Please see the following installation guide for installing NxD Training:
:ref:`NxD Training Installation Guide <nxdt_installation_guide>`.


SFT-YAML Configuration Overview
-------------------------------

You can configure a variety of SFT-specific and model parameters for finetuning through the YAML file.

.. code-block:: yaml

    exp_manager:
        resume_from_checkpoint: /pretrained_ckpt

    data:
        train_dir: /example_datasets/llama3_8b/training.jsonl
        val_dir: /example_datasets/llama3_8b/validation.json
        dev_choose_samples: 2250
        seq_length: 4096
        alignment_strategy:
            sft:
                packing: True
        tokenizer:
            type: /llama3_tokenizer

    model:
        weight_init_only: True


**exp_manager**
    **resume_from_checkpoint**

    Manually set the checkpoint file (pretrained checkpoint) to load from

        * **Type**: str
        * **Default**: ``/pretrained_ckpt``
        * **Required**: True (start with pretrained checkpoint)

**data**

    **train_dir**

    SFT training data - jsonl or arrow file

    As for SFT we use HF style ModelAlignment dataloader, we also use HF style data file paths

        * **Type**: str
        * **Required**: True

    **val_dir**

    SFT validation data - jsonl or arrow file

    As for SFT we use HF style ModelAlignment dataloader, we also use HF style data file paths

        * **Type**: str
        * **Required**: False

    **dev_choose_samples**

    If set, will use that many number of records from the
    head of the dataset instead of using all. Set to null to use full dataset

        * **Type**: integer
        * **Default**: null
        * **Required**: False

    **seq_length**

    Set sequence length for the training job.

        * **Type**: integer
        * **Required**: True

    **alignment_strategy**

    Set only when using finetuning specific algorithms (SFT, DPO, etc) and related hyperparameters
    SFT-specific parameters.

        **sft**
            **packing**

            Appends multiple records in a single record until seq length
            supported by model, if false uses pad tokens to reach seq length.
            Setting it to True increases throughput but might impact accuracy.

                * **Type**: bool
                * **Default**: False
                * **Required**: False

    **tokenizer**
        **type**

        Set tokenizer path/type

            * **Type**: str
            * **Default**: ``/llama3_tokenizer``
            * **Required**: True

 **model**
        **weight_init_only**

        Load only model states and ignore the optim states from ckpt directory

            * **Type**: bool
            * **Default**: True


Download the dataset
--------------------

This tutorial makes use of a preprocessed version of `databricks-dolly` instruction-following
dataset that is stored in S3. The dataset can be downloaded to your cluster or instance
by running the following AWS CLI commands on the head node or your Trn1 instance:

.. code-block:: bash

    export DATA_DIR=~/examples_datasets/llama3_8b
    mkdir -p ${DATA_DIR} && cd ${DATA_DIR}
    aws s3 cp s3://neuron-s3/training_datasets/llama/sft/training.jsonl .  --no-sign-request
    aws s3 cp s3://neuron-s3/training_datasets/llama/sft/validation.jsonl .  --no-sign-request


Download pretrained model checkpoint and tokenizer
--------------------------------------------------

In this tutorial, we will use a pretrained Llama3-8B checkpoint from the original repository.
Follow the steps to download tokenizer and model checkpoint from
the pretraining stage: `<https://llama.meta.com/llama-downloads/>`_

Alternatively, the model checkpoint and tokenizer can also be downloaded
from HuggingFace by following this `guide <https://huggingface.co/meta-llama/Llama-3.1-8B#use-with-llama>`_

You can also directly download and covert the HuggingFace
model checkpoint using :ref:`Direct HuggingFace Model Conversion <checkpoint_conversion>`

Create a folder ``llama3_tokenizer`` and copy the tokenizer contents to it.

Modify the following paths in YAML file based on your specific directory configuration:

1. ``model.model_config``
2. ``exp_manager.resume_from_checkpoint``
3. ``tokenizer.type``
4. ``train_dir`` and ``val_dir``

You can use your custom model, pretrained checkpoint and tokenizer by
modifying the ``hf_llama3_8B_SFT_config.yaml`` file.


Checkpoint Conversion
^^^^^^^^^^^^^^^^^^^^^

Follow this :ref:`Checkpoint Conversion Guide <checkpoint_conversion>` to convert the
HF-style Llama3-8B checkpoint
to NxDT supported format and store it in  ``pretrained_ckpt`` directory.
Modify the config parameter ``exp_manager.resume_from_checkpoint`` path to the
converted pretrained checkpoint path.

Pre-compile the model
---------------------

By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially
compiles all of the neural network compute graphs as they are encountered during a training job.
The compiled graphs are cached in a local compiler cache so that subsequent training jobs
can leverage the compiled graphs and avoid compilation
(so long as the graph signatures and Neuron version have not changed).

An alternative to the JIT flow is to use the included ``neuron_parallel_compile``
command to perform ahead of time (AOT) compilation. In the AOT compilation flow,
the compute graphs are first identified and extracted during a short simulated training run,
and the extracted graphs are then compiled and cached using parallel compilation,
which is considerably faster than the JIT flow.

First, clone the open-source ``neuronx-distributed-training`` library

.. code:: ipython3

   git clone https://github.com/aws-neuron/neuronx-distributed-training
   cd neuronx-distributed-training/examples

Now, ensure that you are using the proper config file in the ``conf/`` directory.
In the ``train.sh`` file, ensure that the ``CONF_FILE`` variable is properly
set to the config for the model you want to use. In our case,
it will be ``hf_llama3_8B_SFT_config``. The default config here is a 8B parameter model,
but users can also add their own ``conf/*.yaml`` files and run different configs and
hyperparameters if desired. Please see :ref:`Config Overview <nxdt_config_overview>`
for examples and usage for the ``.yaml`` config files.

Next, run the following commands to launch an AOT pre-compilation job on your instance:

.. code-block:: bash

    export COMPILE=1
    ./train.sh

The compile output and logs will be shown directly in the terminal
and you will see logs similar to this:

.. code-block:: bash

    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total graphs: 22
    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total successful compilations: 22
    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0

Then, you know your compilation has successfully completed.

.. note::
    The number of graphs will differ based on package versions, models, and other factors.
    This is just an example.


Training the model
------------------

The fine-tuning job is launched almost exactly in the same way as the compile job.
We now turn off the ``COMPILE`` environment variable and
run the same training script to start pre-training.

On a single instance:

.. code-block:: bash

    export COMPILE=0
    ./train.sh

Once the model is loaded onto the Trainium accelerators and training has commenced,
you will begin to see output indicating the job progress:

Example:

.. code-block:: bash

    Epoch 0:   0%|          | 189/301501 [59:12<1573:03:24, 18.79s/it, loss=7.75, v_num=3-16, reduced_train_loss=7.560, global_step=188.0, consumed_samples=24064.0]
    Epoch 0:   0%|          | 190/301501 [59:30<1572:41:13, 18.79s/it, loss=7.74, v_num=3-16, reduced_train_loss=7.560, global_step=189.0, consumed_samples=24192.0]
    Epoch 0:   0%|          | 191/301501 [59:48<1572:21:28, 18.79s/it, loss=7.73, v_num=3-16, reduced_train_loss=7.910, global_step=190.0, consumed_samples=24320.0]

Monitoring Training
-------------------

Tensorboard monitoring
^^^^^^^^^^^^^^^^^^^^^^

In addition to the text-based job monitoring described in the previous section,
you can also use standard tools such as TensorBoard to monitor training job progress.
To view an ongoing training job in TensorBoard, you first need to identify the
experiment directory associated with your ongoing job.
This will typically be the most recently created directory under
``~/neuronx-distributed-training/examples/nemo_experiments/hf_llama3_8B/``.
Once you have identifed the directory, cd into it, and then launch TensorBoard:

.. code-block:: bash

    cd ~/neuronx-distributed-training/examples/nemo_experiments/hf_llama3_8B/
    tensorboard --logdir ./

With TensorBoard running, you can then view the TensorBoard dashboard by browsing to
``http://localhost:6006`` on your local machine. If you cannot access TensorBoard at this address,
please make sure that you have port-forwarded TCP port 6006 when SSH'ing into the head node,

.. code-block:: bash

    ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006

neuron-top / neuron-monitor / neuron-ls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The `neuron-top <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-top-user-guide.html>`_
tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization,
and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job, run ``neuron-top``:

.. code-block:: bash

    ssh compute1-dy-queue1-i1-1  # to determine which compute nodes are in use, run the squeue command
    neuron-top

Similarly, once you are logged into one of the active compute nodes,
you can also use other Neuron tools such as
`neuron-monitor <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_
and `neuron-ls <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_
to capture performance and utilization statistics and to understand NeuronCore allocation.

Troubleshooting Guide
---------------------

For issues with NxD Training, please see:
:ref:`NxD Training Known Issues <nxdt_known_issues>`

================================================
FILE: libraries/nxd-training/tutorials/hf_llama3_8B_SFT_LORA.rst
================================================
.. _hf_llama3_8B_SFT_LORA:

HuggingFace  Llama3.1/Llama3-8B Efficient Supervised Fine-tuning with LoRA (Beta)
=================================================================================

In this example, we will compile and finetune pre-trained HF  Llama3.1/Llama3-8B model
with LoRA adaptors on a single instance with the ``NxD Training (NxDT)`` library.
LoRA or Low Rank Adaptation allows for parameter-efficient fine-tuning (PEFT) by adding small trainable rank
decomposition matrices to specified layer of the model, significantly
reducing memory usage and training time compared to dense fine-tuning.
The pre-trained Llama3-8B model serves as the foundation, and we will
build upon this by fine-tuning the model to adapt it to a specific task or dataset.

.. warning::
   **9/18/2025**: Currently, the code in this tutorial does not work. We will be updating it at a futu

The example has the following main sections:

.. contents:: Table of contents
   :local:
   :depth: 2

Setting up the environment
--------------------------

Install Dependencies
^^^^^^^^^^^^^^^^^^^^

First, you can launch a Trn1 instance by following the Neuron DLAMI guide:
`Neuron DLAMI User Guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html>`_.

Once you have launched a Trn1 instance,
follow this guide on how to install the latest Neuron packages:
`PyTorch Neuron Setup Guide
<https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/index.html#setup-torch-neuronx>`_.

Next, we will need to install ``NxDT`` and its dependencies.
Please see the following installation guide for installing ``NxDT``:
:ref:`NxDT Installation Guide <nxdt_installation_guide>`.


Download the dataset
--------------------

This tutorial makes use of a preprocessed version of `databricks-dolly` instruction-following
dataset that is stored in S3. The dataset can be downloaded to your cluster or instance
by running the following AWS CLI commands on the head node or your Trn1 instance:

.. code-block:: bash

    export DATA_DIR=~/examples_datasets/llama3_8b
    mkdir -p ${DATA_DIR} && cd ${DATA_DIR}
    aws s3 cp s3://neuron-s3/training_datasets/llama/sft/training.jsonl .  --no-sign-request
    aws s3 cp s3://neuron-s3/training_datasets/llama/sft/validation.jsonl .  --no-sign-request


Then, download the ``config.json`` file:

For Llama-3-8B:

.. code-block:: bash

   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3/config.json ~/


Download pretrained model checkpoint and tokenizer
--------------------------------------------------

In this tutorial, we will use a pretrained Llama3-8B checkpoint from the original repository.
Follow the steps to download tokenizer and model checkpoint from
the pretraining stage: `<https://llama.meta.com/llama-downloads/>`_.

Alternatively, the model checkpoint and tokenizer can also be downloaded
from HuggingFace by following this `guide <https://huggingface.co/meta-llama/Meta-Llama-3-8B#use-with-llama3>`_.

You can also directly download and covert the HuggingFace
model checkpoint using :ref:`Direct HuggingFace Model Conversion <checkpoint_conversion>`.

If you choose to download the weights from HuggingFace with your own token, you can create a python script to run such as:

.. code-block:: python

    import transformers

    tokenizer_path="llama3_tokenizer"
    model_weights_path="llama3-8B_hf_weights"
    model_id = "meta-llama/Meta-Llama-3-8B"

    t = transformers.AutoTokenizer.from_pretrained(model_id)
    t.save_pretrained(tokenizer_path)

    m = transformers.AutoModelForCausalLM.from_pretrained(model_id)
    m.save_pretrained(model_weights_path)

Create a folder ``llama3_tokenizer`` and copy the tokenizer contents to it.

Modify the following paths in YAML file based on your specific directory configuration:

1. ``model.model_config``
2. ``exp_manager.resume_from_checkpoint``
3. ``tokenizer.type``
4. ``train_dir`` and ``val_dir``

You can use your custom model, pretrained checkpoint and tokenizer by
modifying the ``hf_llama3_8B_SFT_lora_config.yaml`` file.


Checkpoint Conversion
^^^^^^^^^^^^^^^^^^^^^

Follow this :ref:`Checkpoint Conversion Guide <checkpoint_conversion>` to convert the
HF-style Llama3-8B checkpoint
to NxDT supported format and store it in  ``pretrained_ckpt`` directory.
Modify the config parameter ``exp_manager.resume_from_checkpoint`` path to the
converted pretrained checkpoint path.


LoRA SFT-YAML Configuration Overview
------------------------------------

You can configure a variety of SFT, DPO, PEFT-specfic and model parameters for finetuning using the YAML file.

.. code-block:: yaml

    exp_manager:
        resume_from_checkpoint: /pretrained_ckpt

    data:
        train_dir: /example_datasets/llama3_8b/training.jsonl
        val_dir: /example_datasets/llama3_8b/validation.json
        dev_choose_samples: 2250
        seq_length: 4096
        tokenizer:
            type: /llama3_tokenizer

    model:
        weight_init_only: True

    model_alignment_strategy:
        sft:
            packing: True
        peft:
            lora_rank: 16
            lora_alpha: 32
            lora_dropout: 0.05
            lora_bias: "none"
            lora_verbose: True
            target_modules: ["qkv_proj"]


**exp_manager**
    **resume_from_checkpoint**

    Manually set the checkpoint file (pretrained checkpoint) to load from

        * **Type**: str
        * **Default**: ``/pretrained_ckpt``
        * **Required**: True (start with pretrained checkpoint)

**data**

    **train_dir**

    SFT training data - jsonl or arrow file

    For SFT, we use HF style ModelAlignment dataloader, we also use HF style data file paths

        * **Type**: str
        * **Required**: True

    **val_dir**

    SFT validation data - jsonl or arrow file

    For SFT, we use HF style ModelAlignment dataloader, we also use HF style data file paths

        * **Type**: str
        * **Required**: False

    **dev_choose_samples**

    If set, will use that many number of records from the
    head of the dataset instead of using all. Set to null to use full dataset

        * **Type**: integer
        * **Default**: null
        * **Required**: False

    **seq_length**

    Set sequence length for the training job.

        * **Type**: integer
        * **Required**: True

    **tokenizer**
        **type**

        Set tokenizer path/type

            * **Type**: str
            * **Default**: ``/llama3_tokenizer``
            * **Required**: True

 **model**
        **weight_init_only**

        Load only model states and ignore the optim states from ckpt directory

            * **Type**: bool
            * **Default**: True

 **model_alignment_strategy**

    Set only when using finetuning specific algorithms (SFT, DPO, etc) and parameter-efficient
    fine-tuning methods like LoRA (Low-Rank Adaptation).

        **sft**
            Supervised Fine-Tuning (SFT) specific parameters.

            **packing**

            Appends multiple records in a single record until seq length
            supported by model, if false uses pad tokens to reach seq length.
            Setting it to True increases throughput but might impact accuracy.

                * **Type**: bool
                * **Default**: False
                * **Required**: False

        **peft**
            Configuration options for Parameter-Efficient Fine-Tuning (PEFT) methods,
            specifically LoRA settings.

            **lora_rank**

            Rank of LoRA; determines the number of trainable parameters
            Higher rank allows for more expressive adaptations but increases memory usage

                * **Type**: int
                * **Default**: 16
                * **Required**: True

            **lora_alpha**

            Scaling factor for LoRA updates; affects the magnitude of LoRA adaptations.

                * **Type**: int
                * **Default**: 32
                * **Required**: True

            **lora_dropout**

            Dropout rate for LoRA layers to prevent overfitting.

                * **Type**: float
                * **Default**: 0.05
                * **Required**: False

            **lora_bias**

            Bias type for LoRA. Determines which biases are trainable. Can be 'none', 'all' or 'lora_only'

                * **Type**: str
                * **Default**: "none"
                * **Required**: False

            **lora_verbose**

            Enables detailed LoRA-related logging during training.

                * **Type**: bool
                * **Default**: False
                * **Required**: False

            **target_modules**

            List of model layers to apply LoRA.

                * **Type**: list[str]
                * **Default**: ["qkv_proj"] (for Llama)
                * **Required**: True


Pre-compile the model
---------------------

By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially
compiles all of the neural network compute graphs as they are encountered during a training job.
The compiled graphs are cached in a local compiler cache so that subsequent training jobs
can leverage the compiled graphs and avoid compilation
(so long as the graph signatures and Neuron version have not changed).

An alternative to the JIT flow is to use the included ``neuron_parallel_compile``
command to perform ahead of time (AOT) compilation. In the AOT compilation flow,
the compute graphs are first identified and extracted during a short simulated training run,
and the extracted graphs are then compiled and cached using parallel compilation,
which is considerably faster than the JIT flow.

First, clone the open-source ``neuronx-distributed-training`` library

.. code:: ipython3

   git clone https://github.com/aws-neuron/neuronx-distributed-training
   cd neuronx-distributed-training/examples

Now, ensure that you are using the proper config file in the ``conf/`` directory.
In the ``train.sh`` file, ensure that the ``CONF_FILE`` variable is properly
set to the config for the model you want to use. In our case,
it will be ``hf_llama3_8B_SFT_lora_config``. The default config here is a 8B parameter model,
but users can also add their own ``conf/*.yaml`` files and run different configs and
hyperparameters if desired. Please see :ref:`Config Overview <nxdt_config_overview>`
for examples and usage for the ``.yaml`` config files.

Next, run the following commands to launch an AOT pre-compilation job on your instance:

.. code-block:: bash

    cd ~/neuronx-distributed-training/examples
    export COMPILE=1
    ./train.sh

The compile output and logs will be shown directly in the terminal
and you will see logs similar to this:

.. code-block:: bash

    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total graphs: 22
    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total successful compilations: 22
    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0

Then, you know your compilation has successfully completed.

.. note::
    The number of graphs will differ based on package versions, models, and other factors.
    This is just an example.


Training the model
------------------

The fine-tuning job is launched almost exactly in the same way as the compile job.
We now turn off the ``COMPILE`` environment variable and
run the same training script to start pre-training.

On a single instance:

.. code-block:: bash

    export COMPILE=0
    ./train.sh

Once the model is loaded onto the Trainium accelerators and training has commenced,
you will begin to see output indicating the job progress:

Example:

.. code-block:: bash

    Epoch 0:   0%|          | 189/301501 [59:12<1573:03:24, 18.79s/it, loss=7.75, v_num=3-16, reduced_train_loss=7.560, global_step=188.0, consumed_samples=24064.0]
    Epoch 0:   0%|          | 190/301501 [59:30<1572:41:13, 18.79s/it, loss=7.74, v_num=3-16, reduced_train_loss=7.560, global_step=189.0, consumed_samples=24192.0]
    Epoch 0:   0%|          | 191/301501 [59:48<1572:21:28, 18.79s/it, loss=7.73, v_num=3-16, reduced_train_loss=7.910, global_step=190.0, consumed_samples=24320.0]

Monitoring Training
-------------------

Tensorboard monitoring
^^^^^^^^^^^^^^^^^^^^^^

In addition to the text-based job monitoring described in the previous section,
you can also use standard tools such as TensorBoard to monitor training job progress.
To view an ongoing training job in TensorBoard, you first need to identify the
experiment directory associated with your ongoing job.
This will typically be the most recently created directory under
``~/neuronx-distributed-training/examples/nemo_experiments/hf_llama3_8B/``.
Once you have identifed the directory, cd into it, and then launch TensorBoard:

.. code-block:: bash

    cd ~/neuronx-distributed-training/examples/nemo_experiments/hf_llama3_8B/
    tensorboard --logdir ./

With TensorBoard running, you can then view the TensorBoard dashboard by browsing to
``http://localhost:6006`` on your local machine. If you cannot access TensorBoard at this address,
please make sure that you have port-forwarded TCP port 6006 when SSH'ing into the head node,

.. code-block:: bash

    ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006

neuron-top / neuron-monitor / neuron-ls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The `neuron-top <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-top-user-guide.html>`_
tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization,
and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job, run ``neuron-top``:

.. code-block:: bash

    ssh compute1-dy-queue1-i1-1  # to determine which compute nodes are in use, run the squeue command
    neuron-top

Similarly, once you are logged into one of the active compute nodes,
you can also use other Neuron tools such as
`neuron-monitor <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_
and `neuron-ls <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_
to capture performance and utilization statistics and to understand NeuronCore allocation.

Troubleshooting Guide
---------------------

For issues with ``NxDT``, please see:
:ref:`NxDT Known Issues <nxdt_known_issues>`


================================================
FILE: libraries/nxd-training/tutorials/hf_llama3_8B_pretraining.rst
================================================
.. _hf_llama3_8B_pretraining:

HuggingFace Llama3.1/Llama3-8B Pretraining
==========================================

In this example, we will compile and train a HF Llama3.1/Llama3-8B model on a single instance
with the ``NxD Training (NxDT)`` library.
The example has the following main sections:

.. contents:: Table of contents
   :local:
   :depth: 2

Setting up the environment
--------------------------

Install Dependencies
^^^^^^^^^^^^^^^^^^^^

Once you have launched a Trn1 instance,
please follow this guide on how to install the latest Neuron packages:
`PyTorch Neuron Setup Guide
<https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/index.html#setup-torch-neuronx>`_.

Next, we will need to install ``NxDT`` and its dependencies.
Please see the following installation guide for installing ``NxDT``:
:ref:`NxDT Installation Guide <nxdt_installation_guide>`


Download the dataset
--------------------

Let's download training-data scripts for our experiments

.. code:: ipython3

   wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/llama/get_dataset.py


To tokenize the data, we must request the tokenizer from Hugging Face and Meta by following the
instructions at the following link: `HuggingFace Llama 3 8B Model <https://huggingface.co/meta-llama/Meta-Llama-3-8B>`__ . 

Use of the Llama models is governed by the Meta license.
In order to download the model weights and tokenizer, please visit the above website
and accept their License before requesting access. After access has been granted,
you may use the following python3 script along with your own hugging face token to download and save the tokenizer.


.. code:: ipython3

   from huggingface_hub import login
   from transformers import AutoTokenizer

   login(token='your_own_hugging_face_token')

   tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B')  

   tokenizer.save_pretrained(".")

For Llama3.1/Llama3, make sure your base directory has the following files:

.. code:: ipython3

   './tokenizer_config.json', './special_tokens_map.json', './tokenizer.json'

Next let’s download and pre-process the dataset:

.. code:: ipython3

   mkdir ~/examples_datasets/ && cd ~/examples_datasets/
   python3 ~/get_dataset.py --llama-version 3


`Note:` In case you see an error of the following form when downloading data: ``huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name'. Use `repo_type` argument if needed.`` 
This could be because of a stale cache. Try deleting the cache using: 

.. code:: ipython3

   sudo rm -rf ~/.cache/


Pre-compile the model
---------------------

By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially
compiles all of the neural network compute graphs as they are encountered during a training job.
The compiled graphs are cached in a local compiler cache so that subsequent training jobs
can leverage the compiled graphs and avoid compilation
(so long as the graph signatures and Neuron version have not changed).

An alternative to the JIT flow is to use the included ``neuron_parallel_compile``
command to perform ahead of time (AOT) compilation. In the AOT compilation flow,
the compute graphs are first identified and extracted during a short simulated training run,
and the extracted graphs are then compiled and cached using parallel compilation,
which is considerably faster than the JIT flow.

First, clone the open-source ``neuronx-distributed-training`` library

.. code:: ipython3

   git clone https://github.com/aws-neuron/neuronx-distributed-training
   cd neuronx-distributed-training/examples

Now, ensure that you are using the proper config file in the ``conf/`` directory.
In the ``train.sh`` file, ensure that the ``CONF_FILE`` variable is properly
set to the config for the model you want to use. In our case,
it will be ``hf_llama3_8B_config``. The default config here is a 8B parameter model,
but users can also add their own ``conf/*.yaml`` files and run different configs and
hyperparameters if desired. Please see :ref:`Config Overview <nxdt_config_overview>`
for examples and usage for the ``.yaml`` config files.

Next, run the following commands to launch an AOT pre-compilation job on your instance:

.. code-block:: bash

    export COMPILE=1
    ./train.sh

The compile output and logs will be shown directly in the terminal
and you will see a message similar to this:

.. code-block:: bash

    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total graphs: 22
    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total successful compilations: 22
    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0

Then, you know your compilation has successfully completed.

.. note::
    The number of graphs will differ based on package versions, models, and other factors.
    This is just an example.


Training the model
------------------

The pre-training job is launched almost exactly the same as the compile job.
We now turn off the ``COMPILE`` environment variable and
run the same training script to start pre-training.

On a single instance:

.. code-block:: bash

    export COMPILE=0
    ./train.sh

Once the model is loaded onto the Trainium accelerators and training has commenced,
you will begin to see output indicating the job progress:

Example:

.. code-block:: bash

    Epoch 0:   0%|          | 189/301501 [59:12<1573:03:24, 18.79s/it, loss=7.75, v_num=3-16, reduced_train_loss=7.560, global_step=188.0, consumed_samples=24064.0]
    Epoch 0:   0%|          | 190/301501 [59:30<1572:41:13, 18.79s/it, loss=7.74, v_num=3-16, reduced_train_loss=7.560, global_step=189.0, consumed_samples=24192.0]
    Epoch 0:   0%|          | 191/301501 [59:48<1572:21:28, 18.79s/it, loss=7.73, v_num=3-16, reduced_train_loss=7.910, global_step=190.0, consumed_samples=24320.0]


Monitoring Training
-------------------

Tensorboard monitoring
^^^^^^^^^^^^^^^^^^^^^^

In addition to the text-based job monitoring described in the previous section,
you can also use standard tools such as TensorBoard to monitor training job progress.
To view an ongoing training job in TensorBoard, you first need to identify the
experiment directory associated with your ongoing job.
This will typically be the most recently created directory under
``~/neuronx-distributed-training/examples/nemo_experiments/hf_llama3_8B/``.
Once you have identifed the directory, cd into it, and then launch TensorBoard:

.. code-block:: bash

    cd ~/neuronx-distributed-training/examples/nemo_experiments/hf_llama3_8B/
    tensorboard --logdir ./

With TensorBoard running, you can then view the TensorBoard dashboard by browsing to
``http://localhost:6006`` on your local machine. If you cannot access TensorBoard at this address,
please make sure that you have port-forwarded TCP port 6006 when SSH'ing into the head node,

.. code-block:: bash

    ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006

neuron-top / neuron-monitor / neuron-ls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The `neuron-top <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-top-user-guide.html>`_
tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization,
and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job, run ``neuron-top``:

.. code-block:: bash

    ssh compute1-dy-queue1-i1-1  # to determine which compute nodes are in use, run the squeue command
    neuron-top

Similarly, once you are logged into one of the active compute nodes,
you can also use other Neuron tools such as
`neuron-monitor <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_
and `neuron-ls <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_
to capture performance and utilization statistics and to understand NeuronCore allocation.

Troubleshooting Guide
---------------------

For issues with ``NxDT``, please see:
:ref:`NxDT Known Issues <nxdt_known_issues>`

================================================
FILE: libraries/nxd-training/tutorials/index.rst
================================================
.. _nxdt_tutorials:

Tutorials
=========

This section will go over tutorials to help users get started with NxD Training library.

.. toctree::
    :maxdepth: 1
    :hidden:

    HuggingFace Llama3.1/Llama3-8B Pretraining <hf_llama3_8B_pretraining>
    HuggingFace Llama3.1/LLama3-8B Supervised Fine-tuning <hf_llama3_8B_SFT>
    HuggingFace Llama3.1/Llama3-8B Efficient Supervised Fine-tuning with LoRA (Beta) <hf_llama3_8B_SFT_LORA>
    HuggingFace Llama3.1/Llama3-8B Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO) based Fine-tuning (Beta) <hf_llama3_8B_DPO_ORPO>
    HuggingFace Llama3.1/Llama3-70B Pretraining <hf_llama3_70B_pretraining>
    Checkpoint Conversion <checkpoint_conversion>

.. include:: /libraries/nxd-training/tutorials/tutorials.txt


================================================
FILE: libraries/nxd-training/tutorials/megatron_gpt_pretraining.rst
================================================
.. _megatron_gpt_pretraining:

Megatron GPT Pretraining
========================

In this example, we will compile and train a Megatron GPT model on a single instance or
on multiple instances using ParallelCluster with the NxD Training library.
The example has the following main sections:

.. contents:: Table of contents
   :local:
   :depth: 2

Setting up the environment
--------------------------

ParallelCluster Setup
^^^^^^^^^^^^^^^^^^^^^

In this example, we will use 8 instances with ParallelCluster,
please follow the instructions here to create a cluster:
`Train your model on ParallelCluster
<https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/devflows/training/parallelcluster/parallelcluster-training.html>`_

ParallelCluster automates the creation of trn1 clusters,
and provides the SLURM job management system for scheduling and managing distributed training jobs.
Please note that the home directory on your ParallelCluster
head node will be shared with all of the worker nodes via NFS.

Install Dependencies
^^^^^^^^^^^^^^^^^^^^

Once you have launched a trn1 instance or ParallelCluster,
please follow this guide on how to install the latest Neuron packages:
`PyTorch Neuron Setup Guide
<https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx>`_.

Next, we will need to install NxD Training and its dependencies.
Please see the following installation guide for installing NxD Training:
:ref:`NxDT Installation Guide <nxdt_installation_guide>`


Download the dataset
--------------------

This tutorial makes use of a preprocessed Wikipedia dataset that is stored in S3.
The dataset can be downloaded to your cluster or instance by running
the following commands on the head node or your trn1 instance:

.. code-block:: bash

    export DATA_DIR=~/examples_datasets/gpt2
    mkdir -p ${DATA_DIR} && cd ${DATA_DIR}
    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
    aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.bin .  --no-sign-request
    aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.idx .  --no-sign-request
    aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/license.txt .  --no-sign-request


Pre-compile the model
---------------------

By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially
compiles all of the neural network compute graphs as they are encountered during a training job.
The compiled graphs are cached in a local compiler cache so that subsequent training jobs
can leverage the compiled graphs and avoid compilation
(so long as the graph signatures and Neuron version have not changed).

An alternative to the JIT flow is to use the included ``neuron_parallel_compile``
command to perform ahead of time (AOT) compilation. In the AOT compilation flow,
the compute graphs are first identified and extracted during a short simulated training run,
and the extracted graphs are then compiled and cached using parallel compilation,
which is considerably faster than the JIT flow.

First, clone the open-source ``neuronx-distributed-training`` library

.. code:: ipython3

   git clone https://github.com/aws-neuron/neuronx-distributed-training
   cd neuronx-distributed-training/examples

Now, ensure that you are using the proper config file in the ``conf/`` directory.
In the ``train.sh`` file, ensure that the ``CONF_FILE`` variable is properly
set to the config for the model you want to use. In our case,
it will be ``megatron_gpt_config``. The default config here is a 6.7B parameter model,
but users can also add their own ``conf/*.yaml`` files and run different configs and
hyperparameters if desired. Please see :ref:`Config Overview <nxdt_config_overview>`
for examples and usage for the ``.yaml`` config files.

Next, run the following commands to launch an AOT pre-compilation job on your instance:

.. code-block:: bash

    export COMPILE=1
    ./train.sh

The compile output and logs will be shown directly in the terminal
and you will see a message similar to this:

.. code-block:: bash

    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total graphs: 22
    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total successful compilations: 22
    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0

Then, you know your compilation has successfully completed.

.. note::
    The number of graphs will differ based on package versions, models, and other factors.
    This is just an example.

If you are using ParallelCluster, then you will need to update the ``conf/megatron_gpt_config.yaml``
with

.. code-block:: yaml

    num_nodes: 8

Then to run the compile job:

.. code-block:: bash

    export COMPILE=1
    sbatch --exclusive \
        --nodes 8 \
        --cpus-per-task 128 \
        --wrap="srun ./train.sh"

Once you have launched the precompilation job, run the squeue command to view the
SLURM job queue on your cluster. If you have not recently run a job on your cluster,
it may take 4-5 minutes for the requested trn1.32xlarge nodes to be launched and initialized.
Once the job is running, squeue should show output similar to the following:

.. code-block:: bash

    JOBID  PARTITION  NAME      USER    ST  TIME  NODES NODELIST(REASON)
    10     compute1   wrap      ubuntu  R   5:11  8     compute1-dy-queue1-i1-[0-7]

You can view the output of the precompilation job by examining the file named
``slurm-ZZ.out``,
where ZZ represents the JOBID of your job in the squeue output above.

.. code-block:: bash

    tail -f slurm-10.out

Once the precompilation job is complete, just like the above output
you should see a message similar to the following in the logs:

.. code-block:: bash

    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total graphs: 22
    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total successful compilations: 22
    2024-08-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0

At this point, you can press ``CTRL-C`` to exit the tail command.

Training the model
------------------

The pre-training job is launched almost exactly the same as the compile job.
We now turn off the ``COMPILE`` environment variable and
run the same training script to start pre-training.

On a single instance:

.. code-block:: bash

    export COMPILE=0
    ./train.sh

If you are using ParallelCluster:

.. code-block:: bash

    export COMPILE=0
    sbatch --exclusive \
        --nodes 8 \
        --cpus-per-task 128 \
        --wrap="srun ./train.sh"

As outlined above, you can again use the ``squeue`` command to view the job queue,
and also monitor the job in the same way with the ``tail`` command to see the training logs.
Once the model is loaded onto the Trainium accelerators and training has commenced,
you will begin to see output indicating the job progress:

Example:

.. code-block:: bash

    Epoch 0:   0%|          | 189/301501 [59:12<1573:03:24, 18.79s/it, loss=7.75, v_num=3-16, reduced_train_loss=7.560, global_step=188.0, consumed_samples=24064.0]
    Epoch 0:   0%|          | 190/301501 [59:30<1572:41:13, 18.79s/it, loss=7.74, v_num=3-16, reduced_train_loss=7.560, global_step=189.0, consumed_samples=24192.0]
    Epoch 0:   0%|          | 191/301501 [59:48<1572:21:28, 18.79s/it, loss=7.73, v_num=3-16, reduced_train_loss=7.910, global_step=190.0, consumed_samples=24320.0]

Monitoring Training
-------------------

Tensorboard monitoring
^^^^^^^^^^^^^^^^^^^^^^

In addition to the text-based job monitoring described in the previous section,
you can also use standard tools such as TensorBoard to monitor training job progress.
To view an ongoing training job in TensorBoard, you first need to identify the
experiment directory associated with your ongoing job.
This will typically be the most recently created directory under
``~/neuronx-distributed-training/examples/nemo_experiments/megatron_gpt/``.
Once you have identifed the directory, cd into it, and then launch TensorBoard:

.. code-block:: bash

    cd ~/neuronx-distributed-training/examples/nemo_experiments/megatron_gpt/
    tensorboard --logdir ./

With TensorBoard running, you can then view the TensorBoard dashboard by browsing to
``http://localhost:6006`` on your local machine. If you cannot access TensorBoard at this address,
please make sure that you have port-forwarded TCP port 6006 when SSH'ing into the head node,

.. code-block:: bash

    ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006

neuron-top / neuron-monitor / neuron-ls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The `neuron-top <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-top-user-guide.html>`_
tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization,
and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job,
first SSH into one of your compute nodes from the head node (if using ParallelCluster), and then run ``neuron-top``:

.. code-block:: bash

    ssh compute1-dy-queue1-i1-1  # to determine which compute nodes are in use, run the squeue command
    neuron-top

Similarly, once you are logged into one of the active compute nodes,
you can also use other Neuron tools such as
`neuron-monitor <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_
and `neuron-ls <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_
to capture performance and utilization statistics and to understand NeuronCore allocation.

Troubleshooting Guide
---------------------

For issues with NxD Training, please see:
:ref:`NxD Training Known Issues <nxdt_known_issues>`

For ParallelCluster issues see:
`AWS ParallelCluster Troubleshooting <https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html>`_


================================================
FILE: libraries/nxd-training/tutorials/tutorials.txt
================================================
* :ref:`megatron_gpt_pretraining`
* :ref:`hf_llama3_8B_pretraining`
* :ref:`hf_llama3_8B_SFT`
* :ref:`hf_llama3_8B_SFT_LORA`
* :ref:`hf_llama3_8B_DPO_ORPO`
* :ref:`hf_llama3_70B_pretraining`
* :ref:`checkpoint_conversion`


================================================
FILE: libraries/transformers-neuronx/index.rst
================================================
.. _transformers_neuronx_readme:

Transformers NeuronX (``transformers-neuronx``)
==============================================

.. important::
    As of 9/16/2025, support has ended for the Transformers NeuronX library and the corresponding documentation has been archived. If you are still using Transformers NeuronX, Neuron recommends that you migrate to the :ref:`NxD Inference library <nxdi-index>`, which provides a PyTorch-based inference library.  Refer to the :ref:`migration guide <nxdi_migrate_from_tnx>` to learn how to migrate your Transformers NeuronX workloads to NxD Inference.           

To view the archived Transformers NeuronX documentation, including setup instructions and developer guides, see the :ref:`Transformers NeuronX documentation <transformers_neuronx_archive_readme>`. 


================================================
FILE: llms.txt
================================================
# AWS Neuron SDK Documentation

> AWS Neuron SDK enables high-performance deep learning and generative AI on AWS Inferentia and Trainium instances. The SDK provides comprehensive support for PyTorch, JAX, TensorFlow, and distributed training/inference libraries, along with developer tools for profiling, debugging, and optimizing ML workloads.

## Important notes for AI agents:

* The ReStructured Text (RST, .rst) sources for this documentation can be found at https://github.com/aws-neuron/aws-neuron-sdk in the `master` branch.

* When developing on behalf of or providing code or information to a user, observe the information listed below in this section:

  * Use the `latest` version of the website (as present in the page URL) instead of older versions (with `/release-*.**.*/` in the page URL). Older versions of the docs have a higher probability of being incorrect over time.
  * Currently, setting the number of logical NeuronCores (LogicalNeuronCores, LNC) to 8 is not supported. Do not suggest LNC=8 until the documentation clearly and expressly notes it is possible.

## Getting Started

- [AWS Neuron Documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html): Main documentation home page
- [What is AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/what-is-neuron.html): Overview of the Neuron SDK for high-performance machine learning on AWS Inferentia and Trainium
- [What's New](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/whats-new.html): Latest releases and features of the Neuron SDK
- [Quickstarts](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/quick-start/index.html): Quick start guides for getting started with AWS Neuron
- [Inference Quickstart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/quick-start/inference-quickstart.html): End-to-end inference quickstart on Neuron
- [Training Quickstart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/quick-start/training-quickstart.html): End-to-end training quickstart on Neuron
- [Announcements](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/index.html): SDK announcements including end-of-support and maintenance notices

## Setup Guides

- [Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/index.html): Installation and upgrade instructions for Neuron on Trainium and Inferentia instances
- [Multi-Framework DLAMI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/multiframework-dlami.html): Pre-configured AMI with PyTorch, JAX, and vLLM virtual environments

### PyTorch Setup

- [PyTorch Setup Overview](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/index.html): Choose between DLAMI, DLC, or manual installation for PyTorch on Neuron
- [PyTorch DLAMI Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/dlami.html): Install PyTorch via pre-configured Deep Learning AMI
- [PyTorch DLC Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/dlc.html): Install PyTorch via Docker Deep Learning Container from ECR
- [PyTorch Manual Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/manual.html): Install PyTorch manually using pip on bare OS
- [Update PyTorch DLAMI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/update-dlami.html): Update PyTorch version and drivers on an existing DLAMI
- [Update PyTorch Manual](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/update-manual.html): Update PyTorch version and drivers on a manual installation
- [Update PyTorch DLC](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/update-dlc.html): Update PyTorch container image and host driver

### JAX Setup

- [JAX Setup Overview](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/jax/index.html): Choose between DLAMI, DLC, or manual installation for JAX on Neuron
- [JAX DLAMI Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/jax/dlami.html): Install JAX via pre-configured Deep Learning AMI
- [JAX DLC Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/jax/dlc.html): Install JAX via Docker Deep Learning Container from ECR
- [JAX Manual Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/jax/manual.html): Install JAX manually using pip on bare OS

### Legacy and Troubleshooting

- [Legacy Inf1 Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/legacy-inf1/index.html): Installation guides for legacy Inferentia 1 instances
- [Setup Troubleshooting](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/troubleshooting.html): Solutions for common setup issues

## Core Concepts

- [Neuron Architecture](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/index.html): Understanding the Neuron hardware and software architecture
- [Neuron Hardware](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/index.html): Inferentia and Trainium hardware architecture details
- [Neuron Features](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-features/index.html): Overview of model development features provided by Neuron
- [Term Glossary](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/glossary.html): Definitions of key terms used in AWS Neuron documentation
- [Model Samples](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/models/index.html): Pre-tested model samples and implementations
- [Benchmarks](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/benchmarks/index.html): Training and inference performance benchmarks
- [SDK Maintenance Policy](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/sdk-policy.html): AWS Neuron SDK maintenance, support lifecycle, and versioning policy
- [FAQ](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/faq/index.html): Frequently asked questions about AWS Neuron
- [Troubleshooting](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/troubleshooting.html): Solutions for common issues with AWS Neuron
- [News and Blogs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/news-and-blogs/index.html): AWS Neuron news articles and blog posts

## ML Frameworks

- [ML Frameworks Overview](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/index.html): PyTorch and JAX integration for high-performance machine learning on Neuron

### PyTorch on Neuron

- [PyTorch on Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/index.html): Complete PyTorch integration for both inference and training on Neuron hardware
- [About PyTorch on Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/about/index.html): History and evolution of PyTorch support (torch-neuron, torch-neuronx, TorchNeuron Native)
- [Native PyTorch for Trainium](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/pytorch-native-overview.html): TorchNeuron native backend with eager mode and torch.compile for Trn2/Trn3
- [Training with PyTorch NeuronX](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/training-torch-neuronx.html): Training guides and resources for Trn1/Trn2/Trn3
- [Inference with PyTorch NeuronX](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/inference-torch-neuronx.html): Inference guides for Inf2 and Trn1/Trn2/Trn3
- [PyTorch Training Developer Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide.html): Core concepts for training on Neuron with XLA
- [PyTorch Training Tutorials](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/tutorials-training-torch-neuronx.html): Step-by-step training examples
- [PyTorch Inference Tutorials](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/inference/tutorials-inference-torch-neuronx.html): Step-by-step inference examples
- [torch-neuron vs torch-neuronx Comparison](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/guide-torch-neuron-vs-torch-neuronx-inference.html): Detailed comparison for inference workloads across Inf1 and Inf2/Trn1
- [Supported Operators](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/pytorch-neuron-supported-operators.html): List of PyTorch operators supported on Neuron
- [Multi-Node Training Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup-trn1-multi-node-execution.html): Configure multi-node distributed training on Trn1

### JAX on Neuron

- [JAX on Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/jax/index.html): JAX support with PJRT plugin and NKI integration
- [JAX NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/jax/setup/jax-setup.html): Install and configure the JAX NeuronX plugin
- [JAX API Reference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/jax/api-reference-guide/index.html): API reference for JAX NeuronX features
- [JAX Environment Variables](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/jax/api-reference-guide/neuron-envvars.html): JAX NeuronX environment variables reference
- [JAX Known Issues](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/jax/setup/jax-neuronx-known-issues.html): Known issues and limitations in JAX NeuronX

### TensorFlow (Archived)

- [TensorFlow on Neuron (Archived)](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/index.html): Archived TensorFlow support documentation

## Distributed Libraries & Inference

- [NeuronX Distributed Libraries](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/index.html): High-performance distributed training and inference libraries
- [NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/index.html): PyTorch-based inference library for deploying large models
- [NxD Inference Tutorials](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/index.html): Comprehensive tutorials for deploying LLMs and vision models
- [Llama 3.3 70B Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.3-70b-tutorial.html): Deploy Llama 3.3 70B on Trn2
- [Llama 3.1 405B Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.1-405b-tutorial.html): Deploy Llama 3.1 405B on Trn2
- [Llama 4 Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/llama4-tutorial.html): Deploy Llama 4 models on Neuron
- [Multi-LoRA Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.1-8b-multi-lora-tutorial.html): Serve multiple LoRA adapters with Llama 3.1 8B
- [Qwen2 VL Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/qwen2-vl-tutorial.html): Deploy Qwen2 VL vision language model
- [Qwen3 VL Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/qwen3-vl-tutorial.html): Deploy Qwen3 VL vision language model
- [Flux Inference Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/flux-inference-tutorial.html): Deploy Flux.1 image generation model
- [Flux Inpainting Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/flux-inpainting-inference-tutorial.html): Image inpainting with Flux.1
- [Pixtral Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/pixtral-tutorial.html): Deploy Pixtral vision language model
- [Disaggregated Inference Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/disaggregated-inference-tutorial.html): Separate prefill and decode for improved performance
- [vLLM on Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/vllm/index.html): High-performance inference serving for large language models with OpenAI-compatible APIs
- [vLLM Offline Serving Quickstart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/vllm/quickstart-vllm-offline-serving.html): Run batch inference with vLLM on Neuron
- [vLLM Online Serving Quickstart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/vllm/quickstart-vllm-online-serving.html): Launch an OpenAI-compatible API server with vLLM on Neuron
- [vLLM DLC Deployment](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/get-started/quickstart-configure-deploy-dlc.html): Deploy a vLLM server using a pre-configured Neuron Deep Learning Container
- [vLLM V1 User Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide-v1.html): Complete guide for vLLM V1 with vLLM-Neuron Plugin
- [vLLM User Guide (Legacy)](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html): User guide for earlier vLLM versions
- [NxD Training](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/index.html): PyTorch library for end-to-end distributed training with Neuron
- [NxD Core](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index.html): Core distributed training and inference primitives

## Developer Tools

- [Developer Tools Overview](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/index.html): Comprehensive suite of tools for optimizing, monitoring, and debugging ML workloads

### Neuron Explorer

- [Neuron Explorer](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/index.html): Unified profiling suite with AI-driven optimization recommendations
- [Neuron Explorer Getting Started](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/get-started.html): Set up Neuron Explorer, launch the web UI, and configure SSH tunneling
- [Capture and View Profiles](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/how-to-profile-workload.html): Capture and view profiles in the UI or via VSCode integration
- [Device Trace Viewer](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-device-profiles.html): Hardware-level execution timeline, operator table, and dependency analysis
- [System Trace Viewer](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-system-profiles.html): System-level execution timeline and analysis
- [Hierarchy Viewer](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-hierarchy-view.html): Model layer to hardware execution visualization
- [Source Code Viewer](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/how-to-link-view-source-code.html): Bidirectional linking between source code and profile data
- [Summary Viewer](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-summary-page.html): High-level performance metrics and optimization recommendations
- [Database Viewer](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-database-viewer.html): SQL and natural language queries on profiling data
- [Tensor Viewer](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-tensor-viewer.html): Tensor names, shapes, sizes, and memory usage details
- [Memory Viewer](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-memory-viewer.html): Low-level memory allocation and usage pattern analysis
- [AI Recommendation Viewer](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-ai-recommendations.html): AI-powered bottleneck analysis and optimization recommendations
- [Migration from Neuron Profiler](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/migration-faq.html): Migration guide and FAQ for moving from Neuron Profiler to Neuron Explorer
- [View Profiles with Perfetto](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/view-perfetto.html): View Neuron Explorer profiles using the Perfetto UI

### System Tools

- [System Tools](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/index.html): Command-line utilities for monitoring, debugging, and managing AWS Neuron devices
- [Neuron Top](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-top-user-guide.html): Real-time monitoring of Neuron device utilization
- [Neuron Monitor](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html): Collect metrics for monitoring and alerting

## Neuron Runtime & Compiler

- [NeuronX Runtime](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/index.html): High-performance execution engine for running models on AWS Inferentia and Trainium
- [Runtime Configuration](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/configuration-guide.html): Learn how to configure the Neuron Runtime using environment variables
- [Runtime API Reference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/index.html): Comprehensive guide to the Neuron Runtime API
- [Neuron Graph Compiler](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/index.html): Sophisticated compilation system that transforms ML models into optimized code
- [Compiler Error Codes](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/index.html): Neuron Compiler error code documentation

## Neuron Kernel Interface (NKI)

- [NKI Introduction](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/index.html): Programming interface for direct access to AWS NeuronDevices
- [NKI FAQ](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/nki_faq.html): Frequently asked questions about NKI
- [NKI Getting Started](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/index.html): Setup and quickstart for NKI development
- [NKI Quickstart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/quickstart-implement-run-kernel.html): Implement and run your first NKI kernel
- [NKI Language Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/nki-language-guide.html): Developer guide for NKI's Pythonic language syntax
- [NKI Setup Environment](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/setup-env.html): Environment setup for NKI development

### NKI Overviews & Concepts

- [NKI About Overview](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/about/index.html): Core concepts and architecture overview
- [Memory Hierarchy Overview](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/about/memory-hierarchy-overview.html): Understanding NKI memory hierarchy
- [Data Representation Overview](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/about/data-representation-overview.html): How data is represented in NKI
- [Indexing Overview](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/about/indexing-overview.html): Tensor indexing in NKI
- [Tiling Overview](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/about/tiling-overview.html): Tiling strategies for efficient computation
- [DMA Overview](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/about/nki-dma-overview.html): Direct Memory Access in NKI
- [LNC Overview](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/about/lnc.html): Large NeuronCore multi-core programming

### NKI Tutorials

- [NKI Tutorials](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/index.html): Step-by-step tutorials for NKI kernel development
- [Tensor Addition Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/tensor_addition_tutorial.html): Learn NKI basics with tensor addition
- [SPMD Tensor Addition Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/spmd_tensor_addition.html): Multi-core tensor addition with SPMD
- [SPMD Multi-NC Tensor Addition](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/spmd_multiple_nc_tensor_addition.html): Multi-NeuronCore SPMD programming
- [Matrix Multiplication Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/matrix_multiplication_tutorial.html): Implement efficient matrix multiplication
- [Attention Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/attention_tutorial.html): Build attention mechanisms in NKI
- [Fused Mamba Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/fused_mamba_tutorial.html): Optimize Mamba state space models
- [Average Pool2D Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/average_pool2d_tutorial.html): Implement 2D average pooling
- [Transpose2D Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/transpose2d_tutorial.html): Efficient 2D tensor transpose
- [Kernel Optimization Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/kernel-optimization.html): Techniques for optimizing NKI kernel performance

### NKI Guides

- [NKI Guides](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/index.html): Comprehensive guides for NKI development
- [Framework Custom Operators](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/framework_custom_op.html): Integrate NKI kernels with PyTorch and JAX
- [Scheduling APIs Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/how-to-scheduling-apis.html): Using NKI scheduling APIs for instruction control
- [Architecture Overview](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/architecture/index.html): Hardware architecture guides
- [Trainium & Inferentia2 Architecture](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/architecture/trainium_inferentia2_arch.html): Architecture details for Trn1 and Inf2
- [Trainium2 Architecture](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/architecture/trainium2_arch.html): Architecture details for Trn2
- [Trainium3 Architecture](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/architecture/trainium3_arch.html): Architecture details for Trn3

### NKI Deep Dives

- [NKI Deep Dives](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/index.html): In-depth documentation on NKI concepts and optimization
- [NKI Performance Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki_perf_guide.html): Optimization techniques for NKI kernels
- [Profiling NKI Kernels](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/use-neuron-profile.html): Learn how to profile NKI kernels with Neuron Explorer
- [NKI Compiler](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki-compiler.html): Documentation for the NKI compiler
- [NKI 0.3.0 Update Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki-0-3-0-update-guide.html): Migration guide for NKI 0.3.0 changes
- [NKI Beta2 Migration Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki-beta2-migration-guide.html): Migrating to NKI Beta 2
- [NKI Block Dimension Migration](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki_block_dimension_migration_guide.html): Migrating to new block dimension APIs
- [NKI APS](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki-aps.html): Automatic Performance Scheduling in NKI
- [NKI DGE](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki-dge.html): Data Gather Engine programming
- [NKI Dynamic Range](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki-dynamic-range.html): Dynamic range indexing in NKI
- [NKI HBM CRC Hashing](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki-hbm-crc-hashing.html): HBM CRC hashing for data integrity
- [MxFP MatMul Deep Dive](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/mxfp-matmul.html): Microscaling floating-point matrix multiplication

### NKI API Reference

- [NKI API Reference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/index.html): Complete API documentation for the Neuron Kernel Interface
- [nki](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.html): Top-level NKI module
- [nki.jit](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.jit.html): JIT compilation decorator for NKI kernels
- [nki.language](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.language.html): NKI language constructs, memory types, and tile operations
- [nki.language.tile_size](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.language.tile_size.html): Tile size constants for NKI programming
- [nki.isa](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.isa.html): Instruction Set Architecture APIs for low-level operations
- [nki.collectives](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.collectives.html): Collective communication operations across NeuronCores
- [nki.simulate](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.simulate.html): NKI kernel simulation for CPU-based testing
- [nki.api.shared](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.api.shared.html): Shared API utilities and types

### NKI Library

- [NKI Library](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/index.html): Pre-built optimized kernels for common operations
- [NKI Library About](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/about/index.html): Overview and design principles of the NKI Library
- [NKI Library API Reference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/index.html): Complete API reference for all NKI Library kernels
- [Attention CTE Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/attention-cte.html): Fused attention for context encoding
- [Attention TKG Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/attention-tkg.html): Fused attention for token generation
- [Attention Block TKG Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/attention-block-tkg.html): Fused attention block for token generation
- [QKV Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/qkv.html): Fused QKV projection with quantization support
- [MLP Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/mlp.html): Fused MLP with gate/up projection
- [RMSNorm-Quant Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/rmsnorm-quant.html): Fused RMSNorm with quantization
- [RoPE Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/rope.html): Rotary Position Embedding
- [Router Top-K Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/router-topk.html): Expert selection for Mixture of Experts
- [MoE CTE Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/moe-cte.html): MoE context encoding
- [MoE TKG Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/moe-tkg.html): MoE token generation
- [Cumsum Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/cumsum.html): Cumulative sum operation
- [Cross Entropy Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/cross-entropy.html): Cross entropy forward and backward
- [Output Projection CTE](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/output-projection-cte.html): Output projection for context encoding
- [Output Projection TKG](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/output-projection-tkg.html): Output projection for token generation
- [Transformer TKG Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/transformer-tkg.html): Full transformer block for token generation
- [Conv1D Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/conv1d.html): 1D convolution
- [Depthwise Conv1D Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/depthwise-conv1d.html): Depthwise 1D convolution
- [Blockwise MM Backward](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/blockwise-mm-backward.html): Blockwise matrix multiply backward for MoE training
- [Top-K Reduce Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/topk-reduce.html): Top-K reduction operation
- [Dynamic Elementwise Add](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/dynamic-elementwise-add.html): Dynamic elementwise addition
- [Find Nonzero Indices](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/find-nonzero-indices.html): Find nonzero element indices
- [Fine-Grained All-Gather](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/fg-allgather.html): Ring-based all-gather with compute overlap
- [FGCC (All-Gather + Matmul)](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/fgcc.html): Fused fine-grained collective compute combining all-gather and matmul
- [SBUF-to-SBUF All-Gather](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/sb2sb-allgather.html): SBUF-to-SBUF all-gather for small and large tensors

### NKI Library Kernel Utilities

- [Kernel Utilities](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/kernel-utils/index.html): Shared utilities for NKI Library kernels
- [TensorView](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/kernel-utils/tensor-view.html): Tensor view abstraction with rearrange and dynamic access
- [SbufManager / Allocator](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/kernel-utils/allocator.html): SBUF memory allocation management
- [TiledRange](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/kernel-utils/tiled-range.html): Tiled range iteration utilities
- [Stream Shuffle Broadcast](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/kernel-utils/stream-shuffle-broadcast.html): Stream shuffle and broadcast utilities

## Deployment & Orchestration

- [AWS Workload Orchestration](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/index.html): Deployment patterns and best practices for running Neuron-powered applications
- [Amazon EC2](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/ec2-flows.html): Launching Inf/Trn instances on Amazon EC2
- [Amazon EKS](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/eks-flows.html): Deploy Neuron workloads on Kubernetes with Amazon Elastic Kubernetes Service
- [Amazon SageMaker](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/sagemaker-flows.html): Deploy Neuron workloads on Amazon SageMaker
- [Neuron Containers](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/index.html): Pre-configured Docker images for training and serving models
- [Getting Started with Containers](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/getting-started.html): Step-by-step guide for building Neuron containers using Docker
- [Kubernetes Getting Started](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html): Deploy Neuron workloads on Kubernetes
- [Neuron DRA](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/neuron-dra.html): Dynamic Resource Allocation for Kubernetes
- [Container Tutorials](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials.html): Hands-on tutorials for deploying containers on EC2, EKS, ECS

## Release Notes & Support

- [Release Notes](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html): Official home page for AWS Neuron SDK release notes
- [What's New](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/whats-new.html): Latest releases and features of the Neuron SDK
- [Latest Release (2.29.0)](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/2.29.0.html): Release notes for Neuron SDK version 2.29.0
- [Component Release Notes](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/components/index.html): Release notes for individual Neuron components
- [Troubleshooting](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/troubleshooting.html): Solutions for common issues with AWS Neuron
- [FAQ](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/faq/index.html): Frequently asked questions about AWS Neuron

## Optional

- [Model Samples](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/models/index.html): Sample models and implementations for AWS Neuron
- [Benchmarks](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/benchmarks/index.html): Training performance benchmarks for Trn1 instances with distributed training metrics
- [PyTorch NeuronX Application Notes](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/torch-neuronx/index.html): Technical documentation for PyTorch NeuronX
- [Open Source](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/oss/index.html): Neuron Open Source GitHub Repos and contribution guidelines
- [Native PyTorch Support](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/pytorch-native-overview.html): Learn about native PyTorch support for inference and training
- [TensorFlow Neuron Tutorials](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/tutorials/index.html): Tutorials for TensorFlow Neuron
- [TensorFlow NeuronX Tutorials](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/tutorials/tutorials-tensorflow-neuronx.html): Tutorials for TensorFlow NeuronX
- [NeMo Megatron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nemo-megatron/index.html): NeMo Megatron integration with Neuron for large-scale model training
- [Third-party Tools](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/third-party-solutions.html): Third-party tools and integrations that support the AWS Neuron development experience
- [Tool Tutorials](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/tutorials/index.html): Tutorials for how to utilize all Neuron Tools
- [Setup Troubleshooting (legacy)](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/setup-troubleshooting.html): Solutions for common setup issues (legacy page)
- [Amazon ECS](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/ecs-flows.html): Run containerized Neuron applications using Amazon Elastic Container Service
- [AWS ParallelCluster](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/parallelcluster-flows.html): Set up HPC clusters for distributed training and inference workloads
- [AWS Batch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/aws-batch-flows.html): Execute batch ML jobs with automatic scaling and resource management
- [Container FAQ & Troubleshooting](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/faq.html): Frequently asked questions and solutions for common issues with Neuron containers
- [Runtime Troubleshooting](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-troubleshoot.html): Solutions for common issues with the Neuron Runtime
- [Neuron C++ Custom Operators](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/index.html): Documentation for creating custom operators in C++ for Neuron
- [AWS Neuron DLAMIs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html): Pre-configured Amazon Machine Images with Neuron SDK
- [Security](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/security.html): Security disclosures and notification for the AWS Neuron SDK
- [SDK Maintenance Policy](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/sdk-policy.html): AWS Neuron SDK maintenance and support policy


================================================
FILE: neuron-customops/api-reference-guide/api-reference-guide.rst
================================================
API Reference Guide
===================


.. toctree::
    :maxdepth: 1

    /neuron-customops/api-reference-guide/custom-ops-ref-guide

================================================
FILE: neuron-customops/api-reference-guide/custom-ops-ref-guide.rst
================================================
.. _custom-ops-api-ref-guide:

Custom Operators API Reference Guide [Beta]
============================================

This page provides the documentation for the C++ API available to creators of Neuron custom C++ operators (see :ref:`neuron_c++customops`).

.. contents:: Table of contents
   :local:
   :depth: 1


Tensor Library
--------------

The tensor library used for Neuron custom C++ operators is based upon the PyTorch ATen tensor library. This includes the core Tensor class as well as select operations defined below. Users need to include the ``<torch/torch.h>`` header to access the tensor library. A small example of using the tensor library looks as follows.

.. code-block:: c++

    #include <torch/torch.h>
    ...
    torch::Tensor a = torch::zeros({32, 32, 3}, torch::kFloat);

Tensor Factory Functions
^^^^^^^^^^^^^^^^^^^^^^^^

The tensor factory functions provide different means for creating new tensors.

They each take in a ``size`` argument that specifies the size of each dimension of the tensor created (with the exception of ``eye``, which takes in two int64's and creates a strictly 2-dimensional identity matrix.)

``c10::TensorOptions`` allows the specification of optional properties for the tensor being created. Currently, only the ``dtype`` property has an effect on tensor construction, and it must be specified. Other properties, such as ``layout`` may be supported in the future.
The example above shows a common way to use factory functions.

The following dtypes are supported:

* torch::kFloat
* torch::kBFloat16
* torch::kHalf
* torch::kInt
* torch::kChar
* torch::kShort
* torch::kByte

.. cpp:function:: torch::Tensor empty(torch::IntArrayRef size, c10::TensorOptions options)

    Creates a tensor filled with uninitialized data, with the specified size and options. Slightly faster than other factory functions since it skips writing data to the tensor.

.. cpp:function:: torch::Tensor full(torch::IntArrayRef size, const Scalar & fill_value, c10::TensorOptions options)

    Creates a tensor filled with the specified ``fill_value``, with the specified size and options.

.. cpp:function:: torch::Tensor zeros(torch::IntArrayRef size, c10::TensorOptions options)

    Creates a tensor filled with zeros, with the specified size and options.

.. cpp:function:: torch::Tensor ones(torch::IntArrayRef size, c10::TensorOptions options)

    Creates a tensor filled with ones, with the specified size and options.

.. cpp:function:: torch::Tensor eye(int64_t n, int64_t m, c10::TensorOptions options)

    Creates a 2-D tensor with ones on the diagonal and zeros elsewhere.

Tensor Operation Functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^

The tensor library provides commonly used operations defined below. The tensor operation functions do not support broadcasting; the shape of the operands must match if applicable. 

The library provides two styles of functions for each tensor operation. For functions ending with ``_out``, a tensor with the proper size must be provided to which the output is written. This is illustrated in the example below.

.. code-block:: c++

    torch::exp_out(t_out, t_in);

Alternatively, for functions that do not end in ``_out``, a new tensor that contains the results of the operation is allocated and returned as seen in the example below.

.. code-block:: c++

    torch::Tensor t_out = torch::exp(t_in);

.. warning:: 
    Only operations that are documented below are supported.

.. cpp:function:: torch::Tensor& abs_out(torch::Tensor &result, torch::Tensor &self)
.. cpp:function:: torch::Tensor abs(torch::Tensor& self)

    Computes the absolute value of each element in ``self``.

.. cpp:function:: torch::Tensor& ceil_out(torch::Tensor &result, torch::Tensor &self)
.. cpp:function:: torch::Tensor ceil(torch::Tensor &self)

    Computes the ceiling of the elements of ``self``, the smallest integer greater than or equal to each element.

.. cpp:function:: torch::Tensor& floor_out(torch::Tensor& result, torch::Tensor &self)
.. cpp:function:: torch::Tensor floor(torch::Tensor &self)

    Computes the floor of the elements of ``self``, the largest integer less than or equal to each element.

.. cpp:function:: torch::Tensor& sin_out(torch::Tensor& result, torch::Tensor& self)
.. cpp:function:: torch::Tensor sin(torch::Tensor& self)

    Computes the sine value of the elements of ``self``.

.. cpp:function:: torch::Tensor& cos_out(torch::Tensor& result, torch::Tensor& self)
.. cpp:function:: torch::Tensor cos(torch::Tensor& self)

    Computes the cosine value of the elements of ``self``.

.. cpp:function:: torch::Tensor& tan_out(torch::Tensor& result, torch::Tensor& self)
.. cpp:function:: torch::Tensor tan(torch::Tensor& self)

    Computes the tangent value of the elements of ``self``.

.. cpp:function:: torch::Tensor& log_out(torch::Tensor& result, torch::Tensor& self)
.. cpp:function:: torch::Tensor log(torch::Tensor& self)

    Computes the natural logarithm of the elements of ``self``.

.. cpp:function:: torch::Tensor& log2_out(torch::Tensor& result, torch::Tensor& self)
.. cpp:function:: torch::Tensor log2(torch::Tensor& self)

    Computes the base-2 logarithm of the elements of ``self``.

.. cpp:function:: torch::Tensor& log10_out(torch::Tensor& result, torch::Tensor& self)
.. cpp:function:: torch::Tensor log10(torch::Tensor& self)

    Computes the base-10 logarithm of the elements of ``self``.

.. cpp:function:: torch::Tensor& exp_out(torch::Tensor& result, torch::Tensor& self)
.. cpp:function:: torch::Tensor exp(torch::Tensor& self)

    Computes the exponential of the elements of ``self``.

.. cpp:function:: torch::Tensor& pow_out(torch::Tensor& result, const torch::Tensor& self, const torch::Scalar & exponent)
.. cpp:function:: torch::Tensor& pow_out(torch::Tensor& result, const torch::Scalar& self, const torch::Tensor & exponent)
.. cpp:function:: torch::Tensor& pow_out(torch::Tensor& result, const torch::Tensor& self, const torch::Tensor & exponent)
.. cpp:function:: torch::Tensor pow(const torch::Tensor& self, const torch::Scalar & exponent)
.. cpp:function:: torch::Tensor pow(const torch::Scalar& self, const torch::Tensor & exponent)
.. cpp:function:: torch::Tensor pow(const torch::Tensor& self, const torch::Tensor & exponent)

    Takes the power of each element in ``self`` with ``exponent``. 

.. cpp:function:: torch::Tensor& clamp_out(torch::Tensor& result, const torch::Tensor& self, const torch::Scalar& minval, const torch::Scalar& maxval)
.. cpp:function:: torch::Tensor clamp(const torch::Tensor& self, const torch::Scalar& minval, const torch::Scalar& maxval)

    Clamps all elements in ``self`` into the range ``[minval, maxval]``.

.. cpp:function:: torch::Tensor& add_out(torch::Tensor& result, const torch::Tensor& self, const torch::Scalar &other, const torch::Scalar& alpha=1)
.. cpp:function:: torch::Tensor& add_out(torch::Tensor& result, const torch::Tensor& self, const torch::Tensor& other, const torch::Scalar& alpha=1)
.. cpp:function:: torch::Tensor add(const torch::Tensor& self, const torch::Scalar &other, const torch::Scalar& alpha=1)
.. cpp:function:: torch::Tensor add(const torch::Tensor& self, const torch::Tensor &other, const torch::Scalar& alpha=1)

    Adds ``other``, scaled by ``alpha``, to ``input``,
.. math:: 
    out = self + alpha \times other.

.. cpp:function:: torch::Tensor& sub_out(torch::Tensor& result, const torch::Tensor& self, const torch::Scalar &other, const torch::Scalar& alpha=1)
.. cpp:function:: torch::Tensor& sub_out(torch::Tensor& result, const torch::Tensor& self, const torch::Tensor& other, const torch::Scalar& alpha=1)
.. cpp:function:: torch::Tensor sub(const torch::Tensor& self, const torch::Tensor &other, const torch::Scalar& alpha=1)
.. cpp:function:: torch::Tensor sub(const torch::Tensor& self, const torch::Scalar& other, const torch::Scalar& alpha=1)

    Subtracts ``other``, scaled by ``alpha``, to ``input``,
.. math:: 
    out = self - alpha \times other.

.. cpp:function:: torch::Tensor& mul_out(torch::Tensor& result, const torch::Tensor& self, const torch::Scalar &other)
.. cpp:function:: torch::Tensor& mul_out(torch::Tensor& result, const torch::Tensor& self, const torch::Tensor& other)
.. cpp:function:: torch::Tensor mul(const torch::Tensor& self, const torch::Scalar &other)
.. cpp:function:: torch::Tensor mul(const torch::Tensor& self, const torch::Tensor &other)

    Multiplies ``self`` by ``other``.

.. cpp:function:: torch::Tensor& div_out(torch::Tensor& result, const torch::Tensor& self, const torch::Scalar &other)
.. cpp:function:: torch::Tensor& div_out(torch::Tensor& result, const torch::Tensor& self, const torch::Tensor& other)
.. cpp:function:: torch::Tensor div(const torch::Tensor& self, const torch::Scalar &other)
.. cpp:function:: torch::Tensor div(const torch::Tensor& self, const torch::Tensor &other)

    Divides ``self`` by ``other``.

.. note:: 
   For tensor-tensor bitwise operations, all the bitwise operations are elementwise between two tensors. For scalar-tensor bitwise operations, the scalar is casted to the datatype of the tensor before computing the bitwise operation.

.. cpp:function:: torch::Tensor& bitwise_and_out(torch::Tensor& result, const torch::Tensor& self, const torch::Tensor& other)
.. cpp:function:: torch::Tensor& bitwise_and_out(torch::Tensor& result, const torch::Tensor& self, const torch::Scalar& other)
.. cpp:function:: torch::Tensor& bitwise_and_out(torch::Tensor& result, const torch::Scalar& self, const torch::Tensor& other)
.. cpp:function:: torch::Tensor bitwise_and(const torch::Tensor& self, const torch::Tensor& other)
.. cpp:function:: torch::Tensor bitwise_and(const torch::Tensor& self, const torch::Scalar& other)
.. cpp:function:: torch::Tensor bitwise_and(const torch::Scalar& self, const torch::Tensor& other)

    Computes the bitwise AND of ``self`` and ``other``. The input tensors must be of integral types.

.. cpp:function:: torch::Tensor& bitwise_or_out(torch::Tensor& result, const torch::Tensor& self, const torch::Tensor& other)
.. cpp:function:: torch::Tensor& bitwise_or_out(torch::Tensor& result, const torch::Tensor& self, const torch::Scalar& other)
.. cpp:function:: torch::Tensor& bitwise_or_out(torch::Tensor& result, const torch::Scalar& self, const torch::Tensor& other)
.. cpp:function:: torch::Tensor bitwise_or(const torch::Tensor& self, const torch::Tensor& other)
.. cpp:function:: torch::Tensor bitwise_or(const torch::Tensor& self, const torch::Scalar& other)
.. cpp:function:: torch::Tensor bitwise_or(const torch::Scalar& self, const torch::Tensor& other)

    Computes the bitwise OR of ``self`` and ``other``. The input tensors must be of integral types.

.. cpp:function:: torch::Tensor& bitwise_not_out(torch::Tensor& result, const torch::Tensor& self)
.. cpp:function:: torch::Tensor bitwise_not(torch::Tensor& result, const torch::Tensor& self)  

    Computes the bitwise NOT of ``self``. The input tensor must be of integral types. 

Class torch::Tensor
^^^^^^^^^^^^^^^^^^^

Constructors
""""""""""""

Users should not call the Tensor constructor directly but instead use one of the Tensor factory functions.

Member Functions
""""""""""""""""

.. cpp:function:: template<typename T, size_t N> TensorAccessor<T,N,true> accessor() const&

    Return a ``TensorAccessor`` for element-wise random access of a Tensor's elements. Scalar type and dimension template parameters must be specified. This const-qualified overload returns a read-only ``TensorAccessor``, preventing the user from writing to Tensor elements. See the Tensor Accessors section below for more details.

.. cpp:function::  template<typename T, size_t N> TensorAccessor<T,N,false> accessor() &

    Return a ``TensorAccessor`` for element-wise random access of a Tensor's elements. Scalar type and dimension template parameters must be specified. This non-const-qualified overload returns a ``TensorAccessor`` that can be used to both read and write to Tensor elements. See the Tensor Accessors section below for more details.

.. cpp:function:: template<typename T> TensorReadStreamAccessor<T> read_stream_accessor() const&

    Opens a streaming accessor for read on a tensor. Template parameter ``T`` is the scalar type of the tensor data. See Streaming Accessors section below for more details.

.. cpp:function:: template<typename T> TensorWriteStreamAccessor<T> write_stream_accessor() &

    Opens a streaming accessor for write on a tensor. Template parameter ``T`` is the scalar type of the tensor data. See Streaming Accessors section below for more details.

.. cpp:function:: CoherencyEnforcer::Policy get_accessor_coherence_policy() const

    Get the Tensor accessor coherence policy. See Coherence section below for more details.

.. cpp:function:: void set_accessor_coherence_policy(CoherencyEnforcer::Policy policy) const

    Set the Tensor accessor coherence policy. See Coherence section below for more details.

.. cpp:function:: TensorTcmAccessor<true> tcm_accessor() const&

    Opens a TCM accessor on a tensor. This const-qualified overload returns a read-only ``TensorTcmAccessor``, preventing the user from writing to Tensor elements. See TCM Accessor section below for more details.

.. cpp:function:: TensorTcmAccessor<false> tcm_accessor() &

    Opens a TCM accessor on a tensor. This non-const-qualified overload returns a ``TensorTcmAccessor`` that can be used to both read and write to Tensor elements. See TCM Accessor section below for more details.

.. cpp:function:: torch::Tensor& fill_(const torch::Scalar & value) const
    
    Fill a tensor with the specified value.

Tensor Operators
""""""""""""""""

.. cpp:function:: Tensor& operator=(const Tensor &x) &
.. cpp:function:: Tensor& operator=(Tensor &&x) &

    Assignment operators

Tensor Accessors
----------------

The standard tensor accessor provides element-wise random access to ``Tensor`` elements. They can be created by calling ``Tensor::accessor()``. It can be used similarly to the Pytorch ATen version (see https://pytorch.org/cppdocs/notes/tensor_basics.html#cpu-accessors). However, it is not as fast as other methods of accessing a ``Tensor``, such as the streaming accessor or TCM accessor.

.. warning::
    The standard tensor accessors can only be used in single core mode. Using standard tensor accessors in multicore mode is undefined behaviour and is going to cause race condition, yielding incorrect result.

Example Usage
^^^^^^^^^^^^^

Element-wise add of two 1D tensors using ``TensorAccessor``.

.. code-block:: c++

    torch::Tensor tensor_add_compute(const torch::Tensor& t1, const torch::Tensor& t2) {
        size_t num_elem = t1.numel();
        assert(t1.sizes() == t2.sizes());
        torch::Tensor t_out = torch::empty({num_elem}, torch::kFloat);

        auto t1_acc = t1.accessor<float, 1>();
        auto t2_acc = t2.accessor<float, 1>();
        auto t_out_acc = t_out.accessor<float, 1>();
        for (size_t i = 0; i < num_elem; i++) {
            t_out_acc[i] = t1_acc[i] + t2_acc[i];
        }
        return t_out;
    }

.. _custom-ops-ref-guide-mem-arch:

Memory Architecture
^^^^^^^^^^^^^^^^^^^

Tensor data is stored in HBM. The various types of accessors enable users to access tensor data from their custom C++ operator code running on the GPSIMD engine.

.. image:: /neuron-customops/images/ncorev2_gpsimd_memory.png
    :width: 600

Streaming Accessors
-------------------

Streaming accessors provide the user the ability to access ``Tensor`` elements in sequential order, faster than the standard tensor accessor. There are two stream accessor classes, one for reading and one for writing. Users should not construct stream accessors directly, but should get them from a ``Tensor`` using ``Tensor::read_stream_accessor`` and ``Tensor::write_stream_accessor()``.

An active stream accessor is defined as a stream accessor that has been instantiated and not yet closed (via the ``close()`` method or by going out-of-scope).

The user is responsible for managing stream accessors concurrently accessing the same ``Tensor``. For safest usage, no stream accessor should be active while there is an active ``TensorWriteStreamAccessor`` on the same ``Tensor``. The user may either have multiple ``TensorReadStreamAccessors`` active on the same ``Tensor``, or only have a single ``TensorWriteStreamAccessor`` active on that ``Tensor``. Stream accessors should not be used concurrently with standard tensor accessors on the same ``Tensor``.

An unlimited number of active stream accessors (in total, across all ``Tensors``) are functionally supported, but only up to 4 active stream accessors will be performant. Additional stream accessors beyond the 4th will have performance similar to that of a standard tensor accessor.

.. warning::
    Streaming Accessors can only be used in single core mode. Using streaming accessors in multicore mode is undefined behaviour and is going to cause race condition, yielding incorrect result.

Example Usage
^^^^^^^^^^^^^

Element-wise add of two tensors using ``TensorWriteStreamAccessor`` and ``TensorWriteStreamAccessor``.

.. code-block:: c++

    torch::Tensor tensor_add_compute(const torch::Tensor& t1, const torch::Tensor& t2) {
        assert(t1.sizes() == t2.sizes());
        torch::Tensor t_out = torch::empty(t1.sizes(), torch::kFloat);

        auto t1_rd_stm_acc = t1.read_stream_accessor<float>();
        auto t2_rd_stm_acc = t2.read_stream_accessor<float>();
        auto t_out_wr_stm_acc = t_out.write_stream_accessor<float>();
        for (int i = 0; i < t1.numel(); i++) {
            auto sum = t1_rd_stm_acc.read() + t2_rd_stm_acc.read();
            t_out_wr_stm_acc.write(sum);
        }
        return t_out;
    }

Class torch::TensorWriteStreamAccessor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. cpp:class:: template<typename T> class TensorReadStreamAccessor

    The class template parameter ``T`` is the scalar type of the tensor data.

Member Functions
""""""""""""""""

.. cpp:function:: T read()

    Reads from next element in the stream. User is responsible for knowing when to stop reading from ``TensorReadStreamAccessor``. Reading past the end of the stream or on a closed stream results in undefined behaviour.

.. cpp:function:: int close()

    Closes stream. Do not read from the stream after calling ``close()``.

Class torch::TensorWriteStreamAccessor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. cpp:class:: template<typename T> class torch::TensorWriteStreamAccessor

    The class template parameter ``T`` is the scalar type of the tensor data.

Member Functions
""""""""""""""""

.. cpp:function:: void write(T value)

    Writes to next element in the stream. Written value is not guaranteed to be written back to the Tensor's memory until the ``TensorWriteStreamAccessor`` goes out of scope, or the user explicitly calls ``close()``. User is responsible for knowing when to stop writing to a stream accessor. Writing past the end of the stream or on a closed stream results in undefined behaviour.

.. cpp:function:: int close()

    Closes stream. Flushes write data to the ``Tensor``'s memory. Do not write to the stream after calling ``close()``.

Coherence
^^^^^^^^^

Stream accessors cache ``Tensor`` data in GPSIMD tightly-coupled memory (TCM), but do not ensure their caches remain coherent. When exactly they read from or write back to HBM is opaque to the user (except for ``close()`` which forces a write back).

The safest way to use them is to ensure that no stream accessor is active (instantiated and not yet closed) while there is an active write stream accessor on the same ``Tensor``. The user should either have multiple read stream accessors active on the same ``Tensor``, or only have a single write stream accessor active on that ``Tensor``.

The standard tensor accessors read/write HBM directly. Therefore, tensor accessors can safely concurrently access the same ``Tensor``, but it is safest not to use them concurrently with stream accessors since HBM isn't guaranteed to be coherent with the stream accessor caches.

These coarse-grained guidelines are best practices, but it is possible to ignore them with careful usage of the accessors (making sure elements are read before they are written to, elements written to are written back before being read again, etc).

The coherence policy of a ``Tensor`` determines what to do when there is potentially incoherent access by an accessor of that ``Tensor``. It can either cause an error, or allow it but print a warning, or do nothing. In the case of the latter two options, it is the user's responsibility to ensure they carefully use accessors coherently. Coherence policy for ``Tensors`` is ``torch::CoherencyEnforcer::Policy::COHERENT`` by default, but can be changed using ``Tensor::set_accessor_coherence_policy()``.

.. code-block:: c++

    // class torch::CoherencyEnforcer
    enum Policy {
        // Enforce a resource is acquired in a way that guarantees coherence
        // Causes an error if it encounters potentially incoherent access
        COHERENT,

        // Allows potentially incoherent access, but will print a warning
        INCOHERENT_VERBOSE,

        // Allows potentially incoherent access, no error or warnings
        INCOHERENT_QUIET
    };

TCM Accessor
------------

TCM accessors provide the fastest read and write performance. TCM accessors allow the user to manually manage copying data between larger, but slower-access HBM to faster GPSIMD tightly-coupled memory (TCM). It may be beneficial to see the diagram under :ref:`custom-ops-ref-guide-mem-arch`. Create a ``TensorTcmAccessor`` from a ``Tensor`` by calling ``Tensor::tcm_accessor()``. Users can allocate and free TCM memory using ``tcm_malloc()`` and ``tcm_free()``. Users have access to a 16KB pool of TCM memory. Note the streaming accessors also allocate from this pool (4KB each). TCM accessors do not do any coherence checks.

.. note:: 
    See :ref:`neuronx-customop-mlp-perf` for a tutorial on how to use TCM accessors. 

Example Usage
^^^^^^^^^^^^^

Element-wise negate of a tensor using ``TensorTcmAccessor``.

.. code-block:: c++

    torch::Tensor tensor_negate_compute(const torch::Tensor& t_in) {
        size_t num_elem = t_in.numel();
        torch::Tensor t_out = torch::empty(t_in.sizes(), torch::kFloat);

        static constexpr size_t buffer_size = 1024;
        float *tcm_buffer = (float *)torch::neuron::tcm_malloc(sizeof(float) * buffer_size);

        if (tcm_buffer != nullptr) {
            // tcm_malloc allocated successfully, use TensorTcmAccessor
            auto t_in_tcm_acc = t_in.tcm_accessor();
            auto t_out_tcm_acc = t_out.tcm_accessor();
            for (size_t i = 0; i < num_elem; i += buffer_size) {
                size_t remaining_elem = num_elem - i;
                size_t copy_size = (remaining_elem > buffer_size) ? buffer_size : remaining_elem;

                t_in_tcm_acc.tensor_to_tcm<float>(tcm_buffer, i, copy_size);
                for (size_t j = 0; j < copy_size; j++) {
                    tcm_buffer[j] *= -1;
                }
                t_out_tcm_acc.tcm_to_tensor<float>(tcm_buffer, i, copy_size);
            }

            torch::neuron::tcm_free(tcm_buffer);
        } else {
            // Handle not enough memory...
        }

        return t_out;
    }

TCM Management Functions
^^^^^^^^^^^^^^^^^^^^^^^^

.. cpp:function:: void * torch::neuron::tcm_malloc(size_t nbytes)

    Allocate ``nbytes`` bytes of memory from TCM and return pointer to this memory. Upon failure, returns null.

.. cpp:function:: void torch::neuron::tcm_free(void * ptr)

    Free memory that was allocated by ``tcm_malloc()``. Undefined behaviour if ``ptr`` was not returned from a previous call to ``tcm_malloc()``.

Class torch::TensorTcmAccessor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. cpp:class:: template<bool read_only> class torch::TensorTcmAccessor

    The ``read_only`` template parameter controls whether or not you can write to the accessor's ``Tensor``. A ``const Tensor`` will return a read-only ``TensorTcmAccessor`` from ``Tensor::tcm_accessor()``.

Member Functions
""""""""""""""""

.. cpp:function:: template<typename T> void tensor_to_tcm(T * tcm_ptr, size_t tensor_offset, size_t num_elem)

    Copy ``num_elem`` elements from the accessor's ``Tensor`` starting at the index ``tensor_offset`` to a TCM buffer starting at ``tcm_ptr``. Tensor indexing is performed as if the tensor was flattened. Template parameter ``T`` is the scalar type of the tensor data. The TCM buffer's size should be at least ``sizeof(T) * num_elem`` bytes.

.. cpp:function:: template<typename T> void tcm_to_tensor(T * tcm_ptr, size_t tensor_offset, size_t num_elem)

    Copy ``num_elem`` elements from a TCM buffer starting at ``tcm_ptr`` to the accessor's ``Tensor`` starting at the index ``tensor_offset``. Tensor indexing is performed as if the tensor was flattened. The TCM buffer's size should be at least ``sizeof(T) * num_elem`` bytes.


Writing Directly to Output Tensor
---------------------------------

.. cpp:function:: torch::Tensor get_dst_tensor()

    Returns a reference to the Custom C++ operator output tensor (return value). If this method is called, it is assumed that data will be written to this output tensor, and the tensor returned from the C++ operator will be ignored. Using this method will improve performance by avoiding additional copying of the return value. See example below for function usage.

    .. code-block:: c++
        :emphasize-lines: 4, 12
        
        // Example of write to get_dst_tensor()
        torch::Tensor example_kernel(const torch::Tensor& t_in) {
            size_t num_elem = t_in.numel();
            torch::Tensor t_out = get_dst_tensor();
            auto t_out_tcm_acc = t_out.tcm_accessor();

            float *tcm_buffer = (float *)torch::neuron::tcm_malloc(sizeof(float) * buffer_size);
            
            // Populate tcm_buffer with results
            ...
            // Write to t_out throught tcm_accessor
            t_out_acc.tcm_to_tensor<float>(tcm_buffer, offset, copy_size);
            
            ...
        }

Using multiple GPSIMD cores
---------------------------

.. note:: 
    See :ref:`neuronx-customop-mlp-perf` for a tutorial on how to use multiple GPSIMD cores to execute the Custom C++ Operator

By default, Custom C++ operators target a single core of the GPSIMD-Engine. Performance of Custom C++ operators can be improved by targeting multiple cores. To enable usage of multiple GPSIMD cores, ``multicore=True`` should be passed to ``custom_op.load()``.

.. code-block:: python
    :emphasize-lines: 6

    custom_op.load(
        name=name,
        compute_srcs=compute_srcs,
        shape_srcs=shape_srcs,
        build_directory=os.getcwd(),
        multicore=True
    )

Each GPSIMD core executes the same kernel function. The user can control the execution on each core by conditioning the Custom C++ operator logic on the core id (obtained via ``get_cpu_id()`` API). This is illustrated in the example below.

.. warning::
    In multicore mode, tensors can only be accessed through TCM accessors. Using regular tensor accessors and streaming accessors are going to yield incorrect result.

The following functions are defined in ``neuron/neuron-utils.hpp``

.. cpp:function:: uint32_t get_cpu_id()

    Return the id of the core that the Custom C++ operator is executing on, id is in range ``[0, get_cpu_count())``

.. cpp:function:: uint32_t get_cpu_count()

    Return the total number of available GPSIMD cores.

.. code-block:: c++
    :emphasize-lines: 5, 6, 15

    torch::Tensor example_kernel(const torch::Tensor& t_in) {
        size_t num_elem = t_in.numel();
        torch::Tensor t_out = get_dst_tensor();

        uint32_t cpu_id = get_cpu_id();
        uint32_t cpu_count = get_cpu_count();

        uint32_t partition = num_elem / cpu_count;

        float *tcm_buffer = (float *)torch::neuron::tcm_malloc(sizeof(float) * buffer_size);
        // Populate tcm_buffer with desired results
        ...

        // Write to t_out with a offset computed from cpu_id and cpu_count
        t_out_tcm_acc.tcm_to_tensor<float>(tcm_buffer, partition*cpu_id, copy_size);

        ...
    }

Return Value Handling
^^^^^^^^^^^^^^^^^^^^^

When using multiple GPSIMD cores, the ``get_dst_tensor()`` API must be used to write the return value of the Custom C++ operators. Data not written to the tensor reference returned by ``get_dst_tensor()``, or not invoking ``get_dst_tensor()`` will result in undefined behavior. The user is responsible for writing the appropriate portion of the output reference tensor from a given GPSIMD core. Since there is no synchronization between GPSIMD cores, it is advised that each GPSIMD core writes to a mutually exclusive partition of the output reference tensor.

printf()
--------------

Custom C++ operators support the use of C++'s ``printf()`` to send information to the host's terminal. Using ``printf()`` is the recommended approach to functional debug. With it, the programmer can check the value of inputs, outputs, intermediate values, and control flow within their operator.

Usage
^^^^^

To use ``printf()`` within a Custom C++ operator, the programmer must set the following environment variables before running their model in order to receive the messages printed by their operator:

.. list-table:: Environment Variables
   :widths: 50 200 20 200 200
   :header-rows: 1


   * - Name
     - Description
     - Type
     - Value to Enable printf
     - Default Value
   * - ``NEURON_RT_LOG_LEVEL``
     - Runtime log verbose level
     - String
     - At least ``INFO``
     - See (:ref:`nrt-configuration`) for more options.
   * - ``NEURON_RT_GPSIMD_STDOUT_QUEUE_SIZE_BYTES``
     - Size of the printf output buffer, in bytes
     - Integer
     - Any power of two that is equal to or less than ``131072`` (128KB)
     - Recommend setting a value of ``131072`` to maximize the size of printf's buffer. Setting a value of 0 disables printf.

Within a Custom C++ operator, ``printf()`` can be used as normal from within a C++ program. For more information, consult a reference such as (https://cplusplus.com/reference/cstdio/printf/)

Example
^^^^^^^

.. code-block:: c++

    #include <torch/torch.h>
    #include <stdio.h> // Contains printf()

    torch::Tensor tensor_negate_compute(const torch::Tensor& t_in) {
        size_t num_elem = t_in.numel();
        torch::Tensor t_out = torch::zeros({num_elem}, torch::kFloat);

        auto t_in_acc = t_in.accessor<float, 1>();
        auto t_out_acc = t_out.accessor<float, 1>();
        for (size_t i = 0; i < num_elem; i++) {
            float tmp = -1 * t_in_acc[i];
            printf("Assigning element %d to a value of %f\n", i, tmp);
            t_out_acc[i] = tmp;
        }
        return t_out;
    }

Print statements then appear on the host's terminal with a header message prepended:

::

    2023-Jan-26 00:25:02.0183  4057:4131   INFO  TDRV:pool_stdio_queue_consume_all_entries    Printing stdout from GPSIMD:
    Assigning element 0 to a value of -1.000000
    Assigning element 1 to a value of -2.000000
    Assigning element 2 to a value of -3.000000
    Assigning element 3 to a value of -4.000000
    Assigning element 4 to a value of -5.000000
    Assigning element 5 to a value of -6.000000
    Assigning element 6 to a value of -7.000000
    Assigning element 7 to a value of -8.000000


Limitations
^^^^^^^^^^^

* Performance: using ``printf()`` significantly degrades the operator's performance.

  * The programmer can disable it by unsetting ``NEURON_RT_GPSIMD_STDOUT_QUEUE_SIZE_BYTES`` or setting it to 0.

    * We recommend that you disable ``printf()`` if you are running the model in a performance-sensitive context.

  * To maximize performance, remove calls to ``printf()`` from within the operator.

    * Even if ``printf()`` is disabled, calling the function incurs overhead.
* Buffer size: output from ``printf()`` is buffered during model execution and read by the Neuron runtime after execution.

  * The model can still execute successfully if you overflow the buffer.
  * Overflowing the buffer causes the oldest data in the buffer to be overwritten.
* Print statements are processed and printed to the host's terminal at the end of model execution, not in real time.
* ``printf()`` is only supported in single core mode, or on GPSIMD core 0 only when using multiple GPSIMD cores.

Library Limitations
-------------------

* Tensors passed into and returned from CustomOp functions can either have up to 8 dimensions where the maximum size of each dimension is 65535, or up to 4 dimensions where the maximum size of each dimension is 4294967295.
* When using multiple GPSIMD cores, only ``TensorTcmAccessor`` is supported. Usage of other accessors results in undefined behaviour.
* Each model can only have one CustomOp library, and the library can have 10 functions registered. For more information on function registration in PyTorch, see `Implementing an operator in C++` in the :ref:`feature-custom-operators-devguide`.

  * However, models using ``torch.sort`` cannot have any CustomOps.


================================================
FILE: neuron-customops/customops-intro.txt
================================================
Neuron Custom C++ Operators enable developers to write C++ Custom Operators (“CustomOps”) that run on NeuronCores. This enables developers to extend operator support beyond what is officially supported by Neuron.

Developers can use standard PyTorch custom operators programming interfaces to leverage Neuron Custom C++ Operators feature. This makes it easy to migrate CPU Custom Operators to Neuron, and implement new beta operators, all without any intimate knowledge of the NeuronCore hardware.

================================================
FILE: neuron-customops/index.rst
================================================
.. _neuron_c++customops:

Neuron Custom C++ Operators [Beta]
==================================


.. include:: /neuron-customops/customops-intro.txt


.. note:: 

        Neuron Custom C++ Operators feature is currently supported on NeuronCore-v2 architecture only, which is found in Trainium (Trn1) and second-generation Inferentia (Inf2) chips.


.. toctree::
    :maxdepth: 1
    :hidden:

    /neuron-customops/api-reference-guide/api-reference-guide


.. toctree::
    :maxdepth: 1
    :hidden:
      
    /neuron-customops/programming-guide/programming-guide


.. toctree::
    :maxdepth: 1
    :hidden:

    /neuron-customops/tutorials/tutorials


.. toctree::
    :maxdepth: 1
    :hidden:

    /neuron-customops/misc-customops


.. dropdown::  API Reference Guide
      :class-title: sphinx-design-class-title-med
      :class-body: sphinx-design-class-body-small
      :animate: fade-in
      :open:

      * :ref:`custom-ops-api-ref-guide`       


.. dropdown::  Developer Guide
      :class-title: sphinx-design-class-title-med
      :class-body: sphinx-design-class-body-small
      :animate: fade-in
      :open:

      * :ref:`feature-custom-operators-devguide`


.. dropdown::  Tutorials
      :class-title: sphinx-design-class-title-med
      :class-body: sphinx-design-class-body-small
      :animate: fade-in
      :open:

      * :ref:`neuronx-customop-mlp-tutorial`
      * :ref:`neuronx-customop-mlp-perf`


.. dropdown::  Misc
      :class-title: sphinx-design-class-title-med
      :class-body: sphinx-design-class-body-small
      :animate: fade-in
      :open:

  
      * :ref:`gpsimd-customop-tools-rn`
      * :ref:`gpsimd-customop-lib-rn`


================================================
FILE: neuron-customops/misc-customops.rst
================================================
Misc (Neuron Custom C++ Operators)
==================================

.. toctree::
    :maxdepth: 1

    /release-notes/archive/customcxxps/gpsimd-tools
    /release-notes/archive/customcxxps/gpsimd-customop-lib

================================================
FILE: neuron-customops/programming-guide/custom-c++-operators-devguide.rst
================================================
.. _feature-custom-operators-devguide:

Neuron Custom C++ Operators Developer Guide [Beta]
==================================================

This document gives an overview of the Neuron Custom C++ Operator feature and APIs . Currently, CustomOp support is limited to the PyTorch framework.  

Please refer to the following documents for further information regarding Neuron Custom C++ Operators:

* :ref:`neuronx-customop-mlp-tutorial`
* :ref:`neuronx-customop-mlp-perf`
* :ref:`custom-ops-api-ref-guide`

.. contents:: Table of contents
   :local:
   :depth: 1

Setup & Installation
--------------------

.. note::
   The name of ``aws-neuronx-gpsimd-customop`` has been changed to ``aws-neuronx-gpsimd-customop-lib`` as of the neuron 2.10 release.

We provide tooling and library packages (RPM and DEB) that can be installed on TRN1 and INF2 instances:
::

   aws-neuronx-gpsimd-tools-0.3
   aws-neuronx-gpsimd-customop-lib-0.3

For AL2023 only, the following packages need be installed as dependencies:

::
   sudo dnf install libnsl
   sudo dnf install libxcrypt-compat

On AL2023, they can be installed with the following commands:

::
   sudo dnf remove python3-devel -y
   sudo dnf remove aws-neuronx-gpsimd-tools-0.* -y
   sudo dnf remove aws-neuronx-gpsimd-customop-lib-0.* -y
   
   sudo dnf install python3-devel -y
   sudo dnf install aws-neuronx-gpsimd-tools-0.* -y 
   sudo dnf install aws-neuronx-gpsimd-customop-lib-0.* -y

On Ubuntu, they can be installed with the following commands:

::
   sudo apt-get remove python3-dev -y
   sudo apt-get remove aws-neuronx-gpsimd-tools=0.* -y
   sudo apt-get remove aws-neuronx-gpsimd-customop-lib=0.* -y  
   
   sudo apt-get install python3-dev -y
   sudo apt-get install aws-neuronx-gpsimd-tools=0.* -y
   sudo apt-get install aws-neuronx-gpsimd-customop-lib=0.* -y 


Implementing an operator in C++
-------------------------------

Custom operators require a function that defines the custom computation. We define this as the **kernel function**. Neuron Custom C++ Operators also contain a **shape function** separate from the normal compute code. This *shape function* defines the shapes of output tensors for a given set of inputs to the operator. This is needed because PyTorch Neuron (torch-neuronx) is based on the PyTorch/XLA software package and uses a Just-In-Time (JIT) compilation strategy. At runtime the operators in the model will be compiled into a binary to be executed on the NeuronCore. During compilation the shapes of the input and output tensors to operators are computed. The **shape function** is executed on the host, whereas the **kernel function** is executed on the NeuronCore. 

Kernel Function
^^^^^^^^^^^^^^^

The kernel function contains the C++ implementation of the CustomOp, as shown in the example below.  By including torch.h in the source, the developer has access to a NeuronCore-ported subset of the torch C++ api  (https://pytorch.org/cppdocs/).  The port contains everything required for CustomOp development and model integration, specifically Tensor and Scalar classes in c10, and a subset of aTen operators.
::

   #include <stdint.h>
   #include <stdlib.h>
   #include <torch/torch.h>

   torch::Tensor tensor_negate_compute(const torch::Tensor& t_in) {
      size_t num_elem = t_in.numel();
      torch::Tensor t_out = torch::zeros({num_elem}, torch::kFloat);

      auto t_in_acc = t_in.accessor<float, 1>();
      auto t_out_acc = t_out.accessor<float, 1>();
      for (size_t i = 0; i < num_elem; i++) {
         t_out_acc[i] = -1 * t_in_acc[i];
      }
      return t_out;
   }

The kernel function is the main computational code for the operator. We support a subset of the input types usable by regular PyTorch Custom Operators: ``torch::Tensor``, ``torch::Scalar``, ``double``, and ``int64_t``. However we do not support ``std::vector`` or ``std::tuple`` of these types at this time. When passing in scalars, ``double`` is the only supported dtype, no other integral types such as ``int``, ``short``, ``int64_t`` or ``long`` are supported. The return value must be a ``torch::Tensor``.

.. warning::
   Tensors passed into and returned from CustomOp functions can either have up to 8 dimensions where the maximum size of each dimension is 65535, or up to 4 dimensions where the maximum size of each dimension is 4294967295.

The body of the kernel function may exercise C/C++ libraries, ``torch::Tensor`` classes, and select aTen operators, as is customary for Torch programming.  For high performance, feature offerings provide faster memory access, via new Tensor Accessor classes and stack management compiler flags. Additionally, higher performance can be obtained by parallelizing execution of the kernel over multiple GPSIMD cores. See the :ref:`custom-ops-api-ref-guide` for more details.

Finally, because the kernel is specially compiled for and run by the NeuronCore target, its tooling, libraries, and environment differ from the host pytorch installation. For example, while the host may run Pytorch 1.13 and a C++17 compatible compiler in a linux environment, the NeuronCore may run a port of Pytorch 1.12 (c10) and LLVM’s libc++ C++14 version 10.0.1 without linux.  Developers must develop for the compiler, torch version, and environment of their targeted NeuronCore.  See the :ref:`custom-ops-api-ref-guide` for more details.


Shape Function
^^^^^^^^^^^^^^

The shape function has the same function signature as the kernel function, but does not perform any computations. Rather, it only defines the shape of the output tensor but not the actual values. 
::

   #include <stdint.h>
   #include <stdlib.h>
   #include <torch/torch.h>

   torch::Tensor tensor_negate_shape(torch::Tensor t1) {
      size_t num_elem = t1.numel();
      torch::Tensor t_out = torch::zeros({num_elem}, torch::kFloat);

      return t_out;
   }

The body of the shape function may exercise C/C++ libraries or ``torch::Tensor`` classes. The body may not access the data of input tensors since these are XLA Tensors and do not have any data storage allocated yet. However, any of the functions that access shape information such as *numel* (to get the number of elements) may be used. 


Building and executing operators
--------------------------------

Once you have the kernel and shape functions for your operators you can build them into a library to use them from PyTorch in your model. Just like regular PyTorch Custom Operators, Neuron Custom C++ Operators use a registration macro to associate the kernel and shape functions with the name of the operator that will be called from Python.

Similar to PyTorch, Neuron Custom C++ Operators are grouped into libraries defined within the ``NEURON_LIBRARY(<lib_name>, m)`` scope, where lib_name is the name of your library of custom operators. Within this scope, calls to ``m.def(<op_name>, <shape_fcn>, <kernel_fcn>)`` define each operator in your library. The ``op_name`` is the name to call the operator with in the model (i.e. ``torch.ops.lib_name.op_name()``). The ``shape_fcn`` is a function pointer to the shape function to call during compilation. Finally the ``kernel_fcn`` is the name of the function to be executed on the NeuronCore at runtime. 
::

   #include <stdint.h>
   #include <stdlib.h>
   #include <torch/torch.h>
   #include "torchneuron/register.h"

   torch::Tensor tensor_negate_shape(torch::Tensor t1) {
      size_t num_elem = t1.numel();
      torch::Tensor t_out = torch::zeros({num_elem}, torch::kFloat);

      return t_out;
   }

   NEURON_LIBRARY(my_ops, m) {
      m.def("tensor_negate", &tensor_negate_shape, "tensor_negate_compute");
   }

Notice that the ``NEURON_LIBRARY`` macro is used in the same C++ file as the shape function. This is because the registration is loaded on the host.

.. warning::
   Each model can only have one CustomOp library, and the library can have 10 functions registered. However, models using ``torch.sort`` cannot have any CustomOps.

The custom op library is built by calling the ``load`` API in Python like:
::

   import torch_neuronx
   from torch_neuronx.xla_impl import custom_op

   custom_op.load(
      name='my_ops',
      compute_srcs=['kernel.cpp'],
      shape_srcs=['shape.cpp'],
      multicore=False
   )

In the example above, name refers to the name of the library file to be created (i.e. ``libmy_ops.so``) and the ``compute_srcs`` and ``shape_srcs`` are lists of files to be compiled. After the ``load`` API completes, the library will have been compiled and loaded into the current PyTorch process. 

.. warning::
   The library file name should not be "builtin" as it is a reserved keyword.

CustomOp also supports multicore execution mode. If you want to the library to run in multicore mode, pass the flag ``multicore=True`` into the ``load`` API. Notice that the execution mode is specified at the library level, so all the functions in the library run in the same mode. For more details of multicore CustomOp, please refer to `Using multiple GPSIMD cores` section in :ref:`custom-ops-api-ref-guide`.

Similar to PyTorch, the Neuron custom op will be available at ``torch.ops.<lib_name>.<op_name>`` where ``lib_name`` is defined in the ``NEURON_LIBRARY`` macro, and ``op_name`` is defined in the call to ``m.def``.
::

   import torch

   out_tensor = torch.ops.my_ops.tensor_negate(in_tensor)


Loading a previously built library
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The library can also be built ahead of time or in a separate process and loaded later. In the ``load`` API, specify the ``build_directory`` argument and the library will be written to that location on disk.
::

   import torch_neuronx
   from torch_neuronx.xla_impl import custom_op

   custom_op.load(
      name='my_ops',
      compute_srcs=['kernel.cpp'],
      shape_srcs=['shape.cpp'],
      build_directory*=*os.getcwd(),
   )

Then, later, this library can be loaded by calling the ``load_library`` API and using the ops in the exact same way.
::

   import torch
   import torch_neuronx
   from torch_neuronx.xla_impl import custom_op

   custom_op.load_library('/home/user/libmy_ops.so')

   out_tensor = torch.ops.my_ops.tensor_negate(in_tensor)

Note: The ``load_library`` API does not need to be called in the same process where the library is built with the load API. Similar to regular PyTorch Custom Operators, Neuron Custom C++ Operators are built and loaded at the same time when the ``load`` API is called.  


Performance Guidance
--------------------

When possible, it is recommended that operators supported by the designated framework with supported compilation onto Neuron devices are used. These operators have been have been highly optimized for the Neuron architecture. However, for other scenarios where Custom C++ operators are the required solution, the following recommendations can be followed to improve performance:

* Use the provided memory management accessors (streaming and tcm accessor). Both of these accessors improve data fetch overhead. See the :ref:`custom-ops-api-ref-guide` for more information.
* You can optionally specify the estimated amount of stack space (in bytes) used in your Custom C++ operator via the ``extra_cflags`` argument in the call to ``custom_op.load()``. For instance, if you anticipate your operator using ~20KB of stack space, include the argument ``extra_cflags=['-DSTACK_SIZE=20000']`` in the call to custom_op.load(). **This is necessary only if you anticipate the stack to grow beyond ~8KB.** This flag is used to decide whether to place the stack in faster local memory, which significantly improves performance, or if we will need to place the stack in larger NeuronCore memory with longer access latency. If you do not specify this flag, or the estimate you provide is small enough (less than ~8KB), the stack will go in local memory. Note, when placed in local memory, the stack space will not be restricted by your estimate, but if your stack grows beyond ~8KB, there's a risk of a stack overflow, and you will be notified with an error message from GPSIMD should such a case occur. If you do specify a stack size, the maximum supported stack size is 400KB.
* Use multiple GPSIMD cores when possible to parallelize (and hence improve performance) of Custom C++ operator, refer to `Using multiple GPSIMD cores`  section in :ref:`custom-ops-api-ref-guide` for more information.

Functional Debug
----------------

Custom C++ operators support the use of the C++ language's ``printf()``. For functional debug, the recommended approach is using ``printf()`` to print input, intermediate, and final values. Consult the :ref:`custom-ops-api-ref-guide` for more information.


================================================
FILE: neuron-customops/programming-guide/programming-guide.rst
================================================
Developer Guide
===============

.. toctree::
    :maxdepth: 1

    /neuron-customops/programming-guide/custom-c++-operators-devguide

================================================
FILE: neuron-customops/tutorials/customop-mlp-perf-opt.rst
================================================
.. _neuronx-customop-mlp-perf:

Neuron Custom C++ Operators Performance Optimization
====================================================

In this tutorial, we will build on the small MLP model shown in :ref:`neuronx-customop-mlp-tutorial` and demonstrate methods to optimize the performance of a custom C++ operator. We will be taking advantage of the TCM accessor as well as the usage of multiple GPSIMD cores to enhance performance.

This tutorial assumes the reader has read and set up an environment described in :ref:`neuronx-customop-mlp-tutorial`.

.. contents:: Table of Contents
    :local:
    :depth: 2

Download Examples
-----------------

To download the source code for this tutorial, do:

.. code:: bash

    git clone https://github.com/aws-neuron/aws-neuron-samples.git
    cd aws-neuron-samples/torch-neuronx/inference/customop_mlp

.. note:: 
    We will be using an inference example in this tutorial in order to adhere to certain Custom C++ operator restrictions when using multiple GPSIMD cores (see :ref:`custom-ops-api-ref-guide`  for details on current restrictions).

.. note::

    Custom C++ Operators are supported as of Neuron SDK Version 2.7 as a beta feature. As such this feature is not installed by default, additional tooling and library packages (RPM and DEB) are required. 

    For AL2023 only, the following packages need be installed as dependencies:
    ::
      sudo dnf install libnsl
      sudo dnf install libxcrypt-compat
    
    On AL2023, they can be installed with the following commands:
    ::
      sudo dnf remove python3-devel -y
      sudo dnf remove aws-neuronx-gpsimd-tools-0.* -y
      sudo dnf remove aws-neuronx-gpsimd-customop-lib-0.* -y
      
      sudo dnf install python3-devel -y
      sudo dnf install aws-neuronx-gpsimd-tools-0.* -y 
      sudo dnf install aws-neuronx-gpsimd-customop-lib-0.* -y

    On Ubuntu, they can be installed with the following commands:
    ::
      sudo apt-get remove python3-dev -y
      sudo apt-get remove aws-neuronx-gpsimd-tools=0.* -y
      sudo apt-get remove aws-neuronx-gpsimd-customop-lib=0.* -y  
      
      sudo apt-get install python3-dev -y
      sudo apt-get install aws-neuronx-gpsimd-tools=0.* -y
      sudo apt-get install aws-neuronx-gpsimd-customop-lib=0.* -y  

Activate the virtual environment created in :ref:`neuronx-customop-mlp-tutorial`,

.. code:: shell

    source ~/aws_neuron_venv_pytorch/bin/activate

As a reminder, ``ninja`` should be already installed in the virtual environment. If not, install it for PyTorch Custom Extensions in your environment by running:

.. literalinclude:: tutorial_source_code/custom_c_perf_optimization/custom_c_perf_optimization_code.sh
   :language: bash
   :lines: 5-6

Model Configuration Adjustment
------------------------------

For this tutorial, we will enlarge the size of the hidden layer from ``[120, 84]`` to ``[4096, 2048]`` in ``model.py``.

.. code-block:: python
    :emphasize-lines: 8

    import torch
    import torch.nn as nn
    from torch.nn import functional as F
    import my_ops

    # Declare 3-layer MLP for MNIST dataset                                                                
    class MLP(nn.Module):
        def __init__(self, input_size = 28 * 28, output_size = 10, layers = [4096, 2048]):
            super(MLP, self).__init__()
            self.fc1 = nn.Linear(input_size, layers[0])
            self.fc2 = nn.Linear(layers[0], layers[1])
            self.fc3 = nn.Linear(layers[1], output_size)

        def forward(self, x):
            f1 = self.fc1(x)
            r1 = my_ops.Relu.apply(f1)
            f2 = self.fc2(r1)
            r2 = my_ops.Relu.apply(f2)
            f3 = self.fc3(r2)
            return torch.log_softmax(f3, dim=1)

Performance with Element-wise Accessor
---------------------------------------

The ``neuron`` directory contains the same code shown in :ref:`neuronx-customop-mlp-tutorial`, where the ``relu_forward`` is implemented with element-wise accessor. Go to ``neuron`` directory, run ``build.py`` then ``inference.py``, the expected output on a trn1 instance is,

.. code-block:: bash

    Inf throughput (iter/sec): 8.098649744235592
    ----------End Inference ---------------

Performance with TCM Accessor
-----------------------------
Now we switch to ``neuron-tcm`` folder. As mentioned in :ref:`custom-ops-api-ref-guide`, TCM accessors provide faster read and write performance. We implement the ``relu_forward`` using TCM accessor in ``relu.cpp``:

.. code-block:: c++

    torch::Tensor relu_forward(const torch::Tensor& t_in) {
        size_t num_elem = t_in.numel();
        torch::Tensor t_out = torch::zeros(t_in.sizes(), torch::kFloat); 

        static constexpr size_t buffer_size = 1024;
        float *tcm_buffer = (float*)torch::neuron::tcm_malloc(sizeof(float) * buffer_size);

        if (tcm_buffer != nullptr) {
            auto t_in_tcm_acc = t_in.tcm_accessor();
            auto t_out_tcm_acc = t_out.tcm_accessor();

            for (size_t i = 0; i < num_elem; i += buffer_size) {
            size_t remaining_elem = num_elem - i;
            size_t copy_size = (remaining_elem > buffer_size) ? buffer_size : remaining_elem;

            t_in_tcm_acc.tensor_to_tcm<float>(tcm_buffer, i, copy_size);
            for (size_t j = 0; j < copy_size; j++) {
                tcm_buffer[j] = tcm_buffer[j] > 0.0 ? tcm_buffer[j] : 0.0;
            }
            t_out_tcm_acc.tcm_to_tensor<float>(tcm_buffer, i, copy_size);
            }
        }
        torch::neuron::tcm_free(tcm_buffer);
        return t_out;
    }

Run ``build.py`` then ``inference.py``, the expected output on a trn1 instance is:

.. code-block:: bash

    Inf throughput (iter/sec): 220.73800131604054
    ----------End Inference ---------------

Extending the example to utilize multiple GPSIMD cores
------------------------------------------------------

Now we switch to the ``neuron-multicore`` folder. We first enable the usage of multiple GPSIMD cores by ``multicore=True`` in the ``build.py``. 

.. code-block:: python

    custom_op.load(
        name='relu',
        compute_srcs=['relu.cpp'],
        shape_srcs=['shape.cpp'],
        build_directory=os.getcwd(),
        multicore=True,
        verbose=True
    )

After passing the flag, the kernel function ``relu_forward`` defined in ``relu.cpp`` will execute on all GPSIMD cores. Thus we need to use ``cpu_id`` to partition the workload among all cores. 

.. code-block:: c++

    torch::Tensor relu_forward(const torch::Tensor& t_in) {
        size_t num_elem = t_in.numel();
        torch::Tensor t_out = get_dst_tensor();

        uint32_t cpu_id = get_cpu_id();
        uint32_t cpu_count = get_cpu_count();
        uint32_t partition = num_elem / cpu_count;
        if (cpu_id == cpu_count - 1) {
            partition = num_elem - partition * (cpu_count - 1);
        }

        static constexpr size_t buffer_size = 1024;
        float *tcm_buffer = (float*)torch::neuron::tcm_malloc(sizeof(float) * buffer_size);

        if (tcm_buffer != nullptr) {
            auto t_in_tcm_acc = t_in.tcm_accessor();
            auto t_out_tcm_acc = t_out.tcm_accessor();

            for (size_t i = 0; i < partition; i += buffer_size) {
            size_t remaining_elem = partition - i;
            size_t copy_size = (remaining_elem > buffer_size) ? buffer_size : remaining_elem;

            t_in_tcm_acc.tensor_to_tcm<float>(tcm_buffer, partition *cpu_id + i, copy_size);
            for (size_t j = 0; j < copy_size; j++) {
                tcm_buffer[j] = tcm_buffer[j] > 0.0 ? tcm_buffer[j] : 0.0;
            }
            t_out_tcm_acc.tcm_to_tensor<float>(tcm_buffer, partition *cpu_id + i, copy_size);
            }
        }
        torch::neuron::tcm_free(tcm_buffer);
        return t_out;
    }

There are two things noteworthy in the code:

1. We use ``cpu_id`` and ``cpu_count`` to distribute the workload among all cores. Particularly, each cores performs ``relu`` on a partition of the tensor, the offset is computed based on ``cpu_id``.
2. The output of the operator is directly written to the tensor from ``get_dst_tensor()``. The ``return t_out;`` statement is ignored during execution.

Run ``build.py`` then ``inference.py``, the expected output on a trn1 instance is:

.. code-block:: bash

    Inf throughput (iter/sec): 269.936119707143
    ----------End Inference ---------------

Details of the API used in the sample here can be found in :ref:`custom-ops-api-ref-guide`. 


================================================
FILE: neuron-customops/tutorials/customop-mlp-training.rst
================================================
.. _neuronx-customop-mlp-tutorial:

Neuron Custom C++ Operators in MLP Training 
===========================================

In this tutorial we’ll demonstrate how to prepare a PyTorch model that contains a custom operator (ie. CppExtension) for Neuron compilation to run on Trainium EC2 instances. To learn more about Neuron CustomOps see :ref:`neuron_c++customops`. For a deeper dive on MNIST or Multi-Layer Perceptron models, see the :ref:`neuronx-mlp-training-tutorial`. This tutorial assumes the reader is familiar with `PyTorch Custom Extensions <https://pytorch.org/tutorials/advanced/cpp_extension.html>`_.

.. contents:: Table of Contents
   :local:
   :depth: 2

Setup Environment and Download Examples
---------------------------------------

Before running the tutorial please follow the installation instructions at:

* :ref:`pytorch-neuronx-install` on Trn1

.. note::
    The name of ``aws-neuronx-gpsimd-customop`` has been changed to ``aws-neuronx-gpsimd-customop-lib`` as of the neuron 2.10 release.

.. note::

    Custom C++ Operators are supported as of Neuron SDK Version 2.7 as a beta feature. As such this feature is not installed by default, additional tooling and library packages (RPM and DEB) are required. 

    For AL2023 only, the following packages need be installed as dependencies:
    ::
        sudo dnf install libnsl
        sudo dnf install libxcrypt-compat
    
    On AL2023, they can be installed with the following commands:
    ::
        sudo dnf remove python3-devel -y
        sudo dnf remove aws-neuronx-gpsimd-tools-0.* -y
        sudo dnf remove aws-neuronx-gpsimd-customop-lib-0.* -y

        sudo dnf install python3-devel -y
        sudo dnf install aws-neuronx-gpsimd-tools-0.* -y 
        sudo dnf install aws-neuronx-gpsimd-customop-lib-0.* -y

    On Ubuntu, they can be installed with the following commands:
    ::
        sudo apt-get remove python3-dev -y
        sudo apt-get remove aws-neuronx-gpsimd-tools=0.* -y
        sudo apt-get remove aws-neuronx-gpsimd-customop-lib=0.* -y  

        sudo apt-get install python3-dev -y
        sudo apt-get install aws-neuronx-gpsimd-tools=0.* -y
        sudo apt-get install aws-neuronx-gpsimd-customop-lib=0.* -y 

  
For all the commands below, make sure you are in the virtual environment that you have created above before you run the commands:

.. code:: shell

    source ~/aws_neuron_venv_pytorch/bin/activate

Install dependencies for PyTorch Custom Extensions in your environment by running:

.. literalinclude:: tutorial_source_code/custom_c_mlp_training/custom_c_mlp_training_code.sh
   :language: bash
   :lines: 5-6

The ``ninja`` package is only needed for the reference CPU example. It is not needed by Neuron to run on Trainium instances.
    
To download the source code for this tutorial, do:

.. code:: bash

    git clone https://github.com/aws-neuron/aws-neuron-samples.git
    cd aws-neuron-samples/torch-neuronx/training/customop_mlp

In the ``customop_mlp`` directory there are two subdirectories. The ``pytorch`` directory contains an example model and training script using a custom operator that runs using the cpu device with standard PyTorch APIs and libraries (ie. not specific to AWS/Neuron). The ``neuron`` directory contains a version of the same model and training script with the custom operator ported to Neuron to run on trn1 using the XLA device. 

Basic PyTorch Custom Relu Operator
----------------------------------

For the next few sections we’ll review the example model in the ``pytorch`` directory. This is a condensed and simplified explanation of PyTorch C++ Extensions, for more details see the `PyTorch documentation <https://pytorch.org/tutorials/advanced/cpp_extension.html>`_. In ``my_ops.py`` we implement a custom relu activation op as a torch autograd function so that we can use it in a training loop:

.. code-block:: python

    import torch

    torch.ops.load_library('librelu.so')

    class Relu(torch.autograd.Function):
        @staticmethod
        def forward(ctx, input):
            ctx.save_for_backward(input)
            return torch.ops.my_ops.relu_forward(input)

        @staticmethod
        def backward(ctx, grad):
            input, = ctx.saved_tensors
            return torch.ops.my_ops.relu_backward(grad, input), None

Notice that here we first load ``librelu.so`` using the ``load_library`` API. And then call the ``relu_forward`` and ``relu_backward`` functions from our library within the relevant static methods. 

We implemented these two library functions in the ``relu.cpp`` file:

.. code-block:: c++

    torch::Tensor relu_forward(const torch::Tensor& t_in) {
        ...
        t_out_acc[i][j] = t_in_acc[i][j] > 0.0 ? t_in_acc[i][j] : 0.0;
        ...
    }

    torch::Tensor relu_backward(const torch::Tensor& t_grad, const torch::Tensor& t_in) {
        ...
        t_out_acc[i][j] = t_in_acc[i][j] > 0.0 ? t_grad_acc[i][j] : 0.0;
        ...
    }

    TORCH_LIBRARY(my_ops, m) {
        m.def("relu_forward", &relu_forward);
        m.def("relu_backward", &relu_backward);
    }

And then built them into a library using the PyTorch Cpp Extension APIs in the ``build.py`` script:

.. code-block:: python

    torch.utils.cpp_extension.load(
        name='librelu',
        sources=['relu.cpp'],
        is_python_module=False,
        build_directory=os.getcwd()
    )

Run ``python build.py`` to produce the ``librelu.so`` library.
    
Multi-layer perceptron MNIST model
----------------------------------

In ``model.py``, we define the multi-layer perceptron (MLP) MNIST model with 3 linear layers and a custom ReLU activation, followed by a log-softmax layer. Highlighted below are the relevant custom changes in the ``model.py`` file:

.. code-block:: python
    :emphasize-lines: 4, 16, 18

    import torch
    import torch.nn as nn
    from torch.nn import functional as F
    import my_ops

    # Declare 3-layer MLP for MNIST dataset                                                                
    class MLP(nn.Module):
        def __init__(self, input_size = 28 * 28, output_size = 10, layers = [120, 84]):
            super(MLP, self).__init__()
            self.fc1 = nn.Linear(input_size, layers[0])
            self.fc2 = nn.Linear(layers[0], layers[1])
            self.fc3 = nn.Linear(layers[1], output_size)

        def forward(self, x):
            f1 = self.fc1(x)
            r1 = my_ops.Relu.apply(f1)
            f2 = self.fc2(r1)
            r2 = my_ops.Relu.apply(f2)
            f3 = self.fc3(r2)
            return torch.log_softmax(f3, dim=1)

Training the MLP model on CPU
-----------------------------

In the ``train_cpu.py`` script we load the MNIST train dataset, instantiate the MLP model, and use ``device='cpu'`` to execute on the host CPU. Expected CPU output:

.. code:: bash

    ----------Training ---------------
    Train throughput *(*iter/sec*)*: *286*.96994718801335
    Final loss is *0*.1040
    ----------End Training ---------------

Neuron Relu CustomOp
--------------------

Now switch over into the ``neuron`` directory. To migrate our PyTorch customOp to Neuron, we have to make a few small changes. First, we create a new ``shape.cpp`` file to implement our shape function as required by XLA (see :ref:`feature-custom-operators-devguide` for details). We also replace the ``TORCH_LIBRARY`` API with ``NEURON_LIBRARY``.

.. code-block:: c++

    torch::Tensor relu_fwd_shape(torch::Tensor t_in) {
        torch::Tensor t_out = torch::zeros(t_in.sizes(), torch::kFloat);
        return t_out;
    }

    torch::Tensor relu_bwd_shape(torch::Tensor t_grad, torch::Tensor t_in) {
        torch::Tensor t_out = torch::zeros(t_in.sizes(), torch::kFloat);
        return t_out;
    }

    NEURON_LIBRARY(my_ops, m) {
        m.def("relu_forward", &relu_fwd_shape, "relu_forward");
        m.def("relu_backward", &relu_bwd_shape, "relu_backward");
    }

And then we build it using the ``torch_neuronx`` package in ``build.py``:

.. code-block:: python

    from torch_neuronx.xla_impl import custom_op

    custom_op.load(
        name='relu',
        compute_srcs=['relu.cpp'],
        shape_srcs=['shape.cpp'],
        build_directory=os.getcwd()
    )

Notice that here we specify both the ``relu.cpp`` and ``shape.cpp`` files separately. This is because the shape functions will be compiled with an x86 compiler and run on the host during the XLA compilation, and the compute functions will be compiled for the NeuronCore accelerator and executed during the training loop. Running ``build.py`` produces the same ``librelu.so`` as in the CPU example, but compiles the source code to execute on the NeuronCore.

In our ``my_ops.py`` file we just use the ``torch_neuronx`` API to load our new library and execute our customOp exactly the same way we did before:

.. code-block:: python

    import torch
    import torch_neuronx
    from torch_neuronx.xla_impl import custom_op

    custom_op.load_library('librelu.so')

    class Relu(torch.autograd.Function):
        @staticmethod
        def forward(ctx, input):
            ctx.save_for_backward(input)
            return torch.ops.my_ops.relu_forward(input)

        @staticmethod
        def backward(ctx, grad):
            input, = ctx.saved_tensors
            return torch.ops.my_ops.relu_backward(grad, input), None

Training the MLP model on Trainium
----------------------------------

In the ``train.py`` script we modify the CPU training script ``train_cpu.py`` to run with PyTorch Neuron torch_xla. Expected output on a trn1 instance:

.. code:: bash

    ----------Training ---------------
    2023-02-02 22 (tel:2023020222):46:58.000299: INFO ||NCC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/USER_neuroncc-2.0.0.8683a0+c94c3936c/MODULE_4447837791278761679/MODULE_0_SyncTensorsGraph.329_4447837791278761679_ip-172-31-38-167.us-west-2.compute.internal-49ad7ade-14011-5f3bf523d8788/1650ba41-bcfd-4d15-9038-16d391c4a57c/MODULE_0_SyncTensorsGraph.329_4447837791278761679_ip-172-31-38-167.us-west-2.compute.internal-49ad7ade-14011-5f3bf523d8788.neff. Exiting with a successfully compiled graph
    2023-02-02 22 (tel:2023020222):46:58.000433: INFO ||NCC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/USER_neuroncc-2.0.0.8683a0+c94c3936c/MODULE_16964505026440903899/MODULE_1_SyncTensorsGraph.401_16964505026440903899_ip-172-31-38-167.us-west-2.compute.internal-4d0cabba-14011-5f3bf529794a3/23d74230-59dd-4347-b247-fa98aed416bd/MODULE_1_SyncTensorsGraph.401_16964505026440903899_ip-172-31-38-167.us-west-2.compute.internal-4d0cabba-14011-5f3bf529794a3.neff. Exiting with a successfully compiled graph
    Train throughput (iter/sec): 117.47151142662648
    Final loss is 0.1970
    ----------End Training ---------------


================================================
FILE: neuron-customops/tutorials/tutorial_source_code/custom_c_mlp_training/custom_c_mlp_training_code.sh
================================================
#!/bin/bash
set -eExuo

# Install requirements
pip install regex
pip install ninja

cd ~/aws-neuron-samples/torch-neuronx/training/customop_mlp

cd pytorch
python build.py

python train_cpu.py

cd ..
cd neuron
python build.py

python train.py

================================================
FILE: neuron-customops/tutorials/tutorial_source_code/custom_c_perf_optimization/custom_c_perf_optimization_code.sh
================================================
#!/bin/bash
set -eExuo

# Install requirements
pip install regex
pip install ninja

cd ~/aws-neuron-samples/torch-neuronx/inference/customop_mlp

cd neuron
python build.py
python inference.py

cd ..
cd neuron-tcm
python build.py
python inference.py

cd ..
cd neuron-multicore
python build.py
python inference.py

================================================
FILE: neuron-customops/tutorials/tutorials.rst
================================================
Tutorials
=========

.. toctree::
    :maxdepth: 1

    /neuron-customops/tutorials/customop-mlp-training
    /neuron-customops/tutorials/customop-mlp-perf-opt

================================================
FILE: neuron-runtime/about/collectives.rst
================================================
.. meta::
    :description: Learn about Neuron Collective Communication in AWS Neuron SDK. Understand key operations like AllGather, ReduceScatter, AllReduce and All-to-All, along with intra-node and inter-node communication scopes.
    :date-modified: 12/02/2025

.. _about_collectives:

What is Neuron Collective Communication?
=========================================

This topic covers Neuron Collective Communication and how it applies to developing with the AWS Neuron SDK. Collectives are distributed communication primitives that enable ranks in a distributed workload to exchange data using simple, well-defined semantics. In Neuron, each rank can be represented by a physical or logical Neuron Core.

Overview
--------

Modern neural networks with billions to trillions of parameters exceed single-machine computational capacity, making distributed machine learning essential for training and deployment. Collectives are a set of distributed computing primitives with simple semantics, originally developed in HPC.

Collective communication coordinates data exchange among multiple processes in distributed systems. Unlike point-to-point communication, collective operations involve groups performing tasks like gradient aggregation, parameter sharing, and computation synchronization.


Applies to
----------

This concept is applicable to:

* **Distributed Training**: Collective communication aggregates and synchronizes gradients across workers to maintain model consistency. In this scenario, collective operations enable workers to compute gradient sums across all nodes, ensuring uniform parameter updates.

* **Distributed Inference**: During inference, collective communication distributes requests across multiple accelerators in serving nodes, optimizing resource utilization and maintaining low latency under high loads.

In distributed training, workers compute gradients on different data batches simultaneously. Collective communication aggregates and synchronizes gradients across workers to maintain model consistency. Also, during inference, collective communication distributes requests across multiple accelerators in serving nodes, optimizing resource utilization and maintaining low latency under high loads.

From a developer perspective, the training/inference code will have high-level invocations to collective functions like (PyTorch) ``all_gather``, ``all_reduce``, ``reduce_scatter``, ``all_to_all``, ``permute``, and others. See below for a visual representation of some key collective operations:

Collective Operations
---------------------

AllGather Operation
~~~~~~~~~~~~~~~~~~~

In the **AllGather** operation, each rank shares its tensor and receives the aggregated tensors from all ranks, ordered by rank index.

.. image:: /neuron-runtime/img/collectives/all-gather.gif
   :alt: AllGather Operation
   :align: center
   :width: 80%

ReduceScatter Operation
~~~~~~~~~~~~~~~~~~~~~~~

The **ReduceScatter** operation performs reductions on input data (for example, sum, min, max) across ranks, with each rank receiving an equal-sized block/piece of the result based on its rank index.

.. image:: /neuron-runtime/img/collectives/reduce-scatter.gif
   :alt: ReduceScatter Operation
   :align: center
   :width: 80%

AllReduce Operation
~~~~~~~~~~~~~~~~~~~

The **AllReduce** operation performs reductions on data (e.g., sum, max, min) across ranks and stores the result in the output buffer of every rank.

.. image:: /neuron-runtime/img/collectives/all-reduce.gif
   :alt: AllReduce Operation
   :align: center
   :width: 80%

All-to-All Operation
~~~~~~~~~~~~~~~~~~~~

In **AlltoAll**, each rank sends different data to and receives different data from every other rank, resembling a distributed transpose.

.. image:: /neuron-runtime/img/collectives/all-to-all.gif
   :alt: All-to-All Operation
   :align: center
   :width: 80%

Permute Operation
~~~~~~~~~~~~~~~~~~

In the **Permute** operation, each rank sends its data to a designated destination rank and receives data from a designated source rank, according to a set of source-target pairs. The source-target pairs must form a valid ring topology with direct physical connectivity between adjacent ranks. Only ranks included in the source-target pairs participate in the collective execution; other ranks remain inactive during the operation. Currently, only circular permute patterns are supported.

.. image:: /neuron-runtime/img/collectives/permute.gif
   :alt: All-to-All Operation
   :align: center
   :width: 80%


Communication Scope
--------------------

Collective communication operations can be further categorized based on their scope within the distributed system topology. Understanding this distinction is crucial for optimizing performance and minimizing communication overhead in large-scale distributed training and inference. Collectives can be grouped into two main categories:

Intra-node Collectives
~~~~~~~~~~~~~~~~~~~~~~

**Intra-node collectives** operate within a single node or a group of nodes where all corresponding Neuron Chips are physically interconnected using NeuronLinks. These operations typically leverage high-bandwidth, low-latency chip-to-chip connections, high-speed PCIe links and NeuronLink interconnections. Since data remains within the local memory (in one or more interconnected nodes) hierarchy, intra-node collectives generally offer superior bandwidth and lower latency compared to inter-node communication. However, depending on the size of the model, multiple nodes are required for the job.

  For more details, see :doc:`Intra-node Collective Communications with AWS Neuron </neuron-runtime/explore/intranode-collective-comm>`.

Inter-node Collectives
~~~~~~~~~~~~~~~~~~~~~~

**Inter-node collectives** coordinate communication across multiple physical nodes in a distributed cluster, requiring data to traverse network infrastructure via EFA (Elastic Fabric Adapter) connections. While inter-node communication typically has higher latency and lower bandwidth than intra-node alternatives, it enables scaling beyond the computational limits of a single machine. Efficient inter-node collective implementations often employ hierarchical communication patterns, where intra-node operations are performed first, followed by inter-node coordination among designated processes.

Modern distributed training frameworks automatically optimize collective operations by combining intra-node and inter-node communication strategies. For example, in a Trn2 cluster, an all-reduce operation across 256 accelerators distributed across 4 nodes might first perform local reductions within each 64-accelerator node, then execute inter-node communication between the 4 nodes, and finally broadcast results back within each node.

  For more details, see :doc:`Inter-node Collective Communications with AWS Neuron </neuron-runtime/explore/internode-collective-comm>`.

System Connectivity
-------------------

Each Trainium 2 server (trn2.48xlarge or trn2u.48xlarge) consists of 16 Trainium2 chips, each connected to a 200Gbps EFA (`Elastic Fabric Adapter <https://aws.amazon.com/hpc/efa/>`__) network interface, for an aggregated 3.2Tbps `device-RDMA connectivity <https://en.wikipedia.org/wiki/Remote_direct_memory_access>`__. Each Trainium2 chip consists of eight physical NeuronCores. These physical cores can also be configured as Logical Cores or LNC (Logical Neuron Core). By default, each two NeuronCores are exposed as one (Logical) rank (LNC=2), but under LNC=1, they're exposed as two. In ``LNC=2``, each chip is exposed as 4 ranks for a total of 64 ranks per server, and each rank gets 3.2 Tbps / 64 = 50Gbps. In the case of ``LNC=1``, each chip is exposed as 8 ranks, and each rank gets 50 Gbps / 2 = 25Gbps.

Each NeuronCore has dedicated components to actually realize collective operations called CC Cores. The collectives communication cores (CC cores) are dedicated synchronization processors responsible for the orchestration of collective communications. The CC cores control when and how data movement engines transfer data, ensuring each step of the collective algorithm executes in the correct order.

Latency-wise, Trn2.48xl instances are backed by the AWS `10p10u <https://www.aboutamazon.com/news/aws/aws-infrastructure-generative-ai>`__ network. When measured with the `RDMA core performance test ib_write_lat <https://enterprise-support.nvidia.com/s/article/ib-write-lat>`__, a minimal packet takes 15us (latency) to go from an HBM in one server to an HBM of another.

.. image:: /neuron-runtime/img/collectives/trn2-topology.png
   :alt: Trn2 Topology
   :align: center
   :width: 80%  

Each Trn2 server consists of 16 Trainium2 chips connected in a **2D Torus** — each chip is connected to 4 neighbors with a NeuronLink. For an :ref:`UltraServer configuration <trn2-ultraserver>`, we extend this to a **3D Torus**, with each chip adding connections on the Z dimensions to 2 neighbors with a bidirectional **NeuronLink** between each pair.

.. image:: /neuron-runtime/img/collectives/trn2-ultraserver-topology.png
   :alt: Trn2 UltraServer Topology
   :align: center
   :width: 80%

Read more
----------

For more details about how collectives are implemented in Neuron, see the following pages:

* :doc:`Inter-node Collective Communications with AWS Neuron </neuron-runtime/explore/internode-collective-comm>`
* :doc:`Intra-node Collective Communications with AWS Neuron </neuron-runtime/explore/intranode-collective-comm>`

================================================
FILE: neuron-runtime/about/core-dump.rst
================================================
.. meta::
   :description: This topic guides you through your first time generating a Neuron runtime core dump when using the AWS Neuron SDK. 
   :date-modified: 12-02-2025

.. _runtime-core-dump-quickstart:

Quickstart: Generating a Neuron runtime core dump
==================================================

This topic guides you through your first time generating a Neuron runtime core dump. It will help you understand the process when using AWS Neuron during a runtime failure and debugging the state of the device. When you have completed it, you will have a core dump.

**This quickstart is for**: Advanced users

**Time to complete**: 15m

Prerequisites
---------------

* `Launch an EC2 instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`__
* Use the latest :doc:`AWS Neuron Multi-Framework DLAMI </dlami/index>`
* `Connect to the EC2 instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connect-linux-inst-ssh.html>`__
* Understand the  :doc:`AWS Neuron Kernel Interface </nki/get-started/index>`

Step 1: Setup the python virtual environment
---------------------------------------------

To run this example, you must create a Python virtual environment with the Neuron Compiler::

    python3 -m venv venv
    source venv/bin/activate
    python3 -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
    pip install neuronx-cc==2.*

Step 2: Implement a NKI kernel with an error
---------------------------------------------

To generate a core dump, you must run a model with a runtime error. The following script implements a NKI kernel with a out-of-bounds indirect memcopy. Save it to ``oob.py``::

    import neuronxcc.nki as nki
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    from neuronxcc.nki.typing import tensor
    import numpy as np

    @nki.jit()
    def out_of_bounds(in_tensor):
        output = nl.ndarray([64, 512], dtype=in_tensor.dtype, buffer=nl.shared_hbm)

        n, m = in_tensor.shape
        ix, iy = nl.mgrid[0:n//2, 0:m]

        # indices are out of range on purpose to demonstrate the core dump
        expr_arange = 3*nl.arange(n//2)[:, None] 
        idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)

        out_tile: tensor[64, 512] = nisa.memset(shape=(n//2, m), value=-1, dtype=in_tensor.dtype)
        nisa.dma_copy(src=in_tensor[idx_tile, iy], dst=out_tile[ix, iy], oob_mode=nisa.oob_mode.error)

        nl.store(output, out_tile)
        return output

    if __name__ == "__main__":
        in_tensor = np.random.random_sample([128, 512]).astype(np.float32) * 100
        output = out_of_bounds(in_tensor)

Step 3: Run the NKI kernel
---------------------------

Trigger the core dump by running the script in your virtual environment: ``python3 oob.py``.

This leads to a runtime error and is accompanied with a ``nrt_infodump``::

    2025-Sep-19 18:57:20.782962  4444:4444  ERROR  TDRV:exec_process_custom_notification        nd0:nc0:h_model.id1001: Received notification generated at runtime: failed to run scatter/gather (indirect memory copy via vector DGE), due to out-of-bound access. model name = file.neff.
    2025-Sep-19 18:57:20.798030  4444:4444  ERROR  TDRV:exec_wait_round_robin                   [ND 0][NC 0] Out of bounds access on model file.neff
    2025-Sep-19 18:57:20.805570  4444:4444  ERROR  NMGR:dlr_infer                               Inference completed with err: 1006. mode->h_nn=1001, lnc=0
    2025-Sep-19 18:57:20.813269  4444:4444  ERROR   NRT:nrt_infodump                            Neuron runtime information - please include in any support request:
    2025-Sep-19 18:57:20.821272  4444:4444  ERROR   NRT:nrt_infodump                            ------------->8------------[ cut here ]------------>8-------------
    2025-Sep-19 18:57:20.829241  4444:4444  ERROR   NRT:nrt_infodump                            NRT version: 2.x.33931.0 (8be979e9fd075e9294c151d7cf03968058670d4c)
    2025-Sep-19 18:57:20.837226  4444:4444  ERROR   NRT:nrt_infodump                            Embedded FW version: 1.0.22039.0 (d5fbbb7781171a2d6dd5bf6bac8f71064308bb0a) loaded from "libnrtucode_extisa.so"
    2025-Sep-19 18:57:20.848129  4444:4444  ERROR   NRT:nrt_infodump                            CCOM version: 2.0.35440.0- (compat 78)
    2025-Sep-19 18:57:20.855228  4444:4444  ERROR   NRT:nrt_infodump                            NCFW version: 1.0.18253.0 (7c9806c58d468da2cd27d24d59ceaf8fa0d25e4a)
    2025-Sep-19 18:57:20.863255  4444:4444  ERROR   NRT:nrt_infodump                            Instance ID: i-0b514eadc4fec7de6
    2025-Sep-19 18:57:20.870138  4444:4444  ERROR   NRT:nrt_infodump                            Cluster ID: 0
    2025-Sep-19 18:57:20.876409  4444:4444  ERROR   NRT:nrt_infodump                            Kernel: Linux 5.10.240-218.959.amzn2int.x86_64 #1 SMP Thu Aug 7 19:38:22 UTC 2025
    2025-Sep-19 18:57:20.886375  4444:4444  ERROR   NRT:nrt_infodump                            Nodename: 9371096ea4a1
    2025-Sep-19 18:57:20.892956  4444:4444  ERROR   NRT:nrt_infodump                            Driver version: 2.x

    2025-Sep-19 18:57:20.901533  4444:4444  ERROR   NRT:nrt_infodump                            Failure: NRT_EXEC_OOB in nrt_execute()
    2025-Sep-19 18:57:20.908621  4444:4444  ERROR   NRT:nrt_infodump                            LNC: 0
    2025-Sep-19 18:57:20.914681  4444:4444  ERROR   NRT:nrt_infodump                            Visible cores: 0, 1
    2025-Sep-19 18:57:20.921135  4444:4444  ERROR   NRT:nrt_infodump                            Environment:
    2025-Sep-19 18:57:20.927398  4444:4444  ERROR   NRT:nrt_infodump                            -------------8<-----------[ cut to here ]-----------8<------------
    2025-Sep-19 18:57:21.484865  4444:4444  ERROR   NRT:nrt_execute_repeat                      Failed to execute model file.neff with status 1006

Confirmation
--------------

The core dump is generated under ``/tmp/neuron-core-dumps/``::

    $ ls /tmp/neuron-core-dump/
    dt-20250917-194443-cid-0000000000000000
    $ ls /tmp/neuron-core-dump/dt-20250917-194443-cid-0000000000000000/
    i-0b514eadc4fec7de6-nd0-nc0-pid-897-tid-897-lid-0  i-0b514eadc4fec7de6-nrt-pid-897.log

The core dump creates two types of files:

* Dump of the hardware state
* Dump of the tail of Neuron runtime error logs

Next Steps
-----------

Now that you've completed this quickstart, take the core dump and dive into other topics that build off of and investigate it.

* :ref:`Explore a Neuron Runtime core dump <runtime-core-dump-deep-dive>`


================================================
FILE: neuron-runtime/about/index.rst
================================================
.. _neuron-runtime-about:

.. meta::
   :description: Learn about the AWS NeuronX Runtime, its features, and capabilities.
   :date-modified: 11/03/2025

About the NeuronX Runtime
==========================

This section provides information about the AWS Neuron Runtime, its features, and capabilities. Learn about core dumps, debugging techniques, and other important aspects of the Neuron Runtime.

What is the NeuronX Runtime?
--------------------------------

The NeuronX Runtime consists of a kernel driver and C/C++ libraries which provides APIs to access Inferentia and Trainium Neuron devices. The Neuron ML frameworks plugins for TensorFlow, PyTorch and Apache MXNet use the Neuron runtime to load and run models on the NeuronCores. Neuron runtime loads compiled deep learning models, also referred to as Neuron Executable File Format (NEFF) to the Neuron devices and is optimized for high-throughput and low-latency.

What are Neuron Collectives?
-----------------------------

Neuron Collectives are distributed communication primitives that coordinate data exchange among multiple NeuronCores in distributed machine learning workloads. Each rank represents a physical or logical NeuronCore that participates in collective operations such as AllGather, AllReduce, ReduceScatter, and AllToAll.

These operations enable efficient gradient aggregation during distributed training and parameter sharing during distributed inference. Collectives operate at two levels: intra-node communication uses high-bandwidth NeuronLink interconnects between chips within a node, while inter-node communication leverages EFA (Elastic Fabric Adapter) networks to coordinate across multiple physical nodes. The runtime automatically selects optimal algorithms based on message size, cluster topology, and latency requirements.

Get Started
------------  

.. grid:: 1
   :gutter: 2

   .. grid-item-card:: Quickstart: Generate a Neuron Runtime Core Dump
      :link: runtime-core-dump-quickstart
      :link-type: ref
      :class-header: sd-bg-primary sd-text-white

      Learn how to generate a Neuron runtime core dump for debugging runtime failures and analyzing device state.

Neuron Runtime Collectives
---------------------------

.. grid:: 1
   :gutter: 2

   .. grid-item-card:: About Neuron Runtime Collectives
      :link: collectives
      :link-type: doc
      :class-header: sd-bg-primary sd-text-white

      Learn about "Collectives", distributed communication primitives that enable efficient data exchange between NeuronCores.
   

================================================
FILE: neuron-runtime/api/debug-stream-api.rst
================================================
.. _nrt-debug-stream-api:

========================================
Neuron Debug Stream API Documentation
========================================

Overview
========

The ``ndebug_stream`` APIs provide applications a way to consume debug events from the runtime. These debug events are emitted by the runtime per Logical Neuron Core and can be used by applications to get information on events that occurred on the device (such as device prints, breakpoints, etc.).

Debug events are streamed through a connection interface, allowing applications to monitor and display information from Neuron Cores during execution.

Connecting, Polling, and Consuming
===================================

Connection Process
------------------

Applications that want to consume debug events must follow these steps:

1. **Connect** to a Logical Neuron Core's debug stream via ``nrt_debug_client_connect``
2. **Poll** for events using Linux kernel polling APIs on the returned file descriptor
3. **Consume** events using the ``nrt_debug_client_read_one_event`` API
4. **Close** the connection when finished using ``nrt_debug_client_connect_close``

Once a client is connected to a core's debug stream, the runtime will push debug events emitted by the Logical Neuron Core to the stream for clients to consume.

Polling for Events
------------------

The stream file descriptor obtained from ``nrt_debug_client_connect`` is a standard Linux file descriptor and can be passed into any Linux polling API (such as ``epoll``, ``poll``, or ``select``). This allows applications to efficiently wait for debug events without busy waiting.

.. important::
   While the ``stream_fd`` is pollable, all other non-polling functionality must go through the provided ``nrt_debug_client*`` APIs. The stream contents can only be accessed using the ``nrt_debug_client_read*`` API(s).

Events
======

Events consist of two parts:

1. A header describing the payload type
2. A payload representing the contents of the event

Each event sent to the application is wrapped as a datagram. The header is a fixed-sized struct that describes the contents of the payload, including the size and how to interpret it.

Event Types
-----------

Currently, the system supports these event types:

+-------------------------------------------------+------------------------------------------+
| Event Type                                      | Description                              |
+=================================================+==========================================+
| ``NDEBUG_STREAM_EVENT_TYPE_DEBUG_TENSOR_READ``  | Debug tensor read events from the core   |
+-------------------------------------------------+------------------------------------------+

API Reference
=============

nrt_debug_client_connect
------------------------

.. code-block:: c

   NRT_STATUS nrt_debug_client_connect(int logical_nc_idx, int *stream_fd);

Establishes a connection to a specified Logical Neuron Core's debug stream.

**Parameters:**

* ``logical_nc_idx [in]`` - Core's debug stream to connect to
* ``stream_fd [out]`` - Connection handle to reference and interact with the stream

**Returns:**

* ``NRT_SUCCESS`` on success

.. note::
   Only one client can connect to a Logical Neuron Core's stream at any given time. Attempts to connect to a stream with multiple clients will result in a ``NRT_INVALID`` return status.

nrt_debug_client_connect_close
------------------------------

.. code-block:: c

   void nrt_debug_client_connect_close(int stream_fd);

Closes a connection created by ``nrt_debug_client_connect``.

**Parameters:**

* ``stream_fd [in]`` - Connection handle to close

nrt_debug_client_read_one_event
-------------------------------

.. code-block:: c

   NRT_STATUS nrt_debug_client_read_one_event(int stream_fd, ndebug_stream_event_header_t *header, void **payload);

Consumes a single event from the stream.

**Parameters:**

* ``stream_fd [in]`` - Stream to consume an event from
* ``header [out]`` - Consumed event's header
* ``payload [out]`` - Consumed event's payload

**Returns:**

* ``NRT_SUCCESS`` on success
* ``NRT_QUEUE_EMPTY`` if no events are available

.. important::
   It is the user's responsibility to free the payload pointer.

.. note::
   This function must be called from the same process that owns the Logical Neuron Core. Calling this function from any other process results in undefined behavior.

Data Structures
===============

ndebug_stream_event_type
------------------------

.. code-block:: c

   typedef enum ndebug_stream_event_type {
       NDEBUG_STREAM_EVENT_TYPE_INVALID = 0,
       NDEBUG_STREAM_EVENT_TYPE_DEBUG_TENSOR_READ = 1,
   } ndebug_stream_event_type_t;

Enumeration of the different types of debug events that can be emitted.

ndebug_stream_event_header
--------------------------

.. code-block:: c

   typedef struct ndebug_stream_event_header {
       uint64_t data_size;
       uint32_t type;
       char reserved[52];
   } ndebug_stream_event_header_t;

Header structure for debug stream events.

**Fields:**

* ``data_size`` - Size of the payload data in bytes
* ``type`` - Type of event (see ``ndebug_stream_event_type_t``)
* ``reserved`` - Reserved bytes for future use

ndebug_stream_payload_debug_tensor_read
---------------------------------------

.. code-block:: c

   typedef struct ndebug_stream_payload_debug_tensor_read {
       char prefix[512];
       uint32_t logical_nc_id;
       uint32_t pipe;
       char tensor_dtype[16];
       uint64_t tensor_shape[8];
       uint64_t tensor_data_size;
       char reserved0[416];
       char tensor_data[];
   } ndebug_stream_payload_debug_tensor_read_t;

Payload structure for debug tensor read events.

**Fields:**

* ``prefix`` - The prefix string to print
* ``logical_nc_id`` - The logical core the print event originated from
* ``pipe`` - The pipe to write the printed string to
* ``tensor_dtype`` - Tensor data type
* ``tensor_shape`` - Tensor shape dimensions (up to 8 dimensions)
* ``tensor_data_size`` - Size in bytes of the tensor content
* ``reserved0`` - Reserved bytes for future use
* ``tensor_data`` - The contents of the tensor to display (flexible array member)

Notes and Important Considerations
==================================

1. These APIs do not allow for interprocess communication. Debug events are only pushed to the process that owns the Logical Neuron Core.

2. These APIs do not provide thread safety for multiple threads accessing the SAME stream (thread safety for different streams is guaranteed).

3. There can only be one outstanding connection per stream. Any attempts to initialize multiple connections will result in an error.

4. Events are only emitted AFTER a client connects to a Logical Neuron Core's stream. Any event that would have been emitted before connecting to the stream is dropped.

5. Events will be dropped if the number of unconsumed events in a stream exceeds the stream's buffer size. Clients must consume events fast enough to prevent dropped events.

6. Clients can configure the stream's buffer size via the ``NEURON_RT_DEBUG_STREAM_BUFFER_SIZE`` environment variable. The buffer size currently defaults to 64K debug events.

7. The payload buffer returned by ``nrt_debug_client_read_one_event`` must be freed by the caller.


================================================
FILE: neuron-runtime/api/index.rst
================================================
.. _nrt_api_reference:

Neuron Runtime API Reference
=============================

This section provides comprehensive API reference documentation for the Neuron Runtime (NRT) and Neuron Driver Library (NDL). These APIs enable low-level access to AWS Neuron devices and provide interfaces for model loading, execution, memory management, and collective operations.

**Source code for these APIs can be found at**: https://github.com/aws-neuron/aws-neuron-sdk.

Core Runtime APIs
-----------------

.. list-table::
   :widths: 40 60

   * - :doc:`NRT API </neuron-runtime/api/nrt>`
     - Main Neuron Runtime API for model loading, execution, and tensor management
   * - :doc:`NRT Status </neuron-runtime/api/nrt_status>`
     - Status codes and error handling for runtime operations
   * - :doc:`NRT Version </neuron-runtime/api/nrt_version>`
     - Version information and compatibility checking

Asynchronous Execution APIs
----------------------------

.. list-table::
   :widths: 40 60

   * - :doc:`NRT Async </neuron-runtime/api/nrt_async>`
     - Asynchronous execution API for non-blocking operations
   * - :doc:`NRT Async Send/Recv </neuron-runtime/api/nrt_async_sendrecv>`
     - Asynchronous tensor send and receive operations

Profiling and Debugging APIs
-----------------------------

.. list-table::
   :widths: 40 60

   * - :doc:`NRT Profile </neuron-runtime/api/nrt_profile>`
     - Profiling API for performance analysis and optimization
   * - :doc:`NRT System Trace </neuron-runtime/api/nrt_sys_trace>`
     - System trace capture and event fetching
   * - :doc:`Debug Stream </neuron-runtime/api/ndebug_stream>`
     - Debug event streaming from Logical Neuron Cores

Collective Operations API
--------------------------

.. list-table::
   :widths: 40 60

   * - :doc:`NEC API </neuron-runtime/api/nec>`
     - Neuron Elastic Collectives (NEC) for distributed operations

Neuron Driver Library (NDL) APIs
---------------------------------

.. list-table::
   :widths: 40 60

   * - :doc:`NDL API </neuron-runtime/api/ndl>`
     - Low-level Neuron Driver Library for device access and control
   * - :doc:`Neuron Driver Shared </neuron-runtime/api/neuron_driver_shared>`
     - Shared definitions between runtime and driver
   * - :doc:`Tensor Batch Operations </neuron-runtime/api/neuron_driver_shared_tensor_batch_op>`
     - Batch operation structures for tensor transfers

Neuron Datastore API
--------------------

.. list-table::
   :widths: 40 60

   * - :doc:`Neuron Datastore </neuron-runtime/api/neuron_ds>`
     - Neuron Datastore (NDS) for sharing metrics and model information

Experimental APIs
-----------------

.. list-table::
   :widths: 40 60

   * - :doc:`NRT Experimental </neuron-runtime/api/nrt_experimental>`
     - Experimental features and APIs (subject to change)

.. toctree::
   :maxdepth: 5
   :hidden:
   :caption: Core Runtime APIs

.. toctree::
   :maxdepth: 5
   :hidden:
   :caption: Debugging

   Debug Stream APIs </neuron-runtime/api/debug-stream-api>


.. toctree::
   :maxdepth: 5
   :hidden:
   :caption: Asynchronous Execution APIs

   NRT Async <nrt_async>
   NRT Async Send/Recv <nrt_async_sendrecv>

.. toctree::
   :maxdepth: 5
   :hidden:
   :caption: Profiling and Debugging APIs

   NRT Profile <nrt_profile>
   NRT System Trace <nrt_sys_trace>
   Debug Stream <ndebug_stream>

.. toctree::
   :maxdepth: 5
   :hidden:
   :caption: Collective Operations API

   NEC API <nec>

.. toctree::
   :maxdepth: 5
   :hidden:
   :caption: Neuron Driver Library APIs

   NDL API <ndl>
   Neuron Driver Shared <neuron_driver_shared>
   Tensor Batch Operations <neuron_driver_shared_tensor_batch_op>

.. toctree::
   :maxdepth: 5
   :hidden:
   :caption: Neuron Datastore API

   Neuron Datastore <neuron_ds>

.. toctree::
   :maxdepth: 5
   :hidden:
   :caption: Experimental APIs

   NRT Experimental <nrt_experimental>


================================================
FILE: neuron-runtime/api/ndebug_stream.rst
================================================
.. _api_ndebug_stream_h:

ndebug_stream.h
===============

Neuron Debug Stream API - Consume debug events from the runtime per Logical Neuron Core.

**Source**: `src/libnrt/include/nrt/ndebug_stream.h <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/ndebug_stream.h>`_

Overview
--------

The ``ndebug_stream`` APIs provide applications a way to consume debug events from the runtime. These debug events are emitted by the runtime per Logical Neuron Core and can be used by applications to get information on events that occurred on the device (ie prints, breakpoints, etc.).

**Connecting, polling, and consuming:** Applications that want to consume debug events will first need to connect to a Logical Neuron Core's debug stream via a call to ``nrt_debug_client_connect``. Once a client is connected to a core's debug stream, the runtime will push debug events emitted by the Logical Neuron Core to the stream for clients to consume.

**Closing a Connection:** Once a connection is not needed anymore, clients can close the connection using the ``nrt_debug_client_connect_close`` API.

**Events:** Events consist of a header describing the payload type, and a payload representing the contents of the event. Events can be consumed by clients via the ``nrt_debug_client_read*`` API(s).

Enumerations
------------

ndebug_stream_event_type_t
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef enum ndebug_stream_event_type {
       NDEBUG_STREAM_EVENT_TYPE_INVALID = 0,
       NDEBUG_STREAM_EVENT_TYPE_DEBUG_TENSOR_READ = 1,
   } ndebug_stream_event_type_t;

Debug stream event types.

**Source**: `ndebug_stream.h:51 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/ndebug_stream.h#L51>`_

Structures
----------

ndebug_stream_event_header_t
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct ndebug_stream_event_header {
       uint64_t data_size;
       uint32_t type;
       char reserved[52];
   } ndebug_stream_event_header_t;

Debug stream event header.

**Source**: `ndebug_stream.h:56 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/ndebug_stream.h#L56>`_

ndebug_stream_payload_debug_tensor_read_t
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct ndebug_stream_payload_debug_tensor_read {
       char prefix[512];
       uint32_t logical_nc_id;
       uint32_t pipe;
       char tensor_dtype[16];
       uint64_t tensor_shape[8];
       uint64_t tensor_data_size;
       char reserved0[416];
       char tensor_data[];
   } ndebug_stream_payload_debug_tensor_read_t;

Payload for debug tensor read events.

**Source**: `ndebug_stream.h:62 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/ndebug_stream.h#L62>`_

Functions
---------

nrt_debug_client_connect
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_debug_client_connect(int logical_nc_idx, int *stream_fd);

Establish a connection to a specified Logical Neuron Core's debug stream.

**Parameters:**

* ``logical_nc_idx`` [in] - Core's debug stream to connect to.
* ``stream_fd`` [out] - Connection handle to reference and interact with the stream.

**Returns:** NRT_SUCCESS on success.

**Note:** Only one client can connect to a Logical Neuron Core's stream at any given time. Attempts to connect to a stream with multiple clients will result in a NRT_INVALID return status.

**Source**: `ndebug_stream.h:82 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/ndebug_stream.h#L82>`_

nrt_debug_client_connect_close
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   void nrt_debug_client_connect_close(int stream_fd);

Closes connection created by ``nrt_debug_client_connect``.

**Parameters:**

* ``stream_fd`` [in] - Connection handle to close.

**Source**: `ndebug_stream.h:88 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/ndebug_stream.h#L88>`_

nrt_debug_client_read_one_event
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_debug_client_read_one_event(int stream_fd, ndebug_stream_event_header_t *header, void **payload);

Consumes a single event from the stream.

**Parameters:**

* ``stream_fd`` [in] - Stream to consume an event from
* ``header`` [out] - Consumed event's header.
* ``payload`` [out] - Consumed event's payload. **IMPORTANT**: it is the user's responsibility to free this payload pointer.

**Returns:** NRT_SUCCESS on success.

**Note:** This function must be called from the same process that owns the Logical Neuron Core. Calling this function from any other process results in undefined behavior.

**Source**: `ndebug_stream.h:102 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/ndebug_stream.h#L102>`_


================================================
FILE: neuron-runtime/api/ndl.rst
================================================
.. _api_ndl_h:

ndl.h
=====

Neuron Driver Library (NDL) API - Low-level interface to Neuron devices.

**Source**: `src/libnrt/include/ndl/ndl.h <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h>`_

Enumerations
------------

NQ_DEV_TYPE
^^^^^^^^^^^

.. code-block:: c

   typedef enum NQ_DEV_TYPE {
       NQ_DEV_TYPE_NEURON_CORE = 0,
       NQ_DEV_TYPE_TOPSP,
       NQ_DEV_TYPE_MAX,
   } ndl_nq_dev_t;

Device type enumeration for notification queues.

**Source**: `ndl.h:18 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L18>`_

Constants
---------

NEURON_MAX_DEVICES
^^^^^^^^^^^^^^^^^^

.. code-block:: c

   #define NEURON_MAX_DEVICES MAX_NEURON_DEVICE_COUNT

Maximum neuron devices supported on a system.

**Source**: `ndl.h:24 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L24>`_

NEURON_DEVICE_PREFIX
^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   #define NEURON_DEVICE_PREFIX "/dev/neuron"

Device file prefix for Neuron devices.

**Source**: `ndl.h:25 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L25>`_

MAX_HBM_PER_DEVICE
^^^^^^^^^^^^^^^^^^

.. code-block:: c

   #define MAX_HBM_PER_DEVICE 4

Maximum HBM (High Bandwidth Memory) regions per device.

**Source**: `ndl.h:28 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L28>`_

MAX_NEURON_DEVICE_COUNT
^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   #define MAX_NEURON_DEVICE_COUNT 64

Maximum neuron devices supported on a system.

**Source**: `ndl.h:78 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L78>`_

MAX_NC_PER_DEVICE
^^^^^^^^^^^^^^^^^

.. code-block:: c

   #define MAX_NC_PER_DEVICE 8

Maximum neuron cores per device.

**Source**: `ndl.h:81 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L81>`_

Structures
----------

ndl_version_info_t
^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct ndl_version_info {
       uint16_t driver_major_version;
       uint16_t driver_minor_version;
       char driver_full_version[DRIVER_VERSION_MAX_SIZE];
       uint16_t library_major_version;
       uint16_t library_minor_version;
   } ndl_version_info_t;

Version information for driver and library.

**Source**: `ndl.h:31 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L31>`_

ndl_device_init_param_t
^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct ndl_device_init_param {
       bool initialize_device;
       int num_dram_regions;
       bool map_hbm;
   } ndl_device_init_param_t;

Device initialization parameters.

**Source**: `ndl.h:59 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L59>`_

ndl_device_t
^^^^^^^^^^^^

.. code-block:: c

   typedef struct ndl_device {
       uint8_t device_index;
       uint8_t device_type;
       uint16_t device_revision;
       uint8_t connected_device_count;
       uint8_t connected_devices[MAX_NEURON_DEVICE_COUNT];
       uint64_t csr_base[2];
       uint64_t csr_size[2];
       ndl_copy_buf_t cpy_bufs[MAX_NC_PER_DEVICE];
       void *hbm_va[MAX_HBM_PER_DEVICE];
       size_t hbm_size;
       uint32_t hbm_va_cnt;
       uint32_t shift_hbm_size;
       uint64_t hbm_offset[MAX_HBM_PER_DEVICE];
       uint8_t context[];
   } ndl_device_t;

Device structure containing device information and resources.

**Source**: `ndl.h:83 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L83>`_

ndl_mem_info_t
^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct ndl_mem_info {
       ndl_device_t *device;
       __u64 driver_handle;
       uint64_t pa;
       uint64_t mmap_offset;
       uint64_t size;
       uint32_t align;
       void *mmap_va;
       uint32_t host_memory;
       int nc_id;
   } ndl_mem_info_t;

Memory allocation information.

**Source**: `ndl.h:107 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L107>`_

ndl_notification_context_t
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct ndl_notification_context {
       union {
           uint8_t nc_id;
           uint8_t nq_dev_id;
       };
       ndl_nq_dev_t nq_dev_type;
       uint8_t nq_type;
       uint8_t engine_index;
       uint32_t size;
       int fd;
       uint64_t offset;
       uint64_t mem_handle;
       void *va;
       ndl_mem_info_t *mem_info;
   } ndl_notification_context_t;

Notification queue context.

**Source**: `ndl.h:119 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L119>`_

Functions
---------

ndl_get_version
^^^^^^^^^^^^^^^

.. code-block:: c

   int ndl_get_version(ndl_version_info_t *version);

Get version info.

**Parameters:**

* ``version`` [out] - Buffer to store the version information.

**Returns:** 0 on success, -1 on failed to read driver version.

**Source**: `ndl.h:45 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L45>`_

ndl_open_device
^^^^^^^^^^^^^^^

.. code-block:: c

   int ndl_open_device(int device_index, ndl_device_init_param_t *params, ndl_device_t **device);

Called by app the first time when it accesses the device.

**Parameters:**

* ``device_index`` [in] - device index that is to be opened
* ``params`` [in] - device initialization parameters
* ``device`` [out] - device specific information

**Returns:** 0 on success, -1 on failure

**Source**: `ndl.h:141 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L141>`_

ndl_close_device
^^^^^^^^^^^^^^^^

.. code-block:: c

   int ndl_close_device(ndl_device_t *device);

Called by app when it is done. After this, device cannot be accessed.

**Parameters:**

* ``device`` [in] - Device to close.

**Returns:** 0 on success, -1 on failure

**Source**: `ndl.h:150 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L150>`_

ndl_available_devices
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   int ndl_available_devices(int *device_indexes, int device_indexes_size);

Get all the device index.

**Parameters:**

* ``device_indexes`` [out] - Buffer to store device indexes.
* ``device_indexes_size`` [in] - Size of the buffer in dwords.

**Returns:** Number of devices found.

**Source**: `ndl.h:159 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L159>`_

ndl_memory_alloc
^^^^^^^^^^^^^^^^

.. code-block:: c

   int ndl_memory_alloc(ndl_device_t *device, size_t size, uint64_t align, uint32_t host_memory, 
                        uint32_t dram_channel, uint32_t dram_region, uint32_t nc_id, 
                        uint32_t mem_alloc_type, uint64_t *mem_handle);

Allocates memory.

**Parameters:**

* ``device`` [in] - Device to be associated with the allocation.
* ``size`` [in] - Number of bytes to allocate.
* ``host_memory`` [in] - If true allocate from host memory instead of using device memory.
* ``dram_channel`` [in] - DRAM channel to use in the device memory.
* ``dram_region`` [in] - DRAM region to use in the device memory.
* ``nc_id`` [in] - NC ID to use in the device
* ``mem_alloc_type`` [in] - Type of memory allocation
* ``mem_handle`` [out] - Allocated memory handle would be stored here.

**Returns:** 0 on success.

**Source**: `ndl.h:227 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L227>`_

ndl_memory_map
^^^^^^^^^^^^^^

.. code-block:: c

   int ndl_memory_map(uint64_t mem_handle, void **va);

Map given memory handle into virtual address space.

**Parameters:**

* ``mem_handle`` [in] - Handle to map.
* ``va`` [out] - Resulting virtual address.

**Returns:** 0 on success

**Source**: `ndl.h:240 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L240>`_

ndl_memory_free
^^^^^^^^^^^^^^^

.. code-block:: c

   int ndl_memory_free(uint64_t mem_handle);

Frees already allocated memory.

**Parameters:**

* ``mem_handle`` [in] - Memory handle to be freed.

**Returns:** 0 on success.

**Source**: `ndl.h:255 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L255>`_

ndl_notification_init
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   int ndl_notification_init(ndl_device_t *device, int nq_dev_id, ndl_nq_dev_t nq_dev_type, 
                             uint8_t nq_type, uint8_t engine_index, uint32_t size, 
                             bool on_host_memory, uint32_t dram_channel, uint32_t dram_region,
                             uint64_t *notification_context);

Configure notification queue.

**Parameters:**

* ``device`` [in] - Device
* ``nq_dev_id`` [in] - Notification device index
* ``nq_dev_type`` [in] - Notification device type
* ``nq_type`` [in] - Notification queue type
* ``engine_index`` [in] - Engine index
* ``size`` [in] - Size in bytes
* ``on_host_memory`` [in] - If true, NQ is created on host memory
* ``dram_channel`` [in] - If NQ is created on device, DRAM channel to use
* ``dram_region`` [in] - If NQ is created on device, DRAM region to use
* ``notification_context`` [out] - Resulting NQ context.

**Returns:** 0 on success.

**Source**: `ndl.h:625 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L625>`_

ndl_reset_ncs
^^^^^^^^^^^^^

.. code-block:: c

   int ndl_reset_ncs(int device_index, int nc_map, uint32_t *request_id);

Reset given NCs within a device.

**Parameters:**

* ``device_index`` [in] - Device to reset.
* ``nc_map`` [in] - NCs to reset (-1 to reset entire device)
* ``request_id`` [out] - ID for this reset request

**Returns:** 0 on success.

**Source**: `ndl.h:476 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/ndl.h#L476>`_


================================================
FILE: neuron-runtime/api/nec.rst
================================================
.. _api_nec_h:

nec.h
=====

Neuron Elastic Collectives (NEC) API - Collective operations for distributed computing on Neuron devices.

**Source**: `src/libnrt/include/nrt/nec.h <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nec.h>`_

Overview
--------

This is the main component for Neuron Elastic Collectives in Neuron Runtime (NRT). This provides collective operations to applications offloaded by the device including collective comm init, receiving (post) operations, building resources for the operation, triggering the operation and polling its completion.

Constants
---------

NEC_MAX_CHANNELS
^^^^^^^^^^^^^^^^

.. code-block:: c

   #define NEC_MAX_CHANNELS 32

Maximum channels (matches MAXCHANNELS in NCCL).

**Source**: `nec.h:18 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nec.h#L18>`_

NEC_MAX_COMM_N
^^^^^^^^^^^^^^

.. code-block:: c

   #define NEC_MAX_COMM_N 12

Max supported replica-groups in NEFF.

**Source**: `nec.h:26 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nec.h#L26>`_

NEC_MAX_STREAM_N
^^^^^^^^^^^^^^^^

.. code-block:: c

   #define NEC_MAX_STREAM_N 4

The maximum number of concurrent cc execution.

**Source**: `nec.h:56 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nec.h#L56>`_

Enumerations
------------

nec_pod_type_t
^^^^^^^^^^^^^^

.. code-block:: c

   typedef enum nec_pod_type {
       NEC_POD_TYPE_NONE,
       NEC_POD_TYPE_P2P,
       NEC_POD_TYPE_SWITCH,
       NEC_POD_TYPE_INVALID
   } nec_pod_type_t;

Pod type enumeration (translated from what KaenaDriver returns).

**Source**: `nec.h:103 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nec.h#L103>`_

enc_pattern_t
^^^^^^^^^^^^^

.. code-block:: c

   typedef enum enc_pattern {
       ENC_PATTERN_RING,
       ENC_PATTERN_MESH,
       ENC_PATTERN_INVALID,
   } enc_pattern_t;

Communication pattern types.

**Source**: `nec.h:244 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nec.h#L244>`_

Structures
----------

nccl_comm_info_t
^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nccl_comm_info {
       uint64_t cluster_id;
       time_t epoch;
       int neuron_dev;
       int rank;
       int rank_n;
       int local_rank_n;
       int local_rack_rank_n;
       int node;
       int node_n;
       bool enable_pod;
       bool use_net;
       int pod;
       int pod_n;
       int pod_node;
       int pod_node_n;
       struct enc_peer_info *peers;
       int channel_n;
       struct enc_ring rings[NEC_MAX_CHANNELS];
       int kangaring_channel_n;
       int* kangaring_paths[NEC_MAX_CHANNELS];
       int mla_cycle_n;
       int* mla_cycles[NEC_MAX_CHANNELS];
   } nccl_comm_info_t;

Comm info to query from NCCL.

**Source**: `nec.h:732 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nec.h#L732>`_

enc_neuron_device_info_t
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct enc_neuron_device_info {
       int nec_dev_id;
       int mla_idx;
       int tpb_idx;
       int host_device_id;
       int routing_id;
       uint64_t pod_id;
       nec_pod_type_t pod_type;
       uint32_t pod_node_id;
       uint32_t virtual_server_id;
       enc_proxy_histogram_config_t histogram_config;
   } enc_neuron_device_info_t;

Neuron Device information. This data structure is used to send the device information from KaenaRuntime to KaenaNCCL for nccl communicator building.

**Source**: `nec.h:787 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nec.h#L787>`_

nec_version_info_t
^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nec_version_info {
       uint64_t major;
       uint64_t minor;
       uint64_t patch;
       uint64_t maintenance;
       char git_hash[16];
       uint64_t compatibility_version;
       uint8_t future_fields[];
   } nec_version_info_t;

NEC version information.

**Source**: `nec.h:920 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nec.h#L920>`_

Functions
---------

nec_get_device_count
^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   int nec_get_device_count(int *available_devices_array, uint32_t array_size);

Query device information - get device count.

**Parameters:**

* ``available_devices_array`` [out] - Array to store available device IDs
* ``array_size`` [in] - Size of the array

**Returns:** Number of available devices

**Source**: `nec.h:917 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nec.h#L917>`_

nec_get_virtual_core_size
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nec_get_virtual_core_size(uint32_t *virtual_core_size);

Query vcore size.

**Parameters:**

* ``virtual_core_size`` [out] - Virtual core size

**Returns:** NRT_STATUS_SUCCESS on success

**Source**: `nec.h:923 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nec.h#L923>`_

nec_get_version_info
^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nec_get_version_info(nec_version_info_t *version_info);

Get NEC version information.

**Parameters:**

* ``version_info`` [out] - Version information structure

**Returns:** NRT_STATUS_SUCCESS on success

**Source**: `nec.h:932 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nec.h#L932>`_


================================================
FILE: neuron-runtime/api/neuron_driver_shared.rst
================================================
.. _api_neuron_driver_shared_h:

neuron_driver_shared.h
======================

Shared definitions between Neuron driver and runtime.

**Source**: `src/libnrt/include/ndl/neuron_driver_shared.h <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h>`_

Enumerations
------------

neuron_driver_feature_flag
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   enum neuron_driver_feature_flag {
       NEURON_DRIVER_FEATURE_DMABUF = 1ull << 0,
       NEURON_DRIVER_FEATURE_ASYNC_DMA = 1ull << 1,
       NEURON_DRIVER_FEATURE_BATCH_DMAQ_INIT = 1ull << 2,
       NEURON_DRIVER_FEATURE_BIG_CORE_MAPS = 1ull << 3,
       NEURON_DRIVER_FEATURE_MEM_ALLOC_TYPE = 1ull << 4,
       NEURON_DRIVER_FEATURE_HBM_SCRUB = 1ull << 5,
       NEURON_DRIVER_FEATURE_MEM_ALLOC64 = 1ull << 6,
       NEURON_DRIVER_FEATURE_CONTIGUOUS_SCRATCHPAD = 1ull << 7,
       NEURON_DRIVER_FEATURE_ZEROCOPY = 1ull << 8,
   };

Feature flags for driver capabilities.

**Source**: `neuron_driver_shared.h:11 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L11>`_

neuron_pod_ctrl_req
^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   enum neuron_pod_ctrl_req {
       NEURON_NPE_POD_CTRL_REQ_POD = 0,
       NEURON_NPE_POD_CTRL_REQ_SINGLE_NODE = 1,
       NEURON_NPE_POD_CTRL_REQ_KILL = 2,
       NEURON_NPE_POD_CTRL_SET_MODE = 3,
   };

Pod control request types.

**Source**: `neuron_driver_shared.h:40 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L40>`_

neuron_ultraserver_mode
^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   enum neuron_ultraserver_mode {
       NEURON_ULTRASERVER_MODE_UNSET = 0,
       NEURON_ULTRASERVER_MODE_X4 = 1,
       NEURON_ULTRASERVER_MODE_X2H = 2,
       NEURON_ULTRASERVER_MODE_X2V = 3,
       NEURON_ULTRASERVER_MODE_X1 = 4,
   };

Ultraserver configuration modes.

**Source**: `neuron_driver_shared.h:47 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L47>`_

neuron_dma_queue_type
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   enum neuron_dma_queue_type {
       NEURON_DMA_QUEUE_TYPE_TX = 0,
       NEURON_DMA_QUEUE_TYPE_RX,
       NEURON_DMA_QUEUE_TYPE_COMPLETION,
   };

DMA queue types.

**Source**: `neuron_driver_shared.h:63 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L63>`_

NQ_DEVICE_TYPE
^^^^^^^^^^^^^^

.. code-block:: c

   enum NQ_DEVICE_TYPE {
       NQ_DEVICE_TYPE_NEURON_CORE = 0,
       NQ_DEVICE_TYPE_TOPSP,
       NQ_DEVICE_TYPE_MAX
   };

Notification queue device types.

**Source**: `neuron_driver_shared.h:115 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L115>`_

NQ_TYPE
^^^^^^^

.. code-block:: c

   enum NQ_TYPE {
       NQ_TYPE_TRACE = 0,
       NQ_TYPE_NOTIFY,
       NQ_TYPE_EVENT,
       NQ_TYPE_ERROR,
       NQ_TYPE_TRACE_DMA,
       NQ_TYPE_THROTTLE,
       NQ_TYPE_MAX
   };

Notification queue types.

**Source**: `neuron_driver_shared.h:123 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L123>`_

mem_alloc_category_t
^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef enum {
       NEURON_MEMALLOC_TYPE_UNKNOWN_HOST,
       NEURON_MEMALLOC_TYPE_CODE_HOST,
       NEURON_MEMALLOC_TYPE_TENSORS_HOST,
       NEURON_MEMALLOC_TYPE_CONSTANTS_HOST,
       NEURON_MEMALLOC_TYPE_MISC_HOST,
       NEURON_MEMALLOC_TYPE_NCDEV_HOST,
       NEURON_MEMALLOC_TYPE_NOTIFICATION_HOST,
       NEURON_MEMALLOC_TYPE_UNKNOWN_DEVICE,
       NEURON_MEMALLOC_TYPE_CODE_DEVICE,
       NEURON_MEMALLOC_TYPE_TENSORS_DEVICE,
       NEURON_MEMALLOC_TYPE_CONSTANTS_DEVICE,
       NEURON_MEMALLOC_TYPE_SCRATCHPAD_DEVICE,
       NEURON_MEMALLOC_TYPE_MISC_DEVICE,
       NEURON_MEMALLOC_TYPE_NCDEV_DEVICE,
       NEURON_MEMALLOC_TYPE_COLLECTIVES_DEVICE,
       NEURON_MEMALLOC_TYPE_SCRATCHPAD_NONSHARED_DEVICE,
       NEURON_MEMALLOC_TYPE_NOTIFICATION_DEVICE,
       NEURON_MEMALLOC_TYPE_DMA_RINGS_HOST,
       NEURON_MEMALLOC_TYPE_DMA_RINGS_DEVICE,
       NEURON_MEMALLOC_TYPE_CONTIGUOUS_SCRATCHPAD_DEVICE,
       NEURON_MEMALLOC_TYPE_MAX
   } mem_alloc_category_t;

Memory allocation categories for sysfs counters.

**Source**: `neuron_driver_shared.h:234 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L234>`_

Structures
----------

neuron_dma_eng_state
^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   struct neuron_dma_eng_state {
       __u32 revision_id;
       __u32 max_queues;
       __u32 num_queues;
       __u32 tx_state;
       __u32 rx_state;
   };

DMA engine state information.

**Source**: `neuron_driver_shared.h:76 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L76>`_

neuron_dma_queue_state
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   struct neuron_dma_queue_state {
       __u32 hw_status;
       __u32 sw_status;
       __u64 base_addr;
       __u32 length;
       __u32 head_pointer;
       __u32 tail_pointer;
       __u64 completion_base_addr;
       __u32 completion_head;
   };

DMA queue state information.

**Source**: `neuron_driver_shared.h:84 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L84>`_

neuron_uuid
^^^^^^^^^^^

.. code-block:: c

   struct neuron_uuid {
       __u8 value[32];
   };

UUID structure for model identification.

**Source**: `neuron_driver_shared.h:163 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L163>`_

neuron_app_info
^^^^^^^^^^^^^^^

.. code-block:: c

   struct neuron_app_info {
       __s32 pid;
       __u8 nc_lock_map;
       struct neuron_uuid uuid_data[APP_INFO_MAX_MODELS_PER_DEVICE];
       size_t host_mem_size;
       size_t device_mem_size;
   };

Application information including PID, locked neuron cores, and memory usage.

**Source**: `neuron_driver_shared.h:175 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L175>`_

neuron_memcpy_batch_t
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct neuron_memcpy_batch {
       __u64 mem_handle;
       __u64 mem_handle_offset;
       const nrt_tensor_batch_op_t *ops_ptr;
       __u32 num_ops;
       __u16 bar4_wr_threshold;
       __u16 flags;
       void *context;
   } neuron_memcpy_batch_t;

A batch of copy operations for efficient data transfer.

**Source**: `neuron_driver_shared.h:220 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L220>`_

nds_header_t
^^^^^^^^^^^^

.. code-block:: c

   typedef struct nds_header {
       char signature[4];
       int version;
   } nds_header_t;

Neuron Datastore header structure.

**Source**: `neuron_driver_shared.h:330 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L330>`_

Constants
---------

NEURON_DMA_H2T_DEFAULT_QID
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   #define NEURON_DMA_H2T_DEFAULT_QID (-1)

H2T DMA Default Queue id.

**Source**: `neuron_driver_shared.h:108 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L108>`_

NEURON_MAX_PROCESS_PER_DEVICE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   #define NEURON_MAX_PROCESS_PER_DEVICE 16

Maximum processes per device.

**Source**: `neuron_driver_shared.h:167 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L167>`_

NDS_MAX_NEURONCORE_COUNT
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   #define NDS_MAX_NEURONCORE_COUNT (4)

Maximum neuron core count for NDS.

**Source**: `neuron_driver_shared.h:323 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared.h#L323>`_


================================================
FILE: neuron-runtime/api/neuron_driver_shared_tensor_batch_op.rst
================================================
.. _api_neuron_driver_shared_tensor_batch_op_h:

neuron_driver_shared_tensor_batch_op.h
=======================================

Shared tensor batch operation structures between runtime and driver.

**Source**: `src/libnrt/include/ndl/neuron_driver_shared_tensor_batch_op.h <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared_tensor_batch_op.h>`_

Typedefs
--------

nrt_tensor_batch_offset_t
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef uint64_t nrt_tensor_batch_offset_t;

Type for tensor batch operation offset.

**Source**: `neuron_driver_shared_tensor_batch_op.h:13 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared_tensor_batch_op.h#L13>`_

nrt_tensor_batch_size_t
^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef uint64_t nrt_tensor_batch_size_t;

Type for tensor batch operation size.

**Source**: `neuron_driver_shared_tensor_batch_op.h:14 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared_tensor_batch_op.h#L14>`_

Structures
----------

nrt_tensor_batch_op_t
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nrt_tensor_batch_op {
       nrt_tensor_batch_offset_t offset;
       nrt_tensor_batch_size_t size;
       void *buffer;
   } nrt_tensor_batch_op_t;

Tensor batch operation structure containing offset, size, and buffer pointer.

**Source**: `neuron_driver_shared_tensor_batch_op.h:17 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/ndl/neuron_driver_shared_tensor_batch_op.h#L17>`_


================================================
FILE: neuron-runtime/api/neuron_ds.rst
================================================
.. _api_neuron_ds_h:

neuron_ds.h
===========

Neuron Datastore (NDS) API - Shared memory datastore for runtime metrics and model information.

**Source**: `src/libnrt/include/nrt/nds/neuron_ds.h <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h>`_

Constants
---------

OBJECT_TYPE_MODEL_NODE_INFO
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   #define OBJECT_TYPE_MODEL_NODE_INFO (0)

NDS object type for model node information.

**Source**: `neuron_ds.h:19 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L19>`_

OBJECT_TYPE_PROCESS_INFO
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   #define OBJECT_TYPE_PROCESS_INFO (1)

NDS object type for process information.

**Source**: `neuron_ds.h:20 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L20>`_

MODEL_MEM_USAGE_LOCATION_COUNT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   #define MODEL_MEM_USAGE_LOCATION_COUNT 2

Number of memory usage locations tracked.

**Source**: `neuron_ds.h:24 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L24>`_

Enumerations
------------

feature_bitmap_bit_index_t
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef enum feature_bitmap_bit_index {
       BIT_INDEX_TEST_FEATURE = 0,
       BIT_INDEX_MULTICORE_FEATURE = 1,
       BIT_INDEX_COUNT = BIT_INDEX_MULTICORE_FEATURE + 1
   } feature_bitmap_bit_index_t;

Feature bitmap's bit index information.

**Source**: `neuron_ds.h:88 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L88>`_

Structures
----------

nds_mem_usage_info_t
^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nds_mem_usage_info {
       size_t total_size;
       uint32_t chunk_count;
   } nds_mem_usage_info_t;

Aggregated data for all chunks of the same type/location.

**Source**: `neuron_ds.h:45 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L45>`_

nds_model_node_info_t
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nds_model_node_info {
       uint32_t model_id;
       uint32_t model_node_id;
       char name[256];
       char uuid[16];
       uint8_t nc_index;
       uint8_t sg_index;
   } nds_model_node_info_t;

Loaded model node information.

**Source**: `neuron_ds.h:51 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L51>`_

nds_model_node_mem_usage_info_t
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nds_model_node_mem_usage_info {
       nds_mem_usage_info_t model_mem_usage[MODEL_MEM_USAGE_LOCATION_COUNT][NDS_DMA_MEM_USAGE_SLOT_COUNT];
   } nds_model_node_mem_usage_info_t;

Loaded model node memory usage information.

**Source**: `neuron_ds.h:61 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L61>`_

nds_version_info_t
^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nds_version_info {
       uint8_t major;
       uint8_t minor;
       uint32_t build;
   } nds_version_info_t;

Version information.

**Source**: `neuron_ds.h:66 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L66>`_

nds_process_info_t
^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nds_process_info {
       int8_t framework_type;
       char tag[32];
       nds_version_info_t framework_version;
       nds_version_info_t fal_version;
       nds_version_info_t runtime_version;
   } nds_process_info_t;

Process information-related struct.

**Source**: `neuron_ds.h:73 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L73>`_

Functions
---------

nds_open
^^^^^^^^

.. code-block:: c

   int nds_open(ndl_device_t *device, pid_t pid, nds_instance_t **inst);

Opens NDS for the given pid. If pid == 0, it acquires it for the current PID and it's opened in read-write mode. If pid != 0, it acquires it for the provided PID and it's opened as read-only.

**Parameters:**

* ``device`` [in] - ndl_device used to open this NDS
* ``pid`` [in] - pid for which to open the NDS, if 0 - it's opened as r/w for the current process
* ``inst`` [out] - address of a pointer which will contain the instance handle

**Returns:** non zero in case of error

**Source**: `neuron_ds.h:102 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L102>`_

nds_close
^^^^^^^^^

.. code-block:: c

   int nds_close(nds_instance_t *inst);

Releases the NDS instance and frees the data associated with it (mandatory for readers).

**Parameters:**

* ``inst`` [in] - NDS instance to close

**Returns:** non zero in case of error, the pointer gets deleted regardless

**Source**: `neuron_ds.h:110 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L110>`_

nds_increment_nc_counter
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   int nds_increment_nc_counter(nds_instance_t *inst, int pnc_index, uint32_t counter_index, uint64_t increment);

Increments a simple per-nc counter.

**Parameters:**

* ``inst`` [in] - NDS instance
* ``pnc_index`` [in] - Neuroncore index
* ``counter_index`` [in] - Counter index
* ``increment`` [in] - Amount to increment

**Returns:** 0 on success.

**Source**: `neuron_ds.h:123 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L123>`_

nds_get_nc_counter
^^^^^^^^^^^^^^^^^^

.. code-block:: c

   int nds_get_nc_counter(nds_instance_t *inst, int pnc_index, uint32_t counter_index, uint64_t *value);

Gets a simple per-nc counter.

**Parameters:**

* ``inst`` [in] - NDS instance
* ``pnc_index`` [in] - Neuroncore index
* ``counter_index`` [in] - Counter index
* ``value`` [out] - Counter value

**Returns:** 0 on success.

**Source**: `neuron_ds.h:145 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L145>`_

nds_increment_nd_counter
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   int nds_increment_nd_counter(nds_instance_t *inst, uint32_t counter_index, uint64_t increment);

Increments a simple per-nd counter - may overflow.

**Parameters:**

* ``inst`` [in] - NDS instance
* ``counter_index`` [in] - Counter index
* ``increment`` [in] - Amount to increment

**Returns:** 0 on success.

**Source**: `neuron_ds.h:167 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L167>`_

nds_get_nd_counter
^^^^^^^^^^^^^^^^^^

.. code-block:: c

   int nds_get_nd_counter(nds_instance_t *inst, uint32_t counter_index, uint64_t *value);

Gets a simple per-nd counter.

**Parameters:**

* ``inst`` [in] - NDS instance
* ``counter_index`` [in] - Counter index
* ``value`` [out] - Counter value

**Returns:** 0 on success.

**Source**: `neuron_ds.h:193 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L193>`_

nds_obj_new
^^^^^^^^^^^

.. code-block:: c

   nds_obj_handle_t nds_obj_new(nds_instance_t *inst, int type);

Creates a new NDS object with the given type.

**Parameters:**

* ``inst`` [in] - NDS instance
* ``type`` [in] - type of object to create

**Returns:** handle for newly created object

**Source**: `neuron_ds.h:220 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L220>`_

nds_obj_commit
^^^^^^^^^^^^^^

.. code-block:: c

   int nds_obj_commit(nds_obj_handle_t obj);

Writes an NDS object to the NDS memory.

**Parameters:**

* ``obj`` [in] - NDS object handle

**Returns:** 0 on success.

**Source**: `neuron_ds.h:213 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L213>`_

nds_read_all_model_nodes
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   int nds_read_all_model_nodes(nds_instance_t *inst, nds_obj_handle_t **models, size_t *count);

Reads all model info data and returns it as an array (needs to be deleted by caller).

**Parameters:**

* ``inst`` [in] - NDS instance
* ``models`` [out] - Pointer where to write the address of an array of length count containing object handles
* ``count`` [out] - Number of models loaded (present in the models array)

**Returns:** non-NULL on success.

**Source**: `neuron_ds.h:250 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nds/neuron_ds.h#L250>`_


================================================
FILE: neuron-runtime/api/nrt-async-api-best-practices.rst
================================================
##############
Best Practices
##############

Sync vs Async APIs
==================

With the introduction of the explicit async APIs, the Neuron Runtime provides users with a choice between synchronous APIs and asynchronous APIs. Choosing the right approach
depends on your workload requirements and performance goals.

When to Use Synchronous APIs
----------------------------

Synchronous APIs are appropriate when:

* **Prototyping or debugging** — Blocking behavior simplifies reasoning about execution order and makes it easier to isolate issues.
* **Simple, sequential workloads** — If your application processes one request at a time without pipelining, the added complexity of async APIs may not provide meaningful
  benefit.

When to Use Asynchronous APIs
-----------------------------

Asynchronous APIs are recommended when:

* **Maximizing device utilization** — Async APIs allow you to queue future execution requests while the device processes current work, eliminating idle time between operations.
* **Pipelining across Execution Units** — Async APIs enable the overlapping of work between different Execution Units, allowing for customizable pipelining schemes, reducing
  Execution Unit idle time.
* **Overlapping device work with CPU work** — Non-blocking APIs free the CPU to perform other tasks (e.g., preprocessing, request management) while the device processes requests.

Maximizing Device Utilization
=============================

To maximize device utilization, applications should keep execution unit queues saturated with work at all times. Rather than waiting for each request to complete before submitting
the next request, use the schedule APIs to queue multiple requests ahead of execution—this ensures the device always has work ready to execute when the current operation finishes.
Monitor queue depth using completion APIs like ``nrta_get_sequence`` to track how many requests remain in flight, and submit new work as completions occur to maintain a steady pipeline.
Avoid letting the queue drain completely, as this creates idle gaps while the CPU prepares and submits the next request. A good rule of thumb is to keep at least 2-3 requests
queued per execution unit to absorb any variability in CPU scheduling or request preparation time. For workloads that span multiple execution units, submit work to each unit
as soon as the data dependencies are satisfied—this allows compute, communication, and data transfer operations to overlap, further improving overall device utilization.

Handling Execution Errors
=========================

Request Error Handling
----------------------

When using asynchronous APIs, errors may not surface until after the
schedule call returns—the device could encounter a failure mid-execution
while the application continues to submit new work. To detect these
failures, all the schedule APIs accept an ``NRT_STATUS*`` parameter,
that will be populated with the result of the request execution upon
request completion. Applications should track where they allocated this
status, and check this status after each scheduled request to detect
execution failures.

See :doc:`nrt-async-api-examples` for an example.

Execution Unit Unrecoverable
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In rare cases, an execution unit may enter a fatal failure state due to
a non-recoverable error such as a timeout or detectable hardware issue.
Once in this state, the execution unit can no longer process requests —
all subsequent schedule calls will return
``NRT_EXEC_UNIT_UNRECOVERABLE``.

This is a terminal state; the execution unit cannot be restored without
*at least* reinitializing Runtime, most likely by terminating and
relaunching the application. With the worst errors, reloading the driver
or rebooting the machine will be needed. Applications should monitor for
this return code and implement appropriate recovery logic, such as
releasing resources, notifying upstream services, and relaunching their
application.


================================================
FILE: neuron-runtime/api/nrt-async-api-examples.rst
================================================
Async API Usage Examples
========================

Schedule
--------

.. code:: c

   NRT_STATUS exec_ret;

   NRT_STATUS ret = nrta_execute_schedule(model, inputs, outputs, 0, &exec_ret, &seq);
   if (ret != NRT_SUCCESS) {
       if (ret == NRT_QUEUE_FULL) {
           // or handle retries if desired
           break;
       }
       // handle other errors
       ...
   } else {
       req_seqs[req] = seq;
   }

Wait for Completion via polling
-------------------------------

Here’s an example for polling for completion:

.. code:: c

   nrta_seq_t last_req_seq;
   nrta_seq_t completed_seq = {};
   while (true) {
       nrta_get_sequence(lnc, NRTA_XU_COMPUTE, &completed_seq);
       if (completed_seq >= last_req_seq) {
           break;
       }
       usleep(1);
   }

In a similar vein, we can also poll with
``nrta_is_completed(last_req_seq, &is_completed)``

.. _nrta-error-handling:

Error Handling
--------------

We need to maintain an array/vector to track the execution return
statuses.

.. code:: c

   static const int NUM_EXECS = 8;
   int lnc = 0;
   NRT_STATUS exec_rets[NUM_EXECS];

   // submit execution requests
   nrta_seq_t req_seqs[NUM_EXECS];
   for (int req = 0; req < NUM_EXECS; req++) {
       nrta_seq_t seq = {};
       NRT_STATUS ret = nrta_execute_schedule(model, inputs, outputs, 0, &exec_rets[req], &seq);
       if (ret != NRT_SUCCESS) {
           if (ret == NRT_QUEUE_FULL) {
               // or handle retries if desired
               break;
           }
           // handle other errors
           ...
       } else {
           req_seqs[req] = seq;
       }
   }

   // check for execution errors
   int error_count = 0;
   for (int req = 0; req < NUM_EXECS; req++) {
       if (exec_rets[req] != NRT_SUCCESS) {
           fprintf(stderr, "Request [%x] completed with error %lu\n", req_seqs[req], exec_rets[req]);
           error_count++;
       }
   }
   if (error_count > 0) {
       ...
   }

Finding Number of Pending Executions
------------------------------------

While this is susceptible to some races, here’s an example of how to
estimate the outstanding requests:

.. code:: c

   nrta_seq_t last_completed = {};
   const int compute_queue = 0; // Compute XU only has 1 queue
   nrta_get_sequence(lnc, NRTA_XU_COMPUTE, compute_queue, &last_completed);

   // sanity check: the two sequence ids should be from the same XU
   assert(NRTA_SEQ_GET_XU_ID(last_submitted) == NRTA_SEQ_GET_XU_ID(last_completed));
   // the sequence id is a monotone and sequential value for each XU
   return last_submitted - last_completed;


================================================
FILE: neuron-runtime/api/nrt-async-api-overview.rst
================================================
====================================================
Neuron Runtime Async APIs: Motivation and Overview
====================================================

Introduction
============

Achieving maximum utilization of AWS Neuron Devices requires applications to execute work asynchronously—submitting future execution requests while the device is still processing
previous ones. The Neuron Runtime (NRT) Async APIs provide explicit, fine-grained control over asynchronous operations, enabling developers to fully optimize their workloads for
Neuron hardware.

Neuron Device Execution Units
=============================

The Neuron Runtime exposes the Neuron device as a collection of specialized, independent processing blocks called execution units. Each execution unit can process
operations asynchronously, enabling parallel execution across multiple units.

Currently there are 3 types of Execution Units:

+------------------+-----------------------------------------------------------------------------------+
| Execution Unit   | Purpose                                                                           |
+==================+===================================================================================+
| Neuron Core XU   | Executes compiled models or kernels                                               |
+------------------+-----------------------------------------------------------------------------------+
| Collectives XU   | Runs standalone collective operations (all-gather, reduce-scatter, all-reduce)    |
|                  | outside of a compiled model/kernel                                                |
+------------------+-----------------------------------------------------------------------------------+
| Tensor Op XU     | Transfers data between host and Neuron Devices                                    |
+------------------+-----------------------------------------------------------------------------------+

And each neuron core has multiple execution units of each type *(PENDING
API for getting number of queues)*: 

+------------------+------------+
| Execution Unit   | Queues/NC  |
+==================+============+
| Neuron Core XU   | 1          |
+------------------+------------+
| Collectives XU   | 3          |
+------------------+------------+
| Tensor Op XU     | 2          |
+------------------+------------+

In general, an individual Execution Unit on the device is uniquely
identified by :math:`(NeuronCore\times XUType\times QueueID)`

This abstraction along with the Explicit Async APIs, provide
applications the control necessary to overlap compute, communication,
and data movement operations.

(Legacy) Async Execution Mode vs (New) Async APIs
=================================================

Previously, the Neuron Runtime supported an Async Execution Mode which allowed for the asynchronous submission of model/kernel executions. When this mode is enabled, calls to
``nrt_execute`` return immediately, allowing the calling thread to prepare the next execution while the device processes the current one. To maintain tensor consistency, tensor
read/write operations automatically block while tensors are in use by pending executions.

While this flow works, the implicit nature of the implementation limits both the flexibility and control available to applications.

**Limited Flexibility:** The current async model ties execution and data operations together in ways that prevent efficient pipelining. For example, reading tensor
data from the device blocks until all pending executions complete, preventing applications from overlapping data transfers with ongoing Neuron Core computation.

**Limited Control:** The current APIs do not expose asynchronous control for all execution units, limiting applications from making optimal scheduling decisions. Without
fine-grained, asynchronous control over each execution unit, applications cannot implement scheduling strategies that maximize overlap between compute, communication, and
data movement operations.

Async APIs
==========

The Async APIs directly address the limitations of the implicit async implementation through two core design choices:

* **Explicit completion primitives** — Instead of relying on implicit blocking behavior to ensure consistency, the new APIs provide explicit mechanisms for tracking request
  completion. This gives applications full control over synchronization and enables efficient polling patterns that keep execution units saturated with work.
* **All execution units can run asynchronously** — Unlike the current model where execution and tensor operations are coupled, the new APIs allow the Neuron Core, Collectives,
  and Tensor execution units to operate independently and in parallel. This enables applications to schedule compute, communication, and data movement operations concurrently,
  achieving true overlap between these different types of work.

Together, these design choices give applications the flexibility to implement custom scheduling strategies and the control needed to make optimal decisions about when to overlap work,
when to synchronize, and how to maximize device utilization.

Key Benefits
------------

* **Higher device utilization** — Pipeline work across multiple devices without idle cycles
* **Compute/communication/data transfer overlap** — Schedule independent operations in parallel
* **Greater optimization flexibility** — Build custom execution strategies tailored to your specific workload

What are the Async APIs
=======================

The Explicit Async APIs (prefixed with ``nrta``) are organized into two main categories:

* **Schedule APIs** (``nrta_execute_schedule``, ``nrta_cc_schedule``, ``nrta_tensor_read``, ``nrta_tensor_write``, ``nrta_tensor_copy``) — enqueue work to an execution unit and return a sequence number for tracking.
* **Completion APIs** (``nrta_get_sequence``, ``nrta_is_completed``) — enable applications to monitor execution unit progress and check for request completion.

Together, these categories enable a workflow where applications continuously submit work and monitor completions—keeping execution units busy and
maximizing device utilization.

See `nrt_async.h <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h>`_ for more details.

Summary
=======

The Neuron Runtime Async APIs give developers explicit control over asynchronous execution on Neuron hardware. Whether you're building advanced inference pipelines or
implementing eager mode workloads that demand responsive kernel scheduling, these APIs unlock optimization opportunities by exposing non-blocking interfaces for all
execution units.


================================================
FILE: neuron-runtime/api/nrt.rst
================================================
.. _api_nrt_h:

nrt.h
=====

Neuron Runtime (NRT) API - Main interface for loading and executing models on Neuron devices.

**Source**: `src/libnrt/include/nrt/nrt.h <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h>`_

Constants
---------

NRT_MAJOR_VERSION
^^^^^^^^^^^^^^^^^

.. code-block:: c

   #define NRT_MAJOR_VERSION 2

Major version of runtime.

**Source**: `nrt.h:21 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L21>`_

NRT_MINOR_VERSION
^^^^^^^^^^^^^^^^^

.. code-block:: c

   #define NRT_MINOR_VERSION 0

Minor version of runtime.

**Source**: `nrt.h:22 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L22>`_

Enumerations
------------

nrt_tensor_placement_t
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef enum {
       NRT_TENSOR_PLACEMENT_DEVICE,
       NRT_TENSOR_PLACEMENT_HOST,
       NRT_TENSOR_PLACEMENT_VIRTUAL,
   } nrt_tensor_placement_t;

Tensor placement options.

**Source**: `nrt.h:34 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L34>`_

nrt_framework_type_t
^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef enum {
       NRT_FRAMEWORK_TYPE_INVALID = 0,
       NRT_FRAMEWORK_TYPE_NO_FW = 1,
       NRT_FRAMEWORK_TYPE_TENSORFLOW,
       NRT_FRAMEWORK_TYPE_PYTORCH,
       NRT_FRAMEWORK_TYPE_MXNET,
       NRT_FRAMEWORK_TYPE_PRECHECK,
   } nrt_framework_type_t;

Framework types supported by NRT.

**Source**: `nrt.h:40 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L40>`_

nrt_dtype_t
^^^^^^^^^^^

.. code-block:: c

   typedef enum nrt_dtype {
       NRT_DTYPE_UNKNOWN = 0x0,
       NRT_DTYPE_INVALID = 0x0,
       NRT_DTYPE_FP8_E3 = 0xD,
       NRT_DTYPE_FP8_E4 = 0xE,
       NRT_DTYPE_FP8_E5 = 0xF,
       NRT_DTYPE_FLOAT16 = 0x7,
       NRT_DTYPE_BFLOAT16 = 0x6,
       NRT_DTYPE_FLOAT32 = 0xA,
       NRT_DTYPE_FP32R = 0xB,
       NRT_DTYPE_UINT8 = 0x3,
       NRT_DTYPE_UINT16 = 0x5,
       NRT_DTYPE_UINT32 = 0x9,
       NRT_DTYPE_UINT64 = 0x1,
       NRT_DTYPE_INT8 = 0x2,
       NRT_DTYPE_INT16 = 0x4,
       NRT_DTYPE_INT32 = 0x8,
       NRT_DTYPE_INT64 = 0xC,
   } nrt_dtype_t;

Data types supported by NRT.

**Source**: `nrt.h:90 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L90>`_

nrt_op_type_t
^^^^^^^^^^^^^

.. code-block:: c

   typedef enum nrt_op_type {
       NRT_OP_ADD = 0x0,
       NRT_OP_FMA = 0x1,
       NRT_OP_MAX = 0x2,
       NRT_OP_MIN = 0x3,
       NRT_OP_INVALID = 0xF,
   } nrt_op_type_t;

Operation types for collectives.

**Source**: `nrt.h:83 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L83>`_

nrt_cc_op_type_t
^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef enum nrt_cc_op_type {
       NRT_CC_ALLGATHER,
       NRT_CC_ALLREDUCE,
       NRT_CC_REDUCESCATTER
   } nrt_cc_op_type_t;

Collective communication operation types.

**Source**: `nrt.h:111 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L111>`_

Structures
----------

nrt_instance_info_t
^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nrt_instance_info {
       uint32_t family;
       uint32_t size;
       char arch_name[16];
       char device_revision[8];
   } nrt_instance_info_t;

Instance information structure.

**Source**: `nrt.h:117 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L117>`_

nrt_tensor_batch_t
^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nrt_tensor_batch {
       const nrt_tensor_t *tensor;
       const nrt_tensor_batch_op_t *ops;
       uint32_t num_ops;
   } nrt_tensor_batch_t;

A batch of tensor operations on a single tensor.

**Source**: `nrt.h:343 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L343>`_

nrt_tensor_device_allocation_info_t
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nrt_tensor_device_allocation_info {
       uint64_t physical_address;
       size_t size;
       int hbm_index;
   } nrt_tensor_device_allocation_info_t;

Returns on device allocation info for a tensor.

**Source**: `nrt.h:442 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L442>`_

nrt_vnc_memory_stats_t
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nrt_vnc_memory_stats {
       size_t bytes_used;
       size_t bytes_limit;
   } nrt_vnc_memory_stats_t;

NRT memory stats for a VNC.

**Source**: `nrt.h:509 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L509>`_

nrt_cc_comm_t
^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nrt_cc_comm {
       uint32_t *replica_group;
       uint32_t rank;
       uint32_t rank_n;
       uint32_t ctx_device_id;
       uint32_t ctx_device_count;
       uint32_t vnc;
   } nrt_cc_comm_t;

Communicator for collective operations.

**Source**: `nrt.h:545 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L545>`_

nrt_tensor_list_t
^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nrt_tensor_list {
       nrt_tensor_t **tensors;
       size_t num_tensors;
   } nrt_tensor_list_t;

List of tensors.

**Source**: `nrt.h:554 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L554>`_

Functions
---------

nrt_init
^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_init(nrt_framework_type_t framework, const char *fw_version, const char *fal_version);

Initialize neuron runtime.

**Parameters:**

* ``framework`` [in] - Type of the framework.
* ``fw_version`` [in] - Framework version as string. (eg 2.1)
* ``fal_version`` [in] - Framework Abstraction Layer version as string.

**Returns:** NRT_STATUS_SUCCESS on success.

**Source**: `nrt.h:133 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L133>`_

nrt_close
^^^^^^^^^

.. code-block:: c

   void nrt_close();

Closes all the devices and cleans up the runtime state.

**Source**: `nrt.h:138 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L138>`_

nrt_load
^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_load(const void *neff_bytes, size_t size, int32_t vnc, int32_t vnc_count, nrt_model_t **model);

Load given NEFF and place it in one or more neuron cores.

**Parameters:**

* ``neff_bytes`` [in] - Pointer to NEFF data.
* ``size`` [in] - Length of the NEFF data.
* ``vnc`` [in] - VNC index where the NEFF should be loaded(-1 means runtime would automatically load in first free VNC).
* ``vnc_count`` [in] - DEPRECATED: always use -1
* ``model`` [out] - Resulting model would be stored here.

**Returns:** NRT_STATUS_SUCCESS on success.

**Source**: `nrt.h:149 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L149>`_

nrt_unload
^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_unload(nrt_model_t *model);

Unload given model and free up device and host resources.

**Parameters:**

* ``model`` - Model to unload.

**Returns:** NRT_STATUS_SUCCESS on success.

**Source**: `nrt.h:172 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L172>`_

nrt_execute
^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_execute(nrt_model_t *model, const nrt_tensor_set_t *input_set, nrt_tensor_set_t *output_set);

Execute given model with given inputs and collect outputs.

**Parameters:**

* ``model`` [in] - Model to execute.
* ``input_set`` [in] - Set of input tensors.
* ``output_set`` [in] - Set of output tensors.

**Returns:** NRT_STATUS_SUCCESS on success.

**Source**: `nrt.h:256 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L256>`_

nrt_tensor_allocate
^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_tensor_allocate(nrt_tensor_placement_t tensor_placement, int vnc, size_t size, 
                                  const char *name, nrt_tensor_t **tensor);

Allocates a tensor that can be passed and used by a model for compute.

**Parameters:**

* ``tensor_placement`` [in] - Where the tensor would be allocated (device, host, or virtual memory)
* ``vnc`` [in] - Virtual Neuron Core id to allocate the tensor on. Pass in -1 if allocating tensors on host memory.
* ``size`` [in] - Size in bytes of the tensor to allocate.
* ``name`` [in] - OPTIONAL. Name of the tensor.
* ``tensor`` [out] - Pointer to newly created tensor will be stored here.

**Returns:** NRT_STATUS_SUCCESS on success.

**Source**: `nrt.h:283 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L283>`_

nrt_tensor_free
^^^^^^^^^^^^^^^

.. code-block:: c

   void nrt_tensor_free(nrt_tensor_t **tensor);

Deallocates a tensor created by "nrt_tensor_allocate".

**Parameters:**

* ``tensor`` [in] - Deallocates given tensor.

**Source**: `nrt.h:292 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L292>`_

nrt_tensor_read
^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_tensor_read(const nrt_tensor_t *tensor, void *buf, size_t offset, size_t size);

Copies data from tensor to passed in buffer.

**Parameters:**

* ``tensor`` [in] - Tensor used to reference the tensor to read from.
* ``buf`` [out] - Buffer used to store data read from the tensor.
* ``offset`` [in] - Offset into the tensor to read from.
* ``size`` [in] - Number of bytes to read.

**Returns:** NRT_STATUS_SUCCESS on success.

**Source**: `nrt.h:303 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L303>`_

nrt_tensor_write
^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_tensor_write(nrt_tensor_t *tensor, const void *buf, size_t offset, size_t size);

Copies data from passed in buffer to tensor.

**Parameters:**

* ``tensor`` [in/out] - Tensor used to reference the tensor to write to.
* ``buf`` [in] - Buffer used to store data to write to the tensor.
* ``offset`` [in] - Offset into the tensor to write to.
* ``size`` [in] - Number of bytes to write.

**Returns:** NRT_STATUS_SUCCESS on success.

**Source**: `nrt.h:315 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L315>`_

nrt_tensor_copy
^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_tensor_copy(const nrt_tensor_t *src, size_t src_offset, nrt_tensor_t *dst, 
                              size_t dst_offset, size_t size);

Copies data between tensors.

**Parameters:**

* ``src`` [in] - Tensor to copy from.
* ``src_offset`` [in] - Offset into the source tensor to copy from.
* ``dst`` [out] - Tensor to copy to.
* ``dst_offset`` [in] - Offset into the destination tensor to copy to.
* ``size`` [in] - Number of bytes to copy.

**Returns:** NRT_STATUS_SUCCESS on success.

**Source**: `nrt.h:381 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L381>`_

nrt_get_total_vnc_count
^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_get_total_vnc_count(uint32_t *vnc_count);

Returns VirtualNeuronCores available in instance.

**Parameters:**

* ``vnc_count`` [out] - VirtualNeuronCores available in instance.

**Note:** This API can be called before nrt_init().

**Returns:** NRT_STATUS_SUCCESS on success.

**Source**: `nrt.h:203 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h#L203>`_


================================================
FILE: neuron-runtime/api/nrt_async.rst
================================================
.. _api_nrt_async_h:

nrt_async.h
===========

Neuron Runtime Asynchronous Execution API - Non-blocking operations for tensor I/O and model execution.

**Source**: `src/libnrt/include/nrt/nrt_async.h <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h>`__

.. note::

   The Neuron Runtime Async APIs are currently in early release and may change across Neuron versions.

Enumerations
------------

nrta_xu_t
^^^^^^^^^

.. code-block:: c

   typedef enum {
       NRTA_XU_TENSOR_OP = 0,
       NRTA_XU_COMPUTE,
       NRTA_XU_COLLECTIVES,
       NRTA_XU_TYPE_NUM
   } nrta_xu_t;

Execution unit types.

**Source**: `nrt_async.h:18 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h#L20>`__

Typedefs
--------

nrta_seq_t
^^^^^^^^^^

.. code-block:: c

   typedef uint64_t nrta_seq_t;

Monotonically increasing IDs of executions. The first 16 bits are an Execution Unit ID, while the last 48 bits are a strictly ordered Sequence Number.

**Source**: `nrt_async.h:31 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h#L33>`__

nrta_xu_id_t
^^^^^^^^^^^^

.. code-block:: c

   typedef uint16_t nrta_xu_id_t;

Execution unit ID type.

**Source**: `nrt_async.h:32 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h#L34>`__

Constants
---------

NRTA_SEQ_NUM_MAX
^^^^^^^^^^^^^^^^

.. code-block:: c

   #define NRTA_SEQ_NUM_MAX ((1ull << 48) - 1)

Maximum sequence number value.

**Source**: `nrt_async.h:34 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h#L36>`__

Functions
---------

nrta_tensor_write
^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrta_tensor_write(nrt_tensor_t *tensor, const void *buf, uint64_t offset, 
                                uint64_t size, int queue, NRT_STATUS *ret, 
                                nrta_seq_t *req_sequence);

Enqueues a tensor write request. Copies the data from a host buffer to a tensor allocated on a Neuron device.

**Parameters:**

* ``tensor`` [in] - Destination tensor
* ``buf`` [in] - Host buffer containing source data
* ``offset`` [in] - Offset into the tensor
* ``size`` [in] - Number of bytes to write
* ``queue`` [in] - XU queue to use
* ``ret`` [in] - pointer to store return value of the async request upon completion
* ``req_sequence`` [out] - Sequence number of the scheduled request

**Returns:** NRT_SUCCESS on success

**Source**: `nrt_async.h:59 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h#L57>`__

nrta_tensor_read
^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrta_tensor_read(void *buf, nrt_tensor_t *tensor, uint64_t offset, 
                               uint64_t size, int queue, NRT_STATUS *ret, 
                               nrta_seq_t *req_sequence);

Enqueues a tensor read request. Copies the data from a tensor allocated on a Neuron device to a host buffer.

**Parameters:**

* ``buf`` [in] - Destination Host buffer
* ``tensor`` [in] - Source tensor
* ``offset`` [in] - Offset into the tensor
* ``size`` [in] - Number of bytes to read
* ``queue`` [in] - XU queue to use
* ``ret`` [in] - pointer to store return value of the async request upon completion
* ``req_sequence`` [out] - Sequence number of the scheduled request

**Returns:** NRT_SUCCESS on success

**Source**: `nrt_async.h:77 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h#L81>`__

nrta_tensor_copy
^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrta_tensor_copy(nrt_tensor_t *src, uint64_t src_offset, nrt_tensor_t *dst, 
                               uint64_t dst_offset, uint64_t size, int queue, 
                               NRT_STATUS *ret, nrta_seq_t *req_sequence);

Enqueues a tensor copy request. Copies data between two tensors allocated on the same Logical Neuron Core.

**Parameters:**

* ``src`` [in] - Source tensor
* ``src_offset`` [in] - Offset into the source tensor
* ``dst`` [in] - Destination tensor
* ``dst_offset`` [in] - Offset into the destination tensor
* ``size`` [in] - Number of bytes to copy
* ``queue`` [in] - XU queue to use
* ``ret`` [in] - pointer to store return value of the async request upon completion
* ``req_sequence`` [out] - Sequence number of the scheduled request

**Returns:** NRT_SUCCESS on success

**Source**: `nrt_async.h:98 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h#L107>`__

nrta_execute_schedule
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrta_execute_schedule(nrt_model_t *model, const nrt_tensor_set_t *input, 
                                    nrt_tensor_set_t *output, int queue, 
                                    NRT_STATUS *ret, nrta_seq_t *req_sequence);

Schedules an asynchronous request to execute a model with specified inputs and outputs.

**Parameters:**

* ``model`` [in] - The model to schedule for execution
* ``input`` [in] - Set of input tensors for the model
* ``output`` [in] - Set of tensors to receive the outputs
* ``queue`` [in] - XU queue to use, must be 0
* ``ret`` [in] - pointer to store return value of the async request upon completion
* ``req_sequence`` [out] - Sequence number of the scheduled request

**Returns:** NRT_SUCCESS on successful preparation, appropriate error code otherwise

**Source**: `nrt_async.h:118 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h#L129>`__

nrta_cc_prepare
^^^^^^^^^^^^^^^^^^^^^
**NOTE: The nrta_cc_prepare and nrta_cc_schedule APIs are work-in-progress and subject to change.**

.. code-block:: c

   NRT_STATUS nrta_cc_prepare(nrt_cc_comm_t *comm, nrt_tensor_list_t *input, 
                              nrt_tensor_list_t *output, nrt_dtype_t dtype, 
                              nrt_op_type_t op, nrt_cc_op_type_t cc_op
                              nrt_cc_context_t **cc_ctx);

Prepares collective context and HW configuration needed for collectives operation.
Allocates a collective context handle that is returned to the caller which is freed in the schedule thread post CC op execution.

**Parameters:**

* ``comm`` [in] - Communicator containing the replica group
* ``input`` [in] - Input tensor list
* ``output`` [out] - Output tensor list
* ``dtype`` [in] - Data type of elements
* ``op`` [in] - Reduction operation (e.g., SUM, MAX) if applicable
* ``cc_op`` [in] - Collective operation (e.g., ALLREDUCE, ALLGATHER)
* ``cc_ctx`` [out] - Collective context

**Returns:** NRT_SUCCESS on successful preparation, appropriate error code otherwise

**Source**: `nrt_async.h:155 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h#L155>`__

nrta_cc_schedule
^^^^^^^^^^^^^^^^^^^^^
**NOTE: The nrta_cc_prepare and nrta_cc_schedule APIs are work-in-progress and subject to change.**

.. code-block:: c

   NRT_STATUS nrta_cc_schedule(nrt_cc_context_t **cc_ctx, int queue, 
                              NRT_STATUS *ret, nrta_seq_t *req_sequence);

Schedules an asynchronous request to execute collective operation

**Parameters:**

* ``cc_ctx`` [in] - Collective context
* ``queue`` [in] - XU queue to use, must be 0
* ``ret`` [in] - pointer to store return value of the async request upon completion
* ``req_sequence`` [out] - Sequence number of the scheduled request

**Returns:** NRT_SUCCESS on successful preparation, appropriate error code otherwise

**Source**: `nrt_async.h:172 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h#L172>`__

nrta_is_completed
^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrta_is_completed(nrta_seq_t seq, bool *is_completed);

Checks completion status of a scheduled request.

**Parameters:**

* ``seq`` [in] - Scheduled request sequence id
* ``is_completed`` [out] - true if the request is completed, false otherwise

**Returns:** NRT_SUCCESS if the request is completed, NRT_INVALID if the seq is not valid

**Source**: `nrt_async.h:159 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h#L186>`__

nrta_get_sequence
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrta_get_sequence(uint32_t lnc, nrta_xu_t xu, int queue, nrta_seq_t *seq);

Returns sequence number of the last completed request.

**Parameters:**

 * ``lnc`` [in] - LNC
 * ``xu`` [in] - XU
 * ``queue`` [in] - XU's queue
 * ``seq`` [out] - last completed sequence number

**Returns:** NRT_SUCCESS on success

**Source**: `nrt_async.h:185 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h#L198>`__


================================================
FILE: neuron-runtime/api/nrt_async_sendrecv.rst
================================================
.. _api_nrt_async_sendrecv_h:

nrt_async_sendrecv.h
====================

Neuron Runtime Asynchronous Send/Receive API - Network communication between logical neuron cores.

**Source**: `src/libnrt/include/nrt/nrt_async_sendrecv.h <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async_sendrecv.h>`_

.. note::

   The Neuron Runtime Async APIs are currently in early release and may change across Neuron versions.

Functions
---------

nrt_async_sendrecv_init
^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_async_sendrecv_init(int lnc);

Initialize asynchronous tensor send and receive on logical neuron core.

Logical neuron core ID is the absolute ID of the logical core on the host machine. The ID is unaffected by device remapping via docker and selection of visible logical cores.

**Parameters:**

* ``lnc`` [in] - Logical neuron core ID on the current server

**Returns:** NRT_SUCCESS if logical core has been initialized successfully, NRT_FAILURE for errors

**Note:** This function may only be called when runtime is initialized. This function must have a matching call to nrt_async_sendrecv_close() before nrt_close() is called.

**Source**: `nrt_async_sendrecv.h:48 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async_sendrecv.h#L48>`_

nrt_async_sendrecv_close
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_async_sendrecv_close(int lnc);

Closes asynchronous tensor send and receive of logical neuron core and cleans up resources.

**Parameters:**

* ``lnc`` [in] - Logical neuron core ID on the current server

**Returns:** NRT_SUCCESS if logical core has been closed successfully, NRT_FAILURE for errors

**Note:** After this function was invoked, all sendrecv communicators and requests associated with this logical neuron core are closed and cannot be accessed anymore.

**Source**: `nrt_async_sendrecv.h:64 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async_sendrecv.h#L64>`_

nrt_async_sendrecv_connect
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_async_sendrecv_connect(const char* peer_ip, int peer_lnc, int lnc, 
                                         nrt_async_sendrecv_comm_t** send_comm);

Create send communicator.

Before send communicator can be used to initiate sending a tensor, connection to receive communicator must be established. Use function nrt_async_sendrecv_test_comm() to test whether connection is established.

**Parameters:**

* ``peer_ip`` [in] - IP address of peer logical neuron core
* ``peer_lnc`` [in] - Logical neuron core ID on the peer server
* ``lnc`` [in] - Logical neuron core ID on the current server
* ``send_comm`` [out] - Pointer to send communicator

**Returns:** NRT_SUCCESS if logical core has been created successfully, NRT_RESOURCE if the number of created communicators exceeds the limit, NRT_FAILURE for other errors

**Note:** This function is thread-safe.

**Source**: `nrt_async_sendrecv.h:84 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async_sendrecv.h#L84>`_

nrt_async_sendrecv_accept
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_async_sendrecv_accept(const char* peer_ip, int peer_lnc, int lnc, 
                                        nrt_async_sendrecv_comm_t** recv_comm);

Create receive communicator.

Before receive communicator can be used to initiate receiving a tensor, connection to receive communicator must be established. Use function nrt_async_sendrecv_test_comm() to test whether connection is established.

**Parameters:**

* ``peer_ip`` [in] - IP address of peer logical neuron core
* ``peer_lnc`` [in] - Logical neuron core ID on the peer server
* ``lnc`` [in] - Logical neuron core ID on the current server
* ``recv_comm`` [out] - Pointer to receive communicator

**Returns:** NRT_SUCCESS if logical core has been created successfully, NRT_RESOURCE if the number of created communicators exceeds the limit, NRT_FAILURE for other errors

**Note:** This function is thread-safe.

**Source**: `nrt_async_sendrecv.h:104 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async_sendrecv.h#L104>`_

nrt_async_sendrecv_send_tensor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_async_sendrecv_send_tensor(nrt_tensor_t* tensor, size_t offset, size_t length, 
                                             nrt_async_sendrecv_comm_t* send_comm, 
                                             nrt_async_sendrecv_request_t** request);

Asynchronously send a tensor.

This is a non-blocking function. This function is thread-safe. This function is only allowed to be invoked on a communicator that is successfully tested to be connected via call to nrt_async_sendrecv_test_comm().

**Parameters:**

* ``tensor`` [in] - Tensor to send from
* ``offset`` [in] - Offset into the tensor to send from
* ``length`` [in] - Number of bytes to send
* ``send_comm`` [in] - Send communicator
* ``request`` [out] - Pointer to send request

**Returns:** NRT_SUCCESS on success, NRT_INVALID_HANDLE if handle is invalid, NRT_RESOURCE if the number of pending requests exceeds the limit, NRT_FAILURE for other errors

**Source**: `nrt_async_sendrecv.h:135 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async_sendrecv.h#L135>`_

nrt_async_sendrecv_recv_tensor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_async_sendrecv_recv_tensor(nrt_tensor_t* tensor, size_t offset, size_t length, 
                                             nrt_async_sendrecv_comm_t* recv_comm, 
                                             nrt_async_sendrecv_request_t** request);

Asynchronously receive a tensor.

This is a non-blocking function. This function is thread-safe. This function is only allowed to be invoked on a communicator that is successfully tested to be connected via call to nrt_async_sendrecv_test_comm().

**Parameters:**

* ``tensor`` [in] - Tensor to receive to
* ``offset`` [in] - Offset into the tensor to receive to
* ``length`` [in] - Number of bytes to read
* ``recv_comm`` [in] - Receive communicator
* ``request`` [out] - Pointer to receive request

**Returns:** NRT_SUCCESS on success, NRT_INVALID_HANDLE if handle is invalid, NRT_RESOURCE if the number of pending requests exceeds the limit, NRT_FAILURE for other errors

**Source**: `nrt_async_sendrecv.h:156 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async_sendrecv.h#L156>`_

nrt_async_sendrecv_test_request
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_async_sendrecv_test_request(nrt_async_sendrecv_request_t* request, bool* done, size_t* size);

Test the completion status of an asynchronous request.

This function is thread-safe when invoked with different requests. This function is not allowed to be invoked concurrently by multiple threads with the same request at the same time.

**Parameters:**

* ``request`` [in] - Request to test
* ``done`` [out] - Whether the request has completed
* ``size`` [out] - Number of bytes sent/received

**Returns:** NRT_SUCCESS on success, NRT_INVALID_HANDLE if handle is invalid, NRT_TIMEOUT if the request fails to complete data transfer within time limit, NRT_FAILURE for other errors

**Source**: `nrt_async_sendrecv.h:174 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async_sendrecv.h#L174>`_


================================================
FILE: neuron-runtime/api/nrt_experimental.rst
================================================
.. _api_nrt_experimental_h:

nrt_experimental.h
==================

Neuron Runtime Experimental API - Features under development and subject to change.

**Source**: `src/libnrt/include/nrt/nrt_experimental.h <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_experimental.h>`_

.. note::

   Experimental APIs are provided for testing and feedback and may not be appropriate for production environments.

Enumerations
------------

nrt_tensor_usage_t
^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef enum nrt_tensor_usage {
       NRT_TENSOR_USAGE_INPUT = 0,
       NRT_TENSOR_USAGE_OUTPUT,
   } nrt_tensor_usage_t;

Usage of a Tensor in the NEFF.

**Source**: `nrt_experimental.h:18 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_experimental.h#L18>`_

Structures
----------

nrt_tensor_info_t
^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nrt_tensor_info {
       char name[NRT_TENSOR_NAME_MAX];
       nrt_tensor_usage_t usage;
       size_t size;
       nrt_dtype_t dtype;
       uint32_t *shape;
       uint32_t ndim;
   } nrt_tensor_info_t;

Tensor information including name, usage, size, data type, and shape.

**Source**: `nrt_experimental.h:25 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_experimental.h#L25>`_

nrt_tensor_info_array_t
^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nrt_tensor_info_array {
       uint64_t tensor_count;
       nrt_tensor_info_t tensor_array[];
   } nrt_tensor_info_array_t;

Array of tensor information.

**Source**: `nrt_experimental.h:34 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_experimental.h#L34>`_

nrt_model_info_t
^^^^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nrt_model_info {
       uint32_t vnc;
   } nrt_model_info_t;

Model information structure.

**Source**: `nrt_experimental.h:139 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_experimental.h#L139>`_

Functions
---------

nrt_get_model_tensor_info
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_get_model_tensor_info(nrt_model_t *model, nrt_tensor_info_array_t **tensor_info);

Return input/output tensor information for a given model.

**Parameters:**

* ``model`` [in] - Model for which tensor information needs to be extracted.
* ``tensor_info`` [out] - Pointer to store the result.

**Returns:** NRT_STATUS_SUCCESS on success.

**Source**: `nrt_experimental.h:48 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_experimental.h#L48>`_

nrt_trace_start
^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_trace_start(bool trace_mem);

Enable tracing for all VNCs visible to the app.

**Parameters:**

* ``trace_mem`` [in] - collect memory allocation info

**Returns:** NRT_SUCCESS on success.

**Source**: `nrt_experimental.h:68 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_experimental.h#L68>`_

nrt_trace_stop
^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_trace_stop(const char *filename);

Serialize all data and disable tracing.

**Parameters:**

* ``filename`` [in] - filename to write to

**Returns:** NRT_SUCCESS on success.

**Source**: `nrt_experimental.h:75 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_experimental.h#L75>`_

nrt_barrier
^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_barrier(int32_t vnc, uint32_t g_device_id, uint32_t g_device_count);

Implements a barrier by running a small all-reduce over all workers.

**Parameters:**

* ``vnc`` [in] - local VNC (within the instance)
* ``global_device_id`` [in] - global worker ID
* ``global_device_count`` [in] - total number of workers

**Returns:** NRT_STATUS_SUCCESS on success.

**Source**: `nrt_experimental.h:115 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_experimental.h#L115>`_


================================================
FILE: neuron-runtime/api/nrt_profile.rst
================================================
.. _api_nrt_profile_h:

nrt_profile.h
=============

Neuron Runtime Profiling API - Tools for profiling model execution and device performance.

**Source**: `src/libnrt/include/nrt/nrt_profile.h <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_profile.h>`_

Functions
---------

nrt_profile_start
^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_profile_start(nrt_model_t *model, const char *filename);

Enable profiling for a model.

**Parameters:**

* ``model`` [in] - model to profile
* ``filename`` [in] - output filename that will be used with nrt_profile_stop()

**Returns:** NRT_SUCCESS on success.

**Source**: `nrt_profile.h:18 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_profile.h#L18>`_

nrt_profile_stop
^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_profile_stop(const char *filename);

Collect results and disable profiling for a model.

**Parameters:**

* ``filename`` [in] - output filename to save the NTFF profile to

**Returns:** NRT_SUCCESS on success.

**Source**: `nrt_profile.h:26 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_profile.h#L26>`_

nrt_profile_continuous_start
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_profile_continuous_start(nrt_profile_continuous_options_t *options);

Start continuous device profiling.

When continuous device profiling is started, profiling is enabled for every model but notifications will only be serialized to disk when the user calls nrt_profile_continuous_save().

**Parameters:**

* ``options`` [in] - options to control continuous device profiling

**Returns:** NRT_SUCCESS on success.

**Source**: `nrt_profile.h:77 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_profile.h#L77>`_

nrt_profile_continuous_save
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_profile_continuous_save(uint32_t vnc, nrt_profile_continuous_options_t *options);

Save NTFF profile to disk for the latest model executed on requested NeuronCore.

**Parameters:**

* ``vnc`` [in] - (start) NeuronCore id to collect profile for
* ``options`` [in] - options to control continuous device profiling

**Returns:** NRT_SUCCESS on success.

**Source**: `nrt_profile.h:91 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_profile.h#L91>`_

nrt_inspect_begin
^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_inspect_begin();

Begin tracing/profiling.

Users of this API must set options through environment variables (NEURON_RT_INSPECT_ENABLE, NEURON_RT_INSPECT_OUTPUT_DIR, etc.).

**Returns:** NRT_SUCCESS on success

**Source**: `nrt_profile.h:118 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_profile.h#L118>`_

nrt_inspect_stop
^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_inspect_stop();

Stop tracing/profiling and dump profile data.

**Returns:** NRT_SUCCESS on success

**Source**: `nrt_profile.h:126 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_profile.h#L126>`_

nrt_inspect_begin_with_options
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_inspect_begin_with_options(nrt_inspect_config_t *options);

Begin tracing/profiling with configurable options.

**Parameters:**

* ``options`` [in] - A pointer to an nrt_inspect_config struct containing configuration options for profiling. If NULL is passed, default options will be used.

**Returns:** NRT_SUCCESS on success

**Note:** This API ignores all the NEURON_RT_INSPECT_* environment variables.

**Source**: `nrt_profile.h:237 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_profile.h#L237>`_

nrt_inspect_config_allocate
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_inspect_config_allocate(nrt_inspect_config_t **options);

Allocate memory for the options structure which is needed to start profiling using nrt_inspect_begin_with_options.

**Parameters:**

* ``options`` [out] - pointer to a pointer to options nrt_inspect_config struct

**Returns:** NRT_SUCCESS on success

**Source**: `nrt_profile.h:149 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_profile.h#L149>`_

nrt_inspect_config_set_output_dir
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_inspect_config_set_output_dir(nrt_inspect_config_t *options, const char *output_dir);

Sets the output directory for results of profiling using nrt_inspect_begin_with_options.

**Parameters:**

* ``options`` [in,out] - Pointer to the options structure.
* ``output_dir`` [in] - Path to the output directory. Must be a valid non-empty string

**Returns:** NRT_SUCCESS on success, NRT_INVALID for invalid parameters, NRT_RESOURCE for memory allocation failure.

**Source**: `nrt_profile.h:180 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_profile.h#L180>`_


================================================
FILE: neuron-runtime/api/nrt_status.rst
================================================
.. _api_nrt_status_h:

nrt_status.h
============

Neuron Runtime status codes and error handling.

**Source**: `src/libnrt/include/nrt/nrt_status.h <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_status.h>`_

Enumerations
------------

NRT_STATUS
^^^^^^^^^^

.. code-block:: c

   typedef enum {
       NRT_SUCCESS = 0,
       NRT_FAILURE = 1,
       NRT_INVALID = 2,
       NRT_INVALID_HANDLE = 3,
       NRT_RESOURCE = 4,
       NRT_TIMEOUT = 5,
       NRT_HW_ERROR = 6,
       NRT_QUEUE_FULL = 7,
       NRT_LOAD_NOT_ENOUGH_NC = 9,
       NRT_UNSUPPORTED_NEFF_VERSION = 10,
       NRT_FAIL_HOST_MEM_ALLOC = 11,
       NRT_UNINITIALIZED = 13,
       NRT_CLOSED = 14,
       NRT_QUEUE_EMPTY = 15,
       NRT_EXEC_UNIT_UNRECOVERABLE = 101,
       NRT_EXEC_BAD_INPUT = 1002,
       NRT_EXEC_COMPLETED_WITH_NUM_ERR = 1003,
       NRT_EXEC_COMPLETED_WITH_ERR = 1004,
       NRT_EXEC_NC_BUSY = 1005,
       NRT_EXEC_OOB = 1006,
       NRT_COLL_PENDING = 1100,
       NRT_EXEC_HW_ERR_COLLECTIVES = 1200,
       NRT_EXEC_HW_ERR_HBM_UE = 1201,
       NRT_EXEC_HW_ERR_NC_UE = 1202,
       NRT_EXEC_HW_ERR_DMA_ABORT = 1203,
       NRT_EXEC_SW_NQ_OVERFLOW = 1204,
       NRT_EXEC_HW_ERR_REPAIRABLE_HBM_UE = 1205,
       NRT_NETWORK_PROXY_FAILURE = 1206,
   } NRT_STATUS;

Status codes returned by NRT API functions.

**Status Codes:**

* ``NRT_SUCCESS`` - Operation completed successfully
* ``NRT_FAILURE`` - Non-specific failure
* ``NRT_INVALID`` - Invalid input (e.g., invalid NEFF, bad instruction, input tensor name/size mismatch)
* ``NRT_INVALID_HANDLE`` - Invalid handle passed
* ``NRT_RESOURCE`` - Failed to allocate a resource for requested operation
* ``NRT_TIMEOUT`` - Operation timed out
* ``NRT_HW_ERROR`` - Hardware failure
* ``NRT_QUEUE_FULL`` - Not enough space in the execution input queue
* ``NRT_LOAD_NOT_ENOUGH_NC`` - Failed to allocate enough NCs for loading a NEFF
* ``NRT_UNSUPPORTED_NEFF_VERSION`` - Unsupported version of NEFF
* ``NRT_UNINITIALIZED`` - NRT API called before nrt_init()
* ``NRT_CLOSED`` - NRT API called after nrt_close()
* ``NRT_QUEUE_EMPTY`` - Accessed a queue with no data
* ``NRT_EXEC_UNIT_UNRECOVERABLE`` - Encountered fatal error, Execution Unit cannot recover
* ``NRT_EXEC_BAD_INPUT`` - Invalid input submitted to exec()
* ``NRT_EXEC_COMPLETED_WITH_NUM_ERR`` - Execution completed with numerical errors (produced NaN)
* ``NRT_EXEC_COMPLETED_WITH_ERR`` - Execution completed with other errors
* ``NRT_EXEC_NC_BUSY`` - Neuron core is locked (in use) by another model/process
* ``NRT_EXEC_OOB`` - One or more indirect memcopies and/or embedding updates are out of bound
* ``NRT_COLL_PENDING`` - Collective operation is still pending
* ``NRT_EXEC_HW_ERR_COLLECTIVES`` - Stuck in collectives op (missing notification(s))
* ``NRT_EXEC_HW_ERR_HBM_UE`` - HBM encountered an unrepairable uncorrectable error
* ``NRT_EXEC_HW_ERR_NC_UE`` - On-chip memory of Neuron Core encountered a parity error
* ``NRT_EXEC_HW_ERR_DMA_ABORT`` - DMA engine encountered an unrecoverable error
* ``NRT_EXEC_SW_NQ_OVERFLOW`` - Software notification queue overflow
* ``NRT_EXEC_HW_ERR_REPAIRABLE_HBM_UE`` - HBM encountered a repairable uncorrectable error
* ``NRT_NETWORK_PROXY_FAILURE`` - EFA network proxy operation failed

**Source**: `nrt_status.h:13 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_status.h#L13>`_

Functions
---------

nrt_get_status_as_str
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   const char *nrt_get_status_as_str(NRT_STATUS status);

Get string representation of a status code.

**Parameters:**

* ``status`` [in] - Status code to convert to string.

**Returns:** String representation of the status code.

**Source**: `nrt_status.h:58 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_status.h#L58>`_


================================================
FILE: neuron-runtime/api/nrt_sys_trace.rst
================================================
.. _api_nrt_sys_trace_h:

nrt_sys_trace.h
===============

Neuron Runtime System Trace API - Capture and fetch system trace events from Neuron devices.

**Source**: `src/libnrt/include/nrt/nrt_sys_trace.h <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_sys_trace.h>`_

Functions
---------

System Trace Capture
^^^^^^^^^^^^^^^^^^^^

nrt_sys_trace_config_allocate
""""""""""""""""""""""""""""""

.. code-block:: c

   NRT_STATUS nrt_sys_trace_config_allocate(nrt_sys_trace_config_t **options);

Allocate memory for the options structure which is needed to start profiling using nrt_sys_trace_start.

**Parameters:**

* ``options`` [in] - pointer to a pointer to options nrt_sys_trace_config struct

**Returns:** NRT_SUCCESS on success

**Source**: `nrt_sys_trace.h:29 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_sys_trace.h#L29>`_

nrt_sys_trace_config_set_max_events_per_nc
"""""""""""""""""""""""""""""""""""""""""""

.. code-block:: c

   void nrt_sys_trace_config_set_max_events_per_nc(nrt_sys_trace_config_t *options, uint64_t max_events_per_nc);

Sets max number of events that can be stored across all ring buffers.

**Parameters:**

* ``options`` [in,out] - Pointer to the options structure.
* ``max_events_per_nc`` [in] - Max number of events that can be stored in each ring buffer.

**Source**: `nrt_sys_trace.h:50 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_sys_trace.h#L50>`_

nrt_sys_trace_config_set_capture_enabled_for_nc
""""""""""""""""""""""""""""""""""""""""""""""""

.. code-block:: c

   void nrt_sys_trace_config_set_capture_enabled_for_nc(nrt_sys_trace_config_t *options, uint32_t nc_idx, bool enabled);

Sets system trace capture enabled for a specific NeuronCore. Ring buffers won't be allocated for disabled NeuronCores.

**Parameters:**

* ``options`` [in,out] - Pointer to the options structure.
* ``nc_idx`` [in] - NeuronCore index.
* ``enabled`` [in] - Capture enabled flag.

**Source**: `nrt_sys_trace.h:60 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_sys_trace.h#L60>`_

nrt_sys_trace_get_event_types
""""""""""""""""""""""""""""""

.. code-block:: c

   NRT_STATUS nrt_sys_trace_get_event_types(const char ***event_types, size_t *count);

Returns an allocated array of all valid event type strings.

**Parameters:**

* ``event_types`` [out] - Pointer to array of const char* (allocated).
* ``count`` [out] - Number of event types.

**Returns:** NRT_SUCCESS on success, error code otherwise.

**Note:** The user is responsible for freeing the array and each string, or can use nrt_sys_trace_free_event_types() for convenience.

**Source**: `nrt_sys_trace.h:79 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_sys_trace.h#L79>`_

nrt_sys_trace_start
"""""""""""""""""""

.. code-block:: c

   NRT_STATUS nrt_sys_trace_start(nrt_sys_trace_config_t *options);

Initialization for system trace capture including allocating memory for event ring buffers.

**Parameters:**

* ``options`` [in] - Configuration options for system trace capture

**Returns:** NRT_SUCCESS on success

**Source**: `nrt_sys_trace.h:106 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_sys_trace.h#L106>`_

nrt_sys_trace_stop
""""""""""""""""""

.. code-block:: c

   NRT_STATUS nrt_sys_trace_stop();

Teardown for system trace capture including freeing allocated memory for event ring buffers.

**Returns:** NRT_SUCCESS on success

**Source**: `nrt_sys_trace.h:109 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_sys_trace.h#L109>`_

System Trace Fetch
^^^^^^^^^^^^^^^^^^

nrt_sys_trace_fetch_events
"""""""""""""""""""""""""""

.. code-block:: c

   NRT_STATUS nrt_sys_trace_fetch_events(char **buffer, size_t *written_size, const nrt_sys_trace_fetch_options_t *options);

Fetches system trace events from process memory and returns them as a JSON-formatted string. Once events are fetched, they cannot be fetched again.

**Parameters:**

* ``buffer`` [out] - On successful return, will point to a dynamically allocated, null-terminated JSON string containing the trace events. The caller must free the allocated memory by calling nrt_sys_trace_buffer_free(buffer).
* ``written_size`` [out] - A pointer to a size_t variable that will be set to the number of bytes written into the allocated buffer.
* ``options`` [in] - Pointer to options such as max number of events to fetch.

**Returns:** NRT_SUCCESS on success.

**Source**: `nrt_sys_trace.h:143 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_sys_trace.h#L143>`_

nrt_sys_trace_buffer_free
""""""""""""""""""""""""""

.. code-block:: c

   void nrt_sys_trace_buffer_free(char *buffer);

Free the buffer allocated by nrt_sys_trace_fetch_events. Should be called after the events are no longer needed.

**Parameters:**

* ``buffer`` [in] - Pointer to buffer to be freed.

**Source**: `nrt_sys_trace.h:151 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_sys_trace.h#L151>`_


================================================
FILE: neuron-runtime/api/nrt_version.rst
================================================
.. _api_nrt_version_h:

nrt_version.h
=============

Neuron Runtime version information API.

**Source**: `src/libnrt/include/nrt/nrt_version.h <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_version.h>`_

Constants
---------

RT_VERSION_DETAIL_LEN
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c

   #define RT_VERSION_DETAIL_LEN 128

Maximum length for version detail string.

**Source**: `nrt_version.h:12 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_version.h#L12>`_

GIT_HASH_LEN
^^^^^^^^^^^^

.. code-block:: c

   #define GIT_HASH_LEN 64

Maximum length for git hash string.

**Source**: `nrt_version.h:13 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_version.h#L13>`_

Structures
----------

nrt_version_t
^^^^^^^^^^^^^

.. code-block:: c

   typedef struct nrt_version {
       uint64_t rt_major;
       uint64_t rt_minor;
       uint64_t rt_patch;
       uint64_t rt_maintenance;
       char rt_detail[RT_VERSION_DETAIL_LEN];
       char git_hash[GIT_HASH_LEN];
   } nrt_version_t;

NRT version information structure.

**Source**: `nrt_version.h:15 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_version.h#L15>`_

Functions
---------

nrt_get_version
^^^^^^^^^^^^^^^

.. code-block:: c

   NRT_STATUS nrt_get_version(nrt_version_t *ver, size_t size);

Get the NRT library version.

**Parameters:**

* ``ver`` [out] - Pointer to nrt version struct
* ``size`` [in] - Length of the data needed to be filled in the nrt_version_struct

**Returns:** NRT_STATUS_SUCCESS on success.

**Source**: `nrt_version.h:28 <https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_version.h#L28>`_


================================================
FILE: neuron-runtime/configuration-guide.rst
================================================
Configuration Guide
===================

.. toctree::
    :maxdepth: 1
    
    Runtime Configuration </neuron-runtime/nrt-configurable-parameters>

================================================
FILE: neuron-runtime/explore/compute-comm-overlap.rst
================================================
.. _neuron-runtime-explore-compute-comm:

.. meta::
   :description: How AWS Neuron's architecture enables compute-communication overlap to improve performance in distributed training workloads.
   :keywords: AWS Neuron, collective communocation, compute-communication overlap, distributed training, FSDP, TP, Neuron Runtime, Neuron Compiler

=========================================
Compute-Communication Overlap in Neuron
=========================================

This topic explains how AWS Neuron's architecture enables compute-communication overlap to improve performance in distributed training workloads. Users will learn about the asynchronous execution model where dedicated collective communication cores operate independently from computation engines, the challenges of resource contention between DMA engines, and optimization techniques including Token Threading for FSDP and static DMA priority adjustment. The content covers practical implementation strategies for overlapping FSDP operations with computational tasks in adjacent network layers, helping developers maximize throughput in tensor parallelism and fully-sharded data parallelism scenarios.

Background
----------

Collective communication (CC) operations on the AWS Trainium System-on-Chip (SoC) architecture are executed autonomously from computation engines using dedicated CC cores. Computation engines on each Neuron core do not execute explicit communication instructions. Instead, they asynchronously initiate the CC core and later retrieve completion signals once CC operations finish. The Neuron compiler implements this mechanism by generating pseudo-instructions (PseudoTriggerCollective2 or PTC2) for each CC operation in the engine binaries of the Neuron Executable File Format (NEFF).

When a NEFF is loaded, the Neuron Runtime translates these pseudo-instructions into Write instructions to trigger the CC core during execution. At the same time, the runtime loads the collective communication program for the control path and pre-constructed DMA rings that establish the data path for CC operations. During runtime execution, whenever a Neuron core triggers a CC core, the next scheduled operation advances through the configured DMA rings, enabling inter-core data transfer using a semaphore-based synchronization protocol among CC cores within the processing cluster.

This asynchronous execution paradigm enables intrinsic overlapping of computation and communication processes, which enhances throughput in scenarios where computation can proceed independently from communication results. This architectural advantage is especially pronounced in computation-intensive applications such as neural network training.

Despite these performance benefits, resource contention is a significant consideration. DMA engines are shared resources between computation and communication subsystems. This contention can cause throughput degradation for compute operations due to delayed DMA transactions between High Bandwidth Memory (HBM) and Scratchpad Buffer (SBUF), affecting both input tensor loading and output tensor spill-out for computation engines. Communication operations may also experience performance degradation due to time-sharing of DMA engine resources. Implementing optimal DMA prioritization strategies is critical for maximizing system performance in real-world conditions.

Overlap Between Compute and Communication
-----------------------------------------

The Neuron compiler enables concurrent execution of operations across the Neuron core and CC core through a sophisticated instruction scheduling mechanism. The compiler backend maintains separate scheduling queues for computation engines and communication streams, allowing independent instruction scheduling except where explicit dependencies exist. In theory, this design should enable optimal overlapping of compute and communication operations without manual intervention, similar to scheduling computational instructions across multiple computation engines. However, empirical analysis reveals suboptimal overlapping patterns in some scenarios.

For example, in dense Large Language Model (LLM) training that uses Tensor Parallelism (TP), Fully-Sharded Data Parallelism (FSDP), and Sequence Parallelism (SP), each network layer exhibits characteristic communication requirements:

- **TP AllGather**: Precedes matrix multiplication to consolidate sharded activations.
- **TP ReduceScatter**: Aggregates and re-shards the outputs.
- **FSDP AllGather**: Required before each layer execution to gather sharded model parameters.
- **FSDP ReduceScatter**: Needed during the backward pass for gradient accumulation.

Current compiler heuristics schedule FSDP AllGather operations collectively at the earliest possible execution point, as these operations depend only on subsequent computational operations within their respective layers. However, this strategy creates resource contention with critical TP communication operations, resulting in decreased end-to-end performance—even when Multi-stream CC capability is available for concurrent execution. A more efficient approach would proactively perform FSDP AllGather for a given layer during the execution of the preceding layer.

Similarly, FSDP ReduceScatter operations are typically scheduled at the end of the backward pass, just before optimizer execution, due to compiler memory optimization strategies. An alternative scheduling approach—placing each FSDP ReduceScatter operation within the subsequent backward layer—would enable better computational overlap and eliminate idle periods at the end of the backward pass.

Token Threading for FSDP
^^^^^^^^^^^^^^^^^^^^^^^^

To achieve optimal overlapping of CC operations, a novel dependency control mechanism called **Token Threading for FSDP** has been implemented. This experimental feature can be activated with environment variables:

- For JAX frameworks: ``NEURON_FSDP=1``
- For NeuronX Distributed (NxD): ``NEURON_NXD_FSDP_CC_MULTISTREAM=1``

This mechanism uses a specialized Neuron PJRT compiler pass to identify operation patterns spanning TP and FSDP dimensions. It enforces precise execution ordering between CC operations by establishing synthetic data dependencies using a daisy-chain configuration of token tensors. Each token is a single-element tensor serving as a synchronization mechanism.

The resulting High Level Optimizer (HLO) instruction sequence demonstrates the dependency chain:

.. code-block:: none
   
   constant.45 = bf16[] constant(0)
   all-gather.26 = (bf16[4096,8192]{2,1,0}, bf16[]) all-gather(param, constant.45), ...
   ...
   get-tuple-element.6 = bf16[] get-tuple-element(all-gather.26), index=1,...
   all-gather.25 = (bf16[896,8192]{1,0}, bf16[]) all-gather(param.2, get-tuple-element.6), ...
   ...
   get-tuple-element.2 = bf16[896,8192]{1,0} get-tuple-element(all-gather.25), index=0, ... 
   dot.9 = bf16[4096,8192]{1,0} dot(maximum.14, get-tuple-element.2),...
   ...
   get-tuple-element.7 = bf16[] get-tuple-element(all-gather.25), index=1, ...
   reduce-scatter.8 = (bf16[128,8192]{1,0}, bf16[]) reduce-scatter(dot.9, get-tuple-element.7), ...

A token is extracted from the preceding CC operation and incorporated into the input tuple of the next CC operation, creating an explicit data dependency that enforces deterministic ordering. The Neuron compiler preserves this ordering during instruction scheduling but eliminates the token tensors from the final execution plan.

This implementation enables effective overlapping of FSDP CC operations with computational operations in adjacent network layers. Performance analysis confirms that FSDP AllGather operations for Attention layers successfully overlap with computation in preceding Multi-Layer Perceptron (MLP) layers, specifically in the execution window between TP AllGather and ReduceScatter operations.

.. figure:: /images/deep-dives/compiler/deep-dive-compute-comm1.png
   :align: center
   :width: 80%

   Image that shows how FSDP-AG operations for Attention layers successfully overlap with computation in preceding MLP layers.

Adjusting Static DMA Priority
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To address performance degradation caused by overlapping FSDP AllGather operations competing for DMA resources, a configurable static prioritization mechanism is provided through DMA packet size adjustment. DMA engines process descriptors from up to 16 DMA rings in HBM using a round-robin arbitration scheme. Arbitration transitions between rings only at packet boundaries. DMA rings with smaller packet sizes are more susceptible to resource starvation. Increasing packet size elevates processing priority.

- The Neuron compiler generates PseudoDmaTrigger (PDMAT) instructions and descriptors in the NEFF.
- The Neuron Runtime translates these into hardware WRITE operations and constructs hardware-compatible DMA rings.
- The ``NEURON_RT_DBG_DMA_PACKETIZATION_SIZE`` environment variable controls packet size during DMA ring construction. The default is 4 KiB, the empirically determined minimum for DMA/HBM efficiency. This parameter only allows increasing packet size to elevate priority.
- For PTC2 instructions, ``NEURON_RT_DBG_CC_DMA_PACKET_SIZE`` controls packet size, with a default and maximum of 64 KiB. This parameter only allows reducing packet size to lower priority and only affects memory copy components of CC operations.

For systems with both TP and FSDP, optimal performance is achieved by prioritizing PDMAT for computational operations over FSDP CC operations:

.. code-block:: shell

   NEURON_RT_DBG_DMA_PACKETIZATION_SIZE=65536
   NEURON_RT_DBG_CC_DMA_PACKET_SIZE=4096

Although ``NEURON_RT_DBG_CC_DMA_PACKET_SIZE`` also affects critical TP collective communication operations, empirical analysis shows operational efficiency remains unimpaired.

The architecture supports additional DMA instruction types for dynamic transaction handling (DmaMemcpy, DmaIndirect, DmaTranspose), using the Descriptor Generation Engine (DGE) to generate DMA descriptors dynamically. The ``NEURON_RT_DBG_DMA_PACKETIZATION_SIZE`` parameter does not affect these DGE-based instructions. Enhanced dynamic DMA prioritization is under development.

Overlap Between Communications – Multi-stream CC
------------------------------------------------

Optimal system performance requires computation duration to be sufficient to fully mask communication latency. Partial communication masking can provide incremental benefits but may introduce secondary performance implications as seen in the figure below.

.. figure:: /images/deep-dives/compiler/deep-dive-compute-comm2.png
   :align: center
   :width: 80%

   Image that shows idle compute resources due to cross-compute communication latency.

In experimental configurations, FSDP AllGather operations gather weight parameters for Up, Gate, and Down projections in the next MLP layer. These operations are larger than those in the Attention layer, and the Attention layer's computation is shorter. Extended FSDP AllGather operations can delay TP ReduceScatter operations, which could otherwise start immediately. If TP ReduceScatter could execute concurrently with FSDP AllGather, subsequent computations (such as Up and Gate projections) could begin earlier.

Multi-stream CC enables concurrent execution of communication operations using parallel communication resources. The hardware provides two CC cores per physical Neuron core. In TP×FSDP training, two physical Neuron cores are configured as a Logical Neuron Core (LNC2 mode), resulting in four CC cores per logical unit. Each CC core can manage a distinct communication stream, supporting up to four concurrent CC streams in LNC2 mode.

.. figure:: /images/deep-dives/compiler/deep-dive-compute-comm3.png
   :align: center
   :width: 80%

   Image that shows efficient use of compute when effective overlapping of communication operations are enabled.

- With fewer streams than CC cores, each stream has exclusive access to a CC core, and surplus cores are allocated to stream 0.
- Increased CC core allocation does not necessarily provide linear throughput gains. The benefit is greatest when communication operations use algorithms with multiple channels.
- In reference implementations, optimal performance requires two streams: stream 0 for TP CC operations and stream 1 for FSDP CC operations.

To enable multi-stream CC in JAX, set these environment variables:

.. code-block:: shell
   
   NEURON_FSDP=1
   NEURON_FSDP_CC_MULTISTREAM=1

For NxD implementations, also set this environment variable:

.. code-block:: shell
   
   NEURON_NXD_FSDP_CC_MULTISTREAM=1

The stream allocation mechanism is implemented in Neuron PJRT compilation passes, where CC stream identifiers (stream_id) are assigned to the ``frontend_attributes`` field of HLO instructions, using metadata tags from Token Threading for FSDP.

.. code-block:: none
   
   reduce-scatter.8 =
     (bf16[128,8192]{1,0}, bf16[]) reduce-scatter(dot.9, get-tuple-element.7), ...
     frontend_attributes={collective_type="tp_reduce_scatter",has_token="1",stream_id="0"}, ...

These configuration parameters are being incorporated into default settings in future releases, enabling automatic activation. More granular user-configurable options for stream allocation are also under development.

Adjusting Static DMA Priority (per Stream)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

DMA prioritization for TP CC operations is critical, as these operations directly block subsequent computation. They must not be delayed by concurrent FSDP CC weight prefetch operations. Since FSDP CC operations overlap with long computational sequences, they can be executed on a best-effort basis. The optimal DMA priority hierarchy is: TP CC ≥ PDMAT (compute) > FSDP CC.

The ``NEURON_RT_DBG_CC_DMA_PACKET_SIZE`` variable accepts comma-delimited values for individual adjustment of DMA packet sizes per communication stream:

.. code-block:: shell

   NEURON_RT_DBG_DMA_PACKETIZATION_SIZE=65536
   NEURON_RT_DBG_CC_DMA_PACKET_SIZE=65536,4096 # 65536 for stream 0, 4096 for stream 1

Weight Prefetch
^^^^^^^^^^^^^^^

To overlap FSDP CC operations with computation from adjacent layers, FSDP AllGather operations are strategically relocated to preceding layers in both forward and backward passes. Similarly, FSDP ReduceScatter operations in the backward pass are relocated to subsequent layers. Large language models typically alternate Attention and MLP blocks. MLP layers have longer computation and larger weights, resulting in larger FSDP CC operations.

If all FSDP CC operations are shifted by one layer, Attention layers in the backward pass may be burdened with very large FSDP AllGather and ReduceScatter operations for adjacent MLP layers, exceeding their computational duration.

To balance communication and computation, additional configuration parameters enable precise control over the shifting distance for FSDP CC operations:

.. code-block:: shell
   
   NEURON_FSDP_NUM_LAYER_EARLY_AG_SHIFT=1
   NEURON_FSDP_NUM_LAYER_LATE_RS_SHIFT=2

These parameters enable differential shifting strategies for AllGather and ReduceScatter operations, optimizing the overlap pattern for each model architecture.

What’s Next?
------------

Dynamic DMA Prioritization
^^^^^^^^^^^^^^^^^^^^^^^^^^

Future implementations will introduce a dedicated field in DMA instructions to specify priority class, enabling dynamic DMA prioritization at the instruction level, including DGE instructions. This will allow developers to assign priority designations in HLO instructions, with the Neuron compiler generating instructions with appropriate priority class based on user tags and compiler heuristics. Beyond packet size adjustment, this approach will provide additional mechanisms for regulating relative priority among competing instructions.

For critical CC operations, the DGE will implement dynamic resource reallocation, temporarily relinquishing DMA engines occupied by inflight CC operations. This is especially beneficial for latency-sensitive scenarios, such as inference token generation, where CC operations are critical and often contend with weight prefetching from HBM to SBUF. Since these critical operations typically involve small data transfers, packet size adjustment may not be sufficient. Complete isolation of DMA engines during these operations can yield substantial improvements in end-to-end performance, even if it reduces overall DGE throughput.

TRN3 and later generations will include DMA engines with strict priority-based arbitration, processing descriptors from the highest-priority ring to completion before lower-priority transactions. This hardware advancement will expand the flexibility and effectiveness of DMA prioritization strategies.

Fine-grained CC
^^^^^^^^^^^^^^^

Currently, TP CC operations cannot be effectively overlapped with computation due to strict data dependencies. Performance profiles show computational idle periods during TP collective communication operations. Two common patterns create these stalls:

1. ``dot(all-gather(x), y)``: Matrix multiplication cannot proceed until AllGather consolidates sharded activations across the TP dimension.
2. ``reduce-scatter(dot(x, y))``: Requires matrix multiplication to complete before reduction and redistribution.

These CC operations can be decomposed into more granular communication primitives—specifically, sequences of send/receive operations implemented with CollectivePermute operations. In the ``dot(all-gather(x), y)`` pattern, this allows partial matrix multiplication to begin with each received data segment while transmitting it to other ranks, rather than waiting for the full tensor. Similarly, ``reduce-scatter(dot(x, y))`` can be restructured for progressive reduction and communication of partial results during ongoing computation.

This fine-grained CC approach is based on research from Google and is under development for future versions of the Neuron SDK.

Read More
---------

- `AWS Neuron SDK Documentation Home <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/>`_
- `Neuron Distributed Training Guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/index.html>`_
- `Neuron Runtime Documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/index.html>`_


================================================
FILE: neuron-runtime/explore/core-dump-deep-dive.rst
================================================
.. meta::
   :description: This topic explores Neuron runtime core dumps in depth, using the neuron-dump tool included in the AWS Neuron SDK.
   :date-modified: 12-02-2025
   
.. _runtime-core-dump-deep-dive:   

Deep Dive: Explore Neuron runtime core dumps
=============================================

This topic explores Neuron runtime core dumps in depth and discusses the technical details of it from the perspective of an AWS Neuron expert. Some experience in AWS NeuronCore Architecture is required to understand it in full.

What you should know before reading
------------------------------------

* :doc:`AWS NeuronCore Architecture </about-neuron/arch/neuron-hardware/neuroncores-arch>`
* :doc:`Amazon EC2 AI Chips Architecture </about-neuron/arch/neuron-hardware/neuron-devices>`
* :doc:`Generating a Neuron runtime core dump </neuron-runtime/about/core-dump>`

Overview
--------

What are Neuron Runtime core dumps?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Core dumps are a snapshot of relevant runtime and hardware state to aid in debugging issues when deploying Neuron at scale.

What problems do Neuron Runtime core dumps solve?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When deploying Neuron applications at scale, there can be infrequent and difficult to reproduce errors.
Core dumps are a mechanism to capture relevant state about these errors to aid in debugging.

Who are Neuron runtime core dumps for?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Organizations who are scaling up Neuron applications and encountering sporadic issues in the fleet.

When should Neuron Runtime core dumps be used?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Diagnoising correctness issues occuring infrequently in the fleet.

How are core dumps enabled?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Core dumps are enabled by default if an executable script ``/opt/aws/neuron/bin/neuron-dump`` exists.
The package ``aws-neuronx-tools`` provides a default implementation of ``/opt/aws/neuron/bin/neuron-dump``.
Alternatively, core dumps are enabled if users install a custom version of ``/opt/aws/neuron/bin/neuron-dump``.

If users want to disable this default behavior, core dumps are disabled by defining both ``NEURON_RT_LOCAL_CORE_DUMP_DIRECTORY`` and ``NEURON_RT_S3_CORE_DUMP_PREFIX`` to an empty string::

    export NEURON_RT_LOCAL_CORE_DUMP_DIRECTORY=""
    export NEURON_RT_S3_CORE_DUMP_PREFIX=""

Alternatively, deleting ``/opt/aws/neuron/bin/neuron-dump`` also disables core dumps.

What is the core dump generation flow?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Upon execution error:

1. Neuron runtime produces state snapshots for each rank
2. Neuron runtime invokes ``neuron-dump`` to capture instance hardware state
3. ``neuron-dump`` captures environment and hardware state
4. ``neuron-dump`` can optionally be configured to upload core dump artifacts to S3

What is included in a core dump?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- tail of runtime logs (default naming: ``nrt-<instance id>-pid-<pid>.log``)
- dump of hardware state for every participating physical NeuronCore (default naming: ``<instance id>-nd<device id>-nc<core id>-pid-<pid>-tid-<tid>-lid-<log id>``)
  - installed neuron packages
  - snapshot of instruction buffers
  - semaphore values
  - DMA state
- dump of CC core state for every participating CC core (naming: ``<instance id>-nd<device id>-cc-core-<cc core id>-pid-<pid>-tid-<tid>-lid-<log id>``)
- tail of nrt error logs for every participating process (naming: ``<instance id>-nrt-pid-<pid>.log``)

neuron-dump
-------------

What is neuron-dump?
~~~~~~~~~~~~~~~~~~~~~

``neuron-dump`` is the script responsible for capturing relevant hardware state for core dumps and uploading the core dump to Amazon S3.

How is neuron-dump distributed?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A default ``neuron-dump`` is distributed as part of the ``aws-neuronx-tools`` package.

Where is neuron-dump installed?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``/opt/aws/neuron/bin/neuron-dump``

How do users customize the information included in core dumps?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To add/remove information from core dumps, uyou must install a custom ``neuron-dump`` at ``/opt/aws/neuron/bin/neuron-dump``. If you choose to install a custom ``neuron-dump`` as part of an automated script, you must install it after you install ``aws-neuronx-tools``.

What input interface does Neuron runtime provide to neuron-dump?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Neuron runtime provides input to ``neuron-dump`` by CLI flag-value pairs.
The following CLI flags are provided to ``neuron-dump``::

    --neff-name: Name from the neff header
    --neff-uuid: UUID from the neff header

    --date-time: datetime formatted as `yyyy-mm-dd-HH-MM`. This datetime represents the epoch of the initial barrier when running a collectives execution, or it falls back to the epoch from the local process if collectives context is not available.
    --pid: The process id
    --tid: The thread id
    --log-id: A process unique id given. Each execution of neuron-dump is given a unique log id for that given runtime process. Not guaranteed to be unique across processes.

    --instance-id: The instance id
    --cluster-id: the unique identifier for a single collectives execution. `0000000000000000` if collectives information is not available.

    --error-location: The libnrt api where the error occured.
    --error-code: The libnrt api return code: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-api-guide.html#api-return-codes.

    --local-output-dir: The directory specified by `NEURON_RT_LOCAL_CORE_DUMP_DIRECTORY` with format variables replaced.
    --s3-output-prefix: The prefix specified to by `NEURON_RT_S3_CORE_DUMP_PREFIX` with format variables replaced. Only included if `NEURON_RT_S3_CORE_DUMP_PREFIX` is set.

Configuring Neuron Runtime core dumps
---------------------------------------

Where are core dumps located locally on the instance?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Neuron Runtime exposes the environment variable ``NEURON_RT_LOCAL_CORE_DUMP_DIRECTORY`` to configure the local root directory of core dumps. The default value is ``/tmp/neuron-core-dump/dt-%d-cid-%c``.

Where are core dumps uploaded to in Amazon S3?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Neuron Runtime exposes the environment variable ``NEURON_RT_S3_CORE_DUMP_PREFIX`` to configure the root directory for core dumps to be uploaded to in an s3 bucket. Neuron Runtime does not perform the upload to s3. The formatted directory is provided as an argument to ``neuron-dump`` which can be configured by the user to upload the core dump to s3.

What format variables are supported for core dump paths?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The configuration environment variables ``NEURON_RT_LOCAL_CORE_DUMP_DIRECTORY`` and ``NEURON_RT_S3_CORE_DUMP_PREFIX`` support format variable substition. Neuron Runtime substitutes these variables with information from the runtime process. The formatted directories are then passed along to ``neuron-dump``::

    %d: datetime
    %c: cluster id
    %p: the process id
    %t: the thread id
    %l: the log id
    %i: the instance id

How do users ensure the root path for core dumps are unique?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Including the format variables ``%d`` (datetime) and ``%c`` (cluster id) in the path ensures uniqueness.
As well, these values are agreed upon by all participating ranks in a collectives execution, so all ranks produce their core dump in the same directory with these set::

    export NEURON_RT_LOCAL_CORE_DUMP_DIRECTORY="/your/base/path/%d-%c"
    export NEURON_RT_S3_CORE_DUMP_PREFIX="s3://your/s3/bucket/%d-%c"


================================================
FILE: neuron-runtime/explore/device-memory.rst
================================================
.. meta::
   :description: Learn how to understand, monitor, and optimize memory usage on AWS Neuron devices such as Trainium and Inferentia ML chips. 
   :date-modified: 10/16/2025

.. _neuron-device-memory-deep-dive:

Neuron Device Memory
====================

Learn how to understand, monitor, and optimize memory usage on AWS Neuron devices. This topic covers memory categories including tensors, model constants, scratchpad allocations, DMA rings, and profiling buffers. Discover debugging tools like neuron-top and neuron-monitor, troubleshoot out-of-memory (OOM) errors, and implement strategies to reduce memory consumption for efficient ML workload execution on Inferentia and Trainium instances.

.. contents:: Table of Contents
   :local:
   :depth: 2

Overview
--------

The Neuron Runtime's memory usage falls into the following categories:

- ``tensors``: input and output tensors allocated by application
- ``model constants``: compiled constants used by a NEFF program
- ``model code``: the executable instructions for the Neuron Core. This also includes a micro-code overhead of 96MB per physical Neuron Core (this overhead is subject to future improvements)
- ``profile buffers``: buffers used to store profling events
- ``scratchpad`` and ``shared scratchpad``: additional space used to store intermediary SBUF and other computations. Read :ref:`nd-scratchpad` for details.
- ``dma rings``: Data transfer instructions describing data movements during NEFF execution, used during NEFF execution.
- ``collectives``: Memory overhead used to orchestrate collective communication

Here's what users can do to adjust these forms of memory usage:

1. ``model constants`` and ``tensors`` are entirely controlled by the user. Adjust similar to other XLA devices with matrix dimensions, batch sizes, etc.
2. ``scratchpad`` and ``shared scratchpad`` depend on model size, model type and tiling strategy. Read the :ref:`nd-scratchpad`.
3. ``dma rings`` usage is not easily actionable. It can be reduced by using DGE where possible, or changing the model to reduce data movements (like transfers between HBM and SBUF).
4. ``profile buffers`` are allocated when the user enables profiling. Users can influence these allocations by either disabling profiling or manually adjusting. Read the :ref:`nd-profile-buffers` section.
5. ``model code`` usages are not actionable. If users observe significant usage, contact your AWS Neuron support.


Logical Neuron Cores
~~~~~~~~~~~~~~~~~~~~~

Starting with ``trn2``, we introduced the concept of Logical Neuron Cores, where multiple physical Neuron Cores are grouped into the same "Neuron Core". Read :doc:`this article </about-neuron/arch/neuron-features/logical-neuroncore-config>` for more details.

.. note::
   On ``trn2``, the default configuration is LNC2, but when using LNC1 (``NEURON_LOGICAL_NC_CONFIG=1``), two neighboring Neuron Cores will end up **SHARING a HBM**. See the following diagram, where two vertically neighboring NeuronCore-V3s share a HBM.

   .. image:: /images/architecture/Trainium2/trainium2.png

   As a result, there will be **noisy neighbor problems**, and you may see out-of-memory (OOM) errors earlier than expected depending on what is loaded on the neighboring core.

Debugging Tools
~~~~~~~~~~~~~~~

neuron-top
^^^^^^^^^^

Running ``neuron-top`` will give you a view of the current memory usages on a core level. Read :doc:`this article </tools/neuron-sys-tools/neuron-top-user-guide>` for more details.


sysfs
^^^^^

As an alternative, you can find the same information from the sysfs. Read :doc:`this article </tools/neuron-sys-tools/neuron-sysfs-user-guide>` for more details.

Out-of-memory (OOM) Errors
^^^^^^^^^^^^^^^^^^^^^^^^^^

When an OOM occurs, the Neuron Runtime dumps a detailed breakdown of the various memory usage types for each NEFF. For example:

.. code-block:: text

   2025-May-15 20:58:33.895937 224822:224822 ERROR  TDRV:print_lnc_hbm_details                   LNC size is 1. Neuron Cores using this HBM: NC 4 and NC 5
   2025-May-15 20:58:33.897479 224822:224822 ERROR  TDRV:log_dev_mem                             Failed to allocate 4.000GB (alignment: none, usage: tensors) on ND 0:NC 4
   2025-May-15 20:58:33.899416 224822:224822 ERROR  TDRV:log_dev_mem_usage_table                 Displaying Current Memory Utilization:
   (NOTE: the lines are LONG, and NEFF id to name mapping is printed after)

                 |          |  Model   |  Model   |          |  Shared  |          |          |DMA Rings |DMA Rings | DMA Rings |DMA Rings |           |          | Profiler |
                 |  TOTAL   |   Code   |Constants | Tensors  |Scratchpad|Scratchpad| Runtime  |    IO    |  Spill   |Collectives| Runtime  |Collectives|  XT CC   | Buffers  |
   ND 0 Overall  | 20.188GB |192.102MB | 82.344KB | 20.000GB |  0.000B  |  0.000B  |350.125KB |179.000KB | 64.000KB |  0.000B   | 68.000KB |  0.000B   |  0.000B  |  0.000B  |
   \_NC 4        | 20.094GB | 96.065MB | 58.344KB | 20.000GB |  0.000B  |  0.000B  |229.062KB |118.000KB | 48.000KB |  0.000B   | 36.000KB |  0.000B   |  0.000B  |  0.000B  |
     \_NEFF 1001 |263.906KB | 28.562KB | 34.344KB |   n/a    |   n/a    |  0.000B  |108.000KB | 57.000KB | 32.000KB |  0.000B   | 4.000KB  |  0.000B   |   n/a    |   n/a    |
     \_NEFF 1002 |244.875KB | 31.875KB | 24.000KB |   n/a    |   n/a    |  0.000B  |108.000KB | 61.000KB | 16.000KB |  0.000B   | 4.000KB  |  0.000B   |   n/a    |   n/a    |
   \_NC 5        | 96.285MB | 96.037MB | 24.000KB |  0.000B  |  0.000B  |  0.000B  |121.062KB | 61.000KB | 16.000KB |  0.000B   | 32.000KB |  0.000B   |  0.000B  |  0.000B  |
     \_NEFF 1003 |244.875KB | 31.875KB | 24.000KB |   n/a    |   n/a    |  0.000B  |108.000KB | 61.000KB | 16.000KB |  0.000B   | 4.000KB  |  0.000B   |   n/a    |   n/a    |

   NEFF id to name mapping:
   1001: "1.0.41235.0+df4a714bb-/local/out-test0_meta_dense"
   1002: "1.0.41235.0+df4a714bb-/local/out-test0_meta_concat3"
   1003: "1.0.41235.0+df4a714bb-/local/out-test0_meta_concat3"

In case this OOM message is truncated, this information is also available under ``/tmp/neuron_mem_table_device_<device_id>_hbm_<hbm_idx>.log``.

Per-NEFF INFO logs
^^^^^^^^^^^^^^^^^^

The memory usage of a NEFF is also available as ``INFO`` level logs during model load. By using ``NEURON_RT_LOG_LEVEL_TDRV=info``, you'll see a log like:

.. code-block:: text

   2025-May-15 07:41:15.014997 2198754:2198754  INFO  TDRV:dml_log_dev_neff_mem
   [ND 0:NC 0] Current Usage Total: 96.543MB
           shared scratchpad: 0.000B
   Per NEFF memory usage breakdown for [out-test0_meta_concat3]:
           Total: 230.562KB
           * model code: 30.562KB
           * model constants: 24.000KB
           * scratchpad: 0.000B
           * runtime: 95.000KB
           * dma rings io: 61.000KB
           * dma rings spill: 16.000KB
           * dma rings collectives: 0.000B
           * dma rings runtime: 4.000KB
           * collectives: 0.000B


.. _nd-profile-buffers:

Profile Buffers
---------------

When used with NRT's profiling APIs and ``neuron-profiler capture``, Runtime allocates buffers in order to store the profiling events. These profiling buffers by default are about 64 or 128 MB each, so expect around 2 GB overhead. (*subject to future changes*)

These profiler buffer sizes can be manually adjusted by setting flags ``NEURON_RT_PROFILE_BUF_<buffer type>_MB``. For example, ``NEURON_RT_PROFILE_BUF_DMA_MB=512``. Here's a list of the different buffers one can attempt adjusting: ``EVENT``, ``DMA``, ``THROTTLE``, ``CC_CORE_INSTRUCTION``, ``CC_CORE_EVENT``.

.. note::
   Adjusting the buffer sizes manually is NOT recommended, since buffers too small will cause profiler to lose events. **Prioritize profiling one NEFF at a time, and only consider when profiling a single NEFF still OOMs.**

Another option for reducing memory usage further when profiling is to use the ``--single-io``. This option will reduce the memory used by IO tensors by creating an IO tensor the size of the largest IO tensor in the model. Other IO tensors will point to slices of this tensor during execution. The output will no longer be correct but the profile will still realistically capture performance. Note that the ``--single-io`` option is only available to ``neuron-profile``.

.. code-block:: bash

   neuron-profile capture -n file.neff --single-io

**NOTE**: only device profiles require extra device memory. System profiles do not. If you are only interested in a high-level view of performance kernel execution latency and time spent in Neuron runtime APIs, consider capturing a system profile with the ``nrt_sys_trace_fetch_events`` or ``NEURON_RT_INSPECT_ENABLE`` APIs.

.. _nd-scratchpad:

Scratchpad
----------

Aside from inputs and outputs, a NEFF execution requires additional space on HBM for temporary spills out of the state buffer (the cache). This is necessary because the working set of a program can be arbitrarily large, and may not fit in the state buffer. We call this space **scratchpad**.

Scratchpad size requirement for a NEFF is specified entirely by the compiler. Scratchpad size depends on kernel size, kernel type and tiling strategy. For example, for a training workload, scratchpad usage is usually determined by the size of activation between forward and backward layer. For an inference kernel, scratchpad usage is usually determined by the size of hidden states. Additionally, optimal tiling and fusion of collective and/or compute operations can reduce scratchpad usage significantly.

``def.json`` within a NEFF contains information about how much scratchpad space is required for the NEFF. Scratchpad memory is allocated on the HBM, per NeuronCore. The memory is only used while a NEFF execution is running. Thus it makes sense to share this memory among all loaded NEFFs to reduce the overall memory footprint. Runtime allocates a **shared scratchpad** - that is shared by all NEFFs loaded on a particular NeuronCore. The size of the **shared scratchpad** size is equivalent of the size of the largest **scratchpad** among all the loaded NEFFs. In some cases a variable cannot be placed in **shared scratchpad** and is placed in a **non-shared scratchpad** specific to a NEFF (see `Scratchpad variables`_ below).

Scratchpad variables
~~~~~~~~~~~~~~~~~~~~

The scratchpad space is fully managed by the Compiler. A NEFF defines scratchpad variables and their **size** and **offset** within the scratchpad space. Runtime maps all these variables to the scratchpad space it allocates on the HBM. Some of the variables may overlap with others since not all variables are "live" at the same time during NEFF execution.

Runtime iterates through all scratchpad variables in ``def.json`` and computes ``MAX`` of ``offset + size`` over all of them. That is the size of the shared scratchpad space required by the NEFF.

Shared scratchpad
~~~~~~~~~~~~~~~~~

As the name implies, **shared scratchpad** is shared among all programs/NEFFs loaded on a particular NeuronCore. This is possible because only one NEFF executes at a time on a NeuronCore, and data cannot be passed from one NEFF to other through the scratchpad. That means the scratchpad dynamically grows/shrinks with NEFF loads/unloads. To achieve that, the **runtime allocates the shared scratchpad in chunks**, referred to as **scratchpad pages**.

Once a variable is placed in a scratchpad page the variable's physical location cannot be changed, i.e. the variable cannot be moved to another page and the page itself cannot be moved. That is because during NEFF load the Runtime generates DMA descriptors that point to the variables' physical addresses and the descriptors are generated only once during NEFF load. The number of pages can grow and shrink as NEFFs are loaded and unloaded but the variables for the loaded NEFFs retain their physical locations. When a new NEFF is loaded, it might require larger **scratchpad** space than any of the currently loaded NEFFs. In that case new pages are allocated, but the pages are not necessarily contiguous with the previously allocated pages.

Because the pages are not contiguous in HBM, a scratchpad variable must fit entirely within a page in order to be placed in the shared scratchpad (``(var_offset % NEURON_SCRATCHPAD_PAGE_SIZE) + var_size <= NEURON_SCRATCHPAD_PAGE_SIZE``). The default scratchpad page size in Runtime is 512 MB and through environment variables described later in this document, it can be set to any multiple of 512 MB, up to a maximum of 3.5 GB.

Shared scratchpad pages are shown in the OOM reporting in Runtime as category **"shared scratchpad"** and in sysfs under:

.. code-block:: text

   /sys/devices/virtual/neuron_device/neuron<device_number>/neuron_core<nc_number>/stats/memory_usage/device_mem/model_shared_scratchpad/

Non-shared/Private scratchpad allocations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If a variable cannot fit into a shared scratchpad page, Runtime makes a completely separate allocation for it.

As an example, let's say scratchpad page size is 512 MB, and we load the following two NEFFs:

NEFF A has the following scratchpad variables (using a different format from ``def.json`` for brevity here):

``a_var1: {offset: 0, size: 536870912 [512 MB]}, a_var2: {offset: 536870912 [512 MB], size: 1073741824 [1 GB]}``

NEFF B has the following scratchpad variables:
``b_var1: {offset: 0, size: 104857600 [100 MB]}, b_var2: {offset: 104857600 [100 MB], size: 1610612736 [1.5 GB]}``

``a_var1`` and ``b_var1`` both satisfy the condition to fit within the 512 MB shared scratchpad page. Since they are both at same offset, they will end up sharing the same shared scratchpad page.

But ``a_var2`` and ``b_var2`` are both bigger than 512 MB, Runtime will make separate allocations for them. So there will be 1 GB of private allocation for NEFF A and another 1.5 GB of private allocation for NEFF B.

In this example we would have 2.5 GB of non-shared scratchpad allocations on the HBM. These would show up as category **"scratchpad"** in the OOM reporting in Runtime, and in sysfs under: 

.. code-block:: text

   /sys/devices/virtual/neuron_device/neuron<device_number>/neuron_core<nc_number>/stats/memory_usage/device_mem/model_shared_scratchpad/

One thing to note in this case is that Runtime will still calculate the required amount of shared scratchpad and allocate it. It comes to 1.5 GB for NEFF A and 1.6 GB for NEFF B - so the maximum among the NEFFs is 1.6 GB; and rounded up to scratchpad page size, it comes to 2 GB. Thus, Runtime will allocate 2 GB of shared scratchpad (or 4 pages), and 2.5 GB of non-shared scratchpad allocations in this case, even though it only ends up using 1 page of the shared scratchpad.

If the page size is set to 2GB (by setting ``NEURON_SCRATCHPAD_PAGE_SIZE=2048`` - see environment variables described later in this doc), all variables would fit within the shared scratchpad page. After loading both NEFFs only a single shared 2 GB page will be allocated, with zero HBM consumed by the non-shared scratchpad. Thus, choosing the right scratchpad page size can reduce HBM allocations by a significant amount.

How to avoid high non-shared scratchpad usage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If the OOM report has a high amount of non-shared scratchpad usage (i.e. high ``scratchpad`` category usage, but not ``shared scratchpad`` category), it typically means that the scratchpad variables are larger than the default Runtime scratchpad page size.

Examples of non-shared scratchpad usage in OOM report:

.. code-block:: text

   Overall HBM usage
       * total: 23.577GB
       * ...
       * shared scratchpad: 9.000GB
       * scratchpad: 8.149GB   <--- non-shared scratchpad allocations
       * ...

Or, with recent changes to OOM reporting:

.. code-block:: text

                                                                   non-shared scratchpad allocations
                                                                             |
                                                                             v
                 |          |  Model   |  Model   |          |  Shared  |          |          |DMA Rings | ...
                 |  TOTAL   |   Code   |Constants | Tensors  |Scratchpad|Scratchpad| Runtime  |    IO    | ...
   ND 0 HBM 0    | 23.577GB |932.370MB | 1.438MB  | 5.359GB  |  9.000GB |  8.149GB |203.062KB |118.000KB | ...
   ...

You can try experimenting with larger scratchpad page sizes through the following environment variables for Compiler and Runtime respectively:

.. code-block:: bash

   export NEURON_CC_FLAGS=' <other flags if required> --hbm-scratchpad-page-size=<size in MB> ' # Env var for Neuron Compiler
   export NEURON_SCRATCHPAD_PAGE_SIZE=<size in MB>  # Env var for Neuron Runtime

Both these environment variables specify the scratchpad page size in MBs (megabytes)

As an example, setting scratchpad page size to 2 GB:

.. code-block:: bash

   export NEURON_CC_FLAGS=' --hbm-scratchpad-page-size=2048 '
   export NEURON_SCRATCHPAD_PAGE_SIZE=2048

Note that the env variable for Neuron Compiler needs to be set as well, otherwise it may set the offsets for the variables in an inefficient manner.

**The size should be a multiple of 512 and less than 4096 (4 GB)**. Setting the scratchpad page size too low would lead to non-shared allocations, and setting it too high could also lead to memory wastage (as the last scratchpad page allocated may only be partially utilized). It is recommended to try values like 2048 (2 GB), 1536 (1.5 GB) and 1024 (1 GB) in case of OOM.

Appendix: NEFF format for scratchpad variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If we unpack a NEFF (using ``neuron-packager``), and inspect ``sg00/def.json`` (and ``sg01/def.json`` in case of NEFFs generated for Trn2 LNC size 2 configuration), we will see variables entries like these:

.. code-block:: json

   "var": {
           "some_variable_name": {
               "backing_variable_off": 17108992,
               "ops": [],
               "size": 131072,
               "type": "virtual",
               "var_id": 2349
           },
           ...
    }

``type`` being "virtual" for a variable indicates that it is a scratchpad variable. The ``backing_variable_off`` field is the offset inside the shared scratchpad space allocated by Runtime, and the ``size`` field is the size of the variable.

DMA Rings
---------

**DMA rings** are buffers used to store DMA **descriptors** (each descriptor describes a data movement that the DMA engines can execute).

DGE generates the descriptors dynamically during NEFF execution, so, if a NEFF is using DGE for some DMA, then no allocation is needed on the HBM for those descriptors.

For any DMAs not using DGE, Runtime must allocate the DMA rings on HBM and build the DMA descriptors before execution. The details for building the descriptors for these DMAs in the NEFF is encoded in ``def.json`` and ``<engine>.json`` where ``<engine>`` is the TPB engine that will trigger the DMA operation.

Overall, reducing DMA rings usage requires changes in the NEFF itself, with the most effective change being using DGE for DMAs where supported.

In OOM reports, DMA rings are further categorized as:

1. IO - These descriptors have an I/O tensor as their source or destination
2. Spill - These descriptors move data between any NEFF variables/tensors, excluding any I/O tensors
3. Collectives - These descriptors move data for collectives operations between ranks on the same node
4. Runtime - These descriptors do not correspond to any explicit DMAs in the NEFF but are needed to perform DMAs to support NEFF execution. Examples: loading DVE and activation tables, instruction fetch DMAs for TPB engines

================================================
FILE: neuron-runtime/explore/direct-hbm-tensor-alloc.rst
================================================
.. _direct-hbm-tensor-alloc:

.. meta::
   :description: Guide on Direct HBM Tensor Allocation with Neuron
   :date_updated: 12/02/2024

Direct HBM Tensor Allocation with Neuron
========================================

This topic provides an overview and usage examples for directly allocating tensors into High Bandwidth Memory (HBM) on AWS Neuron devices using the Neuron Runtime with PyTorch.

Overview
---------

* Device identifier: On Trainium/Inferentia instances, Neuron devices are identified in PyTorch through the names: ``privateuseone`` or ``neuron``. These names can be used interchangeably
* Direct HBM allocation: Allows tensors to be allocated directly into High Bandwidth Memory (HBM) on Neuron devices  
* Performance optimization: Eliminates memory transfer overhead between CPU and device memory

Background
-----------

* PyTorch has many different devices which it dispatches ops (like add, matmul, to) to, ``privateuseone`` is one of these devices, we utilize this and register our backend using this PyTorch interface, and we rename it as ``neuron``. If a tensor is created or moved to a device, PyTorch will dispatch the allocation operation to that device. For instance, if a tensor is created on ``neuron:0`` specifically, the Neuron Runtime will handle the allocation, and will allocate the result on device instead of CPU.

* *Diagram 1: Device registration and allocation flow*

  .. image:: /neuron-runtime/img/device-allocation-flow.png
     :align: center
     :width: 80%

* *Diagram 2: Tensor allocation behaviour*

  .. image:: /neuron-runtime/img/tensor-allocation-behavior.png
     :align: center
     :width: 80%

Device Placement Behavior
--------------------------

Critical Rule
~~~~~~~~~~~~~~

* All-or-nothing: ALL inputs must be on ``neuron:0`` for outputs to remain on device  
* CPU fallback: Any CPU input causes ALL outputs to move to CPU

Why This Matters
~~~~~~~~~~~~~~~~~

* Chained operations: Enables efficient multi-model pipelines without CPU roundtrips  
* Reduced latency: Eliminates expensive device-to-CPU transfers  
* Memory efficiency: Better utilization of 32GB (trn1) / 96GB (trn2) HBM available on Trainium instances

Usage Examples
----------------

Basic Usage - All Inputs on Device
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    traced_model = '{your-model-here}'
    torch_neuronx.move_trace_to_device(traced_model, 0)

    # Single input
    input_tensor = torch.rand([1, 3, 224, 224], device="neuron:0")
    output = traced_model(input_tensor)
    print(output.device)  # device(type='neuron', index=0)

    # Multiple inputs
    a = torch.rand([2, 2], device="neuron:0")
    b = torch.rand([2, 2], device="neuron:0")
    output = traced_model(a, b)
    print(output.device)  # device(type='neuron', index=0)


Mixed Device Inputs - Shows Fallback
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    a = torch.rand([2, 2], device="neuron:0")
    b = torch.rand([2, 2], device="cpu")  # One CPU tensor
    output = traced_model(a, b)
    print(output.device)  # device(type='cpu') - falls back to CPU


Efficient Model Chaining
~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    input_data = torch.rand([1, 256], device="neuron:0")
    intermediate = traced_model1(input_data)    # stays on device
    final_output = traced_model2(intermediate)  # stays on device


Best Practices
----------------

* Keep all tensors on same device: Ensure all inputs are on ``neuron:0`` to avoid CPU fallback  
* Monitor HBM usage: Be aware of HBM limits on Trainium instances (32GB for trn1, 96GB for trn2)
* Verify device placement: Check ``tensor.device`` to confirm expected placement

Compatibility
--------------

* Works with: All ``torch_neuronx.trace`` models, dynamic batching, ``move_trace_to_device``
* Limited by: Available HBM memory


================================================
FILE: neuron-runtime/explore/index.rst
================================================
.. _neuron-runtime-explore-home:

.. meta::
   :description: Topics that explore the AWS Neuron Runtime and tools in-depth, written by the AWS engineers who developed them.
   :keywords: AWS Neuron, deep dives, whitepapers, engineering

Neuron Runtime Deep Dives
==========================

.. toctree::
   :hidden:
   :maxdepth: 1

   Understand NEFF Files <work-with-neff-files>
   Compute-Communication Overlap <compute-comm-overlap>
   Neuron Device Memory <device-memory>
   Direct HBM Tensor Allocation <direct-hbm-tensor-alloc>
   Runtime Performance Tips <runtime-performance-tips>
   Neuron Runtime Core Dumps <core-dump-deep-dive>
   Inter-node Collectives <internode-collective-comm>
   Intra-node Collectives <intranode-collective-comm>

Curious about how the Neuron Runtime works? Looking for deeper explorations of the computer science, techniques, and algorithms used to develop it? This section provides topics that dive into the learnings and engineering behind the Neuron Runtime, written by the AWS engineers who developed it.

NeuronX Runtime Deep Dives
---------------------------

.. grid:: 2
        :gutter: 2

        .. grid-item-card:: Understand NEFF Files

                * :ref:`work-with-neff-files`

                Explore the structure and contents of NEFF files, the compiled model format used by the Neuron Runtime.

        .. grid-item-card:: Compute-Communication Overlap

                * :ref:`neuron-runtime-explore-compute-comm`
  
        .. grid-item-card:: Neuron Device Memory

                * :ref:`neuron-device-memory-deep-dive`

                Learn how the Neuron Runtime overlaps computation and communication to maximize performance on AWS Inferentia and Trainium chips.
  
        .. grid-item-card:: Neuron Device Memory

                * :ref:`neuron-device-memory-deep-dive`

                Understand, monitor, and optimize memory usage on AWS Neuron devices including tensors, model constants, scratchpad allocations, and more.

        .. grid-item-card:: Direct HBM Tensor Allocation

                * :ref:`direct-hbm-tensor-alloc`
  
                Optimize performance by allocating tensors directly into High Bandwidth Memory (HBM) on Neuron devices, eliminating CPU-device memory transfer overhead.

        .. grid-item-card:: Runtime Performance Tips

                * :ref:`runtime-performance-tips`
  
                Best practices and optimization techniques for achieving optimal performance with the AWS Neuron Runtime. 

        .. grid-item-card:: Neuron Runtime Core Dumps   

                * :ref:`runtime-core-dump-deep-dive`

                Dive into the structure and analysis of Neuron Runtime core dumps to troubleshoot and debug runtime issues effectively.

Neuron Collectives Deep Dives
-----------------------------

.. grid:: 2
        :gutter: 2

        .. grid-item-card:: Inter-node Collectives Communication

                * :doc:`internode-collective-comm`

                Explore Ring, Mesh, and Recursive Doubling-Halving algorithms for coordinating data exchange across multiple nodes via EFA networks.

        .. grid-item-card:: Intra-node Collectives Communication

                * :doc:`intranode-collective-comm`

                Learn about Ring, Mesh, KangaRing, and RDH algorithms optimized for high-bandwidth NeuronLink communication within single nodes.


================================================
FILE: neuron-runtime/explore/internode-collective-comm.rst
================================================
.. meta::
    :description: Learn about inter-node collective communications with AWS Neuron, including algorithms and optimization strategies
    :date-modified: 12/02/2025

.. _internode_collectives:

Inter-node Collective Communications with AWS Neuron
====================================================

This topic explores inter-node collective communication algorithms and optimization strategies for AWS Neuron distributed workloads. It covers the implementation details of Ring, Mesh, and Recursive Doubling-Halving algorithms for coordinating data exchange across multiple nodes connected via EFA (Elastic Fabric Adapter) networks.

Overview
--------

Inter-node collective communication enables efficient data exchange between NeuronCores distributed across multiple physical nodes in a cluster. This document examines three primary algorithmic approaches: Ring, Mesh, and Recursive Doubling-Halving (RDH), with each optimized for different cluster sizes and message characteristics. The choice of algorithm depends on the trade-offs between step latency (O(N), O(1), O(logN)) and network bandwidth utilization, with performance further influenced by EFA network topology and message size considerations.

Applies to
----------

This concept is applicable to:

* **Distributed Training**: Collective communication aggregates and synchronizes gradients across workers to maintain model consistency. In this scenario, collective operations enable workers to compute gradient sums across all nodes, ensuring uniform parameter updates.
* **Distributed Inference**: During inference, collective communication distributes requests across multiple accelerators in serving nodes, optimizing resource utilization and maintaining low latency under high loads.
  
Introduction: About Collectives on Neuron
------------------------------------------

Also see :ref:`intranode_collectives`.

Collective Communication Operations
-----------------------------------

We define the following denotations:

* **N**: the number of participating ranks in a communication group
* **C**: a "chunk", which is a piece of data (subset of tensor data transmitted at each algorithm step) with size equaling to that of a rank's input in AllGather, or output in ReduceScatter
* **B**: the size of both the input and output buffer in AllReduce. In that context, C = B / N

Now we establish the following collective operations:

.. list-table:: Collective Operations
   :widths: 20 15 15 50
   :header-rows: 1

   * - Operation Type
     - Input Size
     - Output Size
     - Explanation
   * - AllGather
     - C
     - N * C
     - Each rank starts with a chunk and ends with everyone else's chunks
   * - ReduceScatter
     - N * C
     - C
     - Each rank starts with N chunks, and ends with a unique chunk which is fully reduced among the N ranks
   * - AllReduce
     - B = N * C
     - B = N * C
     - Each rank contributes B, and ends with B which is fully reduced among the N ranks. AllReduce can be seen as a concatenation of ReduceScatter followed by AllGather
   * - AllToAll
     - B = N * C
     - B = N * C
     - Each rank starts with N chunks, and ends with the N in a way that the pieces of data were transposed between the ranks r0[A0, A1] r1[B0, B1] → r0[A0, B0], r1 [A1, B1]

The execution time of a collective communication operation consists of two parts: **step latency + data transfer** time. As mentioned above, the per-hop or point-to-point latency is ~15us. On the other hand, the transfer time is a function of the buffer/message size. For example, to transfer 1KiB, 1MiB, and 1GiB at 50Gbps takes 160ns, 160us, and 160ms respectively. Therefore, the collectives communication problem is latency dominant for small sizes, and throughput dominant for large sizes, and a mix for mid-sizes. This requires us to incorporate different strategies and algorithms for each range.

Communication Groups - Only One Rank per Node
----------------------------------------------

When distributing a ML workload across multiple nodes, communication groups are always formed with symmetry:

1. The number of participating ranks on each node is consistent
2. These local ranks must have the same intra-node indices

As the name suggests, one-rank-per-node groups refer to the simple case where we only need to focus on the network communication between peers.

Ring Algorithm
~~~~~~~~~~~~~~

.. image:: /neuron-runtime/img/collectives/ring-algorithm.png
   :alt: Ring Algorithm
   :align: center

In the Ring algorithm, all participating ranks are joined together in a directed cycle. This algorithm is considered bandwidth optimal, meaning that from each rank's perspective, it transfers the minimal amount of data. For instance, in the AllGather and ReduceScatter cases we have: ``number_of_steps * chunk_size = (N - 1) * C``.

Ring's O(N) number of hops means it has linear step latency, making it not suitable for large clusters or the latency-bound small message sizes. However, because each rank only receives from and sends to a fixed peer, the Ring algorithm does not incur any ingress congestion. Furthermore, we can arrange the neighbors in Ring to be topologically close to each other in the substrate network, which reduces the congestion between the global inflight transactions on core switches. As a result, Ring tends to push for the highest bandwidth utilization rate with large message sizes.

Ring AllGather
^^^^^^^^^^^^^^^

In step 0, each rank r sends its input chunk to its downstream peer. In step 1, each rank sends the chunk it just received from the upstream peer further to downstream, and the process repeats until all chunks have traversed the ring.

Ring ReduceScatter
^^^^^^^^^^^^^^^^^^

In step 0, each rank r sends its (r-1)th chunk to its downstream. In step 1, each rank reduces the chunk it just received with the same indexed chunk from its own input, and then sends the result to downstream. This process goes on until each rank has its output chunk fully traversed the ring and reduced among all ranks.

In practice, each ReduceScatter step consists of three components: network receive, local reduction, and network send. To hide the serial latency, an implementation trick is to further break a chunk of data into slices to pipeline the communication and reduction.

Ring AllReduce
^^^^^^^^^^^^^^^

This algorithm is a concatenation of the above two patterns (ReduceScatter followed by AllGather). Its traffic is doubled since we need to traverse the ring twice, but still minimal/optimal.

Mesh Algorithm
~~~~~~~~~~~~~~

.. image:: /neuron-runtime/img/collectives/mesh-algorithm.png
   :alt: Mesh Algorithm
   :align: center

The Mesh algorithm aims to optimize step latency for small message sizes. Rather than transferring data point-to-point like in Ring and accumulating hop latency along the steps. Consequently, there is no extra overhead in Mesh with ``traffic = number_of_peers * chunk_size = (N - 1) * C``.

The mesh communication pattern suffers from ingress congestion because each rank directly receives from all peers. Even small variances in the start time of an operation will cause the fast starters to saturate disproportionately high fractions of the switch and NIC bandwidth, congesting the rest of the transactions by the slower ranks. Furthermore, the fact that one has to communicate with multiple peers means that some of the network paths will be longer (go through higher level of switches) and hence subject to congestion and queuing delays. As a result, Mesh does not scale well to large clusters and/or message sizes.

Mesh AllGather
^^^^^^^^^^^^^^^

It directly broadcasts data from each rank to all the other peers in one step, hence achieving O(1) latency.

Mesh ReduceScatter
^^^^^^^^^^^^^^^^^^

Similarly, Mesh ReduceScatter scatters the input of each rank to the other peers, and then locally reduces the N-1 received chunks, plus the self chunk into the output.

Mesh AllReduce
^^^^^^^^^^^^^^^

This algorithm is a concatenation of the above two patterns (ReduceScatter followed by AllGather). Its traffic is doubled since we need to run Mesh twice, but still minimal/optimal.

Single-step Mesh Algorithm (AllReduce)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The Single-step Mesh algorithm is a variant of Mesh specifically designed for AllReduce. The goal is to trade off bandwidth optimality for a reduced number of hops (from 2 to 1). Rather than simply concatenating ReduceScatter and AllGather, we can have each rank duplicate and broadcast its whole input buffer to all peers. Upon receiving these duplicates, a rank will reduce the whole buffer to its output. Remember that each hop is expected to add ~15 us latency. Single-step Mesh outperforms regular Mesh for sufficiently small cluster and/or message sizes where the extra data transfer time is shorter than 15 us.

.. list-table:: Mesh Algorithm Comparison
   :widths: 30 15 55
   :header-rows: 1

   * - Algorithm
     - # steps
     - Network Traffic Amount per Rank
   * - Mesh AllGather
     - 1
     - optimal = (N - 1) * C
   * - Mesh ReduceScatter
     - 1
     - optimal = (N - 1) * C
   * - Mesh AllReduce
     - 2
     - optimal = 2 * (N - 1) * C
   * - Single-step Mesh AllReduce
     - 1
     - not optimal = (N - 1) * N * C

Recursive Doubling and Halving (RDH) Algorithm
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. image:: /neuron-runtime/img/collectives/rdh-algorithm.png
   :alt: Recursive Doubling an Halving Algorithm
   :align: center

(inspired by https://web.cels.anl.gov/~thakur/papers/ijhpca-coll.pdf)

The Recursive Doubling and Halving (RDH) algorithm works to find the middle ground between Mesh and Ring in both step latency and bandwidth utilization, in a communication group with N = 2^p members.

There is no additional overhead in RDH with ``traffic = (1 + 2 + 4 ...) * chunk_size = (N - 1) * C``. Corresponding to the number of steps, the step latency of RDH is O(logN) or O(p). In respect to congestion deficiency, having log(N) peers poses ingress contentions, but in the implementation, we can issue send/receive credits with rate control to mitigate such issues. Effectively, a rank will only talk to one single peer at any given time, resulting in several steady streams of high-speed transfer and a relatively high amortized bandwidth utilization. Overall, RDH is suitable for large clusters with medium/large sized messages.

When representing the indices in binary, they are obtained by flipping each of the p bits of the current rank's index.

Recursive-Doubling AllGather
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It works by having each rank sequentially communicating with log(N) peers. In each step of AllGather, a rank sends all the chunks it has collected so far, and receives the equal amount of new chunks from its peer, hence doubling the amount of data. The algorithm follows a classic Recursive-Doubling communication style.

Recursive-Halving ReduceScatter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This algorithm works similarly — in each step, a rank sends half of the partially reduced chunks so far to its peer, and receives the other peer's half of partial chunks. It then reduces the self chunks and the received chunks together, and we repeat the process with the problem space exactly halved.

Recursive-Halving AllReduce
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Again, AllReduce works by concatenating ReduceScatter and AllGather.

Algorithm Summary
~~~~~~~~~~~~~~~~~

.. list-table:: Algorithm Comparison
   :widths: 15 20 25 20 20
   :header-rows: 1

   * - Algorithm
     - Step Latency
     - Network BW Utilization
     - Suitable Group Sizes
     - Suitable Message Sizes
   * - Ring
     - O(N)
     - High
     - Small-to-medium
     - Large
   * - Mesh
     - O(1)
     - Low
     - Small
     - Small
   * - RDH
     - O(logN)
     - > Mesh; < Ring
     - Medium-to-large
     - Small-to-medium

Communication Groups - Multiple Ranks per Node
-----------------------------------------------

Orchestrating collective communication operations across distributed computing systems presents a fundamental challenge when multiple processing ranks are deployed per node. The complexity arises from the need to efficiently coordinate data exchange both within individual nodes (intra-node) and across the network between different nodes (inter-node), each with distinct bandwidth characteristics, latency profiles, and optimal communication patterns.

Traditional flat communication algorithms that treat all ranks uniformly often fail to exploit the inherent hierarchical structure of modern distributed systems, leading to suboptimal performance and scalability bottlenecks.

Hierarchical algorithms address this challenge by recognizing and leveraging the two-tier nature of distributed systems, strategically decomposing global operations into separate intra-node and inter-node phases that can each be optimized independently while maintaining overall correctness and efficiency.

Hierarchical Algorithm
~~~~~~~~~~~~~~~~~~~~~~

The Hierarchical algorithm is a powerful framework to break down a multiple-rank-per-node operation into stages of pure intra-node and inter-node communication.

The hierarchical algorithm implementation employs a plug-and-play mechanism to allow for any combination of intra-node and inter-node algorithms — and we simply choose the most optimal one for each communication stage.

The latency and throughput properties of the Hierarchical algorithm is therefore dependent on the selected sub-algorithms. However, it's worth calling out that by breaking a global communication into intra-node and inter-node dimensions, we promote the principle of divide and conquer and it matters especially to the total latency. For example by choosing Ring + Ring, the latency is O(X) + O(Y), where X is the number of nodes and Y intra-node ranks. That is significantly better than Flat Ring's O(X * Y).

Overall, the Hierarchical algorithm is versatile to work well across a wide range of group and message sizes. For example, small groups + sizes can go to intra-node Mesh + inter-node Mesh, and large groups + sizes can go to intra-node KangaRing + inter-node RDH.

Global AllGather = inter-node AllGather + intra-node AllGather
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Let's assume there are X servers globally, each containing Y ranks. The first step of the hierarchical algorithm is to form Y rank lists where each contains X ranks who have the same local index. Later, we run AllGather on these inter-node groups in parallel and each rank ends up with X chunks. Finally, we form X rank lists each of all the Y ranks on one node, and run intra-node AllGather again to further broadcast the data. By the end, everyone has all the (X * Y) chunks.

Because the inter EFA interface has lower bandwidth than that of the intra-node interface, and that the first stage incurs less traffic than the second, we choose to run the inter-node communication first.

.. image:: /neuron-runtime/img/collectives/global-allgather.png
   :alt: Global All-Gather node communication
   :align: center

Global ReduceScatter = intra-node ReduceScatter + inter-node ReduceScatter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We can run intra-node ReduceScatter first on X parallel rank lists of Y local ranks, reducing the buffer size to 1/Y of the original size. Notice that each rank will end up holding a different 1/Y corresponds to its local index. Next, we run inter-node ReduceScatter on Y parallel rank lists of X network ranks, further reducing the buffer on each global rank a unique 1/(X * Y) chunk of the original.

The order of the intra- and inter-node stages is flipped when compared to AllGather, because now the second stage has less traffic.

.. image:: /neuron-runtime/img/collectives/global-reducescatter.png
   :alt: Global ReduceScatter node communication
   :align: center

Global AllReduce = intra-node ReduceScatter + inter-node AllReduce + intra-node AllGather
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We first run an intra-node ReduceScatter to break down the buffer size to 1/Y of the original. Then we run an inter-node AllReduce. Again, each inter-node group will work on a different 1/Y section of the original buffer, so there's no duplicated work. And lastly, we run an intra-node AllGather to broadcast the whole buffer to everyone.

.. image:: /neuron-runtime/img/collectives/global-allreduce.png
   :alt: Global All-Reduce node communication
   :align: center

Flat Ring Algorithm (Edge cases only)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Flat Ring algorithm works by connecting all the global ranks in a directed cycle. Ranks local to a single server are connected in an open chain with the intra-node communication interface, and the two ends will be joined to chains on other servers with the inter-node EFA interface.

The step latency is O(X * Y) - X is the number of nodes and Y intra-node ranks - and the network bandwidth utilization is high. However, one caveat is that each EFA interface is connected to a different Trainium Chip. So, to utilize all of them, we need to run multiple directed cycles (called channels) in parallel, thus reducing the transfer size and efficiency of each cycle, besides causing high context switching overheads in the collective execution cores. We only enable Flat Ring on Trn1 for large message size cases where it has a clear edge.

.. image:: /neuron-runtime/img/collectives/flat-ring.png
   :alt: Flat Ring algorithm
   :align: center

More information
-----------------

* :doc:`Intra-node Collective Communications </neuron-runtime/explore/intranode-collective-comm>`
* :doc:`About Neuron Runtime Collectives </neuron-runtime/about/collectives>`

================================================
FILE: neuron-runtime/explore/intranode-collective-comm.rst
================================================
.. meta::
    :description: Learn about intra-node collective communications with AWS Neuron, including Ring, Mesh, KangaRing, and RDH algorithms
    :date-modified: 12/02/2025

.. _intranode_collectives:

Intra-node Collective Communications with AWS Neuron
====================================================

This topic covers intra-node collective communication algorithms and optimization strategies for AWS Neuron distributed workloads within a single node. It examines Ring, Mesh, KangaRing, and Recursive Doubling-Halving (RDH) algorithms for coordinating data exchange between NeuronCores connected via high-bandwidth intra-chip and chip-to-chip NeuronLink interconnects.

Overview
--------

Intra-node collective communication enables efficient data exchange between NeuronCores within a single physical node or tightly coupled nodes connected via NeuronLinks. This document explores four primary algorithmic approaches—Ring, Mesh, KangaRing, and RDH—each optimized for different message sizes and latency requirements. The algorithms leverage the 2D Torus topology of Trainium chips and specialized hardware features like duplication to minimize memory bandwidth pressure and maximize throughput.

Applies to
----------

This concept is applicable to:

* **Distributed Training**: Collective communication aggregates and synchronizes gradients across workers to maintain model consistency. In this scenario, collective operations enable workers to compute gradient sums across all nodes, ensuring uniform parameter updates.
* **Distributed Inference**: During inference, collective communication distributes requests across multiple accelerators in serving nodes, optimizing resource utilization and maintaining low latency under high loads.

Collective Communication Operations
-----------------------------------

We define the following denotations:

* **N**: the number of participating ranks in a communication group
* **C**: a "chunk", which is a piece of data (subset of tensor data transmitted at each algorithm step) with size equaling to that of a rank's input in AllGather, or output in ReduceScatter
* **B**: the size of both the input and output buffer in AllReduce. In that context, C = B / N

Now we establish the following collective operations:

.. list-table:: Collective Operations
   :widths: 20 15 15 50
   :header-rows: 1

   * - Operation Type
     - Input Size
     - Output Size
     - Explanation
   * - AllGather
     - C
     - N * C
     - Each rank starts with a chunk and ends with everyone else's chunks
   * - ReduceScatter
     - N * C
     - C
     - Each rank starts with N chunks, and ends with a unique chunk which is fully reduced among the N ranks
   * - AllReduce
     - B = N * C
     - B = N * C
     - Each rank contributes B, and ends with B which is fully reduced among the N ranks. AllReduce can be seen as a concatenation of ReduceScatter followed by AllGather
   * - AllToAll
     - B = N * C
     - B = N * C
     - Each rank starts with N chunks, and ends with the N in a way that the pieces of data were transposed between the ranks r0[A0, A1] r1[B0, B1] → r0[A0, B0], r1 [A1, B1]

The execution time of a collective communication operation consists of two portions: latency + data transfer time. More concretely, the latency term is of 10^0 to 10^1 us magnitude. For example, the per-hop latency of Ring/KangaRing is about 1-2 us (HBM load dependent). On the other hand, the transfer time is dependent on the buffer/message size. For example, to transfer 1KB, 1MB, and 1GB at 100GBps takes 10 ns, 10 us, and 10 ms respectively. Therefore, the collective communication problem is latency dominant for small sizes, and throughput dominant for large sizes, and a balance for mid-sizes. For this reason, different strategies and algorithms are required to provide the best performance for each range.

Ring Algorithm
--------------

.. image:: /neuron-runtime/img/collectives/ring-algorithm.png
   :alt: Ring Algorithm
   :align: center

In Ring algorithm, all the ranks are connected in a directed cycle. Algorithmically, it has O(N) per-hop latency where N is the number of ranks. In practice, we run multiple cycles with mutually exclusive wires in parallel for full wire bandwidth. That means big tensors (packets) are divided into smaller packets called chunks (more specifically, a chunk is a subset of tensor data transmitted at each algorithm step. The chunk size depends on number of participating ranks on collective) that are transferred across ranks in one or more cycles.

Ring AllGather
~~~~~~~~~~~~~~

In step 0, each rank sends its input chunk to its downstream neighbor. In step 1, each rank sends the chunk it has just received from upstream to downstream. The process goes on until all the chunks have traversed the ring.

Ring ReduceScatter
~~~~~~~~~~~~~~~~~~

In step 0, each rank r sends its (r-1)th chunk to its downstream. In step 1, each rank reduces the chunk it has just received with the same indexed chunk from its own input, and writes the result to downstream. The process goes on until each rank has its output chunk fully traversed the ring. It is important to mention a chunk transmit is divided into two sliced transmissions, where the first slice reduction overlaps the second slice communication.

Ring AllReduce
~~~~~~~~~~~~~~

This algorithm is a concatenation of the above two patterns (ReduceScatter followed by AllGather), so it requires the ring to be traversed twice.

Mesh Algorithm
--------------

.. image:: /neuron-runtime/img/collectives/mesh-algorithm.png
   :alt: Mesh Algorithm
   :align: center

The Mesh algorithm aims to optimize latency for small message sizes. Rather than transferring data step-by-step like in Ring and accumulate per-hop latency along the way, Mesh directly broadcasts/scatters data to all other ranks in one step, hence, to a first degree it has O(1) latency. This is made possible by inter-chip routing — from a rank, data can be directly written to any other rank on a remote chip, where the in-between traffic is routed automatically. The downside of routing is that it leads to link over-subscription, hence mesh is good for mainly small sizes.

Mesh AllGather
~~~~~~~~~~~~~~

This algorithm consists of two steps. In step 0, each chip contains 4 input chunks which need to be broadcasted to the other 15 chips. We split these destinations roughly evenly among the 4 local ranks. Each rank reads the 4 chunks, either locally or over intra-chip connectivity, and writes to the closest rank on each destination chip via routing. In step 1, each local rank has received 16 distinct chunks. We then run intra-chip broadcast to further exchange them.

Mesh ReduceScatter
~~~~~~~~~~~~~~~~~~

This algorithm also involves two steps. In step 0, each rank is tasked to send to roughly a quarter of the other 60 off-chip ranks. For each destination, the local rank reads 4 on-chip chunks, one from each NeuronCore (LNC=1 or LNC=2), which correspond to the destination rank, reduces them, and writes the result over via routing. In step 1, each rank has received 16 partially reduced chunks, each of which is from a different chip. It then reads these 16 chunks, reduces them, and writes the result to output.

Mesh AllReduce
~~~~~~~~~~~~~~

This algorithm is a concatenation of the above two patterns (ReduceScatter followed by AllGather).

Single-step Mesh Algorithm (AllReduce)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Single-step Mesh algorithm is a variant of Mesh specifically designed for AllReduce. The goal is to trade off bandwidth optimality for a reduced number of hops (from 2 to 1). Rather than simply concatenating ReduceScatter and AllGather, we can have each rank duplicate and broadcast its whole input buffer to all peers. Upon receiving these duplicates, a rank will reduce the whole buffer to its output.

KangaRing Algorithm
-------------------

.. image:: /neuron-runtime/img/collectives/kangaring-algorithm.png
   :alt: KangaRing Algorithm
   :align: center

The KangaRing Algorithm is an extension and optimization of Ring. Rather than connecting all ranks in a flat cycle, we group each two ranks (with LNC=2 each rank is composed of 2 NeuronCores) out of four on the same Neuron Device, nominate one as primary and the other as secondary, and connect only the primary ranks in a cycle. Hence, the per-hop latency is cut by half when compared with Ring (although still O(N)). Primary ranks handle all the data movement and reduction, while secondary ranks just sit idle. In practice, we will alternate the assignment of primary and secondary ranks in different cycles (when using all 16 Neuron Devices, there are 2 non-overlapping Hamiltonian cycles on 2D Torus), so that each rank is active in half of them.

KangaRing AllGather
~~~~~~~~~~~~~~~~~~~

Algorithm-wise, in step 0, each primary rank sends its self chunk as well as the secondary chunk to the downstream. In subsequent steps, it reads the newly received chunk and duplicates it to both the secondary peer and downstream. Duplication is a hardware feature that allows a data transfer to only incur one read but duplicate the write to two destinations, for reduced HBM pressure. Specifically, for each chunk to traverse every 2 ranks, Ring needs to do 1R1W (one read / one write) twice, resulting in 4 HBM accesses. The same transfer can be done with one 1R2W (one read / 2 writes) in KangaRing, resulting in 3 HBM accesses or a 25% reduction.

KangaRing ReduceScatter
~~~~~~~~~~~~~~~~~~~~~~~

In step 0, each primary rank reduces self and secondary chunks and writes the result to downstream. In subsequent steps, it reduces the newly received partial sum, self chunk, and secondary chunk, and writes to downstream. For each chunk to traverse every 2 ranks, Ring needs to do 2R1W (two reads / 1 write) twice, resulting in 6 HBM accesses. In comparison, KangaRing does one 3R1W for only 4 touches or a 33% reduction.

KangaRing is an option for TP replica-groups where all ranks in device are in same rank-list. For instance: On a one-rank-per-chip rank-list replica-group, Ring is used rather than KangaRing. In these particular cases, KangaRing is better than Ring at all sizes. At smaller sizes, it has better latency. At larger sizes, which are HBM bandwidth bound or contended, the number of touches is reduced. But obviously, it still loses to Mesh at small sizes. KangaRing is only relevant for TP replica-groups where all ranks in chip are in same rank-list.

Recursive Doubling and Halving (RDH) Algorithm
-----------------------------------------------

.. image:: /neuron-runtime/img/collectives/rdh-interchip-algorithm.png
   :alt: RDH Algorithm at the inter-node level
   :align: center

The RDH Algorithm optimizes for mid-size collectives, where both the latency and transfer factors matter. The 2D-Torus connectivity can also be seen as a 4D hyper-cube, where each Chip can reach to a neighbor in 4 axis directions W, X, Y, and Z.

RDH AllGather
~~~~~~~~~~~~~

This algorithm involves two stages: inter-chip recursive-doubling and intra-chip broadcast. In the first stage, ranks of the same in-chip index form a communication group, so there are 4 groups of 16 ranks each. Within a group, each rank sends/receives in the 4 axis directions sequentially, and pair-wise exchanges the received chunks so far. By the end of recursive doubling, chunks within each communication group are fully broadcasted. In the second stage, intra-chip broadcast, the 4 local ranks then use intra-chip to further exchange chunks.

RDH ReduceScatter
~~~~~~~~~~~~~~~~~

The algorithm also involves two stages: intra-chip reduction and inter-chip recursive halving. In the first stage, a quarter of the chunks are partially reduced to each of the 4 local ranks, with indices corresponding to each rank's inter-chip communication group members. In the second stage, each rank sends/receives the partially reduced chunks in the 4 axis directions sequentially, halving its problem space at each step, until there is one fully reduced chunk left.

Evidently, the intra-chip stage has O(1) number of steps or latency, and the inter-chip recursive stage has O(logN) latency, where N is the number of ranks. When a rank communicates in an axis direction that requires on-chip routing via intra-chip, it may contend with traffic by another rank, but this only happens in some of the cases. So, RDH suffers from less severe link over-subscription than Mesh.

Algorithm Summary
-----------------

.. list-table:: Algorithm Comparison
   :widths: 15 15 20 20 30
   :header-rows: 1

   * - Algorithm
     - Latency
     - Link Utilization
     - HBM Pressure
     - Sweet Range (Empirically)
   * - Ring
     - O(N)
     - Full
     - Normal
     - Fallback only
   * - Mesh
     - O(1)
     - Over-subscription
     - Normal
     - < 1MB
   * - RDH
     - O(logN)
     - Partial Over-subscription
     - Normal
     - 1-56MB
   * - KangaRing
     - O(N/2)
     - Full
     - Reduced
     - >56MB

.. image:: /neuron-runtime/img/collectives/mesh-rdh-kr-summary.png
   :alt: Comparison of message size for RDH, KangaRing, and Mesh algorithms
   :align: center


More information
-----------------

* :doc:`Inter-node Collective Communications </neuron-runtime/explore/internode-collective-comm>`
* :doc:`About Neuron Runtime Collectives </neuron-runtime/about/collectives>`


================================================
FILE: neuron-runtime/explore/runtime-performance-tips.rst
================================================
.. meta::
   :description: Performance optimization tips for AWS Neuron Runtime
   :keywords: AWS Neuron, performance, optimization, runtime, asynchronous execution, NUMA, CPU affinity

.. _runtime-performance-tips:

==========================================
Best Practices: Neuron Runtime Performance
==========================================

This topic provides best practices and performance optimization tips for applications using the AWS Neuron Runtime (NRT). Following these guidelines can help you achieve optimal performance when running workloads on AWS Neuron devices.

Best Practice: Enable asynchronous execution
---------------------------------------------

Background
^^^^^^^^^^

The Neuron runtime's main submission interface, ``nrt_execute()``, is synchronous by default. It's typically required to
enable asynchronous mode to achieve high on-device utilization. The asynchronous interface allows the application's call
to ``nrt_execute()`` to return immediately after preparing and enqueuing a request. A callback can be registered (see
``nrt_register_async_exec_callback()``) to receive completion notifications.

Enabling this feature improves performance by allowing the application thread to proceed with host-side processing,
which creates a pipeline between host and device work in the critical path.

Instructions
^^^^^^^^^^^^^

Enable this feature by setting the following environment variable to the required queue depth::

    NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=<queue-depth>

Additional Notes
^^^^^^^^^^^^^^^^^

* The queue depth can be arbitrarily large. However, each execution submission typically has pre-allocated reserved
  tensors for output buffers, which means limiting the number of requests in the queue is necessary to manage memory
  usage.

Best Practice: Isolate latency-sensitive threads
-------------------------------------------------

Background
^^^^^^^^^^

The proxy thread is a per-neuron-core thread in the runtime that drives network communication over EFA. While this
thread doesn't perform heavy computation, it is latency-sensitive and directly impacts on-device execution. This thread
needs to be isolated for consistent performance.

Instructions
^^^^^^^^^^^^^

Neuron Runtime provides an environment variable that allows you to specify the CPU affinity of proxy threads. This
enables you to isolate a set of CPUs and place the proxy threads on them. Here's a simple way to achieve this::

    NEURON_RT_LOW_LATENCY_TASKS_CPU_AFFINITY=40-47,88-95,136-143,184-191
    taskset --cpu-list 0-39,48-87,96-135,144-183 my_workload.py

Additional Notes
^^^^^^^^^^^^^^^^^

* Using fewer cores than threads for proxy threads naturally results in slightly higher P0 latency. However, isolating
  to a small set of cores is practically preferred because the performance is predictable and consistent, while the
  impact remains negligible.

* The configuration suggested above is specific to the trn2.48xlarge instance. It allocates 32 out of 192 cores to
  latency-sensitive threads, split across 2 NUMA nodes. If you choose a custom configuration, it's important to balance
  the allocated cores across NUMA nodes. Use ``lscpu | grep -i numa`` if needed to check your system's NUMA topology.

* The approach above provides a simple baseline configuration. If your application involves multiple processes, you'll
  want to adjust their affinities away from critical path threads, as demonstrated using taskset above. You can also
  enable system-wide isolation using the kernel parameter
  `isolcpus <https://wiki.linuxfoundation.org/realtime/documentation/howto/tools/cpu-partitioning/isolcpus>`_.

Understanding Neuron Runtime CPU Usage
---------------------------------------

During typical operation, there can be many polling threads in the runtime. For example, in a trn2.48xlarge instance
used with an ``lnc=1`` configuration, there will be 128 threads polling for execution completions. Additionally, the
application can perform three operations in parallel per core: read, write, and execute. Between NRT and upper layers
(PJRT, etc.), this is handled with three different threads that busy-loop while polling for the completion of these
events. This results in a total of 384 threads.

This activity appears as busy CPUs but is typically harmless for the following reasons:

1. **Thread yielding**: The threads simply poll and yield, so other threads on the system will not be starved of CPU
   resources.

2. **Non-blocking execution**: These threads do not block on-device executions. Since the execution queue is managed on
   the device, as long as there are queued executions, no performance impact should be observed from host jitter.

Best Practice: Respect the NUMA node layout
--------------------------------------------

Background
^^^^^^^^^^

Each Neuron Device is connected to a specific NUMA node on the host instance. Data movements between host <-> device are
affected by the NUMA node layout.

Instructions
^^^^^^^^^^^^^^

While the Neuron Runtime internally takes the NUMA layout into account, configuring application threads to respect the
NUMA node layout may also lead to performance benefits. As a general rule of thumb, threads that interact with a specific
Neuron Core might see latency improvements if the CPU affinity for that thread places it on the same NUMA Node as the
Neuron Core it interacts with. The NUMA node layout can be obtained from the ``neuron-ls`` tool and is also listed below.

Layout
^^^^^^

trn1.32xlarge
"""""""""""""

.. list-table::
   :header-rows: 1
   :widths: 10 10 12 10 15 15 15 8

   * - NEURON DEVICE
     - NEURON CORES
     - NEURON CORE IDS
     - NEURON MEMORY
     - CONNECTED DEVICES
     - PCI BDF
     - CPU AFFINITY
     - NUMA NODE
   * - 0
     - 2
     - 0-1
     - 32 GB
     - 12, 3, 4, 1
     - 0000:10:1c.0
     - 0-31,64-95
     - 0
   * - 1
     - 2
     - 2-3
     - 32 GB
     - 13, 0, 5, 2
     - 0000:10:1d.0
     - 0-31,64-95
     - 0
   * - 2
     - 2
     - 4-5
     - 32 GB
     - 14, 1, 6, 3
     - 0000:a0:1c.0
     - 32-63,96-127
     - 1
   * - 3
     - 2
     - 6-7
     - 32 GB
     - 15, 2, 7, 0
     - 0000:a0:1d.0
     - 32-63,96-127
     - 1
   * - 4
     - 2
     - 8-9
     - 32 GB
     - 0, 7, 8, 5
     - 0000:20:1b.0
     - 0-31,64-95
     - 0
   * - 5
     - 2
     - 10-11
     - 32 GB
     - 1, 4, 9, 6
     - 0000:20:1c.0
     - 0-31,64-95
     - 0
   * - 6
     - 2
     - 12-13
     - 32 GB
     - 2, 5, 10, 7
     - 0000:90:1b.0
     - 32-63,96-127
     - 1
   * - 7
     - 2
     - 14-15
     - 32 GB
     - 3, 6, 11, 4
     - 0000:90:1c.0
     - 32-63,96-127
     - 1
   * - 8
     - 2
     - 16-17
     - 32 GB
     - 4, 11, 12, 9
     - 0000:20:1d.0
     - 0-31,64-95
     - 0
   * - 9
     - 2
     - 18-19
     - 32 GB
     - 5, 8, 13, 10
     - 0000:20:1e.0
     - 0-31,64-95
     - 0
   * - 10
     - 2
     - 20-21
     - 32 GB
     - 6, 9, 14, 11
     - 0000:90:1d.0
     - 32-63,96-127
     - 1
   * - 11
     - 2
     - 22-23
     - 32 GB
     - 7, 10, 15, 8
     - 0000:90:1e.0
     - 32-63,96-127
     - 1
   * - 12
     - 2
     - 24-25
     - 32 GB
     - 8, 15, 0, 13
     - 0000:10:1e.0
     - 0-31,64-95
     - 0
   * - 13
     - 2
     - 26-27
     - 32 GB
     - 9, 12, 1, 14
     - 0000:10:1b.0
     - 0-31,64-95
     - 0
   * - 14
     - 2
     - 28-29
     - 32 GB
     - 10, 13, 2, 15
     - 0000:a0:1e.0
     - 32-63,96-127
     - 1
   * - 15
     - 2
     - 30-31
     - 32 GB
     - 11, 14, 3, 12
     - 0000:a0:1b.0
     - 32-63,96-127
     - 1

inf2.48xlarge
"""""""""""""

.. list-table::
   :header-rows: 1
   :widths: 10 10 12 10 12 15 15 8

   * - NEURON DEVICE
     - NEURON CORES
     - NEURON CORE IDS
     - NEURON MEMORY
     - CONNECTED DEVICES
     - PCI BDF
     - CPU AFFINITY
     - NUMA NODE
   * - 0
     - 2
     - 0-1
     - 32 GB
     - 11, 1
     - 0000:80:1e.0
     - 48-71,144-167
     - 2
   * - 1
     - 2
     - 2-3
     - 32 GB
     - 0, 2
     - 0000:90:1e.0
     - 72-95,168-191
     - 3
   * - 2
     - 2
     - 4-5
     - 32 GB
     - 1, 3
     - 0000:80:1d.0
     - 48-71,144-167
     - 2
   * - 3
     - 2
     - 6-7
     - 32 GB
     - 2, 4
     - 0000:90:1f.0
     - 72-95,168-191
     - 3
   * - 4
     - 2
     - 8-9
     - 32 GB
     - 3, 5
     - 0000:80:1f.0
     - 48-71,144-167
     - 2
   * - 5
     - 2
     - 10-11
     - 32 GB
     - 4, 6
     - 0000:90:1d.0
     - 72-95,168-191
     - 3
   * - 6
     - 2
     - 12-13
     - 32 GB
     - 5, 7
     - 0000:20:1e.0
     - 24-47,120-143
     - 1
   * - 7
     - 2
     - 14-15
     - 32 GB
     - 6, 8
     - 0000:20:1f.0
     - 24-47,120-143
     - 1
   * - 8
     - 2
     - 16-17
     - 32 GB
     - 7, 9
     - 0000:10:1e.0
     - 0-23,96-119
     - 0
   * - 9
     - 2
     - 18-19
     - 32 GB
     - 8, 10
     - 0000:10:1f.0
     - 0-23,96-119
     - 0
   * - 10
     - 2
     - 20-21
     - 32 GB
     - 9, 11
     - 0000:10:1d.0
     - 0-23,96-119
     - 0
   * - 11
     - 2
     - 22-23
     - 32 GB
     - 10, 0
     - 0000:20:1d.0
     - 24-47,120-143
     - 1

trn2.48xlarge
"""""""""""""

.. list-table::
   :header-rows: 1
   :widths: 10 10 12 10 15 15 15 8

   * - NEURON DEVICE
     - NEURON CORES
     - NEURON CORE IDS
     - NEURON MEMORY
     - CONNECTED DEVICES
     - PCI BDF
     - CPU AFFINITY
     - NUMA NODE
   * - 0
     - 4
     - 0-3
     - 96 GB
     - 12, 3, 4, 1
     - 0000:cc:00.0
     - 48-95,144-191
     - 1
   * - 1
     - 4
     - 4-7
     - 96 GB
     - 13, 0, 5, 2
     - 0000:b5:00.0
     - 48-95,144-191
     - 1
   * - 2
     - 4
     - 8-11
     - 96 GB
     - 14, 1, 6, 3
     - 0000:b6:00.0
     - 48-95,144-191
     - 1
   * - 3
     - 4
     - 12-15
     - 96 GB
     - 15, 2, 7, 0
     - 0000:cb:00.0
     - 48-95,144-191
     - 1
   * - 4
     - 4
     - 16-19
     - 96 GB
     - 0, 7, 8, 5
     - 0000:6f:00.0
     - 0-47,96-143
     - 0
   * - 5
     - 4
     - 20-23
     - 96 GB
     - 1, 4, 9, 6
     - 0000:58:00.0
     - 0-47,96-143
     - 0
   * - 6
     - 4
     - 24-27
     - 96 GB
     - 2, 5, 10, 7
     - 0000:59:00.0
     - 0-47,96-143
     - 0
   * - 7
     - 4
     - 28-31
     - 96 GB
     - 3, 6, 11, 4
     - 0000:6e:00.0
     - 0-47,96-143
     - 0
   * - 8
     - 4
     - 32-35
     - 96 GB
     - 4, 11, 12, 9
     - 0000:9b:00.0
     - 0-47,96-143
     - 0
   * - 9
     - 4
     - 36-39
     - 96 GB
     - 5, 8, 13, 10
     - 0000:84:00.0
     - 0-47,96-143
     - 0
   * - 10
     - 4
     - 40-43
     - 96 GB
     - 6, 9, 14, 11
     - 0000:85:00.0
     - 0-47,96-143
     - 0
   * - 11
     - 4
     - 44-47
     - 96 GB
     - 7, 10, 15, 8
     - 0000:9a:00.0
     - 0-47,96-143
     - 0
   * - 12
     - 4
     - 48-51
     - 96 GB
     - 8, 15, 0, 13
     - 0000:f8:00.0
     - 48-95,144-191
     - 1
   * - 13
     - 4
     - 52-55
     - 96 GB
     - 9, 12, 1, 14
     - 0000:e1:00.0
     - 48-95,144-191
     - 1
   * - 14
     - 4
     - 56-59
     - 96 GB
     - 10, 13, 2, 15
     - 0000:e2:00.0
     - 48-95,144-191
     - 1
   * - 15
     - 4
     - 60-63
     - 96 GB
     - 11, 14, 3, 12
     - 0000:f7:00.0
     - 48-95,144-191
     - 1


================================================
FILE: neuron-runtime/explore/work-with-neff-files.rst
================================================
.. meta::
  :description: Learn about NEFF (Neuron Executable File Format) architecture, structure, and components

.. _work-with-neff-files:

Work with NEFF Files
====================

NEFF Architecture
-----------------

Overview
~~~~~~~~

A NEFF (Neuron Executable File Format) is a Neuron Runtime executable file generated by the Neuron compiler describing a compute graph (typically a neural network model). While each NEFF is always a single file, at its core, the NEFF is just a tarball of all the metadata needed to run the described compute graph.

Packaging
~~~~~~~~~

At its core, the NEFF is just a file with a Header prepended onto a Tarball. Unpacking the NEFF and examining its contents is as straightforward as stripping the header from the file and untaring the header-stripped buffer. As part of the Neuron devtools suite, we have a ``neuron-packager`` tool that can be used to unpack a NEFF::

    neuron-packager unpack file.neff

NEFF Header
~~~~~~~~~~~

The NEFF header is a 1024 byte buffer prepended onto the NEFF tarball::

    typedef struct neff_header {
        uint64_t pkg_version;
        uint64_t header_size;
        uint64_t data_size;
        uint64_t neff_version_major;
        uint64_t neff_version_minor;
        uint8_t neff_build_version[128];
        uint32_t num_tpb;
        uint8_t hash[32];
        uint8_t uuid[16];
        char name[256];
        uint32_t requested_tpb_count;
        uint8_t tpb_per_node[64];
        uint64_t feature_bits;
        uint32_t lnc_size;
        uint8_t pad[468];
        uint8_t data[];
    } neff_header_t;

Its contents are described below:

* ``uint64_t pkg_version``
    Tool version used to create this NEFF
* ``uint64_t header_size``
    Number of bytes contained in this header
* ``uint64_t data_size``
    Size in bytes of the NEFF contents
* ``uint64_t neff_version_major``
    NEFF major version
* ``uint64_t neff_version_minor``
    NEFF minor version
* ``uint8_t neff_build_version[128]``
    Build version information
* ``uint32_t num_tpb``
    Total number of TPBs required for efficient execution (all SGs get their own TPB)
* ``uint8_t hash[NEFF_HEADER_HASH_SZ]``
    Hash of the package, sha256 or md5 depending on the pkg_version
* ``uint8_t uuid[NEFF_HEADER_UUID_SZ]``
    Unique identifier for the NEFF
* ``char name[NEFF_HEADER_NAME_SZ]``
    Name of the NEFF
* ``uint32_t requested_tpb_count``
    How many TPBs were requested during compilation
* ``uint8_t tpb_per_node[MAX_NODES]``
    Number of required TPBs per kelf node in the graph, 1 byte per node
* ``uint64_t feature_bits``
    Bits representing individual incompatible NEFF features for fine-grained compatibility checking
* ``uint32_t lnc_size``
    Logical core size required to run this NEFF

Tarball
~~~~~~~

The NEFF tarball, when unpacked, consists of top-level JSON files describing the graph as a whole and partitioned subgraphs.

Components
----------

Subgraphs (sg00 ... sgN)
~~~~~~~~~~~~~~~~~~~~~~~~

A subgraph is a directory in the unpackaged NEFF that contains files which describe the computation and resources needed to run a "subgraph". When the NEFF is loaded, each subgraph declared in the NEFF will be loaded onto its own TPB. In the past on INF1, a NEFF could contain multiple subgraphs with data being passed between subgraphs to improve model throughput. This feature was called serial TPB. The serial TPB feature does not exist (not needed) on architectures after INF1. Today multiple subgraphs in a NEFF tie to the logical core feature.

def.json
~~~~~~~~

The ``def.json`` file is the starting file for any subgraph. At its top level, it will point the runtime to the engine JSONs and the engine binaries as well as declare queue sets and variables used by the subgraph to move and hold data.

Queue Sets
^^^^^^^^^^

Queue sets declared in ``def.json`` will be mapped to physical HW queues by the runtime during model load. These queue sets will be used to move data during NEFF execution and are declared in the ``dma_queue`` object. Each queue set is a JSON object that can contain the following fields:

* ``type`` (required)
    * **Type**: string
    * **Description**: What this queue set will be used for
    * **Valid values**: ``in``, ``out``, ``data``, ``embedding_update``, ``dynamic``
    * **Supported architectures**: all

* ``num_queues`` (optional)
    * **Type**: int
    * **Description**: Number of HW queues to reserve for this queue set. More HW queues allow the NEFF program to use multiple DMA engines to transfer data
    * **Restrictions**: On INF1, this field must be 1. On non-INF1 platforms it must be 16 or less
    * **Default**: 1
    * **Supported architectures**: all

* ``owner`` (optional)
    * **Type**: string
    * **Description**: Engine that owns this queue set. When queue set is assigned to an engine, the owning engine will perform the DMA triggers of the queue set
    * **Supported architectures**: all

* ``pinned`` (optional)
    * **Type**: bool
    * **Description**: Queue is used to move data to the TPB's state buffer during model start. Once the data is moved to SB, the NEFF will never write to the buffers "pinning" the data to SB
    * **Default**: false
    * **Supported architectures**: INF1

* ``queue_instances`` (optional)
    * **Type**: [string]
    * **Description**: Set of DMA rings that can be swapped in/out during model execution
    * **Supported architectures**: all architectures except for INF1

* ``semaphore_set`` (optional)
    * **Type**: [int]
    * **Description**: A list of semaphores used by the queue set to signal data transfer completion
    * **Supported architectures**: all

* ``semaphore`` (optional)
    * **Type**: int
    * **Description**: Single semaphore used by single queue to signal data transfer completion
    * **Supported architectures**: all

* ``fabric_path`` (optional)
    * **Type**: string
    * **Description**: Which pathway the DMA queue should take to move data
    * **Valid values**: ``main``, ``alt``
    * **Default**: "main"
    * **Supported architectures**: all

Variables
^^^^^^^^^

Variables are buffers allocated on device that can be referenced by the NEFF to read data from and write data to during execution. Variables are declared in the ``var`` object in ``def.json`` and each variable is a JSON object that can contain the following fields:

* ``type`` (required)
    * **Type**: string
    * **Description**: What type of data this variable contains
    * **Valid values**: ``state-buffer``, ``input``, ``output``, ``file`` (HBM), ``tmp-buf`` (HBM) - private-per-NEFF scratchpad allocation, ``virtual`` (HBM) - shared scratchpad variables, ``pointer`` (HBM), ``dge-table``
    * **Supported architectures**: all

* ``var_id`` (required)
    * **Type**: int
    * **Description**: Unique ID to reference this variable with
    * **Restrictions**: Must be unique to this variable
    * **Supported architectures**: all

* ``size`` (required)
    * **Type**: int
    * **Description**: Size in bytes of the variable
    * **Supported architectures**: all

* ``alignment`` (optional)
    * **Type**: int
    * **Description**: Physical address alignment for this variable
    * **Restrictions**: Must be a power of two
    * **Default**: 0
    * **Supported architectures**: all

* ``fabric_path`` (optional)
    * **Type**: string
    * **Description**: Fabric path to place this variable on
    * **Default**: "main"
    * **Supported architectures**: all

* ``file_name`` (optional)
    * **Type**: string
    * **Description**: File to load variable data from. Can point to .npy files or raw binary data (any file without a .npy extension)
    * **Restrictions**: Only used with variable type ``file``
    * **Supported architectures**: all

* ``backing_variable_off`` (optional)
    * **Type**: int
    * **Description**: The offset inside the shared scratchpad space allocated by Runtime
    * **Restrictions**: Only used with variable type ``virtual``
    * **Supported architectures**: all

* ``referenced_var_id`` (optional)
    * **Type**: int
    * **Description**: ``var_id`` of the variable whose address will be placed in this pointer variable
    * **Restrictions**: Only used with variable type ``pointer``
    * **Supported architectures**: all

* ``list`` (optional)
    * **Type**: [int]
    * **Description**: List of ``var_ids`` to populate the table with
    * **Restrictions**: Used with variable type ``dge-table``
    * **Supported architectures**: all

{ENGINE}.json
~~~~~~~~~~~~~

The engine JSON is a JSON for each of the TPB's engines. This JSON will describe the DMA descriptors triggered by the engine to move data during execution as well as some extra engine-specific metadata.

DMA Descriptors
^^^^^^^^^^^^^^^

In each engine JSON file, there is a list of JSON objects describing DMA data movements triggered by the engine. This list is indexed by the ``dma`` key. Each object in the list is a JSON object with the following fields:

* ``id`` (required)
    * **Type**: int
    * **Description**: Identifier to map this descriptor to a trigger in the engine binary. Other descriptors with the same ID in the same function call must have the same trigger amounts
    * **Supported architectures**: all

* ``queue`` (required if ``instance_name`` is empty)
    * **Type**: string
    * **Description**: Name of the queue set this descriptor will be placed on
    * **Restrictions**: ``instance_name`` field takes precedence over this field
    * **Supported architectures**: all

* ``instance_name`` (required if ``queue`` field is empty)
    * **Type**: string
    * **Description**: Name of the queue set instance this descriptor will be placed on
    * **Restrictions**: Takes precedence over ``queue`` field
    * **Supported architectures**: everything but INF1

* ``function_start`` (optional)
    * **Type**: string
    * **Description**: Names the function that will trigger this descriptor and all other descriptors after it belonging to the same queue set until the next ``function_start`` for the queue set is hit. Used in the call graph flow feature of the compiler
    * **Restrictions**: Must name a valid function declared in the engine binary
    * **Default**: ""
    * **Supported architectures**: everything but INF1

* ``section_start_desc`` (optional)
    * **Type**: bool
    * **Description**: If this field is true, the runtime will place this descriptor on the first queue in the queue set
    * **Default**: false
    * **Supported architectures**: everything but INF1

* ``event`` (optional)
    * **Type**: int
    * **Description**: Event to set after this descriptor has been executed
    * **Supported architectures**: all

* ``semaphore`` (optional)
    * **Type**: int
    * **Description**: Semaphore to increment after this descriptor has been executed. If there are multiple queues in the queue set, semaphore will be incremented by ``num_queues`` amount when the transfer is complete
    * **Supported architectures**: all

* ``remote_semaphores`` (optional)
    * **Type**: [int]
    * **Description**: Semaphore(s) of other NeuronCore/TPB to increment in case of LNC size 2
    * **Supported architectures**: Trn2 and above

* ``desc`` (required)
    Contains the following sub-fields:

    * ``op`` (optional)
        * **Type**: string
        * **Description**: Op for the DMA engine to perform for this transfer
        * **Valid values**: ``fma``, ``cast``, ``add``, ``min``, ``max``, ``transpose``, ``copy``
        * **Default**: "copy"
        * **Supported architectures**: everything but INF1

    * ``from``/``to`` (``to`` is always required; ``from`` is required for non-CCE descriptors)
        * **Type**: string
        * **Description**: Which variable to read from/write to
        * **Restrictions**: Must be a variable declared in ``def.json``
        * **Supported architectures**: all

    * ``from_off``/``to_off`` (required)
        * **Type**: int
        * **Description**: Offset in variable to read/write
        * **Supported architectures**: all

    * ``from_steps``/``to_steps`` (required)
        * **Type**: Array[int]
        * **Description**: Access pattern steps for variable. All elements are in denominations of bytes. The first element - corresponding to the innermost/fastest-growing dimension - is usually 1 to indicate that successive bytes must be copied
        * **Restrictions**: Max array size of 4; array-length must match ``{from, to}_sizes``
        * **Supported architectures**: all

    * ``from_sizes``/``to_sizes`` (required)
        * **Type**: Array[int]
        * **Description**: Access pattern sizes for variable. The first element - corresponding to the innermost/fastest-growing dimension - is in denomination of bytes. All other elements are counts of number of elements in those dimensions
        * **Restrictions**: Max array-size of 4; array length must match ``{from, to}_steps``
        * **Supported architectures**: all

    * ``from_dtype``/``to_dtype`` (optional)
        * **Type**: string
        * **Description**: Dtype of variable
        * **Valid values**: ``float8e3``, ``float8e4``, ``float8e5``, ``float16``, ``float32``, ``float32r``, ``bfloat16``, ``uint8``, ``uint16``, ``uint32``, ``uint64``, ``int8``, ``int16``, ``int32``, ``int64``
        * **Default**: "uint8"
        * **Supported architectures**: everything but INF1

    * ``num_tiling_dimensions`` (optional)
        * **Type**: int
        * **Description**: Number of dimensions used to tile the DMA descriptor; number of dimensions used for a single tile
        * **Supported architectures**: all

    * ``from_arr`` (required for CCE descriptors, replaces ``from*`` fields)
        * **Type**: [objects]
        * **Description**: List of source tensors to perform the CCE op on; fields are the "from" parts of a DMA descriptor
        * **Restrictions**: Cannot have more than 16 source tensors (length of ``from_arr`` <= 16)
        * **Supported architectures**: everything but INF1

    **FMA only fields:**

    * ``scale_dtype`` (required with fma op)
        * **Type**: string
        * **Description**: Data type of the scale constant
        * **Valid values**: ``float32``
        * **Default**: "float32"

    * ``scale`` (optional)
        * **Type**: double
        * **Description**: Scale for data being moved
        * **Restrictions**: Only valid on "fma" type descriptors
        * **Default**: 1.0
        * **Supported architectures**: everything but INF1

    **Min/Max only fields:**

    * ``constant_dtype`` (optional)
        * **Type**: string
        * **Description**: Datatype of "min"/"max" constant
        * **Valid values**: ``float32``, ``int32``, ``uint32``

    * ``constant`` (required if ``constant_dtype`` specified)
        * **Type**: double, int, or uint
        * **Description**: Constant to start the "min"/"max" operation with
        * **Restrictions**: Only valid on "min"/"max" type descriptors; will be ignored if ``constant_dtype`` is not specified
        * **Supported architectures**: everything but INF1

    **Transpose only fields:**

    * ``transpose_shape`` (optional)
        * **Type**: [int]
        * **Description**: Shape to transpose the data to
        * **Restrictions**: Number of elements must be ``XPOSE_NUM_DIMS`` (4)
        * **Supported architectures**: everything but INF1

    * ``transpose_element_size`` (optional)
        * **Type**: int
        * **Description**: Size of a single element of the transpose
        * **Supported architectures**: everything but INF1

Activation.json
~~~~~~~~~~~~~~~

In addition to DMA descriptors, the ``Activation.json`` file also contains metadata on the PWP tables used by the NEFF. The field ``activation_function_sets`` lists the activation function sets used during execution of the NEFF. Each "activation function set" will point to activation table metadata contained in the subgraph's directory.

Example::

    "activation_function_sets": [
        "reciprocal_sqrt_and_small",
        "natural_log_exp_and_others",
        "reciprocal_and_small",
        "gelu_and_others"
    ]

DVE.json
~~~~~~~~

``DVE.json`` will contain info about the DVE tables used by the NEFF. In ``DVE.json``, the DVE tables will be indexed by the ``dve_tables`` key. More info on this works can be found in the Loadable DVE doc.

Example::

    "dve_tables": [
        {
            "control_table": "default_control_table.bin",
            "datapath_table": "default_datapath_table.bin",
            "opcode_table": "default_opcode_table.bin"
        }
    ]

Constants
~~~~~~~~~

Constants are data files placed directly in the subgraph directory. These files can be referenced from a variable declared in ``def.json``. During model load time, the contents of these files are written into the variable declared for it. The data is either written as raw binary for most files, or, for npy files, the file will be parsed and the numpy array data will be written to the buffer. These are pointed to by the ``file_name`` field in var declarations.


================================================
FILE: neuron-runtime/faq.rst
================================================
.. _neuron-runtime-faq:

NeuronX runtime FAQ
==================

.. contents:: Table of Contents
   :local:
   :depth: 1


Where can I find information about Neuron Runtime 2.x (``libnrt.so``)
---------------------------------------------------------------------

See :ref:`introduce-libnrt` for detailed information about Neuron Runtime 2.x (``libnrt.so``).

What will happen if I will upgrade Neuron Framework without upgrading latest kernel mode driver?
------------------------------------------------------------------------------------------------

Application start would fail with the following error message:
.. code:: bash

    2021-Aug-11 19:18:21.0661 24616:24616 ERROR   NRT:nrt_init      This runtime requires Neuron Driver version 2.0 or greater. Please upgrade aws-neuron-dkms package.


Do I need to recompile my model to use the Runtime Library?
-----------------------------------------------------------
No. Runtime 2.x supports all the models compiled with Neuron Compiler 1.x.


Do I need to change my application launch command?
--------------------------------------------------
No.

How do I restart/start/stop the NeuronX Runtime?
-----------------------------------------------
Since Neuron Runtime is a library, starting/stopping application would result in starting/stopping the Neuron Runtime.


How do I know which runtimes are associated with which Neuron Device(s)?
------------------------------------------------------------------------
`neuron-ls` and `neuron-top` can be used to find out applications using Neuron Devices.


What about RedHat or other versions of Linux and Windows?
--------------------------------------------------------

We don't officially support it yet.


How can I take advantage of multiple NeuronCores to run multiple inferences in parallel?
---------------------------------------------------------------------------------------

Examples of this for TensorFlow and MXNet are found
:ref:`here <tensorflow-tutorials>` and :ref:`here <mxnet-tutorials>`.


================================================
FILE: neuron-runtime/index.rst
================================================
.. _neuron_runtime:

NeuronX Runtime
================

The NeuronX Runtime is a high-performance execution engine that enables deep learning models to run on AWS Inferentia and Trainium accelerators. It consists of a kernel driver and C/C++ libraries that provide low-level APIs for accessing Neuron devices, managing model execution, and coordinating collective communications across NeuronCores.

The Neuron Runtime serves as the foundation for all ML framework integrations (TensorFlow, PyTorch, JAX, and Apache MXNet), loading compiled models in Neuron Executable File Format (NEFF) and orchestrating their execution on Neuron hardware. It is optimized for high-throughput and low-latency inference and training workloads, with features including:

* **Efficient model execution**: Loads and executes NEFF files on NeuronCores with optimized memory management
* **Multi-model support**: Manages multiple models across multiple NeuronCores with flexible allocation strategies
* **Collective communications**: Provides high-performance collective operations for distributed training and inference
* **Device management**: Handles NeuronCore allocation, device discovery, and resource management
* **Debugging support**: Offers core dump generation, debug streams, and detailed logging for troubleshooting
* **Configuration flexibility**: Extensive environment variables for fine-tuning runtime behavior

The Neuron Runtime is typically used transparently through ML framework plugins, but also provides direct C/C++ APIs for developers building custom frameworks or requiring low-level device control. 

.. toctree::
    :maxdepth: 2
    :hidden:

    Overview </neuron-runtime/about/index>
    Get Started </neuron-runtime/about/core-dump>
    Deep Dives </neuron-runtime/explore/index>
    /neuron-runtime/configuration-guide
    Developer Guide </neuron-runtime/nrt-developer-guide>
    API Reference </neuron-runtime/api/index>
    NRT Debug Stream </neuron-runtime/api/debug-stream-api>
    Troubleshooting on Inf1 and Trn1 </neuron-runtime/nrt-troubleshoot>
    Release Notes </release-notes/components/runtime>
    FAQ </neuron-runtime/faq>

Get Started
------------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: About the NeuronX Runtime
        :link: neuron-runtime-about
        :link-type: ref
        :class-header: sd-bg-primary sd-text-white

        Learn about the AWS Neuron Runtime, its features, and capabilities for accessing Inferentia and Trainium Neuron devices.

    .. grid-item-card:: Quickstart: Generate a Core Dump
        :link: runtime-core-dump-quickstart
        :link-type: ref
        :class-header: sd-bg-primary sd-text-white

        Learn how to generate a Neuron runtime core dump for debugging runtime failures and analyzing device state.

Reference
------------

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: Runtime Developer Guide
        :link: nrt-api-guide
        :link-type: ref
        :class-header: sd-bg-primary sd-text-white

        Comprehensive guide to the Neuron Runtime API for developers building custom frameworks that call libnrt APIs directly.

    .. grid-item-card:: Runtime API Reference Documentation
        :link: /neuron-runtime/api/index
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Documentation of the APIs in the public headers for the Neuron Runtime.

    .. grid-item-card:: Runtime Configuration
        :link: nrt-configuration
        :link-type: ref
        :class-header: sd-bg-primary sd-text-white

        Learn how to configure the Neuron Runtime using environment variables to control NeuronCore allocation, logging, and more.

    .. grid-item-card:: Troubleshooting on Inf1 and Trn1
        :link: nrt-troubleshooting
        :link-type: ref
        :class-header: sd-bg-primary sd-text-white

        Solutions for common issues encountered when using the Neuron Runtime on Inferentia and Trainium instances.

    .. grid-item-card:: Frequently Asked Questions
        :link: neuron-runtime-faq
        :link-type: ref
        :class-header: sd-bg-primary sd-text-white

        Answers to common questions about the Neuron Runtime, including compatibility, configuration, and usage.

Learn More
------------

.. grid:: 1
    :gutter: 2

    .. grid-item-card:: Explore the Neuron Runtime
        :link: neuron-runtime-explore-home
        :link-type: ref
        :class-header: sd-bg-primary sd-text-white

        Deep dives into the Neuron Runtime, including NEFF files, compute-communication overlap, device memory, and core dumps.

Collectives
------------

.. grid:: 1
    :gutter: 2

    .. grid-item-card:: About Collectives
        :link: /neuron-runtime/about/collectives
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Learn about Neuron Runtime collectives.

.. grid:: 2
    :gutter: 2

    .. grid-item-card:: Deep Dive: Inter-node Collective Communication
        :link: /neuron-runtime/explore/internode-collective-comm
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Explore and understand techniques for communication across nodes in the Neuron Runtime.

    .. grid-item-card:: Deep dive: Intra-node Collective Communication
        :link: /neuron-runtime/explore/intranode-collective-comm
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Explore and understand techniques for communication within nodes in the Neuron Runtime.

Release Notes
--------------

.. grid:: 1
    :gutter: 2

    .. grid-item-card:: Runtime Release Notes
        :link: /release-notes/components/runtime
        :link-type: doc
        :class-header: sd-bg-primary sd-text-white

        Latest updates, improvements, and bug fixes for the Neuron Runtime library, driver, and collectives.


================================================
FILE: neuron-runtime/nrt-configurable-parameters.rst
================================================
.. _nrt-configuration:

NeuronX Runtime Configuration
============================

NeuronX Runtime is responsible for executing ML models on Neuron Devices. NeuronX Runtime determines which NeuronCore will execute which model and how to execute it.
Configuration of the NeuronX Runtime is controlled through the use of Environment variables at the process level.  By default, Neuron framework extensions will take care of NeuronX Runtime configuration on the user's behalf.  Explicit configurations are also possible when attempting to achieve a desired behavior.

This guide provides an overview of the different environment variables available to
configure NeuronX Runtime behavior.

.. list-table:: Environment Variables
   :widths: 25 60 20 50 20 50
   :header-rows: 1
   

   * - Name
     - Description
     - Type
     - Expected Values
     - Default Value
     - RT Version
   * - ``NEURON_RT_VISIBLE_CORES``
     - Range of specific NeuronCores needed by the process
     - Integer range (like 1-3)
     - Any value or range between 0 to Max NeuronCore in the system.
     - None
     - 2.0+
   * - ``NEURON_RT_NUM_CORES``
     - Number of NeuronCores required by the process.
     - Integer
     - A value from 1 to Max NeuronCore in the system.
     - 0, which is interpreted as "all"
     - 2.0+
   * - ``NEURON_RT_LOG_LOCATION``
     - Runtime log location
     - string
     - console or syslog
     - console
     - 2.0+
   * - ``NEURON_RT_LOG_LEVEL``
     - Runtime log verbose level
     - string
     - ERROR, WARNING, INFO, DEBUG, TRACE
     - ERROR
     - 2.0+
   * - ``NEURON_RT_EXEC_TIMEOUT``
     - Timeout for execution in seconds
     - Integer
     - 0 to INT_MAX
     - 30
     - 2.0+
   * - ``NEURON_RT_VALIDATE_HASH``
     - Validate NEFF contents before loading into accelerator
     - Boolean
     - TRUE or FALSE
     - FALSE
     - 2.0+
   * - ``NEURON_RT_MULTI_INSTANCE_SHARED_WEIGHTS``
     - Share weights when loading multiple instance versions of the same model on different NeuronCores
     - Boolean
     - TRUE or FALSE
     - FALSE
     - 2.11+
   * - ``NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS``
     - Controls number of asynchronous execution requests to be supported.
     - Integer
     - 0 to INT_MAX; 0 is disabled.
     - 0
     - 2.15+
   * - ``NEURON_RT_ALLOW_LEGACY_NEFF``
     - Allow a NEFF compiled for an older arch to execute on a newer one. For example, executing a NEFF originally compiled for Trn1 architecture on Trn2.
     - Boolean
     - TRUE or FALSE
     - FALSE
     - 2.25+

.. warning::
  When applying ``NEURON_RT_ALLOW_LEGACY_NEFF``, note that not all NEFF files, especially those from older architectures, may be compatible.
  In the case of an incompatibility, the operation will fail with a data mismatch error or stall out.

NeuronCore Allocation
---------------------

.. important ::

  ``NEURONCORE_GROUP_SIZES`` is being deprecated, if your application is using ``NEURONCORE_GROUP_SIZES`` please 
  see :ref:`neuron-migrating-apps-neuron-to-libnrt` for more details.


By default, NeuronX Runtime initializes all the cores present in the system and reserves them for the current process.

.. note::

  Once a NeuronCore is reserved for a process, it cannot be used by another process at all, until the process reserving that NeuronCore is terminated.
  
Using NEURON_RT_VISIBLE_CORES
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For parallel processing, ``NEURON_RT_VISIBLE_CORES`` can be used to control which NeuronCores each process would reserve.  This variable is specified with a single NeuronCore index or an inclusive range value.

For example, if a process (myapp.py) requires one NeuronCore, then it can be started with
``NEURON_RT_VISIBLE_CORES=0`` to limit the process to NeuronCore 0. For parallel processing, multiple process can be
started (without any change to myapp.py code) with different ``NEURON_RT_VISIBLE_CORES`` values.
Here is an example that runs myapp.py on inf1.xlarge in parallel across the four different NeuronCores available in the inf1.xlarge.

::

 NEURON_RT_VISIBLE_CORES=0 myapp.py &
 NEURON_RT_VISIBLE_CORES=1 myapp.py &
 NEURON_RT_VISIBLE_CORES=2 myapp.py &
 NEURON_RT_VISIBLE_CORES=3 myapp.py &


If myapp.py required 3 NeuronCores and was running on a inf1.6xlarge (16 NeuronCores maximum), the first instance of myapp.py could use NeuronCores 0-2, the next instance could use 3-5 and so on:

::

 NEURON_RT_VISIBLE_CORES=0-2 myapp.py &
 NEURON_RT_VISIBLE_CORES=3-5 myapp.py &
 NEURON_RT_VISIBLE_CORES=6-8 myapp.py &
 NEURON_RT_VISIBLE_CORES=9-11 myapp.py &
 NEURON_RT_VISIBLE_CORES=12-14 myapp.py &


Using NEURON_RT_NUM_CORES
~~~~~~~~~~~~~~~~~~~~~~~~~

If ``NEURON_RT_NUM_CORES`` is set to a value between 1 and the maximum number of NeuronCores in the instance, Neuron Runtime will attempt to automatically reserve the number of free NeuronCores specified for the process. The difference between ``NEURON_RT_VISIBLE_CORES`` and ``NEURON_RT_NUM_CORES`` is that, ``NEURON_RT_VISIBLE_CORES`` specifies exact NeuronCores to allocate where as ``NEURON_RT_NUM_CORES`` specifies the number of NeuronCores needed and Neuron Runtime selects free NeuronCores.

Using the same example earlier where myapp.py needed 3 cores, but _which_ 3 cores was of no concern, the same application could be executed in parallel up to 5 times on an inf1.6xlarge (16 NeuronCore max):

::

 NEURON_RT_NUM_CORES=3 myapp.py &
 NEURON_RT_NUM_CORES=3 myapp.py &
 NEURON_RT_NUM_CORES=3 myapp.py &
 NEURON_RT_NUM_CORES=3 myapp.py &
 NEURON_RT_NUM_CORES=3 myapp.py &

Executing a 6th ``NEURON_RT_NUM_CORES=3 myapp.py &`` in the above example would fail as there is only a single NeuronCore still free.


Notes
~~~~~

1. Number of NeuronCores in a inferentia device is 4
2. Number of inferentia is depends on the instance size.
3. The NeuronCore index in NEURON_RT_VISIBLE_CORES starts from 0 and ends at (number of NeuronDevices * number of NeuronCores) - 1.
4. By default, ``NEURON_RT_NUM_CORES`` is set to ``0``, which indicates to RT that all cores are to be used.  
5. NEURON_RT_VISIBLE_CORES takes precedence over NEURON_RT_NUM_CORES.  If specified, all cores within the range will be assigned to the owning process.


Logging and debug-ability
-------------------------
By default, NeuronX Runtime logs to syslog with verbose level of *INFO* and only *ERROR* s are logged in console.
The following code snippet shows ways to increase/decrease the log level.

::

 NEURON_RT_LOG_LEVEL=INFO myapp.py         # Sets the log level for syslog and console to INFO
 NEURON_RT_LOG_LOCATION=console NEURON_RT_LOG_LEVEL=QUIET myapp.py    # Completely disables console logging.

By default, NeuronX Runtime expects the NeuronCore to complete execution of any model with in 2 seconds.
If NeuronCore didn't complete the execution within 2 seconds then runtime would fail the execution with timeout error.
Most of the models takes few milliseconds to complete so 2 seconds(2000 milliseconds) is more than adequate.
However if your model is expected to run more than 2 seconds then you can increase the timeout with NEURON_RT_EXEC_TIMEOUT.

::

 NEURON_RT_EXEC_TIMEOUT=5 myapp.py       # increases the timeout to 5 seconds


Additional Logging Controls
-------------------------
NeuronX Runtime enables detailed control over logging behaviors, including the ability to set separate log levels and log locations for individual components. 
When ``NEURON_RT_LOG_LEVEL`` is set globally, NeuronX Runtime combines the logs from all modules into a single stream. 
For instance, the logs from the modules ``TDRV`` and ``NMGR`` would appear in the same stream as shown in the example below

::
  2023-Jan-09 20:27:41.0593 15042:15042 ERROR  TDRV:exec_consume_infer_status_notifications (FATAL-RT-UNDEFINED-STATE) inference timeout (600000 ms) on Neuron Device 0 NC 0, waiting for execution completion notification
  2023-Jan-09 20:27:41.0600 15042:15042 ERROR  NMGR:dlr_infer 

However, it is possible to adjust the log level for individual components to capture more or less detail as required for specific debugging contexts.
These individual components are
- ``TDRV``: the low level driver library
- ``KMGR``: the higher level manager library bridging the driver and runtime
- ``NRT``: the Neuron Runtime library responsible for loading and executing models that is exposed to end users and frameworks

To adjust the log level for individual components, use the environment variable ``NEURON_RT_LOG_LEVEL_<component>``, where ``<component>`` is the identifier of the component 
(either ``TDRV``, ``NMGR``, or ``NRT``). 
This allows for precise control over the verbosity of logs generated by each component, facilitating more targeted debugging.
For example, the following sets different log levels for the ``TDRV`` and ``NMGR`` components.

::
  export NEURON_RT_LOG_LEVEL_TDRV=DEBUG
  export NEURON_RT_LOG_LEVEL_NMGR=ERROR


Similarly, to specify separate log locations for individual components, use the environment variable ``NEURON_RT_LOG_LOCATION_<component>``, following the same naming convention as for log levels. 
This feature enables logs from different components to be directed to separate files or destinations, making it easier to organize and analyze the log output.
For example, the following sets different log locations for the ``TDRV`` and ``NMGR`` components.

::
  export NEURON_RT_LOG_LOCATION_TDRV=tdrv.log
  export NEURON_RT_LOG_LOCATION_NMGR=nmgr.log


Checksum
--------
To execute a model(NEFF), NeuronX Runtime needs to load the NEFF file into NeuronCore and run.
Neuron Runtime provides a way to do checksum validation on each NEFF file while loading to validate the file is not corrupted.
This option is off by default to avoid performance penalty during model load time(~50%).

::

 NEURON_RT_VALIDATE_HASH=true myapp1.py     # enables model checksum validation while loading
 NEURON_RT_VALIDATE_HASH=false myapp2.py    # disables(default) model checksum validation while loading
 
 
Shared Weights (NEURON_RT_MULTI_INSTANCE_SHARED_WEIGHTS)
--------------------------------------------------------
By default, NeuronX Runtime will make copies of model weights when loading the same instance of a model to multiple NeuronCores. Changing this default to a weight sharing mechanism is possible with NeuronX Runtime 2.11 or higher by setting ``NEURON_RT_MULTI_INSTANCE_SHARED_WEIGHTS=TRUE``. Use of this flag will allow for more models to be loaded by reducing the memory requirements, but will potentially come at a cost of throughput by forcing the execution across cores to compete for memory bandwidth.

Note: the use of this flag requires the model to be loaded with the multi-instance feature (see :ref:`torch_core_placement_api`).

See the :pytorch-neuron-src:`[BERT tutorial with shared weights notebook] <bert_tutorial/tutorial_pretrained_bert_shared_weights.ipynb>` for an example of how this is used in ``Torch-Neuron``.

::

 NEURON_RT_MULTI_INSTANCE_SHARED_WEIGHTS=TRUE myapp1.py     # enables model weight sharing
 NEURON_RT_MULTI_INSTANCE_SHARED_WEIGHTS=FALSE myapp2.py    # disables(default) model weight sharing


Aynchronous Execution (NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS)
--------------------------------------------------------
A beta asynchronous execution feature which can reduce latency by roughly 12% for training workloads. Starting in Neuron Runtime version 2.15, the feature is available, but disabled.  To enable the feature for possible improvement, recommendation is to set NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS to 3.  Setting the number of inflight requests above 3 may lead to Out-Of-Memory (OOM) errors during execution.  For developers using libnrt.so directly, please use nrt_register_async_exec_callback to register a callback for the nrt execution thread to post the execution status to. A default callback will be registered if one is not set by the developer.

::

 NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3 myapp.py     # Up to 3 async exec requests at once.
 NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=0 myapp.py     # disables async execution (default behavior)


================================================
FILE: neuron-runtime/nrt-developer-guide.rst
================================================
.. _nrt-api-guide:

Developer Guide - NeuronX Runtime
=================================

This guide is intended to support a deeper understanding of the Neuron Runtime and how ML applications are built using the Runtime APIs directly, and focuses on the information you need to know when building custom frameworks that call ``libnrt`` APIs directly from C/C++ apps. It is applicable to developers building their own ML frameworks; if you are using a popular existing framework such as PyTorch, JAX, or TensorFlow, the concepts and techniques discussed in this guide do not apply to your work.

.. note::
    The next few paragraphs provide a brief introduction to the Neuron hardware and the Neuron Runtime architecture. Customers who would rather skip this and jump straight to building their first ML
    application which runs without the aid of an ML framework, should go to :ref:`first_app`.


About the Neuron Runtime Library
--------------------------------

The Neuron Runtime Library (``libnrt``) is the intermediate layer between an application and a framework, and the Neuron driver and Neuron Devices. It provides a C API for initializing the Neuron hardware, staging models and input data, executing inferences and training iterations on the staged models, and retrieving output data. The vast majority of ML applications running on Neuron will follow one of the following 3 architectural templates:


.. figure:: ../images/neuron-rt-diagram.png

    `Individual processes executing models on one or more Neuron Devices`

.. figure:: ../images/neuron-rt-diagram-2.png

    `Processes working together on executing models within the same instance - libnccom (The Neuron Collective Communication Library) handles inter-worker communication`


.. figure:: ../images/neuron-rt-diagram-3.png

    `Processes working together on executing models across multiple instances - libnccom, libfabric and the EFA driver handle communication`


.. _reqs:

Requirements
------------

A more comprehensive guide to installing Neuron software can be found in the :ref:`torch_quick_start` guide.

The Neuron Runtime requires the Neuron Driver, which is provided by the ``aws-neuron-dkms`` package. Run the commands below to install the driver for the indicated operating system:

AL2023:

.. code-block:: bash

    sudo dnf install aws-neuronx-dkms

Ubuntu:

.. code-block:: bash

    sudo apt-get install aws-neuronx-dkms


The Runtime Library consists of the ``libnrt.so`` and header files.  These artifacts are version-controlled and installed via the ``aws-neuronx-runtime-lib`` package. After installing the package, you will find the compied library file (``libnrt.so``) in
``/opt/aws/neuron/lib`` and the necessary header files to use the APIs it provides in ``/opt/aws/neuron/include``. Run the commands below to install the runtime library and headers for the indicated operating system:

AL2023:

.. code-block:: bash

    sudo dnf install aws-neuronx-runtime-lib

Ubuntu:

.. code-block:: bash

    sudo apt-get install aws-neuronx-runtime-lib

For applications that use distributed training or distributed inferences, the Neuron Collective Communication Library is required. Run the commands below to the library for the indicated operating system:

AL2023:

.. code-block:: bash

    sudo dnf install aws-neuronx-collectives

Ubuntu:

.. code-block:: bash

    sudo apt-get install aws-neuronx-collectives


In case of multi-instance training, you must also install the EFA driver and the Libfabric library (provided by the EFA installer). Run the command below to install it:

AL2023 & Ubuntu:

.. code-block:: bash

    curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
    wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.key
    cat aws-efa-installer.key | gpg --fingerprint
    wget https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz.sig && gpg --verify ./aws-efa-installer-latest.tar.gz.sig

    tar -xvf aws-efa-installer-latest.tar.gz
    cd aws-efa-installer && sudo bash efa_installer.sh --yes
    cd
    sudo rm -rf aws-efa-installer-latest.tar.gz aws-efa-installer


.. _insttypes:

Introduction to Neuron Hardware
-------------------------------

Neuron Machine Learning Accelerators (or Neuron Devices) are custom accelerators designed to efficiently run  Machine Learning workloads such as inference using a given model or a distributed training job. Depending on the type of workload and its size, customers can opt for the following Neuron-equipped EC2 instances:

.. list-table::
    :widths: 40 40 40 40 40
    :header-rows: 1

    * - Workload type
      - Neuron Device Name
      - Instance type(s)
      - Devices Per Instance
      - Availability
    * - Inference
      - Inferentia II (v3)
      - inf2.xlarge, inf2.8xlarge
      - 1
      - Available Now!
    * - Inference
      - Inferentia II (v3)
      - inf2.24xlarge
      - 6
      - Available Now!
    * - Inference
      - Inferentia II (v3)
      - inf2.48xlarge
      - 12
      - Available Now!
    * - Inference
      - Inferentia (v1)
      - inf1.xlarge, inf1.2xlarge
      - 1
      - Available Now!
    * - Inference
      - Inferentia (v1)
      - inf1.6xlarge
      - 4
      - Available Now!
    * - Inference
      - Inferentia (v1)
      - inf1.24xlarge
      - 16
      - Available Now!
    * - Training
      - Trainium (v2)
      - trn1.2xlarge
      - 1
      - Available Now!
    * - Training
      - Trainium (v2)
      - trn1.32xlarge
      - 16
      - Available Now!


Neuron Device
^^^^^^^^^^^^^

Each Neuron Device consists of multiple execution units called "NeuronCores". They use high-bandwidth device memory and PCIe interfaces to coordinate with the host CPU and other Neuron Devices and components (depending on the Neuron Device version).

To get the number of NeuronCores per Neuron Device, the amount of Neuron Device memory, and the way devices are directly connected, use the ``neuron-ls`` tool by running the following command:

``neuron-ls --topology``

If successful, it will return output like this:

.. code-block:: bash

    instance-type: trn1.32xlarge
    instance-id: i-0633517e496256bf8
    +--------+--------+--------+---------------+---------+
    | NEURON | NEURON | NEURON |   CONNECTED   |   PCI   |
    | DEVICE | CORES  | MEMORY |    DEVICES    |   BDF   |
    +--------+--------+--------+---------------+---------+
    | 0      | 2      | 32 GB  | 12, 3, 4, 1   | 10:1c.0 |
    | 1      | 2      | 32 GB  | 13, 0, 5, 2   | 10:1d.0 |
    | 2      | 2      | 32 GB  | 14, 1, 6, 3   | a0:1c.0 |
    | 3      | 2      | 32 GB  | 15, 2, 7, 0   | a0:1d.0 |
    | 4      | 2      | 32 GB  | 0, 7, 8, 5    | 20:1b.0 |
    | 5      | 2      | 32 GB  | 1, 4, 9, 6    | 20:1c.0 |
    | 6      | 2      | 32 GB  | 2, 5, 10, 7   | 90:1b.0 |
    | 7      | 2      | 32 GB  | 3, 6, 11, 4   | 90:1c.0 |
    | 8      | 2      | 32 GB  | 4, 11, 12, 9  | 20:1d.0 |
    | 9      | 2      | 32 GB  | 5, 8, 13, 10  | 20:1e.0 |
    | 10     | 2      | 32 GB  | 6, 9, 14, 11  | 90:1d.0 |
    | 11     | 2      | 32 GB  | 7, 10, 15, 8  | 90:1e.0 |
    | 12     | 2      | 32 GB  | 8, 15, 0, 13  | 10:1e.0 |
    | 13     | 2      | 32 GB  | 9, 12, 1, 14  | 10:1b.0 |
    | 14     | 2      | 32 GB  | 10, 13, 2, 15 | a0:1e.0 |
    | 15     | 2      | 32 GB  | 11, 14, 3, 12 | a0:1b.0 |
    +--------+--------+--------+---------------+---------+
    Neuron Device Topology
          *        *        *        *
          │        │        │        │
          ▼        ▼        ▼        ▼
    *––►[ 0 ]◄––►[ 1 ]◄––►[ 2 ]◄––►[ 3 ]◄––*
          ▲        ▲        ▲        ▲
          │        │        │        │
          ▼        ▼        ▼        ▼
    *––►[ 4 ]◄––►[ 5 ]◄––►[ 6 ]◄––►[ 7 ]◄––*
          ▲        ▲        ▲        ▲
          │        │        │        │
          ▼        ▼        ▼        ▼
    *––►[ 8 ]◄––►[ 9 ]◄––►[10 ]◄––►[11 ]◄––*
          ▲        ▲        ▲        ▲
          │        │        │        │
          ▼        ▼        ▼        ▼
    *––►[12 ]◄––►[13 ]◄––►[14 ]◄––►[15 ]◄––*
          ▲        ▲        ▲        ▲
          │        │        │        │
          *        *        *        *


|nd_v1|


NeuronCore
^^^^^^^^^^

The NeuronCore is the primary execution unit within the accelerator. Each NeuronCore contains several execution engines
(for different types of compute operations such as tensor-based, vector, and scalar), Direct Memory Access (DMA) engines, and a local cache.

A NeuronCore can operate independently or together with other NeuronCores, depending on the nature of the workload and the way
a model is compiled and loaded to the NeuronCores in the accelerator. Each execution engine can access the cache and DRAM attached to the accelerator device.
Data is transferred between the host CPU and the accelerator device (as well as between the device DRAM and NeuronCores) using DMA, which enables more efficient data movement.

The Neuron Runtime Architecture
-------------------------------

|nrt_arch|

Application Interface Layer (The ``libnrt`` API)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The Application Interface Layer allows applications and frameworks to use the available Neuron Devices to run
inference or training workloads. A complete reference of the C interface can be found in :ref:`nrt_api`.

Monitoring and Profiling
^^^^^^^^^^^^^^^^^^^^^^^^

The Neuron Runtime is able to capture key execution metrics which can be read in real-time using ``neuron-monitor`` and
``neuron-top``. ``neuron-monitor`` allows forwarding those metrics to CloudWatch or a Prometheus server, enabling fleet-wide
monitoring - for more on that please refer to the ``neuron-monitor`` usage guide :ref:`neuron-monitor-ug`.
Profiling an execution is another feature of the Neuron Runtime - which provides an API for starting and stopping profiling,
as well as saving the profile data to a file, which can be used by tools such as the Neuron Tensorboard. This API is
documented in :ref:`api_profile` section.

.. _neff-format:

The NEFF format and NEFF Parser
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A NEFF (Neuron Executable File Format) is a single file container for all the artifacts needed to execute a model on one or more NeuronCores.
A NEFF is the output of the Neuron Compiler (neuron-cc). It contains Neuron machine instructions, pseudo instructions (compiler-generated instructions
which are parsed and replaced with Neuron instructions by the Neuron Runtime when the model loads), tensor information, model parameters and other components
that support the model's execution on one or more NeuronCores.

Operators that are not supported by Neuron can be compiled into CPU-executable binary and included into the NEFF as well.

Usually there is only one subgraph (which is executed on a single NeuronCore) in a NEFF:

.. code-block:: bash

    NEFF Nodes:
        NODE       Executor    Name        Variable       Size    Type    Format            Shape    DataType    TimeSeries
           1    Neuron Core    sg00
                                            image:0    3259008      IN      NHWC    [1 3 552 984]
                                       net_output:0    1323972     OUT      NHWC    [1 78 69 123]                false

In this example, there is a single subgraph, one input, and one output:

|nrt_neff_single|

Some NEFFs can have multiple subgraphs (which are deployed by the runtime on separate NeuronCores) and multiple CPU operators, as demonstrated below:

.. code-block:: bash

    NEFF Nodes:
        NODE       Executor                             Name               Variable    Size    Type    Format        Shape    DataType    TimeSeries
           1    Neuron Core                             sg00
                                                                            input:0       2      IN      NHWC    [1 1 1 1]
                                                                         nn/relu1:0       2     OUT      NHWC    [1 1 1 1]                false
           1    Neuron Core                             sg01
                                                                         nn/relu1:0       2      IN      NHWC    [1 1 1 1]
                                                                         nn/relu2:0       2     OUT      NHWC    [1 1 1 1]                false
           2            CPU         fused_3_layout_transform
                                                                layout_transform0:0       0     OUT                     []
           4            CPU        fused_2_nn_conv2d_nn_relu
                                                                          constant0       2      IN              [1 1 1 1]     float16
                                                                         nn.relu0:0       0     OUT                     []
           5            CPU    fused_1_layout_transform_copy
                                                                         nn/relu3:0       0     OUT                     []
           6    Neuron Core                             sg02
                                                                         nn/relu3:0       2      IN      NHWC    [1 1 1 1]
                                                                         nn/relu4:0       2     OUT      NHWC    [1 1 1 1]                false
           6    Neuron Core                             sg03
                                                                         nn/relu4:0       2      IN      NHWC    [1 1 1 1]
                                                                        nn/output:0       2     OUT      NHWC    [1 1 1 1]                false

The output above is summarized by the graph below:

|nrt_neff|

The nodes marked with dark blue are intermediate tensors that are handled internally by the Neuron Runtime.
The other blue nodes are inputs/outputs. The green colored box indicates the operator is executed on the NeuronCore while
the red color box indicates the execution is done on the CPU.

The NEFF layer in Neuron Runtime is responsible for parsing a NEFF, validating it, and translating pseudo instructions into hardware specific
instructions and DMA descriptors.


Graph Walker and CPU Node Executor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As shown in the previous section, a NEFF can contain one or more nodes. During execution, the Neuron Runtime Graph Walker executes each node
one by one and handles copying input and output between each of them. If a node needs to be executed by the CPU, then a corresponding library function, found
in a .so file in the NEFF, is dynamically loaded using ``dlopen()`` during model load and executed during model execution. Since this library function is executed in the calling
thread’s context, the workload can be efficiently parallelized using a multi-threaded approach.

In the example below, each invocation of ``nrt_execute()`` would take 23ms: the first CPU node takes 1ms, the NeuronCore execution takes 20ms and the second CPU node takes 2 ms,
so the total latency is 23ms and the throughput is 43 calls per second (1000/23).

|nrt_neff_s|

If multiple threads are used, subsequent executions would be pipelined inside the runtime, hence increasing the throughput in this case to ~50 (1000/20).

|nrt_neff_m|

User Mode Driver
^^^^^^^^^^^^^^^^

This is the lowest level component of the Neuron Runtime and handles programming the engines, managing memory,
creating DMA descriptors to move data from host and device, handling notifications etc.

Memory Management
~~~~~~~~~~~~~~~~~

The Neuron Runtime is responsible with managing Neuron Device and host memory for the running models. The application is responsibile with
deallocating every loaded model and allocated tensor so the proper deallocation method needs to be called.
For more details, refer to :ref:`nrt_api` documentation.
Tools such as ``neuron-top`` and ``neuron-monitor`` can be used to determine the amount of memory being used at any given time.


.. _first_app:


Building your first Neuron application
----------------------------------------

The simple application presented here loads a NEFF file using the provided binary files' contents as input tensors and saving the output tensors as
binary files. If a file isn't provided for an input tensor, that input tensor will be zero-filled.

Prerequisites
^^^^^^^^^^^^^

Before you start, you must have the following available in your local environment:

* A recent version of GCC C++ compiler
* An installation of the ``aws-neuronx-runtime-lib`` package as described in :ref:`reqs`

Running the built application requires:

* A Neuron-equipped EC2 compute instance as shown in :ref:`insttypes`
* Installing the ``aws-neuronx-runtime-lib`` and the ``aws-neuronx-dkms`` package on the instance as described in :ref:`reqs`
* A NEFF file


Obtain a NEFF file
^^^^^^^^^^^^^^^^^^

When you run a workload through a Neuron framework, the compiled NEFFs are placed in ``/var/tmp/neuron-compile-cache``.
Additionally, setting the ``NEURON_FRAMEWORK_DEBUG`` environment variable to ``1`` before running the workload enables
the compiled NEFFs to be written to the current directory.

Author your code
^^^^^^^^^^^^^^^^

For the purposes of this guide, use the code provided below. If you are developing your own application, review this code to understand how to use the Neuron Runtime in it.

.. code-block:: c

    #include <stdbool.h>
    #include <nrt/nrt.h>
    #include <nrt/nrt_experimental.h>

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <time.h>
    #include <errno.h>
    #include <sys/mman.h>
    #include <sys/stat.h>
    #include <pthread.h>
    #include <fcntl.h>
    #include <stdint.h>
    #include <unistd.h>

    // Function to mmap a file in the application's memory space,
    // it will return a pointer to the mmapped memory and the size
    // of the mmapped data will be written to *size
    void *mmap_file(const char *filepath, size_t *size) {
        struct stat sb;
        int fd = open(filepath, O_RDONLY);
        if (fd < 0 || fstat(fd, &sb) != 0) {
            fprintf(stderr, "Unable to open %s: %s\n", filepath, strerror(errno));
            return MAP_FAILED;
        }
        *size = sb.st_size;
        return mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    }

    #define P_ERR(...) fprintf(stderr, __VA_ARGS__)

    #define CHECK_RESULT(res, expected, ...)    \
        if (res != expected) {                  \
            fprintf(stderr, __VA_ARGS__);       \
            exit(-1);                           \
        }

    // struct used to load input tensors from files
    typedef struct {
        char *name;
        size_t size;
        void *data;
    } input_tensor_info_t;

    // simple container for input_tensor_info_t
    typedef struct {
        input_tensor_info_t *entries;
        int entry_count;
    } input_tensor_info_array_t;

    // Allocate tensorsets and tensors based on the info_array and returns a valid tensorset in out_tset
    // containing all the newly allocated tensors
    NRT_STATUS allocate_tensors(nrt_tensor_info_array_t *info_array, nrt_tensor_usage_t usage_type, nrt_tensor_set_t **out_tset) {
        NRT_STATUS result;
        int tensor_idx;
        nrt_tensor_info_t *tensor_info = NULL;
        nrt_tensor_t *tensor = NULL;

        // We allocate a nrt_tensor_set which acts as a containers for nrt_tensors
        result = nrt_allocate_tensor_set(out_tset);
        if (result != NRT_SUCCESS) {
            P_ERR("Couldn't allocate %s tensorset\n", usage_type == NRT_TENSOR_USAGE_INPUT ? "input" : "output");
        }

        for (tensor_idx = 0; tensor_idx < info_array->tensor_count; tensor_idx++) {
            tensor_info = &info_array->tensor_array[tensor_idx];
            if (tensor_info->usage != usage_type) {
                continue;
            }
            // Allocate the tensor with the name and size found in tensor_info_array
            result = nrt_tensor_allocate(NRT_TENSOR_PLACEMENT_DEVICE, 0, tensor_info->size,
                                         tensor_info->name, &tensor);
            if (result != NRT_SUCCESS) {
                P_ERR("Couldn't allocate tensor %s\n", tensor_info->name);
                return result;
            }
            // Finally add the tensors to the newly allocated tensor set
            result = nrt_add_tensor_to_tensor_set(*out_tset, tensor_info->name, tensor);
            if (result != NRT_SUCCESS) {
                P_ERR("Couldn't add tensor %s to tensorset\n", tensor_info->name);
                return result;
            }
        }
        return NRT_SUCCESS;
    }

    // Tensor iterator handler - returns false if the iteration needs to stop
    typedef bool (*tensor_handler)(nrt_tensor_t *, nrt_tensor_info_t *, NRT_STATUS *, void *);

    // Iterates through all the tensors in the given tensorset, based on the data in info_array for the given usage_type
    // and calls the handler function with the provided args pointer
    // Will return the first error returned by a handler
    NRT_STATUS iterate_tensors(nrt_tensor_set_t *tset, nrt_tensor_info_array_t *info_array, nrt_tensor_usage_t usage_type,
                               tensor_handler handler, void *args) {
        NRT_STATUS result = NRT_SUCCESS;
        NRT_STATUS final_result = NRT_SUCCESS;
        int tensor_idx;
        nrt_tensor_info_t *tensor_info = NULL;
        nrt_tensor_t *tensor = NULL;

        for (tensor_idx = 0; tensor_idx < info_array->tensor_count; tensor_idx++) {
            tensor_info = &info_array->tensor_array[tensor_idx];
            if (tensor_info->usage != usage_type) {
                continue;
            }
            result = nrt_get_tensor_from_tensor_set(tset, tensor_info->name, &tensor);
            if (result != NRT_SUCCESS) {
                P_ERR("Tensor %s not found in tensor set\n", tensor_info->name);
                continue;
            }
            result = NRT_SUCCESS;
            if ((*handler)(tensor, tensor_info, &result, args) == false) {
                return result;
            }
            if (final_result == NRT_SUCCESS && result != final_result) {
                final_result = result;
            }
        }
        return final_result;
    }

    // Tensor iteration handler that checks if a tensor has an input file associated with it
    // based on the CLI args
    bool handler_load_inputs(nrt_tensor_t *tensor, nrt_tensor_info_t *tensor_info, NRT_STATUS *result, void* args) {
        NRT_STATUS res;
        int idx;
        input_tensor_info_array_t *info_array = (input_tensor_info_array_t *)args;
        bool input_found = false;

        for (idx = 0; idx < info_array->entry_count; idx++) {
            if (strcmp(info_array->entries[idx].name, tensor_info->name) != 0) {
                continue;
            }
            if (info_array->entries[idx].size != tensor_info->size) {
                P_ERR("Input file for tensor %s has incorrect size %lu, expected %lu\n",
                      tensor_info->name, info_array->entries[idx].size, tensor_info->size);
                break;
            }
            res = nrt_tensor_write(tensor, info_array->entries[idx].data, 0, tensor_info->size);
            if (res != NRT_SUCCESS) {
                P_ERR("Unable to write content to input tensor %s\n", tensor_info->name);
            } else {
                input_found = true;
            }
        }
        if (!input_found) {
            fprintf(stderr, "Input tensor %s will be zero-filled\n", tensor_info->name);
        }
        *result = NRT_SUCCESS;
        return true;
    }

    // Tensor iteration handler that saves outputs
    bool handler_save_outputs(nrt_tensor_t *tensor, nrt_tensor_info_t *tensor_info, NRT_STATUS *result, void* args) {
        static char filename[280];

        int fd;
        // Allocating a buffer large enough to read the entire tensor
        void *tensor_data = malloc(tensor_info->size);

        *result = NRT_SUCCESS;
        if (tensor_data == NULL) {
            fprintf(stderr, "Unable to allocate memory for saving output tensor %s\n", tensor_info->name);
            *result = NRT_FAILURE;
            return true;
        }
        // Reading the tensor to the newly allocated buffer
        *result = nrt_tensor_read(tensor, tensor_data, 0, tensor_info->size);
        if (*result != NRT_SUCCESS) {
            fprintf(stderr, "Unable to read tensor %s\n", tensor_info->name);
            free(tensor_data);
            return true;
        }

        // Saving the tensor to a file
        snprintf(filename, 280, "%s.out", tensor_info->name);
        fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC, 0644);
        if (fd < 0) {
            fprintf(stderr, "Unable to open %s for writing\n", filename);
            free(tensor_data);
            *result = NRT_FAILURE;
            return true;
        }
        if (write(fd, tensor_data, tensor_info->size) != tensor_info->size) {
            *result = NRT_FAILURE;
            fprintf(stderr, "Unable to write tensor %s contents to file %s\n", tensor_info->name, filename);
        }
        close(fd);

        free(tensor_data);
        return true;
    }

    // Tensor iteration handler that deallocates tensors
    bool handler_free_tensor(nrt_tensor_t *tensor, nrt_tensor_info_t *tensor_info, NRT_STATUS *result, void* args) {
        *result = NRT_SUCCESS;
        nrt_tensor_free(&tensor);
        return true;
    }

    int main(int argc, char *argv[]) {
        NRT_STATUS result;
        int idx = 0;
        int tensor_idx = 0;
        void *neff_data = NULL;
        size_t neff_size = 0;
        void *input_data = NULL;

        input_tensor_info_array_t input_tensor_info_array = {0};
        input_tensor_info_t *current_input = NULL;

        nrt_model_t *model = NULL;
        nrt_tensor_set_t *inputs = NULL;
        nrt_tensor_set_t *outputs = NULL;

        nrt_tensor_t *tensor = NULL;
        nrt_tensor_info_array_t *tensor_info_array = NULL;

        if (argc < 2) {
            fprintf(stderr, "Incorrect number of args, usage: exec_test file.neff [input_1_name] [input_1_file] ...\n");
            exit(-1);
        }

        // Try mmapping the NEFF file first, so we can fail fast if not found or
        // mmap fails
        neff_data = mmap_file(argv[1], &neff_size);
        if (neff_data == MAP_FAILED) {
            fprintf(stderr, "Unable to map file %s\n", argv[1]);
            exit(-1);
        }

        // mmap input tensor files (if any provided) and fill the input_tensor_info array
        if (argc > 3) {
            input_tensor_info_array.entries = malloc((argc - 2 / 2) * sizeof(input_tensor_info_t));
            for (idx = 2; idx < argc; idx += 2) {
                if (idx + 1 >= argc) {
                    break;
                }
                current_input = &input_tensor_info_array.entries[input_tensor_info_array.entry_count];
                input_data = mmap_file(argv[idx + 1], &current_input->size);
                if (input_data == MAP_FAILED) {
                    fprintf(stderr, "Unable to mmap inputs file %s\n", argv[idx + 1]);
                    continue;
                }
                current_input->name = argv[idx];
                current_input->data = input_data;
                input_tensor_info_array.entry_count++;
            }
        }

        // Before calling any nrt API, nrt_init must be called
        // Since this is not running as part of a framework, the correct parameter for 'framework' is
        // NRT_FRAMEWORK_TYPE_NO_FW and the others can be empty strings
        result = nrt_init(NRT_FRAMEWORK_TYPE_NO_FW, "", "");
        CHECK_RESULT(result, NRT_SUCCESS, "NRTLIB could not be initialized, error: %d\n", (int)result);

        // Loading the NEFF
        printf("Loading NEFF\n");
        result = nrt_load(neff_data, neff_size, -1, -1, &model);
        CHECK_RESULT(result, NRT_SUCCESS, "Unable to load NEFF\n");

        // In order to allocate tensors, first we need to call nrt_get_model_tensor_info which
        // will give us the model tensors' names and sizes in tensor_info_array
        printf("Getting IO tensor information\n");
        result = nrt_get_model_tensor_info(model, &tensor_info_array);
        CHECK_RESULT(result, NRT_SUCCESS, "Unable to get model tensor information\n");

        // Allocating tensors
        printf("Creating I/O data (%ld tensors)\n", tensor_info_array->tensor_count);
        result = allocate_tensors(tensor_info_array, NRT_TENSOR_USAGE_INPUT, &inputs);
        CHECK_RESULT(result, NRT_SUCCESS, "Error allocating input tensors\n");
        result = allocate_tensors(tensor_info_array, NRT_TENSOR_USAGE_OUTPUT, &outputs);
        CHECK_RESULT(result, NRT_SUCCESS, "Error allocating input tensors\n");

        // Loading input files (if provided)
        iterate_tensors(inputs, tensor_info_array, NRT_TENSOR_USAGE_INPUT, handler_load_inputs,
                        (void*) &input_tensor_info_array);

        // Executing model using the tensors in the inputs tensorset and writing the outputs to the tensors
        // in the outputs tensorset
        result = nrt_execute(model, inputs, outputs);
        CHECK_RESULT(result, NRT_SUCCESS, "Error during model execution: %d\n", result);

        // Saving outputs to files
        result = iterate_tensors(outputs, tensor_info_array, NRT_TENSOR_USAGE_OUTPUT, handler_save_outputs, NULL);
        if (result != NRT_SUCCESS) {
            P_ERR("Error saving outputs to files\n");
        }

        // Unloading the model
        result = nrt_unload(model);
        if (result != NRT_SUCCESS) {
            P_ERR("Unable to unload NEFF\n");
        }

        printf("Freeing tensors\n");
        iterate_tensors(inputs, tensor_info_array, NRT_TENSOR_USAGE_INPUT, handler_free_tensor, NULL);
        iterate_tensors(outputs, tensor_info_array, NRT_TENSOR_USAGE_OUTPUT, handler_free_tensor, NULL);

        nrt_destroy_tensor_set(&inputs);
        nrt_destroy_tensor_set(&outputs);

        printf("Deallocating model tensor info\n");
        // We are done with the tensor_info_array, we can dispose of it
        nrt_free_model_tensor_info(tensor_info_array);

        printf("Deallocating inputs tensor info\n");
        // Unmapping the input files
        for (tensor_idx = 0; tensor_idx < input_tensor_info_array.entry_count; tensor_idx++) {
            munmap(input_tensor_info_array.entries[tensor_idx].data, input_tensor_info_array.entries[tensor_idx].size);
        }
        if (input_tensor_info_array.entries) {
            free(input_tensor_info_array.entries);
        }

        // Clean-up the runtime
        printf("Cleaning up the runtime\n");
        nrt_close();

        printf("DONE\n");
    }


Building the example:

.. code-block:: bash

    gcc run_neff.c -o run_neff -lnrt -pthread -I/opt/aws/neuron/include -L/opt/aws/neuron/lib


Running the example:

.. code-block:: bash

    ./run_neff my.neff [input_1] [input_1.bin] [input_2] [input_2.bin] ...


Review of the example code
^^^^^^^^^^^^^^^^^^^^^^^^^^

This section breaks down the code example above to better illustrate the structure, flow, and calls in it.

Initialization and cleanup
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: c

    // ...
    result = nrt_init(NRT_FRAMEWORK_TYPE_NO_FW, "", "");
    // ...
    nrt_close();


The Neuron Runtime is initialized by calling ``nrt_init`` and all applications should call ``nrt_close`` once they're done
using it. For more details on these functions, go to the :ref:`api_init` section.

Loading the NEFF
~~~~~~~~~~~~~~~~

Once the contents of a NEFF file have been mapped to virtual memory using ``mmap``,  load the NEFF with ``nrt_load``.

.. code-block:: c

    // ...
    void *mmap_file(const char *filepath, size_t *size) {
        struct stat sb;
        int fd = open(filepath, O_RDONLY);
        if (fd < 0 || fstat(fd, &sb) != 0) {
            fprintf(stderr, "Unable to open %s: %s\n", filepath, strerror(errno));
            return MAP_FAILED;
        }
        *size = sb.st_size;
        return mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    }
    // ...
    neff_data = mmap_file(argv[1], &neff_size);


The runtime will decide the optimal placement for the model. Specifically, it chooses the optimal NeuronCore on which to deploy the model.

.. code-block:: c

    // ...
    result = nrt_load(neff_data, neff_size, -1, -1, &model);
    // ...


The call to ``nrt_load`` returns a valid model handle of type ``nrt_model_t*``, which you can use for other calls to the Runtime API (such as ``nrt_execute``).

For more details on the model API (including ``nrt_load``), see :ref:`api_model`.


Creating input/output tensors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The main container for tensors is the ``nrt_tensor_set_t*``. Tensors (``nrt_tensor_t*``) are not passed directly to the NEFF execution function (``nrt_execute``); rather,
they have to be wrapped as ``nrt_tensor_set_t*``. The ``allocate_tensors`` function allocates the tensorset and the tensors for the requested usage type
(``NRT_TENSOR_USAGE_INPUT`` or ``NRT_TENSOR_USAGE_OUTPUT``) and returns the tensorset containing the allocated tensors in ``out_tset``.

.. code-block:: c

    NRT_STATUS allocate_tensors(nrt_tensor_info_array_t *info_array, nrt_tensor_usage_t usage_type, nrt_tensor_set_t **out_tset) {
        // ...
        // We allocate a nrt_tensor_set which acts as a containers for nrt_tensors
        result = nrt_allocate_tensor_set(out_tset);
        // ...

        for (tensor_idx = 0; tensor_idx < info_array->tensor_count; tensor_idx++) {
            tensor_info = &info_array->tensor_array[tensor_idx];
            if (tensor_info->usage != usage_type) {
                continue;
            }
            // ...
            // Allocate the tensor with the name and size found in tensor_info_array
            result = nrt_tensor_allocate(NRT_TENSOR_PLACEMENT_DEVICE, 0, tensor_info->size,
                                         tensor_info->name, &tensor);
            // ...
            // Finally add the tensors to the newly allocated tensor set
            result = nrt_add_tensor_to_tensor_set(*out_tset, tensor_info->name, tensor);
            // ...
        }
        // ...
    }


Iterating through tensors in an nrt_tensor_set_t
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A helper function, ``iterate_tensors``, is defined and implemented to iterate through the ``nrt_tensor_t`` values in a tensorset and call the function
``handler`` for each of them. If the handler function returns ``false``, iteration ends. ``iterate_tensors`` returns the first error
reported by the handler function.

.. code-block:: c

    // Tensor iterator handler - returns false if the iteration needs to stop
    typedef bool (*tensor_handler)(nrt_tensor_t *, nrt_tensor_info_t *, NRT_STATUS *, void *);

    NRT_STATUS iterate_tensors(nrt_tensor_set_t *tset, nrt_tensor_info_array_t *info_array, nrt_tensor_usage_t usage_type,
                               tensor_handler handler, void *args) {
    // ...
    for (tensor_idx = 0; tensor_idx < info_array->tensor_count; tensor_idx++) {
        // ...
        result = nrt_get_tensor_from_tensor_set(tset, tensor_info->name, &tensor);
        // ...
        if ((*handler)(tensor, tensor_info, &result, args) == false) {
            return result;
        }
        // ...
    }


Deallocating input/output tensors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After the execution is complete, the tensors are deallocated using ``iterate_tensors`` and the tensorsets are deallocated
using ``nrt_destroy_tensor_set``:

.. code-block:: c

    iterate_tensors(inputs, tensor_info_array, NRT_TENSOR_USAGE_INPUT, handler_free_tensor, NULL);
    iterate_tensors(outputs, tensor_info_array, NRT_TENSOR_USAGE_OUTPUT, handler_free_tensor, NULL);

    nrt_destroy_tensor_set(&inputs);
    nrt_destroy_tensor_set(&outputs);


The ``handler_free_tensor`` function simply deallocates the given tensor:

.. code-block:: c

    bool handler_free_tensor(nrt_tensor_t *tensor, nrt_tensor_info_t *tensor_info, NRT_STATUS *result, void* args) {
        // ...
        nrt_tensor_free(&tensor);
        // ...
    }

Executing the NEFF
~~~~~~~~~~~~~~~~~~

Execute the NEFF by calling ``nrt_execute``. If ``nrt_execute`` completes successfully, the output tensors are
read and saved to files (one binary file per output tensor) using ``iterate_tensors``:

.. code-block:: c

    // Executing model using the tensors in the inputs tensorset and writing the outputs to the tensors
    // in the outputs tensorset
    result = nrt_execute(model, inputs, outputs);
    // ...
    // Saving outputs to files
    result = iterate_tensors(outputs, tensor_info_array, NRT_TENSOR_USAGE_OUTPUT, handler_save_outputs, NULL);


The iteration handler reads the tensor data and writes it to a file with the same name as the tensor:

.. code-block:: c

    bool handler_save_outputs(nrt_tensor_t *tensor, nrt_tensor_info_t *tensor_info, NRT_STATUS *result, void* args) {
        // ...
        void *tensor_data = malloc(tensor_info->size);
        // ...
        // Reading the tensor to the newly allocated buffer
        *result = nrt_tensor_read(tensor, tensor_data, 0, tensor_info->size);
        // ...

        // Saving the tensor to a file
        snprintf(filename, 280, "%s.out", tensor_info->name);
        fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC, 0644);
        // ...
        if (write(fd, tensor_data, tensor_info->size) != tensor_info->size) {
            // ...
        }
        close(fd);


For more details on the execution API, go to the :ref:`api_exec` section.


.. _nrt_api:

The LIBNRT API
------------------

API Return Codes
^^^^^^^^^^^^^^^^

All API calls return an ``NRT_STATUS`` value representing the return status of the call. In case of an error, an error message
is logged (based on the logging settings). The table below contains all the possible error codes. Note that some error codes only apply to certain API calls. 
For more details on these errors, refer the :ref:`nrt-troubleshooting` docs.

.. list-table::
    :widths: 40 20 260
    :header-rows: 1

    * - Name
      - Return Code
      - Error
    * - ``NRT_SUCCESS``
      - 0
      - Call was successful
    * - ``NRT_FAILURE``
      - 1
      - Generic failure
    * - ``NRT_INVALID``
      - 2
      - Invalid NEFF, bad instruction, bad DMA descriptor, input tensor name/size does not match the model, etc.
    * - ``NRT_INVALID_HANDLE``
      - 3
      - Invalid handle (e.g. an invalid model handle)
    * - ``NRT_RESOURCE``
      - 4
      - Failed to allocate a resource for the requested operation
    * - ``NRT_TIMEOUT``
      - 5
      - Operation timed out
    * - ``NRT_HW_ERROR``
      - 6
      - Hardware failure
    * - ``NRT_QUEUE_FULL``
      - 7
      - Too many pending ``nrt_execute()`` requests. The runtime request queue is full. Cannot enqueue more ``nrt_execute()`` requests
    * - ``NRT_LOAD_NOT_ENOUGH_NC``
      - 9
      - The number of available NeuronCores is insufficient for the requested operation
    * - ``NRT_UNSUPPORTED_NEFF_VERSION``
      - 10
      - NEFF version unsupported
    * - ``NRT_UNINITIALIZED``
      - 13
      - Returned when attempting an API call when the library is not initialized
    * - ``NRT_CLOSED``
      - 14
      - Returned when attempting an API call after ``nrt_close()`` was called
    * - ``NRT_EXEC_BAD_INPUT``
      - 1002
      - Invalid input has been submitted to nrt_execute()
    * - ``NRT_EXEC_COMPLETED_WITH_NUM_ERR``
      - 1003
      - Execution completed with numerical errors (produced NaN)
    * - ``NRT_EXEC_COMPLETED_WITH_ERR``
      - 1004
      - Execution was completed with other errors, either logical (event double clear), or hardware (parity error)
    * - ``NRT_EXEC_NC_BUSY``
      - 1005
      - The NeuronCore is locked (in use) by another model/thread
    * - ``NRT_OOB``
      - 1006
      - One or more indirect memcopies and/or embedding updates are out of bound due to input corruptions
    * - ``NRT_EXEC_HW_ERR_COLLECTIVES``
      - 1200
      - Suspected hang in collectives operation due to hardware errors on this or other workers.
    * - ``NRT_EXEC_HW_ERR_HBM_UE``
      - 1201
      - HBM Unrepairable Uncorrectable hardware error caused incorrect results 
    * - ``NRT_EXEC_HW_ERR_NC_UE``
      - 1202
      - NeuronCore parity errors caused incorrect results
    * - ``NRT_EXEC_HW_ERR_DMA_ABORT``
      - 1203
      - The DMA engine encountered an error and halted execution
    * - ``NRT_EXEC_SW_NQ_OVERFLOW``
      - 1204
      - Execution timed out due to dropped notifications. Likely caused by Hardware DMA Generation Engine (DGE) notifications
    * - ``NRT_EXEC_HW_ERR_REPAIRABLE_HBM_UE``
      - 1205
      - HBM Repairable Uncorrectable hardware error caused incorrect results
    * - ``NRT_NETWORK_PROXY_FAILURE``
      - 1206
      - Network communication failed between Neuron hardware and Collectives


.. _api_init:

Initialization, configuration and teardown
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. c:function:: NRT_STATUS nrt_init(nrt_framework_type_t framework, const char *fw_version, const char *fal_version)

    Initializes the Neuron Runtime’s internal state and the Neuron hardware’s state.
    This should be called before any other nrt_* call is attempted, although a small set of functions
    are exempt from this rule (such as ``nrt_get_total_nc_count`` and ``get_nrt_version``). Any call to the NRT
    library API will return ``NRT_FAILURE`` if ``nrt_init`` has not been called beforehand for any API call requires it.

    The runtime is configured by setting the appropriate environment variable before this API call.
    The list of available environment variables is found in the :ref:`api_config` section.

    :param framework: Can be one of:

        ``NRT_FRAMEWORK_TYPE_INVALID,                 // Invalid framework
        NRT_FRAMEWORK_TYPE_NO_FW,                   // No framework
        NRT_FRAMEWORK_TYPE_TENSORFLOW,              // Tensorflow
        NRT_FRAMEWORK_TYPE_PYTORCH,                 // Pytorch
        NRT_FRAMEWORK_TYPE_MXNET                    // Mxnet``

        This argument is used by our Neuron Tools to determine the type of application running,
        it has no other impact on the functioning of the runtime.
        Application using a custom framework or calling the Neuron Runtime directly should use ``NRT_FRAMEWORK_TYPE_NO_FW``.

    :param const char *fw_version: version of the framework on top of which this runtime is running
    :param const char *fal_version: version of the framework adapter on top of which this runtime is running

    Applications using `NRT_FRAMEWORK_TYPE_NO_FW` for the first argument should use two empty strings for the versions.


.. _api_config:

Environment variables used to configure the Runtime Library
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``NEURON_RT_LOG_LOCATION=<CONSOLE/SYSLOG>, default=CONSOLE``
    Chooses the output target for the Neuron Runtime logs (either console or syslog).

``NEURON_RT_LOG_LEVEL=<ERROR/WARN/INFO/DEBUG/TRACE>, default=ERROR``
    Specifies the logging verbosity for the Neuron Runtime library, from ERROR (least verbose), to TRACE (most verbose).

``NEURON_RT_NUM_CORES=<n>``
    Specifies how many NeuronCores are needed for the application. During ``nrt_init`` the requested number of NeuronCores are **exclusively** associated with the calling processes and
    become unavailable to any other process attempting to use them. If there aren't enough NeuronCores available, ``nrt_init`` will return an error. Once the owner process has called ``nrt_close``
    or exited, the NeuronCores are released and become available  to be associated with another process. By default, all NeuronCores present on the instance will be made available to the caller.


``NEURON_RT_VISIBLE_CORES=<m,n,p-q>``
    Similarly to the previous, it allows the calling process to get exclusive access to a set of NeuronCores, but it allows explicitly specifying which NeuronCores are available for the application based on their zero-based indices.
    This variable can be a list of NeuronCores, for example: ``NEURON_RT_VISIBLE_CORES=3,4,5,6``, a range of NeuronCores, for example: ``NEURON_RT_VISIBLE_CORES=3-6``, or a combination of both: ``NEURON_RT_VISIBLE_CORES=3-5,6``.
    The resulting range must be contiguous, for example this is not valid: ``NEURON_RT_VISIBLE_CORES=3,5,6`` because 4 is missing from the list, and indices need to be provided in consecutive increasing order.


    .. note::

        If both ``NEURON_RT_VISIBLE_CORES`` are ``NEURON_RT_NUM_CORES`` are defined, ``NEURON_RT_VISIBLE_CORES`` will be used.


``NEURON_RT_ROOT_COMM_ID=<ip_address:port>``
    Mandatory for applications that run workloads containing Collective Communication operators, allows specifying the IP address and assign a port for the rank 0 worker in the Collective Compute worker pool.
    For example: ``NEURON_RT_ROOT_COMM_ID=10.0.1.2:46820``.


``NEURON_RT_STOCHASTIC_ROUNDING_SEED=<value>``
    Allows setting a value for the stochastic rounding seed. Has no effect on inf1.


``NEURON_RT_DEBUG_MEMLOG_MAX_SIZE=<value>, default=1024*1024``
    Allows changing the number of entries in the memory allocations log. This log contains an entry for every allocation and deallocation and will be dumped to a file in case of a memory allocation failure in CSV format.


.. c:function::  NRT_STATUS nrt_close()

    Closes all the devices used by the application (as defined by ``NEURON_RT_NUM_CORES``/``NEURON_RT_VISIBLE_CORES``)
    and cleans up the runtime state. Note that once ``nrt_close`` has been called, most nrt_* API calls will fail if attempted.


.. _api_model:

The Model API
^^^^^^^^^^^^^

.. c:function:: NRT_STATUS nrt_load(const void *neff_bytes, size_t size, int32_t start_nc, int32_t nc_count, nrt_model_t **model)

    Loads a NEFF file whose content is found in `neff_bytes`, with the given size, placing it on ``nc_count`` NeuronCores starting with NeuronCore index `start_nc`.
    If either ``nc_count`` or ``start_nc`` are -1, an optimal value for each will be determined automatically. The model can be configured using a list of environment
    variables read inside this API call which can be found in the :ref:`model_env` section. It returns a handle to the loaded model in the ``nrt_model_t*``
    pointer if the call succeeds. The returned handle represents the loaded model and can be used with calls that operate on an ``nrt_model_t*`` (such as ``nrt_execute``).


    :param neff_bytes: Pointer to existing NEFF file data
    :param size: Size of data in ``neff_bytes``
    :param start_nc: Index of the NeuronCore on which to stage the model. The first NeuronCore owned by the application will always have the index ``0`` - for example, even if when setting ``NEURON_RT_VISIBLE_CORES=3,4``, the two NeuronCores will be referred to as ``0`` and ``1``. If -1, an optimal index will be automatically determined (based on current NeuronCore usage).
    :param nc_count: Number of NeuronCores on which to stage the model. If its value is a multiple of the amount of NeuronCores needed by the model, the model will be replicated on the number of NeuronCores specified in the argument. This feature is called **TBD** and it will be explained in detail in a separate section. If its value is -1, the model will be staged a single time, using the number of cores needed by a single instance of the model.
    :param model: Model handle returned by the call which can be passed to other functions that operate on models (such as ``nrt_execute``).


.. _model_env:

Environment variables used to configure a model being loaded
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``NEURON_RT_EXEC_TIMEOUT=<n>, default=30 (inf1), default=600(trn1,inf2)``
    Maximum of time, in seconds, allowed for one execution before timing out - which will cause the call to ``nrt_execute`` to fail and return ``NRT_TIMEOUT``.

``NEURON_RT_VALIDATE_HASH=<true/false>, default=false``
    Verify the integrity of NEFF data being loaded by checking against a checksum found in the header.

``NEURON_RT_STOCHASTIC_ROUNDING_EN=<true/false>, default=false``
    Enable stochastic rounding.


.. c:function:: NRT_STATUS nrt_load_collectives(const void *neff_bytes, size_t size, int32_t start_nc, int32_t nc_count, uint32_t g_device_id, uint32_t g_device_count, nrt_model_t **model)

    Same as ``nrt_load`` (same environment variables can be used to configure the model), but must be used when loading NEFFs containing Collective Communication operators. Uses the same arguments as `nrt_load`, but adds 2 extra ones.

    :param neff_bytes: Pointer to existing NEFF file data
    :param size: Size of data in ``neff_bytes``
    :param start_nc: Index of NeuronCore on which to stage the model. If -1, an optimal index will be automatically determined (based on current NeuronCore usage).
    :param nc_count: Number of NeuronCores on which to stage the model. If its value is a multiple of the amount of NeuronCores needed by the model, the model will be replicated on the number of NeuronCores specified in the argument. This feature is called **TBD** and it will be explained in detail in a separate section. If its value is -1, the model will be staged a single time, using the number of cores needed by a single instance of the model.
    :param g_device_id: Globally unique ID within the Collective Communication world associated with this model instance.
    :param g_device_count: Size of the Collective Communication world (total number of participating unique IDs).
    :param model: Model handle returned by the call which can be passed to other functions that operate on models (such as ``nrt_execute``).


.. c:function:: NRT_STATUS nrt_unload(nrt_model_t *model)

    Unloads the given model and frees up device and host resources.

    :param model: Pointer to model to unload. All data associated with the model is deleted, do not reuse the pointer or try to deallocate it afterwards. Do not call ``nrt_unload`` again on the same ``nrt_model_t*`` pointer (think of it as a call to `free()`).


.. c:function:: NRT_STATUS nrt_get_model_nc_count(const nrt_model_t *model, uint32_t *nc_count)

    Gets the number of NeuronCores used by the model and writes that value at the address pointed by ``nc_count``.

    :param model: Valid pointer to an ``nrt_model_t``.
    :param nc_count: If the call completes successfully, the pointed address will contain the number of NeuronCores used by the model.


.. c:function:: NRT_STATUS nrt_get_model_tensor_info(nrt_model_t *model, nrt_tensor_info_array_t **tensor_info)

    Gets input/output tensor information for a given loaded model.

    :param model: Valid pointer to an ``nrt_model_t``.
    :param tensor_info: Pointer to a ``nrt_tensor_info_array_t*`` which will contain the tensor information data. The function allocates memory for the structure internally which can only be correctly freed by calling ``nrt_free_model_tensor_info``.
        The ``nrt_tensor_info_array_t`` struct and its dependencies are defined as follows:

        .. code-block:: c

            typedef struct nrt_tensor_info_array {
                uint64_t tensor_count;              // Total number of input/output tensors used by the model
                nrt_tensor_info_t tensor_array[];   // Array of tensor info representing those tensors
            } nrt_tensor_info_array_t;

            typedef struct nrt_tensor_info {
                char name[NRT_TENSOR_NAME_MAX];     // Name of the tensor
                nrt_tensor_usage_t usage;           // Type of the tensor
                size_t size;                        // Tensor size in bytes
                nrt_dtype_t dtype;                  // Data type
                uint32_t *shape;                    // An array representing data shape
                uint32_t ndim;                      // The number of dimensions (number of elements in the shape array)
            } nrt_tensor_info_t;

            // Usage type definitions for tensors
            typedef enum nrt_tensor_usage {
                NRT_TENSOR_USAGE_INPUT = 0,     // Tensor is used for input
                NRT_TENSOR_USAGE_OUTPUT,        // Tensor is used for output
            } nrt_tensor_usage_t;

            // Data type definitions for tensors
            typedef enum nrt_dtype {
                NRT_DTYPE_UNKNOWN = 0,
                NRT_DTYPE_FLOAT32,
                NRT_DTYPE_FLOAT16,
                NRT_DTYPE_BFLOAT16,
                NRT_DTYPE_INT8,
                NRT_DTYPE_UINT8,
                NRT_DTYPE_INT16,
                NRT_DTYPE_UINT16,
                NRT_DTYPE_INT32,
                NRT_DTYPE_UINT32,
                NRT_DTYPE_INT64,
                NRT_DTYPE_UINT64
            } nrt_dtype_t;


.. c:function:: NRT_STATUS nrt_free_model_tensor_info(nrt_tensor_info_array_t *tensor_info)

    Frees a ``nrt_tensor_info_array_t`` allocated by a call to ``nrt_get_model_tensor_info``. As with all deallocation functions, don’t call it more than once on the same pointer.

    :param tensor_info: ``nrt_tensor_info_array_t`` to deallocate.


.. c:function:: NRT_STATUS nrt_get_model_instance_count(nrt_model_t *model, uint32_t *instance_count)

    Returns the number of times this `nrt_model_t `is currently staged on the NeuronDevice(s) by writing it to the address pointed by ``instance_count``. It will always be >= 1. This value can be used to determine the number of threads that can optimally call ``nrt_execute`` on this ``nrt_model_t``.

    :param model: Valid pointer to an ``nrt_model_t``.
    :param instance_count: If the call completes successfully, the address will contain the instance count for this model


.. _api_tensor:

The Tensor API
^^^^^^^^^^^^^^


.. c:function:: NRT_STATUS nrt_tensor_allocate(nrt_tensor_placement_t tensor_placement, int logical_nc_id, size_t size, const char *name, nrt_tensor_t **tensor)

    Allocates a new tensor, placing it in either host virtual memory or device memory (based on the ``tensor_placement`` argument), on the specified NeuronCore index, of a given size, and attaches the given name to it - the name is only used for log messages.
    For applications running on Inferentia, ``tensor_placement`` should always be ``NRT_TENSOR_PLACEMENT_VIRTUAL``. For all other cases, ``NRT_TENSOR_PLACEMENT_DEVICE`` should be used. If successful, the ``tensor`` address will contain a valid pointer to the newly allocated ``nrt_tensor_t``.
    (depricated) ``tensor_placement`` set to ``NRT_TENSOR_PLACEMENT_HOST`` will allocate tensors in physical host memory. Tensors allocated with ``NRT_TENSOR_PLACEMENT_HOST`` cannot be larger than 4MB, the Kernel physical page size limit. We restrict tensors to a single page of host memory to simplify the generation of DMA descriptors during pre-execution setup.

    :param tensor_placement: Controls where the tensor will be placed, the definition of the ``nrt_tensor_placement_t`` enum is as follows:

        .. code-block:: c

            typedef enum {
                NRT_TENSOR_PLACEMENT_DEVICE,    // the tensor is allocated directly in device memory
                NRT_TENSOR_PLACEMENT_HOST,      // (depricated) the tensor is allocated in DMAable host memory (only for sizes < 4MB)
                NRT_TENSOR_PLACEMENT_VIRTUAL    // the tensor is allocated in host memory
            } nrt_tensor_placement_t;

    :param int logical_nc_id: Zero-based NeuronCore index on which to allocate the tensor (if ``tensor_placement`` is ``NRT_TENSOR_PLACEMENT_DEVICE``) or to which associate the tensor for all other cases.
    :param size: Size for the new tensor.
    :param name: Name for the new tensor.
    :param tensor: If the call completes successfully, the address will contain a valid ``nrt_tensor_t*`` pointer.


.. c:function:: void nrt_tensor_free(nrt_tensor_t **tensor)

    Frees a tensor allocated by a call to ``nrt_tensor_allocate`` and sets the nrt_tensor_t* pointer at address ``tensor`` to NULL.

    :param tensor: Pointer to a pointer to a previously allocated nrt_model_t. After the call returns, the ``nrt_model_t*`` pointer will be NULL.


.. c:function:: NRT_STATUS nrt_tensor_read(const nrt_tensor_t *tensor, void *buf, size_t offset, size_t size)

    Reads ``size`` bytes of data from a given tensor, starting at ``offset``, to ``buf`` starting at offset 0. ``buf`` needs to be allocated with a size of at least ``size`` bytes.

    :param tensor: Valid pointer to an ``nrt_tensor_t``.
    :param buf: Buffer where to write read data, it needs to be at least `size` bytes in size.
    :param offset: Offset within the tensor from which to begin reading.
    :param size: Size to read.


.. c:function:: NRT_STATUS nrt_tensor_write(nrt_tensor_t *tensor, const void *buf, size_t offset, size_t size)

    Writes ``size`` bytes of data to a given tensor, starting at ``offset``, from ``buf`` (starting at offset 0).

    :param tensor: Valid pointer to an ``nrt_tensor_t``.
    :param buf: Buffer containing ``size`` bytes of data to write to the tensor.
    :param offset: Offset within the tensor from which to begin writing.
    :param size: Size to write.


.. c:function:: NRT_STATUS nrt_tensor_copy(const nrt_tensor_t *src, size_t src_offset, nrt_tensor_t *dst, size_t dst_offset, size_t size)

    Copies ``size`` bytes of data from ``src`` (starting at ``src_offset``) into ``dst`` (starting at ``dst_offset``).
    When copying between two device tensors, they must both be allocated on the same HBM or the call returns ``NRT_INVALID``.

    :param src: Valid pointer to an ``nrt_tensor_t`` to copy from.
    :param src_offset: Offset within the source tensor from which to begin copying.
    :param dst: Valid pointer to an ``nrt_tensor_t`` to copy to.
    :param dst_offset: Offset within the destination tensor at which to begin copying.
    :param size: Size to copy.


.. c:function:: size_t nrt_tensor_get_size(const nrt_tensor_t *tensor)

    Returns the size, in bytes, of the given tensor.

    :param tensor: Valid pointer to an ``nrt_tensor_t``.
    :returns: Size in bytes of the given tensor.


.. c:function:: NRT_STATUS nrt_tensor_allocate_empty(const char *name, nrt_tensor_t **tensor)

    Allocates an empty tensor, i.e. the tensor structure w/o any attached storage.

    :param name: Name for the new tensor.
    :param tensor: If the call completes successfully, the address will contain a valid ``nrt_tensor_t*`` pointer.


.. c:function:: NRT_STATUS nrt_tensor_attach_buffer(nrt_tensor_t *tensor, void *buffer, size_t size)

    Attaches a caller-supplied buffer to a tensor. Any storage previously attached to the tensor is detached and freed if was owned by the tensor.
    The attached buffer is managed by the caller and must persist through the entire lifetime of the tensor - calling `nrt_tensor_free` will not deallocate it.
    This changes the memory placement of the nrt_tensor_t to ``NRT_TENSOR_PLACEMENT_VIRTUAL`` regardless of the initial memory placement type.

    :param tensor: Valid pointer to an ``nrt_tensor_t``.
    :param buffer: Buffer of ``size`` bytes to attach to the tensor.
    :param size: Size of attached buffer.


.. c:function:: NRT_STATUS nrt_tensor_allocate_slice(const nrt_tensor_t *tensor_source, size_t offset, size_t size, const char *name, nrt_tensor_t **tensor_slice)

    Allocates a new ``nrt_tensor_t`` that doesn’t have its own backing storage - instead, it will use a part (slice) of ``tensor_source``’s storage, starting at ``offset``
    with the given size. The shared backing storage is reference counted and it will not be deallocated until the last tensor using it is deallocated.

    :param tensor_source: Valid pointer to a ``nrt_tensor_t`` whose storage will be used by the new tensor.
    :param offset: Offset within the ``tensor_source`` used as origin for the 'slice'.
    :param size: Size of storage to be used by the new tensor.
    :param name: Name for the new tensor.
    :param tensor_slice: If the call completes successfully, the address will contain a valid, newly allocated, ``nrt_tensor_t*`` pointer.


.. c:function:: void *nrt_tensor_get_va(const nrt_tensor_t *tensor)

    Returns the virtual address for an allocated tensor.

    :param tensor: Valid pointer to an ``nrt_tensor_t``.
    :returns: Pointer to host memory used by the tensor.

.. c:function:: NRT_STATUS nrt_tensor_check_output_completion(const nrt_tensor_t *output_tensor, int64_t timeout, uint64_t expected_completion_count)

    Checks if the output tensor has been completely written to by the Neuron Runtime.
    It waits for up to ``timeout`` microseconds, or unlimited if ``timeout`` is negative, until the tensor reaches the expected completion count.
    If the ``timeout`` is given as unbounded, it emits a warning at the first 30 seconds.
    The caller is in charge of handling the timeout behavior.
    If the tensor is complete, it returns ``NRT_SUCCESS``;
    if the output tensor is given as NULL, it returns ``NRT_INVALID``;
    if the tensor does not reach the ``expected_completion_count`` within the timeout, it returns ``NRT_TIMEOUT``.

    :param output_tensor: Valid pointer to an ``nrt_tensor_t``, which is expected to be an output tensor.
    :param timeout: Maximum time to wait for the output tensor to be written to, in microseconds. If negative, it waits indefinitely until the tensor is complete.
    :param expected_completion_count: The number of completions expected by the caller.


_api_tensorset:

The Tensorset API
~~~~~~~~~~~~~~~~~

Tensorsets are containers for tensors.

.. c:function:: NRT_STATUS nrt_allocate_tensor_set(nrt_tensor_set_t **result)

    Allocates an empty ``nrt_tensor_set_t`` and places its address in ``result``.

    :param result: If the call completes successfully, this address will contain a pointer to a valid, newly allocated ``nrt_tensor_set_t``.


.. c:function:: void nrt_destroy_tensor_set(nrt_tensor_set_t **tensor_set)

    Frees a tensor set allocated by a call to ``nrt_allocate_tensor_set`` and sets the ``nrt_tensor_set_t*`` pointer at address ``tensor_set`` to NULL.

    :param tensor_set: Pointer to a pointer to a previously allocated ``nrt_tensor_set_t``. After the call returns, the ``nrt_tensor_set_t*`` pointer will be NULL.


.. c:function:: NRT_STATUS nrt_add_tensor_to_tensor_set(nrt_tensor_set_t *tensor_set, const char *tensor_name, nrt_tensor_t *tensor)

    Adds an ``nrt_tensor`` to a tensor_set under a given name. That name can be later used to retrieve the tensor.

    :param tensor_set: Pointer to a valid Tensorset where to add the tensor.
    :param tensor_name: Name that will be used to access the added tensor in the container. Does not need to be the same as the ``nrt_tensor_t``’s name.
    :param tensor: Pointer to a valid ``nrt_tensor_t`` to ad to the Tensorset.


.. c:function:: NRT_STATUS nrt_get_tensor_from_tensor_set(nrt_tensor_set_t *tensor_set, const char *tensor_name, nrt_tensor_t **tensor)

    Gets an ``nrt_tensor`` from the tensor set based on the name used when it was added by ``nrt_add_tensor_to_tensor_set`` and places its address
    at the address pointed by ``tensor``. If the tensor is not found, ``NRT_FAILURE`` is returned and nothing gets written at the address pointed by ``tensor``.

    :param tensor_set: Pointer to a valid Tensorset containing the tensor.
    :param tensor_name: Name associated with the searched ``nrt_tensor_t`` when it was added to this Tensorset. Might be different from the ``nrt_tensor_t``’s internal name.
    :param tensor: Address where the address of the found ``nrt_tensor_t`` will be placed.


.. _api_exec:

The Execution API
^^^^^^^^^^^^^^^^^


.. c:function:: NRT_STATUS nrt_execute(nrt_model_t *model, const nrt_tensor_set_t *input_set, nrt_tensor_set_t *output_set)

    Runs one execution of the given ``nrt_model_t`` using the provided input tensor set and writing the results to the provided output tensor set.

    :param model: Valid pointer to a `nrt_model_t` on which to run the execution.
    :param input_set: Tensorset containing input data.
    :param input_set: Tensor set where the output data will be written to.


.. c:function:: NRT_STATUS nrt_execute_repeat(nrt_model_t *model, const nrt_tensor_set_t *input_set, nrt_tensor_set_t *output_set, int repeat_count)

    Same as ``nrt_execute`` but it will repeat the execution ``repeat_count`` times using the outputs from the n - 1th iteration as inputs for the nth iteration.
    This requires a specially compiled NEFF and it's not a commonly used call.

    :param model: Valid pointer to a `nrt_model_t` on which to run the execution.
    :param input_set: Tensorset containing input data.
    :param input_set: Tensor set where the output data will be written to.
    :param repeat_count:  Number of times to repeat this execution.


.. _api_profile:

The Profiling API
^^^^^^^^^^^^^^^^^


.. c:function:: NRT_STATUS nrt_profile_start(nrt_model_t *model, const char *filename)

    Begins profiling of the execution of the given model. The profile data will be written to the file specified by the path in ``filename``.
    The file will be truncated if it exists.

    :param model: Valid pointer to a `nrt_model_t` which will be profiled by the Neuron Runtime during execution.
    :param filename: Path to a file where the profile will be written. If the file already exists, it will be truncated.


.. c:function:: NRT_STATUS nrt_profile_stop(const char *filename)

    Ends profiling of the execution of a model and writes profile data to ``filename``. ``filename`` needs to be the same path as the one used for ``nrt_profile_start``.

    :param filename: Path to a file where the profile will be written. If the file already exists, it will be truncated.


.. _api_debug_stream:

The Debug Stream API
^^^^^^^^^^^^^^^^^^^^

See :ref:`Neuron Debug Stream API Documentation <nrt-debug-stream-api>` for more details.

.. c:function:: NRT_STATUS nrt_debug_client_connect(int logical_nc_idx, int *stream_fd)

    Establishes a connection to a specified Logical Neuron Core's debug stream and returns a handle to the stream in the ``stream_fd`` parameter. Note that only one client
    can connect to a Logical Neuron Core's stream at any given time. Attempts to connect to a stream with multiple clients will result in a ``NRT_INVALID`` return status.

    :param logical_nc_idx: Core's debug stream to connect to
    :param stream_fd: Connection handle to reference and interact with the stream

    :returns: ``NRT_SUCCESS`` on success

.. c:function:: void nrt_debug_client_connect_close(int stream_fd)

    Closes a connection created by ``nrt_debug_client_connect``.

    :param stream_fd: Connection handle to close

.. c:function:: NRT_STATUS nrt_debug_client_read_one_event(int stream_fd, ndebug_stream_event_header_t *header, void **payload)

    Consumes a single event from the stream and return it in ``header`` and ``payload``. Note that it is the user's responsibility to free the payload pointer. Also keep
    in mind that this function must be called from the same process that owns the Logical Neuron Core. Calling this function from any other process results in undefined behavior.

    :param stream_fd: Stream to consume an event from
    :param header: Consumed event's header
    :param payload: Consumed event's payload (caller's responsibility to free this pointer)

    :returns: ``NRT_SUCCESS`` on success or ``NRT_QUEUE_EMPTY`` if no events are available

Other APIs
^^^^^^^^^^

.. c:function:: NRT_STATUS nrt_get_version(nrt_version_t *ver, size_t size)

    Fills a ``nrt_version_t`` struct with the provided size with version info. The ``size`` argument allows for backwards compatibility.
    if the struct changes in future releases.

    :param *ver: Pointer to a ``nrt_version_t`` structure which is currently defined as:

        .. code-block:: c

            typedef struct nrt_version {
                uint64_t rt_major;       // major version number
                uint64_t rt_minor;       // minor version number
                uint64_t rt_patch;       // patch version number
                uint64_t rt_maintenance; // maintainance version number
                char rt_detail[RT_VERSION_DETAIL_LEN]; // runtime version description string
                char git_hash[GIT_HASH_LEN];           // runtime git hash
            } nrt_version_t;

    :param size_t size: Size of the ``nrt_version_t`` structure, should always be ``sizeof(nrt_version_t)``


.. c:function:: NRT_STATUS nrt_get_total_nc_count(uint32_t *nc_count)

    Gets the total number of NeuronCores present on the current instance. The result is not affected by the values in
    ``NEURON_RT_NUM_CORES`` or ``NEURON_RT_VISIBLE_CORES`` and, in fact, this function can be called before calling ``nrt_init``.

    :param nc_count: If the call completes successfully, the address will contain the total number of NeuronCores present on the instance.


.. c:function:: NRT_STATUS nrt_get_visible_nc_count(uint32_t *nc_count)

    Gets the total number of NeuronCores available to the application after ``nrt_init`` has parsed the configuration environment variables ``NEURON_RT_NUM_CORES`` and ``NEURON_RT_VISIBLE_CORES``
    (if provided).

    :param nc_count: If the call completes successfully, the address will contain the total number of NeuronCores available to the application.


.. |nd_v1| image:: ../images/neuron-rt-nd-v1.png
.. |nrt_arch| image:: ../images/neuron-rt-architecture.png
.. |nrt_neff| image:: ../images/neuron-rt-neff.png
.. |nrt_neff_s| image:: ../images/neuron-rt-neff-s.png
.. |nrt_neff_m| image:: ../images/neuron-rt-neff-m.png
.. |nrt_neff_single| image:: ../images/neuron-rt-neff-single.png


================================================
FILE: neuron-runtime/nrt-troubleshoot.rst
================================================
.. _nrt-troubleshooting:

Neuron Runtime Troubleshooting on Inf1, Inf2 and Trn1
=====================================================

This document aims to provide more information on how to fix issues you
might encounter while using the Neuron Runtime 2.x or above. For each
issue we will provide an explanation of what happened and what can
potentially correct the issue.


If your issue is not listed below or you have a more nuanced problem, contact
us via `issues <https://github.com/aws/aws-neuron-sdk/issues>`__ posted
to this repo, the `AWS Neuron developer
forum <https://forums.aws.amazon.com/forum.jspa?forumID=355>`__, or
through AWS support.


.. contents::  Table of contents
   :local:
   :depth: 2

Neuron Driver installation fails
--------------------------------

aws-neuron-dkms is a driver package which needs to be compiled during
installation. The compilation requires kernel headers for the instance's
kernel. ``uname -r`` can be used to find kernel version in the instance.
In some cases, the installed kernel headers might be newer than the
instance's kernel itself.

Please look at the aws-neuron-dkms installation log for message like the
following:

::

   Building for 4.14.193-149.317.amzn2.x86_64
   Module build for kernel 4.14.193-149.317.amzn2.x86_64 was skipped since the
   kernel headers for this kernel does not seem to be installed.

If installation log is not available, check whether the module is
loaded.

::

   $ lsmod | grep neuron

If the above has no output then that means ``aws-neuron-dkms``
installation is failed.

Solution
^^^^^^^^

1. Stop all applications using the NeuronCores.

2. Uninstall aws-neuron-dkms ``sudo apt remove aws-neuron-dkms`` or
   ``sudo dnf remove aws-neuron-dkms``

3. Install kernel headers for the current kernel
   ``sudo apt install -y linux-headers-$(uname -r)`` or
   ``sudo dnf install -y "kernel-devel-uname-r = $(uname -r)"``

4. Install aws-neuron-dkms ``sudo apt install aws-neuron-dkms`` or
   ``sudo dnf install aws-neuron-dkms``

Application fails to start
--------------------------

Neuron Runtime requires Neuron Driver(aws-neuron-dkms package) to access Neuron
devices. If the driver is not installed then Neuron Runtime wont able to access the
Neuron devices and will fail with an error message in console and syslog.

If ``aws-neuron-dkms`` is not installed then the error message will be like the following::

 2021-Aug-11 18:38:27.0917 13713:13713 ERROR   NRT:nrt_init      Unable to determine Neuron Driver version. Please check aws-neuron-dkms package is installed.

If ``aws-neuron-dkms`` is installed but does not support the latest runtime then the error message will be like the following::

 2021-Aug-11 19:18:21.0661 24616:24616 ERROR   NRT:nrt_init      This runtime requires Neuron Driver version 2.0 or greater. Please upgrade aws-neuron-dkms package.

When using any supported framework from Neuron SDK version 2.5.0 and Neuron Driver (aws-neuron-dkms) versions 2.4 or older, Neuron Runtime will return the following error message::

  2022-Dec-01 09:34:12.0559   138:138   ERROR   HAL:aws_hal_tpb_pooling_write_profile       failed programming the engine

Solution
^^^^^^^^

Please follow the installation steps in :ref:`setup-guide-index` to install ``aws-neuronx-dkms``.

This Neuron Runtime (compatibility id: X) is not compatible with the installed aws-neuron-dkms package
------------------------------------------------------------------------------------------------------

This error is caused by incompatibility between the Neuron Driver (dkms package) and the Runtime Library (runtime-lib package).  The driver remains backwards compatible with older versions of Neuron Runtime, but newer versions of the Runtime might rely on the functionality that is only provided by a newer driver.  In that case, an update to the newer driver is required.

In some cases the compatibility error persists even after the driver has been updated.  That happens when the update process fails to reload the driver at the end of the update.  Note that ``$ modinfo neuron``  will misleadingly show the new version because modinfo reads the version information for neuron.ko file that's been successfully replaced.

Reload failure happens because one of the processes is still using Neuron Devices and thus the driver cannot be reloaded.

Solution
^^^^^^^^

Check for any process that is still using the Neuron driver by running lsmod:

.. code:: bash

   ubuntu@ip-10-1-200-50:~$ lsmod | grep neuron
   neuron                237568  0
   ubuntu@ip-10-1-200-50:~$

“Used by” counter, the second number, should be 0.  If it is not, there is still a running process that is using Neuron.  Terminate that process and either:

.. code:: bash

   $ sudo rmmod neuron
   $ sudo modprobe neuron

Or simply rerun the installation one more time.  The driver logs its version in dmesg:

.. code:: bash

   $ sudo dmesg
   ...
   [21531.105295] Neuron Driver Started with Version:2.9.4.0-8a6fdf292607dccc3b7059ebbe2fb24c60dfc7c4

A common culprit is a Jupyter process.  If you are using Jupyter on the instance, make sure to terminate Jupyter process before updating the driver.

Neuron Core is in use
---------------------

A Neuron Core cant be shared between two applications. If an application
started using a Neuron Core all other applications trying to use the
NeuronCore would fail during runtime initialization with the following
message in the console and in syslog:

.. code:: bash

   2021-Aug-27 23:22:12.0323 28078:28078 ERROR   NRT:nrt_allocate_neuron_cores               NeuronCore(s) not available - Requested:nc1-nc1 Available:0

Solution
^^^^^^^^

Terminate any other processes that are using NeuronCore and then try launching the application again. If you are using Jupyter, ensure that you only have a single Jupyter kernel attempting to access the NeuronCores by restarting or shutting-down any other kernels, which will release any NeuronCores that might be in use.

Unsupported NEFF Version
------------------------

While loading a model(NEFF), Neuron Runtime checks the version compatibility.
If the version the NEFF is incompatible with Runtime then it would fail the
model load with following error message:

::

   NEFF version mismatch supported: 1.1 received: 2.0

Solution
^^^^^^^^

Use compatible versions of Neuron Compiler and Runtime. Updating to the
latest version of both Neuron Compiler and Neuron Runtime is the
simplest solution. If updating one of the two is not an option, please
refer to the :ref:`runtime_rn`
of the Neuron Runtime to determine NEFF version support.

Unsupported Hardware Operator Code
----------------------------------

While loading a model(NEFF), Neuron Runtime checks whether the hardware operators are supported or not. If unsupported,
Neuron Runtime will display the following error messages:

::

    2023-Jul-28 22:23:13.0357 101413:101422 ERROR  TDRV:translate_one_pseudo_instr_v2           Unsupported hardware operator code 214 found in neff.
    2023-Jul-28 22:23:13.0357 101413:101422 ERROR  TDRV:translate_one_pseudo_instr_v2           Please make sure to upgrade to latest aws-neuronx-runtime-lib and aws-neuronx-collective; for detailed installation instructions visit Neuron documentation.

Solution
^^^^^^^^

Upgrade to latest Neuron Runtime and Neuron Collectives.


Insufficient Memory
-------------------

While loading a model(NEFF), Neuron Runtime reserves both device and host memory
for storing weights, ifmap and ofmap of the Model. The memory consumption of
each model is different. If Neuron Runtime is unable to allocate memory then
the model load would fail with the following message in syslog

::

   kernel: [XXXXX] neuron:mc_alloc: device mempool [0:0] total 1073741568 occupied 960539030 needed 1272 available 768


Solution
^^^^^^^^

As the error is contextual to what's going on with your instance, the
exact next step is unclear. Try unloading some of the loaded models
which will free up device DRAM space. If this is still a problem, moving
to a larger Inf1 instance size with additional NeuronCores may help.

Insufficient number of NeuronCores
----------------------------------

The NEFF requires more NeuronCores than available on the instance.

Check for error messages in syslog similar to:

::

  NRT:  26638:26638 ERROR  TDRV:db_vtpb_get_mla_and_tpb                 Could not find VNC id n
  NRT:  26638:26638 ERROR  NMGR:dlr_kelf_stage                          Failed to create shared io
  NRT:  26638:26638 ERROR  NMGR:stage_kelf_models                       Failed to stage graph: kelf-a.json to NeuronCore
  NRT:  26638:26638 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: xxxxxxx, err: 2

Solution
^^^^^^^^

The NeuronCores may be in use by models you are not actively using.
Ensure you've unloaded models you're not using and terminated unused applications.
If this is still a problem, moving to a larger Inf1 instance
size with additional NeuronCores may help.

Numerical Error
---------------

Neuron Devices will detect any NaN generated during execution and
report it. If Neuron Runtime sees NaNs are generated then it would
fail the execution request with Numerical Error with the following
message:

::

   nrtd[nnnnn]: ....  Error notifications found on NC .... INFER_ERROR_SUBTYPE_NUMERICAL

Solution
^^^^^^^^

This usually an indication of either error in the model or error in the
input.

Report issue to Neuron by posting the relevant details on GitHub
`issues <https://github.com/aws/aws-neuron-sdk/issues>`__.

RuntimeError: module compiled against API version 0xf but this version of numpy is 0xe
--------------------------------------------------------------------------------------
This usually means that the numpy version used during compilation is different than the one used when executing the model.
As of Neuron SDK release 2.15, numpy versions supported in Neuron SDK are following:  numpy<=1.25.2, >=1.22.2.  Check and confirm the right
numpy version is installed and re-compile/execute the model.


Failure to initialize Neuron
----------------------------

::

   nd0 nc0 Timestamp program stop timeout (1000 ms)
   nd0 nc0 Error while waiting for timestamp program to end on TPB eng 0
   nd0 nc0 Failed to stop neuron core
   nd0 nc0 Failed to end timestamp sync programs
   TDRV not initialized
   Failed to initialize devices, error:5

.. _solution-2:

Solution
^^^^^^^^

Previously executed application left Neuron devices in running state.
Reset Neuron devices but reloading Neuron Driver. Note, this is a
temporary workaround, future versions of Neuron will reset running
devices automatically.

::

   sudo rmmod neuron; sudo modprobe neuron

An application is trying to use more cores that are available on the instance
-----------------------------------------------------------------------------

::

   Could not open the nd1

.. _solution-3:

Solution
^^^^^^^^

Use properly sized instance. trn1.32xlarge has 32 Neuron Cores,
trn1.2xlarge has 2 Neuron Cores.


Neuron DGE notification queue overflow
----------------------------------------

.. code:: bash

   2025-Oct-01 23:48:34.002205 516278:516289 ERROR  TDRV:exec_consume_topsp_cc_notifications     [ND 1][NC 4] execution on model /home/ubuntu/compiled-models/model.MODULE_7c055c4ac6e2851a63bb+7d89256e.neff, is stuck after all collectives operation have completed
   2025-Oct-01 23:48:34.002207 516278:516288 ERROR   NRT:nrt_infodump                            Failure: NRT_EXEC_SW_NQ_OVERFLOW in nrt_execute()
   2025-Oct-01 23:48:34.002234 516278:516288 ERROR   NRT:nrt_infodump                            LNC: 0
   2025-Oct-01 23:48:34.002260 516278:516285 ERROR  TDRV:exec_request_process_errors             [ND 1][NC 0] execution timeout (30000 ms) on model /home/ubuntu/compiled-models/model.MODULE_7c055c4ac6e2851a63bb+7d89256e.neff, potentially caused by DGE notifications enabled. Please disable it (set NEURON_RT_ENABLE_DGE_NOTIFICATIONS to 0) and try again.

Solution
^^^^^^^^

Set the environment variable ``NEURON_RT_ENABLE_DGE_NOTIFICATIONS`` to ``0`` to disable DMA Generation Engine notifications.


Neuron Runtime execution fails at out-of-bound access
-----------------------------------------------------

When a Neuron Runtime execution encounters an out-of-bound access error, the runtime logs in the stdout console will display one of the following error messages:

::

    2024-08-12 18:34:56,116::ERROR: 2024-Aug-12 18:34:56.067150 159612:159612 ERROR  TDRV:generate_custom_notification_msg        nd0:nc0:h_model.id1107: Received notification generated at runtime: failed to run embedding table update, due to out-of-bound access.
    2024-08-12 18:34:56,116::ERROR: 2024-Aug-12 18:34:56.067151 159602:159602 ERROR  TDRV:generate_custom_notification_msg        nd0:nc1:h_model.id1109: Received notification generated at runtime: failed to run scatter/gather (indirect memory copy), due to out-of-bound access.

**Cause of the Error**

An out-of-bound access error typically indicates that incorrect inputs have been provided to the model.

**How to Debug**

To troubleshoot this issue, you need to examine both the High-Level Operation (HLO) and all inputs.
Neuron Runtime can automatically dump all inputs in binary format, which can be instrumental in debugging.
To enable input dumping for each failed execution, set the following environment variable:

::

    export NEURON_RT_DBG_DUMP_INPUTS_ON_ERR=<an NRT_STATUS value>

A complete set of ``NRT_STATUS`` can be found under :ref:`The LIBNRT API Return Codes <nrt_api>`.

Once this variable is set, Neuron Runtime generates a directory in the current working directory for each failed execution at this `NRT_STATUS` value. The directory name follows this pattern:

::

    input_dump_<runtime_generated_random_number>_h_nn_<runtime_generated_execution_id>


Inside each directory, you'll find all the inputs that led to this failure, stored in binary format.
Additionally, the model name is saved in a separate file called model_name.txt within the same directory.

To disable input dump, you can set the environment variable back to 0

::

    export NEURON_RT_DBG_DUMP_INPUTS_ON_ERR=0

**Example: Debug an out-of-bound access execution**

To debug an out-of-bound (OOB) execution, which returns an NRT_STATUS code of 1006, both HLO and all inputs are required.
By setting the ``NEURON_RT_DBG_DUMP_INPUTS_ON_ERR`` environment variable to 1006, you can capture the inputs leading to an OOB execution.

For example, when an OOB error occurs, Neuron Runtime creates a directory named input_dump_424238335_h_nn_10001.
Here, 424238335 is a randomly generated number by Neuron Runtime, and 10001 is the Neuron Runtime generated execution ID.
All relevant inputs, labeled from input0 to input14, are saved in binary format within this directory.

::

    ubuntu@ip-172-31-53-90:~$ NEURON_RT_DBG_DUMP_INPUTS_ON_ERR=1006 torchrun --nproc_per_node=2 train_torchrun.py
    ......
    2024-Jun-26 00:32:47.943821 30294:32381 ERROR  TDRV:generate_custom_notification_msg        nd0:nc0:h_model.id1001: Received notification generated at runtime: failed to run scatter/gather (indirect memory copy), due to out-of-bound access. isa instruction line number = 11. model name = /home/ubuntu/token-seqlen1280-batch128-FullyUnrolled.736.2.0.62758.0a0+44863561.93f365ce40ab99133659.pb.neff
    ......
    2024-Jun-26 00:32:47.948678 30294:32381 ERROR  NMGR:dlr_infer                               Inference completed with err: 1006. mode->h_nn=1001, start_nc=0, nc_count=1
    2024-Jun-26 00:32:50.801487 30294:32381 ERROR  TDRV:tensor_dump_inputs                      15 input tensors were dumped successfully to directory /home/ubuntu/input_dump_424238335_h_nn_10001. Model name is /home/ubuntu/token-seqlen1280-batch128-FullyUnrolled.736.2.0.62758.0a0+44863561.93f365ce40ab99133659.pb.neff
    ......

    ubuntu@ip-172-31-53-90:~$ ls -lt
    total 3908900
    drwxrwxr-x 2 ubuntu ubuntu 4096 Jun 26 00:32 input_dump_424238335_h_nn_10001
    .....

    ubuntu@ip-172-31-53-90:~$ ls -lt input_dump_424238335_h_nn_10001
    total 1405192
    -rw-r—r-- 1 ubuntu ubuntu 5242880 Jun 26 00:32 input14.bin
    -rw-r—r-- 1 ubuntu ubuntu 5242880 Jun 26 00:32 input13.bin
    -rw-r—r-- 1 ubuntu ubuntu 5242880 Jun 26 00:32 input12.bin
    -rw-r—r-- 1 ubuntu ubuntu 5242880 Jun 26 00:32 input11.bin
    -rw-r—r-- 1 ubuntu ubuntu 13967360 Jun 26 00:32 input10.bin
    -rw-r—r-- 1 ubuntu ubuntu 81920 Jun 26 00:32 input8.bin
    -rw-r—r-- 1 ubuntu ubuntu 4 Jun 26 00:32 input9.bin
    -rw-r—r-- 1 ubuntu ubuntu 4 Jun 26 00:32 input6.bin
    -rw-r—r-- 1 ubuntu ubuntu 81920 Jun 26 00:32 input7.bin
    -rw-r—r-- 1 ubuntu ubuntu 16777216 Jun 26 00:32 input5.bin
    -rw-r—r-- 1 ubuntu ubuntu 131072 Jun 26 00:32 input3.bin
    -rw-r—r-- 1 ubuntu ubuntu 13967360 Jun 26 00:32 input4.bin
    -rw-r—r-- 1 ubuntu ubuntu 16777216 Jun 26 00:32 input2.bin
    -rw-r—r-- 1 ubuntu ubuntu 13967360 Jun 26 00:32 input1.bin
    -rw-r—r-- 1 ubuntu ubuntu 1342177280 Jun 26 00:32 input0.bin
    -rw-r—r-- 1 ubuntu ubuntu 9 Jun 26 00:32 model_name.txt

    ubuntu@ip-172-31-53-96:~$ cat input_dump_424238335_h_nn_10001/model_name.txt
    /home/ubuntu/token-seqlen1280-batch128-FullyUnrolled.736.2.0.62758.0a0+44863561.93f365ce40ab99133659.pb.neff


**Known Limitations**

* **HLO Access**: Neuron Runtime does not have direct access to the HLO; it must be deduced from the model name.

* **Partial Input Dumps**: If a Neuron Runtime execution fails and an exception is raised to the Neuron Framework, other ongoing Neuron Runtime executions may be terminated by the Neuron Framework. This means only one set of inputs may be fully captured, while others may be incomplete if terminated prematurely.

  * An input dump folder is considered complete when the model_name.txt file is fully written, as Neuron Runtime saves all inputs first and then writes the model_name.txt file. So you might find out the folder with the complete set of inputs by searching for the model_name.txt file.


Hardware Errors
----------------


For Trn and Inf instances, the following hardware errors are monitored by Neuron Runtime:


+-------------------------------------+-----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Error Types                         | Description                                               | Behaviors                                                                                                                     | Recommended Actions                                                                                                                                                  |
+=====================================+===========================================================+===============================================================================================================================+======================================================================================================================================================================+
| SRAM Uncorrectable                  | An on-chip SRAM encountered a parity error and produced   | 1. Instance Retirement Notice:                                                                                                | 1. Replace the EC2 instance by                                                                                                                                       |
|                                     | incorrect results.                                        | You will receive an `EC2 instance retirement notice                                                                           | `terminating <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/terminating-instances.html>`_                                                                      |
|                                     |                                                           | <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-retirement.html>`_                                              | it or `stopping then starting <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html>`_ it.                                                            |
|                                     |                                                           | within 15 minutes of experiencing this message.                                                                               |                                                                                                                                                                      |
|                                     |                                                           | EKS, EC2 Auto Scaling Groups, and AWS ParallelCluster will react to                                                           | 2. Utilize `Neuron Sysfs <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-sysfs-user-guide.html#description-for-each-metric>`_ |
|                                     |                                                           | these retirement notices according to their configured policies,                                                              | and `Neuron Monitor <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html#system-level-metric-groups>`_     |
|                                     |                                                           | but you can also automate responses to these notices yourself with                                                            | to monitor the ``sram_ecc_uncorrected`` error counts.                                                                                                                |
|                                     |                                                           | `EventBridge rules <https://repost.aws/knowledge-center/eventbridge-notification-scheduled-events>`_.                         |                                                                                                                                                                      |
|                                     |                                                           |                                                                                                                               |                                                                                                                                                                      |
|                                     |                                                           | 2. Neuron Runtime Behavior:                                                                                                   |                                                                                                                                                                      |
|                                     |                                                           | Neuron Runtime will timeout and exit with ``NRT_EXEC_COMPLETED_WITH_ERR (1004)``                                              |                                                                                                                                                                      |
|                                     |                                                           | or ``NRT_EXEC_HW_ERR_NC_UE (1202)`` return code.                                                                              |                                                                                                                                                                      |
|                                     |                                                           | You will see the following error message in runtime logs from stdout console: ``(FATAL-RT-UNDEFINED-STATE)                    |                                                                                                                                                                      |
|                                     |                                                           | [ND 0][NC 0] Uncorrectable memory error is detected, metadata: 0x16. Please terminate or stop/start this instance to prevent  |                                                                                                                                                                      |
|                                     |                                                           | future impact from the hardware error.``                                                                                      |                                                                                                                                                                      |
+-------------------------------------+-----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| HBM Unrepairable Uncorrectable      | An HBM encountered an unrepairable uncorrectable error    | 1. Instance Retirement Notice:                                                                                                | 1. Replace the EC2 instance by                                                                                                                                       |
|                                     | and produced incorrect results.                           | You will receive an `EC2 instance retirement notice                                                                           | `terminating <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/terminating-instances.html>`_                                                                      |
|                                     |                                                           | <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-retirement.html>`_                                              | it or `stopping then starting <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html>`_ it.                                                            |
|                                     |                                                           | within 15 minutes of experiencing this message.                                                                               |                                                                                                                                                                      |
|                                     |                                                           | EKS, EC2 Auto Scaling Groups, and AWS ParallelCluster will react to                                                           | 2. Utilize `Neuron Sysfs <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-sysfs-user-guide.html#description-for-each-metric>`_ |
|                                     |                                                           | these retirement notices according to their configured policies,                                                              | and `Neuron Monitor <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html#system-level-metric-groups>`_     |
|                                     |                                                           | but you can also automate responses to these notices yourself with                                                            | to monitor the ``mem_ecc_uncorrected`` error counts.                                                                                                                 |
|                                     |                                                           | `EventBridge rules <https://repost.aws/knowledge-center/eventbridge-notification-scheduled-events>`_.                         |                                                                                                                                                                      |
|                                     |                                                           |                                                                                                                               |                                                                                                                                                                      |
|                                     |                                                           | 2. Neuron Runtime Behavior:                                                                                                   |                                                                                                                                                                      |
|                                     |                                                           | Neuron Runtime will timeout and exit with ``NRT_TIMEOUT (5)``                                                                 |                                                                                                                                                                      |
|                                     |                                                           | or ``NRT_EXEC_HW_ERR_HBM_UE (1201)`` return code.                                                                             |                                                                                                                                                                      |
|                                     |                                                           | You will see the following error message in runtime logs from stdout console: ``(FATAL-RT-UNDEFINED-STATE)                    |                                                                                                                                                                      |
|                                     |                                                           | Uncorrectable HBM memory error is detected. Execution results may be invalid.                                                 |                                                                                                                                                                      |
|                                     |                                                           | Please terminate this instance to prevent future impact from the hardware error.``                                            |                                                                                                                                                                      |
+-------------------------------------+-----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| HBM Repairable Uncorrectable        | An HBM encountered a repairable uncorrectable error       | Neuron Runtime Behavior:                                                                                                      | 1. Reload the neuron driver or                                                                                                                                       |
|                                     | and produced incorrect results.                           | Neuron Runtime will timeout and exit with ``NRT_TIMEOUT (5)``                                                                 | `reboot <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-reboot.html>`_ the EC2 instance.                                                           |
|                                     |                                                           | or ``NRT_EXEC_HW_ERR_REPAIRABLE_HBM_UE (1205)`` return code.                                                                  |                                                                                                                                                                      |
|                                     |                                                           | You will see the following error message in runtime logs from stdout console: ``(FATAL-RT-UNDEFINED-STATE)                    | 2. Utilize `Neuron Sysfs <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-sysfs-user-guide.html#description-for-each-metric>`_ |
|                                     |                                                           | Uncorrectable HBM memory error is detected. Execution results may be invalid.                                                 | and `Neuron Monitor <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html#system-level-metric-groups>`_     |
|                                     |                                                           | Please reload the neuron driver or reboot your EC2 instance to prevent future impact from the hardware error.``               | to monitor the ``mem_ecc_repairable_uncorrected`` error counts.                                                                                                      |
+-------------------------------------+-----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| DMA Aborts                          | A DMA engine encountered an unrecoverable error.          | Neuron Runtime Behavior:                                                                                                      | Replace the EC2 instance by                                                                                                                                          |
|                                     |                                                           | Neuron Runtime will timeout and exit with ``NRT_TIMEOUT (5)``                                                                 | `terminating <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/terminating-instances.html>`_                                                                      |
|                                     |                                                           | or ``NRT_EXEC_HW_ERR_DMA_ABORT (1203)`` return code.                                                                          | it or `stopping then starting <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html>`_ it.                                                            |
|                                     |                                                           | You will see the following error messages in runtime logs from stdout console:                                                |                                                                                                                                                                      |
|                                     |                                                           | ``[MLA 0][NC 0] DMA TX engine 0 is in an abort state`` or                                                                     |                                                                                                                                                                      |
|                                     |                                                           | ``[MLA 0][NC 0] DMA RX engine 0 is in an abort state``                                                                        |                                                                                                                                                                      |
+-------------------------------------+-----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Hang on Collectives                 | Possibly caused by a hardware error on another worker.    | Neuron Runtime Behavior:                                                                                                      | Search for SRAM Uncorrectable, HBM Uncorrectable, DMA Aborts, and Hang on Compute errors on the other workers, and implement the recommended actions on the          |
|                                     |                                                           | Neuron Runtime will timeout and exit with ``NRT_TIMEOUT (5)``                                                                 | affected worker. Afterward, restart your workload and attempt again.                                                                                                 |
|                                     |                                                           | or ``NRT_EXEC_HW_ERR_COLLECTIVES (1200)`` return code.                                                                        |                                                                                                                                                                      |
|                                     |                                                           | You will see the following error messages in runtime logs from stdout console:                                                |                                                                                                                                                                      |
|                                     |                                                           | ``(FATAL-RT-UNDEFINED-STATE) missing collectives status                                                                       |                                                                                                                                                                      |
|                                     |                                                           | on Neuron Device 0 NC 0, model 0 - suspected hang in collectives operation 0 out of 100``                                     |                                                                                                                                                                      |
+-------------------------------------+-----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Hang on Compute                     | Unexpected software or hardware issue.                    | Neuron Runtime Behavior:                                                                                                      | Replace the EC2 instance by                                                                                                                                          |
|                                     |                                                           | Neuron Runtime will timeout and exit with ``NRT_TIMEOUT (5)``.                                                                | `terminating <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/terminating-instances.html>`_                                                                      |
|                                     |                                                           | You will see the following error messages in runtime logs from stdout console:                                                | it or `stopping then starting <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html>`_ it.                                                            |
|                                     |                                                           | ``(FATAL-RT-UNDEFINED-STATE) execution timeout (30000 ms)                                                                     |                                                                                                                                                                      |
|                                     |                                                           | on Neuron Device 0 NC 0, model xxx.neff, waiting for execution completion notification``                                      |                                                                                                                                                                      |
+-------------------------------------+-----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Upon any hardware errors, you should also expect to see the error message like the following in ``dmesg``:
``NEURON_HW_ERR=SRAM_UNCORRECTABLE_ERROR instance-id=i-0592464924bd45322 hostname=ip-172-31-61-252 nd-id=0 nc-id=0 serial-num=19fcda00f5ff6eb9 action=TERMINATE_INSTANCE``


EFA and Collective Communication Errors
-----------------------------------------

Missing aws-neuronx-collectives package
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**aws-neuronx-collectives** package is required to execute Collective
Communication on a single instance and across multiple instances.

::

   NCCL init error: Error opening libnccom.so, cannot use collective operations! Please set LD_LIBRARY_PATH to library location. Error: libnccom.so: cannot open shared object
   file: No such file or directory
   Please make sure to install correct version of aws-neuronx-collectives; for detailed installation instructions visit Neuron documentation

.. _solution-4:

Solution
~~~~~~~~~

Install aws-neuornx-collectives package. If the installation used
non-default destination set LD_LIBRARY_PATH.

.. _missing-efa-installer-package:

Missing efa installer package.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**efa-installer** package is required to execute Collective
Communication across multiple instances.

::

   Unable to run multi-instance workload.  Ofi plugin is not installed or EFA is not enabled

.. _solution-5:

Solution
~~~~~~~~~

Follow the directions to install efa-installer package. Make sure to add
the path to to libfabric library to LD_LIBRARY_PATH

.. _efa-is-not-enabled-in-trn132xlarage:

EFA is not enabled in trn1.32xlarge
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

EFA is used as a transport for Collective Communication among multiple
instances. EFA must be enabled on the instances used for multi-node
training.

::

    OFI plugin initNet() failed is EFA enabled?

.. _solution-6:

Solution
~~~~~~~~~

Confirm that EFA is enabled by running lspci command and making sure
there are eight EFA devices. For example:

::

   [ec2-user@ip-10-0-13-247 ~]$ lspci -tv
   -+-[0000:a0]-+-00.0  Amazon.com, Inc. Elastic Network Adapter (ENA)
    |           +-01.0  Amazon.com, Inc. Elastic Network Adapter (ENA)
    |           +-19.0  Amazon.com, Inc. Elastic Fabric Adapter (EFA)
    |           +-1a.0  Amazon.com, Inc. Elastic Fabric Adapter (EFA)
    |           +-1b.0  Amazon.com, Inc. NeuronDevice
    |           +-1c.0  Amazon.com, Inc. NeuronDevice
    |           +-1d.0  Amazon.com, Inc. NeuronDevice
    |           +-1e.0  Amazon.com, Inc. NeuronDevice
    |           \-1f.0  Amazon.com, Inc. NVMe SSD Controller
    +-[0000:90]-+-00.0  Amazon.com, Inc. Elastic Network Adapter (ENA)
    |           +-01.0  Amazon.com, Inc. Elastic Network Adapter (ENA)
    |           +-19.0  Amazon.com, Inc. Elastic Fabric Adapter (EFA)
    |           +-1a.0  Amazon.com, Inc. Elastic Fabric Adapter (EFA)
    |           +-1b.0  Amazon.com, Inc. NeuronDevice
    |           +-1c.0  Amazon.com, Inc. NeuronDevice
    |           +-1d.0  Amazon.com, Inc. NeuronDevice
    |           +-1e.0  Amazon.com, Inc. NeuronDevice
    |           \-1f.0  Amazon.com, Inc. NVMe SSD Controller
    +-[0000:20]-+-00.0  Amazon.com, Inc. Elastic Network Adapter (ENA)
    |           +-01.0  Amazon.com, Inc. Elastic Network Adapter (ENA)
    |           +-19.0  Amazon.com, Inc. Elastic Fabric Adapter (EFA)
    |           +-1a.0  Amazon.com, Inc. Elastic Fabric Adapter (EFA)
    |           +-1b.0  Amazon.com, Inc. NeuronDevice
    |           +-1c.0  Amazon.com, Inc. NeuronDevice
    |           +-1d.0  Amazon.com, Inc. NeuronDevice
    |           +-1e.0  Amazon.com, Inc. NeuronDevice
    |           \-1f.0  Amazon.com, Inc. NVMe SSD Controller
    +-[0000:10]-+-00.0  Amazon.com, Inc. Elastic Network Adapter (ENA)
    |           +-01.0  Amazon.com, Inc. Elastic Network Adapter (ENA)
    |           +-19.0  Amazon.com, Inc. Elastic Fabric Adapter (EFA)
    |           +-1a.0  Amazon.com, Inc. Elastic Fabric Adapter (EFA)
    |           +-1b.0  Amazon.com, Inc. NeuronDevice
    |           +-1c.0  Amazon.com, Inc. NeuronDevice
    |           +-1d.0  Amazon.com, Inc. NeuronDevice
    |           +-1e.0  Amazon.com, Inc. NeuronDevice
    |           \-1f.0  Amazon.com, Inc. NVMe SSD Controller
    \-[0000:00]-+-00.0  Intel Corporation 440FX - 82441FX PMC [Natoma]
                +-01.0  Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
                +-01.3  Intel Corporation 82371AB/EB/MB PIIX4 ACPI
                +-03.0  Amazon.com, Inc. Device 1111
                +-04.0  Amazon.com, Inc. NVMe EBS Controller
                \-1f.0  Amazon.com, Inc. NVMe EBS Controller

Launch instances with EFA enabled and try again. If not planning to use
the instances for multi-node training or running on trn1.2xlarge, this
error message can be ignored.

Communication timeout
^^^^^^^^^^^^^^^^^^^^^^

Ranks exchange information during NEFF loading and before the start of
the execution. The loading/execution cannot move forward until all ranks
are ready.

::

   Timeout waiting for RX (waited 120 sec) - retrying

::

   Timeout waiting for incoming connection (waited 120 sec) - retrying

::

   Connect to localhost:33666 failed - retrying

.. _solution-7:

Solution
~~~~~~~~~

The communication timeouts are not fatal. The ranks will continue
waiting forever. In most case the timeouts are caused by one of the
ranks getting delayed, usually be recompilation of a graph. The
execution is resumed after the graph is compiled (might take significant
amount of time). It is possible to determine if compilation is in
progress by checking the logs on all nodes.

Communication timeouts might also indicate that one of the nodes or
ranks is hang. If that is the case, terminate the run and restart from
the last known good check point.

.. _communication-errors:

Communication errors
---------------------

::

   RX, connection closed by remote peer

There could be other similar messages indicating that ranks failed to
communicate.

.. _solution-8:

Solution
^^^^^^^^

One of the ranks or nodes encountered a problem and terminated.
Terminate the run and restart from the last known check point.

.. _efa-kernel-messages-dmesg-after-process-termination:

EFA Kernel messages (dmesg) after process termination.
------------------------------------------------------

::

   [298850.502143] neuron:npid_detach: neuron:npid_detach: pid=90193, slot=0
   [298850.919248] efa 0000:a0:1a.0 rdmap160s26: Failed to process command DEREG_MR (opcode 8) comp_status 7 err -22

.. _solution-9:

Solution
^^^^^^^^

When a process that executed Collective Communication terminates it
deregisters buffers that were registered with the networking stack.
There is a race condition because the Neuron driver deregisters buffers
owned by terminating process as part of the memory cleanup. The error is
benign and will be removed in the future releases.

Failure to find bootstrap interface
-----------------------------------

::

   No interface found in the same subnet as remote address fe80::1461:22ff:fe33:b471<45015>
   No usable listening interface found

.. _solution-10:

Solution
^^^^^^^^

Bootstrap code incorrectly trying to use link-local IPv6 address for
communication. This error will be fixed in the next Neuron release. In
the meantime, as a workaround, disable IPv6 on the instances.

::

   sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
   sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1

Name resolution failure
-----------------------

.. code:: bash

     WARN Invalid NCCL_COMM_ID [compute1-dy-training-0-1.pcluster-trn1-24-pdx80-2n.pcluster:41211], please use format: <ipv4>:<port> or [<ipv6>]:<port>

.. _solution-11:

Solution
^^^^^^^^

Verify that the name can be resolved by DNS by using nslookup or dig.  Currently released version fails to resolve FQDN longer than 63 characters.  This error will be fixed in the upcoming Neuron release.  In the mean time use shorter names to ensure that FQDN length does not exceed the maximum of 63 characters.

Neuron Runtime timeout or GPSIMD exception
------------------------------------------

At this point, reset of Neuron Runtime is required after running a model which
invoked a Neuron Custom C++ operator. Otherwise, a Neuron Runtime timeout or
GPSIMD exception may occur.

Example Neuron Runtime timeout:

::

   2023-Jan-09 20:27:41.0593 15042:15042 ERROR  TDRV:exec_consume_tpb_status_notifications   Missing infer_status notification: (end:1)
   2023-Jan-09 20:27:41.0593 15042:15042 ERROR  TDRV:exec_consume_tpb_status_notifications   Missing infer_status notification: (end:2)
   2023-Jan-09 20:27:41.0593 15042:15042 ERROR  TDRV:exec_consume_tpb_status_notifications   Missing infer_status notification: (end:3)
   2023-Jan-09 20:27:41.0593 15042:15042 ERROR  TDRV:exec_consume_tpb_status_notifications   Missing infer_status notification: (end:4)
   2023-Jan-09 20:27:41.0593 15042:15042 ERROR  TDRV:exec_consume_tpb_status_notifications   Missing infer_status notification: (end:0)
   2023-Jan-09 20:27:41.0593 15042:15042 ERROR  TDRV:exec_consume_infer_status_notifications (FATAL-RT-UNDEFINED-STATE) inference timeout (600000 ms) on Neuron Device 0 NC 0, waiting for execution completion notification
   2023-Jan-09 20:27:41.0600 15042:15042 ERROR  NMGR:dlr_infer                               Inference completed with err: 5

Example GPSIMD exception:

::

   2023-Jan-06 22:28:01.0845 137472:137472 ERROR TDRV:pool_stdio_queue_consume_all_entries  Printing stderr from GPSIMD:
   GPSIMD EXCEPTION OCCURRED: ILLEGAL INSTRUCTION
   Subtype/Type/Cause: 0x201
   Exception PC: 0x840001E8

Solution
^^^^^^^^

If either of the above errors are seen, and ``NEURON_RT_RESET_CORES`` is set to
0, either unset it or set it to 1. This will enable the default runtime
behaviour of resetting NeuronCores when initializing applications. See
:ref:`nrt-configuration` for more information.

Also note that the timeout period can be changed by setting
``NEURON_RT_EXEC_TIMEOUT``. See :ref:`nrt-configuration` for more information.


FI_EFA_FORK_SAFE
----------------

Older Linux (<5.15) kernels require environment variable FI_EFA_FORK_SAFE to be set to 1 for the libfabric to operate correctly.  Specifically Amazon Linux 2 uses 5.10 kernel and requires the variable to be set.

When the variable is not set multi-node collective communication will be disabled.  Intra-node collective communication is still possible.  The following error message will be logged the first time a model containing collective communication is loaded:

.. code-block::

   Linux kernel 5.10 requires setting FI_EFA_FORK_SAFE=1 environment variable.  Multi-node support will be disabled.
   Please restart with FI_EFA_FORK_SAFE=1 set."


Neuron driver cannot be uninstalled
------------------------------------

If you attempt to uninstall the Neuron driver on Ubuntu with ``sudo dpkg -r aws-neuronx-dkms``, you may get an error like this:

.. code-block::

   Removing aws-neuronx-dkms (2.x) ...
   Neuron module is currently loaded. Attempting to unload...
   ERROR: Cannot unload neuron module - it is currently in use.
   Please stop all processes using the neuron module before uninstalling.
   dpkg: error processing package aws-neuronx-dkms (--remove):
   installed aws-neuronx-dkms package pre-removal script subprocess returned error exit status 1
   Errors were encountered while processing:
   aws-neuronx-dkms

On Amazon Linux, you get a similar error if you run ``sudo rpm -e aws-neuronx-dkms`` to uninstall the driver:

.. code-block::
   
   Uninstall of aws-neuronx module (version 2.x) beginning:
   Neuron module is currently loaded. Attempting to unload...
   ERROR: Cannot unload neuron module - it is currently in use.
   Please stop all processes using the neuron module before uninstalling.
   error: %preun(aws-neuronx-dkms-2.x-dkms.noarch) scriptlet failed, exit status 1
   error: aws-neuronx-dkms-2.x-dkms.noarch: erase failed

Usually, this just means you still have an active process using the driver. Killing that process will allow the driver to be unloaded/uninstalled. But if for some rare reason the driver is stuck, one remediation is to first force uninstall the driver, and then reboot. 

Solution
^^^^^^^^

Force-uninstall the Neuron driver.

.. warning:: Force-uninstalling the driver runs the risk of causing system instability or causing a kernel panic. Reboot your instance immediately after uninstalling it.

To force-uninstall the driver on Ubuntu instances:

.. code-block::

   sudo dkms remove aws-neuronx/<version> --all
   sudo dpkg -r --force-all aws-neuronx-dkms

To force-uninstall the driver on Amazon Linux instances:

.. code-block::

   sudo dkms remove aws-neuronx/<version> --all
   sudo rpm -e --noscript aws-neuronx-dkms


================================================
FILE: neuron-runtime/rn.rst
================================================
What's New
==========

.. toctree::
   :maxdepth: 1

   /release-notes/components/runtime


================================================
FILE: nki/_ext/nki_directives.py
================================================
"""
Copyright (c) 2023, Amazon.com. All Rights Reserved

Define new directives for nki documentation

"""

from __future__ import annotations

import importlib
import os
from typing import TYPE_CHECKING, ClassVar, Any, Union

from docutils import nodes
from docutils.parsers.rst import directives
from docutils.statemachine import ViewList
from sphinx.directives.code import (
    LiteralInclude,
    container_wrapper,
    LiteralIncludeReader,
)
from sphinx.locale import __
from sphinx.util import logging, parselinenos
from sphinx.util.docutils import SphinxDirective
from sphinx.util.nodes import nested_parse_with_titles

if TYPE_CHECKING:
    from docutils.nodes import Element, Node

    from sphinx.application import Sphinx
    from sphinx.config import Config
    from sphinx.util.typing import ExtensionMetadata, OptionSpec

logger = logging.getLogger(__name__)


class NKIExampleReader(LiteralIncludeReader):

    def __init__(self, filename: str, options: dict[str, Any], config: Config) -> None:
        if "diff" in options:
            raise ValueError(__("`diff` mode is not supported"))

        super().__init__(filename=filename, options=options, config=config)
        marker = self.options.get("marker", "NKI_EXAMPLE")
        self.example_begin = f"{marker}_BEGIN"
        self.example_end = f"{marker}_END"
        self.skip_marker = self.options.get("skip_marker", "NKI_EXAMPLE")

    def nki_example_filter(
        self,
        lines: list[str],
        location: Union[tuple[str, int], None] = None,
    ) -> list[str]:
        whole_file = "whole-file" in self.options
        example_lines = []
        include_line = whole_file
        indentsize = 0

        for lineno, line in enumerate(lines):
            if include_line:
                if not whole_file and self.example_end in line:
                    include_line = False
                    continue

                if self.skip_marker in line:
                    continue

                if indentsize and "\n" not in line[:indentsize]:
                    line = line[indentsize:]

                example_lines.append(line)
                continue

            assert not whole_file, "`inline` should stay true if `whole_file` is True"
            if self.example_begin in line:
                include_line = True
                indentsize = len(line) - len(line.lstrip())
                if example_lines:
                    # Insert an empty line between blocks
                    example_lines.append("\n")

                continue

        return example_lines

    def read(self, location: Union[tuple[str, int], None] = None) -> tuple[str, int]:
        filters = [
            self.nki_example_filter,
            #  self.pyobject_filter,
            #  self.start_filter,
            #  self.end_filter,
            #  self.lines_filter,
            self.dedent_filter,
            self.prepend_filter,
            self.append_filter,
        ]

        lines = self.read_file(self.filename, location=location)

        for func in filters:
            lines = func(lines, location=location)

        return "".join(lines), len(lines)


class NKIExample(LiteralInclude):
    """A directive to include nki example"""

    option_spec: ClassVar[OptionSpec] = {
        "marker": str,
        "skip_marker": str,
        "whole-file": directives.flag,
        **LiteralInclude.option_spec,
    }

    def run(self) -> list[Node]:
        document = self.state.document
        if not document.settings.file_insertion_enabled:
            return [
                document.reporter.warning("File insertion disabled", line=self.lineno)
            ]
        # convert options['diff'] to absolute path
        if "diff" in self.options:
            _, path = self.env.relfn2path(self.options["diff"])
            self.options["diff"] = path

        try:
            location = self.state_machine.get_source_and_line(self.lineno)
            nki_root = self.config.nki_example_root
            if nki_root and not os.path.isabs(self.arguments[0]):
                filename = os.path.join(nki_root, self.arguments[0])
                rel_filename = self.arguments[0]
            else:
                rel_filename, filename = self.env.relfn2path(self.arguments[0])
            self.env.note_dependency(rel_filename)

            reader = NKIExampleReader(filename, self.options, self.config)
            text, lines = reader.read(location=location)

            retnode: Element = nodes.literal_block(text, text, source=filename)
            retnode["force"] = "force" in self.options
            self.set_source_info(retnode)
            if self.options.get("diff"):  # if diff is set, set udiff
                retnode["language"] = "udiff"
            elif "language" in self.options:
                retnode["language"] = self.options["language"]
            if (
                "linenos" in self.options
                or "lineno-start" in self.options
                or "lineno-match" in self.options
            ):
                retnode["linenos"] = True
            retnode["classes"] += self.options.get("class", [])
            extra_args = retnode["highlight_args"] = {}
            if "emphasize-lines" in self.options:
                hl_lines = parselinenos(self.options["emphasize-lines"], lines)
                if any(i >= lines for i in hl_lines):
                    logger.warning(
                        __("line number spec is out of range(1-%d): %r"),
                        lines,
                        self.options["emphasize-lines"],
                        location=location,
                    )
                extra_args["hl_lines"] = [x + 1 for x in hl_lines if x < lines]
            extra_args["linenostart"] = reader.lineno_start

            if "caption" in self.options:
                caption = self.options["caption"] or self.arguments[0]
                retnode = container_wrapper(self, retnode, caption)

            # retnode will be note_implicit_target that is linked from caption and numref.
            # when options['name'] is provided, it should be primary ID.
            self.add_name(retnode)

            return [retnode]
        except Exception as exc:
            return [document.reporter.warning(exc, line=self.lineno)]


def setup(app: Sphinx) -> ExtensionMetadata:
    app.add_config_value("nki_example_root", None, "env")
    app.add_directive("nki_example", NKIExample)

    return {
        "version": "0.1",
        "parallel_read_safe": True,
        "parallel_write_safe": True,
    }


================================================
FILE: nki/_templates/nki-custom-class-attr-only-template.rst
================================================
{{ fullname | escape | underline}}

.. currentmodule:: {{ module }}

.. autoclass:: {{ objname }}

   {% block attributes %}
   {% if attributes %}
   .. rubric:: {{ _('Attributes') }}

   .. autosummary::
   {% for item in attributes %}
      ~{{ name }}.{{ item }}
   {%- endfor %}
   {% endif %}
   {% endblock %}


================================================
FILE: nki/_templates/nki-custom-class-template.rst
================================================
{{ fullname | escape | underline}}

.. currentmodule:: {{ module }}

.. autoclass:: {{ objname }}
   :members:

   {% block methods %}
   {% if methods %}
   .. rubric:: {{ _('Methods') }}

   .. autosummary::
      :nosignatures:
   {% for item in methods %}
      {%- if not item.startswith('_') %}
      ~{{ name }}.{{ item }}
      {%- endif -%}
   {%- endfor %}
   {% endif %}
   {% endblock %}

   {% block attributes %}
   {% if attributes %}
   .. rubric:: {{ _('Attributes') }}

   .. autosummary::
   {% for item in attributes %}
      ~{{ name }}.{{ item }}
   {%- endfor %}
   {% endif %}
   {% endblock %}


================================================
FILE: nki/api/index.rst
================================================
.. _nki_api_reference:

NKI API Reference Manual
===============================

.. toctree::
    :maxdepth: 2

    nki
    nki.isa
    nki.language
    nki.collectives
    nki.api.shared

================================================
FILE: nki/api/nki/__init__.py
================================================
"""Auto-generated stub file"""
from enum import Enum
import nki.language as nl
import ml_dtypes

def jit(func=None, mode="auto", **kwargs):
    r"""
    This decorator compiles a top-level NKI function to run on NeuronDevices.

    This decorator tries to automatically detect the current framework and compile
    the function as a custom operator. To bypass the framework detection logic, you
    can specify the ``mode`` parameter explicitly.

    You might need to explicitly set the target platform using the
    ``NEURON_PLATFORM_TARGET_OVERRIDE`` environment variable. Supported values are
    "trn1"/"gen2", "trn2"/"gen3", and "trn3"/"gen4".

    :param func: Function that defines the custom operation.
    :param mode: Compilation mode. Supported values are "jax", "torchxla",
                 and "auto". (Default: "auto".)

    .. code-block:: python
       :caption: Writing an addition kernel using ``@nki.jit``

        @nki.jit()
        def nki_tensor_add_kernel(a_input, b_input):
            # Check both input tensor shapes are the same for element-wise operation.
            assert a_input.shape == b_input.shape

            # Check the first dimension's size to ensure it does not exceed on-chip
            # memory tile size, since this simple kernel does not tile inputs.
            assert a_input.shape[0] <= nl.tile_size.pmax

            # Allocate space for the input tensors in SBUF and copy the inputs from HBM
            # to SBUF with DMA copy.
            a_tile = nl.ndarray(dtype=a_input.dtype, shape=a_input.shape, buffer=nl.sbuf)
            nisa.dma_copy(dst=a_tile, src=a_input)

            b_tile = nl.ndarray(dtype=b_input.dtype, shape=b_input.shape, buffer=nl.sbuf)
            nisa.dma_copy(dst=b_tile, src=b_input)

            # Allocate space for the result and use tensor_tensor to perform
            # element-wise addition. Note: the first argument of 'tensor_tensor'
            # is the destination tensor.
            c_tile = nl.ndarray(dtype=a_input.dtype, shape=a_input.shape, buffer=nl.sbuf)
            nisa.tensor_tensor(dst=c_tile, data1=a_tile, data2=b_tile, op=nl.add)

            # Create a tensor in HBM and copy the result into HBM.
            c_output = nl.ndarray(dtype=a_input.dtype, shape=a_input.shape, buffer=nl.hbm)
            nisa.dma_copy(dst=c_output, src=c_tile)

            # Return kernel output as function output.
            return c_output
    """
    ...


def simulate(kernel):
    """Create a CPU-simulated version of an NKI kernel.

    .. warning::

       This API is experimental and may change in future releases.

    See :ref:`nki-simulate` for full documentation including target platform
    selection, precise floating-point mode, debugging, and known limitations.

    Example:
    
    .. code-block:: python

        @nki.jit
        def my_kernel(a, b): ...

        # Explicit simulation
        result = nki.simulate(my_kernel)(a_np, b_np)

        # With LNC2
        result = nki.simulate(my_kernel[2])(a_np, b_np)

    Args:
      kernel: NKI kernel function, typically decorated with ``@nki.jit``.
        If a plain function is passed, it is automatically wrapped.

    Returns:
      A callable that, when invoked with NumPy arrays, executes the kernel
      on CPU and returns NumPy array results.
    """
    ...


================================================
FILE: nki/api/nki/collectives/__init__.py
================================================
"""Stubs for nki.collectives"""

from enum import Enum
import nki.language as nl

class NKIObject:
    r"""Base class for NKI kernel dataclasses and configuration objects."""
    ...


class ReplicaGroup(NKIObject):
    r"""Defines a group of ranks that participate in a collective operation.

    Sub-groups represented by lists of ranks should not have any overlap."""
    ...


def all_gather(srcs, dsts, replica_group, collective_dim):
    r"""Perform an all-gather on the given replica group and input/output tensors.

    The ``srcs`` and ``dsts`` parameters accept lists of tensors to support coalesced
    collective communication, which allows multiple tensors to be gathered in a single
    collective operation for improved efficiency.

    Tensors can reside on either HBM or SBUF. However, mixing memory spaces is not
    supported: all tensors must be on HBM or all must be on SBUF. Coalesced collective
    communication (multiple tensors) is only supported when tensors are on HBM.

    :param srcs: List of input tensors to gather
    :param dsts: List of output tensors to store results
    :param replica_group: ReplicaGroup defining rank groups for the collective
    :param collective_dim: Dimension along which output tensors are concatenated.
        Currently only 0 is supported for HBM tensors. For SBUF tensors, 0 or 1 is
        supported as SBUF collectives currently only operate on 2D tensors with a
        single free dimension."""
    ...


def all_reduce(srcs, dsts, replica_group, op):
    r"""Perform an all-reduce on the given replica group and input/output tensors.

    The ``srcs`` and ``dsts`` parameters accept lists of tensors to support coalesced
    collective communication, which allows multiple tensors to be reduced in a single
    collective operation for improved efficiency.

    Tensors can reside on either HBM or SBUF. However, mixing memory spaces is not
    supported: all tensors must be on HBM or all must be on SBUF. Coalesced collective
    communication (multiple tensors) is only supported when tensors are on HBM.

    :param srcs: List of input tensors to reduce
    :param dsts: List of output tensors to store results
    :param replica_group: ReplicaGroup defining rank groups for the collective
    :param op: The reduction operation to perform (``nl.add``, ``nl.minimum``, or ``nl.maximum``)"""
    ...


def all_to_all(srcs, dsts, replica_group, collective_dim):
    r"""Perform an all-to-all on the given replica group and input/output tensors.

    The ``srcs`` and ``dsts`` parameters accept lists of tensors to support coalesced
    collective communication, which allows multiple tensors to be redistributed in a
    single collective operation for improved efficiency.

    Tensors must reside on HBM. SBUF is not currently supported for all-to-all.

    :param srcs: List of input tensors to redistribute
    :param dsts: List of output tensors to store results
    :param replica_group: ReplicaGroup defining rank groups for the collective
    :param collective_dim: Dimension along which input tensors are split and output tensors are concatenated.
        Currently only 0 is supported."""
    ...


def all_to_all_v(srcs, dsts, replica_group, metadata_tensor, recv_counts_known=False, has_rdispls=False):
    r"""Perform a variable-length all-to-all on the given replica group and input/output tensors.

    Unlike all_to_all which splits and concatenates along a collective_dim,
    all_to_all_v treats tensors as flat buffers of elements. Counts and
    displacements in the metadata tensor are in elements (row-major order),
    not slices along a particular dimension.

    :param srcs: List of input tensors to redistribute (must be exactly one)
    :param dsts: List of output tensors to store results (must be exactly one)
    :param replica_group: ReplicaGroup defining rank groups for the collective
    :param metadata_tensor: Metadata tensor of shape (2-4, world_size), dtype uint32.
                            Row 0: send counts, Row 1: send displacements,
                            Row 2 (optional): recv counts, Row 3 (optional): recv displacements.
    :param recv_counts_known: If True, metadata includes receive counts (row 2)
    :param has_rdispls: If True, metadata includes receive displacements (row 3)"""
    ...


def collective_permute(srcs, dsts, source_target_pairs):
    r"""Send and receive data between ranks based on explicitly defined source-target pairs.

    Each pair ``(source, target)`` specifies that data from the source rank
    should be sent to the target rank. This gives you full control over the
    communication pattern (e.g., pairwise swaps, arbitrary shuffles).

    Prefer :func:`collective_permute_implicit` when the communication
    follows a ring topology, as the hardware can optimize that pattern.

    Tensors must reside on HBM. SBUF is not currently supported for collective_permute.

    Coalesced collective communication (multiple tensors) is not currently supported;
    each list parameter must contain exactly one tensor.

    :param srcs: List of source tensors to send
    :param dsts: List of destination tensors to receive into
    :param source_target_pairs: List of (source, target) rank ID pairs"""
    ...


def collective_permute_implicit(srcs_by_channel, dsts_by_channel, replica_group, channel_ids=[0]):
    r"""Send and receive data between ranks in a ring, where sources and destinations are
    implicitly determined by the ring structure during runtime.

    Each rank sends data to its successor and receives from its predecessor in the ring.
    This differs from :func:`collective_permute` where users explicitly specify source-target pairs.

    Since the sources and destinations are implicitly determined, use
    :func:`collective_permute_implicit_current_processing_rank_id` to get the rank ID
    whose data is currently being processed.

    The outer dimension of ``srcs_by_channel`` and ``dsts_by_channel`` corresponds to channels.
    For each channel, the inner list contains exactly one tensor (coalesced collective
    communication is not currently supported).

    **Channels**: Multiple channels enable overlapping communication, allowing concurrent data
    transfers. The number of available channels depends on the replica group and system
    connectivity (see
    `Neuron Collectives <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/about/collectives.html#system-connectivity>`_).
    The maximum number of channels is 4 for replica groups containing all devices inside a node
    and 2 for other supported replica groups.

    :param srcs_by_channel: List of source tensor lists, one per channel. Each inner list must contain exactly one tensor.
    :param dsts_by_channel: List of destination tensor lists, one per channel. Each inner list must contain exactly one tensor.
    :param replica_group: ReplicaGroup defining rank groups for the collective
    :param channel_ids: List of channel IDs to use for communication (default [0] for single channel).
        Currently must be consecutive integers starting from 0."""
    ...


def collective_permute_implicit_current_processing_rank_id(iteration_id, replica_group, channel_id=0):
    r"""Returns the rank ID of the data to be processed in the current ring iteration.

    This function is intended to be used in conjunction with
    :func:`collective_permute_implicit` or :func:`collective_permute_implicit_reduce`.
    Since the sources and destinations are implicitly determined in ring algorithms,
    the rank ID of received data can only be determined at runtime.

    At iteration 0, this returns the current rank's own ID (processing local data).
    In subsequent iterations, it returns the rank ID of data received from predecessors,
    progressing around the ring.

    The returned rank ID is a scalar register. To determine the offset of the received
    data chunk within a tensor, use register ALU operations (e.g., multiply the rank ID
    by chunk size), then use dynamic access pattern (``tensor.ap()``) in ISA compute
    operations (e.g., ``nisa.nc_matmul()``).

    **Typical usage pattern**: In each iteration of a ring algorithm, the compute kernel
    uses this function to identify which rank's data is being processed, computes on that
    data while concurrently triggering the next communication step to send already-computed
    chunks to the successor.

    **Channels**: Multiple channels enable overlapping communication, allowing concurrent data
    transfers. The number of available channels depends on the replica group and system
    connectivity (see
    `Neuron Collectives <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/about/collectives.html#system-connectivity>`_).
    The maximum number of channels is 4 for replica groups containing all devices inside a node
    and 2 for other supported replica groups.

    :param iteration_id: Current ring step (typically the loop counter).
    :param replica_group: ReplicaGroup defining the ring topology
    :param channel_id: Channel ID for the communication (0 to num_channels-1)
    :return: Scalar register containing the rank ID of the data to be processed"""
    ...


def collective_permute_implicit_reduce(srcs0_by_channel, srcs1_by_channel, dsts_by_channel, replica_group, op, channel_ids=[0]):
    r"""Perform an implicit collective permute with reduction in a ring, where sources and
    destinations are implicitly determined by the ring structure during runtime.

    Combines :func:`collective_permute_implicit` with a reduction operation.
    Each rank reduces its local sources using ``op(srcs0_by_channel[i], srcs1_by_channel[i])``,
    sends the result to its successor, and receives its predecessor's reduced result into
    ``dsts_by_channel[i]``.

    Since the sources and destinations are implicitly determined, use
    :func:`collective_permute_implicit_current_processing_rank_id` to get the rank ID
    whose data is currently being processed.

    The outer dimension of ``srcs0_by_channel``, ``srcs1_by_channel``, and ``dsts_by_channel``
    corresponds to channels. For each channel, the inner list contains exactly one tensor
    (coalesced collective communication is not currently supported).

    **Channels**: Multiple channels enable overlapping communication, allowing concurrent data
    transfers. The number of available channels depends on the replica group and system
    connectivity (see
    `Neuron Collectives <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/about/collectives.html#system-connectivity>`_).
    The maximum number of channels is 4 for replica groups containing all devices inside a node
    and 2 for other supported replica groups.

    :param srcs0_by_channel: List of source tensor lists (left operand of reduction), one per channel. Each inner list must contain exactly one tensor.
    :param srcs1_by_channel: List of source tensor lists (right operand of reduction), one per channel. Each inner list must contain exactly one tensor.
    :param dsts_by_channel: List of destination tensor lists to receive predecessor's reduced result, one per channel. Each inner list must contain exactly one tensor.
    :param replica_group: ReplicaGroup defining rank groups for the collective
    :param op: The reduction operation to perform (``nl.add``, ``nl.minimum``, or ``nl.maximum``)
    :param channel_ids: List of channel IDs to use for communication (default [0] for single channel).
        Currently must be consecutive integers starting from 0."""
    ...


def rank_id():
    r"""Get the rank ID of the current rank.

    :return: The rank ID of the current rank within the collective group"""
    ...


def reduce_scatter(srcs, dsts, replica_group, collective_dim, op):
    r"""Perform a reduce-scatter on the given replica group and input/output tensors.

    The ``srcs`` and ``dsts`` parameters accept lists of tensors to support coalesced
    collective communication, which allows multiple tensors to be reduced and scattered
    in a single collective operation for improved efficiency.

    Tensors can reside on either HBM or SBUF. However, mixing memory spaces is not
    supported: all tensors must be on HBM or all must be on SBUF. Coalesced collective
    communication (multiple tensors) is only supported when tensors are on HBM.

    :param srcs: List of input tensors to reduce and scatter
    :param dsts: List of output tensors to store results
    :param replica_group: ReplicaGroup defining rank groups for the collective
    :param collective_dim: Dimension along which input tensors are split.
        Currently only 0 is supported for both HBM and SBUF tensors.
    :param op: The reduction operation to perform (``nl.add``, ``nl.minimum``, or ``nl.maximum``)"""
    ...


================================================
FILE: nki/api/nki/isa/__init__.py
================================================
"""Stubs for nki.isa"""

from enum import Enum
import nki.collectives as nc
import nki.isa as nisa
import nki.language as nl

class NKIObject:
    r"""Base class for NKI kernel dataclasses and configuration objects."""
    ...


class NkiValidationError(Exception):
    r"""Raised when hardware constraints are violated."""
    ...


class VirtualRegister(NKIObject):
    r"""A virtual register on engine.

    Allocated via ``nisa.register_alloc()`` and manipulated via
    ``nisa.register_move()``, ``nisa.register_load()``, ``nisa.register_store()``.
    
    Virtual registers represent registers on engine and are used for various APIs 
    such loading and storing constants from tensors, as the return value of 
    ``nki.collective`` and ``nki.isa`` APIs, and for dynamic addressing.
    
    In addition to NKI APIs, virtual registers can be used to represent dynamic 
    loop bounds for for loops using :doc:`dynamic_range <nki.language.dynamic_range>`,
    and while loops.
    
    .. code-block:: python

        import nki.language as nl
        import nki.isa as nisa

        # Using a register in a dynamic for loop.
        reg = nisa.register_alloc(5)
        for _ in nl.dynamic_range(reg):
            tile = nl.load(input_tensor[0:128, 0:512])
            result = nl.multiply(tile, tile)
            nl.store(out_tensor[0:128, 0:512], result)
       
    .. code-block:: python

        import nki.language as nl
        import nki.isa as nisa

        # Using a register in a dynamic while loop.
        cond_sb = nl.ndarray((1, 1), dtype=nl.int32, buffer=nl.sbuf)
        nisa.dma_copy(dst=cond_sb, src=...)

        # Load condition into register
        reg = nisa.register_alloc()
        nisa.register_load(reg, cond_sb)

        while reg:
            ... 
            nisa.dma_copy(dst=cond_sb, src = ...)
            nisa.register_load(reg, cond_sb)
            
    """
    ...


class dge_mode(Enum):
    r"""Descriptor Generation Engine mode."""

    unknown = 0
    """Unknown DGE mode, i.e., let compiler decide the DGE mode"""
    swdge = 1
    """Software DGE"""
    hwdge = 2
    """Hardware DGE"""
    none = 3
    """Not using DGE"""


class dma_engine(Enum):
    r"""DMA transfer engine.
        """

    dma = 1
    """Shared DMA with CoreBarrier synchronization (default). Can be triggered from any engine."""
    gpsimd_dma = 2
    """GPSIMD's internal DMA engine for low-latency SB-to-SB swaps in LNC=2.
        Implies GPSIMD as the trigger engine."""


class engine(Enum):
    r"""Neuron Device engines."""

    tensor = 1
    """Tensor Engine"""
    vector = 5
    """Vector Engine"""
    scalar = 2
    """Scalar Engine"""
    gpsimd = 3
    """GpSIMD Engine"""
    dma = 4
    """DMA Engine"""
    sync = 6
    """Sync Engine"""
    unknown = 0
    """Unknown Engine"""


class matmul_perf_mode(Enum):
    r"""Performance mode for matmul."""

    none = 'none'
    """Default mode, no performance optimization"""
    double_row = 'double_row'
    """Double FP8 mode, 2x matmul throughput by packing two FP8 weight/ifmap element pairs"""


class nc_version(Enum):
    r"""NeuronCore version."""

    gen2 = 2
    """Trn1/Inf2 target"""
    gen3 = 3
    """Trn2 target"""
    gen4 = 4
    """Trn3 target"""


class oob_mode(Enum):
    r"""Out-of-bounds access mode."""

    error = 0
    """Raise a runtime error when an out-of-bounds access is detected."""
    skip = 1
    """Silently skip the runtime out-of-bounds access."""


class reduce_cmd(Enum):
    r"""Engine register reduce commands."""

    idle = 0
    """Not using the accumulator registers"""
    reset = 1
    """Resets the accumulator registers to its initial state"""
    reduce = 2
    """Keeps accumulating over the current value of the accumulator registers"""
    reset_reduce = 3
    """Resets the accumulator registers then immediately accumulate the results of the current instruction into the accumulators"""
    load_reduce = 4
    """Loads a value into the accumulator registers, then accumulate the results of the current instruction into the accumulators"""


def activation(dst, op, data, bias=None, scale=1.0, reduce_op=None, reduce_res=None, reduce_cmd=reduce_cmd.idle, name=None):
    r"""Apply an activation function on every element of the input tile using Scalar Engine, with an optional scale/bias operation
    before the activation and an optional reduction operation after the activation in the same instruction.

    The activation function is specified in the ``op`` input field (see :ref:`nki-act-func` for a list of
    supported activation functions and their valid input ranges).

    ``nisa.activation`` can optionally multiply the input ``data`` by a scalar or vector ``scale``
    and then add another vector ``bias`` before the activation function is applied.

    After the activation function
    is applied, Scalar Engine can also reduce along the free dimensions of the activated data per lane, using
    ``reduce_op`` operation. ``reduce_op`` must be ``nl.add``.

    The reduction result is then either stored into or reduced on top of a set of internal engine registers
    called ``reduce_regs`` (one 32-bit register per compute lane, 128 registers in total), controlled by the
    ``reduce_cmd`` field:

    - ``nisa.reduce_cmd.reset``: Reset ``reduce_regs`` to zero only.
    - ``nisa.reduce_cmd.idle``: Do not modify ``reduce_regs``.
    - ``nisa.reduce_cmd.reduce``: Reduce activated data over existing values in ``reduce_regs``.
    - ``nisa.reduce_cmd.reset_reduce``: Reset ``reduce_regs`` to zero and then store the reduction result
      of the activated data.

    ``nisa.activation`` can also emit another instruction to read out ``reduce_regs`` by
    passing an SBUF/PSUM tile in the ``reduce_res`` arguments.
    The ``reduce_regs`` state can persist across multiple ``nisa.activation`` instructions without the need to
    be evicted back to SBUF/PSUM (``reduce_res`` tile).

    The following is the pseudo code for ``nisa.activation``:

    .. code-block:: python

        output = op(data * scale + bias)

        if reduce_cmd == nisa.reduce_cmd.reset or reduce_cmd == nisa.reduce_cmd.reset_reduce:
            reduce_regs = 0

        result = reduce_op(reduce_regs, reduce_op(output, axis=<FreeAxis>))

        if reduce_cmd == nisa.reduce_cmd.reduce or reduce_cmd == nisa.reduce_cmd.reset_reduce:
            reduce_regs += result

        if reduce_res:
            reduce_res = reduce_regs

    All these optional operations incur no further performance penalty compared to only applying the activation function,
    except reading out ``reduce_regs`` into ``reduce_res`` will have a small overhead due to an extra instruction.

    **Memory types.**

    The input ``data`` tile can be an SBUF or PSUM tile. Similarly, the instruction
    can write the output ``dst`` tile into either SBUF or PSUM.

    **Data types.**

    Both input ``data`` and output ``dst`` tiles can be in any valid NKI data type
    (see :ref:`nki-dtype` for more information).
    The Scalar Engine always performs the math operations in float32 precision.
    Therefore, the engine automatically casts the input ``data`` tile to float32 before
    performing multiply/add/activate specified in the activation instruction.
    The engine is also capable of casting the float32 math results into another
    output data type in ``dst`` at no additional performance cost.
    The ``scale`` parameter must
    have a float32 data type, while the ``bias`` parameter can be any supported dtype except tfloat32.

    **Layout.**

    The ``scale`` can either be a compile-time constant scalar or a
    ``[N, 1]`` vector from SBUF/PSUM. ``N`` must be the same as the partition dimension size of ``data``.
    In NeuronCore-v2, the ``bias`` must be a ``[N, 1]`` vector, but starting NeuronCore-v3, ``bias`` can either be
    a compile-time constant scalar or a ``[N, 1]`` vector similar to ``scale``.

    When the ``scale`` (or similarly, ``bias``) is a scalar, the scalar
    is broadcasted to all the elements in the input ``data`` tile to perform the computation.
    When the ``scale`` (or ``bias``) is a vector, the ``scale`` (or ``bias``) value in each partition is broadcast
    along the free dimension of the ``data`` tile.

    **Tile size.**

    The partition dimension size of input ``data`` and output ``dst`` tiles must be the same and must not exceed 128.
    The number of elements per partition of ``data`` and ``dst`` tiles must be the same and must not
    exceed the physical size of each SBUF partition.

    :param dst: the activation output
    :param op: an activation function (see :ref:`nki-act-func` for supported functions)
    :param data: the input tile; layout: (partition axis <= 128, free axis)
    :param scale: a scalar or a vector for multiplication
    :param bias: a scalar (NeuronCore-v3 or newer) or a vector for addition
    :param reduce_op: the reduce operation to perform on the free dimension of the activated data
    :param reduce_res: a tile of shape ``(data.shape[0], 1)`` to hold the final state of ``reduce_regs``.
    :param reduce_cmd: an enum member from ``nisa.reduce_cmd`` to control the state of ``reduce_regs``."""
    ...


def activation_reduce(dst, op, data, reduce_op, reduce_res, bias=None, scale=1.0, name=None):
    r"""Perform the same computation as ``nisa.activation`` and also a reduction along the free dimension of the
    ``nisa.activation`` result using Scalar Engine. The results for the reduction is stored
    in the reduce_res.

    This API is equivalent to calling ``nisa.activation`` with
    ``reduce_cmd=nisa.reduce_cmd.reset_reduce`` and passing in reduce_res. This API is kept for
    backward compatibility, we recommend using ``nisa.activation`` moving forward.

    Refer to :doc:`nisa.activation <nki.isa.activation>` for semantics of ``op/data/bias/scale``.

    In addition to :doc:`nisa.activation <nki.isa.activation>` computation, this API also performs a reduction
    along the free dimension(s) of the :doc:`nisa.activation <nki.isa.activation>` result, at a small additional
    performance cost. The reduction result is returned in ``reduce_res`` in-place, which must be a
    SBUF/PSUM tile with the same partition axis size as the input tile ``data`` and one element per partition.
    On NeuronCore-v2, the ``reduce_op`` must be ``nl.add``.

    There are 128 registers on the scalar engine for storing reduction results, corresponding
    to the 128 partitions of the input. These registers are shared between ``activation`` and ``activation_accu`` calls.
    This instruction first resets those
    registers to zero, performs the reduction on the value after activation function is applied,
    stores the results into the registers,
    then reads out the reduction results from the register, eventually store them into ``reduce_res``.

    Note that ``nisa.activation`` can also change the state of the register. It's user's
    responsibility to ensure correct ordering. It's the best practice to not mixing
    the use of ``activation_reduce`` and ``activation``.

    Reduction axis is not configurable in this API. If the input tile has multiple free axis, the API will
    reduce across all of them.

    Mathematically, this API performs the following computation:

    .. code-block:: python

        output = op(data * scale + bias)
        reduce_res = reduce_op(output, axis=<FreeAxis>)

    :param dst: output tile of the activation instruction; layout: same as input ``data`` tile
    :param op: an activation function (see :ref:`nki-act-func` for supported functions)
    :param data: the input tile; layout: (partition axis <= 128, free axis)
    :param reduce_op: the reduce operation to perform on the free dimension of the activation result
    :param reduce_res: a tile of shape ``(data.shape[0], 1)``, where data.shape[0]
                    is the partition axis size of the input ``data`` tile. The result of ``sum(ReductionResult)``
                    is written in-place into the tensor.
    :param bias: a vector with the same partition axis size as ``data``
                 for broadcast add (after broadcast multiply with ``scale``)
    :param scale: a scalar or a vector with the same partition axis size as ``data``
                  for broadcast multiply"""
    ...


def affine_select(dst, pattern, channel_multiplier, on_true_tile, on_false_value, cmp_op=nl.equal, offset=0, name=None):
    r"""Select elements between an input tile ``on_true_tile`` and a scalar value ``on_false_value``
    according to a boolean predicate tile using GpSimd Engine.

    The predicate tile is calculated on-the-fly in the engine by evaluating an affine expression element-by-element.
    The affine expression is defined by a ``pattern``, ``offset``, and ``channel_multiplier``, similar to ``nisa.iota``.
    The ``pattern`` field is a list of lists in the form of
    ``[[step_w, num_w], [step_z, num_z], [step_y, num_y], [step_x, num_x]]``. When fewer than 4D ``pattern``
    is provided, NKI compiler automatically pads remaining dimensions with size of 1.

    Given a 4D pattern (padded if needed), the instruction generates a predicate using the following pseudo code:

    .. code-block:: python

        num_partitions = dst.shape[0]
        [[step_w, num_w], [step_z, num_z], [step_y, num_y], [step_x, num_x]] = pattern

        for channel_id in range(num_partitions):
          for w in range(num_w):
            for z in range(num_z):
              for y in range(num_y):
                for x in range(num_x):
                  affine_value = offset + (channel_id * channel_multiplier) +
                                (w * step_w) + (z * step_z) + (y * step_y) + (x * step_x)

                  predicate = cmp_op(affine_value, 0)  # Compare with 0 using cmp_op

                  if predicate:
                      dst[channel_id, w, z, y, x] = on_true_tile[channel_id, w, z, y, x]
                  else:
                      dst[channel_id, w, z, y, x] = on_false_value

    The above pseudo code assumes ``dst`` has the same size in every dimension ``x/y/z/w`` for simplicity. However,
    the instruction allows any sizes in the free dimension, as long as the number of elements per partition in ``dst``
    matches the product: ``num_w * num_z * num_y * num_x``.

    A common use case for ``affine_select`` is to apply a causal mask on the attention
    scores for transformer decoder models.

    **Memory types.**

    The output ``dst`` tile must be in SBUF. The input ``on_true_tile`` must also be in SBUF.

    **Data types.**

    The input ``on_true_tile`` and output ``dst`` tile can be any valid NKI data type
    (see :ref:`nki-dtype` for more information). If the data type of ``on_true_tile`` differs from
    that of ``dst``, the input elements in ``on_true_tile``, if selected, are first cast to FP32
    before converting to the output data type in ``dst``.
    The ``on_false_value`` must be float32, regardless of the input/output tile data types.

    **Layout.**

    The partition dimension determines the number of active channels for parallel pattern generation and selection.
    The input tile ``on_true_tile``, the calculated boolean predicate tile, and the returned output tile
    must have the same partition dimension size and.

    **Tile size.**

    - The partition dimension size of ``dst`` and ``on_true_tile`` must be the same and must not exceed 128.
    - The number of elements per partition of ``dst`` and ``on_true_tile`` must not
      exceed the physical size of each SBUF partition.
    - The total number of elements in ``pattern`` must match the number of elements
      per partition in the ``dst`` and ``on_true_tile`` tiles.

    :param dst: the output tile in SBUF to store the selected values
    :param pattern: a list of [step, num] to describe up to 4D tensor sizes and strides for affine expression generation
    :param offset: an int32 offset value to be added to every generated affine value
    :param channel_multiplier: an int32 multiplier to be applied to the channel (partition) ID
    :param on_true_tile: an input tile for selection with a ``True`` predicate value
    :param on_false_value: a scalar value for selection with a ``False`` predicate value
    :param cmp_op: comparison operator to use for predicate evaluation (default: nl.equal)"""
    ...


def bn_aggr(dst, data, name=None):
    r"""Aggregate one or multiple ``bn_stats`` outputs to generate
    a mean and variance per partition using Vector Engine.

    The input ``data`` tile
    effectively has an array of ``(count, mean, variance*count)`` tuples per partition
    produced by  :doc:`bn_stats <nki.isa.bn_stats>` instructions. Therefore, the number of elements per partition
    of ``data`` must be a modulo of three.

    Note, if you need to aggregate multiple ``bn_stats`` instruction outputs,
    it is recommended to declare a SBUF tensor
    and then make each ``bn_stats`` instruction write its output into the
    SBUF tensor at different offsets.

    Vector Engine performs the statistics aggregation in float32 precision.
    The engine automatically casts the input ``data`` to float32 before performing computation.
    The float32 computation results are cast to ``dst.dtype`` at no additional performance cost.

    :param dst: an output tile with two elements per partition: a mean followed by a variance
    :param data: an input tile with results of one or more :doc:`bn_stats <nki.isa.bn_stats>`"""
    ...


def bn_stats(dst, data, name=None):
    r"""Compute mean- and variance-related statistics for each partition of an input tile ``data``
    in parallel using Vector Engine.

    The output tile of the instruction has 6 elements per partition:

    - the ``count`` of the even elements (of the input tile elements from the same partition)
    - the ``mean`` of the even elements
    - ``variance * count`` of the even elements
    - the ``count`` of the odd elements
    - the ``mean`` of the odd elements
    - ``variance * count`` of the odd elements

    To get the final mean and variance of the input tile,
    we need to pass the above ``bn_stats`` instruction output
    into the :doc:`bn_aggr <nki.isa.bn_aggr>`
    instruction, which will output two elements per partition:

    - mean (of the original input tile elements from the same partition)
    - variance

    Due to hardware limitation, the number of elements per partition
    (i.e., free dimension size) of the input ``data`` must not exceed 512 (nl.tile_size.bn_stats_fmax).
    To calculate per-partition mean/variance of a tensor with more than
    512 elements in free dimension, we can invoke ``bn_stats`` instructions
    on each 512-element tile and use a single ``bn_aggr`` instruction to
    aggregate ``bn_stats`` outputs from all the tiles.

    Vector Engine performs the above statistics calculation in float32 precision.
    The engine automatically casts the input ``data`` to float32 before performing computation.
    The float32 computation results are cast to ``dst.dtype`` at no additional performance cost.

    :param dst: an output tile with 6-element statistics per partition
    :param data: the input tile (up to 512 elements per partition)"""
    ...


def core_barrier(data, cores, engine=engine.gpsimd, name=None):
    r"""Synchronize execution across multiple NeuronCores by implementing a barrier mechanism.

    .. note::
      Available only on NeuronCore-v3 or newer.

    This instruction creates a synchronization point where all specified NeuronCores must
    reach before any can proceed. The barrier is implemented using a semaphore-based protocol
    where each NeuronCore writes a semaphore to each other core (remote semaphore update)
    and then waits for the other cores' semaphores before continuing execution (local semaphore wait).

    The use case is when two NeuronCores both need to write to disjoint portions of a
    shared HBM tensor (``data``) and they both need to consume the tensor after both cores
    have finished writing into the tensor. In this case, both cores can perform the write to
    ``data`` in HBM using ``nisa.dma_copy``, and then signal to each other when the write operation is complete
    using ``nisa.core_barrier``.

    This instruction is only allowed in NeuronCore-v3 or newer when
    `LNC (Logical NeuronCore) <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-features/logical-neuroncore-config.html>`_
    is enabled. Currently only ``cores=(0, 1)`` is supported. This allows synchronization between exactly
    two NeuronCores that share the same HBM stack.

    The ``data`` parameter represents the shared data that all cores need to synchronize on.
    This must be data in shared HBM that multiple cores are accessing.

    The ``engine`` parameter allows specifying which engine inside the NeuronCores should execute the barrier
    instruction (that is, the remote semaphore update and local semaphore wait). The barrier will block
    execution on this engine, other engines will not be blocked.

    :param data: the shared data that all cores need to synchronize on; must be data in shared HBM
    :param cores: a tuple of core indices to synchronize; only ``(0, 1)`` is supported when LNC2 is enabled
    :param engine: the engine to execute the barrier instruction on; defaults to GpSimd Engine

    Example:

    .. code-block:: python

        # Synchronize between two cores after each core writes to half of shared tensor
        shared_tensor = nl.ndarray((batch_size, hidden_dim), dtype=nl.float32, buffer=nl.shared_hbm)

        # Each core writes to half of the tensor
        if core_id == 0:
            # Core 0 writes to first half
            core0_data = nl.ndarray((batch_size // 2, hidden_dim), dtype=nl.float32, buffer=nl.sbuf)
            nisa.dma_copy(dst=shared_tensor[:batch_size // 2, :], src=core0_data)
        else:
            # Core 1 writes to second half
            core1_data = nl.ndarray((batch_size // 2, hidden_dim), dtype=nl.float32, buffer=nl.sbuf)
            nisa.dma_copy(dst=shared_tensor[batch_size // 2:, :], src=core1_data)

        core_barrier(data=shared_tensor, cores=(0, 1))

        # Now both cores can safely read the complete tensor"""
    ...


def dma_compute(dst, srcs, reduce_op, scales=None, unique_indices=True, name=None):
    r"""Perform math operations using compute logic inside DMA engines with element-wise scaling and reduction.

    This instruction leverages the compute capabilities within DMA engines to perform scaled element-wise operations
    followed by reduction across multiple source tensors. The computation follows the pattern:
    ``dst = reduce_op(srcs[0] * scales[0], srcs[1] * scales[1], ...)``, where each source tensor is first
    multiplied by its corresponding scale factor, then all scaled results are combined using the specified
    reduction operation.
    Currently, only ``nl.add`` is supported for ``reduce_op``, and
    all values in ``scales`` must be ``1.0`` (or ``scales`` can be ``None``
    which defaults to all 1.0).

    The DMA engines perform all computations in float32 precision internally. Input tensors are automatically
    cast from their source data types to float32 before computation, and the final float32 result is cast
    to the output data type in a pipelined fashion.

    **Read-Modify-Write with vector_offset (scatter and gather).**

    When one of the source tensors has a ``vector_offset`` (indirect indexing),
    ``dma_compute`` performs read-modify-write with two modes:

    **Scatter RMW**: ``dst(HBM)[indices] = dst(HBM)[indices] + src(SB)``
      - ``dst`` is in HBM with indirect indexing
      - One source matches ``dst`` and has ``vector_offset``
      - The other source is data in SBUF

    **Gather RMW**: ``dst(SB) = dst(SB) + src(HBM)[indices]``
      - ``dst`` is in SBUF
      - One source is data in HBM with ``vector_offset``
      - The other source matches ``dst``

    Both modes require:
      - Exactly 2 source tensors
      - All ``scales`` must be ``1.0`` (or ``None``)
      - ``unique_indices`` must be ``True`` (non-unique indices not yet supported)

    **Memory types.**

    Both input ``srcs`` tensors and output ``dst`` tensor can be in HBM or SBUF.
    Both ``srcs`` and ``dst`` tensors must have compile-time known addresses (unless using vector_offset for indirect access).

    **Data types.**

    All input ``srcs`` tensors and the output ``dst`` tensor can be any supported NKI data types
    (see :ref:`nki-dtype` for more information). The DMA engines automatically cast input data types to float32
    before performing the scaled reduction computation. The float32 computation results are then cast to the
    data type of ``dst`` in a pipelined fashion.

    **Layout.**

    The computation is performed element-wise across all tensors, with the reduction operation applied
    across the scaled source tensors at each element position.

    **Tile size.**

    The element count of each tensor in ``srcs`` and ``dst`` must match exactly.
    The max number of source tensors in ``srcs`` is 16.

    :param dst: the output tensor to store the computed results
    :param srcs: a list of input tensors to be scaled and reduced
    :param reduce_op: the reduction operation to apply (currently only ``nl.add`` is supported)
    :param scales: (optional) a list of scale factors corresponding to each
                   tensor in ``srcs``. Must be all 1.0 if provided.
                   Defaults to None (equivalent to [1.0, 1.0, ...]).
    :param unique_indices: (optional) Whether scatter indices are unique.
                          Must be True when using vector_offset (non-unique not yet supported).
                          Default: True."""
    ...


def dma_copy(dst, src, oob_mode=oob_mode.error, dge_mode=dge_mode.unknown, engine=engine.unknown, name=None):
    r"""Copy data from ``src`` to ``dst`` using DMA engines.

    This instruction performs data movement between memory locations (SBUF or HBM) using DMA engines.
    The operation copies data from the source tensor to the destination tensor: ``dst = src``.

    ``nisa.dma_copy`` supports different modes of DMA descriptor generation (DGE):

    - ``nisa.dge_mode.none``: Neuron Runtime generates DMA descriptors and stores them into HBM before NEFF execution.
    - ``nisa.dge_mode.swdge``: Gpsimd Engine generates DMA descriptors as part of the ``nisa.dma_copy`` instruction
      during NEFF execution.
    - ``nisa.dge_mode.hwdge``: Sync Engine or Scalar Engine sequencers invoke DGE hardware block to generate DMA
      descriptors as part of the ``nisa.dma_copy`` instruction during NEFF execution.

    See `Trainium2 arch guide` and `Introduction to DMA with NKI` for more discussion.

    When either ``sw_dge`` or ``hw_dge`` mode is used, the ``src`` and ``dst`` tensors can have a dynamic start address
    which depends on a variable that cannot be resolved at compile time. When ``sw_dge`` is selected, ``nisa.dma_copy``
    can also perform a gather or scatter operation, using a list of dynamic indices from SBUF.
    In both of these dynamic modes, out-of-bound address checking is turned on automatically during execution.
    By default a runtime error is raised (``oob_mode=oob_mode.error`` as default setting).
    Developers can disable this error and make the ``nisa.dma_copy`` instruction skip the DMA transfer for a given dynamic
    address or index when it is out of bound using ``oob_mode=oob_mode.skip``.

    **Memory types.**

    Both ``src`` and ``dst`` tiles can be in HBM or SBUF. However, if both tiles are in SBUF, consider using an alternative
    for better performance:

    - :doc:`nisa.tensor_copy <nki.isa.tensor_copy>` for direct copies
    - :doc:`nisa.nc_n_gather <nki.isa.nc_n_gather>` to gather elements within each partition independently
    - :doc:`nisa.local_gather <nki.isa.local_gather>` to gather elements within groups of partitions

    **Data types.**

    Both ``src`` and ``dst`` tiles can be any supported NKI data types (see :ref:`nki-dtype` for more information).

    The DMA engines automatically handle data type conversion when ``src`` and ``dst`` have different data types.
    The conversion is performed through a two-step process: first casting from ``src.dtype`` to float32, then
    from float32 to ``dst.dtype``.

    **Tile size.**

    The total number of data elements in ``src`` must match that of ``dst``.

    **Indirect addressing (gather/scatter).**

    ``nisa.dma_copy`` supports indirect addressing for dynamic row selection at runtime. This enables
    gather (read from dynamic rows) and scatter (write to dynamic rows) patterns. Indirect addressing
    is activated by calling ``.ap()`` on ``src`` or ``dst`` with a ``vector_offset`` or ``scalar_offset``
    parameter.

    There are two types of indirect addressing:

    *Vector indirection* provides per-partition dynamic offsets. Each of the hardware partitions
    gets its own index, enabling gather/scatter where different partitions access different rows.
    Use ``.ap(pattern=..., vector_offset=idx_tensor, indirect_dim=0)`` where ``idx_tensor`` is an
    SBUF tensor of shape ``(P, 1)`` containing one row index per partition.
    The tensor being indexed (the one ``.ap()`` is called on) must be in HBM.

    *Scalar indirection* provides a single dynamic offset applied uniformly to all partitions.
    Use ``.ap(pattern=..., scalar_offset=reg_or_tensor, indirect_dim=N)`` where the offset is
    either a 1x1 SBUF tensor or a ``VirtualRegister`` from ``nisa.register_alloc()``.

    ``vector_offset`` and ``scalar_offset`` are mutually exclusive.

    **Indirect gather example** (``vector_offset`` on ``src``):

    .. code-block:: python

        import nki
        import nki.isa as nisa
        import nki.language as nl

        @nki.jit
        def indirect_gather_kernel(data, indices):
            P, F = indices.shape[0], data.shape[1]
            output = nl.ndarray((P, F), dtype=data.dtype, buffer=nl.shared_hbm)

            idx = nl.ndarray((P, 1), dtype=nl.uint32, buffer=nl.sbuf)
            nisa.dma_copy(dst=idx, src=indices)

            dst = nl.ndarray((P, F), dtype=data.dtype, buffer=nl.sbuf)
            nisa.dma_copy(
                dst=dst,
                src=data.ap(
                    pattern=[[F, P], [1, F]],
                    vector_offset=idx,
                    indirect_dim=0,
                ),
            )

            nisa.dma_copy(dst=output, src=dst)
            return output

    **Indirect scatter example** (``vector_offset`` on ``dst``):

    .. code-block:: python

        import nki

        @nki.jit
        def indirect_scatter_kernel(src_data, indices, output):
            P, F = src_data.shape

            src = nl.ndarray((P, F), dtype=src_data.dtype, buffer=nl.sbuf)
            nisa.dma_copy(dst=src, src=src_data)

            idx = nl.ndarray((P, 1), dtype=nl.uint32, buffer=nl.sbuf)
            nisa.dma_copy(dst=idx, src=indices)

            nisa.dma_copy(
                dst=output.ap(
                    pattern=[[F, P], [1, F]],
                    vector_offset=idx,
                    indirect_dim=0,
                ),
                src=src,
            )
            return output

    :param dst: the destination tensor to copy data into
    :param src: the source tensor to copy data from
    :param dge_mode: (optional) specify which Descriptor Generation Engine (DGE) mode to use for DMA descriptor generation: ``nki.isa.dge_mode.none`` (turn off DGE) or ``nki.isa.dge_mode.swdge`` (software DGE) or ``nki.isa.dge_mode.hwdge`` (hardware DGE)  or ``nki.isa.dge_mode.unknown`` (by default, let compiler select the best DGE mode). Hardware based DGE is only supported for NeuronCore-v3 or newer. See `Trainium2 arch guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/arch/trainium2_arch.html>`__ for more information.
    :param oob_mode: (optional) Specifies how to handle out-of-bounds (oob) array indices during indirect access operations. Valid modes are:

        - ``oob_mode.error``: (Default) Raises an error when encountering out-of-bounds indices.
        - ``oob_mode.skip``: Silently skips any operations involving out-of-bounds indices.

        For example, when using indirect gather/scatter operations, out-of-bounds indices can occur if the index array contains values that exceed the dimensions of the target array.

    :param engine: (optional) the engine to use for HWDGE descriptor generation: ``nki.isa.engine.sync`` or ``nki.isa.engine.scalar``.
                   Only valid when ``dge_mode=nisa.dge_mode.hwdge``. ``nki.isa.engine.unknown`` by default."""
    ...


def dma_transpose(dst, src, axes=None, dge_mode=dge_mode.unknown, oob_mode=oob_mode.error, name=None):
    r"""Perform a transpose on input ``src`` using DMA Engine.

    The permutation of transpose follow the rules described below:

    1. For 2-d input tile, the permutation will be [1, 0]
    2. For 3-d input tile, the permutation will be [2, 1, 0]
    3. For 4-d input tile, the permutation will be [3, 1, 2, 0]

    **DMA Direct Transpose Constraints**

    The only valid ``dge_mode`` s are ``unknown`` and ``hwdge``. If ``hwdge``, this instruction will be lowered
    to a Hardware DGE transpose. This has additional restrictions:

    1. ``src.shape[0] == 16``
    2. ``src.shape[-1] % 128 == 0``
    3. ``src.dtype`` is 2 bytes

    **DMA Indirect Transpose Constraints**

    The only valid ``dge_mode`` s are ``unknown`` and ``swdge``. This instruction will be lowered
    to a Software DGE transpose (``dma_gather_transpose``). This has additional restrictions:

    #. When ``src`` is 4D: ``len(src[1])`` or ``len(src[2])`` must be 1
    #. ``src.shape[-1] <= 128``
    #. ``src.dtype`` is 2 bytes
    #. ``src`` tensor must be on HBM
    #. ``indices`` must be 2-d
    #. ``indices.shape[0] * indices.shape[1]`` must be ``>=`` ``src.shape[0]``
    #. ``src.shape[0]`` must be divisible by 16
    #. ``indices.shape[0]`` must be in ``[16, 128]`` and divisible by 16
    #. When ``indices.shape[1] > 1``: ``indices.shape[0]`` must be exactly 128
    #. ``indices.dtype`` is ``np.uint32``
    #. ``indices`` tensor must be on SBUF
    #. TRN2+ only

    Indirect transpose effectively performs the following operation:
    ``flat_indices = indices.T.flatten()[:src.shape[0]]``
    ``gathered = src[flat_indices, :]``
    ``dst = gathered.T``

    **Indirect transpose example** (``vector_offset`` on ``src``):

    .. code-block:: python

        import nki
        import nki.isa as nisa
        import nki.language as nl

        @nki.jit
        def gather_transpose_kernel(src_hbm, idx_hbm):
            P, F = 128, 128
            output = nl.ndarray((F, P), dtype=src_hbm.dtype, buffer=nl.shared_hbm)

            idx_sb = nl.load(idx_hbm)

            dst_sb = nl.ndarray((F, P), dtype=src_hbm.dtype, buffer=nl.sbuf)
            nisa.memset(dst=dst_sb, value=0)

            src_ap = src_hbm.ap(
                pattern=[[F, P], [1, F]],
                vector_offset=idx_sb,
                indirect_dim=0,
            )
            nisa.dma_transpose(dst=dst_sb, src=src_ap, axes=(1, 0))

            nisa.dma_copy(dst=output, src=dst_sb)
            return output

    :param dst: the destination of transpose, must be a tile in SBUF.
    :param src: the source of transpose, must be a tile in HBM or SBUF. ``src.dtype == dst.dtype``
    :param axes: transpose axes where the i-th axis of the transposed tile will correspond to the axes[i] of the source.
                 Supported axes are ``(1, 0)``, ``(2, 1, 0)``, and ``(3, 1, 2, 0)``.
    :param dge_mode: (optional) specify which Descriptor Generation Engine (DGE) mode to use for DMA descriptor generation: ``nki.isa.dge_mode.none`` (turn off DGE) or ``nki.isa.dge_mode.swdge`` (software DGE) or ``nki.isa.dge_mode.hwdge`` (hardware DGE)  or ``nki.isa.dge_mode.unknown`` (by default, let compiler select the best DGE mode). Hardware based DGE is only supported for NeuronCore-v3 or newer. See `Trainium2 arch guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/arch/trainium2_arch.html>`__ for more information.
    :param oob_mode: (optional) Specifies how to handle runtime out-of-bounds (oob) array indices during indirect access operations. Valid modes are:

        - ``oob_mode.error``: (Default) Raises an error when encountering runtime out-of-bounds indices.

        - ``oob_mode.skip``: Silently skips any operations involving out-of-bounds indices. Only valid when ``src`` uses indirect indexing."""
    ...


def dropout(dst, data, prob, name=None):
    r"""Randomly replace some elements of the input tile ``data`` with zeros
    based on input probabilities using Vector Engine.
    The probability of replacing input elements with zeros (i.e., drop probability)
    is specified using the ``prob`` field:
    - If the probability is 1.0, all elements are replaced with zeros.
    - If the probability is 0.0, all elements are kept with their original values.

    The ``prob`` field can be a scalar constant or a tile of shape ``(data.shape[0], 1)``,
    where each partition contains one drop probability value.
    The drop probability value in each partition is applicable to the input
    ``data`` elements from the same partition only.

    Data type of the input ``data`` tile can be any valid NKI data types
    (see :ref:`nki-dtype` for more information).
    However, data type of ``prob`` has restrictions based on the data type of ``data``:

    - If data type of ``data`` is any of the integer types (e.g., int32, int16),
      ``prob`` data type must be float32
    - If data type of data is any of the float types (e.g., float32, bfloat16),
      ``prob`` data can be any valid float type

    The output data type ``dst.dtype`` must match the input data type ``data.dtype``.

    :param dst: an output tile of the dropout result
    :param data: the input tile
    :param prob: a scalar or a tile of shape ``(data.shape[0], 1)`` to indicate the
                 probability of replacing elements with zeros"""
    ...


def exponential(dst, src, max_value=0.0, reduce_res=None, reduce_cmd=reduce_cmd.idle, reduce_init=0.0, name=None):
    r"""Apply exponential function to each element after subtracting a max_value using Vector Engine.

    .. note::
        Available only on NeuronCore-v4 and newer.

    This instruction computes ``exp(src - max_value)`` for each element. The instruction can
    optionally maintain a running sum of the exponential values using shared internal reduction
    registers in the Vector Engine.

    The exponential operation is performed as:

    .. code-block::

        dst[i] = exp(src[i] - max_value)

    When accumulation is enabled through ``reduce_cmd``, the instruction also computes:

    .. code-block::

        reduce_res[i] = sum(dst[i])

    The Vector Engine performs the computation in float32 precision internally and can
    output results in various data types as specified by the ``dst`` dtype field.

    **Constraints**

    - Supported engines: Vector.
    - ``src``, ``dst`` must have the same number of elements in the partition dimension.
    - ``src``, ``dst`` must have the same number of elements in the free dimensions.
    - ``src``, ``dst`` can be up to 4D tensor.
    - ``reduce_init`` should be unset or set to ``0.0`` when ``reduce_cmd`` is not ``load_reduce``.

    :param dst: The output tile with exponential function applied. Supported buffers: SBUF, PSUM. Supported dtypes: float8_e4m3, float8_e5m2, float16, bfloat16, float32, tfloat32, int8, int16, int32, uint8, uint16.
    :param src: The input tile to apply exponential function on. Supported buffers: SBUF, PSUM. Supported dtypes: float8_e4m3, float8_e5m2, float16, bfloat16, float32, int8, int16, int32, uint8, uint16, uint32.
    :param max_value: The maximum value to subtract from each element before applying exponential (for numerical stability). Can be a scalar or vector of shape ``(src.shape[0], 1)``. Supported dtypes: float32.
    :param reduce_res: Optional tile to store reduction results (sum of exponentials). Must have shape ``(src.shape[0], 1)``. Supported buffers: SBUF, PSUM. Supported dtypes: float8_e4m3, float8_e5m2, float16, bfloat16, float32, tfloat32.
    :param reduce_cmd: Control the state of reduction registers for accumulating exponential results. Supported: ``idle``, ``reset_reduce``, ``reduce``, ``load_reduce``.
    :param reduce_init: Initial value for reduction when using ``reduce_cmd.load_reduce``. Supported dtypes: float32.

    **Accumulator behavior:**

    The Vector Engine maintains internal accumulator registers that can be controlled via the ``reduce_cmd`` parameter:

    - ``reduce_cmd.reset_reduce``: Reset accumulators to 0, then accumulate the current results.
    - ``reduce_cmd.reduce``: Continue accumulating without resetting (useful for multi-step reductions).
    - ``reduce_cmd.load_reduce``: Load the values from ``reduce_init`` into the accumulator, then accumulate the current result on top of it.
    - ``reduce_cmd.idle``: (default) No accumulation performed, accumulator state unknown.

    .. note::
      Even when ``reduce_cmd`` is set to ``idle``, the accumulator state may still be modified.
      Always use ``reset_reduce`` after any Vector Engine operation that ran with ``idle`` mode to ensure
      consistent behavior.

    .. note::
      The accumulator registers are shared for other Vector Engine accumulation instructions such as :doc:`nki.isa.range_select <nki.isa.range_select>`,
      :doc:`nki.isa.select_reduce <nki.isa.select_reduce>`, and :doc:`nki.isa.tensor_scalar_cumulative <nki.isa.tensor_scalar_cumulative>`.

    **Behavior**

    .. code-block:: python

        # Initialize reduction if requested
        if reduce_cmd == reduce_cmd.reset_reduce:
            accumulator = 0
        elif reduce_cmd == reduce_cmd.load_reduce:
            accumulator = reduce_init
        elif reduce_cmd == reduce_cmd.idle:
            accumulator = undefined  # Not used

        # Process each element
        for i in range(num_elements):
            dst[i] = exp(src[i] - max_value)

            # Update reduction if active
            if reduce_cmd != reduce_cmd.idle:
                accumulator += dst[i]"""
    ...


def get_nc_version():
    r"""Returns the nc_version of the current target context."""
    ...


gpsimd_engine = engine.gpsimd
"""GpSIMD Engine"""


def iota(dst, pattern, offset=0, channel_multiplier=0, name=None):
    r"""Generate a constant literal pattern into SBUF using GpSimd Engine.

    The pattern is defined by an int32 ``offset``, a tensor access pattern of up to 4D ``pattern`` and
    an int32 ``channel_multiplier``. The ``pattern`` field is a list of lists in the form of
    ``[[step_w, num_w], [step_z, num_z], [step_y, num_y], [step_x, num_x]]``. When fewer than 4D ``pattern``
    is provided, NKI compiler automatically pads remaining dimensions with size of 1.

    Given a 4D pattern (padded if needed), the instruction generates a stream of values using the following pseudo code:

    .. code-block:: python

        num_partitions = dst.shape[0]
        [[step_w, num_w], [step_z, num_z], [step_y, num_y], [step_x, num_x]] = pattern

        for channel_id in range(num_partitions):
            for w in range(num_w):
                for z in range(num_z):
                    for y in range(num_y):
                        for x in range(num_x):
                            value = offset + (channel_id * channel_multiplier) +
                                    (w * step_w) + (z * step_z) + (y * step_y) + (x * step_x)

                            dst[channel_id, w, z, y, x] = value

    The above pseudo code assumes ``dst`` has the same size in every dimension ``x/y/z/w`` for simplicity. However,
    the instruction allows any sizes in the free dimension, as long as the number of elements per partition in ``dst``
    matches the product: ``num_w * num_z * num_y * num_x``.

    **Memory types.**

    The output ``dst`` tile must be in SBUF.

    **Data types.**

    The generated values are computed in 32-bit integer arithmetic. The GpSimd Engine can cast
    these integer results to any valid NKI data type (see :ref:`nki-dtype` for more information)
    before writing to the output tile. The output data type is determined by the ``dst`` tile's
    data type.

    **Layout.**

    The partition dimension determines the number of active channels for parallel pattern generation.

    **Tile size.**

    The partition dimension size of ``dst`` must not exceed 128. The number of
    elements per partition of ``dst`` must not exceed the physical size of each SBUF partition.
    The total number of elements in ``pattern`` must match the number of elements per partition in the ``dst`` tile.

    :param dst: the output tile in SBUF to store the generated pattern
    :param pattern: a list of [step, num] to describe up to 4D tensor sizes and strides
    :param offset: an int32 offset value to be added to every generated value
    :param channel_multiplier: an int32 multiplier to be applied to the channel (parition) ID"""
    ...


def local_gather(dst, src_buffer, index, num_elem_per_idx=1, num_valid_indices=None, name=None):
    r"""Gather SBUF data in ``src_buffer`` using ``index`` on GpSimd Engine.

    Each of the eight GpSimd cores in GpSimd Engine connects to 16 contiguous SBUF partitions
    (e.g., core[0] connected to partition[0:16]) and performs gather from the connected 16
    SBUF partitions *independently* in parallel. The indices used for gather on each core should also
    come from the same 16 connected SBUF partitions. If you only need to gather elements within a partition,
    consider using :doc:`nisa.nc_n_gather <nki.isa.nc_n_gather>` instead, which supports gathering more indices.

    During execution of the instruction, each GpSimd core reads a 16-partition slice from ``index``, flattens
    all indices into a 1D array ``indices_1d`` (along the partition dimension first).
    By default with no ``num_valid_indices`` specified, each GpSimd core
    will treat all indices from its corresponding 16-partition ``index`` slice as valid indices.
    However, when the number of valid indices per core
    is not a multiple of 16, users can explicitly specify the valid index count per core in ``num_valid_indices``.
    Note, ``num_valid_indices`` must not exceed the total element count in each 16-partition ``index`` slice
    (i.e., ``num_valid_indices <= index.size / (index.shape[0] / 16)``).

    Next, each GpSimd core uses the flattened ``indices_1d`` indices as *partition offsets* to gather from
    the connected 16-partition slice of ``src_buffer``. Optionally, this API also allows gathering of multiple
    contiguous elements starting at each index to improve gather throughput, as indicated by ``num_elem_per_idx``.
    Behavior of out-of-bound index access is undefined.

    Even though all eight GpSimd cores can gather with completely different indices, a common use case for
    this API is to make all cores gather with the same set of indices (i.e., partition offsets). In this case,
    users can generate indices into 16 partitions, replicate them eight times to 128 partitions and then feed them into
    ``local_gather``.

    As an example, if ``src_buffer`` is (128, 512) in shape and ``index`` is (128, 4) in shape, where the partition
    dimension size is 128, ``local_gather`` effectively performs the following operation:

    ``local_gather`` preserves the input data types from ``src_buffer`` in the gather output.
    Therefore, no data type casting is allowed in this API. The indices in ``index`` tile must be uint16 types.

    This API has three tile size constraints [subject to future relaxation]:

    #. The partition axis size of ``src_buffer`` must match that of ``index`` and must
       be a multiple of 16. In other words, ``src_buffer.shape[0] == index.shape[0] and src_buffer.shape[0] % 16 == 0``.
    #. The number of contiguous elements to gather per index per partition ``num_elem_per_idx``
       must be one of the following values: ``[1, 2, 4, 8, 16, 32]``.
    #. The number of indices for gather per core must be less than or equal to 4096.

    :param dst: an output tile of the gathered data
    :param src_buffer: an input tile for gathering.
    :param index: an input tile with indices used for gathering.
    :param num_elem_per_idx: an optional integer value to read multiple contiguous elements per index per partition; default is 1.
    :param num_valid_indices: an optional integer value to specify the number of valid indices per GpSimd core; default is
                              ``index.size / (index.shape[0] / 16)``.

    Click :download:`here <../../test/test_nki_isa_local_gather.py>` to download the
    full NKI code example with equivalent numpy implementation."""
    ...


def max8(dst, src, name=None):
    r"""Find the 8 largest values in each partition of the source tile.

    This instruction reads the input elements, converts them to fp32 internally, and outputs
    the 8 largest values in descending order for each partition. Outputs are converted to
    ``dst.dtype`` automatically.

    The source tile can be up to 5-dimensional, while the output tile is always 2-dimensional.
    The number of elements read per partition must be between 8 and 16,384 inclusive.
    The output will always contain exactly 8 elements per partition.
    The source and output must have the same partition dimension size:

    - source: [par_dim, ...]
    - output: [par_dim, 8]

    :param dst: a 2D tile containing the 8 largest values per partition in descending order with shape [par_dim, 8]
    :param src: the source tile to find maximum values from"""
    ...


def memset(dst, value, engine=engine.unknown, name=None):
    r"""Initialize ``dst`` by filling it with a compile-time constant ``value``, using Vector or GpSimd Engine.
    The memset instruction supports all valid NKI dtypes (see :ref:`nki-dtype`).

    :param dst: destination tile to initialize.
    :param value: the constant value to initialize with
    :param engine: specify which engine to use for memset: ``nki.isa.engine.vector`` or ``nki.isa.engine.gpsimd`` ;
                   ``nki.isa.engine.unknown`` by default, lets compiler select the best engine for the given
                   input tile shape

    .. note::
        For x4 packed types (``float8_e4m3fn_x4``, ``float8_e5m2_x4``,
        ``float4_e2m1fn_x4``), only ``value=0`` is supported."""
    ...


def nc_find_index8(dst, data, vals, name=None):
    r"""Find indices of the 8 given vals in each partition of the data tensor.

    This instruction first loads the 8 values,
    then loads the data tensor and outputs the indices (starting at 0) of the first
    occurrence of each value in the data tensor, for each partition.

    The data tensor can be up to 5-dimensional, while the vals tensor must be up
    to 3-dimensional. The data tensor must have between 8 and 16,384 elements per
    partition. The vals tensor must have exactly 8 elements per partition.
    The output will contain exactly 8 elements per partition and will be uint16 or
    uint32 type. Default output type is uint32.

    Behavior is undefined if vals tensor contains values that are not in
    the data tensor.

    If provided, a mask is applied only to the data tensor.

    :param dst: a 2D tile containing indices (uint16 or uint32) of the 8 values in each partition with shape [par_dim, 8]
    :param data: the data tensor to find indices from
    :param vals: tensor containing the 8 values per partition whose indices will be found"""
    ...


def nc_match_replace8(dst, data, vals, imm, dst_idx=None, name=None):
    r"""Replace first occurrence of each value in ``vals`` with ``imm`` in ``data``
    using the Vector engine and return the replaced tensor. If ``dst_idx``
    tile is provided, the indices of the matched values are written to ``dst_idx``.

    :param dst: output tile with replaced values
    :param data: the data tensor to search and replace in
    :param vals: tensor containing the 8 values per partition to match
    :param imm: the immediate float value to replace matched values with
    :param dst_idx: optional tile to store indices of matched values"""
    ...


def nc_matmul(dst, stationary, moving, is_stationary_onezero=False, is_moving_onezero=False, is_transpose=False, accumulate=None, tile_position=(), tile_size=(), perf_mode=matmul_perf_mode.none, name=None):
    r"""Compute ``dst = stationary.T @ moving`` matrix multiplication using Tensor Engine.

    The figure below illustrates how to map a matrix multiplication from a mathematical definition
    to ``nisa.nc_matmul`` on Tensor Engine. The stationary tensor is loaded into the systolic array first and
    stays in place, while the moving tensor streams through the array during computation.
    For more detailed discussion of Tensor Engine capabilities, see
    `Trainium arch guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/arch/trainium_inferentia2_arch.html>`_.

    .. figure:: ../../img/arch_images/matmul.png
      :align: center
      :width: 100%

      MxKxN Matrix Multiplication Visualization.

    **Performance mode.**

    On NeuronCore-v2, performance mode is not supported.
    On NeuronCore-v3 and NeuronCore-v4, Tensor Engine supports FP8 double performance mode, enabled by setting
    performance mode to ``double_row``.
    See `Trainium2 arch guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/arch/trainium2_arch.html>`_
    for more details.
    ``double_row`` performance mode cannot be combined with Tensor Engine column tiling mode (details below).

    **Tiling mode.**
    NeuronCore Tensor Engine is built upon a systolic array with 128 rows and 128 columns of processing elements (PEs).
    Tensor Engine supports both row and column tiling modes, which allow multiple ``nc_matmul`` instructions with
    a stationary tile size smaller than [128, 128] to run in parallel to improve hardware utilization.
    Row tiling mode slices the 128 PE rows into 2x 64 row
    tiles (NeuronCore-v2 or newer), or 4x 32 row tiles (NeuronCore-v3 or newer). Column tiling mode slices
    the 128 PE columns in the same fashion. The row and column tile sizes can be set independently in the
    ``tile_size`` field as a tuple ``(row_size, column_size)``. The stationary tile size must not exceed the chosen
    ``tile_size``.

    In addition, a given ``nc_matmul`` can also pick the exact row and column tile within the 128x128 systolic
    array, by specifying the starting row and starting column in ``tile_position`` as a
    tuple ``(start_row, start_column)``. The ``start_row`` must be a multiple of ``row_size`` specified in ``tile_size``
    and must not exceed 128. Similarly, the ``start_column`` must be a multiple of ``column_size`` and must not exceed 128.

    For example, setting ``tile_position`` to (64, 0) and ``tile_size`` to (64, 128) means using the bottom half
    of the systolic array.

    Note, ``tile_position`` and ``tile_size`` must both be set to enable tiling mode. If they are not set,
    the default is to use the full systolic array, which is equivalent to ``tile_position=(0, 0)``
    and ``tile_size=(128, 128)``. The values in ``tile_position`` and ``tile_size`` tuples can be
    integers or affine expressions.

    **Accumulation mode.**

    The ``accumulate`` parameter controls whether the matmul result should overwrite or accumulate on top of
    the ``dst`` PSUM tile. When ``accumulate=False``, the result overwrites the existing content.
    When ``accumulate=True``, the result is added to the existing content.
    When ``accumulate=None`` (default), the behavior is auto-detected: the first write to a PSUM location
    overwrites, and subsequent writes to the same location accumulate. Multiple ``nc_matmul`` instructions
    with ``accumulate=True`` can form an accumulation group before the PSUM tile content is evicted back to SBUF.

    **Transpose mode.**

    Tensor Engine can transpose a tile in SBUF by loading it as a stationary tile and using an identity matrix
    as the moving tile.
    Starting NeuronCore-v3, turning on transpose mode by setting ``is_transpose=True`` enables bit-accurate
    data transpose, which can transpose tensors with NaN/Inf values properly.
    See `Trainium2 arch guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/arch/trainium2_arch.html>`_
    for more details.

    On NeuronCore-v2, Tensor Engine does not support transpose mode natively. However, setting ``is_transpose=True``
    ensures neuron-profile identifies this instruction as a transpose for performance metric accounting purposes.

    **Memory types.**

    The ``nc_matmul`` instruction *must* read inputs from SBUF and
    write outputs to PSUM. Therefore, the ``stationary`` and ``moving`` must be SBUF tiles, and ``dst`` tile
    must be a PSUM tile.

    **Data types.**

    The input ``stationary`` and ``moving`` tiles can be one of these supported data types:
    ``float8_e4m3/float8_e5m2/bfloat16/float16/tfloat32/float32``. The ``stationary`` and ``moving`` tiles
    can have different data types, with one exception: if one of the input tiles is ``tfloat32/float32``,
    the other tile must also be ``tfloat32/float32``.
    On NeuronCore-v3 and NeuronCore-v4, when performance mode is ``double_row``, ``stationary`` and ``moving`` tiles
    must be one of ``float8_e4m3`` or ``float8_e5m2``, but the two input tiles can have different float8 formats.

    The accumulation precision internal to Tensor Engine is float32.
    The ``dst`` tile must be a float32 tile in NeuronCore-v2 and NeuronCore-v3. Starting NeuronCore-v4,
    ``dst`` can either be a float32 or bfloat16 tile.

    **Layout.**

    If performance mode is off, the contraction dimension of the matmul must be along the partition dimension in
    both ``stationary`` and ``moving`` tiles.

    If performance mode is ``double_row``, the contraction dimension of the matmul is split between the partition dimension
    and the first free dimension after the partition dimension in both ``stationary`` and ``moving`` tiles.
    The first free dimension must be 2. For example, to perform a matmul of ``[1, 256]@[256, 3]=[1, 3]``, the stationary
    tile is of shape ``[128, 2, 1]``, while the moving tile is of shape ``[128, 2, 3]``.

    Regardless of performance mode, the free dimension of the ``stationary`` tile matches the partition
    dimension of the output ``dst`` tile in size, while the free dimension of the ``moving`` tile
    matches the free dimension of the ``dst`` tile in size.

    **Tile size.**

    The partition dimension sizes of the ``stationary`` and ``moving`` tiles must be identical. They must not
    exceed 128 when tiling mode is off or ``row_size`` specified in ``tile_size`` when tiling mode is on.
    The free dimension size of ``stationary`` must not exceed 128 when tiling mode is off or ``column_size``
    in ``tile_size`` when tiling mode is on.

    On NeuronCore-v2 and -v3, the free dimension size of ``moving`` tile must not exceed 512, matching the maximum
    number of float32 elements per PSUM bank. Starting NeuronCore-v4, the free dimension size of ``moving`` tile
    can go up to 4096 for float32 ``dst`` or 8192 for bfloat16 ``dst``, matching the size of 8x PSUM banks
    (the entire PSUM).

    Explicit tiling is required when the high-level matmul operation exceeds the tile size limits of ``nc_matmul``.

    **Profiler view syntax.**

    Each ``nc_matmul`` call lowers to two ISA instructions in the profiler: a load instruction
    (to load the stationary operand into the Tensor Engine) followed by a multiply instruction.
    Both instructions will appear in profiler output for a single ``nc_matmul`` call.

    The multiply instruction operands are displayed in a compact ISA syntax:

    .. code-block:: text

        src=<dtype>@<address>[<strides>][<num_elem>]
        dst=<dtype>@<address>[<strides>][<num_elem>]
        <M>*<K> acc_flags=<flags> psum_zero=<val>

    Where:

    - ``<dtype>``: data type (e.g., ``bfloat16``, ``fp8e4``, ``fp8e5``)
    - ``<address>``: hex memory address in SBUF (for src) or PSUM (for dst)
    - ``[<strides>]``: element strides per dimension (multi-dimensional)
    - ``[<num_elem>]``: number of elements per dimension (multi-dimensional)
    - ``<M>*<K>``: matmul dimensions (M rows × K contraction)
    - ``acc_flags``: accumulator control flags (e.g., ``2`` = reset accumulator)
    - ``psum_zero``: PSUM zero-initialization control value

    :param dst: the matmul output
    :param stationary: the stationary operand
    :param moving: the moving operand
    :param is_stationary_onezero: hints to the compiler whether the ``stationary`` operand is a tile with ones/zeros only;
                           setting this field explicitly could lead to 2x better performance
                           if ``stationary`` tile is in float32; the field has no impact for non-float32 ``stationary``
    :param is_moving_onezero: hints to the compiler whether the ``moving`` operand is a tile with ones/zeros only;
                           setting this field explicitly could lead to 2x better performance
                           if ``moving`` tile is in float32; the field has no impact for non-float32 ``moving``
    :param is_transpose: controls Tensor Engine transpose mode on/off starting NeuronCore-v3
    :param accumulate: if True, accumulate the matmul result into the existing ``dst`` PSUM tile content;
                       if False, overwrite the existing content;
                       if None (default), auto-detect based on whether this PSUM location was previously written.
                       Not exposed for ``nc_transpose``.
    :param tile_position: a 2D tuple (start_row, start_column) to control starting row in Tensor Engine tiling mode; start_column must be 0
    :param tile_size: a 2D tuple (row_size, column_size) to control row tile size in Tensor Engine tiling mode; column_size must be 128
    :param perf_mode: controls Tensor Engine FP8 double performance mode on/off starting NeuronCore-v3: ``matmul_perf_mode.none`` (default) disables double FP8 mode; ``matmul_perf_mode.double_row`` enables double FP8 mode which achieves 2x matmul throughput by packing two FP8 weight/ifmap element pairs and computing two multiplications in parallel per cycle; cannot be combined with column tiling mode. See the `Trainium2 arch guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/arch/trainium2_arch.html>`__ for more information."""
    ...


def nc_matmul_mx(dst, stationary, moving, stationary_scale, moving_scale, tile_position=None, tile_size=None, accumulate=None, name=None):
    r"""Compute matrix multiplication of MXFP8/MXFP4 quantized matrices with integrated dequantization using Tensor Engine.

    .. note::

      Available only on NeuronCore-v4 and newer.

    The NeuronCore-v4 Tensor Engine supports matrix multiplication of MXFP8/MXFP4 quantized matrices as defined in the
    `OCP Microscaling standard <https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf>`__.
    This instruction performs matrix multiplication between quantized ``stationary`` and ``moving`` matrices while
    applying dequantization scales during computation. The micro-scaling group size is 32 elements in groups of
    8 partitions × 4 elements per partition of both ``stationary`` and ``moving`` tensors.
    See `Trainium3 arch guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/about/trainium3_arch.html>`_
    for more detailed discussion.

    **Tiling Mode.**

    NeuronCore Tensor Engine is built upon a systolic array with 128 rows and 128 columns of processing elements (PEs).
    For ``nc_matmul_mx``, Tensor Engine supports only row tiling mode, which allows multiple ``nc_matmul_mx`` instructions with
    a stationary partition dimension size smaller than 128 to run in parallel to improve hardware utilization.
    Row tiling mode slices the 128 PE rows into 2x 64 row tiles or 4x 32 row tiles.

    The row tile size can be set in the ``tile_size`` field as a tuple ``(row_size, column_size)``,
    where ``column_size`` must be 128.
    The stationary tile size must not exceed the chosen ``tile_size``.

    A given ``nc_matmul_mx`` can pick the exact row tile within the 128x128 systolic array by specifying the starting row
    in ``tile_position`` as a tuple ``(start_row, start_column)``, where ``start_column`` must be 0.
    The ``start_row`` must be a multiple of ``row_size`` specified in ``tile_size`` and must not exceed 128.

    For example, setting ``tile_position`` to (64, 0) and ``tile_size`` to (64, 128) means using the bottom half
    of the systolic array.

    Note, ``tile_position`` and ``tile_size`` must both be set to enable tiling mode. If they are not set,
    the default is to use the full systolic array, which is equivalent to ``tile_position=(0, 0)``
    and ``tile_size=(128, 128)``. The values in ``tile_position`` and ``tile_size`` tuples can be
    integers or affine expressions.

    **Memory types.**

    The ``nc_matmul_mx`` instruction must read inputs from SBUF and write outputs to PSUM. Therefore, the
    ``stationary``, ``moving``, ``stationary_scale``, and ``moving_scale`` must be SBUF tiles, and ``dst``
    tile must be a PSUM tile.

    **Data types.**

    The input ``stationary`` and ``moving`` tiles must be float8_e5m2_x4, float8_e4m3fn_x4, or float4_e2m1fn_x4
    (4-packed quantized data types). The ``stationary_scale`` and ``moving_scale`` tiles must be uint8.
    The ``dst`` tile can be float32 or bfloat16.

    **Layout.**

    The contraction dimension of the matrix multiplication is along the partition dimension of ``stationary``
    and ``moving`` tensors and also the x4 dimension within each packed data type element
    (float8_e5m2_x4, float8_e4m3fn_x4, or float4_e2m1fn_x4).

    The free dimension of the ``stationary`` tile matches the partition
    dimension of the output ``dst`` tile in size, while the free dimension of the ``moving`` tile
    matches the free dimension of the ``dst`` tile in size.

    The scale tensors follow a special layout requirement. See more details in ``nisa.quantize_mx`` API doc.

    *Tile size*

    - The partition dimension size of ``stationary`` and ``moving`` must be identical and be a multiple of 32,
      not exceeding 128.
    - The free dimension size of ``stationary`` must be even and not exceed 128.
    - The free dimension size of ``moving`` must not exceed 512 when ``dst`` is in float32 or 1024 when ``dst`` is in bfloat16.
    - The scale tensors have partition dimensions that depend on whether the data tensors span multiple quadrants.
      See more details in ``nisa.quantize_mx`` API doc.

    **Profiler view syntax.**

    ``nc_matmul_mx`` uses the same profiler output format as :doc:`nisa.nc_matmul <nki.isa.nc_matmul>`,
    except the source access pattern is interpreted as an MX-quantized tensor:
    ``src=<dtype>@$MX[<data_addr>,<scale_addr>,<start_scale_partition>]@[<step_elem>][<num_elem>]``.

    :param dst: the matrix multiplication output (PSUM tile)
    :param stationary: the stationary quantized matrix (SBUF tile)
    :param moving: the moving quantized matrix (SBUF tile)
    :param stationary_scale: the dequantization scales for stationary matrix
                             (SBUF tile)
    :param moving_scale: the dequantization scales for moving matrix (SBUF tile)
    :param tile_position: a 2D tuple (start_row, start_column) to control
                          starting row and column in Tensor Engine tiling mode
    :param tile_size: a 2D tuple (row_size, column_size) to control row and
                      column tile sizes in Tensor Engine tiling mode
    :param accumulate: if True, accumulate the matmul result into the existing
                       ``dst`` PSUM tile content; if False, overwrite the
                       existing content; if None (default), auto-detect based on
                       whether this PSUM location was previously written"""
    ...


def nc_n_gather(dst, data, indices, name=None):
    r"""Gather elements from ``data`` according to ``indices`` using GpSimd Engine.

    This instruction performs a gather operation where elements are selected from the input ``data`` tile
    based on flattened indices specified in the ``indices`` tile. The free dimensions of ``data`` are
    treated as if they were flattened into a single dimension for indexing purposes, while the partition
    dimension defines the parallel compute boundary.

    The gather operation works independently within each partition. For each partition, the free dimensions
    of ``data`` are conceptually flattened, and elements are gathered according to the corresponding
    flattened indices from the same partition in ``indices``. If you need to gather elements across partitions
    (within groups of partitions), consider using :doc:`nisa.local_gather <nki.isa.local_gather>`.

    The ``n`` in ``nc_n_gather`` indicates that this instruction corresponds to ``n`` groups of instructions
    in the underlying ISA, where ``n = ceil(elems_per_partition / 512)``.

    Alternatively, we could gather elements by calling :doc:`nisa.dma_copy <nki.isa.dma_copy>` with an
    indirect access pattern derived from ``indices``. However, this is less efficient than ``nc_n_gather``,
    which uses GpSimd Engine to perform local data movement within SBUF, without using DMA engines.

    **Memory types.**

    All input and output tiles (``data``, ``indices``, and ``dst``) must be in SBUF.
    GpSimd Engine cannot access PSUM (see :ref:`arch_sec_neuron_core_engines` for details).

    **Data types.**

    The input ``data`` tile can be any valid NKI data type (see :ref:`nki-dtype` for more information).
    The output ``dst`` tile must have the same data type as ``data``.
    The ``indices`` tile must be uint32.

    **Layout.**

    The partition dimension of ``data``, ``indices``, and ``dst`` must be the same.
    Within each partition, the free dimensions of ``data`` are flattened for indexing.
    The free dimensions of ``indices`` determine the shape of the output ``dst``.

    **Tile size.**

    The partition dimension size of ``data``, ``indices``, and ``dst`` must be the same and must not exceed 128.
    The number of elements per partition in ``dst`` must match the number of elements per partition in ``indices``.
    The indices' values must be within the range ``[0, data.size / data.shape[0])``.

    :param dst: output tile containing the gathered elements
    :param data: the input tile to gather elements from
    :param indices: the indices tile (uint32) specifying which elements to gather"""
    ...


def nc_stream_shuffle(dst, src, shuffle_mask, name=None):
    r"""Apply cross-partition data movement within a quadrant of 32 partitions from source tile
    ``src`` to destination tile ``dst`` using Vector Engine.

    Both source and destination tiles can be in either SBUF or PSUM, and passed in by reference as arguments.
    In-place shuffle is allowed, i.e., ``dst`` same as ``src``. ``shuffle_mask`` is a 32-element list. Each mask
    element must be in data type int or affine expression. ``shuffle_mask[i]`` indicates which input partition the
    output partition [i] copies from within each 32-partition quadrant. The special value ``shuffle_mask[i]=255``
    means the output tensor in partition [i] will be unmodified. ``nc_stream_shuffle`` can be applied to multiple
    of quadrants. In the case with more than one quadrant, the shuffle is applied to each quadrant independently,
    and the same ``shuffle_mask`` is used for each quadrant. For more information about the cross-partition data movement,
    see :ref:`arch_guide_cross_partition_data_movement`.

    This API has 3 constraints on ``src`` and ``dst``:

    #. ``dst`` must have same data type as ``src``.
    #. ``dst`` must have the same number of elements per partition as ``src``.
    #. The access start partition of ``src`` (``src_start_partition``), does not have to match or be in the same quadrant
       as that of ``dst`` (``dst_start_partition``). However, ``src_start_partition``/``dst_start_partition`` needs to follow
       some special hardware rules with the number of active partitions ``num_active_partitions``.
       ``num_active_partitions = ceil(max(src_num_partitions, dst_num_partitions)/32) * 32``, where ``src_num_partitions`` and
       ``dst_num_partitions`` refer to the number of partitions the ``src`` and ``dst`` tensors access respectively.
       ``src_start_partition``/``dst_start_partition`` is constrained based on the value of ``num_active_partitions``:

      * If ``num_active_partitions`` is 96/128, ``src_start_partition``/``dst_start_partition`` must be 0.

      * If ``num_active_partitions`` is 64, ``src_start_partition``/``dst_start_partition`` must be 0/64.

      * If ``num_active_partitions`` is 32, ``src_start_partition``/``dst_start_partition`` must be 0/32/64/96.

    :param dst: the destination tile
    :param src: the source tile
    :param shuffle_mask: a 32-element list that specifies the shuffle source and destination partition"""
    ...


def nc_transpose(dst, data, engine=engine.unknown, name=None):
    r"""Perform a 2D transpose between the partition axis and the free axis of input ``data`` using Tensor or Vector Engine.

    If the ``data`` tile has more than one free axis, this API implicitly flattens all free axes into one axis
    and then performs a 2D transpose.

    2D transpose on Tensor Engine is implemented by performing a matrix multiplication between ``data`` as the
    stationary tensor and an identity matrix as the moving tensor. This is equivalent to calling ``nisa.nc_matmul``
    directly with ``is_transpose=True``. See :ref:`architecture guide <arch_sec_tensor_engine_alternative_use>`
    for more information. On NeuronCore-v2, Tensor Engine transpose is not bit-accurate if the input ``data``
    contains NaN/Inf.
    You may consider replacing NaN/Inf with regular floats (float_max/float_min/zeros) in the input matrix.
    Starting NeuronCore-v3, all Tensor Engine transpose is bit-accurate.

    **Memory types.**

    Tensor Engine ``nc_transpose`` must read the input tile from SBUF and write the transposed result to PSUM.
    Vector Engine ``nc_transpose`` can read/write from/to either SBUF or PSUM.

    **Data types.**

    The input ``data`` tile can be any valid NKI data type (see :ref:`nki-dtype` for more information).
    The output ``dst`` tile must have the same data type as that of ``data``.

    **Layout.**
    The partition dimension of ``data`` tile becomes the free dimension of the ``dst`` tile.
    Similarly, the free dimension of the ``data`` tile becomes the partition dimension of the ``dst`` tile.

    **Tile size.**
    Tensor Engine ``nc_transpose`` can handle an input tile of shape [128, 128] or smaller, while Vector
    Engine can handle shape [32, 32] or smaller.
    If no ``engine`` is specified, Neuron Compiler will automatically select an engine
    based on the input shape.

    :param dst: the transpose output
    :param data: the input tile to be transposed
    :param engine: specify which engine to use for transpose: ``nki.isa.engine.tensor`` or ``nki.isa.engine.vector``;
                   by default, the best engine will be selected for the given input tile shape"""
    ...


def nonzero_with_count(dst, src, index_offset=0, padding_val=-1, name=None):
    r"""Find indices of nonzero elements in an input tensor and their total count using GpSimd Engine.

    .. note::

      Available only on NeuronCore-v3 and newer.

    NOTE: this instruction only operates on partitions [0, 16, 32, ..., 112] of the input tile
    and writes to partitions [0, 16, 32, ..., 112] of the destination tile. The data in other
    partitions of the destination tile are not modified, including the last 'extra' slot for count.

    This behavior is due to the physical connectivity of GpSimd engine. Each of the eight GpSimd cores
    connects to 16 contiguous SBUF partitions (e.g., core[0] connects to partitions[0:16]).
    In nonzero_with_count, each GpSimd core reads from and writes to its 0-th partition only.

    This instruction takes an input array and produces an output array containing the indices of all
    nonzero elements, followed by padding values, and ending with the count of nonzero elements found.

    The output tensor has one more element in the free dimension than the input tensor:

    - **First N elements**: 0-indexed positions of nonzero elements, offset by ``index_offset``
    - **Next T-N elements**: Filled with ``padding_val``
    - **Last element**: Count ``N`` of nonzero elements found

    The ``index_offset`` parameter is useful when processing arrays in tiles, allowing
    indices to be relative to the original array position rather than the tile.

    Example for one partition of the tensor:

    .. code-block::

        Input array (T=8): [0, 1, 1, 0, 0, 1, 0, 0]
        index_offset = 16
        padding_val = -1

        Output (T+1=9): [17, 18, 21, -1, -1, -1, -1, -1, 3]

        Where:

        - 17, 18, 21 are the indices (1, 2, 5) plus offset 16
        - -1 is the padding value for unused slots
        - 3 is the count of nonzero elements

    **Constraints**

    - Supported arch versions: NeuronCore-v3+.
    - Supported engines: GpSimd.
    - Parameters ``src``, ``dst`` must have the same number of elements in the partition dimension.
    - Destination tensor must have exactly 1 more element than the source tensor in the free dimension.
    - Only accesses the 0-th partition for each GpSimd core (i.e., [0, 16, 32, ..., 112]).
    - ``src`` must be in SBUF with dtype float32 or int32.
    - ``dst`` must be in SBUF with dtype int32.
    - ``index_offset`` and ``padding_val`` must be int32.

    :param src: Input tensor to find nonzero indices from. Only partitions [0, 16, 32, ..., 112] are read from. Supported buffers: SBUF. Supported dtypes: float32, int32.
    :param dst: Output tensor containing nonzero indices, padding, and count. Only partitions [0, 16, 32, ..., 112] are written to. It must have one extra element than src in the free dimension. Supported buffers: SBUF. Supported dtypes: int32.
    :param index_offset: Offset to add to the found indices (useful for tiled processing). Supported dtypes: int32.
    :param padding_val: Value to use for padding unused output elements. Supported dtypes: int32.

    **Behavior**

    .. code-block:: python

        # Find all nonzero elements in input
        nonzero_indices = []
        for i in range(len(input_array)):
            if input_array[i] != 0:
                nonzero_indices.append(i + index_offset)

        # Build output array
        output = []
        # Add found indices
        for idx in nonzero_indices:
            output.append(idx)
        # Add padding for remaining slots
        for _ in range(len(input_array) - len(nonzero_indices)):
            output.append(padding_val)
        # Add count as last element
        output.append(len(nonzero_indices))

    **Example**

    .. code-block:: python

        def nonzero_with_count_kernel(in_tensor):
            in_shape = in_tensor.shape
            assert len(in_tensor.shape) == 2, "expected 2D tensor"

            in_tile = nl.ndarray(in_shape, dtype=in_tensor.dtype, buffer=nl.sbuf)
            nisa.dma_copy(dst=in_tile, src=in_tensor)

            out_tile = nl.ndarray((in_shape[0], in_shape[1] + 1), dtype=nl.int32, buffer=nl.sbuf)
            nisa.nonzero_with_count(dst=out_tile, src=in_tile, index_offset=0, padding_val=-1)

            out_tensor = nl.ndarray(out_tile.shape, dtype=out_tile.dtype, buffer=nl.hbm)
            nisa.dma_copy(dst=out_tensor, src=out_tile)

            return out_tensor"""
    ...


def quantize_mx(dst, src, dst_scale, name=None):
    r"""Quantize FP16/BF16 data to MXFP8 tensors (both data and scales) using Vector Engine.

    .. note::

      Available only on NeuronCore-v4 and newer.

    The resulting MXFP8 tensors, ``dst`` and ``dst_scale`` are as defined in the
    `OCP Microscaling standard <https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf>`__.
    This instruction calculates the required scales for each group of 32 values in ``src``, divides them by the calculated scale,
    and casts to the target MXFP8 datatype. The output layout is suitable for direct consumption by the
    ``nisa.nc_matmul_mx`` API running on Tensor Engine.

    **Memory types.**

    All input ``src`` and output tiles (``dst`` and ``dst_scale``) must be in SBUF.

    **Data types.**

    The input ``src`` tile must be float16 or bfloat16. The output ``dst`` tile must be float8_e5m2_x4 or
    float8_e4m3fn_x4 (4-packed FP8 data types). The ``dst_scale`` tile must be uint8.

    The 4-packed data types (float8_e5m2_x4/float8_e4m3fn_x4) are 32-bit data types that pack four 8-bit
    float8_e5m2/float8_e4m3fn values.

    **Layout.**

    The quantization operates on groups of 32 elements from the input ``src`` tile, where each group consists of
    8 partitions × 4 elements per partition. For each 32-element group, the instruction produces:

    - Quantized FP8 data in ``dst``
    - One shared scale value in ``dst_scale`` per group

    **Tile size.**

    - The partition dimension size of ``src`` must be a multiple of 32 and must not exceed 128.
    - The free dimension size of ``src`` must be a multiple of 4 and must not exceed the physical size of each SBUF
      partition.
    - The ``dst`` tile has the same partition dimension size as ``src`` but a free dimension size
      that is 1/4 of ``src`` free dimension size due to the special 4-packed FP8 data types.

    :param dst: the quantized MXFP8 output tile
    :param src: the input FP16/BF16 tile to be quantized
    :param dst_scale: the output scale tile"""
    ...


def rand2(dst, min, max, name=None):
    r"""Generate pseudo random numbers with uniform distribution using Vector Engine.

    .. note::

      Available only on NeuronCore-v4 and newer.

    This instruction generates pseudo random numbers and stores them into SBUF/PSUM.
    The generated values follow a uniform distribution within the specified [min, max] range.

    Key features:

    - Uses XORWOW PRNG algorithm for high-quality random number generation
    - Generates FP32 random values with uniform distribution
    - Supports output conversion to various data types

    **Memory types.**

    The output ``dst`` tile can be in SBUF or PSUM.

    **Data types.**

    The output ``dst`` tile can be any of: float8_e4m3, float8_e5m2, float16, bfloat16, float32,
    tfloat32, int8, int16, int32, uint8, uint16, or uint32.

    **Tile size.**

    The partition dimension size of ``dst`` must not exceed 128. The number of
    elements per partition of ``dst`` must not exceed the physical size of each SBUF/PSUM partition.

    **Constraints.**

    - Supported arch versions: NeuronCore-v4+.
    - Supported engines: Vector.
    - min < max for valid range.

    :param dst: the destination tensor to write random values to
    :param min: minimum value for uniform distribution range (FP32), can be a scalar or vector value
    :param max: maximum value for uniform distribution range (FP32), can be a scalar or vector value"""
    ...


def rand_get_state(dst, engine=engine.unknown, name=None):
    r"""Store the current pseudo random number generator (PRNG) states from the engine.

    This instruction stores the current PRNG states cached inside the engine to SBUF/PSUM.
    Each partition in the output tensor holds the PRNG states for the corresponding compute lane
    inside the engine.

    **Memory types.**

    The output ``dst`` tile must be in SBUF (NeuronCore-v3) or SBUF/PSUM (NeuronCore-v4+).

    **Data types.**

    The output ``dst`` tile must be uint32.

    **Tile size.**

    - dst element count for XORWOW must be 6 elements (GpSimd) or 24 elements (Vector).

    **Constraints.**

    - Supported arch versions: NeuronCore-v3+.
    - Supported engines: NeuronCore-v3: GpSimd. NeuronCore-v4+: GpSimd, Vector.
    - Since GpSimd Engine cannot access PSUM, ``dst`` must be in SBUF when using GpSimd Engine.

    :param dst: the destination tensor to store PRNG state values; must be a 2D uint32 tensor
    :param engine: specify which engine to use: ``nki.isa.engine.vector``, ``nki.isa.engine.gpsimd``,
                   or ``nki.isa.engine.unknown`` (default, the best engine will be selected)"""
    ...


def rand_set_state(src_seeds, engine=engine.unknown, name=None):
    r"""Seed the pseudo random number generator (PRNG) inside the engine.

    This instruction initializes the PRNG state for future random number generation operations.
    Each partition in the source tensor seeds the PRNG states for the corresponding compute lane
    inside the engine.

    The PRNG state is cached inside the engine as a persistent state during the rest of NEFF
    execution. However, the state cannot survive TPB resets or Runtime reload.

    **Memory types.**

    The input ``src_seeds`` tile must be in SBUF.

    **Data types.**

    The input ``src_seeds`` tile must be uint32.

    **Tile size.**

    - src_seeds element count for XORWOW must be 6 elements (GpSimd) or 24 elements (Vector).

    **Constraints.**

    - Supported arch versions: NeuronCore-v3+.
    - Supported engines: NeuronCore-v3: GpSimd. NeuronCore-v4+: GpSimd, Vector.
    - ``src_seeds`` must be in SBUF.

    :param src_seeds: the source tensor containing seed values for the PRNG; must be a 2D uint32 tensor
                      with the partition dimension representing the compute lanes and the free dimension
                      containing the seed values
    :param engine: specify which engine to use: ``nki.isa.engine.vector``, ``nki.isa.engine.gpsimd``,
                   or ``nki.isa.engine.unknown`` (default, the best engine will be selected)"""
    ...


def range_select(dst, on_true_tile, comp_op0, comp_op1, bound0, bound1, reduce_cmd=reduce_cmd.reset_reduce, reduce_res=None, reduce_op=nl.maximum, range_start=0, on_false_value=-3.4028235e+38, name=None):
    r"""Select elements from ``on_true_tile`` based on comparison with bounds using Vector Engine.

    .. note::

      Available only on NeuronCore-v3 and newer.

    For each element in ``on_true_tile``, compares its free dimension index + ``range_start`` against ``bound0`` and ``bound1``
    using the specified comparison operators (``comp_op0`` and ``comp_op1``). If both comparisons
    evaluate to True, copies the element to the output; otherwise uses  ``on_false_value``.

    Additionally performs a reduction operation specified by ``reduce_op`` on the results,
    storing the reduction result in ``reduce_res``.

    **Note on numerical stability:**

    In self-attention, we often have this instruction sequence: ``range_select`` (VectorE) -> ``reduce_res`` -> ``activation`` (ScalarE).
    When ``range_select`` outputs a full row of ``fill_value``, caution is needed to avoid NaN in the
    activation instruction that subtracts the output of ``range_select`` by ``reduce_res`` (max value):

    - If ``dst.dtype`` and ``reduce_res.dtype`` are both FP32, we should not hit any NaN issue
      since ``FP32_MIN - FP32_MIN = 0``. Exponentiation on 0 is stable (1.0 exactly).

    - If ``dst.dtype`` is FP16/BF16/FP8, the fill_value in the output tile will become ``-INF``
      since HW performs a downcast from FP32_MIN to a smaller dtype.
      In this case, you must make sure ``reduce_res.dtype`` is FP32 to avoid NaN in ``activation``.
      NaN can be avoided because ``activation`` always upcasts input tiles to FP32 to perform math operations: ``-INF - FP32_MIN = -INF``.
      Exponentiation on ``-INF`` is stable (0.0 exactly).

    **Constraints:**

    The comparison operators must be one of:

    - nl.equal
    - nl.less
    - nl.less_equal
    - nl.greater
    - nl.greater_equal

    Partition dim sizes must match across ``on_true_tile``, ``bound0``, and ``bound1``:

    - ``bound0`` and ``bound1`` must have one element per partition
    - ``on_true_tile`` must be one of the FP dtypes, and ``bound0/bound1`` must be FP32 types.

    The comparison with ``bound0``, ``bound1``, and free dimension index is done in FP32.
    Make sure ``range_start`` + free dimension index is within 2^24 range.

    **Numpy equivalent:**

    .. code-block:: python

        indices = np.zeros_like(on_true_tile, dtype=np.float32)
        indices[:] = range_start + np.arange(on_true_tile[0].size)

        mask = comp_op0(indices, bound0) & comp_op1(indices, bound1)
        select_out_tile = np.where(mask, on_true_tile, on_false_value)
        reduce_tile = reduce_op(select_out_tile, axis=1, keepdims=True)

    :param dst: output tile with selected elements
    :param on_true_tile: input tile containing elements to select from
    :param on_false_value: constant value to use when selection condition is False.
      Due to hardware constraints, this must be ``FP32_MIN`` (``-3.4028235e+38``).
      See the numerical stability note above for guidance on output dtype selection.
    :param comp_op0: first comparison operator
    :param comp_op1: second comparison operator
    :param bound0: tile with one element per partition for first comparison
    :param bound1: tile with one element per partition for second comparison
    :param reduce_op: reduction operator to apply on across the selected output. Currently only ``nl.maximum`` is supported.
    :param reduce_cmd: controls the state of the Vector Engine accumulator registers.
      Defaults to ``reduce_cmd.reset_reduce``. See :ref:`nki-reduce-cmd` for supported values.
    :param reduce_res: optional tile to store reduction results.
    :param range_start: starting base offset for index array for the free dimension of ``on_true_tile``.
        Defaults to 0, and must be a compile-time integer."""
    ...


def reciprocal(dst, data, name=None):
    r"""Compute element-wise reciprocal (1.0/x) of the input ``data`` tile using Vector Engine.

    **Memory types.**

    Both the input ``data`` and output ``dst`` tiles can be in SBUF or PSUM.

    **Data types.**

    The input ``data`` tile can be any valid NKI data type (see :ref:`nki-dtype` for more information).
    The Vector Engine automatically casts the input data type to float32 and performs the reciprocal
    computation in float32 math. The float32 results are cast to the data type of ``dst``.

    **Layout.**

    The partition dimension of the input ``data`` is considered the parallel compute dimension.

    **Tile size.**

    The partition dimension size of input ``data`` and output ``dst`` tiles must be the same
    and must not exceed 128. The number of elements per partition of ``dst`` must match
    that of ``data`` and must not exceed the physical size of each SBUF partition.

    :param dst: the output tile
    :param data: the input tile"""
    ...


def register_alloc(x=None):
    r"""Allocate a virtual register and optionally initialize it with a value.

    Each engine sequencer (Tensor/Scalar/Vector/GpSimd/Sync Engine) within a NeuronCore maintains its own set of
    physical registers for scalar operations (64x 32-bit registers per engine sequencer in NeuronCore v2-v4).
    This API conceptually allocates a register within a virtual register space.
    Users do not need to explicitly free a register through nisa APIs. The NKI compiler
    handles physical register allocation (and deallocation) across the appropriate engine sequencers
    based on the dynamic program flow.

    NKI provides the following APIs to manipulate allocated registers:

    - ``nisa.register_move``: Move a constant integer or another register's value into a register
    - ``nisa.register_load``: Load a scalar (32-bit) value from HBM/SBUF into a register
    - ``nisa.register_store``: Store register contents to HBM/SBUF

    In the current NKI release, these registers are primarily used to specify dynamic loop boundaries and
    while loop conditions. The NKI compiler compiles such dynamic looping constructs to branching instructions
    executed by engine sequencers. For additional details, see ``nl.dynamic_range``. For more information
    on engine sequencer and its capabilities, see
    `Trainium/Inferentia2 architecture guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/arch/trainium_inferentia2_arch.html>`_.

    :param x: optional initialization value. Can be one of:

              - ``None`` (default): allocate an uninitialized register
              - ``int``: allocate a register initialized with this immediate integer value

    Example:

    Three ways to allocate a register initialized to zero:

    .. code-block:: python

        # Approach 1: Using an immediate value
        reg1 = nisa.register_alloc(0)

        # Approach 2: Two-step with register_load
        zero_tensor = nl.zeros([1, 1], dtype=nl.int32, buffer=nl.sbuf)
        reg2 = nisa.register_alloc(None)
        nisa.register_load(reg2, zero_tensor)"""
    ...


def register_load(dst, src):
    r"""Load a scalar value from memory (HBM or SBUF) into a virtual register.

    This instruction reads a single scalar value (up to 32-bit) from a memory location (HBM or SBUF)
    and stores it in the specified virtual register. The source must be a NKI tensor with exactly
    one element (shape [1] or [1, 1]). This enables dynamic loading of values computed at
    runtime into registers for use in control flow operations.

    The virtual register system allows the NKI compiler to allocate physical registers across
    different engine sequencers as needed. See ``nisa.register_alloc`` for more details on
    virtual register allocation.

    :param dst: the destination virtual register (allocated via ``nisa.register_alloc``)
    :param src: the source tensor containing a single scalar value to load

    Example:

    .. code-block:: python

        # Load a computed value into a register
        computed_bound = nl.ones([1], dtype=nl.int32, buffer=nl.sbuf)  # bound of 1 in SBUF
        loop_reg = nisa.register_alloc()
        nisa.register_load(loop_reg, computed_bound)"""
    ...


def register_move(dst, src):
    r"""Move a value into a virtual register.

    This instruction loads a value into the specified virtual register. The source can be
    either a compile-time constant integer or another virtual register.

    The virtual register system allows the NKI compiler to allocate physical registers across
    different engine sequencers as needed. See ``nisa.register_alloc`` for more details on
    virtual register allocation.

    This instruction operates on virtual registers only and does not access SBUF, PSUM, or HBM.

    :param dst: the destination virtual register (allocated via ``nisa.register_alloc``)
    :param src: source value - either a compile-time constant integer or a VirtualRegister

    Example:

    .. code-block:: python

        # Allocate a register and initialize it with a constant
        loop_count = nisa.register_alloc()
        nisa.register_move(loop_count, 10)  # Set register to 10

        # Copy from another register
        reg2 = nisa.register_alloc()
        nisa.register_move(reg2, loop_count)  # Copy value from loop_count"""
    ...


def register_store(dst, src):
    r"""Store the value from a virtual register into memory (HBM/SBUF).

    This instruction writes the scalar value (up to 32-bit) stored in a virtual register to a memory location
    (HBM or SBUF). The destination must be a tensor with exactly one element (shape [1] or [1, 1]).
    This enables saving register values back to memory for later use or for output purposes.

    The virtual register system allows the NKI compiler to allocate physical registers across
    different engine sequencers as needed. See ``nisa.register_alloc`` for more details on
    virtual register allocation.

    :param dst: the destination tensor with a single element to store the register value
    :param src: the source virtual register (allocated via ``nisa.register_alloc``)

    Example:

    .. code-block:: python

        # Store a register value back to memory
        counter_reg = nisa.register_alloc(0)
        # ... perform operations that modify counter_reg ...
        result_tensor = nl.ndarray([1], dtype=nl.int32, buffer=nl.sbuf)
        nisa.register_store(result_tensor, counter_reg)"""
    ...


def rng(dst, engine=engine.unknown, name=None):
    r"""Generate pseudo random numbers using the Vector or GpSimd Engine.

    This instruction generates 32 random bits per element and writes them to the
    destination tensor. Depending on the size of the dtype, the instruction truncates
    each 32-bit random value to the specified data type, taking the least significant bits.

    Example use case:
    To generate random FP32 numbers between 0.0 and 1.0, follow the Rng instruction
    with a normalization instruction (e.g., write 16 random bits as UINT16, then
    divide by (2^16-1) to get a random FP32 number between 0.0 and 1.0).

    **Memory types.**

    The output ``dst`` tile can be in SBUF or PSUM.

    **Data types.**

    The output ``dst`` tile must be an integer type: int8, int16, int32, uint8, uint16, or uint32.

    **Tile size.**

    The partition dimension size of ``dst`` must not exceed 128. The number of
    elements per partition of ``dst`` must not exceed the physical size of each SBUF/PSUM partition.

    **Constraints.**

    - Supported arch versions: NeuronCore-v2+.
    - Supported engines: NeuronCore-v2: Vector. NeuronCore-v3+: GpSimd, Vector.
    - Since GpSimd Engine cannot access PSUM, ``dst`` must be in SBUF when using GpSimd Engine.

    :param dst: the destination tensor to write random values to
    :param engine: specify which engine to use: ``nki.isa.engine.vector``, ``nki.isa.engine.gpsimd``,
                   or ``nki.isa.engine.unknown`` (default, the best engine will be selected)"""
    ...


scalar_engine = engine.scalar
"""Scalar Engine"""


def scalar_tensor_tensor(dst, data, op0, operand0, op1, operand1, reverse0=False, reverse1=False, name=None):
    r"""Apply two math operators in sequence using Vector Engine: ``(data <op0> operand0) <op1> operand1``.

    This instruction is equivalent to running two operations back-to-back:
    1. ``temp_result = tensor_scalar(data, op0, operand0)`` - broadcast ``operand0`` and apply ``op0``
    2. ``dst = tensor_tensor(temp_result, op1, operand1)`` - element-wise operation with ``operand1``

    The ``operand0`` can be either a compile-time
    constant scalar for broadcast across all elements of ``data`` or
    a tile of shape ``(data.shape[0], 1)`` for broadcast along the free dimension.
    The ``operand1`` tile must have the same shape as ``data`` for element-wise operation.

    The scalar broadcasting in the first operation is performed at no additional performance cost,
    making this instruction have approximately the same latency as a regular ``tensor_tensor`` instruction.

    Both ``op0`` and ``op1`` must be arithmetic operators (see :ref:`nki-aluop` for supported operators).
    Bitvec operators are not supported. When the operators are non-commutative (e.g., subtract),
    operand ordering can be reversed using ``reverse0`` and ``reverse1`` flags.

    **Memory types.**

    The input ``data`` tile can be an SBUF or PSUM tile. The ``operand0`` can be an SBUF or PSUM tile
    or a compile-time constant scalar. The ``operand1`` must be an SBUF or PSUM tile.
    However, ``data`` and ``operand1`` cannot both reside in PSUM. The output ``dst`` tile can be
    written to either SBUF or PSUM.

    **Data types.**

    All input tiles can be any supported NKI data type (see :ref:`nki-dtype` for more information).
    The Vector Engine automatically casts input data types to float32 and performs all computations
    in float32 math. The float32 results are cast to the data type of output ``dst``.

    **Layout.**

    The parallel computation dimension of ``nisa.scalar_tensor_tensor`` is along the partition dimension.

    **Tile size.**

    The partition dimension size of input ``data``, ``operand1``, and output ``dst`` tiles must be
    the same and must not exceed 128. The total number of elements per partition of input ``data``, ``operand1``,
    and output ``dst`` tiles must be the same and must not exceed the
    physical size of each SBUF partition.
    If operand0 is not a scalar, the partition dimension size of ``operand0`` must be the same as that of ``data``
    and the number of elements per partition of ``operand0`` must be 1.

    :param dst: the output tile
    :param data: the input tile
    :param op0: the first math operator used with operand0 (see :ref:`nki-aluop` for supported operators)
    :param operand0: a scalar constant or a tile of shape ``(data.shape[0], 1)``, where data.shape[0]
                    is the partition axis size of the input ``data`` tile
    :param reverse0: reverse ordering of inputs to ``op0``; if false, ``operand0`` is the rhs of ``op0``;
                     if true, ``operand0`` is the lhs of ``op0``
    :param op1: the second math operator used with operand1 (see :ref:`nki-aluop` for supported operators)
    :param operand1: a tile with the same size as ``data`` for element-wise operation
    :param reverse1: reverse ordering of inputs to ``op1``; if false, ``operand1`` is the rhs of ``op1``;
                     if true, ``operand1`` is the lhs of ``op1``"""
    ...


def select_reduce(dst, predicate, on_true, on_false, reduce_res=None, reduce_cmd=reduce_cmd.idle, reduce_op=nl.maximum, reverse_pred=False, name=None):
    r"""Selectively copy elements from either ``on_true`` or ``on_false`` to the destination tile
    based on a ``predicate`` using Vector Engine, with optional reduction (max).

    The operation can be expressed in NumPy as:

    .. code-block:: python

        # Select:
        predicate = ~predicate if reverse_pred else predicate
        result = np.where(predicate, on_true, on_false)

        # With Reduce:
        reduction_result = np.max(result, axis=1, keepdims=True)

    **Memory constraints:**

    - Both ``on_true`` and ``predicate`` are permitted to be in SBUF
    - Either ``on_true`` or ``predicate`` may be in PSUM, but not both simultaneously
    - The destination ``dst`` can be in either SBUF or PSUM

    **Shape and data type constraints:**

    - ``on_true``, ``dst``, and ``predicate`` must have identical shapes (same number of partitions and elements per partition)
    - ``on_true`` can be any supported dtype except ``tfloat32``, ``int32``, ``uint32``
    - ``on_false`` dtype must be ``float32`` if ``on_false`` is a scalar.
    - ``on_false`` has to be either scalar or vector of shape ``(on_true.shape[0], 1)``
    - ``predicate`` dtype can be any supported integer type ``int8``, ``uint8``, ``int16``, ``uint16``
    - ``reduce_res`` must be a vector of shape ``(on_true.shape[0], 1)``
    - ``reduce_res`` dtype must of float type
    - ``reduce_op`` only supports ``max``

    **Behavior:**

    - Where predicate is True: The corresponding elements from ``on_true`` are copied to ``dst``
    - Where predicate is False: The corresponding elements from ``on_false`` are copied to ``dst``
    - When reduction is enabled, the max value from each partition of the ``result`` is computed and stored in ``reduce_res``

    **Accumulator behavior:**

    The Vector Engine maintains internal accumulator registers that can be controlled via the ``reduce_cmd`` parameter:

    - ``nisa.reduce_cmd.reset_reduce``: Reset accumulators to -inf, then accumulate the current results
    - ``nisa.reduce_cmd.reduce``: Continue accumulating without resetting (useful for multi-step reductions)
    - ``nisa.reduce_cmd.idle``: No accumulation performed (default)

    .. note::
      Even when ``reduce_cmd`` is set to ``idle``, the accumulator state may still be modified.
      Always use ``reset_reduce`` after any operations that ran with ``idle`` mode to ensure
      consistent behavior.

    .. note::
      The accumulator registers are shared for other Vector Engine accumulation instructions such :doc:`nki.isa.range_select <nki.isa.range_select>`

    :param dst: The destination tile to write the selected values to
    :param predicate: Tile that determines which value to select (on_true or on_false)
    :param on_true: Tile to select from when predicate is True
    :param on_false: Value to use when predicate is False, can be a scalar value or a vector tile of ``(on_true.shape[0], 1)``
    :param reduce_res: (optional) Tile to store reduction results, must have shape ``(on_true.shape[0], 1)``
    :param reduce_cmd: (optional) Control accumulator behavior using ``nisa.reduce_cmd`` values, defaults to idle
    :param reduce_op: (optional) Reduction operator to apply (only ``nl.maximum`` is supported)
    :param reverse_pred: (optional) Reverse the meaning of the predicate condition, defaults to False"""
    ...


def sendrecv(src, dst, send_to_rank, recv_from_rank, pipe_id, dma_engine=dma_engine.dma, name=None):
    r"""Perform point-to-point communication between NeuronCores by sending and receiving data
    simultaneously using DMA engines.

    .. note::
      Available only on NeuronCore-v3 or newer.

    This instruction enables bidirectional data exchange between two NeuronCores within a
    Logical NeuronCore (LNC) configuration.
    The current NeuronCore sends its ``src`` tile to the ``dst`` location of the target
    NeuronCore specified by ``send_to_rank``,
    while simultaneously receiving data from ``recv_from_rank`` into its own ``dst`` tile.

    The use case is when NeuronCores need to exchange data for distributed computation patterns,
    such as all-gather communication or other collective operations where cores need to
    coordinate their computations by exchanging tiles.

    This instruction is only allowed in NeuronCore-v3 or newer when
    `LNC (Logical NeuronCore) <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-features/logical-neuroncore-config.html>`_
    is enabled. The communication occurs between NeuronCores that share the same HBM stack within the LNC configuration.
    Therefore, ``send_to_rank`` and ``recv_from_rank`` must be either 0 or 1.

    The ``pipe_id`` parameter provides synchronization control by grouping sendrecv operations. Operations with the same
    ``pipe_id`` form a logical group where all operations in the group must complete before any can proceed. Operations
    with different ``pipe_id`` values can progress independently without blocking each other.

    The ``dma_engine`` parameter specifies which DMA transfer mechanism to use:

    - ``nisa.dma_engine.dma`` (default): Uses the standard DMA engine with CoreBarrier synchronization.
      Can be triggered from any engine.
    - ``nisa.dma_engine.gpsimd_dma``: Uses the GPSIMD's internal DMA engine for low-latency
      SB-to-SB swaps in LNC=2. Implies GPSIMD as the trigger engine. This mode has restrictions:
      the partition dimension size of ``src``/``dst`` must be a multiple of 16, and the data size
      per partition must not exceed 1024 bytes for 32-bit types, 512 bytes for 16-bit types,
      or 256 bytes for 8-bit types.

    **Memory types.**

    Both ``src`` and ``dst`` tiles must be in SBUF.

    **Data types.**

    ``src`` and ``dst`` must have the same data type, but they can be any supported data types in NKI.

    **Layout.**

    ``src`` and ``dst`` must have the same shape and layout.

    **Tile size.**

    ``src`` and ``dst`` must have the same partition dimension size and the same number of elements per partition.

    :param src: the source tile on the current NeuronCore to be sent to the target NeuronCore
    :param dst: the destination tile on the current NeuronCore where received data will be stored
    :param send_to_rank: rank ID of the target NeuronCore to send data to
    :param recv_from_rank: rank ID of the source NeuronCore to receive data from
    :param pipe_id: synchronization identifier that groups sendrecv operations; operations with the same pipe_id are synchronized
    :param dma_engine: the DMA transfer mode; defaults to ``nisa.dma_engine.dma``

    Example:

    .. code-block:: python

        # Exchange data between two cores in a ring pattern
        num_cores = 2
        current_rank = nl.program_id()
        next_rank = (current_rank + 1) % num_cores
        prev_rank = (current_rank - 1) % num_cores

        # Data to send and buffer to receive
        send_data = nl.ndarray((batch_size, hidden_dim), dtype=nl.float32, buffer=nl.sbuf)
        recv_buffer = nl.ndarray((batch_size, hidden_dim), dtype=nl.float32, buffer=nl.sbuf)

        # Perform bidirectional exchange
        sendrecv(
            src=send_data,
            dst=recv_buffer,
            send_to_rank=next_rank,
            recv_from_rank=prev_rank,
            pipe_id=0
        )

        # Now recv_buffer contains data from the previous core"""
    ...


def sequence_bounds(dst, segment_ids, name=None):
    r"""Compute the sequence bounds for a given set of segment IDs using GpSIMD Engine.

    Given a tile of segment IDs, this function identifies where each segment begins and ends.
    For each element, it returns a pair of values: [start_index, end_index] indicating
    the boundaries of the segment that element belongs to. All segment IDs must be non-negative
    integers. Padding elements (with segment ID of zero) receive special boundary
    values: a start index of n and an end index of (-1), where n is the length
    of ``segment_ids``.

    The output tile contains two values per input element: the start index (first column)
    and end index (second column) of each segment. The partition dimension must always be 1.
    For example, with input shape (1, 512), the output shape becomes (1, 2, 512), where
    the additional dimension holds the start and end indices for each element.

    Both the input tile (``segment_ids``) and output tile (``dst``) must have data type ``nl.float32`` or ``nl.int32``.

    **NumPy equivalent:**

    :param dst: tile containing the sequence bounds.
    :param segment_ids: tile containing the segment IDs. Elements with ID=0 are treated as padding."""
    ...


def set_rng_seed(src_seeds, name=None):
    r"""Seed the pseudo random number generator (PRNG) inside the Vector Engine.

    The PRNG state is cached inside the engine as a persistent state during the rest of NEFF
    execution. However, the state cannot survive TPB resets or Runtime reload.

    Using the same seed will generate the same sequence of random numbers when used
    together with the ``nisa.rng()`` on the Vector Engine.

    **Memory types.**

    The input ``src_seeds`` must be in SBUF or PSUM.

    **Data types.**

    The input ``src_seeds`` must be a 32-bit value.

    **Tile size.**

    The input ``src_seeds`` must be a [1,1] tensor.

    :param src_seeds: a [1,1] tensor on SBUF or PSUM with a 32-bit value to be used as the seed"""
    ...


def tensor_copy(dst, src, engine=engine.unknown, name=None):
    r"""Create a copy of ``src`` tile within NeuronCore on-chip SRAMs using Vector, Scalar or GpSimd Engine.

    The output tile has the same partition axis size and also the same number of elements per partition
    as the input tile ``src``.

    All three compute engines, Vector, Scalar and GpSimd Engine can perform tensor copy. However, their copy behavior
    is slightly different across engines:

    - Scalar Engine on NeuronCore-v2 performs copy by first casting the input tile to FP32 internally and then casting from
      FP32 to ``dst.dtype``. Users should be cautious with assigning this instruction to Scalar Engine when the input data
      type cannot be precisely cast to FP32 (e.g., INT32).
    - Both GpSimd and Vector Engine can operate in two modes: (1) bit-accurate copy when input and output data types are
      the same or (2) intermediate FP32 cast when input and output data types differ, similar to Scalar Engine.

    In addition, since GpSimd Engine cannot access PSUM in NeuronCore, Scalar or Vector Engine must be chosen when the input or
    output tile is in PSUM (see :ref:`arch_sec_neuron_core_engines` for details). By default, this API returns
    a tile in SBUF, unless the returned value is assigned to a pre-declared PSUM tile.

    On NeuronCore v2, ``tensor_copy`` is not supported on the Scalar Engine. Instead, use :doc:`nisa.activation <nki.isa.activation>` with ``op=nl.copy``.

    :param dst: a tile with the same content and partition axis size as the ``src`` tile.
    :param src: the source of copy, must be a tile in SBUF or PSUM.
    :param engine: (optional) the engine to use for the operation: `nki.isa.engine.vector`, `nki.isa.engine.scalar`,
                  `nki.isa.engine.gpsimd` or `nki.isa.engine.unknown` (default, compiler selects best engine based on engine workload)."""
    ...


def tensor_copy_predicated(dst, src, predicate, reverse_pred=False, name=None):
    r"""Conditionally copy elements from the ``src`` tile to the destination tile on SBUF / PSUM
    based on a ``predicate`` using Vector Engine.

    This instruction provides low-level control over conditional data movement on NeuronCores,
    optimized for scenarios where only selective copying of elements is needed. Either ``src`` or
    ``predicate`` may be in PSUM, but not both simultaneously. Both ``src`` and ``predicate`` are permitted to be in SBUF.

    Shape and data type constraints:

    1. ``src`` (if it is a tensor), ``dst``, and ``predicate`` must occupy the same number of partitions and same number of elements per partition.
    2. ``predicate`` must be of type ``uint8``, ``uint16``, or ``uint32``.
    3. ``src`` and ``dst`` must share the same data type.

    **Behavior:**

    - Where predicate is True: The corresponding elements from `src` are copied to `dst` tile. If `src` is a scalar, the scalar is copied to the `dst` tile.
    - Where predicate is False: The corresponding values in `dst` tile are unmodified

    :param ``src``: The source tile or number to copy elements from when ``predicate`` is True
    :param ``dst``: The destination tile to copy elements to
    :param ``predicate``: A tile that determines which elements to copy
    :param reverse_pred: A boolean that reverses the effect of ``predicate``."""
    ...


tensor_engine = engine.tensor
"""Tensor Engine"""


def tensor_partition_reduce(dst, op, data, name=None):
    r"""Apply a reduction operation across partitions of an input ``data`` tile using GpSimd Engine.

    :param dst: output tile with reduced result
    :param op: the reduction operator (add, max, bitwise_or, bitwise_and)
    :param data: the input tile to be reduced"""
    ...


def tensor_reduce(dst, op, data, axis, negate=False, keepdims=False, name=None):
    r"""Apply a reduction operation to the free axes of an input ``data`` tile using Vector Engine.

    The reduction operator is specified in the ``op`` input field
    (see :ref:`nki-aluop` for a list of supported reduction operators).
    ``nisa.tensor_reduce`` supports two types of reduction operators: 1) bitvec operators (e.g., bitwise_and, bitwise_or)
    and 2) arithmetic operators (e.g., add, subtract, multiply).

    The reduction axes are specified in the ``axis`` field as an int or list of ints indicating
    which dimensions to reduce. The reduction axes must be the last contiguous free dimension(s)
    of the tile, ending at the final dimension. Axis 0 (partition axis) cannot be reduced.

    For example, given a 4D tile ``(P, D1, D2, D3)``:

    - ``axis=(3,)`` reduces only ``D3``
    - ``axis=(2, 3)`` reduces ``D2`` and ``D3``
    - ``axis=(1, 2, 3)`` reduces ``D1``, ``D2``, and ``D3``

    When the reduction ``op`` is an arithmetic operator, the instruction can also multiply the output reduction
    results by ``-1.0`` before writing into the output tile, at no additional performance cost. This behavior is
    controlled by the ``negate`` input field.

    **Memory types.**

    Both the input ``data`` and ``dst`` tiles can be in SBUF or PSUM.

    **Data types.**

    For bitvec operators, the input/output data types must be integer types and Vector Engine treats
    all input elements as bit patterns without any data type casting. For arithmetic operators,
    the input/output data types can be any supported NKI data types, but the engine automatically casts
    input data types to float32
    and performs the reduction operation in float32 math. The float32 reduction results are cast to the
    data type of ``dst``.

    **Layout.**

    ``nisa.tensor_reduce`` only supports free axes reduction. Therefore, the partition dimension of the input
    ``data`` is considered the parallel compute dimension. To perform a partition axis reduction, we can either:

    1. invoke a ``nisa.nc_transpose`` instruction on the input tile and then this ``nisa.tensor_reduce``
       on the transposed tile, or
    2. invoke ``nisa.nc_matmul`` instructions to multiply a ``nl.ones([128, 1], dtype=data.dtype)`` tile as a stationary
       tensor with the input tile as a moving tensor. See more discussion on Tensor Engine alternative usage in
       `Trainium architecture guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/arch/trainium2_arch.html>`_.

    **Tile size.**

    The partition dimension size of input ``data`` and output ``dst`` tiles must be the same and must not exceed 128.
    The number of elements per partition of ``data`` must not
    exceed the physical size of each SBUF partition. The number of elements per partition in ``dst`` must be consistent
    with the ``axis`` field. For example, if ``axis`` indicates all free dimensions of ``data`` are reduced,
    the number of elements per partition in ``dst`` must be 1.

    :param dst: output tile of the reduction result
    :param op: the reduction operator (see :ref:`nki-aluop` for supported reduction operators)
    :param data: the input tile to be reduced
    :param axis: int or tuple/list of ints. The axis (or axes) along which to reduce;
                 must be the last contiguous free dimension(s) ending at the final dim.
                 For example, for a 4D tile ``(P, D1, D2, D3)``: valid values are
                 ``(3,)``, ``(2, 3)``, or ``(1, 2, 3)``. Axis 0 (partition dim) cannot be reduced.
    :param negate: if True, reduction result is multiplied by ``-1.0``;
                   only applicable when op is an arithmetic operator
    :param keepdims: If this is set to True, the axes which are reduced are left in the result as dimensions with size one.
                     With this option, the result will broadcast correctly against the input array."""
    ...


def tensor_scalar(dst, data, op0, operand0, reverse0=False, op1=None, operand1=None, reverse1=False, engine=engine.unknown, name=None):
    r"""Apply up to two math operators to the input ``data`` tile by broadcasting scalar/vector operands
    in the free dimension using Vector or Scalar or GpSimd Engine: ``(data <op0> operand0) <op1> operand1``.

    The input ``data`` tile can be an SBUF or PSUM tile. Both ``operand0`` and ``operand1`` can be
    SBUF or PSUM tiles of shape ``(data.shape[0], 1)``, i.e., vectors,
    or compile-time constant scalars.

    ``op1`` and ``operand1`` are optional, but must be ``None`` (default values) when unused.
    Note, performing one operator has the same performance cost as performing two operators in the instruction.

    When the operators are non-commutative (e.g., subtract), we can reverse ordering of the inputs for each operator through:

      - ``reverse0 = True``: ``tmp_res = operand0 <op0> data``
      - ``reverse1 = True``: ``operand1 <op1> tmp_res``

    The ``tensor_scalar`` instruction supports two types of operators: 1) bitvec
    operators (e.g., bitwise_and) and 2) arithmetic operators (e.g., add).
    See :ref:`nki-aluop` for the full list of supported operators.
    The two operators, ``op0`` and ``op1``, in a ``tensor_scalar`` instruction must be of the same type
    (both bitvec or both arithmetic).
    If bitvec operators are used, the ``tensor_scalar`` instruction must run on Vector Engine. Also, the input/output
    data types must be integer types, and input elements are treated as bit patterns without any data type casting.

    If arithmetic operators are used, the ``tensor_scalar`` instruction can run on Vector or Scalar or GpSimd Engine.
    However, each engine supports limited arithmetic operators (see :ref:``tbl-aluop``). The Scalar Engine on trn2 only
    supports some operator combinations:

      - ``op0=nl.multiply`` and ``op1=nl.add``
      - ``op0=nl.multiply`` and ``op1=None``
      - ``op0=nl.add`` and ``op1=None``

    Also, arithmetic operators impose no restriction on the data types of input tensor ``data`` and output tensor ``dst``,
    but the operand0 and operand1 (if used) must be float32.
    The compute engine automatically casts ``data.dtype`` to float32
    and performs the operators in float32 math.
    The float32 computation results are cast to ``dst.dtype`` at no additional performance cost.

    :param dst: an output tile of ``(data <op0> operand0) <op1> operand1`` computation
    :param data: the input tile
    :param op0: the first math operator used with operand0 (see :ref:`nki-aluop` for supported operators)
    :param operand0: a scalar constant or a tile of shape ``(data.shape[0], 1)``, where data.shape[0]
                    is the partition axis size of the input ``data`` tile
    :param reverse0: reverse ordering of inputs to ``op0``; if false, ``operand0`` is the rhs of ``op0``;
                     if true, ``operand0`` is the lhs of ``op0``
    :param op1: the second math operator used with operand1 (see :ref:`nki-aluop` for supported operators);
                this operator is optional
    :param operand1: a scalar constant or a tile of shape ``(data.shape[0], 1)``, where data.shape[0]
                    is the partition axis size of the input ``data`` tile
    :param reverse1: reverse ordering of inputs to ``op1``; if false, ``operand1`` is the rhs of ``op1``;
                     if true, ``operand1`` is the lhs of ``op1``
    :param engine: (optional) the engine to use for the operation: `nki.isa.engine.vector`, `nki.isa.engine.scalar`,
                   `nki.isa.engine.gpsimd` (only allowed for rsqrt) or `nki.isa.engine.unknown` (default, let
                   compiler select best engine based on the input tile shape)."""
    ...


def tensor_scalar_cumulative(dst, src, op0, op1, imm0, imm1=None, reduce_cmd=reduce_cmd.reset_reduce, name=None):
    r"""Perform tensor-scalar arithmetic operation with cumulative reduction using Vector Engine.

    The operation applies a scalar operation to each tensor element, then performs a cumulative
    reduction, storing the cumulative results in the destination tensor.

    The operation can be expressed in pseudocode as:

    .. code-block:: python

        if reduce_cmd == reset_reduce:
            if op1 == add or op1 == subtract:
                reg = 0
            elif op1 == mult:
                reg = 1
            elif op1 == max:
                reg = -inf
            elif op1 == min:
                reg = +inf
        elif reduce_cmd == reduce:
            reg = reg
        elif reduce_cmd == load_reduce:
            reg = imm1

        for i in len(in_tensor):
            if not reverse0:
                reg = op1(op0(in_tensor[i], imm0), reg)
                out_tensor[i] = reg
            else:
                reg = op1(op0(imm0, in_tensor[i]), reg)
                out_tensor[i] = reg

    **Operation constraints:**

    - Scalar operation (``op0``) must be an arithmetic op (e.g., add, mult, max)
    - Reduction operation (``op1``) is limited to add, subtract, mult, max, min
    - Input / output dtypes are restricted to BF16, FP16, FP32, FP8, UINT8, UINT16, INT8, INT16
        - INT32/UINT32 are not supported as input/output dtypes (ISA limitation)

    **Accumulator behavior:**

    The Vector Engine maintains internal accumulator registers controlled via ``reduce_cmd``:

    - ``reset_reduce``: Reset accumulator based on reduction operation type
    - ``load_reduce``: Initialize accumulator with ``imm1`` value
    - ``reduce``: Continue with existing accumulator value

    :param dst: The destination tensor to write cumulative results to
    :param src: The source tensor to process
    :param op0: Scalar arithmetic operation to apply to each element
    :param op1: Cumulative arithmetic operation for cumulative computation
    :param imm0: Scalar or vector value for tensor-scalar operation. Must be FP32 datatype
    :param imm1: (optional) Initial scalar or vector value for the accumulator when ``load_reduce``
                            is specified as the ``reduce_cmd``. Must be FP32 datatype
    :param reduce_cmd: (optional) Control accumulator behavior using ``nisa.reduce_cmd`` values,
                                defaults to ``reset_reduce``"""
    ...


def tensor_scalar_reduce(dst, data, op0, operand0, reduce_op, reduce_res, reverse0=False, name=None):
    r"""Perform the same computation as ``nisa.tensor_scalar`` with one math operator
    and also a reduction along the free dimension of the ``nisa.tensor_scalar`` result using Vector Engine.

    Refer to :doc:`nisa.tensor_scalar <nki.isa.tensor_scalar>` for semantics of ``data/op0/operand0``.
    Unlike regular ``nisa.tensor_scalar`` where two operators are supported, only one
    operator is supported in this API. Also, ``op0`` can only be arithmetic operation in :ref:`nki-aluop`.
    Bitvec operators are not supported in this API.

    In addition to :doc:`nisa.tensor_scalar <nki.isa.tensor_scalar>` computation, this API also performs a reduction
    along the free dimension(s) of the :doc:`nisa.tensor_scalar <nki.isa.tensor_scalar>` result, at a small additional
    performance cost. The reduction result is returned in ``reduce_res`` in-place, which must be a
    SBUF/PSUM tile with the same partition axis size as the input tile ``data`` and one element per partition.
    The ``reduce_op`` can be any of ``nl.add``, ``nl.subtract``, ``nl.multiply``, ``nl.max`` or ``nl.min``.

    Reduction axis is not configurable in this API. If the input tile has multiple free axis, the API will
    reduce across all of them.

    .. math::
      result = data <op0> operand0 \\
      reduce\_res = reduce\_op(dst, axis=<FreeAxis>)

    :param dst: an output tile of ``(data <op0> operand0)`` computation
    :param data: the input tile
    :param op0: the math operator used with operand0 (any arithmetic operator in :ref:`nki-aluop` is allowed)
    :param operand0: a scalar constant or a tile of shape ``(data.shape[0], 1)``, where data.shape[0]
                    is the partition axis size of the input ``data`` tile
    :param reverse0: `(not supported yet)` reverse ordering of inputs to ``op0``; if false, ``operand0`` is the rhs of ``op0``;
                     if true, ``operand0`` is the lhs of ``op0``. `<-- currently not supported yet.`
    :param reduce_op: the reduce operation to perform on the free dimension of ``data <op0> operand0``
    :param reduce_res: a tile of shape ``(data.shape[0], 1)``, where data.shape[0]
                    is the partition axis size of the input ``data`` tile. The result of ``reduce_op(data <op0> operand0)``
                    is written in-place into the tile."""
    ...


def tensor_tensor(dst, data1, data2, op, engine=engine.unknown, name=None):
    r"""Perform an element-wise operation of input two tiles using Vector Engine or GpSimd Engine.
    The two tiles must have the same partition axis size and the same number of elements per partition.

    The element-wise operator is specified using the ``op`` field. Valid choices for ``op``:

    1. Any supported *binary* operator that runs on the Vector Engine. (See :ref:`nki-aluop` for details.)
    2. ``nl.power``. (Which runs on the GpSimd engine.)

    For bitvec operators, the input/output data types must be integer types and Vector Engine treats
    all input elements as bit patterns without any data type casting. For arithmetic operators, the behavior
    depends on the data types:

    - **Float types**: The engine casts input data types to float32 and performs the element-wise operation
      in float32 math. The float32 results are cast to ``dst.dtype`` at no additional performance cost.
    - **int32/uint32 types**: When all input/output tiles are int32 or uint32, the operation defaults to
      GpSimd Engine, which uses native integer arithmetic. This ensures exact results for all 32-bit integer
      values. You may override this by passing ``engine=nki.isa.engine.vector`` explicitly.

    Since GpSimd Engine cannot access PSUM, the input/output tiles cannot be in PSUM if ``op`` is ``nl.power``.
    Similarly, the automatic GpSimd dispatch for int32/uint32 falls back to Vector Engine when any operand
    resides in PSUM. (See :ref:`arch_sec_neuron_core_engines` for details.)

    Otherwise, the output tile can be in either SBUF or PSUM.
    However, the two input tiles, ``data1`` and ``data2`` cannot both reside in PSUM.
    The three legal cases are:

    1. Both ``data1`` and ``data2`` are in SBUF.
    2. ``data1`` is in SBUF, while ``data2`` is in PSUM.
    3. ``data1`` is in PSUM, while ``data2`` is in SBUF.

    Note, if you need broadcasting capability in the free dimension for either input tile, you should consider
    using :doc:`nki.isa.tensor_scalar <nki.isa.tensor_scalar>` API instead,
    which has better performance than ``nki.isa.tensor_tensor`` in general.

    :param dst: an output tile of the element-wise operation
    :param data1: lhs input operand of the element-wise operation
    :param data2: rhs input operand of the element-wise operation
    :param op: a binary math operator (see :ref:`nki-aluop` for supported operators)
    :param engine: (optional) the engine to use for the operation: `nki.isa.engine.vector`, `nki.isa.engine.gpsimd`
                   or `nki.isa.engine.unknown` (default, let compiler select best engine based on the input tile shape)."""
    ...


def tensor_tensor_scan(dst, data0, data1, initial, op0, op1, reverse0=False, reverse1=False, name=None):
    r"""Perform a scan operation of two input tiles using Vector Engine.

    Mathematically, the tensor_tensor_scan instruction on Vector Engine performs
    the following computation per partition:

    .. code-block:: python

        # Let's assume we work with numpy, and data0 and data1 are 2D (with shape[0] being the partition axis)
        import numpy as np

        result = np.ndarray(data0.shape, dtype=data0.dtype)
        result[:, 0] = op1(op0(data0[:. 0], initial), data1[:, 0])

        for i in range(1, data0.shape[1]):
            result[:, i] = op1(op0(data0[:, i], result[:, i-1]), data1[:, i])

    The two input tiles (``data0`` and ``data1``) must have the same
    partition axis size and the same number of elements per partition.
    The third input ``initial`` can either be a float32 compile-time scalar constant
    that will be broadcasted in the partition axis of ``data0``/``data1``, or a tile
    with the same partition axis size as ``data0``/``data1`` and one element per partition.

    The two input tiles, ``data0`` and ``data1`` cannot both reside in PSUM. The three legal cases are:

    1. Both ``data1`` and ``data2`` are in SBUF.
    2. ``data1`` is in SBUF, while ``data2`` is in PSUM.
    3. ``data1`` is in PSUM, while ``data2`` is in SBUF.

    The scan operation supported by this API has two programmable
    math operators in ``op0`` and ``op1`` fields.
    Both ``op0`` and ``op1`` can be any binary arithmetic operator
    supported by NKI (see :ref:`nki-aluop` for details).
    We can optionally reverse the input operands of ``op0`` by setting ``reverse0`` to True
    (or ``op1`` by setting ``reverse1``). Reversing operands is useful for non-commutative
    operators, such as subtract.

    Input/output data types can be any supported NKI data type (see :ref:`nki-dtype`),
    but the engine automatically casts input data types to float32
    and performs the computation in float32 math. The float32 computation results are
    cast to ``dst.dtype`` at no additional performance cost.

    :param dst: an output tile of the scan operation
    :param data0: lhs input operand of the scan operation
    :param data1: rhs input operand of the scan operation
    :param initial: starting state of the scan; can be a SBUF/PSUM tile with 1 element/partition or a scalar
                        compile-time constant
    :param op0: a binary arithmetic math operator (see :ref:`nki-aluop` for supported operators)
    :param op1: a binary arithmetic math operator (see :ref:`nki-aluop` for supported operators)
    :param reverse0: reverse ordering of inputs to ``op0``; if false, ``data0`` is the lhs of ``op0``;
                   if true, ``data0`` is the rhs of ``op0``
    :param reverse1: reverse ordering of inputs to ``op1``; if false, ``data1`` is the rhs of ``op1``;
                   if true, ``data1`` is the lhs of ``op1``"""
    ...


unknown_engine = engine.unknown
"""Unknown Engine"""


vector_engine = engine.vector
"""Vector Engine"""


================================================
FILE: nki/api/nki/language/__init__.py
================================================
"""Stubs for nki.language"""

from enum import Enum

class MemoryRegion(Enum):
    r"""Memory region constants for NKI tensors."""

    sbuf = 'sbuf'
    psum = 'psum'
    private_hbm = 'private_hbm'
    shared_hbm = 'shared_hbm'


class NKIObject:
    r"""Base class for NKI kernel dataclasses and configuration objects."""
    ...


class tile_size:
    r"""Hardware tile size constants (pmax, psum_fmax, gemm_stationary_fmax, etc.)"""
    bn_stats_fmax = ...
    """Maximum free dimension of BN_STATS"""
    gemm_moving_fmax = ...
    """Maximum free dimension of the moving operand of General Matrix Multiplication on Tensor Engine"""
    gemm_stationary_fmax = ...
    """Maximum free dimension of the stationary operand of General Matrix Multiplication on Tensor Engine"""
    pmax = ...
    """Maximum partition dimension of a tile"""
    psum_fmax = ...
    """Maximum free dimension of a tile on PSUM buffer"""
    psum_min_align = ...
    """Minimum byte alignment requirement for PSUM free dimension address"""
    sbuf_min_align = ...
    """Minimum byte alignment requirement for SBUF free dimension address"""
    total_available_sbuf_size = ...
    """Usable SBUF size per partition (total minus reserved bytes)."""


class NkiTensor(NKIObject):
    r"""Tensor class with access pattern support.

    Attributes:
        shape: Tuple of dimension sizes
        dtype: NKI data type string
        buffer: Buffer location (sbuf, psum, hbm, etc.)
        _storage: Opaque storage handle, interpreted by the backend
        _pattern: Access pattern as list of [step, num] tuples (None = identity)
        offset: Element offset into storage"""
    ...


def abs(x, dtype=None):
    r"""Absolute value of the input, element-wise.

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has absolute values of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.abs
        a = nl.full((128, 512), -1.0, dtype=nl.float32, buffer=nl.sbuf)
        b = nl.abs(a)
        expected = nl.full((128, 512), 1.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(b, expected)

        # nki.language.abs with explicit dtype
        a = nl.full((128, 512), -1.0, dtype=nl.float32, buffer=nl.sbuf)
        b = nl.abs(a, dtype=nl.float16)
        expected = nl.full((128, 512), 1.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(b, expected)"""
    ...


def add(x, y, dtype=None):
    r"""Add the inputs, element-wise.

    ((Similar to `numpy.add <https://numpy.org/doc/stable/reference/generated/numpy.add.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile that has ``x + y``, element-wise.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.add -- element-wise addition of two tiles
        a = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
        b = nl.full((128, 512), 2.0, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.add(a, b)

        expected = nl.full((128, 512), 5.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)

        # nki.language.add -- adding a scalar to every element of a tile
        a = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.add(a, 2.0)
        expected = nl.full((128, 512), 5.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)"""
    ...


def affine_range(start, stop=None, step=1):
    r"""Create a sequence for fully unrolled loop iteration.

    Create a sequence of numbers for use as loop iterators in NKI, resulting in
    a fully unrolled loop. Prefer :doc:`static_range <nki.language.static_range>` instead.

    .. warning::
    
        This API is deprecated and will be removed in future releases.

    :param start: start value (or stop if ``stop`` is None).
    :param stop: stop value (exclusive).
    :param step: step size.
    :return: an iterator yielding integer values from start to stop.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.affine_range
        for i in nl.affine_range(input_tensor.shape[1] // 512):
            offset = i * 512
            tile = nl.load(input_tensor[0:128, offset:offset+512])
            result = nl.multiply(tile, tile)
            nl.store(out_tensor[0:128, offset:offset+512], result)"""
    ...


def all(x, axis, dtype=None):
    r"""Whether all elements along the specified axis (or axes) evaluate to True.

    ((Similar to `numpy.all <https://numpy.org/doc/stable/reference/generated/numpy.all.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param axis: int or tuple/list of ints. The axis (or axes) along which to operate;
        must be free dimensions, not partition dimension (0); can only be the
        last contiguous dim(s) of the tile: [1], [1,2], [1,2,3], [1,2,3,4].
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile with the logical AND reduction along the provided axis."""
    ...


def arctan(x, dtype=None):
    r"""Inverse tangent of the input, element-wise.

    ((Similar to `numpy.arctan <https://numpy.org/doc/stable/reference/generated/numpy.arctan.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has inverse tangent values of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.arctan -- arctan(0.0) = 0.0
        a = nl.full((128, 512), 0.0, dtype=nl.float32, buffer=nl.sbuf)
        b = nl.arctan(a)
        expected = nl.full((128, 512), 0.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(b, expected)"""
    ...


bfloat16 = 'bfloat16'
"""16-bit floating-point number (1S,8E,7M)"""


def bitwise_and(x, y, dtype=None):
    r"""Compute the bitwise AND of two tiles element-wise.

    ((Similar to `numpy.bitwise_and <https://numpy.org/doc/stable/reference/generated/numpy.bitwise_and.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    Inputs must be integer typed.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value. At least one of x, y must be a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile with the bitwise AND result."""
    ...


def bitwise_or(x, y, dtype=None):
    r"""Compute the bitwise OR of two tiles element-wise.

    ((Similar to `numpy.bitwise_or <https://numpy.org/doc/stable/reference/generated/numpy.bitwise_or.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    Inputs must be integer typed.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value. At least one of x, y must be a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile with the bitwise OR result."""
    ...


def bitwise_xor(x, y, dtype=None):
    r"""Compute the bitwise XOR of two tiles element-wise.

    ((Similar to `numpy.bitwise_xor <https://numpy.org/doc/stable/reference/generated/numpy.bitwise_xor.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    Inputs must be integer typed.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value. At least one of x, y must be a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile with the bitwise XOR result."""
    ...


bool_ = 'bool'
"""Boolean (True or False) stored as a byte"""


def broadcast_to(x, shape, dtype=None):
    r"""Broadcast a tile to a new shape following numpy broadcasting rules.

    ((Similar to `numpy.broadcast_to <https://numpy.org/doc/stable/reference/generated/numpy.broadcast_to.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    If ``x.shape`` is already the same as ``shape``, returns ``x`` unchanged
    (or a dtype-cast copy if ``dtype`` differs).

    :param x: the source tile in SBUF or PSUM.
    :param shape: the target shape. Must have the same rank as ``x``.
        Each dimension must either match or be broadcast from size 1.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile with the target shape containing broadcast values from ``x``."""
    ...


def ceil(x, dtype=None):
    r"""Ceiling of the input, element-wise.

    ((Similar to `numpy.ceil <https://numpy.org/doc/stable/reference/generated/numpy.ceil.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    The ceil of the scalar x is the smallest integer i, such that i >= x.

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has ceiling values of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.ceil -- rounds 3.2 up to 4.0
        a = nl.full((128, 512), 3.2, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.ceil(a)
        expected = nl.full((128, 512), 4.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)

        # nki.language.ceil -- rounds -3.7 up to -3.0
        a = nl.full((128, 512), -3.7, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.ceil(a)
        expected = nl.full((128, 512), -3.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)"""
    ...


def copy(x, dtype=None):
    r"""Create a copy of the input tile.

    .. warning::

       This API is experimental and may change in future releases.

    Uses the Scalar Engine via ``activation(op=copy)``. Note that the Scalar Engine
    internally casts through FP32, which may be lossy for integer types with
    values exceeding FP32 precision (e.g. int32 values > 2^23).

    :param x: the source of copy, must be a tile in SBUF or PSUM.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a new tile with the same layout as ``x``, allocated on the same buffer
        as ``x`` (SBUF or PSUM)."""
    ...


def cos(x, dtype=None):
    r"""Cosine of the input, element-wise.

    ((Similar to `numpy.cos <https://numpy.org/doc/stable/reference/generated/numpy.cos.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has cosine values of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.cos -- cos(0.0) = 1.0
        a = nl.full((128, 512), 0.0, dtype=nl.float32, buffer=nl.sbuf)
        b = nl.cos(a)
        expected = nl.full((128, 512), 1.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(b, expected)"""
    ...


def device_print(print_prefix, tensor):
    r"""Print a message with a string prefix followed by the value of a tile.

    During kernel execution on hardware, the Neuron Runtime (NRT) exports device-printed tensors
    via the NRT debug stream API. By default, setting the environment variable
    ``NEURON_RT_DEBUG_OUTPUT_DIR`` to a directory path enables the default stream consumer,
    which dumps tensor data to that directory. The output is organized as:
    ``<output_dir>/<print_prefix>/core_<logical_core_id>/<iteration>/``.

    In CPU simulation, this prints immediately to stdout.

    :param print_prefix: prefix of the print message. Evaluated at trace time; must be a constant string.
    :param tensor: tensor to print out. Can be in SBUF or HBM."""
    ...


def divide(x, y, dtype=None):
    r"""Divide the inputs, element-wise.

    ((Similar to `numpy.divide <https://numpy.org/doc/stable/reference/generated/numpy.divide.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile that has ``x / y``, element-wise.
    """
    ...


def dropout(x, rate, dtype=None):
    r"""Randomly zeroes some of the elements of the input tile given a probability rate.

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param rate: the probability of zeroing each element. Can be a scalar constant
        or a tile of shape ``(x.shape[0], 1)`` for per-partition drop probabilities.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile with randomly zeroed elements of ``x``."""
    ...


def ds(start, size):
    r"""Create a dynamic slice for tensor indexing.

    :param start: the start index of the slice.
    :param size: the size of the slice.
    :return: a dynamic slice object for use in tensor indexing."""
    ...


def dynamic_range(start, stop=None, step=1):
    r"""Create a sequence for **dynamic**  loop iteration.

    Create a sequence of numbers for use as **dynamic** loop iterators in NKI.
    The loop runs on device with dynamic bounds.

    :param start: start value (or stop if ``stop`` is None), can be VirtualRegister.
    :param stop: stop value (exclusive), can be VirtualRegister.
    :param step: step size, must be a compile-time positive integer (not VirtualRegister).
    :return: an iterator yielding integer values from start to stop.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.dynamic_range
        for _ in nl.dynamic_range(1):
            tile = nl.load(input_tensor[0:128, 0:512])
            result = nl.multiply(tile, tile)
            nl.store(out_tensor[0:128, 0:512], result)"""
    ...


def empty_like(x, dtype=None, buffer=None, name=''):
    r"""Create a new tensor with the same shape and type as a given tensor.

    ((Similar to `numpy.empty_like <https://numpy.org/doc/stable/reference/generated/numpy.empty_like.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: the tensor.
    :param dtype: the data type of the tensor (default: same as ``x``).
    :param buffer: the specific buffer (ie, sbuf, psum, hbm), (default: same as ``x``).
    :param name: the name of the tensor, used in :ref:`scheduling <how-to-scheduling-apis>`.
    :return: a new :class:`NkiTensor` with the same shape and type as ``x``."""
    ...


def equal(x, y, dtype=None):
    r"""Return (x == y) element-wise.

    ((Similar to `numpy.equal <https://numpy.org/doc/stable/reference/generated/numpy.equal.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value. At least one of x, y must be a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information); Defaults to the input tile dtype.
        Use ``dtype=nl.uint8`` for a boolean-like result.
    :return: a tile with 1 where equal, 0 otherwise."""
    ...


def erf(x, dtype=None):
    r"""Error function, element-wise."""
    ...


def erf_dx(x, dtype=None):
    r"""Derivative of error function, element-wise."""
    ...


def exp(x, dtype=None):
    r"""Exponential of the input, element-wise.

    ((Similar to `numpy.exp <https://numpy.org/doc/stable/reference/generated/numpy.exp.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    The ``exp(x)`` is ``e^x`` where ``e`` is the Euler's number = 2.718281...

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has exponential values of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.exp -- exp(0.0) = 1.0
        a = nl.full((128, 512), 0.0, dtype=nl.float32, buffer=nl.sbuf)
        b = nl.exp(a)
        expected = nl.full((128, 512), 1.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(b, expected)"""
    ...


def expand_dims(x, axis):
    r"""Expand the shape of a tile.

    ((Similar to `numpy.expand_dims <https://numpy.org/doc/stable/reference/generated/numpy.expand_dims.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    Insert a new axis that will appear at the axis position in the expanded tile shape.

    :param x: a tile.
    :param axis: position in the expanded axes where the new axis is placed.
    :return: a tile with view of input data with the number of dimensions increased."""
    ...


float16 = 'float16'
"""16-bit floating-point number"""


float32 = 'float32'
"""32-bit floating-point number"""


float4_e2m1fn_x4 = 'float4_e2m1fn_x4'
"""4x packed float4_e2m1fn elements, custom data type for nki.isa.nc_matmul_mx on NeuronCore-v4"""


float8_e4m3 = 'float8_e4m3'
"""8-bit floating-point number (1S,4E,3M)"""


float8_e4m3fn = 'float8_e4m3fn'
"""8-bit floating-point number (1S,4E,3M), Extended range: no inf, NaN represented by 0bS111'1111"""


float8_e4m3fn_x4 = 'float8_e4m3fn_x4'
"""4x packed float8_e4m3fn elements, custom data type for nki.isa.nc_matmul_mx on NeuronCore-v4"""


float8_e5m2 = 'float8_e5m2'
"""8-bit floating-point number (1S,5E,2M)"""


float8_e5m2_x4 = 'float8_e5m2_x4'
"""4x packed float8_e5m2 elements, custom data type for nki.isa.nc_matmul_mx on NeuronCore-v4"""


def floor(x, dtype=None):
    r"""Floor of the input, element-wise.

    ((Similar to `numpy.floor <https://numpy.org/doc/stable/reference/generated/numpy.floor.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    The floor of the scalar x is the largest integer i, such that i <= x.

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has floor values of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.floor -- rounds 3.7 down to 3.0
        a = nl.full((128, 512), 3.7, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.floor(a)
        expected = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)

        # nki.language.floor -- rounds -3.2 down to -4.0
        a = nl.full((128, 512), -3.2, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.floor(a)
        expected = nl.full((128, 512), -4.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)"""
    ...


def fmod(x, y, dtype=None):
    r"""Floating-point remainder of ``x / y``, element-wise.

    The remainder has the same sign as the dividend x.
    It is equivalent to the Matlab(TM) rem function and should not be confused with the Python modulus operator x % y.

    ((Similar to `numpy.fmod <https://numpy.org/doc/stable/reference/generated/numpy.fmod.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile. If x is a scalar value it will be broadcast to the shape of y.
    :param y: a tile or a scalar value.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile that has values ``x fmod y``.
    """
    ...


def full(shape, fill_value, dtype, buffer=MemoryRegion.sbuf, name=''):
    r"""Create a new tensor of given shape and dtype on the specified buffer, filled with initial value.

    .. warning::

       This API is experimental and may change in future releases.

    :param shape: the shape of the tensor.
    :param fill_value: the value to fill the tensor with.
    :param dtype: the data type of the tensor.
    :param buffer: the specific buffer (ie, sbuf, psum, hbm), defaults to sbuf.
    :param name: the name of the tensor, used in :ref:`scheduling <how-to-scheduling-apis>`.
    :return: a new :class:`NkiTensor` allocated on the buffer."""
    ...


def gather_flattened(data, indices, axis=0, dtype=None):
    r"""Gather elements from data tensor using indices after flattening.

    This instruction gathers elements from the data tensor using integer indices
    provided in the indices tensor. For each element in the indices tensor, it
    retrieves the corresponding value from the data tensor using the index value
    to select from the free dimension of data.

    .. warning::

       This API is experimental and may change in future releases.

    :param data: input tensor to gather from.
    :param indices: indices to gather.
    :param axis: axis along which to gather.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: gathered tensor.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.gather_flattened -- gather elements by index
        data = nl.load(data_tensor[0:128, 0:512])
        indices = nl.load(indices_tensor[0:128, 0:512])
        result = nl.gather_flattened(data, indices)
        nl.store(actual_tensor[0:128, 0:512], result)"""
    ...


def gelu(x, dtype=None):
    r"""GELU activation, element-wise."""
    ...


def gelu_apprx_sigmoid(x, dtype=None):
    r"""GELU approximation using sigmoid, element-wise."""
    ...


def gelu_apprx_sigmoid_dx(x, dtype=None):
    r"""Derivative of sigmoid-approximated GELU, element-wise."""
    ...


def gelu_apprx_tanh(x, dtype=None):
    r"""GELU approximation using tanh, element-wise."""
    ...


def gelu_dx(x, dtype=None):
    r"""Derivative of GELU activation, element-wise."""
    ...


def greater(x, y, dtype=None):
    r"""Return (x > y) element-wise.

    ((Similar to `numpy.greater <https://numpy.org/doc/stable/reference/generated/numpy.greater.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value. At least one of x, y must be a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information); Defaults to the input tile dtype.
        Use ``dtype=nl.uint8`` for a boolean-like result.
    :return: a tile with 1 where x > y, 0 otherwise."""
    ...


def greater_equal(x, y, dtype=None):
    r"""Return (x >= y) element-wise.

    ((Similar to `numpy.greater_equal <https://numpy.org/doc/stable/reference/generated/numpy.greater_equal.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value. At least one of x, y must be a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information); Defaults to the input tile dtype.
        Use ``dtype=nl.uint8`` for a boolean-like result.
    :return: a tile with 1 where x >= y, 0 otherwise."""
    ...


hbm = MemoryRegion.private_hbm


int16 = 'int16'
"""16-bit signed integer number"""


int32 = 'int32'
"""32-bit signed integer number"""


int8 = 'int8'
"""8-bit signed integer number"""


def invert(x, dtype=None):
    r"""Compute the bitwise NOT element-wise.

    ((Similar to `numpy.invert <https://numpy.org/doc/stable/reference/generated/numpy.invert.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    Input must be integer typed. Implemented as XOR with all-ones.

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile with the bitwise NOT result."""
    ...


def is_hbm(buffer):
    r"""Check if buffer is any HBM type."""
    ...


def is_on_chip(buffer):
    r"""Check if buffer is on-chip (SBUF or PSUM)."""
    ...


def is_psum(buffer):
    r"""Check if buffer is PSUM."""
    ...


def is_sbuf(buffer):
    r"""Check if buffer is SBUF."""
    ...


def left_shift(x, y, dtype=None):
    r"""Left shift the bits of x by y positions element-wise.

    ((Similar to `numpy.left_shift <https://numpy.org/doc/stable/reference/generated/numpy.left_shift.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    Inputs must be integer typed.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value. At least one of x, y must be a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile with the left-shifted result."""
    ...


def less(x, y, dtype=None):
    r"""Return (x < y) element-wise.

    ((Similar to `numpy.less <https://numpy.org/doc/stable/reference/generated/numpy.less.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value. At least one of x, y must be a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information); Defaults to the input tile dtype.
        Use ``dtype=nl.uint8`` for a boolean-like result.
    :return: a tile with 1 where x < y, 0 otherwise."""
    ...


def less_equal(x, y, dtype=None):
    r"""Return (x <= y) element-wise.

    ((Similar to `numpy.less_equal <https://numpy.org/doc/stable/reference/generated/numpy.less_equal.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value. At least one of x, y must be a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information); Defaults to the input tile dtype.
        Use ``dtype=nl.uint8`` for a boolean-like result.
    :return: a tile with 1 where x <= y, 0 otherwise."""
    ...


def load(src, dtype=None):
    r"""Load a tensor from device memory (HBM) into on-chip memory (SBUF).

    .. warning::

       This API is experimental and may change in future releases.

    :param src: HBM tensor to load the data from.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a new tile on SBUF with values from ``src``."""
    ...


def load_transpose2d(src, dtype=None):
    r"""Load a tensor from device memory (HBM) and 2D-transpose the data before storing into on-chip memory (SBUF).

    .. warning::

       This API is experimental and may change in future releases.

    :param src: HBM tensor to load the data from.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a new tile on SBUF with values from ``src`` 2D-transposed."""
    ...


def log(x, dtype=None):
    r"""Natural logarithm of the input, element-wise.

    ((Similar to `numpy.log <https://numpy.org/doc/stable/reference/generated/numpy.log.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    It is the inverse of the exponential function, such that: ``log(exp(x)) = x`` .
    The natural logarithm base is ``e``.

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has natural logarithm values of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.log -- log(1.0) = 0.0
        a = nl.full((128, 512), 1.0, dtype=nl.float32, buffer=nl.sbuf)
        b = nl.log(a)
        expected = nl.full((128, 512), 0.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(b, expected)"""
    ...


def logical_and(x, y, dtype=None):
    r"""Compute the logical AND of two tiles element-wise.

    ((Similar to `numpy.logical_and <https://numpy.org/doc/stable/reference/generated/numpy.logical_and.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    Inputs should be boolean-like (0 or 1 values).

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value. At least one of x, y must be a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile with the logical AND result."""
    ...


def logical_not(x, dtype=None):
    r"""Compute the logical NOT element-wise.

    ((Similar to `numpy.logical_not <https://numpy.org/doc/stable/reference/generated/numpy.logical_not.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    Implemented as XOR with 1, so inputs should be boolean-like (0 or 1 values).
    For non-boolean inputs, use ``nl.equal(x, 0)`` instead.

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile with the logical NOT result."""
    ...


def logical_or(x, y, dtype=None):
    r"""Compute the logical OR of two tiles element-wise.

    ((Similar to `numpy.logical_or <https://numpy.org/doc/stable/reference/generated/numpy.logical_or.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    Inputs should be boolean-like (0 or 1 values).

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value. At least one of x, y must be a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile with the logical OR result."""
    ...


def logical_xor(x, y, dtype=None):
    r"""Compute the logical XOR of two tiles element-wise.

    ((Similar to `numpy.logical_xor <https://numpy.org/doc/stable/reference/generated/numpy.logical_xor.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    Inputs should be boolean-like (0 or 1 values).

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value. At least one of x, y must be a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile with the logical XOR result."""
    ...


def matmul(x, y, transpose_x=False):
    r"""x @ y matrix multiplication of x and y.

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile on SBUF (partition dimension <= 128, free dimension <= 128),
        x's free dimension must match y's partition dimension.
    :param y: a tile on SBUF (partition dimension <= 128, free dimension <= 512).
    :param transpose_x: defaults to False. If True, x is treated as already transposed.
        If False, an additional transpose will be inserted to make x's partition
        dimension the contract dimension of the matmul to align with the Tensor Engine.
    :return: x @ y or x.T @ y if transpose_x=True.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.matmul -- identity.T @ ones = ones
        x = nl.shared_identity_matrix(n=128, dtype=nl.float32)
        y = nl.full((128, 128), 1.0, dtype=nl.float32, buffer=nl.sbuf)
        result_psum = nl.matmul(x, y, transpose_x=True)
        result = nl.ndarray((128, 128), dtype=nl.float32, buffer=nl.sbuf)
        nisa.tensor_copy(result, result_psum)
        expected = nl.full((128, 128), 1.0, dtype=nl.float32,
                           buffer=nl.sbuf)
        assert nl.equal(result, expected)"""
    ...


def max(x, axis, dtype=None, keepdims=False):
    r"""Maximum of elements along the specified axis (or axes) of the input.

    ((Similar to `numpy.max <https://numpy.org/doc/stable/reference/generated/numpy.max.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param axis: int or tuple/list of ints. The axis (or axes) along which to operate;
        must be free dimensions, not partition dimension (0); can only be the
        last contiguous dim(s) of the tile: [1], [1,2], [1,2,3], [1,2,3,4].
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :param keepdims: if True, the reduced axes are kept as size-one dimensions.
    :return: a tile with the maximum along the provided axis."""
    ...


def maximum(x, y, dtype=None):
    r"""Maximum of the inputs, element-wise.

    ((Similar to `numpy.maximum <https://numpy.org/doc/stable/reference/generated/numpy.maximum.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile that has the maximum of each element from x and y.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.maximum -- max(3.0, 5.0) = 5.0
        a = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
        b = nl.full((128, 512), 5.0, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.maximum(a, b)
        expected = nl.full((128, 512), 5.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)

        # nki.language.maximum -- with a scalar operand
        a = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.maximum(a, 5.0)
        expected = nl.full((128, 512), 5.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)"""
    ...


def mean(x, axis, dtype=None, keepdims=False):
    r"""Arithmetic mean along the specified axis (or axes) of the input.

    ((Similar to `numpy.mean <https://numpy.org/doc/stable/reference/generated/numpy.mean.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param axis: int or tuple/list of ints. The axis (or axes) along which to operate;
        must be free dimensions, not partition dimension (0); can only be the
        last contiguous dim(s) of the tile: [1], [1,2], [1,2,3], [1,2,3,4].
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :param keepdims: if True, the reduced axes are kept as size-one dimensions.
    :return: a tile with the average of elements along the provided axis. Float32
        intermediate values are used for the computation."""
    ...


def min(x, axis, dtype=None, keepdims=False):
    r"""Minimum of elements along the specified axis (or axes) of the input.

    ((Similar to `numpy.min <https://numpy.org/doc/stable/reference/generated/numpy.min.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param axis: int or tuple/list of ints. The axis (or axes) along which to operate;
        must be free dimensions, not partition dimension (0); can only be the
        last contiguous dim(s) of the tile: [1], [1,2], [1,2,3], [1,2,3,4].
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :param keepdims: if True, the reduced axes are kept as size-one dimensions.
    :return: a tile with the minimum along the provided axis."""
    ...


def minimum(x, y, dtype=None):
    r"""Minimum of the inputs, element-wise.

    ((Similar to `numpy.minimum <https://numpy.org/doc/stable/reference/generated/numpy.minimum.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile that has the minimum of each element from x and y.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.minimum -- min(3.0, 5.0) = 3.0
        a = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
        b = nl.full((128, 512), 5.0, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.minimum(a, b)
        expected = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)

        # nki.language.minimum -- with a scalar operand
        a = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.minimum(a, 5.0)
        expected = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)"""
    ...


def mish(x, dtype=None):
    r"""Mish activation, element-wise."""
    ...


def mod(x, y, dtype=None):
    r"""Remainder of ``x / y``, element-wise.

    Computes the remainder complementary to the floor_divide function.
    It is equivalent to the Python modulus x % y and has the same sign as the divisor y.

    ((Similar to `numpy.mod <https://numpy.org/doc/stable/reference/generated/numpy.mod.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile. If x is a scalar value it will be broadcast to the shape of y.
    :param y: a tile or a scalar value.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile that has values ``x mod y``.
    """
    ...


def multiply(x, y, dtype=None):
    r"""Multiply the inputs, element-wise.

    ((Similar to `numpy.multiply <https://numpy.org/doc/stable/reference/generated/numpy.multiply.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile that has ``x * y``, element-wise.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.multiply -- element-wise multiplication of two tiles
        a = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
        b = nl.full((128, 512), 4.0, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.multiply(a, b)
        expected = nl.full((128, 512), 12.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)

        # nki.language.multiply -- scaling every element by a scalar
        a = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.multiply(a, 4.0)
        expected = nl.full((128, 512), 12.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)"""
    ...


def ndarray(shape, dtype, buffer=MemoryRegion.sbuf, name='', address=None):
    r"""Create a new tensor of given shape and dtype on the specified buffer.

    :param shape: the shape of the tensor.
    :param dtype: the data type of the tensor.
    :param buffer: the specific buffer (ie, sbuf, psum, hbm), defaults to sbuf.
    :param name: the name of the tensor, used in :ref:`scheduling <how-to-scheduling-apis>`.
    :param address: optional memory address ``(partition_offset, free_offset)``.
    :return: a new :class:`NkiTensor` allocated on the buffer."""
    ...


def negative(x, dtype=None):
    r"""Numerical negative of the input, element-wise.

    ((Similar to `numpy.negative <https://numpy.org/doc/stable/reference/generated/numpy.negative.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has numerical negative values of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.negative -- negates 5.0 to -5.0
        a = nl.full((128, 512), 5.0, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.negative(a)
        expected = nl.full((128, 512), -5.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)

        # nki.language.negative -- negates -3.0 to 3.0
        a = nl.full((128, 512), -3.0, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.negative(a)
        expected = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)"""
    ...


def no_reorder():
    r"""Prevent the scheduler from reordering operations in this region.

    Use as a context manager (``with nl.no_reorder():``) to guarantee that
    operations inside the block execute in program order. Without this
    directive, the compiler scheduler is free to reorder independent
    operations for better hardware utilization.

    Dynamic loops (``nl.dynamic_range``) are not supported inside a
    ``no_reorder`` block. Static loops (``nl.affine_range``,
    ``nl.sequential_range``, ``nl.static_range``) are allowed because
    they are fully unrolled at compile time.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.no_reorder -- guarantee execution order
        with nl.no_reorder():
            a = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
            b = nl.full((128, 512), 2.0, dtype=nl.float32, buffer=nl.sbuf)
            c = nl.add(a, b)
        expected = nl.full((128, 512), 5.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)"""
    ...


def not_equal(x, y, dtype=None):
    r"""Return (x != y) element-wise.

    ((Similar to `numpy.not_equal <https://numpy.org/doc/stable/reference/generated/numpy.not_equal.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value. At least one of x, y must be a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information); Defaults to the input tile dtype.
        Use ``dtype=nl.uint8`` for a boolean-like result.
    :return: a tile with 1 where not equal, 0 otherwise."""
    ...


def num_programs(axes=0):
    r"""Number of SPMD programs along the given axes in the launch grid.

    :param axes: the axes of the launch grid. If not provided, returns the total
        number of programs along the entire launch grid.
    :return: the number of SPMD programs along ``axes`` in the launch grid."""
    ...


def ones(shape, dtype, buffer=MemoryRegion.sbuf, name=''):
    r"""Create a new tensor of given shape and dtype on the specified buffer, filled with ones.

    ((Similar to `numpy.ones <https://numpy.org/doc/stable/reference/generated/numpy.ones.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param shape: the shape of the tensor.
    :param dtype: the data type of the tensor.
    :param buffer: the specific buffer (ie, sbuf, psum, hbm), defaults to sbuf.
    :param name: the name of the tensor, used in :ref:`scheduling <how-to-scheduling-apis>`.
    :return: a new :class:`NkiTensor` allocated on the buffer."""
    ...


def power(x, y, dtype=None):
    r"""Elements of x raised to powers of y, element-wise.

    ((Similar to `numpy.power <https://numpy.org/doc/stable/reference/generated/numpy.power.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile that has values ``x`` to the power of ``y``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.power -- element-wise exponentiation of two tiles
        a = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
        b = nl.full((128, 512), 2.0, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.power(a, b)
        expected = nl.full((128, 512), 9.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)"""
    ...


private_hbm = MemoryRegion.private_hbm


def prod(x, axis, dtype=None, keepdims=False):
    r"""Product of elements along the specified axis (or axes) of the input.

    ((Similar to `numpy.prod <https://numpy.org/doc/stable/reference/generated/numpy.prod.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param axis: int or tuple/list of ints. The axis (or axes) along which to operate;
        must be free dimensions, not partition dimension (0); can only be the
        last contiguous dim(s) of the tile: [1], [1,2], [1,2,3], [1,2,3,4].
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :param keepdims: if True, the reduced axes are kept as size-one dimensions.
    :return: a tile with the product along the provided axis."""
    ...


def program_id(axis=0):
    r"""Index of the current SPMD program along the given axis in the launch grid.

    :param axis: the axis of the launch grid.
    :return: the program id along ``axis``."""
    ...


def program_ndim():
    r"""Number of dimensions in the SPMD launch grid.

    :return: the number of dimensions in the launch grid, i.e. the number of axes. 0 if no grid."""
    ...


psum = MemoryRegion.psum


def rand(shape, dtype, buffer=MemoryRegion.sbuf, name=''):
    r"""Create a new tensor of given shape and dtype on the specified buffer, filled with random values.

    Values are sampled from a uniform distribution between 0 and 1.

    .. warning::

       This API is experimental and may change in future releases.

    :param shape: the shape of the tensor.
    :param dtype: the data type of the tensor (see :ref:`nki-dtype` for more information).
    :param buffer: the specific buffer (sbuf, psum, hbm), defaults to sbuf.
    :param name: the name of the tensor, used in :ref:`scheduling <how-to-scheduling-apis>`.
    :return: a new :class:`NkiTensor` allocated on the buffer with random values.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.rand -- generate random values in [0, 1)
        a = nl.rand((128, 512), dtype=nl.float32)"""
    ...


def random_seed(seed):
    r"""Set the random seed for random number generation.

    Using the same seed will generate the same sequence of random numbers
    when used with ``rand()``.

    .. warning::

       This API is experimental and may change in future releases.

    :param seed: a [1,1] tensor on SBUF or PSUM with a 32-bit seed value.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.random_seed -- set seed for reproducible random values
        seed = nl.full((1, 1), 42, dtype=nl.int32, buffer=nl.sbuf)
        nl.random_seed(seed)
        a = nl.rand((128, 512), dtype=nl.float32)

        # nki.language.random_seed -- same seed produces same values
        seed = nl.full((1, 1), 42, dtype=nl.int32, buffer=nl.sbuf)
        nl.random_seed(seed)
        a = nl.rand((128, 512), dtype=nl.float32)
        nl.random_seed(seed)
        b = nl.rand((128, 512), dtype=nl.float32)
        assert nl.equal(a, b)"""
    ...


def reciprocal(x, dtype=None):
    r"""Reciprocal of the input, element-wise.

    ((Similar to `numpy.reciprocal <https://numpy.org/doc/stable/reference/generated/numpy.reciprocal.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    ``reciprocal(x) = 1 / x``

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has reciprocal values of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.reciprocal -- reciprocal(4.0) = 0.25
        a = nl.full((128, 512), 4.0, dtype=nl.float32,
                    buffer=nl.sbuf)
        b = nl.reciprocal(a)
        expected = nl.full((128, 512), 0.25, dtype=nl.float32,
                           buffer=nl.sbuf)
        assert nl.equal(b, expected)"""
    ...


def relu(x, dtype=None):
    r"""ReLU activation, element-wise."""
    ...


def right_shift(x, y, dtype=None):
    r"""Right shift the bits of x by y positions element-wise.

    ((Similar to `numpy.right_shift <https://numpy.org/doc/stable/reference/generated/numpy.right_shift.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    Inputs must be integer typed.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value. At least one of x, y must be a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile with the right-shifted result."""
    ...


def rms_norm(x, w, axis, n, epsilon=1e-06, dtype=None, compute_dtype=None):
    r"""Apply Root Mean Square Layer Normalization.

    .. warning::

       This API is experimental and may change in future releases.

    :param x: input tile.
    :param w: weight tile.
    :param axis: axis along which to compute the root mean square (rms) value.
    :param n: total number of values to calculate rms.
    :param epsilon: epsilon value used by rms calculation to avoid divide-by-zero.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :param compute_dtype: (optional) dtype for the internal computation.
    :return: ``x / RMS(x) * w``

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.rms_norm -- normalize with unit weights
        x = nl.full((128, 512), 2.0, dtype=nl.float32, buffer=nl.sbuf)
        w = nl.full((128, 512), 1.0, dtype=nl.float32, buffer=nl.sbuf)
        result = nl.rms_norm(x, w, axis=1, n=512)"""
    ...


def rsqrt(x, dtype=None):
    r"""Reciprocal of the square-root of the input, element-wise.

    ((Similar to `torch.rsqrt <https://pytorch.org/docs/master/generated/torch.rsqrt.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    ``rsqrt(x) = 1 / sqrt(x)``

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has reciprocal square-root values of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.rsqrt -- rsqrt(4.0) = 0.5
        a = nl.full((128, 512), 4.0, dtype=nl.float32,
                    buffer=nl.sbuf)
        b = nl.rsqrt(a)
        expected = nl.full((128, 512), 0.5, dtype=nl.float32,
                           buffer=nl.sbuf)
        assert nl.equal(b, expected)"""
    ...


sbuf = MemoryRegion.sbuf


def sequential_range(start, stop=None, step=1):
    r"""Create a sequence for fully unrolled loop iteration.

    Create a sequence of numbers for use as loop iterators in NKI, resulting in
    a fully unrolled loop. Prefer :doc:`static_range <nki.language.static_range>` instead.

    .. warning::
    
        This API is deprecated and will be removed in future releases.

    :param start: start value (or stop if ``stop`` is None).
    :param stop: stop value (exclusive).
    :param step: step size.
    :return: an iterator yielding integer values from start to stop.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.sequential_range
        for i in nl.sequential_range(input_tensor.shape[1] // 512):
            offset = i * 512
            tile = nl.load(input_tensor[0:128, offset:offset+512])
            result = nl.multiply(tile, tile)
            nl.store(out_tensor[0:128, offset:offset+512], result)"""
    ...


shared_hbm = MemoryRegion.shared_hbm


def shared_identity_matrix(n, dtype='uint8', dst=None):
    r"""Create an identity matrix in SBUF with the specified data type.

    The compiler will reuse all identity matrices of the same
    dtype in the graph to save space.

    :param n: the number of rows (and columns) of the returned identity matrix
    :param dtype: the data type of the tensor, default to be ``nl.uint8`` (see :ref:`nki-dtype` for more information).
    :return: a new :class:`NkiTensor` which contains the identity tensor

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.shared_identity_matrix -- 128x128 identity matrix
        identity = nl.shared_identity_matrix(n=128, dtype=nl.float32)
        expected = nl.load(expected_tensor[0:128, 0:128])
        assert nl.equal(identity, expected)
        nl.store(actual_tensor[0:128, 0:128], identity)"""
    ...


def sigmoid(x, dtype=None):
    r"""Sigmoid activation, element-wise."""
    ...


def sign(x, dtype=None):
    r"""Sign of the numbers of the input, element-wise.

    ((Similar to `numpy.sign <https://numpy.org/doc/stable/reference/generated/numpy.sign.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    The sign function returns ``-1`` if ``x < 0``, ``0`` if ``x==0``, ``1`` if ``x > 0``.

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has sign values of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.sign -- sign(-5.0) = -1.0
        a = nl.full((128, 512), -5.0, dtype=nl.float32,
                    buffer=nl.sbuf)
        b = nl.sign(a)
        expected = nl.full((128, 512), -1.0, dtype=nl.float32,
                           buffer=nl.sbuf)
        assert nl.equal(b, expected)"""
    ...


def silu(x, dtype=None):
    r"""SiLU (Swish) activation, element-wise."""
    ...


def silu_dx(x, dtype=None):
    r"""Derivative of SiLU activation, element-wise."""
    ...


def sin(x, dtype=None):
    r"""Sine of the input, element-wise.

    ((Similar to `numpy.sin <https://numpy.org/doc/stable/reference/generated/numpy.sin.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has sine values of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.sin -- sin(0.0) = 0.0
        a = nl.full((128, 512), 0.0, dtype=nl.float32,
                    buffer=nl.sbuf)
        b = nl.sin(a)
        expected = nl.full((128, 512), 0.0, dtype=nl.float32,
                           buffer=nl.sbuf)
        assert nl.equal(b, expected)"""
    ...


def softmax(x, axis=-1, dtype=None):
    r"""Softmax activation function on the input, element-wise.

    ((Similar to `torch.nn.functional.softmax <https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param axis: int or tuple/list of ints. The axis (or axes) along which to operate; must be free dimensions, not partition dimension (0); can only be the last contiguous dim(s) of the tile: [1], [1,2], [1,2,3], [1,2,3,4]
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has softmax of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.softmax -- uniform input produces uniform output
        a = nl.full((128, 512), 1.0, dtype=nl.float32, buffer=nl.sbuf)
        result = nl.softmax(a, axis=1)"""
    ...


def softplus(x, dtype=None):
    r"""Softplus activation, element-wise."""
    ...


def sqrt(x, dtype=None):
    r"""Non-negative square-root of the input, element-wise.

    ((Similar to `numpy.sqrt <https://numpy.org/doc/stable/reference/generated/numpy.sqrt.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has square-root values of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.sqrt -- sqrt(4.0) = 2.0
        a = nl.full((128, 512), 4.0, dtype=nl.float32,
                    buffer=nl.sbuf)
        b = nl.sqrt(a)
        expected = nl.full((128, 512), 2.0, dtype=nl.float32,
                           buffer=nl.sbuf)
        assert nl.equal(b, expected)"""
    ...


def square(x, dtype=None):
    r"""Square of the input, element-wise.

    ((Similar to `numpy.square <https://numpy.org/doc/stable/reference/generated/numpy.square.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has square of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.square -- square(3.0) = 9.0
        a = nl.full((128, 512), 3.0, dtype=nl.float32,
                    buffer=nl.sbuf)
        b = nl.square(a)
        expected = nl.full((128, 512), 9.0, dtype=nl.float32,
                           buffer=nl.sbuf)
        assert nl.equal(b, expected)"""
    ...


def static_range(start, stop=None, step=1):
    r"""Create a sequence for fully unrolled loop iteration.

    Create a sequence of numbers for use as loop iterators in NKI, resulting in
    a fully unrolled loop. Prefer this method over :doc:`affine_range <nki.language.affine_range>`
    and :doc:`sequential_range <nki.language.sequential_range>`

    :param start: start value (or stop if ``stop`` is None).
    :param stop: stop value (exclusive).
    :param step: step size.
    :return: an iterator yielding integer values from start to stop.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.static_range
        for i in nl.static_range(input_tensor.shape[1] // 512):
            offset = i * 512
            tile = nl.load(input_tensor[0:128, offset:offset+512])
            result = nl.multiply(tile, tile)
            nl.store(out_tensor[0:128, offset:offset+512], result)"""
    ...


def store(dst, value):
    r"""Store into a tensor on device memory (HBM) from on-chip memory (SBUF).

    .. warning::

       This API is experimental and may change in future releases.

    :param dst: HBM tensor to store the data into.
    :param value: an SBUF tile that contains the values to store."""
    ...


def subtract(x, y, dtype=None):
    r"""Subtract the inputs, element-wise.

    ((Similar to `numpy.subtract <https://numpy.org/doc/stable/reference/generated/numpy.subtract.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile or a scalar value.
    :param y: a tile or a scalar value.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
    :return: a tile that has ``x - y``, element-wise.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.subtract -- element-wise subtraction of two tiles
        a = nl.full((128, 512), 10.0, dtype=nl.float32, buffer=nl.sbuf)
        b = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.subtract(a, b)
        expected = nl.full((128, 512), 7.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)

        # nki.language.subtract -- subtracting a scalar from every element
        a = nl.full((128, 512), 10.0, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.subtract(a, 3.0)
        expected = nl.full((128, 512), 7.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)"""
    ...


def sum(x, axis, dtype=None, keepdims=False):
    r"""Sum of elements along the specified axis (or axes) of the input.

    ((Similar to `numpy.sum <https://numpy.org/doc/stable/reference/generated/numpy.sum.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param axis: int or tuple/list of ints. The axis (or axes) along which to operate;
        must be free dimensions, not partition dimension (0); can only be the
        last contiguous dim(s) of the tile: [1], [1,2], [1,2,3], [1,2,3,4].
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :param keepdims: if True, the reduced axes are kept as size-one dimensions.
    :return: a tile with the sum along the provided axis."""
    ...


def tan(x, dtype=None):
    r"""Tangent of the input, element-wise.

    ((Similar to `numpy.tan <https://numpy.org/doc/stable/reference/generated/numpy.tan.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has tangent values of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.tan -- tan(0.0) = 0.0
        a = nl.full((128, 512), 0.0, dtype=nl.float32,
                    buffer=nl.sbuf)
        b = nl.tan(a)
        expected = nl.full((128, 512), 0.0, dtype=nl.float32,
                           buffer=nl.sbuf)
        assert nl.equal(b, expected)"""
    ...


def tanh(x, dtype=None):
    r"""Hyperbolic tangent, element-wise."""
    ...


tfloat32 = 'tfloat32'
"""32-bit floating-point number (1S,8E,10M)"""


def transpose(x, dtype=None):
    r"""Transposes a 2D tile between its partition and free dimension.

    .. warning::

       This API is experimental and may change in future releases.

    :param x: 2D input tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has the values of the input tile with its partition and free
        dimensions swapped.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.transpose -- transpose of identity is identity
        x = nl.shared_identity_matrix(n=128, dtype=nl.float32)
        result_psum = nl.transpose(x)
        result = nl.ndarray((128, 128), dtype=nl.float32, buffer=nl.sbuf)
        nisa.tensor_copy(result, result_psum)
        assert nl.equal(result, x)"""
    ...


def trunc(x, dtype=None):
    r"""Truncated value of the input, element-wise.

    ((Similar to `numpy.trunc <https://numpy.org/doc/stable/reference/generated/numpy.trunc.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    The truncated value of the scalar x is the nearest integer i which is closer to zero than x is.
    In short, the fractional part of the signed number x is discarded.

    :param x: a tile.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: a tile that has truncated values of ``x``.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.trunc -- truncates 3.7 toward zero to 3.0
        a = nl.full((128, 512), 3.7, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.trunc(a)
        expected = nl.full((128, 512), 3.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)

        # nki.language.trunc -- truncates -3.7 toward zero to -3.0
        a = nl.full((128, 512), -3.7, dtype=nl.float32, buffer=nl.sbuf)
        c = nl.trunc(a)
        expected = nl.full((128, 512), -3.0, dtype=nl.float32, buffer=nl.sbuf)
        assert nl.equal(c, expected)"""
    ...


uint16 = 'uint16'
"""16-bit unsigned integer number"""


uint32 = 'uint32'
"""32-bit unsigned integer number"""


uint8 = 'uint8'
"""8-bit unsigned integer number"""


def var(x, axis, dtype=None, keepdims=False):
    r"""Variance along the specified axis (or axes) of the input.

    ((Similar to `numpy.var <https://numpy.org/doc/stable/reference/generated/numpy.var.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: a tile.
    :param axis: int or tuple/list of ints. The axis (or axes) along which to operate;
        must be free dimensions, not partition dimension (0); can only be the
        last contiguous dim(s) of the tile: [1], [1,2], [1,2,3], [1,2,3,4].
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :param keepdims: currently ignored; result always has keepdims=True shape.
    :return: a tile with the variance of the elements along the provided axis."""
    ...


def where(condition, x, y, dtype=None):
    r"""Return elements chosen from x or y depending on condition.

    ((Similar to `numpy.where <https://numpy.org/doc/stable/reference/generated/numpy.where.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param condition: condition tile with float values (1.0 for True, 0.0 for False).
    :param x: tensor from which to take elements where condition is True.
    :param y: tensor from which to take elements where condition is False.
    :param dtype: (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
    :return: tensor with elements from x or y based on condition.

    Examples:

    .. code-block:: python

        import nki.language as nl

        # nki.language.where -- select 10.0 where condition is 1, else 0.0
        cond = nl.full((128, 512), 1.0, dtype=nl.float32,
                       buffer=nl.sbuf)
        x = nl.full((128, 512), 10.0, dtype=nl.float32,
                    buffer=nl.sbuf)
        y = nl.full((128, 512), 0.0, dtype=nl.float32,
                    buffer=nl.sbuf)
        result = nl.where(cond, x, y)
        expected = nl.full((128, 512), 10.0, dtype=nl.float32,
                           buffer=nl.sbuf)
        assert nl.equal(result, expected)

        # nki.language.where -- select 5.0 where condition is 0
        cond = nl.full((128, 512), 0.0, dtype=nl.float32,
                       buffer=nl.sbuf)
        x = nl.full((128, 512), 10.0, dtype=nl.float32,
                    buffer=nl.sbuf)
        y = nl.full((128, 512), 5.0, dtype=nl.float32,
                    buffer=nl.sbuf)
        result = nl.where(cond, x, y)
        expected = nl.full((128, 512), 5.0, dtype=nl.float32,
                           buffer=nl.sbuf)
        assert nl.equal(result, expected)"""
    ...


def zeros(shape, dtype, buffer=MemoryRegion.sbuf, name=''):
    r"""Create a new tensor of given shape and dtype on the specified buffer, filled with zeros.

    ((Similar to `numpy.zeros <https://numpy.org/doc/stable/reference/generated/numpy.zeros.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param shape: the shape of the tensor.
    :param dtype: the data type of the tensor.
    :param buffer: the specific buffer (ie, sbuf, psum, hbm), defaults to sbuf.
    :param name: the name of the tensor, used in :ref:`scheduling <how-to-scheduling-apis>`.
    :return: a new :class:`NkiTensor` allocated on the buffer."""
    ...


def zeros_like(x, dtype=None, buffer=None, name=''):
    r"""Create a new tensor of zeros with the same shape and type as a given tensor.

    ((Similar to `numpy.zeros_like <https://numpy.org/doc/stable/reference/generated/numpy.zeros_like.html>`_))

    .. warning::

       This API is experimental and may change in future releases.

    :param x: the tensor.
    :param dtype: the data type of the tensor.
    :param buffer: the specific buffer (ie, sbuf, psum, hbm), defaults to sbuf.
    :param name: the name of the tensor, used in :ref:`scheduling <how-to-scheduling-apis>`.
    :return: a new :class:`NkiTensor` of zeros with the same shape as ``x``."""
    ...


================================================
FILE: nki/api/nki.api.shared.rst
================================================
=======================
NKI API Common Fields
=======================

.. _nki-dtype:

Supported Data Types
========================

:ref:`tbl-dtype` below lists all supported data types by NKI.
Almost all of the NKI APIs accept a data type field, `dtype`,
which must be a `nki.language` data type.

.. _tbl-dtype:

.. table:: Supported Data Types by NKI

  +------------------------+------------------------------+-------------------------------------------------+
  |                        | Data Type                    | Accepted ``dtype`` Field by NKI APIs            |
  +========================+==============================+=================================================+
  |                        | 8-bit unsigned integer       | ``nki.language.uint8``                          |
  |                        +------------------------------+-------------------------------------------------+
  |                        | 8-bit signed integer         | ``nki.language.int8``                           |
  |                        +------------------------------+-------------------------------------------------+
  | Integer                | 16-bit unsigned integer      | ``nki.language.uint16``                         |
  |                        +------------------------------+-------------------------------------------------+
  |                        | 16-bit signed integer        | ``nki.language.int16``                          |
  |                        +------------------------------+-------------------------------------------------+
  |                        | 32-bit unsigned integer      | ``nki.language.uint32``                         |
  |                        +------------------------------+-------------------------------------------------+
  |                        | 32-bit signed integer        | ``nki.language.int32``                          |
  +------------------------+------------------------------+-------------------------------------------------+
  |                        | float8_e4m3 (1S,4E,3M) [#1]_ | ``nki.language.float8_e4m3``                    |
  |                        +------------------------------+-------------------------------------------------+
  |                        | float8_e5m2 (1S,5E,2M)       | ``nki.language.float8_e5m2``                    |
  |                        +------------------------------+-------------------------------------------------+
  |                        | float16 (1S,5E,10M)          | ``nki.language.float16``                        |
  |                        +------------------------------+-------------------------------------------------+
  | Float                  | bfloat16 (1S,8E,7M)          | ``nki.language.bfloat16``                       |
  |                        +------------------------------+-------------------------------------------------+
  |                        | tfloat32 (1S,8E,10M)         | ``nki.language.tfloat32``                       |
  |                        +------------------------------+-------------------------------------------------+
  |                        | float32 (1S,8E,23M)          | ``nki.language.float32``                        |
  +------------------------+------------------------------+-------------------------------------------------+
  | Boolean                | boolean stored as uint8      | ``nki.language.bool_``                          |
  +------------------------+------------------------------+-------------------------------------------------+

.. _nki-aluop:

Supported Math Operators for NKI ISA
====================================

:ref:`tbl-aluop` below lists all the mathematical operator primitives supported by NKI.
Many :ref:`nki.isa <nki-isa>` APIs (instructions) allow programmable operators through the ``op`` field.
The supported operators fall into two categories: *bitvec* and *arithmetic*. In general, instructions
using *bitvec* operators expect integer data types and treat input elements as bit patterns. On the other
hand, instructions using *arithmetic* operators accept any valid NKI data type and convert input elements
into float32 before performing the operators.

.. _tbl-aluop:
.. table:: Supported Math Operators by NKI ISA

  +------------------------+----------------------------+---------------------------------------------+------------------------+
  |                        | Operator                   | ``op``                                      | Legal Reduction ``op`` |
  +========================+============================+=============================================+========================+
  |                        | Bitwise Not                | ``nki.language.invert``                     | N                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Bitwise And                | ``nki.language.bitwise_and``                | Y                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Bitwise Or                 | ``nki.language.bitwise_or``                 | Y                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  | Bitvec                 | Bitwise Xor                | ``nki.language.bitwise_xor``                | Y                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Arithmetic Shift Left      | ``nki.language.left_shift``                 | N                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Arithmetic Shift Right     |  Not supported                              | N                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Logical Shift Left         | ``nki.language.left_shift``                 | N                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Logical Shift Right        | ``nki.language.right_shift``                | N                      |
  +------------------------+----------------------------+---------------------------------------------+------------------------+
  |                        | Add                        | ``nki.language.add``                        | Y                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Subtract                   | ``nki.language.subtract``                   | Y                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Multiply                   | ``nki.language.multiply``                   | Y                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Max                        | ``nki.language.maximum``                    | Y                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Min                        | ``nki.language.minimum``                    | Y                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Is Equal to                | ``nki.language.equal``                      | N                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Is Not Equal to            | ``nki.language.not_equal``                  | N                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  | Arithmetic             | Is Greater than or Equal to| ``nki.language.greater_equal``              | N                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Is Greater than to         | ``nki.language.greater``                    | N                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Is Less than or Equal to   | ``nki.language.less_equal``                 | N                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Is Less than               | ``nki.language.less``                       | N                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Logical And                | ``nki.language.logical_and``                | Y                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Logical Or                 | ``nki.language.logical_or``                 | Y                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Logical Xor                | ``nki.language.logical_xor``                | Y                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Reverse Square Root        | ``nki.language.rsqrt``                      | N                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Reciprocal                 | ``nki.language.reciprocal``                 | N                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Absolute                   | ``nki.language.abs``                        | N                      |
  |                        +----------------------------+---------------------------------------------+------------------------+
  |                        | Power                      | ``nki.language.power``                      | N                      |
  +------------------------+----------------------------+---------------------------------------------+------------------------+

.. _nki-act-func:

Supported Activation Functions for NKI ISA
==========================================
:ref:`tbl-act-func` below lists all the activation function supported by the ``nki.isa.activation`` API. These
activation functions are approximated with piece-wise polynomials on Scalar Engine.
*NOTE*: if input values fall outside the supported **Valid Input Range** listed below,
the Scalar Engine will generate invalid output results.

.. _tbl-act-func:
.. table:: Supported Activation Functions by NKI ISA
   :widths: 25 25 25

   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Function Name                              | Accepted ``op`` by Scalar Engine                    | Valid Input Range   |
   +============================================+=====================================================+=====================+
   | Identity                                   | ``nki.language.copy``                               | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Square                                     | ``nki.language.square``                             | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Sigmoid                                    | ``nki.language.sigmoid``                            | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Relu                                       | ``nki.language.relu``                               | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Gelu                                       | ``nki.language.gelu``                               | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Gelu Derivative                            | ``nki.language.gelu_dx``                            | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Gelu with Tanh Approximation               | ``nki.language.gelu_apprx_tanh``                    | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Gelu with Sigmoid Approximation            | ``nki.language.gelu_apprx_sigmoid``                 | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Gelu with Sigmoid Approximation Derivative | ``nki.language.gelu_apprx_sigmoid_dx``              | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Silu                                       | ``nki.language.silu``                               | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Silu Derivative                            | ``nki.language.silu_dx``                            | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Tanh                                       | ``nki.language.tanh``                               | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Softplus                                   | ``nki.language.softplus``                           | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Mish                                       | ``nki.language.mish``                               | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Erf                                        | ``nki.language.erf``                                | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Erf Derivative                             | ``nki.language.erf_dx``                             | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Exponential                                | ``nki.language.exp``                                | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Natural Log                                | ``nki.language.log``                                | ``[2^-64, 2^64]``   |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Sine                                       | ``nki.language.sin``                                | ``[-PI, PI]``       |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Arctan                                     | ``nki.language.arctan``                             | ``[-PI/2, PI/2]``   |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Square Root                                | ``nki.language.sqrt``                               | ``[2^-116, 2^118]`` |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Reverse Square Root                        | ``nki.language.rsqrt``                              | ``[2^-87, 2^97]``   |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Reciprocal                                 | ``nki.language.reciprocal``                         | ``±[2^-42, 2^42]``  |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Sign                                       | ``nki.language.sign``                               | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+
   | Absolute                                   | ``nki.language.abs``                                | ``[-inf, inf]``     |
   +--------------------------------------------+-----------------------------------------------------+---------------------+

.. _nki-engine-sel:

NKI Engine Selection for Operators Supported on Multiple Engines
================================================================
There is a tradeoff between precision and speed on different engines for operators with multiple engine options. Users can select which engine to map to based on
their needs. We take reciprocal and reverse square root as two examples and explain the tradeoff below.

1. Reciprocal can run on Scalar Engine or Vector Engine:

  Reciprocal can run on Vector Engine with ``nki.isa.reciprocal`` or on Scalar Engine with ``nki.isa.activation(nl.reciprocal)``. Vector Engine performs reciprocal
  at a higher precision compared to Scalar Engine; however, the computation throughput of reciprocal on Vector Engine is about 8x lower than Scalar Engine for large
  input tiles. For input tiles with a small number of elements per partition (less than 64, processed one per cycle), instruction initiation interval (roughly 64
  cycles) dominates performance so Scalar Engine and Vector Engine have comparable performance. In this case, we suggest using Vector Engine to achieve better precision.

  **Estimated cycles on different engines:**

  .. list-table::
    :widths: 40 60
    :header-rows: 1

    * - Cost `(Engine Cycles)`
      - Condition
    * - ``max(MIN_II, N)``
      - mapped to Scalar Engine ``nki.isa.scalar_engine``
    * - ``max(MIN_II, 8*N)``
      - mapped to Vector Engine ``nki.isa.vector_engine``

  where,

  - ``N`` is the number of elements per partition in the input tile.
  - ``MIN_II`` is the minimum instruction initiation interval for small input tiles.
    ``MIN_II`` is roughly 64 engine cycles.

  **Note** ``nki.isa.activation(op=nl.reciprocal)`` doesn't support setting bias on NeuronCore-v2.

2. Reverse square root can run on GpSIMD Engine or Scalar Engine:

  Reverse square root can run on GpSIMD Engine with ``nki.isa.tensor_scalar(op0=nl.rsqrt, operand0=0.0)`` or on Scalar Engine with ``nki.isa.activation(nl.rsqrt)``.
  GpSIMD Engine performs reverse square root at a higher precision compared to Scalar Engine; however, the computation throughput of reverse square root on GpSIMD
  Engine is 4x lower than Scalar Engine.


.. rubric:: Footnotes

.. [#1] S: sign bits, E: exponent bits, M: mantissa bits


================================================
FILE: nki/api/nki.collectives.rst
================================================
nki.collectives
===============

.. currentmodule:: nki.collectives

The ``nki.collectives`` module provides APIs for multi-core collective communication
operations such as all-reduce and all-gather across NeuronCores.

.. _nki-collectives:

NKI Collectives
---------------

Collective operations for multi-rank communication.

.. autosummary::
   :toctree: generated
   :nosignatures:

   all_reduce
   all_gather
   reduce_scatter
   all_to_all
   all_to_all_v
   collective_permute
   collective_permute_implicit
   collective_permute_implicit_reduce
   collective_permute_implicit_current_processing_rank_id
   rank_id


Constants
--------------

.. autosummary::
   :toctree: generated
   :template: nki-custom-class-template.rst
   :nosignatures:

   ReplicaGroup


================================================
FILE: nki/api/nki.isa.rst
================================================
nki.isa
========

.. currentmodule:: nki.isa

The ``nki.isa`` module exposes low-level ISA instructions for compute, data movement, and synchronization.
These APIs map to individual Tensor Engine, Vector Engine, Scalar Engine, and DMA Engine operations,
giving you fine-grained control over the underlying hardware capabilities.

.. _nki-isa:

NKI ISA
--------

.. autosummary::
   :toctree: generated
   :nosignatures:

   nc_matmul
   nc_matmul_mx
   nc_transpose
   activation
   activation_reduce
   tensor_reduce
   tensor_partition_reduce
   tensor_tensor
   tensor_tensor_scan
   scalar_tensor_tensor
   tensor_scalar
   tensor_scalar_reduce
   tensor_scalar_cumulative
   tensor_copy
   tensor_copy_predicated
   exponential
   reciprocal
   quantize_mx
   iota
   dropout
   affine_select
   range_select
   select_reduce
   sequence_bounds
   memset
   bn_stats
   bn_aggr
   local_gather
   nc_n_gather
   dma_copy
   dma_transpose
   dma_compute
   max8
   nc_find_index8
   nc_match_replace8
   nc_stream_shuffle
   register_alloc
   register_load
   register_move
   register_store
   core_barrier
   sendrecv
   rng
   rand2
   rand_set_state
   rand_get_state
   set_rng_seed
   nonzero_with_count


NKI ISA Config Enums
--------------------
.. autosummary::
   :toctree: generated
   :template: nki-custom-class-attr-only-template.rst
   :nosignatures:

   engine
   dma_engine
   reduce_cmd
   dge_mode
   oob_mode
   matmul_perf_mode


Target
-------------

.. autosummary::
   :toctree: generated
   :nosignatures:

   nc_version
   get_nc_version


Constants
---------

.. autosummary::
   :toctree: generated
   :template: nki-custom-class-template.rst
   :nosignatures:

   VirtualRegister


================================================
FILE: nki/api/nki.isa.rst.bak
================================================
.. _nki-isa:

nki.isa
========

.. currentmodule:: nki.isa

The ``nki.isa`` module provides low-level instructions that map directly to the NeuronDevice instruction set architecture. These APIs give you fine-grained control over compute engines, data movement, and memory operations.

Matrix Operations
-----------------

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - API
     - Description
   * - :doc:`nc_matmul <generated/nki.isa.nc_matmul>`
     - Matrix multiplication on the Tensor Engine
   * - :doc:`nc_matmul_mx <generated/nki.isa.nc_matmul_mx>`
     - Matrix multiplication with MX (microscaling) format support
   * - :doc:`nc_transpose <generated/nki.isa.nc_transpose>`
     - Transpose a tile on the Tensor Engine

Activation and Element-wise Operations
--------------------------------------

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - API
     - Description
   * - :doc:`activation <generated/nki.isa.activation>`
     - Apply activation functions (exp, gelu, sigmoid, etc.)
   * - :doc:`activation_reduce <generated/nki.isa.activation_reduce>`
     - Apply activation with reduction
   * - :doc:`exponential <generated/nki.isa.exponential>`
     - Dedicated exponential with max subtraction (Trn3 only)
   * - :doc:`reciprocal <generated/nki.isa.reciprocal>`
     - Compute element-wise reciprocal
   * - :doc:`quantize_mx <generated/nki.isa.quantize_mx>`
     - Quantize tensors to MX (microscaling) format

Tensor Arithmetic
-----------------

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - API
     - Description
   * - :doc:`tensor_tensor <generated/nki.isa.tensor_tensor>`
     - Element-wise operation on two tensors
   * - :doc:`tensor_tensor_scan <generated/nki.isa.tensor_tensor_scan>`
     - Element-wise operation with scan (prefix sum)
   * - :doc:`scalar_tensor_tensor <generated/nki.isa.scalar_tensor_tensor>`
     - Scalar-tensor-tensor fused operation
   * - :doc:`tensor_scalar <generated/nki.isa.tensor_scalar>`
     - Element-wise operation between a tensor and a scalar
   * - :doc:`tensor_scalar_reduce <generated/nki.isa.tensor_scalar_reduce>`
     - Tensor-scalar operation with reduction
   * - :doc:`tensor_scalar_cumulative <generated/nki.isa.tensor_scalar_cumulative>`
     - Tensor-scalar operation with cumulative reduction


================================================
FILE: nki/api/nki.language.rst
================================================
.. _nki-language:

nki.language
====================

.. currentmodule:: nki.language

The ``nki.language`` module provides high-level constructs for writing NKI kernels.
It includes tensor creation, indexing, type casting, math operations, and loop constructs
that the NKI compiler translates into efficient hardware instructions.

.. _nl_creation:

Creation operations
--------------------

.. autosummary::
   :toctree: generated
   :nosignatures:

   ndarray
   zeros
   ones
   full
   zeros_like
   empty_like
   shared_identity_matrix
   rand
   random_seed

.. _nl_tensor_ops:

Tensor operations
------------------

.. autosummary::
   :toctree: generated
   :nosignatures:

   load
   load_transpose2d
   store
   copy
   matmul
   transpose

.. _nl_math:

Math operations
----------------

.. autosummary::
   :toctree: generated
   :nosignatures:

   abs
   add
   arctan
   ceil
   cos

   .. divide : not supported

   exp
   floor
   log
   maximum
   minimum
   multiply
   negative
   power
   reciprocal
   rsqrt
   sign
   sin
   sqrt
   square
   subtract
   tan
   tanh
   trunc

.. _nl_activation_and_backpropagation:

Activation and Backpropagation functions
-----------------------------------------

.. autosummary::
   :toctree: generated
   :nosignatures:

   relu
   sigmoid
   silu
   silu_dx
   gelu
   gelu_dx
   gelu_apprx_sigmoid
   gelu_apprx_sigmoid_dx
   gelu_apprx_tanh
   mish
   softplus
   softmax
   erf
   erf_dx


.. _nl_normalization_and_regularization:

Normalization and Regularization functions
------------------------------------------

.. autosummary::
   :toctree: generated
   :nosignatures:

   dropout
   rms_norm


.. _nl_reduction:

Reduction operations
---------------------

.. autosummary::
   :toctree: generated
   :nosignatures:

   all
   max
   mean
   min
   prod
   sum
   var

.. _nl_comparison:

Comparison operations
----------------------

.. autosummary::
   :toctree: generated
   :nosignatures:

   equal
   not_equal
   less
   less_equal
   greater
   greater_equal

.. _nl_logical:

Logical operations
-------------------

.. autosummary::
   :toctree: generated
   :nosignatures:

   logical_and
   logical_or
   logical_xor
   logical_not

.. _nl_bitwise:

Bitwise operations
-------------------

.. autosummary::
   :toctree: generated
   :nosignatures:

   bitwise_and
   bitwise_or
   bitwise_xor
   invert
   left_shift
   right_shift

.. _nl_tensor_manipulation_operations:

Tensor manipulation operations
-------------------------------

.. autosummary::
   :toctree: generated
   :nosignatures:

   broadcast_to
   ds
   expand_dims


.. _nl_indexing:

Indexing operations
------------------------------

.. autosummary::
   :toctree: generated
   :nosignatures:

   where
   gather_flattened


.. _nl_iterators:

Iterators
----------

.. autosummary::
   :toctree: generated
   :nosignatures:

   affine_range
   dynamic_range
   sequential_range
   static_range


.. _nl_memory_hierarchy:

Memory Hierarchy
-----------------

.. autosummary::
   :toctree: generated
   :nosignatures:

   psum
   sbuf
   hbm
   private_hbm
   shared_hbm
   is_psum
   is_sbuf
   is_hbm
   is_on_chip

.. _nl_others:

Others
-------

.. autosummary::
   :toctree: generated
   :nosignatures:

   device_print
   no_reorder
   program_id
   num_programs
   program_ndim

.. _nl_datatypes:

Data Types
-----------

.. autosummary::
   :toctree: generated
   :nosignatures:

   bool_
   int8
   int16
   int32
   uint8
   uint16
   uint32
   float16
   float32
   bfloat16
   tfloat32
   float8_e4m3
   float8_e5m2
   float8_e4m3fn
   float8_e5m2_x4
   float8_e4m3fn_x4
   float4_e2m1fn_x4


.. _nl_constants:

Constants
----------

.. list-table::

   * - :doc:`tile_size <nki.language.tile_size>`
     - Hardware tile size constants (pmax, psum_fmax, gemm_stationary_fmax, etc.)

.. toctree::
   :hidden:

   nki.language.tile_size


================================================
FILE: nki/api/nki.language.tile_size.rst
================================================
nki.language.tile\_size
=======================

.. currentmodule:: nki.language

.. autoclass:: tile_size

   .. rubric:: Attributes

   .. autosummary::

      ~tile_size.pmax
      ~tile_size.psum_fmax
      ~tile_size.gemm_stationary_fmax
      ~tile_size.gemm_moving_fmax
      ~tile_size.bn_stats_fmax
      ~tile_size.psum_min_align
      ~tile_size.sbuf_min_align
      ~tile_size.total_available_sbuf_size


================================================
FILE: nki/api/nki.rst
================================================
.. _nki-reference:

nki
======

.. currentmodule:: nki

The ``nki`` module provides the top-level entry points for compiling and running NKI kernels.
Use the :func:`jit` decorator to compile a kernel for NeuronDevices, or :func:`simulate` to
run a kernel in the CPU simulator for debugging.

.. _nki_decorators:


.. autosummary::
   :toctree: generated
   :nosignatures:

   jit
   simulate


================================================
FILE: nki/api/nki.simulate.rst
================================================
.. meta::
    :description: Documentation for the nki.simulate API in the Neuron SDK
    :keywords: nki, simulate, nki.simulate, test, kernels, aws neuron sdk
    :date-modified: 04/02/2026

.. _nki-simulate:

nki.simulate
============

.. note::

   This API is experimental and may change in future releases.

``nki.simulate`` runs NKI kernels on your CPU using Python (and NumPy), with no Trainium hardware required.
It executes kernel code as regular Python, making it ideal for fast development, debugging, and correctness testing.

.. contents:: On this page
   :local:
   :depth: 2

Overview
--------

``nki.simulate`` is a CPU-based functional simulator for NKI kernels. It executes every ``nki.isa``
and ``nki.language`` operation using Python and NumPy, producing results that approximate hardware behavior.
You write your kernel once and can run it on both the simulator and real Trainium devices. Some kernels
may require adjustments when moving to hardware — see :ref:`Simulation Limitations <simulation-limitations-api>` for details.

**Why use the simulator?**

- **No hardware required** — develop and test NKI kernels on any machine with Python.
- **Cost savings** — avoid the cost of developing on Trainium instances; iterate locally, then deploy to hardware when ready.
- **Same kernel code** — the same ``@nki.jit`` kernel can run on both hardware and the simulator. See :ref:`Simulation Limitations <simulation-limitations-api>` for cases where adjustments may be needed.
- **Full debugging support** — use ``breakpoint()``, PDB, or IDE debuggers to step through kernel execution and inspect tensor values.
- **Fast iteration** — test kernels instantly without compilation or deployment.
- **Hardware constraint validation** — catches invalid shapes, buffer misuse, dtype errors, and other constraint violations at runtime with clear error messages.
- **AI-assisted development** — ideal for GenAI coding agents authoring NKI kernels, thanks to instant local feedback and detailed error messages that enable rapid autonomous iteration.

Quick Start
-----------

.. nki_example:: /nki/examples/simulate/nki_simulate_example.py
   :language: python
   :marker: NKI_EXAMPLE_SIMULATE

.. nki_example:: /nki/examples/simulate/nki_simulate_example.py
   :language: python
   :marker: NKI_EXAMPLE_SIMULATE_RUN


Usage
-----

Running the Simulator
^^^^^^^^^^^^^^^^^^^^^

The simulator accepts **NumPy arrays** as inputs. If your script uses PyTorch or JAX tensors,
convert them to NumPy arrays before passing them to simulated kernels (for example, ``tensor.numpy()``).

**nki.simulate() API**

Use the explicit API to run a kernel on the simulator. This is also useful when you want
to run a kernel on *both* the simulator and hardware in the same script — for example,
to compare results:

.. code-block:: python

   # Run on simulator
   sim_result = nki.simulate(my_kernel)(a_np, b_np)

   # Run on hardware (requires Trainium and neuronx-cc)
   hw_result = my_kernel(a_torch, b_torch)

   # Compare
   np.testing.assert_allclose(sim_result, hw_result.numpy(), rtol=1e-2)


Target Platform
^^^^^^^^^^^^^^^

The simulator models different NeuronCore generations. Set the target using the
``NEURON_PLATFORM_TARGET_OVERRIDE`` environment variable:

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - Environment variable value
     - Hardware
   * - ``trn1`` or ``gen2``
     - Trn1 (NeuronCore-v2)
   * - ``trn2`` or ``gen3``
     - Trn2 (NeuronCore-v3)
   * - ``trn3`` or ``gen4``
     - Trn3 (NeuronCore-v4)
   * - *(unset)*
     - Auto-detect (uses the Neuron chip detected on the running machine, otherwise defaults to ``trn3``)

Precise Floating-Point Mode
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

By default, the simulator stores ``bfloat16``, ``float8_e4m3``, and ``float8_e5m2`` tensors as ``float32``
for faster simulation performance and to let you examine kernel correctness in high-precision floating-point.
To get numerical behavior similar to hardware, enable precise mode with ``NKI_PRECISE_FP=1``:

.. code-block:: bash

   NKI_PRECISE_FP=1 python my_script.py

When enabled, low-precision dtypes are stored using ``ml_dtypes`` (real ``bfloat16``, ``float8``, etc.)
instead of ``float32``. This is recommended for most use cases.

Debugging
^^^^^^^^^

Because the simulator runs kernels as regular Python, you have full access to Python's
debugging ecosystem.

**Using breakpoint():**

.. code-block:: python

   @nki.jit
   def my_kernel(a_ptr):
       tile = nl.load(a_ptr)
       breakpoint()  # Debugger stops here — inspect `tile`
       result = nl.add(tile, tile)
       return nl.store(result)

   nki.simulate(my_kernel)(data)

**Using device_print:**

``nl.device_print`` works in the simulator and prints tensor values to stdout:

.. code-block:: python

   @nki.jit
   def my_kernel(a_ptr):
       tile = nl.load(a_ptr)
       nl.device_print("my tile", tile)
       ...

**Using Python print:**

Since the simulator executes kernels as standard Python, you can use ``print()`` to inspect any
intermediate tensor or register value during execution. This is especially useful for both interactive
debugging and AI-assisted development workflows where agents iterate on kernels locally.

**IDE Debugging (VSCode / PyCharm):**

Set breakpoints in your kernel code and run your script normally. The simulator executes
kernel code in-process, so IDE debuggers work without any special configuration.


How It Works
------------

Execution
^^^^^^^^^

When you call ``nki.simulate(kernel)(a, b)``:

1. Each NumPy array argument is wrapped into an ``NkiTensor`` with ``buffer=nl.hbm``
   (or ``shared_hbm`` for LNC2). Non-array arguments pass through unchanged.
2. The simulator backend is activated, routing all ``nki.isa`` and ``nki.language``
   operations to NumPy-based implementations.
3. The kernel function runs as regular Python — each NKI API call executes eagerly
   and sequentially. There is no instruction scheduling or engine parallelism.
4. On return, ``NkiTensor`` results are converted back to NumPy arrays. Input arrays are
   updated in-place if the kernel modified the corresponding HBM tensors.

For **LNC2 kernels** (``kernel[2]``), the simulator spawns two Python threads that execute the
kernel concurrently, each with its own ``program_id``. Input arrays use ``shared_hbm`` buffers,
so both threads can access shared memory. ``nki.isa.sendrecv`` and ``nki.isa.core_barrier``
use thread-safe synchronization primitives.

Uninitialized Memory Detection
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The simulator automatically fills all newly allocated tensors with **sentinel values** — ``NaN`` for
floating-point types and ``4`` for integer types. This makes it easy to detect bugs where a kernel
reads from memory that was never written to.

Because ``NaN`` propagates through arithmetic (any operation involving ``NaN`` produces ``NaN``), if
your kernel accidentally computes on uninitialized memory, the resulting output will contain ``NaN``
values. You can check for this in your test:

.. code-block:: python

   result = nki.simulate(my_kernel)(inputs)
   assert not np.any(np.isnan(result)), "Kernel computed on uninitialized memory!"

**Why this matters:**

On real hardware, uninitialized memory contains arbitrary leftover values from previous operations.
A kernel that reads uninitialized data may appear to produce correct results on hardware by coincidence —
making these bugs extremely difficult to track down. The simulator's sentinel values turn these silent
correctness hazards into immediately visible ``NaN`` values in the output.

.. tip::

   If you see unexpected ``NaN`` values in your simulation output, check that all tensors are properly
   initialized before use. Common causes include:

   - Allocating a tensor with ``nl.ndarray`` but not writing to all elements before reading
   - Off-by-one errors in tile loop bounds that leave some elements unwritten
   - Conditional writes that skip certain partitions or indices


Hardware Constraint Validation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Each ``nki.isa`` operation validates hardware constraints at runtime — shape limits, dtype
compatibility, buffer types, engine restrictions, and architecture version requirements.
Invalid operations raise clear Python exceptions with descriptive error messages.

.. note::

   Hardware constraint validation is actively being developed. Some constraints may not yet
   be checked by the simulator. If your kernel passes simulation but fails on hardware,
   report it to the Neuron team as an issue.


**Example:**

.. code-block:: python

   @nki.jit
   def bad_kernel(a_ptr):
       tile = nl.ndarray((256, 512), dtype=nl.float32, buffer=nl.sbuf)  # exceeds 128
       ...

   nki.simulate(bad_kernel)(data)
   # AssertionError: tensor_tensor data1 partition dimension 256 exceeds maximum 128


.. _simulation-limitations-api:

Simulation Limitations
----------------------

The simulator approximates hardware behavior but is not identical. Understanding these
limitations helps you write kernels that work on both the simulator and real Trainium hardware.

No Compilation
^^^^^^^^^^^^^^

The simulator runs kernel code directly as Python — there is no compilation step. For real hardware,
NKI kernels go through a full compilation pipeline (NKI → NEFF binary). This means
the simulator cannot catch compilation errors; a kernel that runs on the simulator may still fail
to compile for hardware.

NKI Meta-Programming Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The simulator accepts any valid Python in the kernel body, including arbitrary classes, closures,
and dynamic control flow. The NKI compiler, however, only supports a restricted subset of Python
for meta-programming, see :ref:`NKI Language Guide <nki-language-guide>`. As a result, kernels that execute successfully on the simulator may fail to
compile on hardware.

Numerical Precision
^^^^^^^^^^^^^^^^^^^

By default, the simulator stores low-precision types (``bfloat16``, ``float8_e4m3``, ``float8_e5m2``)
as ``float32``, which can mask rounding and precision issues that appear on hardware. Enable
``NKI_PRECISE_FP=1`` (recommended) to use real low-precision storage via ``ml_dtypes`` for
numerical behavior similar to hardware. See `Precise Floating-Point Mode`_ for details.

Performance
^^^^^^^^^^^

The simulator runs on the CPU using Python and NumPy. It does not model instruction latency,
engine parallelism, or hardware scheduling. Since kernels are interpreted rather than compiled
and optimized for Trainium NeuronCores, the simulator is significantly slower than hardware
execution and is not suitable for performance benchmarking.

Memory Model
^^^^^^^^^^^^

The simulator allocates each tensor independently without simulating overlapping memory regions
or validating against SBUF/PSUM capacity limits. Kernels with memory conflicts may run
successfully on the simulator but fail or produce incorrect results on real hardware, where
SBUF and PSUM are shared physical memory with capacity constraints.

Known Gaps
^^^^^^^^^^

- ``nki.collectives`` APIs are not implemented in the simulator.
- Some ``nki.isa`` instructions produce incorrect results: ``local_gather``,
  ``nc_stream_shuffle`` with ``mask=255``, ``nc_matmul_mx``, and ``quantize_mx``.


================================================
FILE: nki/deep-dives/index.rst
================================================
.. _nki_deep-dives_home:

.. meta::
    :description: Documentation home for the AWS Neuron SDK NKI Deep Dives and other advanced materials.
    :keywords: NKI, AWS Neuron, Deep Dives, Advanced Programming
    :date-modified: 12/01/2025

NKI Deep Dives
==============

This section provides in-depth technical documentation and guides for advanced users of the Neuron Kernel Interface (NKI). These deep dives offer detailed explanations of NKI concepts, programming patterns, and best practices to help you maximize the performance and capabilities of your NKI code on AWS Neuron devices.

Optimizing a NKI Kernel
-----------------------

.. grid:: 2
   :margin: 4 1 0 0

   .. grid-item-card:: NKI Performance Optimizations
      :link: nki_perf_guide
      :link-type: ref
      :class-body: sphinx-design-class-title-small

Advanced NKI Programming
------------------------

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: MXFP4/8 Matrix Multiplication Guide
      :link: mxfp-matmul 
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Perform matrix multiplication using MXFP8 data types in NKI kernels, including data layout, quantization, and tiling strategies.

   .. grid-item-card:: NKI Compiler
      :link: nki_compiler_about
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Learn about the NKI Compiler.

   .. grid-item-card:: NKI Dynamic Loops
      :link: nki-dynamic-loops
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Use dynamic loops with runtime-determined trip counts via hardware loop instructions.

   .. grid-item-card:: Descriptor Generation Engine (DGE)
      :link: dge-documentation
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Control how DMA descriptors are generated: pre-computed, software (GpSimd), or hardware DGE.

   .. grid-item-card:: DMA Bandwidth Guide
      :link: nki-dma-bandwidth-guide
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Guidelines for maximizing DMA bandwidth with large contiguous payloads.

   .. grid-item-card:: NKI Access Patterns
      :link: nki-aps
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Learn about Access Patterns (AP) to directly specify how the Trainium hardware accesses tensors.


Additional NKI Information
--------------------------

.. toctree::
    :maxdepth: 1
    :hidden:

    Performance Optimizations <nki_perf_guide>
    MXFP8/4 Matrix Multiplication <mxfp-matmul>
    NKI Access Patterns <nki-aps>
    NKI Dynamic Loops <nki-dynamic-loops>
    Descriptor Generation Engine (DGE) <nki-dge>
    DMA Bandwidth Guide <nki-dma-bandwidth-guide>
    nki-compiler


================================================
FILE: nki/deep-dives/mxfp-matmul.rst
================================================
.. meta::
    :description: Guide for implementating MXFP4/8 matrix multiplication using NKI on AWS Neuron hardware.
    :keywords: MXFP8, MXFP4, Matrix Multiplication, NKI, Neuron
    :date-modified: 12/19/2025

MXFP Matrix Multiplication with NKI on AWS Neuron
===================================================

In this guide, you'll learn how to perform MXFP4/8 matrix multiplication, quantization, and Neuron's recommended best practices for writing MX kernels.


Before You start
-----------------

* Read the MX-related sections of the :ref:`Trainium 3 Architecture Guide for NKI <trainium3_arch>` and become familiar with basic matrix multiplication concepts on Neuron in the :doc:`Matrix Multiplication tutorial </nki/guides/tutorials/matrix_multiplication>`.

.. note::
    The code snippets in this guide are taken from the `tutorial code package <https://github.com/aws-neuron/nki-samples/tree/main/src/nki_samples/tutorials/mxfp-matmul>`_ which demonstrates how to execute all MX kernel examples from Torch. We recommend you browse and run the code as you read the tutorial.

What is MXFP4/8 Matrix Multiplication?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

MXFP4/8 matrix multiplication uses microscaling (MX) quantization as defined in the OCP standard. Unlike traditional quantization that uses tensor- or channel-wide scale factors, microscaling calculates quantization scales from small groups of values. Specifically, groups of 32 elements along the matrix multiplication contraction dimension share the same 8-bit MX scale value.

This approach preserves significantly more information in quantized values by preventing high-magnitude outliers from "squeezing" the entire data distribution. The NeuronCore-v4 Tensor Engine performs matrix multiplication of MXFP4 or MXFP8 input matrices and dequantization with MX scales in a single instruction, achieving 4x throughput compared to BF16/FP16 matrix multiplication while outputting results in FP32 or BF16.

Layout and Tile Size Requirements
----------------------------------

Before diving into code examples of MX multiplication, it's important to review the layout and tile-size requirements of MX. MX quantized tensors are represented with separate data and scale tensors, each with distinct requirements.

Data Tensor
~~~~~~~~~~~~

Compared to BF16/FP32 matrix multiplication, the performance uplift from Matmul-MX comes from the ability to contract 4x more elements during one matmul operation as each TensorE processing element is able to perform four simultaneous, FP4/FP8, multiply-accumulate computations. This means the maximum effective contraction dimension has increased from 128 → 512. 

First, let's examine the tile-size constraints for MX so we can allocate the correct space for tensors. MX data is represented in NKI using quad (x4) packed data types (:doc:`float8_e5m2_x4 </nki/api/generated/nki.language.float8_e5m2_x4>`, :doc:`float8_e4m3fn_x4 </nki/api/generated/nki.language.float8_e4m3fn_x4>`, and :doc:`float4_e2m1fn_x4 </nki/api/generated/nki.language.float4_e2m1fn_x4>`, herein referred to collectively as ``MXFP_x4``). The ``float8_*_x4`` types are 32-bits wide and physically contain four ``float8`` elements. The ``float4_*_x4`` type is 16-bits wide and physically contains four ``float4`` elements. As expressed in ``_x4`` elements, the TensorE maximum tile sizes in NKI code continue to be given by the existing hardware constraints, summarized below.

.. list-table::
   :header-rows: 1
   :widths: 20 20 30 30

   * - Matrix Type
     - Data Type
     - Implied Physical Size
     - Max Tile Size in Code
   * - Stationary
     - BF16
     - [128P, 128F]
     - [128P, 128F]
   * - Stationary
     - MXFP_x4
     - [512P, 128F]
     - [128P, 128F]
   * - Moving
     - BF16
     - [128P, 512F]
     - [128P, 512F]
   * - Moving
     - MXFP_x4
     - [512P, 512F]
     - [128P, 512F]

This means that we will allocate data tensors, of type ``MXFP_x4``, in our NKI code with the same shapes as we would for BF16/FP32, but it's implied they contain 4x more contraction elements as shown in the subsequent diagrams.

Now let's examine a BF16 tile destined to be quantized into a max-sized moving tile for Matmul-MX (``[128P, 512F] MXFP_x4``). Note that the following concepts are equally applicable to the stationary tile whose max size is ``[128P, 128F]``.

Since a 4x larger contraction dimension is supported we'll start with a BF16 tile of size ``[512, 512]`` as shown below. To help us in the subsequent step we'll also view it as being sectioned into 4 regions of 128 rows (i.e. reshaped as ``[4, 128, 512]``). This view is mathematical (i.e. not residing in any particular memory).

.. image:: /nki/img/deep-dives/mxfp84-matmul-guide-1.png
   :width: 50%
   :align: center

As explained in the :doc:`Trainium 3 Architecture Guide for NKI </nki/guides/architecture/trainium3_arch>` we must take 4 elements originating 128 apart on the contraction axis and pack them together on the SBUF free-dimension as shown below. We'll call this transformation "interleaving".

.. image:: /nki/img/deep-dives/mxfp84-matmul-guide-2.png
   :align: center

Notice the SBUF shape has become ``[128P, 2048F]``. In a subsequent code example we'll see that it's useful to view/reshape this as ``[128P, 512F, 4F]``, making it clear we have 512 groups of 4 packed elements.

Next, let's Quantize-MX this tile, which will preserve the layout but pack groups of 4 free-dimension elements into a single ``MXFP_x4`` element, as shown below. Note that Quantize-MX does not support an FP4 output but Matmul-MX does support FP4 input.

.. image:: /nki/img/deep-dives/mxfp84-matmul-guide-3.png
   :width: 50%
   :align: center

Notice the shape is now ``[128P, 512F]`` which is the max moving tile size we aimed for. But each ``MXFP_x4`` element, shown in red, physically contains four quantized elements from the original tile. Recall that each TensorE processing element ingests enough data to perform four, FP4/FP8 multiply-accumulate operations, which is why four elements from the original contraction axis must be packed together in this fashion.

With this understanding we'll state the space allocation rules for quantized ``MXFP_x4`` data tiles.

.. code-block:: none

    Unquantized Interleaved Data Tile = [P,F] BF16 in SBUF

    MX Quantized Data Tile = [P, F//4] MXFP_x4 in SBUF

Scale Tensor
~~~~~~~~~~~~~

Let's revisit the BF16 tile with the interleaved SBUF layout but this time with one of the ``[8P, 4F]`` scaling groups overlaid.

.. image:: /nki/img/deep-dives/mxfp84-matmul-guide-4.png
   :align: center

MX scales are represented using a ``UINT8`` tile containing one element for each scaling group.

As explained in the :doc:`Trainium 3 Architecture Guide for NKI </nki/guides/architecture/trainium3_arch>`, we view the partition-dimension of SBUF as being split into 4 quadrants of 32 partitions each. Scales must be placed in the quadrant from which the corresponding scaling group originated, as shown below.


.. image:: /nki/img/deep-dives/mxfp84-matmul-guide-5.png
   :width: 50%
   :align: center


Notice the allocated shape is ``[128P, 512F]`` despite the underlying useful shape being ``[16P, 512F]``. See the :doc:`quantize_mx API </nki/api/generated/nki.isa.quantize_mx>` for an example of how to improve memory usage by packing scales, from other quantized tensors, into the same allocation.

With this understanding we'll state the space allocation rules for quantized MX scale tiles.

.. code-block:: none

    Unquantized Interleaved Data Tile = [P,F] BF16 in SBUF

    If P <= 32 (Oversize optional)

    MX Quantized Scale = [P//8, F//4] UINT8 in SBUF

    If P > 32 (Oversize required)

    MX Quantized Scale = [P, F//4] UINT8 in SBUF

Basic Matmul-MX
----------------

This NKI example performs a single Matmul-MX using offline-quantized, max-sized input tiles. For simplicity, it assumes the MX *data* tiles in HBM already satisfy the layout requirements so they may be simply loaded straight into SBUF. The MX *scale* tiles require some shuffling. Note that subsequent examples, instead, show how to establish this layout yourself in SBUF.

.. literalinclude:: src/mxfp-matmul/mx_kernels.py
   :language: python
   :start-after: [start-kernel_offline_quantized_mx_matmul]
   :end-before: [end-kernel_offline_quantized_mx_matmul]

A few notes about the above example:

* The ``MXFP_x4`` packed data types are custom to NKI and are not supported in Torch. Therefore, we mimic the packed data using ``uint8`` in Torch and simply view it as ``MXFP_x4`` in the kernel, as shown.
* The ``load_scales_scattered()`` helper function reads contiguously packed offline scales from HBM and spreads them across partition-dim quadrants.
* The PSUM output tile is allocated with data type BF16 to indicate the desired output data type of the Matmul-MX. Note that Matmul-MX (:doc:`nki.isa.nc_matmul </nki/api/generated/nki.isa.nc_matmul_mx>`) supports both BF16 and FP32 output dtypes.

Let's also look at the host code which calls this kernel as all subsequent examples use the same structure.

.. literalinclude:: src/mxfp-matmul/mx_toplevel.py
   :language: python
   :start-after: [start-run_offline_quantized_matmul_mx_test]
   :end-before: [end-run_offline_quantized_matmul_mx_test]

* The ``generate_stabilized_mx_data()`` helper function is used to generate MX data on the host. "Stabilized" means the data is randomly generated but injected with certain properties to allow for lossless quantization/dequantization, including constraining the data to be in the FP4/8 range. It conveniently returns MX data as ``ml_dtypes`` FP4/FP8, the same data packed into ``uint`` to mimic the ``MXFP_x4`` packing (suitable for sending to a NKI kernel), MX scales, and a corresponding unquantized FP32 tensor. The input shape argument specifies the unquantized shape. The unquantized tensor is viewed as being in the required layout for MX operations. Therefore to generate an MX data tile of maximum size we must specify an unquantized free-dimension that is 4x larger. In this example the moving unquantized shape is ``[128P, 2048F]`` and the function will return a ``[128P, 512F]`` packed MX data tensor, as desired.
* ``nc_matmul_mx_golden()`` is a utility to mimic the hardware's Matmul-MX operation and is therefore useful for verifying the hardware output. It assumes the input tensors meet the SBUF layout requirements and the data tensor is packed to mimic ``MXFP_x4``. Hence it can directly accept MX data generated by ``generate_stabilized_mx_data()``.
* ``compare_and_print_results()`` uses ``numpy.allclose()`` to check data correctness and print the tensors to ``stdout``.
* Although this is a single-tile Matmul-MX, larger MX tensors can be multiplied by using the same tiling techniques shown in the non-MX :doc:`Matrix Multiplication tutorial </nki/guides/tutorials/matrix_multiplication>`.

Quantize-MX + Matmul-MX
-----------------------

Next we'll replace one of the Matmul-MX inputs with a tile that we quantize on the VectorE using Quantize-MX. Again, it assumes the interleaved SBUF layout requirement is already satisfied. The source data for Quantize-MX must be in SBUF (cannot be in PSUM).

The two main changes in this example are:

* The ``allocate_mx_tiles()`` helper function implements the data and scale tile allocation rules mentioned above.
* ``load_scales_scattered()`` is again used for the stationary scales but is unnecessary for the moving scales since Quantize-MX will correctly spread the data across SBUF partition-dim quadrants.

.. literalinclude:: src/mxfp-matmul/mx_kernel_utils.py
   :language: python
   :start-after: [start-allocate_mx_tiles]
   :end-before: [end-allocate_mx_tiles]

.. literalinclude:: src/mxfp-matmul/mx_kernels.py
   :language: python
   :start-after: [start-kernel_on_device_quantize_matmul_mx]
   :end-before: [end-kernel_on_device_quantize_matmul_mx]

Please see the code package for the host code that calls this kernel.

SBUF Layout Using Strided Access
--------------------------------

Here we present two techniques for establishing the interleaved layout required for MX operations. Both produce the same result but have different performance tradeoffs. Therefore it's useful to think of them as tools in a toolbox where you use the one that's appropriate for your given situation.

It's important to note that these techniques operate on unquantized tensors (BF16 in these examples) as the layout must be established before calling Quantize-MX. If you already have offline MX weights (already quantized), it's suggested you establish the required layout offline so you may perform a direct load to SBUF.

The techniques are first explained then followed by a combined code example.

VectorE/ScalarE Strided Access
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Here we use either VectorE or ScalarE to write data to SBUF in the required layout. The simplest operation is a TensorCopy (shown below) but it's usually more performant to apply the strided access pattern to some prior useful computation already occurring on these engines.

For completeness the example loads an HBM tensor to SBUF prior to rearranging the data on-device using an SBUF-to-SBUF TensorCopy. The load is needed for this to be a standalone executable example but in practice it's expected your data would already be in SBUF from some previous operation. The TensorCopy strided access pattern is the key takeaway from this example.

Also note the TensorCopy source could be PSUM if you want to rearrange the data immediately after a prior matmul.

DMA Strided Access
~~~~~~~~~~~~~~~~~~~

Here we DMA a tensor from HBM to SBUF using a strided access pattern. It's conceptually similar to the above technique except the source of the copy is in HBM. This technique is typically significantly slower than on-device techniques but it can be useful in heavily compute-bound workloads where the DMA may overlap with compute.

Code
~~~~

This example demonstrates both techniques, selected by the ``use_tensor_copy`` argument. They are very similar but with slightly different read access patterns. It's useful to refer to the above layout diagrams as you read this code as the reshapes and access patterns directly correspond.

.. literalinclude:: src/mxfp-matmul/mx_kernel_utils.py
   :language: python
   :start-after: [start-copy_data_strided]
   :end-before: [end-copy_data_strided]

See the code package for an example kernel that calls ``copy_data_strided()`` to establish the interleaved layout for stationary and moving tiles, quantize both, and perform a Matmul-MX.

.. _nki-mxfp-scale-packing:

Packing Scale Values
--------------------

As discussed in `Scale Tensor`_, each element of a scale tensor corresponds to a group of 32 elements in the unquantized source tensor.
Each scaling group spans 8 partitions, with 4 free elements per partition, giving scale tensors a logical size of ``[P // 8, F // 4]``.
However, due to connectivity constraints between SBUF and VectorE, scale values must be placed in the same quadrant as their corresponding scaling group.
When the unquantized source tensor spans multiple partitions (i.e., ``src.shape[0] > 32``), the scale tensor has physical shape ``[P, F // 4]``.
Only the first 4 partitions (= 32 partitions per quadrant divided by 8 partitions per scaling group) of each quadrant are occupied, leaving the remaining 28 unused.

Quantize-MX allows you to utilize some of this space by packing scale values from multiple Quantize-MX calls.
Quantize-MX and Matmul-MX support writing/reading scales at an offset of 0, 4, 8, or 12 within each partition, allowing you to pack scale values from up to four tensors into a single tile.
To illustrate, consider the scale tile from `Scale Tensor`_ shown with and without scale packing:

.. image:: /nki/img/deep-dives/mxfp84-matmul-guide-6.drawio.png
   :align: center

Code
~~~~

This example demonstrates how to pack scale values from multiple Quantize-MX calls into a single tensor in SBUF, as mentioned in the :ref:`Trainium3 Architecture Guide <arch-trn3-quad-mxfp>`.
We use tensor slicing to control the offset into each quadrant at which Quantize-MX writes scale values.

.. literalinclude:: src/mxfp-matmul/mx_kernels.py
   :language: python
   :start-after: [start-kernel_copy_strided_quantize_matmul_mx_packed_scale]
   :end-before: [end-kernel_copy_strided_quantize_matmul_mx_packed_scale]


Additional Tips
----------------

* It's important to plan where in your design you'll pay the cost of interleaving the data. Ideally you minimize the cost by finding existing, prior compute on which you can apply the strided access pattern. Or find existing compute against which you can overlap the interleave process. For offline MX weights prepare the layout offline on CPU so you may load the data to SBUF directly in a contiguous/unstrided fashion.

* As with all compute on Neuron, it's generally performant to spread it across multiple engines operating in parallel. Given that Quantize-MX runs exclusively on the VectorE a bit more care may be needed to alleviate VectorE contention by becoming familiar with operations that may be relegated other engines, like ScalarE.

* The TensorE operates at double the clock frequency of VectorE, therefore Matmul-MX produces data at double the rate that Quantize-MX can consume it. It may seem that the TensorE could be back-pressured in a situation where a Matmul-MX quickly feeds a subsequent Matmul-MX (since you must Quantize-MX in between at half the speed), but that only happens for small tensors. Larger tensors require tiled matrix multiplication which inherently reuses input (quantized) tiles, allowing time for prior matmul output data to be quantized.

Matmul-MX supports PE-tiling (row-tiling only) where matmuls with a small (<= 64) contraction-dimension (partition-dimension) may be parallelized on the TensorE. This becomes more relevant for MX since a 4x-larger effective contraction-dimension is supported, meaning it's useful for an ``MXFP_x4`` contraction-dimension <= 64 or an equivalent unquantized contraction-dimension <= 256.

Executing the Code
------------------
After downloading the `tutorial code package <https://github.com/aws-neuron/nki-samples/tree/main/src/nki_samples/tutorials/mxfp-matmul>`_ to your Trainium3 Neuron environment, simply execute it as follows and observe the sample output.

.. code-block:: bash

  $ python3 mx_toplevel.py

  =====================================================================================
      OFFLINE_QUANTIZED_MX_MATMUL - stationary <float8_e5m2> @ moving <float8_e5m2>
  =====================================================================================

  Result shape: (128, 512)

  np.allclose pass? True

  Device Output:
  [[0.02526855 0.59765625 1.15625   ] ... [-0.09033203 -0.10888672 -0.84375   ]]
  ...
  [[ 0.25585938  0.18554688 -0.546875  ] ... [-0.71875    -0.6015625  -0.46484375]]

  Golden:
  [[0.02535721 0.5957752  1.1556101 ] ... [-0.09036541 -0.10906862 -0.8448767 ]]
  ...
  [[ 0.2551025   0.1856966  -0.54681885] ... [-0.71797514 -0.6026518  -0.4641544 ]]


  =========================================================================================
      OFFLINE_QUANTIZED_MX_MATMUL - stationary <float4_e2m1fn> @ moving <float4_e2m1fn>
  =========================================================================================

  Result shape: (128, 512)

  np.allclose pass? True

  Device Output:
  [[-0.02038574  0.02648926  0.10351562] ... [-0.25        0.02404785  0.08154297]]
  ...
  [[ 0.234375  -0.0456543  1.140625 ] ... [ 1.1015625   0.04833984 -0.17675781]]

  Golden:
  [[-0.02036181  0.02647817  0.10362364] ... [-0.24955288  0.02399684  0.08132255]]
  ...
  [[ 0.23485765 -0.04565394  1.1424086 ] ... [ 1.0981529   0.04839906 -0.17722145]]


  ========================================================================================
      ON_DEVICE_QUANTIZE_MATMUL_MX - stationary <float4_e2m1fn> @ moving <float8_e5m2>
  ========================================================================================

  Result shape: (128, 512)

  np.allclose pass? True

  Device Output:
  [[-0.12792969  0.02685547 -0.19140625] ... [ 0.05883789 -0.01916504 -0.66796875]]
  ...
  [[ 0.03198242 -0.24316406 -0.1640625 ] ... [ 0.06591797 -0.11914062  0.6015625 ]]

  Golden:
  [[-0.1284121   0.02687968 -0.19178611] ... [ 0.05882631 -0.01915852 -0.666565  ]]
  ...
  [[ 0.03191248 -0.24304396 -0.16389877] ... [ 0.06606946 -0.11931092  0.60205466]]


  ======================================================================================
      ON_DEVICE_QUANTIZE_MATMUL_MX - stationary <float8_e5m2> @ moving <float8_e5m2>
  ======================================================================================

  Result shape: (128, 512)

  np.allclose pass? True

  Device Output:
  [[ 0.02832031 -0.29296875  0.04394531] ... [-0.13671875 -0.00704956 -0.47265625]]
  ...
  [[ 0.03442383 -0.75        0.11572266] ... [ 0.86328125 -0.00735474  0.33007812]]

  Golden:
  [[ 0.02831857 -0.29297137  0.04390652] ... [-0.13685682 -0.00703458 -0.47168562]]
  ...
  [[ 0.03451066 -0.7511592   0.11560257] ... [ 0.86369723 -0.00734489  0.3300762 ]]


  ================================================================
      COPY_STRIDED_TENSOR_COPY - <float8_e5m2> @ <float8_e5m2>
  ================================================================

  Result shape: (128, 512)

  np.allclose pass? True

  Device Output:
  [[ 0.56640625 -1.28125     0.26953125] ... [ 0.5859375   0.31054688 -0.60546875]]
  ...
  [[ 1.2421875 -0.859375  -1.140625 ] ... [-0.06542969  0.11425781  0.6015625 ]]

  Golden:
  [[ 0.5663527  -1.2832397   0.26900524] ... [ 0.5861912  0.3109728 -0.6038357]]
  ...
  [[ 1.2426924  -0.85944945 -1.1438001 ] ... [-0.0654989   0.11429967  0.6028823 ]]


  ============================================================
      COPY_STRIDED_DMA - <float8_e5m2> @ <float8_e5m2>
  ============================================================

  Result shape: (128, 512)

  np.allclose pass? True

  Device Output:
  [[ 0.32421875  0.43359375 -0.09814453] ... [ 0.82421875 -2.171875    0.71484375]]
  ...
  [[-0.47070312 -0.734375    0.09765625] ... [ 1.328125   -1.09375    -0.32226562]]

  Golden:
  [[ 0.32461044  0.43410686 -0.09810834] ... [ 0.82437325 -2.1703691   0.71522826]]
  ...
  [[-0.47003102 -0.733371    0.09745546] ... [ 1.3250915  -1.0969493  -0.32166338]]


================================================
FILE: nki/deep-dives/nki-aps.rst
================================================
.. meta::
   :description: Deep dive into Access Patterns (AP) to directly specify how tensors are accessed on Trainium hardware
   :keywords: NKI kernels, Neuron Kernel Interface, AWS Neuron SDK, kernel compilation, Trainium, Inferentia, machine learning acceleration
   :date-modified: 12/19/2025

.. _nki-aps:

===================
NKI Access Patterns
===================

Starting with NKI 0.2.0, NKI supports the use of access patterns (AP) on 
``nl.ndarray``, which provides users with the ability to specify 
hardware-native access patterns. This low-level capability allows developers 
to specify precisely what they want their instructions to read on the hardware.

Access patterns are only necessary if slicing cannot represent the desired 
tensor access.

Hardware Capability
===================

Instructions can read and write tensors from/to the SBUF or PSUM, which are 
both two-dimensional memories with 128 partitions on NeuronCore v2/v3/v4. 
Within each SBUF/PSUM partition, the tensor read/write logic on the NeuronCore 
supports accessing elements from up to four-dimensional arrays, though most 
instructions only support 1D/2D/3D in the free dimension due to instruction 
length limitations.

The multi-dimensional access patterns are typically described using two pieces 
of information: 1) the element stepping (i.e., ``stride``) and 2) number of 
elements (i.e., ``size``) in each dimension. A tensor access pattern of an 
instruction is expected to be the same across all partitions.

In addition to the free dimension pattern, additional information is required 
to locate the number of elements to access: 1) the offset from the beginning 
of the tensor and 2) the number of partitions. The next section will describe 
how the NKI API abstracts this information.

NKI API for the Access Pattern
===============================

The NKI API for access pattern is a direct reflection of the hardware capability. 
The ``nl.ndarray`` has an ``ap`` method.

.. code-block:: python

   def ap(self, pattern: List[Tuple[int, int]], 
      offset: Optional[int] = 0,
      scalar_offset: Optional[Access] = None,
      vector_offset: Optional[Access] = None,
      indirect_dim: int = 0
      dtype: Optional[Dtype] = None):
      pass

The parameters have the following definitions:

* ``pattern``: A list of two-element tuples, each tuple describes the access on one dimension. The first element represents the element stepping and the second element represents the number of elements in each dimension. This tuple is referred to as ``[step, num]`` going forward.

  * The shape of a pattern is the collection of num. For example, given pattern ``[[w_step, w_num], [z_step, z_num], [y_step, y_num], [x_step, x_num]]``, the shape is ``[w_num, z_num, y_num, x_num]``.
  * **Note**: The order of the pattern specified here is in the opposite order to what is actually accepted by the hardware. Therefore, the order of the tuples shown on the profiler will be in the opposite order of what is specified here.

* ``offset``: The offset to start the access in terms of number of elements from the beginning of the tensor. The default value is 0.
* ``scalar_offset``: An SBUF tensor of shape ``(1, 1)`` that specifies the location to start the access in terms of number of elements on the ``indirect_dim`` of the access pattern. At most one of the ``scalar_offset`` and ``vector_offset`` can be specified.
* ``vector_offset``: An SBUF tensor that specifies the location to start the access in terms of number of elements from the beginning of the indirect dimension specified by ``indirect_dim``. At most one of the ``scalar_offset`` and ``vector_offset`` can be specified.
* ``indirect_dim``: The indirect dimension on which to apply ``scalar_offset`` and ``vector_offset``.
* ``dtype``: The data type of the access pattern. The default value is the ``dtype`` of the tensor being accessed.

Semantics of the Access Pattern
================================

Access patterns can be thought of as compact representations of a loop. The 
offset is an integer indicating the start offset in terms of elements with 
respect to the beginning of the tensor. Each two-element list ``[step, num]`` 
represents the stride in terms of elements and the number of iterations of 
each level of the loop. The semantics are explored through the following 
example.

Given a tensor, the Access Pattern conceptually flattens the tensor to 1D,
and then uses a loop to fetch elements from the tensor to construct a view.
Consider the following NKI code:

.. code-block:: python

   t = nl.ndarray((p_count, N), dtype=nl.float32, buffer=nl.sbuf)
   access = t.ap(
     pattern=[[N, p_size], [z_step, z_num], [
     y_step, y_num], [x_step, x_num]], 
     offset)

The above represents the following access on the tensor ``t``, written below in pseudo-code.

.. code-block:: python

   access = nl.ndarray((p_size, z_num, y_num, x_num), dtype=nl.float32, buffer=nl.sbuf)
   for w in range(p_size):
     for z in range(z_num):
       for y in range(y_num):
         for x in range(x_num):
           t_flatten = t.flatten() # first flatten the tensor to 1d
           access[w, z, y, x] = [offset + (w * N) + (z * z_step)
                     + (y * y_step) + (x * x_step)]

The access pattern has the following properties:

1. Recall from the hardware capability, the access pattern in each partition 
must be identical. Therefore, the step of the first tuple in the AP must be 
equal to the number of elements in the free dimension of the tensor.
2. The shape of the result view is always the same as the shape of the pattern.

Note that calling ``.ap`` on a tensor does not do any computation directly. 
It describes how to get data. The engines will consume data when the AP 
is passed into a ``nki.isa`` instruction.

.. code-block:: python

   src = nl.ndarray((16, 32), dtype=nl.float32, buffer=nl.sbuf)
   dst = nl.ndarray((16, 32), dtype=nl.float32, buffer=nl.sbuf)
   src_access = src.ap([32, 16], [1, 32]) # no computation happens
   dst_access = dst.ap([32, 16], [1, 32]) # no computation happens

   # Engine reads both src_access and dst_access and performs the copy
   nisa.dma_copy(dst_access, src_access)

A Concrete Example
==================

Given a tensor ``t`` of size (16P, 16F), to iterate all the elements in 
``t[0:16, 8:16]`` the access pattern can be written as:

.. code-block:: python

   t = nl.ndarray((16, 16), dtype=nl.float32, buffer=nl.sbuf)
   access = t.ap(pattern=[[16, 16], [1, 8]], offset=8)


   # Semantics, the following is pseudo-code
   access = nl.ndarray((16, 8), dtype=nl.float32, buffer=nl.sbuf)
   # in loop form
   for w in range(16):
     for z in range(8):
       idx = 8 + (w * 16) + (1 * z)
       t_flatten = t.flatten()
       access[w, z] = t_flatten[idx]

.. image:: /nki/img/deep-dives/memory-access-visualization-1.png
   :width: 80%
   :align: center

Restriction on SBUF/PSUM Tensors
=================================

For SBUF/PSUM tensors, the first tuple must always be the access for the 
partition dimension. On NeuronCore v2/v3/v4, the access on the partition 
dimension must be contiguous, meaning that the step of the leading dimension 
must be the element count of the entire free dimension of the tensor. 
Therefore, given a tensor of shape ``(p_dim, f_dim0, f_dim1)``, the step of 
the leading dimension must be ``f_dim0 * f_dim1``.

The following example is not allowed because it reads every other partition.

.. code-block:: python

   t = nl.ndarray((16, 32), dtype=nl.float32, buffer=nl.sbuf)

   # The following is illegal, because the first stride is 32*2 and reads every other partition
   t.ap(pattern=[[64, 8], [1, 32]], offset=0)

.. image:: /nki/img/deep-dives/memory-access-visualization-2.png
   :width: 80%
   :align: center

Restriction on Nested Indexing
===============================

The ``.ap`` method is only allowed on ``nl.ndarray`` and cannot be called on a 
tile produced by it. For example, the following would result in an error.

.. code-block:: python

   t = nl.ndarray((128, 256), dtype=nl.float32, buffer=nl.sbuf)
   t.ap(pattern=[[256, 128],[2, 128]], offset=0).ap(pattern=[[128, 64], [1, 64]], offset=0)
        ^-- cannot specify an access pattern on an already indexed tensor

To facilitate nested indexing, the :doc:`NKI Library </nki/library/index>`
provides :doc:`TensorView </nki/library/kernel-utils/tensor-view>`. ``TensorView`` provides
a convenient interface for tensor manipulation operations like slicing, permuting, broadcasting, and reshaping without copying data. It keeps track of the 
operations performed on the tensor, and could efficiently generate NKI Access Pattern by calling ``get_view()``. For example, the nested tensor slicing 
above could be represented as the following chain of TensorView operations.

.. code-block:: python

   t = nl.ndarray((128, 256), dtype=nl.float32, buffer=nl.sbuf)
   t_view = TensorView(t)

   """
   Equivalent to .ap(.ap(pattern=[[256, 128],[2, 128]], offset=0), 
   notice the ``step`` parameter in TensorView is on the dimension it is slicing,
   where in Access Patterns, the ``stride`` is computed by flattening the tensor to 1D. 

   Conceptually equivalent to t[0:128, 0:256:2], where the resulting view is of shape (128, 128)
   """
   t_access_0 = t.slice(dim=0, start=0, end=128, step=1).slice(dim=1, start=0, end=256, step=2)

   """
   Slice the t_access_0, conceptually equivalent to t_access_0[0:64, 0:64], where the resulting
   view is of shape (64, 64)
   """
   t_access_1 = t_access_0.slice(dim=0, start=0, end=64).slice(dim=1, start=0, end=64, step=1)

   # t_access_1.get_view() is equivalent to the nested indexing.
   t_access_1.get_view() # Materialize the operations to the NKI Access Pattern


Reinterpret Cast with ``ap``
============================

The ``dtype`` parameter can be used for reinterpret casting the tensor. 
Since both the pattern and the offset are in terms of number of elements, 
not bytes, the count must be computed accordingly. See the following example 
of reinterpret cast from ``INT32`` to ``BF16``.

.. code-block:: python

   t = nl.ndarray((128, 256), dtype=nl.int32, buffer=nl.sbuf)
   cast_to_bf16 = t.ap(pattern=[
     [512, 128], [1, 512]
    ], # notice the number of elements is doubled due to dtype size change
   offset = 0, dtype=nl.bfloat16) # cast_to_bf16 has shape (128, 512)

Dynamic Access with ``scalar_offset`` and ``vector_offset``
===========================================================

The ``scalar_offset`` and ``vector_offset`` are for dynamic tensor access, i.e. using a 
runtime value to index another tensor. 

Scalar Dynamic Access
---------------------

The ``scalar_offset`` is an SBUF value that specifies the index on the ``indirect_dim`` of the tensor. 

.. code-block:: python
   
   def scalar_dynamic_dma(A):
      # Assume input A is of shape (4*128, 512). We want to copy from A[3*128:, 0:256]
      # The 3*128 offset comes from a dynamic variable in SBUF
      assert A.shape == [512, 512]
      batch_idx = nl.ndarray((1, 1), nl.int32, buffer=nl.sbuf)
      nisa.memset(batch_idx, value=3*128)

      result = nl.ndarray((128, 256), A.dtype, buffer=nl.shared_hbm)

      nisa.dma_copy(src=A.ap(
         pattern=[[512, 128], [1, 256]], offset=0,
         scalar_offset=batch_idx, indirect_dim=0
         ),
         dst=result[...])

      return result

The code block above accesses ``batch_idx`` on the 0-th dimension 
of the tensor A. Note that the dimension is relative to 
the base tensor, not relative to the pattern specified.

This example will access the memory from A starting at the element offset below.

.. code-block:: python

   # prod(A.shape[indirect_dim+1:]) is the accumulated shape
   # to the right of indirect_dim
   offset + scalar_offset * prod(A.shape[indirect_dim+1:])

In the example above, the access starts from:

.. code-block:: python

   0 + batch_idx * 512

Again, we should notice that 512 is read from the shape of the base tensor, not from the access pattern. The shape of the access pattern is ``(128, 256)``.


Vector Dynamic Access
---------------------

Vector dynamic access is similar to that of scalar, except that the dynamic offsets are in a vector. 
We need to specify the field ``vector_offset``. **Currently, only ``indirect_dim=0`` is supported**. 
The stride on the leading dimension must be the total number of elements to the right of the 
leading dimension in the base tensor, and the stride specified in the 
leading dimension of the pattern in the .ap() is currently ignored. 
We still recommend setting the stride properly so that code would still work 
if this limitation is lifted in the future.

.. code-block:: python 

   def indirect_vector_dynamic_dma(A):
      # shape of A is (128, 512)
      dynamic_idx_legal = nl.ndarray((64, 1), nl.int32, nl.sbuf)
      nisa.iota(dynamic_idx_legal, [[1, 1]], 0, 2)

      result_sb = nl.ndarray((64, 512), nl.float32, buffer=nl.sbuf)
      result_hbm = nl.ndarray((64, 512), nl.float32, buffer=nl.shared_hbm)

      nisa.dma_copy(src=A.ap(
         [[512, 64], [1, 512]], 0, vector_offset=dynamic_idx_legal, indirect_dim=0
         ), dst=result_sb, name='inst0')

      nisa.dma_copy(result_hbm, result_sb, name="copy1")

      return result_hbm

For this particular case, the semantics of the access are the following. Note that the stride on the dynamic dimension is directly read from the base tensor.

.. code-block:: python

   indirect_dimension = 0

   for w in range(64):
     for z in range(512):
      dynamic_idx = dynamic_idx_legal[w]
         A[
            // static offsets
            offset +
            // AP with the indirect dimension number replaced
            // Note that the 512 is read from the shape of the **base** tensor.
            1 * z + 512 * dynamic_idx
         ]


Interaction with DGE
--------------------

The ``scalar_offset`` and ``vector_offset`` interact with the DGE mode selection. Refer to 
:doc:`Descriptor Generation Engine (DGE) Reference </nki/deep-dives/nki-dge>` for details.


================================================
FILE: nki/deep-dives/nki-compiler.rst
================================================
.. meta::
   :description: Overview of the NKI Compiler, its integration with the Neuron SDK, and how it enables efficient kernel development for AWS Neuron hardware.
   :keywords: NKI Compiler, Neuron Kernel Interface, AWS Neuron SDK, kernel compilation, Trainium, Inferentia, machine learning acceleration

.. _nki_compiler_about:

======================
About the NKI Compiler
======================

This topic covers the NKI Compiler and how it interacts with the Neuron Graph Compiler to produce a complete model. The NKI Compiler is responsible for compiling NKI kernels.

Overview
----------

The NKI language allows kernel writers to have direct, fine grained control over Neuron devices. Through low level APIs that reflect the Neuron instruction set architecture (ISA), NKI empowers developers to take direct control over critical performance optimizations during kernel development. This approach requires a dedicated NKI Compiler, separate from :doc:`the existing Neuron Graph Compiler </compiler/index>`, which compiles kernel code while preserving the developer's optimization choices. To seamlessly integrate NKI into model architectures defined in machine learning frameworks like JAX and PyTorch, the NKI Compiler also works in conjunction with the Neuron Graph compiler.

The diagram below shows the detailed compilation flow inside the Neuron compilers and how they work together to build the overall binary that is executable on Neuron hardware. The NKI Compiler first parses the kernel code into an AST representation for semantic analysis. It then performs a small number of middle end and back end transformations on the AST, optimizing resource allocations and instruction scheduling, producing optimized NKI IR that gets integrated back into the overall model.

.. image:: /nki/img/compiler/nki-compiler-1.jpg

.. important::
    While the NKI meta-programming language looks and feels like Python, it is not actually Python code. When the Python interpreter encounters a top level function decorated with ``@nki.jit``, it invokes the NKI Compiler to handle compilation of that function.

.. code-block:: python
    
    # this is a Python function that calls 'kernel', which is a NKI kernel
    def a_function(x,y,z):
        kernel(x, y, z)

    # this is a NKI kernel that will be compiled by the NKI Compiler and 
    # integrated back into the overall model by the Neuron Graph compiler
    @nki.jit
    def kernel(x,y,z):
        # this is kernel code


Using Python features within NKI kernels that are not supported will result in useful errors from the NKI Compiler indicating that the feature is not a valid NKI feature. Neuron has intentionally constrained the NKI meta-programming language to be as minimal as possible while serving the needs of building high performance kernels for today's popular models and will continue to grow and evolve the language over time. 

NKI Compiler Open Source
-------------------------

Neuron is planning to release the source code for the NKI Compiler to increase awareness and transparency, to enable easier development of tools, and to invite participation and collaboration as we evolve the NKI language. Developers will be able to download the compiler sources, modify them, build the compiler, and use their locally built compiler in their overall model compilation flow. 

To do this, developers will be able to download our sources from our public git repository: https://github.com/aws-neuron/nki-library. The source files can be found under the ``...`` filepath in the repo.

The repo contains all the sources for the entire NKI Compiler, as well as build instructions on how to produce a standalone nki.whl. Once built, developers can install their locally built wheel: ``pip install nki.whl``. This will replace the default NKI Compiler that is installed with the Neuron SDK package. The local wheel will then be registered to handle subsequent ``@nki.jit`` decorators and will be picked up and integrated with the rest of the Neuron Graph compiler flow.

Note that upon installing a locally-built wheel, developers must reinstall the Neuron SDK in order to revert their changes to the official version of the NKI Compiler. Also, the officially built compiler will have an officially tagged version whereas locally built versions will not. Any bug and error reports will contain the version of the compiler used.


How the NKI Compiler Works with the Graph Compiler
--------------------------------------------------

For each kernel function, Neuron runs the NKI Compiler to produce an artifact 
for that kernel function. This is similar to compiling a single file with a 
traditional compiler, such as a C++ compiler.

All of the kernel artifacts are managed by the Neuron SDK. Programmers do not 
need to manage these files themselves. Similar to prior versions of NKI, 
programmers mark kernel functions with ``nki.jit``---the NKI Compiler will be 
invoked automatically when this decorator is encountered during compilation.

The Neuron Graph Compiler (or just the Neuron Compiler) handles the rest of the
model, which we refer to as "the compute graph". The framework, such as
PyTorch or Jax, orchestrates the process of building a compute graph from the
model definition. When the model includes a call to a NKI kernel function, the
NKI Compiler will insert a reference to the compiled artifact into the graph.
The Graph Compiler recognizes these references and assembles the final result
that can be run on the Trainium Hardware.

Integration
-----------

As described above, both the NKI Compiler and the Neuron Compiler are used to
construct the final artifact that can be run on Trainium hardware. The NKI
Compiler compiles each NKI kernel function in turn, and the Neuron Compiler
compiles the whole model and inserts the NKI kernels based on the references
generated by the NKI Compiler.

This insertion of NKI kernels into the graph is done very late in the
compilation process. This is different from prior versions of NKI that
integrated NKI kernels earlier in the compile process. Insertion later in the
process allows the NKI Compiler to provide custom behavior for NKI and give
users a more predictable and performant result.

Further reading
---------------

- :doc:`/compiler/index`
- :doc:`/nki/get-started/about/index`


================================================
FILE: nki/deep-dives/nki-dge.rst
================================================
.. meta::
   :description: Deep dive into the Descriptor Generation Engine (DGE) modes for DMA operations in NKI on AWS Neuron hardware.
   :keywords: NKI, DGE, DMA, descriptor, swdge, hwdge, gather, scatter, AWS Neuron, Trainium
   :date-modified: 03/31/2026

.. _dge-documentation:

=============================================
Descriptor Generation Engine (DGE) Reference
=============================================

Every DMA operation (``nisa.dma_copy``, ``nisa.dma_transpose``) needs a
*descriptor* that tells the hardware the source address, destination address,
transfer shape, and stride pattern. We can specify *when* and *where* those
descriptors are produced---on the host before execution, on the GpSimd engine
at runtime, or on a dedicated hardware block. Each choice has different
performance characteristics and capability constraints. DGE (Descriptor
Generation Engine) is the umbrella term for the strategies that control this.

In the NKI API, there are three concrete strategies---plus an ``unknown`` mode
that lets the compiler choose---exposed through the ``nki.isa.dge_mode`` enum.
The rest of this document describes each mode, its constraints, and when to use
each one.

.. contents:: On this page
   :local:
   :depth: 2


DGE Modes
----------

``unknown`` --- let the compiler decide
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
                 dge_mode=nisa.dge_mode.unknown)

The default. The compiler selects the best mode based on the target hardware,
tensor shapes, and surrounding instruction schedule. Use this unless you have a
specific reason to force a specific mode.

``none`` --- pre-computed descriptors in HBM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
                 dge_mode=nisa.dge_mode.none)

DMA descriptors are pre-computed on the Trainium host **before** NEFF
execution. The pre-computed descriptors are stored them in HBM. At runtime the
DMA engine reads the pre-built descriptor directly---no on-device generation is
needed.

**When to use:**

- Fully static transfer patterns where source/destination addresses are known
  at compile time.
- When you want to avoid any on-device descriptor generation overhead.

**Trade-offs:**

- Descriptors consume HBM capacity (one per DMA instruction instance).
- Cannot handle dynamic (runtime-computed) addresses or indices.

``swdge`` --- software DGE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
                 dge_mode=nisa.dge_mode.swdge)

The **GpSimd Engine** generates DMA descriptors during NEFF execution. This is
the only mode that supports indirect (gather/scatter) operations with dynamic
indices from SBUF.

**When to use:**

- Dynamic addresses that depend on runtime values.
- Gather or scatter operations using ``vector_offset`` (indirect indexing).
- Indirect transpose (``dma_transpose`` with indirect ``src``).

**Trade-offs:**

- Consumes GpSimd Engine cycles for descriptor generation.
- May compete with other GpSimd workloads.

Importantly, ``swdge`` has additional constraints for indirect transpose:

- ``src.shape[-1] <= 128``
- ``src.dtype`` must be 2 bytes (``float16`` / ``bfloat16``)
- ``src`` must be on HBM
- ``src.shape[0]`` must be divisible by 16
- When ``src`` is 4D: ``src.shape[1]`` or ``src.shape[2]`` must be 1
- Index tensor must be 2-D, on SBUF, with dtype ``uint32``
- ``indices.shape[0]`` must be in ``[16, 128]`` and divisible by 16
- When ``indices.shape[1] > 1``: ``indices.shape[0]`` must be exactly 128
- Only available on NeuronCore-v3 (Trainium2) or newer only

``hwdge`` --- hardware DGE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
                 dge_mode=nisa.dge_mode.hwdge)

A dedicated **hardware block** on the NeuronCore generates descriptors on
demand, triggered by the Scalar Engine or Sync Engine sequencer. Each TRN2
NeuronCore has **two DGE instances**.

**When to use:**

- Dynamic or semi-dynamic transfer patterns on NeuronCore-v3+.
- When GpSimd Engine is busy with other work (avoids ``swdge`` contention).
- Overlapping descriptor generation with compute via Scalar Engine pipelining.

**Trade-offs:**

- Each hardware-DGE DMA instruction takes approximately **600 ns** to execute.
- Does **not** support indirect (gather/scatter) operations.

Note, for ``dma_copy`` with ``hwdge``, the ``engine`` parameter can optionally
select which sequencer triggers the DGE block:

.. code-block:: python

   # Let Scalar Engine trigger DGE (can overlap with earlier compute)
   nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
                 dge_mode=nisa.dge_mode.hwdge,
                 engine=nisa.engine.scalar)

   # Let Sync Engine trigger DGE
   nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
                 dge_mode=nisa.dge_mode.hwdge,
                 engine=nisa.engine.sync)

Only ``nisa.engine.scalar`` and ``nisa.engine.sync`` are valid when
``dge_mode=hwdge``.

Hardware DGE constraints for ``dma_transpose``:

- ``src.shape[0] == 16``
- ``src.shape[-1] % 128 == 0``
- ``src.dtype`` must be 2 bytes (``float16`` / ``bfloat16``)


Mode Selection Summary
------------------------

.. list-table::
   :header-rows: 1
   :widths: 15 15 15 20 35

   * - Mode
     - Descriptor Source
     - Min HW
     - Indirect Support
     - Best For
   * - ``none``
     - Host (pre-computed in HBM)
     - Any
     - No
     - Fully static patterns, zero on-device overhead
   * - ``swdge``
     - GpSimd Engine
     - Any (indirect: v3+)
     - Yes
     - Gather/scatter, dynamic indices
   * - ``hwdge``
     - Hardware DGE block
     - NeuronCore-v3+
     - No
     - Dynamic patterns without GpSimd contention
   * - ``unknown``
     - Compiler decides
     - Any
     - Depends
     - Default---recommended unless tuning


How ``.ap()`` Affects DGE Mode
-------------------------------

When you use ``.ap()`` with ``vector_offset`` for indirect (gather/scatter)
access, the DGE mode is constrained to ``swdge``:

.. list-table::
   :header-rows: 1
   :widths: 35 30 30

   * - Access Pattern
     - ``dma_copy``
     - ``dma_transpose``
   * - Static (no ``.ap()``, or ``.ap()`` without offsets)
     - Any mode
     - ``none``, ``hwdge``, or compiler-selected
   * - ``.ap()`` with ``scalar_offset``
     - Any mode
     - Any mode
   * - ``.ap()`` with ``vector_offset``
     - ``unknown`` or ``swdge``
     - ``unknown`` or ``swdge``

If you specify ``dge_mode=unknown`` (the default) with ``vector_offset``, the
compiler will automatically select ``swdge``.

The ``name`` Parameter
-----------------------

Both ``dma_copy`` and ``dma_transpose`` accept an optional ``name`` string:

.. code-block:: python

   nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor, name="load_weights")

This label appears in profiling traces and compiler debug output. It does not
affect execution. Assigning meaningful names makes it significantly easier to
identify specific DMA operations when analyzing performance with Neuron
profiling tools.


Performance Implications
-------------------------

In essence, the choice comes down to where you want to spend your overhead
budget:

- **``none``** --- Lowest per-transfer latency (descriptor already in HBM), but
  each descriptor consumes HBM bandwidth on first fetch and HBM capacity
  permanently.
- **``swdge``** --- Flexible but uses GpSimd cycles. In GpSimd-bound kernels
  this can become a bottleneck.
- **``hwdge``** --- ~600 ns per instruction. When triggered from Scalar Engine,
  descriptor generation overlaps with earlier compute instructions in the
  pipeline, effectively hiding the cost. Frees GpSimd for other work.
- **``unknown``** --- The compiler applies heuristics to pick the best mode for
  the target and workload. Start here and only override after profiling.

In summary, use ``unknown`` until profiling tells you otherwise, then
switch to the specific mode that addresses the bottleneck you observe.


Code Examples
--------------

Static copy (no DGE)
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   import nki.isa as nisa

   # Pre-computed descriptors — addresses fully known at compile time
   nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
                 dge_mode=nisa.dge_mode.none,
                 name="static_load")

Software DGE copy with dynamic address
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   import nki.isa as nisa

   # GpSimd generates the descriptor at runtime
   nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
                 dge_mode=nisa.dge_mode.swdge,
                 name="dynamic_load")

Hardware DGE copy
^^^^^^^^^^^^^^^^^^

.. code-block:: python

   import nki.isa as nisa

   # Hardware DGE block generates the descriptor (NeuronCore-v3+)
   nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
                 dge_mode=nisa.dge_mode.hwdge,
                 name="hwdge_load")

Hardware DGE transpose
^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   import nki.isa as nisa

   # src must be [16, ...] with last dim divisible by 128, 2-byte dtype
   nisa.dma_transpose(dst=sbuf_tile, src=hbm_tensor,
                      dge_mode=nisa.dge_mode.hwdge,
                      name="hwdge_transpose")

Software DGE indirect transpose (gather + transpose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   import nki.isa as nisa
   import nki.language as nl

   # indices is a 2-D uint32 SBUF tensor; src is on HBM
   # Effectively: dst = src[indices.T.flatten()[:src.shape[0]], :].T
   P, F = 128, 128
   src_ap = hbm_tensor.ap(
       pattern=[[P, F], [1, P]],
       vector_offset=indices,
       indirect_dim=0,
   )
   nisa.dma_transpose(dst=sbuf_tile, src=src_ap,
                      dge_mode=nisa.dge_mode.swdge,
                      name="gather_transpose")

Compiler-selected mode (default)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   import nki.isa as nisa

   # Let the compiler pick the best DGE mode
   nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
                 name="auto_load")

   nisa.dma_transpose(dst=sbuf_tile, src=hbm_tensor,
                      name="auto_transpose")


================================================
FILE: nki/deep-dives/nki-dma-bandwidth-guide.rst
================================================
.. meta::
   :description: Guidelines for maximizing DMA bandwidth by using large contiguous payloads in NKI.
   :keywords: NKI, DMA, bandwidth, payload size, AWS Neuron, Trainium
   :date-modified: 04/12/2026

.. _nki-dma-bandwidth-guide:

=====================================================
Guideline to Avoid Under-Utilizing DMA Bandwidth
=====================================================

A common misconception is that the hardware's internal memory interleaving
removes the need for large contiguous DMA payloads. In practice, small
fragmented transfers underperform badly regardless of how the hardware
distributes traffic across memory channels. This document clarifies why large
payloads (≥4 KiB) are required to saturate HBM bandwidth.

.. contents:: On this page
   :local:
   :depth: 2


How HBM Channel Interleaving Works
-------------------------------------

HBM is organized into multiple independent channels and banks. The hardware
uses address interleaving to spread DMA traffic across all available channels,
avoiding hot-spots where one channel becomes a bottleneck while others sit
idle. This achieves higher effective channel utilization and more consistent
bandwidth across diverse access patterns.

However, channel interleaving only solves the *channel utilization* problem.
It has no effect on the per-transfer payload size seen by the DMA engines.


Why Large Contiguous DMA Payloads Are Required
------------------------------------------------

The fundamental problem is per-packet overhead. Each NeuronCore has 16 DMA
engines, and every DMA transfer incurs descriptor setup, synchronization, and
semaphore-to-start latency (~1300 ns cross-engine). When payloads are small,
the engines spend more time on overhead than on data movement, and the DMA
packet rate---not HBM bandwidth---becomes the limiting factor.

Channel interleaving does **not**:

- Reduce the number of DMA packets required for a given transfer.
- Remove the need for large contiguous payloads per DMA operation.
- Eliminate DMA packets-per-second (PPS) bottlenecks caused by small
  transfers.

Channel utilization and per-engine throughput are independent concerns.
Interleaving addresses the first; payload size addresses the second.

Large contiguous payloads (≥4 KiB per partition) amortize this fixed overhead
and allow each engine to sustain its peak throughput:

=====  ==============  ==============  ==============
 Gen   BW / Engine     Engines / NC    Aggregate BW
=====  ==============  ==============  ==============
TRN1   17 B/ns         16              272 GB/s
TRN2   23 B/ns         16              368 GB/s
TRN3   33 B/ns         16              528 GB/s
=====  ==============  ==============  ==============

With small payloads the engines cannot fill their pipelines, and achieved
bandwidth drops well below these peaks regardless of how well the hardware
distributes traffic across channels.


Bandwidth vs. Payload Size
----------------------------

The relationship between DMA payload size and achieved bandwidth follows a
saturation curve:

- **< 256 B per partition:** Severely overhead-bound. Achieved bandwidth is a
  small fraction of peak.
- **256 B -- 2 KiB per partition:** Improving but still below peak. Per-packet
  overhead is a significant fraction of transfer time.
- **≥ 2 KiB per partition (minimum recommended):** Approaches peak bandwidth.
  The kernel efficiency guide recommends at least 2 KiB of contiguous data per
  partition for all data types.
- **≥ 4 KiB per partition (target for full saturation):** Fully amortizes
  per-packet overhead and saturates the DMA engines.

.. list-table:: Minimum free-dimension sizes for 2 KiB per partition
   :header-rows: 1

   * - Data Type
     - Minimum Free Dimension
     - Bytes per Partition
   * - float32
     - 512 elements
     - 2 048
   * - bfloat16 / float16
     - 1 024 elements
     - 2 048
   * - float8
     - 2 048 elements
     - 2 048


Practical Guidance
--------------------

- **Maximize the free dimension** of every DMA tile. Target ≥4 KiB per
  partition for peak throughput.
- **Coalesce transfers.** One large DMA covering multiple logical sub-tiles
  is faster than many small DMAs to adjacent addresses.
- **Do not rely on hardware channel interleaving alone** to solve bandwidth
  problems caused by small or fragmented transfers. Channel utilization and
  per-engine throughput are independent concerns.
- **Use full partitions (P=128).** Fewer partitions means fewer engines
  utilized, compounding the effect of small payloads.


================================================
FILE: nki/deep-dives/nki-dynamic-loops.rst
================================================
.. meta::
   :description: Deep dive into nki.language.dynamic_range for dynamic loop iteration with runtime bounds on AWS Neuron hardware.
   :keywords: NKI, dynamic_range, hardware loop, runtime bounds, VirtualRegister, AWS Neuron, Trainium
   :date-modified: 03/31/2026

.. _nki-dynamic-loops:

==================
NKI Dynamic Loops
==================

This document covers the `dynamic_range` NKI language API and describes how it
can be used to create on-chip (a.k.a. dynamic) loops.

To begin, let's look at the `dynamic_range` function which is defined below.

.. py:function:: nki.language.dynamic_range(start, stop=None, step=1)
   :noindex:

   Create a sequence for **dynamic** loop iteration with runtime bounds.

   :param start: Start value (inclusive), or stop if ``stop`` is ``None``. Can be a ``VirtualRegister``.
   :param stop: Stop value (exclusive). Can be a ``VirtualRegister``.
   :param step: Step size. Must be a compile-time positive ``int`` (not a ``VirtualRegister``).
   :return: An iterator yielding integer values from *start* to *stop*.

The other NKI range iterators (``affine_range``, ``sequential_range``,
``static_range``) all require compile-time constant bounds. However, some
kernels need trip counts determined at execution time on the NeuronCore---for
example, when the number of tiles to process is loaded from a tensor or
computed on device. The ``nl.dynamic_range`` iterator supports this use case.

When the compiler encounters a ``dynamic_range`` loop it emits a **hardware
loop instruction** on the device. The loop body is not unrolled; instead, a
single copy of the body is generated and the hardware iterates over it at
runtime.


.. contents:: On this page
   :local:
   :depth: 2

Parameter Constraints
-----------------------

``start`` / ``stop``
   Can be Python ``int`` literals **or** ``VirtualRegister`` objects (runtime
   values computed on device). When only one positional argument is given it is
   treated as ``stop`` and ``start`` defaults to ``0``, matching the Python
   ``range()`` convention.

``step``
   **Must** be a compile-time positive integer. Passing a ``VirtualRegister``
   raises an ``AssertionError``:
   The step must be known at compile time because the hardware loop instruction
   encodes the step as an immediate operand.

Comparison with Other Range Iterators
---------------------------------------

NKI provides four range iterators. The table below summarises their key
differences:

.. list-table::
   :header-rows: 1
   :widths: 20 15 15 20 30

   * - Iterator
     - Bounds
     - Unrolled?
     - Generated Code
     - Primary Use Case
   * - ``static_range``
     - Compile-time ``int``
     - Yes (at compile time)
     - Fully unrolled---no loop instruction
     - Default choice---supersedes ``sequential_range`` and ``affine_range``.
   * - ``sequential_range``
     - Compile-time ``int``
     - Yes (at compile time)
     - Fully unrolled---no loop instruction
     - Deprecated, formerly for iterations with loop-carried dependencies. Prefer ``static_range`` instead.
   * - ``affine_range``
     - Compile-time ``int``
     - Yes (at compile time)
     - Fully unrolled---no loop instruction
     - Deprecated, formerly for parallel iterations with no loop-carried dependency. Prefer ``static_range`` instead.
   * - ``dynamic_range``
     - Runtime ``VirtualRegister`` or ``int``
     - **No**
     - **Hardware loop instruction**
     - Trip count unknown at compile time

There are three key distinctions worth calling out:

- ``static_range``, ``affine_range``, and ``sequential_range`` require all bounds to be
  compile-time integers. The compiler keeps them as loops internally but may
  unroll them in the backend. ``dynamic_range`` bounds can be
  runtime values and the loop is **never** unrolled.
- ``static_range``, ``affine_range``, and ``sequential_range`` fully unrolls at compile time, which can dramatically increase
  compilation time, ``dynamic_range`` avoids this entirely.

Hardware Lowering
-------------------

The compiler lowers ``dynamic_range`` loops to hardware loop instructions on
the NeuronCore. Because the loop exists as a single hardware instruction with a body:

- The compiled artifact size does **not** grow with the trip count.
- The loop variable is a device register, not a Python ``int``. You cannot use
  it in host-side Python expressions (e.g., ``if i == 0:``). Use NKI
  device-side operations for any conditional logic that depends on the loop
  variable.

Register Allocation Implications
----------------------------------

Inside a ``dynamic_range`` loop the compiler must keep all live tensors in
on-chip memory (SBUF/PSUM) for the **entire duration** of the loop, because
the hardware re-executes the same body on each iteration. This means:

- Tensors allocated inside the loop body are allocated once and reused across
  iterations.
- Keeping the loop body small and limiting the number of live tiles reduces
  memory pressure.

In contrast, ``static_range`` unrolls each iteration independently, giving the
compiler full freedom to schedule instructions across the flattened instruction
stream. However, this does not solve the issue when the trip count is unknown
at compile time---which is precisely when ``dynamic_range`` is needed.

Interaction with ``no_reorder``
---------------------------------

``dynamic_range`` loops inside a ``nl.no_reorder()`` block are not currently
supported.

.. code-block:: python

   # ✗ This is NOT supported and will error
   with nl.no_reorder():
       for i in nl.dynamic_range(n):
           ...

``affine_range``, ``sequential_range``, and ``static_range`` are all permitted
inside ``no_reorder`` blocks.

To work around this, place the ``no_reorder`` block inside the loop body:

.. code-block:: python

   # ✓ no_reorder inside the dynamic loop body
   for i in nl.dynamic_range(n):
       with nl.no_reorder():
           ...

Using ``while`` with a ``VirtualRegister``
--------------------------------------------

As an alternative to ``dynamic_range``, you can use a standard ``while`` loop
with a ``VirtualRegister`` as the condition. The loop terminates when the
register holds the value ``0``.

.. code-block:: python

   import nki.language as nl
   import nki.isa as nisa

   reg = nisa.register_alloc(1)
   while reg:
       # perform work ...

       # update condition from an SBUF tensor
       nisa.register_load(reg, cond_tensor)


When to Use ``dynamic_range``
-------------------------------

Use ``dynamic_range`` when:

- The number of iterations is **not known at compile time**---for example, it
  depends on a value loaded from a tensor or computed on device.
- The trip count is **large** and unrolling (``static_range``, ``affine_range``, or ``sequential_range``) would cause
  excessive compilation time or code size.

Prefer other iterators when:

- Bounds are compile-time constants and iterations are independent, contain loop-carried dependencies, or need full unrolling → 
  ``static_range``, ``affine_range``, or ``sequential_range``.

Examples
----------

Basic usage with a constant bound
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   import nki.language as nl
   import nki.isa as nisa

   for _ in nl.dynamic_range(1):
       tile = nl.ndarray((128, 512), dtype=nl.float32, buffer=nl.sbuf)
       result = nl.ndarray((128, 512), dtype=nl.float32, buffer=nl.sbuf)
       nisa.dma_copy(src=input_tensor[0:128, 0:512], dst=tile)
       nisa.tensor_tensor(dst=result, data1=tile, data2=tile, op=nl.multiply)
       nisa.dma_copy(src=result, dst=out_tensor[0:128, 0:512])

Even with a constant bound, this generates a hardware loop instruction rather than unrolling.

Runtime trip count from a ``VirtualRegister``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   import nki.language as nl
   import nki.isa as nisa

   start = nisa.register_alloc(0)
   stop = nisa.register_alloc(512)
   for i in nl.dynamic_range(start, stop, 128):
       tile = nl.ndarray((128, 512), dtype=nl.float32, buffer=nl.sbuf)
       result = nl.ndarray((128, 512), dtype=nl.float32, buffer=nl.sbuf)
       nisa.dma_copy(src=input_tensor.ap([[512, 128], [1, 512]], scalar_offset=i), dst=tile)
       nisa.tensor_scalar(dst=result, data=tile, op0=nl.add, operand0=2.0)
       nisa.dma_copy(src=result, dst=out_tensor.ap([[512, 128], [1, 512]], scalar_offset=i))


Specifying start, stop, and step
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   import nki.language as nl
   import nki.isa as nisa

   # Loop from `begin` to `end` with step 2
   # begin and end are VirtualRegisters; step must be a compile-time int
   begin = nisa.register_alloc(0)
   end = nisa.register_alloc(4)
   for i in nl.dynamic_range(begin, end, 2):
       ...


================================================
FILE: nki/deep-dives/nki_perf_guide.rst
================================================
.. _nki_perf_guide:

NKI Performance Optimizations
=============================

In this document, we describe a recipe to find performance bottlenecks of NKI kernels and apply common software optimizations
to address such bottlenecks. During this process, we will showcase how to leverage :doc:`neuron-profile </nki/guides/use-neuron-profile>`,
a GUI-based performance profiler designed for NeuronDevices, to guide your performance optimization efforts. Before proceeding
with this document, make sure to read through :doc:`NeuronDevice Architecture Guide </nki/guides/architecture/trainium_inferentia2_arch>`
to familiarize yourself with Neuron hardware architecture.

Ideally, performance optimization efforts would end with one of two possible outcomes: the execution of a NKI kernel is
either strictly **compute-bound** or **memory-bound**. In the context of NeuronDevices, compute-bound means at least one
of the compute engines is active close to 100% of the kernel execution time (90%+ is considered good in practice),
while memory-bound typically means the achieved device memory bandwidth utilization (MBU) is close to 100% (60%+
is considered good in practice). For compute-bound kernels that are matrix-multiplication dominated, we should also aim
for close to 100% model flops utilization (MFU) in the execution. All of these metrics are available under the ``Summary``
tab in ``neuron-profile`` GUI:

.. _perf_guide_mbu:

.. figure:: /nki/img/nki_perf_guide/fig1.png
   :align: center
   :width: 60%

   MBU metric in neuron-profile.

.. _perf_guide_compute_metrics:

.. figure:: /nki/img/nki_perf_guide/fig2.png
   :align: center
   :width: 60%

   Compute-related metrics in neuron-profile.

The rest of this document is divided into three sections, focusing on three categories of performance optimizations. The
first section covers optimizations to maximize achieved arithmetic intensity, with the goal of minimizing compute engine
idle periods due to unnecessary data movement. The second and third sections dive into optimizations to improve compute
engine and data movement efficiency, respectively.

Improving Arithmetic Intensity
------------------------------

Arithmetic intensity of a computation workload is commonly defined as the average number of computation operations performed
per byte of data accessed from memory. In the context of NeuronDevices, the definition refers to data accessed from *device
memory* (HBM), since the on-chip memory (SBUF) has sufficient bandwidth to keep all compute engines busy.

When arithmetic intensity is overly low, compute engines would be consuming data much faster than DMA engines fetching data
from device memory into the on-chip memory SBUF. In this case, the execution is bounded by the available device memory bandwidth.
Once arithmetic intensity is beyond certain threshold, that is, ratio of maximum compute throughput over memory bandwidth,
the performance bottleneck shifts to how fast compute engines can perform computation, which leads to a compute-bound execution.

Figure below visualizes the `Roofline Model <https://en.wikipedia.org/wiki/Roofline_model#:~:text=The%20roofline%20model%20is%20an,benefit%20and%20priority%20of%20optimizations.>`_\
, which captures this idea by plotting the projected attainable compute throughput with respective to the arithmetic intensity
of an algorithm.


.. _perf_guide_roof:

.. figure:: /nki/img/nki_perf_guide/fig3.png
   :align: center
   :width: 50%

   The Roofline Model.

*Algorithmic* arithmetic intensity is an intrinsic characteristic of the particular workload and solely dependent on the
compute algorithm. In reality, due to limited capacity in SBUF, the *achieved* arithmetic intensity of a NKI kernel implementation
of such workload could be lower than the algorithmic arithmetic intensity. This could lead to excessive compute engine idle
time blocked by completion of data movements. The two typical reasons behind this are *input data reloading* and *intermediate
data spillage*. Let's discuss how to identify their symptoms in ``neuron-profile`` and how to mitigate these issues to improve
arithmetic intensity next.

.. _perf_guide_temporal_locality:

Opt #1. Exploit temporal locality to minimize input data reloading
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


**Symptom**: In neuron-profile, if a NKI kernel triggers DMAs (\ ``nl.load``\ ) for the same input tensor multiple times,
you would see the relevant DMA activities (on the timeline row with a label starting with ``q`` and ending with ``IO``\
) being highlighted in an orange box. Hovering over the “+” sign of the box in top-left corner, a performance warning pop-up
will show up, indicating which input tensor is being reloaded, the size of it and how many times it was reloaded. For example,
figure below is a screenshot of such warning pop-up showing the ``u`` input tensor defined in my NKI kernel was reloaded
~7 times:


.. _perf_guide_input_reload_warning:

.. figure:: /nki/img/nki_perf_guide/fig4.png
   :align: center
   :width: 50%

   Performance warning on input data reloading.

**Optimization**: Input tensor reloading could be avoided if the same data stay in SBUF across all the operations that consume
it at different points of the execution. However, keeping too much data in SBUF across operations can increase the memory
pressure in SBUF, leading to more spilling of intermediate data. Therefore, avoiding input reload should be a trade-off
programmers need to make carefully. Figure below illustrates this trade-off conceptually.


.. _perf_guide_input_reloading:

.. figure:: /nki/img/nki_perf_guide/fig5.png
   :align: center
   :width: 70%

   SBUF usage impact with and without input reloading.

A classic example of using this optimization technique is in a matrix multiplication kernel, where we need to exploit data
reuse in the same rows of the left hand-side input matrix across different columns of the right hand-side matrix. See
:doc:`Matmul NKI Tutorial Optimization 1-3 </nki/guides/tutorials/matrix_multiplication>` for more
detailed discussion. Another great example is in the :ref:`Fused Mamba <tut_mamba_loop_reordering>`
kernel tutorial, where programmers can minimize reloading of largest input tensors through loop reordering.

.. _perf_guide_opt2:

Opt #2.  Fuse operations to minimize intermediate data spilling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Symptom**: In ``neuron-profile`` , we can find many useful data movement related metrics in the ``Summary`` tab:


.. _perf_guide_summary:

.. figure:: /nki/img/nki_perf_guide/fig6.png
   :align: center
   :width: 60%

   ``neuron-profile`` Summary tab.

Below we highlight four relevant metrics to assess severity of data spilling under the ``data_movement`` section (tip: hovering
over any metric name will show a detailed description of the metric):

.. _perf_guide_data_metrics:

.. figure:: /nki/img/nki_perf_guide/fig7.png
   :align: center
   :width: 60%

   Data movement metrics

Here, ``spill_save_bytes`` refers to the total size of intermediate data in bytes the workload spills from SBUF into device
memory, while ``spill_reload_bytes`` indicates total size of spilled data in bytes the workload reloads back into SBUF.
By comparing ``spill_save_bytes`` against ``sb_read_bytes``\ , you can get a feel on how much of the data movement traffic
from SBUF to device memory is related to spilling. Similarly, comparing ``spill_reload_bytes`` against ``sb_write_bytes``
indicates how much of traffic from device memory back to SBUF is related to spilling. If the spill related traffic takes
up a significant portion (for example over 30%), it is likely worthwhile to take a close look at this optimization.

**Optimization**: To reduce spilling, the key is to find operator fusion opportunities in the kernel. To achieve fusion, we
typically also need to slice up computation of each operator and perform computation for a portion of the input tensor at
a time. As a simple example, assume a chain of operators ``op0 → op1`` on a large input tensor ``kernel_in_hbm`` that cannot
fit in SBUF all at once. If we were to do the operators one at a time, we will effectively have the following sequence of
events:

.. code-block::

   for tile in kernel_in_hbm:
       tile_sbuf = load(tile)
       op0_out_sbuf = op0(tile_sbuf)
       # compiler generated spilling, or NKI programmers explicitly perform a store
       spill_save(op0_out_sbuf, op0_out_hbm)

   for tile in op1_out_device_memory:
       tile_sbuf = spill_reload(tile)
       op1_out_sbuf = op1(tile_sbuf)
       store(op1_out_sbuf, kernel_out_hbm)

However, if we fuse the operators from above:

.. code-block::

   for tile in kernel_in_hbm:
       tile_sbuf = load(tile)
       op0_out_sbuf = op0(tile_sbuf)
       op1_out_sbuf = op1(op0_out_sbuf)
       store(op1_out_sbuf, kernel_out_hbm)

Inside a NKI kernel, operator fusion is exactly done as the above through explicit loop fusion.

One great use of this optimization is the self attention operator commonly found in Transformer models. Self attention performs
a chain of operators: matmul_0 → softmax → matmul_1, where matmul_0 of a single attention head produces a large intermediate
tensor shape that overflows SBUF in common Transformer models with a context length in the thousands.

Consider caching as described in :ref:`Exploit temporal locality to minimize input data reloading <perf_guide_temporal_locality>` if there 
are no opportunities for operation fusion. If caching is already implemented with blocking, then oversized block size might be causing spills. 
Refer to :doc:`Matmul NKI Tutorial Optimization 1-3 </nki/guides/tutorials/matrix_multiplication>` for more details on block sizing.


**Optimization Gotchas**:
Certain code patterns in NKI might lead to unexpected spilling from programmers' perspectives. We are working on improving
these in future releases. As an example, buffers sometimes need to be declared within the inner loop to avoid spilling.
In other words, instead of:

.. code-block::


   buf = nl.ndarray((2, 4, nl.par_dim(128), 512), buffer=nl.sbuf)
   for i0 in nl.affine_range(2):
     for i1 in nl.affine_range(4):
        buf[i0, i1, ....] = nl.load(...)
        ...

we need to implement:

.. code-block::

   for i0 in nl.affine_range(2):
     for i1 in nl.affine_range(4):
        buf = nl.ndarray((nl.par_dim(128), 512), buffer=nl.sbuf)
        buf[...] = nl.load(...)

With the above aforementioned optimizations, the kernel execution should achieve an arithmetic intensity that is somewhat
close to the algorithmic arithmetic intensity. At this point, you should be able to observe from the execution timeline
in ``neuron-profile`` whether the kernel spends more time in compute or DMA engines. The ``engine/dma_active_time_percent``
metrics reported in the Summary tab should also give you good hints. If your kernel execution is dominated by computation,
we recommend going over :ref:`Optimizing Compute Efficiency <perf_guide_compute>`
first to optimize compute efficiency. Otherwise, jump straight to :ref:`Optimizing Data Movement Efficiency <perf_guide_memory>`
to understand how to optimize data movement efficiency.


.. _perf_guide_compute:

Optimizing Compute Efficiency
-----------------------------

Compute efficiency optimizations typically fall into two categories:


#. “time” domain engine utilization: reduce engine idle time to keep the compute engine *on critical path* as busy as possible,
   such as enabling pipelining among engines.
#. “spatial” domain engine utilization: within the engine active periods, increase instruction efficiency to use as many
   hardware units within the engine as possible, such as combining multiple instructions into one.

Let's dive into each category below.

Reducing engine idle time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To improve the active time of a compute engine, we need to understand the exact reasons for the engine to enter an idle
state. In neuron-profile, we can focus on the execution trace of the bottlenecked engine and zoom into the visually large
engine idle gaps. For example, in the below profile, we expect VectorE to be the bottlenecked engine and therefore focus
on the idle gaps on VectorE:

.. _perf_guide_engine_idle:

.. figure:: /nki/img/nki_perf_guide/fig8.png
   :align: center
   :width: 100%

   Engine idle gaps.

*Side note*\ , for faster GUI rendering, neuron-profile enables data sampling by default and “hides” certain instructions
from the timeline with a large profile. To confirm whether an engine indeed has an idle gap, we recommend zooming into a
smaller region of the profile and turn on “Show unsampled data” in ``View Edit Settings`` to make sure all instructions
are rendered:

.. _perf_guide_unsampled:

.. figure:: /nki/img/nki_perf_guide/fig9.png
   :align: center
   :width: 100%

   Show unsampled data in neuron-profile.

For each engine idle gap, you can find out the reasons why the engine cannot execute instructions by inspecting the **semaphore
wait condition** of the first instruction executed on the engine after the gap. Broadly speaking, these semaphore wait conditions
are either waiting for 1) other compute engine instructions or 2) DMA activities to finish. We have different techniques
to shrink the idle gaps caused by either of these wait conditions (that is, engine stall reasons).

.. _perf_guide_opt3:

Opt #3.  Overlap execution across compute engines through pipelining
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Symptom**: The semaphore wait condition of the first instruction after an idle gap is on a semaphore name that matches a
compute engine name in NeuronCore: Vector, Scalar, GpSimd and Tensor. These semaphores are associated with instruction completion
on the corresponding compute engine.

For example, the below ``TENSOR_TENSOR`` instruction on VectorE is waiting for ``S[4] (Scalar)`` to reach a value of 36.
This means VectorE was waiting for ScalarE to finish certain instructions.

.. _perf_guide_wait_engine:

.. figure:: /nki/img/nki_perf_guide/fig10.png
   :align: center
   :width: 100%

   Semaphore wait on another compute engine.

**Optimization**: When there is a sequence of operators on different compute engines, we can slice the computation in a way
that the compute engines can process tiles of the original operator in a pipeline fashion. As an example, let’s assume we
have two operator back to back on a large (say, thousands of elements) tensor ``X``\ : ``X → op0 → Y → op1 → Z``. ``op0``
is performed on ScalarE while ``op1`` is on VectorE. For simplicity, let’s assume tensor ``X/Y/Z`` have the same shape.

Figure below shows two possible execution timelines with and without engine pipelining. Without pipelining, VectorE is fully
idle when ScalarE is executing ``op0`` on tensor ``X`` in the first half of the execution. Similarly, ScalarE is idle while
VectorE is running ``op1``. However, with pipelining, ScalarE is able to produce partial results in tiles and unblock VectorE
as soon as the first tile is processed. Overall, engine pipelining shortens the end to end latency to complete ``op0`` and
``op1``\ , through shrinking engine idle time and improving hardware utilization.

.. _perf_guide_engine_pipe:

.. figure:: /nki/img/nki_perf_guide/fig11.png
   :align: center
   :width: 80%

   Engine timeline with and without engine pipelining.

Choosing a proper tile size is crucial to the performance of such engine pipelining. It is up to NKI programmers to make
this choice in kernel implementation and iterate on it using performance profiling data in neuron-profile. For complex kernels,
we often need to schedule a pipeline among all engines: Tensor/Scalar/Vector/GpSimd Engine.

For example, in Transformer's self-attention layer, in addition to fusing matmul_0(Q, K) → softmax → matmul_1(softmax_out,
V) in a single kernel to minimize spilling as discussed in :ref:`Opt #2 <perf_guide_opt2>`,
we also need to form a complex engine pipeline for the operators to maximize utilization of the compute engines:


* matmul_0/matmul_1: TensorE
* softmax:

  * exponential: ScalarE
  * summation: VectorE
  * scale by reciprocal of summation: ScalarE
  * for causal self attention, triangular masking: GpSimdE


.. _perf_guide_opt4:

Opt #4.  Overlap data loading with computation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Symptom**: The semaphore wait condition of the first instruction after an idle gap is on a semaphore name that starts with
letter ``q``. These semaphores are associated with completion of DMA activities.

For example, hovering on an instruction will bring up the key instruction details as follows:

.. _perf_guide_wait_input:

.. figure:: /nki/img/nki_perf_guide/fig12.png
   :align: center
   :width: 100%

   Instruction waiting for input data loading.

In this particular screenshot, the ``EVENT_SEMAPHORE`` instruction could not start earlier even though VectorE was idle
because it was waiting for semaphore S[22] (\ ``qSyncIO0``\ ) to reach a value of 240. The semaphore is only incremented
whenever the corresponding DMA activities shown on the ``qSyncIO0`` execution trace are completed. Clicking on the DMA activities
on ``qSyncIO0`` immediately before the ``EVENT_SEMAPHORE`` instruction, you may follow the ``nki_source_location`` to find
out which line of code is related to this DMA activity (\ ``nl.load()`` call).

Similarly, if an instruction is blocked on ``S[47] (qSyncSpillReload0``\ ), that means it is blocked by DMA activities for
spilling:

.. _perf_guide_wait_spill:

.. figure:: /nki/img/nki_perf_guide/fig13.png
   :align: center
   :width: 100%

   Instruction waiting for spilled data reloading.

Clicking on the DMA activities on ``qSyncSpillReload0`` immediately before the ``EVENT_SEMAPHORE`` instruction, you may
find out the name of the intermediate NKI tensor that was spilled/reloaded. For example, the below DMA transfer reloads
the tensor named ``deltaU`` as defined in our NKI kernel. Note, spill/reload DMA transfers are generated by Neuron Compiler
automatically by analyzing SBUF usage in NKI kernels. Therefore, these DMA transfers do not have an associated explicit
NKI API call or ``nki_source_location`` information.

.. _perf_guide_spill_variable:

.. figure:: /nki/img/nki_perf_guide/fig14.png
   :align: center
   :width: 60%

   Spilled tensor variable name.

**Optimization**: Overlapping data loading with compute is highly similar to enabling compute engine pipelining in Opt #3,
since DMA engines can move data in parallel to compute engine execution, just like how compute engines can run different
operators in parallel.

.. _perf_guide_overlap_comp_mem:

.. figure:: /nki/img/nki_perf_guide/fig15.png
   :align: center
   :width: 80%

   DMA and engine timeline with and without overlapping.

However, it is also possible that even after maximizing overlapping of compute and data movement the best you can, the data
movement duration is still not hidden behind compute even though your kernel has a compute-bound arithmetic intensity. In
these cases, the most common cause is the data movement in your kernel is not using the DMA engines *efficiently*. Refer
to a :ref:`later section <perf_guide_memory>` to
see relevant optimization techniques to improve DMA bandwidth utilization.

As a concrete example, we demonstrate how to properly overlap compute and data movement in a compute-bound (VectorE as the
bottlenecked engine) kernel in :ref:`Mamba tutorial <tut_mamba_tiling>`.

Improving engine efficiency
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Once done with “avoiding engine idle gaps” as much as possible, we can focus on improving “engine efficiency” during the
busy periods of the engine. We will start with two optimizations techniques that are generally applicable to all compute
engines, followed by TensorE-specific optimization techniques.

Opt #5a: Use sufficiently large input tiles in free dimension
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Symptom**: Certain operators might trigger many back-to-back instructions with small free dimension sizes in the input
tensors. For example, in the below profile, ScalarE is busy with many repeated ``activation`` instructions with IDENTITY
(scale/bias enabled) activation function, which is equivalent to calling ``nki.isa.tensor_scalar(op0=nl.multiply, op1=add)``
APIs. If you click on one of the instructions to pull up the instruction detailed view, you can see the source tensor access
pattern is ``fp32@20580[1,1,1][1,1,1]`` , where the first set of bracket indicates 3D strides and the second set indicates
3D shape in FP32 elements. More detailed discussion of ISA access pattern can be found by clicking on the ``i`` button at
the end of the ``Operands`` row.

In this example, each of the back-to-back instructions is reading **one** element per partition from SBUF, which would take
about one engine cycle to perform useful computation within the instruction. Such instructions are extremely inefficient
since the static instruction overhead in the order of ~100 cycles would be limiting the overall throughput.

To make things worse, these instructions also have data dependency (read after write) between consecutive instructions,
which means the next instruction cannot start data read until the previous instruction has all of its output committed to
the local SRAM. In neuron-profile, you can inspect data dependency between instructions by clicking on an instruction of
interests (\ ``Inst1`` in the below profile), which will highlight the clicked instruction and also the instruction that
produces input for the clicked instruction (\ ``Inst0`` in the below profile). The dependency information can also be viewed
in the details “instruction dependency pcs”. In fact, all the neighboring instructions also have a similar dependency patterns
in this profile.

With the above inefficiencies, the initiation interval (the time between the starting points of two consecutive instructions)
for these instructions on ScalarE is around ``189 ns (264 ScalarE cycles on NC-v2)`` , which is much higher than the useful
computation cost (one ScalarE cycle throughput-wise).

.. _perf_guide_small_instr:

.. figure:: /nki/img/nki_perf_guide/fig16.png
   :align: center
   :width: 100%

   Many back-to-back ScalarE instructions with small tensor shapes

**Optimization**: The trick of this optimization is to increase the free dimension size of instruction input tiles. As discussed
in the :doc:`architecture guide </nki/guides/architecture/trainium_inferentia2_arch>`, NeuronCore compute engines
typically require at least 128 elements/partition in the source tensor to be efficient. However, it is worth mentioning
that increasing free dimension sizes might not be trivial due to the high-level computation definition. We suggest developers
walking through the :doc:`architecture guide </nki/guides/architecture/trainium_inferentia2_arch>` in detail to better understand capabilities of
different compute engines, and mapping/reformulating the high-level operators onto the engines using the most suitable instructions.
Such instructions could be invoked either through the high-level ``[nki.lanaguage](api/nki.language)`` or low-level
``[nki.isa](api/nki.isa)`` APIs.

In addition, keep in mind there is a trade-off in choosing the free dimension size in instruction input tiles: Too small
of a tile size exposes significant instruction overhead leading to inefficient engine execution, while too large of a tile
size often leads to inefficient pipelining between engines (working against :ref:`Opt #3 <perf_guide_opt3>`)
and high memory pressure in SBUF (working against Opt #2).

As an example, a naive implementation of the prefix sum scan operation in Mamba v1 would trigger ``seq_len`` back-to-back
single element ``nki.isa.tensor_scalar`` instructions as shown in the above profile example, where ``seq_len`` is the sequence
length of the model typically in the range of thousands. A more efficient way to implement this operation is through a special
VectorE instruction ``nisa.tensor_tensor_scan``.
See the :doc:`Mamba tutorial </nki/guides/tutorials/fused_mamba>` for more discussion.

Opt #5b: Use sufficiently large input tiles in partition dimension
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Symptom**: When instructions use input/output tiles that span fewer than 128 partitions, they typically under-utilize
the compute engine capabilities. This is because each SBUF/PSUM partition has a one-to-one mapping to parallel vector lanes
in the compute engines. As an example, the ``TENSOR_TENSOR`` instruction (equivalent to ``nki.tensor_tensor``\ ) on VectorE
takes a source tensor in SBUF that occupies 64 partitions only, as indicated by the ``channels=64`` instruction operand
field. If we were to increase the ``channels`` field to 128, the instruction would have taken the same amount of time as
``channels=64``.

.. _perf_guide_le128_part:

.. figure:: /nki/img/nki_perf_guide/fig17.png
   :align: center
   :width: 70%

   An instruction that read/write less than 128 partitions.


Similarly, for a ``MultiplyMoving`` instruction (Matmul opcode in neuron-profile) TensorE, if the instruction reads/writes
tiles do not span the full SBUF/PSUM partitions, we would be underutilizing TensorE. As an example, the below ``MultiplyMoving``
instruction only writes to 96 partitions in PSUM, as indicated by the operand ``128*96``\ , which means the instruction
only uses 128 rows and 96 columns of the processing elements out of the available 128x128 systolic array.

.. _perf_guide_le128_col:

.. figure:: /nki/img/nki_perf_guide/fig18.png
   :align: center
   :width: 70%

   MultiplyMoving instruction that uses <128 TensorE columns


**Optimization**:
If we see **many back-to-back** **instructions** on the compute engine that have fewer than 128 partitions in the input/output
tiles as discussed above, we should consider an optimization called “partition vectorization”.

As an example, say we have two ``nki.isa.nc_matmul()`` instructions with each generating a 64-partition PSUM tile of the
same shape. Then VectorE needs to run ``nki.isa.tensor_reduce()`` on both tiles to generate a reduction result. Note, on
trn1/inf2, VectorE cannot run the two independent ``nki.isa.tensor_reduce()`` instructions in parallel in this case, even
though the total number of compute lanes required for these instructions does not exceed 128. To improve VectorE utilization
in this case, we can:


#. The two ``nc_matmul()`` instructions write to disjoint PSUM partitions: partition 0-63 for the first ``nc_matmul`` and
   partition 64-127 for the second one.
#. Invoke a single ``nki.isa.tensor_reduce()`` instruction to process output of both ``nki.isa.nc_matmul()`` instructions.

The below pseudo-code illustrates the above computation without and with partition vectorization.

.. code-block::

   import nki.isa as nisa
   import nki.language as nl

   ################################################################
   # option 1: No partition vectorization
   # two 64-partition vector instructions running serially

   # By default, NKI creates mm_tile0 and mm_tile1 in partition 0-63
   mm_tile0 = nisa.nc_matmul(...)
   mm_tile1 = nisa.nc_matmul(...)

   # Both nki.isa.reduce instructions move data from psum partition 0-63
   # in a serialized fashion
   reduce0 = nisa.tensor_reduce(mm_tile0, ...)
   reduce1 = nisa.tensor_reduce(mm_tile1, ...)

   ################################################################
   # option 2: Partition vectorization
   # vectorized into one 128-partition vector instructions

   # Here, we explicitly declare a 128-partition tensor in PSUM
   mm_tile = nl.zeros((128, ...), np.float32, buffer=nl.psum)

   i_output0_p = nl.arange(64)[:, None]
   i_output1_p = 64 + nl.arange(64)[:, None]
   # Assign first part of mm_tile to partition 0-63
   mm_tile[i_output0_p, ...] = nki.isa.nc_matmul(...)
   # Assign second part of mm_tile to partition 64-127
   mm_tile[i_output1_p, ...] = nki.isa.nc_matmul(...)

   # A single nki.isa.reduce instruction, using all 128 partitions
   reduce = nisa.tensor_reduce(mm_tile, ...)

Option #2 above is able to perform the reduction 2x faster, by vectorizing the partition dimension and performing a single
reduction instead of two.

Opt #6: Combine instructions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Symptom**: Even though the majority of popular ML models are matrix multiplication heavy, certain operators can be vector/scalar
operation heavy instead, such as self-attention in Transformer models. These operators typically have a performance bottleneck
in VectorE or ScalarE or both. As an example, the below profile shows the inner loop of self attention, where either VectorE
or ScalarE is busy at any moment in time, while TensorE has clear engine idle gaps.

.. _perf_guide_vector_scalar_bound:

.. figure:: /nki/img/nki_perf_guide/fig19.png
   :align: center
   :width: 100%

   A VectorE/ScalarE-bound profile.

**Optimization**: A common optimization to tackle vector/scalar-operation-heavy operators is **combining instructions** using
low-level ``nki.isa`` APIs. Combining instructions can leverage the deep pipelined stages within VectorE and ScalarE engine
data path to increase hardware utilization per instruction and reduce the instruction count. Check out the
:doc:`architecture guide </nki/guides/architecture/trainium_inferentia2_arch>` to learn what operations can be done in a pipeline fashion
in a single VectorE/ScalarE instruction.

For example, below pseudo-code showcase combining three instructions into a single one on ScalarE. ``impl 1`` and ``impl
2`` are functionally equivalent, but ``impl 2`` is 3x faster in terms of latency by touching the input ``data`` only once
and running all three operations (multiply, add, exp) in a pipeline.

.. code-block::

   import nki.isa as nisa
   import nki.language as nl

   # input: data (tile[128, 512]), scale (tile[128, 1]) , bias (tile[128, 1])

   # impl 1:
   scaled = nl.multiply(data, scale)
   shifted = nl.add(scaled, bias)
   exp = nl.exp(shifted)

   # impl 2:
   exp = nisa.activation(nl.exp, data,
                            bias, scale)

Check out :doc:`nki.isa APIs </nki/api/nki.isa>`
to understand low-level ISA API semantics, limitations, engine mapping, and rough estimates of performance cost.

See :doc:`Fused Mamba </nki/guides/tutorials/fused_mamba>` tutorial for a concrete example to
combine matrix-vector multiplication and exponential evaluation in a single ``nisa.activation`` instruction.

Opt #7: TensorE only: Leverage fast weight load
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Symptom**: Let's consider a matrix multiplication between two matrices of shape ``[M, K]`` and ``[K, N]``\ , with one of
the following conditions:


#. M is significantly smaller than 128, while N is much larger than 128, or
#. the other way around: N is significantly smaller than 128, while M is much larger than 128

In NKI, if the matrix with ``min(M, N)`` dimension is mapped to the **stationary tensor** (\ ``x`` input tensor in ``nl.matmul``
and ``nisa.nc_matmul``\ ) for the TensorE ``LoadStationary`` instruction (details see :ref:`architecture guide <arch_guide_tensor_engine>`
), we will typically end up under-utilizing TensorE more severely compared to mapping such matrix to the **moving tensor**.

In ``neuron-profile``\ , programmers can identify also this inefficient case by inspecting the ``src`` access patterns for
LoadStationary and MultiplyMoving instructions on TensorE. For example, the below screenshot indicates a stationary tensor
with 1 element per partition and a moving tensor with 128 elements per partition:


.. _perf_guide_matrix_vector_instr:

.. figure:: /nki/img/nki_perf_guide/fig20-21.png
   :align: center
   :width: 100%

   Example instructions for matrix-vector multiplication.

If you have many back-to-back TensorE instructions with the above pattern, we recommend applying the below optimization.

**Optimization**: The key idea of this optimization is to simply swap the stationary and moving tensor positions for the given
matmul in NKI, in order to leverage the "Fast LoadStationary" support in TensorE (more discussion in
:ref:`architecture guide <arch_guide_tensor_engine_perf>`). To better understand the intuition behind this, let's walk
through a concrete example.

Consider a ``[1, 128] x [128, 128]`` matrix multiplication as below:

.. _perf_guide_matrix_vector:

.. figure:: /nki/img/nki_perf_guide/fig22.png
   :align: center
   :width: 60%

   Illustration of matrix-vector multiplication.

Since K=128 is the contraction dimension, it will get mapped to the partition dimension of the SBUF for both the ``x`` and
``y`` matrices. M and N will therefore get mapped to the free dimension of the SBUF.  and we will refer to ``x`` as the
“short” tensor, and ``y`` as the “long” tensor (short and long in the free dimension, respectively). We have two possible
ways of performing this computation on the TensorE, which we'll refer to as “Short Moving” and “Short Stationary“, depending
on which tensor has the short free dimension.

.. _perf_guide_matrix_vector_2way:

.. figure:: /nki/img/nki_perf_guide/fig23.png
   :align: center
   :width: 100%

   Two possible TensorE instruction mapping for matrix-vector multiplication.

Based on the multiplication property of transpose, we have ``A×B=(B.T×A.T).T``. Meanwhile, based on the semantics of TensorE, when
we want to compute ``A×B``, we need to call ``nc_matmul(A.T, B)``, and for ``BT×AT``, we need to call
``nc_matmul(B.T.T, A.T)`` -> ``nc_matmul(B, A.T)``. Notice how the parameters
to ``nc_matmul`` are swapped! Thus, when we swap stationary and moving tensors and perform the matrix multiplication, the
output tensor will be transposed from the original output.

Recall, if there is a difference in initiation interval between ``LoadStationary`` and ``MultiplyMoving``, one of them
can end up limiting the throughput of TensorE:

.. _perf_guide_tensor_perf:

.. figure:: /nki/img/arch_images/mm_bottleneck.png
   :align: center
   :width: 60%

   Two possible TensorE performance characteristics.

In the above scenarios, we expect TensorE performance to be bound by whichever instruction reads the longer tensor - LoadStationary
in “Short Moving”, and MultiplyMoving in “Short Stationary”. However, with TensorE Fast LoadStationary, TensorE can perform
``LoadStationary`` **up to 4x** faster than a ``MultiplyMoving`` with the same free axis size.

So in the two above scenarios:


#. Short Moving - ``LoadStationary`` initiation interval is roughly equal to the number of elements divided by 4 (because
   of fast LoadStationary), and ``MultiplyMoving`` initiation interval is dominated TensorE instruction turnaround time ``MM_INIT_LATENCY
   (64 cycles on trn1)``. Therefore, we have  ``LS_II ~= 128/4 = 32 cycles`` , and ``MM_II ~= max(1, MM_INIT_LATENCY=64 cycles)``
   which leads to issuing a MM roughly every 64 cycles.
#. Short Stationary - ``MultiplyMoving`` initiation interval will dominate, which leads to issuing a MM roughly every 128
   cycles.

Because of the above, we will prefer to map short tensors to the moving tensor in ``MultiplyMoving`` instruction in TensorE.

A classic example is a matrix-vector product. This is commonly seen in auto-regressive token generation in LLMs, where most
of the matmuls occur only on a single token (vector) as the feature map, while the weight tensor remains large and hence
must be broken into tiles to meet TensorE tile size constraints.

Opt #8: TensorE only: Mitigating overhead from tensor transposes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Symptom**: Since TensorE accounts for over 90% of the hardware FLOPS on a NeuronCore, we would like the engine to perform
useful computations as much as possible, especially in matmul-heavy kernels. The most common “not useful” computation that
could occupy precious TensorE cycles is tensor PF-transposes, which swap the partition and free dimensions of a NKI tile.
When you have a profile with TensorE visually extremely busy, we recommend doing a sanity check on how much of the TensorE
activities are performing transposes. One easy way to check is by selecting ``Instruction Type`` as the ``Instruction Grouping``
in ``View Settings`` :

.. _perf_guide_transpose_setting:

.. figure:: /nki/img/nki_perf_guide/fig25.png
   :align: center
   :width: 60%

   Change view settings to visualize transposes.

With this instruction coloring, TensorE instructions will be highlighted in two different colors: one for Transpose and
one for Regular (useful matmuls). As an example, the below profile has an execution trace with TensorE being the performance
bottleneck. Visually, we can see the bulk of the TensorE execution is for regular matmuls, but there is a noticeable chunk
of engine time spent on transpose-induced instructions in red. Note, the colors for transpose versus regular instructions
are chosen randomly by the profiler each time. You should hover over the instructions to check the ``Instruction Type``
field on the pop-up to confirm the color mapping.

.. _perf_guide_transpose_timeline:

.. figure:: /nki/img/nki_perf_guide/fig26.png
   :align: center
   :width: 100%

   Example timeline with a transpose instruction type.


**Optimization**: The key goal of this optimization is to reduce the number of transpose-induced instructions on TensorE,
when such instructions are taking up a large portion of the execution. Before diving into techniques to reduce transposes,
it is important to understand the root cause of these transposes.

At a high level, tensor transposes are needed to adjust the data layout of tensors to match the partition dimension requirements
of different ISA instructions. Refer to the :doc:`architecture guide </nki/guides/architecture/trainium_inferentia2_arch>`
for layout requirements of each compute engine. Transposes are inserted explicitly into NKI kernels through 
:doc:`nisa.sb_transpose </nki/api/generated/nki.isa.nc_transpose>` APIs, or calling ``nl.matmul`` with ``transpose_x=False``. 
These transposes are most commonly lowered down to Tensor Engine.

Broadly speaking, there are 2 different types of tensor transposes, with different root causes:


#. IO tensor transpose (abbreviated as IO transpose)
#. intermediate tensor transpose (abbreviated as intermediate transpose)

**IO transpose.** These transposes are ** done on NKI kernel IO (input/output) tensors, which must reside in device memory
in current NKI releases. The transposes are needed when the NKI compute API consuming input tensors or producing the output
tensors expect a different layout than their IO layout in device memory. To simplify discussion, we dive into input tensor
layout discussion below, but the same reasoning also applies to output tensors.

For example, say we have an input tensor in device memory with layout ``[out_channel=128, in_channel=128]`` (major-to-minor
ordering), but the ``nisa.nc_matmul`` call in our NKI kernel expects ``[in_channel, out_channel]`` as input tile layout.
In this case, we can perform a :doc:`nl.load </nki/api/generated/nki.language.load>` to load the input
into SBUF, with ``out_channel`` being the partition dimension because ``out_channel`` is the most major dimension in device
memory. Then, a PF-transpose on TensorE is required before the loaded data can be consumed by ``nisa.nc_matmul`` . Alternatively,
we can invoke :doc:`nl.load_transpose2d </nki/api/generated/nki.language.load_transpose2d>`
to transpose the input tensor on the fly in the DMA engine, with a major caveat of much lower DMA bandwidth compared to
``nl.load``. ``nl.load_transpose2d`` could make sense in a compute-bound kernel, but should certainly be avoided in memory-bound
kernels.

Either way, an IO transpose is inevitable here *due to* the IO tensor layout choice we made as NKI programmers. In the naive
case scenario where we only care about reaching the best performance for a single kernel, we can carefully decide on the
IO tensor layout to make sure it is compatible with the NKI compute API layout requirements. When the input tensor is consumed
by multiple compute APIs with conflicting layout requirements, IO-transposes cannot be avoided but should still be minimized
as much as possible with a careful trade-off.

However, NKI kernels are often injected into a larger model defined at the framework such as PyTorch and JAX, in which case
the kernel IO tensors are also input/output of the surrounding framework operators. These cases will require more complex
reasoning on the optimal IO tensor layout for the NKI kernel, but the optimization goal of minimizing IO transposes remains
the same.

One last complexity in deciding IO tensor layout is the layout choice also has a potential impact on DMA efficiency. See
more discussion in a :ref:`later section <perf_guide_memory>`
discussion optimizing data movement efficiency.

**Intermediate Transpose.** These transposes are done on intermediate tensors produced within a NKI kernel. These transposes
arise due to layout requirement mismatches between producer and consumer NKI compute APIs.

There are two common techniques to reduce intermediate transposes: 1) swapping moving/stationary tensors in ``nisa.nc_matmul``
(or equivalently, ``nl.matmul``) and 2) mapping a computation to an alternative engine with different layout requirements.

One example for technique 1) is in an operator chain commonly seen in Transformer models: ``linear_layer`` → ``layernorm``.
Normally, we tend to map the weight ``[hidden_size, 4xhidden_size]`` tensor in ``linear_layer`` to the stationary tensor
and the input feature map ``[hidden_size, seq_len]`` to the moving tensor when performing ``nisa.nc_matmul`` on TensorE.
The output feature map of this matmul will be in a layout of ``[4xhidden_size, seq_len]``. However, the first step in ``layernorm``
to calculate mean and variance, ``nisa.bn_stats``\ , requires ``4xhidden_size`` to be the free dimension because we need
to calculate mean/variance within a single token. Therefore, a naive implementation of this operator chain will trigger
a PF-transpose between the ``nisa.nc_matmul`` and ``nisa.bn_stats`` instructions. However, if we were to instead map the
weight tensor to the moving tensor and input feature map to stationary tensor, we can skip this PF-transpose entirely because
the ``nisa.nc_matmul`` output will be in the expected layout by ``nisa.bn_stats``.

An example for technique 2) is in a similar operator chain: ``linear_layer → RMSnorm`` with the same intermediate tensor
dimensions as the above example. ``RMSnorm`` is considered a cheaper normalization operator compared compared ``Layernorm``\
, because it replaces the mean/variance calculation with squared and summation. Unlike ``nisa.bn_stats`` for mean/variance
calculations which must be done along the free dimension, for ``RMSnorm`` the scalar squared operator has no layout requirement
and the summation can be done along either dimensions: use VectorE ``nisa.tensor_reduce`` for free dimension summation or
use TensorE ``nisa.nc_matmul`` for partition dimension summation (see :ref:`TensorE alternative use case <arch_sec_tensor_engine_alternative_use>`
in the architecture guide). Since ``RMSnorm`` can be done with either ``[4xhidden_size, seq_len]`` or ``[seq_len, 4xhidden_size]``\
, we should make the layout choice based on more surrounding operator: ``RMSnorm`` in Transformer models is typically followed
by yet another ``linear_layer``\ , which requires the ``[4xhidden_size, seq_len]`` layout. Therefore, to minimize intermediate
transposes in an operator chain like ``linear_layer → RMSnorm → linear_layer`` , we should map the weight tensor of the
first ``linear_layer`` to the stationary tensor and leverage TensorE to perform cross-partition summation for ``RMSnorm``.

.. _perf_guide_memory:

Optimizing Data Movement Efficiency
-----------------------------------

The key goal of optimizing memory-bound kernels is to keep the DMA engines running at high bandwidth utilization as much
as possible. If you are seeing major DMA engine idle gaps in neuron-profile, you should first find ways to hide compute
behind DMA activities using techniques discussed in :ref:`Opt #4 <perf_guide_opt4>`.
The rest of this section is going to focus on optimizations to improve DMA bandwidth utilization. All the optimizations
below are applicable to a common symptom: computation blocked by DMA activities, which are keeping the DMA engines “busy”
but at low bandwidth utilization (< 60%):

.. _perf_guide_busy_dma:

.. figure:: /nki/img/nki_perf_guide/fig27.png
   :align: center
   :width: 100%

   Busy DMA engines with relatively idle compute engines.

Note, the current NKI release only supports running a kernel on a single NeuronCore (subject to changes in future releases).
Therefore, the optimizations below will focus solely on movement between device memory and on-chip memory SBUF for now.


.. _perf_guide_opt9:

Opt #9: Perform sufficiently large DMA transfers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Symptoms**: A quick way to determine whether the DMA transfers are moving large enough amount of data per transfer is to
visualize the DMA activities per engine in ``neuron-profile``:

.. _perf_guide_dma_setting:

.. figure:: /nki/img/nki_perf_guide/fig28.png
   :align: center
   :width: 70%

   Change view settings to visualize DMA transfer per DMA engine.

With the above view settings, each DMA transfer will be shown with a continuous bar on the execution trace, grouped by DMA
engines. Below is a profile example with small DMA transfers going on all 16 DMA engines. Visually, we can see DMA engine
empty gaps (due to DMA overhead) are taking up more time than active DMA transfers. Hovering over some of DMA transfers,
we can also see a transfer size of 4B, which is extremely tiny. For reference, the transfer size on Trainium/Inferential2
should be larger than 32KiB to achieve ideal bandwidth.

.. _perf_guide_tiny_dma:

.. figure:: /nki/img/nki_perf_guide/fig29.png
   :align: center
   :width: 100%

   Example timeline with tiny DMA transfers.

For comparison, here's another profile with sufficiently large DMA transfers, achieving close 70% DMA throughput utilization:

.. _perf_guide_large_dma:

.. figure:: /nki/img/nki_perf_guide/fig30.png
   :align: center
   :width: 100%

   Example timeline with large DMA transfers.

**Optimizations**: Refer to the architecture guide for more detailed discussion on DMA engines and intuitions behind the need
for large DMA transfer sizes to achieve good DMA efficiency. Here, we will discuss simple rule of thumbs in NKI to trigger
large DMA transfers: maximize the partition and free dimension sizes in both :doc:`nl.load </nki/api/generated/nki.language.load>`
and :doc:`nl.store </nki/api/generated/nki.language.store>`. For example, the below data loading will trigger
16 DMA transfers that can be run on all 16 DMA engines, which each transfer loading 8 SBUF partitions' worth of data with
a transfer size of 32KiB:

.. code-block::

   import nki.language as nl

   def load_store_32kib_contiguous(in_tensor, out_tensor):
       # both in_tensor and out_tensor have FP32 data type, 4B/element
       assert in_tensor.dtype == out_tensor.dtype == nl.float32
       # both have shape 128x1024 in device memory
       assert in_tensor.shape == out_tensor.shape == [128, 1024]

       # partition dim size is at maximum supported by the architecture: 128
       # free dim size is at the ideal size to achieve good bandwidth usage: 1024
       # Beyond 1024 has diminished return on bandwidth and
       # runs the risk of degrading compute/data movement pipelining efficiency

       # This access pattern should map to 16 DMA transfers (1 transfer/DMA engine),
       # with each DMA transfer moving 8 partitions worth of data:
       # 8 partitions * 1024 elements * 4B/element = 32 KiB
       data_tile = nl.load(in_tensor[0:128, 0:1024])

       # Do some useful computation
       ...

       # Store, similar size as the load
       nl.store(out_tensor[0:128, 0:1024], data_tile)

Opt #10: Minimize use of DMA transposes.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Symptom**: Excessive use of DMA transposes, invoked through ``nl.load_transpose2d``, can degrade DMA bandwidth significantly.
In ``neuron-profile``, you can find out whether  ``nl.load_transpose2d`` is taking up substantial amount of execution
time by using the search functionality, which will highlight all the DMA activities that perform transposes on the fly:

.. _perf_guide_search_transpose:

.. figure:: /nki/img/nki_perf_guide/fig31.png
   :align: center
   :width: 100%

   Search for DMA activities that perform transposes.

**Optimizations**: Refer to :ref:`Opt #8 <perf_guide_opt9>`
for a detailed discussion on how to eliminate the need of transposes on device memory input data. When the transposes are
inevitable and the kernel is memory bound, we recommend replacing ``nl.load_transpose2d`` with ``nl.load()`` and ``nisa.nc_transpose()``.
For example, if you have an ``in_tensor`` of shape [8192, 128] in device memory but you would like an SBUF tile of shape
[128, 8192] spread across 128 partitions for computation, the following two code snippets can achieve the same functionality:

.. code-block::

   # Option 1, low DMA bandwidth usage:
   sbuf_opt1 = nl.load_transpose2d(in_tensor[0:8192, 0:128])

   # Option 2, better DMA bandwidth usage, fastest transpose:
   sbuf_opt2 = nl.ndarray((128, 8192), dtype=in_tensor.dtype)
   for i_in_tile in nl.affine_range(8192 // 128):
       i_start = i_in_tile*128
       current_tile = nl.load(in_tensor[i_start:i_start+128, 0:128])
       sbuf_opt2[0:128, i_start:i_start+128] = nisa.nc_transpose(current_tile)

Option 2 above is especially great for cases where ``nl.load_transpose2d`` is slowing down data movement in the critical
path and TensorE is otherwise idle. Occasionally Option 1 can still be the right call, when the amount of data to be transposed
is small and the overhead of ``nl.load_transpose2d`` can be well hidden behind other useful computation.


================================================
FILE: nki/deep-dives/src/mxfp-matmul/mx_cpu_utils.py
================================================
################################################################
# CPU Utilities to generate MX kernel input and golden data
################################################################

import numpy as np
import ml_dtypes as mld

# Ensure dtype is in the list of MX FP8/FP4 dtypes we support
def validate_quantized_dtype(dtype):
  if dtype not in {mld.float8_e5m2, mld.float8_e4m3fn, mld.float4_e2m1fn}:
    raise ValueError(f"Unsupported quantized dtype: {dtype}")
  return dtype == mld.float4_e2m1fn

# Get exponent for float32 in IEEE 754 standard
def get_float32_exp(float_data):
  man_nbits, exp_nbits = 23, 8
  return (float_data.astype(np.float32).view(np.uint32) >> man_nbits) & ((1 << exp_nbits) - 1)

# max normal
# float8_e5m2: S 11110 11 = ± 2^15 × 1.75 = ± 57,344
# float8_e4m3fn: S 1111 110 = ± 2^8 × 1.75 = ± 448
# float4_e2m1fn: S 11 1 = ± 2^2 × 1.5 = ± 6
def get_mx_fp_max(mx_dtype):
  """Get maximum representable value for MX dtype"""
  validate_quantized_dtype(mx_dtype)
  if mx_dtype == mld.float8_e5m2:
    return 57344.0  # 2^15 * 1.75
  elif mx_dtype == mld.float8_e4m3fn:
    return 448.0    # 2^8 * 1.75
  elif mx_dtype == mld.float4_e2m1fn:
    return 6.0      # 2^2 * 1.5
  else:
    raise ValueError(f"Unsupported mx_dtype: {mx_dtype}")

def get_mx_max_exp(mx_dtype):
  """Get maximum exponent for MX dtype"""
  validate_quantized_dtype(mx_dtype)
  if mx_dtype == mld.float8_e5m2:
    return 15
  elif mx_dtype == mld.float8_e4m3fn:
    return 8
  elif mx_dtype == mld.float4_e2m1fn:
    return 2
  else:
    raise ValueError(f"Unsupported mx_dtype: {mx_dtype}")

def get_p_contiguous_scale(hw_scale, data_p_size, p_offset=0):
  if data_p_size <= 32:
    return hw_scale[p_offset : p_offset + data_p_size]

  scale = np.zeros((data_p_size // 8,) + tuple(hw_scale.shape[1:]), hw_scale.dtype)
  for i in range(data_p_size // 8):
    scale[i] = hw_scale[i // 4 * 32 + i % 4 + p_offset]

  return scale

# inputs/outputs are numpy, with shape [P,F]
# returns:
#   mx_data_golden x4 mimicked packing. If fp8, then uint32 containing 4 x fp8 elements. If fp4, then uint8 containing 2 x fp4 elements.
#   mx_scale_golden as uint8 with shape [P//8, F//4] (scales are packed contiguously)
def quantize_mx_golden(in_tensor, out_quantized_dtype, ocp_saturation = True, reverse_dst_fdim_group = 0, custom_mx_max_exp=None):
  max_exp = custom_mx_max_exp(out_quantized_dtype) if custom_mx_max_exp else get_mx_max_exp(out_quantized_dtype)
  max_val = get_mx_fp_max(out_quantized_dtype)
  float32_exp_bias = 127

  P, F = in_tensor.shape
  SP, SF = P // 8, F // 4

  in_tensor_ = np.copy(in_tensor)

  RG = reverse_dst_fdim_group
  # reverse free dimension by a group of RG elements (keep the order within each group)
  if RG > 0:
    assert F % RG == 0
    in_tensor_ = in_tensor_.reshape(P, F // RG, RG)[:, ::-1, :].reshape(P, F)

  exp = get_float32_exp(in_tensor_)

  # Reshape exponent tensor to group by 8x4 blocks for max computation
  exp_reshaped = exp.reshape(SP, 8, SF, 4)

  # Compute max exponent for each 8x4 block using vectorized operations
  # Take max over the 8x4 dimensions (axes 1 and 3)
  mx_scale_golden = np.max(exp_reshaped, axis=(1, 3)).astype(np.uint8) - max_exp

  # Convert scale exponents to scale factors
  scale_exp = mx_scale_golden.astype(np.int32) - float32_exp_bias
  scale_factors = 2.0**scale_exp  # Shape: [SP, SF]

  # Expand scale factors to match input tensor shape using vectorized operations
  # Each scale factor applies to an 8x4 block
  scale_expanded_p = np.repeat(scale_factors, 8, axis=0)  # Shape: [P, SF]
  scale = np.repeat(scale_expanded_p, 4, axis=1)  # Shape: [P, F]

  # Quantize: divide by scale
  mx_data_golden = in_tensor_ / scale
  if ocp_saturation:
    mx_data_golden = np.clip(mx_data_golden, -max_val, max_val)
  
  # Cast to out_quantized_dtype then mimic x4 packing
  mx_data_golden = mx_data_golden.astype(out_quantized_dtype)
  mx_data_golden_x4 = pack_mx_data_into_x4(mx_data_golden)

  return mx_data_golden_x4, mx_scale_golden

# *_x4 inputs must mimic x4 packing via uint
#   if quantized_dtype=fp8, then must be uint32 containing 4 x quantized_dtype elements
#   if quantized_dtype=fp4, then must be uint8 containing 2 x quantized_dtype elements
# *_scale inputs are numpy uint8.
# use_contiguous_scale: True=scales are packed together contiguously, False=scales are spread across p-dim quadrants.
# Return numpy result.
def nc_matmul_mx_golden(stationary_x4, moving_x4, stationary_scale, moving_scale, stationary_quantized_dtype, moving_quantized_dtype,
                        use_contiguous_scale=True, stationary_scale_p_offset=0, moving_scale_p_offset=0):
  
  validate_quantized_dtype(stationary_quantized_dtype)
  validate_quantized_dtype(moving_quantized_dtype)

  # Unpack and upcast to fp32
  moving = unpack_mx_data_from_x4(moving_x4, moving_quantized_dtype).astype(np.float32)
  moving_scale = moving_scale.astype(np.float32)
  stationary = unpack_mx_data_from_x4(stationary_x4, stationary_quantized_dtype).astype(np.float32)
  stationary_scale = stationary_scale.astype(np.float32)

  # Process moving tensor
  new_shape = moving.shape[:-1] + (moving.shape[-1] // 4, 4)
  moving = moving.reshape(new_shape)
  MP, MF0, MF1 = moving.shape
  assert MF1 == 4
  # moving_scale = moving_scale.cpu().numpy().astype(np.float32)
  if not use_contiguous_scale:
    # if scale follows hw layout, make it contiguous at partition dimension
    moving_scale = get_p_contiguous_scale(moving_scale, MP, moving_scale_p_offset)

  MSP, MSF0 = moving_scale.shape

  # The scale tensor may have more columns than needed (e.g., when stationary and moving scales are packed together).
  moving_scale_relevant = moving_scale[:, :MF0]

  # Convert scale exponents to scale factors
  moving_scale_factors = 2.0 ** (moving_scale_relevant - 127)  # Shape: [MSP, MF0]

  # Expand scale factors to match moving tensor shape
  # Each scale factor applies to an 8x1x4 block
  moving_scale_expanded = np.repeat(moving_scale_factors[:, :, np.newaxis], 4, axis=2)  # Shape: [MSP, MF0, 4]
  moving_scale_expanded = np.repeat(moving_scale_expanded[:, np.newaxis, :, :], 8, axis=1)  # Shape: [MSP, 8, MF0, 4]
  moving_scale_expanded = moving_scale_expanded.reshape(MSP * 8, MF0, 4)  # Shape: [MP, MF0, 4]

  # Apply scaling
  moving *= moving_scale_expanded

  # Process stationary tensor
  new_shape = stationary.shape[:-1] + (stationary.shape[-1] // 4, 4)
  stationary = stationary.reshape(new_shape)
  SP, SF0, SF1 = stationary.shape
  assert SF1 == 4
  stationary = stationary.astype(np.float32)

  if not use_contiguous_scale:
    # if scale follows hw layout, make it contiguous at partition dimension
    stationary_scale = get_p_contiguous_scale(stationary_scale, SP, stationary_scale_p_offset)

  SSP, SSF0 = stationary_scale.shape

  # The scale tensor may have more columns than needed (e.g., when stationary and moving scales are packed together).
  stationary_scale_relevant = stationary_scale[:, :SF0]

  # Convert scale exponents to scale factors
  stationary_scale_factors = 2.0 ** (stationary_scale_relevant - 127)  # Shape: [SSP, SF0]

  # Expand scale factors to match stationary tensor shape
  # Each scale factor applies to an 8x1x4 block
  stationary_scale_expanded = np.repeat(stationary_scale_factors[:, :, np.newaxis], 4, axis=2)  # Shape: [SSP, SF0, 4]
  stationary_scale_expanded = np.repeat(stationary_scale_expanded[:, np.newaxis, :, :], 8, axis=1)  # Shape: [SSP, 8, SF0, 4]
  stationary_scale_expanded = stationary_scale_expanded.reshape(SSP * 8, SF0, 4)  # Shape: [SP, SF0, 4]

  # Apply scaling
  stationary *= stationary_scale_expanded

  # This einsum mimics the hardware's Matmul-MX operation. In contrast to a standard 2D x 2D matmul, 
  # this performs an additional multiply-accumulate on the 4 elements inside one _x4 element, which is what
  # the hardware does.
  golden = np.einsum("kiq,kjq->ij", stationary, moving)
  return golden

def dequantize_mx_golden(mx_data_x4, quantized_dtype, mx_scale):
  """
  Dequantize MX data back to float32, reversing quantize_mx_golden.

  This is the exact reverse of quantize_mx_golden:
  - quantize: out_data = in_data / scale, then clip, then cast to MX format
  - dequantize: cast to float32, then out_data = in_data * scale
  where scale = 2^(mx_scale - float32_exp_bias)

  Args:
      mx_data_x4: np.ndarray mimicking x4 packing via uint. [P, F//4] if fp8, [P, F//2] if fp4
      mx_scale: np.ndarray [SP, SF] in uint8 - scale tensor where SP=P//8, SF=F//4 if fp8 or F//2 if fp4 

  Returns:
      np.ndarray [P, F] in float32 - dequantized data (same shape as original input to quantize)
  """
  
  is_fp4 = validate_quantized_dtype(quantized_dtype)

  float32_exp_bias = 127

  P, F_packed = mx_data_x4.shape
  SP, SF = mx_scale.shape

  assert SP == P // 8, f"Scale tensor P dimension mismatch: expected {P//8}, got {SP}"
  expected_SF = F_packed // 2 if is_fp4 else F_packed
  assert SF == expected_SF, f"Scale tensor F dimension mismatch: expected {expected_SF}, got {SF}"

  # Unpack
  mx_data_unpacked = unpack_mx_data_from_x4(mx_data_x4, quantized_dtype)
  # Convert quantized_dtype to float32
  data_float = mx_data_unpacked.astype(np.float32)
  P_expanded, F_expanded = data_float.shape

  # The F dimension is expanded, so check it's as expected
  expected_F_expanded = F_packed * 2 if is_fp4 else F_packed * 4
  assert F_expanded == expected_F_expanded, f"Unexpected expansion: expected {expected_F_expanded}, got {F_expanded}"

  # Convert scale exponents to scale factors
  scale_exp = mx_scale.astype(np.int32) - float32_exp_bias
  scale_exp = np.clip(scale_exp, -127, 127)
  scale_factors = 2.0**scale_exp

  # Use numpy's repeat and tile to expand scale factors to match data shape
  # Each scale factor needs to be applied to an 8x4 block
  # First expand along P dimension: repeat each row 8 times
  scale_expanded_p = np.repeat(scale_factors, 8, axis=0)  # Shape: [P_expanded, SF]

  # Then expand along F dimension: repeat each column 4 times
  scale_expanded = np.repeat(scale_expanded_p, 4, axis=1)   # Shape: [P_expanded, F_expanded]

  # Dequantize: multiply by scale (reverse of quantize division)
  dequantized_data = data_float * scale_expanded

  return dequantized_data

def generate_stabilized_mx_data(quantized_dtype, shape, val_range=1.0):
  """
  Generate stabilized floating-point data and its equivalent MX quantized representation.

  This function returns standard floating-point numbers along with their equivalent
  MX quantized data and scale tensors that are stabilized in the sense that the
  floating-point data and MX data can convert to each other exactly without losing precision.

  Args:
      quantized_dtype: MX quantization dtype (ml_dtypes.float8_e5m2, ml_dtypes.float8_e4m3fn, ml_dtypes.float4_e2m1fn)
      shape: 2D shape for the unquantized output tensor, each 8x4 block is a scaling group; e.g.,
             fp_data[8*row : 8*(row+1), 4*col : 4*(col+1)] is a scaling group
      val_range: fp_data output will be in (-val_range, val_range), (default: 1.0)

  Returns numpy tensors:
      tuple: (fp_data, quantized_mx_data, quantized_mx_data_x4, quantized_mx_scale)
          - fp_data: floating-point data
          - quantized_mx_data: MX quantized data that can be de-quantized to fp_data.
          - quantized_mx_data_x4: quantized_mx_data packed to mimic NKI MXFP_x4 datatypes.
              if quantized_dtype=fp8, then dtype=uint32 packed with 4 x quantized_dtype elements
              if quantized_dtype=fp4, then dtype=uint8 packed with 2 x quantized_dtype elements.
                uint16 is not used because it behaves inconsistently in torch when moving data host <-> device.
          - quantized_mx_scale: MX scale tensor, uint8
  """
  validate_quantized_dtype(quantized_dtype)

  _q_height, _q_width = 8, 4
  assert (shape[0] % _q_height == 0), f'shape[0] must be a multiple of {_q_height}, but got {shape[0]}'
  assert (shape[1] % _q_width == 0), f'shape[1] must be a multiple of {_q_width}, but got {shape[1]}'

  if val_range == 0:
    zeros = np.zeros(shape)
    return zeros, *quantize_mx_golden(zeros, quantized_dtype)

  # Get MX dtype parameters
  max_val = get_mx_fp_max(quantized_dtype)
  max_exp = get_mx_max_exp(quantized_dtype)

  # Generate initial random mxfp data within the mxfp dtype's range.
  rand_data = (np.random.random(shape) * 2 - 1) * max_val

  # For each scaling block, randomly select one element to have max exponent.
  # This prevents change in mx_scale after quantize(dequantize(rand_mx_data, rand_mx_scale)), causing precision loss.
  for i in range(0, shape[0], _q_height):
    for j in range(0, shape[1], _q_width):
      # Random position within the tile
      tile_i = np.random.randint(0, _q_height - 1)
      tile_j = np.random.randint(0, _q_width - 1)

      # Set this element to have maximum exponent
      # Value = ±1.xxx × 2^max_exp (where 1.xxx is the mantissa)
      sign = np.random.choice([-1, 1])
      # Within the range of [1.0, 1.5) (could be upto 1.75 for mxfp8).
      mantissa = 1.0 + np.random.random() * 0.5
      rand_data[i + tile_i, j + tile_j] = sign * mantissa * (2 ** max_exp)

  # Cast to quantized_dtype
  rand_data_quantized = rand_data.astype(quantized_dtype)
  # pack into uint to mimic x4
  rand_data_quantized_x4 = pack_mx_data_into_x4(rand_data_quantized)

  # Calculate mx_scale bounds based on val_range
  # max_val already takes max_exp into account
  float32_exp_bias = 127
  mx_scale_upper_bound = min(255, int(np.log2(val_range / max_val) + float32_exp_bias))
  mx_scale_lower_bound = max(0, mx_scale_upper_bound - 10)

  # Generate random scale
  scale_shape = (shape[0] // _q_height, shape[1] // _q_width)
  rand_quantized_scale_np = np.random.randint(mx_scale_lower_bound, mx_scale_upper_bound + 1,
                                            size=scale_shape, dtype=np.uint8)

  # Dequantize to get final fp data
  dequantized_fp_data_np = dequantize_mx_golden(rand_data_quantized_x4, quantized_dtype, rand_quantized_scale_np)

  return dequantized_fp_data_np, rand_data_quantized, rand_data_quantized_x4, rand_quantized_scale_np

def pack_mx_data_into_x4(mx_data):
  """
  Pack MX data based on dtype:
  - FP4: Pack 2 adjacent values into uint8 (4 bits each)
  - FP8: Pack 4 adjacent values into uint32 (8 bits each)
  """
  import ml_dtypes as mld
  
  if mx_data.dtype == mld.float4_e2m1fn:
    # FP4 path: pack 2 values into uint8. Each FP4 element consumes 8 bits. Take the relevant 4-bits from two elements
    # and pack into uint8.
    mx_as_bytes = mx_data.view(np.uint8)
    H, W = mx_data.shape
    assert W % 2 == 0, "Width must be divisible by 2 for FP4 packing"
    
    bytes_grouped = mx_as_bytes.reshape(H, W // 2, 2)
    return ((bytes_grouped[:, :, 0] & 0xF).astype(np.uint8) << 0) | \
            ((bytes_grouped[:, :, 1] & 0xF).astype(np.uint8) << 4)
  
  elif mx_data.dtype in [mld.float8_e5m2, mld.float8_e4m3fn]:
    # FP8 path: view automatically gives (H, W//4) shape
    # Just view it as uint32.
    return mx_data.view(np.uint32)
  
  else:
    raise ValueError(f"Unsupported dtype: {mx_data.dtype}")

def unpack_mx_data_from_x4(packed_data, target_dtype):
  """
  Unpack MX data based on target dtype:
  - FP4: Unpack uint8 into 2 adjacent values (4 bits each)
  - FP8: Unpack uint32 into 4 adjacent values (8 bits each)
  """
  import ml_dtypes as mld
  
  if target_dtype == mld.float4_e2m1fn:
    # FP4 path: unpack uint8 into 2 values
    assert packed_data.dtype == np.uint8, f"Expected uint8 for FP4, got {packed_data.dtype}"
    H, W_packed = packed_data.shape
    
    # Extract 4-bit values from uint8
    unpacked = np.zeros((H, W_packed, 2), dtype=np.uint8)
    unpacked[:, :, 0] = packed_data & 0xF
    unpacked[:, :, 1] = (packed_data >> 4) & 0xF
    
    # Each FP4 (target_dtype) actually consumes 8-bits.
    return unpacked.reshape(H, W_packed * 2).view(target_dtype)
  
  elif target_dtype in [mld.float8_e5m2, mld.float8_e4m3fn]:
    # FP8 path: view automatically gives (P, F*4) shape
    assert packed_data.dtype == np.uint32, f"Expected uint32 for FP8, got {packed_data.dtype}"
    return packed_data.view(target_dtype)
      
  else:
    raise ValueError(f"Unsupported dtype: {target_dtype}")


================================================
FILE: nki/deep-dives/src/mxfp-matmul/mx_kernel_utils.py
================================================
################################################################
# NKI Kernel helper utilities for using MX
################################################################

import nki
import nki.isa as nisa
import nki.language as nl
import numpy as np

# data_hbm = MX data tile, dtype=*_x4, in HBM. dim[0] must be multiple of 32.
# scale_hbm = MX scale tile, dtype=*_x4, in HBM, contiguous.
# Returns SBUF tile with scales spread across P-dim quadrants as follows:
# HBM Scale:      →     Physical SBUF Layout:
# [0:4,   :]      →     Quadrant 0: partitions [0:4,   :]
# [4:8,   :]      →     Quadrant 1: partitions [32:36, :]
# [8:12,  :]      →     Quadrant 2: partitions [64:68, :]
# [12:16, :]      →     Quadrant 3: partitions [96:100, :]
def load_scales_scattered(data_hbm, scale_hbm):
  # As per nc_matmul_mx's SBUF input layout rules, we need to spread the scales across the partition-dimension.

  # P dimension must be multiple of 32 and not exceed 128
  data_p, _ = data_hbm.shape
  assert data_p % 32 == 0, f"Data tile P={data_p} must be divisible by 32 for MX. Apply padding."
  assert data_p <= 128, f"Data tile P={data_p} must be <= 128."
  
  scale_p, scale_f = scale_hbm.shape
  # This should automatically be true, but just sanity check.
  assert (scale_p == data_p//8), f"Scale tile P={scale_p} must be Data tile P//8 (data_p={data_p}), for MX." 

  # We only need to scatter the scales if more than one SBUF quadrant is used.
  if (data_p > 32): # Could also check (scale_p > 4)
    # Allocate expanded scale tile. Notice here we match the P-dim of the data tile.
    scale_sbuf = nl.ndarray((data_p, scale_f), dtype=scale_hbm.dtype, buffer=nl.sbuf)
    nisa.memset(dst=scale_sbuf,value=0)
 
    # Take each group of 4 scale rows from HBM and write them to the respective SBUF quadrant, where SBUF quadrants
    # are 32-rows.
    for q in range (scale_p // 4):
      # .ap(pattern) tuple of [step_size, count], right-most is the inner (fastest changing) dimension of the access pattern (AP)
      # The src AP reads scale_f elements, jumps to the next row, 4 times total. 
      # Outer for-loop sets the src AP start offset to be the first of a set of 4 rows.
      # The dst AP also writes scale_f elements, jumps to the next row, 4 times total.
      # But the start-offset is the first of a set of 32 rows in dst.
      nisa.dma_copy(
        src=scale_hbm.ap(pattern=[[scale_f, 4], [1, scale_f]],offset=(4*q)*scale_f),
        dst=scale_sbuf.ap(pattern=[[scale_f, 4], [1, scale_f]],offset=(32*q)*scale_f)        
      )

  else:
    # Allocate scale tile. Notice here we use scale_p directly since scales will fit into one quadrant.
    scale_sbuf = nl.ndarray((scale_p, scale_f), dtype=scale_hbm.dtype, buffer=nl.sbuf)
    nisa.dma_copy(src=scale_hbm, dst=scale_sbuf) # Straight copy

  return scale_sbuf

# Expected input tile shapes: stationary_hbm [4, P_st, F_st], moving_hbm [4, P_mv, F_mv]
# Output SBUF shapes: stationary_sbuf [P_st, 4, F_st], moving_sbuf [P_mv, 4, F_mv]
#
# HBM Layout [4, P, F]:           SBUF Layout [P, 4, F]:
# =====================           ======================
# ┌───────────┐                   ┌─────────┬─────────┬─────────┬─────────┐
# │           │                   │         │         │         │         │
# │ Tile0     │                   │  Tile0  │  Tile1  │  Tile2  │  Tile3  │
# │ [P,F]     │                   │  [P,F]  │  [P,F]  │  [P,F]  │  [P,F]  │
# │           │                   │         │         │         │         │
# ├───────────┤                   └─────────┴─────────┴─────────┴─────────┘
# │           │
# │ Tile1     │
# │ [P,F]     │
# │           │
# ├───────────┤
# │           │
# │ Tile2     │
# │ [P,F]     │
# │           │
# ├───────────┤
# │           │
# │ Tile3     │
# │ [P,F]     │
# │           │
# └───────────┘
def load_tensor_helper(stationary_hbm, moving_hbm):
  P_st = stationary_hbm.shape[1]
  F_st = stationary_hbm.shape[2]
  P_mv = moving_hbm.shape[1]
  F_mv = moving_hbm.shape[2]
  
  stationary_sbuf = nl.ndarray((P_st, 4, F_st), dtype=stationary_hbm.dtype, buffer=nl.sbuf)
  moving_sbuf = nl.ndarray((P_mv, 4, F_mv), dtype=moving_hbm.dtype, buffer=nl.sbuf)
  
  # .ap(pattern) tuple of [step_size, count], right-most is the inner (fastest changing) dimension of the access pattern (AP).
  # dst (SBUF) does not have an AP specified which means it is linearly accessed.
  # The src AP reads F elements, then jumps to the next Tile, 4 times. This supplies the data to fill one row of SBUF.
  #   Then we jump to the next row of HBM and repeat.

  nisa.dma_copy(src=stationary_hbm.ap(pattern=[[F_st, P_st], [P_st*F_st, 4], [1, F_st]], offset=0), dst=stationary_sbuf)
  nisa.dma_copy(src=moving_hbm.ap(pattern=[[F_mv, P_mv], [P_mv*F_mv, 4], [1, F_mv]], offset=0), dst=moving_sbuf)

  return stationary_sbuf, moving_sbuf

# [start-allocate_mx_tiles]
# shape_unquantized represents the 2D unquantized SBUF shape with interleaved
# layout established (i.e. the shape immediately before calling Quantize-MX).
def allocate_mx_tiles(shape_unquantized, mx_dtype, alloc_scale: bool = True):
  assert len(shape_unquantized) == 2, f"shape_unquantized must have exactly 2 dimensions, got {len(shape_unquantized)}"
  
  P, F = shape_unquantized
  
  # Allocate data tile
  # Quantize-MX shrinks the free-dim by 4x because it packs 4 elements into 1.
  mx_data_sbuf = nl.ndarray((P, F//4), dtype=mx_dtype, buffer=nl.sbuf)

  if not alloc_scale:
      return mx_data_sbuf, None
  
  # Allocate scale tile
  # Nominally the scale tile is sized (P//8, F//4) given that the scaling
  # group shape is [8P, 4F]. But when P > 32, the scales must be placed in the
  # partition-dim quadrant from which the corresponding scaling group originated 
  # hence we must allocate the full P.
  if P <= 32: # Can store all scales in first p-dim quadrant.
    mx_scale_sbuf = nl.ndarray((P//8, F//4), dtype=nl.uint8, buffer=nl.sbuf)
  else: # Must oversize and spread across quadrants.
    mx_scale_sbuf = nl.ndarray((P, F//4), dtype=nl.uint8, buffer=nl.sbuf)
  
  return mx_data_sbuf, mx_scale_sbuf
# [end-allocate_mx_tiles]

# [start-copy_data_strided]
# Read unquantized tensors from HBM and establish interleaved layout in SBUF.
# use_tensor_copy=true: Straight read from HBM->SBUF, then use SBUF-to-SBUF TensorCopy to stride the data.
#   Intended to demonstrate how to stride the tile using VectorE/ScalarE if tile already present on SBUF.
# use_tensor_copy=false: Stride the data while reading HBM->SBUF.
#   Intended to demonstrate how to stride the tile if coming from HBM, using only the DMA engine.
# The output shapes are [P//4, F*4] where the [P,F] is the shape of the corresponding unquantized input tensor.
def copy_data_strided(stationary_hbm, moving_hbm, use_tensor_copy: bool = True):  
    
  # The HBM tensors have nominal shape [P,F]. Reshape into [4, P//4, F]. 
  # In other words, we divide the contraction axis into 4 "P" tiles since we'll eventually
  # need to read data from each tile and pack them together on SBUF.
  
  # These dimensions reflect the shape of each "P" tile.
  P_st = stationary_hbm.shape[0] // 4
  F_st = stationary_hbm.shape[1]
  P_mv = moving_hbm.shape[0] // 4
  F_mv = moving_hbm.shape[1]
  
  stationary_hbm_reshape = stationary_hbm.reshape((4, P_st, F_st))
  moving_hbm_reshape = moving_hbm.reshape((4, P_mv, F_mv))

  # Allocate SBUF tensors to store the strided result.
  # The shape is [P//4, F, 4] where the [P,F] is the shape of the unquantized input tensor.
  # In other words, we view the free-dim as having F_st/F_mv groups of 4 elements.
  # Taking 3D views of both the HBM and SBUF tensors allows for cleaner indexing.
  stationary_sbuf_strided = nl.ndarray((P_st, F_st, 4), dtype=stationary_hbm.dtype, buffer=nl.sbuf)
  moving_sbuf_strided = nl.ndarray((P_mv, F_mv, 4), dtype=moving_hbm.dtype, buffer=nl.sbuf)    

  # Perform a TensorCopy to achieve the required layout.
  if (use_tensor_copy):

    # First load from HBM -> SBUF. Take "P" tiles from HBM and write them
    # contiguously (adjacent to each other) into the SBUF free-dim. 
    # This load is not the focus of this example so its details are encapsulated in load_tensor_helper().
    # The SBUF shapes will be stationary_sbuf [P_st, 4, F_st], moving_sbuf [P_mv, 4, F_mv]
    stationary_sbuf, moving_sbuf = load_tensor_helper(stationary_hbm_reshape, moving_hbm_reshape)

    # Perform SBUF-to-SBUF TensorCopy to shuffle the data into the required MX layout.
    # Here are some tips on how to read this access pattern (AP).
    # .ap(pattern) = tuple of [step_size, count], right-most is the inner (fastest changing) dimension of the access pattern (AP).
    # The dst (*_strided) has no AP specified, meaning it is linearly written to.
    # To understand the src AP it's useful to refer to the SBUF Layout diagram in load_tensor_helper().
    # We read 1 element, then step F elements to the next tile, 4 times total. In other words, we gather a group
    # of 4 elements (one from each tile).
    # Then step 1 element and repeat the above F times to read an entire row of SBUF.
    # Then step to the next row of SBUF and repeat the above for all P rows of SBUF.
    # Note, this example is shown as a strided-read but it could be re-written as a strided-write, though it will be slower.
    # Secondly, the source tile can be in PSUM (i.e. the result of a prior matmul).
  
    nisa.tensor_copy(src=stationary_sbuf.ap(pattern=[[4*F_st, P_st], [1, F_st], [F_st, 4]], offset=0), dst=stationary_sbuf_strided)
    nisa.tensor_copy(src=moving_sbuf.ap(pattern=[[4*F_mv, P_mv], [1, F_mv], [F_mv, 4]], offset=0), dst=moving_sbuf_strided)

  # Perform a strided DMA to achieve the required layout.
  else:

    # Similar to TensorCopy, the we linearly write to stationary_sbuf_strided.
    # When reading from *_hbm_reshape, we read one element from each tile.
    # Then step 1 element and repeat the above F times, thereby reading one full row of HBM.
    # Then step to the next row of HBM and repeat the above P times.

    nisa.dma_copy(src=stationary_hbm_reshape.ap(pattern=[[F_st, P_st], [1, F_st], [P_st*F_st, 4]], offset=0),
                  dst=stationary_sbuf_strided)
    nisa.dma_copy(src=moving_hbm_reshape.ap(pattern=[[F_mv, P_mv], [1, F_mv], [P_mv*F_mv, 4]], offset=0),
                  dst=moving_sbuf_strided)

  # Return as 2D.
  return stationary_sbuf_strided.reshape((P_st, F_st*4)), moving_sbuf_strided.reshape((P_mv, F_mv*4))
# [end-copy_data_strided]


================================================
FILE: nki/deep-dives/src/mxfp-matmul/mx_kernels.py
================================================
################################################################
# NKI Kernels to demonstrate MX usage
################################################################

import nki
import nki.isa as nisa
import nki.language as nl
from mx_kernel_utils import load_scales_scattered, allocate_mx_tiles, copy_data_strided

# [start-kernel_offline_quantized_mx_matmul]
# Matmul-MX using offline-quantized input tiles in HBM, assumed to be maximum tile sizes for the TensorE.
# MX layout requirements for data tiles are ignored. (i.e. it's assumed the data tiles are 
# already correctly laid out).
# *_mx_data inputs mimic _x4 packed types via uint. This kernel will simply view it as _x4.
# *_mx_scale inputs are uint8, with scales packed contiguous (this kernel will spread them across partition-dim).
# mx_dtype = one of nl.float8_e5m2_x4, nl.float8_e4m3fn_x4, nl.float4_e2m1fn_x4.
# Returns bfloat16 matmul result.
@nki.jit
def kernel_offline_quantized_mx_matmul(stationary_mx_data, stationary_mx_scale, moving_mx_data, moving_mx_scale, mx_dtype):    
  
  MAX_TILE_M = nl.tile_size.gemm_stationary_fmax  # 128
  MAX_TILE_K = nl.tile_size.pmax  # 128
  MAX_TILE_N = nl.tile_size.gemm_moving_fmax  # 512

  # View the input data as _x4 mx_dtype. This is done using an access pattern, specifying the target dtype and a simple
  # linear pattern.
  stationary_mx_data_hbm_x4 = stationary_mx_data.ap(dtype=mx_dtype, pattern=[[MAX_TILE_M,MAX_TILE_K],[1,MAX_TILE_M]], offset=0)
  moving_mx_data_hbm_x4 = moving_mx_data.ap(dtype=mx_dtype, pattern=[[MAX_TILE_N,MAX_TILE_K],[1,MAX_TILE_N]], offset=0)

  # Check that the input tiles are max-sized. This is merely for simplicity of the example but
  # smaller shapes are also supported.
  assert stationary_mx_data_hbm_x4.shape == (MAX_TILE_K, MAX_TILE_M)
  assert moving_mx_data_hbm_x4.shape == (MAX_TILE_K, MAX_TILE_N)

  # Load inputs directly from HBM to SBUF. Data is assumed to already have the 
  # layout required by MX. Scales are assumed to be contiguous in HBM therefore we use
  # load_scales_scattered() to spread them across SBUF partition-dim quadrants, as is required
  # by Matmul-MX.

  stationary_mx_data_sbuf_x4 = nl.ndarray(stationary_mx_data_hbm_x4.shape, dtype=mx_dtype, buffer=nl.sbuf)
  nisa.dma_copy(dst=stationary_mx_data_sbuf_x4, src=stationary_mx_data_hbm_x4)
  stationary_mx_scale_sbuf = load_scales_scattered(stationary_mx_data_sbuf_x4, stationary_mx_scale)

  # Load moving
  moving_mx_data_sbuf_x4 = nl.ndarray(moving_mx_data_hbm_x4.shape, dtype=mx_dtype, buffer=nl.sbuf)
  nisa.dma_copy(dst=moving_mx_data_sbuf_x4, src=moving_mx_data_hbm_x4)
  moving_mx_scale_sbuf = load_scales_scattered(moving_mx_data_sbuf_x4, moving_mx_scale)
  
  # Allocate a tile in PSUM. This could also be float32.
  result_psum = nl.ndarray((MAX_TILE_M, MAX_TILE_N), dtype=nl.bfloat16, buffer=nl.psum)

  # Matmul-MX
  nisa.nc_matmul_mx(
    dst=result_psum,
    stationary=stationary_mx_data_sbuf_x4,
    moving=moving_mx_data_sbuf_x4,
    stationary_scale=stationary_mx_scale_sbuf,
    moving_scale=moving_mx_scale_sbuf
  )

  # Copy the PSUM result back to SBUF
  result_sbuf = nl.ndarray(result_psum.shape, dtype=nl.bfloat16, buffer=nl.sbuf)
  nisa.tensor_copy(dst=result_sbuf, src=result_psum)  

  # Store to HBM
  result_hbm = nl.ndarray(result_psum.shape, dtype=nl.bfloat16, buffer=nl.shared_hbm)  
  nisa.dma_copy(dst=result_hbm, src=result_sbuf)
  
  return result_hbm
# [end-kernel_offline_quantized_mx_matmul]

# [start-kernel_on_device_quantize_matmul_mx]
# Matmul-MX using a offline-quantized stationary input tile from HBM and on-device quantized moving tile.
# Input to Quantize-MX must be bf16/fp16.
# MX layout requirements for data tiles are ignored. (i.e. it's assumed the data tiles are 
# already correctly laid out, including moving_data_bf16).
# *_mx_data inputs are float32 where each element contains 4 x quantized elements elements.
#   *_mx_data will be viewed as mx_dtype.
# *_mx_scale inputs are uint8, with scales packed contiguous (this kernel will spread them across partition-dim).
# mx_dtype = one of nl.float8_e5m2_x4, nl.float8_e4m3fn_x4, nl.float4_e2m1fn_x4.
# It's assumed TensorE max tile sizes are used.
@nki.jit
def kernel_on_device_quantize_matmul_mx(stationary_mx_data, stationary_mx_scale, moving_data_bf16, stationary_mx_dtype, moving_mx_dtype):

  assert moving_mx_dtype != nl.float4_e2m1fn_x4, "FP4 not supported by Quantize-MX"

  MAX_TILE_M = nl.tile_size.gemm_stationary_fmax  # 128
  MAX_TILE_K = nl.tile_size.pmax  # 128
  MAX_TILE_N = nl.tile_size.gemm_moving_fmax  # 512

  # View the input MX data as _x4 mx_dtype. This is done using an access pattern, specifying the target dtype and a simple
  # linear pattern.
  stationary_mx_data_hbm_x4 = stationary_mx_data.ap(dtype=stationary_mx_dtype, pattern=[[MAX_TILE_M,MAX_TILE_K],[1,MAX_TILE_M]], offset=0)

  # Check that the input tiles are max-sized. This is merely for simplicity of the example but
  # smaller shapes are also supported.
  assert stationary_mx_data_hbm_x4.shape == (MAX_TILE_K, MAX_TILE_M)
  # Note the factor of 4 on the N free-dim. This is unquantized data whose free-dim will be packed and
  # reduced by a factor of 4 during quantize_mx.
  assert moving_data_bf16.shape == (MAX_TILE_K, MAX_TILE_N*4)

  # Load stationary MX.
  stationary_mx_data_sbuf_x4 = nl.ndarray(stationary_mx_data_hbm_x4.shape, dtype=stationary_mx_dtype, buffer=nl.sbuf)
  nisa.dma_copy(dst=stationary_mx_data_sbuf_x4, src=stationary_mx_data_hbm_x4)
  stationary_mx_scale_sbuf = load_scales_scattered(stationary_mx_data_sbuf_x4, stationary_mx_scale)
  
  # Load moving BF16
  moving_bf16_sbuf = nl.ndarray(moving_data_bf16.shape, dtype=moving_data_bf16.dtype, buffer=nl.sbuf)
  nisa.dma_copy(dst=moving_bf16_sbuf, src=moving_data_bf16)

  # Allocate quantized moving tiles
  moving_mx_data_sbuf_x4, moving_mx_scale_sbuf = allocate_mx_tiles(moving_data_bf16.shape, moving_mx_dtype)  

  # Quantize-MX. Scales will automatically be spread across partition-dim quadrants.
  nisa.quantize_mx(dst=moving_mx_data_sbuf_x4,
                  src=moving_bf16_sbuf,
                  dst_scale=moving_mx_scale_sbuf)  

  # Allocate a tile in PSUM
  result_psum = nl.ndarray((MAX_TILE_M, MAX_TILE_N), dtype=nl.bfloat16, buffer=nl.psum)

  # Matmul-MX
  nisa.nc_matmul_mx(
    dst=result_psum,
    stationary=stationary_mx_data_sbuf_x4,
    moving=moving_mx_data_sbuf_x4,
    stationary_scale=stationary_mx_scale_sbuf,
    moving_scale=moving_mx_scale_sbuf
  )  

  # Copy the PSUM result back to SBUF
  result_sbuf = nl.ndarray(result_psum.shape, dtype=nl.bfloat16, buffer=nl.sbuf)
  nisa.tensor_copy(dst=result_sbuf, src=result_psum)  

  # Store to HBM
  result_hbm = nl.ndarray(result_psum.shape, dtype=nl.bfloat16, buffer=nl.shared_hbm)  
  nisa.dma_copy(dst=result_hbm, src=result_sbuf)

  return result_hbm
# [end-kernel_on_device_quantize_matmul_mx]

# Matmul-MX using on-device quantized stationary and moving tensors, demonstrating how to use
# a strided access pattern to establish the SBUF layout required by MX operations.
# Two examples are shown: the access pattern is implemented either in VectorE/ScalarE Tensor Copy or by the DMA engine.
# Unquantized input tiles from HBM are expected to be sized such that they become max-tiles for the 
# TensorE once quantized.
@nki.jit
def kernel_copy_strided_quantize_matmul_mx(stationary_hbm, moving_hbm, mx_dtype, use_tensor_copy: bool = True):
  
  assert mx_dtype != nl.float4_e2m1fn_x4, "FP4 not supported by Quantize-MX"
 
  MAX_TILE_M = nl.tile_size.gemm_stationary_fmax  # 128
  MAX_TILE_K = nl.tile_size.pmax  # 128
  MAX_TILE_N = nl.tile_size.gemm_moving_fmax  # 512  

  # Ensure input tensors are in HBM.
  assert stationary_hbm.buffer == moving_hbm.buffer == nl.hbm

  # Sanity check the shapes. We expect contraction dimension of the unquantized tile to be 4x.
  assert stationary_hbm.shape == (MAX_TILE_K*4, MAX_TILE_M)
  assert moving_hbm.shape == (MAX_TILE_K*4, MAX_TILE_N)

  # The key details of this example are shown in copy_data_strided() where data is copied into SBUF
  # using strided access patterns to achieve the required MX layout.
  # Returned shape is [P//4, F*4] where [P,F] is the input shape.
  stationary_sbuf_strided, moving_sbuf_strided = copy_data_strided(stationary_hbm, moving_hbm, use_tensor_copy)

  # Allocate quantized moving tiles
  stationary_mx_data_sbuf, stationary_mx_scale_sbuf = allocate_mx_tiles(stationary_sbuf_strided.shape, mx_dtype)
  moving_mx_data_sbuf, moving_mx_scale_sbuf = allocate_mx_tiles(moving_sbuf_strided.shape, mx_dtype)

  # Quantize-MX. Scales will automatically be spread across partition-dim quadrants.
  nisa.quantize_mx(dst=stationary_mx_data_sbuf,
                  src=stationary_sbuf_strided,
                  dst_scale=stationary_mx_scale_sbuf)

  nisa.quantize_mx(dst=moving_mx_data_sbuf,
                  src=moving_sbuf_strided,
                  dst_scale=moving_mx_scale_sbuf)
  
  # Allocate a tile in PSUM
  result_psum = nl.ndarray((MAX_TILE_M, MAX_TILE_N), dtype=nl.bfloat16, buffer=nl.psum)

  # Matmul-MX
  nisa.nc_matmul_mx(
    dst=result_psum,
    stationary=stationary_mx_data_sbuf,
    moving=moving_mx_data_sbuf,
    stationary_scale=stationary_mx_scale_sbuf,
    moving_scale=moving_mx_scale_sbuf
  )

  # Copy the PSUM result back to SBUF
  result_sbuf = nl.ndarray(result_psum.shape, dtype=nl.bfloat16, buffer=nl.sbuf)
  nisa.tensor_copy(dst=result_sbuf, src=result_psum)  

  # Store to HBM
  result_hbm = nl.ndarray(result_psum.shape, dtype=nl.bfloat16, buffer=nl.shared_hbm)  
  nisa.dma_copy(dst=result_hbm, src=result_sbuf)

  return result_hbm

#[start-kernel_copy_strided_quantize_matmul_mx_packed_scale]
# Matmul-MX using on-device quantized stationary and moving tensors, demonstrating how to use
# pack scale values from multiple quantize_mx calls into a single tensor in SBUF.
# 
# Unquantized input tiles from HBM are expected to be sized such that they become max-tiles for the 
# TensorE once quantized.
@nki.jit
def kernel_copy_strided_quantize_matmul_mx_packed_scale(stationary_hbm, moving_hbm, mx_dtype, use_tensor_copy: bool = True):
  
  assert mx_dtype != nl.float4_e2m1fn_x4, "FP4 not supported by Quantize-MX"
 
  MAX_TILE_M = nl.tile_size.gemm_stationary_fmax  # 128
  MAX_TILE_K = nl.tile_size.pmax  # 128
  MAX_TILE_N = nl.tile_size.gemm_moving_fmax  # 512  

  # Ensure input tensors are in HBM.
  assert stationary_hbm.buffer == moving_hbm.buffer == nl.hbm

  # Sanity check the shapes. We expect contraction dimension of the unquantized tile to be 4x.
  assert stationary_hbm.shape == (MAX_TILE_K*4, MAX_TILE_M)
  assert moving_hbm.shape == (MAX_TILE_K*4, MAX_TILE_N)

  # Use strided access patterns to achieve required MX layout.
  # Returned shape is [P//4, F*4] where [P,F] is the input shape.
  stationary_sbuf_strided, moving_sbuf_strided = copy_data_strided(stationary_hbm, moving_hbm, use_tensor_copy)

  # Allocate quantized stationary/moving tiles.
  # Unlike the example kernel_copy_strided_quantize_matmul_mx, we do not allocate scale tiles here.
  stationary_mx_data_sbuf, _  = allocate_mx_tiles(stationary_sbuf_strided.shape, mx_dtype, alloc_scale=False)
  moving_mx_data_sbuf, _ = allocate_mx_tiles(moving_sbuf_strided.shape, mx_dtype, alloc_scale=False)

  # Allocate a single tile into which we will pack scale values from BOTH quantize_mx calls.
  #
  # quantize_mx requires that the input tile's free dimension contains exactly 4x as many 
  # elements as the scale tile. We will use this tile for both quantize_mx calls, so its 
  # free dimension needs to be able to hold the larger of the two input tiles, hence MAX_TILE_N.
  packed_mx_scale_sbuf = nl.ndarray((MAX_TILE_K, MAX_TILE_N), dtype=nl.uint8, buffer=nl.sbuf)

  # Each scaling group consists of 32 elements, with 8 partitions x 4 elements per partition.
  # Therefore, for each 32-partition SBUF quadrant, we get only 32 // 8 = 4 partitions' worth of scale factors.
  # This leaves 28 partitions unused. quantize_mx lets us use some of this space by storing other tensors'
  # scale factors at an offset.

  # In this example, we use tensor slicing to store:
  # - stationary's scale values at offset 0 in each quadrant (i.e., partitions 0:4, 32:36, 64:68, 96:100)
  # - moving's scale values at offset 4 in each quadrant (i.e., partitions 4:8, 36:40, 68:72, 100:104)

  # moving's scale values will be written to partitions 0:4 in each quadrant.
  # Additionally, we restrict the free dimension size to match stationary's shape.
  stationary_mx_scale_sbuf = packed_mx_scale_sbuf[0:, :MAX_TILE_M]

  # moving's scale values will be written to partitions 4:8 in each quadrant.
  # We don't restrict the size of the free dimension; it already matches moving's shape.
  moving_mx_scale_sbuf = packed_mx_scale_sbuf[4:, :]

  # Quantize-MX. Scales will automatically be spread across partition-dim quadrants.
  nisa.quantize_mx(dst=stationary_mx_data_sbuf,
                  src=stationary_sbuf_strided,
                  dst_scale=stationary_mx_scale_sbuf)

  nisa.quantize_mx(dst=moving_mx_data_sbuf,
                  src=moving_sbuf_strided,
                  dst_scale=moving_mx_scale_sbuf)
  
  # Allocate a tile in PSUM
  result_psum = nl.ndarray((MAX_TILE_M, MAX_TILE_N), dtype=nl.bfloat16, buffer=nl.psum)

  # Matmul-MX
  nisa.nc_matmul_mx(
    dst=result_psum,
    stationary=stationary_mx_data_sbuf,
    moving=moving_mx_data_sbuf,
    stationary_scale=stationary_mx_scale_sbuf,
    moving_scale=moving_mx_scale_sbuf
  )

  # Copy the PSUM result back to SBUF
  result_sbuf = nl.ndarray(result_psum.shape, dtype=nl.bfloat16, buffer=nl.sbuf)
  nisa.tensor_copy(dst=result_sbuf, src=result_psum)  

  # Store to HBM
  result_hbm = nl.ndarray(result_psum.shape, dtype=nl.bfloat16, buffer=nl.shared_hbm)  
  nisa.dma_copy(dst=result_hbm, src=result_sbuf)

  return result_hbm
#[end-kernel_copy_strided_quantize_matmul_mx_packed_scale]

================================================
FILE: nki/deep-dives/src/mxfp-matmul/mx_toplevel.py
================================================
import torch
import os
import nki.language as nl
import numpy as np
import torch_xla
import shutil
import ml_dtypes as mld
from mx_cpu_utils import generate_stabilized_mx_data, nc_matmul_mx_golden, quantize_mx_golden
from mx_kernels import kernel_offline_quantized_mx_matmul, kernel_on_device_quantize_matmul_mx, kernel_copy_strided_quantize_matmul_mx, kernel_copy_strided_quantize_matmul_mx_packed_scale

# Global compiler flags
NEURON_CC_BASE_FLAGS = " --target trn3 --pipeline compile SaveTemps --internal-compiler-debug-mode=all --internal-backend-options='--print-format=json,condensed' "

device = None
cpu = None

# NKI kernels use these _x4 custom dtypes to represent MXFP* data.
quantized_dtype_to_x4_map = {
  mld.float8_e5m2: nl.float8_e5m2_x4,
  mld.float8_e4m3fn: nl.float8_e4m3fn_x4,
  mld.float4_e2m1fn: nl.float4_e2m1fn_x4,
}

def setup_compiler_workdir(test_name):
  """Setup unique compiler output directory for each test"""
  current_dir = os.path.dirname(os.path.abspath(__file__))
  workdir = f"{current_dir}/artifacts_{test_name}"
  
  # Remove existing directory if it exists
  if os.path.exists(workdir):
    shutil.rmtree(workdir)
  os.makedirs(workdir, exist_ok=True)
  
  # Set full environment variable
  os.environ["NEURON_CC_FLAGS"] = f"{NEURON_CC_BASE_FLAGS} --compile_workdir {workdir}"

def compare_and_print_results(res, golden, rtol=5e-2, atol=5e-2):
  print("\n\nResult shape:", res.shape)
  
  # Ensure both are numpy float32
  res_float = res.astype(np.float32) if res.dtype != np.float32 else res
  golden_float = golden.astype(np.float32) if golden.dtype != np.float32 else golden
  
  match = np.allclose(res_float, golden_float, rtol=rtol, atol=atol)
  print("\nnp.allclose pass?", match)
  
  if not match:
    # Print mismatch info
    diff = np.abs(res_float - golden_float)
    max_diff = np.max(diff)
    mean_diff = np.mean(diff)
    print(f"Max difference: {max_diff:.6f}")
    print(f"Mean difference: {mean_diff:.6f}")
  
  # Print first and last row, first 3 and last 3 columns
  print(f"\nDevice Output:\n[{res_float[0,:3]} ... {res_float[0,-3:]}]\n...\n[{res_float[-1,:3]} ... {res_float[-1,-3:]}]")
  print(f"\nGolden:\n[{golden_float[0,:3]} ... {golden_float[0,-3:]}]\n...\n[{golden_float[-1,:3]} ... {golden_float[-1,-3:]}]")

def print_test_header(test_name):
  border_length = max(60, len(test_name) + 8)  # Ensure minimum width + padding
  print(f"\n\n{'='*border_length}")
  print(f"    {test_name}")
  print(f"{'='*border_length}\n")

# [start-run_offline_quantized_matmul_mx_test]
# This test will quantize to MXFP8 on the host.
# Then execute Matmul-MX on the device using these offline-quantized tiles.
def run_offline_quantized_matmul_mx_test(quantized_dtype):
  
  # Choose max tile-sizes for TensorE.
  M, K, N = 128, 128, 512

  print_test_header(f"OFFLINE_QUANTIZED_MX_MATMUL - stationary <{quantized_dtype.__name__}> @ moving <{quantized_dtype.__name__}>")

  setup_compiler_workdir(f"offline_quantized_mx_matmul")

  # Generate stationary MX tile. Note the scales will be packed contiguously here. The kernel will later load the scales into SBUF
  # in the required scattered fashion.
  st_unquantized_shape = (K, M*4)
  _, _, st_mx_data_x4, st_mx_scale = generate_stabilized_mx_data(quantized_dtype, st_unquantized_shape)

  # Generate moving MX tile
  mv_unquantized_shape = (K, N*4)
  _, _, mv_mx_data_x4, mv_mx_scale = generate_stabilized_mx_data(quantized_dtype, mv_unquantized_shape)

  # Call the Kernel. Perform matmul-mx: stationary_mx @ moving_mx
  output_kernel = kernel_offline_quantized_mx_matmul(
    torch.from_numpy(st_mx_data_x4).to(device), 
    torch.from_numpy(st_mx_scale).to(device), 
    torch.from_numpy(mv_mx_data_x4).to(device), 
    torch.from_numpy(mv_mx_scale).to(device), 
    quantized_dtype_to_x4_map[quantized_dtype]
  )

  output_kernel_np = output_kernel.cpu().float().numpy()

  # Generate the golden
  golden = nc_matmul_mx_golden(st_mx_data_x4, mv_mx_data_x4, st_mx_scale, mv_mx_scale, quantized_dtype, quantized_dtype)

  compare_and_print_results(output_kernel_np, golden)
# [end-run_offline_quantized_matmul_mx_test]

# This test will quantize the stationary tile to MXFP8 on the host, and moving tile on device.
# Then execute Matmul-MX on the device,
def run_on_device_quantize_matmul_mx_test(quantized_dtype_stationary, quantized_dtype_moving):
  
  # Choose max tile-sizes for TensorE.
  M, K, N = 128, 128, 512
 
  print_test_header(f"ON_DEVICE_QUANTIZE_MATMUL_MX - stationary <{quantized_dtype_stationary.__name__}> @ moving <{quantized_dtype_moving.__name__}>")

  setup_compiler_workdir(f"on_device_quantize_matmul_m")

  # Generate stationary MX tile. Note the scales will be packed contiguously here. The kernel will later load the scales into SBUF
  # in the required scattered fashion.
  st_unquantized_shape = (K, M*4)
  _, _, st_mx_data_x4, st_mx_scale = generate_stabilized_mx_data(quantized_dtype_stationary, st_unquantized_shape)

  # Generate moving tile
  mv_unquantized_shape = (K, N*4)
  # Notice we don't just generate random fp data using, say, np.random.
  # Instead we use generate_stabilized_mx_data()'s fp_data output to get stabilized unquantized data that can be
  # quantized and dequantized without loss of precision.
  mv_data, _, _, _ = generate_stabilized_mx_data(quantized_dtype_moving, mv_unquantized_shape)

  # Call the Kernel. Quantize mv_data, then perform Matmul-MX.
  output_kernel = kernel_on_device_quantize_matmul_mx(
    torch.from_numpy(st_mx_data_x4).to(device), 
    torch.from_numpy(st_mx_scale).to(device), 
    torch.from_numpy(mv_data).bfloat16().to(device), # Convert to bf16,
    quantized_dtype_to_x4_map[quantized_dtype_stationary], # stationary mx
    quantized_dtype_to_x4_map[quantized_dtype_moving], # moving qmx output
  )

  output_kernel_np = output_kernel.cpu().float().numpy()

  # Generate the golden
  # Quantize moving tensor as an intermediate step.
  moving_mx_data, moving_mx_scale = quantize_mx_golden(mv_data, quantized_dtype_moving)
  # Matmul-MX
  golden = nc_matmul_mx_golden(st_mx_data_x4, moving_mx_data, st_mx_scale, moving_mx_scale, quantized_dtype_stationary, quantized_dtype_moving)

  compare_and_print_results(output_kernel_np, golden)

# This example:
# 1. Starts with two HBM tensors.
# 2. Establishes required SBUF layout using:
#   - TensorCopy on the NeuronCore (if use_tensor_copy is True)
#   - DMA (if use_tensor_copy is False)
# 3. Quantizes both tensors on device, storing scale values:
#   - In a single packed tile (if pack_scales is True)
#   - In two separate tiles (if pack_scales is False)
# 4. Performs Matmul-MX.
def run_copy_strided_test(quantized_dtype, use_tensor_copy: bool = True, pack_scales: bool = False):
  # Choose max tile-sizes for TensorE. But here we're specifying unquantized shapes.
  # Since Matmul-MX allows for 4x larger contraction dimension, we choose K=512.
  K, M, N = 512, 128, 512

  print_test_header(f"COPY_STRIDED_{'TENSOR_COPY' if use_tensor_copy else 'DMA'}_{'PACKED' if pack_scales else 'UNPACKED'} - <{quantized_dtype.__name__}> @ <{quantized_dtype.__name__}>")

  setup_compiler_workdir(f"copy_strided_test_tensor_copy_{use_tensor_copy}_{pack_scales}")

  # Generate the stationary and moving tensors in bf16.
  # Using generate_stabilized_mx_data() to generate FP data that is within the MX data-type range.
  # Contraction dimension is the first dimensions, as is required by TensorE.
  st_shape = (K, M)
  st_data, _, _, _ = generate_stabilized_mx_data(quantized_dtype, st_shape)
  
  mv_shape = (K, N)
  mv_data, _, _, _ = generate_stabilized_mx_data(quantized_dtype, mv_shape)

  # Call the kernel
  kernel = kernel_copy_strided_quantize_matmul_mx_packed_scale if pack_scales else kernel_copy_strided_quantize_matmul_mx
  output_kernel = kernel(
    torch.from_numpy(st_data).bfloat16().to(device),
    torch.from_numpy(mv_data).bfloat16().to(device),
    quantized_dtype_to_x4_map[quantized_dtype],
    use_tensor_copy
  )

  output_kernel_np = output_kernel.cpu().float().numpy()

  # To generate a golden we simply perform matmul using the input fp tensors.
  # Notice we're not using the matmul_mx_golden/quantize_mx_golden utilities -- they mimic the hardware
  # and therefore assume the input tensors have the interleaved layout.
  golden = st_data.T @ mv_data
  
  compare_and_print_results(output_kernel_np, golden)

if __name__ == "__main__":

  device = torch_xla.device()
  cpu = torch.device('cpu')
  
  # Matmul-MX with MX tensors prepared on host
  run_offline_quantized_matmul_mx_test(mld.float8_e5m2) # FP8 @ FP8
  run_offline_quantized_matmul_mx_test(mld.float4_e2m1fn) # FP4 @ FP4

  # Matmul-MX with moving tensor quantized on device.
  run_on_device_quantize_matmul_mx_test(mld.float4_e2m1fn, mld.float8_e5m2) # Mixed FP4 @ FP8
  run_on_device_quantize_matmul_mx_test(mld.float8_e5m2, mld.float8_e5m2) # FP8 @ FP8

  # Use TensorCopy to stride the data
  run_copy_strided_test(mld.float8_e5m2, use_tensor_copy=True, pack_scales=False) # FP8 @ FP8

  # Use DMA to stride the data
  run_copy_strided_test(mld.float8_e5m2, use_tensor_copy=False, pack_scales=False) # FP8 @ FP8

  # Pack scale values into single tensor and use TensorCopy to stride the data
  run_copy_strided_test(mld.float8_e5m2, use_tensor_copy=True, pack_scales=True) # FP8 @ FP8

  # Pack scale values into single tensor and use DMA to stride the data
  run_copy_strided_test(mld.float8_e5m2, use_tensor_copy=False, pack_scales=True) # FP8 @ FP8


================================================
FILE: nki/examples/average_pool2d/average_pool2d_jax.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

JAX implementation for average pool 2D NKI tutorial.

"""
# NKI_EXAMPLE_40_BEGIN
import jax.numpy as jnp
# NKI_EXAMPLE_40_END
from average_pool2d_nki_kernels import tensor_avgpool_kernel


# NKI_EXAMPLE_40_BEGIN
# Reference JAX implementation
def jax_average_pool_2D(in_tensor, pool_size):
  c, h_in, w_in = in_tensor.shape
  reshaped = in_tensor.reshape(c, h_in // pool_size, pool_size, w_in // pool_size, pool_size)
  return jnp.nanmean(reshaped, axis=(2, 4))
  # NKI_EXAMPLE_40_END


# NKI_EXAMPLE_41_BEGIN
if __name__ == "__main__":
  POOL_SIZE = 2
  C, HIN, WIN = 2, 6, 6
  HOUT, WOUT = HIN//POOL_SIZE, WIN//POOL_SIZE

  in_array = jnp.arange(C * HIN * WIN, dtype=jnp.float32).reshape(C, HIN, WIN)

  # NKI_EXAMPLE_39_BEGIN
  out_nki = tensor_avgpool_kernel(in_array, pool_size=POOL_SIZE)
  # NKI_EXAMPLE_39_END
  out_jax = jax_average_pool_2D(in_array, pool_size=POOL_SIZE)

  print(in_array, out_nki, out_jax)

  if jnp.allclose(out_nki, out_jax):
    print("NKI and JAX match")
  else:
    print("NKI and JAX differ")
    # NKI_EXAMPLE_41_END


================================================
FILE: nki/examples/average_pool2d/average_pool2d_nki_kernels.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

NKI implementation for average pool 2D NKI tutorial.

"""
import numpy as np
# NKI_EXAMPLE_37_BEGIN
import nki
import nki.isa as nisa
import nki.language as nl
from nki.typing import tensor

@nki.jit
def tensor_avgpool_kernel(in_tensor, pool_size):
  """NKI kernel to compute a 2D avg-pool operation

  Args:
      in_tensor: an input tensor, of shape C x H x W
      pool_size: an integer representing a (square) pool-window size

  Return:
      out_tensor: the resulting output tensor, of shape C x (H/pool_size) x (W/pool_size)
  """

  # Get input/output dimensions
  sz_cin, sz_hin, sz_win = in_tensor.shape
  sz_hout = sz_hin // pool_size
  sz_wout = sz_win // pool_size
  # Create output tensor shared between all SPMD instances as result tensor
  out_tensor = nl.ndarray((sz_cin, sz_hout, sz_wout), dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)

  # Set relevant sizes
  sz_p = sz_cin
  sz_pool = pool_size

  # Generate pool access pattern to create a 5D view:
  # [sz_p, sz_hout, sz_wout, sz_pool, sz_pool]
  # The pool dimensions are placed last so we can reduce over them.

  # Load input data from external memory to on-chip memory
  in_tile = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype, buffer=nl.sbuf)
  nisa.dma_copy(dst=in_tile, src=in_tensor)

  # Perform the pooling operation using an access pattern view:
  # The .ap() creates a strided 5D view of the 3D input tile,
  # grouping elements into pool windows for reduction.
  pool_view = in_tile.ap([
    [sz_hin * sz_win, sz_p],      # partition stride
    [sz_pool * sz_win, sz_hin // sz_pool],  # outer row stride
    [sz_pool, sz_win // sz_pool],            # outer col stride
    [sz_win, sz_pool],             # inner row stride (within pool window)
    [1, sz_pool],                  # inner col stride (within pool window)
  ])
  sum_tile = nl.sum(pool_view, axis=[3, 4])
  out_tile = nl.ndarray(sum_tile.shape, dtype=sum_tile.dtype, buffer=nl.sbuf)
  nisa.tensor_scalar(dst=out_tile, data=sum_tile, op0=nl.multiply,
                     operand0=1.0 / (pool_size * pool_size))

  # Store the results back to hbm
  nisa.dma_copy(dst=out_tensor, src=out_tile)

  # Transfer the ownership of `out_tensor` to the caller
  return out_tensor
  # NKI_EXAMPLE_37_END


# Reference NumPy implementation
def np_average_pool_2D(in_tensor, pool_size):
  c, h_in, w_in = in_tensor.shape
  reshaped = in_tensor.reshape(c, h_in // pool_size, pool_size, w_in // pool_size, pool_size)
  return np.nanmean(reshaped, axis=(2, 4))


if __name__ == "__main__":
  # Now let's run the kernel
  POOL_SIZE = 2
  C, HIN, WIN = 2, 6, 6
  HOUT, WOUT = HIN//POOL_SIZE, WIN//POOL_SIZE

  in_tensor = np.arange(C * HIN * WIN, dtype=np.float16).reshape(C, HIN, WIN)

  out_nki = tensor_avgpool_kernel(in_tensor, POOL_SIZE)

  out_np = np_average_pool_2D(in_tensor, POOL_SIZE)

  print(in_tensor, out_nki, out_np)

  match = (out_nki == out_np).all()

  if match:
    print("NKI and NumPy match")
  else:
    print("NKI and NumPy differ")

  assert match


================================================
FILE: nki/examples/average_pool2d/average_pool2d_torch.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

PyTorch implementation for average pool 2D NKI tutorial.

"""
# NKI_EXAMPLE_38_BEGIN
import torch
import torch_xla
# NKI_EXAMPLE_38_END
from average_pool2d_nki_kernels import tensor_avgpool_kernel


# NKI_EXAMPLE_38_BEGIN
if __name__ == "__main__":
  device = torch_xla.device()

  # Now let's run the kernel
  POOL_SIZE = 2
  C, HIN, WIN = 2, 6, 6
  HOUT, WOUT = HIN//POOL_SIZE, WIN//POOL_SIZE

  in_tensor = torch.arange(C * HIN * WIN, dtype=torch.bfloat16).reshape(C, HIN, WIN).to(device=device)
  out_nki = torch.zeros((C, HOUT, WOUT), dtype=torch.bfloat16).to(device=device)

  out_nki = tensor_avgpool_kernel(in_tensor, POOL_SIZE)

  out_torch = torch.nn.functional.avg_pool2d(in_tensor, POOL_SIZE, POOL_SIZE)

  print(in_tensor, out_nki, out_torch) # an implicit XLA barrier/mark-step

  if (out_nki == out_torch).all():
    print("NKI and Torch match")
  else:
    print("NKI and Torch differ")
    # NKI_EXAMPLE_38_END


================================================
FILE: nki/examples/fused_mamba/mamba_nki_kernels.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

Mamba-v1 NKI kernel implementation.

"""
# NKI_EXAMPLE_25_BEGIN
import nki
import nki.language as nl
import nki.isa as nisa
import numpy as np
# NKI_EXAMPLE_25_END
import argparse
import itertools

# NKI_EXAMPLE_25_BEGIN
@nki.jit
def mamba_v1(delta, u, A, B, C):
    """Computes the SSM operation in the Mamba model.

    :param delta: (batch_size, channels, seq_len)
    :param u: (batch_size, channels, seq_len)
    :param A: (channels, state_size)
    :param B: (batch_size, state_size, seq_len)
    :param C: (batch_size, state_size, seq_len)
    :return: (batch_size, channels, seq_len)
    """
    batch_size, channels, seq_len = delta.shape
    output = nl.ndarray((batch_size, channels, seq_len), dtype=delta.dtype,
                        buffer=nl.shared_hbm)

    _, state_size = A.shape

    # We can relax this using mask paramters in all the NKI API calls
    assert channels % 128 == 0

    # Map channels to the partition dimension
    # Tile channels to comply with NKI tile size constraints
    channel_psize = nl.tile_size.pmax
    n_channel_tile = channels // channel_psize

    # Most outer loop with batch_size, parallel_for
    for i_batch in nl.affine_range(batch_size):
        # Inner loop: tiling channels
        for i_channel_tile in nl.affine_range(n_channel_tile):
            channel_start = i_channel_tile * channel_psize

            # partial accumulated scanC result with processed states
            scanC_accum = nl.zeros((channel_psize, seq_len), dtype=delta.dtype)

            # Second outer loop with state_size, partial parallel
            for i_state in nl.affine_range(state_size):

                # Load the relevant tile from delta and A
                delta_slice = delta[i_batch, channel_start:channel_start+channel_psize, 0:seq_len]
                delta_i = nl.ndarray(delta_slice.shape, dtype=delta_slice.dtype, buffer=nl.sbuf)
                nisa.dma_copy(dst=delta_i, src=delta_slice)
                A_slice = A[channel_start:channel_start+channel_psize, i_state:i_state+1]
                A_i = nl.ndarray(A_slice.shape, dtype=A_slice.dtype, buffer=nl.sbuf)
                nisa.dma_copy(dst=A_i, src=A_slice)

                # Step 1&2: Element-wise multiplication of delta_i and A_i and then exponential
                deltaA = nl.ndarray((channel_psize, seq_len), dtype=delta.dtype, buffer=nl.sbuf)
                nisa.activation(dst=deltaA, op=nl.exp, data=delta_i, scale=A_i)

                # Load the relevant tile from u and B
                u_slice = u[i_batch, channel_start:channel_start+channel_psize, 0:seq_len]
                u_i = nl.ndarray(u_slice.shape, dtype=u_slice.dtype, buffer=nl.sbuf)
                nisa.dma_copy(dst=u_i, src=u_slice)
                B_slice = B[i_batch, i_state:i_state+1, 0:seq_len]
                B_i = nl.ndarray(B_slice.shape, dtype=B_slice.dtype, buffer=nl.sbuf)
                nisa.dma_copy(dst=B_i, src=B_slice)

                # Step 3: Element-wise multiplication of delta_i, B_i and u_i
                deltaU = nl.ndarray((channel_psize, seq_len), dtype=delta.dtype, buffer=nl.sbuf)
                nisa.tensor_tensor(dst=deltaU, data1=delta_i, data2=u_i, op=nl.multiply)
                B_i_bcast = nl.broadcast_to(B_i, (channel_psize, seq_len))
                deltaBu = nl.ndarray((channel_psize, seq_len), dtype=delta.dtype, buffer=nl.sbuf)
                nisa.tensor_tensor(dst=deltaBu, data1=deltaU, data2=B_i_bcast, op=nl.multiply)

                # Step 4: Associative scan between deltaA and deltaBu
                scan_res = nl.ndarray((channel_psize, seq_len), dtype=delta.dtype, buffer=nl.sbuf)
                nisa.tensor_tensor_scan(dst=scan_res, data0=deltaA, data1=deltaBu, initial=0.0,
                        op0=nl.multiply, op1=nl.add)

                # Load the relevant tile from C
                C_slice = C[i_batch, i_state:i_state+1, 0:seq_len]
                C_i = nl.ndarray(C_slice.shape, dtype=C_slice.dtype, buffer=nl.sbuf)
                nisa.dma_copy(dst=C_i, src=C_slice)

                # Step 5: Element-wise multiplication of scan_res and C_i
                C_i_bcast = nl.broadcast_to(C_i, (channel_psize, seq_len))
                scanC = nl.ndarray((channel_psize, seq_len), dtype=delta.dtype, buffer=nl.sbuf)
                nisa.tensor_tensor(dst=scanC, data1=scan_res, data2=C_i_bcast, op=nl.multiply)

                # Step 6: Accumulation of scanC along state_size dimension
                nisa.tensor_tensor(dst=scanC_accum, data1=scanC_accum, data2=scanC, op=nl.add)

            # Store scanC_accum for a single batch/channel tile to output
            nisa.dma_copy(dst=output[i_batch, channel_start:channel_start+channel_psize, 0:seq_len],
                    src=scanC_accum)

    return output
# NKI_EXAMPLE_25_END

# NKI_EXAMPLE_26_BEGIN
@nki.jit
def mamba_v2(delta, u, A, B, C):
    """Computes the SSM operation in the Mamba model.

    :param delta: (batch_size, channels, seq_len)
    :param u: (batch_size, channels, seq_len)
    :param A: (channels, state_size)
    :param B: (batch_size, state_size, seq_len)
    :param C: (batch_size, state_size, seq_len)
    :return: (batch_size, channels, seq_len)
    """
    batch_size, channels, seq_len = delta.shape
    output = nl.ndarray((batch_size, channels, seq_len), dtype=delta.dtype,
                        buffer=nl.shared_hbm)
    _, state_size = A.shape

    assert channels % 128 == 0

    # Map channels to the partition dimension
    # Tile channels to comply with NKI tile size constraints
    channel_psize = nl.tile_size.pmax
    n_channel_tile = channels // channel_psize

    # Most outer loop with batch_size, parallel_for
    for i_batch in nl.affine_range(batch_size):

        # Second outer loop: tiling channels
        for i_channel_tile in nl.affine_range(n_channel_tile):
            channel_start = i_channel_tile * channel_psize

            # partial accumulated scanC result with processed states
            scanC_accum = nl.zeros((channel_psize, seq_len), dtype=delta.dtype)

            # Load delta/u once to be reused across states
            delta_slice = delta[i_batch, channel_start:channel_start+channel_psize, 0:seq_len]
            delta_i = nl.ndarray(delta_slice.shape, dtype=delta_slice.dtype, buffer=nl.sbuf)
            nisa.dma_copy(dst=delta_i, src=delta_slice)
            u_slice = u[i_batch, channel_start:channel_start+channel_psize, 0:seq_len]
            u_i = nl.ndarray(u_slice.shape, dtype=u_slice.dtype, buffer=nl.sbuf)
            nisa.dma_copy(dst=u_i, src=u_slice)

            # Inner loop with state_size, partial parallel
            for i_state in nl.affine_range(state_size):
                # Load the relevant tile from A
                A_slice = A[channel_start:channel_start+channel_psize, i_state:i_state+1]
                A_i = nl.ndarray(A_slice.shape, dtype=A_slice.dtype, buffer=nl.sbuf)
                nisa.dma_copy(dst=A_i, src=A_slice)

                # Step 1&2: Element-wise multiplication of delta_i and A_i and then exponential
                deltaA = nl.ndarray((channel_psize, seq_len), dtype=delta.dtype, buffer=nl.sbuf)
                nisa.activation(dst=deltaA, op=nl.exp, data=delta_i, scale=A_i)

                # Load the relevant tile from B
                B_slice = B[i_batch, i_state:i_state+1, 0:seq_len]
                B_i = nl.ndarray(B_slice.shape, dtype=B_slice.dtype, buffer=nl.sbuf)
                nisa.dma_copy(dst=B_i, src=B_slice)

                # Step 3: Element-wise multiplication of delta_i, B_i and u_i
                deltaU = nl.ndarray((channel_psize, seq_len), dtype=delta.dtype, buffer=nl.sbuf)
                nisa.tensor_tensor(dst=deltaU, data1=delta_i, data2=u_i, op=nl.multiply)
                B_i_bcast = nl.broadcast_to(B_i, (channel_psize, seq_len))
                deltaBu = nl.ndarray((channel_psize, seq_len), dtype=delta.dtype, buffer=nl.sbuf)
                nisa.tensor_tensor(dst=deltaBu, data1=deltaU, data2=B_i_bcast, op=nl.multiply)

                # Step 4: Associative scan between deltaA and deltaBu
                scan_res = nl.ndarray((channel_psize, seq_len), dtype=delta.dtype, buffer=nl.sbuf)
                nisa.tensor_tensor_scan(dst=scan_res, data0=deltaA, data1=deltaBu, initial=0.0,
                        op0=nl.multiply, op1=nl.add)

                # Load the relevant tile from C
                C_slice = C[i_batch, i_state:i_state+1, 0:seq_len]
                C_i = nl.ndarray(C_slice.shape, dtype=C_slice.dtype, buffer=nl.sbuf)
                nisa.dma_copy(dst=C_i, src=C_slice)

                # Step 5: Element-wise multiplication of scan_res and C_i
                C_i_bcast = nl.broadcast_to(C_i, (channel_psize, seq_len))
                scanC = nl.ndarray((channel_psize, seq_len), dtype=delta.dtype, buffer=nl.sbuf)
                nisa.tensor_tensor(dst=scanC, data1=scan_res, data2=C_i_bcast, op=nl.multiply)

                # Step 6: Accumulation of scanC along state_size dimension
                nisa.tensor_tensor(dst=scanC_accum, data1=scanC_accum, data2=scanC, op=nl.add)

            # Store scanC_accum for a single batch to output
            nisa.dma_copy(dst=output[i_batch, channel_start:channel_start+channel_psize, 0:seq_len],
                    src=scanC_accum[0:channel_psize, 0:seq_len])

    return output
# NKI_EXAMPLE_26_END


@nki.jit
def mamba_v3(delta, u, A, B, C):
    """Computes the SSM operation in the Mamba model.

    :param delta: (batch_size, channels, seq_len)
    :param u: (batch_size, channels, seq_len)
    :param A: (channels, state_size)
    :param B: (batch_size, state_size, seq_len)
    :param C: (batch_size, state_size, seq_len)
    :return: (batch_size, channels, seq_len)
    """
    batch_size, channels, seq_len = delta.shape
    output = nl.ndarray((batch_size, channels, seq_len), dtype=delta.dtype,
                        buffer=nl.shared_hbm)
    _, state_size = A.shape

    # Map channels to the partition dimension
    # Tile channels to comply with NKI tile size constraints
    channel_psize = nl.tile_size.pmax
    n_channel_tile = channels // channel_psize

    # Magic number, decided through empirical profiling data
    seq_len_fsize = 512
    n_seq_len_tile = seq_len // seq_len_fsize

    # Fix this later with mask
    assert channels % channel_psize == 0
    assert seq_len % seq_len_fsize == 0

    # Most outer loop with batch_size, parallel_for
    for i_batch in nl.affine_range(batch_size):

        # Second outer loop: tiling channels
        for i_channel_tile in nl.affine_range(n_channel_tile):
            channel_start = i_channel_tile * channel_psize

            # partial accumulated scanC result with processed states
            scanC_accum = nl.zeros((channel_psize, seq_len), dtype=delta.dtype)

            # Load delta/u once to be reused across states
            delta_slice = delta[i_batch, channel_start:channel_start+channel_psize, 0:seq_len]
            delta_i = nl.ndarray(delta_slice.shape, dtype=delta_slice.dtype, buffer=nl.sbuf)
            nisa.dma_copy(dst=delta_i, src=delta_slice)
            u_slice = u[i_batch, channel_start:channel_start+channel_psize, 0:seq_len]
            u_i = nl.ndarray(u_slice.shape, dtype=u_slice.dtype, buffer=nl.sbuf)
            nisa.dma_copy(dst=u_i, src=u_slice)

            # Inner loop with state_size, partial parallel
            for i_state in nl.affine_range(state_size):
                # Load the relevant tile from A
                A_slice = A[channel_start:channel_start+channel_psize, i_state:i_state+1]
                A_i = nl.ndarray(A_slice.shape, dtype=A_slice.dtype, buffer=nl.sbuf)
                nisa.dma_copy(dst=A_i, src=A_slice)

                # Last scan result
                scan_init = nl.zeros((channel_psize, 1), dtype=delta_i.dtype)
                # FIXME: sequential_range gives incorrect answer and also much worse perf than static_range
                # for i_seq_len_tile in nl.sequential_range(n_seq_len_tile):
                for i_seq_len_tile in nl.static_range(n_seq_len_tile):
                    seq_len_start = i_seq_len_tile * seq_len_fsize

                    # Step 1&2: Element-wise multiplication of delta_i and A_i and then exponential
                    deltaA = nl.ndarray((channel_psize, seq_len_fsize), dtype=delta.dtype, buffer=nl.sbuf)
                    nisa.activation(dst=deltaA, op=nl.exp,
                            data=delta_i[0:channel_psize, seq_len_start:seq_len_start+seq_len_fsize],
                            scale=A_i)

                    # Load the relevant tile from B
                    B_slice = B[i_batch, i_state:i_state+1, seq_len_start:seq_len_start+seq_len_fsize]
                    B_i = nl.ndarray(B_slice.shape, dtype=B_slice.dtype, buffer=nl.sbuf)
                    nisa.dma_copy(dst=B_i, src=B_slice)

                    # Step 3: Element-wise multiplication of delta_i, B_i and u_i
                    deltaU = nl.ndarray((channel_psize, seq_len_fsize), dtype=delta.dtype, buffer=nl.sbuf)
                    nisa.tensor_tensor(dst=deltaU,
                            data1=delta_i[0:channel_psize, seq_len_start:seq_len_start+seq_len_fsize],
                            data2=u_i[0:channel_psize, seq_len_start:seq_len_start+seq_len_fsize],
                            op=nl.multiply)
                    B_i_bcast = nl.broadcast_to(B_i, (channel_psize, seq_len_fsize))
                    deltaBu = nl.ndarray((channel_psize, seq_len_fsize), dtype=delta.dtype, buffer=nl.sbuf)
                    nisa.tensor_tensor(dst=deltaBu, data1=deltaU, data2=B_i_bcast, op=nl.multiply)

                    # Step 4: Associative scan between deltaA and deltaBu
                    scan_res = nl.ndarray((channel_psize, seq_len_fsize), dtype=delta.dtype, buffer=nl.sbuf)
                    nisa.tensor_tensor_scan(dst=scan_res, data0=deltaA, data1=deltaBu, initial=scan_init,
                            op0=nl.multiply, op1=nl.add)
                    nisa.tensor_copy(dst=scan_init, src=scan_res[0:channel_psize, seq_len_fsize-1:seq_len_fsize])

                    # Load the relevant tile from C
                    C_slice = C[i_batch, i_state:i_state+1, seq_len_start:seq_len_start+seq_len_fsize]
                    C_i = nl.ndarray(C_slice.shape, dtype=C_slice.dtype, buffer=nl.sbuf)
                    nisa.dma_copy(dst=C_i, src=C_slice)

                    # Step 5: Element-wise multiplication of scan_res and C_i
                    C_i_bcast = nl.broadcast_to(C_i, (channel_psize, seq_len_fsize))
                    scanC = nl.ndarray((channel_psize, seq_len_fsize), dtype=delta.dtype, buffer=nl.sbuf)
                    nisa.tensor_tensor(dst=scanC, data1=scan_res, data2=C_i_bcast, op=nl.multiply)

                    # Step 6: Accumulation of scanC along state_size dimension
                    nisa.tensor_tensor(dst=scanC_accum[0:channel_psize, seq_len_start:seq_len_start+seq_len_fsize],
                            data1=scanC_accum[0:channel_psize, seq_len_start:seq_len_start+seq_len_fsize],
                            data2=scanC, op=nl.add)

            # Store scanC_accum for a single batch to output
            nisa.dma_copy(dst=output[i_batch, channel_start:channel_start+channel_psize, 0:seq_len],
                    src=scanC_accum[0:channel_psize, 0:seq_len])
    return output


def parse_args():
    parser = argparse.ArgumentParser("Run Mamba NKI kernels.")
    parser.add_argument("--version",
            nargs='+',
            default=["v1", "v2", "v3"],
            choices=["v1", "v2", "v3"],
            help="Test versions")

    parser.add_argument("--batch",
            nargs='+',
            default=[1],
            help="Batch size.")
    parser.add_argument("--seq_len",
            nargs='+',
            default=[2048],
            help="Sequence length.")
    parser.add_argument("--channels",
            nargs='+',
            default=[256],
            help="Number of channels.")
    parser.add_argument("--state_size",
            nargs='+',
            default=[16],
            help="State size.")

    args = parser.parse_args()
    return args


if __name__ == "__main__":
    args = parse_args()

    # Small test to ensure numerical correctness
    arr_batch = [int(_) for _ in args.batch]
    arr_seq_len = [int(_) for _ in args.seq_len]
    arr_channels = [int(_) for _ in args.channels]
    arr_state_size = [int(_) for _ in args.state_size]

    configs = itertools.product(arr_batch, arr_seq_len, arr_channels, arr_state_size)

    for config in configs:
        batch, seq_len, channels, state_size = config
        print(f">>> batch={batch}, seq_len={seq_len}, channels={channels}, state_size={state_size}")

        # Set up input tensors
        dtype = np.float32
        delta = np.ones((batch, channels, seq_len), dtype=dtype)
        u = np.ones((batch, channels, seq_len), dtype=dtype)
        A = -np.ones((channels, state_size), dtype=dtype)
        B = np.ones((batch, state_size, seq_len), dtype=dtype)
        C = np.ones((batch, state_size, seq_len), dtype=dtype)

        func_dict = {"v1": mamba_v1,
                     "v2": mamba_v2,
                     "v3": mamba_v3,
                    }

        # v1: reference kernel
        print(f">>>> Running v1 (reference).")
        nki_out_v1 = mamba_v1(delta, u, A, B, C)

        for version in args.version:
            if version == "v1":
                # already run, continue
                continue

            print(f">>>> Running version {version}.")
            func = func_dict[version]
            nki_out_test = func(delta, u, A, B, C)
            print(f">>>> mamba {version} matches?", np.all(nki_out_test == nki_out_v1))
            assert np.all(nki_out_test == nki_out_v1)

================================================
FILE: nki/examples/fused_mamba/mamba_torch.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

Mamba-v1 PyTorch Reference Implementation.

"""

# NKI_EXAMPLE_24_BEGIN
import torch
import torch_xla
import os
import argparse

os.environ["NEURON_FRAMEWORK_DEBUG"] = "1"
os.environ["NEURON_CC_FLAGS"]= " --model-type=transformer --disable-dge "


def associative_scan(deltaA, deltaB_u):
    """
    Args:
        deltaA: [batch_size, channels, state_size, seq_len]
        deltaB_u: [batch_size, channels, state_size, seq_len]

    Mamba uses an associative scan operator to aggregate information across
    time sequentially (sequence length, e.g. sequence of tokens),
    from the past to the present.
    """
    batch_size, channels, state_size, seq_len = deltaA.shape
    out = torch.empty(batch_size, channels, state_size, seq_len,
                        device=deltaA.device, dtype=deltaA.dtype)
    for i in range(seq_len):
        prev_state = out[..., i - 1] if i > 0 else 0
        out[..., i] = deltaA[..., i] * prev_state + deltaB_u[..., i]
    return out


def mamba_layer(delta, A, B, u, C):
    """
    Args:
        delta: [batch, channels, seq_len]
        u: [batch, channels, seq_len]
        A: [channels, state_size]
        B: [batch, state_size, seq_len]
        C: [batch, state_size, seq_len]
    """
    # expand the tensors so they all have the same dimensions and compute elementwise products (with broadcast)
    # deltaA and deltaB_u have shape [batch_size, channels, state_size, seq_len]
    deltaA = torch.exp(delta[:, :, None, :] * A[None, :, :, None])
    deltaB_u = delta[:, :, None, :] * B[:, None, :, :] * u[:, :, None, :]
    scan_res = associative_scan(deltaA, deltaB_u)
    # y sums over the `state_size` axis and has shape [batch_size, channels, seq_len]
    mamba_out = (C[:, None, :, :] * scan_res).sum(dim=-2)
    return mamba_out


def parse_args():
    parser = argparse.ArgumentParser(
    """Run Mamba PyTorch implementation. Hard-coded small example only since
       PyTorch implementation is very slow for larger configs.
    """)
    parser.add_argument("--mode",
                        choices=["accuracy", "perf"],
                        default="accuracy",
                        help="""Do accuracy test or perf test.
                                Accuracy test compares mamba_v1 kernel against PyTorch implementation.
                                Perf test will generate a NEFF for the PyTorch implementation in local directory
                                for a manual run of neuron-profile.
                             """)
    args = parser.parse_args()
    return args


if __name__ == "__main__":
    args = parse_args()

    # Toy example
    batch = 1
    seq_len = 512
    channels = 256
    state_size = 16

    dtype = torch.float32

    device = torch_xla.device()

    delta = torch.ones(batch, channels, seq_len, dtype=dtype, device=device)
    u = torch.ones(batch, channels, seq_len, dtype=dtype, device=device)

    # For numerical accuracy testing purposes, we choose negative numbers for A on purpose.
    # Otherwise, the associative scan will integrate too fast and overflow, which would
    # mask any real numerical issues in our computation.
    # A negative A will ensure we catch numerical issues when we have them.
    A = -torch.ones(channels, state_size, dtype=dtype, device=device)
    B = torch.ones(batch, state_size, seq_len, dtype=dtype, device=device)

    C = torch.ones(batch, state_size, seq_len, dtype=dtype, device=device)

    torch_xla.sync()
    torch_out = mamba_layer(delta, A, B, u, C)
    torch_xla.sync()
    print(torch_out)
    # NKI_EXAMPLE_24_END

    if args.mode == "accuracy":
        # Call NKI mamba_v1 kernel to check accuracy
        from mamba_nki_kernels import mamba_v1

        torch_xla.sync()
        nki_out = mamba_v1(delta, u, A, B, C)
        torch_xla.sync()

        allclose = torch.allclose(torch_out, nki_out, atol=1e-2, rtol=1e-2)

        if allclose:
            print("NKI and Torch match")
        else:
            print("NKI and Torch differ")

        assert allclose


================================================
FILE: nki/examples/getting_started_baremetal.py
================================================
# NKI_EXAMPLE_0_BEGIN NKI_EXAMPLE_1_BEGIN
import nki
import nki.language as nl
# NKI_EXAMPLE_1_END


# NKI_EXAMPLE_2_BEGIN
@nki.jit
def nki_tensor_add_kernel(a_input, b_input):
    # NKI_EXAMPLE_2_END

    """NKI kernel to compute element-wise addition of two input tensors
    """

    # NKI_EXAMPLE_3_BEGIN
    # Check all input/output tensor shapes are the same for element-wise operation
    assert a_input.shape == b_input.shape

    # Check size of the first dimension does not exceed on-chip memory tile size limit,
    # so that we don't need to tile the input to keep this example simple
    assert a_input.shape[0] <= nl.tile_size.pmax
    # NKI_EXAMPLE_3_END

    # Load the inputs from device memory to on-chip memory
    # NKI_EXAMPLE_4_BEGIN
    a_tile = nl.load(a_input)
    b_tile = nl.load(b_input)
    # NKI_EXAMPLE_4_END

    # Specify the computation (in our case: a + b)
    # NKI_EXAMPLE_5_BEGIN
    c_tile = nl.add(a_tile, b_tile)
    # NKI_EXAMPLE_5_END

    # NKI_EXAMPLE_6_BEGIN
    # Create a HBM tensor as the kernel output
    c_output = nl.ndarray(a_input.shape, dtype=a_input.dtype, buffer=nl.shared_hbm)

    # Store the result to c_output from on-chip memory to device memory
    nl.store(c_output, value=c_tile)

    # Return kernel output as function output
    return c_output
# NKI_EXAMPLE_0_END NKI_EXAMPLE_6_END


if __name__ == "__main__":
    # NKI_EXAMPLE_8_BEGIN
    import numpy as np

    a = np.ones((4, 3), dtype=np.float16)
    b = np.ones((4, 3), dtype=np.float16)

    # NKI_EXAMPLE_12_BEGIN
    # Run NKI kernel on a NeuronDevice
    c = nki_tensor_add_kernel(a, b)
    # NKI_EXAMPLE_12_END

    print(c)
    # NKI_EXAMPLE_8_END


================================================
FILE: nki/examples/getting_started_jax.py
================================================
import nki
import nki.language as nl

@nki.jit
def nki_tensor_add_kernel(a_input, b_input):
    """NKI kernel to compute element-wise addition of two input tensors
    """

    c_output = nl.ndarray(a_input.shape, dtype=a_input.dtype, buffer=nl.shared_hbm)

    # Check all input/output tensor shapes are the same for element-wise operation
    assert a_input.shape == b_input.shape

    # Check size of the first dimension does not exceed on-chip memory tile size limit,
    # so that we don't need to tile the input to keep this example simple
    assert a_input.shape[0] <= nl.tile_size.pmax

    # Load the inputs from device memory to on-chip memory
    a_tile = nl.load(a_input)
    b_tile = nl.load(b_input)

    # Specify the computation (in our case: a + b)
    c_tile = nl.add(a_tile, b_tile)

    # Store the result to c_output from on-chip memory to device memory
    nl.store(c_output, value=c_tile)

    return c_output


if __name__ == "__main__":
    # NKI_EXAMPLE_11_BEGIN
    import jax.numpy as jnp

    a = jnp.ones((4, 3), dtype=jnp.float16)
    b = jnp.ones((4, 3), dtype=jnp.float16)

    c = nki_tensor_add_kernel(a, b)

    print(c)
    # NKI_EXAMPLE_11_END


================================================
FILE: nki/examples/getting_started_torch.py
================================================
import nki
import nki.language as nl

@nki.jit
def nki_tensor_add_kernel(a_input, b_input):
    """NKI kernel to compute element-wise addition of two input tensors
    """

    c_output = nl.ndarray(a_input.shape, dtype=a_input.dtype, buffer=nl.shared_hbm)

    # Check all input/output tensor shapes are the same for element-wise operation
    assert a_input.shape == b_input.shape

    # Check size of the first dimension does not exceed on-chip memory tile size limit,
    # so that we don't need to tile the input to keep this example simple
    assert a_input.shape[0] <= nl.tile_size.pmax

    # Load the inputs from device memory to on-chip memory
    a_tile = nl.load(a_input)
    b_tile = nl.load(b_input)

    # Specify the computation (in our case: a + b)
    c_tile = nl.add(a_tile, b_tile)

    # Store the result to c_output from on-chip memory to device memory
    nl.store(c_output, value=c_tile)

    return c_output


if __name__ == "__main__":
    # NKI_EXAMPLE_10_BEGIN
    import torch
    import torch_xla

    device = torch_xla.device()

    a = torch.ones((4, 3), dtype=torch.float16).to(device=device)
    b = torch.ones((4, 3), dtype=torch.float16).to(device=device)

    c = nki_tensor_add_kernel(a, b)

    print(c)  # an implicit XLA barrier/mark-step (triggers XLA compilation)
    # NKI_EXAMPLE_10_END


================================================
FILE: nki/examples/index-case-1.py
================================================
import nki
import nki.language as nl
import math

@nki.jit
def tensor_split_kernel_(in_tensor):
  """NKI kernel to split an input tensor into two output tensors, along the column axis.

  The even columns of the input tensor will be gathered into the first output tensor,
  and the odd columns of the input tensor will be gathered into the second output tensor.

  Args:
      in_tensor: an input tensor
  Returns:
      out_tensor_even: a first output tensor (will hold the even columns of the input tensor)
      out_tensor_odd: a second output tensor (will hold the odd columns of the input tensor)
  """

  # This example only works for tensors with a partition dimension that fits in the SBUF
  assert in_tensor.shape[0] <= nl.tile_size.pmax

  # Extract tile sizes.
  sz_p, sz_f = in_tensor.shape
  sz_fout_even = sz_f - sz_f // 2
  sz_fout_odd = sz_f // 2

  # create output tensors
  out_tensor_even = nl.ndarray((sz_p, sz_fout_even), dtype=in_tensor.dtype, buffer=nl.shared_hbm)
  out_tensor_odd = nl.ndarray((sz_p, sz_fout_odd), dtype=in_tensor.dtype, buffer=nl.shared_hbm)

  # Load input data from external memory to on-chip memory
  in_tile = nl.load(in_tensor)

  # Store the results back to external memory
  nl.store(out_tensor_even, value=in_tile[:, 0:sz_f:2])
  nl.store(out_tensor_odd,  value=in_tile[:, 1:sz_f:2])

  return out_tensor_even, out_tensor_odd


if __name__ == "__main__":
    import torch
    import torch_xla

    device = torch_xla.device()

    X, Y = 4, 5
    in_tensor = torch.arange(X * Y, dtype=torch.bfloat16).reshape(X, Y).to(device=device)

    out1_tensor, out2_tensor = tensor_split_kernel_(in_tensor)
    print(in_tensor, out1_tensor, out2_tensor)


================================================
FILE: nki/examples/index-case-3.py
================================================
import nki
import nki.language as nl

@nki.jit
def tensor_maxpool_kernel_(in_tensor, sz_pool):
  """NKI kernel to compute a 2D max-pool operation

  Args:
      in_tensor: an input tensor, of dimensions C x H x W
      sz_pool: integer P representing a (square) pool-window size
  Returns:
      out_tensor: the resulting output tensor, of dimensions C x (H/P) x (W/P)
  """

  # Get input/output dimensions
  sz_p, sz_hin, sz_win = in_tensor.shape
  sz_hout, sz_wout = sz_hin // sz_pool, sz_win // sz_pool
  out_tensor = nl.ndarray((sz_p, sz_hout, sz_wout), dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)

  # Load input data from external memory to on-chip memory
  in_tile = nl.load(in_tensor)

  # Perform the pooling operation using an access pattern to create a 5D view:
  # [sz_p, sz_hout, sz_wout, sz_pool, sz_pool]
  # The pool dimensions are placed last so we can reduce over them.
  pool_view = in_tile.ap([
    [sz_hin * sz_win, sz_p],      # partition stride
    [sz_pool * sz_win, sz_hout],   # outer row stride (hop by pool rows)
    [sz_pool, sz_wout],            # outer col stride (hop by pool cols)
    [sz_win, sz_pool],             # inner row stride (within pool window)
    [1, sz_pool],                  # inner col stride (within pool window)
  ])
  out_tile = nl.max(pool_view, axis=[3, 4])

  # Store the results back to external memory
  nl.store(out_tensor, value=out_tile)

  return out_tensor


if __name__ == "__main__":
    import torch
    import torch_xla

    device = torch_xla.device()

    # Now let's run the kernel
    POOL_SIZE = 2
    C, HIN, WIN = 2, 6, 6
    HOUT, WOUT = HIN//POOL_SIZE, WIN//POOL_SIZE

    in_tensor = torch.arange(C * HIN * WIN, dtype=torch.bfloat16).reshape(C, HIN, WIN).to(device=device)
    out_tensor = tensor_maxpool_kernel_(in_tensor, POOL_SIZE)

    print(in_tensor, out_tensor) # an implicit XLA barrier/mark-step


================================================
FILE: nki/examples/layout-dynamic-loop.py
================================================
import nki.language as nl
import nki
import math

@nki.jit
def tensor_exp_kernel_(in_tensor):
  """NKI kernel to compute elementwise exponential of an input tensor

  Args:
      in_tensor: an input tensor of ANY 2D shape (up to SBUF size)
  Returns:
      out_tensor: an output tensor of ANY 2D shape (up to SBUF size)
  """
  sz_p, sz_f = in_tensor.shape
  out_tensor = nl.ndarray((sz_p, sz_f), dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)

  for p in nl.affine_range(math.ceil(sz_p / nl.tile_size.pmax)):
    # Generate tensor indices for the input/output tensors
    p_start = k * nl.tile_size.pmax
    p_end = p_start + nl.tile_size.pmax
    i_p = slice(p_start, min(p_end, sz_p))

    # Load input data from external memory to on-chip memory
    in_tile = nl.load(in_tensor[i_p, 0:sz_f]

    # perform the computation
    out_tile = nl.exp(in_tile)

    # store the results back to external memory
    nl.store(out_tensor[i_p, 0:sz_f], value=out_tile)

    return out_tensor


================================================
FILE: nki/examples/layout-loop.py
================================================
import nki.language as nl
from torch_neuronx import nki_jit

@nki_jit
def tensor_exp_kernel_(in_tensor):
  """NKI kernel to compute elementwise exponential of an input tensor

  Args:
      in_tensor: an input tensor of shape [256,512]
  Returns:
      out_tensor: an output tensor of shape [256,512]
  """
  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)

  for k in nl.affine_range(2):
    # Generate tensor indices for the input/output tensors
    p_start = k * nl.tile_size.pmax
    p_end = p_start + nl.tile_size.pmax
    i_p = slice(p_start, p_end)

    # Load input data from HBM to on-chip memory
    in_tile = nl.load(in_tensor[i_p, 0:512])

    # perform the computation
    out_tile = nl.exp(in_tile)

    # store the results back to HBM
    nl.store(out_tensor[i_p, i_f], value=out_tile)

  return out_tensor


================================================
FILE: nki/examples/layout-pass.py
================================================
import nki.language as nl
import nki

@nki.jit
def tensor_exp_kernel_(in_tensor):
  """NKI kernel to compute elementwise exponential of an input tensor

  Args:
      in_tensor: an input tensor of shape [128,512]
  Returns:
      out_tensor: an output tensor of shape [128,512]
  """

  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)

  # Load input data from HBM to on-chip memory
  in_tile = nl.load(in_tensor[0:128, 0:512])

  # perform the computation:
  out_tile = nl.exp(in_tile)

  # store the results back to HBM
  nl.store(out_tensor[0:128, 0:512], value=out_tile)

  return out_tensor


if __name__ == "__main__":
  import torch
  import torch_xla

  device = torch_xla.device()

  shape = (128, 512)
  in_tensor = torch.ones(shape,  dtype=torch.bfloat16).to(device=device)
  out_tensor = tensor_exp_kernel_(in_tensor)

  print(out_tensor) # an implicit XLA barrier/mark-step


================================================
FILE: nki/examples/layout-violation.py
================================================
import nki.language as nl
import nki


@nki.jit
def tensor_exp_kernel_(in_tensor):
  """NKI kernel to compute elementwise exponential of an input tensor

  Args:
      in_tensor: an input tensor of shape [128,512]
  Returns:
      out_tensor: an output tensor of shape [128,512]
  """
  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)

  # Load input data from HBM to on-chip memory
  in_tile = nl.load(in_tensor[0:256, 0:512])

  # perform the computation:
  out_tile = nl.exp(in_tile)

  # store the results back to HBM
  nl.store(out_tensor[0:256, 0:512], value=out_tile)


# NKI_EXAMPLE_12_BEGIN
if __name__ == "__main__":
  import torch
  import torch_xla

  device = torch_xla.device()

  shape = (256, 512) # Previously (128, 512)
  in_tensor = torch.ones(shape,  dtype=torch.bfloat16).to(device=device)
  out_tensor = tensor_exp_kernel_(in_tensor)

  print(out_tensor) # an implicit XLA barrier/mark-step
  # NKI_EXAMPLE_12_END


================================================
FILE: nki/examples/matrix_multiplication/matrix_multiplication_nki_kernels.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

NKI implementation for matrix multiplication NKI tutorial.

"""

import nki as nki
import nki.isa as nisa
import nki.language as nl
import numpy as np


# NKI_EXAMPLE_16_BEGIN
@nki.jit
def nki_matmul_basic_(lhsT, rhs):
  """NKI kernel to compute a 64x128x512 matrix multiplication operation

  Args:
      lhsT: an input tensor of shape [128,64], a left hand side argument of the
        matrix multiplication, delivered transposed for optimal performance
      rhs: an input tensor of shape [128,512], a right hand side argument of the
        matrix multiplication
  Returns:
      result: the resulting output tensor of shape [64,512]
  """
  # Verify that the lhsT and rhs are the expected sizes.
  K, M = lhsT.shape
  K_, N = rhs.shape

  # Check that the contraction dimension matches and all dimensions
  #are what were expected.
  assert K == K_, \
    f"Expected contraction dimension to match on both lhsT ({K}) and rhs ({K})"
  assert K == 128, f"Expected contraction dimension to be 128, but got {K}"
  assert M == 64, f"Expected lhsT matrix to have dimension M of 64, but got {M}"
  assert N == 512, f"Expected rhs matrix to have dimension N of 512, but got {N}"

  # Create a tensor to write the result into (not initialized)
  result = nl.ndarray((M, N), dtype=lhsT.dtype, buffer=nl.shared_hbm)

  # Creating a tensor in SBUF to load the inputs into (not initialized)
  lhs_tile = nl.ndarray(lhsT.shape, dtype=lhsT.dtype, buffer=nl.sbuf)
  rhs_tile = nl.ndarray(rhs.shape, dtype=rhs.dtype, buffer=nl.sbuf)

  # Loading the inputs (HBM->SBUF)
  # Note: here we take Tile dtype definition into account,
  # which forces P-dim as the left most index
  nisa.dma_copy(dst=lhs_tile, src=lhsT)
  nisa.dma_copy(dst=rhs_tile, src=rhs)

  # Create a tensor in PSUM to accumulate the result in (uninitialized)
  result_psum = nl.ndarray(result.shape, dtype=nl.float32, buffer=nl.psum)

  # Perform the matrix-multiplication
  # Note: A NKI matmul instruction always writes to PSUM in float32 data-type
  nisa.nc_matmul(result_psum, lhs_tile, rhs_tile)

  # Create a tensor in SBUF and copy the result from PSUM back to SBUF, 
  # and cast to expected output data-type
  result_sbuf = nl.ndarray(result_psum.shape, dtype=result.dtype, buffer=nl.sbuf)
  nisa.tensor_copy(dst=result_sbuf, src=result_psum)

  # The result of [64,128] x [128,512] matrix multiplication has a shape of [64, 512].
  # This dictates which indices to use to address the result tile.
  nisa.dma_copy(dst=result, src=result_sbuf)

  return result
  # NKI_EXAMPLE_16_END


# NKI_EXAMPLE_18_BEGIN
@nki.jit
def nki_matmul_tiled_(lhsT, rhs):
  """NKI kernel to compute a matrix multiplication operation in a tiled manner

  Args:
      lhsT: an input tensor of shape [K,M], where both K and M are multiples for
        128.  It is the left-hand-side argument of the matrix multiplication,
        delivered transposed for optimal performance.
      rhs: an input tensor of shape [K,N], where K is a multiple of 128, and N
        is a multiple of 512.  It is the right-hand-side argument of the matrix
        multiplication.
  Returns:
      result: the resulting output tensor of shape [M,N]
  """

  # Verify that the lhsT and rhs have the same contraction dimension.
  K, M = lhsT.shape
  K_, N = rhs.shape
  assert K == K_, "lhsT and rhs must have the same contraction dimension"

  # Lookup the device matrix multiply dimensions.
  TILE_M = nl.tile_size.gemm_stationary_fmax  # 128
  TILE_K = nl.tile_size.pmax  # 128
  TILE_N = nl.tile_size.gemm_moving_fmax  # 512

  # Verify that the input matrices are a multiple of the tile dimensions.
  assert M % TILE_M == 0, \
    f"Expected M, {M}, to be a multiple of stationary free-dimension max, {TILE_M}"
  assert N % TILE_N == 0, \
    f"Expected N, {N}, to be a multiple of moving free-dimension max, {TILE_N}"
  assert K % TILE_K == 0, \
    f"Expected K, {K}, to be a multiple of the partition dimension max, {TILE_K}"

  # Create a space for the result in HBM (not initialized)
  result = nl.ndarray((M, N), dtype=lhsT.dtype, buffer=nl.shared_hbm)

  # Use affine_range to loop over tiles
  for m in nl.affine_range(M // TILE_M):
    for n in nl.affine_range(N // TILE_N):
      # Allocate a tensor in PSUM
      res_psum = nl.ndarray((TILE_M, TILE_N), nl.float32, buffer=nl.psum)

      for k in nl.affine_range(K // TILE_K):
        # Declare the tiles on SBUF
        lhsT_tile = nl.ndarray((TILE_K, TILE_M), dtype=lhsT.dtype, buffer=nl.sbuf)
        rhs_tile = nl.ndarray((TILE_K, TILE_N), dtype=rhs.dtype, buffer=nl.sbuf)

        # Load tiles from lhsT and rhs
        nisa.dma_copy(dst=lhsT_tile,
                      src=lhsT[k * TILE_K:(k + 1) * TILE_K,
                               m * TILE_M:(m + 1) * TILE_M])
        nisa.dma_copy(dst=rhs_tile, 
                      src=rhs[k * TILE_K:(k + 1) * TILE_K,
                              n * TILE_N:(n + 1) * TILE_N])

        # Accumulate partial-sums into PSUM
        nisa.nc_matmul(dst=res_psum, stationary=lhsT_tile, moving=rhs_tile)

      # Copy the result from PSUM back to SBUF, and cast to expected output data-type
      res_sb = nl.ndarray(res_psum.shape, dtype=result.dtype, buffer=nl.sbuf)
      nisa.tensor_copy(dst=res_sb, src=res_psum)

      # Copy the result from SBUF to HBM.
      nisa.dma_copy(dst=result[m * TILE_M:(m + 1) * TILE_M,
                               n * TILE_N:(n + 1) * TILE_N],
                    src=res_sb)

  return result
  # NKI_EXAMPLE_18_END


# NKI_EXAMPLE_19_BEGIN
@nki.jit
def nki_matmul_hoist_load_(lhsT, rhs):
  """NKI kernel to compute a matrix multiplication operation in a tiled manner
     while hoisting the load of the lhsT and rhs to outer loops.

  Args:
      lhsT: an input tensor of shape [K,M], where both K and M are multiples for
        128.  It is the left-hand-side argument of the matrix multiplication,
        delivered transposed for optimal performance.
      rhs: an input tensor of shape [K,N], where K is a multiple of 128, and N
        is a multiple of 512.  It is the right-hand-side argument of the matrix
        multiplication.
  Returns:
      result: the resulting output tensor of shape [M,N]
  """

  # Verify that the lhsT and rhs are the expected sizes.
  K, M = lhsT.shape
  K_, N = rhs.shape
  assert K == K_, "lhsT and rhs must have the same contraction dimension"

  # Lookup the device matrix multiply dimensions.
  TILE_M = nl.tile_size.gemm_stationary_fmax  # 128
  TILE_K = nl.tile_size.pmax  # 128
  TILE_N = nl.tile_size.gemm_moving_fmax  # 512

  # Verify that the input matrices are a multiple of the tile dimensions.
  assert M % TILE_M == 0, \
    f"Expected M, {M}, to be a multiple of stationary free-dimension max, {TILE_M}"
  assert N % TILE_N == 0, \
    f"Expected N, {N}, to be a multiple of moving free-dimension max, {TILE_N}"
  assert K % TILE_K == 0, \
    f"Expected K, {K}, to be a multiple of the partition dimension max, {TILE_K}"

  # Create a space for the result in HBM (not initialized)
  result = nl.ndarray((M, N), dtype=lhsT.dtype, buffer=nl.shared_hbm)

  # Use affine_range to loop over tiles
  for m in nl.affine_range(M // TILE_M):
    # Load a whole column tiles from lhsT (with K * TILE_M numbers)
    # This corresponds to the whole row in the original lhs
    lhsT_tiles = []
    for k in nl.affine_range(K // TILE_K):
      # Allocate space in SBUF for the tile (uninitialized)
      lhsT_tile = nl.ndarray(shape=(TILE_K, TILE_M), dtype=lhsT.dtype, buffer=nl.sbuf)
      # Copy the tile from HBM to SBUF
      nisa.dma_copy(dst=lhsT_tile, 
                    src=lhsT[k * TILE_K:(k + 1) * TILE_K,
                             m * TILE_M:(m + 1) * TILE_M])
      # Append the tile to the list of tiles.
      lhsT_tiles.append(lhsT_tile)

    for n in nl.affine_range(N // TILE_N):
      # Load a whole column tiles from rhs (with K * TILE_N numbers)
      rhs_tiles = []
      for k in nl.affine_range(K // TILE_K):
        # Allocate space in SBUF for the tile (uninitialized)
        rhs_tile = nl.ndarray(shape=(TILE_K, TILE_N), dtype=rhs.dtype, buffer=nl.sbuf)
        # Copy the tile from HBM to SBUF
        nisa.dma_copy(dst=rhs_tile,
                      src=rhs[k * TILE_K:(k + 1) * TILE_K,
                              n * TILE_N:(n + 1) * TILE_N])
        # Append the tile to the list of tiles.
        rhs_tiles.append(rhs_tile)

      # Allocate a tile in PSUM for the result (uninitialized)
      res_psum = nl.ndarray(shape=(TILE_M, TILE_N), dtype=nl.float32, buffer=nl.psum)
      for k in nl.affine_range(K // TILE_K):
        # Accumulate partial-sums into PSUM
        nisa.nc_matmul(dst=res_psum, stationary=lhsT_tiles[k], moving=rhs_tiles[k])

      # Copy the result from PSUM back to SBUF, and cast to expected output data-type
      res_sb = nl.ndarray(shape=(TILE_M, TILE_N), dtype=nl.float32, buffer=nl.sbuf)
      nisa.tensor_copy(dst=res_sb, src=res_psum)

      # Copy the result from SBUF to HBM.
      nisa.dma_copy(dst=result[m * TILE_M:(m + 1) * TILE_M,
                               n * TILE_N:(n + 1) * TILE_N],
                    src=res_sb)

  return result
  # NKI_EXAMPLE_19_END


# NKI_EXAMPLE_20_BEGIN
@nki.jit
def nki_matmul_block_free_dimension_(lhsT, rhs):
  """NKI kernel to compute a matrix multiplication operation while blocking the
     free dimensions of the LHS and RHS to improve memory access pattern.

  Args:
      lhsT: an input tensor of shape [K,M], where both K and M are multiples for
        128.  It is the left-hand-side argument of the matrix multiplication,
        delivered transposed for optimal performance.
      rhs: an input tensor of shape [K,N], where K is a multiple of 128, and N
        is a multiple of 512.  It is the right-hand-side argument of the matrix
        multiplication.
  Returns:
      result: the resulting output tensor of shape [M,N]
  """

  # Verify that the lhsT and rhs have the same contraction dimension.
  K, M = lhsT.shape
  K_, N = rhs.shape
  assert K == K_, "lhsT and rhs must have the same contraction dimension"

  # Lookup the device matrix multiply dimensions.
  TILE_M = nl.tile_size.gemm_stationary_fmax  # 128
  TILE_K = nl.tile_size.pmax  # 128
  TILE_N = nl.tile_size.gemm_moving_fmax  # 512

  # Configuring the blocking size for the free dimensions
  TILES_IN_BLOCK_M = 2
  TILES_IN_BLOCK_N = 2

  BLOCK_M = TILE_M * TILES_IN_BLOCK_M  # 256
  BLOCK_N = TILE_N * TILES_IN_BLOCK_N  # 1024

  # the size has to be multiple of block size
  assert M % BLOCK_M == 0
  assert N % BLOCK_N == 0

  # Create a space for the result in HBM (not initialized)
  result = nl.ndarray((M, N), dtype=lhsT.dtype, buffer=nl.shared_hbm)

  # Loop over blocks over the M dimension
  for m in nl.affine_range(M // BLOCK_M):
    # Load TILES_IN_BLOCK_M columns tiles by TILES_K rows from lhsT
    lhsT_tiles = []
    for bm in nl.affine_range(TILES_IN_BLOCK_M):
      # Inner tile array.
      lhsT_tiles_internal = []
      for k in nl.affine_range(K // TILE_K):
        # Allocate space in SBUF for the tile (uninitialized)
        lhsT_tile = nl.ndarray(shape=(TILE_K, TILE_M),
                               dtype=lhsT.dtype,
                               buffer=nl.sbuf)
        # Copy the tile from HBM to SBUF
        nisa.dma_copy(dst=lhsT_tile,
                      src=lhsT[k * TILE_K:(k + 1) * TILE_K,
                               (m * TILES_IN_BLOCK_M + bm) *
                               TILE_M:((m * TILES_IN_BLOCK_M + bm) + 1) *
                               TILE_M])
        # Append the tile to the inner list of tiles.
        lhsT_tiles_internal.append(lhsT_tile)
      # Append the inner list of tiles into the outer list of tiles.
      lhsT_tiles.append(lhsT_tiles_internal)

    for n in nl.affine_range(N // BLOCK_N):
      # Load TILES_IN_BLOCK_N columns from rhs by TILES_K rows from rhs
      rhs_tiles = []
      for bn in nl.affine_range(TILES_IN_BLOCK_N):
        # Inner tile array.
        rhs_tiles_internal = []
        for k in nl.affine_range(K // TILE_K):
          # Allocate space in SBUF for the tile (uninitialized)
          rhs_tile = nl.ndarray(shape=(TILE_K, TILE_N),
                                dtype=rhs.dtype,
                                buffer=nl.sbuf)
          # Copy the tile from HBM to SBUF
          nisa.dma_copy(dst=rhs_tile,
                        src=rhs[k * TILE_K:(k + 1) * TILE_K,
                                (n * TILES_IN_BLOCK_N + bn) *
                                TILE_N:((n * TILES_IN_BLOCK_N + bn) + 1) *
                                TILE_N])
          # Append the tile to the inner list of tiles.
          rhs_tiles_internal.append(rhs_tile)
        # Append the inner list of tiles into the outer list of tiles.
        rhs_tiles.append(rhs_tiles_internal)

      for bm in nl.affine_range(TILES_IN_BLOCK_M):
        for bn in nl.affine_range(TILES_IN_BLOCK_N):
          # Allocate a tensor in PSUM
          result_tile = nl.ndarray(shape=(TILE_M, TILE_N),
                                   dtype=nl.float32,
                                   buffer=nl.psum)
          for k in nl.affine_range(K // TILE_K):
            # Accumulate partial-sums into PSUM
            nisa.nc_matmul(dst=result_tile,
                           stationary=lhsT_tiles[bm][k],
                           moving=rhs_tiles[bn][k])
  
          # Copy the result from PSUM back to SBUF, and cast to expected
          # output data-type
          result_tmp = nl.ndarray(shape=result_tile.shape,
                                  dtype=result.dtype,
                                  buffer=nl.sbuf)
          nisa.tensor_copy(dst=result_tmp, src=result_tile)

          # Copy the result from SBUF to HBM.
          nisa.dma_copy(dst=result[(m * TILES_IN_BLOCK_M + bm) *
                                   TILE_M:((m * TILES_IN_BLOCK_M + bm) + 1) *
                                   TILE_M,
                                   (n * TILES_IN_BLOCK_N + bn) *
                                   TILE_N:((n * TILES_IN_BLOCK_N + bn) + 1) *
                                   TILE_N],
                        src=result_tmp)

  return result
  # NKI_EXAMPLE_20_END


# NKI_EXAMPLE_21_BEGIN
@nki.jit
def nki_matmul_fully_optimized_(
    lhsT,
    rhs,
    # Meta-parameters
    TILES_IN_BLOCK_M=16,
    TILES_IN_BLOCK_N=2,
    TILES_IN_BLOCK_K=8,
):
  """NKI kernel to compute a large matrix multiplication efficiently by
     blocking all dimensions and doing layout optimization.

  Args:
      lhsT: an input tensor of shape [K,M], where K is a multiple of 128 *
        TILES_IN_BLOCK_K and M is a multiple of 128 * TILES_IN_BLOCK_M.  It is the
        left-hand-side argument of the matrix multiplication, delivered transposed
        for optimal performance.
      rhs: an input tensor of shape [K,N],  where K is a multiple of 128 *
        TILES_IN_BLOCK_K and N is a multiple of 512 * TILES_IN_BLOCK_N.  It is
        the right-hand-side argument of the matrix multiplication.
      TILES_IN_BLOCK_*: meta parameters to control blocking dimensions
  Returns:
      result: the resulting output tensor of shape [M,N]
  """

  # Verify that the lhsT and rhs have the same contraction dimension.
  K, M = lhsT.shape
  K_, N = rhs.shape
  assert K == K_, "lhsT and rhs must have the same contraction dimension"

  # Lookup the device matrix multiply dimensions.
  TILE_M = nl.tile_size.gemm_stationary_fmax  # 128
  TILE_K = nl.tile_size.pmax  # 128
  TILE_N = nl.tile_size.gemm_moving_fmax  # 512

  # Compute the block dimensions.
  BLOCK_M = TILE_M * TILES_IN_BLOCK_M
  BLOCK_N = TILE_N * TILES_IN_BLOCK_N
  BLOCK_K = TILE_K * TILES_IN_BLOCK_K

  # Verify the size is a multiple of block size
  assert M % BLOCK_M == 0, \
    f"Expected M {M} to be divisible by {BLOCK_M} when there are {TILES_IN_BLOCK_M}"
  assert N % BLOCK_N == 0, \
    f"Expected N {N} to be divisible by {BLOCK_N} when there are {TILES_IN_BLOCK_N}"
  assert K % BLOCK_K == 0, \
    f"Expected K {K} to be divisible by {BLOCK_K} when there are {TILES_IN_BLOCK_K}"

  # Create a space for the result in HBM (not initialized)
  result = nl.ndarray((M, N), dtype=lhsT.dtype, buffer=nl.shared_hbm)

  # Compute the number of blocks in each dimension
  NUM_BLOCK_M = M // BLOCK_M
  NUM_BLOCK_N = N // BLOCK_N
  NUM_BLOCK_K = K // BLOCK_K

  # Blocking N dimension (the RHS free dimension)
  for n in nl.affine_range(NUM_BLOCK_N):
    n_start = n * BLOCK_N
    n_end = n_start + BLOCK_N

    # Allocate and initialize result matrix N-block to 0.0.
    #
    # Each result M-tile stores its N-block contiguous on the free-dim
    # with shape (TILE_M, TILES_IN_BLOCK_N, TILE_N). This layout allows
    # reshaping to (TILE_M, BLOCK_N) for SBUF->HBM DMA to operate on a
    # large payload, enabling good DMA efficiency.
    #
    # We split the N-block into individual M-tiles so the compiler can
    # pipeline memset(0), matmul, tensor_tensor, and SBUF->HBM DMA
    # on M-tile granularity.
    result_m_tiles = []
    for m in nl.affine_range(NUM_BLOCK_M):
      for m_tile in nl.affine_range(TILES_IN_BLOCK_M):
        result_m_tile = nl.ndarray(
          shape=(TILE_M, TILES_IN_BLOCK_N, TILE_N),
          dtype=result.dtype,
          buffer=nl.sbuf,
        )
        nisa.memset(dst=result_m_tile, value=0.0)
        result_m_tiles.append(result_m_tile)

    # Blocking K dimension (the contraction dimension)
    for k in nl.sequential_range(NUM_BLOCK_K):
      k_block_tile_start = k * TILES_IN_BLOCK_K

      # Load tiles from RHS
      # Load tiles one N-block at a time for good DMA efficiency.
      rhs_tiles = nl.ndarray(
        shape=(TILE_K, TILES_IN_BLOCK_K, BLOCK_N),
        dtype=rhs.dtype,
        buffer=nl.sbuf,
      )
      for k_tile in range(TILES_IN_BLOCK_K):
        k_tile_start = (k_block_tile_start + k_tile) * TILE_K
        k_tile_end = k_tile_start + TILE_K
        nisa.dma_copy(
          dst=rhs_tiles[0:TILE_K, k_tile, 0:BLOCK_N],
          src=rhs[k_tile_start:k_tile_end, n_start:n_end],
        )

      # Blocking M dimension (the LHS free dimension)
      for m in nl.affine_range(NUM_BLOCK_M):
        # Loading tiles from lhsT
        # Load tiles one M-block at a time for good DMA efficiency.
        lhsT_tiles = nl.ndarray(
          shape=(TILE_K, TILES_IN_BLOCK_K, BLOCK_M),
          dtype=lhsT.dtype,
          buffer=nl.sbuf,
        )
        m_start = m * BLOCK_M
        m_end = m_start + BLOCK_M
        for k_tile in nl.affine_range(TILES_IN_BLOCK_K):
          k_tile_start = (k_block_tile_start + k_tile) * TILE_K
          k_tile_end = k_tile_start + TILE_K
          nisa.dma_copy(
            dst=lhsT_tiles[0:TILE_K, k_tile, 0:BLOCK_M],
            src=lhsT[k_tile_start:k_tile_end, m_start:m_end],
          )

        # Do matmul with all tiles in the blocks
        m_block_tile_start = m * TILES_IN_BLOCK_M
        for n_tile in nl.affine_range(TILES_IN_BLOCK_N):
          for m_tile in nl.affine_range(TILES_IN_BLOCK_M):
            result_tile = nl.ndarray(
              shape=(TILE_M, TILE_N), dtype=nl.float32, buffer=nl.psum
            )
            for k_tile in nl.affine_range(TILES_IN_BLOCK_K):
              m_tile_start = m_tile * TILE_M
              m_tile_end = m_tile_start + TILE_M
              n_tile_start = n_tile * TILE_N
              n_tile_end = n_tile_start + TILE_N
              nisa.nc_matmul(
                dst=result_tile,
                stationary=lhsT_tiles[0:TILE_K, k_tile, m_tile_start:m_tile_end],
                moving=rhs_tiles[0:TILE_K, k_tile, n_tile_start:n_tile_end],
              )

            # Evict from PSUM to SBUF while accumulating into result M-tile.
            m_tile_idx = m_block_tile_start + m_tile
            result_m_tile = result_m_tiles[m_tile_idx]
            nisa.tensor_tensor(
              dst=result_m_tile[0:TILE_M, n_tile, 0:TILE_N],
              data1=result_m_tile[0:TILE_M, n_tile, 0:TILE_N],
              data2=result_tile,
              op=nl.add,
            )

    # Evict the result M-tiles from SBUF to HBM.
    # Copy on N-blocks granularity for good DMA efficiency.
    for m in nl.affine_range(NUM_BLOCK_M):
      m_block_tile_start = m * TILES_IN_BLOCK_M
      for m_tile in nl.affine_range(TILES_IN_BLOCK_M):
        m_tile_idx = m_block_tile_start + m_tile
        result_m_tile = result_m_tiles[m_tile_idx]
        result_m_tile_block = result_m_tile.reshape((TILE_M, BLOCK_N))

        m_tile_start = m_tile_idx * TILE_M
        m_tile_end = m_tile_start + TILE_M
        nisa.dma_copy(
          dst=result[m_tile_start:m_tile_end, n_start:n_end],
          src=result_m_tile_block[0:TILE_M, 0:BLOCK_N],
        )

  return result
# NKI_EXAMPLE_21_END


================================================
FILE: nki/examples/matrix_multiplication/matrix_multiplication_torch.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

PyTorch implementation for matrix multiplication NKI tutorial.

"""

import torch
import torch_xla

from matrix_multiplication_nki_kernels import nki_matmul_basic_, nki_matmul_tiled_, nki_matmul_hoist_load_, nki_matmul_block_free_dimension_, nki_matmul_fully_optimized_

if __name__ == "__main__":

  # NKI_EXAMPLE_17_BEGIN
  device = torch_xla.device()
  cpu = torch.device('cpu')

  # Test the small workload with basic kernel
  lhs_small = torch.rand((64, 128), dtype=torch.bfloat16, device=device)
  rhs_small = torch.rand((128, 512), dtype=torch.bfloat16, device=device)

  # Run NKI kernel
  output_small = nki_matmul_basic_(lhs_small.T, rhs_small)

  # Run torch reference
  output_small_torch = torch.matmul(lhs_small, rhs_small)

  # Compare results
  print("Checking correctness of nki_matmul_basic")
  if torch.allclose(output_small_torch, output_small, atol=1e-4, rtol=1e-2):
    print("NKI and Torch match")
  else:
    print("NKI and Torch differ")
    # NKI_EXAMPLE_17_END

  # NKI_EXAMPLE_22_BEGIN
  # Test the large workload with tiled kernels
  lhs = torch.rand((4096, 1024), dtype=torch.bfloat16, device=device)
  rhs = torch.rand((1024, 2048), dtype=torch.bfloat16, device=device)

  # Run torch reference
  output_torch = torch.matmul(lhs, rhs).to(device=cpu)

  def check_match(nki_func):
    output = nki_func(lhs.T, rhs)
    output_nki = output.to(device=cpu)
    if torch.allclose(output_torch, output_nki, atol=1e-4, rtol=1e-2):
      print("NKI and Torch match")
    else:
      print("NKI and Torch differ")

  print("Checking correctness of nki_matmul_tiled")
  check_match(nki_matmul_tiled_)

  print("Checking correctness of nki_matmul_hoist_load")
  check_match(nki_matmul_hoist_load_)

  print("Checking correctness of nki_matmul_block_free_dimension")
  check_match(nki_matmul_block_free_dimension_)

  print("Checking correctness of nki_matmul_fully_optimized")
  check_match(nki_matmul_fully_optimized_)
  # NKI_EXAMPLE_22_END


================================================
FILE: nki/examples/simulate/nki_simulate_example.py
================================================
"""Quick Start example for nki.simulate documentation."""

import nki
import nki.language as nl
import nki.isa as nisa
import numpy as np


# NKI_EXAMPLE_SIMULATE_BEGIN
@nki.jit
def add_kernel(a_ptr, b_ptr):
    # Load tiles from HBM into SBUF
    a = nl.load(a_ptr)
    b = nl.load(b_ptr)
    # Element-wise add
    result = nl.add(a, b)
    # Store result back to HBM
    out = nl.ndarray(a_ptr.shape, dtype=a_ptr.dtype, buffer=nl.shared_hbm)
    nl.store(out, value=result)
    return out
# NKI_EXAMPLE_SIMULATE_END


# NKI_EXAMPLE_SIMULATE_RUN_BEGIN
# Run on the CPU simulator
result = nki.simulate(add_kernel)(a, b)

# Verify correctness
np.testing.assert_allclose(result, a + b, rtol=1e-5)
# NKI_EXAMPLE_SIMULATE_RUN_END


================================================
FILE: nki/examples/tensor_addition/tensor_addition_nki_kernels.py
================================================
"""
Copyright (C) 2025, Amazon.com. All Rights Reserved

NKI implementation for tensor addition NKI tutorial.

"""
# NKI_EXAMPLE_27_BEGIN
import nki as nki
import nki.language as nl
import nki.isa as nisa
import os

os.environ["NEURON_PLATFORM_TARGET_OVERRIDE"] = "trn1"

@nki.jit
def nki_tensor_add(a_input, b_input):
  """NKI kernel to compute element-wise addition of two input tensors

  This kernel assumes strict input/output sizes can be uniformly tiled to [128,512]

  Args:
      a_input: a first input tensor
      b_input: a second input tensor

  Returns:
      c_output: an output tensor
  """
  # Create output tensor shared between all SPMD instances as 
  # result tensor (uninitialized)
  c_output = nl.ndarray(a_input.shape, dtype=a_input.dtype, buffer=nl.shared_hbm)

  # Extract the dimensions for the a_input shape.
  M, N = a_input.shape

  # Set the tile dimensions, while the TILE_N is not, strictly speaking, limited to 
  # 512 for the additiona operation, we stick with this size for simplicity.
  TILE_M = 128
  TILE_N = 512

  # Check the input sizes match and match the tilable constraint.
  assert a_input.shape == b_input.shape, \
    f"Expected shaps {a_input.shape} and {b_input.shape} to match"
  assert a_input.dtype == b_input.dtype, \
    f"Expected data types {a_input.dtype} and {b_input.dtype} to match"
  assert M % TILE_M == 0, \
    f"Expected partition dimention ({M}) to be divisble by {TILE_M}"
  assert N % TILE_N == 0, \
    f"Expected partition dimention ({N}) to be divisble by {TILE_N}"

  # Lop over each tile, load the tile, do the addition, and save it back to HBM.
  for m in nl.affine_range(M // TILE_M):
    for n in nl.affine_range(N // TILE_N):
      # Allocte space for the a_tile and b_tile in sbuf (uninitialized)
      a_tile = nl.ndarray(shape=(TILE_M, TILE_N), dtype=a_input.dtype, buffer=nl.sbuf)
      b_tile = nl.ndarray(shape=(TILE_M, TILE_N), dtype=b_input.dtype, buffer=nl.sbuf)

      # Load the a_tile and b_tile from HBM into SBUF.
      nisa.dma_copy(dst=a_tile,
                    src=a_input[m * TILE_M:(m + 1) * TILE_M,
                                n * TILE_N:(n + 1) * TILE_N])
      nisa.dma_copy(dst=b_tile,
                    src=b_input[m * TILE_M:(m + 1) * TILE_M,
                                n * TILE_N:(n + 1) * TILE_N])

      # Allocate space for the c_tile in sbuf.
      c_tile = nl.ndarray(shape=(TILE_M, TILE_N), dtype=a_input.dtype, buffer=nl.sbuf)

      # Perform the addition using the element-wise tensor_tensor instruction.
      nisa.tensor_tensor(dst=c_tile, data1=a_tile, data2=b_tile, op=nl.add)

      # Copy the result to the output tensor.
      nisa.dma_copy(dst=c_output[m * TILE_M:(m + 1) * TILE_M,
                                 n * TILE_N:(n + 1) * TILE_N],
                    src=c_tile)

  # Transfer the ownership of `c_output` to the caller
  return c_output
  # NKI_EXAMPLE_27_END


================================================
FILE: nki/examples/transpose2d/transpose2d_jax.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

JAX implementation for transpose2d NKI tutorial.

"""

# NKI_EXAMPLE_36_BEGIN
import jax
import jax.numpy as jnp
# NKI_EXAMPLE_36_END

from transpose2d_nki_kernels import tensor_transpose2D_kernel_

# NKI_EXAMPLE_36_BEGIN
if __name__ == "__main__":
  P, X, Y = 5, 37, 44
  a = jax.random.uniform(jax.random.PRNGKey(42), (P, X * Y))
  a_t_nki = tensor_transpose2D_kernel_(a, shape2D=(X, Y))

  a_t_jax = jnp.transpose(a.reshape(P, X, Y), axes=(0, 2, 1)).reshape(P, X * Y)
  print(a, a_t_nki, a_t_jax)

  allclose = jnp.allclose(a_t_jax, a_t_nki)
  if allclose:
    print("NKI and JAX match")
  else:
    print("NKI and JAX differ")

  assert allclose
# NKI_EXAMPLE_36_END


================================================
FILE: nki/examples/transpose2d/transpose2d_nki_kernels.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

NKI baremetal implementation for transpose2d NKI tutorial.
"""

import numpy as np
# NKI_EXAMPLE_33_BEGIN
import nki
import nki.language as nl
import nki.isa as nisa


@nki.jit
def tensor_transpose2D_kernel_(in_tensor, shape2D):
  """
  NKI kernel to reorder the elements on axis[1] of the input tensor.

  Every row of the input tensor is a flattened row-major 2D matrix.
  The shape2D argument defines the dimensions of the flattened matrices (#rows,#cols).
  Our goal in this kernel is to transpose these flattened 2D matrices, i.e. make them (#cols,#rows).

  Example:
      in_tensor = [a0,a1,a2,a3,b0,b1,b2,b3,c0,c1,c2,c3]
      shape2D = (3,4)
  this means that in_tensor has 3 rows and 4 columns, i.e. can be represented as:
      [a0,a1,a2,a3]
      [b0,b1,b2,b3]
      [c0,c1,c2,c3]
  after transpose, we expect to get:
      [a0,b0,c0]
      [a1,b1,c1]
      [a2,b2,c2]
      [a3,b3,c3]
  Thus, out_tensor is expected to be [a0,b0,c0,a1,b1,c1,a2,b2,c2,a3,b3,c3]

  Args:
    in_tensor: an input tensor
    shape2D: tuple representing the dimensions to be transposed: (#rows, #cols)
  """
  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)
  # Gather input shapes
  sz_p, _ = in_tensor.shape

  # Load input data from external memory to on-chip memory
  in_tile = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype, buffer=nl.sbuf)
  nisa.dma_copy(dst=in_tile, src=in_tensor)

  # Performing f1/f2 transpose
  # ==========================
  # The desired transpose pattern is provided as an input:
  sz_f1, sz_f2 = shape2D

  # Perform the transposition via element-wise SBUF-to-SBUF copies
  # with index arithmetic to scatter elements into transposed positions.
  # RHS traverses an F1 x F2 matrix in row major order
  # LHS traverses an F2 x F1 (transposed) matrix in row major order
  out_tile = nl.ndarray(shape=(sz_p, sz_f2*sz_f1), dtype=in_tensor.dtype,
                        buffer=nl.sbuf)
  for i_f1 in nl.affine_range(sz_f1):
    for i_f2 in nl.affine_range(sz_f2):
      nisa.tensor_copy(dst=out_tile[:, nl.ds(i_f2*sz_f1+i_f1, 1)],
                       src=in_tile[:, nl.ds(i_f1*sz_f2+i_f2, 1)])

  # Finally, we store out_tile to external memory
  nisa.dma_copy(dst=out_tensor, src=out_tile)

  return out_tensor
  # NKI_EXAMPLE_33_END


if __name__ == "__main__":
  P, X, Y = 5, 3, 4
  a = np.arange(P*X*Y, dtype=np.int8).reshape((P, X*Y))

  a_t_nki = tensor_transpose2D_kernel_(a, (X, Y))

  a_t_np = np.transpose(a.reshape(P, X, Y), (0, 2, 1)).reshape(P, X * Y)

  print(a, a_t_nki, a_t_np)

  allclose = np.allclose(a_t_np, a_t_nki)
  if allclose:
    print("NKI and NumPy match")
  else:
    print("NKI and NumPy differ")

  assert allclose


================================================
FILE: nki/examples/transpose2d/transpose2d_torch.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

PyTorch implementation for transpose2d NKI tutorial.
"""

# NKI_EXAMPLE_34_BEGIN
import torch
import torch_xla
# NKI_EXAMPLE_34_END

from transpose2d_nki_kernels import tensor_transpose2D_kernel_


# NKI_EXAMPLE_34_BEGIN
if __name__ == "__main__":
  device = torch_xla.device()

  P, X, Y = 5, 3, 4
  a = torch.arange(P*X*Y, dtype=torch.int8).reshape((P, X*Y)).to(device=device)
  a_t_nki = torch.zeros((P, Y*X), dtype=torch.int8).to(device=device)

  a_t_nki = tensor_transpose2D_kernel_(a, (X, Y))

  a_cpu = torch.arange(P*X*Y, dtype=torch.int8).reshape((P, X*Y))
  a_t_torch = torch.transpose(a_cpu.reshape(P, X, Y), 1, 2).reshape(P, X * Y).to(device=device)

  print(a, a_t_nki, a_t_torch)

  allclose = torch.allclose(a_t_torch, a_t_nki)
  if allclose:
    print("NKI and PyTorch match")
  else:
    print("NKI and PyTorch differ")

  assert allclose
  # NKI_EXAMPLE_34_END


================================================
FILE: nki/get-started/about/data-representation-overview.rst
================================================
.. meta::
   :description: Overview of Data Representations in NKI
   :date_updated: 12/02/2025

.. _nki-about-data:

==========================
Data Representation in NKI
==========================

This topic covers Data Representation and how it applies to developing with the AWS Neuron SDK.
This overview will describe how data appears to the NKI programmer, and how this data is organized on the NeuronDevice.

Representing data in NKI
------------------------

NKI represents data in NeuronCore's memory hierarchy with built-in ``tensor`` type.
A ``tensor`` is a multi-dimensional array which contains elements with
the same data type, or "dtype".

Programmers can pass ``tensor`` values in and out of NKI kernels, and declare or initialize ``tensor`` values in any memory within the NeuronDevice
(PSUM, SBUF, HBM) using APIs such as :doc:`nki.language.ndarray </nki/api/generated/nki.language.ndarray>` and :doc:`nki.language.zeros </nki/api/generated/nki.language.zeros>`.

A ``tensor`` value has a name, a shape that describes the number and size of
each of its dimensions, an element data type (or "dtype"), and a description of
the physical location of the underlying data on the NeuronCore. For example, a
matrix of 16-bit floating point numbers may have a shape of ``(128,64)``
indicating that there are 128 rows and 64 columns of numbers, and a dtype of
"bfloat16" describing the floating format.

The physical location of a ``tensor`` consists of a memory (HBM, SBUF, or
PSUM), and an offset and size for the underlying data. In the case of HBM
tensors, there is only one offset and size. However, for SBUF and PSUM tensors
there are two offsets and two sizes because those memories are two-dimensional.
The two offsets and sizes describe a rectangle in the underlying memory within
which the tensor data will live. For two-dimensional memories, the first
dimension is called the "partition dimension" and corresponds to the partitions
of the underlying memory. Using our example from above, if our 128x64 element
tensor was resident on the SBUF, then the partition offset and size could be 0
and 128 indicating that each row of the matrix corresponds to one partition of
the SBUF. The second offset and size could be, for instance, 1024 and 128,
indicating that each matrix row start 1024 bytes from the beginning of each
partition, and consumes 128 bytes (2 bytes for each 16-bit float), within each
partition.

Often, NKI programmers will not need to worry about the physical location of
tensors. When using high-level APIs such as
:doc:`nki.language.ndarray </nki/api/generated/nki.language.ndarray>`,
the physical location is assigned automatically by the NKI compiler. However,
more advanced kernels may directly control the relative physical locations of
tensors using the direct allocation APIs.

Input and output tensors from ML frameworks to NKI kernels will be tensors with
underlying memory type of ``hbm``. These tensors are placed in the HBM memory
prior to calling the NKI kernel. Intermediate tensors can be allocated using the
tensor creation APIs, for instance:

.. code-block::

   # Allocate 3D tensor on the SBUF
   x = nl.ndarray((128, 32, 512), dtype=nl.float32, buffer=nl.sbuf)

The above code creates a new 3D tensor on the SBUF memory with shape 128x32x512, and with
an element type of 32-bit floats. The physical location of this tensor will
be assigned by the NKI compiler, and the total amount of memory used will be:
``8,388,608 = 128 * 32 * 512 * 4``


================================================
FILE: nki/get-started/about/index.rst
================================================
.. meta::
   :description: Learn about Neuron Kernel Interface (NKI) and core concepts essential for working with it.
   :keywords: NKI, AWS Neuron, Core Concepts, Programming Model, Architecture
   :date-modified: 12/01/2025

.. _nki_about_home:

About Neuron Kernel Interface (NKI)
===================================

This section covers core concepts Neuron Kernel Interface (NKI) within the AWS Neuron SDK. Whether you're developing custom kernels or optimizing machine learning workloads, this documentation will help you leverage the full capabilities of AWS Neuron accelerators.

Introducing NKI: Complete Kernel Development Solution 
-------------------------------------------------------

Neuron Kernel Interface (NKI) is an open source tool for developing kernels for Trainium hardware. It has three main parts: 

* The first part is the NKI Programming Interface, which offers two APIs: ``nki.lang`` for high-level tile programming (similar to numpy and Triton), and ``nki.isa`` for direct access to hardware instructions.

* The second part is the NKI Compiler, built on MLIR, which turns NKI kernel code into optimized hardware instructions. It keeps the execution order and memory allocation that developers specify. 

* The third part is the NKI Library (``NKI-Lib``), which provides ready-to-use optimized kernels that developers can use directly or learn from.

Using MLIR enables NKI integration with the LLVM ecosystem and compiler research community. NKI's open-source code lets everyone see how the compilation works, from the Python code to the final hardware instructions. Researchers can try new compiler techniques, framework developers can learn how kernels work with their code, and the community can improve both the compiler and kernel library. If you want to start using NKI, you can find tutorials available at https://github.com/aws-neuron/nki-samples.

For more details on NKI and Neuron open source GitHub repos, see :doc:`/about-neuron/oss/index`.

NKI and Neuron Hardware 
------------------------

Before learning about NKI, it's important to understand the hardware where NKI kernels run. NKI is made specifically for AWS Trainium, so let's look at the architecture your NKI code will use.

.. image:: /nki/img/overviews/about-nki-1.png

Trainium chips are AI chips built by AWS for AI training and inference. They are highly optimized for performance, power efficiency, and cost efficiency for a broad range of AI/ML workloads. Trainium uses a small number of powerful cores (called NeuronCores), each with four specialized engines that work together:

* **Tensor Engine**: Handles matrix operations like matrix-multiplications and convolutions
* **Vector Engine**: Processes multi-input vector operations and reductions (e.g. Normalization, ResidualAdd)
* **Scalar Engine**: Performs element-wise functions, including non-linearities (e.g. GELU, Square)
* **GpSimd Engine**: Deeply embedded general-purpose programmable processors for custom operations

Trainium devices also have dedicated **Collective Communication Engines** (**CC-Cores**) that move data between NeuronCores and between Trainium chips. These engines handle operations like AllReduce, ReduceScatter, AllGather and All2all, while the core computation continues processing in parallel, allowing more efficient scaling to multiple cores and chips.

The memory system has three levels:

* **HBM** (High Bandwidth Memory): Provides device working memory.  
* **SBUF** (State Buffer): On-chip scratchpad SRAM that serves as a software-managed cache. Holds active tensors locally and acts as a landing buffer for DMA transfers.  
* **PSUM** (Partial Sum Buffer): Stores and accumulates matrix operations results.  

Unlike traditional CPUs and GPUs which adopt hardware managed caches, Trainium software (NKI and the Neuron Graph Compiler) explicitly manages the allocation and data movment within the entire memory hierarchy. This architecture allows developers to optimize hardware usage directly, resulting in more consistent and predictable performance. NKI exposes all NISA primitives needed to manage the memory hierarchy.

.. _nki_arch_guides:

NKI and Neuron Architecture
----------------------------

NKI currently supports the following NeuronDevice generations:

* Trainium/Inferentia2, available on AWS ``trn1``, ``trn1n`` and ``inf2`` instances
* Trainium2, available on AWS ``trn2`` instances and UltraServers
* Trainium3, available on AWS ``trn3`` instances and UltraServers

The documents below provide an architecture deep dive of each NeuronDevice generation,
with a focus on areas that NKI developers can directly control through kernel implementation.

* :doc:`Trainium/Inferentia2 Architecture Guide </nki/guides/architecture/trainium_inferentia2_arch>` serves as a foundational architecture guide for understanding the basics of any NeuronDevice generation.
* :doc:`Trainium2 Architecture Guide </nki/guides/architecture/trainium2_arch>` walks through the architecture enhancements when compared to the previous generation.
* :doc:`Trainium3 Architecture Guide </nki/guides/architecture/trainium3_arch>` covers the enhancements for the next-generation Trainium ML accelerators.
  
Neuron recommends new NKI developers start with :doc:`Trainium/Inferentia2 Architecture Guide </nki/guides/architecture/trainium_inferentia2_arch>` before exploring newer NeuronDevice architecture.

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: Trainium/Inferentia2 Architecture Guide
      :link: trainium_inferentia2_arch
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Foundational architecture guide for understanding NeuronDevice basics.

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: Trainium2 Architecture Guide
      :link: trainium2_arch
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Architecture enhancements and improvements in the Trainium2 generation.

   .. grid-item-card:: Trainium3 Architecture Guide
      :link: trainium3_arch
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Latest architecture features and capabilities in Trainium3 devices.

NKI APIs
-------------

NKI provides two sets of APIs:

1. The higher-level ``nki.lang`` interface makes memory allocation, tensor indexing, and control of logical neuron core groups easier. Data scientists and ML engineers who know numpy and Triton will find this familiar.
2. The lower-level ``nki.isa`` interface gives direct access to the Neuron Instruction Set Architecture (NISA). This lets operations map directly to hardware instructions with full control over instruction selection, scheduling, and allocation. This helps developers get the most out of the hardware for better performance, throughput, and latency.

These two APIs are designed to work together: ``nki.lang`` makes indexing and memory operations simpler, while ``nki.isa`` provides the hardware details needed for maximum efficiency.

In the next section, we provide broad view of key concepts for NKI programming, starting with how tensors are allocated, how loop performance is controlled, and memory movement APIs.

Tensor management and indexing 
------------------------------

The ``nki.lang`` APIs provide tools for memory allocation, execution scheduling, tensor indexing, and tensor manipulation. The next two examples demonstrate memory allocation and scheduling APIs.

For memory allocation, developers can explicitly control tensor placement in the memory hierarchy. For example:

.. code-block:: python

    import nki.language as nl

    # Allocate tensor of FP32 elements in SBUF (on-chip scratchpad memory)
    # using ndarray call similar to numpy 
    # like numpy, nl supports ndarray(), zeros() and ones() functions
    x_on_chip = nl.ndarray((128, 32, 512), dtype=nl.float32, buffer=nl.sbuf)

    # Allocate tensor of FP16 elements in HBM (high-bandwidth memory, off-chip)
    y_in_hbm = nl.ndarray(shape, dtype=nl.float16, buffer=nl.shared_hbm)

Scheduling options for loop
----------------------------

Loops are a key part of tile and tensor programming. NKI offers three ways to write loops that control execution order and determine whether loops are optimized during compilation or depend on runtime values.

Let's look at three types of loops, which serve as hints to the compiler. The compiler will always make sure your code works correctly, regardless of any optimizations it makes.

Sequential loop (default loops)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Loops with sequential ranges are loops that might carry dependencies between the result of one loop to the next loop.  The NKI compiler does not try to re-order or parallelize the executions of loops, and runs them in sequence order.  When in doubt, Neuron recommends you start with sequential loops.

.. code-block:: python

    import nki.language as nl

    # Sequential range - compiler will assume loop iteration n, *might* depend on 
    # results from iterations n-1, n-2,...0, and will not try to unroll 
    # or parallize the code execution
    # when in doubt, developers should start with Sequential_range()

    for i in nl.sequential_range(8):
        # Compiler will not re-order
        result = process_tile(result_from_previous_loop)
        result_from_previous_loop = result

Affine loop 
^^^^^^^^^^^^

Affine loops are a hint that developers can give to the compiler when developer is confident there are no carried dependencies between different loop iterations. This approach allows the compiler to unroll and optimize code ordering between different iterations of the loop to improve performance. 

.. code-block:: python

    import nki.language as nl

    # Affine range - allows compiler optimizations like pipelining and unrolling
    for i in nl.affine_range(8):
        # Compiler can reorder and optimize these iterations
        process_tile(i)

On-device (Dynamic) loop 
^^^^^^^^^^^^^^^^^^^^^^^^^

Some code does not know the number of loop iterations at compile time; or perhaps the code depends on dynamically generated integer values during runtime that decide the number of iterations. In this case, the NKI compiler does not attempt to optimize across loop iterations.

.. code-block:: python

    import nki.language as nl

    # Dynamic range - runs on device at runtime, not compile-time
    lower_bound = register_alloc(0)
    upper_bound = register_alloc(10)
    for i in nl.dynamic_range(lower_bound, upper_bound):
        process_tensor(t[i])

Direct Hardware Control with nki.isa
--------------------------------------

The ``nki.isa`` APIs provide low-level operations for computation, data movement, dynamic control flow, and communication between cores. The examples below show compute operations, dynamic control flow, and collective communication APIs.

Matrix operations execute on the Tensor Engine. For instance:

.. code-block:: python

    import nki.isa as nisa

    # Matrix multiplication on Tensor Engine using nc_matmul
    # nc stands for NeuronCore, and matmul is the instruction name
    # stationary: [128, 128], moving: [128, 512], output: [128, 512]
    # The input arguments must meet NISA requirements as defined 
    # in the Trainium architecture, such as data types, layout, tile sizes
    # and buffer memory types (SBUF or PSUM)
    # dst is explicitly defined as instruciton parameter
    nisa.nc_matmul(dst=output, stationary, moving)

    # Element-wise operations between two tensors
    # in this specific example, x and y must have the same partition dimension size
    # and the same number of elements per partition.
    # Notice the destination (dst) is explicit defined in the instruction parameters
    # and op=nl.add defines the actual element-wise operation needed
    nisa.tensor_tensor(dst=output, data1=x, data2=y, op=nl.add)

Dynamic control flow uses register-based operations to enable runtime control decisions on the device itself. For example:

.. code-block:: python

    import nki.isa as nisa
    import nki.language as nl

    # this is used to load the scalar register used in the dynamic loop
    # memory allocation does NOT perform initialization
    cond = nl.ndarray((1, 1), buffer=nl.shared_hbm, dtype=nl.int32)

    # explicit initialization is required: initialize cond to zero
    isa.dma_copy(dst=cond, src=nl.zeros())

    # Allocate a scalar register for control flow
    # initialize register to 1
    reg = nisa.register_alloc(1)

    # Dynamic while-loop with runtime condition
    # while condition will check for non-zero integer in register as true condition
    while reg:  
        # Perform some calculation on device, which updates tensor cond
        # update loop condition from cond
        nisa.register_load(reg, cond)  # Re-evaluate condition


Collective communication primitives enable kernels to coordinate and exchange data across multiple NeuronCores. For example:

.. code-block:: python

    import nki.isa as nisa

    # Synchronize all cores at a barrier point
    nisa.barrier()

    # Send and receive data between cores
    nisa.sendrecv()

The nki.isa interface gives developers detailed control over AWS Trainium's hardware. This direct access lets them fine-tune how computations work, manage memory, and optimize when instructions run. By controlling these elements precisely, developers can get the best performance from Trainium by creating custom versions of AI model parts like attention mechanisms, loss functions, and data preprocessing routines.

NKI Open Source Compiler
---------------------------

The NKI Compiler, built on MLIR, turns kernel source code into optimized NKI IR (Intermediate Representation). The Neuron Compiler Back-end then turns this NKI IR into NeuronISA instructions. When a framework model includes NKI source code, the framework calls the NKI Compiler to process these kernels separately. The NKI Compiler creates optimized NKI IR that gets added to the larger Neuron IR representing the complete model, which then goes to the Neuron Graph Compiler.

The NKI Compiler processes one kernel at a time, creating NKI intermediate representation (NKI IR). This IR, along with other kernels and compilation graphs, is used to create a Neuron Executable (NEFF). We've put the NKI Compiler code on GitHub so performance engineers, researchers, compiler developers, and MLIR enthusiasts can understand how the compilation works and contribute to research or development.

The diagram below shows how PyTorch or JAX models are turned into optimized NeuronISA instructions. When developers create a model with NKI kernels (marked with the @nki.jit decorator), the framework starts tracing the model through the Neuron Backend. During this process, when the framework finds NKI kernels, it calls the NKI Compiler to process them right away. The NKI Compiler creates optimized NKI IR that is saved and referenced by custom-call nodes in the Neuron IR. The framework continues building the complete Neuron IR, adding these custom-call nodes alongside regular model operations. When the Neuron IR is complete, the Graph Compiler processes the entire model, and the Neuron Compiler Back-end generates code for both standard operations and the NKI kernels by turning the referenced NKI IR into NeuronISA instructions.

.. image:: /nki/img/overviews/about-nki-2.png

In PyTorch, the ``@nki_op`` decorator handles registration of the custom operation, enabling seamless integration into the framework's execution model.

For more information, see :doc:`the NKI Compiler documentation </nki/deep-dives/nki-compiler>`.

NKI Library
------------

The NKI Library (``NKI-Lib``) is a collection of open-source, pre-optimized, production-ready kernels for common operations. You can use these kernels directly in your PyTorch or JAX code as regular Python functions. The library has two main purposes:

1. It gives you immediate performance improvements through optimized implementations
2. It provides examples that show best practices for memory management, instruction scheduling, and hardware use

Developers can use these kernels as they are or as starting points for creating custom optimizations for specific needs.

For more information, see :doc:`the NKI Library documentation </nki/library/index>`.

Working with NKI Kernels
-------------------------

If you're already running models on Trainium or Inferentia, you're probably using NKI kernels without realizing it. The Neuron compiler automatically adds optimized NKI kernels for common operations during compilation. Many of these kernels are already part of the standard compilation process. When you use vLLM with the Neuron plug-in, popular models already include NKI kernels. Models in NeuronXDistributed Inference also regularly use NKI kernels for you. In many cases, you get the performance benefits of these kernels without changing any code.

Beyond these automatic optimizations, developers who want more control can use NKI in two more ways. First, you can call existing kernels from the NKI Library directly in your PyTorch or JAX code. This needs only small code changes. You just import the kernel and call it where needed in your model. For example, if you need a faster attention mechanism or a special activation function, you can add the matching NKI Library kernel with just a few lines of code.

.. code-block:: python

    # Example: Authoring a custom NKI kernel in PyTorch

    import torch
    from torch_neuronx import nki_op, nki
    import nki.language as nl

    # Step 1: Define NKI kernel
    @nki.jit
    def my_kernel(in_ptr0, out_ptr):
        # ... kernel implementation ...

    # Step 2: Register as PyTorch custom operator
    @nki_op("mylib::my_op", mutates_args={})
    def my_op(x: torch.Tensor) -> torch.Tensor:
        out = torch.empty_like(x)
        my_kernel(x, out)
        return out

    # Use in PyTorch code
    x = torch.randn(128, device="neuron")
    y = my_kernel(x)

    # y = my_op(x)


Second, developers can create custom kernels for operations that aren't in the library or need special optimizations. You can start from scratch using the ``nki.lang`` or ``nki.isa`` APIs, or you can modify existing kernels from the NKI Library as starting points.

These three approaches (automatic optimization, using library kernels, and creating custom kernels) are widely used across ML frameworks and libraries. Frameworks like PyTorch use NKI kernels through ATen operator dispatch for seamless integration. NxD Inference (NxDI), Optimum Neuron, and vLLM use all three approaches: they benefit from automatic compiler optimizations, directly call kernels from the NKI Library when appropriate, and create custom kernels for their specific needs.

Profiling, Debugging, and Performance Optimization
----------------------------------------------------

Neuron Explorer helps you profile your NKI kernel by making it easier to capture and analyze performance data at both system and device levels. You can collect detailed system profiles that show:

* Device utilization (how much each engine is used)
* Memory consumption
* Communication patterns between cores

For NKI kernels specifically, Neuron Explorer shows source-code level information, helping you find bottlenecks by connecting kernel code directly with device-level profiles. The tool works with familiar framework APIs in both PyTorch and JAX. You can view the results in several ways:

* The Neuron Profiler UI
* Perfetto integration
* JSON export for custom analysis

This makes it easier than ever to optimize your NKI kernel performance.

For a more in-depth example of profiling a NKI kernel with Neuron Explorer, see :doc:`/nki/guides/use-neuron-profile` and the :doc:`Neuron Explorer documentation </tools/neuron-explorer/index>`.

Core Concepts
---------------

For details on specific NKI concepts, jump to one of these topics:

.. grid:: 1 
   :gutter: 3

   .. grid-item-card:: Introduction to Direct Memory Access (DMA)
      :link: nki-dma-overview
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Learn about DMA with NKI.

   .. grid-item-card:: Data Representation Overview
      :link: data-representation-overview
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Understanding data types, layouts, and representation in NKI programming.

   .. grid-item-card:: Indexing Overview
      :link: indexing-overview
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Tensor indexing patterns and addressing schemes in NKI kernels.

   .. grid-item-card:: Memory Hierarchy Overview
      :link: memory-hierarchy-overview
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Memory levels, allocation strategies, and data movement in Neuron devices.

   .. grid-item-card:: Tiling Overview
      :link: tiling-overview
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Strategies for breaking down large computations into manageable tiles.

Understanding the NKI Language
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Explore core language constructs including loops, indexing, and control flow, explain the memory hierarchy and data representation, and cover tiling and scheduling concepts with examples. Link to the docs for deep diving into optimization techniques like allocation and scheduling.

For more about the NKI Language, see :doc:`/nki/get-started/nki-language-guide`. Otherwise, read up on the core programming concepts below!

Core Programming Model
^^^^^^^^^^^^^^^^^^^^^^^

NKI uses a sequential programming model where operations run in the order they're written. However, the compiler may change the order of operations that don't depend on each other to make the code faster. This approach gives predictable execution while letting the hardware's multiple compute engines work in parallel behind the scenes.

There's an important difference between compile-time and runtime execution:
* Most NKI code, including print statements, runs during compilation
* Other statements, like nki.isa.* function calls, create actual runtime operations on the device

For example:


.. code-block:: python

    @nki.jit
    def my_function(x: tensor, y: tensor) -> tensor:
        print(f"adding tensors of type {x.dtype} and {x.shape}")  # Compile-time print
        nki.isa.tensor_tensor(output, x, y, op=nki.language.add)  # Runtime
        return output


The print statement shows "adding tensors of type float16 and shape (128,512)" during compilation, not when the code runs on the device. If you want to see output from the device itself, NKI provides a special device_print function.


Value Types and Data Structures
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The NKI meta-programming language supports six basic value types:
* None
* Booleans
* 32-bit integers
* 32-bit floats
* String literals
* Tensors (references to on-device memory)

It also supports container types like tuples, lists, dictionaries with string keys, and simple user-defined classes. These containers work much like their Python equivalents:


.. code-block:: python

    l = [1, 2, 3]
    l.append(4.1)
    l.extend(("Hello", "List"))
    size = l.count()

    d = dict()
    d['a'] = 1
    for k, v in d.items():
        print(k, v)

Tensor Management and Memory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Tensors are the most important type in NKI. They represent on-chip memory regions with metadata you can query, including dtype, shape, address, offset, pattern, and buffer. The most commonly used fields are dtype and shape, which help with compatibility checking and iteration:


.. code-block:: python

    assert x.shape == y.shape, "expecting tensors of the same shape"
    for i in range(t.shape[0]):  # Compile-time constant bounds
        my_function(t[i])


You can create tensors using the simple nki.language.ndarray API or more advanced memory management techniques. The basic approach creates tensors with a specified shape, data type, and memory buffer:


.. code-block:: python

    t = nl.ndarray((128, 128), nl.float16, nl.sbuf)
    u = t.reshape((128, 2, 64))  # Alternative view of same memory


Memory Architecture and Indexing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The SBUF memory uses a two-dimensional layout with partition and free dimensions. By convention, the first tensor dimension always maps to the partition dimension, while the remaining dimensions are arranged in the free dimension.

Tensor indexing supports integer indexing, slices (start:stop:step), and ellipsis (...) notation, just like NumPy:


.. code-block:: python

    u = t[0, 0, 10]        # Single element
    u = t[:, 0, :]         # Slice with defaults
    u = t[0, ..., :]       # Using ellipsis
    u = t[::2, :, ::2]     # Step indexing


Each indexing operation creates a new tensor reference with hardware access patterns you can query:


.. code-block:: python

    u = t[0, ...]
    print(u.offset)   # Hardware access pattern offset
    print(u.pattern)  # Hardware access pattern

Control Flow Constructs
^^^^^^^^^^^^^^^^^^^^^^^^

NKI supports two types of control flow:
1. Static control flow (evaluated at compile-time)
2. Dynamic control flow (executed on the device)

Static control flow includes standard if statements, for loops, and while loops that are unrolled during compilation:


.. code-block:: python

    for i in range(len(inputs)):
        if i % 2 == 0:
            nki.isa.nc_transpose(dst=outputs[i], data=inputs[i])
        else:
            nki.isa.reciprocal(dst=outputs[i], data=inputs[i])


The compiler provides special range functions as performance hints: sequential_range(), static_range(), and affine_range(). These don't change how your code works, but they give the compiler hints about how to optimize it.

**Dynamic control flow** runs on the Trainium device using register values and a special range function:

.. code-block:: python

    # Dynamic loop with static bounds
    for i in dynamic_range(10):
        process_tensor(t[i])

    # Dynamic loop with register-based bounds
    count = nki.isa.register_alloc(count_tensor)
    for i in dynamic_range(count):
        process_tensor(t[i])


Dynamic while loops use register conditions and four register management APIs: register_alloc(), register_move(), register_load(), and register_store().


Class Support and Interoperability
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

NKI provides basic support for user-defined classes, which must inherit from NKIObject. These classes work similarly to Python dataclasses and can be created with or without the @dataclass decorator:


.. code-block:: python

    @dataclass
    class C(NKIObject):
        x: int
        y: bool = False

        def toggle(self):
            self.y = not self.y

    c = C(1)
    c.toggle()


You can create class instances in Python and pass them to NKI kernels, where they're translated using the object's dictionary. Check the language guide for more details.


NKI Compiler Architecture and Development
------------------------------------------

The NKI language gives kernel writers detailed control over Neuron hardware. By offering low-level APIs that match the hardware instructions, the compiler steps back and lets developers take control. This needs a separate compiler that processes the kernel code and works together with the Neuron Graph compiler to fit kernels into the overall model.

The NKI compiler runs when Python is tracing the code. When the interpreter finds a top-level function with the ``@nki.jit`` decorator, it calls the NKI compiler. The compiler reads the function, creates an Abstract Syntax Tree (AST) of the user's code, and makes a few low-level changes to:

* Optimize the code
* Allocate memory
* Schedule instructions

It then sends the optimized code to the Neuron Graph compiler, which adds it to the overall model and creates the NEFF executable.

The diagram below shows the detailed compilation process inside the Neuron compilers and how they work together to create the final program that runs on Neuron hardware. The NKI Compiler first converts the kernel code into an AST representation for analysis. It then makes a few middle-end and back-end changes to the AST, improving resource allocation and instruction scheduling. This creates optimized NKI IR that gets added back into the overall model.

.. image:: /nki/img/overviews/about-nki-3.png

.. toctree::
      :maxdepth: 1
      :hidden:

      Memory Hierarchy <memory-hierarchy-overview>
      Data Representation <data-representation-overview>
      Indexing <indexing-overview>
      Tiling <tiling-overview>
      Direct Memory Access <nki-dma-overview>
      Logical Neuron Cores <lnc>


================================================
FILE: nki/get-started/about/indexing-overview.rst
================================================
.. meta::
   :description: Overview of Indexing in NKI
   :date_updated: 12/02/2025

.. _nki-about-indexing:
.. _nki-tensor-indexing:

=======================
Tensor Indexing on NKI
=======================

This topic covers basic tensor indexing and how it applies to developing with the AWS Neuron SDK. This overview describes basic indexing of tensors with several examples of how to use indexing in NKI kernels.

.. _nki-basic-tensor-indexing:

Basic Tensor Indexing
^^^^^^^^^^^^^^^^^^^^^

NKI supports basic indexing of tensors using integers as indexes. For example,
we can index a 3-dimensional tensor with a single integer to get get a *view*
of a portion of the original tensor.

.. code-block::

   x = nl.ndarray((2, 2, 2), dtype=nl.float32, buffer=nl.hbm)

   # `x[1]` return a view of x with shape of [2, 2]
   # [[x[1, 0, 0], x[1, 0 ,1]], [x[1, 1, 0], x[1, 1 ,1]]]
   assert x[1].shape == [2, 2]

NKI also supports creating views from sub-ranges of the original tensor
dimension. This is done with the standard Python **slicing** syntax. For
example:

.. code-block::

   x = nl.ndarray((2, 128, 1024), dtype=nl.float32, buffer=nl.hbm)

   # `x[1, :, :]` is the same as `x[1]`
   assert x[1, :, :].shape == [128, 1024]

   # Get a smaller view of the third dimension
   assert x[1, :, 0:512].shape == [128, 512]

   # `x[:, 1, 0:2]` returns a view of x with shape of [2, 2]
   # [[x[0, 1, 0], x[0, 1 ,1]], [x[1, 1, 0], x[1, 1 ,1]]]
   assert x[:, 1, 0:2].shape == [2, 2]

When indexing into tensors, NeuronCore offers much more flexible memory access
in its on-chip SRAMs along the free dimension. You can use this to efficiently
stride the SBUF/PSUM memories at high performance for all NKI APIs that access
on-chip memories. Note, however, this flexibility is not supported along the
partition dimension. That being said, device memory (HBM) is always more
performant when accessed sequentially.

.. _nki-advanced-tensor-indexing:
.. _pm_sec_tile_indexing:

Tensor Indexing by Example
^^^^^^^^^^^^^^^^^^^^^^^^^^^

In this section, we share several use cases that benefit from advanced
memory access patterns and demonstrate how to implement them in NKI.

Case #1 - Tensor split to even and odd columns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Here we split an input tensor into two output tensors, where the first
output tensor gathers all the even columns from the input tensor,
and the second output tensor gathers all the odd columns from the
input tensor. We assume the rows of the input tensors are mapped to SBUF
partitions. Therefore, we are effectively gathering elements along
the free dimension of the input tensor. The figure below visualizes the input and output tensors.

.. _nki-fig-pm-index-1:

.. image:: /nki/img/pm-index-1.png
   :align: center
   :width: 60%

*Tensor split to even and odd columns*

.. nki_example:: /nki/examples/index-case-1.py
   :language: python
   :linenos:
   :whole-file:

The main concept in this example is that we are using slices to access the even
and odd columns of the input tensor. For the partition dimension, we use the
slice expression `:`, which selects all of the rows of the input tensor. For
the free dimension, we use `0:sz_f:2` for the even columns. This slice says:
start at index `0`, take columns unto index `sz_f`, and increment by `2` at
each step. The odd columns are similar, except we start at index `1`.


Case #2 - Transpose tensor along the f axis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this example we transpose a tensor along two of its axes. Note,
there are two main types of transposition in NKI:

1. Transpose between the partition-dimension axis and one of the free-dimension axes, which is achieved via the
   :doc:`nki.isa.nc_transpose </nki/api/generated/nki.isa.nc_transpose>` API.
2. Transpose between two free-dimension axes, which is achieved via a :doc:`nki.isa.dma_copy </nki/api/generated/nki.isa.dma_copy>` API,
   with indexing manipulation in the transposed axes to re-arrange the data.

In this example, we'll focus on the second case: consider a
three-dimensional input tensor ``[P, F1, F2]``, where the ``P`` axis is mapped
to the different SBUF partitions and the ``F1`` and ``F2`` axes are
flattened and placed in each partition, with ``F1`` being the major
dimension. Our goal in this example is to transpose the ``F1`` and
``F2`` axes with a parallel dimension ``P``,
which would re-arrange the data within each partition. The figure below illustrates the input and output tensor layouts.

.. _nki-fig-index-2:

.. image:: /nki/img/pm-index-2.png
   :align: center
   :width: 60%

*Tensor F1:F2 Transpose*

.. nki_example:: /nki/examples/transpose2d/transpose2d_nki_kernels.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_33

The main concept introduced in this example is a 2D memory access
pattern per partition, via additional indices. We copy ``in_tile`` into
``out_tile``, while traversing the memory in different access patterns
between the source and destination, thus achieving the desired
transposition.

You may download the full runnable script from :ref:`Transpose2d tutorial <tutorial_transpose2d_code>`.

Case #3 - 2D pooling operation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Lastly, we examine a case of
dimensionality reduction. We implement a 2D MaxPool operation, which
is used in many vision neural networks. This operation takes
``C x [H,W]`` matrices and reduces each matrix along the ``H`` and ``W``
axes. To leverage free-dimension flexible indexing, we can map the ``C``
(parallel) axis to the ``P`` dimension and ``H/W`` (contraction)
axes to the ``F`` dimension.
Performing such a 2D pooling operation requires a 4D memory access
pattern in the ``F`` dimension, with reduction along two axes. The figure below illustrates the input and output tensor layouts.

.. _nki-fig-index-3:

.. image:: /nki/img/pm-index-3.png
   :align: center
   :width: 60%

*2D-Pooling Operation (reducing on axes F2 and F4)*

.. nki_example:: /nki/examples/index-case-3.py
   :language: python
   :linenos:
   :whole-file:


================================================
FILE: nki/get-started/about/lnc.rst
================================================
.. meta::
   :description: Overview of Neuron Logical Cores
   :date_updated: 12/12/2025

.. _nki-about-lnc:

Using Logical Neuron Cores (LNC)
================================

This topic covers how to use multiple neuron cores by launching your NKI kernel
on multiple cores at the same time. This overview will cover how to launch
kernels, and the basic methods for writing a kernel to run on multiple cores.

Logical Neuron Cores (LNC)
--------------------------

The Neuron SDK supports running NKI kernels on multiple logical cores. When
launching a kernel, you can opt to run the kernel on 1 or 2 logical cores. If
you choose to run on 2 logical cores, at runtime, your kernel will be run on
two physical cores (if available) that have shared HBM memory (see Trainium3
Architrecture <trainium3_arch> for more details on NeuronCores). These two
version can operate on different parts of the input data, increasing overall
performance of your kernel.

NKI gives you a few mechanisms to for using Logical Neuron Cores (LNC). We will
look briefly at each of these, specifically we will describe:

1. How to launch a kernel on multiple cores
2. How to tell if a kernel is running on multiple cores
3. How to tell which core a kernel is running on

Launching a kernel on multiple cores
-----------

To launch a NKI kernel on multiple cores, you specify the number of cores to
use, in square brackets, when calling the kernel. For example, suppose we have
a kernel called `lnc_test`, and we want to launch this kernel on two cores.

.. code-block::

   # Launch lnc_test on 2 cores
   lnc_test[2](input)

The bracket syntax must contain only one number, the number of cores to use.
If no brackets are given the number of cores defaults to 1. If the number is
too large for the current architecture, then you will receive an error.

.. code-block::

   # Launch lnc_test on 1 core
   lnc_test(input)

   # Launch lnc_test on 1 core
   lnc_test[1](input)

   # Launch lnc_test on 2 cores
   lnc_test[2](input)

   # Launch lnc_test on 8 cores (ERROR on current architecture)
   lnc_test[8](input)

Programming for multiple cores
-----

When writing a NKI kernel for multiple cores, there are two important APIs that
can be used to tell how many cores are being used and which core the current
instance is running on. These APIs are called `num_programs` and `program_id`.

The `num_programs` API will return the total number of cores the current kernel
is running on. If LNC is not being used, this API will return 1. So, we can
tell if we are running on multiple cores by inspecting the result of this
variable:

.. code-block::

   @nki.jit
   def lnc_test(input):
     if nl.num_programs() > 1:
       print("Running on multiple cores")
     else:
       print("Running on one core - no LNC")

   # Launch lnc_test on 1 core
   # prints "Running on one core - no LNC"
   lnc_test(input)

   # Launch lnc_test on 2 cores
   # prints "Running on multiple cores"
   lnc_test[2](input)

The `program_id` API will return the logical core id that the current
instance is running on. In the case of LNC=2, this API will return either 0
or 1. When not using LNC, this API will return 0. This API can be used to
programmatically divide work between multiple cores.

For example, suppose we have a tensor with shape `2x128x128` and we want to
compute the reciprocal of all of the elements of this tensor. We can write a
kernel function that is LNC-aware and can make use of extra cores when
available.

.. code-block::

   def lnc_test(input):
    # Check the first dimension is 2 for this example
    assert input.shape[0] == 2

    # create temporary storage on SBUF for comptation
    in_tile = nl.ndarray(input.shape[1:], input.dtype, buffer=nl.sbuf)
    out_tile = nl.ndarray(input.shape[1:], input.dtype, buffer=nl.sbuf)

    # create output tensor
    output = nl.ndarray(input.shape, input.dtype, buffer=nl.shared_hbm)

    if nl.num_programs() == 1:
      # Not using multiple cores, process two tiles
      for i in range(2):
        nisa.dma_copy(in_tile, input[i])
        nisa.reciprocal(out_tile, in_tile)
        nisa.dma_copy(output[i], out_tile)
    else:
      # Using multiple cores, process tiles in parallel, one per core
      i = nl.program_id(0)
      nisa.dma_copy(in_tile, input[i])
      nisa.reciprocal(out_tile, in_tile)
      nisa.dma_copy(output[i], out_tile)
    return output

The code above has two cases, one for when we are not using LNC
(`num_programs` returns 1), and one for when we are using LNC=2
(`num_programs` returns 2). In the non-LNC case, there is a for loop that
processes each input tiles one after the other. However, in the LNC=2 case,
we can use the `program_id` API to query which core we are on. This API will
return either `0` or `1`. The code uses the `program_id` to have each core
process one of the two tiles, in parallel.

Final Notes
---

Using LNC can improve the performance of NKI kernels by leveraging multiple
NeuronCores. However, there are two things to be mindful of when using LNC.
First, the inputs and outputs of the kernel should be stored in the Shared HBM
that all of the cores can access. Second, the Neuron SDK assumes that when
running a kernel on multiple cores, the program on each core is "the same".
This means that each core is executing the same basic control flow as the other
cores. Most of the time, this requirement will be automatically satisfied by
the NKI compiler. However, if you use dynamic control flow, and this
control-flow is different on the different cores, then the behavior is
undefined, and you will likely receive an error at runtime.


================================================
FILE: nki/get-started/about/memory-hierarchy-overview.rst
================================================
.. meta::
   :description: Overview of the Trainium Memory Hierarchy
   :date_updated: 12/02/2025

.. _nki-about-memory:

======================================
The Trainium Memory Hierarchy
======================================

This topic covers the Trainium Memory Hierarchy and how it applies to developing with the AWS Neuron SDK. This overview covers the various memories
that are available on the Trainium hardware and how they are used. Understanding the memory hierarchy is important for writing performant kernels
for use in your Machine Leaning models.


Memory hierarchy
-----------------

The diagram in :numref:`Fig. %s <nki-fig-pm-memory>`, below, shows the four-level memory hierarchy available to a single NeuronCore. The latency
ranges provided in the figure are approximate and are intended to calibrate the programmer's mental model (see :doc:`NeuronDevice Architecture Guide </nki/guides/architecture/trainium_inferentia2_arch>` for the exact values). Memories closer to the top of the figure are the closer to the compute engines; therefore, they are designed to provide the highest bandwidth and lowest latency. However, the faster memories also have smaller capacities compared to memories near the bottom. This set of memories is the *Memory Hierarchy* for the Trainium devices.

Unlike memory hierarchies for traditional processors (such as CPUs and GPUs), all of the memories available to a NeuronCore are software-managed. This means the contents of the memories are managed either directly by the programmer, or by the Neuron SDK tool chain, rather than being managed by the hardware. In other words, NeuronCore does not have a hardware cache system that performs data movement across memories in a way that is opaque to the program. All memory movement is explicit in the program itself. These explicit memory movements may be specified by writing a NKI kernel, or they may be computed by the Neuron Graph Compiler as part of the optimization process. 

In the following section we will discuss each memory in turn.

.. _nki-fig-pm-memory:

.. figure:: /nki/img/pm-memory.png
   :align: center
   :width: 80%

   NeuronCore Memory Hierarchy with Capacity and Bandwidth Ranges

NeuronCore external memory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The two memories at the bottom of the hierarchy, host memory and device memory,
are both considered *external* memory for a NeuronCore. These memories are
**linear memory**, where multi-dimensional tensors must be stored in a
flattened manner.

The **host memory** is the CPU-attached DRAM, which is accessible by the host
CPUs and all the NeuronCores attached to the instance. NKI kernels currently do
not provide APIs to move data in and out of the host memory directly, but
rather, rely on ML frameworks such as PyTorch or JAX to send input data from
host memory to the NeuronDevice and vice versa. For an example of this, see
:doc:`Getting Started with NKI </nki/get-started/quickstart-implement-run-kernel>`.

The **device memory** resides within a NeuronDevice and uses High Bandwidth
Memory (HBM) technologies starting from NeuronDevice v2. Currently, the input
and output parameters to NKI kernels must be HBM tensor references. When a NKI
kernel begins execution, the first task is to load the input tensors from HBM
into the internal memory. Then computation can be done on the tensors in
internal memory. Once the computation is complete, the results are copied from
the internal memory back to the HBM.

NeuronCore internal memory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The two memories at the top of the hierarchy, SBUF and PSUM, are both
considered *internal* (or *on-chip*) memory for a NeuronCore. Both memories are
**two-dimensional** memory, organized in **128 partitions**. The partitions
size of PSUM is typically much smaller than SBUF, and PSUM/SBUF partition sizes
vary with NeuronCore generations.

State Buffer (SBUF) memory is the main software-managed on-chip memory. The
SBUF is accessible by all the compute engines within a NeuronCore. NKI kernel
input tensors from HBM must be loaded into the SBUF for computation computed
output tensors of the kernel must be stored back into the HBM from SBUF before
the host can access them.

Both loading and storing to and from the HBM memory can be done using the :doc:`nki.isa.dma_copy </nki/api/generated/nki.isa.dma_copy>` API. In addition, SBUF is used for storing intermediate data within the kernel, generated by the compute engines. Note, SBUF has **~20x higher bandwidth** than HBM, but it needs to be carefully managed to minimize HBM accesses for better performance.

Lastly, Partial Sum Buffer (PSUM) memory is a small, dedicated memory designed
for storing matrix multiplication (MatMult) results computed by the tensor
engine. Tensor Engine is able to read-add-write to every address in PSUM.
Therefore, PSUM is useful for performing large MatMult calculations using
multiple tiles where multiple MatMult instructions need to accumulate into the
same output tile. As is shown in :numref:`Fig. %s <nki-fig-pm-memory>`, PSUM memory
can also be read and written by the vector and scalar engines. However, due to
the limited capacity of PSUM, we recommend that you reserve PSUM space for the
tensor engine to write MatMult outputs and to use the vector and scalar engines
to evict MatMult results back to SBUF as soon as possible.

.. note:: 
   To optimize kernel performance, it is good practice for NKI programmers to be mindful of SBUF and PSUM usage through careful :ref:`tiling <nki-about-tiling>` and loop fusion. If the total size of the live data being used by a NKI kernel overflows the capacity of any on-chip memory, the Neuron compiler will insert the necessary spills or refills between that memory and the next-tier memory in the hierarchy.


================================================
FILE: nki/get-started/about/nki-dma-overview.rst
================================================
.. meta::
    :description: Direct Memory Access (DMA) engines in Neuron enable efficient data movement between different memory types, maximizing memory bandwidth utilization and overall workload performance.

Introduction to Direct Memory Access (DMA) with NKI
======================================================

Direct Memory Access (DMA) engines in Neuron enable efficient data movement between different memory types, primarily between the device memory (HBM) and on-chip SRAM buffers (SBUF). DMA Engines can operate in parallel to compute, allowing asynchronous data movement independent from compute operations. Each NeuronCore (v2-v4) is paired with 16 DMA engines. Understanding and efficiently utilizing these DMA engines is critical for maximizing memory bandwidth utilization and overall workload performance.

Before reading this doc, it may be helpful to refer to :doc:`Introduction to memory hierarchy in NKI </nki/get-started/about/memory-hierarchy-overview>`.

Basic DMA Capabilities
-----------------------

To move data between HBM and SBUF, programmers can initiate a DMA transfer that gets executed by the DMA engines. Each DMA transfer starts with a DMA trigger from a NeuronCore and ends with a semaphore update from the DMA engine to signal the completion of transfer back to the NeuronCore. Today, each DMA transfer is by default parallelized up to 16 DMA engines, depending on the shape.

The 16 DMA Engines are connected to both the off chip HBM and the on-chip SRAM, called SBUF. DMA transfers can move data in multiple directions: bidirectionally between HBM to SBUF, within HBM or within SBUF. Each DMA engine has a theoretical bandwidth of 27.2 GB/s for NeuronCore-v2 and -v3 or 38.4 GB/s for NeuronCore-v4. DMA engines also support scatter-gather operations, allowing a single transfer to gather data from multiple non-contiguous source buffers or scatter to multiple non-contiguous destination buffers. 

DMA transfers can perform both copy and transpose transfers into SBUF. This doc will mainly focus on copy transfers.
You can also perform casting as part of DMA when the transfer has a different source and destination datatype. Neuron supported datatypes can be found in the :doc:`NKI datatype guide </nki/api/nki.api.shared>`. The casting operation is performed by first casting the source type to FP32, before finally casting to the destination type. This may be worth considering if working with integer types. Casting with DMAs is not supported for MXFP4 and MXFP8 datatypes.

DMA Triggers
-------------

DMA transfers can be triggered by any engine sequencer in the NeuronCore. (For details, refer to :doc:`/nki/guides/architecture/trainium2_arch`.) The sequencer instruction to trigger the transfer may wait on any semaphore condition which is signaled by other compute engines to respect data dependencies. The Trigger Engine for a given transfer can be specified by setting the ``engine`` parameter when calling :doc:`nisa.dma_copy </nki/api/generated/nki.isa.dma_copy>`. This behavior is only allowed when using hardware DGE in the current NKI release.

DMA Queues
-----------

DMA transfers are submitted to DMA queues for the DMA Engines to consume. There are 16 DMA queues per DMA engine (ID 0-15). A given DMA transfer can be submitted to a single queue ID across all 16x DMA engines paired with a NeuronCore. The given queue for a DMA transfer can be seen when mousing over a DMA transfer in a profile in Neuron Explorer. The queue ID is typically tied to the trigger engine and the method of descriptor generation (refer to the NeuronCore-v3 architecture guide for details). DMA transfers within a queue on the same DMA engine are executed in order. DMA transfers from different DMA queues are scheduled in a round robin fashion (for NeuronCore-v2 and v3) or based on the queue QoS configured (for NeuronCore-v4). Refer to the NeuronCore-v4 architecture guide for more details on DMA QoS.

Performance Considerations
---------------------------

When moving data in or out of SBUF, optimal performance is achieved with transfers maximizing the number of partitions with 4KiB or larger per partition. Given 16x DMA engines and 128 SBUF partitions, each DMA engine is typically responsible for moving data for eight SBUF partitions (128 partitions / 16 DMA engines). The figure below visualizes the DMA throughput across different number of bytes per partition ("Free Bytes"), for a fixed partition dimension size of 128:

.. figure:: /nki/img/overviews/nki-dma-intro-1.jpg
   :alt: DMA throughput graph showing performance across different bytes per partition

The points on the graph refer to various Free (Dimension) Byte values (that is, bytes per partition). We see that at 4096 free bytes, we are able to nearly saturate DMA bandwidth.

Another key consideration for performance is overhead to initiate a DMA transfer. Small, frequent transfers incur significant overhead causing us to be latency bound, while larger transfers help amortize these costs, moving to a more bandwidth bound regime. For optimal performance, it's important to batch data movements into larger transfers whenever possible. 

The table below shows the theoretical peak DMA bandwidth per NeuronCore generation:

.. list-table::
   :header-rows: 1

   * - Generation
     - BW / Engine
     - Engines / NC
     - Aggregate BW
   * - Trainium1
     - 17 B/ns
     - 16
     - 272 GB/s
   * - Trainium2
     - 23 B/ns
     - 16
     - 368 GB/s
   * - Trainium3
     - 33 B/ns
     - 16
     - 528 GB/s

The minimum free dimension size to reach the recommended 2 KiB per partition depends on the data type:

.. list-table::
   :header-rows: 1

   * - Data Type
     - Minimum Free Dimension
     - Bytes per Partition
   * - float32
     - 512 elements
     - 2 048
   * - bfloat16 / float16
     - 1 024 elements
     - 2 048
   * - float8
     - 2 048 elements
     - 2 048

We will look at two examples below, which show various shapes, sizes and access patterns, and how this affects the the achieved DMA throughput of the corresponding DMA transfers.

Examples
---------

As DMAs are a result of the corresponding source layout and access pattern, it is best to look at concrete examples to ground our understanding of common applications and their resulting access patterns.

Example 1: Move A[4,4096] HBM → SBUF
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The purpose of this example is to show a very simple access pattern (a 2D tensor in contiguous memory in HBM, being written to SBUF). This should build a foundation of how a particular access pattern maps to a specific set of DMA transfers.

Consider a 2D Tensor, A[4, 4096], in HBM. Assume the tensor is laid out in row-major form and is contiguous in the HBM. In row major form, array elements are stored sequentially row by row in memory, meaning all elements of the first row are stored first, followed by all elements of the second row, and so on. Let's assume we wish to move this tensor to SBUF, where the destination tensor will have a partition dimension of 4 and a free dimension of 4096. Each row of the source tensor will occupy a single partition in SBUF. 

Assuming A is a bfloat16 tensor, this means that the total size of the tensor is 32KiB (4*4096*2B). Knowing that each DMA engine corresponds to 8 partition lanes, and we are writing our 4 rows to only 4 partition lanes of SBUF, we would expect to see a single DMA engine active, with a single transfer size of 32KiB.

Here is a diagram with the expected behavior:

.. figure:: /nki/img/overviews/nki-dma-intro-2.jpg
   :alt: Diagram showing DMA transfer of A[4,4096] from HBM to SBUF

Example
"""""""""""""

Here is the kernel to perform the DMA transfer.

.. code-block:: python

    import nki.language as nl
    import nki.isa as nisa
    import nki

    @nki.jit
    def tensor_exp_kernel_isa(in_tensor):
      """NKI kernel to compute elementwise exponential of an input tensor
      Args:
           in_tensor: an input tensor of shape [4,4096]
      Returns:
           out_tensor: an output tensor of shape [4,4096]
      """
      out_tensor = nl.ndarray(in_tensor.shape, dtype=nl.bfloat16, buffer=nl.shared_hbm)
      sbuf_tensor = nl.ndarray(in_tensor.shape, dtype=nl.bfloat16, buffer=nl.sbuf)
      out_tile = nl.ndarray(in_tensor.shape, dtype=nl.bfloat16, buffer=nl.sbuf)

      # Load input data from HBM to on-chip memory
      nisa.dma_copy(src=in_tensor[0:4, 0:4096], dst=sbuf_tensor)

      # perform the computation:
      out_tile = nisa.activation(op=nl.exp, data=sbuf_tensor)

      # store the results back to HBM
      nisa.dma_copy(src=out_tile, dst=out_tensor[0:4, 0:4096])
      return out_tensor

    if __name__ == "__main__":
      import torch
      import torch_neuronx
      shape = (4, 4096)
      in_tensor = torch.ones(shape, dtype=torch.bfloat16)
      out_tensor = tensor_exp_kernel_isa(in_tensor)
      print(out_tensor)

Profile
"""""""

The above code runs on a single NeuronCore-v3, in a Trn2 instance. Here we can look at the profile, to validate the expected behavior. Refer to the :doc:`Neuron Explorer user guide </tools/neuron-explorer/index>` for guidance on how to generate a profile.

.. figure:: /nki/img/overviews/nki-dma-intro-3.png
   :alt: Profile showing DMA transfer for Example 1

This is exactly what we expected based on our analysis. From the profile, we can see that the first DMA engine takes 1416 ns to load 32 KiB from HBM to SBUF and also a small 4B semaphore update. Even though the remaining 15 DMA engines do not perform useful data movement, they also perform a small 4B semaphore update writes. This allows the NeuronCore to always monitor a semaphore increment of 16 to signal DMA transfer completion, regardless of the tensor shapes in the transfer.  

This is good, but this example only uses a single DMA engine. In the next example, we increase partition dimension to increase the number of DMA Engines in use.

Example 2: Move A[128,128] HBM → SBUF
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The purpose of this example is to show how as partition count scales, the number of DMA Engines in use increases.

Consider a 2D Tensor A[128, 128] in HBM, laid out in row-major form and contiguous on the HBM. Assuming we wish to move A from HBM to SBUF, how many DMA engines will this require?

Again, we see the total tensor size is 32KiB (128*128*2B), the same as the previous example. We are writing across 128 partitions of SBUF, with each row corresponding to a partition lane. Knowing that each DMA engine corresponds to 8 partition lanes, and we are writing to 128 partitions, we would expect all 16 DMA engines to be active, each performing a single DMA operation of 2KiB (8 rows x 128 elements x 2 bytes per element).

Here is a diagram of the expected transfer:

.. figure:: /nki/img/overviews/nki-dma-intro-4.jpg
   :alt: Diagram showing DMA transfer of A[128,128] from HBM to SBUF

Example
"""""""""""""

.. code-block:: python

    import nki.language as nl
    import nki.isa as nisa
    import nki
    import os
    os.environ["NEURON_FRAMEWORK_DEBUG"] = "1"
    os.environ["NEURON_RT_ENABLE_DGE_NOTIFICATIONS"] = "1"
    os.environ["NEURON_PLATFORM_TARGET_OVERRIDE"] = "trn2"


    @nki.jit(mode="torchxla")
    def tensor_exp_kernel_isa(in_tensor):
      """NKI kernel to compute elementwise exponential of an input tensor
      Args:
            in_tensor: an input tensor of shape [128,128]
      Returns:
            out_tensor: an output tensor of shape [128,128]
      """
      out_tensor = nl.ndarray(in_tensor.shape, dtype=nl.bfloat16, buffer=nl.shared_hbm)
      sbuf_tensor = nl.ndarray(in_tensor.shape, dtype=nl.bfloat16, buffer=nl.sbuf)
      out_tile = nl.ndarray(in_tensor.shape, dtype=nl.bfloat16, buffer=nl.sbuf)
   
      # Load input data from HBM to on-chip memory
      nisa.dma_copy(src=in_tensor[0:128, 0:128], dst=sbuf_tensor)

      # perform the computation:
      out_tile = nisa.activation(op=nl.exp, data=sbuf_tensor)
   
      # store the results back to HBM
      nisa.dma_copy(src=out_tile, dst=out_tensor[0:128, 0:128])
      return out_tensor

    if __name__ == "__main__":
      import torch
      import torch_xla


      device = torch_xla.device()
      shape = (128, 128) # Tensor shape : [128, 128]
      in_tensor = torch.ones(shape,  dtype=torch.bfloat16).to(device=device)
      print(in_tensor.dtype)
      out_tensor = tensor_exp_kernel_isa(in_tensor)
      print(out_tensor) # an implicit XLA barrier/mark-step

Profile
"""""""

.. figure:: /nki/img/overviews/nki-dma-intro-5.png
   :alt: Profile showing DMA transfer for Example 2

In the above profile, we can see that all 16 DMA engines are active, as each DMA engine is reading 8 rows from HBM and writing to 8 corresponding partition lanes in SBUF.  Similarly, we see the reverse also applies from SBUF, back to HBM. By mousing over an individual DMA operation, we see each DMA engine corresponds to a single 2KiB read (8 rows x 128 elements x 2B), as we expect!

Using the same profile from the 128x128 DMA example, lets look at the DMA Trigger and the associated Transfer. You can trace the DMA trigger instruction and the associated DMA transfer via the profiler. This would be useful if you wanted to understand the why a DMA was triggered when, and any preceding dependencies.

.. figure:: /nki/img/overviews/nki-dma-intro-6.png
   :alt: Profile showing DMA trigger from qGpSimdDynamic

.. figure:: /nki/img/overviews/nki-dma-intro-7.png
   :alt: Profile showing corresponding trigger in GPSimd

We can see the first DMA is triggered from qGpSimdDynamic (First screenshot). We can look at GPSimd to see the corresponding trigger (second screenshot).


================================================
FILE: nki/get-started/about/tiling-overview.rst
================================================
.. meta::
   :description: Overview of Tiling process for NKI programmers
   :date_updated: 12/02/2025

.. _nki-about-tiling:

=======================
What is Tiling?
=======================

This topic covers tiling and how it applies to developing NKI kernels with the AWS Neuron SDK. Tiling is the process of dividing a large tensor up in to smaller tensors that can be processed by single Neuron ISA instructions. When writing NKI kernels, all tensors must be tiled to fit within the constraints of the hardware.

Tile-based operations
----------------------

All NKI APIs operate on tiles. A tile is just a tensor that resides in either the SBUF or PSUM memory with a size and layout that satisfies the constraints of the Neuron instruction set architecture (NeuronCore ISA). Since the SBUF and PSUM memories have 128 partitions, most APIs are limited to tiles with a first dimension (also called the "Partition Dimension") no larger than 128 elements. So, for example, to compute the reciprocal of a matrix of size 256x256, you will need to split the computation up into (at least) two parts:

.. code-block::

   # Example how to split 256x256 into tiles with 128 partition dimensions
   # Assume input and output are tensors of size 256 x 256

   # The hardware supports up to 128 partitions
   P_DIM = nki.language.tile_size.pmax

   # allocating memory for input and output tiles
   # note that memory allocation does not initialize
   in_tile = nl.ndarray((P_DIM, 256), dtype=nl.float32, buffer=nl.sbuf)
   out_tile = nl.ndarray((P_DIM, 256), dtype=nl.float32, buffer=nl.sbuf)

   # process first tile from input to output
   nki.isa.dma_copy(dst=in_tile, src=input[0:P_DIM, 0:256])
   nki.isa.reciprocal(dst=out_tile, data=in_tile)
   nki.isa.dma_copy(dst=output[0:P_DIM, 0:256], src=out_tile)

   # process second tile
   nki.isa.dma_copy(dst=in_tile, src=input[P_DIM:256, 0:256])
   nki.isa.reciprocal(dst=out_tile, data=in_tile)
   nki.isa.dma_copy(dst=output[P_DIM:256, 0:256], src=out_tile)

In the code above, we allocate two SBUF tensors to store our tiles: one for the input and one for the result. These two tiles are available within the kernel that they are declared in, and will be automatically recycled by the compiler when no longer needed. Then we copy the first 128 rows of our matrix from the input in HBM to the input tile in SBUF, and compute the reciprocal placing the result into the output tile in SBUF. Finally, we copy the result back to the output tensor, in HBM. Of course, this could also be done with a loop, as shown below.

.. code-block::

   # allocate memory for input and output tiles
   in_tile = nl.ndarray((P_DIM, 256), dtype=nl.float32, buffer=nl.sbuf)
   out_tile = nl.ndarray((P_DIM, 256), dtype=nl.float32, buffer=nl.sbuf)
   # process tiles
   for i in range(input.shape[0] // P_DIM):
       s = nl.ds(i * P_DIM, P_DIM) # equivalent to i * P_DIM : (i + 1) * P_DIM
       nki.isa.dma_copy(dst=in_tile, src=input[s, 0:256])
       nki.isa.reciprocal(dst=out_tile, data=in_tile)
       nki.isa.dma_copy(dst=output[s, 0:256], src=out_tile)

We will provide more discussion of the indexing in :ref:`Tensor Indexing <nki-tensor-indexing>`. Next, let's discuss two important considerations when working with tile-based operations in NKI: :ref:`data layout <nki-tile-layout>` and :ref:`tile size <nki-tile-size>` constraints.

.. _nki-tile-layout:

Layout considerations
-----------------------

When working with multi-dimensional arrays in any platform, it is important to consider the physical memory layout of the arrays, or how data is stored in memory. For example, in the context of 1D linear memory, we can store a 2D array in a row-major layout or a column-major layout. Row-major layouts place elements within each row in contiguous memory, and column-major layouts place elements within each column in contiguous memory.

As discussed in :ref:`Memory hierarchy <nki-about-memory>`, the on-chip memories, SBUF and PSUM, are arranged as 2D memory arrays. The first dimension is always the partition dimension ``P`` with 128 memory partitions that can be read and written in parallel by compute engines. The second dimension is the free dimension ``F`` where elements are read and written sequentially. A tensor is placed in SBUF and PSUM across both P and ``F``, with the same start offset across all ``P`` partitions used by the tensor. The figure below illustrates a default tensor layout. Note that a tile in NKI must map shape[0] to the partition dimension.

.. _nki-fig-pm-layout:

.. figure:: /nki/img/overviews/tiling-1.png
   :align: center
   :width: 70%

   Tensor mapped to partition and free dimensions of SBUF and PSUM

Similar to other domain-specific languages that operate on tensors, NKI defines a contraction axis of a tensor as the axis over which reduction is performed, for example the summation axis in a dot product. NKI also defines a parallel axis as an axis over which the same operation is performed on all elements. For example, if we take a ``[100, 200]`` matrix and sum each row independently to get an output of shape ``[100, 1``], then the row-axis (``axis[0]``, left-most) is the parallel axis, and the column-axis (``axis[1``], right-most) is the contraction axis.

To summarize, the partition and free dimensions of a NKI tensor dictate how the tensor is stored in the 2D on-chip memories physically, while the parallel and contraction axes of a tensor are logical axes that are determined by the computation to be done on the tensor.

The NeuronCore compute engines impose two layout constraints (LC):

* **[Layout Constraint #1]** For matrix multiplication operations, the contraction axis of both input tiles must be mapped to the Partition (P or P_DIM) dimension which is typically 128 for current hardware.
* **[Layout Constraint #2]** For operations that are not matrix multiplication operations, such as scalar or vector operations, the parallel axis should be mapped to the Partition (``P`` or ``P_DIM``) dimension.

Layout Constraint #1 means that to perform a matrix multiplication of shapes ``[M, K]`` and ``[K, N]`` that contracts on K to generate ``[M, N]``, Tensor Engine (the engine performing this matmul operation) requires the K dimension to be mapped to the partition dimension in SBUF for both input matrices. Therefore, you need to pass shapes ``[K, M]`` and ``[K, N]`` into the :doc:`nki.isa.nc_matmul </nki/api/generated/nki.isa.nc_matmul>` API, as the partition dimension is always the left-most dimension for an input tile to any NKI compute API.

To help developers get started with NKI quickly, NKI also provides a high-level API :doc:`nki.isa.nc_matmul </nki/api/generated/nki.isa.nc_matmul>` that can take ``[M, K]`` and ``[K, N]`` input shapes and invoke the necessary layout shuffling on the input data before sending it to the Tensor Engine matmul instruction.

LC#2, on the other hand, is applicable to many instructions supported on Vector, Scalar and GpSimd Engines. See :doc:`nki.isa.tensor_reduce </nki/api/generated/nki.isa.tensor_reduce>` API as an example.

.. _nki-tile-size:

Tile size considerations
-------------------------

Besides layout constraints, NeuronCore hardware further imposes three tile-size constraints (TC) in NKI:

* **[Tile-Size Constraint#1]** The P dimension size of a tile in both SBUF and PSUM must never exceed ``nki.tile_size.pmax == 128``.
* **[Tile-Size Constraint#2]** For tiles in PSUM, the F dimension size must not exceed ``nki.tile_size.psum_fmax == 512``.
* **[TileSize Constraint#3]** Matrix multiplication input tiles F dimension size must not exceed ``nki.tile_size.gemm_stationary_fmax == 128`` on the left-hand side (LHS), or ``nki.tile_size.gemm_moving_fmax == 512`` on the right-hand side (RHS).

Programmers are responsible for breaking up your tensors according to these tile-size constraints. For example, below is a simple kernel that applies the exponential function to every element of an input tensor. The kernel expects a shape of ``(128, 512)`` for both input and output tensors:

.. code-block::

   import nki.isa as nisa
   import nki.language as nl
   import nki

   # The hardware supports up to 128 partitions
   P_DIM = nki.language.tile_size.pmax

   @nki.jit
   def tensor_kernel(in_tensor):
    """NKI kernel to compute elementwise reciprocal of an input tensor
    Args:
    in_tensor: an input tensor of shape [128,512]
    Returns:
    out_tensor: an output tensor of shape [128,512]
    """
     X_SIZE = 128
     Y_SIZE = 512
     
     # allocate space for the result
     out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype, buffer=nl.shared_hbm)
     # allocate space for tile memory
     in_tile = nl.ndarray((P_DIM, 256), dtype=nl.float32, buffer=nl.sbuf)
     out_tile = nl.ndarray((P_DIM, 256), dtype=nl.float32, buffer=nl.sbuf)

     # Process first tile
     nki.isa.dma_copy(dst=in_tile, src=in_tensor[0:P_DIM, 0:256])
     nki.isa.reciprocal(dst=out_tile, data=in_tile)
     nki.isa.dma_copy(dst=out_tensor[0:P_DIM, 0:256], src=out_tile)
     
     return out_tensor

As expected, the output tensor is an element-wise exponentiation of the input-tensor (a tensor of ones):

::

   tensor([[2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
   [2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
   [2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
   ...,
   [2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
   [2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
   [2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188]],
   device='xla:1', dtype=torch.bfloat16)

.. _nki-output-garbage-data:

Now let's examine what happens if the input/output tensor shapes do not match the shape of the compute kernel. As an example, we can change the input and output tensor shape from ``[128,512]`` to ``[256,512]``:

Since the compute kernel is expecting ``(128, 512)`` input/output tensors, but we used a ``(256, 512)`` input/output tensor instead, the bottom half of the output tensor becomes garbage data:

::

   tensor([[2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
   [2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
   [2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
   ...,
   [0.5273, 0.6055, 0.4336, ..., 0.9648, 0.9414, 0.4062],
   [0.7109, 0.2539, 0.7227, ..., 0.7344, 0.2539, 0.1211],
   [0.8867, 0.2109, 0.8789, ..., 0.8477, 0.2227, 0.1406]],
   device='xla:1', dtype=torch.bfloat16)

We could try to fix this by changing the tile size inside the compute kernel to ``(256, 512)`` as well, and see what happens: (**Note**: This violates tile-size constraint #1!) 

Here, the Neuron Graph Compiler identifies the tile-size constraint violation and fails compilation with the following exception:

::

   Size of partition dimension 256 exceeds architecture limitation of 128.

Now, let's see how to build a kernel that properly handles ``(256, 512)`` input/output tensors with a simple loop. We can use the ``nki.language.tile_size.pmax`` constant defined in NKI as the maximum partition dimension size in a tile.

.. code-block::

   import nki.isa as nisa
   import nki.language as nl
   import nki

   # The hardware supports up to 128 partitions
   P_DIM = nki.language.tile_size.pmax

   @nki.jit
   def tensor_exp_kernel_(in_tensor):
     """NKI kernel to compute elementwise exponential of an input tensor
     Args:
         in_tensor: an input tensor of shape [256,512]
     Returns:
         out_tensor: an output tensor of shape [256,512]
     """
     X_SIZE = 128
     Y_SIZE = 512
     assert in_tensor.shape == (X_SIZE, Y_SIZE)
     # allocate space for the result
     out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype, buffer=nl.shared_hbm)
     # allocate space for tile memory
     in_tile = nl.ndarray((P_DIM, Y_SIZE), dtype=nl.float32, buffer=nl.sbuf)
     out_tile = nl.ndarray((P_DIM, Y_SIZE), dtype=nl.float32, buffer=nl.sbuf)

     for k in nl.affine_range(in_tensor.shape[0] / nl.tile_size.pmax):
       # Generate tensor indices for the input/output tensors
       p_start = k * nl.tile_size.pmax
       i_p = nl.ds(p_start, nl.tile_size.pmax)

       # Process tile
       nki.isa.dma_copy(dst=in_tile, src=in_tensor[i_p, :])
       nki.isa.reciprocal(dst=out_tile, data=in_tile)
       nki.isa.dma_copy(dst=out_tensor[i_p, :], src=out_tile)
     
     return out_tensor

The ``nl.affine_range(2)`` API call is similar to the Python ``range`` function, and you can think of it as returning ``[0, 1]``. See :ref:`NKI iterator API <nl_iterators>` for a detailed discussion of various loop iterator options in NKI.

While the code above does handle ``(256, 512)`` tensors correctly, it is rather inflexible since it only supports an input shape of ``(256, 512)``. Therefore, as a last step, we extend this kernel to handle varying input/output sizes:

.. code-block::

   import nki.isa as nisa
   import nki.language as nl
   import nki
   import math

   # The hardware supports up to 128 partitions
   P_DIM = nki.language.tile_size.pmax

   @nki.jit
   def tensor_exp_kernel_(in_tensor):
     """NKI kernel to compute elementwise exponential of an input tensor
     Args:
         in_tensor: an input tensor of ANY 2D shape (up to SBUF size)
     Returns:
         out_tensor: an output tensor of ANY 2D shape (up to SBUF size)
     """

     sz_p, sz_f = in_tensor.shape
     assert sz_f < nl.tile_size.total_available_sbuf_size
    
     # allocate space for the result
     out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype, buffer=nl.shared_hbm)
     # allocate space for tile memory
     in_tile = nl.ndarray((P_DIM, sz_f), dtype=nl.float32, buffer=nl.sbuf)
     out_tile = nl.ndarray((P_DIM, sz_f), dtype=nl.float32, buffer=nl.sbuf)
     
     for p in nl.affine_range(math.ceil(sz_p / P_DIM)):
       # Generate tensor indices for the input/output tensors
       p_start = p * P_DIM
       p_end = p_start + P_DIM
       i_p = slice(p_start, min(p_end, sz_p)) # same as nl.ds(p_start, min(p_end, sz_p) - p_start)

       # Process tile
       nki.isa.dma_copy(dst=in_tile, src=in_tensor[i_p, :])
       nki.isa.reciprocal(dst=out_tile, data=in_tile)
       nki.isa.dma_copy(dst=out_tensor[i_p, :], src=out_tile)
       
     return out_tensor

The above example handles cases where ``in_tensor.shape[0]`` is not a multiple of 128 by using the standard Python ``min`` function to make sure the tensor access is in bounds.

Further reading
---------------

- :ref:`Logical Neuron Cores (LNC) <nki-about-lnc>`


================================================
FILE: nki/get-started/index.rst
================================================
.. meta::
   :description: Get started with Neuron Kernel Interface (NKI).
   :keywords: NKI, AWS Neuron, Get Started, Language Guide
   :date-modified: 12/13/2025

.. _nki-get-started:

Get Started with Neuron Kernel Interface (NKI)
==============================================

This section provides the essentials that you need to get started with NKI.

.. grid:: 1
      :margin: 2

      .. grid-item::

            .. card:: About NKI
                  :link: nki_about_home
                  :link-type: ref
                  :class-body: sphinx-design-class-title-small

                  Learn about Neuron Kernel Interface (NKI) and core concepts essential for working with it.

The Quick Start Guide will walk you through implementing and running your first kernel. The NKI Language Guide
provides an introduction to some of the core concepts within NKI.

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: NKI Quick Start Guide
      :link: quickstart-run-nki-kernel
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Implement and run your first NKI kernel.

   .. grid-item-card:: NKI Language Guide
      :link: nki-language-guide
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Developer guide for NKI's Pythonic language syntax.

.. toctree::
      :maxdepth: 1
      :hidden:

      Environment Setup <setup-env>
      First NKI Kernel <quickstart-implement-run-kernel>
      NKI Language Guide <nki-language-guide>
      Concepts <about/index>


================================================
FILE: nki/get-started/nki-language-guide.rst
================================================
.. meta::
    :description: Comprehensive guide to the NKI language for AWS Neuron SDK, covering tensor operations, control flow, memory management, and programming patterns for Trainium accelerators.
    :keywords: NKI, AWS Neuron, Language Guide, Tensor Operations, Trainium
    :date-modified: 04/08/2026

.. _nki-language-guide:

NKI Language Guide
==================

The Neuron Kernel Interface (NKI) language is designed for writing kernel functions to accelerate machine learning workloads on Trainium devices. This guide is an introduction to the NKI language and the key concepts you will need to know to program in NKI effectively.

Let us start by looking at a simple NKI function.

.. code-block:: python

    @nki.jit
    def nki_tensor_add_kernel(a_input, b_input):
        """
        NKI kernel to compute element-wise addition of two input tensors.
        """

        # Check both input tensor shapes/dtypes are the same for element-wise operation.
        assert a_input.shape == b_input.shape
        assert a_input.dtype == b_input.dtype

        print(f"adding tensors of type {a_input.dtype} and shape {a_input.shape}")

        # Check the first dimension's size to ensure it does not exceed on-chip
        # memory tile size, since this simple kernel does not tile inputs.
        assert a_input.shape[0] <= nl.tile_size.pmax

        # Allocate space for the input tensors in SBUF and copy the inputs from HBM
        # to SBUF with DMA copy.
        a_tile = nl.ndarray(shape=a_input.shape, dtype=a_input.dtype, buffer=nl.sbuf)
        nisa.dma_copy(dst=a_tile, src=a_input)

        b_tile = nl.ndarray(shape=b_input.shape, dtype=b_input.dtype, buffer=nl.sbuf)
        nisa.dma_copy(dst=b_tile, src=b_input)

        # Allocate space for the result and use tensor_tensor to perform
        # element-wise addition. Note: the first argument of 'tensor_tensor'
        # is the destination tensor.
        c_tile = nl.ndarray(shape=a_input.shape, dtype=a_input.dtype, buffer=nl.sbuf)
        nisa.tensor_tensor(dst=c_tile, data1=a_tile, data2=b_tile, op=nl.add)

        # Create a tensor in HBM and copy the result into HBM.
        c_output = nl.ndarray(dtype=a_input.dtype, shape=a_input.shape, buffer=nl.shared_hbm)
        nisa.dma_copy(dst=c_output, src=c_tile)

        # Return kernel output as function output.
        return c_output
        
.. important::
   The first thing you may notice about this NKI function is that it looks very much like a Python function. In fact, all NKI functions are syntactically valid Python functions. However, it is important to understand that NKI functions are not Python functions: they will be compiled by the NKI compiler and run on the Trainium accelerator. Because of this, not all Python constructs and libraries are supported within a NKI function.

The second thing to notice is that NKI has a sequential programming model. This means that the logical order of operations follows the syntactic order of the statements in the function. As you learn more about the Trainium hardware, you will see that the hardware can often do many things at the same time across the different compute engines on the Trainium devices. When we compile NKI functions, we will respect the sequential order of operations written by the programmer. The compiler may reorder operations that have no data dependencies, but this is functionally transparent to NKI programmers. Later you will see how to control which engines operations run on and even how to influence the ordering of operations with no data dependencies for better performance, but all of this is done in the context of the sequential ordering of the code.

The third thing to notice about this simple function is that is has a print statement. You may be wondering: When does this print happen? Does the Trainium hardware output a string, where does it go? What about all those different engines we just talked about and the sequential ordering? The answer to these questions reveal a very important aspect of NKI programming. The answer is that the print is evaluated by the compiler at compile time, not at runtime. So, when you compile this NKI function, the NKI compiler will output a string like:

.. code-block:: text

   adding tensors of type float16 and shape (128, 512)

However, when we run this compiled function on Trainium devices they will not output anything. This is usually what you want. The compiler gives important debugging information during compilation, but when you deploy your function across 1000 Trainium devices, they will not waste any time generating debug output. 

**Note**: There is a special print function that does run on the Trainium devices, called ``device_print``, that can be used if this is really what you need, see the API references for more information.

We have just seen that the print statement is evaluated at compile-time, and not at runtime. In fact, most things in NKI programs are evaluated at compile time. In general, calls to nki.isa.* functions will result in on-device operations, and (almost) all other things will be evaluated by the compiler at compile time. We will discuss some exceptions to this rule below, but for now it is generally the case that only the nki.isa.* calls result in run-time operations, and everything else is evaluated by the compiler at compile-time.

This leads us to our the last observation about NKI functions. The nki.isa.* APIs are the heart of the matter. These APIs are designed to expose the underlying hardware capabilities in as direct a way as possible. If you write a nki.isa function, then the hardware will execute that operation at that point in the program. The NKI meta-programming language simply provides a convenient way to specify which ISA operations you want to run on your data.

In the rest of this guide we will focus on the NKI language, starting with the compilation model and namespaces, then the values you can manipulate in a NKI function. We will then cover tensor indexing, control flow, and end with a discussion of class support, interoperation with Python, and composable kernels.

Compilation Model
------------------

When you decorate a function with ``@nki.jit`` and call it, the NKI compiler processes your kernel in three stages:

1. **Specialization**: The compiler takes your Python function and evaluates all meta-programming constructs. This includes resolving tensor shapes, unrolling loops, inlining function calls, and evaluating if-statements with compile-time conditions. The result is a specialized, flat sequence of ``nki.isa.*`` operations with all compile-time values resolved.

2. **Compilation**: The specialized program is lowered to Trainium machine code. This stage performs instruction scheduling, register allocation, and memory layout.

3. **Graph-compiler linking**: The compiled kernel is linked into the larger computation graph managed by the Neuron graph compiler, which handles data movement between the host and device.

The specialization stage is key to understanding NKI programming. During specialization, the compiler acts as an interpreter for the meta-programming parts of your kernel. Everything that is not a ``nki.isa.*`` call or a ``dynamic_range`` loop is evaluated and resolved at this stage. This means:

- All ``for`` loops (except ``dynamic_range``) are **unrolled** at specialization time. The compiler expands the loop body once for each iteration.
- All function calls are **inlined** at specialization time. The compiler substitutes the function body at each call site.
- All ``if`` statements with compile-time conditions are **resolved** at specialization time. Only the taken branch is included in the specialized program.
- All Python expressions on compile-time values (integers, booleans, strings, shapes) are **evaluated** at specialization time.

The only constructs that survive specialization and become runtime operations are ``nki.isa.*`` calls and ``dynamic_range`` loops. Everything else is part of the meta-programming language that controls how the final sequence of ISA operations is generated.

.. note::

   Throughout this documentation, we use the term **NKI meta-programming language** to refer to the Python subset that is evaluated at specialization time (loops, conditionals, function calls, and expressions on compile-time values), and **NKI language** to refer to the runtime primitives (``nki.isa.*`` operations and ``dynamic_range`` loops) that execute on the device.

.. code-block:: python

   @nki.jit
   def example_kernel(a_input):
       # Meta-programming: this loop is unrolled at specialization time
       for i in range(4):
           tile = nl.ndarray((128, 512), dtype=nl.float16, buffer=nl.sbuf)
           nisa.dma_copy(dst=tile, src=a_input[i * 128:(i + 1) * 128, :])
           # Meta-programming: this if is resolved at specialization time
           if i % 2 == 0:
               nisa.tensor_scalar(dst=tile, data=tile, op0=nl.add, operand0=1.0)

After specialization, this kernel becomes a flat sequence of ``dma_copy`` and ``tensor_scalar`` operations, with the loop and if-statement fully resolved.

NKI Namespaces
---------------

NKI is organized into several Python namespaces:

- ``nki`` — The top-level package. Provides the ``@nki.jit`` decorator for compiling kernel functions.
- ``nki.language`` (commonly imported as ``nl``) — The high-level language API. This includes tensor creation (``ndarray``), data types, memory buffers, loop ranges (``affine_range``, ``dynamic_range``), and high-level math operations (``nl.add``, ``nl.matmul``, ``nl.softmax``, etc.). Many of the functions in ``nki.language`` are convenience wrappers around one or more ``nki.isa`` operations.
- ``nki.isa`` (commonly imported as ``nisa``) — The low-level instruction set architecture API. Each function in this namespace maps directly to a Trainium hardware operation. These are the only calls that produce runtime operations on the device.
- ``nki.collectives`` — APIs for multi-device collective communication operations such as ``all_reduce``, ``all_gather``, and ``collective_permute``.

A typical NKI kernel imports these namespaces as follows:

.. code-block:: python

   import nki
   import nki.language as nl
   import nki.isa as nisa

The distinction between ``nki.language`` and ``nki.isa`` is important. When you call a ``nki.language`` function like ``nl.add(a, b)``, the compiler may lower this to one or more ``nki.isa`` operations depending on the tensor shapes and types. When you call a ``nki.isa`` function like ``nisa.tensor_tensor(...)``, you are directly specifying the hardware operation. Use ``nki.language`` for readability and portability; use ``nki.isa`` when you need precise control over which hardware engine executes an operation.

NKI Values
-----------

The NKI language supports six types of values:

1. The special None value
2. Boolean values (True and False)
3. 32-bit integer values
4. 32-bit IEEE floating-point values
5. String literals
6. Tensors (on-device tensor memory)

In addition, NKI supports the following container types:

1. Tuples of any fixed length
2. Lists of arbitrary length
3. Dictionaries with string-value keys
4. Simple user-defined classes

NKI values and containers are very similar to their Python equivalents. For instance, you can use most of the Python standard list functions, and they work in the same way as in Python.

.. code-block:: python

   l = [1,2,3]    # create a list with 3 elements 
   l.append(4.1)  # append a value to the list
   l.extend(("Hello", "List")) # extend list with multiple values
   size = len(l) # return number of elements in list
   third = l[2]  # get third element of list (index 2)

   # search list for a specific value
   if l.index(2):
     print("list contains 2")
     
   # remove a specific value from a list (if present)
   l.remove(1)

   # print out list in reverse order
   l.reverse()
   for x in l:
     print(x)

The NKI dictionary type is also similar to the Python version, but with the restriction that the keys must be string values.

.. code-block:: python

    d = dict() # create an empty dictionary
    d['a'] = 1 # set a value in the dictionary

    print(d.keys())  # print out keys in dictionary
    print(d.items())  # print out values in dictionary

    # print out dictionary
    for k in d.keys():
        v = d[k]
        print(k, v)

    # remove value from dictionary if present
    if d.pop('a'):
        print("removed 'a' from dictionary")

    # fetch value of a, set to 2 if not present
    a = d.setdefault('a', 2)

We will discuss user-defined classes later in the guide. For now, let's take a close look at the most important value in NKI, the tensor.

Tensor Values
--------------

The ``NkiTensor`` class represents an on-chip tensor. That is, an ``NkiTensor`` instance is really a reference to some region of memory on the Trainium device at runtime. At compile-time, we do not yet know the precise location nor the precise contents of this tensor, and therefore, code evaluated at compile-time will not be able to query the precise location nor the contents. At compile-time we can only query meta-data about the tensor, such as its shape and element type. ``NkiTensor`` exposes the following meta-data:

* ``t.dtype`` - The element type of the tensor, e.g. "float16"
* ``t.shape`` - The shape of the tensor, e.g. (128,64,64)
* ``t.ndim`` - The number of dimensions
* ``t.size`` - The total number of elements
* ``t.offset`` - The access pattern offset (discussed below)
* ``t.buffer`` - The memory buffer this tensor lives in (discussed below)
* ``t.get_pattern()`` - The access pattern (discussed below)

The most commonly used fields are dtype and shape. We have already seen an example of using these fields to check that argument tensors are compatible in our simple example. Another common case is using a dimension of a shape to iterate over a tensor:

.. code-block:: python

   # assume t is a 3-dimensional tensor, we can iterate over the
   # 2-D subtensors
   for i in range(t.shape[0]):
     my_function(t[i])

Note, because the shape is part of the meta-data of the tensor, the expression ``t.shape[0]`` is a compile-time constant. Therefore, the bounds of the for-loop are known at compile time. The compiler will unroll this loop into a sequence of calls to my_function, one for each subtensor of t.

In addition to the basic meta-data fields, ``NkiTensor`` provides two methods for creating alternate views of the same underlying storage:

``view(dtype)``
  Reinterpret the tensor's storage bits as a different data type. The underlying memory is not modified; only the interpretation changes. This is useful for bitwise manipulation, such as reinterpreting ``int32`` values as ``float32``.

  .. code-block:: python

     int_tensor = nl.ndarray((128, 256), dtype=nl.int32, buffer=nl.sbuf)
     float_tensor = int_tensor.view(nl.float32)

``ap(pattern, offset=0, scalar_offset=None, vector_offset=None, indirect_dim=0, dtype=None)``
  Create a tensor with an explicit hardware access pattern sharing the same storage. The ``pattern`` is a list of ``[step, num]`` tuples that define how elements are accessed. This is an advanced feature for controlling the exact memory access pattern used by the hardware. See the architecture guide for details on access patterns.

  .. code-block:: python

     t = nl.ndarray((128, 1024), dtype=nl.float16, buffer=nl.sbuf)
     # Access every other element in the free dimension
     u = t.ap(pattern=[(1, 128), (2, 512)])

Creating Tensors
-----------------

The easiest way to create tensors is using the ``nki.language.ndarray`` API. This function takes a shape, a dtype, and a memory type, and returns an ``NkiTensor`` representing a reference to a memory region in the given memory type large enough to hold the tensor.

.. note::

   ``ndarray`` does **not** initialize memory. The contents of a newly allocated tensor are undefined until explicitly written to (e.g., via ``nisa.dma_copy`` or ``nisa.memset``).

.. code-block:: python

   # A matrix of 128x128 16-bit float values in the SBUF memory
   t = nl.ndarray((128,128), nl.float16, nl.sbuf)
   assert t.shape = (128,128)
   assert t.dtype == nl.float16
   assert t.buffer == nl.sbuf

You can also pass an optional ``name`` argument to ``ndarray``. The name is a string label that is propagated through the compiler into the generated IR and debug information. This can be helpful when profiling or debugging compiled kernels, since the name will appear in compiler output and diagnostic messages.

.. code-block:: python

   # Named tensor for easier identification in compiler output
   t = nl.ndarray((128,128), nl.float16, nl.sbuf, name="my_weights")

You can also create a tensor from an existing tensor using the ``reshape`` method. The ``reshape`` method will create a new reference to the same memory with a different shape. The reshaped tensor must have the same total number of elements as the original.

.. code-block:: python

   # create an alternate view of t with shape 128x2x64
   u = t.reshape((128,2,64))

   # create an alternate view of t with shape 128x32x4
   v = t.reshape((128,32,4))

In both cases, ``u`` and ``v`` refer to the same underlying memory as ``t``; no data is copied.

Tensor Indexing
----------------

Next, we will examine two meta-data fields related to tensor indexing: offset and pattern. But before we talk about these fields, let's look at the most common way of indexing tensors using integers and slices.

Suppose you have a tensor t with shape 64x64x64 that is in the SBUF memory. The SBUF memory is a two-dimensional block of memory, so the underlying storage for this 3-D tensor is a 2-D region of the SBUF. Recall, in the SBUF, the first dimension is called the partition dimension and the second dimension is called the free dimension. By convention, the first dimension of a tensor always corresponds to the partition dimension, and the remaining dimensions are laid out in the free dimension. Therefore, in our example, we have 64 partitions, each with 64*64=4096 elements.

We can refer to specific elements of the tensor using an index expression.

.. code-block:: python

   # 10th element in partition 0
   u = t[0,0,10]

   # 65th element in partition 0
   u = t[0,1,0]

   # last element of the tensor
   u = t[63,63,63]

It is more common to refer to whole sub-tensors rather then single elements, and for this we can use slices. A slice is an expression of the form start:stop:step, which describes a range of elements starting with index start, up to (but not including) index stop, and incrementing by step. If any of start, stop, or step are not specified, defaults will be used.

.. code-block:: python

   # All first 64 elements of every partition
   u = t[0:64, 0, 0:64]

   # Same as above, but using defaults
   u = t[:, 0, :]

   # Only the even elements of the third dimension
   u = t[:, :, ::2]

Finally, you can also use the ellipsis (...) to indicate defaults for a range of dimensions.

.. code-block:: python

   # the whole tensor t
   u = t[...]

   # same as above
   u = t[:,...]

   # use defaults for second dimension
   # equivalent to t[0,0:64,0:64]
   u = t[0,...,:]

Note, when you index into a tensor, the result is another tensor. So, in the examples above, the tensor u also has the normal tensor fields and capabilities. This means you can query the shape of the result, or further index the tensor u.

.. code-block:: python

   u = t[0,...]
   assert u.shape = (64,64)

   v = u[0:32, :]
   assert v.shape = (32, 64)

In addition to querying the shape, you can also query the hardware access pattern that corresponds to the tensor value. For example, the code below will display the access pattern that would be used to query u, which is a sub-tensor of t.

.. code-block:: python

   u = t[0,...]

   # check hardware access pattern
   print(u.offset)
   print(u.get_pattern())

For advanced use cases, the hardware access pattern can be specified directly.

.. code-block:: python

   # Specify HW access pattern directly
   u = t.ap(offset = 0, pattern = [...])

For more details on hardware access patterns, see the architecture guide.

Control Flow
-------------

NKI supports basic control flow constructs, including if-statements, for-loops over ranges, lists or tuples, and while loops. All of these constructs work similarly their equivalents in Python, but with one important difference: they are all evaluated at specialization time. This means the compiler unrolls every loop and resolves every branch before generating device code. For example, the code below uses a simple loop with a nested if statement to process the even and odd elements of a list differently.

.. code-block:: python

    inputs = [a, b, c]
    outputs = [x, y, z]

    assert len(inputs) == len(outputs)
    for i in range(len(inputs)):
        if i % 2 == 0:
            nisa.nc_transpose(dst=outputs[i], data=inputs[i])
        else:
            nisa.reciprocal(dst=outputs[i], data=inputs[i])

The loop and if-statement above will ultimately be evaluated away by NKI Compiler. This means that the ISA instructions will be included in the final executable as a linear sequence:

.. code-block:: python

   nki.isa.nc_transpose(dst=x, data=a)
   nki.isa.reciprocal(dst=y, data=b)
   nki.isa.nc_transpose(dst=z, data=c)

A for-loop can also iterate over a list or tuple, similar to Python. The two loops below both print the numbers 1-3 in sequence.

.. code-block:: python

   l = [1,2,3]
   for x in l:
     print(x)

   t = (1,2,3)
   for x in t:
     print(x)

Finally, NKI also supports while loops. Again these loops are similar to Python, and will be unrolled by the compiler, just like the for-loops.

.. code-block:: python

   # print the numbers 0-9
   x = 0
   while x < 10:
     print(x)
     x += 1

Dynamic Control Flow
----------------------

In the previous section we looked at control-flow constructs that are ultimately expanded at compile-time. NKI also supports dynamic control-flow, or control-flow that runs on the device. Dynamic control-flow is not expanded by the compiler, but lowered to equivalent Trainium control-flow instructions.

The most basic dynamic loop is a for-loop with static bounds. A dynamic loop with static bounds can be written using the standard for-loop with a dynamic_range hint.

.. code-block:: python

   # create a dynamic loop that runs "on chip"
   for i in dynamic_range(10):
     process_tensor(t[i])

The for loop above will lower to a loop on the Trainium device. The loop will execute its body (process_tensor), 10 times and then continue. Because this is a dynamic loop, the loop index, i, will be stored in a hardware register during evaluation. Therefore, the type of i is register in NKI. Register values can be used to index tensors, and passed to nki.isa APIs. We can also use registers to create dynamic loops with dynamic bounds.

.. code-block:: python

   count = nki.isa.register_alloc(0)
   nisa.register_load(count, count_tensor)
   for i in dynamic_range(count):
     process_tensor(t[i])

The loop above uses a register value as the upper bound. This register is allocated with the ``register_alloc`` function, and then its value is populated from a tensor using ``register_load``. The for loop will then execute ``count`` times.

There are four register APIs that can be used to create, and load and store values to and from registers. Each register is 32-bit and supports multiple data types: ``u8``, ``u16``, ``u32``, ``i8``, ``i16``, ``i32``, and ``fp32`` (or a pair of registers for ``u64``/``i64``). Signed integers are supported, so negative values (e.g., ``count=-5``) are valid. The register APIs return and operate on ``VirtualRegister`` objects.

A ``VirtualRegister`` represents a scalar value stored in a hardware register on the Trainium device. Unlike compile-time integer values, a ``VirtualRegister`` holds a value that exists at runtime. You can use a ``VirtualRegister`` as a loop bound for ``dynamic_range``, as a condition for a dynamic ``while`` loop, or as a ``scalar_offset`` in a tensor access pattern for dynamic indexing.

.. note::

   The induction variable of a ``dynamic_range`` loop is also a ``VirtualRegister``, but it is frozen: you cannot write to it with ``register_move`` or ``register_load``. This prevents ambiguity about whether modifying the induction variable would affect loop termination.

.. code-block:: python

   # allocate a new register with initial value (32-bit integer)
   def register_alloc(x: int) -> VirtualRegister: ...

   # store a constant integer into a register
   def register_move(dst: VirtualRegister, imm: int): ...

   # load a value from an SBUF tensor into a register
   # the source tensor must be a 1x1 SBUF tile
   def register_load(dst: VirtualRegister, src: tensor): ...

   # store the value of a register into an SBUF tensor
   def register_store(dst: tensor, src: VirtualRegister): ...

Using the APIs above, we can also create dynamic while loops. A dynamic while loop is specified using the standard while-loop with a condition that is a single register value. The NKI compiler will preserve while loops with register conditions, and not unroll them.

.. code-block:: python

   # suppose cond is an SBUF tensor, perhaps declared as
   cond = nl.ndarray((1, 1), buffer=nl.sbuf, dtype=nl.int32)

   # allocate a register with initial value 1
   reg = nisa.register_alloc(1)

   # This while loop is dynamic because the condition is a register
   while reg:
      # perform a calculation that updates cond
      nisa.dma_copy(dst=cond, ...)

      # update register used in while-loop condition
      nisa.register_load(reg, cond)

The code above uses a 1x1 SBUF tensor called cond to store the condition. We update this tensor in the body of the loop and then use register_load to update the register. When the register reg holds the value 0 the loop will terminate.

Class Support
--------------

NKI has basic support for user-defined classes. In NKI all classes are similar to Python data classes. When you declare a class for use in a NKI kernel, the class must inherit from NKIObject and no other classes. This restriction is to ensure the NKI compiler only brings in class definitions that are intended for NKI. A simple NKI class can be declared similar to a Python data class:

.. code-block:: python

   @dataclass 
   class C(NKIObject):
     x : int
     y : bool = False
     
     def toggle(self):
       self.y = not self.y
       
   c = C(1)
   c.toggle()

   # prints 1 True
   print(c.x, c.y)

The @dataclass decorator is optional; classes with and without the @dataclass decorator will be compiled in the same way by the NKI compiler. The compiler will create the initializer functions __init__ and __post_init__, if they are not provided by the user. For the class above, the default initializers are:

.. code-block:: python

   # default if not provided by the user
   def __init__(self, x = None, y = False):
     self.x = x
     self.y = y
     self.__post_init__()

   # default if not provided by the user
   def __post_init__(self):
     pass

Classes can be declared in Python and passed as arguments to NKI functions. When a class is used as an argument to a NKI kernel, the NKI kernel will import the definition of the Python class, and convert the Python class instance to a NKI instance using the objects dictionary. Currently, NKI does not look at slots or other object features, only the object dictionary. For example, consider the code shown below.

.. code-block:: python

   class A(NKIObject):
     x : int = 1
     def __init__(self, x):
       self.x = x

   @nki.jit
   def kernel(a : A): ...

   kernel(A(1))

The class A is instantiated in Python as an argument to the kernel function. The NKI compiler will take this object and translate it to an instance of A on the NKI side. Roughly this translation is done by translating the object dictionary, in pseudo-code:

.. code-block:: python

   # pseudo-code "copy constuct" A on NKI side
   def kernel(python_a : A):
     # make a NKI instance of class A
     nki_a = new A
     # populate NKI instance from Python instance
     nki_a.__dict__ = python_a.__dict__

Enumerations
-------------

In addition to the basic data classes described, NKI also supports basic enumerations. For example, the following can be used in NK kernel functions.

.. code-block:: python

   class E(Enum):
     x = 1
     y = 2
     z = 3

   def f(e : E):
     if e == E.x: ...
     elif e == E.y: ...
     elif e == E.z: ...
     
   f(E.x)

Similar to Python, the NKI compiler will translate the enumration class E to the following:

.. code-block:: python

   class E(NKIObject):
     x = E("x", 1)
     y = E("y", 2)
     z = E("z", 3)
     
     def __init__(self, name, value):
       self.name = name
       self.value = value

Equality in NKI is structural, so no additional code is needed to replicate the behavior of == and != for objects of type E. No other binary operators on enum values are supported.

Composable Kernels
-------------------

Because all functions are inlined at specialization time, NKI supports a powerful composition pattern: you can pass functions as arguments to other functions, and the compiler will inline them at each call site. This allows you to write generic kernel templates that can be specialized with different operations.

For example, consider a generic tiled processing kernel that applies a user-supplied function to each tile:

.. code-block:: python

   def tiled_process(input_tensor, output_tensor, tile_fn):
       """Generic kernel that applies tile_fn to each tile of the input."""
       for i in range(input_tensor.shape[0] // nl.tile_size.pmax):
           tile = nl.ndarray((128, 512), dtype=input_tensor.dtype, buffer=nl.sbuf)
           nisa.dma_copy(dst=tile, src=input_tensor[i * 128:(i + 1) * 128, :])

           result = nl.ndarray((128, 512), dtype=input_tensor.dtype, buffer=nl.sbuf)
           tile_fn(dst=result, src=tile)

           nisa.dma_copy(dst=output_tensor[i * 128:(i + 1) * 128, :], src=result)

   def my_activation(dst, src):
       nisa.activation(dst=dst, data=src, op=nl.relu)

   def my_scale(dst, src):
       nisa.tensor_scalar(dst=dst, data=src, op0=nl.multiply, operand0=0.5)

   @nki.jit
   def relu_kernel(a_input, a_output):
       tiled_process(a_input, a_output, my_activation)

   @nki.jit
   def scale_kernel(a_input, a_output):
       tiled_process(a_input, a_output, my_scale)

During specialization, the compiler inlines ``tiled_process`` and then inlines the specific ``tile_fn`` (either ``my_activation`` or ``my_scale``) at each call site. The result is a fully specialized kernel with no function call overhead.

This pattern is especially useful for building mega-kernels that compose multiple operations. You can pass function references as hyperparameters when using the kernel builder API:

.. code-block:: python

   from nki.compiler.kernel_builder import compile_kernel

   compile_kernel(
       tiled_process,
       inputs={"input_tensor": input_array},
       outputs={"output_tensor": output_array},
       compile_opts=opts,
       tile_fn=my_activation,  # passed as a hyperparameter
   )

Functions can also be stored in data structures, returned from other functions, and selected dynamically at specialization time based on compile-time conditions:

.. code-block:: python

   def select_activation(name):
       if name == "relu":
           return my_relu
       elif name == "gelu":
           return my_gelu

   @nki.jit
   def kernel(a_input, a_output):
       act_fn = select_activation("relu")
       # act_fn is resolved at specialization time; the selected
       # function is inlined directly
       act_fn(dst=a_output, src=a_input)

Because all of this resolution happens at specialization time, there is no runtime cost. The compiled kernel contains only the specific ISA operations for the chosen function.


================================================
FILE: nki/get-started/quickstart-implement-run-kernel.rst
================================================
.. meta::
    :description: Learn how to implement and run your first NKI kernel on AWS Neuron accelerators
    :date-modified: 03/30/2026

.. _quickstart-run-nki-kernel:

Quickstart: Implement and run your first kernel
================================================

The Neuron Kernel Interface (NKI) lets you write low-level kernels that use the ISA of Trainium2 and Trainium3 ML accelerators. Your kernels can be used in PyTorch and JAX models to speed up critical parts of your model. This topic guides you through your first time writing a NKI kernel. It will help you understand the process when using AWS Neuron and NKI. 

When you have completed it, you will have a simple kernel that adds two input tensors and returns the result and a test program in PyTorch or JAX.

* This quickstart is for: Customers new to NKI
* Time to complete: ~10 minutes

Prerequisites
--------------

Before you begin, you will need a Trn2 or Trn3 EC2 instance.

* Your EC2 instance should have the Neuron SDK and NKI library installed on them. If you used the Deep Learning AMI (DLAMI), these will be available by activating a PyTorch or JAX environment with Python's venv.
* You will need a text editor or IDE for editing code.
* A basic familiarity with Python and either PyTorch or JAX will be helpful, though not strictly required.


Before you start
-----------------

Make sure you are logged in to your EC2 instance and have activated either a PyTorch or JAX environment. See :doc:`Set up your environment for NKI development <setup-env>` for details.

Step 1: Import the nki library
-------------------------------

In this step you create the ``add_kernel.py`` file and add imports for the ``nki``, ``nki.language``, and ``nki.isa`` libraries.

.. code-block:: python

    import nki
    import nki.language as nl
    import nki.isa as nisa

Open your favorite editor or IDE and create the ``add_kernel.py`` code file, and then add the imports for the NKI libraries.

Step 2: Create the nki_tensor_add_kernel
-----------------------------------------

In this step, you define the ``nki_tensor_add_kernel`` function. 

.. code-block:: python

    @nki.jit
    def nki_tensor_add_kernel(a_input, b_input):
        """
        NKI kernel to compute element-wise addition of two input tensors.
        """

Add the ``nki_tensor_add_kernel`` function definition above. Make sure you annotate it with the ``@nki.jit`` decorator as in the example above.

Step 3: Check input size and shapes
------------------------------------

In this step, you add a couple of assertions to check that ``a_input`` and ``b_input`` are the same size/datatype and that these will fit within the on-chip tile size.

Add the following assertions to your ``nki_tensor_add_kernel`` function in ``add_kernel.py``.

.. code-block:: python

        # check both input tensor shapes/dtypes are the same for element-wise operation.
        assert a_input.shape == b_input.shape
        assert a_input.dtype == b_input.dtype

        # Check the first dimension's size to ensure it does not exceed on-chip
        # memory tile size, since this simple kernel does not tile inputs.
        assert a_input.shape[0] <= nl.tile_size.pmax

The first assertion checks that ``a_input`` and ``b_input`` have the same shape. The second assertion checks that the inputs will fit in within the tile size of the on-chip memory. If an input is larger than the on-chip tile size, you must tile the input. To keep this example simple we will avoid discussing tiling further in this quick start.

Step 4: Read input into the on-chip memory
-------------------------------------------

In this step, you will add code to read the inputs from HBM into on-chip memory.

The ``nki_tensor_add_kernel`` function will receive inputs from the HBM memory and must move them into on-chip memory to operate over their values. You first create space in the on-chip memory and then copy the value into on-chip memory for each input. See :doc:`Memory Hierarchy </nki/get-started/about/memory-hierarchy-overview>` for more details on the memory hierarchy.

.. code-block:: python

    # Allocate space for the input tensors in SBUF and copy the inputs from HBM
    # to SBUF with DMA copy.
    a_tile = nl.ndarray(shape=a_input.shape, dtype=a_input.dtype, buffer=nl.sbuf)
    nisa.dma_copy(dst=a_tile, src=a_input)

    b_tile = nl.ndarray(shape=b_input.shape, dtype=b_input.dtype, buffer=nl.sbuf)
    nisa.dma_copy(dst=b_tile, src=b_input)

The ``nl.ndarray`` function allows you to allocate tensors in SBUF. Here you allocate ``a_tile`` and ``b_tile`` and use the ``nisa.dma_copy`` :doc:`instruction </nki/api/generated/nki.isa.dma_copy>` to copy tensors between HBM and SBUF memories. You first supply the destination for the copy, ``a_tile`` and ``b_tile``. Then you provide the source for the copy, ``a_input`` and ``b_input``, as seen in this example.

Step 5: Add the two tensors
----------------------------

In this step, you add code to allocate a destination tensor in SBUF and put the results of adding these two tensor in the new tensor.

.. code-block:: python

    # Allocate space for the result and use tensor_tensor to perform
    # element-wise addition. Note: the first argument of 'tensor_tensor'
    # is the destination tensor.
    c_tile = nl.ndarray(shape=a_input.shape, dtype=a_input.dtype, buffer=nl.sbuf)
    nisa.tensor_tensor(dst=c_tile, data1=a_tile, data2=b_tile, op=nl.add)

As in step 4, you allocate a space for the ``c_tile`` in SBUF, using ``nl.ndarray``. Since the shape of the output will be the same shape as the inputs, you can use the ``a_input`` data type and shape for the allocation. You use the ``nisa.tensor_tensor`` :doc:`instruction </nki/api/generated/nki.isa.tensor_tensor>` to perform element-wise calculation on two tensors. The first argument of ``tensor_tensor`` is the destination tensor, ``c_tile``, and the sources, ``a_tile`` and ``b_tile``, follow it. You must also provide an op which tells ``tensor_tensor`` which operation to perform on the inputs. In this case, you use ``op=nl.add`` to specify addition.

Step 6: Copy the result to HBM
-------------------------------

In this step, you will allocate space for the output tensor in HBM and copy the result from SBUF to the new tensor. This is the inverse of what you did with the input, where you copied the inputs from HBM into SBUF.

.. code-block:: python

    # Create a tensor in HBM and copy the result into HBM.
    c_output = nl.ndarray(dtype=a_input.dtype, shape=a_input.shape, buffer=nl.shared_hbm)
    nisa.dma_copy(dst=c_output, src=c_tile)

You use ``nl.ndarray`` with ``buffer=nl.shared_hbm`` to create tensors in HBM, similar to how you allocated space in SBUF with ``buffer=nl.sbuf``. You then copy the result in ``c_tile`` into ``c_output``. Remember that ``c_output`` is the destination and ``c_tile`` is the source for the ``dma_copy`` instruction. The copy is needed because outputs, like inputs, need to be in HBM.

Step 7: Return the output
--------------------------

In this step, you will return the result.

.. code-block:: python

    # Return kernel output as function output.
    return c_output

You should now have an ``add_kernel.py`` file that looks as follows.

.. code-block:: python

    import nki
    import nki.language as nl
    import nki.isa as nisa

    @nki.jit
    def nki_tensor_add_kernel(a_input, b_input):
        """
        NKI kernel to compute element-wise addition of two input tensors.
        """

        # check both input tensor shapes/dtypes are the same for element-wise operation.
        assert a_input.shape == b_input.shape
        assert a_input.dtype == b_input.dtype

        # Check the first dimension's size to ensure it does not exceed on-chip
        # memory tile size, since this simple kernel does not tile inputs.
        assert a_input.shape[0] <= nl.tile_size.pmax

        # Allocate space for the input tensors in SBUF and copy the inputs from HBM
        # to SBUF with DMA copy.
        a_tile = nl.ndarray(shape=a_input.shape, dtype=a_input.dtype, buffer=nl.sbuf)
        nisa.dma_copy(dst=a_tile, src=a_input)

        b_tile = nl.ndarray(shape=b_input.shape, dtype=b_input.dtype, buffer=nl.sbuf)
        nisa.dma_copy(dst=b_tile, src=b_input)

        # Allocate space for the result and use tensor_tensor to perform
        # element-wise addition. Note: the first argument of 'tensor_tensor'
        # is the destination tensor.
        c_tile = nl.ndarray(shape=a_input.shape, dtype=a_input.dtype, buffer=nl.sbuf)
        nisa.tensor_tensor(dst=c_tile, data1=a_tile, data2=b_tile, op=nl.add)

        # Create a tensor in HBM and copy the result into HBM.
        c_output = nl.ndarray(dtype=a_input.dtype, shape=a_input.shape, buffer=nl.shared_hbm)
        nisa.dma_copy(dst=c_output, src=c_tile)

        # Return kernel output as function output.
        return c_output


Step 8: Create a PyTorch or JAX test program
---------------------------------------------

In this step, you create a test program as a Python script using either PyTorch or JAX.

.. tabs::

   .. tab:: PyTorch

      You can create a file called ``test_program.py`` with the following content.

      .. code-block:: python

          import torch
          import torch_neuronx
          from add_kernel import nki_tensor_add_kernel

          # Generate input tensors.
          a = torch.ones((4, 3), dtype=torch.float16)
          b = torch.ones((4, 3), dtype=torch.float16)

          # Trace the kernel for Neuron.
          trace = torch_neuronx.trace(nki_tensor_add_kernel, (a, b))

          # Run the traced kernel.
          c = trace(a, b)

          # Print the result.
          print(c)

      You create input tensors using PyTorch. You use ``torch_neuronx.trace`` to compile the kernel for the Neuron device, then call the traced function to run it. The ``print`` function prints the result to the console.

   .. tab:: JAX

      You can create a file called ``test_program.py`` with the following content.

      .. code-block:: python

          import jax.numpy as jnp
          from add_kernel import nki_tensor_add_kernel

          # Generate the input tensors.
          a = jnp.ones((4, 3), dtype=jnp.float16)
          b = jnp.ones((4, 3), dtype=jnp.float16)

          # Invoke the kernel to add the results.
          c = nki_tensor_add_kernel(a, b)

          # Print the result tensor.
          print(c)

      You create input tensors using the ``jax.numpy`` library. You call the ``nki_tensor_add_kernel function`` to invoke the kernel. The ``print`` function prints the result to the console.

All complete! Now, let's confirm everything works.

Confirmation
-------------

You can confirm the success of the kernel by running the driver you created in step 8.

.. code-block:: bash

    NEURON_PLATFORM_TARGET_OVERRIDE=trn3 python test_program.py

The ``NEURON_PLATFORM_TARGET_OVERRIDE`` environment variable sets the target architecture for compilation. In this example it is set to ``trn3`` which creates a binary suitable for running on Trn3 machines. For Trn2, specify ``trn2``.

Whether you used PyTorch or JAX for the driver, you should see the following result.

.. code-block:: text

    [[2. 2. 2.]
     [2. 2. 2.]
     [2. 2. 2.]
     [2. 2. 2.]]

You will also see some additional output depending on whether you used PyTorch or JAX.

.. tabs::

   .. tab:: PyTorch

      .. code-block:: text

            2026-Apr-13 01:46:31.0675 837617:837663 [2] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):219 CCOM WARN NET/OFI Failed to initialize rdma protocol
            2026-Apr-13 01:46:31.0678 837617:837663 [2] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):354 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
            2026-Apr-13 01:46:31.0681 837617:837663 [2] ncclResult_t nccl_net_ofi_init_no_atexit_fini_v6(ncclDebugLogger_t):183 CCOM WARN NET/OFI Initializing plugin failed
            2026-Apr-13 01:46:31.0683 837617:837663 [2] net_plugin.cc:97 CCOM WARN OFI plugin initNet() failed is EFA enabled?
            .
            Compiler status PASS
            2026-04-13 01:46:33.000003: 837617 [INFO]: Compilation Successfully Completed for model.MODULE_9886333626096130500+70e3f644.hlo_module.pb
            tensor([[2., 2., 2.],
             [2., 2., 2.],
             [2., 2., 2.],
             [2., 2., 2.]], device='xla:0', dtype=torch.float16)

      .. note::

         The CCOM warnings about OFI/EFA initialization are harmless on single-node instances without EFA networking and can be safely ignored.

   .. tab:: JAX

      .. code-block:: text

            WARNING:2026-04-13 01:56:40,630:jax._src.xla_bridge:901: Platform 'neuron' is experimental and not all JAX functionality may be correctly supported!
            2026-Apr-13 01:56:47.0115 838811:838863 [3] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):219 CCOM WARN NET/OFI Failed to initialize rdma protocol
            2026-Apr-13 01:56:47.0117 838811:838863 [3] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):354 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
            2026-Apr-13 01:56:47.0120 838811:838863 [3] ncclResult_t nccl_net_ofi_init_no_atexit_fini_v6(ncclDebugLogger_t):183 CCOM WARN NET/OFI Initializing plugin failed
            2026-Apr-13 01:56:47.0122 838811:838863 [3] net_plugin.cc:97 CCOM WARN OFI plugin initNet() failed is EFA enabled?
            [[2. 2. 2.]
             [2. 2. 2.]
             [2. 2. 2.]
             [2. 2. 2.]]

      .. note::

         The "Platform 'neuron' is experimental" warning and CCOM warnings are harmless and can be safely ignored.

Congratulations! You have now your first NKI kernel written and running. If you encountered any issues, see the Common issues section below.

Common issues
--------------

Uh oh! Did you encounter an error or other issue while working through this quickstart? Here are some commonly encountered issues and how to address them.

* ``nki``, ``jax``, ``torch``, etc. library not found: You may need to activate the PyTorch or JAX environment.
* No neuron device available: You may not have the ``neuron`` kernel module loaded. Make sure the ``neuron`` module is loaded with ``sudo modprobe neuron``.

Clean up
---------

When you are finished with this example, you can deactivate your ``venv`` with ``deactivate`` and remove both ``add_kernel.py`` and ``test_program.py``.

Next steps
-----------

Now that you've completed this quickstart, take your work and dive into other topics that build off of it.

* :doc:`NKI Language Guide </nki/get-started/nki-language-guide>`
* :doc:`NKI Tutorials </nki/guides/tutorials/index>`

Further reading
----------------

* :doc:`NKI API Reference Manual </nki/api/index>`
* :doc:`NKI Developer Guides </nki/guides/index>`


================================================
FILE: nki/get-started/setup-env.rst
================================================
.. meta::
    :description: How to set up your environment for NKI development with AWS Neuron SDK
    :date-modified: 04/12/2026


.. _how-to-set-up-nki-env:

How to set up your environment for NKI development
===================================================

The Neuron Kernel Interface (NKI) lets you write kernels that directly use hardware resources in the Trn2 / Trn3 family of Neuron ML accelerators. NKI kernels use low-level operators that match instructions on Neuron devices. You can use kernels with PyTorch or JAX to speed up critical sections of your model. This topic shows you how to set up your environment for NKI development using the AWS Neuron SDK. After you set up your environment, you can access the NKI and Neuron Graph compilers.

Task overview
--------------
This tutorial walks you through launching a Trn2 / Trn3 instance with an Amazon Machine Image (AMI).

Prerequisites
--------------

* You need an AWS login to launch a Trn2 / Trn3 EC2 instance.

Instructions
-------------


.. tabs::

   .. tab:: Amazon Linux 2023

      You can set up an environment to use NKI in several ways. The easiest method uses the Neuron Multi-framework Deep Learning AMI (DLAMI). The DLAMI provides Python virtual environments (using venv) for frameworks like PyTorch and JAX. AWS updates the DLAMI with each new Neuron SDK release. If you prefer to manage the environment directly, you can start with a standard Amazon Linux 2023 (AL2023) AMI and install the Neuron SDK and NKI library directly. If you already have a configured environment, follow the upgrade tab instructions to upgrade to the latest SDK.

      .. tabs::

         .. tab:: DLAMI

            1. Launch the instance using the Neuron Deep Learning AMI.
   
               .. image:: /nki/img/get-started/nki-setup-1.png

               Select the desired region from the EC2 Console and choose "Launch Instance". In the "Quick Start" tab, select "Amazon Linux", then in the AMI dropdown search for "neuron". The "Deep Learning AMI Neuron (Amazon Linux 2023)" should be the only option. Select an Trn2 / Trn3 instance type. For more details see the Trn2 or Trn3 EC2 pages.

               Once the instance is launched, an environment can be activated with the NKI library and Neuron SDK already installed.

               * Note: If you are looking to use the Neuron DLAMI in your cloud automation flows, Neuron also supports SSM parameters to easily retrieve the latest DLAMI id.

         .. tab:: Standard AMI

            1. Launch the instance using the Amazon Linux 2023
               
               Select the desired region from the EC2 Console and choose "Launch Instance". In the "Quick Start" tab, select "Amazon Linux", then in the AL2023 AMI. Select an Trn2 / Trn3 instance type. For more details see the Trn2 or Trn3 EC2 pages. Note: You will need to allocate at least 85 GB of storage.
            
            2. Install Drivers and Tools

               .. code-block:: bash

                  # Configure Linux for Neuron repository updates
                  sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
                  [neuron]
                  name=Neuron YUM Repository
                  baseurl=https://yum.repos.neuron.amazonaws.com
                  enabled=1
                  metadata_expire=0
                  EOF
                  sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

                  # Update OS packages 
                  sudo dnf update -y

                  # Install OS headers 
                  sudo dnf install -y "kernel-devel-uname-r = $(uname -r)"

                  # Install git 
                  sudo dnf install git -y

                  # Install Neuron Driver
                  sudo dnf install aws-neuronx-dkms-2.* -y

                  # Install Neuron Runtime 
                  sudo dnf install aws-neuronx-collectives-2.* -y
                  sudo dnf install aws-neuronx-runtime-lib-2.* -y

                  # Install Neuron Tools 
                  sudo dnf install aws-neuronx-tools-2.* -y

                  # Add PATH
                  export PATH=/opt/aws/neuron/bin:$PATH

            3. Set up either a PyTorch or JAX environment to use with NKI

               .. tabs::

                  .. tab:: PyTorch

                     .. code-block:: bash

                        # Install External Dependency
                        sudo dnf install -y libxcrypt-compat

                        # Install Python 
                        sudo dnf install -y python3.11

                        # Install GCC
                        sudo dnf install -y gcc-c++ 

                        # Create Python venv
                        python3.11 -m venv aws_neuron_venv_pytorch 

                        # Activate Python venv 
                        source aws_neuron_venv_pytorch/bin/activate 
                        pip install -U pip 

                        # Install Jupyter notebook kernel
                        pip install ipykernel 
                        python3.11 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
                        pip install jupyter notebook
                        pip install environment_kernels

                        # Set pip repository pointing to the Neuron repository 
                        pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com

                        # Install wget, awscli 
                        pip install wget 
                        pip install awscli 

                        # Install Neuron Compiler and Framework
                        pip install neuronx-cc==2.* torch-neuronx==2.9.* torchvision nki

                  .. tab:: JAX

                     .. code-block:: bash

                        # Install External Dependency
                        sudo dnf install -y libxcrypt-compat

                        # Install Python 
                        sudo dnf install -y python3.11

                        # Install GCC 
                        sudo dnf install -y gcc-c++ 

                        # Create Python venv
                        python3.11 -m venv aws_neuron_venv_jax

                        # Activate Python venv 
                        source aws_neuron_venv_jax/bin/activate 
                        pip install -U pip

                     Neuron provides two different ways to install the JAX package. The first is a common package with jax-neuronx packaged together and tested with all the necessary dependencies including jax, jaxlib, libneuronxla, neuronx-cc, and nki. This package can be installed as follows.

                     .. code-block:: bash

                        pip install jax-neuronx[stable] --extra-index-url=https://pip.repos.neuron.amazonaws.com

                     Alternatively, jax, jaxlib, libneuronxla, neuronx-cc, and nki can be installed separately, with jax-neuronx being an optional addition. This version can be installed as follows.

                     .. code-block:: bash

                        pip install jax==0.7.0 jaxlib==0.7.0
                        pip install jax-neuronx libneuronxla neuronx-cc==2.* nki --extra-index-url=https://pip.repos.neuron.amazonaws.com

         .. tab:: Upgrade

            Upgrading an existing AL2023 install of of the Neuron SDK with NKI can be done with for PyTorch or JAX.

            .. tabs::

               .. tab:: PyTorch

                  .. code-block:: bash

                     # Install External Dependency
                     sudo dnf install -y libxcrypt-compat

                     # Activate Python venv 
                     source aws_neuron_venv_pytorch/bin/activate 

                     # Install Jupyter notebook kernel
                     pip install ipykernel 
                     python3.11 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
                     pip install jupyter notebook
                     pip install environment_kernels

                     # Set pip repository pointing to the Neuron repository 
                     pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com

                     # Install wget, awscli 
                     pip install wget 
                     pip install awscli 

                     # Update Neuron Compiler and Framework
                     pip install --upgrade neuronx-cc==2.* torch-neuronx==2.9.* torchvision nki

               .. tab:: JAX

                  .. code-block:: bash

                     # Install External Dependency
                     sudo dnf install -y libxcrypt-compat

                     # Activate Python venv 
                     source aws_neuron_venv_pytorch/bin/activate

                     # Install wget, awscli 
                     pip install wget 
                     pip install awscli 

                  JAX upgrade can be done with either the combined jax-neuronx package which is tested to work together as follows.

                  .. code-block:: bash

                     pip install --upgrade jax-neuronx[stable] --extra-index-url=https://pip.repos.neuron.amazonaws.com

                  Alternatively, jax, jaxlib, libneuronxla, neuronx-cc, and nki can be upgraded separately, with jax-neuronx being an optional addition. This version can be installed as follows.

                  .. code-block:: bash

                     pip install jax==0.7.0 jaxlib==0.7.0
                     pip install --upgrade jax-neuronx libneuronxla neuronx-cc==2.* nki --extra-index-url=https://pip.repos.neuron.amazonaws.com

   .. tab:: Ubuntu 24

      The easiest way to set up an environment to use NKI is by using the Neuron Multi-framework Deep Learning AMI (DLAMI). The DLAMI provides Python virtual environments (using venv) for a variety of frameworks including PyTorch and JAX and is updated with each new release of the Neuron SDK. For customers that prefer to manage the environment directly, it is also possible to start with an standard Ubuntu 24 AMI and install the Neuron SDK and NKI library directly. Customers who already have an environment configured can follow the instructions in the upgrade tab to upgrade to the latest SDK.

      .. tabs::

         .. tab:: DLAMI

            1. Launch the instance using the Neuron Deep Learning AMI
   
               .. image:: /nki/img/get-started/nki-setup-2.png

               Select the desired region from the EC2 Console and choose "Launch Instance". In the "Quick Start" tab, select "Ubuntu", then in the AMI dropdown search for "neuron". The "Deep Learning AMI Neuron (Ubuntu 24.04)" should be the only option. Select an Trn2 / Trn3 instance type. For more details see the Trn2 or Trn3 EC2 pages.

               Once the instance is launched, an environment can be activated with the NKI library and Neuron SDK already installed.

               * Note: If you are looking to use the Neuron DLAMI in your cloud automation flows, Neuron also supports SSM parameters to easily retrieve the latest DLAMI id.

         .. tab:: Standard AMI

            1. Launch the instance using the Ubuntu 24
               
               Select the desired region from the EC2 Console and choose "Launch Instance". In the "Quick Start" tab, select "Ubuntu", then in the Ubuntu Server 22 AMI. Select an Trn2 / Trn3 instance type. For more details see the Trn2 or Trn3 EC2 pages. Note: You will need to allocate at least 50 GB of storage.
            
            2. Install Drivers and Tools

               .. code-block:: bash

                  # Configure Linux for Neuron repository updates
                  . /etc/os-release
                  sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
                  deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
                  EOF
                  wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -

                  # Update OS packages 
                  sudo apt-get update -y

                  # Install OS headers 
                  sudo apt-get install linux-headers-$(uname -r) -y

                  # Install git 
                  sudo apt-get install git -y

                  # Install Neuron Driver
                  sudo apt-get install aws-neuronx-dkms=2.* -y

                  # Install Neuron Runtime 
                  sudo apt-get install aws-neuronx-collectives=2.* -y
                  sudo apt-get install aws-neuronx-runtime-lib=2.* -y

                  # Install Neuron Tools 
                  sudo apt-get install aws-neuronx-tools=2.* -y

                  # Add PATH
                  export PATH=/opt/aws/neuron/bin:$PATH

            3. Set up either a PyTorch or JAX environment to use with NKI

               .. tabs::

                  .. tab:: PyTorch

                     .. code-block:: bash

                        # Install Python venv 
                        sudo apt-get install -y python3.12-venv g++ 

                        # Create Python venv
                        python3.12 -m venv aws_neuron_venv_pytorch 

                        # Activate Python venv 
                        source aws_neuron_venv_pytorch/bin/activate 
                        python -m pip install -U pip 

                        # Install Jupyter notebook kernel
                        pip install ipykernel 
                        python3.12 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
                        pip install jupyter notebook
                        pip install environment_kernels

                        # Set pip repository pointing to the Neuron repository 
                        python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com

                        # Install wget, awscli 
                        python -m pip install wget 
                        python -m pip install awscli 

                        # Install Neuron Compiler and Framework
                        python -m pip install neuronx-cc==2.* torch-neuronx==2.9.* torchvision nki

                  .. tab:: JAX

                     .. code-block:: bash

                        # Install Python venv 
                        sudo apt-get install -y python3.12-venv g++ 

                        # Create Python venv
                        python3.12 -m venv aws_neuron_venv_jax

                        # Activate Python venv 
                        source aws_neuron_venv_jax/bin/activate 
                        python -m pip install -U pip 

                     Neuron provides two different ways to install the JAX package. The first is a common package with jax-neuronx packaged together and tested with all the necessary dependencies including jax, jaxlib, libneuronxla, neuronx-cc, and nki. This package can be installed as follows.

                     .. code-block:: bash

                        pip install jax-neuronx[stable] --extra-index-url=https://pip.repos.neuron.amazonaws.com

                     Alternatively, jax, jaxlib, libneuronxla, neuronx-cc, and nki can be installed separately, with jax-neuronx being an optional addition. This version can be installed as follows.

                     .. code-block:: bash

                        pip install jax==0.7.0 jaxlib==0.7.0
                        pip install jax-neuronx libneuronxla neuronx-cc==2.* nki --extra-index-url=https://pip.repos.neuron.amazonaws.com

         .. tab:: Upgrade

            Upgrading an existing Ubuntu 24 install of of the Neuron SDK with NKI can be done with for PyTorch or JAX.

            .. tabs::

               .. tab:: PyTorch

                  .. code-block:: bash

                     # Install Python venv 
                     sudo apt-get install -y python3.12-venv g++ 

                     # Create Python venv
                     python3.12 -m venv aws_neuron_venv_pytorch 

                     # Activate Python venv 
                     source aws_neuron_venv_pytorch/bin/activate 
                     pip install -U pip 

                     # Install Jupyter notebook kernel
                     pip install ipykernel 
                     python3.12 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
                     pip install jupyter notebook
                     pip install environment_kernels

                     # Set pip repository pointing to the Neuron repository 
                     pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com

                     # Install wget, awscli 
                     pip install wget 
                     pip install awscli 

                     # Install Neuron Compiler and Framework
                     pip install neuronx-cc==2.* torch-neuronx==2.9.* torchvision nki

               .. tab:: JAX

                  .. code-block:: bash

                     # Update Python venv 
                     sudo apt-get install -y python3.12-venv g++ 

                     # Activate Python venv 
                     source aws_neuron_venv_jax/bin/activate 
                     pip install -U pip 

                  Neuron provides two different ways to install the JAX package. The first is a common package with jax-neuronx packaged together and tested with all the necessary dependencies including jax, jaxlib, libneuronxla, neuronx-cc, and nki. This package can be installed as follows.

                  .. code-block:: bash

                     pip install --upgrade jax-neuronx[stable] --extra-index-url=https://pip.repos.neuron.amazonaws.com

                  Alternatively, jax, jaxlib, libneuronxla, neuronx-cc, and nki can be installed separately, with jax-neuronx being an optional addition. This version can be installed as follows.

                  .. code-block:: bash

                     pip install jax==0.7.0 jaxlib==0.7.0
                     pip install --upgrade jax-neuronx libneuronxla neuronx-cc==2.* nki --extra-index-url=https://pip.repos.neuron.amazonaws.com

Confirm your work
------------------

To test the NKI environment is set up and ready to use, a ``venv`` that contains the ``nki`` library must be activated. Select the tab below that corresponds to how you installed the Neuron SDK above.

.. tabs::

   .. tab:: Deep Learning AMI
      
      The Deep Learning AMI provides a number of environments for PyTorch, JAX, and other supported ML frameworks. Any of the PyTorch or JAX venvs supplied as a part of the Deep Learning AMI will include the ``nki`` library. See the Neuron DLAMI overview for the full list of environments. For simplicity, the JAX and PyTorch tabs below each choose the plain JAX and PyTorch venv respectively.

      .. tabs::

         .. tab:: PyTorch

            .. code-block:: bash

               source /opt/aws_neuronx_venv_pytorch_2_9/bin/activate

         .. tab:: JAX

            .. code-block:: bash

               source /opt/aws_neuronx_venv_jax_0_7/bin/activate

   .. tab:: Standard AMI
      
      The venv created in the setup step above can be activate as follows.

      .. tabs::

         .. tab:: PyTorch

            .. code-block:: bash

               source aws_neuronx_venv_pytorch/bin/activate

         .. tab:: JAX

            .. code-block:: bash

               source aws_neuronx_venv_jax/bin/activate

Once the ``venv`` is activated, confirm that NKI is available.

.. code-block:: bash

   python -c 'import nki'

If the environment is setup correctly, Python should return without reporting any errors.

Common issues
---------------

Uh oh! Did you encounter an error or other issue while working through this task? Here are some commonly encountered issues and how to address them.

* Python reports an error trying to import NKI when using a Deep Learning AMI:
  
    - Make sure a PyTorch or JAX ``venv`` (provided as part of the Deep Learning AMI) is activated. Your shell prompt should reflect this by starting with ``(aws_neuronx_venv_<framework+version>) ...``
  
* Python reports an error trying to import NKI in the ``venv`` created as part of the Standard AMI install:
  
    - Make sure the ``venv`` you created is activated. Your shell prompt should reflect this by starting with ``(<venv-name>) ...``
    - Make sure that the NKI library installation (with ``pip``) from the previous instructions succeeded.

Related information
-------------------

* :doc:`Neuron DLAMI User Guide </dlami/index>`
* :doc:`Neuron Setup Guide </setup/index>`


================================================
FILE: nki/guides/architecture/index.rst
================================================
.. meta::
    :description: NKI and Neuron Architectures.
    :keywords: NKI, AWS Neuron, Architecture, Trainium, trn1, trn2, trn3, inf2
    :date-modified: 12/14/2025

.. _nki-architecture-guides:

NKI and Neuron Architecture
----------------------------

NKI currently supports the following NeuronDevice generations:

* Trainium/Inferentia2, available on AWS ``trn1``, ``trn1n`` and ``inf2`` instances
* Trainium2, available on AWS ``trn2`` instances and UltraServers
* Trainium3, available on AWS ``trn3`` instances and UltraServers

The documents below provide an architecture deep dive of each NeuronDevice generation,
with a focus on areas that NKI developers can directly control through kernel implementation.

* :doc:`Trainium/Inferentia2 Architecture Guide </nki/guides/architecture/trainium_inferentia2_arch>` serves as a foundational architecture guide for understanding the basics of any NeuronDevice generation.
* :doc:`Trainium2 Architecture Guide </nki/guides/architecture/trainium2_arch>` walks through the architecture enhancements when compared to the previous generation.
* :doc:`Trainium3 Architecture Guide </nki/guides/architecture/trainium3_arch>` covers the enhancements for the next-generation Trainium ML accelerators.
  
Neuron recommends new NKI developers start with :doc:`Trainium/Inferentia2 Architecture Guide </nki/guides/architecture/trainium_inferentia2_arch>` before exploring newer NeuronDevice architecture.

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: Trainium/Inferentia2 Architecture Guide
      :link: trainium_inferentia2_arch
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Foundational architecture guide for understanding NeuronDevice basics.

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: Trainium2 Architecture Guide
      :link: trainium2_arch
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Architecture enhancements and improvements in the Trainium2 generation.

   .. grid-item-card:: Trainium3 Architecture Guide
      :link: trainium3_arch
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Latest architecture features and capabilities in Trainium3 devices.

.. toctree::
   :maxdepth: 1
   :hidden:

   Trainium/Inferentia2 Guide <trainium_inferentia2_arch>
   Trainium2 Guide <trainium2_arch>
   Trainium3 Guide <trainium3_arch>


================================================
FILE: nki/guides/architecture/trainium2_arch.rst
================================================
.. meta::
   :description: Trainium2 Architecture Guide for NKI
   :keywords: AWS Neuron, Trainium2, NeuronCore-v3, NKI, architecture
   :date-modified: 12/01/2025

.. _trainium2_arch:

Trainium2 Architecture Guide for NKI
===============================================

In this guide, we will dive into hardware architecture of third-generation NeuronDevices: Trainium2. This guide will highlight major architectural updates compared to the previous generation. Therefore, we assume readers have gone through :doc:`Trainium/Inferentia2 Architecture Guide </nki/guides/architecture/trainium_inferentia2_arch>` in detail to understand the basics of NeuronDevice Architecture.

The diagram below shows a block diagram of a Trainium2 device, which consists of:

* 8 NeuronCores (v3).
* 4 HBM stacks with a total device memory capacity of 96GiB and bandwidth of 3TB/s.
* 128 DMA (Direct Memory Access) engines to move data within and across devices.
* 20 CC-Cores for collective communication.
* 4 NeuronLink-v3 for device-to-device collective communication.

.. _fig-arch-neuron-device-v3:

.. image:: /nki/img/arch_images/neuron_device3.png

Trainium2 Device Diagram.

For a high-level architecture specification comparison from Trainium1 to Trainium2, check out the
:doc:`Neuron architecture guide for Trainium2 </about-neuron/arch/neuron-hardware/trainium2>`. The rest of this guide will provide details on new features or improvements in NeuronCore-v3 compute engines and memory subsystem compared to NeuronCore-v2.

NeuronCore-v3 Compute Engine Updates
------------------------------------

The figure below is a simplified NeuronCore-v3 diagram of the compute engines and their connectivity to the two on-chip SRAMs, SBUF and PSUM. This is similar to NeuronCore-v2.

.. _fig-neuroncore-v3-diagram:

.. image:: /nki/img/arch_images/nki-trn2-arch-1.png

NeuronCore-v3 SBUF capacity is **28MiB** (or, 128 partitions of 224KiB), up from 24 MiB in NeuronCore-v2. PSUM capacity remains the same at 2MiB. Engine data-path width and frequency are updated to the following:

.. list-table:: Compute Engine Specifications
   :widths: 20 20 40 20
   :header-rows: 1

   * - Device Architecture
     - Compute Engine
     - Data-path Width (elements/cycle)
     - Frequency (GHz)
   * - Trainium2
     - Tensor
     - 4x128 (dense FP8_E4/FP8_E5 input), 2x128 (dense BF16/FP16 input) or 5x128 (sparse input); 1x128 (output)
     - 2.4
   * - 
     - Vector
     - 512 BF16/FP16 input/output; 256 input/output for other data types
     - 0.96
   * - 
     - Scalar
     - 128 input/output
     - 1.2
   * - 
     - GpSimd
     - 
     - 1.2

Next, we will go over major updates to each compute engine.

Tensor Engine
--------------

The Tensor Engine is optimized for tensor computations such as GEMM, CONV, and Transpose. A NeuronCore-v3 Tensor Engine delivers 158 FP8, 79 BF16/FP16/TF32 and 20 FP32 dense TFLOPS of tensor computations. It also delivers 316 FP8/BF16/FP16/TF32 sparse TFLOPS. The rest of this section describes new architectural features introduced in NeuronCore-v3 Tensor Engine. 

Double FP8 Matmul Performance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

NeuronCore-v3 TensorEngine (TensorE from now on) supports matrix multiplications (matmuls) of FP8 input matrices (including FP8_E4 and FP8_E5 formats [1]_) at **double** the throughput compared to BF16/FP16. Mixing FP8_E4 in one input matrix and FP8_E5 in the other is also allowed. This FP8 double performance mode uses FP32 as the accumulation data type, similar to BF16/FP16 matmul.

.. [1] FP8_E3 format is still supported by NeuronCore-v3 TensorE similar to NeuronCore-v2, but its matmul performance is the same as BF16/FP16.

Logically, TensorE doubles the FP8 matmul performance by doubling the maximum contraction dimension of a matmul instruction from 128 (for BF16/FP16) to 256, effectively presenting a 256x128 systolic array to the programmer. Under the hood, since the systolic array is still organized as a grid of 128x128 processing elements, each processing element performs two pairs of FP8 multiplications and also accumulation of the two multiplication results per cycle. The remaining section discusses the semantics of a single double-FP8 matmul instruction. Multiple such instructions can be used to accommodate larger matrix multiplications than the allowed instruction-level tile sizes.

A double-FP8 matmul can perform a multiplication of a 128x256 matrix and a 256x512 matrix (that is, MxKxN matmul, M=128, K=256, N=512). The figure below shows a visualization of the two input matrices (x and y) and the matmul output matrix (output). The figure also highlights two elements (red and yellow) in the first row of the x matrix and in the first column of the y matrix. These two elements are 128 (K//2) elements apart within the rows and columns. We will use these elements to illustrate the SBUF layout requirements for these matrices next. 


.. _fig-double-fp8-matmul:

.. image:: /nki/img/arch_images/nki-trn2-arch-2.png

These tensors must still fit in the 128-partition SBUF, with each partition feeding data into each row of processing elements inside the TensorE. The contraction of size 256 is therefore split into two dimensions: (1) the partition dimension of size 128 and (2) the most major (slowest) free dimension of size 2. This is illustrated in the figure below. Both the stationary matrix (x in above figure) and the moving matrix (y in above figure) are sliced in two tiles, where the first and second tile correspond to first and second halves of the contraction dimension, respectively. 

.. _fig-double-fp8-sbuf-layout:

.. image:: /nki/img/arch_images/nki-trn2-arch-3.png

Next, we invoke the LoadStationary and MultiplyMoving instructions to perform the matrix multiplications using the above tensors in SBUF. This is illustrated in figure below. The LoadStationary instruction loads the stationary tensor (K/2=128, 2, M=128) into TensorE, which stores two data elements into a single processing element (for example, the red and yellow elements land in the first processing element of TensorE as shown in ❶). Next, the MultiplyMoving instruction streams the moving tensor horizontally across the loaded stationary tensor. Similar to LoadStationary, two elements of moving tensor are sent to the same processing element simultaneously as shown in ❷, such that they can get multiplied with the corresponding pair of loaded stationary elements.

.. _fig-double-fp8-instruction:

.. image:: /nki/img/arch_images/nki-trn2-arch-4.png

Note that the above double FP8 ``LoadStationary``/``MultiplyMoving`` instruction sequence with a 256 contraction dimension takes the same amount of time as the regular BF16/FP16 LoadStationary/MultiplyMoving instruction sequence with a 128 contraction dimension. Since the double FP8 instruction performs double the FLOPs, overall double FP8 matmul on TensorE can achieve double the throughput compared to BF16/FP16 matmuls.

NKI programmers can invoke double FP8 matmul using the ``nisa.nc_matmul()`` API on NeuronCore-v3:

.. code-block:: python

   import nki.isa as nisa

   # stationary: [128, 2, 128]
   # moving: [128, 2, 512]
   # dst: [128, 512]
   nisa.nc_matmul(dst, stationary, moving, 
                  perf_mode=nisa.matmul_perf_mode.double_row, ...)

The ``nt.tensor[128, 2, 128]`` stationary and ``nt.tensor[128, 2, 512]`` moving tensor shapes reflect the maximum tile sizes for the double FP8 matmul instruction. Smaller tile sizes are supported, though the second dimension (the most major free dimension) of both input tensors must be two. In other words, if the contraction dimension of the matmul is not a multiple of two, programmers are required to explicitly pad the input tensors with zeros to enable the performance mode.

Note that Double FP8 matmul performance mode cannot be combined with the following TensorE features:

* Column tiling mode
* Sparse matmul (new in NeuronCore-v3, discussion below)
* Transpose mode (new in NeuronCore-v3, more discussion below)

.. TODO: Uncomment and unindent when the NISA API ships
   M:N Structured Sparsity
   ^^^^^^^^^^^^^^^^^^^^^^^^

   Trainium2 TensorE introduces sparse matmul (matrix multiplication) support for M:N structured sparsity. This new functionality multiplies a regular dense moving matrix with a sparse stationary matrix that exhibits a M:N sparsity pattern, where every N elements only have up to M non-zero values along the contraction dimension. Trainium2 hardware supports up to 4x compression ratio and therefore 4x faster matmul performance compared to dense, with the largest value of N being 16. Programmers also have the flexibility to choose a lower compression ratio (CR=N/M) for better model accuracy. NKI currently supports the following M:N patterns: 4:8 (2x compression), 4:12 (3x) and 4:16 (4x), through the ``nki.isa.sparse_matmul`` API.

   To exercise sparse matmul, the sparse stationary matrix must be compressed to store only M out of every N elements, along with a tag tensor which indicates the original positions of the remaining M non-zero elements. In Figure below, a stationary matrix with a compression ratio of 16:4, along with its compressed representation.

   .. _fig-sparse-matmul:

   .. image:: /nki/img/arch_images/nki-trn2-arch-5.png

   The ``nki.isa.sparse_matmul`` API takes the following arguments ``nc_matmul_sparse(moving, stationary, tag, compress_ratio)``.

   Each row TensorE is able to read from 4, 2, or 1 SBUF partitions corresponding to the maximum compression ratio supported by the sparsity on TRN2 ratio. In order to efficiently utilize the TensorE the input moving should have shape ``[Partition Dimension <=128, Compression Ratio, Tile Free Dimension <= 512]``. The stationary matrix represents a 128x128 compressed weight tensor.

   Finally, the tag tensor is a 128x32 tensor of uint16. Each position of uncompressed elements is encoded as an 4-bit integer, which is the minimal width to relative position within N=16. Four tags are then packed into a uint16 datatype which forms the [128,32] tensor.

   A sample ``nki.isa.sparse_matmul`` can be found here:

   .. code-block:: python

      def mm_sparse_128_512_cr4(moving_tensor, stationary_tensor, tag_tensor, output):
      """
      Args:
         moving: Input tensor of shape [128, 4, 512], which represents activation tensor
         stationary: Input tensor of shape [128, 128], which represents the 
                     compressed weight tensor
         tag: Input tensor of shape [128, 32] of uint16 datatype where each tag 
               represents the indices of non-zero elements of the weight tensor.
               Tags are uint4 datatypes, 4 tags are packed into 1 uint16 datatype
         output: reference to the resulting output tensor of shape [128, 512]
      """
      _, compress_ratio, _ = moving_tensor.shape
      
      moving = nl.ndarray(moving_tensor.shape, dtype=moving_tensor.dtype, buffer=nl.sbuf)
      stationary = nl.ndarray(stationary_tensor.shape, dtype=stationary_tensor.dtype, buffer=nl.sbuf)
      tag = nl.ndarray(tag_tensor.shape, dtype=tag_tensor.dtype, buffer=nl.sbuf)
      nisa.dma_copy(dst=moving, src=moving_tensor)
      nisa.dma_copy(dst=stationary, src=stationary_tensor)
      nisa.dma_copy(dst=tag, src=tag_tensor)

      psum_buf = nc_matmul_sparse(moving, # [128P, 4, 512F] 
                                    stationary, # [128P, 128F]
                                    tag, # [128, 32]
                                    compress_ratio)

      nisa.dma_copy(dst=output, src=psum_buf)

      # Sparse matmul     
      def test_nc_matmul_sparse(self):
         M = 512
         N = 128
         K = 128
         sparsity_pattern = (16, 4)
         L, R = sparsity_pattern
         ratio = L // R

         moving = np.random.random_sample((K, ratio, M)).astype(dtype) # activation
         stationary = np.random.random_sample((K, N)).astype(dtype) # compressed weight 
         
         # For demonstration purpose, we use random values between 0~15 in the tag tensor
         tag = np.random.randint(0, 16, size=(K, N), dtype=np.ubyte)
         
         # maps logical tags to physical tags and pack 4 uint4 to 1 uint
         squeezed_tag = squeeze_tags(tag, ratio) 
         
         # Generate NKI output
         nki_output = np.zeros((N, M), dtype=dtype)
         mm_sparse_128_512_cr4(moving, stationary, squeezed_tag, nki_output)

Built-in Transpose Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As discussed in :doc:`Trainium/Inferentia2 Architecture Guide </nki/guides/architecture/trainium_inferentia2_arch>`, one common use of TensorE besides matrix multiplication operations is transposition of a 2D SBUF tensor, which swaps the partition and free dimension of the matrix. Such a transposition is done through a matmul of the tensor to be transposed (stationary tensor) and an identity matrix (moving tensor). Prior to NeuronCore-v3, TensorE has to perform multiplication of each data element with 1.0 or 0.0 and accumulation along the contraction dimension normally. However, if the tensor to be transposed contains NaN/Inf floating point values, the matmul result will not be a bit-accurate transposition of the original matrix - the NaN/Inf values will propagate through the accumulation chain and spread across the output tensor.

Starting with NeuronCore-v3, TensorE supports an explicit transpose mode, which can correctly transpose input tensors with NaN/Inf. In addition, the transpose mode provides the following benefits:

* 2x speedup in FP32 transpose, vs. no transpose mode enabled.
* FP16/BF16 PSUM output for FP16/BF16 transpose, vs. FP32 (default matmul output data type) PSUM output when no transpose mode enabled. This allows faster PSUM data eviction back to SBUF.

.. note:: NeuronCore-v3 TensorE transpose mode for FP8 input data produces 16-bit output elements in PSUM, with the upper 8 bits filled with zeros.

NKI programmers can enable TensorE transpose mode on NeuronCore-v3 through the following APIs:

.. code-block:: python

   nisa.nc_matmul(..., is_transpose=True)
   # OR
   nisa.nc_transpose(..., engine=nisa.constants.engine.tensor)

Vector Engine
----------------

Vector Engine (VectorE) is specially designed to accelerate vector operations where every element in the output tensor typically depends on multiple elements from input tensor(s), such as vector reduction and element-wise operators between two tensors. NeuronCore-v3 Vector Engine delivers a total of 1.0 TFLOPS of FP32 computations and can handle various input/output data-types, including FP8, FP16, BF16, TF32, FP32, INT8, INT16, and INT32. 

Vector Engine Performance Mode
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

NeuronCore-v3 Vector Engine provides a new performance mode BF16/FP16 data types, which quadruples or doubles the instruction throughput depending on the instruction type compared to NeuronCore-v2 (more details below). Enabling this performance mode does not change the computation precision - all computation is still done in FP32, similar to NeuronCore-v2 Vector Engine.

In particular, the following instructions could see a 4x throughput lift compared to NeuronCore-v2:

1. ``nisa.tensor_copy`` and ``nisa.tensor_scalar`` when both input/output tensors:
    a. are in SBUF
    b. are in BF16/FP16 (input and output data types do not need to match)
    c. have physically contiguous elements in the inner-most (most minor) free dimension

The following instructions could see a 2x throughput lift compared to NeuronCore-v2:

1. ``nisa.tensor_copy`` and ``nisa.tensor_scalar``:
    a. when both input/output tensors satisfy 1a and 1b, but not 1c conditions above, or
    b. when both input/output tensors satisfy 1b and 1c, but one of input and output tensors is in PSUM
2. ``nisa.tensor_tensor``:
    a. when both input tensors are SBUF and all of input/output tensors are in BF16/FP16

Note, NKI programmers are not required to explicitly enable VectorE performance mode. VectorE detects the above conditions and enables performance mode automatically in hardware.

Scalar Engine
---------------

As discussed in Trainium/Inferentia2 Architecture Guide, Scalar Engine (ScalarE) is specially designed to accelerate scalar operations where every element in the output tensor only depends on one element of the input tensor. In addition, ScalarE provides hardware acceleration to evaluate non-linear functions such as Gelu and Sqrt. All architectural capabilities from NeuronCore-v2 Scalar Engine are applicable to NeuronCore-v3. NeuronCore-v3 Scalar Engine additionally supports bit-accurate tensor copies without intermediate FP32 data type casting, similar to VectorE and Gpsimd Engine (see details in ``nisa.tensor_copy``).

Gpsimd Engine
--------------

GpSimd Engine (GpSimdE) is intended to be a general-purpose engine that can run any ML operators that cannot be lowered onto the other highly specialized compute engines discussed above efficiently, such as applying a triangular mask to a tensor. A GpSimdE consists of eight fully programmable processors that can execute arbitrary C/C++ programs.

In NeuronCore-v3, each processor in GpsimdE also comes with an integrated DMA engine that can move data in parallel to computation on GpsimdE and also parallel to data movements done by the main DMA engines on the Neuron Device. These integrated DMA engines can reach any SBUF/HBM on-chip or off-chip in the same trn2 instance. All eight processors together have a total integrated DMA bandwidth of 307 GB/s (153 GB/s per read/write direction).

In NeuronCore-v3, each processor in GpsimdE also comes with an integrated DMA engine that can move data in parallel to computation on GpsimdE and also parallel to data movements done by the main DMA engines on the Neuron Device. These integrated DMA engines can reach any SBUF/HBM on-chip or off-chip in the same trn2 instance. All eight processors together have a total integrated DMA bandwidth of 307 GB/s (153 GB/s per read/write direction). 

Data Movement Updates
----------------------

Trainium2 consists of a three-tiered memory hierarchy: HBM, SBUF and PSUM, from highest to lowest memory capacity. Figures below show the specifications of these memories and their connectivity for one NeuronCore-v3.

.. _fig-memory-hierarchy:

.. image:: /nki/img/arch_images/nki-trn2-arch-5-1.png

.. _fig-memory-hierarchy-2:

.. image:: /nki/img/arch_images/nki-trn2-arch-6.png

As shown in the above figures, data movement between HBM and SBUF is performed using on-chip DMA (Direct Memory Access) engines, which can run in parallel to computation within the NeuronCore. Data movement between PSUM and SBUF is done through ISA instructions on the compute engines. In NeuronCore-v3, two restrictions in engine parallel accesses to SBUF/PSUM are lifted to improve programming flexibility compared to NeuronCore-v2:

1. VectorE and GpSimdE can access SBUF in parallel.
    a. This was disallowed in NeuronCore-v2.
    b. VectorE's performance mode leverages a shared memory bus between the VectorE and GpsimdE engines to deliver 2-4x performance improvement for select VectorE instructions. The hardware automatically coordinates access between engines to optimize bus utilization, including arbitrating between GpsimdE and relevant VectorE instructions.
2. VectorE and ScalarE can access PSUM in parallel.
    a. This was disallowed in NeuronCore-v2.
    b. Both VectorE and ScalarE can access PSUM at full bandwidth in parallel, as long as their accesses do not collide on the same PSUM bank.

DMA Transpose
^^^^^^^^^^^^^^^

Trainium2 DMA engines can perform a tensor transpose while moving data from HBM into SBUF, or from SBUF to SBUF itself. The figure below illustrates these two supported DMA transpose data flows. Trainium2 DMA transpose supports bit-accurate transposition for both 2-byte and 4-byte data types.

.. _fig-dma-transpose:

.. image:: /nki/img/arch_images/nki-trn2-arch-7.png

HBM2SBUF DMA transpose
""""""""""""""""""""""

Before diving into how HBM2SBUF transpose works, let's revisit a simple DMA copy from a packed HBM tensor ``[128, 512]`` to an SBUF tensor ``[nl.par_dim(128), 512]``. Following Numpy convention, these tensor shapes follow a major to minor ordering. The figure below visualizes these HBM and SBUF tensors. A packed ``[128, 512]`` HBM tensor consists of 128 chunks of 512 elements, laid out back to back in the HBM linear memory. The most minor (that is, inner-most) dimension consists of 512 contiguous elements in memory. Once loaded into the SBUF, the most minor HBM tensor dimension (512) is mapped to the free dimension of the SBUF, while the most major dimension is mapped to the SBUF partition dimension.

In Trainium2, each NeuronCore-v3 is typically paired with 16x DMA engines to drive its corresponding SBUF bandwidth. In the above DMA copy, each DMA engine would be responsible for moving 128/16 = 8 chunks of 512 elements.

* HBM tensor [128, 512]: 512 is the inner-most (minor) dimension

.. _fig-hbm2sbuf-dma-copy:

.. image:: /nki/img/arch_images/nki-trn2-arch-8.png

In contrast, in a DMA transpose operation, we take an HBM tensor of opposite layout [512, 128]:

.. _fig-hbm2sbuf-dma-transpose:

.. image:: /nki/img/arch_images/nki-trn2-arch-9.png

In a DMA transposition, the most minor dimension of the source HBM tensor now becomes the partition dimension of the SBUF in destination. Compared to the above DMA copy operation where each DMA engine reads and writes an independent slice of 512 elements, DMA transpose requires all 16x DMA engines to work co-operatively to deliver the best throughput - these 16x DMA engines should write into a single ``[nl.par_dim(128), 16]`` SBUF tile in parallel at a time, where the 16 elements along free dimension must be contiguous. Having a multiple of 128 and a multiple of 16 in the output SBUF partition and inner-most free dimension sizes is a pre-requisite to achieve best DMA throughput efficiency possible with DMA transpose. However, it is not a functionality requirement - DMA transpose can flexible tile sizes for DMA transpose at the cost of DMA performance. 

HBM2SBUF DMA transpose is commonly seen in ML workloads where the data layout in HBM differs from the format needed by the initial compute engine that processes the data. For example, in the LLM decode phase, the K cache typically has an HBM layout of ``[seqlen, d_head]``, where ``seqlen`` and ``d_head`` are the sequence length and head dimensions respectively. However, when K is consumed by TensorE in the Q@K operator in self-attention, ``d_head`` is the contraction dimension of the matrix multiplication. Therefore, the most-minor d_head dimension in HBM should become the partition dimension to satisfy TensorE layout requirements (see :ref:`Tiling <nki-tile-layout>`: Contraction dimension must map to partition dimension). Mapping most minor HBM tensor dimension to SBUF partition dimension is exactly an HBM2SBUF DMA transpose operation on Trainium2. 

In NKI, programmers can invoke an HBM2SBUF DMA transpose using the ``nisa.dma_transpose`` API.

.. code-block:: python

   import nki
   import nki.language as nl
   import nki.isa as nisa

   # hbm_src: [512, 128] in shared_hbm
   sbuf_dst = nl.ndarray((128, 512), dtype=hbm_src.dtype, buffer=nl.sbuf)
   nisa.dma_transpose(dst=sbuf_dst, src=hbm_src)

.. admonition:: Performance Consideration

   DMA transpose on Trainium2 can achieve up to 90% DMA throughput utilization given hardware-friendly tensor access patterns, compared to up to 100% throughput utilization for a DMA copy.

SBUF2SBUF DMA transpose
"""""""""""""""""""""""

SBUF2SBUF DMA transpose works in a similar fashion as HBM2SBUF transpose, where the most minor dimension of the input SBUF tensor, i.e., inner-most free dimension, becomes the partition dimension of the output SBUF tensor. Therefore, SBUF2SBUF DMA transpose is a way to swap partition and free axis of an SBUF tensor, an alternative to TensorE transpose.

The same ``nisa.dma_transpose`` API can be used to perform an SBUF2SBUF DMA transpose:

.. code-block:: python

   import nki
   import nki.language as nl
   import nki.isa as nisa

   # sbuf_src: [128, 128] in sbuf
   sbuf_dst = nl.ndarray((128, 128), dtype=sbuf_src.dtype, buffer=nl.sbuf)
   nisa.dma_transpose(dst=sbuf_dst, src=sbuf_src)

Performance Consideration. SBUF2SBUF transpose can achieve up to 50% of DMA throughput on Trainium2. Compared to TensorE transpose that is more performant but requires ScalarE/VectorE to evict the transposed output from PSUM back to SBUF, DMA transpose can read from and write to SBUF directly. Therefore, DMA transpose is particularly useful in operators that are ScalarE/VectorE bound, such as self attention.

.. _dge_arch:

Descriptor Generation Engine (DGE)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The Descriptor Generation Engine (DGE) is a new hardware block in NeuronCore-v3 that accelerates DMA descriptor generation to perform either DMA copy or transpose on the DMA engines. Each NeuronCore-v3 comes with two instances of DGE, which can be commanded through either SyncE or ScalarE sequencer. The figure below shows the connectivity of the DGE instances.

.. _fig-dge:

.. image:: /nki/img/arch_images/nki-trn2-arch-10.png

Prior to Trainium2, DMA descriptor generation was handled in two ways. They were either generated statically on the host when loading a NEFF onto a Neuron Device (i.e., static DMA), or created dynamically through custom kernels on GpsimdE during NEFF execution (i.e., software DGE). The static approach stored all descriptors in HBM, consuming valuable memory space that could otherwise be used for model parameters or computation data. The software-based approach used a portion of SBUF for storing descriptors generated during execution and occupies GpsimdE that could otherwise perform useful computation.

In comparison, the new hardware-based DGE in Trainium2 generates descriptors on demand without requiring additional memory storage. It also frees up GpsimdE to perform useful computation. Therefore, it is recommended to leverage hardware-based DGE on Trainium2 whenever possible to initiate a DMA transfer.

NKI programmers can invoke hardware-based DGE on NeuronCore-v3 using ``nisa.dma_copy`` and ``nisa.dma_transpose`` APIs, by setting ``dge_mode=nisa.dge_mode.hw_dge``. The compute engine to initiate a DGE command (Sync Engine or ScalarE) is currently determined by NKI compiler (subject to changes).

.. note::
   NeuronCore-v3 hardware DGE currently does not support indirect DMA operations (gather/scatter). Refer to nisa API documentation for detailed implementation guidelines.

.. admonition:: Performance Consideration

   When triggered from ScalarE, execution of the DGE-based DMA instruction could be hidden behind earlier compute instructions (such as ``nisa.activate()``) in program order, since DGE and the compute pipeline of ScalarE are independent hardware resources. Each DGE-based DMA instruction takes about 600 ns to execute on NeuronCore-v3.


================================================
FILE: nki/guides/architecture/trainium3_arch.rst
================================================
.. meta::
   :description: Trainium3 Architecture Guide for NKI
   :keywords: AWS Neuron, Trainium3, NeuronCore-v4, NKI, architecture
   :date-modified: 03/09/2026

.. _trainium3_arch:

Trainium3 Architecture Guide for NKI
====================================

In this guide, we will dive into the hardware architecture of fourth-generation NeuronDevices: Trainium3. This guide will highlight major architectural updates compared to the previous generation (Trainium2). Therefore, we assume readers are familiar with :doc:`Trainium/Inferentia2 Architecture Guide </nki/guides/architecture/trainium_inferentia2_arch>` and :doc:`Trainium2 Architecture Guide for NKI </nki/guides/architecture/trainium2_arch>` to understand the basics of NeuronDevice Architecture. 

The diagram below shows a block diagram of a Trainium3 device, which consists of:

* 8 NeuronCores (v4).
* 4 HBM stacks with a total device memory capacity of 144 GiB and bandwidth of 4.7 TB/s. 
* 128 DMA (Direct Memory Access) engines to move data within and across devices.
* 20 CC-Cores for collective communication.
* 4 NeuronLink-v4 for device-to-device collective communication.

.. _fig-arch-neuron-device-v4:

.. image:: /nki/img/arch_images/nki-trn3-arch-1.png

The rest of this guide discusses NeuronCore-v4's major architectural updates compared to NeuronCore-v3 that are relevant for NKI programmers. 

NeuronCore-v4 Compute Engine Updates
------------------------------------

The figure below is a simplified NeuronCore-v4 diagram of the compute engines and their connectivity to the two on-chip SRAMs, which are SBUF and PSUM. This is similar to previous versions of NeuronCore. 

.. _fig-neuroncore-v4-diagram:

.. image:: /nki/img/arch_images/nki-trn3-arch-2.png

The NeuronCore-v4 SBUF capacity is 32 MiB (up from 28 MiB in NeuronCore-v3), while the PSUM capacity remains the same at 2 MiB. The engine data-path widths and frequencies are updated to the following:

.. list-table:: Compute Engine Specifications
   :widths: 20 20 40 20
   :header-rows: 1

   * - Device Architecture
     - Compute Engine
     - Data-path Width (elements/cycle)
     - Frequency (GHz)
   * - Trainium3
     - Tensor
     - 8x128 (MXFP8 dense input) or 2x128 (non-MXFP8 dense input) or 5x128 (sparse input); 1x128 (output)
     - 2.4
   * - 
     - Vector
     - 512 BF16/FP16/FP8 input/output; 256 input/output for other data types
     - 1.2
   * - 
     - Scalar
     - 256 BF16/FP16/FP8 input/output; 128 input/output for other data types
     - 1.2
   * - 
     - GpSimd
     - 128 input/output for all data types
     - 1.2

Sync Engine has not changed since :doc:`previous Trainium architectures </nki/guides/architecture/trainium_inferentia2_arch>`. Next, we will go over major architectural updates to each compute engine. 

Tensor Engine
--------------

The Tensor Engine is optimized for tensor computations such as GEMM, CONV, and Transpose. A NeuronCore-v4 Tensor Engine delivers 315 MXFP8/MXFP4 TFLOPS, where MXFP8/MXFP4 are OCP (Open Compute Project) compliant data type formats. Besides quantized data types, a NeuronCore-v4 Tensor Engine also delivers 79 BF16/FP16/TF32 and 20 FP32 TFLOPS of tensor computations. The rest of this section describes new architectural features introduced in the NeuronCore-v4 Tensor Engine. 

.. _arch-trn3-quad-mxfp:

Quad-MXFP8/MXFP4 Matmul Performance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The NeuronCore-v4 Tensor Engine (TensorE) supports two new input data types: MXFP8 and MXFP4, where MX stands for "microscaling", as defined in the OCP standard. Microscaling is a subset of absmax (absolute maximum quantization), where quantization scale factors are calculated using absolute maxima of fine-granularity groups of values, as opposed to having tensor- or channel-wise scale factors. It can significantly improve the amount of information preserved in the quantized values. The supported scaling group size is 32: That means that 32 MXFP8/MXFP4 elements along the matrix multiplication (matmul) contraction dimension share the same 8-bit MX scale value. The Tensor Engine performs matrix multiplications of MXFP8 or MXFP4 input matrices [1]_ and dequantization with the MX scales in a single instruction, with the output either in FP32 or BF16. We will refer to MXFP8 and MXFP4 matmul as MX matmul in the rest of this guide. An MX matmul with either the MXFP8 or MXFP4 datatype runs at 4x the throughput compared to a BF16/FP16 matmul. 

.. [1] Multiplying an MXFP8 matrix with an MXFP4 matrix is also allowed. 

Logically, TensorE quadruples the MX matmul performance, as compared to BF16 performance, by quadrupling the maximum contraction dimension of the matmul instruction from 128 (for BF16/FP16) to 512, effectively presenting a 512x128 systolic array to the programmer. Under the hood, since the systolic array is still organized as a grid of 128x128 processing elements, each processing element performs four pairs of MX multiplications and also accumulation of the four multiplication results per cycle. This is similar to the Double-FP8 performance mode in the Trainium2 TensorE (discussed in :doc:`Trainium2 Architecture Guide </nki/guides/architecture/trainium2_arch>`), but the data layout requirements for MX matmul are distinct and discussed below. 

Mathematically, an MX matmul instruction can perform a multiplication of a 128x512 matrix and a 512x512 matrix (that is, MxKxN matmul, M=128, K=512, N=512). The figure below shows a visualization of the two input matrices (x and y) and the matmul output matrix (output). The figure also highlights four elements (red, blue, yellow and green) in the first row of the x matrix and in the first column of the y matrix. These four elements are 128 (K//4) elements apart within the row and column. Each pair of same-colored elements from x and y matrices will get multiplied, and the multiplication results are subsequently accumulated in the matmul operation, inside the TensorE. We will use these elements to illustrate the SBUF layout requirements for these matrices next. 

.. _fig-mx-matmul:

.. image:: /nki/img/arch_images/nki-trn3-arch-3.png

The figure below shows how the above matrices should be laid out in SBUF in preparation for MX matmul. For visualization purposes, the x matrix is rotated 90 degrees, such that the contraction K dimension is aligned with the SBUF partition dimension. In addition, we pack the four highlighted elements that used to be 128 elements apart back-to-back along the free dimension. As a result, the matmul contraction dimension K=512 is split into two dimensions: (1) the partition dimension of size 128 and (2) the most minor (fastest) free dimension of size 4. The y (moving) matrix follows a similar four-element packing pattern along the free dimension. The MX matmul instruction requires that data is packed in such quads of elements. In NKI, programmers can directly work with MX data using special quad (x4) packed data types: ``float8_e5m2_x4``, ``float8_e4m3fn_x4``, and ``float4_e2m1fn_x4``.

.. _fig-mx-sbuf-layout:

.. image:: /nki/img/arch_images/nki-trn3-arch-4.png

Next, we invoke the LoadStationary and MultiplyMoving instructions to perform the matrix multiplications using the above tensors in SBUF. This is illustrated in the figure below. The LoadStationary instruction loads the MX stationary tensor (K/4=128, M=128, 4) into TensorE, which stores four MX data elements into a single processing element as shown in ❶. Next, the MultiplyMoving instruction streams the moving tensor horizontally across the loaded stationary tensor. Similar to LoadStationary, four elements of moving tensor are sent to the same processing element simultaneously as shown in ❷, such that they can get multiplied with the corresponding loaded stationary elements.

.. _fig-mx-instruction:

.. image:: /nki/img/arch_images/nki-trn3-arch-5.png

Since MX matmul in TensorE performs dequantization in addition to the multiplication of the input matrices, we discuss how the scale tensor is laid out in SBUF for TensorE consumption. Recall that the supported MX group size on NeuronCore-v4 TensorE is 32 elements along the contraction dimension. Each input MX matrix to the matmul operation therefore has its own scale tensor. In fact, the highlighted x4 elements within each matrix in the above images are within the same scaling group. The diagram below shows the full 32-element scaling group that includes these highlighted x4 elements within matrix x and y. 

.. image:: /nki/img/arch_images/nki-trn3-arch-6.png

Let's focus on the stationary data and scale tensor layout below. On the left, the purple rectangle represents the 32-element scaling group that includes the four highlighted elements, which spans 8 SBUF partitions (8P) and 4 elements per partition. 

A single scaling group corresponds to one 8-bit integer scale. Therefore, for every 32 partitions of the data tensor, we get 32/8=4 partitions worth of scale factors. As shown in the scale tensor below, the full scale tensor is split across four SBUF quadrants, where each quadrant holds 4 partitions worth of scales. Note the free dimension of the scale tensor is M=128, which is 4x smaller than the data tensor. This is because the four packed colored elements in the data tensor belong to the same scaling group and hence share a single scale. Within each SBUF quadrant, 32-4=28 partitions are unused in the scale tensor below. Multiple scale tensors for different MX matmul instructions can be packed together to fill up the unused partitions. See the :ref:`MXFP NKI tutorial <nki-mxfp-scale-packing>` for more discussion on packed scales.

.. _fig-mx-scale-layout:

.. image:: /nki/img/arch_images/nki-trn3-arch-7.png

The moving data and scale tensor layout follows the same rules. Therefore, an MX matmul on TensorE requires four input tensors:

1. stationary data
2. stationary scale
3. moving data
4. moving scale

In NKI, programmers can define MX data tensors using the special x4 data types. The maximum tile size for stationary MX data tensor is [128, 128] in x4 data types ([128, 512] of actual values), while the maximum tile size for moving MX data tensor is [128, 512] in x4 data types ([128, 2048] of actual values). One convenience of the x4 datatypes is that the output matrix dimensions map directly to the sizes of the free dimensions of the input matrices. Similarly, the maximum tile size for stationary and moving MX scale tensors are [128, 128] and [128, 512] in ``nl.uint8``, respectively. The API to invoke an MX matmul is :doc:`nisa.nc_matmul_mx </nki/api/generated/nki.isa.nc_matmul_mx>`:

.. code-block:: python

   nisa.nc_matmul_mx(dst, stationary, moving, stationary_scale, moving_scale)

.. _arch-trn3-bf16-psum:

BF16 Matmul Results in PSUM
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Prior to the NeuronCore-v4, the Tensor Engine always passes FP32 matrix multiplication results to PSUM unless transpose mode is turned on. Similarly, the PSUM buffer was restricted to FP32 near-memory accumulation (fp32_psum_tensor += fp32_matmul_output). Starting with the NeuronCore-v4, the Tensor Engine allows the matrix multiplication instruction (:doc:`nisa.nc_matmul </nki/api/generated/nki.isa.nc_matmul>`) to store BF16 data into the PSUM buffer directly and also to perform addition to a BF16 tensor stored in PSUM. 

Note that the accumulation performed during a matmul operation within the systolic array is still performed using FP32 data. When writing the matmul results into a BF16 PSUM tensor location, the downcast from FP32 to BF16 is performed immediately before the write. The downcast can use the RNE (round nearest even) or SR (stochastic rounding) mode. The figure below illustrates this data flow. 

.. _fig-bf16-psum:

.. image:: /nki/img/arch_images/nki-trn3-arch-8.png

When adding the matmul results to an existing BF16 tensor stored in PSUM, the following operations are performed:

* The existing PSUM tensor (red) is upcast to FP32.
* The PSUM tensor (now in FP32) and the TensorE output (yellow) are added together at FP32 precision.
* The result of the addition (green) is converted to BF16 using the given rounding mode, and written back to PSUM.

.. image:: /nki/img/arch_images/nki-trn3-arch-9.png

Background Transpose
^^^^^^^^^^^^^^^^^^^^^

The NeuronCore-v4 TensorEngine supports a new background transpose functionality, which allows it to run a transpose operation in parallel to another matrix multiplication (or another transpose). It allows us to achieve close to double performance on long chains of transposes, or to overlap a larger matrix multiplication with transpose operations in the background.

NKI programmers are not required to enable background transpose explicitly to leverage the performance improvements from this feature. The decision to trigger background transpose is made automatically by the hardware. 

Vector Engine
--------------

The Vector Engine is optimized for vector computations, in which every element of the output is dependent on multiple input elements. Examples include the axpi operations (Z=aX+Y), Layer Normalization, and Pooling operations. The NeuronCore-v4 Vector Engine delivers a total of 1.2 TFLOPS of FP32 computations and can handle various input/output data-types, including FP8, FP16, BF16, TF32, FP32, INT8, INT16, and INT32. The rest of this section describes new architectural features introduced in the NeuronCore-v4 Vector Engine. 

MX data-type Quantization
^^^^^^^^^^^^^^^^^^^^^^^^^

The NeuronCore-v4 VectorE supports quantizing FP16/BF16 data to MXFP8 tensors (both data and scales) in a layout that TensorE can directly consume for MX matmul, as described in the Quad-MXFP8/MXFP4 Matmul Performance section above. As a reminder, an MxK MXFP8 matrix, where K is the contraction dimension, requires the following data and scale layout in SBUF:

.. _fig-mx-quantization:

.. image:: /nki/img/arch_images/nki-trn3-arch-10.png

The VectorE can natively quantize BF16/FP16 data to produce this layout using the QuantizeMX instruction. QuantizeMX calculates the required scales for each group of 32 values, divides them by the calculated scale, and casts to the target MXFP8 datatype (as per the OCP specification):

.. _fig-mx-quantization-flow:

.. image:: /nki/img/arch_images/nki-trn3-arch-11.png

The source FP16/BF16 data must be in SBUF, and has to be in a layout that exactly matches the target MXFP8 data layout (QuantizeMX preserves the data layout). The target MXFP8 data and scales also have to be in SBUF. The quantization instruction can quantize four input elements per partition, per cycle (i.e., 4x Vector performance mode).

In NKI, programmers can perform such an MX data type quantization using the :doc:`nisa.quantize_mx </nki/api/generated/nki.isa.quantize_mx>` API.

Fast Exponential Evaluation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The NeuronCore-v4 Vector Engine introduces a new instruction to perform fast exponential evaluation (:doc:`nisa.exponential </nki/api/generated/nki.isa.exponential>`), at 4x the throughput compared to the :doc:`nisa.activation </nki/api/generated/nki.isa.activation>` (``op=nl.exp``) instruction on the Scalar Engine. In addition to the exponential function, the instruction on Vector Engine can also apply a subtraction before the exponential function and an accumulation after:

.. code-block:: python

   # Inputs:
   # src tile [M, N]
   # max_value tile [M, 1]
   # Outputs:
   # dst tile of the same shape [M, N]
   # reduce_res tile [M, 1]
   for i in range(M): # parallel (partition) dimension
       reduce_res[i, 0] = 0
       for j in range(N): # sequential (free) dimension
           dst[i, j] = exp(src[i, j] - max_value[i, 0])
           reduce_res[i, 0] += dst[i, j]

This particular pattern is useful to speed up the Softmax operator, which is commonly on the critical path of long context length self-attention in large language models (LLMs):

.. math::

   Softmax(X)=\frac{e^{X_i-max(X)}}{\sum_i e^{X_i-max(X)}}

X is a vector of attention scores in the context of self-attention, which corresponds to a row in the src tile in the above instruction pseudo code.

XORWOW-based PRNG
^^^^^^^^^^^^^^^^^

The NeuronCore-v4 VectorE provides hardware support to produce PRNG (pseudo-random) values using XORWOW as the underlying algorithm. Compared to the LFSR-based algorithm used in VectorE prior to NeuronCore-v4, XORWOW produces higher quality random values. The NeuronCore-v4 VectorE can produce 4x 32-bit PRNG values per compute lane per engine cycle.

In addition, the NeuronCore-v4 VectorE introduces support for loading and storing XORWOW random states from and to SBUF (or PSUM), across all 128 compute lanes. Within each compute lane, four XORWOW random states are tracked to maintain the :doc:`nisa.rand2 </nki/api/generated/nki.isa.rand2>` instruction throughput, with each state comprising 6 ``uint32`` values. For more details, refer to the :doc:`nisa.rand_set_state </nki/api/generated/nki.isa.rand_set_state>` and :doc:`nisa.rand_get_state </nki/api/generated/nki.isa.rand_get_state>` API documentation. This new state load/store capability in NeuronCore-v4 VectorE, which was not available in previous NeuronCore versions, allows users to save and restore random states for reproducible training runs more easily. 

Scalar Engine
--------------

The Scalar Engine is optimized for scalar computations in which every element of the output is dependent on one element of the input. The NeuronCore-v4 Scalar Engine delivers a total of 1.2 TFLOPS of FP32 computations and can support various input/output data types, including FP8, FP16, BF16, TF32, FP32, INT8, INT16, and INT32. The rest of this section describes new architectural features introduced in the NeuronCore-v4 Scalar Engine. 

Performance mode
^^^^^^^^^^^^^^^^^

The Trainium3 ScalarE now natively supports the :doc:`tensor_scalar </nki/api/generated/nki.isa.tensor_scalar>` and :doc:`tensor_copy </nki/api/generated/nki.isa.tensor_copy>` instructions (same as VectorE), and offers up to 2x performance uplift for BF16/FP16 datatypes, which is the same as the 2x performance mode on VectorE introduced with Trainium2. For those instructions, NKI users are able to select the execution engine, which can help offload either one of the engines, or load balance between them, depending on workload characteristics.

.. _arch-trn3-activation2:

More flexible ``nisa.activation``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. note::
   The Activation2 instruction does not have ``nki.isa`` API support at this time.

Trainium3 introduces the Activation2 instruction, which provides more flexibility to users compared to the existing Activation instruction. Unlike Activation, which only supports the combination of scale multiplication and bias addition, Activation2 supports bias subtraction and allows users to disable scale multiplication and bias addition entirely. Further, while Activation only supported add as a reduce command, Activation2 supports add, max, min, absmax, and absmin reductions.

Data Movement and DMA updates
------------------------------

.. _arch-trn3-indirect-access:

SBUF/PSUM indirect access
^^^^^^^^^^^^^^^^^^^^^^^^^

.. note::
   SBUF/PSUM indirect access for compute engines does not have ``nki.isa`` API support at this time.

The NeuronCore-v4 SBUF/PSUM introduce a new indirect addressing mode for all compute engines (TensorE/VectorE/ScalarE/GpsimdE), which allows gathering or scattering SBUF and PSUM tensors along the free (F) dimension. Consider a tensor of shape [128, 512] located in SBUF, which occupies 128 partitions with 512 elements per partition. Suppose a user is interested in only accessing the elements 0, 128 and 384 along the free dimension across all 128 partitions for a single computation operation, such as ``nisa.nc_matmul``:

.. _fig-indirect-access-1:

.. image:: /nki/img/arch_images/nki-trn3-arch-12.png

Since these three vectors do not have a uniform stride along the free dimension, the access pattern is not a tensorized pattern (i.e. a regular N-dimensional access pattern). Prior to NeuronCore-v4, such an access pattern would require three separate instructions (such as ``nisa.nc_matmul``) to perform the computation on all three vectors. 

In NeuronCore-v4, all compute engines can perform a gather access pattern to directly access those three vectors in a single instruction:

.. _fig-indirect-gather:

.. image:: /nki/img/arch_images/nki-trn3-arch-13.png


Similarly, an indirect scatter operation allows any engine to scatter a set of vectors into a target tensor:

.. _fig-indirect-scatter:

.. image:: /nki/img/arch_images/nki-trn3-arch-14.png

Both styles of indirection use a separate offset tensor to encode which vectors to access. 

SBUF Read-Add-Write
^^^^^^^^^^^^^^^^^^^^

.. note::
   SBUF Read-Add-Write does not have ``nki.isa`` API support at this time.

NeuronCore-v4 introduces an enhanced SBUF capability that enables on-the-fly tensor accumulation near memory. This feature allows DMA engines to perform B+=A operations, where tensor B resides in SBUF and tensor A can be sourced from any accessible memory location (such as HBM or SBUF). Tensors A and B can be either BF16 or FP32 data types, but they must have a matching data type within a single DMA transfer performing the Read-Add-Write operation. This near-memory accumulation maintains the same throughput as standard DMA copy operations to SBUF (compared to 50% DMA throughput via DMA collective compute engines prior to NeuronCore-v4), enabling efficient in-place tensor updates without additional memory overhead.

The figure below illustrates the data flow that is used to enable this SBUF accumulation feature. First, a DMA unit transfers tensor A to the ReadAddWrite unit adjacent to the SBUF. The ReadAddWrite unit then retrieves tensor B from SBUF, performs the addition of A and B, and writes the result back to tensor B's original location in SBUF. 

.. _fig-read-add-write:

.. image:: /nki/img/arch_images/nki-trn3-arch-15.png

.. _arch-trn3-traffic-shaping:

DMA Traffic Shaping
^^^^^^^^^^^^^^^^^^^^

.. note::
   DMA Traffic Shaping does not have ``nki.isa`` API support at this time.

Trainium3 DMA engines support Traffic Shaping, which enables configurable bandwidth allocation across different DMA operations. The DMA Traffic Shaping feature supports 4 distinct classes of service, enabling fine-grained control over the priorities of data movement. This capability is particularly beneficial when optimizing parallel computation and communication (collective operations) across multiple NeuronCores.


================================================
FILE: nki/guides/architecture/trainium_inferentia2_arch.rst
================================================
.. meta::
    :description: Comprehensive guide to Trainium/Inferentia2 hardware architecture for NKI, covering compute engines, memory hierarchy, and optimization techniques for AWS Neuron SDK.
    :keywords: Trainium, Inferentia2, NKI, AWS Neuron, Hardware Architecture
    :date-modified: 12/01/2025

.. _trainium_inferentia2_arch:

Trainium/Inferentia2 Architecture Guide for NKI
===============================================

In this guide, we will dive into hardware architecture of second-generation NeuronDevices: Trainium/Inferentia2.
Our goal is to equip advanced Neuron users with sufficient architectural knowledge to write performant NKI kernels and
troubleshoot performance issues on NeuronDevices using :doc:`Neuron Explorer </nki/guides/use-neuron-profile>`,
a profiler tool designed specifically for NeuronDevices. This guide is also written assuming readers have read
through :doc:`NKI Language Guide </nki/get-started/nki-language-guide>` and familiarized themselves with key NKI concepts.

:numref:`Fig. %s <fig-arch-neuron-device-v2>` shows a block diagram of a Trainium and Inferentia2 device.
At a high level, both Trainium and Inferentia2 devices consist of:

* 2 NeuronCores (v2).
* 2 HBM stacks with a total device memory capacity of 32GiB and bandwidth of 820 GB/s.
* 32 DMA (Direct Memory Access) engines to move data within and across devices.
* 6 CC-Cores for collective communication.
* 2 (Inferentia2) or 4 (Trainium) NeuronLink-v2 for device-to-device collective communication.


.. _fig-arch-neuron-device-v2:

.. figure:: /nki/img/arch_images/neuron_device2.png
   :align: center
   :width: 100%

   Trainium/Inferentia2 Device Diagrams.

The rest of this guide will go into details of each compute engine in NeuronCore-v2 and supported data movement
patterns across the memory hierarchy.

.. _arch_sec_neuron_core_engines:

NeuronCore-v2 Compute Engines
-----------------------------

In this section, we will describe the architectural details within a NeuronCore-v2. The figure below is a simplified diagram
of the compute engines and their connectivity to the two on-chip SRAMs: state buffer (SBUF) and partial sum buffer (PSUM).

.. _fig-arch-neuron-core-v2:

.. figure:: /nki/img/neuroncore_fig24.png
   :align: center
   :width: 60%

   NeuronCore-v2 and its device memory (HBM).

A NeuronCore-v2 consists of four heterogeneous compute engines (Tensor, Vector, Scalar, and GpSimd), each designed to accelerate different types of operators in modern machine learning models. Each compute engine has its own sequencer, which is responsible for instruction fetch, decode, and issue. The four compute engines execute four independent instruction streams asynchronously in parallel. Explicit synchronization to satisfy data dependencies between engines is handled through atomic semaphores in hardware. In NKI, programmers do not need to program engine synchronization manually. The Neuron Compiler can automatically insert the required synchronizations during compilation, based on data dependencies identified in the NKI kernel. 

The instruction stream within each compute engine consists of both control and data-path instructions. Control instructions are executed directly by the engine sequencer and can perform scalar operations using a set of 32-bit scalar registers private to each sequencer. Examples of control instructions include register ALU operations for dynamic condition and address calculations, branching for control flow execution, and triggering DMA transfers. Data path instructions are executed by the specialized engine data path, which interacts with tensors in SBUF/PSUM. Data path instructions can handle flexible addressing and shapes by referencing values stored in scalar registers.

Within each NeuronCore, there is also a Sync Engine, which functions as an engine sequencer that can perform the same types of control instructions. The Sync Engine is most commonly used to trigger DMA transfers without interfering with compute engine instruction scheduling and ordering.


In addition, it is often useful to take engine data-path width and frequency into account when optimizing performance for
a multi-engine operator:

  +------------------------+----------------+------------------------------------+-----------------------+
  | Device Architecture    | Compute Engine | Data-path Width (elements/cycle)   | Frequency (GHz)       |
  +========================+================+====================================+=======================+
  |                        | Tensor         | 2x128 (input); 1x128 (output)      | 2.8                   |
  |                        +----------------+------------------------------------+-----------------------+
  |                        | Vector         |                                    | 1.12                  |
  |                        +----------------+                                    +-----------------------+
  | Trainium/Inferentia2   | Scalar         |   128 input/output                 | 1.4                   |
  |                        +----------------+                                    +-----------------------+
  |                        | GpSimd         |                                    | 1.4                   |
  +------------------------+----------------+------------------------------------+-----------------------+

Memory-wise, a NeuronCore-v2 consists of two software-managed on-chip SRAMs, a 24MiB SBUF as the main data storage and a
2MiB PSUM as a dedicated accumulation buffer for Tensor Engine. Both SBUF and PSUM are considered two-dimensional memories
with 128 partitions each, i.e., one SBUF partitions has 192KiB of memory while one PSUM partition has 16KiB. We will cover
more details on data movements with SBUF/PSUM later :ref:`here <arch_sec_data_movement>`.


The rest of this section will cover the following topics for each compute engine:


* Key functionalities.
* Layout and tile size requirement for input and output tensors.
* Best practices to achieve good performance on the engine.

.. _arch_guide_tensor_engine:

Tensor Engine
^^^^^^^^^^^^^

Tensor Engine (TensorE from now on) is specially designed to accelerate matrix-multiplications (matmuls), as well as other
operators that can be executed using matrix multiplications such as 2D convolutions. We also note that TensorE can be used
for advanced data movement from SBUF to PSUM, including transposition and broadcast
(more discussion below :ref:`here <arch_sec_tensor_engine_alternative_use>`).
Architecturally, the engine is built around a `systolic array <https://en.wikipedia.org/wiki/Systolic_array>`_ with
128 rows and 128 columns of processing elements, which streams input data from SBUF and writes output to PSUM.

**Data Types.** TensorE supports `BF16 <https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>`_\ ,
FP16, `TF32 <https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/>`_\
, and cFP8 input matrix data types at a maximum throughput of 92 TFLOPS, as well as 23 TFLOPS for FP32 inputs. TensorE performs
mixed-precision calculations, with accumulations at FP32 precision. Therefore, the output data of a TensorE calculation
is always in FP32.

**Layout.** To understand the layout and tiling constraints of TensorE, let's visualize its connection to SBUF
and PSUM as below. Note, PSUM partition dimension is purposely rotated 90 degrees compared to SBUF partition dimension
due to systolic array data flow.


.. _fig-arch-tensor-engine:

.. figure:: /nki/img/arch_images/tensor_engine.png
   :align: center
   :width: 80%

   Tensor Engine and SRAM Connectivity.

As shown in the diagram above, TensorE must **read** input matrices from **SBUF** and **write** output matrices to **PSUM**.
PSUM also allows near-memory accumulation of multiple matrix multiplication output tiles (detailed usage discussed
:ref:`here <arch_sec_accumulation_psum>`).

In NKI, to perform a multiplication of two matrices, ``x[M, K]`` and ``y[K, N]``, you may invoke the NKI ISA API
``nki.isa.nc_matmul(x, y)`` directly. The returned tile has a shape of ``[M, N]`` as expected. At the hardware level,
TensorE requires both input tiles to have the **contraction dimension** ``K`` in the SBUF partition
dimension, that is, the first dimension of input shapes (:ref:`Tiling Layout <nki-tile-layout>`).
This ISA requirement is reflected in the low-level API :doc:`nki.isa.nc_matmul </nki/api/generated/nki.isa.nc_matmul>`,
which takes ``stationary`` and ``moving`` matrices as input parameters. Therefore, ``nki.isa.nc_matmul(x, y)`` is a two-step computation:
invoking ``nki.isa.nc_transpose(x)`` to get ``stationary`` and then ``nki.isa.nc_matmul(stationary, moving)`` to get the final result.
In other words, ``nki.isa.nc_matmul(stationary[K,M], moving[K,N])`` performs a ``stationary.T @ moving`` calculation, which will result
in an output with dimensions ``[M,N]``.

For every ``nki.isa.nc_matmul(stationary, moving)`` call, TensorE executes two distinct Neuron ISA instructions:

* LoadStationary (short for LS): This instruction loads the ``stationary`` from SBUF and caches it in internal storage of TensorE
* MultiplyMoving (short for MM): This instruction loads the ``moving`` from SBUF and multiplies ``moving`` across the pre-loaded
  ``stationary`` matrix from the previous LoadStationary instruction. The output of this instruction is the
  output of the ``nki.isa.nc_matmul`` call written to PSUM.

With the above instruction sequence, we as NKI programmers effectively map input tile ``stationary`` as the stationary tensor
and input tile ``moving`` as the moving tensor for TensorE. As a rule-of-thumb for layout analysis, the **free** axis of the
**stationary** tensor always becomes the partition (first) axis of the output tile, while the **free** axis of the
**moving** tensor becomes the free axis of the output. :numref:`Fig %s <fig-arch-matmul>` below visualizes this concept
by showing a matrix multiplication in both mathematical and TensorE views.

.. _fig-arch-matmul:

.. figure:: /nki/img/arch_images/matmul.png
   :align: center
   :width: 100%

   MxKxN Matrix Multiplication Visualization.

However, programmers are also free to map ``stationary`` tile to the moving tensor instead, which would lead to the same output tile
but transposed: ``nki.isa.nc_matmul(moving[K,N], stationary[K,M]) = moving.T @ stationary = outputT[N, M]``. In fact, mapping high-level input tiles
to the low-level stationary/moving tensors in TensorE is an important layout decision that NKI programmers should consider
to minimize data transposes. Programmers should make this decision based on layout requirements imposed
by the compute engine that is going to consume the matrix multiplication output. See NKI Performance Guide
for more discussion.

.. _arch_matmul_tile_size:

**Tile Size.** The ``nki.isa.nc_matmul`` API enforces the following constraints on the input/output tile sizes:

#. ``stationary`` tensor free axis size (\ ``stationary_fsize``\ ) must never exceed 128, due to the number of PE columns in TensorE.
#. ``stationary/moving`` tensor partition axis size (\ ``stationary_psize/moving_psize``\ ) must never exceed 128, due to the number of PE rows and
   also the number of SBUF partitions.
#. ``moving`` tensor free axis size (``moving_fsize``) must never exceed 512, due to the fact that each ``nc_matmul`` can only write
   to a single PSUM bank, which can only hold 512 FP32 elements per PSUM partition.

When the shapes of the input matrices defined in the user-level operator exceed any of the above tile size limitation, we
must tile the input matrices and invoke multiple ``nki.isa.nc_matmul`` calls to perform the matrix multiplication. Exceeding
the ``stationary_fsize`` (#1) or ``moving_fsize`` (#3) tile limitations for M or N should lead to fully independent ``nki.isa.nc_matmul``
with disjoint output tiles. However, when ``K`` exceeds the ``stationary_psize/moving_psize`` limit, we need to tile the input matrices
in the contraction dimension and invoke multiple ``nki.isa.nc_matmul`` to accumulate into the *same* output buffer in PSUM.
Refer to the :ref:`Tiling Matrix Multiplications <tutorial_matmul_tiling>`
tutorial for a NKI code example.

.. _arch_sec_tensor_engine_alternative_use:

**Alternative Use Case**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One interesting use case of TensorE is low-latency data reshape within NeuronCore, which typically involves multiplying
a matrix to be reshaped with a compile-time constant matrix filled with zeros and ones.

As an example, we can perform a 128x128 matrix transposition (i.e., swap the free and partition axis of the matrix) using
``nki.isa.nc_matmul(transpose_input, identity)``\ , where ``transpose_input`` is the matrix to be transposed and
``identity`` is a 128x128 identity matrix. In fact, this is exactly what nki.isa.nc_transpose() does, when TensorE is chosen
as the compute engine.

.. _fig-arch-mm-transpose:

.. figure:: /nki/img/arch_images/mm_transpose.png
   :align: center
   :width: 80%

   Transposition.

Similarly, we can broadcast a vector occupying a single partition to M (M <= 128) partitions using ``nki.isa.nc_matmul(ones,
broadcast_input, is_stationary_onezero=True)``\ , where ``ones`` is a 1xM vector filled with ones and ``broadcast_input`` is
the vector to be broadcast. In fact, NKI invokes such matmul under the hood when ``broadcast_input.broadcast_to((M, broadcast_input.shape[1]))``
is called.

.. _fig-arch-mm-broadcast:

.. figure:: /nki/img/arch_images/mm_broadcast.png
   :align: center
   :width: 80%

   Partition Broadcast.

In general, we can achieve many more complex data reshapes in TensorE, such as shuffling partitions of a SBUF tensor, by
constructing appropriate zero/one patterns as one of the matmul inputs.

Finally, we can also leverage TensorE for data summation across SBUF partitions (P-dim summation). For example, a vector
laid out across SBUF partitions can be reduced into a single sum using TensorE as shown in the diagram below. Note, this
utilizes only a single PE column of the TensorE; therefore, depending on the surrounding operators, this may not be the
best use of TensorE. If you can do summation within each partition (F-dim summation), see
:doc:`nki.isa.tensor_reduce </nki/api/generated/nki.isa.tensor_reduce>`
for an alternative reduction implementation on Vector Engine. It is recommended to choose the engine based on the natural
layout of your input data to avoid any transpositions.

.. _fig-arch-mm-cross-partition:

.. figure:: /nki/img/arch_images/mm_cross_partition.png
   :align: center
   :width: 60%

   Cross-Partition Accumulation

As TensorE is the most performant compute engine of the NeuronCore in terms of FLOPS, the goal is to have it execute meaningful
computation at high utilization as much as possible. The above “alternative use cases” stop TensorE from performing *useful*
computations at *high* throughput and therefore, should generally be avoided. However, there are situations where it is
advisable to use them:


* Operators that do not require heavy matmuls anyhow, e.g. normalization, softmax.
* Layout conflicts between producer and consumer engines where broadcast/transpose are absolutely unavoidable (see example
  in fused attention tutorial).

.. _arch_guide_tensor_engine_perf:

**Performance Consideration**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As a rule of thumb, TensorE can achieve the best throughput when it runs many back-to-back ``nki.isa.nc_matmul`` with both
input matrices at the largest possible tiles sizes (``stationary`` is 128x128 and ``moving`` is 128x512). In this ideal
scenario, TensorE sees the below instruction sequence:


* ``LoadStationary (LS[0])`` (128x128)
* ``MultiplyMoving (MM[0])`` (128x512)
* ``LoadStationary (LS[1])`` (128x128)
* ``MultiplyMoving (MM[1])`` (128x512)
* ...

**Cost Model:** TensorE is a deeply pipelined engine; therefore, the engine can have several ``LS&MM`` instruction pairs
in-flight at a given time. Due to this pipelining nature, it is often *not* useful to use end-to-end execution *latency*
of a single instruction when estimating the instruction cost. Instead, we can focus on the **initiation interval** of
such instructions, that is, the number of cycles between successive instruction launches. Therefore, we can estimate the
cost of an instruction ``I`` by how soon TensorE can issue the next instruction after ``I``.

For the sake of discussion, let's assume we have many back-to-back ``MM`` instructions with BF16/FP16/TF32/cFP8 input data
type that reuse a single pre-loaded ``stationary`` inside TensorE. The initiation interval between subsequent MM instructions in
this case is roughly ``max(N, MM_INIT_LATENCY)``\ , where ``MM_INIT_LATENCY`` is 64 TensorE cycles on NeuronCore-v2, and  ``N`` is the
free axis size of ``moving`` of current ``MM`` (typically set to 512). For FP32 input data type,
the instruction cost is roughly 4x higher than BF16/FP16/TF32/cFP8. Therefore, whenever possible, we recommend down-casting
FP32 input matrix data type to one of BF16/FP16/TF32/cFP8 before performing matrix multiplications.

Figure below visualizes two pipelined ``MM`` instructions:

.. _fig-arch-mm-pipeline:

.. figure:: /nki/img/arch_images/mm_pipeline.png
   :align: center
   :width: 90%

   Pipelined multiplyMoving instructions.

**Background LoadStationary:** In typical workloads, TensorE would be alternating between LS and MM instructions with different
input matrices. In order to optimize TensorE's utilization, we also enable a "background LoadStationary" capability, which
allows loading of the next stationary tensor in parallel to the computation on the current stationary tensor.

As a result, depending on the relative sizes of the ``stationary`` and ``moving`` matrices, the overall
TensorE performance can be bounded by either ``LS`` or ``MM`` instructions. Figure below visualizes these two cases. In
the ideal scenario where ``stationary`` and ``moving`` use the largest tile sizes, TensorE should operate in case (a).

.. _fig-arch-mm-bottlenecks:

.. figure:: /nki/img/arch_images/mm_bottleneck.png
   :align: center
   :width: 70%

Possible execution timeline execution with background LoadStationary

**Fast LoadStationary:** Since ``LoadStationary`` is a pure data movement with no computation, TensorE can perform ``LoadStationary``
**up to 4x** faster than a ``MultiplyMoving`` with the same free axis size. Fast ``LoadStationary`` has an important performance
implication on ``nki.isa.nc_matmul``\ : When one of the input matrices has a small free axis size and the other has a large
free axis size, we prefer to put the matrix with large free axis as the ``stationary`` matrix. For example, if we
try to do a vector-matrix multiplication, it is recommended to put the matrix as ``stationary`` matrix and vector as ``moving``
matrix to get the best performance out of TensorE.

.. _arch_guide_vector_engine:

Vector Engine
^^^^^^^^^^^^^

Vector Engine (VectorE) is specially designed to accelerate vector operations where every element in the output tensor typically
depends on multiple elements from input tensor(s), such as vector reduction and element-wise operators between two tensors.
VectorE consists of 128 parallel vector lanes, each of which can stream data from a SBUF/PSUM partition, perform mathematical
operations, and write data back to each SBUF/PSUM partition in a deeply pipelined fashion.

**Data Types.** VectorE supports all NKI data types (details see :ref:`supported data types in NKI <nki-dtype>`)
in both input and output tiles. :ref:`Arithmetic operations <nki-aluop>`
are performed in FP32, with automatic zero-overhead input and output casting to and from FP32. Refer to ``nki.isa`` API
reference manual for any instruction-specific data type requirements.

**Layout & Tile Size.** VectorE instructions expect the parallel axis of the input and output data to be mapped to the partition dimension. For
example, the figure below shows reduction add of a NxM matrix along the M dimension. Since each of N rows in the matrix
can be reduced in parallel, the N dimension of the matrix should be mapped to the SBUF partition dimension. Refer to the
:doc:`nki.isa API manual </nki/api/nki.isa>` for
instruction-specific layout constraint of different VectorE instructions.


.. _fig-arch-vector-engine-reduce:

.. figure:: /nki/img/arch_images/vector_engine_reduce.png
   :align: center
   :width: 60%

   Reduce add on Vector Engine.

In terms of tile size, the majority of VectorE instructions only have limitation on the input/output tile partition dimension
size which must not exceed 128, while the free dimension size can be up to 64K elements for SBUF or 4K elements for PSUM.
However, there are a few notable exceptions, such as :doc:`nki.isa.bn_stats </nki/api/generated/nki.isa.bn_stats>`
which further imposes free dimension size of input tile cannot exceed 512. Refer to the `nki.isa API manual <nki.language>`
for instruction-specific tile size constraints.

.. _arch_guide_cross_partition_data_movement:

Cross-partition Data Movement
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The VectorE also supports a limited set of cross-partition data movement within each group of 32 partitions. The figure
below shows connectivity between SBUF and VectorE banks. VectorE consists of four Reshape and Compute banks: each Reshape
Bank connects to 32 SBUF/PSUM partitions and outputs 32 parallel streams of data, while each Compute Bank can process 32
parallel data streams using 32 vector lanes. The Compute Bank can write back to 32 SBUF/PSUM partitions.


.. _fig-arch-vector_cross_partition:

.. figure:: /nki/img/arch_images/vector_engine_cross_partition.png
   :align: center
   :width: 90%

   Vector Engine reshape and compute banks.

The Reshape Bank supports the following data movement:


#. *32x32 transpose*\ : Each Reshape Bank can read in 32 elements per SBUF/PSUM partitions and transpose the partition and
   free dimension of the incoming 32x32 matrix. This can be invoked by :doc:`nki.isa.nc_transpose </nki/api/generated/nki.isa.nc_transpose>`
   API by selecting VectorE as the execution engine.
#. *32 partition shuffle*\ : Each Reshape Bank can take an arbitrary *shuffle mask*
   ``SM``\ * of length 32. The integer value of ``SM[i]`` indicates the source partition ID (modulo 32) that the Reshape Bank
   output stream ``i`` will get. For example, we can broadcast partition[0] to partition[0-31] using a SM of 32 zeros.
   This can be invoked by :doc:`nki.isa.nc_stream_shuffle </nki/api/generated/nki.isa.nc_stream_shuffle>` API.

Refer :ref:`here <arch_sec_cross_partition_connect>`
later in this doc for cross-bank data movement.

.. _arch_sec_vector_engine_perf:

**Performance Consideration**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**128 Parallel Compute Lanes:** VectorE can perform computation with all 128 vector lanes in parallel, with each lane streaming
data from/to one SBUF/PSUM partition. Therefore, the performance cost of a VectorE instruction using all 128 lanes is the
same as an instruction that uses fewer than 128 lanes.

As a result, we recommend NKI developers to maximize the compute lanes used per VectorE instruction, that is, the partition
axis size of input/output tiles of a single ``nki.isa`` or ``nki.language`` compute API call. When the partition axis size
of input tiles is inevitably fewer than 128 partitions due to high-level operator definition, we could adopt an optimization
called “partition vectorization” by packing multiple “small” VectorE instructions of the same operation into a single “large”
Vector instruction. Refer to NKI Performance Guide for more detailed discussion of this optimization.

**Cost Model:** In the most common cases where the free axis size (\ ``N``\ ) of the input tile(s) is sufficiently large
(\ ``N > 128``\ ), the execution cost of an instruction on VectorE is correlated to ``N``\ :


* If there is only one input tile, most VectorE instructions can execute in roughly ``N`` cycles (example:
  :doc:`nki.isa.tensor_scalar </nki/api/generated/nki.isa.tensor_scalar>`)
* If there are two input tiles, the instruction can execute in roughly ``2N`` cycles (example: nki.isa.tensor_tensor)


There are a few exceptions to the above rule, depending on the data types and instruction type. See
:doc:`NKI ISA API doc </nki/api/nki.isa>`
for instruction-specific instruction cost details.

In the rare cases where VectorE is running many back-to-back instructions either with ``N << 128`` or with every instruction
depending on the output tile of the previous instruction, we need to add a static instruction overhead of 100 engine cycles
to the above execution cost estimate.

The above rules are for general guidance only. To find out the exact instruction costs for your NKI kernel, you may capture
a detailed instruction execution trace on device using :doc:`neuron-profiler </nki/guides/use-neuron-profile>`.


Scalar Engine
^^^^^^^^^^^^^

Scalar Engine (ScalarE) is specially designed to accelerate scalar operations where every element in the output tensor only
depends on one element of the input tensor. In addition, ScalarE provides hardware acceleration to evaluate non-linear functions
such as Gelu and Sqrt. The currently supported set of non-linear functions is listed in :ref:`here <nki-act-func>`.
It it worth noting that we can support any new non-linear functions on ScalarE as they come up in new ML model architectures
through Neuron SDK software updates. Similar to VectorE, ScalarE consists of 128 parallel lanes, each of which can stream
data from a SBUF/PSUM partition, perform mathematical operations, and write data back to each SBUF/PSUM partition in a deeply
pipelined fashion.

**Data Types.** ScalarE supports all NKI data types (details see :ref:`supported data types in NKI <nki-dtype>`)
in both input and output tiles. All internal computation is performed in FP32,
with automatic zero-overhead input and output casting to and from FP32.

**Layout & Tile Size.** ScalarE typically evaluates scalar operations (such as, nki.language.gelu), which does not impose
any input/output tile layout constraints. However, there are additional hardware features in ScalarE that will have layout
constraints similar to VectorE (more discussion later).

In terms of tile size, ScalarE instructions only have limitation on the input/output tile partition dimension size which
must not exceed 128, while the free dimension size can be up to 64K elements for SBUF or 4K elements for PSUM.

.. _arch_sec_scalar_pipelined_fma:

Pipelined Multiply-Add
~~~~~~~~~~~~~~~~~~~~~~

Each ScalarE compute lane also supports an additional multiply-add **before** the non-linear function (\ ``func``\ ) is applied
in a pipeline fashion. Mathematically, ScalarE implements:

.. code-block::

   # Case 1: scale is SBUF/PSUM vector
   # Input: 2D in_tile, 1D scale, 1D bias
   # Output: 2D out_tile
   for lane_id in range(in_tile.shape[0]):
      for k in range(in_tile.shape[1])
       out_tile[lane_id][k] = func(in_tile[lane_id][k] * scale[lane_id]
                                       + bias[lane_id])

   # Case 2: scale is a compile-time scalar constant in the instruction
   for lane_id in range(in_tile.shape[0]):
      for k in range(in_tile.shape[1])
       out_tile[lane_id][k] = func(in_tile[lane_id][k] * scale
                                       + bias[lane_id])

This functionality can be invoked using the :doc:`nki.isa.activation </nki/api/generated/nki.isa.activation>`
API by specifying a ``scale`` for multiplication and ``bias`` for addition. The scale can either be a tile from SBUF/PSUM
with one element/partition or a compile-time constant. On the other hand, the bias can only be a tile from SBUF/PSUM with
one element/partition. A useful mental model for this capability is combining a :doc:`nki.isa.tensor_scalar </nki/api/generated/nki.isa.tensor_scalar>`
instruction with a non-linear function evaluation into a single instruction (2x speed-up than two separate instructions).

Pipelined Reduction
~~~~~~~~~~~~~~~~~~~~~~

Each ScalarE compute lane also supports reduction **after** the non-linear function (\ ``func``\ ) is applied
in a pipeline fashion. On NeuronCore-v2, the reduction operator can only be addition.

Mathematically, ScalarE with accumulation enabled implements:

.. code-block::
   :emphasize-lines: 7

   # Input: 2D in_tile, 1D scale (similarly for scalar scale), 1D bias
   # Output: 2D out_tile, 1D reduce_res
   for lane_id in range(in_tile.shape[0]):
     for k in range(in_tile.shape[1]):
       out_tile[lane_id][k] = func(in_tile[lane_id][k] * scale[lane_id]
                                    + bias[lane_id])
       reduce_res[lane_id] += out_tile[lane_id][k]

This functionality can be invoked using the :doc:`nki.isa.activation_reduce </nki/api/generated/nki.isa.activation_reduce>`
API by specifying ``reduce_op`` as ``nki.language.add`` and ``reduce_res`` as
the output reduction tile, passed by reference.

A useful mental model for this capability is combining a :doc:`nki.isa.activation </nki/api/generated/nki.isa.activation>`
instruction with a :doc:`nki.isa.tensor_reduce </nki/api/generated/nki.isa.tensor_reduce>` into a single API,
which returns results from **both** APIs. Note,
:doc:`nki.isa.activation_reduce </nki/api/generated/nki.isa.activation_reduce>`
invokes two back-to-back ISA instructions on hardware, `Activate` and `ActReadAccumulator`. The `Activate` instruction
performs the regular computation as specified in :doc:`nki.isa.activation </nki/api/generated/nki.isa.activation>` and also
reduction at no additional cost. The reduction result is cached inside ScalarE after `Activate`.
The `ActReadAccumulator` instruction is a low cost (roughly 64 ScalarE cycles on NeuronCore-v2)
instruction to write the internal reduction result back to SBUF/PSUM, one element per partition.

Performance Consideration
~~~~~~~~~~~~~~~~~~~~~~~~~

All the performance notes discussed for :ref:`Vector Engine <arch_sec_vector_engine_perf>`
earlier are applicable to Scalar Engine, with one exception regarding instruction cost for two input tensors - ScalarE can
only read up to one input tensor per instruction.

**Instruction Combination.** All ``nki.isa.activation`` instructions have the same execution cost, regardless of whether
we enable the scale multiplication or bias add. Therefore, it is recommended to combine such multiply-add operations with
non-linear function evaluation into a single ScalarE instruction if the computation allows it. This is highly useful for
ML operators that are **not** TensorE heavy (not matmul-bound). Softmax is one such example, where we typically subtract
the maximum value of the input elements before evaluating exponential function for numerical stability.

GpSimd Engine
^^^^^^^^^^^^^

GpSimd Engine (GpSimdE) is intended to be a general-purpose engine that can run any ML operators that cannot be lowered
onto the other highly specialized compute engines discussed above efficiently, such as applying a triangular mask to a tensor.


A GpSimdE consists of eight fully programmable processors that can execute arbitrary C/C++ programs. Therefore, this engine
provides the hardware support for `Neuron Custom Operator. <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/programming-guide/custom-c%2B%2B-operators-devguide.html>`_
In addition, each processor is a 512-bit vector machine that can run high-performance vectorized kernels. Every  ``nki.isa``
API running on GpSimdE such as :doc:`nki.isa.iota </nki/api/generated/nki.isa.iota>`
uses a vectorized kernel implementation that Neuron engineers hand-tune for the underlying processor ISA.

**Data Types.** Each processor in GpSimd supports vectorized computation for


* 16x FP32/INT32/UINT32, or
* 32x FP16/INT16/UINT16, or
* 64x INT8/UINT8

This is in contrast to ScalarE/VectorE which can only perform arithmetic operations in FP32. However, if the GpSimdE program
chooses to, it can also access SBUF data of any :ref:`supported data types in NKI <nki-dtype>`
and perform data casting to- and from-FP32 at no throughput cost similar to VectorE/ScalarE.

**Layout & Tile Size.** The layout and tile size requirements of GpSimdE highly depend on semantics of the exact instruction.
Refer to the :doc:`nki.isa API reference guide </nki/api/nki.isa>`
for these requirements.

**Memory Hierarchy.** In Trainium/Inferentia2, each GpSimdE processor has 64KB of local data RAM, also called tightly-coupled
memory (TCM) as discussed in `Neuron Custom Operator <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/programming-guide/custom-c%2B%2B-operators-devguide.html>`_.
The TCM is configured with a 3-cycle access latency and 512-bit data width. Therefore, TCM is often used to store intermediate
computation results within a Neuron Custom Operator or GpSimdE instruction.

The eight processors in GpSimdE also have a high-bandwidth read/write interface connected to the SBUF.
:numref:`Figure %s <fig-gpsimd-sbuf-connectivity>` below illustrates the GpSimdE connectivity to SBUF. Each processor connects
to 16 SBUF partitions for both reading and writing: processor[0] connected to partition[0:15], processor[1] to partition[16:31]
and so on. Each processor can programmatically send tensor read/write requests to SBUF to access data from the connected
partitions. On the read side, once a read request is processed, the tensor read interface can deliver up to 512-bit of data
from all 16 connected partitions collectively (up to 32-bit per partition) to the processor per cycle, which matches the
512-bit SIMD width. Similarly, on the write side, the tensor write interface can accept 512-bit of data for writing back
to the connected SBUF partitions per cycle.

.. _fig-gpsimd-sbuf-connectivity:

.. figure:: /nki/img/arch_images/gpsimd-sbuf-connectivity.png
   :align: center
   :width: 60%

   Connectivity between GpSimdE and SBUF.

**Performance Consideration**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**128 Parallel Compute Lanes:** Similar to VectorE and ScalarE, GpSimdE has 128 parallel compute lanes for 32-bit computation
data types across SIMD lanes of all eight processors. Therefore, it is desirable to invoke GpSimdE instructions that will
utilize all the parallel compute lanes, typically through accessing all 128 SBUF partitions for input and output. In addition,
since each processor can also handle 32-wide 16-bit or 64-wide 8-bit data type computation, GpSimdE can effectively support
256 or 512 parallel compute lanes internally.

**Cost Model:** Unlike VectorE/ScalarE, there is no rule-of-thumb to estimate execution cost of a GpSimdE instruction. Refer
to the :doc:`nki.isa </nki/api/nki.isa>`
API reference manual to find out instruction-specific latency estimates.

.. _arch_sec_data_movement:

Data Movement
-------------

In this section, we will dive into the memory subsystem and discuss how to perform data movement between different memories
and also how to do it efficiently. As a reminder, there are three main types of memory on a NeuronDevice: HBM, SBUF, and
PSUM, from highest to lowest capacity. Figure below shows the specifications of these memories and their connectivity
for one NeuronCore-v2:

.. _fig-arch-memory-hierarchy:

.. figure:: /nki/img/arch_images/memory_hierarchy.png
   :align: center
   :width: 60%

   Memory hierarchy.

As shown in the above figure, data movement between HBM and SBUF is performed using on-chip DMA
(Direct Memory Access) engines, which can run in
parallel to computation within the NeuronCore. Data movement between PSUM and SBUF is done through ISA instructions on the
compute engines. However, different compute engines have different connectivity to SBUF/PSUM as indicated by the arrows
in the figure. In addition, NeuronCore-v2 has the following restrictions:


#. VectorE and GpSimdE cannot access SBUF in parallel.
#. VectorE and ScalarE cannot access PSUM in parallel.

Therefore, VectorE and GpSimdE instructions that access SBUF must be serialized, similarly for VectorE and ScalarE instructions
that access PSUM. This is enforced by Neuron Compiler during NKI kernel compilation, so NKI developers are not required
to program such serializations.

The rest of this section will discuss the following topics in detail:


* Data movement between HBM and SBUF using DMAs.
* Accessing SBUF/PSUM tensors using compute engines.
* In-memory accumulation using TensorE and PSUM.

Data movement between HBM and SBUF using DMAs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Each NeuronCore-v2 is equipped by 16 parallel DMA engines that can perform data movement between any addressable
memories in the system. Here, we focus on using these DMA engines to move data between the local SBUF and HBM.
Each DMA engine can process one **DMA transfer** at a time driving a peak bandwidth of 27 GiB/s, but all DMA engines
can process different DMA transfers in parallel.

Each DMA transfer can gather a list of source **DMA buffers** and then scatter the data into another list of destination
DMA buffers. Data within a DMA buffer must be continuous in the memory address map. There is some performance overhead
at both DMA buffer and transfer levels, both of which can be amortized by moving a sufficiently
large amount of data (more discussion below).

Next, let's examine how HBM and SBUF are laid out in the device memory address map. On one hand,
HBM is logically a one-dimensional memory and hence occupies a flat chunk of continuous addresses in the
address map. In the most common cases, an HBM tensor in NKI is also contiguous in the HBM address space.

On the other hand, SBUF is considered a two-dimensional memory with 128 partitions as discussed earlier :ref:`here <arch_sec_neuron_core_engines>`.
:numref:`Figure %s <fig-arch-sbuf-addr-space>`
shows how SBUF addresses fit in the device
address map. ``sbuf_base_addr`` is a 64-bit address dependent
on which NeuronCore-v2 on the device the SBUF is located in. The SBUF addresses start from the first byte of partition 0,
increment along the free dimension first and then advance onto the next partition.


.. _fig-arch-sbuf-addr-space:

.. figure:: /nki/img/arch_images/sbuf_addr_space.png
   :align: center
   :width: 80%

   SBUF memory address space.

As discussed in :doc:`NKI Language Guide </nki/get-started/nki-language-guide>`,
an SBUF tensor in NKI spans one or more partitions, with data starting at the same offset:

.. _fig-arch-sbuf-tensor:

.. figure:: /nki/img/pm-layout.png
   :align: center
   :width: 80%

   SBUF tensor.

As a result, a data movement involving ``tensor`` in SBUF will require at least ``tensor.shape[0]``, i.e., P dim size,
different DMA buffers, since slices of tensor data from different SBUF partitions occupy non-contiguous memory
in the address space. If the tensor data slice within each SBUF partition is not contiguous in the F dimension,
more DMA buffers will need to be unrolled along the F dim. These DMA buffers are typically grouped into different
DMA transfers so that multiple DMA engines can participate in the data movement to maximize memory bandwidth utilization.

In NKI, moving data from HBM to SBUF and from SBUF to HBM are done with calls to the :doc:`nki.isa.dma_copy </nki/api/generated/nki.isa.dma_copy>` API. Neuron Compiler is responsible for converting each NKI API call to DMA transfers and
assigning these transfers to different DMA engines. As an example, loading a 128x512 FP32 HBM tensor to SBUF is best
done through 16 DMA transfers (one per DMA engine), each moving a scatter-gather list of 8 DMA buffers:

.. code-block::

   import nki.language as nl
   tile = nl.load(in_tensor[0:128, 0:512])

To achieve good performance out of the DMAs, we generally aim to:

#. Move a large amount of contiguous data in each DMA buffer to amortize DMA buffer overhead
#. Move a large amount of data in each DMA transfer to amortize DMA transfer overhead.
#. Invoke as many parallel DMA transfers on the available DMA engines as possible.

These goals ultimately boil down to a quick optimization rule: maximize **both free (4KiB or above) and partition
(ideally 128) dimension sizes** when moving tensors between SBUF and HBM using ``nki.language.load``
and ``nki.language.store``. Refer to the
:doc:`NKI Performance Guide </nki/deep-dives/nki_perf_guide>` for more information
on optimizing performance of data movements between HBM and SBUF.

Accessing SBUF/PSUM tensors using compute engines
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:numref:`Figure %s <fig-arch-data-streaming>` shows a simplified timeline of how compute engines
**stream** data in and out of on-chip SRAM (SBUF or PSUM).
Refer to :numref:`Figure %s <fig-arch-neuron-core-v2>` for the available connectivity between engines and SBUF/PSUM.
At a high level, the compute engines are able to pipeline
data reads, computation and writes along the F dimension of the src/dst tensors.
In every cycle, each engine can read 128 elements across 128 SBUF/PSUM partitions,
perform a computation on previously
read 128 elements, and write 128 previously computed results to SBUF/PSUM.
In other words, the P axis of a tensor
is the *parallel* dimension for SBUF/PSUM data accessing, while the F axis of the tensor is the *time* dimension for data
accessing.

.. _fig-arch-data-streaming:

.. figure:: /nki/img/arch_images/data_streaming.png
   :align: center
   :width: 80%

   Data streaming between SBUF and compute engine.

When accessing SBUF/PSUM tensors in an instruction, we need to follow different rules in the P and F dimensions. First,
hardware does not allow P dimension striding when accessing data from a single SBUF/PSUM tensor. Therefore, a valid src/dst
tensor of an instruction must occupy a continuous number of partitions. In addition, the hardware further enforces which
partition a tensor can start from (\ ``start_partition``\ ) based on the number of partitions the tensor occupies (\ ``num_partition``\
). This is currently handled by the tensor allocator in Neuron Compiler during NKI kernel compilation process:


* If ``64 < num_partition <= 128``\ , ``start_partition`` must be 0
* If ``32 < num_partition <= 64``\ , ``start_partition`` must be 0 or 64
* If ``0 < num_partition <= 32``\ , ``start_partition`` must be one of 0/32/64/96

On the other hand, data accessing along the free dimension is a lot more flexible: the src/dst tensor of an engine
instruction can support up to four-dimensional tensorized access pattern with a stride in each dimension
within each partition. At the ISA level,
each F axis in the tensor can have a size expressed in ``uint16`` and a stride expressed in ``int16``\ , measured in data elements.
As an example, if the tensor data type is BF16, and the stride of the most-minor F dimension is set to 10, then we will
stride across 20B within a partition at a time. Refer to :ref:`Tile Indexing in NKI Programming Guide <pm_sec_tile_indexing>`
to learn about how to index SBUF/PSUM tensors to achieve F dimension striding in NKI syntax.

Lastly, as implied in :numref:`Figure %s <fig-arch-data-streaming>`,
when accessing a SBUF/PSUM tensor, all active partitions must follow the same F dimension access pattern. In other words,
at every time step, the engine read/write interface will access data elements at the same *offset* within each active partition.

.. _arch_sec_cross_partition_connect:

Cross-Partition Connectivity
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The majority of VectorE/ScalarE/GpSimdE instructions on NeuronCore-v2 require ``src_tensor`` and ``dst_tensor`` to occupy
the same number of partitions. When the number of partitions involved exceeds 64, by the ``start_partition`` rule discussed
above, the src_tensor and dst_tensor in such cases must both start from partition 0. Therefore, we effectively cannot perform
any cross-partition data movement when ``num_partition > 64`` : each partition of ``src_tensor`` data will eventually flow
into the corresponding partition in ``dst_tensor``.

However, when ``num_partition < 64``\ , VectorE/ScalarE/GpSimdE on NeuronCore-v2 supports two styles of cross-partition
SBUF/PSUM data movement patterns: 1) cross-half movement for ``32 < num_partition <= 64`` and 2) cross-quadrant movement
for ``0 < num_partition <= 32``. Figure below illustrates these two patterns for ``num_partition=64`` and ``num_partition=32``.
The shaded portion of the ``Engine`` block indicates the active lanes for the given instruction. With these movement patterns,
each partition in ``src_tensor`` still has a one-to-one mapping to each partition in ``dst_tensor``.

.. _fig-arch-cross-quadrant:

.. figure:: /nki/img/arch_images/cross_quadrant.png
   :align: center
   :width: 90%

   Cross-partition connectivity.

Performance Consideration
~~~~~~~~~~~~~~~~~~~~~~~~~

**Access pattern.** As discussed previously in the context of compute engine utilization, it is recommended to use as many
partitions as possible when accessing SBUF/PSUM tensors to saturate the available data streaming bandwidth. In addition,
accessing with a large stride in the most-minor (fastest) F dimension will incur performance penalty. When the most-minor
F dimension stride is less than 16 bytes, SBUF/PSUM on NeuronCore-v2 can supply a peak bandwidth of 128 elements/cycle at
1.4 GHz for each tensor read/write interface. A 16-byte stride is equivalent to 4 elements for 32-bit data types, 8 elements
for 16-bit data types or 16 elements for 8-bit data types.
If the most-minor F dimension stride exceeds 16 bytes, the achievable bandwidth of each tensor read/write interface will
be half of the peak bandwidth, which translates to roughly 50% performance hit on the instructions.

**Concurrent SBUF/PSUM accesses by engines.** As mentioned earlier, NeuronCore-v2 has the following on-chip RAM access restrictions:

#. Vector Engine and GpSimd Engine cannot access SBUF in parallel
#. Vector Engine and Scalar Engine cannot access PSUM in parallel

Despite these restrictions, SBUF is capable of driving peak bandwidth in each tensor read/write interface connected to VectorE/ScalarE/TensorE
or GpSimdE/ScalarE/TensorE *simultaneously* without bandwidth interference. Similarly, PSUM can drive peak bandwidth for
VectorE/TensorE or ScalarE/TensorE *simultaneously*.

**Tensor access overhead.** Initiating a tensor access request from an engine to its SBUF/PSUM read/write interface incurs
a static overhead approximately 60 cycles on NeuronCore-v2. Compute engines can typically hide some of this latency through
instruction level parallelism. However, it is still highly recommended to access tensors with large P and F dimension sizes
whenever possible to amortize this overhead.

.. _arch_sec_accumulation_psum:

Near-memory accumulation in PSUM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As shown in :numref:`Figure %s <fig-arch-neuron-core-v2>`,
both VectorE and ScalarE have read and write access to PSUM, while TensorE only has write access. In fact, PSUM is designed
to be a landing buffer for TensorE with near-memory accumulation capabilities that allows read-accumulate-write to every
4B element in memory. Note, this accumulation mechanism can *only* be controlled by TensorE. VectorE and ScalarE can only
access PSUM like a regular SRAM similar to SBUF.

Next, let's discuss how TensorE can write outputs to PSUM. As previously discussed, PSUM is organized into 128 *partitions,*
each consisting of 16KB of memory. Each partition is further divided into 8 PSUM banks, with each bank holding up to 512
32-bit values. The output tile of a TensorE matrix multiplication instruction (\ ``nki.isa.nc_matmul``\ ) must **fit** into
one PSUM bank per partition, which is the fundamental reason for
the :ref:`free dimension size limitation <arch_matmul_tile_size>` for the ``moving`` tensor.
Every ``nc_matmul`` instruction can choose whether to *override* existing bank data with instruction output or *accumulate*
instruction output into existing bank data element-wise.

The accumulation mode of PSUM is particularly useful when the high-level matmul operator has a contraction dimension (i.e.,
``stationary/moving`` partition dimension of ``nki.isa.nc_matmul``) greater than 128. As an example, let's assume the following
matmul dimensions:


* ``x.shape = [128, 256]``
* ``y.shape = [256, 512]``

Figure below shows this matmul mathematically and also how we would tile the contraction dimension. With tiling, we slice
both ``x`` and ``y`` in the contraction dimension to get ``[x0, x1]`` and ``[y0, y1]`` input tiles. To get the
final output result, we need to perform:


* output0 = matmul(x0, y0)
* output1 = matmul(x1, y1)
* output = output0 + output1

.. _fig-arch-mm-tiling:

.. figure:: /nki/img/arch_images/mm_tiling.png
   :align: center
   :width: 90%

   Matmul tiling (mathematical view).

PSUM accumulation effectively combines Step 2 and 3 above into a single TensorE ``nki.isa.nc_matmul`` instruction. Assuming
we have ``x`` in the transposed layout in SBUF, visually the above tiled matmul example will have two back-to-back ``nki.isa.nc_matmul``
instructions on TensorE:

.. _fig-arch-mm-tiling-hw:

.. figure:: /nki/img/arch_images/mm_tiling_hw.png
   :align: center
   :width: 90%

   Matmul tiling (hardware view).

Effectively, the first ``nki.isa.nc_matmul`` instruction overwrites the destination PSUM bank with the instruction output.
The second instruction accumulates instruction output onto the previous instruction's result in the same PSUM. The PSUM
accumulation is always done in FP32. A series of TensorE matmul instructions with the first one writing to a PSUM bank and
more subsequent instructions accumulating into the same PSUM bank data is called a *matmul accumulation group*.

In NKI, ``nisa.nc_matmul`` supports an ``accumulate`` parameter to control PSUM accumulation behavior.
When not specified (default), the compiler auto-detects: the first write to a PSUM bank overwrites, and
subsequent writes accumulate. The following NKI code pattern demonstrates PSUM accumulation:

.. code-block::

   # condition 1: a psum buffer with zeros
   psum_buf = nl.zeros((128, 128), dtype=nl.float32, buffer=nl.psum)

   # condition 2: an affine range loop
   for i in nl.affine_range(N):
      # condition 3: add matmul results from TensorEngine
      nisa.nc_matmul(dst=psum_buf, stationary=stationary_tile, moving=moving_tile)


Refer to the
:ref:`Tiling Matrix Multiplications <tutorial_matmul_tiling>`
tutorial for a detailed implementation.

.. note::

   When ``accumulate`` is not specified (default), ``nisa.nc_matmul`` auto-detects accumulation:
   the first write to a PSUM location overwrites, and subsequent writes accumulate. Accumulation
   can also be controlled explicitly with ``accumulate=True`` or ``accumulate=False``.

Finally, with 8 PSUM banks per partition, TensorE can have up to eight outstanding matmul accumulation groups, which allows
flexible scheduling of matmul instructions on TensorE. Also, the extra buffering from multiple PSUM banks allows us to pipeline
TensorE computation with other compute engines: TensorE can move onto the next accumulation group without waiting for VectorE/ScalarE
to evict previous accumulation group results.


================================================
FILE: nki/guides/framework_custom_op.rst
================================================
.. meta::
   :description: Learn how to insert NKI kernels as custom operators into PyTorch or JAX models with code examples.
   :keywords: NKI, custom operator, PyTorch, JAX, kernel integration, Neuron Kernel Interface
   :date-modified: 02/26/2026

.. _nki_framework_custom_op:

NKI Kernel as a Framework Custom Operator
===========================================

This document demonstrates how to insert a NKI kernel as a custom
operator into a PyTorch or JAX model using simple code examples.

Using NKI kernels
-------------------------------

To register a NKI kernel registration, you need to call a decorated
NKI function.

Let's examine a guiding example below where we
randomly initialize two inputs, add them together, and then
multiply the result by the two input tensors element-wise.
This effectively calculates: ``a * b * (a + b)``.

We define a common NKI kernel for addition. This is a tiled variation of the addition kernel from
:doc:`Quickstart: Build and Run a Kernel </nki/get-started/quickstart-implement-run-kernel>`.

.. nki_example:: /nki/examples/tensor_addition/tensor_addition_nki_kernels.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_27

^^^^^^^
PyTorch
^^^^^^^

We can perform ``(a + b) * a * b`` using native PyTorch code.
::

   import torch
   import torch_xla

   device = torch_xla.device()

   a = torch.randn(256, 1024, dtype=torch.float32).to(device)
   b = torch.randn(256, 1024, dtype=torch.float32).to(device)
   c = a + b
   out = a * b * c

   print(out)

Now let's replace the tensor addition (``c = a + b``) with a NKI
kernel.
To do this we replace the ``+`` operator with a call to the NKI kernel
caller (``nki_tensor_add``), and everything else works as before.

::

   device = torch_xla.device()
   a = torch.randn(256, 1024, dtype=torch.float32).to(device)
   b = torch.randn(256, 1024, dtype=torch.float32).to(device)
   c = nki_tensor_add(a, b) # calling a NKI kernel, instead of the built-in torch op
   out = a * b * c
   print(out)

To understand what happens under the hood when we compile the above
code, we can print HLO IR graph generated by XLA by setting the
``NEURON_FRAMEWORK_DEBUG`` environment variable, which preserves the HLO in
binary form, and the ``XLA_SAVE_TENSORS_FILE``, which presents a textual
representation of the HLO. For example, you may add the following lines to your
code:

::

   import os
   os.environ['NEURON_FRAMEWORK_DEBUG'] = "1"
   os.environ["XLA_SAVE_TENSORS_FILE"] = "example1.pbtxt"

A ``example1.pbtxt.0`` file is then written in your run directory that has the
corresponding human-readable HLO IR.

Let's examine the XLA output of this example.
In line #14 we can identify that the tensor addition is now
mapped to an HLO ``xla::_op_<locals>CallImpl`` instruction, representing the custom call. The output of
that ``xla::_op_<locals>CallImpl`` is then consumed by the next instruction in line
#15 as usual.

.. code-block::
   :linenos:

    [ScheduleSyncTensorsGraph]
    TensorsGraphInfo:
      _str_intern (/home/ec2-user/pytorch-klir/lib/python3.10/site-packages/torch/_tensor_str.py:462)
      _str (/home/ec2-user/pytorch-klir/lib/python3.10/site-packages/torch/_tensor_str.py:726)
      __repr__ (/home/ec2-user/pytorch-klir/lib/python3.10/site-packages/torch/_tensor.py:590)
      <module> (/home/ec2-user/private-aws-neuron-sdk-staging/nki/examples/tensor_addition/t2.py:14)
    
    Root Hashes: (181deae9d76fbfbf2fe0e040179f9da8)
    
    ## BEGIN_GRAPH
    IR {
      %0 = f32[256,1024]{1,0} xla::device_data(), xla_shape=f32[256,1024]{1,0}
      %1 = f32[256,1024]{1,0} xla::device_data(), xla_shape=f32[256,1024]{1,0}
      %2 = (f32[256,1024]{1,0}) xla::_op_<locals>CallImpl(%1, %0), xla_shape=(f32[256,1024]{1,0})
      %3 = f32[256,1024]{1,0} aten::mul(%1, %0), xla_shape=f32[256,1024]{1,0}
      %4 = f32[256,1024]{1,0} aten::mul(%3, %2), xla_shape=f32[256,1024]{1,0}, ROOT=0
    }
    
    Graph Hash: f518d5bd723cb9d6f9482b42b33105e1
    
    ## END_GRAPH

The Neuron compiler replaces the above custom call with
the corresponding NKI kernel implementation while optimizing the rest of the
compute graph as usual. At the end of the compilation process, a single
compiled binary NEFF file
is generated representing the entire graph
including the NKI kernel. For more information about NEFF files, see `Neuron Compiler <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/index.html>`__.

.. _nki_framework_custom_op_jax:

^^^
JAX
^^^

We can perform ``(a + b) * a * b`` using native JAX code.

::

   import jax
   import jax.numpy as jnp

   @jax.jit
   def jax_customop_tutorial(a, b):
      c = a + b
      out = a * b * c
      return out

   seed = jax.random.PRNGKey(0)
   seed_a, seed_b = jax.random.split(seed)
   a = jax.random.normal(seed_a, (256, 1024), dtype=jnp.float32)
   b = jax.random.normal(seed_b, (256, 1024), dtype=jnp.float32)

   print(jax_customop_tutorial(a, b))

Similar to the PyTorch example above, let's replace the tensor addition ``(c = a + b)`` with
the addition NKI kernel. To do this we replace the ``+`` operator with a call to the NKI kernel
caller (``nki_tensor_add``), and everything else works as before.

::

   import jax
   import jax.numpy as jnp

   @jax.jit
   def jax_customop_tutorial(a, b):
      c = nki_tensor_add(a, b) # calling a NKI kernel, instead of the built-in jax op
      out = a * b * c
      return out

   seed = jax.random.PRNGKey(0)
   seed_a, seed_b = jax.random.split(seed)
   a = jax.random.normal(seed_a, (256, 1024), dtype=jnp.float32)
   b = jax.random.normal(seed_b, (256, 1024), dtype=jnp.float32)
   print(jax_customop_tutorial(a, b))


To understand what happens under the hood when we compile the above code,
we can print the HLO IR graph by adding the following snippet to your code:

::

   print(jax.jit(jax_customop_tutorial)
      .lower(a, b)
      .compile()
      .runtime_executable()
      .hlo_modules()[0].to_string()
   )

Let's examine the XLA output of this example.
In line #8 we can identify that the tensor addition is now
mapped to an HLO ``custom-call`` instruction, similar to PyTorch. The output of
that ``custom-call`` is then consumed by the next instruction in line
#9 as usual.

.. code-block::
   :linenos:

   HloModule jit_jax_customop_tutorial, entry_computation_layout={(f32[256,1024]{1,0}, f32[256,1024]{1,0})->(f32[256,1024]{1,0})}, allow_spmd_sharding_propagation_to_parameters={}, allow_spmd_sharding_propagation_to_output={true}
   
   ENTRY %main.12 (Arg_0.1: f32[256,1024], Arg_1.2: f32[256,1024]) -> (f32[256,1024]) {
     %Arg_0.1 = f32[256,1024]{1,0} parameter(0), metadata={op_name="a"}
     %Arg_1.2 = f32[256,1024]{1,0} parameter(1), metadata={op_name="b"}
     %multiply.0 = f32[256,1024]{1,0} multiply(%Arg_0.1, %Arg_1.2), metadata={op_name="jit(jax_customop_tutorial)/jit(main)/jit(jax_customop_tutorial)/mul" source_file="/home/ec2-user/private-aws-neuron-sdk-staging/nki/examples/tensor_addition/t4.py" source_line=9}
     %constant.0 = s8[128,128]{1,0} constant({...})
     %custom-call.0 = f32[256,1024]{1,0} custom-call(%Arg_0.1, %Arg_1.2, %constant.0), custom_call_target="AwsNeuronCustomNativeKernel", api_version=API_VERSION_STATUS_RETURNING, metadata={op_name="jit(jax_customop_tutorial)/jit(main)/jit(jax_customop_tutorial)/nki_call" source_file="/home/ec2-user/jax-klir/lib/python3.10/site-packages/nki/_jax.py" source_line=64}, backend_config="eyJrZXJuZWxfdmVyc2lvbiI6IDEsICJrbGlyX2JpbmFyeSI6IHsiYmluYXJ5IjogIi90bXAvbmtpX3RlbnNvcl9hZGRncmV6aGF4eS5rbGlyIiwgImlucHV0X25hbWVzIjogWyJhX2lucHV0IiwgImJfaW5wdXQiLCAidG1wLjQiXSwgIm91dHB1dF9uYW1lcyI6IFsiY19vdXRwdXQuMzUiXX0sICJmdW5jX25hbWUiOiAibmtpX3RlbnNvcl9hZGQiLCAiZ3JpZCI6IFtdLCAiaGFzX2NvbGxlY3RpdmVzIjogZmFsc2V9"
     %multiply.1 = f32[256,1024]{1,0} multiply(%multiply.0, %custom-call.0), metadata={op_name="jit(jax_customop_tutorial)/jit(main)/jit(jax_customop_tutorial)/mul" source_file="/home/ec2-user/private-aws-neuron-sdk-staging/nki/examples/tensor_addition/t4.py" source_line=9}
     ROOT %tuple.11 = (f32[256,1024]{1,0}) tuple(%multiply.1)
   }

The Neuron compiler replaces the above custom-call with
the corresponding NKI kernel implementation while optimizing the rest of the
compute graph as usual. At the end of the compilation process, a single
compiled binary NEFF file
is generated representing the entire graph
including the NKI kernel. For more information about NEFF files, see `Neuron Compiler <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/index.html>`__.


Using NKI in training graphs
----------------------------

If you are using NKI to implement a new operator in a training graph,
you might need to make the new operator interplay with the
``autograd`` engine in the framework. To do this, in PyTorch, you can
subclass the framework’s base operator class and implement both the ``forward()``
and ``backward()`` methods. The ``autograd`` engine then uses the ``backward()``
method when performing auto-differentiation. See
`Extending torch.autograd <https://pytorch.org/docs/stable/notes/extending.html>`__ in the
PyTorch Docs for instructions on doing this in PyTorch. To do this in JAX,
you can create a ``custom_vjp`` rule (vjp stands for Vector-Jacobian product), which binds the
``forward()`` and ``backward()`` calls. See
`Autodiff Cookbook <https://jax.readthedocs.io/en/latest/notebooks/autodiff_cookbook.html>`__ in
the JAX Docs for instructions on doing this.

Let's reuse the ``nki_tensor_add`` kernels from before and demonstrate how to train a
simple compute graph ``(a+b)*a*b`` in both PyTorch and JAX.

.. _nki_framework_custom_op_pytorch:

^^^^^^^
PyTorch
^^^^^^^

We define a ``NkiAddFunc``
class, which leverages the ``nki_tensor_add`` kernel in its ``forward()``
function. The gradients of both input tensors in ``y = a + b`` are
ones, so the ``backward()`` function
propagates the ``dy`` gradients from the previous backward function.

::

   import torch
   import torch_xla
   device = torch_xla.device()

   class NkiAddFunc(torch.autograd.Function):
     @staticmethod
     def forward(ctx, a, b):
       return nki_tensor_add(a, b)

     @staticmethod
     def backward(ctx, dy, *args):
       # gradients for a and b
       return dy, dy

   # now, let's define the compute graph
   a = torch.randn(256, 1024, dtype=torch.float32).to(device).detach().requires_grad_()
   b = torch.randn(256, 1024, dtype=torch.float32).to(device).detach().requires_grad_()
   c = NkiAddFunc.apply(a, b)
   out = a * b * c

   # here we define a (dummy) loss-function, in prep for backward propagation
   loss = out.sum()

   # lastly, let's invoke the auto-grad engine
   loss.backward()

   torch_xla.sync()

^^^
JAX
^^^

We define a ``custom_vjp`` function ``nki_add_func`` by using
the ``@jax.custom_vjp`` decorator which directly calls
the ``nki_tensor_add`` kernel. We then define and register
the ``forward()`` and ``backward()`` implementations of the
``nki_add_func`` function via ``defvjp()``. Just like the PyTorch
example before, the ``backward()`` implementation simply passes
the gradients through. Finally, to start training, we execute the
forward pass by calling ``nki_add_func(a, b) * x * y``.
To get the gradients, we call ``jax.grad`` directly with a loss function.

::

   @jax.custom_vjp
   def nki_add_func(a, b):
      return nki_tensor_add(a, b)

   def f_forward(a, b):
      # operator output and residual (same as input here)
      return nki_add_func(a, b), (a, b)

   def f_backward(res, grad):
      # gradients for a and b
      return grad, grad

   nki_add_func.defvjp(f_forward, f_backward) # line 11

   @jax.jit
   def jax_customop_tutorial_and_grad(a, b):
      out = nki_add_func(a, b) * a * b

      # use the same dummy loss function (output sum) as PyTorch example above
      grad = jax.grad(lambda x, y: (nki_add_func(x, y) * x * y).sum(), argnums=(0, 1))(a, b)
      return out, *grad

   c, grad_a, grad_b = jax_customop_tutorial_and_grad(a, b)


================================================
FILE: nki/guides/how-to-scheduling-apis.rst
================================================
.. _how-to-scheduling-apis:

.. meta::
   :description: Learn how to use NKI Scheduling APIs to control automatic instruction scheduling by adding dependency edges and using no-reorder blocks.
   :keywords: NKI, scheduling, instruction scheduling, dependency edges, no-reorder, Neuron Kernel Interface
   :date-modified: 02/20/2026

How to Use the NKI Scheduling APIs
==================================

Learn how to control instruction execution order in your NKI kernels using scheduling APIs. This guide demonstrates how to add dependency edges between instructions with ``with_schedule()`` and how to use ``no_reorder`` blocks to prevent automatic instruction reordering, giving you fine-grained control over kernel performance optimization.

About the NKI Scheduling APIs
-----------------------------

The NKI Scheduling APIs provide additional control over automatic instruction scheduling. This control comes in the form of adding additional dependency edges before scheduling. The extra dependency edges constrain the reordering that the automatic scheduler will do.

Adding dependency edges
-----------------------

Below is an example showing how to specify the scheduling metadata for NKI kernel functions. The extra dependency edges are communicated to the NKI compiler by setting a property on the top-level kernel functions with the scheduling edges.

.. code-block:: python

   @nki.jit()
   def kernel(t):
       x = nl.ndarray(t.shape, t.dtype, buffer=nl.sbuf)
       nisa.dma_copy(dst=x, src=t)
       
       a = nl.ndarray(t.shape, t.dtype, buffer=nl.sbuf)
       b = nl.ndarray(t.shape, t.dtype, buffer=nl.sbuf)
       c = nl.ndarray(t.shape, t.dtype, buffer=nl.sbuf)
       
       nisa.reciprocal(dst=a, data=x, name="recip")
       nisa.tensor_scalar(dst=b, data=x, op0=nl.add, operand0=1, name="plus1")
       nisa.tensor_tensor(dst=c, data1=a, data2=b, op=nl.add)
       
       out = nl.ndarray(t.shape, t.dtype, buffer=nl.hbm)
       nisa.dma_copy(dst=out, src=c)
       
       return out

   # The named statements "recip" and "plus1" could execute in any order.
   # We can fix the order with a dependency by setting the "schedule" property.
   # This property can be a list of pairs of instruction names.
   # Each pair is a set of dependency edges.
   # In this case "plus1" depends on "recip", and so will execute second.
   scheduled = kernel.with_schedule([
       ("plus1", "recip")
   ])

   # The second component of each pair can be a single name or a list of names
   # This is equivalent to above. Using a list is convenient
   # for declaring multiple dependency edges.
   scheduled = kernel.with_schedule([
       ("plus1", ["recip"])
   ])

The NKI compiler will collect the data from the ``with_schedule`` call, check that it makes sense, and propagate it to the scheduling pass of the compiler. Below is a more complicated example with programmatic meta-data generation. In this example, we will enforce a sequential order for all of the activation operations.

.. code-block:: python

   # compute exp on three tiles and return the result tiles
   @nki.jit
   def kernel(a, b, c):
       in_tiles = []
       for inp in (a,b,c):
           in_tile = nl.ndarray(inp.shape, inp.dtype, buffer=nl.sbuf)
           nisa.dma_copy(dst=in_tile, src=inp)
           in_tiles.append(in_tile)
       
       out_tiles = []
       for i in range(len(in_tiles)):
           tile = in_tiles[i]
           out_tile = nl.ndarray(tile.shape, tile.dtype, buffer=nl.sbuf)
           nisa.activation(dst=out_tile, data=tile, op=nl.exp, name=f"act{i}")
           out_tiles.append(out_tile)
       
       outs = []
       for tile in out_tiles:
           out = nl.ndarray(tile.shape, tile.dtype, buffer=nl.hbm)
           nisa.dma_copy(dst=out, src=tile)
           outs.append(out)
       
       return tuple(outs)

   # The activations have no data dependencies, and could execute in any order.
   # Make them execute serially by building a list of pairs of edges
   # act1 depends on act0
   # act2 depends on act1
   l = []
   for i in range(1,3):
       l.append((f"act{i}", f"act{i-1}"))
   
   # attach the dependencies to the kernel by calling with_schedule
   scheduled = kernel.with_schedule(l)

Using no_reorder
----------------

Adding dependency edges can be tedious. To make the process more streamlined the NKI compiler also supports no-reorder blocks. A no-reorder block is a section of code where dependency edges are automatically between every pair of instructions. Using no-reorder blocks, the example above could be written as shown below.

.. code-block:: python

   # compute exp on three tiles and return the result tiles
   @nki.jit()
   def loop(a, b, c):
       in_tiles = []
       for inp in (a,b,c):
           in_tile = nl.ndarray(inp.shape, inp.dtype, buffer=nl.sbuf)
           nisa.dma_copy(dst=in_tile, src=inp)
           in_tiles.append(in_tile)
       
       out_tiles = []
       with nl.no_reorder():
           for i in range(len(in_tiles)):
               tile = in_tiles[i]
               out_tile = nl.ndarray(tile.shape, tile.dtype, buffer=nl.sbuf)
               nisa.activation(dst=out_tile, data=tile, op=nl.exp, name=f"act{i}")
               out_tiles.append(out_tile)
       
       outs = []
       for tile in out_tiles:
           out = nl.ndarray(tile.shape, tile.dtype, buffer=nl.hbm)
           nisa.dma_copy(dst=out, src=tile)
           outs.append(out)
       
       return tuple(outs)

The ``no_reorder`` block instructs the compiler to insert dependency edges between every instruction. Note, the ``no_reorder`` block is "dynamically scoped", meaning it applies to all of the code that would execute under the block, not just the code that is syntactically under the block. For example, the following code is equivalent to the above.

.. code-block:: python

   def loop_body(i, in_tiles, out_tiles):
       tile = in_tiles[i]
       out_tile = nl.ndarray(tile.shape, tile.dtype, buffer=nl.sbuf)
       nisa.activation(dst=out_tile, data=tile, op=nl.exp, name=f"act{i}")
       out_tiles.append(out_tile)

   @nki.jit
   def loop(a, b, c):
       in_tiles = []
       for inp in (a,b,c):
           in_tile = nl.ndarray(inp.shape, inp.dtype, buffer=nl.sbuf)
           nisa.dma_copy(dst=in_tile, src=inp)
           in_tiles.append(in_tile)
       
       out_tiles = []
       with nl.no_reorder():
           for i in range(len(in_tiles)):
               loop_body(i, in_tiles, out_tiles)
       
       outs = []
       for tile in out_tiles:
           out = nl.ndarray(tile.shape, tile.dtype, buffer=nl.hbm)
           nisa.dma_copy(dst=out, src=tile)
           outs.append(out)
       
       return tuple(outs)

Notice that even though the ``loop_body`` function is not syntactically under a ``no_reorder`` block, it will be evaluated as a no-reorder block because the function is called from under a ``no_reorder`` block.


================================================
FILE: nki/guides/index.rst
================================================
.. meta::
    :description: Guides for AWS Neuron Kernel Interface (NKI), including architectures, tutorials for implementing and optimizing kernels, and how to use kernels with common frameworks.
    :keywords: NKI, AWS Neuron, Guides, Tutorials, how-to
    :date-modified: 2/26/2026

.. _nki-guides:

NKI Guides
===========

This section provides hands-on tutorials for the Neuron Kernel Interface (NKI), demonstrating how to write custom kernels for AWS Trainium and Inferentia instances. These tutorials cover fundamental operations, advanced techniques, and distributed computing patterns using NKI.

Tutorials
---------

.. grid:: 1 1 2 2
   :gutter: 2

   .. grid-item-card::
      :link: nki-matrix-multiplication
      :link-type: ref

      **Matrix Multiplication**
      ^^^
      Learn the fundamentals of implementing matrix multiplication in your NKI kernels.

   .. grid-item-card::
      :link: nki-transpose2d
      :link-type: ref

      **Transpose 2D**
      ^^^
      Implement efficient 2D matrix transpose operations using NKI

   .. grid-item-card::
      :link: nki-averagepool2d
      :link-type: ref

      **Average Pooling 2D**
      ^^^
      Create custom 2D average pooling kernels for computer vision workloads

   .. grid-item-card::
      :link: nki-fused-mamba
      :link-type: ref

      **Fused Mamba**
      ^^^
      Implement fused Mamba state space model kernels

Architecture Guides
-------------------

Neuron recommends new NKI developers start with :doc:`Trainium/Inferentia2 Architecture Guide </nki/guides/architecture/trainium_inferentia2_arch>` before exploring newer NeuronDevice architecture.

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: Trainium/Inferentia2 Architecture Guide
      :link: trainium_inferentia2_arch
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Foundational architecture guide for understanding NeuronDevice basics.

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: Trainium2 Architecture Guide
      :link: trainium2_arch
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Architecture enhancements and improvements in the Trainium2 generation.

   .. grid-item-card:: Trainium3 Architecture Guide
      :link: trainium3_arch
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Latest architecture features and capabilities in Trainium3 devices.

How-To Guides
-------------

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: How to use the NKI CPU Simulator
      :link: nki-simulator
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      Develop and debug NKI kernels on your CPU with no hardware required.

   .. grid-item-card:: How to Insert NKI Kernels into Models
      :link: nki_framework_custom_op
      :link-type: ref
      :class-body: sphinx-design-class-title-small

      How to insert a NKI kernel as a custom operator into a PyTorch or JAX model using simple code examples.

   .. grid-item-card:: How to Use the NKI Scheduling APIs
      :link: how-to-scheduling-apis
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Control instruction execution order using dependency edges and no-reorder blocks for kernel performance optimization.

   .. grid-item-card:: Profiling a NKI Kernel with Neuron Explorer
      :link: /nki/guides/use-neuron-profile
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Profile NKI kernels using Neuron Explorer to analyze hardware-level performance.

.. toctree::
   :maxdepth: 1
   :hidden:

   Tutorials </nki/guides/tutorials/index>
   Architecture </nki/guides/architecture/index>
   NKI CPU Simulator </nki/guides/nki_simulator>
   Insert NKI Kernels into Models </nki/guides/framework_custom_op>
   Use NKI Scheduling APIs </nki/guides/how-to-scheduling-apis>
   Profile a NKI Kernel </nki/guides/use-neuron-profile>

================================================
FILE: nki/guides/nki_simulator.rst
================================================
.. meta::
    :description: Documentation for the nki.simulate API in the Neuron SDK
    :keywords: nki, simulate, nki.simulate, test, kernels, aws neuron sdk
    :date-modified: 04/02/2026

.. _nki-simulator:

NKI CPU Simulator
=================

.. warning::

   This API is experimental and may change in future releases.

``nki.simulate`` runs NKI kernels on your CPU using Python (and NumPy), with no Trainium hardware required.
It executes kernel code as regular Python, making it ideal for fast development, debugging, and correctness testing.

.. contents:: On this page
   :local:
   :depth: 2

Overview
--------

``nki.simulate`` is a CPU-based functional simulator for NKI kernels. It executes every ``nki.isa``
and ``nki.language`` operation using Python and NumPy, producing results that approximate hardware behavior.
You write your kernel once and can run it on both the simulator and real Trainium devices. Some kernels
may require adjustments when moving to hardware — see :ref:`Simulation Limitations <simulation-limitations>` for details.

**Why use the simulator?**

- **No hardware required** — develop and test NKI kernels on any machine with Python.
- **Cost savings** — avoid the cost of developing on Trainium instances; iterate locally, then deploy to hardware when ready.
- **Same kernel code** — the same ``@nki.jit`` kernel can run on both hardware and the simulator. See :ref:`Simulation Limitations <simulation-limitations>` for cases where adjustments may be needed.
- **Full debugging support** — use ``breakpoint()``, PDB, or IDE debuggers to step through kernel execution and inspect tensor values.
- **Fast iteration** — test kernels instantly without compilation or deployment.
- **Hardware constraint validation** — catches invalid shapes, buffer misuse, dtype errors, and other constraint violations at runtime with clear error messages.
- **AI-assisted development** — ideal for GenAI coding agents authoring NKI kernels: instant local feedback, detailed error messages, and the ability to instrument every line of code with debug prints (including intermediate tensors) enable rapid autonomous iteration without hardware access.

Quick Start
-----------

.. nki_example:: /nki/examples/simulate/nki_simulate_example.py
   :language: python
   :marker: NKI_EXAMPLE_SIMULATE

.. nki_example:: /nki/examples/simulate/nki_simulate_example.py
   :language: python
   :marker: NKI_EXAMPLE_SIMULATE_RUN


Usage
-----

Running the Simulator
^^^^^^^^^^^^^^^^^^^^^

The simulator accepts **NumPy arrays** as inputs. If your script uses PyTorch or JAX tensors,
convert them to NumPy arrays before passing them to simulated kernels (for example, ``tensor.numpy()``).

**nki.simulate() API**

Use the explicit API to run a kernel on the simulator. This is also useful when you want
to run a kernel on *both* the simulator and hardware in the same script — for example,
to compare results:

.. code-block:: python

   # Run on simulator
   sim_result = nki.simulate(my_kernel)(a_np, b_np)

   # Run on hardware (requires Trainium and neuronx-cc)
   hw_result = my_kernel(a_torch, b_torch)

   # Compare
   np.testing.assert_allclose(sim_result, hw_result.numpy(), rtol=1e-2)


Target Platform
^^^^^^^^^^^^^^^

The simulator models different NeuronCore generations. Set the target using the
``NEURON_PLATFORM_TARGET_OVERRIDE`` environment variable:

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - Environment variable value
     - Hardware
   * - ``trn1`` or ``gen2``
     - Trn1 (NeuronCore-v2)
   * - ``trn2`` or ``gen3``
     - Trn2 (NeuronCore-v3)
   * - ``trn3`` or ``gen4``
     - Trn3 (NeuronCore-v4)
   * - *(unset)*
     - Auto-detect (uses the Neuron chip detected on the running machine, otherwise defaults to ``trn3``)

Precise Floating-Point Mode
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

By default, the simulator stores ``bfloat16``, ``float8_e4m3``, and ``float8_e5m2`` tensors as ``float32``
for faster simulation performance and to let you examine kernel correctness in high-precision floating-point.
To get numerical behavior similar to hardware, enable precise mode with ``NKI_PRECISE_FP=1``:

.. code-block:: bash

   NKI_PRECISE_FP=1 python my_script.py

When enabled, low-precision dtypes are stored using ``ml_dtypes`` (real ``bfloat16``, ``float8``, etc.)
instead of ``float32``. This is recommended for most use cases.

Debugging
^^^^^^^^^

Because the simulator runs kernels as regular Python, you have full access to Python's
debugging ecosystem.

**Using breakpoint():**

.. code-block:: python

   @nki.jit
   def my_kernel(a_ptr):
       tile = nl.load(a_ptr)
       breakpoint()  # Debugger stops here — inspect `tile`
       result = nl.add(tile, tile)
       return nl.store(result)

   nki.simulate(my_kernel)(data)

**Using device_print:**

``nl.device_print`` works in the simulator and prints tensor values to stdout:

.. code-block:: python

   @nki.jit
   def my_kernel(a_ptr):
       tile = nl.load(a_ptr)
       nl.device_print("my tile", tile)
       ...

**Using Python print:**

Since the simulator executes kernels as standard Python, you can use ``print()`` to inspect any
intermediate tensor or register value during execution. This is especially useful for both interactive
debugging and AI-assisted development workflows where agents iterate on kernels locally.

**IDE Debugging (VSCode / PyCharm):**

Set breakpoints in your kernel code and run your script normally. The simulator executes
kernel code in-process, so IDE debuggers work without any special configuration.


How It Works
------------

Execution
^^^^^^^^^

When you call ``nki.simulate(kernel)(a, b)``:

1. Each NumPy array argument is wrapped into an ``NkiTensor`` with ``buffer=nl.hbm``
   (or ``shared_hbm`` for LNC2). Non-array arguments pass through unchanged.
2. The simulator backend is activated, routing all ``nki.isa`` and ``nki.language``
   operations to NumPy-based implementations.
3. The kernel function runs as regular Python — each NKI API call executes eagerly
   and sequentially. There is no instruction scheduling or engine parallelism.
4. On return, ``NkiTensor`` results are converted back to NumPy arrays. Input arrays are
   updated in-place if the kernel modified the corresponding HBM tensors.

For **LNC2 kernels** (``kernel[2]``), the simulator spawns two Python threads that execute the
kernel concurrently, each with its own ``program_id``. Input arrays use ``shared_hbm`` buffers,
so both threads can access shared memory. ``nki.isa.sendrecv`` and ``nki.isa.core_barrier``
use thread-safe synchronization primitives.

Uninitialized Memory Detection
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The simulator automatically fills all newly allocated tensors with **sentinel values** — ``NaN`` for
floating-point types and ``4`` for integer types. This makes it easy to detect bugs where a kernel
reads from memory that was never written to.

Because ``NaN`` propagates through arithmetic (any operation involving ``NaN`` produces ``NaN``), if
your kernel accidentally computes on uninitialized memory, the resulting output will contain ``NaN``
values. You can check for this in your test:

.. code-block:: python

   result = nki.simulate(my_kernel)(inputs)
   assert not np.any(np.isnan(result)), "Kernel computed on uninitialized memory!"

**Why this matters:**

On real hardware, uninitialized memory contains arbitrary leftover values from previous operations.
A kernel that reads uninitialized data may appear to produce correct results on hardware by coincidence —
making these bugs extremely difficult to track down. The simulator's sentinel values turn these silent
correctness hazards into immediately visible ``NaN`` values in the output.

.. tip::

   If you see unexpected ``NaN`` values in your simulation output, check that all tensors are properly
   initialized before use. Common causes include:

   - Allocating a tensor with ``nl.ndarray`` but not writing to all elements before reading
   - Off-by-one errors in tile loop bounds that leave some elements unwritten
   - Conditional writes that skip certain partitions or indices


Hardware Constraint Validation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Each ``nki.isa`` operation validates hardware constraints at runtime — shape limits, dtype
compatibility, buffer types, engine restrictions, and architecture version requirements.
Invalid operations raise clear Python exceptions with descriptive error messages.

.. warning::

   Hardware constraint validation is actively being developed. Some constraints may not yet
   be checked by the simulator. If your kernel passes simulation but fails on hardware,
   report it to the Neuron team as an issue.


**Example:**

.. code-block:: python

   @nki.jit
   def bad_kernel(a_ptr):
       tile = nl.ndarray((256, 512), dtype=nl.float32, buffer=nl.sbuf)  # exceeds 128
       ...

   nki.simulate(bad_kernel)(data)
   # AssertionError: tensor_tensor data1 partition dimension 256 exceeds maximum 128


.. _simulation-limitations:

Simulation Limitations
----------------------

The simulator approximates hardware behavior but is not identical. Understanding these
limitations helps you write kernels that work on both the simulator and real Trainium hardware.

NKI Meta-Programming Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The simulator executes kernel code directly as Python — there is no compilation step. As a result,
the simulator accepts any valid Python in the kernel body, including arbitrary classes, closures,
and dynamic control flow. The NKI compiler, however, only supports a restricted subset of Python
for meta-programming, see :ref:`NKI Language Guide<nki-language-guide>`. Kernels that use
unsupported Python constructs will execute successfully on the simulator but fail to compile for hardware.

Numerical Precision
^^^^^^^^^^^^^^^^^^^

By default, the simulator stores low-precision types (``bfloat16``, ``float8_e4m3``, ``float8_e5m2``)
as ``float32``, which can mask rounding and precision issues that appear on hardware. Enable
``NKI_PRECISE_FP=1`` (recommended) to use real low-precision storage via ``ml_dtypes`` for
numerical behavior similar to hardware. See `Precise Floating-Point Mode`_ for details.

Performance
^^^^^^^^^^^

The simulator runs on the CPU using Python and NumPy. It does not model instruction latency,
engine parallelism, or hardware scheduling. Since kernels are interpreted rather than compiled
and optimized for Trainium NeuronCores, the simulator is significantly slower than hardware
execution and is not suitable for performance benchmarking.

Memory Model
^^^^^^^^^^^^

The simulator allocates each tensor independently without simulating overlapping memory regions
or validating against SBUF/PSUM capacity limits. Kernels with memory conflicts may run
successfully on the simulator but fail or produce incorrect results on real hardware, where
SBUF and PSUM are shared physical memory with capacity constraints.

Known Gaps
^^^^^^^^^^

- ``nki.collectives`` APIs are not implemented in the simulator.
- Some ``nki.isa`` instructions produce incorrect results: ``local_gather``,
  ``nc_stream_shuffle`` with ``mask=255``, ``nc_matmul_mx``, and ``quantize_mx``.
- Some hardware constraint checks are missing — see `Hardware Constraint Validation`_ for details.


================================================
FILE: nki/guides/tutorials/average_pool2d.rst
================================================
.. _nki-averagepool2d:

AveragePool2D
=============

In this tutorial, we examine a case of
dimensionality reduction. We implement a 2D AveragePool operation, which
is used in many vision neural networks.
In doing so, we learn about:

-  NKI syntax and programming model.
-  multi-dimensional memory access patterns in NKI.

The 2D AveragePool operation takes
``C x [H,W]`` matrices and reduces each matrix along the ``H`` and ``W``
axes. To leverage free-dimension flexible indexing, we can map the ``C``
(parallel) axis to the ``P`` dimension and ``H/W`` (contraction)
axes to the ``F`` dimension.
Performing such a 2D pooling operation requires a 4D memory access
pattern in the ``F`` dimension, with reduction along two axes.
:ref:`Figure <nki-fig-avgpool>`
below illustrates the input and output tensor layouts.

.. :

.. figure:: ../../img/pm-index-3.png
   :name: nki-fig-avgpool
   :align: center
   :width: 60%

   2D-Pooling Operation (reducing on axes F2 and F4)

PyTorch
-------

Compute kernel
^^^^^^^^^^^^^^

.. nki_example:: ../../examples/average_pool2d/average_pool2d_nki_kernels.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_37


Launching kernel and testing correctness
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To execute the kernel, we prepare tensors ``in_tensor`` and call ``tensor_avgpool_kernel``:

.. nki_example:: ../../examples/average_pool2d/average_pool2d_torch.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_38

JAX
-------

Compute kernel
^^^^^^^^^^^^^^

Let's reuse the same NKI kernel implementation defined for PyTorch above:

.. nki_example:: ../../examples/average_pool2d/average_pool2d_nki_kernels.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_37

In order to pass ``pool_size`` as a compile time constant, we pass ``pool_size`` as kwargs.

.. nki_example:: ../../examples/average_pool2d/average_pool2d_jax.py
   :language: python
   :marker: NKI_EXAMPLE_39

We write a reference JAX implementation of ``AveragePool2D`` as JAX does
not have a primitive for it.

.. nki_example:: ../../examples/average_pool2d/average_pool2d_jax.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_40


Launching kernel and testing correctness
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To execute the kernel, we prepare array ``in_array`` and invoke the kernel caller function ``tensor_avgpool_kernel``:

.. nki_example:: ../../examples/average_pool2d/average_pool2d_jax.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_41


Download All Source Code
--------------------------

Click the links to download source code of the kernels and the testing code
discussed in this tutorial.

* NKI baremetal implementation: :download:`average_pool2d_nki_kernels.py <../../examples/average_pool2d/average_pool2d_nki_kernels.py>`
* PyTorch implementation: :download:`average_pool2d_torch.py <../../examples/average_pool2d/average_pool2d_torch.py>`
    * You must also download :download:`average_pool2d_nki_kernels.py <../../examples/average_pool2d/average_pool2d_nki_kernels.py>`
      into the same folder to run this PyTorch script.
* JAX implementation: :download:`average_pool2d_jax.py <../../examples/average_pool2d/average_pool2d_jax.py>`
    * You must also download :download:`average_pool2d_nki_kernels.py <../../examples/average_pool2d/average_pool2d_nki_kernels.py>`
      into the same folder to run this JAX script.

You can also view the source code in the GitHub repository `nki_samples <https://github.com/aws-neuron/nki-samples/tree/main/src/nki_samples/tutorials/average_pool2d/>`_

Example usage of the scripts:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Run NKI baremetal implementation:

.. code-block::

   python3 average_pool2d_nki_kernels.py

Run PyTorch implementation:

.. code-block::

   python3 average_pool2d_torch.py

Run JAX implementation:

.. code-block::

   python3 average_pool2d_jax.py


================================================
FILE: nki/guides/tutorials/fused_mamba.rst
================================================
.. _nki-fused-mamba:

Fused Mamba
==============

In this tutorial, we implement a NKI kernel for the `Mamba Large Language Model <https://arxiv.org/abs/2312.00752>`_,
a State Space Model (SSM) which replaces
the attention of a regular Transformer model with a custom layer inspired by Recurrent Neural Networks. We will walk through
the core computation step-by-step and map it to NKI APIs to form a functional kernel. Next, by scaling the input shapes
of the kernel (both channel size and sequence length), we will iterate on a more hardware-efficient kernel implementation
to improve the scaling efficiency.

In this tutorial, we learn about:

* Mapping different vector operations efficiently to NeuronCore compute engines, such as associative scan and element-wise
  operations between tensors
* Leveraging data reuse and tiling to reduce excessive data movement and keep compute engines busy
* Using :doc:`neuron-profile </nki/guides/use-neuron-profile>` to identify performance bottlenecks and opportunities

PyTorch Reference Implementation
--------------------------------

Before jumping to NKI, let's examine the compute definition of a Mamba-v1 layer using the below PyTorch script
(``mamba_torch.py``):

.. nki_example:: ../../examples/fused_mamba/mamba_torch.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_24

The input tensor shapes are as follows:

* ``delta: [batch, channels, seq_len]``
* ``u: [batch, channels, seq_len]``
* ``A: [channels, state_size]``
* ``B: [batch, state_size, seq_len]``
* ``C: [batch, state_size, seq_len]``

The key model parameters are:


* ``batch``\ : batch size of the model.
* ``seq_len``\ : sequence length of the model.
* ``channels``\ : hidden size of a token.
* ``state_size``\ : number of model states.

We use ``[batch=1, seq_len=512, channels = 256, state_size = 16]`` as a simple test case for initial performance evaluation.

Running the above Python script will compile the ``PyTorch`` compute graph using Neuron Compiler and generate a Neuron executable
file (NEFF) in the same directory. We can then profile the NEFF on a single NeuronCore using :doc:`neuron-profiler </nki/guides/use-neuron-profile>`.
Figure below is a screenshot of the profile. We see this initial PyTorch implementation takes **151.83 ms** to execute *on
device*.

.. _fig_mamba_torch_ref:

.. figure:: ../../img/mamba_torch_ref.png
   :align: center
   :width: 100%

   Profile of Mamba PyTorch Implementation

Zooming into a portion of the profile, we notice the compute activities on different engines (TensorE/VectorE/ScalarE/GpSimdE)
are quite sparse compared to data movement activities (the qSyncIO0 and qVectorSpillReload rows):

.. _fig_mamba_torch_ref_zoomed:

.. figure:: ../../img/mamba_torch_ref_zoomed.png
   :align: center
   :width: 100%

   Profile of Mamba PyTorch Implementation (Zoomed-in)

In this seemingly “memory-bound” execution trace, the achieved DMA throughput is also extremely low, hovering around
0.33% utilization throughout execution. Therefore, we are stressing neither the compute nor the memory subsystem, hinting
the workload is running at low efficiency on the NeuronCore. In the rest of this tutorial, we will showcase how to re-write
the above computation using NKI to achieve a device execution latency of **172.93 usec** , which is a **878x speedup**
compared to the PyTorch reference implementation.

Mapping Mamba Layer to NeuronCore
---------------------------------

In this section, we will discuss how the computation can be mapped onto the NeuronCore architecture. We will also highlight
the importance of choosing appropriate data layouts to achieve good compute efficiency.

Recall we have the following input tensor shapes in device memory:


* ``delta: [batch_size, channels, seq_len]``
* ``u: [batch_size, channels, seq_len]``
* ``A: [channels, state_size]``
* ``B: [batch_size, state_size, seq_len]``
* ``C: [batch_size, state_size, seq_len]``

In fact, the above tensor layout has been chosen carefully based on the computation done in NeuronCore, which we will discuss
in more detail below.

In Mamba models, both ``seq_len`` and ``channels`` are typically in the thousands (such as ``seq_len=16K, channels=4K``),
while ``batch_size`` and ``state_size`` are much smaller by 2-3 order of magnitudes (such as ``batch_size=4, state_size=16``).
To simplify visualization of computation
on multi-dimensional tensors, let's hold ``batch`` and ``state_size`` dimension constant and focus on computation per batch
per state. Note, the ``batch_size`` dimension is considered a fully parallel axis in a Mamba layer, while ``state_size``
is only a partial parallel axis where results from different states will be accumulated together.

By extracting ``batch`` and ``state_size`` dimensions, we get the following input tensor shapes in device memory:


* ``delta_i: [channels, seq_len]``
* ``u_i:     [channels, seq_len]``
* ``A_i:     [channels]``
* ``B_i:     [seq_len]``
* ``C_i:     [seq_len]``

Next, let's visualize the data flow and computation using 2D matrices or vectors step-by-step.

Step 1: Element-wise multiplication of ``delta_i`` and ``A_i``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We have the following PyTorch reference code for Step 1:

.. code-block::

   # delta[batch, channels, seq_len]
   # A    [channels, state_size]
   delta[:, :, None, :] * A[None, :, :, None]

   # Holding batch and state_size constant
   # delta_i: [channels, seq_len]
   # A_i:     [channels]
   delta_i[:, :] * A_i[:]

After the above transformation, the multiplication between ``delta_i`` and ``A_i`` involves a **broadcasting** across the
``seq_len`` dimension of ``delta_i``. In NKI, free-dimension broadcast can often be folded into the actual computation instruction
at no additional performance cost, while partition-dim broadcast often requires a separate instruction on TensorE (see TensorE
alternative use case in :ref:`Trainium/Inferentia2 Architecture Guide <arch_sec_tensor_engine_alternative_use>`).
As a result, we have two options for executing Step 1.

**Option 1: Map ``seq_len`` to free dimension.** Element-wise multiplication of ``delta_i`` and ``A_i`` on NeuronCore can
be done through :doc:`nisa.tensor_scalar </nki/api/generated/nki.isa.tensor_scalar>`
on either VectorE or ScalarE, which automatically broadcast ``A_i`` along the free dimension to match the ``seq_len`` dimension
in ``A_i``.

Note, the ``channels`` dimension is mapped to SBUF partition dimension. Since the input ``channels`` dimension has a size
of 256 in our initial setup, which exceeds the architectural limitation of ``nl.tile_size.pmax=128`` , we must **tile**
``delta_i`` in the ``channels`` dimension (tiled dimension denoted as ``channels_tiled``\ ) and feed one tile into ``nisa.tensor_scalar``
at a time. Figure below illustrates the computation done for Option 1.

.. _fig_mamba_step1_opt1:

.. figure:: ../../img/mamba_step1_opt1.png
   :align: center
   :width: 80%

   Step 1, Option 1: `nisa.tensor_scalar`

As an example, the associated NKI code for batch ``i_batch``\ , state ``i_state`` and tile ``i_tile_channels`` in ``channels``
is:

.. code-block::

   # Input shape in device memory matches the computation layout
   # Device memory layout:
   # delta_i: [channels, seq_len]
   # A_i:     [channels]

   # Computation layout in SBUF:
   # delta_i: [par_dim(channels), seq_len]
   # A_i:     [par_dim(channels)]

   deltaA_i = nisa.tensor_scalar(delta_i, op0=nl.multiply, operand0=A_i)

Note, with this compute layout option, the ``delta_i`` tensor shape ``[channels, seq_len]`` in device memory can be loaded
into SBUF efficiently with ``seq_len`` as the free dimension and fed into VectorE/ScalarE for computation. No extra transposes
are needed.

**Option 2: Map ``seq_len`` to partition dimension.** Alternatively, if we choose a transposed layout for ``delta_i`` in
SBUF for computation, we will need a partition-dimension broadcast of ``A_i`` using a separate instruction on TensorE
(``A_i.broadcast_to(...)``) and then a :doc:`nisa.tensor_tensor </nki/api/generated/nki.isa.tensor_tensor>`
operation between ``delta_i`` and the broadcast ``A_i`` on VectorE. As a reminder, we need to tile the ``seq_len`` dimension
to meet the tile size constraint ``nl.tile_size.pmax=128``. Figure below illustrates the computation done for Option 2.


.. _fig_mamba_step1_opt2:

.. figure:: ../../img/mamba_step1_opt2.png
   :align: center
   :width: 80%

   Step 1, Option 2: p-dim broadcast + `nisa.tensor_tensor`

The associated NKI code is as follows:

.. code-block::

   # Input shape in device memory does NOT match the computation layout
   # Device memory layout:
   # delta_i: [channels, seq_len]
   # A_i:     [channels]

   # Computation layout in SBUF:
   # delta_i: [par_dim(seq_len_tiled), channels]
   # A_i:     [par_dim(1), channels]

   A_i_bcast = A_i.broadcast_to((nl.tile_size.pmax, channels))
   deltaA_i = nisa.tensor_tensor(delta_i, A_i_bcast, op=ml.multiply)

Assuming the same ``delta_i`` device memory layout ``[channels, seq_len]``\ , before performing the ``nisa.tensor_tensor``
instruction, we will need to either:


* Do a regular load of ``delta_i`` into SBUF using :doc:`nl.load <../../api/generated/nki.language.load>` and an explicit transpose on the loaded ``delta_i`` using
  ``nl.transpose`` to make ``seq_len`` lie in the free dimension, or
* Do a transposed load of ``delta_i`` using :doc:`nl.load_transpose2d <../../api/generated/nki.language.load_transpose2d>`,
  which is significantly less efficient in memory bandwidth usage compared to ``nl.load``

If Option2 was chosen as the compute layout, we would have incentives to define the ``delta`` input tensor shape as ``[seq_len,
channels]`` in device memory instead.

From computation perspectives, Option 2 is less efficient than Option 1 because:

#. Option 2 needs an extra TensorE instruction performing partition dimension broadcast.
#. ``nisa.tensor_tensor`` is 2x slower than ``nisa.tensor_scalar`` for our input data type FP32 (see API doc for instruction
   cost estimates).

Therefore, for Step 1 only, Option 1 is the winner compared to Option 2. Let's continue with the rest of the steps to see
if we need to revise this selection due to surrounding operator layout preferences.

Step 2: Exponential of deltaA_i.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Step 2 is evaluating exponential on ``deltaA_i`` from the previous step:

.. code-block::

   torch.exp(...)

In NeuronCore, evaluating an exponential function on a tensor is considered a scalar operation, which runs on ScalarE. This
operation can be invoked through :doc:`nl.exp <../../api/generated/nki.language.exp>`
or :doc:`nisa.activation </nki/api/generated/nki.isa.activation>`.
However, ScalarE is able to perform a “pipelined multiply-add” on the input before evaluating a non-linear function (detail
see :ref:`Trainium/Inferentia2 Architecture Guide <arch_sec_scalar_pipelined_fma>`).
In other words, we can fold Step 1 (Option 1) ``nisa.tensor_scalar`` and Step 2 into a single ScalarE instruction at
no additional cost. This functionality is only exposed in the ``nisa.activation`` API. This folding is not feasible if we
chose Option 2 ``nisa.tensor_tensor`` in Step 1. Figure below illustrates our new execution plan to combine Step 1 and 2
into ``nisa.activation`` :

.. _fig_mamba_step2:

.. figure:: ../../img/mamba_step2.png
   :align: center
   :width: 80%

   Step 1&2: ``nisa.activation``

The associated NKI code is as follows:

.. code-block::

   # Input shape in device memory matches the computation layout
   deltaA_i = nisa.activation(op=nl.exp, data=delta_i, scale=A_i)

Step 3: Element-wise multiplication of delta_i, B_i and u_i.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

PyTorch reference code for Step 3 is:

.. code-block::

   # delta[batch, channels, seq_len]
   # B:   [batch, state_size, seq_len]
   # u:   [batch, channels, seq_len]
   delta[:, :, None, :] * B[:, None, :, :] * u[:, :, None, :]

   # Holding batch and state_size constant
   # delta_i: [channels, seq_len]
   # B_i:     [seq_len]
   # u_i:     [channels, seq_len]
   delta_i[:, :] * B_i[None, :] * u_i[:, :]

This step involves similar compute layout and instruction choices as Step 1:


* ``channels`` is either partition or free dimension for both ``delta_i`` and ``u_i``
* multiplication with ``B_i`` is either through ``nisa.tensor_tensor`` or ``nisa.tensor_scalar``

Since we preferred Step 1 to consume ``delta_i`` using ``channels`` as the partition dimension in previous steps, it is
wise to follow the same layout choice here for ``delta_i`` to avoid any transposes. Given this layout choice, the multiplication
with ``B_i`` will have to be a ``nisa.tensor_tensor``. Figure below visualizes the computation in Step 3:


.. _fig_mamba_step3:

.. figure:: ../../img/mamba_step3.png
   :align: center
   :width: 80%

   Step 3: p-dim broadcast + 2x ``nisa.tensor_tensor``

The associated NKI code is as follows:

.. code-block::

   # Input shape in device memory does NOT match the computation layout
   # Device memory layout:
   # delta_i: [channels, seq_len]
   # u_i:     [channels, seq_len]
   # B_i:     [seq_len]

   # Computation layout in SBUF:
   # delta_i: [par_dim(channels_tiled), seq_len]
   # u_i:     [par_dim(channels_tiled), seq_len]
   # B_i:     [par_dim(1), seq_len]

   deltaU_i = nisa.tensor_tensor(delta_i, u_i, op=ml.multiply)
   B_i_bcast = B_i.broadcast_to((nl.tile_size.pmax, seq_len))
   deltaBu_i = nisa.tensor_tensor(deltaU_i, B_i_bcast, op=ml.multiply)

Step 4: Associative scan between deltaA_i and deltaBu_i
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In this step, we use an associative scan operator between ``deltaA`` and ``deltaBu`` to aggregate information across time
sequentially (sequence length, e.g. sequence of tokens), from the past to the present. Here is a PyTorch reference implementation:

.. code-block::

   # deltaA:   [batch_size, channels, state_size, seq_len]
   # deltaB_u: [batch_size, channels, state_size, seq_len]
   out = torch.empty(batch_size, channels, state_size, seq_len,
                     device=deltaA.device, dtype=deltaA.dtype)

   for i in range(seq_len):
       # starting state is 0
       prev_state = out[..., i - 1] if i > 0 else 0
       # multiply deltaA by the previous time step state and then add deltaB_u
       out[..., i] = deltaA[..., i] * prev_state + deltaB_u[..., i]

By holding batch and state_size dimensions constant, we get ``deltaA_i`` and ``deltaBu_i`` both with
``[channels_tiled, seq_len]``, where ``channels_tiled`` is the partition dimension.
The associative scan between these two tile shapes can
be implemented in NKI naively through the following loop:

.. code-block::

   scan_i = nl.ndarray((channels_tiled, seq_len), ...)

   # Peeling the first iteration out, which is
   # equivalent to loop iterator dependent control flow within the loop
   scan_i[0:channels_tiled, 0] = deltaBu[0:channels_tiled, 0]

   for i in nl.sequential_range(seq_len - 1):
      scan_i[0:channels_tiled, i+1] =    deltaA_i[0:channels_tiled, i+1] * scan_i[0:channels_tiled, i]
                                       + deltaBu_i[0:channels_tiled, i+1]

Within the loop, the current implementation invokes one instruction for multiplication and another for addition. Since both
instructions are performed among tiles of shape ``[channels_tiled, 1]``, we can combine
these two instructions using :doc:`nisa.tensor_scalar <../../api/generated/nki.isa.tensor_scalar>`
which supports two operators in a pipelined fashion within an instruction at the same cost as a single operator. Below is
a new implementation that could provide 2x speedup compared to the above:

.. code-block::

   scan_i = nl.ndarray((channels_tiled, seq_len), dtype=deltaA.dtype, buffer=nl.sbuf)
   scan_i[0:channels_tiled, 0] = deltaBu[i_p, 0]

   for i in nl.sequential_range(seq_len - 1):
      scan_i[0:channels_tiled, i+1] = nisa.tensor_scalar(
           deltaA[0:channels_tiled, i+1],
           op0=nl.multiply,
           operand0=scan_i[0:channels_tiled, i],
           op1=nl.add,
           operand1=deltaBu[0:channels_tiled, i+1])

However, the above loop nest will turn into ``seq_len`` many instructions with input tiles that have a single element per
partition in SBUF. In addition, every ``nisa.tensor_scalar`` instruction has a data dependency on the output of the previous
instruction. As discussed in the :ref:`Trainium/Inferentia2 Architecture Guide <arch_sec_vector_engine_perf>`,
these two traits combined in the instruction sequence is considered extremely *inefficient* on ScalarE/VectorE, where
the static instruction overhead instead of the useful execution time would be dominating the engine timeline.

Conveniently, NKI exposes another instruction :doc:`nisa.tensor_tensor_scan <../../api/generated/nki.isa.tensor_tensor_scan>`
on VectorE, which can perform the above loop nest in a *single* instruction by caching the intermediate scan result from
the previous time step internally in VectorE without going through SBUF.

.. code-block::

   scan_i = nisa.tensor_tensor_scan(deltaA_i, deltaBu_i, initial=0,
                                    op0=np.multiply, op1=np.add)

Note, the shape of ``scan_i`` is exactly the same as the input ``deltaA_i/deltaBu_i``\ : ``[channels_tiled, seq_len]``.

Step 5: Element-wise multiplication of C_i and scan_i
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The PyTorch reference implementation is:

.. code-block::

   # scan_res: [batch_size, channels, state_size, seq_len]
   # C:        [batch_size, state_size, seq_len]
   scanC = C[:, None, :, :] * scan_res

   # Holding batch and state constant
   # scan_i: [channels_tiled, seq_len]
   # C_i:    [seq_len]
   scanC_i = C_i[None, :] * scan_i[:, :]

You know the drill - Since ``channels_tiled`` is the partition dimension in ``scan_i`` from the previous step, we need to
perform a partition-dimension broadcast on ``C_i`` before invoking ``nisa.tensor_tensor``\ :


.. _fig_mamba_step5:

.. figure:: ../../img/mamba_step5.png
   :align: center
   :width: 80%

   Step 5: p-dim broadcast + ``nisa.tensor_tensor``

The corresponding NKI code is:

.. code-block::

   C_i_bcast = C_i.broadcast((nl.tile_size.pmax, seq_len))
   scanC_i = nisa.tensor_tensor(scan_i, C_i_bcast, op=ml.multiply)

Step 6: Accumulation of scanC_i along ``state_size`` dimension
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

So far in Step 1-5, all the computation is logically parallel across the ``state_size`` dimension in a Mamba layer. The
next step of computation introduces data dependency along the ``state_size`` dimension for the first time. The PyTorch reference
implementation is:

.. code-block::

   # scan_res: [batch_size, channels, state_size, seq_len]
   # C:        [batch_size, state_size, seq_len]
   # -2 dim is state_size
   scanC.sum(dim=-2)

   # Holding batch constant only.
   # scan_i_states: [channels_tiled, state_size, seq_len]
   (scanC_i).sum(dim=-2)

In NKI, we can accumulate the ``scanC_i`` results across states element-wise using ``state_size-1`` number of ``nisa.tensor_tensor``
instructions:

.. _fig_mamba_step6:

.. figure:: ../../img/mamba_step6.png
   :align: center
   :width: 80%

   Step 6: ``state_size-1`` number of ``nisa.tensor_tensor``

Since we will be looping over different states, we can also declare an empty accumulation buffer ``scanC_accum`` of shape
``[channels_tiled, seq_len]`` outside of the loop structure and accumulate into this buffer at the end of the every loop
iteration using ``+=`` operator. The use of a single accumulation buffer avoids allocating memory for ``scanC_i`` across
all states in SBUF. The corresponding NKI code is:

.. code-block::

   scanC_accum = nl.zeros(...)

   for i_state in nl.affine_range(state_size):
       scanC_i = ...
       scanC_accum += scanC_i

Initial NKI Kernel
------------------

Putting all the pieces together from the previous section, we can arrive at the below kernel implementation ``mamba_v1``:

.. nki_example:: ../../examples/fused_mamba/mamba_nki_kernels.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_25

In the above code example,

* We have three levels of loop nests. From the outer-most to inner-most:
    * Iterating over ``batch``: Different batch samples perform completely different computation. ``A`` tensor is the only
      input parameter that is shared among batch samples.
    * Iterating over ``state_size``: Different states perform parallel computation until Step 6 as discussed in the previous
      section. Both ``delta`` and ``u`` tensors are shared across different states.
    * Iterating over ``channels``: This is the most-inner dimension where we tile the input channels dimension into ``nl.tile_size.pmax=128``
      chunks. Both ``B`` and ``C`` tensors are shared across different ``channels``.
* The kernel above assumes channels is a multiple of ``nl.tile_size.pmax=128`` . We can relax this by adding a ``mask``
  parameter in all the NKI API call in the kernel. To simplify the code example, we omit this change.
  See :ref:`NKI API Masking <nki-mask>` for more information.
* We declare an empty intermediate tensor ``scanC_accum`` to hold partial summation from every state.
* Within the inner loop, we process data for ``nl.tile_size.pmax=128`` channels for one batch sample in one state.
    * We use the :ref:`slicing syntax <nki-basic-tensor-indexing>`
      to index a tensor. For example, ``delta[i_batch, channel_start:channel_start+channel_psize, 0:seq_len]`` grabs data from
      the input ``delta`` tensor for the current range of channels at the current batch sample.
    * Note, in tensor slicing, the first index dimension from the left with a slicing range will be chosen as the partition
      dimension. When loading ``B``, since we intend to load only one state's worth of data into one partition of SBUF (discussed
      in Step 3), we need to explicitly slice the state using: ``nl.load(B[i_batch, **i_state:i_state+1**, 0:seq_len])``. Otherwise,
      ``nl.load(B[i_batch, **i_state**, 0:seq_len])`` will treat ``seq_len`` as the partition dimension, which is not what we
      planned for in Step 3 and would also trigger a NKI compilation error since ``seq_len`` exceeds ``nl.tile_size.pmax``.
    * We accumulate partial ``scanC_i`` results into the accumulation buffer using the ``+=`` operator. This creates a loop-carried
      dependency for ``scanC_accum`` on the ``i_state`` loop.

Performance Check
^^^^^^^^^^^^^^^^^

Let's re-run neuron-profile on the above NKI kernel:

.. _fig_mamba_v1_profile:

.. figure:: ../../img/mamba_v1_profile.png
   :align: center
   :width: 100%

   Profile of initial Mamba kernel implementation ``mamba_v1``

Hooray! This NKI kernel implementation now takes ``172.93`` usec, which is **878x** speedup compared to the reference PyTorch
implementation. Based on the profile, VectorE is the busiest compute engine in the Mamba layer. This makes sense because
the bulk of computation in the kernel is in ``nisa.tensor_tensor``\ , which can only run on VectorE.

Therefore, our goal is to keep VectorE as busy as possible throughout execution. Note, every NEFF execution involves certain
start-up and tear-down overhead. We can use the ``Selection Summary`` feature in ``neuron-profile`` to find out the percentage
of time VectorE is busy during the actual execution period:


.. _fig_mamba_v1_profile_zoomed:

.. figure:: ../../img/mamba_v1_profile_zoomed.png
   :align: center
   :width: 100%

   Profile of initial Mamba kernel implementation ``mamba_v1`` (zoomed in)

As indicated by the above profile, VectorE is active over **98.71%** of the time, which is rather impressive. However,
remember we used small input shapes as a toy example to get started: ``[batch=1, seq_len=512, channels = 256, n = 16]``.
Next, let's increase the ``channels`` and ``seq_len`` dimensions one by one and observe how VectorE efficiency changes.

Increasing input ``channels`` size
--------------------------------------

Let's increase the size of ``channels`` by 16x, from 256 to a more realistic value 4096. We obtain the following profile:

.. _fig_mamba_v1_profile_4k_chan:

.. figure:: ../../img/mamba_v1_profile_4k_chan.png
   :align: center
   :width: 100%

   Profile of ``mamba_v1`` kernel with 4K channels

The new device execution time with increased channels is now **2.34 ms**. We can see that VectorE active duration has
dropped to **92.16%** during the core execution period, compared to **98.71%** previously with the toy example. Let’s zoom
into an arbitrary region of the profile to see what could be causing VectorE to go idle:

.. _fig_mamba_v1_profile_4k_chan_sem:

.. figure:: ../../img/mamba_v1_profile_4k_chan_sem.png
   :align: center
   :width: 100%

   ``mamba_v1`` kernel blocking on input tensor loading

By identifying a gap where VectorE is completely idle, we can hover over the first executed instruction after the gap
to find out what's the reason for idleness in the instruction semaphore wait condition. In the above screenshot, the instruction
is pending on ``S[22]`` to reach a value of 240, which is set by ``qSyncIO0`` activities. This means VectorE has been waiting
for input tensors to be loaded before performing more computation. If you hover over ``qSyncIO0`` activities during the
VectorE idle period, you can also see the exact input tensor name defined in NKI being loaded in the DMA:

.. _fig_mamba_v1_profile_4k_chan_load_var:

.. figure:: ../../img/mamba_v1_profile_4k_chan_load_var.png
   :align: center
   :width: 100%

   DMA loading tensor u in ``mamba_v1`` profile

We can find similar VectorE gaps through the execution trace. At this point, we can conclude one of the reasons why we have
a lower VectorE active time percentage is due to *blocking* input tensor loading (``nl.load``) activities in the DMA.
Next, let's spend some time analyzing DMA efficiency.

Zooming out, we can make several observations. First, we see two orange boxes around the ``qSyncIO0`` row. Hovering over
the top left corners of the boxes shows two similar performance warnings for loading IO tensors:


.. _fig_mamba_v1_profile_4k_chan_reload:

.. figure:: ../../img/mamba_v1_profile_4k_chan_reload.png
   :align: center
   :width: 100%

   Performance warnings for reloading ``u`` and ``delta`` tensors

This indicates we reload both the input ``u`` and ``delta`` tensors around 7 times. This could be inevitable
when we don't have sufficient on-chip memory (SBUF) to allow full reuse of the input data tensors. However, the profiler
shows we are only hitting around 50% capacity usage throughout execution:

.. _fig_mamba_v1_profile_4k_chan_sb:

.. figure:: ../../img/mamba_v1_profile_4k_chan_sb.png
   :align: center
   :width: 100%

   Low SBUF usage

Therefore, the input tensor reloading is likely not justified, and we should investigate whether we can optimize the
NKI kernel to avoid it.

.. _tut_mamba_loop_reordering:

Minimizing data reloading by loop reordering
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To understand why delta and u are being reloaded, let's revisit our input tensor shapes:


* ``delta: [batch_size, channels, seq_len]``
* ``u:     [batch_size, channels, seq_len]``
* ``A:     [channels, state_size]``
* ``B:     [batch_size, state_size, seq_len]``
* ``C:     [batch_size, state_size, seq_len]``

Let's hold ``batch_size`` constant since the majority of input tensors have completely different slices for different batch
samples:


* ``delta: [channels, seq_len]``
* ``u:     [channels, seq_len]``
* ``A:     [channels, state_size]``
* ``B:     [state_size, seq_len]``
* ``C:     [state_size, seq_len]``

``delta`` and ``u`` tensors have the same shape with ``channels`` as the outer dimensions, while ``B`` and ``C`` have the
same shape with ``state_size`` as the outer dimension. All four of these input tensors have ``seq_len`` as the inner dimension.
Therefore, we say ``delta/u`` is reused across different states, while ``B/C`` are reused across different channels. Given
this conflicting reuse dimensions, we further say it is more important to **prioritize reuse of ``delta/u``** because
the expected size of ``channels`` is much higher than ``state_size``:


* ``state_size`` is now 16 and typically stay small
* ``channels`` is now 4096 and typically in the thousands

In NKI, we can prioritize ``delta/u`` reuse through loop ordering. Recall in the initial NKI kernel implementation, we have
the following inner loops:

.. code-block::

   ...
   for i_state in nl.affine_range(state_size):
       for i_channel_tile in nl.affine_range(n_channel_tile):
           # step 1-6
   ...

Since these two loops are executed serially within a single NeuronCore, the loop instances will be unrolled by Neuron Compiler.
With the channel dimension in the fastest dimension, we will need to load ``delta/u`` across all channels in the first state,
and then likely reload them again in the later states due to a large total memory size in ``delta`` and ``u`` (16MB in this
case).

To prioritize reuse of ``delta/u``\ , we should reorder the above loop nests. To further enforce the reuse, we can hoist
the ``nl.load`` calls for ``delta/u`` outside of the ``i_state`` inner loop:

.. code-block::

   ...
   for i_channel_tile in nl.affine_range(n_channel_tile):
       delta_i = nl.load(...)
       u_i = nl.load(...)

       for i_state in nl.affine_range(state_size):
           # step 1-6
   ...

As a side effect of this loop re-ordering, we can also spot a loop fusion opportunity since we have two ``i_channel_tile``
loop nests at the same level now:

.. code-block::

   scanC_accum = nl.zeros((n_channel_tile, nl.par_dim(channel_psize), seq_len), ...)
   ...

   # First i_channel_tile loop
   for i_channel_tile in nl.affine_range(n_channel_tile):
       delta_i = nl.load(...)
       u_i = nl.load(...)

       for i_state in nl.affine_range(state_size):
           # step 1-6

   # Second i_channel_tile loop
   for i_channel_tile in nl.affine_range(n_channel_tile):
       nl.store(..., scanC_accum[i_channel_tile, 0:channel_psize, 0:seq_len])

   ...

By fusing the two ``i_channel_tile`` loop nests into a single loop nest, we can pull the declaration of ``scanC_accum``
inside the ``i_channel_tile`` loop and further reduce the ``scanC_accum`` size requirement by a factor of ``n_channel_tile``
:

.. code-block::

   ...

   # First i_channel_tile loop
   for i_channel_tile in nl.affine_range(n_channel_tile):
       scanC_accum = nl.zeros((nl.par_dim(channel_psize), seq_len), ...)

       delta_i = nl.load(...)
       u_i = nl.load(...)

       for i_state in nl.affine_range(state_size):
           # step 1-6

       nl.store(..., scanC_accum[i_channel_tile, 0:channel_psize, 0:seq_len])

   ...

Let's modify our initial NKI kernel implementation accordingly to get ``mamba_v2``:

.. nki_example:: ../../examples/fused_mamba/mamba_nki_kernels.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_26

We recapture the profile for the new kernel implementation:

.. _fig_mamba_v2:

.. figure:: ../../img/mamba_v2.png
   :align: center
   :width: 100%

   Profile of ``mamba_v2`` kernel with loop reordering optimization


The device execution time is now **1.61 ms**, which is a **31%** reduction in latency compared to our initial kernel implementation.
We can also see VectorE active duration is back up to 99.63% and the performance warnings on input tensor reloading are
now gone. In case you are curious, the above loop reordering optimization alone provides around 30% of latency reduction,
while the loop fusion optimization contributes the remaining 1% performance boost. This makes sense because the loop reordering
addresses our key performance concern around input data reloading, while reducing intermediate tensor size is only a nice-to-have
given we were quite low on SBUF usage to begin with.

Increasing input ``seq_len`` size
-------------------------------------

Next, let's increase the input ``seq_len`` by **16x**, from 512 to 8192 and recompile the above NKI kernel. Below is the
associated performance profile:

.. _fig_mamba_v2_8K_seqlen:

.. figure:: ../../img/mamba_v2_8K_seqlen.png
   :align: center
   :width: 100%

   Profile of ``mamba_v2`` kernel with 8K seq_len

The new profile now takes **53.33 ms**, which is **33x longer** than the previous profile. VectorE active duration has
dropped down to a new low: 58.93%. Compared to the profile captured with a smaller ``seq_len``, we notice new DMA activity
rows ``qSyncSpillReload0`` and ``qVectorSpillReload0`` , which are associated with data movement traffic for intermediate
data spill from SBUF into device memory or reload back to SBUF. Zooming into a smaller portion of the profile:

.. _fig_mamba_v2_8K_seqlen_zoomed:

.. figure:: ../../img/mamba_v2_8K_seqlen_zoomed.png
   :align: center
   :width: 100%

   Poor overlap of computation and data movement


We can see VectorE enters idle states due to a blocking semaphore wait for ``qSyncSpillReload0`` activities,
which indicates the extra spill/reload is indeed degrading overall computation performance. In addition, we can see low
SBUF usage peaking at merely 50%. Computation and data movement are also not overlapped properly, leading to low average
utilization in both compute engines and DMA throughput in the overall timeline.

Intuitively, increasing ``seq_len`` of the kernel increases the active tile sizes of input and intermediate tensors in the
free dimension, which could cause severe fragmentation in SBUF and excessive data movements to spill/reload tensors in
SBUF. To mitigate these inefficiencies, we must **tile** the ``seq_len`` dimension in our NKI kernel through a new loop
level.

.. _tut_mamba_tiling:

Mitigate spilling by tiling ``seq_len``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We have **three** key considerations when adding this new loop level:

1. tile size selection,
2. loop-carried dependency handling
3. loop ordering with other loop nests.

**Tile size of ``seq_len``.** Since previously with ``seq_len=512`` in our toy example, we were able to achieve close to
100% VectorE utilization, let's set the tile size ``seq_len_fsize`` to 512 as a starting point. We can revisit this decision
as needed once we obtain a new profile.

**Loop-carried dependency.** Splitting ``seq_len`` into chunks is straightforward for all computation steps except for Step
4. In the associative scan operation, the next loop iteration requires results from the previous iteration for computation.
As a result, we will introduce another loop-carried dependency here with the scan tiles. This dependency can be handled
through the ``initial`` input parameter:

.. code-block::

   scan_init = nl.zeros((channel_psize, 1), ...)

   for i_seq_len_tile in static_range(seq_len // seq_len_fsize):
       scan_i = nisa.tensor_tensor_scan(deltaA, deltaBu, initial=scan_init,
                                             op0=np.multiply, op1=np.add)
       scan_init = scan_i[0:channel_psize, seq_len_fsize-1]

Note, we choose to use ``static_range`` instead of ``affine_range`` due to the new loop-carried dependencies.

**Loop ordering.** Recall from our latest NKI kernel implementation, we have the following loop nest:

.. code-block::

   ...
   for i_batch in nl.affine_range(batch_size):

       for i_channel_tile in nl.affine_range(n_channel_tile):
           scanC_accum = nl.zeros((nl.par_dim(channel_psize), **seq_len**), ...)

           delta_i = nl.load(delta[i_batch, channel_start:channel_start+channel_psize, 0:**seq_len**])
           u_i = nl.load(u[i_batch, channel_start:channel_start+channel_psize, 0:**seq_len**])

           for i_state in nl.affine_range(state_size):
               A_i = nl.load(A[channel_start:channel_start+channel_psize, i_state])

               B_i = nl.load(B[i_batch, i_state:i_state+1, 0:**seq_len**])
               C_i = nl.load(C[i_batch, i_state:i_state+1, 0:**seq_len**])

               deltaA = ...
               deltaBu = ...
               scanC = ...
               ...
               scanC_accum += ...

            nl.store(..., scanC_accum[i_channel_tile, 0:channel_psize, 0:**seq_len**])
   ...

Let's denote the above loop ordering as ``[batch_size, n_channel_tile, state_size]``\ , and our key question here is where
to insert ``seq_len`` in this list.

Appending ``seq_len`` to the above list, that is, making ``seq_len`` the new inner-most loop, would involve the least amount
of code changes to our current NKI kernel. However, it will lead to the least amount of SBUF usage reduction, since this
loop ordering won't be tiling ``scanC_accum``, ``delta_i`` and ``u_i`` tensors. Given ``seq_len=8192`` and FP32 data types,
these three tensors will occupy 8192\ *4B*\ 3 = 96 KiB/partition, half of the available SBUF capacity. Let's go ahead and
experiment this loop ordering in a new kernel ``mamba_v3``:

.. _fig_mamba_v3:

.. figure:: ../../img/mamba_v3.png
   :align: center
   :width: 100%

   Profile of ``mamba_v3`` kernel with seq_len tiling optimization


With the above profile, the kernel now takes **27.8 ms**\ , which is **48%** reduction in latency compared to no ``seq_len``
tiling. VectorE is now 94.85% active, and we no longer have spilling related DMA activities.

Finally, since the key advantage of Mamba compared to Transformer models is Mamba's computation and latency should scale
linearly with respect to ``seq_len``, instead of quadratically in Transformers, let's plot the measured kernel latencies across different
``seq_len`` up to 8K (what we have optimized so far) and compare it against “perfect latencies” assuming linear scaling
from ``seq_len=512``. We evaluate scaling efficiency using ``perfect latency / measured latency``,
which is a higher the better metric. Finally, to showcase the importance of the last seq_len tiling optimization for scaling seq_len,
we also compare scaling efficiency for ``mamba_v2`` (no seq_len tiling) and ``mamba_v3`` (seq_len tiling).

.. list-table::
   :header-rows: 1

   * - seq_len
     - Perfect Latency (ms)
     - mamba_v2 Measured Latency (ms)
     - mamba_v2 Scaling Efficiency
     - mamba_v3 Measured Latency (ms)
     - mamba_v3 Scaling Efficiency
   * - 512
     - N/A
     - 1.6
     - N/A
     - 1.6
     - N/A
   * - 1024
     - 3.2
     - 4.4
     - 72.73%
     - 3.3
     - 96.97%
   * - 2048
     - 6.4
     - 8.9
     - 71.91%
     - 6.6
     - 96.97%
   * - 3072
     - 9.6
     - 13.1
     - 73.28%
     - 10.1
     - 95.05%
   * - 4096
     - 12.8
     - 17.6
     - 72.73%
     - 13.3
     - 96.24%
   * - 5120
     - 16
     - 23.7
     - 67.51%
     - 17.3
     - 92.49%
   * - 6144
     - 19.2
     - 27.5
     - 69.82%
     - 19.6
     - 97.96%
   * - 7168
     - 22.4
     - 41.3
     - 54.24%
     - 24.2
     - 92.56%
   * - 8192
     - 25.6
     - 52.2
     - 49.04%
     - 27.8
     - 92.09%


The above data shows the last NKI kernel implementation ``mamba_v3`` can reach 90%+ scaling efficiency up to 8K ``seq_len``.
To support even larger ``seq_len``, we will need more aggressive tiling by pulling the ``seq_len`` loop level further
towards the outer-loop level to tile more input/intermediate tensors to keep spilling low and VectorE busy.

Download All Source Code
--------------------------

Click the links to download source code of the kernels and the testing code
discussed in this tutorial.

* PyTorch reference implementation: :download:`mamba_torch.py <../../examples/fused_mamba/mamba_torch.py>`
* Three versions of NKI kernels: :download:`mamba_nki_kernels.py <../../examples/fused_mamba/mamba_nki_kernels.py>`

You can also view the source code in the GitHub repository `nki_samples <https://github.com/aws-neuron/nki-samples/tree/main/src/nki_samples/tutorials/fused_mamba/>`_

Example usage of the scripts:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Performance mode**

Run PyTorch reference implementation to generate a NEFF for profiling:

.. code-block::

   python3 mamba_torch.py --mode perf

Check performance numbers of mamba_v1/mamba_v2/mamba_v3:

.. code-block::

   python3 mamba_nki_kernels.py --mode perf --version v1 v2 v3 --batch 1 --seq_len 2048 --channels 512 --state_size 16


**Accuracy mode**

Check mamba_v1 NKI kernel accuracy against PyTorch implementation:

.. code-block::

   python3 mamba_torch.py --mode accuracy

Check optimized Mamba kernel (mamba_v2, mamba_v3) accuracy against mamba_v1:

.. code-block::

   python3 mamba_nki_kernels.py --mode accuracy --version v1 v2 v3 --batch 1 --seq_len 2048 --channels 512 --state_size 16


================================================
FILE: nki/guides/tutorials/index.rst
================================================
.. meta::
    :description: Hands-on tutorials for AWS Neuron Kernel Interface (NKI), covering matrix operations, normalization techniques, advanced kernels, and distributed computing patterns.
    :keywords: NKI, AWS Neuron, Tutorials, Matrix Multiplication, Normalization
    :date-modified: 12/01/2025

.. _nki-tutorials:

NKI Tutorials
==============

.. toctree::
   :maxdepth: 1
   :hidden:

   Matrix Multiplication <matrix_multiplication>
   average_pool2d
   transpose2d
   fused_mamba
   kernel-optimization

This section provides hands-on tutorials for the Neuron Kernel Interface (NKI), demonstrating how to write custom kernels for AWS Trainium and Inferentia instances. These tutorials cover fundamental operations, advanced techniques, and distributed computing patterns using NKI.

The full source code of the following tutorials can be also viewed on the 
`nki-samples <https://github.com/aws-neuron/nki-samples/tree/main/src/nki_samples/tutorials/>`_ repository on GitHub.

Basic Operations
----------------

.. grid:: 1 1 2 2
   :gutter: 2

   .. grid-item-card::
      :link: matrix_multiplication
      :link-type: doc

      **Matrix Multiplication**
      ^^^
      Learn the fundamentals of implementing matrix multiplication in your NKI kernels.

   .. grid-item-card::
      :link: transpose2d
      :link-type: doc

      **2D Transpose**
      ^^^
      Implement efficient 2D matrix transpose operations using NKI

   .. grid-item-card::
      :link: average_pool2d
      :link-type: doc

      **Average Pooling 2D**
      ^^^
      Create custom 2D average pooling kernels for computer vision workloads

Advanced Kernels
----------------

.. grid:: 1 1 2 2
   :gutter: 2

   .. grid-item-card::
      :link: fused_mamba
      :link-type: doc

      **Fused Mamba**
      ^^^
      Implement fused Mamba state space model kernels

   .. grid-item-card::
      :link: kernel-optimization
      :link-type: doc

      **Kernel Optimization**
      ^^^
      Learn the recommended workflow for optimizing NKI kernels using profiling and performance analysis


================================================
FILE: nki/guides/tutorials/kernel-optimization.rst
================================================
.. meta::
    :description: Learn the recommended workflow for optimizing kernels with NKI and AWS Neuron.
    :date-modified: 12/02/2025

.. _nki-kernel-optimization-guide:

Introduction to NKI Kernel Optimization
========================================

The Neuron Kernel Interface (NKI) provides an API for writing hand-tuned kernels. You use the Instruction Set Architecture (ISA) of a Neuron device directly to speed up critical parts of an ML model. This topic covers how you develop and tune NKI kernels and how this applies to developing with the AWS Neuron SDK. The Neuron Profiler helps you identify opportunities to improve ML model performance and drive hand-tuned optimizations with a NKI kernel.


Overview
--------

Developers commonly create NKI kernels to accelerate critical operations in larger ML inference or training models. Just as you might accelerate a traditional program by writing small parts in inline assembler, NKI lets you directly program the underlying Neuron hardware using the same ISA instructions the Neuron Compiler generates. In this overview, we use a kernel that performs matrix multiply as an example. We use the profiler to work from a simpler, more obviously correct version of the kernel to a version that performs better by improving memory usage by removing redudant loads and increasing DMA efficiency through blocking which better overlaps loading data and computing results. Along the way, we use a test program in PyTorch or JAX to ensure each step preserves a working kernel. We use the Neuron Explorer to drive additional performance improvements. We also change the kernel from being memory bound in the initial tiled implementation to being compute bound, as we would expect, in the optimized version of the kernel.

Applies to
-----------

This concept is applicable to:

*  Improving the performance of critical sections of ML inference or training models.
* Writing small performant kernels for standalone ML inference or training.

When to write a kernel?
------------------------

The Neuron Compiler takes ML models written in PyTorch, JAX, and other frameworks and generates the best performing code it can based on that model. Like any general purpose compiler, it may make optimization decisions that work well for the general case but may not produce optimal code for this specific model. The Neuron Kernel Interface (NKI) provides a mechanism for replacing sections of a model with a hand-tuned kernel. The first step in identifying a good candidate for turning a section of a model into a kernel is the Neuron Profiler, which provides a view on how the model performs.

The Neuron Profiler can help indicate where the model might benefit from optimization. You can map sections in the Neuron Profiler where one or more engines are idle while waiting on DMA or similar apparent gaps to places in the model where code may execute several times. These can be good candidates for writing a custom kernel. Good candidates are similar to where you might split a large function into smaller functions in a traditional program. This means some "minimum cut" in the graph where there are relatively few inputs and outputs of the kernel.

Starting simple
----------------

The end goal of writing a kernel is to improve the performance of the model, but the first step is to write a kernel that correctly performs the operation you wish to replace in the graph. As a motivating example, suppose that the section of the graph you wish to replace consists of a matrix multiply of two relatively large matrices. Kernels will often be more sophisticated than this, as you can see by looking at the Neuron Kernel Library (NKI-Lib), for instance performing functions like RMSNorm-Quant or QKV, but matrix multiply may be an aspect of these more sophisticated kernels.

NKI provides the ``nki.isa.nc_matmul`` instruction to perform a matrix multiply. This instruction operates over a restricted sized matrix with at most a 128 x 128 "stationary" (weights) matrix and a 128 x 512 "moving" (ifmap) matrix. This allows you to produce a 128 x 512 matrix, at most, as output. The "stationary" matrix must be transposed to get a result that is not transposed. To call the ``nki.isa.nc_matmul`` instruction, provide to the state buffer (SBUF), and the result will be written into the partial sum buffer (PSUM). If you use a small driver program to invoke the kernel, the arguments will be passed in from the device memory (HBM) and the result will be read from HBM as well. The kernel will move inputs from HBM to SBUF, call the ``nki.isa.nc_matmul`` instruction, move the result from PSUM to SBUF (you cannot move data directly from PSUM to HBM), and then from SBUF to HBM.

.. code-block:: python

    import nki
    import nki.language as nl
    import nki.isa as nisa
    import os

    os.environ["NEURON_PLATFORM_TARGET_OVERRIDE"] = "trn2"
    
    @nki.jit
    def matrix_multiply_kernel(lhsT, rhs):
      """NKI kernel to compute a matrix multiplication operation on a single tile

      Args:
        lhsT: an input tensor of shape [K,M], where both K and M are, at most, 
          128.  It is the left-hand-side argument of the matrix multiplication,
          delivered transposed for optimal performance.
        rhs: an input tensor of shape [K,N], where K is, at most, 128, and N
          is, at most, 512.  It is the right-hand-side argument of the matrix
          multiplication.
      Returns:
        result: the resulting output tensor of shape [M,N]
      """
      # Verify that the lhsT and rhs are the expected sizes.
      K, M = lhsT.shape
      K_, N = rhs.shape

      # Ensure that the contraction dimension matches
      assert K == K_, \
        f"Contraction demention {K} does not match {K_}, did you remember to transpose?"

      # Ensure the dimensions will fit within the constrins of matmul.
      assert K <= nl.tile_size.pmax, \
        f"Expected partition dimension in lhsT ({K}) to be less than {nl.tile_size.pmax}"
      assert M <= nl.tile_size.gemm_stationary_fmax, \
        f"Expected free dimension in lhsT ({M}) to be less than " \
        f"{nl.tile_size.gemm_stationary_fmax}"
      assert N <= nl.tile_size.gemm_moving_fmax, \
        f"Expected free dimension in rhs ({N}) to be less than " \
        f"{nl.tile_size.gemm_moving_fmax}"

      # Allocate tiles for lhsT and rhs on sbuf (uninitialized)
      lhsT_tile = nl.ndarray(shape=lhsT.shape, dtype=lhsT.dtype, buffer=nl.sbuf)
      rhs_tile = nl.ndarray(shape=rhs.shape, dtype=rhs.dtype, buffer=nl.sbuf)

      # Copy the input matrices from HBM to SBUF
      nisa.dma_copy(dst=lhsT_tile, src=lhsT)
      nisa.dma_copy(dst=rhs_tile, src=rhs)

      # Perform matrix multiply, result will be written into PSUM
      result_tile = nl.ndarray(shape=(M, N), dtype=nl.float32, buffer=nl.psum)
      nisa.nc_matmul(dst=result_tile, stationary=lhsT_tile, moving=rhs_tile)

      # Copy result to SBUF (we cannot copy directly from PSUM to HBM)
      result_tmp = nl.ndarray(shape=result_tile.shape,
                              dtype=result_tile.dtype,
                              buffer=nl.sbuf)
      nisa.tensor_copy(dst=result_tmp, src=result_tile)

      # Copy result to HBM
      result = nl.ndarray(shape=result_tmp.shape,
                          dtype=result_tmp.dtype,
                          buffer=nl.hbm)
      nisa.dma_copy(dst=result, src=result_tmp)

      return result

This small kernel allows you to experiment with the ``nki.isa.nc_matmul`` instruction and you can test that it works with a simple driver.

.. tabs::

   .. tab:: PyTorch

      .. code-block:: python

          import numpy as np
          import torch
          import torch_xla
          import torch_xla
          from multiply_kernel import matrix_multiply_kernel

          # Set up our initial inputs in numpy, and compute the matrix multiply in pure
          # numpy on the CPU
          rng = np.random.default_rng()
          lhs = rng.random((128, 128), dtype=np.float32)
          rhs = rng.random((128, 512), dtype=np.float32)
          expected_result = np.matmul(lhs, rhs)

          # Setup the XLA device and generate input tensors.
          device = torch_xla.device()

          lhsT_torch = torch.from_numpy(lhs.T).to(device=device)
          rhs_torch = torch.from_numpy(rhs).to(device=device)

          # Invoke the kernel to add the results.
          result_device = matrix_multiply_kernel(lhsT_torch, rhs_torch)

          result_torch = result_device.cpu()

          if np.allclose(expected_result, result_torch):
              print("Kernel computed correct output")
              print(result_torch)
          else:
              print("FAILED: Kernel computed output off from expected")
              print("expected:")
              print(expected_result)
              print("actual:")
              print(result_torch)

   .. tab:: JAX

      .. code-block:: python

          import numpy as onp
          import jax.numpy as jnp
          from multiply_kernel import matrix_multiply_kernel

          # Set up our initial inputs in numpy, and compute the matrix multiply in pure
          # numpy on the CPU
          rng = onp.random.default_rng()
          lhs = rng.random((128, 128), dtype=onp.float32)
          rhs = rng.random((128, 512), dtype=onp.float32)
          expected_result = onp.matmul(lhs, rhs)

          # Generate the input tensors
          lhsT_jax = jnp.array(lhs.T)
          rhs_jax = jnp.array(rhs)

          result_jax = matrix_multiply_kernel(lhsT_jax, rhs_jax)

          if onp.allclose(expected_result, result_jax):
              print("Kernel computed correct output")
              print(result_jax)
          else:
              print("FAILED: Kernel computed output off from expected")
              print("expected:")
              print(expected_result)
              print("actual:")
              print(result_jax)

You can validate that you have the correct understanding of the nki.isa.nc_matmul instruction by invoking your test:

.. code-block:: bash

    $ python driver.py
    Kernel computed correct output
    tensor([[35.7896, 32.8659, 31.6545,  ..., 37.1804, 31.4682, 33.9796],
            [28.8202, 27.4512, 26.0832,  ..., 30.1993, 27.0034, 27.1942],
            [35.0943, 30.6835, 33.3721,  ..., 36.8755, 32.7837, 32.4317],
            ...,
            [34.9192, 30.0401, 32.3874,  ..., 34.2831, 31.9439, 32.8761],
            [33.0372, 28.7389, 32.2096,  ..., 34.8574, 30.7248, 32.1855],
            [32.4571, 29.1864, 31.7483,  ..., 33.3723, 30.1617, 29.8077]])

(Note that there will be some additional output, which varies slightly depending on which framework you use. The values will also vary, since the inputs are randomly generated.)

As you become more familiar with NKI, you will no longer need to start with quite so simple a variation on the kernel. While this kernel allowed us to validate our understanding of the ``nki.isa.nc_matmul`` instruction, it will not allow you to pass in matrices larger than a single tile. A more realistic variant of the kernel needs to take matrices larger than the tile size, break down the inputs into single tiles, compute each output tile, then write the result back to HBM.

Writing the kernel
-------------------

The simple start allowed us to validate our understanding of the ``nki.isa.matmul`` instruction. The following kernel shows how you can do this with input matrices that are larger than a single tile size. You may recognize the traditional three nested loop structure of matrix multiply, but instead of the inner body computing a scalar value it operates over a full tile.

.. code-block:: python

    import nki
    import nki.language as nl
    import nki.isa as nisa
    import os

    os.environ["NEURON_PLATFORM_TARGET_OVERRIDE"] = "trn2"

    @nki.jit
    def matrix_multiply_kernel(lhsT, rhs):
      """NKI kernel to compute a matrix multiplication operation in a tiled manner

      Args:
          lhsT: an input tensor of shape [K,M], where both K and M are multiples for
            128.  It is the left-hand-side argument of the matrix multiplication,
            delivered transposed for optimal performance.
          rhs: an input tensor of shape [K,N], where K is a multiple of 128, and N
            is a multiple of 512.  It is the right-hand-side argument of the matrix
            multiplication.
      Returns:
          result: the resulting output tensor of shape [M,N]
      """

      # Verify that the lhsT and rhs have the same contraction dimension.
      K, M = lhsT.shape
      K_, N = rhs.shape
      assert K == K_, "lhsT and rhs must have the same contraction dimension"
     
      # Lookup the device matrix multiply dimensions.
      TILE_M = nl.tile_size.gemm_stationary_fmax  # 128
      TILE_K = nl.tile_size.pmax  # 128
      TILE_N = nl.tile_size.gemm_moving_fmax  # 512
     
      # Verify that the input matrices are a multiple of the tile dimensions.
      assert M % TILE_M == 0, \
        f"Expected M, {M}, to be a multiple of stationary free-dimension max, {TILE_M}"
      assert N % TILE_N == 0, \
        f"Expected N, {N}, to be a multiple of moving free-dimension max, {TILE_N}"
      assert K % TILE_K == 0, \
        f"Expected K, {K}, to be a multiple of the partition dimension max, {TILE_K}"
     
      # Create a space for the result in HBM (uninitialized)
      result = nl.ndarray(shape=(M, N), dtype=lhsT.dtype, buffer=nl.hbm)
     
      # Use affine_range to loop over tiles
      for m in nl.affine_range(M // TILE_M):
        for n in nl.affine_range(N // TILE_N):
          # Allocate a tensor in PSUM (uninitialized)
          result_tile = nl.ndarray(shape=(TILE_M, TILE_N),
                               dtype=nl.float32,
                               buffer=nl.psum)
     
          for k in nl.affine_range(K // TILE_K):
            # Declare the tiles on SBUF (uninitialized)
            lhsT_tile = nl.ndarray(shape=(TILE_K, TILE_M),
                               dtype=lhsT.dtype,
                               buffer=nl.sbuf)
            rhs_tile = nl.ndarray(shape=(TILE_K, TILE_N),
                              dtype=rhs.dtype,
                              buffer=nl.sbuf)
     
            # Load tiles from lhsT and rhs
            nisa.dma_copy(dst=lhsT_tile, 
                      src=lhsT[k * TILE_K:(k + 1) * TILE_K,
                               m * TILE_M:(m + 1) * TILE_M])
            nisa.dma_copy(dst=rhs_tile,
                      src=rhs[k * TILE_K:(k + 1) * TILE_K,
                              n * TILE_N:(n + 1) * TILE_N])
     
            # Accumulate partial-sums into PSUM
            nisa.nc_matmul(dst=result_tile, stationary=lhsT_tile, moving=rhs_tile)
     
          # Copy the result from PSUM back to SBUF, and cast to expected
          # output data-type
          result_tmp = nl.ndarray(shape=(TILE_M, TILE_N),
                              dtype=nl.float32,
                              buffer=nl.sbuf)
          nisa.tensor_copy(dst=result_tmp, src=result_tile)

          # Copy the result from SBUF to HBM.
          nisa.dma_copy(dst=result[m * TILE_M:(m + 1) * TILE_M,
                               n * TILE_N:(n + 1) * TILE_N],
                    src=result_tmp)
     
      return result

The tiled version expects the input and output matrices to be a multiple of the tile sizes. In cases where the matrices you want to multiply do not match that, they can be padded or the implementation could be extended to handle the sub-tile sized edges. The body of the n and m loops allocates a result_tile in the PSUM. The inner-most k loop then loads the tiles from the lhsT and rhs inputs into SBUF from HBM, performs the matrix multiply, accumulating the result into the result_tile. After the k loop completes, the m, n tile has been computed and can be moved from PSUM to SBUF and then written into the correct position in the result HBM.

Now that you have a kernel that can handle what you expect the model to need, you can extend the small test driver above to ensure you can keep the kernel functioning correctly as you begin to improve the performance of the kernel. This driver is something you can continue to use with each progressive improvement of the kernel. This is just a variation on the original test that provides input matrices large enough to represent the real workload the kernel will be expected to handle.

In this case that just means increasing the size of the input matrices from a single tile at 128x128 x 128x512 to something slightly more realistic at 4096x8192 x 8192x8192. You can update the numpy generation of inputs to set the lhs and rhs to the new dimensions.

.. code-block:: python

    lhs = rng.random((4096, 8192), dtype=np.float32)
    rhs = rng.random((8192, 8192), dtype=np.float32)

It is important to select input sizes that are realistic (or at least representative) of the real work you expect the kernel to handle, because you will use this test not just for correctness, but also to allow you to profile the kernel to guide improvements on the kernel's performance.

In addition to changing the size of the input to the kernel, you will also want to enable profiling of the kernel. You will use the approach described in the :doc:`Neuron Explorer user guide </tools/neuron-explorer/index>` to profile just the call to the NKI matrix multiply kernel. With this you can surround the call to the kernel with the profiling context.

.. tabs::

   .. tab:: PyTorch

      .. code-block:: python

          from torch_neuronx.experimental import profiler
          ...
          with profiler.profile(port=9012,
                                profile_type='system',
                                target='neuron_profile_perfetto',
                                output_dir='./output',
                                ms_duration=600000) as profiler:
              result_device = matrix_multiply_kernel(lhsT_device, rhs_device)


   .. tab:: JAX

      .. code-block:: python

          import jax
          ...
          with jax.profiler.trace("./output"):
            result_jax = matrix_multiply_kernel(lhsT_jax, rhs_jax)

When you run the test driver, in addition to showing that the output matches the numpy result, you will also get both the Neuron Execution File Format (NEFF) file, which is what executes on the accelerator and the Neuron Timing File Format (NTFF) file generated by running the kernel with profiling enabled. You can use these two files with the neuron_profiler to view the results of running the kernel.

Looking at this profile for the full kernel run, you can see that the DMA queues which move data from HBM to SBUF and back are quite active. Looking at the Tensor and TensorMatrix lines, it appears there are some gaps within the run as well. The heavy use of DMA and the Tensor Engine (TensorE) is not too surprising, since those are the two things the kernel is primarily doing. The profile also provides some data as an overview of how much each engine is being used. You can zoom in to one of the areas where you see a gap and validate the impression.

.. image:: /nki/img/how-to/v2-full.png

You can see that the TensorE is busy from the start of the kernel through the end. Note that matrix multiply becomes two instructions on the hardware load weights, which loads the static matrix, and matrix multiply, which loads the moving matrix and performs the matrix multiply operation.

.. image:: /nki/img/how-to/v2-zoom.png

However, there are gaps between matrix multiply operations indicate that the TensorE is waiting on data to be read from the HBM to SBUF for the next operation to take place that we can see when we zoom in.Looking at the original kernel code you can see that you are loading the two tiles before each matrix multiply. Looking at the summary data provided in the profile, you can also see that the DMA engines were active 99.93% of the time while the TensorE was only active 87.28% of the run.

.. list-table::
   :header-rows: 0
   :widths: 50 50

   * - .. image:: /nki/img/how-to/v2-dma.png
          :width: 100%
     - .. image:: /nki/img/how-to/v2-pe.png
          :width: 100%


Analyzing the kernel
---------------------

The first step to improving the performance of the kernel is to analyze the performance you observed and apply that to your understanding of the NeuronEngine Architecture. The NeuronEngine Architecture consists of a number of computational engines that can each run independently, assuming the inputs are available for each instruction. In the current example, the only computational engine you are using is the TensorE and all of its inputs are coming directly from the DMA engines just before the computation is performed with the output of each tile written back after the k inner-most loop completes. Considering that matrix multiply is compute bound, you would expect that the matrix multiply instruction should be the limiting factor of your performance. However, TensorE was only active about 69.83% of the time, which tells us you can likely get more data to it faster to improve the overall computation time.

Looking at this, you might notice two things. First, since the data for each matrix multiply is being loaded just before the multiply, you are always waiting on these loads to complete before you can start the next multiply. If you look at the structure of the iteration, you can also see that you will load the same tile more than once. For instance the m=0, k=0 tile will be loaded N // TILE_N times. One change you could make is to load all of the tiles needed to compute a given output tile before you start the computation. You can accomplish this by moving the loads out into the outer loops, loading all K // TILE_K tiles for a given value of m from the stationary matrix at the start of the m loop, and all K // TILE_K tiles for a given value of n from the stationary matrix at the start of the n loop.

.. code-block:: python

    import nki
    import nki.language as nl
    import nki.isa as nisa
    import os

    os.environ["NEURON_PLATFORM_TARGET_OVERRIDE"] = "trn2"

    @nki.jit
    def matrix_multiply_kernel(lhsT, rhs):
      """NKI kernel to compute a matrix multiplication operation in a tiled manner
         while hoisting the load of the lhsT and rhs to outer loops.

      Args:
          lhsT: an input tensor of shape [K,M], where both K and M are multiples for
            128.  It is the left-hand-side argument of the matrix multiplication,
            delivered transposed for optimal performance.
          rhs: an input tensor of shape [K,N], where K is a multiple of 128, and N
            is a multiple of 512.  It is the right-hand-side argument of the matrix
            multiplication.
      Returns:
          result: the resulting output tensor of shape [M,N]
      """

      # Verify that the lhsT and rhs are the expected sizes.
      K, M = lhsT.shape
      K_, N = rhs.shape
      assert K == K_, "lhsT and rhs must have the same contraction dimension"
      result = nl.ndarray(shape=(M, N), dtype=nl.float32, buffer=nl.hbm)

      # Lookup the device matrix multiply dimensions.
      TILE_M = nl.tile_size.gemm_stationary_fmax  # 128
      TILE_K = nl.tile_size.pmax  # 128
      TILE_N = nl.tile_size.gemm_moving_fmax  # 512

      # Verify that the input matrices are a multiple of the tile dimensions.
      assert M % TILE_M == 0, \
        f"Expected M, {M}, to be a multiple of stationary free-dimension max, {TILE_M}"
      assert N % TILE_N == 0, \
        f"Expected N, {N}, to be a multiple of moving free-dimension max, {TILE_N}"
      assert K % TILE_K == 0, \
        f"Expected K, {K}, to be a multiple of the partition dimension max, {TILE_K}"

      # Use affine_range to loop over tiles
      for m in nl.affine_range(M // TILE_M):
        # Load a whole column tiles from lhsT (with K * TILE_M numbers)
        # This corresponds to the whole row in the original lhs
        lhsT_tiles = []
        for k in nl.affine_range(K // TILE_K):
          # Allocate space in SBUF for the tile (uninitialized)
          lhsT_tile = nl.ndarray(shape=(TILE_K, TILE_M),
                               dtype=lhsT.dtype,
                               buffer=nl.sbuf)
          # Copy the tile from HBM to SBUF
          nisa.dma_copy(dst=lhsT_tile, 
                      src=lhsT[k * TILE_K:(k + 1) * TILE_K,
                             m * TILE_M:(m + 1) * TILE_M])
          # Append the tile to the list of tiles.
          lhsT_tiles.append(lhsT_tile)

        for n in nl.affine_range(N // TILE_N):
          # Load a whole column tiles from rhs (with K * TILE_N numbers)
          rhs_tiles = []
          for k in nl.affine_range(K // TILE_K):
            # Allocate space in SBUF for the tile (uninitialized)
            rhs_tile = nl.ndarray(shape=(TILE_K, TILE_N),
                              dtype=rhs.dtype,
                              buffer=nl.sbuf)
            # Copy the tile from HBM to SBUF
            nisa.dma_copy(dst=rhs_tile,
                      src=rhs[k * TILE_K:(k + 1) * TILE_K,
                              n * TILE_N:(n + 1) * TILE_N])
            # Append the tile to the list of tiles.
            rhs_tiles.append(rhs_tile)

          # Allocate a tile in PSUM for the result
          result_tile = nl.ndarray(shape=(TILE_M, TILE_N),
                               dtype=nl.float32,
                               buffer=nl.psum)
          for k in nl.affine_range(K // TILE_K):
            # Accumulate partial-sums into PSUM
            nisa.nc_matmul(dst=result_tile,
                       stationary=lhsT_tiles[k],
                       moving=rhs_tiles[k])

          # Copy the result from PSUM back to SBUF, and cast to expected
          # output data-type
          result_tmp = nl.ndarray(shape=(TILE_M, TILE_N),
                              dtype=nl.float32,
                              buffer=nl.sbuf)
          nisa.tensor_copy(dst=result_tmp, src=result_tile)

          # Copy the result from SBUF to HBM.
          nisa.dma_copy(dst=result[m * TILE_M:(m + 1) * TILE_M,
                               n * TILE_N:(n + 1) * TILE_N],
                    src=result_tmp)

      return result

The test program validates that the new implementation is correct and also provides new NEFF and NTFF.

.. image:: /nki/img/how-to/v3-full.png

At this level the profile does not look too different, however when you zoom in, you can see that the matrix multiplies no longer show so many gaps.

.. image:: /nki/img/how-to/v3-zoom.png

Analyzing the improvement though, you can see that this change has made big strides. The DMA and matrix multiply is better overlapped, the DMA engines are now busy 99.73% of the time, slightly more than before, but the TensorE is busy 99.85% of the time. This is a huge improvement, but the time spent in the kernel is still dominated by DMA.

.. list-table::
   :header-rows: 0
   :widths: 50 50

   * - .. image:: /nki/img/how-to/v3-dma.png
          :width: 100%
     - .. image:: /nki/img/how-to/v3-pe.png
          :width: 100%

Overlapping data and compute through blocking
-----------------------------------------------

The previous refinement of the kernel showed that you can improve the utilization of the TensorE by improving how the data is loaded. Instead of loading each tile in the innermost loop, lifting the loads to the outer loops and loading a whole column from both the transposed stationary matrix and the moving matrix reduced the overall amount data that needed to be moved from HBM to SBUF. However, the fact that the kernel is still memory bound means there is more that can be done.

Blocking is a technique to help load even larger amounts of data in at a time. Instead of copying single tiles of data from HBM to SBUF, you can load a full block, which is a multiple of the number of tiles. Since matrix multiply still needs to operate tile by tile, you compute all of the tiles in the block before proceeding to the next block.

.. code-block:: python

    import nki
    import nki.language as nl
    import nki.isa as nisa
    import os

    os.environ["NEURON_PLATFORM_TARGET_OVERRIDE"] = "trn2"

    @nki.jit
    def matrix_multiply_kernel(lhsT, rhs):
      """NKI kernel to compute a matrix multiplication operation while blocking the
         free dimensions of the LHS and RHS to improve memory access pattern.
      
      Args:
          lhsT: an input tensor of shape [K,M], where both K and M are multiples for
            1.    It is the left-hand-side argument of the matrix multiplication,
            delivered transposed for optimal performance.
          rhs: an input tensor of shape [K,N], where K is a multiple of 128, and N
            is a multiple of 512.  It is the right-hand-side argument of the matrix
            multiplication.
      Returns:
          result: the resulting output tensor of shape [M,N]
      """
      
      # Verify that the lhsT and rhs have the same contraction dimension.
      K, M = lhsT.shape
      K_, N = rhs.shape
      assert K == K_, "lhsT and rhs must have the same contraction dimension"
      
      # Lookup the device matrix multiply dimensions.
      TILE_M = nl.tile_size.gemm_stationary_fmax  # 128
      TILE_K = nl.tile_size.pmax  # 128
      TILE_N = nl.tile_size.gemm_moving_fmax  # 512
      
      # Configuring the blocking size for the free dimensions
      TILES_IN_BLOCK_M = 2
      TILES_IN_BLOCK_N = 2
      
      BLOCK_M = TILE_M * TILES_IN_BLOCK_M  # 256
      BLOCK_N = TILE_N * TILES_IN_BLOCK_N  # 1024
      
      # the size has to be multiple of block size
      assert M % BLOCK_M == 0, f"Expected M ({M}) to be divisible by BLOCK_M ({BLOCK_M})"
      assert N % BLOCK_N == 0, f"Expected N ({N}) to be divisible by BLOCK_N ({BLOCK_N})"

      # Create a space for the result in HBM (not initialized)
      result = nl.ndarray(shape=(M, N), dtype=lhsT.dtype, buffer=nl.hbm)
      
      # Loop over blocks over the M dimension
      for m in nl.affine_range(M // BLOCK_M):
        # Load TILES_IN_BLOCK_M columns tiles from lhsT
        lhsT_tiles = []
        for bm in nl.affine_range(TILES_IN_BLOCK_M):
          # Inner tile array.
          lhsT_tiles_internal = []
          for k in nl.affine_range(K // TILE_K):
            # Allocate space in SBUF for the tile (uninitialized)
            lhsT_tile = nl.ndarray(shape=(TILE_K, TILE_M),
                                   dtype=lhsT.dtype,
                                   buffer=nl.sbuf)
            # Copy the tile from HBM to SBUF
            nisa.dma_copy(dst=lhsT_tile,
                    src=lhsT[k * TILE_K:(k + 1) * TILE_K,
                         (m * TILES_IN_BLOCK_M + bm) *
                         TILE_M:((m * TILES_IN_BLOCK_M + bm) + 1) *
                         TILE_M])
            # Append the tile to the inner list of tiles.
            lhsT_tiles_internal.append(lhsT_tile)
          # Append the inner list of tiles into the outer list of tiles.
          lhsT_tiles.append(lhsT_tiles_internal)
      
        for n in nl.affine_range(N // BLOCK_N):
          # Load TILES_IN_BLOCK_N columns from rhs
          rhs_tiles = []
          for bn in nl.affine_range(TILES_IN_BLOCK_N):
            # Inner tile array.
            rhs_tiles_internal = []
            for k in nl.affine_range(K // TILE_K):
              # Allocate space in SBUF for the tile (uninitialized)
              rhs_tile = nl.ndarray(shape=(TILE_K, TILE_N),
                                    dtype=rhs.dtype,
                                    buffer=nl.sbuf)
              # Copy the tile from HBM to SBUF
              nisa.dma_copy(dst=rhs_tile,
                    src=rhs[k * TILE_K:(k + 1) * TILE_K,
                        (n * TILES_IN_BLOCK_N + bn) *
                        TILE_N:((n * TILES_IN_BLOCK_N + bn) + 1) *
                        TILE_N])
              # Append the tile to the inner list of tiles.
              rhs_tiles_internal.append(rhs_tile)
            # Append the inner list of tiles into the outer list of tiles.
            rhs_tiles.append(rhs_tiles_internal)
      
          for bm in nl.affine_range(TILES_IN_BLOCK_M):
            for bn in nl.affine_range(TILES_IN_BLOCK_N):
              # Allocate a tensor in PSUM
              result_tile = nl.ndarray(shape=(TILE_M, TILE_N),
                                       dtype=nl.float32,
                                       buffer=nl.psum)
              for k in nl.affine_range(K // TILE_K):
                # Accumulate partial-sums into PSUM
                nisa.nc_matmul(dst=result_tile,
                               stationary=lhsT_tiles[bm][k],
                               moving=rhs_tiles[bn][k])
      
              # Copy the result from PSUM back to SBUF, and cast to expected
              # output data-type
              result_tmp = nl.ndarray(shape=result_tile.shape,
                                      dtype=result.dtype,
                                      buffer=nl.sbuf)
              nisa.tensor_copy(dst=result_tmp, src=result_tile)

              # Copy the result from SBUF to HBM.
              nisa.dma_copy(dst=result[(m * TILES_IN_BLOCK_M + bm) *
                                       TILE_M:((m * TILES_IN_BLOCK_M + bm) + 1) *
                                       TILE_M,
                                       (n * TILES_IN_BLOCK_N + bn) *
                                       TILE_N:((n * TILES_IN_BLOCK_N + bn) + 1) *
                                       TILE_N],
                            src=result_tmp)
      
      return result

Running the test driver ensures the new implementation of the kernel is correct and provides a new NEFF and NTFF that helps us understand the improvements.

.. image:: /nki/img/how-to/v4-full.png

Zooming in on a similarly sized section shows that while the overall time of the kernel has improved, there are again gaps between our matrix multiply instructions.

.. image:: /nki/img/how-to/v4-zoom.png

Again you can see gaps in the matrix multiply. Even though the new implementation of the kernel improves on the overall time of the kernel, the new implementation reduces the number of DMA instructions, because each instruction loads more, but you wait longer for each block to load. In fact, even though the performance improved the TensorE is actually less utilized as a percentage of time, dropping to 99.52% of the time, with the DMA engines hitting 95.70%. This means there is a small amount of time when only the TensorE is being used, but the DMA engine is still active for most of the kernel run, which you should expect could be smaller.

.. list-table::
   :header-rows: 0
   :widths: 50 50

   * - .. image:: /nki/img/how-to/v4-dma.png
          :width: 100%
     - .. image:: /nki/img/how-to/v4-pe.png
          :width: 100%

Optimizing DMA through blocking the contraction dimension
---------------------------------------------------------

One of the advantages of leaving the K dimension unblocked was that you could rely on the PSUM buffer to hold the final computed value. To block in the K dimension, you will need to store intermediate partial sums in a temporary SBUF array of tiles. The nki.isa.tensor_tensor instruction can be used to add two tensors, allowing you to accumulate into the temporary tile. With this, you can build blocks in all three dimensions. This version of blocking loads the blocks to in BLOCK_K by BLOCK_M and BLOCK_K by BLOCK_N dimensions.

.. code-block:: python

   import nki
   import nki.language as nl
   import nki.isa as nisa
   import os

   os.environ["NEURON_PLATFORM_TARGET_OVERRIDE"] = "trn2"

   @nki.jit
   def matrix_multiply_kernel(
       lhsT,
       rhs,
       # Meta-parameters
       TILES_IN_BLOCK_M=16,
       TILES_IN_BLOCK_N=2,
       TILES_IN_BLOCK_K=8,
   ):
     """NKI kernel to compute a large matrix multiplication efficiently by
        blocking all dimensions and doing layout optimization.
     
     Args:
         lhsT: an input tensor of shape [K,M], where K is a multiple of 128 *
           TILES_IN_BLOCK_K and M is a multiple of 128 * TILES_IN_BLOCK_M.  It is the
           left-hand-side argument of the matrix multiplication, delivered transposed
           for optimal performance.
         rhs: an input tensor of shape [K,N],  where K is a multiple of 128 *
           TILES_IN_BLOCK_K and N is a multiple of 512 * TILES_IN_BLOCK_N.  It is
           the right-hand-side argument of the matrix multiplication.
         TILES_IN_BLOCK_*: meta parameters to control blocking dimensions
     Returns:
         result: the resulting output tensor of shape [M,N]
     """

     # Verify that the lhsT and rhs have the same contraction dimension.
     K, M = lhsT.shape
     K_, N = rhs.shape
     assert K == K_, "lhsT and rhs must have the same contraction dimension"

     # Lookup the device matrix multiply dimensions.
     TILE_M = nl.tile_size.gemm_stationary_fmax  # 128
     TILE_K = nl.tile_size.pmax  # 128
     TILE_N = nl.tile_size.gemm_moving_fmax  # 512

     # Compute the block dimensions.
     BLOCK_M = TILE_M * TILES_IN_BLOCK_M
     BLOCK_N = TILE_N * TILES_IN_BLOCK_N
     BLOCK_K = TILE_K * TILES_IN_BLOCK_K

     # the size has to be multiple of block size
     assert M % BLOCK_M == 0, \
       f"Expected M {M} to be divisble by {BLOCK_M} when there are {TILES_IN_BLOCK_M}"
     assert N % BLOCK_N == 0, \
       f"Expected N {N} to be divisble by {BLOCK_N} when there are {TILES_IN_BLOCK_N}"
     assert K % BLOCK_K == 0, \
       f"Expected K {K} to be divisble by {BLOCK_K} when there are {TILES_IN_BLOCK_K}"

     # Create a space for the result in HBM (not initialized)
     result = nl.ndarray(shape=(M,N), dtype=nl.float32, buffer=nl.hbm)

     # Compute the number of blocks in each dimension
     NUM_BLOCK_M = M // BLOCK_M
     NUM_BLOCK_N = N // BLOCK_N
     NUM_BLOCK_K = K // BLOCK_K

     # Blocking N dimension (the RHS free dimension)
     for n in nl.affine_range(NUM_BLOCK_N):
       # Create the initial result tiles in SBUF and initialize each tile to
       # 0.0, since the final results will be accumulated here.
       result_tmps = []
       for m_idx in range(NUM_BLOCK_M):
         block_m = []
         for bm_idx in range(TILES_IN_BLOCK_M):
           block_n = []
           for bn_idx in range(TILES_IN_BLOCK_N):
             # Create the result tile (uninitialized)
             tile = nl.ndarray(shape=(TILE_M, TILE_N),
                               dtype=lhsT.dtype,
                               buffer=nl.sbuf)
             # Initialize the tile 0.0
             nisa.memset(dst=tile, value=0.0)
             # Append the tile to block_n array.
             block_n.append(tile)
           # Append block_n array to block_m array.
           block_m.append(block_n)
         # Append block_m array into result_tmps.
         result_tmps.append(block_m)

       # Blocking K dimension (the contraction dimension)
       # Use `sequential_range` because we do not want the compiler to
       # change this loop by, for example, vectorizing it
       for k in nl.sequential_range(NUM_BLOCK_K):
         # Loading tiles from rhs setting the load tile to
         # `TILE_K x BLOCK_SIZE_N` to optimize DMA performance
         rhs_tiles = []
         for bk_r in range(TILES_IN_BLOCK_K):
           # Allocate rhs_tile tensor, TILE_K x BLOCK_N
           rhs_tile = nl.ndarray(shape=(TILE_K, BLOCK_N),
                                 dtype=rhs.dtype,
                                 buffer=nl.sbuf)
           # Copy block tile from rhs, to rhs_tile.
           nisa.dma_copy(dst=rhs_tile[0:TILE_K, 0:BLOCK_N],
                         src=rhs[(TILES_IN_BLOCK_K * k + bk_r) *
                                 TILE_K:(TILES_IN_BLOCK_K * k + bk_r + 1) * TILE_K,
                                 BLOCK_N * n:BLOCK_N * (n + 1)])
           # Append rhs_tile to rhs_tiles.
           rhs_tiles.append(rhs_tile)

         # Blocking M dimension (the LHS free dimension)
         for m in nl.affine_range(NUM_BLOCK_M):
           # Loading tiles from lhsT
           lhsT_tiles = []
           for bk_l in nl.affine_range(TILES_IN_BLOCK_K):
             # Allocate lhsT_tile in SBUF (uninitialized)
             lhsT_tile = nl.ndarray(shape=(TILE_K, BLOCK_M),
                                    dtype=lhsT.dtype,
                                    buffer=nl.sbuf)
             # Copy block tile from lhsT to lhsT_tile
             nisa.dma_copy(
               dst=lhsT_tile[0:TILE_K, 0:BLOCK_M],
               src=lhsT[(TILES_IN_BLOCK_K * k + bk_l) *
                    TILE_K:(TILES_IN_BLOCK_K * k + bk_l + 1) * TILE_K,
                    BLOCK_M * m:BLOCK_M * (m + 1)])
             # Copy block tile from lhsT to lhsT_tile
             lhsT_tiles.append(lhsT_tile)

           # Do matmul with all tiles in the blocks
           for bn in nl.affine_range(TILES_IN_BLOCK_N):
             for bm in nl.affine_range(TILES_IN_BLOCK_M):
               # Allocate result_tile in PSUM (uninitialized)
               result_tile = nl.ndarray(shape=(TILE_M, TILE_N),
                                        dtype=nl.float32,
                                        buffer=nl.psum)
               for bk in nl.affine_range(TILES_IN_BLOCK_K):
                 # Perform matrix multiply on a tile.
                 nisa.nc_matmul(
                   dst=result_tile,
                   stationary=lhsT_tiles[bk][0:TILE_K, bm * TILE_M:(bm + 1) * TILE_M],
                   moving=rhs_tiles[bk][0:TILE_K, bn * TILE_N:(bn + 1) * TILE_N]
                 )
               # Accumulate the result into the result_tmps tile.
               nisa.tensor_tensor(dst=result_tmps[m][bm][bn],
                                  data1=result_tmps[m][bm][bn],
                                  data2=result_tile,
                                  op=nl.add)

       # Copying the result from SBUF to HBM
       for m in nl.affine_range(NUM_BLOCK_M):
         for bm in nl.affine_range(TILES_IN_BLOCK_M):
           # coalesce result tiles for better DMA performance
           result_packed = nl.ndarray(shape=(TILE_M, BLOCK_N),
                                      dtype=nl.float32,
                                      buffer=nl.sbuf)
           for bn in nl.affine_range(TILES_IN_BLOCK_N):
             nisa.tensor_copy(
               dst=result_packed[0:TILE_M, bn * TILE_N:(bn + 1) * TILE_N],
               src=result_tmps[m][bm][bn][0:TILE_M, 0:TILE_N])

           # Copy packed result from SBUF to HBM.
           nisa.dma_copy(dst=result[(TILES_IN_BLOCK_M * m + bm) *
                                    TILE_M:(TILES_IN_BLOCK_M * m + bm + 1) * TILE_M,
                                    BLOCK_N * n:BLOCK_N * (n + 1)],
                         src=result_packed[0:TILE_M, 0:BLOCK_N])

     return result

This version of the kernel is considerably more complicated, but the test driver you created for the simplest version of this kernel means you have a ready test. The sizes of matrices you chose in the original test were forward-looking in that they correspond to the tiling dimensions you selected. However, you expose these as additional arguments (unlike in the previous blocking), so a model calling this kernel can choose block sizes appropriate for the model. The test driver also gives us a new set of NEFF and NTFF files.

.. image:: /nki/img/how-to/v5-full.png

Other than the improved time, this seems similar to the other profile graphs, however you can see a slightly more complex pattern. This reflects the time to compute the full output tile and then copying the results out.

.. image:: /nki/img/how-to/v5-zoom.png

Zooming in you can see the gap at the end of the set of matrix multiplies where the results are accumulated into the SBUF temporary results. Looking at the utilization of the DMA engines and TensorE you can see the DMA engines are now active only 21.54% of the time, while the TensorE is now active 99.50%, with the Vector Engine (VectorE) active 10.55% of the time, where it was previously unused.

.. list-table::
   :header-rows: 0
   :widths: 50 50

   * - .. image:: /nki/img/how-to/v5-dma.png
          :width: 100%
     - .. image:: /nki/img/how-to/v5-pe.png
          :width: 100%
   * -
     - .. image:: /nki/img/how-to/v5-vec.png
          :width: 100%

This final version of the matrix multiply kernel is no longer memory-bound. Instead, as you should expect, it is compute-bound with the TensorE and VectorE engines being the limiting factor on the speed of the kernel.

Summary
-------

While the matrix multiply example kernel is a relatively simple one, which primarily focuses on just two of the engines in the NeuronEngine Architecture: the DMA engines and the TensorE, it demonstrates how you can start with a simpler known correct version of a kernel with a test case that provides a representative workload and use a combination of your understanding of the NeuronEngine Architecture, the Neuron Profiler, and your understanding of the kernel you are trying to implement to improve the performance of the kernel.

Once the kernel is ready you use it to replace the section of the model it is intended to implement. The test driver can continue to be used as a unit test that ensures correct operations and allows you to add regression tests, both for accuracy and performance of the kernel. It can also provide a starting point to porting to other generations of the NeuronEngine Architecture.

Related concepts
----------------

* :doc:`Tutorial: Matrix multiplication </nki/guides/tutorials/matrix_multiplication>`
* :doc:`Profiling NKI kernels with Neuron Explorer </nki/guides/use-neuron-profile>`

Further reading
---------------

* :doc:`NKI Language Guide </nki/get-started/nki-language-guide>`
* :doc:`NeuronDevice Architecture Guide for NKI </nki/guides/architecture/trainium_inferentia2_arch>`
* :doc:`NKI Performance Guide </nki/deep-dives/nki_perf_guide>`


================================================
FILE: nki/guides/tutorials/matrix_multiplication.rst
================================================
.. meta::
    :description: Learn how to implement and optimize matrix multiplication kernels using NKI on AWS Neuron hardware, from basic implementation to advanced optimization techniques.
    :keywords: Matrix Multiplication, NKI, Neuron, Optimization, TensorE
    :date-modified: 12/01/2025

.. _nki-matrix-multiplication:

Matrix multiplication
=====================

In this tutorial, we will start with a simple NKI matrix multiplication kernel
and optimize it step by step. In doing so, we learn about:

-  The NKI syntax and programming model.
-  Layout, tiling, and memory management considerations when performing
   matrix multiplication in NKI.

Basic compute kernel
----------------------


.. _nki-fig-mm-view:

.. figure:: ../../img/matrix-multiplication-views.png
   :align: center

   MxKxN Matrix Multiplication Visualization

:numref:`Fig. %s <nki-fig-mm-view>` illustrates how a simple matrix
multiplication: ``lhs [M, K] * rhs [K, N] = output [M, N]`` would be mapped to the
Tensor Engine (TensorE) and SRAMs from its original mathematical view. Note, the PSUM
partition dimension is rotated 90 degrees from SBUF partition dimension solely for layout visualization.
The copy preserves the ``output`` tile layout from PSUM to SBUF, by copying data from each PSUM partition
to the corresponding SBUF partition.

The NKI example below implements a compute kernel for a single-tile matrix
multiplication. It computes a ``64(M) x 128(K) x 512 (N)`` matrix
multiplication operation.

.. nki_example:: ../../examples/matrix_multiplication/matrix_multiplication_nki_kernels.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_16

In this example, we define the NKI kernel as ``nki_matmul_basic_:``

1. We define indices to access the LHS and RHS input tensors.
2. To adhere to NKI's layout considerations,
   we map the contraction axis of both LHS and RHS to the P-dimension,
   which means we load LHS in transposed form.
3. To adhere to NKI's tile size considerations,
   we limit the matmul instruction arguments to tiles of up to
   ``[128,128]`` for LHS, and ``[128,512]`` for RHS.
4. Using the ``nisa.dma_copy`` operation, we load the inputs from HBM tensors
   to SBUF tiles.
5. We then use the ``nisa.nc_matmul`` operation to perform the matrix
   multiplication. Note that we set the LHS argument is transposed. Also note that the *64x128*
   dimension here actually under-utilizes the TensorE, but it helps to
   distinguish the M, K and N dimensions for education purposes in this first
   code example.
6. ``nisa.nc_matmul`` always writes its result to PSUM, and since
   ``nisa.dma_copy`` only moves data from SBUF to HBM, we copy the
   multiplication result from PSUM back to SBUF using ``nisa.tensor_copy``.

We can then execute the kernel and verify correctness against the torch
implementation as follows. Note that we use `torch.allclose` to tolerate
numerical error inherent to floating-point arithmetic.

.. nki_example:: ../../examples/matrix_multiplication/matrix_multiplication_torch.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_17


.. _tutorial_matmul_tiling:

Tiling matrix multiplications
-------------------------------

.. TODO
  Stretch goal (not urgent): use nki masking to support non-multiples

So far, we've limited our matrix multiplication to the tile sizes
allowed by NKI's tile size and layout constraints. Next, we'll see how
to handle larger matrix multiplications. Let's start with a pseudo-code
for tiling an ``[M,K] @ [K,N]`` matrix-multiplication.
Note that we assume the left-hand-side matrix (``[M,K]``) is already transposed
to LHS_T (``[K,M]``) for optimal performance of the underlying TensorE.

::

   # LHS_T: left-hand-side matmul argument (shape [K,M])
   # RHS: right-hand-side matmul argument (shape [K,N])
   # RES: matmul result (shape [M,N])

   # Tile LHS_T free dimension
   for m in range(0, M, 128):
     # Tile RHS free dimension
     for n in range(0, N, 512):
       # Zero-out the accumulator buffer
       accum = zeros((128, 512))
       # Tile contraction dimension
       for k in range(0, K, 128):
         lhsT_tile = LHS_T[m : m+128, k : k+128]
         rhs_tile = RHS[k : k+128, n : n+512]
         accum += dot(lhsT_tile, rhs_tile)
       RES[m : m+128, n : n+512] = accum

This form of tiling can be achieved in NKI as follows:

.. nki_example:: ../../examples/matrix_multiplication/matrix_multiplication_nki_kernels.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_18

A few notes about the above code example:

.. code-block::

   psum_buf = nl.ndarray(..., buffer=nl.psum)

   # condition: an affine range loop
   for i in nl.affine_range(N):
      # condition 3: add matmul results from TensorEngine
      nisa.nc_matmul(psum_buf, stationary_tile, moving_tile) # or nl.matmul

The use of :ref:`PSUM accumulation architecture feature <arch_sec_accumulation_psum>` is critical to
achieve good performance out of TensorEngine when
the contraction dimension of the matmul is greater than 128.

The :doc:`nl.affine_range <../../api/generated/nki.language.affine_range>` is used
to define loop-level iterators, which is the recommended iterator type when the
loop does not have loop-carried dependency (Note, associative reductions are
not considered loop carried dependencies in this context). The first
``nisa.nc_matmul`` call overwrites the contents of the ``psum_buf``, with
subsequent calls to the ``nisa.nc_matmul`` instruction accumulating results
into the ``psum_buf``.

There is an alternative way to implement this tiled matrix multiplication kernel
using the SPMD programming model.  We can use the SPMD model to launch ``(M/128)
x (N/512)`` instances of the kernel to complete the innermost loop.


Optimization 1: Removing Redundant Loads
----------------------------------------


Currently, every ``nisa.nc_matmul`` is accompanied with two ``nisa.dma_copy`` calls in the
inner loop, both of which move data from HBM to SBUF. Let's introduce a metric,
arithmetic intensity, to help understand why this is problematic. The arithmetic
intensity of a workload is defined as the number of computation operations
performed per byte of data accessed from HBM on average. The reason why we do
not consider data accessed from SBUF in this metric is because the SBUF
bandwidth (~20x higher than HBM) is high enough to sustain the peak computation
throughput in TensorE.

.. _nki-fig-roofline:

.. figure:: ../../img/roofline.png
   :align: center

   Roofline Model: The Relationship Between Arithmetic Intensity and Performance

:numref:`Fig. %s <nki-fig-roofline>`  shows the roofline model, which models the
relationship between arithmetic intensity of a workload and its achievable
performance on a given computing platform. To saturate TensorE in a
NeuronCore-v2, the arithmetic intensity threshold of a workload is 222
Flops/Byte for ``bfloat16`` data type.  Inside the inner loop of
``nki_matmul_tiled_``, accessing ``lhsT_tile`` and ``rhs_tile`` requires
160 KB of data read from HBM, while the ``nisa.nc_matmul`` call involves 16 MFlops.
This leads to an arithmetic intensity of 102, which is significantly lower than
the saturation threshold of 222. Therefore, ``nki_matmul_tiled_``
operates in the memory bound region of the roofline model and under-utilizes
TensorE.  To make the best out of TensorE, we need to improve the arithmetic
intensity of the matmul kernel.

With NKI, programmers can control when and how to load data from HBM into SBUF
and also perform computation. We will demonstrate in the upcoming steps how to
increase the arithmetic intensity of the matmul kernel using NKI, thereby
maximizing the utilization of TensorE.

First, we notice that in ``nki_matmul_tiled_``, the same tiles from
``lhsT`` and ``rhs`` matrices are loaded more than once across different
iterations of the inner loop. The following example reduces these redundant
loads through hoisting them out of the innermost loop.

.. _nki-fig-mm-after-load-hoisting:

.. figure:: ../../img/mm-memory-pattern-after-load-hoisting.png
   :align: center

   Memory Pattern After Hoisting Loads Out of the Innermost Loop


.. nki_example:: ../../examples/matrix_multiplication/matrix_multiplication_nki_kernels.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_19


Optimization 2: Blocking M and N Dimension
-----------------------------------------------------------

While hoisting the load out of the innermost loop eliminates some redundant
loads, we can push this idea further to increase arithmetic intensity.

Each time we load K elements from the MxK matrix stored in HBM, Optimization 1 allows us
to utilize those same elements N different times.
However, SBUF capacity is much higher than `K` elements currently cached from optimization 1.
We can load multiple K elements from the MxK matrix at a time, result in higher data reuse.
This will increase arithmetic intensity.


Block size must balance two constraints: it should be large enough to saturate arithmetic intensity, yet
small enough for all live blocks remain within SBUF capacity to avoid spilling, causing performance regression.


:numref:`Fig. %s <nki-fig-mm-after-blocking-free>` below visualizes the memory pattern
after blocking both free dimensions.

.. _nki-fig-mm-after-blocking-free:

.. figure:: ../../img/mm-memory-pattern-after-blocking-free.png
   :align: center

   Memory Pattern After Blocking Free Dimensions


.. nki_example:: ../../examples/matrix_multiplication/matrix_multiplication_nki_kernels.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_20

Optimization 3: Blocking M, N and K Dimension
----------------------------------------------------------------
Blocking only the free dimensions and requiring to load the whole partition dimension (K) will set an upper
limit on block size (M and N) due to limited SBUF capacity.

Matrix multiply with shapes ``[M, K] @ [K, N] = [M, N]`` requires ``K`` multiplies and ``K`` additions
(or ``K-1`` for accumulation) for each element in the resulting ``[M, N]`` grid, totaling ``2*K*M*N`` FLOPS.
It has to load ``M*K + K*N + M*N`` elements, resulting in arithmetic intensity ``2*M*N*K/(2*(M*K + K*N + M*N))``
for 2 byte data types like FP16 or BF16. Since the full K has to fit in memory for Optimization 2,
it will limit M and N size for a block. Arithmetic intensity will be lower if any of the M, N or K is
much smaller than the others.

Blocking partition dimension also results in calculating partial matrix multiplies in each block that have to
be accumulated, resulting in additional HBM traffic if not handled carefully.

.. _nki-fig-mm-after-blocking-all:

.. figure:: ../../img/mm-memory-pattern-after-blocking-all.png
   :align: center

   Memory Pattern After Blocking All Dimensions

With the blocking configuration in the code (16 tiles or 2048 numbers in the
``M`` dimension; 2 tiles or 1024 numbers in the ``N`` dimension; and 8 tiles or
1024 numbers in the ``K`` dimension), this computation has an arithmetic
intensity of 683 Flops/Byte (2048*1024*1024/(2048*1024 + 1024*1024)). This is
certainly above the threshold of 222.

At the same time, this blocking configuration keeps all the tensors within the
SBUF limit as much as possible.  With all matrices in BF16 data type, the
``lhsT_tiles`` requires 4MB and ``rhs_tiles`` requires 2MB SBUF memory. The
``result_m_tiles`` requires ``4 * NUM_BLOCK_M`` MB SBUF memory, where
``NUM_BLOCK_M`` is ``M // 2048``. Thus, as long as ``M <= 8192``, the required
SBUF memory is under the 24 MB budget (4 + 2 + 4 * (8192 // 2048) == 22 MB).
When the ``M`` dimension becomes bigger, spilling and reloading of the
``result_m_tiles`` will happen, but because the frequency is relatively low, the
computation can still be sufficient.
Block size must balance two constraints: it should be large enough to saturate arithmetic intensity, yet
small enough for all live blocks to remain within SBUF capacity to avoid spilling, causing performance regression.

We also use a contiguous N-block layout per M-tile to eliminate coalescing. Each
result M-tile is allocated with shape ``(TILE_M, TILES_IN_BLOCK_N, TILE_N)``
instead of separate ``(TILE_M, TILE_N)`` tiles. Because the N-block tiles are
already contiguous in the free dimension, we can later reshape to
``(TILE_M, BLOCK_N)`` and issue a single large ``nisa.dma_copy`` for SBUF to HBM
eviction without needing a ``nisa.tensor_copy()`` coalescing step to remove work
from VectorE.
Furthermore, by splitting the N-block into individual M-tiles, the compiler can
pipeline ``memset(0)``, matmul, ``tensor_tensor`` accumulation, and SBUF to HBM
DMA eviction on M-tile granularity, overlapping all stages.

.. nki_example:: ../../examples/matrix_multiplication/matrix_multiplication_nki_kernels.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_21

Testing Correctness and Benchmarking
------------------------------------

To test the correctness of the kernels, we compare the result with the
``torch.matmul`` with ``torch.allclose``.

.. nki_example:: ../../examples/matrix_multiplication/matrix_multiplication_torch.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_22

Output from the test:

::

   Checking correctness of nki_matmul_tiled
   NKI and Torch match
   Checking correctness of nki_matmul_hoist_load
   NKI and Torch match
   Checking correctness of nki_matmul_block_free_dimension
   NKI and Torch match
   Checking correctness of nki_matmul_fully_optimized
   NKI and Torch match

Download All Source Code
--------------------------

Click the links to download source code of the kernels and the testing code
discussed in this tutorial.

* All matrix multiplication NKI kernels: :download:`matrix_multiplication_nki_kernels.py <../../examples/matrix_multiplication/matrix_multiplication_nki_kernels.py>`
* PyTorch implementation: :download:`matrix_multiplication_torch.py <../../examples/matrix_multiplication/matrix_multiplication_torch.py>`

You can also view the source code in the GitHub repository `nki_samples <https://github.com/aws-neuron/nki-samples/tree/main/src/nki_samples/tutorials/matrix_multiplication/>`_

Example usage of the scripts:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Run benchmarking of different NKI kernels:

.. code-block::

   python3 matrix_multiplication_nki_kernels.py

Run PyTorch implementation to validate the NKI results against the PyTorch
implementation:

.. code-block::

   python3 matrix_multiplication_torch.py


================================================
FILE: nki/guides/tutorials/transpose2d.rst
================================================
.. _nki-transpose2d:

Transpose2D
===========

In this tutorial, we transpose a tensor along two of its axes using NKI.
In doing so, we learn about:

-  The NKI syntax and programming model.
-  Multi-dimensional memory address patterns in NKI.

As background, there are two main types of transposition in NKI:

1. Transposition between the partition-dimension axis and one of the
   free-dimension axes, which is achieved via the
   :literal:`nki.isa.nc_transpose` instruction.
2. Transposition between two axes on the free-dimension, which is achieved
   via a ``nki.language.copy`` instruction, with indexing manipulation
   in the free axis to re-arrange the data.


In this example, we'll focus on the second case: consider a
three-dimensional input tensor ``[P, F1, F2]``, where the ``P`` axis is mapped
to the different SBUF partitions and the ``F1`` and ``F2`` axes are
flattened and placed in each partition, with ``F1`` being the major
dimension. Our goal in this example is to transpose the ``F1`` and
``F2`` axes with a parallel dimension ``P``,
to re-arrange the data within each partition. :ref:`Figure <nki-fig-transpose>`
below illustrates the input and output tensor layouts.

.. _nki-fig-transpose:

.. figure:: ../../img/pm-index-2.png
   :align: center
   :width: 60%

   Tensor F1:F2 Transpose

PyTorch
-------

Compute kernel
^^^^^^^^^^^^^^

.. nki_example:: ../../examples/transpose2d/transpose2d_nki_kernels.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_33

Launching kernel and testing correctness
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To execute the kernel, we prepare tensors ``a`` and call ``tensor_transpose2D_kernel_``:


.. nki_example:: ../../examples/transpose2d/transpose2d_torch.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_34


JAX
---

Compute kernel
^^^^^^^^^^^^^^

We can reuse the same NKI compute kernel defined for PyTorch above.

.. nki_example:: ../../examples/transpose2d/transpose2d_nki_kernels.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_33


Launching kernel and testing correctness
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To execute the kernel, we prepare array ``a`` and call ``tensor_transpose2D_kernel_``:

.. nki_example:: ../../examples/transpose2d/transpose2d_jax.py
   :language: python
   :linenos:
   :marker: NKI_EXAMPLE_36

.. note::
   We pass ``shape2D`` as kwargs to pass the shape as a compile-time constant
   to the kernel function.

.. _tutorial_transpose2d_code:

Download All Source Code
--------------------------

Click the links to download source code of the kernels and the testing code
discussed in this tutorial.

* NKI baremetal implementation: :download:`transpose2d_nki_kernels.py <../../examples/transpose2d/transpose2d_nki_kernels.py>`
* PyTorch implementation: :download:`transpose2d_torch.py <../../examples/transpose2d/transpose2d_torch.py>`
    * You must also download :download:`transpose2d_nki_kernels.py <../../examples/transpose2d/transpose2d_nki_kernels.py>`
      into the same folder to run this PyTorch script.
* JAX implementation: :download:`transpose2d_jax.py <../../examples/transpose2d/transpose2d_jax.py>`
    * You must also download :download:`transpose2d_nki_kernels.py <../../examples/transpose2d/transpose2d_nki_kernels.py>`
      into the same folder to run this JAX script.

You can also view the source code in the GitHub repository `nki_samples <https://github.com/aws-neuron/nki-samples/tree/main/src/nki_samples/tutorials/transpose2d/>`_

Example usage of the scripts:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Run NKI baremetal implementation:

.. code-block::

   python3 transpose2d_nki_kernels.py

Run PyTorch implementation:

.. code-block::

   python3 transpose2d_torch.py

Run JAX implementation:

.. code-block::

   python3 transpose2d_jax.py


================================================
FILE: nki/guides/use-neuron-profile.rst
================================================
.. meta::
    :description: Learn how to profile Neuron Kernel Interface (NKI) kernels using Neuron Explorer to analyze hardware-level performance characteristics on Trainium and Inferentia devices.
    :date-modified: 12/02/2025

.. _use-neuron-profile:

Profile a NKI Kernel
====================

Learn how to profile Neuron Kernel Interface (NKI) kernels using Neuron Explorer to analyze hardware-level performance characteristics on Trainium and Inferentia devices. This comprehensive guide covers two profiling methods: using the ``neuron-explorer capture`` command-line tool. You'll discover how to generate NEFF and NTFF files, identify performance bottlenecks, optimize kernel execution, and leverage the interactive web-based Neuron Profile UI to visualize execution traces with source code integration for efficient NKI kernel development and optimization.

Install Neuron Explorer
------------------------

Ensure that you have the latest version of the ``aws-neuronx-tools`` package installed as Neuron Explorer comes with this package. The ``aws-neuronx-tools`` package is pre-installed on Neuron DLAMIs.

* For detailed installation instructions, see: :ref:`How to Get Started with Neuron Explorer <new-neuron-profiler-setup>`.

Profile a NKI Kernel
--------------------

Profiling NKI (Neuron Kernel Interface) kernels helps you understand hardware level performance characteristics of your kernels running on AWS Trainium and Inferentia devices. When you write or optimize custom NKI kernels, profiling allows you to:

* **Identify bottlenecks**: Determine if your kernel is compute-bound, memory-bound, or limited by data movement.
* **Optimize performance**: Analyze kernel-level execution time, investigate compute engine utilization, look for opportunities to implement operator fusion to fine-tune performance.
* **Compare implementations**: Benchmark different kernel implementations or configurations to pick the most efficient kernel.

You can profile NKI kernels using several approaches. In this guide, you'll learn two primary methods for profiling NKI kernels.

How to profile using neuron-explorer capture
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To profile an NKI kernel using neuron-explorer capture, follow these three steps:

1. Set the environment variable ``NEURON_FRAMEWORK_DEBUG=1`` to instruct the compiler to save the NEFF (Neuron Executable File Format) file.
2. Execute the NKI kernel to generate the NEFF file.
3. Run ``neuron-explorer capture`` to create an Neuron Trace File Format (NTFF) file for performance analysis.

Each of these steps is explained in detail below.

Step 1: Set Environment Variables
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We will profile a 3-layer MLP model that fuses matrix multiplications with ReLU activation functions and uses a NKI matrix multiplication kernel. The rest of this tutorial will use a performance profile generated from this example. Here is the implementation of ``mlp_with_mm_kernel.py``. Save this file before moving on to the next step::

    """
    Example 3-layer MLP with matrix multiplication kernel to demonstrate Neuron Profile.
    """

    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    import torch_neuronx
    import nki
    import nki.isa as nisa
    import nki.language as nl
    import os

    os.environ["NEURON_LOGICAL_NC_CONFIG"] = "1"

    os.environ["NEURON_FRAMEWORK_DEBUG"] = "1"
    os.environ["XLA_IR_DEBUG"] = "1"       # Preserve source-level IR names in the compiled graph for profiler source mapping
    os.environ["XLA_HLO_DEBUG"] = "1"      # Preserve HLO operation names and metadata for profiler attribution

    @nki.jit
    def nki_matmul(
        lhsT,
        rhs,
        # Meta-parameters
        TILES_IN_BLOCK_M=16,
        TILES_IN_BLOCK_N=2,
        TILES_IN_BLOCK_K=8,
    ):
        """NKI kernel to compute a large matrix multiplication efficiently by
        blocking all dimensions and doing layout optimization.

        Args:
            lhsT: an input tensor of shape [K,M], where K is a multiple of 128 *
                TILES_IN_BLOCK_K and M is a multiple of 128 * TILES_IN_BLOCK_M.  It is the
                left-hand-side argument of the matrix multiplication, delivered transposed
                for optimal performance.
            rhs: an input tensor of shape [K,N],  where K is a multiple of 128 *
                TILES_IN_BLOCK_K and N is a multiple of 512 * TILES_IN_BLOCK_N.  It is
                the right-hand-side argument of the matrix multiplication.
            TILES_IN_BLOCK_*: meta parameters to control blocking dimensions
        Returns:
            result: the resulting output tensor of shape [M,N]
        """

        # Verify that the lhsT and rhs have the same contraction dimension.
        K, M = lhsT.shape
        K_, N = rhs.shape
        assert K == K_, "lhsT and rhs must have the same contraction dimension"

        # Lookup the device matrix multiply dimensions.
        TILE_M = nl.tile_size.gemm_stationary_fmax  # 128
        TILE_K = nl.tile_size.pmax  # 128
        TILE_N = nl.tile_size.gemm_moving_fmax  # 512

        # Compute the block dimensions.
        BLOCK_M = TILE_M * TILES_IN_BLOCK_M
        BLOCK_N = TILE_N * TILES_IN_BLOCK_N
        BLOCK_K = TILE_K * TILES_IN_BLOCK_K

        # Verify the size is a multiple of block size
        assert M % BLOCK_M == 0, \
            f"Expected M {M} to be divisible by {BLOCK_M} when there are {TILES_IN_BLOCK_M}"
        assert N % BLOCK_N == 0, \
            f"Expected N {N} to be divisible by {BLOCK_N} when there are {TILES_IN_BLOCK_N}"
        assert K % BLOCK_K == 0, \
            f"Expected K {K} to be divisible by {BLOCK_K} when there are {TILES_IN_BLOCK_K}"

        # Create a space for the result in HBM (not initialized)
        result = nl.ndarray((M, N), dtype=lhsT.dtype, buffer=nl.shared_hbm)

        # Compute the number of blocks in each dimension
        NUM_BLOCK_M = M // BLOCK_M
        NUM_BLOCK_N = N // BLOCK_N
        NUM_BLOCK_K = K // BLOCK_K

        # Blocking N dimension (the RHS free dimension)
        for n in nl.affine_range(NUM_BLOCK_N):
            n_start = n * BLOCK_N
            n_end = n_start + BLOCK_N

            # Allocate and initialize result matrix N-block to 0.0.
            #
            # Each result M-tile stores its N-block contiguous on the free-dim
            # with shape (TILE_M, TILES_IN_BLOCK_N, TILE_N). This layout allows
            # reshaping to (TILE_M, BLOCK_N) for SBUF->HBM DMA to operate on a
            # large payload, enabling good DMA efficiency.
            #
            # We split the N-block into individual M-tiles so the compiler can
            # pipeline memset(0), matmul, tensor_tensor, and SBUF->HBM DMA
            # on M-tile granularity.
            result_m_tiles = []
            for m in nl.affine_range(NUM_BLOCK_M):
                for m_tile in nl.affine_range(TILES_IN_BLOCK_M):
                    result_m_tile = nl.ndarray(
                        shape=(TILE_M, TILES_IN_BLOCK_N, TILE_N),
                        dtype=result.dtype,
                        buffer=nl.sbuf,
                    )
                    nisa.memset(dst=result_m_tile, value=0.0)
                    result_m_tiles.append(result_m_tile)

            # Blocking K dimension (the contraction dimension)
            for k in nl.sequential_range(NUM_BLOCK_K):
                k_block_tile_start = k * TILES_IN_BLOCK_K

                # Load tiles from RHS
                # Load tiles one N-block at a time for good DMA efficiency.
                rhs_tiles = nl.ndarray(
                    shape=(TILE_K, TILES_IN_BLOCK_K, BLOCK_N),
                    dtype=rhs.dtype,
                    buffer=nl.sbuf,
                )
                for k_tile in range(TILES_IN_BLOCK_K):
                    k_tile_start = (k_block_tile_start + k_tile) * TILE_K
                    k_tile_end = k_tile_start + TILE_K
                    nisa.dma_copy(
                        dst=rhs_tiles[0:TILE_K, k_tile, 0:BLOCK_N],
                        src=rhs[k_tile_start:k_tile_end, n_start:n_end],
                    )

                # Blocking M dimension (the LHS free dimension)
                for m in nl.affine_range(NUM_BLOCK_M):
                    # Loading tiles from lhsT
                    # Load tiles one M-block at a time for good DMA efficiency.
                    lhsT_tiles = nl.ndarray(
                        shape=(TILE_K, TILES_IN_BLOCK_K, BLOCK_M),
                        dtype=lhsT.dtype,
                        buffer=nl.sbuf,
                    )
                    m_start = m * BLOCK_M
                    m_end = m_start + BLOCK_M
                    for k_tile in nl.affine_range(TILES_IN_BLOCK_K):
                        k_tile_start = (k_block_tile_start + k_tile) * TILE_K
                        k_tile_end = k_tile_start + TILE_K
                        nisa.dma_copy(
                            dst=lhsT_tiles[0:TILE_K, k_tile, 0:BLOCK_M],
                            src=lhsT[k_tile_start:k_tile_end, m_start:m_end],
                        )

                    # Do matmul with all tiles in the blocks
                    m_block_tile_start = m * TILES_IN_BLOCK_M
                    for n_tile in nl.affine_range(TILES_IN_BLOCK_N):
                        for m_tile in nl.affine_range(TILES_IN_BLOCK_M):
                            result_tile = nl.ndarray(
                                shape=(TILE_M, TILE_N), dtype=nl.float32, buffer=nl.psum
                            )
                            for k_tile in nl.affine_range(TILES_IN_BLOCK_K):
                                m_tile_start = m_tile * TILE_M
                                m_tile_end = m_tile_start + TILE_M
                                n_tile_start = n_tile * TILE_N
                                n_tile_end = n_tile_start + TILE_N
                                nisa.nc_matmul(
                                    dst=result_tile,
                                    stationary=lhsT_tiles[0:TILE_K, k_tile, m_tile_start:m_tile_end],
                                    moving=rhs_tiles[0:TILE_K, k_tile, n_tile_start:n_tile_end],
                                )

                            # Evict from PSUM to SBUF while accumulating into result M-tile.
                            m_tile_idx = m_block_tile_start + m_tile
                            result_m_tile = result_m_tiles[m_tile_idx]
                            nisa.tensor_tensor(
                                dst=result_m_tile[0:TILE_M, n_tile, 0:TILE_N],
                                data1=result_m_tile[0:TILE_M, n_tile, 0:TILE_N],
                                data2=result_tile,
                                op=nl.add,
                            )

            # Evict the result M-tiles from SBUF to HBM.
            # Copy on N-blocks granularity for good DMA efficiency.
            for m in nl.affine_range(NUM_BLOCK_M):
                m_block_tile_start = m * TILES_IN_BLOCK_M
                for m_tile in nl.affine_range(TILES_IN_BLOCK_M):
                    m_tile_idx = m_block_tile_start + m_tile
                    result_m_tile = result_m_tiles[m_tile_idx]
                    result_m_tile_block = result_m_tile.reshape((TILE_M, BLOCK_N))

                    m_tile_start = m_tile_idx * TILE_M
                    m_tile_end = m_tile_start + TILE_M
                    nisa.dma_copy(
                        dst=result[m_tile_start:m_tile_end, n_start:n_end],
                        src=result_m_tile_block[0:TILE_M, 0:BLOCK_N],
                    )

        return result


    class NKILinear(nn.Module):
        def __init__(self, in_features, out_features):
            super(NKILinear, self).__init__()
            self.weight = nn.Parameter(torch.randn(out_features, in_features))
            self.bias = nn.Parameter(torch.randn(out_features))

        def forward(self, x):
            weight_T = self.weight.t()
            x_T = x.t()
            output = nki_matmul(x_T, weight_T)
            return output + self.bias


    class MLP(nn.Module):
        def __init__(self):
            super(MLP, self).__init__()
            self.fc1 = NKILinear(2048, 2048)
            self.fc2 = NKILinear(2048, 1024)
            self.fc3 = NKILinear(1024, 1024)

        def forward(self, x):
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))
            x = self.fc3(x)
            return F.log_softmax(x, dim=1)


    def main():
        torch.manual_seed(0)

        model = MLP()
        train_x = torch.randn(2048, 2048)

        # Use torch_neuronx.trace to compile the model and generate the NEFF
        traced_model = torch_neuronx.trace(model, train_x, compiler_args="--lnc=1", compiler_workdir="./compiler_workdir")

        output = traced_model(train_x)
        print(f"Output tensor: {output}")


    if __name__ == "__main__":
        main()

As you can see, at the very top we have added the following environment variables::

    os.environ["NEURON_FRAMEWORK_DEBUG"] = "1"
    os.environ["XLA_IR_DEBUG"] = "1"
    os.environ["XLA_HLO_DEBUG"] = "1"

These environment variables serve the following purposes:

* ``NEURON_FRAMEWORK_DEBUG=1``: Enables Neuron debug output. This triggers the Neuron compiler to save the Neuron Executable File Format (NEFF) artifact to the current directory after compilation of your NKI kernel. The NEFF contains all hardware instructions required to execute your NKI kernel on a NeuronDevice, as well as metadata and debug info needed for profiling.
* ``XLA_IR_DEBUG=1``: Preserves the mapping between high-level framework operations (e.g., PyTorch operators) and the intermediate representation (IR) passed to the compiler. This enables source code linking from device instructions back to framework-level code in the profiler.
* ``XLA_HLO_DEBUG=1``: Preserves the mapping between the HLO (High Level Operation) graph and the original framework operations. This enables the profiler to display descriptive operator names and stack frame information, making it easier to identify which part of your model corresponds to each device instruction.

Step 2: Compile Your NKI Kernel
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Compile your NKI kernel to create a NEFF in your current directory::

    $ python3 mlp_with_mm_kernel.py

.. note:: The ``compiler_workdir`` argument to ``torch_neuronx.trace`` specifies the directory where the compiler saves artifacts, including the NEFF file. Look for your NEFF file inside the ``./compiler_workdir`` directory, which will be named ``graph.neff``.

Step 3: Profile the Generated NEFF
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The last step is profiling the generated NEFF. This step executes the NEFF on the NeuronDevice and records a raw execution trace into a NTFF artifact::

    $ neuron-explorer capture -n ./compiler_workdir/graph.neff -s profile.ntff --profile-nth-exec=2 --enable-dge-notifs

This will save your NTFF profile to ``profile_exec_2.ntff``.

.. important::

    The ``--profile-nth-exec=2`` option will profile your NEFF twice on the NeuronDevice and output a NTFF profile for the second iteration. This is recommended to avoid one-time warmup delays which can be seen in the first iteration of execution.

    The ``--enable-dge-notifs`` option enables the capture of DGE DMA events but has known issues where it may overflow the status notification queue and cause execution timeouts when there are many DGE instructions.

View the Neuron Explorer UI
----------------------------

This section assumes you've completed the previous step and have already generated both the NEFF and NTFF files, and downloaded them on your local machine.

Neuron Explorer includes an interactive, web-based UI for exploring execution traces in detail. In this section, we'll open the Neuron Explorer UI to examine NKI-specific profiling information. These details can be found in multiple areas of the interface — including instruction hover tooltips, instruction click panels, search results, and box select results. For a comprehensive overview of all available viewers, see the :doc:`Neuron Explorer documentation </tools/neuron-explorer/index>`.

To view the Neuron Profile Web UI, execute the view command to start Web UI, replacing ``<workspace>`` with a path to a folder to store your profiling artifacts::

    $ neuron-explorer view --data-path ./<workspace>

``<workspace>`` is a path that neuron profile will use for storing and managing profiles.

The above command should print a URL that you can click to open the web UI::

    View a list of profiles at http://localhost:3001/

Port Forwarding for Remote Instances
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If ``neuron-explorer view`` is run on a remote instance, you may need to use port forwarding to access the web UI. By default, neuron-explorer creates a web server on port 3001 and the API server on port 3002. To enable connection to your browser in you local computer, we will need to establish an ssh tunnel to both of the ports.

For example::

    ssh -L 3001:localhost:3001 -L 3002:localhost:3002 <user>@<ip> -fN

If you created an EC2 instance with ``pem`` credentials, include it in the ``ssh`` tunnel below::

    ssh -i ~/my-ec2.pem -L 3001:localhost:3001 -L 3002:localhost:3002 ubuntu@[PUBLIC_IP_ADDRESS] -fN


Using the Profile UI
~~~~~~~~~~~~~~~~~~~~~

* Once the ssh tunnel is setup, you can now open a browser and navigate to http://localhost:3001.

   .. image:: /nki/img/how-to/nki-profiler-1.png
      :align: center
      :width: 750

* Click on the button "Upload Profile" to upload NEFF and NTFF files, and give a meaningful name to your profile. Selecting a source code folder for code linking is optional.

   .. image:: /nki/img/how-to/nki-profiler-2.png
      :align: center
      :width: 750

* After the files are uploaded and processed, you will be able to open the profile from the list.

   .. image:: /nki/img/how-to/nki-profiler-3.png
      :align: center
      :width: 750

* If you click on the name of your profile in Profile Name column, it will navigate to profile page

   .. image:: /nki/img/how-to/nki-profiler-4.png
      :align: center
      :width: 750

* If you hover over any engine instruction in the timeline with your mouse, you will see instruction details in a pop-up box.

   .. image:: /nki/img/how-to/nki-profiler-5.png
      :align: center
      :width: 750

* If you click on any engine instruction in the timeline with your mouse, you will see event details in a panel below the timeline.

   .. image:: /nki/img/how-to/nki-profiler-6.png
      :align: center
      :width: 750

* To view hierarchy of this profile, click on Add Widget and select Hierarchy. For more details, see the :doc:`Hierarchy Viewer </tools/neuron-explorer/overview-hierarchy-view>` documentation.

   .. image:: /nki/img/how-to/nki-profiler-7.png
      :align: center
      :width: 750

* Using the Profiler's flexible layout support, you can drag and group every widget into any panel of your choice to customize the layout for your workflow.

   .. image:: /nki/img/how-to/nki-profiler-8.png
      :align: center
      :width: 750

1. If you right-click on an operator in the hierarchy timeline, it will highlight all related instructions in the instruction timeline.

   .. image:: /nki/img/how-to/nki-profiler-9.png
      :align: center
      :width: 750

View NKI Source Code in Neuron Profile
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can optionally include your NKI source code files for display in Neuron Profile. When provided, Neuron Profile loads the source code into an integrated viewer, displayed side-by-side with the execution timeline in the web UI. This makes it easier to navigate between the instruction trace and the corresponding NKI source code, and to track the exact version of the code that generated the profile. For more details on source code linking, see the :doc:`Source Code Viewer </tools/neuron-explorer/how-to-link-view-source-code>` documentation.

.. note:: Even if you don't upload the source code, the NKI source filename and line number remain available in the instruction detail view as noted in View Neuron Profile UI.

* If source code is uploaded with NEFF and NTFF file, you will be able to see the source code in the code editor. To open the code editor, click on **Add Widget** and select **Code Editor**.

   .. image:: /nki/img/how-to/nki-profiler-10.png
      :align: center
      :width: 750

* The code editor will be open on the right-hand side.

   .. image:: /nki/img/how-to/nki-profiler-11.png
      :align: center
      :width: 750

* Hover on an instruction that has NKI source location and **Command + left click** on Mac (**Ctrl + right click** on Windows), and it will jump to the line of the source code and highlight all of instructions related to this line.

   .. image:: /nki/img/how-to/nki-profiler-12.png
      :align: center
      :width: 750

* You can also enable different source code decorations in **Source Code Settings**.

   .. image:: /nki/img/how-to/nki-profiler-13.png
      :align: center
      :width: 750

   .. image:: /nki/img/how-to/nki-profiler-14.png
      :align: center
      :width: 750

Next Steps
----------

Great! Now that you've learned how to profile an NKI kernel, it's time to take this further:

* Dive into the :doc:`NKI Performance Guide </nki/deep-dives/nki_perf_guide>` to discover techniques for making your kernels faster and more efficient.
* Explore the `NKI sample kernels <https://github.com/aws-neuron/nki-samples>`__ to see real-world examples of high-performance kernel implementations — and get inspiration for your own NKI kernels.
* Learn more about the Neuron Explorer viewers to deepen your profiling analysis:

  * :doc:`Device Trace Viewer </tools/neuron-explorer/overview-device-profiles>` — Explore hardware-level execution with timeline view, operator table, and event details.
  * :doc:`Hierarchy Viewer </tools/neuron-explorer/overview-hierarchy-view>` — Visualize execution from model layers down to hardware operations.
  * :doc:`Source Code Viewer </tools/neuron-explorer/how-to-link-view-source-code>` — Navigate between source code and profile data with bidirectional linking.
  * :doc:`Summary Viewer </tools/neuron-explorer/overview-summary-page>` — Get high-level performance insights and optimization recommendations.
  * :doc:`AI Recommendation Viewer </tools/neuron-explorer/overview-ai-recommendations>` — Get AI-powered bottleneck analysis and optimization suggestions for NKI profiles.

By combining profiling insights with optimization strategies and practical examples, you'll be well prepared to write NKI kernels that leverage Neuron hardware in an efficient way.


================================================
FILE: nki/index.rst
================================================
.. _neuron-nki:

.. meta::
   :description: Neuron Kernel Interface (NKI) - Low-level programming interface for custom kernel development on AWS Trainium and Inferentia with direct NeuronCore ISA access.
   :keywords: NKI, Neuron Kernel Interface, custom kernels, NeuronCore, AWS Neuron, Trainium, Inferentia, ISA, tile programming, torch.compile
   :date-modified: 2026-04-02

Neuron Kernel Interface (NKI)
====================================

Neuron Kernel Interface (NKI) is a bare-metal language and compiler for directly programming NeuronDevices
available on AWS Trn/Inf instances. You can use NKI to develop, optimize and run new operators directly on
NeuronCores while making full use of available compute and memory resources. NKI empowers ML developers to
self-serve and invent new ways to use the NeuronCore hardware, starting NeuronCores v2 (Trainium1) and beyond.

NKI provides developers with direct access to the NeuronCore ISA (Instruction Set Architecture), accessible from a
Python-based programming environment, which has syntax and tile-level semantics that are similar to
`Triton <https://triton-lang.org/main/index.html>`_ and `NumPy <https://numpy.org/doc/stable/>`_.
This enables developers to get started quickly and optimize performance in a familiar environment, while at the same
time get full control of the underlying hardware. At the hardware level, NeuronCore's tensorized memory access
capability enables efficient reading and writing of multi-dimensional arrays on a per instruction basis,
which makes NKI's tile-based programming highly suitable for the NeuronCore instruction set.

For comparison, before NKI was introduced, the only way to program NeuronDevices was through defining high-level ML
models in frameworks such as `PyTorch <https://pytorch.org/>`_
and `JAX <https://jax.readthedocs.io/en/latest/index.html>`_.
Neuron Compiler takes such high-level model definitions as input,
performs multiple rounds of optimization, and eventually generates a NEFF (Neuron Executable File Format) that
is executable on NeuronDevices. At a high level, Neuron Compiler runs the following optimization stages in order:

1. **Hardware-agnostic graph-level optimizations.** These transformations are done in the compiler front-end,
   using `XLA <https://openxla.org/xla>`_, including optimizations like constant propagation, re-materialization
   and operator fusion.

2. **Loop-level optimization.** Compiler turns the optimized graph from Step 1 into a series of loop nests
   and performs layout, tiling and loop fusion optimizations.

3. **Hardware intrinsics mapping.** Compiler maps the architecture-agnostic loop nests from Step 2 into
   architecture-specific instructions.

4. **Hardware-specific optimizations.** These optimizations are mainly
   done at the instruction level in compiler back-end,
   with a key goal of reducing memory pressure and improving instruction-level parallelism. For example, memory
   allocation and instruction scheduling are done in this stage.

NKI kernels bypass the first 3 steps, and are compiled into IRs (intermediate representations) that the compiler's
back-end (Step 4 above) can directly consume. Advanced features in NKI, such as direct allocation, also allow programmers
to bypass certain compiler passes in Step 4. As a result, NKI developers can now have great control over NeuronDevices down to
the instruction level. We highly recommend developers to study the underlying hardware architecture before
optimizing performance of their NKI kernels. See the NKI guide below to learn more!

.. _api_reference_guide:

API Reference Guide
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. grid:: 2
      :margin: 4 1 0 0

      .. grid-item::

            .. card:: NKI API Reference Manual
                  :link: nki_api_reference
                  :link-type: ref
                  :class-body: sphinx-design-class-title-small


.. toctree::
      :maxdepth: 1
      :hidden:

      NKI FAQ <nki_faq>


================================================
FILE: nki/library/about/index.rst
================================================
.. meta::
    :description: Overviews and conceptual docs for the NKI Library . NKI Library provides pre-built NKI kernels you can use in model development with Neuron.
    :date-modified: 12/02/2025

.. _nkl_overviews_home:

About the NKI Library 
======================================

Learn about the NKI Library and the pre-built kernels it provides to accelerate the performance of your models.

What is the NKI Library?
-----------------------------------

The NKI Library is a collection of pre-built NKI kernels optimized for AWS Neuron-powered devices. These kernels are designed to accelerate machine learning workloads by providing efficient implementations of common operations used in deep learning models. NKI kernels are commonly used to implement custom PyTorch operators that run on NeuronCores, enabling developers to optimize performance-critical operations beyond what the Neuron Compiler generates automatically.

How do I use the NKI Library?
------------------------------

The kernels in the NKI Library are provided in a public GitHub repository that you can clone and integrate into your Neuron-based model development workflow. You can use these kernels directly in your models to take advantage of their optimized performance on Neuron hardware. 

* **NKI Library repository**: https://github.com/aws-neuron/nki-library

To get started using NKI Library kernels in your model development, clone or fork the repo and follow the instructions in the `README <https://github.com/aws-neuron/nki-library/blob/main/README.md>`_ file.

Resources
---------

* :doc:`NKI Library Kernel API Reference </nki/library/api/index>`
* :doc:`NKI Library Kernel Design Specifications </nki/library/specs/index>`
* :doc:`NKI Documentation </nki/index>`

    
================================================
FILE: nki/library/api/attention-block-tkg.rst
================================================
.. meta::
    :description: Attention Block TKG kernel implements fused attention block optimized for Token Generation.
    :date-modified: 02/13/2026

.. currentmodule:: nkilib.core.attention_block_tkg.attention_block_tkg

.. _nki_library_attention_block_tkg:

Attention Block TKG Kernel API Reference
=========================================

**[Experimental]** Implements a fully fused attention block optimized for Token Generation (autoregressive decoding), keeping all intermediate tensors in SBUF to minimize HBM traffic.

The kernel supports:

* Fused multi-stage computation: pre-normalization, QKV projection, RoPE, post-normalization, attention, KV cache update, and output projection
* Multiple KV cache layouts: flat (transposed/non-transposed) and block-based
* Grouped-Query Attention (GQA) with configurable Q/KV head ratios
* Optional RMSNorm at multiple stages (pre-projection, post-projection per-head)
* Optional Rotary Position Embedding (RoPE) with configurable layouts
* Flexible quantization support (FP8, FP16, BF16)
* FP8 KV cache quantization support
* Configurable softmax scaling factor
* Batch processing with per-batch cache indexing
* Single program multiple data (SPMD) sharding for distributed computation

Background
----------

The ``attention_block_tkg`` kernel combines multiple stages of transformer attention computation into a single fused operation that minimizes data movement between HBM and on-chip memory (SBUF).

**Fused Operations:**

The kernel fuses the following stages in SBUF to avoid HBM round-trips:

1. **Pre-normalization**: Optional RMSNorm on input hidden states
2. **QKV Projection**: Linear projection to Query, Key, Value tensors
3. **RoPE**: Optional Rotary Position Embedding on Q and K
4. **Post-normalization**: Optional per-head RMSNorm on Q and K
5. **Attention Computation**: Scaled dot-product attention with KV cache
6. **KV Cache Update**: Write new K/V tokens to cache
7. **Output Projection**: Linear projection of attention output

**Performance Benefits:**

By keeping intermediate tensors in SBUF throughout the computation, this kernel achieves:

* Reduced HBM bandwidth consumption
* Lower latency for token generation
* Better hardware utilization through operation fusion

API Reference
-------------

**Source code for this kernel API can be found at**: `attention_block_tkg.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/experimental/transformer/attention_block_tkg.py>`_

attention_block_tkg
^^^^^^^^^^^^^^^^^^^

.. py:function:: attention_block_tkg(X: nl.ndarray, X_hidden_dim_actual: Optional[int], rmsnorm_X_enabled: bool, rmsnorm_X_eps: Optional[float], rmsnorm_X_gamma: Optional[nl.ndarray], W_qkv: nl.ndarray, bias_qkv: Optional[nl.ndarray], quantization_type_qkv: QuantizationType, weight_dequant_scale_qkv: Optional[nl.ndarray], input_dequant_scale_qkv: Optional[nl.ndarray], rmsnorm_QK_pre_rope_enabled: bool, rmsnorm_QK_pre_rope_eps: float, cos: Optional[nl.ndarray], sin: Optional[nl.ndarray], rope_contiguous_layout: bool, rmsnorm_QK_post_rope_enabled: bool, rmsnorm_QK_post_rope_eps: float, rmsnorm_QK_post_rope_W_Q: Optional[nl.ndarray], rmsnorm_QK_post_rope_W_K: Optional[nl.ndarray], K_cache_transposed: bool, active_blocks_table: Optional[nl.ndarray], K_cache: nl.ndarray, V_cache: nl.ndarray, attention_mask: nl.ndarray, sink: Optional[nl.ndarray], softmax_scale: Optional[float] = None, update_cache: bool, kv_cache_update_idx: Optional[nl.ndarray], k_scale: Optional[nl.ndarray] = None, v_scale: Optional[nl.ndarray] = None, W_out: Optional[nl.ndarray], bias_out: Optional[nl.ndarray], quantization_type_out: QuantizationType, weight_dequant_scale_out: Optional[nl.ndarray], input_dequant_scale_out: Optional[nl.ndarray], transposed_out: bool, out_in_sb: bool, sbm: Optional[SbufManager] = None, skip_attention: bool = False)

   Fused Attention Block for Token Generation (TKG).

   Performs end-to-end attention block computation optimized for autoregressive decoding:
   X → [RMSNorm] → QKV Projection → [RMSNorm Q/K] → [RoPE] → [RMSNorm Q/K] →
   Attention → KV Cache Update → [Output Projection] → Output

   All intermediate tensors remain in SBUF to minimize HBM traffic.

   :param X: Input hidden states ``[B, S_tkg, H]`` @ HBM or ``[pmax, B*S_tkg, H//pmax]`` @ SBUF
   :type X: ``nl.ndarray``
   :param X_hidden_dim_actual: Actual hidden dim if X is padded
   :type X_hidden_dim_actual: ``int``, optional
   :param rmsnorm_X_enabled: Apply RMSNorm to X before QKV projection
   :type rmsnorm_X_enabled: ``bool``
   :param rmsnorm_X_eps: RMSNorm epsilon (default 1e-3)
   :type rmsnorm_X_eps: ``float``, optional
   :param rmsnorm_X_gamma: RMSNorm weights ``[1, H]`` @ HBM
   :type rmsnorm_X_gamma: ``nl.ndarray``, optional
   :param W_qkv: QKV projection weights ``[H, d_head*(q_heads+2)]`` @ HBM
   :type W_qkv: ``nl.ndarray``
   :param bias_qkv: QKV bias ``[1, d_head*(q_heads+2)]`` @ HBM
   :type bias_qkv: ``nl.ndarray``, optional
   :param quantization_type_qkv: Quantization type for QKV projection
   :type quantization_type_qkv: ``QuantizationType``
   :param weight_dequant_scale_qkv: Weight dequantization scale for QKV projection
   :type weight_dequant_scale_qkv: ``nl.ndarray``, optional
   :param input_dequant_scale_qkv: Input dequantization scale for QKV projection
   :type input_dequant_scale_qkv: ``nl.ndarray``, optional
   :param rmsnorm_QK_pre_rope_enabled: Apply RMSNorm to Q/K before RoPE
   :type rmsnorm_QK_pre_rope_enabled: ``bool``
   :param rmsnorm_QK_pre_rope_eps: Pre-RoPE RMSNorm epsilon
   :type rmsnorm_QK_pre_rope_eps: ``float``
   :param cos: RoPE cosine embeddings ``[d_head//2, B, S_tkg]`` @ HBM (None = skip RoPE)
   :type cos: ``nl.ndarray``, optional
   :param sin: RoPE sine embeddings ``[d_head//2, B, S_tkg]`` @ HBM (None = skip RoPE)
   :type sin: ``nl.ndarray``, optional
   :param rope_contiguous_layout: True for contiguous halves, False for interleaved
   :type rope_contiguous_layout: ``bool``
   :param rmsnorm_QK_post_rope_enabled: Apply RMSNorm to Q/K after RoPE
   :type rmsnorm_QK_post_rope_enabled: ``bool``
   :param rmsnorm_QK_post_rope_eps: Post-RoPE RMSNorm epsilon
   :type rmsnorm_QK_post_rope_eps: ``float``
   :param rmsnorm_QK_post_rope_W_Q: Post-RoPE Q weights ``[1, d_head]`` @ HBM
   :type rmsnorm_QK_post_rope_W_Q: ``nl.ndarray``, optional
   :param rmsnorm_QK_post_rope_W_K: Post-RoPE K weights ``[1, d_head]`` @ HBM
   :type rmsnorm_QK_post_rope_W_K: ``nl.ndarray``, optional
   :param K_cache_transposed: K cache layout flag
   :type K_cache_transposed: ``bool``
   :param active_blocks_table: Block indices for block KV cache ``[B, num_blocks]`` @ HBM
   :type active_blocks_table: ``nl.ndarray``, optional
   :param K_cache: Key cache @ HBM
   :type K_cache: ``nl.ndarray``
   :param V_cache: Value cache @ HBM
   :type V_cache: ``nl.ndarray``
   :param attention_mask: Attention mask ``[S_ctx, B, q_heads, S_tkg]`` @ HBM
   :type attention_mask: ``nl.ndarray``
   :param sink: Attention sink tokens ``[H, 1]`` @ HBM
   :type sink: ``nl.ndarray``, optional
   :param softmax_scale: Scaling factor for attention scores (``Q @ K^T * softmax_scale``). If ``None``, defaults to ``1.0 / sqrt(d_head)``.
   :type softmax_scale: ``float``, optional
   :param update_cache: Update KV cache with new tokens
   :type update_cache: ``bool``
   :param kv_cache_update_idx: Cache write positions ``[B, 1]`` (uint32_max = skip)
   :type kv_cache_update_idx: ``nl.ndarray``, optional
   :param k_scale: Key quantization scale for FP8 KV cache. Enables FP8 quantization of K values written to cache.
   :type k_scale: ``nl.ndarray``, optional
   :param v_scale: Value quantization scale for FP8 KV cache. Enables FP8 quantization of V values written to cache.
   :type v_scale: ``nl.ndarray``, optional
   :param W_out: Output projection weights ``[q_heads*d_head, H]`` @ HBM
   :type W_out: ``nl.ndarray``, optional
   :param bias_out: Output projection bias ``[1, H]`` @ HBM
   :type bias_out: ``nl.ndarray``, optional
   :param quantization_type_out: Quantization type for output projection
   :type quantization_type_out: ``QuantizationType``
   :param weight_dequant_scale_out: Weight dequantization scale for output projection
   :type weight_dequant_scale_out: ``nl.ndarray``, optional
   :param input_dequant_scale_out: Input dequantization scale for output projection
   :type input_dequant_scale_out: ``nl.ndarray``, optional
   :param transposed_out: Transpose output layout (requires W_out)
   :type transposed_out: ``bool``
   :param out_in_sb: Return output in SBUF instead of HBM
   :type out_in_sb: ``bool``
   :param sbm: SBUF memory manager (otherwise auto-allocated)
   :type sbm: ``SbufManager``, optional
   :param skip_attention: Skip attention computation (for testing). Default: False.
   :type skip_attention: ``bool``
   :return: Tuple of (out, K_out, V_out) - Output tensor, updated K cache or new K tokens, updated V cache or new V tokens
   :rtype: ``tuple``

   **Dimensions**:

   * B: batch size
   * S_tkg: number of new tokens to generate
   * S_ctx: KV cache sequence length in current bucket
   * S_max_ctx: maximum KV cache capacity of current bucket
   * H: hidden dimension
   * d_head: head dimension (must be even)
   * q_heads: number of query heads
   * kv_heads: 1 (GQA with single KV head)

   **Supported Data Types**:

   * Supports nl.float16 and nl.bfloat16

   **Constraints**:

   * Requires NeuronCore v3+
   * d_head must be even
   * H must be multiple of 128
   * Requires ``batch * sequence_tkg * q_heads <= pmax (=128)``


Implementation Details
----------------------

**Computation Flow:**

The kernel executes the following stages in sequence:

1. **Input Pre-normalization** (optional):
   
   - Apply RMSNorm to input hidden states: ``X_norm = RMSNorm(X, rmsnorm_pre_W, rmsnorm_pre_eps)``
   - Computed in FP32, result cast back to input dtype

2. **QKV Projection**:
   
   - Compute ``QKV = X_norm @ W_qkv.T`` using matrix multiplication
   - Result shape: ``[B, S_tkg, (q_heads + 2) * d_head]``
   - Supports FP8 quantization with dequantization scales

3. **Q/K Processing** (per head group):
   
   - Extract Q heads: ``Q = QKV[:, :, :q_heads * d_head]``
   - Extract K head: ``K = QKV[:, :, q_heads * d_head : (q_heads + 1) * d_head]``
   - Apply RoPE if enabled: ``Q, K = RoPE(Q, K, cos, sin, position_ids)``
   - Apply per-head RMSNorm if enabled: ``Q = RMSNorm(Q, rmsnorm_post_W_Q)``, ``K = RMSNorm(K, rmsnorm_post_W_K)``

4. **V Processing**:
   
   - Extract V head: ``V = QKV[:, :, (q_heads + 1) * d_head :]``

5. **KV Cache Update**:
   
   - Write new K/V tokens to cache at positions specified by ``kv_cache_update_idx``
   - Supports multiple cache layouts (flat, transposed, block-based)
   - Uses indirect addressing for efficient batch processing

6. **Attention Computation**:
   
   - Compute scaled dot-product attention: ``Attn = softmax(Q @ K_cache.T / scale) @ V_cache``
   - Apply causal masking based on ``S_ctx`` (context lengths)
   - Use FP32 accumulation if ``mixed_precision=True``
   - Supports Grouped-Query Attention by replicating KV heads

7. **Output Projection**:
   
   - Reshape attention output: ``Attn_flat = Attn.reshape([B, S_tkg, q_heads * d_head])``
   - Compute ``out = Attn_flat @ W_o.T``
   - Supports FP8 quantization with dequantization scales

**Memory Management:**

The kernel uses a custom SBUF memory manager (``SbufManager``) to efficiently allocate and reuse on-chip memory:

- Stack-based allocation for temporary tensors
- Automatic memory reuse after tensor lifetime ends
- Minimizes SBUF fragmentation

**Parallelization:**

The kernel supports data parallelism across multiple Neuron Cores:

- Batch dimension (``B``) can be sharded across cores
- Each core processes a subset of batch elements independently
- KV cache updates use per-core indexing

**Cache Layout Support:**

1. **Flat Cache** (``is_block_kv=False``):
   
   - K cache: ``[B, S_max_ctx, d_head]`` or ``[B, d_head, S_max_ctx]`` (transposed)
   - V cache: ``[B, S_max_ctx, d_head]``
   - Direct indexing by batch and sequence position

2. **Block Cache** (``is_block_kv=True``):
   
   - K/V cache: ``[num_blocks, block_len, d_head]``
   - Indirect indexing via block slot mapping
   - Efficient for variable-length sequences

**Quantization Support:**

- FP8 weights: Provide ``qkv_scale`` and ``o_scale`` for dequantization
- Mixed precision: FP32 accumulation with FP16/BF16 inputs
- Automatic dtype handling throughout the pipeline

**Key Implementation Notes:**

1. **Grouped-Query Attention**: The kernel processes Q heads in groups, where each group shares a single K/V head. This reduces KV cache memory by a factor of ``q_heads / kv_heads``.

2. **RoPE Application**: Rotary embeddings are applied using position indices derived from ``S_ctx`` (current context length). Supports both contiguous and interleaved layouts.

3. **Causal Masking**: Attention scores are masked such that token at position ``i`` can only attend to positions ``0`` to ``i`` in the context. Implemented by adding ``-inf`` to masked positions before softmax.

4. **Cache Update Optimization**: 
   
   - For ``S_tkg=1``: Uses batched vector DMA with ``vector_offset`` for all batches in one operation
   - For ``S_tkg>1``: Uses per-batch scalar DMA with ``scalar_offset``
   - Block cache uses indirect addressing via block slot indices

5. **Memory Efficiency**: All intermediate tensors (QKV, Q, K, V, attention scores, attention output) remain in SBUF. Only input ``X``, weights, caches, and final output ``out`` reside in HBM.


================================================
FILE: nki/library/api/attention-cte.rst
================================================
.. meta::
    :description: Attention CTE kernel implements attention optimized for Context Encoding (prefill) use cases.
    :date-modified: 04/09/2026

.. currentmodule:: nkilib.core.attention.attention_cte

Attention CTE Kernel API Reference
===================================

Implements attention optimized for Context Encoding (prefill) use cases with long sequence lengths.

The kernel supports:

* Efficient attention computation for long sequence lengths
* Causal masking
* Sliding window attention
* Context parallelism for distributed computation
* Prefix caching for efficient inference
* Sink tokens for streaming attention
* Native Grouped Query Attention (GQA) support
* Softmax caching for training
* Sequence packing with per-query KV range bounds

Background
--------------

The ``Attention CTE`` kernel is designed specifically for context encoding (prefill) scenarios where the sequence length is large (typically > 256). It performs the standard attention operation ``Attention(Q, K, V) = softmax(scale * Q @ K^T) @ V`` with optimizations for long sequence lengths.

The kernel employs efficient tiling strategies and memory access patterns to maximize performance on Neuron hardware. It supports various optimizations including flash attention for long sequences, LNC sharding, and context parallelism.

API Reference
----------------

**Source code for this kernel API can be found at**: `attention_cte.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/core/attention/attention_cte.py>`_


attention_cte
^^^^^^^^^^^^^^^

.. py:function:: attention_cte(q: nl.ndarray, k: nl.ndarray, v: nl.ndarray, scale: float = 1.0, causal_mask: bool = True, k_prior: Optional[nl.ndarray] = None, v_prior: Optional[nl.ndarray] = None, prior_used_len: Optional[nl.ndarray] = None, sink: Optional[nl.ndarray] = None, sliding_window: Optional[int] = None, tp_q: bool = True, tp_k: bool = False, tp_out: bool = False, cache_softmax: bool = False, softmax_dtype=nl.float32, mm_out_dtype=nl.float32, cp_offset: Optional[nl.ndarray] = None, global_cp_deg: int = None, cp_strided_q_slicing: bool = False, bound_min: Optional[nl.ndarray] = None, bound_max: Optional[nl.ndarray] = None)

   Entrypoint NKI kernel that supports multiple attention variants.

   The kernel can be invoked with 1D SPMD grid for LNC2 or without grid.

   :param q: Query tensor with layout dependent on ``tp_q`` parameter
   :type q: ``nl.ndarray``
   :param k: Key tensor with layout dependent on ``tp_k`` parameter
   :type k: ``nl.ndarray``
   :param v: Value tensor with shape ``(batch_size_kv, seqlen, d)``
   :type v: ``nl.ndarray``
   :param scale: Scaling factor for attention scores. Must be 1.0 when using sliding window, context parallel, or prefix caching.
   :type scale: ``float``, optional
   :param causal_mask: Whether to use causal mask
   :type causal_mask: ``bool``, optional
   :param k_prior: (Prefix caching) Prior key tensor with layout dependent on ``tp_k`` parameter
   :type k_prior: ``nl.ndarray``, optional
   :param v_prior: (Prefix caching) Prior value tensor with shape ``(batch_size_kv, seqlen_prior, d)``
   :type v_prior: ``nl.ndarray``, optional
   :param prior_used_len: (Prefix caching) Actual used length in prior with shape ``(1,)``
   :type prior_used_len: ``nl.ndarray``, optional
   :param sink: Sink token tensor
   :type sink: ``nl.ndarray``, optional
   :param sliding_window: Sliding window size for attention, ``None`` or ``0`` denotes no sliding window mask
   :type sliding_window: ``int``, optional
   :param tp_q: Query tensor transpose flag
   :type tp_q: ``bool``, optional
   :param tp_k: Key tensor transpose flag
   :type tp_k: ``bool``, optional
   :param tp_out: Output tensor transpose flag
   :type tp_out: ``bool``, optional
   :param cache_softmax: Whether to cache softmax intermediate values
   :type cache_softmax: ``bool``, optional
   :param softmax_dtype: Data type for softmax computations
   :type softmax_dtype: ``nl.dtype``, optional
   :param mm_out_dtype: Data type for matmul output accumulation. Default: ``nl.float32``.
   :type mm_out_dtype: ``nl.dtype``, optional
   :param cp_offset: Context parallel offset tensor
   :type cp_offset: ``nl.ndarray``, optional
   :param global_cp_deg: Global context parallel degree
   :type global_cp_deg: ``int``, optional
   :param cp_strided_q_slicing: Whether to use strided Q slicing for context parallelism. Default: False.
   :type cp_strided_q_slicing: ``bool``
   :param bound_min: (Sequence packing) Per-query minimum KV index bounds with shape ``(batch_size, seqlen_q)``. When provided with ``bound_max``, restricts the KV range each query attends to. Default: ``None`` (no packing).
   :type bound_min: ``nl.ndarray``, optional
   :param bound_max: (Sequence packing) Per-query maximum KV index bounds with shape ``(batch_size, seqlen_q)``. When provided with ``bound_min``, restricts the KV range each query attends to. Default: ``None`` (no packing).
   :type bound_max: ``nl.ndarray``, optional
   :return: Output tensor with attention results. Shape depends on ``tp_out`` parameter. If ``cache_softmax`` is ``True``, returns tuple of ``(output, out_neg_max, out_sum_recip)``.
   :rtype: ``nl.ndarray`` or ``tuple``

   **IO Shapes**:

   * q:
     ``(batch_size, seqlen_q, d)`` when ``tp_q`` is ``True``
     ``(batch_size, d, seqlen_q)`` when ``tp_q`` is ``False``
   * k:
     ``(batch_size_kv, seqlen_kv, d)`` when ``tp_k`` is ``True``
     ``(batch_size_kv, d, seqlen_kv)`` when ``tp_k`` is ``False``
   * v: ``(batch_size_kv, seqlen_kv, d)``
   * returns:
     ``(batch_size, d, seqlen_q)`` if ``tp_out`` is ``True``
     ``(batch_size, seqlen_q, d)`` if ``tp_out`` is ``False``

   **Constraints**:

   * Head dimension (``d``) must be <= 128
   * ``scale`` must be 1.0 when using sliding window, context parallel, or prefix caching
   * Context parallelism currently only supports causal attention
   * Sliding window attention currently only supports causal attention

Features
-----------

1. **Causal Masking (causal_mask=True)**:
   
   * Masks upper triangle of attention scores: ``S[i,j] = -inf`` when ``i < j``
   * Enables compute skipping: skip MM1/MM2 for upper triangle tiles

2. **Sliding Window Attention (SWA, when sliding_window > 0)**:
   
   * Local attention: each query only attends to nearby keys within a window
   * Masks attention scores: ``S[i,j] = -inf`` when ``|i - j| > sliding_window``
   * Currently only works with causal: masks both upper triangle AND positions outside window
   * When used with CP: loads only required KV slice to save memory

3. **Context Parallelism (CP, global_cp_deg > 1, cp_offset != None)**:
   
   * Distributes long sequence computation across multiple devices/ranks
   * Each rank (kernel call) processes a slice of Q sequence with full K/V
   * ``cp_offset`` indicates which Q slice this rank handles (runtime value)
   * Requires dynamic masking since offset unknown at compile time
   * Currently only supports causal attention

4. **Prefix Caching (k_prior/v_prior provided)**:
   
   * K/V split into two parts: prior (cached) and active (current)
   * ``prior_used_len`` specifies how much of prior to use (dynamic mask)
   * Causal mask not required for prior portion (although SWA still applies if enabled)

5. **Sink Tokens (sink provided)**:
   
   * Add additional sink token to softmax denominator

6. **Grouped Query Attention (GQA, batch_size_kv < batch_size)**:
   
   * Kernel handles GQA natively without explicit K/V replication

7. **Support for training**:
   
   * Kernel can optionally return maximum attention score and softmax denominator (per row) for backpropagation

Implementation Details
-------------------------

The kernel implementation includes several key optimizations:

1. **LNC2 Sharding**: Shards computation across 2 NeuronCores with primary sharding on batch dimension and secondary sharding on sequence length for odd batch sizes.

2. **Flash Attention**: For K/V length > 10K tokens, divides into 8K-token sections and processes one section at a time to fit in SBUF memory.

3. **Software Pipelining**: Overlaps operations across Q groups (``i``, ``i+1``, ``i+2``) for efficient hardware utilization:
   
   * Group ``i``: PV computation, writeback
   * Group ``i+1``: Exp computation
   * Group ``i+2``: Q load, QK computation

4. **Modular Allocation**: Uses efficient buffer reuse with modular allocation for intermediate tensors.

5. **Dynamic Masking**: Implements efficient masking strategies for causal, sliding window, and context parallel scenarios.

6. **Optimized Memory Access**: Employs careful memory access patterns to optimize data movement between HBM and SBUF.


See Also
-----------

* :doc:`Attention TKG Kernel API Reference </nki/library/api/attention-tkg>`


================================================
FILE: nki/library/api/attention-tkg.rst
================================================
.. meta::
    :description: Attention TKG kernel implements attention optimized for Token Generation (decode) use cases.
    :date-modified: 11/28/2025

.. currentmodule:: nkilib.core.attention_tkg

Attention TKG Kernel API Reference
===================================

Implements attention optimized for Token Generation (decode) use cases with small active sequence lengths.

The kernel supports:

* Efficient attention computation for small active sequence lengths
* Flexible tensor placement in SBUF or HBM
* Adaptive LNC2 sharding strategies
* In-kernel mask generation
* Fused RoPE (Rotary Position Embedding)
* Block KV cache for efficient long-context inference
* Attention sink for streaming attention
* GPSIMD optimizations for inter-core communication

Background
--------------

The ``Attention TKG`` kernel is designed specifically for token generation (decoding) scenarios where the active sequence length is small (typically ≤ 7). It performs the standard attention operation ``Attention(Q, K, V) = softmax(Q @ K^T) @ V`` with optimizations for small active sequence lengths and large KV caches.

The kernel employs efficient tiling strategies and memory access patterns to maximize performance on Neuron hardware. It supports various optimizations including LNC sharding, block KV cache, and attention sink for streaming attention.

API Reference
----------------

**Source code for this kernel API can be found at**: `attention_tkg.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/core/attention/attention_tkg.py>`_

AttnTKGConfig
^^^^^^^^^^^^^^^

.. py:class:: AttnTKGConfig

   Configuration for token-generation attention kernel.

   This dataclass contains shape parameters and performance optimization flags
   for the attention_tkg kernel, which is optimized for small active sequence lengths.

   .. py:attribute:: bs
      :type: int
      :value: 0

      Batch size

   .. py:attribute:: q_head
      :type: int
      :value: 0

      Number of query heads

   .. py:attribute:: s_active
      :type: int
      :value: 0

      Active sequence length (>1 means speculative decoding)

   .. py:attribute:: curr_sprior
      :type: int
      :value: 0

      Current prior sequence length (KV cache length for this execution)

   .. py:attribute:: full_sprior
      :type: int
      :value: 0

      Full prior sequence length (maximum KV cache capacity)

   .. py:attribute:: d_head
      :type: int
      :value: 0

      Head dimension (embedding size per head)

   .. py:attribute:: block_len
      :type: int
      :value: 0

      Block length for block KV cache (0 if not using block KV)

   .. py:attribute:: tp_k_prior
      :type: bool
      :value: False

      Specifies that k_prior is transposed (shape ``[B, 1, d, s_prior]`` instead of ``[B, 1, s_prior, d]``)

   .. py:attribute:: strided_mm1
      :type: bool
      :value: True

      Use strided memory access for first matmul to improve cache locality

   .. py:attribute:: use_pos_id
      :type: bool
      :value: False

      Generate attention mask from position IDs in-kernel instead of loading pre-generated mask

   .. py:attribute:: fuse_rope
      :type: bool
      :value: False

      Fuse RoPE (Rotary Position Embedding) computation into the kernel

   .. py:attribute:: use_gpsimd_sb2sb
      :type: bool
      :value: True

      Use GPSIMD instructions for SBUF-to-SBUF data transfers (LNC2 sharding)

   .. py:attribute:: qk_in_sb
      :type: bool
      :value: False

      Query and key tensors are already in SBUF instead of HBM

   .. py:attribute:: k_out_in_sb
      :type: bool
      :value: False

      Output key tensor after RoPE should be stored in SBUF instead of HBM

   .. py:attribute:: out_in_sb
      :type: bool
      :value: False

      Output tensor should be stored in SBUF instead of HBM

attention_tkg
^^^^^^^^^^^^^^^

.. py:function:: attention_tkg(q: nl.ndarray, k_active: nl.ndarray, v_active: nl.ndarray, k_prior: nl.ndarray, v_prior: nl.ndarray, mask: nl.ndarray, out: nl.ndarray, cfg: AttnTKGConfig, sbm: SbufManager, inv_freqs: Optional[nl.ndarray] = None, rope_pos_ids: Optional[nl.ndarray] = None, sink: Optional[nl.ndarray] = None, active_blocks_table: Optional[nl.ndarray] = None, k_out: Optional[nl.ndarray] = None, DBG_TENSORS: Optional[tuple] = None) -> Tuple[nl.ndarray, Optional[nl.ndarray]]

   Attention specifically optimized for token-gen (where s_active is small). Can optionally fuse RoPE at the start.

   :param q: Query tensor. Shape depends on ``cfg.qk_in_sb``: If ``True``: ``[d, B * H * s_active]``, else: ``[B, d, H, s_active]``
   :type q: ``nl.ndarray``
   :param k_active: Active key tensor. Shape depends on ``cfg.qk_in_sb``: If ``True``: ``[d, B * s_active]``, else: ``[B, d, s_active]``
   :type k_active: ``nl.ndarray``
   :param v_active: Active value tensor. Shape: ``[B, 1, s_active, d]``
   :type v_active: ``nl.ndarray``
   :param k_prior: Prior key tensor from KV cache. Shape: ``[B+, 1, s_prior, d]`` if ``cfg.tp_k_prior`` else ``[B+, 1, d, s_prior]``. For block KV cache, shape is ``[B+ * block_count, block_len, d]``
   :type k_prior: ``nl.ndarray``
   :param v_prior: Prior value tensor from KV cache. Shape: ``[B+, 1, s_prior, d]``. For block KV cache, shape is ``[B+ * block_count, block_len, d]``
   :type v_prior: ``nl.ndarray``
   :param mask: Attention mask. Shape: ``[s_active, B, H, s_active]`` if ``cfg.use_pos_id`` else ``[s_prior, B, H, s_active]``
   :type mask: ``nl.ndarray``
   :param out: Output tensor. Shape depends on ``cfg.out_in_sb``: If ``True``: ``[d, B * H * s_active]``, else: ``[B, H, d, s_active]``
   :type out: ``nl.ndarray``
   :param cfg: Kernel configuration with shapes and performance flags
   :type cfg: ``AttnTKGConfig``
   :param sbm: SBUF memory manager for allocating temporary buffers
   :type sbm: ``SbufManager``
   :param inv_freqs: Inverse frequencies for RoPE. Shape: ``[d // 2, 1]``. Required when ``cfg.fuse_rope`` is ``True``
   :type inv_freqs: ``nl.ndarray``, optional
   :param rope_pos_ids: Position IDs for RoPE. Shape: ``[B, s_active]``. Required when ``cfg.fuse_rope`` or ``cfg.use_pos_id`` is ``True``
   :type rope_pos_ids: ``nl.ndarray``, optional
   :param sink: Sink attention tokens. Shape: ``[H, 1]`` for streaming attention sink tokens
   :type sink: ``nl.ndarray``, optional
   :param active_blocks_table: Table of active blocks for block KV cache. Shape: ``[B, num_blocks]``. Required when using block KV cache
   :type active_blocks_table: ``nl.ndarray``, optional
   :param k_out: Output key tensor after RoPE. Shape depends on ``cfg.k_out_in_sb``: If ``True``: ``[d, B * s_active]``, else: ``[B, 1, d, s_active]``
   :type k_out: ``nl.ndarray``, optional
   :param DBG_TENSORS: Optional tuple of 4-5 debug tensors with shared HBM type for intermediate value inspection
   :type DBG_TENSORS: ``tuple``, optional
   :return: Tuple of ``(out, k_out)`` where ``out`` is the attention output tensor and ``k_out`` is the key output tensor (if ``cfg.fuse_rope`` is ``True``)
   :rtype: ``tuple``

   **Constraints**:

   * Optimized for ``s_active <= 7`` and ``d_head <= 128``
   * ``cfg.qk_in_sb=True`` is required when skipping fused RoPE
   * Block KV cache requires ``cfg.qk_in_sb=True``
   * In-kernel mask generation (``cfg.use_pos_id=True``) is not supported with batch sharding or block KV cache

Features
-----------

1. **Flexible Tensor Placement**:
   
   * ``q``, ``k``, ``k_out``, and ``out`` tensors can be placed in either SBUF or HBM
   * When ``qk_in_sb=True``, q and k tensors are pre-loaded in SBUF (required for block KV cache)
   * ``out_in_sb`` and ``k_out_in_sb`` flags control output tensor placement for reduced memory transfers
   * Use this feature for performance improvement when integrating this kernel into a larger kernel

2. **Adaptive LNC2 Sharding**:
   
   * Automatically selects sharding strategy based on tensor dimensions
   * Batch sharding: Used when batch is even AND (``s_prior < 256`` OR ``b*q_head*s_active > 128``)
   * Sequence sharding: Used when ``s_prior >= 256`` and batch sharding criteria not met
   * Balances computation across 2 NeuronCores for improved throughput

3. **Mask Generation**:
   
   * ``use_pos_id=False``: Pre-generated mask loaded from HBM
   * ``use_pos_id=True``: Mask generated in-kernel from position IDs
   * In-kernel generation reduces memory bandwidth but requires position ID input

4. **Fused RoPE (Rotary Position Embedding)**:
   
   * ``fuse_rope`` integrates RoPE computation directly into the attention kernel
   * Applies rotary embeddings to Q and K tensors, scaling Q by ``1/sqrt(d_head)``
   * Reduces memory traffic by avoiding separate RoPE passes

5. **Block KV Cache**:
   
   * Supports block-sparse KV cache with configurable ``block_len``
   * Uses ``active_blocks_table`` to track which cache blocks are active per batch
   * Enables efficient long-context inference with sparse memory access patterns

6. **K_prior Transpose Handling**:
   
   * ``tp_k_prior`` flag indicates whether K_prior is pre-transposed in memory
   * Optimizes memory layout: ``[B, 1, d, s_prior]`` when ``tp_k_prior=True`` vs ``[B, 1, s_prior, d]`` when False
   * Reduces transpose operations during computation and improves interoperability with other kernels

7. **Strided Memory Access (strided_mm1)**:
   
   * Enables strided read patterns for K in first matmul
   * When enabled, allows MM2 to use sequential V reads for better DMA throughput
   * Trades off MM1 memory access for MM2 optimization

8.  **Attention Sink**:
   
   * Supports streaming attention with sink tokens for infinite context
   * Sink tokens maintain fixed attention scores across all positions
   * Integrated into softmax reduction for minimal overhead

9.  **GPSIMD SBUF-to-SBUF Transfers**:
    
   * ``use_gpsimd_sb2sb`` enables high-performance GPSIMD instructions for inter-core communication
   * Optimizes LNC2 sharding by using extended instructions for SBUF-to-SBUF data transfers

10. **Context Length Management**:
    
    * ``curr_sprior``: Current prior sequence length (actual KV cache content for this invocation)
    * ``full_sprior``: Full prior sequence length (maximum KV cache capacity allocated)
    * Allows progressive filling of KV cache during autoregressive generation

Implementation Details
-------------------------

The kernel implementation includes several key optimizations:

1. **Efficient Tiling Strategy**: Uses carefully chosen tile sizes for processing batches, sequences, and heads to maximize hardware utilization.

2. **Cascaded Reduction**: Implements cascaded max and sum reduction operations for softmax computation to maintain numerical stability.

3. **Memory Access Optimization**: Employs careful memory access patterns to optimize data movement between HBM and SBUF.

4. **Block KV Cache Support**: Implements efficient block-sparse KV cache with dynamic block size adjustment to ensure optimal hardware utilization.

5. **Attention Sink Integration**: Efficiently integrates attention sink tokens into the softmax computation for streaming attention.

6. **Fused RoPE Implementation**: Implements efficient rotary position embeddings with optimized trigonometric computations.

7. **Adaptive Sharding**: Dynamically selects between batch and sequence sharding based on tensor dimensions to optimize performance.

8. **GPSIMD Optimization**: Uses GPSIMD instructions for high-performance SBUF-to-SBUF data transfers in LNC2 sharding.

9. **Debug Support**: Provides comprehensive debug tensor support for intermediate value inspection.

10. **Stack-based SBUF Allocation**: Uses SbufManager for efficient on-chip memory management with hierarchical scoping.


See Also
-----------

* :doc:`Output Projection TKG Kernel API Reference </nki/library/api/output-projection-tkg>`


================================================
FILE: nki/library/api/blockwise-mm-backward.rst
================================================
.. meta::
    :description: Blockwise MM Backward kernel computes backward pass for blockwise Mixture of Experts layers.
    :date-modified: 02/13/2026

.. currentmodule:: nkilib.experimental.moe.bwd

Blockwise MM Backward Kernel API Reference
============================================

**[Experimental]** Computes the backward pass for blockwise matrix multiplication in Mixture of Experts (MoE) layers, producing gradients for all parameters.

The kernel supports:

* Gradient computation for hidden states, expert affinities, gate/up weights, and down weights
* Optional bias gradient computation
* Multiple sharding strategies (hidden dimension, intermediate dimension)
* Affinity scaling on hidden or intermediate dimension
* Gradient clamping for numerical stability
* Various activation functions (SiLU, GELU, Swish)
* Dropless MoE with variable block assignments per expert

Background
-----------

The ``blockwise_mm_bwd`` kernel is the backward pass companion to the MoE CTE forward kernel. It computes gradients for all learnable parameters in a blockwise MoE layer by reversing the forward computation:

1. **Down projection backward**: Compute gradients for down projection weights and intermediate activations
2. **Activation backward**: Compute gradients through the activation function using checkpointed activations
3. **Gate/Up projection backward**: Compute gradients for gate and up projection weights
4. **Hidden states backward**: Compute gradients for input hidden states
5. **Affinity backward**: Compute gradients for expert affinities

The kernel uses activation checkpoints saved during the forward pass (``gate_up_proj_act_checkpoint_T`` and ``down_proj_act_checkpoint``) to avoid recomputation.

API Reference
--------------

**Source code for this kernel API can be found at**: `blockwise_mm_backward.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/experimental/moe/bwd/blockwise_mm_backward.py>`_

blockwise_mm_bwd
^^^^^^^^^^^^^^^^^

.. py:function:: blockwise_mm_bwd(hidden_states: nl.ndarray, expert_affinities_masked: nl.ndarray, gate_up_proj_weight: nl.ndarray, down_proj_weight: nl.ndarray, gate_up_proj_act_checkpoint_T: nl.ndarray, down_proj_act_checkpoint: nl.ndarray, token_position_to_id: nl.ndarray, block_to_expert: nl.ndarray, output_hidden_states_grad: nl.ndarray, block_size: int, skip_dma: SkipMode = None, compute_dtype: nki.dtype = nl.bfloat16, is_tensor_update_accumulating: bool = True, shard_option: ShardOption = ShardOption.SHARD_ON_HIDDEN, affinity_option: AffinityOption = AffinityOption.AFFINITY_ON_H, kernel_type_option: KernelTypeOption = KernelTypeOption.DROPLESS, clamp_limits: ClampLimits = None, bias: bool = False, activation_type: ActFnType = ActFnType.SiLU, block_tile_size: int = None) -> tuple

   Compute backward pass for blockwise MoE layer.

   Computes gradients for all parameters in a Mixture of Experts layer using blockwise
   matrix multiplication. Optimized for dropless MoE with variable block assignments per expert.

   :param hidden_states: Input hidden states tensor with shape ``[T, H]`` in HBM.
   :type hidden_states: ``nl.ndarray``
   :param expert_affinities_masked: Expert affinities with shape ``[T * E, 1]`` in HBM.
   :type expert_affinities_masked: ``nl.ndarray``
   :param gate_up_proj_weight: Gate and up projection weights with shape ``[E, H, 2, I_TP]`` in HBM.
   :type gate_up_proj_weight: ``nl.ndarray``
   :param down_proj_weight: Down projection weights with shape ``[E, I_TP, H]`` in HBM.
   :type down_proj_weight: ``nl.ndarray``
   :param gate_up_proj_act_checkpoint_T: Checkpointed gate/up activations from forward pass with shape ``[N, 2, I_TP, B]``.
   :type gate_up_proj_act_checkpoint_T: ``nl.ndarray``
   :param down_proj_act_checkpoint: Checkpointed down projection activations from forward pass with shape ``[N, B, H]``.
   :type down_proj_act_checkpoint: ``nl.ndarray``
   :param token_position_to_id: Token position to block mapping with shape ``[N * B]``.
   :type token_position_to_id: ``nl.ndarray``
   :param block_to_expert: Expert index per block with shape ``[N, 1]``.
   :type block_to_expert: ``nl.ndarray``
   :param output_hidden_states_grad: Upstream gradient from output with shape ``[T, H]``.
   :type output_hidden_states_grad: ``nl.ndarray``
   :param block_size: Number of tokens per block. Must be one of: 128, 256, 512, 1024.
   :type block_size: ``int``
   :param skip_dma: DMA skip mode for OOB handling. Default: ``SkipMode(False, False)``.
   :type skip_dma: ``SkipMode``, optional
   :param compute_dtype: Computation data type. Default: ``nl.bfloat16``.
   :type compute_dtype: ``nki.dtype``
   :param is_tensor_update_accumulating: Whether to accumulate into existing gradients. Default: ``True``.
   :type is_tensor_update_accumulating: ``bool``
   :param shard_option: Sharding strategy. ``SHARD_ON_HIDDEN``: shard across hidden dimension. ``SHARD_ON_INTERMEDIATE``: shard across intermediate dimension. ``AUTO``: auto-select. Default: ``SHARD_ON_HIDDEN``.
   :type shard_option: ``ShardOption``
   :param affinity_option: Dimension for affinity scaling. ``AFFINITY_ON_H``: scale on hidden dimension. ``AFFINITY_ON_I``: scale on intermediate dimension. Default: ``AFFINITY_ON_H``.
   :type affinity_option: ``AffinityOption``
   :param kernel_type_option: Token dropping strategy. ``DROPLESS``: variable blocks per expert. ``DROPPING``: fixed blocks per expert. Default: ``DROPLESS``.
   :type kernel_type_option: ``KernelTypeOption``
   :param clamp_limits: Gradient clamping limits for numerical stability. Contains ``linear_clamp_upper_limit``, ``linear_clamp_lower_limit``, ``non_linear_clamp_upper_limit``, ``non_linear_clamp_lower_limit``.
   :type clamp_limits: ``ClampLimits``, optional
   :param bias: Whether to compute bias gradients. Default: ``False``.
   :type bias: ``bool``
   :param activation_type: Activation function type. Default: ``SiLU``.
   :type activation_type: ``ActFnType``
   :param block_tile_size: Optional tile size override for block processing.
   :type block_tile_size: ``int``, optional
   :return: Tuple of gradient tensors. When ``bias=False``: ``(hidden_states_grad, expert_affinities_masked_grad, gate_up_proj_weight_grad, down_proj_weight_grad)``. When ``bias=True``: additionally includes ``gate_and_up_proj_bias_grad`` and ``down_proj_bias_grad``.
   :rtype: ``tuple``

   **Dimensions**:

   * T: Total number of input tokens
   * H: Hidden dimension size
   * I_TP: Intermediate size / tensor parallel degree
   * E: Number of experts
   * B: Block size (tokens per block)
   * N: Number of blocks

   **Supported Data Types**:

   * Input: bfloat16, float16

   **Constraints**:

   * ``block_size`` must be one of: 128, 256, 512, 1024
   * H must be divisible by the number of shards for LNC sharding
   * Currently only supports ``DROPLESS`` kernel type
   * Requires activation checkpoints from the forward pass (``gate_up_proj_act_checkpoint_T`` and ``down_proj_act_checkpoint``)

Implementation Details
-----------------------

The kernel implementation includes several key optimizations:

1. **Sharding Strategies**: Supports sharding across hidden dimension (simpler, no H-tiling) or intermediate dimension (better memory efficiency) for LNC2 parallelism.

2. **Activation Checkpointing**: Uses saved activations from the forward pass to avoid recomputation during backward, trading memory for compute.

3. **Blockwise Processing**: Processes tokens in blocks matching the forward pass structure, enabling efficient gradient accumulation across experts.

4. **Gradient Clamping**: Optional clamping of gradients for numerical stability during training.

5. **Affinity Gradient Computation**: Computes gradients for expert routing weights, enabling end-to-end training of the router.

See Also
-----------

* :doc:`MoE CTE Kernel API Reference </nki/library/api/moe-cte>`
* :doc:`MoE TKG Kernel API Reference </nki/library/api/moe-tkg>`


================================================
FILE: nki/library/api/conv1d.rst
================================================
.. meta::
    :description: 1D Convolution operation using tensor engine with replication strategy.
    :date-modified: 04/09/2026

.. currentmodule:: nkilib.experimental.conv

Conv1D Kernel API Reference
===========================

Implements 1D convolution using tensor engine with a replication strategy for efficient computation.

The kernel supports:

* Arbitrary stride, padding, and dilation values
* Optional bias addition
* Activation function fusion
* LNC sharding on the output channel dimension

Intended usage range:

* Kernel size (K): 1 to 128
* Sequence length (L): 1 to 4096
* Input channels (C_in): 1 to 4096
* Output channels (C_out): 1 to 4096
* Batch size (B): Any positive integer

Background
-----------

The ``conv1d`` kernel applies 1D convolution filters across the input sequence dimension. It uses a replication strategy to efficiently utilize the tensor engine by stacking multiple filter positions along the partition dimension.

API Reference
--------------

**Source code for this kernel API can be found at**: `conv1d.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/experimental/conv/conv1d.py>`_

conv1d
^^^^^^

.. py:function:: conv1d(x_in: nl.ndarray, filters: nl.ndarray, bias: Optional[nl.ndarray] = None, stride: int = 1, padding: tuple[int, int] = (0, 0), dilation: int = 1, activation_fn: Optional[ActFnType] = None, lnc_shard: bool = False) -> nl.ndarray

   1D Convolution operation using tensor engine with replication strategy.

   :param x_in: [B, C_in, L], Input tensor on HBM.
   :type x_in: ``nl.ndarray``
   :param filters: [K, C_in, C_out], Convolution filter weights on HBM.
   :type filters: ``nl.ndarray``
   :param bias: [C_out], Optional bias tensor on HBM. Default None.
   :type bias: ``Optional[nl.ndarray]``
   :param stride: Stride for convolution. Must be >= 1. Default 1.
   :type stride: ``int``
   :param padding: Tuple of (left_pad, right_pad). Must be non-negative. Default (0, 0).
   :type padding: ``tuple[int, int]``
   :param dilation: Dilation factor for dilated convolution. Must be >= 1. Default 1.
   :type dilation: ``int``
   :param activation_fn: Optional activation function to fuse. Default None.
   :type activation_fn: ``Optional[ActFnType]``
   :param lnc_shard: If True, shard computation across LNC cores on C_out dimension. Default False.
   :type lnc_shard: ``bool``
   :return: [B, C_out, L_out], Output tensor on HBM where L_out = (L + pad_left + pad_right - dilation * (K - 1) - 1) // stride + 1
   :rtype: ``nl.ndarray``

   **Notes**:

   * All input tensors (x_in, filters, bias) must have the same dtype
   * Input channels C_in must match filter channels
   * Uses replication strategy to stack K filter positions along partition dimension
   * Partition alignment rules limit K replication factor based on C_in tile size
   * Memory management uses SbufManager with multi-buffering for efficiency

   **Dimensions**:

   * B: Batch size
   * C_in: Number of input channels
   * C_out: Number of output channels
   * L: Input sequence length
   * L_out: Output sequence length = (L + pad_left + pad_right - dilation * (K - 1) - 1) // stride + 1


================================================
FILE: nki/library/api/cross-entropy.rst
================================================
.. meta::
    :description: Cross entropy kernel implements memory-efficient cross entropy loss for large vocabularies using online log-sum-exp algorithm.
    :date-modified: 02/13/2026

.. currentmodule:: nkilib.experimental.loss

Cross Entropy Kernel API Reference
===================================

Implements memory-efficient cross entropy loss computation for large vocabularies using the online log-sum-exp algorithm with batched processing.

The kernel supports:

* Memory-efficient computation for large vocabularies
* Online log-sum-exp algorithm to avoid numerical overflow
* Forward and backward pass kernels
* Batched processing for improved throughput
* Optimized for LNC2 (2 cores) architecture
* Configurable chunk sizes and batch sizes
* Support for bfloat16 and float32 data types

Background
-----------

The ``cross_entropy_forward`` kernel is designed for efficient computation of cross entropy loss in large vocabulary scenarios, such as language modeling. Traditional cross entropy implementations require loading the entire vocabulary for each position, which can be memory-intensive. This kernel uses an online log-sum-exp algorithm that processes the vocabulary in chunks, maintaining numerical stability while reducing memory requirements.

A companion ``cross_entropy_backward`` kernel computes gradients with respect to logits using the saved log-sum-exp state from the forward pass.

.. note::
    This kernel is optimized for Trainium2 (TRN2) and uses batched processing where each core processes multiple positions simultaneously with vectorized operations.

API Reference
--------------

**Source code for this kernel API can be found at**: `cross_entropy.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/experimental/loss/cross_entropy.py>`_

cross_entropy_forward
^^^^^^^^^^^^^^^^^^^^^

.. py:function:: cross_entropy_forward(logits_hbm: nl.ndarray, targets_hbm: nl.ndarray, positions_per_batch: int = 32, chunk_size: int = 32768, dtype: nki.dtype = nl.bfloat16) -> tuple[nl.ndarray, nl.ndarray]

   Cross entropy forward pass using online log-sum-exp algorithm with batching.

   This kernel computes cross entropy loss for large vocabularies using a memory-efficient
   online log-sum-exp algorithm. Optimized for LNC2 (2 cores) with batched processing where
   each core processes multiple positions in batches with vectorized operations.

   :param logits_hbm: Input logits tensor in HBM with shape [num_positions, V]. Supported dtypes: nl.bfloat16, nl.float32. MUST be 2D (already flattened).
   :type logits_hbm: ``nl.ndarray``
   :param targets_hbm: Target indices tensor in HBM with shape [num_positions]. dtype: nl.int32. MUST be 1D (already flattened).
   :type targets_hbm: ``nl.ndarray``
   :param positions_per_batch: Number of positions to process together. Default: 32. Larger batches improve HBM bandwidth and SBUF utilization. Candidate values (powers of 2): 8, 16, 32, 64, 128. Must satisfy: positions_per_batch × chunk_size × dtype_bytes ≤ 24 MiB.
   :type positions_per_batch: ``int``
   :param chunk_size: Size of vocabulary chunks. Default: 32768 (32K). Must not exceed vocabulary size V or hardware limit (65535). Candidate values: 65535 (F_MAX, ideal for 128K-256K vocabs, bf16 only), 49152 (3/4 of F_MAX), 40960 (Good balance), 32768 (Standard, good for 32K-128K vocabs), 16384 (Half of 32K), 8192 (Quarter of 32K), 4096 (Small vocab fallback), 2048 (Minimum practical).
   :type chunk_size: ``int``
   :param dtype: Data type for internal computations. Default: nl.bfloat16. Supported types: nl.bfloat16 (2 bytes), nl.float32 (4 bytes). Controls precision of intermediate calculations and memory usage.
   :type dtype: ``nki.dtype``
   :return: A tuple containing: loss_hbm (Cross entropy loss per position in HBM with shape [num_positions], dtype matches dtype parameter), lse_state_hbm (Log-sum-exp values per position in HBM with shape [num_positions], dtype matches dtype parameter, saved for backward pass).
   :rtype: ``tuple[nl.ndarray, nl.ndarray]``

   **Notes**:

   * Batched version for LNC2 (2 cores): Each core processes multiple positions in batches
   * Positions assigned in strided pattern (core_id, core_id + 2, core_id + 4, ...)
   * Vectorized operations across batch dimension for efficiency
   * chunk_size must not exceed vocabulary size V
   * positions_per_batch must be in range (0, 128]
   * Per-allocation size constraint: positions_per_batch × chunk_size × dtype_bytes ≤ 24 MiB
   * Performance tuning: Increase positions_per_batch for better throughput (up to memory limit)
   * Performance tuning: Use larger chunk_size to reduce loop iterations (up to V and memory limit)

Implementation Details
-----------------------

The kernel implementation includes several key optimizations:

1. **Online Log-Sum-Exp Algorithm**: Processes vocabulary in chunks while maintaining running maximum and sum of exponentials to avoid numerical overflow.

2. **Batched Processing**: Each core processes multiple positions simultaneously using vectorized operations for improved throughput.

3. **Memory Efficiency**: Uses configurable chunk sizes to balance memory usage and computational efficiency.

4. **Load Balancing**: Distributes positions across cores in a strided pattern for optimal load distribution.

5. **Numerical Stability**: Maintains numerical stability through careful handling of maximum values and exponential computations.

**Chunk Size Selection Guide**:

* V ≤ 32K: Use chunk_size = V (single chunk)
* 32K < V ≤ 128K: Use chunk_size = 32768 or 40960
* 128K < V ≤ 256K: Use chunk_size = 65535 (bf16) or 32768 (fp32)
* Always verify: positions_per_batch × chunk_size × dtype_bytes ≤ 24 MiB

cross_entropy_backward
^^^^^^^^^^^^^^^^^^^^^^^

.. py:function:: cross_entropy_backward(logits_hbm: nl.ndarray, targets_hbm: nl.ndarray, lse_state_hbm: nl.ndarray, reduction: str = "mean", positions_per_batch: int = 32, chunk_size: int = 32768, dtype: nki.dtype = nl.bfloat16, inplace: bool = True) -> nl.ndarray

   Cross entropy backward pass computing gradients with respect to logits.

   Computes the gradient of cross entropy loss with respect to input logits using the formula:
   ``grad_logits[i, j] = grad_scale * (softmax(logits[i, j]) - 1{j == target[i]})``
   where softmax is computed using the saved LSE state from the forward pass, and ``grad_scale``
   is determined by the reduction parameter.

   Optimized for LNC2 (2 cores) with batched processing where each core processes multiple
   positions in batches with vectorized operations.

   :param logits_hbm: Input logits tensor in HBM with shape ``[num_positions, V]``. Supported dtypes: ``nl.bfloat16``, ``nl.float32``. MUST be 2D (already flattened). Same tensor used in forward pass.
   :type logits_hbm: ``nl.ndarray``
   :param targets_hbm: Target indices tensor in HBM with shape ``[num_positions]``. dtype: ``nl.int32``. MUST be 1D (already flattened). Same tensor used in forward pass.
   :type targets_hbm: ``nl.ndarray``
   :param lse_state_hbm: Log-sum-exp values from forward pass in HBM with shape ``[num_positions]``. dtype matches ``dtype`` parameter. Saved state from ``cross_entropy_forward``.
   :type lse_state_hbm: ``nl.ndarray``
   :param reduction: How to scale gradients. ``'mean'``: scale by ``1/num_positions`` (matches PyTorch default). ``'sum'``: scale by ``1.0``. Default: ``'mean'``.
   :type reduction: ``str``
   :param positions_per_batch: Number of positions to process together. Default: 32. Must satisfy: ``positions_per_batch × chunk_size × dtype_bytes ≤ 24 MiB``.
   :type positions_per_batch: ``int``
   :param chunk_size: Size of vocabulary chunks. Default: 32768.
   :type chunk_size: ``int``
   :param dtype: Data type for internal computations. Default: ``nl.bfloat16``. Supported types: ``nl.bfloat16``, ``nl.float32``.
   :type dtype: ``nki.dtype``
   :param inplace: If ``True``, write gradients directly over ``logits_hbm`` to save HBM memory. Default: ``True``. When ``True``, ``logits_hbm`` is overwritten and cannot be used after.
   :type inplace: ``bool``
   :return: Gradient with respect to logits in HBM with shape ``[num_positions, V]``. If ``inplace=True``, this is the same tensor as ``logits_hbm``.
   :rtype: ``nl.ndarray``

   **Notes**:

   * Uses the saved LSE state from ``cross_entropy_forward`` to compute softmax without recomputing the full forward pass
   * ``inplace=True`` saves ``num_positions × vocab_size × dtype_bytes`` of HBM memory
   * Same chunking and batching strategy as the forward pass for consistent performance

================================================
FILE: nki/library/api/cumsum.rst
================================================
.. meta::
    :description: Cumsum kernel computes cumulative sum along the last dimension.
    :date-modified: 01/21/2026

.. currentmodule:: nkilib.core.cumsum

Cumsum Kernel API Reference
============================

Computes cumulative sum along the last dimension of the input tensor. Optimized for batch sizes up to 2048 and hidden dimension sizes up to 8192. Supports 3D inputs with sequence length up to 10.

The kernel supports:

* Cumulative sum computation along the last dimension only
* 2D and 3D input tensors
* Float32 accumulation for numerical stability
* Efficient tiled processing for large tensors
* Sequential processing to maintain cumulative dependencies

Background
--------------

The ``cumsum`` kernel implements cumulative sum computation, where each element in the output is the sum of all preceding elements (including itself) along the specified dimension. This operation is commonly used in various machine learning applications including attention mechanisms and sequence processing.

The kernel applies the following transformation along the last dimension:

* ``out[..., i] = sum(x[..., 0:i+1])``

The implementation uses ``tensor_tensor_scan`` operations with float32 accumulation for numerical stability, processing data in tiles to handle large tensors efficiently.

API Reference
----------------

**Source code for this kernel API can be found at**: `cumsum.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/core/cumsum/cumsum.py>`_

cumsum
^^^^^^^^^^^^^^^

.. py:function:: cumsum(x, axis=-1)

   Compute cumulative sum along the last dimension.

   :param x: Input tensor of shape ``[B, H]`` for 2D or ``[B, S, H]`` for 3D in HBM
   :type x: ``nl.ndarray``
   :param axis: Axis along which to compute cumsum. Must be -1 or the last dimension index. Default is -1.
   :type axis: ``int``, optional
   :return: Output tensor with same shape and dtype as input, containing cumulative sums along the last dimension
   :rtype: ``nl.ndarray``

   **Constraints**:

   * Only supports cumsum along the last dimension (axis=-1)
   * Batch size (``B``) must be up to 2048
   * Hidden dimension size (``H``) must be up to 8192
   * Sequence length (``S``) for 3D inputs must be up to 10
   * Input tensor must be 2D or 3D
   * For very long hidden dimensions (>5K), expect ~1e-2 absolute error due to fp32 accumulation

Implementation Details
-------------------------

The kernel implementation includes several key optimizations:

1. **Tiled Processing**: Processes data in tiles to handle large tensors efficiently:
   
   * **Partition Tiles**: Up to 128 elements per partition tile
   * **Free Dimension Tiles**: Up to 2048 elements per free dimension tile
   * Sequential processing across free dimension tiles to maintain cumulative dependencies

2. **Numerical Stability**: Uses float32 accumulation internally regardless of input dtype to maintain numerical precision for long sequences.

3. **Tensor Scan Operations**: Leverages ``tensor_tensor_scan`` with multiply and add operations to compute cumulative sums efficiently:
   
   * ``result[i] = ones[i] * result[i-1] + data[i] = result[i-1] + data[i]``

4. **Carry Forward**: Maintains cumulative state across tiles by carrying forward the last column of each processed tile as the initial value for the next tile.

5. **Memory Management**: Efficiently manages SBUF allocations for intermediate buffers and uses DMA operations for HBM transfers.

================================================
FILE: nki/library/api/depthwise-conv1d.rst
================================================
.. meta::
    :description: Depthwise Conv1D kernel using implicit GEMM approach for TRN2.
    :date-modified: 02/06/2025

.. currentmodule:: nkilib.experimental.conv

Depthwise Conv1D Kernel API Reference
======================================

Implements depthwise 1D convolution using implicit GEMM without full im2col materialization.

The kernel supports:

* Depthwise 1D convolution with stride=1 and zero padding
* Implicit GEMM approach for memory efficiency
* LNC2 sharding on channel dimension
* Optimized for TRN2 platform

Background
-----------

The ``depthwise_conv1d_implicit_gemm`` kernel performs depthwise 1D convolution by loading input with shape [S_TILE, Q] where row k contains elements starting at index k (i.e., input[k:k+Q]), enabling implicit im2col via offset-based loading. This approach avoids materializing the full im2col matrix, saving W*S*C memory. The kernel tiles on S dimension for S > 128 and is optimized for TRN2 platform with LNC2 sharding on channel dimension.

API Reference
--------------

**Source code for this kernel API can be found at**: `depthwise_conv1d.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/experimental/conv/depthwise_conv1d.py>`_

depthwise_conv1d_implicit_gemm
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. py:function:: depthwise_conv1d_implicit_gemm(img_ref: nl.ndarray, filter_ref: nl.ndarray, padding: tuple = ((0, 0), (0, 0)), stride: tuple = (1, 1), rhs_dilation: tuple = (1, 1), lhs_dilation: tuple = (1, 1), feature_group_count: int = 1, batch_group_count: int = 1, in_perm: tuple = None, kern_perm: tuple = None, out_perm: tuple = None) -> nl.ndarray

   Depthwise Conv1D using implicit GEMM without full im2col materialization.

   Performs depthwise 1D convolution by loading input with shape [S_TILE, Q] where
   row k contains elements starting at index k (i.e., input[k:k+Q]), enabling implicit
   im2col via offset-based loading. Tiles on S dimension for S > 128. Optimized for
   TRN2 platform with LNC2 sharding on channel dimension.

   :param img_ref: Input tensor on HBM with shape [N, C, 1, W].
   :type img_ref: ``nl.ndarray``
   :param filter_ref: Depthwise kernel weights on HBM with shape [C, 1, 1, S].
   :type filter_ref: ``nl.ndarray``
   :param padding: Padding as ((H_pad_l, H_pad_r), (W_pad_l, W_pad_r)). Default: ((0,0),(0,0)), only zeros supported.
   :type padding: ``tuple``
   :param stride: Stride values. Default: (1, 1), only (1, 1) supported.
   :type stride: ``tuple``
   :param rhs_dilation: RHS dilation. Default: (1, 1).
   :type rhs_dilation: ``tuple``
   :param lhs_dilation: LHS dilation. Default: (1, 1).
   :type lhs_dilation: ``tuple``
   :param feature_group_count: Number of feature groups. Default: 1.
   :type feature_group_count: ``int``
   :param batch_group_count: Number of batch groups. Default: 1.
   :type batch_group_count: ``int``
   :param in_perm: Input permutation. Default: None.
   :type in_perm: ``tuple``, optional
   :param kern_perm: Kernel permutation. Default: None.
   :type kern_perm: ``tuple``, optional
   :param out_perm: Output permutation. Default: None.
   :type out_perm: ``tuple``, optional
   :return: Convolution output on HBM with shape [N, C, 1, Q] where Q = W - S + 1.
   :rtype: ``nl.ndarray``

   **Notes**:

   * Only supports stride=1 and zero padding
   * Requires C to be divisible by NUM_SHARDS (2)
   * Uses LNC2 sharding on channel dimension
   * For depthwise convolution, feature_group_count must equal C

   **Dimensions**:

   * N: Batch size
   * C: Number of channels
   * W: Input width (spatial dimension)
   * S: Kernel size
   * Q: Output width (W - S + 1)

Implementation Details
-----------------------

The kernel implementation includes several key optimizations:

1. **Implicit GEMM Approach**: Avoids materializing full im2col matrix by using offset-based loading patterns, saving W*S*C memory.

2. **Tiling Strategy**: 
   - Input: [N, C, W] tiled as [N, C_TILES, C_TILE] x [S_TILES, S_TILE, Q]
   - Filter: [C, S] tiled as [C_TILES, C_TILE] x [S_TILES, S_TILE]
   - Output: [N, C, Q] accumulated in [Q_TILES, Q_TILE] chunks

3. **Tile Size Selection**:
   - S_TILE = min(S, 128): Matches partition dimension (P_MAX=128)
   - Q_TILE = min(Q, 512): Matches free dimension (F_MAX=512)
   - C_TILE = min(C_per_shard, 128): Balances parallelism and memory

4. **Filter Preloading**: Amortizes transpose cost across channels by preloading filter tiles in outer loop.

5. **Sequential S-tile Accumulation**: Enables pipelining and reduces PSUM pressure.

6. **LNC2 Sharding**: Distributes computation across channel dimension for parallel processing.

================================================
FILE: nki/library/api/dynamic-elementwise-add.rst
================================================
.. meta::
    :description: Elementwise addition with dynamic partition dimension tiling.
    :date-modified: 04/09/2026

.. currentmodule:: nkilib.experimental.dynamic_shapes

Dynamic Elementwise Add Kernel API Reference
============================================

Elementwise addition with dynamic partition dimension tiling.

Computes output = input_a + input_b for 2D bf16 tensors where the number of M-dimension tiles to process is determined at runtime via num_m_tiles. Optimized for M dimensions up to 2048 and H dimensions up to 8192.

Background
-----------

The ``dynamic_elementwise_add`` kernel computes elementwise addition where the number of M-dimension tiles to process is determined at runtime. This demonstrates NKI's support for dynamic loop bounds using ``sequential_range`` with runtime-variable trip counts.

API Reference
--------------

**Source code for this kernel API can be found at**: `dynamic_elementwise_add.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/experimental/dynamic_shapes/dynamic_elementwise_add.py>`_

dynamic_elementwise_add
^^^^^^^^^^^^^^^^^^^^^^^

.. py:function:: dynamic_elementwise_add(input_a: nl.ndarray, input_b: nl.ndarray, num_m_tiles: nl.ndarray) -> nl.ndarray

   Elementwise addition with dynamic partition dimension tiling.

   :param input_a: [M, H], First input tensor, bf16, on HBM.
   :type input_a: ``nl.ndarray``
   :param input_b: [M, H], Second input tensor, bf16, on HBM. Must match input_a shape.
   :type input_b: ``nl.ndarray``
   :param num_m_tiles: [1, 1], int32 scalar tensor on HBM. Value = number of M-tiles to process (0 <= num_m_tiles <= M // P_MAX).
   :type num_m_tiles: ``nl.ndarray``
   :return: [M, H], bf16 output tensor on HBM. Elements in the first (num_m_tiles * P_MAX) rows contain input_a + input_b; remaining rows are unmodified.
   :rtype: ``nl.ndarray``

   **Notes**:

   * M must be divisible by P_MAX (128)
   * H must be divisible by H_TILE_SIZE (512)
   * input_a and input_b must have identical shapes

   **Dimensions**:

   * M: Row dimension, tiled at P_MAX (128). Dynamic at runtime via num_m_tiles.
   * H: Hidden/column dimension, tiled at H_TILE_SIZE (512). Static.


================================================
FILE: nki/library/api/fg-allgather.rst
================================================
.. meta::
    :description: Fine-grained ring-based all-gather kernel for TRN2.
    :date-modified: 04/09/2026

.. currentmodule:: nkilib.experimental.collectives

Fine-Grained All-Gather Kernel API Reference
=============================================

Performs fine-grained ring-based all-gather across ranks for TRN2.

The kernel supports:

* Ring-based collective permute with double buffering
* Both SBUF and HBM communication paths with automatic selection based on tensor sizes
* Overlapped communication and data movement

Background
-----------

The ``fine_grained_allgather`` kernel performs all-gather on the input tensor across ranks along the row dimension. It uses ring-based collective permute with double buffering to overlap communication and data movement.

API Reference
--------------

**Source code for this kernel API can be found at**: `fg_allgather.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/experimental/collectives/fg_allgather.py>`_

fine_grained_allgather
^^^^^^^^^^^^^^^^^^^^^^

.. py:function:: fine_grained_allgather(lhs: nl.ndarray, tp_degree: int, num_groups: int, force_hbm_cc: bool = False) -> nl.ndarray

   Fine-grained ring-based all-gather kernel for TRN2.

   :param lhs: [m, K], Input tensor, row-sharded across ranks.
   :type lhs: ``nl.ndarray``
   :param tp_degree: Tensor parallelism degree (number of ranks). Must be even. Supported values: 4, 8, 16, 32, 64, 128.
   :type tp_degree: ``int``
   :param num_groups: Number of replica groups for collective communication.
   :type num_groups: ``int``
   :param force_hbm_cc: If True, force HBM collective communication path even when SBUF path is feasible.
   :type force_hbm_cc: ``bool``
   :return: [RANK_N, ...], Fully gathered tensor in shared HBM. Shape depends on communication path (SBUF vs HBM).
   :rtype: ``nl.ndarray``

   **Notes**:

   * tp_degree must be even.
   * M must be divisible by (RANK_N * LNC_N * CHANNEL_N).
   * Platform target is TRN2 only.

   **Dimensions**:

   * m: Local rows per rank (before all-gather).
   * M: Total rows after all-gather (m * tp_degree).


================================================
FILE: nki/library/api/fgcc.rst
================================================
.. meta::
    :description: Fine grained all-gather and matrix multiplication (FGCC) kernel for TRN2.
    :date-modified: 04/09/2026

.. currentmodule:: nkilib.experimental.collectives

FGCC (All-Gather + Matmul) Kernel API Reference
=================================================

Performs fused all-gather and matrix multiplication (FGCC) for TRN2.

The kernel supports:

* All-gather on left-hand side tensor across ranks
* Matrix multiplication with column-sharded right-hand side tensor
* Ring-based collective permute overlapped with compute
* Both SBUF and HBM communication paths with automatic selection

Background
-----------

The ``allgather_compute_matmul`` kernel performs all-gather on the left-hand side tensor across ranks, then computes matrix multiplication with a column-sharded right-hand side tensor. Communication is overlapped with compute using ring-based collective permute.

API Reference
--------------

**Source code for this kernel API can be found at**: `fgcc.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/experimental/collectives/fgcc.py>`_

allgather_compute_matmul
^^^^^^^^^^^^^^^^^^^^^^^^

.. py:function:: allgather_compute_matmul(lhs: nl.ndarray, rhs: nl.ndarray, tp_degree: int, num_groups: int, force_hbm_cc: bool = False) -> nl.ndarray

   Fine grained all-gather and matrix multiplication (FGCC) kernel for TRN2.

   :param lhs: [m, K], Left-hand side tensor, row-sharded across ranks.
   :type lhs: ``nl.ndarray``
   :param rhs: [K, N], Right-hand side tensor, column-sharded per rank.
   :type rhs: ``nl.ndarray``
   :param tp_degree: Tensor parallelism degree (number of ranks). Must be even.
   :type tp_degree: ``int``
   :param num_groups: Number of replica groups for collective communication.
   :type num_groups: ``int``
   :param force_hbm_cc: If True, force HBM collective communication path even when SBUF path is feasible.
   :type force_hbm_cc: ``bool``
   :return: [RANK_N, ...], Column-sharded result tensor in shared HBM. Shape depends on communication path (SBUF vs HBM).
   :rtype: ``nl.ndarray``

   **Notes**:

   * tp_degree must be even.
   * lhs and rhs must have matching K dimension.
   * M must be divisible by (RANK_N * LNC_N * CHANNEL_N).
   * Platform target is TRN2 only.

   **Dimensions**:

   * m: Local rows per rank (before all-gather).
   * M: Total rows after all-gather (m * tp_degree).
   * K: Shared (contraction) dimension.


================================================
FILE: nki/library/api/find-nonzero-indices.rst
================================================
.. meta::
    :description: Find indices of nonzero elements along the T dimension.
    :date-modified: 04/09/2026

.. currentmodule:: nkilib.core.subkernels

Find Nonzero Indices Subkernel API Reference
=============================================

Finds indices of nonzero elements along the T dimension.

The kernel supports:

* Finding nonzero indices in an input tensor of shape [T, C]
* LNC2 sharding across columns
* GpSimd ``nonzero_with_count`` ISA for parallel processing
* Token counts up to 65536 and column counts up to 128
* Optional column subsetting via ``col_start_id`` and ``n_cols``

Background
-----------

The ``find_nonzero_indices`` subkernel computes the indices of nonzero elements along the T dimension for each column of an input tensor. It uses the GpSimd ``nonzero_with_count`` ISA instruction for parallel processing of 8 columns at a time, with LNC2 sharding across the column dimension.

API Reference
--------------

**Source code for this kernel API can be found at**: `find_nonzero_indices.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/core/subkernels/find_nonzero_indices.py>`_

find_nonzero_indices
^^^^^^^^^^^^^^^^^^^^

.. py:function:: find_nonzero_indices(input_tensor: nl.ndarray, col_start_id: nl.ndarray = None, n_cols: int = None, chunk_size: int = None, index_dtype: nki.dtype = nl.int32)

   Find indices of nonzero elements along the T dimension.

   :param input_tensor: [T, C], Input tensor on HBM. Nonzero elements are found along the T dimension for each column.
   :type input_tensor: ``nl.ndarray``
   :param col_start_id: [1], Optional HBM tensor containing the starting column index in the C dimension. If specified, only n_cols Columns starting from col_start_id are processed. If None, all C Columns are processed.
   :type col_start_id: ``nl.ndarray``
   :param n_cols: Number of columns (in C dimension) to process. Required when col_start_id is specified, ignored otherwise.
   :type n_cols: ``int``
   :param chunk_size: Size of chunks for processing T dimension. If None, defaults to T. Must divide T evenly. Smaller chunk sizes reduce memory usage.
   :type chunk_size: ``int``
   :param index_dtype: Data type for output indices tensor. Default is nl.int32.
   :type index_dtype: ``nki.dtype``
   :return: [C, T] or [n_cols, T], Tensor containing nonzero indices. For each column c, the first N values are the T-indices of nonzero elements, followed by -1 padding values.
   :rtype: ``nl.ndarray``
   :return: [C] or [n_cols], Count of nonzero elements per column.
   :rtype: ``nl.ndarray``

   **Notes**:

   * Requires LNC2 configuration (2 NeuronCores)
   * C must be divisible by 2 (for LNC2 sharding)
   * chunk_size must be divisible by 128 (partition size)
   * Uses GpSimd nonzero_with_count ISA which only operates on partitions [0, 16, 32, ..., 112]

   **Dimensions**:

   * T: Sequence/token dimension (first dimension of input)
   * C: Column dimension that used to calculate the non zero indices (second dimension of input)
   * C_full: Full columns dimension from input tensor shape


================================================
FILE: nki/library/api/index.rst
================================================
.. meta::
    :description: Reference for the pre-built NKI Library kernels included with the AWS Neuron SDK.
    :date-modified: 04/09/2026

.. _nkl_api_ref_home:

NKI Library Supported Kernel Reference
======================================

The NKI Library provides pre-built reference kernels you can use directly in your model development with the AWS Neuron SDK and NKI. These kernels provide the default classes, functions, and parameters you can use to integrate the NKI Library kernels into your models.

**Source code for these kernel APIs can be found at**: https://github.com/aws-neuron/nki-library

Core Kernels
-------------

Normalization and Quantization Kernels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`RMSNorm-Quant </nki/library/api/rmsnorm-quant>`
     - Performs optional RMS normalization followed by quantization to ``fp8``.

QKV Projection Kernels
~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`QKV </nki/library/api/qkv>`
     - Performs Query-Key-Value projection with optional normalization and RoPE fusion.

Attention Kernels
~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`Attention CTE </nki/library/api/attention-cte>`
     - Implements attention optimized for Context Encoding (prefill) use cases.
   * - :doc:`Attention TKG </nki/library/api/attention-tkg>`
     - Implements attention optimized for Token Generation (decode) use cases with small active sequence lengths.

Rotary Position Embedding (RoPE) Kernels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`RoPE </nki/library/api/rope>`
     - Applies Rotary Position Embedding to input embeddings with flexible layout support.

Multi-Layer Perceptron (MLP) Kernels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`MLP </nki/library/api/mlp>`
     - Implements Multi-Layer Perceptron with optional normalization fusion and quantization support.

Output Projection Kernels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`Output Projection CTE </nki/library/api/output-projection-cte>`
     - Computes output projection optimized for Context Encoding use cases.
   * - :doc:`Output Projection TKG </nki/library/api/output-projection-tkg>`
     - Computes output projection optimized for Token Generation use cases.

Mixture of Experts (MoE) Kernels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`Router Top-K </nki/library/api/router-topk>`
     - Computes router logits, applies activation functions, and performs top-K selection for MoE models.
   * - :doc:`MoE CTE </nki/library/api/moe-cte>`
     - Implements Mixture of Experts MLP operations optimized for Context Encoding use cases.
   * - :doc:`MoE TKG </nki/library/api/moe-tkg>`
     - Implements Mixture of Experts MLP operations optimized for Token Generation use cases.

Cumulative Sum Kernels
~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`Cumsum </nki/library/api/cumsum>`
     - Computes cumulative sum along the last dimension with optimized tiling.

Core Subkernels
~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`Find Nonzero Indices </nki/library/api/find-nonzero-indices>`
     - Finds indices of nonzero elements along the T dimension using GpSimd ``nonzero_with_count`` ISA.

Experimental Kernels
---------------------

.. note::
   Experimental kernels are under active development and their APIs may change in future releases.

Attention Kernels
~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`Attention Block TKG </nki/library/api/attention-block-tkg>`
     - Fused attention block for Token Generation that keeps all intermediate tensors in SBUF to minimize HBM traffic.

Transformer Kernels
~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`Transformer TKG </nki/library/api/transformer-tkg>`
     - Multi-layer transformer forward pass megakernel for token generation.

Convolution Kernels
~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`Conv1D </nki/library/api/conv1d>`
     - 1D convolution using tensor engine with replication strategy.
   * - :doc:`Depthwise Conv1D </nki/library/api/depthwise-conv1d>`
     - Implements depthwise 1D convolution using implicit GEMM algorithm.

Collective Communication Kernels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`Fine-Grained All-Gather </nki/library/api/fg-allgather>`
     - Ring-based all-gather for TRN2 with double-buffered collective permute.
   * - :doc:`FGCC (All-Gather + Matmul) </nki/library/api/fgcc>`
     - Fused all-gather and matrix multiplication for TRN2.
   * - :doc:`SBUF-to-SBUF All-Gather </nki/library/api/sb2sb-allgather>`
     - SBUF-to-SBUF all-gather with variants for small and large tensors.

MoE Subkernels
~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`Top-K Reduce </nki/library/api/topk-reduce>`
     - MoE Top-K reduction across sparse all-to-all collective output.

Dynamic Shape Kernels
~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`Dynamic Elementwise Add </nki/library/api/dynamic-elementwise-add>`
     - Elementwise addition with runtime-variable M-dimension tiling.

Loss Kernels
~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`Cross Entropy </nki/library/api/cross-entropy>`
     - Memory-efficient cross entropy loss forward and backward passes using online log-sum-exp algorithm.

MoE Backward Kernels
~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`Blockwise MM Backward </nki/library/api/blockwise-mm-backward>`
     - Computes backward pass for blockwise matrix multiplication in Mixture of Experts layers.

.. toctree::
    :maxdepth: 1
    :hidden:

    Attention Block TKG <attention-block-tkg>
    Attention CTE <attention-cte>
    Attention TKG <attention-tkg>
    Blockwise MM Backward <blockwise-mm-backward>
    Conv1D <conv1d>
    Cross Entropy <cross-entropy>
    Cumsum <cumsum>
    Depthwise Conv1D <depthwise-conv1d>
    Dynamic Elementwise Add <dynamic-elementwise-add>
    FGCC <fgcc>
    Find Nonzero Indices <find-nonzero-indices>
    Fine-Grained All-Gather <fg-allgather>
    MLP <mlp>
    MoE CTE <moe-cte>
    MoE TKG <moe-tkg>
    Output Projection CTE <output-projection-cte>
    Output Projection TKG <output-projection-tkg>
    QKV <qkv>
    RMSNorm-Quant <rmsnorm-quant>
    RoPE <rope>
    Router Top-K <router-topk>
    SBUF-to-SBUF All-Gather <sb2sb-allgather>
    Top-K Reduce <topk-reduce>
    Transformer TKG <transformer-tkg>


================================================
FILE: nki/library/api/mlp.rst
================================================
.. meta::
    :description: MLP kernel implements Multi-Layer Perceptron with optional normalization fusion and quantization.
    :date-modified: 04/09/2026

.. currentmodule:: nkilib.core.mlp

MLP Kernel API Reference
=========================

Implements Multi-Layer Perceptron with optional normalization fusion and quantization support.

The kernel supports:

* Both context encoding (CTE) and token generation (TKG) modes
* Optional normalization fusion (RMSNorm, LayerNorm)
* Various activation functions
* Residual connections via fused addition
* Flexible tensor layouts and column tiling optimizations
* Bias addition for all projections and normalization
* FP8 quantization (static and row-wise, TKG mode only)
* MXFP4/MXFP8 quantization (TKG mode)
* Gate and up projection result clamping
* Optional gate projection skipping
* SBUF output for kernel fusion

Background
-----------

The ``MLP`` kernel is a critical component in transformer architectures, responsible for processing token representations after the attention mechanism. This kernel optimizes the MLP computation by fusing it with optional normalization and supporting various optimizations for both context encoding and token generation scenarios.

.. note::
    This kernel automatically selects between TKG (Token Generation) and CTE (Context Encoding) implementations based on the batch size × sequence length, ensuring optimal performance across different use cases.

API Reference
--------------

**Source code for this kernel API can be found at**: `mlp.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/core/mlp/mlp.py>`_

mlp
^^^

.. py:function:: mlp(hidden_tensor: nl.ndarray, gate_proj_weights_tensor: nl.ndarray, up_proj_weights_tensor: nl.ndarray, down_proj_weights_tensor: nl.ndarray, normalization_weights_tensor: Optional[nl.ndarray] = None, gate_proj_bias_tensor: Optional[nl.ndarray] = None, up_proj_bias_tensor: Optional[nl.ndarray] = None, down_proj_bias_tensor: Optional[nl.ndarray] = None, normalization_bias_tensor: Optional[nl.ndarray] = None, fused_add_tensor: Optional[nl.ndarray] = None, store_fused_add_result: bool = False, activation_fn: ActFnType = ActFnType.SiLU, normalization_type: NormType = NormType.NO_NORM, quantization_type: QuantizationType = QuantizationType.NONE, gate_w_scale: Optional[nl.ndarray] = None, up_w_scale: Optional[nl.ndarray] = None, down_w_scale: Optional[nl.ndarray] = None, gate_up_in_scale: Optional[nl.ndarray] = None, down_in_scale: Optional[nl.ndarray] = None, quant_clipping_bound: float = 0.0, output_dtype = None, store_output_in_sbuf: bool = False, eps: float = 1e-6, skip_gate_proj: bool = False, use_tkg_gate_up_proj_column_tiling: bool = True, use_tkg_down_proj_column_tiling: bool = True, use_tkg_down_proj_optimized_layout: bool = False, gate_clamp_upper_limit: Optional[float] = None, gate_clamp_lower_limit: Optional[float] = None, up_clamp_upper_limit: Optional[float] = None, up_clamp_lower_limit: Optional[float] = None, force_cte_mode: bool = False, sbm: Optional[BufferManager] = None) -> list[nl.ndarray]

   MLP (Multi-Layer Perceptron) Kernel implementation.

   Performs the standard MLP computation with support for both context encoding (CTE) and
   token generation (TKG) modes. Automatically selects the appropriate implementation based
   on input dimensions and supports various optimizations.

   :param hidden_tensor: Input hidden states tensor with shape [B, S, H] or SBUF layout.
   :type hidden_tensor: ``nl.ndarray``
   :param gate_proj_weights_tensor: Gate projection weight matrix with shape [H, I].
   :type gate_proj_weights_tensor: ``nl.ndarray``
   :param up_proj_weights_tensor: Up projection weight matrix with shape [H, I].
   :type up_proj_weights_tensor: ``nl.ndarray``
   :param down_proj_weights_tensor: Down projection weight matrix with shape [I, H].
   :type down_proj_weights_tensor: ``nl.ndarray``
   :param normalization_weights_tensor: Normalization weights with shape [1, H].
   :type normalization_weights_tensor: ``nl.ndarray``, optional
   :param gate_proj_bias_tensor: Bias tensor for gate projection with shape [1, I].
   :type gate_proj_bias_tensor: ``nl.ndarray``, optional
   :param up_proj_bias_tensor: Bias tensor for up projection with shape [1, I].
   :type up_proj_bias_tensor: ``nl.ndarray``, optional
   :param down_proj_bias_tensor: Bias tensor for down projection with shape [1, H].
   :type down_proj_bias_tensor: ``nl.ndarray``, optional
   :param normalization_bias_tensor: Bias tensor for normalization with shape [1, H]. Only applicable for layer normalization.
   :type normalization_bias_tensor: ``nl.ndarray``, optional
   :param fused_add_tensor: Tensor to fuse for the residual connection.
   :type fused_add_tensor: ``nl.ndarray``, optional
   :param store_fused_add_result: If True, stores the fused_add output to HBM, and the kernel returns both the fused_add output and the MLP output. Default: False.
   :type store_fused_add_result: ``bool``
   :param activation_fn: Activation function type.
   :type activation_fn: ``ActFnType``
   :param normalization_type: Type of normalization.
   :type normalization_type: ``NormType``
   :param quantization_type: Quantization type to use (default: QuantizationType.NONE). Supported values are QuantizationType.STATIC and QuantizationType.ROW. Quantization is only supported in TKG mode.
   :type quantization_type: ``QuantizationType``
   :param gate_w_scale: FP8 dequantization scales for gate weights. Shape is [128, I] for row-wise quantization, [128, 1] for static quantization. Defaults to None.
   :type gate_w_scale: ``nl.ndarray``, optional
   :param up_w_scale: FP8 dequantization scales for up weights. Shape is [128, I] for row-wise quantization, [128, 1] for static quantization. Defaults to None.
   :type up_w_scale: ``nl.ndarray``, optional
   :param down_w_scale: FP8 dequantization scales for down weights. Shape is [128, I] for row-wise quantization, [128, 1] for static quantization. Defaults to None.
   :type down_w_scale: ``nl.ndarray``, optional
   :param gate_up_in_scale: FP8 dequantization scales for gate and up input. Used for static quantization with shape [128, 1]. Defaults to None.
   :type gate_up_in_scale: ``nl.ndarray``, optional
   :param down_in_scale: FP8 dequantization scales for down input. Used for static quantization with shape [128, 1]. Defaults to None.
   :type down_in_scale: ``nl.ndarray``, optional
   :param quant_clipping_bound: Clipping bound for quantization. Default: 0.0.
   :type quant_clipping_bound: ``float``
   :param output_dtype: Output tensor data type. Defaults to None; if None, the hidden tensor's ``dtype`` is used.
   :type output_dtype: ``nki.dtype``
   :param store_output_in_sbuf: If True, stores the output in SBUF instead of HBM, allowing the next layer to read it directly without an additional load operation. This option is only available in TKG mode where output tensor is small enough to fit in SBUF. Default: False.
   :type store_output_in_sbuf: ``bool``
   :param eps: Epsilon value for numerical stability.
   :type eps: ``float``
   :param skip_gate_proj: Skip gate projection.
   :type skip_gate_proj: ``bool``
   :param use_tkg_gate_up_proj_column_tiling: If True, uses column tiling for the gate and up projection in TKG mode. Default: True.
   :type use_tkg_gate_up_proj_column_tiling: ``bool``
   :param use_tkg_down_proj_column_tiling: If True, uses column tiling for the down projection in TKG mode. Default: True.
   :type use_tkg_down_proj_column_tiling: ``bool``
   :param use_tkg_down_proj_optimized_layout: If True, the standard down_weight tensor (``shape [I, H]``) is reinterpreted as ``[I, lnc, 128, H // (128 * lnc)]``, then transposed to ``[I, lnc, H // (128 * lnc), 128]``. This layout provides unit-stride weight loading, reducing the matrix multiplication initiation interval. Only applied when ``use_tkg_down_proj_column_tiling`` is False. Default: False.
   :type use_tkg_down_proj_optimized_layout: ``bool``
   :param gate_clamp_upper_limit: Upper bound value to clamp on gate projection results, does not perform clamping if the value is set to None.
   :type gate_clamp_upper_limit: ``float``, optional
   :param gate_clamp_lower_limit: Lower bound value to clamp on gate projection results, does not perform clamping if the value is set to None.
   :type gate_clamp_lower_limit: ``float``, optional
   :param up_clamp_upper_limit: Upper bound value to clamp on up projection results, does not perform clamping if the value is set to None.
   :type up_clamp_upper_limit: ``float``, optional
   :param up_clamp_lower_limit: Lower bound value to clamp on up projection results, does not perform clamping if the value is set to None.
   :type up_clamp_lower_limit: ``float``, optional
   :param force_cte_mode: If True, forces the use of CTE mode. Default: False.
   :type force_cte_mode: ``bool``
   :param sbm: Optional BufferManager instance for custom SBUF memory management. When provided, the kernel uses the given buffer manager instead of creating its own. Default: ``None``.
   :type sbm: ``BufferManager``, optional
   :return: The MLP output tensor(s). HBM output: Tensor with shape [B, S, H]. SBUF output: Shape depends on the mode setting. CTE: Not applicable. TKG when ``use_tkg_down_proj_column_tiling`` is ``True = [BxS, H]``. TKG when ``use_tkg_down_proj_column_tiling`` is ``False = [128(p_max), H/128, BxS``]``. If ``store_fused_add_result`` is ``True``, returns a list containing both the output and the stored fused output.
   :rtype: ``list[nl.ndarray]``

   **Notes**:

   * Automatically dispatches to either CTE or TKG implementation based on batch size and sequence length.
   * Token generation mode (TKG) is used for small batch/sequence dimensions (``batch_size × sequence_length ≤ 96``), while context encoding (CTE) handles larger inputs.
   * Column tiling and tensor layout optimization (``use_tkg_down_proj_optimized_layout``) are valid only in TKG mode.
   * FP8 quantization support is available only in TKG mode.
   * Supported input data types: ``nl.bfloat16``, ``nl.float16``, ``nl.float32``

Implementation Details
-----------------------

The kernel implementation includes several key optimizations:

1. **Dual Implementation Strategy**: Automatically selects between CTE (Context Encoding) and TKG (Token Generation) implementations based on ``batch_size × sequence_length``.

2. **Normalization Fusion**: Optionally fuses RMSNorm or LayerNorm operations with the MLP computation for improved performance.

3. **FP8 Quantization**: Supports FP8 quantization with both static and row-wise dequantization scales. Available only in TKG mode for weights and activations.

4. **Flexible Tensor Layouts**: Supports column tiling optimizations and tensor layout optimizations in TKG mode to improve memory access patterns.

5. **Activation Function Options**: Supports multiple activation functions, including SiLU (Swish), GELU, and ReLU.

6. **Result Clamping**: Provides optional clamping of gate and up projection results with configurable upper and lower bounds.

7. **Gate Projection Skipping**: Allows skipping the gate projection computation when ``skip_gate_proj`` is enabled.

8. **Residual Connection Fusion**: Can incorporate residual connections through fused_add_tensor for improved performance.

9. **SBUF Output Option**: Provides the option to keep output in SBUF for fusion with subsequent operations (TKG mode only).

10. **Bias Addition**: Supports optional bias addition for gate, up, and down projections, as well as for normalization.

11. **Optimized Weight Loading**: In TKG mode, ``use_tkg_down_proj_optimized_layout`` enables unit-stride weight loading to reduce matrix multiplication initiation interval.

12. **Multi-Precision Support**: Supports ``bfloat16``, ``float16``, and ``float32`` input data types for flexible precision requirements.

See Also
--------

* :doc:`QKV Kernel API Reference </nki/library/api/qkv>`
* :doc:`RMSNorm-Quant Kernel API Reference </nki/library/api/rmsnorm-quant>`


================================================
FILE: nki/library/api/moe-cte.rst
================================================
.. meta::
    :description: MoE CTE kernel implements Mixture of Experts MLP optimized for Context Encoding.
    :date-modified: 02/13/2026

.. currentmodule:: nkilib.core.moe_cte

MoE CTE Kernel API Reference
=============================

Implements Mixture of Experts (MoE) MLP computation optimized for Context Encoding with blockwise matrix multiplication and multiple sharding strategies.

The kernel supports:

* Unified entry point dispatching to multiple implementation variants
* Block-sharding and intermediate-dimension-sharding strategies
* Multiple quantization types (FP8 row/static, MxFP4/MxFP8)
* Expert affinity scaling (pre-scale and post-scale modes)
* Various activation functions (SiLU, GELU, ReLU)
* Optional bias terms for projections
* Clamping for gate and up projections
* Activation checkpointing for gradient computation
* Hybrid static/dynamic loop optimization for padded sequences

Background
--------------

The ``MoE CTE`` kernel is designed for Mixture of Experts models during context encoding (prefill) phase where the sequence length is typically large (T > 128). The kernel performs blockwise MoE MLP computation:

1. **Token Assignment**: Tokens are pre-assigned to blocks via ``token_position_to_id``
2. **Gate Projection**: ``gate_out = hidden @ gate_weights``
3. **Up Projection**: ``up_out = hidden @ up_weights``
4. **Activation**: ``act_gate = activation_fn(gate_out)``
5. **Element-wise Multiply**: ``intermediate = act_gate * up_out``
6. **Down Projection**: ``expert_out = intermediate @ down_weights``
7. **Affinity Scaling**: ``output = expert_out * affinity`` (if enabled)
8. **Block Accumulation**: Results are accumulated across blocks for multi-expert assignments

The unified ``moe_cte`` entry point dispatches to the appropriate implementation based on the ``spec`` parameter, which selects between block-sharding and intermediate-dimension-sharding strategies with optional MX quantization support.

API Reference
----------------

**Source code for this kernel API can be found at**: `moe_cte.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/core/moe/moe_cte/moe_cte.py>`_

moe_cte
^^^^^^^^

.. py:function:: moe_cte(hidden_states: nl.ndarray, expert_affinities_masked: nl.ndarray, gate_up_proj_weight: nl.ndarray, down_proj_weight: nl.ndarray, token_position_to_id: nl.ndarray, block_to_expert: nl.ndarray, block_size: int, spec: MoECTESpec, conditions: Optional[nl.ndarray] = None, gate_and_up_proj_bias: Optional[nl.ndarray] = None, down_proj_bias: Optional[nl.ndarray] = None, quantization_config: Optional[QuantizationConfig] = None, gate_up_activations_T: Optional[nl.ndarray] = None, down_activations: Optional[nl.ndarray] = None, activation_function: ActFnType = ActFnType.SiLU, skip_dma: SkipMode = SkipMode(False, False), compute_dtype=nl.bfloat16, is_tensor_update_accumulating: bool = True, expert_affinities_scaling_mode: ExpertAffinityScaleMode = ExpertAffinityScaleMode.POST_SCALE, gate_clamp_upper_limit: Optional[float] = None, gate_clamp_lower_limit: Optional[float] = None, up_clamp_upper_limit: Optional[float] = None, up_clamp_lower_limit: Optional[float] = None)

   Unified entry point for MoE CTE blockwise matrix multiplication kernels.

   Dispatches to the appropriate implementation based on ``spec.implementation``. Supports multiple
   sharding strategies and quantization modes for different hardware targets.

   :param hidden_states: Input hidden states tensor with shape ``[T+1, H]`` in HBM. T+1 because padding token position is set to T.
   :type hidden_states: ``nl.ndarray``
   :param expert_affinities_masked: Expert affinities for each token with shape ``[(T+1) * E, 1]`` in HBM
   :type expert_affinities_masked: ``nl.ndarray``
   :param gate_up_proj_weight: Concatenated gate and up projection weights with shape ``[E, H, 2, I_TP]`` in HBM
   :type gate_up_proj_weight: ``nl.ndarray``
   :param down_proj_weight: Down projection weights with shape ``[E, I_TP, H]`` in HBM
   :type down_proj_weight: ``nl.ndarray``
   :param token_position_to_id: Block index of corresponding tokens with shape ``[N * B]`` in HBM. Includes padding tokens (N * B >= T). Padding token id is set to T.
   :type token_position_to_id: ``nl.ndarray``
   :param block_to_expert: Expert indices of corresponding blocks with shape ``[N, 1]`` in HBM
   :type block_to_expert: ``nl.ndarray``
   :param block_size: Number of tokens per block (must be multiple of 256)
   :type block_size: ``int``
   :param spec: Implementation selection and configuration. Controls which sharding strategy and implementation variant to use. See ``MoECTESpec`` for details.
   :type spec: ``MoECTESpec``
   :param conditions: Block padding indicators with shape ``[N+1]``. Used by hybrid and block_mx implementations to distinguish padded vs non-padded blocks.
   :type conditions: ``nl.ndarray``, optional
   :param gate_and_up_proj_bias: Gate and up projection bias with shape ``[E, 2, I_TP]``. For SiLU, up_bias = up_bias + 1.
   :type gate_and_up_proj_bias: ``nl.ndarray``, optional
   :param down_proj_bias: Down projection bias with shape ``[E, H]``
   :type down_proj_bias: ``nl.ndarray``, optional
   :param quantization_config: Quantization scales configuration containing ``gate_up_proj_scale`` and ``down_proj_scale`` for weight dequantization. See ``QuantizationConfig`` for details.
   :type quantization_config: ``QuantizationConfig``, optional
   :param gate_up_activations_T: Pre-allocated storage for gate/up activations (for activation checkpointing). Used when ``spec.shard_on_I.checkpoint_activation=True``.
   :type gate_up_activations_T: ``nl.ndarray``, optional
   :param down_activations: Pre-allocated storage for down projection activations (for activation checkpointing). Used when ``spec.shard_on_I.checkpoint_activation=True``.
   :type down_activations: ``nl.ndarray``, optional
   :param activation_function: Activation function for MLP block. Default is ``SiLU``.
   :type activation_function: ``ActFnType``
   :param skip_dma: DMA skip mode configuration. Default is ``SkipMode(False, False)``.
   :type skip_dma: ``SkipMode``
   :param compute_dtype: Compute data type. Default is ``nl.bfloat16``.
   :type compute_dtype: ``nl.dtype``
   :param is_tensor_update_accumulating: Whether to accumulate results over multiple blocks. Default is ``True``.
   :type is_tensor_update_accumulating: ``bool``
   :param expert_affinities_scaling_mode: Post or pre scaling mode. Default is ``POST_SCALE``.
   :type expert_affinities_scaling_mode: ``ExpertAffinityScaleMode``
   :param gate_clamp_upper_limit: Upper clamp limit for gate projection
   :type gate_clamp_upper_limit: ``float``, optional
   :param gate_clamp_lower_limit: Lower clamp limit for gate projection
   :type gate_clamp_lower_limit: ``float``, optional
   :param up_clamp_upper_limit: Upper clamp limit for up projection
   :type up_clamp_upper_limit: ``float``, optional
   :param up_clamp_lower_limit: Lower clamp limit for up projection
   :type up_clamp_lower_limit: ``float``, optional
   :return: Output hidden states with shape ``[T+1, H]``. When activation checkpointing is enabled, may return a tuple including saved activations.
   :rtype: ``nl.ndarray`` or ``Tuple[nl.ndarray, ...]``

   **Dimensions**:

   * T: Total number of input tokens (after linearizing across the batch dimension)
   * H: Hidden dimension size
   * B: Number of tokens per block
   * N: Total number of blocks
   * E: Number of experts
   * I_TP: Intermediate size / tensor parallelism degree

   **Supported Data Types**:

   * Input: bfloat16, float16
   * MX implementations: float4_e2m1fn_x4 (MxFP4), float8_e4m3fn (MxFP8)

   **Constraints**:

   * Block size B: 256-1024 tokens (must be multiple of 256)
   * Total tokens T: Up to 32K tokens per call
   * Hidden dimension H: 512-8192 (optimal: 2048-4096), must be multiple of 512
   * Intermediate dimension I_TP: 2048-16384 (optimal: 8192), must be divisible by 16
   * Number of experts E: 8-64 (optimal: 8-16)
   * All input/output tensors must have the same floating point dtype
   * ``token_position_to_id`` and ``block_to_expert`` must be ``nl.int32`` tensors

Configuration Classes
-----------------------

MoECTESpec
^^^^^^^^^^^

Specification for MoE CTE kernel execution. Selects the implementation variant and provides implementation-specific configuration.

.. code-block:: python

   from nkilib.core.moe.moe_cte.moe_cte import MoECTESpec, MoECTEImplementation

   # Block sharding (default config auto-initialized)
   spec = MoECTESpec(implementation=MoECTEImplementation.shard_on_block)

   # I-sharding with activation checkpointing
   spec = MoECTESpec(
       implementation=MoECTEImplementation.shard_on_i,
       shard_on_I=ShardOnIConfig(checkpoint_activation=True),
   )

**Implementation variants**:

* ``shard_on_block``: Shards blocks across cores. Best for many blocks. (TRN2)
* ``shard_on_i``: Shards intermediate dimension across cores. (TRN2)
* ``shard_on_i_hybrid``: Shard on I with hybrid static/dynamic loop. (TRN2)
* ``shard_on_i_dropping``: Shard on I for dropping layer. (TRN2)
* ``shard_on_block_mx``: Shard on block with MxFP4/MxFP8 quantization. (TRN3)
* ``shard_on_i_mx``: Shard on I with MxFP4/MxFP8 quantization. (TRN3)
* ``shard_on_i_mx_hybrid``: Shard on I with MxFP4/MxFP8 and hybrid loop. (TRN3)

QuantizationConfig
^^^^^^^^^^^^^^^^^^^

Configuration for quantization-related parameters. Contains dequantization scales for weight tensors.

.. code-block:: python

   from nkilib.core.moe.moe_cte.moe_cte import QuantizationConfig

   # No quantization (default)
   quant_cfg = QuantizationConfig()

   # With per-tensor scales
   quant_cfg = QuantizationConfig(
       gate_up_proj_scale=gate_up_scale_tensor,
       down_proj_scale=down_scale_tensor,
   )

* ``gate_up_proj_scale`` (``nl.ndarray``, optional): Dequantization scales for gate/up projection weights.
* ``down_proj_scale`` (``nl.ndarray``, optional): Dequantization scales for down projection weights.

Implementation Details
-------------------------

The kernel implementation includes several key optimizations:

1. **Unified Dispatch**: The ``moe_cte`` entry point dispatches to the appropriate implementation based on ``spec.implementation``.

2. **Block Sharding**: Distributes blocks across cores for parallel processing. Supports PING_PONG and HI_LO distribution strategies.

3. **Intermediate Dimension Sharding**: Distributes the intermediate dimension (I_TP) across multiple cores with all-reduce operations to combine partial results.

4. **Quantization Support**: Handles multiple quantization schemes:
   
   * **FP8 Row Quantization**: Per-row scaling for weights
   * **FP8 Static Quantization**: Single scale per weight matrix
   * **MxFP4/MxFP8**: Microscaling formats with block-wise scaling (TRN3)

5. **Expert Affinity Scaling Modes**:
   
   * **PRE_SCALE**: Apply affinity scaling before activation
   * **POST_SCALE**: Apply affinity scaling after down projection (default)

6. **Hybrid Loop Optimization**: For sequences with padding, uses a hybrid static/dynamic loop where non-padded blocks are processed in a compile-time-known static loop and padded blocks in a runtime-dependent dynamic loop.

7. **Activation Checkpointing**: Optionally saves intermediate activations for gradient computation during backward pass.

8. **Optional Clamping**: Supports clamping of gate and up projection outputs for numerical stability.

Usage Examples
-----------------

Basic usage with block sharding:

.. code-block:: python

   from nkilib.core.moe.moe_cte.moe_cte import moe_cte, MoECTESpec, MoECTEImplementation

   spec = MoECTESpec(implementation=MoECTEImplementation.shard_on_block)

   output = moe_cte(
       hidden_states=hidden_states,
       expert_affinities_masked=expert_affinities,
       gate_up_proj_weight=gate_up_weights,
       down_proj_weight=down_weights,
       token_position_to_id=token_position_to_id,
       block_to_expert=block_to_expert,
       block_size=512,
       spec=spec,
   )

With quantization:

.. code-block:: python

   from nkilib.core.moe.moe_cte.moe_cte import QuantizationConfig

   quant_cfg = QuantizationConfig(
       gate_up_proj_scale=gate_up_scale,
       down_proj_scale=down_scale,
   )

   output = moe_cte(
       hidden_states=hidden_states,
       expert_affinities_masked=expert_affinities,
       gate_up_proj_weight=gate_up_weights,
       down_proj_weight=down_weights,
       token_position_to_id=token_position_to_id,
       block_to_expert=block_to_expert,
       block_size=512,
       spec=spec,
       quantization_config=quant_cfg,
   )

See Also
-----------

* :doc:`MoE TKG Kernel API Reference </nki/library/api/moe-tkg>`
* :doc:`Router Top-K Kernel API Reference </nki/library/api/router-topk>`
* :doc:`MLP Kernel API Reference </nki/library/api/mlp>`


================================================
FILE: nki/library/api/moe-tkg.rst
================================================
.. meta::
    :description: MoE TKG kernel implements Mixture of Experts MLP optimized for Token Generation.
    :date-modified: 02/13/2026

.. currentmodule:: nkilib.core.moe_tkg

MoE TKG Kernel API Reference
=============================

Implements Mixture of Experts (MoE) MLP computation optimized for Token Generation with support for both all-expert and selective-expert modes.

The kernel supports:

* All-expert mode (process all experts for all tokens)
* Selective-expert mode (process only top-K selected experts)
* Multiple quantization types (FP8 row/static, MxFP4)
* Expert affinity scaling (post-scale mode)
* Expert affinity masking for distributed inference
* Various activation functions (SiLU, GELU, ReLU)
* Optional bias terms for projections
* Clamping for gate and up projections
* SBUF or HBM output allocation

Background
--------------

The ``MoE TKG`` kernel is designed for Mixture of Experts models during token generation (decoding) phase where the batch size and sequence length are typically small (T ≤ 128). The kernel performs the core MoE MLP computation:

1. **Gate Projection**: ``gate_out = hidden @ gate_weights``
2. **Up Projection**: ``up_out = hidden @ up_weights``
3. **Activation**: ``act_gate = activation_fn(gate_out)``
4. **Element-wise Multiply**: ``intermediate = act_gate * up_out``
5. **Down Projection**: ``expert_out = intermediate @ down_weights``
6. **Affinity Scaling**: ``output = sum(expert_out * affinity)`` (if enabled)

The kernel supports two operational modes:

* **All-Expert Mode**: Processes all experts for all tokens, useful for distributed inference scenarios
* **Selective-Expert Mode**: Processes only the top-K selected experts per token, reducing computation

API Reference
----------------

**Source code for this kernel API can be found at**: `moe_tkg.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/core/moe/moe_tkg/moe_tkg.py>`_

moe_tkg
^^^^^^^^^^^^^^^

.. py:function:: moe_tkg(hidden_input: nl.ndarray, expert_gate_up_weights: nl.ndarray, expert_down_weights: nl.ndarray, expert_affinities: nl.ndarray, expert_index: nl.ndarray, is_all_expert: bool, rank_id: Optional[nl.ndarray] = None, expert_gate_up_bias: Optional[nl.ndarray] = None, expert_down_bias: Optional[nl.ndarray] = None, expert_gate_up_weights_scale: Optional[nl.ndarray] = None, expert_down_weights_scale: Optional[nl.ndarray] = None, hidden_input_scale: Optional[nl.ndarray] = None, gate_up_input_scale: Optional[nl.ndarray] = None, down_input_scale: Optional[nl.ndarray] = None, mask_unselected_experts: bool = False, expert_affinities_eager: Optional[nl.ndarray] = None, expert_affinities_scaling_mode: ExpertAffinityScaleMode = ExpertAffinityScaleMode.NO_SCALE, activation_fn: ActFnType = ActFnType.SiLU, output_dtype=None, gate_clamp_upper_limit: Optional[float] = None, gate_clamp_lower_limit: Optional[float] = None, up_clamp_upper_limit: Optional[float] = None, up_clamp_lower_limit: Optional[float] = None, output_in_sbuf: bool = False, is_all_expert_dynamic: bool = False) -> nl.ndarray

   Mixture of Experts (MoE) MLP token generation kernel.

   Performs MoE computation with support for both all-expert and selective-expert modes.
   Supports various quantization types including FP8 row/static quantization and MxFP4.
   Optimized for token generation scenarios with T ≤ 128 (except MX all-expert mode).

   :param hidden_input: Input hidden states tensor with shape ``[T, H]`` in HBM or ``[H0, T, H1]`` in SBUF
   :type hidden_input: ``nl.ndarray``
   :param expert_gate_up_weights: Fused gate and up projection weights. Shape ``[E_L, H, 2, I]`` for bf16/fp16 or ``[E_L, 128, 2, ceil(H/512), I]`` for MxFP4
   :type expert_gate_up_weights: ``nl.ndarray``
   :param expert_down_weights: Down projection weights. Shape ``[E_L, I, H]`` for bf16/fp16 or ``[E_L, I_p, ceil(I/512), H]`` for MxFP4
   :type expert_down_weights: ``nl.ndarray``
   :param expert_affinities: Expert routing weights/affinities with shape ``[T, E]``. For all-expert mode with affinity scaling, this will be sliced to ``[T, E_L]`` internally.
   :type expert_affinities: ``nl.ndarray``
   :param expert_index: Top-K expert indices per token with shape ``[T, K]``
   :type expert_index: ``nl.ndarray``
   :param is_all_expert: If ``True``, process all experts for all tokens; otherwise, process only selected top-K experts
   :type is_all_expert: ``bool``
   :param rank_id: Rank ID tensor specifying which worker processes experts ``[E_L * rank_id, E_L * (rank_id + 1))``. Shape ``[1, 1]``. Required for all-expert mode with affinity scaling enabled.
   :type rank_id: ``nl.ndarray``, optional
   :param expert_gate_up_bias: Bias for gate/up projections. Shape ``[E_L, 2, I]`` for non-MX or ``[E_L, I_p, 2, ceil(I/512), 4]`` for MX.
   :type expert_gate_up_bias: ``nl.ndarray``, optional
   :param expert_down_bias: Bias for down projection with shape ``[E_L, H]``
   :type expert_down_bias: ``nl.ndarray``, optional
   :param expert_gate_up_weights_scale: Quantization scales for gate/up weights. Shape ``[E_L, 2, I]`` for FP8 row quantization, ``[E_L, 2, 1]`` for FP8 static quantization, or ``[E_L, 128/8, 2, ceil(H/512), I]`` for MxFP4.
   :type expert_gate_up_weights_scale: ``nl.ndarray``, optional
   :param expert_down_weights_scale: Quantization scales for down weights. Shape ``[E_L, H]`` for FP8 row quantization, ``[E_L, 1]`` for FP8 static quantization, or ``[E_L, I_p/8, ceil(I/512), H]`` for MxFP4.
   :type expert_down_weights_scale: ``nl.ndarray``, optional
   :param hidden_input_scale: FP8 dequantization scale for the hidden input tensor. Used for static quantization of the input.
   :type hidden_input_scale: ``nl.ndarray``, optional
   :param gate_up_input_scale: FP8 dequantization scales for gate/up input. Shape ``[E_L, 1]``. Used for static quantization.
   :type gate_up_input_scale: ``nl.ndarray``, optional
   :param down_input_scale: FP8 dequantization scales for down input. Shape ``[E_L, 1]``. Used for static quantization.
   :type down_input_scale: ``nl.ndarray``, optional
   :param mask_unselected_experts: Whether to apply expert affinity masking based on expert_index. When ``True``, affinities are masked to zero for experts not selected by each token. Only used in all-expert mode with affinity scaling.
   :type mask_unselected_experts: ``bool``
   :param expert_affinities_eager: Eager expert affinities with shape ``[T, K]``. Not used in all-expert mode.
   :type expert_affinities_eager: ``nl.ndarray``, optional
   :param expert_affinities_scaling_mode: When to apply affinity scaling. Supported values: ``NO_SCALE``, ``POST_SCALE``. Default is ``NO_SCALE``.
   :type expert_affinities_scaling_mode: ``ExpertAffinityScaleMode``
   :param activation_fn: Activation function type. Default is ``SiLU``.
   :type activation_fn: ``ActFnType``
   :param output_dtype: Output tensor data type. Defaults to ``None``; if ``None``, uses ``hidden_input`` dtype.
   :type output_dtype: ``nl.dtype``, optional
   :param gate_clamp_upper_limit: Upper bound value to clamp gate projection results
   :type gate_clamp_upper_limit: ``float``, optional
   :param gate_clamp_lower_limit: Lower bound value to clamp gate projection results
   :type gate_clamp_lower_limit: ``float``, optional
   :param up_clamp_upper_limit: Upper bound value to clamp up projection results
   :type up_clamp_upper_limit: ``float``, optional
   :param up_clamp_lower_limit: Lower bound value to clamp up projection results
   :type up_clamp_lower_limit: ``float``, optional
   :param output_in_sbuf: If ``True``, allocate output in SBUF with same shape as hidden_input. If ``False`` (default), allocate output in HBM with shape ``[T, H]``.
   :type output_in_sbuf: ``bool``
   :param is_all_expert_dynamic: If ``True``, enables dynamic expert selection in all-expert mode, where the set of active experts can vary per token. Default: ``False``.
   :type is_all_expert_dynamic: ``bool``
   :return: Output tensor with MoE computation results. Shape ``[T, H]`` or same shape as hidden_input if output_in_sbuf=True.
   :rtype: ``nl.ndarray``

   **Dimensions**:

   * T: Number of tokens (batch_size × seq_len)
   * H: Hidden dimension
   * I: Intermediate dimension
   * E: Number of global experts
   * E_L: Number of local experts processed by this kernel
   * K: Top-K experts per token
   * I_p: I//4 if I ≤ 512 else 128

   **Supported Data Types**:

   * Input: bfloat16, float16, float4_e2m1fn_x4 (MxFP4)

   **Constraints**:

   * T ≤ 128 (batch_size × seq_len must be ≤ 128, except for MX all-expert mode)
   * ``PRE_SCALE`` and ``PRE_SCALE_DELAYED`` modes are not supported
   * Static quantization (``gate_up_input_scale`` and ``down_input_scale``) is not currently supported
   * MX kernels require ``expert_gate_up_weights_scale`` and ``expert_down_weights_scale`` to be set
   * All-expert mode with affinity scaling requires ``rank_id`` parameter
   * All-expert mode does not support ``expert_affinities_eager``

Implementation Details
-------------------------

The kernel implementation includes several key optimizations:

1. **Dual Mode Operation**: Supports both all-expert and selective-expert modes with separate optimized implementations for each.

2. **Quantization Support**: Handles multiple quantization schemes:
   
   * **FP8 Row Quantization**: Per-row scaling for weights
   * **FP8 Static Quantization**: Single scale per weight matrix
   * **MxFP4**: Microscaling FP4 format with block-wise scaling

3. **Expert Affinity Masking**: For distributed inference in all-expert mode, masks expert affinities based on rank ID to ensure each worker processes only its assigned experts.

4. **Fused Gate-Up Projection**: Gate and up projection weights are fused into a single tensor for efficient memory access and computation.

5. **Affinity Scaling Modes**:
   
   * **NO_SCALE**: No affinity scaling applied
   * **POST_SCALE**: Apply affinity scaling after expert computation (recommended)

6. **Activation Function Support**: Supports various activation functions including SiLU (default), GELU, and ReLU.

7. **Optional Clamping**: Supports clamping of gate and up projection outputs for numerical stability.

8. **Flexible Output Allocation**: Supports output allocation in either HBM or SBUF for integration with larger kernels.

9. **MX-Specific Optimizations**: MX all-expert mode supports larger batch sizes and includes K-dimension sharding for selective-expert mode.


See Also
-----------

* :doc:`MoE CTE Kernel API Reference </nki/library/api/moe-cte>`
* :doc:`Router Top-K Kernel API Reference </nki/library/api/router-topk>`
* :doc:`MLP Kernel API Reference </nki/library/api/mlp>`


================================================
FILE: nki/library/api/output-projection-cte.rst
================================================
.. meta::
    :description: Output Projection CTE kernel computes output projection optimized for Context Encoding.
    :date-modified: 11/28/2025

.. currentmodule:: nkilib.core.output_projection.output_projection_cte

Output Projection CTE Kernel API Reference
===========================================

Computes output projection (attention @ weight + bias) optimized for Context Encoding (prefill) use cases.

The kernel supports:

* Efficient projection of attention outputs
* Optional bias addition
* LNC sharding for distributed computation
* Optimized memory access patterns
* Head dimension packing for improved performance

Background
--------------

The ``Output Projection CTE`` kernel computes the operation ``out = attention @ weight + bias``, which is commonly used to project the output scores after an attention block in transformer models. This kernel is specifically optimized for Context Encoding (Prefill) use cases, where the sequence length can be large (typically ``S`` ≥ 512).

The kernel employs efficient tiling strategies and memory access patterns to maximize performance on Neuron hardware, with support for sharding across multiple Logical Neuron Cores (LNCs) to handle large hidden dimensions. When ``LNC>1``, the ``H`` dimension is sharded across the cores, which avoids the need for any inter-core collective operations as each core produces part of the output tensor.

API Reference
----------------

**Source code for this kernel API can be found at**: `output_projection_cte.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/core/output_projection/output_projection_cte.py>`_

output_projection_cte
^^^^^^^^^^^^^^^^^^^^^^^

.. py:function:: output_projection_cte(attention: nl.ndarray, weight: nl.ndarray, bias=None, quantization_type: QuantizationType = QuantizationType.NONE, input_scales: Optional[nl.ndarray] = None, weight_scales: Optional[nl.ndarray] = None)

   Output Projection Kernel optimized for Context Encoding (Prefill) use cases.

   This kernel computes ``out = attention @ weight + bias``, typically used to project the output scores after an attention block in transformer models.

   This kernel is optimized for Context Encoding (aka Prefill) use cases where sequence length ``S`` is large. Using this kernel with ``S < 512`` may result in degraded performance.

   This kernel uses a layout also used by other Context Encoding kernels to avoid need for transposes.

   :param attention: Input tensor in HBM, typically the scores output from an attention block. Shape: ``[B, N, D, S]``, where ``B`` is batch size, ``N`` is number of heads, ``D`` is head dimension, and ``S`` is sequence length. Indexing: ``[b, n, d, s]``.
   :type attention: ``nl.ndarray``
   :param weight: Weight tensor in HBM. Shape: ``[N*D, H]``, where ``H`` is hidden dimension size. Indexing: ``[n * D + d, h]``.
   :type weight: ``nl.ndarray``
   :param bias: Optional bias tensor in HBM. Shape: ``[1, H]``. Indexing: ``[1, h]``.
   :type bias: ``nl.ndarray``, optional
   :param quantization_type: Type of quantization (NONE or STATIC for FP8). Default: QuantizationType.NONE.
   :type quantization_type: ``QuantizationType``
   :param input_scales: Input scale tensor for FP8 quantization. Shape: ``[128, 1]``.
   :type input_scales: ``nl.ndarray``, optional
   :param weight_scales: Weight scale tensor for FP8 quantization. Shape: ``[128, 1]``.
   :type weight_scales: ``nl.ndarray``, optional
   :return: Output tensor in HBM. Shape: ``[B, S, H]``. Indexing: ``[b, s, h]``.
   :rtype: ``nl.ndarray``

   **Data Types**:
     This kernel supports ``nl.float32``, ``nl.float16`` and ``nl.bfloat16`` data types.
     However, for ``nl.float32``, large inputs may not fit in SBUF.

   **Dimensions**:
     * ``B``: Batch size
     * ``N``: Number of heads
     * ``S``: Sequence length
     * ``H``: Hidden dimension size
     * ``D``: Head dimension size

   **Restrictions**:

   * The contract dimension of input and weight tensors must match (``N*D == weight.shape[0]``)
   * Output projection kernel currently only supports ``H`` to be no more than 32768
   * Hidden dimension (``H``) needs to be divisible by LNC size since LNC sharding is on the weight hidden dimension
   * Head dimension (``D``) must be <= 128
   * Maximum validated ``H`` size is 20705
   * Maximum validated ``B*S`` size is 131072
   * Maximum validated ``N`` size is 17

Implementation Details
-------------------------

The kernel implementation includes several key optimizations:

1. **Dimension Packing**: Optimizes the contraction dimension by folding ``N`` (number of heads) into ``D`` (head dimension) when beneficial, improving computational efficiency.

2. **Efficient Tiling Strategy**: Uses carefully chosen tile sizes for processing batches and sequences to maximize hardware utilization.

3. **LNC Sharding**: Supports sharding across multiple Logical Neuron Cores (LNCs) by dividing the hidden dimension, enabling processing of larger models.

4. **Memory Access Optimization**: Employs optimized memory access patterns to maximize bandwidth utilization and minimize data movement.

5. **PSUM Bank Utilization**: Efficiently utilizes PSUM banks for accumulating partial results during matrix multiplication operations.

6. **Stream Shuffle Broadcast**: Uses stream shuffle broadcast for bias tensors to efficiently distribute them across processing elements.

7. **Specialized Engine Selection**: Alternates between scalar and vector engines for tensor copy operations to balance workload and improve performance.

See Also
-----------

* :doc:`Output Projection TKG Kernel API Reference </nki/library/api/output-projection-tkg>`
* :doc:`QKV Kernel API Reference </nki/library/api/qkv>`


================================================
FILE: nki/library/api/output-projection-tkg.rst
================================================
.. meta::
    :description: Output Projection TKG kernel computes output projection optimized for Token Generation.
    :date-modified: 11/28/2025

.. currentmodule:: nkilib.core.output_projection.output_projection_tkg

Output Projection TKG Kernel API Reference
===========================================

Computes output projection (attention @ weight + bias) optimized for Token Generation (decode) use cases.

The kernel supports:

* Efficient projection of attention outputs
* Optional bias addition
* LNC sharding for distributed computation
* Optimized memory access patterns
* Head dimension packing for improved performance
* Flexible output tensor layouts
* SBUF output option for kernel fusion

Background
--------------

The ``Output Projection TKG`` kernel computes the operation ``out = attention @ weight + bias``, which is commonly used to project the output scores after an attention block in transformer models. This kernel is specifically optimized for Token Generation (Decode) use cases, where the sequence length ``S`` is small (often 1 or a small number for speculative decoding).

The kernel employs efficient tiling strategies and memory access patterns to maximize performance on Neuron hardware, with support for sharding across multiple Logical Neuron Cores (LNCs) to handle large hidden dimensions. When ``LNC>1``, the ``H`` dimension is sharded across the cores, which avoids the need for any inter-core collective operations as each core produces part of the output tensor.

The input layouts expected for this kernel are different from those for the CTE kernel. In TKG workloads, the ``S`` dimension is small, so placing the ``N`` dimension next to it allows more efficient GQA implementations by loading multiple heads at once.

API Reference
----------------

**Source code for this kernel API can be found at**: `output_projection_tkg.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/core/output_projection/output_projection_tkg.py>`_

output_projection_tkg
^^^^^^^^^^^^^^^^^^^^^^^

.. py:function:: output_projection_tkg(attention: nl.ndarray, weight: nl.ndarray, bias: Optional[nl.ndarray] = None, quantization_type: QuantizationType = QuantizationType.NONE, weight_scale: Optional[nl.ndarray] = None, input_scale: Optional[nl.ndarray] = None, TRANSPOSE_OUT=False, OUT_IN_SB=False)

   Output Projection Kernel optimized for Token Generation (Decode) use cases.

   This kernel computes ``out = attention @ weight + bias``, typically used to project the output scores after an attention block in transformer models.

   This kernel is optimized for Token Generation (aka Decode) use cases where sequence length ``S`` is small.

   :param attention: Input tensor in HBM or SBUF, typically the scores output from an attention block. Shape: ``[D, B, N, S]``, where ``D`` is head dimension, ``B`` is batch size, ``N`` is number of heads, and ``S`` is sequence length. Indexing: ``[d, b, n, s]``.
   :type attention: ``nl.ndarray``
   :param weight: Weight tensor in HBM. Shape: ``[N*D, H]``, where ``H`` is hidden dimension size. Indexing: ``[n * D + d, h]``.
   :type weight: ``nl.ndarray``
   :param bias: Optional bias tensor in HBM. Shape: ``[1, H]``. Indexing: ``[1, h]``.
   :type bias: ``nl.ndarray``, optional
   :param quantization_type: Type of quantization to apply. Default: QuantizationType.NONE.
   :type quantization_type: ``QuantizationType``
   :param weight_scale: Weight scale tensor for quantization.
   :type weight_scale: ``nl.ndarray``, optional
   :param input_scale: Input scale tensor for quantization.
   :type input_scale: ``nl.ndarray``, optional
   :param TRANSPOSE_OUT: Whether to store the output in transposed shape. If ``False``, output shape is ``[B*S, H]`` with indexing ``[b*S+s, h]``. If ``True``, output shape is ``[H_1, H_0, H_2, B*S]`` with indexing ``[h_1, h_0, h_2, b*S+s]``, where ``H_0 = logical core size (LNC)``, ``H_1 = 128``, ``H_2 = H/(H_0*H_1)``, such that ``h = h_0*H_1*H_2 + h_1*H_2 + h_2``.
   :type TRANSPOSE_OUT: ``bool``
   :param OUT_IN_SB: If ``True``, output is in SBUF. Else, it is written out to HBM.
   :type OUT_IN_SB: ``bool``
   :return: Output tensor in HBM or SBUF. Shape depends on ``TRANSPOSE_OUT`` parameter.
   :rtype: ``nl.ndarray``

   **Data Types**:
     This kernel supports ``nl.float32``, ``nl.float16`` and ``nl.bfloat16`` data types.
     However, for ``nl.float32``, large inputs may not fit in SBUF.

   **Dimensions**:
     * ``B``: Batch size
     * ``N``: Number of heads
     * ``S``: Sequence length
     * ``H``: Hidden dimension size
     * ``D``: Head dimension size

   **Restrictions**:

   * The contract dimension of input and weight tensors must match (``N*D == weight.shape[0]``)
   * Hidden dimension (``H``) needs to be divisible by LNC size since LNC sharding is on the weight hidden dimension
   * ``B*S`` must be <= 128
   * Head dimension (``D``) must be <= 128
   * When ``TRANSPOSE_OUT`` is ``False``, ``H`` must be a multiple of ``512*LNC``
   * When ``TRANSPOSE_OUT`` is ``True``, ``H`` must be a multiple of ``128*LNC``
   * When ``TRANSPOSE_OUT`` is ``True`` and using 32-bit floats, ``N*H`` must be <= 81920
   * When ``TRANSPOSE_OUT`` is ``True`` and using 16-bit floats, ``N*H`` must be <= 163840

Implementation Details
-------------------------

The kernel implementation includes several key optimizations:

1. **Dimension Packing**: Optimizes the contraction dimension by folding ``N`` (number of heads) into ``D`` (head dimension) when beneficial, improving computational efficiency.

2. **Efficient Tiling Strategy**: Uses carefully chosen tile sizes for processing batches and sequences to maximize hardware utilization.

3. **LNC Sharding**: Supports sharding across multiple Logical Neuron Cores (LNCs) by dividing the hidden dimension, enabling processing of larger models.

4. **Memory Access Optimization**: Employs optimized memory access patterns to maximize bandwidth utilization and minimize data movement.

5. **PSUM Bank Utilization**: Efficiently utilizes PSUM banks for accumulating partial results during matrix multiplication operations.

6. **Stream Shuffle Broadcast**: Uses stream shuffle broadcast for bias tensors to efficiently distribute them across processing elements.

7. **Flexible Output Layouts**: Supports both standard and transposed output layouts to accommodate different downstream kernel requirements.

8. **SBUF Output Option**: Provides the option to keep output in SBUF for fusion with subsequent operations.

9. **Block-based Weight Loading**: Uses block-based loading of weights to encourage prefetching and improve memory access patterns.

See Also
-----------

* :doc:`Output Projection CTE Kernel API Reference </nki/library/api/output-projection-cte>`
* :doc:`QKV Kernel API Reference </nki/library/api/qkv>`


================================================
FILE: nki/library/api/qkv.rst
================================================
.. meta::
    :description: QKV kernel performs Query-Key-Value projection with optional normalization and RoPE fusion.
    :date-modified: 02/13/2026

.. currentmodule:: nkilib.core.qkv

QKV Kernel API Reference
==================================

Performs Query-Key-Value projection with optional normalization and RoPE fusion.

The kernel supports:

* Optional RMSNorm/LayerNorm fusion
* Multiple output tensor layouts
* Residual connections from previous MLP and attention outputs
* Automatic selection between TKG and CTE implementations based on batch_size * seqlen threshold
* Optional RoPE (Rotary Position Embedding) fusion
* Fused FP8 KV cache quantization
* Block-based KV cache layout support
* MX quantization support (CTE mode only)

Background
-----------

The ``QKV`` kernel is a critical component in transformer architectures, responsible for projecting the input hidden states into query, key, and value representations. This kernel optimizes the projection operation by fusing it with optional normalization and supporting various output layouts to accommodate different transformer implementations.

.. note::
    This kernel automatically selects between TKG (Token Generation) and CTE (Context Encoding) implementations based on sequence length, ensuring optimal performance across different use cases. CTE is used for longer sequences, while TKG is optimized for shorter sequences.

API Reference
--------------

**Source code for this kernel API can be found at**: `qkv.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/core/qkv/qkv.py>`_

qkv
^^^

.. py:function:: qkv(input: nl.ndarray, fused_qkv_weights: nl.ndarray, output_layout: QKVOutputLayout = QKVOutputLayout.BSD, bias: Optional[nl.ndarray] = None, quantization_type: QuantizationType = QuantizationType.NONE, qkv_w_scale: Optional[nl.ndarray] = None, qkv_in_scale: Optional[nl.ndarray] = None, fused_residual_add: Optional[bool] = False, mlp_prev: Optional[nl.ndarray] = None, attention_prev: Optional[nl.ndarray] = None, fused_norm_type: NormType = NormType.NO_NORM, gamma_norm_weights: Optional[nl.ndarray] = None, layer_norm_bias: Optional[nl.ndarray] = None, norm_eps: float = 1e-6, hidden_actual: Optional[int] = None, fused_rope: Optional[bool] = False, cos_cache: Optional[nl.ndarray] = None, sin_cache: Optional[nl.ndarray] = None, d_head: Optional[int] = None, num_q_heads: Optional[int] = None, num_kv_heads: Optional[int] = None, k_cache: Optional[nl.ndarray] = None, v_cache: Optional[nl.ndarray] = None, k_scale: Optional[nl.ndarray] = None, v_scale: Optional[nl.ndarray] = None, fp8_max: Optional[float] = None, fp8_min: Optional[float] = None, kv_dtype: Optional[type] = None, use_block_kv: bool = False, block_size: Optional[int] = None, slot_mapping: Optional[nl.ndarray] = None, store_output_in_sbuf: bool = False, sbm: Optional[SbufManager] = None, use_auto_allocation: bool = False, load_input_with_DMA_transpose: bool = True, is_input_swizzled: bool = False) -> nl.ndarray

   QKV (Query, Key, Value) projection kernel with multiple optional fused operations.
    
   Performs matrix multiplication between hidden states and fused QKV weights matrix with optional
   fused operations including residual addition, normalization, bias addition, and RoPE rotation.
   Automatically selects between TKG and CTE implementations based on sequence length.

   :param input: Input hidden states tensor. Shape: [B, S, H] where B=batch, S=sequence_length, H=hidden_dim.
   :type input: ``nl.ndarray``
   :param fused_qkv_weights: Fused QKV weight matrix. Shape: [H, I] where I=fused_qkv_dim=(num_q_heads + 2*num_kv_heads)*d_head.
   :type fused_qkv_weights: ``nl.ndarray``
   :param output_layout: Output tensor layout. QKVOutputLayout.BSD=[B, S, I] or QKVOutputLayout.NBSd=[num_heads, B, S, d_head]. Default: QKVOutputLayout.BSD.
   :type output_layout: ``QKVOutputLayout``
   :param bias: Bias tensor to add to QKV projection output. Shape: [1, I].
   :type bias: ``nl.ndarray``, optional
   :param quantization_type: Type of quantization to apply. Default: QuantizationType.NONE.
   :type quantization_type: ``QuantizationType``
   :param qkv_w_scale: Weight scale tensor for quantization.
   :type qkv_w_scale: ``nl.ndarray``, optional
   :param qkv_in_scale: Input scale tensor for quantization.
   :type qkv_in_scale: ``nl.ndarray``, optional
   :param fused_residual_add: Whether to perform residual addition: input = input + mlp_prev + attention_prev. Default: False.
   :type fused_residual_add: ``bool``, optional
   :param mlp_prev: Previous MLP output tensor for residual addition. Shape: [B, S, H].
   :type mlp_prev: ``nl.ndarray``, optional
   :param attention_prev: Previous attention output tensor for residual addition. Shape: [B, S, H].
   :type attention_prev: ``nl.ndarray``, optional
   :param fused_norm_type: Type of normalization (NO_NORM, RMS_NORM, RMS_NORM_SKIP_GAMMA, LAYER_NORM). Default: NormType.NO_NORM.
   :type fused_norm_type: ``NormType``
   :param gamma_norm_weights: Normalization gamma/scale weights. Shape: [1, H]. Required for RMS_NORM and LAYER_NORM.
   :type gamma_norm_weights: ``nl.ndarray``, optional
   :param layer_norm_bias: Layer normalization beta/bias weights. Shape: [1, H]. Only for LAYER_NORM.
   :type layer_norm_bias: ``nl.ndarray``, optional
   :param norm_eps: Epsilon value for numerical stability in normalization. Default: 1e-6.
   :type norm_eps: ``float``, optional
   :param hidden_actual: Actual hidden dimension for padded tensors (if H contains padding).
   :type hidden_actual: ``int``, optional
   :param fused_rope: Whether to apply RoPE rotation to Query and Key heads after QKV projection. Default: False.
   :type fused_rope: ``bool``, optional
   :param cos_cache: Cosine cache for RoPE. Shape: [B, S, d_head]. Required if fused_rope=True.
   :type cos_cache: ``nl.ndarray``, optional
   :param sin_cache: Sine cache for RoPE. Shape: [B, S, d_head]. Required if fused_rope=True.
   :type sin_cache: ``nl.ndarray``, optional
   :param d_head: Dimension per attention head. Required for QKVOutputLayout.NBSd and RoPE.
   :type d_head: ``int``, optional
   :param num_q_heads: Number of query heads. Required for RoPE.
   :type num_q_heads: ``int``, optional
   :param num_kv_heads: Number of key/value heads. Required for RoPE.
   :type num_kv_heads: ``int``, optional
   :param k_cache: Key cache tensor for fused FP8 KV cache quantization. Shape: ``[B, max_seq_len, kv_dim]``. Required when ``k_scale`` and ``v_scale`` are provided.
   :type k_cache: ``nl.ndarray``, optional
   :param v_cache: Value cache tensor for fused FP8 KV cache quantization. Shape: ``[B, max_seq_len, kv_dim]``. Required when ``k_scale`` and ``v_scale`` are provided.
   :type v_cache: ``nl.ndarray``, optional
   :param k_scale: Key quantization scale for FP8 KV cache quantization. Enables KV output quantization when both ``k_scale`` and ``v_scale`` are provided.
   :type k_scale: ``nl.ndarray``, optional
   :param v_scale: Value quantization scale for FP8 KV cache quantization. Enables KV output quantization when both ``k_scale`` and ``v_scale`` are provided.
   :type v_scale: ``nl.ndarray``, optional
   :param fp8_max: Maximum FP8 value for clamping during KV cache quantization. Defaults to the maximum positive value of ``kv_dtype``.
   :type fp8_max: ``float``, optional
   :param fp8_min: Minimum FP8 value for clamping during KV cache quantization. Defaults to the negative of ``fp8_max``.
   :type fp8_min: ``float``, optional
   :param kv_dtype: Data type for quantized KV cache output. Defaults to the input tensor dtype if not specified.
   :type kv_dtype: ``type``, optional
   :param use_block_kv: Whether to use block-based KV cache layout. When ``True``, requires ``block_size`` and ``slot_mapping``. Default: False.
   :type use_block_kv: ``bool``
   :param block_size: Number of tokens per block in block KV cache. Required when ``use_block_kv=True``.
   :type block_size: ``int``, optional
   :param slot_mapping: Mapping from token positions to block slots for block KV cache. Required when ``use_block_kv=True``.
   :type slot_mapping: ``nl.ndarray``, optional
   :param store_output_in_sbuf: Whether to store output in SBUF (currently unsupported, must be False). Default: False.
   :type store_output_in_sbuf: ``bool``
   :param sbm: Optional SBUF manager for memory allocation control with pre-specified bounds for SBUF usage.
   :type sbm: ``SbufManager``, optional
   :param use_auto_allocation: Whether to use automatic SBUF allocation. Default: False.
   :type use_auto_allocation: ``bool``
   :param load_input_with_DMA_transpose: Whether to use DMA transpose optimization. Default: True.
   :type load_input_with_DMA_transpose: ``bool``
   :param is_input_swizzled: Whether the input tensor is swizzled (only applicable with MX Quantization). Default: False.
   :type is_input_swizzled: ``bool``
   :return: QKV projection output tensor with shape determined by output_layout.
   :rtype: ``nl.ndarray``

   **Raises**:

   * **ValueError** – Raised when contract dimension mismatch occurs between ``input`` and ``fused_qkv_weights``.
   * **AssertionError** – Raised when required parameters for fused operations are missing or have incorrect shapes.

Implementation Details
-----------------------

The kernel implementation includes several key optimizations:

1. **Automatic Implementation Selection**: The kernel automatically selects between TKG (Token Generation) and CTE (Context Encoding) implementations based on sequence length. Some features like RoPE fusion and loading input with DMA transpose are only available in CTE mode. TKG mode only supports automatic allocation at the moment.

2. **Fused Operations Support**: 
   
   - **Residual Addition**: Fuses ``input`` + ``mlp_prev`` + ``attention_prev``
   - **Normalization**: Supports RMSNorm, LayerNorm, and ``RMS_NORM_SKIP_GAMMA``
   - **Bias Addition**: Adds bias to QKV projection output
   - **RoPE Fusion**: Applies Rotary Position Embedding to Query and Key heads
   - **FP8 KV Cache Quantization**: Quantizes K and V outputs directly into the KV cache, avoiding a separate quantization step. Enabled when ``k_scale`` and ``v_scale`` are provided. Only supported with BSD output layout.
   - **Block KV Cache**: Supports block-based KV cache layout with indirect addressing via ``slot_mapping`` for variable-length sequences.

3. **Flexible Output Layouts**: Supports BSD (``[B, S, I]``) and NBSd (``[num_heads, B, S, d_head]``) output tensor layouts.

4. **Memory Management**: 
   
   - Optional SBUF manager for controlled memory allocation
   - DMA transpose optimization for weight loading

5. **Hardware Compatibility**: Supports bf16, fp16, and fp32 data types (fp32 inputs are internally converted to bf16).

6. **Constraints**: 
   
   - H must be ≤ 24576 and divisible by 128
   - I must be ≤ 4096
   - For NBSd output: d_head must equal 128
   - FP8 KV cache quantization requires BSD output layout
   - Block KV cache requires ``block_size`` and ``slot_mapping``


================================================
FILE: nki/library/api/rmsnorm-quant.rst
================================================
.. meta::
    :description: RMSNorm-Quant kernel performs optional RMS normalization followed by fp8 quantization.
    :date-modified: 10/28/2025

.. currentmodule:: nkilib.core.rmsnorm_quant.rmsnorm_quant

RMSNorm-Quant Kernel API Reference
==================================

Performs optional RMS normalization followed by quantization to fp8.

The kernel supports:

* Optional RMS normalization before quantization
* 8-bit quantization along the last dimension of the input tensor
* Single program multiple data (SPMD) sharding for distributed computation
* Flexible input tensor shapes (minimum 2 dimensions)
* Input validation with configurable dimension limits
* Lower bound clipping for numerical stability

Background
--------------

The ``RMSNorm-Quant`` kernel processes tensors along their last dimension (processing dimension), with all other dimensions collapsed into a single outer dimension. This design allows for efficient processing of tensors with arbitrary shapes, as long as they have at least 2 dimensions.

For detailed information about the mathematical operations and implementation details, refer to the :doc:`RMSNorm-Quant Kernel Design Specification </nki/library/specs/design-rmsnorm-quant>`.

API Reference
----------------

**Source code for this kernel API can be found at**: `rmsnorm_quant.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/core/rmsnorm/rmsnorm_quant.py>`_

rmsnorm_quant_kernel
^^^^^^^^^^^^^^^^^^^^^

.. py:function:: rmsnorm_quant_kernel(hidden: nl.ndarray, ln_w: nl.ndarray, kargs: RmsNormQuantKernelArgs, input_dequant_scale: nl.ndarray = None)

   Entrypoint NKI kernel that performs one of the following:
   
   1. Perform RMSNorm and quantize the normalized hidden over the hidden dimension (``H``, or ``axis=-1``).
   2. Quantize hidden over dimension ``H``.

   The kernel supports no specialization, or specialization along 1 dimension (1D SPMD grid).

   :param hidden: Input hidden states tensor with minimum 2 dimensions. For 3D inputs, expected layout is ``[B, S, H]``. For 2D inputs, layout is ``[outer_dim, processing_dim]`` where outer_dim is the product of all major dimensions.
   :type hidden: ``nl.ndarray``
   :param ln_w: Gamma multiplicative bias vector with ``[H]`` or ``[1, H]`` layout. Required when RMS normalization is enabled.
   :type ln_w: ``nl.ndarray``
   :param kargs: Kernel arguments specifying normalization type, bounds, and epsilon values. See :py:class:`RmsNormQuantKernelArgs` for details.
   :type kargs: ``RmsNormQuantKernelArgs``
   :param input_dequant_scale: Optional dequantization scale for input tensor.
   :type input_dequant_scale: ``nl.ndarray``, optional
   :return: Output tensor with shape ``[..., H + 4]`` on HBM where the last dimension is extended by 4 elements. The first H elements store the possibly normalized and quantized tensor, while the last 4 elements store fp8 floats that can be reinterpreted as fp32 dequantization scales.
   :rtype: ``nl.ndarray``

   **Constraints**:

   * Input tensor must have at least 2 dimensions
   * For 3D inputs: batch dimension ≤ MAX_B, sequence length ≤ MAX_S, hidden dimension ≤ MAX_H
   * For 2D inputs: processing dimension ≤ MAX_H, outer dimension ≤ MAX_B × MAX_S
   * When RMS normalization is enabled, ln_w must have shape [H] or [1, H] where H matches the processing dimension

RmsNormQuantKernelArgs
^^^^^^^^^^^^^^^^^^^^^^^

.. py:class:: RmsNormQuantKernelArgs

   RMS Norm Quantization Kernel arguments.

   .. py:attribute:: lower_bound
      :type: float

      Non-negative float used for clipping input values and scale.

   .. py:attribute:: norm_type
      :type: NormType
      :value: NormType.RMS_NORM

      Normalization type to use [``RMS_NORM``, ``NO_NORM``]

   .. py:attribute:: quantization_type
      :type: QuantizationType
      :value: QuantizationType.ROW

      Quantization type to use [``ROW``, ``STATIC``]

   .. py:attribute:: eps
      :type: float
      :value: 1e-6

      Epsilon value for numerical stability, model hyperparameter

   .. py:method:: needs_rms_normalization() -> bool

      Returns True if RMS normalization should be applied, False otherwise.

   .. py:method:: has_lower_bound() -> bool

      Returns True if a positive lower bound is specified, False otherwise.

   **Raises**:

   * **AssertionError** – Raised when unsupported normalization types are used, negative bounds are provided, or invalid epsilon values are specified.
   * Supports 1D SPMD grid or no specialization

   .. note::
      The autocast argument may NOT be respected properly. The kernel automatically handles dimension validation and provides detailed error messages for constraint violations.

Implementation Details
-------------------------

The kernel implementation includes several key optimizations:

1. **Input Tensor Outer Dimension Collapse**: All major dimensions are collapsed into one for simplification, allowing the kernel to process along the minor dimension efficiently.

2. **Tiling**: The kernel is tiled on the major dimension by a size equal to the hardware's maximum partition dimension, ensuring full utilization of the hardware engines' input width.

3. **SBUF/PSUM Allocation**: Uses Stack Allocator for consistent and deterministic memory allocations within the kernel scope.

4. **SPMD Sharding**: Supports splitting computation across the constituent cores of a Logical Neuron Core by sharding on the outer-most dimension with automatic load balancing for non-divisible dimensions.

5. **Gamma Broadcast**: Improves pipeline parallelism by distributing work to the TensorEngine through matrix multiplication against a vector of ones.

6. **Activation Reduce**: Uses specialized instructions to perform reduce-add operations efficiently along with square operations.

7. **Optimized Batch Processing**: Processes tiles in batches of 8 for improved efficiency, with remainder handling for non-divisible cases.

8. **Input Validation**: Comprehensive validation of tensor dimensions against hardware limits (MAX_B, MAX_S, MAX_H) with detailed error messages.

9. **Numerical Stability**: Implements lower bound clipping and minimum dequantization scale clamping to prevent numerical instabilities.

See Also
-----------

* :doc:`RMSNorm-Quant Kernel Design Specification </nki/library/specs/design-rmsnorm-quant>`


================================================
FILE: nki/library/api/rope.rst
================================================
.. meta::
    :description: RoPE kernel applies Rotary Position Embedding to input embeddings.
    :date-modified: 01/21/2026

.. currentmodule:: nkilib.core.rope

RoPE Kernel API Reference
==========================

Applies Rotary Position Embedding (RoPE) to input embeddings, encoding positional information by rotating embedding dimension pairs using precomputed sine/cosine frequencies.

The kernel supports:

* Efficient position encoding without absolute position embeddings
* Optional LNC sharding for parallelization across cores
* Flexible memory layouts (contiguous or interleaved)
* Layout conversion strategies (DMA strided access or SBUF matmul)
* Standalone operation with HBM I/O
* SBUF-only operation for megakernel fusion

Background
--------------

The ``RoPE`` kernel implements Rotary Position Embedding, which encodes positional information by rotating pairs of embedding dimensions using precomputed sine/cosine frequencies. This approach enables position-aware attention mechanisms without requiring absolute position embeddings.

The kernel applies the following transformation:

* ``out[even] = x[even] * cos - x[odd] * sin``
* ``out[odd] = x[odd] * cos + x[even] * sin``

The kernel supports two memory layouts for the head dimension: contiguous (first half, second half) and interleaved (even, odd, even, odd). Layout conversion can be performed using either strided DMA access or SBUF matmul operations.

API Reference
----------------

**Source code for this kernel API can be found at**: `rope.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/core/embeddings/rope.py>`_

RoPE
^^^^^^^^^^^^^^^

.. py:function:: RoPE(x_in, cos, sin, lnc_shard=False, contiguous_layout=True, relayout_in_sbuf=False)

   Apply Rotary Position Embedding (RoPE) to input embeddings.
   Standalone kernel with HBM I/O and optional LNC sharding.

   :param x_in: Input embeddings tensor with shape ``[d_head, B, n_heads, S]`` in HBM
   :type x_in: ``nl.ndarray``
   :param cos: Cosine frequencies tensor with shape ``[d_head//2, B, S]`` in HBM
   :type cos: ``nl.ndarray``
   :param sin: Sine frequencies tensor with shape ``[d_head//2, B, S]`` in HBM
   :type sin: ``nl.ndarray``
   :param lnc_shard: Parallelize across LNC cores by tiling sequence dimension. Default is ``False``.
   :type lnc_shard: ``bool``, optional
   :param contiguous_layout: Memory layout in d_head dimension. ``True`` for ``[first_half, second_half]`` (default, more efficient), ``False`` for ``[even, odd, even, odd, ...]`` (interleaved).
   :type contiguous_layout: ``bool``, optional
   :param relayout_in_sbuf: Use SBUF matmul for layout conversion (only for small tensors). Default is ``False``.
   :type relayout_in_sbuf: ``bool``, optional
   :return: RoPE applied output tensor with shape ``[d_head, B, n_heads, S]`` in HBM
   :rtype: ``nl.ndarray``

   **Constraints**:

   * Head dimension (``d_head``) must be 64 or 128
   * Batch size (``B``) must be in range (0, 64]
   * Sequence length (``S``) must be in range (0, 512]
   * Number of heads (``n_heads``) must be in range (0, 16]
   * When ``lnc_shard=True``, sequence length must be divisible by number of programs
   * SBUF relayout (``relayout_in_sbuf=True``) requires ``B * n_heads * S <= gemm_moving_fmax``

RoPE_sbuf
^^^^^^^^^^^^^^^

.. py:function:: RoPE_sbuf(x_in_sb, cos_sb, sin_sb, x_out_sb, convert_from_interleaved=False)

   Apply RoPE on tensors in SBUF (for megakernel fusion).
   Helper function that operates entirely in SBUF without HBM I/O.

   :param x_in_sb: Input embeddings tensor with shape ``[d_head, B, n_heads, S]`` in SBUF
   :type x_in_sb: ``nl.ndarray``
   :param cos_sb: Cosine frequencies tensor with shape ``[d_head//2, B, S]`` in SBUF
   :type cos_sb: ``nl.ndarray``
   :param sin_sb: Sine frequencies tensor with shape ``[d_head//2, B, S]`` in SBUF
   :type sin_sb: ``nl.ndarray``
   :param x_out_sb: Output buffer tensor with shape ``[d_head, B, n_heads, S]`` in SBUF
   :type x_out_sb: ``nl.ndarray``
   :param convert_from_interleaved: Convert from interleaved to contiguous layout (only for small tensors: ``B * n_heads * S <= gemm_moving_fmax``). Default is ``False``.
   :type convert_from_interleaved: ``bool``, optional
   :return: Output tensor with RoPE applied (modified in-place)
   :rtype: ``nl.ndarray``

   **Constraints**:

   * Assumes contiguous layout unless ``convert_from_interleaved=True``
   * For large tensors with interleaved layout, use ``RoPE()`` with strided DMA
   * Input and output tensors must have matching dtypes

Implementation Details
-------------------------

The kernel implementation includes several key optimizations:

1. **Layout Conversion Strategies**: Supports two methods for converting between contiguous and interleaved layouts:
   
   * **DMA Strided Access**: Uses strided DMA operations with step=2 to gather/scatter even and odd indices separately. Suitable for all tensor sizes.
   * **SBUF Matmul**: Uses matrix multiplication with a permutation matrix for layout conversion. Limited to small tensors where ``B * n_heads * S <= gemm_moving_fmax``.

2. **LNC Sharding**: Supports parallelization across Logical NeuronCore (LNC) cores by tiling the sequence dimension. Each core processes a tile of size ``S // n_prgs``.

3. **Efficient Tensor Operations**: Uses ``tensor_tensor`` operations with TensorView broadcasting to efficiently apply cos/sin coefficients across the n_heads dimension.

4. **Memory Management**: Carefully manages SBUF allocations for intermediate buffers including separate storage for odd half elements to satisfy tensor_tensor alignment requirements.

5. **Permutation Matrix Generation**: For SBUF layout conversion, generates a permutation matrix using strided access on an identity matrix, enabling efficient transformation via matrix multiplication.


See Also
-----------

* :doc:`RoPE HuggingFace Kernel API Reference </nki/library/api/rope-hf>`


================================================
FILE: nki/library/api/router-topk.rst
================================================
.. meta::
    :description: Router Top-K kernel computes router logits and performs top-K selection for MoE models.
    :date-modified: 01/21/2026

.. currentmodule:: nkilib.core.router_topk

Router Top-K Kernel API Reference
==================================

Computes router logits, applies activation functions, and performs top-K selection with expert affinity scattering for Mixture of Experts (MoE) models.

The kernel supports:

* Router logits computation (x @ w + bias)
* Activation functions (SOFTMAX, SIGMOID)
* Top-K expert selection (K ≤ 8)
* Expert affinity scattering (one-hot or indirect DMA)
* Multiple layout configurations and optimization modes
* Column tiling for small token counts
* LNC sharding across token dimension
* Pre-norm and post-norm activation pipelines
* L1 normalization of top-K probabilities

Background
--------------

The ``Router Top-K`` kernel is a core component of Mixture of Experts (MoE) models, responsible for routing tokens to the most relevant experts. The kernel computes router logits by multiplying input tokens with a weight matrix, applies activation functions, selects the top-K experts for each token, and scatters the expert affinities to the full expert dimension.

The kernel is optimized for token counts T ≤ 2048, expert counts E ≤ 512, hidden dimensions H that are multiples of 128, and K ≤ 8 top experts per token. It supports both context encoding (CTE) with larger T and token generation (TKG) with T ≤ 128.

**Pipeline Configurations**:

The kernel supports multiple pipeline configurations:

1. **(topK, ACT2, Scatter)**: Standard pipeline with post-topK activation
2. **(ACT1, topK)**: Pre-norm activation before topK selection
3. **(ACT1, topK, Norm, Scatter)**: Pre-norm with L1 normalization and scatter

API Reference
----------------

**Source code for this kernel API can be found at**: `router_topk.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/core/router_topk/router_topk.py>`_

router_topk
^^^^^^^^^^^^^^^

.. py:function:: router_topk(x, w, w_bias, router_logits, expert_affinities, expert_index, act_fn, k, x_hbm_layout, x_sb_layout, output_in_sbuf=False, router_pre_norm=True, norm_topk_prob=False, use_column_tiling=False, use_indirect_dma_scatter=False, return_eager_affi=False, use_PE_broadcast_w_bias=False, shard_on_tokens=False, skip_store_expert_index=False, skip_store_router_logits=False, x_input_in_sbuf=False, expert_affin_in_sb=False)

   Router top-K kernel for Mixture of Experts (MoE) models.

   Computes router logits (x @ w + bias), applies activation functions, performs top-K selection,
   and scatters expert affinities. Supports multiple layout configurations, sharding strategies,
   and optimization modes.

   :param x: Input tensor. Shape depends on ``x_hbm_layout`` and ``x_input_in_sbuf``. If in HBM: ``[H, T]`` or ``[T, H]``. If in SBUF: a permutation of ``[128, T, H/128]``.
   :type x: ``nl.ndarray``
   :param w: Weight tensor with shape ``[H, E]`` in HBM
   :type w: ``nl.ndarray``
   :param w_bias: Optional bias tensor with shape ``[1, E]`` or ``[E]`` in HBM
   :type w_bias: ``nl.ndarray``
   :param router_logits: Output router logits with shape ``[T, E]`` in HBM
   :type router_logits: ``nt.mutable_tensor``
   :param expert_affinities: Output expert affinities with shape ``[T, E]`` in HBM or SBUF
   :type expert_affinities: ``nt.mutable_tensor``
   :param expert_index: Output expert indices with shape ``[T, K]`` in HBM or SBUF
   :type expert_index: ``nt.mutable_tensor``
   :param act_fn: Activation function (SOFTMAX or SIGMOID)
   :type act_fn: ``common_types.RouterActFnType``
   :param k: Number of top experts to select (must be ≤ 8)
   :type k: ``int``
   :param x_hbm_layout: Layout of x in HBM (0=[H,T], 1=[T,H])
   :type x_hbm_layout: ``int``
   :param x_sb_layout: Layout of x in SBUF (0-3, see notes for details)
   :type x_sb_layout: ``int``
   :param output_in_sbuf: If True, outputs are in SBUF (requires T ≤ 128). Default is False.
   :type output_in_sbuf: ``bool``, optional
   :param router_pre_norm: If True, apply activation before top-K (ACT1 pipeline). Default is True.
   :type router_pre_norm: ``bool``, optional
   :param norm_topk_prob: If True, normalize top-K probabilities with L1 norm. Default is False.
   :type norm_topk_prob: ``bool``, optional
   :param use_column_tiling: Enable PE array column tiling for small T. Default is False.
   :type use_column_tiling: ``bool``, optional
   :param use_indirect_dma_scatter: Use indirect DMA for expert affinity scatter. Default is False.
   :type use_indirect_dma_scatter: ``bool``, optional
   :param return_eager_affi: If True, return top-K affinities in addition to scattered. Default is False.
   :type return_eager_affi: ``bool``, optional
   :param use_PE_broadcast_w_bias: Use tensor engine for bias broadcast. Default is False.
   :type use_PE_broadcast_w_bias: ``bool``, optional
   :param shard_on_tokens: Enable LNC sharding across token dimension. Default is False.
   :type shard_on_tokens: ``bool``, optional
   :param skip_store_expert_index: Skip storing expert indices to HBM. Default is False.
   :type skip_store_expert_index: ``bool``, optional
   :param skip_store_router_logits: Skip storing router logits to HBM. Default is False.
   :type skip_store_router_logits: ``bool``, optional
   :param x_input_in_sbuf: If True, x is already in SBUF. Default is False.
   :type x_input_in_sbuf: ``bool``, optional
   :param expert_affin_in_sb: If True, expert affinities output is in SBUF. Default is False.
   :type expert_affin_in_sb: ``bool``, optional
   :return: List of ``[router_logits, expert_index, expert_affinities, optional: expert_affinities_topk]``
   :rtype: ``list``

   **Dimensions**:

   * T: Total number of tokens
   * H: Hidden dimension size
   * E: Number of experts
   * K: Number of top experts to select per token

   **Constraints**:

   * K must be ≤ 8
   * E must be ≤ 512 (gemm_moving_fmax)
   * H must be a multiple of 128
   * SIGMOID activation requires ``use_indirect_dma_scatter=True``
   * ``router_pre_norm`` requires ``use_indirect_dma_scatter=True``
   * With ``use_indirect_dma_scatter``, T must be ≤ 128 or multiple of 128
   * ``shard_on_tokens`` requires n_prgs > 1 and T divisible by 2
   * ``output_in_sbuf`` requires T ≤ 128

   **SBUF Layout Options** (``x_sb_layout``):

   * 0: ``[128, T, H/128]`` - P-dim contains H elements with stride of H/128
   * 1: ``[128, T, H/128]`` - P-dim with H/256 chunk interleaving
   * 2: ``[128, T, H/128]`` - P-dim contains consecutive H elements
   * 3: ``[128, H/128, T]`` - H-tiles in dim-1, T in dim-2

router_topk_input_x_load
^^^^^^^^^^^^^^^^^^^^^^^^^

.. py:function:: router_topk_input_x_load(x, hbm_layout=0, sb_layout=1)

   Load input tensor x from HBM to SBUF with specified layout transformations.

   Performs DMA transfer from HBM to SBUF with layout conversion based on hbm_layout
   and sb_layout parameters. Supports multiple layout combinations optimized for
   different access patterns in subsequent matmul operations.

   :param x: Input tensor in HBM. Shape ``[H, T]`` if hbm_layout=0, ``[T, H]`` if hbm_layout=1
   :type x: ``nl.ndarray``
   :param hbm_layout: Layout of x in HBM (0=[H,T], 1=[T,H]). Default is 0.
   :type hbm_layout: ``int``, optional
   :param sb_layout: Target layout in SBUF (0-3). Default is 1.
   :type sb_layout: ``int``, optional
   :return: Input tensor in SBUF with transformed layout
   :rtype: ``nl.ndarray``

   **Constraints**:

   * H must be a multiple of 128
   * Supported combinations: (hbm_layout=0, sb_layout=3) and (hbm_layout=1, sb_layout=0/1/2)

router_topk_input_w_load
^^^^^^^^^^^^^^^^^^^^^^^^^

.. py:function:: router_topk_input_w_load(w, x_sb_layout, name='')

   Load weight tensor w from HBM to SBUF with layout matching x tensor.

   :param w: Weight tensor with shape ``[H, E]`` in HBM
   :type w: ``nl.ndarray``
   :param x_sb_layout: Layout of x in SBUF (determines w layout)
   :type x_sb_layout: ``int``
   :param name: Optional name for the tensor. Default is empty string.
   :type name: ``str``, optional
   :return: Weight tensor in SBUF with appropriate layout
   :rtype: ``nl.ndarray``

Implementation Details
-------------------------

The kernel implementation includes several key optimizations:

1. **Tiled Matrix Multiplication**: Tiles computation on both H (contraction dimension) and T (token dimension) for efficient memory access and hardware utilization.

2. **PE Array Column Tiling**: For small token counts (T < 128), splits the PE array column-wise into multiple tiles (32, 64, or 128 columns) to enable parallel execution of independent matmuls.

3. **LNC Sharding**: Supports parallelization across 2 cores by sharding the token dimension. Each core processes T/2 tokens with automatic load balancing for non-divisible token counts.

4. **Bias Broadcasting**: Supports two methods for bias application:
   
   * Stream shuffle broadcast (default)
   * Tensor engine matmul with ones mask (``use_PE_broadcast_w_bias=True``)

5. **Top-K Selection**: Uses hardware-accelerated ``max8`` and ``nc_find_index8`` instructions to efficiently find top-8 values and their indices.

6. **Expert Affinity Scattering**: Supports two scattering methods:
   
   * **One-hot scatter**: Uses mask-based selection with element-wise operations
   * **Indirect DMA scatter**: Uses dynamic indexing for efficient scatter to HBM

7. **Activation Pipelines**: Supports multiple activation pipeline configurations (ACT1, ACT2) with optional L1 normalization.

8. **Memory Management**: Carefully manages SBUF allocations with modular allocation and buffer reuse for intermediate tensors.


See Also
-----------

* :doc:`Router Top-K PyTorch Reference </nki/library/api/router-topk-torch>`


================================================
FILE: nki/library/api/sb2sb-allgather.rst
================================================
.. meta::
    :description: SBUF-to-SBUF all-gather kernel for gathering tensors across ranks.
    :date-modified: 04/09/2026

.. currentmodule:: nkilib.experimental.collectives

SBUF-to-SBUF All-Gather Kernel API Reference
=============================================

Performs SBUF-to-SBUF all-gather for gathering tensors across ranks.

The kernel provides two variants:

* ``allgather_sb2sb`` — Optimized for small tensors that fit entirely in SBUF
* ``allgather_sb2sb_tiled`` — Adds tiling and LNC support for larger tensors

Background
-----------

The ``allgather_sb2sb`` kernels gather input tensors from all ranks along the last dimension (K dimension). Each rank contributes its local tensor, and all ranks receive the concatenated result.

API Reference
--------------

**Source code for this kernel API can be found at**: `sb2sb_allgather.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/experimental/collectives/sb2sb_allgather.py>`_

allgather_sb2sb
^^^^^^^^^^^^^^^

.. py:function:: allgather_sb2sb(inp: nl.ndarray, replica_groups: ReplicaGroup, tp_degree: int) -> nl.ndarray

   SBUF-to-SBUF all-gather kernel for gathering tensors across ranks.

   :param inp: [H, W], Input tensor on HBM, where W is the local width per rank.
   :type inp: ``nl.ndarray``
   :param replica_groups: ReplicaGroup defining which ranks participate in the collective.
   :type replica_groups: ``ReplicaGroup``
   :param tp_degree: Tensor parallelism degree (number of ranks in the group).
   :type tp_degree: ``int``
   :return: [H, K], Output tensor on shared HBM containing gathered data from all ranks.
   :rtype: ``nl.ndarray``

   **Notes**:

   * Input tensor must fit in SBUF (H * W * dtype_size <= SBUF capacity)
   * Output is stored in shared_hbm for cross-rank visibility
   * All ranks receive identical output after the collective

   **Dimensions**:

   * H: Height dimension (partition dimension, typically <= 128)
   * W: Width dimension per rank (local width before gather)

allgather_sb2sb_tiled
^^^^^^^^^^^^^^^^^^^^^

.. py:function:: allgather_sb2sb_tiled(inp: nl.ndarray, replica_groups: ReplicaGroup, tp_degree: int) -> nl.ndarray

   SBUF-to-SBUF all-gather with tiling and LNC support for larger tensors.

   :param inp: [M, K], Input tensor on HBM, where K is the local width per rank.
   :type inp: ``nl.ndarray``
   :param replica_groups: ReplicaGroup defining which ranks participate in the collective.
   :type replica_groups: ``ReplicaGroup``
   :param tp_degree: Tensor parallelism degree (number of ranks in the group).
   :type tp_degree: ``int``
   :return: [M, K * tp_degree], Output tensor on shared HBM containing gathered data.
   :rtype: ``nl.ndarray``

   **Notes**:

   * TILE_M is capped at 128 (SBUF partition size limit)
   * When launched with LNC grid [lnc], tiles are distributed across LNC cores
   * Each LNC core processes TILES_PER_CORE = NUM_M_TILES // n_prgs tiles
   * Assumes M is evenly divisible by 128 when M > 128

   **Dimensions**:

   * M: Height dimension (tiled along this dimension)
   * K: Width dimension per rank (local width before gather)
   * TILE_M: Tile size along M dimension (capped at 128)


================================================
FILE: nki/library/api/topk-reduce.rst
================================================
.. meta::
    :description: Compute MoE Top-K reduction across sparse all_to_all_v() collective output buffer.
    :date-modified: 04/09/2026

.. currentmodule:: nkilib.experimental.subkernels

Top-K Reduce Kernel API Reference
=================================

Computes MoE Top-K reduction across sparse ``all_to_all_v()`` collective output buffer.

The kernel supports:

* Gathering scattered rows by packed global token index
* Reduction along the K dimension
* LNC sharding on the H dimension

Background
-----------

The ``topk_reduce`` kernel gathers scattered rows by packed global token index and reduces along the K dimension. It is used to recombine expert outputs after an ``all_to_all_v()`` collective in Mixture of Experts models.

API Reference
--------------

**Source code for this kernel API can be found at**: `topk_reduce.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/experimental/subkernels/topk_reduce.py>`_

topk_reduce
^^^^^^^^^^^

.. py:function:: topk_reduce(input: nl.ndarray, T: int, K: int)

   Compute MoE Top-K reduction across sparse all_to_all_v() collective output buffer.

   :param input: [TK_padded, H + 2]@HBM, bf16/fp16. Sparse input buffer containing T*K scattered outputs. Global token index is packed as int32 in the final 2x columns of each row.
   :type input: ``nl.ndarray``
   :param T: Total number of input tokens.
   :type T: ``int``
   :param K: Number of routed experts per token.
   :type K: ``int``
   :return: [T, H]@HBM, bf16/fp16. Ordered and reduced output.
   :rtype: ``nl.ndarray``

   **Dimensions**:

   * TK_padded: n_src_ranks * T, padded input row count
   * H: Hidden dimension size (must be divisible by LNC)
   * T: Total number of input tokens (up to 128)


================================================
FILE: nki/library/api/transformer-tkg.rst
================================================
.. meta::
    :description: Transformer token generation forward pass megakernel.
    :date-modified: 04/09/2026

.. currentmodule:: nkilib.experimental.transformer

Transformer TKG Kernel API Reference
====================================

Implements the transformer token generation forward pass as a single megakernel.

The kernel supports:

* Configurable number of transformer layers
* Per-layer attention block (RMSNorm + QKV + RoPE + Attention + Output Projection)
* Per-layer MLP block (RMSNorm + Gate/Up + Activation + Down Projection)
* All-reduce collective communication between layers
* Residual connections
* Optional FP8 quantization with per-layer weight scales
* SBUF residual path with SB2SB all-reduce

Background
-----------

The ``transformer_tkg`` kernel performs multiple transformer layers in a single kernel invocation for token generation. Within each layer, it executes: attention block, all-reduce, MLP, all-reduce, and residual connections. This reduces kernel launch overhead and enables cross-layer optimizations.

API Reference
--------------

**Source code for this kernel API can be found at**: `transformer_tkg.py <https://github.com/aws-neuron/nki-library/blob/main/src/nkilib_src/nkilib/experimental/transformer/transformer_tkg.py>`_

transformer_tkg
^^^^^^^^^^^^^^^

.. py:function:: transformer_tkg(X: nl.ndarray, W_qkvs: List[nl.ndarray], W_outs: List[nl.ndarray], W_gates: List[nl.ndarray], W_ups: List[nl.ndarray], W_downs: List[nl.ndarray], W_gamma_qkvs: List[nl.ndarray], W_gamma_mlps: List[nl.ndarray], K_caches: List[nl.ndarray], V_caches: List[nl.ndarray], RoPE_cos: nl.ndarray, RoPE_sin: nl.ndarray, mask_cache: nl.ndarray, mask_active: nl.ndarray, position_ids: Optional[nl.ndarray], num_layers: int, eps: float = 1e-06, replica_groups: Optional[List[List[int]]] = None, sbuf_residual_and_cc: bool = False, clamp_bound: float = 0.0, W_gate_scales: Optional[List[nl.ndarray]] = None, W_up_scales: Optional[List[nl.ndarray]] = None, W_down_scales: Optional[List[nl.ndarray]] = None)

   Transformer token generation forward pass megakernel.

   :param X: [B, S_tkg, H], Input hidden states on HBM
   :type X: ``nl.ndarray``
   :param W_qkvs: Per-layer QKV projection weights
   :type W_qkvs: ``List[nl.ndarray]``
   :param W_outs: Per-layer output projection weights
   :type W_outs: ``List[nl.ndarray]``
   :param W_gates: Per-layer MLP gate projection weights
   :type W_gates: ``List[nl.ndarray]``
   :param W_ups: Per-layer MLP up projection weights
   :type W_ups: ``List[nl.ndarray]``
   :param W_downs: Per-layer MLP down projection weights
   :type W_downs: ``List[nl.ndarray]``
   :param W_gamma_qkvs: Per-layer RMSNorm gamma for QKV
   :type W_gamma_qkvs: ``List[nl.ndarray]``
   :param W_gamma_mlps: Per-layer RMSNorm gamma for MLP
   :type W_gamma_mlps: ``List[nl.ndarray]``
   :param K_caches: Per-layer K caches on HBM
   :type K_caches: ``List[nl.ndarray]``
   :param V_caches: Per-layer V caches on HBM
   :type V_caches: ``List[nl.ndarray]``
   :param RoPE_cos: [d_head//2, B, S_tkg], RoPE cosine embeddings
   :type RoPE_cos: ``nl.ndarray``
   :param RoPE_sin: [d_head//2, B, S_tkg], RoPE sine embeddings
   :type RoPE_sin: ``nl.ndarray``
   :param mask_cache: Attention mask for cached KV context
   :type mask_cache: ``nl.ndarray``
   :param mask_active: Attention mask for active tokens
   :type mask_active: ``nl.ndarray``
   :param position_ids: [B, 1], KV cache write positions (None = skip cache update)
   :type position_ids: ``Optional[nl.ndarray]``
   :param num_layers: Number of transformer layers to execute
   :type num_layers: ``int``
   :param eps: RMSNorm epsilon (default 1e-6)
   :type eps: ``float``
   :param replica_groups: Replica groups for collective communication
   :type replica_groups: ``Optional[List[List[int]]]``
   :param sbuf_residual_and_cc: Use SBUF residual path with SB2SB all-reduce (default False)
   :type sbuf_residual_and_cc: ``bool``
   :param clamp_bound: FP8 quantization clipping boundary (default 0.0, 0 = no clipping)
   :type clamp_bound: ``float``
   :param W_gate_scales: Per-layer FP8 gate weight scales
   :type W_gate_scales: ``Optional[List[nl.ndarray]]``
   :param W_up_scales: Per-layer FP8 up weight scales
   :type W_up_scales: ``Optional[List[nl.ndarray]]``
   :param W_down_scales: Per-layer FP8 down weight scales
   :type W_down_scales: ``Optional[List[nl.ndarray]]``
   :return: [B, S_tkg, H], Final hidden states after all transformer layers
   :rtype: ``nl.ndarray``

   **Dimensions**:

   * B: Batch size
   * S_tkg: Token generation sequence length (number of new tokens)
   * H: Hidden dimension (must be multiple of 128)
   * H0: Partition tile size (pmax = 128)
   * H1: H // H0


================================================
FILE: nki/library/index.rst
================================================
.. meta::
    :description: Home page for the NKI Library  documentation. NKI Library provides pre-built NKI kernels you can use in model development with Neuron.
    :date-modified: 12/02/2025

.. _nkl_home:

NKI Library Documentation
==========================

The NKI Library is a collection of pre-built kernels optimized for AWS Neuron-powered devices. These kernels are designed to accelerate machine learning workloads by providing efficient implementations of common operations used in deep learning models.

**NKI Library GitHub repository**: https://github.com/aws-neuron/nki-library

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: **NKI Library Kernel Design Specs**
      :class-card: sd-border-1
      :link: /nki/library/specs/index
      :link-type: doc

      Review the formal specifications for the pre-built NKI kernels available in the NKI Library.

   .. grid-item-card:: **NKI Library Supported Kernel Reference**
      :class-card: sd-border-1
      :link: /nki/library/api/index
      :link-type: doc

      Use this kernel reference to understand the functions, parameters, and usage of the pre-built NKI kernels in the NKI Library.

.. grid:: 1 
   :gutter: 3

   .. grid-item-card:: **NKI Library Kernel Utilities**
      :class-card: sd-border-1
      :link: /nki/library/kernel-utils/index
      :link-type: doc

      Utility modules for memory management, tensor views, and iteration helpers used in NKI kernel development.

   .. grid-item-card:: **NKI Library Release Notes**
      :class-card: sd-border-1
      :link: /release-notes/components/nki-lib
      :link-type: doc

      Release notes for the NKI Library kernels and APIs.

.. toctree::
   :maxdepth: 1
   :hidden:

   Overview <about/index>
   Kernel Design Specs <specs/index>
   Kernel API Reference <api/index>
   Kernel Utilities <kernel-utils/index>
   Release Notes </release-notes/components/nki-lib>


================================================
FILE: nki/library/kernel-utils/allocator.rst
================================================
.. meta::
    :description: API reference for the SbufManager (Allocator) utility in the NKI Library.
    :date-modified: 02/13/2026

.. currentmodule:: nkilib.core.utils.allocator

SbufManager (Allocator) API Reference
=====================================

This topic provides the API reference for the ``SbufManager`` utility. It provides stack-based SBUF memory allocation with scope management and multi-buffering support.

When to Use
-----------

Use ``SbufManager`` when you need:

* **Deterministic memory layout**: Manual control over SBUF addresses for predictable memory placement
* **Scope-based allocation**: Automatic cleanup of temporary buffers when a computation phase ends
* **Multi-buffering in loops**: Ping-pong buffers for overlapping compute and memory operations
* **Memory debugging**: Detailed logging of allocation patterns and usage statistics

``SbufManager`` is particularly useful in complex kernels with multiple computation phases where different buffers are needed at different times.

API Reference
-------------

**Source code**: https://github.com/aws-neuron/nki-library

SbufManager
^^^^^^^^^^^

.. py:class:: SbufManager(sb_lower_bound, sb_upper_bound, logger=None, use_auto_alloc=False, default_stack_alloc=True)

   Stack-based SBUF memory manager with scope support.

   :param sb_lower_bound: Lower bound of available SBUF memory region.
   :type sb_lower_bound: int
   :param sb_upper_bound: Upper bound of available SBUF memory region.
   :type sb_upper_bound: int
   :param logger: Optional logger instance for allocation tracking.
   :type logger: Logger, optional
   :param use_auto_alloc: If True, delegates address assignment to compiler. Default False.
   :type use_auto_alloc: bool
   :param default_stack_alloc: If True, ``alloc()`` uses stack; if False, uses heap. Default True.
   :type default_stack_alloc: bool

   .. py:method:: open_scope(interleave_degree=1, name="")

      Opens a new allocation scope. Allocations within this scope are freed when the scope closes.

      :param interleave_degree: Number of buffer sections for multi-buffering. Default 1.
      :type interleave_degree: int
      :param name: Optional scope name for debugging.
      :type name: str
      :rtype: None

   .. py:method:: close_scope()

      Closes the current scope and frees all stack allocations made within it.

      :rtype: None

   .. py:method:: increment_section()

      Advances to the next buffer section within a multi-buffer scope. When all sections are used, wraps back to the first section.

      :rtype: None

   .. py:method:: alloc_stack(shape, dtype, buffer=nl.sbuf, name=None, base_partition=0, align=None)

      Allocates a tensor on the stack (freed when scope closes).

      :param shape: Shape of the tensor.
      :type shape: tuple[int, ...]
      :param dtype: Data type (e.g., ``nl.bfloat16``, ``nl.float32``).
      :type dtype: dtype
      :param buffer: Buffer type. Only ``nl.sbuf`` supported.
      :type buffer: buffer
      :param name: Optional tensor name (must be unique).
      :type name: str, optional
      :param base_partition: Base partition for allocation. Default 0.
      :type base_partition: int
      :param align: Alignment requirement in bytes.
      :type align: int, optional
      :return: Allocated SBUF tensor.
      :rtype: nl.ndarray

   .. py:method:: alloc_heap(shape, dtype, buffer=nl.sbuf, name=None, base_partition=0, align=None)

      Allocates a tensor on the heap (must be manually freed with ``pop_heap()``).

      Parameters are identical to ``alloc_stack()``.

      :rtype: nl.ndarray

   .. py:method:: alloc(shape, dtype, buffer=nl.sbuf, name=None, base_partition=0, align=None)

      Allocates a tensor on the stack or heap, depending on the ``default_stack_alloc`` setting.

      Parameters are identical to ``alloc_stack()``.

      :rtype: nl.ndarray

   .. py:method:: pop_heap()

      Frees the most recently allocated heap tensor.

      :rtype: None

   .. py:method:: get_total_space()

      Returns the total number of bytes in the managed region.

      :rtype: int

   .. py:method:: get_free_space()

      Returns the number of free bytes between stack and heap.

      :rtype: int

   .. py:method:: get_used_space()

      Returns the number of bytes currently used by stack and heap allocations.

      :rtype: int

   .. py:method:: get_stack_curr_addr()

      Returns the current stack address. Not supported in auto-allocation mode.

      :rtype: int

   .. py:method:: get_heap_curr_addr()

      Returns the current heap address. Not supported in auto-allocation mode.

      :rtype: int

   .. py:method:: align_stack_curr_addr(align=32)

      Aligns the current stack address to the given alignment. Not supported in auto-allocation mode.

      :param align: Alignment in bytes. Default 32.
      :type align: int
      :rtype: None

   .. py:method:: set_name_prefix(prefix)

      Sets a prefix string prepended to all subsequent allocation names.

      :param prefix: Prefix string.
      :type prefix: str
      :rtype: None

   .. py:method:: get_name_prefix()

      Returns the current name prefix.

      :rtype: str

   .. py:method:: flush_logs()

      Prints buffered allocation logs in tree format.

      :rtype: None

create_auto_alloc_manager
^^^^^^^^^^^^^^^^^^^^^^^^^

.. py:function:: create_auto_alloc_manager(logger=None)

   Creates an SbufManager that delegates address assignment to the compiler.

   :param logger: Optional logger instance.
   :type logger: Logger, optional
   :return: Auto-allocation SbufManager instance.
   :rtype: SbufManager

Examples
--------

Without SbufManager (Manual Allocation)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   import nki.language as nl

   @nki.jit
   def kernel_without_sbm(input_hbm, output_hbm):
       addr = 0
       
       # Heap-like allocation at end of SBUF
       heap_addr = nl.tile_size.total_available_sbuf_size - 512
       weights = nl.ndarray((128, 256), dtype=nl.bfloat16, buffer=nl.sbuf,
                            address=(0, heap_addr))
       print(f"weights.address = {weights.address}")  # (0, 261632)
       
       # Outer scope
       buf1 = nl.ndarray((128, 512), dtype=nl.bfloat16, buffer=nl.sbuf,
                         address=(0, addr))
       print(f"buf1.address = {buf1.address}")  # (0, 0)
       addr += 512 * 2  # 1024
       
       # Inner scope
       inner_start = addr
       buf2 = nl.ndarray((128, 256), dtype=nl.bfloat16, buffer=nl.sbuf,
                         address=(0, addr))
       print(f"buf2.address = {buf2.address}")  # (0, 1024)
       addr += 256 * 2  # 1536
       buf3 = nl.ndarray((128, 256), dtype=nl.bfloat16, buffer=nl.sbuf,
                         address=(0, addr))
       print(f"buf3.address = {buf3.address}")  # (0, 1536)
       # End inner scope - must manually reset
       addr = inner_start  # 1024
       
       # Back in outer - reuse inner's memory
       buf4 = nl.ndarray((128, 512), dtype=nl.bfloat16, buffer=nl.sbuf,
                         address=(0, addr))
       print(f"buf4.address = {buf4.address}")  # (0, 1024)

With SbufManager
^^^^^^^^^^^^^^^^

.. code-block:: python

   import nki.language as nl
   from nkilib.core.utils.allocator import SbufManager

   @nki.jit
   def kernel_with_sbm(input_hbm, output_hbm):
       sbm = SbufManager(0, nl.tile_size.total_available_sbuf_size)
       
       weights = sbm.alloc_heap((128, 256), nl.bfloat16, name="weights")
       print(f"weights.address = {weights.address}")  # (0, 261632)
       
       sbm.open_scope(name="outer")
       buf1 = sbm.alloc_stack((128, 512), nl.bfloat16, name="buf1")
       print(f"buf1.address = {buf1.address}")  # (0, 0)
       
       sbm.open_scope(name="inner")
       buf2 = sbm.alloc_stack((128, 256), nl.bfloat16, name="buf2")
       print(f"buf2.address = {buf2.address}")  # (0, 1024)
       buf3 = sbm.alloc_stack((128, 256), nl.bfloat16, name="buf3")
       print(f"buf3.address = {buf3.address}")  # (0, 1536)
       sbm.close_scope()
       
       buf4 = sbm.alloc_stack((128, 512), nl.bfloat16, name="buf4")
       print(f"buf4.address = {buf4.address}")  # (0, 1024)
       sbm.close_scope()
       
       sbm.pop_heap()

Both produce identical memory layouts:

.. code-block:: text

   weights.address = (0, 261632)  # heap at top
   buf1.address = (0, 0)          # stack grows up
   buf2.address = (0, 1024)       # inner scope
   buf3.address = (0, 1536)       # inner scope
   buf4.address = (0, 1024)       # reuses inner's memory

Multi-Buffering Example
^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   import nki.language as nl
   from nkilib.core.utils.allocator import SbufManager

   @nki.jit
   def kernel_multibuffer(input_hbm, output_hbm, N):
       sbm = SbufManager(0, nl.tile_size.total_available_sbuf_size)
       
       # Double-buffering: 2 sections alternate
       sbm.open_scope(interleave_degree=2, name="double_buffer")
       
       for i in nl.affine_range(N):
           # Allocates to section 0, then 1, then 0, then 1...
           buf = sbm.alloc_stack((128, 512), nl.bfloat16)
           # Load to buf[current], compute on buf[previous]
           sbm.increment_section()
       
       sbm.close_scope()

Debug output for ``N=4``:

.. code-block:: text

   [SBM] Allocations:
       ▶ SCOPE 'double_buffer' [interleave=2] @ 0
       ├── (unnamed): 1024 B @ 0 (128, 512) bfloat16
       ├── ↳ section: 1/2 @ 1024
       ├── (unnamed): 1024 B @ 1024 (128, 512) bfloat16
       ├── ↻ section: 0/2 @ 0
       ├── (unnamed): 1024 B @ 0 (128, 512) bfloat16
       ├── ↳ section: 1/2 @ 1024
       └── (unnamed): 1024 B @ 1024 (128, 512) bfloat16
       ◀ END 'double_buffer' freed=2048 B

Note how allocations alternate between addresses 0 and 1024.

See Also
--------

* :doc:`TensorView </nki/library/kernel-utils/tensor-view>` - Zero-copy tensor view operations


================================================
FILE: nki/library/kernel-utils/index.rst
================================================
.. meta::
    :description: API reference for kernel utility modules in the NKI Library.
    :date-modified: 04/09/2026

.. _nkl_kernel_utils_home:

NKI Library Kernel Utilities Reference
======================================

The NKI Library provides utility modules to simplify common patterns in NKI kernel development. These utilities help manage memory allocation, tensor views, dimension tiling, and data broadcasting.

**Source code for these utilities can be found at**: https://github.com/aws-neuron/nki-library

Memory Management
~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`SbufManager (Allocator) </nki/library/kernel-utils/allocator>`
     - Stack-based SBUF memory allocator with scope management and multi-buffering support.

Tensor Operations
~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`TensorView </nki/library/kernel-utils/tensor-view>`
     - Zero-copy tensor view operations including slicing, permuting, reshaping, and broadcasting.
   * - :doc:`stream_shuffle_broadcast </nki/library/kernel-utils/stream-shuffle-broadcast>`
     - Broadcasts a single partition across the partition dimension using hardware shuffle.

Iteration Helpers
~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 40 60

   * - :doc:`TiledRange </nki/library/kernel-utils/tiled-range>`
     - Divides dimensions into tiles with automatic remainder handling.

.. toctree::
    :maxdepth: 1
    :hidden:
    :caption: Memory Management

    SbufManager (Allocator) <allocator>

.. toctree::
    :maxdepth: 1
    :hidden:
    :caption: Tensor Operations

    TensorView <tensor-view>
    stream_shuffle_broadcast <stream-shuffle-broadcast>

.. toctree::
    :maxdepth: 1
    :hidden:
    :caption: Iteration Helpers

    TiledRange <tiled-range>


================================================
FILE: nki/library/kernel-utils/tensor-view.rst
================================================
.. meta::
    :description: API reference for the TensorView utility in the NKI Library.
    :date-modified: 02/13/2026

.. currentmodule:: nkilib.core.utils.tensor_view

TensorView API Reference
========================

This topic provides the API reference for the ``TensorView`` utility. It provides zero-copy tensor view operations for NKI tensors.

When to Use
-----------

Use ``TensorView`` when you need to:

* **Reshape without copying**: Change tensor layout for different computation phases
* **Slice with strides**: Extract non-contiguous elements efficiently
* **Permute dimensions**: Transpose or reorder dimensions for matmul compatibility
* **Broadcast dimensions**: Expand size-1 dimensions without data duplication
* **Chain operations**: Combine multiple view transformations fluently

``TensorView`` is essential for kernels that need to interpret the same data in multiple layouts (e.g., attention kernels that reshape between ``[B, S, H]`` and ``[B, num_heads, S, head_dim]``).

API Reference
-------------

**Source code**: https://github.com/aws-neuron/nki-library

TensorView
^^^^^^^^^^

.. py:class:: TensorView(base_tensor)

   A view wrapper around NKI tensors supporting various operations without copying data.

   :param base_tensor: The underlying NKI tensor.
   :type base_tensor: nl.ndarray

   .. py:attribute:: shape
      :type: tuple[int, ...]

      Current shape of the view.

   .. py:attribute:: strides
      :type: tuple[int, ...]

      Stride of each dimension in elements.

   .. py:method:: get_view()

      Generates the actual NKI tensor view using array pattern.

      :return: NKI tensor with the view pattern applied.
      :rtype: nl.ndarray

   .. py:method:: slice(dim, start, end, step=1)

      Creates a sliced view along a dimension.

      :param dim: Dimension to slice.
      :type dim: int
      :param start: Start index (inclusive).
      :type start: int
      :param end: End index (exclusive).
      :type end: int
      :param step: Step size. Default 1.
      :type step: int
      :return: New TensorView with sliced dimension.
      :rtype: TensorView

   .. py:method:: permute(dims)

      Creates a permuted view by reordering dimensions.

      :param dims: New order of dimensions.
      :type dims: tuple[int, ...]
      :return: New TensorView with permuted dimensions.
      :rtype: TensorView

      **Note**: For SBUF tensors, partition dimension (dim 0) must remain at position 0.

   .. py:method:: broadcast(dim, size)

      Expands a size-1 dimension to a larger size without copying.

      :param dim: Dimension to broadcast (must have size 1).
      :type dim: int
      :param size: New size for the dimension.
      :type size: int
      :return: New TensorView with broadcasted dimension.
      :rtype: TensorView

   .. py:method:: reshape_dim(dim, shape)

      Reshapes a single dimension into multiple dimensions.

      :param dim: Dimension to reshape.
      :type dim: int
      :param shape: New sizes (can contain one -1 for inference).
      :type shape: tuple[int, ...]
      :return: New TensorView with reshaped dimension.
      :rtype: TensorView

   .. py:method:: flatten_dims(start_dim, end_dim)

      Flattens a range of contiguous dimensions into one.

      :param start_dim: First dimension to flatten (inclusive).
      :type start_dim: int
      :param end_dim: Last dimension to flatten (inclusive).
      :type end_dim: int
      :return: New TensorView with flattened dimensions.
      :rtype: TensorView

   .. py:method:: expand_dim(dim)

      Inserts a new dimension of size 1.

      :param dim: Position to insert the new dimension.
      :type dim: int
      :return: New TensorView with added dimension.
      :rtype: TensorView

   .. py:method:: squeeze_dim(dim)

      Removes a dimension of size 1.

      :param dim: Dimension to remove (must have size 1).
      :type dim: int
      :return: New TensorView with removed dimension.
      :rtype: TensorView

   .. py:method:: select(dim, index)

      Selects a single element along a dimension, reducing dimensionality.

      :param dim: Dimension to select from.
      :type dim: int
      :param index: Index to select (int for static, nl.ndarray for dynamic).
      :type index: int | nl.ndarray
      :return: New TensorView with one fewer dimension.
      :rtype: TensorView

   .. py:method:: rearrange(src_pattern, dst_pattern, fixed_sizes=None)

      Rearranges dimensions using einops-style patterns.

      :param src_pattern: Source dimension pattern with named dimensions.
      :type src_pattern: tuple[str | tuple[str, ...], ...]
      :param dst_pattern: Destination dimension pattern.
      :type dst_pattern: tuple[str | tuple[str, ...], ...]
      :param fixed_sizes: Dictionary mapping dimension names to sizes.
      :type fixed_sizes: dict[str, int], optional
      :return: New TensorView with rearranged dimensions.
      :rtype: TensorView

   .. py:method:: reshape(new_shape)

      Reshapes the tensor to new dimensions.

      :param new_shape: New dimension shape.
      :type new_shape: tuple[int, ...]
      :return: New TensorView with reshaped dimensions.
      :rtype: TensorView

      .. note:: General reshape is not yet implemented and will raise an error. Use ``reshape_dim`` for single-dimension reshaping.

   .. py:method:: has_dynamic_access()

      Checks if the tensor view uses dynamic indexing (via a prior ``select`` with an ``nl.ndarray`` index).

      :return: True if the view has dynamic access, False otherwise.
      :rtype: bool

Examples
--------

Reshape and Permute
^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   import nki.language as nl
   from nkilib.core.utils.tensor_view import TensorView

   @nki.jit
   def kernel_reshape_permute(data_sb):
       view = TensorView(data_sb)  # Shape: (128, 24, 64)
       
       reshaped = view.reshape_dim(1, (4, 6))  # (128, 4, 6, 64)
       transposed = reshaped.permute((0, 2, 1, 3))  # (128, 6, 4, 64)
       
       result = transposed.get_view()

Slicing with Step
^^^^^^^^^^^^^^^^^

.. code-block:: python

   from nkilib.core.utils.tensor_view import TensorView

   @nki.jit
   def kernel_strided_slice(data_sb):
       view = TensorView(data_sb)  # Shape: (128, 256)
       
       # Take every other element: indices 0, 2, 4, ...
       strided = view.slice(dim=1, start=0, end=256, step=2)  # (128, 128)
       
       result = strided.get_view()

Broadcasting
^^^^^^^^^^^^

.. code-block:: python

   from nkilib.core.utils.tensor_view import TensorView

   @nki.jit
   def kernel_broadcast(scale_sb, data_sb):
       # scale_sb shape: (128, 1, 64)
       # data_sb shape: (128, 32, 64)
       
       scale_view = TensorView(scale_sb)
       
       # Broadcast dim 1 from size 1 to 32
       broadcasted = scale_view.broadcast(dim=1, size=32)  # (128, 32, 64)
       
       # Now can multiply element-wise
       result = data_sb * broadcasted.get_view()

Einops-Style Rearrange
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   from nkilib.core.utils.tensor_view import TensorView

   @nki.jit
   def kernel_rearrange(data_sb):
       view = TensorView(data_sb)  # Shape: (128, 512, 64)
       
       # Reshape and transpose: (p, h*w, c) -> (p, c, h, w)
       # where h=32 (must specify one dimension for -1 inference)
       rearranged = view.rearrange(
           src_pattern=('p', ('h', 'w'), 'c'),
           dst_pattern=('p', 'c', 'h', 'w'),
           fixed_sizes={'h': 32}
       )  # (128, 64, 32, 16)
       
       result = rearranged.get_view()

Chained Operations
^^^^^^^^^^^^^^^^^^

.. code-block:: python

   from nkilib.core.utils.tensor_view import TensorView

   @nki.jit
   def attention_reshape(qkv_sb, num_heads, head_dim):
       # qkv_sb shape: (128, seq_len, 3 * num_heads * head_dim)
       view = TensorView(qkv_sb)
       
       # Chain: reshape -> slice Q -> reshape to heads
       q_view = (view
           .reshape_dim(2, (3, num_heads, head_dim))  # (128, S, 3, H, D)
           .select(dim=2, index=0)                     # (128, S, H, D) - select Q
           .permute((0, 2, 1, 3)))                     # (128, H, S, D)
       
       q = q_view.get_view()

See Also
--------

* :doc:`stream_shuffle_broadcast </nki/library/kernel-utils/stream-shuffle-broadcast>` - Hardware broadcast for partition dimension
* :doc:`SbufManager </nki/library/kernel-utils/allocator>` - Memory allocation with scope management


================================================
FILE: nki/library/specs/design-rmsnorm-quant.rst
================================================
.. meta::
    :description: Design specification for the RMSNorm-Quant kernel included in the NKI Library .
    :date-modified: 12/02/2025


RMSNorm-Quant Kernel Design Specification
==========================================

This document describes the design of the RMSNorm-Quant kernel. It is intended to be a companion to the code to help readers understand what this kernel does, how it's designed, and how to use it.

For details on how to use this kernel, see the :doc:`RMSNorm-Quant Kernel API Reference </nki/library/api/rmsnorm-quant>`.

Background
----------

This kernel performs *optional* `RMS normalization <https://arxiv.org/abs/1910.07467>`_ followed by quantization to ``fp8``.

Motivation
^^^^^^^^^^
Performance
"""""""""""

It is expected that this kernel is typically used in an LLM FP8 inference model to replace the RMSNorm and FP8 quantization operators.

This kernel enables sequence-parallelism (SP) for the RMSNorm_Quant operation. In a non-SP LLM implementation, typically an allReduce collectives operation is followed by RMSNorm_Quant where the computation is duplicated across the entire [S,H] tensor on each TP (tensor parallel) worker. In SP, the allReduce+RMSNorm_Quant operation is instead replaced with reduceScatter + RMSNorm_Quant + allGather. The compute is accelerated because each worker only computes [S/TP_degree,H]. Furthermore, the allGather distributes an FP8 tensor, improving collective performance compared to bf16.

Neuron Support
""""""""""""""
Currently the Neuron software stack does not support packing the two tensors with different data types (an FP8 data tensor and FP32 quantization tensor) into one tensor. This kernel showcases how this can be achieved in NKI.

Next we'll examine the math this kernel performs.

RMSNorm
^^^^^^^

Math
""""

The input tensor typically has shape [B, S, H].

RMSNorm is independently performed on each [B,S].

The equation is:

.. math::

    \mathrm{RMSNorm}(x_i)=\frac{x_i}{\mathrm{RMS}(x)} \gamma_i \quad \text{for } i = 1 \dots H

where:

.. math::

    \mathrm{RMS}(x)=\sqrt{(\frac{1}{H} \sum_{i=1}^{H} x_i^2) + \epsilon} \\
    x = \text{each [B,S] with shape [H]} \\
    \gamma \text{ = gamma with shape [H]} \\
    \epsilon = \text{ small positive value for numerical stability}

Explained in English using common LLM terminology, each token (i.e. each element of the S dimension) is represented by a vector of shape [H] (i.e. a vector in the so-called 'embedding' space). Each token-vector is normalized by dividing each element in the vector by the RMS factor of the overall token-vector. This **RMS** factor is computed ‘right-to-left', meaning the **S**\ quares of the vector elements are computed, then the **M**\ ean, then the square-**R**\ oot. There is also a learned scaling factor called gamma; this is a shape [H] vector that scales (i.e. multiplied against) every token-vector.

Next we'll look at how the above math is implemented using NKI ISA instructions on the hardware.

Operator Graph
""""""""""""""

The following diagram depicts the flow of operations. The code is written generically with respect to input tensor shape and tile sizes. But to be more relatable, this diagram instead uses both typical LLM labels ([S,H]) for the code's outer-dimension and processing-dimension as well as tiling sizes that optimally fit Trainium 2. 

.. figure:: images/RMSNorm.drawio.svg
   :align: center

Quantization
^^^^^^^^^^^^

Math
""""

We subsequently apply AbsMax quantization to the RMS-Normalized input tensor whose shape is typically [B,S,H].

Quantization is independently performed on each [B,S].

The equation is:

.. math::
    M = \max_{i=1}^{H} |x_i| \\
    D = \frac{M}{240} \\
    Q = \frac{1}{D} \\
    \mathbf{x}_q = xQ

or equivalently

.. math::
    x_{q,i} = x_iQ \quad \text{for } i = 1, \dots, H

where

.. math::
    x = \text{each [B,S] with shape [H]} \\
    \mathbf{x}_q = \text{quantized } \mathbf{x} \\
    D = \text{de-quantization scale} \\
    Q = \text{quantization scale}

The above equation omits clipping/flooring details which are instead included later in this document.

Each token-vector is quantized by multiplying each element in the vector by the quantization scale (Q) of the given token-vector; or said equivalently, dividing by the dequantization scale (D). The dequantization scale is computed by finding the absolute-max value in the vector and dividing by 240 (a typical constant for 8-bit quantization).

Operator Graph
""""""""""""""

In the following operator graph you'll notice that the final output packs the data and scales together into a single tensor, as described in the Motivation section.

.. figure:: images/quant.drawio.svg
   :align: center

In summary, we've seen how the RMSNorm and Quantization math operations are implemented using NKI ISA instructions and examined the intermediate shapes and tiling decisions along the way.

Next we'll look at the kernel's API.

High-Level Design Considerations & Optimization Strategies
----------------------------------------------------------

Input Tensor Outer Dimension Collapse
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The code provides a good description of this but it's briefly summarized here so the idea can be referenced below. The RMSNorm-Quantization computations happen strictly on the minor dimension of the input tensor (called the ‘processing dimension' in the code therefore all major dimensions are collapsed into one for simplification (called the ‘outer dimension' in the code). In other words, the input is collapsed into a 2D tensor.

Example:
    [B,S,H] is collapsed into [BxS, H] = [outer_dimension, processing_dimension]

Tiling
^^^^^^

The overall kernel (both RMSNorm and Quantization steps) is tiled on the major dimension of the 2D input tensor by a size equal to the hardware's maximum partition dimension of a tile. This ensures full utilization of the various hardware engines' input width.

Within the RMSNorm operation, the RMS-scale and gamma steps are further tiled on the minor dimension by a size equal to the hardware's maximum free dimension of the stationary operand of General Matrix Multiplication on TensorEngine. This is because the gamma-broadcast operation is ultimately performed via TensorEngine matrix multiplication so we maximize our use of the engine with maximally sized tiles. See :doc:`What is Tiling? </nki/get-started/about/tiling-overview>` for more details on tile size constraints.

Example:

    Consider a typical LLM input tensor of the shape [Batch, Sequence, Hidden] with [B=1, S=1024, H=2048]. We'll set B=1 for simplicity so that we can ignore it entirely. The tensor is first tiled on the S dimension in a size of 128 (which is the maximum partition dimension of Trainium2), resulting in 1024 / 128 = 8 outer dimension tiles of shape [S=128, H=2048]. The inverse-RMS calculation is performed across the H dimension, meaning it is performed independently on every row of the tile.

    We subsequently tile on the H dimension in a size of 512 (the maximum matrix-multiply free-dimension on Trainium2), resulting in 2048 / 512 = 4 processing dimension tiles of shape [S=128, H=512]. The RMS scale (ScalarE) is applied, gamma is broadcast (TensorE), and gamma is applied (VectorE). You'll notice that pipeline parallelism is implemented by splitting the computation across 3 engines.

SBUF/PSUM Allocation
^^^^^^^^^^^^^^^^^^^^

The Stack Allocator is generally recommended for all kernels since it enables consistent and deterministic SBUF/PSUM memory allocations within the scope of the kernel. This is contrast to the default allocator which considers a larger scope outside the kernel, potentially resulting in varying allocations and consequent kernel performance variations.

SPMD Sharding
^^^^^^^^^^^^^

This kernel supports SPMD sharding as a way to split the computation across the constituent cores of a :doc:`Logical Neuron Core </about-neuron/arch/neuron-features/logical-neuroncore-config>`. It shards on the outer-most dimension.


Gamma Broadcast
^^^^^^^^^^^^^^^

The bulk of the RMSNorm-Quantization operations rely on the Vector and Scalar engines as the core math does not involve matrix-multiplication at all, hence the TensorEngine would otherwise be idle. To improve pipeline parallelism we use a technique to broadcast the gamma vector across rows of a 2D matrix by performing matrix multiplication against a vector of ones, thereby distributing some of the work to the TensorEngine.

activation_reduce
^^^^^^^^^^^^^^^^^

This :doc:`instruction </nki/api/generated/nki.isa.activation_reduce>` is notable because it allows us to perform the reduce-add for free along with the square operation.


Design Implementation
---------------------

The commented code and the above sections should together deliver a good understanding of this kernel. However this section explains a few additional points to help understand the code.

CPU Golden
^^^^^^^^^^

The following is a simple Python equivalent to the kernel which can be another useful way of understanding the kernel's behaviour.

.. code-block:: python

   def rmsnorm_quant_ref(inp: np.ndarray, gamma: np.ndarray, eps: float = 1e-6) -> Tuple[np.ndarray, np.ndarray]:
       """RMSNorm + Quantization reference impl.

       - inp: shape [B, S, H]
       - output[0]: shape [B, S, H] in fp8e4, representing the quantized RMSNorm output of input
       - output[1]: shape [B, S, 4] in fp32 representing the per-row dequantization scale
       """
       assert(len(inp.shape) == 3)
       inp = inp.astype(np.float32)
       gamma = gamma.astype(np.float32)

       # Perform RMSNorm
       rms = np.sqrt(np.mean(np.square(inp), axis=-1, keepdims=True))
       norm = inp * np.reciprocal(rms + eps)
       norm *= gamma

       # Perform quantization
       norm_abs_max = np.abs(norm).max(axis=-1, keepdims=True)
       quant_scale = 240.0 / norm_abs_max
       norm_quant = norm * quant_scale
       assert(np.allclose(norm, norm_quant * np.reciprocal(quant_scale)))  # dequantization should yield same norm

       # Cast and return
       norm_quant = dt.static_cast(norm_quant, dt.float8_e4m3)
       dequant_scale = dt.static_cast(np.reciprocal(quant_scale), np.float32)

       return norm_quant, dequant_scale


Kernel Code Details
^^^^^^^^^^^^^^^^^^^

`rms_normalize_tile()` contains a loop to tile across the processing dimension. This loop contains the following directive:

.. code-block:: python

    directives=ncc.multi_buffer(constants.num_hw_psum_banks)

This enables the compiler to replicate the gamma PSUM allocation (into which the gamma-broadcast matmul result is stored), improving pipeline parallelism by enabling each loop iteration to write into a separate PSUM bank.

.. code-block:: python

    skip_middle_end_transformations

The compiler middle-end-transformation passes contain heuristic-driven optimizations, including loop-reordering and loop-fusion. While these passes could help improve performance, in some cases, they are not predictable. Kernels are generally hand-tuned to achieve optimal performance, so we turn them off.

Kernel API
----------

.. autodata:: rmsnorm_quant_kernel
   :noindex:

Evaluation
----------

Performance Targets
^^^^^^^^^^^^^^^^^^^

The section includes some example performance targets for real world model configurations on a Trainium 2 with LNC=2 configuration.

**Llama3.3 70B**

+--------------------+-------------+-----------------+--------+
| Target Latency (us)| Batch Count | Sequence Length | Hidden |
+====================+=============+=================+========+
| 458.2              | 1           | 2K              | 8192   |
+--------------------+-------------+-----------------+--------+
| 6,287.0            | 1           | 32K             | 8192   |
+--------------------+-------------+-----------------+--------+

**Llama3.1 405B**

+--------------------+-------------+-----------------+--------+
| Target Latency (us)| Batch Count | Sequence Length | Hidden |
+====================+=============+=================+========+
| 866.81             | 1           | 2K              | 16384  |
+--------------------+-------------+-----------------+--------+
| 13,214.40          | 1           | 32K             | 16384  |
+--------------------+-------------+-----------------+--------+


Performance Analysis
--------------------

Here we demonstrate a sample execution of this kernel and break it down in the Profiler.

**Test Parameters:**

  * LNC: 2 ( Note, two pairs of instructions in `nc0`, and `nc1` in captured figures )
  * Batch Size: 1
  * Sequence Length: 160
  * Hidden Size: 16,384
  * Data Type: `dt.bfloat16`
  * Quantization Data Type: `dt.float8_e4m3`
  * Quantization Only: `False`

The following picture shows the overall execution.

.. image:: images/profile_overall.png

Phase 1: Load Inputs
^^^^^^^^^^^^^^^^^^^^

This phase involves two DMA load operations: one for the hidden tensor and one for the gamma tensor.

* **Hidden Tensor**: The DMA buffer size is calculated as `hidden_size * sizeof(dtype)`.

* **Gamma Tensor**: The code intends to load the entire `[1, H]` tensor in a single operation. However, it should be noted that the compiler performs optimizations for trivial dimensions, which can result in several small (e.g., 4-byte) DMA buffer loads.

Phase 2: RMSNorm
^^^^^^^^^^^^^^^^

.. figure:: images/profile_phase_2.png
   :align: center

* Compute Inverse RMS scale

    * This step involves two ACT (activation) instructions:

        * `activation_reduce`: Squares each element of the hidden tensor and performs a reduction (sum) across the hidden dimension.
        * `activation`: Adds a small constant `eps` for numerical stability, applies a scaling factor `(1 / H)`, and then computes the reciprocal square root of the result.

* Broadcast Gamma – Part 1 / Part 2

    * As previously mentioned, a multi-buffer strategy is used for PSUM. Assuming there are N PSUM banks, Part 1 of the broadcast operation replicates the gamma values of shape [1, `512`] to [128, 512] tiles, repeating this process N times.
    * The size `512` corresponds to the **free dimension limit** of the TensorEngine, meaning we must slice the H dimension (processing dimension) into chunks of 512.
    * The broadcast is divided into Part 1 and Part 2 because the inverse RMS scale value is needed before evicting data from the PSUM buffers after Part 1. The PSUM data is not evicted to the SBUF immediately; instead, it remains in place to be consumed by the `scalar_tensor_tensor` operation once `inverse_rms_scale` is ready. This behavior is intentional, as there is limited performance benefit in evicting PSUMs early. Part 2 of the gamma broadcast is fully pipelined with the subsequent `scalar_tensor_tensor` instruction, making early eviction unnecessary.

* Apply gamma and inverse RMS scale

    * This step is performed using the `scalar_tensor_tensor` instruction, with a free dimension size of 512, matching the limit of the TensorEngine. This allows the operation to be *efficiently pipelined* with the TensorEngine activity.

Phase 3: Quantization
^^^^^^^^^^^^^^^^^^^^^


.. figure:: images/profile_phase_3.png
   :align: center

The overall quantization process involves heavy use of the VectorEngine, primarily due to the `max` function. These instructions are executed **sequentially with no parallelism**, as each step depends on the result of the previous one.

* Compute absolute maximum

* Compute dequantization scale

    * `activation`: The dequantization scale is derived by dividing the absolute max by `_FP8_RANGE`

* Compute quantized output

    * `tensor_scalar`: clamp to `_MIN_DEQUANT_SCALE_VAL` for numerical stability
    * `reciprocal`:  compute the reciprocal to get the quantization scale
    * `tensor_scalar`: Apply quantization scale to produce the quantized result

Phase 4: Store output
^^^^^^^^^^^^^^^^^^^^^

Store quantized value with dequantizing scale

  * **Hidden Tensor**:
    The DMA buffer size is calculated as `hidden_size * sizeof(quant_dtype)`.
  * **Dequantization Scale:**
    The DMA buffer size is calculated as `4* sizeof(quant_dtype)`.

================================================
FILE: nki/library/specs/index.rst
================================================
.. meta::
    :description: NKI Library specifications for the pre-built kernels included with the AWS Neuron SDK.
    :date-modified: 12/02/2025

.. _nkl_design_spec_home:

NKI Library Design Specifications
==================================

The NKI Library provides pre-built kernels you can review and modify in your own kernel development with the AWS Neuron SDK and NKI. In this section, learn how the NKI Library kernels are designed and optimized so you can apply the same techniques to your own custom NKI kernels.

.. list-table::
   :header-rows: 1
   :widths: 30 40 30

   * - Kernel
     - Description
     - Source Code
   * - :doc:`RMSNorm-Quant kernel specification </nki/library/specs/design-rmsnorm-quant>`
     - Performs optional RMS normalization followed by quantization to ``fp8``.
     - `Source code <https://github.com/aws-neuron/nki-samples/tree/main/src/nki_samples/reference/rmsnorm_quant>`_

.. toctree::
    :maxdepth: 1
    :hidden:

    RMSNorm-Quant <design-rmsnorm-quant>

================================================
FILE: nki/migration/index.rst
================================================
.. _nki_migration_home:

.. meta::
    :description: NKI Migration Guides for upgrading between NKI versions.
    :keywords: NKI, AWS Neuron, Migration, Upgrade, Update Guide

NKI Migration Guides
====================

These guides help you migrate your NKI kernels between versions.

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: NKI 0.3.0 Update Guide
      :link: nki-0-3-0-update-guide
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Migrate your NKI kernels from 0.2.0 to 0.3.0, including API changes, deprecations, and new features.

   .. grid-item-card:: NKI Block Dimension Migration Guide
      :link: nki_block_dimension_migration_guide
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Migrate NKI kernels to use block dimensions for improved performance and resource utilization on Trainium devices.

   .. grid-item-card:: NKI Beta 2 Migration Guide
      :link: nki-beta2-migration-guide
      :link-type: doc
      :class-body: sphinx-design-class-title-small

      Migrate NKI kernels from Beta 1 to Beta 2.

.. toctree::
    :maxdepth: 1
    :hidden:

    NKI 0.3.0 Update Guide <nki-0-3-0-update-guide>
    Block Dimension Migration Guide <nki_block_dimension_migration_guide>
    Beta 2 Migration Guide <nki-beta2-migration-guide>


================================================
FILE: nki/migration/nki-0-3-0-update-guide.rst
================================================
.. meta::
   :description: NKI 0.3.0 Update Guide — update NKI kernels from Beta 2 to NKI 0.3.0
   :keywords: NKI, Neuron Kernel Interface, update guide, 0.3.0, Trainium, Inferentia

.. _nki-0-3-0-update-guide:

NKI 0.3.0 Update Guide
=======================

For developers with existing NKI Beta 2 kernels, this document provides guidance on updating to NKI 0.3.0.

NKI 0.3.0 is a significant update to the Neuron Kernel Interface, available in AWS Neuron SDK 2.29.0.
This release moves NKI to General Availability with a new open-source NKI Standard Library (nki-stdlib),
a built-in CPU Simulator, ``nki.language`` APIs, and several API improvements for correctness
and consistency.

This guide is intended for NKI developers updating existing kernels from Beta 2 to NKI 0.3.0. It covers
new features, deprecated and removed APIs, and breaking changes with before-and-after code examples.

.. note::

   If you are migrating from NKI Beta 1 (``neuronxcc.nki.*``), first complete the
   :doc:`NKI Beta 2 Migration Guide <nki-beta2-migration-guide>` before following this guide.

.. contents:: Table of contents
   :local:
   :depth: 2


What's New in NKI 0.3.0
------------------------


NKI Standard Library (nki-stdlib)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NKI 0.3.0 ships with the NKI Standard Library (nki-stdlib), which provides developer-visible code for all
NKI APIs and native language objects (e.g., ``NkiTensor``).


NKI CPU Simulator
~~~~~~~~~~~~~~~~~

NKI 0.3.0 introduces ``nki.simulate(kernel)``, which executes NKI kernels entirely on CPU without requiring
NeuronDevice hardware. The simulator interprets NKI operations using NumPy, producing numerically equivalent
results to on-device execution (with minor floating-point differences due to CPU vs NeuronCore arithmetic).
This enables local development, debugging, and functional correctness testing on any machine — including
laptops and CI environments.

.. note::

   The NKI CPU Simulator is experimental in NKI 0.3.0.

The simulator can be invoked in two ways:

1. **Set the environment variable** ``NKI_SIMULATOR=1`` to run existing kernels without code changes:

.. code-block:: bash

   NKI_SIMULATOR=1 python my_script.py

2. **Wrap the kernel call** with ``nki.simulate``:

.. code-block:: python

   import nki
   import numpy as np

   @nki.jit
   def my_kernel(X, Y):
       ...

   # Run on CPU — no Neuron device needed
   X = np.random.randn(128, 512).astype(np.float16)
   Y = np.zeros((128, 512), dtype=np.float16)
   nki.simulate(my_kernel)(X, Y)


``nki.typing`` Module
~~~~~~~~~~~~~~~~~~~~~

A new module for type-annotating kernel tensor parameters. Use ``nt.tensor[shape]`` to declare expected
tensor shapes:

.. code-block:: python

   import nki.typing as nt

   @nki.jit
   def my_kernel(
       X: nt.tensor[128, 512],
       Y: nt.tensor[128, 512]
   ):
       ...


New ``nki.isa`` APIs
~~~~~~~~~~~~~~~~~~~~

* ``nki.isa.exponential`` — Dedicated exponential instruction with max subtraction, faster than ``nisa.activation(op=nl.exp)`` and useful for Softmax calculation. Trn3 (NeuronCore-v4) only.


New ``nki.collectives`` APIs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* ``nki.collectives.all_to_all_v`` — Variable-length all-to-all collective. Unlike ``all_to_all``, uses a metadata tensor to specify per-rank send/recv counts.


Matmul Accumulation
~~~~~~~~~~~~~~~~~~~

``nc_matmul`` and ``nc_matmul_mx`` now have an ``accumulate`` parameter that controls whether the operation
overwrites or accumulates on the destination PSUM tile. The default (``accumulate=None``) auto-detects:
the first write to a PSUM location overwrites, and subsequent writes accumulate. This matches Beta 2
behavior.

.. code-block:: python

   nisa.nc_matmul(dst, stationary, moving, accumulate=True)
   nisa.nc_matmul_mx(dst, stationary, moving, stat_scale, mov_scale, accumulate=True)


Address Placement
~~~~~~~~~~~~~~~~~

The ``address`` parameter was added to ``nki.language.ndarray`` as an optional parameter for explicit
memory placement.

.. code-block:: python

   buf = nl.ndarray((128, 512), dtype=nl.float16, address=(p_off, f_off))  # explicit placement


``nki.language`` APIs
~~~~~~~~~~~~~~~~~~~~~

NKI 0.3.0 introduces ``nki.language`` APIs as convenience wrappers around ``nki.isa`` APIs. These
include operations such as ``nl.load``, ``nl.store``, ``nl.copy``, ``nl.matmul``, ``nl.transpose``,
``nl.softmax``, and other high-level operations that map to one or more ``nki.isa`` calls.

.. note::

   The ``nki.language`` convenience APIs are experimental in NKI 0.3.0.


Deprecated and Removed APIs
----------------------------


``nki.isa.tensor_copy_dynamic_src`` / ``nki.isa.tensor_copy_dynamic_dst``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Deprecated and scheduled for removal. Use ``nisa.tensor_copy()`` with ``.ap()`` and ``scalar_offset`` instead.


``nki.jit(platform_target=...)``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``platform_target`` parameter is deprecated. Set the target platform via the
``NEURON_PLATFORM_TARGET_OVERRIDE`` environment variable instead.

.. important::

   This is a breaking change. Passing ``platform_target`` to ``@nki.jit`` raises an error in NKI 0.3.0.


``nki.jit(mode=...)``
~~~~~~~~~~~~~~~~~~~~~

The ``mode`` parameter is deprecated and ignored. The NKI Compiler now inspects the kernel arguments to
detect the appropriate machine learning framework automatically:

1. **Torch tensors**: uses TorchXLA integration.
2. **JAX arrays**: uses JAX integration.
3. **NumPy arrays**: runs the kernel in standalone mode without a machine learning framework.

To run the kernel in the CPU simulator, set the environment variable ``NKI_SIMULATOR=1``, or wrap the
kernel call in ``nki.simulate``.

.. important::

   This is a breaking change. Code that passes ``mode=`` to ``@nki.jit`` should remove the parameter.


API Breaking Changes
--------------------

This section describes each breaking change with before-and-after code examples.


``nisa.dma_copy`` — Reading from PSUM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``nisa.dma_copy`` no longer supports reading directly from PSUM. Copy the PSUM tensor to SBUF first
using ``nisa.tensor_copy``.

.. code-block:: python

   # Beta 2
   nisa.dma_copy(dst=hbm_tensor, src=psum_tensor[0:TILE, 0:N])

   # NKI 0.3.0
   sbuf_temp = nl.ndarray((TILE, PSUM_SIZE), dtype=nl.float32, buffer=nl.sbuf)
   nisa.tensor_copy(dst=sbuf_temp[0:TILE, 0:N], src=psum_tensor[0:TILE, 0:N])
   nisa.dma_copy(dst=hbm_tensor, src=sbuf_temp[0:TILE, 0:N])


``nisa.dma_copy`` — ``dge_mode`` Type Matching
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NKI 0.3.0 enforces that source and destination element types must match when using
``dge_mode=dge_mode.hwdge``. Beta 2 did not validate this, allowing mismatched types to pass silently.

The DMA hardware moves raw bytes — HWDGE generates descriptors without interpreting data content, so no
type casting occurs. To reinterpret data as a different type, use ``.view()`` to match types before the copy.

.. code-block:: python

   # Beta 2 (no validation, undefined behavior)
   nisa.dma_copy(dst=dst_f4, src=src_ui16, dge_mode=nisa.dge_mode.hwdge)

   # NKI 0.3.0 — use .view() to reinterpret
   nisa.dma_copy(dst=dst_f4, src=src_ui16.view(nl.float4_e2m1fn_x4), dge_mode=nisa.dge_mode.hwdge)

Alternatively, use ``dge_mode.swdge`` or ``dge_mode.none`` if type casting is intended.


``nisa.dma_copy`` — ``dst_rmw_op`` and ``unique_indices`` Removed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``nisa.dma_copy`` no longer supports read-modify-write operations. The ``dst_rmw_op`` and ``unique_indices``
parameters have been removed. Use ``nisa.dma_compute`` instead.

.. code-block:: python

   # Beta 2 — simple read-modify-write
   nisa.dma_copy(dst, src, dst_rmw_op=nl.add)

   # NKI 0.3.0 — use dma_compute
   nisa.dma_compute(dst, [src], reduce_op=nl.add)

For accumulation loops with indirect indexing:

.. code-block:: python

   # Beta 2
   for k_idx in range(K):
       dst_rmw_op = None if k_idx == 0 else nl.add
       nisa.dma_copy(
           src=input.ap(...),
           dst=reduced_sb[:, :],
           dst_rmw_op=dst_rmw_op,
           unique_indices=True,
       )

   # NKI 0.3.0 — split into dma_copy + dma_compute
   for k_idx in range(K):
       src_access = input.ap(...)
       if k_idx == 0:
           nisa.dma_copy(dst=reduced_sb[:, :], src=src_access)
       else:
           nisa.dma_compute(
               dst=reduced_sb[:, :],
               srcs=[src_access, reduced_sb[:, :]],
               reduce_op=nl.add,
               unique_indices=True,
           )


``nisa.memset`` — Strict Type Matching
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NKI 0.3.0 enforces that the ``value`` argument must match the destination tensor's dtype. Beta 2 silently
cast float values to the destination type. For integer-typed tensors, pass an integer literal.

.. code-block:: python

   # Beta 2
   buf = nl.ndarray((128, 128), dtype=nl.int32, buffer=nl.sbuf)
   nisa.memset(dst=buf, value=2.0)

   # NKI 0.3.0
   buf = nl.ndarray((128, 128), dtype=nl.int32, buffer=nl.sbuf)
   nisa.memset(dst=buf, value=2)


``nisa.tensor_reduce`` — Axis Handling Fix
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NKI 0.3.0 fixes incorrect axis handling that existed in Beta 2. Beta 2 incorrectly allowed ``axis=1`` to
refer to the last free dimension even for 3D/4D tensors. NKI 0.3.0 corrects this so that axis values
correspond to the actual tensor dimensions.

Kernels that relied on the Beta 2 behavior (e.g., using ``axis=1`` to mean the last dimension of a 3D/4D
tensor) will produce errors in NKI 0.3.0.


``nisa.dma_compute`` — Parameter Reorder
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``scales`` and ``reduce_op`` parameters swapped positions. ``scales`` is now optional, and
``unique_indices`` was added (moved from ``dma_copy``).

.. code-block:: python

   # Beta 2
   nisa.dma_compute(dst, srcs, scales, reduce_op)

   # NKI 0.3.0
   nisa.dma_compute(dst, srcs, reduce_op, scales=None, unique_indices=True)


``nisa.sendrecv`` — ``dma_engine`` Enum
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The boolean ``use_gpsimd_dma`` parameter is replaced by the ``dma_engine`` enum.

.. code-block:: python

   # Beta 2
   nisa.sendrecv(..., use_gpsimd_dma=True)

   # NKI 0.3.0
   from nki.isa import dma_engine
   nisa.sendrecv(..., dma_engine=dma_engine.gpsimd_dma)
   nisa.sendrecv(..., dma_engine=dma_engine.dma)      # was use_gpsimd_dma=False


``nisa.affine_select`` — ``offset`` Parameter Moved
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``offset`` parameter moved from the 3rd positional argument to a keyword argument with default ``0``.
Existing positional call sites will break.

.. code-block:: python

   # Beta 2
   nisa.affine_select(dst, pattern, offset, channel_multiplier, on_true, on_false)

   # NKI 0.3.0
   nisa.affine_select(dst, pattern, channel_multiplier, on_true, on_false, offset=offset)


``nisa.register_move`` — ``imm`` Renamed to ``src``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``imm`` parameter has been renamed to ``src`` and now accepts a ``VirtualRegister`` instead of a
compile-time constant. To move a compile-time constant into a register, first allocate a register with
the constant value.

.. code-block:: python

   # Beta 2
   nisa.register_move(dst, imm=42)

   # NKI 0.3.0
   src = nisa.register_alloc(x=42)
   nisa.register_move(dst, src=src)


Collectives — ``num_channels`` Removed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``num_channels`` removed from ``collective_permute_implicit_current_processing_rank_id``. The high-level
``collective_permute_implicit()`` now accepts a ``channel_ids`` list directly.

.. code-block:: python

   # Beta 2
   rank_id = ncc.collective_permute_implicit_current_processing_rank_id(
       iteration_id=0, channel_id=ch, num_channels=N, replica_group=rg
   )

   # NKI 0.3.0
   rank_id = ncc.collective_permute_implicit_current_processing_rank_id(
       iteration_id=0, channel_id=ch, replica_group=rg
   )

   ncc.collective_permute_implicit(
       srcs_by_channel=[[src0], [src1]],
       dsts_by_channel=[[dst0], [dst1]],
       replica_group=rg,
       channel_ids=[0, 1],  # replaces num_channels=2
   )


Output Tensors Must Use ``nl.shared_hbm``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All kernel output (return) tensors must be allocated with ``buffer=nl.shared_hbm``. Using ``nl.hbm``
for output tensors will cause compilation failures.

.. code-block:: python

   # Beta 2
   output = nl.ndarray((B, C, L), dtype=x.dtype, buffer=nl.hbm)

   # NKI 0.3.0
   output = nl.ndarray((B, C, L), dtype=x.dtype, buffer=nl.shared_hbm)


Integer Enum Constants No Longer Supported
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Raw integer values (e.g., ``dge_mode=2``) are no longer accepted for enum parameters. Use the named enum
members instead: ``nki.isa.engine``, ``nki.isa.dge_mode``, ``nki.isa.oob_mode``, ``nki.isa.reduce_cmd``,
and ``nki.isa.nc_version``.

.. code-block:: python

   # Beta 2
   nisa.dma_copy(src=src_tensor, dst=dst_tensor, dge_mode=2)

   # NKI 0.3.0
   nisa.dma_copy(src=src_tensor, dst=dst_tensor, dge_mode=nisa.dge_mode.hwdge)


String Buffer Names No Longer Supported
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``nl.ndarray``, ``nl.zeros``, and other creation ops no longer accept strings for the ``buffer`` parameter.
Use buffer objects from ``nki.language`` instead.

.. code-block:: python

   # Beta 2
   buf = nl.ndarray((128, 512), dtype=nl.float16, buffer='sbuf')

   # NKI 0.3.0
   buf = nl.ndarray((128, 512), dtype=nl.float16)  # buffer defaults to sbuf
   buf = nl.ndarray((128, 512), dtype=nl.float16, buffer=nl.sbuf)

.. list-table:: Buffer type mapping
   :header-rows: 1
   :widths: 50 50

   * - Beta 2 (string)
     - NKI 0.3.0 (object)
   * - ``"sbuf"``
     - ``nl.sbuf``
   * - ``"psum"``
     - ``nl.psum``
   * - ``"hbm"``
     - ``nl.hbm``
   * - ``"private_hbm"``
     - ``nl.private_hbm``
   * - ``"shared_hbm"``
     - ``nl.shared_hbm``


``nki.isa.dma_engine`` Alias Repurposed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Beta 2 ``nki.isa.dma_engine`` module-level alias was unused and did not map correctly to a valid engine.
In NKI 0.3.0, it has been replaced with the ``nki.isa.dma_engine`` enum, which provides explicit control
over DMA transfer engines (``dma_engine.dma`` for shared DMA, ``dma_engine.gpsimd_dma`` for GPSIMD's
internal DMA engine).


Language Restrictions
---------------------

The NKI 0.3.0 compiler has stricter validation. The following patterns require changes for NKI 0.3.0.


Remove Keyword-Only Argument Separator (``*``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The NKI 0.3.0 compiler does not support the ``*`` separator in kernel function signatures. Move all
parameters with defaults to the end of the signature.

.. code-block:: python

   # Beta 2
   @nki.jit
   def my_kernel(X: nl.ndarray, *, flag: bool = True, scale: float = 1.0):
       ...

   # NKI 0.3.0
   @nki.jit
   def my_kernel(X: nl.ndarray, flag: bool = True, scale: float = 1.0):
       ...


Replace ``is`` / ``is not`` with ``==`` / ``!=``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The NKI 0.3.0 compiler does not support Python's ``is`` / ``is not`` operators. These operators check
object identity, which is not meaningful during NKI compilation tracing. Use ``==`` / ``!=`` instead.

.. code-block:: python

   # Beta 2
   if some_flag is True:
       ...

   # NKI 0.3.0
   if some_flag == True:
       ...


Replace List Kernel Arguments with Tuples
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The NKI 0.3.0 compiler does not support ``list`` as a kernel argument type. Convert list arguments to
tuples at the call site.

Tuples are immutable and hashable, which more accurately reflects the semantics of compiled kernels and enables 
the compiler to cache compilations based on the kernel's arguments.

.. code-block:: python

   # Beta 2
   @nki.jit
   def my_kernel(img, in_perm, stride=[1, 1]):
       ...
   my_kernel(img, in_perm=[0, 3, 1, 2], stride=[1, 1])

   # NKI 0.3.0
   @nki.jit
   def my_kernel(img, in_perm, stride=(1, 1)):
       ...
   my_kernel(img, in_perm=(0, 3, 1, 2), stride=(1, 1))


API Improvements
----------------

These changes improve correctness or usability but are non-breaking for most kernels.


``nisa.memset`` — x4 Packed Type Restriction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

x4 packed types (``float8_e4m3fn_x4``, ``float8_e5m2_x4``, ``float4_e2m1fn_x4``) now enforce ``value=0``.
The ISA memset instruction fills the destination with a single u32 value and has no notion of the
sub-elements packed inside, so only zero is valid. To initialize x4 packed tensors with non-zero values,
use ``nisa.dma_copy`` to load pre-computed x4 data from an HBM kernel argument.

.. code-block:: python

   # Zero-fill works directly
   buf = nl.ndarray((128, 128), dtype=nl.float8_e4m3fn_x4, buffer=nl.sbuf)
   nisa.memset(dst=buf, value=0)

   # Non-zero: pass pre-computed x4 data as a kernel argument from HBM
   # and use nisa.dma_copy to load it into SBUF
   nisa.dma_copy(dst=buf, src=precomputed_x4_hbm_tensor)


``nisa.range_select`` — Parameter Fixes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Beta 2 silently overrode ``on_false_value`` to ``FP32_MIN`` and ``reduce_cmd`` to ``reset_reduce``,
regardless of user input. In NKI 0.3.0:

* ``reduce_cmd`` now works as expected (default ``reset_reduce``)
* ``on_false_value`` must be ``FP32_MIN`` due to hardware constraints, but is now documented as a
  constraint rather than silently ignored


Parameter Default Value Updates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following default values changed in NKI 0.3.0:

* ``nki.isa.iota`` — ``offset`` is now optional with a default of ``0``
* ``nki.isa.core_barrier`` — ``engine`` default changed from ``unknown`` to ``gpsimd`` (no behavioral change)
* ``nki.language.num_programs`` — ``axes`` default changed from ``None`` to ``0``
* ``nki.language.program_id`` — ``axis`` now has a default value of ``0``
* ``nki.language.ndarray`` — ``buffer`` default changed from ``None`` to ``nl.sbuf``
* ``nki.language.zeros`` — ``buffer`` default changed from ``None`` to ``nl.sbuf``
* ``nki.language.sequential_range`` — ``stop`` and ``step`` now have default values (``None`` and ``1``)


================================================
FILE: nki/migration/nki-beta2-migration-guide.rst
================================================
.. meta::
   :description: Best practices for migrating NKI kernels from Beta 1 to the Beta 2 NKI Compiler
   :keywords: NKI kernels, Neuron Kernel Interface, AWS Neuron SDK, kernel compilation, Trainium, Inferentia, machine learning acceleration

.. _nki-migration-guide:

=========================================
NKI Migration Guide from Beta 1 to Beta 2
=========================================

This topic covers best practices for migrating NKI kernels from the legacy 
``neuronxcc.nki.*`` namespace to the new ``nki.*`` namespace which uses the 
new NKI Compiler. See :ref:`nki_compiler_about` 
for more in-depth information.

Background: NKI has a Compiler!
==================================

As of Release 2.27, NKI now has a new standalone compiler. The syntax of NKI 
remains a subset of Python. This means you can largely use Python syntax when 
writing NKI kernels. However, it is important to remember that your NKI 
functions are compiled by the NKI Compiler and not evaluated by the Python 
interpreter. The goal is to offer a better programming experience with more 
precise error messages.

With the NKI Compiler, we have chosen to define the NKI meta-programming language as a subset 
of Python. This means that all NKI programs are valid Python programs, but not 
all Python programs are valid NKI programs. The delineation is the ``nki.jit`` 
decorator. Just as before, you mark your NKI kernels with the ``nki.jit`` 
decorator. However, unlike before, the functions under this decorator will be 
passed to the NKI Compiler and not be evaluated by the Python interpreter.

.. code-block:: python

   def a_function(x,y,z):
     # this is Python code

   @nki.jit
   def kernel(x,y,z):
     # this is NKI code

If you use Python features within a NKI kernel that are not supported, the NKI
Compiler will give an error. The goal is that programming in NKI is intuitive 
and convenient and all of the features you need are available and behave as 
expected. However, if you find some curious errors or confusing behavior, 
reach out to us on the NKI Samples repository on AWS Neuron GitHub.

This document is intended for experienced NKI developers who are looking to 
migrate their existing kernels to the Beta 2 NKI compiler. Most code snippets 
below are assumed to be executed within a valid NKI kernel.

Key Migration Items
===================

These are the key items to migrate existing kernel to the Beta 2 NKI Compiler.

What new features are available in NKI Beta 2?
----------------------------------------------

* A new namespace for NKI Beta 2, ``nki.*``
* ``device_print`` is available to inspect tensor values
* The behavior of loops and branching is consistent with regular Python
* Lists and dictionaries are available and their behavior in loops is consistent with regular Python
* Direct allocation APIs have been reworked

What features in ``neuronxcc.nki.*`` are not available in ``nki.*``?
--------------------------------------------------------------------

* ``arange`` has been removed, use slicing or :ref:`nki-aps`
* The ``mask`` parameter is no longer supported
* Block dimensions of tensors have been removed
* Explicit ``dst`` parameter is now required for ``nki.isa`` instructions and is always the first argument
* ``nl.load`` and ``nl.store`` have been removed, use ``nisa.dma_copy``
* Nested slicing is not available
* Dynamic Access syntax has changed
* Decorators on sub-kernels need to be removed
* Dictionaries support only string keys

New Features in NKI Beta 2
===========================

New namespace, new APIs
-----------------------

NKI Beta 2 introduces a number of changes to the language and to the 
compilation process. While we are deprecating NKI Beta 1, the Beta 2 release 
supports both versions of the language via namespaces. The Beta 1 APIs can 
be used via the ``neuronxcc.nki.*`` namespace, while Beta 2 has moved to the 
``nki.*`` namespace.

.. code-block:: python

   # Legacy Beta 1 APIs
   import neuronxcc.nki as nki
   import neuronxcc.nki.isa as nisa

   # New Beta 2 APIs
   import nki
   import nki.isa as nisa

We have made improvements to the APIs, like consistent naming, order of 
arguments, and matching more closely the hardware ISA so that what developers 
write in NKI and what they see in the profiler are the same. There is one 
change that developers should be aware of: all ISA functions now require a 
destination parameter.

All ISA functions require a destination parameter
-------------------------------------------------

In Beta 2, all of the ISA functions now require a ``dst`` parameter instead 
of returning a result. So, instead of writing:

.. code-block:: python

   result[...] = nisa.reciprocal(src)

Developers must write:

.. code-block:: python

   nisa.reciprocal(dst=result[...], src)

This change makes the behavior of the APIs more consistent and matches cases
where APIs may perform accumulation or return multiple results. It also helps 
avoid scenarios where developers might inadvertently write to the wrong buffer 
or inadvertently introduce additional copy operations.

Dynamic control flow
--------------------

NKI Beta 2 includes support for dynamic (on-chip) control flow. All of the 
dynamic control flow uses on-chip registers to hold the conditional values. 
See :ref:`trainium_inferentia2_arch` for more information. If a control flow 
construct uses a register as a conditional, then the loop will be an on-chip, 
dynamic (or runtime) loop. This is very common in scenarios like Mixture of 
Experts (MoE), where the index space for the expert is known at runtime, but 
not at compile time. Dynamic control flow with the new NKI APIs unlock this 
use case.

To support dynamic control flow, NKI has a new set of ``nki.isa`` APIs for 
reading and writing to hardware registers. See :doc:`/nki/api/index` for 
more information.

.. code-block:: python

   # Define a register
   def register_alloc(x: Optional[int]) -> register: ...

   # Fill the register with an immediate value
   def register_move(dst: imm: int): ...

   # Load SRAM tensor element into the dst register
   def register_load(dst: register, src: tensor): ...

   # Store the value of the register into SRAM
   def register_store(dst: tensor, src: register): ...

The most basic dynamic loop is a ``for`` loop that uses a register value for 
the iteration value and another register for the upper bound. Developers can 
write this kind of loop using ``dynamic_range``:

.. code-block:: python

   # dynamic loop with dynamically computed upper bounds
   # upper_bound is a hardware register
   # the loop index, i, is also a hardware register
   upper_bound = register_alloc()
   register_load(upper_bound, tensor)
   for i in dynamic_range(5, upper_bound, 2):
     ...

Developers can also write dynamic while loops. When using a dynamic while loop, 
the developer should update the register within the body of the loop.

.. code-block:: python

   # initialize a conditional tensor which will be updated in the loop
   cond = nl.ndarray((1, 1), buffer=nl.sbuf, dtype=np.int32)

   # create register with initial value
   reg = register_alloc(5)

   while reg: # loop will terminate when the value reaches 0
     ...
     # store the register value into SBUF for computation
     nisa.register_store(cond, reg)
     # Decrement the condition variable by 1
     nisa.tensor_scalar(cond, cond, nl.add, -1)
     # load (updated) value from cond tensor into register
     nisa.register_load(reg, cond)

Update indexing syntax for ``mgrid`` and ``arange``
---------------------------------------------------

If using ``nl.mgrid/arange`` to access continuous elements in an existing NKI 
kernel, this should be replaced with integer slicing. Take a look at the 
following example.

.. code-block:: python

   # Example 1
   t = nl.ndarray(shape=(128, 16, 64), ...)
   # Old Approach: use mgrid to access continuous elements
   i_p, if0, if1 = nl.mgrid[0:128, 0:8, 0:64]
   t[i_p, if0, if1] 
   # Updated: should just use integers to create the slice
   t[0:128, 0:8, 0:64]

   # Example 2
   t = nl.ndarray(shape=(128, 16*64))
   # Old Approach: using mgrid
   i_p, if0, if1 = nl.mgrid[0:128, 0:8, 0:64]
   t[i_p, if0*64+i_f1]
   # should just use integer slicing
   t[0:128, 0:8*64]

If your use case cannot be represented with the slicing syntax above, see
:ref:`nki-aps`.

Changes in NKI Beta 2 from Beta 1
==================================

Consistent control flow behavior
--------------------------------

In NKI Beta 1, range iterators were converted into special objects that allowed 
the eDSL to capture the loop body. Because of this, loops were only executed once 
by the Python evaluator, which could lead to some surprising results. For example, 
in the code below, the normal Python variable ``var`` ends up with a value of 1 
rather than the expected value of 8. This has been solved in the new NKI Compiler.

.. code-block:: python

   val = 0
   for i in range(8):
     val += 1
   print(val) # will print 1 in Beta 1, prints 8 in Beta 2

For similar reasons, sometimes Python control flow constructs, such as ``if`` 
statements, could not be handled properly when nested within a ``for`` loop. 
For example, in Beta 1 the code below produces an undefined result. In Beta 2, 
this code produces the expected result.

.. code-block:: python

   val = 0
   for i in range(8):
     if i == 0:
       val = 1
     else:
       val = 2
   print(val) # undefined behaviour in Beta 1, prints 2 in Beta 2

Many other examples of troublesome control flow have been fixed, which should 
make using NKI easier and more intuitive.

.. _nki-mask:

Deprecation of masking
----------------------

Follow this section if you are using the ``mask`` parameter in your kernel.

In NKI Beta 1, the concept of masking was introduced to order modify the 
behavior of tensor indexing expressions. The use of masking was almost always 
used to avoid out-of-bounds access. For example, suppose a developer is tiling 
a tensor of size 129 x 513, and you want to use tiles of size 128 x 512. A 
typical way to write a tiling loop in Beta 1 is shown below.

.. code-block:: python

   t = nl.ndarray(shape=(129, 513), ...)
   result = nl.ndarray(shape=(129, 513), ...)
   for i in range(2):
     for j in range(2):
       i_p, i_f = nl.mgrid[0:128, 0:512]
       result[i_p+128*i, i_f+512*i] = nisa.tensor_copy(t[i_p+128*i, i_f+512*i],
        mask=(i_p+128*i<129) & (i_f+512*i<513))

Note, when ``i`` (or ``j``) is equal to 1, then the index expression 
``result[i_p+128*i, i_f+512*i]`` would overflow the tensor dimension. The mask 
expression ``mask=(i_p+128*i<129) & (i_f+512*i<513)`` modifies the indexing so 
that the equations are true, and thus inbounds of the tensor. This mechanism 
has many drawbacks, including being error-prone and non-intuitive for Python 
developers. Therefore, this mechanism has been deprecated in Beta 2.

In NKI Beta 2, developers can use standard constructs from Python such as 
``min`` and ``slice`` to build indexing expressions that are in bounds for 
the tensor. For example, the above code can now be written as:

.. code-block:: python

   for i in range(2):
     p_start = i * 128
     p_end = min(129, pstart + 128)
     p = slice(p_start, p_end)  # a.k.a. (p_start:p_end)
     
     for j in range(2):
       f_start = j * 512
       f_end = min(513, f_start + 512)
       f = slice(f_start, f_end)  # a.k.a. (f_start:f_end)
       
       nisa.tensor_copy(result[p, f], t[p, f])

The developer may also choose to inline the slices, if that is more natural. 
The below syntax is common in NKI Beta 1.

.. code-block:: python

   nisa.tensor_copy(result[p_start:p_end, f_start:f_end],
                         t[p_start:p_end, f_start:f_end])

Improved Allocation API
-----------------------

The manual allocation API has been simplified. In Beta 2 the there is a new 
argument to ``nl.ndarray`` that allows the offset of each tensor to be specified: 
(partition_offset, free_offset). Similar to the Beta 1, while the partition offset 
corresponds to a physical partition lane on the hardware, the free dimension offset 
is the element offset within each partition. The free dimension offset is 
translated into physical SBUF address in the compiler.

.. code-block:: python

   # creates your buffer on parition 0, offset by 128 elements of your data type
   a_result = nl.ndarray(dtype=a.dtype, shape=a.shape, name="result", 
     address=(0, 128), buffer=nl.sbuf)

The address space for PSUM is now also 2D to be consistent with the hardware. 
Recall that PSUM on NeuronCore v2/v3/v4 is organized into 128 partitions, each 
consisting of 16KB of memory. Each partition is further divided into 8 PSUM banks, 
with each bank holding up to 2048 bit worth of values. The allocation for PSUM 
tensors must start at the beginning of each bank - the compiler will throw an 
error otherwise.

For example, the following code will allocate a PSUM tensor on bank 3:

.. code-block:: python

   bank_id = 3
   PSUM_BANK_SIZE = 2048
   psum_t = nl.ndarray(dtype=nl.bfloat16, shape=(128, 1024), 
     address=(0, bank_id*PSUM_BANK_SIZE))

Translate from the Beta 1 Direct Allocation API
-----------------------------------------------

To translate the direct allocated kernel in Beta 1, all data structures must 
not use the block dimension. This means reformatting tensors to place the 
partition-dimension on the left-most position, using either lists or 
multi-dimensional tensors for the rest of your dimensions. See 
:ref:`nki_block_dimension_migration_guide` for more information.

After this, translate the address of each block. For example, given the 
following tensor in the Beta 1 that uses the modular allocation.

.. code-block:: python

   # beta 1 - uses block dimension and mod allocator
   k_loaded = nl.ndarray((num_512_tiles_cur_section, nl.par_dim(p_k), n_k), 
    dtype=nl.bfloat16, 
    buffer=sb_mod(base_addr=sca, num_free_tiles=(num_512_tiles_cur_section, )

Now with Beta 2, developers can translate the block dimension into a list 
and compute the address for each block.

.. code-block:: python

   # beta 2 - use lists of tensors and get lists of virtual byte addresses
   k_loaded_tensors = []
   for i in range(num_512_tiles_cur_section):
     k_loaded_tensors.append(nl.ndarray(shape=(p_k,n_k), dtype=nl.bfloat16, 
     buffer=nl.sbuf, address=(0, sca + (i%num_512_tiles_cur_section)*n_k*2 ) )

Remove nki.jit decorator on sub-kernels
---------------------------------------

For kernels that call other kernels, or call any other functions that are 
decorated with a ``nki.jit`` decorator, the ``nki.jit`` decorated will need to
be removed from sub-kernels.

In NKI Beta 1, all the sub-kernels called from a top-level kernel could be 
decorated with ``nki.jit(mode='trace')`` decorator. This decorator needs to be 
removed for the new NKI Compiler. Otherwise, you will see an error about classes 
needing to inherent from ``nl.NKIObject`` thrown from the callsite of the sub-kernels.

If a kernel is being called by another kernel and it is also called standalone, the 
decorator can be applied on-the-fly at the call site to avoid this problem.

.. code-block:: python

   # Do not apply the decorator on the kernel definition
   def my_kernel(...):
     pass
     
   # When calling the kernel, apply the decorator
   a = torch.tensor(...)
   kernel_decorated = nki.jit(my_kernel)
   result = kernel_decorated(a)

Translation of Block Dimensions
-------------------------------

If the kernel uses block dimension, defined as a tensor with a partition 
dimension set to any position other than the left-most position, this has been 
removed in Beta 2. There are two performance-equivalent ways to translate block 
dimensions. The first is to use a Python-like list and the second is to use a 
differently-shaped tensor.

Use a Python-like list
^^^^^^^^^^^^^^^^^^^^^^

Block dimension of tensors in Beta 1 was syntactic sugar for a list of tensors 
managed by the compiler. In NKI Beta 2, users can directly code this patten using 
standard lists, without extra compiler support.

.. code-block:: python

   # Before migration
   t = nl.ndarray((8, nl.par_dim(128), 256), dtype=nl.float32, buffer=nl.sbuf)
   for i in range(8):
     t[i]

   # After migration
   # Create an explicit list of tensors
   t_lst = []
   for i in range(8):
     t_lst.append(nl.ndarray(128, 256), dtype=nl.float32, buffer=nl.sbuf)
   for i in range(8):
     t_list[i]

With this approach, the programs generated before and after migration are 
identical and should yield the same performance.

Not using Python list
^^^^^^^^^^^^^^^^^^^^^

If blocks need to be alive at the same time, move the block dimension into 
free dimension

.. code-block:: python

   a = nl.ndarray((8, par_dim(128), 512), buffer=nl.sbuf, dtype=bfloat16)

   # ----> Migrate to
   a = nl.ndarray((128, 8, 512), buffer=nl.sbuf, dtype=bfloat16)

As an example, if all 8 blocks of add_buf need to be live at the same time, then 
the block dimension needs to be folded into the free dimension.

.. code-block:: python

   @nki.jit
   def sb_blocks(inp):
       res = nl.ndarray(shape=(8, 128, 512), dtype=inp.dtype, buffer=nl.shared_hbm)
       add_buf = nl.ndarray(shape=(8, nl.par_dim(128), 512), dtype=inp.dtype, buffer=nl.sbuf)
       for i in range(8):
           nisa.dma_copy(add_buf[i], inp[i])
       for i in range(8):
           nisa.dma_copy(res[i], add_buf[i])
       return res

   # should migrate to
   @nki.jit
   def sb_blocks_migrated(inp):
       res = nl.ndarray(shape=(8, 128, 512), dtype=inp.dtype, buffer=nl.shared_hbm)
       add_buf = nl.ndarray(shape=(128, 8, 512), dtype=inp.dtype, buffer=nl.sbuf)
       for i in range(8):
           nisa.dma_copy(add_buf[0:128, i, 0:512], inp[i])
       for i in range(8):
           nisa.dma_copy(res[i], add_buf[0:128, i, 0:512])
       return res

If blocks do not need to be alive at the same time, remove the block 
dimension and relocate tensor declaration.

.. code-block:: python

   a = nl.ndarray((8, par_dim(128), 256))
   for i in nl.affine_range(8):
     <do something with a[i]>

   # should be transformed to ....
   for i in nl.affine_range(8):
     a = nl.ndarray((128, 256))
     <do something with a>

As an example, if all 8 blocks of add_buf do not need to be live at the same 
time, then remove the block dimension and relocate the tensor declaration 
inside the loop.

.. code-block:: python

   @nki.jit
   def sb_blocks(inp):
       res = nl.ndarray(shape=(8, 128, 512), dtype=inp.dtype, buffer=nl.shared_hbm)
       add_buf = nl.ndarray(shape=(8, nl.par_dim(128), 512), dtype=inp.dtype, buffer=nl.sbuf)
       for i in range(8):
           nisa.dma_copy(add_buf[i], inp[i])
           nisa.dma_copy(res[i], add_buf[i])
       return res

   # should migrate to
   @nki.jit
   def sb_blocks_migrated(inp):
       res = nl.ndarray(shape=(8, 128, 512), dtype=inp.dtype, buffer=nl.shared_hbm)
       for i in range(8):
           add_buf = nl.ndarray(shape=(128, 512), dtype=inp.dtype, buffer=nl.sbuf)
           nisa.dma_copy(add_buf[0:128, 0:512], inp[i])
           nisa.dma_copy(res[i], add_buf[0:128, 0:512])
       return res

It is important to note that the dependency relationship between loop iterations 
is different in ``sb_blocks_migrated`` and the following ``sb_blocks_migrated_incorrect`` 
shown below.

.. code-block:: python

   @nki.jit
   def sb_blocks_migrated_incorrect(inp):
       res = nl.ndarray(shape=(8, 128, 512), dtype=inp.dtype, buffer=nl.shared_hbm)
       add_buf = nl.ndarray(shape=(128, 512), dtype=inp.dtype, buffer=nl.sbuf)
       for i in range(8):
           nisa.dma_copy(add_buf[0:128, 0:512], inp[i])
           nisa.dma_copy(res[i], add_buf[0:128, 0:512])
       return res

In ``sb_blocks_migrated``, the compiler could unroll the loop and materialize 
multiple copies of the tensor ``add_buf``. However, in the ``sb_blocks_migrated_incorrect``, 
the execution will be serialized because the loop carries a dependency on ``add_buf``.

Dynamic Access Pattern
----------------------

Follow this section for a kernel that uses dynamic access, i.e. using a runtime value 
to index another tensor.

The syntax for representing dynamic access patterns has changed. In NKI Beta 1, 
an access with a dynamic scalar offset could be represented as shown below where 
``batch_idx`` is a dynamic value in the SBUF:

.. code-block:: python

   batch_idx = nl.multiply(nl.bitwise_and(nl.load(dynamic_idx), y=3), 128)
   result = nl.ndarray((128, 256), A.dtype, buffer=nl.shared_hbm)
   batch_idx[...] = 4 # set a constant, but batch_idx is a runtime SBUF value
   i_p, i_f = nl.mgrid[0:128, 0:256]
   nisa.dma_copy(src=A[batch_idx, i_p, i_f], dst=result[...])

Scalar Dynamic Access
^^^^^^^^^^^^^^^^^^^^^

In Beta 2, we need to use a physical access pattern, specified with the ``.ap`` 
method, to represent this.

.. code-block:: python

   def indirect_scalar_dynamic_dma(A):
     # Assume input A is of shape (4*128, 512). We want to copy from A[3*128:, 0:256]
     # The 3*128 offset comes from a dynamic variable in SBUF
     assert A.shape = [512, 512]
     batch_idx = nl.ndarray((1, 1), nl.int32, buffer=nl.sbuf)
     nisa.memset(batch_idx, value=3*128)

     result = nl.ndarray((128, 256), A.dtype, buffer=nl.shared_hbm)

     nisa.dma_copy(src=A.ap(
       pattern=[[512, 128], [1, 256]], offset=0, 
       scalar_offset=batch_idx, indirect_dim=0
       ),
       dst=result[...])

     return result

The ``scalar_offset`` is an SBUF value that specifies the index on the 
``indirect_dim`` of the tensor. For example, the code block above accesses 
``batch_idx`` on the 0-th dimension of the tensor ``A``. It is important 
to note that the dimension is relative to the **bast tensor**, not relative
to the **pattern** specified. 

This example will access the memory from ``A`` starting at the element offset below.

.. code-block:: python

   # prod(A.shape[indirect_dim+1:]) is the accumulated shape 
   # to the right of indirect_dim
   offset + scalar_offset * prod(A.shape[indirect_dim+1:])

In the example above, the access would start from:

.. code-block:: python

   0 + batch_idx * 512

Again, we should notice that ``512`` is read from the shape of the **base tensor**, not 
from the access pattern. The shape of the access pattern is ``(128, 256)``.

In conventional NumPy syntax, the above means that we will are accessing 
``A[batch_idx:batch_idx+128, 0:256]``. Writing this in the canonical loop form, 
the result of the access is the following:

.. code-block:: python

   result = nl.ndarray(shape=(128, 256), dtype=A.dtype, buffer=nl.sbuf)
   for x in range(128):
     for y in range(256):
       result[x, y] = A.flatten()[0 + batch_idx*512 + x*512 + y*1]

Vector Dynamic Access
^^^^^^^^^^^^^^^^^^^^^

Vector dynamic access is similar to that of scalar, except that we need to specify 
the field ``vector_offset``. **Currently, only ``indirect_dim=0`` is supported**. 
The stride on the leading dimension must be the the total number of 
elements to the right of the leading dimension in the **base tensor**, and the stride
specified in the leading dimension of the pattern in the `.ap()` is currently ignored.
We still recommend setting the stride properly so that code would still work if this
limitation is lifted in the future.

.. code-block:: python

   def indirect_vector_dynamic_dma(A):
     # shape of A is (128, 512)
     dynamic_idx_legal = nl.ndarray((64, 1), nl.int32, nl.sbuf)
     nisa.iota(dynamic_idx_legal, [[1, 1]], 0, 2)
     
     result_sb = nl.ndarray((64, 512), nl.float32, buffer=nl.sbuf)
     result_hbm = nl.ndarray((64, 512), nl.float32, buffer=nl.shared_hbm)

     nisa.dma_copy(src=A.ap(
       [[512, 64], [1, 512]], 0, vector_offset=dynamic_idx_legal, indirect_dim=0
       ), dst=result_sb, name='inst0')
    
     nisa.dma_copy(result_hbm, result_sb, name="copy1")

     return result_hbm

For this particular case, the semantics of the access are the following. Note that,
the stride on the dynamic dimension is directly read from the **base tensor**.

.. code-block:: python

   indirect_dimension = 0

   for w in range(64):
     for z in range(512):
       dynamic_idx = dynamic_idx_legal[w]
           A[
                  // static offsets
                  offset +
                  // AP with the indirect dimension number replaced
                  // Note that the 512 is read from the shape of the **base** tensor.
                  1 * z + 512 * dynamic_idx
                 ]

Further reading
---------------

- :doc:`/nki/deep-dives/nki-compiler`
- :doc:`/nki/api/index`


================================================
FILE: nki/migration/nki_block_dimension_migration_guide.rst
================================================
.. _nki_block_dimension_migration_guide:

NKI Block Dimension Migration Guide
===================================

The SBUF/PSUM tensors in NKI used to allow block dimensions in front of the partition dimension. The block dimension support has been removed due the following reasons.

* Removing block dimensions does not hurt the expressivity of NKI.
* Block dimension is a pure software concept and does not have direct hardware mapping.
* The block dimension is unintuitive and causes confusion.
* Using block dimension has no inherit performance benefit, particularly using block dimension has no relationship with memory throughput whatsoever.
* Multi-buffering is implicit with block dimension. Removing block dimension will make multi-buffering more natural.

This document will first explain the semantics of block dimensions in detail, then it will provide information on how to migrate existing code that uses block dimensions while maintain the functional correctness and performance.

What are block dimensions?
--------------------------

Consider the following NKI tensor.

.. code-block:: python
  :linenos:

  a = nl.ndarray((4, 8, nl.par_dim(128), 2, 512), buffer=nl.sbuf)

  # - (4, 8): (B) block dimensions
  # - 128: (P) partition dimension
  # - (2, 512): (F) free dimension


A NKI tensor has three types of dimensions: `(B, P, F)` . The partition dimension maps to the partition dimension of the physical memory, and the free dimensions describe how data is organized in each SBUF/PSUM partition. The block dimensions described how many physical `(P, F)` tiles the tensor has.

The block dimension of tensors is a **logical** dimension and is a pure software concept. The compiler analyzes the memory dependency and allocates physical address to each tiles. **This means that the physical tiles may not be alive in the memory simultaneously**, and in most of the cases they don not. Consider the following code snippet that access the tensor `a`.

.. code-block:: python
  :linenos:

  @nki.jit
  def exp_func(inp):
    output = nl.ndarray((4, 8, 128, 2, 512), dtype=float32, 
      buffer=nl.shared_hbm)
    a = nl.ndarray((4, 8, nl.par_dim(128), 2, 512), dtype=float32, buffer=nl.sbuf)
    for i in range(4):
      for j in range(8):
        a[i, j] = nl.load(inp[i, j])
        a[i, j] = nl.exp(a[i, j])
        nl.store(output[i, j], value=result)


At the very minimum, only 1 physical tile of `a` needs to be alive. Then the execution is completely serialized. Essentially, all physical tiles would have the exact same memory address.

.. code-block::
  :linenos:

  Physical Address Map

  output[0, 0] --> Partition 0 - 128, Free 0 - 2048B
  output[0, 1] --> Partition 0 - 128, Free 0 - 2048B
  ...


Instead, compiler could choose to allocate 2 physical tiles to `a`, then the dma copy from HBM to SBUF can overlap with the exponential operation. In other word, **the block dimension allows compiler to perform space-time tradeoff at liberty.**

.. code-block::
  :linenos:

  Physical Address Map

  output[0, 0] --> Partition 0 - 128, Free 0    - 2048B
  output[0, 1] --> Partition 0 - 128, Free 2048 - 4096B
  output[0, 2] --> Partition 0 - 128, Free 0    - 2048B
  output[0, 3] --> Partition 0 - 128, Free 2048 - 4096B
  ...


When performing the migration, it is important to understand the dependency relationship between blocks and choose the correct migration method accordingly.

Migration for SBUF tensors
--------------------------

If blocks need to be alive at the same time, move the block dimension into free dimension
**********************************************************************************************

.. code-block:: python
  :linenos:

  a = nl.ndarray((8, par_dim(128), 512), buffer=nl.sbuf, dtype=bfloat16)

  # ----> Migrate to
  a = nl.ndarray((128, 8, 512), buffer=nl.sbuf, dtype=bfloat16)

As an example, all 8 blocks of ``add_buf`` needs to be alive at the same time when the first for loop finishes. Therefore, the block dimension need to be fold into the free dimension.

.. code-block:: python
    :linenos:

    @nki.jit
    def sb_blocks(inp):
        res = nl.ndarray(shape=(8, 128, 512), dtype=inp.dtype, buffer=nl.shared_hbm)
        add_buf = nl.ndarray(shape=(8, nl.par_dim(128), 512), dtype=inp.dtype, buffer=nl.sbuf)
        for i in range(8):
            add_buf[i] = nl.load(inp[i])
        for i in range(8):
            nl.store(res[i], add_buf[i])
        return res

    # should migrate to
    @nki.jit
    def sb_blocks_migrated(inp):
        res = nl.ndarray(shape=(8, 128, 512), dtype=inp.dtype, buffer=nl.shared_hbm)
        add_buf = nl.ndarray(shape=(128, 8, 512), dtype=inp.dtype, buffer=nl.sbuf)
        for i in range(8):
            add_buf[0:128, i, 0:512] = nl.load(inp[i])
        for i in range(8):
            nl.store(res[i], add_buf[0:128, i, 0:512])
        return res

If blocks does not need to be alive at the same time, remove the block dimension and hoist it down 
**************************************************************************************************

.. code-block:: python
  :linenos:

  a = nl.ndarray((8, par_dim(128), 256))
  for i in nl.affine_range(8):
    <do something with a[i]>
    
  # should be transformed to ....
  for i in nl.affine_range(8):
    a = nl.ndarray((128, 256))
    <do something with a>

As an example, all 8 blocks of ``add_buf`` does not need to be alive at the same time. We can remove the block dimension and hoist down the tensor inside the loop.

.. code-block:: python
    :linenos:

    @nki.jit
    def sb_blocks(inp):
        res = nl.ndarray(shape=(8, 128, 512), dtype=inp.dtype, buffer=nl.shared_hbm)
        add_buf = nl.ndarray(shape=(8, nl.par_dim(128), 512), dtype=inp.dtype, buffer=nl.sbuf)
        for i in range(8):
            add_buf[i] = nl.load(inp[i])
            nl.store(res[i], add_buf[i])
        return res

    # should migrate to
    @nki.jit
    def sb_blocks_migrated(inp):
        res = nl.ndarray(shape=(8, 128, 512), dtype=inp.dtype, buffer=nl.shared_hbm)
        for i in range(8):
            add_buf = nl.ndarray(shape=(128, 512), dtype=inp.dtype, buffer=nl.sbuf)
            add_buf[0:128, 0:512] = nl.load(inp[i])
            nl.store(res[i], add_buf[0:128, 0:512])
        return res

.. warning::
    To preserve performance, it is important to hoist down the tensor inside the loop.

It is important to note that the dependency relationship betweens loop iterations is different in ``sb_blocks_migrated`` and the following ``sb_blocks_migrated_incorrect``.

.. code-block:: python
    :linenos:

    @nki.jit
    def sb_blocks_migrated_incorrect(inp):
        res = nl.ndarray(shape=(8, 128, 512), dtype=inp.dtype, buffer=nl.shared_hbm)
        add_buf = nl.ndarray(shape=(128, 512), dtype=inp.dtype, buffer=nl.sbuf)
        for i in range(8):
            add_buf[0:128, 0:512] = nl.load(inp[i])
            nl.store(res[i], add_buf[0:128, 0:512])
        return res

In ``sb_blocks_migrated``, compiler could unroll the loop and materialize multiple copies of the tensor ``add_buf``. However, in the ``sb_blocks_migrated_incorrect``, the execution will be serialized because the loop carries dependency on ``add_buf``.

Migration for PSUM tensors
--------------------------

.. note:: 
    To be filled, the backend support for removing blocks in PSUM tensor is still in progress.


Migration of direct allocation & multi-buffering
------------------------------------------------

When we have block dimensions, we allocate interleaved address for blocks to achieve multi-buffering.

.. code-block:: python
  :linenos:
  
  def interleave_alloc_func(idx, pdim_size, fdim_size):
    """
    This function assumes 1d block dimension, and will allocate unique
    address by modulo of 2.

    For a tensor of 4 blocks, block 0 and 2 will have the same address, while
    block 1 and 3 will have the same address that is different to that of 0 and 2.
    """
    # unpack the tuple
    idx, = idx

    # hard-code to partition 0, since each tile takes up 128 partitions
    start_partition = 0

    return (start_partition, (idx % 2) * fdim_size)
  
  @nki.jit
  def copy_func(inp):
    output = nl.ndarray((4, 128, 512), dtype=float32, buffer=nl.shared_hbm)
    a = nl.ndarray((4, nl.par_dim(128), 512), dtype=float32, buffer=ncc.sbuf.alloc(interleave_alloc_func))
    for i in range(4):
        a[i] = nl.load(inp[i])
        nl.store(output[i], value=a[i])

After removing the block dimension, we could write the following to implement the same multi-buffering, which is actually more natural and closer to that on CPU.

.. code-block:: python
  :linenos:
  
  def interleave_alloc_func(idx, pdim_size, fdim_size):
    """
    This function assumes 1d block dimension, and will allocate unique
    address by modulo of 2.

    For a tensor of 4 blocks, block 0 and 2 will have the same address, while
    block 1 and 3 will have the same address that is different to that of 0 and 2.
    """
    # unpack the tuple
    assert idx == () # We don't have any block dimension

    # hard-code to partition 0, since each tile takes up 128 partitions
    start_partition = 0

    return (start_partition, (idx % 2) * fdim_size)
  
  @nki.compiler.skip_middle_end_transformations
  @nki.jit
  def exp_func(inp):
    output = nl.ndarray((4, 128, 512), dtype=nl.float32, buffer=nl.shared_hbm)
    a = nl.ndarray((128, 2, 512), dtype=nl.float32, buffer=ncc.sbuf.alloc(interleave_alloc_func))
    for i in range(4):
      a[0:128, i % 2, 0:512] = nl.load(inp[i])
      nl.store(output[i], value=a[0:128, i % 2, 0:512])


================================================
FILE: nki/nki_faq.rst
================================================
.. _nki_faq:

NKI FAQ
=========

When should I use NKI?
~~~~~~~~~~~~~~~~~~~~~~

NKI lets you write custom operators that program directly against the Neuron ISA.
There are two common reasons to use NKI:

* **Performance optimization**: When the Neuron Compiler's general-purpose optimizations
  don't fully exploit the hardware for your specific workload, NKI lets you write
  hand-tuned operators that maximize compute and memory throughput. For example,
  the NKI Library provides optimized kernels for attention, MLP, RMSNorm with
  quantization, and collective communication that outperform compiler-generated
  equivalents.

* **Novel operators and architectures**: NKI enables you to implement operators that
  are not yet supported by the Neuron Compiler, letting you self-serve new deep learning
  architectures and custom operations without waiting for compiler support.

Which AWS chips does NKI support?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NKI supports Trainium2 and Trainium3 chips,
available in the following instance types: Trn2 and Trn3.

Which compute engines are supported?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following AWS Trainium and Inferentia compute engines are
supported: Tensor Engine, Vector Engine, Scalar Engine, and GpSimd Engine.
For more details, see the :doc:`NeuronDevice Architecture Guide </nki/guides/architecture/index>`,
and refer to :doc:`nki.isa <api/nki.isa>` APIs to identify which engines are utilized for each instruction.

How do I launch a NKI kernel onto a logical NeuronCore with Trainium2 or Trainium3 from NKI?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A logical NeuronCore (LNC) can consist of multiple physical NeuronCores. In the current Neuron release, an LNC on Trainium2 or Trainium3 can have up to two physical NeuronCores.

For more details on NeuronCore configurations, see
`Logical NeuronCore configurations <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-features/logical-neuroncore-config.html#logical-neuroncore-config>`__.

In NKI, users can launch a NKI kernel onto multiple physical NeuronCores within a logical NeuronCore using ``kernel[2]`` to set LNC=2. Each core receives a different ``nl.program_id(0)`` value (0 or 1).

What ML Frameworks support NKI kernels?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NKI is integrated with :ref:`nki_framework_custom_op_pytorch` and :ref:`nki_framework_custom_op_jax`
frameworks. For more details, see the :ref:`nki_framework_custom_op`.

What Neuron software does not currently support NKI?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NKI does not currently support integration with
Neuron Custom C++ Operators, Transformers NeuronX, and Neuron Collective Communication.

Where can I find NKI sample kernels?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NKI provides two open source repositories with kernel examples:

* `NKI Library <https://github.com/aws-neuron/nki-library>`__ — Production-ready, optimized kernels for common operations (matrix multiplication, attention, normalization, quantization, etc.) that you can use directly in your models. See the :doc:`NKI Library documentation </nki/library/index>` for API reference and design specifications.

* `NKI Samples <https://github.com/aws-neuron/nki-samples>`__ — Reference and tutorial kernels that demonstrate NKI programming patterns and concepts. These are designed for learning and experimentation rather than production use.

For step-by-step guides on writing NKI kernels, see the :doc:`NKI tutorials </nki/guides/tutorials/index>`.

What should I do if I have trouble resolving a kernel compilation error?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Refer to the `NKI sample GitHub issues <https://github.com/aws-neuron/nki-samples/issues>`__ for guidance on
resolving common NKI compilation errors.

If you encounter compilation errors from Neuron Compiler that you cannot understand or
resolve, you may check out NKI sample `GitHub issues <https://github.com/aws-neuron/nki-samples/issues>`__
and open an issue if no similar issues exist.

How can I debug numerical issues in NKI kernels?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We encourage NKI programmers to build kernels incrementally and verify output of small operators one at a time.
NKI also provides a CPU simulation mode that supports printing of kernel intermediate tensor values to the console.
See :doc:`nki.simulate </nki/api/generated/nki.simulate>` for a code example.


How can I optimize my NKI kernel?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To learn how to optimize your NKI kernel, see the :ref:`nki_perf_guide`.

Does NKI support entire Neuron instruction set?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Neuron will iteratively add support for the Neuron
instruction set through adding more :doc:`nki.isa <api/nki.isa>` (Instruction Set
Architecture) APIs in upcoming Neuron releases.


Will NKI APIs guarantee backwards compatibility?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The :doc:`NKI APIs <api/index>` follow the Neuron Software Maintenance policy for Neuron APIs.
For more information, see the
`SDK Maintenance Policy <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/sdk-policy.html>`__.

================================================
FILE: nki/scripts/markdown2rst.py
================================================
import re
import sys
from m2r import convert

args = sys.argv
assert len(args) == 3

# Useful regex engine for testing: https://www.regexpal.com/

# Gets user input
# filename = input("What's the file you want to insert line breaks?")
# new_filename = input("What's the file you want to write results to?")
filename = args[1]
new_filename = args[2]

f = open(filename, "r")
text = f.read()
f.close()

# Step 1: run m2r tool to convert markdown to sphinx
text = convert(text)

# Step 2:
# Replace image with sphinx figure directive
# There can be two formats for images coming out of markdown
pattern1 = r"\[Image: image\.png\]\n(.*)\n"
pattern2 = r"\[Image: image\.png\](.*)\n"

replacement=r'''
.. _<FIXME>:

.. figure:: img/<FIXME>.png
   :align: center
   :width: 60%

   \1
'''

text = re.sub(pattern1, replacement, text)
text = re.sub(pattern2, replacement, text)

# Replace code browser URL
pattern_url = r"https:\/\/prod\.artifactbrowser\.brazil\.aws\.dev(\S*)_build\/html\/"
replacement= ""

text = re.sub(pattern_url, replacement, text)

# Step 3.
# Insert line breaks
text = text.split("\n")
max_char = 120

split_lines = []
for line in text:
    words = line.replace("\n", "").split(" ")
    i_words = 0

    while (i_words < len(words)):
        buffer = ""
        while(len(buffer) < 120):
            buffer += words[i_words] + " "
            i_words += 1

            if i_words >= len(words):
                break

        split_lines.append(buffer)


with open(new_filename, "w+") as f:
    f.write("\n".join(split_lines))


================================================
FILE: nki/scripts/requirements.txt
================================================
m2r

================================================
FILE: nki/test/test_nki_isa_activation.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_BEGIN
import neuronxcc.nki.language as nl
import neuronxcc.nki.isa as nisa
# NKI_EXAMPLE_END
import numpy as np


@nki.jit(mode="simulation")
def nki_activation(a_tensor, b_tensor, c_tensor):
  a_act_tensor = nl.ndarray(a_tensor.shape, dtype=a_tensor.dtype, buffer=nl.shared_hbm)
  b_act_tensor = nl.ndarray(b_tensor.shape, dtype=b_tensor.dtype, buffer=nl.shared_hbm)

  # NKI_EXAMPLE_BEGIN
  ##################################################################
  # Example 1: perform exponential function on matrix a of shape (128, 1024)
  ##################################################################
  a = nl.load(a_tensor)
  activated_a = nisa.activation(op=nl.exp, data=a)
  nl.store(a_act_tensor, activated_a)

  ##################################################################
  # Example 2: perform the following operations to matrix b of shape (128, 512)
  # using a single activation instruction: np.square(b * 2.0) + c
  # 1) compute `np.square(b * 2.0 + c)`
  # 2) cast 1) results into bfloat16
  ##################################################################
  b = nl.load(b_tensor)
  c = nl.load(c_tensor)
  activated_b = nisa.activation(op=np.square, data=b, bias=c, scale=2.0,
                                dtype=nl.bfloat16)
  nl.store(b_act_tensor, activated_b)
  # NKI_EXAMPLE_END

  return a_act_tensor, b_act_tensor

  
class TestNkiIsaExamplesActivation(unittest.TestCase):
  def test_activation(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 1024]).astype(np.float32) * 100
    b = np.random.random_sample([128, 512]).astype(np.float32) * 100
    c = np.random.random_sample([128, 1]).astype(np.float32) * 100

    a_act, b_act = nki_activation(a, b, c)

    a_act_golden = np.exp(a)
    b_act_golden = np.square(b*2+c)

    self.assertTrue(np.allclose(a_act, a_act_golden))
    self.assertTrue(np.allclose(b_act, b_act_golden))


================================================
FILE: nki/test/test_nki_isa_affine_select.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
# NKI_EXAMPLE_END
import numpy as np


@nki.jit(mode="simulation")
def nki_affine_select(a_tensor):
  b_tensor = nl.ndarray(a_tensor.shape, dtype=a_tensor.dtype, buffer=nl.shared_hbm)

  # NKI_EXAMPLE_BEGIN
  ##################################################################
  # Example 1: Take tile a of shape [128, 128] and replace its
  # upper triangle with nl.fp32.min;
  ##################################################################
  ix, iy = nl.mgrid[0:128, 0:128]
  a = nl.load(a_tensor[ix, iy])

  b = nisa.affine_select(pred=(iy <ix), on_true_tile=a[ix, iy], on_false_value=nl.fp32.min)

  nl.store(b_tensor[ix, iy], b)
  # NKI_EXAMPLE_END

  return b_tensor


class TestNkiIsaExamplesAffineSelect(unittest.TestCase):
  def test_affine_select(self):
    a = np.random.random_sample([128, 128]).astype(np.float32) * 100
    b_golden = np.copy(a)

    b = nki_affine_select(a)

    triui = np.triu_indices_from(b_golden) # upper triangle indicies
    b_golden[triui] = nl.fp32.min

    self.assertTrue(np.allclose(b, b_golden))


================================================
FILE: nki/test/test_nki_isa_bn_stats.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np

import neuronxcc.nki as nki
# NKI_EXAMPLE_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
from neuronxcc.nki.typing import tensor
# NKI_EXAMPLE_END


@nki.jit(mode="simulation")
def nki_bn_stats_bn_aggr_1(a_tensor):
  mean_a_tensor = nl.ndarray([a_tensor.shape[0], 1], dtype=a_tensor.dtype, buffer=nl.shared_hbm)
  var_a_tensor = nl.ndarray([a_tensor.shape[0], 1], dtype=a_tensor.dtype, buffer=nl.shared_hbm)

  # NKI_EXAMPLE_BEGIN
  ##################################################################
  # Example 1: Calculate the mean and variance for each partition
  # of tile a with shape (128, 128)
  ##################################################################
  a: tensor[128, 128] = nl.load(a_tensor)
  stats_a: tensor[128, 6] = nisa.bn_stats(a)
  mean_var_a: tensor[128, 2] = nisa.bn_aggr(stats_a)

  # Extract mean and variance
  mean_a = mean_var_a[:, 0]
  var_a = mean_var_a[:, 1]
  nl.store(mean_a_tensor, mean_a)
  nl.store(var_a_tensor, var_a)
  # NKI_EXAMPLE_END

  return mean_a_tensor, var_a_tensor


@nki.jit(mode="simulation")
def nki_bn_stats_bn_aggr_2(b_tensor):
  mean_b_tensor = nl.ndarray([b_tensor.shape[0], 1], dtype=b_tensor.dtype, buffer=nl.shared_hbm)
  var_b_tensor = nl.ndarray([b_tensor.shape[0], 1], dtype=b_tensor.dtype, buffer=nl.shared_hbm)

  # NKI_EXAMPLE_BEGIN
  # ##################################################################
  # # Example 2: Calculate the mean and variance for each partition of
  # # tile b with shape [128, 1024]
  # ##################################################################
  b: tensor[128, 1024] = nl.load(b_tensor)

  # Run bn_stats in two tiles because b has 1024 elements per partition,
  # but bn_stats has a limitation of nl.tile_size.bn_stats_fmax
  # Initialize a bn_stats output tile with shape of [128, 6*2] to
  # hold outputs of two bn_stats instructions
  stats_b = nl.ndarray((128, 6 * 2), dtype=nl.float32)
  bn_tile = nl.tile_size.bn_stats_fmax
  ix, iy = nl.mgrid[0:128, 0:bn_tile]
  iz, iw = nl.mgrid[0:128, 0:6]

  for i in range(1024 // bn_tile):
    stats_b[iz, i * 6 + iw] = nisa.bn_stats(b[ix, i * bn_tile + iy], dtype=nl.float32)

  mean_var_b = nisa.bn_aggr(stats_b)

  # Extract mean and variance
  mean_b = mean_var_b[:, 0]
  var_b = mean_var_b[:, 1]

  nl.store(mean_b_tensor, mean_b)
  nl.store(var_b_tensor, var_b)
  # NKI_EXAMPLE_END

  return mean_b_tensor, var_b_tensor


class TestNkiIsaExamplesBnStatsBnAggr(unittest.TestCase):
  def test_bn_stats_bn_aggr(self):
    a = np.random.random_sample([128, 128]).astype(np.float32) * 100
    b = np.random.random_sample([128, 1024]).astype(np.float32) * 100

    a_mean, a_var = nki_bn_stats_bn_aggr_1(a)
    b_mean, b_var = nki_bn_stats_bn_aggr_2(b)

    a_mean_golden = np.mean(a, axis=1, keepdims=True)
    b_mean_golden = np.mean(b, axis=1, keepdims=True)
    a_var_golden = np.var(a, axis=1, keepdims=True)
    b_var_golden = np.var(b, axis=1, keepdims=True)

    self.assertTrue(np.allclose(a_mean, a_mean_golden))
    self.assertTrue(np.allclose(a_var, a_var_golden))
    self.assertTrue(np.allclose(b_mean, b_mean_golden))
    self.assertTrue(np.allclose(b_var, b_var_golden))


================================================
FILE: nki/test/test_nki_isa_copypredicated.py
================================================
"""
Copyright (C) 2025, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_21_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
from neuronxcc.nki.typing import tensor
# NKI_EXAMPLE_21_END
import numpy as np
...


@nki.jit(mode="simulation")
def nki_copy_predicated(predicate, on_true_tensor, on_false_tensor):
  # NKI_EXAMPLE_21_BEGIN
  ##################################################################
  # Example 1: Conditionally copies elements from the `on_true` tile to 
  # SBUF/PSUM destination tile using Vector Engine, where copying occurs 
  # only at positions where the predicate evaluates to True.
  ##################################################################
  # NKI_EXAMPLE_21_END
  ...
  out_tensor: tensor[128, 512] = nl.ndarray([128, 512], dtype=on_true_tensor.dtype,
                                            buffer=nl.shared_hbm)
  # NKI_EXAMPLE_21_BEGIN
  ...
  pre_tile: tensor[128, 512] = nl.load(predicate)
  src_tile: tensor[128, 512] = nl.load(on_true_tensor)

  ix, iy = nl.mgrid[0:128, 0:512]
  dst_tile: tensor[128, 512] = nl.zeros(shape=src_tile.shape, dtype=src_tile.dtype)
  dst_tile[ix, iy] = nl.load(on_false_tensor)

  nisa.tensor_copy_predicated(src=src_tile, dst=dst_tile, predicate=pre_tile)
  # NKI_EXAMPLE_21_END

  nl.store(out_tensor, dst_tile)
  return out_tensor


class TestNkiIsaExamplescopy_predicated(unittest.TestCase):
  def test_copy_predicated(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 512]).astype(np.float32) * 100
    b = np.random.random_sample([128, 512]).astype(np.float32) * 100

    b = nki_copy_predicated(np.less_equal(a, 0.8), a, b)
    b_golden = np.where(np.less_equal(a, 0.8), a, b)

    self.assertTrue(np.allclose(b, b_golden))


================================================
FILE: nki/test/test_nki_isa_dma_copy.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np
import neuronxcc.nki as nki

# NKI_EXAMPLE_1_BEGIN # NKI_EXAMPLE_2_BEGIN # NKI_EXAMPLE_3_BEGIN # NKI_EXAMPLE_4_BEGIN # NKI_EXAMPLE_5_BEGIN # NKI_EXAMPLE_6_BEGIN # NKI_EXAMPLE_7_END
# NKI_EXAMPLE_0_BEGIN
import neuronxcc.nki.isa as nisa
# NKI_EXAMPLE_0_END
import neuronxcc.nki.language as nl
from neuronxcc.nki.typing import tensor
# NKI_EXAMPLE_1_END # NKI_EXAMPLE_2_END # NKI_EXAMPLE_3_END # NKI_EXAMPLE_4_END # NKI_EXAMPLE_5_END # NKI_EXAMPLE_6_END # NKI_EXAMPLE_7_END

########################################################################
# NOTE: if you modify this file, make sure to update nki.isa .py file with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def nki_dma_copy(a):
  b = nl.ndarray(a.shape, dtype=a.dtype, buffer=nl.shared_hbm)

  # NKI_EXAMPLE_0_BEGIN
  ############################################################################
  # Example 1: Copy over the tensor to another tensor
  ############################################################################
  nisa.dma_copy(dst=b, src=a)

  # NKI_EXAMPLE_0_END

  return b


@nki.jit(mode="simulation")
def nki_indirect_load_oob_err(in_tensor):
  # NKI_EXAMPLE_1_BEGIN
  ############################################################################
  # Example 2: Load elements from HBM with indirect addressing. If addressing 
  # results out-of-bound access, the operation will fail.
  ############################################################################
  # NKI_EXAMPLE_1_END
  ...
  out_tensor: tensor[64, 512] = nl.ndarray([64, 512], dtype=in_tensor.dtype,
                                            buffer=nl.shared_hbm)
  # NKI_EXAMPLE_1_BEGIN
  ...
  n, m = in_tensor.shape
  ix, iy = nl.mgrid[0:n//2, 0:m]

  expr_arange = 2*nl.arange(n//2)[:, None]
  idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)

  out_tile: tensor[64, 512] = nisa.memset(shape=(n//2, m), value=-1, dtype=in_tensor.dtype)
  nisa.dma_copy(src=in_tensor[idx_tile, iy], dst=out_tile[ix, iy], oob_mode=nisa.oob_mode.error)
  # NKI_EXAMPLE_1_END

  nl.store(out_tensor, value=out_tile)
  return out_tensor


@nki.jit(mode="simulation")
def nki_indirect_load_oob_error_negative(in_tensor):
  # NKI_EXAMPLE_2_BEGIN
  ############################################################################
  # Example 3: Load elements from HBM with indirect addressing. If addressing 
  # results in out-of-bounds access, the operation will fail.
  ############################################################################
  # NKI_EXAMPLE_2_END
  ...
  out_tensor: tensor[64, 512] = nl.ndarray([64, 512], dtype=in_tensor.dtype,
                                            buffer=nl.shared_hbm)
  # NKI_EXAMPLE_2_BEGIN
  ...
  n, m = in_tensor.shape
  ix, iy = nl.mgrid[0:n//2, 0:m]

  # indices are out of range on purpose to demonstrate the error
  expr_arange = 3*nl.arange(n//2)[:, None] 
  idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)

  out_tile: tensor[64, 512] = nisa.memset(shape=(n//2, m), value=-1, dtype=in_tensor.dtype)
  nisa.dma_copy(src=in_tensor[idx_tile, iy], dst=out_tile[ix, iy], oob_mode=nisa.oob_mode.error)

  # NKI_EXAMPLE_2_END

  nl.store(out_tensor, value=out_tile)
  return out_tensor

  
@nki.jit(mode="simulation")
def nki_indirect_load_oob_skip(in_tensor):
  # NKI_EXAMPLE_3_BEGIN
  ############################################################################
  # Example 4: Load elements from HBM with indirect addressing. If addressing 
  # results in out-of-bounds access, the operation will skip indices.
  ############################################################################
  # NKI_EXAMPLE_3_END
  ...
  out_tensor: tensor[64, 512] = nl.ndarray([64, 512], dtype=in_tensor.dtype,
                                            buffer=nl.shared_hbm)
  # NKI_EXAMPLE_3_BEGIN
  ...
  n, m = in_tensor.shape
  ix, iy = nl.mgrid[0:n//2, 0:m]

  # indices are out of range on purpose
  expr_arange = 3*nl.arange(n//2)[:, None] 
  idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)

  out_tile: tensor[64, 512] = nisa.memset(shape=(n//2, m), value=-1, dtype=in_tensor.dtype)
  nisa.dma_copy(src=in_tensor[idx_tile, iy], dst=out_tile[ix, iy], oob_mode=nisa.oob_mode.skip)

  # NKI_EXAMPLE_3_END

  nl.store(out_tensor, value=out_tile)
  return out_tensor


@nki.jit(mode="simulation")
def nki_indirect_store_rmw(in_tensor):
  # NKI_EXAMPLE_4_BEGIN
  ############################################################################
  # Example 5: Store elements to HBM with indirect addressing and with 
  # read-modifed-write operation.
  ############################################################################
  # NKI_EXAMPLE_4_END
  ...
  out_tensor: tensor[128, 512] = nl.ndarray([128, 512], dtype=in_tensor.dtype,
                                            buffer=nl.shared_hbm)
  # NKI_EXAMPLE_4_BEGIN
  ...
  n, m = in_tensor.shape
  ix, iy = nl.mgrid[0:n, 0:m]

  expr_arange = 2*nl.arange(n)[:, None]
  inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy])
  idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)

  out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=1, dtype=in_tensor.dtype)
  nl.store(out_tensor, value=out_tile)
  nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, dst_rmw_op=np.add)
  # NKI_EXAMPLE_4_END

  return out_tensor


@nki.jit(mode="simulation")
def nki_indirect_store_oob_err(in_tensor):
  # NKI_EXAMPLE_5_BEGIN
  ############################################################################
  # Example 6: Store elements to HBM with indirect addressing. If indirect 
  # addressing results out-of-bound access, the operation will fail.
  ############################################################################
  # NKI_EXAMPLE_5_END
  ...
  out_tensor: tensor[128, 512] = nl.ndarray([128, 512], dtype=in_tensor.dtype,
                                            buffer=nl.shared_hbm)
  # NKI_EXAMPLE_5_BEGIN
  ...
  n, m = in_tensor.shape
  ix, iy = nl.mgrid[0:n, 0:m]

  expr_arange = 2*nl.arange(n)[:, None]
  inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy])
  idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)

  out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=-1, dtype=in_tensor.dtype)
  nl.store(out_tensor, value=out_tile)
  nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, oob_mode=nisa.oob_mode.error)
  # NKI_EXAMPLE_5_END

  return out_tensor


@nki.jit(mode="simulation")
def nki_indirect_store_oob_err_negative(in_tensor):
  # NKI_EXAMPLE_6_BEGIN
  ############################################################################
  # Example 7: Store elements to HBM with indirect addressing. If indirect 
  # addressing results out-of-bounds access, the operation will skip indices.
  ############################################################################
  # NKI_EXAMPLE_6_END
  ...
  out_tensor: tensor[128, 512] = nl.ndarray([128, 512], dtype=in_tensor.dtype,
                                            buffer=nl.shared_hbm)
  # NKI_EXAMPLE_6_BEGIN
  ...
  n, m = in_tensor.shape
  ix, iy = nl.mgrid[0:n, 0:m]

  # indices are out of range on purpose to demonstrate the error
  expr_arange = 3*nl.arange(n)[:, None] 
  inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy])
  idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)

  out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=-1, dtype=in_tensor.dtype)
  nl.store(out_tensor, value=out_tile)
  nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, oob_mode=nisa.oob_mode.error)

  # NKI_EXAMPLE_6_END

  return out_tensor

  
@nki.jit(mode="simulation")
def nki_indirect_store_oob_skip(in_tensor):
  # NKI_EXAMPLE_7_BEGIN
  ############################################################################
  # Example 8: Store elements to HBM with indirect addressing. If indirect 
  # addressing results out-of-bounds access, the operation will skip indices.
  ############################################################################
  # NKI_EXAMPLE_7_END
  ...
  out_tensor: tensor[128, 512] = nl.ndarray([128, 512], dtype=in_tensor.dtype,
                                            buffer=nl.shared_hbm)
  # NKI_EXAMPLE_7_BEGIN
  ...
  n, m = in_tensor.shape
  ix, iy = nl.mgrid[0:n, 0:m]

  # indices are out of range on purpose
  expr_arange = 3*nl.arange(n)[:, None] 
  inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy])
  idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)

  out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=-1, dtype=in_tensor.dtype)
  nl.store(out_tensor, value=out_tile)
  nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, oob_mode=nisa.oob_mode.skip)

  # NKI_EXAMPLE_7_END

  return out_tensor

@nki.jit(mode='simulation')
def nki_dma_copy_swdge(in_tensor):
  # NKI_EXAMPLE_8_BEGIN
  ############################################################################
  # Example 9: Copy data with SWDGE. Must follow DGE access pattern requirements
  # to use DGE.
  ############################################################################
  # NKI_EXAMPLE_8_END
  ...
  out_tensor: tensor[64, 512] = nl.ndarray([64, 512], dtype=in_tensor.dtype,
                                            buffer=nl.shared_hbm)
  # NKI_EXAMPLE_8_BEGIN
  ...
  nisa.dma_copy(dst=out_tensor, src=in_tensor, dge_mode=nisa.dge_mode.swdge)

  # NKI_EXAMPLE_8_END

  return out_tensor

@nki.jit(mode='simulation', platform_target='trn2')
def nki_dma_copy_hwdge(in_tensor):
  # NKI_EXAMPLE_9_BEGIN
  ############################################################################
  # Example 10: Copy data with HWDGE. Must follow DGE access pattern requirements,
  # and further have (1) accessed partitions=128 (2) spill/reload DMA 
  # (3) target=trn2+ to use HWDGE.
  ############################################################################
  # NKI_EXAMPLE_9_END
  ...
  out_tensor: tensor[128, 512] = nl.ndarray([128, 512], dtype=in_tensor.dtype,
                                            buffer=nl.shared_hbm)
  # NKI_EXAMPLE_9_BEGIN
  ...
  inp_tile: tensor[128, 512] = nl.load(in_tensor)
  out_tile: tensor[128, 512] = nl.zeros_like(inp_tile, buffer=nl.sbuf)
  nisa.dma_copy(dst=out_tile, src=inp_tile, dge_mode=nisa.dge_mode.hwdge)
  nl.store(out_tensor, value=out_tile)

  # NKI_EXAMPLE_9_END

  return out_tensor
      
class TestNkiIsaExamplesTensorCopy(unittest.TestCase):
  def test_tensor_copy(self):
    np.random.seed(0)
    src = np.random.random_sample([256, 1]).astype(np.float32) * 100
    dst_golden = np.copy(src)

    dst = nki_dma_copy(src)
    self.assertTrue(np.allclose(dst, dst_golden))


  def test_indirect_load_oob_err(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 512]).astype(np.float32)

    b = nki_indirect_load_oob_err(a)
    
    b_golden = a[2 * np.arange(64, dtype=np.int32)]

    self.assertTrue(np.allclose(b, b_golden))


  def test_indirect_load_oob_err_negative(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 512]).astype(np.float32)

    with self.assertRaises(IndexError) as cm:
      b = nki_indirect_load_oob_error_negative(a)
    exc = cm.exception
    self.assertEqual(type(exc), IndexError)
    self.assertIn(str(exc), 'index 66048 is out of bounds for axis 0 with size 65536')


  def test_indirect_load_oob_skip(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 512]).astype(np.float32)

    b = nki_indirect_load_oob_skip(a)

    n, m = a.shape
    b_golden = np.full((n//2, m), -1, dtype=a.dtype)
    indices = 3 * np.arange((n//3) + 1)
    b_golden[0:len(indices)] = a[indices, :]

    self.assertTrue(np.allclose(b, b_golden))

    
  def test_indirect_store_rmw(self):
    np.random.seed(0)
    a = np.random.random_sample([64, 512]).astype(np.float32)

    b = nki_indirect_store_rmw(a)

    n, m = a.shape
    b_golden = np.full(shape=(2*n, m), fill_value=1, dtype=a.dtype)
    b_golden[2 * np.arange(n, dtype=np.int32)] += a

    self.assertTrue(np.allclose(b, b_golden))


  def test_indirect_store_oob_err(self):
    np.random.seed(0)
    a = np.random.random_sample([64, 512]).astype(np.float32)

    b = nki_indirect_store_oob_err(a)

    n, m = a.shape
    b_golden = np.full(shape=(2*n, m), fill_value=-1, dtype=a.dtype)
    b_golden[2 * np.arange(n, dtype=np.int32)] = a

    self.assertTrue(np.allclose(b, b_golden))


  def test_indirect_store_oob_err_negative(self):
    np.random.seed(0)
    a = np.random.random_sample([64, 512]).astype(np.float32)

    with self.assertRaises(IndexError) as cm:
      b = nki_indirect_store_oob_err_negative(a)
    exc = cm.exception
    self.assertEqual(type(exc), IndexError)
    self.assertIn(str(exc), 'index 66048 is out of bounds for axis 0 with size 65536')


  def test_indirect_store_oob_skip(self):
    np.random.seed(0)
    a = np.random.random_sample([64, 512]).astype(np.float32)

    b = nki_indirect_store_oob_skip(a)

    n, m = a.shape
    b_golden = np.full(shape=(2*n, m), fill_value=-1, dtype=a.dtype)
    indices = 3*np.arange(((2*n)//3) + 1)
    b_golden[indices, :] = a[0:len(indices), :]

    self.assertTrue(np.allclose(b, b_golden))

  def test_dma_copy_swdge(self):
    np.random.seed(0)
    a = np.random.random_sample([64, 512]).astype(np.float32)
    b = nki_dma_copy_swdge(a)
    self.assertTrue(np.allclose(b, a))

  def test_dma_copy_hwdge(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 512]).astype(np.float32)
    b = nki_dma_copy_hwdge(a)
    self.assertTrue(np.allclose(b, a))


================================================
FILE: nki/test/test_nki_isa_dma_transpose.py
================================================
"""
Copyright (C) 2025, Amazon.com. All Rights Reserved

"""
import unittest
import pytest

import numpy as np
import neuronxcc.nki as nki

# NKI_EXAMPLE_0_BEGIN NKI_EXAMPLE_1_BEGIN NKI_EXAMPLE_2_BEGIN NKI_EXAMPLE_3_BEGIN NKI_EXAMPLE_4_BEGIN NKI_EXAMPLE_5_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
# NKI_EXAMPLE_0_END NKI_EXAMPLE_1_END NKI_EXAMPLE_4_END NKI_EXAMPLE_5_END
from neuronxcc.nki.isa.constants import dge_mode
# NKI_EXAMPLE_2_END NKI_EXAMPLE_3_END

#############################################################################
# NOTE: if you modify this file, make sure to update neuron_isa.py file with
# NOTE: the correct line numbers under .. nki_example:: directive
#############################################################################

@nki.jit(mode="simulation")
# NKI_EXAMPLE_0_BEGIN
############################################################################
# Example 1: Simple 2D transpose (HBM->SB)
############################################################################
def nki_dma_transpose_2d_hbm2sb(a):
  b_sb = nisa.dma_transpose(a[:, :])
  b = nl.ndarray(shape=b_sb.shape, dtype=b_sb.dtype, buffer=nl.hbm)
  nl.store(dst=b, value=b_sb)
  return b
# NKI_EXAMPLE_0_END

@nki.jit(mode="simulation")
# NKI_EXAMPLE_1_BEGIN
############################################################################
# Example 2: Simple 2D transpose (SB->SB)
############################################################################
def nki_dma_transpose_2d_sb2sb(a):
  a_sb = nl.load(a)
  b_sb = nisa.dma_transpose(a_sb[:, :])
  b = nl.ndarray(shape=b_sb.shape, dtype=b_sb.dtype, buffer=nl.hbm)
  nl.store(dst=b, value=b_sb)
  return b
# NKI_EXAMPLE_1_END

@nki.jit(mode="simulation", platform_target="trn2")
# NKI_EXAMPLE_2_BEGIN
################################################################################
# Example 3: Simple 2D transpose (HBM->SB) using DGE xbar (NeuronCore-v3+ only)
################################################################################
def nki_dma_transpose_2d_hbm2sb_dge_xbar(a):
  b_sb = nisa.dma_transpose(a[:, :], dge_mode=dge_mode.hwdge)
  b = nl.ndarray(shape=b_sb.shape, dtype=b_sb.dtype, buffer=nl.hbm)
  nl.store(dst=b, value=b_sb)
  return b
# NKI_EXAMPLE_2_END

@nki.jit(mode="simulation", platform_target="trn2")
# NKI_EXAMPLE_3_BEGIN
###############################################################################
# Example 4: Simple 2D transpose (SB->SB) using DGE xbar (NeuronCore-v3+ only)
###############################################################################
def nki_dma_transpose_2d_sb2sb_dge_xbar(a):
  a_sb = nl.load(a)
  b_sb = nisa.dma_transpose(a_sb[:, :], dge_mode=dge_mode.hwdge)
  b = nl.ndarray(shape=b_sb.shape, dtype=b_sb.dtype, buffer=nl.hbm)
  nl.store(dst=b, value=b_sb)
  return b
# NKI_EXAMPLE_3_END

@nki.jit(mode="simulation", platform_target="trn2")
# NKI_EXAMPLE_4_BEGIN
############################################################################
# Example 5: 3D transpose (HBM->SB) w/Indirect Mem Access
############################################################################
def nki_dma_gather_transpose_3d_hbm2sb(src_tensor, idx_tensor):
  i_p = nl.arange(32)[:, None]
  idx = nl.load(idx_tensor)

  _, dim1, dim2 = src_tensor.shape

  iy = nl.arange(dim1)[None, :, None]
  iz = nl.arange(dim2)[None, None, :]

  dst = nisa.dma_transpose(src_tensor[idx[i_p, 0], iy, iz], axes=(2, 1, 0))
  dst_tensor = nl.ndarray(shape=(dim2, dim1, idx.shape[0]), dtype=src_tensor.dtype, buffer=nl.shared_hbm)
    
  nl.store(dst_tensor, dst)
  return dst_tensor
# NKI_EXAMPLE_4_END

@nki.jit(mode="simulation", platform_target="trn2")
# NKI_EXAMPLE_5_BEGIN
############################################################################
# Example 6: 3D transpose (SB->SB) w/Indirect Mem Access
############################################################################
def nki_dma_gather_transpose_3d_sb2sb(src_tensor, idx_tensor):
  src = nl.load(src_tensor)
  idx = nl.load(idx_tensor)

  dim0, dim1, dim2 = src.shape
  
  iy = nl.arange(dim1)[None, :, None]
  iz = nl.arange(dim2)[None, None, :]

  dst = nisa.dma_transpose(src[idx, iy, iz], axes=(2, 1, 0))
  dst_tensor = nl.ndarray(shape=(dim2, dim1, dim0), dtype=src.dtype, buffer=nl.shared_hbm)
  
  nl.store(dst_tensor, dst)
  return dst_tensor
# NKI_EXAMPLE_5_END

class TestNkiIsaExamplesDmaTranspose(unittest.TestCase):
  def test_dma_transpose_2d(self):
    np.random.seed(0)
    src = np.random.random_sample([16, 128]).astype(np.float16) * 100
    dst_golden = np.transpose(src)

    dst = nki_dma_transpose_2d_hbm2sb(src)
    self.assertTrue(np.allclose(dst, dst_golden))

    dst = nki_dma_transpose_2d_sb2sb(src)
    self.assertTrue(np.allclose(dst, dst_golden))

    dst = nki_dma_transpose_2d_hbm2sb_dge_xbar(src)
    self.assertTrue(np.allclose(dst, dst_golden))

    dst = nki_dma_transpose_2d_sb2sb_dge_xbar(src)
    self.assertTrue(np.allclose(dst, dst_golden))
  
  @pytest.mark.xfail(reason="PBE-63")
  def test_dma_transpose_indirect(self):
    np.random.seed(0)
    src_tensor = np.arange(64 * 4 * 128).reshape(64, 4, 128).astype(nl.uint16)
    idx_tensor = np.arange(32, dtype=nl.uint32).reshape(32, 1)
    
    nki_out = nki_dma_gather_transpose_3d_hbm2sb(src_tensor, idx_tensor)
    golden_out = np.transpose(src_tensor[idx_tensor.reshape(32)], axes=(2, 1, 0))

    assert np.allclose(nki_out, golden_out)


================================================
FILE: nki/test/test_nki_isa_dropout.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
from neuronxcc.nki.typing import tensor
# NKI_EXAMPLE_END
import numpy as np

@nki.jit(mode="simulation")
def nki_dropout(a_tensor, b_tensor):
  c_tensor = nl.ndarray(a_tensor.shape, dtype=a_tensor.dtype, buffer=nl.shared_hbm)

  # NKI_EXAMPLE_BEGIN
  ###########################################################################
  # Example 1: From an input tile a of shape [128, 512], dropout its values
  # with probabilities in tile b of shape [128, 1] and store the result in c.
  ###########################################################################
  a: tensor[128, 512] = nl.load(a_tensor)
  b: tensor[128, 1] = nl.load(b_tensor)

  c: tensor[128, 512] = nisa.dropout(a, prob=b)

  nl.store(c_tensor, c)
  # NKI_EXAMPLE_END

  return c_tensor


@nki.jit(mode="simulation")
def nki_dropout_scalar(in_tensor):
  import neuronxcc.nki.language as nl
  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype, buffer=nl.shared_hbm)

  # NKI_EXAMPLE_BEGIN
  ######################################################
  # Example 2: From an input tile a, dropout its values 
  # with probability of 0.2 and store the result in b.
  ######################################################
  a = nl.load(in_tensor)

  b = nisa.dropout(a, prob=0.2)

  nl.store(out_tensor, b)
  # NKI_EXAMPLE_END

  return out_tensor


class TestNkiIsaExamplesDropout(unittest.TestCase):
  def test_dropout(self):
    a = np.random.random_sample([128, 512]).astype(np.float32) * 100
    b = np.random.random_sample([128, 1]).astype(np.float32) * 100 
    c = np.zeros([128, 512]).astype(np.float32)
    c_zeros = np.copy(c)
    
    c = nki_dropout(a, b)

    self.assertFalse(np.allclose(c, c_zeros))
    # self.assertFalse(np.allclose(c, a)) # we don't have dropout simulation implementation
    
  def test_dropout_scalar(self):
    a = np.random.random_sample([128, 512]).astype(np.float32) * 100
    b = np.zeros([128, 512]).astype(np.float32)
    b_zeros = np.copy(b)
    
    b = nki_dropout_scalar(a)

    self.assertFalse(np.allclose(b, b_zeros))
    # self.assertFalse(np.allclose(b, a)) # we don't have dropout simulation implementation

================================================
FILE: nki/test/test_nki_isa_iota.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
from neuronxcc.nki.typing import tensor
# NKI_EXAMPLE_END
import numpy as np


@nki.jit(mode="simulation")
def nki_iota():

  # NKI_EXAMPLE_BEGIN
  ##################################################################
  # Example 1: Generate tile a of 512 constant values in SBUF partition 0
  # that start at 0 and increment by 1:
  ##################################################################
  # a = [0, 1, ..., 511]
  expr_a = nl.arange(0, 512)[None, :]
  a: tensor[1, 512] = nisa.iota(expr_a, dtype=nl.int32)

  ##################################################################
  # Example 2: Generate tile b of 128 constant values across SBUF partitions
  # that start at 0 and increment by 1, with one value per partition:
  # b = [[0],
  #      [1],
  #      ...,
  #      [127]]
  ##################################################################
  expr_b = nl.arange(0, 128)[:, None]
  b: tensor[128, 1] = nisa.iota(expr_b, dtype=nl.int32)
  
  ##################################################################
  # Example 3: Generate tile c of 512 constant values in SBUF partition 0
  # that start at 0 and decrement by 1:
  # c = [0, -1, ..., -511]
  ##################################################################
  expr_c = expr_a * -1
  c: tensor[1, 512] = nisa.iota(expr_c, dtype=nl.int32)

  ##################################################################
  # Example 4: Generate tile d of 128 constant values across SBUF
  # partitions that start at 5 and increment by 2
  ##################################################################
  # d = [[5],
  #      [7],
  #      ...,
  #      [259]]
  expr_d = 5 + expr_b * 2
  d: tensor[128, 1] = nisa.iota(expr_d, dtype=nl.int32)

  ##################################################################
  # Example 5: Generate tile e of shape [128, 512] by
  # broadcast-add expr_a and expr_b
  # e = [[0, 1, ..., 511],
  #      [1, 2, ..., 512],
  #      ...
  #      [127, 2, ..., 638]]
  ##################################################################
  e: tensor[128, 512] = nisa.iota(expr_a + expr_b, dtype=nl.int32)
  # NKI_EXAMPLE_END

  a_tensor = nl.ndarray([1, 512], dtype=nl.float32, buffer=nl.shared_hbm)
  b_tensor = nl.ndarray([128, 1], dtype=nl.float32, buffer=nl.shared_hbm)
  c_tensor = nl.ndarray([1, 512], dtype=nl.float32, buffer=nl.shared_hbm)
  d_tensor = nl.ndarray([128, 1], dtype=nl.float32, buffer=nl.shared_hbm)
  e_tensor = nl.ndarray([128, 512], dtype=nl.float32, buffer=nl.shared_hbm)
  nl.store(a_tensor[0, expr_a], a)
  nl.store(b_tensor[expr_b, 0], b)
  nl.store(c_tensor[0, expr_a], c)  
  nl.store(d_tensor[expr_b, 0], d)
  nl.store(e_tensor[expr_b, expr_a], e)
  return a_tensor, b_tensor, c_tensor, d_tensor, e_tensor
  
      
class TestNkiIsaExamplesIota(unittest.TestCase):
  def test_iota(self):
    a, b, c, d, e = nki_iota()

    a_golden = np.expand_dims(np.arange(0, 512), 0)
    b_golden = np.expand_dims(np.arange(0, 128), 1)
    c_golden = np.expand_dims(np.arange(0, 512)*-1, 0)
    d_golden = np.expand_dims(np.arange(5, 260, 2), 1)
    e_golden = a_golden + b_golden

    self.assertTrue(np.allclose(a, a_golden))
    self.assertTrue(np.allclose(b, b_golden))
    self.assertTrue(np.allclose(c, c_golden))
    self.assertTrue(np.allclose(d, d_golden))
    self.assertTrue(np.allclose(e, e_golden))

 
================================================
FILE: nki/test/test_nki_isa_local_gather.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
from neuronxcc.nki.typing import tensor


# NKI_EXAMPLE_END


@nki.jit(mode="simulation")
def nki_local_gather(src_buffer, index, num_elem_per_idx, num_valid_indices, output_shape):
  output = nl.ndarray(output_shape, dtype=src_buffer.dtype,
                      buffer=nl.shared_hbm)

  # NKI_EXAMPLE_BEGIN
  ##################################################################
  # Example 1: gather src_buffer using index
  # Gather input: src_buffer_tile with shape (128, 512, 4)
  # Gather indices: index_tile with shape (128, 4)
  # We use num_valid_indices indices per core, and read num_elem_per_idx
  # contiguous elements per partition.
  ##################################################################
  src_buffer_tile: tensor[128, 512, 4] = nl.load(src_buffer)
  index_tile: tensor[128, 4] = nl.load(index)
  output_tile: tensor[128, 4, 16, 4] = nisa.local_gather(
    src_buffer_tile, index_tile, num_elem_per_idx, num_valid_indices)

  nl.store(output, output_tile)
  # NKI_EXAMPLE_END

  return output


class TestNkiIsaExamplesLocalGather(unittest.TestCase):
  def test_local_gather(self):
    import numpy as np

    # Engine constants
    # NUMPY_SEMANTICS_BEGIN
    num_gpsimd_cores = 8
    num_partitions_per_core = 16
    # NUMPY_SEMANTICS_END

    # example gather input: src_buffer = np.array((128, 512, 4))
    # example gather indices: index = np.array((16, 4))
    # (optional, default=0) gather valid index count per core: num_valid_indices
    # (optional, default=1) gather element count per index: num_elem_per_idx

    # NUMPY_SEMANTICS_BEGIN
    src_buffer = np.random.random_sample([128, 512, 4]).astype(np.float32) * 100
    index_per_core = np.random.randint(low=0, high=512, size=(16, 4), dtype=np.uint16)
    # replicate 8 times for 8 GpSimd cores
    index = np.tile(index_per_core, (num_gpsimd_cores, 1))
    num_elem_per_idx = 4
    index_hw = index * num_elem_per_idx
    num_valid_indices = 64
    output_shape = (128, 4, 16, 4)
    # NUMPY_SEMANTICS_END

    # Run NKI
    output_nki = nki_local_gather(src_buffer, index_hw, num_elem_per_idx,
                                  num_valid_indices, output_shape)

    # NumPy reference
    # NUMPY_SEMANTICS_BEGIN
    num_active_cores = index.shape[0] / num_partitions_per_core
    num_valid_indices = num_valid_indices if num_valid_indices \
      else index.size / num_active_cores

    output_np = np.ndarray(shape=(128, num_valid_indices, num_elem_per_idx),
                           dtype=src_buffer.dtype)

    for i_core in range(num_gpsimd_cores):
      start_par = i_core * num_partitions_per_core
      end_par = (i_core + 1) * num_partitions_per_core
      indices_1d = index[start_par:end_par].flatten(order='F')[0: num_valid_indices]

      output_np[start_par:end_par, :, :] = np.take(
        src_buffer[start_par:end_par],
        indices_1d, axis=1)

    output_np = output_np.reshape(output_shape)
    # NUMPY_SEMANTICS_END
    self.assertTrue(np.allclose(output_nki, output_np))


================================================
FILE: nki/test/test_nki_isa_max8.py
================================================
"""
Copyright (C) 2025, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_0_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
from neuronxcc.nki.typing import tensor
# NKI_EXAMPLE_0_END
import numpy as np


@nki.jit(mode="simulation")
def nki_max8():
  # NKI_EXAMPLE_0_BEGIN
  ##################################################################
  # Example 1: Generate tile b of 32 * 128 random floating point values
  # and get the 8 largest values in each row:
  ##################################################################
  expr_a = nl.rand((32, 128))
  a = nisa.max8(src=expr_a)

  a_tensor = nl.ndarray([32, 8], dtype=nl.float32, buffer=nl.shared_hbm)
  nl.store(a_tensor, value=a)
  # NKI_EXAMPLE_0_END

  return a_tensor


class TestNkiIsaExamplesMax8(unittest.TestCase):
  def test_max8(self):
    a = nki_max8()

    self.assertEqual(a.shape, (32, 8))
    self.assertTrue(np.all(a >= 0) and np.all(a <= 1))
    row_diffs = np.diff(a, axis=1)  # Get differences between adjacent elements
    self.assertTrue(np.all(row_diffs <= 0), "Values within rows should be descending")

TestNkiIsaExamplesMax8().test_max8()

================================================
FILE: nki/test/test_nki_isa_memset.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_7_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
...
# NKI_EXAMPLE_7_END
import numpy as np

########################################################################
# NOTE: if you modify this file, make sure to update nki.isa .py file with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def nki_memset():
  a_tensor = nl.ndarray([128, 128], dtype=nl.float32, buffer=nl.shared_hbm)
  # NKI_EXAMPLE_7_BEGIN
  ##################################################################
  # Example 1: Initialize a float32 tile a of shape (128, 128)
  # with a value of 0.2
  ##################################################################
  a = nisa.memset(shape=(128, 128), value=0.2, dtype=nl.float32)
  # NKI_EXAMPLE_7_END

  i_p = nl.arange(128)[:, None]
  i_f = nl.arange(128)[None, :]
  nl.store(a_tensor[i_p, i_f], a)
  return a_tensor
  
      
class TestNkiIsaExamplesMemset(unittest.TestCase):
  def test_memset(self):
    a = nki_memset()

    a_golden = np.full([128, 128], 0.2).astype(np.float32)
    self.assertTrue(np.allclose(a, a_golden))


================================================
FILE: nki/test/test_nki_isa_nc_find_index8.py
================================================
"""
Copyright (C) 2025, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_0_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
from neuronxcc.nki.typing import tensor
# NKI_EXAMPLE_0_END
import numpy as np


@nki.jit(mode="simulation")
def nki_max_index8():
  # NKI_EXAMPLE_0_BEGIN
  ##################################################################
  # Example 1: Generate tile b of 32 * 128 random floating point values,
  # find the 8 largest values in each row, then find their indices:
  ##################################################################
  # Generate random data
  data = nl.rand((32, 128))

  # Find max 8 values per row
  max_vals = nisa.max8(src=data)

  # Create output tensor for indices
  indices_tensor = nl.ndarray([32, 8], dtype=nl.uint32, buffer=nl.shared_hbm)

  # Find indices of max values
  indices = nisa.nc_find_index8(data=data, vals=max_vals)

  # Store results
  nl.store(indices_tensor, value=indices)
  # NKI_EXAMPLE_0_END

  return indices_tensor


class TestNkiIsaExamplesMaxIndex8(unittest.TestCase):
  def test_max_index8(self):
    indices = nki_max_index8()

    self.assertEqual(indices.shape, (32, 8))
    self.assertEqual(indices.dtype, np.uint32)

    # Verify indices are within valid range (0 to 127)
    self.assertTrue(np.all(indices >= 0) and np.all(indices < 128))

    # Check that indices point to descending values
    indices_diffs = np.diff(indices, axis=1)  # Get differences between adjacent indices
    # Values should be unique, so indices should be different
    self.assertTrue(np.all(indices_diffs != 0), "Indices should be unique")

TestNkiIsaExamplesMaxIndex8().test_max_index8()


================================================
FILE: nki/test/test_nki_isa_nc_match_replace8.py
================================================
"""
Copyright (C) 2025, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_0_BEGIN # NKI_EXAMPLE_1_BEGIN # NKI_EXAMPLE_2_BEGIN # NKI_EXAMPLE_3_BEGIN # NKI_EXAMPLE_4_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
import neuronxcc.nki.typing as nt

# NKI_EXAMPLE_0_END # NKI_EXAMPLE_1_END # NKI_EXAMPLE_2_END # NKI_EXAMPLE_3_END # NKI_EXAMPLE_4_END
import numpy as np


@nki.jit(mode="simulation")
def nki_nc_match_replace8():
  # NKI_EXAMPLE_0_BEGIN
  ##################################################################
  # Example 1: Generate tile a of random floating point values,
  # get the 8 largest values in each row, then replace their first
  # occurrences with -inf:
  ##################################################################
  N = 4
  M = 16
  data_tile = nl.rand((N, M))
  max_vals = nisa.max8(src=data_tile)

  result = nisa.nc_match_replace8(data=data_tile[:, :], vals=max_vals, imm=float('-inf'))
  result_tensor = nl.ndarray([N, M], dtype=nl.float32, buffer=nl.shared_hbm)
  nl.store(result_tensor, value=result)
  # NKI_EXAMPLE_0_END

  return result_tensor


@nki.jit(mode="simulation")
def nki_nc_match_replace_indices8(in_tensor: nt.tensor, imm: np.float32):
  # NKI_EXAMPLE_1_BEGIN
  ##################################################################
  # Example 2: Read the 8 largest values in each row of the tensor,
  # replace the first occurrence with imm, write indices, and return
  # the replaced output.
  ##################################################################
  n, m = in_tensor.shape
  # NKI_EXAMPLE_1_END
  out_tensor = nl.ndarray([n, m], dtype=in_tensor.dtype, buffer=nl.hbm)
  idx_tensor = nl.ndarray([n, 8], dtype=nl.uint32, buffer=nl.hbm)
  # NKI_EXAMPLE_1_BEGIN
  dst_idx = nl.ndarray((n, 8), dtype=idx_tensor.dtype)

  ix, iy = nl.mgrid[0:n, 0:8]

  inp_tile: nt.tensor[n, m] = nl.load(in_tensor)
  max_vals: nt.tensor[n, 8] = nisa.max8(src=inp_tile)

  out_tile = nisa.nc_match_replace8(
    dst_idx=dst_idx[ix, iy], data=inp_tile[:, :], vals=max_vals, imm=imm
  )
  # NKI_EXAMPLE_1_END

  nl.store(out_tensor, value=out_tile)
  nl.store(idx_tensor[ix, iy], value=dst_idx[ix, iy])
  return out_tensor, idx_tensor


@nki.jit(mode="simulation")
def nki_nc_match_replace_indices8_mask(in_tensor: nt.tensor, imm: np.float32):
  # NKI_EXAMPLE_2_BEGIN
  ##################################################################
  # Example 3: Read the 8 largest values in each row of the tensor,
  # after applying the specified mask, replace the first occurrence
  # with imm, write indices, and return the replaced output.
  ##################################################################
  n, m = in_tensor.shape
  # NKI_EXAMPLE_2_END
  out_tensor = nl.ndarray([n, m], dtype=in_tensor.dtype, buffer=nl.hbm)
  idx_tensor = nl.ndarray([n, 8], dtype=nl.uint32, buffer=nl.hbm)
  # NKI_EXAMPLE_2_BEGIN
  idx_tile = nisa.memset(shape=(n, 8), value=0, dtype=nl.uint32)

  ix, iy = nl.mgrid[0:n, 0:m]
  inp_tile: nt.tensor[n, m] = nl.load(in_tensor)
  max_vals: nt.tensor[n, 8] = nisa.max8(src=inp_tile[ix, iy], mask=(ix < n //2 and iy < m//2))

  out_tile = nisa.nc_match_replace8(
    dst_idx=idx_tile[:, :],
    data=inp_tile[ix, iy],
    vals=max_vals,
    imm=imm,
    mask=(ix < n // 2 and iy < m // 2),  # mask applies to `data`
  )
  # NKI_EXAMPLE_2_END

  nl.store(out_tensor, value=out_tile)
  nl.store(idx_tensor, value=idx_tile)
  return out_tensor, idx_tensor


@nki.jit(mode="simulation")
def nki_nc_match_replace_indices8_3d(data_tensor: nt.tensor):
  # NKI_EXAMPLE_3_BEGIN
  ##################################################################
  # Example 4: Read the 8 largest values in each row of the tensor,
  # replace the first occurrence with 0.0, write indices, and return 
  # the replaced output.
  ##################################################################
  n, b, m = data_tensor.shape
  # NKI_EXAMPLE_3_END
  out_tensor = nl.ndarray([n, b, m], dtype=data_tensor.dtype, buffer=nl.hbm)
  # NKI_EXAMPLE_3_BEGIN
  n, b, m = data_tensor.shape

  out_tensor = nl.ndarray([n, b, m], dtype=data_tensor.dtype, buffer=nl.hbm)
  idx_tensor = nl.ndarray([n, 8], dtype=nl.uint32, buffer=nl.hbm)

  imm = 0.0
  idx_tile = nisa.memset(shape=(n, 8), value=0, dtype=nl.uint32)
  out_tile = nisa.memset(shape=(n, b, m), value=0, dtype=data_tensor.dtype)

  iq, ir, iw = nl.mgrid[0:n, 0:b, 0:m]
  ip, io = nl.mgrid[0:n, 0:8]

  inp_tile = nl.load(data_tensor[iq, ir, iw])
  max_vals: nt.tensor[n, 8] = nisa.max8(src=inp_tile)

  out_tile[iq, ir, iw] = nisa.nc_match_replace8(
    dst_idx=idx_tile[ip, io],
    data=inp_tile[iq, ir, iw],
    vals=max_vals[ip, io],
    imm=imm,
  )

  # NKI_EXAMPLE_3_END
  nl.store(out_tensor, value=out_tile)
  nl.store(idx_tensor, value=idx_tile)
  return out_tensor, idx_tensor


@nki.jit(mode="simulation")
def nki_nc_match_replace_indices8_3d_inplace(data_tensor: nt.tensor):
  # NKI_EXAMPLE_4_BEGIN
  ##################################################################
  # Example 5: Read the 8 largest values in each row of the tensor,
  # replace the first occurrence with 0.0 in-place and write indices.
  ##################################################################
  n, b, m = data_tensor.shape
  # NKI_EXAMPLE_4_END
  out_tensor = nl.ndarray([n, b, m], dtype=data_tensor.dtype, buffer=nl.hbm)
  # NKI_EXAMPLE_4_BEGIN
  n, b, m = data_tensor.shape

  out_tensor = nl.ndarray([n, b, m], dtype=data_tensor.dtype, buffer=nl.hbm)
  idx_tensor = nl.ndarray([n, 8], dtype=nl.uint32, buffer=nl.hbm)

  imm = 0.0
  idx_tile = nisa.memset(shape=(n, 8), value=0, dtype=nl.uint32)

  iq, ir, iw = nl.mgrid[0:n, 0:b, 0:m]
  ip, io = nl.mgrid[0:n, 0:8]

  inp_tile = nl.load(data_tensor[iq, ir, iw])
  max_vals: nt.tensor[n, 8] = nisa.max8(src=inp_tile)

  inp_tile[iq, ir, iw] = nisa.nc_match_replace8(
    dst_idx=idx_tile[ip, io],
    data=inp_tile[iq, ir, iw],
    vals=max_vals[ip, io],
    imm=imm,
  )

  # NKI_EXAMPLE_4_END
  nl.store(out_tensor, value=inp_tile)
  nl.store(idx_tensor, value=idx_tile)
  return out_tensor, idx_tensor


def match_and_get_index(data, vals):
  row = data.copy()
  vlength = vals.shape[-1]

  result = np.zeros(shape=vals.shape, dtype=np.int32)
  idx = 0
  for j in range(vlength):
    matches = np.where(row == vals[j])[0]
    if matches:
      idx = matches[0]
      row[idx] = np.float32("-inf")
      result[j] = idx
  return result


def get_replaced_output_and_max_indices(a, imm=0):
  axis = -1
  a_reshaped = a.reshape(a.shape[0], -1)
  a_sorted = np.sort(a_reshaped, axis=axis)
  a_sorted_last_8 = a_sorted[:, -8:]
  max_vals = np.flip(a_sorted_last_8, axis=-1)

  c = a_reshaped.copy()
  concat_out_golden_max_vals = np.concatenate([c, max_vals], axis=axis)
  c_idx = np.apply_along_axis(
    # get index for first occurence of max_vals along the specified axis
    lambda x: match_and_get_index(x[:-8], x[-8:]),
    axis=axis,
    arr=concat_out_golden_max_vals,
  ).astype(np.uint32)
  np.put_along_axis(c, indices=c_idx, values=imm, axis=axis)
  c = np.reshape(c, a.shape)
  return c, c_idx


class TestNkiIsaExamplesMatchReplace8(unittest.TestCase):
  def test_nc_match_replace8(self):
    result = nki_nc_match_replace8()

    self.assertEqual(result.shape, (4, 16))
    self.assertEqual(result.dtype, np.float32)

    # Each row should have exactly 8 -inf values
    inf_count = np.sum(np.isinf(result) & (result < 0), axis=1)
    self.assertTrue(np.all(inf_count == 8))

    # Non-inf values should be between 0 and 1 (from rand)
    non_inf_mask = ~(np.isinf(result) & (result < 0))
    self.assertTrue(np.all(result[non_inf_mask] >= 0))
    self.assertTrue(np.all(result[non_inf_mask] <= 1))

  def test_nc_match_replace_indices8(self):
    imm = np.float32('-inf')
    np.random.seed(0)
    a = np.random.random_sample([128, 512]).astype(np.float32)

    b, b_idx = nki_nc_match_replace_indices8(a, imm=imm)
    c, c_idx = get_replaced_output_and_max_indices(a, imm)

    self.assertTrue(np.allclose(b, c))
    self.assertTrue(np.allclose(b_idx, c_idx))

  def test_nc_match_replace_indices8_mask(self):
    imm = np.float32('-inf')
    np.random.seed(0)
    a = np.random.random_sample([128, 512]).astype(np.float32)
    b, b_idx = nki_nc_match_replace_indices8_mask(a, imm=imm)
    c, c_idx = get_replaced_output_and_max_indices(a[:64, :256], imm) 

    self.assertTrue(np.allclose(b[:64, :256], c))
    self.assertTrue(np.allclose(b_idx[:64, :256], c_idx))

  def test_nc_match_replace_indices8_3d(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 4, 4]).astype(np.float32)

    b, b_idx = nki_nc_match_replace_indices8_3d(a)
    c, c_idx = get_replaced_output_and_max_indices(a, imm=0)

    self.assertTrue(np.allclose(b, c))
    self.assertTrue(np.allclose(b_idx, c_idx))

  def test_nc_match_replace_indices8_3d_inplace(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 4, 4]).astype(np.float32)

    b, b_idx = nki_nc_match_replace_indices8_3d_inplace(a)
    c, c_idx = get_replaced_output_and_max_indices(a, imm=0)

    self.assertTrue(np.allclose(b, c))
    self.assertTrue(np.allclose(b_idx, c_idx))

================================================
FILE: nki/test/test_nki_isa_nc_matmul.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np

import neuronxcc.nki as nki
# NKI_EXAMPLE_0_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
# NKI_EXAMPLE_0_END

########################################################################
# NOTE: if you modify this file, make sure to update nki.isa .py file with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def nki_nc_matmul(a_tensor, b_tensor, d_tensor, e_tensor, g_tensor, h_tensor):
  c_tensor = nl.ndarray([128, 512], dtype=nl.float32, buffer=nl.shared_hbm)
  f_tensor = nl.ndarray([128, 512], dtype=nl.float32, buffer=nl.shared_hbm)
  i_tensor = nl.ndarray([16, 64, 512], dtype=nl.float32, buffer=nl.shared_hbm)

  # NKI_EXAMPLE_0_BEGIN
  ##################################################################
  # Example 1:
  # multiply matrix a of shape (128, 128) and matrix b of shape (128, 512)
  # to get matrix c in PSUM of shape (128, 512)
  ##################################################################
  a_mgrid = nl.mgrid[0:128, 0:128]
  b_mgrid = nl.mgrid[0:128, 0:512]
  c_mgrid = nl.mgrid[0:128, 0:512]

  a = nl.load(a_tensor[a_mgrid.p, a_mgrid.x])
  b = nl.load(b_tensor[b_mgrid.p, b_mgrid.x])

  c_psum = nisa.nc_matmul(a[a_mgrid.p, a_mgrid.x], b[b_mgrid.p, b_mgrid.x])

  nl.store(c_tensor[c_mgrid.p, c_mgrid.x], c_psum)

  ##################################################################
  # Example 2:
  # multiply matrix d of shape (256, 128) and matrix e of shape (256, 512)
  # to get matrix f in PSUM of shape (128, 512) using psum accumulation
  ##################################################################
  d_mgrid = nl.mgrid[0:128, 0:128]
  e_mgrid = nl.mgrid[0:128, 0:512]
  f_mgrid = nl.mgrid[0:128, 0:512]

  f_psum = nl.zeros((128, 512), nl.float32, buffer=nl.psum)

  for i_contract in nl.affine_range(2):
    d = nl.load(d_tensor[i_contract * 128 + d_mgrid.p, d_mgrid.x])
    e = nl.load(e_tensor[i_contract * 128 + e_mgrid.p, e_mgrid.x])
    f_psum += nisa.nc_matmul(d[d_mgrid.p, d_mgrid.x], e[e_mgrid.p, e_mgrid.x])
    
  nl.store(f_tensor[f_mgrid.p, f_mgrid.x], f_psum)

  ##################################################################
  # Example 3:
  # perform batched matrix multiplication on matrix g of shape (16, 64, 64) 
  # and matrix h of shape (16, 64, 512) to get matrix i of (16, 64, 512) 
  # using Tensor Engine PE tiling mode. 
  ##################################################################
  g_mgrid = nl.mgrid[0:64, 0:64]
  h_mgrid = nl.mgrid[0:64, 0:512]
  i_mgrid = nl.mgrid[0:64, 0:512]

  for i in nl.affine_range(4):
    for j in nl.affine_range(4):
      g = nl.load(g_tensor[i * 4 + j, g_mgrid.p, g_mgrid.x])
      h = nl.load(h_tensor[i * 4 + j, h_mgrid.p, h_mgrid.x])
      i_psum = nisa.nc_matmul(g, h, tile_position=((i % 2) * 64, (j % 2) * 64), tile_size=(64, 64))
      nl.store(i_tensor[i * 4 + j, i_mgrid.p, i_mgrid.x], i_psum)

  return c_tensor, f_tensor, i_tensor
  # NKI_EXAMPLE_0_END

@nki.jit(mode="simulation", platform_target='trn2')
def nki_nc_matmul_double_row_gen3(a_input, b_input):
  NUM_PARTITIONS_A, TWO_A, FREE_A = a_input.shape
  NUM_PARTITIONS_B, TWO_B, FREE_B = b_input.shape

  c_output = nl.ndarray([FREE_A, FREE_B], dtype=nl.float32, buffer=nl.shared_hbm)

  assert NUM_PARTITIONS_A == NUM_PARTITIONS_B and TWO_A == 2 and TWO_B == 2

  a_tile = nl.ndarray(
    (NUM_PARTITIONS_A, TWO_A, max(FREE_A, 16)), dtype=nl.float8_e5m2, buffer=nl.sbuf
  )
  a_mgrid = nl.mgrid[0:NUM_PARTITIONS_A, 0:TWO_A, 0:FREE_A]
  a_tile[a_mgrid.p, a_mgrid.x, a_mgrid.y] = nl.load(a_input.view(nl.float8_e5m2))
  b_tile = nl.load(b_input.view(nl.float8_e5m2))
  c_tile = nisa.nc_matmul(
    a_tile[a_mgrid.p, a_mgrid.x, a_mgrid.y], b_tile, perf_mode="double_row_gen3"
  )
  nl.store(c_output, value=c_tile)
  return c_output


class TestNkiIsaExamplesNcMatmul(unittest.TestCase):
  def test_nc_matmul(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 128]).astype(np.float32) * 100
    b = np.random.random_sample([128, 512]).astype(np.float32) * 100

    d = np.random.random_sample([256, 128]).astype(np.float32) * 100
    e = np.random.random_sample([256, 512]).astype(np.float32) * 100

    g = np.random.random_sample([16, 64, 64]).astype(np.float32) * 100
    h = np.random.random_sample([16, 64, 512]).astype(np.float32) * 100
    i = np.ndarray(shape=[16, 64, 512], dtype=np.float32)

    c, f, i = nki_nc_matmul(a, b, d, e, g, h)

    c_golden = np.matmul(np.transpose(a), b)
    f_golden = np.matmul(np.transpose(d), e)
    i_golden = np.matmul(g.transpose(0, 2, 1), h)

    self.assertTrue(np.allclose(c, c_golden))
    self.assertTrue(np.allclose(f, f_golden))
    self.assertTrue(np.allclose(i, i_golden))

  def test_double_row_gen3(self):
    np.random.seed(0)
    a = np.ones((128, 2, 1), dtype=nl.float8_e5m2)
    b = np.ones((128, 2, 512), dtype=nl.float8_e5m2)

    c = nki_nc_matmul_double_row_gen3(a, b)

    c_golden = np.einsum("kli,klj->ij",
                         a.astype(np.float32),
                         b.astype(np.float32))

    self.assertTrue(np.allclose(c, c_golden))


================================================
FILE: nki/test/test_nki_isa_nc_stream_shuffle.py
================================================
"""
Copyright (C) 2025, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_0_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
from neuronxcc.nki.typing import tensor
# NKI_EXAMPLE_0_END
import numpy as np
simulate_kernel = nki.simulate_kernel


@nki.jit(mode="simulation")
def nki_nc_stream_shuffle(in_tensor):
  # NKI_EXAMPLE_0_BEGIN
  #####################################################################
  # Example 1: 
  # Apply cross-partition data movement to a 32-partition tensor,
  # in-place shuffling the data in partition[i] to partition[(i+1)%32].
  #####################################################################
  # NKI_EXAMPLE_0_END
  ...
  out_tensor = nl.ndarray(shape=(32, 128), dtype=np.float32, buffer=nl.shared_hbm)
  # NKI_EXAMPLE_0_BEGIN
  ...
  a: tensor[32, 128] = nl.load(in_tensor)
  a_mgrid = nl.mgrid[0:32, 0:128]
  shuffle_mask = [(i - 1) % 32 for i in range(32)]
  nisa.nc_stream_shuffle(src=a[a_mgrid.p, a_mgrid.x], dst=a[a_mgrid.p, a_mgrid.x], shuffle_mask=shuffle_mask)
  
  nl.store(out_tensor, value=a)
  # NKI_EXAMPLE_0_END
  return out_tensor

@nki.jit(mode="simulation")
def nki_nc_stream_shuffle_broadcast_partition(in_tensor):
  # NKI_EXAMPLE_1_BEGIN
  #####################################################################
  # Example 2: 
  # Broadcast data in 1 partition to 32 partitions.
  #####################################################################
  # NKI_EXAMPLE_1_END
  ...
  out_tensor = nl.ndarray(shape=(32, 128), dtype=np.float32, buffer=nl.shared_hbm)
  # NKI_EXAMPLE_1_BEGIN
  ...
  a: tensor[1, 128] = nl.load(in_tensor)
  b = nl.ndarray(shape=(32, 128), dtype=np.float32)
  dst_mgrid = nl.mgrid[0:32, 0:128]
  src_mgrid = nl.mgrid[0:1, 0:128]
  shuffle_mask = [0] * 32
  nisa.nc_stream_shuffle(src=a[0, src_mgrid.x], dst=b[dst_mgrid.p, dst_mgrid.x], shuffle_mask=shuffle_mask)
  
  nl.store(out_tensor, value=b)
  # NKI_EXAMPLE_1_END
  return out_tensor

@nki.jit(mode="simulation")
def nki_nc_stream_shuffle_broadcast_mask(in_tensor):
  # NKI_EXAMPLE_2_BEGIN
  #####################################################################
  # Example 3: 
  # In the case where src and dst access more than one quadrant (32 
  # partitions), the shuffle is applied to each quadrant independently, 
  # and the same shuffle_mask is used for each quadrant.
  #####################################################################
  # NKI_EXAMPLE_2_END
  ...
  out_tensor = nl.ndarray(shape=(128, 128), dtype=np.float32, buffer=nl.shared_hbm)
  # NKI_EXAMPLE_2_BEGIN
  ...
  a: tensor[128, 128] = nl.load(in_tensor)
  b = nl.ndarray(shape=(128, 128), dtype=np.float32)
  mgrid = nl.mgrid[0:128, 0:128]
  shuffle_mask = [(i - 1) % 32 for i in range(32)]
  nisa.nc_stream_shuffle(src=a[mgrid.p, mgrid.x], dst=b[mgrid.p, mgrid.x], shuffle_mask=shuffle_mask)
  
  nl.store(out_tensor, value=b)
  # NKI_EXAMPLE_2_END
  return out_tensor

      
class TestNkiIsaExamplesStreamShuffle(unittest.TestCase):
  def test_stream_shuffle(self):
    in_tensor = np.random.random_sample([32, 128]).astype(np.float32) * 100
    out_tensor = simulate_kernel(nki_nc_stream_shuffle, in_tensor)
    in_tensor[list(range(32))] = in_tensor[[(i - 1) % 32 for i in range(32)]]
    self.assertTrue(np.allclose(out_tensor, in_tensor))

  def test_broadcast_partition(self):
    in_tensor = np.random.random_sample([1, 128]).astype(np.float32) * 100
    out_tensor = simulate_kernel(nki_nc_stream_shuffle_broadcast_partition, in_tensor)
    out_tensor[0:32] = in_tensor[0]
    self.assertTrue(np.allclose(out_tensor, in_tensor))

  def test_broadcast_mask(self):
    in_tensor = np.random.random_sample([128, 128]).astype(np.float32) * 100
    out_tensor = simulate_kernel(nki_nc_stream_shuffle_broadcast_mask, in_tensor)
    for j in range(4):
      in_tensor[list(range(j * 32, (j + 1) * 32))] = in_tensor[[(i - 1) % 32 + j * 32 for i in range(32)]]
    self.assertTrue(np.allclose(out_tensor, in_tensor))


================================================
FILE: nki/test/test_nki_isa_nc_transpose.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np

import neuronxcc.nki as nki
# NKI_EXAMPLE_1_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
...
# NKI_EXAMPLE_1_END

########################################################################
# NOTE: if you modify this file, make sure to update nki.isa .py file with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def nki_nc_transpose(a_tensor, b_tensor):
  at_tensor = nl.ndarray([a_tensor.shape[1], a_tensor.shape[0]], dtype=a_tensor.dtype,
                         buffer=nl.shared_hbm)
  bt_tensor = nl.ndarray([b_tensor.shape[1], b_tensor.shape[0]], dtype=b_tensor.dtype,
                         buffer=nl.shared_hbm)
  ##################################################################
  i_p_a = nl.arange(128)[:, None]
  i_f_a = nl.arange(64)[None, :]
  a = nl.load(a_tensor[i_p_a, i_f_a])
  # NKI_EXAMPLE_1_BEGIN
  ##################################################################
  # Example 1: transpose tile a of shape (128, 64)
  ##################################################################
  i_p_a = nl.arange(128)[:, None]
  i_f_a = nl.arange(64)[None, :]
  aT = nisa.nc_transpose(a[i_p_a, i_f_a])

  # NKI_EXAMPLE_1_END
  i_p_aT = nl.arange(64)[:, None]
  i_f_aT = nl.arange(128)[None, :]
  nl.store(at_tensor[i_p_aT, i_f_aT], aT)

  ##################################################################
  i_p_b = nl.arange(32)[:, None]
  i_f_b = nl.arange(2)[None, :]
  b = nl.load(b_tensor[i_p_b, i_f_b])
  # NKI_EXAMPLE_1_BEGIN
  ##################################################################
  # Example 2: transpose tile b of shape (32, 2) using Vector Engine
  ##################################################################
  i_p_b = nl.arange(32)[:, None]
  i_f_b = nl.arange(2)[None, :]
  bT = nisa.nc_transpose(b[i_p_b, i_f_b], engine=nisa.vector_engine)
  # NKI_EXAMPLE_1_END

  i_p_bT = nl.arange(2)[:, None]
  i_f_bT = nl.arange(32)[None, :]
  nl.store(bt_tensor[i_p_bT, i_f_bT], bT)
  return at_tensor, bt_tensor

class TestNkiIsaExamplesSbTranspose(unittest.TestCase):
  def test_nc_transpose(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 64]).astype(np.float32) * 100
    b = np.random.random_sample([32, 2]).astype(np.float32) * 100

    aT, bT = nki_nc_transpose(a, b)

    aT_golden = np.transpose(a)
    bT_golden = np.transpose(b)

    self.assertTrue(np.allclose(aT, aT_golden))
    self.assertTrue(np.allclose(bT, bT_golden))


================================================
FILE: nki/test/test_nki_isa_partition_reduce.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_1_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
import numpy as np
...
# NKI_EXAMPLE_1_END
nki_jit = nki.trace
simulate_kernel = nki.simulate_kernel

########################################################################
# NOTE: if you modify this file, make sure to update nki.isa .py file with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################

@nki_jit
def nki_par_reduce(a_tensor, b_tensor):
  # NKI_EXAMPLE_1_BEGIN
  ##################################################################
  # Example 1: reduce add tile a of shape (128, 32, 4)
  # in the partition dimension and return
  # reduction result in tile b of shape (1, 32, 4)
  ##################################################################
  a = nl.load(a_tensor[0:128, 0:32, 0:4])  
  b = nisa.tensor_partition_reduce(np.add, a)
  nl.store(b_tensor[0:1, 0:32, 0:4], b)
  # NKI_EXAMPLE_1_END

@nki_jit
def nki_par_reduce_nd_b(a_tensor, b_tensor):
  # NKI_EXAMPLE_1_BEGIN
  ##################################################################
  # Example 2: reduce add tile a of shape (b, p, f1, ...)
  # in the partition dimension p and return
  # reduction result in tile b of shape (b, 1, f1, ...)
  ##################################################################
  for i in nl.affine_range(a_tensor.shape[0]):
    a = nl.load(a_tensor[i])
    b = nisa.tensor_partition_reduce(np.add, a)
    nl.store(b_tensor[i], b)
  # NKI_EXAMPLE_1_END


class TestNkiIsaExamplesPartitionReduce(unittest.TestCase):
  def test_par_reduce_nd(self):
    a = np.random.random_sample([128, 32, 4]).astype(np.float32) * 100
    b = np.ndarray(shape=(1, 32, 4), dtype=np.float32)
    simulate_kernel(nki_par_reduce, a, b)

    self.assertTrue(np.allclose(b, np.sum(a, axis=0, keepdims=True)))

  def test_par_reduce_nd_b(self):
    a = np.random.random_sample([4, 128, 32, 8]).astype(np.float32) * 100
    b = np.ndarray(shape=(4, 1, 32, 8), dtype=np.float32)
    simulate_kernel(nki_par_reduce_nd_b, a, b)

    self.assertTrue(np.allclose(b, np.sum(a, axis=1, keepdims=True)))

================================================
FILE: nki/test/test_nki_isa_range_select.py
================================================
"""
Copyright (C) 2025, Amazon.com. All Rights Reserved
"""
import unittest
# NKI_EXAMPLE_0_BEGIN, NKI_EXAMPLE_1_BEGIN
import neuronxcc.nki as nki
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
import numpy as np
...
# NKI_EXAMPLE_0_END, NKI_EXAMPLE_1_END

########################################################################
# NOTE: if you modify this file, make sure to update nki.isa .py file with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################

@nki.jit(mode="simulation", platform_target="trn2")
def nki_range_select_example(on_true, bound0, bound1, compare_op0, compare_op1, range_start, dtype):
    # Create output tensors
    select_res = nl.ndarray(on_true.shape, dtype=dtype, buffer=nl.hbm)
    reduce_result = nl.ndarray((on_true.shape[0], 1), dtype=dtype, buffer=nl.hbm)
    
    # NKI_EXAMPLE_0_BEGIN
    ##################################################################
    # Example 1: # Select elements where 
    # bound0 <= range_start + index < bound1 and compute max reduction
    # 
    # on_false_value must be nl.fp32.min
    ##################################################################
    on_true_tile = nl.load(on_true[...])
    bound0_tile = nl.load(bound0[...])
    bound1_tile = nl.load(bound1[...])

    reduce_res_tile = nl.ndarray((on_true.shape[0], 1), dtype=dtype, buffer=nl.sbuf)
    result = nl.ndarray(on_true.shape, dtype=dtype, buffer=nl.sbuf)
    
    result[...] = nisa.range_select(
        on_true_tile=on_true_tile,
        comp_op0=compare_op0,
        comp_op1=compare_op1,
        bound0=bound0_tile,
        bound1=bound1_tile,
        reduce_cmd=nisa.reduce_cmd.reset_reduce,
        reduce_res=reduce_res_tile,
        reduce_op=np.max,
        range_start=range_start,
        on_false_value=nl.fp32.min,
        dtype=dtype
    )

    nl.store(select_res[...], value=result[...])
    nl.store(reduce_result[...], value=reduce_res_tile[...])
    # NKI_EXAMPLE_0_END

    return result, reduce_result

@nki.jit(mode="simulation", platform_target="trn2")
def nki_range_select_chaining(on_true, bound0, bound1, compare_op0, compare_op1, range_start):
    # Create output tensors
    select_res = nl.ndarray(on_true.shape, dtype=np.float32, buffer=nl.hbm)
    reduce_result = nl.ndarray((on_true.shape[0], 1), dtype=np.float32, buffer=nl.hbm)
    
    # NKI_EXAMPLE_1_BEGIN
    ##################################################################
    # Example 2.a: Initialize reduction with first range_select
    # Notice we don't pass reduce_res since the accumulation
    # register keeps track of the accumulation until we're ready to 
    # read it. Also we use reset_reduce in order to "clobber" or zero
    # out the accumulation register before we start accumulating.
    #
    # Note: Since the type of these tensors are fp32, we use nl.fp32.min
    # for on_false_value due to HW constraints.
    ##################################################################
    on_true_tile = nl.load(on_true[...])
    bound0_tile = nl.load(bound0[...])
    bound1_tile = nl.load(bound1[...])

    reduce_res_sbuf = nl.ndarray((on_true.shape[0], 1), dtype=np.float32, buffer=nl.sbuf)
    result_sbuf = nl.ndarray(on_true.shape, dtype=np.float32, buffer=nl.sbuf)
    
    result_sbuf[...] = nisa.range_select(
        on_true_tile=on_true_tile,
        comp_op0=compare_op0,
        comp_op1=compare_op1,
        bound0=bound0_tile,
        bound1=bound1_tile,
        reduce_cmd=nisa.reduce_cmd.reset_reduce,
        reduce_op=np.max,
        range_start=range_start,
        on_false_value=nl.fp32.min
    )

    ##################################################################
    # Example 2.b: Chain multiple range_select operations 
    # with reduction in an affine loop. Adding ones just lets us ensure the reduction 
    # gets updated with new values.
    ##################################################################
    ones = nl.full(on_true.shape, fill_value=1, dtype=np.float32, buffer=nl.sbuf)
    # we are going to loop as if we're tiling on the partition dimension    
    iteration_step_size = on_true_tile.shape[0]
    
    # Perform chained operations using an affine loop index for range_start
    for i in range(1, 2):
        # Update input values
        on_true_tile[...] = nl.add(on_true_tile, ones)
        
        # Continue reduction with updated values
        # notice, we still don't have reduce_res specified
        result_sbuf[...] = nisa.range_select(
            on_true_tile=on_true_tile,
            comp_op0=compare_op0,
            comp_op1=compare_op1,
            bound0=bound0_tile,
            bound1=bound1_tile,
            reduce_cmd=nisa.reduce_cmd.reduce,
            reduce_op=np.max,
            # we can also use index expressions for setting the start of the range
            range_start=range_start + (i * iteration_step_size),
            on_false_value=nl.fp32.min
        )

    range_start = range_start + (2 * iteration_step_size)
    ##################################################################
    # Example 2.c: Final iteration, we actually want the results to 
    # return to the user so we pass reduce_res argument so the 
    # reduction  will be written from the accumulation 
    # register to reduce_res_tile
    ##################################################################
    on_true_tile[...] = nl.add(on_true_tile, ones)
    result_sbuf[...] = nisa.range_select(
        on_true_tile=on_true_tile,
        comp_op0=compare_op0,
        comp_op1=compare_op1,
        bound0=bound0_tile,
        bound1=bound1_tile,
        reduce_cmd=nisa.reduce_cmd.reduce,
        reduce_res=reduce_res_sbuf[...],
        reduce_op=np.max,
        range_start=range_start,
        on_false_value=nl.fp32.min
    )

    nl.store(select_res[...], value=result_sbuf[...])
    nl.store(reduce_result[...], value=reduce_res_sbuf[...])
    # NKI_EXAMPLE_1_END

    return select_res, reduce_result

class TestNkiIsaExamplesRangeSelect(unittest.TestCase):
    def test_range_select_example(self):
        bound0 = np.zeros([128, 1], dtype=np.float32)
        bound1 = np.full([128, 1], 64, dtype=np.float32)
        range_start = 32
        for dtype in (nl.float8_e4m3, nl.float8_e5m2, nl.bfloat16, np.float16, np.float32):
            on_true_data = np.random.random_sample((128, 512)).astype(dtype)
            result, reduction = nki_range_select_example(on_true_data, bound0, bound1,
                                                     np.greater_equal, np.less, range_start, dtype)

            # The results should match the numpy equivalent from the docstring:
            # indices = np.zeros_like(on_true_data, dtype=np.float32)
            # indices[:] = range_start + np.arange(on_true_data[0].size)
            # mask = comp_op0(indices, bound0) & comp_op1(indices, bound1)
            # result = np.where(mask, on_true_data, on_false_value)
            # reduction = reduce_op(result, axis=1, keepdims=True)
            indices = np.zeros_like(on_true_data, dtype=np.float32)
            indices[:] = range_start + np.arange(on_true_data.shape[1])

            mask = np.greater_equal(indices, bound0) & np.less(indices, bound1)
            golden = np.where(mask, on_true_data, nl.fp32.min)
            
            golden_reduce = np.max(golden, axis=1, keepdims=True)

            self.assertTrue(np.allclose(result, golden.astype(dtype).astype(np.float32)))
            self.assertTrue(np.allclose(reduction, golden_reduce.astype(dtype).astype(np.float32)))

    def test_range_select_chaining(self):
        on_true_data = np.random.random_sample((128, 512)).astype(np.float32)
        range_start = 32
        bound0 = np.zeros([128, 1], dtype=np.float32)
        bound1 = np.full([128, 1], 350, dtype=np.float32)
        
        result, reduction = nki_range_select_chaining(
            on_true_data, bound0, bound1,
            np.greater_equal, np.less, range_start
        )

        # Calculate golden reference
        indices = np.zeros_like(on_true_data)
               
        # Apply the same operations as in the kernel
        golden = on_true_data.copy()
        golden_max = np.zeros((on_true_data.shape[0], 1), dtype=on_true_data.dtype)
        selected = golden_max.copy()

        iteration_step_size = on_true_data.shape[0]
        for i in range(3):  # 3 iterations
            indices[:] = range_start + (i * iteration_step_size) + np.arange(on_true_data.shape[1])
            mask = np.greater_equal(indices, bound0) & np.less(indices, bound1)

            selected = np.where(mask, golden, nl.fp32.min)
            golden_max = np.maximum(golden_max, np.max(selected, axis=1, keepdims=True))
            golden = golden + 1

        self.assertTrue(np.allclose(result, selected))
        self.assertTrue(np.allclose(reduction, golden_max))


================================================
FILE: nki/test/test_nki_isa_reciprocal.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np
# NKI_EXAMPLE_6_BEGIN
import neuronxcc.nki as nki
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
...
# NKI_EXAMPLE_6_END

########################################################################
# NOTE: if you modify this file, make sure to update nki.isa .py file with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def reciprocal_kernel(in_tensor):
  out_tensor = nl.ndarray([128, 512], dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)
  # NKI_EXAMPLE_6_BEGIN
  x = nl.load(in_tensor[nl.mgrid[0:128, 0:512]])
  
  y = nisa.reciprocal(x)

  # NKI_EXAMPLE_6_END
  nl.store(out_tensor[nl.mgrid[0:128, 0:512]], value=y)
  return out_tensor


class TestNkiExampleNisaReciprocal(unittest.TestCase):
  def test_nisa_reciprocal(self):
    np.random.seed(0)
    src = np.random.random_sample([128, 512]).astype(np.float32) * 100
    dst_golden = np.reciprocal(src)

    dst = reciprocal_kernel(src)
    self.assertTrue(np.allclose(dst, dst_golden))


================================================
FILE: nki/test/test_nki_isa_reduce.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_2_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
import numpy as np
...
# NKI_EXAMPLE_2_END

########################################################################
# NOTE: if you modify this file, make sure to update nki.isa .py file with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def nki_reduce(a_tensor):
  b_tensor = nl.ndarray([a_tensor.shape[0], 1], dtype=a_tensor.dtype,
                        buffer=nl.shared_hbm)
  # NKI_EXAMPLE_2_BEGIN
  ##################################################################
  # Example 1: reduce add tile a of shape (128, 512)
  # in the free dimension and return
  # reduction result in tile b of shape (128, 1)
  ##################################################################
  i_p_a = nl.arange(128)[:, None]
  i_f_a = nl.arange(512)[None, :]
  # NKI_EXAMPLE_2_END
  
  a = nl.load(a_tensor[i_p_a, i_f_a])  

  # NKI_EXAMPLE_2_BEGIN
  b = nisa.tensor_reduce(np.add, a[i_p_a, i_f_a], axis=[1])
  # NKI_EXAMPLE_2_END

  i_p_b, i_f_b = nl.mgrid[0:128, 0:1]
  nl.store(b_tensor[i_p_b, i_f_b], b)
  return b_tensor

      
class TestNkiIsaExamplesReduce(unittest.TestCase):
  def test_reduce(self):
    a = np.random.random_sample([128, 512]).astype(np.float32) * 100
    b = np.ndarray(shape=(128, 1), dtype=np.float32)
    b = nki_reduce(a)

    self.assertTrue(np.allclose(b, np.sum(a, axis=1, keepdims=True)))
 

================================================
FILE: nki/test/test_nki_isa_select_reduce.py
================================================
"""
Copyright (C) 2025, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_1_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
# NKI_EXAMPLE_1_END
import numpy as np


@nki.jit(mode="simulation")
def nki_select_reduce_basic(predicate_data, on_true_data):
  # NKI_EXAMPLE_1_BEGIN
  ##################################################################
  # Example 1: Basic usage of select_reduce
  # Create source data, predicate, and destination tensors
  ##################################################################
  # Create output tensor for result
  result_tensor = nl.ndarray(on_true_data.shape, dtype=nl.float32, buffer=nl.hbm)
  
  # Load input data to SBUF
  predicate = nl.load(predicate_data[...])
  on_true = nl.load(on_true_data[...])
  
  # Create destination tensor
  dst = nl.ndarray(on_true_data.shape, dtype=nl.float32, buffer=nl.sbuf)
  
  # Perform select operation - copy from on_true where predicate is true
  # and set to fp32.min where predicate is false
  nisa.select_reduce(
      dst=dst,
      predicate=predicate,
      on_true=on_true,
      on_false=nl.fp32.min,
  )
  
  # Store result to HBM
  nl.store(result_tensor, value=dst)
  # NKI_EXAMPLE_1_END

  return result_tensor


@nki.jit(mode="simulation")
def nki_select_reduce_with_reduction(predicate_data, on_true_data, on_false_data):
  # NKI_EXAMPLE_2_BEGIN
  ##################################################################
  # Example 2: Using select_reduce with reduction
  # Perform selection and compute max reduction per partition
  ##################################################################
  # Create output tensors for results
  result_tensor = nl.ndarray(on_true_data.shape, dtype=nl.float32, buffer=nl.hbm)
  reduce_tensor = nl.ndarray((on_true_data.shape[0], 1), dtype=nl.float32, buffer=nl.hbm)
  
  # Load input data to SBUF
  predicate = nl.load(predicate_data)
  on_true = nl.load(on_true_data)
  on_false = nl.load(on_false_data)

  # Create destination tensor
  dst = nl.ndarray(on_true_data.shape, dtype=nl.float32, buffer=nl.sbuf)
  
  # Create tensor for reduction results
  reduce_res = nl.ndarray((on_true_data.shape[0], 1), dtype=nl.float32, buffer=nl.sbuf)
  
  # Perform select operation with reduction
  nisa.select_reduce(
      dst=dst,
      predicate=predicate,
      on_true=on_true,
      on_false=on_false,
      reduce_cmd=nisa.reduce_cmd.reset_reduce,
      reduce_res=reduce_res,
      reduce_op=nl.max
  )
  
  # Store results to HBM
  nl.store(result_tensor, value=dst)
  nl.store(reduce_tensor, value=reduce_res)
  # NKI_EXAMPLE_2_END

  return result_tensor, reduce_tensor


@nki.jit(mode="simulation")
def nki_select_reduce_reverse_pred(predicate_data, on_true_data):
  # NKI_EXAMPLE_3_BEGIN
  ##################################################################
  # Example 3: Using select_reduce with reverse_pred option
  # Reverse the meaning of the predicate
  ##################################################################
  # Create output tensor for result
  result_tensor = nl.ndarray(on_true_data.shape, dtype=nl.float32, buffer=nl.hbm)
  
  # Load input data to SBUF
  predicate = nl.load(predicate_data[...])
  on_true = nl.load(on_true_data[...])
  
  # Create destination tensor
  dst = nl.ndarray(on_true_data.shape, dtype=nl.float32, buffer=nl.sbuf)
  
  # Perform select operation with reverse_pred=True
  # This will select on_true where predicate is FALSE
  nisa.select_reduce(
      dst=dst,
      predicate=predicate,
      on_true=on_true,
      on_false=nl.fp32.min,
      reverse_pred=True  # Reverse the meaning of the predicate
  )
  
  # Store result to HBM
  nl.store(result_tensor, value=dst)
  # NKI_EXAMPLE_3_END

  return result_tensor


class TestNkiIsaExamplesSelectReduce(unittest.TestCase):
  def test_select_reduce_basic(self):
    # Create input data
    on_true_data = np.ones((128, 64), dtype=np.float32)
    predicate_data = np.zeros((128, 64), dtype=np.uint8)
    predicate_data[0:64, :] = 1  # Set first half to 1 (true)
    
    # Run the test
    result = nki_select_reduce_basic(predicate_data, on_true_data)

    self.assertEqual(result.shape, (128, 64))
    
    # First half should be 1.0 (from on_true)
    self.assertTrue(np.all(result[0:64, :] == 1.0))
    
    # Second half should be fp32.min (from on_false)
    self.assertTrue(np.all(result[64:, :] == nl.fp32.min))

  def test_select_reduce_with_reduction(self):
    np.random.seed(0)
    on_true_data = np.random.random_sample([128, 512]).astype(np.float32) * 100
    on_false_data = np.random.random_sample([128, 1]).astype(np.float32) * 100
    predicate_data = np.random.randint(low=0, high=2, size=[128, 512], dtype=np.bool_)

    result, reduction = nki_select_reduce_with_reduction(predicate_data, on_true_data, on_false_data)

    self.assertEqual(result.shape, (128, 512))
    self.assertEqual(reduction.shape, (128, 1))

    golden_result = np.where(predicate_data, on_true_data, on_false_data)
    golden_reduce = np.max(golden_result, axis=1, keepdims=True)

    self.assertTrue(np.allclose(result, golden_result))
    self.assertTrue(np.allclose(reduction, golden_reduce))

  def test_select_reduce_reverse_pred(self):
    # Create input data
    on_true_data = np.ones((128, 64), dtype=np.float32)
    predicate_data = np.zeros((128, 64), dtype=np.uint8)
    predicate_data[0:64, :] = 1  # Set first half to 1 (true)
    
    # Run the test
    result = nki_select_reduce_reverse_pred(predicate_data, on_true_data)

    self.assertEqual(result.shape, (128, 64))
    
    # First half should be fp32.min (predicate is 1, but reversed)
    self.assertTrue(np.all(result[0:64, :] == nl.fp32.min))
    
    # Second half should be 1.0 (predicate is 0, but reversed)
    self.assertTrue(np.all(result[64:, :] == 1.0))


================================================
FILE: nki/test/test_nki_isa_sequence_bounds.py
================================================
"""
Copyright (C) 2025, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_0_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
from neuronxcc.nki.typing import tensor
# NKI_EXAMPLE_0_END
import numpy as np


@nki.jit(mode="simulation")
def nki_sequence_bounds(segment_ids):
  output = nl.ndarray([1, 2, 32], dtype=segment_ids.dtype, buffer=nl.shared_hbm)
  # NKI_EXAMPLE_0_BEGIN
  ######################################################################
  # Example 1: Generate tile of boundaries of sequence for each element:
  ######################################################################
  # Input example
  # segment_ids = np.array([[0, 1, 1, 2, 2, 2, 0, 3, 3]], dtype=np.int32)

  # Expected output for this example:
  # [[
  #   [9, 1, 1, 3, 3, 3, 9, 7, 7]       # start index
  #   [-1, 3, 3, 6, 6, 6, -1, 9, 9]     # end index
  #   ]]
  m, n = segment_ids.shape

  ix, iy, iz = nl.mgrid[0:m, 0:2, 0:n]

  out_tile = nl.ndarray([m, 2, n], dtype=segment_ids.dtype, buffer=nl.sbuf)
  seq_tile = nl.load(segment_ids)
  out_tile[ix, iy, iz] = nisa.sequence_bounds(segment_ids=seq_tile)
  # NKI_EXAMPLE_0_END
  nl.store(output, value=out_tile)
  return output


class TestNkiIsaExamplesSequenceBounds(unittest.TestCase):
  def test_sequence_bounds(self):
    m, n = 1, 32
    n_seq = m * n
    length = n

    np.random.seed(0)
    segment_ids = np.sort(np.random.randint(low=0, high=n_seq, size=length))
    segment_ids = segment_ids.reshape((m, n), order='F').astype(np.float32)
    reshaped_segment_ids = segment_ids.reshape(segment_ids.shape[0], -1)

    # NKI_EXAMPLE_1_BEGIN
    def compute_sequence_bounds(sequence):
      n = len(sequence)

      min_bounds = np.zeros(n, dtype=sequence.dtype)
      max_bounds = np.zeros(n, dtype=sequence.dtype)

      min_bound_pad = n
      max_bound_pad = -1

      min_bounds[0] = 0 if sequence[0] != 0 else min_bound_pad
      for i in range(1, n):
        if sequence[i] == 0:
          min_bounds[i] = min_bound_pad
        elif sequence[i] == sequence[i - 1]:
          min_bounds[i] = min_bounds[i - 1]
        else:
          min_bounds[i] = i

      max_bounds[-1] = n if sequence[-1] != 0 else max_bound_pad
      for i in range(n - 2, -1, -1):
        if sequence[i] == 0:
          max_bounds[i] = max_bound_pad
        elif sequence[i] == sequence[i + 1]:
          max_bounds[i] = max_bounds[i + 1]
        else:
          max_bounds[i] = i + 1

      return np.vstack((min_bounds, max_bounds))

    b = (
      np.apply_along_axis(
        compute_sequence_bounds, axis=1, arr=reshaped_segment_ids
      )
      .reshape(m, 2, n)
      .astype(np.float32)
    )
    # NKI_EXAMPLE_1_END

    a = nki_sequence_bounds(segment_ids=segment_ids)
    self.assertTrue(np.allclose(a, b))


================================================
FILE: nki/test/test_nki_isa_tensor_copy.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np
import neuronxcc.nki as nki

# NKI_EXAMPLE_7_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
...

# NKI_EXAMPLE_7_END

########################################################################
# NOTE: if you modify this file, make sure to update nki.isa .py file with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def nki_tensor_copy(in_tensor):
  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)

  # NKI_EXAMPLE_7_BEGIN
  ############################################################################
  # Example 1: Copy over the tensor to another tensor using the Vector engine.
  ############################################################################
  x = nl.load(in_tensor)
  x_copy = nisa.tensor_copy(x, engine=nisa.vector_engine)
  nl.store(out_tensor, value=x_copy)
  # NKI_EXAMPLE_7_END

  return out_tensor

      
class TestNkiIsaExamplesTensorCopy(unittest.TestCase):
  def test_tensor_copy(self):
    np.random.seed(0)
    src = np.random.random_sample([8, 8]).astype(np.float32) * 100
    dst_golden = np.copy(src)

    dst = nki_tensor_copy(src)
    self.assertTrue(np.allclose(dst, dst_golden))

================================================
FILE: nki/test/test_nki_isa_tensor_scalar.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_5_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
import numpy as np
...
# NKI_EXAMPLE_5_END

########################################################################
# NOTE: if you modify this file, make sure to update nki.isa .py file with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def nki_tensor_scalar(a_tensor, c_tensor, e_tensor, f_tensor):
  b_tensor = nl.ndarray(a_tensor.shape, dtype=a_tensor.dtype,
                        buffer=nl.shared_hbm)
  d_tensor = nl.ndarray(c_tensor.shape, dtype=c_tensor.dtype,
                        buffer=nl.shared_hbm)
  g_tensor = nl.ndarray(e_tensor.shape, dtype=e_tensor.dtype,
                        buffer=nl.shared_hbm)
  # NKI_EXAMPLE_5_BEGIN
  ##################################################################
  # Example 1: subtract 1.0 from all elements of tile a of
  # shape (128, 512) and get the output tile in b
  ##################################################################
  i_p = nl.arange(128)[:, None]
  i_f = nl.arange(512)[None, :]
  # NKI_EXAMPLE_5_END
  a = nl.load(a_tensor[i_p, i_f])
  # NKI_EXAMPLE_5_BEGIN
  b = nisa.tensor_scalar(a[i_p, i_f], np.subtract, 1.0)

  # NKI_EXAMPLE_5_END
  nl.store(b_tensor[i_p, i_f], b)

  # NKI_EXAMPLE_5_BEGIN
  ##################################################################
  # Example 2: broadcast 1.0 into a shape of (128, 512) and subtract
  # it with tile c to get output tile d
  ##################################################################
  i_p = nl.arange(128)[:, None]
  i_f = nl.arange(512)[None, :]
  # NKI_EXAMPLE_5_END
  c = nl.load(c_tensor[i_p, i_f])
  # NKI_EXAMPLE_5_BEGIN
  d = nisa.tensor_scalar(c[i_p, i_f], np.subtract, 1.0, reverse0=True)

  # NKI_EXAMPLE_5_END
  nl.store(d_tensor[i_p, i_f], d)

  # NKI_EXAMPLE_5_BEGIN
  ##################################################################
  # Example 3: broadcast multiply tile e with vector f and
  # then broadcast add with scalar 2.5;
  # tile e has a shape of (64, 1024) and vector f has a shape of (64, 1)
  ##################################################################
  i_p_ef = nl.arange(64)[:, None]
  i_f_e = nl.arange(1024)[None, :]
  i_f_f = nl.arange(1)[None, :]
  # NKI_EXAMPLE_5_END
  e = nl.load(e_tensor[i_p_ef, i_f_e])
  f = nl.load(f_tensor[i_p_ef, i_f_f]) 
  # NKI_EXAMPLE_5_BEGIN
  g = nisa.tensor_scalar(e[i_p_ef, i_f_e], op0=np.multiply, operand0=f[i_p_ef, i_f_f], op1=np.add, operand1=2.5)  
  # NKI_EXAMPLE_5_END

  nl.store(g_tensor[i_p_ef, i_f_e], g)
  return b_tensor, d_tensor, g_tensor
  
      
class TestNkiIsaExamplesTensorScalar(unittest.TestCase):
  def test_tensor_scalar(self):
    a = np.random.random_sample([128, 512]).astype(np.float32) * 100

    c = np.random.random_sample([128, 512]).astype(np.float32) * 100

    e = np.random.random_sample([64, 1024]).astype(np.float32) * 100
    f = np.random.random_sample([64, 1]).astype(np.float32) * 100
    
    b, d, g = nki_tensor_scalar(a, c, e, f)
    
    self.assertTrue(np.allclose(b, a-1))
    self.assertTrue(np.allclose(d, 1-c))
    self.assertTrue(np.allclose(g, e*f + 2.5))
 

================================================
FILE: nki/test/test_nki_isa_tensor_scalar_cumulative.py
================================================
"""
Copyright (C) 2025, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_1_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
# NKI_EXAMPLE_1_END
import numpy as np

@nki.jit(mode="simulation")
def nki_tensor_scalar_cumulative_scalar(
  src_data,
  op0,
  op1,
  imm0,
  imm1=None,
  reduce_cmd=nisa.reduce_cmd.reset_reduce):
  # NKI_EXAMPLE_1_BEGIN
  ##################################################################
  # Example 1: Basic usage of tensor scalar cumulative.
  # Using scalar as immeidate values.
  ##################################################################
  # Create output tensor for result.
  result_tensor = nl.ndarray(src_data.shape, dtype=nl.float32, buffer=nl.hbm)

  # Load data into SBUF.
  src = nl.load(src_data[...])

  # Create destination tensor with zeros.
  dst = nl.ndarray(src_data.shape, dtype=nl.float32, buffer=nl.sbuf)

  # Apply cumulative operation on tensor with scalar operations.
  nisa.tensor_scalar_cumulative(
    src=src,
    dst=dst,
    op0=op0,
    op1=op1,
    imm0=imm0,
    imm1=imm1,
    reduce_cmd=reduce_cmd
  )

  # Store result to HBM
  nl.store(result_tensor, value=dst)
  # NKI_EXAMPLE_1_END

  return result_tensor

@nki.jit(mode="simulation")
def nki_tensor_scalar_cumulative_vector(
  src_data,
  op0,
  op1,
  imm0,
  imm1=None,
  reduce_cmd=nisa.reduce_cmd.reset_reduce):
  # NKI_EXAMPLE_2_BEGIN
  ##################################################################
  # Example 2: Basic usage of tensor scalar cumulative.
  # Using vector as immediate values.
  ##################################################################
  # Create output tensor for result.
  result_tensor = nl.ndarray(src_data.shape, dtype=nl.float32, buffer=nl.hbm)

  # Load data into SBUF.
  src = nl.load(src_data[...])
  imm0 = nl.load(imm0[...])
  imm1 = nl.load(imm1[...]) if imm1 else None

  # Create destination tensor with zeros.
  dst = nl.ndarray(src_data.shape, dtype=nl.float32, buffer=nl.sbuf)

  # Apply cumulative operation on tensor with scalar operations.
  nisa.tensor_scalar_cumulative(
    src=src,
    dst=dst,
    op0=op0,
    op1=op1,
    imm0=imm0,
    imm1=imm1,
    reduce_cmd=reduce_cmd
  )

  # Store result to HBM
  nl.store(result_tensor, value=dst)
  # NKI_EXAMPLE_2_END

  return result_tensor

@nki.jit(mode="simulation")
def nki_tensor_scalar_cumulative_chain(
  src_data,
  op0,
  op1,
  imm0,
  imm1=None,
  reduce_cmd=nisa.reduce_cmd.reset_reduce):
  # NKI_EXAMPLE_3_BEGIN
  ##################################################################
  # Example 3: Chain two tensor scalar cumulative together.
  # Using scalar as immeidate values.
  ##################################################################
  # Create output tensor for result.
  result_tensor = nl.ndarray(src_data.shape, dtype=nl.float32, buffer=nl.hbm)

  # Load data into SBUF.
  src = nl.load(src_data[...])

  # Create destination tensor with zeros.
  dst = nl.ndarray(src_data.shape, dtype=nl.float32, buffer=nl.sbuf)

  # Apply cumulative operation on tensor with scalar operations.
  nisa.tensor_scalar_cumulative(
    src=src,
    dst=dst,
    op0=op0,
    op1=op1,
    imm0=imm0,
    imm1=imm1,
    reduce_cmd=reduce_cmd
  )

  # Apply cumulative operation with reduce as reduce_cmd.
  nisa.tensor_scalar_cumulative(
    src=src,
    dst=dst,
    op0=op0,
    op1=op1,
    imm0=imm0,
    imm1=imm1,
    reduce_cmd=nisa.reduce_cmd.reduce
  )

  # Store result to HBM
  nl.store(result_tensor, value=dst)
  # NKI_EXAMPLE_3_END

  return result_tensor

@nki.jit(mode="simulation")
def nki_tensor_scan(src_data, op, initial):
  # NKI_EXAMPLE_4_BEGIN
  ##################################################################
  # Example 4: Perform tensor scan using tensor scalar cumulative.
  ##################################################################
  # Create output tensor for result.
  result_tensor = nl.ndarray(src_data.shape, dtype=nl.float32, buffer=nl.hbm)

  # Load data into SBUF.
  src = nl.load(src_data[...])

  # Create destination tensor with zeros.
  dst = nl.ndarray(src_data.shape, dtype=nl.float32, buffer=nl.sbuf)

  # Apply cumulative operation on tensor with scalar operations.
  nisa.tensor_scalar_cumulative(
    src=src,
    dst=dst,
    op0=nl.add,
    op1=op,
    imm0=np.float32(0.0),
    imm1=initial,
    reduce_cmd=nisa.reduce_cmd.load_reduce
  )

  # Store result to HBM
  nl.store(result_tensor, value=dst)
  # NKI_EXAMPLE_4_END

  return result_tensor

class TestNkiIsaExamplesTensorScalarCumulative(unittest.TestCase):
  
  def test_tensor_scalar_cumulative_scalar1(self):
    """Test when op1 is nl.add with scalar imm0.
    """
    src = np.ones((128, 64), dtype=np.float32)

    result = nki_tensor_scalar_cumulative_scalar(
      src, op0=nl.add, op1=nl.add, imm0=np.float32(0.0))

    self.assertEqual(result.shape, (128, 64))

    golden = np.add.accumulate(src, axis=-1)

    self.assertTrue(np.allclose(result, golden))

  def test_tensor_scalar_cumulative_scalar2(self):
    """Test when op1 is nl.multiply with scalar imm0.
    """
    src = np.ones((128, 64), dtype=np.float32)

    result = nki_tensor_scalar_cumulative_scalar(
      src, op0=nl.add, op1=nl.multiply, imm0=np.float32(0.0))

    self.assertEqual(result.shape, (128, 64))

    golden = np.multiply.accumulate(src, axis=-1)

    self.assertTrue(np.allclose(result, golden))
  
  def test_tensor_scalar_cumulative_vector1(self):
    """Test when op1 is nl.add with vector imm0.
    """
    src = np.ones((128, 64), dtype=np.float32)

    imm0 = np.ones((128, 1), dtype=np.float32)
    result = nki_tensor_scalar_cumulative_vector(
      src, op0=nl.add, op1=nl.add, imm0=imm0)

    self.assertEqual(result.shape, (128, 64))

    golden = np.add.accumulate(np.add(src, imm0), axis=-1)

    self.assertTrue(np.allclose(result, golden))
  
  def test_tensor_scalar_cumulative_vector2(self):
    """Test when op1 is nl.multiply with vector imm0.
    """
    src = np.ones((128, 64), dtype=np.float32)

    imm0 = np.ones((128, 1), dtype=np.float32)
    result = nki_tensor_scalar_cumulative_vector(
      src, op0=nl.add, op1=nl.multiply, imm0=imm0)

    self.assertEqual(result.shape, (128, 64))

    golden = np.multiply.accumulate(np.add(src, imm0), axis=-1)

    self.assertTrue(np.allclose(result, golden))
  
  def test_tensor_scalar_cumulative_vector3(self):
    """Test when op1 is nl.max with vector imm0.
    """
    src = np.ones((128, 64), dtype=np.float32)

    imm0 = np.ones((128, 1), dtype=np.float32)
    result = nki_tensor_scalar_cumulative_vector(
      src, op0=nl.add, op1=nl.max, imm0=imm0)

    self.assertEqual(result.shape, (128, 64))

    golden = np.maximum.accumulate(np.add(src, imm0), axis=-1)

    self.assertTrue(np.allclose(result, golden))
  
  def test_tensor_scalar_cumulative_load_reduce1(self):
    """Test when op1 is nl.add with load_reduce.
    """
    src = np.ones((128, 64), dtype=np.float32)

    imm0 = np.ones((128, 1), dtype=np.float32)
    imm1 = np.ones((128, 1), dtype=np.float32)
    result = nki_tensor_scalar_cumulative_vector(
      src,
      op0=nl.add,
      op1=nl.add,
      imm0=imm0,
      imm1=imm1,
      reduce_cmd=nisa.reduce_cmd.load_reduce
    )

    self.assertEqual(result.shape, (128, 64))

    golden = np.add(np.add.accumulate(np.add(src, imm0), axis=-1), imm1)

    self.assertTrue(np.allclose(result, golden))

  def test_tensor_scalar_cumulative_load_reduce2(self):
    """Test when op1 is nl.multiply with load_reduce.
    """
    src = np.ones((128, 64), dtype=np.float32)

    imm0 = np.ones((128, 1), dtype=np.float32)
    imm1 = np.zeros((128, 1), dtype=np.float32)
    result = nki_tensor_scalar_cumulative_vector(
      src,
      op0=nl.add,
      op1=nl.multiply,
      imm0=imm0,
      imm1=imm1,
      reduce_cmd=nisa.reduce_cmd.load_reduce
    )

    self.assertEqual(result.shape, (128, 64))

    golden = np.zeros((128, 64), dtype=np.float32)

    self.assertTrue(np.allclose(result, golden))
  
  def test_tensor_scalar_cumulative_load_reduce3(self):
    """Test when op1 is nl.min with load_reduce.
    """
    src = np.ones((128, 64), dtype=np.float32)

    imm0 = np.ones((128, 1), dtype=np.float32)
    imm1 = np.zeros((128, 1), dtype=np.float32)
    result = nki_tensor_scalar_cumulative_vector(
      src,
      op0=nl.add,
      op1=nl.min,
      imm0=imm0,
      imm1=imm1,
      reduce_cmd=nisa.reduce_cmd.load_reduce
    )

    self.assertEqual(result.shape, (128, 64))

    golden = np.zeros((128, 1), dtype=np.float32)

    self.assertTrue(np.allclose(result, golden))

  def test_tensor_scalar_cumulative_chain1(self):
    """Test chaining two operations of reset_reduce followed by reduce.
    """
    src = np.ones((128, 64), dtype=np.float32)

    result = nki_tensor_scalar_cumulative_chain(
      src,
      op0=nl.add,
      op1=nl.add,
      imm0=np.float32(0.0),
      reduce_cmd=nisa.reduce_cmd.reset_reduce
    )

    self.assertEqual(result.shape, (128, 64))

    golden = np.add(np.add.accumulate(src, axis=-1), 64)

    self.assertTrue(np.allclose(result, golden))
  
  def test_tensor_scalar_cumulative_chain2(self):
    """Test chaining two operations of load_reduce followed by reduce.
    """
    src = np.ones((128, 64), dtype=np.float32)

    result = nki_tensor_scalar_cumulative_chain(
      src,
      op0=nl.add,
      op1=nl.add,
      imm0=np.float32(0.0),
      imm1=np.float32(1.0),
      reduce_cmd=nisa.reduce_cmd.load_reduce
    )

    self.assertEqual(result.shape, (128, 64))

    golden = np.add(np.add.accumulate(src, axis=-1), 65)

    self.assertTrue(np.allclose(result, golden))
  
  def test_tensor_scan(self):
    """Test tensor scan.
    """
    src = np.ones((128, 64), dtype=np.float32)

    result = nki_tensor_scan(src, op=nl.add, initial=np.float32(2.0))

    self.assertEqual(result.shape, (128, 64))

    golden = np.add(np.add.accumulate(src, axis=-1), np.float32(2.0))

    self.assertTrue(np.allclose(result, golden))

================================================
FILE: nki/test/test_nki_isa_tensor_tensor.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np
import neuronxcc.nki as nki
# NKI_EXAMPLE_3_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
from neuronxcc.nki.typing import tensor
...
# NKI_EXAMPLE_3_END

########################################################################
# NOTE: if you modify this file, make sure to update nki.isa .py file with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def nki_tensor_tensor(a_tensor, b_tensor):
  c_tensor = nl.ndarray(a_tensor.shape, dtype=a_tensor.dtype,
                        buffer=nl.shared_hbm)

  # NKI_EXAMPLE_3_BEGIN
  ##################################################################
  # Example 1: add two tiles, a and b, of the same
  # shape (128, 512) element-wise and get
  # the addition result in tile c
  ##################################################################
  a: tensor[128, 512] = nl.load(a_tensor)
  b: tensor[128, 512] = nl.load(b_tensor)

  c: tensor[128, 512] = nisa.tensor_tensor(a, b, op=nl.add)

  # NKI_EXAMPLE_3_END
  nl.store(c_tensor, c)
  return c_tensor


class TestNkiIsaExamplesTensorTensor(unittest.TestCase):
  def test_tensor_tensor(self):
    a = np.random.random_sample([128, 512]).astype(np.float32) * 100
    b = np.random.random_sample([128, 512]).astype(np.float32) * 100
    c = nki_tensor_tensor(a, b)
    
    self.assertTrue(np.allclose(c, np.add(a, b)))


================================================
FILE: nki/test/test_nki_isa_tensor_tensor_scan.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np

import neuronxcc.nki as nki
# NKI_EXAMPLE_4_BEGIN
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
# NKI_EXAMPLE_4_END


########################################################################
# NOTE: if you modify this file, make sure to update nki.isa .py file with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################

@nki.jit(mode="simulation")
def nki_tensor_tensor_scan(a_tensor, b_tensor):
  c_tensor = nl.ndarray(a_tensor.shape, dtype=a_tensor.dtype,
                        buffer=nl.shared_hbm)
  a = nl.load(a_tensor)
  b = nl.load(b_tensor)

  # NKI_EXAMPLE_4_BEGIN
  ##################################################################
  # Example 1: scan two tiles, a and b, of the same
  # shape (128, 1024) using multiply/add and get
  # the scan result in tile c
  ##################################################################
  c = nl.ndarray(shape=(128, 1024), dtype=nl.float32)

  c[:, 0:512] = nisa.tensor_tensor_scan(a[:, 0:512], b[:, 0:512],
                                        initial=0, op0=np.multiply, op1=np.add)

  c[:, 512:1024] = nisa.tensor_tensor_scan(a[:, 512:1024], b[:, 512:1024],
                                           initial=c[:, 511],
                                           op0=np.multiply, op1=np.add)
  # NKI_EXAMPLE_4_END

  nl.store(c_tensor, c)
  return c_tensor


class TestNkiIsaExamplesTensorTensorScan(unittest.TestCase):
  def test_tensor_tensor_scan(self):
    a = np.random.random_sample([128, 1024]).astype(np.float32)
    b = np.random.random_sample([128, 1024]).astype(np.float32)
    c = nki_tensor_tensor_scan(a, b)

    golden = np.zeros(c.shape)
    golden[:, 0] = a[:, 0] * 0 + b[:, 0]
    for i in range(1, c.shape[1]):
      golden[:, i] = a[:, i] * golden[:, i - 1] + b[:, i]

    print(c)
    print(golden)
    print(c - golden)
    self.assertTrue(np.allclose(c, golden))


================================================
FILE: nki/test/test_nki_mask.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
import neuronxcc.nki.isa as nisa
# NKI_EXAMPLE_15_BEGIN
import neuronxcc.nki.language as nl
# NKI_EXAMPLE_15_END
import numpy as np
...

########################################################################
# NOTE: if you modify this file, make sure to update nki.api.shared.rst with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def nki_mask(in_tensor):
  ...
  out_tensor = nl.ndarray([64, 256], dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)
  # NKI_EXAMPLE_15_BEGIN
  ...
  i_p = nl.arange(128)[:, None]
  i_f = nl.arange(512)[None, :]
  # NKI_EXAMPLE_15_END
  in_tile = nl.load(in_tensor[i_p, i_f])
  # NKI_EXAMPLE_15_BEGIN
  out_tile = nl.square(in_tile, mask=((i_p<64) & (i_f<256)))
  # NKI_EXAMPLE_15_END

  nl.store(out_tensor[i_p, i_f], out_tile[i_p, i_f],
           mask=((i_p < 64) & (i_f < 256)))
  return out_tensor


class TestNkiIsaExamplesMask(unittest.TestCase):
  def test_mask(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 512]).astype(np.float32) * 100

    b = nki_mask(a)

    b_golden = np.square(a[:64, :256])

    self.assertTrue(np.allclose(b, b_golden))


================================================
FILE: nki/test/test_nki_memory_semantics.py
================================================
import unittest
import neuronxcc.nki as nki
import neuronxcc.nki.language as nl
import numpy as np

# NKI_EXAMPLE_0_BEGIN
@nki.jit(mode='simulation')
def simple_demo_kernel(a_ptr):
  
  B, N, M = a_ptr.shape

  a_loaded = nl.ndarray((B, nl.par_dim(N), M), dtype=a_ptr.dtype, buffer=nl.sbuf)
  exp_out =  nl.ndarray((B, nl.par_dim(N), M), dtype=a_ptr.dtype, buffer=nl.sbuf)
  out_ptr = nl.ndarray((B, nl.par_dim(N), M), dtype=a_ptr.dtype, buffer=nl.shared_hbm)

  for b in nl.affine_range(B):
    a_loaded[b] = nl.load(a_ptr[b])
    exp_out[b] = nl.exp(a_loaded[b])
    nl.store(out_ptr[b], value=exp_out[b])

  return out_ptr
# NKI_EXAMPLE_0_END

class TestNkiMemorySemantics(unittest.TestCase):
  def test_simulate_kernel(self):
    np.random.seed(0)
    a = np.random.random_sample([4, 128, 512]).astype(np.float32) * 100
    
    result = simple_demo_kernel(a)

    self.assertTrue(np.allclose(result, np.exp(a)))


================================================
FILE: nki/test/test_nki_nl_add.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_20_BEGIN
import neuronxcc.nki.language as nl
# NKI_EXAMPLE_20_END
import numpy as np

########################################################################
# NOTE: if you modify this file, make sure to update the source .py with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def add_tensors(a_tensor, b_tensor):
  c_tensor = nl.ndarray(a_tensor.shape, dtype=a_tensor.dtype,
                        buffer=nl.shared_hbm)
  # NKI_EXAMPLE_20_BEGIN
  a = nl.load(a_tensor[0:128, 0:512])
  b = nl.load(b_tensor[0:128, 0:512])
  # add a and b element-wise and store in c[128, 512]
  c = nl.add(a, b)
  nl.store(c_tensor[0:128, 0:512], c)
  # NKI_EXAMPLE_20_END
  return c_tensor


@nki.jit(mode="simulation")
def add_tensor_scalar(a_tensor):
  c_tensor = nl.ndarray(a_tensor.shape, dtype=a_tensor.dtype,
                        buffer=nl.shared_hbm)
  # NKI_EXAMPLE_20_BEGIN
  a = nl.load(a_tensor[0:128, 0:512])
  b = 2.2
  # add constant b to each element in a
  c = nl.add(a, b)
  nl.store(c_tensor[0:128, 0:512], c)
  # NKI_EXAMPLE_20_END
  return c_tensor


@nki.jit(mode="simulation")
def add_broadcast_free_dim(a_tensor, b_tensor):
  c_tensor = nl.ndarray(a_tensor.shape, dtype=a_tensor.dtype,
                        buffer=nl.shared_hbm)
  # NKI_EXAMPLE_20_BEGIN
  a = nl.load(a_tensor[0:128, 0:512])
  b = nl.load(b_tensor[0:128, 0:1])
  # broadcast on free dimension -- [128, 1] is broadcasted to [128, 512]
  c = nl.add(a, b)
  nl.store(c_tensor[0:128, 0:512], c)
  # NKI_EXAMPLE_20_END
  return c_tensor


@nki.jit(mode="simulation")
def add_broadcast_par_dim(a_tensor, b_tensor):
  c_tensor = nl.ndarray(a_tensor.shape, dtype=a_tensor.dtype,
                        buffer=nl.shared_hbm)
  # NKI_EXAMPLE_20_BEGIN
  a = nl.load(a_tensor[0:128, 0:512])
  b = nl.load(b_tensor[0:1, 0:512])
  # broadcast on partition dimension -- [1, 512] is broadcasted to [128, 512]
  c = nl.add(a, b)
  nl.store(c_tensor[0:128, 0:512], c)
  # NKI_EXAMPLE_20_END
  return c_tensor


@nki.jit(mode="simulation")
def add_broadcast_both_dims(a_tensor, b_tensor):
  c_tensor = nl.ndarray(a_tensor.shape, dtype=a_tensor.dtype,
                        buffer=nl.shared_hbm)
  # NKI_EXAMPLE_20_BEGIN
  a = nl.load(a_tensor[0:128, 0:512])
  b = nl.load(b_tensor[0:1, 0:1])
  # broadcast on both dimensions -- [1, 1] is broadcasted to [128, 512]
  c = nl.add(a, b)
  nl.store(c_tensor[0:128, 0:512], c)
  # NKI_EXAMPLE_20_END
  return c_tensor


@nki.jit(mode="simulation")
def add_broadcast_each_dims(a_tensor, b_tensor):
  c_tensor = nl.ndarray([128, 512], dtype=a_tensor.dtype,
                        buffer=nl.shared_hbm)
  # NKI_EXAMPLE_20_BEGIN
  a = nl.load(a_tensor[0:128, 0:1])
  b = nl.load(b_tensor[0:1, 0:512])
  # broadcast on each dimensions -- [128, 1] and [1, 512] are broadcasted to [128, 512]
  c = nl.add(a, b)
  nl.store(c_tensor[0:128, 0:512], c)
  # NKI_EXAMPLE_20_END
  return c_tensor


class TestNkiNlExampleAdd(unittest.TestCase):
  def test_add(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 512]).astype(np.float32) * 100
    b = np.random.random_sample([128, 512]).astype(np.float32) * 100
    c = np.zeros([128, 512]).astype(np.float32)
    c_golden = np.add(a, b)
    
    c = add_tensors(a, b)
    self.assertTrue(np.allclose(c, c_golden))

  def test_add_tensor_scalar(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 512]).astype(np.float32) * 100
    b = 2.2
    c = np.zeros([128, 512]).astype(np.float32)
    c_golden = np.add(a, b)

    c = add_tensor_scalar(a)
    self.assertTrue(np.allclose(c, c_golden))

  def test_add_broadcast_free_dim(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 512]).astype(np.float32) * 100
    b = np.random.random_sample([128, 1]).astype(np.float32) * 100
    c = np.zeros([128, 512]).astype(np.float32)
    c_golden = np.add(a, b)

    c = add_broadcast_free_dim(a, b)
    self.assertTrue(np.allclose(c, c_golden))

  def test_add_broadcast_par_dim(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 512]).astype(np.float32) * 100
    b = np.random.random_sample([1, 512]).astype(np.float32) * 100
    c = np.zeros([128, 512]).astype(np.float32)
    c_golden = np.add(a, b)

    c = add_broadcast_par_dim(a, b)
    self.assertTrue(np.allclose(c, c_golden))

  def test_add_broadcast_both_dims(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 512]).astype(np.float32) * 100
    b = np.random.random_sample([1, 1]).astype(np.float32) * 100
    c = np.zeros([128, 512]).astype(np.float32)
    c_golden = np.add(a, b)

    c = add_broadcast_both_dims(a, b)
    self.assertTrue(np.allclose(c, c_golden))

  def test_add_broadcast_each_dims(self):
    np.random.seed(0)
    a = np.random.random_sample([128, 1]).astype(np.float32) * 100
    b = np.random.random_sample([1, 512]).astype(np.float32) * 100
    c = np.zeros([128, 512]).astype(np.float32)
    c_golden = np.add(a, b)

    c = add_broadcast_each_dims(a, b)
    self.assertTrue(np.allclose(c, c_golden))

================================================
FILE: nki/test/test_nki_nl_atomic_rmw.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np
import neuronxcc.nki as nki
# NKI_EXAMPLE_18_BEGIN
import neuronxcc.nki.language as nl
from neuronxcc.nki.typing import tensor
...
# NKI_EXAMPLE_18_END

########################################################################
# NOTE: if you modify this file, make sure to update the source .py with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def atomic_rmw_indirect_indices(in_tensor, indices_tensor, value_tensor):
  rmw_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)
  # Workaround to get simulation working for testing purposes.
  # reason: the IR builder marks in_out tensor as output only, hence the simulator ignores the input values of the in_out tensor.
  # workaround: load input value from another input tensor, write that value to our in_out tensor, so we can test atomic_rmw in simulation.
  in_tile = nl.load(in_tensor)
  nl.store(rmw_tensor, in_tile)

  N = 128
  M = 512

  # NKI_EXAMPLE_18_BEGIN
  value: tensor[N, M] = nl.load(value_tensor)

  # dynamic indices have to be in SBUF, with shape [N, 1]
  indices_tile: tensor[N, 1] = nl.load(indices_tensor)

  ix = nl.arange(M)[None, :]

  ########################################################################
  # Atomic read-modify-write example:
  #   - read: values of rmw_tensor is indexed by values from indices_tile
  #   - modify: incremented by value
  #   - write: saved back into rmw_tensor
  # resulting in rmw_tensor = rmw_tensor + value
  ########################################################################
  nl.atomic_rmw(rmw_tensor[indices_tile, ix], value=value, op=np.add)
  # NKI_EXAMPLE_18_END
  return rmw_tensor


class TestNkiExampleNlLoad(unittest.TestCase):
  def test_atomic_rmw_indirect_indices(self):
    in_tensor = np.random.random_sample([128, 512]).astype(np.float32) * 100
    indices_tensor = np.arange(128, dtype=np.int32)
    indices_tensor = np.expand_dims(indices_tensor, axis=1)
    value_tensor = np.random.random_sample([128, 512]).astype(np.float32) * 100
    golden = in_tensor + value_tensor

    rmw_tensor = atomic_rmw_indirect_indices(in_tensor, indices_tensor,
                                             value_tensor)

    self.assertTrue(np.allclose(rmw_tensor, golden))


================================================
FILE: nki/test/test_nki_nl_broadcast.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np
import neuronxcc.nki as nki
# NKI_EXAMPLE_5_BEGIN
import neuronxcc.nki.language as nl
# NKI_EXAMPLE_5_END
...


########################################################################
# NOTE: if you modify this file, make sure to update the source .py with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################
@nki.jit(mode="simulation")
def test_nl_broadcast(in_tensor):
  out_tensor = nl.ndarray([128, 64], in_tensor.dtype,
                          buffer=nl.shared_hbm)
  # NKI_EXAMPLE_5_BEGIN
  ##################################################################
  # Example 1: Load from in_tensor[P, F] that is on HBM and
  # copy into out_tile[P, F] that is on SBUF by broadcasting
  ##################################################################
  ...
  # NKI_EXAMPLE_5_END
  # NKI_EXAMPLE_5_BEGIN
  ...
  # broadcast into out_tile[P, F] that is on SBUF
  # from data_tile[P, F] that is on SBUF
  in_tile = nl.load(in_tensor, dtype=in_tensor.dtype)
  out_tile = nl.broadcast_to(in_tile, shape=(128, in_tensor.shape[1]))

  # store output
  nl.store(out_tensor, out_tile)
  # NKI_EXAMPLE_5_END
  return out_tensor


class TestNkiExampleNlBroadcast(unittest.TestCase):
  def test_nl_broadcast_to(self):
    src = np.random.random_sample([1, 64]).astype(np.int32) * 100

    dst = test_nl_broadcast(src)
    self.assertTrue(np.allclose(np.repeat(src, 128, axis=0), dst))


================================================
FILE: nki/test/test_nki_nl_dslice.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np
import neuronxcc.nki as nki
# NKI_EXAMPLE_1_BEGIN
import neuronxcc.nki.language as nl
...
# NKI_EXAMPLE_1


@nki.jit(mode="simulation")
def example_kernel(in_tensor):
  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)
  # NKI_EXAMPLE_1_BEGIN
  for i in nl.affine_range(in_tensor.shape[1] // 512):
    tile = nl.load(in_tensor[:, (i * 512):((i + 1) * 512)])
    # Same as above but use ds (dynamic slice) instead of the native
    # slice syntax
    tile = nl.load(in_tensor[:, nl.ds(i * 512, 512)])
    # NKI_EXAMPLE_1_END
    nl.store(out_tensor[:, nl.ds(i * 512, 512)], tile)

  return out_tensor


class TestNkiExampleNlLoad(unittest.TestCase):
  def test_nl_load(self):
    a = np.random.random_sample([128, 4096]).astype(np.float32) * 100

    b = example_kernel(a)
    self.assertTrue(np.allclose(a, b))


================================================
FILE: nki/test/test_nki_nl_gather_flattened.py
================================================
"""
Copyright (C) 2025, Amazon.com. All Rights Reserved

"""
import unittest

import neuronxcc.nki as nki
# NKI_EXAMPLE_0_BEGIN
import neuronxcc.nki.language as nl
from neuronxcc.nki.typing import tensor
# NKI_EXAMPLE_0_END
import numpy as np


@nki.jit(mode="simulation")
def nki_gather_flattened():
    # NKI_EXAMPLE_0_BEGIN
    ##################################################################
    # Example 1: Gather values from a tensor using indices
    ##################################################################
    # Create source tensor
    N = 32
    M = 64
    data = nl.rand((N, M), dtype=nl.float32)

    # Create indices tensor - gather every 5th element
    indices = nl.zeros((N, 10), dtype=nl.uint32)
    for i in nl.static_range(N):
        for j in nl.static_range(10):
            indices[i, j] = j * 5

    # Gather values from data according to indices
    result = nl.gather_flattened(data=data, indices=indices)
    # NKI_EXAMPLE_0_END

    # Create output tensor and store result
    data_tensor = nl.ndarray([N, M], dtype=data.dtype, buffer=nl.shared_hbm)
    nl.store(data_tensor, value=data)
    indices_tensor = nl.ndarray([N, 10], dtype=nl.int32, buffer=nl.shared_hbm)
    nl.store(indices_tensor, value=indices)
    result_tensor = nl.ndarray([N, 10], dtype=data.dtype, buffer=nl.shared_hbm)
    nl.store(result_tensor, value=result)

    return data_tensor, indices_tensor, result_tensor


class TestNkiExamplesGather(unittest.TestCase):
    def test_gather_flattened(self):
        data, indices, result = nki_gather_flattened()

        self.assertEqual(result.shape, (32, 10))
        expected = np.take_along_axis(data, indices, axis=-1)
        self.assertTrue(np.allclose(result, expected))


TestNkiExamplesGather().test_gather_flattened()


================================================
FILE: nki/test/test_nki_nl_load_store.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np
import neuronxcc.nki as nki
# NKI_EXAMPLE_16_BEGIN NKI_EXAMPLE_15_BEGIN NKI_EXAMPLE_14_BEGIN NKI_EXAMPLE_11_BEGIN NKI_EXAMPLE_10_BEGIN
import neuronxcc.nki.language as nl
# NKI_EXAMPLE_16_END NKI_EXAMPLE_10_END NKI_EXAMPLE_11_END NKI_EXAMPLE_14_END NKI_EXAMPLE_15_END
...


########################################################################
# NOTE: if you modify this file, make sure to update the source .py with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def example_kernel(in_tensor, use_scalar=False):
  out_tensor = nl.ndarray(in_tensor.shape, in_tensor.dtype,
                          buffer=nl.shared_hbm)
  # NKI_EXAMPLE_10_BEGIN
  # load from in_tensor[P, F] that is on HBM
  # copy into data_tile[P, F] that is on SBUF
  data_tile = nl.load(in_tensor)
  ...
  # NKI_EXAMPLE_10_END
  if use_scalar:
    # NKI_EXAMPLE_16_BEGIN
    ...
    scalar = 100
    # store scalar into out_tensor on HBM (effectively a memset)
    nl.store(out_tensor, scalar)
    # NKI_EXAMPLE_16_END
  else:
    # NKI_EXAMPLE_14_BEGIN
    ...
    # store into out_tensor[P, F] that is on HBM
    # from data_tile[P, F] that is on SBUF
    nl.store(out_tensor, data_tile)
    # NKI_EXAMPLE_14_END
  return out_tensor


@nki.jit(mode="simulation")
def example_load_store_b(in_tensor):
  out_tensor = nl.ndarray(in_tensor.shape, in_tensor.dtype,
                          buffer=nl.shared_hbm)
  # NKI_EXAMPLE_15_BEGIN NKI_EXAMPLE_11_BEGIN
  for i_b in nl.affine_range(4):
    data_tile = nl.zeros((128, 512), dtype=in_tensor.dtype) 
    # NKI_EXAMPLE_15_END
    # load from in_tensor[4, 128, 512] one batch at a time
    # copy into data_tile[128, 512]
    i_p, i_f = nl.mgrid[0:128, 0:512]
    data_tile[i_p, i_f] = nl.load(in_tensor[i_b, i_p, i_f])
    # NKI_EXAMPLE_15_BEGIN
    ...
    # NKI_EXAMPLE_11_END
    # store into out_tensor[4, 128, 512] one batch at a time
    # from data_tile[128, 512] 
    i_p, i_f = nl.mgrid[0:128, 0:512]
    nl.store(out_tensor[i_b, i_p, i_f], value=data_tile[i_p, i_f]) 
    # NKI_EXAMPLE_15_END
  return out_tensor


class TestNkiExampleNlLoad(unittest.TestCase):
  def test_nl_load(self):
    src = np.random.random_sample([128, 512]).astype(np.float32) * 100

    dst = example_kernel(src)
    self.assertTrue(np.allclose(src, dst))

  def test_nl_load_scalar(self):
    src = np.ones([128, 512]).astype(np.int32) * 100

    dst = example_kernel(src, use_scalar=True)
    self.assertTrue(np.allclose(src, dst))

  def test_load_store_3d(self):
    in_tensor = np.random.random_sample([4, 128, 512]).astype(np.float32) * 100

    out_tensor = example_load_store_b(in_tensor)
    self.assertTrue(np.allclose(out_tensor, in_tensor))


================================================
FILE: nki/test/test_nki_nl_load_store_indirect.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np
import neuronxcc.nki as nki
# NKI_EXAMPLE_17_BEGIN NKI_EXAMPLE_13_BEGIN
import neuronxcc.nki.isa as nisa
# NKI_EXAMPLE_16_BEGIN NKI_EXAMPLE_12_BEGIN
import neuronxcc.nki.language as nl
...

# NKI_EXAMPLE_12_END NKI_EXAMPLE_13_END NKI_EXAMPLE_16_END NKI_EXAMPLE_17_END

########################################################################
# NOTE: if you modify this file, make sure to update the source .py with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def example_indirect_load_1(data_tensor, idx_tensor):
  out_tensor = nl.ndarray([64, 512], dtype=data_tensor.dtype,
                          buffer=nl.shared_hbm)
  # NKI_EXAMPLE_12_BEGIN
  ############################################################################################
  # Indirect DMA read example 1:
  # - data_tensor on HBM has shape [128 x 512].
  # - idx_tensor on HBM has shape [64] (with values [0, 2, 4, 6, ...]).
  # - idx_tensor values read from HBM and stored in SBUF idx_tile of shape [64 x 1]
  # - data_tensor values read from HBM indexed by values in idx_tile 
  #   and store into SBUF data_tile of shape [64 x 512].
  ############################################################################################
  i_p = nl.arange(64)[:, None]
  i_f = nl.arange(512)[None, :]

  idx_tile = nl.load(idx_tensor[i_p]) # indices have to be in SBUF
  data_tile = nl.load(data_tensor[idx_tile[i_p, 0], i_f]) 
  ...
  # NKI_EXAMPLE_12_END
  nl.store(out_tensor, value=data_tile)
  return out_tensor


@nki.jit(mode="simulation")
def example_indirect_load_2(data_tensor):
  out_tensor = nl.ndarray([64, 512], dtype=data_tensor.dtype,
                          buffer=nl.shared_hbm)
  n, m = data_tensor.shape
  assert n == 128 and m == 512
  # NKI_EXAMPLE_13_BEGIN
  ############################################################################################
  # Indirect DMA read example 2:
  # - data_tensor on HBM has shape [128 x 512].
  # - idx_tile on SBUF has shape [64 x 1] (with values [[0], [2], [4], ...] generated by iota)
  # - data_tensor values read from HBM indexed by values in idx_tile 
  #   and store into SBUF data_tile of shape [64 x 512].
  ############################################################################################
  i_f = nl.arange(512)[None, :]
  
  idx_expr = 2*nl.arange(64)[:, None]
  idx_tile = nisa.iota(idx_expr, dtype=np.int32)
  data_tile = nl.load(data_tensor[idx_tile, i_f]) 
  ...
  # NKI_EXAMPLE_13_END

  nl.store(out_tensor, value=data_tile)
  return out_tensor


@nki.jit(mode="simulation")
def example_indirect_save_1(in_tensor, idx_tensor):
  data_tensor = nl.ndarray([128, 512], dtype=in_tensor.dtype,
                           buffer=nl.shared_hbm)
  data_tile = nl.load(in_tensor)
  ...
  # NKI_EXAMPLE_16_BEGIN
  ##################################################################################
  # Indirect DMA write example 1:
  #  - data_tensor has shape [128 x 512].
  #  - idx_tensor on HBM has shape [64] (with values [0, 2, 4, 6, ...]).
  #  - idx_tensor values read from HBM and stored in SBUF idx_tile.
  #  - data_tile of shape [64 x 512] values written into
  #    HBM data_tensor indexed by values in idx_tile.
  ##################################################################################
  i_p = nl.arange(64)[:, None]
  i_f = nl.arange(512)[None, :]
  idx_tile = nl.load(idx_tensor[i_p]) # indices have to be in SB

  nl.store(data_tensor[idx_tile[i_p, 0], i_f], value=data_tile[0:64, 0:512])
  # NKI_EXAMPLE_16_END
  return data_tensor


@nki.jit(mode="simulation")
def example_indirect_save_2(in_tensor):
  data_tensor = nl.ndarray([128, 512], dtype=in_tensor.dtype,
                           buffer=nl.shared_hbm)
  n, m = in_tensor.shape
  i_f = nl.arange(m)[None, :]
  data_tile = nl.load(in_tensor)
  assert n == 64 and m == 512
  ...
  # NKI_EXAMPLE_17_BEGIN
  #############################################################################################
  # Indirect DMA write example 2:
  #  - data_tensor has shape [128 x 512].
  #  - idx_tile on SBUF has shape [64 x 1] (with values [[0], [2], [4], ...] generated by iota)
  #  - data_tile of shape [64 x 512] values written into
  #    HBM data_tensor indexed by values in idx_tile.
  #############################################################################################
  idx_expr = 2*nl.arange(64)[:, None]
  idx_tile = nisa.iota(idx_expr, dtype=np.int32)
  
  nl.store(data_tensor[idx_tile, i_f], value=data_tile[0:64, 0:512]) 
  # NKI_EXAMPLE_17_END
  return data_tensor


class TestNkiExampleNlLoadStoreIndirect(unittest.TestCase):
  def test_indirect_load_1(self):
    in_tensor = np.random.random_sample([128, 512]).astype(np.float32) * 100
    idx_tensor = 2*np.arange(64, dtype=np.int32)
    golden = in_tensor[idx_tensor]

    out_tensor = example_indirect_load_1(in_tensor, idx_tensor)
    self.assertTrue(np.allclose(out_tensor, golden))

  def test_indirect_load_2(self):
    in_tensor = np.random.random_sample([128, 512]).astype(np.float32) * 100
    idx_tensor = 2*np.arange(64, dtype=np.int32)
    golden = in_tensor[idx_tensor]

    out_tensor = example_indirect_load_2(in_tensor)
    self.assertTrue(np.allclose(out_tensor, golden))

  def test_indirect_save_1(self):
    in_tensor = np.random.random_sample([64, 512]).astype(np.float32) * 100
    idx_tensor = 2*np.arange(64, dtype=np.int32)

    out_tensor = example_indirect_save_1(in_tensor, idx_tensor)
    self.assertTrue(np.allclose(out_tensor[idx_tensor], in_tensor))

  def test_indirect_save_2(self):
    in_tensor = np.random.random_sample([64, 512]).astype(np.float32) * 100
    idx_tensor = 2*np.arange(64, dtype=np.int32)

    out_tensor = example_indirect_save_2(in_tensor)
    self.assertTrue(np.allclose(out_tensor[idx_tensor], in_tensor))
   

================================================
FILE: nki/test/test_nki_nl_load_transpose2d.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np
import neuronxcc.nki as nki
# NKI_EXAMPLE_19_BEGIN
import neuronxcc.nki.language as nl
from neuronxcc.nki.typing import tensor
...

# NKI_EXAMPLE_19_END

########################################################################
# NOTE: if you modify this file, make sure to update the source .py with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def example_kernel_0(in_tensor):
  out_tensor = nl.ndarray([in_tensor.shape[1], in_tensor.shape[0]], dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)
  # NKI_EXAMPLE_19_BEGIN
  # load from in_tensor[F, P] that is on HBM
  # transpose and copy into local_tile[P, F] that is on SBUF
  N, M = in_tensor.shape
  local_tile: tensor[M, N] = nl.load_transpose2d(in_tensor)
  ...
  # NKI_EXAMPLE_19_END
  nl.store(out_tensor, value=local_tile)
  return out_tensor


@nki.jit(mode="simulation")
def example_kernel_1(in_tensor):
  out_tensor = nl.ndarray([in_tensor.shape[1], in_tensor.shape[0]], dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)
  # NKI_EXAMPLE_20_BEGIN
  import neuronxcc.nki.isa as nisa
  ...

  # load from in_tensor[F, P] that is on HBM
  # transpose and copy into local_tile[P, F] that is on SBUF
  # always use the DMA engine
  N, M = in_tensor.shape
  local_tile: tensor[M, N] = nisa.dma_transpose(in_tensor)
  ...
  # NKI_EXAMPLE_20_END
  nl.store(out_tensor, value=local_tile)
  return out_tensor


class TestNkiExampleNlLoadTranspose2d(unittest.TestCase):
  def test_dma_transpose_load_0(self):
    np.random.seed(0)
    src = np.random.random_sample([2048, 128]).astype(np.float32) * 100

    dst = example_kernel_0(src)

    dst_golden = np.transpose(src)
    self.assertTrue(np.allclose(dst, dst_golden))

  def test_dma_transpose_load_1(self):
    np.random.seed(0)
    src = np.random.random_sample([2048, 128]).astype(np.float32) * 100

    dst = example_kernel_1(src)

    dst_golden = np.transpose(src)
    self.assertTrue(np.allclose(dst, dst_golden))


================================================
FILE: nki/test/test_nki_nl_mgrid.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np
import neuronxcc.nki as nki
# NKI_EXAMPLE_9_BEGIN NKI_EXAMPLE_8_BEGIN
import neuronxcc.nki.language as nl
...

# NKI_EXAMPLE_8_END NKI_EXAMPLE_9_END

########################################################################
# NOTE: if you modify this file, make sure to update the source .py with
# NOTE: the correct line numbers under .. literalinclude:: directive
########################################################################


@nki.jit(mode="simulation")
def example_kernel(in_tensor):
  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)
  # NKI_EXAMPLE_8_BEGIN
  i_p, i_f = nl.mgrid[0:128, 0:512]
  tile = nl.load(in_tensor[i_p, i_f])
  ...
  nl.store(out_tensor[i_p, i_f], tile)

  # NKI_EXAMPLE_8_END
  return out_tensor


@nki.jit(mode="simulation")
def example_kernel_1(in_tensor):
  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)
  # NKI_EXAMPLE_9_BEGIN
  grid = nl.mgrid[0:128, 0:512]
  tile = nl.load(in_tensor[grid.p, grid.x])
  ...
  nl.store(out_tensor[grid.p, grid.x], tile)
  # NKI_EXAMPLE_9_END
  return out_tensor


class TestNkiExampleNlLoad(unittest.TestCase):
  def test_nl_load(self):
    a = np.random.random_sample([128, 512]).astype(np.float32) * 100
    b = np.ndarray(shape=(128, 512), dtype=np.float32)

    b = example_kernel(a)
    self.assertTrue(np.allclose(a, b))

  def test_nl_load_1(self):
    a = np.random.random_sample([128, 512]).astype(np.float32) * 100
    b = np.ndarray(shape=(128, 512), dtype=np.float32)

    b = example_kernel_1(a)
    self.assertTrue(np.allclose(a, b))

================================================
FILE: nki/test/test_nki_simulate_kernel.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

# NKI_EXAMPLE_BEGIN
import neuronxcc.nki as nki
import neuronxcc.nki.language as nl
import numpy as np


@nki.jit
def print_kernel(a_tensor):
  b = nl.empty_like(a_tensor, buffer=nl.hbm)

  # Load tensor into sbuf
  a = nl.load(a_tensor)

  # Print tensor y
  nl.device_print("value of a:", a)

  # Directly store a into hbm
  nl.store(b, value=a)

  return b
# NKI_EXAMPLE_END


class TestNkiIsaExamplesSimulateKernel(unittest.TestCase):
  def test_simulate_kernel(self):
    # NKI_EXAMPLE_BEGIN
    np.random.seed(0)
    a = np.random.random_sample([3, 4]).astype(np.float32) * 10

    b = nki.simulate_kernel(print_kernel, a)

    assert np.allclose(a, b)
    # NKI_EXAMPLE_END

================================================
FILE: nki/test/test_nki_spmd_grid.py
================================================
"""
Copyright (C) 2024, Amazon.com. All Rights Reserved

"""
import unittest

import numpy as np
import neuronxcc.nki as nki

# NKI_EXAMPLE_0_BEGIN
import neuronxcc.nki.language as nl


@nki.jit
def nki_spmd_kernel(a):
  b = nl.ndarray(a.shape, dtype=a.dtype, buffer=nl.shared_hbm)
  i = nl.program_id(0)
  j = nl.program_id(1)
  
  a_tile = nl.load(a[i, j])
  nl.store(b[i, j], a_tile)

  return b
# NKI_EXAMPLE_0_END


nki_spmd_kernel = nki.jit(nki_spmd_kernel, mode='simulation',
                          platform_target='trn2')


class TestNkiIsaExamplesTensorCopy(unittest.TestCase):
  def test_spmd_grid(self):
    np.random.seed(0)
    src = np.random.random_sample([4, 2, 1, 1]).astype(np.float32) * 100
    dst_golden = np.copy(src)

    # NKI_EXAMPLE_0_BEGIN
    ############################################################################
    # Example 1: Let compiler decide how to distribute the instances of spmd kernel
    ############################################################################
    dst = nki_spmd_kernel[4, 2](src)
    # NKI_EXAMPLE_0_END
    self.assertTrue(np.allclose(dst, dst_golden))

    # NKI_EXAMPLE_0_BEGIN
    ############################################################################
    # Example 2: Distribute SPMD kernel instances to physical NeuronCores with
    # explicit annotations. Expected physical NeuronCore assignments:
    #   Physical NC [0]: kernel[0, 0], kernel[0, 1], kernel[1, 0], kernel[1, 1]
    #   Physical NC [1]: kernel[2, 0], kernel[2, 1], kernel[3, 0], kernel[3, 1]
    ############################################################################
    dst = nki_spmd_kernel[nl.spmd_dim(nl.nc(2), 2), 2](src)
    dst = nki_spmd_kernel[nl.nc(2) * 2, 2](src)  # syntactic sugar
    # NKI_EXAMPLE_0_END
    self.assertTrue(np.allclose(dst, dst_golden))

    # NKI_EXAMPLE_0_BEGIN
    ############################################################################
    # Example 3: Distribute SPMD kernel instances to physical NeuronCores with
    # explicit annotations. Expected physical NeuronCore assignments:
    #   Physical NC [0]: kernel[0, 0], kernel[0, 1], kernel[2, 0], kernel[2, 1]
    #   Physical NC [1]: kernel[1, 0], kernel[1, 1], kernel[3, 0], kernel[3, 1]
    ############################################################################
    dst = nki_spmd_kernel[nl.spmd_dim(2, nl.nc(2)), 2](src)
    dst = nki_spmd_kernel[2 * nl.nc(2), 2](src)  # syntactic sugar
    # NKI_EXAMPLE_0_END
    self.assertTrue(np.allclose(dst, dst_golden))


================================================
FILE: nki/test/test_psum_modulo_alloc.py
================================================
# NKI_EXAMPLE_0_BEGIN
from typing import Optional, Tuple
from functools import reduce
from operator import mul
import unittest

def num_elems(shape):
  return reduce(mul, shape, 1)

def linearize(shape, indices):
  return sum(i * num_elems(shape[dim+1:]) for dim, i in enumerate(indices))

def modulo_allocate_func(base, allocate_shape, scale):
  def func(indices):
    if not allocate_shape:
      # default shape is always (1, 1, ...)
      allocate_shape_ = (1, ) * len(indices)
    else:
      allocate_shape_ = allocate_shape
    mod_idx = tuple(i % s for i, s in zip(indices, allocate_shape_))
    return linearize(shape=allocate_shape_, indices=mod_idx) * scale + base
  return func

def mod_alloc(base_addr: int, *, 
               base_bank: Optional[int] = 0,
               num_bank_tiles: Optional[Tuple[int]] = (),
               base_partition: Optional[int] = 0,
               num_par_tiles: Optional[Tuple[int]] = (),
               num_free_tiles: Optional[Tuple[int]] = ()):
  def psum_modulo_alloc_func(idx, pdim_size, fdim_size):
    # partial bank allocation is not allowed
    return (modulo_allocate_func(base_bank, num_bank_tiles, 1)(idx),
          modulo_allocate_func(base_partition, num_par_tiles, pdim_size)(idx),
          modulo_allocate_func(base_addr, num_free_tiles, fdim_size)(idx))
  return psum_modulo_alloc_func

# NKI_EXAMPLE_0_END

import neuronxcc.nki as nki
import neuronxcc.nki.language as nl
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.compiler as ncc
import numpy as np
nki_jit = nki.trace


@nki_jit
def allocated_loop_transpose(a_ptr, tp_ptr):
  
  N, M = a_ptr.shape

  _M, _N = tp_ptr.shape
  assert _N == N and _M == M

  N0, N1 = N // 128, 128
  M0, M1 = M // 128, 128

  ix0 = nl.arange(0, M1)[:, None]
  iy0 = nl.arange(0, N1)[None, :]

  identity = nl.shared_identity_matrix(n=128, dtype=nl.bfloat16)

  for n0 in nl.affine_range(N0):
    for m0 in nl.affine_range(M0):
      ix0 = nl.arange(0, 128)[:, None]
      iy0 = nl.arange(0, 128)[None, :]
      a_local = nl.ndarray((nl.par_dim(N1), M1), dtype=a_ptr.dtype, 
                           buffer=ncc.sbuf.mod_alloc(base_addr=1024))
      a_local[ix0, iy0] = nl.load(a_ptr[n0 * N1 + ix0, m0 * M1 + iy0])

      identity_load = nl.ndarray((nl.par_dim(128), 128), dtype=a_ptr.dtype, buffer=ncc.sbuf.mod_alloc(base_addr=0))
      identity_load[ix0, iy0] = nl.load(identity, dtype=a_ptr.dtype)

      a_local_transpose = nl.ndarray((nl.par_dim(M1), N1), dtype=a_ptr.dtype,
                                     buffer=ncc.psum.alloc(mod_alloc(base_addr=0)))
      a_local_transpose[ix0, iy0] = nisa.nc_matmul(a_local[ix0, iy0], identity_load)

      a_t_sbuf = nl.ndarray((nl.par_dim(N1), M1), dtype=a_ptr.dtype,
                                     buffer=ncc.sbuf.mod_alloc(base_addr=2048))
      a_t_sbuf[ix0, iy0] = nl.copy(a_local_transpose[ix0, iy0])

      nl.store(tp_ptr[m0 * 128 + ix0, n0 * 128 + iy0], value=a_t_sbuf[ix0, iy0])

class TestNkiPSUMModuloAllocation(unittest.TestCase):
  def test_simulate_kernel(self):
    np.random.seed(0)
    a = np.random.random_sample([2048, 1024]).astype(np.float32) * 100
    b = np.ndarray(shape=(1024, 2048), dtype=np.float32)

    nki.simulate_kernel(allocated_loop_transpose, a, b)

    self.assertTrue(np.allclose(b, np.transpose(a)))


================================================
FILE: nki/test/test_sbuf_modulo_alloc.py
================================================
# NKI_EXAMPLE_0_BEGIN
from typing import Optional, Tuple
from functools import reduce
from operator import mul
import unittest

def num_elms(shape):
  return reduce(mul, shape, 1)

def linearize(shape, indices):
  return sum(i * num_elms(shape[dim+1:]) for dim, i in enumerate(indices))

def modulo_allocate_func(base, allocate_shape, scale):
  def func(indices):
    if not allocate_shape:
      # default shape is always (1, 1, ...)
      allocate_shape_ = (1, ) * len(indices)
    else:
      allocate_shape_ = allocate_shape
    mod_idx = tuple(i % s for i, s in zip(indices, allocate_shape_))
    return linearize(shape=allocate_shape_, indices=mod_idx) * scale + base
  return func

def mod_alloc(base_addr: int, *, 
               base_partition: Optional[int] = 0,
               num_par_tiles: Optional[Tuple[int, ...]] = (),
               num_free_tiles: Optional[Tuple[int, ...]] = ()):
  def sbuf_modulo_alloc_func(idx, pdim_size, fdim_size):
    return (modulo_allocate_func(base_partition, num_par_tiles, pdim_size)(idx),
          modulo_allocate_func(base_addr, num_free_tiles, fdim_size)(idx))
  return sbuf_modulo_alloc_func

# NKI_EXAMPLE_0_END


import neuronxcc.nki as nki
import neuronxcc.nki.language as nl
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.compiler as ncc
import numpy as np
nki_jit = nki.trace


@nki_jit
def allocated_loop_transpose(a_ptr, tp_ptr):
  
  N, M = a_ptr.shape

  _M, _N = tp_ptr.shape
  assert _N == N and _M == M

  N0, N1 = N // 128, 128
  M0, M1 = M // 128, 128

  ix0 = nl.arange(0, M1)[:, None]
  iy0 = nl.arange(0, N1)[None, :]

  identity = nl.shared_identity_matrix(n=128, dtype=nl.bfloat16)

  for n0 in nl.affine_range(N0):
    for m0 in nl.affine_range(M0):
      ix0 = nl.arange(0, 128)[:, None]
      iy0 = nl.arange(0, 128)[None, :]
      a_local = nl.ndarray((nl.par_dim(N1), M1), dtype=a_ptr.dtype, 
                           buffer=ncc.sbuf.alloc(mod_alloc(base_addr=1024)))
      a_local[ix0, iy0] = nl.load(a_ptr[n0 * N1 + ix0, m0 * M1 + iy0])

      identity_load = nl.ndarray((nl.par_dim(128), 128), dtype=a_ptr.dtype, buffer=ncc.sbuf.alloc(mod_alloc(base_addr=0)))
      identity_load[ix0, iy0] = nl.load(identity, dtype=a_ptr.dtype)

      a_local_transpose = nl.ndarray((nl.par_dim(M1), N1), dtype=a_ptr.dtype,
                                     buffer=ncc.psum.mod_alloc(base_bank=0))
      a_local_transpose[ix0, iy0] = nisa.nc_matmul(a_local[ix0, iy0], identity_load)

      a_t_sbuf = nl.ndarray((nl.par_dim(N1), M1), dtype=a_ptr.dtype,
                                     buffer=ncc.sbuf.alloc(mod_alloc(base_addr=2048)))
      a_t_sbuf[ix0, iy0] = nl.copy(a_local_transpose[ix0, iy0])

      nl.store(tp_ptr[m0 * 128 + ix0, n0 * 128 + iy0], value=a_t_sbuf[ix0, iy0])


class TestNkiSBUFModuloAllocation(unittest.TestCase):
  def test_simulate_kernel(self):
    np.random.seed(0)
    a = np.random.random_sample([2048, 1024]).astype(np.float32) * 100
    b = np.ndarray(shape=(1024, 2048), dtype=np.float32)

    nki.simulate_kernel(allocated_loop_transpose, a, b)

    self.assertTrue(np.allclose(b, np.transpose(a)))


================================================
FILE: release-notes/2.29.0.rst
================================================
.. _neuron-2-29-0-whatsnew:
.. _latest-neuron-release:

.. meta::
   :description: The official release notes for the AWS Neuron SDK, version 2.29.0. Release date: 04/09/2026.

AWS Neuron SDK 2.29.0 release notes
===================================

**Date of release**: April 09, 2026

.. toctree::
   :hidden:
   :maxdepth: 1

   PyTorch (torch-neuronx) <components/pytorch>
   NxD Inference/vLLM  <components/nxd-inference>
   NKI <components/nki>
   NKI Library <components/nki-lib>
   Neuron Runtime <components/runtime>
   Developer tools <components/dev-tools>
   Deep Learning AMIs <components/dlamis>
   Deep Learning Containers <components/containers>

This page provides detailed component release notes for the Neuron SDK 2.29.0. For a an overview of the release content, see :ref:`What's New in AWS Neuron <whats-new-2026-04-02-v2_29>`.

Package and Library Updates
---------------------------

.. grid:: 1 
        :gutter: 2

        .. grid-item-card::
                :link: latest-neuron-release-artifacts
                :link-type: ref
                :class-card: sd-border-1
        
                **Neuron 2.29.0 release artifacts**
                ^^^
                The libraries and packages updated in this Neuron release.

Component Release Notes
-----------------------

Select a card below to review detailed release notes for each component of the Neuron SDK version 2.29.0. These component release notes contain details on specific new and improved features, as well as breaking changes, bug fixes, and known issues for that component area of the Neuron SDK.

* For the full set of component release notes across Neuron versions, see :doc:`/release-notes/components/index`.

.. grid:: 1 
        :gutter: 2

        .. grid-item-card:: 
                :link: components/pytorch
                :link-type: doc

                **PyTorch Neuron (torch-neuronx)** 2.29.0 release notes
                ^^^
                Integrated, native support for PyTorch on Neuron.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card:: 
                :link: components/nxd-inference
                :link-type: doc

                **NxD Inference** 2.29.0 release notes
                ^^^
                Neuron features and tools for LLM and agent ML model inference, and the vLLM Plugin for Neuron.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: components/nki
                :link-type: doc

                **Neuron Kernel Interface (NKI)** 2.29.0 release notes
                ^^^
                Neuron's Python-based programming interface for developing and optimizing Neuron kernels.
                +++
                Supports: ``Inf2``, ``Trn1``, ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card:: 
                :link: components/nki-lib
                :link-type: doc

                **NKI Library (NKI-Lib)** 2.29.0 release notes
                ^^^
                Reference kernels and utilities for Neuron kernel development with NKI.
                +++
                Supports: ``Inf2``, ``Trn1``, ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card:: 
                :link: components/runtime
                :link-type: doc

                **Neuron Runtime** 2.29.0 release notes
                ^^^
                The Neuron kernel driver and C++ libraries for AWS Inferentia and Trainium instances.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card:: 
                :link: components/dev-tools
                :link-type: doc

                **Neuron Developer Tools** 2.29.0 release notes
                ^^^
                Tools that support end-to-end development for AWS Neuron, including Neuron Explorer.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card:: 
                :link: components/dlamis
                :link-type: doc

                **Neuron Deep Learning AWS Machine Images (DLAMIs)** 2.29.0 release notes
                ^^^
                AWS-specific machine images for building and deploying Neuron-based ML solutions.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``
 
        .. grid-item-card:: 
                :link: components/containers
                :link-type: doc

                **Neuron Deep Learning Containers (DLCs)** 2.29.0 release notes
                ^^^
                AWS-specific container definitions for building and deploying Neuron-based ML solutions.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``

Previous releases
-----------------

* :doc:`Neuron 2.28.1 </release-notes/prev/2.28.1>`
* :doc:`Neuron 2.28.0 </release-notes/prev/2.28.0>`
* :doc:`Neuron 2.27.0 </release-notes/prev/2.27.0/index>`
* :doc:`Neuron 2.26.0 </release-notes/prev/2.26.0/index>`
* :doc:`Neuron 2.25.0 </release-notes/prev/2.25.0/index>`
* :doc:`Earlier releases </release-notes/prev/rn>`

* :ref:`prev-rn`
* :ref:`pre-release-content`
* :ref:`prev-n1-rn`

================================================
FILE: release-notes/archive/customcxxps/gpsimd-customop-lib.rst
================================================
.. _gpsimd-customop-lib-rn:

Neuron Custom C++ Library Release Notes
========================================

.. note::

    Neuron Custom C++ Operators feature is currently supported on NeuronCore-v2 architecture only, which is found in Trainium (Trn1) and second-generation Inferentia (Inf2) chips.

aws-neuronx-gpsimd-customop-lib [0.20.7]
-----------------------------------------

Date: 03/12/2026

* Fixed package dependency issue with version 0.20.4, initially released as part of Neuron release 2.28.0.

aws-neuronx-gpsimd-customop-lib [0.13]
---------------------------------------

Date: 12/12/2024

* Neuron Custom C++ Operators feature is currently supported on NeuronCore-v2 architecture only, which is found in Trainium (Trn1) and second-generation Inferentia (Inf2) chips.

aws-neuronx-gpsimd-customop-lib [0.3]
-------------------------------------

Date: 04/28/2023

* Add initial support for using Multiple GPSIMD Cores for Custom C++ Operators
* Package name was changed to ``aws-neuronx-gpsimd-customop-lib``

aws-neuronx-gpsimd-customop [0.1]
---------------------------------

Date: 02/08/2023

* First release of aws-neuronx-gpsimd-customop. This release provides tensor library support required for building Neuron Custom C++ operators.


================================================
FILE: release-notes/archive/customcxxps/gpsimd-tools.rst
================================================
.. _gpsimd-customop-tools-rn:

Neuron Custom C++ Tools Release Notes
======================================

.. note::

    Neuron Custom C++ Operators feature is currently supported on NeuronCore-v2 architecture only, which is found in Trainium (Trn1) and second-generation Inferentia (Inf2) chips.

aws-neuronx-gpsimd-tools [0.13]
-------------------------------------

Date: 12/12/2024

* Neuron Custom C++ Operators feature is currently supported on NeuronCore-v2 architecture only, which is found in Trainium (Trn1) and second-generation Inferentia (Inf2) chips.

aws-neuronx-gpsimd-tools [0.1]
------------------------------

Date: 02/08/2023

* First release of aws-neuronx-gpsimd-tools. This release provides the required tools to support the building of Neuron Custom C++ operators.


================================================
FILE: release-notes/archive/index.rst
================================================
.. meta::
    :description: Archived release notes for deprecated Neuron SDK components
    :keywords: neuron, release notes, archive, deprecated, legacy
    :date-modified: 02/26/2026

Archived Neuron Component Release Notes
=======================================

This page contains links to release notes for Neuron components that are no longer supported or are no longer in active development.

.. list-table::
   :widths: 40 60
   :header-rows: 1
   :align: left

   * - Component
     - Description
   * - :doc:`Neuron XLA Pluggable Device <libneuronxla>`
     - Neuron XLA pluggable device (libneuronxla) - PJRT runtime integration
   * - :doc:`Apache MXNet Neuron <mxnet-neuron>`
     - Apache MXNet Neuron framework release notes
   * - :doc:`TensorBoard Neuron Plugin <tensorboard-neuron>`
     - Neuron Plugin for TensorBoard release notes
   * - :doc:`PyTorch Neuron for Inf1 <torch-neuron>`
     - PyTorch Neuron (torch-neuron) for Inf1 release notes
   * - :doc:`Custom C++ Operators Library <customcxxps/gpsimd-customop-lib>`
     - Neuron Custom C++ Operators Library (aws-neuronx-gpsimd-customop-lib)
   * - :doc:`Custom C++ Operators Tools <customcxxps/gpsimd-tools>`
     - Neuron Custom C++ Operators Tools (aws-neuronx-gpsimd-tools)
   * - :doc:`NeMo Megatron <nemo/index>`
     - AWS Neuron Reference for Nemo Megatron (neuronx-nemo-megatron)
   * - :doc:`Neuron Compiler for Inf1 <neuron-cc/neuron-cc>`
     - Neuron Compiler (neuron-cc) for Inferentia 1 chips
   * - :doc:`Neuron Compiler Supported Operators <neuron-cc/neuron-cc-ops/index>`
     - List of operators supported by Neuron Compiler for various frameworks
   * - :doc:`TensorFlow Model Server 1.x <tensorflow/tensorflow-modelserver-neuron/tensorflow-modelserver-neuron>`
     - TensorFlow Model Server Neuron 1.x release notes
   * - :doc:`TensorFlow Model Server 2.x <tensorflow/tensorflow-modelserver-neuron/tensorflow-modelserver-neuron-v2>`
     - TensorFlow Model Server Neuron 2.x release notes
   * - :doc:`TensorFlow Model Server NeuronX <tensorflow/tensorflow-modelserver-neuron/tensorflow-modelserver-neuronx>`
     - TensorFlow Model Server NeuronX (tensorflow-modeslserver-neuronx) release notes
   * - :doc:`TensorFlow Neuron 1.x <tensorflow/tensorflow-neuron/tensorflow-neuron>`
     - TensorFlow Neuron (TF1.x) for Inf1 release notes
   * - :doc:`TensorFlow Neuron 2.x <tensorflow/tensorflow-neuron/tensorflow-neuron-v2>`
     - TensorFlow 2.x (tensorflow-neuron) for Inf1 release notes
   * - :doc:`TensorFlow NeuronX <tensorflow/tensorflow-neuronx/tensorflow-neuronx>`
     - TensorFlow 2.x (tensorflow-neuronx) for Trn1/Inf2 release notes
   * - :doc:`Neuron SDK 1.x Releases <neuron1/prev/rn>`
     - Archived release notes for Neuron SDK 1.x versions
   * - :doc:`Previous Neuron 2.x Release Artifacts <../prev/content>`
     - Package lists and artifacts for previous Neuron 2.x releases

.. note::
  You can also access older Neuron documentation by selecting the version selector widget in the lower-right of the browser page and changing it to a prior Neuron version.

  .. image:: /images/version-selector-rtd.png

.. toctree::
   :maxdepth: 1
   :hidden:

   Neuron XLA Pluggable Device <libneuronxla>
   Apache MXNet Neuron <mxnet-neuron>
   TensorBoard Neuron Plugin <tensorboard-neuron>
   PyTorch Neuron for Inf1 <torch-neuron>
   Custom C++ Operators Library <customcxxps/gpsimd-customop-lib>
   Custom C++ Operators Tools <customcxxps/gpsimd-tools>
   NeMo Megatron <nemo/index>
   Neuron Compiler for Inf1 <neuron-cc/neuron-cc>
   Neuron Compiler Supported Operators <neuron-cc/neuron-cc-ops/index>
   TensorFlow Model Server 1.x <tensorflow/tensorflow-modelserver-neuron/tensorflow-modelserver-neuron>
   TensorFlow Model Server 2.x <tensorflow/tensorflow-modelserver-neuron/tensorflow-modelserver-neuron-v2>
   TensorFlow Model Server NeuronX <tensorflow/tensorflow-modelserver-neuron/tensorflow-modelserver-neuronx>
   TensorFlow Neuron 1.x <tensorflow/tensorflow-neuron/tensorflow-neuron>
   TensorFlow Neuron 2.x <tensorflow/tensorflow-neuron/tensorflow-neuron-v2>
   TensorFlow NeuronX <tensorflow/tensorflow-neuronx/tensorflow-neuronx>
   Neuron SDK 1.x Releases <neuron1/prev/rn>
   Previous Neuron 2.x Release Artifacts <../prev/content>


================================================
FILE: release-notes/archive/libneuronxla.rst
================================================
.. |Trn1| replace:: :ref:`Trn1 <aws-trn1-arch>`
.. |Inf2| replace:: :ref:`Inf2 <aws-inf2-arch>`

.. _libneuronxla-rn:

Neuron XLA pluggable device (``libneuronxla``) release notes
================================================

.. contents:: Table of Contents
   :local:
   :depth: 1

``libneuronxla`` is a software package containing Neuron's integration into
the `PJRT <https://openxla.org/xla/pjrt_integration>`__ runtime, built using
the `PJRT C-API plugin <https://github.com/openxla/xla/blob/5564a9220af230c6c194e37b37938fb40692cfc7/xla/pjrt/c/docs/pjrt_integration_guide.md>`__
mechanism.

Release [2.0.5347.0]
--------------------
Date: 11/20/2024

Summary
~~~~~~~

Add support for torch-xla 2.1.5 which fixes the "list index out of range" error when using the Zero Redundancy Optimizer (ZeRO1) checkpoint loading.

Release [2.0.4986.0]
--------------------
Date: 10/25/2024

Summary
~~~~~~~

This patch release removes the excessive lock wait time during neuron_parallel_compile graph extraction for large cluster training.

Release [2.0.4115.0]
----------------------
Date: 09/16/2024


Summary
~~~~~~~

This release of ``libneuronxla`` officially adds beta support for running JAX on AWS Trainium and Inferentia accelerators.


What’s new in this release
~~~~~~~~~~~~~~~~~~~~~~~~~~

Announcing beta Neuron support for JAX.

- Trainium and Inferentia as PJRT pluggable devices
- JAX 0.4.31 support (through PJRT C-API version 0.54)


================================================
FILE: release-notes/archive/mxnet-neuron.rst
================================================
.. _mxnet-neuron-rn:


Apache MXNet Neuron Release Notes
==================================

.. contents:: Table of contents
   :local:
   :depth: 1

This document lists the release notes for MXNet-Neuron framework.

Apache MXNet Neuron release [1.8.0.2.4.40.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 12/21/2023

Summary
-------

Minor updates.

Apache MXNet Neuron release [1.8.0.2.4.25.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 10/15/2023

Summary
-------

Minor updates.

Apache MXNet Neuron release [1.8.0.2.4.10.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 7/19/2023

Summary
-------

Minor bug fixes and enhancements for MXNet 1.8 Neuron.

Apache MXNet Neuron release [1.8.0.2.4.9.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 6/14/2023

Summary
-------

Minor bug fixes and enhancements for MXNet 1.8 Neuron.

Apache MXNet Neuron release [1.8.0.2.4.1.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 5/1/2023

New in this release
-------------------

* Updated Neuron Runtime library to version 2.12
* Added missing LICENSE.txt

Known Issues and Limitations
----------------------------

* Bert-base in 16 NeuronCores pipeline mode has 50% lower performance when running 16 inferences in parallel with Runtime version 2.12.

[1.5.1.1.10.39.0]
^^^^^^^^^^^^^^^^^

Date: 5/1/2023

Summary
-------

Minor bug fixes and enhancements for MXNet 1.5 Neuron.

This is the last released version. Please use neuron-cc version 1.15.0 only for this mxnet-neuron version. Also, this version is limited to python 3.9 or below only.

.. code:: bash

   python -m pip install mxnet_neuron==1.5.1.* neuron-cc==1.15.0

Apache MXNet Neuron release [1.8.0.2.2.127.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 3/28/2023

Summary
-------

Minor bug fixes and enhancements for MXNet 1.8 Neuron.

[1.5.1.1.10.37.0]
^^^^^^^^^^^^^^^^^

Date: 3/28/2023

Summary
-------

Minor bug fixes and enhancements for MXNet 1.5 Neuron.

Apache MXNet Neuron release [1.8.0.2.2.43.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 11/23/2022

Summary
-------

Minor bug fixes and enhancements for MXNet 1.8 Neuron.

[1.5.1.1.10.11.0]
^^^^^^^^^^^^^^^^^

Date: 11/23/2022

Summary
-------

Minor bug fixes and enhancements for MXNet 1.5 Neuron.

[1.5.1.1.10.0.0]
^^^^^^^^^^^^^^^^

Date: 04/28/2022

Summary
-------

Minor bug fixes and enhancements for MXNet 1.5 Neuron.

Apache MXNet Neuron release [1.8.0.2.2.2.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 03/25/2022

New in this release
-------------------

* Added support for unloading models from a NeuronDevice by deleting the model instance in user application. Users can now call ``del`` in Python on an executor and to unload the model from a NeuronDevice (provided the deleted executor is the last executor pointing to the given model). This requires the latest ``aws-mx-1.8`` package from ``https://aws-mx-pypi.s3.us-west-2.amazonaws.com/1.8.0/aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl``. 

Bug fixes
---------

* Fixed a memory leak caused by stale unloaded models in NeuronDevice memory. For this fix to take effect please install aws-mx package from https://aws-mx-pypi.s3.us-west-2.amazonaws.com/1.8.0/aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl along with the latest mx-neuron package.

[1.5.1.1.9.0.0]
^^^^^^^^^^^^^^^

Date: 03/25/2022

Summary
-------

Minor bug fixes and enhancements for MXNet 1.5 Neuron.


Apache MXNet Neuron release [1.8.0.2.1.5.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 01/20/2022

New in this release
-------------------

* Added support of ``mx_neuron.__version__`` to get the build version of MXNet Neuron plugin

Bug fixes
---------

* Fixed assertion errors when inference was completed with NaNs. The expected behavior is to complete inference successfully and warn the 
  user that ``NaN``s were seen during the current inference. 
* Fixed compile issue when individual output nodes have multiple output nodes. Because the output index was being dropped, fewer number 
  of output feature maps were being considered and that caused failures during inference. 


Apache MXNet Neuron release [1.8.0.2.0.276.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 11/05/2021

* Updated Neuron Runtime (which is integrated within this package) to ``libnrt 2.2.18.0`` to fix a container issue that was preventing 
  the use of containers when /dev/neuron0 was not present. See details here :ref:`runtime_rn`.

Apache MXNet Neuron release [1.8.0.2.0.271.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date 10/27/2021

New in this release
-------------------

-  MXNet Neuron 1.8 now support Neuron Runtime 2.x (``libnrt.so`` shared library) only.

     .. important::

        -  You must update to the latest Neuron Driver (``aws-neuron-dkms`` version 2.1 or newer) 
           for proper functionality of the new runtime library.
        -  Read :ref:`introduce-libnrt`
           application note that describes :ref:`why are we making this
           change <introduce-libnrt-why>` and
           how :ref:`this change will affect the Neuron
           SDK <introduce-libnrt-how-sdk>` in detail.
        -  Read :ref:`neuron-migrating-apps-neuron-to-libnrt` for detailed information of how to
           migrate your application.

-  Introducing Flexible Execution Groups (FlexEG) feature. See :ref:`flexeg` application note.


Resolved Issues
---------------

-  Fixed a bug that prevented compilation of gluon models with multiple
   cpu and neuron nodes.
-  Added more debug logic to help with profiling of model load timing.


[1.5.1.1.7.0.0]
^^^^^^^^^^^^^^^

Date 10/27/2021

New in this release
-------------------

-  MXNet 1.5 enters maintenance mode. Please visit :ref:`maintenance_mxnet_1_5` for more
   information.

Resolved Issues
---------------

 -  Minor bug fixes.


[1.5.1.1.6.5.0]
^^^^^^^^^^^^^^^

Date 08/12/2021

Summary
-------

Minor bug fixes and enhancements for MXNet 1.5 Neuron.

[1.8.0.1.3.4.0]
^^^^^^^^^^^^^^^

Date 08/12/2021

Summary
-------

Minor bug fixes and enhancements for MXNet 1.8 Neuron.


[1.5.1.1.6.1.0]
^^^^^^^^^^^^^^^

Date 07/02/2021

Summary
-------

Minor bug fixes and enhancements for MXNet 1.5 Neuron.

[1.8.0.1.3.0.0]
^^^^^^^^^^^^^^^

Date 07/02/2021

Summary
-------

Support for Autoloop, Cpredict API and minor bug fixes and enhancements for MXNet 1.8 Neuron.

Major New Features
------------------

- Added support for Autoloop feature for MXNet 1.8 Neuron.

Resolved Issues
---------------

- Added support for CPredict API.


[1.8.0.1.2.1.0]
^^^^^^^^^^^^^^^

Date 5/28/2021

Summary
-------

Minor bug fixes and enhancements for MXNet 1.8 Neuron

Resolved Issues
---------------
- Added support for Neuron profiler 


[1.8.0.1.1.2.0]
^^^^^^^^^^^^^^^

Date 4/30/2021

Summary
-------

Initial release of Apache MXNet 1.8 for Neuron

Major New Features
------------------

- Gluon API and Neuron support for NLP BERT models

- Neuron is now a plugin

- Please note new API changes to support plugin mode: :ref:`ref-mxnet-neuron-compilation-python-api`

[1.5.1.1.4.x.x]
^^^^^^^^^^^^^^^

Date 5/28/2021

Summary
-------

- Minor enhancements.

[1.5.1.1.4.4.0]
^^^^^^^^^^^^^^^

Date 4/30/2021

Summary
-------

- Resolve an issue with Neuron profiling.

Resolved Issues
---------------

- Issue: when Neuron profiling is enabled in MXNet-Neuron 1.5.1 (using NEURON_PROFILE=<dir>), and TensorBoard is used to read in the profiled data, user would see an error messsage "panic: runtime error: index out of range". This issue is resolved in this release.

[1.5.1.1.3.8.0]
^^^^^^^^^^^^^^^

Date 3/4/2021

Summary
-------

Minor enhancements.

[1.5.1.1.3.7.0]
^^^^^^^^^^^^^^^

Date 2/24/2021

Summary
-------

Fix for CVE-2021-3177.

[1.5.1.1.3.2.0]
^^^^^^^^^^^^^^^

Date 1/30/2021

Summary
-------

Various minor improvements

[1.5.1.1.2.1.0]
^^^^^^^^^^^^^^^

Date 12/23/2020

Summary
-------

Various minor improvements

[1.5.1.1.1.88.0]
^^^^^^^^^^^^^^^^

Date 11/17/2020

Summary
-------

This release includes the bug fix for MXNet Model Server not being able to clean up
Neuron RTD states after model is unloaded (deleted) from model server.

Resolved Issues
---------------

-  Issue: MXNet Model Server is not able to clean up Neuron RTD states
   after model is unloaded (deleted) from model server.

    -  Workaround for earlier versions: run “\ ``/opt/aws/neuron/bin/neuron-cli reset``\ “ to
   clear Neuron RTD states after all models are unloaded and server is
   shut down.

[1.5.1.1.1.52.0]
^^^^^^^^^^^^^^^^

Date 09/22/2020

Summary
-------

Various minor improvements.

Major New Features
------------------

Resolved Issues
---------------

-  Issue: When first importing MXNet into python process and subprocess
   call is invoked, user may get an OSError exception "OSError: [Errno
   14] Bad address" during subprocess call (see
   https://github.com/apache/incubator-mxnet/issues/13875 for more
   details). This issue is fixed with a mitigation patch from MXNet for
   Open-MP fork race conditions.

   -  Workaround for earlier versions: Export KMP_INIT_AT_FORK=false
      before running python process.

.. _1511110:

[1.5.1.1.1.1.0]
^^^^^^^^^^^^^^^

Date 08/08/2020

.. _mxnet-summary-1:

Summary
-------

Various minor improvements.

.. _mx-major-new-features-1:

Major New Features
------------------

.. _mx-resolved-issues-1:

Resolved Issues
---------------

.. _1511021010:

[1.5.1.1.0.2101.0]
^^^^^^^^^^^^^^^^^^

Date 08/05/2020

.. _mxnet-summary-2:

Summary
-------

Various minor improvements.

.. _mx-major-new-features-2:

Major New Features
------------------

.. _mx-resolved-issues-2:

Resolved Issues
---------------

.. _1511020930:

[1.5.1.1.0.2093.0]
^^^^^^^^^^^^^^^^^^

Date 07/16/2020

.. _mxnet-summary-3:

Summary
-------

This release contains a few bug fixes and user experience improvements.

.. _mx-major-new-features-3:

Major New Features
------------------

.. _mx-resolved-issues-3:

Resolved Issues
---------------

-  User can specify NEURONCORE_GROUP_SIZES without brackets (for
   example, "1,1,1,1"), as can be done in TensorFlow-Neuron and
   PyTorch-Neuron.
-  Fixed a memory leak when inferring neuron subgraph properties
-  Fixed a bug dealing with multi-input subgraphs

.. _1511020330:

[1.5.1.1.0.2033.0]
^^^^^^^^^^^^^^^^^^

Date 6/11/2020

.. _mxnet-summary-4:

Summary
-------

-  Added support for profiling during inference

.. _mx-major-new-features-4:

Major New Features
------------------

-  Profiling can now be enabled by specifying the profiling work
   directory using NEURON_PROFILE environment variable during inference.
   For an example of using profiling, see :ref:`tensorboard-neuron`.
   (Note that graph view of MXNet graph is not available via
   TensorBoard).

.. _mx-resolved-issues-4:

Resolved Issues
---------------

Known Issues and Limitations
----------------------------

Other Notes
-----------

.. _1511019000:

[1.5.1.1.0.1900.0]
^^^^^^^^^^^^^^^^^^

Date 5/11/2020

.. _mxnet-summary-5:

Summary
-------

Improved support for shared-memory communication with Neuron-Runtime.

.. _mx-major-new-features-5:

Major New Features
------------------

-  Added support for the BERT-Base model (base: L-12 H-768 A-12), max
   sequence length 64 and batch size of 8.
-  Improved security for usage of shared-memory for data transfer
   between framework and Neuron-Runtime
-  Improved allocation and cleanup of shared-memory resource
-  Improved container support by automatic falling back to GRPC data
   transfer if shared-memory cannot be allocated by Neuron-Runtime

.. _mx-resolved-issues-5:

Resolved Issues
---------------

-  User is unable to allocate Neuron-Runtime shared-memory resource when
   using MXNet-Neuron in a container to communicate with Neuron-Runtime
   in another container. This is resolved by automatic falling back to
   GRPC data transfer if shared-memory cannot be allocated by
   Neuron-Runtime.
-  Fixed issue where some large models could not be loaded on
   inferentia.

.. _mx-known-issues-and-limitations-1:

Known Issues and Limitations
----------------------------

.. _mx-other-notes-1:

Other Notes
-----------

.. _1511015960:

[1.5.1.1.0.1596.0]
^^^^^^^^^^^^^^^^^^

Date 3/26/2020

.. _mxnet-summary-6:

Summary
-------

No major changes or fixes

.. _mx-major-new-features-6:

Major New Features
------------------

.. _mx-resolved-issues-6:

Resolved Issues
---------------

.. _mx-known-issues-and-limitations-2:

Known Issues and Limitations
----------------------------

.. _mx-other-notes-2:

Other Notes
-----------

.. _1511014980:

[1.5.1.1.0.1498.0]
^^^^^^^^^^^^^^^^^^

Date 2/27/2020

.. _mxnet-summary-7:

Summary
-------

No major changes or fixes.

.. _mx-major-new-features-7:

Major New Features
------------------

.. _mx-resolved-issues-7:

Resolved Issues
---------------

The issue(s) below are resolved:

-  Latest pip version 20.0.1 breaks installation of MXNet-Neuron pip
   wheel which has py2.py3 in the wheel name.

.. _mx-known-issues-and-limitations-3:

Known Issues and Limitations
----------------------------

-  User is unable to allocate Neuron-Runtime shared-memory resource when
   using MXNet-Neuron in a container to communicate with Neuron-Runtime
   in another container. To work-around, please set environment variable
   NEURON_RTD_USE_SHM to 0.

.. _mx-other-notes-3:

Other Notes
-----------

.. _1511014010:

[1.5.1.1.0.1401.0]
^^^^^^^^^^^^^^^^^^

Date 1/27/2020

.. _mxnet-summary-8:

Summary
-------

No major changes or fixes.

.. _mx-major-new-features-8:

Major New Features
------------------

.. _mx-resolved-issues-8:

Resolved Issues
---------------

-  The following issue is resolved when the latest multi-model-server
   with version >= 1.1.0 is used with MXNet-Neuron. You would still need
   to use "``/opt/aws/neuron/bin/neuron-cli reset``" to clear all Neuron
   RTD states after multi-model-server is exited:

   -  Issue: MXNet Model Server is not able to clean up Neuron RTD
      states after model is unloaded (deleted) from model server and
      previous workaround "``/opt/aws/neuron/bin/neuron-cli reset``" is
      unable to clear all Neuron RTD states.

.. _mx-known-issues-and-limitations-4:

Known Issues and Limitations
----------------------------

-  Latest pip version 20.0.1 breaks installation of MXNet-Neuron pip
   wheel which has py2.py3 in the wheel name. This breaks all existing
   released versions. The error looks like:

::

   Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
   ERROR: Could not find a version that satisfies the requirement mxnet-neuron (from versions: none)
   ERROR: No matching distribution found for mxnet-neuron

-  Work around: install the older version of pip using "pip install
   pip==19.3.1".

.. _mx-other-notes-4:

Other Notes
-----------

.. _1511013250:

[1.5.1.1.0.1325.0]
^^^^^^^^^^^^^^^^^^

Date 12/1/2019

.. _mxnet-summary-9:

Summary
-------

.. _mx-major-new-features-9:

Major New Features
------------------

.. _mx-resolved-issues-9:

Resolved Issues
---------------

-  Issue: Compiler flags cannot be passed to compiler during compile
   call. The fix: compiler flags can be passed to compiler during
   compile call using “flags” option followed by a list of flags.

-  Issue: Advanced CPU fallback option is a way to attempt to improve
   the number of operators on Inferentia. The default is currently set
   to on, which may cause failures. The fix: This option is now off by
   default.

.. _mx-known-issues-and-limitations-5:

Known Issues and Limitations
----------------------------

-  Issue: MXNet Model Server is not able to clean up Neuron RTD states
   after model is unloaded (deleted) from model server and previous
   workaround "``/opt/aws/neuron/bin/neuron-cli reset``" is unable to
   clear all Neuron RTD states.

   -  Workaround: run “\ ``sudo systemctl restart neuron-rtd``\ “ to
      clear Neuron RTD states after all models are unloaded and server
      is shut down.

.. _mx-other-notes-5:

Other Notes
-----------

.. _1511013490:

[1.5.1.1.0.1349.0]
^^^^^^^^^^^^^^^^^^

Date 12/20/2019

.. _mxnet-summary-10:

Summary
-------

No major changes or fixes. Released with other Neuron packages.

.. _1511013250-1:

[1.5.1.1.0.1325.0]
^^^^^^^^^^^^^^^^^^

Date 12/1/2019

.. _mxnet-summary-11:

Summary
-------

.. _mx-major-new-features-10:

Major New Features
------------------

.. _mx-resolved-issues-10:

Resolved Issues
---------------

-  Issue: Compiler flags cannot be passed to compiler during compile
   call. The fix: compiler flags can be passed to compiler during
   compile call using “flags” option followed by a list of flags.

-  Issue: Advanced CPU fallback option is a way to attempt to improve
   the number of operators on Inferentia. The default is currently set
   to on, which may cause failures. The fix: This option is now off by
   default.

.. _mx-known-issues-and-limitations-6:

Known Issues and Limitations
----------------------------

-  Issue: MXNet Model Server is not able to clean up Neuron RTD states
   after model is unloaded (deleted) from model server and previous
   workaround "``/opt/aws/neuron/bin/neuron-cli reset``" is unable to
   clear all Neuron RTD states.

   -  Workaround: run “\ ``sudo systemctl restart neuron-rtd``\ “ to
      clear Neuron RTD states after all models are unloaded and server
      is shut down.

.. _mx-other-notes-6:

Other Notes
-----------

.. _1511012600:

[1.5.1.1.0.1260.0]
^^^^^^^^^^^^^^^^^^

Date: 11/25/2019

.. _mxnet-summary-12:

Summary
-------

This version is available only in released DLAMI v26.0 and is based on
MXNet version 1.5.1. Please :ref:`dlami-rn-known-issues` to latest version.

.. _mx-major-new-features-11:

Major new features
------------------

.. _mx-resolved-issues-11:

Resolved issues
---------------

.. _mx-known-issues-and-limitations-7:

Known issues and limitations
----------------------------

-  Issue: Compiler flags cannot be passed to compiler during compile
   call.

-  Issue: Advanced CPU fallback option is a way to attempt to improve
   the number of operators on Inferentia. The default is currently set
   to on, which may cause failures.

   -  Workaround: explicitly turn it off by setting compile option
      op_by_op_compiler_retry to 0.

-  Issue: Temporary files are put in current directory when debug is
   enabled.

   -  Workaround: create a separate work directory and run the process
      from within the work directory

-  Issue: MXNet Model Server is not able to clean up Neuron RTD states
   after model is unloaded (deleted) from model server.

   -  Workaround: run “\ ``/opt/aws/neuron/bin/neuron-cli reset``\ “ to
      clear Neuron RTD states after all models are unloaded and server
      is shut down.

-  Issue: MXNet 1.5.1 may return inconsistent node names for some
   operators when they are the primary outputs of a Neuron subgraph.
   This causes failures during inference.

   -  Workaround : Use the ``excl_node_names`` compilation option to
      change the partitioning of the graph during compile so that these
      nodes are not the primary output of a neuron subgraph. See
      :ref:`ref-mxnet-neuron-compilation-python-api`

   .. code:: python

      compile_args = { 'excl_node_names': ["node_name_to_exclude"] }

Models Supported
----------------

The following models have successfully run on neuron-inferentia systems

1. Resnet50 V1/V2
2. Inception-V2/V3/V4
3. Parallel-WaveNet
4. Tacotron 2
5. WaveRNN

.. _mx-other-notes-7:

Other Notes
-----------

-  Python versions supported:

   -  3.5, 3.6, 3.7

-  Linux distribution supported:

   -  Ubuntu 18, Amazon Linux 2


================================================
FILE: release-notes/archive/nemo/index.rst
================================================
.. neuronx-nemo-rn:

Neuron Nemo Release Notes
==============================================

.. toctree::
   :maxdepth: 1

   neuronx-nemo

================================================
FILE: release-notes/archive/nemo/neuronx-nemo.rst
================================================
.. _neuronx-nemo-rn:


AWS Neuron Reference for Nemo Megatron(``neuronx-nemo-megatron``) Release Notes
===============================================================================

.. contents:: Table of contents
   :local:
   :depth: 1

This document lists the release notes for ``neuronx-nemo-megatron`` library.

``neuronx-nemo-megatron`` [0.8.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 12/20/2024

New in this release
-------------------

* Added support for HuggingFace to NeMo checkpoint conversion when virtual pipeline parallel is enabled.
* Added support for Python 3.11
* Added support for PyTorch 2.5
* Added collective compute coalescing for ZeRO-1 optimizer
* Bug fix for flash attention to ensure proper mixed precision data type handling

Known Issues and Limitations
----------------------------

None at this time.


``neuronx-nemo-megatron`` [0.7.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 09/16/2024

New in this release
-------------------

* Fixed issue with linear warmup with cosine annealing
* Fixed indexing issues with MPI job checkpoint conversion.
* Fixed pipeline parallel bug for NeMo to HF checkpoint conversion.

Known Issues and Limitations
----------------------------

None at this time.


``neuronx-nemo-megatron`` [0.6.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 07/03/2024

New in this release
-------------------

* Added support for fp32 gradient accumulation.
* Added support for flash attention kernel.
* Added option for zero1 with master weights.
* Checkpoint conversion script improvements.
* S3 checkpointing improvements.
* Zero1 checkpointing improvements
* Various bug fixes and improvements.


Known Issues and Limitations
----------------------------

None at this time.


``neuronx-nemo-megatron`` [0.5.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 04/01/2024

New in this release
-------------------

* Added support for LoRA fine tuning.
* Added support for Mistral 7B and sliding window attention
* Added support for Zero1 Automatic Mixed Precision.
* Improved throughput at scale of hundreds of nodes.
* Improved support for FP32 optimizer states.
* Merges up and gate projection in Llama for improved throughput.
* Various bug fixes and improvements.
* Fixes for checkpoint restoration accuracy issues.
* Fixes Zero1 checkpointing issues.


Known Issues and Limitations
----------------------------

None at this time.


``neuronx-nemo-megatron`` [0.4.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 10/15/2023

New in this release
-------------------

* Added Llama 70B model pre-training and finetuning support that works with tensor-parallelism and pipeline parallelism using Group Query Attention (GQA)
* Added GPT-NeoX 20B using  tensor parallelism and pipeline parallelism.
* Added Checkpoint conversion scripts from Nemo to HuggingFace models for LLama 7B, 13B, 70B, GPT-NeoX FineTuning
* Stability fixes for hangs observed for long running jobs checkpointing at regular time intervals.
* Enabled python 3.10 support with Nemo.

Known Issues and Limitations
----------------------------

* We are seeing few extra graph compilations than before. These are not limiting functionality or performance.
* Llama2-70B : Tested and validated on 8 nodes. Scaling beyond might see memory issues.

``neuronx-nemo-megatron`` [0.3.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 9/15/2023

New in this release
-------------------

* Added Llama 13B model support that works with tensor-parallelism and pipeline parallelism
* Zero1 Optimizer support that works with tensor-parallelism and pipeline parallelism
* Fixes for loading/saving checkpoint OOM issues while loading large models
* Added Docker support
* Feature to save only the last checkpoint and delete previous ones to conserve disk space
* Added FP32 OptimizerState option for mixed precision
* Added Validation loop support

Known Issues and Limitations
----------------------------

* Tested validation logic with smaller global batch sizes (32). Not tested larger global batch sizes.


================================================
FILE: release-notes/archive/neuron-cc/neuron-cc-ops/index.rst
================================================
.. _neuron-supported-operators:

Neuron Supported operators
==========================

.. toctree::
   :maxdepth: 1

   /archive/tensorflow/tensorflow-neuron/tensorflow2-accelerated-ops
   neuron-cc-ops-tensorflow
   neuron-cc-ops-pytorch
   neuron-cc-ops-mxnet

================================================
FILE: release-notes/archive/neuron-cc/neuron-cc-ops/neuron-cc-ops-mxnet.rst
================================================
.. _neuron-cc-ops-mxnet:


Neuron Apache MXNet Supported operators
====================================================

To see a list of supported operators for MXNet, run the following command:

``neuron-cc list-operators --framework MXNET``

.. _neuron-compiler-release-1600:

Neuron Compiler Release [1.6.13.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added

::

  amp_cast
  amp_multicast

.. _neuron-compiler-release-1410:

Neuron Compiler Release [1.4.1.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-1400:

Neuron Compiler Release [1.4.0.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-1300:

Neuron Compiler Release [1.3.0.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-1270:

Neuron Compiler Release [1.2.7.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-1220:

Neuron Compiler Release [1.2.2.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-1200:

Neuron Compiler Release [1.2.0.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added

::

 Deconvolution
 LayerNorm
 Pad
 SwapAxis
 _contrib_arange_like
 _contrib_interleaved_matmul_encdec_qk
 _contrib_interleaved_matmul_encdec_valatt
 _contrib_interleaved_matmul_selfatt_qk
 _contrib_interleaved_matmul_selfatt_valatt
 arctan
 broadcast_like
 cos
 erf
 pad
 sin
 slice_axis


.. _neuron-compiler-release-10240450:

Neuron Compiler Release [1.0.24045.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added ``_contrib_div_sqrt_dim``, ``broadcast_axis``

.. _neuron-compiler-release-10180010:

Neuron Compiler Release [1.0.18001.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-10179370:

Neuron Compiler Release [1.0.17937.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-10168610:

Neuron Compiler Release [1.0.16861.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Removed ``log`` (Was erroneously reported as added in previous release.
)

.. _neuron-compiler-release-1015275:

Neuron Compiler Release [1.0.15275]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added ``log``

.. _neuron-compiler-release-1012696:

Neuron Compiler Release [1.0.12696]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-109410:

Neuron Compiler Release [1.0.9410]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-107878:

Neuron Compiler Release [1.0.7878]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-106801:

Neuron Compiler Release [1.0.6801]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-105939:

Neuron Compiler Release [1.0.5939]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

no changes

.. _neuron-compiler-release-105301:

Neuron Compiler Release [1.0.5301]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

no changes

.. _neuron-compiler-release-1046800:

Neuron Compiler Release [1.0.4680.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

   Activation
   BatchNorm
   Cast
   Concat
   Convolution
   Convolution_v1
   Dropout
   Flatten
   FullyConnected
   LeakyReLU
   Pooling
   Pooling_v1
   RNN
   Reshape
   SequenceMask
   SliceChannel
   Softmax
   UpSampling
   __add_scalar__
   __div_scalar__
   __mul_scalar__
   __pow_scalar__
   __rdiv_scalar__
   __rpow_scalar__
   __rsub_scalar__
   __sub_scalar__
   _arange
   _copy
   _div_scalar
   _equal_scalar
   _full
   _greater_equal_scalar
   _greater_scalar
   _lesser_equal_scalar
   _lesser_scalar
   _maximum
   _maximum_scalar
   _minimum
   _minimum_scalar
   _minus_scalar
   _mul_scalar
   _not_equal_scalar
   _ones
   _plus_scalar
   _power_scalar
   _rdiv_scalar
   _rminus_scalar
   _rnn_param_concat
   _zeros
   batch_dot
   broadcast_add
   broadcast_div
   broadcast_equal
   broadcast_greater
   broadcast_greater_equal
   broadcast_lesser
   broadcast_lesser_equal
   broadcast_maximum
   broadcast_minimum
   broadcast_mod
   broadcast_mul
   broadcast_not_equal
   broadcast_sub
   ceil
   clip
   concat
   elemwise_add
   elemwise_div
   elemwise_mul
   elemwise_sub
   exp
   expand_dims
   flatten
   floor
   gather_nd
   log
   log_softmax
   max
   mean
   min
   negative
   ones_like
   relu
   repeat
   reshape
   reshape_like
   reverse
   rsqrt
   sigmoid
   slice
   slice_like
   softmax
   split
   sqrt
   square
   squeeze
   stack
   sum
   tanh
   tile
   transpose
   where
   zeros_like


================================================
FILE: release-notes/archive/neuron-cc/neuron-cc-ops/neuron-cc-ops-pytorch.rst
================================================
.. _neuron-cc-ops-pytorch:

PyTorch Neuron (``torch-neuron``) Supported operators
=====================================================

Current operator lists may be generated with these commands inside
python:

.. code:: python

   import torch.neuron
   print(*torch.neuron.get_supported_operations(), sep='\n')

.. _pytorch-neuron-release-2130:

PyTorch Neuron release [package version 1.*.*.2.9.1.0, SDK 2.13.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 08/28/2023

Added support for new operators:

- ``aten::clamp_min``
- ``aten::clamp_max``

.. _pytorch-neuron-release-2900:

PyTorch Neuron release [2.9.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 03/28/2023

Added support for new operators:

- ``aten::tensordot``
- ``aten::adaptive_avg_pool1d``
- ``aten::prelu``
- ``aten::reflection_pad2d``
- ``aten::baddbmm``
- ``aten::repeat``


.. _pytorch-neuron-release-2500:

PyTorch Neuron release [2.5.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 11/23/2022

Added support for new operators:

- ``aten::threshold``
- ``aten::roll``
- ``aten::instance_norm``
- ``aten::amin``
- ``aten::amax``
- ``aten::new_empty``
- ``aten::new_ones``
- ``aten::tril``
- ``aten::triu``
- ``aten::zero_``
- ``aten::all``
- ``aten::broadcast_tensors``
- ``aten::broadcast_to``
- ``aten::logical_and``
- ``aten::logical_not``
- ``aten::logical_or``
- ``aten::logical_xor``
- ``aten::_convolution_mode``

Added **limited** support for new operators:

- LSTM Operations. See: :ref:`torch_neuron_lstm_support`

  - ``aten::lstm``
  - ``aten::_pack_padded_sequence``
  - ``aten::_pad_packed_sequence``

- ``aten::norm``: Supported when ``p`` argument is one of (``1``, ``2``, ``inf``, ``-inf``, ``'fro'``)


.. _pytorch-neuron-release-2200:

PyTorch Neuron release [2.2.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 03/25/2022

Added support for new operators:

- ``aten::max_pool2d_with_indices``: Fully supported  (Was previously supported only when indices were unused).


.. _pytorch-neuron-release-2170:

PyTorch Neuron release [2.1.7.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 01/20/2022

Added support for new operators:

* ``aten::bucketize``
* ``aten::any``
* ``aten::remainder``
* ``aten::clip``
* ``aten::repeat_interleave``
* ``aten::tensor_split``
* ``aten::split_with_sizes``
* ``aten::isnan``
* ``aten::embedding_renorm_``
* ``aten::dot``
* ``aten::mv``
* ``aten::hardsigmoid``
* ``aten::hardswish``
* ``aten::trunc``
* ``aten::one_hot``: Supported when ``num_classes`` is known at trace time.
      The dynamic version of this operation when ``num_classes = -1`` is not supported.
* ``aten::adaptive_max_pool1d``
* ``aten::adaptive_max_pool2d``


.. _pytorch-neuron-release-205360:

PyTorch Neuron Release [2.0.536.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- The following are operators with limited support on Neuron. Unlike fully 
  supported operators, these operators are not returned when using 
  :func:`torch_neuron.get_supported_operations`. See each operator 
  description for conditional support:

  - ``aten::max_pool2d_with_indices`` - Supported when indices outputs are not used by a downstream operation. This allows the operation to be compiled to Neuron when it is equivalent to an ``aten::max_pool2d``.
  - ``aten::max_pool3d_with_indices`` - Supported when indices outputs are not used by a downstream operation. This allows the operation to be compiled to Neuron when it is equivalent to an ``aten::max_pool3d``.
  - ``aten::where`` - Supported when used as a conditional selection (3-argument variant). Unsupported when used to generate a dynamic list of indices (1-argument variant). See :func:`torch.where`.


.. _pytorch-neuron-release-203180:

PyTorch Neuron Release [2.0.318.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added support for new operators:

-  ``aten::empty_like``
-  ``aten::log``
-  ``aten::type_as``
-  ``aten::movedim``
-  ``aten::einsum``
-  ``aten::argmax``
-  ``aten::min``
-  ``aten::argmin``
-  ``aten::abs``
-  ``aten::cos``
-  ``aten::sin``
-  ``aten::linear``
-  ``aten::pixel_shuffle``
-  ``aten::group_norm``
-  ``aten::_weight_norm``


.. _pytorch-neuron-release-15210:

PyTorch Neuron Release [1.5.21.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No change


.. _pytorch-neuron-release-1570:

PyTorch Neuron Release [1.5.7.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added support for new operators:

- ``aten::erf``
- ``prim::DictConstruct``


.. _pytorch-neuron-release-1410:

PyTorch Neuron Release [1.4.1.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No change


.. _pytorch-neuron-release-1350:

PyTorch Neuron Release [1.3.5.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added support for new operators:

- ``aten::numel``
- ``aten::ones_like``
- ``aten::reciprocal``
- ``aten::topk``


.. _pytorch-neuron-release-12160:

PyTorch Neuron Release [1.2.16.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No change


.. _pytorch-neuron-release-12150:

PyTorch Neuron Release [1.2.15.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No change


.. _pytorch-neuron-release-1230:

PyTorch Neuron Release [1.2.3.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added support for new operators:

- ``aten::silu``
- ``aten::zeros_like``


.. _pytorch-neuron-release-1170:

PyTorch Neuron Release [1.1.7.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added support for new operators:

- ``aten::_shape_as_tensor``
- ``aten::chunk``
- ``aten::empty``
- ``aten::masked_fill``


.. _pytorch-neuron-release-10240450:

PyTorch Neuron Release [1.0.24045.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added support for new operators:

- ``aten::__and__``
- ``aten::bmm``
- ``aten::clone``
- ``aten::expand_as``
- ``aten::fill_``
- ``aten::floor_divide``
- ``aten::full``
- ``aten::hardtanh``
- ``aten::hardtanh_``
- ``aten::le``
- ``aten::leaky_relu``
- ``aten::lt``
- ``aten::mean``
- ``aten::ne``
- ``aten::softplus``
- ``aten::unbind``
- ``aten::upsample_bilinear2d``


.. _pytorch-neuron-release-10172000:

PyTorch Neuron Release [1.0.1720.00]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added support for new operators:

- ``aten::constant_pad_nd``
- ``aten::meshgrid``


.. _pytorch-neuron-release-1015320:

PyTorch Neuron Release [1.0.1532.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added support for new operators:

- ``aten::ones``


.. _pytorch-neuron-release-1015220:

PyTorch Neuron Release [1.0.1522.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No change


.. _pytorch-neuron-release-1013860:

PyTorch Neuron Release [1.0.1386.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added support for new operators:

- ``aten::ceil``
- ``aten::clamp``
- ``aten::eq``
- ``aten::exp``
- ``aten::expand_as``
- ``aten::flip``
- ``aten::full_like``
- ``aten::ge``
- ``aten::gt``
- ``aten::log2``
- ``aten::log_softmax``
- ``aten::max``
- ``aten::neg``
- ``aten::relu``
- ``aten::rsqrt``
- ``aten::scalarImplicit``
- ``aten::sqrt``
- ``aten::squeeze``
- ``aten::stack``
- ``aten::sub``
- ``aten::sum``
- ``aten::true_divide``
- ``aten::upsample_nearest2d``
- ``prim::Constant``
- ``prim::GetAttr``
- ``prim::ImplicitTensorToNum``
- ``prim::ListConstruct``
- ``prim::ListUnpack``
- ``prim::NumToTensor``
- ``prim::TupleConstruct``
- ``prim::TupleUnpack``

Please note, primitives are included in this list from this release.


.. _pytorch-neuron-release-1011680:

PyTorch Neuron Release [1.0.1168.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added support for new operators:

- ``aten::ScalarImplicit``


.. _pytorch-neuron-release-1010010:

PyTorch Neuron Release [1.0.1001.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added support for new operators:

- ``aten::detach``
- ``aten::floor``
- ``aten::gelu``
- ``aten::pow``
- ``aten::sigmoid``
- ``aten::split``

Remove support for operators:

- ``aten::embedding``: Does not meet **performance** criteria
- ``aten::erf``: Error function does not meet **accuracy** criteria
- ``aten::tf_dtype_from_torch``: Internal support function, not an operator


.. _pytorch-neuron-release-108250:

PyTorch Neuron Release [1.0.825.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No change


.. _pytorch-neuron-release-107630:

PyTorch Neuron Release [1.0.763.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added support for new operators:

- ``aten::Int``
- ``aten::arange``
- ``aten::contiguous``
- ``aten::div``
- ``aten::embedding``
- ``aten::erf``
- ``aten::expand``
- ``aten::eye``
- ``aten::index_select``
- ``aten::layer_norm``
- ``aten::matmul``
- ``aten::mm``
- ``aten::permute``
- ``aten::reshape``
- ``aten::rsub``
- ``aten::select``
- ``aten::size``
- ``aten::slice``
- ``aten::softmax``
- ``aten::tf_dtype_from_torch``
- ``aten::to``
- ``aten::transpose``
- ``aten::unsqueeze``
- ``aten::view``
- ``aten::zeros``

Remove support for operators:

- ``aten::tf_broadcastable_slice``: Internal support function, not an operator
- ``aten::tf_padding``: Internal support function, not an operator

These operators were already supported previously:

- ``aten::_convolution``
- ``aten::adaptive_avg_pool2d``
- ``aten::add``
- ``aten::add_``
- ``aten::addmm``
- ``aten::avg_pool2d``
- ``aten::batch_norm``
- ``aten::cat``
- ``aten::dimension_value``
- ``aten::dropout``
- ``aten::flatten``
- ``aten::max_pool2d``
- ``aten::mul``
- ``aten::relu_``
- ``aten::t``
- ``aten::tanh``
- ``aten::values``
- ``prim::Constant``
- ``prim::GetAttr``
- ``prim::ListConstruct``
- ``prim::ListUnpack``
- ``prim::TupleConstruct``
- ``prim::TupleUnpack``


.. _pytorch-neuron-release-106720:

PyTorch Neuron Release [1.0.672.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No change


.. _pytorch-neuron-release-105520:

PyTorch Neuron Release [1.0.552.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added support for new operators:

- ``aten::_convolution``
- ``aten::adaptive_avg_pool2d``
- ``aten::add``
- ``aten::add_``
- ``aten::addmm``
- ``aten::avg_pool2d``
- ``aten::batch_norm``
- ``aten::cat``
- ``aten::dimension_value``
- ``aten::dropout``
- ``aten::flatten``
- ``aten::max_pool2d``
- ``aten::mul``
- ``aten::relu_``
- ``aten::t``
- ``aten::tanh``
- ``aten::tf_broadcastable_slice``
- ``aten::tf_padding``
- ``aten::values``
- ``prim::Constant``
- ``prim::GetAttr``
- ``prim::ListConstruct``
- ``prim::ListUnpack``
- ``prim::TupleConstruct``
- ``prim::TupleUnpack``


================================================
FILE: release-notes/archive/neuron-cc/neuron-cc-ops/neuron-cc-ops-tensorflow.rst
================================================
.. _neuron-cc-ops-tensorflow:

TensorFlow Neuron (``tensorflow-neuron (TF1.x)``) Supported operators
=====================================================================

To see a list of supported operators for TensorFlow 1.x, run the following command:

``neuron-cc list-operators --framework TENSORFLOW``

.. _neuron-compiler-release-1910:

Neuron Compiler Release [1.9.1.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Date: 01/20/2022


Added

::

 isNan
 FusedBatchNormV3

.. _neuron-compiler-release-1730:

Neuron Compiler Release [1.7.3.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added

::

 ArgMax
 ArgMin


.. _neuron-compiler-release-16130:

Neuron Compiler Release [1.6.13.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-1550:

Neuron Compiler Release [1.5.5.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-1400:

Neuron Compiler Release [1.4.0.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-1300:

Neuron Compiler Release [1.3.0.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added

::

 Abs
 Cos
 DepthwiseConv2dNative
 Erf
 Rank
 Sin
 Size


.. _neuron-compiler-release-1270:

Neuron Compiler Release [1.2.7.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-1220:

Neuron Compiler Release [1.2.2.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added

::

 AdjustContrastv2
 AdjustSaturation
 BroadcastTo
 Cholesky
 Conv2DBackpropInput
 Conv3D
 CropAndResize
 FloorDiv
 HSVToRGB
 InvertPermutation
 L2Loss
 Log1p
 MatrixBandPart
 MatrixDiag
 MatrixSetDiag
 MatrixTriangularSolve
 MaxPool3D
 MirrorPad
 RGBToHSV
 Range
 SoftmaxCrossEntropyWithLogits
 SquaredDifference
 StopGradient
 Unpack
 UnsortedSegmentSum


.. _neuron-compiler-release-10240450:

Neuron Compiler Release [1.0.24045.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added ``FloorDiv``, ``Softplus``, ``Unstack``


.. _neuron-compiler-release-1018001:

Neuron Compiler Release [1.0.18001]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-1016764:

Neuron Compiler Release [1.0.16764]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added:

::

   LogSoftmax
   Neg
   ResizeBilinear
   ResizeNearestNeighbor

.. _neuron-compiler-release-1015275:

Neuron Compiler Release [1.0.15275]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Added

::

   Neg 

Removed

::

   Log

(was inadvertently advertised as supported)

.. _neuron-compiler-release-1012696:

Neuron Compiler Release [1.0.12696]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-109410:

Neuron Compiler Release [1.0.9410]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-107878:

Neuron Compiler Release [1.0.7878]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-106801:

Neuron Compiler Release [1.0.6801]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-105939:

Neuron Compiler Release [1.0.5939]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-105301:

Neuron Compiler Release [1.0.5301]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No changes

.. _neuron-compiler-release-1046800:

Neuron Compiler Release [1.0.4680.0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

   Add
   AddV2
   All
   AvgPool
   BatchMatMul
   BatchMatMulV2
   BatchToSpaceND
   BiasAdd
   Cast
   Ceil
   Concat
   ConcatV2
   Const
   Conv2D
   Equal
   Exp
   ExpandDims
   Fill
   Floor
   FusedBatchNorm
   Greater
   GreaterEqual
   Identity
   LRN
   LeakyRelu
   Less
   LessEqual
   Log
   LogicalAnd
   LogicalNot
   LogicalOr
   MatMul
   Max
   MaxPool
   Maximum
   Mean
   Min
   Minimum
   Mul
   NoOp
   NotEqual
   Pack
   Pad
   PadV2
   Placeholder
   Pow
   Prod
   RandomUniform
   RealDiv
   Reciprocal
   Relu
   Relu6
   Reshape
   ReverseV2
   Round
   Rsqrt
   Select
   Shape
   Sigmoid
   Sign
   Slice
   Softmax
   SpaceToBatchND
   Split
   SplitV
   Sqrt
   Square
   Squeeze
   StridedSlice
   Sub
   Sum
   Tanh
   Tile
   Transpose
   ZerosLike


================================================
FILE: release-notes/archive/neuron-cc/neuron-cc-ops/neuron-cc-ops-xla.rst
================================================
.. _neuron-cc-ops-xla:

TensorFlow Neuron (``tensorflow-neuron (TF1.x)``) Supported operators [XLA]
=====================================================================

To see a list of supported operators for XLA, run the following command:

``neuron-cc list-operators --framework XLA``

+-------------------------+-------------------------------------------+
| Supported XLA Operators | Notes                                     |
+=========================+===========================================+
| Abs                     |                                           |
+-------------------------+-------------------------------------------+
| Add                     |                                           |
+-------------------------+-------------------------------------------+
| Allgather               |                                           |
+-------------------------+-------------------------------------------+
| Allreduce               |                                           |
+-------------------------+-------------------------------------------+
| Atan2                   |                                           |
+-------------------------+-------------------------------------------+
| Batchnorm               |                                           |
+-------------------------+-------------------------------------------+
| Batchnormgrad           |                                           |
+-------------------------+-------------------------------------------+
| Batchnorminference      |                                           |
+-------------------------+-------------------------------------------+
| Broadcast               |                                           |
+-------------------------+-------------------------------------------+
| BroadcastInDim          |                                           |
+-------------------------+-------------------------------------------+
| Ceil                    |                                           |
+-------------------------+-------------------------------------------+
| Clamp                   |                                           |
+-------------------------+-------------------------------------------+
| Compare                 |                                           |
+-------------------------+-------------------------------------------+
| Concatenate             |                                           |
+-------------------------+-------------------------------------------+
| Constant                |                                           |
+-------------------------+-------------------------------------------+
| ConstantLiteral         |                                           |
+-------------------------+-------------------------------------------+
| ConvertElementType      |                                           |
+-------------------------+-------------------------------------------+
| Cos                     |                                           |
+-------------------------+-------------------------------------------+
| Customcall              |                                           |
+-------------------------+-------------------------------------------+
| Div                     |                                           |
+-------------------------+-------------------------------------------+
| Dot                     |                                           |
+-------------------------+-------------------------------------------+
| DotGeneral              |                                           |
+-------------------------+-------------------------------------------+
| DynamicUpdateSlice      | Supports only for constant index          |
+-------------------------+-------------------------------------------+
| Eq                      |                                           |
+-------------------------+-------------------------------------------+
| Exp                     |                                           |
+-------------------------+-------------------------------------------+
| Floor                   |                                           |
+-------------------------+-------------------------------------------+
| Gather                  | Supports only disjoint start_index_map    |
|                         | and remapped_offset_dims                  |
+-------------------------+-------------------------------------------+
| Ge                      |                                           |
+-------------------------+-------------------------------------------+
| GetTupleElement         |                                           |
+-------------------------+-------------------------------------------+
| Gt                      |                                           |
+-------------------------+-------------------------------------------+
| Iota                    |                                           |
+-------------------------+-------------------------------------------+
| Le                      |                                           |
+-------------------------+-------------------------------------------+
| Log                     |                                           |
+-------------------------+-------------------------------------------+
| LogicalAnd              |                                           |
+-------------------------+-------------------------------------------+
| LogicalNot              |                                           |
+-------------------------+-------------------------------------------+
| Lt                      |                                           |
+-------------------------+-------------------------------------------+
| Max                     |                                           |
+-------------------------+-------------------------------------------+
| Min                     |                                           |
+-------------------------+-------------------------------------------+
| Mul                     |                                           |
+-------------------------+-------------------------------------------+
| Ne                      |                                           |
+-------------------------+-------------------------------------------+
| Neg                     |                                           |
+-------------------------+-------------------------------------------+
| Pad                     |                                           |
+-------------------------+-------------------------------------------+
| Pow                     | Exponent argument must be a compile-time  |
|                         | integer constant                          |
+-------------------------+-------------------------------------------+
| Reduce                  | Min, Max, Add and Mul are the only        |
|                         | supported computations. Init_values must  |
|                         | be constant                               |
+-------------------------+-------------------------------------------+
| Reshape                 |                                           |
+-------------------------+-------------------------------------------+
| RngBitGenerator         | Ignores user seed                         |
+-------------------------+-------------------------------------------+
| RngUniform              |                                           |
+-------------------------+-------------------------------------------+
| Rsqrt                   |                                           |
+-------------------------+-------------------------------------------+
| Scatter                 |                                           |
+-------------------------+-------------------------------------------+
| Select                  |                                           |
+-------------------------+-------------------------------------------+
| ShiftRightLogical       |                                           |
+-------------------------+-------------------------------------------+
| Sign                    |                                           |
+-------------------------+-------------------------------------------+
| Sin                     |                                           |
+-------------------------+-------------------------------------------+
| Slice                   |                                           |
+-------------------------+-------------------------------------------+
| Sqrt                    |                                           |
+-------------------------+-------------------------------------------+
| Sub                     |                                           |
+-------------------------+-------------------------------------------+
| Tanh                    |                                           |
+-------------------------+-------------------------------------------+
| Transpose               |                                           |
+-------------------------+-------------------------------------------+
| Tuple                   |                                           |
+-------------------------+-------------------------------------------+


================================================
FILE: release-notes/archive/neuron-cc/neuron-cc.rst
================================================
.. _neuron-cc-rn:

Neuron Compiler (``neuron-cc``) for Inf1 Release Notes
======================================================

.. contents:: Table of contents
   :local:
   :depth: 1

Introduction
^^^^^^^^^^^^

This document lists the release notes for AWS Neuron compiler. The
Neuron Compiler is an ahead-of-time compiler that ensures Neuron will
optimally utilize the Inferentia chips.

Operator-support for each input format is provided directly from the
compiler.

::

   neuron-cc list-operators --framework {TENSORFLOW | MXNET | XLA}

The supported operators are also listed here:

Tensorflow: :ref:`neuron-cc-ops-tensorflow`

Pytorch: :ref:`neuron-cc-ops-pytorch`

XLA: :ref:`neuron-cc-ops-xla`

Apache MXNet: :ref:`neuron-cc-ops-mxnet`

Known issues and limitations - updated 11/23/2022
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* There is a known issue of increased latency and lower throughput when MLM head is compiled along with BERT model. The workaround is to compile them separately and feed the raw Bert into the head.
* *TensorFlow 2.x* - In this release supported operators are limited to BERT-like models, specifically no conv2d  or reduce-window operators are available.
* *Control flow* Neuron only supports control flow operators which are static at compile time. For example static length RNN, top-k, sort.
* *Data layout* The Neuron compiler supports multiple data layout format (NCHW, NHWC, …). Non-CNHW input/output data-layouts will require Neuron to insert additional transpose operations, causing a degradation in performance.
* *Primary inputs in NeuronCore Pipeline mode* When a neural network is executed in NeuronCore Pipeline mode, only the first operator in a neural network can receive primary inputs from the host.
* *Reduce data type* INT8 data type is not currently supported by the Neuron compiler.
* *NeuronCore Pipeline:* NeuronCorePipeline mode provides low-latency and high-throughput for small batch sizes. We recommend to start testing with batch=1 and gradually increase batch size to fine tune your model throughput and latency performance.
* *Large input tensors* support varies by model. On some models the large input tensors (eg 1024x1024) may result in lower performance or exceeding hardware or compile-time limits, especially on models where the large input tensor is used by many downstream operators. Workarounds may include use of smaller batch, see
  :ref:`neuron-batching`
* *Conv2d operator* is mapped to Inferentia except for specific cases of extremely large tensors and specific parameters.
* *Conv3d operator* performance is limited when the operator has small number of input channels (< 64).
* FP64 and INT64 input and output tensors are not supported. Please cast to FP32/INT32 in the machine learning framework, prior compiling for Neuron.

Neuron Compiler release [1.21.0.0]]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Date: 12/21/2023

* Minor bug fixes.

Neuron Compiler release [1.20.3.0]]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Date: 10/26/2023

* Minor bug fixes.

Neuron Compiler release [1.19.0.0]]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Date: 09/15/2023

* Minor bug fixes.

Neuron Compiler release [1.17.0.0]]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 7/19/2023

New in this release
-------------------

* This release introduces a new ``--enable-saturate-infinity`` compiler option. A computation that can generate +/- infinity is at a high risk of generating Not-a-Number (NaN) values when the infinity value is used in subsequent computations. This option helps avoid this by converting +Inf/-Inf values to MAX/MIN_FLOAT before operations that could produce NaN values for +Inf/-Inf inputs on the target architecture. While this option helps to avoid NaN values, there is a potential performance degradation that occurs during model execution when this conversion is enabled.
* Minor bug fixes.

Neuron Compiler release [1.16.2.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 6/14/2023

* Minor bug fixes.

Neuron Compiler release [1.15.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 05/01/2023

* Minor bug fixes.

Neuron Compiler release [1.14.3.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 04/19/2023

* Minor bug fixes.

Neuron Compiler release [1.13.3.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Date: 11/23/2022

* Resolved long compile-times when compiling the YOLOv5 and YOLOv6 models. [GitHub · aws-neuron-sdk · #434]
* Improved the layout algorithm to resolve an issue compiling a transformer-based text recognition model. [GitHub · aws-neuron-sdk · #410]
* Support was added for additional XLA operators

Neuron Compiler release [1.11.7.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 08/02/2022

* Fixed a bug for correct handling of mxnet dropout instruction when mode is set as 'training' while performing inference.


Neuron Compiler release [1.11.4.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 04/29/2022

* Solved an issue that caused a "false positive" reporting of a data race that may occur due to address overlap.
* Minor bug fixes.


Neuron Compiler release [1.10.3.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 03/25/2022

* Minor bug fixes.


Neuron Compiler release [1.9.1.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 01/20/2022

* Fixed an issue with frontend compiler for fused operators that was reported in `github #362 <https://github.com/aws/aws-neuron-sdk/issues/362>`_.

Neuron Compiler release [1.8.5.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 01/05/2022


New in this release
-------------------

* Minor bug fixes.


Neuron Compiler release [1.8.2.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 12/15/2021


New in this release
-------------------

* Performance enhancements as a result of improved layout and DMA optimizations.
* Minor bug fixes.


Neuron Compiler release [1.7.3.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 10/27/2021


New in this release
-------------------

* The compiler’s list-operators command can now display the supported TensorFlow 2.x operators.
* Support added for new operators in TensorFlow 1.x -  ArgMax and ArgMin.
* Introducing the ``–-fast-math`` option for better fine-tuning of accuracy/performance. See :ref:`neuron-cc-training-mixed-precision`


[1.6.13.0]
^^^^^^^^^^

Date 08/12/2021

New in this release
-------------------

* TensorFlow 2.x  - First support of TensorFlow 2.x. The support is limited to operators in BERT-like models and was tested with Huggingface BERT small, base, large and DistillBert.

Resolved issues
---------------

* Fixed compiler backend issue in Tensor_tensor argument distance, `github #269 <https://github.com/aws/aws-neuron-sdk/issues/269>`_


[1.5.5.0]
^^^^^^^^^

Date 07/02/2021

Summary
-------

- Robustness and performance improvements.

New in this release
-------------------

* Added ``--enable-fast-context-switch`` option to optimize for faster model switching rather than inference latency.
* Deprecated support for ONNX
* Improved robustness of Conv3d
* Corrected compilation error "too many instructions" in DLRM model


[1.4.0.0]
^^^^^^^^^

Date 5/28/2021

Summary
-------

- Performance improvements, and usability improvements.

New in this release
-------------------

* Added uncompressed NEFF format for faster loading models prior inference. Enable it by –enable-fast-loading-neuron-binaries. Some cases of large models may be detrminentally impacted as it will not be compressed but many cases will benefit.
* Corrected compilation error in specific arguments of ResizeBilinear operator

[1.3.0.0]
^^^^^^^^^

Date 4/30/2021

Summary
-------

- Performance improvements, new operators, and usability improvements.

New in this release
-------------------

- Improved performance of batched CNN models like resnet50  with the default compiler options by 10%.

- Improved performance of bert base sequence 128 batch 6 by upto 16%

- Added support for group and depth wise convolution (with limited performance when the number of input channels is small).

- Added more detailed debug names to support for tensorboard.


Resolved Issues
---------------

- Corrected potential race condition in overwriting tiles of output tensors.

- Fixed various issues in pipelined inference by enabling fine grain partitioning by default.


[1.2.7.0]
^^^^^^^^^

Date 2/24/2021

Summary
-------

Fix for CVE-2021-3177.

[1.2.2.0]
^^^^^^^^^

Date 1/30/2021

Summary
-------

Added suport for multiple new operators (see operators list) for Tensoflow and MXNET. Improved inference performance of language, object recognition models on single as well as multiple pipelined cores using neuroncore-pipeline.

New in this release
-------------------

- The following models are now supported: Resnext 224x224, specific BERT variations applied to natural language processing and translation.

- A number of new operators is now supported on Inferentia, see the full lists :ref:`neuron-cc-ops-tensorflow`
 and :ref:`neuron-cc-ops-mxnet`

- Improved inference performance on yolov4 BERT base sequence 64 (on 16 pipelined cores) and openpose 184.

Resolved Issues
---------------

- Corrected a random failure to compile Resnet50 batch 5

- Corrected numerical inaccuracy in RSQRT and related operators for tensors with very large values ( > 1e20)


[1.1.7.0]
^^^^^^^^^

Date 12/23/2020

Summary
-------

Added suport for PyTorch Yolo V4, a new Framework-visible progress bar and improved inference performance. We continue to streamline the compiler usability by removing the need for options passed to control behavior. We are aiming to remove the need for such options entirely. Some tutorials have been updated to reflect this, but Resnet50 remains in need of these options to achieve maximum performance. Other useability improvements have been added, such as the compiler progress bar. As always, please let us know if there are other areas that we can improve.


New in this release
-------------------
- Pytorch Yolo V4 is now supported.

- Added a compiler progress bar when compilation is invoked from the Framework. This allows the user to see that progress continues as compilation proceeds, which is useful when compilation takes several minutes. A dot is printed every 20 seconds.

- Improved inference performance of Tensorflow BERT base seq 256 batch 3 by 10% .

Resolved Issues
---------------
- Resolved issue with depthwise convolution that manifests as a type check error


.. _10240450:

[1.0.24045.0]
^^^^^^^^^^^^^

Date 11/17/2020

Summary
-------

Improved performance for pipelined execution (NeuronCore Pipeline).

New in this release
-------------------

-  NeuronCore Pipeline: improved partitioning to enable better static
   weights loading to cache.

Resolved Issues
---------------

-  --static-weights : No longer needed. As this is shown in some
   examples, please remove the option since the compiler now performs
   this auto-detection by default.

-  --num-neuroncores renamed to --neuroncore-pipeline-cores. The prior
   option form is still functional (backwards compatible) and will be
   removed in future releases.

-  --batching_en: Resolved compilation failure of ResNet50 FP32 batch 1
   on Ubuntu16 when "--batching_en" was used.


.. _neuron-cc-10206000:

[1.0.20600.0]
^^^^^^^^^^^^^

Date 9/22/2020

Summary
-------

Various performance improvements - both compilation time and inference
speed of object recognition models.

-  Compiler optimization '-O2' option is now enabled by default.

.. _cc-major-new-features-0:

New in this release
-------------------

-  Improved inference performance of YOLO v3, YOLO v4, VGG16, SSD300.
   BERT models were improved by an additional 10%.

-  Modifed such that -O2 is now the default behavior and does not need
   to be specified. Note: some tutorials still explicitly specify "-O2".
   These will be modified in forthcoming updates.

.. _cc-resolved-issues-0:

Resolved Issues
---------------

-  Sped up compilation of large models that were taking hours to sub-40
   minute.


.. _neuron-cc-10180010:

[1.0.18001.0]
^^^^^^^^^^^^^

Date 8/08/2020

.. _cc-summary-1:

Summary
-------

Various performance improvements.

.. _cc-major-new-features-1:

New in this release
-------------------

Improved performance of BERT base with -O2

.. _cc-resolved-issues-1:

Resolved Issues
---------------

-  n/a

.. _neuron-cc-10179370:

[1.0.17937.0]
^^^^^^^^^^^^^

Date 8/05/2020

.. _cc-summary-2:

Summary
-------

Various improvements.

.. _neuron-cc-10168610:

[1.0.16861.0]
^^^^^^^^^^^^^

Date 7/16/2020

.. _cc-summary-3:

Summary
-------

This release has some bug fixes and some functional and performance
improvements to support compilation of several neural networks.

.. _cc-major-new-features-2:

New in this release
-------------------

This release

-  Supports compilation of PoseNet, tested for images of specific
   resolutions upto 736.
-  Update the -O2 with a new memory allocator to reduce spilling to DRAM
-  Improved performance of the '-O2' on BERT base, and openpose pose
   network.

.. _cc-resolved-issues-2:

Resolved Issues
---------------

-  Resolved compilation error in Vgg16 batch 1

Other Notes
-----------

-  Some versions of Inception network may fail to compile in Tensorflow
   on Ubuntu 16 in conda environment. The symptom is neuron-cc backend
   data race error. As a workaround use Ubuntu 18, Amazon Linux 2, or
   virtual env, or use neuron-cc with flag -O2.

.. warning::

   :ref:`Starting with Neuron 1.14.0, Ubuntu 16 is no longer supported <eol-ubuntu16>`

.. _neuron-cc-10152750:

[1.0.15275.0]
^^^^^^^^^^^^^

Date 6/11/2020

.. _cc-summary-4:

Summary
-------

This release has some bug fixes and some functional and performance
improvements to support compilation of several neural networks.

.. _cc-major-new-features-3:

New in this release
-------------------

This release

-  Supports compilation of PoseNet for images of specific resolutions
   upto 400x400.
-  Improves performance of resnet152.
-  Supports a new command line option '-O2' that can help with handling
   of large tensor inputs for certain models.
-  increase NEFF versions to 1.0. This means new NEFFs compiled from
   this release forward are not compatible with older versions of Neuron
   Runtime prior to May, 2020 (1.0.6905.0) release. Please update the
   Neuron Runtime when using NEFF version 1.0.

.. _cc-resolved-issues-3:

Resolved Issues
---------------

-  Compilation issues on prosotron encoder, decoder neural networks.

.. _cc-other-notes-1:

Other Notes
-----------

Dependencies
------------

-  This version creates NEFF 1.0 thus may require update of neuron-rtd
   if older than May 2020 release.

dmlc_nnvm==1.0.2574.0 dmlc_topi==1.0.2574.0 dmlc_tvm==1.0.2574.0
inferentia_hwm==1.0.1362.0 islpy==2018.2

.. _neuron-cc-10126960:

[1.0.12696.0]
^^^^^^^^^^^^^

Date 5/11/2020

.. _cc-summary-5:

Summary
-------

Bug fixes and some functional and performance improvements to several
neural networks.

.. _cc-major-new-features-4:

New in this release
-------------------

-  This version supports compilation of unmodified Tensorflow BERT with
   batch size 1, 4, 6 for input sequence 128.
-  Improved Tensorflow BERT batch 4 sequence 128 performance to 45% of
   the accelerator peak (from 34%).
-  Support for MXNET BERT base batch 8 compilation
-  Support for TF Resnet152 batch 2 compilation
-  Most compiler messages are migrated from cout to logging mechanisms
   with verbosity control

.. _cc-resolved-issues-4:

Resolved Issues
---------------

-  Fixed failure to compile unmodified Tensorflow BERT model for small
   batches

-  Fixed run-to-run-variability in OneHot operator implementation

-  Robustness improvements for ParallelWavenet and transformer decoder
   networks

.. _cc-other-notes-2:

Other Notes
-----------

.. _dependencies-1:

Dependencies
------------

::

   dmlc_nnvm==1.0.2356.0
   dmlc_topi==1.0.2356.0
   dmlc_tvm==1.0.2356.0
   inferentia_hwm==1.0.1294.0
   islpy==2018.2

.. _neuron-cc-1094100:

[1.0.9410.0]
^^^^^^^^^^^^

Date 3/26/2020

.. _cc-summary-6:

Summary
-------

Bug fixes and some functional and performance improvements to several
neural networks.

.. _cc-major-new-features-5:

New in this release
-------------------

-  Support compilation of modified SSD-300
   (:ref:`tensorflow-ssd300`)
-  Improved inference performance in natural language processing
   networks (such as prosotron encoder) by 45%

.. _cc-resolved-issues-5:

Resolved Issues
---------------

-  Eliminated redundant fp32 to bfloat16 cast on input and output
   tensors

Known issues and limitations
----------------------------

-  See previous releases.

.. _cc-other-notes-3:

Other Notes
-----------

-  Added support for faster iteration on recurrent networks (aka
   auto-loop)

.. _dependencies-2:

Dependencies
------------

::

   dmlc_nnvm==1.0.2049.0
   dmlc_topi==1.0.2049.0
   pip install --upgrade dmlc_tvm==1.0.2049.0
   inferentia_hwm==1.0.897.0
   islpy==2018.2

.. _neuron-cc-1078780:

[1.0.7878.0]
^^^^^^^^^^^^

Date 2/27/2020

.. _cc-summary-7:

Summary
-------

Bug fixes and minor performance improvements.

.. _cc-major-new-features-6:

New in this release
-------------------

None

.. _cc-resolved-issues-6:

Resolved Issues
---------------

-  Corrected image resize operator functionallity
-  Compiler internal enhancements made that will benefit models such as
   BERT

.. _cc-known-issues-and-limitations-1:

Known issues and limitations
----------------------------

-  See previous releases.

.. _cc-other-notes-4:

Other Notes
-----------

.. _dependencies-3:

Dependencies
------------

::

   dmlc_nnvm-1.0.1826.0
   dmlc_topi-1.0.1826.0
   dmlc_tvm-1.0.1826.0
   inferentia_hwm-1.0.897.0
   islpy-2018.2

.. _neuron-cc-1068010:

[1.0.6801.0]
^^^^^^^^^^^^

Date 1/27/2020

.. _cc-summary-8:

Summary
-------

Bug fixes and some performance enhancement related to data movement for
BERT-type neural networks.

.. _cc-major-new-features-7:

New in this release
-------------------

None

.. _cc-resolved-issues-7:

Resolved Issues
---------------

-  Improved throughput for operators processed in the Neuron Runtime
   CPU. As an example: execution of 4 single NeuronCore NEFF models of
   ResNet50 v2 float16 batch = 5 in parallel on an inf1.1xlarge sped up
   by 30%.
-  Corrected shape handling in Gather(TensorFlow)/Take(MXNet) operators
   that are processed by the Neuron Runtime in the Neuron Runtime vCPU,
   which resolves a possible crash in Neuron Compiler when compiling
   models with these operators with some shapes.
-  Added support for TensorFlow *OneHot* operator (as a Neuron Runtime
   CPU operator).
-  Added more internal checking for compiler correctness with newly
   defined error messages for this case.

::

         “Internal ERROR: Data race between Op1 'Name1(...) [...]' and Op2 'Name2(...) [...]'”

-  Fixed out-of-memory issue introduced in 1.0.5939.0 such that some
   large models (BERT) compiled on instances with insufficient host
   memory would cause the runtime to crash with an invalid NEFF. This is
   actually a compiler error, but due to additional script layers
   wrapping this in the :ref:`tensorflow-bert-demo`, this would
   have likely been seen as a runtime error like this:

.. code:: bash

   2020-01-09 13:40:26.002594: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: neff is invalid
   2020-01-09 13:40:26.002637: E tensorflow/core/common_runtime/executor.cc:642] Executor failed to create kernel. Invalid argument: neff is invalid
   [[{{node bert/NeuronOp}}]]

.. _cc-known-issues-and-limitations-2:

Known issues and limitations
----------------------------

See previous release notes. Some tutorials show use of specific compiler
options and flags, these are needed to help provide guidance to the
compiler to achieve best performance in specific cases. Please do not
use in cases other than as shown in the specific tutorial as results may
not be defined. These options should be considered beta and will
be removed over time.

.. _cc-other-notes-5:

Other Notes
-----------

.. _dependencies-4:

Dependencies
------------

::

   dmlc_nnvm-1.0.1619.0
   dmlc_topi-1.0.1619.0
   dmlc_tvm-1.0.1619.0
   inferentia_hwm-1.0.839.0
   islpy-2018.2

.. _1059390:

[1.0.5939.0]
^^^^^^^^^^^^

Date 12/20/2019

.. _cc-summary-9:

Summary
-------

Bug fixes and some performance enhancement for NeuronCore Pipeline.

.. _cc-major-new-features-8:

New in this release
-------------------

.. _cc-resolved-issues-8:

Resolved Issues
---------------

-  Fixed pipeline execution on more than 10 NeuronCores
-  Improved NeuronCores Pipeline execution by improving data exchange
   efficiency between NeuronCores
-  Added warning for unaligned memory access
-  Fixed handling of cast on input FP32 tensor
-  Improved handling of data layouts and transpose
-  Improved dead-code elimination
-  Improved efficiency of compute engine synchronization
-  Improved efficiency of data transfers within the Neuron code

.. _cc-known-issues-and-limitations-3:

Known issues and limitations
----------------------------

See previous release notes. Some tutorials show use of specific compiler
options and flags, these are needed to help provide guidance to the
compiler to achieve best performance in specific cases. Please do not
use in cases other than as shown in the specific tutorial as results may
not be defined. These options should be considered beta and will
be removed over time.

.. _cc-other-notes-6:

Other Notes
-----------

.. _dependencies-5:

Dependencies
------------

-  dmlc_nnvm-1.0.1416.0

-  dmlc_topi-1.0.1416.0

-  dmlc_tvm-1.0.1416.0

-  inferentia_hwm-1.0.720.0

-  islpy-2018.2

.. _1053010:

[1.0.5301.0]
^^^^^^^^^^^^

Date 12/1/2019

.. _cc-summary-10:

Summary
-------

.. _cc-major-new-features-9:

New in this release
-------------------

.. _cc-resolved-issues-9:

Resolved Issues
---------------

-  Added warning for unsupported operators and convolution sizes
-  Added warning for unsupported layout / upsampling
-  Added support for Relu6, AddV2, BatchMatmulV2 operators
-  Added support for default MXNet outputs in –io-config
-  Improved performance of batched inference for convolutional networks
-  Fixed MatMult column size 1
-  Fixed bf16 constant loading
-  Fixed Conv2D tile accumulation

.. _cc-known-issues-and-limitations-4:

Known Issues and Limitations
----------------------------

See previous release notes. Resolved issues are shown in Resolved
Issues.

.. _cc-other-notes-7:

Other Notes
-----------

Please install g++ on AMIs without g++ pre-installed (i.e. server AMIs):

.. code:: bash

   # Ubuntu
   sudo apt-get install -y g++

.. code:: bash

   # Amazon Linux
   sudo dnf install -y gcc-c++

Supported Python versions:

-  3.5, 3.6, 3.7

Supported Linux distributions:

-  Ubuntu 16, Ubuntu 18, Amazon Linux 2

.. _dependencies-6:

Dependencies
------------

-  dmlc_nnvm-1.0.1328.0
-  dmlc_topi-1.0.1328.0
-  dmlc_tvm-1.0.1328.0
-  inferentia_hwm-1.0.674.0
-  islpy-2018.2

.. _1046800:

[1.0.4680.0]
^^^^^^^^^^^^

Date: 11/25/2019

.. _cc-major-new-features-10:

New in this release
-------------------

N/A, this is the first release.

.. _cc-resolved-issues-10:

Resolved issues
---------------

N/A, this is the first release.

.. _cc-known-issues-and-limitations-5:

Known issues and limitations
----------------------------

1. **Control flow** Inferentia has a limited support for control flow.
   In general, Neuron can only support control flow operators which are
   static at compile time, i.e. static length RNN, top-k, sort, ...
2. **Size of neural network** The size of neural network is influenced
   by a) type of neural network (CNN, LSTM, MLP) , b) number of layers,
   c) sizes of input (dimension of the tensors, batch size, ...). The
   current Neuron compiler release has a limitation in terms of the size
   of neural network it could effectively optimize. As a result, we
   limit CNN models (e.g. ResNet) to have an input size of up to 480x480
   FP16, batch size of 4; LSTM models (e.g. GNMT) are limited to a time
   step limit of up to 900; MLP models (like BERT) are limited up to
   sequence-length equal 128, batch=8.
3. **Data layout** The Neuron compiler supports multiple data layout
   formats (NCHW, NHWC, ...). Non-CNHW input/output data-layouts will
   require Neuron to insert additional *transpose* operations, causing a
   degradation in performance.
4. **Object detection models** Computer-vision object detection and
   segmentation models are not supported by the current release.
5. **Reduce data type** INT8 data type is not currently supported by the
   Neuron compiler.
6. **Tensor residency** When a sub-graph that is executed on the host is
   communicating with a sub-graph that is executing on Neuron cores,
   tensors are copied via the communication queues between the host and
   Inferentia memory for each inference, which may result in end-to-end
   performance degradation.
7. **Primary inputs in NeuronCore Pipeline mode** When a neural network
   is executed in NeuronCore Pipeline mode, only the first operator in a
   neural network can receive primary inputs from the host.

.. _cc-other-notes-8:

Other Notes
-----------

.. _dependencies-7:

Dependencies
------------

-  nnvm: dmlc_nnvm-1.0.1219.0
-  topi: dmlc_topi-1.0.1219.0
-  tvm: dmlc_tvm-1.0.1219.0
-  hwm: inferentia_hwm-1.0.602.0
-  islpy: islpy-2018.2+aws2018.x.73.0


================================================
FILE: release-notes/archive/neuron1/_legacy-labels.rst
================================================
:orphan:

.. This file provides stub labels for archived Neuron 1.x content where
   the original target pages have been removed. These labels prevent
   build warnings from the archived release notes.

.. _install-guide-index:
.. _ndriver_2_3_26_0:
.. _neuron-containers:
.. _neuron-tutorials:
.. _software-end-of-support:
.. _software-maintenance:
.. _neuron-mxnet:
.. _neuron-cc:
.. _models-inferentia:
.. _neuron-runtime:
.. _neuron-k8-scheduler-ext:
.. _neff-support-table:
.. _tensorflow-neuron-legacy:

Legacy Neuron 1.x Labels
=========================

This page exists solely to provide anchor targets for archived Neuron 1.x release notes.
The original pages for these labels have been removed. See the current
:doc:`Neuron SDK documentation </index>` for up-to-date information.


================================================
FILE: release-notes/archive/neuron1/neuronrelease/previous-content.rst
================================================
.. _pre-release-content:

Previous Releases Content
=========================

.. contents:: Table of contents
   :local:
   :depth: 1

Neuron 2.5.0 (11/23/2022)
--------------------------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=2.5.0


See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=2,5.0


Neuron 1.19.1 (05/27/2022)
--------------------------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.19.1


See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.19.1


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - Python 3.7

Neuron 1.19.0 (04/29/2022)
--------------------------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.19.0


See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.19.0


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - Python 3.7

Neuron 1.18.0 (03/25/2022)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.18.0


See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.18.0


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - Python 3.7


Neuron 1.17.2 (02/18/2022)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.17.2


See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.17.2


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7


Neuron 1.17.1 (02/16/2022)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.17.1


See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.17.1


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7


Neuron 1.17.0 (01/20/2022)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.17.0


See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.17.0


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7


Neuron 1.16.3 (01/05/2022)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.16.3

See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.16.3

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7

Neuron 1.16.2 (12/15/2021)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.16.2

See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.16.2

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7


Neuron 1.16.1 (11/05/2021)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.16.1

See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.16.1


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7


Neuron 1.16.0 (10/27/2021)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.16.0

See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.16.0


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7


Neuron v1.15.2 (September 22 2021)
----------------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.15.2

See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.15.2


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7
       * Python 3.8 [Beta]


Neuron v1.15.1 (August 30 2021)
-------------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.15.1

See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.15.1


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7
       * Python 3.8 [Beta]


Neuron v1.15.0 (August 12 2021)
-------------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.15.0

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.15.0

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7
       * Python 3.8 [Beta]

Neuron v1.14.2 (July 26 2021)
-----------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.14.2

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.14.2

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7
       * Python 3.8 [Beta]
       

Neuron v1.14.1 (July 2nd 2021)
------------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.14.1

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.14.1


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7
       * Python 3.8 [Beta]
       

Neuron v1.14.0 (May 28th 2021)
------------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.14.0

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.14.0

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7
       * Python 3.8 [Beta]
       

Neuron v1.13.0 (May 1st 2021)
-----------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.13.0

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.13.0

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7
       * Python 3.8 [Beta]
   * - Neuron Conda Packages
     - * torch-neuron-1.7.1.1.3.5.0 
     
       * tensorflow-neuron 1.15.5.1.3.3.0

       * mxnet-neuron-1.5.1.1.4.4.0
       

Neuron v1.12.2 (Mar 4th 2021)
------------------------------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.12.2

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.12.2

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
     - Maintenance
     - End Of Support
   * - Python
     - * Python 3.6
       * Python 3.7
     - 
     - * Python 3.5 (2/24/2021)
   * - Neuron Conda Packages
     - * torch-neuron 1.7.1.1.2.16.0 
     
       * tensorflow-neuron 1.15.5.1.2.9.0

       * mxnet-neuron 1.5.1.1.3.8.0
       
     - 
     - 

Neuron v1.12.1 (Feb 24th 2021)
------------------------------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.12.1

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.12.1

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
     - Maintenance
     - End Of Support
   * - Python
     - * Python 3.6
       * Python 3.7
     - 
     - * Python 3.5 (2/24/2021)
   * - Neuron Conda Packages
     - * torch-neuron 1.7.1.1.2.15.0 
     
       * tensorflow-neuron 1.15.5.1.2.8.0

       * mxnet-neuron 1.5.1.1.3.7.0
       
     - 
     - 


Neuron v1.12.0 (Jan 30 2021)
----------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.12.0

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.12.0

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
     - Maintenance
     - End Of Support
   * - Python
     - * Python 3.6
       * Python 3.7
     - 
     - 
   * - Neuron Conda Packages
     - * Conda-PyTorch 1.5.1, Conda-PyTorch 1.7.1, 
     
       * Conda-TensorFlow 1.5.1, Conda-MXNet 1.5.1
     - 
     - 


================================================
FILE: release-notes/archive/neuron1/prev/content.rst
================================================
.. _pre-n1-release-content:

Previous Releases' Content (Neuron 1.x)
=======================================

.. contents:: Table of contents
   :local:
   :depth: 1


Neuron 2.5.0 (11/23/2022)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=2.5.0


See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.19.1


Neuron 1.19.2 (08/02/2022)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.19.2


See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.19.1


Neuron 1.19.1 (05/27/2022)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.19.1


See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.19.1


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - Python 3.7


Neuron 1.19.0 (04/29/2022)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.19.0


See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.19.0


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - Python 3.7


Neuron 1.18.0 (03/25/2022)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.18.0


See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.18.0


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - Python 3.7


Neuron 1.17.2 (02/18/2022)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.17.2


See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.17.2


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7

Neuron 1.17.1 (02/16/2022)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.17.1


See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.17.1


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7


Neuron 1.17.0 (01/20/2022)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.17.0


See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.17.0


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7


Neuron 1.16.3 (01/05/2022)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.16.3

See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.16.3

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7

Neuron 1.16.2 (12/15/2021)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.16.2

See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.16.2

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7


Neuron 1.16.1 (11/05/2021)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.16.1

See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.16.1


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7


Neuron 1.16.0 (10/27/2021)
--------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.16.0

See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.16.0


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7


Neuron v1.15.2 (September 22 2021)
----------------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.15.2

See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.15.2


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7
       * Python 3.8 [Beta]


Neuron v1.15.1 (August 30 2021)
-------------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.15.1

See :ref:`neuron-maintenance-policy` for more information.

Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.15.1


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7
       * Python 3.8 [Beta]


Neuron v1.15.0 (August 12 2021)
-------------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.15.0

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.15.0

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7
       * Python 3.8 [Beta]

Neuron v1.14.2 (July 26 2021)
-----------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.14.2

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.14.2

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7
       * Python 3.8 [Beta]
       

Neuron v1.14.1 (July 2nd 2021)
------------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.14.1

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.14.1


Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7
       * Python 3.8 [Beta]
       

Neuron v1.14.0 (May 28th 2021)
------------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.14.0

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.14.0

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7
       * Python 3.8 [Beta]
       

Neuron v1.13.0 (May 1st 2021)
-----------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.13.0

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.13.0

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
   * - Python
     - * Python 3.6
       * Python 3.7
       * Python 3.8 [Beta]
   * - Neuron Conda Packages
     - * torch-neuron-1.7.1.1.3.5.0 
     
       * tensorflow-neuron 1.15.5.1.3.3.0

       * mxnet-neuron-1.5.1.1.4.4.0
       

Neuron v1.12.2 (Mar 4th 2021)
------------------------------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.12.2

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.12.2

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
     - Maintenance
     - End Of Support
   * - Python
     - * Python 3.6
       * Python 3.7
     - 
     - * Python 3.5 (2/24/2021)
   * - Neuron Conda Packages
     - * torch-neuron 1.7.1.1.2.16.0 
     
       * tensorflow-neuron 1.15.5.1.2.9.0

       * mxnet-neuron 1.5.1.1.3.8.0
       
     - 
     - 

Neuron v1.12.1 (Feb 24th 2021)
------------------------------------------------


Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.12.1

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.12.1

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
     - Maintenance
     - End Of Support
   * - Python
     - * Python 3.6
       * Python 3.7
     - 
     - * Python 3.5 (2/24/2021)
   * - Neuron Conda Packages
     - * torch-neuron 1.7.1.1.2.15.0 
     
       * tensorflow-neuron 1.15.5.1.2.8.0

       * mxnet-neuron 1.5.1.1.3.7.0
       
     - 
     - 


Neuron v1.12.0 (Jan 30 2021)
----------------------------

Release included packages
^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=1.12.0

See :ref:`neuron-maintenance-policy` for more information.


Release supported frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list frameworks --neuron-version=1.12.0

Dependency Software Supported Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Software
     - Supported
     - Maintenance
     - End Of Support
   * - Python
     - * Python 3.6
       * Python 3.7
     - 
     - 
   * - Neuron Conda Packages
     - * Conda-PyTorch 1.5.1, Conda-PyTorch 1.7.1, 
     
       * Conda-TensorFlow 1.5.1, Conda-MXNet 1.5.1
     - 
     - 


================================================
FILE: release-notes/archive/neuron1/prev/rn.rst
================================================
.. _prev-n1-rn:

Previous Release Notes (Neuron 1.x)
===================================

.. contents:: Table of contents
   :local:
   :depth: 1


Neuron 1.19.2 (08/02/2022)
--------------------------

**Neuron 1.19.2** This is a patch release. The release include a :ref:`security update <ndriver_2_3_26_0>` for Neuron Driver (``aws-neuron-dkms``) and includes compiler bug fix that ignore MXNet dropout for 'training' while performing inference. 
Please update the Neuron Driver to the latest (version 2.3.26 or newer) so that you can benefit from operational and security updates included in this release.

.. important ::

   You must update to the latest Neuron Driver (aws-neuron-dkms version 2.3.26 or newer) before installing or upgrading to latest Neuron release.
      * Uninstall ``aws-neuron-dkms`` by running: ``sudo apt remove aws-neuron-dkms`` or ``sudo dnf remove aws-neuron-dkms``
      * Install or upgrade to latest Neuron driver (``aws-neuron-dkms``) by following the ":ref:`install-guide-index`" instructions.

Neuron 1.19.1 (05/27/2022)
--------------------------

**Neuron 1.19.1** is a patch release. This release fixes a bug in Neuron Driver (``aws-neuron-dkms``). Neuron driver version 2.3.11 included in this release fixes a bug that causes kernel panic when a large memory allocation on Neuron device fails.  Neuron Driver 2.3.11 also introduces a new functionality required by the upcoming Neuron 1.20.0 release.  Because the new functionality is mandatory for Neuron 1.20.0 support, Neuron Driver 2.3.11 adds a compatibility check that will prevents Neuron 1.20.0 from running with older versions of the driver.   An attempt to run Neuron 1.20.0 with an older version of the driver will result in the application terminating with an error message.

In addition, this release updates ``tensorflow-neuron`` installation instructions to pin ``protobuf`` version to avoid `compatibility issues <https://github.com/protocolbuffers/protobuf/issues/10051>`__ with older versions of TensorFlow.

.. important ::

   For successful installation or update to next releases (Neuron 1.20.0 and newer):
      * Uninstall ``aws-neuron-dkms`` by running: ``sudo apt remove aws-neuron-dkms`` or ``sudo dnf remove aws-neuron-dkms``
      * Install or upgrade to latest Neuron driver (``aws-neuron-dkms``) by following the ":ref:`install-guide-index`" instructions.


Neuron 1.19.1 (05/27/2022)
^^^^^^^^^^^^^^^^^^^^^^^^^^
**Neuron 1.19.1** is a patch release. This release fixes a bug in Neuron Driver (``aws-neuron-dkms``). Neuron driver version 2.3.11 included in this release fixes a bug that causes kernel panic when a large memory allocation on Neuron device fails.  Neuron Driver 2.3.11 also introduces a new functionality required by the upcoming Neuron 1.20.0 release.  Because the new functionality is mandatory for Neuron 1.20.0 support, Neuron Driver 2.3.11 adds a compatibility check that will prevents Neuron 1.20.0 from running with older versions of the driver.   An attempt to run Neuron 1.20.0 with an older version of the driver will result in the application terminating with an error message.

In addition, this release updates ``tensorflow-neuron`` installation instructions to pin ``protobuf`` version to avoid `compatibility issues <https://github.com/protocolbuffers/protobuf/issues/10051>`__ with older versions of TensorFlow.

.. important ::

   For successful installation or update to next releases (Neuron 1.20.0 and newer):
      * Uninstall ``aws-neuron-dkms`` by running: ``sudo apt remove aws-neuron-dkms`` or ``sudo dnf remove aws-neuron-dkms``
      * Install or upgrade to latest Neuron driver (``aws-neuron-dkms``) by following the ":ref:`install-guide-index`" instructions.

Neuron 1.19.0 (04/29/2022)
--------------------------

**Neuron 1.19.0** release adds support for PyTorch version 1.11, updates torch-neuron 1.10 to 1.10.2, and adds support for TensorFlow version 2.8, as well as minor enhancements and bug fixes.

Please note that starting with this release (*Neuron 1.19.0*), installing ``aws-neuron-runtime-base`` and ``oci-add-hooks`` are no longer required for Neuron Kubernetes device driver plugin. In addition starting with this release, *torch-neuron 1.5* :ref:`will no longer be supported <eol-pt-15>`.


Neuron 1.18.0 (03/25/2022)
--------------------------

**Neuron 1.18.0** release introduces the beta release of :ref:`NeuronPerf <neuronperf>`, NeuronPerf is a Python library with a simple API that enables fast measurements of performance when running models with Neuron. This release adds new 5 models to the :ref:`appnote-performance-benchmark` together with  NeuronPerf scripts used to compile these models and run the benchmarks.


This release also introduces additional ``torch-neuron`` packages that support C++11 ABI, updates TensorFlow-Neuron 2.5 to 2.5.3, adds support for TensorFlow-Neuron 2.6 and 2.7, and introduces Runtime NEURON_RT_NUM_CORES :ref:`environment variable <nrt-configuration>`. In addition this release include minor enhancements and bug fixes in Compiler, Neuron Framework Extensions, Runtime 2.x library and tools. See below detailed release notes.

Starting with this release, *TensorFlow Neuron versions 2.1, 2.2, 2.3 and 2.4* will :ref:`no longer be supported <eol-tf-21-24>` . We will also :ref:`stop supporting PyTorch Neuron version 1.5 <announce-eol-pt-1-5>` starting with Neuron 1.19.0 release, and :ref:`will stop supporting <eol-ncgs-env_2>`  ``NEURONCORE_GROUP_SIZES`` environment variable starting with Neuron 1.20.0 release.

Neuron 1.17.2 (02/18/2022)
--------------------------

**Neuron 1.17.2** is a patch release. This release fixes a bug in TensorFlow Neuron versions 2.1, 2.2. 2.3 and 2.4. The fixed bug was causing a memory leak of 128B for each inference. Starting this release, TensorFlow Neuron versions 2.1, 2.2, 2.3 and 2.4 are :ref:`entering maintenance mode <maintenance_tf21_tf24>`. Future releases of TensorFlow Neuron versions 2.1, 2.2, 2.3 and 2.4 will address security issues only.

Neuron 1.17.1 (02/16/2022)
--------------------------

**Neuron 1.17.1** is a patch release. This release fixes a bug in TensorFlow Neuron that caused a memory leak. The memory leak was approximately 128b for each inference and 
exists in all versions of TensorFlow Neuron versions part of Neuron 1.16.0 to Neuron 1.17.0 releases. see :ref:`pre-release-content` for exact versions included in each release.  This release only fixes the memory leak for TensorFlow versions 1.15 and 2.5 from Neuron.  The other versions of TensorFlow Neuron will be fixed in a shortly upcoming release.

Neuron 1.17.0 (01/20/2022)
--------------------------

**Neuron 1.17.0** release introduces the support of PyTorch 1.10,  Tensorflow 2.5 update to version 2.5.2, new operators support in PyTorch
and TensorFlow 1.15, in addition to enhancements and bug fixes in PyTorch, TensorFlow, MxNet, Compiler, Runtime and Tools.

- **PyTorch**
   * First PyTorch 1.10 support.
   * Added new operators support.
   * See :ref:`pytorch-neuron-rn` and :ref:`neuron-cc-ops-pytorch` for more details.
- **TensorFlow 2.x**
   * Updated Tensorflow 2.5 to version 2.5.2.
   * Updated tensorflow-model-server 2.5 to version 2.5.3.
   * See :ref:`tensorflow-neuron-rn-v2` and :ref:`tensorflow-modelserver-rn-v2` for more details.
- **TensorFlow 1.15**
   * Added new operators support.
   * See :ref:`tensorflow-neuron-rn` and :ref:`neuron-cc-ops-tensorflow` for more details.
- **MXNet**
   * Added support for ``mx_neuron.__version__`` to get the build version of MXNet Neuron plugin.
   * See :ref:`mxnet-neuron-rn` for more details.
- **Tools 2.x**
   * ``neuron-top`` - Added “all” tab that aggregates all running Neuron processes into a single view.
   * ``neuron-top`` - Improved startup time by approximately 1.5 seconds in most cases.
   * See :ref:`dev-tools_rn` for more details.
- **Compiler**
   * Enhancements and minor bug fixes.
   * See :ref:`neuron-cc-rn` for more details.
- **Runtime 2.x**
   * Enhancements and minor bug fixes.
   * See :ref:`runtime_rn` for more details.

Neuron 1.16.3 (01/05/2022)
--------------------------

**Neuron 1.16.3** is a minor release. This release includes performance enhancements and operator support in :ref:`PyTorch Neuron <pytorch-neuron-rn>`
and minor bug fixes in :ref:`Neuron Compiler <neuron-cc-rn>`.


Neuron 1.16.2 (12/15/2021)
--------------------------

**Neuron 1.16.2** is a patch release. This release includes performance enhancements and minor bug fixes in :ref:`Neuron Compiler <neuron-cc-rn>`
and :ref:`PyTorch Neuron <pytorch-neuron-rn>`.

Neuron 1.16.1 (11/05/2021)
--------------------------

**Neuron 1.16.1** is a patch release. This release fixes a bug in Neuron Runtime that would have prevented users from launching a container that doesn’t use all of the Neuron Devices in the instance. If you are using Neuron within a container, please update to this new release by updating to latest Neuron ML framework package, Neuron Tools, and/or TensorFlow Neuron Model Server.


* To update to latest PyTorch 1.9.1:
  ``pip install --upgrade torch-neuron neuron-cc[tensorflow] torchvision``

* To update to latest TensorFlow 2.5.1:
  ``pip install --upgrade tensorflow-neuron[cc]``

* To update to latest TensorFlow 1.15.5:
  ``pip install --upgrade tensorflow-neuron==1.15.5.* neuron-cc``

* To update to latest MXNet 1.8.0:
  ``pip install --upgrade mx_neuron neuron-cc``


For more details on how to update the framework packages, please check out our :ref:`setup-guide-index`.


Neuron 1.16.0 (10/27/2021)
--------------------------

**Neuron 1.16.0 is a release that requires your attention**. **You must update to the latest Neuron Driver (** ``aws-neuron-dkms`` **version 2.1 or newer)
for successful installation or upgrade**.

This release introduces
:ref:`Neuron Runtime 2.x <introduce-libnrt>`, upgrades :ref:`PyTorch Neuron <neuron-pytorch>` to
PyTorch 1.9.1, adds support for new APIs (:func:`torch.neuron.DataParallel` and ``torch_neuron.is_available()``),
adds new features and capabilities (compiler ``--fast-math`` option for better fine-tuning of accuracy/performance and :ref:`MXNet FlexEG feature <flexeg>`),
improves :ref:`tools <neuron-tools>`, adds support for additional :ref:`operators <neuron-supported-operators>`,
improves :ref:`performance <appnote-performance-benchmark>`
(Up to 20% additional throughput and up to 25% lower latency),
and reduces model loading times. It also simplifies :ref:`Neuron installation steps <install-guide-index>`,
and improves the user experience of :ref:`container creation and deployment <neuron-containers>`.
In addition it includes bug fixes, new :ref:`application notes <neuron-appnotes>`, updated :ref:`tutorials <neuron-tutorials>`,
and announcements of software :ref:`end-of-support <software-end-of-support>` and :ref:`maintenance <software-maintenance>`.


-  **Neuron Runtime 2.x**

   - :ref:`introduce-libnrt` - In this release we are introducing Neuron Runtime 2.x.
     The new runtime is a shared library (``libnrt.so``), replacing Neuron Runtime 1.x
     which was a server daemon (``neruon-rtd``).

     Upgrading to ``libnrt.so`` is expected to improves throughput and
     latency, simplifies Neuron installation and upgrade process,
     introduces new capabilities for allocating NeuronCores to
     applications, streamlines container creation, and deprecates tools
     that are no longer needed. The new library-based runtime
     (``libnrt.so``) is directly integrated into Neuron’s ML Frameworks (with the exception of MXNet 1.5) and Neuron
     Tools packages. As a result, users no longer need to install/deploy the
     ``aws-neuron-runtime``\ package.

     .. important::

        -  You must update to the latest Neuron Driver (``aws-neuron-dkms`` version 2.1 or newer)
           for proper functionality of the new runtime library.
        -  Read :ref:`introduce-libnrt`
           application note that describes :ref:`why we are making this
           change <introduce-libnrt-why>` and
           how :ref:`this change will affect the Neuron
           SDK <introduce-libnrt-how-sdk>` in detail.
        -  Read :ref:`neuron-migrating-apps-neuron-to-libnrt` for detailed information of how to
           migrate your application.


-  **Performance**

   -  Updated :ref:`performance numbers <appnote-performance-benchmark>` - Improved performance: Up to 20% additional throughput
      and up to 25% lower latency.

-  **Documentation resources**

   -  Improved :ref:`Neuron Setup Guide <install-guide-index>`.
   -  New :ref:`introduce-libnrt` application note.
   -  New :ref:`bucketing_app_note` application note.
   -  New :ref:`neuron-cc-training-mixed-precision` application note.
   -  New :ref:`torch-neuron-dataparallel-app-note` application note.
   -  New :ref:`flexeg` application note.
   -  New :ref:`parallel-exec-ncgs` application note.
   -  New :ref:`Using NEURON_RT_VISIBLE_CORES with TensorFlow Serving <tensorflow-serving-neuronrt-visible-cores>` tutorial.
   -  Updated :ref:`ResNet50 model for Inferentia </src/examples/pytorch/resnet50.ipynb>` tutorial to use :func:`torch.neuron.DataParallel`.

-  **PyTorch**

   -  PyTorch now supports Neuron Runtime 2.x only. Please visit :ref:`introduce-libnrt` for
      more information.
   -  Introducing PyTorch 1.9.1 support.
   -  Introducing new APIs: :func:`torch.neuron.DataParallel` (see :ref:`torch-neuron-dataparallel-app-note` application note for more details) and
      ``torch_neuron.is_available()``.
   -  Introducing :ref:`new operators support <neuron-cc-ops-pytorch>`.
   -  For more information visit :ref:`neuron-pytorch`

-  **TensorFlow 2.x**

   -  TensorFlow 2.x now supports Neuron Runtime 2.x only. Please visit
      :ref:`introduce-libnrt` for more information.
   -  Updated Tensorflow 2.3.x from Tensorflow 2.3.3 to Tensorflow
      2.3.4.
   -  Updated Tensorflow 2.4.x from Tensorflow 2.4.2 to Tensorflow
      2.4.3.
   -  Updated Tensorflow 2.5.x from Tensorflow 2.5.0 to Tensorflow
      2.5.1.
   -  Introducing :ref:`new operators support <tensorflow-ref-neuron-accelerated-ops>`
   -  For more information visit :ref:`tensorflow-neuron`

-  **TensorFlow 1.x**

   -  TensorFlow 1.x now supports Neuron Runtime 2.x only. Please visit
      :ref:`introduce-libnrt` for more information.
   -  Introducing :ref:`new operators support <neuron-cc-ops-tensorflow>`.
   -  For more information visit :ref:`tensorflow-neuron`

-  **MXNet 1.8**

   -  MXNet 1.8 now supports Neuron Runtime 2.x only. Please visit
      :ref:`introduce-libnrt` for more information.
   -  Introducing Flexible Execution Groups (FlexEG) feature.
   -  MXNet 1.5 enters maintenance mode. Please visit :ref:`maintenance_mxnet_1_5` for more
      information.
   -  For more information visit :ref:`neuron-mxnet`

-  **Neuron Compiler**

   -  Introducing the ``–-fast-math`` option for better fine-tuning of accuracy/performance. See :ref:`neuron-cc-training-mixed-precision`
   -  Support added for new ArgMax and ArgMin operators. See :ref:`neuron-cc-rn`.
   -  For more information visit :ref:`neuron-cc`

-  **Neuron Tools**

   -  Updates have been made to ``neuron-ls`` and ``neuron-top`` to
      improve the interface and utility of information
      provided.
   -  `neuron-monitor`` has been enhanced to include additional information when
      used to monitor the latest Frameworks released with Neuron 1.16.0. See :ref:`dev-tools_rn`.
   -  ``neuron-cli`` is entering maintenance mode as its use is no longer
      relevant when using ML Frameworks with an integrated Neuron
      Runtime (libnrt.so).
   -  For more information visit :ref:`neuron-tools`

-  **Neuron Containers**

   -  Starting with Neuron 1.16.0, installation of Neuron ML Frameworks now includes
      an integrated Neuron Runtime library. As a result, it is
      no longer required to deploy ``neuron-rtd``. Please visit :ref:`introduce-libnrt` for
      information.
   -  When using containers built with components from Neuron 1.16.0, or
      newer, please use ``aws-neuron-dkms`` version 2.1 or newer and the
      latest version of ``aws-neuron-runtime-base``. Passing additional
      system capabilities is no longer required.
   -  For more information visit :ref:`neuron-containers`

-  **Neuron Driver**

   -  Support is added for Neuron Runtime 2.x (libnrt.so).
   -  Memory improvements have been made to ensure all allocations are made with
      4K alignments.


-  **Software Deprecation**

   - :ref:`eol-ncgs-env`
   - :ref:`eol-ncg`


-  **Software maintenance mode**

   - :ref:`maintenance_rtd`
   - :ref:`maintenance_mxnet_1_5`
   - :ref:`maintenance_neuron-cli`

Neuron 1.15.2 (09/22/2021)
--------------------------

Neuron 1.15.2 includes bug fixes for the tensorflow-model-server-neuron 2.5.1.1.6.8.0 package and several other bug fixes for tensorflow-neuron/tensorflow-model-server-neuron packages.

Neuron 1.15.1 (08/30/2021)
--------------------------

Neuron 1.15.1 includes bug fixes for the aws-neuron-dkms package and several other bug fixes for related packages.

Neuron 1.15.0 (08/12/2021)
--------------------------

Neuron 1.15.0 is the first release to support TensorFlow 2. In this release TensorFlow 2 supports language transformer base models like BERT. The TensorFlow 2 support will be enhanced in future releases to support additional models.

* **TensorFlow 2.x** - To get started with TensorFlow 2.x:

  *  Run the TensorFlow 2  :ref:`HuggingFace distilBERT Tutorial </src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb>`.
  *  Read :ref:`tf2_faq`
  *  See newly introduced TensorFlow 2.x (``tensorflow-neuron``) Tracing API.
  *  See :ref:`tensorflow-ref-neuron-accelerated-ops`.


* **Documentation**

  *  **New** :ref:`models-inferentia` application note added in this release. This application note describes what types of deep learning model architectures perform well out of the box and provides guidance on techniques you can use to optimize your deep learning models for Inferentia.
  *  **New** :ref:`Neuron inference performance page <appnote-performance-benchmark>` provides performance information for popular models and links to test these models in your own environment. The data includes throughout and latency numbers, cost per inference, for both realtime and offline applications.
  *  **New** :ref:`TensorFlow 2 HuggingFace distilBERT Tutorial </src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb>`.
  *  **New** :ref:`Bring your own HuggingFace pretrained BERT container to Sagemaker Tutorial </src/examples/pytorch/byoc_sm_bert_tutorial/sagemaker_container_neuron.ipynb>`.


* **More information**

  *  :ref:`tensorflow-neuron-rn`
  *  :ref:`neuron-cc-rn`
  *  :ref:`tensorflow-modelserver-rn`
  

.. _07-02-2021-rn:

Neuron 1.14.2 (07/26/2021)
--------------------------

This release (Neuron 1.14.2) , include bug fixes and minor enhancements to Neuron Runtime:

    * Neuron Runtime - see :ref:`runtime_rn`

Neuron 1.14.1 (07/02/2021)
--------------------------

This release (Neuron 1.14.1) , include bug fixes and minor enhancements:

* PyTorch Neuron - This release adds “Dynamic Batching” feature support, see PyTorch-Neuron trace python API for more information, the release also add support for new operators and include additional bug fixes and minor enhancements, for more information see :ref:`pytorch-neuron-rn`.
* TensorFlow Neuron - see :ref:`tensorflow-neuron-rn`.
* MXNet Neuron - see :ref:`mxnet-neuron-rn`.
* Neuron Compiler - see :ref:`neuron-cc-rn`.
* Neuron Runtime - see :ref:`runtime_rn`.
* Neuron Tools - see :ref:`dev-tools_rn`.


.. _05-28-2021-rn:

Neuron 1.14.0 (05/28/2021)
--------------------------

This release (Neuron 1.14.0) introduces first release of PyTorch Neuron 1.8.1, tutorials update, performance enhancements and memory optimizations for PyTorch Neuron, TensorFlow Neuron and MXNet Neuron.


* PyTorch Neuron - First release of PyTorch Neuron 1.8.1.
* PyTorch Neuron - Convolution operator support has been extended to include ConvTranspose2d variants.
* PyTorch Neuron - Updated  tutorials to use Hugging Face Transformers 4.6.0.
* PyTorch Neuron - Additional performance enhancements, memory optimizations, and bug fixes. see :ref:`pytorch-neuron-rn`.
* Neuron Compiler - New feature  -  Uncompressed NEFF format for faster loading models prior inference. Enable it by –enable-fast-loading-neuron-binaries. Some cases of large models may be detrimentally  impacted as it will not be compressed but many cases will benefit.
* Neuron Compiler - Additional performance enhancements, memory optimizations, and bug fixes, see :ref:`neuron-cc-rn`.
* TensorFlow Neuron - Performance enhancements, memory optimizations, and bug fixes. see :ref:`tensorflow-neuron-rn`. 
* MXNet Neuron - Enhancements and minor bug fixes (MXNet 1.8), see :ref:`mxnet-neuron-rn`.
* Neuron Runtime - Performance enhancements, memory optimizations, and bug fixes. :ref:`runtime_rn`.
* Neuron Tools - Minor bug fixes and enhancements.
* Software Deprecation

    * End of support for Neuron Conda packages in Deep Learning AMI, users should use pip upgrade commands to upgrade to latest Neuron version in DLAMI, see `blog <https://aws.amazon.com/blogs/developer/neuron-conda-packages-eol/>`_.
    * End of support for Ubuntu 16, see :ref:`documentation <eol-ubuntu16>`.


Neuron 1.14.0 (05/28/2021)
--------------------------

This release (Neuron 1.14.0) introduces first release of PyTorch Neuron 1.8.1, tutorials update, performance enhancements and memory optimizations for PyTorch Neuron, TensorFlow Neuron and MXNet Neuron.


* PyTorch Neuron - First release of PyTorch Neuron 1.8.1.
* PyTorch Neuron - Convolution operator support has been extended to include ConvTranspose2d variants.
* PyTorch Neuron - Updated  tutorials to use Hugging Face Transformers 4.6.0.
* PyTorch Neuron - Additional performance enhancements, memory optimizations, and bug fixes. see :ref:`pytorch-neuron-rn`.
* Neuron Compiler - New feature  -  Uncompressed NEFF format for faster loading models prior inference. Enable it by –enable-fast-loading-neuron-binaries. Some cases of large models may be detrimentally  impacted as it will not be compressed but many cases will benefit.
* Neuron Compiler - Additional performance enhancements, memory optimizations, and bug fixes, see :ref:`neuron-cc-rn`.
* TensorFlow Neuron - Performance enhancements, memory optimizations, and bug fixes. see :ref:`tensorflow-neuron-rn`. 
* MXNet Neuron - Enhancements and minor bug fixes (MXNet 1.8), see :ref:`mxnet-neuron-rn`.
* Neuron Runtime - Performance enhancements, memory optimizations, and bug fixes. :ref:`runtime_rn`.
* Neuron Tools - Minor bug fixes and enhancements.
* Software Deprecation

    * End of support for Neuron Conda packages in Deep Learning AMI, users should use pip upgrade commands to upgrade to latest Neuron version in DLAMI, see `blog <https://aws.amazon.com/blogs/developer/neuron-conda-packages-eol/>`_.
    * End of support for Ubuntu 16, see  :ref:`documentation <eol-ubuntu16>`.


Neuron 1.13.0 (05/01/2021)
--------------------------

This release introduces higher performance, updated framework support, new tutorials, and adding models and tools:

* Additional compiler improvements boost performance up to 20% higher throughput compared to previous release across model types.
* Improving usability for NLP models, with out-of-the-box 12x higher-throughput at 70% lower cost for Hugging Face Transformers pre-trained BERT Base models, see :ref:`pytorch-tutorials-neuroncore-pipeline-pytorch`.
* Upgrade Apache MXNet to 1.8, where Neuron is now a plugin, see :ref:`mxnet-neuron-rn`.
* PyTorch ResNext models now functional with new operator support, see :ref:`pytorch-neuron-rn`.
* PyTorch Yolov5 support, see :ref:`pytorch-neuron-rn`.
* MXNet: Gluon API and Neuron support for NLP BERT models, see :ref:`mxnet-neuron-rn`.
* PyTorch Convolution operator support has been extended to include most Conv1d and Conv3d variants, please see :ref:`neuron-cc-ops-pytorch`  for the complete list of operators.
* First release of Neuron plugin for TensorBoard, see :ref:`neuron-tensorboard-rn`.

**Software Deprecation**

* :ref:`eol-conda-packages`
* :ref:`eol-ubuntu16`
* :ref:`eol-classic-tensorboard`


.. _03-04-2021-rn:

March 4, 2021 Release (Patch)
-----------------------------

This release include bug fixes and minor enhancements to the Neuron Runtime and Tools. 


February 24, 2021 Release (Patch)
---------------------------------

This release updates all Neuron packages and libraries in response to the Python Secutity issue CVE-2021-3177 as described here: https://nvd.nist.gov/vuln/detail/CVE-2021-3177. This vulnerability potentially exists in multiple versions of Python including 3.5, 3.6, 3.7. Python is used by various components of Neuron, including the Neuron compiler as well as Machine Learning frameworks including TensorFlow, PyTorch and Apache MXNet. It is recommended that the Python interpreters used in any AMIs and containers used with Neuron are also updated. 

Python 3.5 reached `end-of-life <https://peps.python.org/pep-0478/>`_, from this release Neuron packages will not support Python 3.5.
Users should upgrade to latest DLAMI or upgrade to a newer Python versions if they are using other AMI.


January 30, 2021 Release
--------------------------

This release continues to improves the NeuronCore Pipeline performance for BERT models. For example, running BERT Base with the the neuroncore-pipeline-cores compile option, at batch=3, seqlen=32 using 16 Neuron Cores, results in throughput of up to  5340 sequences per second and P99 latency of 9ms using Tensorflow Serving. 

This release also adds operator support and performance improvements for the PyTorch based DistilBert model for sequence classification.


December 23, 2020 Release
--------------------------

This release introduces a PyTorch 1.7 based torch-neuron package as a part of the Neuron SDK. Support for PyTorch model serving with TorchServe 0.2 is added and will be demonstrated with a tutorial. This release also provides an example tutorial for PyTorch based Yolo v4 model for Inferentia. 

To aid visibility into compiler activity, the Neuron-extended Frameworks TensorFlow and PyTorch will display a new compilation status indicator that prints a dot (.) every 20 seconds to the console as compilation is executing. 

Important to know:
^^^^^^^^^^^^^^^^^^

1. This update continues to support the torch-neuron version of PyTorch 1.5.1 for backwards compatibility.
2. As Python 3.5 reached end-of-life in October 2020, and many packages including TorchVision and Transformers have
   stopped support for Python 3.5, we will begin to stop supporting Python 3.5 for frameworks, starting with
   PyTorch-Neuron version :ref:`neuron-torch-11170` in this release. You can continue to use older versions with Python 3.5.

November 17, 2020 Release
--------------------------

This release improves NeuronCore Pipeline performance. For example,
running BERT Small, batch=4, seqlen=32 using 4 Neuron Cores, results in
throughput of up to 7000 sequences per second and P99 latency of 3ms
using Tensorflow Serving.

Neuron tools updated the NeuronCore utilization metric to include all
inf1 compute engines and DMAs. Added a new neuron-monitor example that
connects to Grafana via Prometheus. We've added a new sample script
which exports most of neuron-monitor's metrics to a Prometheus
monitoring server. Additionally, we also provided a sample Grafana
dashboard. More details at :ref:`neuron-tools`.

ONNX support is limited and from this version onwards we are not
planning to add any additional capabilities to ONNX. We recommend
running models in TensorFlow, PyTorch or MXNet for best performance and
support.

October 22, 2020 Release
--------------------------

This release adds a Neuron kernel mode driver (KMD). The Neuron KMD
simplifies Neuron Runtime deployments by removing the need for elevated
privileges, improves memory management by removing the need for huge
pages configuration, and eliminates the need for running neuron-rtd as a
sidecar container. Documentation throughout the repo has been updated to
reflect the new support. The new Neuron KMD is backwards compatible with
prior versions of Neuron ML Frameworks and Compilers - no changes are
required to existing application code.

More details in the Neuron Runtime release notes at :ref:`neuron-runtime`.

September 22, 2020 Release
--------------------------

This release improves performance of YOLO v3 and v4, VGG16, SSD300, and
BERT. As part of these improvements, Neuron Compiler doesn’t require any
special compilation flags for most models. Details on how to use the
prior optimizations are outlined in the neuron-cc :ref:`neuron-cc-rn`.

The release also improves operational deployments of large scale
inference applications, with a session management agent incorporated
into all supported ML Frameworks and a new neuron tool called
neuron-monitor allows to easily scale monitoring of large fleets of
Inference applications. A sample script for connecting neuron-monitor to
Amazon CloudWatch metrics is provided as well. Read more about using
neuron-monitor :ref:`neuron-monitor-ug`.

August 19, 2020 Release
--------------------------

Bug fix for an error reporting issue with the Neuron Runtime. Previous
versions of the runtime were only reporting uncorrectable errors on half
of the dram per Inferentia. Other Neuron packages are not changed.

August 8, 2020 Release
--------------------------

This release of the Neuron SDK delivers performance enhancements for the
BERT Base model. Sequence lengths including 128, 256 and 512 were found
to have best performance at batch size 6, 3 and 1 respectively using
publically available versions of both Pytorch (1.5.x) and
Tensorflow-based (1.15.x) models. The compiler option "-O2" was used in
all cases.

A new Kubernetes scheduler extension is included in this release to
improve pod scheduling on inf1.6xlarge and inf1.24xlarge instance sizes.
Details on how the scheduler works and how to apply the scheduler can be
found :ref:`neuron-k8-scheduler-ext`.
Check the :ref:`containers_rn` for details
changes to k8 components going forward.

August 4, 2020 Release
--------------------------

Bug fix for a latent issue caused by a race condition in Neuron Runtime
leading to possible crashes. The crash was observed under stress load
conditons. All customers are encouraged to update the latest Neuron
Runtime package (aws-neuron-runtime), version 1.0.8813.0 or newer. Other
Neuron packages are being updated as well, but are to be considered
non-critical updates.

July 16, 2020 Release
--------------------------

This release of Neuron SDK adds support for the OpenPose (posenet)
Neural Network. An example of using Openpose for end to end inference is
available :ref:`/src/examples/tensorflow/openpose_demo/openpose.ipynb`.

A new PyTorch auto-partitioner feature now automatically builds a Neuron
specific graph representation of PyTorch models. The key benefit of this
feature is automatic partitioning the model graph to run the supported
operators on the NeuronCores and the rest on the host. PyTorch
auto-partitioner is enabled by default with ability to disable if a
manual partition is needed. More details :ref:`neuron-pytorch`. The
release also includes various bug fixes and increased operator support.

Important to know:
^^^^^^^^^^^^^^^^^^

1. This update moves the supported version for PyTorch to the current
   release (PyTorch 1.5.1)
2. This release supports Python 3.7 Conda packages in addition to Python
   3.6 Conda packages

June 18, 2020 Release
--------------------------

Point fix an error related to yum downgrade/update of Neuron Runtime
packages. The prior release fails to successfully downgrade/update
Neuron Runtime Base package and Neuron Runtime package when using Yum on
Amazon Linux 2.

Please remove and then install both packages on AL2 using these
commands:

::

   # Amazon Linux 2
   sudo dnf remove aws-neuron-runtime-base
   sudo dnf remove aws-neuron-runtime
   sudo dnf install aws-neuron-runtime-base
   sudo dnf install aws-neuron-runtime

Jun 11, 2020 Release
--------------------------

This Neuron release provides support for the recent launch of EKS for
Inf1 instance types and numerous other improvements. More details about
how to use EKS with the Neuron SDK can be found in AWS documentation
`here <https://docs.aws.amazon.com/eks/latest/userguide/inferentia-support.html>`__.

This release adds initial support for OpenPose PoseNet for images with
resolutions upto 400x400.

This release also adds a '-O2' option to the Neuron Compiler. '-O2' can
help with handling of large tensor inputs.

In addition the Neuron Compiler increments the version of the compiled
artifacts, called "NEFF", to version 1.0. Neuron Runtime versions
earlier than the 1.0.6905.0 release in May 2020 will not be able to
execute NEFFs compiled from this release forward. Please see :ref:`neff-support-table` for
compatibility.

Stay up to date on future improvements and new features by following the :ref:`neuron_roadmap`.

Refer to the detailed release notes for more information on each Neuron
component.

.. _important-to-know-1:

Important to know:
^^^^^^^^^^^^^^^^^^

1. Size of neural network. The current Neuron compiler release has a
   limitation in terms of the size of neural network it could
   effectively optimize for. The size of neural network is influenced by
   a number of factors including: a) type of neural network (CNN, LSTM,
   MLP) , b) number of layers, c) sizes of input (dimension of the
   tensors, batch size, ...). Using the Neuron Compiler '-O2' option can
   help with handling of large tensor inputs for some models. If not
   used, Neuron limits the size of CNN models like ResNet to an input
   size of 480x480 fp16/32, batch size=4; LSTM models like GNMT to have
   a time step limit of 900; MLP models like BERT to have input size
   limit of sequence length=128, batch=8.

2. INT8 data type is not currently supported by the Neuron compiler.

3. Neuron does not support TensorFlow 2 or PyTorch 1.4.0.

May 15, 2020 Release
--------------------------

Point fix an error related to installation of the Neuron Runtime Base
package. The prior release fails to successfully start Neuron Discovery
when the Neuron Runtime package is not also installed. This scenario of
running Neuron Discovery alone is critical to users of Neuron in
container environments.

Please update the aws-neuron-runtime-base package:

::

   # Ubuntu 18 or 16:
   sudo apt-get update
   sudo apt-get install aws-neuron-runtime-base

   # Amazon Linux, Centos, RHEL
   sudo dnf update
   sudo dnf install aws-neuron-runtime-base

May 11, 2020 Release
--------------------------

This release provides additional throughput improvements to running
inference on a variety of models; for example BERTlarge throughput has
improved by an additional 35% compared to the previous release and with
peak thoughput of 360 seq/second on inf1.xlarge (more details :ref:`tensorflow-bert-demo` ).

In addition to the performance boost, this release adds PyTorch, and
MXNet framework support for BERT models, as well as expands container
support in preparation to an upcoming EKS launch.

We continue to work on new features and improving performance further,
to stay up to date follow this repository and our :ref:`neuron_roadmap`.

Refer to the detailed release notes for more information for each Neuron
component.

.. _important-to-know-2:

Important to know:
^^^^^^^^^^^^^^^^^^

1. Size of neural network. The current Neuron compiler release has a
   limitation in terms of the size of neural network it could
   effectively optimize for. The size of neural network is influenced by
   a number of factors including: a) type of neural network (CNN, LSTM,
   MLP) , b) number of layers, c) sizes of input (dimension of the
   tensors, batch size, ...). As a result, we limit the sizes of CNN
   models like ResNet to have an input size limit of 480x480 fp16/32,
   batch size=4; LSTM models like GNMT to have a time step limit of 900;
   MLP models like BERT to have input size limit of sequence length=128,
   batch=8.

2. INT8 data type is not currently supported by the Neuron compiler.

3. Neuron does not support TensorFlow 2 or PyTorch 1.4.0.

Mar 26, 2020 Release
--------------------------

This release supports a variant of the SSD object detection network, a
SSD inference demo is available :ref:`tensorflow-ssd300`

This release also enhances our Tensorboard support to enable CPU-node
visibility.

Refer to the detailed release notes for more information for each neuron
component.

.. _important-to-know-3:

Important to know:
^^^^^^^^^^^^^^^^^^

1. Size of neural network. The current Neuron compiler release has a
   limitation in terms of the size of neural network it could
   effectively optimize for. The size of neural network is influenced by
   a number of factors including: a) type of neural network (CNN, LSTM,
   MLP) , b) number of layers, c) sizes of input (dimension of the
   tensors, batch size, ...). As a result, we limit the sizes of CNN
   models like ResNet to have an input size limit of 480x480 fp16/32,
   batch size=4; LSTM models like GNMT to have a time step limit of 900;
   MLP models like BERT to have input size limit of sequence length=128,
   batch=8.

2. INT8 data type is not currently supported by the Neuron compiler.

3. Neuron does not support TensorFlow 2 or PyTorch 1.4.0.

Feb 27, 2020 Release
--------------------------

This release improves performance throughput by up to 10%, for example
ResNet-50 on inf1.xlarge has increased from 1800 img/sec to 2040
img/sec, Neuron logs include more detailed messages and various bug
fixes. Refer to the detailed release notes for more details.

We continue to work on new features and improving performance further,
to stay up to date follow this repository, and watch the `AWS Neuron
developer
forum <https://forums.aws.amazon.com/forum.jspa?forumID=355>`__.

.. _important-to-know-4:

Important to know:
^^^^^^^^^^^^^^^^^^

1. Size of neural network. The current Neuron compiler release has a
   limitation in terms of the size of neural network it could
   effectively optimize for. The size of neural network is influenced by
   a number of factors including: a) type of neural network (CNN, LSTM,
   MLP) , b) number of layers, c) sizes of input (dimension of the
   tensors, batch size, ...). As a result, we limit the sizes of CNN
   models like ResNet to have an input size limit of 480x480 fp16/32,
   batch size=4; LSTM models like GNMT to have a time step limit of 900;
   MLP models like BERT to have input size limit of sequence length=128,
   batch=8.

2. Computer-vision object detection and segmentation models are not yet
   supported.

3. INT8 data type is not currently supported by the Neuron compiler.

4. Neuron does not support TensorFlow 2 or PyTorch 1.4.0.

Jan 28, 2020 Release
--------------------------

This release brings significant throughput improvements to running
inference on a variety of models; for example Resnet50 throughput is
increased by 63% (measured 1800 img/sec on inf1.xlarge up from 1100/sec,
and measured 2300/sec on inf1.2xlarge). BERTbase throughput has improved
by 36% compared to the re:Invent launch (up to 26100seq/sec from
19200seq/sec on inf1.24xlarge), and BERTlarge improved by 15% (230
seq/sec, compared to 200 running on inf1.2xlarge). In addition to the
performance boost, this release includes various bug fixes as well as
additions to the GitHub with  :ref:`neuron-features-index`
diving deep on how Neuron performance features work and overall improved
documentation following customer input.

We continue to work on new features and improving performance further,
to stay up to date follow this repository, and watch the `AWS Neuron
developer
forum <https://forums.aws.amazon.com/forum.jspa?forumID=355>`__.

.. _important-to-know-5:

Important to know:
^^^^^^^^^^^^^^^^^^

1. Size of neural network. The current Neuron compiler release has a
   limitation in terms of the size of neural network it could
   effectively optimize for. The size of neural network is influenced by
   a number of factors including: a) type of neural network (CNN, LSTM,
   MLP) , b) number of layers, c) sizes of input (dimension of the
   tensors, batch size, ...). As a result, we limit the sizes of CNN
   models like ResNet to have an input size limit of 480x480 fp16/32,
   batch size=4; LSTM models like GNMT to have a time step limit of 900;
   MLP models like BERT to have input size limit of sequence length=128,
   batch=8.

2. Computer-vision object detection and segmentation models are not yet
   supported.

3. INT8 data type is not currently supported by the Neuron compiler.

4. Neuron does not support TensorFlow 2 or PyTorch 1.4.0.

Neuron SDK Release Notes Structure
----------------------------------

The Neuron SDK is delivered through commonly used package mananagers
(e.g. PIP, APT and YUM). These packages are then themselves packaged
into Conda packages that are integrated into the AWS DLAMI for minimal
developer overhead.

The Neuron SDK release notes follow a similar structure, with the core
improvements and known-issues reported in the release notes of the
primary packages (e.g. Neuron-Runtime or Neuron-Compiler release notes),
and additional release notes specific to the package-integration are
reported through their dedicated release notes (e.g. Conda or DLAMI
release notes).


================================================
FILE: release-notes/archive/tensorboard-neuron.rst
================================================
.. _neuron-tensorboard-rn:


Neuron Plugin for TensorBoard Release Notes
============================================


.. contents:: Table of Contents
   :local:
   :depth: 1


Known Issues and Limitations - Updated 11/29/2022
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The following are not limitations in the Neuron plugin, but may affect your ability to
use TensorBoard.

- The Neuron plugin for Trn1 (``tensorboard-plugin-neuronx``) is not compatible with the Neuron plugin
  for Inf1 (``tensorboard-plugin-neuron``).  Please ensure you only have only the correct package installed.

Neuron Plugin for TensorBoard release [2.6.7.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 04/01/2024

Summary
-------

- Minor updates.

Neuron Plugin for TensorBoard release [2.6.1.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 12/21/2023

Summary
-------

- Now uses local third-party dependencies instead of relying on a CDN.


Neuron Plugin for TensorBoard release [2.5.39.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 7/19/2023

Summary
-------

- Minor updates.


Neuron Plugin for TensorBoard release [2.5.37.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 6/14/2023

Summary
-------

- Minor updates.


Neuron Plugin for TensorBoard release [2.5.26.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 05/01/2023

Summary
-------

* Neuron operator timeline view now includes Neuron Runtime setup/teardown time and a collapsed execution of NC engines and DMA - see Tensorboard tutorial for updated views. 

* Improved execution categorization to include "control" instructions


Neuron Plugin for TensorBoard release [2.5.25.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 03/28/2023

Summary
-------

- Supports INF2 and TRN1.


Neuron Plugin for TensorBoard release [2.5.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 12/09/2022

Summary
-------

- Added support for PyTorch Neuron on Trn1 (``torch-neuronx``) with new views!  Includes a trace view,
  an operator view, and an operator timeline view.  For more info, check out the documentation
  :ref:`neuronx-plugin-tensorboard`.

  .. important::

    - You must update to the latest Neuron Tools (``aws-neuronx-tools`` version 2.6 or newer) and install
      ``tensorboard-plugin-neuronx`` for proper functionality of the Neuron plugin on Trn1.
    - For Inf1, please continue to use ``tensorboard-plugin-neuron``.  Refer to the getting started guide
      on Inf1 :ref:`neuron-plugin-tensorboard`.


Neuron Plugin for TensorBoard release [2.4.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 04/29/2022

Summary
-------

- Minor updates.


Neuron Plugin for TensorBoard release [2.3.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 03/25/2022

Summary
-------

- Minor updates.


Neuron Plugin for TensorBoard release [2.2.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 10/27/2021

New in this release
-------------------

   -  Neuron Plugin for TensorBoard now support applications built with Neuron Runtime 2.x (``libnrt.so``).

      .. important::

        -  You must update to the latest Neuron Driver (``aws-neuron-dkms`` version 2.1 or newer) 
           for proper functionality of the new runtime library.
        -  Read :ref:`introduce-libnrt`
           application note that describes :ref:`why are we making this
           change <introduce-libnrt-why>` and
           how :ref:`this change will affect the Neuron
           SDK <introduce-libnrt-how-sdk>` in detail.
        -  Read :ref:`neuron-migrating-apps-neuron-to-libnrt` for detailed information of how to
           migrate your application.


[2.1.2.0]
^^^^^^^^^^

Date: 8/12/2021

Summary
-------

- Adds support for Neuron Tensorflow 2.5+


.. _2.1.0.0:

[2.1.0.0]
^^^^^^^^^^

Date: 5/28/2021

Summary
-------

- No major changes or fixes. Released with other Neuron packages.

.. _2.0.29.0:

[2.0.29.0]
^^^^^^^^^^^

Date: 4/30/2021

Summary
-------

- First release Neuron plugin for TensorBoard.  Check out it out here:
  :ref:`neuron-plugin-tensorboard`.

   - The Neuron plugin is now compatible with TensorBoard 2.0 and higher,
     in addition to TensorBoard 1.15

   - Provides a centralized place to better understand execution using
     Neuron SDK.

   - Continues support visualization for TensorFlow graphs, with support
     for PyTorch and MXNet coming in future releases.

- Neuron plugin for TensorBoard is supported for Neuron tools >= 1.5, which is first
  introduced in Neuron v1.13.0 release
- TensorBoard-Neuron is deprecated, and only supported for Neuron tools <= 1.4.12.0.
  The final version, 1.4.12.0 is part of Neuron v1.12.2 release.


.. _11501260:

[1.15.0.1.2.6.0]
^^^^^^^^^^^^^^^^^^

Date: 2/24/2021

Summary
-------

-  Fix for CVE-2021-3177.

.. _11501110:

[1.15.0.1.1.1.0]
^^^^^^^^^^^^^^^^^

Date: 12/23/2020

Summary
-------

-  Minor internal improvements.


.. _1150106150:

[1.15.0.1.0.615.0]
^^^^^^^^^^^^^^^^^^

Date: 11/17/2020

Summary
-------

-  Fix issue with viewing chrome trace in Neuron profile plugin in
   Chrome 80+.

Resolved Issues
---------------

-  Updated dependencies to polyfill missing APIs used by chrome trace in
   newer browser versions.


.. _1150106000:

[1.15.0.1.0.600.0]
^^^^^^^^^^^^^^^^^^

Date: 09/22/2020

Summary
-------

-  Minor internal improvements.

.. _1150105700:

[1.15.0.1.0.570.0]
^^^^^^^^^^^^^^^^^^

Date: 08/08/2020

.. _tb-summary-1:

Summary
-------

-  Minor internal improvements.

.. _1150105130:

[1.15.0.1.0.513.0]
^^^^^^^^^^^^^^^^^^

Date: 07/16/2020

.. _tb-summary-2:

Summary
-------

-  Minor internal improvements.

.. _1150104910:

[1.15.0.1.0.491.0]
^^^^^^^^^^^^^^^^^^

Date 6/11/2020

.. _tb-summary-3:

Summary
-------

Fix issue where utilization was missing in the op-profile view.

Resolved Issues
---------------

-  The op-profile view in the Neuron Profile plugin now correctly shows
   the overall NeuronCore utilization.

.. _1150104660:

[1.15.0.1.0.466.0]
^^^^^^^^^^^^^^^^^^

Date 5/11/2020

.. _tb-summary-4:

Summary
-------

Fix potential installation issue when installing both tensorboard and
tensorboard-neuron.

.. _tb-resolved-issues-1:

Resolved Issues
---------------

-  Added tensorboard as a dependency in tensorboard-neuron. This
   prevents the issue of overwriting tensorboard-neuron features when
   tensorboard is installed after tensorboard-neuron.

Other Notes
-----------

.. _1150103920:

[1.15.0.1.0.392.0]
^^^^^^^^^^^^^^^^^^

Date 3/26/2020

.. _tb-summary-5:

Summary
-------

Added ability to view CPU node latency in the Graphs plugin and the
Neuron Profile plugins.

Major New Features
------------------

-  Added an aggregate view in addition to the current Neuron subgraph
   view for both the Graphs plugin and the Neuron Profile plugin.
-  When visualizing a graph executed on a Neuron device, CPU node
   latencies are available when coloring the graph by "Compute time"
   using the "neuron_profile" tag.
-  The Neuron Profile plugin now has an overview page to compare time
   spent on Neuron device versus on CPU.

.. _tb-other-notes-1:

Other Notes
-----------

-  Requires Neuron-RTD config option "enable_node_profiling" to be set
   to "true"

.. _1150103660:

[1.15.0.1.0.366.0]
^^^^^^^^^^^^^^^^^^

Date 02/27/2020

.. _tb-summary-6:

Summary
-------

Reduced load times and fixed crashes when loading large models for
visualization.

.. _tb-resolved-issues-2:

Resolved Issues
---------------

-  Enable large attribute filtering by default
-  Reduced load time for graphs with attributes larger than 1 KB
-  Fixed a fail to load graphs with many large attributes totaling more
   than 1 GB in size

.. _1150103150:

[1.15.0.1.0.315.0]
^^^^^^^^^^^^^^^^^^

Date 12/20/2019

.. _tb-summary-7:

Summary
-------

No major chages or fixes. Released with other Neuron packages.

.. _1150103060:

[1.15.0.1.0.306.0]
^^^^^^^^^^^^^^^^^^

Date 12/1/2019

.. _tb-summary-8:

Summary
-------

.. _tb-major-new-features-1:

Major New Features
------------------

.. _tb-resolved-issues-3:

Resolved Issues
---------------

.. _known-issues--limits:

Known Issues & Limits
---------------------

Same as prior release

.. _tb-other-notes-2:

Other Notes
-----------

.. _1150102800:

[1.15.0.1.0.280.0]
^^^^^^^^^^^^^^^^^^

Date 11/29/2019

.. _tb-summary-9:

Summary
-------

Initial release packaged with DLAMI.

.. _tb-major-new-features-2:

Major New Features
------------------

N/A, initial release.

See user guide here:
https://github.com/aws/aws-neuron-sdk/blob/master/docs/neuron-tools/getting-started-tensorboard-neuron.md

.. _tb-resolved-issues-4:

Resolved Issues
---------------

N/A - first release

.. _known-issues--limits-1:

Known Issues & Limits
---------------------

-  Must install TensorBoard-Neuron by itself, or after regular
   TensorBoard is installed. If regular Tensorboard is installed after
   TensorBoard-Neuron, it may overwrite some needed files.
-  Utilization missing in Op Profile due to missing FLOPs calculation
   (see overview page instead)
-  Neuron Profile plugin may not immediately show up on launch (try
   reloading the page)
-  Graphs with NeuronOps may take a long time to load due to attribute
   size
-  Instructions that cannot be matched to a framework layer/operator
   name show as “” (blank)
-  CPU Usage section in chrome-trace is not applicable
-  Debugger currently supports TensorFlow only
-  Visualization requires a TensorFlow-compatible graph

.. _tb-other-notes-3:

Other Notes
-----------


================================================
FILE: release-notes/archive/tensorflow/tensorflow-modelserver-neuron/tensorflow-modelserver-neuron-v2.rst
================================================
.. _tensorflow-modelserver-rn-v2:

TensorFlow-Model-Server-Neuron 2.x Release Notes
================================================

.. contents:: Table of contents
   :local:
   :depth: 1

This document lists the release notes for the
TensorFlow-Model-Server-Neuron package.

TensorFlow Model Server Neuron 2.x release [2.4.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 11/23/2022

* Deprecated the NEURONCORE_GROUP_SIZES environment variable.
* Minor bug fixes.


TensorFlow Model Server Neuron 2.x release [2.3.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 04/29/2022

* Added support for tensorflow-model-serving 2.8.0.


TensorFlow Model Server Neuron 2.x release [2.2.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 03/25/2022

* Updated tensorflow-serving 2.5 to 2.5.4.
* Add support for tensorflow-model-serving 2.6 and 2.7.


TensorFlow Model Server Neuron 2.x release [2.1.6.0]
----------------------------------------------------

Date: 01/20/2022

* Updated tensorflow-model-server 2.5 to version 2.5.3


TensorFlow Model Server Neuron 2.x release [2.0.4.0]
----------------------------------------------------

Date: 11/05/2021

* Updated Neuron Runtime (which is integrated within this package) to ``libnrt 2.2.18.0`` to fix a container issue that was preventing 
  the use of containers when /dev/neuron0 was not present. See details here :ref:`runtime_rn`.

TensorFlow Model Server Neuron 2.x release [2.0.3.0]
----------------------------------------------------

Date: 10/27/2021

New in this release
^^^^^^^^^^^^^^^^^^^

* TensorFlow Model Server Neuron 2.x now support Neuron Runtime 2.x (``libnrt.so`` shared library) only.

     .. important::

        -  You must update to the latest Neuron Driver (``aws-neuron-dkms`` version 2.1 or newer) 
           for proper functionality of the new runtime library.
        -  Read :ref:`introduce-libnrt`
           application note that describes :ref:`why are we making this
           change <introduce-libnrt-why>` and
           how :ref:`this change will affect the Neuron
           SDK <introduce-libnrt-how-sdk>` in detail.
        -  Read :ref:`neuron-migrating-apps-neuron-to-libnrt` for detailed information of how to
           migrate your application.


.. _2511680:

TensorFlow Model Server Neuron 2.x release [1.6.8.0]
----------------------------------------------------

Date: 08/12/2021

Summary
^^^^^^^

TensorFlow 2.x - tensorflow-model-server-neuron now support TensorFlow 2.x,  tensorflow-model-server-neuron package versions 2.1.4, 2.2.2, 2.3.0, 2.4.1, and 2.5.1 support TensorFlow 2.x.


================================================
FILE: release-notes/archive/tensorflow/tensorflow-modelserver-neuron/tensorflow-modelserver-neuron.rst
================================================
.. _tensorflow-modelserver-rn:
.. _tensorflow-modeslserver-neuron-rn:

TensorFlow-Model-Server-Neuron 1.x Release Notes
================================================

.. contents:: Table of contents
   :local:
   :depth: 1

This document lists the release notes for the
TensorFlow-Model-Server-Neuron package.

TensorFlow Model Server Neuron 1.x release [2.4.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 11/23/2022

* Deprecated the NEURONCORE_GROUP_SIZES environment variable.
* Minor bug fixes.


TensorFlow Model Server Neuron 1.x release [2.2.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 03/25/2022

* Minor bug fixes.


TensorFlow Model Server Neuron 1.x release [2.0.4.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 11/05/2021

* Updated Neuron Runtime (which is integrated within this package) to ``libnrt 2.2.18.0`` to fix a container issue that was preventing 
  the use of containers when /dev/neuron0 was not present. See details here :ref:`runtime_rn`.

TensorFlow Model Server Neuron 1.x release [2.0.3.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 10/27/2021

New in this release
-------------------

* TensorFlow Model Server Neuron 1.x now support Neuron Runtime 2.x (``libnrt.so`` shared library) only.

     .. important::

        -  You must update to the latest Neuron Driver (``aws-neuron-dkms`` version 2.1 or newer) 
           for proper functionality of the new runtime library.
        -  Read :ref:`introduce-libnrt`
           application note that describes :ref:`why are we making this
           change <introduce-libnrt-why>` and
           how :ref:`this change will affect the Neuron
           SDK <introduce-libnrt-how-sdk>` in detail.
        -  Read :ref:`neuron-migrating-apps-neuron-to-libnrt` for detailed information of how to
           migrate your application.


.. _11501510:

[1.15.0.1.5.1.0]
^^^^^^^^^^^^^^^^

Date: 07/02/2021

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.

.. _11501400:

[1.15.0.1.4.0.0]
^^^^^^^^^^^^^^^^

Date: 05/24/2021

Summary
-------

1. Remove SIGINT/SIGTERM handler and rely on mechnisms provided by Neuron runtime for resource cleanup.
2. Uncap protobuf size limit.

.. _11501330:

[1.15.0.1.3.3.0]
^^^^^^^^^^^^^^^^^^^

Date: 05/01/2021

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.

.. _11501290:

[1.15.0.1.2.9.0]
^^^^^^^^^^^^^^^^^^^

Date: 03/04/2021

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.

.. _11501280:

[1.15.0.1.2.8.0]
^^^^^^^^^^^^^^^^^^^

Date: 02/24/2021

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.


.. _11501220:

[1.15.0.1.2.2.0]
^^^^^^^^^^^^^^^^^^^

Date: 01/30/2021

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.


.. _11501130:

[1.15.0.1.1.3.0]
^^^^^^^^^^^^^^^^^^^

Date: 12/23/2020

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.


.. _11501021680:

[1.15.0.1.0.2168.0]
^^^^^^^^^^^^^^^^^^^

Date: 11/17/2020

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.


.. _11501020430:

[1.15.0.1.0.2043.0]
^^^^^^^^^^^^^^^^^^^

Date: 09/22/2020

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.

.. _11501019650:

[1.15.0.1.0.1965.0]
^^^^^^^^^^^^^^^^^^^

Date: 08/08/2020

.. _tms-summary-1:

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.

.. _11501019530:

[1.15.0.1.0.1953.0]
^^^^^^^^^^^^^^^^^^^

Date: 08/05/2020

.. _tms-summary-2:

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.

.. _11501018910:

[1.15.0.1.0.1891.0]
^^^^^^^^^^^^^^^^^^^

Date: 07/16/2020

.. _tms-summary-3:

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.

.. _11501017960:

[1.15.0.1.0.1796.0]
^^^^^^^^^^^^^^^^^^^

Date 6/11/2020

.. _tms-summary-4:

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.

.. _11501015720:

[1.15.0.1.0.1572.0]
^^^^^^^^^^^^^^^^^^^

Date 5/11/2020

.. _tms-summary-5:

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.

.. _11501013330:

[1.15.0.1.0.1333.0]
^^^^^^^^^^^^^^^^^^^

Date 3/26/2020

.. _tms-summary-6:

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.

.. _11501012400:

[1.15.0.1.0.1240.0]
^^^^^^^^^^^^^^^^^^^

Date 2/27/2020

.. _tms-summary-7:

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.

.. _1150109970:

[1.15.0.1.0.997.0]
^^^^^^^^^^^^^^^^^^

Date 1/27/2019

.. _tms-summary-8:

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.

.. _1150108030:

[1.15.0.1.0.803.0]
^^^^^^^^^^^^^^^^^^

Date 12/20/2019

.. _tms-summary-9:

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.

.. _1150107490:

[1.15.0.1.0.749.0]
^^^^^^^^^^^^^^^^^^

Date 12/1/2019

.. _tms-summary-10:

Summary
-------

No change. See :ref:`tensorflow-neuron-release-notes` for related TensorFlow-Neuron release
notes.

.. _1150106630:

[1.15.0.1.0.663.0]
^^^^^^^^^^^^^^^^^^

Date 11/29/2019

.. _tms-summary-11:

Summary
-------

This version is available only in released DLAMI v26.0. See
TensorFlow-Neuron Release Notes. Please
:ref:`update <dlami-rn-known-issues>` to latest version.


================================================
FILE: release-notes/archive/tensorflow/tensorflow-modelserver-neuron/tensorflow-modelserver-neuronx.rst
================================================
.. _tensorflow-modeslserver-neuronx-rn:

TensorFlow-Model-Server-Neuron (``tensorflow-modeslserver-neuronx``) Release Notes
==================================================================================

.. contents:: Table of contents
   :local:
   :depth: 1

This document lists the release notes for the
TensorFlow-Model-Server-Neuron (``tensorflow-modeslserver-neuronx``) package.

TensorFlow Model Server Neuron  [2.9.3.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Date: 7/19/2023

* Minor updates

TensorFlow Model Server Neuron  [2.8.9.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Date: 6/14/2023

* Minor updates

TensorFlow Model Server Neuron  [2.8.1.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Date: 5/1/2023

* Minor updates

TensorFlow Model Server Neuron  [2.7.3.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Date: 3/28/2023

* Minor updates

TensorFlow Model Server Neuron  [2.6.5.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Date: 2/24/2023

First release of TensorFlow-Model-Server-Neuron (``tensorflow-modeslserver-neuronx``) package.


================================================
FILE: release-notes/archive/tensorflow/tensorflow-neuron/tensorflow-neuron-v2.rst
================================================
.. _tensorflow-neuron-rn-v2:

TensorFlow 2.x (``tensorflow-neuron``) Release Notes
=====================================================

.. contents:: Table of contents
   :local:
   :depth: 1

This document lists the release notes for the tensorflow-neuron 2.x packages.

.. _tf-known-issues-and-limitations-v2:

Known Issues and Limitations - updated 08/12/2021
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Support on serialized TensorFlow 2.x custom operators is currently limited. Serializing some operators registered from tensorflow-text through `TensorFlow Hub <https://tfhub.dev/>`_ is going to cause failure in tensorflow.neuron.trace.

- Memory leak exists on latest releases of TensorFlow Neuron for versions 2.1, 2.2, 2.3, and 2.4.


-  Issue: When compiling large models, user might run out of memory and
   encounter this fatal error.

::

   terminate called after throwing an instance of 'std::bad_alloc'

Solution: run compilation on a c5.4xlarge instance type or larger.

-  Issue: When upgrading ``tensorflow-neuron`` with
   ``pip install tensorflow-neuron --upgrade``, the following error
   message may appear, which is caused by ``pip`` version being too low.

::

     Could not find a version that satisfies the requirement tensorflow<1.16.0,>=1.15.0 (from tensorflow-neuron)

Solution: run a ``pip install pip --upgrade`` before upgrading
``tensorflow-neuron``.

-  Issue: Some Keras routines throws the following error:

::

   AttributeError: 'str' object has no attribute 'decode'.

Solution: Please downgrade `h5py` by `pip install 'h5py<3'`. This is caused by https://github.com/TensorFlow/TensorFlow/issues/44467.

tensorflow-neuron 2.x release [2.12.2.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 09/16/2024

* Minor updates.

tensorflow-neuron 2.x release [2.11.4.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 07/03/2024

* Minor updates.

tensorflow-neuron 2.x release [2.10.19.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 04/01/2024

* Minor updates.

tensorflow-neuron 2.x release [2.10.8.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 12/21/2023

* Minor updates.

tensorflow-neuron 2.x release [2.10.2.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 10/15/2023

* Minor updates.

tensorflow-neuron 2.x release [2.10.1.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 09/15/2023

* Minor updates.

tensorflow-neuron 2.x release [2.9.3.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 7/19/2023

* Minor updates.


tensorflow-neuron 2.x release [2.8.9.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 06/14/2023

* Added Python 3.10 support.

tensorflow-neuron 2.x release [2.8.1.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 05/01/2023

* Added support for tracing models larger than 2 GB through the environment variable ``NEURON_CC_FLAGS='--extract-weights INSTANCE_TYPE'`` for all inf1 instance types.
* Neuron release 2.10 release will be the last release that will include support for tensorflow-neuron version 2.7. Future Neuron releases will not include tensorflow-neuron version 2.7.

tensorflow-neuron 2.x release [2.7.4.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 04/19/2023

* Minor updates.

tensorflow-neuron 2.x release [2.7.3.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 03/28/2023

* Introduce the ``tfn.analyze_model`` function that displays information about the supported and unsupported operators of a traceable model.
* Introduce the ``on_neuron_ratio`` attribute of AWS Optimized Neuron Models returned by ``tfn.trace``, which is the percentage of ops on neuron after compilation. 

tensorflow-neuron 2.x release [2.6.5.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 02/24/2023

* Minor updates.

tensorflow-neuron 2.x release [2.6.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 2/24/2023

* Minor bug fixes.

tensorflow-neuron 2.x release [2.4.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 11/22/2022

* Beta support for tracing models larger than 2 GB through environment variable ``NEURON_CC_FLAGS='--extract-weights'``.
* Introduce ``tfn.auto_multicore`` Python API to enable automatic data parallel on multiple NeuronCores.
* Introduce ``tf-neuron-auto-multicore`` tool to enable automatic data parallel on multiple NeuronCores.
* Deprecated the NEURONCORE_GROUP_SIZES environment variable.
* Minor bug fixes.


tensorflow-neuron 2.x release [2.3.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 04/29/2022

* Added support for Tensorflow 2.8.0.
* Added support for Slice operator
* The graph partitioner now prefers to place less compute intensive operators on CPU if the model already contains a large amount of compute intensive operators.
* Fixed `Github issue #408 <https://github.com/aws/aws-neuron-sdk/issues/408>`_, the fix solves data type handling bug in ``tfn.trace`` when the model contains Conv2D operators.


tensorflow-neuron 2.x release [2.2.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 03/25/2022

* Updated TensorFlow 2.5 to version 2.5.3.
* Added support for TensorFlow 2.6 and 2.7.
* Added a warning message when calling ``tfn.saved_model.compile`` API. In tensorflow-neuron 2.x you should call :ref:`tensorflow.neuron.trace <tensorflow-ref-neuron-tracing-api>`. ``tfn.saved_model.compile`` API supports only partial functionality of :ref:`tensorflow.neuron.trace <tensorflow-ref-neuron-tracing-api>` and will be deprecated in the future.


tensorflow-neuron 2.x release [2.1.14.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 02/17/2022

* Fixed a bug in TensorFlow Neuron versions 2.1, 2.2. 2.3 and 2.4. The fixed bug was causing a memory leak of 128 bytes for each inference.
* Improved warning message when calling deprecated compilation API under tensorflow-neuron 2.x. 


tensorflow-neuron 2.x release [2.1.13.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 02/16/2022

* Fixed a bug that caused a memory leak. The memory leak was approximately 128b for each inference and 
  exists in all versions of TensorFlow Neuron versions part of Neuron 1.16.0 to Neuron 1.17.0 releases. see :ref:`pre-release-content` 
  for exact versions included in each release.  This release only addresses the leak in TensorFlow Neuron 2.5.  Future release of TensorFlow Neuron will fix the leak in other versions as well (2.1, 2.2, 2.3, 2.4).


tensorflow-neuron 2.x release [2.1.6.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 01/20/2022

* Updated TensorFlow 2.5 to version 2.5.2.
* Enhanced auto data parallel (e.g. when using NEURONCORE_GROUP_SIZES=X,Y,Z,W) to support edge cases.
* Fixed a bug that may cause tensorflow-neuron to generate in some cases scalar gather instruction with incorrect arguments.


tensorflow-neuron 2.x release [2.0.4.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 11/05/2021

* Updated Neuron Runtime (which is integrated within this package) to ``libnrt 2.2.18.0`` to fix a container issue that was preventing 
  the use of containers when /dev/neuron0 was not present. See details here :ref:`runtime_rn`.

tensorflow-neuron 2.x release [2.0.3.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 10/27/2021

New in this release
-------------------

* TensorFlow 2.x (``tensorflow-neuron``) now support Neuron Runtime 2.x (``libnrt.so`` shared library) only.

     .. important::

        -  You must update to the latest Neuron Driver (``aws-neuron-dkms`` version 2.1 or newer) 
           for proper functionality of the new runtime library.
        -  Read :ref:`introduce-libnrt`
           application note that describes :ref:`why are we making this
           change <introduce-libnrt-why>` and
           how :ref:`this change will affect the Neuron
           SDK <introduce-libnrt-how-sdk>` in detail.
        -  Read :ref:`neuron-migrating-apps-neuron-to-libnrt` for detailed information of how to
           migrate your application.


* Updated TensorFlow 2.3.x from TensorFlow 2.3.3 to TensorFlow 2.3.4. 
* Updated TensorFlow 2.4.x from TensorFlow 2.4.2 to TensorFlow 2.4.3.
* Updated TensorFlow 2.5.x from TensorFlow 2.5.0 to TensorFlow 2.5.1.


Resolved Issues
---------------

* Fix bug that can cause illegal compiler optimizations
* Fix bug that can cause dynamic-shape operators be placed on Neuron

.. _2501680:

tensorflow-neuron 2.x release [1.6.8.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 08/12/2021

New in this release
-------------------

* First release of TensorFlow 2.x integration, Neuron support now TensorFlow versions 2.1.4, 2.2.3, 2.3.3, 2.4.2, and 2.5.0.

* New public API tensorflow.neuron.trace: trace a TensorFlow 2.x keras.Model or a Python callable that can be decorated by tf.function, and return an AWS-Neuron-optimized keras.Model that can execute on AWS Machine Learning Accelerators.
 **Please note** that TensorFlow 1.x SavedModel compilation API tensorflow.neuron.saved_model.compile is not supported in tensorflow-neuron 2.x . It continues to function in tensorflow-neuron 1.15.x .

* Included versions:

   - tensorflow-neuron-2.5.0.1.6.8.0 
   - tensorflow-neuron-2.4.2.1.6.8.0
   - tensorflow-neuron-2.3.3.1.6.8.0
   - tensorflow-neuron-2.2.3.1.6.8.0
   - tensorflow-neuron-2.1.4.1.6.8.0


================================================
FILE: release-notes/archive/tensorflow/tensorflow-neuron/tensorflow-neuron.rst
================================================
.. _tensorflow-neuron-rn:
.. _tensorflow-neuron-release-notes:

TensorFlow Neuron (``tensorflow-neuron (TF1.x)``) Release Notes
===============================================================

.. contents:: Table of contents
   :local:
   :depth: 1


This document lists the release notes for the tensorflow-neuron 1.x package.

.. _tf-known-issues-and-limitations:

Known Issues and Limitations - updated 08/12/2021
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Support on serialized TensorFlow 2.x custom operators is currently limited. Serializing some operators registered from TensorFlow-text through `TensorFlow Hub <https://tfhub.dev/>`_ is going to cause failure in tensorflow.neuron.trace.


-  Issue: When compiling large models, user might run out of memory and
   encounter this fatal error.

::

   terminate called after throwing an instance of 'std::bad_alloc'

Solution: run compilation on a c5.4xlarge instance type or larger.

-  Issue: When upgrading ``tensorflow-neuron`` with
   ``pip install tensorflow-neuron --upgrade``, the following error
   message may appear, which is caused by ``pip`` version being too low.

::

     Could not find a version that satisfies the requirement TensorFlow<1.16.0,>=1.15.0 (from tensorflow-neuron)

Solution: run a ``pip install pip --upgrade`` before upgrading
``tensorflow-neuron``.

-  Issue: Some Keras routines throws the following error:

::

   AttributeError: 'str' object has no attribute 'decode'.

Solution: Please downgrade `h5py` by `pip install 'h5py<3'`. This is caused by https://github.com/TensorFlow/TensorFlow/issues/44467.

tensorflow-neuron 1.x release [2.10.1.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 8/28/2023

* Minor updates

tensorflow-neuron 1.x release [2.9.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 7/19/2023

* Minor updates

tensorflow-neuron 1.x release [2.8.9.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 6/14/2023

* Minor updates

tensorflow-neuron 1.x release [2.8.1.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 5/1/2023

* Minor updates

tensorflow-neuron 1.x release [2.7.3.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 3/28/2023

* Minor updates

tensorflow-neuron 1.x release [2.6.5.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 2/24/2023

* Added support for TensorFlow versions 2.9 and 2.10
* End-of-support for TensorFlow versions 2.5 and 2.6

tensorflow-neuron 1.x release [2.4.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 11/23/2022

* Introduce ``tf-neuron-auto-multicore`` tool to enable automatic data parallel on multiple NeuronCores.
* Deprecated the NEURONCORE_GROUP_SIZES environment variable.
* Minor bug fixes.


tensorflow-neuron 1.x release [2.3.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 04/29/2022

* Minor bug fixes.


tensorflow-neuron 1.x release [2.1.14.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 03/25/2022

* Minor bug fixes.


tensorflow-neuron 1.x release [2.1.14.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 02/17/2022

* Minor bug fixes.

tensorflow-neuron 1.x release [2.1.13.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 02/16/2022

* Fixed a bug that caused a memory leak. The memory leak was approximately 128b for each inference and 
  exists in all versions of TensorFlow Neuron versions part of Neuron 1.16.0 to Neuron 1.17.0 releases. see :ref:`pre-release-content` 
  for exact versions included in each release.


tensorflow-neuron 1.x release [2.1.6.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 01/20/2022

* Enhanced auto data parallel (e.g. when using NEURONCORE_GROUP_SIZES=X,Y,Z,W) to support edge cases.
* Added new operators support. see :ref:`neuron-cc-ops-TensorFlow`.


tensorflow-neuron 1.x release [2.0.4.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 11/05/2021

* Updated Neuron Runtime (which is integrated within this package) to ``libnrt 2.2.18.0`` to fix a container issue that was preventing 
  the use of containers when /dev/neuron0 was not present. See details here :ref:`runtime_rn`.


tensorflow-neuron 1.x release [2.0.3.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 10/27/2021

New in this release
-------------------

* TensorFlow 1.x (``tensorflow-neuron``) now support Neuron Runtime 2.x (``libnrt.so`` shared library) only.

     .. important::

        -  You must update to the latest Neuron Driver (``aws-neuron-dkms`` version 2.1 or newer) 
           for proper functionality of the new runtime library.
        -  Read :ref:`introduce-libnrt`
           application note that describes :ref:`why are we making this
           change <introduce-libnrt-why>` and
           how :ref:`this change will affect the Neuron
           SDK <introduce-libnrt-how-sdk>` in detail.
        -  Read :ref:`neuron-migrating-apps-neuron-to-libnrt` for detailed information of how to
           migrate your application.

Resolved Issues
---------------

* Fix neuron-cc argument handling bug when nothing can be compiled.
* Fixing the support of cast operators applied after constants, by Introducing support of constant-folding pass before Neuron auto-mixed-precision.

.. _11551510:

[1.15.5.1.5.1.0]
^^^^^^^^^^^^^^^^

Date: 07/02/2021

New in this release
-------------------

* Bug fixes regarding scalar inputs/outputs.
* Minor performance improvements when dynamic batch size is turned on or when model is small.

.. _11551400:

[1.15.5.1.4.0.0]
^^^^^^^^^^^^^^^^

Date: 05/28/2021

New in this release
-------------------

* Reduce the amount of input/output data movement during inference.
* Improve parallelism for dynamic batch size inference by adopting a new sharding mechanism.
* Reduce the amount of host memory usage during inference.
* tfn.saved_model.compile now generates correct code when operator Split is used as output.
* tfn.saved_model.compile now properly reads input tensor shape information from SignatureDef proto.
* tfn.saved_model.compile now terminates properly when neuron-cc compiler argument is passed but there is no successful compilation.
* Fix bug on some wrong internal tensor names when neuron-cc compiler crashes.
* Other minor bug fixes.

.. _11551330:

[1.15.5.1.3.3.0]
^^^^^^^^^^^^^^^^

Date: 05/01/2021

New in this release
-------------------

1. Minor enhancements.

.. _11551290:

[1.15.5.1.2.9.0]
^^^^^^^^^^^^^^^^

Date: 03/04/2021

New in this release
-------------------

1. Minor enhancements.


.. _11551280:

[1.15.5.1.2.8.0]
^^^^^^^^^^^^^^^^

Date: 02/24/2021

New in this release
-------------------

1. Fix for CVE-2021-3177.


.. _11551220:

[1.15.5.1.2.2.0]
^^^^^^^^^^^^^^^^

Date: 01/30/2021

New in this release
-------------------

1. Bug fixes and internal refactor.

2. Bump TensorFlow base package version to 1.15.5.

3. Introduced a new argument ``convert_constants_to_variables`` to the compilation API ``tfn.saved_model.compile``. Setting it to ``True`` can address the issue of large constants consuming too much memory in the TensorFlow runtime.


.. _11541130:

[1.15.4.1.1.3.0]
^^^^^^^^^^^^^^^^

Date: 12/23/2020

New in this release
-------------------

1. Improved logging during `tfn.saved_model.compile` to display `neuron-cc` compilation progress.

2. Small performance improvement in some edge cases by optimizing the NeuronCore-executable assignment mechanism.


.. _11541021680:

[1.15.4.1.0.2168.0]
^^^^^^^^^^^^^^^^^^^

Date: 11/17/2020

New in this release
-------------------

1. tensorflow-neuron is now a plugin package that can be used together
   with TensorFlow~=1.15.0 built with ``GLIBCXX_USE_CXX11_ABI=0``.

2. Improved logging during ``tfn.saved_model.compile`` to display
   ``neuron-cc`` logging file path, which is useful for tracking
   ``neuron-cc`` compilation progress.

3. Small performance improvement by utilizing shared memory more
   efficiently.


.. _11531020430:

[1.15.3.1.0.2043.0]
^^^^^^^^^^^^^^^^^^^

Date: 09/22/2020

New in this release
-------------------

1. tensorflow-neuron now automatically enables data parallel mode on
   four cores in one Inferentia. In ``TensorFlow-model-server-neuron``,
   most models can now fully utilize four cores automatically. In Python
   TensorFlow, running threaded inference using ``>=4`` Python threads
   in the same TensorFlow Session lead to full utilization of four
   cores.

2. tensorflow-neuron now tries to enable dynamic batch size
   automatically for a limited number of models, such as ResNet50.

3. Improved logging during ``tfn.saved_model.compile`` to display
   input/output information about subgraphs that are going to be
   compiled by ``neuron-cc``.

.. _11531019650:

[1.15.3.1.0.1965.0]
^^^^^^^^^^^^^^^^^^^

Date: 08/08/2020

.. _ts-summary-1:

New in this release
-------------------

Various minor improvements.

.. _11531019530:

[1.15.3.1.0.1953.0]
^^^^^^^^^^^^^^^^^^^

Date: 08/05/2020

.. _ts-summary-2:

New in this release
-------------------

Various minor improvements.

.. _11531018910:

[1.15.3.1.0.1891.0]
^^^^^^^^^^^^^^^^^^^

Date: 07/16/2020

.. _ts-summary-3:

New in this release
-------------------

This version contains a few bug fixes and user experience improvements.

Dependency change
-----------------

1. Bump TensorFlow base package version number to 1.15.3
2. Add ``TensorFlow >= 1.15.0, < 1.16.0`` as an installation dependency
   so that packages depending on TensorFlow can be installed together
   with tensorflow-neuron without error

New Features
------------

1. ``tensorflow-neuron`` now displays a summary of model performance
   when profiling is enable by setting environment variable
   ``NEURON_PROFILE``

Resolved Issues
---------------

1. Environment variable ``NEURON_PROFILE`` can now be set to a
   non-existing path which will be automatically created
2. Fixed a bug in ``tfn.saved_model.compile`` that causes compilation
   failure when ``dynamic_batch_size=True`` is specified on a SavedModel
   with unknown rank inputs.

.. _11521017960:

[1.15.2.1.0.1796.0]
^^^^^^^^^^^^^^^^^^^

Date 6/11/2020

.. _ts-summary-4:

New in this release
-------------------

This version contains a few bug fixes.

Major New Features
------------------

.. _tf-resolved-issues-1:

Resolved Issues
---------------

1. Fixed a bug related with device placement. Now models with device
   information hardcoded to GPU can be successfully compiled with
   ``tfn.saved_model.compile``
2. Fixed a bug in ``tfn.saved_model.compile`` that causes models
   containing Reshape operators not functioning correctly when it is
   compiled with ``dynamic_batch_size=True``
3. Fixed a bug in ``tfn.saved_model.compile`` that causes models
   containing Table related operators to initialize incorrectly after
   compilation.

Known Issues and limitations
----------------------------

.. _11521015720:

[1.15.2.1.0.1572.0]
^^^^^^^^^^^^^^^^^^^

Date: 5/11/2020

.. _ts-summary-5:

New in this release
-------------------

This version contains some bug fixes and new features.

.. _tf-major-new-features-1:

Major New Features
------------------

-  tensorflow-neuron is now built on TensorFlow 1.15.2 instead of
   TensorFlow 1.15.0

.. _tf-resolved-issues-2:

Resolved Issues
---------------

-  Fixed a bug that caused Neuron runtime resources to not all be
   released when a tensorflow-neuron process terminated with in-flight
   inferences
-  Inference timeout value set at compile time is now correctly
   recognized at runtime


Known Issues and limitations
----------------------------

.. _tf-11501013330:

[1.15.0.1.0.1333.0]
^^^^^^^^^^^^^^^^^^^

Date: 3/26/2020

.. _ts-summary-6:

New in this release
-------------------

.. _tf-major-new-features-2:

Major New Features
------------------

-  Improved performance between TensorFlow to Neuron runtime.

.. _tf-resolved-issues-3:

Resolved Issues
---------------

-  Fixed a bug in Neuron runtime adaptor operator's shape function when
   dynamic batch size inference is enabled
-  Framework method (tensorflow.neuron.saved-model.compile) improved
   handling of compiler timeout termination by letting it clean up
   before exiting.

.. _tf-known-issues-and-limitations-2:

Known Issues and limitations
----------------------------

.. _11501012400:

[1.15.0.1.0.1240.0]
^^^^^^^^^^^^^^^^^^^

Date: 2/27/2020

.. _ts-summary-7:

New in this release
-------------------

.. _tf-major-new-features-3:

Major New Features
------------------

-  Enabled runtime memory optimizations by default to improve inference
   performance, specifically in cases with large input/output tensors
-  tfn.saved_model.compile now displays warning message instead of
   "successfully compiled" if less than 30% of operators are mapped to
   Inferentia
-  Improve error messages. Runtime failure error messages are now more
   descriptive and also provide instructions to restart neuron-rtd when
   necessary.

.. _tf-resolved-issues-4:

Resolved Issues
---------------

.. _tf-known-issues-and-limitations-3:

Known Issues and Limitations
----------------------------

-  Issue: When compiling a large model, may encounter.

::

   terminate called after throwing an instance of 'std::bad_alloc'

Solution: run compilation on c5.4xlarge instance type or larger.

Other Notes
-----------

.. _tf-1150109970:

[1.15.0.1.0.997.0]
^^^^^^^^^^^^^^^^^^

Date: 1/27/2020

.. _ts-summary-8:

New in this release
-------------------

.. _tf-major-new-features-4:

Major New Features
------------------

-  Added support for NCHW pooling operators in tfn.saved_model.compile.

.. _tf-resolved-issues-5:

Resolved Issues
---------------

-  Fixed GRPC transient status error issue.
-  Fixed a graph partitioner issue with control inputs.

.. _tf-known-issues-and-limitations-4:

Known Issues and Limitations
----------------------------

-  Issue: When compiling a large model, may encounter.

::

   terminate called after throwing an instance of 'std::bad_alloc'

Solution: run compilation on c5.4xlarge instance type or larger.

.. _tf-other-notes-1:

Other Notes
-----------

.. _1150108030:

[1.15.0.1.0.803.0]
^^^^^^^^^^^^^^^^^^

Date: 12/20/2019

.. _ts-summary-9:

New in this release
-------------------

.. _tf-major-new-features-5:

Major New Features
------------------

.. _tf-resolved-issues-6:

Resolved Issues
---------------

-  Improved handling of ``tf.neuron.saved_model.compile`` arguments

.. _tf-known-issues-and-limitations-5:

Known Issues and Limitations
----------------------------

.. _tf-other-notes-2:

Other Notes
-----------

.. _tf-1150107490:

[1.15.0.1.0.749.0]
^^^^^^^^^^^^^^^^^^

Date: 12/1/2019

.. _tf-summary-10:

New in this release
-------------------

.. _tf-major-new-features-6:

Major New Features
------------------

.. _tf-resolved-issues-7:

Resolved Issues
---------------

-  Fix race condition between model load and model unload when the
   process is killed
-  Remove unnecessary GRPC calls when the process is killed

.. _tf-known-issues-and-limitations-6:

Known Issues and Limitations
----------------------------

-  When compiling a large model, may encounter “terminate called after
   throwing an instance of 'std::bad_alloc'”. Solution: run compilation
   on c5.4xlarge instance type or larger.

-  The pip package ``wrapt`` may have a conflicting version in some
   installations. This is seen when this error occurs:

.. code:: bash

   ERROR: Cannot uninstall 'wrapt'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

To solve this, you can update wrapt to the newer version:

.. code:: bash

   python3 -m pip install wrapt --ignore-installed
   python3 -m pip install tensorflow-neuron

Within a Conda environment:

.. code:: bash

   conda update wrapt
   conda update tensorflow-neuron

.. _tf-other-notes-3:

Other Notes
-----------

.. _1150106630:

[1.15.0.1.0.663.0]
^^^^^^^^^^^^^^^^^^

Date: 11/25/2019

.. _ts-summary-11:

New in this release
-------------------

This version is available only in released DLAMI v26.0 and is based on
TensorFlow version 1.15.0. Please
:ref:`update <dlami-rn-known-issues>` to latest version.

.. _tf-major-new-features-7:

Major New Features
------------------

.. _tf-resolved-issues-8:

Resolved Issues
---------------

Known Issues and Limits
-----------------------

Models Supported
----------------

The following models have successfully run on neuron-inferentia systems

1. BERT_LARGE and BERT_BASE
2. Transformer
3. Resnet50 V1/V2
4. Inception-V2/V3/V4

.. _tf-other-notes-4:

Other Notes
-----------

-  Python versions supported:

   -  3.5, 3.6, 3.7

-  Linux distribution supported:

   -  Ubuntu 18, Amazon Linux 2


================================================
FILE: release-notes/archive/tensorflow/tensorflow-neuronx/tensorflow-neuronx.rst
================================================
.. _tensorflow-neuronx-release-notes:

TensorFlow 2.x (``tensorflow-neuronx``) Release Notes
========================================================

.. contents:: Table of contents
   :local:
   :depth: 1

This document lists the release notes for the tensorflow-neuronx 2.x packages.


tensorflow-neuronx 2.x release [2.1.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 09/15/2023

* Minor updates

Date: 05/1/2023

* Added support for tracing models larger than 2 GB through the environment variable ``NEURON_CC_FLAGS='--extract-weights INSTANCE_TYPE'`` for all trn1 and inf2 instance types.
* tensorflow-neuronx now supports tensorflow 2.7, 2.8, and 2.9 (In addition to the already supported 2.10).
* Neuron release 2.10 release will be the last release that will include support for tensorflow-neuronx version 2.7. Future Neuron releases will not include tensorflow-neuronx version 2.7.

tensorflow-neuronx 2.10 release [2.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 03/28/2023

The second release of tensorflow-neuronx 2.10 includes the following features:

* Dynamic batching

The following features are not included in this release:

* Support for tracing models larger than 2 GB

tensorflow-neuronx 2.10 release [1.0.0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Date: 2/24/2023

The initial release of tensorflow-neuronx 2.10 includes the following features:

* Initial support for TensorFlow 2.10 inference on Inf2 and Trn1
* Trace API (tensorflow_neuronx.trace)
* Automatic partitioning of model into CPU vs NeuronCore parts
* Automatic data parallel on multiple NeuronCores (beta)
* Python 3.7, 3.8 and 3.9 support
* HuggingFace Roberta tutorial

The following features are not included in this release:

* Dynamic batching
* Support for tracing models larger than 2 GB


================================================
FILE: release-notes/archive/torch-neuron.rst
================================================
.. _pytorch-neuron-rn:
.. _torch_neuron_core_placement_api:
.. _pytorch-manual-partitioning-jn-tutorial:

PyTorch Neuron (``torch-neuron``) release notes
===============================================

.. contents:: Table of contents
   :local:
   :depth: 1

This document lists the release notes for the Pytorch-Neuron package.


Known Issues and Limitations - Updated 03/21/2023
-------------------------------------------------

Min & Max Accuracy
~~~~~~~~~~~~~~~~~~

The index outputs of the ``aten::argmin``, ``aten::argmax``, ``aten::min``, and
``aten::max`` operator implementations are sensitive to precision. For models
that contain these operators and have ``float32`` inputs, we recommend using the
``--fp32-cast=matmult --fast-math no-fast-relayout`` compiler option to avoid
numerical imprecision issues. Additionally, the ``aten::min`` and ``aten::max``
operator implementations do not currently support ``int64`` inputs when
``dim=0``. For more information on precision and performance-accuracy tuning,
see :ref:`neuron-cc-training-mixed-precision`.

Python 3.5
~~~~~~~~~~

If you attempt to import torch.neuron from Python 3.5 you will see this error
in 1.1.7.0 - please use Python 3.6 or greater:

.. code-block::

   File "/tmp/install_test_env/lib/python3.5/site-packages/torch_neuron/__init__.py", line 29
      f'Invalid dependency version torch=={torch.__version__}. '
                                                             ^
   SyntaxError: invalid syntax

-  Torchvision has dropped support for Python 3.5
-  HuggingFace transformers has dropped support for Python 3.5

Torchvision
~~~~~~~~~~~

When versions of ``torchvision`` and ``torch`` are mismatched, this
can result in exceptions when compiling ``torchvision`` based
models. Specific versions of ``torchvision`` are built against each release
of ``torch``. For example:

- ``torch==1.5.1`` matches ``torchvision==0.6.1``
- ``torch==1.7.1`` matches ``torchvision==0.8.2``
- etc.

Simultaneously installing both ``torch-neuron`` and ``torchvision`` is the
recommended method of correctly resolving versions.


Dynamic Batching
~~~~~~~~~~~~~~~~

Dynamic batching does not work properly for some models that use the
``aten::size`` operator. When this issue occurs, the input batch sizes are not
properly recorded at inference time, resulting in an error such as:

.. code-block:: text

    RuntimeError: The size of tensor a (X) must match the size of tensor b (Y) at non-singleton dimension 0.

This error typically occurs when ``aten::size`` operators are partitioned to
CPU. We are investigating a fix for this issue.

PyTorch Neuron release [package ver. 1.*.*.2.11.6.0, SDK ver. 2.20.0]
---------------------------------------------------------------------

Date: 09/16/2024

* Minor updates.

PyTorch Neuron release [package ver. 1.*.*.2.10.12.0, SDK ver. 2.19.0]
----------------------------------------------------------------------

Date: 07/03/2024

* Minor updates.

PyTorch Neuron release [package ver. 1.*.*.2.9.74.0, SDK ver. 2.18.0]
---------------------------------------------------------------------

Date: 04/01/2024

* Minor updates.

PyTorch Neuron release [package ver. 1.*.*.2.9.17.0, SDK ver. 2.16.0]
---------------------------------------------------------------------

Date: 12/21/2023

* Minor updates.

PyTorch Neuron release [package ver. 1.*.*.2.9.6.0, SDK ver. 2.15.0]
--------------------------------------------------------------------

Date: 10/26/2023

* Minor updates.

PyTorch Neuron release [package ver. 1.*.*.2.9.1.0, SDK ver. 2.13.0]
--------------------------------------------------------------------

Date: 08/28/2023

* Added support for clamp_min/clamp_max ATEN operators.

PyTorch Neuron release [package ver. 1.*.*.2.8.9.0, SDK ver. 2.12.0]
--------------------------------------------------------------------

Date: 07/19/2023

* Minor updates.

PyTorch Neuron release [2.7.10.0]
--------------------------------------------------

Date: 06/14/2023

New in this release
~~~~~~~~~~~~~~~~~~~

* Added support for Python 3.10

Bug fixes
~~~~~~~~~

* torch.pow Operation now correctly handles mismatch between base and exponent data types

PyTorch Neuron release [2.7.1.0]
--------------------------------------------------

Date: 05/1/2023

* Minor updates.

PyTorch Neuron release [2.6.5.0]
--------------------------------------------------

Date: 03/28/2023

New in this release
~~~~~~~~~~~~~~~~~~~

* Added support for ``torch==1.13.1``
* New releases of ``torch-neuron`` no longer include versions for ``torch==1.7`` and ``torch==1.8``
* Added support for Neuron runtime 2.12
* Added support for new operators:

  * ``aten::tensordot``
  * ``aten::adaptive_avg_pool1d``
  * ``aten::prelu``
  * ``aten::reflection_pad2d``
  * ``aten::baddbmm``
  * ``aten::repeat``

* Added a ``separate_weights`` flag to :func:`torch_neuron.trace` to support
  models that are larger than 2GB


Bug fixes
~~~~~~~~~

* Fixed ``aten::_convolution`` with grouping for:

  * :class:`torch.nn.Conv1d`
  * :class:`torch.nn.Conv3d`
  * :class:`torch.nn.ConvTranspose2d`

* Fixed ``aten::linear`` to support 1d input tensors
* Fixed an issue where an input could not be directly returned from the network


PyTorch Neuron release [2.5.0.0]
--------------------------------------------------

Date: 11/23/2022

New in this release
~~~~~~~~~~~~~~~~~~~

* Added PyTorch 1.12 support
* Added Python 3.8 support
* Added new operators support. See :ref:`neuron-cc-ops-pytorch`
* Added support for ``aten::lstm``. See: :ref:`torch_neuron_lstm_support`
* Improved logging:

  * Improved error messages for specific compilation failure modes, including out-of-memory errors
  * Added a warning to show the code location of ``prim::PythonOp`` operations
  * Removed overly-verbose tracing messages
  * Added improved error messages for ``neuron-cc`` and ``tensorflow`` dependency issues
  * Added more debug information when an invalid dynamic batching configuration is used

* Added new beta explicit NeuronCore placement API. See: :ref:`torch_neuron_core_placement_api`
* Added new guide for NeuronCore placement. See: :ref:`torch_neuron_core_placement_guide`
* Improved :func:`torch_neuron.trace` performance when using large graphs
* Reduced host memory usage of loaded models in ``libtorchneuron.so``
* Added ``single_fusion_ratio_threshold`` argument to :func:`torch_neuron.trace`
  to give more fine-grained control of partitioned graphs


Bug fixes
~~~~~~~~~

* Improved handling of tensor mutations which previously caused accuracy issues on certain models (i.e. yolor, yolov5)
* Fixed an issue where ``inf`` and ``-inf`` values would cause unexpected ``NaN`` values. This could occur with newer versions of ``transformers``
* Fixed an issue where :func:`torch.neuron.DataParallel` would not fully utilize all NeuronCores for specific batch sizes
* Fixed and improved operators:

  * ``aten::upsample_bilinear2d``: Improved error messages in cases where the operation cannot be supported
  * ``aten::_convolution``: Added support for ``output_padding`` argument
  * ``aten::div``: Added support for ``rounding_mode`` argument
  * ``aten::sum``: Fixed to handle non-numeric data types
  * ``aten::expand``: Fixed to handle scalar tensors
  * ``aten::permute``: Fixed to handle negative indices
  * ``aten::min``: Fixed to support more input types
  * ``aten::max``: Fixed to support more input types
  * ``aten::max_pool2d``: Fixed to support both 3-dimensional and 4-dimensional input tensors
  * ``aten::Int``: Fixed an issue where long values would incorrectly lose precision
  * ``aten::constant_pad_nd``: Fixed to correctly use non-0 padding values
  * ``aten::pow``: Fixed to support more input types & values
  * ``aten::avg_pool2d``: Added support for ``count_include_pad`` argument. Added support for ``ceil_mode`` argument if padding isn’t specified
  * ``aten::zero``: Fixed to handle scalars correctly
  * ``prim::Constant``: Fixed an issue where ``-inf`` was incorrectly handled
  * Improved handling of scalars in arithmetic operators


PyTorch Neuron release [2.3.0.0]
--------------------------------------------------

Date: 04/29/2022

New in this release
~~~~~~~~~~~~~~~~~~~

* Added support PyTorch 1.11.
* Updated PyTorch 1.10 to version 1.10.2.
* End of support for torch-neuron 1.5, see :ref:`eol-pt-15`.
* Added support for new operators:

  * ``aten::masked_fill_``
  * ``aten::new_zeros``
  * ``aten::frobenius_norm``

Bug fixes
~~~~~~~~~

* Improved ``aten::gelu`` accuracy
* Updated ``aten::meshgrid`` to support optional indexing argument introduced in ``torch 1.10`` , see  `PyTorch issue 50276 <https://github.com/pytorch/pytorch/issues/50276>`_


PyTorch Neuron release [2.2.0.0]
--------------------------------------------------

Date: 03/25/2022

New in this release
~~~~~~~~~~~~~~~~~~~

* Added full support for  ``aten::max_pool2d_with_indices`` -  (Was previously supported only when indices were unused).
* Added new torch-neuron packages compiled with ``-D_GLIBCXX_USE_CXX11_ABI=1``, the new packages support PyTorch 1.8, PyTorch 1.9, and PyTorch 1.10.
  To install the additional packages compiled with ``-D_GLIBCXX_USE_CXX11_ABI=1`` please change the package repo index to ``https://pip.repos.neuron.amazonaws.com (https://pip.repos.neuron.amazonaws.com/)/cxx11/``
  

PyTorch Neuron release [2.1.7.0]
--------------------------------------------------

Date: 01/20/2022

New in this release
~~~~~~~~~~~~~~~~~~~

* Added PyTorch 1.10 support
* Added new operators support, see :ref:`neuron-cc-ops-pytorch`
* Updated ``aten::_convolution`` to support 2d group convolution
* Updated ``neuron::forward`` operators to allocate less dynamic memory. This can increase performance on models with many input & output tensors.
* Updated ``neuron::forward`` to better handle batch sizes when ``dynamic_batch_size=True``. This can increase performance at 
  inference time when the input batch size is exactly equal to the traced model batch size.

Bug fixes
~~~~~~~~~

* Added the ability to ``torch.jit.trace`` a ``torch.nn.Module`` where a submodule has already been traced with :func:`torch_neuron.trace` on a CPU-type instance.
  Previously, if this had been executed on a CPU-type instance, an initialization exception would have been thrown.
* Fixed ``aten::matmul`` behavior on 1-dimensional by n-dimensional multiplies. Previously, this would cause a validation error.
* Fixed binary operator type promotion. Previously, in unusual situations, operators like ``aten::mul`` could produce incorrect results due to invalid casting.
* Fixed ``aten::select`` when index was -1. Previously, this would cause a validation error.
* Fixed ``aten::adaptive_avg_pool2d`` padding and striding behavior. Previously, this could generate incorrect results with specific configurations.
* Fixed an issue where dictionary inputs could be incorrectly traced when the tensor values had gradients.


PyTorch Neuron release [2.0.536.0]
--------------------------------------------------

Date: 01/05/2022


New in this release
~~~~~~~~~~~~~~~~~~~

* Added new operator support for specific variants of operations (See :ref:`neuron-cc-ops-pytorch`)
* Added optional ``optimizations`` keyword to :func:`torch_neuron.trace` which accepts a list of :class:`~torch_neuron.Optimization` passes.


PyTorch Neuron release [2.0.468.0]
--------------------------------------------------

Date: 12/15/2021


New in this release
~~~~~~~~~~~~~~~~~~~

* Added support for ``aten::cumsum`` operation.
* Fixed ``aten::expand`` to correctly handle adding new dimensions.


PyTorch Neuron release [2.0.392.0]
--------------------------------------------------

Date: 11/05/2021

* Updated Neuron Runtime (which is integrated within this package) to ``libnrt 2.2.18.0`` to fix a container issue that was preventing
  the use of containers when /dev/neuron0 was not present. See details here :ref:`runtime_rn`.

PyTorch Neuron release [2.0.318.0]
--------------------------------------------------

Date: 10/27/2021

New in this release
~~~~~~~~~~~~~~~~~~~

-  PyTorch Neuron 1.x now support Neuron Runtime 2.x (``libnrt.so`` shared library) only.

   .. important::

      -  You must update to the latest Neuron Driver (``aws-neuron-dkms`` version 2.1 or newer)
         for proper functionality of the new runtime library.
      -  Read :ref:`introduce-libnrt`
         application note that describes :ref:`why are we making this
         change <introduce-libnrt-why>` and
         how :ref:`this change will affect the Neuron
         SDK <introduce-libnrt-how-sdk>` in detail.
      -  Read :ref:`neuron-migrating-apps-neuron-to-libnrt` for detailed information of how to
         migrate your application.

-  Introducing PyTorch 1.9.1 support (support for ``torch==1.9.1)``
-  Added ``torch_neuron.DataParallel``, see ResNet-50 tutorial :ref:`[html] </src/examples/pytorch/resnet50.ipynb>` and
   :ref:`torch-neuron-dataparallel-app-note` application note.
-  Added support for tracing on GPUs
-  Added support for ``ConvTranspose1d``
-  Added support for new operators:

   -  ``aten::empty_like``
   -  ``aten::log``
   -  ``aten::type_as``
   -  ``aten::movedim``
   -  ``aten::einsum``
   -  ``aten::argmax``
   -  ``aten::min``
   -  ``aten::argmin``
   -  ``aten::abs``
   -  ``aten::cos``
   -  ``aten::sin``
   -  ``aten::linear``
   -  ``aten::pixel_shuffle``
   -  ``aten::group_norm``
   -  ``aten::_weight_norm``

-  Added ``torch_neuron.is_available()``


Resolved Issues
~~~~~~~~~~~~~~~

-  Fixed a performance issue when using both the
   ``dynamic_batch_size=True`` trace option and
   ``--neuron-core-pipeline`` compiler option. Dynamic batching now uses
   ``OpenMP`` to execute pipeline batches concurrently.
-  Fixed ``torch_neuron.trace`` issues:

   -  Fixed a failure when the same submodule was traced with multiple
      inputs
   -  Fixed a failure where some operations would fail to be called with
      the correct arguments
   -  Fixed a failure where custom operators (torch plugins) would cause
      a trace failure

-  Fixed variants of ``aten::upsample_bilinear2d`` when
   ``scale_factor=1``
-  Fixed variants of ``aten::expand`` using ``dim=-1``
-  Fixed variants of ``aten::stack`` using multiple different input data
   types
-  Fixed variants of ``aten::max`` using indices outputs


[1.8.1.1.5.21.0]
--------------------------------------------------

Date: 08/12/2021

Summary
~~~~~~~

- Minor updates.


.. _neuron-torch-1570:

[1.8.1.1.5.7.0]
--------------------------------------------------

Date: 07/02/2021

Summary
~~~~~~~

- Added support for dictionary outputs using ``strict=False`` flag. See
  :ref:`/archive/torch-neuron/troubleshooting-guide.rst`.
- Updated ``aten::batch_norm`` to correctly implement the ``affine`` flag.
- Added support for ``aten::erf`` and ``prim::DictConstruct``. See
  :ref:`neuron-cc-ops-pytorch`.
- Added dynamic batch support. See
  :ref:`/archive/torch-neuron/api-compilation-python-api.rst`.


.. _neuron-torch-1410:

[1.8.1.1.4.1.0]
--------------------------------------------------

Date: 5/28/2021

Summary
~~~~~~~~

* Added support for PyTorch 1.8.1

  * Models compatibility

    * Models compiled with previous versions of PyTorch Neuron (<1.8.1) are compatible with PyTorch Neuron 1.8.1.
    * Models compiled with PyTorch Neuron 1.8.1 are not backward compatible with previous versions of PyTorch Neuron (<1.8.1) .

  * Updated  tutorials to use Hugging Face Transformers 4.6.0.
  * Added a new set of forward operators (forward_v2)
  * Host memory allocation when loading the same model on multiple NeuronCores is significantly reduced
  * Fixed an issue where models would not deallocate all memory within a python session after being garbage collected.
  * Fixed a TorchScript/C++ issue where loading the same model multiple times would not use multiple NeuronCores by default.


* Fixed logging to no longer configure the root logger.
* Removed informative messages that were produced during compilations as warnings.  The number of warnings reduced significantly.
* Convolution operator support has been extended to include ConvTranspose2d variants.
* Reduce the amount of host memory usage during inference.


.. _neuron-torch-1350:

[1.7.1.1.3.5.0]
--------------------------------------------------

Date: 4/30/2021

Summary
~~~~~~~

- ResNext models now functional with new operator support
- Yolov5 support refer to https://github.com/aws/aws-neuron-sdk/issues/253 note https://github.com/ultralytics/yolov5/pull/2953 which optimized YoloV5 for AWS Neuron
- Convolution operator support has been extended to include most Conv1d and Conv3d variants
- New operator support.  Please see :ref:`neuron-cc-ops-pytorch` for the complete list of operators.

.. _neuron-torch-12160:

[1.7.1.1.2.16.0]
--------------------------------------------------

Date: 3/4/2021

Summary
~~~~~~~~

-  Minor enhancements.

.. _neuron-torch-12150:

[1.7.1.1.2.15.0]
--------------------------------------------------

Date: 2/24/2021

Summary
~~~~~~~

-  Fix for CVE-2021-3177.

.. _neuron-torch-1230:

[1.7.1.1.2.3.0]
--------------------------------------------------

Date: 1/30/2021

Summary
~~~~~~~~

-  Made changes to allow models with -inf scalar constants to correctly compile
-  Added new operator support. Please see :ref:`neuron-cc-ops-pytorch` for the complete list of operators.

.. _neuron-torch-11170:

[1.1.7.0]
--------------------------------------------------

Date: 12/23/2020

Summary
~~~~~~~~

-  We are dropping support for Python 3.5 in this release
-  torch.neuron.trace behavior will now throw a RuntimeError in the case that no operators are compiled for neuron hardware
-  torch.neuron.trace will now display compilation progress indicators (dots) as default behavior (neuron-cc must updated to the December release to greater to see this feature)
-  Added new operator support. Please see :ref:`neuron-cc-ops-pytorch` for the complete list of operators.
-  Extended the BERT pretrained tutorial to demonstrate execution on multiple cores and batch modification, updated the tutorial to accomodate changes in the Hugging Face Transformers code for version 4.0
-  Added a tutorial for torch-serve which extends the BERT tutorial
-  Added support for PyTorch 1.7

.. _neuron-torch-1019780:

[1.0.1978.0]
--------------------------------------------------

Date: 11/17/2020

Summary
~~~~~~~

-  Fixed bugs in comparison operators, and added remaining variantes
   (eq, ne, gt, ge, lt, le)
-  Added support for prim::PythonOp - note that this must be run on CPU
   and not Neuron. We recommend you replace this code with PyTorch
   operators if possible
-  Support for a series of new operators. Please see :ref:`neuron-cc-ops-pytorch` for the
   complete list of operators.
-  Performance improvements to the runtime library
-  Correction of a runtime library bug which caused models with large
   tensors to generate incorrect results in some cases


.. _neuron-torch-1017210:

[1.0.1721.0]
--------------------------------------------------

Date: 09/22/2020

Summary
~~~~~~~

-  Various minor improvements to the Pytorch autopartitioner feature
-  Support for the operators aten::constant_pad_nd, aten::meshgrid
-  Improved performance on various torchvision models. Of note are
   resnet50 and vgg16

.. _neuron-torch-1015320:

[1.0.1532.0]
--------------------------------------------------

Date: 08/08/2020

.. _torch-summary-1:

Summary
~~~~~~~

-  Various minor improvements to the Pytorch autopartitioner feature
-  Support for the aten:ones operator

.. _neuron-torch-1015220:

[1.0.1522.0]
--------------------------------------------------

Date: 08/05/2020

.. _torch-summary-2:

Summary
~~~~~~~~

Various minor improvements.

.. _neuron-torch-1013860:

[1.0.1386.0]
--------------------------------------------------

Date: 07/16/2020

.. _torch-summary-3:

Summary
~~~~~~~

This release adds auto-partitioning, model analysis and PyTorch 1.5.1
support, along with a number of new operators

Major New Features
~~~~~~~~~~~~~~~~~~

-  Support for Pytorch 1.5.1
-  Introduce an automated operator device placement mechanism in
   torch.neuron.trace to run sub-graphs that contain operators that are
   not supported by the neuron compiler in native PyTorch. This new
   mechanism is on by default and can be turned off by adding argument
   fallback=False to the compiler arguments.
-  Model analysis to find supported and unsupported operators in a model

Resolved Issues
~~~~~~~~~~~~~~~~

.. _neuron-torch-1011680:

[1.0.1168.0]
--------------------------------------------------

Date 6/11/2020

.. _torch-summary-4:

Summary
~~~~~~~

.. _major-new-features-1:

Major New Features
~~~~~~~~~~~~~~~~~~

.. _torch-resolved-issues-1:

Resolved Issues
~~~~~~~~~~~~~~~

Known Issues and Limitations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _neuron-torch-1010010:

[1.0.1001.0]
--------------------------------------------------

Date: 5/11/2020

.. _torch-summary-5:

Summary
~~~~~~~~

Additional PyTorch operator support and improved support for model
saving and reloading.

.. _major-new-features-2:

Major New Features
~~~~~~~~~~~~~~~~~~

-  Added Neuron Compiler support for a number of previously unsupported
   PyTorch operators. Please see :ref:`neuron-cc-ops-pytorch` for the
   complete list of operators.
-  Add support for torch.neuron.trace on models which have previously
   been saved using torch.jit.save and then reloaded.

.. _torch-resolved-issues-2:

Resolved Issues
~~~~~~~~~~~~~~~~

.. _torch-known-issues-and-limitations-1:

Known Issues and Limitations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _neuron-torch-108250:

[1.0.825.0]
--------------------------------------------------

Date: 3/26/2020

.. _torch-summary-6:

Summary
~~~~~~~

.. _major-new-features-3:

Major New Features
~~~~~~~~~~~~~~~~~~

.. _torch-resolved-issues-3:

Resolved Issues
~~~~~~~~~~~~~~~

.. _torch-known-issues-and-limitations-2:

Known Issues and limitations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _neuron-torch-107630:

[1.0.763.0]
--------------------------------------------------

Date: 2/27/2020

.. _torch-summary-7:

Summary
~~~~~~~

Added Neuron Compiler support for a number of previously unsupported
PyTorch operators. Please see :ref:`neuron-cc-ops-pytorch` for the complete
list of operators.

.. _major-new-features-4:

Major new features
~~~~~~~~~~~~~~~~~~

-  None

.. _torch-resolved-issues-4:

Resolved issues
~~~~~~~~~~~~~~~~~

-  None

.. _neuron-torch-106720:

[1.0.672.0]
--------------------------------------------------

Date: 1/27/2020

.. _torch-summary-8:

Summary
~~~~~~~~

.. _major-new-features-5:

Major new features
~~~~~~~~~~~~~~~~~~

.. _torch-resolved-issues-5:

Resolved issues
~~~~~~~~~~~~~~~~

-  Python 3.5 and Python 3.7 are now supported.

.. _torch-known-issues-and-limitations-3:

Known issues and limitations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Other Notes
~~~~~~~~~~~

.. _neuron-torch-106270:

[1.0.627.0]
--------------------------------------------------

Date: 12/20/2019

.. _torch-summary-9:

Summary
~~~~~~~~

This is the initial release of torch-neuron. It is not distributed on
the DLAMI yet and needs to be installed from the neuron pip repository.

Note that we are currently using a TensorFlow as an intermediate format
to pass to our compiler. This does not affect any runtime execution from
PyTorch to Neuron Runtime and Inferentia. This is why the neuron-cc
installation must include [tensorflow] for PyTorch.

.. _major-new-features-6:

Major new features
~~~~~~~~~~~~~~~~~~

.. _torch-resolved-issues-6:

Resolved issues
~~~~~~~~~~~~~~~

.. _torch-known-issues-and-limitations-4:

Known issues and limitations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Models TESTED
~~~~~~~~~~~~~~

The following models have successfully run on neuron-inferentia systems

1. SqueezeNet
2. ResNet50
3. Wide ResNet50

Pytorch Serving
~~~~~~~~~~~~~~~

In this initial version there is no specific serving support. Inference
works correctly through Python on Inf1 instances using the neuron
runtime. Future releases will include support for production deployment
and serving of models

Profiler support
~~~~~~~~~~~~~~~~

Profiler support is not provided in this initial release and will be
available in future releases

Automated partitioning
~~~~~~~~~~~~~~~~~~~~~~

Automatic partitioning of graphs into supported and non-supported
operations is not currently supported. A tutorial is available to
provide guidance on how to manually parition a model graph. Please see
:ref:`pytorch-manual-partitioning-jn-tutorial`

PyTorch dependency
~~~~~~~~~~~~~~~~~~

Currently PyTorch support depends on a Neuron specific version of
PyTorch v1.3.1. Future revisions will add support for 1.4 and future
releases.

Trace behavior
~~~~~~~~~~~~~~

In order to trace a model it must be in evaluation mode. For examples
please see :ref:`/src/examples/pytorch/resnet50.ipynb`

Six pip package is required
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Six package is required for the torch-neuron runtime, but it is not
modeled in the package dependencies. This will be fixed in a future
release.

Multiple NeuronCore support
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If the num-neuroncores options is used the number of cores must be
manually set in the calling shell environment variable for compilation
and inference.

For example: Using the keyword argument
compiler_args=['—num-neuroncores', '4'] in the trace call, requires
NEURONCORE_GROUP_SIZES=4 to be set in the environment at compile time
and runtime

CPU execution
~~~~~~~~~~~~~~

At compilation time a constant output is generated for the purposes of
tracing. Running inference on a non neuron instance will generate
incorrect results. This must not be used. The following error message is
generated to stderr:

::

   Warning: Tensor output are ** NOT CALCULATED ** during CPU execution and only
   indicate tensor shape

.. _other-notes-1:

Other notes
~~~~~~~~~~~

-  Python version(s) supported:

   -  3.6

-  Linux distribution supported:

   -  DLAMI Ubuntu 18 and Amazon Linux 2 (using Python 3.6 Conda environments)
   -  Other AMIs based on Ubuntu 18
   -  For Amazon Linux 2 please install Conda and use Python 3.6 Conda
      environment


================================================
FILE: release-notes/components/compiler.rst
================================================
.. meta::
    :description: Complete release notes for the Neuron Compiler component across all AWS Neuron SDK versions.
    :keywords: neuron compiler, neuronx-cc, release notes, aws neuron sdk
    :date-modified: 12/19/2025

.. _compiler_rn:

Component Release Notes for NeuronX Graph Compiler
==================================================

**Latest version (in 2.29.0)**: 2.24.5133.0

The release notes for the NeuronX Graph Compiler (neuronx-cc) Neuron component. Read them for the details about the changes, improvements, and bug fixes for all release versions of the AWS Neuron SDK.

.. note::
    For older Neuron Compiler (neuron-cc) release notes, see :doc:`the archived Neuron Compiler release notes </release-notes/archive/neuron-cc/neuron-cc>` and :doc:`Neuron Compiler operations release notes </release-notes/archive/neuron-cc/neuron-cc-ops/index>`.

.. _compiler-2-27-0-rn:

Neuron Compiler [2.15.54.0] (Neuron 2.27.0 Release)
----------------------------------------------------

Date of Release: 12/19/2025

Improvements
~~~~~~~~~~~~~~~

* New error code documentation has been added to help developers better understand and troubleshoot issues encountered during model compilation.
* Two Neuron Compiler (neuronxcc) flags now have different default behaviors to improve accuracy. The ``--auto-cast`` flag now defaults to ``none`` (previously ``matmul``), and ``--enable-mixed-precision-accumulation`` is now enabled by default.

Breaking Changes
~~~~~~~~~~~~~~~~

* Python 3.9 no longer supported: The Neuron Compiler requires Python 3.10 or higher. Users currently on Python 3.9 must upgrade to continue using the Neuron Compiler with Python bindings.
* Compiler accuracy flag defaults updated: These changes optimize accuracy but may impact performance for FP32 models and models using smaller bitwidth dtypes. To restore previous behavior, explicitly set ``--auto-cast=matmul`` and use the new ``--disable-mixed-precision-accumulation`` flag.

Bug Fixes
~~~~~~~~~

* Minor bug fixes and performance enhancements for both the ``trn1`` and ``trn2`` platforms.


----

.. _compiler-2-25-0-rn:

Neuron Compiler [2.14.77.0] (Neuron 2.25.0 Release)
----------------------------------------------------

Date of Release: 07/31/2025

Improvements
~~~~~~~~~~~~~~~

* Minor bug fixes and performance enhancements for both the ``trn1`` and ``trn2`` platforms.

Breaking Changes
~~~~~~~~~~~~~~~~

* Announcement: The Neuron Compiler default for the ``--auto-cast`` option will change from ``--auto-cast=matmult`` to ``--auto-cast=none`` in a future release.

Bug Fixes
~~~~~~~~~

* Minor bug fixes and performance enhancements for both the trn1 and trn2 platforms.

Known Issues
~~~~~~~~~~~~

* The Llama3 70B test has a compile time increase of 16% and 18%, for 16 and 32 nodes respectively.


----

Neuron Compiler [2.19.8089.0]
-------------------------------
Date: 06/24/2025

* This update enables the Hardware DMA Generation Engine (Hardware DGE) by default on Trainium2 instances. Hardware DGE is a memory optimization that will reduce the generated compiler artifacts size (i.e., NEFFs) and the models’ memory footprint.  Data movement (DMA) descriptors will be generated in the hardware, as needed, during model execution. This reduces (HBM) memory usage within the NEFF and allows the use of more DMA queues.


----

Neuron Compiler [2.18.121.0]
----------------------------
Date: 05/19/2025

* Minor bug fixes and performance enhancements for both the trn1 and trn2 platforms.


----

Neuron Compiler [2.17.194.0]
----------------------------
Date: 04/03/2025

* Minor bug fixes and performance enhancements for both the trn1 and trn2 platforms.


----

Neuron Compiler [2.16.372.0]
----------------------------
Date: 01/14/2025

* Minor bug fixes and performance enhancements for the trn2 platform.


----

Neuron Compiler [2.16.345.0]
----------------------------
Date: 12/20/2024

* Minor bug fixes and performance enhancements for the trn2 platform.


----

Neuron Compiler [Neuron 2.21.0 Beta]
--------------------------------------
Date: 12/03/2024

* This release introduces the ``trn2`` option argument to the compiler ``--target`` option to specify that the compiler should
  generate code for a trn2 instance family. Example usage: ``neuronx-cc compile --target=trn2 ...``
  
* This release introduces the ``--logical-nc-config`` or ``-lnc`` compiler command line option in support of the Logical NeuronCore Configuration feature available in Trainium2 instances. The compiler's default is LNC=2.  **Note: Use of this option is available only for Trainium2 instances.**


----

Neuron Compiler [2.15.128.0]
----------------------------
Date: 09/16/2024

* This release introduces memory optimization that will reduce the generated compiler artifacts size (i.e., NEFFs) and the models' memory footprint. It is possible that some models may experience unexpected performance degradation. If this occurs, these optimizations can be disabled using the --disable-dge compiler command line option or the framework-level option ``additional_compile_opt=" --disable-dge"``


----

Neuron Compiler [2.14.213.0]
----------------------------
Date: 07/03/2024

* Minor bug fixes and performance enhancements.
* Improved flash attention kernel performance.


----

Neuron Compiler [2.13.72.0]
----------------------------
Date: 04/25/2024

* Minor bug fixes and enhancements.


----

Neuron Compiler [2.13.68.0]
----------------------------
Date: 04/10/2024

* This release fixes hang issues related to Triton Inference Server.


----

Neuron Compiler [2.13.66.0]
----------------------------
Date: 04/01/2024

* This release introduces a new ``--enable-mixed-precision-accumulation`` compiler option. This option instructs the compiler to perform intermediate calculations of reduction operators (such as the dot or reduce operators) in FP32 regardless of the operation's defined datatype. The final result of the operator will be cast from FP32 to the model-designated datatype (e.g., BF16). This helps to improve the operator's resulting acccuracy.


----

Neuron Compiler [2.12.68.0]
----------------------------
Date: 01/18/2024

* Patch release with bug fixes.


----

Neuron Compiler [2.12.54.0]
---------------------------
Date: 12/21/2023

* The compiler now generates instructions to check if a model references an embedding table with an illegal index. The check is made at model execution time. If an attempted invalid table index is encountered, the model execution will continue and the user will see an error similar to:

      WARNING: Received notification generated at runtime: failed to run scatter/gather (indirect memory copy with branch_label_id = xx), due to out-of-bound access.

When this occurs, users are encouraged to review the model's gather/scatter input values to determine if there is a coding error.


----

Neuron Compiler [2.11.0.35]
---------------------------
Date: 11/17/2023

* This release addresses performance related issues when training through ``neuronx-nemo-megatron`` library.


----

Neuron Compiler [2.11.0.34]
-----------------------------
Date: 10/26/2023

* This release introduces the option-argument ``llm-training`` to the existing ``--distribution_strategy`` compiler option. This option-argument allows the compiler to make specific optimizations related to training distributed models. This new option-argument is equivalent to the previously introduced ``nemo`` option-argument, which will be deprecated in a future release.


----

Neuron Compiler [2.10.0.35]
-----------------------------
Date: 09/26/2023

* This release addresses a compilation regression for certain configurations of Llama and Llama-2 inference models when it fails compilation with this error "IndirectLoad/Save requires contiguous indirect access per partition" .

There is still a known issue for some configurations of the model with the error "Too many instructions after unroll for function sg0000" . To mitigate this, recompile using the ``--optlevel 1 (-O1)`` option. A complete fix will be coming in the future release which will not require this option


----

Neuron Compiler [2.10.0.34]
-----------------------------
Date: 09/15/2023

* This release introduces a new ``--optlevel (-O)`` compiler option. This option allows the user to balance between compile-time and optimizations performed.
  Three levels are supported. Level ``--optlevel 1 (-O1)`` aims to minimize compile-time and allow for a more rapid model development cycle. Model execution
  time may be reduced. Level ``--optlevel 3 (-O3)`` performs whole-model optimization. This level will deliver the best performance however there will be longer
  compile-times and the compiler will use more host DRAM, potentially requiring a larger instance to compile the model.
  The default is ``--optlevel 2 (-O2)`` which provides a balance between model performance and compile time. 

  The previous ``—enable-experimental-O1`` flag introduced in the 02/08/2023 Neuron Compiler [2.4.0.21] release is now deprecated. Using this flag
  will generate a message similar to:
  
      WARNING: Option —enable-experimental-O1 is deprecated and will be removed in a future release." Use ``--optlevel 1 (-O1)`` instead.


----

Neuron Compiler [2.9.0.16]
-----------------------------
Date: 08/28/2023

* This release fixes an issue where any initial seed passed into the Random Number Generator operator was not honored. The RngBitGenerator operator now correctly accepts and uses setting the seed. Note that the current RNG implementation only supports 32-bit seeds.


----

Neuron Compiler [2.8.0.25]
-----------------------------
Date: 07/19/2023

* This release introduces a new optional ``--distribution_strategy`` compiler option. This option informs the compiler what type of distributed APIs are used to shard the model and allows the compiler to make API-specific optimizations. Currently following option-arguments are supported: ``nemo``.


----

Neuron Compiler [2.7.0.40]
-----------------------------
Date: 06/14/2023

* This release introduces a new ``--enable-saturate-infinity`` compiler option. A computation that can generate +/- infinity is at a high
  risk of generating Not-a-Number (NaN) values when the infinity value is used in subsequent computations. This option helps avoid this
  by converting +Inf/-Inf values to MAX/MIN_FLOAT before operations that could produce NaN values for +Inf/-Inf inputs on the target
  architecture. While this option helps to avoid NaN values, there is a potential performance degradation that occurs during model
  execution when this conversion is enabled.
  

----

Neuron Compiler [2.6.0.19]
-----------------------------
Date: 05/01/2023

* This release introduces a new ``model-type`` option argument: ``unet-inference``.
  This option instructs the compiler to perform model-specific optimizations that produce executable models with improved performance
  on the specified target instance.
  
* Added support for the HLO operator ``BitcastConvertType`` and also added support for ``TopK`` (sampling mode) operator.


----

Neuron Compiler [2.5.0.28]
-----------------------------
Date: 03/28/2023

* This release introduces the ``trn1n`` option argument to the compiler ``target`` option to specify that it should
  generate code for a trn1n instance type. Example usage: ``neuronx-cc compile --target=trn1n ...``
  
* The compiler's usage message now includes the ``inf2`` option argument.

* A new 8-bit floating point data type, ``fp8_e4m3``, is now supported and can be specificed using the ``auto-cast-type`` option.
  This instructs the compiler to convert the FP32 operations selected via the ``--auto-cast`` option to a signed FP8 size
  with 4-bit exponent and 3-bit mantissa. Care must be taken to ensure that the down-casted values are representable within the 8-bit data range.


----

Neuron Compiler [2.4.0.21]
-----------------------------
Date: 02/24/2023

* This release introduces the ``inf2`` option argument to the compiler ``target`` option to specify that it should
  generate code for an inf2 instance type. Example usage: ``neuronx-cc compile --target=inf2 ...``
  The ``inf2`` option argument does not appear in the compiler's usage message. It will be added in the next release.


----

Neuron Compiler [2.4.0.21]
-----------------------------
Date: 02/08/2023

* Added support for the following HLO operators: ``SelectAndScatter``.
* Beta: ``--enable-experimental-O1`` flag: This option reduces the compile-time with a neglible impact on model execution performance.
  It allows the compiler to execute compiler passes in parallel to perform the compilation. By default the compiler uses 8 processes.
  This can be changed via the CLI option ``--num-parallel-jobs``. This option is expected to become the default in a future SDK release.


----

Neuron Compiler [2.3.0.4]
-----------------------------
Date: 12/09/2022

* Added support for the following HLO operators: ``rev (reverse)``.
* The ``pow()`` function can now handle both integer and floating-point exponents.
* Optimization enhancements and bug fixes to improve model execution performance.


----

Neuron Compiler [2.2.0.73]
-----------------------------
Date: 10/27/2022

* Adding support for the following HLO operators: ``LogicalNot``, ``atan2`` and ``DynamicUpdateSlice`` (for constant index).


----

Neuron Compiler [2.1.0.76]
-----------------------------
Date: 10/5/2022


The Neuron Compiler is an Ahead-of-Time compiler that accelerates models for
execution on NeuronCores. This release supports compiling models for training
on a Trn1 instance using Pytorch Neuron. Users typically access the compiler via
the Framework to perform model compilation, although it can also be run
as a command line tool (*neuronx-cc*).


The Neuron Compiler supports compiling models for mixed precision calculations. 
The trn1 hardware supports matrix multiplication using FP16, BF16, and FP32 on
its Matrix Multiplication Engine, and accumulations using FP32. Operators such as 
activations or vector operations are supported using FP16, BF16, and FP32.
Tensor transpose can be accomplished in FP16, BF16, FP32, or TF32 datatypes.
By default, scalar and vector operations on FP32 values will be done in FP32,
while matrix multiplications are cast to BF16 and transpose operations are cast to FP32.
This default casting will generate the highest performance for a FP32 trained model.

By default, the compiler will target maximum performance by automatically casting
the model to mixed precision. It also provides an option (``--auto-cast``) that
allows the user to make tradeoffs between higher performance and optimal accuracy.
The decision on what option argument to use with the ``--auto-cast`` option will be
application specific. Compiler CLI options can be passed to the compiler via the framework.

Known issues
~~~~~~~~~~~~

-  The Random Number Generator operation can be passed an initial seed
   value, however setting the seed is not supported in this release.
-  The exponent value of the pow() function must be a compile-time
   integer constant.
-  The compiler treats INT64 datatypes as INT32 by truncating the
   high-order bits. If possible, cast these values to 32 bits .
-  Model compilation time is proportional to the model size and
   operators used. For some larger NLP models it may be upwards of 30
   minutes.


----

Supported Operators
-------------------

The following XLA operators are supported by the Neuron Compiler. 
Future releases will broaden model support by providing additional XLA operators defined in
https://www.tensorflow.org/xla/operation_semantics.

The list of supported operators can also be retrieved from the command line using :ref:`neuronx-cc list-operators<neuronx-cc-list-operators>`.

+-------------------------+-------------------------------------------+
| Supported XLA Operators | Notes                                     |
+=========================+===========================================+
| Abs                     |                                           |
+-------------------------+-------------------------------------------+
| Add                     |                                           |
+-------------------------+-------------------------------------------+
| Allgather               |                                           |
+-------------------------+-------------------------------------------+
| Allreduce               |                                           |
+-------------------------+-------------------------------------------+
| Atan2                   |                                           |
+-------------------------+-------------------------------------------+
| Batchnorm               |                                           |
+-------------------------+-------------------------------------------+
| Batchnormgrad           |                                           |
+-------------------------+-------------------------------------------+
| Batchnorminference      |                                           |
+-------------------------+-------------------------------------------+
| BitcastConvertType      |                                           |
+-------------------------+-------------------------------------------+
| Broadcast               |                                           |
+-------------------------+-------------------------------------------+
| BroadcastInDim          |                                           |
+-------------------------+-------------------------------------------+
| Ceil                    |                                           |
+-------------------------+-------------------------------------------+
| Clamp                   |                                           |
+-------------------------+-------------------------------------------+
| Compare                 |                                           |
+-------------------------+-------------------------------------------+
| Concatenate             |                                           |
+-------------------------+-------------------------------------------+
| Constant                |                                           |
+-------------------------+-------------------------------------------+
| ConstantLiteral         |                                           |
+-------------------------+-------------------------------------------+
| ConvertElementType      |                                           |
+-------------------------+-------------------------------------------+
| Cos                     |                                           |
+-------------------------+-------------------------------------------+
| Customcall              |                                           |
+-------------------------+-------------------------------------------+
| Div                     |                                           |
+-------------------------+-------------------------------------------+
| Dot                     |                                           |
+-------------------------+-------------------------------------------+
| DotGeneral              |                                           |
+-------------------------+-------------------------------------------+
| DynamicUpdateSlice      | Supports only for constant index          |
+-------------------------+-------------------------------------------+
| Eq                      |                                           |
+-------------------------+-------------------------------------------+
| Exp                     |                                           |
+-------------------------+-------------------------------------------+
| Floor                   |                                           |
+-------------------------+-------------------------------------------+
| Gather                  | Supports only disjoint start_index_map    |
|                         | and remapped_offset_dims                  |
+-------------------------+-------------------------------------------+
| Ge                      |                                           |
+-------------------------+-------------------------------------------+
| GetTupleElement         |                                           |
+-------------------------+-------------------------------------------+
| Gt                      |                                           |
+-------------------------+-------------------------------------------+
| Iota                    |                                           |
+-------------------------+-------------------------------------------+
| Le                      |                                           |
+-------------------------+-------------------------------------------+
| Log                     |                                           |
+-------------------------+-------------------------------------------+
| LogicalAnd              |                                           |
+-------------------------+-------------------------------------------+
| LogicalNot              |                                           |
+-------------------------+-------------------------------------------+
| Lt                      |                                           |
+-------------------------+-------------------------------------------+
| Max                     |                                           |
+-------------------------+-------------------------------------------+
| Min                     |                                           |
+-------------------------+-------------------------------------------+
| Mul                     |                                           |
+-------------------------+-------------------------------------------+
| Ne                      |                                           |
+-------------------------+-------------------------------------------+
| Neg                     |                                           |
+-------------------------+-------------------------------------------+
| Pad                     |                                           |
+-------------------------+-------------------------------------------+
| Pow                     | Exponent argument must be a compile-time  |
|                         | integer constant                          |
+-------------------------+-------------------------------------------+
| Reduce                  | Min, Max, Add and Mul are the only        |
|                         | supported computations. Init_values must  |
|                         | be constant                               |
+-------------------------+-------------------------------------------+
| Reshape                 |                                           |
+-------------------------+-------------------------------------------+
| Rev (reverse)           |                                           |
+-------------------------+-------------------------------------------+
| RngBitGenerator         | Ignores user seed                         |
+-------------------------+-------------------------------------------+
| RngUniform              |                                           |
+-------------------------+-------------------------------------------+
| Rsqrt                   |                                           |
+-------------------------+-------------------------------------------+
| Scatter                 |                                           |
+-------------------------+-------------------------------------------+
| Select                  |                                           |
+-------------------------+-------------------------------------------+
| SelectAndScatter        |                                           |
+-------------------------+-------------------------------------------+
| ShiftRightLogical       |                                           |
+-------------------------+-------------------------------------------+
| Sign                    |                                           |
+-------------------------+-------------------------------------------+
| Sin                     |                                           |
+-------------------------+-------------------------------------------+
| Slice                   |                                           |
+-------------------------+-------------------------------------------+
| Sqrt                    |                                           |
+-------------------------+-------------------------------------------+
| Sub                     |                                           |
+-------------------------+-------------------------------------------+
| Tanh                    |                                           |
+-------------------------+-------------------------------------------+
| Transpose               |                                           |
+-------------------------+-------------------------------------------+
| Tuple                   |                                           |
+-------------------------+-------------------------------------------+


================================================
FILE: release-notes/components/containers.rst
================================================
.. meta::
    :description: Complete release notes for the Neuron Containers component across all AWS Neuron SDK versions.
    :keywords: neuron containers, dlc, kubernetes, k8s, release notes, aws neuron sdk
    :date-modified: 04/09/2026

.. _containers_rn:

Component Release Notes for Neuron Containers
==============================================

The release notes for the Neuron Containers component. Read them for the details about the changes, improvements, and bug fixes for all release versions of the AWS Neuron SDK.

.. _containers-2-29-0-rn:   

Neuron Containers (Neuron 2.29.0 Release)
--------------------------------------------------------------------------------------

Date of Release: 04/09/2026

Improvements
~~~~~~~~~~~~~~~

**DLC Support**

* All Neuron packages and their dependencies have been upgraded to support AWS Neuron SDK version 2.29.0.


Callouts
~~~~~~~~~~~~~~~~

.. important::
   Announcing maintenance mode for NxDT and NxD Core Training APIs starting next release. Starting with Neuron 2.30.0, NxDT and NxD Core Training APIs are entering maintenance mode. Future releases will address critical security issues only and support will be gradually ended. The ``pytorch-training-neuronx`` DLC has been pinned to include ``neuronx_distributed_training-1.7.0``.

   **How does this impact you?**

   Existing NxDT/NxD Core users should stay on Neuron 2.28 and PyTorch 2.9 until ready to migrate to native PyTorch on Neuron (starting PyTorch 2.10). Customers are recommended to use native PyTorch with standard distributed primitives (DTensor, FSDP, DDP) and TorchTitan starting with Neuron 2.30.0 and PyTorch 2.10. A migration guide will be published in a coming release.

   For more information, see :doc:`/frameworks/torch/pytorch-native-overview`.


.. _containers-2-28-0-rn:   

Neuron Containers (Neuron 2.28.0 Release)
--------------------------------------------------------------------------------------

Date of Release: 02/26/2026

Improvements
~~~~~~~~~~~~~~~

**Kubernetes Support**

* Introduced the Neuron DRA Driver, which enables advanced resource allocation capabilities using the Kubernetes Dynamic Resource Allocation (DRA) API for more flexible and efficient management of Neuron devices. For more details, see :doc:`/containers/neuron-dra`.
* Added Neuron DRA Driver support to the Neuron Helm Charts. For more details, see :doc:`the updated Helm documentation under the Kubernetes Getting Started page </containers/kubernetes-getting-started>`.

.. _containers-2-27-0-rn:

Neuron Containers (Neuron 2.27.0 Release)
---------------------------------------------------

Date of Release: 12/19/2025

Improvements
~~~~~~~~~~~~~~~

**DLC Support**

* Added new pytorch-inference-vllm-neuronx 0.11.0 DLC with PyTorch 2.8, vLLM V1 with the vLLM-Neuron Plugin, tools, NxDI and all dependencies to run vLLM out of the box.
* Upgraded pytorch-training-neuronx and pytorch-inference-neuronx DLCs to PyTorch 2.9.0 with related dependencies.
* Upgraded jax-training-neuronx DLC to JAX 0.7.0 with related dependencies.
* Upgraded base image to Ubuntu 24.04 and Python 3.12 in all DLCs.
* Upgraded all Neuron packages and dependencies to support AWS Neuron SDK version 2.27.

Known Issues
~~~~~~~~~~~~

**Note**: Common Vulnerability and Exposure (CVE) identifiers are assigned to publicly disclosed cybersecurity vulnerabilities. CVE identifiers help security professionals and software vendors coordinate their efforts to address and mitigate vulnerabilities.

* ``pytorch-training-neuronx``: 0.9.0 DLC has multiple CRITICAL and HIGH CVEs. We are actively working to resolve them.
   * `CVE-2021-44906 <https://nvd.nist.gov/vuln/detail/CVE-2021-44906>`_ - Prototype Pollution vulnerability in minimist package
   * `CVE-2023-38039 <https://nvd.nist.gov/vuln/detail/CVE-2023-38039>`_ - Memory exhaustion vulnerability in curl/libcurl from unlimited header processing
   * `CVE-2021-35517 <https://nvd.nist.gov/vuln/detail/CVE-2021-35517>`_ - Denial of service vulnerability in Apache Commons Compress TAR archive processing
   * `CVE-2022-29217 <https://nvd.nist.gov/vuln/detail/CVE-2022-29217>`_ - JWT signing algorithm confusion vulnerability in PyJWT library
   * `CVE-2025-58056 <https://nvd.nist.gov/vuln/detail/CVE-2025-58056>`_ - HTTP request smuggling vulnerability in Netty codec
   * `CVE-2024-45337 <https://nvd.nist.gov/vuln/detail/CVE-2024-45337>`_ - Authorization bypass vulnerability in golang.org/x/crypto SSH implementation
   * `CVE-2024-56201 <https://nvd.nist.gov/vuln/detail/CVE-2024-56201>`_ - Remote code execution vulnerability in Jinja templating engine
   * `CVE-2025-0725 <https://nvd.nist.gov/vuln/detail/CVE-2025-0725>`_ - Buffer overflow vulnerability in curl/libcurl gzip decompression
   * `CVE-2023-36665 <https://nvd.nist.gov/vuln/detail/CVE-2023-36665>`_ - Prototype Pollution vulnerability in protobufjs library
   * `CVE-2023-45288 <https://nvd.nist.gov/vuln/detail/CVE-2023-45288>`_ - HTTP/2 CONTINUATION frame DoS vulnerability in golang.org/x/net
   * `CVE-2021-33194 <https://nvd.nist.gov/vuln/detail/CVE-2021-33194>`_ - Infinite loop vulnerability in golang.org/x/net ParseFragment
   * `CVE-2023-41419 <https://nvd.nist.gov/vuln/detail/CVE-2023-41419>`_ - Privilege escalation vulnerability in gevent WSGIServer
   * `CVE-2021-35516 <https://nvd.nist.gov/vuln/detail/CVE-2021-35516>`_ - Memory exhaustion vulnerability in Apache Commons Compress 7Z processing
   * `CVE-2022-24771 <https://nvd.nist.gov/vuln/detail/CVE-2022-24771>`_ - RSA signature verification vulnerability in node-forge
   * `CVE-2022-41723 <https://nvd.nist.gov/vuln/detail/CVE-2022-41723>`_ - HTTP/2 HPACK decoder DoS vulnerability in golang.org/x/net
   * `CVE-2025-66031 <https://nvd.nist.gov/vuln/detail/CVE-2025-66031>`_ - Uncontrolled recursion DoS vulnerability in node-forge ASN.1 parsing
   * `CVE-2025-58057 <https://nvd.nist.gov/vuln/detail/CVE-2025-58057>`_ - Memory exhaustion vulnerability in Netty BrotliDecoder
   * `CVE-2023-50782 <https://nvd.nist.gov/vuln/detail/CVE-2023-50782>`_ - TLS RSA key exchange vulnerability in python-cryptography
   * `CVE-2022-24772 <https://nvd.nist.gov/vuln/detail/CVE-2022-24772>`_ - RSA signature verification vulnerability in node-forge DigestInfo
   * `CVE-2022-27664 <https://nvd.nist.gov/vuln/detail/CVE-2022-27664>`_ - HTTP/2 connection hang DoS vulnerability in golang.org/x/net
   * `CVE-2024-56326 <https://nvd.nist.gov/vuln/detail/CVE-2024-56326>`_ - Sandbox bypass vulnerability in Jinja str.format detection
   * `CVE-2024-3651 <https://nvd.nist.gov/vuln/detail/CVE-2024-3651>`_ - Quadratic complexity DoS vulnerability in idna.encode() function
   * `CVE-2023-49083 <https://nvd.nist.gov/vuln/detail/CVE-2023-49083>`_ - NULL-pointer dereference vulnerability in cryptography PKCS7 processing
   * `CVE-2024-22189 <https://nvd.nist.gov/vuln/detail/CVE-2024-22189>`_ - Memory exhaustion vulnerability in quic-go NEW_CONNECTION_ID frames
   * `CVE-2025-47273 <https://nvd.nist.gov/vuln/detail/CVE-2025-47273>`_ - Path traversal vulnerability in setuptools PackageIndex
   * `CVE-2025-66418 <https://nvd.nist.gov/vuln/detail/CVE-2025-66418>`_ - Unbounded decompression chain vulnerability in urllib3
   * `CVE-2021-23337 <https://nvd.nist.gov/vuln/detail/CVE-2021-23337>`_ - Command injection vulnerability in lodash template function
   * `CVE-2023-29824 <https://nvd.nist.gov/vuln/detail/CVE-2023-29824>`_ - Use-after-free vulnerability in SciPy Py_FindObjects() function
   * `CVE-2025-12816 <https://nvd.nist.gov/vuln/detail/CVE-2025-12816>`_ - ASN.1 schema validation bypass vulnerability in node-forge
   * `CVE-2025-22869 <https://nvd.nist.gov/vuln/detail/CVE-2025-22869>`_ - SSH file transfer DoS vulnerability in golang.org/x/crypto
   * `CVE-2025-59530 <https://nvd.nist.gov/vuln/detail/CVE-2025-59530>`_ - HANDSHAKE_DONE frame DoS vulnerability in quic-go
   * `CVE-2024-6345 <https://nvd.nist.gov/vuln/detail/CVE-2024-6345>`_ - Remote code execution vulnerability in setuptools package_index
   * `CVE-2023-27533 <https://nvd.nist.gov/vuln/detail/CVE-2023-27533>`_ - TELNET protocol input validation vulnerability in curl/libcurl
   * `CVE-2021-36090 <https://nvd.nist.gov/vuln/detail/CVE-2021-36090>`_ - Memory exhaustion vulnerability in Apache Commons Compress ZIP processing
   * `CVE-2025-66471 <https://nvd.nist.gov/vuln/detail/CVE-2025-66471>`_ - Highly compressed data handling vulnerability in urllib3 Streaming API
   * `CVE-2023-43804 <https://nvd.nist.gov/vuln/detail/CVE-2023-43804>`_ - Cookie header information leak vulnerability in urllib3 redirects
   * `CVE-2022-25878 <https://nvd.nist.gov/vuln/detail/CVE-2022-25878>`_ - Prototype Pollution vulnerability in protobufjs util.setProperty
   * `CVE-2021-35515 <https://nvd.nist.gov/vuln/detail/CVE-2021-35515>`_ - Infinite loop vulnerability in Apache Commons Compress 7Z codec construction
   * `CVE-2021-38561 <https://nvd.nist.gov/vuln/detail/CVE-2021-38561>`_ - Out-of-bounds read vulnerability in golang.org/x/text BCP 47 parsing
   * `CVE-2022-43551 <https://nvd.nist.gov/vuln/detail/CVE-2022-43551>`_ - HSTS bypass vulnerability in curl/libcurl IDN handling
   * `CVE-2022-27191 <https://nvd.nist.gov/vuln/detail/CVE-2022-27191>`_ - SSH server crash vulnerability in golang.org/x/crypto AddHostKey
   * GHSA-m425-mq94-257g - HTTP/2 concurrent stream limit bypass vulnerability in gRPC-Go
   * `CVE-2023-39325 <https://nvd.nist.gov/vuln/detail/CVE-2023-39325>`_ - HTTP/2 request reset DoS vulnerability in golang.org/x/net
   * `CVE-2024-2398 <https://nvd.nist.gov/vuln/detail/CVE-2024-2398>`_ - Memory leak vulnerability in curl/libcurl HTTP/2 server push
   * `CVE-2023-44487 <https://nvd.nist.gov/vuln/detail/CVE-2023-44487>`_ - HTTP/2 Rapid Reset DoS vulnerability in multiple packages
   * `CVE-2025-55163 <https://nvd.nist.gov/vuln/detail/CVE-2025-55163>`_ - MadeYouReset DDoS vulnerability in Netty HTTP/2 implementation
   * `CVE-2023-27534 <https://nvd.nist.gov/vuln/detail/CVE-2023-27534>`_ - SFTP path traversal vulnerability in curl/libcurl tilde handling
   * `CVE-2022-32149 <https://nvd.nist.gov/vuln/detail/CVE-2022-32149>`_ - Accept-Language header DoS vulnerability in golang.org/x/text
   * `CVE-2025-47913 <https://nvd.nist.gov/vuln/detail/CVE-2025-47913>`_ - SSH agent panic vulnerability in golang.org/x/crypto
   * `CVE-2022-40898 <https://nvd.nist.gov/vuln/detail/CVE-2022-40898>`_ - DoS vulnerability in Python wheel CLI
   * `CVE-2023-23914 <https://nvd.nist.gov/vuln/detail/CVE-2023-23914>`_ - HSTS functionality failure vulnerability in curl/libcurl
   * `CVE-2023-0286 <https://nvd.nist.gov/vuln/detail/CVE-2023-0286>`_ - X.400 address processing vulnerability in cryptography
   * `CVE-2022-25647 <https://nvd.nist.gov/vuln/detail/CVE-2022-25647>`_ - Deserialization vulnerability in Gson writeReplace() method
   * `CVE-2021-43565 <https://nvd.nist.gov/vuln/detail/CVE-2021-43565>`_ - SSH server panic vulnerability in golang.org/x/crypto
   * `CVE-2024-7254 <https://nvd.nist.gov/vuln/detail/CVE-2024-7254>`_ - Stack overflow vulnerability in Protocol Buffers nested groups parsing
   * `CVE-2023-2976 <https://nvd.nist.gov/vuln/detail/CVE-2023-2976>`_ - Temporary directory access vulnerability in Google Guava FileBackedOutputStream
   * `CVE-2026-21441 <https://nvd.nist.gov/vuln/detail/CVE-2026-21441>`_ - Decompression bomb vulnerability in urllib3 HTTP redirect responses
   * `CVE-2023-38545 <https://nvd.nist.gov/vuln/detail/CVE-2023-38545>`_ - Heap buffer overflow vulnerability in curl/libcurl SOCKS5 proxy handshake
   * GHSA-xpw8-rcwv-8f8p - HTTP/2 RST frame DoS vulnerability in Netty
   * `CVE-2022-42920 <https://nvd.nist.gov/vuln/detail/CVE-2022-42920>`_ - Arbitrary bytecode generation vulnerability in Apache Commons BCEL
   * `CVE-2024-24786 <https://nvd.nist.gov/vuln/detail/CVE-2024-24786>`_ - Infinite loop vulnerability in google.golang.org/protobuf JSON unmarshaling
 
* ``pytorch-inference-vllm-neuronx``: 0.11.0 DLC has multiple HIGH CVEs. We are actively working to resolve these high CVEs:
   * `CVE-2026-21441 <https://nvd.nist.gov/vuln/detail/CVE-2026-21441>`_ - Decompression bomb vulnerability in urllib3 HTTP redirect responses
   * `CVE-2025-62164 <https://nvd.nist.gov/vuln/detail/CVE-2025-62164>`_ - Memory corruption vulnerability in vLLM Completions API endpoint
   * `CVE-2025-69223 <https://nvd.nist.gov/vuln/detail/CVE-2025-69223>`_ - Zip bomb DoS vulnerability in AIOHTTP server
   * GHSA-mcmc-2m55-j8jj - Insufficient fix for CVE-2025-62164 in vLLM sparse tensor validation
   * `CVE-2025-66448 <https://nvd.nist.gov/vuln/detail/CVE-2025-66448>`_ - Remote code execution vulnerability in vLLM config class auto_map
   * `CVE-2025-66418 <https://nvd.nist.gov/vuln/detail/CVE-2025-66418>`_ - Unbounded decompression chain vulnerability in urllib3
   * `CVE-2025-66471 <https://nvd.nist.gov/vuln/detail/CVE-2025-66471>`_ - Highly compressed data handling vulnerability in urllib3 Streaming API

* ``pytorch-training-neuronx``: 0.9.0 DLC has multiple HIGH CVEs. We are actively working to resolve these high CVEs:
   * `CVE-2025-66418 <https://nvd.nist.gov/vuln/detail/CVE-2025-66418>`_ - Unbounded decompression chain vulnerability in urllib3
   * `CVE-2025-66471 <https://nvd.nist.gov/vuln/detail/CVE-2025-66471>`_ - Highly compressed data handling vulnerability in urllib3 Streaming API
   * `CVE-2026-21441 <https://nvd.nist.gov/vuln/detail/CVE-2026-21441>`_ - Decompression bomb vulnerability in urllib3 HTTP redirect responses


----

.. _containers-2-26-0-rn:

Neuron Containers (Neuron 2.26.0 Release)
---------------------------------------------------

Date of Release: 09/18/2025

Improvements
~~~~~~~~~~~~~~~

**DLC Support**

* Both pytorch-training-neuronx and pytorch-inference-neuronx DLCs have been upgraded to version 2.8.0 along with their related dependencies.
* Upgraded Python version to 3.11 in all Deep Learning Containers.
* All Neuron packages and their dependencies have been upgraded to support version 2.26.0 of the AWS Neuron SDK.

Breaking Changes
~~~~~~~~~~~~~~~~

* End-of-support for the Transformers NeuronX library starts with the 2.26.0 release of the AWS Neuron SDK. With this support ended, the PyTorch inference Deep Learning Container (DLC) will no longer include the transformers-neuronx package.

Known Issues
~~~~~~~~~~~~

* ``pytorch-training-neuronx`` 2.7.0 DLC has two HIGH CVEs related to ``sagemaker-python-sdk`` package. We are actively working to resolve these high CVEs:
  * `CVE-2024-34072 <https://nvd.nist.gov/vuln/detail/CVE-2024-34072>`_ - Vulnerability in sagemaker-python-sdk package
  * `CVE-2024-34073 <https://nvd.nist.gov/vuln/detail/CVE-2024-34073>`_ - Vulnerability in sagemaker-python-sdk package


----

.. _containers-2-25-0-rn:

Neuron Containers (Neuron 2.25.0 Release)
---------------------------------------------------

Date of Release: 07/31/2025

Improvements
~~~~~~~~~~~~~~~

**DLC Support**

* All Neuron packages and their dependencies have been upgraded to support AWS Neuron SDK version 2.25.0.
* The pytorch-inference-vllm-neuronx Deep Learning Container has been upgraded to version 0.9.1.

Known Issues
~~~~~~~~~~~~

* ``pytorch-training-neuronx`` 2.7.0 DLC has two HIGH CVEs related to ``sagemaker-python-sdk`` package. We are actively working to resolve these high CVEs:
  * `CVE-2024-34072 <https://nvd.nist.gov/vuln/detail/CVE-2024-34072>`_ - Vulnerability in sagemaker-python-sdk package
  * `CVE-2024-34073 <https://nvd.nist.gov/vuln/detail/CVE-2024-34073>`_ - Vulnerability in sagemaker-python-sdk package
* ``pytorch-inference-vllm-neuronx`` 0.9.1 DLC has CRITICAL and HIGH CVEs. We are actively working to resolve them.


----

.. _containers-2-24-0-rn:

Neuron Containers (Neuron 2.24.0 Release)
---------------------------------------------------

Date of Release: 06/24/2025

Improvements
~~~~~~~~~~~~~~~

**DLC Support**

* Added new pytorch-inference-vllm-neuronx 0.7.2 DLC that contains all dependencies including drivers, tools, NxDI and other packages to run vLLM out of the box.
* Upgraded pytorch-training-neuronx DLC to 2.7 version along with its related dependencies.
* Upgraded pytorch-inference-neuronx DLC to 2.7 version along with its related dependencies.
* Upgraded jax-training-neuronx DLC to 0.6 version along with its related dependencies.
* Updated Neuron SDK to latest 2.24.0 release for all Neuron DLCs.


----

.. _containers-2-23-0-rn:

Neuron Containers (Neuron 2.23.0 Release)
---------------------------------------------------

Date of Release: 05/19/2025

Improvements
~~~~~~~~~~~~~~~

**DLC Support**

* Upgraded pytorch-training-neuronx DLC to 2.6 version along with its related dependencies.
* Upgraded pytorch-inference-neuronx DLC to 2.6 version along with its related dependencies.
* Updated Neuron SDK to latest 2.23.0 release for all Neuron DLCs.


----

.. _containers-2-22-0-rn:

Neuron Containers (Neuron 2.22.0 Release)
---------------------------------------------------

Date of Release: 04/04/2025

Improvements
~~~~~~~~~~~~~~~

**DLC Support**

* Upgraded jax-training-neuronx DLC to 0.5 version.
* Updated Neuron SDK to latest 2.22.0 release for all Neuron DLCs.
* Restructure all Dockerfiles by combining RUN commands for faster build time.

**Kubernetes Support**

* This release introduces the Neuron Helm Chart, which helps streamline the deployment of AWS Neuron components on Amazon EKS.
* Adds ECS support for the "Neuron Node Problem Detector and Recovery" artifact.
* Improves scalability and performance of the Neuron Device Plugin and Neuron Scheduler Extension by skipping "list" calls from the device plugin to the scheduler in situations where the pod allocation request either needs one or all the available resources in the node.
* Ends support for resource name 'neurondevice' with the Neuron Device Plugin.

Breaking Changes
~~~~~~~~~~~~~~~~

* Ends support for resource name 'neurondevice' with the Neuron Device Plugin.


----

.. _containers-2-21-1-rn:

Neuron Containers (Neuron 2.21.1 Release)
---------------------------------------------------

Date of Release: 01/14/2025

Improvements
~~~~~~~~~~~~~~~

**DLC Support**

* Minor improvements and bug fixes.

Bug Fixes
~~~~~~~~~

* Minor improvements and bug fixes.


----

.. _containers-2-21-0-rn:

Neuron Containers (Neuron 2.21.0 Release)
---------------------------------------------------

Date of Release: 12/19/2024

Improvements
~~~~~~~~~~~~~~~

**DLC Support**

* Added new jax-training-neuronx 0.4 Training DLC that contains all dependencies including drivers, tools and other packages to run JAX out of the box.
* Added new pytorch-inference-neuronx 2.5.1 and pytorch-training-neuronx 2.5.1 DLCs.
* PyTorch 1.13.1 and 2.1.2 DLCs reached end of support phase, We now recommend customers to use PyTorch 2.5.1 DLCs by default.
* All Neuron supported DLCs to use latest Neuron SDK 2.21.0 version.
* All Neuron supported DLCs are now updated to Ubuntu 22.
* pytorch-inference-neuronx now supports both NxD Inference and Transformers NeuronX libraries for inference.

Breaking Changes
~~~~~~~~~~~~~~~~

* PyTorch 1.13.1 and 2.1.2 DLCs reached end of support phase.


----

.. _containers-2-20-2-rn:

Neuron Containers (Neuron 2.20.2 Release)
---------------------------------------------------

Date of Release: 11/20/2024

Improvements
~~~~~~~~~~~~~~~

**DLC Support**

* Neuron 2.20.2 DLC fixes dependency bug for NxDT use case by pinning the correct torch version.

**Kubernetes Support**

* This release addresses a stability issue in the Neuron Scheduler Extension that previously caused crashes shortly after installation.

Bug Fixes
~~~~~~~~~

* Fixed dependency bug for NxDT use case by pinning the correct torch version.
* Addressed stability issue in the Neuron Scheduler Extension that previously caused crashes shortly after installation.


----

.. _containers-2-20-1-rn:

Neuron Containers (Neuron 2.20.1 Release)
---------------------------------------------------

Date of Release: 10/25/2024

Improvements
~~~~~~~~~~~~~~~

**DLC Support**

* Neuron 2.20.1 DLC includes prerequisites for NxDT installation. Customers can expect to use NxDT out of the box.


----

.. _containers-2-20-0-rn:

Neuron Containers (Neuron 2.20.0 Release)
---------------------------------------------------

Date of Release: 09/16/2024

Improvements
~~~~~~~~~~~~~~~

**DLC Support**

* Updated Neuron SDK to latest 2.20.0 release for PyTorch Neuron DLCs.
* Added new NxD Training package to pytorch-training-neuronx DLCs.


----

.. _containers-2-19-0-rn:

Neuron Containers (Neuron 2.19.0 Release)
---------------------------------------------------

Date of Release: 07/03/2024

Improvements
~~~~~~~~~~~~~~~

**DLC Support**

* Updated Neuron SDK to latest 2.19.0 release for PyTorch Neuron DLCs.
* Updated TorchServe to 0.11.0 for PyTorch Neuron DLCs.

**Kubernetes Support**

* Critical Security Patch: Updated the dependencies used by the Neuron Device Plugin and the Neuron Kubernetes Scheduler to fix several important security vulnerabilities.
* This release introduces Neuron Node Problem Detector And Recovery artifact to enable fast error detection and recovery in Kubernetes environment. Current version supports EKS managed and self-managed node groups for all EKS supported Kubernetes versions.
* This release introduces a container image for neuron monitor to make it easy to run neuron monitor along with Prometheus and Grafana to monitor neuron metrics in Kubernetes environments.

Bug Fixes
~~~~~~~~~

* This release contains changes to improve performance of the device plugin at scale.


----

.. _containers-2-5-0-rn:

Neuron Containers (Neuron 2.5.0 Release)
-------------------------------------------------

Date of Release: 11/07/2022

Improvements
~~~~~~~~~~~~~~~

**DLC Support**

* Neuron now supports trn1-based training in Sagemaker and Deep Learning Containers using PyTorch.

**Neuron Containers**

* Neuron now supports trn1-based training in Sagemaker and Deep Learning Containers using PyTorch.


----

.. _containers-2-4-0-rn:

Neuron Containers (Neuron 2.4.0 Release)
-------------------------------------------------

Date of Release: 10/27/2022

Improvements
~~~~~~~~~~~~~~~

**Neuron Containers**

* Neuron now supports Kubernetes work scheduling at the level of NeuronCore. Updates on how to use the new core allocation method is captured in the Kubernetes documentation on this site.

**Kubernetes Support**

* Added support for NeuronCore based scheduling to the Neuron Kubernetes Scheduler. Learn more about how to use NeuronCores for finer grain control over container scheduling by following the K8 tutorials documentation.


----

.. _containers-2-3-0-rn:

Neuron Containers (Neuron 2.3.0 Release)
-------------------------------------------------

Date of Release: 10/10/2022

Improvements
~~~~~~~~~~~~~~~

**Neuron Containers**

* Now supporting TRN1 and INF1 EC2 instance types as part of Neuron. There is an optional aws-neuronx-oci-hooks package users may install for convenience that supports use of the AWS_NEURON_VISIBLE_DEVICES environment variable when launching containers. New DLC containers will be coming soon in support of training workloads on TRN1.

**Kubernetes Support**

* Added support for TRN1 and INF1 EC2 instance types.


----

.. _containers-1-19-0-rn:

Neuron Containers [1.19.0] (Neuron 1.19.0 Release)
---------------------------------------------------

Date of Release: 04/29/2022

Improvements
~~~~~~~~~~~~~~~

**Neuron Containers**

* Neuron Kubernetes device driver plugin now can figure out communication with the Neuron driver without the oci hooks. Starting with Neuron 1.19.0 release, installing aws-neuron-runtime-base and oci-add-hooks are no longer a requirement for Neuron Kubernetes device driver plugin.

**Kubernetes Support**

* Minor updates.


----

.. _containers-2-16-0-rn:

Neuron Containers (Neuron 2.16.0 Release)
---------------------------------------------------

Date of Release: 09/01/2023

Improvements
~~~~~~~~~~~~~~~

**Kubernetes Support**

* This release enables easier programmability by using 0-based indexing for Neuron Devices and NeuronCores in EKS container environments. Previously, the Neuron Device indexing was assigned randomly. This change requires Neuron Driver version 2.12.14 or newer.
* Improved logging when Neuron Driver not installed/present.

Bug Fixes
~~~~~~~~~

* Fixed Neuron Device Plugin crash when Neuron Driver is not installed/present on the host.
* Fixed issue where pods fail to deploy when multiple containers are requesting Neuron resources.
* Fixed issue where launching many pods each requesting Neuron cores fails to deploy.


----

.. _containers-2-1-0-rn:

Neuron Containers (Neuron 2.1.0 Release)
-------------------------------------------------

Date of Release: 10/27/2022

Improvements
~~~~~~~~~~~~~~~

**Kubernetes Support**

* Added support for NeuronCore based scheduling to the Neuron Kubernetes Scheduler. Learn more about how to use NeuronCores for finer grain control over container scheduling by following the K8 tutorials documentation.


----

.. _containers-2-0-0-rn:

Neuron Containers (Neuron 2.0.0 Release)
-------------------------------------------------

Date of Release: 10/10/2022

Improvements
~~~~~~~~~~~~~~~

**Kubernetes Support**

* Added support for TRN1 and INF1 EC2 instance types.


----

.. _containers-1-9-3-rn:

Neuron Containers [1.9.3] (Neuron 1.9.3 Release)
-------------------------------------------------

Date of Release: 08/02/2022

Improvements
~~~~~~~~~~~~~~~

**Kubernetes Support**

* Minor updates.


----

.. _containers-1-9-2-rn:

Neuron Containers [1.9.2] (Neuron 1.9.2 Release)
-------------------------------------------------

Date of Release: 05/27/2022

Improvements
~~~~~~~~~~~~~~~

**Kubernetes Support**

* Minor updates.


----

.. _containers-1-9-0-rn:

Neuron Containers [1.9.0] (Neuron 1.9.0 Release)
-------------------------------------------------

Date of Release: 04/29/2022

Improvements
~~~~~~~~~~~~~~~

**Kubernetes Support**

* Minor updates.


----

.. _containers-1-8-2-rn:

Neuron Containers [1.8.2] (Neuron 1.8.2 Release)
-------------------------------------------------

Date of Release: 03/25/2022

Improvements
~~~~~~~~~~~~~~~

**Kubernetes Support**

* Minor updates.


----

.. _containers-1-7-7-rn:

Neuron Containers [1.7.7] (Neuron 1.7.7 Release)
-------------------------------------------------

Date of Release: 01/20/2022

Improvements
~~~~~~~~~~~~~~~

**Kubernetes Support**

* Minor updates.


----

.. _containers-1-7-3-rn:

Neuron Containers [1.7.3] (Neuron 1.7.3 Release)
---------------------------------------------------

Date of Release: 10/27/2021

Improvements
~~~~~~~~~~~~~~~

**Neuron Containers**

* Starting with Neuron 1.16.0, use of Neuron ML Frameworks now comes with an integrated Neuron Runtime as a library, as a result it is no longer needed to deploy neuron-rtd.
* When using containers built with components from Neuron 1.16.0, or newer, please use aws-neuron-dkms version 2.1 or newer and the latest version of aws-neuron-runtime-base. Passing additional system capabilities is no longer required.

**Kubernetes Support**

* Minor updates.

Breaking Changes
~~~~~~~~~~~~~~~~

* Starting with Neuron 1.16.0, use of Neuron ML Frameworks now comes with an integrated Neuron Runtime as a library, as a result it is no longer needed to deploy neuron-rtd.

Known Issues
~~~~~~~~~~~~

* None reported for this release.

================================================
FILE: release-notes/components/dev-tools.rst
================================================
.. meta::
    :description: Complete release notes for the Neuron Developer Tools component across all AWS Neuron SDK versions.
    :keywords: neuron tools, developer tools, profiler, release notes, neuron explorer, aws neuron sdk
    :date-modified: 04/09/2026

.. _dev-tools_rn:

Component Release Notes for Neuron Developer Tools
==================================================

The release notes for the Neuron Developer Tools. Read them for the details about the changes, improvements, and bug fixes for all release versions of the AWS Neuron SDK.


.. _dev-tools-2-29-0-rn:   

Neuron Developer Tools & Neuron Explorer (Neuron 2.29.0 Release)
--------------------------------------------------------------------------------------

Date of Release: 04/09/2026

Improvements
~~~~~~~~~~~~~~~

* The System Trace Viewer now supports the full suite of Device widgets, enabling multi-device profile analysis. Users can analyze hardware events across all linked Device Profiles within a single System Profile. See :doc:`Neuron Explorer system profiling </tools/neuron-explorer/overview-system-profiles>`.
* Introduced Memory Viewer in Neuron Explorer, enabling you to inspect low-level memory allocation details and usage patterns. See :doc:`Neuron Explorer Memory Viewer </tools/neuron-explorer/overview-memory-viewer>`.
* Neuron Explorer for Visual Studio Code is now available on the Visual Studio Code Extension Marketplace, enabling simpler installation and automatic updates. See :ref:`download-neuron-explorer-vscode`.
* Summary Viewer page in Neuron Explorer now includes system-level profile data, enabling you to view summary metrics for system and device profiles in a single view. See :doc:`Neuron Explorer Summary Viewer </tools/neuron-explorer/overview-summary-page>`.
* System Timeline now supports Neuron device HBM usage, showing a breakdown of memory allocation by usage category (such as tensors, scratchpad), enabling you to debug out-of-memory issues.
* Introduced Box Selection Summary in Neuron Explorer, enabling you to view aggregated device profile information for a selected bounding box region. See :ref:`box-selection-summary`.

Breaking Changes
~~~~~~~~~~~~~~~~

* None reported for this release.

Bug Fixes
~~~~~~~~~

* None reported for this release.

Known Issues
~~~~~~~~~~~~

* None reported for this release.

.. _dev-tools-2-28-0-rn:   

Neuron Developer Tools & Neuron Explorer (Neuron 2.28.0 Release)
--------------------------------------------------------------------------------------

Date of Release: 02/26/2026

Improvements
~~~~~~~~~~~~~~~

* Added system profiling support in Neuron Explorer, enabling you to capture and analyze system-level performance data with drill-down navigation to device profiles. See :doc:`Neuron Explorer system profiling </tools/neuron-explorer/overview-system-profiles>`.
* Added migration guide from Neuron Profiler/Profiler 2.0 to Neuron Explorer. See :ref:`neuron-profiler-migration-guide`.
* Added ability to save and search by tags in Neuron Explorer Profile Manager, allowing you to organize and quickly locate profiles across multiple profiling sessions. See :ref:`neuron-explorer-profile-manager`.
* Added help pop-up for the ``Device Trace Viewer`` in Neuron Explorer to see shortcuts and dependency color legend. See :doc:`Device Trace Viewer </tools/neuron-explorer/overview-device-profiles>`.
* Introduced ``Tensor Viewer`` in Neuron Explorer, enabling you to quickly identify memory bottlenecks by viewing tensor names, shapes, sizes, and memory usage in a single interface. See :ref:`tensor-viewer-overview`.
* Introduced ``Database Viewer`` in Neuron Explorer as an interactive interface for querying and exploring profiling data using SQL or natural language, allowing you to perform custom analysis without writing code. See :ref:`database-viewer-overview`.
* Enhanced data integrity checks in ``nccom-test`` by using pseudo-random data patterns instead of fixed patterns, improving detection of data corruption during collective operations. See *Data Integrity* in the nccom-test documentation.
* Added support for ``alltoallv`` collective operation in ``nccom-test``, enabling benchmarking of variable-sized all-to-all communication patterns. See *AlltoAllV Example* in the nccom-test documentation.

Breaking Changes
~~~~~~~~~~~~~~~~

* The ``neuron-profile analyze`` subcommand is no longer supported.  We recommend migrating to Neuron Explorer.  See :doc:`Get Started with Neuron Explorer </tools/neuron-explorer/get-started>`.

Bug Fixes
~~~~~~~~~

* ``neuron-ls`` now handles concurrent queries correctly. Previously, when multiple processes queried Neuron devices simultaneously, ``neuron-ls`` would fail with a driver error, preventing you from viewing device status.
* Neuron Explorer now correctly calculates PSUM usage for operations spanning multiple partitions. Previously, PSUM usage was underreported, which could lead to incorrect performance optimization decisions.

.. _dev-tools-2-27-0-rn:

Neuron Developer Tools & Neuron Explorer [2.29.0] (Neuron 2.27.0 Release)
-------------------------------------------------------------------------

Date of Release: 12/19/2025

Improvements
~~~~~~~~~~~~~~~

* Introduced Neuron Explorer - A unified profiling suite that replaces Neuron Profiler and Profiler 2.0.
* Four core viewers provide insights into model performance: Hierarchy Viewer, AI Recommendation Viewer, Source Code Viewer, and Summary Viewer.
* Neuron Explorer is available through UI, CLI, and VSCode IDE integration.
* Added fine-grained collective communication support to nccom-test utility.
* New tutorials cover profiling NKI kernels, multi-node training jobs, and vLLM inference workloads.
* Added Trn3 support for neuron-monitor, neuron-top, neuron-ls, and nccom-test.

Breaking Changes
~~~~~~~~~~~~~~~~

* Neuron Profiler and Profiler 2.0 support ends after Neuron 2.28.

Bug Fixes
~~~~~~~~~

* Improved profiling accuracy and reduced overhead.
* Fixed visualization issues in multi-process scenarios.

Known Issues
~~~~~~~~~~~~

* Existing NTFF files are compatible but require reprocessing for new features.
* Neuron Explorer does not support system level profiling at this time.


----

.. _dev-tools-2-26-0-rn:

Neuron Developer Tools [2.26.7.0] (Neuron 2.26.0 Release)
----------------------------------------------------------

Date of Release: 09/18/2025

Improvements
~~~~~~~~~~~~~~~

* Profiler UI now allows selecting multiple semaphore values to display simultaneously for a more comprehensive view of activity.
* System profile grouping default in Perfetto now uses global NeuronCore ID instead of process local NeuronCore ID for better display of multi-process workloads.
* Added warning when system profile events are dropped due to limited buffer space.
* nccom-test support on Trn2 for State Buffer to State Buffer collectives benchmarking for all-reduce, all-gather, and reduce-scatter operations.
* nccom-test will show helpful error message when invalid sizes are used with all-to-all collectives.

Bug Fixes
~~~~~~~~~

* Fixed device memory usage type table and improvement made to stay in sync between runtime and tools versions.
* Fixed system profile crash when processing long-running workloads.
* Fixed display of system profiles in Perfetto to correctly separate rows within the same Logical NeuronCore when using NEURON_LOGICAL_NC_CONFIG=2 on Trn2.


----

.. _dev-tools-2-25-0-rn:

Neuron Developer Tools [2.25.100.0] (Neuron 2.25.0 Release)
------------------------------------------------------------

Date of Release: 07/31/2025

Improvements
~~~~~~~~~~~~~~~

* neuron-ls now shows NeuronCore IDs associated with each Neuron device as well as CPU and NUMA node affinity in both the text and JSON outputs.
* Added a summary metric to device profiles for total_active_time (the amount of time the device was not idle during execution).
* System profiles now show the sync point events that are used to approximate CPU and Neuron device timestamp alignment.
* Removed metrics for defunct processes from Neuron Monitor's Prometheus output to more accurately reflect the current utilization of NeuronCores.

Bug Fixes
~~~~~~~~~

* Fixed issue in Neuron Profiler summary metrics where dma_active_time was larger than expected.
* Fixed type inconsistency for certain event types and attributes in the system profile data that could result in a crash.

Known Issues
~~~~~~~~~~~~

* System profile hardware events may be misaligned due to sync point imprecision.
* System profile events shown in the Neuron Profiler UI for multiprocess workloads are grouped together.


----

.. _dev-tools-2-24-0-rn:

Neuron Developer Tools [2.24.54.0] (Neuron 2.24.0 Release)
-----------------------------------------------------------

Date of Release: 06/24/2025

Improvements
~~~~~~~~~~~~~~~

* Scratchpad memory usage visualization is now available in the Neuron Profiler UI.
* Framework stack traces are now available in the Neuron Profiler UI.
* On-device collectives barriers are now shown in the Neuron Profiler UI.
* HBM throughput visualization over time is now shown in the Neuron Profiler UI.
* Added option to filter the Neuron Cores to capture trace events on.
* Added option to filter the event types recorded when capturing system traces.
* Added a flag to nccom-test to get results in JSON (--report-to-json-file <filename>).
* Added a flag to nccom-test to explicitly show input and output sizes based on the operation (--show-input-output-size).

Bug Fixes
~~~~~~~~~

* Fixed instance id labeling in system profile view for framework events.
* Fixed issue in Neuron Profiler UI where the full data was not shown in the NEFF Nodes tab.


----

.. _dev-tools-2-23-0-rn:

Neuron Developer Tools [2.23.16.0] (Neuron 2.23.0 Release)
-----------------------------------------------------------

Date of Release: 05/19/2025

Improvements
~~~~~~~~~~~~~~~

* Improved Neuron Profiler performance, allowing users to view profile results 5x times faster on average.
* Improved error reporting with timeline support for error signatures via custom notifications in the Neuron Profiler UI.
* Added execution and out-of-bounds (OOB) error tracking in Neuron Profiler JSON outputs.
* Updated the default grouping for system profiles to include process ID.
* Added neuron-monitor companion script for collecting Kubernetes info in EKS.

Bug Fixes
~~~~~~~~~

* Fixed hang during data collection when running nccom-test across multiple instances.
* Fixed certain cases in Neuron Profiler where DMA sizes were always reported as 0 bytes.


----

.. _dev-tools-2-22-0-rn:

Neuron Developer Tools [2.22.66.0] (Neuron 2.22.0 Release)
-----------------------------------------------------------

Date of Release: 04/03/2025

Improvements
~~~~~~~~~~~~~~~

* Added several enhancements to the Neuron Profiler UI, including NeuronCore barrier annotations, a minimal default view to improve initial load performance, usability of updating markers, and better organization of view settings.
* Added new event types in the system profile for Neuron Profiler 2.0 (Beta) related to out-of-bounds execution errors, execution request submission, and model switch overhead.
* Updated system trace output format for Neuron Profiler 2.0 (Beta).

Breaking Changes
~~~~~~~~~~~~~~~~

* neuron-det is no longer supported starting with this release. We recommend customers transition to Neuron Profiler 2.0 (Beta) for debugging runtime hangs and issues in large-scale settings.

Bug Fixes
~~~~~~~~~

* Fixed an issue in the Neuron Profiler UI where dependencies were misaligned in the timeline when highlighted.
* Fixed an issue where instruction dependency IDs were truncated in the Neuron Profiler JSON output.


----

.. _dev-tools-2-21-0-rn:

Neuron Developer Tools [2.20.204.0] (Neuron 2.21.0 Release)
------------------------------------------------------------

Date of Release: 12/20/2024

Improvements
~~~~~~~~~~~~~~~

* Added support for Trn2 instance types.
* Added support for Logical Neuroncores. neuron-top, neuron-monitor, and neuron-ls now display and aggregate information per Logical Neuroncore based on LNC configuration.
* Added Neuron Profile 2.0 (Beta) with system profiles featuring Neuron Runtime API trace and ML framework trace.
* Option to view system and device profiles using the Perfetto UI.
* Support for native JAX and PyTorch profilers.
* Support for distributed workloads in environments such as EKS and ParallelCluster.
* Ability to drill down from high-level system profiles to low-level device profiles.
* Simplified experience for capturing profiles.


----

.. _dev-tools-2-20-0-rn:

Neuron Developer Tools [2.19.0.0] (Neuron 2.20.0 Release)
----------------------------------------------------------

Date of Release: 09/16/2024

Improvements
~~~~~~~~~~~~~~~

* Added support for Neuron Kernel Interface (NKI).
* Updated neuron-profile JSON output to include information regarding instruction dependencies, DMA throughput, and SRAM usage.
* Updated Neuron Profiler UI to display transpose information for DMAs (when applicable).

Bug Fixes
~~~~~~~~~

* Fixed error handling in neuron-top to exit gracefully when passing an unknown argument.

Known Issues
~~~~~~~~~~~~

* None reported for this release.

----

.. _dev-tools-2-19-0-rn:

Neuron Developer Tools [2.18.3.0] (Neuron 2.19.0 Release)
----------------------------------------------------------

Date of Release: 07/03/2024

Improvements
~~~~~~~~~~~~~~~

* Profile captured with Neuron Runtime 2.20+ now includes annotations with additional information such as duration, size, and replica groups around collective operations.
* Running neuron-profile capture for workloads with collectives will now attempt to use the required number of workers if --collectives-workers-per-node or --collectives-worker-count is not set.
* Profiler UI now persists searched information in the URL and provides a summary of the search results.
* Updating sampling approach to show more representative data in the profiler UI when zoomed out.
* Updated groupings for displayed info on click in the profiler UI.
* Added neuron_device_type and neuron_device_memory_size to neuron-monitor's hardware information output.

Bug Fixes
~~~~~~~~~

* Resolved issue where NaN would be seen in the JSON output of neuron-profile and result in parsing errors.
* Resolved inconsistent timeline display issues in profiler UI that depended on when the profile was processed.
* neuron-profile view --output-format summary-text will now display in a fixed order.
* Updated accuracy of pending DMA count in the profiler UI.
* Removed unnecessary calls to exec when capturing memory utilization metrics in neuron-monitor.


----

.. _dev-tools-2-18-0-rn:

Neuron Developer Tools [2.17.1.0] (Neuron 2.18.0 Release)
----------------------------------------------------------

Date of Release: 04/01/2024

Improvements
~~~~~~~~~~~~~~~

* NeuronPerf 1.8.55.0: Minor updates.

Bug Fixes
~~~~~~~~~

* Fixed potential hang during synchronization step in nccom-test.


----

.. _dev-tools-2-17-0-rn:

Neuron Developer Tools [2.17.0.0] (Neuron 2.17.0 Release)
----------------------------------------------------------

Date of Release: 02/13/2024

Improvements
~~~~~~~~~~~~~~~

* Added support to neuron-profile for collective communication operator improvements in Neuron SDK 2.17.
* Optimized count query for sampling in neuron-profile UI for up to 3x faster load performance.
* Introduced warning annotations in neuron-profile UI to automatically highlight potential performance issues.

Bug Fixes
~~~~~~~~~

* Resolved issue of inaccurate execution time reported by neuron-profile as mentioned in Neuron Tools 2.16.1.0 release notes.
* Fixed NaN display errors in the neuron-profile UI.
* Fixed file naming issue when capturing collectives profiles with neuron-profile.


----

.. _dev-tools-2-16-0-rn:

Neuron Developer Tools [2.16.1.0] (Neuron 2.16.0 Release)
----------------------------------------------------------

Date of Release: 12/21/2023

Improvements
~~~~~~~~~~~~~~~

* First release of the Neuron Distributed Event Tracing tool neuron-det to visualize execution for multi-node workloads.
* neuron-profile now has the ability to capture multi-worker jobs.
* Added terminology descriptions to neuron-profile summary statistics.
* Added optional flags to neuron-profile view to change the InfluxDB bucket name (--db-bucket <bucket name>) and profile display name (--display-name <name>).
* NeuronPerf 1.8.15.0: Minor updates.

Bug Fixes
~~~~~~~~~

* Fixed bug where GPSimd summary values were missing in the profile summary.
* Fixed issue in nccom-test to no longer expect Neuron Device 0 in a container environment.
* Fixed issue in nccom-test to no longer require the instance launching nccom-test to be participating in the workload.

Known Issues
~~~~~~~~~~~~

* Execution time reported in neuron-profile is sometimes inaccurate due to a bug in how the time is captured. The bug will be addressed in upcoming Neuron releases.


----

.. _dev-tools-2-15-0-rn:

Neuron Developer Tools [2.15.4.0] (Neuron 2.15.0 Release)
----------------------------------------------------------

Date of Release: 10/26/2023

Improvements
~~~~~~~~~~~~~~~

* Improved visibility of summary stats in the profiler UI with added groupings.
* Added support for alltoall CC operation in nccom-test.

Bug Fixes
~~~~~~~~~

* Fixed bug in neuron-profile that may result in a crash when using the NeuronCore Pipeline feature on Inf1.


----

.. _dev-tools-2-14-0-rn:

Neuron Developer Tools [2.14.6.0] (Neuron 2.14.0 Release)
----------------------------------------------------------

Date of Release: 09/15/2023

Improvements
~~~~~~~~~~~~~~~

* Added legend in neuron-ls to clarify wrap around edges for topology view.
* Improved error messaging when passing invalid arguments to neuron-profile view.
* Profiler output now includes HLO name in addition to framework layer names.
* neuron-profile view now has --output-format json option which will write to a file specified by --output-file <name> (default is ntff.json) instead of writing data to InfluxDB.

Bug Fixes
~~~~~~~~~

* Fixed bug in neuron-profile that incorrectly calculated buffer utilization for more recently compiled NEFFs.
* Fixed bug in neuron-profile where the profile would sometimes include additional idle time while waiting for execution to start.


----

.. _dev-tools-2-13-0-rn:

Neuron Developer Tools [2.13.4.0] (Neuron 2.13.0 Release)
----------------------------------------------------------

Date of Release: 08/28/2023

Improvements
~~~~~~~~~~~~~~~

* --check option of nccom-test now supports more data types (fp16, bf16, (u)int8, (u)int16, and (u)int32 are now supported in addition to fp32).
* NeuronPerf 1.8.7.0: Minor updates.

Bug Fixes
~~~~~~~~~

* Fixed bug in nccom-test that would wait indefinitely for execution to end when running on multiple instances (-N 2 and higher).
* Fixed bug in neuron-profile to prevent a crash during utilization calculation.


----

.. _dev-tools-2-12-0-rn:

Neuron Developer Tools [2.12.2.0] (Neuron 2.12.0 Release)
----------------------------------------------------------

Date of Release: 07/19/2023

Improvements
~~~~~~~~~~~~~~~

* Bumped the max supported profiling NTFF version to version 2 to resolve crashes when postprocessing NTFFs captured with newer versions of the Neuron Runtime Library.
* When viewing profiles captured using Neuron Runtime Library 2.15 or above, please upgrade tools to 2.12.
* This version of Neuron tools remains compatible with NTFF version 1.

Bug Fixes
~~~~~~~~~

* Bug fixes for neuron-profile related to the calculation of some summary stats.


----

.. _dev-tools-2-11-0-rn:

Neuron Developer Tools [2.11.10.0] (Neuron 2.11.0 Release)
-----------------------------------------------------------

Date of Release: 06/14/2023

Improvements
~~~~~~~~~~~~~~~

* nccom-test can now show multiple latency stats in the results table, such as average or percentiles, by specifying the -s option (for example: -s p10 p99 avg p50).
* First public support for neuron-profile as a standalone tool that can be used to profile executions on Neuron Devices.


----

.. _dev-tools-2-10-0-rn:

Neuron Developer Tools [2.10.1.0] (Neuron 2.10.0 Release)
----------------------------------------------------------

Date of Release: 05/01/2023

Improvements
~~~~~~~~~~~~~~~

* Added new Neuron Collectives benchmarking tool, nccom-test, to enable benchmarking sweeps on various Neuron Collective Communication operations.
* Expanded support for Neuron profiling to include runtime setup/teardown times and collapsed execution of NeuronCore engines and DMA.


----

.. _dev-tools-2-9-0-rn:

Neuron Developer Tools [2.9.5.0] (Neuron 2.9.0 Release)
--------------------------------------------------------

Date of Release: 03/28/2023

Improvements
~~~~~~~~~~~~~~~

* Updated neuron-top to show effective FLOPs across all NeuronCores.
* NeuronPerf 1.7.0.0: Adds trn1/inf2 support for PyTorch and TensorFlow 2.x. Uses new IMDSv2 for obtaining instance types.


----

.. _dev-tools-2-8-0-rn:

Neuron Developer Tools [2.8.2.0] (Neuron 2.8.0 Release)
--------------------------------------------------------

Date of Release: 02/24/2023

Improvements
~~~~~~~~~~~~~~~

* Updated neuron-top to show aggregated utilization/FLOPs across all NeuronCores.


----

.. _dev-tools-2-7-0-rn:

Neuron Developer Tools [2.7.2.0] (Neuron 2.7.0 Release)
--------------------------------------------------------

Date of Release: 02/08/2023

Improvements
~~~~~~~~~~~~~~~

* Added support for model FLOPS metrics in both neuron-monitor and neuron-top.


----

.. _dev-tools-2-6-0-rn:

Neuron Developer Tools [2.6.0.0] (Neuron 2.6.0 Release)
--------------------------------------------------------

Date of Release: 12/09/2022

Improvements
~~~~~~~~~~~~~~~

* Added support for profiling with the Neuron Plugin for TensorBoard on TRN1.
* Updated profile post-processing for workloads executed on TRN1.


----

.. _dev-tools-2-5-0-rn:

Neuron Developer Tools [2.5.19.0] (Neuron 2.5.0 Release)
---------------------------------------------------------

Date of Release: 11/07/2022

Improvements
~~~~~~~~~~~~~~~

* Minor bug fixes and improvements.

Bug Fixes
~~~~~~~~~

* Minor bug fixes and improvements.


----

.. _dev-tools-2-5-0-2-rn:

Neuron Developer Tools [2.5.16.0] (Neuron 2.5.0 Release)
---------------------------------------------------------

Date of Release: 10/26/2022

Improvements
~~~~~~~~~~~~~~~

* New neuron-monitor and neuron-top feature: memory utilization breakdown. This new feature provides more details on how memory is being currently used on the Neuron Devices as well as on the host instance.
* neuron-top's UI layout has been updated to accommodate the new memory utilization breakdown feature.
* neuron-monitor's inference_stats metric group was renamed to execution_stats. While the previous release still supported inference_stats, starting this release the name inference_stats is considered deprecated and can't be used anymore.
* NeuronPerf 1.6.0.0: New Evaluation + metrics API. Support map and iterable-type torch datasets. Support custom torch DataLoader args via dataloader_kwargs. New get_report_by_tag utility to identify specific configurations. Python 3.7+ now default from 3.6. Pricing and sizing info updated for inf1 + trn1.

Breaking Changes
~~~~~~~~~~~~~~~~

* neuron-monitor's inference_stats metric group was renamed to execution_stats.

Bug Fixes
~~~~~~~~~

* Fix a rare crash in neuron-top when the instance is under heavy CPU load.
* Fix process names on the bottom tab bar of neuron-top sometimes disappearing for smaller terminal window sizes.
* NeuronPerf: GPU inputs are now moved correctly.


----

.. _dev-tools-2-4-0-rn:

Neuron Developer Tools [2.4.6.0] (Neuron 2.4.0 Release)
--------------------------------------------------------

Date of Release: 10/10/2022

Improvements
~~~~~~~~~~~~~~~

* Added support for both EC2 INF1 and TRN1 platforms. Name of the package changed from aws-neuron-tools to aws-neuronx-tools.
* Added support for ECC counters on Trn1.
* Added version number output to neuron-top.
* Expanded support for longer process tags in neuron-monitor.
* Removed hardware counters from the default neuron-monitor config to avoid sending repeated errors - will add back in future release.
* neuron-ls - Added option neuron-ls --topology with ASCII graphics output showing the connectivity between Neuron Devices on an instance.

Breaking Changes
~~~~~~~~~~~~~~~~

* Package name changed from aws-neuron-tools to aws-neuronx-tools.

Bug Fixes
~~~~~~~~~

* Fix neuron-monitor and neuron-top to show the correct Neuron Device when running in a container where not all devices are present.


----

.. _dev-tools-2-1-0-rn:

Neuron Developer Tools [2.1.4.0] (Neuron 2.1.0 Release)
--------------------------------------------------------

Date of Release: 04/29/2022

Improvements
~~~~~~~~~~~~~~~

* Minor updates.
* NeuronPerf 1.3.0.0: Minor updates.


----

.. _dev-tools-2-0-790-0-rn:

Neuron Developer Tools [2.0.790.0] (Neuron 2.0.0 Release)
----------------------------------------------------------

Date of Release: 03/25/2022

Improvements
~~~~~~~~~~~~~~~

* NeuronPerf 1.2.0.0: Initial release of NeuronPerf. Supports PyTorch, TensorFlow, and Apache MXNet. Supports customizable JSON and CSV reports.

Bug Fixes
~~~~~~~~~

* neuron-monitor: fixed a floating point error when calculating CPU utilization.


----

.. _dev-tools-2-0-623-0-rn:

Neuron Developer Tools [2.0.623.0] (Neuron 2.0.0 Release)
----------------------------------------------------------

Date of Release: 01/20/2022

Improvements
~~~~~~~~~~~~~~~

* neuron-top - Added "all" tab that aggregates all running Neuron processes into a single view.
* neuron-top - Improved startup time to approximately 1.5 seconds in most cases.
* neuron-ls - Removed header message about updating tools from neuron-ls output.

Bug Fixes
~~~~~~~~~

* neuron-top - Reduced single CPU core usage down to 0.7% from 80% on inf1.xlarge when running neuron-top by switching to an event-driven approach for screen updates.


----

.. _dev-tools-2-0-494-0-rn:

Neuron Developer Tools [2.0.494.0] (Neuron 2.0.0 Release)
----------------------------------------------------------

Date of Release: 12/27/2021

Improvements
~~~~~~~~~~~~~~~

* Security related updates related to log4j vulnerabilities.


----

.. _dev-tools-2-0-327-0-rn:

Neuron Developer Tools [2.0.327.0] (Neuron 2.0.0 Release)
----------------------------------------------------------

Date of Release: 11/05/2021

Improvements
~~~~~~~~~~~~~~~

* Updated Neuron Runtime (which is integrated within this package) to libnrt 2.2.18.0 to fix a container issue that was preventing the use of containers when /dev/neuron0 was not present.

Bug Fixes
~~~~~~~~~

* Fixed container issue preventing use of containers when /dev/neuron0 was not present.


----

.. _dev-tools-2-0-277-0-rn:

Neuron Developer Tools [2.0.277.0] (Neuron 2.0.0 Release)
----------------------------------------------------------

Date of Release: 10/27/2021

Improvements
~~~~~~~~~~~~~~~

* Tools now support applications built with Neuron Runtime 2.x (libnrt.so).
* Updates have been made to neuron-ls and neuron-top to significantly improve the interface and utility of information provided.
* Expands neuron-monitor to include additional information when used to monitor latest Frameworks released with Neuron 1.16.0.
* neuron-cli entering maintenance mode as its use is no longer relevant when using ML Frameworks with an integrated Neuron Runtime (libnrt.so).

Breaking Changes
~~~~~~~~~~~~~~~~

* You must update to the latest Neuron Driver (aws-neuron-dkms version 2.1 or newer) for proper functionality of the new runtime library.
* neuron-cli entering maintenance mode.

Known Issues
~~~~~~~~~~~~

* None reported for this release.


================================================
FILE: release-notes/components/dlamis.rst
================================================
.. meta::
    :description: Complete release notes for the Neuron DLAMI component across all AWS Neuron SDK versions.
    :keywords: neuron dlami, deep learning ami, release notes, aws neuron sdk
    :date-modified: 04/09/2026

.. _dlamis_rn:
.. _dlami-rn-known-issues:

Component Release Notes for Neuron DLAMI
=========================================

The release notes for the Neuron DLAMI component. Read them for the details about the changes, improvements, and bug fixes for all release versions of the AWS Neuron SDK.

.. _dlami-2-29-0-rn:   

Neuron DLAMIs (Neuron 2.29.0 Release)
------------------------------------------------------------------------


Date of Release: 04/09/2026

Improvements
~~~~~~~~~~~~~~~

* All Neuron packages and their dependencies have been upgraded to support AWS Neuron SDK version 2.29.0.


Callouts
~~~~~~~~~~~~~~~~

.. important::
    Announcing maintenance mode for NxDT and NxD Core Training APIs starting next release. Starting with Neuron 2.30.0, NxDT and NxD Core Training APIs are entering maintenance mode. Future releases will address critical security issues only and support will be gradually ended. The NxDT virtual environment in both single and multi-framework DLAMIs has been pinned to include ``neuronx_distributed_training-1.7.0``.

    **How does this impact you?**

    Existing NxDT/NxD Core users should stay on Neuron 2.28 and PyTorch 2.9 until ready to migrate to native PyTorch on Neuron (starting PyTorch 2.10). Customers are recommended to use native PyTorch with standard distributed primitives (DTensor, FSDP, DDP) and TorchTitan starting with Neuron 2.30.0 and PyTorch 2.10. A migration guide will be published in a coming release.

    For more information, see :doc:`/frameworks/torch/pytorch-native-overview`.

Breaking Changes
~~~~~~~~~~~~~~~~

* Ubuntu 22.04 Multi-Framework DLAMI End of Support: The Ubuntu 22.04 multi-framework DLAMI is no longer published starting with this release. Customers are advised to use the Ubuntu 24.04 multi-framework DLAMI instead.
* PyTorch 2.8 End of Support: PyTorch 2.8 virtual environments have been removed from multi-framework DLAMIs. Customers should use PyTorch 2.9 virtual environments on Ubuntu 24.04.


.. _dlami-2-28-0-rn:   

Neuron DLAMIs (Neuron 2.28.0 Release)
------------------------------------------------------------------------

Date of Release: 02/26/2026


Known Issues
~~~~~~~~~~~~

- AL2023-based DLAMIs released alongside version 2.28.0 do not include PyTorch 2.9+ or Multi-Framework environments due to an incompatibility with the default GLIBC version installed on AL2023.


.. _dlami-2-27-1-rn:

Neuron DLAMI (Neuron 2.27.1 Release)
-----------------------------------------------

Date of Release: 01/14/2026

Improvements
~~~~~~~~~~~~~~~

* Support for NKI has been added to all DLAMI virtual environments.


----

.. _dlami-2-27-0-rn:

Neuron DLAMI (Neuron 2.27.0 Release)
-----------------------------------------------

Date of Release: 12/19/2025

Improvements
~~~~~~~~~~~~~~~

* Ubuntu 24.04 Support: This release adds support for Ubuntu 24.04 base, single framework, and multi-framework DLAMIs with Python 3.12, providing customers with the latest Ubuntu LTS version for their machine learning workloads.
* vLLM V1 with vLLM-Neuron Plugin: Published new vLLM V1 with the vLLM-Neuron Plugin single framework DLAMI and added virtual environment to multi-framework DLAMIs (Amazon Linux 2023, Ubuntu 24.04).
* PyTorch 2.9 Support: Added PyTorch 2.9 support for single framework DLAMIs and virtual environment to multi-framework DLAMIs (Amazon Linux 2023, Ubuntu 24.04).
* JAX 0.7 Support: Published JAX 0.7 single framework DLAMI and updated multi-framework DLAMI virtual environments to JAX 0.7 (Amazon Linux 2023, Ubuntu 24.04).
* Neuron SDK Updates: Upgraded all Neuron packages and dependencies to support AWS Neuron SDK version 2.27.

Breaking Changes
~~~~~~~~~~~~~~~~

* TensorFlow 2.10 End of Support: The tensorflow_2_10 single framework DLAMI and virtual environment in multi-framework DLAMIs will reach end of support in a future release. Customers are advised to use previously released DLAMIs for TensorFlow support.
* Ubuntu 22.04 Single Framework End of Support: Ubuntu 22.04 single framework DLAMIs for PyTorch and JAX will reach end of support in a future release. Customers are advised to use multi-framework or previously released DLAMIs for Ubuntu 22.04.
* Inf1 virtual environments End of Support: Inf1 virtual environments and AMIs have reached end of support. Use Neuron DLAMIs released up to SDK version 2.26 for Inf1 support.


----

.. _dlami-2-26-0-rn:

Neuron DLAMI (Neuron 2.26.0 Release)
-----------------------------------------------

Date of Release: 09/18/2025

Improvements
~~~~~~~~~~~~~~~

* Support for PyTorch 2.8 (Amazon Linux 2023, Ubuntu 22.04) single-framework DLAMI.
* Updates multi-framework DLAMI virtual environments to support PyTorch 2.8.
* All Neuron packages and their dependencies have been upgraded to support version 2.26.0 of the AWS Neuron SDK.

Breaking Changes
~~~~~~~~~~~~~~~~

* End-of-support for the Transformers NeuronX library starts with the 2.26.0 release of the AWS Neuron SDK. As a result, the PyTorch inference Deep Learning Container (DLC) will no longer provide the transformers-neuronx virtual environment in both single and multi-framework DLAMIs.


----

.. _dlami-2-25-0-rn:

Neuron DLAMI (Neuron 2.25.0 Release)
-----------------------------------------------

Date of Release: 07/31/2025

Improvements
~~~~~~~~~~~~~~~

* All multi-framework virtual environments for the Deep Learning AMIs have been upgraded with the latest Neuron packages to support the AWS Neuron SDK version 2.25.0.


----

.. _dlami-2-24-0-rn:

Neuron DLAMI (Neuron 2.24.0 Release)
-----------------------------------------------

Date of Release: 06/24/2025

Improvements
~~~~~~~~~~~~~~~

* Adding support for PyTorch 2.7 (Amazon Linux 2023, Ubuntu 22.04) single framework DLAMI.
* Adding support for JAX 0.6 (Amazon Linux 2023, Ubuntu 22.05) single framework DLAMI.
* Update multi framework DLAMI's virtual environments to use PyTorch 2.7 and JAX 0.6.


----

.. _dlami-2-23-0-rn:

Neuron DLAMI (Neuron 2.23.0 Release)
-----------------------------------------------

Date of Release: 05/19/2025

Improvements
~~~~~~~~~~~~~~~

* Adding support for PyTorch 2.6 (Amazon Linux 2023, Ubuntu 22.04) single framework DLAMI.
* Adding support JAX 0.5 (Amazon Linux 2023, Ubuntu 22.05) single framework DLAMI.
* Update multi framework DLAMI's virtual environments to use PyTorch 2.6 and JAX 0.5.
* Security improvements: Bump Linux kernel to 6.8.1027 for Ubuntu 22 DLAMIs.
* Security improvements: Bump Linux kernel to 6.1.134 for Amazon Linux 2023 DLAMIs.
* Added a setup script within neuronx-distributed-training virtual environment to automate the installation of required dependencies.


----

.. _dlami-2-22-0-rn:

Neuron DLAMI (Neuron 2.22.0 Release)
-----------------------------------------------

Date of Release: 04/03/2025

Improvements
~~~~~~~~~~~~~~~

* Adding PyTorch 2.5 (Amazon Linux 2023, Ubuntu 22.04) and PyTorch 1.13 Inf1 (Ubuntu 22.04) single framework DLAMIs.
* Adding PyTorch 1.13 Inf1 virtual environments within the Neuron Multi Framework DLAMIs. (Amazon Linux 2023, Ubuntu 22.04).
* Adding Tensorflow 2.10 Inf1 virtual environments within Multi Framework DLAMI and Tensorflow singleframework DLAMI.
* Adding support for Amazon Linux 2023 in the Base Neuron DLAMI.
* Security improvements: Bump Linux kernel to 5.19.0-1024-aws for Ubuntu 22 DLAMIs.
* Optimization: Reduce EBS storage size for all DLAMIs such that the virtual environments and dependencies consume 80% of available block storage. This results in reduced cost and time to launch for the DLAMIs. Customers can always request more storage if needed.

Bug Fixes
~~~~~~~~~

* Update venv paths in message of the day (MOTD) launch screens for Neuron DLAMIs.


----

.. _dlami-2-21-1-rn:

Neuron DLAMI (Neuron 2.21.1 Release)
-----------------------------------------------

Date of Release: 01/14/2025

Improvements
~~~~~~~~~~~~~~~

* No changes to DLAMI.

Known Issues
~~~~~~~~~~~~

* Incompatibility issue reported for Tensorflow 2.10 (inf1) on v2.21.1.


----

.. _dlami-2-21-0-rn:

Neuron DLAMI (Neuron 2.21.0 Release)
-----------------------------------------------

Date of Release: 12/20/2024

Improvements
~~~~~~~~~~~~~~~

* Added support for Trainium2 chips within the Neuron Multi Framework DLAMI.
* Added support for JAX 0.4 to Neuron Multi Framework DLAMI.
* Added NxD Training (NxDT), NxD Inference (NxDI) and NxD Core PyTorch 2.5 support within the Neuron Multi Framework DLAMI.
* Added Single Framework DLAMI for TensorFlow 2.10 on U22 and corresponding SSM Parameter support.

Breaking Changes
~~~~~~~~~~~~~~~~

* Removing virtual environments for PyTorch 1.13 and 2.1 within Neuron Multi Framework DLAMI.
* Removing PyTorch 1.13 inf1 virtual environment from Neuron Multi Framework DLAMI.
* Removing Single Framework DLAMI and corresponding SSM Parameters for PyTorch 1.13 and 2.1.
* Removing SSM Parameters for AL2 Base DLAMI, PyTorch 1.13 and 2.1 Neuron DLAMI.


----

.. _dlami-2-20-1-rn:

Neuron DLAMI (Neuron 2.20.1 Release)
-----------------------------------------------

Date of Release: 10/25/2024

Improvements
~~~~~~~~~~~~~~~

* Added support for Amazon Linux 2023 to Neuron Multi Framework DLAMI. Customers will have two operating system options when using the multi framework DLAMI.


----

.. _dlami-2-20-0-rn:

Neuron DLAMI (Neuron 2.20.0 Release)
-----------------------------------------------

Date of Release: 09/16/2024

Improvements
~~~~~~~~~~~~~~~

* Add neuronx-distributed-training library to PyTorch virtual environments.
* Updated existing Neuron supported DLAMIs with Neuron 2.20 SDK release.


----

.. _dlami-2-19-0-rn:

Neuron DLAMI (Neuron 2.19.0 Release)
-----------------------------------------------

Date of Release: 07/03/2024

Improvements
~~~~~~~~~~~~~~~

* New Neuron PyTorch-2.1, PyTorch-1.13 and Base Deep Learning AMIs (DLAMI) for Ubuntu 22.
* Updated Existing Neuron supported DLAMIs with Neuron 2.19 SDK release.

Breaking Changes
~~~~~~~~~~~~~~~~

* End of support for Amazon Linux 2 DLAMIs.

Known Issues
~~~~~~~~~~~~

* None reported for this release.

================================================
FILE: release-notes/components/index.rst
================================================
.. meta::
    :description: Index of Neuron SDK component release notes with latest versions
    :keywords: neuron, release notes, components, versions, aws neuron sdk
    :date-modified: 02/26/2026

Neuron Component Release Notes
===============================

This page provides an index of all Neuron SDK component release notes. Each component has its own dedicated release notes page that tracks changes across all Neuron SDK versions.

.. list-table::
   :widths: 40 30 30
   :header-rows: 1
   :align: left

   * - Component
     - Updated in Neuron Version
     - Latest Component Version
   * - :doc:`Neuron Compiler <compiler>`
     - 2.27.0
     - 2.24.5133.0
   * - :doc:`Neuron Containers <containers>`
     - 2.29.0
     - 2.29.0
   * - :doc:`Neuron Developer Tools <dev-tools>`
     - 2.29.0
     - 2.29.0
   * - :doc:`Neuron DLAMI <dlamis>`
     - 2.29.0
     - 2.29.0
   * - :doc:`JAX NeuronX <jax>`
     - 2.26.0
     - 0.7.0.1.0.*
   * - :doc:`NKI Library <nki-lib>`
     - 2.29.0
     - 2.29.0
   * - :doc:`Neuron Kernel Interface <nki>`
     - 2.29.0
     - 0.3.0
   * - :doc:`NxD Core <nxd-core>`
     - 2.26.0
     - 0.18.27753
   * - :doc:`NxD Inference <nxd-inference>`
     - 2.29.0
     - 0.9.17334
   * - :doc:`NxD Training <nxd-training>`
     - 2.25.0
     - 1.5.0
   * - :doc:`PyTorch Neuron Framework (torch-neuronx) <pytorch>`
     - 2.29.0
     - 2.9.0.2.13.*
   * - :doc:`Neuron Runtime Library <runtime>`
     - 2.29.0
     - 2.31.24.0
   * - :doc:`Neuron Driver <runtime>`
     - 2.29.0
     - 2.26.10.0
   * - :doc:`Neuron Collectives <runtime>`
     - 2.29.0
     - 2.31.24.0
   * - :doc:`vLLM Plugin for Neuron <nxd-inference>`
     - 2.29.0
     - 0.5.0
  
* For older components and features that have not been updated recently or are out of support, see :doc:`../archive/index`.

.. toctree::
   :maxdepth: 1
   :hidden:

   Neuron Compiler <compiler>
   Neuron Containers <containers>
   Neuron Developer Tools <dev-tools>
   Neuron DLAMI <dlamis>
   JAX NeuronX <jax>
   NKI Library <nki-lib>
   Neuron Kernel Interface <nki>
   NxD Core <nxd-core>
   NxD Inference <nxd-inference>
   NxD Training <nxd-training>
   PyTorch Neuron Framework <pytorch>
   Neuron Runtime <runtime>
   Older components and features <../archive/index>


================================================
FILE: release-notes/components/jax.rst
================================================
.. meta::
    :description: Complete release notes for the JAX NeuronX component across all AWS Neuron SDK versions.
    :keywords: jax neuronx, jax, release notes, aws neuron sdk
    :date-modified: 04/03/2026

.. _jax_rn:

Component Release Notes for JAX NeuronX
========================================

**Latest Version (in 2.29.0)**: 0.7.0.1.0.*

The release notes for the JAX NeuronX component. Read them for the details about the changes, improvements, and bug fixes for all release versions of the AWS Neuron SDK.

.. _jax-2-26-0-rn:

JAX NeuronX [0.6.2.1.0.*] (Neuron 2.26.0 Release)
---------------------------------------------------

Date of Release: 09/18/2025

Improvements
~~~~~~~~~~~~~~~

* This release introduces support for JAX version 0.6.2.

Known Issues
~~~~~~~~~~~~

* The Threefry RNG algorithm is not completely supported. Use the rbg algorithm instead. This can be configured by setting the following config option: jax.config.update("jax_default_prng_impl", "rbg").
* For JAX versions older than 0.4.34, caching does not work out of the box.
* For JAX versions older than 0.4.34, buffer donation does not work out of the box.
* Mesh configurations which use non-connected Neuron cores may crash during execution.
* Not all dtypes supported by JAX work on Neuron.
* jax.random.randint does not produce expected distribution of randint values. Run it on CPU instead.
* Dynamic loops are not supported for jax.lax.while_loop. Only static while loops are supported.
* jax.lax.cond is not supported.
* Host callbacks are not supported.
* jax.dlpack is not supported.
* jax.experimental.sparse is not supported.
* jax.lax.sort only supports comparators with LE, GE, LT and GT operations.
* jax.lax.reduce_precision is not supported.
* Certain operations might result in slow compilations.
* Neuron only supports float8_e4m3 and float8_e5m2 for FP8 dtypes.
* Complex dtypes (jnp.complex64 and jnp.complex128) are not supported.
* Variadic reductions are not supported.
* Out-of-bounds access for scatter/gather operations can result in runtime errors.
* Dot operations on int dtypes are not supported.
* lax.DotAlgorithmPreset is not always respected.


----

.. _jax-2-25-0-rn:

JAX NeuronX [0.6.1.1.0.*] (Neuron 2.25.0 Release)
---------------------------------------------------

Date of Release: 07/31/2025

Improvements
~~~~~~~~~~~~~~~

* This release introduces support for JAX version 0.6.1.

Bug Fixes
~~~~~~~~~

* Previously, using multiple meshes within a single program wasn't supported. This is fixed to add support for sub-meshes.

Known Issues
~~~~~~~~~~~~

* Known issues are listed at jax-neuron-known-issues.


----

.. _jax-2-24-0-rn:

JAX NeuronX [0.6.0.1.0.*] (Neuron 2.24.0 Release)
---------------------------------------------------

Date of Release: 06/20/2025

Improvements
~~~~~~~~~~~~~~~

* This release supports JAX versions up to ``0.6.0``.
* Known issues are listed within :ref:`jax-neuron-known-issues`.

Known Issues
~~~~~~~~~~~~

* Known issues are listed within :ref:`jax-neuron-known-issues`.


----

.. _jax-2-23-0-rn:

JAX NeuronX [0.5.3.1.0.*] (Neuron 2.23.0 Release)
---------------------------------------------------

Date of Release: 05/20/2025

Improvements
~~~~~~~~~~~~~~~

* This release supports JAX versions up to ``0.5.3``.
* Known issues are listed within :ref:`jax-neuron-known-issues`.

Breaking Changes
~~~~~~~~~~~~~~~~

* ``jax_neuronx.nki_call`` is no longer supported. Use ``neuronxcc.nki.jit`` instead.

Known Issues
~~~~~~~~~~~~

* Known issues are listed within :ref:`jax-neuron-known-issues`.


----

.. _jax-2-22-0-rn:

JAX NeuronX [0.1.3] (Neuron 2.22.0 Release)
---------------------------------------------

Date of Release: 04/03/2025

Improvements
~~~~~~~~~~~~~~~

* This release supports JAX versions up to ``0.5.0``.
* Known issues are listed within :ref:`jax-neuron-known-issues`.

Known Issues
~~~~~~~~~~~~

* Known issues are listed within :ref:`jax-neuron-known-issues`.


----

.. _jax-2-21-0-rn:

JAX NeuronX [0.1.2] (Neuron 2.21.0 Release)
---------------------------------------------

Date of Release: 12/20/2024

Improvements
~~~~~~~~~~~~~~~

* This release supports JAX versions up to ``0.4.35``.
* Support for JAX versions up to ``0.4.35``.
* Support for JAX caching API for versions ``0.4.30+``.


----

.. _jax-2-20-0-rn:

JAX NeuronX [0.1.1] (Neuron 2.20.0 Release)
---------------------------------------------

Date of Release: 09/16/2024

Improvements
~~~~~~~~~~~~~~~

* This is the initial beta release of JAX NeuronX that contains Neuron-specific JAX features, such as the Neuron NKI JAX interface.
* Announcing the first JAX NeuronX release.
* JAX interface for Neuron NKI.

Known Issues
~~~~~~~~~~~~

* None reported for this release.

================================================
FILE: release-notes/components/nki-lib.rst
================================================
.. meta::
    :description: Complete release notes for the NKI Library component across all AWS Neuron SDK versions.
    :keywords: nki library, nki-lib, release notes, aws neuron sdk
    :date-modified: 04/09/2026

.. _nki-lib_rn:

Release Notes for Neuron Component: NKI Library
================================================

The release notes for the NKI Library Neuron component. Read them for the details about the changes, improvements, and bug fixes for all release versions of the AWS Neuron SDK.

.. _nki-lib-2-29-0-rn:

NKI Library (NKI-Lib) (Neuron 2.29.0 Release)
--------------------------------------------------------------------

Date of Release: 04/09/2026

What's New
~~~~~~~~~~

This release promotes ``find_nonzero_indices`` from experimental to a core subkernel and adds 7 new experimental kernels (Conv1D, Transformer TKG, 3 collective communication kernels, Top-K Reduce, and Dynamic Elementwise Add). Existing kernels receive sequence packing support, MXFP quantization paths, and expanded dimension limits. PyTorch reference implementations are added for 22 kernels.

New Core Additions
^^^^^^^^^^^^^^^^^^

* :doc:`find_nonzero_indices </nki/library/api/find-nonzero-indices>` (promoted from experimental) — Finds indices of nonzero elements along the T dimension using GpSimd ``nonzero_with_count`` ISA. Optimized for LNC2 sharding. Supports token counts up to 65536 and column counts up to 128.

New Experimental Kernels
^^^^^^^^^^^^^^^^^^^^^^^^

* :doc:`Conv1D </nki/library/api/conv1d>` — 1D convolution using tensor engine with replication strategy. Supports stride, padding, dilation, optional bias, activation fusion, and LNC sharding.
* :doc:`Transformer TKG </nki/library/api/transformer-tkg>` — Multi-layer transformer forward pass megakernel for token generation. Executes attention block, all-reduce, MLP, and residual connections across a configurable number of layers.
* :doc:`Fine-Grained All-Gather </nki/library/api/fg-allgather>` — Ring-based all-gather for TRN2 using collective permute with double buffering to overlap communication and data movement.
* :doc:`FGCC (All-Gather + Matmul) </nki/library/api/fgcc>` — Fused all-gather and matrix multiplication for TRN2, overlapping communication with compute.
* :doc:`SBUF-to-SBUF All-Gather </nki/library/api/sb2sb-allgather>` — Two variants: ``allgather_sb2sb`` for small tensors fitting in SBUF and ``allgather_sb2sb_tiled`` with tiling and LNC support for larger tensors.
* :doc:`Top-K Reduce </nki/library/api/topk-reduce>` — Gathers scattered rows by packed global token index and reduces along the K dimension for MoE output. Supports LNC sharding on the hidden dimension.
* :doc:`Dynamic Elementwise Add </nki/library/api/dynamic-elementwise-add>` — Elementwise addition with runtime-variable M-dimension tiling using dynamic loop bounds.

Improvements
~~~~~~~~~~~~~~~

* :doc:`Attention CTE Kernel </nki/library/api/attention-cte>`: Added ``mm_out_dtype`` parameter for controlling matmul output dtype. Added ``bound_min``/``bound_max`` parameters for sequence packing support (per-query KV range bounds). Increased max batch size from 32 to 512. Increased max sequence length from 36864 to 131072.
* :doc:`Attention BWD Kernel </nki/library/api/attention-cte>`: Added ``bound_min``/``bound_max`` parameters for sequence packing support. Added support for large batch size.
* :doc:`Attention TKG Kernel </nki/library/api/attention-tkg>`: Added ``start_pos_ids`` parameter for explicit KV cache position control to support sliding window masking.
* :doc:`Attention Block TKG Kernel </nki/library/api/attention-block-tkg>`: Added ``rmsnorm_QK_pre_rope_W_Q``/``rmsnorm_QK_pre_rope_W_K`` parameters for fused QK-norm before RoPE. Added KVDP attention sharding support (``KVDP``, ``KVDP_replica_group``). Added ``enable_fa_s_prior_tiling`` for overriding flash attention s_prior tiling.
* :doc:`MLP Kernel </nki/library/api/mlp>`: Added ``sbm`` (BufferManager) parameter for custom SBUF memory management. Added MXFP4/MXFP8 quantization path.
* :doc:`MoE TKG Kernel </nki/library/api/moe-tkg>`: Added new dynamic all-expert algorithm that uses ``block_size`` and ``is_all_expert_dynamic`` args. Expanded support for small I and added support for sharding on T in all-expert MX kernel.
* :doc:`Output Projection CTE Kernel </nki/library/api/output-projection-cte>`: Added ``output_dtype`` parameter for controlling output data type.
* :doc:`Output Projection TKG Kernel </nki/library/api/output-projection-tkg>`: Added ``sbm`` (BufferManager) parameter for custom SBUF memory management.
* :doc:`QKV Kernel </nki/library/api/qkv>`: Added ``is_h_dim_4h_transposed`` and ``weight_layout`` parameters for flexible weight layout support.
* **rmsnorm_tkg** / **layernorm_tkg**: Added ``shard_on_h`` parameter for sharding on the hidden dimension.
* Added PyTorch reference implementations for 22 kernels for testing and validation.

Breaking Changes
~~~~~~~~~~~~~~~~

* :doc:`Router Top-K Kernel </nki/library/api/router-topk>`: The ``output_in_sbuf``, ``x_input_in_sbuf``, and ``expert_affin_in_sb`` parameters have been removed. The kernel now auto-detects SBUF inputs from the tensor buffer type. Callers passing these keyword arguments must remove them.
* :doc:`QKV Kernel </nki/library/api/qkv>`: The ``is_input_swizzled`` parameter has been removed and replaced by ``is_h_dim_4h_transposed`` (same position, same default ``False``) and a new ``weight_layout`` parameter. Callers using ``is_input_swizzled`` by name must rename to ``is_h_dim_4h_transposed``.
* :doc:`QKV Kernel </nki/library/api/qkv>` (TKG variant): New parameter ``is_h_dim_4h_transposed`` has been inserted after ``quantization_type``. Callers using positional arguments for ``qkv_w_scale`` or later parameters must update to use keyword arguments.
* :doc:`Attention CTE Kernel </nki/library/api/attention-cte>`: New parameter ``mm_out_dtype`` has been inserted between ``softmax_dtype`` and ``cp_offset``. Callers using positional arguments for ``cp_offset``, ``global_cp_deg``, or ``cp_strided_q_slicing`` must update to use keyword arguments.
* :doc:`Attention TKG Kernel </nki/library/api/attention-tkg>`: New parameter ``start_pos_ids`` has been inserted after ``rope_pos_ids``. Callers using positional arguments beyond ``rope_pos_ids`` must update to use keyword arguments.
* :doc:`Attention BWD Kernel </nki/library/api/attention-cte>`: New parameters ``bound_min`` and ``bound_max`` have been inserted between ``sinks_ref`` and ``use_causal_mask``. Callers using positional arguments for ``use_causal_mask`` or later parameters must update to use keyword arguments.
* :doc:`Attention Block TKG Kernel </nki/library/api/attention-block-tkg>`: The keyword-only marker (``*``) has been removed and multiple parameters have been reordered. New pre-RoPE QK-norm parameters (``rmsnorm_QK_pre_rope_W_Q``, ``rmsnorm_QK_pre_rope_W_K``) have been added. ``softmax_scale``, ``k_scale``, and ``v_scale`` have been moved to optional parameters with defaults. All callers must review their argument ordering.
* **rmsnorm_tkg** / **layernorm_tkg**: New parameter ``shard_on_h`` has been inserted before ``use_heap_memory`` and ``sbm``. Callers using positional arguments beyond ``single_core_forced`` (rmsnorm) or ``eps`` (layernorm) must update to use keyword arguments. Helper functions ``process_rmsnorm_tile``, ``rmsnorm_tkg_llama_impl``, and ``layernorm_tkg_llama_impl`` have been made private (prefixed with ``_``).
* **SbufManager** has been renamed to **BufferManager**. A backward-compatible alias ``SbufManager = BufferManager`` is provided, so existing code using ``SbufManager`` will continue to work.
* MoE TKG: Replaced boolean sharding flags (``shard_on_I``, ``shard_on_T``) with ``LNCShardingStrategy`` enum in down projection interfaces.
* MoE TKG MX quantization files restructured: ``down_projection_mx_shard_I.py`` and ``gate_up_projection_mx_shard_I.py`` replaced with ``all_expert_mx_utils.py``, ``down_projection_mx.py``, and ``gate_up_projection_mx.py``. Callers importing from the old file paths must update their imports.
* ``find_nonzero_indices`` has been moved from ``nkilib.experimental.subkernels`` to ``nkilib.core.subkernels``. A backward-compatible re-export is provided, so imports via the experimental path continue to work.
* Removed usage of ``nki.language.par_dim`` throughout the library.

Bug Fixes
~~~~~~~~~

* Fixed MLP CTE indexing in gate proj row scales.
* Fixed QKV TKG ``sb2sb_wrapper_kernel`` signature missing QK-norm parameters.
* Fixed MLP failure for FP4 quantization with specific dimension combinations (``vnc=2, h=3072, i=384``).
* Fixed ``bwmm_shard_on_H`` with explicit TensorCopy from PSUM to SBUF for NKI 0.3.0 compatibility.

Known Issues
~~~~~~~~~~~~


.. _nki-lib-2-28-0-rn:   

NKI Library (NKI-Lib) (Neuron 2.28.0 Release)
--------------------------------------------------------------------

What's New
~~~~~~~~~~

This release expands the NKI Library with 9 new kernels, bringing the total to 16 documented kernel APIs. New core kernels include RoPE, Router Top-K, MoE CTE, MoE TKG, and Cumsum. New experimental kernels include Attention Block TKG (fused attention block for token generation), Cross Entropy (forward and backward passes), Depthwise Conv1D, and Blockwise MM Backward for MoE training.

Existing kernels receive FP8 and MX quantization support across QKV, MLP, and both Output Projection kernels. Kernel utilities gain new TensorView methods, SbufManager logging improvements with tree-formatted allocation tracing, and new utilities including ``interleave_copy``, ``LncSubscriptable``, and ``rmsnorm_mx_quantize_tkg``. Note that several breaking changes affect kernel signatures and utility APIs — see the Breaking Changes section for details.

New Core Kernels
^^^^^^^^^^^^^^^^

* :doc:`RoPE Kernel </nki/library/api/rope>` — Applies Rotary Position Embedding to input embeddings with optional LNC sharding and flexible layout support (contiguous and interleaved).
* :doc:`Router Top-K Kernel </nki/library/api/router-topk>` — Computes router logits and top-K expert selection for Mixture of Experts models, with support for multiple layout configurations and sharding strategies.
* :doc:`MoE CTE Kernel </nki/library/api/moe-cte>` — Implements Mixture of Experts optimized for Context Encoding with multiple sharding strategies (block sharding, intermediate dimension sharding) and MxFP4/MxFP8 quantization.
* :doc:`MoE TKG Kernel </nki/library/api/moe-tkg>` — Implements Mixture of Experts optimized for Token Generation with all-expert and selective-expert modes, supporting FP8 and MxFP4 quantization.
* :doc:`Cumsum Kernel </nki/library/api/cumsum>` — Computes cumulative sum along the last dimension, optimized for batch sizes up to 2048.

New Experimental Kernels
^^^^^^^^^^^^^^^^^^^^^^^^

* :doc:`Attention Block TKG Kernel </nki/library/api/attention-block-tkg>` — Fused attention block for Token Generation that combines RMSNorm, QKV projection, RoPE, attention, and output projection in SBUF to minimize HBM traffic.
* :doc:`Cross Entropy Kernel </nki/library/api/cross-entropy>` — Memory-efficient cross entropy loss forward and backward passes for large vocabularies using online log-sum-exp algorithm, optimized for LNC2.
* :doc:`Depthwise Conv1D Kernel </nki/library/api/depthwise-conv1d>` — Depthwise 1D convolution using implicit GEMM algorithm with support for arbitrary stride and padding values, optimized for TRN2.
* :doc:`Blockwise MM Backward Kernel </nki/library/api/blockwise-mm-backward>` — Backward pass for blockwise matrix multiplication in MoE layers, computing gradients for all parameters with support for dropless MoE.

Improvements
~~~~~~~~~~~~~~~

* :doc:`QKV Kernel </nki/library/api/qkv>`: Added FP8 quantization support (``quantization_type``, ``qkv_w_scale``, ``qkv_in_scale``), fused FP8 KV cache quantization (``k_cache``, ``v_cache``, ``k_scale``, ``v_scale``, ``fp8_max``, ``fp8_min``, ``kv_dtype``), block-based KV cache layout (``use_block_kv``, ``block_size``, ``slot_mapping``), and MX quantization input swizzling (``is_input_swizzled``).
* :doc:`MLP Kernel </nki/library/api/mlp>`: Added FP8 quantization support (``quantization_type``, ``gate_w_scale``, ``up_w_scale``, ``down_w_scale``, ``gate_up_in_scale``, ``down_in_scale``, ``quant_clipping_bound``), gate/up projection clamping (``gate_clamp_upper_limit``, ``gate_clamp_lower_limit``, ``up_clamp_upper_limit``, ``up_clamp_lower_limit``), ``skip_gate_proj`` option, and fp16 support for TKG mode.
* :doc:`Output Projection CTE Kernel </nki/library/api/output-projection-cte>`: Added FP8 quantization support (``quantization_type``, ``input_scales``, ``weight_scales``).
* :doc:`Output Projection TKG Kernel </nki/library/api/output-projection-tkg>`: Added FP8 quantization support (``quantization_type``, ``weight_scale``, ``input_scale``) and removed 512 restriction on non-transpose path.
* :doc:`Attention CTE Kernel </nki/library/api/attention-cte>`: Added strided Q slicing for context parallelism (``cp_strided_q_slicing``).
* :doc:`RMSNorm-Quant Kernel </nki/library/api/rmsnorm-quant>`: Added input dequantization scale support (``input_dequant_scale``).

Kernel Utilities
^^^^^^^^^^^^^^^^

See :doc:`Kernel Utilities Reference </nki/library/kernel-utils/index>` for full documentation.

* :doc:`TensorView </nki/library/kernel-utils/tensor-view>`: Added ``rearrange`` method for flexible dimension reordering, ``has_dynamic_access`` for checking whether a view requires runtime-dependent addressing, and ``key_in_dict`` helper. The ``slice`` method now clamps the end index to dimension bounds instead of asserting.
* :doc:`TiledRange </nki/library/kernel-utils/tiled-range>`: ``TiledRangeIterator`` now exposes an ``end_offset`` attribute, enabling kernels to determine the end position of each tile without manual calculation.
* :doc:`SbufManager (Allocator) </nki/library/kernel-utils/allocator>`: Added ``get_total_space`` and ``get_used_space`` for querying SBUF utilization, ``set_name_prefix`` / ``get_name_prefix`` for scoped naming, and ``flush_logs`` to emit buffered allocation logs. SbufManager now uses ``TreeLogger`` to provide hierarchical, tree-formatted logs of SBUF allocation and deallocation events, making it easier to debug memory usage across nested scopes.
* **QuantizationType**: Added ``MX`` enum value for microscaling quantization (MxFP4/MxFP8).
* **common_types**: Added ``GateUpDim`` enum for distinguishing gate vs up projection dimensions.
* **rmsnorm_tkg / layernorm_tkg**: Both subkernels now accept a ``TensorView`` or ``nl.ndarray`` for input and require an explicit ``output`` tensor parameter, giving callers control over output placement.
* **New utilities**: Added ``rmsnorm_mx_quantize_tkg`` subkernel for fused RMSNorm with MX quantization in token generation, ``interleave_copy`` for interleaved tensor copy operations, ``LncSubscriptable`` for LNC-aware data access patterns, and ``TreeLogger`` for hierarchical allocation logging.

Breaking Changes
~~~~~~~~~~~~~~~~

* The open source repository source directory has been renamed from ``nkilib_standalone`` to ``nkilib_src``.
* :doc:`MLP Kernel </nki/library/api/mlp>`: The function has been renamed from ``mlp_kernel`` to ``mlp``. New parameters have been inserted in the middle of the signature; callers using positional arguments beyond ``normalization_type`` must update to use keyword arguments.
* :doc:`QKV Kernel </nki/library/api/qkv>`: New parameters (``quantization_type``, ``qkv_w_scale``, ``qkv_in_scale``) have been inserted after ``bias``; callers using positional arguments beyond ``bias`` must update to use keyword arguments.
* :doc:`Output Projection TKG Kernel </nki/library/api/output-projection-tkg>`: The ``bias`` parameter is now optional (default ``None``). New parameters (``quantization_type``, ``weight_scale``, ``input_scale``) have been inserted before ``TRANSPOSE_OUT``; callers using positional arguments beyond ``bias`` must update to use keyword arguments.
* **TiledRangeIterator**: The constructor now requires a fourth positional argument ``end_offset``.
* **TensorView**: The ``sizes`` attribute has been renamed to ``shape``.
* **rmsnorm_tkg**: The ``inp`` parameter has been renamed to ``input``. A new required ``output`` parameter has been added as the third argument. The ``output_in_sbuf`` parameter has been removed. New parameters ``hidden_dim_tp`` and ``single_core_forced`` have been added.
* **layernorm_tkg**: The ``inp`` parameter has been renamed to ``input``. A new required ``output`` parameter has been added as the third argument. The ``output_in_sbuf`` parameter has been removed.

Bug Fixes
~~~~~~~~~

* Fixed attention TKG compilation and non-determinism issues.
* Fixed incorrect v_active slice indices in attention TKG block KV path.
* Fixed batch sharding in gen_mask_tkg active mask loading.
* Fixed expert_affinities masking when ``mask_unselected_experts`` is True in MoE TKG.
* Fixed expert_index shape mismatch in MoE TKG for T > 128.
* Fixed MoE affinity mask handling for T not divisible by 128.
* Fixed MoE TKG MX weight generation x4 pack size.
* Fixed MLP CTE ``force_cte_mode`` parameter validation.
* Fixed output projection CTE mixed precision support.
* Fixed output projection TKG variable name typo.
* Fixed router_topk bias shape to satisfy NKI check requirements.
* Fixed tail iteration bug for sequences not a multiple of 128 in MoE CTE.
* Fixed reading extra partitions for last rank in MoE CTE.

Known Issues
~~~~~~~~~~~~

.. _nki-lib-2-27-0-rn:

NKI Library (NKI-Lib) (Neuron 2.27.0 Release)
--------------------------------------------------------------------

What's New
~~~~~~~~~~

This release introduces the NKI Library, which provides pre-built kernels you can use to optimize
the performance of your models. The NKI Library offers ready-to-use, pre-optimized kernels that
leverage the full capabilities of AWS Trainium hardware.

NKI Library kernels are published in the `NKI Library GitHub repository <https://github.com/aws-neuron/nki-library>`_.
In Neuron 2.27, these kernels are also shipped as part of neuronx-cc under the ``nkilib.*`` namespace.

Accessing NKI Library Kernels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can access NKI Library kernels in two ways:

* **Shipped version**: Import from the ``nkilib.*`` namespace (included with neuronx-cc in Neuron 2.27)
* **Open source repository**: Clone and use kernels from the GitHub repository under the ``nkilib_standalone.nkilib.*`` namespace

New Kernels
~~~~~~~~~~~

This release includes the following pre-optimized kernels:

* **Attention CTE Kernel** — Implements attention with support for multiple variants and optimizations
* **Attention TKG Kernel** — Implements attention specifically optimized for token generation scenarios
* **MLP Kernel** — Implements a Multi-Layer Perceptron with optional normalization fusion and various optimizations
* **Output Projection CTE Kernel** — Computes the output projection operation optimized for Context Encoding use cases
* **Output Projection TKG Kernel** — Computes the output projection operation optimized for Token Generation use cases
* **QKV Kernel** — Performs Query-Key-Value projection with optional normalization fusion
* **RMSNorm-Quant Kernel** — Performs optional RMS normalization followed by quantization to fp8

NKI Library Kernel Migration to New nki.* Namespace in Neuron 2.28
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Some NKI Library kernels currently use the legacy ``neuronxcc.nki.*`` namespace. Starting with
Neuron 2.28, all NKI Library kernels will migrate to the new ``nki.*`` namespace.

The new ``nki.*`` namespace introduces changes to NKI APIs and language constructs. Customers
using NKI Library kernels should review the migration guide for any required changes.

NKI Library Namespace Changes in Neuron 2.28
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Starting with Neuron 2.28, the open source repository namespace will change from
``nkilib_standalone.nkilib.*`` to ``nkilib.*``, providing a consistent namespace between
the open source repository and the shipped version.

Customers who want to add or modify NKI Library kernels can build and install them to
replace the default implementation without changing model imports.


================================================
FILE: release-notes/components/nki.rst
================================================
.. meta::
    :description: Release notes for the Neuron Kernel Interface (NKI) component across all Neuron SDK versions
    :keywords: NKI, Neuron Kernel Interface, release notes, nki.language, nki.isa, kernels
    :date-modified: 04/09/2026

.. _nki_rn:

Release Notes for Neuron Component: Neuron Kernel Interface (NKI)
==================================================================

The release notes for the Neuron Kernel Interface (NKI) component. Read them for the details about the changes, improvements, and bug fixes for all release versions of the AWS Neuron SDK.

.. _nki-2-29-0-rn:   

Neuron Kernel Interface (NKI) [0.3.0] (Neuron 2.29.0 Release)
---------------------------------------------------------------------

Date of Release: 04/09/2026

AWS Neuron SDK 2.29.0 introduces NKI 0.3.0, a significant update to the Neuron Kernel Interface for General Availability. NKI 0.3.0 features NKI Standard Library (nki-stdlib), which provides developer-visible code for all NKI APIs and native language objects (e.g., ``NkiTensor``). This release provides new exposed Trainium capabilities and features in the NKI API and introduces ``nki.language`` APIs. NKI 0.3.0 includes a CPU Simulator, which executes NKI kernels entirely on CPU using NumPy — enabling developers to validate kernel logic on laptops and CI environments without Trainium hardware. NKI 0.3.0 also includes the ``nki.typing`` module for declaring expected tensor shapes, a dedicated ``nki.isa.exponential`` instruction optimized for Softmax computation, matmul accumulation control, explicit memory address placement, and variable-length all-to-all collectives via ``nki.collectives.all_to_all_v``. NKI 0.3.0 includes several API breaking changes that improve correctness and consistency along with updated documentation.

For the full list of changes and update examples, see the :ref:`NKI 0.3.0 Update Guide <nki-0-3-0-update-guide>`.

New Features
~~~~~~~~~~~~

* **NKI Standard Library (nki-stdlib)**: NKI 0.3.0 ships with the NKI Standard Library (nki-stdlib), which provides developer-visible code for all NKI APIs and native language objects (e.g., ``NkiTensor``).

* **NKI CPU Simulator** *(Experimental)*: Executes NKI kernels entirely on CPU using NumPy, enabling local development, debugging, and functional correctness testing without Trainium hardware. Set the environment variable ``NKI_SIMULATOR=1`` to run existing kernels without code changes, or wrap the kernel call with ``nki.simulate(kernel)``. See :doc:`nki.simulate API Reference </nki/api/nki.simulate>`.

* **nki.language APIs** *(Experimental)*: Introduces ``nki.language`` APIs as convenience wrappers around ``nki.isa`` APIs, including ``nl.load``, ``nl.store``, ``nl.copy``, ``nl.matmul``, ``nl.transpose``, ``nl.softmax``, and other high-level operations. See :doc:`nki.language API Reference </nki/api/nki.language>`.

* **nki.typing module**: New module for type-annotating kernel tensor parameters. Use ``nt.tensor[shape]`` to declare expected tensor shapes.

* **nki.isa.exponential**: Dedicated exponential instruction with max subtraction, faster than ``nisa.activation(op=nl.exp)`` and useful for Softmax calculation. Trn3 (NeuronCore-v4) only. See :doc:`nki.isa.exponential </nki/api/generated/nki.isa.exponential>`.

* **nki.collectives.all_to_all_v**: Variable-length all-to-all collective. Unlike ``all_to_all``, uses a metadata tensor to specify per-rank send/recv counts. See :doc:`nki.collectives API Reference </nki/api/nki.collectives>`.

* **Matmul accumulation**: ``nc_matmul`` and ``nc_matmul_mx`` now have an ``accumulate`` parameter that controls whether the operation overwrites or accumulates on the destination PSUM tile. The default (``accumulate=None``) auto-detects, matching Beta 2 behavior. See :doc:`nki.isa.nc_matmul </nki/api/generated/nki.isa.nc_matmul>`.

* **Address placement**: The ``address`` parameter was added to ``nki.language.ndarray`` for explicit memory placement. See :doc:`nki.language.ndarray </nki/api/generated/nki.language.ndarray>`.

Deprecated and Removed APIs
~~~~~~~~~~~~~~~~~~~~~~~~~~~

* ``nki.isa.tensor_copy_dynamic_src`` / ``nki.isa.tensor_copy_dynamic_dst`` — Deprecated and scheduled for removal. Use ``nisa.tensor_copy()`` with ``.ap()`` and ``scalar_offset`` instead.

* ``nki.jit(platform_target=...)`` — Deprecated. Set the target platform via the ``NEURON_PLATFORM_TARGET_OVERRIDE`` environment variable instead. This is a breaking change.

.. TODO: Create an NKI environment variables reference page and link from here.

* ``nki.jit(mode=...)`` — Deprecated and ignored. The NKI Compiler now auto-detects the framework from kernel arguments. This is a breaking change.

Breaking Changes
~~~~~~~~~~~~~~~~

.. note::

   NKI 0.3.0 requires all NKI kernels in a model to be updated to NKI 0.3.0. Mixing NKI 0.3.0 and NKI Beta 2 kernels in the same model is not supported. For models that have not yet been updated, continue using Neuron SDK 2.28.

* ``nisa.dma_copy`` — No longer supports reading directly from PSUM. Copy the PSUM tensor to SBUF first using ``nisa.tensor_copy``.

* ``nisa.dma_copy`` — Enforces matching source and destination element types when using ``dge_mode=dge_mode.hwdge``. Use ``.view()`` to reinterpret types.

* ``nisa.dma_copy`` — ``dst_rmw_op`` and ``unique_indices`` parameters removed. Use ``nisa.dma_compute`` instead.

* ``nisa.dma_compute`` — ``scales`` and ``reduce_op`` parameters swapped positions. ``scales`` is now optional. ``unique_indices`` parameter added. Update call sites to use the new parameter order: ``nisa.dma_compute(dst, srcs, reduce_op, scales=None, unique_indices=True)``.

* ``nisa.memset`` — Enforces strict type matching between ``value`` and destination dtype. x4 packed types enforce ``value=0``. Kernels that pass float values to integer-typed tensors (e.g., ``value=2.0`` instead of ``value=2``) will now raise an error at compile time.

* ``nisa.sendrecv`` — ``use_gpsimd_dma`` replaced by ``dma_engine`` enum. Update existing kernels to use the new enum.

* ``nisa.affine_select`` — ``offset`` moved from 3rd positional argument to keyword argument with default ``0``.

* ``nisa.register_move`` — ``imm`` renamed to ``src``, now accepts ``VirtualRegister``. Update keyword argument from ``imm=`` to ``src=``.

* ``nki.collectives.collective_permute_implicit_current_processing_rank_id`` — ``num_channels`` parameter removed. Remove ``num_channels`` from call sites and pass ``channel_ids`` list to ``collective_permute_implicit()`` instead.

* Output tensors must use ``buffer=nl.shared_hbm``. Using ``nl.hbm`` causes compilation failures.

* Raw integer enum constants no longer accepted. Use named enum members.

* String buffer names no longer accepted. Use buffer objects (e.g., ``nl.sbuf``).

* Keyword-only argument separator (``*``) in kernel signatures is not supported.

* ``is`` / ``is not`` operators are not supported. Use ``==`` / ``!=``.

* ``list`` kernel arguments are not supported. Convert to tuples.

For before-and-after code examples, see the :ref:`NKI 0.3.0 Update Guide <nki-0-3-0-update-guide>`.

.. note::

   The previously announced removal of the ``neuronxcc.nki.*`` namespace has been postponed to a future release. Both the ``neuronxcc.nki.*`` and ``nki.*`` namespaces continue to be supported in this release.

Other Changes
~~~~~~~~~~~~~

* ``nki.isa.dma_engine`` alias repurposed as the ``dma_engine`` enum for DMA transfer engine selection.

* ``nki.isa.iota`` — ``offset`` now optional with default ``0``.

* ``nki.isa.core_barrier`` — ``engine`` default changed from ``unknown`` to ``gpsimd`` (no behavioral change).

* ``nki.language.num_programs`` — ``axes`` default changed from ``None`` to ``0``.

* ``nki.language.program_id`` — ``axis`` now defaults to ``0``.

* ``nki.language.ndarray`` — ``buffer`` default changed from ``None`` to ``nl.sbuf``.

* ``nki.language.zeros`` — ``buffer`` default changed from ``None`` to ``nl.sbuf``.

* ``nki.language.sequential_range`` — ``stop`` and ``step`` now have default values (``None`` and ``1``).

Bug Fixes
~~~~~~~~~

* Fixed incorrect axis handling in ``nisa.tensor_reduce``. Beta 2 incorrectly allowed ``axis=1`` to refer to the last free dimension even for 3D/4D tensors. NKI 0.3.0 corrects this so that axis values correspond to the actual tensor dimensions.

* Fixed ``nisa.range_select`` silently overriding user-specified parameters. The ``on_false_value`` and ``reduce_cmd`` parameters were incorrectly ignored by the compiler — ``on_false_value`` was always set to ``-3.4028235e+38`` and ``reduce_cmd`` was always set to ``reset_reduce``, regardless of the values passed in. NKI 0.3.0 honors the ``reduce_cmd`` parameter and documents the ``FP32_MIN`` hardware constraint for ``on_false_value``.

Known Issues
~~~~~~~~~~~~

**Math Operations**

* ``nki.language.divide`` is not supported — Division is not available as a hardware instruction. As a workaround, multiply by the reciprocal: ``nl.multiply(x, nl.reciprocal(y))``.

* ``nki.language.fmod`` and ``nki.language.mod`` are not supported — Modulo operations are not available as hardware instructions. These APIs work in simulation but will fail when compiled for Trainium hardware.

* ``nki.language.power`` does not support scalar exponents — ``nl.power(tile, scalar)`` is not supported. Use ``nl.power(tile, tile)`` instead, where both operands are tiles.

**Broadcasting**

* Binary operations do not support broadcasting — Operations like ``nl.add(a, b)`` require both operands to have the same shape. Broadcasting (e.g., adding a ``(128, 1)`` tile to a ``(128, 512)`` tile) is not yet supported.

* ``nki.language.softmax`` and ``nki.language.rms_norm`` fail on hardware — These functions rely on internal broadcasting between full-size and reduced-size tiles, which is not supported on hardware. They work correctly in simulation.

**Random Number Generation**

* ``nki.language.random_seed`` requires a tensor, not a scalar — Pass a ``[1, 1]`` tensor on SBUF instead of a Python integer. For example: ``nl.random_seed(nl.full((1, 1), 42, dtype=nl.int32, buffer=nl.sbuf))``.

* ``nki.language.rand`` and ``nki.language.random_seed`` engine behavior — On NeuronCore-v4+ (Trn3+), ``rand`` uses ``nisa.rand2`` on the Vector Engine. On earlier NeuronCores, ``rand`` uses ``nisa.rng`` which may run on a different engine than ``random_seed``, potentially causing ``random_seed`` to have no effect on ``rand`` output.

**Matrix Operations**

* ``nki.language.matmul`` without ``transpose_x=True`` is not supported — Calling ``nl.matmul(x, y)`` without setting ``transpose_x=True`` will fail. As a workaround, always use ``nl.matmul(x, y, transpose_x=True)`` and pre-arrange data accordingly.

**Data Movement**

* ``nki.language.store`` does not support PSUM tiles directly — Storing a tile that resides in PSUM requires manually copying it to SBUF first using ``nisa.tensor_copy``. A future release will handle this automatically.

* ``nki.language.copy`` uses lossy FP32 casting — ``nl.copy`` uses the Scalar Engine which internally casts through FP32, which is lossy for integer types with values exceeding FP32 precision (e.g., int32 values > 2^23). Additionally, cross-buffer copies (e.g., PSUM to SBUF) are not supported.

**Control Flow**

* ``nki.language.dynamic_range`` loop variable cannot be used in index arithmetic — The induction variable of a ``dynamic_range`` loop is a scalar, not a register. It cannot be used as a ``scalar_offset`` in access patterns or in arithmetic expressions for computing tile offsets. Use ``nl.affine_range`` or ``nl.static_range`` if you need to compute offsets from the loop variable.

**Multi-Core (LNC2)**

* LNC2 requires identical control flow across cores — When running with Logical NeuronCore 2 (LNC2), the NKI compiler expects each physical NeuronCore to execute identical control flow. Programs with dynamic control flow that differs across cores may deadlock or produce incorrect results. This constraint is not enforced at compile time.

**Caching**

* NKI kernel caching assumes kernels are pure functions of their input arguments. If a kernel's output depends on external state (such as global variables or closures over mutable objects), the cache may return stale results. This is undefined behavior. Always ensure kernel outputs are determined solely by kernel arguments.

**Compiler**

* Address rotation cannot be disabled — Address rotation, a backend compiler optimization that rotates tensor addresses for improved memory utilization, is enabled by default and cannot be opted out of in this release.


.. _nki-2-28-0-rn:   

Neuron Kernel Interface (NKI) (Beta 2 - 0.2.0) [2.28] (Neuron 2.28.0 Release)
-----------------------------------------------------------------------------

Date of Release: 02/26/2026

New Features
~~~~~~~~~~~~

* LNC (Large Neuron Core) multi-core support:

  * **Shared buffers and canonical outputs**: The compiler now tracks
    :doc:`shared_hbm </nki/api/generated/nki.language.shared_hbm>` tensors declared in kernels
    and canonicalizes LNC kernel outputs into a consistent form. This is foundational
    infrastructure for multi-core kernel compilation.
    See :doc:`LNC Overview </nki/get-started/about/lnc>`.

  * **Private HBM tensors**: Users can declare tensors private to a single NeuronCore using the
    :doc:`private_hbm </nki/api/generated/nki.language.private_hbm>` memory type, distinct from
    regular and shared HBM.

  * **Intra-LNC collectives**: New ISA instruction types for multi-core collective operations
    such as cross-core reductions and broadcasts. See full API listing under
    :doc:`nki.collectives </nki/api/nki.collectives>` below.

* New ``nki.isa`` APIs:

  * :doc:`nki.isa.nonzero_with_count </nki/api/generated/nki.isa.nonzero_with_count>` — returns nonzero element indices and their count, useful for sparse computation and dynamic masking
  * ``nki.isa.exponential`` — computes element-wise exponential on tensors. See :doc:`nki.isa.activation </nki/api/generated/nki.isa.activation>`.

* New :doc:`nki.collectives </nki/api/nki.collectives>` module, enabling collective communication across multiple NeuronCores directly from NKI kernels:

  * :doc:`nki.collectives.all_reduce </nki/api/generated/nki.collectives.all_reduce>`
  * :doc:`nki.collectives.all_gather </nki/api/generated/nki.collectives.all_gather>`
  * :doc:`nki.collectives.reduce_scatter </nki/api/generated/nki.collectives.reduce_scatter>`
  * :doc:`nki.collectives.all_to_all </nki/api/generated/nki.collectives.all_to_all>`
  * :doc:`nki.collectives.collective_permute </nki/api/generated/nki.collectives.collective_permute>`
  * :doc:`nki.collectives.collective_permute_implicit </nki/api/generated/nki.collectives.collective_permute_implicit>`
  * :doc:`nki.collectives.collective_permute_implicit_reduce </nki/api/generated/nki.collectives.collective_permute_implicit_reduce>`
  * :doc:`nki.collectives.rank_id </nki/api/generated/nki.collectives.rank_id>`

* New ``dtypes``:

  * :doc:`nki.language.float8_e4m3fn </nki/api/generated/nki.language.float8_e4m3fn>` — for FP8 inference and training workloads

* New NKI language features:

  * ``no_reorder`` blocks — use ``with no_reorder(): ...`` to prevent the compiler from reordering instructions within a block, for kernels where instruction ordering affects correctness
  * ``__call__`` special method support — callable objects (classes with ``__call__``) can now be used as functions within NKI kernels
  * ``tensor.view`` method — tensors now support ``.view()`` for reshaping
  * ``nl.shared_constant`` can now be passed to kernels as string arguments, not just tensor objects

Improvements
~~~~~~~~~~~~

* Updated ``nki.isa`` APIs:

  * :doc:`nki.isa.dma_transpose </nki/api/generated/nki.isa.dma_transpose>` now supports indirect addressing
  * :doc:`nki.isa.dma_copy </nki/api/generated/nki.isa.dma_copy>` now supports ``unique_indices`` parameter
  * :doc:`nki.isa.register_alloc </nki/api/generated/nki.isa.register_alloc>` now accepts an optional tensor argument to pre-fill the allocated register with initial values

* Compiler output improvements:

  * The compiler no longer truncates diagnostic output; users now receive the full set of warnings and errors

Breaking Changes
~~~~~~~~~~~~~~~~

* :doc:`nki.isa.nc_matmul </nki/api/generated/nki.isa.nc_matmul>` parameter ``psumAccumulateFlag`` has been removed. This parameter had no effect on compilation or execution. Simply remove it from your kernel code.

* :doc:`nki.isa.nc_matmul </nki/api/generated/nki.isa.nc_matmul>` parameter ``is_moving_zero`` has been renamed to ``is_moving_onezero`` to match hardware semantics, consistent with the companion ``is_stationary_onezero`` parameter. Kernels that passed ``is_moving_zero`` by name should update to ``is_moving_onezero``.

* ``nki.tensor`` has moved to ``nki.meta.tensor``. Users should update their imports accordingly.

.. note::

   The previously announced removal of the ``neuronxcc.nki.*`` namespace has 
   been postponed from Neuron 2.28 to Neuron 2.29. Both the ``neuronxcc.nki.*`` 
   and ``nki.*`` namespaces continue to be supported in this release. We 
   encourage customers to migrate to the ``nki.*`` namespace using the 
   :doc:`NKI Beta 2 Migration Guide </nki/migration/nki-beta2-migration-guide>`.

Bug Fixes
~~~~~~~~~

* Fixed incorrect default value for ``on_false_value`` in ``nki.isa.range_select``. The default
  was ``0.0`` instead of negative infinity (``-inf``). This caused ``range_select`` to write zeros
  for out-of-range elements instead of the expected negative-infinity sentinel, which could produce
  incorrect results in downstream reductions (e.g., max-pooling or top-k).
  See :doc:`nki.isa.range_select </nki/api/generated/nki.isa.range_select>`.

* Fixed default value parsing for keyword-only arguments in NKI kernels. When a Python function
  used keyword-only arguments with default values (arguments after ``*`` in the signature), the
  NKI compiler did not associate the defaults with their corresponding parameter names.
  This caused keyword-only arguments to appear as required even when they had defaults, leading to
  "missing argument" errors during kernel compilation.

* Fixed wrong default for ``reduce_cmd`` in :doc:`nki.isa.activation </nki/api/generated/nki.isa.activation>`.
  The default was incorrectly set to ``ZeroAccumulate`` instead of ``Idle``, causing the accumulator
  to be zeroed before every activation call even when no reduction was requested.

* Fixed missing ALU operators (``rsqrt``, ``abs``, ``power``) in
  :doc:`nki.isa.tensor_scalar </nki/api/generated/nki.isa.tensor_scalar>` and
  :doc:`nki.isa.tensor_tensor </nki/api/generated/nki.isa.tensor_tensor>`. Passing these operators
  previously raised an "unsupported operator" error.
  See :doc:`NKI Language Guide </nki/get-started/nki-language-guide>`.

* Fixed ``float8_e4m3fn`` to ``float8_e4m3`` conversion for kernel inputs and outputs. When a
  tensor with dtype ``float8_e4m3fn`` was passed to the compiler, the automatic conversion to
  ``float8_e4m3`` could fail with a size-check error. The conversion now validates sizes
  correctly before casting.
  See :doc:`nki.language.float8_e4m3 </nki/api/generated/nki.language.float8_e4m3>`.

* Fixed dynamic for loop incorrectly incrementing the loop induction variable. In loops with a
  runtime-determined trip count (``sequential_range`` with non-constant bounds), the compiler
  generated incorrect increment code, causing the loop counter to never advance and the loop to
  run indefinitely or produce incorrect iteration values.
  See :doc:`nki.language.sequential_range </nki/api/generated/nki.language.sequential_range>`.

* Fixed reshape of ``shared_hbm`` and ``private_hbm`` tensors failing partition size check.
  Reshape only recognized plain ``hbm`` memory as exempt from partition-dimension size validation.
  Tensors allocated in ``shared_hbm`` or ``private_hbm`` (used for cross-kernel and
  kernel-private storage) incorrectly triggered a "partition size mismatch" error when reshaped.
  See :doc:`nki.language.shared_hbm </nki/api/generated/nki.language.shared_hbm>` and
  :doc:`nki.language.private_hbm </nki/api/generated/nki.language.private_hbm>`.

* Fixed bias shape checking in :doc:`nki.isa.activation </nki/api/generated/nki.isa.activation>`.
  The ``bias`` parameter was not validated for shape correctness. A bias tensor with a free
  dimension other than 1 (e.g., shape ``(128, 64)`` instead of ``(128, 1)``) was accepted
  without validation, which could produce incorrect results. The compiler now raises an error if the bias
  free dimension is not 1.

* Fixed incorrect line numbers in stack traces and error reporting. An off-by-one error in the
  line offset calculation caused all reported line numbers to be shifted by one.
  Additionally, error location was sometimes lost when errors propagated across file boundaries.

* Fixed invalid keyword arguments being silently ignored instead of raising an error. When calling
  an NKI API with a misspelled or unsupported keyword argument, the argument was ignored
  without warning.
  The compiler now validates all keyword argument names against the function signature and raises
  an ``unexpected keyword argument`` error for unrecognized names.

* Fixed ``nki.jit`` in auto-detection mode returning an uncalled kernel object instead of
  executing the kernel. When ``nki.jit`` was used without specifying a framework mode (e.g.,
  ``@nki.jit`` with no ``mode`` argument), the auto-detection path constructed the appropriate
  framework-specific kernel object but returned it without calling it. The user received a kernel
  object instead of the computed result, requiring an extra manual invocation.
  See :doc:`nki.jit </nki/api/generated/nki.jit>`.

* Fixed stale kernel object state between trace invocations. When tracing the same kernel
  multiple times (e.g., with different input shapes), compiler state was not fully reset
  between invocations, causing name collisions and incorrect results.
  The trace state is now fully reset before each invocation.

* Improved 'removed during code migration' error messages with clear descriptions of
  unimplemented features. APIs not available in this release (``nki.baremetal``, ``nki.benchmark``,
  ``nki.profile``, ``nki.simulate_kernel``) previously raised a generic
  ``NotImplementedError("removed during code migration")`` message. Each now raises a specific
  message naming the unsupported API. Additionally, calling an ``nki.jit`` kernel with no
  arguments now raises a clear error instead.
  See :doc:`NKI Beta 2 Migration Guide </nki/migration/nki-beta2-migration-guide>`.

* Fixed nested ``nki_jit`` decorators not being allowed. The NKI compiler only recognized
  ``@nki.jit``-decorated functions when they were plain function objects. Nested decorators
  (e.g., ``@my_wrapper @nki.jit``) wrapped the function in a non-function object, causing the
  compiler to skip it. The compiler now correctly unwraps decorator chains to find the underlying
  kernel function. See :doc:`nki.jit </nki/api/generated/nki.jit>`.

Known Issues
~~~~~~~~~~~~

* ``nki.isa.range_select``: The ``on_false_value`` and ``reduce_cmd`` parameters are incorrectly
  ignored by the NKI compiler. The ``on_false_value`` is always set to ``(-3.4028235e+38)``
  and ``reduce_cmd`` is always set to ``reduce_cmd.reset_reduce``, regardless of the values passed in.

.. _nki-2-27-0-rn:

Neuron Kernel Interface (NKI) (Beta 2 - 0.1.0) [2.27] (Neuron 2.27.0 Release)
-----------------------------------------------------------------------------

Date: 12/25/2025

Improvements
~~~~~~~~~~~~~~~

* new ``nki.language`` APIs:

  * ``nki.language.device_print``

* new ``nki.isa`` APIs:

  * ``nki.isa.dma_compute``
  * ``nki.isa.nki.isa.quantize_mx``
  * ``nki.isa.nki.isa.nc_matmul``
  * ``nki.isa.nki.isa.nc_n_gather`` [used to be ``nl.gather_flattened`` with free partition limited to 512]
  * ``nki.isa.rand2``
  * ``nki.isa.rand_set_state``
  * ``nki.isa.rand_get_state``
  * ``nki.isa.set_rng_seed``
  * ``nki.isa.rng``

* new ``dtypes``:

  * ``nki.language.float8_e5m2_x4``
  * ``nki.language.float4_e2m1fn_x4``
  * ``nki.language.float8_e4m3fn_x4``

* changes to existing APIs:

  * several ``nki.language`` APIs have been removed in NKI Beta 2
  * all nki.isa APIs have ``dst`` as an input param
  * all nki.isa APIs removed ``dtype`` and ``mask`` support
  * ``nki.isa.memset`` — removed ``shape`` positional arg , since we have ``dst``
  * ``nki.isa.affine_select`` — instead of ``pred``, we now take ``pattern`` and ``cmp_op`` params
  * ``nki.isa.iota`` — ``expr`` replaced with ``pattern`` and ``offset``
  * ``nki.isa.nc_stream_shuffle`` - ``src`` and ``dst`` order changed

* docs improvements:

  * restructured NKI Documentation to align with workflows
  * added :doc:`Trainium3 Architecture Guide for NKI </nki/guides/architecture/trainium3_arch>`
  * added :doc:`About Neuron Kernel Interface (NKI) </nki/get-started/about/index>`
  * added :doc:`NKI Environment Setup Guide </nki/get-started/setup-env>`
  * added :doc:`Get Started with NKI </nki/get-started/quickstart-implement-run-kernel>`
  * added :doc:`NKI Language Guide </nki/get-started/nki-language-guide>`
  * added :doc:`About the NKI Compiler </nki/deep-dives/nki-compiler>`
  * added :doc:`About NKI Beta 2 Migration </nki/migration/nki-beta2-migration-guide>`
  * added :doc:`MXFP Matrix Multiplication with NKI </nki/deep-dives/mxfp-matmul>`
  * updated :doc:`Matrix Multiplication Tutorial </nki/guides/tutorials/matrix_multiplication>`
  * updated :doc:`Profile a NKI Kernel </nki/guides/use-neuron-profile>`
  * updated :doc:`NKI APIs </nki/api/index>`
  * updated :doc:`NKI Library docs </nki/library/index>`
  * removed NKI Error Guide

Known Issues
~~~~~~~~~~~~

* ``nki.isa.nki.isa.nc_matmul`` - ``is_moving_onezero`` was incorrectly named ``is_moving_zero`` in this release
* NKI ISA semantic checks are not available with Beta 2, workaround is to reference the API docs
* NKI Collectives are not available with Beta 2
* ``nki.benchmark`` and ``nki.profile`` are not available with Beta 2


----

.. _nki-2-26-0-rn:

Neuron Kernel Interface (NKI) (Beta) [2.26] (Neuron 2.26.0 Release)
--------------------------------------------------------------------

Date: 09/18/2025

Improvements
~~~~~~~~~~~~~~~

* new ``nki.language`` APIs:

  * ``nki.language.gelu_apprx_sigmoid`` - Gaussian Error Linear Unit activation function with sigmoid approximation.
  * ``nki.language.tile_size.total_available_sbuf_size`` to get total available SBUF size

* new ``nki.isa`` APIs:

  * ``nki.isa.select_reduce`` - selectively copy elements with max reduction 
  * ``nki.isa.sequence_bounds`` - compute sequence bounds of segment IDs
  * ``nki.isa.dma_transpose`` 

    * ``axes`` param to define 4D transpose for some supported cases
    * ``dge_mode`` to specify Descriptor Generation Engine (DGE).

  * ``nl.gelu_apprx_sigmoid`` op support on ``nki.isa.activation``

* fixes / improvements:

  * ``nki.language.store`` supports PSUM buffer with extra additional copy inserted.

* docs/tutorial improvements:

  * ``nki.isa.dma_transpose`` API doc and example
  * ``nki.simulate_kernel`` example improvement
  * use ``nl.fp32.min`` in tutorial code instead of a magic number

* better error reporting:

  * indirect indexing on transpose
  * mask expressions


----

.. _nki-2-24-0-rn:

Neuron Kernel Interface (NKI) (Beta) [2.24] (Neuron 2.24.0 Release)
--------------------------------------------------------------------

Date: 06/24/2025

Improvements
~~~~~~~~~~~~~~~

* ``sqrt`` valid data range extended for accuracy improvement with wider numerical values support.
* ``nki.language.gather_flattened`` new API
* ``nki.isa.nc_match_replace8`` additional param ``dst_idx`` 
* improved docs/examples on ``nki.isa.nc_match_replace8``, ``nki.isa.nc_stream_shuffle`` 
* improved error messages


----

.. _nki-2-23-0-rn:

Neuron Kernel Interface (NKI) (Beta) [2.23] (Neuron 2.23.0 Release)
--------------------------------------------------------------------

Date: 05/20/2025

Improvements
~~~~~~~~~~~~~~~

* ``nki.isa.range_select`` (for trn2) new instruction
* ``abs``, ``power`` ops supported on to nki.isa tensor instruction
* ``abs`` op supported on ``nki.isa.activation`` instruction
* GpSIMD engine support added to ``add``, ``multiply`` in 32bit integer to nki.isa tensor operations
* ``nki.isa.tensor_copy_predicated`` support for reversing predicate. 
* ``nki.isa.tensor_copy_dynamic_src``, ``tensor_copy_dynamic_dst`` engine selection.
* ``nki.isa.dma_copy`` additional support with ``dge_mode``, ``oob_mode``, and in-place add ``rmw_op``.
* ``+=, -=, /=, *=`` operators now work consistently across loop types, PSUM, and SBUF,  
* fixed simulation for instructions: ``nki.language.rand``, ``random_seed``, ``nki.isa.dropout``
* fixed simulation masking behavior
* Added warning when the block dimension is used for SBUF and PSUM tensors, see: :ref:`NKI Block Dimension Migration Guide <nki_block_dimension_migration_guide>` 


----

.. _nki-2-22-0-rn:

Neuron Kernel Interface (NKI) (Beta) [2.22] (Neuron 2.22.0 Release)
--------------------------------------------------------------------

Date: 04/03/2025

Improvements
~~~~~~~~~~~~~~~

* New modules and APIs:

  * ``nki.profile``
  * ``nki.isa`` new APIs:
    
    * ``tensor_copy_dynamic_dst``
    * ``tensor_copy_predicated``
    * ``max8``, ``nc_find_index8``, ``nc_match_replace8``
    * ``nc_stream_shuffle``
  
  * ``nki.language`` new APIs: ``mod``, ``fmod``, ``reciprocal``, ``broadcast_to``, ``empty_like``

* Improvements:

  * ``nki.isa.nc_matmul`` now supports PE tiling feature 
  * ``nki.isa.activation`` updated to support reduce operation and ``reduce`` commands
  * ``nki.isa.engine`` enum
  * ``engine`` parameter added to more ``nki.isa`` APIs that support engine selection (ie, ``tensor_scalar``, ``tensor_tensor``, ``memset``)
  * Documentation for ``nki.kernels`` have been moved to the GitHub: https://aws-neuron.github.io/nki-samples. 
    The source code can be viewed at https://github.com/aws-neuron/nki-samples.
    
    * These kernels are still shipped as part of Neuron package in ``neuronxcc.nki.kernels`` module

* Documentation updates:

  * Kernels public repository https://aws-neuron.github.io/nki-samples
  * Updated :doc:`profiling guide </nki/guides/use-neuron-profile>` to use ``nki.profile`` instead of ``nki.benchmark``
  * NKI ISA Activation functions table now have :ref:`valid input data ranges<tbl-act-func>` listed
  * NKI ISA Supported Math operators now have :ref:`supported engine<tbl-aluop>` listed
  * Clarify ``+=`` syntax support/limitation


----

.. _nki-2-21-0-rn:

Neuron Kernel Interface (NKI) (Beta) [2.21] (Neuron 2.21.0 Release)
--------------------------------------------------------------------

Date: 12/16/2024

Improvements
~~~~~~~~~~~~~~~

* New modules and APIs:

  * ``nki.compiler`` module with Allocation Control and Kernel decorators,
    see guide for more info.
  * ``nki.isa``: new APIs (``activation_reduce``, ``tensor_partition_reduce``,
    ``scalar_tensor_tensor``, ``tensor_scalar_reduce``, ``tensor_copy``, 
    ``tensor_copy_dynamic_src``, ``dma_copy``), new activation functions(``identity``, 
    ``silu``, ``silu_dx``), and target query APIs (``nc_version``, ``get_nc_version``).
  * ``nki.language``: new APIs (``shared_identity_matrix``, ``tan``,
    ``silu``, ``silu_dx``, ``left_shift``, ``right_shift``, ``ds``, ``spmd_dim``, ``nc``).
  * New ``datatype <nl_datatypes>``: ``float8_e5m2``
  * New ``kernels`` (``allocated_fused_self_attn_for_SD_small_head_size``,
    ``allocated_fused_rms_norm_qkv``) added, kernels moved to public repository.


* Improvements:

  * Semantic analysis checks for nki.isa APIs to validate supported ops, dtypes, and tile shapes.
  * Standardized naming conventions with keyword arguments for common optional parameters.
  * Transition from function calls to kernel decorators (``jit``, 
    ``benchmark``, ``baremetal``, ``simulate_kernel``).

* Documentation updates:

  * Tutorial for :doc:`SPMD usage with multiple Neuron Cores on Trn2 </nki/guides/tutorials/spmd_multiple_nc_tensor_addition>`


----

.. _nki-2-20-1-rn:

Neuron Kernel Interface (NKI) (Beta) (Neuron 2.20.1 Release)
-------------------------------------------------------------

Date: 12/03/2024

Improvements
~~~~~~~~~~~~~~~

* NKI support for Trainium2, including full integration with Neuron Compiler.
  Users can directly shard NKI kernels across multiple Neuron Cores from an SPMD launch grid.
  See :doc:`tutorial </nki/guides/tutorials/spmd_multiple_nc_tensor_addition>` for more info.
  See :doc:`Trainium2 Architecture Guide </nki/guides/architecture/trainium2_arch>` for an initial version of the architecture specification
  (more details to come in future releases).
* New calling convention in NKI kernels, where kernel output tensors are explicitly returned from the kernel instead
  of pass-by-reference. See any :doc:`NKI tutorial </nki/guides/tutorials/index>` for code examples.


----

.. _nki-2-20-0-rn:

Neuron Kernel Interface (NKI) (Beta) [2.20] (Neuron 2.20.0 Release)
--------------------------------------------------------------------

Date: 09/16/2024

Improvements
~~~~~~~~~~~~~~~

* This release includes the beta launch of the Neuron Kernel Interface (NKI) (Beta).
  NKI is a programming interface enabling developers to build optimized compute kernels
  on top of Trainium and Inferentia. NKI empowers developers to enhance deep learning models
  with new capabilities, performance optimizations, and scientific innovation.
  It natively integrates with PyTorch and JAX, providing a Python-based programming environment
  with Triton-like syntax and tile-level semantics offering a familiar programming experience
  for developers. Additionally, to enable bare-metal access precisely programming the instructions
  used by the chip, this release includes a set of NKI APIs (``nki.isa``) that directly emit
  Neuron Instruction Set Architecture (ISA) instructions in NKI kernels.


================================================
FILE: release-notes/components/nxd-core.rst
================================================
.. meta::
    :description: Complete release notes for the NxD Core component across all AWS Neuron SDK versions.
    :keywords: nxd core, neuronx-distributed, release notes, aws neuron sdk
    :date-modified: 04/09/2026

.. _nxd-core_rn:

Component Release Notes for NxD Core
====================================

The release notes for the NxD Core (``neuronx-distributed``) Neuron component. Read them for the details about the changes, improvements, and bug fixes for all release versions of the AWS Neuron SDK.

.. _nxd-core-2-26-0-rn:

NxD Core [0.15.22259] (Neuron 2.26.0 Release)
-----------------------------------------------

Date of Release: 09/18/2025

Improvements
~~~~~~~~~~~~~~~

**NxD Core inference improvements**

* Non-distributed inference in parallel layers: Updated parallel layers to support non-distributed inference when parallel state isn't initialized. In non-parallel environments, RowParallelLinear and ColumnParallelLinear now function as ``nn.Linear``, and ``ParallelEmbedding`` now functions as ``nn.Embedding``. This change enables you to simplify model code that works on device and on CPU by enabling you to use the parallel layer in both cases.
* Added a ``compiler_flag_hook`` argument to ModelBuilder, which you can use to override compiler flags for different submodels and buckets.

Bug Fixes
~~~~~~~~~

* Added additional instance types to the ``hardware`` enum. For example, ``inf2`` now maps to ``trn1``.
* Other minor bug fixes and improvements.

Known Issues
~~~~~~~~~~~~

* At high batch size (>=32), we have observed performance degradation with ``shard-on-load`` for some models such as Llama3.1-8B. Our current recommendation is to disable this feature by enabling ``save_sharded_checkpoint`` in ``NeuronConfig`` when you trace and compile the model.
* ``spmd_mode = True`` does not work when provided to the ``parallel_model_trace`` API. ``parallel_model_trace`` will be deprecated in the next Neuron SDK release.


----

.. _nxd-core-2-25-0-rn:

NxD Core [0.14.18461] (Neuron 2.25.0 Release)
-----------------------------------------------

Date of Release: 07/31/2025

Improvements
~~~~~~~~~~~~~~~

**Inference**

* ModelBuilder V2: ModelBuilder V2 provides a simplified version of the ModelBuilder API that is more flexible and extensible. This API includes basic building blocks that you can use to trace, compile, and load modules to Neuron. For more information, see :ref:`nxd-core-model-builder-v2` and the updated `Llama-3.2-1B reference inference sample <https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference/llama>`__.

**Training**

* Support for Shared Experts: Shared Experts allow multiple model components to utilize the same expert neural networks. This release adds full support for Shared Experts in training workloads.

Date of Release: 06/24/2025

Improvements
~~~~~~~~~~~~~~~

**Inference:**

* Add ``--auto-cast=none`` compiler arg by default in ModelBuilder to ensure model dtypes are preserved during compilation.
* Update ModelBuilder to cast model weights based on dtypes defined in module parameters.
* Add support for PyTorch 2.7. This release includes support for PyTorch 2.5, 2.6, and 2.7.
* Other minor fixes and improvements.

**Training:**

* Added support for transformers 4.48.0

Bug Fixes
~~~~~~~~~

* Other minor fixes and improvements.


----

.. _nxd-core-2-23-0-rn:

NxD Core [0.12.12111] (Neuron 2.23.0 Release)
-----------------------------------------------

Date of Release: 05/20/2025

Improvements
~~~~~~~~~~~~~~~

**Inference:**

* Improve the Model Builder API. Note: The Model Builder API is in beta.
  
  * Add Neuron Persistent Cache support to Model Builder. Now, Model Builder caches compiled model artifacts to reduce compilation time.
  * Improve the performance of weight sharding in Model Builder to support shard-on-load in NxD Inference.
  * Improve the performance of Model Builder trace when HLO ``debug`` mode is enabled.

* Add a Llama-3.2-1B reference inference sample using NxD Core.
* Remove the unsupported NxD inference examples. You can use the NxD Inference library to run inference with on Neuron using NxD.
* Other minor fixes and improvements.

**Training:**

* Context parallel support for sequence lengths up to 32k on TRN1 (beta feature)

**General:**

* Update the package version to include additional information.

Bug Fixes
~~~~~~~~~

* Other minor fixes and improvements.


----

.. _nxd-core-2-22-0-rn:

NxD Core [0.11.0] (Neuron 2.22.0 Release)
------------------------------------------

Date of Release: 04/03/2025

Improvements
~~~~~~~~~~~~~~~

**Inference:**

* Improve the performance of weight sharding by up to 60-70%, depending on the model.
* You can now configure modules to skip during quantization with the ``modules_to_not_convert`` argument.
* Other minor fixes and improvements.

**Training:**

* Fixed issue with wikicorpus dataset download
* Updated model load for LoRA checkpoints

Bug Fixes
~~~~~~~~~

* Fixed issue with wikicorpus dataset download
* Updated model load for LoRA checkpoints
* Other minor fixes and improvements.

Known Issues
~~~~~~~~~~~~

* With PT2.5, some of the key workloads like Llama3-8B training may show reduced performance when using ``--llm-training`` compiler flag as compared to PT2.1. In such a case, try removing ``--llm-training`` flag from ``NEURON_CC_FLAGS`` in the run.sh only if using Neuron Kernel Interface.


----

.. _nxd-core-2-21-1-rn:

NxD Core [0.10.1] (Neuron 2.21.1 Release)
------------------------------------------

Date of Release: 01/14/2025

Improvements
~~~~~~~~~~~~~~~

**Inference:**

* Fix an issue with sequence parallel support for quantized models.

Bug Fixes
~~~~~~~~~

* Fix an issue with sequence parallel support for quantized models.


----

.. _nxd-core-2-21-0-rn:

NxD Core [0.10.0] (Neuron 2.21.0 Release)
------------------------------------------

Date of Release: 12/20/2024

Improvements
~~~~~~~~~~~~~~~

**Training:**

* Added support for HuggingFace Llama3 70B with Trn2 instances
* Added support for PyTorch 2.5
* Added DPO support for post-training model alignment
* Added fused QKV optimization in GQA models
* Support for Mixture-of-Experts with Tensor, Sequence, and Pipeline parallelism

Known Issues
~~~~~~~~~~~~

* With PT2.5, some of the key workloads like Llama3-8B training may show reduced performance when using ``--llm-training`` compiler flag as compared to PT2.1. In such a case, try removing ``--llm-training`` flag from ``NEURON_CC_FLAGS`` in the run.sh


----

.. _nxd-core-2-20-0-rn:

NxD Core [0.9.0] (Neuron 2.20.0 Release)
-----------------------------------------

Date of Release: 09/16/2024

Improvements
~~~~~~~~~~~~~~~

**Training:**

* Added LoRA adaptor support
* Added support for GPU compatible precision support using ZeRO-1

**Inference:**

* Added inference example for DBRX, and Mixtral models
* Improved inference performance with sequence length autobucketing
* Improved trace time for inference examples
* Reduced memory usage by sharing weights across prefill and decode traced models


----

.. _nxd-core-2-19-0-rn:

NxD Core [0.8.0] (Neuron 2.19.0 Release)
-----------------------------------------

Date of Release: 07/03/2024

Improvements
~~~~~~~~~~~~~~~

* Added support for Interleave pipeline parallel. At large cluster sizes, interleave pipeline schedule should help to reduce the pipeline bubble, thereby increasing training throughput.
* Added integration with flash attention kernel for longer sequence length training. See :ref:`Llama3 8K sequence-length training sample <llama3_tp_zero1_tutorial>`.
* Added support for naive speculative decoding, enabling assistance during the token generation process by predicting tokens with a draft model and verifying the predicted tokens with the original target model. Refer to the Neuronx Distributed inference developer guide for an example.
* Added integration with flash attention kernel for longer sequence length inference. See an end to end example of CodeLlama-13b model with 16K sequence length.
* Added support for scaled inference to run for Llama-2 70b or similar sized models

Known Issues
~~~~~~~~~~~~

* Model checkpointing saves sharded checkpoints. Users will have to write a script to combine the shards
* Validation/Evaluation with interleaved pipeline feature is not supported.
* Due to weights not being able to be shared across context encoding and token generation trace, inference scale is tested for models up to size Llama-2-70b. For model configurations above this, there is a risk of OOM errors.
* Tracing Llama-2-70b sized models for inference and loading them to device can take close to two hours. This is due to duplicate sharding of weights for both context encoding and token generation traces.


----

.. _nxd-core-2-18-0-rn:

NxD Core [0.7.0] (Neuron 2.18.0 Release)
-----------------------------------------

Date of Release: 04/01/2024

Improvements
~~~~~~~~~~~~~~~

* Added support for Pipeline-parallelism training using PyTorch-lightning
* Added support for fine-tuning a model and running evaluation on the fine-tuned model using optimum-neuron
* Added support for auto-partitioning the pipeline parallel stages for training large models
* Added support for async checkpointing, optimizing the checkpoint saving time.
* Added support for auto-resume from a checkpoint, in case training job crashes.
* Added support for sequence length autobucketing in inference
* Added support for inference with bfloat16
* Improved performance for Llama-2-7b inference example.

Known Issues
~~~~~~~~~~~~

* Currently the model checkpointing saves a sharded checkpoint, and users have to write a script to combine the shards.


----

.. _nxd-core-2-16-0-rn:

NxD Core [0.6.0] (Neuron 2.16.0 Release)
-----------------------------------------

Date of Release: 12/21/2023

Improvements
~~~~~~~~~~~~~~~

* Added support for Model/Optimizer wrapper that handles the parallelization in both model and optimizer.
* Added support for PyTorch-lightning. This allows users to train models using Tensor-parallelism and Data-parallelism.
* Added new checkpoint save/load APIs that handles the parallelization and dumps/loads the checkpoint.
* Added a new QKV module which has the ability to replicate the KV heads and produce the query, key and value states.
* Reduced the model initialization time when pipeline-parallel distributed strategy is used.
* Added support for limiting max parallel compilations in parallel_model_trace. This resolves many out of memory errors by reducing the host memory usage.
* Added example for Llama-2-7b inference. This is still early in development and is not well-optimized. The current recommendation is to use ``transformers-neuronx`` for optimal performance of llama inference.

Known Issues
~~~~~~~~~~~~

* Currently the model checkpointing saves a sharded checkpoint, and users have to write a script to combine the shards.
* Pipeline-parallelism is not supported as part of PyTorch-lightning integration.


----

.. _nxd-core-2-15-0-rn:

NxD Core [0.5.0] (Neuron 2.15.0 Release)
-----------------------------------------

Date of Release: 10/26/2023

Improvements
~~~~~~~~~~~~~~~

* Added support for pipeline-parallelism for distributed training.
* Added support for serialized checkpoint saving/loading, resulting in better checkpoint saving/loading time.
* Added support for mixed precision training using ``torch.autocast``.
* Fixed an issue with Zero1 checkpoint saving/loading.

Bug Fixes
~~~~~~~~~

* Fixed an issue with Zero1 checkpoint saving/loading.

Known Issues
~~~~~~~~~~~~

* Currently the model checkpointing saves a sharded checkpoint, and users have to write a script to combine the shards.


----

.. _nxd-core-2-14-0-rn:

NxD Core [0.4.0] (Neuron 2.14.0 Release)
-----------------------------------------

Date of Release: 09/15/2023

Improvements
~~~~~~~~~~~~~~~

* Added API for padding attention heads when they are not divisible by tensor-parallel degree
* Added a constant threadpool for distributed inference
* Fixed a bug with padding_idx in ParallelEmbedding layer
* Fixed an issue with checkpoint loading to take into account the stride parameter in tensor parallel layers

Bug Fixes
~~~~~~~~~

* Fixed a bug with padding_idx in ParallelEmbedding layer
* Fixed an issue with checkpoint loading to take into account the stride parameter in tensor parallel layers

Known Issues
~~~~~~~~~~~~

* Currently the model checkpointing saves a sharded checkpoint, and users have to write a script to combine the shards.


----

.. _nxd-core-2-13-0-rn:

NxD Core [0.3.0] (Neuron 2.13.0 Release)
-----------------------------------------

Date of Release: 08/28/2023

Improvements
~~~~~~~~~~~~~~~

* Added Zero1 Optimizer support that works with tensor-parallelism
* Added support for sequence-parallel that works with tensor-parallelism
* Added IO aliasing feature in parallel_trace api, which can allow marking certains tensors as state tensors
* Fixed hangs when tracing models using parallel_trace for higher TP degree

Bug Fixes
~~~~~~~~~

* Fixed hangs when tracing models using parallel_trace for higher TP degree

Known Issues
~~~~~~~~~~~~

* Currently the model checkpointing saves a sharded checkpoint, and users have to write a script to combine the shards.


----

.. _nxd-core-2-12-0-rn:

NxD Core [0.2.0] (Neuron 2.12.0 Release)
-----------------------------------------

Date of Release: 07/19/2023

Improvements
~~~~~~~~~~~~~~~

* Added parallel cross entropy loss function.

Known Issues
~~~~~~~~~~~~

* Currently the model checkpointing saves a sharded checkpoint, and users have to write a script to combine the shards.


----

.. _nxd-core-2-11-0-rn:

NxD Core [0.1.0] (Neuron 2.11.0 Release)
-----------------------------------------

Date of Release: 06/14/2023

Improvements
~~~~~~~~~~~~~~~

* Releasing the Neuron Distributed (``neuronx-distributed``) library for enabling large language model training/inference.
* Added support for tensor-parallelism training/inference.

Known Issues
~~~~~~~~~~~~

* Currently the model checkpointing saves a sharded checkpoint, and users have to write a script to combine the shards.

================================================
FILE: release-notes/components/nxd-inference.rst
================================================
.. meta::
    :description: Complete release notes for the NxD Inference component across all AWS Neuron SDK versions.
    :keywords: nxd inference, neuronx-distributed-inference, vllm, release notes, aws neuron sdk
    :date-modified: 04/09/2026

.. _nxd-inference_rn:

Component Release Notes for NxD Inference
=========================================

The release notes for the NxD Inference Neuron component. Read them for the details about the changes, improvements, and bug fixes for all release versions of the AWS Neuron SDK.

.. _nxd-inference-2-29-0-rn:

NxD Inference [0.9.17334] + vLLM Neuron Plugin [0.5.0] (Neuron 2.29.0 Release)
------------------------------------------------------------------------------

Date of Release: 04/09/2026

NxD Inference
~~~~~~~~~~~~~

Neuron SDK 2.29.0 includes the following updates for NxD Inference library 0.9.17334:

Improvements
^^^^^^^^^^^^

* Qwen2 VL Model support improvements (Beta) - Implements vision data parallelism support and improves QPS throughput by 7% for image-heavy workloads. See :doc:`Tutorial: Qwen2 VL Inference </libraries/nxd-inference/tutorials/qwen2-vl-tutorial>`.
* Qwen3 VL Model support improvements (Beta) - Implements text-model sequence parallelism and on-device vision patch embedding, parallel merger, and padding/slicing, achieving 2.2x QPS throughput for image-heavy workloads. See :doc:`Tutorial: Qwen3 VL Inference </libraries/nxd-inference/tutorials/qwen3-vl-tutorial>`.
* Flux.1 Model support improvements (Beta) - Implements CFG parallelism for text-to-image use case, improving E2E latency by 19% and instance throughput by 23%. See :doc:`Tutorial: Flux.1 Inference Tutorial </libraries/nxd-inference/tutorials/flux-inference-tutorial>`.

Breaking Changes
^^^^^^^^^^^^^^^^

* NxD Inference no longer supports NKI kernels on Trn1/Inf2 hardware, as NKI 0.3.0 kernels are not supported on Trn1/Inf2. NxD Inference models are now only supported on Trn2 and newer hardware. Customers who require NxD Inference kernel support on Trn1 or Inf2 instances should pin to release 2.28.
* The BWMM shard-on-hidden kernel previously used during prefill in Mixture-of-Experts models has been removed. Models that depend on this kernel (including Llama 4, Mixtral, and DBRX configurations) should be pinned to release 2.28 for optimal performance.

Bug Fixes
^^^^^^^^^^^^

* Fixed the issue on Qwen2-VL where the ``default_image_width`` and ``default_image_height`` values are overwritten during model loading process.

Known Issues
^^^^^^^^^^^^^

* The :doc:`Top-K NKI kernel </nki/library/api/router-topk>` is enabled by default in this release. Release 2.29 has a known accuracy issue at smaller batch sizes (up to 4) when this kernel is disabled.

.. note::
  Qwen3-MoE 235B may observe degraded decode throughput compared to previous releases. Our team is actively investigating the root cause. In the meantime, we recommend customers use release 2.28 for workloads where Qwen3-MoE 235B decode performance is critical.

.. _nxd-inference-2-28-0-rn:

NxD Inference [0.8.16251] + vLLM Neuron Plugin [0.4.0] (Neuron 2.28.0 Release)
------------------------------------------------------------------------------

Date of Release: 02/26/2026

NxD Inference
~~~~~~~~~~~~~

Neuron SDK 2.28.0 includes the following updates for NxD Inference library 0.8.16251:

Improvements
^^^^^^^^^^^^
* Qwen2 VL Model Support (Beta) - NxD Inference supports Qwen2 VL vision language model which processes text and image inputs. Please refer to :doc:`Tutorial: Qwen2 VL Inference </libraries/nxd-inference/tutorials/qwen2-vl-tutorial>`.
  
  Compatible models include:
    - `Qwen2-VL-7B-Instruct <https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct>`__
* Qwen3 VL Model Support (Beta) - NxD Inference supports Qwen3 VL vision language model which processes text and image inputs. Please refer to :doc:`Tutorial: Qwen3 VL Inference </libraries/nxd-inference/tutorials/qwen3-vl-tutorial>`.
  
  Compatible models include:
    - `Qwen3-VL-8B-Thinking <https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking>`__
* Pixtral Model Support Improvements (Beta) - Adds new functionality support with batch size 32 and sequence length 10240 with vllm v1 on Trn2.
* Flux.1 Model support improvements (Beta) - Implements CFG parallelism for text-to-image use case, improving E2E latency by 19% and instance throughput by 23%. See :doc:`Tutorial: Flux.1 Inference Tutorial </libraries/nxd-inference/tutorials/flux-inference-tutorial>`.


Known Issues
^^^^^^^^^^^^
* Qwen3 MoE only supports batch size >= 16 configurations.
* Qwen3-VL only supports dynamic image resolution up to vision sequence length 16K, and total vision and text sequence length up to 32K. Qwen2-VL does not support dynamic image resolution yet.
* Qwen-VL models only support batch size 1 configuration in vision encoder. No video understanding functionality is supported yet.
* Llama 3.2 11B/90B tutorial and samples not compatible to vLLM V1 are removed.

vLLM Plugin for Neuron
~~~~~~~~~~~~~~~~~~~~~~

Neuron SDK 2.28.0 includes the following updates for the vLLM Plugin 0.4.1 for Neuron:

Improvements
^^^^^^^^^^^^
* Multi-LoRA Serving Enhancements - NxD Inference supports streaming LoRA adapters via vLLM's `load_adapter` serving API, allowing adapters to be loaded into CPU memory dynamically at runtime. This provides more flexibility as users no longer need to specify all adapter checkpoint paths before execution. Additionally, users can now run the base model alone when multi-LoRA serving is enabled. See the :doc:`Llama 3.1 8B Multi-LoRA tutorial </libraries/nxd-inference/tutorials/trn2-llama3.1-8b-multi-lora-tutorial>` for more details.
* Eagle3 Speculative Decoding - NxD Inference supports Eagle3 speculative decoding on Llama 3.1 8B.

  Supported Eagle3 draft models include:
    - `EAGLE-LLaMA3-Instruct-8B <https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B>`__
* vLLM v0.13.0 Support - vLLM Neuron Plugin supports vLLM v0.13.0 and Pytorch 2.9.


Known Issues
^^^^^^^^^^^^
* This version of the vLLM Neuron Plugin is pinned to vLLM version v0.13.0 and requires PyTorch 2.9. If you must use PyTorch 2.7 or 2.8, you may fall back to the Neuron fork of vLLM that implements a Neuron integration using the vLLM V0 architecture. However, note that this fork is no longer maintained and not all features may be available. The fork can be found at https://github.com/aws-neuron/upstreaming-to-vllm/releases/tag/2.26.1.
  
* When using data parallelism (DP > 1) with large batch sizes (e.g., >=16) and long input sequences (e.g., 28K+ tokens), time-to-first-token (TTFT) may be significantly degraded due to uneven request distribution across engine replicas. This is caused by an interaction between the Neuron plugin's engine client and vLLM's internal DP coordinator. Customers experiencing this issue should use version 0.4.1 of the plugin, or reduce concurrency and input sequence length as a workaround.
  
* Known issues for vLLM Neuron plugin are tracked in [vLLM V1 user guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide-v1.html#known-issues).

.. _nxd-inference-2-27-1-rn:

NxD Inference [0.7.15603] (Neuron 2.27.1 Release)
---------------------------------------------------

Date of Release: 01/14/2026

Bug Fixes
~~~~~~~~~

* Fixed stability issue affecting Llama 4 that may occur when changing model configuration.


----

.. _nxd-inference-2-27-0-rn:

NxD Inference [0.6.9230] (Neuron 2.27.0 Release)
-------------------------------------------------

Date of Release: 12/19/2025

Improvements
~~~~~~~~~~~~~~~

* Added support for running NxD Inference on Trn3 instances.
* Added support for vLLM V1 through vllm-neuron plugin.
* Qwen3 MoE Model Support (Beta) — NxD Inference supports Qwen3 MoE language model which supports multilingual text inputs.
* Pixtral Model Support (Beta) — NxD Inference supports Pixtral image understanding model which processes text and image inputs.

Known Issues
~~~~~~~~~~~~

* Pixtral deployment is supported up to batch size 32 and sequence length 10240 with vLLM v0. vLLM v1 deployment supports up to batch size 4 and sequence length 10240.
* The performance of Qwen3 MoE and Pixtral on Trn2 is not fully optimized.
* The vllm-neuron plugin source code in github is currently not compatible with 2.27 SDK.


----

.. _nxd-inference-2-25-0-rn:

NxD Inference [0.5.9230] (Neuron 2.25.0 Release)
-------------------------------------------------

Date of Release: 07/31/2025

Improvements
~~~~~~~~~~~~~~~

* Added support for Qwen3 dense models (0.6B to 32B parameters), which are tested on Trn1.
* Added simplified functions for validating the accuracy of logits returned by a model: ``check_accuracy_logits_v2`` and ``generated_expected_logits``.
* Added ``scratchpad_page_size`` attribute to NeuronConfig for configuring the scratchpad page size used during compilation and at runtime.
* Enabled Chunked Attention as a generic building block for any attention-based model.
* Published scripts to evaluate model accuracy and benchmark performance against Neuron.

Breaking Changes
~~~~~~~~~~~~~~~~

* Removed support for Meta checkpoint compatibility in Llama3.2 Multimodal modeling code. You can continue to use Hugging Face checkpoints.

Bug Fixes
~~~~~~~~~

* Fixed accuracy issues when using Automatic Prefix Caching (APC) with EAGLE speculation.
* Fixed continuous batching for Llama3.2 Multimodal where the input batch size is less than the compiled batch size.
* Added support for continuous batching when running Neuron modeling code on CPU.
* Set a manual seed in ``benchmark_sampling`` to improve the stability of data-dependent benchmarks like speculation.


----

.. _nxd-inference-2-24-0-rn:

NxD Inference [0.4.7422] (Neuron 2.24.0 Release)
---------------------------------------------------

Date of Release: 06/24/2025

Improvements
~~~~~~~~~~~~~~~

**Models**

* Qwen2.5 text models, which are tested on Trn1. Compatible models include:

  * `Qwen2.5-0.5B-Instruct <https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct>`__
  * `Qwen2.5-7B-Instruct <https://huggingface.co/Qwen/Qwen2.5-7B-Instruct>`__
  * `Qwen2.5-32B-Instruct <https://huggingface.co/Qwen/Qwen2.5-32B-Instruct>`__
  * `Qwen2.5-72B-Instruct <https://huggingface.co/Qwen/Qwen2.5-72B-Instruct>`__

**Features**

* Automatic Prefix Caching support (APC) through vLLM. APC improves efficiency by reusing KV cache from previous queries if the new query shares a prefix. APC can significantly improve TTFT based on how often different queries share the same prefixes. Performance gains are greater when requests have longer shared prefixes and when there is a higher frequency of prefix sharing across requests. For example, with Llama3.3 70B on Trn2, you can observe a 3.2x TTFT improvement with the math.math dataset (90% cache hit), a 1.6x TTFT improvement with a Sonnet dataset with 2K prompt length (25% cache hit), or no TTFT improvement with the HumanEval dataset (0% cache hit). For more information, see :ref:`nxdi_prefix_caching` and :ref:`/libraries/nxd-inference/tutorials/trn2-llama3.3-70b-apc-tutorial.ipynb`.
* Disaggregated Inference (DI) support through vLLM (Beta). Disaggregated Inference is also known as disaggregated serving, disaggregated prefill, or p/d disaggregation. DI separates the prefill and decode phase of inference onto different hardware resources. DI can improve inter token latency (ITL) by by eliminating prefill stall in continuous batching, where decode is paused to perform prefill for a new incoming request. With DI, you can also scale prefill and decode resources independently to further improve performance. For more information, see :ref:`nxdi-disaggregated-inference`.
* Context parallelism in NeuronAttentionBase (Beta). Context parallelism distributes context processing across multiple NeuronCores. Context parallelism improves TTFT, particularly at higher sequence lengths where the number of KV heads is low. To use context parallelism, set ``cp_degree`` in NeuronConfig.
* Mixed-precision parameters in modeling code. This feature enables you to configure each module's dtype independently. To use mixed-precision parameters, set ``cast_type="as-declared"`` in NeuronConfig. Note: The default behavior (``cast_type="config"``) is to cast all parameters to the ``torch_dtype`` in NeuronConfig.
* Output logits when using on-device sampling. To output logits, enable ``output_logits`` in NeuronConfig. Note that this flag impacts performance and should only be used for debugging model logits.

**Other changes**

* Add support for PyTorch 2.7. This release includes support for PyTorch 2.5, 2.6, and 2.7.
* Upgrade ``transformers`` requirement from v4.48 to v4.51.
* Re-enable warmup on Trn2. NxD Inference disabled warmup on Trn2 in the previous release due to an issue that prevented certain model configurations from loading correctly. That issue is now fixed.
* Update the behavior of the ``attn_kernel_enabled`` attribute in NeuronConfig, which configures whether to use the flash attention kernel. Previously, ``True`` meant to enable in all cases where supported, and ``False`` meant to auto-enable where beneficial (defaults to ``False``). Now, ``attn_kernel_enabled=False`` disables the flash attention kernel in all cases. To use the previous auto-enable behavior, set ``attn_kernel_enabled=None``. The default value for ``attn_kernel_enabled`` is now ``None`` to retain the same default behavior as before.
* Enable ``--verify-hlo`` flag during compilation. Now, if an HLO is invalid, compilation will fail. Previously, in certain scenarios, the compiler would successfully compile invalid HLOs.
* Update the flash attention kernel strategy to use the attention kernel on Trn2 in all cases where it's supported. This change fixes an issue where certain context lengths failed to trace.
* Add ``logical_nc_config`` as an argument to the ``build_module`` and ``build_function`` test utilities, so you can use these utilities to test modules/functions for Trn2 using LNC2.
* Other minor fixes and improvements.

Bug Fixes
~~~~~~~~~

* Other minor fixes and improvements.

Known Issues
~~~~~~~~~~~~

* Increased Device Memory Usage for Certain Configurations: Certain model configurations require slightly more device memory than in previous releases. If your model used close to the maximum amount of device memory in previous releases, this increase could cause it to fail to load after you compile it with this release. This issue is most likely to affect Llama3.1-405B configurations that use a large number of buckets.


----

.. _nxd-inference-2-23-0-rn:

NxD Inference [0.3.5591] (Neuron 2.23.0 Release)
-------------------------------------------------

Date of Release: 05/20/2025

Improvements
~~~~~~~~~~~~~~~

* NxD Inference is now GA and out of beta in the Neuron 2.23 release.

**Features**

* Shard-on-load for weight sharding is now enabled by default. With this change, end-to-end compile and load time is reduced by up to 70% when sharding weights. This change significantly reduces compile time by skipping weight sharding and serialization during compile, but may lead to increased load time. For example, for Llama 3.1 405B, end-to-end compile and load time is reduced from 40 minutes to 12 minutes. For best load performance, you can continue to serialize sharded weights by enabling ``save_sharded_checkpoint`` in NeuronConfig. For more information, see :ref:`nxdi-weights-sharding-guide`.
* Neuron Persistent Cache. NxD Inference now supports Neuron Persistent Cache, which caches compiled model artifacts to reduce compilation times. For more information, see :ref:`nxdi-neuron-persistent-cache`.
* Support for an attention block kernel for token generation. This kernel performs QKV projections, RoPE, attention, and output projections. You can use this kernel with Llama3-like attention on Trn2 to improve token gen performance. To use this kernel, enable ``attn_block_tkg_nki_kernel_enabled`` in NeuronConfig.

  * This kernel can also update the KV cache in parallel with each layer's attention compute to further improve performance. This functionality hides the latency of the KV cache update that is otherwise done for all layers at once at the end of each token generation iteration. To enable in-kernel KV cache updates, enable ``attn_block_tkg_nki_kernel_cache_update`` in NeuronConfig. When in-kernel KV cache updating is enabled, you can also enable ``k_cache_transposed`` to further improve the performance.

* Automatically extract ``target_modules`` and ``max_lora_rank`` from LoRA checkpoints. You no longer need to set these arguments manually.
* Support fused residual add in the QKV kernel. This feature improves the performance of context encoding at short sequence lengths. To use this feature, enable the ``qkv_kernel_fuse_residual_add`` flag in NeuronConfig.

Breaking Changes
~~~~~~~~~~~~~~~~

* Remove ``set_async_mode(async_mode)`` from NeuronBaseForCausalLM, as this feature didn't work as intended. Async mode cannot be enabled or disabled after the model is loaded. To enable async mode, set ``async_mode=True`` in NeuronConfig.

Bug Fixes
~~~~~~~~~

* Disable warmup for Trn2. This change avoids an issue that prevents certain model configurations from loading correctly.

Known Issues
~~~~~~~~~~~~

* None reported for this release.

================================================
FILE: release-notes/components/nxd-training.rst
================================================
.. meta::
    :description: Complete release notes for the NxD Training component across all AWS Neuron SDK versions.
    :keywords: nxd training, neuronx-distributed-training, release notes, aws neuron sdk
    :date-modified: 07/31/2025

.. _nxd-training_rn:

Component Release Notes for NxD Training
========================================

The release notes for the NxD Training (``neuronx-distributed-training``) Neuron component. Read them for the details about the changes, improvements, and bug fixes for all release versions of the AWS Neuron SDK.

.. _nxd-training-2-25-0-rn:

NxD Training [1.5.0] (Neuron 2.25.0 Release)
---------------------------------------------

Date of Release: 07/31/2025

Improvements
~~~~~~~~~~~~~~~

* None

Bug Fixes
~~~~~~~~~

* Disable ``expert_index`` in Mixture of Experts (MoE) forwarding to limit the output to just hidden states and router logits (as expected).


----

.. _nxd-training-2-24-0-rn:

NxD Training [1.4.0] (Neuron 2.24.0 Release)
---------------------------------------------

Date of Release: 06/26/2025

Improvements
~~~~~~~~~~~~~~~

* Added support for PyTorch 2.7


----

.. _nxd-training-2-23-0-rn:

NxD Training [1.3.0] (Neuron 2.23.0 Release)
---------------------------------------------

Date of Release: 05/16/2025

Improvements
~~~~~~~~~~~~~~~

* (Beta release) Added autocast for HF based Llama3 8B and Llama3 70B models
* (Beta release) Added support for context parallel sequence lengths up to 32k on TRN1
* Added support for ORPO
* Added support for nemo-toolkit 2.1
* Added support for Transformers 4.48.0
* Added support for PyTorch-Lightning 2.5.0
* Added support for PyTorch 2.6


----

.. _nxd-training-2-22-0-rn:

NxD Training [1.2.0] (Neuron 2.22.0 Release)
---------------------------------------------

Date of Release: 04/03/2025

Improvements
~~~~~~~~~~~~~~~

* Added support for LoRA supervised fine-tuning.
* Added option to configure collectives data types.
* Minor fixes to reduce the amount of logs during training.
* Removes ``--llm-training`` flag by default in all configs, except llama2. Note: this flag should not be enabled when using the Neuron Kernel Interface.

Bug Fixes
~~~~~~~~~

* Minor fixes to reduce the amount of logs during training.


----

.. _nxd-training-2-21-1-rn:

NxD Training [1.1.1] (Neuron 2.21.1 Release)
---------------------------------------------

Date of Release: 01/14/2025

Improvements
~~~~~~~~~~~~~~~

* Added a flag in Llama3/3.1 70B config to control the dtype of reduce-scatter operations in Column/Row Parallel linear layers.


----

.. _nxd-training-2-21-0-rn:

NxD Training [1.1.0] (Neuron 2.21.0 Release)
---------------------------------------------

Date of Release: 12/20/2024

Improvements
~~~~~~~~~~~~~~~

* Added support for HuggingFace Llama3/3.1 70B with trn2 instances
* Added support for custom pipeline parallel cuts in HuggingFace Llama3
* Added support for PyTorch 2.5
* Added support for DPO post-training model alignment
* Added support for Mixtral 8x7B Megatron and HuggingFace models
* Added option in checkpoint converter to download and convert checkpoints using HuggingFace model identifier
* Fix the validation loss to properly compute the average loss across the validation epoch
* Minor bug fixes for error logging and imports

Bug Fixes
~~~~~~~~~

* Fix the validation loss to properly compute the average loss across the validation epoch
* Minor bug fixes for error logging and imports

Known Issues
~~~~~~~~~~~~

* Autocast option may not properly cast all inputs to bf16, recommended to use mixed precision option (currently is default) in configs for best results
* With PT2.5, some of the key workloads like Llama3-8B training may show a reduced performance when using ``--llm-training`` compiler flag as compared to PT2.1. In such a case, try removing ``--llm-training`` flag from ``compiler_flags`` in the config.yaml


----

.. _nxd-training-2-20-1-rn:

NxD Training [1.0.1] (Neuron 2.20.1 Release)
---------------------------------------------

Date of Release: 11/20/2024

Improvements
~~~~~~~~~~~~~~~

* Added support for transformers 4.36.0


----

.. _nxd-training-2-20-0-rn:

NxD Training [1.0.0] (Neuron 2.20.0 Release)
---------------------------------------------

Date of Release: 09/16/2024

Improvements
~~~~~~~~~~~~~~~

* This is the first release of NxD Training (NxDT), NxDT is a PyTorch-based library that adds support for user-friendly distributed training experience through a YAML configuration file compatible with NeMo, allowing users to easily set up their training workflows. At the same time, NxDT maintains flexibility, enabling users to choose between using the YAML configuration file, PyTorch Lightning Trainer, or writing their own custom training script using the NxD Core.
* The library supports PyTorch model classes including Hugging Face and Megatron-LM. Additionally, it leverages NeMo's data engineering and data science modules enabling end-to-end training workflows on NxDT, and providing a compatability with NeMo through minimal changes to the YAML configuration file for models that are already supported in NxDT. Furthermore, the functionality of the Neuron NeMo Megatron (NNM) library is now part of NxDT, ensuring a smooth migration path from NNM to NxDT.

**This release of NxDT includes:**

* Installation through ``neuronx-distributed-training`` package.
* Open Source Github repository: https://github.com/aws-neuron/neuronx-distributed-training
* Support for YAML based interface allowing users to configure training from a config file.
* Support for 3D-parallelism, sequence-parallelism and zero1.
* Support for megatron-model and hugging-face based Llama model.
* Support flash attention kernel.
* Support for async checkpointing and s3 checkpointing.
* Examples to pretrain and fine-tune Llama model

Known Issues
~~~~~~~~~~~~

* Model checkpointing saves sharded checkpoints. Users will have to write a script to combine the shards
* Validation/Evaluation with interleaved pipeline feature is not supported.
* NxDT shows slightly higher memory utilization as compared to NxD based examples.

================================================
FILE: release-notes/components/pytorch.rst
================================================
.. meta::
    :description: Complete release notes for the Neuron PyTorch framework component across all AWS Neuron SDK versions.
    :keywords: pytorch, torch-neuronx, torch-neuron, transformers-neuronx, release notes, aws neuron sdk
    :date-modified: 04/09/2026

.. _pytorch_rn:

Component Release Notes for Neuron PyTorch Framework
=====================================================

The release notes for the Neuron PyTorch framework component. Read them for the details about the changes, improvements, and bug fixes for all release versions of the AWS Neuron SDK.

.. _pytorch-2-29-0-rn:   

PyTorch Framework [2.9.0.2.13.*] (Neuron 2.29.0 Release)
--------------------------------------------------------

**Date of Release**: 04/09/2026

torch-neuronx
~~~~~~~~~~~~~

Updates
^^^^^^^^
* **PyTorch 2.7 and 2.8 are now End of Life**: Starting Neuron 2.29, PyTorch 2.7 and PyTorch 2.8 are now End of Life. If you require PyTorch 2.7 or PyTorch 2.8 support, use Neuron SDK 2.28.


Breaking Changes
^^^^^^^^^^^^^^^^
* **PyTorch/XLA replaced by TorchNeuron in PyTorch 2.10**: Starting with PyTorch 2.10 support (planned for a future Neuron release), AWS Neuron will use PyTorch support via TorchNeuron instead of PyTorch/XLA. PyTorch 2.9 is the last version using PyTorch/XLA. Users will need to update their scripts when upgrading to PyTorch 2.10 or later. See :ref:`native-pytorch-trainium` for complete details.


Bug Fixes
^^^^^^^^^
* No new bug fixes in this release.


Known Issues
^^^^^^^^^^^^
* **Segmentation faults with certain vision models**: Vision models including ``yolos``, ``wav2vec2``, and ``convbert`` crash with segmentation faults during model tracing.

  **How to check if affected**: If your model tracing fails with a segmentation fault, you are likely affected by this issue.

  **Workaround**: Downgrade to torch-neuronx 2.8, which does not exhibit this issue.

  See `GitHub issue #1265 <https://github.com/aws-neuron/aws-neuron-sdk/issues/1265>`_ for updates.


.. _pytorch-2-28-0-rn:   

PyTorch Framework (Neuron 2.28.0 Release)
--------------------------------------------------------

**Date of Release**: 02/26/2026

torch-neuronx
~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* No new improvements in this release.

Breaking Changes
^^^^^^^^^^^^^^^^

* **PyTorch/XLA replaced by TorchNeuron in PyTorch 2.10**: Starting with PyTorch 2.10 support (planned for a future Neuron release), AWS Neuron will use PyTorch support via TorchNeuron instead of PyTorch/XLA. PyTorch 2.9 is the last version using PyTorch/XLA. Users will need to update their scripts when upgrading to PyTorch 2.10 or later. See :ref:`native-pytorch-trainium` for complete details.

Bug Fixes
^^^^^^^^^

* No new bug fixes in this release.

Known Issues
^^^^^^^^^^^^

* **Segmentation faults with certain vision models**: Vision models including ``yolos``, ``wav2vec2``, and ``convbert`` crash with segmentation faults during model tracing.

  **How to check if affected**: If your model tracing fails with a segmentation fault, you are likely affected by this issue.

  **Workaround**: Downgrade to torch-neuronx 2.8, which does not exhibit this issue.

  See `GitHub issue #1265 <https://github.com/aws-neuron/aws-neuron-sdk/issues/1265>`_ for updates.

* **Performance degradation with public PyPI torch-xla 2.8.0**: Using the publicly released version of torch-xla 2.8.0 from public PyPI repositories results in 10-15% performance degradation for BERT and LLaMA models (`pytorch/xla#9605 <https://github.com/pytorch/xla/issues/9605>`_).

  **Workaround**: Upgrade to torch-xla version 2.8.1 from public PyPI repositories, which resolve this performance issue.

  See :doc:`/setup/pytorch/index` for detailed installation instructions.

* **PyTorch NeuronX 2.7 does not support Python 3.12**: torch-neuronx 2.7 supports Python 3.10 and 3.11 only. Python 3.12 is not supported in torch-neuronx 2.7.

  **Impact**: Attempting to install or run torch-neuronx 2.7 with Python 3.12 will fail with dependency errors.

  **Workaround**: Use Python 3.10 or 3.11 with torch-neuronx 2.7, or upgrade to torch-neuronx 2.9 which supports Python 3.12.

  See :ref:`setup-guide-index` for complete system requirements and compatibility information.


.. _pytorch-2-27-0-rn:

PyTorch Framework (Neuron 2.27.0 Release)
--------------------------------------------------------

**Date of Release**: 12/19/2025

torch-neuronx
~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Added support for PyTorch 2.9
* Improved model tracing performance for large models by up to 90% through trace API optimizations that avoid copying weights and state buffers to the device and guarantee state restoration after tracing.
* Fixed GitHub issue #1240 impacting torch-neuronx 2.7 to 2.9
* Fixed GitHub issue #834 impacting torch-neuronx 2.7 to 2.9
* Fixed issue in PyTorch 2.8 where PJRT_Client_Destroy was not being called, which prevented NRT:nrt_close from being invoked.

Breaking Changes
^^^^^^^^^^^^^^^^

* PyTorch 2.6 has reached end-of-support since release 2.27.
* Transitioning to PyTorch Native Support: In the next Neuron release that will support PyTorch 2.10, AWS Neuron will transition from PyTorch/XLA to PyTorch support via TorchNeuron. PyTorch 2.9 will be the last version based on PyTorch/XLA.

Bug Fixes
^^^^^^^^^

* Fixed resource leaks and "nrtucode: internal error: 832 object(s) leaked, improper teardown" errors by ensuring proper cleanup of Neuron Runtime resources on program exit.

Known Issues
^^^^^^^^^^^^

* Using the publicly released version of torch-xla 2.8.0 from public PyPI repositories would result in lower performance for models like BERT and LLaMA.
* Using the latest torch-xla v2.7 may result in an increase in host memory usage compared to torch-xla v2.6.
* PyTorch NeuronX 2.7 supports Python 3.10, and 3.11 only. Python 3.12 is not supported.

----

.. _pytorch-2-26-1-rn:

PyTorch Framework (Neuron 2.26.1 Release)
--------------------------------------------------------

Date of Release: 10/29/2025

torch-neuronx
~~~~~~~~~~~~~

Bug Fixes
^^^^^^^^^

* Fixed an issue with out-of-memory errors by enabling the use of the :doc:`Neuron Runtime API </neuron-runtime/api/index>` to apply direct memory allocation.


----

.. _pytorch-2-26-0-rn:

PyTorch Framework (Neuron 2.26.0 Release)
--------------------------------------------------------

Date of Release: 09/18/2025

Released Versions: ``2.8.0.2.10.*``, ``2.7.0.2.10.*``, ``2.6.0.2.10.*``

torch-neuronx
~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Added support for PyTorch 2.8 (see :ref:`Introducing PyTorch 2.8 Support<introduce-pytorch-2-8>`)

Known Issues
^^^^^^^^^^^^

.. note::
   * See :ref:`Introducing PyTorch 2.8 Support<introduce-pytorch-2-8>` for a full list of known issues with v2.8.
   * See :ref:`Introducing PyTorch 2.7 Support<introduce-pytorch-2-7>` for a full list of known issues with v2.7.
   * See :ref:`Introducing PyTorch 2.6 Support<introduce-pytorch-2-6>` for a full list of known issues with v2.6.

* [PyTorch v2.8] Using the publicly released version of torch-xla 2.8.0 from public PyPI repositories would result in lower performance for models like BERT and LLaMA (https://github.com/pytorch/xla/issues/9605). To fix this, switch to using the updated torch-xla version 2.8.1 from public PyPI repositories.

* [PyTorch v2.7] Using the latest torch-xla v2.7 may result in an increase in host memory usage compared to torch-xla v2.6. In one example, LLama2 pretraining with ZeRO1 and sequence length 16k could see an increase of 1.6% in host memory usage.

* Currently, when switching Ubuntu OS kernel version from 5.15 to 6.8, you may see performance differences due to the new kernel scheduler (CFS vs EEVDF). For example, BERT pretraining performance could be lower by up to 10%. You may try using an older OS kernel (i.e. Amazon Linux 2023) or experiment with the kernel real-time scheduler by running ``sudo chrt --fifo 99`` before your command (i.e. ``sudo chrt --fifo 99 <script>``) to improve the performance. Note that adjusting the real-time scheduler can also result in lower performance. See https://www.kernel.org/doc/html/latest/scheduler/sched-eevdf.html for more information.

* Currently, when using the tensor split operation on a 2D array in the second dimension, the resulting tensors do not contain the expected data (https://github.com/pytorch/xla/issues/8640). The workaround is to set ``XLA_DISABLE_FUNCTIONALIZATION=0``. Another workaround is to use ``torch.tensor_split``.

* [PyTorch v2.6] BERT pretraining performance is approximately 10% lower with torch-neuronx 2.6 compared to torch-neuronx 2.5. This is due to a known regression in torch-xla https://github.com/pytorch/xla/issues/9037 and may affect other models with high graph tracing overhead. This is fixed in torch-xla 2.7 and 2.8. To work around this issue in torch-xla 2.6, build the ``r2.6_aws_neuron`` branch of torch-xla as follows (see :ref:`pytorch-neuronx-install-cxx11` for C++11 ABI version):

.. code:: bash

      # Setup build env (make sure you are in a python virtual env). Replace "apt" with "yum" on AL2023.
      sudo apt install cmake
      pip install yapf==0.30.0
     wget https://github.com/bazelbuild/bazelisk/releases/download/v1.20.0/bazelisk-linux-amd64
     sudo cp bazelisk-linux-amd64 /usr/local/bin/bazel

     # Clone repos
     git clone --recursive https://github.com/pytorch/pytorch --branch v2.6.0
     cd pytorch/
     git clone --recursive https://github.com/pytorch/xla.git --branch r2.6_aws_neuron
     _GLIBCXX_USE_CXX11_ABI=0 python setup.py bdist_wheel

     # The pip wheel will be present in ./dist
     cd xla/
     CXX_ABI=0 python setup.py bdist_wheel

     # The pip wheel will be present in ./dist and can be installed instead of the torch-xla released in pypi.org

* Currently, BERT pretraining performance is approximately 11% lower when switching to using ``model.to(torch.bfloat16)`` as part of migration away from the deprecated environment variable ``XLA_DOWNCAST_BF16`` due to https://github.com/pytorch/xla/issues/8545. As a workaround to recover the performance, you can set ``XLA_DOWNCAST_BF16=1``, which will still work in torch-neuronx 2.5 through 2.8 although there will be end-of-support warnings (as noted below).

* Environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (see the warning raised below). Switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to cast model to BF16. (see :ref:`migration_from_xla_downcast_bf16`).

.. code:: bash

   Warning: ``XLA_DOWNCAST_BF16`` will be deprecated after the 2.5 release, please downcast your model directly

* [PyTorch v2.8+] ``DeprecationWarning: Use torch_xla.device instead``. This is a warning that ``torch_xla.core.xla_model.xla_device()`` is deprecated. Switch to using ``torch_xla.device()`` instead.

* [PyTorch v2.8+] ``DeprecationWarning: Use torch_xla.sync instead``. This is a warning that ``torch_xla.core.xla_model.mark_step()`` is deprecated. Switch to using ``torch_xla.sync()`` instead.

* [PyTorch v2.7+] ``AttributeError: module 'torch_xla.core.xla_model' ... does not have the attribute 'xrt_world_size'``. This is an error that notes that ``torch_xla.core.xla_model.xrt_world_size()`` is removed in torch-xla version 2.7+. Switch to using ``torch_xla.runtime.world_size()`` instead.

* [PyTorch v2.7+] ``AttributeError: module 'torch_xla.core.xla_model' ... does not have the attribute 'get_ordinal'``. This is an error that notes that ``torch_xla.core.xla_model.get_ordinal()`` is removed in torch-xla version 2.7+. Switch to using ``torch_xla.runtime.global_ordinal()`` instead.

* [PyTorch v2.5+] ``AttributeError: module 'torch_xla.runtime' has no attribute 'using_pjrt'``. In Torch-XLA 2.5+, ``torch_xla.runtime.using_pjrt`` is removed because PJRT is the sole Torch-XLA runtime. See this `PyTorch commit PR on GitHub <https://github.com/pytorch/xla/commit/d6fb5391d09578c8804b1331a5e7a4f72bf981db>`_.


----

.. _pytorch-2-25-0-rn:

PyTorch Framework (Neuron 2.25.0 Release)
--------------------------------------------------------

Date of Release: 07/31/2025

torch-neuronx
~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* The Core Placement API is no longer beta/experimental and the instructions on how to use it have been updated.

Breaking Changes
^^^^^^^^^^^^^^^^

* To migrate, replace any function scope ``torch_neuron.experimental.`` with ``torch_neuron.``. The change will have no effect on behavior or performance.

Known Issues
^^^^^^^^^^^^

* Using the latest torch-xla v2.7 may result in increase in host memory usage compared torch-xla v2.6.
* When switching Ubuntu OS kernel version from 5.15 to 6.8, you may see performance differences due to the new kernel scheduler (CFS vs EEVDF).
* When using tensor split operation on a 2D array in the second dimension, the resulting tensors don't have the expected data.
* BERT pretraining performance is ~10% lower with torch-neuronx 2.6 compared to torch-neuronx 2.5.


----

.. _pytorch-2-21-1-rn:

PyTorch Framework (Neuron 2.21.1 Release)
--------------------------------------------------------

Date of Release: 01/14/2025

transformers-neuronx
~~~~~~~~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* The transformers dependency has been pinned to ``transformers<4.48``


----

.. _pytorch-2-21-0-rn:

PyTorch Framework (Neuron 2.21.0 Release)
--------------------------------------------------------

Date of Release: 12/20/2024

transformers-neuronx
~~~~~~~~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Flash decoding support for speculative decoding
* Enabled on-device generation support in speculative decoding flows
* Added support for EAGLE speculative decoding support with greedy and lossless sampling
* Support for CPU compilation and sharded model saving
* Performance optimized MLP and QKV kernels added for llama models with support for sequence parallel norm
* Added support to control concurrent compilation workers
* Added option to skip AllGather using duplicate Q weights during shard over sequence

Bug Fixes
^^^^^^^^^

* Fixed padding issues when requested batch size is smaller than neff compiled size
* Fixed sequence parallel norm issue when executor is used with speculative decoding flows

Known Issues
^^^^^^^^^^^^

* GPT-NeoX is sensitive to ``fp16`` and customers are advised to use only ``amp="f32"`` for GPT-NeoX.
* Using ``cache_layout=constants.LAYOUT_BSH`` in NeuronConfig has known limitations with compilation. Customers are advised to use ``constants.LAYOUT_SBH`` instead.


----

.. _pytorch-2-20-0-rn:

PyTorch Framework (Neuron 2.20.0 Release)
--------------------------------------------------------

Date of Release: 09/16/2024

torch-neuron
~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Minor updates.

torch-neuronx
~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* This release adds support for Neuron Kernel Interface (NKI), Python 3.11, and protobuf versions 3.20+, as well as improved BERT performance.
* Added support for Neuron Kernel Interface (NKI).
* Added support for Python 3.11.
* Added support for protobuf versions 3.20+.
* (Training) Increased performance for BERT-Large pretraining by changing ``NEURON_TRANSFER_WITH_STATIC_RING_OPS`` default.
* (Training) Improved Neuron Cache locking mechanism for better Neuron Cache performance during multi-node training
* (Inference) Added support for weight separated models for DataParallel class.

Known Issues
^^^^^^^^^^^^

* Error ``cannot import name 'builder' from 'google.protobuf.internal'`` after installing compiler from earlier releases (2.19 or earlier)
* Lower accuracy when fine-tuning Roberta
* Slower loss convergence for NxD LLaMA-3 70B pretraining using ZeRO1 tutorial

transformers-neuronx
~~~~~~~~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Support for model serialization (save and load) of all models except the ``GPTJForSampling`` and ``GPTNeoXForSampling`` model classes, which reduces future model load time by saving a transformed and sharded set of weights as a new safetensors checkpoint.
* Support for on device sampling (Top P) with Continuous batching
* Support for Scaled RoPE for LLAMA 3.1 models
* Support for multi-node inference for LLAMA 3.1 405B model for specific sequence lengths
* Support for FlashDecoding (using ``shard_over_sequence``) for supporting long context lengths upto 128k

Bug Fixes
^^^^^^^^^

* Fixes to handle ``seq_ids`` consistently across vLLM versions
* Fixes for KV head full replication logic errors

----

.. _pytorch-2-19-0-rn:

PyTorch Framework (Neuron 2.19.0 Release)
--------------------------------------------------------

Date of Release: 07/03/2024

torch-neuron
~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Minor updates.

torch-neuronx
~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Improvements in ZeRO1 to have FP32 master weights support and BF16 all-gather
* Added custom SILU enabled via ``NEURON_CUSTOM_SILU`` environment variable
* Neuron Parallel Compile now handle non utf-8 characters in trial-run log and reports compilation time results when enabled with ``NEURON_PARALLEL_COMPILE_DUMP_RESULTS``
* Support for using DummyStore during PJRT process group initialization by setting ``TORCH_DIST_INIT_BARRIER=0`` and ``XLA_USE_DUMMY_STORE=1``

transformers-neuronx
~~~~~~~~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Support for compiler optimized flash attention kernel to support context lengths of 16k/32k for Llama models
* Streamer support enabled for BLOOM, GPTJ, GPT2, GPT-NeoX and LLAMA models
* Support for on device generation for TopK in Mixtral models
* Continuous batching support for Mistral v0.2
* Minor API improvements with type annotations for NeuronConfig, end-of-support warnings for old arguments, and exposing top-level configurations
* Performance improvements such as an optimized logit ordering for continuous batching in Llama models, optimized QKV padding for certain GQA models, faster implementation of cumsum operation to improve TopP performance

Bug Fixes
^^^^^^^^^

* Removed ``start_ids=None`` from ``generate()``
* Mistral decoding issue that occurs during multiple sampling runs
* Mistralv0.1 sliding window error
* Off-by-one error in window context encoding
* Better error messaging

Known Issues
^^^^^^^^^^^^

* ``on_device_generation=GenerationConfig(do_sample=True)`` has some known failures for Llama models. Customers are advised not to use ``on_device_generation`` in such cases.
* GPT-NeoX is sensitive to ``fp16`` and customers are advised to use only ``amp="f32"`` for GPT-NeoX.
* Using ``cache_layout=constants.LAYOUT_BSH`` in NeuronConfig has known limitations with compilation. Customers are advised to use ``constants.LAYOUT_SBH`` instead.

----

.. _pytorch-2-18-0-rn:

PyTorch Framework (Neuron 2.18.0 Release)
--------------------------------------------------------

Date of Release: 04/10/2024

transformers-neuronx
~~~~~~~~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* [Beta] Added support for continuous batching and a reference integration with vLLM (Llama models only)

Known Issues
^^^^^^^^^^^^

* There is a known compiler issue for inference of some configurations of Llama-2 70B that can cause accuracy degredation. Customers are advised to use the ``--enable-mixed-precision-accumulation`` compiler flag if Llama-2 70B accuracy issues occur.
* There is a known compiler issue for inference of some configurations of Llama-2 13B that can cause accuracy degredation. Customers are advised to use the ``--enable-saturate-infinity --enable-mixed-precision-accumulation`` compiler flags if Llama-2 13B accuracy issues occur.
* There is a known compiler issue for inference of some configurations of GPT-2 that can cause accuracy degredation. Customers are advised to use the ``--enable-saturate-infinity --enable-mixed-precision-accumulation`` compiler flags if GPT-2 accuracy issues occur.
* GPT-NeoX is sensitive to ``fp16`` and customers are advised to use only ``amp="f32"`` for GPT-NeoX.
* Using ``cache_layout=constants.LAYOUT_BSH`` in NeuronConfig has known limitations with compilation. Customers are advised to use ``constants.LAYOUT_SBH`` instead.

----

.. _pytorch-2-17-0-rn:

PyTorch Framework (Neuron 2.17.0 Release)
--------------------------------------------------------

Date of Release: 04/01/2024

transformers-neuronx
~~~~~~~~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Added support for on device log-softmax and on device sampling for TopK
* Added support for on device embedding for all models
* Added support for Speculative Decoding
* [Beta] Added support for Mixtral-8x7b MoE
* [Beta] Added support for mistralai/Mistral-7B-Instruct-v0.2 with no sliding window
* Added faster checkpoint loading support for both sharded and whole checkpoints
* Added the ability to download checkpoints directly from huggingface hub repositories
* Added NeuronAutoModelForCausalLM class which automatically loads architecture-specific classes
* Added a warmup to all kernels to avoid unexpected initialization latency spikes

Bug Fixes
^^^^^^^^^

* Users no longer need a copy of the original checkpoint and can use safetensor checkpoints for optimal speed.

Known Issues
^^^^^^^^^^^^

* There is a known compiler issue for inference of some configurations of Llama-2 70B that can cause accuracy degredation. Customers are advised to use the ``--enable-mixed-precision-accumulation`` compiler flag if Llama-2 70B accuracy issues occur.
* There is a known compiler issue for inference of some configurations of Llama-2 13B that can cause accuracy degredation. Customers are advised to use the ``--enable-saturate-infinity --enable-mixed-precision-accumulation`` compiler flags if Llama-2 13B accuracy issues occur.
* There is a known compiler issue for inference of some configurations of GPT-2 that can cause accuracy degredation. Customers are advised to use the ``--enable-saturate-infinity --enable-mixed-precision-accumulation`` compiler flags if GPT-2 accuracy issues occur.
* GPT-NeoX is sensitive to ``fp16`` and customers are advised to use only ``amp="f32"`` for GPT-NeoX.


----

.. _pytorch-2-16-0-rn:

PyTorch Framework (Neuron 2.16.0 Release)
--------------------------------------------------------

Date of Release: 12/21/2023

torch-neuronx
~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* [Beta] Added support for Llama-2 70B
* [Beta] Added support for Mistral 7B
* [Beta] Added support for PyTorch 2.1
* [Beta] Added support for Grouped Query Attention (GQA)
* [Beta] Added support for ``safetensors`` serialization
* [Beta] Added support for early stopping in the ``sample_llama`` function
* [Beta] Added sparse attention support for GPT2
* Added support for ``BatchNorm``
* Use the ``--auto-cast=none`` compiler flag by default for all models. This flag improves accuracy for ``float32`` operations

transformers-neuronx
~~~~~~~~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* [Beta] Added support for Llama-2 70B
* [Beta] Added support for Mistral 7B
* [Beta] Added support for Grouped Query Attention (GQA)
* [Beta] Added support for ``safetensors`` serialization
* [Beta] Added support for early stopping in the ``sample_llama`` function
* [Beta] Added sparse attention support for GPT2

Bug Fixes
^^^^^^^^^

* Resolved an issue in ``top_p`` in the ``sample_llama`` function so that it now selects the same number of tokens that the Hugging Face ``top_p`` implementation selects.

Known Issues
^^^^^^^^^^^^

* There is a known compiler issue for inference of some configurations of Llama-2 70B that can cause accuracy degredation. Customers are advised to use the ``--enable-mixed-precision-accumulation`` compiler flag if Llama-2 70B accuracy issues occur.
* There are known compiler issues impacting inference accuracy of certain model configurations of ``Llama-2-13b`` when ``amp = fp16`` is used. If this issue is observed, ``amp=fp32`` should be used as a work around. This issue will be addressed in future Neuron releases.

----

.. _pytorch-2-15-0-rn:

PyTorch Framework (Neuron 2.15.0 Release)
--------------------------------------------------------

Date of Release: 10/26/2023

torch-neuronx
~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* [Beta] Added support for ``int8`` quantization for Llama
* [Beta] Added multi bucket context encoding support for BLOOM
* [Beta] Added model Serialization for all supported models (except GPT-J and GPT-NeoX)
* [Beta] Added the ability to return output logit scores during sampling
* Added support for ``SOLU`` activation and ``GroupNorm``

transformers-neuronx
~~~~~~~~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* [Beta] Added support for ``int8`` quantization for Llama
* [Beta] Added multi bucket context encoding support for BLOOM
* [Beta] Added model Serialization for all supported models (except GPT-J and GPT-NeoX)
* [Beta] Added the ability to return output logit scores during sampling

Bug Fixes
^^^^^^^^^

* [GPT2] Fixed an issue in ``GPT2ForSamplingWithContextBroadcasting`` where the input prompt would get truncated if it was longer than the ``context_length_estimate``.

----

.. _pytorch-2-14-0-rn:

PyTorch Framework (Neuron 2.14.0 Release)
--------------------------------------------------------

Date of Release: 09/15/2023

torch-neuronx
~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Use the ``--model-type=transformer`` compiler flag by default for all models. This flag improves performance and compilation time for all models. This flag replaces the ``--model-type=transformer-inference`` flag, which is now deprecated.

Bug Fixes
^^^^^^^^^

* Fixed an issue where the ``HuggingFaceGenerationModelAdapter`` class falls back to serial context encoding for models that have parallel context encoding (``GPT2ForSamplingWithContextBroadcasting``, ``LlamaForSampling``, etc.)
* [GPT2 / OPT] Fixed an issue in the parallel context encoding network where incorrect results could be generated due to incorrect masking logic.

Known Issues
^^^^^^^^^^^^

* Some configurations of Llama and Llama-2 inference models fail compilation with the error ``IndirectLoad/Save requires contiguous indirect access per partition``. This is fixed in the compiler version 2.10.0.35 (Neuron SDK 2.14.1).
* Some configurations of Llama and Llama-2 inference model fail compilation with the error ``Too many instructions after unroll for function sg0000``. To mitigate this, please try with ``-O1`` compiler option (or ``--optlevel 1``) by adding ``os.environ["NEURON_CC_FLAGS"] = "-O1"`` to your script or set in the environment. A complete fix will be coming in the future release which will not require this option. Note: Using -O1 in the Llama-2 13B tutorial results in about 50% increase in latency compared to Neuron SDK 2.13.2. If this is not acceptable, please use compiler version from Neuron SDK 2.13.2.


----

.. _pytorch-2-13-0-rn:

PyTorch Framework (Neuron 2.13.0 Release)
--------------------------------------------------------

Date of Release: 08/28/2023

torch-neuronx
~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Added support for Llama 2 (excluding grouped/multi-query versions, such as Llama 2 70B) [Beta]
* Improved the performance of BLOOM and Llama models [Beta]
* Reduced execution latency of token generation in tensor parallel models by improving thread synchronization (supported in Llama only)
* Added an optimized vector implementation of RoPE positional embedding (supported in Llama only)
* Added support for faster context encoding on sequences of varying lengths. This is implemented by allowing multiple buckets for parallel context encoding. During inference the best fit bucket is chosen (supported in Llama/GPT-2 only)
* Added the Neuron Persistent Cache for compilation to automatically load pre-compiled model artifacts (supported by all models)
* Improved compilation time by compiling models used for different sequence length buckets in parallel (not supported in GPT-NeoX/GPT-J)

transformers-neuronx
~~~~~~~~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Added support for Llama 2 (excluding grouped/multi-query versions, such as Llama 2 70B) [Beta]
* Improved the performance of BLOOM and Llama models [Beta]
* Reduced execution latency of token generation in tensor parallel models by improving thread synchronization (supported in Llama only)
* Added an optimized vector implementation of RoPE positional embedding (supported in Llama only)
* Added support for faster context encoding on sequences of varying lengths. This is implemented by allowing multiple buckets for parallel context encoding. During inference the best fit bucket is chosen (supported in Llama/GPT-2 only)
* Added the Neuron Persistent Cache for compilation to automatically load pre-compiled model artifacts (supported by all models)
* Improved compilation time by compiling models used for different sequence length buckets in parallel (not supported in GPT-NeoX/GPT-J)

Bug Fixes
^^^^^^^^^

* [Llama] Fixed an issue in the parallel context encoding network where incorrect results could be generated if the context length is shorter than the context length estimate
* [GPT2 / OPT] Fixed an issue in the parallel context encoding network where incorrect results could be generated

Known Issues
^^^^^^^^^^^^

* The ``HuggingFaceGenerationModelAdapter`` class currently falls back to serial context encoding for models that have parallel context encoding (``GPT2ForSamplingWithContextBroadcasting``, ``LlamaForSampling``, etc.)
* Beam search can introduce memory issues for large models
* There can be accuracy issues for the GPT-J model for certain use-cases


----

.. _pytorch-2-12-0-rn:

PyTorch Framework (Neuron 2.12.0 Release)
--------------------------------------------------------

Date of Release: 07/21/2023

torch-neuronx
~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Added support for GPT-NeoX models [Beta]
* Added support for BLOOM models [Beta]
* Added support for Llama models [Alpha]
* Added support for more flexible tensor-parallel configurations to GPT2, OPT, and BLOOM. The attention heads doesn't need to be evenly divisible by ``tp_degree`` anymore
* Added multi-query / multi-group attention support for GPT2

transformers-neuronx
~~~~~~~~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Added support for GPT-NeoX models [Beta]
* Added support for BLOOM models [Beta]
* Added support for Llama models [Alpha]
* Added support for more flexible tensor-parallel configurations to GPT2, OPT, and BLOOM. The attention heads doesn't need to be evenly divisible by ``tp_degree`` anymore
* Added multi-query / multi-group attention support for GPT2

Bug Fixes
^^^^^^^^^

* Fixed NaN issues for GPT2 model
* Fixed OPT/GPT-NeoX gibberish output
* Resolved an issue where NaN values could be produced when the context_length argument was used in GPT2/OPT

Known Issues
^^^^^^^^^^^^

* Missing cache reorder support for beam search


----

.. _pytorch-2-11-0-rn:

PyTorch Framework (Neuron 2.11.0 Release)
--------------------------------------------------------

Date of Release: 06/14/2023

torch-neuronx
~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Added ``int8`` weight storage for GPT2 models
* Improved prompt context encoding performance for GPT2 models
* Improved collective communications performance for tp-degrees 4, 8, and 24 on Inf2
* Improved collective communications performance for tp-degrees 8 and 32 on Trn1
* Support for the ``--model-type=transformer-inference`` compiler flag for optimized decoder-only LLM inference

transformers-neuronx
~~~~~~~~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Added ``int8`` weight storage for GPT2 models
* Improved prompt context encoding performance for GPT2 models
* Improved collective communications performance for tp-degrees 4, 8, and 24 on Inf2
* Improved collective communications performance for tp-degrees 8 and 32 on Trn1
* Support for the ``--model-type=transformer-inference`` compiler flag for optimized decoder-only LLM inference

Bug Fixes
^^^^^^^^^

* Added padding to the GPT-J ``linear`` layer to correctly handle odd vocabulary sizes
* Issues where the HuggingFace ``generate`` method produces incorrect results when ``beam_search`` is used have been resolved


----

.. _pytorch-2-10-0-rn:

PyTorch Framework (Neuron 2.10.0 Release)
--------------------------------------------------------

Date of Release: 05/01/2023

torch-neuronx
~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Added ``transformers-neuronx`` artifacts to PyPI repository
* Added support for the HuggingFace ``generate`` method
* Added model serialization support for GPT2 models, including model saving, loading, and weight swapping
* Added support for caching compiled artifacts
* Improved performance by removing unnecessary KV-cache tensor resetting
* Improved prompt context encoding performance (OPT, GPT2)

transformers-neuronx
~~~~~~~~~~~~~~~~~~~~

Improvements
^^^^^^^^^^^^^^^

* Added ``transformers-neuronx`` artifacts to PyPI repository
* Added support for the HuggingFace ``generate`` method
* Added model serialization support for GPT2 models, including model saving, loading, and weight swapping
* Added support for caching compiled artifacts
* Improved performance by removing unnecessary KV-cache tensor resetting
* Improved prompt context encoding performance (OPT, GPT2)

Bug Fixes
^^^^^^^^^

* Fixed the GPT-J demo to import the correct ``amp_callback`` function

Known Issues
^^^^^^^^^^^^

* When the HuggingFace ``generate`` method is configured to use ``beam_search``, this can produce incorrect results for certain configurations. It is recommended to use other generation methods such as ``sample`` or ``greedy_search``. This will be fixed in a future Neuron release.

Breaking Changes
^^^^^^^^^^^^^^^^

* None


================================================
FILE: release-notes/components/runtime.rst
================================================
.. meta::
    :description: Complete release notes for the Neuron Runtime component across all AWS Neuron SDK versions.
    :keywords: neuron runtime, neuron driver, neuron collectives, release notes, aws neuron sdk
    :date-modified: 02/26/2026

.. _runtime_rn:

Component Release Notes for Neuron Runtime
==========================================

The release notes for the Neuron Runtime Neuron component, including Neuron collectives, the Runtime driver, and the Runtime library. Read them for the details about the changes, improvements, and bug fixes for all release versions of the AWS Neuron SDK.

.. _runtime-2-29-0-rn:

Neuron Runtime (Neuron 2.29.0 Release)
------------------------------------------------------------------------

Date of Release: 04/09/2026


Neuron Runtime Library
~~~~~~~~~~~~~~~~~~~~~~

**Version:** 2.31.24.0

New Features
^^^^^^^^^^^^
* Added new |nrt_cc_create_stream|_ API for programmatic host-driven collective stream creation, replacing the previous environment variable approach.
* Added |nrt_get_attached_efa_bdf|_ API that returns the BDF string of the EFA device attached to a specified Neuron device index, enabling optimal network interface selection.
* Added |lnc_idx|_ parameter to async tensor APIs, allowing users to select the specific DMA engine for data transfers.
* New environment variables:

  * ``NEURON_RT_ONE_THREAD_PER_CORE``: Pins each network proxy thread to a dedicated CPU core, providing up to 2x improvement in p50/p99 collective communication latency.
  * ``NEURON_RT_RANKS_PER_NETWORK_PROXY``: Controls how many ranks share the same network proxy thread, enabling proxy thread consolidation for improved latency in large-scale distributed workloads.

.. |nrt_cc_create_stream| replace:: ``nrt_cc_create_stream``
.. _nrt_cc_create_stream: https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h
.. |nrt_get_attached_efa_bdf| replace:: ``nrt_get_attached_efa_bdf``
.. _nrt_get_attached_efa_bdf: https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt.h
.. |lnc_idx| replace:: ``lnc_idx``
.. _lnc_idx: https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h

Improvements
^^^^^^^^^^^^
* Added EFA collectives support for Trn3(previously only available on Trn2), enabling cross-instance data transfers.
* Added profiling support for the :doc:`standalone collectives </neuron-runtime/api/nrt-async-api-overview>`, allowing :doc:`standalone collectives </neuron-runtime/api/nrt-async-api-overview>` traces to appear in the profiler timeline.
* Added context caching for :doc:`standalone collectives </neuron-runtime/api/nrt-async-api-overview>` operations (all-gather, reduce-scatter, all-reduce) that are run outside of a compiled model/kernel, significantly improving schedule performance by up to 90% for repeated calls.
* Removed unnecessary memset operations during :doc:`standalone collectives </neuron-runtime/api/nrt-async-api-overview>` request processing flow, eliminating tens of milliseconds of overhead.
* Removed limit of 512 queue set instances per NEFF which lead to ``NRT_RESOURCE`` errors when loading NEFF with too many queue set instnaces. The Neuron Runtime now supports an unbounded number of queue set instance enabling it to load NEFFs that are further optimized for code size reduction.
* Added Physical Neuron Core ID and Global Rank ID fields to debug tensor read payload for better multi-core/multi-rank debugging.
* Added async sequence IDs (nrta_seq_t) to system trace events for correlation between async operations and hardware execution events.

Breaking changes
^^^^^^^^^^^^^^^^
* Error tracker removed from async API in favor of a simpler status pointer pass-in model. Applications now pass a status pointer directly. See :ref:`this section <nrta-error-handling>` for an example.
* |nrta_get_completion_handle|_ API removed.
* Due to the breaking changes, we have performed a version bump from 2.x to 3.0 for the :doc:`NRT Async APIs </neuron-runtime/api/nrt-async-api-overview>` (APIs prefixed with ``nrta``). Applications using the async API will need to be recompiled against the new version.

.. |nrta_get_completion_handle| replace:: ``nrta_get_completion_handle``
.. _nrta_get_completion_handle: https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/libnrt/include/nrt/nrt_async.h

Bug Fixes
^^^^^^^^^
* Fixed profile data loss on SIGTERM shutdown (e.g., vLLM worker processes); the signal handler now covers both SIGTERM and SIGINT.
* Increased :doc:`Collectives XU </neuron-runtime/api/nrt-async-api-overview>` communication wait timeout (increased from 100ms to 30s) to prevent false timeout failures when ranks have timing drift.
* Fixed a double-free crash during ``nrt_close`` when NEFF execution fails and model unload is not called.
* Fixed DMA ring allocation in :doc:`Collectives XU </neuron-runtime/api/nrt-async-api-overview>` context caching that caused hangs and invalid cache hits.
* Fixed :doc:`Collectives XU </neuron-runtime/api/nrt-async-api-overview>` to properly support in-place operations where send and receive buffers are identical.
* Fixed device memory leak during repeated NEFF load/unload cycles.
* Fixed crash when ``proxy_queue`` is destroyed before ``start()`` due to ``ncclSetAffinity`` failure.
* Fixed system trace event correlation by passing correct execution ID to wait events.
* Fixed NEFF output during inspect profiling to use UUID for distinguishing NEFFs with same hash.
* Fixed BranchPrefetchHint addressing mode bug where backwards-relative branch hints computed incorrect target addresses on trn2 and later.
* Fixed dynamically loaded kernel code carveout size on trn2 (16KB → 32KB) to support migrated operations.

Compatibility Support Table
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The Neuron runtime was tested for the following EC2 instances and configurations:

=========================== ============= ============== ================= ===============
Instance Family               OS Type       OS Version     Kernel Version    GLIBC Version
=========================== ============= ============== ================= ===============
``Inf2``                    Ubuntu        U24            6.17              2.39
``Inf2``                    Ubuntu        U22            6.8               2.35
``Inf2``                    Rocky Linux   RL9            5.14              2.34
``Inf2``                    Amazon Linux  AL2023         6.12              2.34
``Inf2``                    Amazon Linux  AL2023         6.1               2.34
``Trn1``                    Ubuntu        U24            6.17              2.39
``Trn1``                    Ubuntu        U22            6.8               2.35
``Trn1``                    Rocky Linux   RL9            5.14              2.34
``Trn1``                    Amazon Linux  AL2023         6.12              2.34
``Trn1``                    Amazon Linux  AL2023         6.1               2.34
``Trn2``                    Ubuntu        U24            6.17              2.39
``Trn2``                    Ubuntu        U22            6.8               2.35
``Trn2``                    Amazon Linux  AL2023         6.12              2.34
``Trn2``                    Amazon Linux  AL2023         6.1               2.34
=========================== ============= ============== ================= ===============

Neuron Driver
~~~~~~~~~~~~~

**Version:** 2.27.4.0

New Features
^^^^^^^^^^^^
* Added support for new :ref:`TRN3 Gen2 Ultraserver <aws-trn3-arch>` configurations: US3 (2-node), US4 (4-node), US16 (4-node), and US18 (4-node).

Improvements
^^^^^^^^^^^^
* Added top-level DMA reset support during TPB reset on trn3 and later platforms, improving reset reliability.

Compatibility Support Table
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The Neuron driver was tested for the following EC2 instances and configurations:

=========================== ============= ============== ================= ===============
Instance Family               OS Type       OS Version     Kernel Version    GLIBC Version
=========================== ============= ============== ================= ===============
``Inf2``                    Ubuntu        U24            6.17              2.39
``Inf2``                    Ubuntu        U22            6.8               2.35
``Inf2``                    Rocky Linux   RL9            5.14              2.34
``Inf2``                    Red Hat       RHEL10         6.12              2.39
``Inf2``                    Amazon Linux  AL2023         6.12              2.34
``Inf2``                    Amazon Linux  AL2023         6.1               2.34
``Inf2``                    Amazon Linux  AL2            5.10              2.26
``Trn1``                    Ubuntu        U24            6.17              2.39
``Trn1``                    Ubuntu        U22            6.8               2.35
``Trn1``                    Rocky Linux   RL9            5.14              2.34
``Trn1``                    Red Hat       RHEL10         6.12              2.39
``Trn1``                    Amazon Linux  AL2023         6.12              2.34
``Trn1``                    Amazon Linux  AL2023         6.1               2.34
``Trn1``                    Amazon Linux  AL2            5.10              2.26
``Trn2``                    Ubuntu        U24            6.17              2.39
``Trn2``                    Ubuntu        U22            6.8               2.35
``Trn2``                    Red Hat       RHEL10         6.12              2.39
``Trn2``                    Amazon Linux  AL2023         6.12              2.34
``Trn2``                    Amazon Linux  AL2023         6.1               2.34
``Trn2``                    Amazon Linux  AL2            5.10              2.26
=========================== ============= ============== ================= ===============

Neuron Collectives
~~~~~~~~~~~~~~~~~~~

**Version:** 2.31.24.0

Improvements
^^^^^^^^^^^^
* Restructured EFA device processing to per-stream granularity, simplifying resource management for concurrent streams improving stability.
* Improved bootstrap error messages with actionable troubleshooting guidance when ranks fail to receive root parameters.

Bug Fixes
^^^^^^^^^
* Fixed incorrect interface selection in multi-ultraserver collectives where EFA was used instead of Ultraserver interfaces, by adding explicit cross-rack-to-rack flag detection.
* Fixed crash and undefined behavior when channels fail to initialize due to EFA Device plugin errors, now returning clear error messages instead of accessing uninitialized data.

----

.. _runtime-2-28-0-rn:

Neuron Runtime (Neuron 2.28.0 Release)
------------------------------------------------------------------------

Date of Release: 02/26/2026


Neuron Runtime Library
~~~~~~~~~~~~~~~~~~~~~~

**Version:** 2.30.50.0

Improvements
^^^^^^^^^^^^

* Added support for :ref:`TRN3 Gen1 Ultraserver <aws-trn3-arch>` instance type with full system topology
* Added support for tensors larger than 4GB with 64-bit addressing
* Introduced experimental async APIs (see :doc:`NRT Async APIs Overview </neuron-runtime/api/nrt-async-api-overview>`)
* Optimized mesh AllGather on TP8 configurations using destination routing
* Added bound check support for ``dma_direct2d_xpose`` operations

Bug Fixes
^^^^^^^^^

* Fixed proxy thread signaling condition in topsp barrier
* Fixed segfaults in NEFF load cleanup and error paths
* Fixed incompatible network/ultraserver interface selection for inter-node mesh
* Fixed RDH buffer reservation and AllGather bugs
* Fixed corrupted memory logs in multi-threaded model loads
* Improved error handling to return a clear error instead of asserting during ``nrt_init``

Compatibility Support Table
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The Neuron runtime was tested for the following EC2 instances and configurations:

=========================== ============= ============== ================= ===============
Instance Family               OS Type       OS Version     Kernel Version    GLIBC Version
=========================== ============= ============== ================= ===============
``Inf2``                    Ubuntu        U24            6.14              2.39
``Inf2``                    Ubuntu        U22            6.8               2.35
``Inf2``                    Rocky Linux   RL9            5.14              2.34
``Inf2``                    Amazon Linux  AL2023         6.12              2.34
``Inf2``                    Amazon Linux  AL2023         6.1               2.34
``Trn1``                    Ubuntu        U24            6.14              2.39
``Trn1``                    Ubuntu        U22            6.8               2.35
``Trn1``                    Rocky Linux   RL9            5.14              2.34
``Trn1``                    Amazon Linux  AL2023         6.12              2.34
``Trn1``                    Amazon Linux  AL2023         6.1               2.34
``Trn2``                    Ubuntu        U24            6.14              2.39
``Trn2``                    Ubuntu        U22            6.8               2.35
``Trn2``                    Amazon Linux  AL2023         6.12              2.34
``Trn2``                    Amazon Linux  AL2023         6.1               2.34
=========================== ============= ============== ================= ===============


Neuron Driver
~~~~~~~~~~~~~

**Version:** 2.26.10.0

Date of Release: 03/12/2026

Bug Fixes
^^^^^^^^^

* Compatibility fixes for Linux kernel 6.18.

.. _neuron-driver-2-26-5-0:

**Version:** 2.26.5.0

Improvements
^^^^^^^^^^^^^^^

* Added support for detecting :ref:`TRN3 Gen1 Ultraserver <aws-trn3-arch>` platforms
* Added IOCTL to lookup both the Neuron device and the HBM for a given virtual address, enabling frameworks to identify which device holds a tensor
* Updated driver uninstall behavior to fail gracefully without uninstalling the driver if driver is in use

Bug Fixes
^^^^^^^^^

* Fixed kernel crash where non-validated input to DMA-related IOCTLs could trigger BUG_ON requiring an instance reboot to recover
* Added BAR bounds validation during ``ncdev_bar_read`` to prevent out-of-bounds register access through IOCTLs
* Fixed bounds checks on memory accesses where u64 wraparound attacks can lead to out-of-bounds memory access

We would like to thank Shaul Ben Hai from SentinelOne Security Research for reporting the above three issues.

* Fixed use-after-free issues in sysfs cleanup flow that caused kernel crashes
* Fixed race condition in sysfs access during driver initialization

Compatibility Support Table
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The Neuron driver was tested for the following EC2 instances and configurations:

=========================== ============= ============== ================= ===============
Instance Family               OS Type       OS Version     Kernel Version    GLIBC Version
=========================== ============= ============== ================= ===============
``Inf2``                    Ubuntu        U24            6.14              2.39
``Inf2``                    Ubuntu        U22            6.8               2.35
``Inf2``                    Rocky Linux   RL9            5.14              2.34
``Inf2``                    Amazon Linux  AL2023         6.12              2.34
``Inf2``                    Amazon Linux  AL2023         6.1               2.34
``Inf2``                    Amazon Linux  AL2            5.10              2.26
``Trn1``                    Ubuntu        U24            6.14              2.39
``Trn1``                    Ubuntu        U22            6.8               2.35
``Trn1``                    Rocky Linux   RL9            5.14              2.34
``Trn1``                    Amazon Linux  AL2023         6.12              2.34
``Trn1``                    Amazon Linux  AL2023         6.1               2.34
``Trn1``                    Amazon Linux  AL2            5.10              2.26
``Trn2``                    Ubuntu        U24            6.14              2.39
``Trn2``                    Ubuntu        U22            6.8               2.35
``Trn2``                    Amazon Linux  AL2023         6.12              2.34
``Trn2``                    Amazon Linux  AL2023         6.1               2.34
``Trn2``                    Amazon Linux  AL2            5.10              2.26
=========================== ============= ============== ================= ===============

Neuron Collectives
~~~~~~~~~~~~~~~~~~~

**Version:** 2.30.58.0

Improvements
^^^^^^^^^^^^

* Added support for :ref:`TRN3 Gen1 Ultraserver <aws-trn3-arch>` instance types with optimized topology configurations
* Added support for Neuron-Switch-v1 topology and proper network interface selection

Bug Fixes
^^^^^^^^^

* Fixed bug where uninitialized socket file descriptors were incorrectly closed during bootstrap, preventing connection errors in multi-context scenarios
* Improved error handling for channel creation failures due to plugin initialization errors, preventing crashes with misconfigured plugins
* Initialized file descriptor arrays to -1 in bootstrap code to prevent accidental use of uninitialized descriptors

----

.. _runtime-2-27-0-rn:

Neuron Runtime [2.29.40.0] (Neuron 2.27.0 Release)
---------------------------------------------------

Date of Release: 12/19/2025

Improvements
~~~~~~~~~~~~~~~

* Added support for Trainium3 (single node mode)
* Reduced the overhead of reprogramming the Collectives Engine by up to 100x for NEFFs compiled with the ``-O1`` flag. This improves end-to-end performance of these NEFFs by up to 15%.
* Reduced NeuronCore branch overhead by up to 3x, decreasing the overhead of starting a NEFF program by up to 5%.
* Reduced the overhead of starting a NEFF program by up to 50% with an on-device hardware barrier between ranks.
* Improved all-gather latency by up to 35% for messages greater than 1MB in TP8 (LNC2) and TP16 (LNC1) collectives.
* Added support for NRT Debug Stream APIs.

Bug Fixes
~~~~~~~~~

* Fixed scratchpad page allocation bug that caused excessive page allocations due to page rounding error.
* Fixed segfault that occurred when freeing an empty tensor.

Known Issues
~~~~~~~~~~~~

* The ``nrt_tensor_allocate`` APIs do not support more then 4 GB (>= 4GB) sizes. Passing in a size larger than or equal to 4GB will result in datatype overflow leading to undefined behavior.
* A hardware bug affecting **Trainium** and **Inferentia2** devices causes numerical errors to become "sticky" within the Neuron Core hardware.


----

.. _runtime-2-26-0-rn:

Neuron Runtime [2.28.19.0] (Neuron 2.26.0 Release)
---------------------------------------------------

Date of Release: 09/18/2025

Improvements
~~~~~~~~~~~~~~~

* Added rank ID to all events emitted from the Profiler 2.0 system trace.
* Improved timestamp alignment of Profiler 2.0 NeuronCore and CPU system trace events enhancing the accuracy of the trace timeline.

Bug Fixes
~~~~~~~~~

* Fixed bug where `nrt_unload` returned `NRT_SUCCESS` even when model stop fails due to Neuron Core lockups.
* Fixed bug where `model_name` was empty in Profiler 2.0 system trace events.
* Fixed bug where error messages were incorrectly being displayed on machines with no EFA devices.

Known Issues
~~~~~~~~~~~~

* The ``nrt_tensor_allocate`` APIs do not support more then 4 GB (>= 4GB) sizes.
* A hardware bug affecting **Trainium** and **Inferentia2** devices causes numerical errors to become "sticky" within the Neuron Core hardware.


----

.. _runtime-2-25-0-rn:

Neuron Runtime (Neuron 2.25.0 Release)
---------------------------------------

Date of Release: 07/31/2025

Neuron Runtime Library
~~~~~~~~~~~~~~~~~~~~~~

**Version:** 2.27.23.0

Improvements
^^^^^^^^^^^^^^

* Introduced ``nrt_get_vnc_memory_stats`` API to retrieve device memory usage.
* Added support for State-Buffer to State-Buffer collective support for ``all_reduce``, ``reduce_scatter``, and ``all_gather`` for LNC2, which helps reduce HBM memory pressure.
* Added support for coalescing of Collectives operations for internode RDH.
* Introduced a new DGE priority class feature to select preferred packet size for memory transfers.
* Improved ``nrt_init`` time by up to ~3 seconds on AWS Trainium and Inferentia instances.
* Added a warning message along with a recommended scratchpad configuration when a loaded NEFF has non-optimial scratchpad usage.

Breaking Changes
^^^^^^^^^^^^^^^^

* Due to a hardware bug that can cause numerical errors to be falsely reported, the runtime has disabled numerical errors by default. Users can re-enable numerical errors by setting ``NEURON_RT_NUMERICAL_ERRORS_VERBOSITY=critical`` or ``NEURON_FAIL_ON_NAN=1``.

Bug Fixes
^^^^^^^^^

* Fixed profiling APIs to report execution duration from explicit notifications.
* Fixed race condition which can cause a crash when starting inspect traces.

Known Issues
^^^^^^^^^^^^

* A hardware bug affecting **Trainium** and **Inferentia2** devices causes numerical errors to become "sticky" within the Neuron Core hardware.

Neuron Collectives
~~~~~~~~~~~~~~~~~~

**Version:** 2.25.65.0

Improvements
^^^^^^^^^^^^^^^

* Added multinode collectives support for Trainium2 instances without EFA devices
* Minor performance improvement to network proxy handshake

Bug Fixes
^^^^^^^^^

* Fixed memory leak clearing up communication devices during ``nrt_close``

Neuron Driver
~~~~~~~~~~~~~

**Version:** 2.21.37.0

Improvements
^^^^^^^^^^^^^^^

* Added the ability for users to read power utilization for each neuron device via a sysfs interface. This interface shows the minimum, maximum and average power consumed by the device over the past minute, expressed as a percentage of the device's maximum power.
* Added the ability for users to read the device utilization. This shows up as the microseconds between the start and end of the current execution on hardware.


----

.. _runtime-2-24-0-rn:

Neuron Runtime (Neuron 2.24.0 Release)
---------------------------------------

Date of Release: 06/24/2025

Neuron Runtime Library
~~~~~~~~~~~~~~~~~~~~~~

**Version:** 2.26.42.0

Improvements
^^^^^^^^^^^^^^^

* Added support for 8x8 collective groups (TP8 + CP8) on **TRN2** for **LNC=2**
* Added support for direct `State-Buffer` to `State-Buffer` collective ops for **LNC=1**
* Introduce RDH algorithm for inter-node collective communication
* Added support for loading NEFF with different world sizes in the same NRT process
* Reduced the average latency of 32x2 collective groups by 65%
* Reduced latency for intra-chip reduce scatter operations on **TRN2** instances by up to 20% for small transfers and 60% for medium to large transfers
* Improved latency for medium message sizes for intra-chip All Gather operations on **TRN2** by up to 60%
* Improved the debugging experience by adding logs which print out the value of timed-out, non-zero semaphores on **Trainium2** platforms
* Improved timeout error messages by displaying the NEFF program counters for the stuck Neuron Core
* Refined out-of-memory error messages to report a NEFF level memory breakdown table

Breaking Changes
^^^^^^^^^^^^^^^^

* This version of the Neuron runtime requires `aws-neuron-dkms` version `2.22` or later on **Trainium2** instances.

Bug Fixes
^^^^^^^^^

* Fixed crash caused by race condition during the capture of system profiles
* Fixed various memory leaks that occur during `nrt_close`

Known Issues
^^^^^^^^^^^^

* The ``nrt_tensor_allocate`` APIs do not support more then 4 GB (>= 4GB) sizes.
* A hardware bug affecting **Trainium** and **Inferentia2** devices causes numerical errors to become "sticky" within the Neuron Core hardware.

Neuron Collectives
~~~~~~~~~~~~~~~~~~

**Version:** 2.24.59.0

Improvements
^^^^^^^^^^^^^^^

* Improved interface between ``libnccom`` and ``libnrt`` resulting stability improvements

Neuron Driver
~~~~~~~~~~~~~

**Version:** 2.20.28.0

Improvements
^^^^^^^^^^^^^^^

* This driver is required to run with Neuron Runtime 2.24 or later on Trainium2 machines. Included in the release is a bug fix to avoid device memory corruption issues leading to undefined Neuron Device behavior.
* Improved interface between ``libnrt`` and the Driver resulting in stability improvements.


----

.. _runtime-2-23-0-rn:

Neuron Runtime (Neuron 2.23.0 Release)
---------------------------------------

Date of Release: 05/19/2025

Neuron Runtime Library
~~~~~~~~~~~~~~~~~~~~~~

**Version:** 2.25.57.0

Improvements
^^^^^^^^^^^^^^^

* Added ``NEURON_RT_LOW_LATENCY_TASKS_CPU_AFFINITY`` environment variable to allow users to set the thread affinity of low latency tasks that run on host cpu
* Refined software notification queue overflow detection flow and improved error message
* Reduced latency for All-Reduce intra-chip collective (TP 4) by 50% for medium message sizes
* Improved error message when an execution request is passed a tensor allocated on an incorrect HBM
* Improved NEFF switch latency by up to 95% when using async mode
* Increased the number of different replica groups supported in the same NEFF on TRN2
* Explicitly limit the max number of in-flight async requests to the hard limit of 63
* Added traces for Host <-> device data transfer events in system profiles (Neuron Profiler 2.0 Beta)
* Added pre/post execution hooks to system profiles (Neuron Profiler 2.0 Beta)
* Significant performance improvements in time taken by calls to nrt_sys_trace_fetch_events() (Neuron Profiler 2.0 Beta)

Bug Fixes
^^^^^^^^^

* Fixed segfault that can occur when applications attempt to load a NEFF with an unsupported number of FMA source descriptors

Known Issues
^^^^^^^^^^^^

* The ``nrt_tensor_allocate`` APIs do not support more then 4 GB (>= 4GB) sizes.

Neuron Collectives
~~~~~~~~~~~~~~~~~~

**Version:** 2.23.135.0 / 2.23.133.0

Improvements
^^^^^^^^^^^^^^^

* Added Trainium2 support
* Improved startup times for large scale training jobs by up to 5 seconds
* Enhanced error logging for bootstrap failures
* Aws-ofi-nccl: minor performance improvement

Bug Fixes
^^^^^^^^^

* Fixed various memory leaks which occur during process cleanup


----

.. _runtime-2-22-0-rn:

Neuron Runtime (Neuron 2.22.0 Release)
---------------------------------------

Date of Release: 04/03/2025

Neuron Runtime Library
~~~~~~~~~~~~~~~~~~~~~~

**Version:** 2.24.53.0

Improvements
^^^^^^^^^^^^^^^

* Improved dynamic DMA descriptor generation performance by up to 3% for certain workloads
* Reduced collectives device memory footprint for large Neffs
* Improved device latency for memory bound workloads on TRN2
* Added support for profiling executions when NRT is launched in Async Execution Mode
* Added check to detect execution completion queue overflows
* Reduced overhead of Neuron Profiler 2.0 to <1% of overall latency (Neuron Profiler 2.0 Beta)
* Added new ``nrt_sys_trace_fetch_events`` API to retrieve system trace events (Neuron Profiler 2.0 Beta)
* Added out of bound error events to system trace (Neuron Profiler 2.0 Beta)
* Removed the ``NEURON_RT_INSPECT_DURATION_NSEC`` and ``NEURON_RT_INSPECT_START_OFFSET_NSEC`` configuration options (Neuron Profiler 2.0 Beta)
* Added dynamic DMA support for block scatter ops (NKI)
* Added RangeSelect instruction Support for the Vector engine (NKI)

Breaking Changes
^^^^^^^^^^^^^^^^

* Removed support for Neuron Distributed Event Tracing

Bug Fixes
^^^^^^^^^

* Fixed bug introduced in NRT 2.23 where the runtime was incorrectly reporting executions that hit "Out of Bound" errors as successful executions
* Fixed segfault when encountering "out of memory" errors when starting profiles

Known Issues
^^^^^^^^^^^^

* The ``nrt_tensor_allocate`` APIs do not support more then 4 GB (>= 4GB) sizes.

Neuron Collectives
~~~~~~~~~~~~~~~~~~

**Version:** 2.22.26.0

Improvements
^^^^^^^^^^^^^^^

* Added check to print out an error message on invalid ``NEURON_RT_ROOT_COMM_ID`` configurations

Bug Fixes
^^^^^^^^^

* Resolved an issue where the ``libnccom.so`` filename was versioned incorrectly as ``libnccom.so.2.y.y``. Will be correctly versioned as ``libnccom.so.2.22.26`` in this release.

Neuron Driver
~~~~~~~~~~~~~

**Version:** 2.22.2.0

Improvements
^^^^^^^^^^^^^^^

* Added workaround for HW DGE descriptor fetching bug

Bug Fixes
^^^^^^^^^

* Fixed typos in certain error log messages

Breaking Changes
^^^^^^^^^^^^^^^^

* Starting with Neuron Release 2.26, Neuron driver versions above 2.24 will only support non-Inf1 instances (such as ``Trn1``, ``Inf2``, or other instance types).
* ``Inf1`` instance users, only Neuron driver version 2.24 will remain supported with regular security patches.
* ``Inf1`` instance users are advised to pin the Neuron driver version to ``2.24.*`` in their installation script.


----

.. _runtime-2-21-0-rn:

Neuron Runtime (Neuron 2.21.0 Release)
---------------------------------------

Date of Release: 12/20/2024

Neuron Runtime Library
~~~~~~~~~~~~~~~~~~~~~~

**Version:** 2.23.110.0 / 2.23.112.0

Improvements
^^^^^^^^^^^^^^^

* Added Trainium2 support
* Added runtime support to detect and fail on out-of-bound memory access in DMA operations
* Added support for 4-rank replica group on adjacent Neuron cores on TRN1/TRN1N
* Added new profiling API for capturing system and device profiles (Neuron Profiler 2.0 Beta)
* Reduced runtime host RAM utilization
* Improved Neff context switch overhead reducing latency by up to 500us
* Split hardware errors into more granular categories:
   * ``NRT_EXEC_HW_ERR_HBM_UE`` (1201)
   * ``NRT_EXEC_HW_ERR_NC_UE`` (1202)
   * ``NRT_EXEC_HW_ERR_DMA_ABORT`` (1203)
* Updated runtime to breakdown DMA ring memory usage into more detailed categories:
   * dma rings io
   * dma rings spill
   * dma rings collectives
   * dma rings runtime
* Updated the ``nrt_load`` error path to print a clear error message when failing to load a collectives Neff instead of aborting

Breaking Changes
^^^^^^^^^^^^^^^^

* Removed INF1 Support from Runtime library

Bug Fixes
^^^^^^^^^

* Fixed multiple memory corruptions and exhaustions on the collectives failure path
* Fixed bug where incorrect execution status was passed to the async execution callback
* Fixed DMA abort errors on TRN2

Known Issues
^^^^^^^^^^^^

* The ``nrt_tensor_allocate`` APIs do not support more then 4 GB (>= 4GB) sizes.

Neuron Collectives
~~~~~~~~~~~~~~~~~~

**Version:** 2.21.46.0

Improvements
^^^^^^^^^^^^^^^

* Bootstrap changes to improve application startup latency for large-scale workloads
* Logging improvements


================================================
FILE: release-notes/documentation/neuron-documentation.rst
================================================
.. _neuron-documentation-rn:

Neuron Documentation Release Notes
==================================

.. contents:: Table of contents
   :local:
   :depth: 1

Neuron 2.21.0
---------------
Date: 12/20/2024

Neuron Architectue and Features
- Added Trainium2 Architectue guide. See :ref:`trainium2-arch`
- Added Trn2 Architecture guide. See :ref:`aws-trn2-arch`
- Added Logical NeuronCore configuration guide. See :ref:`logical-neuroncore-config`
- Added NeuronCore-v3 Architecture guide. See :ref:`neuroncores-v3-arch`

Neuron Compiler
- Added NKI tutorial for SPMD usage with multiple Neuron Cores on Trn2. See `tutorial <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/tutorials/spmd_multiple_nc_tensor_addition.rst>`_
- Updated NKI FAQ with Trn2 FAQs. See :ref:`nki_faq` 
- Added :doc:`Direct Allocation Developer Guide <nki_direct_allocation_guide>`
- Updated :doc:`nki.isa <api/nki.isa>` API guide with support for new APIs. 
- Updated :doc:`nki.language <api/nki.language>` API guide with support for new APIs. 
- Updated :doc:`nki.compiler <api/nki.compiler>` API guide with support for new APIs. 
- Updated NKI :ref:`datatype <nl_datatypes>` guide with support for ``float8_e5m2``. 
- Updated :doc:`kernels <api/nki.kernels>` with support for allocated_fused_self_attn_for_SD_small_head_size and allocated_fused_rms_norm_qkv kernels

Neuron Runtime
- Updated troubleshooting doc with information on device out-of-memory errors after upgrading to Neuron Driver 2.19 or later. See :ref:`small_allocations_mempool`

NeuronX Distributed Inference
- Added Application Note to introduce NxD Inference. See :ref:`introduce-nxd-inference`
- Added NxD Inference Supported Features Guide. See :ref:`nxdi-feature-guide`
- Added NxD Inference Tutorial for Deploying Llama 3.1 405B (Trn2). See :ref:`nxdi-trn2-llama3.1-405b-tutorial`
- Added NxD Inference API Reference Guide. See :ref:`nxd-inference-api-guides`
- Added NxD Inference Production Ready Models (Model Hub) Guide. See :ref:`nxdi-model-reference`
- Added Migration Guide from NxD examples to NxD Inference. See :ref:`nxd-examples-migration-guide`
- Added Migration Guide from Transformers NeuronX to NeuronX Distributed Inference. See :ref:`nxdi_migrate_from_tnx`
- Added vLLM User Guide for NxD Inference. See :ref:`nxdi-vllm-user-guide`
- Added tutorial for deploying Llama3.2 Multimodal Models. See :ref:`/libraries/nxd-inference/tutorials/llama3.2-multimodal-tutorial.ipynb`

NeuronX Distributed Training
- Updated :ref:`api_guide_nxd_training`, :ref:`llama2_tp_pp_tutorial`, :ref:`llama3_tp_pp_tutorial`, :ref:`nxdt_config_overview`, and :ref:`checkpoint_conversion` with support for fused Q,K,V
- Updated :ref:`nxdt_config_overview` with support for Trn2 configuration API
- UpdatedDirect :ref:`checkpoint_conversion` with support for  HuggingFace Model Conversion
- Added tutorial for HuggingFace Llama3.1/Llama3-70B Pretraining. See :ref:`hf_llama3_70B_pretraining`
- Added tutorial for HuggingFace Llama3-8B Direct Preference Optimization (DPO) based Fine-tuning. See :ref:`hf_llama3_8B_DPO`

Transformers NeuronX
- Updated :ref:`transformers_neuronx_developer_guide` and :ref:`torch_neuronx_trace_api` with support for CPU compilation.
- Updated :ref:`transformers_neuronx_developer_guide` to enable skipping the first Allgather introduced by flash decoding at the cost of duplicate Q weights.
- Updated :ref:`transformers_neuronx_developer_guide` with support for EAGLE speculation

Neuron Tools
- Added Neuron Profiler 2.0 Beta User Guide with support for system profiles, integration with Perfetto, distributed workload support, etc. See :ref:`neuron-profiler-2-0-guide`
- Updated nccom-test user guide to include support for Trn2. See :ref:`nccom-test`
- Updated neuron-ls user guide to include support for Trn2. See :ref:`neuron-ls-ug`
- Updated neuron-monitor user guide to include support for Trn2. See :ref:`neuron-monitor-ug`
- Updated neuron-top user guide to include support for Trn2. See :ref:`neuron-top-ug`
- Added Ask Q Developer documentation for general Neuron guidance and jumpstarting NKI kernel developement. See :ref:`amazon-q-dev`

PyTorch NeuronX
- Added troubleshooting note for eager debug mode errors. See :ref:`pytorch-neuron-traning-troubleshooting`
- Add torch-neuronx cxx11 ABI documentation. See :ref:`pytorch-neuronx-install-cxx11`
- Added Migration Guide From ``XLA_USE_BF16``/ ``XLA_DOWNCAST_BF16``. See :ref:`migration_from_xla_downcast_bf16`
- Updated BERT tutorial to not use ``XLA_DOWNCAST_BF16`` and updated BERT-Large pretraining phase to BFloat16 BERT-Large pretraining with AdamW and stochastic rounding. See :ref:`hf-bert-pretraining-tutorial`
- Added Appliation Note for PyTorch 2.5 support. See :ref:`introduce-pytorch-2-5`
- Updated PyTorch NeuronX Environment Variables document with support for PyTorch 2.5. See :ref:`pytorch-neuronx-envvars`

Misc
- Added a third-party developer flow solutions page. See :ref:`third-party-devflow-solutions`
- Added a third-party libraries page. See :ref:`third-party-libraries`

End of support announcements
- :ref:`announce-eos-neuron-det`
- :ref:`announce-eos-nxd-examples`
- :ref:`announce-python-eos`
- :ref:`announce-eos-pytorch-eos-113`
- :ref:`announce-eos-pytorch-2-1`
- :ref:`announce-u20-dlami-dlc-eos`
- :ref:`announce-eos-torch-neuron`

Neuron 2.20.0
---------------
Date: 09/16/2024

Neuron Compiler

- Added Getting Started with NKI guide for implementing a simple “Hello World” style NKI kernel and running it on a Neuron Device (Trainium/Inferentia2). See :ref:`nki_getting_started`
- Added NKI Programming Model guide for explaining the three main stages of the NKI programming model. See :ref:`nki_programming_model`
- Added NKI Kernel as a Framework Custom Operator guide for explaining how to insert a NKI kernel as a custom operator into a PyTorch or JAX model using simple code examples. See :ref:`nki_framework_custom_op`
- Added NKI Tutorials for the following kernels: Tensor addition, Transpose2D, AveragePool2D, Matrix multiplication, RMSNorm, Fused Self Attention, LayerNorm, and Fused Mamba. See :ref:`nki_kernels`
- Added NKI Kernels guide for optimized kernel examples. See :ref:`nki_kernels`
- Added Trainium/Inferentia2 Architecture Guide for NKI. See :ref:`trainium_inferentia2_arch`
- Added Profiling NKI kernels with Neuron Profile. See :ref:`neuron_profile_for_nki`
- Added NKI Performance Guide for explaining a recipe to find performance bottlenecks of NKI kernels and apply common software optimizations to address such bottlenecks. See :ref:`nki_perf_guide`
- Added NKI API Reference Manual with nki framework and types, nki.language, nki.isa, NKI API Common Fields, and NKI API Errors. See :ref:`nki_api_reference`
- Added NKI FAQ. See :ref:`nki_faq`
- Added NKI Known Issues. See :ref:`nki_known_issues`
- Updated Neuron Glossary with NKI terms. See :ref:`neuron_hw_glossary`
- Added new `NKI samples repository <https://github.com/aws-neuron/nki-samples>`_
- Added average_pool2d, fused_mamba, layernorm, matrix_multiplication, rms_norm, sd_attention, tensor_addition, and transpose_2d kernel tutorials to the NKI samples respository. See :ref:`NKI samples repository <https://github.com/aws-neuron/nki-samples>`
- Added unit and integration tests for each kernel. See `NKI samples repository <https://github.com/aws-neuron/nki-samples>`_
- Updated Custom Operators API Reference Guide with updated terminology (HBM). See :ref:`custom-ops-api-ref-guide`

NeuronX Distributing Training (NxDT)

- Added NxDT (Beta) Developer Guide. See :ref:`nxdt_developer_guide`
- Added NxDT Developer Guide for Migrating from NeMo to Neuronx Distributed Training. See :ref:`nxdt_developer_guide_migration_nemo_nxdt`
- Added NxDT Developer Guide for Migrating from Neuron-NeMo-Megatron to Neuronx Distributed Training. See :ref:`nxdt_developer_guide_migration_nnm_nxdt`
- Added NxDT Developer Guide for Integrating a new dataset/dataloader. See :ref:`nxdt_developer_guide_integrate_new_dataloader`
- Added NxDT Developer Guide for Integrating a new model. See :ref:`nxdt_developer_guide_integrate_new_model`
- Added NxDT Developer Guide for Registering an optimizer and LR scheduler. See :ref:`Registering an optimizer and LR scheduler`
- Added NxDT YAML Configuration Overview. See :ref:`nxdt_config_overview`
- Added Neuronx Distributed Training Library Features documentation. See :ref:`nxdt_features`
- Added Installation instructions for NxDT. See :ref:`nxdt_installation_guide`
- Added Known Issues and Workarounds for NxDT. See :ref:`nxdt_known_issues`

NeuronX Distributed Core (NxD Core)

- Updated Developer guide for save/load checkpoint (neuronx-distributed ) with ZeRO-1 Optimizer State Offline Conversion. See :ref:`save_load_developer_guide`
- Added Developer guide for Standard Mixed Precision with NeuronX Distributed. See :ref:`standard_mixed_precision`
- Updated NeuronX Distributed API Guide LoRA finetuning support. See :ref:`api_guide`
- Added Developer guide for LoRA finetuning with NeuronX Distributed. See :ref:`lora_finetune_developer_guide`
- Updated CodeLlama tutorial with latest package versions. See `tutorial <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/neuronx_distributed/llama/codellama_16k_inference.html>`_
- Added tutorial for Fine-tuning Llama3 8B with tensor parallelism and LoRA using Neuron PyTorch-Lightning with NeuronX Distributed. See :ref:`llama3_8b_tp_ptl_lora_finetune_tutorial`
- Updated links in Llama2 NxD Finetuning tutorial. See :ref:`llama2_7b_tp_zero1_ptl_finetune_tutorial`
- Updated tokenizer download command in tutorials. See :ref:`llama2_7b_tp_zero1_tutorial`, :ref:`llama2_tp_pp_tutorial`, and :ref:`codegen25_7b_tp_zero1_tutorial`

JAX Neuron

- Added JAX Neuron Main page. See :ref:`jax-neuron-main`
- Added JAX Neuron plugin instructions. See :ref:`jax-neuronx-setup`
- Added JAX Neuron setup instructions. See :ref:`setup-jax-neuronx`

PyTorch NeuronX

- Updated Developer Guide for Training with PyTorch NeuronX with support for convolution in AMP. See :ref:`pytorch-neuronx-programming-guide`.
- Added inference samples for Wav2Vec2 conformer models with Relative Position Embeddings and Rotary Position Embedding. See `sample <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/inference/hf_pretrained_wav2vec2_conformer_relpos_inference_on_inf2.ipynb>`_ and `sample <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/inference/hf_pretrained_wav2vec2_conformer_rope_inference_on_inf2.ipynb>`_.
- Updated the ViT sample with updated accelerate version. See `sample <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/hf_image_classification/vit.ipynb>`_
- Updated PyTorch NeuronX Environment Variables with ``NEURON_TRANSFER_WITH_STATIC_RING_OPS``. See :ref:`pytorch-neuronx-envvars`
- Added inference samples for Pixart Alpha and PixArt Sigma models. See `sample <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/inference/hf_pretrained_pixart_alpha_inference_on_inf2.ipynb>`_ and `sample <torch-neuronx/inference/hf_pretrained_pixart_sigma_inference_on_inf2.ipynb>`_
- Added benchmarking scripts for PixArt alpha. See `benchmarking script <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/benchmark/pytorch/pixart_alpha_benchmark.py>`_

Transformers NeuronX

- Updated Transformers NeuronX Developer Guide with Multi-node inference support (TP/PP). See :ref:`transformers_neuronx_developer_guide`
- Updated Transformers NeuronX Developer Guide with BDH layout support. See :ref:`transformers_neuronx_developer_guide`
- Updated Transformers NeuronX Developer Guide with Flash Decoding to support long sequence lengths up to 128k. See :ref:`transformers_neuronx_developer_guide`
- Updated Transformers NeuronX Developer Guide with presharded weights support. See :ref:`transformers_neuronx_developer_guide`
- Added Llama 3.1 405b sample with 16k sequence length. See `tutorial <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/llama-3.1-405b-multinode-16k-sampling.ipynb>`_
- Added Llama 3.1 70b 64k tutorial. See `tutorial <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/llama-3.1-70b-64k-sampling.ipynb>`_
- Added Llama 3.1 8b 128k tutorial. See `tutorial <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/llama-3.1-8b-128k-sampling.ipynb>`_
- Removed the sample llama-3-8b-32k-sampling.ipynb and replaced it with Llama-3.1-8B model sample llama-3.1-8b-32k-sampling.ipynb. See `sample <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/llama-3.1-8b-32k-sampling.ipynb>`_

Neuron Runtime

- Updated Neuron Runtime Troubleshooting guide with the latest hardware error codes and logs and with Neuron Runtime execution fails at out-of-bound access. See :ref:`nrt-troubleshooting`
- Updated Neuron Sysfs User Guide with new sysfs entries and device reset instructions. See :ref:`neuron-sysfs-ug`
- Added Neuron Runtime Input Dump on Trn1 documentation. See :ref:`nrt-input-dumps`

Containers

- Added Neuron Helm Chart repository to help streamline the deployment of AWS Neuron components on Amazon EKS. See `repo <https://github.com/aws-neuron/neuron-helm-charts>`_
- Updated Kubernetes container deployment process with Neuron Helm Chart documentation. See :ref:`k8s-neuron-helm-chart`
- Added guide for Deploying Neuron Container on Elastic Container Service (ECS). See :ref:`training-dlc-then-ecs-devflow`
- Added documentation for Neuron Plugins for Containerized Environments. See :ref:`neuron-container-plugins`
- Updated guide for locating DLC images. See :ref:`locate-neuron-dlc-image`

Neuron Tools

- Updated Neuron Profiler User Guide with Alternative output formats. See :ref:`neuron-profile-ug`

Software Maintenance and Misc

- Updated the Neuron Software Maintenance Policy. See :ref:`sdk-maintenance-policy`
- Added announcement and updated documentation for end of support start for Tensorflow-Neuron 1.x. See :ref:`announce-tfx-no-support`
- Added announcement and updated documentation for end of support start for 'neuron-device-version' field. See :ref:`eos-neuron-device-version`
- Added announcement and updated documentation for end of support start for ‘neurondevice’ resource name. See :ref:`eos-neurondevice`
- Added announcement and updated documentation for end of support start for AL2. See :ref:`eos-al2`
- Added announcement for maintenance mode for torch-neuron versions 1.9 and 1.10. See :ref:`announce-torch-neuron-eos`
- Added supported Protobuf versions to the Neuron Release Artifacts. See :ref:`latest-neuron-release-artifacts`
- Updated Neuron Github Roadmap. See :ref:`neuron_roadmap`

Neuron 2.19.0
-------------
Date: 07/03/2024


- Updated Transformers NeuronX Developer guide with support for inference for longer sequence lengths with Flash Attention kernel. See :ref:`Developer Guide <transformers_neuronx_readme>`.
- Updated Transformers NeuronX developer guide with QKV Weight Fusion support. See :ref:`Developer Guide <transformers_neuronx_readme>`.
- Updated Transformers NeuronX continuous batching developer guide with updated vLLM instructions and models supported. See :ref:`Developer Guide <transformers_neuronx_readme>`.
- Updated Neuronx Distributed User guide with interleaved pipeline support. See :ref:`api_guide`
- Added Codellama 13b 16k tutorial with NeuronX Distributed Inference library. See `sample <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/codellama-13b-16k-sampling.ipynb>`_ 
- Updated PyTorch NeuronX Environment variables with custom SILU enabled via NEURON_CUSTOM_SILU. See :ref:`pytorch-neuronx-envvars`
- Updated ZeRO1 support to have FP32 master weights support and BF16 all-gather. See :ref:`zero1-gpt2-pretraining-tutorial`.
- Updated PyTorch 2.1 Appplication note with workaround for slower loss convergence for NxD LLaMA-3 70B pretraining using ZeRO1 tutorial. See :ref:`introduce-pytorch-2-1`.
- Updated Neuron DLAMI guide with support for new 2.19 DLAMIs. See :ref:`neuron-dlami-overview`.
- Updated HF-BERT pre-training documentation for port forwarding. See :ref:`hf-bert-pretraining-tutorial`
- Updated T5 inference tutorial with transformer flag. See  `sample <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/torch-neuronx/t5-inference-tutorial.html>`_ 
- Added support for Llama3 model training. See :ref:`llama3_tp_pp_tutorial` and :ref:`llama2_7b_tp_zero1_tutorial`
- Added support for Flash Attention kernel for training longer sequences in NeuronX Distributed. See :ref:`llama2_7b_tp_zero1_tutorial` and :ref:`api_guide`
- Updated Llama2 inference tutorial using NxD Inference library. See `sample <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/neuronx_distributed/llama/llama2_inference.html>`_ 
- Added new guide for Neuron node problem detection and recovery tool. See :ref:`configuration < k8s-neuron-problem-detector-and-recovery-irsa>` and :ref:`tutorial <k8s-neuron-problem-detector-and-recovery>`.
- Added new guide for Neuron Monitor container to enable easy monitoring of Neuron metrics in Kubernetes. Supports monitoring with Prometheus and Grafana. See :ref:`tutorial <k8s-neuron-monitor>`
- Updated Neuron scheduler extension documentation about enforcing allocation of contiguous Neuron Devices for the pods based on the Neuron instance type. See :ref:`tutorial <neuron_scheduler>`
- Updated Neuron Profiler User Guide with various UI enhancements. See :ref:`neuron-profile-ug`
- Added NeuronPerf support in Llama2 inference tutorial in NeuronX Distributed. See `sample <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/neuronx_distributed/llama/llama2_inference.html>`_ 
- Added announcement for maintenance mode of MxNet. See :ref:`announce-mxnet-maintenance`
- Added announcement for end of support of Neuron TensorFlow 1.x (Inf1). See :ref:`announce-tfx-eos`
- Added announcement for end of support of AL2. See :ref:`announce-eos-al2`
- Added announcement for end of support of 'neuron-device-version' field in neuron-monitor. See :ref:`announce-eos-neuron-device-version`
- Added announcement for end of support of 'neurondevice' resource name in Neuron Device K8s plugin. See :ref:`announce-eos-neurondevice`
- Added announcement for end of support for Probuf versions <= 3.19 for PyTorch NeuronX. See :ref:`announce-eos-probuf319`

Neuron 2.18.0
-------------
Date: 04/01/2024


- Updated PyTorch NeuronX developer guide with Snapshotting support. See :ref:`torch-neuronx-snapshotting`.
- Updated :ref:`api_guide` and :ref:`pp_developer_guide` with support for ``auto_partition`` API.
- Updated :ref:`api_guide` with enhanced checkpointing support with ``load`` API and ``async_save`` API.
- Updated documentation for ``PyTorch Lightning``  to train models using ``pipeline parallelism`` . See :ref:`API guide <api_guide>` and :ref:`Developer Guide <ptl_developer_guide>`.
- Updated NeuronX Distributed developer guide with support for :ref:`Autobucketing <nxd-inference-devguide-autobucketing>`
- Added PyTorch NeuronX developer guide for :ref:`Autobucketing <torch-neuronx-autobucketing-devguide>`.
- Updated :ref:`api_guide` and :ref:`llama2_tp_pp_tutorial` with support for asynchronous checkpointing.
- Updated Transformers NeuronX Developer guide with support for streamer and stopping criteria APIs. See :ref:`Developer Guide <transformers_neuronx_readme>`.
- Updated Transformers NeuronX Developer guide with instructions for ``Repeating N-Gram Filtering``. See :ref:`Developer Guide <transformers_neuronx_readme>`.
- Updated Transformers NeuronX developer guide with Top-K on-device sampling support [Beta]. See :ref:`Developer Guide <transformers_neuronx_readme>`.
- Updated Transformers NeuronX developer guide with Checkpointing support and automatic model selection. See :ref:`Developer Guide <transformers_neuronx_readme>`.
- Updated Transformers NeuronX Developer guide with support for speculative sampling [Beta]. See :ref:`Developer Guide <transformers_neuronx_readme>`.
- Added sample for training CodeGen2.5 7B with Tensor Parallelism and ZeRO-1 Optimizer with ``neuronx-distributed``. See :ref:`codegen25_7b_tp_zero1_tutorial`.
- Added Tutorial for codellama/CodeLlama-13b-hf model inference with 16K seq length using Transformers Neuronx. See `sample <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/codellama-13b-16k-sampling.ipynb>`_.
- Added Mixtral-8x7B Inference Sample/Notebook using TNx. See `sample <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/mixtral-8x7b-sampling.ipynb>`_.
- Added Mistral-7B-Instruct-v0.2 Inference inference sample using TNx. See `sample <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/mistralai-Mistral-7b-Instruct-v0.2.ipynb>`_.
- Added announcement for Maintenance mode of TensorFlow 1.x. See :ref:`announce-tfx-maintenance`.
- Updated PyTorch 2.1 documentation to reflect stable (out of beta) support. See :ref:`introduce-pytorch-2-1`.
- Updated PyTorch NeuronX environment variables to reflect stable (out of beta) support. See :ref:`pytorch-neuronx-envvars`.
- Updated :ref:`latest-neuron-release-artifacts` with supported HuggingFace Transformers versions.
- Added user guide instructions for ``Neuron DLAMI``. See :ref:`neuron-dlami-overview`.
- Updated :ref:`torch-hf-bert-finetune` tutorial with latest Hugging Face Trainer API.
- Updated Neuron Runtime API guide with support for ``nr_tensor_allocate``. See :ref:`nrt-api-guide`.
- Updated :ref:`neuron-sysfs-ug` with support for ``serial_number`` unique identifier.
- Updated :ref:`custom-ops-api-ref-guide` limitations and fixed nested sublists. See :ref:`feature-custom-operators-devguide`.
- Fixed issue in :ref:`zero1-gpt2-pretraining-tutorial`.
- Fixed potential hang during synchronization step in ``nccom-test``. See :ref:`nccom-test`.
- Updated troubleshooting guide with an additional hardware error messaging. See :ref:`nrt-troubleshooting`.
- Updated DLC documentation. See :ref:`containers-dlc-then-customize-devflow` and :ref:`dlc-then-ec2-devflow`.


Neuron 2.16.0
-------------
Date: 12/21/2023

- Added setup guide instructions for ``AL2023`` OS. See :ref:`setup-guide-index`
- Added announcement for name change of Neuron Components. See :ref:`announce-component-name-change`
- Added announcement for End of Support for ``PyTorch 1.10`` . See :ref:`announce-eos_pytorch110`
- Added announcement for End of Support for ``PyTorch 2.0`` Beta. See :ref:`announce-eos_pytorch2`
- Added announcement for moving NeuronX Distributed sample model implementations. See :ref:`announce-moving-samples`
- Updated Transformers NeuronX developer guide with support for Grouped Query Attention(GQA). See :ref:`developer guide <transformers_neuronx_readme>` 
- Added sample for ``Llama-2-70b`` model inference. See `tutorial <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/llama-70b-sampling.ipynb>`_ 
- Added documentation for ``PyTorch Lightning``  to train models using ``tensor parallelism`` and ``data parallelism`` . See :ref:`api guide <api_guide>` , :ref:`developer guide <ptl_developer_guide>` and :ref:`tutorial <llama2_7b_tp_zero1_ptl_tutorial>`
- Added documentation for Model and Optimizer Wrapper training API that handles the parallelization. See :ref:`api guide <api_guide>` and :ref:`model_optimizer_wrapper_developer_guide`
- Added documentation for New ``save_checkpoint``  and ``load_checkpoint`` APIs to save/load checkpoints during distributed training. See :ref:`save_load_developer_guide`
- Added documentation for a new ``Query-Key-Value(QKV)`` module in NeuronX Distributed for Training. See :ref:`api guide <api_guide>` and :ref:`tutorial <llama2_tp_pp_tutorial>`
- Added new developer guide for Inference using NeuronX Distributed. :ref:`developer guide<nxd_inference_developer_guide>`
- Added ``Llama-2-7B`` model inference script (:ref:`[html] </src/examples/pytorch/neuronx_distributed/llama/llama2_inference.ipynb>` :pytorch-neuron-src:`[notebook] <neuronx_distributed/llama/llama2_inference.ipynb>`)
- Added App note on Support for ``PyTorch 2.1`` (Beta) . See :ref:`introduce-pytorch-2-1`
- Added developer guide for ``replace_weights`` API to replace the separated weights. See :ref:`torch_neuronx_replace_weights_api` 
- Added [Beta] script for training ``stabilityai/stable-diffusion-2-1-base`` and  ``runwayml/stable-diffusion-v1-5`` models . See `script <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/stable_diffusion/>`_ 
- Added [Beta] script for training ``facebook/bart-large`` model. See `script <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/hf_summarization/BartLarge.ipynb>`_ 
- Added [Beta] script for ``stabilityai/stable-diffusion-2-inpainting`` model inference.  See `script <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/inference/hf_pretrained_sd2_inpainting_936_624_inference.ipynb>`_ 
- Added documentation for new ``Neuron Distributed Event Tracing (NDET) tool`` to help visualize execution trace logs and diagnose errors in multi-node workloads. See :ref:`neuron-det-ug` 
- Updated Neuron Profile User guide with support for multi-worker jobs. See :ref:`neuron-profile-ug`
- Minor updates to Custom Ops API reference guide.See :ref:`custom-ops-api-ref-guide`


Neuron 2.15.0
--------------
Date: 10/26/2023

- New :ref:`introduce-pytorch-2-0` application note with ``torch-neuronx``
- New :ref:`llama2_70b_tp_pp_tutorial` and (`sample script <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/tp_pp_llama2_70b_hf_pretrain>`_) using ``neuronx-distributed``
- New :ref:`model_samples_tutorials` documentation for a consolidated list of code samples and tutorials published by AWS Neuron.
- New :ref:`sdk-classification` documentation for alpha, beta, and stable Neuron SDK definitions and updated documentation references.
- New :ref:`pipeline_parallelism_overview` and :ref:`pp_developer_guide` documentation in ``neuronx-distributed``
- Updated :ref:`Neuron Distributed API Guide <api_guide>` regarding pipeline-parallelism support and checkpointing
- New :ref:`activation_memory_reduction` application note and :ref:`activation_memory_reduction_developer_guide` in ``neuronx-distributed``
- New ``Weight Sharing (Deduplication)`` `notebook script <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert_shared_weights.ipynb>`_
- Added Finetuning script for `google/electra-small-discriminator <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_text_classification/ElectraSmall.ipynb>`_ with ``torch-neuronx``
- Added `ResNet50 training (Beta) <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/resnet50/resnet50.ipynb>`_ tutorial and scripts with ``torch-neuronx``
- Added `Vision Perceiver training sample <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_image_classification/VisionPerceiverConv.ipynb>`_ with ``torch-neuronx``
- Added ``flan-t5-xl`` model inference :pytorch-neuron-src:`tutorial <neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>` using ``neuronx-distributed`` 
- Added ``HuggingFace Stable Diffusion 4X Upscaler model Inference on Trn1 / Inf2`` `sample script <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_sd_x4_upscaler_inference.ipynb>`_ with ``torch-neuronx``
- Updated `GPT-NeoX 6.9B and 20B model scripts <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain>`_ to include selective checkpointing.
- Added serialization support and removed ``-O1`` flag constraint to ``Llama-2-13B`` model inference script `tutorial <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb>`_ with ``transformers-neuronx``
- Updated ``BERT`` script and ``Llama-2-7B`` script with Pytorch 2.0 support
- Added option-argument ``llm-training`` to the existing ``--distribution_strategy`` compiler option to make specific optimizations related to training distributed models in :ref:`neuron-compiler-cli-reference-guide`
- Updated :ref:`neuron-sysfs-ug` to include mem_ecc_uncorrected and sram_ecc_uncorrected hardware statistics.
- Updated :ref:`torch_neuronx_trace_api` to include io alias documentation
- Updated :ref:`transformers_neuronx_developer_guide` with serialization support.
- Upgraded ``numpy`` version to ``1.22.2`` for various scripts
- Updated ``LanguagePerceiver`` fine-tuning `script <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_text_classification/LanguagePerceiver.ipynb>`_ to ``stable``
- Announcing :ref:`End of Support for OPT <announce-intent-eos-opt>`  example in ``transformers-neuronx``
- Announcing :ref:`End of Support for "nemo" option-argument <announce-intent-deprecate-nemo-arg>`  

Known Issues and Limitations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Following tutorials are currently not working. These tutorials will be updated once there is a fix.

- `Zero1-gpt2-pretraining-tutorial <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/zero1_gpt2.html#zero1-gpt2-pretraining-tutorial>`_

Neuron 2.14.0
-------------
Date: 09/15/2023

- Neuron Calculator now supports multiple model configurations for Tensor Parallel Degree computation. See :ref:`neuron_calculator`
- Announcement to deprecate ``--model-type=transformer-inference`` flag. See :ref:`announce-end-of-support-transformer-flag`
- Updated HF ViT benchmarking script to use ``--model-type=transformer`` flag. See :ref:`[script] <src/benchmark/pytorch/hf-google-vit_benchmark.py>`
- Updated ``torch_neuronx.analyze`` API documentation. See :ref:`torch_neuronx_analyze_api`
- Updated Performance benchmarking numbers for models on Inf1,Inf2 and Trn1 instances with 2.14 release bits. See :ref:`_benchmark`
- New tutorial for Training Llama2 7B with Tensor Parallelism and ZeRO-1 Optimizer using ``neuronx-distributed``  :ref:`llama2_7b_tp_zero1_tutorial`
- New tutorial for ``T5-3B`` model inference using ``neuronx-distributed``  (:pytorch-neuron-src:`tutorial <neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`)
- Updated ``Neuron Persistent Cache`` documentation regarding clarification of flags parsed by ``neuron_cc_wrapper`` tool which is a wrapper over ``Neuron Compiler CLI``. See :ref:`neuron-caching`
- Added ``tokenizers_parallelism=true`` in various notebook scripts to supress tokenizer warnings making errors easier to detect
- Updated Neuron device plugin and scheduler YAMLs to point to latest images.  See `yaml configs <https://github.com/aws-neuron/aws-neuron-sdk/tree/master/src/k8>`_
- Added notebook script to fine-tune ``deepmind/language-perceiver`` model using ``torch-neuronx``. See `sample script <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/hf_text_classification/LanguagePerceiver.ipynb>`_
- Added notebook script to fine-tune ``clip-large`` model using ``torch-neuronx``. See `sample script <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/hf_contrastive_image_text/CLIPLarge.ipynb>`_
- Added ``SD XL Base+Refiner`` inference sample script using ``torch-neuronx``. See `sample script <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/inference/hf_pretrained_sdxl_base_and_refiner_1024_inference.ipynb>`_
- Upgraded default ``diffusers`` library from 0.14.0 to latest 0.20.2 in ``Stable Diffusion 1.5`` and ``Stable Diffusion 2.1`` inference scripts. See `sample scripts <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/inference>`_
- Added ``Llama-2-13B`` model training script using ``neuronx-nemo-megatron`` ( `tutorial <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md>`_ )


Neuron 2.13.0
-------------
Date: 08/28/2023


- Added tutorials for GPT-NEOX 6.9B and 20B models training using neuronx-distributed. See more at :ref:`tp_tutorials`
- Added TensorFlow 2.x (``tensorflow-neuronx``) analyze_model API section. See more at :ref:`tensorflow-ref-neuron-analyze_model-api`
- Updated setup instructions to fix path of existing virtual environments in DLAMIs. See more at :ref:`setup guide <setup-guide-index>`
- Updated setup instructions to fix pinned versions in upgrade instructions of setup guide. See more at :ref:`setup guide <setup-guide-index>`
- Updated tensorflow-neuron HF distilbert tutorial to improve performance by removing HF pipeline. See more at :ref:`[html] </src/examples/tensorflow/huggingface_bert/huggingface_bert.html>` :github:`[notebook] </src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb>`
- Updated training troubleshooting guide in torch-neuronx to describe network Connectivity Issue on trn1/trn1n 32xlarge with Ubuntu. See more at :ref:`pytorch-neuron-traning-troubleshooting`
- Added "Unsupported Hardware Operator Code" section to Neuron Runtime Troubleshooting page. See more at :ref:`nrt-troubleshooting`
- Removed 'beta' tag from ``neuronx-distributed`` section for training. ``neuronx-distributed`` Training is now considered stable and ``neuronx-distributed`` inference is considered as beta.
- Added FLOP count(``flop_count``) and connected Neuron Device ids (``connected_devices``) to sysfs userguide. See :ref:`neuron-sysfs-ug`
- Added tutorial for ``T5`` model inference.  See more at :pytorch-neuron-src:`[notebook] <torch-neuronx/t5-inference-tutorial.ipynb>`
- Updated neuronx-distributed api guide and inference tutorial. See more at :ref:`api_guide` and :ref:`tp_inference_tutorial`
- Announcing End of support for ``AWS Neuron reference for Megatron-LM`` starting Neuron 2.13. See more at :ref:`announce-eol-megatronlm`
- Announcing end of support for ``torch-neuron`` version 1.9 starting Neuron 2.14. See more at :ref:`announce-eol-pytorch19`
- Upgraded ``numpy`` version to ``1.21.6`` in various training scripts for `Text Classification <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training>`_
- Added license for Nemo Megatron to SDK Maintenance Policy. See more at :ref:`sdk-maintenance-policy`
- Updated ``bert-japanese`` training Script to use ``multilingual-sentiments`` dataset. See `hf-bert-jp <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/hf_bert_jp> `_
- Added sample script for LLaMA V2 13B model inference using transformers-neuronx. See `neuron samples repo <https://github.com/aws-neuron/aws-neuron-samples/>`_
- Added samples for training GPT-NEOX 20B and 6.9B models using neuronx-distributed. See `neuron samples repo <https://github.com/aws-neuron/aws-neuron-samples/>`_
- Added sample scripts for CLIP and Stable Diffusion XL inference using torch-neuronx. See `neuron samples repo <https://github.com/aws-neuron/aws-neuron-samples/>`_
- Added sample scripts for vision and language Perceiver models inference using torch-neuronx. See `neuron samples repo <https://github.com/aws-neuron/aws-neuron-samples/>`_
- Added camembert training/finetuning example for Trn1 under hf_text_classification in torch-neuronx. See `neuron samples repo <https://github.com/aws-neuron/aws-neuron-samples/>`_
- Updated Fine-tuning Hugging Face BERT Japanese model sample in torch-neuronx. See `neuron samples repo <https://github.com/aws-neuron/aws-neuron-samples/>`_
- See more neuron samples changes in `neuron samples release notes <https://github.com/aws-neuron/aws-neuron-samples/blob/master/releasenotes.md>`_
- Added samples for pre-training GPT-3 23B, 46B and 175B models using neuronx-nemo-megatron library. See `aws-neuron-parallelcluster-samples <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples>`_
- Announced End of Support for GPT-3 training using aws-neuron-reference-for-megatron-lm library. See `aws-neuron-parallelcluster-samples <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples>`_
- Updated bert-fine-tuning SageMaker sample by replacing amazon_reviews_multi dataset with amazon_polarity dataset. See `aws-neuron-sagemaker-samples <https://github.com/aws-neuron/aws-neuron-sagemaker-samples>`_


Neuron 2.12.0
-------------
Date: 07/19/2023

- Added best practices user guide for benchmarking performance of Neuron Devices `Benchmarking Guide and Helper scripts <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/microbenchmark>`_
- Announcing end of support for Ubuntu 18. See more at :ref:`announce-eol-ubuntu18`
- Improved sidebar navigation in Documentation.
- Removed support for Distributed Data Parallel(DDP) Tutorial.
  

Neuron 2.11.0
-------------

Date: 06/14/2023

- New :ref:`neuron_calculator` Documentation section to help determine number of Neuron Cores needed for LLM Inference.
- Added App Note :ref:`neuron_llm_inference`
- New ``ML Libraries`` Documentation section to have :ref:`neuronx-distributed-index` and :ref:`transformers_neuronx_readme`
- Improved Installation and Setup Guides for the different platforms supported. See more at :ref:`setup-guide-index`
- Added Tutorial :ref:`setup-trn1-multi-node-execution`


================================================
FILE: release-notes/index.rst
================================================
.. _neuron_release_notes:

.. meta::
   :description: The AWS Neuron SDK release notes home page. Current release version: 2.29.0.
   :keywords: aws, neuron, what's new, release notes

AWS Neuron SDK Release Notes
============================

**Last updated**:  April 09, 2026

.. toctree::
    :maxdepth: 1
    :hidden:

    Neuron 2.29.0 </release-notes/2.29.0>
    Component release notes </release-notes/components/index>
    Release artifacts </release-notes/releasecontent>
    Previous versions </release-notes/prev/rn>

Current Release Notes
----------------------

This is the official home page for the AWS Neuron SDK release notes. Release notes are provided whenever AWS and Annapurna labs releases a new version of the Neuron SDK. Select a release version and review what it brings to you!

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: 
      :class-card: sd-border-2
      :link: /release-notes/2.29.0
      :link-type: doc

      **Latest AWS Neuron SDK release: 2.29.0**
      ^^^
      On **04/09/2026**, AWS released version **2.29.0** of the Neuron SDK.

      For more details, select this card and browse the release notes. 

----

Neuron Component Release Notes
------------------------------

Each Neuron component has specific release notes across Neuron versions. 

.. list-table::
   :widths: 40 30 30
   :header-rows: 1
   :align: left

   * - Component
     - Updated in Neuron Version
     - Latest Component Version
   * - :doc:`Neuron Compiler </release-notes/components/compiler>`
     - 2.27.0
     - 2.24.5133.0
   * - :doc:`Neuron Containers <components/containers>`
     - **2.29.0**
     - 2.29.0
   * - :doc:`Neuron Developer Tools <components/dev-tools>`
     - **2.29.0**
     - 2.29.0
   * - :doc:`Neuron DLAMI <components/dlamis>`
     - **2.29.0**
     - 2.29.0
   * - :doc:`JAX NeuronX <components/jax>`
     - 2.26.0
     - 0.7.0.1.0.*
   * - :doc:`NKI Library <components/nki-lib>`
     - **2.29.0**
     - 2.29.0
   * - :doc:`Neuron Kernel Interface <components/nki>`
     - **2.29.0**
     - 0.3.0
   * - :doc:`NxD Core <components/nxd-core>`
     - 2.26.0
     - 0.18.27753
   * - :doc:`NxD Inference <components/nxd-inference>`
     - 2.29.0
     - 0.9.17334
   * - :doc:`NxD Training <components/nxd-training>`
     - 2.25.0
     - 1.5.0
   * - :doc:`PyTorch Neuron Framework (torch-neuronx) <components/pytorch>`
     - **2.29.0**
     - 2.9.0.2.13.*
   * - :doc:`Neuron Runtime Library <components/runtime>`
     - **2.29.0**
     - 2.31.24.0
   * - :doc:`Neuron Driver <components/runtime>`
     - **2.29.0**
     - 2.26.10.0
   * - :doc:`Neuron Collectives <components/runtime>`
     - **2.29.0**
     - 2.31.24.0
   * - :doc:`vLLM plugin for Neuron <components/nxd-inference>`
     - **2.29.0**
     - 0.6.0
  
----

Current Release Package Versions
---------------------------------

.. grid:: 1 
        :gutter: 2

        .. grid-item-card::
                :link: latest-neuron-release-artifacts
                :link-type: ref
                :class-card: sd-border-1
        
                **Neuron 2.29.0 release artifacts**
                ^^^
                The libraries and packages updated in the latest Neuron release.

--

.. _previous-neuron-releases:

Previous AWS Neuron SDK releases
--------------------------------

Release notes for prior versions from the past 12 months.

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Neuron version
     - Date released
   * - :doc:`2.28.1 <prev/2.28.1>`
     - 03/13/26
   * - :doc:`2.28.0 <prev/2.28.0>`
     - 02/26/26
   * - :doc:`2.27.1 <prev/2.27.1>`
     - 01/14/26
   * - :doc:`2.27.0 <prev/2.27.0/index>`
     - 12/19/25
   * - :doc:`2.26.1 <prev/2.26.1>`
     - 10/29/25
   * - :ref:`2.26.0 <neuron-2-26-0-whatsnew>`
     - 09/18/25  
   * - :ref:`2.25.0 <neuron-2-25-0-whatsnew>`
     - 07/31/25
   * - :ref:`2.24.1 <neuron-2-24-1-whatsnew>`
     - 06/30/25
   * - :ref:`2.24.0 <neuron-2-24-0-whatsnew>`
     - 06/24/25
   * - :ref:`2.23.0 <neuron-2.23.0-whatsnew>`
     - 06/10/25
   * - :ref:`2.22.1 <neuron-2.22.1-whatsnew>`
     - 05/12/25
   * - :ref:`2.22.0 <neuron-2.22.0-whatsnew>`
     - 04/03/25
   * - :ref:`2.21.1 <neuron-2.21.1-whatsnew>`
     - 01/14/25
   * - :ref:`2.21.0 Beta <neuron-2.21.0.beta-whatsnew>`
     - 12/03/24

.. note::
    The AWS Neuron SDK is updated regularly with new versions. These releases follow a semantic versioning model of ``(major).(minor).(patch)``. ``major`` versions are more likely to introduce new features and breaking changes over a prior major version. ``minor`` versions add feature and API improvements and may introduce smaller breaking changes. ``patch`` versions typically provide bug fixes and will not have breaking changes.

Earlier Neuron component release notes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* For older components and features that have not been updated recently or are out of support, see :doc:`archive/index`.

Older Releases
-----------------

Release notes are archived when the major version of a release is incremented.

* :doc:`Previous Neuron SDK 2.X release notes </release-notes/prev/rn>`
* :doc:`Archived Neuron SDK 1.X release notes </release-notes/archive/neuron1/prev/rn>`


================================================
FILE: release-notes/prev/2.25.0/compiler.rst
================================================
.. _neuron-2-25-0-compiler:

.. meta::
   :description: The official release notes for the AWS Neuron SDK compiler component, version 2.25.0. Release date: 7/31/2025.

AWS Neuron SDK 2.25.0: Neuron Compiler release notes
====================================================

**Date of release**: July 31, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.25.0 release notes home <neuron-2-25-0-whatsnew>`

Announcements
-------------
The Neuron Compiler default for the ``--auto-cast`` option will change from ``--auto-cast=matmult`` to ``--auto-cast=none`` in a future release.

Behavioral changes
------------------

*Behavioral changes are small, user-facing changes that you may notice after upgrading to this version.*

* Minor bug fixes and performance enhancements for both the ``trn1`` and ``trn2`` platforms.


Known issues
------------

* The Llama3 70B test has a compile time increase of 16% and 18%, for 16 and 32 nodes respectively. We are investigating the cause of this increase and will provide an update in the future.


================================================
FILE: release-notes/prev/2.25.0/containers.rst
================================================
.. _neuron-2-25-0-dlc:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Deep Learning Containers (DLC) component, version 2.25.0. Release date: 7/31/2025.

AWS Neuron SDK 2.25.0: Neuron Deep Learning Containers release notes
====================================================================

**Date of release**: July 31, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.25.0 release notes home <neuron-2-25-0-whatsnew>`

Improvements
------------

* All Neuron packages and their dependencies have been upgraded to support vAWS Neuron SDK version 2.25.0.
* The ``pytorch-inference-vllm-neuronx`` Deep Learning Container has been upgraded to version ``0.9.1``.

Known issues
------------

*Something doesn't work. Check here to find out if we already knew about it. We hope to fix these soon!*

* ``pytorch-training-neuronx`` 2.7.0 DLC has two HIGH CVEs related to ``sagemaker-python-sdk`` package. We are actively working to resolve these high CVEs:
- * `CVE-2024-34072 <https://nvd.nist.gov/vuln/detail/CVE-2024-34072>`_
- * `CVE-2024-34073 <https://nvd.nist.gov/vuln/detail/CVE-2024-34073>`_
* ``pytorch-inference-vllm-neuronx`` 0.9.1 DLC has CRITICAL and HIGH CVEs . We are actively working to resolve these high CVEs:
- * `CVE-2024-35515 <https://nvd.nist.gov/vuln/detail/CVE-2024-35515>`_
- * `CVE-2022-4296 <https://nvd.nist.gov/vuln/detail/CVE-2022-42969>`_

================================================
FILE: release-notes/prev/2.25.0/dlami.rst
================================================
.. _neuron-2-25-0-dlami:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Deep Learning AWS Machine Images (DLAMIs) component, version 2.25.0. Release date: 7/31/2025.

AWS Neuron SDK 2.25.0: Neuron Deep Learning AWS Machine Images release notes
============================================================================

**Date of release**: July 31, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.25.0 release notes home <neuron-2-25-0-whatsnew>`

Improvements
------------

* All multi-framework virtual environments for the Deep Learning AMIs have been upgraded with the latest Neuron packages to support the AWS Neuron SDK version 2.25.0.


================================================
FILE: release-notes/prev/2.25.0/docs-and-samples.rst
================================================
.. _neuron-2-25-0-docs-and-samples:

.. meta::
   :description: The official release notes for updates to the Neuron SDK developer docs and samples. Release date: 7/31/2025.

AWS Neuron SDK 2.25.0: Docs and samples release news and details
================================================================

**Date of release**: July 31, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.25.0 release notes home <neuron-2-25-0-whatsnew>`

Improvements and changes
------------------------

This section of the release notes covers improvements and changes we've made to the Neuron SDK documentation and samples in this release.

* We've heard your feedback on the Neuron documentation, and we're starting on a journey to improve them. Once early change is a richer structure for the release notes with a clearer presentation of the details you're looking for. It's a small thing, but we hope it helps you!

* The front page of the AWS Neuron documentation site has been redesigned. This is the first of many improvements to the site design and navigation we'll be making across the next releases.

Support changes
---------------

Just a few final notes for this release...

* Support for Wav2Vec2 test sample datasets are currently pinned at version ``3.6.0``, as ``4.0.0`` does not support the ``patrickvonplaten/librispeech_asr_dummy`` dataset at this time.

================================================
FILE: release-notes/prev/2.25.0/index.rst
================================================
.. _neuron-2-25-0-whatsnew:

.. meta::
   :description: The official release notes for the AWS Neuron SDK, version 2.25.0. Release date: 7/31/2025.

AWS Neuron SDK 2.25.0 release notes
===================================

**Date of release**: July 31, 2025

.. toctree::
   :hidden:
   :maxdepth: 1

   PyTorch support <nx-pytorch>
   JAX support <nx-jax>
   NxD Inference <nxd-inference>
   NxD Training <nxd-training>
   NxD Core <nxd-core>
   Neuron Compiler <compiler>
   Neuron Runtime <runtime>
   Developer tools <tools>
   Deep Learning AMIs <dlami>
   Deep Learning Containers <containers>
   Docs and samples <docs-and-samples>
   Release artifacts </release-notes/releasecontent>

.. contents:: In this release
   :local:
   :depth: 2

Release highlights
------------------

Neuron 2.25.0 delivers updates across several key areas: inference performance optimizations, expanded model support, enhanced profiling capabilities, improved monitoring and observability tools, framework updates, and refreshed development environments and container offerings. The release includes bug fixes across the SDK components, along with updated tutorials and documentation for new features and model deployments.


Inference Optimizations (NxD Core and NxDI)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Neuron 2.25.0 introduces performance optimizations and new capabilities including:

* Context and Data Parallel support for improved batch scaling
* Chunked Attention for improved long sequence processing
* Automatic Aliasing (Beta) for fast tensor operations
* Disaggregated Serving (Beta) improvements

Model Support (NxDI)
^^^^^^^^^^^^^^^^^^^^

Neuron 2.25.0 expands model support to include:

* Qwen3 dense models (0.6B to 32B parameters)
* Flux.1-dev model for text-to-image generation (Beta)

Monitoring and Observability
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* ``neuron-ls`` now displays CPU and NUMA node affinity information
* ``neuron-ls`` adds NeuronCore IDs display for each Neuron Device
* ``neuron-monitor`` improves accuracy of device utilization metrics

Framework Updates
^^^^^^^^^^^^^^^^^

* JAX 0.6.1 support added, maintaining compatibility with versions 0.4.31-0.4.38 and 0.5
* vLLM support upgraded to version 0.9.x V0

Development Environment Updates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Neuron SDK updated to version 2.25.0 in:

* Deep Learning AMIs on Ubuntu 22.04 and Amazon Linux 2023
* Multi-framework DLAMI with environments for both PyTorch and JAX
* PyTorch 2.7 Single Framework DLAMI
* JAX 0.6 Single Framework DLAMI

Container Support
^^^^^^^^^^^^^^^^^

Neuron SDK updated to version 2.25.0 in:

* PyTorch 2.7 Training and Inference DLCs
* JAX 0.6 Training DLC
* vLLM 0.9.1 Inference DLC
* Neuron Device Plugin and Scheduler container images for Kubernetes integration

Component release notes
-----------------------

Select a card below to review detailed release notes for each component of the Neuron SDK version 2.25.0. These component release notes contain details on specific new and improved features, as well as breaking changes, bug fixes, and known issues for that component area of the Neuron SDK.

.. grid:: 1 1 2 2
        :gutter: 2

        .. grid-item-card:: 
                :link: neuron-2-25-0-pytorch
                :link-type: ref

                **PyTorch framework** 2.25.0 release notes
                ^^^
                Neuron features and solutions that support the PyTorch ML framework.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: neuron-2-25-0-jax
                :link-type: ref

                **JAX framework** 2.25.0 release notes
                ^^^
                Neuron features and solutions that support the JAX ML framework.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: neuron-2-25-0-nxd-training
                :link-type: ref

                **NxD Training** 2.25.0 release notes
                ^^^
                Neuron features and tools for LLM and agent ML model training.
                +++
                Supports: ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: neuron-2-25-0-nxd-inference
                :link-type: ref

                **NxD Inference** 2.25.0 release notes
                ^^^
                Neuron features and tools for LLM and agent ML model inference.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``
        
        .. grid-item-card::
                :link: neuron-2-25-0-nxd-core
                :link-type: ref

                **NxD Core** 2.25.0 release notes
                ^^^
                Common features and tools for Neuron-based training and inference.
                +++
                Supports: ``Trn1`` / ``Trn1n``, ``Trn2``
         
        .. grid-item-card:: 
                :link: neuron-2-25-0-compiler
                :link-type: ref

                **Neuron Compiler** 2.25.0 release notes
                ^^^
                The Neuron compiler for AWS Trainium and Inferentia, and its libraries and tools.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: neuron-2-25-0-runtime
                :link-type: ref

                **Neuron Runtime** 2.25.0 release notes
                ^^^
                The Neuron kernel driver and C++ libraries for AWS Inferentia and Trainium instances.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``

        .. grid-item-card:: 
                :link: neuron-2-25-0-tools
                :link-type: ref

                **Neuron Developer Tools** 2.25.0 release notes
                ^^^
                Tools that support end-to-end development for AWS Neuron.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``


        .. grid-item-card:: 
                :link: neuron-2-25-0-dlami
                :link-type: ref

                **Neuron Deep Learning AWS Machine Images (DLAMIs)** 2.25.0 release notes
                ^^^
                AWS-specific machine images for building and deploying Neuron-based ML solutions.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``
 
        .. grid-item-card:: 
                :link: neuron-2-25-0-dlc
                :link-type: ref

                **Neuron Deep Learning Containers (DLCs)** 2.25.0 release notes
                ^^^
                AWS-specific container definitions for building and deploying Neuron-based ML solutions.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``

        .. grid-item-card::
                :link: neuron-2-25-0-docs-and-samples
                :link-type: ref

                **Documentation and samples** 2.25.0 release notes
                ^^^
                Changes to the Neuron docs and code samples.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n`` 

        .. grid-item-card::
                :link: latest-neuron-release-artifacts
                :link-type: ref
        
                **Neuron 2.25.0 release artifacts**
                ^^^
                The libraries and packages updated in this release.

Support announcements
---------------------

This section signals the official end-of-support or end of support for specific features, tools, and APIs.

End-of-support announcements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

*An "end-of-support (EoS)" announcement is a notification that a feature, tool, or API will not be supported in the future. Plan accordingly!*

* In a future release, the Neuron Compiler default flag ``--auto-cast=matmult`` will change to ``--auto-cast=none``.

  This means the Neuron Compiler will no longer perform auto-casting and use the data types of the operators in the incoming HLO. If the current behavior is desired, users can explicitly pass the  ``--auto-cast=matmult`` and  ``--auto-cast-type=bf16`` options to the compiler.  

  **Note:** This change will not affect Neuron NxDI, NxDT, and TNx Frameworks as these are set to ``--auto-cast=none`` by default. However, Torch-Neuronx users may experience an impact and must adjust their settings if they rely on the previous auto-casting behavior.

* Starting from Neuron Release 2.24, the Hugging Face Transformers NeuronX library is deprecated and in maintenance mode. ``transformers-neuronx`` releases will now only address critical security issues. In Neuron Release 2.26, Neuron will end support for transformers-neuronx. Current users of ``transformers-neuronx`` are advised to migrate to :doc:`NeuronX Distributed Inference </libraries/nxd-inference/index>`.

* PyTorch version 2.6 will no longer be supported in a coming release.  Current users of PyTorch 2.6 are advised to upgrade to PyTorch 2.7, which is supported in this release.

* Support for Python 3.9 will end in a coming release. Currently, we support versions of Python up to 3.11. Current users of Python 3.9 are advised to upgrade to Python 3.11, which is supported in this release.

Ending support in 2.25.0
^^^^^^^^^^^^^^^^^^^^^^^^^

*Items listed here are officially no longer supported starting with Neuron 2.25.0.*

* The following tutorials are no longer supported and have been moved the to :doc:`AWS Neuron SDK doc archive </archive/index>`:
  
  * :doc:`/archive/tutorials/finetune_t5`
  * :doc:`/archive/tutorials/ssd300_demo/ssd300_demo`
  * :doc:`/archive/tutorials/megatron_gpt_pretraining`

* Neuron 2.25 is the last release supporting NxDT Megatron Models. Future Neuron releases will not include support for NxDT Megatron Models. Current users of the NxDT Megatron Models are advised to use the Hugging Face model instead by setting the ``CONF_FILE`` variable in the ``train.sh`` file to the config model you want to use.

* With version 2.25.0, Neuron no longer supports vLLM version 0.7.2. Current users of vLLM 0.7.2 are advised to upgrade to vLLM 0.9.1, which is supported in this release.

* Transformers for NeuronX is no longer supported. For more details, see :doc:`the prior announcement </about-neuron/announcements/neuron2.x/announce-intent-maintenance-tnx>`.

Previous releases
-----------------

* :ref:`Neuron 2.24.1 <neuron-2-24-1-whatsnew>`
* :ref:`Neuron 2.24.0 <neuron-2-24-0-whatsnew>`
* :doc:`Earlier releases </release-notes/prev/rn>`


================================================
FILE: release-notes/prev/2.25.0/nx-jax.rst
================================================
.. _neuron-2-25-0-jax:

.. meta::
   :description: The official release notes for the AWS Neuron SDK JAX support component, version 2.25.0. Release date: 7/31/2025.

AWS Neuron SDK 2.25.0: JAX support release notes
================================================

**Date of release**: July 31, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.25.0 release notes home <neuron-2-25-0-whatsnew>`

Released versions
-----------------
* ``0.6.1.1.0.*``

Improvements
------------

* This release introduces support for JAX version ``0.6.1``.

Bug fixes
---------

* Previously, using multiple meshes within a single program wasn't supported. This is fixed to add support for sub-meshes.

Known issues
------------

* Known issues are listed at :ref:`jax-neuron-known-issues`.


================================================
FILE: release-notes/prev/2.25.0/nx-pytorch.rst
================================================
.. _neuron-2-25-0-pytorch:

.. meta::
   :description: The official release notes for the AWS Neuron SDK PyTorch support component, version 2.25.0. Release date: 7/31/2025.

AWS Neuron SDK 2.25.0: PyTorch support release notes
====================================================

**Date of release**: July 31, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.25.0 release notes home <neuron-2-25-0-whatsnew>`

Released versions
-----------------

- ``2.7.0.2.9.*``
- ``2.6.0.2.9.*``

Improvements
------------

- The :ref:`Core Placement API <torch_neuronx_core_placement_api>` is no longer beta/experimental and the instructions on how to use it have been updated.

  To migrate, replace any function scope ``torch_neuron.experimental.`` with ``torch_neuron.``. The change will have no effect on behavior or performance. For example, replace ``torch_neuronx.experimental.set_neuron_cores`` with ``torch_neuronx.set_neuron_cores``. If you use ``torch_neuron.experimental.`` scope it will work as before but now will also emit this warning: “In a future version torch_neuronx.experimental.<func> will be removed.  Call torch_neuronx.<func> instead."

Known issues
------------

.. note::
   * See the :ref:`Introducing PyTorch 2.7 Support<introduce-pytorch-2-7>` for a full list of known issues with v2.7.
   * See the :ref:`Introducing PyTorch 2.6 Support<introduce-pytorch-2-6>` for a full list of known issues with v2.6.

* [v2.7] Using the latest torch-xla v2.7 may result in increase in host memory usage compared torch-xla v2.6. In on example, LLama2 pretraining with ZeRO1 and sequence length 16k could see an increase of 1.6% in host memory usage.

* Currently, when switching Ubuntu OS kernel version from 5.15 to 6.8, you may see performance differences due to the new kernel scheduler (CFS vs EEVDF). For example, BERT pretraining performance could be lower by up to 10%. You may try using an older OS kernel (i.e. Amazon Linux 2023) or experiment with the kernel real-time scheduler by running ``sudo chrt --fifo 99`` before your command (i.e. ``sudo chrt --fifo 99 <script>``) to improve the performance. Note that adjusting the real-time scheduler can also result in lower performance. See https://www.kernel.org/doc/html/latest/scheduler/sched-eevdf.html for more information.

* Currently, when using tensor split operation on a 2D array in the second dimension, the resulting tensors don't have the expected data (https://github.com/pytorch/xla/issues/8640). The work-around is to set ``XLA_DISABLE_FUNCTIONALIZATION=0``. Another work-around is to use ``torch.tensor_split``.

* [v2.6]  BERT pretraining performance is ~10% lower with torch-neuronx 2.6 compared to torch-neuronx 2.5. This is due to a known regression in torch-xla https://github.com/pytorch/xla/issues/9037 and can affect other models with high graph tracing overhead. This is fixed in torch-xla v2.7. To work-around this issue in torch-xla v2.6, build the ``r2.6_aws_neuron`` branch of torch-xla as follows (see :ref:`pytorch-neuronx-install-cxx11` for C++11 ABI version):

   .. code:: bash

      # Setup build env (make sure you are in a python virtual env). Replace "apt" with "yum" on AL2023.
      sudo apt install cmake
      pip install yapf==0.30.0
     wget https://github.com/bazelbuild/bazelisk/releases/download/v1.20.0/bazelisk-linux-amd64
     sudo cp bazelisk-linux-amd64 /usr/local/bin/bazel

     # Clone repos
     git clone --recursive https://github.com/pytorch/pytorch --branch v2.6.0
     cd pytorch/
     git clone --recursive https://github.com/pytorch/xla.git --branch r2.6_aws_neuron
     _GLIBCXX_USE_CXX11_ABI=0 python setup.py bdist_wheel

     # pip wheel will be present in ./dist
     cd xla/
     CXX_ABI=0 python setup.py bdist_wheel

     # pip wheel will be present in ./dist and can be installed instead of the torch-xla released in pypi.org

* Currently, BERT pretraining performance is ~11% lower when switching to using ``model.to(torch.bfloat16)`` as part of migration away from the deprecated environment variable ``XLA_DOWNCAST_BF16`` due to https://github.com/pytorch/xla/issues/8545. As a workaround to recover the performance, you can set ``XLA_DOWNCAST_BF16=1`` which would still work in torch-neuronx 2.5 and 2.6 although there will be end-of-support warnings (as noted below).

* Environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (see the warning raised below). Switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to cast model to BF16. (see :ref:`migration_from_xla_downcast_bf16`).

   Warning: ``XLA_DOWNCAST_BF16`` will be deprecated after the 2.5 release, please downcast your model directly


* [v2.6] ``AttributeError: module 'torch_xla.core.xla_model' ... does not have the attribute 'xrt_world_size'``. This is an error that notes that ``torch_xla.core.xla_model.xrt_world_size()`` is removed in torch-xla version 2.7. Switch to using ``torch_xla.runtime.world_size()`` instead.

* [v2.6] ``AttributeError: module 'torch_xla.core.xla_model' ... does not have the attribute 'get_ordinal'``. This is an error that notes that ``torch_xla.core.xla_model.xla_model.get_ordinal()`` is removed in torch-xla version 2.7. Switch to using ``torch_xla.runtime.global_ordinal()`` instead.

* ``AttributeError: module 'torch_xla.runtime' has no attribute 'using_pjrt'``. In Torch-XLA 2.5+, ``torch_xla.runtime.using_pjrt`` is removed because PJRT is the sole Torch-XLA runtime. See this `PyTorch commit PR on GitHub <https://github.com/pytorch/xla/commit/d6fb5391d09578c8804b1331a5e7a4f72bf981db>`_.


================================================
FILE: release-notes/prev/2.25.0/nxd-core.rst
================================================
.. _neuron-2-25-0-nxd-core:

.. meta::
   :description: The official release notes for the AWS Neuron SDK NxD Core component, version 2.25.0. Release date: 7/31/2025.

AWS Neuron SDK 2.25.0: NxD Core release notes
=============================================

**Date of release**: July 31, 2025

**Version**: 0.14.18461

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.25.0 release notes home <neuron-2-25-0-whatsnew>`

Improvements
------------

*Improvements are significant new or improved features and solutions introduced with release 2.25.0 of the AWS Neuron SDK. Read on to learn about them!*

Inference
^^^^^^^^^

ModelBuilder V2
"""""""""

ModelBuilder V2 provides a simplified version of the ModelBuilder API that is more flexible and extensible.
This API includes basic building blocks that you can use to trace, compile, and load modules to Neuron.
For more information, see :ref:`nxd-core-model-builder-v2` and the updated
`Llama-3.2-1B reference inference sample <https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference/llama>`__. 

Training
^^^^^^^^

Support for Shared Experts
"""""""""

Shared Experts allow multiple model components to utilize the same expert neural networks. This release adds full support for Shared Experts in training workloads.
  

================================================
FILE: release-notes/prev/2.25.0/nxd-inference.rst
================================================
.. _neuron-2-25-0-nxd-inference:

.. meta::
   :description: The official release notes for the AWS Neuron SDK NxD Inference component, version 2.25.0. Release date: 7/31/2025.

AWS Neuron SDK 2.25.0: NxD Inference release notes
==================================================

**Date of release**: July 31, 2025

**Version**: 0.5.9230

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.25.0 release notes home <neuron-2-25-0-whatsnew>`

Improvements
------------

*Improvements are significant new or improved features and solutions introduced this release of the AWS Neuron SDK. Read on to learn about them!*

Qwen3 (dense) model support
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Add support for Qwen3 dense models, which are tested on Trn1. Compatible models include:

- `Qwen3-0.6B <https://huggingface.co/Qwen/Qwen3-0.6B>`__
- `Qwen3-1.7B <https://huggingface.co/Qwen/Qwen3-1.7B>`__
- `Qwen3-4B <https://huggingface.co/Qwen/Qwen3-4B>`__
- `Qwen3-8B <https://huggingface.co/Qwen/Qwen3-8B>`__
- `Qwen3-14B <https://huggingface.co/Qwen/Qwen3-14B>`__
- `Qwen3-32B <https://huggingface.co/Qwen/Qwen3-32B>`__

For more information, see :ref:`nxdi-model-reference`.

Other improvements
^^^^^^^^^^^^^^^^^^

- Added simplified functions that you can use to validate the accuracy of
  logits returned by a model. These new functions include
  ``check_accuracy_logits_v2`` and ``generated_expected_logits``, which provide more flexibility
  than ``check_accuracy_logits``. For more information, see :ref:`nxdi-evaluating-models`.
- Added ``scratchpad_page_size`` attribute to NeuronConfig. You can
  specify this attribute to configure the scratchpad page size used
  during compilation and at runtime. The scratchpad is a shared memory buffer
  used for internal model variables and other data. For more information, see :ref:`nxd-inference-api-guide-neuron-config`.
- Enabled `Chunked Attention <https://huggingface.co/blog/llama4-release#:~:text=Chunked%20attention%20(in%20RoPE%20layers)>`__ as a generic building block for
  any attention-based model. Chunked attention limits the KV cache size to chunk size and can be used to enable long-context inference where memory constraint is an issue. 
  NxDI now supports chunked attention for any model that defines ``attention_chunk_size`` in the model's HuggingFace ``config.json``,  such as `Llama 4 Scout <https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E/blob/main/config.json#L11>`__,
  or in the model's InferenceConfig.
  Developers using NxDI can then pass ``attention_chunk_size`` to the attention module to enable chunked attention. See `modeling_llama.py <https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/models/llama/modeling_llama.py>`__ for example.
- Published scripts to evaluate model accuracy and benchmark performance against Neuron. For more details, see :doc:`the corresponding documentation </libraries/nxd-inference/tutorials/generating-results-with-performance-cli>` or `go to the Neuron samples GitHub repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking>`_.
  
Breaking changes
----------------

*Sometimes we have to break something now to make the experience better in the longer term. Breaking changes are changes that may require you to update your own code, tools, and configurations.*

- Removed support for Meta checkpoint compatibility in Llama3.2 Multimodal modeling
  code. You can continue to use Hugging Face checkpoints. Hugging Face
  provides a `conversion
  script <https://github.com/huggingface/transformers/blob/main/src/transformers/models/mllama/convert_mllama_weights_to_hf.py>`__
  that you can run to convert a Meta checkpoint to a Hugging Face checkpoint.

Bug fixes
---------

*We're always fixing bugs. It's developer's life!* Here's what we fixed in 2.25.0:

- Fixed accuracy issues when using Automatic Prefix Caching (APC) with
  EAGLE speculation.
- Fixed continuous batching for Llama3.2 Multimodal where the input batch size is less
  than the compiled batch size.
- Added support for continuous batching when running Neuron modeling code
  on CPU.
- Set a manual seed in ``benchmark_sampling`` to improve the stability
  of data-dependent benchmarks like speculation.
- Other minor fixes and improvements.


================================================
FILE: release-notes/prev/2.25.0/nxd-training.rst
================================================
.. _neuron-2-25-0-nxd-training:

.. meta::
   :description: The official release notes for the AWS Neuron SDK NxD Training component, version 2.25.0. Release date: 7/31/2025.

AWS Neuron SDK 2.25.0: NxD Training release notes
=================================================

**Date of release**: July 31, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.25.0 release notes home <neuron-2-25-0-whatsnew>`

Bug fixes
---------

* Disable ``expert_index`` in Mixture of Experts (MoE) forwarding to limit the output to just hidden states and router logits (as expected).


================================================
FILE: release-notes/prev/2.25.0/runtime.rst
================================================
.. _neuron-2-25-0-runtime:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Runtime component, version 2.25.0. Release date: 7/31/2025.

AWS Neuron SDK 2.25.0: Neuron Runtime release notes
===================================================

**Date of release**: July 31, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.25.0 release notes home <neuron-2-25-0-whatsnew>`

Released versions
-----------------

- Neuron Collectives: ``2.27.34.0``
- Neuron Driver: ``2.23.9.0``
- Neuron Runtime Library: ``2.27.23.0``

Behavioral changes
------------------

*Behavioral changes are small, user-facing changes that you may notice after upgrading to this version.*

Neuron Collectives 2.27.34.0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Improved the interface with the Neuron Runtime for minor stability improvements.

Neuron Driver 2.23.9.0
^^^^^^^^^^^^^^^^^^^^^^
* Exposed Tensor Engine activity counters in `sysfs`.

Neuron Runtime Library 2.27.23.0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Introduced ``nrt_get_vnc_memory_stats`` API to retrieve device memory usage.
* Added support for State-Buffer to State-Buffer collective support for ``all_reduce``, ``reduce_scatter``, and ``all_gather`` for LNC2, which helps reduce HBM memory pressure.
* Added support for coalescing of Collectives operations for internode RDH.
* Introduced a new DGE priority class feature to select preferred packet size for memory transfers.
* Improved ``nrt_init`` time by up to ~3 seconds on AWS Trainium and Inferentia instances.
* Added a warning message along with a recommended scratchpad configuration when a loaded NEFF has non-optimial scratchpad usage.

Breaking changes
----------------

*Sometimes we have to break something now to make the experience better in the longer term. Breaking changes are changes that may require you to update your own code, tools, and configurations.*

Neuron Runtime Library 2.27.23.0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Due to a hardware bug that can cause numerical errors to be falsely reported (see the **Known Issues** section below), the runtime has disabled numerical errors by default. Users can re-enable numerical errors by setting ``NEURON_RT_NUMERICAL_ERRORS_VERBOSITY=critical`` or ``NEURON_FAIL_ON_NAN=1`` to enable debug flows and to prevent numerical errors from blowing up a training run.

Bug fixes
---------

*We're always fixing bugs. It's developer's life!* Here's what we fixed in 2.25.0:

Neuron Runtime Library 2.27.23.0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Fixed profiling APIs to report execution duration from explicit notifications.
* Fixed race condition which can cause a crash when starting inspect traces.


Known issues
------------

*Something doesn't work. Check here to find out if we already knew about it. We hope to fix these soon!*


* A hardware bug affecting **Trainium** and **Inferentia2** devices causes numerical errors to become "sticky" within the Neuron Core hardware. When a legitimate numerical error occurs during execution, the error state persists in the hardware, causing all subsequent executions to incorrectly report numerical errors even when the computations are valid. This sticky error state can only be resolved by restarting the application to clear the hardware.


================================================
FILE: release-notes/prev/2.25.0/tools.rst
================================================
.. _neuron-2-25-0-tools:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Developer Tools component, version 2.25.0. Release date: 7/31/2025.

AWS Neuron SDK 2.25.0: Developer Tools release notes
====================================================

**Date of release**: July 31, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.25.0 release notes home <neuron-2-25-0-whatsnew>`

Improvements
------------

*Improvements are significant new or improved features and solutions introduced this release of the AWS Neuron SDK. Read on to learn about them!*

neuron-ls now shows NeuronCore IDs and CPU affinity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For each Neuron device, ``neuron-ls`` will now show the corresponding NeuronCore IDs as well as CPU and NUMA node affinity in both the text and JSON outputs.
These can be used as reference when setting certain Neuron runtime environment variables such as ``NEURON_RT_VISIBLE_CORES``.
See :ref:`neuron-ls-ug` for an example.

System profiles now show sync point events
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

System profiles now show the sync point events that are used to approximate CPU and Neuron device timestamp alignment.
This can be used as a reference point if any inconsistencies are detected between the runtime and hardware trace timestamps.
See :ref:`neuron-profile-system-timestamp-adjustment` for more details.


Behavioral changes
------------------

*Behavioral changes are small, user-facing changes that you may notice after upgrading to this version.*

* Added a summary metric to device profiles for ``total_active_time`` to help determine if the device was unnecessarily idle during execution.
* Removed metrics for defunct processes from Neuron Monitor's Prometheus output to more accurately reflect the current utilization of NeuronCores.
  Only processes that are currently active at the time of reporting will be included in the output.


Bug fixes
---------

*We're always fixing bugs. It's developer's life!* Here's what we fixed in 2.25.0:

* Fixed issue in Neuron Profiler summary metrics where ``dma_active_time`` was larger than expected.
* Fixed type inconsistency for certain event types and attributes in the system profile data that could result in a crash.

Known issues
------------

*Something doesn't work. Check here to find out if we already knew about it. We hope to fix these soon!*

* System profile hardware events may be misaligned due to sync point imprecision.  In Perfetto, this may cause events to be interleaved.
* System profile events shown in the Neuron Profiler UI for multiprocess workloads are grouped together.  Please try the Perfetto output if you encounter this issue.
* Currently, only a Neuron Runtime trace can be shown when capturing a system profile for a PyTorch workload. (Full framework traces can be shown for JAX workloads, though.) We are working to bring PyTorch traces into parity in a future release.

================================================
FILE: release-notes/prev/2.26.0/containers.rst
================================================
.. _neuron-2-26-0-dlc:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Deep Learning Containers (DLC) component, version 2.26.0. Release date: 9/18/2025.

AWS Neuron SDK 2.26.0: Neuron Deep Learning Containers release notes
====================================================================

**Date of release**:  September 18, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.26.0 release notes home <neuron-2-26-0-whatsnew>`

.. important::
   All Neuron packages and their dependencies have been upgraded to support version ``2.26.0`` of the AWS Neuron SDK.

Improvements
------------

We've added the following improvements for Deep Learning Container support in this release of the AWS Neuron SDK:

* Both `pytorch-training-neuronx` and `pytorch-inference-neuronx` DLCs have been upgraded to version ``2.8.0`` along with their related dependencies.
* Upgraded Python version to 3.11 in all Deep Learning Containers.

Behavioral changes
------------------

* End-of-support for the Transformers NeuronX library starts with the 2.26.0 release of the AWS Neuron SDK. With this support ended, the PyTorch inference Deep Learning Container (DLC) will no longer include the ``transformers-neuronx`` package. For more details, see :ref:`announce-eos-tnx`.

Previous release notes
----------------------

* :ref:`containers_rn`
* :ref:`containers_rn`
* :ref:`containers_rn`


================================================
FILE: release-notes/prev/2.26.0/dlami.rst
================================================
.. _neuron-2-26-0-dlami:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Deep Learning AWS Machine Images (DLAMIs) component, version 2.26.0. Release date: 9/18/2025.

AWS Neuron SDK 2.26.0: Neuron Deep Learning AWS Machine Images release notes
============================================================================

**Date of release**:  September 18, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.26.0 release notes home <neuron-2-26-0-whatsnew>`

Improvements
------------

We've added the following improvements for DLAMI support in this release of the AWS Neuron SDK:

* Support for PyTorch 2.8 (Amazon Linux 2023, Ubuntu 22.04) single-framework DLAMI
* Updates multi-framework DLAMI virtual environments to support PyTorch 2.8
* All Neuron packages and their dependencies have been upgraded to support version 2.26.0 of the AWS Neuron SDK

Behavioral changes
------------------
* End-of-support for the Transformers NeuronX library starts with the 2.26.0 release of the AWS Neuron SDK. As a result, the PyTorch inference Deep Learning Container (DLC) will no longer provide the ``transformers-neuronx`` virtual environment in both single and multi-framework DLAMIs. For more details, see :ref:`announce-eos-tnx`.

Previous release notes
----------------------

* :ref:`neuron-2-25-0-dlami`
* :ref:`dlamis_rn`


================================================
FILE: release-notes/prev/2.26.0/index.rst
================================================
.. _neuron-2-26-0-whatsnew:

.. meta::
   :description: The official release notes for the AWS Neuron SDK, version 2.26.0. Release date: 9/18/2025.

AWS Neuron SDK 2.26.0 release notes
===================================

**Date of release**:  September 18, 2025

.. toctree::
   :hidden:
   :maxdepth: 1

   PyTorch support <nx-pytorch>
   JAX support <nx-jax>
   NxD Inference <nxd-inference>
   NxD Core <nxd-core>
   NKI <nki>
   Neuron Runtime <runtime>
   Developer tools <tools>
   Deep Learning AMIs <dlami>
   Deep Learning Containers <containers>

What's new?
-----------

**AWS Neuron SDK 2.26.0** adds support for PyTorch 2.8, JAX 0.6.2, along with support for Python 3.11, and introduces inference improvements on Trainium2 (``Trn2``). This release includes expanded model support, enhanced parallelism features, new Neuron Kernel Interface (NKI) APIs, and improved development tools for optimization and profiling.

Inference Updates
^^^^^^^^^^^^^^^^^

**NxD Inference** - Model support expands with beta releases of Llama 4 Scout and Maverick variants on ``Trn2``. The FLUX.1-dev image generation models are now available in beta on ``Trn2`` instances.

Expert parallelism is now supported in beta, enabling MoE expert distribution across multiple NeuronCores. This release introduces on-device forward pipeline execution in beta and adds sequence parallelism in MoE routers for model deployment flexibility.

.. 
   Sliding Window Attention (SWA) provides performance improvements by attending to recent tokens rather than full context. The feature includes attention sinks support and is automatically enabled for models trained with sliding window attention using the model config ``sliding_window`` attribute.

Neural Kernel Interface (NKI)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

New APIs enable additional optimization capabilities:

* ``gelu_apprx_sigmoid``: GELU activation with sigmoid approximation
* ``select_reduce``: Selective element copying with maximum reduction
* ``sequence_bounds``: Sequence bounds computation

API enhancements include:

* ``tile_size``: Added total_available_sbuf_size field
* ``dma_transpose``: Added axes parameter for 4D transpose.
* ``activation``: Added ``gelu_apprx_sigmoid`` operation

Developer Tools
^^^^^^^^^^^^^^^

Neuron Profiler improvements include the ability to select multiple semaphores at once to correlate pending activity with semaphore waits and increments. Additionally, system profile grouping now uses a global NeuronCore ID instead of a process local ID for visibility across distributed workloads. The Profiler also adds warnings for dropped events due to limited buffer space.

The ``nccom-test`` utility adds State Buffer support on Trn2 for collective operations, including ``all-reduce``, ``all-gather``, and ``reduce-scatter`` operations. Error reporting provides messages for invalid all-to-all collective sizes to help developers identify and resolve issues.

Deep Learning AMI and Containers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The Deep Learning AMI now supports PyTorch 2.8 on Amazon Linux 2023 and Ubuntu 22.04. Container updates include PyTorch 2.8.0 and Python 3.11 across all DLCs. The transformers-neuronx environment and package have been removed from PyTorch inference DLAMI/DLC.

.. contents:: In this release
   :local:
   :depth: 2

Component release notes
-----------------------

Select a card below to review detailed release notes for updated components of the Neuron SDK version 2.26.0. These component release notes contain details on specific new and improved features, as well as breaking changes, bug fixes, and known issues for that component area of the Neuron SDK.

.. grid:: 1 1 2 2
        :gutter: 2

        .. grid-item-card:: 
                :link: neuron-2-26-0-pytorch
                :link-type: ref

                **PyTorch support** 2.26.0 release notes
                ^^^
                Neuron features and solutions that support the PyTorch ML framework.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: neuron-2-26-0-jax
                :link-type: ref

                **JAX support** 2.26.0 release notes
                ^^^
                Neuron features and solutions that support the JAX ML framework.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: neuron-2-26-0-nxd-inference
                :link-type: ref

                **NxD Inference** 2.26.0 release notes
                ^^^
                Neuron features and tools for LLM and agent ML model inference.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``
        
        .. grid-item-card::
                :link: neuron-2-26-0-nxd-core
                :link-type: ref

                **NxD Core** 2.26.0 release notes
                ^^^
                Common features and tools for Neuron-based training and inference.
                +++
                Supports: ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: neuron-2-26-0-nki
                :link-type: ref

                **Neuron Kernel Interface (NKI)** 2.26.0 release notes
                ^^^
                Neuron's Python-based programming interface for developing and optimizing Neuron kernels.
                +++
                Supports:  ``Inf2``, ``Trn1``/ ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: neuron-2-26-0-runtime
                :link-type: ref

                **Neuron Runtime** 2.26.0 release notes
                ^^^
                The Neuron kernel driver and C++ libraries for AWS Inferentia and Trainium instances.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: neuron-2-26-0-tools
                :link-type: ref

                **Neuron Developer Tools** 2.26.0 release notes
                ^^^
                Tools that support end-to-end development for AWS Neuron.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: neuron-2-26-0-dlami
                :link-type: ref

                **Neuron Deep Learning AWS Machine Images (DLAMIs)** 2.26.0 release notes
                ^^^
                AWS-specific machine images for building and deploying Neuron-based ML solutions.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``
 
        .. grid-item-card:: 
                :link: neuron-2-26-0-dlc
                :link-type: ref

                **Neuron Deep Learning Containers (DLCs)** 2.26.0 release notes
                ^^^
                AWS-specific container definitions for building and deploying Neuron-based ML solutions.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card::
                :link: latest-neuron-release-artifacts
                :link-type: ref
        
                Neuron 2.26.0 release artifacts
                ^^^
                The libraries and packages updated in this release.

Support announcements
---------------------

This section signals the official end-of-support or end of support for specific features, tools, and APIs.

End-of-support announcements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

*An "end-of-support (EoS)" announcement is a notification that a feature, tool, or API will not be supported in the future. Plan accordingly!*

* The Neuron Compiler default for the ``--auto-cast`` option will change from ``--auto-cast=matmult`` to ``--auto-cast=none`` in a future release.
* The Beta versions of the :ref:`PyTorch NeuronCore Placement APIs <torch_neuron_core_placement_guide>` are no longer supported with this release.

* Neuron version 2.26.0 is the last release supporting ``parallel_model_trace``. This NxD Inference function will be deprecated in the next version of the Neuron SDK in favor of the ``ModelBuilder.trace()`` method, which provides a more robust and flexible approach for tracing and compiling models for Neuron devices,  enabling more advanced features such as weight layout optimization support, as well as other quality-of-life and stability improvements for SPMD tracing.

  For customers directly invoking ``parallel_model_trace``, they can now use ModelBuilderV2 APIs. For more details on these APIS, see :ref:`nxd-core-model-builder-v2`. For customers that are directly using models in NxDI, there is  no impact since NxDI models are already built on MBv1 which has no issues.

Ending support in 2.26.0
^^^^^^^^^^^^^^^^^^^^^^^^

*" End-of-support" means that AWS Neuron no longer supports the feature, tool, or API indicated in the note as of this release.*

* End-of-support for the Transformers NeuronX library starts with the 2.26.0 release of the AWS Neuron SDK. As a result, the PyTorch inference Deep Learning Container (DLC) will no longer include the ``transformers-neuronx`` package and Neuron no longer provides the ``transformers_neuronx`` virtual environment in both single and multi-framework DLAMIs. For more details, see :ref:`announce-eos-tnx`.
* Starting with Neuron Release 2.26, Neuron driver versions above 2.24 will only support non-Inf1 instances (such as ``Trn1``, ``Inf2``, or other instance types). For ``Inf1`` instance users, only Neuron driver version 2.24 will remain supported with regular security patches.
* The Beta versions of the :ref:`PyTorch NeuronCore Placement APIs <torch_neuron_core_placement_guide>` are no longer supported with this release.

Known issues: Samples
^^^^^^^^^^^^^^^^^^^^^

* When running the `UNet training sample <https://github.com/aws-neuron/aws-neuron-samples-staging/blob/master/torch-neuronx/training/unet_image_segmentation/unet.ipynb>`_ with the Neuron compiler, you may encounter this error: `Estimated peak HBM usage exceeds 16GB.`
  
  * To work around this error, include the function ``conv_wrap`` in your model. (You can find a usable example of this function in the `UNet sample model code <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/unet_image_segmentation/model.py>`_.) Then, define a custom backward pass for your model following the instructions and example in `the Pytorch documentation <https://docs.pytorch.org/docs/stable/notes/extending.html>`_. The UNet sample also illustrates how this is done for the convolution layers in UNet.

Previous releases
-----------------

* :doc:`Neuron 2.25.0 </release-notes/prev/2.25.0/index>`
* :doc:`Earlier releases </release-notes/prev/rn>`

================================================
FILE: release-notes/prev/2.26.0/nki.rst
================================================
.. _neuron-2-26-0-nki:

.. meta::
   :description: The official release notes for the AWS Neuron Kernel Interface (NKI) component, version 2.26.0. Release date: 9/18/2025.

AWS Neuron SDK 2.26.0: Neuron Kernel Interface (NKI) release notes
===================================================================

**Date of release**:  September 18, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.26.0 release notes home <neuron-2-26-0-whatsnew>`

Improvements
------------

New nki.language APIs
^^^^^^^^^^^^^^^^^^^^^

* gelu_apprx_sigmoid - Gaussian Error Linear Unit activation function with sigmoid approximation.

Updated nki.language APIs
^^^^^^^^^^^^^^^^^^^^^^^^^

* tile_size.total_available_sbuf_size constant - Added a new field, ``total_available_sbuf_size``, that contains the returned total available SBUF size.

New nki.isa APIs
^^^^^^^^^^^^^^^^

* select_reduce - Selectively copy elements with maximum reduction.
* sequence_bounds - Compute sequence bounds of segment IDs.
* dma_transpose - Enhanced with:

  * ``axes`` parameter to define 4D transpose for supported cases
  * ``dge_mode`` parameter to specify Descriptor Generation Engine (DGE)

* activation - Supports the new ``nl.gelu_apprx_sigmoid`` nki.language operation.

Improvements and fixes
^^^^^^^^^^^^^^^^^^^^^^

* **nki.language.store()** - Supports PSUM buffer with extra additional copy inserted.

Documentation and tutorial updates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* Added documentation and example for dma_transpose API
* Improved simulate_kernel example
* Updated tutorial code to use ``nl.fp32.min`` instead of a magic number

Previous release notes
----------------------

* :ref:`nki_rn`


================================================
FILE: release-notes/prev/2.26.0/nx-jax.rst
================================================
.. _neuron-2-26-0-jax:

.. meta::
   :description: The official release notes for the AWS Neuron SDK JAX support component, version 2.26.0. Release date: 9/18/2025.

AWS Neuron SDK 2.26.0: JAX support release notes
================================================

**Date of release**:  September 18, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.26.0 release notes home <neuron-2-26-0-whatsnew>`

Released versions
-----------------
* ``0.6.2.1.0.*``

Improvements
------------

* This release introduces support for JAX version ``0.6.2``.

Known issues
------------

* The ``Threefry`` RNG algorithm is not completely supported. Use the ``rbg`` algorithm instead. This can be configured by setting the following config option: ``jax.config.update("jax_default_prng_impl", "rbg")``
* For JAX versions older than ``0.4.34``, caching does not work out of the box. Use this code to enable caching support:
  
  .. code:: python
    
    import jax
    import jax_neuronx
    from jax._src import compilation_cache

    compilation_cache.set_cache_dir('./cache_directory')

* For JAX versions older than ``0.4.34``, buffer donation does not work out of the box. Add the following snippet to your script to enable it * ``jax._src.interpreters.mlir._platforms_with_donation.append('neuron')``
* Mesh configurations which use non-connected Neuron cores may crash during execution. You may observe compilation or Neuron runtime errors for such configurations. Device connectivity can be determined by using ``neuron-ls --topology``.
* Not all dtypes supported by JAX work on Neuron. Check :ref:`neuron-data-types` for supported data types.
* ``jax.random.randint`` does not produce expected distribution of randint values. Run it on CPU instead.
* Dynamic loops are not supported for ``jax.lax.while_loop``. Only static while loops are supported.
* ``jax.lax.cond`` is not supported.
* Host callbacks are not supported. As a result APIs based on callbacks from ``jax.debug`` and ``jax.experimental.checkify`` are not supported.
* ``jax.dlpack`` is not supported.
* ``jax.experimental.sparse`` is not supported.
* ``jax.lax.sort`` only supports comparators with LE, GE, LT and GT operations.
* ``jax.lax.reduce_precision`` is not supported.
* Certain operations (for example, rng weight initialization) might result in slow compilations. Try to run such operations on the CPU backend or by setting the following environment variable: ``NEURON_RUN_TRIVIAL_COMPUTATION_ON_CPU=1``.
* Neuron only supports ``float8_e4m3`` and ``float8_e5m2`` for FP8 dtypes.
* Complex dtypes (``jnp.complex64`` and ``jnp.complex128``) are not supported.
* Variadic reductions are not supported.
* Out-of-bounds access for scatter/gather operations can result in runtime errors.
* Dot operations on ``int`` dtypes are not supported.
* ``lax.DotAlgorithmPreset`` is not always respected. Dot operations occur in operand dtypes. This is a configurable parameter for ``jax.lax.dot`` and ``jax.lax.dot_general``.

Previous release notes
----------------------

* JAX Neuron release notes


================================================
FILE: release-notes/prev/2.26.0/nx-pytorch.rst
================================================
.. _neuron-2-26-0-pytorch:

.. meta::
   :description: The official release notes for the AWS Neuron SDK PyTorch support component, version 2.26.0. Release date: 9/18/2025.

AWS Neuron SDK 2.26.0: PyTorch support release notes
====================================================

**Date of release**:  September 18, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.26.0 release notes home <neuron-2-26-0-whatsnew>`

Released versions
-----------------

- ``2.8.0.2.10.*``
- ``2.7.0.2.10.*``
- ``2.6.0.2.10.*``

Improvements
------------

- Added support for PyTorch 2.8 (see :ref:`Introducing PyTorch 2.8 Support<introduce-pytorch-2-8>`)

Known issues
------------

.. note::
   * See :ref:`Introducing PyTorch 2.8 Support<introduce-pytorch-2-8>` for a full list of known issues with v2.8.
   * See :ref:`Introducing PyTorch 2.7 Support<introduce-pytorch-2-7>` for a full list of known issues with v2.7.
   * See :ref:`Introducing PyTorch 2.6 Support<introduce-pytorch-2-6>` for a full list of known issues with v2.6.

* [PyTorch v2.8] Using the publicly released version of torch-xla 2.8.0 from public PyPI repositories would result in lower performance for models like BERT and LLaMA (https://github.com/pytorch/xla/issues/9605). To fix this, switch to using the updated torch-xla version 2.8.1 from public PyPI repositories.

* [PyTorch v2.7] Using the latest torch-xla v2.7 may result in an increase in host memory usage compared to torch-xla v2.6. In one example, LLama2 pretraining with ZeRO1 and sequence length 16k could see an increase of 1.6% in host memory usage.

* Currently, when switching Ubuntu OS kernel version from 5.15 to 6.8, you may see performance differences due to the new kernel scheduler (CFS vs EEVDF). For example, BERT pretraining performance could be lower by up to 10%. You may try using an older OS kernel (i.e. Amazon Linux 2023) or experiment with the kernel real-time scheduler by running ``sudo chrt --fifo 99`` before your command (i.e. ``sudo chrt --fifo 99 <script>``) to improve the performance. Note that adjusting the real-time scheduler can also result in lower performance. See https://www.kernel.org/doc/html/latest/scheduler/sched-eevdf.html for more information.

* Currently, when using the tensor split operation on a 2D array in the second dimension, the resulting tensors do not contain the expected data (https://github.com/pytorch/xla/issues/8640). The workaround is to set ``XLA_DISABLE_FUNCTIONALIZATION=0``. Another workaround is to use ``torch.tensor_split``.

* [PyTorch v2.6]  BERT pretraining performance is approximately 10% lower with torch-neuronx 2.6 compared to torch-neuronx 2.5. This is due to a known regression in torch-xla https://github.com/pytorch/xla/issues/9037 and may affect other models with high graph tracing overhead. This is fixed in torch-xla 2.7 and 2.8. To work around this issue in torch-xla 2.6, build the ``r2.6_aws_neuron`` branch of torch-xla as follows (see :ref:`pytorch-neuronx-install-cxx11` for C++11 ABI version):

.. code:: bash

      # Setup build env (make sure you are in a python virtual env). Replace "apt" with "yum" on AL2023.
      sudo apt install cmake
      pip install yapf==0.30.0
     wget https://github.com/bazelbuild/bazelisk/releases/download/v1.20.0/bazelisk-linux-amd64
     sudo cp bazelisk-linux-amd64 /usr/local/bin/bazel

     # Clone repos
     git clone --recursive https://github.com/pytorch/pytorch --branch v2.6.0
     cd pytorch/
     git clone --recursive https://github.com/pytorch/xla.git --branch r2.6_aws_neuron
     _GLIBCXX_USE_CXX11_ABI=0 python setup.py bdist_wheel

     # The pip wheel will be present in ./dist
     cd xla/
     CXX_ABI=0 python setup.py bdist_wheel

     # The pip wheel will be present in ./dist and can be installed instead of the torch-xla released in pypi.org

* Currently, BERT pretraining performance is approximately 11% lower when switching to using ``model.to(torch.bfloat16)`` as part of migration away from the deprecated environment variable ``XLA_DOWNCAST_BF16`` due to https://github.com/pytorch/xla/issues/8545. As a workaround to recover the performance, you can set ``XLA_DOWNCAST_BF16=1``, which will still work in torch-neuronx 2.5 through 2.8 although there will be end-of-support warnings (as noted below).

* Environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (see the warning raised below). Switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to cast model to BF16. (see :ref:`migration_from_xla_downcast_bf16`).

.. code:: bash

   Warning: ``XLA_DOWNCAST_BF16`` will be deprecated after the 2.5 release, please downcast your model directly

* [PyTorch v2.8+] ``DeprecationWarning: Use torch_xla.device instead``. This is a warning that ``torch_xla.core.xla_model.xla_device()`` is deprecated. Switch to using ``torch_xla.device()`` instead.

* [PyTorch v2.8+] ``DeprecationWarning: Use torch_xla.sync instead``. This is a warning that ``torch_xla.core.xla_model.mark_step()`` is deprecated. Switch to using ``torch_xla.sync()`` instead.

* [PyTorch v2.7+] ``AttributeError: module 'torch_xla.core.xla_model' ... does not have the attribute 'xrt_world_size'``. This is an error that notes that ``torch_xla.core.xla_model.xrt_world_size()`` is removed in torch-xla version 2.7+. Switch to using ``torch_xla.runtime.world_size()`` instead.

* [PyTorch v2.7+] ``AttributeError: module 'torch_xla.core.xla_model' ... does not have the attribute 'get_ordinal'``. This is an error that notes that ``torch_xla.core.xla_model.get_ordinal()`` is removed in torch-xla version 2.7+. Switch to using ``torch_xla.runtime.global_ordinal()`` instead.

* [PyTorch v2.5+] ``AttributeError: module 'torch_xla.runtime' has no attribute 'using_pjrt'``. In Torch-XLA 2.5+, ``torch_xla.runtime.using_pjrt`` is removed because PJRT is the sole Torch-XLA runtime. See this `PyTorch commit PR on GitHub <https://github.com/pytorch/xla/commit/d6fb5391d09578c8804b1331a5e7a4f72bf981db>`_.

Previous release notes
----------------------

* :ref:`neuron-2-25-0-pytorch`
* :ref:`pytorch-neuron-rn`


================================================
FILE: release-notes/prev/2.26.0/nxd-core.rst
================================================
.. _neuron-2-26-0-nxd-core:

.. meta::
   :description: The official release notes for the AWS Neuron SDK NxD Core component, version 2.26.0. Release date: 9/18/2025.

AWS Neuron SDK 2.26.0: NxD Core release notes
=============================================

**Date of release**:  September 18, 2025

**Version**: 0.15.22259

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.26.0 release notes home <neuron-2-26-0-whatsnew>`

NxD Core inference improvements
-------------------------------

Non-distributed inference in parallel layers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Updated parallel layers to support non-distributed inference when parallel state isn't initialized.
In non-parallel environments, RowParallelLinear and ColumnParallelLinear now function as ``nn.Linear``,
and ``ParallelEmbedding``now functions as ``nn.Embedding``. This change enables you to simplify model code that
works on device and on CPU by enabling you to use the parallel layer in both cases.

Other improvements
^^^^^^^^^^^^^^^^^^

* Added a ``compiler_flag_hook`` argument to ModelBuilder, which you can use to override compiler flags
  for different submodels and buckets.

Bug fixes
---------

Here's what we fixed in 2.26.0:

Inference
^^^^^^^^^

* Added additional instance types to the ``hardware`` enum. For example, ``inf2`` now maps to ``trn1``.
* Other minor bug fixes and improvements.

Known issues
------------

*Something doesn't work. Check here to find out if we already knew about it. We hope to fix these soon!*

Inference
^^^^^^^^^

* At high batch size (>=32), we have observed performance degradation with ``shard-on-load`` for some models such as Llama3.1-8B. Our current recommendation is to disable this feature by enabling 
  ``save_sharded_checkpoint`` in ``NeuronConfig`` when you trace and compile the model.
* ``spmd_mode = True`` does not work when provided to the ``parallel_model_trace`` API. ``parallel_model_trace`` will be deprecated in the next Neuron SDK release.

Previous release notes
----------------------

* :ref:`neuron-2-25-0-nxd-core`
* :ref:`nxd-core_rn`


================================================
FILE: release-notes/prev/2.26.0/nxd-inference.rst
================================================
.. _neuron-2-26-0-nxd-inference:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Transformers for Inference component, version 2.26.0. Release date: 9/18/2025.

AWS Neuron SDK 2.26.0: NxD Inference release notes
==================================================

**Date of release**:  September 18, 2025

**Version**: 0.6.10598

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.26.0 release notes home <neuron-2-26-0-whatsnew>`

Improvements
------------

Llama 4 model support (beta)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Added beta support for Llama 4, which is a family of multi-modal MoE ope- weight LLMs by Meta that support text
and image inputs. Llama 4 is tested on ``Trn2``. Compatible models include:

- `Llama 4 Scout <https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct>`__
- `Llama 4 Maverick <https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct>`__

In this beta release, Llama 4 model support has the following limitations:

- The model is tested to be accurate up to a sequence length of 8192.
- Model performance on Trn2 isn't fully optimized.
- To use Llama 4 with vLLM, you must compile the model outside of vLLM and specify
  the compiled model path using the ``NEURON_COMPILED_ARTIFACTS`` environment variable.

These limitations will be addressed in a future release.

For more information, see :ref:`/libraries/nxd-inference/tutorials/llama4-tutorial.ipynb`
and :ref:`nxdi-model-reference`.

FLUX.1 model support (beta)
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Added beta support for FLUX.1-dev, which is an open weight image generation model
by Black Forest Labs. Flux.1-dev is tested on Trn2. Compatible models include:

- `Flux.1-dev <https://huggingface.co/black-forest-labs/FLUX.1-dev>`__

In this beta release, the model's performance isn't optimized.

For more information, see :ref:`/libraries/nxd-inference/tutorials/flux-inference-tutorial.ipynb`
and :ref:`nxdi-model-reference`.

Expert parallelism support (beta)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Added support for expert parallelism, which distributes expert processing across multiple
NeuronCores. Expert parallelism improves performance for mixture-of-experts (MoE) models,
particularly for models with a large number of experts, such as Llama 4 Maverick. For more
information, see :ref:`nxd-inference-api-guide-moe-neuron-config`.

Context parallelism improvements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

With this release, context parallelism is out of beta and includes several improvements.

- Added support for sliding window attention (SWA) with context parallelism.
- Added a strided context parallel flash attention kernel which includes compute elimination.
  This kernel is more performant than the existing content parallel flash attention kernel,
  especially at high sequence lengths. To use the kernel,
  enable ``strided_context_parallel_kernel_enabled`` in NeuronConfig.
- Fixed an accuracy issue in hybrid sharding configurations that use context parallelism
  and attention bias. Hybrid sharding refers to models with different sharding strategies
  for context encoding and token generation submodels, such as a configuration that uses
  context parallelism for context encoding and data parallelism for token generation.
  
..
  Sliding window attention (SWA)
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
..
  Added support for sliding window attention, including support for attention sinks. Sliding window
  attention improves attention performance by attending to a subset of recent tokens, rather than the
  full context.
..
  NxD Inference uses the ``sliding_window`` attribute from the model config as the window size. The
  ``sliding_window`` attribute is typically set in the Hugging Face checkpoint config, so NxD Inference
  automatically enables sliding window attention for models trained with it.

On-device forward pipeline execution (Beta)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Added support for a model-forward function that accepts both on-device abnd on-CPU input tensors. This feature improves performance in pipeline models by eliminating data transfer between device and CPU. For example, you can use this feature with Llama 4 (which accepts image and text inputs) to keep the vision encoder outputs on-device for the context encoding model to process.

To use pipeline execution, specify ``pipeline_execution=True`` when you initialize a ModelWrapper. For more information, see :ref:`how-to-use-fpem`.

Other improvements
^^^^^^^^^^^^^^^^^^

* Added support for PyTorch 2.8 and Python 3.11.
* Added support for sequence parallelism in mixture-of-experts (MoE) routers. This change improves
  context encoding latency for MoE models that use sequence parallelism.
* Enabled ``temperature=0`` as a valid option in dynamic on-device sampling. This temperature
  value specifies to use greedy sampling.
* Enabled ``top_k`` values of ``0`` and ``-1`` as valid options in dynamic on-device sampling.
  These ``top_k`` values specify to randomly pick a token from the vocabulary using a uniform
  distribution.

Bug fixes
---------

* Fixed an issue where HuggingFaceGenerationAdapter performs redundant CPU sampling for models that
  use on-device sampling and ``output_logits=True``. This fix improves the performance of models with
  this configuration.
* Other minor fixes and improvements.

Known issues
------------

* ``spmd_mode = True`` does not work when provided to the ``parallel_model_trace`` API. ``parallel_model_trace`` will be deprecated in the next Neuron SDK release.

Previous release notes
----------------------

* :ref:`neuron-2-25-0-nxd-inference`
* :ref:`nxd-inference_rn`


================================================
FILE: release-notes/prev/2.26.0/runtime.rst
================================================
.. _neuron-2-26-0-runtime:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Runtime component, version 2.26.0. Release date: 9/18/2025.

AWS Neuron SDK 2.26.0: Neuron Runtime release notes
===================================================

**Date of release**:  September 18, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.26.0 release notes home <neuron-2-26-0-whatsnew>`

Released versions
-----------------

- Neuron Driver: ``2.24.7.0``
- Neuron Runtime Library: ``2.28.19.0``
- Neuron Collectives: ``2.28.20.0``

Neuron Runtime Library 2.28.19.0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* Added rank ID to all events emitted from the Profiler 2.0 system trace.
* Improved timestamp alignment of Profiler 2.0 NeuronCore and CPU system trace events enhancing the accuracy of the trace timeline.

Neuron Driver 2.24.13.0
^^^^^^^^^^^^^^^^^^^^^^^^
* Compatibility fixes for Linux kernel 6.18.

.. note::

   This is a patch release for ``Inf1`` users. Neuron driver version 2.24 is the last driver version to support ``Inf1`` instances.

Neuron Driver 2.24.7.0
^^^^^^^^^^^^^^^^^^^^^^
* Fixed installation issue causing builds to fail for Linux Kernels 6.13+.

Neuron Runtime Library 2.28.19.0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Fixed bug where `nrt_unload` returned `NRT_SUCCESS` even when model stop fails due to Neuron Core lockups.
* Fixed bug where `model_name` was empty in Profiler 2.0 system trace events.

Neuron Collectives 2.28.20.0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Fixed bug where error messages were incorrectly being displayed on machines with no EFA devices.

Previous release notes
----------------------

* :ref:`neuron-2-25-0-runtime`
* :ref:`runtime_rn`
* :ref:`runtime_rn`
* :ref:`runtime_rn`


================================================
FILE: release-notes/prev/2.26.0/tools.rst
================================================
.. _neuron-2-26-0-tools:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Developer Tools component, version 2.26.0. Release date: 9/18/2025.

AWS Neuron SDK 2.26.0: Developer Tools release notes
====================================================

**Date of release**:  September 18, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.26.0 release notes home <neuron-2-26-0-whatsnew>`

Improvements
------------

View multiple semaphores simultaneously
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The Neuron Profiler UI now allows you to select multiple semaphore values to display simultaneously for a more comprehensive view of activity.

``nccom-test`` new State Buffer support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``nccom-test`` support on Trn2 for State Buffer to State Buffer collectives benchmarking for all-reduce, all-gather, and reduce-scatter operations.

Behavioral changes
------------------

* System profile grouping default in Perfetto now uses global NeuronCore ID instead of process local NeuronCore ID for better display of multi-process workloads.
* Added warning when system profile events are dropped due to limited buffer space, and added suggestion of how configure more buffer space if desired.
* ``nccom-test`` will show helpful error message when invalid sizes are used with all-to-all collectives.

Bug fixes
---------

Here's what we fixed in 2.26.0:

* Fixed device memory usage type table and improvement made to stay in sync between runtime and tools versions.
* Fixed system profile crash when processing long-running workloads.
* Fixed display of system profiles in Perfetto to correctly separate rows within the same Logical NeuronCore when using ``NEURON_LOGICAL_NC_CONFIG=2`` on Trn2.

Known issues
------------

* System profile hardware events may be misaligned due to sync point imprecision. In Perfetto, this may cause events to be interleaved.
* System profile events shown in the Neuron Profiler UI for multiprocess workloads are grouped together. Please try the Perfetto output if you encounter this issue.
* Currently, only a Neuron Runtime trace can be shown when capturing a system profile for a PyTorch workload. (Full framework traces can be shown for JAX workloads, though.) We are working to bring PyTorch traces into parity in a future release.

Previous versions
-----------------

* :ref:`neuron-2-25-0-tools`
* :ref:`dev-tools_rn`


================================================
FILE: release-notes/prev/2.26.1.rst
================================================
.. meta::
    :description: Release notes for AWS Neuron SDK release v2.6.1
    :date-modified: 10/29/2025


AWS Neuron SDK Release Notes - v2.26.1
=======================================

**Release Date**: October 29, 2025

Overview
---------    

Release *2.26.1* of the AWS Neuron SDK includes bug fixes applied to the AWS Neuron SDK v2.26.0. See :ref:`the Neuron SDK v2.26.0 release notes <neuron-2-26-0-whatsnew>` for the full set of changes that shipped with the 2.26.0 release.

Bug fixes in this release
--------------------------

* Fix: To address an issue with out-of-memory errors in **torch-neuronx**, this release enables you to use the Neuron Runtime API to apply direct memory allocation.


Resources
----------

* For the set of SDK package version changes in 2.26.1, see :ref:`Release Content <latest-neuron-release-artifacts>`.


================================================
FILE: release-notes/prev/2.27.0/compiler.rst
================================================
.. _neuron-2-27-0-compiler:

.. meta::
   :description: The official release notes for the AWS Neuron SDK compiler component, version 2.27.0. Release date: 12/19/2025.

AWS Neuron SDK 2.27.0: Neuron Compiler release notes
====================================================

**Date of release**: December 19, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.27.0 release notes home <neuron-2-27-0-whatsnew>`
* Review older release notes in the :ref:`Previous Neuron Releases <previous-neuron-releases>` section.

Changes and improvements
-------------------------

* **Error code docs** New error code documentation has been added to help developers better understand and troubleshoot issues encountered during model compilation. Check them out here: :doc:`Neuron Compiler Error Codes </compiler/error-codes/index>`

* **Compiler accuracy flag defaults updated**: Two Neuron Compiler (neuronxcc) flags now have different default behaviors to improve accuracy. The ``--auto-cast`` flag now defaults to ``none`` (previously ``matmul``), and ``--enable-mixed-precision-accumulation`` is now enabled by default. These changes optimize accuracy but may impact performance for FP32 models and models using smaller bitwidth dtypes. To restore previous behavior, explicitly set ``--auto-cast=matmul`` and use the new ``--disable-mixed-precision-accumulation`` flag.

* **Python 3.9 no longer supported**: The Neuron Compiler requires Python 3.10 or higher. Users currently on Python 3.9 must upgrade to continue using the Neuron Compiler with Python bindings.


================================================
FILE: release-notes/prev/2.27.0/containers.rst
================================================
.. _neuron-2-27-0-dlc:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Deep Learning Containers (DLC) component, version 2.27.0. Release date: 12/19/2025.

AWS Neuron SDK 2.27.0: Neuron Deep Learning Containers release notes
====================================================================

**Date of release**: December 19, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.27.0 release notes home <neuron-2-27-0-whatsnew>`

What's New
----------

**PyTorch Inference vLLM-Neuronx 0.11.0 DLC** — Added new ``pytorch-inference-vllm-neuronx`` 0.11.0 DLC with PyTorch 2.8, vLLM V1 with the `vLLM-Neuron Plugin <https://github.com/vllm-project/vllm-neuron>`_, tools, NxDI and all dependencies to run :ref:`nxdi-vllm-user-guide-v1` out of the box.

**PyTorch 2.9.0 Support** — Upgraded ``pytorch-training-neuronx`` and ``pytorch-inference-neuronx`` DLCs to PyTorch 2.9.0 with related dependencies.

**JAX 0.7.0 Support** — Upgraded ``jax-training-neuronx`` DLC to JAX 0.7.0 with related dependencies.

**Ubuntu 24.04 and Python 3.12** — Upgraded base image to Ubuntu 24.04 and Python 3.12 in all DLCs.

**Neuron SDK Updates** — Upgraded all Neuron packages and dependencies to support AWS Neuron SDK version 2.27.


================================================
FILE: release-notes/prev/2.27.0/dlami.rst
================================================
.. _neuron-2-27-0-dlami:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Deep Learning AWS Machine Images (DLAMIs) component, version 2.27.0. Release date: 12/19/2025.

AWS Neuron SDK 2.27.0: Neuron Deep Learning AWS Machine Images release notes
============================================================================

**Date of release**: December 19, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.27.0 release notes home <neuron-2-27-0-whatsnew>`

What's New
----------

**Ubuntu 24.04 Support** — This release adds support for Ubuntu 24.04 base, single framework, and multi-framework DLAMIs with Python 3.12, providing customers with the latest Ubuntu LTS version for their machine learning workloads.

**vLLM V1 with vLLM-Neuron Plugin** — Published new vLLM V1 with the `vLLM-Neuron Plugin <https://github.com/vllm-project/vllm-neuron>`_ single framework DLAMI and added virtual environment to multi-framework DLAMIs (Amazon Linux 2023, Ubuntu 24.04).

**PyTorch 2.9 Support** — Added PyTorch 2.9 support for single framework DLAMIs and virtual environment to multi-framework DLAMIs (Amazon Linux 2023, Ubuntu 24.04).

**JAX 0.7 Support** — Published JAX 0.7 single framework DLAMI and updated multi-framework DLAMI virtual environments to JAX 0.7 (Amazon Linux 2023, Ubuntu 24.04).

**Neuron SDK Updates** — Upgraded all Neuron packages and dependencies to support AWS Neuron SDK version 2.27.

End of Support
--------------

**TensorFlow 2.10 End of Support** — The ``tensorflow_2_10`` single framework DLAMI and virtual environment in multi-framework DLAMIs will reach end of support in a future release. Customers are advised to use previously released DLAMIs for TensorFlow support.

**Ubuntu 22.04 Single Framework End of Support** — Ubuntu 22.04 single framework DLAMIs for PyTorch and JAX will reach end of support in a future release. Customers are advised to use multi-framework or previously released DLAMIs for Ubuntu 22.04.

**Inf1 virtual environments End of Support** — Inf1 virtual environments and AMIs have reached end of support. Use Neuron DLAMIs released up to SDK version 2.26 for Inf1 support.


================================================
FILE: release-notes/prev/2.27.0/index.rst
================================================
.. _neuron-2-27-0-whatsnew:

.. meta::
   :description: The official release notes for the AWS Neuron SDK, version 2.27.0. Release date: 12/19/2025
   :date-modified: 01/14/2026

Neuron 2.27.0 Component Release Notes
=====================================

.. toctree::
   :hidden:
   :maxdepth: 1

   PyTorch support <nx-pytorch>
   NxD Inference <nxd-inference>
   NKI <nki>
   NKI Library <nki-lib>
   Neuron Compiler <compiler>
   Neuron Runtime <runtime>
   Developer tools <tools>
   Deep Learning AMIs <dlami>
   Deep Learning Containers <containers>

.. important:: Neuron 2.27.1 patch release
        **January 14, 2026**
        
        A patch release, Neuron version 2.27.1, is available that includes a fix for an issue with Llama 4 models found in Neuron SDK version 2.27.0. For details, see :doc:`the Neuron SDK v2.27.1 release note </release-notes/prev/2.27.1>`.

----

**On December 19, 2025, AWS Neuron released the 2.27.0 version of the Neuron SDK**. 

This page provides detailed component release notes for the Neuron SDK 2.27.0. For a an overview of the release content, see :ref:`What's New in AWS Neuron <whats-new-2025-12-19-v2_27>`.

**Update for Neuron 2.27.1**: A patch release, Neuron version 2.27.1, is available that includes a fix for an issue with Llama 4 models found in Neuron SDK version 2.27.0. For details, see :doc:`the Neuron SDK v2.27.1 release note </release-notes/prev/2.27.1>`.

Select a card below to review detailed release notes for each component of the Neuron SDK version 2.27.0. These component release notes contain details on specific new and improved features, as well as breaking changes, bug fixes, and known issues for that component area of the Neuron SDK.

.. grid:: 1 
        :gutter: 2

        .. grid-item-card::
                :link: latest-neuron-release-artifacts
                :link-type: ref
                :class-card: sd-border-1
        
                **Neuron 2.27.0 release artifacts**
                ^^^
                The libraries and packages updated in this Neuron release.

.. grid:: 1 1 2 2
        :gutter: 2

        .. grid-item-card:: 
                :link: neuron-2-27-0-pytorch
                :link-type: ref

                **PyTorch support** 2.27.0 release notes
                ^^^
                Neuron features and solutions that support the PyTorch ML framework.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card:: 
                :link: neuron-2-27-0-nxd-inference
                :link-type: ref

                **NxD Inference** 2.27.0 release notes
                ^^^
                Neuron features and tools for LLM and agent ML model inference.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``
         
        .. grid-item-card:: 
                :link: neuron-2-27-0-compiler
                :link-type: ref

                **Neuron Compiler** 2.27.0 release notes
                ^^^
                The Neuron compiler for AWS Trainium and Inferentia, and its libraries and tools.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card:: 
                :link: neuron-2-27-0-nki
                :link-type: ref

                **Neuron Kernel Interface (NKI)** 2.27.0 release notes
                ^^^
                Neuron's Python-based programming interface for developing and optimizing Neuron kernels.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card::
                :link: neuron-2-27-0-nkilib
                :link-type: ref

                **NKI Library (NKI-Lib)** 2.27.0 release notes
                ^^^
                A collection of pre-optimized Neuron kernels for common model operations.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card:: 
                :link: neuron-2-27-0-runtime
                :link-type: ref

                **Neuron Runtime** 2.27.0 release notes
                ^^^
                The Neuron kernel driver and C++ libraries for AWS Inferentia and Trainium instances.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card:: 
                :link: neuron-2-27-0-tools
                :link-type: ref

                **Neuron Developer Tools** 2.27.0 release notes
                ^^^
                Tools that support end-to-end development for AWS Neuron.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card:: 
                :link: neuron-2-27-0-dlami
                :link-type: ref

                **Neuron Deep Learning AWS Machine Images (DLAMIs)** 2.27.0 release notes
                ^^^
                AWS-specific machine images for building and deploying Neuron-based ML solutions.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``
 
        .. grid-item-card:: 
                :link: neuron-2-27-0-dlc
                :link-type: ref

                **Neuron Deep Learning Containers (DLCs)** 2.27.0 release notes
                ^^^
                AWS-specific container definitions for building and deploying Neuron-based ML solutions.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``


NxD Core and NxD Training Updates for 2.27
------------------------------------------

Neuron support for PyTorch 2.9 will be the last to include NeuronX Distributed Training (NxDT), NxD Core training APIs, and PyTorch/XLA for training. Starting with Neuron support for PyTorch 2.10, these components will no longer be supported.

Existing NxDT/NxD Core users should stay on PyTorch 2.9 until ready to migrate to native PyTorch on Neuron (starting PyTorch 2.10). Customers are recommended to use native PyTorch with standard distributed primitives (DTensor, FSDP, DDP) and TorchTitan starting with Neuron 2.28 and PyTorch 2.10. A migration guide will be published in Neuron 2.28.

Software maintenance announcements
----------------------------------

This section signals the official end-of-support or end of support for specific features, tools, and APIs. For the full set of Neuron release announcements, see :doc:`/about-neuron/announcements/index`.

Known issues: Samples
---------------------

* When running the `UNet training sample <https://github.com/aws-neuron/aws-neuron-samples-staging/blob/master/torch-neuronx/training/unet_image_segmentation/unet.ipynb>`_ with the Neuron compiler, you may encounter this error: `Estimated peak HBM usage exceeds 16GB.`
  
  * To work around this error, include the function ``conv_wrap`` in your model. (You can find a usable example of this function in the `UNet sample model code <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/unet_image_segmentation/model.py>`_.) Then, define a custom backward pass for your model following the instructions and example in `the PyTorch documentation <https://docs.pytorch.org/docs/stable/notes/extending.html>`_. The UNet sample also illustrates how this is done for the convolution layers in UNet.

Previous releases
-----------------

* :doc:`Neuron 2.26.0 </release-notes/prev/2.26.0/index>`
* :doc:`Neuron 2.25.0 </release-notes/prev/2.25.0/index>`
* :doc:`Earlier releases </release-notes/prev/rn>`


================================================
FILE: release-notes/prev/2.27.0/nki-lib.rst
================================================
.. _neuron-2-27-0-nkilib:

.. meta::
   :description: The official release notes for the NKI Library component, version 2.27.0. Release date: 12/19/2025.

AWS Neuron SDK 2.27.0: NKI Library release notes
=================================================

**Date of release**: December 19, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.27.0 release notes home <neuron-2-27-0-whatsnew>`

What's New
----------

This release introduces the NKI Library, which provides pre-built kernels you can use to optimize
the performance of your models. The NKI Library offers ready-to-use, pre-optimized kernels that
leverage the full capabilities of AWS Trainium hardware.

NKI Library kernels are published in the `NKI Library GitHub repository <https://github.com/aws-neuron/nki-library>`_.
In Neuron 2.27, these kernels are also shipped as part of neuronx-cc under the ``nkilib.*`` namespace.

Accessing NKI Library Kernels
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can access NKI Library kernels in two ways:

* **Shipped version**: Import from the ``nkilib.*`` namespace (included with neuronx-cc in Neuron 2.27)
* **Open source repository**: Clone and use kernels from the GitHub repository under the ``nkilib_standalone.nkilib.*`` namespace

New Kernels
^^^^^^^^^^^

This release includes the following pre-optimized kernels:

* **Attention CTE Kernel** — Implements attention with support for multiple variants and optimizations
* **Attention TKG Kernel** — Implements attention specifically optimized for token generation scenarios
* **MLP Kernel** — Implements a Multi-Layer Perceptron with optional normalization fusion and various optimizations
* **Output Projection CTE Kernel** — Computes the output projection operation optimized for Context Encoding use cases
* **Output Projection TKG Kernel** — Computes the output projection operation optimized for Token Generation use cases
* **QKV Kernel** — Performs Query-Key-Value projection with optional normalization fusion
* **RMSNorm-Quant Kernel** — Performs optional RMS normalization followed by quantization to fp8

NKI Library Kernel Migration to New nki.* Namespace in Neuron 2.28
-------------------------------------------------------------------

Some NKI Library kernels currently use the legacy ``neuronxcc.nki.*`` namespace. Starting with
Neuron 2.28, all NKI Library kernels will migrate to the new ``nki.*`` namespace.

The new ``nki.*`` namespace introduces changes to NKI APIs and language constructs. Customers
using NKI Library kernels should review the migration guide for any required changes.

NKI Library Namespace Changes in Neuron 2.28
---------------------------------------------

Starting with Neuron 2.28, the open source repository namespace will change from
``nkilib_standalone.nkilib.*`` to ``nkilib.*``, providing a consistent namespace between
the open source repository and the shipped version.

Customers who want to add or modify NKI Library kernels can build and install them to
replace the default implementation without changing model imports.


================================================
FILE: release-notes/prev/2.27.0/nki.rst
================================================
.. _neuron-2-27-0-nki:

.. meta::
   :description: The official release notes for the AWS Neuron Kernel Interface (NKI) component, version 2.27.0. Release date: 12/19/2025.

AWS Neuron SDK 2.27.0: Neuron Kernel Interface (NKI) release notes
===================================================================

**Date of release**: December 19, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.27.0 release notes home <neuron-2-27-0-whatsnew>`

What's New
----------

This release introduces NKI Beta 2, featuring the new :doc:`NKI Compiler </nki/deep-dives/nki-compiler>`
and significant enhancements to the NKI language constructs and APIs, including changes to existing APIs. 
For information about the different NKI Beta versions, see :doc:`About the NKI Compiler </nki/deep-dives/nki-compiler>`.

To take advantage of Beta 2 with the new compiler, import the ``nki.*`` namespace in your code
and annotate your top-level kernel function with ``@nki.jit``.

**Backward Compatibility and Migration**

Neuron 2.27 supports both the ``neuronxcc.nki.*`` and ``nki.*`` namespaces side by side,
allowing existing Beta 1 kernels to continue working seamlessly. However, Neuron 2.27 will
be the last release to include support for the ``neuronxcc.nki.*`` namespace. Starting with
Neuron 2.28, this namespace will no longer be supported.

The new ``nki.*`` namespace introduces changes to NKI APIs and language constructs. We
encourage customers to migrate existing kernels from ``neuronxcc.nki.*`` to the new ``nki.*``
namespace. A kernel migration guide is available in the Neuron 2.27 documentation to assist
with this transition.

New nki.language APIs
^^^^^^^^^^^^^^^^^^^^^

* :doc:`nki.language.device_print </nki/api/generated/nki.language.device_print>`

New nki.isa APIs
^^^^^^^^^^^^^^^^

* :doc:`nki.isa.dma_compute </nki/api/generated/nki.isa.dma_compute>`
* :doc:`nki.isa.quantize_mx </nki/api/generated/nki.isa.quantize_mx>`
* :doc:`nki.isa.nc_matmul </nki/api/generated/nki.isa.nc_matmul>`
* :doc:`nki.isa.nc_n_gather </nki/api/generated/nki.isa.nc_n_gather>` [used to be ``nl.gather_flattened`` with free partition limited to 512]
* :doc:`nki.isa.rand2 </nki/api/generated/nki.isa.rand2>`
* :doc:`nki.isa.rand_set_state </nki/api/generated/nki.isa.rand_set_state>`
* :doc:`nki.isa.rand_get_state </nki/api/generated/nki.isa.rand_get_state>`
* :doc:`nki.isa.set_rng_seed </nki/api/generated/nki.isa.set_rng_seed>`
* :doc:`nki.isa.rng </nki/api/generated/nki.isa.rng>`

New dtypes
^^^^^^^^^^^^^^

* :doc:`nki.language.float8_e5m2_x4 </nki/api/generated/nki.language.float8_e5m2_x4>`
* :doc:`nki.language.float4_e2m1fn_x4 </nki/api/generated/nki.language.float4_e2m1fn_x4>`
* :doc:`nki.language.float8_e4m3fn_x4 </nki/api/generated/nki.language.float8_e4m3fn_x4>`

Changes to Existing APIs
^^^^^^^^^^^^^^^^^^^^^^^^

* Several nki.language APIs have been removed in NKI Beta 2
* All nki.isa APIs have ``dst`` as an input param
* All nki.isa APIs removed ``dtype`` and ``mask`` support
* :doc:`nki.isa.memset </nki/api/generated/nki.isa.memset>` — removed ``shape`` positional arg , since we have ``dst``
* :doc:`nki.isa.affine_select </nki/api/generated/nki.isa.affine_select>` — instead of ``pred``, we now take ``pattern`` and ``cmp_op`` params
* :doc:`nki.isa.iota </nki/api/generated/nki.isa.iota>` — ``expr`` replaced with ``pattern`` and ``offset``
* :doc:`nki.isa.nc_stream_shuffle </nki/api/generated/nki.isa.nc_stream_shuffle>` - ``src`` and ``dst`` order changed

Documentation Updates
^^^^^^^^^^^^^^^^^^^^^^

* Restructured NKI Documentation to align with workflows
* Added :doc:`Trainium3 Architecture Guide for NKI </nki/guides/architecture/trainium3_arch>`
* Added :doc:`About Neuron Kernel Interface (NKI) </nki/get-started/about/index>`
* Added :doc:`NKI Environment Setup Guide </nki/get-started/setup-env>`
* Added :doc:`Get Started with NKI </nki/get-started/quickstart-implement-run-kernel>`
* Added :doc:`NKI Language Guide </nki/get-started/nki-language-guide>`
* Added :doc:`About the NKI Compiler </nki/deep-dives/nki-compiler>`
* Added :doc:`About the NKI Compiler </nki/deep-dives/nki-compiler>`
* Added :doc:`MXFP Matrix Multiplication with NKI </nki/deep-dives/mxfp-matmul>`
* Updated :doc:`Matrix Multiplication Tutorial </nki/guides/tutorials/matrix_multiplication>`
* Updated :doc:`Profile a NKI Kernel </nki/guides/use-neuron-profile>`
* Updated :doc:`NKI APIs </nki/api/index>`
* Updated :doc:`NKI Library docs </nki/library/index>`
* Removed NKI Error Guide

Known issues
------------

* :doc:`nki.isa.nc_matmul </nki/api/generated/nki.isa.nc_matmul>` - ``is_moving_onezero`` was incorrectly named ``is_moving_zero`` in this release
* NKI ISA semantic checks are not available with Beta 2, workaround is to reference the API docs
* NKI Collectives are not available with Beta 2
* nki.benchmark and nki.profile are not available


================================================
FILE: release-notes/prev/2.27.0/nx-pytorch.rst
================================================
.. _neuron-2-27-0-pytorch:

.. meta::
   :description: The official release notes for the AWS Neuron SDK PyTorch support component, version 2.27.0. Release date: TBD.

AWS Neuron SDK 2.27.0: PyTorch support release notes
====================================================

**Date of release**: December 19, 2025

.. contents:: In this release
   :local:
   :depth: 2


* Go back to the :ref:`AWS Neuron 2.27.0 release notes home <neuron-2-27-0-whatsnew>`

Released versions
-----------------

- ``2.9.0.2.11.*``
- ``2.8.0.2.11.*``
- ``2.7.0.2.11.*``


Improvements
------------

- Added support for PyTorch 2.9 (see :ref:`Introducing PyTorch 2.9 Support<introduce-pytorch-2-9>`)
- Improved model tracing performance for large models by up to 90% through trace API optimizations that avoid copying weights and state buffers to the device and guarantee state restoration after tracing.
- Fixed `GitHub issue #1240 <https://github.com/aws-neuron/aws-neuron-sdk/issues/1240>`_ impacting torch-neuronx 2.7 to 2.9
- Fixed `GitHub issue #834 <https://github.com/aws-neuron/aws-neuron-sdk/issues/834>`_ impacting torch-neuronx 2.7 to 2.9
- Fixed issue in PyTorch 2.8 where PJRT_Client_Destroy was not being called, which prevented NRT:nrt_close from being invoked. This caused resource leaks and "nrtucode: internal error: 832 object(s) leaked, improper teardown" errors. This fix ensures proper cleanup of Neuron Runtime resources on program exit. (Related: `PyTorch/XLA #9675 <https://github.com/pytorch/xla/pull/9675>`_)


Transitioning to PyTorch Native Support for AWS Trainium in the Next Neuron Release Supporting PyTorch 2.10
------------------------------------------------------------------------------------------------------------

In the next Neuron release that will support PyTorch 2.10, AWS Neuron will transition from PyTorch/XLA to native PyTorch support via TorchNeuron. PyTorch 2.9 will be the last version based on PyTorch/XLA.

**What's changing:**

- PyTorch 2.9: Last version using PyTorch/XLA backend
- PyTorch 2.10 and later: Native PyTorch support via TorchNeuron

Customers using PyTorch/XLA-based training should migrate to native PyTorch with TorchNeuron, which provides:

- Native PyTorch eager execution mode
- Standard distributed primitives (DTensor, FSDP, DDP)
- ``torch.compile`` support
- Compatibility with frameworks like TorchTitan (PyTorch Training Library)

For more information about native PyTorch on Neuron and migration guidance, please see :ref:`Native PyTorch for AWS Trainium <native-pytorch-trainium>`.


Known issues
------------

.. note::
   * PyTorch 2.6 has reached end-of-support since release 2.27.
   * See :ref:`Introducing PyTorch 2.9 Support<introduce-pytorch-2-9>` for a full list of known issues with v2.9.
   * See :ref:`Introducing PyTorch 2.8 Support<introduce-pytorch-2-8>` for a full list of known issues with v2.8.
   * See :ref:`Introducing PyTorch 2.7 Support<introduce-pytorch-2-7>` for a full list of known issues with v2.7.

* [PyTorch v2.8] Using the publicly released version of torch-xla 2.8.0 from public PyPI repositories would result in lower performance for models like BERT and LLaMA (https://github.com/pytorch/xla/issues/9605). To fix this, switch to using the updated torch-xla version 2.8.1 from public PyPI repositories.

* [PyTorch v2.7] Using the latest torch-xla v2.7 may result in an increase in host memory usage compared to torch-xla v2.6. In one example, LLama2 pretraining with ZeRO1 and sequence length 16k could see an increase of 1.6% in host memory usage.

* [PyTorch v2.7] PyTorch NeuronX 2.7 supports Python 3.10, and 3.11 only. Python 3.12 is not supported. Note that Ubuntu 24.04 comes with Python 3.12 by default, so users must install Python 3.10 or 3.11, or use PyTorch 2.9 which supports Python 3.12.

* [libtorch on Ubuntu 22.04] Using libtorch with PyTorch NeuronX on Ubuntu 22.04 may encounter build errors when compiling tokenizers or other Rust-based dependencies, as Ubuntu 22.04 provides Rust 1.75 by default which is too old for some dependencies that require Rust 1.80+. Ubuntu 24.04 is recommended for libtorch applications.

* Currently, when switching Ubuntu OS kernel version from 5.15 to 6.8, you may see performance differences due to the new kernel scheduler (CFS vs EEVDF). For example, BERT pretraining performance could be lower by up to 10%. You may try using an older OS kernel (i.e. Amazon Linux 2023) or experiment with the kernel real-time scheduler by running ``sudo chrt --fifo 99`` before your command (i.e. ``sudo chrt --fifo 99 <script>``) to improve the performance. Note that adjusting the real-time scheduler can also result in lower performance. See https://www.kernel.org/doc/html/latest/scheduler/sched-eevdf.html for more information.

* Currently, when using the tensor split operation on a 2D array in the second dimension, the resulting tensors do not contain the expected data (https://github.com/pytorch/xla/issues/8640). The workaround is to set ``XLA_DISABLE_FUNCTIONALIZATION=0``. Another workaround is to use ``torch.tensor_split``.

* [PyTorch v2.6]  BERT pretraining performance is approximately 10% lower with torch-neuronx 2.6 compared to torch-neuronx 2.5. This is due to a known regression in torch-xla https://github.com/pytorch/xla/issues/9037 and may affect other models with high graph tracing overhead. This is fixed in torch-xla 2.7 and 2.8. To work around this issue in torch-xla 2.6, build the ``r2.6_aws_neuron`` branch of torch-xla as follows (see :ref:`pytorch-neuronx-install-cxx11` for C++11 ABI version):

.. code:: bash

      # Setup build env (make sure you are in a python virtual env). Replace "apt" with "yum" on AL2023.
      sudo apt install cmake
      pip install yapf==0.30.0
     wget https://github.com/bazelbuild/bazelisk/releases/download/v1.20.0/bazelisk-linux-amd64
     sudo cp bazelisk-linux-amd64 /usr/local/bin/bazel

     # Clone repos
     git clone --recursive https://github.com/pytorch/pytorch --branch v2.6.0
     cd pytorch/
     git clone --recursive https://github.com/pytorch/xla.git --branch r2.6_aws_neuron
     _GLIBCXX_USE_CXX11_ABI=0 python setup.py bdist_wheel

     # The pip wheel will be present in ./dist
     cd xla/
     CXX_ABI=0 python setup.py bdist_wheel

     # The pip wheel will be present in ./dist and can be installed instead of the torch-xla released in pypi.org

* Currently, BERT pretraining performance is approximately 11% lower when switching to using ``model.to(torch.bfloat16)`` as part of migration away from the deprecated environment variable ``XLA_DOWNCAST_BF16`` due to https://github.com/pytorch/xla/issues/8545. As a workaround to recover the performance, you can set ``XLA_DOWNCAST_BF16=1``, which will still work in torch-neuronx 2.5 through 2.9 although there will be end-of-support warnings (as noted below).

* Environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (see the warning raised below). Switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to cast model to BF16. (see :ref:`migration_from_xla_downcast_bf16`).

.. code:: bash

   Warning: ``XLA_DOWNCAST_BF16`` will be deprecated after the 2.5 release, please downcast your model directly

* [PyTorch v2.8+] ``DeprecationWarning: Use torch_xla.device instead``. This is a warning that ``torch_xla.core.xla_model.xla_device()`` is deprecated. Switch to using ``torch_xla.device()`` instead.

* [PyTorch v2.8+] ``DeprecationWarning: Use torch_xla.sync instead``. This is a warning that ``torch_xla.core.xla_model.mark_step()`` is deprecated. Switch to using ``torch_xla.sync()`` instead.

* [PyTorch v2.7+] ``AttributeError: module 'torch_xla.core.xla_model' ... does not have the attribute 'xrt_world_size'``. This is an error that notes that ``torch_xla.core.xla_model.xrt_world_size()`` is removed in torch-xla version 2.7+. Switch to using ``torch_xla.runtime.world_size()`` instead.

* [PyTorch v2.7+] ``AttributeError: module 'torch_xla.core.xla_model' ... does not have the attribute 'get_ordinal'``. This is an error that notes that ``torch_xla.core.xla_model.get_ordinal()`` is removed in torch-xla version 2.7+. Switch to using ``torch_xla.runtime.global_ordinal()`` instead.

* [PyTorch v2.5+] ``AttributeError: module 'torch_xla.runtime' has no attribute 'using_pjrt'``. In Torch-XLA 2.5+, ``torch_xla.runtime.using_pjrt`` is removed because PJRT is the sole Torch-XLA runtime. See this `PyTorch commit PR on GitHub <https://github.com/pytorch/xla/commit/d6fb5391d09578c8804b1331a5e7a4f72bf981db>`_.

Previous release notes
----------------------

* :ref:`neuron-2-26-0-pytorch`
* :ref:`pytorch-neuron-rn`


================================================
FILE: release-notes/prev/2.27.0/nxd-inference.rst
================================================
.. _neuron-2-27-0-nxd-inference:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Transformers for Inference component, version 2.27.0. Release date: 12/19/2025.

AWS Neuron SDK 2.27.0: NxD Inference release notes
==================================================

**Date of release**: December 19, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.27.0 release notes home <neuron-2-27-0-whatsnew>`

What's New
----------

**Trn3 Platform Support** — Added support for running NxD Inference on Trn3 instances.

**vLLM V1 support** - This release adds support for vLLM V1 through  `vllm-neuron <https://github.com/vllm-project/vllm-neuron>`_ plugin. You can use the vLLM V1 by using the new vLLM V1 based Neuron DLC or using the vLLM virtual environment in Neuron DLAMIs. See :ref:`vLLM V1 guide <nxdi-vllm-user-guide-v1>` for more information.

**Qwen3 MoE Model Support (Beta)** — NxD Inference supports Qwen3 MoE language model which supports multilingual text inputs. You can use HuggingFace checkpoint. For more information about how to run Qwen3 MoE inference, see :doc:`Tutorial: Qwen3 MoE Inference </libraries/nxd-inference/tutorials/qwen3-moe-tutorial>`.

Compatible models include:

* `Qwen3-235B-A22B <https://huggingface.co/Qwen/Qwen3-235B-A22B>`_

**Pixtral Model Support (Beta)** — NxD Inference supports Pixtral image understanding model which processes text and image inputs. You can use HuggingFace checkpoint. For more information about how to run Pixtral inference, see :doc:`Tutorial: Deploy Pixtral Large on Trn2 instances </libraries/nxd-inference/tutorials/pixtral-tutorial>`.

Compatible models include:

* `Pixtral-Large-Instruct-2411 <https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411>`_

Known Issues
------------

* Pixtral deployment is supported up to batch size 32 and sequence length 10240 with vLLM v0. vLLM v1 deployment supports up to batch size 4 and sequence length 10240.
* The performance of Qwen3 MoE and Pixtral on Trn2 is not fully optimized. We will address the issues in the future release.
* The vllm-neuron plugin source code in github is currently not compatible with 2.27 SDK. Customers are advised to use inference DLAMI and DLC published with 2.27.0 SDK for vLLN V1 support. vllm-neuron github repo source code
  will be updated soon to be compatible with 2.27 release SDK.


================================================
FILE: release-notes/prev/2.27.0/runtime.rst
================================================
.. _neuron-2-27-0-runtime:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Runtime component, version 2.27.0. Release date: 12/19/2025.

AWS Neuron SDK 2.27.0: Neuron Runtime release notes
===================================================

**Date of release**:  December 19, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.27.0 release notes home <neuron-2-27-0-whatsnew>`

Released versions
-----------------

- Neuron Driver: ``2.25.4.0``
- Neuron Runtime Library: ``2.29.40.0``
- Neuron Collectives: ``2.29.41.0``

Compatibility Support Tables
----------------------------
**Runtime Version:** 2.29.40.0

The Neuron runtime was tested for the following EC2 instances and configurations:

=========================== ============= ============== ================= ===============
Instance Family               OS Type       OS Version     Kernel Version    GLIBC Version
=========================== ============= ============== ================= ===============
``Inf2``                    Ubuntu        U24            6.14              2.39
``Inf2``                    Ubuntu        U22            6.8               2.35
``Inf2``                    Rocky Linux   RL9            5.14              2.34
``Inf2``                    Amazon Linux  AL2023         6.12              2.34
``Trn1``                    Ubuntu        U24            6.14              2.39
``Trn1``                    Ubuntu        U22            6.8               2.35
``Trn1``                    Rocky Linux   RL9            5.14              2.34
``Trn1``                    Amazon Linux  AL2023         6.12              2.34
``Trn2``                    Ubuntu        U24            6.14              2.39
``Trn2``                    Ubuntu        U22            6.8               2.35
``Trn2``                    Amazon Linux  AL2023         6.12              2.34
=========================== ============= ============== ================= ===============

**Driver Version:** 2.25.4.0

The Neuron driver was tested for the following EC2 instances and configurations:

=========================== ============= ============== ================= ===============
Instance Family               OS Type       OS Version     Kernel Version    GLIBC Version
=========================== ============= ============== ================= ===============
``Inf2``                    Ubuntu        U24            6.14              2.39
``Inf2``                    Ubuntu        U22            6.8               2.35
``Inf2``                    Rocky Linux   RL9            5.14              2.34
``Inf2``                    Amazon Linux  AL2023         6.12              2.34
``Inf2``                    Amazon Linux  AL2            5.10              2.26
``Trn1``                    Ubuntu        U24            6.14              2.39
``Trn1``                    Ubuntu        U22            6.8               2.35
``Trn1``                    Rocky Linux   RL9            5.14              2.34
``Trn1``                    Amazon Linux  AL2023         6.12              2.34
``Trn1``                    Amazon Linux  AL2            5.10              2.26
``Trn2``                    Ubuntu        U24            6.14              2.39
``Trn2``                    Ubuntu        U22            6.8               2.35
``Trn2``                    Amazon Linux  AL2023         6.12              2.34
``Trn2``                    Amazon Linux  AL2            5.10              2.26
=========================== ============= ============== ================= ===============

What's New
----------

Neuron Runtime Library 2.29.40.0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* Added support for Trainium3 (single node mode)

Neuron Driver 2.25.4.0
^^^^^^^^^^^^^^^^^^^^^^^^

* Added support for Trainium3

Improvements
------------

Neuron Runtime Library 2.29.40.0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* Reduced the overhead of reprogramming the Collectives Engine by up to 100x for NEFFs compiled with the ``-O1`` flag. This improves end-to-end performance of these NEFFs by up to 15%.
* Reduced NeuronCore branch overhead by up to 3x, decreasing the overhead of starting a NEFF program by up to 5%.
* Reduced the overhead of starting a NEFF program by up to 50% with an on-device hardware barrier between ranks.
* Improved all-gather latency by up to 35% for messages greater than 1MB in TP8 (LNC2) and TP16 (LNC1) collectives.
* Added support for :ref:`NRT Debug Stream APIs <nrt-debug-stream-api>`.

Bug fixes
---------

Neuron Runtime Library 2.29.40.0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* Fixed scratchpad page allocation bug that caused excessive page allocations due to page rounding error.
* Fixed segfault that occurred when freeing an empty tensor.

Previous release notes
----------------------

* :ref:`neuron-2-26-0-runtime`

================================================
FILE: release-notes/prev/2.27.0/tools.rst
================================================
.. _neuron-2-27-0-tools:

.. meta::
   :description: The official release notes for the AWS Neuron SDK Developer Tools component, version 2.27.0. Release date: 12/19/2025.

AWS Neuron SDK 2.27.0: Developer Tools Release Notes
====================================================

**Date of release**: December 19, 2025

.. contents:: In this release
   :local:
   :depth: 2

* Go back to the :ref:`AWS Neuron 2.27.0 release notes home <neuron-2-27-0-whatsnew>`

What's New
----------

Introducing Neuron Explorer
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* Introduces :doc:`Neuron Explorer </tools/neuron-explorer/index>`, a suite of tools designed to support ML engineers throughout their development journey on AWS Trainium. This release adds device profiling support, with new tools like :doc:`Hierarchy Viewer </tools/neuron-explorer/overview-hierarchy-view>`, :doc:`AI Recommendation Viewer </tools/neuron-explorer/overview-ai-recommendations>`, :doc:`Source Code Viewer </tools/neuron-explorer/how-to-link-view-source-code>`, and :doc:`Summary Viewer </tools/neuron-explorer/overview-summary-page>`.
* Introduced Neuron Explorer UI, CLI, and IDE integration via VSCode

Neuron Explorer includes device profiling support with the following tools:

* :doc:`Hierarchy Viewer </tools/neuron-explorer/overview-hierarchy-view>` — Visualizes the hierarchical structure of your model, allowing you to understand how different components interact and contribute to overall performance
* :doc:`AI Recommendation Viewer </tools/neuron-explorer/overview-ai-recommendations>` — Provides AI-driven recommendations for optimizing your model based on profiling data, helping you identify bottlenecks and areas for improvement
* :doc:`Source Code Viewer </tools/neuron-explorer/how-to-link-view-source-code>` — Links profiling data back to your source code, enabling you to quickly identify and address performance issues in your codebase
* :doc:`Summary Viewer </tools/neuron-explorer/overview-summary-page>` — Offers a high-level overview of your model's performance metrics, resource utilization, and optimization opportunities

* Trn3 support for ``neuron-monitor``, ``neuron-top``, ``neuron-ls``, and ``nccom-test``.

New Tutorials
^^^^^^^^^^^^^

Neuron 2.27.0 introduces :doc:`Neuron Explorer </tools/neuron-explorer/index>`, a suite of tools designed to support ML engineers throughout their development journey on AWS Trainium. Neuron Explorer provides insights into model performance, resource utilization, and optimization opportunities, helping developers to fine-tune their models for optimal performance on Trainium instances.

This release introduces enhanced in-UI performance, simplified setup, and key features for device profiling:  

- **Hierarchy Viewer**: Visualizes the hierarchical structure of your model, allowing you to understand how different components interact and contribute to overall performance.
- **AI Recommendation Viewer**: Provides AI-driven recommendations for optimizing your model based on profiling data, helping you identify bottlenecks and areas for improvement.
- **Source Code Viewer**: Links profiling data back to your source code, enabling you to quickly identify and address performance issues in your codebase.
- **Summary Viewer**: Offers a high-level overview of your model's performance metrics, resource utilization, and optimization opportunities.
- Added tutorials: :doc:`How to Profile a NKI Kernel </nki/guides/use-neuron-profile>`, :doc:`Profiling Multi-Node Training Jobs </tools/neuron-explorer/how-to-profile-workload>`, and :doc:`Profiling a vLLM Inference Workload </tools/tutorials/performance-profiling-vllm>`

.. note::

   Neuron Explorer is in active development! At this time, it does not support system level profiling. For a stable user experience and system profiling, see :ref:`Neuron Profiler 2.0 <neuron-profiler-2-0-guide>` and :ref:`Neuron Profiler <neuron-profile-ug>`.


================================================
FILE: release-notes/prev/2.27.1.rst
================================================
.. meta::
    :description: Release notes for AWS Neuron SDK release v2.7.1
    :date-modified: 01/14/2026


AWS Neuron SDK Release Notes - v2.27.1
=======================================

**Release Date**: January 14, 2026   

Release **2.27.1** of the AWS Neuron SDK includes bug fixes applied to the AWS Neuron SDK v2.27.0. See :ref:`the Neuron SDK v2.27.0 release notes <neuron-2-27-0-whatsnew>` for the full set of changes that shipped with the 2.27.0 release.

What's Changed?
----------------

**Neuron DLAMIs**

* Support for NKI has been added to all DLAMI virtual environments.


Bug Fixes
----------

**NxD Inference**

* Fixed stability issue affecting Llama 4 that may occur when changing model configuration.
* Removed a debug print statement from the Qwen3-MoE model implementation.

----

For information about known issues in Neuron DLCs, see the :doc:`Neuron DLC component release notes </release-notes/components/containers>`.

.. grid:: 1 
        :gutter: 2

        .. grid-item-card::
                :link: latest-neuron-release-artifacts
                :link-type: ref
                :class-card: sd-border-1
        
                **Neuron 2.27.1 release artifacts**
                ^^^
                The libraries and packages updated in this Neuron release.


================================================
FILE: release-notes/prev/2.28.0.rst
================================================
.. _neuron-2-28-0-whatsnew:

.. meta::
   :description: The official release notes for the AWS Neuron SDK, version 2.28.0. Release date: 02/26/2026.

AWS Neuron SDK 2.28.0 release notes
===================================

**Date of release**: February 26, 2026

.. toctree::
   :hidden:
   :maxdepth: 1

   PyTorch (torch-neuronx) </release-notes/components/pytorch>
   NxD Inference/vLLM  </release-notes/components/nxd-inference>
   NKI </release-notes/components/nki>
   NKI Library </release-notes/components/nki-lib>
   Neuron Runtime </release-notes/components/runtime>
   Developer tools </release-notes/components/dev-tools>
   Deep Learning AMIs </release-notes/components/dlamis>
   Deep Learning Containers </release-notes/components/containers>

This page provides detailed component release notes for the Neuron SDK 2.28.0. For a an overview of the release content, see :ref:`What's New in AWS Neuron <whats-new-2026-02-26-v2_28>`.

.. note::
        On March 13, 2026, Neuron released a patch version for the Neuron SDK, version 2.28.1. Read about what changed here: [:doc:`/release-notes/prev/2.28.1`]

Package and Library Updates
---------------------------

.. grid:: 1 
        :gutter: 2

        .. grid-item-card::
                :link: latest-neuron-release-artifacts
                :link-type: ref
                :class-card: sd-border-1
        
                **Neuron 2.28.0 release artifacts**
                ^^^
                The libraries and packages updated in this Neuron release.

Component Release Notes
-----------------------

Select a card below to review detailed release notes for each component of the Neuron SDK version 2.28.0. These component release notes contain details on specific new and improved features, as well as breaking changes, bug fixes, and known issues for that component area of the Neuron SDK.

* For the full set of component release notes across Neuron versions, see :doc:`/release-notes/components/index`.

.. grid:: 1 
        :gutter: 2

        .. grid-item-card:: 
                :link: /release-notes/components/pytorch
                :link-type: doc

                **PyTorch Neuron (torch-neuronx)** 2.28.0 release notes
                ^^^
                Intergated, native support for PyTorch on Neuron.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card:: 
                :link: /release-notes/components/nxd-inference
                :link-type: doc

                **NxD Inference** 2.28.0 release notes
                ^^^
                Neuron features and tools for LLM and agent ML model inference, and the vLLM Plugin for Neuron.
                +++
                Supports: ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

        .. grid-item-card:: 
                :link: /release-notes/components/nki
                :link-type: doc

                **Neuron Kernel Interface (NKI)** 2.28.0 release notes
                ^^^
                Neuron's Python-based programming interface for developing and optimizing Neuron kernels.
                +++
                Supports: ``Inf2``, ``Trn1``, ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card:: 
                :link: /release-notes/components/nki-lib
                :link-type: doc

                **NKI Library (NKI-Lib)** 2.28.0 release notes
                ^^^
                Reference kernels and utilities for Neuron kernel development with NKI.
                +++
                Supports: ``Inf2``, ``Trn1``, ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card:: 
                :link: /release-notes/components/runtime
                :link-type: doc

                **Neuron Runtime** 2.28.0 release notes
                ^^^
                The Neuron kernel driver and C++ libraries for AWS Inferentia and Trainium instances.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card:: 
                :link: /release-notes/components/dev-tools
                :link-type: doc

                **Neuron Developer Tools** 2.28.0 release notes
                ^^^
                Tools that support end-to-end development for AWS Neuron, including Neuron Explorer.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``, ``Trn3``

        .. grid-item-card:: 
                :link: /release-notes/components/dlamis
                :link-type: doc

                **Neuron Deep Learning AWS Machine Images (DLAMIs)** 2.28.0 release notes
                ^^^
                AWS-specific machine images for building and deploying Neuron-based ML solutions.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``
 
        .. grid-item-card:: 
                :link: /release-notes/components/containers
                :link-type: doc

                **Neuron Deep Learning Containers (DLCs)** 2.28.0 release notes
                ^^^
                AWS-specific container definitions for building and deploying Neuron-based ML solutions.
                +++
                Supports: ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``

Previous releases
-----------------

* :doc:`Neuron 2.27.0 </release-notes/prev/2.27.0/index>`
* :doc:`Neuron 2.26.0 </release-notes/prev/2.26.0/index>`
* :doc:`Neuron 2.25.0 </release-notes/prev/2.25.0/index>`
* :doc:`Earlier releases </release-notes/prev/rn>`

* :ref:`prev-rn`
* :ref:`pre-release-content`
* :ref:`prev-n1-rn`

================================================
FILE: release-notes/prev/2.28.1.rst
================================================
.. meta::
    :description: Release notes for AWS Neuron SDK release v2.8.1
    :date-modified: 03/13/2026


AWS Neuron SDK Release Notes - v2.28.1
=======================================

**Release Date**: March 13, 2026   

Release **2.28.1** of the AWS Neuron SDK includes bug fixes applied to the AWS Neuron SDK v2.28.0. See :ref:`the Neuron SDK v2.28.0 release notes <neuron-2-28-0-whatsnew>` for the full set of changes that shipped with the 2.28.0 release.


Bug Fixes
----------

Neuron Custom C++ Operators Library
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Fixed a package dependency issue in version ``0.20.4`` of ``aws-neuronx-gpsimd-customop-lib``, which was released as part of Neuron version 2.28.0.

Neuron Driver
~~~~~~~~~~~~~~~~

* Fixed a Neuron Runtime driver compatibility issue with Linux kernel 6.18.

----

.. grid:: 1 
        :gutter: 2

        .. grid-item-card::
                :link: latest-neuron-release-artifacts
                :link-type: ref
                :class-card: sd-border-1
        
                **Neuron 2.28.1 release artifacts**
                ^^^
                The libraries and packages updated in this Neuron release.


================================================
FILE: release-notes/prev/content.rst
================================================
.. _pre-release-content:

Previous release artifacts (Neuron 2.x)
=======================================

.. contents:: Table of contents
   :local:
   :depth: 1

.. _neuron-2.28.1-artifacts:

Neuron 2.28.1 (03/13/2026)
---------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.28.1

Trn2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.28.1


Inf2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.28.1

Inf1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.28.1

Supported Python Versions for Inf1 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.28.1

Supported Python Versions for Inf2/Trn1/Trn2 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.28.1

Supported NumPy Versions
^^^^^^^^^^^^^^^^^^^^^^^^

Neuron currently supports NumPy versions 2.X. Neuron continues to support NumPy versions >= 1.21.6, as well.

Supported vLLM Versions
^^^^^^^^^^^^^^^^^^^^^^^

Neuron currently supports vLLM version 0.13.0.

Supported Hugging Face Transformers Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Hugging Face           |
|                                  | Transformers Versions            |
+==================================+==================================+
| torch-neuronx                    | >= 4.52                          |
+----------------------------------+----------------------------------+
| neuronx-distributed-inference    | >= 4.57                          |
+----------------------------------+----------------------------------+
| vllm                             | >= 4.56.0, < 5                   |
+----------------------------------+----------------------------------+

Supported Protobuf Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Protobuf versions      |
+==================================+==================================+
| neuronx-cc                       | > 3                              |
+----------------------------------+----------------------------------+
| torch-neuronx                    | >= 3.20                          |
+----------------------------------+----------------------------------+
| torch-neuron                     | < 3.20                           |
+----------------------------------+----------------------------------+
| neuronx-distributed              | >= 3.20                          |
+----------------------------------+----------------------------------+
| tensorflow-neuronx               | < 3.20                           |
+----------------------------------+----------------------------------+
| tensorflow-neuron                | < 3.20                           |
+----------------------------------+----------------------------------+

.. _neuron-2.28.0-artifacts:

Neuron 2.28.0 (02/26/2026)
---------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.28.0

Trn2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.28.0


Inf2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.28.0

Inf1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.28.0

Supported Python Versions for Inf1 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.28.0

Supported Python Versions for Inf2/Trn1/Trn2 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.28.0

Supported NumPy Versions
^^^^^^^^^^^^^^^^^^^^^^^^

Neuron currently supports NumPy versions 2.X. Neuron continues to support NumPy versions >= 1.21.6, as well.

Supported vLLM Versions
^^^^^^^^^^^^^^^^^^^^^^^

Neuron currently supports vLLM version 0.13.0.

Supported Hugging Face Transformers Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Hugging Face           |
|                                  | Transformers Versions            |
+==================================+==================================+
| torch-neuronx                    | >= 4.52                          |
+----------------------------------+----------------------------------+
| neuronx-distributed-inference    | >= 4.57                          |
+----------------------------------+----------------------------------+
| vllm                             | >= 4.56.0, < 5                   |
+----------------------------------+----------------------------------+

Supported Protobuf Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Protobuf versions      |
+==================================+==================================+
| neuronx-cc                       | > 3                              |
+----------------------------------+----------------------------------+
| torch-neuronx                    | >= 3.20                          |
+----------------------------------+----------------------------------+
| torch-neuron                     | < 3.20                           |
+----------------------------------+----------------------------------+
| neuronx-distributed              | >= 3.20                          |
+----------------------------------+----------------------------------+
| tensorflow-neuronx               | < 3.20                           |
+----------------------------------+----------------------------------+
| tensorflow-neuron                | < 3.20                           |
+----------------------------------+----------------------------------+

.. _neuron-2.27.1-artifacts:

Neuron 2.27.1 (01/14/2026)
---------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.27.1

Trn2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.27.1


Inf2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.27.1

Inf1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.27.1

Supported Python Versions for Inf1 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.27.1

Supported Python Versions for Inf2/Trn1/Trn2 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.27.1

Supported NumPy Versions
^^^^^^^^^^^^^^^^^^^^^^^^
Neuron currently supports NumPy versions 2.X. Neuron continues to support NumPy versions >= 1.21.6, as well.

Supported Hugging Face Transformers Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Hugging Face           |
|                                  | Transformers Versions            |
+==================================+==================================+
| torch-neuronx                    | < 4.35 and >=4.37.2              |
+----------------------------------+----------------------------------+
| neuronx-distributed - Llama      | 4.31                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - GPT NeoX   | 4.26                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - Bert model | 4.26                             |
| class                            |                                  |
+----------------------------------+----------------------------------+

Supported Protobuf Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Protobuf versions      |
+==================================+==================================+
| neuronx-cc                       | > 3                              |
+----------------------------------+----------------------------------+
| torch-neuronx                    | >= 3.20                          |
+----------------------------------+----------------------------------+
| torch-neuron                     | < 3.20                           |
+----------------------------------+----------------------------------+
| neuronx-distributed              | >= 3.20                          |
+----------------------------------+----------------------------------+
| tensorflow-neuronx               | < 3.20                           |
+----------------------------------+----------------------------------+
| tensorflow-neuron                | < 3.20                           |
+----------------------------------+----------------------------------+

.. _neuron-2.27.0-artifacts:

Neuron 2.27.0 (12/19/2025)
---------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.27.0

Trn2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.27.0


Inf2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.27.0

Inf1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.27.0

Supported Python Versions for Inf1 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.27.0

Supported Python Versions for Inf2/Trn1/Trn2 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.27.0

Supported NumPy Versions
^^^^^^^^^^^^^^^^^^^^^^^^
Neuron currently supports NumPy versions 2.X. Neuron continues to support NumPy versions >= 1.21.6, as well.

Supported Hugging Face Transformers Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Hugging Face           |
|                                  | Transformers Versions            |
+==================================+==================================+
| torch-neuronx                    | < 4.35 and >=4.37.2              |
+----------------------------------+----------------------------------+
| neuronx-distributed - Llama      | 4.31                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - GPT NeoX   | 4.26                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - Bert model | 4.26                             |
| class                            |                                  |
+----------------------------------+----------------------------------+

Supported Protobuf Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Protobuf versions      |
+==================================+==================================+
| neuronx-cc                       | > 3                              |
+----------------------------------+----------------------------------+
| torch-neuronx                    | >= 3.20                          |
+----------------------------------+----------------------------------+
| torch-neuron                     | < 3.20                           |
+----------------------------------+----------------------------------+
| neuronx-distributed              | >= 3.20                          |
+----------------------------------+----------------------------------+
| tensorflow-neuronx               | < 3.20                           |
+----------------------------------+----------------------------------+
| tensorflow-neuron                | < 3.20                           |
+----------------------------------+----------------------------------+

.. _neuron-2.26.1-artifacts:

Neuron 2.26.1 (10/29/2025)
---------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.26.1

Trn2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.26.1


Inf2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.26.1

Inf1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.26.1

Supported Python Versions for Inf1 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.26.1

Supported Python Versions for Inf2/Trn1/Trn2 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.26.1

Supported NumPy Versions
^^^^^^^^^^^^^^^^^^^^^^^^
Neuron supports versions >= 1.21.6 and <= 1.22.2

Supported Hugging Face Transformers Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Hugging Face           |
|                                  | Transformers Versions            |
+==================================+==================================+
| torch-neuronx                    | < 4.35 and >=4.37.2              |
+----------------------------------+----------------------------------+
| neuronx-distributed - Llama      | 4.31                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - GPT NeoX   | 4.26                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - Bert model | 4.26                             |
| class                            |                                  |
+----------------------------------+----------------------------------+

Supported Protobuf Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Protobuf versions      |
+==================================+==================================+
| neuronx-cc                       | > 3                              |
+----------------------------------+----------------------------------+
| torch-neuronx                    | >= 3.20                          |
+----------------------------------+----------------------------------+
| torch-neuron                     | < 3.20                           |
+----------------------------------+----------------------------------+
| neuronx-distributed              | >= 3.20                          |
+----------------------------------+----------------------------------+
| tensorflow-neuronx               | < 3.20                           |
+----------------------------------+----------------------------------+
| tensorflow-neuron                | < 3.20                           |
+----------------------------------+----------------------------------+

.. _neuron-2.25.0-artifacts:

Neuron 2.25.0 (07/31/2025)
---------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.25.0

Trn2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.25.0


Inf2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.25.0

Inf1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.25.0

Supported Python Versions for Inf1 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.25.0

Supported Python Versions for Inf2/Trn1/Trn2 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.25.0

Supported NumPy Versions
^^^^^^^^^^^^^^^^^^^^^^^^
Neuron supports versions >= 1.21.6 and <= 1.22.2

Supported Hugging Face Transformers Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Hugging Face           |
|                                  | Transformers Versions            |
+==================================+==================================+
| torch-neuronx                    | < 4.35 and >=4.37.2              |
+----------------------------------+----------------------------------+
| transformers-neuronx             | >= 4.36.0                        |
+----------------------------------+----------------------------------+
| neuronx-distributed - Llama      | 4.31                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - GPT NeoX   | 4.26                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - Bert model | 4.26                             |
| class                            |                                  |
+----------------------------------+----------------------------------+

Supported Protobuf Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Protobuf versions      |
+==================================+==================================+
| neuronx-cc                       | > 3                              |
+----------------------------------+----------------------------------+
| torch-neuronx                    | >= 3.20                          |
+----------------------------------+----------------------------------+
| torch-neuron                     | < 3.20                           |
+----------------------------------+----------------------------------+
| transformers-neuronx             | >= 3.20                          |
+----------------------------------+----------------------------------+
| neuronx-distributed              | >= 3.20                          |
+----------------------------------+----------------------------------+
| tensorflow-neuronx               | < 3.20                           |
+----------------------------------+----------------------------------+
| tensorflow-neuron                | < 3.20                           |
+----------------------------------+----------------------------------+

.. _neuron-2.24.0-artifacts:

Neuron 2.24.0 (06/24/2025)
---------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.24.0

Trn2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.24.0


Inf2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.24.0

Inf1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.24.0

Supported Python Versions for Inf1 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.24.0

Supported Python Versions for Inf2/Trn1/Trn2 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.24.0

Supported NumPy Versions
^^^^^^^^^^^^^^^^^^^^^^^^
Neuron supports versions >= 1.21.6 and <= 1.22.2

Supported Hugging Face Transformers Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Hugging Face           |
|                                  | Transformers Versions            |
+==================================+==================================+
| torch-neuronx                    | < 4.35 and >=4.37.2              |
+----------------------------------+----------------------------------+
| transformers-neuronx             | >= 4.36.0                        |
+----------------------------------+----------------------------------+
| neuronx-distributed - Llama      | 4.31                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - GPT NeoX   | 4.26                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - Bert model | 4.26                             |
| class                            |                                  |
+----------------------------------+----------------------------------+

Supported Protobuf Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Protobuf versions      |
+==================================+==================================+
| neuronx-cc                       | > 3                              |
+----------------------------------+----------------------------------+
| torch-neuronx                    | >= 3.20                          |
+----------------------------------+----------------------------------+
| torch-neuron                     | < 3.20                           |
+----------------------------------+----------------------------------+
| transformers-neuronx             | >= 3.20                          |
+----------------------------------+----------------------------------+
| neuronx-distributed              | >= 3.20                          |
+----------------------------------+----------------------------------+
| tensorflow-neuronx               | < 3.20                           |
+----------------------------------+----------------------------------+
| tensorflow-neuron                | < 3.20                           |
+----------------------------------+----------------------------------+

.. _neuron-2.23.0-artifacts:

Neuron 2.23.0 (05/20/2025)
---------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.23.0

Trn2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.23.0


Inf2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.23.0

Inf1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.23.0

Supported Python Versions for Inf1 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.23.0

Supported Python Versions for Inf2/Trn1/Trn2 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.23.0

Supported NumPy Versions
^^^^^^^^^^^^^^^^^^^^^^^^
Neuron supports versions >= 1.21.6 and <= 1.22.2

Supported Hugging Face Transformers Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Hugging Face           |
|                                  | Transformers Versions            |
+==================================+==================================+
| torch-neuronx                    | < 4.35 and >=4.37.2              |
+----------------------------------+----------------------------------+
| transformers-neuronx             | >= 4.36.0                        |
+----------------------------------+----------------------------------+
| neuronx-distributed - Llama      | 4.31                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - GPT NeoX   | 4.26                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - Bert model | 4.26                             |
| class                            |                                  |
+----------------------------------+----------------------------------+

Supported Protobuf Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Protobuf versions      |
+==================================+==================================+
| neuronx-cc                       | > 3                              |
+----------------------------------+----------------------------------+
| torch-neuronx                    | >= 3.20                          |
+----------------------------------+----------------------------------+
| torch-neuron                     | < 3.20                           |
+----------------------------------+----------------------------------+
| transformers-neuronx             | >= 3.20                          |
+----------------------------------+----------------------------------+
| neuronx-distributed              | >= 3.20                          |
+----------------------------------+----------------------------------+
| tensorflow-neuronx               | < 3.20                           |
+----------------------------------+----------------------------------+
| tensorflow-neuron                | < 3.20                           |
+----------------------------------+----------------------------------+

.. _neuron-2.22.0-artifacts:

Neuron 2.22.0 (04/03/2025)
---------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.22.0

Trn2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.22.0


Inf2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.22.0

Inf1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.22.0

Supported Python Versions for Inf1 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.22.0

Supported Python Versions for Inf2/Trn1/Trn2 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.22.0

Supported NumPy Versions
^^^^^^^^^^^^^^^^^^^^^^^^
Neuron supports versions >= 1.21.6 and <= 1.22.2

Supported Hugging Face Transformers Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Hugging Face           |
|                                  | Transformers Versions            |
+==================================+==================================+
| torch-neuronx                    | < 4.35 and >=4.37.2              |
+----------------------------------+----------------------------------+
| transformers-neuronx             | >= 4.36.0                        |
+----------------------------------+----------------------------------+
| neuronx-distributed - Llama      | 4.31                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - GPT NeoX   | 4.26                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - Bert model | 4.26                             |
| class                            |                                  |
+----------------------------------+----------------------------------+
| nemo-megatron                    | 4.31.0                           |
+----------------------------------+----------------------------------+

Supported Protobuf Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Protobuf versions      |
+==================================+==================================+
| neuronx-cc                       | > 3                              |
+----------------------------------+----------------------------------+
| torch-neuronx                    | >= 3.20                          |
+----------------------------------+----------------------------------+
| torch-neuron                     | < 3.20                           |
+----------------------------------+----------------------------------+
| transformers-neuronx             | >= 3.20                          |
+----------------------------------+----------------------------------+
| neuronx-distributed              | >= 3.20                          |
+----------------------------------+----------------------------------+
| tensorflow-neuronx               | < 3.20                           |
+----------------------------------+----------------------------------+
| tensorflow-neuron                | < 3.20                           |
+----------------------------------+----------------------------------+

.. _neuron-2.21.0-artifacts:

Neuron 2.21.0 (10/25/2024)
---------------------------

Trn1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.21.0

Trn2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.21.0


Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.21.0

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.21.0

Supported Python Versions for Inf1 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.21.0

Supported Python Versions for Inf2/Trn1/Trn2 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.21.0

Supported NumPy Versions
^^^^^^^^^^^^^^^^^^^^^^^^
Neuron supports versions >= 1.21.6 and <= 1.22.2

Supported Hugging Face Transformers Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Hugging Face           |
|                                  | Transformers Versions            |
+==================================+==================================+
| torch-neuronx                    | < 4.35 and >=4.37.2              |
+----------------------------------+----------------------------------+
| transformers-neuronx             | >= 4.36.0                        |
+----------------------------------+----------------------------------+
| neuronx-distributed - Llama      | 4.31                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - GPT NeoX   | 4.26                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - Bert model | 4.26                             |
| class                            |                                  |
+----------------------------------+----------------------------------+
| nemo-megatron                    | 4.31.0                           |
+----------------------------------+----------------------------------+

Supported Protobuf Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Protobuf versions      |
+==================================+==================================+
| neuronx-cc                       | > 3                              |
+----------------------------------+----------------------------------+
| torch-neuronx                    | >= 3.20                          |
+----------------------------------+----------------------------------+
| torch-neuron                     | < 3.20                           |
+----------------------------------+----------------------------------+
| transformers-neuronx             | >= 3.20                          |
+----------------------------------+----------------------------------+
| neuronx-distributed              | >= 3.20                          |
+----------------------------------+----------------------------------+
| tensorflow-neuronx               | < 3.20                           |
+----------------------------------+----------------------------------+
| tensorflow-neuron                | < 3.20                           |
+----------------------------------+----------------------------------+


.. _neuron-2.20.2.beta-artifacts:

Neuron 2.20.2 (11/20/2024)
---------------------------

Trn1 packages
^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.2

Inf2 packages
^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.2

Inf1 packages
^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.2

Supported Python Versions for Inf1 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.2

Supported Python Versions for Inf2/Trn1 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.2

Supported NumPy Versions
^^^^^^^^^^^^^^^^^^^^^^^^
Neuron supports versions >= 1.21.6 and <= 1.22.2

Supported Hugging Face Transformers Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Hugging Face           |
|                                  | Transformers Versions            |
+==================================+==================================+
| torch-neuronx                    | < 4.35 and >=4.37.2              |
+----------------------------------+----------------------------------+
| transformers-neuronx             | >= 4.36.0                        |
+----------------------------------+----------------------------------+
| neuronx-distributed - Llama      | 4.31                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - GPT NeoX   | 4.26                             |
| model class                      |                                  |
+----------------------------------+----------------------------------+
| neuronx-distributed - Bert model | 4.26                             |
| class                            |                                  |
+----------------------------------+----------------------------------+
| nemo-megatron                    | 4.31.0                           |
+----------------------------------+----------------------------------+

Supported Protobuf Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Protobuf versions      |
+==================================+==================================+
| neuronx-cc                       | > 3                              |
+----------------------------------+----------------------------------+
| torch-neuronx                    | >= 3.20                          |
+----------------------------------+----------------------------------+
| torch-neuron                     | < 3.20                           |
+----------------------------------+----------------------------------+
| transformers-neuronx             | >= 3.20                          |
+----------------------------------+----------------------------------+
| neuronx-distributed              | >= 3.20                          |
+----------------------------------+----------------------------------+
| tensorflow-neuronx               | < 3.20                           |
+----------------------------------+----------------------------------+
| tensorflow-neuron                | < 3.20                           |
+----------------------------------+----------------------------------+

Supported Linux Kernel Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Neuron Driver (``aws-neuronx-dkms``) supports Linux kernel versions >= 5.10


Neuron 2.20.1 (10/25/2024)
---------------------------

Trn1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.1

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.1

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.1

Neuron 2.20.0 (09/16/2024)
---------------------------

Trn1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.0

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.0

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.0

Neuron 2.19.1 (07/19/2024)
---------------------------

Trn1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.19.1

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.19.1

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.19.1

Neuron 2.19.0 (07/03/2024)
---------------------------

Trn1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.19.0

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.19.0

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.19.0

Neuron 2.18.2 (04/25/2024)
---------------------------

Trn1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.18.2

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.18.2

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.18.2


Neuron 2.18.1 (04/10/2024)
---------------------------

Trn1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.18.1

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.18.1

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.18.1

Neuron 2.18.0 (04/01/2024)
---------------------------

Trn1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.18.0

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.18.0

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.18.0


Neuron 2.17.0 (02/13/2024)
---------------------------

Trn1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.17.0

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.17.0

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.17.0


Neuron 2.16.1 (01/18/2024)
---------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.16.1

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.16.1

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.16.1


Neuron 2.16.0 (12/21/2023)
---------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.16.0

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.16.0

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.16.0


Neuron 2.15.2 (11/17/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.15.2

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.15.2

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.15.2


Neuron 2.15.1 (11/09/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.15.1

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.15.1

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.15.1


Neuron 2.15.0 (10/26/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.15.0

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.15.0

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.15.0


Neuron 2.14.1 (09/26/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.14.1

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.14.1

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.14.1


Neuron 2.14.0 (09/15/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.14.0

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.14.0

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.14.0


Neuron 2.13.2 (09/01/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.13.2

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.13.2

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.13.2


Neuron 2.13.1 (08/29/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.13.1

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.13.1

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.13.1


Neuron 2.13.0 (08/28/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.13.0

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.13.0

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.13.0


Neuron 2.12.2 (08/20/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.12.2

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.12.2

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.12.2


Neuron 2.12.1 (08/09/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.12.1

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.12.1

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.12.1


Neuron 2.12.0 (07/19/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.12.0

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.12.0

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.12.0


Neuron 2.11.0 (06/14/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.11.0

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.11.0

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.11.0


Neuron 2.10.0 (05/01/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.10.0

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.10.0

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.10.0


Neuron 2.9.1 (04/19/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.9.1

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.9.1

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.9.1


Neuron 2.9.0 (03/28/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.9.0

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.9.0

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.9.0


Neuron 2.8.0 (02/24/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.8.0

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.8.0

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.8.0


Neuron 2.7.0 (02/08/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.7.0

Inf1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=2.7.0

Neuron 2.6.0 (12/12/2022)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

* ``aws-neuronx-dkms-2.6.33.0``
* ``aws-neuronx-oci-hook-2.1.14.0``
* ``aws-neuronx-runtime-lib-2.10.30.0``
* ``aws-neuronx-collectives-2.10.37.0``
* ``aws-neuronx-tools-2.6.1.0``
* ``aws-neuronx-k8-plugin-2.1.12.0``
* ``aws-neuronx-k8-scheduler-2.1.12.0``
* ``tensorboard_plugin_neuronx-2.5.3.0``
* ``neuronx-cc-2.3.0.4``
* ``torch-neuronx-1.12.0.1.4.0``
* ``tensorflow-model-server-neuronx_1.15.0.2.5.6.0``
* ``tensorflow-model-server-neuronx_2.5.4.2.5.6.0``
* ``tensorflow-model-server-neuronx_2.6.3.2.5.6.0``
* ``tensorflow-model-server-neuronx_2.7.0.2.5.6.0``
* ``tensorflow-model-server-neuronx_2.8.0.2.5.6.0``

Inf1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=2.6.0

Neuron 2.5.0 (11/23/2022)
-------------------------

Trn1 packages
^^^^^^^^^^^^^

* ``aws-neuronx-dkms-2.6.33.0``
* ``aws-neuronx-oci-hook-2.1.14.0``
* ``aws-neuronx-runtime-lib-2.10.27.0``
* ``aws-neuronx-collectives-2.10.34.0``
* ``aws-neuronx-tools-2.5.19.0``
* ``aws-neuronx-k8-plugin-2.1.12.0``
* ``aws-neuronx-k8-scheduler-2.1.12.0``
* ``neuronx-cc-2.2.0.73``
* ``torch-neuronx-1.11.0.1.2.0``
* ``tensorflow-model-server-neuronx_1.15.0.2.5.6.0``
* ``tensorflow-model-server-neuronx_2.5.4.2.5.6.0``
* ``tensorflow-model-server-neuronx_2.6.3.2.5.6.0``
* ``tensorflow-model-server-neuronx_2.7.0.2.5.6.0``
* ``tensorflow-model-server-neuronx_2.8.0.2.5.6.0``

Inf1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=2.5.0
   

Neuron 2.4.0 (10/27/2022)
--------------------------

Trn1 packages
^^^^^^^^^^^^^

* ``aws-neuronx-dkms-2.6.5.0``
* ``aws-neuronx-oci-hook-2.1.1.0``
* ``aws-neuronx-runtime-lib-2.10.15.0``
* ``aws-neuronx-collectives-2.10.17.0``
* ``aws-neuronx-tools-2.5.16.0``
* ``aws-neuronx-k8-plugin-2.1.2.0``
* ``aws-neuronx-k8-scheduler-2.1.2.0``
* ``neuronx-cc-2.2.0.73``
* ``torch-neuronx-1.11.0.1.2.0``

Inf1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=2.4.0


Neuron 2.3.0 (10/10/2022)
-------------------------

Trn1 packages
^^^^^^^^^^^^^

* ``aws-neuronx-dkms-2.5.41.0``
* ``aws-neuronx-oci-hook-2.0.16.0``
* ``aws-neuronx-runtime-lib-2.9.64.0``
* ``aws-neuronx-collectives-2.9.86.0``
* ``aws-neuronx-tools-2.4.14.0``
* ``aws-neuronx-k8-plugin-2.0.1.0``
* ``aws-neuronx-k8-scheduler-2.0.1.0``
* ``neuronx-cc-2.1.0.76``
* ``torch-neuronx-1.11.0.1.1.1``

Inf1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --list packages --neuron-version=2.3.0


================================================
FILE: release-notes/prev/rn.rst
================================================
.. _prev-rn:

Previous release notes (Neuron 2.x)
====================================

.. toctree::
   :maxdepth: 1
   :hidden:

   Neuron 2.28.1 </release-notes/prev/2.28.1>
   Neuron 2.28.0 </release-notes/prev/2.28.0> 
   Neuron 2.27.1 </release-notes/prev/2.27.1> 
   Neuron 2.27.0 </release-notes/prev/2.27.0/index>
   Neuron 2.26.1 </release-notes/prev/2.26.1>
   Neuron 2.26.0 </release-notes/prev/2.26.0/index>
   Neuron 2.25.0 </release-notes/prev/2.25.0/index>

* **The latest Neuron release is 2.29.0, released on 04/09/2026.** Read the :doc:`2.29.0 release notes </release-notes/2.29.0>` or :doc:`the individual Neuron component release notes </release-notes/components/index>` for more details.
  
.. contents:: Table of contents
   :local:
   :depth: 1

.. grid:: 1 
        :gutter: 2

        .. grid-item-card::
                :link: /release-notes/components/index
                :link-type: doc
                :class-card: sd-border-1
        
                **Neuron component release notes**
                ^^^
                Release notes by component for prior Neuron SDK versions.


Neuron 2.28.0 (02/26/2026)
--------------------------

See :ref:`neuron-2-28-0-whatsnew` for the full Neuron 2.28.0 release notes or :doc:`the individual Neuron component release notes </release-notes/components/index>`.

* Neuron 2.28.0 was released as a patch for 2.28.0 on 3/13/2026. See the :doc:`2.28.1 (patch) release notes <2.28.1>` for details.

Neuron 2.27.0 (12/19/2025)
--------------------------

See :ref:`neuron-2-27-0-whatsnew` for the full Neuron 2.27.0 release notes or :doc:`the individual Neuron component release notes </release-notes/components/index>`.

* Neuron 2.27.1 was released as a patch for 2.27.0 on 1/26/2026. See the :doc:`2.27.1 (patch) release notes <2.27.1>` for details.

Neuron 2.26.1 (10/29/2025)
--------------------------

See :doc:`2.26.1` for the updated Neuron 2.26.1 release notes or :doc:`the individual Neuron component release notes </release-notes/components/index>`.

Neuron 2.26.0 (09/18/2025)
--------------------------

See :ref:`neuron-2-26-0-whatsnew` for the full Neuron 2.26.0 release notes or :doc:`the individual Neuron component release notes </release-notes/components/index>`.

Neuron 2.25.0 (07/31/2025)
--------------------------

See :ref:`neuron-2-25-0-whatsnew` for the full Neuron 2.25.0 release notes or :doc:`the individual Neuron component release notes </release-notes/components/index>`.

.. _neuron-2-24-1-whatsnew:

Neuron 2.24.1 (06/30/2025)
--------------------------

Neuron version 2.24.1 resolves an installation issue that could prevent NeuronX Distributed Training from being installed successfully.

.. _neuron-2-24-0-whatsnew:

Neuron 2.24.0 (06/24/2025)
--------------------------

Neuron version 2.24 introduces new inference capabilities including prefix caching, disaggregated inference (Beta), and context parallelization support (Beta). This release also includes NKI language enhancements and enhanced profiling visualizations for improved debugging and performance analysis. Neuron 2.24 adds support for PyTorch 2.7 and JAX 0.6, updates existing DLAMIs and DLCs, and introduces a new vLLM inference container.

.. contents:: Table of contents
   :local:
   :depth: 1

What's New
^^^^^^^^^^

NxD Inference (NxDI) includes the following enhancements:

- **Prefix caching**: Improves Time To First Token (TTFT) by up to 3x when processing common shared prompts across requests.
- **Disaggregated inference (Beta)**: Uses 1P1D (1 Prefill, 1 Decode) architecture to reduce prefill-decode interference and improve goodput.
- **Context parallelism (Beta)**: Improves TTFT for longer sequence lengths by processing context encoding in parallel across multiple NeuronCores.
- **Model support**: Added beta support for Qwen 2.5 text models.
- **NxD Inference Library**: Upgraded to support PyTorch 2.7 and Transformers 4.48.

Hugging Face Optimum Neuron 0.2.0 now supports PyTorch-based NxD Core backend for LLM inference, simplifying the implementation of new PyTorch model architectures. Models including Llama 3.1-8B and Llama-3.3-70B have migrated from Transformers NeuronX to the NxD backend.

Training
^^^^^^^^

**Library Upgrades**

- **NxD Training  (NxDT) Library**: Upgraded to support PyTorch 2.7 and Transformers 4.48.
- **JAX Training Support**: Upgraded to JAX 0.6.0.

Neuron Kernel Interface (NKI)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- **New nki.language.gather_flattened**: Provides efficient parallel tensor element gathering.
- **Enhanced accuracy**: Improved valid range of ``nki.language.sqrt`` and ``nki.isa.activation(nl.sqrt)`` 
- **Advanced indexing**: Improved performance for ``nki.isa.nc_match_replace8``.

Neuron Tools
^^^^^^^^^^^^

**Neuron Profiler Enhancements**

- **Framework stack traces**: Maps device instructions to model source code.
- **Scratchpad memory usage visualization**: Shows tensor-level memory usage over time with HLO name association.
- **On-device collectives barriers**: Identifies synchronization overhead.
- **HBM throughput visualization**: Tracks data movement involving High Bandwidth Memory (HBM) over time.

**NCCOM-TEST Improvements**

- Added ``--report-to-json-file`` flag: Outputs results in JSON format.
- Added ``--show-input-output-size`` flag: Explicitly displays input and output sizes based on operations.

Neuron Deep Learning Containers (DLCs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Updated containers with PyTorch 2.7 support for inference and training.
- Added new inference container with NxD Inference and vLLM with FastAPI.
- JAX DLCs now support JAX 0.6.0 training.

Neuron Deep Learning AMIs (DLAMIs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Updated MultiFramework DLAMIs to include PyTorch 2.7 and JAX 0.6.0.
- Added new Single Framework DLAMIs for PyTorch 2.7 and JAX 0.6.0.

Neuron 2.24 Feature Release Notes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details
     - Instances

   * - NxD Core (neuronx-distributed) 
     - * :ref:`nxd-core_rn`   
     - ``Trn1`` / ``Trn1n``, ``Trn2``

   * - NxD Inference (neuronx-distributed-inference)
     - * :ref:`nxd-inference_rn` 
     - ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

   * - NxD Training (neuronx-distributed-training)
     - * :ref:`nxd-training_rn` 
     - ``Trn1`` / ``Trn1n``, ``Trn2``

   * - PyTorch NeuronX (torch-neuronx)
     - * :ref:`pytorch_rn`
     - ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

   * - Neuron Compiler (neuronx-cc)
     - * :ref:`compiler_rn`
     - ``Inf2``, ``Trn1`` / ``Trn1n``, ``Trn2``

   * - Neuron Kernel Interface (NKI)
     - * :ref:`nki_rn`
     - ``Inf2``, ``Trn1``/ ``Trn1n``

   * - Neuron Tools
     - * :ref:`dev-tools_rn`
     - ``Inf1``, ``Inf2``, ``Trn1``/ ``Trn1n``

   * - Neuron Runtime
     - * :ref:`runtime_rn`
     - ``Inf1``, ``Inf2``, ``Trn1``/ ``Trn1n``

   * - Transformers NeuronX (transformers-neuronx) for Inference
     - * :ref:`nxd-inference_rn` 
     - ``Inf2``, ``Trn1`` / ``Trn1n``

   * - Neuron Deep Learning AMIs (DLAMIs)
     - * :ref:`neuron-dlami-overview`
     - ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``

   * - Neuron Deep Learning Containers (DLCs)
     - * :ref:`containers_rn`
     - ``Inf1``, ``Inf2``, ``Trn1`` / ``Trn1n``

   * - Release Announcements
     - * :ref:`announce-no-longer-support-beta-pytorch-neuroncore-placement-apis`
       * :ref:`announce-eos-block-dimension-nki`
       * :ref:`announce-eos-tensorflow-tutorial`
       * :ref:`announce-eos-tnx`
       * :ref:`announce-eos-longer-support-xla-bf16-vars`
       * :ref:`announce-eos-block-dimension-nki`
       * :ref:`announce-no-longer-support-llama-32-meta-checkpoint`
       * :ref:`announce-no-longer-support-nki-jit`
       * See more at :ref:`announcements-main`.
     - ``Inf1``, ``Inf2``, ``Trn1``/ ``Trn1n``

.. _neuron-2.23.0-whatsnew:

Neuron 2.23.0 (05/20/2025)
---------------------------

.. contents:: Table of contents
   :local:
   :depth: 1

What's New
^^^^^^^^^^

With the Neuron 2.23 release, we move NxD Inference (NxDI) library out of beta. It is now recommended for all multi-chip inference use-cases. In addition, Neuron has new training capabilities, including Context Parallelism and ORPO, NKI improvements (new operators and ISA features), and new Neuron Profiler debugging and performance analysis optimizations. Finally, Neuron now supports :ref:`PyTorch 2.6 <introduce-pytorch-2-6>` and JAX 0.5.3.

Inference: NxD Inference (NxDI) moves from beta to GA. NxDI now supports Persistent Cache to reduce compilation times, and optimizes model loading with improved weight sharding performance.

Training: NxD Training (NxDT) added Context Parallelism support (beta) for Llama models, enabling sequence lengths up to 32K. NxDT now supports model alignment, ORPO, using DPO-style datasets. NxDT has upgraded supports for 3rd party libraries, specifically: PyTorch Lightning 2.5, Transformers 4.48, and NeMo 2.1.

Neuron Kernel Interface (NKI): New support for 32-bit integer nki.language.add and nki.language.multiply on GPSIMD Engine. NKI.ISA improvements include range_select for Trainium2, fine-grained engine control, and enhanced tensor operations. New performance tuning API ``no_reorder`` has been added to enable user-scheduling of instructions. When combined with allocation, this enables software pipelining. Language consistency has been improved for arithmetic operators (``+=``, ``-=``, ``/=``, ``*=``) across loop types, PSUM, and SBUF.

Neuron Profiler: Profiling performance has improved, allowing users to view profile results 5x times faster on average. New features include timeline-based error tracking and JSON error event reporting, supporting execution and OOB error detection. Additionally, this release improves multiprocess visualization with Perfetto. 

Neuron Monitoring: Added Kubernetes context information (pod_name, namespace, and container_name) to neuron monitor prometheus output, enabling resource utilization tracking by pod, namespace, and container.

Neuron DLCs: This release updates containers with PyTorch 2.6 support for inference and training. For JAX DLC, this release adds JAX 0.5.0 training support.

Neuron DLAMIs: This release updates MultiFramework AMIs to include PyTorch 2.6, JAX 0.5, and TensorFlow 2.10 and Single Framework AMIs for PyTorch 2.6 and JAX 0.5.

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details
     - Instances

   * - NxD Core (neuronx-distributed) 
     - * :ref:`nxd-core_rn`   
     - Trn1/Trn1n,Trn2

   * - NxD Inference (neuronx-distributed-inference)
     - * :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n,Trn2

   * - NxD Training (neuronx-distributed-training)
     - * :ref:`nxd-training_rn` 
     - Trn1/Trn1n,Trn2

   * - PyTorch NeuronX (torch-neuronx)
     - * :ref:`pytorch_rn`
     - Trn1/Trn1n,Inf2,Trn2

   * - Neuron Compiler (neuronx-cc)
     - * :ref:`compiler_rn`
     - Trn1/Trn1n,Inf2,Trn2

   * - Neuron Kernel Interface (NKI)
     - * :ref:`nki_rn`
     - Trn1/Trn1n,Inf2

   * - Neuron Tools
     - * :ref:`dev-tools_rn`
     - Inf1,Inf2,Trn1/Trn1n,Trn2

   * - Neuron Runtime
     - * :ref:`runtime_rn`
     - Inf1,Inf2,Trn1/Trn1n,Trn2

   * - Transformers NeuronX (transformers-neuronx) for Inference
     - * :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n

   * - Neuron Deep Learning AMIs (DLAMIs)
     - * :ref:`neuron-dlami-overview`
     - Inf1,Inf2,Trn1/Trn1n

   * - Neuron Deep Learning Containers (DLCs)
     - * :ref:`containers_rn`
     - Inf1,Inf2,Trn1/Trn1n

   * - Release Announcements
     - * :ref:`announce-eos-block-dimension-nki`
       * :ref:`announce-eos-mllama-checkpoint`
       * :ref:`announce-eos-torch-neuronx-nki-jit`
       * :ref:`announce-eos-xla-bf`
       * :ref:`announce-eos-jax-neuronx-features`
       * :ref:`announce-no-support-nemo-megatron`
       * :ref:`announce-no-support-tensorflow-eos`
       * :ref:`announce-u20-base-no-support`
       * :ref:`announce-tnx-maintenance`
       * :ref:`announce-eol-nxd-examples`
       * See more at :ref:`announcements-main`
     - Inf1, Inf2, Trn1/Trn1n

For detailed release artifiacts, see :ref:`Release Artifacts <latest-neuron-release-artifacts>`.


.. _neuron-2.22.1-whatsnew:

Neuron 2.22.1 (05/12/2025)
---------------------------

Neuron 2.22.1 release includes a Neuron Driver update that resolves DMA abort errors on Trainium2 devices. These errors were previously occurring in the Neuron Runtime during specific workload executions.


.. _neuron-2.22.0-whatsnew:

Neuron 2.22.0 (04/03/2025)
---------------------------

.. contents:: Table of contents
   :local:
   :depth: 1

What's New
^^^^^^^^^^

The Neuron 2.22 release includes performance optimizations, enhancements and new capabilities across the Neuron software stack. 

For inference workloads, the NxD Inference library now supports Llama-3.2-11B model and supports multi-LoRA serving, allowing customers to load and serve multiple LoRA adapters. Flexible quantization features have been added, enabling users to specify which model layers or NxDI modules to quantize. Asynchronous inference mode has also been introduced, improving performance by overlapping Input preparation with model execution.

For training, we added LoRA supervised fine-tuning to NxD Training to enable additional model customization and adaptation.

Neuron Kernel Interface (NKI): This release adds new APIs in nki.isa, nki.language, and nki.profile. These enhancements provide customers with greater flexibility and control.

The updated Neuron Runtime includes optimizations for reduced latency and improved device memory footprint. On the tooling side, the Neuron Profiler 2.0 (beta) has added UI enhancements and new event type support.

Neuron DLCs: this release reduces DLC image size by up to 50% and enables faster build times with updated Dockerfiles structure. On the Neuron DLAMI side, new PyTorch 2.5 single framework DLAMIs have been added for Ubuntu 22.04 and Amazon Linux 2023, along with several new virtual environments within the Neuron Multi Framework DLAMIs.


More release content can be found in the table below and each component release notes.

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details
     - Instances

   * - NxD Core (neuronx-distributed) 
     - * :ref:`nxd-core_rn`   
     - Trn1/Trn1n,Trn2

   * - NxD Inference (neuronx-distributed-inference)
     - * :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n,Trn2

   * - NxD Training (neuronx-distributed-training)
     - * :ref:`nxd-training_rn` 
     - Trn1/Trn1n,Trn2

   * - PyTorch NeuronX (torch-neuronx)
     - * :ref:`pytorch_rn`
     - Trn1/Trn1n,Inf2,Trn2

   * - NeuronX Nemo Megatron for Training
     - * `neuronx-nemo-megatron github repo <https://github.com/aws-neuron/neuronx-nemo-megatron>`_  and  :ref:`neuronx-nemo-rn`
     - Trn1/Trn1n,Inf2

   * - Neuron Compiler (neuronx-cc)
     - * :ref:`compiler_rn`
     - Trn1/Trn1n,Inf2,Trn2

   * - Neuron Kernel Interface (NKI)
     - * :ref:`nki_rn`
     - Trn1/Trn1n,Inf2

   * - Neuron Tools
     - * :ref:`dev-tools_rn`
     - Inf1,Inf2,Trn1/Trn1n,Trn2

   * - Neuron Runtime
     - * :ref:`runtime_rn`
     - Inf1,Inf2,Trn1/Trn1n,Trn2

   * - Transformers NeuronX (transformers-neuronx) for Inference
     - * :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n

   * - Neuron Deep Learning AMIs (DLAMIs)
     - * :ref:`neuron-dlami-overview`
     - Inf1,Inf2,Trn1/Trn1n

   * - Neuron Deep Learning Containers (DLCs)
     - * :ref:`containers_rn`
     - Inf1,Inf2,Trn1/Trn1n

   * - Release Announcements
     - * :ref:`announce-eos-neuron-det`
       * :ref:`announce-eos-nxd-examples`
       * :ref:`announce-python-eos`
       * :ref:`announce-eos-pytorch-eos-113`
       * :ref:`announce-eos-pytorch-2-1`
       * :ref:`announce-u20-dlami-dlc-eos`
       * :ref:`announce-no-support-torch-neuron`
       * See more at :ref:`announcements-main`
     - Inf1, Inf2, Trn1/Trn1n

For detailed release artifacts, see :ref:`Release Artifacts <latest-neuron-release-artifacts>`.

.. _neuron-2.21.1-whatsnew:

Neuron 2.21.1 (01/14/2025)
---------------------------

Neuron 2.21.1 release pins Transformers NeuronX dependency to transformers<4.48 and fixes DMA abort errors on Trn2.

Additionally, this release addresses NxD Core and Training improvements, including fixes for sequence parallel support in quantized models and a new flag for dtype control in Llama3/3.1 70B configurations. See :ref:`NxD Training Release Notes <nxd-training_rn>` (neuronx-distributed-training) for details.

NxD Inference update includes minor bug fixes for sampling parameters. See :ref:`NxD Inference Release Notes <nxd-inference_rn>`.

Neuron supported DLAMIs and DLCs have been updated to Neuron 2.21.1 SDK. Users should be aware of an incompatibility between Tensorflow-Neuron 2.10 (Inf1) and Neuron Runtime 2.21 in DLAMIs, which will be addressed in the next minor release. See :ref:`Neuron DLAMI Release Notes <dlamis_rn>`.

The Neuron Compiler includes bug fixes and performance enhancements specifically targeting the Trn2 platform.

.. _neuron-2.21.0-whatsnew:

Neuron 2.21.0 (12/20/2024)
---------------------------

.. contents:: Table of contents
   :local:
   :depth: 1

What's New
^^^^^^^^^^

**Overview**: Neuron 2.21.0 introduces support for :ref:`AWS Trainium2 <trainium2-arch>` and
:ref:`Trn2 instances <aws-trn2-arch>`, including the trn2.48xlarge instance type and Trn2
UltraServer (Preview). The release adds new capabilities in both training and
inference of large-scale models. It introduces :ref:`NxD Inference (beta) <introduce-nxd-inference>`, a
PyTorch-based library for deployment, :ref:`Neuron Profiler 2.0 (beta) <neuron-profiler-2-0-guide>`, and
:ref:`PyTorch 2.5 <introduce-pytorch-2-5>` support across the Neuron SDK, and :ref:`Logical NeuronCore
Configuration (LNC) <logical-neuroncore-config>` for optimizing NeuronCore allocation. The release
enables :ref:`Llama 3.1 405B model inference <nxdi-trn2-llama3.1-405b-tutorial>` on a single trn2.48xlarge
instance.

**NxD Inference**: :ref:`NxD Inference (beta) <nxdi-overview>` is a new PyTorch-based inference library for
deploying large-scale models on AWS Inferentia and Trainium instances.
It enables PyTorch model onboarding with minimal code changes and
integrates with :doc:`vLLM </libraries/nxd-inference/developer_guides/vllm-user-guide>`. NxDI supports various model architectures,
including Llama versions for text processing (Llama 2, Llama 3, Llama
3.1, Llama 3.2, and Llama 3.3), and Mixture-of-Experts (MoE) model architectures including
Mixtral and DBRX. The library supports quantization methods, includes
dynamic sampling, and is compatible with HuggingFace checkpoints and
generate() API. NxDI also supports distributed strategies including tensor parallelism and incorporates speculative decoding techniques (Draft model and EAGLE). The
release includes :ref:`Llama 3.1 405B model sample <nxdi-trn2-llama3.1-405b-tutorial>`, :ref:`Llama 3.3 70B model sample <nxdi-trn2-llama3.3-70b-tutorial>` 
and :ref:`Llama 3.1 405B model with speculative decoding <nxdi-trn2-llama3.1-405b-speculative-tutorial>` for inference on a single trn2.48xlarge instance.

For more information, see :ref:`NxD Inference documentation <nxdi-overview>` and check the NxD
Inference Github repository: `aws-neuron/neuronx-distributed-inference <https://github.com/aws-neuron/neuronx-distributed-inference>`_

**Transformers NeuronX (TNx)**: This release introduces several new features, including flash decoding support for speculative decoding, and on-device generation in speculative decoding flows. It adds :ref:`Eagle speculative decoding <cb-eagle-speculative-decoding>` with greedy and lossless sampling, as well as support for :ref:`CPU compilation <transformers_neuronx_readme>` and sharded model saving. Performance improvements include optimized MLP and QKV for Llama models with sequence parallel norm and control over concurrent compilation workers.

**Training Highlights:** NxD Training in this release adds support for
HuggingFace :ref:`Llama3/3.1 70B <hf_llama3_70B_pretraining>` on trn2 instances, introduces :doc:`DPO support </libraries/nxd-training/tutorials/hf_llama3_8B_DPO_ORPO>` for
post-training model alignment, and adds support for Mixture-of-Experts
(MoE) models including Mixtral 7B. The release includes improved
:ref:`checkpoint conversion <checkpoint_conversion>` capabilities and supports MoE with Tensor,
Sequence, Pipeline, and Expert parallelism.

**ML Frameworks:** Neuron 2.21.0 adds support for :ref:`PyTorch 2.5 <introduce-pytorch-2-5>` and 
JAX 0.4.35.

.. note::
  The CVEs
  `CVE-2024-31583 <https://github.com/advisories/GHSA-pg7h-5qx3-wjr3>`__
  and
  `CVE-2024-31580 <https://github.com/advisories/GHSA-5pcm-hx3q-hm94>`__
  affect PyTorch versions 2.1 and earlier. Based on Amazon’s analysis,
  executing models on Trainium and Inferentia is not exposed to either of
  these vulnerabilities. We recommend upgrading to the new version of
  Torch-NeuronX by following the Neuron setup instructions.

**Logical NeuronCore Configuration (LNC)**: This release introduces :ref:`LNC <logical-neuroncore-config>`
for Trainium2 instances, optimizing NeuronCore allocation for ML
applications. LNC offers two configurations: default (LNC=2) combining
two physical cores, and alternative (LNC=1) mapping each physical core
individually. This feature allows users to efficiently manage resources
for large-scale model training and deployment through runtime variables
and compiler flags.

**Neuron Profiler 2.0:** The new :ref:`profiler <neuron-profiler-2-0-guide>` provides system and
device-level profiling, timeline annotations, container integration, and
support for distributed workloads. It includes trace export capabilities
for Perfetto visualization and integration with JAX and PyTorch
profilers, and support for :ref:`Logical NeuronCore
Configuration (LNC) <logical-neuroncore-config>`.

**Neuron Kernel Interface (NKI)**: NKI now supports Trainium2 including
:ref:`Logical NeuronCore Configuration (LNC) <logical-neuroncore-config>`, adds SPMD capabilities for
multi-core operations, and includes new modules and APIs including
support for float8_e5m2 datatype.

**Deep Learning Containers (DLAMIs)**: This release expands support for
JAX 0.4 within the :ref:`Multi Framework DLAMI <neuron-dlami-overview>`. It also introduces NxD Training, NxD Inference, and NxD Core with
:ref:`PyTorch 2.5 <introduce-pytorch-2-5>` support. Additionally, a new Single Framework DLAMI for
TensorFlow 2.10 on Ubuntu 22 is now available.

**Deep Learning Containers (DLCs):** This release introduces new DLCs
for :doc:`JAX 0.4 </setup/jax/index>` training and PyTorch 2.5.1 inference and training. All DLCs
have been updated to Ubuntu 22, and the pytorch-inference-neuronx DLC
now supports both NxD Inference and TNx libraries.

**Documentation**: Documentation updates include architectural details
about Trainium2 and :ref:`NeuronCore-v3 <neuroncores-v3-arch>`, along with specifications and
topology information for the trn2.48xlarge instance type and Trn2
UltraServer.

**Software Maintenance**: This release includes the following  :ref:`announcements <announcements-main>`:

-  Announcing migration of NxD Core examples from NxD Core repository to NxD Inference repository in next release
-  Announcing end of support for Neuron DET tool starting next release
-  PyTorch Neuron versions 1.9 and 1.10 no longer supported
-  Announcing end of support for PyTorch 2.1 for Trn1, Trn2 and Inf2 starting next release 
-  Announcing end of support for PyTorch 1.13 for Trn1 and Inf2 starting next release
-  Announcing end of support for Python 3.8 in future releases
-  Announcing end of support for Ubuntu20 DLCs and DLAMIs

**Amazon Q**: `Use Q Developer <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/amazonq-getstarted.html#amazon-q-dev>`__
as your Neuron Expert for general technical guidance and to jumpstart your NKI kernel development.


More release content can be found in the table below and each component release notes.

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details
     - Instances

   * - Known Issues and Limitations
     - * See :ref:`neuron-2.21.0-known-issues`
     - Trn1/Trn1n , Inf2, Inf1

   * - Transformers NeuronX (transformers-neuronx) for Inference
     - * Flash decoding support for speculative decoding
       * Added support for EAGLE speculative decoding with greedy and lossless sampling
       * Enabled on-device generation support in speculative decoding flows
       * See more at :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n, Trn2


   * - NxD Core (neuronx-distributed) 
     - **Training:**

       * Added support for HuggingFace Llama3 70B with Trn2 instances
       * Added DPO support for post-training model alignment
       * See more at :ref:`nxd-core_rn`   
     - Trn1/Trn1n,Trn2

   * - NxD Inference (neuronx-distributed-inference)
     - * Introduced new NxD Inference Library. See :ref:`introduce-nxd-inference`
       * Added Llama3.1 405B Inference Example on Trn2. See :ref:`nxdi-trn2-llama3.1-405b-tutorial`
       * Added support for vLLM integration for NxD Inference. See :ref:`nxdi-vllm-user-guide-v1`
       * Introduced Open Source Github repository for NxD Inference. See `aws-neuron/neuronx-distributed-inference <https://github.com/aws-neuron/neuronx-distributed-inference>`_
       * See more at :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n,Trn2

   * - NxD Training (neuronx-distributed-training)
     - * Added support for HuggingFace Llama3/3.1 70B with Trn2 instances
       * Added support for Mixtral 8x7B Megatron and HuggingFace models
       * Added support for custom pipeline parallel cuts in HuggingFace Llama3
       * Added support for DPO post-training model alignment
       * See more at :ref:`nxd-training_rn` 
     - Trn1/Trn1n,Trn2

   * - PyTorch NeuronX (torch-neuronx)
     - * Introduced PyTorch 2.5 support 
       * See more at :ref:`pytorch_rn`
     - Trn1/Trn1n,Inf2,Trn2

   * - NeuronX Nemo Megatron for Training
     - * Added support for HuggingFace to NeMo checkpoint conversion when virtual pipeline parallel is enabled.
       * Added collective compute coalescing for ZeRO-1 optimizer
       * See more at `neuronx-nemo-megatron github repo <https://github.com/aws-neuron/neuronx-nemo-megatron>`_  and  :ref:`neuronx-nemo-rn`
     - Trn1/Trn1n,Inf2

   * - Neuron Compiler (neuronx-cc)
     - * Minor bug fixes and performance enhancements for the Trn2 platform.
       * See more at :ref:`compiler_rn`
     - Trn1/Trn1n,Inf2,Trn2
  
   * - Neuron Kernel Interface (NKI)
     - * Added ``nki.compiler`` module with Allocation Control and Kernel decorators
       * Added new nki.isa APIs. 
       * Added new nki.language APIs. 
       * Added new kernels (``allocated_fused_self_attn_for_SD_small_head_size``, ``allocated_fused_rms_norm_qkv``).
       * See more at :ref:`nki_rn`
     - Trn1/Trn1n,Inf2

   * - Neuron Deep Learning AMIs (DLAMIs)
     - * Added support for Trainium2 chips within the Neuron Multi Framework DLAMI.
       * Added support for JAX 0.4 to Neuron Multi Framework DLAMI.
       * Added NxD Training (NxDT), NxD Inference (NxDI) and NxD Core PyTorch 2.5 support within the Neuron Multi Framework DLAMI.
       * See more at :ref:`neuron-dlami-overview`
     - Inf1,Inf2,Trn1/Trn1n

   * - Neuron Deep Learning Containers (DLCs)
     - * Added new pytorch-inference-neuronx 2.5.1 and pytorch-training-neuronx 2.5.1 DLCs
       * Added new jax-training-neuronx 0.4 Training DLC
       * See more at :ref:`containers_rn`
     - Inf1,Inf2,Trn1/Trn1n

   * - Neuron Tools
     - * Introduced Neuron Profiler 2.0. See :ref:`neuron-profiler-2-0-guide`
       * See more at :ref:`dev-tools_rn`
     - Inf1,Inf2,Trn1/Trn1n,Trn2

   * - Neuron Runtime
     - * Added runtime support to fail in case of out-of-bound memory access when DGE is enabled.
       * Added support for 4-rank replica group on adjacent Neuron cores on TRN1/TRN1N
       * See more at :ref:`runtime_rn`
     - Inf1,Inf2,Trn1/Trn1n,Trn2

   * - Release Annoucements
     - * :ref:`announce-eos-neuron-det`
       * :ref:`announce-eos-nxd-examples`
       * :ref:`announce-python-eos`
       * :ref:`announce-eos-pytorch-eos-113`
       * :ref:`announce-eos-pytorch-2-1`
       * :ref:`announce-u20-dlami-dlc-eos`
       * :ref:`announce-no-support-torch-neuron`
       * See more at :ref:`announcements-main`
     - Inf1, Inf2, Trn1/Trn1n

   * - Release Artifacts
     - * see :ref:`latest-neuron-release-artifacts`
     - Trn1/Trn1n , Inf2, Inf1, Trn2

.. _neuron-2.21.0-known-issues:

2.21.0 Known Issues and Limitations 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* See component release notes below for any additional known issues.


.. _neuron-2.21.0.beta-whatsnew:

Neuron 2.21.0 Beta (12/03/2024)
--------------------------------

.. note::
  This release (Neuron 2.21 Beta) was only tested with Trn2 instances. The next release (Neuron 2.21) will support all instances (Inf1, Inf2, Trn1, and Trn2).

  For access to this release (Neuron 2.21 Beta), please contact your account manager.

This release (Neuron 2.21 beta) introduces support for :ref:`AWS Trainium2 <trainium2-arch>` and :ref:`Trn2 instances <aws-trn2-arch>`, including the trn2.48xlarge instance type and Trn2 UltraServer. The release showcases Llama 3.1 405B model inference using NxD Inference on a single trn2.48xlarge instance, and FUJI 70B model training using the AXLearn library across eight trn2.48xlarge instances.

:ref:`NxD Inference <nxdi-index>`, a new PyTorch-based library for deploying large language models and multi-modality models, is introduced in this release. It integrates with vLLM and enables PyTorch model onboarding with minimal code changes. The release also adds support for `AXLearn <https://github.com/apple/axlearn>`_ training for JAX models.

The new :ref:`Neuron Profiler 2.0 <neuron-profiler-2-0-guide>` introduced in this release offers system and device-level profiling, timeline annotations, and container integration. The profiler supports distributed workloads and provides trace export capabilities for Perfetto visualization.

The documentation has been updated to include architectural details about :ref:`Trainium2 <trainium2-arch>` and :ref:`NeuronCore-v3 <neuroncores-v3-arch>`, along with specifications and topology information for the trn2.48xlarge instance type and Trn2 UltraServer.

:ref:`Use Q Developer <amazon-q-dev>` as your Neuron Expert for general technical guidance and to jumpstart your NKI kernel development.

.. note::
  For the latest release that supports Trn1, Inf2 and Inf1 instances, please see :ref:`Neuron Release 2.20.2 <neuron-2-20-2-whatsnew>`


.. _neuron-2-20-2-whatsnew:

Neuron 2.20.2 (11/20/2024)
---------------------------

Neuron 2.20.2 release fixes a stability issue in Neuron Scheduler Extension that previously caused crashes in Kubernetes (K8) deployments. See :ref:`containers_rn`.

This release also addresses a security patch update to Neuron Driver that fixes a kernel address leak issue. 
See more on :ref:`runtime_rn` and :ref:`runtime_rn`.

Addtionally, Neuron 2.20.2 release updates ``torch-neuronx`` and ``libneuronxla`` packages to add support for ``torch-xla`` 2.1.5 package 
which fixes checkpoint loading issues with Zero Redundancy Optimizer (ZeRO-1). See :ref:`pytorch_rn` and :ref:`libneuronxla-rn`.

Neuron supported DLAMIs and DLCs are updated with this release (Neuron 2.20.2 SDK). The Training DLC is also updated to address the 
version dependency issues in NxD Training library. See :ref:`containers_rn`.

NxD Training library in Neuron 2.20.2 release is updated to transformers 4.36.0 package. See :ref:`nxd-training_rn`.


Neuron 2.20.1 (10/25/2024)
---------------------------

Neuron 2.20.1 release addresses an issue with the Neuron Persistent Cache that was brought forth in 2.20 release. In the 2.20 release, the Neuron persistent cache issue resulted in a cache-miss scenario when attempting to load a previously compiled Neuron Executable File Format (NEFF) from a different path or Python environment than the one used for the initial Neuron SDK installation and NEFF compilation. This release resolves the cache-miss problem, ensuring that NEFFs can be loaded correctly regardless of the path or Python environment used to install the Neuron SDK, as long as they were compiled using the same Neuron SDK version.

This release also addresses the excessive lock wait time issue during neuron_parallel_compile graph extraction for large cluster training. See :ref:`pytorch_rn` and :ref:`libneuronxla-rn`.

Additionally, Neuron 2.20.1 introduces new Multi Framework DLAMI for Amazon Linux 2023 (AL2023) that customers can use to easily get started with latest Neuron SDK on multiple frameworks that Neuron supports. See :ref:`dlamis_rn`.

Neuron 2.20.1 Training DLC is also updated to pre-install the necessary dependencies and support NxD Training library out of the box. See :ref:`containers_rn`

.. _neuron-2.20-whatsnew:

Neuron 2.20.0 (09/16/2024)
---------------------------
.. contents:: Table of contents
   :local:
   :depth: 3

What's New
^^^^^^^^^^

**Overview**: Neuron 2.20 release introduces usability improvements and new capabilities across training and inference workloads. A key highlight is the introduction of :ref:`Neuron Kernel Interface (beta) <neuron-nki>`. NKI, pronounced 'Nicky', is enabling developers to build optimized custom compute kernels for Trainium and Inferentia. Additionally, this release introduces :ref:`NxD Training (beta) <nxdt>`, a PyTorch-based library enabling efficient distributed training, with a user-friendly interface compatible with NeMo. This release also introduces the support for the :ref:`JAX framework (beta) <jax-neuron-main>`.

Neuron 2.20 also adds inference support for Pixart-alpha and Pixart-sigma Diffusion-Transformers (DiT) models, and adds support for Llama 3.1 8B, 70B and 405B models inference supporting up to 128K context length.

**Neuron Kernel Interface**: NKI is a programming interface enabling developers to build optimized compute custom kernels on top of Trainium and Inferentia. NKI empowers developers to enhance deep learning models with new capabilities, performance optimizations, and scientific innovation. It natively integrates with PyTorch and JAX, providing a Python-based programming environment with Triton-like syntax and tile-level semantics, offering a familiar programming experience for developers. 
All of our NKI work is shared as open source, enabling the community developers to collaborate and use these kernels in their projects, improve existing kernels, and contribute new NKI kernels. The list of kernels we are introducing includes Optimized Flash Attention NKI kernel (``flash_attention``), a NKI kernel with an optimized implementation of Mamba model architecture (``mamba_nki_kernels``) and Optimized Stable Diffusion Attention kernel (``fused_sd_attention_small_head``). In addition to NKI kernel samples for ``average_pool2d``, ``rmsnorm``, ``tensor_addition``, ``layernorm``, ``transpose_2d``, and ``matrix_multiplication``.

For more information see :ref:`NKI section <neuron-nki>` and check the NKI samples Github repository: https://github.com/aws-neuron/nki-samples

**NxD Training (NxDT)**: NxDT is a PyTorch-based library that adds support for user-friendly distributed training experience through a YAML configuration file compatible with NeMo,, allowing users to easily set up their training workflows. At the same time, NxDT maintains flexibility, enabling users to choose between using the YAML configuration file, PyTorch Lightning Trainer, or writing their own custom training script using the NxD Core.
The library supports PyTorch model classes including Hugging Face and Megatron-LM. Additionally, it leverages NeMo's data engineering and data science modules enabling end-to-end training workflows on NxDT, and providing compatability with NeMo through minimal changes to the YAML configuration file for models that are already supported in NxDT. Furthermore, the functionality of the Neuron NeMo Megatron (NNM) library is now part of NxDT, ensuring a smooth migration path from NNM to NxDT.

For more information see :ref:`NxD Training (beta) <nxdt>` and check the NxD Training Github repository: https://github.com/aws-neuron/neuronx-distributed-training 

**Training Highlights**: This release adds support for Llama 3.1 8B and 70B model training up to 32K sequence length (beta). It also adds support for torch.autocast() for native PyTorch mixed precision support and PEFT LoRA model training.

**Inference Highlights**: Neuron 2.20 adds support for Llama 3.1 models (405b, 70b, and 8b variants) and introduces new features like on-device top-p sampling for improved performance, support for up to 128K context length through Flash Decoding, and multi-node inference for large models like Llama-3.1-405B.
Furthermore, this release improves model loading in Transformers Neuronx for models like Llama-3 by loading the pre-sharded or pre-transformed weights and adds support to Diffusion-Transformers (DiT) models such as Pixart-alpha and Pixart-sigma.

**Compiler**: This release introduces Neuron Compiler support for RMSNorm and RMSNormDx operators, along with enhanced performance for the sort operator. 

**System Tools**: As for the Neuron Tools, it enables NKI profiling support in the Neuron Profiler and introduces improvements to the Neuron Profiler UI.

**Neuron Driver**: This release adds support for the Rocky Linux 9.0 operating system. 

**Neuron Containers**: This release introduces Neuron Helm Chart, which helps streamline the deployment of AWS Neuron components on Amazon EKS. See Neuron Helm Chart Github repository: https://github.com/aws-neuron/neuron-helm-charts. 
Additionaly, this release adds ECS support for the "Neuron Node Problem Detector and Recovery" artifact. See :ref:`ecs-neuron-problem-detector-and-recovery`.

**Neuron DLAMIs and DLCs**: This release includes the addition of the NxDT package to various Neuron DLAMIs (Multi-Framework Neuron DLAMI, PyTorch 1.13 Neuron DLAMI, and PyTorch 2.1 Neuron DLAMI) and the inclusion of NxDT in the PyTorch 1.13 Training Neuron DLC and PyTorch 2.1 Training Neuron DLC.

**Software Maintenance Policy**: This release also updates Neuron SDK software maintenance poclicy, For more information see :ref:`sdk-maintenance-policy`


More release content can be found in the table below and each component release notes.

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details
     - Instances

   * - Known Issues and Limitations
     - * See :ref:`neuron-2.20.0-known-issues`
     - Trn1/Trn1n , Inf2, Inf1

   * - Transformers NeuronX (transformers-neuronx) for Inference
     - * Support for on-device sampling (Top P) and dynamic sampling (per request parameters) with Continuous batching. See :ref:`developer guide <transformers_neuronx_readme>`
       * Support for Flash Decoding to enable inference for higher sequence lengths of upto 128K. See :ref:`developer guide <transformers_neuronx_readme>`.
       * Support for multi-node inference for large models like ``Llama-3.1-405B``. See :ref:`developer guide <transformers_neuronx_readme>`.
       * Support for bucketing, multi-node inference , on-device sampling and other improvements in Neuron vLLM integration. See :ref:`developer guide <transformers_neuronx_readme>` 
       * Support for Llama 3.1 models (405B, 70B, and 8B variants). See samples for `Llama-3.1-405B <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/llama-3.1-405b-multinode-16k-sampling.ipynb>`_ , `Llama-3.1-70B <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/llama-3.1-70b-64k-sampling.ipynb>`_  and  `Llama-3.1-8B <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/llama-3.1-8b-128k-sampling.ipynb>`_
       * Support for improved model loading for models like Llama-3 by loading the pre-sharded or pre-transformed weights. See :ref:`serialization support in developer guide <transformers_neuronx_readme>`. 
       * Support for ROPE scaling for Llama 3 and Llama 3.1 models. 
       * See more at :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n


   * - NxD Core (neuronx-distributed) 
     - **Training:**

       * Support for LoRA finetuning
       * Support for Mixed precision enhancements

       **Inference:**
       
       * Suppport for DBRX and Mixtral inference samples. See  samples for `DBRX <https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference/dbrx>`_ and `Mixtral <https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/inference/mixtral>`_
       * Support for sequence length autobucketing to improve inference performance.
       * Support for improved tracing in the inference samples.
       * See more at :ref:`nxd-core_rn`   
     - Trn1/Trn1n


   * - NxD Training (neuronx-distributed-training)
     - * First release of NxD Training (beta)
       * See more at :ref:`nxd-training_rn` 
     - Trn1/Trn1n


   * - PyTorch NeuronX (torch-neuronx)
     - * Support for inference of Diffusion-Transformers (DiT) models such as ``Pixart-alpha`` and ``Pixart-sigma``. See samples for `Pixart-alpha <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_pixart_alpha_inference_on_inf2.ipynb>`_ and `Pixart-sigma <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_pixart_sigma_inference_on_inf2.ipynb>`_.
       * Support for inference of ``wav2vec2-conformer`` models.  See samples for inference of ``wav2vec2-conformer`` with `relative position embeddings <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_wav2vec2_conformer_relpos_inference_on_inf2.ipynb>`_ and `rotary position embeddings <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_wav2vec2_conformer_rope_inference_on_inf2.ipynb>`_
       * See more at :ref:`pytorch_rn`
     - Trn1/Trn1n,Inf2

   * - NeuronX Nemo Megatron for Training
     - * Fixed issue with linear warmup with cosine annealing
       * Fixed indexing issues with MPI job checkpoint conversion.
       * Fixed pipeline parallel bug for NeMo to HF checkpoint conversion       
       * See more at `neuronx-nemo-megatron github repo <https://github.com/aws-neuron/neuronx-nemo-megatron>`_  and  :ref:`neuronx-nemo-rn`
     - Trn1/Trn1n,Inf2

   * - Neuron Compiler (neuronx-cc)
     - * Memory optimization that will reduce the generated compiler artifacts size (i.e., NEFFs)
       * See more at :ref:`compiler_rn`
     - Trn1/Trn1n,Inf2
  
   * - Neuron Kernel Interface (NKI)
     - * First Release on Neuron Kernel Interface (NKI)
       * See more at :ref:`nki_rn`
     - Trn1/Trn1n,Inf2

   * - Neuron Deep Learning AMIs (DLAMIs)
     - * Support for ``neuronx-distributed-training`` library in PyTorch Neuron DLAMI virtual enviornments. See :ref:`neuron-dlami-overview`
       * Updated existing Neuron supported DLAMIs with Neuron 2.20 SDK release.
       * See more at :ref:`Neuron DLAMI Release Notes <neuron-dlami-overview>`
     - Inf1,Inf2,Trn1/Trn1n

   * - Neuron Deep Learning Containers (DLCs)
     - * Updated existing PyTorch Neuron DLCs with Neuron 2.20 SDK release.
       * Support for ``neuronx-distributed-training`` library in `pytorch-training-neuronx DLCs <https://github.com/aws-neuron/deep-learning-containers/tree/main?tab=readme-ov-file#pytorch-training-neuronx>`_. 
       * See more at :ref:`containers_rn`
     - Inf1,Inf2,Trn1/Trn1n

   * - Neuron Tools
     - * Improvements in Neuron Profile
       * See more at :ref:`dev-tools_rn`
     - Inf1,Inf2,Trn1/Trn1n

   * - Neuron Runtime
     - * Introduced a sysfs memory usage counter for DMA rings (:ref:`reference <neuron-sysfs-ug>`)
       * See more at :ref:`runtime_rn`
     - Inf1,Inf2,Trn1/Trn1n

   * - Release Annoucements
     - * :ref:`announce-component-name-change-nxdcore`
       * :ref:`eos-neurondevice`
       * :ref:`eos-neuron-device-version`
       * :ref:`announce-tfx-no-support`
       * :ref:`announce-torch-neuron-eos`
       * :ref:`eos-al2`
       * See more at :ref:`announcements-main`
     - Inf1, Inf2, Trn1/Trn1n

   * - Release Artifacts
     - * see :ref:`latest-neuron-release-artifacts`
     - Trn1/Trn1n , Inf2, Inf1

.. _neuron-2.20.0-known-issues:

2.20.0 Known Issues and Limitations 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Known issues when using ``on_device_generation`` flag in Transformers NeuronX config for Llama models. Customers are advised not to use the flag when they see an issue. See more at :ref:`nxd-inference_rn`  
* See component release notes below for any additional known issues.

Neuron Components Release Notes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Inf1, Trn1/Trn1n and Inf2 common packages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size


   * - Component
     - Instance/s
     - Package/s
     - Details


   * - Neuron Runtime
     - Trn1/Trn1n, Inf1, Inf2
     - * Trn1/Trn1n: ``aws-neuronx-runtime-lib`` (.deb, .rpm)

       * Inf1: Runtime is linked into the ML frameworks packages
       
     - * :ref:`runtime_rn`

   * - Neuron Runtime Driver
     - Trn1/Trn1n, Inf1, Inf2
     - * ``aws-neuronx-dkms``  (.deb, .rpm)

     - * :ref:`runtime_rn`

   * - Neuron System Tools
     - Trn1/Trn1n, Inf1, Inf2
     - * ``aws-neuronx-tools``  (.deb, .rpm)
     - * :ref:`dev-tools_rn`


   * - Containers
     - Trn1/Trn1n, Inf1, Inf2
     - * ``aws-neuronx-k8-plugin`` (.deb, .rpm)

       * ``aws-neuronx-k8-scheduler`` (.deb, .rpm)
       
       * ``aws-neuronx-oci-hooks`` (.deb, .rpm)

     - * :ref:`containers_rn`

       * :ref:`containers_rn`

   * - NeuronPerf (Inference only)
     - Trn1/Trn1n, Inf1, Inf2
     - * ``neuronperf`` (.whl)
     - * :ref:`dev-tools_rn`

   * - TensorFlow Model Server Neuron
     - Trn1/Trn1n, Inf1, Inf2
     - * ``tensorflow-model-server-neuronx`` (.deb, .rpm)
     - * :ref:`tensorflow-modeslserver-neuronx-rn`


Trn1/Trn1n and Inf2 only packages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size
   
   * - Component
     - Instance/s
     - Package/s
     - Details


   * - PyTorch Neuron
     - Trn1/Trn1n, Inf2
     - * ``torch-neuronx`` (.whl)
     - * :ref:`pytorch_rn`
       * :ref:`pytorch-neuron-supported-operators`
       

   * - TensorFlow Neuron
     - Trn1/Trn1n, Inf2
     - * ``tensorflow-neuronx`` (.whl)
     - * :ref:`tensorflow-neuronx-release-notes`

 
   * - Neuron Compiler (Trn1/Trn1n, Inf2 only)
     - Trn1/Trn1n, Inf2
     - * ``neuronx-cc`` (.whl)
     - * :ref:`compiler_rn`


   * - Neuron Kernel Interface (NKI) Compiler (Trn1/Trn1n, Inf2 only)
     - Trn1/Trn1n, Inf2
     - * Supported within ``neuronx-cc`` (.whl)
     - * :ref:`nki_rn`

   * - Collective Communication library
     - Trn1/Trn1n, Inf2    
     - * ``aws-neuronx-collective`` (.deb, .rpm)
     - * :ref:`runtime_rn`


   * - Neuron Custom C++ Operators
     - Trn1/Trn1n, Inf2
  
     - * ``aws-neuronx-gpsimd-customop`` (.deb, .rpm)
  
       * ``aws-neuronx-gpsimd-tools`` (.deb, .rpm)
  
     - * :ref:`gpsimd-customop-lib-rn`

       * :ref:`gpsimd-customop-tools-rn`


   * - Transformers Neuron
     - Trn1/Trn1n, Inf2
     - * ``transformers-neuronx`` (.whl)
     - * :ref:`nxd-inference_rn`

   * - NxD Training
     - Trn1/Trn1n, Inf2
     - * ``neuronx-distributed-training`` (.whl)
     - * :ref:`nxd-training_rn`


   * - NxD Core
     - Trn1/Trn1n, Inf2
     - * ``neuronx-distributed`` (.whl)
     - * :ref:`nxd-core_rn`

   * - AWS Neuron Reference for NeMo Megatron
     - Trn1/Trn1n
     - * `neuronx-nemo-megatron github repo <https://github.com/aws-neuron/neuronx-nemo-megatron>`_
     - * :ref:`neuronx-nemo-rn`


Inf1 only packages
~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size
   

   * - Component
     - Instance/s
     - Package/s
     - Details


   * - PyTorch Neuron
     - Inf1
     - * ``torch-neuron`` (.whl)
     - * :ref:`pytorch-neuron-rn`

       * :ref:`neuron-cc-ops-pytorch`


   * - TensorFlow Neuron
     - Inf1
     - * ``tensorflow-neuron`` (.whl)
     - * :ref:`tensorflow-neuron-rn`

       * :ref:`neuron-cc-ops-tensorflow`
       
       * :ref:`tensorflow-neuron-rn-v2` 


   * - Apache MXNet
     - Inf1
     - * ``mx_neuron`` (.whl)
     - * :doc:`MXNet Neuron Release Notes </release-notes/archive/mxnet-neuron>`

       * :ref:`neuron-cc-ops-mxnet`


   * - Neuron Compiler (Inf1 only)
     - Inf1
     - * ``neuron-cc`` (.whl)
     - * :ref:`neuron-cc-rn`

       * :ref:`neuron-supported-operators`

.. _neuron-2.19.0-whatsnew:

Neuron 2.19.1 (07/19/2024)
---------------------------

This release (Neuron 2.19.1) addresses an issue with the Neuron Persistent Cache that was introduced in the previous release, Neuron 2.19. The issue resulted in a cache-miss scenario when attempting to load a previously compiled Neuron Executable File Format (NEFF) from a different path or Python environment than the one used for the initial Neuron SDK installation and NEFF compilation. This release resolves the cache-miss problem, ensuring that NEFFs can be loaded correctly regardless of the path or Python environment used to install the Neuron SDK, as long as they were compiled using the same Neuron SDK version.


Neuron 2.19.0 (07/03/2024)
---------------------------
.. contents:: Table of contents
   :local:
   :depth: 3

What's New
^^^^^^^^^^

Neuron 2.19 release adds Llama 3 training support and introduces Flash Attention kernel support to enable LLM training and inference for
large sequence lengths. Neuron 2.19 also introduces new features and performance
improvements to LLM training, improves LLM inference performance for Llama 3 model by upto 20%, and adds tools for monitoring, problem detection and recovery in Kubernetes (EKS) environments, improving efficiency and reliability.

**Training highlights**: LLM model training user experience using
NeuronX Distributed (NxD) is improved by support for Flash Attention to
enable training with longer sequence lengths >= 8K. Neuron 2.19 adds support for Llama 3 model training. This release also
adds support for Interleaved pipeline parallelism to reduce idle time
(bubble size) and enhance training efficiency and resource utilization for large cluster sizes.

**Inference highlights**: Flash Attention kernel support in the Transformers NeuronX library enables LLM inference for context lengths of up to 32k. This release also adds [Beta] support for continuous batching with ``mistralai/Mistral-7B-v0.2`` in Transformers NeuronX.

**Tools and Neuron DLAMI/DLC highlights**: This release introduces the new Neuron Node
Problem Detector and Recovery plugin in EKS supported Kubernetes
environments:a tool to monitor the health of Neuron instances and
triggers automatic node replacement upon detecting an unrecoverable
error. Neuron 2.19 introduces the new Neuron Monitor container to
enable easy monitoring of Neuron metrics in Kubernetes, and adds monitoring support with Prometheus and Grafana.
This release also introduces new PyTorch 2.1 and PyTorch 1.13 single framework DLAMIs for Ubuntu 22. Neuron DLAMIs and Neuron DLCs are also updated to support this release (Neuron 2.19).

More release content can be found in the table below and each component release notes.

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details
     - Instances

   * - Known Issues and Limitations
     - * See :ref:`neuron-2.19.0-known-issues`
     - Trn1/Trn1n , Inf2, Inf1

   * - Transformers NeuronX (transformers-neuronx) for Inference
     - * Support for Flash Attention kernel in Llama models to enable inference for higher sequence lengths. See :ref:`developer guide <transformers_neuronx_readme>`.
       * Support for running Top-K sampling on Neuron device for generation in Mixtral models. See ``Mixtral-8x7b`` `sample <https://github.com/aws-neuron/transformers-neuronx/blob/main/src/transformers_neuronx/mixtral/model.py>`__.
       * [Beta] Support for Continuous batching with ``mistralai/Mistral-7B-Instruct-v0.2`` model inference. See :ref:`developer guide <transformers_neuronx_readme>`.
       * See more at :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n

   * - NeuronX Distributed (neuronx-distributed) for Training
     - * Support for Interleaved pipeline parallelism to reduce idle time (bubble size) and enhance training efficiency and resource utilization for large cluster sizes. See :ref:`api guide <api_guide>` , :ref:`developer guide <pp_developer_guide>`
       * Support for Flash Attention kernel to enable training with longer sequence lengths.
       * See more at :ref:`nxd-core_rn` 
     - Trn1/Trn1n

   * - NeuronX Distributed (neuronx-distributed) for Inference
     - * Support for Flash Attention kernel for longer sequence length inference. See :pytorch-neuron-src:`[CodeLlama-13b Inference with 16k sequence length] <neuronx_distributed/llama/codellama_16k_inference.ipynb>`
       * [Beta] Support for speculative decoding. See :ref:`developer guide <neuronx_distributed_inference_developer_guide>`.
       * See more at :ref:`nxd-core_rn` 
     - Inf2,Trn1/Trn1n

   * - PyTorch NeuronX (torch-neuronx)
     - * Support for FP32 master weights and BF16 all-gather during Zero1 training to enhance training efficiency.
       * Support to add custom SILU activation functions by configuring NEURON_CUSTOM_SILU variable
       * See more at :ref:`pytorch_rn`
     - Trn1/Trn1n,Inf2

   * - NeuronX Nemo Megatron for Training
     - * Support for FP32 gradient accumulation enhancing accuracy for large model training.
       * Support for Zero1 training with master weights
       * Support for Flash Attention kernel to train with longer sequence lengths (greater than 8K)
       * See more at `neuronx-nemo-megatron github repo <https://github.com/aws-neuron/neuronx-nemo-megatron>`_  and  :ref:`neuronx-nemo-rn`
     - Trn1/Trn1n,Inf2

   * - Neuron Compiler (neuronx-cc)
     - * Support for Flash Attention kernel to enable usage of long sequence lengths during training and inference.
       * See more at :ref:`compiler_rn`
     - Trn1/Trn1n,Inf2

   * - Neuron DLAMI and DLC
     - * Neuron DLAMIs are updated with latest 2.19 Neuron SDK. See :ref:`neuron-dlami-overview`
       * New Neuron Single Framework DLAMIs with PyTorch-2.1 and PyTorch-1.13 for Ubuntu 22. See :ref:`neuron-dlami-overview`
       * New Base Deep Learning AMI (DLAMI) for Ubuntu 22. See :ref:`neuron-dlami-overview`
       * PyTorch 1.13 and PyTorch 2.1 Inference and Training DLCs are updated with latest 2.19 Neuron SDK. See :ref:`neuron_containers`
       * PyTorch 1.13 Inference and PyTorch 2.1 Inference DLCs are updated with TorchServe v0.11.0. See :ref:`neuron_containers`
     - Inf1,Inf2,Trn1/Trn1n

   * - Neuron Tools
     - * Support for new Neuron Node Problem Detector and Recovery plugin in EKS supported kubernetes environments that monitors health of Neuron instances and triggers automatic node replacement upon detecting an unrecoverable error. See :doc:`configuration </containers/tutorials/k8s-neuron-problem-detector-and-recovery-irsa>` and :ref:`tutorial <k8s-neuron-problem-detector-and-recovery>`.
       * Support for new Neuron Monitor container to enable easy monitoring of Neuron metrics in Kubernetes. Supports monitoring with Prometheus and Grafana. See :ref:`tutorial <k8s-neuron-monitor>`
       * Support for Neuron scheduler extension to enforce allocation of contiguous Neuron Devices for the pods based on the Neuron instance type. See :ref:`tutorial <neuron_scheduler>`
       * Neuron Profiler bugfixes and UI updates, including improvements to visualizing collective operations and to the consistency of information being displayed
       * Added memory usage metrics and device count information to neuron-monitor 
       * See more at :ref:`dev-tools_rn`
     - Inf1,Inf2,Trn1/Trn1n

   * - Neuron Runtime
     - * Support for dynamic Direct Memory Access (DMA) that reduces memory usage during runtime.
       * Runtime Enhancements that improve collectives performance
       * See more at :ref:`runtime_rn`
     - Inf1,Inf2,Trn1/Trn1n
  
   * - Other Documentation Updates
     - * Announced maintenance mode of MxNet. See :ref:`announce-mxnet-maintenance`
       * Announced End of support of Neuron TensorFlow 1.x (Inf1). See :ref:`announce-tfx-eos`
       * Announce End of support for AL2. See :ref:`announce-eos-al2`
       * --
     - Inf1, Inf2, Trn1/Trn1n

   * - Release Artifacts
     - * see :ref:`latest-neuron-release-artifacts`
     - Trn1/Trn1n , Inf2, Inf1

.. _neuron-2.19.0-known-issues:

2.19.0 Known Issues and Limitations 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Known issues when using ``on_device_generation`` flag in Transformers NeuronX config for Llama models. Customers are advised not to use the flag when they see an issue. See more at :ref:`nxd-inference_rn`  
* See component release notes below for any additional known issues.


Neuron Components Release Notes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Inf1, Trn1/Trn1n and Inf2 common packages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size


   * - Component
     - Instance/s
     - Package/s
     - Details


   * - Neuron Runtime
     - Trn1/Trn1n, Inf1, Inf2
     - * Trn1/Trn1n: ``aws-neuronx-runtime-lib`` (.deb, .rpm)

       * Inf1: Runtime is linked into the ML frameworks packages
       
     - * :ref:`runtime_rn`

   * - Neuron Runtime Driver
     - Trn1/Trn1n, Inf1, Inf2
     - * ``aws-neuronx-dkms``  (.deb, .rpm)

     - * :ref:`runtime_rn`

   * - Neuron System Tools
     - Trn1/Trn1n, Inf1, Inf2
     - * ``aws-neuronx-tools``  (.deb, .rpm)
     - * :ref:`dev-tools_rn`

   * - Neuron DLAMI
     - Trn1/Trn1n, Inf1, Inf2
     - * 
     - * `Neuron DLAMI Release Notes <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html>`_.

   * - Neuron DLC
     - Trn1/Trn1n, Inf1, Inf2
     - *
     - * :ref:`containers_rn`

   * - Containers
     - Trn1/Trn1n, Inf1, Inf2
     - * ``aws-neuronx-k8-plugin`` (.deb, .rpm)

       * ``aws-neuronx-k8-scheduler`` (.deb, .rpm)
       
       * ``aws-neuronx-oci-hooks`` (.deb, .rpm)

     - * :ref:`containers_rn`

       * :ref:`containers_rn`

   * - NeuronPerf (Inference only)
     - Trn1/Trn1n, Inf1, Inf2
     - * ``neuronperf`` (.whl)
     - * :ref:`dev-tools_rn`

   * - TensorFlow Model Server Neuron
     - Trn1/Trn1n, Inf1, Inf2
     - * ``tensorflow-model-server-neuronx`` (.deb, .rpm)
     - * :ref:`tensorflow-modeslserver-neuronx-rn`

Trn1/Trn1n and Inf2 only packages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size
   
   * - Component
     - Instance/s
     - Package/s
     - Details


   * - PyTorch Neuron
     - Trn1/Trn1n, Inf2
     - * ``torch-neuronx`` (.whl)
     - * :ref:`pytorch_rn`
       * :ref:`pytorch-neuron-supported-operators`
       

   * - TensorFlow Neuron
     - Trn1/Trn1n, Inf2
     - * ``tensorflow-neuronx`` (.whl)
     - * :ref:`tensorflow-neuronx-release-notes`

 
   * - Neuron Compiler (Trn1/Trn1n, Inf2 only)
     - Trn1/Trn1n, Inf2
     - * ``neuronx-cc`` (.whl)
     - * :ref:`compiler_rn`

   * - Collective Communication library
     - Trn1/Trn1n, Inf2    
     - * ``aws-neuronx-collective`` (.deb, .rpm)
     - * :ref:`runtime_rn`


   * - Neuron Custom C++ Operators
     - Trn1/Trn1n, Inf2
  
     - * ``aws-neuronx-gpsimd-customop`` (.deb, .rpm)
  
       * ``aws-neuronx-gpsimd-tools`` (.deb, .rpm)
  
     - * :ref:`gpsimd-customop-lib-rn`

       * :ref:`gpsimd-customop-tools-rn`


   * - Transformers Neuron
     - Trn1/Trn1n, Inf2
     - * ``transformers-neuronx`` (.whl)
     - * :ref:`nxd-inference_rn`

   * - Neuron Distributed
     - Trn1/Trn1n, Inf2
     - * ``neuronx-distributed`` (.whl)
     - * :ref:`nxd-core_rn`

   * - AWS Neuron Reference for NeMo Megatron
     - Trn1/Trn1n
     - * `neuronx-nemo-megatron github repo <https://github.com/aws-neuron/neuronx-nemo-megatron>`_
     - * :ref:`neuronx-nemo-rn`


.. note::

   In next releases ``aws-neuronx-tools`` and ``aws-neuronx-runtime-lib`` will add support for Inf1.


Inf1 only packages
~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size
   

   * - Component
     - Instance/s
     - Package/s
     - Details


   * - PyTorch Neuron
     - Inf1
     - * ``torch-neuron`` (.whl)
     - * :ref:`pytorch-neuron-rn`

       * :ref:`neuron-cc-ops-pytorch`


   * - TensorFlow Neuron
     - Inf1
     - * ``tensorflow-neuron`` (.whl)
     - * :ref:`tensorflow-neuron-rn`

       * :ref:`neuron-cc-ops-tensorflow`
       
       * :ref:`tensorflow-neuron-rn-v2` 


   * - Apache MXNet
     - Inf1
     - * ``mx_neuron`` (.whl)
     - * :doc:`MXNet Neuron Release Notes </release-notes/archive/mxnet-neuron>`

       * :ref:`neuron-cc-ops-mxnet`


   * - Neuron Compiler (Inf1 only)
     - Inf1
     - * ``neuron-cc`` (.whl)
     - * :ref:`neuron-cc-rn`

       * :ref:`neuron-supported-operators`


.. _neuron-2.18.0-whatsnew:


Neuron 2.18.2 (04/25/2024)
--------------------------
Patch release with minor Neuron Compiler bug fixes and enhancements. See more in  :ref:`compiler_rn`


Neuron 2.18.1 (04/10/2024)
--------------------------

Neuron 2.18.1 release introduces :ref:`Continuous batching(beta) <transformers_neuronx_readme>` and Neuron vLLM integration(beta) support in Transformers NeuronX library that improves LLM inference throughput. This release also fixes hang issues related to Triton Inference Server as well as updating Neuron DLAMIs and DLCs with this release(2.18.1). 
See more in  :ref:`nxd-inference_rn` and :ref:`compiler_rn` 


Neuron 2.18.0 (04/01/2024)
--------------------------

.. contents:: Table of contents
   :local:
   :depth: 3

What's New
^^^^^^^^^^

Neuron 2.18 release introduces stable support (out of beta) for PyTorch 2.1, introduces new features and performance improvements to LLM training and inference, and updates Neuron DLAMIs and Neuron DLCs to support this release (Neuron 2.18).

**Training highlights**: LLM model training user experience using NeuronX Distributed (NxD) is improved by introducing asynchronous checkpointing. This release also adds support for auto partitioning pipeline parallelism in NxD and introduces Pipeline Parallelism in PyTorch Lightning Trainer (beta).

**Inference highlights**: Speculative Decoding support (beta) in TNx library improves LLM inference throughput and output token latency(TPOT) by up to 25% (for LLMs such as Llama-2-70B). TNx also improves weight loading performance by adding support for SafeTensor checkpoint format. Inference using Bucketing in PyTorch NeuronX and NeuronX Distributed is improved by introducing auto-bucketing feature.
This release also adds a new sample for ``Mixtral-8x7B-v0.1`` and ``mistralai/Mistral-7B-Instruct-v0.2`` in TNx.

**Neuron DLAMI and Neuron DLC support highlights**: This release introduces new Multi Framework DLAMI for Ubuntu 22 that customers can use to easily get started with latest Neuron SDK on multiple frameworks that Neuron supports as well as SSM parameter support for DLAMIs to automate the retrieval of latest DLAMI ID in cloud automation flows. Support for new Neuron Training and Inference Deep Learning containers (DLCs) for PyTorch 2.1, as well as a new dedicated GitHub repository to host Neuron container dockerfiles and a public Neuron container registry to host Neuron container images.

More release content can be found in the table below and each component release notes.


.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details
     - Instances


   * - Transformers NeuronX (transformers-neuronx) for Inference
     - * [Beta] Support for Speculative Decoding API. See :ref:`developer guide <transformers_neuronx_readme>` 
       * Support for SafeTensors checkpoint format with improved weight loading performance.  See :ref:`developer guide <transformers_neuronx_readme>` 
       * Support for running  Top-K sampling on Neuron Device for improved performance.  See :ref:`developer guide <transformers_neuronx_readme>` 
       * Code Llama model inference sample with 16K input seq length. See `sample <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/codellama-13b-16k-sampling.ipynb>`__
       * [Beta] Support for streaming API and stopping criteria API. See :ref:`developer guide <transformers_neuronx_readme>`
       * Support for ``Mixtral-8x7B-v0.1`` model inference. See `sample <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/mixtral-8x7b-sampling.ipynb>`__
       * [Beta] Support for ``mistralai/Mistral-7B-Instruct-v0.2`` model inference. See `sample <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/mistralai-Mistral-7b-Instruct-v0.2.ipynb>`__
       * See more at :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n

   * - NeuronX Distributed (neuronx-distributed) for Training
     - * Support for Pipeline Parallelism training using PyTorch Lightning. See :ref:`api guide <api_guide>` , :ref:`developer guide <ptl_developer_guide>` and :doc:`tutorial </archive/tutorials/training_llama2_tp_pp_ptl>`
       * Support for auto partitioning pipeline parallel stages when training large models.  See :ref:`api guide <api_guide>` and :ref:`pp_developer_guide`
       * Support for asynchronous checkpointing to improve the time it takes to save the checkpoint.  See :ref:`api guide <api_guide>` , :ref:`save_load_developer_guide` and :doc:`tutorial </archive/tutorials/training_llama2_tp_pp_ptl>`
       * Tutorial to fine-tune Llama-2-7B model using PyTorch Lightning and running evaluation on the fine-tuned model using Hugging Face optimum-neuron. See :ref:`tutorial <llama2_7b_tp_zero1_ptl_finetune_tutorial>`
       * ``codegen25-7b-mono`` model training tutorial and script. See :ref:`codegen25_7b_tp_zero1_tutorial` 
       * See more at :ref:`nxd-core_rn` 
     - Trn1/Trn1n

   * - NeuronX Distributed (neuronx-distributed) for Inference
     - * Support for auto bucketing in inference using a custom bucket kernel that can be passed as a bucket configuration to Tracing API. See :ref:`api guide <api_guide>` and :ref:`neuronx_distributed_inference_developer_guide`
       * Support for inference with bf16 data type using XLA_USE_BF16=1 flag.
       * See more at :ref:`nxd-core_rn` 
     - Inf2,Trn1/Trn1n

   * - PyTorch NeuronX (torch-neuronx)
     - * PyTorch 2.1 support is now stable (out of beta). 
       * Support for auto bucketing in inference using a custom bucket kernel that can be passed as a bucket configuration to Tracing API. See :ref:`torch-neuronx-autobucketing-devguide`
       * See more at :ref:`pytorch_rn`
     - Trn1/Trn1n,Inf2

   * - NeuronX Nemo Megatron for Training
     - * Support for LoRa finetuning. See `sample script <https://github.com/aws-neuron/neuronx-nemo-megatron/tree/main/nemo/examples/nlp/language_modeling/test_llama_lora.sh>`__
       * Support for Mistral-7B training. See `sample script <https://github.com/aws-neuron/neuronx-nemo-megatron/tree/main/nemo/examples/nlp/language_modeling/test_mistral.sh>`__
       * Support for asynchronous checkpointing to improve the time it takes to save the checkpoint.
       * See more at `neuronx-nemo-megatron github repo <https://github.com/aws-neuron/neuronx-nemo-megatron>`_  and  :ref:`neuronx-nemo-rn`
     - Trn1/Trn1n,Inf2

   * - Neuron Compiler (neuronx-cc)
     - * New ``--enable-mixed-precision-accumulation`` compiler option to perform intermediate computations of an operation in FP32 regardless of the operation's defined datatype. See :ref:`neuron-compiler-cli-reference-guide`
       * See more at :ref:`compiler_rn`
     - Trn1/Trn1n,Inf2

   * - Neuron DLAMI and DLC
     - * New Neuron Multi Framework Deep Learning AMI (DLAMI) for Ubuntu 22 with separate virtual environments for PyTorch 2.1, PyTorch 1.13, Transformers NeuronX and Tensorflow 2.10.  See :ref:`setup guide <setup-ubuntu22-multi-framework-dlami>` and :ref:`neuron-dlami-overview`
       * Neuron Multi Framework Deep Learning AMI (DLAMI) is now the default Neuron AMI in QuickStart AMI list when launching Neuron instances for Ubuntu through AWS console. See :ref:`setup guide <setup-ubuntu22-multi-framework-dlami>`
       * Neuron DLAMIs for PyTorch 1.13 and Tensorflow 2.10 are updated with 2.18 Neuron SDK for both Ubuntu 20 and AL2. See :ref:`neuron-dlami-overview`
       * SSM parameter support for Neuron DLAMIs to find the DLAMI id with latest Neuron release SDK. See :ref:`neuron-dlami-overview`
       * New Neuron Deep Learning Containers(DLCs) for PyTorch 2.1 Inference and Training.  See :ref:`neuron_containers`
       * PyTorch 1.13 Inference and Training DLCs are updated with latest 2.18 Neuron SDK and now also comes with pre-installed NeuronX Distributed library. See :ref:`neuron_containers`
       * Neuron DLCs are now hosted both in public Neuron ECR and as private images. Private images are only needed when using with Sagemaker. See :ref:`neuron_containers`
       * New Neuron Github Repository to host dockerfiles for Neuron DLCs. See `neuron deep learning containers github repo <https://github.com/aws-neuron/deep-learning-containers>`_
     - Inf1,Inf2,Trn1/Trn1n
  
   * - Other Documentation Updates
     - * App Note on snapshotting models with PyTorch NeuronX 2.1 to support dumping debug information. See :ref:`pytorch-neuronx-debug`
       * Added announcement for Maintenance mode of TensorFlow 1.x. See :ref:`announce-tfx-maintenance`
       * --
     - Inf1, Inf2, Trn1/Trn1n
   
   * - Known Issues and Limitations
     - * See :ref:`neuron-2.18.0-known-issues`
     - Trn1/Trn1n , Inf2, Inf1

   * - Release Artifacts
     - * see :ref:`latest-neuron-release-artifacts`
     - Trn1/Trn1n , Inf2, Inf1


.. _neuron-2.18.0-known-issues:

2.18.0 Known Issues and Limitations 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* For PyTorch 2.1 (NeuronX), slow convergence for LLaMA-2 70B training when using Zero Redundancy Optimizer (ZeRO1) can be resolved by removing all compiler flags.
* For PyTorch 2.1 (NeuronX), torch-xla 2.1 is incompatible with the default GLibC on AL2. Users are advised to migrate to Amazon Linux 2023 , Ubuntu 22 or Ubuntu 20 Operating systems.
* See component release notes below for any additional known issues.


.. _neuron-2.17.0-whatsnew:


Neuron 2.17.0 (02/13/2024)
--------------------------

What's New
^^^^^^^^^^

Neuron 2.17 release improves small collective communication operators (smaller than 16MB) by up to 30%, which improves large language model (LLM) Inference performance by up to 10%.
This release also includes improvements in :ref:`Neuron Profiler <neuron-profile-ug>` and other minor enhancements and bug fixes.


.. _neuron-2.16.0-whatsnew:


Neuron 2.16.1 (01/18/2024)
--------------------------
Patch release with compiler bug fixes, updates to :ref:`Neuron Device Plugin and Neuron Kubernetes Scheduler <containers_rn>` .


Neuron 2.16.0 (12/21/2023)
--------------------------

.. contents:: Table of contents
   :local:
   :depth: 3

What's New
^^^^^^^^^^

Neuron 2.16 adds support for Llama-2-70B training and inference, upgrades to PyTorch 2.1 (beta) and adds new support for PyTorch Lightning Trainer (beta) as well as performance improvements and adding Amazon Linux 2023 support.

**Training highlights**: NeuronX Distributed library LLM models training performance is improved by up to 15%. LLM model training user experience is improved by introducing support of PyTorch Lightning Trainer (beta), and a new model optimizer wrapper which will minimize the amount of changes needed to partition models using NeuronX Distributed primitives.  

**Inference highlights**: PyTorch inference now allows to dynamically swap different fine-tuned weights for an already loaded model, as well as overall improvements of LLM inference throughput and latency with Transformers NeuronX. Two new reference model samples for LLama-2-70b and Mistral-7b model inference.

**User experience**: This release introduces two new capabilities: A new tool, Neuron Distributed Event Tracing (NDET) which improves debuggability, and the support of profiling collective communication operators in the Neuron Profiler tool.

More release content can be found in the table below and each component release notes.


.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details
     - Instances


   * - Transformers NeuronX (transformers-neuronx) for Inference
     - * [Beta] Support for Grouped Query Attention(GQA). See :ref:`developer guide <transformers_neuronx_readme>` 
       * [Beta] Support for ``Llama-2-70b`` model inference using ``Grouped Query Attention``. See `tutorial <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/llama-70b-sampling.ipynb>`__ 
       * [Beta] Support for ``Mistral-7B-Instruct-v0.1`` model inference. See :ref:`sample code <mistral_gqa_code_sample>`
       * See more at :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n

   * - NeuronX Distributed (neuronx-distributed) for Training
     - * [Beta] Support for ``PyTorch Lightning``  to train models using ``tensor parallelism`` and ``data parallelism`` . See :ref:`api guide <api_guide>` , :ref:`developer guide <ptl_developer_guide>` and tutorial
       * Support for Model and Optimizer Wrapper training API that handles the parallelization. See :ref:`api guide <api_guide>` and :ref:`model_optimizer_wrapper_developer_guide`
       * New ``save_checkpoint``  and ``load_checkpoint`` APIs to save/load checkpoints during distributed training. See :ref:`save_load_developer_guide`
       * Support for a new ``Query-Key-Value(QKV)`` module that provides the ability to replicate the Key Value heads and adds flexibility to use higher Tensor parallel degree during Training. See :ref:`api guide <api_guide>` and :doc:`tutorial </archive/tutorials/training_llama2_tp_pp_ptl>`
       * See more at :ref:`nxd-core_rn` 
     - Trn1/Trn1n

   * - NeuronX Distributed (neuronx-distributed) for Inference
     - * Support weight-deduplication amongst TP shards by giving ability to save weights separately than in NEFF files.  See developer guide
       * See more at :ref:`nxd-core_rn` and  :ref:`api_guide`
     - Inf2,Trn1/Trn1n

   * - PyTorch NeuronX (torch-neuronx)
     - * [Beta]Support for] ``PyTorch 2.1``. See PyTorch 2.1 support documentation. See  `llama-2-13b inference <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb>`_ sample.
       * Support to separate out model weights from NEFF files and new ``replace_weights`` API to replace the separated weights. See :ref:`torch_neuronx_replace_weights_api` and :ref:`torch_neuronx_trace_api`
       * [Beta] Script for training ``stabilityai/stable-diffusion-2-1-base`` and  ``runwayml/stable-diffusion-v1-5`` models . See `script <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/stable_diffusion/>`__ 
       * [Beta] Script for training ``facebook/bart-large`` model. See `script <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/hf_summarization/BartLarge.ipynb>`__ 
       * [Beta] Script for ``stabilityai/stable-diffusion-2-inpainting`` model inference.  See `script <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/inference/hf_pretrained_sd2_inpainting_936_624_inference.ipynb>`__ 
     - Trn1/Trn1n,Inf2

   * - Neuron Tools
     - * New ``Neuron Distributed Event Tracing (NDET) tool`` to help visualize execution trace logs and diagnose errors in multi-node workloads.
       * Support for multi-worker jobs in ``neuron-profile`` . See :ref:`neuron-profile-ug`
       * See more at :ref:`dev-tools_rn`
     - Inf1/Inf2/Trn1/Trn1n
  
   * - Documentation Updates
     - * Added setup guide instructions for ``AL2023`` OS. See :ref:`setup-guide-index`
       * Added announcement for name change of Neuron Components. See :ref:`announce-component-name-change`
       * Added announcement for End of Support for ``PyTorch 1.10`` . See :ref:`announce-eos_pytorch110`
       * Added announcement for End of Support for ``PyTorch 2.0`` Beta. See :ref:`announce-eos_pytorch2`
       * --
     - Inf1, Inf2, Trn1/Trn1n
   
   * - Known Issues and Limitations
     - * See :ref:`neuron-2.16.0-known-issues`
     - Trn1/Trn1n , Inf2, Inf1

   * - Release Artifacts
     - * see :ref:`latest-neuron-release-artifacts`
     - Trn1/Trn1n , Inf2, Inf1


.. _neuron-2.16.0-known-issues:

2.16.0 Known Issues and Limitations 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* We recommend running multi-node training jobs on AL2023 using Amazon EKS. Parallel Cluster currently does not support AL2023.
* There are known compiler issues impacting inference accuracy of certain model configurations of ``Llama-2-13b`` when ``amp = fp16`` is used. If this issue is observed, ``amp=fp32`` should be used as a work around.  This issue will be addressed in future Neuron releases.
* Execution time reported in ``neuron-profile`` tool is sometimes in-accurate due to a bug in how the time is captured.  The bug will be addressed in upcoming Neuron releases.
* See component release notes below for any additional known issues.


.. _neuron-2.15.0-whatsnew:


Neuron 2.15.2 (11/17/2023)
--------------------------
Patch release that fixes compiler issues related to performance when training using ``neuronx-nemo-megatron`` library.


Neuron 2.15.1 (11/09/2023)
--------------------------
Patch release to fix execution overhead issues in Neuron Runtime that were inadvertently introduced in 2.15 release.


Neuron 2.15.0 (10/26/2023)
--------------------------

.. contents:: Table of contents
   :local:
   :depth: 3

What's New
^^^^^^^^^^

This release adds support for PyTorch 2.0 (Beta), increases performance for both training and inference workloads, adding ability to train models like ``Llama-2-70B`` using ``neuronx-distributed``. With this release, we are also adding pipeline parallelism support for ``neuronx-distributed`` enabling full 3D parallelism support to easily scale training to large model sizes.
Neuron 2.15 also introduces support for training ``resnet50``, ``milesial/Pytorch-UNet`` and ``deepmind/vision-perceiver-conv`` models using ``torch-neuronx``, as well as new sample code for ``flan-t5-xl`` model inference using ``neuronx-distributed``, in addition to other performance optimizations, minor enhancements and bug fixes.

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details
     - Instances

   * - Neuron Distributed (neuronx-distributed) for Training
     - * Pipeline parallelism support. See :ref:`api_guide` , :ref:`pp_developer_guide` and :ref:`pipeline_parallelism_overview`
       * ``Llama-2-70B`` model training script  (`sample script <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/llama2/tp_pp_llama2_70b_hf_pretrain>`__) (tutorial)
       * Mixed precision support. See :ref:`pp_developer_guide`
       * Support serialized checkpoint saving and loading using ``save_xser`` and ``load_xser`` parameters. See :ref:`api_guide` 
       * See more at :ref:`nxd-core_rn` 
     - Trn1/Trn1n

   * - Neuron Distributed (neuronx-distributed) for Inference
     - * ``flan-t5-xl`` model inference script (:pytorch-neuron-src:`tutorial <neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`)
       * See more at :ref:`nxd-core_rn` and  :ref:`api_guide`
     - Inf2,Trn1/Trn1n

   * - Transformers Neuron (transformers-neuronx) for Inference
     - * Serialization support for ``Llama``, ``Llama-2``, ``GPT2`` and ``BLOOM`` models . See :ref:`developer guide <transformers_neuronx_readme>`
       * See more at :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n

   * - PyTorch Neuron (torch-neuronx)
     - * Introducing ``PyTorch 2.0`` Beta support. See PyTorch 2.0 support documentation. See `bert training <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/dp_bert_hf_pretrain>`_ and  `t5-3b inference <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.html>`_ samples.
       * Scripts for training `resnet50[Beta] <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/resnet50>`_ ,
         `milesial/Pytorch-UNet[Beta] <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/unet_image_segmentation>`_ and `deepmind/vision-perceiver-conv[Beta] <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/hf_image_classification/VisionPerceiverConv.ipynb>`_ models.
     - Trn1/Trn1n,Inf2

   * - AWS Neuron Reference for Nemo Megatron library (``neuronx-nemo-megatron``)
     - * ``Llama-2-70B`` model training sample using pipeline parallelism and tensor parallelism ( `tutorial <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md>`__)
       * ``GPT-NeoX-20B`` model training using pipeline parallelism and tensor parallelism 
       * See more at :ref:`neuronx-nemo-rn` and `neuronx-nemo-megatron github repo <https://github.com/aws-neuron/neuronx-nemo-megatron>`_
     - Trn1/Trn1n

   * - Neuron Compiler (neuronx-cc)
     - * New ``llm-training`` option argument to ``--distribution_strategy`` compiler option for optimizations related to distributed training. See more at :ref:`neuron-compiler-cli-reference-guide`
       * See more at :ref:`compiler_rn`
     - Inf2/Trn1/Trn1n

   * - Neuron Tools
     - * ``alltoall`` Collective Communication operation for intra node(with in the instance), previously released in Neuron Collectives v2.15.13, was added as a testable operation in ``nccom-test``. See :ref:`nccom-test`
       * See more at :ref:`dev-tools_rn`
     - Inf1/Inf2/Trn1/Trn1n
  
   * - Documentation Updates
     - * New :ref:`App Note <activation_memory_reduction>` and :ref:`Developer Guide <activation_memory_reduction_developer_guide>` about Activation memory reduction using ``sequence parallelism`` and ``activation recomputation`` in ``neuronx-distributed``
       * Added a new Model Samples and Tutorials summary page. See :ref:`model_samples_tutorials`
       * Added Neuron SDK Classification guide. See :ref:`sdk-classification`
       * --
     - Inf1, Inf2, Trn1/Trn1n
   
   * - Release Artifacts
     - * see :ref:`latest-neuron-release-artifacts`
     - Trn1/Trn1n , Inf2, Inf1


.. _neuron-2.14.0-whatsnew:


Neuron 2.14.1 (09/26/2023)
--------------------------

This is a patch release that fixes compiler issues in certain configurations of ``Llama`` and ``Llama-2`` model inference using ``transformers-neuronx``.

.. note::

   There is still a known compiler issue for inference of some configurations of ``Llama`` and ``Llama-2`` models that will be addressed in future Neuron release.
   Customers are advised to use ``--optlevel 1 (or -O1)`` compiler flag to mitigate this known compiler issue.  
    
   See :ref:`neuron-compiler-cli-reference-guide` on the usage of ``--optlevel 1`` compiler flag. Please see more on the compiler fix and known issues in :ref:`compiler_rn` and :ref:`nxd-inference_rn` 
   

Neuron 2.14.0 (09/15/2023)
--------------------------

.. contents:: Table of contents
   :local:
   :depth: 3

What's New
^^^^^^^^^^

This release introduces support for ``Llama-2-7B`` model training and ``T5-3B`` model inference using ``neuronx-distributed``. It also adds support for  ``Llama-2-13B`` model training using ``neuronx-nemo-megatron``. Neuron 2.14 also adds support for ``Stable Diffusion XL(Refiner and Base)`` model inference using ``torch-neuronx`` . This release also introduces other new features, performance optimizations, minor enhancements and bug fixes.
This release introduces the following:

.. note::
   This release deprecates ``--model-type=transformer-inference`` compiler flag. Users are highly encouraged to migrate to the ``--model-type=transformer`` compiler flag.


.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details
     - Instances

   * - AWS Neuron Reference for Nemo Megatron library (``neuronx-nemo-megatron``)
     - * ``Llama-2-13B`` model training support ( `tutorial <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md>`__ )
       * ZeRO-1 Optimizer support  that works with tensor parallelism and pipeline parallelism
       * See more at :ref:`neuronx-nemo-rn` and `neuronx-nemo-megatron github repo <https://github.com/aws-neuron/neuronx-nemo-megatron>`_
     - Trn1/Trn1n
   
   * - Neuron Distributed (neuronx-distributed) for Training
     - * ``pad_model`` API to pad attention heads that do not divide by the number of NeuronCores, this will allow users to use any supported tensor-parallel degree. See  :ref:`api_guide`
       * ``Llama-2-7B`` model training support  (`sample script <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/tp_zero1_llama2_7b_hf_pretrain>`__)
       * See more at :ref:`nxd-core_rn` and  :ref:`api_guide`
     - Trn1/Trn1n

   * - Neuron Distributed (neuronx-distributed) for Inference
     - * ``T5-3B`` model inference support (:pytorch-neuron-src:`tutorial <neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb>`)
       * ``pad_model`` API to pad attention heads that do not divide by the number of NeuronCores, this will allow users to use any supported tensor-parallel degree. See  :ref:`api_guide` 
       * See more at :ref:`nxd-core_rn` and  :ref:`api_guide`
     - Inf2,Trn1/Trn1n

   * - Transformers Neuron (transformers-neuronx) for Inference
     - * Introducing ``--model-type=transformer`` compiler flag that deprecates ``--model-type=transformer-inference`` compiler flag. 
       * See more at :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n

   * - PyTorch Neuron (torch-neuronx)
     - * Performance optimizations in ``torch_neuronx.analyze`` API. See :ref:`torch_neuronx_analyze_api`
       * ``Stable Diffusion XL(Refiner and Base)`` model inference support  ( `sample script <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/inference/hf_pretrained_sdxl_base_and_refiner_1024_inference.ipynb>`__)
     - Trn1/Trn1n,Inf2

   * - Neuron Compiler (neuronx-cc)
     - * New  ``--optlevel``(or ``-O``) compiler option that enables different optimizations with tradeoff between faster model compile time and faster model execution. See more at :ref:`neuron-compiler-cli-reference-guide`
       * See more at :ref:`compiler_rn`
     - Inf2/Trn1/Trn1n

   * - Neuron Tools
     - * Neuron SysFS support for showing connected devices on ``trn1.32xl``, ``inf2.24xl`` and ``inf2.48xl`` instances. See :ref:`neuron-sysfs-ug`
       * See more at :ref:`dev-tools_rn`
     - Inf1/Inf2/Trn1/Trn1n
  
   * - Documentation Updates
     - * Neuron Calculator now supports multiple model configurations for Tensor Parallel Degree computation. See :ref:`neuron_calculator`
       * Announcement to deprecate ``--model-type=transformer-inference`` flag. See :ref:`announce-end-of-support-transformer-flag`
       * --
     - Inf1, Inf2, Trn1/Trn1n
   
   * - Release Artifacts
     - * see :ref:`latest-neuron-release-artifacts`
     - Trn1/Trn1n , Inf2, Inf1


.. _neuron-2.13.0-whatsnew:

Neuron 2.13.2 (09/01/2023)
---------------------------

This is a patch release that fixes issues in Kubernetes (K8) deployments related to Neuron Device Plugin crashes and other pod scheduling issues. This release also adds support for zero-based Neuron Device indexing in K8 deployments, see the :ref:`Neuron K8 release notes <containers_rn>` for more details on the specific bug fixes.

Updating to latest Neuron Kubernetes components and Neuron Driver is highly encouraged for customers using Kubernetes.

Please :ref:`follow these instructions in setup guide <setup-guide-index>` to upgrade to latest Neuron release.


Neuron 2.13.1 (08/29/2023)
--------------------------
This release adds support for ``Llama 2`` model training (`tutorial <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md>`__) using `neuronx-nemo-megatron <https://github.com/aws-neuron/neuronx-nemo-megatron>`_ library, and adds support for ``Llama 2`` model inference using ``transformers-neuronx`` library (`tutorial <https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb>`__) . 

Please :ref:`follow these instructions in setup guide <setup-guide-index>` to upgrade to latest Neuron release.

.. note::

   Please install  ``transformers-neuronx`` from https://pip.repos.neuron.amazonaws.com to get latest features and improvements.
   
   This release does not support LLama 2 model with Grouped-Query Attention


Neuron 2.13.0 (08/28/2023)
--------------------------

.. contents:: Table of contents
   :local:
   :depth: 3

What's New
^^^^^^^^^^

This release introduces support for ``GPT-NeoX`` 20B model training in ``neuronx-distributed`` including Zero-1 optimizer capability. It also adds support for ``Stable Diffusion XL`` and ``CLIP`` models inference in ``torch-neuronx``. Neuron 2.13 also introduces `AWS Neuron Reference for Nemo Megatron <https://github.com/aws-neuron/neuronx-nemo-megatron>`_ library supporting distributed training of LLMs like ``GPT-3 175B``. This release also introduces other new features, performance optimizations, minor enhancements and bug fixes.
This release introduces the following:


.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details
     - Instances

   * - AWS Neuron Reference for Nemo Megatron library
     - * Modified versions of the open-source packages `NeMo <https://github.com/NVIDIA/NeMo>`_ and `Apex <https://github.com/NVIDIA/apex>`_ that have been adapted for use with AWS Neuron and AWS EC2 Trn1 instances.
       * ``GPT-3`` model training support ( `tutorial <https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-gpt-job.md>`__ )
       * See more at `neuronx-nemo-megatron github repo <https://github.com/aws-neuron/neuronx-nemo-megatron>`_
     - Trn1/Trn1n

   * - Transformers Neuron (transformers-neuronx) for Inference
     - * Latency optimizations for  ``Llama`` and ``GPT-2`` models inference.
       * Neuron Persistent Cache support (:ref:`developer guide <transformers_neuronx_readme>`)
       * See more at :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n
   
   * - Neuron Distributed (neuronx-distributed) for Training
     - * Now Stable, removed beta support
       * ZeRO-1 Optimizer support with tensor parallel. (:ref:`tutorial <gpt_neox_tp_zero1_tutorial>`)
       * Sequence Parallel support. (:ref:`api guide <api_guide>`)
       * See more at :ref:`nxd-core_rn` and  :ref:`api_guide`
     - Trn1/Trn1n

   * - Neuron Distributed (neuronx-distributed) for Inference
     - * KV Cache Support for LLM Inference (:ref:`release notes <nxd-core_rn>`)
     - Inf2,Trn1/Trn1n


   * - PyTorch Neuron (torch-neuronx)
     - * Seedable dropout enabled by default for training
       * KV Cache inference support ( :pytorch-neuron-src:`tutorial <torch-neuronx/t5-inference-tutorial.ipynb>` )
       * ``camembert-base`` training script. (`sample script <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/hf_text_classification/CamembertBase.ipynb>`__)
       * New models inference support that include `Stable Diffusion XL <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/inference/hf_pretrained_sdxl_1024_inference.ipynb>`_ , CLIP (`clip-vit-base-patch32 <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/inference/hf_pretrained_clip_base_inference_on_inf2.ipynb>`_ , `clip-vit-large-patch14 <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/inference/hf_pretrained_clip_large_inference_on_inf2.ipynb>`_ ) , `Vision Perceiver <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/inference/hf_pretrained_perceiver_vision_inference.ipynb>`_ , `Language Perceiver <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/inference/hf_pretrained_perceiver_language_inference.ipynb>`_ and :pytorch-neuron-src:`T5 <torch-neuronx/t5-inference-tutorial.ipynb>`
     - Trn1/Trn1n,Inf2


   * - Neuron Tools
     - * New data types support for Neuron Collective Communication Test Utility (NCCOM-TEST)  --check option: fp16, bf16, (u)int8, (u)int16, and (u)int32 
       * Neuron SysFS support for FLOP count(flop_count) and connected Neuron Device ids (connected_devices).  See :ref:`neuron-sysfs-ug`
       * See more at :ref:`dev-tools_rn`
     - Inf1/Inf2/Trn1/Trn1n
  
   * - Neuron Runtime 
     - * Runtime version and Capture Time support to NTFF
       * Async DMA copies support to improve Neuron Device copy times for all instance types
       * Logging and error messages improvements for Collectives timeouts and when loading NEFFs.
       * See more at :ref:`runtime_rn`
     - Inf1, Inf2, Trn1/Trn1n
  
   * - End of Support Announcements and Documentation Updates 
     - * Announcing End of support for ``AWS Neuron reference for Megatron-LM`` starting Neuron 2.13. See more at :ref:`announce-eol-megatronlm`
       * Announcing end of support for ``torch-neuron`` version 1.9 starting Neuron 2.14. See more at :ref:`announce-eol-pytorch19`
       * Added TensorFlow 2.x (``tensorflow-neuronx``) analyze_model API section. See more at :ref:`tensorflow-ref-neuron-analyze_model-api`
       * Upgraded ``numpy`` version to ``1.21.6`` in various training scripts for `Text Classification <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training>`_
       * Updated ``bert-japanese`` training Script to use ``multilingual-sentiments`` dataset. See `hf-bert-jp <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/training/hf_bert_jp>`_
       * --
     - Inf1, Inf2, Trn1/Trn1n
   
   * - Known Issues and Limitations
     - * See :ref:`neuron-2.13.0-known-issues`
     - Trn1/Trn1n , Inf2, Inf1

   * - Release Artifacts
     - * see :ref:`latest-neuron-release-artifacts`
     - Trn1/Trn1n , Inf2, Inf1


.. _neuron-2.13.0-known-issues:

2.13.0 Known Issues and Limitations 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* Currently we see a NaN generated when the model implementation uses torch.dtype(float32.min) or torch.dtype(float32.max) along with XLA_USE_BF16/XLA_DOWNCAST_BF16. This is because, float32.min or float32.max gets downcasted to Inf in bf16 thereby producing a NaN. Short term fix is that we can use a small/large fp32 number instead of using float32.min/float32.max. Example, for mask creation, we can use -/+1e4 instead of min/max values. The issue will be addressed in future Neuron releases.   


.. _neuron-2.12.0-whatsnew:


Neuron 2.12.2 (08/19/2023)
--------------------------
Patch release to fix a jemalloc conflict for all Neuron customers that use Ubuntu 22.  The previous releases shipped with a dependency on jemalloc that may lead to compilation failures in Ubuntu 22 only.  
Please :ref:`follow these instructions in setup guide<setup-guide-index>` to upgrade to latest Neuron release.


Neuron 2.12.1 (08/09/2023)
--------------------------
Patch release to improve reliability of Neuron Runtime when running applications on memory constrained instances. The Neuron Runtime has reduced the contiguous memory requirement for initializing the Neuron Cores associated with applications.
This reduction allows bringup when only small amounts of contiguous memory remain on an instance.  Please :ref:`upgrade to latest Neuron release<setup-guide-index>` to use the latest Neuron Runtime.


Neuron 2.12.0 (07/19/2023)
--------------------------

.. contents:: Table of contents
   :local:
   :depth: 3

What's New
^^^^^^^^^^

This release introduces  ZeRO-1 optimizer for model training in ``torch-neuronx`` , introduces beta support for ``GPT-NeoX``, ``BLOOM`` , ``Llama`` and ``Llama 2(coming soon)`` models in ``transformers-neuronx``. This release also adds support for model inference serving on Triton Inference Server for Inf2 & Trn1 instances, ``lazy_load`` API and ``async_load`` API for model loading in ``torch-neuronx``, as well as other new features,
performance optimizations, minor enhancements and bug fixes. This release introduces the following:


.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details
     - Instances

   * - ZeRO-1 optimizer for model training in ``torch-neuronx``
     - * Support of ZeRO-Stage-1 optimizer ( ZeroRedundancyOptimizer() API) for training models using ``torch-neuronx``
       * See tutorial at  :ref:`zero1-gpt2-pretraining-tutorial`
     - Inf2, Trn1/Trn1n

   * - Support for new models and Enhancements in ``transformers-neuronx``
     - * [Beta] Support for inference of ``GPT-NeoX``, ``BLOOM`` and ``Llama`` models. 
       * [Beta] Support for ``Llama 2`` coming soon. Please monitor the `transformers-neuronx repository <https://github.com/aws-neuron/transformers-neuronx/tree/main/src/transformers_neuronx>`_ for updates.
       * Removed constraints on ``tp_degree`` in tensor-parallel configurations for ``GPT2``, ``OPT``, and ``BLOOM`` . See more at :ref:`nxd-inference_rn`
       * Added multi-query / multi-group attention support for ``GPT2``.
       * See more at :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n
   
   * - Support for Inf2 and Trn1 instances on Triton Inference Server
     - * Support for Model Inference serving on Triton for Inf2 and Trn1 instances. See more at `Triton Server Python Backend <https://github.com/triton-inference-server/python_backend/tree/main/inferentia#using-triton-with-inferentia-2-or-trn1>`_
       * See tutorial at `Triton on SageMaker - Deploying on Inf2 <https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-triton/inferentia2>`_
     - Inf2, Trn1

   * - Support for new computer vision models 
     - * Performance optimizations in Stable Diffusion 2.1 model script and added [beta] support for Stable Diffusion 1.5 models.
       * [Beta] Script for training CLIP model for Image Classification.
       * [Beta] Script for inference of Multimodal perceiver model
       * Please check `aws-neuron-samples repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx>`__
     - Inf2, Trn1/Trn1n

   * - New Features in ``neuronx-distributed`` for training
     - * Added parallel cross entropy loss function.
       * See more at tensor parallelism API guide
     - Trn1/Trn1n

   * - ``lazy_load`` and ``async_load`` API for model loading in inference and performance enhancements in ``torch-neuronx`` 
     - * Added ``lazy_load`` and ``async_load`` API to accelerate model loading for Inference. See more at :ref:`torch_neuronx_lazy_async_load_api`
       * Optimize DataParallel API to load onto multiple cores simultaneously when device IDs specified are consecutive.
       * See more at :ref:`pytorch_rn`
     - Inf2, Trn1/Trn1n
  
   * - [Beta] Asynchronous Execution support and Enhancements in Neuron Runtime 
     - * Added beta asynchronous execution feature which can reduce latency by roughly 12% for training workloads. See more at :ref:`nrt-configuration`
       * AllReduce with All-to-all communication pattern enabled for 16 ranks on TRN1/TRN1N within the instance (intranode)
       * See more at :ref:`runtime_rn`
     - Inf1, Inf2, Trn1/Trn1n
  
   * - Support for ``distribution_strategy`` compiler option in ``neuronx-cc``
     - * Support for optional ``--distribution_strategy`` compiler option to enable compiler specific optimizations based on distribution strategy used.
       * See more at :ref:`neuron-compiler-cli-reference-guide`
     - Inf2, Trn1/Trn1n

   * - New Micro Benchmarking Performance User Guide and Documentation Updates 
     - * Added best practices user guide for benchmarking performance of Neuron devices. See more at `Benchmarking Guide and Helper scripts <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/microbenchmark>`_
       * Announcing end of support for Ubuntu 18. See more at :ref:`announce-eol-ubuntu18`
       * Removed support for Distributed Data Parallel(DDP) Tutorial.
       * Improved sidebar navigation in Documentation.
       * --
     - Inf1, Inf2, Trn1/Trn1n
   
   * - Known Issues and Limitations
     - * See :ref:`neuron-2.12.0-known-issues`
     - Trn1/Trn1n , Inf2, Inf1
  
   * - Release Artifacts
     - * see :ref:`latest-neuron-release-artifacts`
     - Trn1/Trn1n , Inf2, Inf1


.. _neuron-2.12.0-known-issues:

2.12.0 Known Issues and Limitations 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Known Issues in Ubuntu 22 Support
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Several Vision and NLP models on Ubuntu 22 are not supported due to Compilation issues. Issues will be addressed in upcoming releases.
* CustomOp feature failing with seg fault on Ubuntu 22.  Issue will be addressed in upcoming releases.
  
Known issues in certain resnet models on Ubuntu 20
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Known issue with support for resnet-18, resnet-34, resnet-50, resnet-101 and resnet-152 models on Ubuntu 20. Issues will be addressed in upcoming releases.


.. _neuron-2.11.0-whatsnew:

Neuron 2.11.0 (06/14/2023)
--------------------------

.. contents:: Table of contents
   :local:
   :depth: 3

What's New
^^^^^^^^^^

This release introduces Neuron Distributed, a new python library to simplify training and inference of large models, improving usability with features like S3 model caching, standalone profiler tool, support for Ubuntu22, as well as other new features,
performance optimizations, minor enhancements and bug fixes. This release introduces the following:


.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details
     - Instances

  
   * - New Features and Performance Enhancements in ``transformers-neuronx``
     - * Support for ``int8`` inference. See example at :ref:`int8_weight_storage_support`
       * Improved prompt context encoding performance. See more at :ref:`transformers_neuronx_developer_guide`
       * Improved collective communications performance for Tensor Parallel inference on Inf2 and Trn1.
       * See more at :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n

   * - Neuron Profiler Tool 
     - * Profiling and visualization of model execution on Trainium and Inferentia devices now supported as a stand-alone tool.
       * See more at :ref:`neuron-profile-ug`
     - Inf1, Inf2, Trn1/Trn1n

   * - Neuron Compilation Cache through S3
     - * Support for sharing compiled models across Inf2 and Trn1 nodes through S3
       * See more at :ref:`pytorch-neuronx-parallel-compile-cli`
     - Inf2, Trn1/Trn1n

   * - New script to scan a model for supported/unsupported operators
     - * Script to scan a model for supported/unsupported operators before training, scan output includes supported and unsupported operators at both XLA operators and PyTorch operators level.
       * See a sample tutorial at :ref:`torch-analyze-for-training-tutorial`
     - Inf2, Trn1/Trn1n

   * - Neuron Distributed Library [Beta]
     - * New Python Library based on PyTorch enabling distributed training and inference of large models.
       * Initial support for tensor-parallelism.
       * See more at :doc:`NeuronX Distributed </libraries/neuronx-distributed/index-training>`
     - Inf2, Trn1/Trn1n

   * - Neuron Calculator and Documentation Updates  
     - * New :ref:`neuron_calculator` Documentation section to help determine number of Neuron Cores needed for LLM Inference.
       * Added App Note :ref:`neuron_llm_inference`
       * --
     - Inf1, Inf2, Trn1/Trn1n

   * - Enhancements to Neuron SysFS
     - * Support for detailed breakdown of memory usage across the NeuronCores
       * See more at :ref:`neuron-sysfs-ug`
     - Inf1, Inf2, Trn1/Trn1n

   * - Support for Ubuntu 22
     - * See more at :ref:`setup-guide-index` for setup instructions on Ubuntu22
     - Inf1, Inf2, Trn1/Trn1n

   * - Release Artifacts
     - * see :ref:`latest-neuron-release-artifacts`
     - Trn1/Trn1n , Inf2, Inf1


.. _neuron-2.10.0-whatsnew:

Neuron 2.10.0 (05/01/2023)
--------------------------

.. contents:: Table of contents
   :local:
   :depth: 3

What's New
^^^^^^^^^^

This release introduces new features, performance optimizations, minor enhancements and bug fixes. This release introduces the following:

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details
     - Instances


   * - Initial support for computer vision models inference
     - * Added Stable Diffusion 2.1 model script for Text to Image Generation
       * Added VGG model script for Image Classification Task
       * Added UNet model script for Image Segmentation Task
       * Please check `aws-neuron-samples repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx>`__
     - Inf2, Trn1/Trn1n

   * - Profiling support in PyTorch Neuron(``torch-neuronx``) for Inference with TensorBoard
     - * See more at :ref:`torch-neuronx-profiling-with-tb`
     - Inf2, Trn1/Trn1n
  
   * - New Features and Performance Enhancements in transformers-neuronx
     - * Support for the HuggingFace generate function. 
       * Model Serialization support for GPT2 models. (including model saving, loading, and weight swapping)
       * Improved prompt context encoding performance.
       * See :ref:`transformers_neuronx_readme` for examples and usage
       * See more at :ref:`nxd-inference_rn` 
     - Inf2, Trn1/Trn1n

   * - Support models larger than 2GB in TensorFlow 2.x Neuron (``tensorflow-neuronx``) 
     - * See :ref:`tensorflow-neuronx-special-flags` for details. (``tensorflow-neuronx``) 
     - Trn1/Trn1n, Inf2

   * - Support models larger than 2GB in TensorFlow 2.x Neuron (``tensorflow-neuron``) 
     - * See :ref:`Special Flags <tensorflow-ref-neuron-tracing-api>` for details. (``tensorflow-neuron``)
     - Inf1
  
   * - Performance Enhancements in PyTorch C++ Custom Operators (Beta)
     - * Support for using multiple GPSIMD Cores in Custom C++ Operators
       * See :ref:`custom-ops-api-ref-guide`
     - Trn1/Trn1n
   
   * - Weight Deduplication Feature (Inf1) 
     - * Support for Sharing weights when loading multiple instance versions of the same model on different NeuronCores.
       * See more at :ref:`nrt-configuration`
     - Inf1

   * - ``nccom-test`` - Collective Communication Benchmarking Tool
     - * Supports enabling benchmarking sweeps on various Neuron Collective Communication operations. See :ref:`nccom-test` for more details.
     - Trn1/Trn1n , Inf2

   * - Announcing end of support for tensorflow-neuron 2.7 & mxnet-neuron 1.5 versions
     - * See :ref:`announce-eol-tf-before-2-7`
       * See :ref:`announce-eol-mxnet-before-1-5`
     - Inf1

   * - Release Artifacts
     - * see :ref:`latest-neuron-release-artifacts`
     - Trn1/Trn1n , Inf2, Inf1

.. _neuron-2.9.0-whatsnew:


Neuron 2.9.1 (04/19/2023)
-------------------------
Minor patch release to add support for deserialized torchscript model compilation and support for multi-node training in EKS. Fixes included in this release are critical to enable training
and deploying models with Amazon Sagemaker or Amazon EKS.


Neuron 2.9.0 (03/28/2023)
-------------------------

.. contents:: Table of contents
   :local:
   :depth: 3

What's New
^^^^^^^^^^

This release adds support for EC2 Trn1n instances, introduces new features, performance optimizations, minor enhancements and bug fixes. This release introduces the following:

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details
     - Instances

   * - Support for EC2 Trn1n instances
     - * Updated Neuron Runtime for Trn1n instances     
      
       * Overall documentation update to include Trn1n instances
     - Trn1n

   * - New Analyze API in PyTorch Neuron (``torch-neuronx``)  
     - * A new API that return list of supported and unsupported PyTorch operators for a model. See :ref:`torch_neuronx_analyze_api`
     - Trn1, Inf2
  
   * - Support models that are larger than 2GB in PyTorch Neuron (``torch-neuron``) on Inf1
     - * See ``separate_weights`` flag to :func:`torch_neuron.trace` to support models that are larger than 2GB
     - Inf1

   * - Performance Improvements
     - * Up to 10% higher throughput when training GPT3 6.7B model on multi-node
     - Trn1

   * - Dynamic Batching support in TensorFlow 2.x Neuron (``tensorflow-neuronx``)
     - * See :ref:`tensorflow-neuronx-special-flags` for details.
     - Trn1, Inf2

   * - NeuronPerf support for Trn1/Inf2 instances
     - * Added Trn1/Inf2 support for PyTorch Neuron (``torch-neuronx``) and TensorFlow 2.x Neuron (``tensorflow-neuronx``)
     - Trn1, Inf2

   * - Hierarchical All-Reduce and Reduce-Scatter collective communication
     - * Added support for hierarchical All-Reduce and Reduce-Scatter in Neuron Runtime to enable better scalability of distributed workloads .
     - Trn1, Inf2
  
   * - New Tutorials added
     - * :ref:`Added tutorial to fine-tune T5 model <torch-hf-t5-finetune>`
       * Added tutorial to demonstrate use of Libtorch with PyTorch Neuron (``torch-neuronx``) for inference :ref:`[html] <pytorch-tutorials-libtorch>`
     - Trn1, Inf2

   * - Release included packages
     - * see :ref:`latest-neuron-release-artifacts`
     - Trn1, Inf2, Inf1
.. _neuron-2.8.0-whatsnew:

Neuron 2.8.0 (02/24/2023)
-------------------------

.. contents:: Table of contents
   :local:
   :depth: 3

What's New
^^^^^^^^^^

This release adds support for `EC2 Inf2 <https://aws.amazon.com/ec2/instance-types/inf2/>`_ instances, introduces initial inference support with TensorFlow 2.x Neuron (``tensorflow-neuronx``) on Trn1 and Inf2, and introduces minor enhancements and bug fixes.

This release introduces the following:

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details

   * - Support for `EC2 Inf2 <https://aws.amazon.com/ec2/instance-types/inf2/>`_ instances
     - * Inference support for Inf2 instances in PyTorch Neuron (``torch-neuronx``)      
    
       * Inference support for Inf2 instances in TensorFlow 2.x Neuron (``tensorflow-neuronx``)
        
       * Overall documentation update to include Inf2 instances
  

   * - TensorFlow 2.x Neuron (``tensorflow-neuronx``) support
     - * This releases introduces initial inference support with TensorFlow 2.x Neuron (``tensorflow-neuronx``) on Trn1 and Inf2


   * - New Neuron GitHub samples
     - * New sample scripts for deploying LLM models with ``transformer-neuronx`` under       `aws-neuron-samples <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference>`__  GitHub repository.
      
       * New sample scripts for deploying models with ``torch-neuronx`` under `aws-neuron-samples repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx>`__  GitHub repository.

   * - Release included packages
     - * see :ref:`latest-neuron-release-artifacts`

.. _neuron-2.7.0-whatsnew:

Neuron 2.7.0 (02/08/2023)
-------------------------

.. contents:: Table of contents
   :local:
   :depth: 3

What's New
^^^^^^^^^^

This release introduces new capabilities and libraries, as well as features and tools that improves usability. This release introduces the following:

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size

   * - What's New
     - Details

   * - PyTorch 1.13
     - Support of PyTorch 1.13 version for PyTorch Neuron (``torch-neuronx``). For resources see :ref:`pytorch-neuronx-main`

   * - PyTorch DistributedDataParallel (DDP) API
     - Support of PyTorch DistributedDataParallel (DDP) API in PyTorch Neuron (``torch-neuronx``). For resources how to use PyTorch DDP API with Neuron, please check the DDP tutorial.

   * - Inference support in ``torch-neuronx``
     - For more details, see Neuron Inference samples `<https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx>`_ in the ``aws-neuron-samples`` GitHub repo.     

   * - Neuron Custom C++ Operators[Beta]
     - Initial support for Neuron Custom C++ Operators [Beta] , with Neuron Custom C++ Operators (“CustomOps”) you can now write CustomOps that run on NeuronCore-v2 chips. For more resources please check :ref:`neuron_c++customops` section.


   * - ``transformers-neuronx`` [Beta] 
     - ``transformers-neuronx``  is a new library enabling LLM model inference. It contains models that are checkpoint-compatible with HuggingFace Transformers, and currently supports Transformer Decoder models like GPT2, GPT-J and OPT. Please check `aws-neuron-samples repository <https://github.com/aws-neuron/transformers-neuronx>`__  


   * - Neuron sysfs filesystem
     - Neuron sysfs filesystem exposes Neuron Devices under ``/sys/devices/virtual/neuron_device`` providing visibility to Neuron Driver and Runtime at the system level. By performing several simple CLIs such as reading or writing to a sysfs file, you can get information such as Neuron Runtime status, memory usage, Driver info etc. For resources about Neuron sysfs filesystem visit :ref:`neuron-sysfs-ug`.


   * - TFLOPS support in Neuron System Tools
     - Neuron System Tools now also report model actual TFLOPs rate in both ``neuron-monitor`` and ``neuron-top``. More details can be found in the :ref:`Neuron Tools documentation <neuron-tools>`.

   * - New sample scripts for training
     - This release adds multiple new sample scripts for training models with ``torch-neuronx``, Please check `aws-neuron-samples repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx>`__

   * - New sample scripts for inference
     - This release adds multiple new sample scripts for deploying models with ``torch-neuronx``, Please check `aws-neuron-samples repository <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx>`__

   * - Neuron GitHub samples repository for Amazon EKS
     - A new AWS Neuron GitHub samples repository for Amazon EKS, Please check `aws-neuron-samples repository <https://github.com/aws-neuron/aws-neuron-eks-samples>`__

.. _neuron-2.6.0-whatsnew:

Neuron 2.6.0 (12/12/2022)
-------------------------

This release introduces the support of PyTorch 1.12 version, and introduces PyTorch Neuron (``torch-neuronx``) profiling through Neuron Plugin for TensorBoard. Pytorch Neuron (``torch-neuronx``) users can now profile their models through the following TensorBoard views:

* Operator Framework View
* Operator HLO View
* Operator Trace View

This release introduces the support of LAMB optimizer for FP32 mode, and adds support for :ref:`capturing snapshots <torch-neuronx-snapshotting>` of inputs, outputs and graph HLO for debugging.

In addition, this release introduces the support of new operators and resolves issues that improve stability for Trn1 customers.

.. _neuron-2.5.0-whatsnew:

Neuron 2.5.0 (11/23/2022)
-------------------------

Neuron 2.5.0 is a major release which introduces new features and resolves issues that improve stability for Inf1 customers.

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left
   :class: table-smaller-font-size


   * - Component
     - New in this release

   * - PyTorch Neuron ``(torch-neuron)``
     - * PyTorch 1.12 support
       
       * Python 3.8 support
     
       * :ref:`LSTM <torch_neuron_lstm_support>` support on Inf1

       * :ref:`R-CNN <torch-neuron-r-cnn-app-note>` support on Inf1

       * Support for new :doc:`API for core placement </archive/torch-neuron/api-core-placement>`
      
       * Support for :ref:`improved logging <pytorch-neuron-rn>` 
        
       * Improved :func:`torch_neuron.trace` performance when using large graphs
      
       * Reduced host memory usage of loaded models in ``libtorchneuron.so``
      
       * :ref:`Additional operators <neuron-cc-ops-pytorch>` support
       

   * - TensorFlow Neuron ``(tensorflow-neuron)``
     - * ``tf-neuron-auto-multicore`` tool to enable automatic data parallel on multiple NeuronCores.
      
       * Beta support for tracing models larger than 2GB using ``extract-weights`` flag (TF2.x only), see :ref:`tensorflow-ref-neuron-tracing-api`

       * ``tfn.auto_multicore`` Python API to enable automatic data parallel (TF2.x only)
    

This Neuron release is the last release that will include ``torch-neuron`` :ref:`versions 1.7 and 1.8 <announce-eol-pt-before-1-8>`, and that will include ``tensorflow-neuron`` :ref:`versions 2.5 and 2.6 <announce-eol-tf-before-2-5>`.

In addition, this release introduces changes to the Neuron packaging and installation instructions for Inf1 customers, see :ref:`neuron250-packages-changes` for more information.

.. _neuron-2.4.0-whatsnew:

Neuron 2.4.0 (10/27/2022)
-------------------------

This release introduces new features and resolves issues that improve stability. The release introduces "memory utilization breakdown" feature in both :ref:`Neuron Monitor <neuron-monitor-ug>` and :ref:`Neuron Top <neuron-top-ug>` system tools. The release introduces support for "NeuronCore Based Sheduling" capability to the Neuron Kubernetes Scheduler and introduces new operators support in :ref:`Neuron Compiler <neuronx-cc-index>` and :ref:`PyTorch Neuron <pytorch_rn>`. This release introduces also additional eight (8) samples of models' fine tuning using PyTorch Neuron. The new samples can be found in the `AWS Neuron Samples GitHub <https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx>`_ repository.


================================================
FILE: release-notes/releasecontent.rst
================================================

.. _latest-neuron-release-artifacts:

Release Content
===============

This page contains the packages, libraries, and other artifacts (and the versions of them) that ship in the latest AWS Neuron SDK release.

.. contents:: Table of contents
   :local:
   :depth: 2

<< :ref:`Back to the release notes <latest-neuron-release>`

Neuron 2.29.0 (04/09/2026)
---------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.29.0

Trn2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.29.0


Inf2 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.29.0

Inf1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.29.0

Supported Python Versions for Inf1 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.29.0

Supported Python Versions for Inf2/Trn1/Trn2 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.29.0

Supported NumPy Versions
^^^^^^^^^^^^^^^^^^^^^^^^

Neuron currently supports NumPy versions 2.X. Neuron continues to support NumPy versions >= 1.21.6, as well.

Supported vLLM Versions
^^^^^^^^^^^^^^^^^^^^^^^

Neuron currently supports vLLM version 0.16.0.

Supported Hugging Face Transformers Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Hugging Face           |
|                                  | Transformers Versions            |
+==================================+==================================+
| torch-neuronx                    | >= 4.52                          |
+----------------------------------+----------------------------------+
| neuronx-distributed-inference    | == 4.57.*                        |
+----------------------------------+----------------------------------+
| vllm                             | >= 4.56.0, < 5                   |
+----------------------------------+----------------------------------+

Supported Protobuf Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------------+----------------------------------+
| Package                          | Supported Protobuf versions      |
+==================================+==================================+
| neuronx-cc                       | > 3                              |
+----------------------------------+----------------------------------+
| torch-neuronx                    | >= 3.20                          |
+----------------------------------+----------------------------------+
| jax-neuronx                      | >= 3.20                          |
+----------------------------------+----------------------------------+
| torch-neuron                     | < 3.20                           |
+----------------------------------+----------------------------------+
| neuronx-distributed              | >= 3.20                          |
+----------------------------------+----------------------------------+
| tensorflow-neuronx               | < 3.20                           |
+----------------------------------+----------------------------------+
| tensorflow-neuron                | < 3.20                           |
+----------------------------------+----------------------------------+
  
Previous Neuron Releases Content
--------------------------------

* :ref:`pre-release-content`
* :ref:`pre-n1-release-content`


================================================
FILE: requirements-python310.txt
================================================
#this requirement file is for Python 3.10

enchant
Sphinx==5
sphinx-book-theme==1.0.0
sphinx_design==0.3.0
pydata-sphinx-theme==0.13.0
sphinxcontrib.htmlhelp
Jinja2
nbconvert
MarkupSafe
ablog
sphinx_plotly_directive
sphinx-copybutton
nbsphinx
sphinxcontrib-programoutput
sphinxcontrib-contentui
sphinxcontrib-ansi
sphinxcontrib-applehelp
sphinxcontrib-devhelp==1
sphinxcontrib-htmlhelp
sphinxcontrib-jsmath
sphinxcontrib-qthelp
sphinxcontrib-serializinghtml
sphinxcontrib-contentui
traitlets
nbformat
numpy==1.21.2
ml_dtypes~=0.2.0
sphinxcontrib-googleanalytics
ipython
sphinxcontrib.datatemplates
sphinxcontrib.spelling
sphinx-tabs

================================================
FILE: requirements-python38.txt
================================================
#this requirement file is for Python 3.7/3.8

enchant
Sphinx==4.5.0
sphinx-book-theme==0.3.3
pydata-sphinx-theme==0.8.1
sphinx_design==0.3.0
Jinja2==2.11.3
MarkupSafe==1.1.1
ablog==0.10.29
ipython-genutils==0.2.0
ipython==7.26.0
nbconvert
nbformat
nbsphinx==0.8.9
pandas==1.3.1
plotly==5.1.0
readthedocs-sphinx-search==0.1.1
sphinx-panels==0.6.0
sphinx-rtd-theme==0.5.2
sphinx-tabs==3.2.0
sphinx-copybutton==0.5.2
sphinxcontrib-ansi
sphinxcontrib-applehelp
sphinxcontrib-contentui==0.2.5
sphinxcontrib-devhelp
sphinxcontrib-htmlhelp
sphinxcontrib-jsmath
sphinxcontrib-programoutput==0.17
sphinxcontrib-qthelp
sphinxcontrib-serializinghtml
traitlets
ml_dtypes~=0.2.0
sphinxcontrib-googleanalytics
ipython
sphinxcontrib.datatemplates
sphinxcontrib.spelling
sphinx-tabs


================================================
FILE: requirements.txt
================================================
#this requirement file is for Python 3.10

enchant
Sphinx==5.3
sphinx-book-theme==1.0.0
sphinx_design==0.3.0
pydata-sphinx-theme==0.13.0
sphinxcontrib.htmlhelp
Jinja2
nbconvert
MarkupSafe
ablog
sphinx_plotly_directive
sphinx-copybutton
nbsphinx
sphinxcontrib-programoutput
sphinxcontrib-contentui
sphinxcontrib-ansi
sphinxcontrib-applehelp
sphinxcontrib-devhelp
sphinxcontrib-htmlhelp
sphinxcontrib-jsmath
sphinxcontrib-qthelp
sphinxcontrib-serializinghtml
traitlets
nbformat
numpy<2.0
ml_dtypes>=0.5.0
pandas
sphinxcontrib-googleanalytics
ipython
sphinxcontrib.datatemplates
sphinxcontrib.spelling
sphinx-tabs
exhale


================================================
FILE: setup/index.rst
================================================
.. meta::
   :description: Install AWS Neuron SDK for PyTorch and JAX on Inferentia and Trainium instances
   :keywords: neuron, installation, setup, pytorch, jax, inferentia, trainium, inf2, trn1, trn2, trn3
   :instance-types: inf2, trn1, trn2, trn3, inf1
   :content-type: navigation-hub
   :date-modified: 2026-03-03

.. _setup-guide-index:

Install AWS Neuron SDK
======================

Install the AWS Neuron SDK to enable deep learning acceleration on Inferentia and Trainium instances.

.. note::
   
   **New to Neuron?** Start with the :doc:`quickstart guide </about-neuron/quick-start/index>` 
   for a complete end-to-end tutorial.

Quick Start Decision Tree
--------------------------

Answer these questions to find your installation path:

**1. What's your use case?**

- **Training ML models** → Use Trn1, Trn2, or Trn3
- **Running inference** → Use Inf2, Trn1, Trn2, or Trn3
- **Legacy Inf1 support** → See :ref:`legacy-inf1-support`

**2. Which framework?**

- :ref:`PyTorch <pytorch-setup>` (recommended for most users)
- :ref:`JAX <jax-setup>`

**3. Installation method?**

- **AWS Deep Learning AMI** — fastest setup, pre-configured with all dependencies. Best for getting started and single-user development.
- **Deep Learning Container** — Docker-based, portable across EC2, ECS, EKS. Best for production deployments and CI/CD pipelines.
- **Manual installation** — full control over packages and versions. Best for custom OS images and shared clusters.

Instance Comparison
-------------------

.. list-table::
   :header-rows: 1
   :widths: 15 15 35 15
   
   * - Instance
     - NeuronCore
     - Use Case
     - Status
   * - Trn3
     - :doc:`v4 </about-neuron/arch/neuron-hardware/neuron-core-v4>`
     - Training and inference (latest generation)
     - Current
   * - Trn2
     - :doc:`v3 </about-neuron/arch/neuron-hardware/neuron-core-v3>`
     - Training and inference
     - Current
   * - Trn1
     - :doc:`v2 </about-neuron/arch/neuron-hardware/neuron-core-v2>`
     - Training and inference
     - Current
   * - Inf2
     - :doc:`v2 </about-neuron/arch/neuron-hardware/neuron-core-v2>`
     - Inference
     - Current
   * - Inf1
     - :doc:`v1 </about-neuron/arch/neuron-hardware/neuron-core-v1>`
     - Legacy inference
     - Legacy

Installation by Framework
--------------------------

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: PyTorch
      :link: pytorch/index
      :link-type: doc
      :class-card: sd-border-2
      
      **Recommended for most users**
      
      - PyTorch 2.9+ with Native Neuron support
      - Eager mode and torch.compile
      - Supports: Inf2, Trn1, Trn2, Trn3
      
      :bdg-success:`Most Popular`

   .. grid-item-card:: JAX
      :link: jax/index
      :link-type: doc
      :class-card: sd-border-2
      
      **For JAX users**
      
      - JAX 0.7+ with Neuron backend
      - XLA compilation
      - Supports: Inf2, Trn1, Trn2, Trn3

Multi-framework DLAMI
----------------------

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: 🚀 Neuron multi-framework DLAMI
      :link: multiframework-dlami
      :link-type: doc
      :class-card: sd-border-2

      Pre-configured AMI with PyTorch, JAX, and vLLM virtual environments ready to use. The fastest way to get started with any framework.

Common Issues
-------------

.. dropdown:: ⚠️ Module not found errors
   :color: info
   :animate: fade-in
   
   If you see "No module named 'torch_neuronx'" or similar:
   
   1. Verify virtual environment is activated
   2. Check Python version: ``python --version`` (should be 3.10+)
   3. Reinstall: ``pip install --force-reinstall torch-neuronx``
   
   See :doc:`troubleshooting` for more details.

.. dropdown:: ⚠️ Instance type not recognized
   :color: info
   :animate: fade-in
   
   Ensure you're using a Neuron-supported instance:
   
   - Check with: ``aws ec2 describe-instance-types --instance-types <type>``
   - Verify Neuron devices: ``neuron-ls``
   
   See :doc:`troubleshooting` for more details.

.. dropdown:: ⚠️ Version compatibility issues
   :color: info
   :animate: fade-in
   
   Check version compatibility:
   
   - PyTorch 2.9+ requires neuronx-cc 2.15+
   - See :doc:`/release-notes/index` for compatibility matrix
   
   See :doc:`troubleshooting` for more details.

.. _legacy-inf1-support:

Legacy Inf1 Support
-------------------

.. warning::
   
   **Inf1 uses legacy NeuronCore v1 architecture.** For new projects, use Inf2, Trn1, Trn2, or Trn3 with NeuronCore v2.
   
   - Inf2 offers 3x better price-performance than Inf1
   - Broader framework support (PyTorch 2.x, JAX)
   - Active development and feature updates

.. grid:: 1

   .. grid-item-card:: Inf1 Installation (Legacy)
      :link: legacy-inf1/index
      :link-type: doc
      :class-card: sd-border-2
      
      Install Neuron SDK for Inferentia 1 instances
      
      :bdg-warning:`Legacy Hardware`

Additional Resources
--------------------

- :doc:`/devflows/ec2-flows` - Launch Inf/Trn instances on Amazon EC2
- :doc:`/containers/index` - Use Deep Learning Containers
- :doc:`troubleshooting` - Installation troubleshooting guide
- :doc:`/release-notes/index` - Version compatibility information

Other Platforms
---------------

.. grid:: 1

   .. grid-item-card:: Rocky Linux 9
      :link: setup-rocky-linux-9
      :link-type: ref
      :class-card: sd-border-2

      Install PyTorch Neuron on Rocky Linux 9 using the Rocky-9-EC2-Base AMI. Covers driver and tools setup, then follows the Amazon Linux 2023 guide for framework installation.

.. toctree::
   :hidden:
   :maxdepth: 1
   
   PyTorch <pytorch/index>
   JAX <jax/index>
   Multi-framework <multiframework-dlami>
   torch-neuron (Legacy) <legacy-inf1/index>
   Use Rocky Linux 9 <setup-rocky-linux-9>
   Troubleshooting <troubleshooting>


================================================
FILE: setup/index.txt-back
================================================
.. _setup-guide-index:

Setup Guide
===========
This section walks you through the various options to install Neuron. You have to install Neuron on Trainium and Inferentia powered instances to enable deep-learning acceleration. 

.. dropdown::  Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /setup/install-templates/launch-instance.txt

.. dropdown::  Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. tab-set::

       .. tab-item:: Amazon Linux 2

        .. include :: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 2
            :end-line: 3

       .. tab-item:: Ubuntu 20

        .. include :: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 5
            :end-line: 6


.. tab-set::


   .. tab-item:: Pytorch
        :name:

        .. dropdown::  torch-neuron (``Inf1``)
                :class-title: sphinx-design-class-title-med
                :class-body: sphinx-design-class-body-small
                :animate: fade-in

                * :ref:`Fresh install <install-neuron-pytorch>`
                * :ref:`Update to latest release <update-neuron-pytorch>`
                * :ref:`Install previous releases <install-prev-neuron-pytorch>`
                * :ref:`pytorch-install-cxx11`

                .. include:: /setup/install-templates/trn1-ga-warning.txt

        .. dropdown::  torch-neuronx (``Trn1, Inf2``)
                :class-title: sphinx-design-class-title-med
                :class-body: sphinx-design-class-body-small
                :animate: fade-in

                * :ref:`Fresh install <pytorch-neuronx-install>`
                * :ref:`Update to latest release <pytorch-neuronx-update>`
                * :ref:`Install previous releases <pytorch-neuronx-install-prev>`

                .. include:: /setup/install-templates/trn1-ga-warning.txt

   .. tab-item:: Tensorflow
        :name: 

        .. dropdown::  tensorflow-neuron (``Inf1``)
                :class-title: sphinx-design-class-title-med
                :class-body: sphinx-design-class-body-small
                :animate: fade-in

                * :ref:`Fresh install <install-neuron-tensorflow>`
                * :ref:`Update to Latest release <update-neuron-tensorflow>`
                * :ref:`Install previous releases <install-prev-neuron-tensorflow>`

        .. dropdown::  tensorflow-neuronx (``Trn1, Inf2``)
                :class-title: sphinx-design-class-title-med
                :class-body: sphinx-design-class-body-small
                :animate: fade-in

                * :ref:`Fresh install <install-tensorflow-neuronx>`
                * :ref:`Update to Latest release <update-tensorflow-neuronx>`

   .. tab-item:: MXNet
        :name:

        .. dropdown::  mxnet-neuron (``Inf1``)
            :class-title: sphinx-design-class-title-med
            :class-body: sphinx-design-class-body-small
            :animate: fade-in

            * :ref:`Fresh install <install-neuron-mxnet>`
            * :ref:`Update to latest release <update-neuron-mxnet>`
            * :ref:`Install previous releases <install-prev-neuron-mxnet>`


================================================
FILE: setup/install-templates/al2-python.rst
================================================
.. note::

  Please make sure to ``upgrade`` from ``python 3.7`` to ``python 3.8`` to use Neuron SDK on ``Amazon Linux 2``. Starting from ``Neuron Release 2.13``, ``python 3.7`` is no longer supported as mentioned :ref:`here <announce-eol-python37>`. 
  Also,  we do not have support for ``torch-neuronx 2.1.2`` on Amaznon Linux 2. 

================================================
FILE: setup/install-templates/inf1/compile_mode.rst
================================================
If model compilation occurs outside the model deployment environment, you can 
install only the Neuron framework extensions and the compiler on any compute 
instance. This setup is helpful when compiling large complex models that require 
large amount of memory or during a CICD process where models are compiled in a 
separate step, prior to deployment.


================================================
FILE: setup/install-templates/inf1/deploy_mode.rst
================================================
During deployment it can be beneficial to reduce the number of components installed in the system.
For use-cases where only inference is necessary (compilation is already complete), only the
framework and runtime should be installed.

Note:
If you are using a regular U18, U20, or AL2 AMI, follow the same setup instructions as the Base DLAMIs respectively.


================================================
FILE: setup/install-templates/inf1/develop_mode.rst
================================================
The simplest environment setup for model development installs all Neuron SDK components
directly on an AWS ML accelerator instance: the Neuron framework extensions, compiler, runtime, and tools. This will
allow you to compile, execute, and performance tune your model, all in the same instance. This is the recommended
workflow when first starting to work with Neuron device or when optimizing a model.

Note:
If you are using a regular U18, U20, or AL2 AMI, follow the same setup instructions as the Base DLAMIs respectively.


================================================
FILE: setup/install-templates/inf1/dlami-enable-neuron-mxnet.rst
================================================


.. tab-set::

   .. tab-item:: MXNet 1.8.0

      .. tab-set::

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux

   .. tab-item:: MXNet 1.5.1

      .. tab-set::

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=ubuntu --framework-version=mxnet-1.5.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install mxnet --mode=develop --ami=dlami --os=amazonlinux --framework-version=mxnet-1.5.1


================================================
FILE: setup/install-templates/inf1/dlami-enable-neuron-pytorch.rst
================================================
.. tab-set::

   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux

   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu --framework=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux --framework=pytorch-1.8.1


================================================
FILE: setup/install-templates/inf1/launch-inf1-ami.rst
================================================
* Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to Launch an Inf1 instance, when choosing the instance type at the EC2 console. Please make sure to select the correct instance type. To get more information about Inf1 instances sizes and pricing see `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_.

* Select your Amazon Machine Image (AMI) of choice, please note that Neuron supports Ubuntu 18 AMI or Amazon Linux 2 AMI, you can also choose
  Ubuntu 18 or Amazon Linux 2 Deep Learning AMI (DLAMI)

* After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-connect-to-instance-linux>`_ to connect to the instance 


================================================
FILE: setup/install-templates/inf1/launch-inf1-dlami-aws-cli.rst
================================================
.. _launch-inf1-dlami-aws-cli:

AWS CLI commands to launch inf1 instances
"""""""""""""""""""""""""""""""""""""""""

.. code:: bash

  # Launch instance
  # The following are the different Deep Learning AMIs to get started and is recommended
  # for the tutorials.
  # "Deep Learning AMI (Amazon Linux)*"
  # "Deep Learning AMI (Amazon Linux 2)*"
  # "Deep Learning AMI (Ubuntu 18.04)*"
  #

  # You can get the latest AMI ID for any of the above ones using the following command
  AWS_REGION="<aws region name like us-east-1>"
  AMIID=$(aws ec2 describe-images --owners amazon --filters "Name=name,Values=Deep Learning Base AMI (Ubuntu 18.04)*" --query 'sort_by(Images, &CreationDate)[].[Name,ImageId]' --region $AWS_REGION --output text | tail -n 1  | awk '{print $(NF)}')

  INSTANCE_ID=$(aws ec2 run-instances --image-id $AMIID --count 1 --instance-type <inf1.xlarge type> --key-name MyKeyPair --region $AWS_REGION [--subnet-id <subnet id>]| python -c 'import sys, json; print(json.load(sys.stdin)["Instances"][0]["InstanceId"])')
  echo "Instance ID of launched instance" $INSTANCE_ID

  # Wait for few seconds to a minute for the instance to get created and have public DNS/ip.

  # The following command will get the public DNS name of the launched instance to which
  # you can then log in to using your key pair.
  INSTANCE_PUBLIC_DNS=$(aws ec2 describe-instances --instance-id $INSTANCE_ID --region $AWS_REGION | python -c 'import sys, json; print(json.load(sys.stdin)["Reservations"][0]["Instances"][0]["PublicDnsName"])')
  echo "DNS name of the launched instance" $INSTANCE_PUBLIC_DNS

  # Wait for couple of minutes for the instance to be ready and then login:
  ssh -i <key.pem> <ubuntu/ec2-user>@$INSTANCE_PUBLIC_DNS


================================================
FILE: setup/install-templates/inf1/launch-inf1-dlami.rst
================================================
* Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to Launch an Inf1 instance, when choosing the instance type at the EC2 console. Please make sure to select the correct instance type. To get more information about Inf1 instances sizes and pricing see `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_.

* When choosing an Amazon Machine Image (AMI) make sure to select `Deep Learning AMI with Conda Options <https://docs.aws.amazon.com/dlami/latest/devguide/conda.html>`_. Please note that Neuron Conda environments are supported only in Ubuntu 18 DLAMI and Amazon Linux2 DLAMI, Neuron Conda environments are not supported in Amazon Linux DLAMI.


* After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-connect-to-instance-linux>`_ to connect to the instance 

.. note::

  You can also launch the instance from AWS CLI, please see :ref:`AWS CLI commands to launch inf1 instances <launch-inf1-dlami-aws-cli>`.


================================================
FILE: setup/install-templates/inf1/neuron-pip-install.rst
================================================

It is recommended to use a virtual environment when installing Neuron
pip packages. The following steps show how to setup the virtual
environment on Ubuntu or Amazon Linux:

.. code:: bash

   # Ubuntu
   sudo apt-get update
   sudo apt-get install -y python3-venv g++

.. code:: bash

   # Amazon Linux
   sudo dnf update
   sudo dnf install -y python3 gcc-c++

Setup a new Python virtual environment:

.. code:: bash

   python3 -m venv test_venv
   source test_venv/bin/activate
   pip install -U pip

.. include:: /setup/install-templates/inf1/neuron-pip-setup.rst

.. note::

   .. container:: toggle-header

      .. code:: bash

         curl https://pip.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | gpg --import
         pip download --no-deps neuron-cc
         # The above shows you the name of the package downloaded
         # Use it in the following command
         wget https://pip.repos.neuron.amazonaws.com/neuron-cc/neuron_cc-<VERSION FROM FILE>.whl.asc
         gpg --verify neuron_cc-<VERSION FROM FILE>.whl.asc neuron_cc-<VERSION FROM FILE>.whl

The following Pip installation commands assume you are using a virtual
Python environment (see above for instructions on how to setup a virtual
Python environment). If not using virtual Python environment, please
switch 'pip' with 'pip3' as appropriate for your Python environment.


================================================
FILE: setup/install-templates/inf1/neuron-pip-setup.rst
================================================
Modify Pip repository configurations to point to the Neuron repository:

.. code:: bash

   tee $VIRTUAL_ENV/pip.conf > /dev/null <<EOF
   [global]
   extra-index-url = https://pip.repos.neuron.amazonaws.com
   EOF

================================================
FILE: setup/install-templates/inf1/note-setup-cntr.rst
================================================
.. note::

  * Instructions in this page only apply to setting up Neuron components on Linux host running Ubuntu or Amazon Linux AMI.
  * For an example of how to install Neuron components in a container, see :ref:`tutorial-docker-env-setup` and our
    :ref:`neuron-containers` documentation for more details.


================================================
FILE: setup/install-templates/inf1/note-setup-general.rst
================================================
.. note::

   For a successful installation or update, execute each line of the instructions below separately or 
   copy the contents of the code block into a script file and source its contents.


================================================
FILE: setup/install-templates/inf1/note-setup-libnrt-warning.rst
================================================
.. important ::

   For successful installation or update to next releases (Neuron 1.20.0 and newer):
      * Uninstall ``aws-neuron-dkms`` by running: ``sudo apt remove aws-neuron-dkms`` or ``sudo dnf remove aws-neuron-dkms``
      * Install or upgrade to latest Neuron driver (``aws-neuron-dkms``) by following the "Setup Guide" instructions.

================================================
FILE: setup/install-templates/inf1/tensorboard-plugin-neuron-pip-install.rst
================================================

If you are using the DLAMI TensorFlow-Neuron Conda environment,
please run the following to update TensorBoard before installing
the Neuron plugin.

.. code:: bash

    pip install "tensorboard<=2.4.0" --force-reinstall

.. include:: /setup/install-templates/inf1/neuron-pip-setup.rst

.. code:: bash

    pip install tensorboard-plugin-neuron

================================================
FILE: setup/install-templates/inf2/dlami-enable-neuron-pytorch.rst
================================================
.. tab-set::

   .. tab-item:: PyTorch 1.9.1

      .. tab-set::

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux

   .. tab-item:: PyTorch 1.8.1

      .. tab-set::

         .. tab-item:: Ubuntu DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=ubuntu --framework=pytorch-1.8.1

         .. tab-item:: Amazon Linux DLAMI

            .. include :: /setup/install-templates/inf1/note-setup-general.rst

            .. program-output:: python3 src/helperscripts/neuronsetuphelper.py --file src/helperscripts/neuron-releases-manifest.json --install pytorch --mode=develop --ami=dlami --os=amazonlinux --framework=pytorch-1.8.1


================================================
FILE: setup/install-templates/inf2/launch-inf2-dlami.rst
================================================
* Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an Inf2 instance, when choosing the instance type at the EC2 console. Please make sure to select the correct instance type. To get more information about Inf2 instances sizes and pricing see `Inf2 web page <https://aws.amazon.com/ec2/instance-types/inf2/>`_.

* When choosing an Amazon Machine Image (AMI) make sure to select `Deep Learning AMI with Conda Options <https://docs.aws.amazon.com/dlami/latest/devguide/conda.html>`_. Please note that Neuron Conda environments are supported only in Ubuntu 18 DLAMI and Amazon Linux2 DLAMI, Neuron Conda environments are not supported in Amazon Linux DLAMI.


* After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-connect-to-instance-linux>`_ to connect to the instance 

.. note::

  You can also launch the instance from AWS CLI, please see :ref:`AWS CLI commands to launch inf2 instances <launch-inf1-dlami-aws-cli>`.


================================================
FILE: setup/install-templates/inf2/note-setup-libnrt-warning.rst
================================================
.. important ::

   For successful installation or update to next releases (Neuron 1.20.0 and newer):
      * Uninstall ``aws-neuron-dkms`` by running: ``sudo apt remove aws-neuron-dkms`` or ``sudo dnf remove aws-neuron-dkms``
      * Install or upgrade to latest Neuron driver (``aws-neuron-dkms``) by following the "Setup Guide" instructions.

================================================
FILE: setup/install-templates/launch-instance.txt
================================================
* Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to Launch an instance, when choosing the instance type at the EC2 console. Please make sure to select the correct instance type.
* To get more information about instances sizes and pricing see: `Trn1 web page <https://aws.amazon.com/ec2/instance-types/trn1/>`_, `Inf2 web page <https://aws.amazon.com/ec2/instance-types/inf2/>`_, `Inf1 web page <https://aws.amazon.com/ec2/instance-types/inf1/>`_
* Select your Amazon Machine Image (AMI) of choice, please note that Neuron supports Amazon Linux 2 AMI(HVM) - Kernel 5.10.

* When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.

* After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-connect-to-instance-linux>`_ to connect to the instance 

.. include:: /setup/install-templates/trn1-ga-warning.txt

================================================
FILE: setup/install-templates/launch-trn1-dlami.rst
================================================
* Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to Launch an Trn1 instance, when choosing the instance type at the EC2 console. Please make sure to select the correct instance type. To get more information about Trn1 instances sizes and pricing see `Trn1 web page <https://aws.amazon.com/ec2/instance-types/trn1/>`_.

* Select your Amazon Machine Image (AMI) of choice, please note that Neuron support Ubuntu 18 AMI or Amazon Linux 2 AMI, you can also choose 
  Ubuntu 18 or Amazon Linux 2 Deep Learning AMI (DLAMI)

* When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.

* After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-connect-to-instance-linux>`_ to connect to the instance 

.. include:: /setup/install-templates/trn1-ga-warning.txt

  
================================================
FILE: setup/install-templates/trn1/dlami-notes.rst
================================================

.. note::
  * Please refer to the instructions under the tab ``Amazon Linux 2 DLAMI Base``.

.. note::
  * Please refer to the instructions under the tab ``Ubuntu 20 DLAMI Base``.

.. note::
  * Coming soon, meanwhile please refer to the instructions under the tab ``Amazon Linux 2 DLAMI Base``.

.. note::
  * Coming soon, meanwhile please refer to the instructions under the tab ``Ubuntu 20 DLAMI Base``.

.. note::
  * For a successful installation or update, execute each line of the instructions below separately or copy the contents of the code block into a script file and source its contents.
  * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
  * While launching the instance, please use the AMI with the name ``Deep Learning Base Neuron AMI (Amazon Linux 2) <Latest_Date>``.
  * To launch an instance using a specific AMI, please refer to the instructions mentioned `here <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html#finding-an-ami-console>`__.

.. note::
  * For a successful installation or update, execute each line of the instructions below separately or copy the contents of the code block into a script file and source its contents.
  * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
  * While launching the instance, please use the AMI with the name ``Deep Learning Base Neuron AMI (Ubuntu 20.04) <Latest_Date>``.
  * To launch an instance using a specific AMI, please refer to the instructions mentioned `here <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html#finding-an-ami-console>`__.

.. note::
  * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
  * While launching the instance, please use the AMI with the name ``Deep Learning AMI Neuron PyTorch 1.13.1 (Amazon Linux 2) <Latest_Date>``.
  * To launch an instance using a specific AMI, please refer to the instructions mentioned `here <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html#finding-an-ami-console>`__.

.. note::
  * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
  * While launching the instance, please use the AMI with the name ``Deep Learning AMI Neuron PyTorch 1.13.1 (Ubuntu 20.04) <Latest_Date>``.
  * To launch an instance using a specific AMI, please refer to the instructions mentioned `here <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html#finding-an-ami-console>`__.

.. warning::
   * Please note the this DLALMI might not have latest PyTorch version. After activating the python venv, please check your current version using : ``pip list installed | grep torch-neuronx``
   * To see the latest PyTorch version :ref:`check here <latest-neuron-release-artifacts>` and to update to latest release see :doc:`/setup/pytorch/manual`

.. note::
  * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
  * While launching the instance, please use the AMI with the name ``Deep Learning AMI Neuron TensorFlow 2.10.1 (Ubuntu 20.04) <Latest_Date>``.
  * To launch an instance using a specific AMI, please refer to the instructions mentioned `here <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html#finding-an-ami-console>`__.

.. note::
  * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
  * While launching the instance, please use the AMI with the name ``Deep Learning AMI Neuron TensorFlow 2.10.1 (Amazon Linux 2) <Latest_Date>``.
  * To launch an instance using a specific AMI, please refer to the instructions mentioned `here <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html#finding-an-ami-console>`__.


================================================
FILE: setup/install-templates/trn1-ga-warning.txt
================================================
.. note::

  If you are facing a connectivity issue during the model loading process on a Trn1 instance with Ubuntu, that could probably be because of Ubuntu limitations with multiple interfaces. To solve this problem, please follow the steps mentioned :ref:`here<trn1_ubuntu_troubleshooting>`.

  Users are highly encouraged to use DLAMI to launch the instances, since DLAMIs come with the required fix.

================================================
FILE: setup/jax/dlami.rst
================================================
.. meta::
   :description: Install JAX Neuron using AWS Deep Learning AMI on Inf2, Trn1, Trn2, Trn3
   :keywords: jax, neuron, dlami, installation, ami
   :framework: jax
   :installation-method: dlami
   :instance-types: inf2, trn1, trn2, trn3
   :os: ubuntu-24.04, ubuntu-22.04, al2023
   :python-versions: 3.10, 3.11, 3.12
   :content-type: installation-guide
   :estimated-time: 5 minutes
   :date-modified: 2026-03-03

Install JAX via Deep Learning AMI
===================================

Install JAX with Neuron support using pre-configured AWS Deep Learning AMIs. 

⏱️ **Estimated time**: 5 minutes

.. note::
   Want to read about Neuron's Deep Learning machine images (DLAMIs) before diving in? Check out the :doc:`/dlami/index`.

----

Prerequisites
-------------

.. list-table::
   :header-rows: 1
   :widths: 30 70
   
   * - Requirement
     - Details
   * - Instance Type
     - Inf2, Trn1, Trn2, or Trn3
   * - AWS Account
     - With EC2 permissions
   * - SSH Key Pair
     - For instance access
   * - AWS CLI
     - Configured with credentials (optional)

Installation Steps
------------------

.. tab-set::

   .. tab-item:: Ubuntu 24.04
      :sync: ubuntu-24-04
      
      **Step 1: Find the Latest AMI**
      
      Get the latest JAX DLAMI for Ubuntu 24.04:
      
      .. code-block:: bash
         
         aws ec2 describe-images \
           --owners amazon \
           --filters "Name=name,Values=Deep Learning AMI Neuron JAX * (Ubuntu 24.04)*" \
           --query 'Images | sort_by(@, &CreationDate) | [-1].ImageId' \
           --output text
      
      **Step 2: Launch Instance**
      
      Launch a Trn1 or Inf2 instance with the AMI:
      
      .. code-block:: bash
         
         aws ec2 run-instances \
           --image-id ami-xxxxxxxxxxxxxxxxx \
           --instance-type trn1.2xlarge \
           --key-name your-key-pair \
           --security-group-ids sg-xxxxxxxxx \
           --subnet-id subnet-xxxxxxxxx
      
      Replace:
      
      - ``ami-xxxxxxxxxxxxxxxxx`` with AMI ID from Step 1
      - ``your-key-pair`` with your SSH key pair name
      - ``sg-xxxxxxxxx`` with your security group ID
      - ``subnet-xxxxxxxxx`` with your subnet ID
      
      **Step 3: Connect to Instance**
      
      .. code-block:: bash
         
         ssh -i your-key-pair.pem ubuntu@<instance-public-ip>
      
      **Step 4: Activate Environment**
      
      The DLAMI includes a pre-configured virtual environment:
      
      .. code-block:: bash
         
         source /opt/aws_neuronx_venv_jax/bin/activate
      
      **Step 5: Verify Installation**
      
      .. code-block:: python
         
         python3 << EOF
         import jax
         import jax_neuronx
         
         print(f"JAX version: {jax.__version__}")
         print(f"Devices: {jax.devices()}")
         
         # Check Neuron devices
         import subprocess
         result = subprocess.run(['neuron-ls'], capture_output=True, text=True)
         print(result.stdout)
         EOF
      
      **Expected output**:
      
      .. code-block:: text
         
         JAX version: 0.7.0
         Devices: [NeuronDevice(id=0), NeuronDevice(id=1)]
         
         +--------+--------+--------+-----------+
         | DEVICE | CORES  | MEMORY | CONNECTED |
         +--------+--------+--------+-----------+
         | 0      | 2      | 32 GB  | Yes       |
         | 1      | 2      | 32 GB  | Yes       |
         +--------+--------+--------+-----------+
      
      .. dropdown:: ⚠️ Troubleshooting: Module not found
         :color: warning
         :animate: fade-in
         
         If you see ``ModuleNotFoundError: No module named 'jax_neuronx'``:
         
         1. Verify virtual environment is activated:
            
            .. code-block:: bash
               
               which python
               # Should show: /opt/aws_neuronx_venv_jax/bin/python
         
         2. Check Python version:
            
            .. code-block:: bash
               
               python --version
               # Should be 3.10 or higher
         
         3. Reinstall jax-neuronx:
            
            .. code-block:: bash
               
               pip install --force-reinstall jax-neuronx
      
      .. dropdown:: ⚠️ Troubleshooting: No Neuron devices found
         :color: warning
         :animate: fade-in
         
         If ``neuron-ls`` shows no devices:
         
         1. Verify instance type:
            
            .. code-block:: bash
               
               curl http://169.254.169.254/latest/meta-data/instance-type
               # Should show trn1.*, trn2.*, trn3.*, or inf2.*
         
         2. Check Neuron driver:
            
            .. code-block:: bash
               
               lsmod | grep neuron
               # Should show neuron driver loaded
         
         3. Restart Neuron runtime:
            
            .. code-block:: bash
               
               sudo systemctl restart neuron-monitor
               neuron-ls

   .. tab-item:: Ubuntu 22.04
      :sync: ubuntu-22-04
      
      **Step 1: Find the Latest AMI**
      
      Get the latest JAX DLAMI for Ubuntu 22.04:
      
      .. code-block:: bash
         
         aws ec2 describe-images \
           --owners amazon \
           --filters "Name=name,Values=Deep Learning AMI Neuron JAX * (Ubuntu 22.04)*" \
           --query 'Images | sort_by(@, &CreationDate) | [-1].ImageId' \
           --output text
      
      **Step 2: Launch Instance**
      
      .. code-block:: bash
         
         aws ec2 run-instances \
           --image-id ami-xxxxxxxxxxxxxxxxx \
           --instance-type trn1.2xlarge \
           --key-name your-key-pair \
           --security-group-ids sg-xxxxxxxxx \
           --subnet-id subnet-xxxxxxxxx
      
      **Step 3: Connect to Instance**
      
      .. code-block:: bash
         
         ssh -i your-key-pair.pem ubuntu@<instance-public-ip>
      
      **Step 4: Activate Environment**
      
      .. code-block:: bash
         
         source /opt/aws_neuronx_venv_jax/bin/activate
      
      **Step 5: Verify Installation**
      
      .. code-block:: python
         
         python3 << EOF
         import jax
         import jax_neuronx
         
         print(f"JAX version: {jax.__version__}")
         print(f"Devices: {jax.devices()}")
         
         # Check Neuron devices
         import subprocess
         result = subprocess.run(['neuron-ls'], capture_output=True, text=True)
         print(result.stdout)
         EOF
      
      .. dropdown:: ⚠️ Troubleshooting: Module not found
         :color: warning
         :animate: fade-in
         
         If you see ``ModuleNotFoundError: No module named 'jax_neuronx'``:
         
         1. Verify virtual environment is activated
         2. Check Python version: ``python --version`` (should be 3.10+)
         3. Reinstall: ``pip install --force-reinstall jax-neuronx``
      
      .. dropdown:: ⚠️ Troubleshooting: No Neuron devices found
         :color: warning
         :animate: fade-in
         
         If ``neuron-ls`` shows no devices:
         
         1. Verify instance type
         2. Check Neuron driver: ``lsmod | grep neuron``
         3. Restart runtime: ``sudo systemctl restart neuron-monitor``

   .. tab-item:: Amazon Linux 2023
      :sync: al2023
      
      **Step 1: Find the Latest AMI**
      
      Get the latest JAX DLAMI for Amazon Linux 2023:
      
      .. code-block:: bash
         
         aws ec2 describe-images \
           --owners amazon \
           --filters "Name=name,Values=Deep Learning AMI Neuron JAX * (Amazon Linux 2023)*" \
           --query 'Images | sort_by(@, &CreationDate) | [-1].ImageId' \
           --output text
      
      **Step 2: Launch Instance**
      
      .. code-block:: bash
         
         aws ec2 run-instances \
           --image-id ami-xxxxxxxxxxxxxxxxx \
           --instance-type trn1.2xlarge \
           --key-name your-key-pair \
           --security-group-ids sg-xxxxxxxxx \
           --subnet-id subnet-xxxxxxxxx
      
      **Step 3: Connect to Instance**
      
      .. code-block:: bash
         
         ssh -i your-key-pair.pem ec2-user@<instance-public-ip>
      
      .. note::
         
         Amazon Linux 2023 uses ``ec2-user`` instead of ``ubuntu``.
      
      **Step 4: Activate Environment**
      
      .. code-block:: bash
         
         source /opt/aws_neuronx_venv_jax/bin/activate
      
      **Step 5: Verify Installation**
      
      .. code-block:: python
         
         python3 << EOF
         import jax
         import jax_neuronx
         
         print(f"JAX version: {jax.__version__}")
         print(f"Devices: {jax.devices()}")
         
         # Check Neuron devices
         import subprocess
         result = subprocess.run(['neuron-ls'], capture_output=True, text=True)
         print(result.stdout)
         EOF
      
      .. dropdown:: ⚠️ Troubleshooting: Module not found
         :color: warning
         :animate: fade-in
         
         If you see ``ModuleNotFoundError: No module named 'jax_neuronx'``:
         
         1. Verify virtual environment is activated
         2. Check Python version: ``python --version`` (should be 3.10+)
         3. Reinstall: ``pip install --force-reinstall jax-neuronx``
      
      .. dropdown:: ⚠️ Troubleshooting: No Neuron devices found
         :color: warning
         :animate: fade-in
         
         If ``neuron-ls`` shows no devices:
         
         1. Verify instance type
         2. Check Neuron driver: ``lsmod | grep neuron``
         3. Restart runtime: ``sudo systemctl restart neuron-monitor``

Next Steps
----------

Now that JAX is installed:

1. **Try a Quick Example**:
   
   .. code-block:: python
      
      import jax
      import jax.numpy as jnp
      
      # Simple operation on Neuron
      x = jnp.array([1.0, 2.0, 3.0])
      y = jnp.array([4.0, 5.0, 6.0])
      result = jax.numpy.multiply(x, y)
      print(result)

2. **Read Documentation**:
   
   - :doc:`/frameworks/jax/index`
   - :doc:`/frameworks/jax/api-reference-guide/index`

3. **Explore Setup Guide**:
   
   - :doc:`/frameworks/jax/setup/jax-setup`

Additional Resources
--------------------

- :doc:`/dlami/index` - DLAMI documentation
- :doc:`/containers/index` - Container-based deployment
- :doc:`../troubleshooting` - Common issues and solutions
- :doc:`/release-notes/index` - Version compatibility information


================================================
FILE: setup/jax/dlc.rst
================================================
.. meta::
   :description: Install JAX Neuron using Deep Learning Containers on Inf2, Trn1, Trn2, Trn3
   :keywords: jax, neuron, dlc, container, docker, installation
   :framework: jax
   :installation-method: container
   :instance-types: inf2, trn1, trn2, trn3
   :os: ubuntu-24.04, ubuntu-22.04, al2023
   :content-type: installation-guide
   :estimated-time: 10 minutes
   :date-modified: 2026-03-03

Install JAX via Deep Learning Container
=========================================

Install JAX with Neuron support using pre-configured AWS Deep Learning Containers (DLCs).

⏱️ **Estimated time**: ~10 minutes

Prerequisites
-------------

.. list-table::
   :header-rows: 1
   :widths: 30 70
   
   * - Requirement
     - Details
   * - Instance Type
     - Inf2, Trn1, Trn2, or Trn3
   * - Neuron Driver on Host
     - ``aws-neuronx-dkms`` installed on the host instance
   * - Docker Installed
     - Docker engine running on the host instance
   * - AWS Account
     - With EC2 permissions

Available container images
--------------------------

.. list-table::
   :header-rows: 1
   :widths: 30 70
   
   * - Image
     - ECR URI
   * - JAX Training
     - ``public.ecr.aws/neuron/jax-training-neuronx``

.. note::

   JAX DLCs are currently available for training workloads. For the full list of available images and tags, see `JAX Training Containers <https://github.com/aws-neuron/deep-learning-containers#jax-training-neuronx>`_.

For more information, see :doc:`/containers/locate-neuron-dlc-image`.

Installation steps
------------------

.. tab-set::

   .. tab-item:: Ubuntu 24.04
      :sync: ubuntu-24-04
      
      **Step 1: Install Neuron driver on host**
      
      Configure the Neuron repository and install the driver:
      
      .. code-block:: bash
         
         . /etc/os-release
         sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
         deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
         EOF
         wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
         sudo apt-get update
         sudo apt-get install -y aws-neuronx-dkms
      
      **Step 2: Install and verify Docker**
      
      Install Docker and add your user to the ``docker`` group:
      
      .. code-block:: bash
         
         sudo apt-get install -y docker.io
         sudo usermod -aG docker $USER
      
      Log out and log back in to refresh group membership, then verify:
      
      .. code-block:: bash
         
         docker run hello-world
      
      **Step 3: Pull the DLC image from ECR**
      
      Pull the JAX Training DLC image:
      
      .. code-block:: bash
         
         docker pull public.ecr.aws/neuron/jax-training-neuronx:<image_tag>
      
      Replace ``<image_tag>`` with the desired tag from the `JAX Training Containers <https://github.com/aws-neuron/deep-learning-containers#jax-training-neuronx>`_ repository.
      
      **Step 4: Run the container**
      
      Launch the container with access to Neuron devices:
      
      .. code-block:: bash
         
         docker run -it \
           --device=/dev/neuron0 \
           --cap-add SYS_ADMIN \
           --cap-add IPC_LOCK \
           public.ecr.aws/neuron/jax-training-neuronx:<image_tag> \
           bash
      
      .. note::
         
         Adjust the ``--device`` flags based on your instance type. Use ``ls /dev/neuron*`` on the host to list available devices. For example, a ``trn1.32xlarge`` has 16 devices (``/dev/neuron0`` through ``/dev/neuron15``).
      
      **Step 5: Verify inside the container**
      
      Run the following commands inside the container to confirm Neuron devices are visible and JAX is installed:
      
      .. code-block:: bash
         
         neuron-ls
      
      .. code-block:: python
         
         python3 -c "import jax; print(f'JAX version: {jax.__version__}'); print(f'Devices: {jax.devices()}')"
      
      **Expected output**:
      
      .. code-block:: text
         
         +--------+--------+--------+-----------+
         | DEVICE | CORES  | MEMORY | CONNECTED |
         +--------+--------+--------+-----------+
         | 0      | 2      | 32 GB  | Yes       |
         +--------+--------+--------+-----------+
         
         JAX version: 0.7.0
         Devices: [NeuronDevice(id=0), NeuronDevice(id=1)]
      
      .. dropdown:: ⚠️ Troubleshooting: Device not found in container
         :color: warning
         :animate: fade-in
         
         If ``neuron-ls`` shows no devices inside the container:
         
         1. Verify the Neuron driver is installed on the host:
            
            .. code-block:: bash
               
               # Run on the host (not inside the container)
               neuron-ls
         
         2. Confirm you passed the correct ``--device`` flag:
            
            .. code-block:: bash
               
               ls /dev/neuron*
         
         3. Restart the container with the correct device path:
            
            .. code-block:: bash
               
               docker run -it --device=/dev/neuron0 \
                 --cap-add SYS_ADMIN --cap-add IPC_LOCK \
                 public.ecr.aws/neuron/jax-training-neuronx:<image_tag> bash
      
      .. dropdown:: ⚠️ Troubleshooting: Permission denied
         :color: warning
         :animate: fade-in
         
         If you see ``permission denied`` errors when running Docker commands:
         
         1. Verify your user is in the ``docker`` group:
            
            .. code-block:: bash
               
               groups
               # Should include "docker"
         
         2. If not, add yourself and re-login:
            
            .. code-block:: bash
               
               sudo usermod -aG docker $USER
               # Log out and log back in
         
         3. Alternatively, run Docker with ``sudo``:
            
            .. code-block:: bash
               
               sudo docker run -it --device=/dev/neuron0 \
                 --cap-add SYS_ADMIN --cap-add IPC_LOCK \
                 public.ecr.aws/neuron/jax-training-neuronx:<image_tag> bash
      
      .. dropdown:: ⚠️ Troubleshooting: Image pull failure
         :color: warning
         :animate: fade-in
         
         If ``docker pull`` fails with a network or authentication error:
         
         1. Verify internet connectivity:
            
            .. code-block:: bash
               
               curl -s https://public.ecr.aws/v2/ | head -1
         
         2. Check that the image tag exists by browsing the `ECR Public Gallery <https://gallery.ecr.aws/neuron/jax-training-neuronx>`_.
         
         3. If you are behind a proxy, configure Docker proxy settings:
            
            .. code-block:: bash
               
               sudo mkdir -p /etc/systemd/system/docker.service.d
               sudo tee /etc/systemd/system/docker.service.d/proxy.conf > /dev/null <<EOF
               [Service]
               Environment="HTTP_PROXY=http://proxy:port"
               Environment="HTTPS_PROXY=http://proxy:port"
               EOF
               sudo systemctl daemon-reload
               sudo systemctl restart docker

   .. tab-item:: Ubuntu 22.04
      :sync: ubuntu-22-04
      
      **Step 1: Install Neuron driver on host**
      
      Configure the Neuron repository and install the driver:
      
      .. code-block:: bash
         
         . /etc/os-release
         sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
         deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
         EOF
         wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
         sudo apt-get update
         sudo apt-get install -y aws-neuronx-dkms
      
      **Step 2: Install and verify Docker**
      
      Install Docker and add your user to the ``docker`` group:
      
      .. code-block:: bash
         
         sudo apt-get install -y docker.io
         sudo usermod -aG docker $USER
      
      Log out and log back in to refresh group membership, then verify:
      
      .. code-block:: bash
         
         docker run hello-world
      
      **Step 3: Pull the DLC image from ECR**
      
      Pull the JAX Training DLC image:
      
      .. code-block:: bash
         
         docker pull public.ecr.aws/neuron/jax-training-neuronx:<image_tag>
      
      Replace ``<image_tag>`` with the desired tag from the `JAX Training Containers <https://github.com/aws-neuron/deep-learning-containers#jax-training-neuronx>`_ repository.
      
      **Step 4: Run the container**
      
      Launch the container with access to Neuron devices:
      
      .. code-block:: bash
         
         docker run -it \
           --device=/dev/neuron0 \
           --cap-add SYS_ADMIN \
           --cap-add IPC_LOCK \
           public.ecr.aws/neuron/jax-training-neuronx:<image_tag> \
           bash
      
      .. note::
         
         Adjust the ``--device`` flags based on your instance type. Use ``ls /dev/neuron*`` on the host to list available devices. For example, a ``trn1.32xlarge`` has 16 devices (``/dev/neuron0`` through ``/dev/neuron15``).
      
      **Step 5: Verify inside the container**
      
      Run the following commands inside the container to confirm Neuron devices are visible and JAX is installed:
      
      .. code-block:: bash
         
         neuron-ls
      
      .. code-block:: python
         
         python3 -c "import jax; print(f'JAX version: {jax.__version__}'); print(f'Devices: {jax.devices()}')"
      
      .. dropdown:: ⚠️ Troubleshooting: Device not found in container
         :color: warning
         :animate: fade-in
         
         If ``neuron-ls`` shows no devices inside the container:
         
         1. Verify the Neuron driver is installed on the host
         2. Confirm you passed the correct ``--device`` flag: ``ls /dev/neuron*``
         3. Restart the container with the correct device path
      
      .. dropdown:: ⚠️ Troubleshooting: Permission denied
         :color: warning
         :animate: fade-in
         
         If you see ``permission denied`` errors when running Docker commands:
         
         1. Verify your user is in the ``docker`` group: ``groups``
         2. If not, add yourself: ``sudo usermod -aG docker $USER`` and re-login
         3. Alternatively, run Docker with ``sudo``
      
      .. dropdown:: ⚠️ Troubleshooting: Image pull failure
         :color: warning
         :animate: fade-in
         
         If ``docker pull`` fails with a network or authentication error:
         
         1. Verify internet connectivity: ``curl -s https://public.ecr.aws/v2/ | head -1``
         2. Check that the image tag exists in the `ECR Public Gallery <https://gallery.ecr.aws/neuron/jax-training-neuronx>`_
         3. If behind a proxy, configure Docker proxy settings

   .. tab-item:: Amazon Linux 2023
      :sync: al2023
      
      **Step 1: Install Neuron driver on host**
      
      Configure the Neuron repository and install the driver:
      
      .. code-block:: bash
         
         sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
         [neuron]
         name=Neuron YUM Repository
         baseurl=https://yum.repos.neuron.amazonaws.com
         enabled=1
         metadata_expire=0
         EOF
         sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB
         sudo dnf update -y
         sudo dnf install -y "kernel-devel-uname-r == $(uname -r)"
         sudo dnf install -y aws-neuronx-dkms
      
      **Step 2: Install and verify Docker**
      
      Install Docker and add your user to the ``docker`` group:
      
      .. code-block:: bash
         
         sudo dnf install -y docker
         sudo usermod -aG docker $USER
      
      Log out and log back in to refresh group membership, then verify:
      
      .. code-block:: bash
         
         docker run hello-world
      
      **Step 3: Pull the DLC image from ECR**
      
      Pull the JAX Training DLC image:
      
      .. code-block:: bash
         
         docker pull public.ecr.aws/neuron/jax-training-neuronx:<image_tag>
      
      Replace ``<image_tag>`` with the desired tag from the `JAX Training Containers <https://github.com/aws-neuron/deep-learning-containers#jax-training-neuronx>`_ repository.
      
      **Step 4: Run the container**
      
      Launch the container with access to Neuron devices:
      
      .. code-block:: bash
         
         docker run -it \
           --device=/dev/neuron0 \
           --cap-add SYS_ADMIN \
           --cap-add IPC_LOCK \
           public.ecr.aws/neuron/jax-training-neuronx:<image_tag> \
           bash
      
      .. note::
         
         Adjust the ``--device`` flags based on your instance type. Use ``ls /dev/neuron*`` on the host to list available devices. For example, a ``trn1.32xlarge`` has 16 devices (``/dev/neuron0`` through ``/dev/neuron15``).
      
      **Step 5: Verify inside the container**
      
      Run the following commands inside the container to confirm Neuron devices are visible and JAX is installed:
      
      .. code-block:: bash
         
         neuron-ls
      
      .. code-block:: python
         
         python3 -c "import jax; print(f'JAX version: {jax.__version__}'); print(f'Devices: {jax.devices()}')"
      
      .. dropdown:: ⚠️ Troubleshooting: Device not found in container
         :color: warning
         :animate: fade-in
         
         If ``neuron-ls`` shows no devices inside the container:
         
         1. Verify the Neuron driver is installed on the host
         2. Confirm you passed the correct ``--device`` flag: ``ls /dev/neuron*``
         3. Restart the container with the correct device path
      
      .. dropdown:: ⚠️ Troubleshooting: Permission denied
         :color: warning
         :animate: fade-in
         
         If you see ``permission denied`` errors when running Docker commands:
         
         1. Verify your user is in the ``docker`` group: ``groups``
         2. If not, add yourself: ``sudo usermod -aG docker $USER`` and re-login
         3. Alternatively, run Docker with ``sudo``
      
      .. dropdown:: ⚠️ Troubleshooting: Image pull failure
         :color: warning
         :animate: fade-in
         
         If ``docker pull`` fails with a network or authentication error:
         
         1. Verify internet connectivity: ``curl -s https://public.ecr.aws/v2/ | head -1``
         2. Check that the image tag exists in the `ECR Public Gallery <https://gallery.ecr.aws/neuron/jax-training-neuronx>`_
         3. If behind a proxy, configure Docker proxy settings

Next steps
----------

Now that JAX is running in a container:

1. **Find more container images**: Browse the full list of available Neuron DLC images at :doc:`/containers/locate-neuron-dlc-image`.

2. **Customize your container**: Learn how to extend a DLC with additional packages at :ref:`containers-dlc-then-customize-devflow`.

3. **Read the JAX documentation**: Explore the :doc:`/frameworks/jax/index` for JAX framework documentation and tutorials.

Additional resources
--------------------

- :doc:`/containers/locate-neuron-dlc-image` - Full DLC image list
- :doc:`/containers/index` - Container documentation overview
- :doc:`../troubleshooting` - Common issues and solutions
- :doc:`/release-notes/index` - Version compatibility information


================================================
FILE: setup/jax/index.rst
================================================
.. _jax-setup:

.. meta::
   :description: Install JAX for AWS Neuron on Inf2, Trn1, Trn2, Trn3 instances
   :keywords: jax, neuron, installation, trn1, trn2, trn3, inf2
   :framework: jax
   :instance-types: inf2, trn1, trn2, trn3
   :content-type: framework-setup-hub
   :date-modified: 2026-03-03

Install JAX for Neuron
=======================

Install JAX with AWS Neuron support for training and inference on Inferentia and Trainium instances.

**Supported Instances**: Inf2, Trn1, Trn2, Trn3

**JAX Version**: 0.7+ with Neuron PJRT plugin

.. admonition:: Beta Release
   :class: note

   JAX NeuronX is currently in beta. Some JAX functionality may not be fully supported. We welcome your feedback and contributions.

Choose Installation Method
---------------------------

.. grid:: 1 1 3 3
   :gutter: 3

   .. grid-item-card:: 🚀 AWS Deep Learning AMI
      :link: dlami
      :link-type: doc
      :class-card: sd-border-2
      
      **Recommended for most users**
      
      Pre-configured environment with all dependencies
      
      ✅ All dependencies included
      
      ✅ Tested configurations
      
      ✅ Multiple Python versions
      
      ⏱️ **Setup time**: ~5 minutes

   .. grid-item-card:: 🐳 Deep Learning Container
      :link: dlc
      :link-type: doc
      :class-card: sd-border-2
      
      **For containerized deployments**
      
      Pre-configured Docker images from AWS ECR
      
      ✅ Docker-based isolation
      
      ✅ Training and inference images
      
      ✅ Training images available
      
      ⏱️ **Setup time**: ~10 minutes

   .. grid-item-card:: 🔧 Manual Installation
      :link: manual
      :link-type: doc
      :class-card: sd-border-2
      
      **For custom environments**
      
      Install on existing systems or custom setups
      
      ✅ Existing system integration
      
      ✅ Custom Python versions
      
      ✅ Full control over dependencies
      
      ⏱️ **Setup time**: ~15 minutes

Prerequisites
-------------

Before installing, ensure you have:

.. list-table::
   :header-rows: 1
   :widths: 30 70
   
   * - Requirement
     - Details
   * - Instance Type
     - Inf2, Trn1, Trn2, or Trn3 instance
   * - Operating System
     - Ubuntu 24.04, Ubuntu 22.04, or Amazon Linux 2023
   * - Python Version
     - Python 3.10, 3.11, or 3.12
   * - AWS Account
     - With EC2 launch permissions
   * - SSH Access
     - Key pair for instance connection

What You'll Get
---------------

After installation, you'll have:

- **JAX 0.7+** with Neuron PJRT plugin
- **jax-neuronx** package for Neuron-specific features
- **libneuronxla** PJRT plugin for native JAX device integration
- **neuronx-cc** compiler for model optimization
- **Neuron Runtime** for model execution

Version Information
-------------------

.. list-table::
   :header-rows: 1
   :widths: 30 70
   
   * - Component
     - Version
   * - JAX
     - 0.7.0+
   * - jax-neuronx
     - 0.7.0+
   * - libneuronxla
     - latest
   * - neuronx-cc
     - 2.15.0+
   * - Python
     - 3.10, 3.11, 3.12

Next Steps
----------

After installation:

1. **Verify Installation**: Run verification commands in the installation guide
2. **Read the Guide**: :doc:`/frameworks/jax/setup/jax-setup`
3. **Explore JAX on Neuron**: :doc:`/frameworks/jax/index`
4. **API Reference**: :doc:`/frameworks/jax/api-reference-guide/index`

.. toctree::
   :hidden:
   :maxdepth: 1
   
   dlami
   dlc
   manual


================================================
FILE: setup/jax/manual.rst
================================================
.. meta::
   :description: Manually install JAX Neuron on Inf2, Trn1, Trn2, Trn3 instances
   :keywords: jax, neuron, manual installation, pip
   :framework: jax
   :installation-method: manual
   :instance-types: inf2, trn1, trn2, trn3
   :os: ubuntu-24.04, ubuntu-22.04, al2023
   :python-versions: 3.10, 3.11, 3.12
   :content-type: installation-guide
   :estimated-time: 15 minutes
   :date-modified: 2026-03-03

Install JAX Manually
=====================

Install JAX with Neuron support on existing systems using pip.

⏱️ **Estimated time**: 15 minutes

Prerequisites
-------------

.. list-table::
   :header-rows: 1
   :widths: 30 70
   
   * - Requirement
     - Details
   * - Instance Type
     - Inf2, Trn1, Trn2, or Trn3
   * - Operating System
     - Ubuntu 24.04, Ubuntu 22.04, or Amazon Linux 2023
   * - Python
     - Python 3.10, 3.11, or 3.12
   * - Sudo Access
     - Required for driver installation
   * - Internet Access
     - For downloading packages

Installation Steps
------------------

.. tab-set::

   .. tab-item:: Ubuntu 24.04
      :sync: ubuntu-24-04
      
      **Step 1: Update System Packages**
      
      .. code-block:: bash
         
         sudo apt-get update
         sudo apt-get install -y python3-pip python3-venv
      
      **Step 2: Configure Neuron Repository**
      
      .. code-block:: bash
         
         # Add Neuron repository
         . /etc/os-release
         sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
         deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
         EOF
         
         # Add repository key
         wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
         
         # Update package list
         sudo apt-get update
      
      **Step 3: Install Neuron Driver and Runtime**
      
      .. code-block:: bash
         
         sudo apt-get install -y aws-neuronx-dkms
         sudo apt-get install -y aws-neuronx-runtime-lib
         sudo apt-get install -y aws-neuronx-collectives
      
      **Step 4: Create Virtual Environment**
      
      .. code-block:: bash
         
         python3.10 -m venv ~/neuron_venv_jax
         source ~/neuron_venv_jax/bin/activate
      
      **Step 5: Install JAX and Neuron Packages**
      
      .. code-block:: bash
         
         pip install -U pip
         pip install jax-neuronx[stable] --extra-index-url=https://pip.repos.neuron.amazonaws.com
      
      **Step 6: Verify Installation**
      
      .. code-block:: python
         
         python3 << EOF
         import jax
         import jax_neuronx
         
         print(f"JAX version: {jax.__version__}")
         print(f"Devices: {jax.devices()}")
         
         # Check Neuron devices
         import subprocess
         result = subprocess.run(['neuron-ls'], capture_output=True, text=True)
         print(result.stdout)
         EOF
      
      .. dropdown:: ⚠️ Troubleshooting: GPG key error
         :color: warning
         :animate: fade-in
         
         If you see "EXPKEYSIG" error during apt-get update:
         
         .. code-block:: bash
            
            wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
            sudo apt-get update -y
      
      .. dropdown:: ⚠️ Troubleshooting: Driver installation failed
         :color: warning
         :animate: fade-in
         
         If driver installation fails:
         
         1. Check kernel headers are installed:
            
            .. code-block:: bash
               
               sudo apt-get install -y linux-headers-$(uname -r)
         
         2. Retry driver installation:
            
            .. code-block:: bash
               
               sudo apt-get install --reinstall aws-neuronx-dkms

   .. tab-item:: Ubuntu 22.04
      :sync: ubuntu-22-04
      
      **Step 1: Update System Packages**
      
      .. code-block:: bash
         
         sudo apt-get update
         sudo apt-get install -y python3-pip python3-venv
      
      **Step 2: Configure Neuron Repository**
      
      .. code-block:: bash
         
         # Add Neuron repository
         . /etc/os-release
         sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
         deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
         EOF
         
         # Add repository key
         wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
         
         # Update package list
         sudo apt-get update
      
      **Step 3: Install Neuron Driver and Runtime**
      
      .. code-block:: bash
         
         sudo apt-get install -y aws-neuronx-dkms
         sudo apt-get install -y aws-neuronx-runtime-lib
         sudo apt-get install -y aws-neuronx-collectives
      
      **Step 4: Create Virtual Environment**
      
      .. code-block:: bash
         
         python3.10 -m venv ~/neuron_venv_jax
         source ~/neuron_venv_jax/bin/activate
      
      **Step 5: Install JAX and Neuron Packages**
      
      .. code-block:: bash
         
         pip install -U pip
         pip install jax-neuronx[stable] --extra-index-url=https://pip.repos.neuron.amazonaws.com
      
      **Step 6: Verify Installation**
      
      .. code-block:: python
         
         python3 << EOF
         import jax
         import jax_neuronx
         
         print(f"JAX version: {jax.__version__}")
         print(f"Devices: {jax.devices()}")
         
         # Check Neuron devices
         import subprocess
         result = subprocess.run(['neuron-ls'], capture_output=True, text=True)
         print(result.stdout)
         EOF
      
      .. dropdown:: ⚠️ Troubleshooting: GPG key error
         :color: warning
         :animate: fade-in
         
         If you see "EXPKEYSIG" error, update the GPG key and retry.
      
      .. dropdown:: ⚠️ Troubleshooting: Driver installation failed
         :color: warning
         :animate: fade-in
         
         Ensure kernel headers are installed before retrying driver installation.

   .. tab-item:: Amazon Linux 2023
      :sync: al2023
      
      **Step 1: Update System Packages**
      
      .. code-block:: bash
         
         sudo yum update -y
         sudo yum install -y python3-pip python3-devel
      
      **Step 2: Configure Neuron Repository**
      
      .. code-block:: bash
         
         # Add Neuron repository
         sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
         [neuron]
         name=Neuron YUM Repository
         baseurl=https://yum.repos.neuron.amazonaws.com
         enabled=1
         metadata_expire=0
         EOF
         
         # Import GPG key
         sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB
      
      **Step 3: Install Neuron Driver and Runtime**
      
      .. code-block:: bash
         
         sudo yum install -y aws-neuronx-dkms
         sudo yum install -y aws-neuronx-runtime-lib
         sudo yum install -y aws-neuronx-collectives
      
      **Step 4: Create Virtual Environment**
      
      .. code-block:: bash
         
         python3.10 -m venv ~/neuron_venv_jax
         source ~/neuron_venv_jax/bin/activate
      
      **Step 5: Install JAX and Neuron Packages**
      
      .. code-block:: bash
         
         pip install -U pip
         pip install jax-neuronx[stable] --extra-index-url=https://pip.repos.neuron.amazonaws.com
      
      **Step 6: Verify Installation**
      
      .. code-block:: python
         
         python3 << EOF
         import jax
         import jax_neuronx
         
         print(f"JAX version: {jax.__version__}")
         print(f"Devices: {jax.devices()}")
         
         # Check Neuron devices
         import subprocess
         result = subprocess.run(['neuron-ls'], capture_output=True, text=True)
         print(result.stdout)
         EOF
      
      .. dropdown:: ⚠️ Troubleshooting: Repository access error
         :color: warning
         :animate: fade-in
         
         If you cannot access the Neuron repository:
         
         1. Verify network connectivity
         2. Check proxy settings if behind corporate firewall
         3. Ensure GPG key is imported correctly
      
      .. dropdown:: ⚠️ Troubleshooting: Driver installation failed
         :color: warning
         :animate: fade-in
         
         Ensure kernel-devel package is installed:
         
         .. code-block:: bash
            
            sudo yum install -y kernel-devel-$(uname -r)

Next Steps
----------

Now that JAX is installed:

1. **Try a Quick Example**:
   
   .. code-block:: python
      
      import jax
      import jax.numpy as jnp
      
      # Simple operation on Neuron
      x = jnp.array([1.0, 2.0, 3.0])
      y = jnp.array([4.0, 5.0, 6.0])
      result = jax.numpy.multiply(x, y)
      print(result)

2. **Read Documentation**:
   
   - :doc:`/frameworks/jax/index`
   - :doc:`/frameworks/jax/api-reference-guide/index`

3. **Explore Setup Guide**:
   
   - :doc:`/frameworks/jax/setup/jax-setup`

Additional Resources
--------------------

- :doc:`dlami` - Use pre-configured DLAMI instead
- :doc:`dlc` - Use pre-configured Docker containers
- :doc:`/containers/index` - Container-based deployment
- :doc:`../troubleshooting` - Common issues and solutions
- :doc:`/release-notes/index` - Version compatibility information


================================================
FILE: setup/jax-neuronx.rst
================================================
.. meta::
   :description: Install and set up JAX NeuronX plugin for AWS Trainium and Inferentia instances. Complete setup guide for JAX on Ubuntu 22 with Neuron SDK integration.

.. _setup-jax-neuronx:

JAX Setup
=========

This guide provides step-by-step instructions for installing and configuring JAX with the NeuronX plugin on AWS Trainium and Inferentia instances. JAX NeuronX enables high-performance machine learning workloads by integrating JAX with AWS's custom ML accelerators.

For more installation and deployment options, see :ref:`jax-neuron-setup`.

.. note::
   This setup guide is relevant for ``Inf2`` & ``Trn1`` / ``Trn1n`` / ``Trn2`` instances.

.. contents:: Table of contents
   :local:
   :depth: 2


``JAX`` setup on Ubuntu 22
---------------------------

.. toctree::
   :maxdepth: 1
   :hidden:

   JAX NeuronX on Ubuntu 22 </frameworks/jax/setup/jax-setup>

.. card:: Ubuntu 22 (Ubuntu22 AMI)
        :link: /frameworks/jax/setup/jax-setup
        :link-type: doc
        :class-body: sphinx-design-class-title-small


================================================
FILE: setup/legacy-inf1/index.rst
================================================
.. meta::
   :description: Legacy installation guide for AWS Inferentia 1 (Inf1) instances
   :keywords: neuron, inf1, legacy, installation, inferentia
   :instance-types: inf1
   :status: legacy
   :content-type: legacy-guide
   :date-modified: 2026-03-30

.. _legacy-inf1:

Inf1 installation (legacy)
===========================

.. warning::
   
   **Legacy hardware**: Inf1 instances use NeuronCore v1 architecture.
   
   **For new projects, use Inf2, Trn1, Trn2, or Trn3 instances** with NeuronCore v2 for:
   
   - 3x better price-performance than Inf1
   - Broader framework support (PyTorch 2.x, JAX)
   - Active development and feature updates
   - Latest Neuron SDK features
   
   See :ref:`setup-guide-index` for current instance options.

.. admonition:: When to use Inf1
   :class: tip
   
   Use Inf1 only if you:
   
   - Maintain existing Inf1 deployments
   - Have compiled models for NeuronCore v1
   - Require specific Inf1 cost optimization for inference workloads

Migration to Inf2
-----------------

Consider migrating to Inf2 for better performance and support:

- Inf2 offers 3x better price-performance
- Broader framework support including PyTorch 2.x and JAX
- Active development with monthly SDK releases
- See :ref:`setup-guide-index` for current installation options

Choose your framework
---------------------

.. note::
   
   JAX is not supported on Inf1 instances. Use Inf2, Trn1, Trn2, or Trn3 for JAX workloads.

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: PyTorch (Inf1)
      :link: pytorch
      :link-type: doc
      :class-card: sd-border-2
      
      PyTorch 1.x with torch-neuron
      
      Inference on Inf1 instances using NeuronCore v1
      
      :bdg-warning:`Legacy`

   .. grid-item-card:: TensorFlow (Inf1)
      :link: /archive/tensorflow/setup-legacy-inf1-tensorflow
      :link-type: doc
      :class-card: sd-border-2
      
      TensorFlow 2.x with tensorflow-neuron (archived)
      
      :bdg-danger:`Archived`

Additional resources
--------------------

- :doc:`/setup/torch-neuron` - Original PyTorch Neuron setup (Inf1)
- :doc:`/archive/tensorflow/tensorflow-neuron-inference` - TensorFlow Neuron inference (Inf1)
- :doc:`/release-notes/index` - Version compatibility

.. toctree::
   :hidden:
   :maxdepth: 1
   
   pytorch


================================================
FILE: setup/legacy-inf1/pytorch.rst
================================================
.. meta::
   :description: Legacy PyTorch installation guide for AWS Inferentia 1 (Inf1) instances
   :keywords: pytorch, neuron, inf1, legacy, installation, torch-neuron
   :framework: pytorch
   :instance-types: inf1
   :status: legacy
   :content-type: legacy-guide
   :date-modified: 2026-03-30

PyTorch on Inf1 (legacy)
=========================

.. warning::
   
   **Legacy hardware**: Inf1 instances use NeuronCore v1 with PyTorch 1.x (``torch-neuron``).
   
   For new projects, use **Inf2, Trn1, Trn2, or Trn3** with PyTorch 2.9+ (``torch-neuronx``).
   See :doc:`/setup/pytorch/index` for current PyTorch setup.

Key differences from current PyTorch
--------------------------------------

.. list-table::
   :header-rows: 1
   :widths: 30 35 35

   * - Feature
     - Inf1 (torch-neuron)
     - Inf2, Trn1, Trn2, Trn3 (torch-neuronx)
   * - PyTorch version
     - 1.x
     - 2.9+
   * - Backend
     - PyTorch/XLA (``torch_neuron``)
     - Native Neuron (``torch_neuronx``)
   * - Compilation
     - ``torch_neuron.trace()``
     - ``torch.compile(backend='neuronx')``
   * - Training support
     - No
     - Yes
   * - NeuronCore version
     - v1
     - v2

Setup instructions
------------------

.. tab-set::

   .. tab-item:: Ubuntu 20.04

      **Launch Instance**

      * `Launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ and select an Inf1 instance type.
      * Select Ubuntu Server 20 AMI.
      * `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_.

      **Install Drivers and Tools**

      .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami --category=driver_runtime_tools

      .. include:: /includes/setup/tab-inference-torch-neuron-u20.txt

   .. tab-item:: Ubuntu 22.04

      **Launch Instance**

      * `Launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ and select an Inf1 instance type.
      * Select Ubuntu Server 22 AMI.
      * `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_.

      **Install Drivers and Tools**

      .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=non-dlami --category=driver_runtime_tools

      .. include:: /includes/setup/tab-inference-torch-neuron-u22.txt

   .. tab-item:: Amazon Linux 2023

      **Launch Instance**

      * `Launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ and select an Inf1 instance type.
      * Select Amazon Linux 2023 AMI.
      * `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_.

      **Install Drivers and Tools**

      .. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=non-dlami --category=driver_runtime_tools

      .. include:: /includes/setup/tab-inference-torch-neuron-al2023.txt

Update an Existing Installation
-------------------------------

.. tab-set::

   .. tab-item:: Ubuntu 20.04

      .. include:: /archive/torch-neuron/setup/pytorch-update-u20.rst

   .. tab-item:: Ubuntu 22.04

      .. include:: /archive/torch-neuron/setup/pytorch-update-u22.rst

   .. tab-item:: Amazon Linux 2023

      .. include:: /archive/torch-neuron/setup/pytorch-update-al2023.rst

Previous Versions
-----------------

.. tab-set::

   .. tab-item:: Ubuntu 20.04

      .. include:: /archive/torch-neuron/setup/pytorch-install-prev-u20.rst

   .. tab-item:: Ubuntu 22.04

      .. include:: /archive/torch-neuron/setup/pytorch-install-prev-u22.rst

   .. tab-item:: Amazon Linux 2023

      .. include:: /archive/torch-neuron/setup/pytorch-install-prev-al2023.rst

Verification
------------

After installation, verify with:

.. code-block:: python
   
   import torch
   import torch_neuron
   
   print(f"torch-neuron version: {torch_neuron.__version__}")

.. code-block:: bash
   
   neuron-ls

Next steps
----------

- :doc:`/archive/torch-neuron/api-reference-guide-torch-neuron` - torch-neuron API reference
- :doc:`/frameworks/torch/inference-torch-neuronx` - Inference guides
- :ref:`setup-guide-index` - Current setup options (Inf2, Trn1, Trn2, Trn3)


================================================
FILE: setup/multiframework-dlami.rst
================================================
.. meta::
   :description: Get started with the Neuron Multi-Framework Deep Learning AMI for PyTorch, JAX, and vLLM on Inf2, Trn1, Trn2, Trn3
   :keywords: neuron, dlami, multi-framework, pytorch, jax, vllm, installation
   :instance-types: inf2, trn1, trn2, trn3
   :content-type: installation-guide
   :date-modified: 2026-03-30

.. _setup-multiframework-dlami:

Get started with the Neuron multi-framework DLAMI
===================================================

The Neuron multi-framework Deep Learning AMI (DLAMI) provides a pre-configured environment
with multiple frameworks and libraries ready to use. Each framework has its own virtual
environment with all Neuron components pre-installed.

The multi-framework DLAMI supports Inf2, Trn1, Trn1n, Trn2, and Trn3 instances and is
updated with each Neuron SDK release.

.. contents:: On this page
   :local:
   :depth: 2

Step 1: Launch the instance
----------------------------

.. important::
   Currently, only Ubuntu 24.04 is supported for multi-framework DLAMIs.

.. tab-set::

   .. tab-item:: Ubuntu 24.04
      :sync: ubuntu-24-04

      Open the `EC2 Console <https://console.aws.amazon.com/ec2>`_, select your desired
      AWS region, and choose "Launch Instance". Under AMI selection, choose "Quick Start"
      then "Ubuntu", and select **Deep Learning AMI Neuron (Ubuntu 24.04)**.

      .. image:: /images/neuron-multi-framework-dlami-U24-quick-start.png
         :scale: 20%
         :align: center

      Select your desired Neuron instance type (Inf2, Trn1, Trn1n, Trn2, or Trn3),
      configure disk size (minimum 512 GB for Trn instances), and launch the instance.

.. note::

   To retrieve the latest DLAMI ID programmatically for automation flows, use
   :ref:`SSM parameters <ssm-parameter-neuron-dlami>`.

Step 2: Activate a virtual environment
----------------------------------------

The multi-framework DLAMI includes pre-configured virtual environments for each
supported framework and library. Activate the one that matches your use case:

1. Find the virtual environment name for your framework or library in the
   :ref:`Neuron DLAMI overview <neuron-dlami-multifw-venvs>`.

2. Activate the virtual environment:

   .. code-block:: bash

      source /opt/<name_of_virtual_environment>/bin/activate

Common virtual environments include:

.. list-table::
   :header-rows: 1
   :widths: 30 40 30

   * - Framework
     - Virtual environment
     - Use case
   * - PyTorch 2.9
     - ``aws_neuronx_venv_pytorch``
     - Training and inference
   * - PyTorch vLLM
     - ``aws_neuronx_venv_pytorch_inference_vllm``
     - LLM inference serving
   * - JAX
     - ``aws_neuronx_venv_jax``
     - Training and inference

.. note::

   Virtual environment names and available frameworks may vary by DLAMI version.
   See :ref:`neuron-dlami-multifw-venvs` for the complete list.

Step 3: Verify and start
--------------------------

After activating a virtual environment, verify the installation:

.. tab-set::

   .. tab-item:: PyTorch

      .. code-block:: bash

         python3 -c "import torch; import torch_neuronx; print(f'PyTorch {torch.__version__}, torch-neuronx {torch_neuronx.__version__}')"
         neuron-ls

      You should see output similar to this (the framework, versions, instance IDs, and details should match your expected ones, not the ones in this example):

      .. code-block::

         PyTorch 2.9.1+cu128, torch-neuronx 2.9.0.2.13.23887+8e870898
         $ neuron-ls
         instance-type: trn1.2xlarge
         instance-id: i-0bea223b1afb7e159
         +--------+--------+----------+--------+--------------+----------+------+
         | NEURON | NEURON |  NEURON  | NEURON |     PCI      |   CPU    | NUMA |
         | DEVICE | CORES  | CORE IDS | MEMORY |     BDF      | AFFINITY | NODE |
         +--------+--------+----------+--------+--------------+----------+------+
         | 0      | 2      | 0-1      | 32 GB  | 0000:00:1e.0 | 0-7      | -1   |
         +--------+--------+----------+--------+--------------+----------+------+

   .. tab-item:: JAX

      .. code-block:: bash

         python3 -c "import jax; print(f'JAX {jax.__version__}'); print(f'Devices: {jax.devices()}')"
         neuron-ls

      You should see output similar to this (the framework, versions, instance IDs, and details should match your expected ones, not the ones in this example):

      .. code-block::

         JAX 0.6.2.1.0.1, torch-neuronx 2.9.0.2.13.23887+8e870898
         $ neuron-ls
         instance-type: trn1.2xlarge
         instance-id: i-0bea223b1afb7e159
         +--------+--------+----------+--------+--------------+----------+------+
         | NEURON | NEURON |  NEURON  | NEURON |     PCI      |   CPU    | NUMA |
         | DEVICE | CORES  | CORE IDS | MEMORY |     BDF      | AFFINITY | NODE |
         +--------+--------+----------+--------+--------------+----------+------+
         | 0      | 2      | 0-1      | 32 GB  | 0000:00:1e.0 | 0-7      | -1   |
         +--------+--------+----------+--------+--------------+----------+------+


Next steps
----------

After setup, explore the framework documentation:

- :doc:`/frameworks/torch/index` - PyTorch on Neuron
- :doc:`/frameworks/jax/index` - JAX on Neuron
- :doc:`/libraries/nxd-inference/vllm/index` - vLLM on Neuron
- :doc:`/dlami/index` - Full DLAMI documentation and SSM parameters


================================================
FILE: setup/mxnet-neuron.rst
================================================
.. _setup-mxnet-neuron:

MxNet Neuron (``mxnet-neuron``) Setup
=====================================

.. warning::

   MXNet Neuron has been archived. For new projects, use PyTorch or JAX on Inf2, Trn1, Trn2, or Trn3 instances.
   See :ref:`setup-guide-index` for current setup options.

.. note::
   This Setup guide is relevant for ``Inf1`` instances only.


.. contents:: Table of contents
   :local:
   :depth: 2


``mxnet-neuron`` setup on Ubuntu 20 
-----------------------------------

.. toctree::
   :maxdepth: 1
   :hidden:

   MXNet Neuron on Ubuntu 20 </archive/mxnet-neuron/setup/mxnet-neuron-ubuntu20>
   MXNet Neuron on DLAMI Base (Ubuntu 20) </archive/mxnet-neuron/setup/mxnet-neuron-ubuntu20-base-dlami>

.. card:: Ubuntu 20 (Ubuntu20 AMI)
        :link: setup-mxnet-neuron-u20
        :link-type: ref
        :class-body: sphinx-design-class-title-small

.. card:: Ubuntu 20 (DLAMI Base AMI)
        :link: setup-mxnet-neuron-u20-base-dlami
        :link-type: ref
        :class-body: sphinx-design-class-title-small

``mxnet-neuron`` setup on Ubuntu 22
-----------------------------------

.. toctree::
   :maxdepth: 1
   :hidden:

   MXNet Neuron on Ubuntu 22 </archive/mxnet-neuron/setup/mxnet-neuron-ubuntu22>

.. card:: Ubuntu 22 (Ubuntu22 AMI)
        :link: setup-mxnet-neuron-u22
        :link-type: ref
        :class-body: sphinx-design-class-title-small


``mxnet-neuron`` setup on Amazon Linux 2023 (AL2023)
----------------------------------------------------

.. toctree::
   :maxdepth: 1
   :hidden:

   MXNet Neuron on Amazon Linux 2023 </archive/mxnet-neuron/setup/mxnet-neuron-al2023>

.. card:: Amazon Linux 2023 (Amazon Linux 2023 AMI)
        :link: setup-mxnet-neuron-al2023
        :link-type: ref
        :class-body: sphinx-design-class-title-small


================================================
FILE: setup/notebook/running-jupyter-notebook-as-script.rst
================================================
.. _running-jupyter-notebook-as-script:

Running Jupyter Notebook as script
==================================

Converting the Jupyter Notebook and running
-------------------------------------------

Go into the aws-neuron-sdk repository directory containing the Jupyter Notebook (.ipynb file),

.. code:: bash

   cd aws-neuron-sdk/src/examples/<framework like pytorch, tensorflow, etc>

The Jupyter Notebook (.ipynb) can be converted to python script using jupyter-nbconvert. For example,

.. code:: bash

  jupyter nbconvert --to script tutorial_pretrained_bert.ipynb

and can be run in the virtual env (if needed),

.. code:: bash

  # if not already in the virtual env,
  source activate <virtual env>
  # Run the converted script
  python <tutorial.py>


================================================
FILE: setup/notebook/setup-jupyter-notebook-steps-troubleshooting.rst
================================================
.. _setup-jupyter-notebook-steps-troubleshooting:
.. _Running Jupyter Notebook Browser:

Jupyter Notebook QuickStart
===========================

.. contents:: Table of Contents
   :local:
   :depth: 2

SSH Tunnel to the Inf1/Trn1 instance
------------------------------------
The Jupyter notebook can be run via a browser on port 8888 by default. For simplicity we will use ssh port forwarding from your machine to the instance.

::

   ssh -i "<pem file>" <user>@<instance DNS name> -L 8888:127.0.0.1:8888

On an Ubuntu image the user will be ubuntu@, while on AL2 you should use
ec2-user@

This additional argument forwards connections to port 8888 on your
machine to the new Inf1/Trn1 instance.


Starting the Jupyter Notebook on the instance
---------------------------------------------
From your ssh prompt on the Inf1/Trn1 instance run

::

   jupyter notebook

You should see logging in your ssh session similar to:

.. code:: bash

   [I 21:53:11.729 NotebookApp] Using EnvironmentKernelSpecManager...
   [I 21:53:11.730 NotebookApp] Started periodic updates of the kernel list (every 3 minutes).
   [I 21:53:11.867 NotebookApp] Loading IPython parallel extension
   [I 21:53:11.884 NotebookApp] JupyterLab beta preview extension loaded from /home/ubuntu/anaconda3/lib/python3.6/site-packages/jupyterlab
   [I 21:53:11.884 NotebookApp] JupyterLab application directory is /home/ubuntu/anaconda3/share/jupyter/lab
   [I 21:53:12.002 NotebookApp] [nb_conda] enabled
   [I 21:53:12.004 NotebookApp] Serving notebooks from local directory: /home/ubuntu/tutorial
   [I 21:53:12.004 NotebookApp] 0 active kernels
   [I 21:53:12.004 NotebookApp] The Jupyter Notebook is running at:
   [I 21:53:12.004 NotebookApp] http://localhost:8888/?token=f9ad4086afd3c91f33d5587781f9fd8143b4cafbbf121a16
   [I 21:53:12.004 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
   [W 21:53:12.004 NotebookApp] No web browser found: could not locate runnable browser.


Copy/paste this URL into your browser when you connect for the first
time, to login with a token:
``http://localhost:8888/?token=f9ad4086afd3c91f33d5587781f9fd8143b4cafbbf121a16&token=f9ad4086afd3c91f33d5587781f9fd8143b4cafbbf121a16``

.. code:: bash

   [I 21:53:12.004 NotebookApp] Starting initial scan of virtual environments...
   [I 21:53:13.507 NotebookApp] Found new kernels in environments: conda_tensorflow2_p27, conda_aws_neuron_mxnet_p36, conda_anaconda3, conda_tensorflow_p27, conda_chainer_p27, conda_python3, conda_tensorflow_p36, conda_aws_neuron_tensorflow_p36, conda_mxnet_p27, **conda_my_notebook_env**, conda_tensorflow2_p36, conda_pytorch_p27, conda_python2, conda_chainer_p36, conda_mxnet_p36, conda_pytorch_p36


Running the Jupyter Notebook from your local browser
----------------------------------------------------

If you copy and paste the link that looks like
``http://localhost:8888/?token=f9ad4086afd3c91f33d5587781f9fd8143b4cafbbf121a16&token=f9ad4086afd3c91f33d5587781f9fd8143b4cafbbf121a16``
into your local browser the Notebook navigation pane should pop up.

This works because ssh is forwarding you local port 8888 through to the
Inf1/Trn1 instance port 8888 where the notebook is running. Note that our new
conda environment is visible as “kernel” with the “conda\_” prefix
(highlighted)

1) In notebook browser select the tutorial.
2) This will pop up a new tab. In that tab use the menus:

Kernel → Change Kernel → Environment (conda_my_notebook_env)

3) Start reading through the self documenting notebook tutorial

Troubleshooting
---------------

If your jupyter notebook does not start please try the following:

::

   mv ~/.jupyter ~/.jupyter.old
   mkdir -p ~/.jupyter
   echo "c.NotebookApp.iopub_data_rate_limit = 10000000000" > ~/.jupyter/jupyter_notebook_config.py

   # Instal Jupyter notebook kernel
    pip install ipykernel
    python3 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python Neuronx"
    pip install jupyter notebook
    pip install environment_kernels

   jupyter notebook


================================================
FILE: setup/pytorch/dlami.rst
================================================
.. meta::
   :description: Install PyTorch Neuron using AWS Deep Learning AMI on Inf2, Trn1, Trn2, Trn3
   :keywords: pytorch, neuron, dlami, installation, ami
   :framework: pytorch
   :installation-method: dlami
   :instance-types: inf2, trn1, trn2, trn3
   :os: ubuntu-24.04, ubuntu-22.04, al2023
   :python-versions: 3.10, 3.11, 3.12
   :content-type: installation-guide
   :estimated-time: 5 minutes
   :date-modified: 2026-03-03

Install PyTorch via Deep Learning AMI
======================================

Install PyTorch with Neuron support using pre-configured AWS Deep Learning AMIs.

⏱️ **Estimated time**: 5 minutes

.. note::
   Want to read about Neuron's Deep Learning machine images (DLAMIs) before diving in? Check out the :doc:`/dlami/index`.

----

Prerequisites
-------------

.. list-table::
   :header-rows: 1
   :widths: 30 70
   
   * - Requirement
     - Details
   * - Instance Type
     - Inf2, Trn1, Trn2, or Trn3
   * - AWS Account
     - With EC2 permissions
   * - SSH Key Pair
     - For instance access
   * - AWS CLI
     - Configured with credentials (optional)

Installation Steps
------------------

.. tab-set::

   .. tab-item:: Ubuntu 24.04
      :sync: ubuntu-24-04
      
      **Step 1: Find the Latest AMI**
      
      Get the latest PyTorch DLAMI for Ubuntu 24.04 using the AWS CLI:
      
      .. code-block:: bash
         
         aws ec2 describe-images \
           --owners amazon \
           --filters "Name=name,Values=Deep Learning AMI Neuron PyTorch 2.9 (Ubuntu 24.04)*" \
           --query 'Images | sort_by(@, &CreationDate) | [-1].ImageId' \
           --output text

      You can also use the AWS EC2 parameter store to find the ID of a DLAMI. See `https://docs.aws.amazon.com/dlami/latest/devguide/find-dlami-id.html`__ for details. Record the ID (``image-id``) for the next step.
      
      **Step 2: Launch Instance**
      
      Launch a Trn1 or Inf2 instance with the AMI using the AWS CLI:
      
      .. code-block:: bash
         
         aws ec2 run-instances \
           --image-id ami-xxxxxxxxxxxxxxxxx \
           --instance-type trn1.2xlarge \
           --key-name your-key-pair \
           --security-group-ids sg-xxxxxxxxx \
           --subnet-id subnet-xxxxxxxxx
      
      Replace:
      
      - ``ami-xxxxxxxxxxxxxxxxx`` with AMI ID from Step 1
      - ``your-key-pair`` with your SSH key pair name
      - ``sg-xxxxxxxxx`` with your security group ID
      - ``subnet-xxxxxxxxx`` with your subnet ID

       You can also launch your DLAMI through the AWS EC2 web console, which also provides hints for security group and subnet IDs. For more details, see `https://docs.aws.amazon.com/dlami/latest/devguide/launch.html`__.
      
      **Step 3: Connect to Instance**
      
      .. code-block:: bash
         
         ssh -i your-key-pair.pem ubuntu@<instance-public-ip>
      
      **Step 4: Activate Environment**
      
      The DLAMI includes a pre-configured virtual environment:
      
      .. code-block:: bash
         
         source /opt/aws_neuronx_venv_pytorch_2_9/bin/activate
      
      **Step 5: Verify Installation**
      
      .. code-block:: bash

         python3 -c "import torch; import torch_neuronx; print(f'PyTorch {torch.__version__}, torch-neuronx {torch_neuronx.__version__}')"
         neuron-ls

      You should see output similar to this (the versions, instance IDs, and details should match your expected ones, not the ones in this example):
      
      **Expected output**:
      
      .. code-block:: text
         
         PyTorch 2.9.0+cpu, torch-neuronx 2.9.0.1.0
         
         +--------+--------+--------+-----------+
         | DEVICE | CORES  | MEMORY | CONNECTED |
         +--------+--------+--------+-----------+
         | 0      | 2      | 32 GB  | Yes       |
         | 1      | 2      | 32 GB  | Yes       |
         +--------+--------+--------+-----------+
      
      .. dropdown:: ⚠️ Troubleshooting: Module not found
         :color: warning
         :animate: fade-in
         
         If you see ``ModuleNotFoundError: No module named 'torch_neuronx'``:
         
         1. Verify virtual environment is activated:
            
            .. code-block:: bash
               
               which python
               # Should show:  source /opt/aws_neuronx_venv_pytorch_2_9/bin/activate
         
         2. Check Python version:
            
            .. code-block:: bash
               
               python --version
               # Should be 3.10 or higher
         
         3. Reinstall torch-neuronx:
            
            .. code-block:: bash
               
               pip install --force-reinstall torch-neuronx
      
      .. dropdown:: ⚠️ Troubleshooting: No Neuron devices found
         :color: warning
         :animate: fade-in
         
         If ``neuron-ls`` shows no devices:
         
         4. Verify instance type:
            
            .. code-block:: bash
               
               curl http://169.254.169.254/latest/meta-data/instance-type
               # Should show trn1.*, trn2.*, trn3.*, or inf2.*
         
         5. Check Neuron driver:
            
            .. code-block:: bash
               
               lsmod | grep neuron
               # Should show neuron driver loaded
         
         6. Restart Neuron runtime:
            
            .. code-block:: bash
               
               sudo systemctl restart neuron-monitor
               neuron-ls

   .. tab-item:: Ubuntu 22.04
      :sync: ubuntu-22-04
      
      **Step 1: Find the Latest AMI**
      
      Get the latest PyTorch DLAMI for Ubuntu 22.04:
      
      .. code-block:: bash
         
         aws ec2 describe-images \
           --owners amazon \
           --filters "Name=name,Values=Deep Learning AMI Neuron PyTorch 2.9 (Ubuntu 22.04)*" \
           --query 'Images | sort_by(@, &CreationDate) | [-1].ImageId' \
           --output text
      
      **Step 2: Launch Instance**
      
      .. code-block:: bash
         
         aws ec2 run-instances \
           --image-id ami-xxxxxxxxxxxxxxxxx \
           --instance-type trn1.2xlarge \
           --key-name your-key-pair \
           --security-group-ids sg-xxxxxxxxx \
           --subnet-id subnet-xxxxxxxxx
      
      **Step 3: Connect to Instance**
      
      .. code-block:: bash
         
         ssh -i your-key-pair.pem ubuntu@<instance-public-ip>
      
      **Step 4: Activate Environment**
      
      .. code-block:: bash
         
          source /opt/aws_neuronx_venv_pytorch_2_9/bin/activate
      
      **Step 5: Verify Installation**
      
      .. code-block:: bash

         python3 -c "import torch; import torch_neuronx; print(f'PyTorch {torch.__version__}, torch-neuronx {torch_neuronx.__version__}')"
         neuron-ls

      You should see output similar to this (the versions, instance IDs, and details should match your expected ones, not the ones in this example):
      
      **Expected output**:
      
      .. code-block:: text
         
         PyTorch 2.9.0+cpu, torch-neuronx 2.9.0.1.0
         
         +--------+--------+--------+-----------+
         | DEVICE | CORES  | MEMORY | CONNECTED |
         +--------+--------+--------+-----------+
         | 0      | 2      | 32 GB  | Yes       |
         | 1      | 2      | 32 GB  | Yes       |
         +--------+--------+--------+-----------+
      
      .. dropdown:: ⚠️ Troubleshooting: Module not found
         :color: warning
         :animate: fade-in
         
         If you see ``ModuleNotFoundError: No module named 'torch_neuronx'``:
         
         7. Verify virtual environment is activated
         8. Check Python version: ``python --version`` (should be 3.10+)
         9. Reinstall: ``pip install --force-reinstall torch-neuronx``
      
      .. dropdown:: ⚠️ Troubleshooting: No Neuron devices found
         :color: warning
         :animate: fade-in
         
         If ``neuron-ls`` shows no devices:
         
         10. Verify instance type
         11. Check Neuron driver: ``lsmod | grep neuron``
         12. Restart runtime: ``sudo systemctl restart neuron-monitor``

   .. tab-item:: Amazon Linux 2023
      :sync: al2023
      
      **Step 1: Find the Latest AMI**
      
      Get the latest PyTorch DLAMI for Amazon Linux 2023:
      
      .. code-block:: bash
         
         aws ec2 describe-images \
           --owners amazon \
           --filters "Name=name,Values=Deep Learning AMI Neuron PyTorch 2.9 (Amazon Linux 2023)*" \
           --query 'Images | sort_by(@, &CreationDate) | [-1].ImageId' \
           --output text
      
      **Step 2: Launch Instance**
      
      .. code-block:: bash
         
         aws ec2 run-instances \
           --image-id ami-xxxxxxxxxxxxxxxxx \
           --instance-type trn1.2xlarge \
           --key-name your-key-pair \
           --security-group-ids sg-xxxxxxxxx \
           --subnet-id subnet-xxxxxxxxx
      
      **Step 3: Connect to Instance**
      
      .. code-block:: bash
         
         ssh -i your-key-pair.pem ec2-user@<instance-public-ip>
      
      .. note::
         
         Amazon Linux 2023 uses ``ec2-user`` instead of ``ubuntu``.
      
      **Step 4: Activate Environment**
      
      .. code-block:: bash
         
         source /opt/aws_neuronx_venv_pytorch_2_9/bin/activate
      
      **Step 5: Verify Installation**
      
      .. code-block:: python
         
         .. code-block:: bash

         python3 -c "import torch; import torch_neuronx; print(f'PyTorch {torch.__version__}, torch-neuronx {torch_neuronx.__version__}')"
         neuron-ls

      You should see output similar to this (the versions, instance IDs, and details should match your expected ones, not the ones in this example):
      
      **Expected output**:
      
      .. code-block:: text
         
         PyTorch 2.9.0+cpu, torch-neuronx 2.9.0.1.0
         
         +--------+--------+--------+-----------+
         | DEVICE | CORES  | MEMORY | CONNECTED |
         +--------+--------+--------+-----------+
         | 0      | 2      | 32 GB  | Yes       |
         | 1      | 2      | 32 GB  | Yes       |
         +--------+--------+--------+-----------+
      
      .. dropdown:: ⚠️ Troubleshooting: Module not found
         :color: warning
         :animate: fade-in
         
         If you see ``ModuleNotFoundError: No module named 'torch_neuronx'``:
         
         1. Verify virtual environment is activated
         2. Check Python version: ``python --version`` (should be 3.10+)
         3. Reinstall: ``pip install --force-reinstall torch-neuronx``
      
      .. dropdown:: ⚠️ Troubleshooting: No Neuron devices found
         :color: warning
         :animate: fade-in
         
         If ``neuron-ls`` shows no devices:
         
         1. Verify instance type
         2. Check Neuron driver: ``lsmod | grep neuron``
         3. Restart runtime: ``sudo systemctl restart neuron-monitor``

Update an existing installation
--------------------------------

To update PyTorch versions or Neuron drivers on an existing DLAMI, see
:doc:`update-dlami`.


.. tip:: **vLLM for LLM inference**
   
   Neuron provides a dedicated vLLM DLAMI with vLLM and the vLLM-Neuron Plugin pre-installed.
   Launch the **Deep Learning AMI Neuron PyTorch Inference vLLM (Ubuntu 24.04)** and activate
   the pre-configured environment:
   
   .. code-block:: bash
      
      source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate
   
   vLLM provides an OpenAI-compatible API, continuous batching, and supports models like
   Llama 2/3.1/3.3/4, Qwen 2.5/3, and multimodal models with quantization support (INT8/FP8).
   
   The vLLM environment is also available in the multi-framework DLAMI. For more details
   on available DLAMIs and SSM parameters, see :doc:`/dlami/index`.

Next Steps
----------

Now that PyTorch is installed:

1. **Try a Quick Example**:
   
   .. code-block:: python
      
      import torch
      import torch_neuronx

      # Simple tensor operation on Neuron
      x = torch.randn(3, 3)
      model = torch.nn.Linear(3, 3)

      # Compile for Neuron
      trace = torch_neuronx.trace(model, x)
      print(trace(x))

2. **Follow Tutorials**:
   
   - :doc:`/frameworks/torch/training-torch-neuronx`
   - :doc:`/frameworks/torch/inference-torch-neuronx`

3. **Read Documentation**:
   
   - :doc:`/frameworks/torch/torch-neuronx/programming-guide/training/index`
   - :doc:`/frameworks/torch/index`

4. **Explore Tools**:
   
   - :doc:`/tools/profiler/neuron-profile-user-guide`
   - :doc:`/tools/neuron-sys-tools/neuron-top-user-guide`

5. **Deploy LLM inference**: :doc:`/dlami/index` (vLLM on Neuron)

Additional Resources
--------------------

- :doc:`/dlami/index` - DLAMI documentation
- :doc:`/containers/index` - Container-based deployment
- :doc:`../troubleshooting` - Common issues and solutions
- :doc:`/release-notes/index` - Version compatibility information


================================================
FILE: setup/pytorch/dlc.rst
================================================
.. meta::
   :description: Install PyTorch Neuron using AWS Deep Learning Containers on Inf2, Trn1, Trn2, Trn3
   :keywords: pytorch, neuron, dlc, deep learning containers, docker, installation, vllm, inference, training
   :framework: pytorch
   :installation-method: dlc
   :instance-types: inf2, trn1, trn2, trn3
   :content-type: installation-guide
   :estimated-time: 10 minutes
   :date-modified: 2026-03-30

Install PyTorch via Deep Learning Container
=============================================

Deploy PyTorch with Neuron support using pre-configured Docker images from AWS ECR.

⏱️ **Estimated time**: 10 minutes

.. note::
   For a non-containerized setup, consider the :doc:`DLAMI-based installation <dlami>` or
   :doc:`manual installation <manual>` instead.

----

What are Neuron DLCs?
---------------------

AWS Neuron Deep Learning Containers (DLCs) are pre-configured Docker images with the Neuron SDK
and ML frameworks pre-installed. They provide Docker-based isolation, reproducibility, and
portability across deployment platforms including EC2, EKS, ECS, and SageMaker.

Available PyTorch Neuron DLC images:

.. list-table::
   :header-rows: 1
   :widths: 40 30 30

   * - Container Type
     - Use Case
     - Links
   * - PyTorch Inference (NeuronX)
     - Model serving on Inf2/Trn1/Trn2/Trn3
     - `Inference images <https://github.com/aws-neuron/deep-learning-containers#pytorch-inference-neuronx>`__
   * - PyTorch Inference vLLM (NeuronX)
     - LLM serving with vLLM
     - `vLLM images <https://github.com/aws-neuron/deep-learning-containers#vllm-inference-neuronx>`__
   * - PyTorch Training (NeuronX)
     - Model training on Trn1/Trn2/Trn3
     - `Training images <https://github.com/aws-neuron/deep-learning-containers#pytorch-training-neuronx>`__
   * - PyTorch Inference (Neuron)
     - Legacy inference on Inf1
     - `Inf1 images <https://github.com/aws-neuron/deep-learning-containers#pytorch-inference-neuron>`__

Prerequisites
-------------

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Requirement
     - Details
   * - Instance Type
     - Inf2, Trn1, Trn2, or Trn3
   * - Docker
     - Docker Engine installed and running
   * - AWS CLI
     - Configured with ECR access permissions
   * - Neuron Driver
     - ``aws-neuronx-dkms`` installed on the host

Quick Start: vLLM Inference Container
--------------------------------------

The fastest way to get started with LLM inference on Neuron:

.. code-block:: bash

   # Authenticate with ECR
   aws ecr get-login-password --region us-east-1 | \
     docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com

   # Pull the vLLM inference container
   docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.2-ubuntu20.04

   # Run with Neuron device access
   docker run -it --device=/dev/neuron0 \
     763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.2-ubuntu20.04

For the latest image tags and a step-by-step walkthrough, see
:doc:`/containers/get-started/quickstart-configure-deploy-dlc`.

Quick Start: Training Container
--------------------------------

.. code-block:: bash

   # Authenticate with ECR
   aws ecr get-login-password --region us-east-1 | \
     docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com

   # Pull the training container
   docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training-neuronx:2.1.2-neuronx-py310-sdk2.20.2-ubuntu20.04

   # Run with all Neuron devices
   docker run -it --device=/dev/neuron0 --device=/dev/neuron1 \
     763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training-neuronx:2.1.2-neuronx-py310-sdk2.20.2-ubuntu20.04

.. note::
   The image tags above are examples. For the latest available images, see the
   `Neuron DLC repository <https://github.com/aws-neuron/deep-learning-containers>`__.

Customizing a DLC
-----------------

You can extend a Neuron DLC with additional packages by creating a custom Dockerfile:

.. code-block:: dockerfile

   FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.2-ubuntu20.04

   # Install additional packages
   RUN pip install transformers datasets

   # Copy your application code
   COPY app/ /app/

For more details, see :doc:`/containers/dlc-then-customize-devflow`.

Deployment Platforms
--------------------

Neuron DLCs can be deployed across multiple AWS services:

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: Amazon EC2
      :link: /containers/dlc-then-ec2-devflow
      :link-type: doc
      :class-card: sd-rounded-3

      Deploy containers directly on EC2 instances with Neuron devices.

   .. grid-item-card:: Amazon EKS
      :link: /containers/dlc-then-eks-devflow
      :link-type: doc
      :class-card: sd-rounded-3

      Run containers on managed Kubernetes with the Neuron device plugin.

   .. grid-item-card:: Amazon ECS
      :link: /containers/dlc-then-ecs-devflow
      :link-type: doc
      :class-card: sd-rounded-3

      Deploy containers using Amazon Elastic Container Service.

Next Steps
----------

- :doc:`/containers/get-started/quickstart-configure-deploy-dlc` - Full vLLM DLC deployment walkthrough
- :doc:`/containers/locate-neuron-dlc-image` - Find the right DLC image for your workload
- :doc:`/containers/index` - Full containers documentation
- :doc:`/frameworks/torch/training-torch-neuronx` - Training tutorials
- :doc:`/frameworks/torch/inference-torch-neuronx` - Inference tutorials


================================================
FILE: setup/pytorch/index.rst
================================================
.. meta::
   :description: Install PyTorch for AWS Neuron on Inf2, Trn1, Trn2, Trn3 instances
   :keywords: pytorch, neuron, installation, trn1, trn2, trn3, inf2
   :framework: pytorch
   :instance-types: inf2, trn1, trn2, trn3
   :content-type: framework-setup-hub
   :date-modified: 2026-03-03

.. _pytorch-setup:

Install PyTorch for Neuron
===========================

Install PyTorch with AWS Neuron support for training and inference on Inferentia and Trainium instances.

**Supported Instances**: Inf2, Trn1, Trn2, Trn3

**PyTorch Version**: 2.9+ with Native Neuron backend

Choose installation method
---------------------------

.. list-table::
   :header-rows: 1
   :widths: 20 40 40

   * - Method
     - Best for
     - Considerations
   * - :doc:`DLAMI <dlami>`
     - Getting started quickly, prototyping, single-user development
     - Pre-configured with tested dependency versions; launch a new AMI to update
   * - :doc:`DLC <dlc>`
     - Production deployments, CI/CD pipelines, multi-tenant environments
     - Requires Docker and Neuron driver on host; portable across EC2, ECS, EKS
   * - :doc:`Manual <manual>`
     - Custom OS images, shared clusters, integrating into existing environments
     - Full control over versions and dependencies; requires manual dependency management

.. grid:: 1 1 3 3
   :gutter: 3

   .. grid-item-card:: 🚀 AWS Deep Learning AMI
      :link: dlami
      :link-type: doc
      :class-card: sd-border-2
      
      **Recommended for most users**
      
      Pre-configured environment with all dependencies
      
      ✅ All dependencies included
      
      ✅ Tested configurations
      
      ✅ Multiple Python versions
      
      ⏱️ **Setup time**: ~5 minutes

   .. grid-item-card:: 🐳 Deep Learning Container
      :link: dlc
      :link-type: doc
      :class-card: sd-border-2
      
      **For containerized deployments**
      
      Pre-configured Docker images from AWS ECR
      
      ✅ Docker-based isolation
      
      ✅ Training and inference images
      
      ✅ vLLM-ready images available
      
      ⏱️ **Setup time**: ~10 minutes

   .. grid-item-card:: 🔧 Manual Installation
      :link: manual
      :link-type: doc
      :class-card: sd-border-2
      
      **For custom environments**
      
      Install on existing systems or custom setups
      
      ✅ Existing system integration
      
      ✅ Custom Python versions
      
      ✅ Full control over dependencies
      
      ⏱️ **Setup time**: ~15 minutes

Prerequisites
-------------

Before installing, ensure you have:

.. list-table::
   :header-rows: 1
   :widths: 30 70
   
   * - Requirement
     - Details
   * - Instance Type
     - Inf2, Trn1, Trn2, or Trn3 instance
   * - Operating System
     - Ubuntu 24.04, Ubuntu 22.04, or Amazon Linux 2023
   * - Python Version
     - Python 3.10, 3.11, or 3.12
   * - AWS Account
     - With EC2 launch permissions
   * - SSH Access
     - Key pair for instance connection

What You'll Get
---------------

After installation, you'll have:

- **PyTorch 2.9+** with Native Neuron backend
- **torch-neuronx** package for Neuron-specific operations
- **neuronx-cc** compiler for model optimization
- **Neuron Runtime** for model execution
- **Neuron Tools** for profiling and debugging

Version Information
-------------------

.. list-table::
   :header-rows: 1
   :widths: 30 70
   
   * - Component
     - Version
   * - PyTorch
     - 2.9.0+
   * - torch-neuronx
     - 2.9.0+
   * - neuronx-cc
     - 2.15.0+
   * - Python
     - 3.10, 3.11, 3.12

Next Steps
----------

After installation:

1. **Verify Installation**: Run verification commands in the installation guide
2. **Try a Tutorial**: 
   
   * **Inference**: :doc:`/libraries/nxd-inference/vllm/quickstart-vllm-online-serving`
   * **Training**: :doc:`/frameworks/torch/torch-neuronx/tutorials/training/mlp`
  
3. **Read the torch-neuronx Programming Guide**: :doc:`/frameworks/torch/torch-neuronx/programming-guide/training/index`
4. **Explore Examples**: :doc:`/frameworks/torch/index`

Update an existing installation
--------------------------------

Already have PyTorch Neuron installed and need to update to a newer PyTorch version or Neuron SDK release? Select the guide that matches your installation method.

.. grid:: 1 1 3 3
   :gutter: 3

   .. grid-item-card:: 🔄 Update DLAMI
      :link: update-dlami
      :link-type: doc
      :class-card: sd-border-2
      
      Update PyTorch and drivers on an existing Deep Learning AMI

   .. grid-item-card:: 🔄 Update DLC
      :link: update-dlc
      :link-type: doc
      :class-card: sd-border-2
      
      Pull the latest container image and update the host driver

   .. grid-item-card:: 🔄 Update manual install
      :link: update-manual
      :link-type: doc
      :class-card: sd-border-2
      
      Update PyTorch packages and drivers on a manual installation

.. toctree::
   :hidden:
   :maxdepth: 1
   
   New DLAMI <dlami>
   Update Existing DLAMI <update-dlami>
   New DLC <dlc>
   Update Existing DLC <update-dlc>
   New Manual Configuration <manual>
   Update Manual Configuration <update-manual>


================================================
FILE: setup/pytorch/manual.rst
================================================
.. meta::
   :description: Manual installation of PyTorch Neuron on Inf2, Trn1, Trn2, Trn3 instances
   :keywords: pytorch, neuron, manual, installation, pip
   :framework: pytorch
   :installation-method: manual
   :instance-types: inf2, trn1, trn2, trn3
   :os: ubuntu-24.04, ubuntu-22.04, al2023
   :python-versions: 3.10, 3.11, 3.12
   :content-type: installation-guide
   :estimated-time: 15 minutes
   :date-modified: 2026-03-30

Install PyTorch via manual installation
========================================

Install PyTorch with Neuron support on a bare OS AMI or existing system.

⏱️ **Estimated time**: 15 minutes

.. note::
   For a faster setup, consider using the :doc:`DLAMI-based installation <dlami>` instead.

.. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

----

Prerequisites
-------------

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Requirement
     - Details
   * - Instance Type
     - Inf2, Trn1, Trn2, or Trn3
   * - Operating System
     - Ubuntu 24.04, Ubuntu 22.04, or Amazon Linux 2023
   * - Python Version
     - Python 3.10, 3.11, or 3.12
   * - AWS Account
     - With EC2 permissions
   * - SSH Key Pair
     - For instance access

Installation steps
------------------

.. tab-set::

   .. tab-item:: Ubuntu 24.04
      :sync: ubuntu-24-04

      **Step 1: Launch instance**

      * Follow the instructions to `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_.
      * Select Ubuntu Server 24 AMI.
      * For Trn1, adjust your primary EBS volume size to a minimum of 512GB.
      * `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_.

      **Step 2: Install drivers and tools**

      .. include:: /src/helperscripts/installationScripts/python_instructions.txt
          :start-line: 299
          :end-line: 300

      **Step 3: Install EFA** (Trn1/Trn1n/Trn2/Trn3 only)

      .. include:: /src/helperscripts/installationScripts/python_instructions.txt
          :start-line: 290
          :end-line: 293

      **Step 4: Install PyTorch and Neuron packages**

      .. tab-set::

          .. tab-item:: PyTorch 2.9.0

              .. include:: /src/helperscripts/installationScripts/python_instructions.txt
                  :start-line: 296
                  :end-line: 297

          .. tab-item:: PyTorch 2.8.0

              .. note::
                PyTorch versions 2.7 and 2.8 are no longer supported on Neuron. If you are looking for setup instructions specific to PyTorch 2.7 and 2.8 on Amazon Linux 2023, Ubuntu 24.04, or Ubuntu 22.04, see `the Neuron release 2.28.0 version of the setup docs <https://awsdocs-neuron.readthedocs-hosted.com/en/v2.28.0/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu24.html#setup-torch-neuronx-ubuntu24>`__.

      **Step 5: Verify installation**

      .. code-block:: bash

         python3 -c "import torch; import torch_neuronx; print(f'PyTorch {torch.__version__}, torch-neuronx {torch_neuronx.__version__}')"
         neuron-ls

      You should see output similar to this (the versions, instance IDs, and details should match your expected ones, not the ones in this example):
      
      **Expected output**:
      
      .. code-block:: text
         
         PyTorch 2.9.0+cpu, torch-neuronx 2.9.0.1.0
         
         +--------+--------+--------+-----------+
         | DEVICE | CORES  | MEMORY | CONNECTED |
         +--------+--------+--------+-----------+
         | 0      | 2      | 32 GB  | Yes       |
         | 1      | 2      | 32 GB  | Yes       |
         +--------+--------+--------+-----------+

   .. tab-item:: Ubuntu 22.04
      :sync: ubuntu-22-04

      **Step 1: Launch instance**

      * Follow the instructions to `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_.
      * Select Ubuntu Server 22 AMI.
      * For Trn1, adjust your primary EBS volume size to a minimum of 512GB.
      * `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_.

      **Step 2: Install drivers and tools**

      .. include:: /src/helperscripts/installationScripts/python_instructions.txt
          :start-line: 242
          :end-line: 243

      **Step 3: Install EFA** (Trn1/Trn1n/Trn2/Trn3 only)

      .. include:: /src/helperscripts/installationScripts/python_instructions.txt
          :start-line: 248
          :end-line: 249

      **Step 4: Install PyTorch and Neuron packages**

      .. tab-set::

          .. tab-item:: PyTorch 2.9.0

              .. include:: /src/helperscripts/installationScripts/python_instructions.txt
                  :start-line: 286
                  :end-line: 287

          .. tab-item:: PyTorch 2.8.0

              .. note::
                  PyTorch versions 2.7 and 2.8 are no longer supported on Neuron. If you are looking for setup instructions specific to PyTorch 2.7 and 2.8 on Amazon Linux 2023, Ubuntu 24.04, or Ubuntu 22.04, see `the Neuron release 2.28.0 version of the setup docs <https://awsdocs-neuron.readthedocs-hosted.com/en/v2.28.0/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html#setup-torch-neuronx-ubuntu22>`__.

      **Step 5: Verify installation**

      .. code-block:: bash

         python3 -c "import torch; import torch_neuronx; print(f'PyTorch {torch.__version__}, torch-neuronx {torch_neuronx.__version__}')"
         neuron-ls

      You should see output similar to this (the versions, instance IDs, and details should match your expected ones, not the ones in this example):
      
      **Expected output**:
      
      .. code-block:: text
         
         PyTorch 2.9.0+cpu, torch-neuronx 2.9.0.1.0
         
         +--------+--------+--------+-----------+
         | DEVICE | CORES  | MEMORY | CONNECTED |
         +--------+--------+--------+-----------+
         | 0      | 2      | 32 GB  | Yes       |
         | 1      | 2      | 32 GB  | Yes       |
         +--------+--------+--------+-----------+

   .. tab-item:: Amazon Linux 2023
      :sync: al2023

      .. note::
         Currently, PyTorch 2.9 is not available on Amazon Linux 2023 and PyTorch 2.7 and 2.8 are no longer supported for Neuron. Use Ubuntu 24.04 for PyTorch 2.9 support. If you are using Neuron 2.28.0, `see the Amazon Linux 2023 setup documentation in the 2.28.0 version of the Neuron docs <https://awsdocs-neuron.readthedocs-hosted.com/en/v2.28.0/setup/neuron-setup/pytorch/neuronx/amazon-linux/torch-neuronx-al2023.html>`__.


.. tip:: **vLLM for LLM inference**

   After completing the manual installation, you can add vLLM for inference serving
   using the ``vllm-neuron`` plugin:

   .. code-block:: bash

      git clone https://github.com/vllm-project/vllm-neuron.git
      cd vllm-neuron
      pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com -e .

   Or use the pre-configured vLLM DLC image for a containerized deployment.
   See :doc:`/libraries/nxd-inference/vllm/index` for all deployment options.

Update an existing installation
--------------------------------

To update PyTorch versions or Neuron drivers on an existing manual installation, see
:doc:`update-manual`.

Next steps
----------

- :doc:`/frameworks/torch/training-torch-neuronx` - Training on Trn1/Trn2
- :doc:`/frameworks/torch/inference-torch-neuronx` - Inference on Inf2/Trn1/Trn2
- :doc:`/tools/profiler/neuron-profile-user-guide` - Profile your workloads
- :doc:`/tools/neuron-sys-tools/neuron-top-user-guide` - Monitor system resources

Advanced
--------

- :doc:`/frameworks/torch/torch-neuronx/setup/pytorch-neuronx-install-cxx11` - Build torch-xla from source with CXX11 ABI

Additional resources
--------------------

- :doc:`dlami` - Use pre-configured DLAMI instead
- :doc:`dlc` - Use pre-configured Docker containers
- :doc:`/containers/index` - Container-based deployment
- :doc:`../troubleshooting` - Common issues and solutions
- :doc:`/release-notes/index` - Version compatibility information


================================================
FILE: setup/pytorch/update-dlami.rst
================================================
.. meta::
   :description: Update PyTorch Neuron on an existing AWS Deep Learning AMI
   :keywords: pytorch, neuron, dlami, update, upgrade, driver
   :framework: pytorch
   :installation-method: dlami
   :instance-types: inf2, trn1, trn2, trn3
   :os: ubuntu-24.04, ubuntu-22.04, al2023
   :content-type: installation-guide
   :date-modified: 2026-03-30

Update PyTorch on a Deep Learning AMI
=======================================

Update PyTorch and Neuron components on an existing DLAMI to the latest release.

.. contents:: On this page
   :local:
   :depth: 2


Update PyTorch on Ubuntu 24.04
-------------------------------

If you already have a previous Neuron release installed, select the PyTorch version tab below to get the update commands for your environment.

.. tab-set::

    .. tab-item:: PyTorch 2.9.0

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 293
            :end-line: 294

    .. tab-item:: PyTorch 2.8.0

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst


Update PyTorch on Ubuntu 22.04
-------------------------------

If you already have a previous Neuron release installed, select the PyTorch version tab below to get the update commands for your environment.

.. tab-set::

    .. tab-item:: PyTorch 2.9.0

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. include:: /src/helperscripts/installationScripts/python_instructions.txt
            :start-line: 284
            :end-line: 285

    .. tab-item:: PyTorch 2.8.0

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. note::
            PyTorch versions 2.7 and 2.8 are no longer supported on Neuron. If you are looking for setup instructions specific to PyTorch 2.7 and 2.8 on Amazon Linux 2023, Ubuntu 24.04, or Ubuntu 22.04, see `the Neuron release 2.28.0 version of the setup docs <https://awsdocs-neuron.readthedocs-hosted.com/en/v2.28.0/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html#setup-torch-neuronx-ubuntu22>`__.


Update PyTorch on Amazon Linux 2023
-------------------------------------

If you already have a previous Neuron release installed, select the PyTorch version tab below to get the update commands for your environment.

.. tab-set::

    .. tab-item:: PyTorch 2.8.0

        .. include:: /frameworks/torch/torch-neuronx/setup/note-setup-general.rst

        .. note::
            PyTorch versions 2.7 and 2.8 are no longer supported on Neuron. If you are looking for setup instructions specific to PyTorch 2.7 and 2.8 on Amazon Linux 2023, Ubuntu 24.04, or Ubuntu 22.04, see `the Neuron release 2.28.0 version of the setup docs <https://awsdocs-neuron.readthedocs-hosted.com/en/v2.28.0/setup/neuron-setup/pytorch/neuronx/amazon-linux/torch-neuronx-al2023.html#id2>`__.

    .. tab-item:: PyTorch 2.7.0

        .. note::
            PyTorch versions 2.7 and 2.8 are no longer supported on Neuron. If you are looking for setup instructions specific to PyTorch 2.7 and 2.8 on Amazon Linux 2023, Ubuntu 24.04, or Ubuntu 22.04, see `the Neuron release 2.28.0 version of the setup docs <https://awsdocs-neuron.readthedocs-hosted.com/en/v2.28.0/setup/neuron-setup/pytorch/neuronx/amazon-linux/torch-neuronx-al2023.html#id2>`__.


Update Neuron driver and runtime
---------------------------------

Update the Neuron driver, runtime, and tools on your DLAMI host. This is recommended
when updating to a new Neuron SDK release.

.. tab-set::

   .. tab-item:: Ubuntu 24.04

      .. code-block:: bash

         sudo apt-get update
         sudo apt-get install -y aws-neuronx-dkms
         sudo apt-get install -y aws-neuronx-runtime-lib
         sudo apt-get install -y aws-neuronx-collectives
         sudo apt-get install -y aws-neuronx-tools

   .. tab-item:: Ubuntu 22.04

      .. code-block:: bash

         sudo apt-get update
         sudo apt-get install -y aws-neuronx-dkms
         sudo apt-get install -y aws-neuronx-runtime-lib
         sudo apt-get install -y aws-neuronx-collectives
         sudo apt-get install -y aws-neuronx-tools

   .. tab-item:: Amazon Linux 2023

      .. code-block:: bash

         sudo dnf install -y aws-neuronx-dkms
         sudo dnf install -y aws-neuronx-runtime-lib
         sudo dnf install -y aws-neuronx-collectives
         sudo dnf install -y aws-neuronx-tools


Verify the update
------------------

After updating, activate your virtual environment:

.. code-block:: bash

   source /opt/aws_neuronx_venv_pytorch/bin/activate


And verify the update: 

.. code-block:: bash

   python3 -c "import torch; import torch_neuronx; print(f'PyTorch {torch.__version__}, torch-neuronx {torch_neuronx.__version__}')"
   neuron-ls

You should see output similar to this (the versions, instance IDs, and details should match your expected ones, not the ones in this example):
      
**Expected output**:

.. code-block:: text
   
   PyTorch 2.9.0+cpu, torch-neuronx 2.9.0.1.0
   
   +--------+--------+--------+-----------+
   | DEVICE | CORES  | MEMORY | CONNECTED |
   +--------+--------+--------+-----------+
   | 0      | 2      | 32 GB  | Yes       |
   | 1      | 2      | 32 GB  | Yes       |
   +--------+--------+--------+-----------+


Previous releases
------------------

To install a specific previous Neuron SDK release:

- :doc:`Previous releases for Ubuntu 24.04 </frameworks/torch/torch-neuronx/setup/pytorch-install-prev-u24>`
- :doc:`Previous releases for Ubuntu 22.04 </frameworks/torch/torch-neuronx/setup/pytorch-install-prev-u22>`
- :doc:`Previous releases for Amazon Linux 2023 </frameworks/torch/torch-neuronx/setup/pytorch-install-prev-al2023>`


================================================
FILE: setup/pytorch/update-dlc.rst
================================================
.. meta::
   :description: Update PyTorch Neuron in a Deep Learning Container deployment
   :keywords: pytorch, neuron, dlc, container, docker, update, upgrade, driver
   :framework: pytorch
   :installation-method: container
   :instance-types: inf2, trn1, trn2, trn3
   :os: ubuntu-24.04, ubuntu-22.04, al2023
   :content-type: installation-guide
   :date-modified: 2026-03-30

Update PyTorch in a Deep Learning Container
=============================================

Update your DLC-based PyTorch Neuron deployment to the latest release.

.. contents:: On this page
   :local:
   :depth: 2


Update the container image
---------------------------

DLC images are versioned and tagged with the Neuron SDK version. To update, pull the
latest image tag from ECR:

.. code-block:: bash

   # Training
   docker pull public.ecr.aws/neuron/pytorch-training-neuronx:<new_image_tag>

   # Inference
   docker pull public.ecr.aws/neuron/pytorch-inference-neuronx:<new_image_tag>

   # vLLM Inference
   docker pull public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:<new_image_tag>

Replace ``<new_image_tag>`` with the tag for the desired SDK version (e.g.,
``2.9.0-neuronx-py312-sdk2.29.0-ubuntu24.04``).

Check available tags at the ECR Public Gallery:

- `PyTorch Training <https://gallery.ecr.aws/neuron/pytorch-training-neuronx>`_
- `PyTorch Inference <https://gallery.ecr.aws/neuron/pytorch-inference-neuronx>`_
- `PyTorch vLLM Inference <https://gallery.ecr.aws/neuron/pytorch-inference-vllm-neuronx>`_

For the full list of available images and tags, see :doc:`/containers/locate-neuron-dlc-image`.


Update Neuron driver on the host
---------------------------------

The Neuron driver runs on the host, not inside the container. Update it separately
when moving to a new Neuron SDK release.

.. tab-set::

   .. tab-item:: Ubuntu 24.04

      .. code-block:: bash

         sudo apt-get update
         sudo apt-get install -y aws-neuronx-dkms

   .. tab-item:: Ubuntu 22.04

      .. code-block:: bash

         sudo apt-get update
         sudo apt-get install -y aws-neuronx-dkms

   .. tab-item:: Amazon Linux 2023

      .. code-block:: bash

         sudo dnf install -y aws-neuronx-dkms


Verify the update
------------------

Launch the new container and verify:

.. code-block:: bash

   docker run -it \
     --device=/dev/neuron0 \
     --cap-add SYS_ADMIN \
     --cap-add IPC_LOCK \
     public.ecr.aws/neuron/pytorch-training-neuronx:<new_image_tag> \
     bash

Inside the container:

.. code-block:: bash

   python3 -c "import torch; import torch_neuronx; print(f'PyTorch {torch.__version__}, torch-neuronx {torch_neuronx.__version__}')"
   neuron-ls


.. dropdown:: ⚠️ Troubleshooting: Version mismatch between host driver and container
   :color: warning
   :animate: fade-in

   If you see runtime errors after updating the container image but not the host driver:

   1. Check the host driver version: ``modinfo neuron`` on the host
   2. Update the host driver to match the SDK version in the container
   3. Reboot if the driver update requires it: ``sudo reboot``


================================================
FILE: setup/pytorch/update-manual.rst
================================================
.. meta::
   :description: Update a manual PyTorch Neuron installation to the latest release
   :keywords: pytorch, neuron, manual, update, upgrade, driver, pip
   :framework: pytorch
   :installation-method: manual
   :instance-types: inf2, trn1, trn2, trn3
   :os: ubuntu-24.04, ubuntu-22.04, al2023
   :content-type: installation-guide
   :date-modified: 2026-03-30

Update a manual PyTorch installation
======================================

Update PyTorch and Neuron components on an existing manual installation to the latest release.

.. contents:: On this page
   :local:
   :depth: 2


Update PyTorch on Ubuntu 24.04
-------------------------------

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-update-u24.rst
   :start-after: _pytorch-neuronx-ubuntu24-update:


Update PyTorch on Ubuntu 22.04
-------------------------------

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-update-u22.rst
   :start-after: _pytorch-neuronx-ubuntu22-update:


Update PyTorch on Amazon Linux 2023
-------------------------------------

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-update-al2023.rst
   :start-after: _pytorch-neuronx-al2023-update:


Update Neuron driver and runtime
---------------------------------

Update the Neuron driver, runtime, and tools on your host. This is recommended
when updating to a new Neuron SDK release.

.. tab-set::

   .. tab-item:: Ubuntu 24.04

      .. code-block:: bash

         sudo apt-get update
         sudo apt-get install -y aws-neuronx-dkms
         sudo apt-get install -y aws-neuronx-runtime-lib
         sudo apt-get install -y aws-neuronx-collectives
         sudo apt-get install -y aws-neuronx-tools

   .. tab-item:: Ubuntu 22.04

      .. code-block:: bash

         sudo apt-get update
         sudo apt-get install -y aws-neuronx-dkms
         sudo apt-get install -y aws-neuronx-runtime-lib
         sudo apt-get install -y aws-neuronx-collectives
         sudo apt-get install -y aws-neuronx-tools

   .. tab-item:: Amazon Linux 2023

      .. code-block:: bash

         sudo dnf install -y aws-neuronx-dkms
         sudo dnf install -y aws-neuronx-runtime-lib
         sudo dnf install -y aws-neuronx-collectives
         sudo dnf install -y aws-neuronx-tools


Verify the update
------------------

After updating, activate your virtual environment and verify:

.. code-block:: bash

   source ~/neuron_venv/bin/activate

.. code-block:: python

   python3 << EOF
   import torch
   import torch_neuronx

   print(f"PyTorch version: {torch.__version__}")
   print(f"torch-neuronx version: {torch_neuronx.__version__}")

   import subprocess
   result = subprocess.run(['neuron-ls'], capture_output=True, text=True)
   print(result.stdout)
   EOF

You should see output similar to this (the instance IDS and details should match your expected ones, not the ones in this example):

.. code-block::

   PyTorch version: 2.9.1+cu128, torch-neuronx version: 2.9.0.2.13.23887+8e870898
   $ neuron-ls
   instance-type: trn1.2xlarge
   instance-id: i-0bea223b1afb7e159
   +--------+--------+----------+--------+--------------+----------+------+
   | NEURON | NEURON |  NEURON  | NEURON |     PCI      |   CPU    | NUMA |
   | DEVICE | CORES  | CORE IDS | MEMORY |     BDF      | AFFINITY | NODE |
   +--------+--------+----------+--------+--------------+----------+------+
   | 0      | 2      | 0-1      | 32 GB  | 0000:00:1e.0 | 0-7      | -1   |
   +--------+--------+----------+--------+--------------+----------+------+

----

.. _install-prev-releases:

Install previous releases
-------------------------

If you need to install older Neuron releases on your instance, follow the instructions below.

Install previous releases on Ubuntu 24.04
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-install-prev-u24.rst
   :start-after: _pytorch-neuronx-install-prev-u24:


Install previous releases on Ubuntu 22.04
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-install-prev-u22.rst
   :start-after: _pytorch-neuronx-install-prev-u22:


Install previous releases on Amazon Linux 2023
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: /frameworks/torch/torch-neuronx/setup/pytorch-install-prev-al2023.rst
   :start-after: _pytorch-neuronx-install-prev-al2023:


================================================
FILE: setup/setup-rocky-linux-9.rst
================================================
.. _setup-rocky-linux-9:

.. card:: Select a Different Platform for Setup
    :link: setup-guide-index
    :link-type: ref
    :class-body: sphinx-design-class-title-small

PyTorch Neuron Setup Guide for Rocky Linux 9
============================================


.. contents:: Table of contents
    :local:
    :depth: 1

Get Started with Latest Release of PyTorch Neuron 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provides links that will assist you to quickly start with a fresh installation of PyTorch Neuron (``torch-neuronx`` , ``torch-neuron``).


.. dropdown:: Launch the Instance
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    * Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type.
    * To get more information about instances sizes and pricing see: `Trn1 web page <https://aws.amazon.com/ec2/instance-types/trn1/>`_, `Inf2 web page <https://aws.amazon.com/ec2/instance-types/inf2/>`_
    * Select Rocky-9-EC2-Base AMI
    * When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
    * After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance

.. dropdown:: Install Drivers and Tools
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in

    .. include:: /src/helperscripts/installationScripts/python_instructions.txt
        :start-line: 218
        :end-line: 219

Please continue with the installation instructions for EFA and PyTorch Neuron by following the corresponding AL2023 setup guide below (please skip the "Launch the Instance" and "Install Drivers and Tools" sections). 


.. card:: Pytorch Neuron (``torch-neuronx``) Setup
            :link: /setup/pytorch/manual
            :link-type: doc
            :class-body: sphinx-design-class-title-small

.. card:: Pytorch Neuron (``torch-neuron``) Setup for Inf1
            :link: /setup/legacy-inf1/pytorch
            :link-type: doc
            :class-body: sphinx-design-class-title-small

================================================
FILE: setup/setup-troubleshooting.rst
================================================
.. _neuron-setup-troubleshooting:

Neuron Setup Troubleshooting
============================

.. contents:: Table of contents
   :local:
   :depth: 2

.. _gpg_key_update:

How to update Neuron repository GNU Privacy Guard (GPG) key for Ubuntu installation
-----------------------------------------------------------------------------------

Description
^^^^^^^^^^^

The GPG key for the Neuron repository (https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB) is installed on the Ubuntu (Canonical) server, the key was uploaded originally with an expiry date of three (3) years, which has expired on 11/10/22.

Any customer of Ubuntu or Debian using Neuron ``apt`` repository will get the following error:

.. code::

   While running an apt-get update command on an AWS deep learning image (us-east-1/ami-01fce297f68912e45) I get this output:

   Err:6 https://apt.repos.neuron.amazonaws.com (https://apt.repos.neuron.amazonaws.com/) bionic InRelease
   The following signatures were invalid: EXPKEYSIG 5749CAD8646D9185 Amazon AWS Neuron <neuron-maintainers@amazon.com>
   Fetched 172 kB in 1s (161 kB/s)
   Reading package lists... Done
   W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error:https://apt.repos.neuron.amazonaws.com (https://apt.repos.neuron.amazonaws.com/) bionic InRelease: The following signatures were invalid: EXPKEYSIG 5749CAD8646D9185 Amazon AWS Neuron <neuron-maintainers@amazon.com>

Solution
^^^^^^^^

To solve this issue, you need to run the following commands to fetch the new key before running ``apt-get update``


.. code::

   wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -

   # Update OS packages
   sudo apt-get update -y


``pip install --upgrade`` wouldn't upgrade ``neuron-cc``
--------------------------------------------------------

Description
^^^^^^^^^^^

When trying to upgrade to a newer Neuron release, for example by calling: 

``pip install --upgrade torch-neuron neuron-cc[tensorflow] torchvision``

``neuron-cc`` is not upgraded.

This can be a result of a bug in certain ``pip`` versions, for example `pip install upgrade will not upgrade package if extras_require specified <https://github.com/pypa/pip/issues/10173>`_

Solution
^^^^^^^^

To solve this issue you can either upgrade to a newer ``pip`` version or use ``--force`` when trying to upgrade, for example:

``pip install --force torch-neuron neuron-cc[tensorflow] torchvision``


================================================
FILE: setup/torch-neuron-ubuntu20.rst
================================================

.. dropdown::  Select a Different Framework or Platform
    :class-title: sphinx-design-class-title-small
    :class-body: sphinx-design-class-body-small
    :animate: fade-in
     

.. raw:: html

        <script>
            var framework = document.getElementById("framework-select");
            framework.value = "torch-neuron";

            var platform = document.getElementById("platform-select");
            platform.value = "ubuntu-20";

        </script>


Get Started with PyTorch Neuron ("torch-neuron") on Ubuntu 20
==============================================================

================================================
FILE: setup/torch-neuron.rst
================================================
.. _setup-torch-neuron:

PyTorch Neuron (``torch-neuron``) Setup
=======================================

.. warning::

   ``torch-neuron`` is for Inf1 instances only (legacy NeuronCore v1).
   For new projects, use Inf2, Trn1, Trn2, or Trn3 with ``torch-neuronx``.
   See :doc:`/setup/pytorch/index` for current setup.

For Inf1 setup instructions, see :doc:`/setup/legacy-inf1/pytorch`.


================================================
FILE: setup/torch-neuronx.rst
================================================
.. _setup-torch-neuronx:

.. meta::
   :description: Install PyTorch NeuronX (torch-neuronx) on AWS Trainium and Inferentia instances using DLAMI, DLC, or manual pip installation
   :keywords: pytorch, neuron, torch-neuronx, installation, setup, trainium, inferentia, trn1, trn2, trn3, inf2, DLAMI, pip
   :date-modified: 2026-03-30

PyTorch Neuron (``torch-neuronx``) Setup 
========================================

Install PyTorch with Neuron support for training and inference on Inf2, Trn1, Trn2, and Trn3 instances. Choose from a pre-configured DLAMI, a Docker container, or a manual pip installation.

For the full setup guide with all options, see :doc:`Install PyTorch for Neuron </setup/pytorch/index>`.

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: 🚀 DLAMI Installation
      :link: /setup/pytorch/dlami
      :link-type: doc
      :class-card: sd-border-2

      Pre-configured environment with all dependencies. Recommended for most users.

   .. grid-item-card:: 🚀 Multi-Framework DLAMI
      :link: /setup/multiframework-dlami
      :link-type: doc
      :class-card: sd-border-2

      Pre-configured AMI with PyTorch, JAX, and vLLM virtual environments ready to use.

   .. grid-item-card:: � Deep Learning Container
      :link: /setup/pytorch/dlc
      :link-type: doc
      :class-card: sd-border-2

      Pre-configured Docker images from AWS ECR for containerized deployments.

   .. grid-item-card:: 🔧 Manual Installation
      :link: /setup/pytorch/manual
      :link-type: doc
      :class-card: sd-border-2

      Install on bare OS AMIs or existing systems with full control over dependencies.

   .. grid-item-card:: Rocky Linux 9
      :link: setup-rocky-linux-9
      :link-type: ref
      :class-card: sd-border-2

      Install on Rocky Linux 9 using the Rocky-9-EC2-Base AMI.


================================================
FILE: setup/troubleshooting.rst
================================================
.. meta::
   :description: Troubleshooting guide for AWS Neuron SDK installation issues
   :keywords: neuron, troubleshooting, installation, errors, debugging
   :content-type: troubleshooting
   :date-modified: 2026-03-03

Installation Troubleshooting
=============================

Common issues and solutions for Neuron SDK installation.

Module Import Errors
--------------------

ModuleNotFoundError: No module named 'torch_neuronx'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Symptoms**: Python cannot find torch_neuronx module after installation.

**Causes**:

- Virtual environment not activated
- Wrong Python version
- Installation failed silently
- Multiple Python installations

**Solutions**:

1. **Verify virtual environment**:
   
   .. code-block:: bash
      
      which python
      # Should show virtual environment path, not system Python

2. **Check Python version**:
   
   .. code-block:: bash
      
      python --version
      # Should be 3.10, 3.11, or 3.12

3. **Reinstall torch-neuronx**:
   
   .. code-block:: bash
      
      pip install --force-reinstall torch-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com

4. **Verify installation**:
   
   .. code-block:: bash
      
      pip list | grep neuron

ImportError: cannot import name 'neuron' from 'torch'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Symptoms**: Import error when trying to use Neuron features.

**Cause**: Using PyTorch/XLA syntax with Native PyTorch backend.

**Solution**: Update code to use Native PyTorch syntax:

.. code-block:: python
   
   # Old (PyTorch/XLA)
   import torch_xla.core.xla_model as xm
   device = xm.xla_device()
   
   # New (Native PyTorch)
   import torch
   device = torch.device('neuron')

See :doc:`/frameworks/torch/index` for complete migration guide.

Device and Runtime Errors
--------------------------

No Neuron devices found
~~~~~~~~~~~~~~~~~~~~~~~

**Symptoms**: ``neuron-ls`` shows no devices or returns error.

**Causes**:

- Wrong instance type
- Neuron driver not loaded
- Runtime not started

**Solutions**:

1. **Verify instance type**:
   
   .. code-block:: bash
      
      curl http://169.254.169.254/latest/meta-data/instance-type
      # Should show inf2.*, trn1.*, trn2.*, trn3.*, or inf1.*

2. **Check Neuron driver**:
   
   .. code-block:: bash
      
      lsmod | grep neuron
      # Should show neuron driver loaded

3. **Install/reload driver**:
   
   .. code-block:: bash
      
      # Ubuntu/Debian
      sudo apt-get install -y aws-neuronx-dkms
      
      # Amazon Linux
      sudo yum install -y aws-neuronx-dkms

4. **Restart runtime**:
   
   .. code-block:: bash
      
      sudo systemctl restart neuron-monitor
      neuron-ls

RuntimeError: Neuron runtime initialization failed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Symptoms**: Runtime fails to initialize when running models.

**Causes**:

- Insufficient permissions
- Runtime version mismatch
- Corrupted runtime state

**Solutions**:

1. **Check runtime status**:
   
   .. code-block:: bash
      
      sudo systemctl status neuron-monitor

2. **Verify permissions**:
   
   .. code-block:: bash
      
      ls -l /dev/neuron*
      # Should be accessible by current user

3. **Reinstall runtime**:
   
   .. code-block:: bash
      
      sudo apt-get install --reinstall aws-neuronx-runtime-lib

Version Compatibility Issues
-----------------------------

Compiler version mismatch
~~~~~~~~~~~~~~~~~~~~~~~~~

**Symptoms**: Error about incompatible compiler version.

**Cause**: neuronx-cc version incompatible with framework version.

**Solution**: Install compatible versions:

.. code-block:: bash
   
   # For PyTorch 2.9
   pip install neuronx-cc==2.15.* --extra-index-url=https://pip.repos.neuron.amazonaws.com

See :doc:`/release-notes/index` for version compatibility matrix.

Package dependency conflicts
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Symptoms**: pip reports conflicting dependencies.

**Solution**: Use fresh virtual environment:

.. code-block:: bash
   
   python3 -m venv ~/fresh_neuron_venv
   source ~/fresh_neuron_venv/bin/activate
   pip install -U pip
   # Install packages in correct order
   pip install torch==2.9.0
   pip install torch-neuronx neuronx-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com

Network and Repository Issues
------------------------------

Cannot connect to Neuron repository
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Symptoms**: apt-get or pip cannot reach Neuron repositories.

**Solutions**:

1. **Verify network connectivity**:
   
   .. code-block:: bash
      
      curl -I https://apt.repos.neuron.amazonaws.com
      curl -I https://pip.repos.neuron.amazonaws.com

2. **Check proxy settings** (if behind corporate proxy):
   
   .. code-block:: bash
      
      export https_proxy=http://proxy.example.com:8080
      export http_proxy=http://proxy.example.com:8080

3. **Use alternative index URL**:
   
   .. code-block:: bash
      
      pip install torch-neuronx --index-url=https://pip.repos.neuron.amazonaws.com

GPG key expired
~~~~~~~~~~~~~~~

**Symptoms**: "EXPKEYSIG" error during apt-get update.

**Solution**:

.. code-block:: bash
   
   wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
   sudo apt-get update -y

Getting Help
------------

If issues persist:

1. **Check release notes**: :doc:`/release-notes/index`
2. **Review documentation**: :doc:`/frameworks/torch/index`
3. **GitHub Issues**: `aws-neuron-sdk/aws-neuron-sdk <https://github.com/aws-neuron/aws-neuron-sdk/issues>`_
4. **AWS Support**: Open support case if you have AWS Support plan

Diagnostic Information
----------------------

When reporting issues, include:

.. code-block:: bash
   
   # System information
   uname -a
   cat /etc/os-release
   
   # Instance type
   curl http://169.254.169.254/latest/meta-data/instance-type
   
   # Neuron devices
   neuron-ls
   
   # Package versions
   pip list | grep -E "(torch|neuron)"
   
   # Driver status
   lsmod | grep neuron
   sudo systemctl status neuron-monitor


================================================
FILE: src/benchmark/helper_scripts/llmperf_dp.patch
================================================
diff --git a/src/llmperf/ray_clients/openai_chat_completions_client.py b/src/llmperf/ray_clients/openai_chat_completions_client.py
index f2e0a91..74c4027 100644
--- a/src/llmperf/ray_clients/openai_chat_completions_client.py
+++ b/src/llmperf/ray_clients/openai_chat_completions_client.py
@@ -1,5 +1,6 @@
 import json
 import os
+import random
 import time
 from typing import Any, Dict
 
@@ -14,6 +15,9 @@ from llmperf import common_metrics
 @ray.remote
 class OpenAIChatCompletionsClient(LLMClient):
     """Client for OpenAI Chat Completions API."""
+    def __init__(self):
+        self.addr_id = 0
+        self.addr_select_strategy = 'round-robin'
 
     def llm_request(self, request_config: RequestConfig) -> Dict[str, Any]:
         prompt = request_config.prompt
@@ -50,6 +54,13 @@ class OpenAIChatCompletionsClient(LLMClient):
         address = os.environ.get("OPENAI_API_BASE")
         if not address:
             raise ValueError("the environment variable OPENAI_API_BASE must be set.")
+        # if several addresses of model server exist, select one for each request (1) randomly or (2) round-robin
+        address_list = address.split(";")
+        if self.addr_select_strategy == 'round-robin':
+            address = address_list[self.addr_id]
+            self.addr_id = (self.addr_id + 1) % len(address_list)
+        else:
+            address = random.choice(address_list)
         key = os.environ.get("OPENAI_API_KEY")
         if not key:
             raise ValueError("the environment variable OPENAI_API_KEY must be set.")


================================================
FILE: src/benchmark/helper_scripts/llmperf_reasoning.patch
================================================
diff --git a/src/llmperf/ray_clients/openai_chat_completions_client.py b/src/llmperf/ray_clients/openai_chat_completions_client.py
index aeb5fbf..f1b4473 100644
--- a/src/llmperf/ray_clients/openai_chat_completions_client.py
+++ b/src/llmperf/ray_clients/openai_chat_completions_client.py
@@ -100,7 +100,7 @@ class OpenAIChatCompletionsClient(LLMClient):
                         raise RuntimeError(data["error"]["message"])

                     delta = data["choices"][0]["delta"]
-                    if delta.get("content", None):
+                    if delta.get("content", None) or delta.get("reasoning_content", None):
                         if not ttft:
                             ttft = time.monotonic() - start_time
                             # time_to_next_token.append(ttft)
@@ -109,7 +109,11 @@ class OpenAIChatCompletionsClient(LLMClient):
                                 time.monotonic() - most_recent_received_token_time
                             )
                         most_recent_received_token_time = time.monotonic()
-                        generated_text += delta["content"]
+                        if "reasoning_content" in delta and delta["reasoning_content"]:
+                            chunk_content = delta["reasoning_content"]
+                        else:
+                            chunk_content = delta["content"]
+                        generated_text += chunk_content

             total_request_time = time.monotonic() - start_time
             output_throughput = tokens_received / total_request_time

================================================
FILE: src/benchmark/helper_scripts/neuron_perf.patch
================================================
diff --git a/src/llmperf/ray_clients/openai_chat_completions_client.py b/src/llmperf/ray_clients/openai_chat_completions_client.py
index f2e0a91..644d5a6 100644
--- a/src/llmperf/ray_clients/openai_chat_completions_client.py
+++ b/src/llmperf/ray_clients/openai_chat_completions_client.py
@@ -92,7 +92,7 @@ class OpenAIChatCompletionsClient(LLMClient):
                     if delta.get("content", None):
                         if not ttft:
                             ttft = time.monotonic() - start_time
-                            time_to_next_token.append(ttft)
+                            # time_to_next_token.append(ttft)
                         else:
                             time_to_next_token.append(
                                 time.monotonic() - most_recent_received_token_time
diff --git a/token_benchmark_ray.py b/token_benchmark_ray.py
index 63216b1..11e0116 100644
--- a/token_benchmark_ray.py
+++ b/token_benchmark_ray.py
@@ -32,6 +32,7 @@ def get_token_throughput_latencies(
     stddev_input_tokens: int,
     mean_output_tokens: int,
     stddev_output_tokens: int,
+    tokenizer: str,
     additional_sampling_params: Optional[Dict[str, Any]] = None,
     num_concurrent_requests: int = 1,
     max_num_completed_requests: int = 500,
@@ -60,10 +61,8 @@ def get_token_throughput_latencies(
     """
     random.seed(11111)
 
-    tokenizer = LlamaTokenizerFast.from_pretrained(
-        "hf-internal-testing/llama-tokenizer"
-    )
-    get_token_length = lambda text: len(tokenizer.encode(text))
+    hf_tokenizer = LlamaTokenizerFast.from_pretrained(tokenizer)
+    get_token_length = lambda text: len(hf_tokenizer.encode(text))
     
     if not additional_sampling_params:
         additional_sampling_params = {}
@@ -84,7 +83,7 @@ def get_token_throughput_latencies(
             prompt_tokens_mean=mean_input_tokens,
             prompt_tokens_stddev=stddev_input_tokens,
             expect_output_tokens=num_output_tokens,
-            tokenizer=tokenizer
+            tokenizer=hf_tokenizer
         ))
     start_time = time.monotonic()
     pbar = tqdm(total=max_num_completed_requests)
@@ -118,7 +117,7 @@ def get_token_throughput_latencies(
                 with completed_requests_lock:
                     if num_completed_requests < max_num_completed_requests:
                         if num_output_tokens:
-                            request_metrics[common_metrics.INTER_TOKEN_LAT] /= request_metrics[common_metrics.NUM_OUTPUT_TOKENS]
+                            request_metrics[common_metrics.INTER_TOKEN_LAT] /= num_output_tokens - 1
                         else:
                             request_metrics[common_metrics.INTER_TOKEN_LAT] = 0
                         request_metrics[common_metrics.NUM_OUTPUT_TOKENS] = num_output_tokens
@@ -155,7 +154,7 @@ def get_token_throughput_latencies(
         with completed_requests_lock:
             if num_completed_requests < max_num_completed_requests:
                 if num_output_tokens:
-                    request_metrics[common_metrics.INTER_TOKEN_LAT] /= num_output_tokens
+                    request_metrics[common_metrics.INTER_TOKEN_LAT] /= num_output_tokens - 1
                 else:
                     request_metrics[common_metrics.INTER_TOKEN_LAT] = 0
                 request_metrics[common_metrics.NUM_OUTPUT_TOKENS] = num_output_tokens
@@ -292,6 +291,7 @@ def run_token_benchmark(
     additional_sampling_params: str,
     results_dir: str,
     user_metadata: Dict[str, Any],
+    tokenizer: str,
 ):
     """
     Args:
@@ -327,6 +327,7 @@ def run_token_benchmark(
         stddev_output_tokens=stddev_output_tokens,
         num_concurrent_requests=num_concurrent_requests,
         additional_sampling_params=json.loads(additional_sampling_params),
+        tokenizer=tokenizer,
     )
 
     if results_dir:
@@ -462,6 +463,11 @@ args.add_argument(
         "name=foo,bar=1. These will be added to the metadata field of the results. "
     ),
 )
+args.add_argument(
+    "--tokenizer",
+    type=str,
+    default="hf-internal-testing/llama-tokenizer",
+)
 
 if __name__ == "__main__":
     env_vars = dict(os.environ)
@@ -488,4 +494,5 @@ if __name__ == "__main__":
         additional_sampling_params=args.additional_sampling_params,
         results_dir=args.results_dir,
         user_metadata=user_metadata,
+        tokenizer=args.tokenizer,
     )


================================================
FILE: src/benchmark/tensorflow/distilbert-base-uncased-finetuned-sst-2-english_benchmark.py
================================================
# Add to these lists or change as needed
model_names = ["distilbert-base-uncased-finetuned-sst-2-english"]
sequence_lengths = [128]
batch_sizes = [128]
pipeline_sizes = [1]

# Silence an irrelevant warning from transformers library
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import numpy as np
import neuronperf as npf
import neuronperf.tensorflow
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification


def get_batch(tokenizer, sequence_length, batch_size):
    sequence = "I am sorry. I really want to like it, but I just can not stand sushi."
    paraphrase = tokenizer.encode_plus(
        sequence,
        max_length=sequence_length,
        padding="max_length",
        truncation=True,
        return_tensors="np",
    )
    inputs = {
        "input_ids": np.concatenate([paraphrase["input_ids"]] * batch_size, axis=0),
        "attention_mask": np.concatenate([paraphrase["attention_mask"]] * batch_size, axis=0),
    }
    return inputs


if __name__ == "__main__":
    for model_name in model_names:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        for sequence_length in sequence_lengths:
            inputs = [
                get_batch(tokenizer, sequence_length, batch_size) for batch_size in batch_sizes
            ]
            filename = f"{model_name}_sl{sequence_length}.json"

            # Benchmark
            print("Benchmarking {}".format(filename))
            reports = npf.tensorflow.benchmark(filename, inputs)

            # View and save results
            print("======== {} ========".format(filename))
            npf.print_reports(reports)
            npf.write_csv(reports)
            npf.write_json(reports)


================================================
FILE: src/benchmark/tensorflow/distilbert-base-uncased-finetuned-sst-2-english_compile.py
================================================
# Add to these lists or change as needed
model_names = ["distilbert-base-uncased-finetuned-sst-2-english"]
sequence_lengths = [128]
batch_sizes = [128]
pipeline_sizes = [1]

# Silence an irrelevant warning from transformers library
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import numpy as np
import neuronperf as npf
import neuronperf.tensorflow
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification


def get_batch(tokenizer, sequence_length, batch_size):
    sequence = "I am sorry. I really want to like it, but I just can not stand sushi."
    paraphrase = tokenizer.encode_plus(
        sequence,
        max_length=sequence_length,
        padding="max_length",
        truncation=True,
        return_tensors="np",
    )
    inputs = {
        "input_ids": np.concatenate([paraphrase["input_ids"]] * batch_size, axis=0),
        "attention_mask": np.concatenate([paraphrase["attention_mask"]] * batch_size, axis=0),
    }
    return inputs


if __name__ == "__main__":
    for model_name in model_names:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = TFAutoModelForSequenceClassification.from_pretrained(model_name, return_dict=False)
        for sequence_length in sequence_lengths:
            inputs = [
                get_batch(tokenizer, sequence_length, batch_size) for batch_size in batch_sizes
            ]
            filename = f"{model_name}_sl{sequence_length}.json"

            # Compile
            print("Compiling {}".format(filename))
            npf.tensorflow.compile(
                model,
                inputs,
                batch_sizes=batch_sizes,
                pipeline_sizes=pipeline_sizes,
                filename=filename,
                model_name=model_name,
            )


================================================
FILE: src/examples/mxnet/README.md
================================================
</br>
</br>

Please view our documentation at **[https://awsdocs-neuron.readthedocs-hosted.com/](https://awsdocs-neuron.readthedocs-hosted.com/)** 


================================================
FILE: src/examples/mxnet/data_parallel/benchmark_utils.py
================================================
import math
from collections import Counter

import numpy as np

class Results():

    def __init__(self, batch_size, num_cores=1):
        self.latency_array = []
        self.end_times = []
        self.start_times = []
        self.batch_size = batch_size
        self.num_cores = num_cores

    def add_result(self, latency_array, end_times, start_times):
        self.latency_array.extend(latency_array)
        self.end_times.extend(end_times)
        self.start_times.extend(start_times)

    def report(self, f, window_size=1):
        assert(len(self.latency_array) != 0)
        p50_latency = np.percentile(self.latency_array, 50)
        p90_latency = np.percentile(self.latency_array, 90)
        p95_latency = np.percentile(self.latency_array, 95)
        p99_latency = np.percentile(self.latency_array, 99)
        p100_latency = np.percentile(self.latency_array, 100)


        def get_bucket(start, end):
            bucketed_start = math.floor(start / window_size) * window_size
            bucketed_end = math.ceil(end / window_size) * window_size
            # The check is to make sure that we ignore timestamps that are larger than the window size
            if bucketed_end - bucketed_start == window_size:
                return bucketed_start
            else:
                return None
            
        # Divide the timestamps into different buckets
        bucketed_timestamps = [get_bucket(start, end)
                            for start, end in zip(self.start_times, self.end_times)]
        # Count the values in each bucket
        counted_buckets = Counter(
            item for item in bucketed_timestamps if item is not None)
        # Normalize each bucket
        bucket_throughputs = [(key, value / window_size)
                            for key, value in sorted(counted_buckets.items())]
        
        busy_throughputs = [value for _, value in bucket_throughputs]
        max_throughput = max(busy_throughputs) * self.batch_size
        avg_throughput = sum(busy_throughputs) * self.batch_size / len(busy_throughputs)
        
        f.write("\n")
        f.write(
            "Maximum throughput = {} sentences/sec\n".format(int(max_throughput)))
        f.write("Average throughput = {} sentences/sec\n".format(int(avg_throughput)))

        f.write("\n")
        f.write("Latency Percentiles:\n")
        f.write("===\n")
        f.write("P50  = {} milliseconds\n".format(int(1000*p50_latency)))
        f.write("P90  = {} milliseconds\n".format(int(1000*p90_latency)))
        f.write("P95  = {} milliseconds\n".format(int(1000*p95_latency)))
        f.write("P99  = {} milliseconds\n".format(int(1000*p99_latency)))
        f.write("P100 = {} milliseconds\n".format(int(1000*p100_latency)))
        f.write("\n")
        f.write("Sanity test:\n")
        f.write("===\n")
        f.write("Processed - num batches {}\n".format(len(self.latency_array)))
        f.write("          - batch size {}\n".format(self.batch_size))
        f.write("          - num cores {}\n".format(self.num_cores))

================================================
FILE: src/examples/mxnet/data_parallel/data_parallel_tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using Data Parallel Mode with Gluon MXNet\n",
    "\n",
    "In this tutorial, you will compile a Gluon BERT model and run in data-parallel mode to completely utilize the NeuronCores. Here you will benchmark a multi-worker setup and compare it with a single worker.\n",
    "\n",
    "This tutorial is intended only for MXNet-1.8.\n",
    "\n",
    "In this tutorial, we will be using an inf1.2xlarge with the latest AWS Deep Learning AMI (DLAMI). The inf1.2xlarge instance has 1 AWS Inferentia Chip with 4 NeuronCores.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setting up your environment\n",
    "\n",
    "To run this tutorial, please make sure you deactivate any existing MXNet conda environments you already using. Install MXNet 1.8 by following the instructions at [MXNet Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-intro/mxnet-setup/mxnet-install.html#develop-on-aws-ml-accelerator-instance). You would also need to change your kernel to use the correct Python environment setup earlier by clicking Kerenel->Change Kernel->Python (Neuron MXNet)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install dependencies\n",
    "\n",
    "We have to install gluon-nlp to get the BERT model. Run the following command to install:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m pip install gluonnlp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compiling BERT Model\n",
    "\n",
    "Next, we compile the Gluon BERT model and save it. Once the model is compiled, we use the same model across the entire tutorial.\n",
    "In this tutorial, we will be using a BERT model with sequence length 32"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import mxnet as mx\n",
    "import mx_neuron\n",
    "import gluonnlp as nlp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "BERT_MODEL = 'bert_12_768_12'\n",
    "BERT_DATA = 'book_corpus_wiki_en_uncased'\n",
    "batch_size = 1\n",
    "seq_len = 32\n",
    "num_cores = 1\n",
    "dtype = 'float32'\n",
    "\n",
    "compiled_model_path = '{}.compiled.{}.{}'.format(BERT_MODEL, batch_size, seq_len)\n",
    "\n",
    "model, vocab = nlp.model.get_model(BERT_MODEL,\n",
    "                                   dataset_name=BERT_DATA,\n",
    "                                   use_classifier=False,\n",
    "                                   use_decoder=False, ctx=mx.cpu())\n",
    "  \n",
    "# Create sample inputs for compilation\n",
    "words = mx.nd.ones([batch_size, seq_len], name='words', dtype=dtype)\n",
    "valid_len = mx.nd.ones([batch_size,], name='valid_len', dtype=dtype)\n",
    "segments = mx.nd.ones([batch_size, seq_len], name='segments', dtype=dtype)\n",
    "inputs = {'data0': words, 'data1': segments, 'data2': valid_len}\n",
    "\n",
    "# Compiler Args ~~ \n",
    "options = {}\n",
    "embeddingNames = ['bertmodel0_word_embed_embedding0_fwd', 'bertmodel0_token_type_embed_embedding0_fwd', 'bertencoder0_embedding0']\n",
    "options.update({'force_incl_node_names': embeddingNames})\n",
    "options.update({'flags': ['--fp32-cast matmult']}) \n",
    "\n",
    "# Compile and save ~~ \n",
    "model = mx_neuron.compile(model, inputs=inputs, **options)\n",
    "model.export(compiled_model_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Parallel Mode\n",
    "\n",
    "Data Parallel Mode is a setup in which you launch multiple copies of the same model, such that each model is running independently of the other. In other words, each model has its own resources to run inference. \n",
    "\n",
    "On an inf1.2xlarge instance, we have 4 NeuronCores. Hence, we can launch 4 models such that each model is loaded on a single NeuronCore. This unables us to process 4 request concurrently without linear increase in latency. As a result, the throughput of the system increases when compared to a single model inference. This would also allow us to utilize all the 4 NeuronCores on the instance.\n",
    "\n",
    "Run through the next set of cells to see the difference in throughput as we scale from one model to 4 models running in parallel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "def get_sample_inputs(batch_size, seq_len):\n",
    "    words = np.ones([batch_size, seq_len], dtype=np.float32)\n",
    "    valid_len = np.ones([batch_size,], dtype=np.float32)\n",
    "    segments = np.ones([batch_size, seq_len], dtype=np.float32)\n",
    "    inputs = {'data0': words, 'data1': segments, 'data2': valid_len}\n",
    "    return inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next for comparison purposes, we run the setup with 1 worker. To do this, we set the num_cores=1. This would launch only 1 model running on a single NeuronCore. After running the below cell, note down the latency and throughput for the system"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from parallel import NeuronSimpleDataParallel\n",
    "from benchmark_utils import Results\n",
    "import time\n",
    "import functools\n",
    "import os\n",
    "import numpy as np\n",
    "import warnings\n",
    "\n",
    "num_cores = 1\n",
    "batch_size=1\n",
    "\n",
    "# Each worker process should use one core, hence we set\n",
    "#    os.environ['NEURON_RT_NUM_CORES'] = \"1\"\n",
    "os.environ[\"NEURON_RT_NUM_CORES\"] = \"1\"\n",
    "\n",
    "#Result aggregation class (code in bert_benchmark_utils.py)\n",
    "results = Results(batch_size, num_cores)\n",
    "def result_handler(output, start, end):\n",
    "    elapsed = end - start\n",
    "    results.add_result([elapsed], [end], [start])\n",
    "\n",
    "inputs = get_sample_inputs(batch_size, seq_len)\n",
    "parallel_neuron_model = NeuronSimpleDataParallel(compiled_model_path, num_cores, inputs)\n",
    "\n",
    "#Starting the inference threads\n",
    "parallel_neuron_model.start_continuous_inference()\n",
    "\n",
    "# Warm up the cores\n",
    "for _ in range(num_cores*4):\n",
    "    parallel_neuron_model.warmup(inputs)\n",
    "    \n",
    "# Need to run for high number of iterations to benchmark the models\n",
    "for _ in range(1000):\n",
    "    parallel_neuron_model.infer(inputs)\n",
    "    # Passing the result_handler as a callback function\n",
    "    parallel_neuron_model.add_result(result_handler)\n",
    "\n",
    "# Stop inference                \n",
    "parallel_neuron_model.stop()\n",
    "# Since we are using a multi-process execution with a shared queue, some inferences\n",
    "# may still be in execution phase. Hence we need to wait till all the inputs are processed\n",
    "# add_all_results() will collect all the results of requests which are in this state\n",
    "parallel_neuron_model.add_all_results(result_handler)\n",
    "\n",
    "\n",
    "with open(\"benchmark.txt\", \"w\") as f:\n",
    "    results.report(f, window_size=1)\n",
    "\n",
    "with open(\"benchmark.txt\", \"r\") as f:\n",
    "    for line in f:\n",
    "        print(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we run the setup with 4 workers. To do this, we set the num_cores=4. This would launch 4 model running each running on individual NeuronCore. All the 4 models are running in individual processes, in other words the models are running in parallel. \n",
    "\n",
    "To feed the models efficiently, we use the producer-consumer setup, in which all processes running a model act as consumers. All consumers are fed using a sharing input queue.\n",
    "\n",
    "Now we run the below setup. You may notice, that the throughput increase by >2x when compared to a single worker setup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from parallel import NeuronSimpleDataParallel\n",
    "from benchmark_utils import Results\n",
    "import time\n",
    "import functools\n",
    "import os\n",
    "import numpy as np\n",
    "\n",
    "num_cores = 4\n",
    "batch_size=1\n",
    "\n",
    "os.environ[\"NEURON_RT_NUM_CORES\"] = \"1\"\n",
    "\n",
    "#Result aggregation class (code in bert_benchmark_utils.py)\n",
    "results = Results(batch_size, num_cores)\n",
    "def result_handler(output, start, end):\n",
    "    elapsed = end - start\n",
    "    results.add_result([elapsed], [end], [start])\n",
    "\n",
    "inputs = get_sample_inputs(batch_size, seq_len)\n",
    "parallel_neuron_model = NeuronSimpleDataParallel(compiled_model_path, num_cores, inputs)\n",
    "\n",
    "#Starting the inference threads\n",
    "parallel_neuron_model.start_continuous_inference()\n",
    "\n",
    "# Warm up the cores\n",
    "for _ in range(num_cores*4):\n",
    "    parallel_neuron_model.warmup(inputs)\n",
    "    \n",
    "# Need to run for high number of iterations to benchmark the models\n",
    "for _ in range(5000):\n",
    "    parallel_neuron_model.infer(inputs)\n",
    "    # Passing the result_handler as a callback function\n",
    "    parallel_neuron_model.add_result(result_handler)\n",
    "\n",
    "# Stop inference                \n",
    "parallel_neuron_model.stop()\n",
    "# Since we are using a multi-process execution with a shared queue, some inferences\n",
    "# may still be in execution phase. Hence we need to wait till all the inputs are processed\n",
    "# add_all_results() will collect all the results of requests which are in this state\n",
    "parallel_neuron_model.add_all_results(result_handler)\n",
    "\n",
    "\n",
    "with open(\"benchmark.txt\", \"w\") as f:\n",
    "    results.report(f, window_size=1)\n",
    "\n",
    "with open(\"benchmark.txt\", \"r\") as f:\n",
    "    for line in f:\n",
    "        print(line)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: src/examples/mxnet/data_parallel/parallel.py
================================================
import mxnet as mx
import mx_neuron
import os
from time import time
from queue import Queue
from multiprocessing import Process, Manager


def consumer(model_file, sample_input, input_queue, result_queue):
    sym, args, aux = mx.model.load_checkpoint(model_file, 0)
    sample_input = {key: mx.nd.array(v) for key, v in sample_input.items()}
    args.update(sample_input)
    model = sym.bind(mx.cpu(), args=args, aux_states=aux, grad_req="null")

    while True:
        inputs, input_id = input_queue.get()
        input_queue.task_done()
        # Stop execution if stopping condition is recieved
        if inputs == "stop":
            break
        inputs = {key: mx.nd.array(v) for key, v in inputs.items()}
        start = time()
        results = model.forward(**inputs)
        results[0].wait_to_read()

        # Make the output iterable - if it is not already a tuple or list
        if not isinstance(results, tuple) or isinstance(results, list):
            results = [results]
        end = time()

        if input_id != -1:
            result_queue.put((results, start, end, input_id))


class NeuronSimpleDataParallel:
    def __init__(self, model_file, num_neuron_cores, sample_input):
        self.num_neuron_cores = num_neuron_cores
        self.sample_input = sample_input
        self.model_path = model_file
        # Create shared input queue and output queue
        manager = Manager()
        self.input_queue = manager.Queue(maxsize=num_neuron_cores * 16)
        self.result_queue = manager.Queue(maxsize=num_neuron_cores * 16)

        self.processes = [
            Process(
                target=consumer,
                args=(
                    self.model_path,
                    self.sample_input,
                    self.input_queue,
                    self.result_queue,
                ),
            )
            for _ in range(num_neuron_cores)
        ]
        self.input_id = 0
        self.input_dict = set()

    def start_continuous_inference(self):
        for p in self.processes:
            p.start()

    def warmup(self, batch):
        self.input_queue.put((batch, -1))

    def infer(self, batch):
        self.input_id += 1
        self.input_dict.add(self.input_id)
        self.input_queue.put((batch, self.input_id))

    def stop(self):
        for _ in range(self.num_neuron_cores):
            self.input_queue.put(("stop", -1))

    def add_result(self, callback_fn):
        if not self.result_queue.empty():
            result, start, end, input_id = self.result_queue.get()
            self.input_dict.remove(input_id)
            self.result_queue.task_done()
            callback_fn(result, start, end)

    def add_all_results(self, callback_fn):
        results = []
        while len(self.input_dict):
            self.add_result(callback_fn)
        for p in self.processes:
            p.join()


================================================
FILE: src/examples/mxnet/mxnet-gluon-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4dcf9bb1",
   "metadata": {},
   "source": [
    "## MXNet 1.8: Getting Started with Gluon Tutorial\n",
    "\n",
    "In this tutorial you will compile and deploy resnet-50 using the newly supported MXNet 1.8 and Gluon API on an Inf1 instance. This tutorial is only supported with MXNet 1.8.\n",
    "\n",
    "This Jupyter notebook should be run on an inf1.6xlarge instance since you will be loading and compiling several large models.\n",
    "\n",
    "To run this tutorial, please make sure you deactivate any existing MXNet conda environments you already using. Install MXNet 1.8 by following the instructions at [MXNet Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-intro/mxnet-setup/mxnet-install.html#install-neuron-mxnet). You would also need to change your kernel to use the correct Python environment setup earlier by clicking Kerenel->Change Kernel->Python (Neuron MXNet)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83eb578b",
   "metadata": {},
   "source": [
    "## Compile\n",
    "\n",
    "A trained model must be compiled to Inferentia target before it can run on Inferentia. In this step we compile a pre-trained ResNet50 and export it as a compiled MXNet checkpoint.\n",
    "\n",
    "Compilation will take a few minutes. At the end of compilation, the files resnet-50_compiled-0000.params and resnet-50_compiled-symbol.json will be created in local directory.\n",
    "\n",
    "To check the supported operations for the uncompiled model or information on Neuron subgraphs for the compiled model, please see [Neuron Check Model](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-tools/tutorial-neuron-check-model.html#neuron-check-model)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88c41e01",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import mxnet as mx\n",
    "import mx_neuron as neuron\n",
    "import numpy as np\n",
    "\n",
    "path='http://data.mxnet.io/models/imagenet/'\n",
    "mx.test_utils.download(path+'resnet/50-layers/resnet-50-0000.params')\n",
    "mx.test_utils.download(path+'resnet/50-layers/resnet-50-symbol.json')\n",
    "block = mx.gluon.nn.SymbolBlock.imports('resnet-50-symbol.json',\\\n",
    "    ['data', 'softmax_label'], 'resnet-50-0000.params', ctx=mx.cpu())\n",
    "\n",
    "block.hybridize()\n",
    "\n",
    "# Compile for Inferentia using Neuron\n",
    "inputs = { \"data\" : mx.nd.ones([1,3,224,224], name='data', dtype='float32'), 'softmax_label' : mx.nd.ones([1], name='data', dtype='float32') }\n",
    "block = neuron.compile(block, inputs=inputs)\n",
    "\n",
    "#save compiled model\n",
    "block.export(\"resnet-50_compiled\", 0, block)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6337e0ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a9af0c7",
   "metadata": {},
   "source": [
    "## Deploy\n",
    "\n",
    "Deply on Infenrentia to see the inference results as below:\n",
    "```\n",
    "probability=0.643591, class=n02123045 tabby, tabby cat\n",
    "probability=0.184392, class=n02123159 tiger cat\n",
    "probability=0.105063, class=n02124075 Egyptian cat\n",
    "probability=0.030101, class=n02127052 lynx, catamount\n",
    "probability=0.016112, class=n02129604 tiger, Panthera tigris\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "960c6aa9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import mxnet as mx\n",
    "import mx_neuron as neuron\n",
    "\n",
    "path='http://data.mxnet.io/models/imagenet/'\n",
    "mx.test_utils.download(path+'synset.txt')\n",
    "\n",
    "fname = mx.test_utils.download('https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg?raw=true')\n",
    "img = mx.image.imread(fname)# convert into format (batch, RGB, width, height)\n",
    "img = mx.image.imresize(img, 224, 224) # resize\n",
    "img = img.transpose((2, 0, 1)) # Channel first\n",
    "img = img.expand_dims(axis=0) # batchify\n",
    "img = img.astype(dtype='float32')\n",
    "\n",
    "block = mx.gluon.nn.SymbolBlock.imports('resnet-50_compiled-symbol.json',\\\n",
    "    ['data', 'softmax_label'], 'resnet-50_compiled-0000.params', ctx=mx.cpu())\n",
    "softmax = mx.nd.random_normal(shape=(1,))\n",
    "\n",
    "out = block(img, softmax).asnumpy()\n",
    "\n",
    "with open('synset.txt', 'r') as f:\n",
    "    labels = [l.rstrip() for l in f]\n",
    "\n",
    "out = block(img, softmax).asnumpy()\n",
    "\n",
    "prob = np.squeeze(out)\n",
    "a = np.argsort(prob)[::-1]\n",
    "for i in a[0:5]:\n",
    "    print('probability=%f, class=%s' %(prob[i], labels[i]))"
   ]
  },
  {
   "cell_type": "raw",
   "id": "4f15e776",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: src/examples/mxnet/resnet50/resnet50.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "wrapped-soccer",
   "metadata": {},
   "source": [
    "# Running Neuron Apache MXNet ResNet50 on Inferentia "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "appreciated-daily",
   "metadata": {},
   "source": [
    "## Introduction:\n",
    "In this tutorial we will compile and deploy ResNet50 model for Inferentia.\n",
    "In this tutorial we provide two main sections:\n",
    "\n",
    "1.Compile the ResNet50 model.\n",
    "\n",
    "2.Infer the compiled model.\n",
    "\n",
    "Before running the following verify this Jupyter notebook is running “conda_aws_neuron_mxnet_p36” kernel. You can select the Kernel from the “Kernel -> Change Kernel” option on the top of this Jupyter notebook page.\n",
    "Neuron supports Python module, Symbol APIs and the C predict API. The following quick start example uses the Symbol API.\n",
    "\n",
    "### Warning\n",
    "This tutorial was tested on MXNet-1.5\n",
    "\n",
    "MXNet-1.5 entered maintenance mode and require Neuron runtime 1.0, please see : [MXNet-1.5 enters maintainence mode](../../../../release-notes/maintenance.html)\n",
    "\n",
    "To setup development environment for MXNet-1.5 see installation instructions for Neuron 1.15.1 : [Neuron-1.15.1 MXNet install](../../../../archive/mxnet-neuron/setup/mxnet-install.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "advance-rebound",
   "metadata": {},
   "source": [
    "## Compile model on Neuron\n",
    "The following step will compile the resnet50 model. Compilation will take a few minutes on inf1.6xlarge. At the end of compilation, the files resnet-50_compiled-0000.params and resnet-50_compiled-symbol.json will be created in local directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "alpha-publication",
   "metadata": {},
   "outputs": [],
   "source": [
    "import mxnet as mx\n",
    "import numpy as np\n",
    "\n",
    "path='http://data.mxnet.io/models/imagenet/'\n",
    "mx.test_utils.download(path+'resnet/50-layers/resnet-50-0000.params')\n",
    "mx.test_utils.download(path+'resnet/50-layers/resnet-50-symbol.json')\n",
    "sym, args, aux = mx.model.load_checkpoint('resnet-50', 0)\n",
    "\n",
    "# Compile for Inferentia using Neuron\n",
    "inputs = { \"data\" : mx.nd.ones([1,3,224,224], name='data', dtype='float32') }\n",
    "sym, args, aux = mx.contrib.neuron.compile(sym, args, aux, inputs)\n",
    "\n",
    "#save compiled model\n",
    "mx.model.save_checkpoint(\"resnet-50_compiled\", 0, sym, args, aux)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "technical-reason",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "meaningful-substance",
   "metadata": {},
   "source": [
    "## Deploy on Inferentia\n",
    "Using same instance to deploy the model.        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cooked-jonathan",
   "metadata": {},
   "outputs": [],
   "source": [
    "import mxnet as mx\n",
    "import numpy as np\n",
    "\n",
    "path='http://data.mxnet.io/models/imagenet/'\n",
    "mx.test_utils.download(path+'synset.txt')\n",
    "\n",
    "fname = mx.test_utils.download('https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg?raw=true')\n",
    "img = mx.image.imread(fname)# convert into format (batch, RGB, width, height)\n",
    "img = mx.image.imresize(img, 224, 224) # resize\n",
    "img = img.transpose((2, 0, 1)) # Channel first\n",
    "img = img.expand_dims(axis=0) # batchify\n",
    "img = img.astype(dtype='float32')\n",
    "\n",
    "sym, args, aux = mx.model.load_checkpoint('resnet-50_compiled', 0)\n",
    "softmax = mx.nd.random_normal(shape=(1,))\n",
    "args['softmax_label'] = softmax\n",
    "args['data'] = img\n",
    "\n",
    "# Inferentia context\n",
    "ctx = mx.neuron()\n",
    "\n",
    "exe = sym.bind(ctx=ctx, args=args, aux_states=aux, grad_req='null')\n",
    "\n",
    "with open('synset.txt', 'r') as f:\n",
    "     labels = [l.rstrip() for l in f]\n",
    "\n",
    "exe.forward(data=img)\n",
    "prob = exe.outputs[0].asnumpy()# print the top-5\n",
    "prob = np.squeeze(prob)\n",
    "a = np.argsort(prob)[::-1]\n",
    "for i in a[0:5]:\n",
    "     print('probability=%f, class=%s' %(prob[i], labels[i]))\n",
    "        \n",
    "# Sample output will look like below:\n",
    "#probability=0.634792, class=n02123045 tabby, tabby cat\n",
    "#probability=0.193601, class=n02123159 tiger cat\n",
    "#probability=0.103627, class=n02124075 Egyptian cat\n",
    "#probability=0.031604, class=n02127052 lynx, catamount\n",
    "#probability=0.015892, class=n02129604 tiger, Panthera tigris"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Environment (conda_aws_neuron_mxnet_p36)",
   "language": "python",
   "name": "conda_aws_neuron_mxnet_p36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: src/examples/mxnet/resnet50_neuroncore_groups.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Neuron Apache MXNet - Configurations for NeuronCore Groups Using Resnet50\n",
    "\n",
    "\n",
    "\n",
    "## Introduction:\n",
    "\n",
    "In this tutorial we will compile and deploy Resnet-50 model in parallel using the concept of NeuronCore Groups on an Inf1 instance. This Jupyter notebook should be run on an instance which is inf1.6xlarge or larger. For simplicity we will run this tutorial on inf1.6xlarge but in real life scenario the compilation should be done on a compute instance and the deployment on inf1 instance to save costs. \n",
    "\n",
    "Set environment variable NEURON_RT_NUM_CORES to the total number of Neuron cores that will be utilized. The consecutive NeuronCore groups will be created by Neuron Runtime and place the models to the cores according to the compiled size.\n",
    "\n",
    "Note that in order to map a model to a group, the model must be compiled to fit within the group size. To limit the number of NeuronCores during compilation, use compiler_args dictionary with field “–neuroncore-pipeline-cores“ set to the group size. For exmaple, if NEURON_RT_NUM_CORES=4 and two models compiled with “–neuroncore-pipeline-cores=3“ and “–neuroncore-pipeline-cores=1“ were loaded, the first model would occupy NC0-2 and the second model would occupy NC3. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "compile_args = {'--neuroncore-pipeline-cores' : 2}\n",
    "sym, args, auxs = neuron.compile(sym, args, auxs, inputs, **compile_args)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "In this tutorial we provide two main sections:\n",
    "\n",
    "1. Compile the Resnet50 model for Neuron\n",
    "\n",
    "2. Run inference using NeuronCore Groups\n",
    "\n",
    "Please use environment `conda_aws_neuron_mxnet_p36`.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compile model for Neuron\n",
    "\n",
    "Model must be compiled to Inferentia target before it can be used on Inferentia. In the following we will compile the the flag, --neuroncore-pipeline-cores set to 2 and run it. The files resnet-50_compiled-0000.params and resnet-50_compiled-symbol.json will be created in local directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from packaging import version\n",
    "import mxnet as mx\n",
    "import numpy as np\n",
    "\n",
    "import mx_neuron as neuron\n",
    "\n",
    "path='http://data.mxnet.io/models/imagenet/'\n",
    "mx.test_utils.download(path+'resnet/50-layers/resnet-50-0000.params')\n",
    "mx.test_utils.download(path+'resnet/50-layers/resnet-50-symbol.json')\n",
    "sym, args, aux = mx.model.load_checkpoint('resnet-50', 0)\n",
    "\n",
    "# Compile for Inferentia using Neuron, fit to NeuronCore group size of 2\n",
    "inputs = { \"data\" : mx.nd.ones([1,3,224,224], name='data', dtype='float32') }\n",
    "compile_args = {'--neuroncore-pipeline-cores' : 2}\n",
    "sym, args, aux = neuron.compile(sym, args, aux, inputs, **compile_args)\n",
    "\n",
    "#save compiled model\n",
    "mx.model.save_checkpoint(\"resnet-50_compiled\", 0, sym, args, aux)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run inference using NeuronCore Groups\n",
    "\n",
    "Within the framework, the model can be mapped to specific cores using ```ctx=mx.neuron(N)``` context where N specifies the index of the Neuron core to deploy. For more information, see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/appnotes/perf/flex-eg.html .\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import warnings\n",
    "\n",
    "mx.test_utils.download(path+'synset.txt')\n",
    "\n",
    "fname = mx.test_utils.download('https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg?raw=true')\n",
    "img = mx.image.imread(fname) # convert into format (batch, RGB, width, height)\n",
    "img = mx.image.imresize(img, 224, 224) # resize\n",
    "img = img.transpose((2, 0, 1)) # Channel first\n",
    "img = img.expand_dims(axis=0) # batchify\n",
    "img = img.astype(dtype='float32')\n",
    "\n",
    "sym, args, aux = mx.model.load_checkpoint('resnet-50_compiled', 0)\n",
    "softmax = mx.nd.random_normal(shape=(1,))\n",
    "args['softmax_label'] = softmax\n",
    "args['data'] = img\n",
    "\n",
    "os.environ[\"NEURON_RT_NUM_CORES\"] = '4'\n",
    "\n",
    "\n",
    "# Inferentia context - group index 1 (size 2) would skip NC0 and place the \n",
    "# compiled model onto NC1,2\n",
    "ctx = mx.neuron(1)\n",
    "\n",
    "exe = sym.bind(ctx=ctx, args=args, aux_states=aux, grad_req='null')\n",
    "\n",
    "with open('synset.txt', 'r') as f:\n",
    "     labels = [l.rstrip() for l in f]\n",
    "\n",
    "exe.forward(data=img)\n",
    "prob = exe.outputs[0].asnumpy()# print the top-5\n",
    "prob = np.squeeze(prob)\n",
    "a = np.argsort(prob)[::-1]\n",
    "for i in a[0:5]:\n",
    "     print('probability=%f, class=%s' %(prob[i], labels[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can experiment with different Neuron core group combinations and different models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Troubleshooting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If not enough NeuronCores are provided, an error message will be displayed:\n",
    "\n",
    "```\n",
    "mxnet.base.MXNetError: [04:01:39] src/operator/subgraph/neuron/./neuron_util.h:541: Check failed: rsp.status().code() == 0: Failed load model with Neuron-RTD Error. Neuron-RTD Status Code: 9, details: \"\"\n",
    "\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Environment (conda_aws_neuron_mxnet_p36)",
   "language": "python",
   "name": "conda_aws_neuron_mxnet_p36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: src/examples/neuron-monitor/neuron-monitor-grafana.json
================================================
{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 0,
  "id": 2,
  "iteration": 1605138719380,
  "links": [],
  "panels": [
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {
            "align": null,
            "filterable": false
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "Value"
            },
            "properties": [
              {
                "id": "custom.width",
                "value": 163
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "Field"
            },
            "properties": [
              {
                "id": "custom.width",
                "value": 450
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "ami_id"
            },
            "properties": [
              {
                "id": "custom.width",
                "value": 217
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "instance_type"
            },
            "properties": [
              {
                "id": "custom.width",
                "value": 391
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "Prometheus instance"
            },
            "properties": [
              {
                "id": "custom.width",
                "value": 641
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 8,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 8,
      "options": {
        "showHeader": true,
        "sortBy": []
      },
      "pluginVersion": "7.2.1",
      "repeat": null,
      "targets": [
        {
          "expr": "instance_info",
          "format": "table",
          "instant": true,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "Instance Info",
      "transformations": [
        {
          "id": "organize",
          "options": {
            "excludeByName": {
              "Time": true,
              "Value": true,
              "__name__": true,
              "ami_id": false,
              "instance": true,
              "job": true
            },
            "indexByName": {
              "Time": 0,
              "Value": 7,
              "__name__": 1,
              "availability_zone": 8,
              "instance": 5,
              "instance_id": 2,
              "instance_name": 3,
              "instance_type": 4,
              "job": 6,
              "region": 9,
              "subnet_id": 10
            },
            "renameByName": {
              "Value": "",
              "availability_zone": "Availability Zone",
              "instance": "",
              "instance_id": "Instance ID",
              "instance_name": "Instance Name",
              "instance_type": "Instance Type",
              "region": "Region",
              "subnet_id": "Subnet"
            }
          }
        }
      ],
      "type": "table"
    },
    {
      "datasource": null,
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "super-light-yellow",
                "value": null
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 5,
        "w": 3,
        "x": 0,
        "y": 8
      },
      "id": 36,
      "options": {
        "colorMode": "value",
        "graphMode": "none",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "last"
          ],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "7.2.1",
      "targets": [
        {
          "expr": "count(instance_info)\n",
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "Instance Count",
      "type": "stat"
    },
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "light-blue",
                "value": null
              }
            ]
          },
          "unit": "none"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 5,
        "w": 3,
        "x": 3,
        "y": 8
      },
      "id": 10,
      "options": {
        "colorMode": "value",
        "graphMode": "none",
        "justifyMode": "center",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "7.2.1",
      "targets": [
        {
          "expr": "sum (system_vcpu_count)",
          "instant": true,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "vCPU Count",
      "type": "stat"
    },
    {
      "datasource": null,
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "mappings": [],
          "thresholds": {
            "mode": "percentage",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "#EAB839",
                "value": 70
              },
              {
                "color": "orange",
                "value": 80
              },
              {
                "color": "semi-dark-red",
                "value": 90
              }
            ]
          },
          "unit": "percentunit"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 5,
        "w": 3,
        "x": 6,
        "y": 8
      },
      "id": 20,
      "options": {
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "showThresholdLabels": true,
        "showThresholdMarkers": true
      },
      "pluginVersion": "7.2.1",
      "targets": [
        {
          "expr": "avg(sum by (instance_id) (system_vcpu_usage_ratio))",
          "instant": true,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "vCPU Utilization",
      "type": "gauge"
    },
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "mappings": [],
          "thresholds": {
            "mode": "percentage",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "yellow",
                "value": 70
              },
              {
                "color": "orange",
                "value": 80
              },
              {
                "color": "red",
                "value": 90
              }
            ]
          },
          "unit": "percentunit"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 5,
        "w": 3,
        "x": 9,
        "y": 8
      },
      "id": 16,
      "options": {
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "showThresholdLabels": true,
        "showThresholdMarkers": true
      },
      "pluginVersion": "7.2.1",
      "targets": [
        {
          "expr": "avg(system_memory_used_bytes / system_memory_total_bytes)",
          "instant": true,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "Host Memory Usage",
      "type": "gauge"
    },
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "rgb(191, 151, 105)",
                "value": null
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 5,
        "w": 3,
        "x": 12,
        "y": 8
      },
      "id": 12,
      "options": {
        "colorMode": "value",
        "graphMode": "none",
        "justifyMode": "center",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "7.2.1",
      "targets": [
        {
          "expr": "count(neuroncore_utilization_ratio > 0)",
          "instant": true,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "NeuronCores in Use",
      "transformations": [],
      "type": "stat"
    },
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {
            "align": null,
            "filterable": false
          },
          "mappings": [],
          "thresholds": {
            "mode": "percentage",
            "steps": [
              {
                "color": "red",
                "value": null
              },
              {
                "color": "orange",
                "value": 5
              },
              {
                "color": "yellow",
                "value": 20
              },
              {
                "color": "green",
                "value": 35
              }
            ]
          },
          "unit": "percentunit"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 5,
        "w": 3,
        "x": 15,
        "y": 8
      },
      "id": 4,
      "interval": "",
      "options": {
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "showThresholdLabels": true,
        "showThresholdMarkers": true
      },
      "pluginVersion": "7.2.1",
      "targets": [
        {
          "expr": "avg(neuroncore_utilization_ratio)",
          "instant": true,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "NeuronCore Utilization",
      "type": "gauge"
    },
    {
      "datasource": "Prometheus",
      "description": "",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "mappings": [],
          "thresholds": {
            "mode": "percentage",
            "steps": [
              {
                "color": "green",
                "value": null
              }
            ]
          },
          "unit": "cps"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 5,
        "w": 3,
        "x": 18,
        "y": 8
      },
      "id": 6,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "7.2.1",
      "targets": [
        {
          "expr": "sum(rate(execution_status_total{status_type=\"completed\"}[1m]))",
          "hide": false,
          "instant": true,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "Execution Success Rate",
      "transformations": [],
      "type": "stat"
    },
    {
      "datasource": "Prometheus",
      "description": "",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 1
              }
            ]
          },
          "unit": "cps"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 5,
        "w": 3,
        "x": 21,
        "y": 8
      },
      "id": 18,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "7.2.1",
      "targets": [
        {
          "expr": "sum(rate(execution_status_total{status_type!=\"completed\"}[1m]))",
          "instant": true,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "Execution Error Rate",
      "type": "stat"
    },
    {
      "aliasColors": {
        "Inf Error Rate": "semi-dark-red",
        "Inf Success Rate": "light-green"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": null,
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 12,
        "w": 12,
        "x": 0,
        "y": 13
      },
      "hiddenSeries": false,
      "id": 32,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.2.1",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum(rate(execution_status_total{status_type=\"completed\"}[1m]))",
          "interval": "",
          "legendFormat": "Execution Success Rate",
          "refId": "A"
        },
        {
          "expr": "sum(rate(execution_status_total{status_type!=\"completed\"}[1m]))",
          "interval": "",
          "legendFormat": "Execution Error Rate",
          "refId": "B"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Execution Status Rates",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "$$hashKey": "object:547",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "$$hashKey": "object:548",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "p0": "dark-green",
        "p1": "semi-dark-green",
        "p100": "semi-dark-red",
        "p25": "light-green",
        "p50": "super-light-green",
        "p75": "super-light-red",
        "p99": "light-red",
        "{percentile=\"p0\"}": "dark-green",
        "{percentile=\"p1\"}": "semi-dark-green",
        "{percentile=\"p100\"}": "dark-red",
        "{percentile=\"p25\"}": "light-green",
        "{percentile=\"p50\"}": "super-light-green",
        "{percentile=\"p75\"}": "light-red",
        "{percentile=\"p99\"}": "semi-dark-red"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": null,
      "description": "",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": []
      },
      "fill": 0,
      "fillGradient": 0,
      "gridPos": {
        "h": 12,
        "w": 12,
        "x": 12,
        "y": 13
      },
      "hiddenSeries": false,
      "id": 34,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.2.1",
      "pointradius": 1,
      "points": true,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "avg by (percentile) (execution_latency_seconds)",
          "interval": "",
          "legendFormat": "{{percentile}}",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Execution Latency",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "$$hashKey": "object:61",
          "format": "s",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "$$hashKey": "object:62",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": null,
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "unit": "percentunit"
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 12,
        "w": 8,
        "x": 0,
        "y": 25
      },
      "hiddenSeries": false,
      "id": 30,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.2.1",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "avg by (neuroncore) (neuroncore_utilization_ratio)",
          "interval": "",
          "legendFormat": "nc{{neuroncore}}",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "NeuronCore Utilization",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "$$hashKey": "object:493",
          "format": "percentunit",
          "label": null,
          "logBase": 1,
          "max": "1",
          "min": "0",
          "show": true
        },
        {
          "$$hashKey": "object:494",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": "100",
          "min": "0",
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "Runtime system CPU usage ": "light-red",
        "Runtime user CPU usage ": "light-green"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "percentunit"
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 12,
        "w": 8,
        "x": 8,
        "y": 25
      },
      "hiddenSeries": false,
      "id": 2,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.2.1",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": true,
      "steppedLine": false,
      "targets": [
        {
          "expr": "avg by (usage_type) (neuron_runtime_vcpu_usage_ratio)",
          "format": "time_series",
          "instant": false,
          "interval": "",
          "legendFormat": "Neuron Runtime {{usage_type}} CPU usage ",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Neuron Runtime vCPU Usage",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "$$hashKey": "object:385",
          "format": "percentunit",
          "label": null,
          "logBase": 1,
          "max": "1",
          "min": "0",
          "show": true
        },
        {
          "$$hashKey": "object:386",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "host": "rgb(0, 217, 255)",
        "neuron_device": "super-light-orange"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": null,
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "unit": "bytes"
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 12,
        "w": 8,
        "x": 16,
        "y": 25
      },
      "hiddenSeries": false,
      "id": 28,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.2.1",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "avg by (memory_location) (sum by (instance_id, memory_location) (neuron_runtime_memory_used_bytes))",
          "interval": "",
          "legendFormat": "{{memory_location}}",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Neuron Runtime Used Memory",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "$$hashKey": "object:439",
          "format": "bytes",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "$$hashKey": "object:440",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "Memory Usage": "rgb(0, 217, 255)",
        "NeuronCore Usage": "light-orange",
        "vCPU Usage": "light-blue"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": null,
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "unit": "percentunit"
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 12,
        "w": 8,
        "x": 0,
        "y": 37
      },
      "hiddenSeries": false,
      "id": 22,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.2.1",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "avg(system_memory_used_bytes / system_memory_total_bytes)",
          "instant": false,
          "interval": "",
          "legendFormat": "Memory Usage",
          "refId": "A"
        },
        {
          "expr": "avg(sum by (instance_id) (system_vcpu_usage_ratio))",
          "instant": false,
          "interval": "",
          "legendFormat": "vCPU Usage",
          "refId": "B"
        },
        {
          "expr": "avg(neuroncore_utilization_ratio)",
          "instant": false,
          "interval": "",
          "legendFormat": "NeuronCore Usage",
          "refId": "C"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Host System Utilization",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "$$hashKey": "object:664",
          "format": "percentunit",
          "label": null,
          "logBase": 1,
          "max": "1",
          "min": "0",
          "show": true
        },
        {
          "$$hashKey": "object:665",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "system": "light-red",
        "user": "light-green"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "unit": "percentunit"
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 12,
        "w": 8,
        "x": 8,
        "y": 37
      },
      "hiddenSeries": false,
      "id": 24,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.2.1",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": true,
      "steppedLine": false,
      "targets": [
        {
          "expr": "avg by (usage_type) (system_vcpu_usage_ratio)",
          "interval": "",
          "legendFormat": "{{usage_type}}",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Host vCPU Usage",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "$$hashKey": "object:876",
          "format": "percentunit",
          "label": null,
          "logBase": 1,
          "max": "1",
          "min": "0",
          "show": true
        },
        {
          "$$hashKey": "object:877",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "Memory Usage Bytes": "rgb(223, 180, 0)",
        "Memory Usage Percent": "rgb(0, 217, 255)"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "unit": "short"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "Memory Usage Percent"
            },
            "properties": [
              {
                "id": "unit",
                "value": "percentunit"
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "Memory Usage Bytes"
            },
            "properties": [
              {
                "id": "unit",
                "value": "bytes"
              }
            ]
          }
        ]
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 12,
        "w": 8,
        "x": 16,
        "y": 37
      },
      "hiddenSeries": false,
      "id": 26,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.2.1",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [
        {
          "$$hashKey": "object:711"
        },
        {
          "$$hashKey": "object:931",
          "alias": "Memory Usage Bytes",
          "yaxis": 2
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "avg(system_memory_used_bytes / system_memory_total_bytes)",
          "instant": false,
          "interval": "",
          "legendFormat": "Memory Usage Percent",
          "refId": "A"
        },
        {
          "expr": "avg(system_memory_used_bytes)",
          "instant": false,
          "interval": "",
          "legendFormat": "Memory Usage Bytes",
          "refId": "B"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Host Memory Usage",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "$$hashKey": "object:689",
          "format": "percentunit",
          "label": "",
          "logBase": 1,
          "max": "1",
          "min": "0",
          "show": true
        },
        {
          "$$hashKey": "object:690",
          "decimals": null,
          "format": "bytes",
          "label": "",
          "logBase": 1,
          "max": null,
          "min": "0",
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    }
  ],
  "refresh": "5s",
  "schemaVersion": 26,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": [
      {
        "datasource": "Prometheus",
        "filters": [],
        "hide": 0,
        "label": "",
        "name": "Filters",
        "skipUrlSync": false,
        "type": "adhoc"
      }
    ]
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "neuron-monitor",
  "uid": "EqWNYf5Mz",
  "version": 68
}

================================================
FILE: src/examples/pytorch/bert_tutorial/README.md
================================================
</br>
</br>

Please view our documentation at **[https://awsdocs-neuron.readthedocs-hosted.com/](https://awsdocs-neuron.readthedocs-hosted.com/)** 


================================================
FILE: src/examples/pytorch/bert_tutorial/THIRD
================================================


================================================
FILE: src/examples/pytorch/bert_tutorial/THIRD PARTY LICENSE.txt
================================================
** transformers; version 2.8.0 -- https://github.com/huggingface/transformers
Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.

                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.

================================================
FILE: src/examples/pytorch/bert_tutorial/bert_benchmark_utils.py
================================================
import torch
import torch.neuron
import os
import sys
import csv
import math
from collections import Counter

import numpy as np

class BertTestDataset(torch.utils.data.Dataset):
    """Bert test dataset."""

    def __init__(self, tsv_file, tokenizer, max_length=128, transform=None):
        """
        Args:
            csv_file (string): Path to the csv file with annotations.
            tokenizer (callable = hugging face tokenizer):  Takes a string and encodes to standard input tensor set
            max_length (int): Maximum length that all input tensors will be padded to
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        with open(tsv_file, "r") as f:
            reader = csv.reader(f, delimiter="\t", quotechar=None)
            self.lines = list(reader)

        self.lines.pop(0)

        self.tokenizer = tokenizer
        self.max_length = max_length
        self.transform = transform

    def __len__(self):
        return len(self.lines)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        s1_raw = self.lines[idx][3]
        if isinstance(s1_raw, bytes):
            s1_raw = s1_raw.decode("utf-8", "ignore")
        s2_raw = self.lines[idx][4]
        if isinstance(s2_raw, bytes):
            s2_raw = s2_raw.decode("utf-8", "ignore")

        quality = self.lines[idx][0]

        encoded = self.tokenizer.encode_plus(s1_raw, s2_raw, add_special_tokens=True,
                                             return_tensors='pt', max_length=self.max_length, 
                                             padding='max_length', truncation=True)

        sample = {'encoded': encoded, 'quality': quality}

        if self.transform:
            sample = self.transform(sample)

        return sample


class BertResults():

    def __init__(self, batch_size, num_cores=1):
        self.correct_count = 0
        self.inference_count = 0
        self.latency_array = []
        self.end_times = []
        self.start_times = []
        self.batch_size = batch_size
        self.num_cores = num_cores

    def add_result(self, correct_count, inference_count, latency_array, end_times, start_times):
        self.correct_count += correct_count

        self.inference_count += inference_count
        self.latency_array.extend(latency_array)
        self.end_times.extend(end_times)
        self.start_times.extend(start_times)

    def report(self, f, window_size=1):
        assert(len(self.latency_array) != 0)
        p50_latency = np.percentile(self.latency_array, 50)
        p90_latency = np.percentile(self.latency_array, 90)
        p95_latency = np.percentile(self.latency_array, 95)
        p99_latency = np.percentile(self.latency_array, 99)
        p100_latency = np.percentile(self.latency_array, 100)


        def get_bucket(start, end):
            bucketed_start = math.floor(start / window_size) * window_size
            bucketed_end = math.ceil(end / window_size) * window_size
            # The check is to make sure that we ignore timestamps that are larger than the window size
            if bucketed_end - bucketed_start == window_size:
                return bucketed_start
            else:
                return None
            
        # Divide the timestamps into different buckets
        bucketed_timestamps = [get_bucket(start, end)
                            for start, end in zip(self.start_times, self.end_times)]
        # Count the values in each bucket
        counted_buckets = Counter(
            item for item in bucketed_timestamps if item is not None)
        # Normalize each bucket
        bucket_throughputs = [(key, value / window_size)
                            for key, value in sorted(counted_buckets.items())]
        
        busy_throughputs = [value for _, value in bucket_throughputs]
        max_throughput = max(busy_throughputs) * self.batch_size
        avg_throughput = sum(busy_throughputs) * self.batch_size / len(busy_throughputs)
        
        f.write("\n")
        f.write(
            "Maximum throughput = {} sentences/sec\n".format(int(max_throughput)))
        f.write("Average throughput = {} sentences/sec\n".format(int(avg_throughput)))

        f.write("\n")
        f.write("Latency Percentiles:\n")
        f.write("===\n")
        f.write("P50  = {} milliseconds\n".format(int(1000*p50_latency)))
        f.write("P90  = {} milliseconds\n".format(int(1000*p90_latency)))
        f.write("P95  = {} milliseconds\n".format(int(1000*p95_latency)))
        f.write("P99  = {} milliseconds\n".format(int(1000*p99_latency)))
        f.write("P100 = {} milliseconds\n".format(int(1000*p100_latency)))
        f.write("\n")
        f.write("Accuracy:\n")
        f.write("===\n")
        if self.inference_count == 0:
            self.inference_count = 1
        accuracy = float(self.correct_count) / float(self.inference_count)
        f.write("Accuracy = {}% \n".format(round(100*accuracy, 2)))
        f.write("\n")
        f.write("Sanity test:\n")
        f.write("===\n")
        f.write("Processed - num batches {}\n".format(len(self.latency_array)))
        f.write("          - batch size {}\n".format(self.batch_size))
        f.write("          - num cores {}\n".format(self.num_cores))


================================================
FILE: src/examples/pytorch/bert_tutorial/glue_mrpc_dev.tsv
================================================
Quality	#1 ID	#2 ID	#1 String	#2 String
1	1355540	1355592	He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .	" The foodservice pie business does not fit our long-term growth strategy .
0	2029631	2029565	Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .	His wife said he was " 100 percent behind George Bush " and looked forward to using his years of training in the war .
0	487993	487952	The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .	The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent .
1	1989515	1989458	The AFL-CIO is waiting until October to decide if it will endorse a candidate .	The AFL-CIO announced Wednesday that it will decide in October whether to endorse a candidate before the primaries .
0	1783137	1782659	No dates have been set for the civil or the criminal trial .	No dates have been set for the criminal or civil cases , but Shanley has pleaded not guilty .
1	3039165	3039036	Wal-Mart said it would check all of its million-plus domestic workers to ensure they were legally employed .	It has also said it would review all of its domestic employees more than 1 million to ensure they have legal status .
0	1490811	1490840	While dioxin levels in the environment were up last year , they have dropped by 75 percent since the 1970s , said Caswell .	The Institute said dioxin levels in the environment have fallen by as much as 76 percent since the 1970s .
1	426112	426210	This integrates with Rational PurifyPlus and allows developers to work in supported versions of Java , Visual C # and Visual Basic .NET.	IBM said the Rational products were also integrated with Rational PurifyPlus , which allows developers to work in Java , Visual C # and VisualBasic .Net.
1	1439663	1439808	The top rate will go to 4.45 percent for all residents with taxable incomes above $ 500,000 .	For residents with incomes above $ 500,000 , the income-tax rate will increase to 4.45 percent .
1	3147370	3147525	The results appear in the January issue of Cancer , an American Cancer Society journal , being published online today .	The results appear in the January issue of Cancer , an American Cancer Society ( news - web sites ) journal , being published online Monday .
1	3300040	3299992	The delegates said raising and distributing funds has been complicated by the U.S. crackdown on jihadi charitable foundations , bank accounts of terror-related organizations and money transfers .	Bin Laden ’ s men pointed out that raising and distributing funds has been complicated by the U.S. crackdown on jihadi charitable foundations , bank accounts of terror-related organizations and money transfers .
0	524136	524119	" Sanitation is poor ... there could be typhoid and cholera , " he said .	" Sanitation is poor , drinking water is generally left behind . . . there could be typhoid and cholera . "
0	969512	969295	The broader Standard & Poor 's 500 Index .SPX gave up 11.91 points , or 1.19 percent , at 986.60 .	The technology-laced Nasdaq Composite Index was down 25.36 points , or 1.53 percent , at 1,628.26 .
1	1685339	1685429	The only announced Republican to replace Davis is Rep. Darrell Issa of Vista , who has spent $ 1.71 million of his own money to force a recall .	So far the only declared major party candidate is Rep. Darrell Issa , a Republican who has spent $ 1.5 million of his own money to fund the recall .
1	1967578	1967664	The decision to issue new guidance has been prompted by intelligence passed to Britain by the FBI in a secret briefing in late July .	Scotland Yard 's decision to issue new guidance has been prompted by new intelligence passed to Britain by the FBI in late July .
1	2047034	2046820	Unable to find a home for him , a judge told mental health authorities they needed to find supervised housing and treatment for DeVries somewhere in California .	The judge had told the state Department of Mental Health to find supervised housing and treatment for DeVries somewhere in California .
1	2046630	2046644	The decision came a year after Whipple ended federal oversight of the district 's racial balance , facilities , budget , and busing .	The decision came a year after Whipple ended federal oversight of school busing as well as the district 's racial balance , facilities and budget .
0	2221603	2221633	In midafternoon trading , the Nasdaq composite index was up 8.34 , or 0.5 percent , to 1,790.47 .	The Nasdaq Composite Index .IXIC dipped 8.59 points , or 0.48 percent , to 1,773.54 .
1	129995	129864	Morgan Stanley raised its rating on the beverage maker to " overweight " from " equal-weight " saying in part that pricing power with its bottlers should improve in 2004 .	Morgan Stanley raised its rating on the company to " overweight " from " equal-weight , " saying the beverage maker 's pricing power with bottlers should improve in 2004 .
0	919683	919782	The pound also made progress against the dollar , reached fresh three-year highs at $ 1.6789 .	The British pound flexed its muscle against the dollar , last up 1 percent at $ 1.6672 .
0	970740	971209	Friday , Stanford ( 47-15 ) blanked the Gamecocks 8-0 .	Stanford ( 46-15 ) has a team full of such players this season .
1	2745055	2745022	Last month Intel raised its revenue guidance for the quarter to between $ 7.6 billion and $ 7.8 billion .	At the end of the second quarter , Intel initially predicted sales of between $ 6.9 billion and $ 7.5 billion .
0	2199097	2199072	The driver , Eugene Rogers , helped to remove children from the bus , Wood said .	At the accident scene , the driver was " covered in blood " but helped to remove children , Wood said .
1	1609290	1609098	ONG KONG , July 9 Tens of thousands of demonstrators gathered tonight before the legislature building here to call for free elections and the resignation of Hong Kong 's leader .	Tens of thousands of demonstrators gathered yesterday evening to stand before this city 's legislature building and call for free elections and the resignation of Hong Kong 's leader .
1	1597193	1597119	Saddam loyalists have been blamed for sabotaging the nation 's infrastructure , as well as frequent attacks on U.S. soldiers .	Hussein loyalists have been blamed for sabotaging the nation 's infrastructure and attacking US soldiers .
1	2758944	2758975	Its closest living relatives are a family frogs called sooglossidae that are found only in the Seychelles in the Indian Ocean .	Its closest relative is found in the Seychelles Archipelago , near Madagascar in the Indian Ocean .
0	2584416	2584653	Cooley said he expects Muhammad will similarly be called as a witness at a pretrial hearing for Malvo .	Lee Boyd Malvo will be called as a witness Wednesday in a pretrial hearing for fellow sniper suspect John Allen Muhammad .
1	86007	86373	" Instead of pursuing the most imminent and real threats - international terrorists , " Graham said , " this Bush administration chose to settle old scores . "	" Instead of pursuing the most imminent and real threats - international terrorists - this Bush administration has chosen to settle old scores , " Graham said .
1	1602860	1602844	He said they lied on a sworn affidavit that requires them to list prior marriages .	Morgenthau said the women , all U.S. citizens , lied on a sworn affidavit that requires them to list prior marriages .
1	1201306	1201329	The association said 28.2 million DVDs were rented in the week that ended June 15 , compared with 27.3 million VHS cassettes .	The Video Software Dealers Association said 28.2 million DVDs were rented out last week , compared to 27.3 million VHS cassettes .
0	461779	461815	With these assets , Funny Cide has a solid chance to become the first Triple Crown winner since Affirmed in 1978 .	Funny Cide is looking to become horse racing 's first Triple Crown winner in a generation .
1	1438666	1438643	Intel was disappointed and assessing its " options in the event Mr. Hamidi resumes his spamming activity against Intel , " spokesman Chuck Mulloy said .	Intel spokesman Chuck Mulloy said the company was disappointed and assessing its " options in the event Mr. Hamidi resumes his spamming activity against Intel . "
1	3261484	3261306	Mr Annan also warned the US should not use the war on terror as an excuse to suppress " long-cherished freedoms " .	Annan warned that the dangers of extremism after September 11 should not be used as an excuse to suppress " long-cherished " freedoms .
1	1277539	1277527	At community colleges , tuition will jump to $ 2,800 from $ 2,500 .	Community college students will see their tuition rise by $ 300 to $ 2,800 or 12 percent .
1	3035788	3035918	He made a point of saying during Tuesdays debate that the Confederate flag was a racist symbol .	Though Dean made a point of saying during the debate that the Confederate flag is a racist symbol .
0	132553	132725	Bush wanted " to see an aircraft landing the same way that the pilots saw an aircraft landing , " White House press secretary Ari Fleischer said yesterday .	On Tuesday , before Byrd 's speech , Fleischer said Bush wanted ' ' to see an aircraft landing the same way that the pilots saw an aircraft landing .
0	2259788	2259747	On Monday the Palestinian Prime Minister , Mahmoud Abbas , will report to the Palestinian parliament on his Government 's achievements in its first 100 days in office .	Palestinian Prime Minister Mahmoud Abbas must defend the record of his first 100 days in office before Parliament today as the death toll in the occupied territories continues to rise .
0	2307064	2307235	The civilian unemployment rate improved marginally last month -- slipping to 6.1 percent -- even as companies slashed payrolls by 93,000 .	The civilian unemployment rate improved marginally last month _ sliding down to 6.1 percent _ as companies slashed payrolls by 93,000 amid continuing mixed signals about the nation 's economic health .
1	3046488	3046824	Per-user pricing is $ 29 for Workplace Messaging , $ 89 for Team Collaboration and $ 35 for Collaborative Learning .	Workplace Messaging is $ 29 , Workplace Team Collaboration is $ 89 , and Collaborative Learning is $ 35 .
1	86020	86007	" Instead of pursuing the most imminent and real threats – international terrorism – this Bush administration chose to settle old scores , " Mr. Graham said .	" Instead of pursuing the most imminent and real threats - international terrorists , " Graham said , " this Bush administration chose to settle old scores . "
0	1100998	1100441	SARS has killed about 800 people and affected more than 8400 since being detected in China in November .	SARS has killed about 800 people and sickened more than 8,400 worldwide , mostly in Asia .
1	2268396	2268480	Authorities had no evidence to suggest the two incidents were connected .	There was no immediate evidence that the two incidents were connected , police said .
0	1984039	1983986	" Jeremy 's a good guy , " Barber said , adding : " Jeremy is living the dream life of the New York athlete .	He also said Shockey is " living the dream life of a New York athlete .
0	2697659	2697747	Ratliff 's daughters , Margaret and Martha Ratliff , were adopted by Peterson after their mother 's death .	Peterson helped raise Ratliff 's two daughters , Margaret and Martha Ratliff , who supported him throughout the trial .
0	2175939	2176090	After losing as much as 84.56 earlier , the Dow Jones industrial average closed up 22.81 , or 0.2 percent , at 9,340.45 .	In midday trading , the Dow Jones industrial average lost 68.84 , or 0.7 percent , to 9,248.80 .
1	886618	886456	Rumsfeld , who has been feuding for two years with Army leadership , passed over nine active-duty four-star generals .	Rumsfeld has been feuding for a long time with Army leadership , and he passed over nine active-duty four-star generals .
1	588637	588864	Consumers who said jobs are difficult to find jumped from 29.4 to 32.6 , while those claiming work was plentiful slipped from 13 to 12.6 .	Consumers who said jobs are difficult to find jumped to 32.6 from 29.4 , while those saying work was plentiful slipped to 12.6 from 13 in April .
0	2252795	2252970	He has no immediate plans for television advertising , believing it is unnecessary this early .	A Lieberman aide said there were no immediate plans for television advertising .
1	1756329	1756394	" I think it happened very quickly , " Houston Police Department homicide investigator Phil Yochum said of the crime .	" I think it happened very quickly , " said Investigator Phil Yochum of the Houston Police Department 's homicide division .
1	1673112	1673068	United issued a statement saying it will " work professionally and cooperatively with all its unions . "	Senior vice president Sara Fields said the airline " will work professionally and cooperatively with all our unions . "
1	2357324	2357271	" But they never climb out of the pot of beer again . "	It 's just that they never climb out of the beer again . "
1	780408	780363	Chief financial officer Andy Bryant has said that hike had a greater affect volume than officials expected .	Bryant has said that hike had a greater effect on demand than officials expected .
1	821523	821385	Robert Liscouski , the Assistant Secretary of Homeland Security for Infrastructure Protection , will oversee NCSD .	NCSD 's chief will be Robert Liscouski , the assistant secretary of Homeland Security for Infrastructure Protection .
1	2304696	2304863	HP 's shipments increased 48 percent year-over-year , compared to an increase of 31 percent for Dell .	HPs shipments increased 48 per cent year-on-year , compared to an increase of 31 per cent for Dell .
1	2531749	2531607	Chirac , who can pardon a law-breaker , refused Humbert 's request last year but kept in close touch with the family .	Chirac , who has the authority to pardon law-breakers , refused Humbert 's request to be allowed to die last year but kept in close touch with the family .
1	3180014	3179967	The charges allege that he was part of the conspiracy to kill and kidnap persons in a foreign country .	The government now charges that Sattar conspired with Rahman to kill and kidnap individuals in foreign countries .
1	726966	726945	In the 2002 study , the margin of error ranged from 1.8 to 4.4 percentage points .	It has a margin of error of plus or minus three to four percentage points .
1	2638861	2638982	Mr. Clinton 's national security adviser , Sandy Berger , said that the White House wasn 't informed of the FBI activities .	Clinton ’ s national security adviser , Sandy Berger , said in an interview that the White House was not informed of the FBI activities .
1	2495223	2495307	" This decision is clearly incorrect , " FTC Chairman Timothy Muris said in a written statement .	The decision is " clearly incorrect , " FTC Chairman Tim Muris said .
1	55187	54831	Prosecutors allege that Nichols and co-conspirator Timothy McVeigh worked together to prepare a bomb that destroyed the Alfred P. Murrah Federal Building .	Prosecutors allege that Nichols and coconspirator Timothy McVeigh worked together to prepare a 4,000-pound fuel-and-fertilizer bomb that destroyed the Murrah building .
0	2763381	2763517	Terri Schiavo , 39 , is expected to die sometime in the next two weeks in the Tampa-area hospice where she has spent the past several years .	Terri Schiavo , 39 , underwent the procedure at the Tampa Bay area hospice where she has been living for several years , said her father , Bob Schindler .
1	1990975	1991132	Secretary of State Colin Powell designated the Chechen leader believed responsible for last year 's hostage standoff in a Moscow theater as a threat to U.S. security Friday .	U.S. Secretary of State Colin Powell on Friday designated Chechen rebel leader Shamil Basayev a threat to the security of the United States and to U.S. citizens .
1	2204353	2204418	" Today , we are trying to convey this problem to Russian President Vladimir Putin and US President George W Bush . "	" Today , we are trying to convey this problem to Russian President Vladimir Putin ( news - web sites ) and President Bush ( news - web sites ) . "
1	60122	60445	That would be a potential setback to Chief Executive Phil Condit 's strategy of bolstering defense-related sales during a slump in jetliner deliveries .	The inquiry may hinder Chief Executive Phil Condit 's strategy of bolstering defense-related sales during a slump in jetliner deliveries .
1	961836	962243	PeopleSoft also said its board had officially rejected Oracle 's offer .	Thursday morning , PeopleSoft 's board rejected the Oracle takeover offer .
0	3140260	3140288	The Dow Jones industrial average ended the day down 10.89 at 9,837.94 , after advancing 111.04 Wednesday .	The Dow Jones industrial average fell 10.89 points , or 0.11 percent , to 9,837.94 .
1	1720166	1720115	Cortisol levels in the saliva of day care children were highest and rose most steeply in those judged by day care center personnel to be the shyest .	Cortisol levels in the saliva of day-care children were highest and rose most steeply in those whom day-care centre staffed judged to be the shyest .
1	2573262	2573319	" The idea that Tony Abbott is in some way a one-dimensional political head-kicker couldn 't be more wrong , " Mr Howard said .	" The idea that Tony Abbott is in some way a one-dimensional political head kicker couldn 't be more wrong . "
0	1353356	1353174	" Biotech products , if anything , may be safer than conventional products because of all the testing , " Fraley said , adding that 18 countries have adopted biotechnology .	" Biotech products , if anything , may be safer than conventional products because of all the testing , " said Robert Fraley , Monsanto 's executive vice president .
1	2738677	2738741	The rate of skin cancer has tripled since the 1950s in Norway and Sweden , according to the study .	The study also found that skin cancer nearly tripled in Norway and Sweden since the 1950s .
1	1638813	1639087	We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said .	Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11 " .
1	1605350	1605425	Trans fat makes up only 1 percent to 3 percent of the total fat Americans consume , compared with 14 percent for saturated fat .	Trans fat accounts for 2.5 percent of Americans ' daily calories , compared to 11 percent to 12 percent for saturated fat .
1	2494149	2494073	However , a recent slide in prices and OPEC 's expectations of a surge in oil inventories have compounded its fears about a further softening of the market .	A 14 percent slide in crude prices this month and expectations of a build up in oil inventories compounded OPEC 's fears of a further softening of the market .
1	3023029	3023229	Peterson , 31 , is now charged with murder in the deaths of his 27-year-old wife and their unborn son .	Peterson , 31 , is charged with two counts of first-degree murder in the slayings of his wife , Laci , and their unborn son , Conner .
1	1351550	1351155	Carlson on Tuesday said he would not recuse himself from the case .	Service officials said Carlson refused to recuse himself from the case .
1	981185	981234	The program will grow to include ports in Dubai , Turkey and Malaysia , among others .	The program will be expanded to include areas of the Middle East such as Dubai , Turkey and Malaysia , Mr. Ridge said .
0	2111629	2111786	McCabe said he was considered a witness , not a suspect .	" He is not considered a suspect , " McCabe said .
1	655498	655391	The woman was exposed to the SARS virus while in the hospital but was not a health care worker , said Dr. Colin D ’ Cunha , Ontario ’ s commissioner of public health .	The woman was exposed to the SARS virus while in the hospital but was not a health-care worker , said Dr Colin D 'Cunha , Ontario 's commissioner of public health .
1	533823	533909	He added that those " are not solely American principles , nor are they exclusively Western . "	" These are not solely American principles nor are they exclusively Western , " Rumsfeld said .
1	581592	581570	" If we don 't march into Tehran , I think we will be in pretty good shape , " he said .	" As long as we don 't march on Tehran , I think we are going to be in pretty good shape , " he said .
0	1010655	1010430	On Saturday , a 149mph serve against Agassi equalled Rusedski 's world record .	On Saturday , Roddick equalled the world record with a 149 m.p.h. serve in beating Andre Agassi .
1	2241925	2242066	Chad Kolton , emergency management spokesman with the Department of Homeland Security , said the government is open to new technologies and methods to communicate more quickly and efficiently .	Chad Kolton , emergency management spokesman with the Department of Homeland Security , said the government is open to new ways to communicate .
1	2796978	2797024	" APEC leaders are painfully aware that security and prosperity are inseparable , " Thai Prime Minister Thaksin Shinawatra told business leaders .	" APEC leaders are painfully aware that security and prosperity are inseparable , " Thaksin said .
0	101746	101775	Danbury prosecutor Warren Murray could not be reached for comment Monday .	Prosecutors could not be reached for comment after the legal papers were obtained late Monday afternoon .
1	327839	327748	Wittig resigned last year after being indicted on federal bank fraud charges involving a real estate loan unrelated to Westar business .	Wittig resigned in late November about two weeks after being indicted on bank fraud charges in a real estate case unrelated to the company .
0	2988297	2988555	Shattered Glass , " starring Hayden Christensen as Stephen Glass , debuted well with $ 80,000 in eight theaters .	" Shattered Glass " _ starring Hayden Christensen as Stephen Glass , The New Republic journalist fired for fabricating stories _ debuted well with $ 80,000 in eight theaters .
1	2217613	2217659	He was arrested Friday night at an Alpharetta seafood restaurant while dining with his wife , singer Whitney Houston .	He was arrested again Friday night at an Alpharetta restaurant where he was having dinner with his wife .
0	2128530	2128455	However , EPA officials would not confirm the 20 percent figure .	Only in the past few weeks have officials settled on the 20 percent figure .
1	2208376	2208198	University of Michigan President Mary Sue Coleman said in a statement on the university 's Web site , " Our fundamental values haven 't changed .	" Our fundamental values haven 't changed , " Mary Sue Coleman , president of the university , said in a statement in Ann Arbor .
1	1980654	1980641	The first products are likely to be dongles costing between US $ 100 and US $ 150 that will establish connections between consumer electronics devices and PCs .	The first products will likely be dongles costing $ 100 to $ 150 that will establish connections between consumer electronics devices and PCs .
0	589579	589557	However , Lapidus expects foreign brands ' sales to be up 4 percent , driven by strong truck sales at Honda Motor Co .	Lapidus expects Ford to be down 5 percent , Chrysler down 10 percent and foreign brands up 4 percent driven by strong truck sales at Honda .
1	1636060	1635946	Michel , who remains in the government , denied that US pressure had provoked the government 's move .	Michel , who has stayed in the new government , denied that it was U.S. pressure which had provoked the government 's move .
1	1630585	1630657	Some of the computers also are used to send spam e-mail messages to drum up traffic to the sites .	Some are also used to send spam e-mail messages to boost traffic to the sites .
0	447728	447699	Indonesia 's army has often been accused of human rights abuses during GAM 's battle for independence , charges it has generally denied while accusing the separatists of committing rights violations .	Indonesia 's army has been accused of human rights abuses during its earlier battles with GAM , charges it has generally denied .
1	1606495	1606619	Bush also hoped to polish his anti-AIDS credentials in Uganda , which has been hailed as an African pioneer in fighting the killer disease .	President Bush flies to Uganda Friday hoping to polish his anti- AIDS credentials in a country hailed as an African pioneer in fighting the epidemic .
1	1550897	1550977	Later this year , the command will send trainers with soldiers from four North African nations on patrolling and intelligence gathering missions .	This fall the command will send trainers to work with soldiers from four North African nations on patrolling and gathering intelligence .
0	490376	490490	The reports helped overcome investor jitters after the euro briefly hit an all-time high against the dollar Tuesday .	Stocks slipped at the open after the euro hit record highs against the dollar .
1	3084554	3084612	Sales for the quarter beat expectations , rising 37 percent year-on-year to 1.76 billion euros .	Sales rose 37 per cent year-on-year to 1.76bn , beating expectations .
1	315647	315778	If the MTA 's appeal to a higher court is successful , the $ 2 bus and subway base fare won 't be rolled back .	If the MTA 's appeal is successful , the $ 2 bus and subway base fare won 't change .
1	3428298	3428362	Robert Walsh , 40 , remained in critical but stable condition Friday at Staten Island University Hospital 's north campus .	Walsh , also 40 , was in critical but stable condition at Staten Island University Hospital last night .
1	2523564	2523358	The Guru microcontroller serves four functions : hardware monitoring , overclocking management , BIOS ( Basic Input Output System ) update and a troubleshooting-assistance feature called Black Box .	The µGuru microcontroller serves four functions : hardware monitoring , overclocking management , BIOS update and a troubleshooting-assistance feature called Black Box .
1	2079200	2079131	U.S. corporate bond yield spreads tightened in spotty trading on Friday as Wall Street labored to get back on its feet after the largest power outage ever in North America .	U.S. stocks rose slightly on feather-light volume on Friday , as Wall Street regrouped after the biggest-ever power outage in North America .
1	818091	817811	The company said it would issue revised guidance for the full fiscal year next month when it releases its Q2 results .	The company said it would renew its guidance for 2003 when it announces its second quarter results in mid-July .
1	1580638	1580663	" I stand 100 percent by it , and I think our intelligence services gave us the correct information at the time . "	I stand 100 percent by it , and I think that our intelligence services gave us the correct intelligence and information at the time , " Blair said .
0	1919740	1919926	" I don 't know if the person I 'm talking to now may end up being someone else at another time that may not follow the rules , " Parrish said .	" I don 't know whether the person I 'm talking to now may end up being someone else , " Parrish said .
1	2748287	2748550	" I think it 's going to be a close vote , but I think the grant proposal is going to win , " McConnell said .	" I think it 's going to be a close vote , but I think the grant proposal 's going to win , " said Sen. Mitch McConnell , assistant majority leader .
1	3394891	3394775	Twenty-eight people were believed to have been spending Christmas Day with the caretaker of the St Sophia 's camp , when the mudslide smashed into two cabins .	Twenty-seven people were believed to have been spending Christmas Day with the caretaker of Saint Sophia Camp , a Greek Orthodox facility , when the mudslide roared through .
0	2963943	2963880	One , Capt. Doug McDonald , remained hospitalized in critical condition on Thursday .	Her 20-year-old sister , Allyson , was severely burned and remained hospitalized in critical condition .
0	1865364	1865251	The United States finally relented during President Bush 's visit to Africa earlier this month .	During President Bush 's trip to Africa earlier this month , however , Washington said it would support the increase .
1	263690	263819	" There is no conscious policy of the United States , I can assure you of this , to move the dollar at all , " he said .	He also said there is no conscious policy by the United States to move the value of the dollar .
1	283751	283290	It 's the first such drill since the September 11 terrorist attacks on New York and Washington .	It is the nation 's first large-scale counterterrorism exercise since the Sept . 11 terrorist attacks .
1	2517014	2516995	Myanmar 's pro-democracy leader Aung San Suu Kyi will return home late Friday but will remain in detention after recovering from surgery at a Yangon hospital , her personal physician said .	Myanmar 's pro-democracy leader Aung San Suu Kyi will be kept under house arrest following her release from a hospital where she underwent surgery , her personal physician said Friday .
1	1330643	1330622	According to the Merchant Marine Ministry , the 37-year-old ship is registered to Alpha Shipping Inc. based in the Pacific Ocean nation of Marshall Islands .	The Baltic Sky is a 37-year-old ship registered to Alpha Shipping Inc. based in the Pacific Ocean nation of Marshall Islands .
1	3111452	3111428	In an unusual move , the U.S. Patent and Trademark Office is reconsidering a patent affecting Internet pages that critics contend could disrupt millions of Web sites .	In an unusual move that critics contend could disrupt millions of Web sites , the U.S. Patent and Trademark Office is reconsidering a patent affecting Internet pages .
0	1167835	1167651	Kansas Department of Health and Environment records show there were 88 abortions performed on girls age 14 and younger last year .	Statistics from the Kansas Department of Health and Environment show that 11,844 abortions were performed in the state last year .
0	1423836	1423708	A European Union spokesman said the Commission was consulting EU member states " with a view to taking appropriate action if necessary " on the matter .	Laos 's second most important export destination - said it was consulting EU member states ' ' with a view to taking appropriate action if necessary ' ' on the matter .
1	2090911	2091154	Waiting crowds filling the streets on both sides overwhelmed the peacekeepers soon after daylight , sweeping past the barbed wire barricades .	But waiting crowds filling the streets rushed the bridges soon after daylight , overrunning razor-wire barricades .
1	2265271	2265152	Barry Callebaut will be able to use Brach 's retail network to sell products made from its German subsidiary Stollwerck , which makes chocolate products not sold in the United States .	Barry Callebaut will be able to use Brach 's retail network to sell products made from its German subsidiary Stollwerck , which makes chocolate products unknown to the American market .
1	3062202	3062308	By skirting the FDA 's oversight , Eagan said , the quality of the imported drugs is " less predictable " than for those obtained in the United States .	By skirting the FDA 's oversight , Eagan said the quality of the imported drugs is " less predictable " than U.S. drugs .
1	2155514	2155377	He said : " For the first time there is an easy and affordable way of making this treasure trove of BBC content available to all . "	" For the first time , there is an easy and affordable way of making this treasure trove of BBC content available to all , " Dyke said .
1	1552068	1551928	Three such vigilante-style attacks forced the hacker organizer , who identified himself only as " Eleonora [ 67 ] , " to extend the contest until 7 p.m. EST Sunday .	Three such vigilante-style attacks forced the hacker organiser , who identified himself only as " Eleonora67 ] , " to extend the contest until 8am ( AEST ) today .
1	936978	937500	Eric Gagne pitched a perfect ninth for his 23rd save in as many opportunities .	Gagne struck out two in a perfect ninth inning for his 23rd save .
0	985015	984975	One way or another , Harry Potter And The Order Of The Phoenix will be in your hands by Saturday .	Just about everything about " Harry Potter and the Order of the Phoenix " will set records .
1	1430357	1430425	" Allison just proves you don 't need to wait until August or September to have a disaster , " said Josh Lichter , a meteorologist with the Houston-Galveston weather office .	" Allison just proves you don 't need to wait until August or September to have a disaster , " Lichter said .
1	3039310	3039413	Today , analysts say , UN members can no longer ignore the shifts since the September 11 2001 attacks .	On Wednesday , analysts say , UN members can no longer ignore the shifts since the attacks in the US of September 11 2001 .
1	34513	34742	Police say CIBA was involved in the importation of qat , a narcotic substance legal in Britain but banned in the United States .	Mr McKinlay said that CIBA was involved in the importation of qat , a narcotic substance legal in Britain but banned in the US .
1	368067	368018	Chiron already has nearly 20 percent acceptances from PowderJect 's shareholders .	Chiron has acceptances from holders of nearly 20 percent of PowderJect shares .
0	611663	611716	Ernst & Young has denied any wrongdoing and plans to fight the allegations .	Ernst & Young has denied the SEC 's claims , and called its recommendations " irresponsible " .
1	98432	98657	The attack followed several days of disturbances in the city where American soldiers exchanged fire with an unknown number of attackers as civilians carried out demonstrations against the American presence .	The attack came after several days of disturbance in the city in which U.S. soldiers exchanged fire with an unknown number of attackers as civilians protested the American presence .
1	3039007	3038845	No company employee has received an individual target letter at this time .	She said no company official had received " an individual target letter at this time . "
1	1708040	1708062	Second-quarter results reflected a gain of 10 cents per diluted share , while the 2002 results included a loss of 19 cents per diluted share .	The second-quarter results had a non-operating gain of 10 cents a share while the 2002 second-quarter performance had a net non-operating loss of 19 cents a share .
0	1757264	1757375	He allegedly told his ex-wife in an angry phone call that he had no intention of following their new custody agreement .	The two had battled over custody and he allegedly told her in an angry phone call that he had no intention of following their new custody agreement .
1	383417	383558	Worldwide , more than 50 million people have seen " Les Miz , " with gross receipts of $ 1.8 billion .	Worldwide , Les Misérables has been seen by over 50 million people , with a total gross of over $ 2 billion .
0	2766112	2766084	In fiction : Edward P. Jones ( " The Known World " ) and Scott Spencer ( " A Ship Made of Paper " ) .	The fifth nominee for fiction is Scott Spencer , for A Ship Made of Paper .
1	1261116	1261234	" Overwhelmingly the Windows brand really resonated with them . "	" Windows was the part of the experience that really resonated with people . "
1	3028143	3028234	The Centers for Medicare and Medicaid Services , the federal agency that runs Medicare , last year began a similar effort for nursing homes .	The Centers for Medicare and Medicaid launched a similar consumer tool for nursing homes last year .
0	249699	249623	Vivace was founded in 1999 and has raised over $ 118 million in three rounds of venture financing .	During difficult times for technology venture capital , Vivace raised over $ 118 million in three rounds of venture financing .
0	3448488	3448449	The Dow Jones industrial average < .DJI > added 28 points , or 0.27 percent , at 10,557 , hitting its highest level in 21 months .	The Dow Jones industrial average < .DJI > rose 49 points , or 0.47 percent , to 10,578 .
1	2749322	2749663	The Democratic candidates also began announcing their fund-raising totals before Wednesday 's deadline to file quarterly reports with the Federal Election Commission .	The Democratic candidates also began announcing their fund-raising totals in advance of the deadline today to file quarterly reports with the Federal Election Commission .
0	2204592	2204588	Sun Microsystems Inc. on Thursday said it had added 100 new third-party systems and 100 new components to its Hardware Compatibility List for the Solaris x86 operating system Platform Edition .	The vendor has added 100 new third-party systems and 100 new components to the operating system 's Hardware Compatibility List ( HCL ) .
1	2889005	2888954	Prosecutors said PW Marketing violated the state 's 1998 anti-spam law by sending unsolicited e-mail without a toll-free number for recipients to call to stop additional mailings .	Prosecutors said PW Marketing violated the 1998 anti-spam law because these unsolicited e-mails were sent without a free call number for recipients to phone to stop additional mailings .
0	1657632	1657619	The Neighbours star and singer spent yesterday resting at her family home in Sydney and will have more tests today .	Goodrem spent yesterday resting in her family home in Sydney and will have more tests today to determine her exact treatment .
0	555617	555528	The 3 rd Armored Cavalry Regiment is 5,200 strong and the largest combat unit at Fort Carson .	Broomhead , 34 , was assigned to the 2nd Squadron , 3rd Armored Cavalry Regiment .
1	2396937	2396818	" The risk of inflation becoming undesirably low remains the predominant concern for the foreseeable future , " the Fed said in a statement accompanying the unanimous decision .	" The risk of inflation becoming undesirably low remains the predominant concern for the foreseeable future , " the policy-setting Federal Open Market Committee said .
0	2339738	2339771	" It is bad for Symbian , " said Per Lindberg , analyst at Dresdner Kleinwort Wasserstein .	" Motorola has displayed clear disloyalty " to Symbian , said Per Lindberg , an analyst at Dresdner Kleinwort Wasserstein in London .
0	1616174	1616206	Bob Richter , a spokesman for House Speaker Tom Craddick , had no comment about the ruling .	Bob Richter , spokesman for Craddick , R-Midland , said the speaker had not seen the ruling and could not comment .
1	635783	635802	But Ms Ward said the headroom under its financial covenants was " tight " and that there could be another downgrade if Southcorp breached any of its banking covenants .	But Ms Ward said the headroom under its financial covenants was " tight " and that there could be a rating downgrade if Southcorp did breach any banking covenants .
1	3444633	3444733	He added : ``I 've never heard of more reprehensiblebehaviour by a doctor .	The Harrisons ’ lawyer Paul LiCalsi said : “ I ’ ve never heard of more reprehensible behaviour by a doctor .
1	555553	555528	Broomhead was assigned to 2nd Squadron , 3rd Armor Cavalry Regiment , based at Fort Carson .	Broomhead , 34 , was assigned to the 2nd Squadron , 3rd Armored Cavalry Regiment .
1	1112021	1111925	Other staff members , however , defended the document , saying it would still help policy-makers and the agency improve efforts to address the climate issue .	Some E.P.A. staff members defended the document , saying that although pared down it would still help policy makers and the agency address the climate issue .
0	2749410	2749625	President Bush raised a record-breaking $ 49.5 million for his re-election campaign over the last three months , with contributions from 262,000 Americans , the president 's campaign chairman said Tuesday .	President Bush has raised $ 83.9 million since beginning his re-election campaign in May , and has $ 70 million of that left to spend , his campaign said Tuesday .
1	1629064	1629043	An episode is declared when the ozone reaches .20 parts per million parts of air for one hour .	A Stage 1 episode is declared when ozone levels reach 0.20 parts per million .
1	789691	789665	" He may not have been there , " the defence official said on Thursday .	" He may not have been there , " said a defence official speaking on condition of anonymity .
1	844421	844679	The U.N. troops are in Congo to protect U.N. installations and personnel , and they can only fire in self defense and have been unable to stem the violence .	The troops - whose mandate is to protect U.N. installations and personnel - can only fire in self-defense and have been unable to stem the violence .
1	58540	58567	North American markets grabbed early gains Monday morning , as earnings season begins to slow and economic indicators take the spotlight .	North American futures pointed to a strong start to the first trading session of the week Monday , as earnings season slows and economic indicators take the spotlight .
1	781439	781461	Xerox itself paid a $ 10 million fine last year to settle similar SEC charges .	Xerox itself previously paid a $ 10-million penalty to settle the SEC accusations .
1	1909579	1909408	" This deal makes sense for both companies , " said National Chief Executive Brian Halla .	" This deal makes sense for both companies , " Halla said in a prepared statement .
0	787432	787464	The blasts killed two people and injured more than 150 others .	The Atlanta Olympic Games attack killed one woman and injured more than 100 other people .
0	52758	52343	Morrill 's wife , Ellie , sobbed and hugged Bondeson 's sister-in-law during the service .	At the service Morrill 's widow , Ellie , sobbed and hugged Bondeson 's sister-in-law as people consoled her .
1	1675025	1675047	Spansion products are to be available from both AMD and Fujitsu , AMD said .	Spansion Flash memory solutions are available worldwide from AMD and Fujitsu .
1	2131318	2131372	About 1,500 police will be deployed for the visit .	Around 1,500 police are to be deployed at Niigata for the ferry 's visit .
1	325763	325928	Gamarekian told The News she remembers only the woman 's first name - and refused to reveal it .	She told the New York Daily News she remembers only the intern 's first name , which she refused to reveal .
1	2638975	2638855	One of the FBI ’ s key operatives , who had a falling out with the bureau , provided an account of the operation at a friend ’ s closed immigration court proceeding .	One of the FBI 's key operatives , who has had a falling-out with the bureau , provided an account of the operation at a friend 's closed immigration court proceeding .
1	2198694	2198937	A nationally board certified teacher with a master 's degree , Kelley makes a salary of $ 65,000 in his 30th year .	A nationally board certified teacher with a master 's degree , Kelley , in his 30th year teaching , makes $ 65,000 .
1	1825432	1825301	A man arrested for allegedly threatening to shoot and kill a city councilman from Queens was ordered held on $ 100,000 bail during an early morning court appearance Saturday .	The Queens man arrested for allegedly threatening to shoot City Councilman Hiram Monserrate was held on $ 100,000 bail Saturday , a spokesman for the Queens district attorney said .
1	2906104	2906322	They were being held Sunday in the Camden County Jail on $ 100,000 bail .	They remained in Camden County Jail on Sunday on $ 100,000 bail .
1	722278	722383	Ms Stewart , the chief executive , was not expected to attend .	Ms Stewart , 61 , its chief executive officer and chairwoman , did not attend .
0	101747	101777	Christina 's aunt , Shelley Riling , said the defense 's claims were preposterous .	Christina 's aunt , Shelley Riling , said she will address the court .
1	2224884	2224819	The Justice Department Aug. 19 gave pre-clearance for the Oct. 7 date for the election to recall Gov. Gray Davis , saying it would not affect minority voting rights .	The Justice Department on Aug. 19 sanctioned the Oct. 7 date for recall election , saying it would not affect voting rights .
0	977938	978162	Lord Falconer hailed the changes as " a new beginning as far as the courts , Crown Prosecution Service and police are concerned " .	" It 's a new beginning as far as the courts , Crown Prosecution Service and police are concerned , making the criminal justice system work better . "
0	1015010	1014963	GE stock closed at $ 30.65 a share , down about 42 cents , on the New York Stock Exchange .	GE 's shares closed at $ 30.65 on Friday on the New York Stock Exchange .
1	1513190	1513246	At least 27 US troops have been killed in hostile fire since Bush 's statement .	At least 26 American troops have been killed in hostile fire since major combat was officially declared over on May 1 .
1	2385348	2385394	A recent poll showed Edwards with a narrow lead in South Carolina , and he plans a rally there later on Tuesday .	A recent poll showed Edwards in a virtual four-way tie at the top in South Carolina , and he plans a rally there later on Tuesday .
1	2317018	2317252	November 17 's last victim was British defence attache Stephen Saunders , who was shot on an Athens road in June 2000 .	November 17 's last victim was British defense attache Stephen Saunders , who was shot and killed at point-blank range on a busy Athens road in June 2000 .
0	1831696	1831660	The agency charged that one WD Energy worker discussed false reporting with traders at two other energy companies .	The agency found further that a WD Energy employee discussed false reporting with traders at two other energy companies , which the CFTC didn 't identify .
1	1528383	1528083	Zulifquar Ali , a worshipper slightly wounded by shrapnel , said the assailants first targeted the mosque 's security guards .	Witness Zulfiqar Ali , who was slightly wounded by shrapnel , said the attackers had focused on the mosque 's guards .
1	917965	918315	For the second year in a row , rises in hospital costs accounted for much of the inflation , accounting for 51 percent of the overall cost increase .	For the second year in a row , rises in hospital costs dominated the increase , accounting for 51 percent of the overall cost spiral .
0	3218713	3218830	Q : Can I buy coverage for prescription drugs right away ?	Congress has added a new benefit - an option to buy insurance coverage for prescription drugs .
1	221079	221003	The airline also said it has the option to buy 380 more airplanes , orders that would be split evenly between the two manufacturers .	The airline has the option to buy 380 more , split evenly between the two manufacturers .
1	2546175	2546198	Dr Mark McClean , Jonathan 's family doctor , said if the drug had been administered earlier Jonathan would have retained more of his brain functions .	Dr Mark McClean , the family 's GP , said had the drug been administered to Jonathan earlier , he would have retained more of his brain function .
0	799346	799268	The chain operates more than 3,400 stores , and has annual revenue of about $ 15.8 billion .	The chain , which has been under new management since late 1999 , has more than 3,400 stores and $ 15.8 billion in annual revenue .
0	2673104	2673130	All patients developed some or all of the symptoms of E. coli food poisoning : bloody diarrhea , vomiting , abdominal cramping and nausea .	Symptoms of the E. coli infection include bloody diarrhea , nausea , vomiting and abdominal cramping .
1	1354501	1354476	Federal regulators have turned from sour to sweet on a proposed $ 2.8 billion merger of ice cream giants Nestle Holdings Inc. and Dreyer 's Grand Ice Cream Inc .	Federal regulators have changed their minds on a proposed $ 2.8 billion merger of ice cream giants Nestle Holdings and Dreyer 's Grand Ice Cream .
1	3070979	3070949	Environmental campaigners are using this weekend ’ s lunar eclipse to highlight the huge increase in light pollution across the UK .	Environmental campaigners used the eclipse to highlight the surge in light pollution across Britain .
0	1264509	1264471	Available July 7 , the software supports the Solaris , IBM AIX , Red Hat Linux and Windows operating systems .	The OpForce product currently works with Solaris , AIX , Red Hat Linux and Windows servers .
1	103280	103431	Justice Minister Martin Cauchon and Prime Minister Jean Chrétien have both said the Liberal government will introduce legislation soon to decriminalize possession of small amounts of pot for personal use .	Justice Minister Martin Cauchon and Prime Minister Jean Chretien both have said the government will introduce legislation to decriminalize possession of small amounts of pot .
0	110731	110648	But Chauncey Billups demonstrated he 's also capable of big games , scoring 77 points over the final two games against the Magic .	Billups scored 77 points in the final two games of the first-round series against the Magic .
1	2274844	2274714	Kelly killed himself after being exposed as the source for a BBC report which claimed the government had embellished evidence of Iraq 's banned weapons to justify the war .	He killed himself after being exposed as the source for a BBC report which claimed the government exaggerated the case for war against Iraq .
0	1050307	1050144	And it 's going to be a wild ride , " said Allan Hoffenblum , a Republican consultant .	Now the rest is just mechanical , " said Allan Hoffenblum , a Republican consultant .
1	2810634	2810670	While the Ibrahims had one separation operation , Goodrich and Dr. David Staffenberg plan about three for the Aguirres , with several weeks between each .	Instead of one long operation to separate the twins , Goodrich and Dr. David Staffenberg plan about three , with several weeks between each .
1	3073773	3073779	Lay had contended that turning over the documents would violate his Fifth Amendment right against self-incrimination .	Lay had refused to turn over the papers , asserting his Fifth Amendment right against self-incrimination .
0	261202	260995	The WHO experts didn 't say how many cases in Hebei were in rural areas .	Hebei has reported 191 cases and eight deaths , though the WHO experts did not say how many were in rural areas .
1	1824224	1824209	Nearly 300 mutinous troops who seized a Manila shopping and apartment complex demanding the government resign gave up and retreated peacefully after some 19 hours .	Mutinous troops who seized a Manila shopping and apartment complex demanding the government resign ended a 19-hour standoff late Sunday and returned to barracks without a shot fired .
1	548867	548785	In three years , Lend Lease has slipped from a top-five stock , when its share price was around $ 24 , to 37th .	In the space of three years , Lend Lease has slipped from a top-five 5 stock when its share price hovered around $ 24 to 37th on the list .
0	2796658	2796682	About two hours later , his body , wrapped in a blanket , was found dumped a few blocks away .	Then his body was dumped a few blocks away , found in a driveway on Argyle Road .
1	1808166	1808434	Columbia broke up over Texas upon re-entry on Feb. 1 .	Columbia broke apart in the skies above Texas on Feb. 1 .
1	853475	853342	A year or two later , 259 , or 10 per cent , of the youths reported that they had started to smoke , or had taken just a few puffs .	Within two years , 259 , or 10 percent , of the youths reported they had started to smoke or had at least taken a few puffs .
0	977772	977804	The Lord Chancellor was guardian of the Great Seal , used to stamp all official documents from the sovereign .	Falconer will hold on , for now , to the Lord Chancellor 's Great Seal , used to sign off instructions from the sovereign .
1	577854	578500	Cindy Yeast , a 50-year-old Washington-area publicist , says she began taking supplements two years ago in part to avoid mild dementia that affects her elderly parents .	She started taking supplements two years ago - partly to stave off mild dementia that affects her elderly parents .
1	2829194	2829229	The two are not related , but have referred to each other as father and son .	He 's not related to Malvo , but the two have referred to each other as father and son .
1	2074182	2074668	Gibson said last month in a press statement that " neither I nor my film are anti-Semitic .	Gibson said in a June statement that he and his film are not anti-Semitic .
0	2758265	2758282	The world 's largest software company said it recognized the difficulty the multiple patches posed for companies , and set out to make it easier for them to apply the updates .	The world 's largest software company said it recognized the difficulty the multiple patches posed for companies trying to apply them .
1	1958079	1958143	The Dow Jones industrial average .DJI ended up 64.64 points , or 0.71 percent , at 9,191.09 , according to the latest available data .	The blue-chip Dow Jones industrial average .DJI added 38 points , or 0.42 percent , to 9,165 .
1	544217	544325	The vote came just two days after Kurds swept City Council elections , taking the largest single block of votes on the 30-seat council .	The vote for mayor followed City Council elections that gave Kurds the largest block of votes on the 30-seat council .
1	2385288	2385256	Large swells and dangerous surf already were being felt along sections of the coast .	Already large swells and dangerous surf have arrived along the mid-Atlantic .
0	2324708	2325028	Based on a separate survey of households , the unemployment rate fell in August to 6.1 percent from 6.2 percent .	Labor Department analysts discounted a slight improvement in the national unemployment rate , which fell in August to 6.1 percent from 6.2 percent .
1	2139506	2139427	" We will work with the board to ensure a smooth transition . "	He said federal regulators would work with the corporation to ensure a " smooth transition . "
1	2965576	2965701	Gasps could be heard in the courtroom when the photo was displayed .	Gasps could be heard as the photo was projected onto the screen .
1	2931098	2931144	Gilead had earnings of $ 73.1 million , or 33 cents a share , compared with $ 20.8 million , or 10 cents , in the year-ago quarter .	Quarterly profit climbed to $ 73.1 million , or 33 cents a share , from $ 20.8 million , or 10 cents , a year earlier , the company said .
0	644788	644816	" I had one bad stretch of holes that put me out of contention to win , " Woods said .	" I had one bad stretch of holes that put me out of contention , " Woods said , referring to his 42 on the front nine Saturday .
0	2551891	2551563	The poll had a margin of error of plus or minus 2 percentage points .	It had a margin of sampling error of plus or minus four percentage points and was conducted Thursday through Saturday .
1	1089053	1089297	Sen. Patrick Leahy of Vermont , the committee 's senior Democrat , later said the problem is serious but called Hatch 's suggestion too drastic .	Sen. Patrick Leahy , the committee 's senior Democrat , later said the problem is serious but called Hatch 's idea too drastic a remedy to be considered .
1	3435735	3435717	The broad Standard & Poor 's 500 < .SPX > eased 0.37 of a point , or 0.03 percent , at 1,121 .	The Standard & Poor 's 500 Index < .SPX > slipped 0.26 point , or 0.02 percent , to 1,121.96 .
0	1954	2142	Watertown , Saugus and Framingham also are going smoke-free Monday , joining a growing number of cities around the country .	Along with Boston , Watertown , Saugus and Framingham also are going smoke-free Monday .
1	3400796	3400822	That is evident from their failure , three times in a row , to get a big enough turnout to elect a president .	Three times in a row , they failed to get a big _ enough turnout to elect a president .
1	1220668	1220801	We firmly believe we have an absolute right to use the common word ' spike ' as the name of our network . "	We firmly believe that we have an absolute right to use the common word ' spike ' to name our network .
1	1889954	1889847	Sources who knew of the bidding said last week that cable TV company Comcast Corp. was also looking at VUE .	Late last week , sources told Reuters cable TV company Comcast Corp. CMCSA.O also was looking at buying VUE assets .
1	315785	315653	But MTA officials appropriated the money to the 2003 and 2004 budgets without notifying riders or even the MTA board members considering the 50-cent hike , Hevesi found .	MTA officials appropriated the surplus money to later years ' budgets without notifying riders or the MTA board members when the 50-cent hike was being considered , he said .
0	1521034	1520582	White , who had suffered kidney failure from years of high blood pressure , died at Cedars-Sinai Medical Center around 9 : 30 a.m. , said manager Ned Shankman .	White , who had kidney failure from years of high blood pressure , had been undergoing dialysis and had been hospitalized since a September stroke .
1	2083598	2083810	About 10 percent of high school and 16 percent of elementary students must be proficient at math .	In math , 16 percent of elementary and middle school students and 9.6 percent of high school students must be proficient .
1	1910610	1910455	The legal ruling follows three days of intense speculation Hewlett-Packard Co. may be bidding for the company .	The legal ruling follows three days of wild volatility in RIM 's stock over speculation that PC giant Hewlett-Packard Co. may be bidding for the company .
1	3113791	3113782	The European Commission , the EU 's antitrust enforcer , is expected to issue its decision next spring — unless a settlement is reached .	The European Commission is expected to issue its decision in the case next spring — unless a settlement is reached .
1	3214517	3214483	" So Sebastian did his best to convincingly confess to a crime that he didn 't commit in order to survive , " she told jurors .	" Sebastian did his best to confess convincingly to a crime he didn 't do in order to survive , " Ms. Richardson declared .
0	2083612	2083810	Twenty percent of Latino students and 23 percent of black students performed at proficient or higher .	In math , 16 percent of elementary and middle school students and 9.6 percent of high school students must be proficient .
1	661390	661218	He is charged in three bombings in Atlanta including a blast at the 1996 Olympics and one in Alabama .	He is charged in three bombings in Atlanta - including a blast at the 1996 Olympics - along with the bombing in Alabama .
1	1269572	1269682	The men were remanded in custody and are due to appear again before court on July 8 .	They were remanded in custody and will appear in court again on July 8 .
1	1095780	1095652	" No matter who becomes the sponsor for stock-car racing 's top series , NASCAR will need an all-star event , " Wheeler said in a statement .	No matter who becomes the sponsor for stock-car racings top series , NASCAR will need an all-star event , Wheeler said Tuesday .
1	116294	116332	The Phillies were upset that Counsell had stolen second in the sixth inning with Arizona leading 7-1 .	The Phillies were apparently upset when Counsell stole during the sixth with the Diamondbacks up 7-1 .
1	941617	941673	He said his hatred for such people grew from these discussions and had helped convince him violence was the answer .	His hatred for these people had germinated from these discussions and helped cement his belief that violence was the panacea .
1	2640607	2640576	" There is no need for one deadline for all to create the ASEAN Economic Community , " Thaksin said .	Thus , he said , there did not have to one deadline to create the economic community .
1	3310210	3310286	The announcement was made during the recording of a Christmas concert attended by top Vatican cardinals , bishops , and many elite from Italian society , witnesses said .	The broadside came during the recording on Saturday night of a Christmas concert attended by top Vatican cardinals , bishops and many elite of Italian society , witnesses said .
1	3376093	3376101	The additional contribution brings total U.S. food aid to North Korea this year to 100,000 tonnes .	The donation of 60,000 tons brings the total of U.S. contributions for the year to 100,000 .
1	1549586	1549609	Leon Williams ' body was found inside his third-floor apartment at 196 Bay St. , in Tompkinsville .	The dead man , Leon Williams , was found in his third-floor apartment .
1	460211	460445	The player 's eyes were bloodshot and a blood-alcohol test produced a reading of 0.18 - well above Tennessee 's level of presumed intoxication of 0.10 , the report said .	He failed a field sobriety test and a blood-alcohol test produced a reading of 0.18 – well above Tennessee 's level of presumed intoxication of 0.10 , the report said .
1	1196962	1197061	But Virgin wants to operate Concorde on routes to New York , Barbados and Dubai .	Branson said that his preference would be to operate a fully commercial service on routes to New York , Barbados and Dubai .
0	862804	862715	He tried to fight off officers and was taken to a hospital after a police dog bit him but was later released .	Cruz tried to fight off officers and was hospitalized after a police dog bit him , Sgt. Steve Dixon said .
1	1726935	1726879	The announcement , which economists said was not a surprise , may be bittersweet for the millions of Americans without jobs .	Economists said the announcement was not a surprise , and politicians said it offered little comfort to the millions of Americans without jobs .
0	331980	332110	Asked if the delegates could leave on Friday , police intelligence chief in Aceh , Surya Dharma , told reporters they could not because they did not have proper permission .	Asked if the delegates could leave on Friday , police intelligence chief Surya Dharma told reporters : " Of course they may not go .
1	173879	173832	Dealers said the dollar also drew some downside support as Japanese investors are expected to keep snapping up foreign bonds amid the yen 's rise against the dollar .	Dealers said the dollar also drew some downside support as Japanese investors are expected to keep snapping up foreign bonds amid ever-falling domestic interest rates .
0	2834988	2835026	Iran has until the end of the month to satisfy the agency it has no plans for nuclear weapons .	The Iranians have until the end of the month to answer all the agency 's questions about their past nuclear activities .
1	2587300	2587243	Her father , Florin Cioaba , the king of Transylvania 's Gypsies , had her brought back and she was married against her will .	Her father , Roma King Florin Cioaba , had her brought back and she was promptly married against her will .
0	554905	554627	Claire had advanced to the third round of the 76th annual Scripps Howard National Spelling Bee .	One by one they strolled to the microphone , all 251 youngsters in the 76th Scripps Howard National Spelling Bee .
1	1912524	1912648	Citigroup Inc . C.N , the world 's largest financial services company , on Wednesday promoted Marjorie Magner to chairman and chief executive of its global consumer group .	Citigroup ( C ) on Wednesday named Marjorie Magner chairman and chief executive of its colossal global consumer business .
1	3255597	3255668	" They 've been in the stores for over six weeks , " says Carney .	The quarterlies usually stay in stores for between six to eight weeks , " Carney added .
1	629316	629289	Let me just say this : the evidence that we have of weapons of mass destruction was evidence drawn up and accepted by the joint intelligence community .	" The evidence that we had of weapons of mass destruction was drawn up and accepted by the Joint Intelligence Committee , " he said .
1	54181	53570	Ridge said no actual explosives or other harmful substances will be used .	Ridge said no real explosives or harmful devices will be used in the exercise .
1	723557	724115	Thus far , Stewart 's company appears ready to stand behind her .	For now , the company 's management appears to be standing behind Stewart .
0	2607718	2607708	But late Thursday night , the campaign issued a statement saying there would be no news conference and no big announcement .	But late yesterday , the campaign and the state Democratic Party said there would be no news conference .
1	753858	753890	There 's also a flaw that results because IE does not implement an appropriate block on a file download dialog box .	The second vulnerability is a result of IE not implementing a block on a file download dialog box .
1	587009	586969	Another $ 100-million in savings will come from management layoffs and pay cuts .	The airline expects to save another $ 100-million a year through management layoffs and pay cuts .
1	308567	308525	He called on Prime Minister John Howard to establish a royal commission on child sex abuse .	The Senate motion also called on Prime Minister John Howard to hold a royal commission into child sex abuse .
0	665419	665612	" We think that the United States of America should support the free speech of all groups , " Mr. White said , objecting to Mr. Olson 's recommendation .	We think that the United States of America should support the free speech of all groups , he said .
1	2763517	2763576	Terri Schiavo , 39 , underwent the procedure at the Tampa Bay area hospice where she has been living for several years , said her father , Bob Schindler .	The tube was removed Wednesday from Terri Schiavo , 39 , at the Tampa Bay-area hospice where she has lived for several years .
0	3107118	3107136	After 18 months , Nissen found that Lipitor stopped plaque buildup in the patients ' arteries .	After 18 months , the atorvastatin patients had no change in the plaque in their arteries .
1	780604	780466	Toll , Australia 's second-largest transport company , last week offered NZ75 a share for Tranz Rail .	Toll last week offered to buy the company for NZ75c a share , or $ NZ158 million .
0	1989213	1989116	" This child was literally neglected to death , " Armstrong County District Attorney Scott Andreassi said .	Armstrong County District Attorney Scott Andreassi said the many family photos in the home did not include Kristen .
1	1462409	1462504	Wal-Mart , the nation 's largest private employer , has expanded its antidiscrimination policy to protect gay and lesbian employees , company officials said Tuesday .	Wal-Mart Stores Inc . , the nation 's largest private employer , will now include gays and lesbians in its anti-discrimination policy , company officials said Wednesday .
1	260952	260924	Metro , bus and local rail services in France 's four largest towns -- Paris , Lyon , Lille and Marseille -- were severely disrupted , Europe 1 radio reported .	Subway , bus and suburban rail services in France 's four largest cities -- Paris , Lyon , Lille and Marseille -- were severely disrupted , transport authorities said .
1	1224743	1225510	In the undergraduate case , Rehnquist said the use of race was not " narrowly tailored " to achieve the university 's asserted interest in diversity .	Rehnquist wrote that the system was not narrowly tailored to achieve the interest in educational diversity .
0	3329379	3329416	SP2 is basically about security enhancements to Windows , such as the improved Internet Connection Firewall ( ICF ) .	The firewall in the current Windows XP was known as the Internet Connection Firewall ( ICF ) .
1	2362761	2362698	A landslide in central Chungchong province derailed a Seoul-bound train and 28 passengers were injured , television said .	In central Chungchong province , a landslide caused a Seoul-bound Saemaeul Express train to derail , injuring 28 people , local television said .
0	1465073	1464854	They will help draft a plan to attack obesity that Kraft will implement over three to four years .	The team will help draft a plan by the end of the year to attack obesity .
1	195728	196099	But that amount would probably be impossible to pass in the Senate , where Republican moderates have refused to go above $ 350 billion .	Such an amount would probably be unable to summon a majority of the Senate , where Republican moderates have refused to go above $ 350 billion .
1	2587767	2587673	In the clash with police , Lt. Mothana Ali said about 1,000 demonstrators had gone to the station demanding jobs .	In Baghdad , police Lieut . Mothana Ali said about 1,000 demonstrators arrived at the station demanding jobs .
0	1490044	1489975	Corixa shares rose 54 cents to $ 7.74 yesterday on the Nasdaq Stock Market .	Shares of Corixa rose 54 cents , or about 8 percent , to close at $ 7.74 .
1	958161	957782	Committee approval , expected today , would set the stage for debate on the Senate floor beginning Monday .	That would clear the way for debate in the full Senate beginning on Monday .
1	1033204	1033365	O 'Brien was charged with leaving the scene of a fatal accident , a felony .	Bishop Thomas O 'Brien , 67 , was booked on a charge of leaving the scene of a fatal accident .
0	2996241	2996734	Tom Hamilton said his daughter was conscious and alert and in stable condition after the attack Friday morning .	Bethany , who remained in stable condition after the attack Friday morning , talked of the attack Saturday .
0	2015389	2015410	The Calgary woman , who is in her twenties , donated blood on Aug. 7 .	The woman -- who has no symptoms of illness -- donated blood Aug. 7 .
1	221515	221509	Quattrone lawyer John W. Keker said his client is innocent .	In a statement Monday , his lawyer John Keker said ``Frank Quattrone is innocent .
0	2283737	2283794	In the weeks leading up to the execution , several Florida officials received anonymous threatening letters .	Several Florida officials connected to the case have received threatening letters , accompanied by rifle bullets .
1	2826681	2826474	The disagreement over online music sales was disclosed in documents filed last week with the judge and made available by the court yesterday .	The fight over online music sales was disclosed in documents made available Monday by the court .
1	2249237	2249305	Parson was charged with intentionally causing and attempting to cause damage to protected computers .	Parson is charged with one count of intentionally causing damage to a protected computer .
1	389239	389299	" The court and the public need to know much more of the details of the defendant 's seemingly massive fraud , " the judge said .	" The court and the public need to know more of the defendants ' seemingly massive fraud , " he said .
1	2652187	2652218	The U.S. Supreme Court will hear arguments on Wednesday on whether companies can be sued under the Americans with Disabilities Act for refusing to rehire rehabilitated drug users .	The high court will hear arguments today on whether companies can be sued under the ADA for refusing to rehire rehabilitated drug users .
1	2945693	2945847	The IRS said taxpayers can avoid undelivered checks by having refunds deposited directly into their checking or savings accounts .	The IRS said taxpayers can avoid problems with lost or stolen refunds by having refunds deposited directly into personal checking or savings accounts .
1	2065523	2065836	" More than 70,000 men and women from bases in Southern California were deployed in Iraq .	In all , more than 70,000 troops based in Southern California were deployed to Iraq .
1	2222998	2223097	BP shares slipped 0.8 percent to 433.50 pence ( $ 6.85 ) each in afternoon trading on the London Stock Exchange .	BP shares slipped 48 cents to $ 41.72 Friday in trading on the New York Stock Exchange .
1	2561999	2561941	Because of the accounting charge , the company now says it lost $ 1.04 billion , or 32 cents a share , in the quarter ended June 30 .	Including the charge , the Santa Clara , Calif.-based company said Monday it lost $ 1.04 billion , or 32 cents per share , in the period ending June 30 .
0	2324704	2325023	Friday 's report raised new worries that a weak job market could shackle the budding economic recovery despite a slight improvement in the overall unemployment rate .	U.S. companies slashed payrolls for a seventh straight month in August , raising new worries that a weak jobs market could shackle the budding economic recovery .
1	2336453	2336545	Federal Emergency Management Administration designated $ 20 million to establish the registry .	The registry was launched with $ 20 million from the Federal Emergency Management Agency .
1	720572	720486	BREAST cancer cases in the UK have hit an all-time high with more than 40,000 women diagnosed with the disease each year , Cancer Re-search UK revealed yesterday .	Cases of breast cancer in Britain have reached a record high , with the number of women diagnosed with the disease passing the 40,000 mark for the first time .
1	1605818	1605806	" It was never our intention to sell the product , " said Health Minister Anne McClellan , a skeptic of medical marijuana use .	" It was never the intention of us to sell product , " federal Health Minister Anne McLellan said yesterday in Edmonton .
0	2440680	2440474	GM , the world 's largest automaker , has 115,000 active UAW workers and another 340,000 retirees and spouses .	They cover more than 300,000 UAW workers and 500,000 retirees and spouses .
0	726399	726078	Rosenthal is hereby sentenced to custody of the Federal Bureau of prisons for one day with credit for time served , " Breyer said to tumultuous cheers in the courtroom .	" Rosenthal is hereby sentenced to custody of the Federal Bureau of Prisons for one day with credit for time served . "
1	533903	533818	" We are committed to helping the Iraqi people get on the path to a free society , " Rumsfeld said in a speech to the Council on Foreign Relations .	" We are committed to helping the Iraqi people get on the path to a free society , " he said .
1	1166473	1166857	Mr. Young said he was disappointed that the government didn 't see the severe acute respiratory syndrome crisis as worthy of federal disaster-relief money .	Young said he was disappointed the government didn 't see the SARS crisis as worthy of federal disaster relief money .
1	144089	143697	The 12-nation currency has risen by 33 percent against the dollar over the past 15 months .	The euro is up 9 percent against the dollar in the past six weeks .
1	3439854	3439874	In February 2000 , the officers — Kenneth Boss , Sean Carroll , Edward McMellon and Richard Murphy — were acquitted of all charges in the killing .	The officers -- Kenneth Boss , Sean Carroll , Edward McMellon and Richard Murphy -- were acquitted in 2000 of state murder charges .
1	3464314	3464302	I was surprised it turned out me talking and the president just listening .	" I was surprised it turned out me talking and the president just listening . . . It was mostly a monologue . "
1	2008984	2009175	The state 's House delegation currently consists of 17 Democrats and 15 Republicans .	Democrats hold a 17-15 edge in the state 's U.S. House delegation .
0	816867	816831	Freddie also said Leland C. Brendsel will retire as chairman and chief executive and resign from the board .	He replaces Leland Brendsel , 61 , who retired as chairman and chief executive .
1	192285	192327	We 'll be listening carefully to the [ IAEA ] director general 's report at the next board meeting .	" We 'll be listening carefully to the ( IAEA ) director-general 's report at the next board meeting . "
1	2688145	2688162	In that position , Elias will report to Joe Tucci , president and CEO of EMC .	As executive vice president of new ventures , Elias will report to Joe Tucci , EMC 's president and chief executive .
1	3294207	3294290	But with the PM due to leave tomorrow afternoon for personal reasons there was a risk he might not be present when the final decision was made .	But with the Prime Minister due to leave tomorrow , a day early , he may not be present when the final decision is made .
0	205100	205145	A pro-independence radical , Miodrag Zivkovic , of the Liberal Alliance , came in second with 31 percent of the vote .	Miodrag Zivkovic , of the Liberal Alliance of Montenegro , won 31 percent of the vote while the independent Dragan Hajdukovic got four percent .
0	3242051	3241897	Mr. Kerkorian tried unsuccessfully to take over Chrysler in 1995 , but did win representation on its board .	Kerkorian and Tracinda had also tried to take over Chrysler in 1995 .
0	1076861	1077018	Glover spoke at a news conference that included about 20 relatives of the victims .	About 20 family members of the victims were invited to the news conference .
1	2095803	2095786	Drax faced a financial crisis late last year after it lost its most lucrative sales contract , held with insolvent utility TXU Europe .	Drax ’ s troubles began late last year when it lost its most lucrative sales contract , with the insolvent utility TXU Europe .
1	2112330	2112376	But I would rather be talking about high standards than low standards . "	" I would rather be talking about positive numbers rather than negative .
1	3389318	3389271	It was not immediately known how many people were on flight UTA 141 , which could carry 141 passengers and crew .	It was still not known exactly how many people were on the plane , which could carry 141 passengers and crew .
1	698948	698933	The market remains pinned in a narrow range after a powerful rally drove the broad Standard & Poor 's 500 index .SPX up more than 20 percent since mid-March .	The market remains pinned in a narrow range after a powerful rally pushed the broad S & P 500 index up more than 20 percent since mid-March .
1	539585	539355	Witnesses said they believed the man planned to crash the Launceston-bound Qantas flight 1737 , which was carrying 47 passengers and six crew .	Witnesses believe he wanted to crash Flight 1737 , which had 47 passengers and six crew .
1	684848	684557	As Samudra sat down to hear the indictment , he looked over to his nine lawyers and shouted ``God is Great ' ' three times .	As he sat down to hear the indictment , Samudra looked over to his nine lawyers and shouted " Takbir ! " , or " Proclaim ! " , a religious rallying cry .
1	347017	347002	In hardest-hit Taipei , traffic has disappeared from once bustling streets , ubiquitous department stores stand mostly empty and restaurants are eerily quiet .	In hardest-hit Taipei , traffic has disappeared from once-bustling streets and department stores and restaurants are virtually empty .
1	1592037	1592076	In a statement , Lee said he " no longer believes that Viacom deliberately intended to trade on my name when naming Spike TV . "	Spike Lee no longer believes that Viacom deliberately intended to trade on his name by calling its own venture " Spike TV , " according to a statement read in court Tuesday .
0	3013483	3013540	Singapore Prime Minister Goh Chok Tong says China plays an important role in the integration of Asia , including managing the stresses and strains both within and between countries .	HAINAN PROVINCE , China : Singapore Prime Minister Goh Chok Tong said China plays an important role in the integration of Asia .
1	2020252	2020081	The worm attacks Windows computers via a hole in the operating system , an issue Microsoft on July 16 had warned about .	The worm attacks Windows computers via a hole in the operating system , which Microsoft warned of 16 July .
0	2614947	2614904	The premium edition adds OfficeFront Page 2003 , Acceleration Server 2000 , and SQL Server 2000 .	The premium edition adds ISA Server , SQL Server and a specialized edition of BizTalk 2004 .
0	1744257	1744378	In the year-ago quarter , the steelmaker recorded a profit of $ 16.2 million , or 15 cents per share , on sales of $ 1.14 billion .	In the second quarter last year , AK Steel reported a profit of $ 16.2 million , or 15 cents a share .
0	1119721	1119714	Sony claimed that the reader 's capacitance sensing technology cannot be fooled by paper copies and does not require cleaning .	Its capacitance sensing technology electronically reads a fingerprint ; Sony says it can 't be fooled by paper copies and doesn 't require cleaning .
1	1186754	1187056	Amazon.com shipped out more than a million copies of the new book , making Saturday the largest distribution day of a single item in e-commerce history .	Amazon.com shipped more than a million copies by Saturday afternoon , making Saturday the largest distribution day of a single item in e-commerce history .
1	2842562	2842582	The show 's closure affected third-quarter earnings per share by a penny .	The company said this impacted earnings by a penny a share .
0	431076	431242	After the two-hour meeting on May 14 , publisher Arthur O. Sulzberger Jr . , executive editor Howell Raines and managing editor Gerald Boyd pledged quick remedies to staff grievances .	The committee will make recommendations to Publisher Arthur Sulzberger , Executive Editor Howell Raines and Managing Editor Gerald Boyd .
1	1393764	1393984	It 's been a busy couple of days for security gurus assigned to keep their companies safe and sound .	It 's been a busy couple of days for enterprise security gurus tasked with the job of keeping their companies safe and sound .
0	2916199	2916164	Lu reclined in a soft chair wearing a woolly coat near the blackened capsule .	" It 's great to be back home , " said Lu , dressed in a woolly coat near the blackened capsule .
1	2530671	2530542	Gov. Bob Riley proposed the budget cuts after Alabama voters rejected his $ 1.2 billion tax plan Sept . 9 .	After Alabama voters rejected his $ 1.2 billion tax plan Sept . 9 , Riley forecast significant cuts in state programs .
1	219064	218969	" It is probably not the easiest time to come in and take over the shuttle program , but then again , I look forward to the challenge , " he said .	" It 's probably not the easiest time to come in and take over the shuttle program , but I look forward to the challenge , " Parsons told reporters at NASA headquarters .
0	2377289	2377259	Estonia 's place in the European mainstream and safeguard its independence regained in 1991 .	Estonia was forcibly incorporated in the Soviet Union in 1940 and regained its independence only in 1991 .
0	2110220	2110199	Franklin County Judge-Executive Teresa Barton said a firefighter was struck by lightning and was taken to the Frankfort Regional Medical Center .	A county firefighter , was struck by lightning and was in stable condition at Frankfort Regional Medical Center .
0	1864253	1863810	Police suspected that Shaichat , 20 , had been abducted either by Palestinians or by Israeli Arabs .	Nobody claimed responsibility for Schaichat 's death , but police suspect that the 20-year-old soldier was abducted either by Palestinians or Israeli Arabs .
0	3150803	3150839	During this year 's August to October quarter , Lowe 's opened 38 new stores , including two relocations .	During the third quarter , Lowe 's opened 38 new stores and now has 932 stores in 45 states .
0	969381	969512	The technology-laced Nasdaq Composite Index < .IXIC > declined 25.78 points , or 1.56 percent , to 1,627.84 .	The broader Standard & Poor 's 500 Index .SPX gave up 11.91 points , or 1.19 percent , at 986.60 .
1	271891	271839	Sony said the PSP would also feature a 4.5-inch LCD screen , Memory Stick expansion slots .	It also features a 4.5 in back-lit LCD screen and memory expansion facilities .
0	2829648	2829613	Clinton did not mention that two Democratic senators , Charles Robb of Virginia and Wendell Ford of Kentucky , voted to shelve the McCain bill .	Two Democrats , Sen. Charles Robb of Virginia and Wendell Ford of Kentucky , voted with the 40 Republicans .
1	886904	887158	Some of the company 's software developers will join Microsoft , but details haven 't been finalized , said Mike Nash , corporate vice president of Microsoft 's security business unit .	Some of the companys software developers will join Microsoft , but details havent been finalized , said Mike Nash , corporate vice president of Microsofts security business unit .
0	2632692	2632767	Wal-Mart has said it plans to open at least 40 Supercenters in the state in the coming years ; analysts expect four or more to be in San Diego County .	At least 40 of the outlets will be in California , and analysts expect four or more to be in San Diego County .
1	2240399	2240149	Cintas is battling efforts to unionize 17,000 of its workers and to let unions organize the workers by signing cards , rather than by a lengthy election process .	Cintas is battling efforts to unionize 17,000 of its workers and labor 's demands to let its workers organize by signing cards , rather than by a lengthy election process .
1	805457	805985	The opposition would resort to rolling mass action " at strategic times of our choice and without warning to the dictatorship , " he said .	" From now onwards we will embark on rolling mass action at strategic times of our choice and without any warning to the dictatorship , " he said .
1	2896308	2896334	Federal Agriculture Minister Warren Truss said the Government still did not know the real reason the sheep were rejected at the Saudi port of Jeddah on August 21 .	He said the Government still did not know the real reason the original Saudi buyer pulled out on August 21 .
1	2110775	2110924	Tom Kraynak , manager of operations and resources for the Canton , Ohio-based East Central Area Reliability Council , said that scenario is one among many that investigators are considering .	Tom Kraynak , manager of operations and resources for the Canton , Ohio-based East Central Area Reliability Council , said investigators are considering the scenario .
1	1762569	1762526	Hester said Sanmina was the best fit among several purchase offers the company received from electronics manufacturers and computer makers .	Hester said Sanmina 's offer was the best among several Newisys received from electronics manufacturers and computer makers .
0	2706154	2706185	The other inmate fell but Selenski shimmed down the makeshift rope to a second-story roof and used the mattress to scale a razor-wire fence , Fischi said .	After the other inmate fell , Selenski used the mattress to scale a 10-foot , razor-wire fence , Fischi said .
1	1057995	1057778	The hearing , expected to last a week , will determine whether Akbar faces a court-martial .	The purpose of the hearing is to determine whether Akbar should be court-martialled .
1	1386884	1386857	He said he has begun a court action to seize Beacon Hill 's assets and has frozen more than $ 13 million Beacon Hill had when it closed .	He said he has initiated a forfeiture action in court and frozen more than $ 13 million Beacon Hill had when it closed .
1	3093023	3092996	Speaking for the first time yesterday , Brigitte 's maternal aunt said his family was unaware he had was in prison or that he had remarried .	Brigitte 's maternal aunt said his family was unaware he had been sent to prison , or that he had remarried in Sydney .
1	1661381	1661317	" Close co-operation between our law enforcement agencies , close co-operation between our intelligence services lie at the heart of the ongoing fight against terrorism . "	Close cooperation between regional law enforcement agencies and intelligence services was at the heart of the fight against terrorism , he said .
0	2926039	2925982	The mother of a Briton held by Colombian guerrillasspoke of her relief yesterday after hearing that he might be freed in the next few weeks .	The parents of a Briton being held hostage by Colombian rebels spoke yesterday of their optimism that he would be freed in time for his birthday next month .
0	637168	637447	We strongly disagree with Novell 's position and view it as a desperate measure to curry favor with the Linux community .	McBride characterized Novell 's move as " a desperate measure to curry favor with the Linux community . "
1	696677	696932	After more than two years ' detention under the State Security Bureau , the four were found guilty of subversion in Beijing 's No. 1 Intermediate Court last Wednesday .	After more than two years in detention by the State Security Bureau , the four were found guilty last Wednesday of subversion .
1	3122429	3122305	Mr Russell , 46 , a coal miner from Brisbane , said : " They are obviously hurting , so we are basically going over there to help them . "	" They are obviously hurting so we are basically going over there to help them , " Russell , 46 , said .
1	1348909	1348954	The New York Democrat and former first lady has said she will not run for the White House in 2004 , but has not ruled out a race in later years .	The former first lady has said she will not run for the White House in 2004 but has not ruled out a race later on .
0	162203	162101	It does not affect the current Windows Media Player 9.0 Series .	Windows Media Player has had security problems before .
0	71501	71627	The seizure took place at 4 a.m. on March 18 , just hours before the first American air assault .	The time was about 4 a.m. on March 18 , just hours before the first pinpoint missiles rained down on the capital .
1	2907762	2907649	Donations stemming from the Sept . 11 attacks helped push up contributions to human service organizations and large branches of the United Way by 15 percent and 28.6 percent , respectively .	Donations stemming from the Sept . 11 attacks helped push up contributions to human service organizations by 15 percent and to large branches of the United Way by 28.6 percent .
1	2167771	2167744	In May , Mr. Hatfill said he was struck by a vehicle being driven by an FBI employee who was tailing him in Georgetown .	Last May , Hatfill was struck by a vehicle being driven by an FBI employee who was tailing him in Washington 's Georgetown neighborhood .
1	3320577	3320553	" I will support a constitutional amendment which would honor marriage between a man and a woman , codify that , " he said .	" If necessary , I will support a constitutional amendment which would honour marriage between a man and a woman , codify that . "
1	849291	849442	IBM of the US and Infineon Technologies of Germany will today announce a technological development that could threaten multi-billion dollar memory chip markets .	IBMof the US andInfineon Technologies of Germany willon Tuesdayannounce a technological development that could threaten multi-billion dollar memory chip markets .
0	763948	763991	Costa 's semifinal opponent is Spaniard Juan Carlos Ferrero , whom he beat in last year 's final .	Costa will play Juan Carlos Ferrero next in a rematch of last year 's final .
1	1908763	1908744	A former employee of a local power company pleaded guilty Wednesday to setting off a bomb that knocked out a power substation during the Winter Olympics last year .	A former Utah Power meter reader pleaded guilty Wednesday to bombing a power substation during the 2002 Winter Olympics .
0	1876120	1876059	Thyroid hormones are known to help in weight loss by stimulating metabolism - and cutting cholesterol - but come with the unwanted side effect of speeding up the heartbeat .	Thyroid hormones are known to help in weight loss by stimulating metabolism , and they can help cut cholesterol too .
1	518089	518133	Judge Craig Doran said it wasn 't his role to determine if Hovan was " an evil man " but maintained that " he has committed an evil act . "	Judge Craig Doran said he couldn 't determine if Hovan was " an evil man " but said he " has committed an evil act . "
0	224932	224868	The Hartford shares rose $ 2.88 , or 6.6 percent , to close Monday at $ 46.50 on the New York Stock Exchange .	Shares of Hartford rose $ 2.88 to $ 46.50 in New York Stock Exchange composite trading .
1	1771131	1771091	It also offers a built-in NAND flash boot loader so that high-density NAND flash memory can be used without having to install an additional support chip .	The S3C2440 has a built-in NAND flash boot loader , for example , so that high-density NAND flash memory can be installed without an additional support chip .
0	2728425	2728251	It decided instead to issue them before the stock market opened Monday after the downgrade of its debt late Friday by Moody 's , the credit rating agency .	It decided instead to issue them before the stock market opened Monday to counteract the downgrade of its debt late Friday by Moody 's to one step above junk status .
0	953733	953537	Altria shares fell 2.5 percent or $ 1.11 to $ 42.57 and were the Dow 's biggest percentage loser .	Its shares fell $ 9.61 to $ 50.26 , ranking as the NYSE 's most-active issue and its biggest percentage loser .
1	349215	349241	It will be followed in November by a third movie , " The Matrix Revolutions . "	The film is the second of a trilogy , which will wrap up in November with " The Matrix Revolutions . "
1	2919853	2919804	Massachusetts regulators and the Securities and Exchange Commission on Tuesday pressed securities fraud charges against Putnam Investments and two of its former portfolio managers for alleged improper mutual fund trading .	State and federal securities regulators filed civil charges against Putnam Investments and two portfolio managers in the ever-expanding mutual fund trading scandal .
1	954526	954607	He is blocking them until the Air Force assigns four additional C-130 cargo planes to Gowen Field , an Idaho Air National Guard base in Boise .	He is holding them up until the Air Force agrees to assign four additional C-130 cargo planes to the Idaho Air National Guard .
1	69773	69792	Cisco pared spending to compensate for sluggish sales .	In response to sluggish sales , Cisco pared spending .
0	2823575	2823513	The study , published Monday in the journal Molecular Brain Research , is likely to also apply to humans , its authors said .	The study , conducted on the brains of developing mice , was being published today in the journal Molecular Brain Research .
1	2455942	2455978	My decision today is not based on any one event . "	Governor Rowland said his decision was " not based on any one event . "
1	131979	131957	Nelson , 27 , is being retried on civil-rights charges stemming from the disturbance which led to Rosenbaum 's death .	Nelson , 27 , is being retried on civil rights charges stemming from the disturbance that led to Rosenbaum 's death .
0	2010705	2010779	" The government elements who have been causing trouble are still in place .	The government elements who have been causing trouble are still in place , they are attacking us . "
1	54142	53641	Next Monday at about 2 p.m. ( CST ) , hospital officials in and near Chicago will notice a sudden increase in people complaining of flu-like symptoms .	Around the same time , hospital officials in and near Chicago will notice a sudden increase in people complaining of flu-like symptoms .
1	1015249	1015204	Wal-Mart Stores Inc . , Kohl 's Corp. , Family Dollar Stores Inc. and Big Lots Inc. were among the merchants posting May sales that fell below Wall Street 's modest expectations .	Wal- Mart , Kohl 's Corp. , Family Dollar Stores Inc . , and Big Lots Inc. posted May sales that fell below Wall Street 's modest expectations .
0	753928	753890	The patch also fixes a vulnerability that results because IE does not implement an appropriate block on a file download dialog box .	The second vulnerability is a result of IE not implementing a block on a file download dialog box .
1	3022833	3023029	Peterson , a former fertilizer salesman , is charged with murder in the deaths of his 27-year-old wife and the baby boy she was carrying .	Peterson , 31 , is now charged with murder in the deaths of his 27-year-old wife and their unborn son .
0	751520	751373	SPOT products run a Microsoft operating system and the company 's DirectBand radio technology developed with SCA Data Systems .	The DirectBand network was developed with the assistance of SCA Data Systems .
0	218848	218851	He replaces Ron Dittemore , who announced his resignation in April .	Dittemore announced his plans to resign on April 23 .
1	3181118	3181443	Detectives told Deasean 's father , Stelly Chisolm , a college student , and mother , Kimberly Hill , of the arrest shortly after Perry was apprehended .	Shortly after his arrest , detectives told Deasean 's father , Stelly Chisolm , a college student , and mother , Kimberly Hill , a medical assistant , about the development .
1	515581	515752	They were among about 40 people attending the traditional Jewish ceremony colored by some non-traditional touches .	He said about 40 people attended the traditional Jewish ceremony colored by some nontraditional touches .
1	347022	347003	Taiwan had been relatively free of the viral infection until a fiasco at a Taipei hospital in late April caused the number of infections to skyrocket .	Taiwan had been relatively free of the viral infection until a severe outbreak at a Taipei hospital in late April .
1	3311600	3311633	Mr. Rowland attended a party in South Windsor for the families of Connecticut National Guard soldiers called to active duty .	Rowland was making an appearance at a holiday party for families of Connecticut National Guard soldiers assigned to duty in Iraq and Afghanistan .
0	3439114	3439084	Ross Garber , Rowland 's lawyer , said Tuesday he would attend the meeting and would ask to speak on the issue .	Ross Garber , Rowland 's legal counsel , said the governor would have no comment on the condo deal .
0	487951	488007	The euro was at 1.5281 versus the Swiss franc EURCHF = , up 0.2 percent on the session , after hitting its highest since mid-2001 around 1.5292 earlier in the session .	The euro was steady versus the Swiss franc after hitting its highest since mid-2001 of 1.5261 earlier in the session .
0	314997	315030	On the stand Wednesday , she said she was referring only to the kissing .	On the stand Wednesday , she testified that she was referring to the kissing before the alleged rape .
0	4733	4557	Garner said the group would probably be expanded to include , for example , a Christian and perhaps another Sunni leader .	The group has already met several times and Gen. Garner said it probably will be expanded to include a Christian and perhaps another Sunni Muslim leader .
1	2820371	2820525	Blair 's Foreign Secretary Jack Straw was to take his place on Monday to give a statement to parliament on the European Union .	Blair 's office said his Foreign Secretary Jack Straw would take his place on Monday to give a statement to parliament on the EU meeting the prime minister attended last week .
1	801552	801516	" There were more people surrounding the clubhouse than the Unabomber 's house up in the hills , " Baker said .	" There are more people surrounding the clubhouse than surrounded the Unabomber 's home in the hills .
1	1704987	1705268	Charles O. Prince , 53 , was named as Mr. Weill 's successor .	Mr. Weill 's longtime confidant , Charles O. Prince , 53 , was named as his successor .
1	396041	396188	Officials are also meeting with the International Organization for Epizootics ( OIE ) , which establishes animal-health standards for the world .	Canadian officials were also expected to meet yesterday with the International Organization for Epizootics ( OIE ) , which establishes animal-health standards for the world .
0	1014983	1014963	GE stock closed Friday at $ 30.65 a share , down about 42 cents , on the New York Stock Exchange .	GE 's shares closed at $ 30.65 on Friday on the New York Stock Exchange .
1	2320654	2320666	The Midwestern research center will focus on the development of diagnostic , therapeutic and vaccine products for anthrax , botulism , tularemia , hemorrhagic fever viruses and plague .	The Midwestern center will focus on diagnosis , treatment and vaccines for anthrax , botulism , tularemia , hemorrhagic fever viruses and plague .
1	1057876	1057778	The hearing is to determine whether there is enough evidence to order Akbar to a general court-martial proceeding .	The purpose of the hearing is to determine whether Akbar should be court-martialled .
0	2116843	2116883	In the United States , heart attacks kill about 460,000 year , in Canada about 80,000 .	In the United States , heart attacks kill about 460,000 yearly , according to the National Institutes of Health .
1	1461629	1461781	Ninety-five percent of international cargo to the United States is carried by ship .	Ships carry 95 percent of international cargo to the United States .
0	374015	374162	" It 's a major victory for Maine , and it 's a major victory for other states .	The Maine program could be a model for other states .
1	2493369	2493428	News that oil producers were lowering their output starting in November exacerbated a sell-off that was already under way on Wall Street .	News that the Organization of Petroleum Exporting Countries was lowering output starting in November exacerbated a stock sell-off already under way yesterday .
1	490355	490378	They note that after several weeks of rallies on upbeat earnings , investors are looking for stronger evidence of a recovery before sending stocks higher .	After several weeks of market rallies on upbeat earnings , many investors are looking for more concrete signs of an economic recovery .
1	2691044	2691264	Most economists had expected a more dire report , with many anticipating the fifth month of job losses in six months .	Most economists had been expecting a far more dire report , with many expecting to see the fifth month of job losses in six months in September .
1	1831453	1831491	But software license revenues , a measure financial analysts watch closely , decreased 21 percent to $ 107.6 million .	License sales , a key measure of demand , fell 21 percent to $ 107.6 million .
1	2380695	2380822	King , brand-name writer , master of the horror story and e-book pioneer , is receiving this year 's medal for Distinguished Contributions to American Letters .	Stephen King , master of the horror story and e-book pioneer , is receiving this year 's medal for Distinguished Contributions to American Letters from the National Book Foundation .
1	2577517	2577531	The Denver-based natural gas producer and marketer said the inaccurate reporting was discovered after it received a subpoena from the U.S. Commodity Futures Trading Commission .	The natural gas producer and marketer said the inaccurate reporting was discovered in response to a subpoena from the U.S. Commodity Futures Trading Commission , or CFTC .
1	3267026	3266930	The steel tariffs , which the U.S. president imposed in March 2002 , will officially end at midnight , instead of March 2005 as initially planned .	The U.S. steel tariffs , which Bush imposed in March 2002 , were to officially end at midnight Thursday ( 0500 GMT ) , instead of March 2005 as initially planned .
1	360875	360943	Business Week 's online edition reported on Friday that WorldCom and the SEC could announce a settlement as early as Monday .	BusinessWeek Online has learned that the settlement could come as early as Monday , May 19 .
1	162632	162653	Only one of the five buildings in the Baghdad compound of the United Nations Development Program escaped being burned , the UN said on its Web site .	Only one of the five buildings in the compound in Baghdad run by the UN Development Program , escaped being burned , the UN said on its Web site .
1	1128884	1128865	Shares of Salix have rocketed 64 percent since Axcan made its first offer on April 10 .	Since the initial takeover offer , Salix shares have risen about 35 percent .
1	3264732	3264648	The jury verdict , reached Wednesday after less than four hours of deliberation , followed a 2 week trial , during which Waagner represented himself .	The quick conviction followed a 2 1 / 2 week trial , during which the Venango County man represented himself .
1	1721433	1721267	It 's happened five times in the last 11 years : A disaster puts this Southwestern town in the headlines during the summer tourist season .	It 's happened five times in the last decade : A disaster puts this tourist town in the headlines during summer , its busiest season .
0	146112	146127	The broader Standard & Poor 's 500 Index .SPX edged down 9 points , or 0.98 percent , to 921 .	The technology-laced Nasdaq Composite Index < .IXIC > shed 15 points , or 0.98 percent , to 1,492 .
1	389117	389052	The company emphasized that McDonald 's USA does not import any raw beef or hamburger patties from Canada for McDonald 's use in the United States .	McDonald 's said in a statement that it does not import any raw beef or hamburger patties from Canada for use in the United States .
1	872784	872834	Gregory Parseghian , a former investment banker , was appointed chief executive .	Greg Parseghian was appointed the new chief executive .
0	2977500	2977547	Their contract will expire at 12 : 01 a.m. Wednesday instead of 12 : 01 a.m. Sunday , said Rian Wathen , organizing director for United Food and Commercial Workers Local 700 .	" It has outraged the membership , " said Rian Wathen , organizing director of United Food and Commercial Workers Local 700 .
1	3107137	3107119	But plaque volume increased by 2.7 percent in pravastatin patients .	The volume of plaque in Pravachol patients ' arteries rose by 3 % .
1	1619244	1619274	Today in the US , the book - kept under wraps by its publishers , G. P. Putnam 's Sons , since its inception - will appear in bookstores .	Tomorrow the book , kept under wraps by G. P. Putnam 's Sons since its inception , will appear in bookstores .
0	3061836	3062031	The S & P / TSX composite rose 87.74 points on the week , while the TSX Venture Exchange composite gained 44.49 points .	On the week , the Dow Jones industrial average rose 11.56 points , while the Nasdaq Stock Market gained 39.42 points .
1	485999	486011	Ex-KGB agent Putin added that the Beatles were considered ' propaganda of an alien ideology ' .	In Soviet times the Beatles ' music " was considered propaganda of an alien ideology .

================================================
FILE: src/examples/pytorch/bert_tutorial/parallel.py
================================================
from concurrent import futures
import torch
import torch.neuron
import os
from time import time
from queue import Queue
import warnings

def consumer(model, input_queue):
    while True:
        inputs, input_id, callback_fn = input_queue.get()
        input_queue.task_done()
        # Stop execution if stopping condition is recieved
        if inputs == "stop":
            break
        start = time()
        results = model(*inputs)
        # Make the output iterable - if it is not already a tuple or list
        if not isinstance(results, tuple) or isinstance(results, list):
            results = [results]
        end = time()
        if callback_fn is not None:
            callback_fn(results, input_id, start, end)
              
class NeuronSimpleDataParallel():

    def __init__(self, model_file, num_neuron_cores, batch_size=1):
        self.num_neuron_cores = num_neuron_cores
        self.batch_size = batch_size
        
        os.environ['NEURON_RT_NUM_CORES'] = str(num_neuron_cores)
        
        # Construct a list of models
        self.models = [torch.jit.load(model_file)
                       for i in range(num_neuron_cores)]
        
        # Create shared input queue
        self.input_queue = Queue(maxsize=num_neuron_cores*16)

        self.executor = futures.ThreadPoolExecutor(
            max_workers=num_neuron_cores)

    def eval(self):
        for model in self.models:
            model.eval()
            
    def train(self):
        for model in self.models:
            model.train()
            
    def start_continuous_inference(self):
        for model in self.models:
            self.executor.submit(consumer, model, self.input_queue)
    
    def infer(self, batch, input_id, callback_fn):
        self.input_queue.put((batch, input_id, callback_fn))
        
    def stop(self):
        for _ in range(self.num_neuron_cores):
            self.input_queue.put(("stop", -1, None))


================================================
FILE: src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compiling and Deploying HuggingFace Pretrained BERT\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Introduction\n",
    "\n",
    "In this tutorial we will compile and deploy BERT-base version of HuggingFace 🤗 Transformers BERT for Inferentia. The full list of HuggingFace's pretrained BERT models can be found in the BERT section on this page https://huggingface.co/transformers/pretrained_models.html. \n",
    "\n",
    "This Jupyter notebook should be run on an instance which is inf1.6xlarge or larger. The compile part of this tutorial requires inf1.6xlarge and not the inference itself. For simplicity we will run this tutorial on inf1.6xlarge but in real life scenario the compilation should be done on a compute instance and the deployment on inf1 instance to save costs.\n",
    "\n",
    "Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [PyTorch Installation Guide](../../../../frameworks/torch/torch-neuron/setup/pytorch-install.html). You can select the kernel from the \"Kernel -> Change Kernel\" option on the top of this Jupyter notebook page."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install Dependencies:\n",
    "This tutorial requires the following pip packages:\n",
    "\n",
    "- `torch-neuron`\n",
    "- `neuron-cc[tensorflow]`\n",
    "- `transformers`\n",
    "\n",
    "Most of these packages will be installed when configuring your environment using the Neuron PyTorch setup guide. The additional dependencies must be installed here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
    "!pip install --upgrade \"transformers==4.6.0\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compile the model into an AWS Neuron optimized TorchScript\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow  # to workaround a protobuf version conflict issue\n",
    "import torch\n",
    "import torch.neuron\n",
    "from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig\n",
    "import transformers\n",
    "import os\n",
    "import warnings\n",
    "\n",
    "# Setting up NeuronCore groups for inf1.6xlarge with 16 cores\n",
    "num_cores = 16 # This value should be 4 on inf1.xlarge and inf1.2xlarge\n",
    "os.environ['NEURON_RT_NUM_CORES'] = str(num_cores)\n",
    "\n",
    "# Build tokenizer and model\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased-finetuned-mrpc\")\n",
    "model = AutoModelForSequenceClassification.from_pretrained(\"bert-base-cased-finetuned-mrpc\", return_dict=False)\n",
    "\n",
    "# Setup some example inputs\n",
    "sequence_0 = \"The company HuggingFace is based in New York City\"\n",
    "sequence_1 = \"Apples are especially bad for your health\"\n",
    "sequence_2 = \"HuggingFace's headquarters are situated in Manhattan\"\n",
    "\n",
    "max_length=128\n",
    "paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=max_length, padding='max_length', truncation=True, return_tensors=\"pt\")\n",
    "not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors=\"pt\")\n",
    "\n",
    "# Run the original PyTorch model on compilation exaple\n",
    "paraphrase_classification_logits = model(**paraphrase)[0]\n",
    "\n",
    "# Convert example inputs to a format that is compatible with TorchScript tracing\n",
    "example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']\n",
    "example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']\n",
    "\n",
    "# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron\n",
    "model_neuron = torch.neuron.trace(model, example_inputs_paraphrase)\n",
    "\n",
    "# Verify the TorchScript works on both example inputs\n",
    "paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)\n",
    "not_paraphrase_classification_logits_neuron = model_neuron(*example_inputs_not_paraphrase)\n",
    "\n",
    "# Save the TorchScript for later use\n",
    "model_neuron.save('bert_neuron.pt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You may inspect `model_neuron.graph` to see which part is running on CPU versus running on the accelerator. All native `aten` operators in the graph will be running on CPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(model_neuron.graph)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "### Deploy the AWS Neuron optimized TorchScript\n",
    "\n",
    "To deploy the AWS Neuron optimized TorchScript, you may choose to load the saved TorchScript from disk and skip the slow compilation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load TorchScript back\n",
    "model_neuron = torch.jit.load('bert_neuron.pt')\n",
    "# Verify the TorchScript works on both example inputs\n",
    "paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)\n",
    "not_paraphrase_classification_logits_neuron = model_neuron(*example_inputs_not_paraphrase)\n",
    "classes = ['not paraphrase', 'paraphrase']\n",
    "paraphrase_prediction = paraphrase_classification_logits_neuron[0][0].argmax().item()\n",
    "not_paraphrase_prediction = not_paraphrase_classification_logits_neuron[0][0].argmax().item()\n",
    "print('BERT says that \"{}\" and \"{}\" are {}'.format(sequence_0, sequence_2, classes[paraphrase_prediction]))\n",
    "print('BERT says that \"{}\" and \"{}\" are {}'.format(sequence_0, sequence_1, classes[not_paraphrase_prediction]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's run the model in parallel on four cores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_input_with_padding(batch, batch_size, max_length):\n",
    "    ## Reformulate the batch into three batch tensors - default batch size batches the outer dimension\n",
    "    encoded = batch['encoded']\n",
    "    inputs = torch.squeeze(encoded['input_ids'], 1)\n",
    "    attention = torch.squeeze(encoded['attention_mask'], 1)\n",
    "    token_type = torch.squeeze(encoded['token_type_ids'], 1)\n",
    "    quality = list(map(int, batch['quality']))\n",
    "\n",
    "    if inputs.size()[0] != batch_size:\n",
    "        print(\"Input size = {} - padding\".format(inputs.size()))\n",
    "        remainder = batch_size - inputs.size()[0]\n",
    "        zeros = torch.zeros( [remainder, max_length], dtype=torch.long )\n",
    "        inputs = torch.cat( [inputs, zeros] )\n",
    "        attention = torch.cat( [attention, zeros] )\n",
    "        token_type = torch.cat( [token_type, zeros] )\n",
    "\n",
    "    assert(inputs.size()[0] == batch_size and inputs.size()[1] == max_length)\n",
    "    assert(attention.size()[0] == batch_size and attention.size()[1] == max_length)\n",
    "    assert(token_type.size()[0] == batch_size and token_type.size()[1] == max_length)\n",
    "\n",
    "    return (inputs, attention, token_type), quality\n",
    "\n",
    "def count(output, quality):\n",
    "    assert output.size(0) >= len(quality)\n",
    "    correct_count = 0\n",
    "    count = len(quality)\n",
    "    \n",
    "    batch_predictions = [ row.argmax().item() for row in output ]\n",
    "\n",
    "    for a, b in zip(batch_predictions, quality):\n",
    "        if int(a)==int(b):\n",
    "            correct_count += 1\n",
    "\n",
    "    return correct_count, count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data parallel inference\n",
    "In the below cell, we use the data parallel approach for inference. In this approach, we load multiple models, all of them running in parallel. Each model is loaded onto a single NeuronCore. In the below implementation, we launch 16 models, thereby utilizing all the 16 cores on an inf1.6xlarge.\n",
    "\n",
    "> Note: Now if you try to decrease the num_cores in the above cells, please restart the notebook and run `!sudo rmmod neuron; sudo modprobe neuron` step in cell 2 to clear the Neuron cores.\n",
    "\n",
    "Since, we can run more than 1 model concurrently, the throughput for the system goes up. To achieve maximum gain in throughput, we need to efficiently feed the models so as to keep them busy at all times. In the below setup, this is done by using a producer-consumer model. We maintain a common python queue shared across all the models. The common queue enables feeding data continuously to the models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from parallel import NeuronSimpleDataParallel\n",
    "from bert_benchmark_utils import BertTestDataset, BertResults\n",
    "import time\n",
    "import functools\n",
    "\n",
    "max_length = 128\n",
    "num_cores = 16\n",
    "batch_size = 1\n",
    "\n",
    "tsv_file=\"glue_mrpc_dev.tsv\"\n",
    "\n",
    "data_set = BertTestDataset( tsv_file=tsv_file, tokenizer=tokenizer, max_length=max_length )\n",
    "data_loader = torch.utils.data.DataLoader(data_set, batch_size=batch_size, shuffle=True)\n",
    "\n",
    "#Result aggregation class (code in bert_benchmark_utils.py)\n",
    "results = BertResults(batch_size, num_cores)\n",
    "def result_handler(output, result_id, start, end, input_dict):\n",
    "    correct_count, inference_count = count(output[0], input_dict.pop(result_id))\n",
    "    elapsed = end - start\n",
    "    results.add_result(correct_count, inference_count, [elapsed], [end], [start])\n",
    "\n",
    "parallel_neuron_model = NeuronSimpleDataParallel('bert_neuron.pt', num_cores)\n",
    "\n",
    "#Starting the inference threads\n",
    "parallel_neuron_model.start_continuous_inference()\n",
    "\n",
    "# Warm up the cores\n",
    "z = torch.zeros( [batch_size, max_length], dtype=torch.long )\n",
    "batch = (z, z, z)\n",
    "for _ in range(num_cores*4):\n",
    "    parallel_neuron_model.infer(batch, -1, None)\n",
    "    \n",
    "input_dict = {}\n",
    "input_id = 0\n",
    "for _ in range(30):\n",
    "    for batch in data_loader:\n",
    "        batch, quality = get_input_with_padding(batch, batch_size, max_length)\n",
    "        input_dict[input_id] = quality\n",
    "        callback_fn = functools.partial(result_handler, input_dict=input_dict)\n",
    "        parallel_neuron_model.infer(batch, input_id, callback_fn)\n",
    "        input_id+=1\n",
    "\n",
    "# Stop inference                \n",
    "parallel_neuron_model.stop()\n",
    "\n",
    "\n",
    "with open(\"benchmark.txt\", \"w\") as f:\n",
    "    results.report(f, window_size=1)\n",
    "\n",
    "with open(\"benchmark.txt\", \"r\") as f:\n",
    "    for line in f:\n",
    "        print(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now recompile with a larger batch size of six sentence pairs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "batch_size = 6\n",
    "\n",
    "example_inputs_paraphrase = (\n",
    "    torch.cat([paraphrase['input_ids']] * batch_size,0), \n",
    "    torch.cat([paraphrase['attention_mask']] * batch_size,0), \n",
    "    torch.cat([paraphrase['token_type_ids']] * batch_size,0)\n",
    ")\n",
    "\n",
    "# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron\n",
    "model_neuron_batch = torch.neuron.trace(model, example_inputs_paraphrase)\n",
    "\n",
    "## Save the batched model\n",
    "model_neuron_batch.save('bert_neuron_b{}.pt'.format(batch_size))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rerun inference with batch 6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from parallel import NeuronSimpleDataParallel\n",
    "from bert_benchmark_utils import BertTestDataset, BertResults\n",
    "import time\n",
    "import functools\n",
    "\n",
    "max_length = 128\n",
    "num_cores = 16\n",
    "batch_size = 6\n",
    "\n",
    "data_set = BertTestDataset( tsv_file=tsv_file, tokenizer=tokenizer, max_length=max_length )\n",
    "data_loader = torch.utils.data.DataLoader(data_set, batch_size=batch_size, shuffle=True)\n",
    "\n",
    "#Result aggregation class (code in bert_benchmark_utils.py)\n",
    "results = BertResults(batch_size, num_cores)\n",
    "def result_handler(output, result_id, start, end, input_dict):\n",
    "    correct_count, inference_count = count(output[0], input_dict.pop(result_id))\n",
    "    elapsed = end - start\n",
    "    results.add_result(correct_count, inference_count, [elapsed], [end], [start])\n",
    "\n",
    "parallel_neuron_model = NeuronSimpleDataParallel('bert_neuron_b{}.pt'.format(batch_size), num_cores)\n",
    "\n",
    "#Starting the inference threads\n",
    "parallel_neuron_model.start_continuous_inference()\n",
    "\n",
    "# Adding to the input queue to warm all cores\n",
    "z = torch.zeros( [batch_size, max_length], dtype=torch.long )\n",
    "batch = (z, z, z)\n",
    "for _ in range(num_cores*4):\n",
    "    parallel_neuron_model.infer(batch, -1, None)\n",
    "\n",
    "input_dict = {}\n",
    "input_id = 0\n",
    "for _ in range(30):\n",
    "    for batch in data_loader:\n",
    "        batch, quality = get_input_with_padding(batch, batch_size, max_length)\n",
    "        input_dict[input_id] = quality\n",
    "        callback_fn = functools.partial(result_handler, input_dict=input_dict)\n",
    "        parallel_neuron_model.infer(batch, input_id, callback_fn)\n",
    "        input_id+=1\n",
    "\n",
    "# Stop inference                \n",
    "parallel_neuron_model.stop()\n",
    "\n",
    "with open(\"benchmark_b{}.txt\".format(batch_size), \"w\") as f:\n",
    "    results.report(f, window_size=1)\n",
    "\n",
    "with open(\"benchmark_b{}.txt\".format(batch_size), \"r\") as f:\n",
    "    for line in f:\n",
    "        print(line)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.9 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert_shared_weights.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Parallel HuggingFace Pretrained BERT with Weight Sharing (Deduplication)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Introduction\n",
    "\n",
    "In this tutorial we will compile and deploy BERT-base version of HuggingFace 🤗 Transformers BERT for Inferentia, with additional demonstration of using Weight Sharing (Deduplication) feature.\n",
    "\n",
    "To use the [Weight Sharing (Deduplication) feature](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-configurable-parameters.html#shared-weights-neuron-rt-multi-instance-shared-weights), you must set the Neuron Runtime environmental variable NEURON_RT_MULTI_INSTANCE_SHARED_WEIGHTS to \"TRUE\" together with the [core placement API](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuron/api-core-placement.html) (``torch_neuron.experimental.neuron_cores_context()``).\n",
    "\n",
    "This Jupyter notebook should be run on an instance which is inf1.6xlarge or larger. The compile part of this tutorial requires inf1.6xlarge and not the inference itself. For simplicity we will run this tutorial on inf1.6xlarge but in real life scenario the compilation should be done on a compute instance and the deployment on inf1 instance to save costs.\n",
    "\n",
    "Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [PyTorch Installation Guide](../../../../frameworks/torch/torch-neuron/setup/pytorch-install.html). You can select the kernel from the \"Kernel -> Change Kernel\" option on the top of this Jupyter notebook page."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install Dependencies:\n",
    "This tutorial requires the following pip packages:\n",
    "\n",
    "- `torch-neuron`\n",
    "- `neuron-cc[tensorflow]`\n",
    "- `transformers`\n",
    "\n",
    "Most of these packages will be installed when configuring your environment using the Neuron PyTorch setup guide. The additional dependencies must be installed here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
    "!pip install --upgrade \"transformers==4.6.0\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compile the model into an AWS Neuron optimized TorchScript\n",
    "\n",
    "This step compiles the model into an AWS Neuron optimized TorchScript, and saves it in the filed ``bert_neuron.pt``. This step is the same as the pretrained BERT tutorial without Shared Weights feature. We use batch 1 for simplicity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow  # to workaround a protobuf version conflict issue\n",
    "import torch\n",
    "import torch.neuron\n",
    "from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig\n",
    "import transformers\n",
    "import os\n",
    "import warnings\n",
    "\n",
    "\n",
    "# Build tokenizer and model\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased-finetuned-mrpc\")\n",
    "model = AutoModelForSequenceClassification.from_pretrained(\"bert-base-cased-finetuned-mrpc\", return_dict=False)\n",
    "\n",
    "# Setup some example inputs\n",
    "sequence_0 = \"The company HuggingFace is based in New York City\"\n",
    "sequence_1 = \"Apples are especially bad for your health\"\n",
    "sequence_2 = \"HuggingFace's headquarters are situated in Manhattan\"\n",
    "\n",
    "max_length=128\n",
    "paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=max_length, padding='max_length', truncation=True, return_tensors=\"pt\")\n",
    "not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors=\"pt\")\n",
    "\n",
    "# Run the original PyTorch model on compilation exaple\n",
    "paraphrase_classification_logits = model(**paraphrase)[0]\n",
    "\n",
    "# Convert example inputs to a format that is compatible with TorchScript tracing\n",
    "example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']\n",
    "example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']\n",
    "\n",
    "# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron\n",
    "model_neuron = torch.neuron.trace(model, example_inputs_paraphrase)\n",
    "\n",
    "# Verify the TorchScript works on both example inputs\n",
    "paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)\n",
    "not_paraphrase_classification_logits_neuron = model_neuron(*example_inputs_not_paraphrase)\n",
    "\n",
    "# Save the TorchScript for later use\n",
    "model_neuron.save('bert_neuron.pt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "### Deploy the AWS Neuron optimized TorchScript\n",
    "\n",
    "To deploy the AWS Neuron optimized TorchScript, you may choose to load the saved TorchScript from disk and skip the slow compilation. This step is the same as the pretrained BERT tutorial without Shared Weights feature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load TorchScript back\n",
    "model_neuron = torch.jit.load('bert_neuron.pt')\n",
    "# Verify the TorchScript works on both example inputs\n",
    "paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)\n",
    "not_paraphrase_classification_logits_neuron = model_neuron(*example_inputs_not_paraphrase)\n",
    "classes = ['not paraphrase', 'paraphrase']\n",
    "paraphrase_prediction = paraphrase_classification_logits_neuron[0][0].argmax().item()\n",
    "not_paraphrase_prediction = not_paraphrase_classification_logits_neuron[0][0].argmax().item()\n",
    "print('BERT says that \"{}\" and \"{}\" are {}'.format(sequence_0, sequence_2, classes[paraphrase_prediction]))\n",
    "print('BERT says that \"{}\" and \"{}\" are {}'.format(sequence_0, sequence_1, classes[not_paraphrase_prediction]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We define two helper functions to pad input and to count correct results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_input_with_padding(batch, batch_size, max_length):\n",
    "    ## Reformulate the batch into three batch tensors - default batch size batches the outer dimension\n",
    "    encoded = batch['encoded']\n",
    "    inputs = torch.squeeze(encoded['input_ids'], 1)\n",
    "    attention = torch.squeeze(encoded['attention_mask'], 1)\n",
    "    token_type = torch.squeeze(encoded['token_type_ids'], 1)\n",
    "    quality = list(map(int, batch['quality']))\n",
    "\n",
    "    if inputs.size()[0] != batch_size:\n",
    "        print(\"Input size = {} - padding\".format(inputs.size()))\n",
    "        remainder = batch_size - inputs.size()[0]\n",
    "        zeros = torch.zeros( [remainder, max_length], dtype=torch.long )\n",
    "        inputs = torch.cat( [inputs, zeros] )\n",
    "        attention = torch.cat( [attention, zeros] )\n",
    "        token_type = torch.cat( [token_type, zeros] )\n",
    "\n",
    "    assert(inputs.size()[0] == batch_size and inputs.size()[1] == max_length)\n",
    "    assert(attention.size()[0] == batch_size and attention.size()[1] == max_length)\n",
    "    assert(token_type.size()[0] == batch_size and token_type.size()[1] == max_length)\n",
    "\n",
    "    return (inputs, attention, token_type), quality\n",
    "\n",
    "def count(output, quality):\n",
    "    assert output.size(0) >= len(quality)\n",
    "    correct_count = 0\n",
    "    count = len(quality)\n",
    "    \n",
    "    batch_predictions = [ row.argmax().item() for row in output ]\n",
    "\n",
    "    for a, b in zip(batch_predictions, quality):\n",
    "        if int(a)==int(b):\n",
    "            correct_count += 1\n",
    "\n",
    "    return correct_count, count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data parallel inference\n",
    "In the below cell, we use the data parallel approach for inference. In this approach, we load multiple models, all of them running in parallel. Each model is loaded onto a single NeuronCore via the core placement API (``torch_neuron.experimental.neuron_cores_context()``). We also set Neuron Runtime environment variable ``NEURON_RT_MULTI_INSTANCE_SHARED_WEIGHTS`` to \"TRUE\" as required to use the Weight Sharing feature.\n",
    "\n",
    "In the below implementation, we launch 16 models, thereby utilizing all the 16 cores on an inf1.6xlarge.\n",
    "\n",
    "> Note: Now if you try to decrease the num_cores in the below cells, please restart the notebook and run `!sudo rmmod neuron; sudo modprobe neuron` step in cell 2 to clear the Neuron cores.\n",
    "\n",
    "Since, we can run more than 1 model concurrently, the throughput for the system goes up. To achieve maximum gain in throughput, we need to efficiently feed the models so as to keep them busy at all times. In the below setup, we use parallel threads to feed data continuously to the models.\n",
    "\n",
    "When running the cell below, you can monitor the Inferentia device activities by running ``neuron-top`` in another terminal. You will see that \"Device Used Memory\" is 1.6GB total, and the model instance loaded onto NeuronDevice 0 NeuronCore 0 uses the most device memory (272MB) while the other model instances loaded onto other NeuronCores use less device memory (92MB). This shows the effect of using Shared Weights as the device memory usage is lower. If you change ``NEURON_RT_MULTI_INSTANCE_SHARED_WEIGHTS`` to \"FALSE\" you will see that \"Device Used Memory\" is 3.2GB, and the model instances loaded onto  NeuronDevice 0 NeuronCore 0 and 1 use the most device memory (360MB) while the other model instances now use 180MB each."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bert_benchmark_utils import BertTestDataset, BertResults\n",
    "import time\n",
    "import functools\n",
    "import os\n",
    "import torch.neuron as torch_neuron\n",
    "from concurrent import futures\n",
    "\n",
    "# Setting up NeuronCore groups for inf1.6xlarge with 16 cores\n",
    "num_cores = 16 # This value should be 4 on inf1.xlarge and inf1.2xlarge\n",
    "os.environ['NEURON_RT_NUM_CORES'] = str(num_cores)\n",
    "os.environ['NEURON_RT_MULTI_INSTANCE_SHARED_WEIGHTS'] = 'TRUE'\n",
    "#os.environ['NEURON_RT_MULTI_INSTANCE_SHARED_WEIGHTS'] = 'FALSE'\n",
    "\n",
    "max_length = 128\n",
    "num_cores = 16\n",
    "batch_size = 1\n",
    "\n",
    "tsv_file=\"glue_mrpc_dev.tsv\"\n",
    "\n",
    "data_set = BertTestDataset( tsv_file=tsv_file, tokenizer=tokenizer, max_length=max_length )\n",
    "data_loader = torch.utils.data.DataLoader(data_set, batch_size=batch_size, shuffle=True)\n",
    "\n",
    "#Result aggregation class (code in bert_benchmark_utils.py)\n",
    "results = BertResults(batch_size, num_cores)\n",
    "def result_handler(output, result_id, start, end, input_dict):\n",
    "    correct_count, inference_count = count(output[0], input_dict.pop(result_id))\n",
    "    elapsed = end - start\n",
    "    results.add_result(correct_count, inference_count, [elapsed], [end], [start])\n",
    "\n",
    "with torch_neuron.experimental.neuron_cores_context(start_nc=0, nc_count=num_cores):\n",
    "    model = torch.jit.load('bert_neuron.pt')\n",
    "\n",
    "# Warm up the cores\n",
    "z = torch.zeros( [batch_size, max_length], dtype=torch.long )\n",
    "batch = (z, z, z)\n",
    "for _ in range(num_cores*4):\n",
    "    model(*batch)\n",
    "\n",
    "# Prepare the input data\n",
    "batch_list = []\n",
    "for batch in data_loader:\n",
    "    batch, quality = get_input_with_padding(batch, batch_size, max_length)\n",
    "    batch_list.append((batch, quality))\n",
    "\n",
    "# One thread running a model on one core\n",
    "def one_thread(feed_data, quality):\n",
    "    start = time.time()\n",
    "    result = model(*feed_data)\n",
    "    end = time.time()   \n",
    "    return result[0], quality, start, end\n",
    "\n",
    "# Launch more threads than models/cores to keep them busy\n",
    "processes = []\n",
    "with futures.ThreadPoolExecutor(max_workers=num_cores*2) as executor:\n",
    "    # extra loops to help you see activities in neuron-top\n",
    "    for _ in range(10):\n",
    "        for input_id, (batch, quality) in enumerate(batch_list):\n",
    "            processes.append(executor.submit(one_thread, batch, quality))\n",
    "\n",
    "results = BertResults(batch_size, num_cores)\n",
    "for _ in futures.as_completed(processes):   \n",
    "    (output, quality, start, end) = _.result()     \n",
    "    correct_count, inference_count = count(output, quality)\n",
    "    results.add_result(correct_count, inference_count, [start - end], [start], [end])\n",
    "\n",
    "with open(\"benchmark.txt\", \"w\") as f:\n",
    "    results.report(f, window_size=1)\n",
    "\n",
    "with open(\"benchmark.txt\", \"r\") as f:\n",
    "    for line in f:\n",
    "        print(line)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (torch-neuron)",
   "language": "python",
   "name": "aws_neuron_venv_pytorch_inf1"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: src/examples/pytorch/byoc_sm_bert_tutorial/code/inference.py
================================================
import os
import json
import tensorflow  # to workaround a protobuf version conflict issue
import torch
import torch.neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig

JSON_CONTENT_TYPE = 'application/json'


def model_fn(model_dir):
    tokenizer_init = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
    model_file =os.path.join(model_dir, 'neuron_compiled_model.pt')
    model_neuron = torch.jit.load(model_file)
#    print("using {}".format(model_file))

    return (model_neuron, tokenizer_init)


def input_fn(serialized_input_data, content_type=JSON_CONTENT_TYPE):
    if content_type == JSON_CONTENT_TYPE:
        input_data = json.loads(serialized_input_data)
#        print(input_data)
        return input_data

    else:
        raise Exception('Requested unsupported ContentType in Accept: ' + content_type)
        return


def predict_fn(input_data, models):
#    print('Got input Data: {}'.format(input_data))

    model_bert, tokenizer = models
    sequence_0 = input_data[0] 
    sequence_1 = input_data[1]
    
    max_length=128
    paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")
    # Convert example inputs to a format that is compatible with TorchScript tracing
    example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']  

    # Verify the TorchScript works on example inputs
    paraphrase_classification_logits_neuron = model_bert(*example_inputs_paraphrase)
    classes = ['not paraphrase', 'paraphrase']
    paraphrase_prediction = paraphrase_classification_logits_neuron[0][0].argmax().item()
    out_str = 'BERT says that "{}" and "{}" are {}'.format(sequence_0, sequence_1, classes[paraphrase_prediction])
    
    return out_str

def output_fn(prediction_output, accept=JSON_CONTENT_TYPE):
    if accept == JSON_CONTENT_TYPE:
        return json.dumps(prediction_output), accept

    raise Exception('Requested unsupported ContentType in Accept: ' + accept)


================================================
FILE: src/examples/pytorch/byoc_sm_bert_tutorial/container/Dockerfile
================================================
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-neuron:1.7.1-neuron-py36-ubuntu18.04

# Install packages 
RUN pip install "transformers==4.7.0"
# CMD ["/usr/local/bin/entrypoint.sh"]


================================================
FILE: src/examples/pytorch/byoc_sm_bert_tutorial/sagemaker_container_neuron.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4674f667",
   "metadata": {},
   "source": [
    "# Deploy a pretrained PyTorch BERT model from HuggingFace on Amazon SageMaker with Neuron container"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3e39838",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a92c454f",
   "metadata": {},
   "source": [
    "In this tutotial we will deploy on SageMaker a pretraine BERT Base model from HuggingFace Transformers, using the [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers). We will use the same model as shown in the [Neuron Tutorial \"PyTorch - HuggingFace Pretrained BERT Tutorial\"](../../../../frameworks/torch/torch-neuronx/tutorials/training/bert.html#). We will compile the model and build a custom AWS Deep Learning Container, to include the HuggingFace Transformers Library. \n",
    "\n",
    "This Jupyter Notebook should run on a ml.c5.4xlarge SageMaker Notebook instance. You can set up your SageMaker Notebook instance by following the [Get Started with Amazon SageMaker Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-console.html) documentation. \n",
    "\n",
    "> We recommend increasing the size of the base root volume of you SM notebook instance, to accomodate the models and containers built locally. A root volume of 10Gb should suffice. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37445ad2",
   "metadata": {},
   "source": [
    "## Install Dependencies:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ecd765f",
   "metadata": {},
   "source": [
    "This tutorial requires the following pip packages:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cae3092c",
   "metadata": {},
   "source": [
    "- torch-neuron\n",
    "- neuron-cc[tensorflow]\n",
    "- transformers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "066c3731",
   "metadata": {},
   "outputs": [],
   "source": [
    "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
    "!pip install --upgrade --no-cache-dir torch-neuron neuron-cc[tensorflow] torchvision torch --extra-index-url=https://pip.repos.neuron.amazonaws.com\n",
    "!pip install --upgrade --no-cache-dir 'transformers==4.6.0'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4796d3a",
   "metadata": {},
   "source": [
    "## Compile the model into an AWS Neuron optimized TorchScript"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6fe85f8e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch_neuron\n",
    "\n",
    "from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0c5c253a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Build tokenizer and model\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased-finetuned-mrpc\")\n",
    "model = AutoModelForSequenceClassification.from_pretrained(\"bert-base-cased-finetuned-mrpc\", return_dict=False)\n",
    "\n",
    "# Setup some example inputs\n",
    "sequence_0 = \"The company HuggingFace is based in New York City\"\n",
    "sequence_1 = \"Apples are especially bad for your health\"\n",
    "sequence_2 = \"HuggingFace's headquarters are situated in Manhattan\"\n",
    "\n",
    "max_length=128\n",
    "paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=max_length, padding='max_length', truncation=True, return_tensors=\"pt\")\n",
    "not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors=\"pt\")\n",
    "\n",
    "# Run the original PyTorch model on compilation exaple\n",
    "paraphrase_classification_logits = model(**paraphrase)[0]\n",
    "\n",
    "# Convert example inputs to a format that is compatible with TorchScript tracing\n",
    "example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']\n",
    "example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44255ada",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron\n",
    "# This step may need 3-5 min\n",
    "model_neuron = torch.neuron.trace(model, example_inputs_paraphrase, verbose=1, compiler_workdir='./compilation_artifacts')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c4752ac",
   "metadata": {},
   "source": [
    "You may inspect **model_neuron.graph** to see which part is running on CPU versus running on the accelerator. All native **aten** operators in the graph will be running on CPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc00889e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# See  which part is running on CPU versus running on the accelerator.\n",
    "print(model_neuron.graph)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "775fb30d",
   "metadata": {},
   "source": [
    "Save the compiled model, so it can be packaged and sent to S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "027c4f53",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the TorchScript for later use\n",
    "model_neuron.save('neuron_compiled_model.pt')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d362c579",
   "metadata": {},
   "source": [
    "### Package the pre-trained model and upload it to S3\n",
    "\n",
    "To make the model available for the SageMaker deployment, you will TAR the serialized graph and upload it to the default Amazon S3 bucket for your SageMaker session. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29c7f7b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now you'll create a model.tar.gz file to be used by SageMaker endpoint\n",
    "!tar -czvf model.tar.gz neuron_compiled_model.pt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1beadca0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import time\n",
    "from sagemaker.utils import name_from_base\n",
    "import sagemaker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06ad87d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# upload model to S3\n",
    "role = sagemaker.get_execution_role()\n",
    "sess=sagemaker.Session()\n",
    "region=sess.boto_region_name\n",
    "bucket=sess.default_bucket()\n",
    "sm_client=boto3.client('sagemaker')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5205ec55",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_key = '{}/model/model.tar.gz'.format('inf1_compiled_model')\n",
    "model_path = 's3://{}/{}'.format(bucket, model_key)\n",
    "boto3.resource('s3').Bucket(bucket).upload_file('model.tar.gz', model_key)\n",
    "print(\"Uploaded model to S3:\")\n",
    "print(model_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8b425d4",
   "metadata": {},
   "source": [
    "## Build and Push the container"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "430e6ed2",
   "metadata": {},
   "source": [
    "The following shell code shows how to build the container image using docker build and push the container image to ECR using docker push.\n",
    "The Dockerfile in this example is available in the ***container*** folder.\n",
    "Here's an example of the Dockerfile:\n",
    "\n",
    "```Dockerfile\n",
    "FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-neuron:1.7.1-neuron-py36-ubuntu18.04\n",
    "\n",
    "# Install packages \n",
    "RUN pip install \"transformers==4.7.0\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3970025d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat container/Dockerfile"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62f78b0f",
   "metadata": {},
   "source": [
    "Before running the next cell, make sure your SageMaker IAM role has access to ECR. If not, you can attache the role `AmazonEC2ContainerRegistryPowerUser` to your IAM role ARN, which allows you to upload image layers to ECR.  \n",
    "\n",
    "It takes 5 minutes to build docker images and upload image to ECR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ecd51acf",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "\n",
    "# The name of our algorithm\n",
    "algorithm_name=neuron-py36-inference\n",
    "\n",
    "cd container\n",
    "\n",
    "account=$(aws sts get-caller-identity --query Account --output text)\n",
    "\n",
    "# Get the region defined in the current configuration (default to us-west-2 if none defined)\n",
    "region=$(aws configure get region)\n",
    "region=${region:-us-west-2}\n",
    "\n",
    "fullname=\"${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest\"\n",
    "\n",
    "# If the repository doesn't exist in ECR, create it.\n",
    "\n",
    "aws ecr describe-repositories --repository-names \"${algorithm_name}\" > /dev/null 2>&1\n",
    "\n",
    "if [ $? -ne 0 ]\n",
    "then\n",
    "    aws ecr create-repository --repository-name \"${algorithm_name}\" > /dev/null\n",
    "fi\n",
    "\n",
    "# Get the login command from ECR in order to pull down the SageMaker PyTorch image\n",
    "aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com\n",
    "# Build the docker image locally with the image name and then push it to ECR\n",
    "# with the full name.\n",
    "docker build  -t ${algorithm_name} . --build-arg REGION=${region}\n",
    "docker tag ${algorithm_name} ${fullname}\n",
    "\n",
    "# Get the login command from ECR and execute it directly\n",
    "aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin ${account}.dkr.ecr.${region}.amazonaws.com\n",
    "docker push ${fullname}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4f6bbda",
   "metadata": {},
   "source": [
    "## Deploy Container and run inference based on the pretrained model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64e65e31",
   "metadata": {},
   "source": [
    "To deploy a pretrained PyTorch model, you'll need to use the PyTorch estimator object to create a PyTorchModel object and set a different entry_point.\n",
    "\n",
    "You'll use the PyTorchModel object to deploy a PyTorchPredictor. This creates a SageMaker Endpoint -- a hosted prediction service that we can use to perform inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f343d3b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "\n",
    "!{sys.executable} -m pip install Transformers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2bd73b77",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import boto3\n",
    "import sagemaker\n",
    "\n",
    "role = sagemaker.get_execution_role()\n",
    "sess = sagemaker.Session()\n",
    "\n",
    "bucket = sess.default_bucket()\n",
    "prefix = \"inf1_compiled_model/model\"\n",
    "\n",
    "# Get container name in ECR\n",
    "client=boto3.client('sts')\n",
    "account=client.get_caller_identity()['Account']\n",
    "\n",
    "my_session=boto3.session.Session()\n",
    "region=my_session.region_name\n",
    "\n",
    "algorithm_name=\"neuron-py36-inference\"\n",
    "ecr_image='{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, algorithm_name)\n",
    "print(ecr_image)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9298f2a7",
   "metadata": {},
   "source": [
    "An implementation of *model_fn* is required for inference script.\n",
    "We are going to implement our own **model_fn** and **predict_fn** for Hugging Face Bert, and use default implementations of **input_fn** and **output_fn** defined in sagemaker-pytorch-containers.\n",
    "\n",
    "In this example, the inference script is put in ***code*** folder. Run the next cell to see it:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cfea75b6",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pygmentize code/inference.py"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b31a7b8",
   "metadata": {},
   "source": [
    "Path of compiled pretrained model in S3:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61f3556e",
   "metadata": {},
   "outputs": [],
   "source": [
    "key = os.path.join(prefix, \"model.tar.gz\")\n",
    "pretrained_model_data = \"s3://{}/{}\".format(bucket, key)\n",
    "print(pretrained_model_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7557a5f",
   "metadata": {},
   "source": [
    "The model object is defined by using the SageMaker Python SDK's PyTorchModel and pass in the model from the estimator and the entry_point. The endpoint's entry point for inference is defined by model_fn as seen in the previous code block that prints out **inference.py**. The model_fn function will load the model and required tokenizer.\n",
    "\n",
    "Note, **image_uri** must be user's own ECR images."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0bd99768",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.pytorch.model import PyTorchModel\n",
    "\n",
    "pytorch_model = PyTorchModel(\n",
    "    model_data=pretrained_model_data,\n",
    "    role=role,\n",
    "    source_dir=\"code\",\n",
    "    framework_version=\"1.7.1\",\n",
    "    entry_point=\"inference.py\",\n",
    "    image_uri=ecr_image\n",
    ")\n",
    "\n",
    "# Let SageMaker know that we've already compiled the model via neuron-cc\n",
    "pytorch_model._is_compiled_model = True"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67439fe7",
   "metadata": {},
   "source": [
    "The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint.\n",
    "\n",
    "Here you will deploy the model to a single **ml.inf1.2xlarge** instance.\n",
    "It may take 6-10 min to deploy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d771fc7c",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "predictor = pytorch_model.deploy(initial_instance_count=1, instance_type=\"ml.inf1.2xlarge\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab6342f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(predictor.endpoint_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "059537d9",
   "metadata": {},
   "source": [
    "Since in the input_fn we declared that the incoming requests are json-encoded, we need to use a json serializer, to encode the incoming data into a json string. Also, we declared the return content type to be json string, we Need to use a json deserializer to parse the response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29e82f90",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor.serializer = sagemaker.serializers.JSONSerializer()\n",
    "predictor.deserializer = sagemaker.deserializers.JSONDeserializer()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d006ea03",
   "metadata": {},
   "source": [
    "Using a list of sentences, now SageMaker endpoint is invoked to get predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "325a87f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "result = predictor.predict(\n",
    "    [\n",
    "        \"Never allow the same bug to bite you twice.\",\n",
    "        \"The best part of Amazon SageMaker is that it makes machine learning easy.\",\n",
    "    ]\n",
    ")\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a12410d",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "result = predictor.predict(\n",
    "    [\n",
    "        \"The company HuggingFace is based in New York City\",\n",
    "        \"HuggingFace's headquarters are situated in Manhattan\",\n",
    "    ]\n",
    ")\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a72dfd16",
   "metadata": {},
   "source": [
    "## Benchmarking your endpoint\n",
    "\n",
    "The following cells create a load test for your endpoint. You first define some helper functions: `inference_latency` runs the endpoint request, collects cliend side latency and any errors, `random_sentence` builds random to be sent to the endpoint.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "088d0e75",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np \n",
    "import datetime\n",
    "import math\n",
    "import time\n",
    "import boto3   \n",
    "import matplotlib.pyplot as plt\n",
    "from joblib import Parallel, delayed\n",
    "import numpy as np\n",
    "from tqdm import tqdm\n",
    "import random"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "038d9953",
   "metadata": {},
   "outputs": [],
   "source": [
    "def inference_latency(model,*inputs):\n",
    "    \"\"\"\n",
    "    infetence_time is a simple method to return the latency of a model inference.\n",
    "\n",
    "        Parameters:\n",
    "            model: torch model onbject loaded using torch.jit.load\n",
    "            inputs: model() args\n",
    "\n",
    "        Returns:\n",
    "            latency in seconds\n",
    "    \"\"\"\n",
    "    error = False\n",
    "    start = time.time()\n",
    "    try:\n",
    "        results = model(*inputs)\n",
    "    except:\n",
    "        error = True\n",
    "        results = []\n",
    "    return {'latency':time.time() - start, 'error': error, 'result': results}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6b200ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "def random_sentence():\n",
    "    \n",
    "    s_nouns = [\"A dude\", \"My mom\", \"The king\", \"Some guy\", \"A cat with rabies\", \"A sloth\", \"Your homie\", \"This cool guy my gardener met yesterday\", \"Superman\"]\n",
    "    p_nouns = [\"These dudes\", \"Both of my moms\", \"All the kings of the world\", \"Some guys\", \"All of a cattery's cats\", \"The multitude of sloths living under your bed\", \"Your homies\", \"Like, these, like, all these people\", \"Supermen\"]\n",
    "    s_verbs = [\"eats\", \"kicks\", \"gives\", \"treats\", \"meets with\", \"creates\", \"hacks\", \"configures\", \"spies on\", \"retards\", \"meows on\", \"flees from\", \"tries to automate\", \"explodes\"]\n",
    "    p_verbs = [\"eat\", \"kick\", \"give\", \"treat\", \"meet with\", \"create\", \"hack\", \"configure\", \"spy on\", \"retard\", \"meow on\", \"flee from\", \"try to automate\", \"explode\"]\n",
    "    infinitives = [\"to make a pie.\", \"for no apparent reason.\", \"because the sky is green.\", \"for a disease.\", \"to be able to make toast explode.\", \"to know more about archeology.\"]\n",
    "    \n",
    "    return (random.choice(s_nouns) + ' ' + random.choice(s_verbs) + ' ' + random.choice(s_nouns).lower() or random.choice(p_nouns).lower() + ' ' + random.choice(infinitives))\n",
    "\n",
    "print([random_sentence(), random_sentence()])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2945dde",
   "metadata": {},
   "source": [
    "The following cell creates `number_of_clients` concurrent threads to run `number_of_runs` requests. Once completed, a `boto3` CloudWatch client will query for the server side latency metrics for comparison.   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69c047e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Defining Auxiliary variables\n",
    "number_of_clients = 2\n",
    "number_of_runs = 1000\n",
    "t = tqdm(range(number_of_runs),position=0, leave=True)\n",
    "\n",
    "# Starting parallel clients\n",
    "cw_start = datetime.datetime.utcnow()\n",
    "\n",
    "results = Parallel(n_jobs=number_of_clients,prefer=\"threads\")(delayed(inference_latency)(predictor.predict,[random_sentence(), random_sentence()]) for mod in t)\n",
    "avg_throughput = t.total/t.format_dict['elapsed']\n",
    "\n",
    "cw_end = datetime.datetime.utcnow() \n",
    "\n",
    "# Computing metrics and print\n",
    "latencies = [res['latency'] for res in results]\n",
    "errors = [res['error'] for res in results]\n",
    "error_p = sum(errors)/len(errors) *100\n",
    "p50 = np.quantile(latencies[-1000:],0.50) * 1000\n",
    "p90 = np.quantile(latencies[-1000:],0.95) * 1000\n",
    "p95 = np.quantile(latencies[-1000:],0.99) * 1000\n",
    "\n",
    "print(f'Avg Throughput: :{avg_throughput:.1f}\\n')\n",
    "print(f'50th Percentile Latency:{p50:.1f} ms')\n",
    "print(f'90th Percentile Latency:{p90:.1f} ms')\n",
    "print(f'95th Percentile Latency:{p95:.1f} ms\\n')\n",
    "print(f'Errors percentage: {error_p:.1f} %\\n')\n",
    "\n",
    "# Querying CloudWatch\n",
    "print('Getting Cloudwatch:')\n",
    "cloudwatch = boto3.client('cloudwatch')\n",
    "statistics=['SampleCount', 'Average', 'Minimum', 'Maximum']\n",
    "extended=['p50', 'p90', 'p95', 'p100']\n",
    "\n",
    "# Give 5 minute buffer to end\n",
    "cw_end += datetime.timedelta(minutes=5)\n",
    "\n",
    "# Period must be 1, 5, 10, 30, or multiple of 60\n",
    "# Calculate closest multiple of 60 to the total elapsed time\n",
    "factor = math.ceil((cw_end - cw_start).total_seconds() / 60)\n",
    "period = factor * 60\n",
    "print('Time elapsed: {} seconds'.format((cw_end - cw_start).total_seconds()))\n",
    "print('Using period of {} seconds\\n'.format(period))\n",
    "\n",
    "cloudwatch_ready = False\n",
    "# Keep polling CloudWatch metrics until datapoints are available\n",
    "while not cloudwatch_ready:\n",
    "  time.sleep(30)\n",
    "  print('Waiting 30 seconds ...')\n",
    "  # Must use default units of microseconds\n",
    "  model_latency_metrics = cloudwatch.get_metric_statistics(MetricName='ModelLatency',\n",
    "                                             Dimensions=[{'Name': 'EndpointName',\n",
    "                                                          'Value': predictor.endpoint_name},\n",
    "                                                         {'Name': 'VariantName',\n",
    "                                                          'Value': \"AllTraffic\"}],\n",
    "                                             Namespace=\"AWS/SageMaker\",\n",
    "                                             StartTime=cw_start,\n",
    "                                             EndTime=cw_end,\n",
    "                                             Period=period,\n",
    "                                             Statistics=statistics,\n",
    "                                             ExtendedStatistics=extended\n",
    "                                             )\n",
    "  # Should be 1000\n",
    "  if len(model_latency_metrics['Datapoints']) > 0:\n",
    "    print('{} latency datapoints ready'.format(model_latency_metrics['Datapoints'][0]['SampleCount']))\n",
    "    side_avg = model_latency_metrics['Datapoints'][0]['Average'] / number_of_runs\n",
    "    side_p50 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p50'] / number_of_runs\n",
    "    side_p90 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p90'] / number_of_runs\n",
    "    side_p95 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p95'] / number_of_runs\n",
    "    side_p100 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p100'] / number_of_runs\n",
    "    \n",
    "    print(f'50th Percentile Latency:{side_p50:.1f} ms')\n",
    "    print(f'90th Percentile Latency:{side_p90:.1f} ms')\n",
    "    print(f'95th Percentile Latency:{side_p95:.1f} ms\\n')\n",
    "\n",
    "    cloudwatch_ready = True\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9035e681",
   "metadata": {},
   "source": [
    "### Cleanup\n",
    "Endpoints should be deleted when no longer in use, to avoid costs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1284ef3f",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor.delete_endpoint(predictor.endpoint)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5af53873",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.9 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: src/examples/pytorch/libtorch_demo/bert_neuronx/compile.py
================================================
import torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import transformers
import os
import warnings

from detect_instance import get_instance_type, get_num_neuroncores

instance_type = get_instance_type() 

print(f"Detected instance type: {instance_type}")

if 'inf1' in instance_type:
    print(" - using torch_neuron.trace")
    from torch_neuron import trace
else:
    print(" - using torch_neuronx.xla_impl.trace")
    from torch_neuronx.xla_impl.trace import trace
print()

os.environ['TOKENIZERS_PARALLELISM']='false'
batch_size = 6

# Setting up NeuronCore groups for inf1.6xlarge with 16 cores
num_cores = get_num_neuroncores(instance_type)
print(f"Number of cores = {num_cores}")
os.environ['NEURON_RT_NUM_CORES'] = str(num_cores)

# Build tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc", return_dict=False)

# Setup some example inputs
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

max_length=128
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")

# Convert example inputs to a format that is compatible with TorchScript tracing
example_inputs_paraphrase = (
    torch.cat([paraphrase['input_ids']] * batch_size,0),
    torch.cat([paraphrase['attention_mask']] * batch_size,0),
    torch.cat([paraphrase['token_type_ids']] * batch_size,0)
)

# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron
try:
    model_neuron = trace(model, example_inputs_paraphrase)
except Exception as e:
    print(e)
    print("libtorch_demo: Model tracing failed - check tutorial steps and preconditions")
    print("libtorch_demo: If this does not resolve your issue - Report a bug at ")
    print("https://github.com/aws-neuron/aws-neuron-sdk/issues")
    exit(1)

# Verify the TorchScript works on both example inputs
try:
    paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)
except:
    print("libtorch_demo: Neuron runtime failed - check tutorial steps and preconditions")
    print("libtorch_demo: If this does not resolve your issue - Report a bug at ")
    print("https://github.com/aws-neuron/aws-neuron-sdk/issues")
    exit(1)

# Save the TorchScript for later use
model_neuron.save(f'bert_neuron_b{batch_size}.pt')


================================================
FILE: src/examples/pytorch/libtorch_demo/bert_neuronx/detect_instance.py
================================================
import torch
import torch_neuronx
from typing import Optional

INSTANCETYPE_TO_NEURONCORES = {
    "inf1.xlarge": 4,
    "inf1.2xlarge": 4,
    "inf1.6xlarge": 16,
    "inf2.xlarge": 2,
    "inf2.8xlarge": 2,
    "inf2.24xlarge": 12,
    "inf2.48xlarge": 24,
    "inf1.24xlarge": 64,
    "trn1.2xlarge": 2,
    "trn1.32xlarge": 32,
}

def get_instance_type() -> str:
    """Try to obtain the instance type."""
    try:
        from urllib.request import Request, urlopen

        req = Request("http://169.254.169.254/latest/api/token", method="PUT")
        req.add_header("X-aws-ec2-metadata-token-ttl-seconds", "21600")
        with urlopen(req) as response:
            token = response.read().decode("utf-8")

        req = Request("http://169.254.169.254/latest/meta-data/instance-type")
        req.add_header("X-aws-ec2-metadata-token", token)
        with urlopen(req) as response:
            instance_type = response.read().decode("utf-8")

        return instance_type
    except:  # noqa: E722, there are various ways above code can fail and we don't care
        return None


def get_num_neuroncores(instance_type: Optional[str] = None) -> int:
    """
    Try to obtain the maximum number of NeuronCores available on this instance.

    Args:
        instance_type: The Neuron instance type. Autodetermined from current instance
            if not provided.

    Returns:
        The number of NeuronCores (or 2 if the type is unknown).
    """

    try:
        if not instance_type:
            instance_type = get_instance_type()
        return INSTANCETYPE_TO_NEURONCORES[instance_type]
    except KeyError:
        num_cores = get_num_neuroncores_v3()
        return num_cores


def get_num_neuroncores_v3() -> int:
    """
    Retrieve the number of NeuronCores visible to this process.

    Returns:
        The number of visible neuron cores.

    Raises:
        RuntimeError: If the Neuron runtime cannot be initialized. This most
            commonly occurs when executing on an instance with no Neuron
            devices available or when no Neuron devices are visible to the
            process.
    """
    runtime = torch.classes.neuron.Runtime()
    try:
        nc_count = runtime.get_visible_nc_count()
    except RuntimeError as e:
        raise RuntimeError(
            "Neuron runtime cannot be initialized; cannot determine the number of available NeuronCores"  # noqa: E501
        ) from e
    return nc_count


================================================
FILE: src/examples/pytorch/libtorch_demo/clean.sh
================================================
#!/bin/bash

echo "Clean up constructed files"
rm -rf bert_neuron_b6.pt example-app tokenizers venv/ libtorch/ tokenizers_binding/lib/ tokenizers_binding/venv all_metrics.csv venv


================================================
FILE: src/examples/pytorch/libtorch_demo/example_app/README.txt
================================================
AWS NEURON TORCHLIB DEMO FOR C++
================================

For the full tutorial, please refer to:
https://awsdocs-neuron.readthedocs-hosted.com


================================================
FILE: src/examples/pytorch/libtorch_demo/example_app/build.sh
================================================
#!/bin/bash

# Installation script to build with torch dependency from /usr/local
set -x

# Find paths for local packages
PATH_TOKENIZERS_LIB=../tokenizers_binding/lib
PATH_TORCH=../libtorch
PATH_TORCH_INC=${PATH_TORCH}/include
PATH_TORCH_LIB=${PATH_TORCH}/lib
PATH_NEURON_LIB=${PATH_TORCH}/lib

if [ ! -e "${PATH_TORCH_LIB}/libnrt.so.1" ] && [ -e "/opt/aws/neuron/lib/libnrt.so.1" ]
then
    PATH_NEURON_LIB=/opt/aws/neuron/lib/
fi

g++ utils.cpp example_app.cpp \
    -o ../example-app \
    -O2 \
    -D_GLIBCXX_USE_CXX11_ABI=1 \
    -I${PATH_TORCH_INC} \
    -L${PATH_TOKENIZERS_LIB} \
    -L${PATH_NEURON_LIB} \
    -L${PATH_TORCH_LIB} \
    -Wl,-rpath,libtorch/lib \
    -Wl,-rpath,tokenizers_binding/lib \
    -Wl,-rpath,$PATH_NEURON_LIB \
    -Wl,-no-as-needed \
    -ltokenizers \
    -ltorchneuron \
    -ltorch_cpu \
    -lc10 \
    -lpthread \
    -lnrt \
    -std=c++17


================================================
FILE: src/examples/pytorch/libtorch_demo/example_app/core_count.hpp
================================================
#pragma once

/*
 * Copyright 2021, Amazon.com, Inc. or its affiliates. All Rights Reserved
 */
 
#ifdef __cplusplus
extern "C" {
#endif

typedef enum {
    NRT_SUCCESS = 0,
    NRT_FAILURE = 1,                        
    NRT_INVALID = 2,                        
    NRT_INVALID_HANDLE = 3,                 
    NRT_RESOURCE = 4,                      
    NRT_TIMEOUT = 5,                        
    NRT_HW_ERROR = 6,                      
    NRT_QUEUE_FULL = 7,                    
    NRT_LOAD_NOT_ENOUGH_NC = 9,             
    NRT_UNSUPPORTED_NEFF_VERSION = 10,    
    NRT_FAIL_HOST_MEM_ALLOC = 11,           
    NRT_EXEC_BAD_INPUT = 1002,              
    NRT_EXEC_COMPLETED_WITH_NUM_ERR = 1003, 
    NRT_EXEC_COMPLETED_WITH_ERR = 1004,     
    NRT_EXEC_NC_BUSY = 1005,                
    NRT_COLL_PENDING = 1100,                
} NRT_STATUS;

NRT_STATUS nrt_get_total_nc_count(uint32_t *nc_count);

#ifdef __cplusplus
}
#endif

================================================
FILE: src/examples/pytorch/libtorch_demo/example_app/example_app.cpp
================================================
#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>

#include "utils.hpp"
#include "core_count.hpp"
#include "../tokenizers_binding/remote_rust_tokenizer.h"

typedef std::vector<std::vector<long>> Input;

namespace
{
    // some hardcoded parameters that could be read from a config file
    const size_t seq_len = 128;
    const size_t batch_size = 6;
    uint32_t num_neuron_cores = 0;
    const size_t cores_per_model = 1;
    const size_t num_runs_per_neuron_core = 2000;

    // these token ids are particular to a vocabulary, could be parsed from vocab file
    const long start_token = 101;
    const long end_token = 102;
}

// construct a single input: input_ids, attention_mask, and token_type_ids from two input sentences
Input get_input(const std::string& sentence_1, const std::string& sentence_2)
{
    // ensure the concatenated sentences + separator tokens do not exceed the compiled sequence length
    assert(sentence_1.size() + sentence_2.size() + 3 <= seq_len);

    // tokenize the input sentence using the HuggingFace Tokenizers library
    std::vector<long> input_ids(seq_len, 0);
    input_ids[0] = start_token;
    size_t pos = 1; // current write position in input_ids

    // tokenize sentence_1 and copy to output buffer
    std::vector<uint32_t> buffer(seq_len, 0);
    remote_rust_encode(sentence_1.c_str(), buffer.data(), buffer.size());
    for (size_t i = 0; i < seq_len && buffer[i]; i++, pos++) {
        input_ids[pos] = buffer[i];
    }

    // mark end of sentence_1
    input_ids[pos++] = end_token;
    const size_t sentence_2_start = pos;

    // tokenize sentence_2 and copy to output buffer
    std::fill(buffer.begin(), buffer.end(), 0);
    remote_rust_encode(sentence_2.c_str(), buffer.data(), buffer.size());
    for (size_t i = 0; i < seq_len && buffer[i]; i++, pos++) {
        input_ids[pos] = buffer[i];
    }

    // mark end of sentence_2
    input_ids[pos++] = end_token;

    // construct attention mask
    std::vector<long> attention_mask(seq_len, 0);
    for (size_t i = 0; i < seq_len; ++i) attention_mask[i] = input_ids[i] ? 1 : 0;

    // token type ids are 0s for sentence_1 (incl. separators), 1s for sentence_2
    std::vector<long> token_type_ids(seq_len, 0);
    for (size_t i = sentence_2_start; i < seq_len; i++) {
        if (!attention_mask[i]) break;
        token_type_ids[i] = 1;
    }

    return {input_ids, attention_mask, token_type_ids};
}

// reshape a vector of inputs into a proper batch
std::vector<torch::jit::IValue> get_batch(const std::vector<Input>& inputs)
{
    // must be given a full batch
    assert(inputs.size() == batch_size);

    torch::Tensor input_ids_tensor = torch::zeros({batch_size, seq_len}, at::kLong);
    torch::Tensor attention_mask_tensor = torch::zeros({batch_size, seq_len}, at::kLong);
    torch::Tensor token_type_ids_tensor = torch::zeros({batch_size, seq_len}, at::kLong);

    const auto opts = torch::TensorOptions().dtype(torch::kLong);
    for (size_t i = 0; i < batch_size; i++) {
        input_ids_tensor.slice(0, i, i+1) = torch::from_blob((void*)inputs[i][0].data(), {seq_len}, opts);
        attention_mask_tensor.slice(0, i, i+1) = torch::from_blob((void*)inputs[i][1].data(), {seq_len}, opts);
        token_type_ids_tensor.slice(0, i, i+1) = torch::from_blob((void*)inputs[i][2].data(), {seq_len}, opts);
    }

    return {input_ids_tensor, attention_mask_tensor, token_type_ids_tensor};
}

int sanity_check(const std::string& model_filename)
{
    // load the model
    auto model = get_model(model_filename);

    // construct some example inputs
    const std::string sentence_1 = "The company HuggingFace is based in New York City";
    const std::string sentence_2 = "Apples are especially bad for your health";
    const std::string sentence_3 = "HuggingFace's headquarters are situated in Manhattan";
    const auto paraphrase = get_input(sentence_1, sentence_3);
    const auto not_paraphrase = get_input(sentence_1, sentence_2);

    // batch the inputs 50/50 positive/negative
    std::vector<Input> inputs(batch_size);
    for (size_t i = 0; i < batch_size; ++i) {
        if (i < batch_size / 2) {
            inputs[i] = paraphrase;
        } else {
            inputs[i] = not_paraphrase;
        }
    }
    const auto batch = get_batch(inputs);

    // forward pass
    const auto output = model.forward(batch);

    // interpret output
    const auto output_tensor = output.toTuple()->elements()[0].toTensor();
    const auto paraphrase_probabilities = torch::softmax(output_tensor[0], 0);
    const auto not_paraphrase_probabilities = torch::softmax(output_tensor[batch_size-1], 0);
    const auto paraphrase_0 = std::round(paraphrase_probabilities[0].item<double>() * 100);
    const auto paraphrase_1 = std::round(paraphrase_probabilities[1].item<double>() * 100);
    const auto not_paraphrase_0 = std::round(not_paraphrase_probabilities[0].item<double>() * 100);
    const auto not_paraphrase_1 = std::round(not_paraphrase_probabilities[1].item<double>() * 100);

    std::cout << sentence_1 << std::endl << sentence_3 << std::endl;
    std::cout << "not paraphrase: " << paraphrase_0 << "%" << std::endl;
    std::cout << "paraphrase: " << paraphrase_1 << "%" << std::endl;
    if (paraphrase_0 >= paraphrase_1) return -1;

    std::cout << std::endl;

    std::cout << sentence_1 << std::endl << sentence_2 << std::endl;
    std::cout << "not paraphrase: " << not_paraphrase_0 << "%" << std::endl;
    std::cout << "paraphrase: " << not_paraphrase_1 << "%" << std::endl;
    if (not_paraphrase_0 <= not_paraphrase_1) return -2;

    return 0;
}

void benchmark(const std::string& model_filename, const std::vector<torch::jit::IValue>& batch,
               std::condition_variable& warmup_cv, std::atomic_size_t& warmup_count,
               std::condition_variable& ready_cv)
{
    // load model and warmup
    auto model = get_model(model_filename);
    model.forward(batch);
    std::cout << "." << std::flush;
    --warmup_count;
    warmup_cv.notify_one();

    // wait for ready signal
    std::mutex ready_mutex;
    std::unique_lock<std::mutex> lk(ready_mutex);
    ready_cv.wait(lk);

    // benchmark
    for (size_t i = 0; i < num_runs_per_neuron_core; i++) {
        if (i == num_runs_per_neuron_core/2) std::cout << "." << std::flush;
        model.forward(batch);
    }
}

int main(int argc, char *argv[])
{
    if (argc < 2) {
        std::cerr << "Usage: ./example_app neuron_traced_model.pt [--sanity]" << std::endl;
        return -1;
    }

    if( nrt_get_total_nc_count( &num_neuron_cores ) != NRT_SUCCESS ) {
        std::cerr << "Could not determine number of cores - aborting!" << std::endl;
        return -1;
    }

    // let runtime know we want M models / core for N cores (e.g. "1,1,1,1")
    setenv("NEURON_RT_VISIBLE_CORES", get_visible_cores_str(num_neuron_cores, cores_per_model).c_str(), true);

    if (argc >= 3 && std::string("--sanity") == argv[2]) {
        return sanity_check(argv[1]);
    }

    /*************************************************************************/
    // prepare inputs, prepare models, and perform warmup inference

    std::cout << "Getting ready" << std::flush;

    const auto input = get_input("This sentence is for benchmarking.", "For benchmarking, use this sentence.");
    const auto batch = get_batch(std::vector<Input>(batch_size, input));

    std::condition_variable warmup_cv, ready_cv;
    std::atomic_size_t warmup_count(num_neuron_cores);
    std::vector<std::thread> threads(num_neuron_cores);
    for (size_t i = 0; i < threads.size(); i++) {
        threads[i] = std::move(std::thread(benchmark, argv[1], batch, std::ref(warmup_cv),
                                std::ref(warmup_count), std::ref(ready_cv)));
    }

    // wait for warmup to complete
    auto is_warmup_complete = [](std::atomic_size_t& warmup_count) { return warmup_count.load() == 0; };
    std::mutex warmup_mutex;
    std::unique_lock<std::mutex> lk(warmup_mutex);
    warmup_cv.wait(lk, std::bind(is_warmup_complete, std::ref(warmup_count)));
    std::cout << std::endl;

    /*************************************************************************/
    // begin timed benchmarking

    std::cout << "Benchmarking" << std::flush;

    // signal workers to begin benchmarking and wait for completion
    const auto start_time = std::chrono::high_resolution_clock::now();
    ready_cv.notify_all();
    for (auto& thread : threads) thread.join();
    const auto end_time = std::chrono::high_resolution_clock::now();
    std::cout << std::endl;

    // report statistics
    const float elapsed = (end_time - start_time) / std::chrono::seconds(1);
    const size_t num_inferences = num_neuron_cores * num_runs_per_neuron_core;
    const float throughput = (float)(num_inferences * batch_size) / elapsed;
    std::cout << "Completed " << num_inferences << " operations in " << elapsed << " seconds => " << throughput << " pairs / second" << std::endl;

    std::cout << std::endl;
    std::cout << "====================" << std::endl;
    std::cout << "Summary information:" << std::endl;
    std::cout << "====================" << std::endl;
    std::cout << "Batch size = " << batch_size << std::endl;
    std::cout << "Num neuron cores = " << num_neuron_cores << std::endl;
    std::cout << "Num runs per neuron core = " << num_runs_per_neuron_core << std::endl;

    return 0;
}


================================================
FILE: src/examples/pytorch/libtorch_demo/example_app/utils.cpp
================================================
#include "utils.hpp"
#include "../tokenizers_binding/remote_rust_tokenizer.h"

#include <random>
#include <sstream>

#include <torch/csrc/jit/passes/inliner.h>
#include <ATen/ATen.h>

std::string get_visible_cores_str(size_t num_neuron_cores, size_t cores_per_model)
{
    std::ostringstream oss;
    oss << "0-" << ((num_neuron_cores * cores_per_model) - 1);
    return oss.str();
}

std::string get_uuid()
{
    // xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx
    // M = version = 4, (4 bits, 0100 = 0x4)
    // N = variant = 1, (2 bits, 10XX = 0x{8, 9, A, B})

    static const char *chars = "0123456789abcdef";
    static std::random_device rd;
    static std::mt19937 mt(rd());
    static std::uniform_int_distribution<> dist(0, 15);

    std::stringstream ss;
    for (size_t i = 0; i < 37; i++) {
        const int index = dist(mt);
        ss << chars[index];
    }

    // variant bits are 10XX
    std::stringstream variant_ss;
    size_t variant;
    variant_ss << std::hex << chars[dist(mt)];
    variant_ss >> variant;
    variant = 0x8 | (0x3 & variant);

    ss.seekp(9); ss << "-";
    ss.seekp(14); ss << "-4";
    ss.seekp(19); ss << "-" << std::hex << variant;
    ss.seekp(24); ss << "-";
    return ss.str();
}

torch::jit::script::Module get_model(const std::string& filename)
{
    torch::jit::script::Module model = torch::jit::load(filename);

    // If you're using a model traced with torch-neuron >= 1.8, 
    // the section below is no longer necessary. It was a workaround 
    // for a runtime issue when loading identical copies of a model.

    // This is redundant in the new flow, but left to provide future 
    // pointer on torchscript graph manipulation if needed

    // this next section adds a unique uuid to the graph, so that the neuron runtime
    // will load the graph multiple times instead of reusing a previously loaded copy

    /*
    auto fwd = model.get_method("forward");
    auto& fn = static_cast<torch::jit::GraphFunction&>(fwd.function());
    auto graph = fn.graph();

    torch::jit::Inline(*graph);
    for (auto node : graph->nodes()) {
        if (std::string(node->kind().toQualString()).rfind("neuron::forward") == 0) {
            auto uuid_input_tensor = node->inputs()[1];
            if (std::string(uuid_input_tensor->node()->kind().toQualString()).rfind("prim::Constant") == 0) {
                // we clone the tensor to retain ownership of "the blob" after it goes out of scope
                const std::string uuid = get_uuid();
                torch::Tensor t = torch::from_blob((void*)uuid.c_str(), {36}, torch::kUInt8).clone();

                // if we don't move the insertion point so that the copy of the constant appears after the operator,
                // the inference will crash
                graph->setInsertPoint(node);
                torch::jit::Value *val = graph->insertConstant(t);
                node->replaceInputWith(uuid_input_tensor, val);

                // ensure a valid graph
                graph->lint();
            }
        }
    }
    */

    return model;
}


================================================
FILE: src/examples/pytorch/libtorch_demo/example_app/utils.hpp
================================================
#ifndef __UTILS_HPP__
#define __UTILS_HPP__

#include <torch/script.h>

std::string get_visible_cores_str(size_t num_neuron_cores, size_t cores_per_model);
std::string get_uuid();
torch::jit::script::Module get_model(const std::string& filename);

#endif // __UTILS_HPP__


================================================
FILE: src/examples/pytorch/libtorch_demo/neuron.patch
================================================

From 3f126613c47e4261d0e86520cb6e85c5713e2b15 Mon Sep 17 00:00:00 2001
From: Stephen Dunn <stdun@amazon.com>
Date: Tue, 26 Jan 2021 22:55:40 +0000
Subject: [PATCH] Adds AWS Neuron native C++ interface

---
diff --git a/tokenizers/Cargo.toml b/tokenizers/Cargo.toml
index c0f1aff..9767da7 100644
--- a/tokenizers/Cargo.toml
+++ b/tokenizers/Cargo.toml
@@ -19,6 +19,7 @@ exclude = [ "rust-toolchain", "target/*", "Cargo.lock", "benches/*.txt", "benche
 name = "tokenizers"
 path = "src/lib.rs"
 bench = false
+crate-type = ["rlib", "cdylib"]
 
 [[bench]]
 name = "bpe_benchmark"
diff --git a/tokenizers/src/lib.rs b/tokenizers/src/lib.rs
index eb89b93..2392f28 100644
--- a/tokenizers/src/lib.rs
+++ b/tokenizers/src/lib.rs
@@ -145,6 +145,8 @@ pub mod tokenizer;
 // Re-export from tokenizer
 pub use tokenizer::*;
 
+mod neuron;
+
 // Re-export also parallelism utils
 pub use utils::parallelism;
 
diff --git a/b_tokenizers/tokenizers/src/neuron.rs b/tokenizers/src/neuron.rs
new file mode 100644
index 0000000..af4a679
--- /dev/null
+++ b/tokenizers/src/neuron.rs
@@ -0,0 +1,25 @@
+use crate::tokenizer::Tokenizer;
+use std::ffi::CStr;
+use std::os::raw::c_char;
+
+// cached tokenizer
+static mut TOKENIZER: Option<Tokenizer> = None;
+
+#[no_mangle]
+pub unsafe extern "C" fn remote_rust_encode(input_arr: *const c_char, output_arr: *mut u32, output_arr_len: u32) {
+    // load the pretrained tokenizer up if we haven't already
+    let tokenizer = TOKENIZER.get_or_insert_with(|| Tokenizer::from_file("./tokenizer.json").unwrap());
+
+    // convert input from C -> Rust
+    let cstr = CStr::from_ptr(input_arr);
+    let input = cstr.to_str().unwrap();
+
+    // tokenize raw text
+    let encoding = tokenizer.encode(input, false).unwrap();
+
+    // hand the output back to C across shared memory
+    let output = std::slice::from_raw_parts_mut(output_arr, output_arr_len as usize);
+    for (i, token) in &mut encoding.get_ids().to_vec().iter().enumerate() {
+        output[i] = *token;
+    }
+}
\ No newline at end of file


================================================
FILE: src/examples/pytorch/libtorch_demo/run_tests.sh
================================================
#!/bin/bash

set -e

if [ "$#" -ne 1 ]; then
    echo "usage: ./run_tests.sh model_filename.pt"
    exit 1
fi

echo -e "\nRunning tokenization sanity checks.\n"
pushd tokenizers_binding 2>&1 >/dev/null
chmod +x run_python.sh run.sh
(./run_python.sh && ./run.sh) || { echo "Sanity checks failed."; exit 2; }
popd 2>&1 >/dev/null
echo -e "\nTokenization sanity checks passed."

echo -e "Running end-to-end sanity check.\n"
(./example-app $1 --sanity) || { echo "Sanity check failed."; exit 3; }
echo -e "\nSanity check passed.\n"

================================================
FILE: src/examples/pytorch/libtorch_demo/setup.sh
================================================
#!/bin/bash

set -eEx

# Fail on error
set -e

TORCH_VERSION=$(python -c "import torch; v=torch.__version__.split('+')[0]; print(f'{v}')")

#Parse cli
while [ "$1" != "" ]; do
  case $1 in
    --torch-version ) shift
        TORCH_VERSION=$1
        ;;
  esac
  shift
done

echo "Using PyTorch version ${TORCH_VERSION}"

# Python setup
PYTHON=python3
PYTHON_VERSION=$($PYTHON --version | cut -f2 -d' ' | cut -f1,2 -d'.')
echo "Python version is '$PYTHON_VERSION'"

OLD_TOOL_CHAIN=$($PYTHON -c \
    "from bert_neuronx.detect_instance import get_instance_type; print('inf1' in get_instance_type())")

if [ "$OLD_TOOL_CHAIN" == "True" ]; then
    TORCH_VERSION="1.13"
    echo "- Detected inf1 - using version ${TORCH_VERSION}"
else
    echo "- Detected inf2 or trn1 - using version ${TORCH_VERSION}"
fi

# checkout tokenizers and apply neuron patch
if [ ! -e "tokenizers" ]; then
    git clone https://github.com/huggingface/tokenizers.git
    cp neuron.patch tokenizers/neuron.patch
    pushd tokenizers
    git checkout d8c4388166cad8f0216dfc485efd6207a3275af2
    git apply neuron.patch
    rm neuron.patch
    popd
fi

# build tests
pushd tokenizers_binding
chmod +x build.sh
./build.sh
popd
cp -f tokenizers_binding/tokenizer.json .

# setup torch
if [ ! -e "libtorch" ]; then
    # Use different download paths based on PyTorch version
    MAJOR_VERSION=$(echo "${TORCH_VERSION}" | cut -d. -f1)
    MINOR_VERSION=$(echo "${TORCH_VERSION}" | cut -d. -f2)
    
    if [ "$MAJOR_VERSION" -gt 2 ] || ([ "$MAJOR_VERSION" -eq 2 ] && [ "$MINOR_VERSION" -ge 8 ]); then
        wget -q https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-${TORCH_VERSION}%2Bcpu.zip
        unzip -q libtorch-shared-with-deps-${TORCH_VERSION}+cpu.zip
        rm -f libtorch-shared-with-deps-${TORCH_VERSION}+cpu.zip
    else
        wget -q https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-${TORCH_VERSION}%2Bcpu.zip
        unzip -q libtorch-cxx11-abi-shared-with-deps-${TORCH_VERSION}+cpu.zip
        rm -f libtorch-cxx11-abi-shared-with-deps-${TORCH_VERSION}+cpu.zip
    fi
fi

# get libneuron_op.so and install into libtorch
$PYTHON -m pip install --upgrade "transformers==4.40.0"
$PYTHON bert_neuronx/compile.py

site_pkgs_dir=$($PYTHON -c "import site; print(site.getsitepackages()[0])")
if [ "$OLD_TOOL_CHAIN" == "True" ]
  then
    cp -f $(find $site_pkgs_dir -exec find {} -type f -name 'libtorchneuron.so' \; -quit | grep torch_neuron) libtorch/lib/
    cp -f $(find $site_pkgs_dir -exec find {} -type f -name 'libnrt.so' \; -quit ) libtorch/lib/
    cp -f $(find $site_pkgs_dir -exec find {} -type f -name 'libnrt.so.1' \; -quit ) libtorch/lib/
  else
    cp -f $(find $site_pkgs_dir -exec find {} -type f -name 'libtorchneuron.so' \; -quit | grep torch_neuronx) libtorch/lib/
fi

# compile example app
pushd example_app
chmod +x build.sh
./build.sh
popd

chmod +x run_tests.sh
echo "Successfully completed setup"


================================================
FILE: src/examples/pytorch/libtorch_demo/tokenizers_binding/build.sh
================================================
#!/bin/bash

# clean old artifacts
rm tokenizer_test 2>&1 >/dev/null
rm -rf lib 2>&1 >/dev/null

# build shared library
if [ $# -eq 0 ]; then
    pushd ../tokenizers/tokenizers
    echo "Building release test..."
    cargo build --release
    popd
    cp -r ../tokenizers/tokenizers/target/release lib
    g++ -O3 -o tokenizer_test tokenizer_test.cpp -L./lib -ltokenizers
else
    pushd ../tokenizers/tokenizers
    echo "Building debug test..."
    cargo build
    popd
    cp -r ../tokenizers/tokenizers/target/debug lib
    g++ -O0 -o tokenizer_test tokenizer_test.cpp -L./lib -ltokenizers
fi

if [ ! -e "tokenizer.json" ]; then
    wget https://huggingface.co/bert-base-cased-finetuned-mrpc/raw/main/tokenizer.json
fi


================================================
FILE: src/examples/pytorch/libtorch_demo/tokenizers_binding/remote_rust_tokenizer.h
================================================
#ifndef __REMOTE_RUST_TOKENIZER_H__
#define __REMOTE_RUST_TOKENIZER_H__

#include <cstdint>

extern "C" {
    extern void remote_rust_encode(const char *input_arr, uint32_t* output_arr, uint32_t output_arr_len);
}

#endif // __REMOTE_RUST_TOKENIZER_H__


================================================
FILE: src/examples/pytorch/libtorch_demo/tokenizers_binding/run.sh
================================================
#!/bin/bash

set -e

LD_LIBRARY_PATH=./lib ./tokenizer_test


================================================
FILE: src/examples/pytorch/libtorch_demo/tokenizers_binding/run_python.sh
================================================
#!/bin/bash

set -e

python tokenizer_test.py


================================================
FILE: src/examples/pytorch/libtorch_demo/tokenizers_binding/tokenizer_test.cpp
================================================
#include <iostream>
#include <chrono> // timing
#include <cstring> // rust interface
#include <iomanip> // std::setprecision
#include <sstream> // parse args
#include <vector>

#include "remote_rust_tokenizer.h"

#define DEFAULT_NUM_TESTS 10000u

int main(int argc, char *argv[]) {
    // prepare some input to tokenize
    const uint32_t seq_len = 128;
    const std::vector<uint32_t> ground_truth = { 1409, 1917, 2947, 16193, 117, 1142, 3087, 1209, 1129, 22559, 2200, 1656, 155, 8954, 119 };
    const char *input_arr = "If everything goes smoothly, this text will be tokenized inside Rust.";
    uint32_t* output_arr = new uint32_t[seq_len];
    std::memset(output_arr, 0, sizeof(uint32_t) * seq_len);

    // call rust tokenizer
    remote_rust_encode(input_arr, output_arr, seq_len);

    // check output
    std::cout << "Sanity check ";
    for (auto i = 0; i < ground_truth.size(); ++i) {
        if (output_arr[i] != ground_truth[i]) {
            std::cerr << "failed at: " << i << ", " << output_arr[i] << " != " << ground_truth[i] << std::endl;
            return -1;
        }
    }
    std::cout << "passed." << std::endl;

    // run timed test
    uint32_t num_tests = DEFAULT_NUM_TESTS;
    if (argc >= 3 && !strcmp("--num_tests", argv[1])) {
        std::istringstream iss(argv[2]);
        iss >> num_tests;
    }

    const uint32_t ten_percent = uint32_t(0.1 * num_tests);
    std::cout << "Begin " << num_tests << " timed tests." << std::endl;
    auto start = std::chrono::high_resolution_clock::now();

    for (auto test_num = 0; test_num < num_tests; ++test_num) {
        if (test_num % ten_percent == 0) {
            std::cout << "." << std::flush;
        }
        remote_rust_encode(input_arr, output_arr, seq_len);
    }

    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration<double>(end - start);
    std::cout << std::endl << "End timed tests." << std::endl << "C++ took "
        << std::setprecision(3) << duration.count()
        << " seconds." <<  std::endl;

    return 0;
}


================================================
FILE: src/examples/pytorch/libtorch_demo/tokenizers_binding/tokenizer_test.py
================================================
from transformers import AutoTokenizer
import argparse
import time
from tqdm import tqdm

parser = argparse.ArgumentParser()
parser.add_argument('--num_tests', type=int, default=10_000)
args = parser.parse_args()

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased-finetuned-mrpc')

start = time.time()
for _ in tqdm(range(args.num_tests), desc='Tokenizing'):
    tokenizer.encode("If everything goes smoothly, this text will be tokenized inside Rust.")
end = time.time()
print('Python took {:.2f} seconds.'.format(end - start))


================================================
FILE: src/examples/pytorch/libtorch_demo/trace_bert_neuron.py
================================================
import torch
import torch_neuron

from transformers import AutoTokenizer, AutoModelForSequenceClassification


# Build tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc", return_dict=False)

# Setup some example inputs
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "HuggingFace's headquarters are situated in Manhattan"

max_length = 128
batch_size = 6

paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")

example_inputs_paraphrase = (
    torch.cat([paraphrase['input_ids']] * batch_size, 0),
    torch.cat([paraphrase['attention_mask']] * batch_size, 0),
    torch.cat([paraphrase['token_type_ids']] * batch_size, 0)
)

# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron
model_neuron_batch = torch_neuron.trace(model, example_inputs_paraphrase)

# Save the batched model
model_neuron_batch.save('bert_neuron_b{}.pt'.format(batch_size))

================================================
FILE: src/examples/pytorch/mnist_mlp/train_monitor.py
================================================
import os
import time
import torch
import torch.nn as nn
import torch.nn.functional as F

from torchvision.datasets import mnist
from torch.optim import SGD
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor

# XLA imports
import torch_xla.core.xla_model as xm

# Declare 3-layer MLP for MNIST dataset
class MLP(nn.Module):
    def __init__(self, input_size = 28 * 28, output_size = 10, layers = [120, 84]):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_size, layers[0])
        self.fc2 = nn.Linear(layers[0], layers[1])
        self.fc3 = nn.Linear(layers[1], output_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return F.log_softmax(x, dim=1)

# Load MNIST train dataset
train_dataset = mnist.MNIST(root='./MNIST_DATA_train', \
                            train=True, download=True, transform=ToTensor())

def main():
    # Prepare data loader
    train_loader = DataLoader(train_dataset, batch_size=32)

    # Fix the random number generator seeds for reproducibility
    torch.manual_seed(0)

    # XLA: Specify XLA device (defaults to a NeuronCore on Trn1 instance)
    device = 'xla'

    # Move model to device and declare optimizer and loss function
    model = MLP().to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    loss_fn = torch.nn.NLLLoss()

    # Run the training loop
    print('----------Training ---------------')
    for run in range(0, 1000):
        print(f'Run {run}')
        model.train()
        for idx, (train_x, train_label) in enumerate(train_loader):
            optimizer.zero_grad()
            train_x = train_x.view(train_x.size(0), -1)
            train_x = train_x.to(device)
            train_label = train_label.to(device)
            output = model(train_x)
            loss = loss_fn(output, train_label)
            loss.backward()
            optimizer.step()
            xm.mark_step() # XLA: collect ops and run them in XLA runtime
            if idx < 2: # skip warmup iterations
                start = time.time()

    # Save checkpoint for evaluation
    os.makedirs("checkpoints", exist_ok=True)
    checkpoint = {'state_dict': model.state_dict()}
    # XLA: use xm.save instead of torch.save to ensure states are moved back to cpu
    # This can prevent "XRT memory handle not found" at end of test.py execution
    xm.save(checkpoint,'checkpoints/checkpoint.pt')

    print('----------End Training ---------------')

================================================
FILE: src/examples/pytorch/mnist_mlp/train_tb.py
================================================
import os
import time
import torch
import torch.nn as nn
import torch.nn.functional as F

from torchvision.datasets import mnist
from torch.optim import SGD
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor

# XLA imports
import torch_xla.core.xla_model as xm

from torch.utils.tensorboard import SummaryWriter

# Declare 3-layer MLP for MNIST dataset
class MLP(nn.Module):
  def __init__(self, input_size = 28 * 28, output_size = 10, layers = [120, 84]):
      super(MLP, self).__init__()
      self.fc1 = nn.Linear(input_size, layers[0])
      self.fc2 = nn.Linear(layers[0], layers[1])
      self.fc3 = nn.Linear(layers[1], output_size)

  def forward(self, x):
      x = F.relu(self.fc1(x))
      x = F.relu(self.fc2(x))
      x = self.fc3(x)
      return F.log_softmax(x, dim=1)

# Load MNIST train dataset
train_dataset = mnist.MNIST(root='./MNIST_DATA_train', \
                            train=True, download=True, transform=ToTensor())

def main():
    # Prepare data loader
    train_loader = DataLoader(train_dataset, batch_size=32)

    # Fix the random number generator seeds for reproducibility
    torch.manual_seed(0)
    
    # XLA: Specify XLA device (defaults to a NeuronCore on Trn1 instance)
    device = 'xla'
    
    # Move model to device and declare optimizer and loss function
    model = MLP().to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    loss_fn = torch.nn.NLLLoss()

    # Use SummaryWriter to generate logs for TensorBoard
    writer = SummaryWriter('./output')

    # Run the training loop
    print('----------Training ---------------')
    model.train()
    start = time.time()
    for idx, (train_x, train_label) in enumerate(train_loader):
        optimizer.zero_grad()
        train_x = train_x.view(train_x.size(0), -1)
        train_x = train_x.to(device)
        train_label = train_label.to(device)
        output = model(train_x)
        loss = loss_fn(output, train_label)
        writer.add_scalar("step loss", loss, idx) # add the step loss to the TensorBoard logs
        loss.backward()
        optimizer.step()
        xm.mark_step() # XLA: collect ops and run them in XLA runtime
        if idx < 2: # skip warmup iterations
            start = time.time()
    
    # Compute statistics
    interval = idx - 2 # skip warmup iterations
    throughput = interval / (time.time() - start)
    print("Train throughput (iter/sec): {}".format(throughput))
    print("Final loss is {:0.4f}".format(loss.detach().to('cpu')))
    
    # Ensure TensorBoard logs are all written
    writer.flush()

    # Save checkpoint for evaluation
    os.makedirs("checkpoints", exist_ok=True)
    checkpoint = {'state_dict': model.state_dict()}
    # XLA: use xm.save instead of torch.save to ensure states are moved back to cpu
    # This can prevent "XRT memory handle not found" at end of test.py execution
    xm.save(checkpoint,'checkpoints/checkpoint.pt')
    
    print('----------End Training ---------------')
    
if __name__ == '__main__':
    main()

================================================
FILE: src/examples/pytorch/neuronx_distributed/t5-inference/t5-inference-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [Broken] T5 inference with Tensor Parallelism"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is an extension to the [t5 inference tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/torch-neuronx/t5-inference-tutorial.html). Here we will use NeuronxDistributed to improve the inference performance using tensor parallelism.\n",
    "\n",
    "This tutorial has the following main sections:\n",
    "\n",
    "1. Install dependencies\n",
    "1. Plug in `NeuronxDistributed` layers into T5\n",
    "1. Compile the T5 model\n",
    "1. Run distributed inference with beam search \n",
    "\n",
    "This Jupyter notebook should be run on a Inf2 instance (`inf2.24xlarge`) or Trn1 isntance (`trn1.32xlarge`)\n",
    "\n",
    "> The tutorial works for t5 and flan-t5 models. In this notebook we will run distributed inference with flan-t5-xl."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install dependencies\n",
    "\n",
    "The code in this tutorial is written for Jupyter Notebooks. To use Jupyter Notebook on the Neuron instance, you\n",
    "can use this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "Run the notebook by cloning aws-neuron-sdk\n",
    "```\n",
    "git clone https://github.com/aws-neuron/aws-neuron-sdk.git\n",
    "cd aws-neuron-sdk/src/examples/pytorch/neuronx_distributed/t5-inference/\n",
    "```\n",
    "\n",
    "Once done execute `t5-inference-tutorial.ipynb`\n",
    "\n",
    "It is recommended to go through the [t5 inference tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/torch-neuronx/t5-inference-tutorial.html) before you start this tutorial. \n",
    "In addition to the dependencies in the [t5 inference tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/torch-neuronx/t5-inference-tutorial.html), we need to install neuronx-distributed. \n",
    "\n",
    "This tutorial requires the following pip packages:\n",
    "\n",
    "- `torch-neuronx`\n",
    "- `neuronx-cc`\n",
    "- `transformers`\n",
    "- `optimum-neuron`\n",
    "- `neuronx-distributed`\n",
    "\n",
    "Most of these packages will be installed when configuring your environment using the Trn1/Inf2 [ setup guide ](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20.html#setup-torch-neuronx-ubuntu20). The additional dependencies must be installed here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install --upgrade transformers==4.33.1 optimum-neuron neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pull the latest version of the compiler\n",
    "! pip install --upgrade neuronx-cc>=2.11 --no-deps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Lets update numpy to a newer version \n",
    "! pip install --upgrade \"numpy>=1.22.2,<2\" --no-deps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plug in NeuronxDistributed layers into T5\n",
    "\n",
    "We extend the huggingface's T5 model to use the `NeuronxDistributed` parallel layers. To do so, we simply swap linear layers in `T5LayerSelfAttention`, `T5LayerCrossAttention`, and `T5LayerFF` definitions with `ColumnParallelLinear` and `RowParallelLinear`. We also need to swap the `Embedding` layer with `ParallelEmbedding`.\n",
    "\n",
    "Let us take the example of T5Attention. The [attention block](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L363-L366) has q, k, v, and o linear layers. \n",
    "The multi-head attention block uses q, k and v to compute the attention scores. The attention scores are then passed through o to compute the attention block output. \n",
    "So let us swap q, k and v layers with `ColumnParallelLinear` and o with `RowParallelLinear`. Having `RowParallelLinear` following a `ColumnParallelLinear` is a performance optimization. The attention scores computed with q, k and v are already split across Neuron devices. The row parallel layer can use this shared output directly. \n",
    "The embedding layer is simply swapped with the `ParallelEmbedding`.\n",
    "\n",
    "```\n",
    "class ParallelAttention(T5Attention):\n",
    "    def __init__(self, config: T5Config, has_relative_attention_bias=False):\n",
    "        super().__init__(config, has_relative_attention_bias)\n",
    "        # Per attention head and per partition values\n",
    "        world_size = parallel_state.get_tensor_model_parallel_size()\n",
    "        self.num_attention_heads_per_partition = divide(self.n_heads, world_size)\n",
    "        self.hidden_size_per_partition = self.num_attention_heads_per_partition * self.key_value_proj_dim\n",
    "\n",
    "        # Mesh TensorFlow initialization to avoid scaling before softmax\n",
    "        self.q = ColumnParallelLinear(self.d_model,\n",
    "                                      self.inner_dim,\n",
    "                                      bias=False,\n",
    "                                      gather_output=False)\n",
    "        self.k = ColumnParallelLinear(self.d_model,\n",
    "                                      self.inner_dim,\n",
    "                                      bias=False,\n",
    "                                      gather_output=False)\n",
    "        self.v = ColumnParallelLinear(self.d_model,\n",
    "                                      self.inner_dim,\n",
    "                                      bias=False,\n",
    "                                      gather_output=False)\n",
    "        self.o = RowParallelLinear(self.inner_dim,\n",
    "                                   self.d_model,\n",
    "                                   bias=False,\n",
    "                                   input_is_parallel=True)\n",
    "\n",
    "        if self.has_relative_attention_bias:\n",
    "            self.relative_attention_bias = ParallelEmbedding(self.relative_attention_num_buckets, self.n_heads)\n",
    "        self.n_heads = self.num_attention_heads_per_partition\n",
    "...\n",
    "```\n",
    "\n",
    "You can find the all modified T5 layers defined in [t5_model_layers.py](https://github.com/aws-neuron/aws-neuron-sdk/tree/master/src/examples/pytorch/neuronx_distributed/t5-inference/t5_model_layers.py).  \n",
    "\n",
    "\n",
    "Once we have the modified T5 layers, we can plug in the T5Attention and T5LayerFF into the pretrained model. Here is how you do that. \n",
    "\n",
    "```\n",
    "def load_pretrained_with_parallel_attn(model_name):\n",
    "    \n",
    "    model = T5ForConditionalGeneration.from_pretrained(model_name, torch_dtype=\"auto\")\n",
    "\n",
    "    # Parallel implementation of Attention modules.\n",
    "    from t5_model_layers import ParallelSelfAttention, ParallelFF, ParallelCrossAttention\n",
    "\n",
    "    for index, block in enumerate(model.decoder.block):\n",
    "        if index == 0:\n",
    "            block.layer[0] = ParallelSelfAttention(model.config,\n",
    "                                                   has_relative_attention_bias=True)\n",
    "        else:\n",
    "            block.layer[0] = ParallelSelfAttention(model.config)\n",
    "        block.layer[1] = ParallelCrossAttention(model.config)\n",
    "        block.layer[2] = ParallelFF(model.config)\n",
    "    # Load the weights into the parallel layers        \n",
    "    neuronx_distributed.parallel_layers.load(model_name + \".pt\", model, sharded=False)\n",
    "\n",
    "    return model\n",
    "\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compile the parallel T5 model\n",
    "\n",
    "Let us set some model parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_name = \"google/flan-t5-xl\" \n",
    "max_length = 128\n",
    "num_beams = 4\n",
    "tp_degree = 8 # tensor parallelism degree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download and save the model that we want to trace. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from transformers import T5ForConditionalGeneration\n",
    "\n",
    "model = T5ForConditionalGeneration.from_pretrained(model_name, torch_dtype=\"auto\")\n",
    "torch.save({\"model\":model.state_dict()}, model_name.split(\"/\")[-1] + \".pt\")\n",
    "model.config.use_cache = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run HuggingFace T5 models on Neuron, we need to make a couple of changes. Let us reuse the code from the [t5 inference tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/torch-neuronx/t5-inference-tutorial.html) which makes T5 compatible with Neuron. For your convenience, the code copied into [wrapper.py](https://github.com/aws-neuron/aws-neuron-sdk/tree/master/src/examples/pytorch/neuronx_distributed/t5-inference/wrapper.py) and [t5_models.py](https://github.com/aws-neuron/aws-neuron-sdk/tree/master/src/examples/pytorch/neuronx_distributed/t5-inference/t5_models.py). This notebook will import these files. \n",
    "\n",
    "The only change made to this code is that we use `neuronx_distributed.trace` instead of `torch_neuronx.trace`. \n",
    "\n",
    "Let us trace the encoder and decoder. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import t5_models  \n",
    "import neuronx_distributed\n",
    "import time \n",
    "\n",
    "# This can take up to 20 minutes\n",
    "encoder_compile_start_time = time.time()\n",
    "traced_encoder = t5_models.parallel_trace_encoder(model_name, max_length, num_beams, tp_degree)\n",
    "print(\"Encoder compilation time {}\".format(time.time() - encoder_compile_start_time))\n",
    "\n",
    "neuronx_distributed.trace.parallel_model_save(traced_encoder, \"TracedParallelEncoder.pt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This can take up to 15 minutes\n",
    "decoder_compile_start_time = time.time()\n",
    "traced_decoder = t5_models.parallel_trace_decoder(model, model_name, num_beams, max_length, tp_degree)\n",
    "print(\"Decoder compilation time {}\".format(time.time() - decoder_compile_start_time))\n",
    "\n",
    "neuronx_distributed.trace.parallel_model_save(traced_decoder, \"TracedParallelDecoder.pt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inference with the traced parallel T5 model\n",
    "\n",
    "With the traced model, let us try using beam search for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Results:\n",
      "1 Lassen Sie uns gutes Essen essen.\n",
      "2 Lassen Sie uns gut essen.\n",
      "3 Lassen Sie uns gutes Essen zu essen.\n",
      "4 Lassen Sie uns gutes Essen zu sich nehmen.\n"
     ]
    }
   ],
   "source": [
    "import neuronx_distributed\n",
    "from wrapper import T5Wrapper\n",
    "from transformers import T5Tokenizer\n",
    "\n",
    "\n",
    "num_return_sequences = 4\n",
    "\n",
    "traced_encoder = neuronx_distributed.trace.parallel_model_load(\"TracedParallelEncoder.pt\")\n",
    "traced_decoder = neuronx_distributed.trace.parallel_model_load(\"TracedParallelDecoder.pt\")\n",
    "\n",
    "tokenizer = T5Tokenizer.from_pretrained(model_name)\n",
    "model = T5Wrapper.from_pretrained(model_name)\n",
    "\n",
    "model.encoder = traced_encoder\n",
    "model.decoder = traced_decoder\n",
    "setattr(model.encoder, 'main_input_name', 'input_ids')  # Attribute required by beam search\n",
    "\n",
    "output = model.parallel_infer(tokenizer=tokenizer,\n",
    "                              prompt=\"translate English to German: Lets eat good food.\",\n",
    "                              max_length=max_length,\n",
    "                              num_beams=num_beams,\n",
    "                              num_return_sequences=num_return_sequences,\n",
    "                              device=\"xla\")\n",
    "\n",
    "results = [tokenizer.decode(t, skip_special_tokens=True) for t in output]\n",
    "\n",
    "print('Results:')\n",
    "for i, summary in enumerate(results):\n",
    "    print(i + 1, summary)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Benchmarking\n",
    "\n",
    "Let us benchmark the per token decoder latency"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let us install NeuronPerf. We will use it to measure the performance.\n",
    "! pip install neuronperf --extra-index-url=https://pip.repos.neuron.amazonaws.com"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os \n",
    "import neuronperf as npf\n",
    "\n",
    "d_model = model.config.d_model\n",
    "model_dir = \"TracedParallelDecoder.pt\"\n",
    "decoder_run_count = 128\n",
    "\n",
    "def load_fn(model_path, **kwargs):\n",
    "    return neuronx_distributed.trace.parallel_model_load(model_path)\n",
    "    \n",
    "# NeuronPerf can't see tp_degree at the moment, so just expose all cores\n",
    "def env_setup_fn(*_):\n",
    "    del os.environ[\"NEURON_RT_VISIBLE_CORES\"]\n",
    "\n",
    "def benchmark():\n",
    "\n",
    "    # Create some sample inputs for the decoder\n",
    "    decoder_input_ids = torch.ones((num_beams, 1), dtype=torch.int64)\n",
    "    decoder_attention_mask = torch.ones((num_beams, max_length), dtype=torch.int32)\n",
    "    encoder_attention_mask = torch.ones((num_beams, max_length), dtype=torch.int64)\n",
    "    encoder_hidden_states = torch.ones((num_beams, max_length, d_model), dtype=torch.float32)\n",
    "    beam_idx = torch.arange(0, num_beams, dtype=torch.int64)\n",
    "    beam_scores = torch.zeros((num_beams,), dtype=torch.float)\n",
    "\n",
    "    inputs = (decoder_input_ids,\n",
    "               decoder_attention_mask,\n",
    "               encoder_hidden_states,\n",
    "               encoder_attention_mask,\n",
    "               beam_idx,\n",
    "               beam_scores)\n",
    "\n",
    "    reports = npf.benchmark(\n",
    "        load_fn,\n",
    "        model_dir,\n",
    "        [inputs],       \n",
    "        batch_sizes=1,\n",
    "        n_models=1,\n",
    "        max_infers=decoder_run_count,\n",
    "        workers_per_model=1,  # no bottleneck on model inputs, so 1 is fine\n",
    "        env_setup_fn=env_setup_fn,\n",
    "        multiprocess=False,\n",
    "    )\n",
    "    \n",
    "    report = reports[0]\n",
    "\n",
    "    # let's update throughput to be tokens / second and add a new recor\n",
    "    latency_in_s = report[\"latency_ms_avg\"] / 1000\n",
    "    tokens_per_s = decoder_run_count / latency_in_s\n",
    "    report[\"throughput_avg\"] = tokens_per_s\n",
    "    \n",
    "    # display and save results\n",
    "    npf.print_reports(reports, cols=[\"throughput_avg\", \"latency_ms_p50\", \"latency_ms_p99\"])\n",
    "    print(f\"Results saved to: {npf.write_json(reports[0])}\")\n",
    "\n",
    "benchmark()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now lets benchmark inference as a whole including sampling. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import torch\n",
    "import neuronx_distributed\n",
    "import neuronperf as npf\n",
    "\n",
    "from transformers import T5Tokenizer\n",
    "from wrapper import T5Wrapper\n",
    "\n",
    "tokenizer = T5Tokenizer.from_pretrained(model_name)\n",
    "\n",
    "generated_token_count = 0\n",
    "\n",
    "class Wrapper(torch.nn.Module):\n",
    "    def __init__(self, \n",
    "                 traced_encoder,\n",
    "                 traced_decoder):\n",
    "        super().__init__()\n",
    "        self.model = T5Wrapper.from_pretrained(model_name)\n",
    "        self.model.encoder = traced_encoder\n",
    "        self.model.decoder = traced_decoder\n",
    "        setattr(self.model.encoder, 'main_input_name', 'input_ids')  # Attribute required by beam search\n",
    "\n",
    "    def forward(self, *inputs):\n",
    "        input_ids = inputs[0]['input_ids']\n",
    "        attention_mask = inputs[0]['attention_mask']\n",
    "        return self.model.parallel_infer(input_ids=input_ids,\n",
    "                                         attention_mask=attention_mask,\n",
    "                                         max_length=max_length,\n",
    "                                         num_beams=num_beams,\n",
    "                                         num_return_sequences=num_return_sequences)\n",
    "\n",
    "def load_fn(filename, **kwargs):\n",
    "    traced_encoder = neuronx_distributed.trace.parallel_model_load(filename + \"TracedParallelEncoder.pt\")\n",
    "    traced_decoder = neuronx_distributed.trace.parallel_model_load(filename + \"TracedParallelDecoder.pt\")\n",
    "    return Wrapper(traced_encoder, traced_decoder)\n",
    "\n",
    "# NeuronPerf can't see tp_degree at the moment, so just expose all cores\n",
    "def env_setup_fn(*_):\n",
    "    del os.environ[\"NEURON_RT_VISIBLE_CORES\"]\n",
    "\n",
    "def preprocess_fn(inputs):\n",
    "    \n",
    "    encoding = []\n",
    "    for text in inputs:\n",
    "        batch_encoding = tokenizer(text, \n",
    "                                   max_length=max_length, \n",
    "                                   truncation=True, \n",
    "                                   padding='max_length',\n",
    "                                   return_tensors=\"pt\")\n",
    "        input_ids = batch_encoding['input_ids']\n",
    "        attention_mask = batch_encoding['attention_mask']\n",
    "        encoding.append({\"input_ids\": input_ids,\n",
    "                         \"attention_mask\": attention_mask})\n",
    "    return encoding\n",
    "\n",
    "def postprocess_fn(outputs):\n",
    "    output = [tokenizer.decode(seq) for seq in outputs]\n",
    "    global generated_token_count \n",
    "    generated_token_count = len(outputs[0])\n",
    "    return output\n",
    "\n",
    "def benchmark():\n",
    "    inputs = [\"summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes.\"]\n",
    "    reports = npf.benchmark(\n",
    "        load_fn,\n",
    "        \"\",   # Model dir\n",
    "        [inputs], \n",
    "        batch_sizes=1,\n",
    "        n_models=1,\n",
    "        max_infers=5,\n",
    "        max_duration=0,       # sampling can take a while, so let's not timeout\n",
    "        workers_per_model=1,  \n",
    "        env_setup_fn=env_setup_fn,\n",
    "        preprocess_fn=preprocess_fn,\n",
    "        postprocess_fn=postprocess_fn,\n",
    "        multiprocess=False,\n",
    "    )\n",
    "    \n",
    "    report = reports[0]\n",
    "\n",
    "    report[\"throughput_avg\"] = round(generated_token_count / (report[\"latency_ms_avg\"] / 1000), 2)\n",
    "    report[\"latency_per_token_ms_p50\"] = round((report[\"latency_ms_p50\"])/generated_token_count, 2)\n",
    "    report[\"latency_per_token_ms_p99\"] = round((report[\"latency_ms_p99\"])/generated_token_count, 2)\n",
    "\n",
    "    # display and save results\n",
    "    npf.print_reports(reports, cols=[\"throughput_avg\", \"latency_per_token_ms_p50\", \"latency_per_token_ms_p99\"])\n",
    "    print(f\"Results saved to: {npf.write_json(report)}\")\n",
    "\n",
    "benchmark()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "aws_neuron_venv_pytorch",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: src/examples/pytorch/neuronx_distributed/t5-inference/t5_model_layers.py
================================================
from neuronx_distributed.parallel_layers import parallel_state
from neuronx_distributed.parallel_layers.layers import BaseParallelLinear, ColumnParallelLinear, RowParallelLinear, ParallelEmbedding
from neuronx_distributed.parallel_layers.utils import divide

import torch
from torch import nn
from torch.nn.parameter import Parameter
from transformers import T5Config
from transformers.activations import ACT2FN
from transformers.pytorch_utils import find_pruneable_heads_and_indices
from transformers.models.t5.modeling_t5 import T5Attention, T5LayerSelfAttention, T5LayerNorm,\
    T5LayerCrossAttention, T5LayerFF, T5DenseGatedActDense, T5DenseActDense

from transformers import T5ForConditionalGeneration
import neuronx_distributed

def prune_linear_layer(layer: BaseParallelLinear, index: torch.LongTensor,
                       dim: int = 0) -> BaseParallelLinear:
    """
    Prune a linear layer to keep only entries in index.

    Used to remove heads.

    Args:
        layer (`BaseParallelLinear`): The layer to prune.
        index (`torch.LongTensor`): The indices to keep in the layer.
        dim (`int`, *optional*, defaults to 0): The dimension on which to keep the indices.

    Returns:
        `BaseParallelLinear`: The pruned layer as a new layer with `requires_grad=True`.
    """
    index = index.to(layer.weight.device)
    W = layer.weight.index_select(dim, index).clone().detach()
    if layer.bias is not None:
        if dim == 1:
            b = layer.bias.clone().detach()
        else:
            b = layer.bias[index].clone().detach()
    new_size = list(layer.weight.size())
    new_size[dim] = len(index)
    new_layer = ColumnParallelLinear(new_size[1],
                                     new_size[0],
                                     bias=layer.bias is not None,
                                     gather_output=False).to(layer.weight.device)
    new_layer.weight.requires_grad = False
    new_layer.weight.copy_(W.contiguous())
    new_layer.weight.requires_grad = True
    if layer.bias is not None:
        new_layer.bias.requires_grad = False
        new_layer.bias.copy_(b.contiguous())
        new_layer.bias.requires_grad = True
    return new_layer


class ParallelAttention(T5Attention):
    def __init__(self, config: T5Config, has_relative_attention_bias=False):
        super().__init__(config, has_relative_attention_bias)
        # Per attention head and per partition values
        world_size = parallel_state.get_tensor_model_parallel_size()
        self.num_attention_heads_per_partition = divide(
            self.n_heads, world_size)
        self.hidden_size_per_partition = self.num_attention_heads_per_partition * self.key_value_proj_dim

        # Mesh TensorFlow initialization to avoid scaling before softmax
        self.q = ColumnParallelLinear(self.d_model,
                                      self.inner_dim,
                                      bias=False,
                                      gather_output=False)
        self.k = ColumnParallelLinear(self.d_model,
                                      self.inner_dim,
                                      bias=False,
                                      gather_output=False)
        self.v = ColumnParallelLinear(self.d_model,
                                      self.inner_dim,
                                      bias=False,
                                      gather_output=False)
        self.o = RowParallelLinear(self.inner_dim,
                                   self.d_model,
                                   bias=False,
                                   input_is_parallel=True)

        if self.has_relative_attention_bias:
            self.relative_attention_bias = ParallelEmbedding(self.relative_attention_num_buckets, self.n_heads)
        self.n_heads = self.num_attention_heads_per_partition

    def prune_heads(self, heads):
        if len(heads) == 0:
            return
        heads, index = find_pruneable_heads_and_indices(
            heads, self.num_attention_heads_per_partition, self.key_value_proj_dim, self.pruned_heads
        )
        # Prune linear layers
        self.q = prune_linear_layer(self.q, index)
        self.k = prune_linear_layer(self.k, index)
        self.v = prune_linear_layer(self.v, index)
        self.o = prune_linear_layer(self.o, index, dim=1)
        # Update hyper params
        self.num_attention_heads_per_partition = self.num_attention_heads_per_partition - len(heads)
        self.hidden_size_per_partition = self.key_value_proj_dim * self.num_attention_heads_per_partition
        self.pruned_heads = self.pruned_heads.union(heads)

    def compute_bias(self, query_length, key_length, device=None):
        """Compute binned relative position bias"""
        if device is None:
            device = self.relative_attention_bias.weight.device
        context_position = torch.arange(query_length, dtype=torch.long, device=device)[:, None]
        memory_position = torch.arange(key_length, dtype=torch.long, device=device)[None, :]
        relative_position = memory_position - context_position  # shape (query_length, key_length)
        relative_position_bucket = self._relative_position_bucket(
            relative_position,  # shape (query_length, key_length)
            bidirectional=(not self.is_decoder),
            num_buckets=self.relative_attention_num_buckets,
            max_distance=self.relative_attention_max_distance,
        )
        values = self.relative_attention_bias(
            relative_position_bucket)
        tp_rank = parallel_state.get_tensor_model_parallel_rank()
        values = values[:, :, tp_rank * self.num_attention_heads_per_partition:(tp_rank + 1)
                                                                     * self.num_attention_heads_per_partition]

        # values = self.relative_attention_bias(
        #     relative_position_bucket)  # shape (query_length, key_length, num_heads)
        values = values.permute([2, 0, 1]).unsqueeze(
            0)  # shape (1, num_heads, query_length, key_length)
        # print("Values shape is: ", values.shape)
        return values

    def forward(
        self,
        hidden_states,
        mask=None,
        key_value_states=None,
        position_bias=None,
        past_key_value=None,
        layer_head_mask=None,
        query_length=None,
        use_cache=False,
        output_attentions=False,
    ):
        """
        Self-attention (if key_value_states is None) or attention over source sentence (provided by key_value_states).
        """
        # Input is (batch_size, seq_length, dim)
        # Mask is (batch_size, key_length) (non-causal) or (batch_size, key_length, key_length)
        # past_key_value[0] is (batch_size, n_heads, q_len - 1, dim_per_head)
        self.is_decoder = True
        batch_size, seq_length = hidden_states.shape[:2]

        real_seq_length = seq_length

        if past_key_value is not None:
            assert (
                    len(past_key_value) == 2
            ), f"past_key_value should have 2 past states: keys and values. Got {len(past_key_value)} past states"
            real_seq_length += past_key_value[0].shape[2] if query_length is None else query_length

        key_length = real_seq_length if key_value_states is None else key_value_states.shape[1]

        def shape(states):
            """projection"""
            return states.view(batch_size, -1, self.num_attention_heads_per_partition,
                               self.key_value_proj_dim).transpose(1, 2)

        def unshape(states):
            """reshape"""
            return states.transpose(1, 2).contiguous().view(batch_size, -1,
                                                            self.hidden_size_per_partition)

        def project(hidden_states, proj_layer, key_value_states, past_key_value):
            """projects hidden states correctly to key/query states"""
            if key_value_states is None:
                # self-attn
                # (batch_size, n_heads, seq_length, dim_per_head)
                hidden_states = shape(proj_layer(hidden_states))
            elif past_key_value is None:
                # cross-attn
                # (batch_size, n_heads, seq_length, dim_per_head)
                hidden_states = shape(proj_layer(key_value_states))

            if past_key_value is not None:
                # import pdb; pdb.set_trace()
                if key_value_states is None:
                    # self-attn
                    # (batch_size, n_heads, key_length, dim_per_head)
                    hidden_states = torch.cat([past_key_value, hidden_states], dim=2)
                elif past_key_value.shape[2] != key_value_states.shape[1]:
                    # checking that the `sequence_length` of the `past_key_value` is the same as
                    # the provided `key_value_states` to support prefix tuning
                    # cross-attn
                    # (batch_size, n_heads, seq_length, dim_per_head)
                    hidden_states = shape(proj_layer(key_value_states))
                else:
                    # cross-attn
                    hidden_states = past_key_value
            return hidden_states

        # get query states
        query_states = shape(
            self.q(hidden_states))  # (batch_size, n_heads, seq_length, dim_per_head)

        # get key/value states
        key_states = project(
            hidden_states, self.k, key_value_states,
            past_key_value[0] if past_key_value is not None else None
        )
        value_states = project(
            hidden_states, self.v, key_value_states,
            past_key_value[1] if past_key_value is not None else None
        )

        # compute scores
        scores = torch.matmul(
            query_states, key_states.transpose(3, 2)
        )  # equivalent of torch.einsum("bnqd,bnkd->bnqk", query_states, key_states), compatible with onnx op>9

        if position_bias is None:
            if not self.has_relative_attention_bias:
                position_bias = torch.zeros(
                    (1, self.num_attention_heads_per_partition, real_seq_length, key_length),
                    device=scores.device,
                    dtype=scores.dtype
                )
                if self.gradient_checkpointing and self.training:
                    position_bias.requires_grad = True
            else:
                position_bias = self.compute_bias(real_seq_length, key_length, device=scores.device)

            # if key and values are already calculated
            # we want only the last query position bias
            if past_key_value is not None:
                position_bias = position_bias[:, :, -hidden_states.size(1):, :]

            if mask is not None:
                print(position_bias.shape, mask.shape, flush=True)
                position_bias = position_bias + mask  # (batch_size, n_heads, seq_length, key_length)

        if self.pruned_heads:
            mask = torch.ones(position_bias.shape[1])
            mask[list(self.pruned_heads)] = 0
            position_bias_masked = position_bias[:, mask.bool()]
        else:
            position_bias_masked = position_bias

        # print("Scores is: ", scores.shape)
        # print("position_bias_masked: ", position_bias_masked.shape)
        # print(scores.dtype, position_bias_masked.dtype)

        scores += position_bias_masked
        attn_weights = nn.functional.softmax(scores.float(), dim=-1).type_as(
            scores
        )  # (batch_size, n_heads, seq_length, key_length)
        attn_weights = nn.functional.dropout(
            attn_weights, p=self.dropout, training=self.training
        )  # (batch_size, n_heads, seq_length, key_length)

        # Mask heads if we want to
        if layer_head_mask is not None:
            attn_weights = attn_weights * layer_head_mask

        attn_output = unshape(
            torch.matmul(attn_weights, value_states))  # (batch_size, seq_length, dim)
        attn_output = self.o(attn_output)

        print(self.is_decoder,use_cache, flush=True)
        present_key_value_state = (key_states, value_states) if (
                self.is_decoder and use_cache) else None
        outputs = (attn_output,) + (present_key_value_state,) + (position_bias,)

        if output_attentions:
            outputs = outputs + (attn_weights,)
        return outputs


class ParallelSelfAttention(T5LayerSelfAttention):
    def __init__(self, config, has_relative_attention_bias=False):
        super().__init__(config, has_relative_attention_bias=False)
        self.SelfAttention = ParallelAttention(config,
                                         has_relative_attention_bias=has_relative_attention_bias)
        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)


class ParallelCrossAttention(T5LayerCrossAttention):
    def __init__(self, config):
        super().__init__(config)
        self.EncDecAttention = ParallelAttention(config, has_relative_attention_bias=False)
        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)


class ParallelDenseActDense(T5DenseActDense):
    def __init__(self, config: T5Config):
        super().__init__(config)
        self.wi = ColumnParallelLinear(config.d_model, config.d_ff, gather_output=False, bias=False)
        self.wo = RowParallelLinear(config.d_ff, config.d_model, input_is_parallel=True, bias=False)
        self.dropout = nn.Dropout(config.dropout_rate)
        self.act = ACT2FN[config.dense_act_fn]


class ParallelDenseGatedActDense(T5DenseGatedActDense):
    def __init__(self, config: T5Config):
        super().__init__(config)
        self.wi_0 = ColumnParallelLinear(config.d_model,
                                      config.d_ff,
                                         gather_output=False,
                                      bias=False)
        self.wi_1 = ColumnParallelLinear(config.d_model,
                                      config.d_ff,
                                        gather_output=False,
                                      bias=False)
        self.wo = RowParallelLinear(config.d_ff,
                                    config.d_model,
                                    input_is_parallel=True,
                                    bias=False)
        self.dropout = nn.Dropout(config.dropout_rate)
        self.act = ACT2FN[config.dense_act_fn]


class ParallelFF(T5LayerFF):
    def __init__(self, config: T5Config):
        super().__init__(config)
        if config.is_gated_act:
            self.DenseReluDense = ParallelDenseGatedActDense(config)
        else:
            self.DenseReluDense = ParallelDenseActDense(config)

        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)


def load_pretrained_with_parallel_attn(model_name):
    
    model = T5ForConditionalGeneration.from_pretrained(model_name, torch_dtype="auto")

    # Parallel implementation of Attention modules.
    from t5_model_layers import ParallelSelfAttention, ParallelFF, ParallelCrossAttention

    for index, block in enumerate(model.decoder.block):
        if index == 0:
            block.layer[0] = ParallelSelfAttention(model.config,
                                                   has_relative_attention_bias=True)
        else:
            block.layer[0] = ParallelSelfAttention(model.config)
        block.layer[1] = ParallelCrossAttention(model.config)
        block.layer[2] = ParallelFF(model.config)
    # Load the weights into the parallel layers        
    neuronx_distributed.parallel_layers.load(model_name.split("/")[-1] + ".pt", model, sharded=False)

    return model


================================================
FILE: src/examples/pytorch/neuronx_distributed/t5-inference/t5_models.py
================================================
import torch
import neuronx_distributed

from functools import partial
from transformers import T5Tokenizer, T5ForConditionalGeneration

from wrapper import EncoderWrapper, DecoderWrapper
from t5_model_layers import load_pretrained_with_parallel_attn

def get_wrapped_encoder(max_length, num_beams, tp_degree, model_name):
    
    model = load_pretrained_with_parallel_attn(model_name)

    encoder = EncoderWrapper(model.encoder, model.decoder, model.config, num_beams, max_length, "xla", num_beams, tp_degree=tp_degree)
    encoder.eval()
    
    # We are alaising the cache, so that way we keep the cache always on device.
    aliases = {}
    for i in range(len(encoder.past_key_values_sa)):
        aliases[encoder.past_key_values_sa[i]] = i
    
    for i in range(len(encoder.past_key_values_ca)):
        aliases[encoder.past_key_values_ca[i]] = len(encoder.past_key_values_sa) + i

    return encoder, aliases


def get_wrapped_decoder(max_length, num_beams, tp_degree, model_name):
    
    model = load_pretrained_with_parallel_attn(model_name)

    decoder = DecoderWrapper(decoder=model.decoder,
                             lm_head=model.lm_head,
                             model_config=model.config,
                             num_beams=num_beams,
                             max_length=max_length,
                             device="xla",
                             tp_degree=tp_degree)
    
    decoder.eval()
    num_outputs_from_trace = 3 if num_beams > 1 else 1
    aliases = {}
    for i in range(len(decoder.past_key_values_sa)):
        aliases[decoder.past_key_values_sa[i]] = i + num_outputs_from_trace
    for i in range(len(decoder.past_key_values_ca)):
        aliases[decoder.past_key_values_ca[i]] = len(decoder.past_key_values_sa) + i + num_outputs_from_trace

    return decoder, aliases

def parallel_trace_encoder(model_name: str,
                           max_length: int,
                           num_beams: int,
                           tp_degree: int):
    
    print("starting encoder parallel trace")
    
    tokenizer = T5Tokenizer.from_pretrained(model_name)
    get_encoder_callable = partial(get_wrapped_encoder, max_length, num_beams, tp_degree, model_name)

    # Trace encoder
    batch_encoding = tokenizer("translate English to German: Lets go home now",
                               max_length=max_length, truncation=True, padding='max_length', return_tensors="pt")
    input_ids = batch_encoding['input_ids']
    attention_mask = batch_encoding['attention_mask']

    # Here we are tracing the encoder and cache together. Cache is marked as state and we are aliasing.
    traced_encoder = neuronx_distributed.trace.parallel_model_trace(get_encoder_callable, (
            input_ids,
            attention_mask,
        ), 
        tp_degree=tp_degree, 
        compiler_workdir="/tmp/encoder/",
        )
    setattr(traced_encoder, 'main_input_name', 'input_ids')  # Attribute required by beam search

    print("completed encoder parallel trace")

    return traced_encoder


def parallel_trace_decoder(model: T5ForConditionalGeneration,
                           model_name: str,
                           num_beams: int,
                           max_length: int,
                           tp_degree: int):

    print("starting decoder trace")

    get_decoder_callable = partial(get_wrapped_decoder, max_length, num_beams, tp_degree, model_name)
  
    # We create mock inputs so we can trace the decoder
    decoder_input_ids = torch.ones((num_beams, 1), dtype=torch.int64)
    decoder_attention_mask = torch.ones((num_beams, max_length), dtype=torch.int32)
    encoder_attention_mask = torch.ones((num_beams, max_length), dtype=torch.int64)
    encoder_hidden_states = torch.ones((num_beams, max_length, model.config.d_model), dtype=torch.float32)

    beam_idx = torch.arange(0, num_beams, dtype=torch.int64)
    beam_scores = torch.zeros((num_beams,), dtype=torch.float)

    traced_decoder = neuronx_distributed.trace.parallel_model_trace(get_decoder_callable, (
            decoder_input_ids,
            decoder_attention_mask,
            encoder_hidden_states,
            encoder_attention_mask,
            beam_idx,
            beam_scores
        ), 
        tp_degree=tp_degree,
        compiler_workdir="/tmp/decoder/",
        )

    print("complete decoder trace")

    return traced_decoder


================================================
FILE: src/examples/pytorch/neuronx_distributed/t5-inference/wrapper.py
================================================
import torch
import neuronx_distributed
import torch_xla.core.xla_model as xm

from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers.modeling_outputs import BaseModelOutput, Seq2SeqLMOutput
from transformers.models.t5.modeling_t5 import T5Stack, T5LayerCrossAttention
from transformers.generation.utils import ModelOutput
from typing import Any, Dict, List, Optional, Tuple, Union
from transformers.generation.beam_search import BeamScorer, BeamSearchScorer

from optimum.neuron.generation import NeuronGenerationMixin

from transformers.generation.logits_process import (
    LogitsProcessorList,
)
from transformers.generation.stopping_criteria import (
    MaxLengthCriteria,
    MaxTimeCriteria,
    StoppingCriteriaList,
    validate_stopping_criteria,
)

from transformers.generation.utils import (
    BeamSearchDecoderOnlyOutput,
    BeamSearchEncoderDecoderOutput,
    BeamSearchOutput,
    GreedySearchOutput,
)

class T5Wrapper(T5ForConditionalGeneration, NeuronGenerationMixin):

    def _prepare_encoder_decoder_kwargs_for_generation(
        self, 
        inputs_tensor: torch.Tensor, 
        model_kwargs, 
        model_input_name: Optional[str] = None
    ) -> Dict[str, Any]:
        encoder = self.get_encoder()
        model_kwargs["encoder_outputs"]: ModelOutput = encoder(inputs_tensor, model_kwargs["attention_mask"])
        return model_kwargs

    # Override to cut the input_ids to just last token
    def prepare_inputs_for_generation(
        self,
        input_ids,
        past_key_values=None,
        attention_mask=None,
        head_mask=None,
        decoder_head_mask=None,
        decoder_attention_mask=None,
        cross_attn_head_mask=None,
        use_cache=None,
        encoder_outputs=None,
        **kwargs,
    ):
        # cut decoder_input_ids as past is cached
        input_ids = input_ids[:, -1:]

        return {
            "decoder_input_ids": input_ids,
            "past_key_values": past_key_values,
            "encoder_outputs": encoder_outputs,
            "attention_mask": attention_mask,
            "head_mask": head_mask,
            "decoder_head_mask": decoder_head_mask,
            "decoder_attention_mask": decoder_attention_mask,
            "cross_attn_head_mask": cross_attn_head_mask,
            "use_cache": use_cache,
        }
    
    '''
        We update the cache in the decoder trace, so lets override the _update_model_kwargs_for_xla_generation in NeuronGenerationMixin
    '''
    def _update_model_kwargs_for_xla_generation(
        self,
        model_kwargs: Dict[str, Any],
        batch_size: int,
        is_encoder_decoder: bool = False,
        standardize_cache_format: bool = False,
        max_length: Optional[int] = None,
        seq_length: Optional[int] = None,
        use_cache: bool = True,
    ) -> Dict[str, Any]:

        def _update_attention(model_kwargs, is_encoder_decoder):
            """Updates the appropriate attention mask -- encoder-decoder models use `decoder_attention_mask`"""

            attention_mask_name = "decoder_attention_mask" if is_encoder_decoder else "attention_mask"
            attention_mask = model_kwargs.pop(attention_mask_name)
            attention_mask_update_slice = torch.ones(
                (batch_size, 1), dtype=attention_mask.dtype, device=attention_mask.device
            )
            attention_mask = torch.cat([attention_mask[:, 1:], attention_mask_update_slice], dim=-1)
            mask = {attention_mask_name: attention_mask}
            return mask

        mask = _update_attention(model_kwargs, is_encoder_decoder)
        # sets the updated variables (mask and past_key_values)
        model_kwargs.update(mask)

        # Set a mock cache tensor
        model_kwargs["past_key_values"] = torch.tensor([])

        return model_kwargs
    
    def _reorder_cache(self, past_key_values, beam_idx):
        '''
            This is needed for beam search and not greedy sampling
            We reorder the cache within the trace so we can skip it in modelling_t5.py. So we override the _reorder_cache
        '''
        self.beam_idx = beam_idx
        return past_key_values

    def infer(self,
              tokenizer: T5Tokenizer,
              prompt: str,
              max_length: int,
              num_beams: int,
              num_return_sequences: int,
              device: str):

        batch_encoding = tokenizer(prompt, max_length=max_length, truncation=True, padding='max_length',
                                return_tensors="pt")

        past_key_values = self.encoder(batch_encoding['input_ids'],batch_encoding['attention_mask'])
 
        decoder_attention_mask = torch.cat([torch.zeros((1, max_length-1), dtype=torch.int32),
                                            torch.ones((1, 1), dtype=torch.int32)], axis=1)

        # copy the new cache state to the decoder
        if device == "xla":
            for state, tensor in zip(self.decoder.parameters(), past_key_values):
                state.copy_(tensor)
        else:
            # First half of the cache is self attention and the rest is cross attention
            self.decoder.past_key_values_sa = past_key_values[:len(past_key_values)//2]
            self.decoder.past_key_values_ca = past_key_values[len(past_key_values)//2:]
        
        output = self.generate(**batch_encoding,
                                max_length=max_length,
                                num_beams=num_beams,
                                num_return_sequences=num_return_sequences,
                                do_sample=False,
                                use_cache=True,
                                decoder_attention_mask=decoder_attention_mask, 
                                encoder_outputs={"last_hidden_state": torch.ones((1,128,1))}) # Pass fake encoder_outputs so the transfomers code will not invoke the encoder
        return output

    def parallel_infer(self,
                       max_length: int,
                       num_beams: int,
                       num_return_sequences: int,
                       device: str = None,
                       tokenizer: T5Tokenizer = None,
                       prompt: str = None, 
                       input_ids: torch.Tensor = None,
                       attention_mask: torch.Tensor = None):

        if input_ids is None or attention_mask is None: 
            batch_encoding = tokenizer(prompt, 
                                    max_length=max_length, 
                                    truncation=True, 
                                    padding='max_length',
                                    return_tensors="pt")
        else: 
            batch_encoding = {
                'input_ids' : input_ids,
                'attention_mask': attention_mask
            }

        
        past_key_values = self.encoder(batch_encoding['input_ids'],batch_encoding['attention_mask'])
 
        decoder_attention_mask = torch.cat([torch.zeros((1, max_length-1), dtype=torch.int32),
                                            torch.ones((1, 1), dtype=torch.int32)], axis=1)

        # Here the encoder is now returning cache which is device tensor, so we directly assign
        # the cache device tensor to decoder's cache (which is also a device tensor). 
        # We thereby avoid the copy and always use a pre-allocated memory
        for model_tp_decoder, model_tp_encoder in zip(self.decoder.models, self.encoder.models):
            model_tp_decoder.load_state_dict(model_tp_encoder.state_dict(), strict=True)
            
        # Pass fake encoder_outputs so the transfomers code will not invoke the encoder
        output = self.generate(**batch_encoding,
                                max_length=max_length,
                                num_beams=num_beams,
                                num_return_sequences=num_return_sequences,
                                do_sample=False,
                                use_cache=True,
                                decoder_attention_mask=decoder_attention_mask, 
                                encoder_outputs={"last_hidden_state": torch.ones((1,128,1))}) 
        return output

    def forward(
        self,
        attention_mask: Optional[torch.FloatTensor] = None,
        decoder_input_ids: Optional[torch.LongTensor] = None,
        decoder_attention_mask: Optional[torch.BoolTensor] = None,
        encoder_outputs: Optional[Tuple[Tuple[torch.Tensor]]] = None,
        beam_scores = None,
        **kwargs
    ) -> Union[Tuple[torch.FloatTensor], Seq2SeqLMOutput]:

        hidden_states = encoder_outputs["last_hidden_state"]

        if not hasattr(self, 'beam_idx'):
            # Infering the number of beams from the attention mask
            num_beams = attention_mask.shape[0]
            self.beam_idx = torch.arange(0, num_beams, dtype=torch.int64)

        decoder_outputs = self.decoder(
            decoder_input_ids,
            decoder_attention_mask,
            hidden_states,
            attention_mask,
            self.beam_idx,
            beam_scores
        )

        # lm_logits = decoder_outputs[0]
        next_token_scores = decoder_outputs[0]
        next_tokens = decoder_outputs[1]
        next_indices = decoder_outputs[2]

        return next_token_scores, next_tokens, next_indices

    def beam_search(
        self,
        input_ids: torch.LongTensor,
        beam_scorer: BeamScorer,
        logits_processor: Optional[LogitsProcessorList] = None,
        stopping_criteria: Optional[StoppingCriteriaList] = None,
        max_length: Optional[int] = None,
        pad_token_id: Optional[int] = None,
        eos_token_id: Optional[Union[int, List[int]]] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        output_scores: Optional[bool] = None,
        return_dict_in_generate: Optional[bool] = None,
        synced_gpus: Optional[bool] = False,
        seq_length: Optional[int] = None,
        **model_kwargs,
    ) -> Union[BeamSearchOutput, torch.LongTensor]:

        logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
        stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
        pad_token_id = pad_token_id if pad_token_id is not None else self.generation_config.pad_token_id
        eos_token_id = eos_token_id if eos_token_id is not None else self.generation_config.eos_token_id
        if isinstance(eos_token_id, int):
            eos_token_id = [eos_token_id]
        output_scores = output_scores if output_scores is not None else self.generation_config.output_scores
        output_attentions = (
            output_attentions if output_attentions is not None else self.generation_config.output_attentions
        )
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.generation_config.output_hidden_states
        )

        batch_size = len(beam_scorer._beam_hyps)
        num_beams = beam_scorer.num_beams

        batch_beam_size, cur_len = input_ids.shape

        # Overwrite cur_len
        cur_len = seq_length

        if num_beams * batch_size != batch_beam_size:
            raise ValueError(
                f"Batch dimension of `input_ids` should be {num_beams * batch_size}, but is {batch_beam_size}."
            )

        # init attention / hidden states / scores tuples
        scores = () if (return_dict_in_generate and output_scores) else None
        beam_indices = (
            tuple(() for _ in range(batch_beam_size)) if (return_dict_in_generate and output_scores) else None
        )

        # initialise score of first beam with 0 and the rest with -1e9. This makes sure that only tokens
        # of the first beam are considered to avoid sampling the exact same tokens across all beams.
        # beam_scores = torch.zeros((batch_size, num_beams), dtype=torch.float, device=input_ids.device)
        beam_scores_device = "cpu"
        beam_scores = torch.zeros((batch_size, num_beams), dtype=torch.float, device=beam_scores_device)
        beam_scores[:, 1:] = -1e9
        beam_scores = beam_scores.view((batch_size * num_beams,))

        while True:
            # prepare model inputs
            # From max_length-sized input_ids, select first
            # cur_len - 1 values.
            update_indices = torch.stack(
                [torch.arange(input_ids.size(0)), torch.tensor(cur_len - 1).repeat(input_ids.size(0))], dim=-1
            )
            input_ids_ = input_ids[update_indices[:, 0], update_indices[:, 1], None]
            model_inputs = self.prepare_inputs_for_generation(input_ids_, **model_kwargs)

            next_token_scores, next_tokens, next_indices = self(
                **model_inputs,
                return_dict=True,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
                beam_scores=beam_scores
            )

            # stateless
            beam_outputs = beam_scorer.process(
                input_ids.to("cpu")[:, :cur_len],
                next_token_scores.to("cpu"),
                next_tokens.to("cpu"),
                next_indices.to("cpu"),
                pad_token_id=pad_token_id,
                eos_token_id=eos_token_id,
                beam_indices=beam_indices,
            )

            beam_scores = beam_outputs["next_beam_scores"]
            beam_next_tokens = beam_outputs["next_beam_tokens"]
            beam_idx = beam_outputs["next_beam_indices"]

            update_indices = torch.stack(
                [torch.arange(batch_beam_size), torch.tensor(cur_len - 1).repeat(batch_beam_size)], dim=-1
            )
            update_indices_2 = torch.stack(
                [torch.arange(batch_beam_size), torch.tensor(cur_len).repeat(batch_beam_size)], dim=-1
            )
            # First select beam_indices
            device = input_ids.device
            beam_idx_device = beam_idx.to(device=input_ids.device)
            input_ids[:, :] = input_ids[beam_idx_device.long(), :]

            # Then append new tokens
            input_ids[update_indices_2[:, 0], update_indices_2[:, 1], None] = beam_next_tokens.unsqueeze(-1).to(device).to(torch.long)
            input_ids = input_ids * 1  # Hack to materialize tensor

            # update generated ids, model inputs, and length for next step
            model_kwargs = self._update_model_kwargs_for_xla_generation(
                model_kwargs,
                batch_size=batch_beam_size,
                is_encoder_decoder=self.config.is_encoder_decoder,
                max_length=stopping_criteria.max_length,
                seq_length=cur_len,
                use_cache=model_kwargs["use_cache"],
            )
            if model_kwargs["past_key_values"] is not None:
                model_kwargs["past_key_values"] = self._reorder_cache(model_kwargs["past_key_values"], beam_idx.to(torch.int64))

            if return_dict_in_generate and output_scores:
                beam_indices = tuple((beam_indices[beam_idx[i]] + (beam_idx[i],) for i in range(len(beam_indices))))

            # increase cur_len
            cur_len = cur_len + 1

            # stop when each sentence is finished, or if we exceed the maximum length
            stop_criterion_1 = beam_scorer.is_done
            if isinstance(stopping_criteria, list):
                if len(stopping_criteria) == 1:
                    stopping_criteria = stopping_criteria[0]

            # Cases that can be handled in XLA without requiring
            # non-padded input_ids
            if isinstance(stopping_criteria, MaxLengthCriteria):
                stop_criterion_2 = cur_len >= stopping_criteria.max_length
            elif isinstance(stopping_criteria, MaxTimeCriteria):
                stop_criterion_2 = stopping_criteria(input_ids, scores)
            else:
                # Other cases will be handled on CPU
                batch_size, _ = input_ids.shape
                input_ids_cpu = input_ids.to("cpu")
                mask = torch.cat(
                    [torch.ones(batch_size, cur_len), torch.zeros(batch_size, input_ids.shape[1] - cur_len)], dim=1
                ).bool()
                input_ids_cpu = torch.masked_select(input_ids_cpu, mask).reshape((batch_size, cur_len))
                scores_cpu = scores.to("cpu") if torch.is_tensor(scores) else scores
                stop_criterion_2 = stopping_criteria(input_ids_cpu, scores_cpu)

            if stop_criterion_1 or stop_criterion_2:
                if not synced_gpus:
                    break
                else:
                    this_peer_finished = True

        sequence_outputs = beam_scorer.finalize(
            input_ids.to("cpu"),
            beam_scores.to("cpu"),
            next_tokens.to("cpu"),
            next_indices.to("cpu"),
            pad_token_id=pad_token_id,
            eos_token_id=eos_token_id,
            max_length=stopping_criteria.max_length,
            beam_indices=beam_indices,
        )

        for k, v in sequence_outputs.items():
            if type(v) == torch.Tensor:
                sequence_outputs[k] = sequence_outputs[k].to(input_ids.device)

        return sequence_outputs["sequences"]


    def greedy_search(
        self,
        input_ids: torch.LongTensor,
        logits_processor: Optional[LogitsProcessorList] = None,
        stopping_criteria: Optional[StoppingCriteriaList] = None,
        max_length: Optional[int] = None,
        pad_token_id: Optional[int] = None,
        eos_token_id: Optional[Union[int, List[int]]] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        output_scores: Optional[bool] = None,
        return_dict_in_generate: Optional[bool] = None,
        seq_length: Optional[int] = int,
        streamer: Optional["BaseStreamer"] = None,
        **model_kwargs,
    ) -> Union[GreedySearchOutput, torch.LongTensor]:
        """
            Overriding greedy sampling to use next tokens returned from neuron device instead of logits.
        """
        # init values
        logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
        use_cache = model_kwargs["use_cache"] if "use_cache" in model_kwargs else False
        stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
        pad_token_id = pad_token_id if pad_token_id is not None else self.generation_config.pad_token_id
        eos_token_id = eos_token_id if eos_token_id is not None else self.generation_config.eos_token_id
        if isinstance(eos_token_id, int):
            eos_token_id = [eos_token_id]
        eos_token_id_tensor = torch.tensor(eos_token_id).to(input_ids.device) if eos_token_id is not None else None
        output_scores = output_scores if output_scores is not None else self.generation_config.output_scores
        output_attentions = (
            output_attentions if output_attentions is not None else self.generation_config.output_attentions
        )
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.generation_config.output_hidden_states
        )

        # init attention / hidden states / scores tuples
        scores = () if (return_dict_in_generate and output_scores) else None
        decoder_attentions = () if (return_dict_in_generate and output_attentions) else None
        cross_attentions = () if (return_dict_in_generate and output_attentions) else None
        decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None


        # keep track of which sequences are already finished
        unfinished_sequences = torch.ones(input_ids.shape[0], dtype=torch.long, device=input_ids.device)

        this_peer_finished = False  # used by synced_gpus only
        while True:

            # prepare model inputs
            # From max_length-sized input_ids, select first
            # seq_length - 1 values.

            if model_kwargs.get("past_key_values") is None:
                input_ids_ = input_ids[:, :seq_length]
            else:
                update_indices = torch.stack(
                    [torch.arange(input_ids.size(0)), torch.tensor(seq_length - 1).repeat(input_ids.size(0))],
                    dim=-1,
                )
                input_ids_ = input_ids[update_indices[:, 0], update_indices[:, 1], None]

            model_inputs = self.prepare_inputs_for_generation(input_ids_, **model_kwargs)
        
            # forward pass to get next token
            output = self(
               **model_inputs,
                return_dict=True,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
            )
            next_tokens = output[0]

            # finished sentences should have their next token be a padding token
            if eos_token_id is not None:
                if pad_token_id is None:
                    raise ValueError("If `eos_token_id` is defined, make sure that `pad_token_id` is defined.")
                next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)

            # update generated ids, model inputs, and length for next step

            batch_size, _ = input_ids.shape
            update_indices = torch.stack(
                [torch.arange(batch_size), torch.tensor(seq_length).repeat(batch_size)], dim=-1
            )
            input_ids[update_indices[:, 0], update_indices[:, 1]] = next_tokens[:]
            model_kwargs = self._update_model_kwargs_for_xla_generation(
                model_kwargs,
                batch_size=batch_size,
                is_encoder_decoder=self.config.is_encoder_decoder,
                max_length=stopping_criteria.max_length,
                seq_length=seq_length,
                use_cache=use_cache,
            )

            seq_length += 1

            # if eos_token was found in one sentence, set sentence to finished
            if eos_token_id_tensor is not None:
                unfinished_sequences = unfinished_sequences.mul(
                    next_tokens.tile(eos_token_id_tensor.shape[0], 1).ne(eos_token_id_tensor.unsqueeze(1)).prod(dim=0)
                )

            # stop when each sentence is finished, or if we exceed the maximum length
            stop_criterion_1 = unfinished_sequences.max() == 0

            if isinstance(stopping_criteria, list):
                if len(stopping_criteria) == 1:
                    stopping_criteria = stopping_criteria[0]

            # Cases that can be handled in XLA without requiring
            # non-padded input_ids
            if isinstance(stopping_criteria, MaxLengthCriteria):
                stop_criterion_2 = seq_length >= stopping_criteria.max_length
            elif isinstance(stopping_criteria, MaxTimeCriteria):
                stop_criterion_2 = stopping_criteria(input_ids, scores)
            else:
                # Other cases will be handled on CPU
                batch_size, _ = input_ids.shape
                mask = torch.cat(
                    [torch.ones(batch_size, seq_length), torch.zeros(batch_size, input_ids.shape[1] - seq_length)],
                    dim=1,
                ).bool()
                input_ids_cpu = torch.masked_select(input_ids, mask).reshape((batch_size, seq_length)).to("cpu")
                scores_cpu = scores.to("cpu") if torch.is_tensor(scores) else scores
                stop_criterion_2 = stopping_criteria(input_ids_cpu, scores_cpu)

            if stop_criterion_1 or stop_criterion_2:
                this_peer_finished = True

            if this_peer_finished:
                break

        if streamer is not None:
            streamer.end()

        return input_ids
    
class EncoderWrapper(torch.nn.Module):
    '''
        This wrapper converts positional args to kwargs
    '''

    def __init__(self, 
                 encoder,
                 decoder, 
                 model_config, 
                 batch_size, 
                 max_length, 
                 device, 
                 num_beams,
                 tp_degree=None):
        
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.batch_size = batch_size
        self.max_length = max_length
        self.model_config = model_config
        self.device = device
        self.num_beams = num_beams
        self.num_attention_heads_per_partition = model_config.num_heads
        self.tp_degree = tp_degree
        if self.tp_degree is not None:
            self.num_attention_heads_per_partition = model_config.num_heads // neuronx_distributed.parallel_layers.parallel_state.get_tensor_model_parallel_size()
            self.past_key_values_sa = torch.nn.ParameterList([torch.nn.Parameter(torch.ones((self.num_beams,self.num_attention_heads_per_partition,self.max_length-1,model_config.d_kv), dtype=torch.float32), requires_grad=False) for _ in range(model_config.num_decoder_layers * 2)])
            self.past_key_values_ca = torch.nn.ParameterList([torch.nn.Parameter(torch.ones((self.num_beams,self.num_attention_heads_per_partition,self.max_length,model_config.d_kv), dtype=torch.float32), requires_grad=False) for _ in range(model_config.num_decoder_layers * 2)])

    def forward(self, input_ids, attention_mask):
        '''
            This is the core functionality we want to trace. 
        '''
        encoder_output =  self.encoder(input_ids=input_ids,
                                       attention_mask=attention_mask,
                                       output_attentions=False,
                                       output_hidden_states=False)

        last_hidden_state = encoder_output["last_hidden_state"]
        encoder_hidden_states = torch.concat([tensor.unsqueeze(0).repeat(self.num_beams, 1, 1) for tensor in last_hidden_state])

        decoder_blocks = self.decoder.block
        present_key_value_states_sa = []
        present_key_value_states_ca = []

        for i, block in enumerate(decoder_blocks):

            # Cross attention has to be initialized with the encoder hidden state
            cross_attention: T5LayerCrossAttention = block.layer[1]
            attention = cross_attention.EncDecAttention

            def shape(states):
                """projection"""
                return states.view(self.batch_size, -1, self.num_attention_heads_per_partition, attention.key_value_proj_dim).transpose(1, 2)

            key_states = shape(attention.k(encoder_hidden_states))
            value_states = shape(attention.v(encoder_hidden_states))

            if self.tp_degree is None:
                # cross_attn_kv_state
                present_key_value_states_ca.append(key_states) 
                present_key_value_states_ca.append(value_states) 
                
                # Self attention kv states are initialized to zeros.
                present_key_value_states_sa.append(torch.zeros((self.batch_size,                                                     # key states
                                                                self.model_config.num_heads, 
                                                                self.max_length-1, 
                                                                self.model_config.d_kv), dtype=torch.float32, device=self.device)) 
                present_key_value_states_sa.append(torch.zeros((self.batch_size,                                                     # value states
                                                                self.model_config.num_heads, 
                                                                self.max_length-1, 
                                                                self.model_config.d_kv), dtype=torch.float32, device=self.device))
            else:
                # We want to copy the cross attention states (key_states and value_states) into the decoder trace. 
                # One way of doing it is to get the encoder trace to return the kv states as an output and then we can pass it to the decoder trace  
                # as an output. But this requires a copy from device to cpu and back. 
                #
                # There is no good way to keep the output within the device yet. Until we build that feature, we use this workaround. 
                # The work around uses input_output_aliasing to map the output kv state to an input parameter. The output present_key_value_states_ca
                # represents the cross attention kv states and is aliased to a similarly named parameter. 
                # 
                # Why are we multiplying past_key_values_ca with 0 and adding to the key or value state?  
                # The trace api will remove any variables that are not used to compute the output tensor. As the past_key_values parameter is not 
                # being used and to compute the kv cache, it would be removed. To avoid that, we use it in an operation that computes the output 
                # but at the same time does not effect the output. 
                present_key_value_states_ca.append((self.past_key_values_ca[i*2] * 0) + key_states)
                present_key_value_states_ca.append((self.past_key_values_ca[i*2+1] * 0) + value_states)
                present_key_value_states_sa.append(self.past_key_values_sa[i*2]*torch.zeros((self.batch_size, self.num_attention_heads_per_partition, self.max_length-1, self.model_config.d_kv), dtype=torch.float32, device="xla"))
                present_key_value_states_sa.append(self.past_key_values_sa[i*2+1]*torch.zeros((self.batch_size, self.num_attention_heads_per_partition, self.max_length-1, self.model_config.d_kv), dtype=torch.float32, device="xla"))

        return present_key_value_states_sa + present_key_value_states_ca

class DecoderWrapper(torch.nn.Module):

    def __init__(self, 
                 decoder: T5Stack, 
                 lm_head: torch.nn.Linear,
                 model_config,
                 num_beams: int, 
                 max_length: int,
                 device: str,
                 tp_degree=None):
        super().__init__()        
        self.decoder = decoder
        self.lm_head = lm_head
        self.model_dim=model_config.d_model
        self.device = device
        self.num_beams = num_beams
        self.batch_size = 1
        self.config = model_config

        num_heads=model_config.num_heads
        num_decoder_layers=model_config.num_decoder_layers

        self.num_attention_heads_per_partition = num_heads
        if tp_degree is not None:
            self.num_attention_heads_per_partition = num_heads // neuronx_distributed.parallel_layers.parallel_state.get_tensor_model_parallel_size()

        # (num_beams, n_heads, seq_length, dim_per_head)
        if device == "cpu":
            self.past_key_values_sa = [torch.ones((num_beams,num_heads,max_length-1,model_config.d_kv), dtype=torch.float32) for _ in range(num_decoder_layers * 2)]
            self.past_key_values_ca = [torch.ones((num_beams,num_heads,max_length,model_config.d_kv), dtype=torch.float32) for _ in range(num_decoder_layers * 2)]
        elif device == "xla":
            self.past_key_values_sa = torch.nn.ParameterList([torch.nn.Parameter(torch.ones((num_beams,self.num_attention_heads_per_partition,max_length-1,model_config.d_kv), dtype=torch.float32), requires_grad=False) for _ in range(num_decoder_layers * 2)])
            self.past_key_values_ca = torch.nn.ParameterList([torch.nn.Parameter(torch.ones((num_beams,self.num_attention_heads_per_partition,max_length,model_config.d_kv), dtype=torch.float32), requires_grad=False) for _ in range(num_decoder_layers * 2)])

    def update_past(self, past_key_values):
        new_past_sa = []
        new_past_ca = []
        for past_layer in past_key_values:
            new_past_layer = list(past_layer)
            for i in range(len(new_past_layer[:2])):
                new_past_layer[i] = past_layer[i][:, :, 1:]
            new_past_sa += [new_past_layer[:2],]
            new_past_ca += [new_past_layer[2:],]
        return new_past_sa, new_past_ca
    
    def reorder_cache(self, past_key_values, beam_idx):
        for i in range(len(past_key_values)):
             past_key_values[i] = torch.index_select(past_key_values[i], 0, beam_idx)
        return past_key_values

    def forward(self,
                input_ids,
                decoder_attention_mask,
                encoder_hidden_states,
                encoder_attention_mask,
                beam_idx,
                beam_scores,
                **kwargs):

        if self.num_beams > 1:
            # We reorder the cache based on the beams selected in each iteration. Required step for beam search.
            past_key_values_sa = self.reorder_cache(self.past_key_values_sa, beam_idx)
            past_key_values_ca = self.reorder_cache(self.past_key_values_ca, beam_idx)
        else:
            # We do not need to reorder for greedy sampling
            past_key_values_sa = self.past_key_values_sa
            past_key_values_ca = self.past_key_values_ca

        # The cache is stored in a flatten form. We order the cache per layer before passing it to the decoder. 
        # Each layer has 4 tensors, so we group by 4. 
        past_key_values = [[*past_key_values_sa[i*2:i*2+2], *past_key_values_ca[i*2:i*2+2]] for i in range(0, int(len(past_key_values_ca)/2))]

        decoder_output = self.decoder(
            input_ids=input_ids,
            attention_mask=decoder_attention_mask,
            past_key_values=past_key_values,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            use_cache=True,
            output_attentions=False,
            output_hidden_states=False)

        last_hidden_state = decoder_output['last_hidden_state']
        past_key_values = decoder_output['past_key_values']

        if self.config.tie_word_embeddings:
            last_hidden_state = last_hidden_state * (self.model_dim**-0.5)

        lm_logits = self.lm_head(last_hidden_state)

        past_key_values_sa, past_key_values_ca = self.update_past(past_key_values)

        # We flatten the cache to a single array. This is required for the input output aliasing to work
        past_key_values_sa = [vec for kv_per_layer in past_key_values_sa for vec in kv_per_layer]
        past_key_values_ca = [vec for kv_per_layer in past_key_values_ca for vec in kv_per_layer]

        if self.device == "cpu":
            self.past_key_values_sa = past_key_values_sa
            self.past_key_values_ca = past_key_values_ca

        # Moving the topk inside 
        next_token_logits = lm_logits[:, -1, :]

        if self.num_beams > 1:
            logit_max, _ = torch.max(next_token_logits, dim=-1, keepdim=True)
            logsumexp = torch.log(torch.exp(next_token_logits - logit_max).sum(dim=-1, keepdim=True))
            next_token_scores = next_token_logits - logit_max - logsumexp
            next_token_scores = next_token_scores + beam_scores[:, None].expand_as(next_token_scores)

            # reshape for beam search
            vocab_size = next_token_scores.shape[-1]
            next_token_scores = next_token_scores.view(self.batch_size, self.num_beams * vocab_size)
            next_token_scores = next_token_scores * 1

            # Sample 2 next tokens for each beam (so we have some spare tokens and match output of beam search)
            next_token_scores, next_tokens = torch.topk(
                next_token_scores, 2 * self.num_beams, dim=1, largest=True, sorted=True
            ) 

            next_indices = torch.div(next_tokens, vocab_size, rounding_mode="floor")
            next_tokens = next_tokens % vocab_size

            return [next_token_scores, next_tokens, next_indices] + past_key_values_sa + past_key_values_ca
        else:
            # Greedy    
            next_tokens = torch.argmax(next_token_logits, dim=-1)
            return [next_tokens] + past_key_values_sa + past_key_values_ca
    

================================================
FILE: src/examples/pytorch/pipeline_tutorial/neuroncore_pipeline_pytorch.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "variable-character",
   "metadata": {},
   "source": [
    "# Using NeuronCore Pipeline with PyTorch"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "valued-economics",
   "metadata": {},
   "source": [
    "In this tutorial you compile a pretrained BERT base model from HuggingFace 🤗 Transformers, using the NeuronCore Pipeline feature of the AWS Neuron SDK. You benchmark model latency of the pipeline parallel mode and compare with the usual data parallel (multi-worker) deployment.\n",
    "\n",
    "This tutorial is intended to run in an inf1.6xlarge, running the latest AWS Deep Learning AMI (DLAMI). The inf1.6xlarge instance size has AWS Inferentia chips for a total of 16 NeuronCores.\n",
    "\n",
    "Verify that this Jupyter notebook is running the Python or Conda kernel environment that was set up according to the [PyTorch Installation Guide](../../../../frameworks/torch/torch-neuron/setup/pytorch-install.html). You can select the kernel from the \"Kernel -> Change Kernel\" option on the top of this Jupyter notebook page.\n",
    "\n",
    "> __Note:__ Do not execute this tutorial using \"Run -> Run all cells\" option.  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "private-authentication",
   "metadata": {},
   "source": [
    "## Install Dependencies:\n",
    "This tutorial requires the following pip packages:\n",
    "\n",
    "- `torch-neuron`\n",
    "- `neuron-cc[tensorflow]`\n",
    "- `transformers`\n",
    "\n",
    "Most of these packages will be installed when configuring your environment using the Neuron PyTorch setup guide. The additional HuggingFace 🤗 Transformers dependency must be installed here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "romantic-accident",
   "metadata": {},
   "outputs": [],
   "source": [
    "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
    "!pip install --upgrade \"transformers==4.6.0\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "prompt-australian",
   "metadata": {},
   "source": [
    "## Compiling a BERT base model for a single NeuronCore"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aging-biodiversity",
   "metadata": {},
   "source": [
    "To run a HuggingFace [BERTModel](https://huggingface.co/transformers/model_doc/bert.html#bertmodel) on Inferentia, you only need to add a single extra line of code to the usual 🤗 Transformers PyTorch implementation, after importing the torch_neuron framework. \n",
    "\n",
    "Add the argument `return_dict=False` to the BERT transformers model so it can be traced with [TorchScript](https://pytorch.org/docs/stable/jit.html). TorchScript is a way to create serializable and optimizable models from PyTorch code. \n",
    "\n",
    "Enable padding to a maximum sequence length of 128, to test the model's performance with a realistic payload size. You can adapt this sequence length to your application's requirement. \n",
    "\n",
    "You can adapt the original example on the [BertModel forward pass docstring](https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel.forward) according to the following cell\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "stretch-preview",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch_neuron\n",
    "from transformers import BertTokenizer, BertModel\n",
    "\n",
    "from joblib import Parallel, delayed  \n",
    "import numpy as np\n",
    "from tqdm import tqdm\n",
    "\n",
    "import os\n",
    "import time \n",
    "\n",
    "\n",
    "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n",
    "model = BertModel.from_pretrained('bert-base-uncased',return_dict=False)\n",
    "\n",
    "inputs = tokenizer(\"Hello, my dog is cute\",return_tensors=\"pt\",max_length=128,padding='max_length',truncation=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "conceptual-aberdeen",
   "metadata": {},
   "source": [
    "The one extra line required is the call to torch.neuron.trace() method. This call compiles the model and returns the forwad method of the torch `nn.Model` method, which you can use to run inference. \n",
    "\n",
    "The compiled graph can be saved using the `torch.jit.save` function and restored using `torch.jit.load` function for inference on Inf1 instances. During inference, the previously compiled artifacts will be loaded into the Neuron Runtime for inference execution.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "secondary-exclusive",
   "metadata": {},
   "outputs": [],
   "source": [
    "neuron_model = torch.neuron.trace(model, \n",
    "                                  example_inputs = (inputs['input_ids'],inputs['attention_mask']),\n",
    "                                  verbose=1)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "atmospheric-stewart",
   "metadata": {},
   "source": [
    "## Running the BERT base model on a single NeuronCore\n",
    "With the model already available in memory, you can time one execution and check for the latency on the single inference call. You will load the model into Inferentia with a single inference call. A large \"wall time\" is expected when you first run the next cell, running the cell twice will show the actual inference latency:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "approved-reputation",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# The following line tests inference and should be executed on Inf1 instance family. \n",
    "outputs = neuron_model(*(inputs['input_ids'],inputs['attention_mask']))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "great-collective",
   "metadata": {},
   "source": [
    "You can also check for the throughput of the single model running on a single NeuronCore.\n",
    "\n",
    "The sequential inference test (for loop) does not measure all the performance one can achieve in an instance with multiple NeuronCores. To improve hardwar utilization you can run parallel inference requests over multiple model workers, which you'll test in the Data Parallel Bonus Section below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "framed-reference",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "for _ in tqdm(range(100)):\n",
    "    outputs = neuron_model(*(inputs['input_ids'],inputs['attention_mask'])) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "super-innocent",
   "metadata": {},
   "source": [
    "Save the compiled model for later use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "express-greensboro",
   "metadata": {},
   "outputs": [],
   "source": [
    "neuron_model.save('bert-base-uncased-neuron.pt')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "modified-government",
   "metadata": {},
   "source": [
    "## Compiling a BERT base model for 16 NeuronCores\n",
    "\n",
    "Our next step is to compile the same model for all 16 NeuronCores available in the inf1.6xlarge and check the performance difference when running pipeline parallel inferences.. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "compound-initial",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch_neuron\n",
    "from transformers import BertTokenizer, BertModel\n",
    "\n",
    "from joblib import Parallel, delayed  \n",
    "import numpy as np\n",
    "from tqdm import tqdm\n",
    "\n",
    "import os\n",
    "import time \n",
    "\n",
    "\n",
    "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n",
    "model = BertModel.from_pretrained('bert-base-uncased',return_dict=False)\n",
    "\n",
    "inputs = tokenizer(\"Hello, my dog is cute\",return_tensors=\"pt\",max_length=128,padding='max_length',truncation=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "universal-desperate",
   "metadata": {},
   "source": [
    "To enable pipeline mode during compilation, you need only to add the compiler flag `--neuroncore-pipeline-cores` and set the number of desired cores. The cell below sets up a  `neuroncore_pipeline_cores` string, which you can set for the available number of NeuronCores on the instance: _inf1.6xlarge_ has 16 NeuronCores in 4 Inferentia chips. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "passing-masters",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Number of Cores in the Pipeline Mode\n",
    "neuroncore_pipeline_cores = 16 # This string should be '4' on an inf1.xlarge\n",
    "\n",
    "# Compiling for neuroncore-pipeline-cores='16'\n",
    "neuron_pipeline_model = torch.neuron.trace(model,\n",
    "                                           example_inputs = (inputs['input_ids'],inputs['attention_mask']),\n",
    "                                           verbose=1,\n",
    "                                           compiler_args = ['--neuroncore-pipeline-cores', str(neuroncore_pipeline_cores)]\n",
    "                                          )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "enhanced-swedish",
   "metadata": {},
   "source": [
    "## Running the BERT base model on 16 NeuronCores\n",
    "Next, time one execution and check for the latency on the single inference call over 16 cores. You will load the model into Inferentia with a single inference call. A large \"wall time\" is expected when you first run the next cell, running the cell twice will show the actual inference latency:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "expressed-trinity",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# The following line tests inference and should be executed on Inf1 instance family. \n",
    "outputs = neuron_pipeline_model(*(inputs['input_ids'],inputs['attention_mask']))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "located-graphic",
   "metadata": {},
   "source": [
    "Check also for the throughput of the single model running over a 16 NeuronCores. \n",
    "\n",
    "The sequential inference test (for loop) does not measure all the performance one can achieve with Pipeline mode. As the inference runs in streaming fashion, at least 15 cores are waiting for a new call until the last one processes the first call. This results in low NeuronCore utilization. To improve hardware utilization you will require parallel inference requests, which you'll test in the next section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "hydraulic-calcium",
   "metadata": {},
   "outputs": [],
   "source": [
    "for _ in tqdm(range(100)):\n",
    "    outputs = neuron_pipeline_model(*(inputs['input_ids'],inputs['attention_mask']))\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "patent-victoria",
   "metadata": {},
   "source": [
    "## Load Testing the Pipeline Parallel Mode\n",
    "\n",
    "To put the 16 NeuronCores group to test, a client has to run concurrent requests to the model. In this Notebook setup you achieve it by creating a thread pool with `Joblib.Parallel`, with all workers on the pool runing one inference call. \n",
    "\n",
    "You can define a new method called `inference_latency()` so that you measure the amount of time each inference calls take."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "appointed-adventure",
   "metadata": {},
   "outputs": [],
   "source": [
    "def inference_latency(model,*inputs):\n",
    "    \"\"\"\n",
    "    infetence_time is a simple method to return the latency of a model inference.\n",
    "        \n",
    "        Parameters:\n",
    "            model: torch model onbject loaded using torch.jit.load\n",
    "            inputs: model() args\n",
    "        \n",
    "        Returns:\n",
    "            latency in seconds\n",
    "    \"\"\"\n",
    "    start = time.time()\n",
    "    _ = model(*inputs)\n",
    "    return time.time() - start"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "environmental-guinea",
   "metadata": {},
   "source": [
    "Use `tqdm` to measure total throughput of your experiment, with a nice side-effect of \"cool progress bar!\". The total throughput is expected to be high, so set your experiment range to a large number, here 30k inferences. \n",
    "\n",
    "To calculate the latency statistics over the returned 30k list of latencies use `numpy.qunatile()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "played-catch",
   "metadata": {},
   "outputs": [],
   "source": [
    "t = tqdm(range(30000), position=0, leave=True)\n",
    "latency = Parallel(n_jobs=12,prefer=\"threads\")(delayed(inference_latency)(neuron_pipeline_model,*(inputs['input_ids'],inputs['attention_mask'])) for i in t)\n",
    "\n",
    "p50 = np.quantile(latency[-10000:],0.50) * 1000\n",
    "p95 = np.quantile(latency[-10000:],0.95) * 1000\n",
    "p99 = np.quantile(latency[-10000:],0.99) * 1000\n",
    "avg_throughput = t.total/t.format_dict['elapsed']\n",
    "print(f'Avg Throughput: :{avg_throughput:.1f}')\n",
    "print(f'50th Percentile Latency:{p50:.1f} ms')\n",
    "print(f'95th Percentile Latency:{p95:.1f} ms')\n",
    "print(f'99th Percentile Latency:{p99:.1f} ms')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "exposed-northern",
   "metadata": {},
   "source": [
    "Save compile model for later use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "imperial-complex",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the TorchScript graph\n",
    "neuron_pipeline_model.save('bert-base-uncased-neuron-pipeline.pt')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "abroad-earthquake",
   "metadata": {},
   "source": [
    "## Bonus Section - Load Testing Data Parallel Mode"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "therapeutic-detector",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch_neuron\n",
    "from transformers import BertTokenizer \n",
    "\n",
    "from joblib import Parallel, delayed  \n",
    "import numpy as np\n",
    "from tqdm import tqdm\n",
    "\n",
    "import os\n",
    "import time \n",
    "\n",
    "def inference_latency(model,*inputs):\n",
    "    \"\"\"\n",
    "    infetence_time is a simple method to return the latency of a model inference.\n",
    "        \n",
    "        Parameters:\n",
    "            model: torch model onbject loaded using torch.jit.load\n",
    "            inputs: model() args\n",
    "        \n",
    "        Returns:\n",
    "            latency in seconds\n",
    "    \"\"\"\n",
    "    start = time.time()\n",
    "    _ = model(*inputs)\n",
    "    return time.time() - start\n",
    "\n",
    "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n",
    "\n",
    "inputs = tokenizer(\"Hello, my dog is cute\",return_tensors=\"pt\",max_length=128,padding='max_length',truncation=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "legal-terrorist",
   "metadata": {},
   "source": [
    "You use the `'NEURON_RT_NUM_CORES'` environment variable to define how many Neuron cores to be used. Set the environment variable to the number of individual workers you want to test in parallel.\n",
    "\n",
    "`torch_neuron` will load one model per NeuronCore group until it runs out of cores. At that point, if the Python process continues to spawn more model objest using `torch.jit.load`, `torch_neuron` will start stacking more than one model per core, until the Inferentia chip memory is full. \n",
    "\n",
    "Inferentia is able to run inference over all the loaded models, but only one at a time. The Neuron Runtime takes care of dynamically switching the model context as requests come in, no extra worker process management required. Use 1 model per NeuronCore to achieve maximum performance.\n",
    "\n",
    "The following cell creates a list with as many models as NeuronCore Groups and execute one single dummy inference to load the models into Inferentia. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "current-mechanics",
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "# Number of data parallel workers\n",
    "number_of_workers=16 # This number should be 4 on an inf1.xlarge\n",
    "\n",
    "# Setting up a data parallel group\n",
    "os.environ['NEURON_RT_NUM_CORES'] = str(number_of_workers)\n",
    "\n",
    "# Loading 'number_of_workers' amount of models in Python memory\n",
    "model_list = [torch.jit.load('bert-base-uncased-neuron.pt') for _ in range(number_of_workers)]\n",
    "\n",
    "# Dummy inference to load models to Inferentia\n",
    "_ = [mod(*(inputs['input_ids'],inputs['attention_mask'])) for mod in model_list]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "threatened-swaziland",
   "metadata": {},
   "source": [
    "Adapt the call to `joblib.Parallel()` iterating over a concatenated version of the `model_list`, to run 'round-robin' calls to each of the model workers.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fleet-month",
   "metadata": {},
   "outputs": [],
   "source": [
    "t = tqdm(model_list*1500,position=0, leave=True)\n",
    "latency = Parallel(n_jobs=number_of_workers,prefer=\"threads\")(delayed(inference_latency)(mod,*(inputs['input_ids'],inputs['attention_mask'])) for mod in t)\n",
    "\n",
    "p50 = np.quantile(latency[-10000:],0.50) * 1000\n",
    "p95 = np.quantile(latency[-10000:],0.95) * 1000\n",
    "p99 = np.quantile(latency[-10000:],0.99) * 1000\n",
    "avg_throughput = t.total/t.format_dict['elapsed']\n",
    "print(f'Avg Throughput: :{avg_throughput:.1f}')\n",
    "print(f'50th Percentile Latency:{p50:.1f} ms')\n",
    "print(f'95th Percentile Latency:{p95:.1f} ms')\n",
    "print(f'99th Percentile Latency:{p99:.1f} ms')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aggressive-stevens",
   "metadata": {},
   "source": [
    "For this model, despite the larger number of workers, the per-worker latency increases when running a single model per core, which in turn reduces the total throughput. \n",
    "\n",
    "This behavior may not repeat if the model memory footprint or the input payload size changes, i.e batch size > 1. We encourage you to experiment with the data parallel and pipeline parallel modes to optimize your application performance. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Environment (conda_aws_neuron_pytorch_p36)",
   "language": "python",
   "name": "conda_aws_neuron_pytorch_p36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: src/examples/pytorch/resnet50.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ResNet50 model for Inferentia\n",
    "\n",
    "\n",
    "## Introduction:\n",
    "\n",
    "In this tutorial we will compile and deploy a ResNet50 model for inference on Inferentia. \n",
    "\n",
    "This Jupyter notebook should run on an inf1.6xlarge instance. The inference part of this tutorial requires an inf1 instance, not the compilation stage. For simplicity we will run this tutorial on an inf1.6xlarge, but in real life scenarios the compilation should be done on a compute instance and the deployment on an inf1 instance to save costs. \n",
    "\n",
    "In this tutorial we provide three main sections:\n",
    "\n",
    "1. Compile the ResNet50 model and infer with a batch size of 1\n",
    "\n",
    "2. Run the same compiled model on multiple NeuronCores using `torch.neuron.DataParallel` and dynamic batching\n",
    "\n",
    "3. Compile the ResNet50 model with a batch size of 5 and run it on multiple NeuronCores using `torch.neuron.DataParallel` for optimal performance on Inferentia\n",
    "\n",
    "Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [PyTorch Installation Guide](../../../frameworks/torch/torch-neuron/setup/pytorch-install.html). You can select the kernel from the \"Kernel -> Change Kernel\" option on the top of this Jupyter notebook page."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install Dependencies:\n",
    "This tutorial requires the following pip packages:\n",
    "\n",
    "- `torch>=1.8`\n",
    "- `torch-neuron`\n",
    "- `torchvision`\n",
    "- `neuron-cc[tensorflow]`\n",
    "\n",
    "These will be installed by default when configuring your environment using the Neuron PyTorch setup guide."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compile model for Neuron\n",
    "\n",
    "The following step will compile the ResNet50 model for Inferentia. This will take a few minutes. At the end of script execution, the compiled model is saved as `resnet50_neuron.pt` in your local directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from torchvision import models, transforms, datasets\n",
    "import torch_neuron\n",
    "\n",
    "# Create an example input for compilation\n",
    "image = torch.zeros([1, 3, 224, 224], dtype=torch.float32)\n",
    "\n",
    "# Load a pretrained ResNet50 model\n",
    "model = models.resnet50(pretrained=True)\n",
    "\n",
    "# Tell the model we are using it for evaluation (not training)\n",
    "model.eval()\n",
    "\n",
    "# Analyze the model - this will show operator support and operator count\n",
    "torch.neuron.analyze_model(model, example_inputs=[image])\n",
    "\n",
    "# Compile the model using torch.neuron.trace to create a Neuron model\n",
    "# that that is optimized for the Inferentia hardware\n",
    "model_neuron = torch.neuron.trace(model, example_inputs=[image])\n",
    "\n",
    "# The output of the compilation step will report the percentage of operators that \n",
    "# are compiled to Neuron, for example:\n",
    "#\n",
    "# INFO:Neuron:The neuron partitioner created 1 sub-graphs\n",
    "# INFO:Neuron:Neuron successfully compiled 1 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 100.0%\n",
    "# \n",
    "# We will also be warned if there are operators that are not placed on the Inferentia hardware\n",
    "\n",
    "# Save the compiled model\n",
    "model_neuron.save(\"resnet50_neuron.pt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run inference on Inferentia\n",
    "\n",
    "We can use the compiled Neuron model to run inference on Inferentia.\n",
    "\n",
    "In the following example, we preprocess a sample image for inference using the CPU model and Neuron model. We compare the predicted labels from the CPU model and Neuron model to verify that they are the same.\n",
    "\n",
    "Important: Do not perform inference with a Neuron traced model on a non-Neuron supported instance, as the results will not be calculated properly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define a preprocessing function\n",
    "\n",
    "We define a basic image preprocessing function that loads a sample image and labels, normalizes and batches the image, and transforms the image into a tensor for inference using the compiled Neuron model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "from urllib import request\n",
    "\n",
    "# Create an image directory containing a sample image of a small kitten\n",
    "os.makedirs(\"./torch_neuron_test/images\", exist_ok=True)\n",
    "request.urlretrieve(\"https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg\",\n",
    "                    \"./torch_neuron_test/images/kitten_small.jpg\")\n",
    "\n",
    "# Fetch labels to output the top classifications\n",
    "request.urlretrieve(\"https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json\",\"imagenet_class_index.json\")\n",
    "idx2label = []\n",
    "\n",
    "# Read the labels and create a list to hold them for classification \n",
    "with open(\"imagenet_class_index.json\", \"r\") as read_file:\n",
    "    class_idx = json.load(read_file)\n",
    "    idx2label = [class_idx[str(k)][1] for k in range(len(class_idx))]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "def preprocess(batch_size=1, num_neuron_cores=1):\n",
    "    # Define a normalization function using the ImageNet mean and standard deviation\n",
    "    normalize = transforms.Normalize(\n",
    "        mean=[0.485, 0.456, 0.406],\n",
    "        std=[0.229, 0.224, 0.225])\n",
    "\n",
    "    # Resize the sample image to [1, 3, 224, 224], normalize it, and turn it into a tensor\n",
    "    eval_dataset = datasets.ImageFolder(\n",
    "        os.path.dirname(\"./torch_neuron_test/\"),\n",
    "        transforms.Compose([\n",
    "        transforms.Resize([224, 224]),\n",
    "        transforms.ToTensor(),\n",
    "        normalize,\n",
    "        ])\n",
    "    )\n",
    "    image, _ = eval_dataset[0]\n",
    "    image = torch.tensor(image.numpy()[np.newaxis, ...])\n",
    "\n",
    "    # Create a \"batched\" image with enough images to go on each of the available NeuronCores\n",
    "    # batch_size is the per-core batch size\n",
    "    # num_neuron_cores is the number of NeuronCores being used\n",
    "    batch_image = image\n",
    "    for i in range(batch_size * num_neuron_cores - 1):\n",
    "        batch_image = torch.cat([batch_image, image], 0)\n",
    "     \n",
    "    return batch_image"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run inference using the Neuron model\n",
    "\n",
    "We import the necessary python modules, load the torch-neuron compiled model, and run inference on Inferentia. \n",
    "\n",
    "By default, the Neuron model will run on a single NeuronCore. In the next section, we will see how to run the Neuron model on multiple NeuronCores to fully saturate our hardware for optimal performance on Inferentia. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from torchvision import models, transforms, datasets\n",
    "import torch_neuron\n",
    "\n",
    "# Get a sample image\n",
    "image = preprocess()\n",
    "\n",
    "# Run inference using the CPU model\n",
    "output_cpu = model(image)\n",
    "\n",
    "# Load the compiled Neuron model\n",
    "model_neuron = torch.jit.load('resnet50_neuron.pt')\n",
    "\n",
    "# Run inference using the Neuron model\n",
    "output_neuron = model_neuron(image)\n",
    "\n",
    "# Verify that the CPU and Neuron predictions are the same by comparing\n",
    "# the top-5 results\n",
    "top5_cpu = output_cpu[0].sort()[1][-5:]\n",
    "top5_neuron = output_neuron[0].sort()[1][-5:]\n",
    "\n",
    "# Lookup and print the top-5 labels\n",
    "top5_labels_cpu = [idx2label[idx] for idx in top5_cpu]\n",
    "top5_labels_neuron = [idx2label[idx] for idx in top5_neuron]\n",
    "print(\"CPU top-5 labels: {}\".format(top5_labels_cpu))\n",
    "print(\"Neuron top-5 labels: {}\".format(top5_labels_neuron))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run Inference using torch.neuron.DataParallel\n",
    "\n",
    "To fully leverage the Inferentia hardware we want to use all avaialable NeuronCores. An inf1.xlarge and inf1.2xlarge have four NeuronCores, an inf1.6xlarge has 16 NeuronCores, and an inf1.24xlarge has 64 NeuronCores. For maximum performance on Inferentia hardware, we can use `torch.neuron.DataParallel` to utilize all available NeuronCores.\n",
    "\n",
    "`torch.neuron.DataParallel` implements data parallelism at the module level by duplicating the Neuron model on all available NeuronCores and distributing data across the different cores for parallelized inference.\n",
    "\n",
    "In the following section, we will run inference using the `torch.neuron.DataParallel` module to fully saturate the Inferentia hardware. We benchmark the model to collect throughput and latency statistics.\n",
    "\n",
    "Note: `torch.neuron.DataParallel` is new with Neuron 1.16.0. Please ensure you are using the latest Neuron package to run the following sections. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define a benchmarking function\n",
    "\n",
    "We create a function that handles benchmarking the Neuron model to collect throughput and latency metrics. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from time import time\n",
    "\n",
    "def benchmark(model, image):\n",
    "    print('Input image shape is {}'.format(list(image.shape)))\n",
    "    \n",
    "    # The first inference loads the model so exclude it from timing \n",
    "    results = model(image)\n",
    "    \n",
    "    # Collect throughput and latency metrics\n",
    "    latency = []\n",
    "    throughput = []\n",
    "\n",
    "    # Run inference for 100 iterations and calculate metrics\n",
    "    num_infers = 100\n",
    "    for _ in range(num_infers):\n",
    "        delta_start = time()\n",
    "        results = model(image)\n",
    "        delta = time() - delta_start\n",
    "        latency.append(delta)\n",
    "        throughput.append(image.size(0)/delta)\n",
    "    \n",
    "    # Calculate and print the model throughput and latency\n",
    "    print(\"Avg. Throughput: {:.0f}, Max Throughput: {:.0f}\".format(np.mean(throughput), np.max(throughput)))\n",
    "    print(\"Latency P50: {:.0f}\".format(np.percentile(latency, 50)*1000.0))\n",
    "    print(\"Latency P90: {:.0f}\".format(np.percentile(latency, 90)*1000.0))\n",
    "    print(\"Latency P95: {:.0f}\".format(np.percentile(latency, 95)*1000.0))\n",
    "    print(\"Latency P99: {:.0f}\\n\".format(np.percentile(latency, 99)*1000.0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run Inference using torch.neuron.DataParallel\n",
    "\n",
    "We create the `torch.neuron.DataParallel` module using the compiled Neuron model, get a sample image, and benchmark the parallelized model on Neuron."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a torch.neuron.DataParallel module using the compiled Neuron model\n",
    "# By default, torch.neuron.DataParallel will use four cores on an inf1.xlarge\n",
    "# or inf1.2xlarge, 16 cores on an inf1.6xlarge, and 24 cores on an inf1.24xlarge\n",
    "model_neuron_parallel = torch.neuron.DataParallel(model_neuron)\n",
    "\n",
    "# Get sample image with batch size=1 per NeuronCore\n",
    "batch_size = 1\n",
    "\n",
    "# For an inf1.xlarge or inf1.2xlarge, set num_neuron_cores = 4\n",
    "num_neuron_cores = 16\n",
    "\n",
    "image = preprocess(batch_size=batch_size, num_neuron_cores=num_neuron_cores)\n",
    "\n",
    "# Benchmark the model\n",
    "benchmark(model_neuron_parallel, image)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run inference with dynamic batch sizes\n",
    "\n",
    "Batch size has a direct impact on model performance. The Inferentia chip is optimized to run with small batch sizes. This means that a Neuron compiled model can outperform a GPU model, even if running single digit batch sizes.\n",
    "\n",
    "As a general best practice, we recommend optimizing your model's throughput by compiling the model with a small batch size and gradually increasing it to find the peak throughput on Inferentia.\n",
    "\n",
    "Dynamic batching is a feature that allows you to use tensor batch sizes that the Neuron model was not originally compiled against. This is necessary because the underlying Inferentia hardware will always execute inferences with the batch size used during compilation. Fixed batch size execution allows tuning the input batch size for optimal performance. For example, batch size 1 may be best suited for an ultra-low latency on-demand inference application, while batch size > 1 can be used to maximize throughput for offline inferencing. Dynamic batching is implemented by slicing large input tensors into chunks that match the batch size used during the `torch.neuron.trace` compilation call. \n",
    "\n",
    "The `torch.neuron.DataParallel` class automatically enables dynamic batching on eligible models. This allows us to run inference in applications that have inputs with a variable batch size without needing to recompile the model.\n",
    "\n",
    "In the following example, we use the same `torch.neuron.DataParallel` module to run inference using several different batch sizes. Notice that latency increases consistently as the batch size increases. Throughput increases as well, up until a certain point where the input size becomes too large to be efficient."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# using the same DataParallel model_neuron_parallel model, we can run\n",
    "# inference on inputs with a variable batch size without recompiling\n",
    "batch_sizes = [2, 3, 4, 5, 6, 7]\n",
    "for batch_size in batch_sizes:\n",
    "    print('Batch size: {}'.format(batch_size))\n",
    "    image = preprocess(batch_size=batch_size, num_neuron_cores=num_neuron_cores)\n",
    "    \n",
    "    # Benchmark the model for each input batch size\n",
    "    benchmark(model_neuron_parallel, image)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compile and Infer with different batch sizes on multiple NeuronCores\n",
    "\n",
    "Dynamic batching using small batch sizes can result in sub-optimal throughput because it involves slicing tensors into chunks and iteratively sending data to the hardware. Using a larger batch size at compilation time can use the Inferentia hardware more efficiently in order to maximize throughput. You can test the tradeoff between individual request latency and total throughput by fine-tuning the input batch size.\n",
    "\n",
    "In the following example, we recompile our model using a batch size of 5 and run the model using `torch.neuron.DataParallel` to fully saturate our Inferentia hardware for optimal performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an input with batch size 5 for compilation\n",
    "batch_size = 5\n",
    "image = torch.zeros([batch_size, 3, 224, 224], dtype=torch.float32)\n",
    "\n",
    "# Recompile the ResNet50 model for inference with batch size 5\n",
    "model_neuron = torch.neuron.trace(model, example_inputs=[image])\n",
    "\n",
    "# Export to saved model\n",
    "model_neuron.save(\"resnet50_neuron_b{}.pt\".format(batch_size))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run inference with batch size of 5 using the Neuron model compiled for a batch size of 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_size = 5\n",
    "\n",
    "# Load compiled Neuron model\n",
    "model_neuron = torch.jit.load(\"resnet50_neuron_b{}.pt\".format(batch_size))\n",
    "\n",
    "# Create DataParallel model\n",
    "model_neuron_parallel = torch.neuron.DataParallel(model_neuron)\n",
    "\n",
    "# Get sample image with batch size=5\n",
    "image = preprocess(batch_size=batch_size, num_neuron_cores=num_neuron_cores)\n",
    "\n",
    "# Benchmark the model\n",
    "benchmark(model_neuron_parallel, image)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can experiment with different batch size values to see what gives the best overall throughput on Inferentia."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.9 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: src/examples/pytorch/resnet50_partition.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Manual Partitioning Tutorial\n",
    "\n",
    "In this tutorial we will run through how to manually partition a graph.  There are six steps:\n",
    "\n",
    "1. Import ResNet50 code from torchvision and set to evaluation mode\n",
    "1. Download a test image and preprocess it\n",
    "1. Run inference on CPU as a baseline\n",
    "1. Manually partition the graph using Neuron\n",
    "1. Save the model to be loaded on another instance\n",
    "1. Inspect the graph to deepen our understanding\n",
    "\n",
    "The following is a ResNet50 implementation copied from `torchvision.models.resnet`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**STEP 1:** Import torchvision ResNet50 and run the model on CPU\n",
    "\n",
    "Note that training code can be inserted before `model.eval()` if retraining/fine-tuning is necessary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "\n",
    "\n",
    "def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):\n",
    "    \"\"\"3x3 convolution with padding\"\"\"\n",
    "    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,\n",
    "                     padding=dilation, groups=groups, bias=False, dilation=dilation)\n",
    "\n",
    "\n",
    "def conv1x1(in_planes, out_planes, stride=1):\n",
    "    \"\"\"1x1 convolution\"\"\"\n",
    "    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)\n",
    "\n",
    "\n",
    "class Bottleneck(nn.Module):\n",
    "    expansion = 4\n",
    "\n",
    "    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,\n",
    "                 base_width=64, dilation=1, norm_layer=None):\n",
    "        super(Bottleneck, self).__init__()\n",
    "        if norm_layer is None:\n",
    "            norm_layer = nn.BatchNorm2d\n",
    "        width = int(planes * (base_width / 64.)) * groups\n",
    "        # Both self.conv2 and self.downsample layers downsample the input when stride != 1\n",
    "        self.conv1 = conv1x1(inplanes, width)\n",
    "        self.bn1 = norm_layer(width)\n",
    "        self.conv2 = conv3x3(width, width, stride, groups, dilation)\n",
    "        self.bn2 = norm_layer(width)\n",
    "        self.conv3 = conv1x1(width, planes * self.expansion)\n",
    "        self.bn3 = norm_layer(planes * self.expansion)\n",
    "        self.relu = nn.ReLU(inplace=True)\n",
    "        self.downsample = downsample\n",
    "        self.stride = stride\n",
    "\n",
    "    def forward(self, x):\n",
    "        identity = x\n",
    "\n",
    "        out = self.conv1(x)\n",
    "        out = self.bn1(out)\n",
    "        out = self.relu(out)\n",
    "\n",
    "        out = self.conv2(out)\n",
    "        out = self.bn2(out)\n",
    "        out = self.relu(out)\n",
    "\n",
    "        out = self.conv3(out)\n",
    "        out = self.bn3(out)\n",
    "\n",
    "        if self.downsample is not None:\n",
    "            identity = self.downsample(x)\n",
    "\n",
    "        out += identity\n",
    "        out = self.relu(out)\n",
    "\n",
    "        return out\n",
    "\n",
    "\n",
    "class ResNet(nn.Module):\n",
    "\n",
    "    def __init__(self, block, layers, num_classes=1000, zero_init_residual=False,\n",
    "                 groups=1, width_per_group=64, replace_stride_with_dilation=None,\n",
    "                 norm_layer=None):\n",
    "        super(ResNet, self).__init__()\n",
    "        if norm_layer is None:\n",
    "            norm_layer = nn.BatchNorm2d\n",
    "        self._norm_layer = norm_layer\n",
    "\n",
    "        self.inplanes = 64\n",
    "        self.dilation = 1\n",
    "        if replace_stride_with_dilation is None:\n",
    "            # each element in the tuple indicates if we should replace\n",
    "            # the 2x2 stride with a dilated convolution instead\n",
    "            replace_stride_with_dilation = [False, False, False]\n",
    "        if len(replace_stride_with_dilation) != 3:\n",
    "            raise ValueError(\"replace_stride_with_dilation should be None \"\n",
    "                             \"or a 3-element tuple, got {}\".format(replace_stride_with_dilation))\n",
    "        self.groups = groups\n",
    "        self.base_width = width_per_group\n",
    "        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3,\n",
    "                               bias=False)\n",
    "        self.bn1 = norm_layer(self.inplanes)\n",
    "        self.relu = nn.ReLU(inplace=True)\n",
    "        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)\n",
    "        self.layer1 = self._make_layer(block, 64, layers[0])\n",
    "        self.layer2 = self._make_layer(block, 128, layers[1], stride=2,\n",
    "                                       dilate=replace_stride_with_dilation[0])\n",
    "        self.layer3 = self._make_layer(block, 256, layers[2], stride=2,\n",
    "                                       dilate=replace_stride_with_dilation[1])\n",
    "        self.layer4 = self._make_layer(block, 512, layers[3], stride=2,\n",
    "                                       dilate=replace_stride_with_dilation[2])\n",
    "        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))\n",
    "        self.fc = nn.Linear(512 * block.expansion, num_classes)\n",
    "\n",
    "        for m in self.modules():\n",
    "            if isinstance(m, nn.Conv2d):\n",
    "                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')\n",
    "            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):\n",
    "                nn.init.constant_(m.weight, 1)\n",
    "                nn.init.constant_(m.bias, 0)\n",
    "\n",
    "        # Zero-initialize the last BN in each residual branch,\n",
    "        # so that the residual branch starts with zeros, and each residual block behaves like an identity.\n",
    "        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677\n",
    "        if zero_init_residual:\n",
    "            for m in self.modules():\n",
    "                if isinstance(m, Bottleneck):\n",
    "                    nn.init.constant_(m.bn3.weight, 0)\n",
    "                elif isinstance(m, BasicBlock):\n",
    "                    nn.init.constant_(m.bn2.weight, 0)\n",
    "\n",
    "    def _make_layer(self, block, planes, blocks, stride=1, dilate=False):\n",
    "        norm_layer = self._norm_layer\n",
    "        downsample = None\n",
    "        previous_dilation = self.dilation\n",
    "        if dilate:\n",
    "            self.dilation *= stride\n",
    "            stride = 1\n",
    "        if stride != 1 or self.inplanes != planes * block.expansion:\n",
    "            downsample = nn.Sequential(\n",
    "                conv1x1(self.inplanes, planes * block.expansion, stride),\n",
    "                norm_layer(planes * block.expansion),\n",
    "            )\n",
    "\n",
    "        layers = []\n",
    "        layers.append(block(self.inplanes, planes, stride, downsample, self.groups,\n",
    "                            self.base_width, previous_dilation, norm_layer))\n",
    "        self.inplanes = planes * block.expansion\n",
    "        for _ in range(1, blocks):\n",
    "            layers.append(block(self.inplanes, planes, groups=self.groups,\n",
    "                                base_width=self.base_width, dilation=self.dilation,\n",
    "                                norm_layer=norm_layer))\n",
    "\n",
    "        return nn.Sequential(*layers)\n",
    "\n",
    "    def forward(self, x):\n",
    "        x = self.conv1(x)\n",
    "        x = self.bn1(x)\n",
    "        x = self.relu(x)\n",
    "        x = self.maxpool(x)\n",
    "\n",
    "        x = self.layer1(x)\n",
    "        x = self.layer2(x)\n",
    "        x = self.layer3(x)\n",
    "        x = self.layer4(x)\n",
    "\n",
    "        x = self.avgpool(x)\n",
    "        x = torch.flatten(x, 1)\n",
    "        x = self.fc(x)\n",
    "\n",
    "        return x\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.utils.model_zoo import load_url as load_state_dict_from_url\n",
    "\n",
    "model = ResNet(Bottleneck, [3, 4, 6, 3])\n",
    "state_dict = load_state_dict_from_url('https://download.pytorch.org/models/resnet50-19c8e357.pth', progress=True)\n",
    "model.load_state_dict(state_dict)\n",
    "# you can do some training here, before calling model.eval()\n",
    "model.eval()\n",
    "print('ResNet50 model is turned into inference mode')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**STEP 2:** Download a cat image and preprocess it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "from torchvision import transforms, datasets\n",
    "from tensorflow.keras.applications import resnet50\n",
    "import urllib.request\n",
    "\n",
    "imagedir = './images'\n",
    "os.makedirs(imagedir, exist_ok=True)\n",
    "urllib.request.urlretrieve(\n",
    "    'https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg',\n",
    "    os.path.join(imagedir, 'kitten_small.jpg'),\n",
    ")\n",
    "normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],\n",
    "                                 std=[0.229, 0.224, 0.225])\n",
    "eval_dataset = datasets.ImageFolder(\n",
    "    '.',\n",
    "    transforms.Compose([\n",
    "        transforms.Resize([224, 224]),\n",
    "        transforms.ToTensor(),\n",
    "        normalize,\n",
    "    ])\n",
    ")\n",
    "image, label = eval_dataset[0]\n",
    "image = torch.tensor(image.numpy()[np.newaxis, ...])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**STEP 3:** Run inference without neuron for comparison"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('model inference result:')\n",
    "print(resnet50.decode_predictions(model(image).detach().numpy(), top=5)[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "STEP 4: Run the same inference using torch.jit.trace - then we can save and load the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "jit_trace = torch.jit.trace(model, example_inputs=image)\n",
    "print('jit.trace inference result:')\n",
    "print(resnet50.decode_predictions(jit_trace(image).detach().numpy(), top=5)[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "jit_trace_filename = 'resnet50_jit_trace.pt'\n",
    "jit_trace.save(jit_trace_filename)\n",
    "jit_trace_loaded = torch.jit.load(jit_trace_filename)\n",
    "print('loaded jit.trace inferenced result:')\n",
    "print(resnet50.decode_predictions(jit_trace_loaded(image).detach().numpy(), top=5)[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**STEP 4:** Manually partition the ResNet50 model and execute it\n",
    "\n",
    "To generate a Neuron-optimized TorchScript with only layers 1~4 placed on Neuron runtime, we first define a new module class `ResNetNeuron` inheriting from `ResNet`. We add 'torch.neuron.trace' calls in the forward function of this module in order to turn layer submodules into Neuron-optimized ones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch.neuron\n",
    "\n",
    "class ResNetNeuron(ResNet):\n",
    "\n",
    "    def trace(self, x):\n",
    "        x = self.conv1(x)\n",
    "        x = self.bn1(x)\n",
    "        x = self.relu(x)\n",
    "        x = self.maxpool(x)\n",
    "\n",
    "        self.layer1 = torch.neuron.trace(self.layer1, x, fallback=False)\n",
    "        x = self.layer1(x)\n",
    "        \n",
    "        self.layer2 = torch.neuron.trace(self.layer2, x, fallback=False)\n",
    "        x = self.layer2(x)\n",
    "        \n",
    "        self.layer3 = torch.neuron.trace(self.layer3, x, fallback=False)\n",
    "        x = self.layer3(x)\n",
    "        \n",
    "        self.layer4 = torch.neuron.trace(self.layer4, x, fallback=False)\n",
    "        \n",
    "    def forward(self, x):\n",
    "        x = self.conv1(x)\n",
    "        x = self.bn1(x)\n",
    "        x = self.relu(x)\n",
    "        x = self.maxpool(x)\n",
    "        \n",
    "        # After running ResNetNeuron::trace, these layers will be placed on Neuron\n",
    "        x = self.layer1(x)\n",
    "        x = self.layer2(x)\n",
    "        x = self.layer3(x)\n",
    "        x = self.layer4(x)\n",
    "        \n",
    "        x = self.avgpool(x)\n",
    "        x = torch.flatten(x, 1)\n",
    "        x = self.fc(x)\n",
    "\n",
    "        return x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now construct the class and runn an inference to trigger the `neuron-cc` compiler.  Watch for the [ \\* ] icon to the left of this cell to disappear and show a number - this will take a minute or two to run"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "model_neuron = ResNetNeuron(Bottleneck, [3, 4, 6, 3])\n",
    "model_neuron.load_state_dict(state_dict)\n",
    "model_neuron.eval()\n",
    "model_neuron.trace(image) # this line triggers neuron-cc compiler\n",
    "result = model_neuron(image)\n",
    "print('Neuron optimized model inference result:')\n",
    "print(resnet50.decode_predictions(result.detach().numpy(), top=5)[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**STEP 5:** Save the model as TorchScript ready to deploy\n",
    "\n",
    "To deploy the Neuron-optimized as TorchScript, we use `torch.jit.trace` again to generate TorchScript for the entire mode, including the Neuron-optimized `ScriptModule`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "neuron_trace = torch.jit.trace(model_neuron, example_inputs=image)\n",
    "print('neuron.trace inference result:')\n",
    "print(resnet50.decode_predictions(neuron_trace(image).detach().numpy(), top=5)[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This Neuron-optimized `ScriptModule` can be saved/loaded easily and be deployed on inf1 instances."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "neuron_trace_filename = 'resnet50_neuron_trace.pt'\n",
    "neuron_trace.save(neuron_trace_filename)\n",
    "neuron_trace_loaded = torch.jit.load(neuron_trace_filename)\n",
    "print('loaded neuron.trace inference result:')\n",
    "print(resnet50.decode_predictions(neuron_trace_loaded(image).detach().numpy(), top=5)[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**STEP 6:** Understanding the neuron graph\n",
    "\n",
    "We can inspect the graph property of the Neuron-optimized `ScriptModule` to get an idea of how Neuron-optimization is performed. Each `torch.neuron.trace` call fuses a submodule (layer) into a `neuron::forward`/`NeuronModule` operator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "neuron_trace.graph"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: src/examples/pytorch/torch-neuronx/bert-base-cased-finetuned-mrpc-inference-on-trn1-tutorial.ipynb
================================================
{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e11b2ce1",
   "metadata": {},
   "source": [
    "# Compiling and Deploying HuggingFace Pretrained BERT on Trn1 or Inf2"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "59a44364",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "In this tutorial we will compile and deploy a HuggingFace 🤗 Transformers BERT model for accelerated inference on Neuron. In this tutorial, we will be deploying directly on Trn1/Inf2 instances. If you are looking to deploy this model through SageMaker on Inf2 instance, please visit the [Sagemaker samples repository](https://github.com/aws-neuron/aws-neuron-sagemaker-samples/tree/master/inference/inf2-bert-on-sagemaker). \n",
    "\n",
    "This tutorial will use the [bert-base-cased-finetuned-mrpc](https://huggingface.co/bert-base-cased-finetuned-mrpc) model. This model has 12 layers, 768 hidden dimensions, 12 attention heads, and 110M total parameters. The final layer is a binary classification head that has been trained on the Microsoft Research Paraphrase Corpus (`mrpc`). The input to the model is two sentences and the output of the model is whether or not those sentences are a paraphrase of each other. \n",
    "\n",
    "This tutorial has the following main sections:\n",
    "\n",
    "1. Install dependencies\n",
    "1. Compile the BERT model\n",
    "1. Run inference on Neuron and compare results to CPU\n",
    "1. Benchmark the model using multicore inference\n",
    "1. Finding the optimal batch size\n",
    "\n",
    "This Jupyter notebook should be run on a Trn1 instance (`trn1.2xlarge` or larger.) or Inf2 instance (`inf2.xlarge` or larger.)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9ceecb92",
   "metadata": {},
   "source": [
    "## Install dependencies\n",
    "\n",
    "The code in this tutorial is written for Jupyter Notebooks. To use Jupyter Notebook on the Neuron instance, you\n",
    "can use this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "This tutorial requires the following pip packages:\n",
    "\n",
    "- `torch-neuronx`\n",
    "- `neuronx-cc`\n",
    "- `transformers`\n",
    "\n",
    "Most of these packages will be installed when configuring your environment using the Trn1/Inf2 setup guide. The additional dependencies must be installed here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66392b0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
    "%env HF_HUB_DISABLE_PROGRESS_BARS=1 # Avoids xet progress bar model download error\n",
    "!pip install --upgrade transformers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82533d8e",
   "metadata": {},
   "source": [
    "## Compile the model into an AWS Neuron optimized TorchScript\n",
    "\n",
    "In the following section, we load the BERT model and tokenizer, get a sample input, run inference on CPU, compile the model for Neuron using `torch_neuronx.trace()`, and save the optimized model as `TorchScript`.\n",
    "\n",
    "`torch_neuronx.trace()` expects a tensor or tuple of tensor inputs to use for tracing, so we unpack the tokenizer output using the `encode` function. \n",
    "\n",
    "The result of the trace stage will be a static executable where the operations to be run upon inference are determined during compilation. This means that when inferring, the resulting Neuron model must be executed with tensors that are the exact same shape as those provided at compilation time. If a model is given a tensor at inference time whose shape does not match the tensor given at compilation time, an error will occur.\n",
    "\n",
    "For language models, the shape of the tokenizer tensors can vary based on the length of input sentence. We can satisfy the Neuron restriction of using a fixed shape input by padding all varying input tensors to a specified length. In a deployment scenario, the padding size should be chosen based on the maximum token length that is expected to occur for the application.\n",
    "\n",
    "In the following section we will assume that we will receive a maximum of 128 tokens at inference time. We will pad our example inputs by using `padding='max_length'` and to avoid potential errors caused by creating a tensor that is larger than `max_length=128`, we will always tokenize using `truncation=True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0c9aac5e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch_neuronx\n",
    "from transformers import AutoTokenizer, AutoModelForSequenceClassification\n",
    "import transformers\n",
    "\n",
    "\n",
    "def encode(tokenizer, *inputs, max_length=128, batch_size=1):\n",
    "    tokens = tokenizer.encode_plus(\n",
    "        *inputs,\n",
    "        max_length=max_length,\n",
    "        padding='max_length',\n",
    "        truncation=True,\n",
    "        return_tensors=\"pt\"\n",
    "    )\n",
    "    return (\n",
    "        torch.repeat_interleave(tokens['input_ids'], batch_size, 0),\n",
    "        torch.repeat_interleave(tokens['attention_mask'], batch_size, 0),\n",
    "        torch.repeat_interleave(tokens['token_type_ids'], batch_size, 0),\n",
    "    )\n",
    "\n",
    "\n",
    "# Create the tokenizer and model\n",
    "name = \"bert-base-cased-finetuned-mrpc\"\n",
    "tokenizer = AutoTokenizer.from_pretrained(name)\n",
    "model = AutoModelForSequenceClassification.from_pretrained(name, torchscript=True)\n",
    "\n",
    "# Set up some example inputs\n",
    "sequence_0 = \"The company HuggingFace is based in New York City\"\n",
    "sequence_1 = \"Apples are especially bad for your health\"\n",
    "sequence_2 = \"HuggingFace's headquarters are situated in Manhattan\"\n",
    "\n",
    "paraphrase = encode(tokenizer, sequence_0, sequence_2)\n",
    "not_paraphrase = encode(tokenizer, sequence_0, sequence_1)\n",
    "\n",
    "# Run the original PyTorch BERT model on CPU\n",
    "cpu_paraphrase_logits = model(*paraphrase)[0]\n",
    "cpu_not_paraphrase_logits = model(*not_paraphrase)[0]\n",
    "\n",
    "# Compile the model for Neuron\n",
    "model_neuron = torch_neuronx.trace(model, paraphrase)\n",
    "\n",
    "# Save the TorchScript for inference deployment\n",
    "filename = 'model.pt'\n",
    "torch.jit.save(model_neuron, filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53e9605d",
   "metadata": {},
   "source": [
    "## Run inference and compare results\n",
    "\n",
    "In this section we load the compiled model, run inference on Neuron, and compare the CPU and Neuron outputs.\n",
    "\n",
    "NOTE: Although this tutorial section uses one NeuronCore (and the next section uses two NeuronCores), by default each Jupyter notebook Python process will attempt to take ownership of all NeuronCores visible on the instance. For multi-process applications where each process should only use a subset of the NeuronCores on the instance you can use NEURON_RT_NUM_CORES=N or NEURON_RT_VISIBLE_CORES=< list of NeuronCore IDs > when starting the Jupyter notebook as described in [NeuronCore Allocation and Model Placement for Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/inference/core-placement.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8d509aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the TorchScript compiled model\n",
    "model_neuron = torch.jit.load(filename)\n",
    "\n",
    "# Verify the TorchScript works on both example inputs\n",
    "neuron_paraphrase_logits = model_neuron(*paraphrase)[0]\n",
    "neuron_not_paraphrase_logits = model_neuron(*not_paraphrase)[0]\n",
    "\n",
    "# Compare the results\n",
    "print('CPU paraphrase logits:        ', cpu_paraphrase_logits.detach().numpy())\n",
    "print('Neuron paraphrase logits:    ', neuron_paraphrase_logits.detach().numpy())\n",
    "print('CPU not-paraphrase logits:    ', cpu_not_paraphrase_logits.detach().numpy())\n",
    "print('Neuron not-paraphrase logits: ', neuron_not_paraphrase_logits.detach().numpy())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a4553cc9",
   "metadata": {},
   "source": [
    "## Benchmarking\n",
    "\n",
    "In this section we benchmark the performance of the BERT model on Neuron. By default, models compiled with `torch_neuronx` will always execute on a *single* NeuronCore. When loading *multiple* models, the default behavior of the Neuron runtime is to evenly distribute models across all available NeuronCores. The runtime places models on the NeuronCore that has the fewest models loaded to it first. In the following section, we will `torch.jit.load` multiple instances of the model which should each be loaded onto their own NeuronCore. It is not useful to load more copies of a model than the number of NeuronCores on the instance since an individual NeuronCore can only execute one model at a time.\n",
    "\n",
    "To ensure that we are maximizing hardware utilization, we must run inferences using multiple threads in parallel. It is nearly always recommended to use some form of threading/multiprocessing and some form of model replication since even the smallest Neuron EC2 instance has 2 NeuronCores available. Applications with no form of threading are only capable of `1 / num_neuron_cores` hardware utilization which becomes especially problematic on large instances.\n",
    "\n",
    "One way to view the hardware utilization is by executing the `neuron-top` application in the terminal while the benchmark is executing. If the monitor shows >90% utilization on all NeuronCores, this is a good indication that the hardware is being utilized effectively.\n",
    "\n",
    "In this example we load two models, which utilizes all NeuronCores (2) on a `trn1.2xlarge` or `inf2.xlarge` instance. Additional models can be loaded and run in parallel on larger Trn1 or Inf2 instance sizes to increase throughput.\n",
    "\n",
    "We define a benchmarking function that loads two optimized BERT models onto two separate NeuronCores, runs multithreaded inference, and calculates the corresponding latency and throughput."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c9e14b0d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "import concurrent.futures\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "def benchmark(filename, example, n_models=2, n_threads=2, batches_per_thread=1000):\n",
    "    \"\"\"\n",
    "    Record performance statistics for a serialized model and its input example.\n",
    "\n",
    "    Arguments:\n",
    "        filename: The serialized torchscript model to load for benchmarking.\n",
    "        example: An example model input.\n",
    "        n_models: The number of models to load.\n",
    "        n_threads: The number of simultaneous threads to execute inferences on.\n",
    "        batches_per_thread: The number of example batches to run per thread.\n",
    "\n",
    "    Returns:\n",
    "        A dictionary of performance statistics.\n",
    "    \"\"\"\n",
    "\n",
    "    # Load models\n",
    "    models = [torch.jit.load(filename) for _ in range(n_models)]\n",
    "\n",
    "    # Warmup\n",
    "    for _ in range(8):\n",
    "        for model in models:\n",
    "            model(*example)\n",
    "\n",
    "    latencies = []\n",
    "\n",
    "    # Thread task\n",
    "    def task(model):\n",
    "        for _ in range(batches_per_thread):\n",
    "            start = time.time()\n",
    "            model(*example)\n",
    "            finish = time.time()\n",
    "            latencies.append((finish - start) * 1000)\n",
    "\n",
    "    # Submit tasks\n",
    "    begin = time.time()\n",
    "    with concurrent.futures.ThreadPoolExecutor(max_workers=n_threads) as pool:\n",
    "        for i in range(n_threads):\n",
    "            pool.submit(task, models[i % len(models)])\n",
    "    end = time.time()\n",
    "\n",
    "    # Compute metrics\n",
    "    boundaries = [50, 95, 99]\n",
    "    percentiles = {}\n",
    "\n",
    "    for boundary in boundaries:\n",
    "        name = f'latency_p{boundary}'\n",
    "        percentiles[name] = np.percentile(latencies, boundary)\n",
    "    duration = end - begin\n",
    "    batch_size = 0\n",
    "    for tensor in example:\n",
    "        if batch_size == 0:\n",
    "            batch_size = tensor.shape[0]\n",
    "    inferences = len(latencies) * batch_size\n",
    "    throughput = inferences / duration\n",
    "\n",
    "    # Metrics\n",
    "    metrics = {\n",
    "        'filename': str(filename),\n",
    "        'batch_size': batch_size,\n",
    "        'batches': len(latencies),\n",
    "        'inferences': inferences,\n",
    "        'threads': n_threads,\n",
    "        'models': n_models,\n",
    "        'duration': duration,\n",
    "        'throughput': throughput,\n",
    "        **percentiles,\n",
    "    }\n",
    "\n",
    "    display(metrics)\n",
    "\n",
    "\n",
    "def display(metrics):\n",
    "    \"\"\"\n",
    "    Display the metrics produced by `benchmark` function.\n",
    "\n",
    "    Args:\n",
    "        metrics: A dictionary of performance statistics.\n",
    "    \"\"\"\n",
    "    pad = max(map(len, metrics)) + 1\n",
    "    for key, value in metrics.items():\n",
    "\n",
    "        parts = key.split('_')\n",
    "        parts = list(map(str.title, parts))\n",
    "        title = ' '.join(parts) + \":\"\n",
    "\n",
    "        if isinstance(value, float):\n",
    "            value = f'{value:0.3f}'\n",
    "\n",
    "        print(f'{title :<{pad}} {value}')\n",
    "\n",
    "\n",
    "# Benchmark BERT on Neuron\n",
    "benchmark(filename, paraphrase)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc374b12",
   "metadata": {},
   "source": [
    "## Finding the optimal batch size"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "113acb55",
   "metadata": {},
   "source": [
    "Batch size has a direct impact on model performance. The NeuronCore architecture is optimized to maximize throughput with relatively small batch sizes. This means that a Neuron compiled model can outperform a GPU model, even if running single digit batch sizes.\n",
    "\n",
    "As a general best practice, we recommend optimizing your model’s throughput by compiling the model with a small batch size and gradually increasing it to find the peak throughput on Neuron. To minimize latency, using `batch size = 1` will nearly always be optimal. This batch size configuration is typically used for on-demand inference applications. To maximize throughput, *usually* `1 < batch_size < 10` is optimal. A configuration which uses a larger batch size is generally ideal for batched on-demand inference or offline batch processing.\n",
    "\n",
    "In the following section, we compile BERT for multiple batch size inputs. We then run inference on each batch size and benchmark the performance. Notice that latency increases consistently as the batch size increases. Throughput increases as well, up until a certain point where the input size becomes too large to be efficient."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be26aafc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compile BERT for different batch sizes\n",
    "for batch_size in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]:\n",
    "    tokenizer = AutoTokenizer.from_pretrained(name)\n",
    "    model = AutoModelForSequenceClassification.from_pretrained(name, torchscript=True)\n",
    "    example = encode(tokenizer, sequence_0, sequence_2, batch_size=batch_size)\n",
    "    model_neuron = torch_neuronx.trace(model, example)\n",
    "    filename = f'model_batch_size_{batch_size}.pt'\n",
    "    torch.jit.save(model_neuron, filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f0f6ed2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Benchmark BERT for different batch sizes\n",
    "for batch_size in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]:\n",
    "    print('-'*50)\n",
    "    example = encode(tokenizer, sequence_0, sequence_2, batch_size=batch_size)\n",
    "    filename = f'model_batch_size_{batch_size}.pt'\n",
    "    benchmark(filename, example)\n",
    "    print()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (Neuron PyTorch)",
   "language": "python",
   "name": "pytorch_venv"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: src/examples/pytorch/torch-neuronx/resnet50-inference-on-trn1-tutorial.ipynb
================================================
{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "6a30ffd9",
   "metadata": {},
   "source": [
    "# Compiling and Deploying ResNet50 on Trn1 or Inf2"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ea682fbe",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "In this tutorial we will compile and deploy a TorchVision ResNet50 model for accelerated inference on Neuron. To get started with\n",
    "Jupyter Notebook on Neuron Instance you launched, please use this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "This tutorial will use the [resnet50](https://pytorch.org/vision/main/models/generated/torchvision.models.resnet50.html) model, which is primarily used for arbitrary image classification tasks.\n",
    "\n",
    "This tutorial has the following main sections:\n",
    "\n",
    "1. Install dependencies\n",
    "1. Compile the ResNet model\n",
    "1. Run inference on Neuron and compare results to CPU\n",
    "1. Benchmark the model using multicore inference\n",
    "1. Finding the optimal batch size\n",
    "\n",
    "This Jupyter notebook should be run on a Trn1 instance (`trn1.2xlarge` or larger.) or Inf2 instance (`inf2.xlarge` or larger.)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5f60760a",
   "metadata": {},
   "source": [
    "## Install Dependencies\n",
    "The code in this tutorial is written for Jupyter Notebooks. To use Jupyter Notebook on the Neuron instance, you\n",
    "can use this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "This tutorial requires the following pip packages:\n",
    "\n",
    "- `torch-neuronx`\n",
    "- `neuronx-cc`\n",
    "- `torchvision`\n",
    "- `Pillow`\n",
    "\n",
    "Most of these packages will be installed when configuring your environment using the Trn1 setup guide. The additional dependencies must be installed here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c44c5df5",
   "metadata": {},
   "outputs": [],
   "source": [
    "%env HF_HUB_DISABLE_PROGRESS_BARS=1 # Avoids xet progress bar model download error\n",
    "!pip install Pillow"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de2efba5",
   "metadata": {},
   "source": [
    "## Compile the model into an AWS Neuron optimized TorchScript\n",
    "\n",
    "In the following section, we load the model, get a sample input, run inference on CPU, compile the model for Neuron using `torch_neuronx.trace()`, and save the optimized model as `TorchScript`.\n",
    "\n",
    "`torch_neuronx.trace()` expects a tensor or tuple of tensor inputs to use for tracing, so we convert the input image into a tensor using the `get_image` function.\n",
    "\n",
    "The result of the trace stage will be a static executable where the operations to be run upon inference are determined during compilation. This means that when inferring, the resulting Neuron model must be executed with tensors that are the exact same shape as those provided at compilation time. If a model is given a tensor at inference time whose shape does not match the tensor given at compilation time, an error will occur. \n",
    "\n",
    "In the following section, we assume that we will receive an image shape of `[1, 3, 224, 224]` at inference time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1650de1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import urllib\n",
    "from PIL import Image\n",
    "\n",
    "import torch\n",
    "import torch_neuronx\n",
    "from torchvision import models\n",
    "from torchvision.transforms import functional\n",
    "\n",
    "\n",
    "def get_image(batch_size=1, image_shape=(224, 224)):\n",
    "    # Get an example input\n",
    "    filename = \"000000039769.jpg\"\n",
    "    if not os.path.exists(filename):\n",
    "        url = \"http://images.cocodataset.org/val2017/000000039769.jpg\"\n",
    "        urllib.request.urlretrieve(url, filename)\n",
    "    image = Image.open(filename).convert('RGB')\n",
    "    image = functional.resize(image, (image_shape))\n",
    "    image = functional.to_tensor(image)\n",
    "    image = torch.unsqueeze(image, 0)\n",
    "    image = torch.repeat_interleave(image, batch_size, 0)\n",
    "    return (image, )\n",
    "\n",
    "\n",
    "# Create the model\n",
    "model = models.resnet50(pretrained=True)\n",
    "model.eval()\n",
    "\n",
    "# Get an example input\n",
    "image = get_image()\n",
    "\n",
    "# Run inference on CPU\n",
    "output_cpu = model(*image)\n",
    "\n",
    "# Compile the model\n",
    "model_neuron = torch_neuronx.trace(model, image)\n",
    "\n",
    "# Save the TorchScript for inference deployment\n",
    "filename = 'model.pt'\n",
    "torch.jit.save(model_neuron, filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25f453f8",
   "metadata": {},
   "source": [
    "## Run inference and compare results\n",
    "\n",
    "In this section we load the compiled model, run inference on Neuron, and compare the CPU and Neuron outputs using the ImageNet classes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4a203aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "# Load the TorchScript compiled model\n",
    "model_neuron = torch.jit.load(filename)\n",
    "\n",
    "# Run inference using the Neuron model\n",
    "output_neuron = model_neuron(*image)\n",
    "\n",
    "# Compare the results\n",
    "print(f\"CPU tensor:    {output_cpu[0][0:10]}\")\n",
    "print(f\"Neuron tensor: {output_neuron[0][0:10]}\")\n",
    "\n",
    "# Download and read the ImageNet classes\n",
    "urllib.request.urlretrieve(\"https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json\",\"imagenet_class_index.json\")\n",
    "with open(\"imagenet_class_index.json\", \"r\") as file:\n",
    "    class_id = json.load(file)\n",
    "    id2label = [class_id[str(i)][1] for i in range(len(class_id))]\n",
    "\n",
    "# Lookup and print the top-5 labels\n",
    "top5_cpu = output_cpu[0].sort()[1][-5:]\n",
    "top5_neuron = output_neuron[0].sort()[1][-5:]\n",
    "top5_labels_cpu = [id2label[idx] for idx in top5_cpu]\n",
    "top5_labels_neuron = [id2label[idx] for idx in top5_neuron]\n",
    "print(f\"CPU top-5 labels:    {top5_labels_cpu}\")\n",
    "print(f\"Neuron top-5 labels: {top5_labels_neuron}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c96389ae",
   "metadata": {},
   "source": [
    "## Benchmarking\n",
    "\n",
    "In this section we benchmark the performance of the ResNet model on Neuron. By default, models compiled with `torch_neuronx` will always execute on a *single* NeuronCore. When loading *multiple* models, the default behavior of the Neuron runtime is to evenly distribute models across all available NeuronCores. The runtime places models on the NeuronCore that has the fewest models loaded to it first. In the following section, we will `torch.jit.load` multiple instances of the model which should each be loaded onto their own NeuronCore. It is not useful to load more copies of a model than the number of NeuronCores on the instance since an individual NeuronCore can only execute one model at a time.\n",
    "\n",
    "To ensure that we are maximizing hardware utilization, we must run inferences using multiple threads in parallel. It is nearly always recommended to use some form of threading/multiprocessing and some form of model replication since even the smallest Neuron EC2 instance has 2 NeuronCores available. Applications with no form of threading are only capable of `1 / num_neuron_cores` hardware utilization which becomes especially problematic on large instances.\n",
    "\n",
    "One way to view the hardware utilization is by executing the `neuron-top` application in the terminal while the benchmark is executing. If the monitor shows >90% utilization on all NeuronCores, this is a good indication that the hardware is being utilized effectively.\n",
    "\n",
    "In this example we load two models, which utilizes all NeuronCores (2) on a `trn1.2xlarge` or `inf2.xlarge` instance. Additional models can be loaded and run in parallel on larger Trn1 or Inf2 instance sizes to increase throughput.\n",
    "\n",
    "We define a benchmarking function that loads two optimized ResNet models onto two separate NeuronCores, runs multithreaded inference, and calculates the corresponding latency and throughput."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9657ae4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "import concurrent.futures\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "def benchmark(filename, example, n_models=2, n_threads=2, batches_per_thread=1000):\n",
    "    \"\"\"\n",
    "    Record performance statistics for a serialized model and its input example.\n",
    "\n",
    "    Arguments:\n",
    "        filename: The serialized torchscript model to load for benchmarking.\n",
    "        example: An example model input.\n",
    "        n_models: The number of models to load.\n",
    "        n_threads: The number of simultaneous threads to execute inferences on.\n",
    "        batches_per_thread: The number of example batches to run per thread.\n",
    "\n",
    "    Returns:\n",
    "        A dictionary of performance statistics.\n",
    "    \"\"\"\n",
    "\n",
    "    # Load models\n",
    "    models = [torch.jit.load(filename) for _ in range(n_models)]\n",
    "\n",
    "    # Warmup\n",
    "    for _ in range(8):\n",
    "        for model in models:\n",
    "            model(*example)\n",
    "\n",
    "    latencies = []\n",
    "\n",
    "    # Thread task\n",
    "    def task(model):\n",
    "        for _ in range(batches_per_thread):\n",
    "            start = time.time()\n",
    "            model(*example)\n",
    "            finish = time.time()\n",
    "            latencies.append((finish - start) * 1000)\n",
    "\n",
    "    # Submit tasks\n",
    "    begin = time.time()\n",
    "    with concurrent.futures.ThreadPoolExecutor(max_workers=n_threads) as pool:\n",
    "        for i in range(n_threads):\n",
    "            pool.submit(task, models[i % len(models)])\n",
    "    end = time.time()\n",
    "\n",
    "    # Compute metrics\n",
    "    boundaries = [50, 95, 99]\n",
    "    percentiles = {}\n",
    "\n",
    "    for boundary in boundaries:\n",
    "        name = f'latency_p{boundary}'\n",
    "        percentiles[name] = np.percentile(latencies, boundary)\n",
    "    duration = end - begin\n",
    "    batch_size = 0\n",
    "    for tensor in example:\n",
    "        if batch_size == 0:\n",
    "            batch_size = tensor.shape[0]\n",
    "    inferences = len(latencies) * batch_size\n",
    "    throughput = inferences / duration\n",
    "\n",
    "    # Metrics\n",
    "    metrics = {\n",
    "        'filename': str(filename),\n",
    "        'batch_size': batch_size,\n",
    "        'batches': len(latencies),\n",
    "        'inferences': inferences,\n",
    "        'threads': n_threads,\n",
    "        'models': n_models,\n",
    "        'duration': duration,\n",
    "        'throughput': throughput,\n",
    "        **percentiles,\n",
    "    }\n",
    "\n",
    "    display(metrics)\n",
    "\n",
    "\n",
    "def display(metrics):\n",
    "    \"\"\"\n",
    "    Display the metrics produced by `benchmark` function.\n",
    "\n",
    "    Args:\n",
    "        metrics: A dictionary of performance statistics.\n",
    "    \"\"\"\n",
    "    pad = max(map(len, metrics)) + 1\n",
    "    for key, value in metrics.items():\n",
    "\n",
    "        parts = key.split('_')\n",
    "        parts = list(map(str.title, parts))\n",
    "        title = ' '.join(parts) + \":\"\n",
    "\n",
    "        if isinstance(value, float):\n",
    "            value = f'{value:0.3f}'\n",
    "\n",
    "        print(f'{title :<{pad}} {value}')\n",
    "\n",
    "\n",
    "# Benchmark ResNet on Neuron\n",
    "benchmark(filename, image)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "795d2fca",
   "metadata": {},
   "source": [
    "## Finding the optimal batch size\n",
    "\n",
    "Batch size has a direct impact on model performance. The NeuronCore architecture is optimized to maximize throughput with relatively small batch sizes. This means that a Neuron compiled model can outperform a GPU model, even if running single digit batch sizes.\n",
    "\n",
    "As a general best practice, we recommend optimizing your model’s throughput by compiling the model with a small batch size and gradually increasing it to find the peak throughput on Neuron. To minimize latency, using `batch size = 1` will nearly always be optimal. This batch size configuration is typically used for on-demand inference applications. To maximize throughput, *usually* `1 < batch_size < 10` is optimal. A configuration which uses a larger batch size is generally ideal for batched on-demand inference or offline batch processing.\n",
    "\n",
    "In the following section, we compile ResNet for multiple batch size inputs. We then run inference on each batch size and benchmark the performance. Notice that latency increases consistently as the batch size increases. Throughput increases as well, up until a certain point where the input size becomes too large to be efficient."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdef1805",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compile ResNet for different batch sizes\n",
    "for batch_size in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]:\n",
    "    model = models.resnet50(pretrained=True)\n",
    "    model.eval()\n",
    "    example = get_image(batch_size=batch_size)\n",
    "    model_neuron = torch_neuronx.trace(model, example)\n",
    "    filename = f'model_batch_size_{batch_size}.pt'\n",
    "    torch.jit.save(model_neuron, filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec244d4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Benchmark ResNet for different batch sizes\n",
    "for batch_size in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]:\n",
    "    print('-'*50)\n",
    "    example = get_image(batch_size=batch_size)\n",
    "    filename = f'model_batch_size_{batch_size}.pt'\n",
    "    benchmark(filename, example)\n",
    "    print()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (Neuron PyTorch)",
   "language": "python",
   "name": "pytorch_venv"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: src/examples/pytorch/torch-neuronx/t5-inference-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# T5 model inference on Trn1 or Inf2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "In this tutorial we will compile and deploy a pretrained T5 model for accelerated inference on Neuron. \n",
    "\n",
    "This tutorial will use the [t5-large](https://huggingface.co/t5-large) model. The T5 model can be used for machine translation, document summarization, question answering, and classification tasks. \n",
    "\n",
    "This tutorial has the following main sections:\n",
    "\n",
    "1. Install dependencies\n",
    "1. Compile the T5 model\n",
    "1. Run inference with greedy decoding on Neuron\n",
    "1. Run infernece with beam search on Neuron\n",
    "\n",
    "This Jupyter notebook should be run on a Trn1 instance (`trn1.2xlarge` or larger.) or Inf2 instance (`inf2.xlarge` or larger.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install dependencies\n",
    "\n",
    "The code in this tutorial is written for Jupyter Notebooks. To use Jupyter Notebook on the Neuron instance, you\n",
    "can use this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).\n",
    "\n",
    "This tutorial requires the following pip packages:\n",
    "\n",
    "- `torch-neuronx`\n",
    "- `neuronx-cc`\n",
    "- `transformers`\n",
    "- `optimum-neuron`\n",
    "\n",
    "Most of these packages will be installed when configuring your environment using the Trn1/Inf2 setup guide. The additional dependencies must be installed here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%env HF_HUB_DISABLE_PROGRESS_BARS=1 # Avoids xet progress bar model download error\n",
    "!pip install --upgrade transformers==4.31.0 optimum-neuron==0.0.8 sentencepiece"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "🤗 Optimum Neuron is the interface between the 🤗 Transformers library and AWS Accelerators including AWS Trainium and AWS Inferentia. It provides a set of tools enabling easy model loading, training and inference on single- and multi-Accelerator settings for different downstream tasks. In this tutorial we use 🤗 HuggingFace Optimum Neuron's generate() method instead of 🤗 [transformers's generate()](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate) to perform greedy decoding. Optimum Neuron takes care of padding the inputs which is necessary to infer on Neuron.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compile the model into an AWS Neuron optimized TorchScript\n",
    "\n",
    "In the following section, we load the T5 model, compile the model's encoder and decoder for Neuron using `torch_neuronx.trace()`, and save the optimized encoder and decoder as `TorchScript`. \n",
    "\n",
    "Before we trace the model, we need to make a couple of changes. \n",
    "\n",
    "1. We need to write encoder and decoder wrappers - `torch_neuronx` can only trace functions with positional arguments. But the T5 encoder and decoder both use keyword arguments. So, in order to trace them, we have to write wrappers that convert keyword arguments to positional arguments \n",
    "2. We modify the t5 code to maximize the computation on the neuron device - Having sections of code running on cpu will reduce the performance. Moreover, we do not want to move data berween the neuron device and cpu during inference. The code we trace with `torch_neuronx` is the code that runs on the neuron device, so we refactor the t5 code to run computationally heavy operations within the wrapper.  \n",
    "\n",
    "Let us start with the EncoderWrapper. \n",
    "\n",
    "In the huggingface t5 implementation, the encoder block takes in the input ids and returns the encoder hidden states. This hidden states are then used to initialize the KV cache in the decoder blocks during the first decoder invocation. We could trace both the encoder and the cache initialization step separately. But there is a better way, we could just compute the initial KV cache state within the encoder wrapper. This way, we remove the overhead of moving the hidden states from neuron device to cpu and back. This also allows neuron's compiler to optimize execution across both the encoder and cache initialization. \n",
    "\n",
    "*Why don't we just initalize the cache on the first decoder run?* \n",
    "\n",
    "This is harder to do on Neuron. Similar to `torch.jit.trace()`, `torch_neuronx.trace()` produces a function that has a fixed control flow, i.e. there are no conditional executions. So we cannot choose to conditionally initialize the cache in the first decoder iteration. Instead, we can compute the initial cache state outside the generation flow and pass the cache to it. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "\n",
    "from transformers.models.t5.modeling_t5 import T5Stack, T5LayerCrossAttention\n",
    "\n",
    "class EncoderWrapper(torch.nn.Module):\n",
    "    '''\n",
    "        We will trace an instance of the EncoderWrapper. \n",
    "        This wrapper just converts positional args to kwargs. \n",
    "    '''\n",
    "\n",
    "    def __init__(self, \n",
    "                 encoder,\n",
    "                 decoder, \n",
    "                 model_config, \n",
    "                 batch_size, \n",
    "                 max_length, \n",
    "                 device, \n",
    "                 num_beams,\n",
    "                 tp_degree=None):\n",
    "        \n",
    "        super().__init__()\n",
    "        self.encoder = encoder\n",
    "        self.decoder = decoder\n",
    "        self.batch_size = batch_size\n",
    "        self.max_length = max_length\n",
    "        self.model_config = model_config\n",
    "        self.device = device\n",
    "        self.num_beams = num_beams\n",
    "        self.num_attention_heads_per_partition = model_config.num_heads\n",
    "        self.tp_degree = tp_degree\n",
    "\n",
    "    def forward(self, input_ids, attention_mask):\n",
    "        '''\n",
    "            This is the core functionality we want to trace. \n",
    "        '''\n",
    "        encoder_output =  self.encoder(input_ids=input_ids,\n",
    "                                       attention_mask=attention_mask,\n",
    "                                       output_attentions=False,\n",
    "                                       output_hidden_states=False)\n",
    "\n",
    "        last_hidden_state = encoder_output[\"last_hidden_state\"]\n",
    "        encoder_hidden_states = torch.concat([tensor.unsqueeze(0).repeat(self.num_beams, 1, 1) for tensor in last_hidden_state])\n",
    "\n",
    "        decoder_blocks = self.decoder.block\n",
    "        present_key_value_states_sa = []\n",
    "        present_key_value_states_ca = []\n",
    "\n",
    "        for i, block in enumerate(decoder_blocks):\n",
    "\n",
    "            # Cross attention has to be initialized with the encoder hidden state\n",
    "            cross_attention: T5LayerCrossAttention = block.layer[1]\n",
    "            attention = cross_attention.EncDecAttention\n",
    "\n",
    "            def shape(states):\n",
    "                \"\"\"projection\"\"\"\n",
    "                return states.view(self.batch_size, -1, self.num_attention_heads_per_partition, attention.key_value_proj_dim).transpose(1, 2)\n",
    "\n",
    "            key_states = shape(attention.k(encoder_hidden_states))\n",
    "            value_states = shape(attention.v(encoder_hidden_states))\n",
    "\n",
    "            # cross_attn_kv_state\n",
    "            present_key_value_states_ca.append(key_states) \n",
    "            present_key_value_states_ca.append(value_states) \n",
    "            \n",
    "            # Self attention kv states are initialized to zeros. This is done to keep the size of the kv cache tensor constant. \n",
    "            # The kv cache will be an input to the decoder trace. Any traced function will have a fixed control flow. What this means \n",
    "            # is that the trace performs the exact same computations on inputs of the same shape in each invocation. So the attention \n",
    "            # kv cache is padded here to keep a fixed shape. \n",
    "            present_key_value_states_sa.append(torch.zeros((self.batch_size,                                                     # key states\n",
    "                                                            self.model_config.num_heads, \n",
    "                                                            self.max_length-1, \n",
    "                                                            self.model_config.d_kv), dtype=torch.float32, device=self.device)) \n",
    "            present_key_value_states_sa.append(torch.zeros((self.batch_size,                                                     # value states\n",
    "                                                            self.model_config.num_heads, \n",
    "                                                            self.max_length-1, \n",
    "                                                            self.model_config.d_kv), dtype=torch.float32, device=self.device))\n",
    "\n",
    "        return present_key_value_states_sa + present_key_value_states_ca\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "In the decoder wrapper, in addition to converting keyword arguments to positional arguments we add support for attention caching. Generating text from the encoder decoder models is an autoregressive process. For each invocation, we have to compute the key and value states of the attention heads repeatedly. To improve the performance, we cache the key and value states. This cache is what HuggingFace transformers code refers to as `past_key_values`.\n",
    "\n",
    "In HuggingFace transformers, the `past_key_values` are updated outside the decoder. This works for training and evaluation but for inference we want to perform them within a single trace. This way, we can optimize across both the decoder execution and cache update. So, we move the cache update within the decoder wrapper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "class DecoderWrapper(torch.nn.Module):\n",
    "\n",
    "    def __init__(self, \n",
    "                 decoder: T5Stack, \n",
    "                 lm_head: torch.nn.Linear,\n",
    "                 model_config,\n",
    "                 num_beams: int, \n",
    "                 max_length: int,\n",
    "                 device: str,\n",
    "                 tp_degree=None):\n",
    "        super().__init__()        \n",
    "        self.decoder = decoder\n",
    "        self.lm_head = lm_head\n",
    "        self.model_dim=model_config.d_model\n",
    "        self.device = device\n",
    "        self.num_beams = num_beams\n",
    "        self.batch_size = 1\n",
    "        self.config = model_config\n",
    "        \n",
    "        num_heads=model_config.num_heads\n",
    "        num_decoder_layers=model_config.num_decoder_layers\n",
    "\n",
    "        self.num_attention_heads_per_partition = num_heads\n",
    "\n",
    "        # (num_beams, n_heads, seq_length, dim_per_head)\n",
    "        if device == \"cpu\":\n",
    "            self.past_key_values_sa = [torch.ones((num_beams,num_heads,max_length-1,model_config.d_kv), dtype=torch.float32) for _ in range(num_decoder_layers * 2)]\n",
    "            self.past_key_values_ca = [torch.ones((num_beams,num_heads,max_length,model_config.d_kv), dtype=torch.float32) for _ in range(num_decoder_layers * 2)]\n",
    "        elif device == \"xla\":\n",
    "            self.past_key_values_sa = torch.nn.ParameterList([torch.nn.Parameter(torch.ones((num_beams,self.num_attention_heads_per_partition,max_length-1,model_config.d_kv), dtype=torch.float32), requires_grad=False) for _ in range(num_decoder_layers * 2)])\n",
    "            self.past_key_values_ca = torch.nn.ParameterList([torch.nn.Parameter(torch.ones((num_beams,self.num_attention_heads_per_partition,max_length,model_config.d_kv), dtype=torch.float32), requires_grad=False) for _ in range(num_decoder_layers * 2)])\n",
    "\n",
    "    def update_past(self, past_key_values):\n",
    "        new_past_sa = []\n",
    "        new_past_ca = []\n",
    "        for past_layer in past_key_values:\n",
    "            new_past_layer = list(past_layer)\n",
    "            for i in range(len(new_past_layer[:2])):\n",
    "                new_past_layer[i] = past_layer[i][:, :, 1:]\n",
    "            new_past_sa += [new_past_layer[:2],]\n",
    "            new_past_ca += [new_past_layer[2:],]\n",
    "        return new_past_sa, new_past_ca\n",
    "    \n",
    "    def reorder_cache(self, past_key_values, beam_idx):\n",
    "        for i in range(len(past_key_values)):\n",
    "            gather_index = beam_idx.view([beam_idx.shape[0],1,1,1]).expand_as(past_key_values[i])\n",
    "            past_key_values[i] = torch.gather(past_key_values[i], dim = 0, index=gather_index)\n",
    "        return past_key_values\n",
    "\n",
    "    def forward(self,\n",
    "                input_ids,\n",
    "                decoder_attention_mask,\n",
    "                encoder_hidden_states,\n",
    "                encoder_attention_mask,\n",
    "                beam_idx,\n",
    "                beam_scores,\n",
    "                **kwargs):\n",
    "\n",
    "        if self.num_beams > 1:\n",
    "            # We reorder the cache based on the beams selected in each iteration. Required step for beam search.\n",
    "            past_key_values_sa = self.reorder_cache(self.past_key_values_sa, beam_idx)\n",
    "            past_key_values_ca = self.reorder_cache(self.past_key_values_ca, beam_idx)\n",
    "        else:\n",
    "            # We do not need to reorder for greedy sampling\n",
    "            past_key_values_sa = self.past_key_values_sa\n",
    "            past_key_values_ca = self.past_key_values_ca\n",
    "\n",
    "        # The cache is stored in a flatten form. We order the cache per layer before passing it to the decoder. \n",
    "        # Each layer has 4 tensors, so we group by 4. \n",
    "        past_key_values = [[*past_key_values_sa[i*2:i*2+2], *past_key_values_ca[i*2:i*2+2]] for i in range(0, int(len(past_key_values_ca)/2))]\n",
    "\n",
    "        decoder_output = self.decoder(\n",
    "            input_ids=input_ids,\n",
    "            attention_mask=decoder_attention_mask,\n",
    "            past_key_values=past_key_values,\n",
    "            encoder_hidden_states=encoder_hidden_states,\n",
    "            encoder_attention_mask=encoder_attention_mask,\n",
    "            use_cache=True,\n",
    "            output_attentions=False,\n",
    "            output_hidden_states=False)\n",
    "\n",
    "        last_hidden_state = decoder_output['last_hidden_state']\n",
    "        past_key_values = decoder_output['past_key_values']\n",
    "\n",
    "        if self.config.tie_word_embeddings:\n",
    "            # Rescale output before projecting on vocab\n",
    "            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586\n",
    "            last_hidden_state = last_hidden_state * (self.model_dim**-0.5)\n",
    "        \n",
    "        lm_logits = self.lm_head(last_hidden_state)\n",
    "\n",
    "        past_key_values_sa, past_key_values_ca = self.update_past(past_key_values)\n",
    "\n",
    "        # We flatten the cache to a single array. This is required for the input output aliasing to work\n",
    "        past_key_values_sa = [vec for kv_per_layer in past_key_values_sa for vec in kv_per_layer]\n",
    "        past_key_values_ca = [vec for kv_per_layer in past_key_values_ca for vec in kv_per_layer]\n",
    "\n",
    "        if self.device == \"cpu\":\n",
    "            self.past_key_values_sa = past_key_values_sa\n",
    "            self.past_key_values_ca = past_key_values_ca\n",
    "\n",
    "        # We calculate topk inside the wrapper\n",
    "        next_token_logits = lm_logits[:, -1, :]\n",
    "\n",
    "        if self.num_beams > 1:\n",
    "            # This section of beam search is run outside the decoder in the huggingface t5 implementation. \n",
    "            # To maximize the computation within the neuron device, we move this within the wrapper\n",
    "            logit_max, _ = torch.max(next_token_logits, dim=-1, keepdim=True)\n",
    "            logsumexp = torch.log(torch.exp(next_token_logits - logit_max).sum(dim=-1, keepdim=True))\n",
    "            next_token_scores = next_token_logits - logit_max - logsumexp\n",
    "            next_token_scores = next_token_scores + beam_scores[:, None].expand_as(next_token_scores)\n",
    "\n",
    "            # reshape for beam search\n",
    "            vocab_size = next_token_scores.shape[-1]\n",
    "            next_token_scores = next_token_scores.view(self.batch_size, self.num_beams * vocab_size)\n",
    "            next_token_scores = next_token_scores * 1\n",
    "\n",
    "            # Sample 2 next tokens for each beam (so we have some spare tokens and match output of beam search)\n",
    "            next_token_scores, next_tokens = torch.topk(\n",
    "                next_token_scores, 2 * self.num_beams, dim=1, largest=True, sorted=True\n",
    "            ) \n",
    "\n",
    "            next_indices = torch.div(next_tokens, vocab_size, rounding_mode=\"floor\")\n",
    "            next_tokens = next_tokens % vocab_size\n",
    "\n",
    "            return [next_token_scores, next_tokens, next_indices] + past_key_values_sa + past_key_values_ca\n",
    "        else:\n",
    "            # Greedy    \n",
    "            next_tokens = torch.argmax(next_token_logits, dim=-1)\n",
    "            return [next_tokens] + past_key_values_sa + past_key_values_ca\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's create a T5 model wrapper to make it compatible with our traced encoder and decoder. \n",
    "\n",
    "There are two reasons for having this wrapper, \n",
    "\n",
    "1. The encoder and decoder traces can only be invoked with positional arguments. But the HuggingFace transformers code is written with keyword arguments. So we override the functions that invoke encoder and decoder to call with positional arguments. \n",
    "1. The generate() function in the NeuronGenerationMixin performs cache update within the CPU. As we are handling the cache within the DecoderWrapper, we disable the cache update on CPU. \n",
    "1. The topK computation to determine the next tokens for beam search was moved into the decoder wrapper. So, we need to override the huggingface's beam search implementation to accept the next tokens and the beam scores from the decoder. \n",
    "\n",
    "Let's also override the `generate()` function so that it will intialize the cache using the cache initalizer before starting the greedy decoding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch_xla.core.xla_model as xm\n",
    "\n",
    "from transformers import T5Tokenizer, T5ForConditionalGeneration\n",
    "from transformers.modeling_outputs import BaseModelOutput, Seq2SeqLMOutput\n",
    "from transformers.models.t5.modeling_t5 import T5Stack, T5LayerCrossAttention\n",
    "from transformers.generation.utils import ModelOutput\n",
    "from typing import Any, Dict, List, Optional, Tuple, Union\n",
    "from transformers.generation.beam_search import BeamScorer, BeamSearchScorer\n",
    "\n",
    "from optimum.neuron.generation import NeuronGenerationMixin\n",
    "\n",
    "from transformers.generation.logits_process import (\n",
    "    LogitsProcessorList,\n",
    ")\n",
    "from transformers.generation.stopping_criteria import (\n",
    "    MaxLengthCriteria,\n",
    "    MaxTimeCriteria,\n",
    "    StoppingCriteriaList,\n",
    "    validate_stopping_criteria,\n",
    ")\n",
    "\n",
    "from transformers.generation.utils import (\n",
    "    BeamSearchOutput,\n",
    "    GreedySearchOutput,\n",
    ")\n",
    "\n",
    "class T5Wrapper(T5ForConditionalGeneration, NeuronGenerationMixin):\n",
    "\n",
    "    def _prepare_encoder_decoder_kwargs_for_generation(\n",
    "        self, \n",
    "        inputs_tensor: torch.Tensor, \n",
    "        model_kwargs, \n",
    "        model_input_name: Optional[str] = None\n",
    "    ) -> Dict[str, Any]:\n",
    "        encoder = self.get_encoder()\n",
    "        model_kwargs[\"encoder_outputs\"]: ModelOutput = encoder(inputs_tensor, model_kwargs[\"attention_mask\"])\n",
    "        return model_kwargs\n",
    "\n",
    "    # Override to cut the input_ids to just last token\n",
    "    def prepare_inputs_for_generation(\n",
    "        self,\n",
    "        input_ids,\n",
    "        past_key_values=None,\n",
    "        attention_mask=None,\n",
    "        head_mask=None,\n",
    "        decoder_head_mask=None,\n",
    "        decoder_attention_mask=None,\n",
    "        cross_attn_head_mask=None,\n",
    "        use_cache=None,\n",
    "        encoder_outputs=None,\n",
    "        **kwargs,\n",
    "    ):\n",
    "        # cut decoder_input_ids as past is cached\n",
    "        input_ids = input_ids[:, -1:]\n",
    "\n",
    "        return {\n",
    "            \"decoder_input_ids\": input_ids,\n",
    "            \"past_key_values\": past_key_values,\n",
    "            \"encoder_outputs\": encoder_outputs,\n",
    "            \"attention_mask\": attention_mask,\n",
    "            \"head_mask\": head_mask,\n",
    "            \"decoder_head_mask\": decoder_head_mask,\n",
    "            \"decoder_attention_mask\": decoder_attention_mask,\n",
    "            \"cross_attn_head_mask\": cross_attn_head_mask,\n",
    "            \"use_cache\": use_cache,\n",
    "        }\n",
    "    \n",
    "    '''\n",
    "        We update the cache in the decoder trace, so lets override the _update_model_kwargs_for_xla_generation in NeuronGenerationMixin\n",
    "    '''\n",
    "    def _update_model_kwargs_for_xla_generation(\n",
    "        self,\n",
    "        model_kwargs: Dict[str, Any],\n",
    "        batch_size: int,\n",
    "        is_encoder_decoder: bool = False,\n",
    "        standardize_cache_format: bool = False,\n",
    "        max_length: Optional[int] = None,\n",
    "        seq_length: Optional[int] = None,\n",
    "        use_cache: bool = True,\n",
    "    ) -> Dict[str, Any]:\n",
    "\n",
    "        def _update_attention(model_kwargs, is_encoder_decoder):\n",
    "            \"\"\"Updates the appropriate attention mask -- encoder-decoder models use `decoder_attention_mask`\"\"\"\n",
    "\n",
    "            attention_mask_name = \"decoder_attention_mask\" if is_encoder_decoder else \"attention_mask\"\n",
    "            attention_mask = model_kwargs.pop(attention_mask_name)\n",
    "            attention_mask_update_slice = torch.ones(\n",
    "                (batch_size, 1), dtype=attention_mask.dtype, device=attention_mask.device\n",
    "            )\n",
    "            attention_mask = torch.cat([attention_mask[:, 1:], attention_mask_update_slice], dim=-1)\n",
    "            mask = {attention_mask_name: attention_mask}\n",
    "            return mask\n",
    "\n",
    "        mask = _update_attention(model_kwargs, is_encoder_decoder)\n",
    "        # sets the updated variables (mask and past_key_values)\n",
    "        model_kwargs.update(mask)\n",
    "\n",
    "        # Set a mock cache tensor\n",
    "        model_kwargs[\"past_key_values\"] = torch.tensor([])\n",
    "\n",
    "        return model_kwargs\n",
    "    \n",
    "    def _reorder_cache(self, past_key_values, beam_idx):\n",
    "        '''\n",
    "            This is needed for beam search and not greedy sampling\n",
    "            We reorder the cache within the trace so we can skip it in modelling_t5.py. So we override the _reorder_cache\n",
    "        '''\n",
    "        self.beam_idx = beam_idx\n",
    "        return past_key_values\n",
    "\n",
    "    def generate(self,\n",
    "                tokenizer: T5Tokenizer,\n",
    "                prompt: str,\n",
    "                max_length: int,\n",
    "                num_beams: int,\n",
    "                num_return_sequences: int,\n",
    "                device: str):\n",
    "\n",
    "        batch_encoding = tokenizer(prompt, max_length=max_length, truncation=True, padding='max_length',\n",
    "                                return_tensors=\"pt\")\n",
    "\n",
    "        past_key_values = self.encoder(batch_encoding['input_ids'],batch_encoding['attention_mask'])\n",
    " \n",
    "        decoder_attention_mask = torch.cat([torch.zeros((1, max_length-1), dtype=torch.int32),\n",
    "                                            torch.ones((1, 1), dtype=torch.int32)], axis=1)\n",
    "\n",
    "        # copy the new cache state to the decoder\n",
    "        if device == \"xla\":\n",
    "            for state, tensor in zip(self.decoder.parameters(), past_key_values):\n",
    "                state.copy_(tensor)\n",
    "        else:\n",
    "            # First half of the cache is self attention and the rest is cross attention\n",
    "            self.decoder.past_key_values_sa = past_key_values[:len(past_key_values)//2]\n",
    "            self.decoder.past_key_values_ca = past_key_values[len(past_key_values)//2:]\n",
    "        \n",
    "        output = super().generate(**batch_encoding,\n",
    "                                max_length=max_length,\n",
    "                                num_beams=num_beams,\n",
    "                                num_return_sequences=num_return_sequences,\n",
    "                                do_sample=False,\n",
    "                                use_cache=True,\n",
    "                                decoder_attention_mask=decoder_attention_mask, \n",
    "                                encoder_outputs={\"last_hidden_state\": torch.ones((1,128,1))}) # Pass fake encoder_outputs so the transfomers code will not invoke the encoder\n",
    "        return output\n",
    "\n",
    "    def forward(\n",
    "        self,\n",
    "        attention_mask: Optional[torch.FloatTensor] = None,\n",
    "        decoder_input_ids: Optional[torch.LongTensor] = None,\n",
    "        decoder_attention_mask: Optional[torch.BoolTensor] = None,\n",
    "        encoder_outputs: Optional[Tuple[Tuple[torch.Tensor]]] = None,\n",
    "        beam_scores = None,\n",
    "        **kwargs\n",
    "    ) -> Union[Tuple[torch.FloatTensor], Seq2SeqLMOutput]:\n",
    "\n",
    "        hidden_states = encoder_outputs[\"last_hidden_state\"]\n",
    "\n",
    "        if not hasattr(self, 'beam_idx'):\n",
    "            # Infering the number of beams from the attention mask\n",
    "            num_beams = attention_mask.shape[0]\n",
    "            self.beam_idx = torch.arange(0, num_beams, dtype=torch.int64)\n",
    "\n",
    "        decoder_outputs = self.decoder(\n",
    "            decoder_input_ids,\n",
    "            decoder_attention_mask,\n",
    "            hidden_states,\n",
    "            attention_mask,\n",
    "            self.beam_idx,\n",
    "            beam_scores\n",
    "        )\n",
    "\n",
    "        # lm_logits = decoder_outputs[0]\n",
    "        next_token_scores = decoder_outputs[0]\n",
    "        next_tokens = decoder_outputs[1]\n",
    "        next_indices = decoder_outputs[2]\n",
    "\n",
    "        return next_token_scores, next_tokens, next_indices\n",
    "\n",
    "    def beam_search(\n",
    "        self,\n",
    "        input_ids: torch.LongTensor,\n",
    "        beam_scorer: BeamScorer,\n",
    "        logits_processor: Optional[LogitsProcessorList] = None,\n",
    "        stopping_criteria: Optional[StoppingCriteriaList] = None,\n",
    "        max_length: Optional[int] = None,\n",
    "        pad_token_id: Optional[int] = None,\n",
    "        eos_token_id: Optional[Union[int, List[int]]] = None,\n",
    "        output_attentions: Optional[bool] = None,\n",
    "        output_hidden_states: Optional[bool] = None,\n",
    "        output_scores: Optional[bool] = None,\n",
    "        return_dict_in_generate: Optional[bool] = None,\n",
    "        synced_gpus: Optional[bool] = False,\n",
    "        seq_length: Optional[int] = None,\n",
    "        **model_kwargs,\n",
    "    ) -> Union[BeamSearchOutput, torch.LongTensor]:\n",
    "\n",
    "        logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()\n",
    "        stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()\n",
    "        pad_token_id = pad_token_id if pad_token_id is not None else self.generation_config.pad_token_id\n",
    "        eos_token_id = eos_token_id if eos_token_id is not None else self.generation_config.eos_token_id\n",
    "        if isinstance(eos_token_id, int):\n",
    "            eos_token_id = [eos_token_id]\n",
    "        output_scores = output_scores if output_scores is not None else self.generation_config.output_scores\n",
    "        output_attentions = (\n",
    "            output_attentions if output_attentions is not None else self.generation_config.output_attentions\n",
    "        )\n",
    "        output_hidden_states = (\n",
    "            output_hidden_states if output_hidden_states is not None else self.generation_config.output_hidden_states\n",
    "        )\n",
    "\n",
    "        batch_size = len(beam_scorer._beam_hyps)\n",
    "        num_beams = beam_scorer.num_beams\n",
    "\n",
    "        batch_beam_size, cur_len = input_ids.shape\n",
    "\n",
    "        # Overwrite cur_len\n",
    "        cur_len = seq_length\n",
    "\n",
    "        if num_beams * batch_size != batch_beam_size:\n",
    "            raise ValueError(\n",
    "                f\"Batch dimension of `input_ids` should be {num_beams * batch_size}, but is {batch_beam_size}.\"\n",
    "            )\n",
    "\n",
    "        # init attention / hidden states / scores tuples\n",
    "        scores = () if (return_dict_in_generate and output_scores) else None\n",
    "        beam_indices = (\n",
    "            tuple(() for _ in range(batch_beam_size)) if (return_dict_in_generate and output_scores) else None\n",
    "        )\n",
    "\n",
    "        # initialise score of first beam with 0 and the rest with -1e9. This makes sure that only tokens\n",
    "        # of the first beam are considered to avoid sampling the exact same tokens across all beams.\n",
    "        # beam_scores = torch.zeros((batch_size, num_beams), dtype=torch.float, device=input_ids.device)\n",
    "        beam_scores_device = \"cpu\"\n",
    "        beam_scores = torch.zeros((batch_size, num_beams), dtype=torch.float, device=beam_scores_device)\n",
    "        beam_scores[:, 1:] = -1e9\n",
    "        beam_scores = beam_scores.view((batch_size * num_beams,))\n",
    "\n",
    "        while True:\n",
    "            # prepare model inputs\n",
    "            # From max_length-sized input_ids, select first\n",
    "            # cur_len - 1 values.\n",
    "            update_indices = torch.stack(\n",
    "                [torch.arange(input_ids.size(0)), torch.tensor(cur_len - 1).repeat(input_ids.size(0))], dim=-1\n",
    "            )\n",
    "            input_ids_ = input_ids[update_indices[:, 0], update_indices[:, 1], None]\n",
    "            model_inputs = self.prepare_inputs_for_generation(input_ids_, **model_kwargs)\n",
    "\n",
    "            next_token_scores, next_tokens, next_indices = self(\n",
    "                **model_inputs,\n",
    "                return_dict=True,\n",
    "                output_attentions=output_attentions,\n",
    "                output_hidden_states=output_hidden_states,\n",
    "                beam_scores=beam_scores\n",
    "            )\n",
    "\n",
    "            # stateless\n",
    "            beam_outputs = beam_scorer.process(\n",
    "                input_ids.to(\"cpu\")[:, :cur_len],\n",
    "                next_token_scores.to(\"cpu\"),\n",
    "                next_tokens.to(\"cpu\"),\n",
    "                next_indices.to(\"cpu\"),\n",
    "                pad_token_id=pad_token_id,\n",
    "                eos_token_id=eos_token_id,\n",
    "                beam_indices=beam_indices,\n",
    "            )\n",
    "\n",
    "            beam_scores = beam_outputs[\"next_beam_scores\"]\n",
    "            beam_next_tokens = beam_outputs[\"next_beam_tokens\"]\n",
    "            beam_idx = beam_outputs[\"next_beam_indices\"]\n",
    "\n",
    "            update_indices = torch.stack(\n",
    "                [torch.arange(batch_beam_size), torch.tensor(cur_len - 1).repeat(batch_beam_size)], dim=-1\n",
    "            )\n",
    "            update_indices_2 = torch.stack(\n",
    "                [torch.arange(batch_beam_size), torch.tensor(cur_len).repeat(batch_beam_size)], dim=-1\n",
    "            )\n",
    "            # First select beam_indices\n",
    "            device = input_ids.device\n",
    "            beam_idx_device = beam_idx.to(device=input_ids.device)\n",
    "            input_ids[:, :] = input_ids[beam_idx_device.long(), :]\n",
    "\n",
    "            # Then append new tokens\n",
    "            input_ids[update_indices_2[:, 0], update_indices_2[:, 1], None] = beam_next_tokens.unsqueeze(-1).to(device).to(torch.long)\n",
    "            input_ids = input_ids * 1  # Hack to materialize tensor\n",
    "\n",
    "            # update generated ids, model inputs, and length for next step\n",
    "            model_kwargs = self._update_model_kwargs_for_xla_generation(\n",
    "                model_kwargs,\n",
    "                batch_size=batch_beam_size,\n",
    "                is_encoder_decoder=self.config.is_encoder_decoder,\n",
    "                max_length=stopping_criteria.max_length,\n",
    "                seq_length=cur_len,\n",
    "                use_cache=model_kwargs[\"use_cache\"],\n",
    "            )\n",
    "            if model_kwargs[\"past_key_values\"] is not None:\n",
    "                model_kwargs[\"past_key_values\"] = self._reorder_cache(model_kwargs[\"past_key_values\"], beam_idx.to(torch.int64))\n",
    "\n",
    "            if return_dict_in_generate and output_scores:\n",
    "                beam_indices = tuple((beam_indices[beam_idx[i]] + (beam_idx[i],) for i in range(len(beam_indices))))\n",
    "\n",
    "            # increase cur_len\n",
    "            cur_len = cur_len + 1\n",
    "\n",
    "            # stop when each sentence is finished, or if we exceed the maximum length\n",
    "            stop_criterion_1 = beam_scorer.is_done\n",
    "            if isinstance(stopping_criteria, list):\n",
    "                if len(stopping_criteria) == 1:\n",
    "                    stopping_criteria = stopping_criteria[0]\n",
    "\n",
    "            # Cases that can be handled in XLA without requiring\n",
    "            # non-padded input_ids\n",
    "            if isinstance(stopping_criteria, MaxLengthCriteria):\n",
    "                stop_criterion_2 = cur_len >= stopping_criteria.max_length\n",
    "            elif isinstance(stopping_criteria, MaxTimeCriteria):\n",
    "                stop_criterion_2 = stopping_criteria(input_ids, scores)\n",
    "            else:\n",
    "                # Other cases will be handled on CPU\n",
    "                batch_size, _ = input_ids.shape\n",
    "                input_ids_cpu = input_ids.to(\"cpu\")\n",
    "                mask = torch.cat(\n",
    "                    [torch.ones(batch_size, cur_len), torch.zeros(batch_size, input_ids.shape[1] - cur_len)], dim=1\n",
    "                ).bool()\n",
    "                input_ids_cpu = torch.masked_select(input_ids_cpu, mask).reshape((batch_size, cur_len))\n",
    "                scores_cpu = scores.to(\"cpu\") if torch.is_tensor(scores) else scores\n",
    "                stop_criterion_2 = stopping_criteria(input_ids_cpu, scores_cpu)\n",
    "\n",
    "            if stop_criterion_1 or stop_criterion_2:\n",
    "                if not synced_gpus:\n",
    "                    break\n",
    "                else:\n",
    "                    this_peer_finished = True\n",
    "\n",
    "        sequence_outputs = beam_scorer.finalize(\n",
    "            input_ids.to(\"cpu\"),\n",
    "            beam_scores.to(\"cpu\"),\n",
    "            next_tokens.to(\"cpu\"),\n",
    "            next_indices.to(\"cpu\"),\n",
    "            pad_token_id=pad_token_id,\n",
    "            eos_token_id=eos_token_id,\n",
    "            max_length=stopping_criteria.max_length,\n",
    "            beam_indices=beam_indices,\n",
    "        )\n",
    "\n",
    "        for k, v in sequence_outputs.items():\n",
    "            if type(v) == torch.Tensor:\n",
    "                sequence_outputs[k] = sequence_outputs[k].to(input_ids.device)\n",
    "\n",
    "        return sequence_outputs[\"sequences\"]\n",
    "\n",
    "\n",
    "    def greedy_search(\n",
    "        self,\n",
    "        input_ids: torch.LongTensor,\n",
    "        logits_processor: Optional[LogitsProcessorList] = None,\n",
    "        stopping_criteria: Optional[StoppingCriteriaList] = None,\n",
    "        max_length: Optional[int] = None,\n",
    "        pad_token_id: Optional[int] = None,\n",
    "        eos_token_id: Optional[Union[int, List[int]]] = None,\n",
    "        output_attentions: Optional[bool] = None,\n",
    "        output_hidden_states: Optional[bool] = None,\n",
    "        output_scores: Optional[bool] = None,\n",
    "        return_dict_in_generate: Optional[bool] = None,\n",
    "        seq_length: Optional[int] = int,\n",
    "        streamer: Optional[\"BaseStreamer\"] = None,\n",
    "        **model_kwargs,\n",
    "    ) -> Union[GreedySearchOutput, torch.LongTensor]:\n",
    "        \"\"\"\n",
    "            Overriding greedy sampling to use next tokens returned from neuron device instead of logits.\n",
    "        \"\"\"\n",
    "        # init values\n",
    "        logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()\n",
    "        use_cache = model_kwargs[\"use_cache\"] if \"use_cache\" in model_kwargs else False\n",
    "        stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()\n",
    "        pad_token_id = pad_token_id if pad_token_id is not None else self.generation_config.pad_token_id\n",
    "        eos_token_id = eos_token_id if eos_token_id is not None else self.generation_config.eos_token_id\n",
    "        if isinstance(eos_token_id, int):\n",
    "            eos_token_id = [eos_token_id]\n",
    "        eos_token_id_tensor = torch.tensor(eos_token_id).to(input_ids.device) if eos_token_id is not None else None\n",
    "        output_scores = output_scores if output_scores is not None else self.generation_config.output_scores\n",
    "        output_attentions = (\n",
    "            output_attentions if output_attentions is not None else self.generation_config.output_attentions\n",
    "        )\n",
    "        output_hidden_states = (\n",
    "            output_hidden_states if output_hidden_states is not None else self.generation_config.output_hidden_states\n",
    "        )\n",
    "\n",
    "        # init attention / hidden states / scores tuples\n",
    "        scores = () if (return_dict_in_generate and output_scores) else None\n",
    "        decoder_attentions = () if (return_dict_in_generate and output_attentions) else None\n",
    "        cross_attentions = () if (return_dict_in_generate and output_attentions) else None\n",
    "        decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None\n",
    "\n",
    "\n",
    "        # keep track of which sequences are already finished\n",
    "        unfinished_sequences = torch.ones(input_ids.shape[0], dtype=torch.long, device=input_ids.device)\n",
    "\n",
    "        this_peer_finished = False  # used by synced_gpus only\n",
    "        while True:\n",
    "\n",
    "            # prepare model inputs\n",
    "            # From max_length-sized input_ids, select first\n",
    "            # seq_length - 1 values.\n",
    "\n",
    "            if model_kwargs.get(\"past_key_values\") is None:\n",
    "                input_ids_ = input_ids[:, :seq_length]\n",
    "            else:\n",
    "                update_indices = torch.stack(\n",
    "                    [torch.arange(input_ids.size(0)), torch.tensor(seq_length - 1).repeat(input_ids.size(0))],\n",
    "                    dim=-1,\n",
    "                )\n",
    "                input_ids_ = input_ids[update_indices[:, 0], update_indices[:, 1], None]\n",
    "\n",
    "            model_inputs = self.prepare_inputs_for_generation(input_ids_, **model_kwargs)\n",
    "        \n",
    "            # forward pass to get next token\n",
    "            output = self(\n",
    "               **model_inputs,\n",
    "                return_dict=True,\n",
    "                output_attentions=output_attentions,\n",
    "                output_hidden_states=output_hidden_states,\n",
    "            )\n",
    "            next_tokens = output[0]\n",
    "\n",
    "            # finished sentences should have their next token be a padding token\n",
    "            if eos_token_id is not None:\n",
    "                if pad_token_id is None:\n",
    "                    raise ValueError(\"If `eos_token_id` is defined, make sure that `pad_token_id` is defined.\")\n",
    "                next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)\n",
    "\n",
    "            # update generated ids, model inputs, and length for next step\n",
    "\n",
    "            batch_size, _ = input_ids.shape\n",
    "            update_indices = torch.stack(\n",
    "                [torch.arange(batch_size), torch.tensor(seq_length).repeat(batch_size)], dim=-1\n",
    "            )\n",
    "            input_ids[update_indices[:, 0], update_indices[:, 1]] = next_tokens[:]\n",
    "            model_kwargs = self._update_model_kwargs_for_xla_generation(\n",
    "                model_kwargs,\n",
    "                batch_size=batch_size,\n",
    "                is_encoder_decoder=self.config.is_encoder_decoder,\n",
    "                max_length=stopping_criteria.max_length,\n",
    "                seq_length=seq_length,\n",
    "                use_cache=use_cache,\n",
    "            )\n",
    "\n",
    "            seq_length += 1\n",
    "\n",
    "            # if eos_token was found in one sentence, set sentence to finished\n",
    "            if eos_token_id_tensor is not None:\n",
    "                unfinished_sequences = unfinished_sequences.mul(\n",
    "                    next_tokens.tile(eos_token_id_tensor.shape[0], 1).ne(eos_token_id_tensor.unsqueeze(1)).prod(dim=0)\n",
    "                )\n",
    "\n",
    "            # stop when each sentence is finished, or if we exceed the maximum length\n",
    "            stop_criterion_1 = unfinished_sequences.max() == 0\n",
    "\n",
    "            if isinstance(stopping_criteria, list):\n",
    "                if len(stopping_criteria) == 1:\n",
    "                    stopping_criteria = stopping_criteria[0]\n",
    "\n",
    "            # Cases that can be handled in XLA without requiring\n",
    "            # non-padded input_ids\n",
    "            if isinstance(stopping_criteria, MaxLengthCriteria):\n",
    "                stop_criterion_2 = seq_length >= stopping_criteria.max_length\n",
    "            elif isinstance(stopping_criteria, MaxTimeCriteria):\n",
    "                stop_criterion_2 = stopping_criteria(input_ids, scores)\n",
    "            else:\n",
    "                # Other cases will be handled on CPU\n",
    "                batch_size, _ = input_ids.shape\n",
    "                mask = torch.cat(\n",
    "                    [torch.ones(batch_size, seq_length), torch.zeros(batch_size, input_ids.shape[1] - seq_length)],\n",
    "                    dim=1,\n",
    "                ).bool()\n",
    "                input_ids_cpu = torch.masked_select(input_ids, mask).reshape((batch_size, seq_length)).to(\"cpu\")\n",
    "                scores_cpu = scores.to(\"cpu\") if torch.is_tensor(scores) else scores\n",
    "                stop_criterion_2 = stopping_criteria(input_ids_cpu, scores_cpu)\n",
    "\n",
    "            if stop_criterion_1 or stop_criterion_2:\n",
    "                this_peer_finished = True\n",
    "\n",
    "            if this_peer_finished:\n",
    "                break\n",
    "\n",
    "        if streamer is not None:\n",
    "            streamer.end()\n",
    "\n",
    "        return input_ids\n",
    "    \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's test inference on CPU with all the wrappers before tracing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's set some run parameters\n",
    "\n",
    "model_name = \"t5-large\"\n",
    "num_beams = 1\n",
    "num_return_sequences = 1\n",
    "max_length = 128"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Results:\n",
      "1 Lassen Sie uns gutes Essen essen.\n"
     ]
    }
   ],
   "source": [
    "from transformers import T5Tokenizer\n",
    "\n",
    "\n",
    "prompt=\"translate English to German: Lets eat good food.\"\n",
    "        \n",
    "tokenizer = T5Tokenizer.from_pretrained(model_name, model_max_length=max_length)\n",
    "model = T5Wrapper.from_pretrained(model_name)\n",
    "\n",
    "model.encoder = EncoderWrapper(model.encoder, model.decoder, model.config, num_beams, max_length, \"cpu\", num_beams)\n",
    "setattr(model.encoder, 'main_input_name', 'input_ids')  # Attribute required by beam search\n",
    "\n",
    "model.decoder = DecoderWrapper(decoder=model.decoder,\n",
    "                                lm_head=model.lm_head,\n",
    "                                model_config=model.config,\n",
    "                                num_beams=num_beams,\n",
    "                                max_length=max_length,\n",
    "                                device=\"cpu\")\n",
    "\n",
    "output = model.generate(tokenizer=tokenizer,\n",
    "                        prompt=prompt,\n",
    "                        max_length=max_length,\n",
    "                        num_beams=num_beams,\n",
    "                        num_return_sequences=num_return_sequences,\n",
    "                        device=\"cpu\")\n",
    "\n",
    "results = [tokenizer.decode(t, skip_special_tokens=True) for t in output]\n",
    "\n",
    "print('Results:')\n",
    "for i, summary in enumerate(results):\n",
    "    print(i + 1, summary)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that the wrappers are running as expected, let's trace the encoder, and decoder. To trace these functions, we pass the function and a sample input to the trace function. The result of the trace stage will be a static executable where the operations to be run upon inference are determined during compilation. This means that when inferring, the resulting Neuron model must be executed with tensors that are the exact same shape as those provided at compilation time. If a model is given a tensor at inference time whose shape does not match the tensor given at compilation time, an error will occur.\n",
    "\n",
    "The decoder wrapper returns the new state of the cache as an output which is copied back to the CPU. As the cache is a large tensor, copying it to and from the XLA device for each decoder invocation will significantly slow down the inference. Instead, we can use input output aliasing, a feature of `torch_neuronx` to keep these tensors on device rather than copying back to the CPU. To use input output aliasing, we need to map the outputs to input parameters while tracing. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch_neuronx\n",
    "\n",
    "from transformers import T5Tokenizer, T5ForConditionalGeneration\n",
    "\n",
    "def trace_encoder(model: T5ForConditionalGeneration,\n",
    "                  tokenizer: T5Tokenizer,\n",
    "                  max_length: int,\n",
    "                  num_beams: int):\n",
    "    \n",
    "    # Trace encoder\n",
    "    batch_encoding = tokenizer(\"translate English to German: Lets go home now\",\n",
    "                               max_length=max_length, truncation=True, padding='max_length', return_tensors=\"pt\")\n",
    "    input_ids = batch_encoding['input_ids']\n",
    "    attention_mask = batch_encoding['attention_mask']\n",
    "\n",
    "    encoder = EncoderWrapper(model.encoder, model.decoder, model.config, num_beams, max_length, \"xla\", num_beams)\n",
    "    traced_encoder = torch_neuronx.trace(encoder, (input_ids, attention_mask), compiler_workdir=\"/tmp/encoder/\")\n",
    "    setattr(traced_encoder, 'main_input_name', 'input_ids')  # Attribute required by beam search\n",
    "\n",
    "    return traced_encoder\n",
    "\n",
    "def trace_decoder(model: T5ForConditionalGeneration,\n",
    "                  num_beams: int,\n",
    "                  max_length: int):\n",
    "\n",
    "    decoder = DecoderWrapper(decoder=model.decoder,\n",
    "                             lm_head=model.lm_head,\n",
    "                             model_config=model.config,\n",
    "                             num_beams=num_beams,\n",
    "                             max_length=max_length,\n",
    "                             device=\"xla\")\n",
    "\n",
    "    # We create mock inputs so we can trace the decoder\n",
    "    decoder_input_ids = torch.ones((num_beams, 1), dtype=torch.int64)\n",
    "    decoder_attention_mask = torch.ones((num_beams, max_length), dtype=torch.int32)\n",
    "    encoder_attention_mask = torch.ones((num_beams, max_length), dtype=torch.int64)\n",
    "    encoder_hidden_states = torch.ones((num_beams, max_length, model.config.d_model), dtype=torch.float32)\n",
    "\n",
    "    beam_idx = torch.arange(0, num_beams, dtype=torch.int64)\n",
    "    beam_scores = torch.zeros((num_beams,), dtype=torch.float)\n",
    "\n",
    "    num_outputs_from_trace = 3 if num_beams > 1 else 1\n",
    "\n",
    "    aliases = {}\n",
    "    for i in range(len(decoder.past_key_values_sa)):\n",
    "        aliases[decoder.past_key_values_sa[i]] = i + num_outputs_from_trace\n",
    "    for i in range(len(decoder.past_key_values_ca)):\n",
    "        aliases[decoder.past_key_values_ca[i]] = len(decoder.past_key_values_sa) + i + num_outputs_from_trace\n",
    "\n",
    "    traced_decoder = torch_neuronx.trace(decoder, (\n",
    "        decoder_input_ids,\n",
    "        decoder_attention_mask,\n",
    "        encoder_hidden_states,\n",
    "        encoder_attention_mask,\n",
    "        beam_idx,\n",
    "        beam_scores,\n",
    "    ), input_output_aliases=aliases, compiler_workdir=\"/tmp/decoder/\")\n",
    "\n",
    "    return traced_decoder\n",
    "\n",
    "\n",
    "tokenizer = T5Tokenizer.from_pretrained(model_name, model_max_length=max_length)\n",
    "model = T5ForConditionalGeneration.from_pretrained(model_name)\n",
    "\n",
    "# We enable this flag to ensure model uses attention key value caching\n",
    "model.config.use_cache = True\n",
    "\n",
    "traced_encoder = trace_encoder(model, tokenizer, max_length, num_beams)\n",
    "traced_decoder = trace_decoder(model, num_beams, max_length)\n",
    "\n",
    "torch.jit.save(traced_encoder, \"TracedEncoder.pt\")\n",
    "torch.jit.save(traced_decoder, \"TracedDecoder.pt\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run inference with greedy decoding\n",
    "Now that we have the traced model, let's use it for inference. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Results:\n",
      "1 Lassen Sie uns gutes Essen essen.\n"
     ]
    }
   ],
   "source": [
    "runtime = torch.classes.neuron.Runtime()\n",
    "runtime.initialize()\n",
    "runtime.set_default_neuron_cores(0, 1)\n",
    "\n",
    "tokenizer = T5Tokenizer.from_pretrained(model_name)\n",
    "model = T5Wrapper.from_pretrained(model_name)\n",
    "\n",
    "model.encoder = torch.jit.load(\"TracedEncoder.pt\")\n",
    "# Attribute required by beam search\n",
    "setattr(model.encoder, 'main_input_name', 'input_ids')  \n",
    "\n",
    "model.decoder = torch.jit.load(\"TracedDecoder.pt\")\n",
    "torch_neuronx.move_trace_to_device(model.decoder, 0)\n",
    "\n",
    "\n",
    "output = model.generate(tokenizer=tokenizer,\n",
    "                        prompt=\"translate English to German: Lets eat good food.\",\n",
    "                        max_length=max_length,\n",
    "                        num_beams=num_beams,\n",
    "                        num_return_sequences=num_return_sequences,\n",
    "                        device=\"xla\")\n",
    "\n",
    "results = [tokenizer.decode(t, skip_special_tokens=True) for t in output]\n",
    "\n",
    "print('Results:')\n",
    "for i, summary in enumerate(results):\n",
    "    print(i + 1, summary)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run inference with beam search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's set some run parameters for beam search\n",
    "\n",
    "model_name = \"t5-large\"\n",
    "num_beams = 4\n",
    "num_return_sequences = 4\n",
    "max_length = 128\n",
    "\n",
    "tokenizer = T5Tokenizer.from_pretrained(model_name, model_max_length=max_length)\n",
    "model = T5ForConditionalGeneration.from_pretrained(model_name)\n",
    "model.config.use_cache = True\n",
    "\n",
    "traced_encoder = trace_encoder(model, tokenizer, max_length, num_beams)\n",
    "traced_decoder = trace_decoder(model, num_beams, max_length)\n",
    "\n",
    "torch.jit.save(traced_encoder, \"TracedEncoder.pt\")\n",
    "torch.jit.save(traced_decoder, \"TracedDecoder.pt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Results:\n",
      "1 Lassen Sie uns gutes Essen essen.\n",
      "2 Lassen Sie uns gutes Essen zu essen.\n",
      "3 Lassen Sie uns essen gutes Essen.\n",
      "4 Lassen Sie uns gutes Essen.\n"
     ]
    }
   ],
   "source": [
    "tokenizer = T5Tokenizer.from_pretrained(model_name)\n",
    "model = T5Wrapper.from_pretrained(model_name)\n",
    "\n",
    "model.encoder = torch.jit.load(\"TracedEncoder.pt\")\n",
    "# Attribute required by beam search\n",
    "setattr(model.encoder, 'main_input_name', 'input_ids')  \n",
    "\n",
    "model.decoder = torch.jit.load(\"TracedDecoder.pt\")\n",
    "torch_neuronx.move_trace_to_device(model.decoder, 0)\n",
    "\n",
    "\n",
    "output = model.generate(tokenizer=tokenizer,\n",
    "                        prompt=\"translate English to German: Lets eat good food.\",\n",
    "                        max_length=max_length,\n",
    "                        num_beams=num_beams,\n",
    "                        num_return_sequences=num_return_sequences,\n",
    "                        device=\"xla\")\n",
    "\n",
    "results = [tokenizer.decode(t, skip_special_tokens=True) for t in output]\n",
    "\n",
    "print('Results:')\n",
    "for i, summary in enumerate(results):\n",
    "    print(i + 1, summary)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: src/examples/pytorch/torchserve/benchmark_bert.py
================================================
import os
import argparse
import time
import numpy as np
import requests
import sys
from concurrent import futures

import torch


parser = argparse.ArgumentParser()
parser.add_argument('--url', help='Torchserve model URL', type=str, default=f'http://127.0.0.1:8080/predictions/bert-max_length128-batch_size6')
parser.add_argument('--num_thread', type=int, default=64, help='Number of threads invoking the model URL')
parser.add_argument('--batch_size', type=int, default=6)
parser.add_argument('--sequence_length', type=int, default=128)
parser.add_argument('--latency_window_size', type=int, default=1000)
parser.add_argument('--throughput_time', type=int, default=300)
parser.add_argument('--throughput_interval', type=int, default=10)
args = parser.parse_args()

data = { 'seq_0': 'A completely made up sentence.',
    'seq_1': 'Well, I suppose they are all made up.' }
live = True
num_infer = 0
latency_list = []


def one_thread(pred, feed_data):
    global latency_list
    global num_infer
    global live
    session = requests.Session()
    while True:
        start = time.time()
        result = session.post(pred, data=feed_data)
        latency = time.time() - start
        latency_list.append(latency)
        num_infer += 1
        if not live:
            break


def current_performance():
    last_num_infer = num_infer
    for _ in range(args.throughput_time // args.throughput_interval):
        current_num_infer = num_infer
        throughput = (current_num_infer - last_num_infer) / args.throughput_interval
        p50 = 0.0
        p90 = 0.0
        if latency_list:
            p50 = np.percentile(latency_list[-args.latency_window_size:], 50)
            p90 = np.percentile(latency_list[-args.latency_window_size:], 90)
        print('pid {}: current throughput {}, latency p50={:.3f} p90={:.3f}'.format(os.getpid(), throughput, p50, p90))
        sys.stdout.flush()
        last_num_infer = current_num_infer
        time.sleep(args.throughput_interval)
    global live
    live = False


with futures.ThreadPoolExecutor(max_workers=args.num_thread+1) as executor:
    executor.submit(current_performance)
    for _ in range(args.num_thread):
        executor.submit(one_thread, args.url, data)

================================================
FILE: src/examples/pytorch/torchserve/config.json
================================================
{
    "model_name": "bert-base-cased-finetuned-mrpc",
    "max_length": 128,
    "batch_size": 6
}

================================================
FILE: src/examples/pytorch/torchserve/handler_bert.py
================================================
import os
import json
import sys
import logging
from abc import ABC

import torch
import torch_neuron

from transformers import AutoTokenizer
from ts.torch_handler.base_handler import BaseHandler


# one core per worker
os.environ['NEURON_RT_NUM_CORES'] = '1'

logger = logging.getLogger(__name__)

class BertEmbeddingHandler(BaseHandler, ABC):
    """
    Handler class for Bert Embedding computations.
    """
    def __init__(self):
        super(BertEmbeddingHandler, self).__init__()
        self.initialized = False

    def initialize(self, ctx):
        self.manifest = ctx.manifest
        properties = ctx.system_properties
        self.device = 'cpu'
        model_dir = properties.get('model_dir')
        serialized_file = self.manifest['model']['serializedFile']
        model_pt_path = os.path.join(model_dir, serialized_file)

        # point sys.path to our config file
        with open('config.json') as fp:
            config = json.load(fp)
        self.max_length = config['max_length']
        self.batch_size = config['batch_size']
        self.classes = ['not paraphrase', 'paraphrase']

        self.model = torch.jit.load(model_pt_path)
        logger.debug(f'Model loaded from {model_dir}')
        self.model.to(self.device)
        self.model.eval()

        self.tokenizer = AutoTokenizer.from_pretrained(config['model_name'])
        self.initialized = True

    def preprocess(self, input_data):
        """
        Tokenization pre-processing
        """

        input_ids = []
        attention_masks = []
        token_type_ids = []
        for row in input_data:
            seq_0 = row['seq_0'].decode('utf-8')
            seq_1 = row['seq_1'].decode('utf-8')
            logger.debug(f'Received text: "{seq_0}", "{seq_1}"')

            inputs = self.tokenizer.encode_plus(
                    seq_0,
                    seq_1,
                    max_length=self.max_length,
                    padding='max_length',
                    truncation=True,
                    return_tensors='pt'
                    )

            input_ids.append(inputs['input_ids'])
            attention_masks.append(inputs['attention_mask'])
            token_type_ids.append(inputs['token_type_ids'])

        batch = (torch.cat(input_ids, 0),
                torch.cat(attention_masks, 0),
                torch.cat(token_type_ids, 0))

        return batch

    def inference(self, inputs):
        """
        Predict the class of a text using a trained transformer model.
        """

        # sanity check dimensions
        assert(len(inputs) == 3)
        num_inferences = len(inputs[0])
        assert(num_inferences <= self.batch_size)

        # insert padding if we received a partial batch
        padding = self.batch_size - num_inferences
        if padding > 0:
            pad = torch.nn.ConstantPad1d((0, 0, 0, padding), value=0)
            inputs = [pad(x) for x in inputs]

        outputs = self.model(*inputs)[0]
        predictions = []
        for i in range(num_inferences):
            prediction = self.classes[outputs[i].argmax().item()]
            predictions.append([prediction])
            logger.debug("Model predicted: '%s'", prediction)
        return predictions

    def postprocess(self, inference_output):
        return inference_output


================================================
FILE: src/examples/pytorch/torchserve/handler_bert_neuronx.py
================================================
import os
import json
import sys
import logging
from abc import ABC

import torch
import torch_neuronx

from transformers import AutoTokenizer
from ts.torch_handler.base_handler import BaseHandler


# one core per worker
os.environ['NEURON_RT_NUM_CORES'] = '1'

logger = logging.getLogger(__name__)

class BertEmbeddingHandler(BaseHandler, ABC):
    """
    Handler class for Bert Embedding computations.
    """
    def __init__(self):
        super(BertEmbeddingHandler, self).__init__()
        self.initialized = False

    def initialize(self, ctx):
        self.manifest = ctx.manifest
        properties = ctx.system_properties
        self.device = 'cpu'
        model_dir = properties.get('model_dir')
        serialized_file = self.manifest['model']['serializedFile']
        model_pt_path = os.path.join(model_dir, serialized_file)

        # point sys.path to our config file
        with open('config.json') as fp:
            config = json.load(fp)
        self.max_length = config['max_length']
        self.batch_size = config['batch_size']
        self.classes = ['not paraphrase', 'paraphrase']

        self.model = torch.jit.load(model_pt_path)
        logger.debug(f'Model loaded from {model_dir}')
        self.model.to(self.device)
        self.model.eval()

        self.tokenizer = AutoTokenizer.from_pretrained(config['model_name'])
        self.initialized = True

    def preprocess(self, input_data):
        """
        Tokenization pre-processing
        """

        input_ids = []
        attention_masks = []
        token_type_ids = []
        for row in input_data:
            seq_0 = row['seq_0'].decode('utf-8')
            seq_1 = row['seq_1'].decode('utf-8')
            logger.debug(f'Received text: "{seq_0}", "{seq_1}"')

            inputs = self.tokenizer.encode_plus(
                    seq_0,
                    seq_1,
                    max_length=self.max_length,
                    padding='max_length',
                    truncation=True,
                    return_tensors='pt'
                    )

            input_ids.append(inputs['input_ids'])
            attention_masks.append(inputs['attention_mask'])
            token_type_ids.append(inputs['token_type_ids'])

        batch = (torch.cat(input_ids, 0),
                torch.cat(attention_masks, 0),
                torch.cat(token_type_ids, 0))

        return batch

    def inference(self, inputs):
        """
        Predict the class of a text using a trained transformer model.
        """

        # sanity check dimensions
        assert(len(inputs) == 3)
        num_inferences = len(inputs[0])
        assert(num_inferences <= self.batch_size)

        # insert padding if we received a partial batch
        padding = self.batch_size - num_inferences
        if padding > 0:
            pad = torch.nn.ConstantPad1d((0, 0, 0, padding), value=0)
            inputs = [pad(x) for x in inputs]

        outputs = self.model(*inputs)[0]
        predictions = []
        for i in range(num_inferences):
            prediction = self.classes[outputs[i].argmax(dim=-1).item()]
            predictions.append([prediction])
            logger.debug("Model predicted: '%s'", prediction)
        return predictions

    def postprocess(self, inference_output):
        return inference_output


================================================
FILE: src/examples/pytorch/torchserve/infer_bert.py
================================================
import json
import concurrent.futures
import requests

with open('config.json') as fp:
    config = json.load(fp)
max_length = config['max_length']
batch_size = config['batch_size']
name = f'bert-max_length{max_length}-batch_size{batch_size}'

# dispatch requests in parallel
url = f'http://localhost:8080/predictions/{name}'
paraphrase = {'seq_0': "HuggingFace's headquarters are situated in Manhattan",
        'seq_1': "The company HuggingFace is based in New York City"}
not_paraphrase = {'seq_0': paraphrase['seq_0'], 'seq_1': 'This is total nonsense.'}

with concurrent.futures.ThreadPoolExecutor(max_workers=batch_size) as executor:
    def worker_thread(worker_index):
        # we'll send half the requests as not_paraphrase examples for sanity
        data = paraphrase if worker_index < batch_size//2 else not_paraphrase
        try:
            response = requests.post(url, data=data)

            # Check if the response status code indicates success
            if response.status_code == 200:
                print(worker_index, response.json())
            else:
                # If the response is not successful, raise an exception with the status code and error message
                error_message = response.json().get('message', 'Unknown Error')
                raise Exception(f"Failed request with status code {response.status_code}: {error_message}")
        except Exception as e:
            # Catch all other exceptions that may be raised
            print(f"An unexpected error occurred: {e}")
            raise

    for worker_index in range(batch_size):
        executor.submit(worker_thread, worker_index)


================================================
FILE: src/examples/pytorch/torchserve/torchserve.config
================================================
# bind inference API to all network interfaces with SSL enabled
inference_address=http://0.0.0.0:8080
default_workers_per_model=1

================================================
FILE: src/examples/pytorch/torchserve/trace_bert_neuron.py
================================================
import torch
import torch_neuron

from transformers import AutoTokenizer, AutoModelForSequenceClassification


# Build tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc", return_dict=False)

# Setup some example inputs
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "HuggingFace's headquarters are situated in Manhattan"

max_length = 128
batch_size = 6

paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")

example_inputs_paraphrase = (
    torch.cat([paraphrase['input_ids']] * batch_size, 0),
    torch.cat([paraphrase['attention_mask']] * batch_size, 0),
    torch.cat([paraphrase['token_type_ids']] * batch_size, 0)
)

# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron
model_neuron_batch = torch_neuron.trace(model, example_inputs_paraphrase)

# Save the batched model
model_neuron_batch.save('bert_neuron_b{}.pt'.format(batch_size))


================================================
FILE: src/examples/pytorch/torchserve/trace_bert_neuronx.py
================================================
import torch
import torch_neuronx

from transformers import AutoTokenizer, AutoModelForSequenceClassification


# Build tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc", return_dict=False)

# Setup some example inputs
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "HuggingFace's headquarters are situated in Manhattan"

max_length = 128
batch_size = 6

paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")

example_inputs_paraphrase = (
    torch.cat([paraphrase['input_ids']] * batch_size, 0),
    torch.cat([paraphrase['attention_mask']] * batch_size, 0),
    torch.cat([paraphrase['token_type_ids']] * batch_size, 0)
)

# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron
model_neuron_batch = torch_neuronx.trace(model, example_inputs_paraphrase)

# Save the batched model
model_neuron_batch.save('bert_neuron_b{}.pt'.format(batch_size))


================================================
FILE: src/examples/pytorch/transformers-marianmt.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Transformers MarianMT Tutorial\n",
    "\n",
    "In this tutorial, you will deploy the [HuggingFace MarianMT](https://huggingface.co/transformers/v4.0.1/model_doc/marian.html) model for text translation.\n",
    "\n",
    "This Jupyter notebook should be run on an inf1.6xlarge instance since you will be loading and compiling several large models.\n",
    "\n",
    "Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [PyTorch Installation Guide](../../../frameworks/torch/torch-neuron/setup/pytorch-install.html). You can select the kernel from the \"Kernel -> Change Kernel\" option on the top of this Jupyter notebook page.\n",
    "\n",
    "To generate text, you will be using the beam search algorithm to incrementally generate token candidates until the full output text has been created. Unlike simple single-pass models, this algorithm divides the work into two distinct phases:\n",
    "\n",
    "- **Encoder**: Convert the input text into an encoded representation. (Executed once)\n",
    "- **Decoder**: Use the encoded representation of the input text and the current output tokens to incrementally generate the set of next best candidate tokens. (Executed many times)\n",
    "\n",
    "In this tutorial you will perform the following steps:\n",
    "\n",
    "- **Compile**: Compile both the Encoder and Decoder for Neuron using simplified interfaces for inference.\n",
    "- **Infer**: Run on CPU and Neuron and compare results.\n",
    "\n",
    "Finally, a completely unrolled decoder will be built which simplifies the implementation at the cost of performing fixed-length inferences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install Dependencies:\n",
    "\n",
    "This tutorial has the following dependencies:\n",
    "\n",
    "- `transformers==4.25.1`\n",
    "- `torch-neuron`\n",
    "- `sentencepiece`\n",
    "- `neuron-cc[tensorflow]`\n",
    "\n",
    "The following will install the required `transformers` version. Note that encoder/decoder API changes across different minor versions requires that you are specific about the version used. Also note that the `torch-neuron` version is pinned due to `transformer` compatibility issues."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install sentencepiece transformers==4.26.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parameters\n",
    "\n",
    "The parameters of a generative model can be tuned for different use-cases. In this example, you'll tailor the parameters to a single inference beam search for an on-demand inference use-case. See the [MarianConfig](https://huggingface.co/transformers/v4.0.1/model_doc/marian.html#marianconfig) for parameter details.\n",
    "\n",
    "Rather than varying the encoder/decoder token sizes at runtime, you must define these parameters prior to compilation. The encoder/decoder token sizes are important tunable parameters as a large token sequence will offer greater sentence length flexibility but perform worse than a small token sequence.\n",
    "\n",
    "To maximize performance on Neuron, the `num_beams`, `max_encode_length` and `max_decoder_length` should be made as small as possible for the use-case.\n",
    "\n",
    "For this tutorial you will use a model that translates sentences of up to 32 token from English to German."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
    "model_name = \"Helsinki-NLP/opus-mt-en-de\" # English -> German model\n",
    "num_texts = 1                             # Number of input texts to decode\n",
    "num_beams = 4                             # Number of beams per input text\n",
    "max_encoder_length = 32                   # Maximum input token length\n",
    "max_decoder_length = 32                   # Maximum output token length"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CPU Model Inference\n",
    "\n",
    "Start by executing the model on CPU to test its execution.\n",
    "\n",
    "The following defines the inference function which will be used to compare the Neuron and CPU output. In this example you will display all beam search sequences that were generated. For a real on-demand use case, set the `num_beams` to `1` to return only the top result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def infer(model, tokenizer, text):\n",
    "\n",
    "    # Truncate and pad the max length to ensure that the token size is compatible with fixed-sized encoder (Not necessary for pure CPU execution)\n",
    "    batch = tokenizer(text, max_length=max_decoder_length, truncation=True, padding='max_length', return_tensors=\"pt\")\n",
    "    output = model.generate(**batch, max_length=max_decoder_length, num_beams=num_beams, num_return_sequences=num_beams)\n",
    "    results = [tokenizer.decode(t, skip_special_tokens=True) for t in output]\n",
    "\n",
    "    print('Texts:')\n",
    "    for i, summary in enumerate(results):\n",
    "        print(i + 1, summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that after loading the model, we also set the maximum length. This will later be used to limit the size of the compiled model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import MarianMTModel, MarianTokenizer\n",
    "\n",
    "model_cpu = MarianMTModel.from_pretrained(model_name)\n",
    "model_cpu.config.max_length = max_decoder_length\n",
    "model_cpu.eval()\n",
    "\n",
    "tokenizer = MarianTokenizer.from_pretrained(model_name)\n",
    "\n",
    "sample_text = \"I am a small frog.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "infer(model_cpu, tokenizer, sample_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Padded Model\n",
    "In order to perform inference on Neuron, the model must be changed in a way that it supports tracing and fixed-sized inputs. One way in which this is possible is to use a pad the model inputs to the maximum possible tensor sizes. The benefit of using a padded model is that it supports variable length text generation up to a specified length `max_decoder_length`. A consequence of padding is that it can negatively impact performance due to large data transfers.\n",
    "\n",
    "### PaddedEncoder & PaddedDecoder Modules\n",
    "Here you will define wrappers around the encoder and decoder portions of the generation model that are compatible with `torch.jit.trace` as well as fixed-sized inputs.\n",
    "\n",
    "The following are important features which are distinct from the default configuration:\n",
    "\n",
    "1. Disabled `return_dict`. When this is enabled, the network uses `dataclass` type outputs which are not compatible with `torch.jit.trace`.\n",
    "2. Disabled `use_cache`. When this option is enabled, the network expects a collection of cache tensors which grow upon each iteration. Since Neuron requires fixed sized inputs, this must be disabled.\n",
    "3. The `GenerationMixin:beam_search` implementation uses only the logits for the current iteration index from the original decoder layer output. Since inputs must be padded, performance can be improved by selecting only a subset of the hidden state prior to the final linear layer. For efficiency on Neuron, this reduction uses an elementwise-multiply to mask out the unused hidden values and then sums along an axis.\n",
    "4. Since a reduction step is insterted between the decoder output and the final logit calculation, the original `model` attribute is not used. Instead the `PaddedDecoder` class combines the decoder, reducer, and linear layers into a combined forward pass. In the original model there is a clear distinction between the decoder layer and the final linear layer. These layers are fused together to get one large fully optimized graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from torch.nn import functional as F\n",
    "\n",
    "\n",
    "class PaddedEncoder(torch.nn.Module):\n",
    "\n",
    "    def __init__(self, model):\n",
    "        super().__init__()\n",
    "        self.encoder = model.model.encoder\n",
    "        self.main_input_name = 'input_ids'\n",
    "        \n",
    "    def forward(self, input_ids, attention_mask):\n",
    "        return self.encoder(input_ids, attention_mask=attention_mask, return_dict=False)\n",
    "\n",
    "\n",
    "class PaddedDecoder(torch.nn.Module):\n",
    "\n",
    "    def __init__(self, model):\n",
    "        super().__init__()\n",
    "        self.weight = model.model.shared.weight.clone().detach()\n",
    "        self.bias = model.final_logits_bias.clone().detach()\n",
    "        self.decoder = model.model.decoder\n",
    "\n",
    "    def forward(self, input_ids, attention_mask, encoder_outputs, index):\n",
    "\n",
    "        # Invoke the decoder\n",
    "        hidden, = self.decoder(\n",
    "            input_ids=input_ids,\n",
    "            encoder_hidden_states=encoder_outputs,\n",
    "            encoder_attention_mask=attention_mask,\n",
    "            return_dict=False,\n",
    "            use_cache=False,\n",
    "        )\n",
    "\n",
    "        _, n_length, _ = hidden.shape\n",
    "\n",
    "        # Create selection mask\n",
    "        mask = torch.arange(n_length, dtype=torch.float32) == index\n",
    "        mask = mask.view(1, -1, 1)\n",
    "\n",
    "        # Broadcast mask\n",
    "        masked = torch.multiply(hidden, mask)\n",
    "\n",
    "        # Reduce along 1st dimension\n",
    "        hidden = torch.sum(masked, 1, keepdims=True)\n",
    "\n",
    "        # Compute final linear layer for token probabilities\n",
    "        logits = F.linear(\n",
    "            hidden,\n",
    "            self.weight,\n",
    "            bias=self.bias\n",
    "        )\n",
    "        return logits\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### PaddedGenerator - GenerationMixin Class\n",
    "\n",
    "\n",
    "On text generation tasks, HuggingFace Transformers defines a [GenerationMixin](https://huggingface.co/transformers/v4.0.1/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin) base class which provides standard methods and algorithms to generate text. For this tutorial, you will be using the beam search algorithm on encoder/decoder architectures.\n",
    "\n",
    "To be able to use these methods, you will be defining your own class derived from the GenerationMixin class to run a beam search. This will invoke the encoder and decoder layers in a way that is compatible with fixed sized inputs and traced modules. This means you must import the base class and the output objects ([Seq2SeqLMOutput](https://huggingface.co/transformers/v4.0.1/main_classes/output.html#transformers.modeling_outputs.Seq2SeqLMOutput), [BaseModelOutput](https://huggingface.co/transformers/v4.0.1/main_classes/output.html#transformers.modeling_outputs.BaseModelOutput)) used by the [beam_search](https://huggingface.co/transformers/v4.0.1/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin.beam_search) algorithm.\n",
    "\n",
    "The `GenerationMixin:generate` method will use `GenerationMixin:beam_search` which requires that you to define your own class implementation that invokes the `PaddedEncoder` and `PaddedDecoder` modules using padded inputs. The standard generator model implementation will not work by default because it is intended to infer with variable-sized (growing) input tensors. \n",
    "\n",
    "The `from_model` method is defined to create the `PaddedGenerator` from an existing pretrained generator class.\n",
    "\n",
    "To invoke the Encoder and Decoder traced modules in a way that is compatible with the `GenerationMixin:beam_search` implementation, the `get_encoder`, `__call__`, and  `prepare_inputs_for_generation` methods are overriden.\n",
    "\n",
    "Lastly, the class defines methods for serialization so that the model can be easily saved and loaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "from transformers import GenerationMixin, AutoConfig\n",
    "from transformers.modeling_outputs import Seq2SeqLMOutput, BaseModelOutput\n",
    "from transformers.modeling_utils import PreTrainedModel\n",
    "\n",
    "\n",
    "class PaddedGenerator(PreTrainedModel, GenerationMixin):\n",
    "\n",
    "    @classmethod\n",
    "    def from_model(cls, model):\n",
    "        generator = cls(model.config)\n",
    "        generator.encoder = PaddedEncoder(model)\n",
    "        generator.decoder = PaddedDecoder(model)\n",
    "        return generator\n",
    "    \n",
    "    def prepare_inputs_for_generation(\n",
    "            self,\n",
    "            input_ids,\n",
    "            encoder_outputs=None,\n",
    "            attention_mask=None,\n",
    "            **kwargs,\n",
    "    ):\n",
    "        # Pad the inputs for Neuron\n",
    "        current_length = input_ids.shape[1]\n",
    "        pad_size = self.config.max_length - current_length\n",
    "        return dict(\n",
    "            input_ids=F.pad(input_ids, (0, pad_size)),\n",
    "            attention_mask=attention_mask,\n",
    "            encoder_outputs=encoder_outputs.last_hidden_state,\n",
    "            current_length=torch.tensor(current_length - 1),\n",
    "        )\n",
    "\n",
    "    def get_encoder(self):\n",
    "        def encode(input_ids, attention_mask, **kwargs):        \n",
    "            output, = self.encoder(input_ids, attention_mask)\n",
    "            return BaseModelOutput(\n",
    "                last_hidden_state=output,\n",
    "            )\n",
    "        return encode\n",
    "\n",
    "    def forward(self, input_ids, attention_mask, encoder_outputs, current_length, **kwargs):\n",
    "        logits = self.decoder(input_ids, attention_mask, encoder_outputs, current_length)\n",
    "        return Seq2SeqLMOutput(logits=logits)\n",
    "\n",
    "    @property\n",
    "    def device(self):  # Attribute required by beam search\n",
    "        return torch.device('cpu')\n",
    "    \n",
    "    def save_pretrained(self, directory):\n",
    "        if os.path.isfile(directory):\n",
    "            print(f\"Provided path ({directory}) should be a directory, not a file\")\n",
    "            return\n",
    "        os.makedirs(directory, exist_ok=True)\n",
    "        torch.jit.save(self.encoder, os.path.join(directory, 'encoder.pt'))\n",
    "        torch.jit.save(self.decoder, os.path.join(directory, 'decoder.pt'))\n",
    "        self.config.save_pretrained(directory)\n",
    "\n",
    "    @classmethod\n",
    "    def from_pretrained(cls, directory):\n",
    "        config = AutoConfig.from_pretrained(directory)\n",
    "        obj = cls(config)\n",
    "        obj.encoder = torch.jit.load(os.path.join(directory, 'encoder.pt'))\n",
    "        obj.decoder = torch.jit.load(os.path.join(directory, 'decoder.pt'))\n",
    "        setattr(obj.encoder, 'main_input_name', 'input_ids')  # Attribute required by beam search\n",
    "        return obj\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Padded CPU Inference\n",
    "To start, it is important to ensure that the transformations we have made to the model were successful. Using the classes defined above we can test that the padded model execution on CPU is identical to the original output also running on CPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "padded_model_cpu = PaddedGenerator.from_model(model_cpu)\n",
    "infer(padded_model_cpu, tokenizer, sample_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Padded Neuron Tracing & Inference\n",
    "\n",
    "Now that the padded version of model is confirmed to produce the same outputs as the non-padded version, the model can be compiled for Neuron."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch_neuron\n",
    "\n",
    "\n",
    "def trace(model, num_texts, num_beams, max_decoder_length, max_encoder_length):\n",
    "    \"\"\"\n",
    "    Traces the encoder and decoder modules for use on Neuron.\n",
    "\n",
    "    This function fixes the network to the given sizes. Once the model has been\n",
    "    compiled to a given size, the inputs to these networks must always be of\n",
    "    fixed size.\n",
    "\n",
    "    Args:\n",
    "        model (PaddedGenerator): The padded generator to compile for Neuron\n",
    "        num_texts (int): The number of input texts to translate at once\n",
    "        num_beams (int): The number of beams to compute per text\n",
    "        max_decoder_length (int): The maximum number of tokens to be generated\n",
    "        max_encoder_length (int): The maximum number of input tokens that will be encoded\n",
    "    \"\"\"\n",
    "\n",
    "    # Trace the encoder\n",
    "    inputs = (\n",
    "        torch.ones((num_texts, max_encoder_length), dtype=torch.long),\n",
    "        torch.ones((num_texts, max_encoder_length), dtype=torch.long),\n",
    "    )\n",
    "    encoder = torch_neuron.trace(model.encoder, inputs)\n",
    "\n",
    "    # Trace the decoder (with expanded inputs)\n",
    "    batch_size = num_texts * num_beams\n",
    "    inputs = (\n",
    "        torch.ones((batch_size, max_decoder_length), dtype=torch.long),\n",
    "        torch.ones((batch_size, max_encoder_length), dtype=torch.long),\n",
    "        torch.ones((batch_size, max_encoder_length, model.config.d_model), dtype=torch.float),\n",
    "        torch.tensor(0),\n",
    "    )\n",
    "    decoder = torch_neuron.trace(model.decoder, inputs)\n",
    "    \n",
    "    traced = PaddedGenerator(model.config)\n",
    "    traced.encoder = encoder\n",
    "    traced.decoder = decoder\n",
    "    setattr(encoder, 'main_input_name', 'input_ids')  # Attribute required by beam search\n",
    "    return traced"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "padded_model_neuron = trace(padded_model_cpu, num_texts, num_beams, max_decoder_length, max_encoder_length)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Comparing the Neuron execution to the original CPU implementation, you will see the exact same generated text.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# CPU execution for comparison\n",
    "infer(padded_model_neuron, tokenizer, sample_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Padded Neuron Serialization\n",
    "Finally, we can test that we can serialize and reload the model so that it can be used later in its precompiled format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "padded_model_neuron.save_pretrained('NeuronPaddedMarianMT')\n",
    "padded_model_loaded = PaddedGenerator.from_pretrained('NeuronPaddedMarianMT')\n",
    "infer(padded_model_loaded, tokenizer, sample_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Greedy Unrolled Model\n",
    "An unrolled version of the model can achieve better performance in some cases since all operations will be executed on the Neuron hardware without returning to CPU. The consequence of this type of model is that since the generation loop execution never returns to CPU, the entire sequence up to `max_decoder_length` is performed in a single forward pass.\n",
    "\n",
    "The following module performs greedy text generation. Unlike the original beam search text generation, this implementation always selects the most probable token and does not generate multiple result texts.\n",
    "\n",
    "### GreedyUnrolledGenerator Module"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class GreedyUnrolledGenerator(torch.nn.Module):\n",
    "    \n",
    "    def __init__(self, model):\n",
    "        super().__init__()\n",
    "        self.config = model.config\n",
    "        self.model = model\n",
    "    \n",
    "    def forward(self, input_ids, attention_mask):\n",
    "        \n",
    "        # Generate the encoder state for the input tokens. This is only done once and the state is reused.\n",
    "        encoder_outputs, = self.model.model.encoder(input_ids, attention_mask=attention_mask, return_dict=False)\n",
    "        \n",
    "        # Set the intial state for the decode loop. This will grow per decoder iteration\n",
    "        tokens = torch.full((input_ids.size(0), 2), self.config.decoder_start_token_id)\n",
    "        \n",
    "        # Iteratively invoke the decoder on incrementally generated `tokens` to generate a `next_token`.\n",
    "        # Note that unlike the GeneratorMixin.generate function, there is no early-exit if the stop token \n",
    "        # has been reached. This will always run a fixed number of iterations.\n",
    "        for i in range(self.config.max_length):\n",
    "            \n",
    "            hidden, = self.model.model.decoder(\n",
    "                input_ids=tokens,\n",
    "                encoder_hidden_states=encoder_outputs,\n",
    "                encoder_attention_mask=attention_mask,\n",
    "                return_dict=False,\n",
    "                use_cache=False,\n",
    "            ) # size: [batch, current_length, vocab_size]\n",
    "                        \n",
    "            logits = F.linear(\n",
    "                hidden[:, -1, :],\n",
    "                self.model.model.shared.weight,\n",
    "                bias=self.model.final_logits_bias\n",
    "            )\n",
    "            next_tokens = torch.argmax(logits, dim=1, keepdims=True)\n",
    "            tokens = torch.cat([tokens, next_tokens], dim=1)\n",
    "        \n",
    "        return tokens"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Greedy CPU Inference\n",
    "The inference code must be updated since the `generate` method is no longer used. This is because the entire generative inference loop occurs within the `GreedyUnrolledGenerator.forward` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def infer_greedy(model, tokenizer, text):\n",
    "    batch = tokenizer(text, max_length=max_decoder_length, truncation=True, padding='max_length', return_tensors=\"pt\")\n",
    "    inputs = batch['input_ids'], batch['attention_mask']\n",
    "    tokens = greedy_cpu(*inputs)\n",
    "    print('Texts:')\n",
    "    for i, t in enumerate(tokens):\n",
    "        result = tokenizer.decode(t, skip_special_tokens=True)\n",
    "        print(i + 1, result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Like in previous section of this tutorial, first the greedy model is executed on CPU to validate that the correct results were produced. In this example, the generated text matches the first result of the original beam search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_cpu.config.max_length = 8 # This controls the number of decoder loops. Reduced to improve compilation speed.\n",
    "greedy_cpu = GreedyUnrolledGenerator(model_cpu)\n",
    "infer_greedy(greedy_cpu, tokenizer, sample_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Greedy Neuron Tracing & Inference\n",
    "Similarly the tracing is simplified since the now the `GreedyUnrolledGenerator.forward` can be compiled as a single unit. \n",
    "\n",
    "For compilation efficiency, two changes will be made compared to normal compilaition:\n",
    "- `torch.jit.freeze` is used because it can *sometimes* speed up compilation by in the case where a module is re-used multiple times. In this case, it is more efficient because the `self.model.model.decoder` is used in a loop. \n",
    "- The `torch_neuron.trace` option `fallback` is set to `False`. This forces all operations to execute on Neuron. Most of the time this is not recommended or efficient. In this case, it is more efficient because it means a single subgraph is produced rather than many. Usually one subgraph would be produced per decoder iteration since `aten::embedding` is executed in a loop. The `aten::embedding` operation is otherwise exected on CPU by default since this is usually more efficient than executing on Neuron.\n",
    "\n",
    "You may notice that compilation will take significantly longer with the unrolled model since the model inserts new operations into the compute graph for every single decoder iteration. This creates a much larger model graph even though the weights are re-used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "example = (\n",
    "    torch.ones((num_texts, max_encoder_length), dtype=torch.long),\n",
    "    torch.ones((num_texts, max_encoder_length), dtype=torch.long),\n",
    ")\n",
    "greedy_cpu.eval()\n",
    "greedy_trace = torch.jit.trace(greedy_cpu, example)\n",
    "greedy_frozen = torch.jit.freeze(greedy_trace)\n",
    "greedy_neuron = torch_neuron.trace(greedy_frozen, example, fallback=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "infer_greedy(greedy_neuron, tokenizer, sample_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Greedy Neuron Serialization\n",
    "Unlike the previous version of the model that used the `GenerationMixin` base class. This greedy version of the model can be serialized using the regular `torch.jit.save` and `torch.jit.load` utilities since it is a pure torchscript module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "torch.jit.save(greedy_neuron, 'greedy_neuron.pt')\n",
    "loaded_greedy_neuron = torch.jit.load('greedy_neuron.pt')\n",
    "infer_greedy(loaded_greedy_neuron, tokenizer, sample_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Appendix\n",
    "### BART (Mask Filling Task)\n",
    "\n",
    "These `PaddedGenerator` class can be applied to the BART model for the task of filling in mask tokens.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from transformers import BartForConditionalGeneration, BartTokenizer\n",
    "bart_name = \"facebook/bart-large\"\n",
    "bart_model = BartForConditionalGeneration.from_pretrained(bart_name)\n",
    "bart_model.config.max_length = max_decoder_length\n",
    "bart_tokenizer = BartTokenizer.from_pretrained(bart_name)\n",
    "bart_text = \"UN Chief Says There Is No <mask> in Syria\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# CPU Execution\n",
    "infer(bart_model, bart_tokenizer, bart_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Neuron Execution\n",
    "paddded_bart = PaddedGenerator.from_model(bart_model)\n",
    "bart_neuron = trace(paddded_bart, num_texts, num_beams, max_decoder_length, max_encoder_length)\n",
    "infer(bart_neuron, bart_tokenizer, bart_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pegasus (Summarization Task)\n",
    "\n",
    "These `PaddedGenerator` class can be applied to the Pegasus model for summarization.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from transformers import PegasusForConditionalGeneration, PegasusTokenizer\n",
    "pegasus_name = 'google/pegasus-xsum'\n",
    "pegasus_model = PegasusForConditionalGeneration.from_pretrained(pegasus_name)\n",
    "pegasus_model.config.max_length = max_decoder_length\n",
    "pegasus_tokenizer = PegasusTokenizer.from_pretrained(pegasus_name)\n",
    "pegasus_text = \"PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# CPU Execution\n",
    "infer(pegasus_model, pegasus_tokenizer, pegasus_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Neuron Execution\n",
    "paddded_pegasus = PaddedGenerator.from_model(pegasus_model)\n",
    "pegasus_neuron = trace(paddded_pegasus, num_texts, num_beams, max_decoder_length, max_encoder_length)\n",
    "infer(pegasus_neuron, pegasus_tokenizer, pegasus_text)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: src/examples/pytorch/yolo_v4.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluate YOLO v4 on Inferentia"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "This tutorial walks through compiling and evaluating YOLO v4 model implemented in PyTorch on Inferentia. \n",
    "\n",
    "The tutorial has five main sections:\n",
    "\n",
    "1. Define YOLO v4 model in PyTorch\n",
    "2. Download the COCO 2017 evaluation dataset and define the data loader function\n",
    "3. Build, Compile, and Save Neuron-Optimized YOLO v4 TorchScript\n",
    "4. Evaluate Accuracy on the COCO 2017 Dataset\n",
    "5. Benchmark COCO Dataset Performance of the Neuron-Optimized TorchScript\n",
    "\n",
    "Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [PyTorch Installation Guide](../../../frameworks/torch/torch-neuron/setup/pytorch-install.html). You can select the kernel from the \"Kernel -> Change Kernel\" option on the top of this Jupyter notebook page."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install Dependencies:\n",
    "This tutorial requires the following pip packages:\n",
    "\n",
    "- `torch-neuron`\n",
    "- `torchvision`\n",
    "- `pillow`\n",
    "- `pycocotools`\n",
    "- `neuron-cc[tensorflow]`\n",
    "\n",
    "Many of these packages will be installed by default when configuring your environment using the Neuron PyTorch setup guide. The additional dependencies must be installed here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --upgrade pillow pycocotools "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: Define YOLO v4 model in PyTorch \n",
    "The following PyTorch model definition is from https://github.com/Tianxiaomo/pytorch-YOLOv4/."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import torch\n",
    "import torch.neuron\n",
    "from torch import nn\n",
    "import torch.nn.functional as F\n",
    "import os\n",
    "import warnings\n",
    "\n",
    "# Setting up NeuronCore groups for inf1.6xlarge with 16 cores\n",
    "n_cores = 16 # This value should be 4 on inf1.xlarge and inf1.2xlarge\n",
    "os.environ['NEURON_RT_NUM_CORES'] = str(n_cores)\n",
    "\n",
    "\n",
    "class Mish(torch.nn.Module):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "\n",
    "    def forward(self, x):\n",
    "        x = x * (torch.tanh(torch.nn.functional.softplus(x)))\n",
    "        return x\n",
    "\n",
    "\n",
    "class Upsample(nn.Module):\n",
    "    def __init__(self):\n",
    "        super(Upsample, self).__init__()\n",
    "\n",
    "    def forward(self, x, target_size, inference=False):\n",
    "        assert (x.data.dim() == 4)\n",
    "\n",
    "        if inference:\n",
    "\n",
    "            return x.view(x.size(0), x.size(1), x.size(2), 1, x.size(3), 1).\\\n",
    "                    expand(x.size(0), x.size(1), x.size(2), target_size[2] // x.size(2), x.size(3), target_size[3] // x.size(3)).\\\n",
    "                    contiguous().view(x.size(0), x.size(1), target_size[2], target_size[3])\n",
    "        else:\n",
    "            return F.interpolate(x, size=(target_size[2], target_size[3]), mode='nearest')\n",
    "\n",
    "\n",
    "class Conv_Bn_Activation(nn.Module):\n",
    "    def __init__(self, in_channels, out_channels, kernel_size, stride, activation, bn=True, bias=False):\n",
    "        super().__init__()\n",
    "        pad = (kernel_size - 1) // 2\n",
    "\n",
    "        self.conv = nn.ModuleList()\n",
    "        if bias:\n",
    "            self.conv.append(nn.Conv2d(in_channels, out_channels, kernel_size, stride, pad))\n",
    "        else:\n",
    "            self.conv.append(nn.Conv2d(in_channels, out_channels, kernel_size, stride, pad, bias=False))\n",
    "        if bn:\n",
    "            self.conv.append(nn.BatchNorm2d(out_channels))\n",
    "        if activation == \"mish\":\n",
    "            self.conv.append(Mish())\n",
    "        elif activation == \"relu\":\n",
    "            self.conv.append(nn.ReLU(inplace=True))\n",
    "        elif activation == \"leaky\":\n",
    "            self.conv.append(nn.LeakyReLU(0.1, inplace=True))\n",
    "        elif activation == \"linear\":\n",
    "            pass\n",
    "        else:\n",
    "            print(\"activate error !!! {} {} {}\".format(sys._getframe().f_code.co_filename,\n",
    "                                                       sys._getframe().f_code.co_name, sys._getframe().f_lineno))\n",
    "\n",
    "    def forward(self, x):\n",
    "        for l in self.conv:\n",
    "            x = l(x)\n",
    "        return x\n",
    "\n",
    "\n",
    "class ResBlock(nn.Module):\n",
    "    \"\"\"\n",
    "    Sequential residual blocks each of which consists of \\\n",
    "    two convolution layers.\n",
    "    Args:\n",
    "        ch (int): number of input and output channels.\n",
    "        nblocks (int): number of residual blocks.\n",
    "        shortcut (bool): if True, residual tensor addition is enabled.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, ch, nblocks=1, shortcut=True):\n",
    "        super().__init__()\n",
    "        self.shortcut = shortcut\n",
    "        self.module_list = nn.ModuleList()\n",
    "        for i in range(nblocks):\n",
    "            resblock_one = nn.ModuleList()\n",
    "            resblock_one.append(Conv_Bn_Activation(ch, ch, 1, 1, 'mish'))\n",
    "            resblock_one.append(Conv_Bn_Activation(ch, ch, 3, 1, 'mish'))\n",
    "            self.module_list.append(resblock_one)\n",
    "\n",
    "    def forward(self, x):\n",
    "        for module in self.module_list:\n",
    "            h = x\n",
    "            for res in module:\n",
    "                h = res(h)\n",
    "            x = x + h if self.shortcut else h\n",
    "        return x\n",
    "\n",
    "\n",
    "class DownSample1(nn.Module):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.conv1 = Conv_Bn_Activation(3, 32, 3, 1, 'mish')\n",
    "\n",
    "        self.conv2 = Conv_Bn_Activation(32, 64, 3, 2, 'mish')\n",
    "        self.conv3 = Conv_Bn_Activation(64, 64, 1, 1, 'mish')\n",
    "        # [route]\n",
    "        # layers = -2\n",
    "        self.conv4 = Conv_Bn_Activation(64, 64, 1, 1, 'mish')\n",
    "\n",
    "        self.conv5 = Conv_Bn_Activation(64, 32, 1, 1, 'mish')\n",
    "        self.conv6 = Conv_Bn_Activation(32, 64, 3, 1, 'mish')\n",
    "        # [shortcut]\n",
    "        # from=-3\n",
    "        # activation = linear\n",
    "\n",
    "        self.conv7 = Conv_Bn_Activation(64, 64, 1, 1, 'mish')\n",
    "        # [route]\n",
    "        # layers = -1, -7\n",
    "        self.conv8 = Conv_Bn_Activation(128, 64, 1, 1, 'mish')\n",
    "\n",
    "    def forward(self, input):\n",
    "        x1 = self.conv1(input)\n",
    "        x2 = self.conv2(x1)\n",
    "        x3 = self.conv3(x2)\n",
    "        # route -2\n",
    "        x4 = self.conv4(x2)\n",
    "        x5 = self.conv5(x4)\n",
    "        x6 = self.conv6(x5)\n",
    "        # shortcut -3\n",
    "        x6 = x6 + x4\n",
    "\n",
    "        x7 = self.conv7(x6)\n",
    "        # [route]\n",
    "        # layers = -1, -7\n",
    "        x7 = torch.cat([x7, x3], dim=1)\n",
    "        x8 = self.conv8(x7)\n",
    "        return x8\n",
    "\n",
    "\n",
    "class DownSample2(nn.Module):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.conv1 = Conv_Bn_Activation(64, 128, 3, 2, 'mish')\n",
    "        self.conv2 = Conv_Bn_Activation(128, 64, 1, 1, 'mish')\n",
    "        # r -2\n",
    "        self.conv3 = Conv_Bn_Activation(128, 64, 1, 1, 'mish')\n",
    "\n",
    "        self.resblock = ResBlock(ch=64, nblocks=2)\n",
    "\n",
    "        # s -3\n",
    "        self.conv4 = Conv_Bn_Activation(64, 64, 1, 1, 'mish')\n",
    "        # r -1 -10\n",
    "        self.conv5 = Conv_Bn_Activation(128, 128, 1, 1, 'mish')\n",
    "\n",
    "    def forward(self, input):\n",
    "        x1 = self.conv1(input)\n",
    "        x2 = self.conv2(x1)\n",
    "        x3 = self.conv3(x1)\n",
    "\n",
    "        r = self.resblock(x3)\n",
    "        x4 = self.conv4(r)\n",
    "\n",
    "        x4 = torch.cat([x4, x2], dim=1)\n",
    "        x5 = self.conv5(x4)\n",
    "        return x5\n",
    "\n",
    "\n",
    "class DownSample3(nn.Module):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.conv1 = Conv_Bn_Activation(128, 256, 3, 2, 'mish')\n",
    "        self.conv2 = Conv_Bn_Activation(256, 128, 1, 1, 'mish')\n",
    "        self.conv3 = Conv_Bn_Activation(256, 128, 1, 1, 'mish')\n",
    "\n",
    "        self.resblock = ResBlock(ch=128, nblocks=8)\n",
    "        self.conv4 = Conv_Bn_Activation(128, 128, 1, 1, 'mish')\n",
    "        self.conv5 = Conv_Bn_Activation(256, 256, 1, 1, 'mish')\n",
    "\n",
    "    def forward(self, input):\n",
    "        x1 = self.conv1(input)\n",
    "        x2 = self.conv2(x1)\n",
    "        x3 = self.conv3(x1)\n",
    "\n",
    "        r = self.resblock(x3)\n",
    "        x4 = self.conv4(r)\n",
    "\n",
    "        x4 = torch.cat([x4, x2], dim=1)\n",
    "        x5 = self.conv5(x4)\n",
    "        return x5\n",
    "\n",
    "\n",
    "class DownSample4(nn.Module):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.conv1 = Conv_Bn_Activation(256, 512, 3, 2, 'mish')\n",
    "        self.conv2 = Conv_Bn_Activation(512, 256, 1, 1, 'mish')\n",
    "        self.conv3 = Conv_Bn_Activation(512, 256, 1, 1, 'mish')\n",
    "\n",
    "        self.resblock = ResBlock(ch=256, nblocks=8)\n",
    "        self.conv4 = Conv_Bn_Activation(256, 256, 1, 1, 'mish')\n",
    "        self.conv5 = Conv_Bn_Activation(512, 512, 1, 1, 'mish')\n",
    "\n",
    "    def forward(self, input):\n",
    "        x1 = self.conv1(input)\n",
    "        x2 = self.conv2(x1)\n",
    "        x3 = self.conv3(x1)\n",
    "\n",
    "        r = self.resblock(x3)\n",
    "        x4 = self.conv4(r)\n",
    "\n",
    "        x4 = torch.cat([x4, x2], dim=1)\n",
    "        x5 = self.conv5(x4)\n",
    "        return x5\n",
    "\n",
    "\n",
    "class DownSample5(nn.Module):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.conv1 = Conv_Bn_Activation(512, 1024, 3, 2, 'mish')\n",
    "        self.conv2 = Conv_Bn_Activation(1024, 512, 1, 1, 'mish')\n",
    "        self.conv3 = Conv_Bn_Activation(1024, 512, 1, 1, 'mish')\n",
    "\n",
    "        self.resblock = ResBlock(ch=512, nblocks=4)\n",
    "        self.conv4 = Conv_Bn_Activation(512, 512, 1, 1, 'mish')\n",
    "        self.conv5 = Conv_Bn_Activation(1024, 1024, 1, 1, 'mish')\n",
    "\n",
    "    def forward(self, input):\n",
    "        x1 = self.conv1(input)\n",
    "        x2 = self.conv2(x1)\n",
    "        x3 = self.conv3(x1)\n",
    "\n",
    "        r = self.resblock(x3)\n",
    "        x4 = self.conv4(r)\n",
    "\n",
    "        x4 = torch.cat([x4, x2], dim=1)\n",
    "        x5 = self.conv5(x4)\n",
    "        return x5\n",
    "\n",
    "\n",
    "class Neck(nn.Module):\n",
    "    def __init__(self, inference=False):\n",
    "        super().__init__()\n",
    "        self.inference = inference\n",
    "\n",
    "        self.conv1 = Conv_Bn_Activation(1024, 512, 1, 1, 'leaky')\n",
    "        self.conv2 = Conv_Bn_Activation(512, 1024, 3, 1, 'leaky')\n",
    "        self.conv3 = Conv_Bn_Activation(1024, 512, 1, 1, 'leaky')\n",
    "        # SPP\n",
    "        self.maxpool1 = nn.MaxPool2d(kernel_size=5, stride=1, padding=5 // 2)\n",
    "        self.maxpool2 = nn.MaxPool2d(kernel_size=9, stride=1, padding=9 // 2)\n",
    "        self.maxpool3 = nn.MaxPool2d(kernel_size=13, stride=1, padding=13 // 2)\n",
    "\n",
    "        # R -1 -3 -5 -6\n",
    "        # SPP\n",
    "        self.conv4 = Conv_Bn_Activation(2048, 512, 1, 1, 'leaky')\n",
    "        self.conv5 = Conv_Bn_Activation(512, 1024, 3, 1, 'leaky')\n",
    "        self.conv6 = Conv_Bn_Activation(1024, 512, 1, 1, 'leaky')\n",
    "        self.conv7 = Conv_Bn_Activation(512, 256, 1, 1, 'leaky')\n",
    "        # UP\n",
    "        self.upsample1 = Upsample()\n",
    "        # R 85\n",
    "        self.conv8 = Conv_Bn_Activation(512, 256, 1, 1, 'leaky')\n",
    "        # R -1 -3\n",
    "        self.conv9 = Conv_Bn_Activation(512, 256, 1, 1, 'leaky')\n",
    "        self.conv10 = Conv_Bn_Activation(256, 512, 3, 1, 'leaky')\n",
    "        self.conv11 = Conv_Bn_Activation(512, 256, 1, 1, 'leaky')\n",
    "        self.conv12 = Conv_Bn_Activation(256, 512, 3, 1, 'leaky')\n",
    "        self.conv13 = Conv_Bn_Activation(512, 256, 1, 1, 'leaky')\n",
    "        self.conv14 = Conv_Bn_Activation(256, 128, 1, 1, 'leaky')\n",
    "        # UP\n",
    "        self.upsample2 = Upsample()\n",
    "        # R 54\n",
    "        self.conv15 = Conv_Bn_Activation(256, 128, 1, 1, 'leaky')\n",
    "        # R -1 -3\n",
    "        self.conv16 = Conv_Bn_Activation(256, 128, 1, 1, 'leaky')\n",
    "        self.conv17 = Conv_Bn_Activation(128, 256, 3, 1, 'leaky')\n",
    "        self.conv18 = Conv_Bn_Activation(256, 128, 1, 1, 'leaky')\n",
    "        self.conv19 = Conv_Bn_Activation(128, 256, 3, 1, 'leaky')\n",
    "        self.conv20 = Conv_Bn_Activation(256, 128, 1, 1, 'leaky')\n",
    "\n",
    "    def forward(self, input, downsample4, downsample3, inference=False):\n",
    "        x1 = self.conv1(input)\n",
    "        x2 = self.conv2(x1)\n",
    "        x3 = self.conv3(x2)\n",
    "        # SPP\n",
    "        m1 = self.maxpool1(x3)\n",
    "        m2 = self.maxpool2(x3)\n",
    "        m3 = self.maxpool3(x3)\n",
    "        spp = torch.cat([m3, m2, m1, x3], dim=1)\n",
    "        # SPP end\n",
    "        x4 = self.conv4(spp)\n",
    "        x5 = self.conv5(x4)\n",
    "        x6 = self.conv6(x5)\n",
    "        x7 = self.conv7(x6)\n",
    "        # UP\n",
    "        up = self.upsample1(x7, downsample4.size(), self.inference)\n",
    "        # R 85\n",
    "        x8 = self.conv8(downsample4)\n",
    "        # R -1 -3\n",
    "        x8 = torch.cat([x8, up], dim=1)\n",
    "\n",
    "        x9 = self.conv9(x8)\n",
    "        x10 = self.conv10(x9)\n",
    "        x11 = self.conv11(x10)\n",
    "        x12 = self.conv12(x11)\n",
    "        x13 = self.conv13(x12)\n",
    "        x14 = self.conv14(x13)\n",
    "\n",
    "        # UP\n",
    "        up = self.upsample2(x14, downsample3.size(), self.inference)\n",
    "        # R 54\n",
    "        x15 = self.conv15(downsample3)\n",
    "        # R -1 -3\n",
    "        x15 = torch.cat([x15, up], dim=1)\n",
    "\n",
    "        x16 = self.conv16(x15)\n",
    "        x17 = self.conv17(x16)\n",
    "        x18 = self.conv18(x17)\n",
    "        x19 = self.conv19(x18)\n",
    "        x20 = self.conv20(x19)\n",
    "        return x20, x13, x6\n",
    "\n",
    "\n",
    "class Yolov4Head(nn.Module):\n",
    "    def __init__(self, output_ch, n_classes, inference=False):\n",
    "        super().__init__()\n",
    "        self.inference = inference\n",
    "\n",
    "        self.conv1 = Conv_Bn_Activation(128, 256, 3, 1, 'leaky')\n",
    "        self.conv2 = Conv_Bn_Activation(256, output_ch, 1, 1, 'linear', bn=False, bias=True)\n",
    "\n",
    "        self.yolo1 = YoloLayer(\n",
    "                                anchor_mask=[0, 1, 2], num_classes=n_classes,\n",
    "                                anchors=[12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401],\n",
    "                                num_anchors=9, stride=8)\n",
    "\n",
    "        # R -4\n",
    "        self.conv3 = Conv_Bn_Activation(128, 256, 3, 2, 'leaky')\n",
    "\n",
    "        # R -1 -16\n",
    "        self.conv4 = Conv_Bn_Activation(512, 256, 1, 1, 'leaky')\n",
    "        self.conv5 = Conv_Bn_Activation(256, 512, 3, 1, 'leaky')\n",
    "        self.conv6 = Conv_Bn_Activation(512, 256, 1, 1, 'leaky')\n",
    "        self.conv7 = Conv_Bn_Activation(256, 512, 3, 1, 'leaky')\n",
    "        self.conv8 = Conv_Bn_Activation(512, 256, 1, 1, 'leaky')\n",
    "        self.conv9 = Conv_Bn_Activation(256, 512, 3, 1, 'leaky')\n",
    "        self.conv10 = Conv_Bn_Activation(512, output_ch, 1, 1, 'linear', bn=False, bias=True)\n",
    "        \n",
    "        self.yolo2 = YoloLayer(\n",
    "                                anchor_mask=[3, 4, 5], num_classes=n_classes,\n",
    "                                anchors=[12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401],\n",
    "                                num_anchors=9, stride=16)\n",
    "\n",
    "        # R -4\n",
    "        self.conv11 = Conv_Bn_Activation(256, 512, 3, 2, 'leaky')\n",
    "\n",
    "        # R -1 -37\n",
    "        self.conv12 = Conv_Bn_Activation(1024, 512, 1, 1, 'leaky')\n",
    "        self.conv13 = Conv_Bn_Activation(512, 1024, 3, 1, 'leaky')\n",
    "        self.conv14 = Conv_Bn_Activation(1024, 512, 1, 1, 'leaky')\n",
    "        self.conv15 = Conv_Bn_Activation(512, 1024, 3, 1, 'leaky')\n",
    "        self.conv16 = Conv_Bn_Activation(1024, 512, 1, 1, 'leaky')\n",
    "        self.conv17 = Conv_Bn_Activation(512, 1024, 3, 1, 'leaky')\n",
    "        self.conv18 = Conv_Bn_Activation(1024, output_ch, 1, 1, 'linear', bn=False, bias=True)\n",
    "        \n",
    "        self.yolo3 = YoloLayer(\n",
    "                                anchor_mask=[6, 7, 8], num_classes=n_classes,\n",
    "                                anchors=[12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401],\n",
    "                                num_anchors=9, stride=32)\n",
    "\n",
    "    def forward(self, input1, input2, input3):\n",
    "        x1 = self.conv1(input1)\n",
    "        x2 = self.conv2(x1)\n",
    "\n",
    "        x3 = self.conv3(input1)\n",
    "        # R -1 -16\n",
    "        x3 = torch.cat([x3, input2], dim=1)\n",
    "        x4 = self.conv4(x3)\n",
    "        x5 = self.conv5(x4)\n",
    "        x6 = self.conv6(x5)\n",
    "        x7 = self.conv7(x6)\n",
    "        x8 = self.conv8(x7)\n",
    "        x9 = self.conv9(x8)\n",
    "        x10 = self.conv10(x9)\n",
    "\n",
    "        # R -4\n",
    "        x11 = self.conv11(x8)\n",
    "        # R -1 -37\n",
    "        x11 = torch.cat([x11, input3], dim=1)\n",
    "\n",
    "        x12 = self.conv12(x11)\n",
    "        x13 = self.conv13(x12)\n",
    "        x14 = self.conv14(x13)\n",
    "        x15 = self.conv15(x14)\n",
    "        x16 = self.conv16(x15)\n",
    "        x17 = self.conv17(x16)\n",
    "        x18 = self.conv18(x17)\n",
    "        \n",
    "        if self.inference:\n",
    "            y1 = self.yolo1(x2)\n",
    "            y2 = self.yolo2(x10)\n",
    "            y3 = self.yolo3(x18)\n",
    "\n",
    "            return get_region_boxes([y1, y2, y3])\n",
    "        \n",
    "        else:\n",
    "            return [x2, x10, x18]\n",
    "\n",
    "\n",
    "class Yolov4(nn.Module):\n",
    "    def __init__(self, yolov4conv137weight=None, n_classes=80, inference=False):\n",
    "        super().__init__()\n",
    "\n",
    "        output_ch = (4 + 1 + n_classes) * 3\n",
    "\n",
    "        # backbone\n",
    "        self.down1 = DownSample1()\n",
    "        self.down2 = DownSample2()\n",
    "        self.down3 = DownSample3()\n",
    "        self.down4 = DownSample4()\n",
    "        self.down5 = DownSample5()\n",
    "        # neck\n",
    "        self.neek = Neck(inference)\n",
    "        # yolov4conv137\n",
    "        if yolov4conv137weight:\n",
    "            _model = nn.Sequential(self.down1, self.down2, self.down3, self.down4, self.down5, self.neek)\n",
    "            pretrained_dict = torch.load(yolov4conv137weight)\n",
    "\n",
    "            model_dict = _model.state_dict()\n",
    "            # 1. filter out unnecessary keys\n",
    "            pretrained_dict = {k1: v for (k, v), k1 in zip(pretrained_dict.items(), model_dict)}\n",
    "            # 2. overwrite entries in the existing state dict\n",
    "            model_dict.update(pretrained_dict)\n",
    "            _model.load_state_dict(model_dict)\n",
    "        \n",
    "        # head\n",
    "        self.head = Yolov4Head(output_ch, n_classes, inference)\n",
    "\n",
    "\n",
    "    def forward(self, input):\n",
    "        d1 = self.down1(input)\n",
    "        d2 = self.down2(d1)\n",
    "        d3 = self.down3(d2)\n",
    "        d4 = self.down4(d3)\n",
    "        d5 = self.down5(d4)\n",
    "\n",
    "        x20, x13, x6 = self.neek(d5, d4, d3)\n",
    "\n",
    "        output = self.head(x20, x13, x6)\n",
    "        return output\n",
    "\n",
    "\n",
    "def yolo_forward_dynamic(output, conf_thresh, num_classes, anchors, num_anchors, scale_x_y, only_objectness=1,\n",
    "                              validation=False):\n",
    "    # Output would be invalid if it does not satisfy this assert\n",
    "    # assert (output.size(1) == (5 + num_classes) * num_anchors)\n",
    "\n",
    "    # print(output.size())\n",
    "\n",
    "    # Slice the second dimension (channel) of output into:\n",
    "    # [ 2, 2, 1, num_classes, 2, 2, 1, num_classes, 2, 2, 1, num_classes ]\n",
    "    # And then into\n",
    "    # bxy = [ 6 ] bwh = [ 6 ] det_conf = [ 3 ] cls_conf = [ num_classes * 3 ]\n",
    "    # batch = output.size(0)\n",
    "    # H = output.size(2)\n",
    "    # W = output.size(3)\n",
    "\n",
    "    bxy_list = []\n",
    "    bwh_list = []\n",
    "    det_confs_list = []\n",
    "    cls_confs_list = []\n",
    "\n",
    "    for i in range(num_anchors):\n",
    "        begin = i * (5 + num_classes)\n",
    "        end = (i + 1) * (5 + num_classes)\n",
    "        \n",
    "        bxy_list.append(output[:, begin : begin + 2])\n",
    "        bwh_list.append(output[:, begin + 2 : begin + 4])\n",
    "        det_confs_list.append(output[:, begin + 4 : begin + 5])\n",
    "        cls_confs_list.append(output[:, begin + 5 : end])\n",
    "\n",
    "    # Shape: [batch, num_anchors * 2, H, W]\n",
    "    bxy = torch.cat(bxy_list, dim=1)\n",
    "    # Shape: [batch, num_anchors * 2, H, W]\n",
    "    bwh = torch.cat(bwh_list, dim=1)\n",
    "\n",
    "    # Shape: [batch, num_anchors, H, W]\n",
    "    det_confs = torch.cat(det_confs_list, dim=1)\n",
    "    # Shape: [batch, num_anchors * H * W]\n",
    "    det_confs = det_confs.view(output.size(0), num_anchors * output.size(2) * output.size(3))\n",
    "\n",
    "    # Shape: [batch, num_anchors * num_classes, H, W]\n",
    "    cls_confs = torch.cat(cls_confs_list, dim=1)\n",
    "    # Shape: [batch, num_anchors, num_classes, H * W]\n",
    "    cls_confs = cls_confs.view(output.size(0), num_anchors, num_classes, output.size(2) * output.size(3))\n",
    "    # Shape: [batch, num_anchors, num_classes, H * W] --> [batch, num_anchors * H * W, num_classes] \n",
    "    cls_confs = cls_confs.permute(0, 1, 3, 2).reshape(output.size(0), num_anchors * output.size(2) * output.size(3), num_classes)\n",
    "\n",
    "    # Apply sigmoid(), exp() and softmax() to slices\n",
    "    #\n",
    "    bxy = torch.sigmoid(bxy) * scale_x_y - 0.5 * (scale_x_y - 1)\n",
    "    bwh = torch.exp(bwh)\n",
    "    det_confs = torch.sigmoid(det_confs)\n",
    "    cls_confs = torch.sigmoid(cls_confs)\n",
    "\n",
    "    # Prepare C-x, C-y, P-w, P-h (None of them are torch related)\n",
    "    grid_x = np.expand_dims(np.expand_dims(np.expand_dims(np.linspace(0, output.size(3) - 1, output.size(3)), axis=0).repeat(output.size(2), 0), axis=0), axis=0)\n",
    "    grid_y = np.expand_dims(np.expand_dims(np.expand_dims(np.linspace(0, output.size(2) - 1, output.size(2)), axis=1).repeat(output.size(3), 1), axis=0), axis=0)\n",
    "    # grid_x = torch.linspace(0, W - 1, W).reshape(1, 1, 1, W).repeat(1, 1, H, 1)\n",
    "    # grid_y = torch.linspace(0, H - 1, H).reshape(1, 1, H, 1).repeat(1, 1, 1, W)\n",
    "\n",
    "    anchor_w = []\n",
    "    anchor_h = []\n",
    "    for i in range(num_anchors):\n",
    "        anchor_w.append(anchors[i * 2])\n",
    "        anchor_h.append(anchors[i * 2 + 1])\n",
    "\n",
    "    device = None\n",
    "    cuda_check = output.is_cuda\n",
    "    if cuda_check:\n",
    "        device = output.get_device()\n",
    "\n",
    "    bx_list = []\n",
    "    by_list = []\n",
    "    bw_list = []\n",
    "    bh_list = []\n",
    "\n",
    "    # Apply C-x, C-y, P-w, P-h\n",
    "    for i in range(num_anchors):\n",
    "        ii = i * 2\n",
    "        # Shape: [batch, 1, H, W]\n",
    "        bx = bxy[:, ii : ii + 1] + torch.tensor(grid_x, device=device, dtype=torch.float32) # grid_x.to(device=device, dtype=torch.float32)\n",
    "        # Shape: [batch, 1, H, W]\n",
    "        by = bxy[:, ii + 1 : ii + 2] + torch.tensor(grid_y, device=device, dtype=torch.float32) # grid_y.to(device=device, dtype=torch.float32)\n",
    "        # Shape: [batch, 1, H, W]\n",
    "        bw = bwh[:, ii : ii + 1] * anchor_w[i]\n",
    "        # Shape: [batch, 1, H, W]\n",
    "        bh = bwh[:, ii + 1 : ii + 2] * anchor_h[i]\n",
    "\n",
    "        bx_list.append(bx)\n",
    "        by_list.append(by)\n",
    "        bw_list.append(bw)\n",
    "        bh_list.append(bh)\n",
    "\n",
    "\n",
    "    ########################################\n",
    "    #   Figure out bboxes from slices     #\n",
    "    ########################################\n",
    "    \n",
    "    # Shape: [batch, num_anchors, H, W]\n",
    "    bx = torch.cat(bx_list, dim=1)\n",
    "    # Shape: [batch, num_anchors, H, W]\n",
    "    by = torch.cat(by_list, dim=1)\n",
    "    # Shape: [batch, num_anchors, H, W]\n",
    "    bw = torch.cat(bw_list, dim=1)\n",
    "    # Shape: [batch, num_anchors, H, W]\n",
    "    bh = torch.cat(bh_list, dim=1)\n",
    "\n",
    "    # Shape: [batch, 2 * num_anchors, H, W]\n",
    "    bx_bw = torch.cat((bx, bw), dim=1)\n",
    "    # Shape: [batch, 2 * num_anchors, H, W]\n",
    "    by_bh = torch.cat((by, bh), dim=1)\n",
    "\n",
    "    # normalize coordinates to [0, 1]\n",
    "    bx_bw /= output.size(3)\n",
    "    by_bh /= output.size(2)\n",
    "\n",
    "    # Shape: [batch, num_anchors * H * W, 1]\n",
    "    bx = bx_bw[:, :num_anchors].view(output.size(0), num_anchors * output.size(2) * output.size(3), 1)\n",
    "    by = by_bh[:, :num_anchors].view(output.size(0), num_anchors * output.size(2) * output.size(3), 1)\n",
    "    bw = bx_bw[:, num_anchors:].view(output.size(0), num_anchors * output.size(2) * output.size(3), 1)\n",
    "    bh = by_bh[:, num_anchors:].view(output.size(0), num_anchors * output.size(2) * output.size(3), 1)\n",
    "\n",
    "    bx1 = bx - bw * 0.5\n",
    "    by1 = by - bh * 0.5\n",
    "    bx2 = bx1 + bw\n",
    "    by2 = by1 + bh\n",
    "\n",
    "    # Shape: [batch, num_anchors * h * w, 4] -> [batch, num_anchors * h * w, 1, 4]\n",
    "    boxes = torch.cat((bx1, by1, bx2, by2), dim=2).view(output.size(0), num_anchors * output.size(2) * output.size(3), 1, 4)\n",
    "    # boxes = boxes.repeat(1, 1, num_classes, 1)\n",
    "\n",
    "    # boxes:     [batch, num_anchors * H * W, 1, 4]\n",
    "    # cls_confs: [batch, num_anchors * H * W, num_classes]\n",
    "    # det_confs: [batch, num_anchors * H * W]\n",
    "\n",
    "    det_confs = det_confs.view(output.size(0), num_anchors * output.size(2) * output.size(3), 1)\n",
    "    confs = cls_confs * det_confs\n",
    "\n",
    "    # boxes: [batch, num_anchors * H * W, 1, 4]\n",
    "    # confs: [batch, num_anchors * H * W, num_classes]\n",
    "\n",
    "    return  boxes, confs\n",
    "\n",
    "class YoloLayer(nn.Module):\n",
    "    \"\"\"\n",
    "    Yolo layer\n",
    "    model_out: while inference,is post-processing inside or outside the model\n",
    "        true:outside\n",
    "    \"\"\"\n",
    "    def __init__(self, anchor_mask=[], num_classes=0, anchors=[], num_anchors=1, stride=32, model_out=False):\n",
    "        super(YoloLayer, self).__init__()\n",
    "        self.anchor_mask = anchor_mask\n",
    "        self.num_classes = num_classes\n",
    "        self.anchors = anchors\n",
    "        self.num_anchors = num_anchors\n",
    "        self.anchor_step = len(anchors) // num_anchors\n",
    "        self.coord_scale = 1\n",
    "        self.noobject_scale = 1\n",
    "        self.object_scale = 5\n",
    "        self.class_scale = 1\n",
    "        self.thresh = 0.6\n",
    "        self.stride = stride\n",
    "        self.seen = 0\n",
    "        self.scale_x_y = 1\n",
    "\n",
    "        self.model_out = model_out\n",
    "\n",
    "    def forward(self, output, target=None):\n",
    "        if self.training:\n",
    "            return output\n",
    "        masked_anchors = []\n",
    "        for m in self.anchor_mask:\n",
    "            masked_anchors += self.anchors[m * self.anchor_step:(m + 1) * self.anchor_step]\n",
    "        masked_anchors = [anchor / self.stride for anchor in masked_anchors]\n",
    "\n",
    "        return yolo_forward_dynamic(output, self.thresh, self.num_classes, masked_anchors, len(self.anchor_mask),scale_x_y=self.scale_x_y)\n",
    "\n",
    "\n",
    "def get_region_boxes(boxes_and_confs):\n",
    "\n",
    "    # print('Getting boxes from boxes and confs ...')\n",
    "\n",
    "    boxes_list = []\n",
    "    confs_list = []\n",
    "\n",
    "    for item in boxes_and_confs:\n",
    "        boxes_list.append(item[0])\n",
    "        confs_list.append(item[1])\n",
    "\n",
    "    # boxes: [batch, num1 + num2 + num3, 1, 4]\n",
    "    # confs: [batch, num1 + num2 + num3, num_classes]\n",
    "    boxes = torch.cat(boxes_list, dim=1)\n",
    "    confs = torch.cat(confs_list, dim=1)\n",
    "        \n",
    "    return boxes, confs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2: Download the COCO 2017 evaluation dataset and define the data loader function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!curl -LO http://images.cocodataset.org/zips/val2017.zip\n",
    "!curl -LO http://images.cocodataset.org/annotations/annotations_trainval2017.zip\n",
    "!unzip -q val2017.zip\n",
    "!unzip annotations_trainval2017.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define data loader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import json\n",
    "import time\n",
    "import torchvision\n",
    "import torchvision.transforms as transforms\n",
    "import torchvision.datasets as dset\n",
    "from pycocotools.coco import COCO\n",
    "\n",
    "\n",
    "def get_image_filenames(root=os.getcwd()):\n",
    "    \"\"\"\n",
    "    Generate paths to the coco dataset image files.\n",
    "    \n",
    "    Args:\n",
    "        root (str): The root folder contains.\n",
    "    \n",
    "    Yields:\n",
    "        filename (str): The path to an image file.\n",
    "    \"\"\"\n",
    "    image_path = os.path.join(root, 'val2017')\n",
    "    for root, dirs, files in os.walk(image_path):\n",
    "        for filename in files:\n",
    "            yield os.path.join(image_path, filename)\n",
    "\n",
    "            \n",
    "def get_coco_dataloader(coco2017_root, transform, subset_indices=None):\n",
    "    \"\"\"\n",
    "    Create the dataset loader and ground truth coco dataset.\n",
    "    \n",
    "    Arguments:\n",
    "        coco2017_root (str): The root directory to load the data/labels from.\n",
    "        transform (torchvision.Transform): A transform to apply to the images.\n",
    "        subset_indices (list): Indices used to create a subset of the dataset.\n",
    "\n",
    "    Returns: \n",
    "        loader (iterable): Produces transformed images and labels.\n",
    "        cocoGt (pycocotools.coco.COCO): Contains the ground truth in coco \n",
    "            format.\n",
    "        label_info (dict): A mapping from label id to the human-readable name.\n",
    "    \"\"\"\n",
    "\n",
    "    # Create the dataset\n",
    "    coco2017_img_path = os.path.join(coco2017_root, 'val2017')\n",
    "    coco2017_ann_path = os.path.join(\n",
    "        coco2017_root, 'annotations/instances_val2017.json')\n",
    "\n",
    "    # check the number of images in val2017 - Should be 5000\n",
    "    num_files = len(list(get_image_filenames(coco2017_root)))\n",
    "    print('\\nNumber of images in val2017 = {}\\n'.format(num_files))\n",
    "\n",
    "    # load annotations to decode classification results\n",
    "    with open(coco2017_ann_path) as f:\n",
    "        annotate_json = json.load(f)\n",
    "    label_info = {label[\"id\"]: label[\"name\"]\n",
    "                  for label in annotate_json['categories']}\n",
    "\n",
    "    # initialize COCO ground truth dataset\n",
    "    cocoGt = COCO(coco2017_ann_path)\n",
    "\n",
    "    # create the dataset using torchvision's coco detection dataset\n",
    "    coco_val_data = dset.CocoDetection(\n",
    "        root=coco2017_img_path, \n",
    "        annFile=coco2017_ann_path, \n",
    "        transform=transform\n",
    "    )\n",
    "\n",
    "    if subset_indices is not None:\n",
    "        # Create a smaller subset of the data for testing - e.g. to pinpoint error at image 516\n",
    "        coco_val_data = torch.utils.data.Subset(coco_val_data, subset_indices)\n",
    "\n",
    "    # create the dataloader using torch dataloader\n",
    "    loader = torch.utils.data.DataLoader(coco_val_data, batch_size=1, shuffle=False)\n",
    "\n",
    "    return loader, cocoGt, label_info\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load dataset\n",
    "Here 2 dataset loaders are created and the resulting data is displayed\n",
    "- `orig_coco_val_data_loader`: Contains the original unmodified image\n",
    "- `coco_val_data_loader`: Contains images of a standardized size of 608x608 pixels "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coco2017_root = './'\n",
    "orig_coco_val_data_loader, *_ = get_coco_dataloader(coco2017_root, transforms.ToTensor())\n",
    "transform = transforms.Compose([transforms.Resize([608, 608]), transforms.ToTensor()])\n",
    "coco_val_data_loader, cocoGt, label_info = get_coco_dataloader(coco2017_root, transform)\n",
    "image_orig, _ = next(iter(orig_coco_val_data_loader))\n",
    "print(image_orig.shape)\n",
    "image, image_info = next(iter(coco_val_data_loader))\n",
    "image_id = image_info[0][\"image_id\"].item()\n",
    "print(image.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define some helper functions for deployment (inference)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def postprocess(boxes, scores, score_threshold=0.05, iou_threshold=0.5):\n",
    "    \"\"\"\n",
    "    Classifies and filters bounding boxes from Yolo V4 output.\n",
    "    \n",
    "    Performs classification, filtering, and non-maximum suppression to remove\n",
    "    boxes that are irrelevant. The result is the filtered set of boxes, the \n",
    "    associated label confidence score, and the predicted label.\n",
    "    \n",
    "    See: https://pytorch.org/docs/stable/torchvision/ops.html#torchvision.ops.nms\n",
    "    \n",
    "    Args:\n",
    "        boxes (torch.Tensor): The Yolo V4 bounding boxes.\n",
    "        scores (torch.Tensor): The categories scores for each box.\n",
    "        score_threshold (float): Ignore boxes with scores below threshold.\n",
    "        iou_threshold (float): Discards boxes with intersection above threshold. \n",
    "            \n",
    "    Returns:\n",
    "        boxes (torch.Tensor): The filtered Yolo V4 bounding boxes.\n",
    "        scores (torch.Tensor): The label score for each box.\n",
    "        labels (torch.Tensor): The label for each box.\n",
    "    \"\"\"\n",
    "    \n",
    "    # shape: [n_batch, n_boxes, 1, 4] => [n_boxes, 4]  # Assumes n_batch size is 1\n",
    "    boxes = boxes.squeeze()\n",
    "\n",
    "    # shape: [n_batch, n_boxes, 80] => [n_boxes, 80]  # Assumes n_batch size is 1\n",
    "    scores = scores.squeeze()\n",
    "\n",
    "    # Classify each box according to the maximum category score\n",
    "    score, column = torch.max(scores, dim=1)\n",
    "\n",
    "    # Filter out rows for scores which are below threshold\n",
    "    mask = score > score_threshold\n",
    "\n",
    "    # Filter model output data\n",
    "    boxes = boxes[mask]\n",
    "    score = score[mask]\n",
    "    idxs = column[mask]\n",
    "\n",
    "    # Perform non-max suppression on all categories at once. shape: [n_keep,]\n",
    "    keep = torchvision.ops.batched_nms(\n",
    "        boxes=boxes, \n",
    "        scores=score, \n",
    "        idxs=idxs,\n",
    "        iou_threshold=iou_threshold,\n",
    "    )\n",
    "\n",
    "    # The image category id associated with each column\n",
    "    categories = torch.tensor([\n",
    "        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16,\n",
    "        17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28, 31,\n",
    "        32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,\n",
    "        44, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,\n",
    "        57, 58, 59, 60, 61, 62, 63, 64, 65, 67, 70, 72,\n",
    "        73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 84, 85,\n",
    "        86, 87, 88, 89, 90\n",
    "    ])\n",
    "    \n",
    "    boxes = boxes[keep]       # shape: [n_keep, 4]\n",
    "    score = score[keep]       # shape: [n_keep,]\n",
    "    idxs = idxs[keep]\n",
    "    label = categories[idxs]  # shape: [n_keep,]\n",
    "    \n",
    "    return boxes, score, label\n",
    "\n",
    "\n",
    "def get_results_as_dict(boxes, scores, labels, image_orig):\n",
    "    \"\"\"\n",
    "    Transforms post-processed output into dictionary output.\n",
    "    \n",
    "    This translates the model coordinate bounding boxes (x1, y1, x2, y2) \n",
    "    into a rectangular description (x, y, width, height) scaled to the \n",
    "    original image size.\n",
    "    \n",
    "    Args:\n",
    "        boxes (torch.Tensor): The Yolo V4 bounding boxes.\n",
    "        scores (torch.Tensor): The label score for each box.\n",
    "        labels (torch.Tensor): The label for each box.\n",
    "        image_orig (torch.Tensor): The image to scale the bounding boxes to.\n",
    "        \n",
    "    Returns:\n",
    "        output (dict): The dictionary of rectangle bounding boxes.\n",
    "    \"\"\"\n",
    "    h_size, w_size = image_orig.shape[-2:]\n",
    "\n",
    "    x1 = boxes[:, 0] * w_size\n",
    "    y1 = boxes[:, 1] * h_size\n",
    "    x2 = boxes[:, 2] * w_size\n",
    "    y2 = boxes[:, 3] * h_size\n",
    "\n",
    "    width = x2 - x1\n",
    "    height = y2 - y1\n",
    "\n",
    "    boxes = torch.stack([x1, y1, width, height]).T\n",
    "    return {\n",
    "        'boxes': boxes.detach().numpy(),\n",
    "        'labels': labels.detach().numpy(),\n",
    "        'scores': scores.detach().numpy(),\n",
    "    }\n",
    "\n",
    "\n",
    "def prepare_for_coco_detection(predictions):\n",
    "    \"\"\"\n",
    "    Convert dictionary model predictions into an expected COCO dataset format.\n",
    "    \n",
    "    Args:\n",
    "        predictions (dict): The list of box coordinates, scores, and labels.\n",
    "        \n",
    "    Returns:\n",
    "        output (list[dict]): The list of bounding boxes.\n",
    "    \"\"\"\n",
    "    coco_results = []\n",
    "    for original_id, prediction in predictions.items():\n",
    "        if len(prediction) == 0:\n",
    "            continue\n",
    "\n",
    "        boxes = prediction[\"boxes\"].tolist()\n",
    "        scores = prediction[\"scores\"].tolist()\n",
    "        labels = prediction[\"labels\"].tolist()\n",
    "\n",
    "        coco_results.extend(\n",
    "            [\n",
    "                {\n",
    "                    \"image_id\": original_id,\n",
    "                    \"category_id\": labels[k],\n",
    "                    \"bbox\": box,\n",
    "                    \"score\": scores[k],\n",
    "                }\n",
    "                for k, box in enumerate(boxes)\n",
    "            ]\n",
    "        )\n",
    "    return coco_results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download pretrained checkpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "def download_file_from_google_drive(id, destination):\n",
    "    response = requests.post('https://drive.google.com/uc?id='+id+'&confirm=t')\n",
    "    save_response_content(response, destination)\n",
    "\n",
    "def save_response_content(response, destination):\n",
    "    CHUNK_SIZE = 32768\n",
    "    with open(destination, \"wb\") as f:\n",
    "        for chunk in response.iter_content(CHUNK_SIZE):\n",
    "            if chunk:  # filter out keep-alive new chunks\n",
    "                f.write(chunk)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "download_file_from_google_drive('1wv_LiFeCRYwtpkqREPeI13-gPELBDwuJ', './yolo_v4.pth')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 3: Build, Compile, and Save Neuron-Optimized YOLO v4 TorchScript\n",
    "### Construct model and load pretrained checkpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "model = Yolov4(yolov4conv137weight=None, n_classes=80, inference=True)\n",
    "weightfile = \"./yolo_v4.pth\"\n",
    "pretrained_dict = torch.load(weightfile, map_location=torch.device('cpu'))\n",
    "model.load_state_dict(pretrained_dict)\n",
    "model.eval()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Execute inference for a single image and display output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.patches as patches\n",
    "\n",
    "image_orig, _ = next(iter(orig_coco_val_data_loader))\n",
    "image, _ = next(iter(coco_val_data_loader))\n",
    "boxes, scores = model(image)\n",
    "boxes, scores, labels = postprocess(boxes, scores)\n",
    "result_dict = get_results_as_dict(boxes, scores, labels, image_orig)\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(10, 10))\n",
    "ax.imshow(image_orig.numpy().squeeze(0).transpose(1, 2, 0))\n",
    "for xywh, _ in zip(result_dict['boxes'], result_dict['labels']):\n",
    "    x, y, w, h = xywh\n",
    "    rect = patches.Rectangle((x, y), w, h, linewidth=1, edgecolor='g', facecolor='none')\n",
    "    ax.add_patch(rect)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Run compilation with manually specified device placement\n",
    "\n",
    "First, inspect the model without running compilation by adding the `skip_compiler=True` argument to the `torch.neuron.trace` call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "model_neuron_for_inspection = torch.neuron.trace(model, image, skip_compiler=True)\n",
    "print(model_neuron_for_inspection)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Inspecting the model, we discover that there are many `aten::slice` operations in some submodules called `YoloLayer`. Although these operations are supported by the neuron-cc compiler, they are not going to run efficiently on the Inferentia hardware. To work it around, we recommend to manually place these operators on CPU."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To manually place `YoloLayer` on CPU, we may make use of the `subgraph_builder_function` argument in `torch.neuron.trace`. It is a callback function that returns `True` or `False` based on information available in `node`. The typical use is a condition based on either `node.name` or `node.type_string`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def subgraph_builder_function(node):\n",
    "    return 'YoloLayer' not in node.name\n",
    "\n",
    "model_neuron = torch.neuron.trace(model, image, subgraph_builder_function=subgraph_builder_function)\n",
    "model_neuron.save('yolo_v4_neuron.pt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compilation is now finished and the compiled model has been saved to a local file called 'yolo_v4_neuron.pt'. Saving is important due to the slow compilation process."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 4: Evaluate Accuracy on the COCO 2017 Dataset\n",
    "### Load compiled model and run inference\n",
    "To validate accuracy of the compiled model, lets run inference on the COCO 2017 validation dataset. We start by defining a helper function `run_inference`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_inference(dataloader, dataloader_orig, model, convert=True, modelName=''):\n",
    "    \"\"\"\n",
    "    Run Yolo V4 inference on the COCO dataset.\n",
    "    \n",
    "    Args:\n",
    "        dataloader (iterable): Data loader of input processed images and labels.\n",
    "        dataloader_orig (iterable): Data loader with original images.\n",
    "        model (torch.nn.Module): The torch model to run inference against.\n",
    "        convert (bool): Set to False when using a vanilla torchvision model that \n",
    "            does not need to be transformed into coco format.\n",
    "        \n",
    "    Returns: \n",
    "        imgIds (list): The list of images with predictions.\n",
    "        cocoDt (pycocotools.coco.COCO): Contains the predictions from the model \n",
    "            in coco format.\n",
    "    \"\"\"\n",
    "    print('\\n================ Starting Inference on {} Images using {} model ================\\n'.format(\n",
    "        len(dataloader), modelName))\n",
    "\n",
    "    modelName = str(modelName).replace(\" \", \"_\")\n",
    "\n",
    "    # convert predicition to cocoDt\n",
    "    # code from def evaluate in https://github.com/pytorch/vision/blob/master/references/detection/engine.py\n",
    "    imgIds = []\n",
    "    results = []\n",
    "    skippedImages = []\n",
    "\n",
    "    # time inference\n",
    "    inference_time = 0.0\n",
    "    for idx, ((image, targets), (image_orig, _)) in enumerate(zip(dataloader, dataloader_orig)):\n",
    "        # if target is empty, skip the image because it breaks the scripted model\n",
    "        if not targets:\n",
    "            skippedImages.append(idx)\n",
    "            continue\n",
    "\n",
    "        # get the predictions\n",
    "        start_time = time.time()\n",
    "        boxes, scores = model(image)\n",
    "        delta = time.time() - start_time\n",
    "        inference_time += delta\n",
    "        boxes, scores, labels = postprocess(boxes, scores)\n",
    "        outputs = get_results_as_dict(boxes, scores, labels, image_orig)\n",
    "\n",
    "        res = {target[\"image_id\"].item(): output for target,\n",
    "               output in zip(targets, [outputs])}\n",
    "\n",
    "        # add the image id to imgIds\n",
    "        image_id = targets[0][\"image_id\"].item()\n",
    "        imgIds.append(image_id)\n",
    "\n",
    "        # convert the predicition into cocoDt results\n",
    "        pred = prepare_for_coco_detection(res)\n",
    "        results.extend(pred)\n",
    "\n",
    "    print('\\n==================== Performance Measurement ====================')\n",
    "    print('Finished inference on {} images in {:.2f} seconds'.format(\n",
    "        len(dataloader), inference_time))\n",
    "    print('=================================================================\\n')\n",
    "\n",
    "    # create bbox detections file\n",
    "    # following code in https://github.com/aws/aws-neuron-sdk/blob/master/src/examples/tensorflow/yolo_v4_demo/evaluate.ipynb\n",
    "    resultsfile = modelName + '_bbox_detections.json'\n",
    "    print('Generating json file...')\n",
    "    with open(resultsfile, 'w') as f:\n",
    "        json.dump(results, f)\n",
    "\n",
    "    # return COCO api object with loadRes\n",
    "    cocoDt = cocoGt.loadRes(resultsfile)\n",
    "\n",
    "    return imgIds, cocoDt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step is to simply load the compiled model from disk and then run inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_neuron = torch.jit.load('yolo_v4_neuron.pt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "imgIds, cocoDt = run_inference(coco_val_data_loader, orig_coco_val_data_loader, model_neuron)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then use the standard `pycocotools` routines to generate a report of bounding box precision/recall."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pycocotools.cocoeval import COCOeval\n",
    "\n",
    "cocoEval = COCOeval(cocoGt, cocoDt, 'bbox')\n",
    "cocoEval.params.imgIds = imgIds\n",
    "cocoEval.evaluate()\n",
    "cocoEval.accumulate()\n",
    "cocoEval.summarize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For reference, we may perform the same evaluation on the CPU model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "imgIdsRef, cocoDtRef = run_inference(coco_val_data_loader, orig_coco_val_data_loader, model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cocoEval = COCOeval(cocoGt, cocoDtRef, 'bbox')\n",
    "cocoEval.params.imgIds = imgIdsRef\n",
    "cocoEval.evaluate()\n",
    "cocoEval.accumulate()\n",
    "cocoEval.summarize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 5: Benchmark COCO Dataset Performance of the Neuron-Optimized TorchScript\n",
    "The following code snippet sets up data parallel on 16 NeuronCores and runs saturated multi-threaded inference on the Inferentia accelerator. Note that the number of cores (`n_cores`) should be set to the number of available NeuronCores on the current instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.neuron\n",
    "import torchvision\n",
    "import torchvision.transforms as transforms\n",
    "import torchvision.datasets as dset\n",
    "import multiprocessing as mp\n",
    "from concurrent.futures import ThreadPoolExecutor\n",
    "import PIL\n",
    "import os\n",
    "import time\n",
    "\n",
    "n_threads = 16\n",
    "\n",
    "def get_image_filenames(root=os.getcwd()):\n",
    "    \"\"\"\n",
    "    Generate paths to the coco dataset image files.\n",
    "    \n",
    "    Args:\n",
    "        root (str): The root folder contains.\n",
    "    \n",
    "    Yields:\n",
    "        filename (str): The path to an image file.\n",
    "    \"\"\"\n",
    "    image_path = os.path.join(root, 'val2017')\n",
    "    for root, dirs, files in os.walk(image_path):\n",
    "        for filename in files:\n",
    "            yield os.path.join(image_path, filename)\n",
    "\n",
    "def preprocess(path):\n",
    "    \"\"\"\n",
    "    Load an image and convert to the expected Yolo V4 tensor format.\n",
    "    \n",
    "    Args:\n",
    "        path (str): The image file to load from disk.  \n",
    "        \n",
    "    Returns:\n",
    "        result (torch.Tensor): The image for prediction. Shape: [1, 3, 608, 608]\n",
    "    \"\"\"\n",
    "    image = PIL.Image.open(path).convert('RGB')\n",
    "    resized = torchvision.transforms.functional.resize(image, [608, 608])\n",
    "    tensor = torchvision.transforms.functional.to_tensor(resized)\n",
    "    return tensor.unsqueeze(0).to(torch.float32)\n",
    "\n",
    "\n",
    "def load_model(filename='yolo_v4_neuron.pt'):\n",
    "    \"\"\"\n",
    "    Load and pre-warm the Yolo V4 model.\n",
    "    \n",
    "    Args:\n",
    "        filename (str): The location to load the model from.\n",
    "        \n",
    "    Returns:\n",
    "        model (torch.nn.Module): The torch model.\n",
    "    \"\"\"\n",
    "    \n",
    "    # Load model from disk\n",
    "    model = torch.jit.load(filename)\n",
    "\n",
    "    # Warm up model on neuron by running a single example image\n",
    "    filename = next(iter(get_image_filenames()))\n",
    "    image = preprocess(filename)\n",
    "    model(image)\n",
    "\n",
    "    return model\n",
    "\n",
    "\n",
    "def task(model, filename):\n",
    "    \"\"\"\n",
    "    The thread task to perform prediction.\n",
    "    \n",
    "    This does the full end-to-end processing of an image from loading from disk\n",
    "    all the way to classifying and filtering bounding boxes.\n",
    "    \n",
    "    Args:\n",
    "        model (torch.nn.Module): The model to run processing with\n",
    "        filename (str): The image file to load from disk.  \n",
    "    \n",
    "    Returns:\n",
    "        boxes (torch.Tensor): The Yolo V4 bounding boxes.\n",
    "        scores (torch.Tensor): The label score for each box.\n",
    "        labels (torch.Tensor): The label for each box.        \n",
    "    \"\"\"\n",
    "    image = preprocess(filename)\n",
    "    begin = time.time()\n",
    "    boxes, scores = model(image)\n",
    "    delta = time.time() - begin\n",
    "    return postprocess(boxes, scores), delta\n",
    "\n",
    "\n",
    "def benchmark():\n",
    "    \"\"\"\n",
    "    Run a benchmark on the entire COCO dataset against the neuron model.\n",
    "    \"\"\"\n",
    "    \n",
    "    # Load a model into each NeuronCore\n",
    "    models = [load_model() for _ in range(n_cores)]\n",
    "    \n",
    "    # Create input/output lists\n",
    "    filenames = list(get_image_filenames())\n",
    "    results = list()\n",
    "    latency = list()\n",
    "    \n",
    "    # We want to keep track of average completion time per thread\n",
    "    sum_time = 0.0\n",
    "    \n",
    "    # Submit all tasks and wait for them to finish\n",
    "    with ThreadPoolExecutor(n_threads) as pool:\n",
    "        for i, filename in enumerate(filenames):\n",
    "            result = pool.submit(task, models[i % len(models)], filename)\n",
    "            results.append(result)\n",
    "        for result in results:\n",
    "            results, times = result.result() # Note: Outputs unused for benchmark\n",
    "            latency.append(times)\n",
    "            sum_time += times\n",
    "    \n",
    "    print('Duration: ', sum_time / n_threads)\n",
    "    print('Images Per Second:', len(filenames) / (sum_time / n_threads))\n",
    "    print(\"Latency P50: {:.1f}\".format(np.percentile(latency[1000:], 50)*1000.0))\n",
    "    print(\"Latency P90: {:.1f}\".format(np.percentile(latency[1000:], 90)*1000.0))\n",
    "    print(\"Latency P95: {:.1f}\".format(np.percentile(latency[1000:], 95)*1000.0))\n",
    "    print(\"Latency P99: {:.1f}\".format(np.percentile(latency[1000:], 99)*1000.0))\n",
    "\n",
    "benchmark()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.9 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: src/examples/tensorflow/bert_demo/LICENSE
================================================
Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  
Permission is hereby granted, free of charge, to any person obtaining a copy of this
software and associated documentation files (the "Software"), to deal in the Software
without restriction, including without limitation the rights to use, copy, modify,
merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


================================================
FILE: src/examples/tensorflow/bert_demo/README.md
================================================
</br>
</br>

Please view our documentation at **[https://awsdocs-neuron.readthedocs-hosted.com/](https://awsdocs-neuron.readthedocs-hosted.com/)** 


================================================
FILE: src/examples/tensorflow/bert_demo/bert_client.py
================================================
# coding=utf-8

""" Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0
    Program to gather information from a system
"""

import sys
import os
import argparse
import random
import time
import grpc
import mrpc_pb2
sys.path.append(os.path.dirname(__file__))
import mrpc_pb2_grpc
import mrpc_feature

latencies = []

def client():
    parser = argparse.ArgumentParser()
    parser.add_argument('--port', default=60061, help='gRPC port')
    parser.add_argument('--pair', default=None, help='Text pair')
    parser.add_argument('--cycle', type=int, default=1, help='Number of inference cycles')
    parser.add_argument('--save-accuracy', default=None, help='Save accuracy to file')
    args = parser.parse_args()
    text_pair = mrpc_pb2.TextPair()
    if args.pair is not None:
        text_a, text_b = args.pair
        text_pair.text_a = text_a.encode()
        text_pair.text_b = text_b.encode()
    else:
        eval_data_path = os.path.join(os.path.dirname(__file__), 'glue_mrpc_dev.tsv')
        tsv = mrpc_feature.read_tsv(eval_data_path)
    with grpc.insecure_channel('127.0.0.1:{}'.format(args.port)) as channel:
        stub = mrpc_pb2_grpc.mrpcStub(channel)
        num_correct = 0
        very_start = time.time()
        for _ in range(args.cycle):
            if args.pair is None:
                data = random.choice(tsv[1:])
                text_pair.text_a = data[3].encode()
                text_pair.text_b = data[4].encode()
            start = time.time()
            yes_no = stub.paraphrase(text_pair)
            elapsed = time.time() - start
            if data is None:
                evaluation = ''
            else:
                if yes_no.prediction.decode() == data[0]:
                    num_correct += 1
                evaluation = 'correct, ' if yes_no.prediction.decode() == data[0] else 'incorrect, '
            print('{} ({}latency {} s)'.format(yes_no.message.decode(), evaluation, elapsed))
            latencies.append(elapsed)
        if args.cycle > 1:
            accuracy = num_correct / args.cycle
            print('took {} s for {} cycles, accuracy {}'.format(time.time() - very_start, args.cycle, accuracy))
            if args.save_accuracy is not None:
                with open(args.save_accuracy, 'w') as f:
                    f.write(str(accuracy))

def write_latencies():
    with open('latencies.txt', 'a') as f:
        for l in latencies:
            f.write(str(l) + '\n')

if __name__ == '__main__':
    client()
    write_latencies()


================================================
FILE: src/examples/tensorflow/bert_demo/bert_model.py
================================================
# coding=utf-8

""" Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0
    Program to gather information from a system
"""

import os
import argparse
import shlex
import numpy as np
import tensorflow as tf
from tensorflow.neuron import fuse
from tensorflow.core.framework import attr_value_pb2


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_saved_model', required=True, help='Original SaveModel')
    parser.add_argument('--output_saved_model', required=True, help='Output SavedModel that runs on Inferentia')
    parser.add_argument('--dtype', default='float16', help='Data type for weights')
    parser.add_argument('--batch_size', type=int, default=4)
    parser.add_argument('--sequence_length', type=int, default=128)
    parser.add_argument('--crude_gelu', action='store_true')
    parser.add_argument('--aggressive_optimizations', action='store_true')
    args = parser.parse_args()
    if os.path.exists(args.output_saved_model):
        raise OSError('output_saved_model {} already exists'.format(args.output_saved_model))
    dtype = tf.float16 if args.dtype == 'float16' else tf.float32
    if args.aggressive_optimizations:
        args.crude_gelu = True
    bert = NeuronBERTMRPC(
        args.input_saved_model,
        dtype=dtype,
        batch_size=args.batch_size,
        seq_len=args.sequence_length,
        crude_gelu=args.crude_gelu,
        aggressive_fp16_cast=args.aggressive_optimizations,
    )

    fuser = fuse(compiler_args=['--fp32-cast', 'matmult'], timeout=360000)
    bert.encoder = fuser(bert.encoder)

    input_ids = bert.input_ids
    input_mask = bert.input_mask
    segment_ids = bert.segment_ids
    with tf.Session(graph=tf.Graph()) as sess:
        input_ids_ph_shape = input_ids.shape.as_list()
        input_ids_ph_shape[0] = None
        input_ids_ph = tf.placeholder(input_ids.dtype, input_ids_ph_shape, name='input_ids')

        input_mask_ph_shape = input_mask.shape.as_list()
        input_mask_ph_shape[0] = None
        input_mask_ph = tf.placeholder(input_mask.dtype, input_mask_ph_shape, name='input_mask')

        segment_ids_ph_shape = segment_ids.shape.as_list()
        segment_ids_ph_shape[0] = None
        segment_ids_ph = tf.placeholder(segment_ids.dtype, segment_ids_ph_shape, name='segment_ids')

        dummy_reshapes = []
        discard_op_names = set()

        with tf.name_scope('bert/embeddings'):
            expand_dims = tf.expand_dims(input_ids_ph, axis=-1)
            batch_size = tf.shape(input_ids_ph)[0]
            reshape = tf.reshape(expand_dims, [batch_size * bert.seq_len])
            gatherv2 = tf.gather(bert.weights_dict['bert/embeddings/word_embeddings:0'], reshape, axis=0)
            reshape_1 = tf.reshape(gatherv2, [batch_size, bert.seq_len, bert.hid_size])
            reshape_2 = tf.reshape(segment_ids_ph, [batch_size * bert.seq_len])
            one_hot = tf.one_hot(reshape_2, depth=2)
            matmul = tf.matmul(one_hot, bert.weights_dict['bert/embeddings/token_type_embeddings:0'])
            reshape_3 = tf.reshape(matmul, [batch_size, bert.seq_len, bert.hid_size])
            slice0 = tf.slice(bert.weights_dict['bert/embeddings/position_embeddings:0'], begin=[0, 0], size=[bert.seq_len, -1])
            add_1 = reshape_1 + reshape_3 + slice0
            input_tensor = tf.reshape(add_1, [batch_size, bert.seq_len, bert.hid_size])
        with tf.name_scope('bert/encoder'):
            reshape = tf.reshape(input_mask_ph, [batch_size, 1, 1, bert.seq_len])
            bias_tensor = tf.cast(reshape, tf.float32)
            bias_tensor = 1.0 - bias_tensor
            bias_tensor = bias_tensor * -10000.0
            bias_tensor = tf.cast(bias_tensor, bert.dtype)
        tensor = bert.layer_norm(input_tensor, 'embeddings', force_float32=True)

        tensor = tf.reshape(tensor, [bert.batch_size, bert.seq_len, bert.hid_size])
        dummy_reshapes.append(tensor)
        discard_op_names.add(tensor.op.name)
        bias_tensor = tf.reshape(bias_tensor, [bert.batch_size, 1, 1, bert.seq_len])
        dummy_reshapes.append(bias_tensor)
        discard_op_names.add(bias_tensor.op.name)

        logits = bert.encoder(tensor, bias_tensor)
        with tf.name_scope('loss'):
            if bert.dtype is not tf.float32:
                logits = tf.cast(logits, tf.float32)
            probabilities = tf.nn.softmax(logits)
        for rts in dummy_reshapes:
            neuron_op = rts.consumers()[0]
            neuron_op._update_input(list(neuron_op.inputs).index(rts), rts.op.inputs[0])
        try:
            sess.run(probabilities)
        except:
            pass
        graph_def = sess.graph.as_graph_def()
    new_graph_def = tf.GraphDef()
    new_graph_def.node.MergeFrom(node for node in graph_def.node if node.name not in discard_op_names)
    neuron_op_node = [node for node in new_graph_def.node if node.op == 'NeuronOp'][0]
    neuron_op_node.attr['input_batch_axis'].list.i[:] = [0, 0]
    neuron_op_node.attr['output_batch_axis'].list.i[:] = [0]

    with tf.Session(graph=tf.Graph()) as sess:
        tf.import_graph_def(new_graph_def, name='')
        inputs = {
            'input_ids': sess.graph.get_tensor_by_name(input_ids_ph.name),
            'input_mask': sess.graph.get_tensor_by_name(input_mask_ph.name),
            'segment_ids': sess.graph.get_tensor_by_name(segment_ids_ph.name),
        }
        outputs = {
            'probabilities': sess.graph.get_tensor_by_name(probabilities.name)
        }
        try:
            sess.run(probabilities)
        except:
            pass
        neuron_op = [op for op in sess.graph.get_operations() if op.type == 'NeuronOp'][0]
        if not neuron_op.get_attr('executable'):
            raise AttributeError('Neuron executable (neff) is empty. Please check neuron-cc is installed and working properly (`pip install neuron-cc` to install neuron-cc).')
        tf.saved_model.simple_save(sess, args.output_saved_model, inputs, outputs)


class NeuronBERTMRPC:

    def __init__(self, bert_saved_model, dtype=tf.float16, batch_size=4, seq_len=128, crude_gelu=False, aggressive_fp16_cast=False):
        predictor = tf.contrib.predictor.from_saved_model(bert_saved_model)
        sess = predictor.session
        self.input_ids = predictor.feed_tensors['input_ids']
        self.input_mask = predictor.feed_tensors['input_mask']
        self.segment_ids = predictor.feed_tensors['segment_ids']
        weights_dict = {}
        for op in sess.graph.get_operations():
            if op.type == 'Const':
                tensor = op.outputs[0]
                weights_dict[tensor.name] = tensor
            if op.type == 'Identity' and op.name.endswith('read'):
                tensor = op.outputs[0]
                weights_dict[tensor.op.inputs[0].name] = tensor
        self.weights_dict = sess.run(weights_dict)
        self.dtype = dtype
        self.batch_size = batch_size
        self.seq_len = seq_len
        self.hid_size, self.inter_size = self.weights_dict['bert/encoder/layer_0/intermediate/dense/kernel:0'].shape
        self.num_heads = sess.graph.get_tensor_by_name('bert/encoder/layer_0/attention/self/Reshape:0').shape.as_list()[2]
        self.head_size = self.hid_size // self.num_heads
        self.eps = self.weights_dict['bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/add/y:0']
        self.crude_gelu = crude_gelu
        self.layer_norm_dtype = tf.float16 if aggressive_fp16_cast else tf.float32
        sess.close()

    def encoder(self, tensor, bias_tensor):
        tensor = tf.reshape(tensor, [self.batch_size * self.seq_len, self.hid_size])
        for layer_id in range(24):
            mid_layer_name = 'layer_{}'.format(layer_id)
            tensor = self.self_attention(tensor, bias_tensor, mid_layer_name)
            tensor = self.layer_norm(tensor, 'encoder/' + mid_layer_name + '/attention/output')
            tensor = self.fully_connected(tensor, mid_layer_name)
            tensor = self.layer_norm(tensor, 'encoder/' + mid_layer_name + '/output')
        logits = self.pooler_loss(tensor)
        return logits

    def fully_connected(self, input_tensor, layer_name):
        inter_kernel = self.weights_dict['bert/encoder/{}/intermediate/dense/kernel:0'.format(layer_name)]
        inter_bias = self.weights_dict['bert/encoder/{}/intermediate/dense/bias:0'.format(layer_name)]
        out_kernel = self.weights_dict['bert/encoder/{}/output/dense/kernel:0'.format(layer_name)]
        out_bias = self.weights_dict['bert/encoder/{}/output/dense/bias:0'.format(layer_name)]
        with tf.name_scope('bert/encoder/{}/fully_connected/intermediate/dense'.format(layer_name)):
            matmul = tf.matmul(input_tensor, inter_kernel.astype(self.dtype.as_numpy_dtype))
            bias_add = tf.nn.bias_add(matmul, inter_bias.astype(self.dtype.as_numpy_dtype))
            gelu = self.gelu_sigmoid(bias_add) if self.crude_gelu else self.gelu_tanh(bias_add)
        with tf.name_scope('bert/encoder/{}/fully_connected/output/dense'.format(layer_name)):
            matmul = tf.matmul(gelu, out_kernel.astype(self.dtype.as_numpy_dtype))
            bias_add = tf.nn.bias_add(matmul, out_bias.astype(self.dtype.as_numpy_dtype))
            output_tensor = bias_add + input_tensor
        return output_tensor

    def self_attention(self, input_tensor, bias_tensor, layer_name):
        query_kernel = self.weights_dict['bert/encoder/{}/attention/self/query/kernel:0'.format(layer_name)] * 0.125
        query_bias = self.weights_dict['bert/encoder/{}/attention/self/query/bias:0'.format(layer_name)] * 0.125
        key_kernel = self.weights_dict['bert/encoder/{}/attention/self/key/kernel:0'.format(layer_name)]
        key_bias = self.weights_dict['bert/encoder/{}/attention/self/key/bias:0'.format(layer_name)]
        value_kernel = self.weights_dict['bert/encoder/{}/attention/self/value/kernel:0'.format(layer_name)]
        value_bias = self.weights_dict['bert/encoder/{}/attention/self/value/bias:0'.format(layer_name)]
        output_kernel = self.weights_dict['bert/encoder/{}/attention/output/dense/kernel:0'.format(layer_name)]
        output_bias = self.weights_dict['bert/encoder/{}/attention/output/dense/bias:0'.format(layer_name)]
        with tf.name_scope('bert/encoder/{}/attention/self'.format(layer_name)):
            matmul = tf.matmul(input_tensor, query_kernel.astype(self.dtype.as_numpy_dtype))
            query = tf.nn.bias_add(matmul, query_bias.astype(self.dtype.as_numpy_dtype))
            query_r = tf.reshape(query, [self.batch_size, self.seq_len, self.num_heads, self.head_size])
            query_rt = tf.transpose(query_r, [0, 2, 1, 3])
            matmul = tf.matmul(input_tensor, key_kernel.astype(self.dtype.as_numpy_dtype))
            key = tf.nn.bias_add(matmul, key_bias.astype(self.dtype.as_numpy_dtype))
            key_r = tf.reshape(key, [self.batch_size, self.seq_len, self.num_heads, self.head_size])
            key_rt = tf.transpose(key_r, [0, 2, 1, 3])  # [b, n, l, h]
            query_key = tf.matmul(query_rt, key_rt, transpose_b=True)  # [b, n, lq, h] @ [b, n, lk, h] -> [b, n, lq, lk]
            bias_query_key = tf.add(query_key, bias_tensor)
            softmax_weights = tf.nn.softmax(bias_query_key)
            matmul = tf.matmul(input_tensor, value_kernel.astype(self.dtype.as_numpy_dtype))
            value = tf.nn.bias_add(matmul, value_bias.astype(self.dtype.as_numpy_dtype))
            value_r = tf.reshape(value, [self.batch_size, self.seq_len, self.num_heads, self.head_size])
            value_rt = tf.transpose(value_r, [0, 2, 3, 1])
            weighted_value_rt = tf.matmul(softmax_weights, value_rt, transpose_b=True)  # [b, n, lq, lk] @ [b, n, h, lv] -> [b, n, lq, h]
            weighted_value_r = tf.transpose(weighted_value_rt, [0, 2, 1, 3])  # [b, lq, n, h]
            weighted_value = tf.reshape(weighted_value_r, [self.batch_size * self.seq_len, self.hid_size])
        with tf.name_scope('bert/encoder/{}/attention/output'.format(layer_name)):
            matmul = tf.matmul(weighted_value, output_kernel.astype(self.dtype.as_numpy_dtype))
            unnorm_output = tf.nn.bias_add(matmul, output_bias.astype(self.dtype.as_numpy_dtype))
            output_tensor = tf.add(input_tensor, unnorm_output)
        return output_tensor

    def layer_norm(self, input_tensor, layer_name, force_float32=False):
        dtype = tf.float32 if force_float32 else self.layer_norm_dtype
        gamma = dtype.as_numpy_dtype(self.weights_dict['bert/{}/LayerNorm/gamma:0'.format(layer_name)])
        beta = dtype.as_numpy_dtype(self.weights_dict['bert/{}/LayerNorm/beta:0'.format(layer_name)])
        with tf.name_scope('bert/{}/LayerNorm'.format(layer_name)):
            input_tensor = tf.cast(input_tensor, dtype)
            mean = tf.reduce_mean(input_tensor, axis=[-1], keepdims=True, name='mean')
            residuals = tf.subtract(input_tensor, mean, name='residuals')
            var = tf.reduce_mean(residuals * residuals, axis=[-1], keepdims=True, name='var')
            rsqrt = tf.rsqrt(var + dtype.as_numpy_dtype(self.eps))
            norm_output = tf.multiply(residuals, rsqrt, name='normalized')
            output_tensor = norm_output * gamma + beta
            output_tensor = tf.cast(output_tensor, self.dtype)
        return output_tensor

    def pooler_loss(self, input_tensor):
        pooler_kernel = self.weights_dict['bert/pooler/dense/kernel:0']
        pooler_bias = self.weights_dict['bert/pooler/dense/bias:0']
        loss_kernel = self.weights_dict['output_weights:0'].T
        loss_bias = self.weights_dict['output_bias:0']
        with tf.name_scope('bert/pooler_loss'):
            reshape = tf.reshape(input_tensor, [self.batch_size, self.seq_len, self.hid_size])
            reshape_1 = tf.reshape(reshape[:, 0:1, :], [self.batch_size, self.hid_size])
            matmul = tf.matmul(reshape_1, pooler_kernel.astype(self.dtype.as_numpy_dtype))
            bias_add = tf.nn.bias_add(matmul, pooler_bias.astype(self.dtype.as_numpy_dtype))
            tanh = tf.tanh(bias_add)
            matmul = tf.matmul(tanh, loss_kernel.astype(self.dtype.as_numpy_dtype))
            output_tensor = tf.nn.bias_add(matmul, loss_bias.astype(self.dtype.as_numpy_dtype))
        return output_tensor

    def gelu_tanh(self, tensor):
        pow3 = 0.044714998453855515 * tensor * tensor * tensor + tensor
        shifted = (tf.tanh(0.7978845834732056 * pow3) + 1.0) * tensor
        return tf.multiply(shifted, 0.5)

    def gelu_sigmoid(self, tensor):
        return tf.sigmoid(1.702 * tensor) * tensor


if __name__ == '__main__':
    main()


================================================
FILE: src/examples/tensorflow/bert_demo/bert_model_server.py
================================================
# coding=utf-8

""" Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0
    Program to gather information from a system
"""

import os
import argparse
import subprocess
import time


_ONE_DAY_IN_SECONDS = 60 * 60 * 24


def serve():
    parser = argparse.ArgumentParser()
    parser.add_argument('--serving', required=True, help='Path to tf-serving binary')
    parser.add_argument('--dir', required=True, help='TensorFlow SavedModel dir')
    parser.add_argument('--port', default=8500, help='gRPC port')
    parser.add_argument('--parallel', type=int, default=8, help='Number of predictors')
    args = parser.parse_args()
    model = os.path.abspath(args.dir)
    model_with_version = os.path.join(model, '1')
    if not os.path.exists(model_with_version):
        os.makedirs(model_with_version)
        os.symlink(os.path.join(model, 'variables'), os.path.join(model_with_version, 'variables'))
        os.symlink(os.path.join(model, 'saved_model.pb'), os.path.join(model_with_version, 'saved_model.pb'))
    process_list = []
    for _ in range(args.parallel):
        proc = subprocess.Popen([
            args.serving, '--model_base_path={}'.format(model), '--port={}'.format(args.port),
            '--tensorflow_intra_op_parallelism=1', '--tensorflow_inter_op_parallelism=1'
        ])
        process_list.append(proc)
    try:
        time.sleep(_ONE_DAY_IN_SECONDS)
    except KeyboardInterrupt:
        for proc in process_list:
            proc.terminate()
            proc.wait()


if __name__ == '__main__':
    serve()


================================================
FILE: src/examples/tensorflow/bert_demo/bert_no_model.py
================================================
# bert_no_model.py
import argparse
import tensorflow as tf
import tensorflow.neuron as tfn


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_saved_model', required=True, help='Original SaveModel')
    parser.add_argument('--output_saved_model', required=True, help='Output SavedModel that runs on Inferentia')
    parser.add_argument('--batch_size', type=int, default=1)
    args = parser.parse_args()
    pred = tf.contrib.predictor.from_saved_model(args.input_saved_model)
    no_fuse_ops = [op.name for op in pred.graph.get_operations()]

    def force_fuse_condition(op_name):
        exclude_scopes = [
            'bert/encoder/strided_slice',
            'bert/encoder/ones',
            'bert/encoder/Reshape',
            'bert/encoder/Shape',
            'bert/encoder/Cast',
        ]
        for scope in exclude_scopes:
            if op_name == scope or op_name.startswith('{}/'.format(scope)):
                return False
        return op_name.startswith('bert/encoder') or op_name.startswith('bert/pooler')

    force_fuse_ops = [op.name for op in pred.graph.get_operations() if force_fuse_condition(op.name)]
    compilation_result = tfn.saved_model.compile(
        args.input_saved_model, args.output_saved_model,
        batch_size=args.batch_size,
        no_fuse_ops=no_fuse_ops,
        force_fuse_ops=force_fuse_ops,
    )
    print(compilation_result)


if __name__ == '__main__':
    main()


================================================
FILE: src/examples/tensorflow/bert_demo/bert_server.py
================================================
# coding=utf-8

""" Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0
    Program to gather information from a system
"""

import sys
import os
import collections
import argparse
import time
import csv
import random
from concurrent import futures
import multiprocessing
from multiprocessing.dummy import Pool
from threading import Lock
import pkg_resources
from distutils.version import LooseVersion
import grpc
import numpy as np
import tensorflow as tf
import mrpc_feature
import tokenization
import mrpc_pb2
sys.path.append(os.path.dirname(__file__))
import mrpc_pb2_grpc


_ONE_DAY_IN_SECONDS = 60 * 60 * 24

total_tpt = 0
num_tpt = 0

class BERTService(mrpc_pb2_grpc.mrpcServicer):

    def __init__(self, model_path, parallel, batch_size, bootstrap, vocab_txt, num_thread_per_predictor=2):
        num_queues = parallel * num_thread_per_predictor
        config = tf.ConfigProto(inter_op_parallelism_threads=num_queues, intra_op_parallelism_threads=1)
        tfn_version = LooseVersion(pkg_resources.get_distribution('tensorflow-neuron').version)
        if tfn_version >= LooseVersion('1.15.0.1.0.1333.0'):
            neuroncore_group_sizes = '{}x1'.format(parallel)
            predictor = tf.contrib.predictor.from_saved_model(model_path, config=config)
            self.predictor_list = [predictor for _ in range(num_queues)]
        else:
            neuroncore_group_sizes = ','.join('1' for _ in range(parallel))
            predictor_list = [tf.contrib.predictor.from_saved_model(model_path, config=config) for _ in range(parallel)]
            self.predictor_list = []
            for pred in predictor_list:
                self.predictor_list.extend(pred for _ in range(num_thread_per_predictor))
        os.environ['NEURONCORE_GROUP_SIZES'] = neuroncore_group_sizes
        if self.predictor_list[0].feed_tensors['input_ids'].shape.is_fully_defined():
            self.batch_size = self.predictor_list[0].feed_tensors['input_ids'].shape.as_list()[0]
        else:
            self.batch_size = batch_size
        self.bootstrap = bootstrap
        self.tokenizer = tokenization.FullTokenizer(vocab_file=vocab_txt, do_lower_case=True)
        self.num_infer = 0
        self.num_correct = 0
        self.output_name = list(self.predictor_list[0].fetch_tensors.keys())[0]
        self.iid = 0
        self.throughput_list = []
        self.latency_list = []
        self.max_len_latency_list = 1000
        self.iid_lock = Lock()
        if bootstrap:
            self.request_queue_list = [collections.deque() for _ in self.predictor_list]
            eval_data_path = os.path.join(os.path.dirname(__file__), 'glue_mrpc_dev.tsv')
            tsv = mrpc_feature.read_tsv(eval_data_path)
            for request_queue in self.request_queue_list:
                for _ in range(1024):
                    data_list = random.choices(tsv[1:], k=self.batch_size)
                    model_feed_dict_list = [mrpc_feature.text_pair_to_model_feed_dict(data[3], data[4], self.tokenizer) for data in data_list]
                    label_list = [int(data[0]) for data in data_list]
                    batch_labels = np.array(label_list)
                    batch_feeds = {
                        key: np.concatenate([feed[key] for feed in model_feed_dict_list], axis=0)
                        for key in model_feed_dict_list[0].keys()
                    }
                    request_queue.append((batch_feeds, batch_labels))
        else:
            self.request_queue_list = [[] for _ in self.predictor_list]
        self.result_map = {}
        self.alive = True
        dummy_feed = {
            'input_ids': np.zeros([1, 128], dtype=np.int32),
            'input_mask': np.zeros([1, 128], dtype=np.int32),
            'segment_ids': np.zeros([1, 128], dtype=np.int32),
        }
        self.dummy_feeds = [(None, dummy_feed) for _ in range(self.batch_size)]
        model_feed_dict_list = [dummy_feed for _ in range(self.batch_size)]
        batch_feeds = {
            key: np.concatenate([feed[key] for feed in model_feed_dict_list], axis=0)
            for key in model_feed_dict_list[0].keys()
        }
        pool = Pool(len(self.predictor_list))
        for pred in self.predictor_list:
            pool.apply_async(pred, (batch_feeds,))
            time.sleep(1)
        pool.close()
        pool.join()

    def cleanup(self):
        for pred in self.predictor_list:
            print(pred)
            pred.session.close()

    def current_throughput(self):
        last_num_infer = self.num_infer
        global total_tpt
        global num_tpt
        while self.alive:
            current_num_infer = self.num_infer
            throughput = current_num_infer - last_num_infer
            self.throughput_list.append(throughput)
            print('current throughput {}'.format(throughput))
            last_num_infer = current_num_infer
            if throughput != 0:
                total_tpt += throughput
                num_tpt += 1
            time.sleep(1)

    def current_throughput_accuracy(self):
        last_num_infer = self.num_infer
        while self.alive:
            current_num_infer = self.num_infer
            accuracy = 0.0 if self.num_infer == 0 else self.num_correct / self.num_infer
            print('current throughput {}, accuracy {}'.format(current_num_infer - last_num_infer, accuracy))
            last_num_infer = current_num_infer
            if throughput != 0:
                total_tpt += throughput
                num_tpt += 1
            time.sleep(1)

    def paraphrase(self, text_pair, context):
        iid = self.put_input(text_pair.text_a, text_pair.text_b)
        yes_no = mrpc_pb2.YesNo()
        if self.get_output(iid) == 1:
            yes_no.message = b'paraphrase!'
            yes_no.prediction = b'1'
        else:
            yes_no.message = b'not paraphrase!'
            yes_no.prediction = b'0'
        return yes_no

    def put_input(self, text_a, text_b):
        model_feed_dict = mrpc_feature.text_pair_to_model_feed_dict(text_a, text_b, self.tokenizer)
        with self.iid_lock:
            self.iid += 1
            iid = self.iid
        self.request_queue_list[iid % len(self.request_queue_list)].append((iid, model_feed_dict))
        return iid

    def process_input(self, idx):
        print('input processor is waiting')
        request_queue = self.request_queue_list[idx]
        predictor = self.predictor_list[idx]
        while self.alive:
            if len(request_queue) > 0:
                sublist = request_queue[:self.batch_size]
                request_queue[:self.batch_size] = []
                if len(sublist) != self.batch_size:
                    print('batch with {} garbage entries!'.format(self.batch_size - len(sublist)))
                if len(sublist) < self.batch_size:
                    pad_batch_size = self.batch_size - len(sublist)
                    sublist.extend(self.dummy_feeds[:pad_batch_size])
                iid_list = [iid for iid, _ in sublist]
                model_feed_dict_list = [feed for _, feed in sublist]
                batch_feeds = {
                    key: np.concatenate([feed[key] for feed in model_feed_dict_list], axis=0)
                    for key in model_feed_dict_list[0].keys()
                }
                start = time.time()
                batch_predictions = predictor(batch_feeds)[self.output_name].argmax(-1)
                latency = time.time() - start
                if len(self.latency_list) < self.max_len_latency_list:
                    self.latency_list.append(latency)
                self.result_map.update({iid: pred for iid, pred in zip(iid_list, batch_predictions)})
            time.sleep(0.001)

    def process_input_bootstrap(self, idx):
        print('input processor is waiting')
        request_queue = self.request_queue_list[idx]
        predictor = self.predictor_list[idx]
        while self.alive:
            if len(request_queue) > 0:
                batch_feeds, batch_labels = request_queue.popleft()
                batch_predictions = predictor(batch_feeds)[self.output_name].argmax(-1)
                self.num_infer += self.batch_size
                self.num_correct += (batch_predictions == batch_labels).sum()
                continue
            time.sleep(0.0001)

    def get_output(self, iid):
        while iid not in self.result_map:
            time.sleep(0.001)
        self.num_infer += 1
        return self.result_map.pop(iid)


def serve():
    parser = argparse.ArgumentParser()
    parser.add_argument('--port', default=60061, help='gRPC port')
    parser.add_argument('--dir', required=True, help='TensorFlow SavedModel dir')
    parser.add_argument('--parallel', type=int, default=4, help='Number of predictors')
    parser.add_argument('--thread', type=int, default=2, help='Number of threads used by each predictor')
    parser.add_argument('--batch', type=int, default=4, help='Batch size')
    parser.add_argument('--bootstrap', action='store_true',
                        help='Server loads a dataset and run inference itself')
    args = parser.parse_args()
    vocab_txt = os.path.join(os.path.dirname(__file__), 'uncased_L-24_H-1024_A-16.vocab.txt')
    bert_service = BERTService(args.dir, args.parallel, args.batch, args.bootstrap, vocab_txt, args.thread)
    server = grpc.server(
        futures.ThreadPoolExecutor(max_workers=128),
        options=[('grpc.max_send_message_length', -1),
                 ('grpc.max_receive_message_length', -1)])
    mrpc_pb2_grpc.add_mrpcServicer_to_server(bert_service, server)
    server.add_insecure_port('[::]:{}'.format(args.port))
    server.start()
    try:
        pool = Pool(len(bert_service.predictor_list) + 1)  # +1 for bert_service.current_throughput
        if args.bootstrap:
            monitor_func = bert_service.current_throughput_accuracy
            process_func = bert_service.process_input_bootstrap
        else:
            monitor_func = bert_service.current_throughput
            process_func = bert_service.process_input
        pool.apply_async(monitor_func)
        if args.parallel == 1:
            process_func(0)
        else:
            for idx in range(len(bert_service.predictor_list)):
                pool.apply_async(process_func, (idx,))
        pool.close()
        time.sleep(_ONE_DAY_IN_SECONDS)
    except KeyboardInterrupt:
        pass
    bert_service.cleanup()
    bert_service.alive = False
    server.stop(0)


if __name__ == '__main__':
    serve()
    print(f'Average Throughput: {total_tpt/num_tpt}')


================================================
FILE: src/examples/tensorflow/bert_demo/download_mrpc_data.py
================================================
import os
import sys
import argparse
import urllib.request

MRPC_TRAIN = 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt'
MRPC_TEST = 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt'

# This function is taken from https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e.
def format_mrpc(data_dir, path_to_data, path_to_dev_tsv):
    print("Processing MRPC...")
    mrpc_dir = os.path.join(data_dir, "MRPC")
    if not os.path.isdir(mrpc_dir):
        os.mkdir(mrpc_dir)
    if path_to_data:
        mrpc_train_file = os.path.join(path_to_data, "msr_paraphrase_train.txt")
        mrpc_test_file = os.path.join(path_to_data, "msr_paraphrase_test.txt")
    else:
        try:
            mrpc_train_file = os.path.join(mrpc_dir, "msr_paraphrase_train.txt")
            mrpc_test_file = os.path.join(mrpc_dir, "msr_paraphrase_test.txt")
            urllib.request.urlretrieve(MRPC_TRAIN, mrpc_train_file)
            urllib.request.urlretrieve(MRPC_TEST, mrpc_test_file)
        except urllib.error.HTTPError:
            print("Error downloading MRPC")
            return
    assert os.path.isfile(mrpc_train_file), "Train data not found at %s" % mrpc_train_file
    assert os.path.isfile(mrpc_test_file), "Test data not found at %s" % mrpc_test_file

    with open(mrpc_test_file, encoding='utf-8') as data_fh, \
            open(os.path.join(mrpc_dir, "test.tsv"), 'w', encoding='utf-8') as test_fh:
        header = data_fh.readline()
        test_fh.write("index\t#1 ID\t#2 ID\t#1 String\t#2 String\n")
        for idx, row in enumerate(data_fh):
            label, id1, id2, s1, s2 = row.strip().split('\t')
            test_fh.write("%d\t%s\t%s\t%s\t%s\n" % (idx, id1, id2, s1, s2))

    dev_ids = []
    with open(path_to_dev_tsv, encoding='utf-8') as dev_fh:
        header = dev_fh.readline()
        for row in dev_fh:
            _, id1, id2, _, _ = row.strip().split('\t')
            dev_ids.append([id1, id2])

    with open(mrpc_train_file, encoding='utf-8') as data_fh, \
        open(os.path.join(mrpc_dir, "train.tsv"), 'w', encoding='utf-8') as train_fh, \
        open(os.path.join(mrpc_dir, "dev.tsv"), 'w', encoding='utf-8') as dev_fh:
        header = data_fh.readline()
        train_fh.write(header)
        dev_fh.write(header)
        for row in data_fh:
            label, id1, id2, s1, s2 = row.strip().split('\t')
            if [id1, id2] in dev_ids:
                dev_fh.write("%s\t%s\t%s\t%s\t%s\n" % (label, id1, id2, s1, s2))
            else:
                train_fh.write("%s\t%s\t%s\t%s\t%s\n" % (label, id1, id2, s1, s2))
                
    print("\tCompleted!")

def main(arguments):
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_dir', help='directory to save data to', type=str, default='glue_data')
    parser.add_argument('--path_to_mrpc', help='path to directory containing extracted MRPC data, msr_paraphrase_train.txt and msr_paraphrase_text.txt',
                        type=str, default='')
    parser.add_argument('--path_to_dev_tsv', help='path to directory containing the glue_mrpc_dev.tsv', type=str, default='glue_mrpc_dev.tsv')
    args = parser.parse_args(arguments)

    if not os.path.isdir(args.data_dir):
        os.mkdir(args.data_dir)
    format_mrpc(args.data_dir, args.path_to_mrpc, args.path_to_dev_tsv)

if __name__ == '__main__':
    sys.exit(main(sys.argv[1:]))

================================================
FILE: src/examples/tensorflow/bert_demo/glue_mrpc_dev.tsv
================================================
﻿Quality	#1 ID	#2 ID	#1 String	#2 String
1	1355540	1355592	He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .	" The foodservice pie business does not fit our long-term growth strategy .
0	2029631	2029565	Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .	His wife said he was " 100 percent behind George Bush " and looked forward to using his years of training in the war .
0	487993	487952	The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .	The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent .
1	1989515	1989458	The AFL-CIO is waiting until October to decide if it will endorse a candidate .	The AFL-CIO announced Wednesday that it will decide in October whether to endorse a candidate before the primaries .
0	1783137	1782659	No dates have been set for the civil or the criminal trial .	No dates have been set for the criminal or civil cases , but Shanley has pleaded not guilty .
1	3039165	3039036	Wal-Mart said it would check all of its million-plus domestic workers to ensure they were legally employed .	It has also said it would review all of its domestic employees more than 1 million to ensure they have legal status .
0	1490811	1490840	While dioxin levels in the environment were up last year , they have dropped by 75 percent since the 1970s , said Caswell .	The Institute said dioxin levels in the environment have fallen by as much as 76 percent since the 1970s .
1	426112	426210	This integrates with Rational PurifyPlus and allows developers to work in supported versions of Java , Visual C # and Visual Basic .NET.	IBM said the Rational products were also integrated with Rational PurifyPlus , which allows developers to work in Java , Visual C # and VisualBasic .Net.
1	1439663	1439808	The top rate will go to 4.45 percent for all residents with taxable incomes above $ 500,000 .	For residents with incomes above $ 500,000 , the income-tax rate will increase to 4.45 percent .
1	3147370	3147525	The results appear in the January issue of Cancer , an American Cancer Society journal , being published online today .	The results appear in the January issue of Cancer , an American Cancer Society ( news - web sites ) journal , being published online Monday .
1	3300040	3299992	The delegates said raising and distributing funds has been complicated by the U.S. crackdown on jihadi charitable foundations , bank accounts of terror-related organizations and money transfers .	Bin Laden ’ s men pointed out that raising and distributing funds has been complicated by the U.S. crackdown on jihadi charitable foundations , bank accounts of terror-related organizations and money transfers .
0	524136	524119	" Sanitation is poor ... there could be typhoid and cholera , " he said .	" Sanitation is poor , drinking water is generally left behind . . . there could be typhoid and cholera . "
0	969512	969295	The broader Standard & Poor 's 500 Index .SPX gave up 11.91 points , or 1.19 percent , at 986.60 .	The technology-laced Nasdaq Composite Index was down 25.36 points , or 1.53 percent , at 1,628.26 .
1	1685339	1685429	The only announced Republican to replace Davis is Rep. Darrell Issa of Vista , who has spent $ 1.71 million of his own money to force a recall .	So far the only declared major party candidate is Rep. Darrell Issa , a Republican who has spent $ 1.5 million of his own money to fund the recall .
1	1967578	1967664	The decision to issue new guidance has been prompted by intelligence passed to Britain by the FBI in a secret briefing in late July .	Scotland Yard 's decision to issue new guidance has been prompted by new intelligence passed to Britain by the FBI in late July .
1	2047034	2046820	Unable to find a home for him , a judge told mental health authorities they needed to find supervised housing and treatment for DeVries somewhere in California .	The judge had told the state Department of Mental Health to find supervised housing and treatment for DeVries somewhere in California .
1	2046630	2046644	The decision came a year after Whipple ended federal oversight of the district 's racial balance , facilities , budget , and busing .	The decision came a year after Whipple ended federal oversight of school busing as well as the district 's racial balance , facilities and budget .
0	2221603	2221633	In midafternoon trading , the Nasdaq composite index was up 8.34 , or 0.5 percent , to 1,790.47 .	The Nasdaq Composite Index .IXIC dipped 8.59 points , or 0.48 percent , to 1,773.54 .
1	129995	129864	Morgan Stanley raised its rating on the beverage maker to " overweight " from " equal-weight " saying in part that pricing power with its bottlers should improve in 2004 .	Morgan Stanley raised its rating on the company to " overweight " from " equal-weight , " saying the beverage maker 's pricing power with bottlers should improve in 2004 .
0	919683	919782	The pound also made progress against the dollar , reached fresh three-year highs at $ 1.6789 .	The British pound flexed its muscle against the dollar , last up 1 percent at $ 1.6672 .
0	970740	971209	Friday , Stanford ( 47-15 ) blanked the Gamecocks 8-0 .	Stanford ( 46-15 ) has a team full of such players this season .
1	2745055	2745022	Last month Intel raised its revenue guidance for the quarter to between $ 7.6 billion and $ 7.8 billion .	At the end of the second quarter , Intel initially predicted sales of between $ 6.9 billion and $ 7.5 billion .
0	2199097	2199072	The driver , Eugene Rogers , helped to remove children from the bus , Wood said .	At the accident scene , the driver was " covered in blood " but helped to remove children , Wood said .
1	1609290	1609098	ONG KONG , July 9 Tens of thousands of demonstrators gathered tonight before the legislature building here to call for free elections and the resignation of Hong Kong 's leader .	Tens of thousands of demonstrators gathered yesterday evening to stand before this city 's legislature building and call for free elections and the resignation of Hong Kong 's leader .
1	1597193	1597119	Saddam loyalists have been blamed for sabotaging the nation 's infrastructure , as well as frequent attacks on U.S. soldiers .	Hussein loyalists have been blamed for sabotaging the nation 's infrastructure and attacking US soldiers .
1	2758944	2758975	Its closest living relatives are a family frogs called sooglossidae that are found only in the Seychelles in the Indian Ocean .	Its closest relative is found in the Seychelles Archipelago , near Madagascar in the Indian Ocean .
0	2584416	2584653	Cooley said he expects Muhammad will similarly be called as a witness at a pretrial hearing for Malvo .	Lee Boyd Malvo will be called as a witness Wednesday in a pretrial hearing for fellow sniper suspect John Allen Muhammad .
1	86007	86373	" Instead of pursuing the most imminent and real threats - international terrorists , " Graham said , " this Bush administration chose to settle old scores . "	" Instead of pursuing the most imminent and real threats - international terrorists - this Bush administration has chosen to settle old scores , " Graham said .
1	1602860	1602844	He said they lied on a sworn affidavit that requires them to list prior marriages .	Morgenthau said the women , all U.S. citizens , lied on a sworn affidavit that requires them to list prior marriages .
1	1201306	1201329	The association said 28.2 million DVDs were rented in the week that ended June 15 , compared with 27.3 million VHS cassettes .	The Video Software Dealers Association said 28.2 million DVDs were rented out last week , compared to 27.3 million VHS cassettes .
0	461779	461815	With these assets , Funny Cide has a solid chance to become the first Triple Crown winner since Affirmed in 1978 .	Funny Cide is looking to become horse racing 's first Triple Crown winner in a generation .
1	1438666	1438643	Intel was disappointed and assessing its " options in the event Mr. Hamidi resumes his spamming activity against Intel , " spokesman Chuck Mulloy said .	Intel spokesman Chuck Mulloy said the company was disappointed and assessing its " options in the event Mr. Hamidi resumes his spamming activity against Intel . "
1	3261484	3261306	Mr Annan also warned the US should not use the war on terror as an excuse to suppress " long-cherished freedoms " .	Annan warned that the dangers of extremism after September 11 should not be used as an excuse to suppress " long-cherished " freedoms .
1	1277539	1277527	At community colleges , tuition will jump to $ 2,800 from $ 2,500 .	Community college students will see their tuition rise by $ 300 to $ 2,800 or 12 percent .
1	3035788	3035918	He made a point of saying during Tuesdays debate that the Confederate flag was a racist symbol .	Though Dean made a point of saying during the debate that the Confederate flag is a racist symbol .
0	132553	132725	Bush wanted " to see an aircraft landing the same way that the pilots saw an aircraft landing , " White House press secretary Ari Fleischer said yesterday .	On Tuesday , before Byrd 's speech , Fleischer said Bush wanted ' ' to see an aircraft landing the same way that the pilots saw an aircraft landing .
0	2259788	2259747	On Monday the Palestinian Prime Minister , Mahmoud Abbas , will report to the Palestinian parliament on his Government 's achievements in its first 100 days in office .	Palestinian Prime Minister Mahmoud Abbas must defend the record of his first 100 days in office before Parliament today as the death toll in the occupied territories continues to rise .
0	2307064	2307235	The civilian unemployment rate improved marginally last month -- slipping to 6.1 percent -- even as companies slashed payrolls by 93,000 .	The civilian unemployment rate improved marginally last month _ sliding down to 6.1 percent _ as companies slashed payrolls by 93,000 amid continuing mixed signals about the nation 's economic health .
1	3046488	3046824	Per-user pricing is $ 29 for Workplace Messaging , $ 89 for Team Collaboration and $ 35 for Collaborative Learning .	Workplace Messaging is $ 29 , Workplace Team Collaboration is $ 89 , and Collaborative Learning is $ 35 .
1	86020	86007	" Instead of pursuing the most imminent and real threats – international terrorism – this Bush administration chose to settle old scores , " Mr. Graham said .	" Instead of pursuing the most imminent and real threats - international terrorists , " Graham said , " this Bush administration chose to settle old scores . "
0	1100998	1100441	SARS has killed about 800 people and affected more than 8400 since being detected in China in November .	SARS has killed about 800 people and sickened more than 8,400 worldwide , mostly in Asia .
1	2268396	2268480	Authorities had no evidence to suggest the two incidents were connected .	There was no immediate evidence that the two incidents were connected , police said .
0	1984039	1983986	" Jeremy 's a good guy , " Barber said , adding : " Jeremy is living the dream life of the New York athlete .	He also said Shockey is " living the dream life of a New York athlete .
0	2697659	2697747	Ratliff 's daughters , Margaret and Martha Ratliff , were adopted by Peterson after their mother 's death .	Peterson helped raise Ratliff 's two daughters , Margaret and Martha Ratliff , who supported him throughout the trial .
0	2175939	2176090	After losing as much as 84.56 earlier , the Dow Jones industrial average closed up 22.81 , or 0.2 percent , at 9,340.45 .	In midday trading , the Dow Jones industrial average lost 68.84 , or 0.7 percent , to 9,248.80 .
1	886618	886456	Rumsfeld , who has been feuding for two years with Army leadership , passed over nine active-duty four-star generals .	Rumsfeld has been feuding for a long time with Army leadership , and he passed over nine active-duty four-star generals .
1	588637	588864	Consumers who said jobs are difficult to find jumped from 29.4 to 32.6 , while those claiming work was plentiful slipped from 13 to 12.6 .	Consumers who said jobs are difficult to find jumped to 32.6 from 29.4 , while those saying work was plentiful slipped to 12.6 from 13 in April .
0	2252795	2252970	He has no immediate plans for television advertising , believing it is unnecessary this early .	A Lieberman aide said there were no immediate plans for television advertising .
1	1756329	1756394	" I think it happened very quickly , " Houston Police Department homicide investigator Phil Yochum said of the crime .	" I think it happened very quickly , " said Investigator Phil Yochum of the Houston Police Department 's homicide division .
1	1673112	1673068	United issued a statement saying it will " work professionally and cooperatively with all its unions . "	Senior vice president Sara Fields said the airline " will work professionally and cooperatively with all our unions . "
1	2357324	2357271	" But they never climb out of the pot of beer again . "	It 's just that they never climb out of the beer again . "
1	780408	780363	Chief financial officer Andy Bryant has said that hike had a greater affect volume than officials expected .	Bryant has said that hike had a greater effect on demand than officials expected .
1	821523	821385	Robert Liscouski , the Assistant Secretary of Homeland Security for Infrastructure Protection , will oversee NCSD .	NCSD 's chief will be Robert Liscouski , the assistant secretary of Homeland Security for Infrastructure Protection .
1	2304696	2304863	HP 's shipments increased 48 percent year-over-year , compared to an increase of 31 percent for Dell .	HPs shipments increased 48 per cent year-on-year , compared to an increase of 31 per cent for Dell .
1	2531749	2531607	Chirac , who can pardon a law-breaker , refused Humbert 's request last year but kept in close touch with the family .	Chirac , who has the authority to pardon law-breakers , refused Humbert 's request to be allowed to die last year but kept in close touch with the family .
1	3180014	3179967	The charges allege that he was part of the conspiracy to kill and kidnap persons in a foreign country .	The government now charges that Sattar conspired with Rahman to kill and kidnap individuals in foreign countries .
1	726966	726945	In the 2002 study , the margin of error ranged from 1.8 to 4.4 percentage points .	It has a margin of error of plus or minus three to four percentage points .
1	2638861	2638982	Mr. Clinton 's national security adviser , Sandy Berger , said that the White House wasn 't informed of the FBI activities .	Clinton ’ s national security adviser , Sandy Berger , said in an interview that the White House was not informed of the FBI activities .
1	2495223	2495307	" This decision is clearly incorrect , " FTC Chairman Timothy Muris said in a written statement .	The decision is " clearly incorrect , " FTC Chairman Tim Muris said .
1	55187	54831	Prosecutors allege that Nichols and co-conspirator Timothy McVeigh worked together to prepare a bomb that destroyed the Alfred P. Murrah Federal Building .	Prosecutors allege that Nichols and coconspirator Timothy McVeigh worked together to prepare a 4,000-pound fuel-and-fertilizer bomb that destroyed the Murrah building .
0	2763381	2763517	Terri Schiavo , 39 , is expected to die sometime in the next two weeks in the Tampa-area hospice where she has spent the past several years .	Terri Schiavo , 39 , underwent the procedure at the Tampa Bay area hospice where she has been living for several years , said her father , Bob Schindler .
1	1990975	1991132	Secretary of State Colin Powell designated the Chechen leader believed responsible for last year 's hostage standoff in a Moscow theater as a threat to U.S. security Friday .	U.S. Secretary of State Colin Powell on Friday designated Chechen rebel leader Shamil Basayev a threat to the security of the United States and to U.S. citizens .
1	2204353	2204418	" Today , we are trying to convey this problem to Russian President Vladimir Putin and US President George W Bush . "	" Today , we are trying to convey this problem to Russian President Vladimir Putin ( news - web sites ) and President Bush ( news - web sites ) . "
1	60122	60445	That would be a potential setback to Chief Executive Phil Condit 's strategy of bolstering defense-related sales during a slump in jetliner deliveries .	The inquiry may hinder Chief Executive Phil Condit 's strategy of bolstering defense-related sales during a slump in jetliner deliveries .
1	961836	962243	PeopleSoft also said its board had officially rejected Oracle 's offer .	Thursday morning , PeopleSoft 's board rejected the Oracle takeover offer .
0	3140260	3140288	The Dow Jones industrial average ended the day down 10.89 at 9,837.94 , after advancing 111.04 Wednesday .	The Dow Jones industrial average fell 10.89 points , or 0.11 percent , to 9,837.94 .
1	1720166	1720115	Cortisol levels in the saliva of day care children were highest and rose most steeply in those judged by day care center personnel to be the shyest .	Cortisol levels in the saliva of day-care children were highest and rose most steeply in those whom day-care centre staffed judged to be the shyest .
1	2573262	2573319	" The idea that Tony Abbott is in some way a one-dimensional political head-kicker couldn 't be more wrong , " Mr Howard said .	" The idea that Tony Abbott is in some way a one-dimensional political head kicker couldn 't be more wrong . "
0	1353356	1353174	" Biotech products , if anything , may be safer than conventional products because of all the testing , " Fraley said , adding that 18 countries have adopted biotechnology .	" Biotech products , if anything , may be safer than conventional products because of all the testing , " said Robert Fraley , Monsanto 's executive vice president .
1	2738677	2738741	The rate of skin cancer has tripled since the 1950s in Norway and Sweden , according to the study .	The study also found that skin cancer nearly tripled in Norway and Sweden since the 1950s .
1	1638813	1639087	We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said .	Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11 " .
1	1605350	1605425	Trans fat makes up only 1 percent to 3 percent of the total fat Americans consume , compared with 14 percent for saturated fat .	Trans fat accounts for 2.5 percent of Americans ' daily calories , compared to 11 percent to 12 percent for saturated fat .
1	2494149	2494073	However , a recent slide in prices and OPEC 's expectations of a surge in oil inventories have compounded its fears about a further softening of the market .	A 14 percent slide in crude prices this month and expectations of a build up in oil inventories compounded OPEC 's fears of a further softening of the market .
1	3023029	3023229	Peterson , 31 , is now charged with murder in the deaths of his 27-year-old wife and their unborn son .	Peterson , 31 , is charged with two counts of first-degree murder in the slayings of his wife , Laci , and their unborn son , Conner .
1	1351550	1351155	Carlson on Tuesday said he would not recuse himself from the case .	Service officials said Carlson refused to recuse himself from the case .
1	981185	981234	The program will grow to include ports in Dubai , Turkey and Malaysia , among others .	The program will be expanded to include areas of the Middle East such as Dubai , Turkey and Malaysia , Mr. Ridge said .
0	2111629	2111786	McCabe said he was considered a witness , not a suspect .	" He is not considered a suspect , " McCabe said .
1	655498	655391	The woman was exposed to the SARS virus while in the hospital but was not a health care worker , said Dr. Colin D ’ Cunha , Ontario ’ s commissioner of public health .	The woman was exposed to the SARS virus while in the hospital but was not a health-care worker , said Dr Colin D 'Cunha , Ontario 's commissioner of public health .
1	533823	533909	He added that those " are not solely American principles , nor are they exclusively Western . "	" These are not solely American principles nor are they exclusively Western , " Rumsfeld said .
1	581592	581570	" If we don 't march into Tehran , I think we will be in pretty good shape , " he said .	" As long as we don 't march on Tehran , I think we are going to be in pretty good shape , " he said .
0	1010655	1010430	On Saturday , a 149mph serve against Agassi equalled Rusedski 's world record .	On Saturday , Roddick equalled the world record with a 149 m.p.h. serve in beating Andre Agassi .
1	2241925	2242066	Chad Kolton , emergency management spokesman with the Department of Homeland Security , said the government is open to new technologies and methods to communicate more quickly and efficiently .	Chad Kolton , emergency management spokesman with the Department of Homeland Security , said the government is open to new ways to communicate .
1	2796978	2797024	" APEC leaders are painfully aware that security and prosperity are inseparable , " Thai Prime Minister Thaksin Shinawatra told business leaders .	" APEC leaders are painfully aware that security and prosperity are inseparable , " Thaksin said .
0	101746	101775	Danbury prosecutor Warren Murray could not be reached for comment Monday .	Prosecutors could not be reached for comment after the legal papers were obtained late Monday afternoon .
1	327839	327748	Wittig resigned last year after being indicted on federal bank fraud charges involving a real estate loan unrelated to Westar business .	Wittig resigned in late November about two weeks after being indicted on bank fraud charges in a real estate case unrelated to the company .
0	2988297	2988555	Shattered Glass , " starring Hayden Christensen as Stephen Glass , debuted well with $ 80,000 in eight theaters .	" Shattered Glass " _ starring Hayden Christensen as Stephen Glass , The New Republic journalist fired for fabricating stories _ debuted well with $ 80,000 in eight theaters .
1	2217613	2217659	He was arrested Friday night at an Alpharetta seafood restaurant while dining with his wife , singer Whitney Houston .	He was arrested again Friday night at an Alpharetta restaurant where he was having dinner with his wife .
0	2128530	2128455	However , EPA officials would not confirm the 20 percent figure .	Only in the past few weeks have officials settled on the 20 percent figure .
1	2208376	2208198	University of Michigan President Mary Sue Coleman said in a statement on the university 's Web site , " Our fundamental values haven 't changed .	" Our fundamental values haven 't changed , " Mary Sue Coleman , president of the university , said in a statement in Ann Arbor .
1	1980654	1980641	The first products are likely to be dongles costing between US $ 100 and US $ 150 that will establish connections between consumer electronics devices and PCs .	The first products will likely be dongles costing $ 100 to $ 150 that will establish connections between consumer electronics devices and PCs .
0	589579	589557	However , Lapidus expects foreign brands ' sales to be up 4 percent , driven by strong truck sales at Honda Motor Co .	Lapidus expects Ford to be down 5 percent , Chrysler down 10 percent and foreign brands up 4 percent driven by strong truck sales at Honda .
1	1636060	1635946	Michel , who remains in the government , denied that US pressure had provoked the government 's move .	Michel , who has stayed in the new government , denied that it was U.S. pressure which had provoked the government 's move .
1	1630585	1630657	Some of the computers also are used to send spam e-mail messages to drum up traffic to the sites .	Some are also used to send spam e-mail messages to boost traffic to the sites .
0	447728	447699	Indonesia 's army has often been accused of human rights abuses during GAM 's battle for independence , charges it has generally denied while accusing the separatists of committing rights violations .	Indonesia 's army has been accused of human rights abuses during its earlier battles with GAM , charges it has generally denied .
1	1606495	1606619	Bush also hoped to polish his anti-AIDS credentials in Uganda , which has been hailed as an African pioneer in fighting the killer disease .	President Bush flies to Uganda Friday hoping to polish his anti- AIDS credentials in a country hailed as an African pioneer in fighting the epidemic .
1	1550897	1550977	Later this year , the command will send trainers with soldiers from four North African nations on patrolling and intelligence gathering missions .	This fall the command will send trainers to work with soldiers from four North African nations on patrolling and gathering intelligence .
0	490376	490490	The reports helped overcome investor jitters after the euro briefly hit an all-time high against the dollar Tuesday .	Stocks slipped at the open after the euro hit record highs against the dollar .
1	3084554	3084612	Sales for the quarter beat expectations , rising 37 percent year-on-year to 1.76 billion euros .	Sales rose 37 per cent year-on-year to 1.76bn , beating expectations .
1	315647	315778	If the MTA 's appeal to a higher court is successful , the $ 2 bus and subway base fare won 't be rolled back .	If the MTA 's appeal is successful , the $ 2 bus and subway base fare won 't change .
1	3428298	3428362	Robert Walsh , 40 , remained in critical but stable condition Friday at Staten Island University Hospital 's north campus .	Walsh , also 40 , was in critical but stable condition at Staten Island University Hospital last night .
1	2523564	2523358	The Guru microcontroller serves four functions : hardware monitoring , overclocking management , BIOS ( Basic Input Output System ) update and a troubleshooting-assistance feature called Black Box .	The µGuru microcontroller serves four functions : hardware monitoring , overclocking management , BIOS update and a troubleshooting-assistance feature called Black Box .
1	2079200	2079131	U.S. corporate bond yield spreads tightened in spotty trading on Friday as Wall Street labored to get back on its feet after the largest power outage ever in North America .	U.S. stocks rose slightly on feather-light volume on Friday , as Wall Street regrouped after the biggest-ever power outage in North America .
1	818091	817811	The company said it would issue revised guidance for the full fiscal year next month when it releases its Q2 results .	The company said it would renew its guidance for 2003 when it announces its second quarter results in mid-July .
1	1580638	1580663	" I stand 100 percent by it , and I think our intelligence services gave us the correct information at the time . "	I stand 100 percent by it , and I think that our intelligence services gave us the correct intelligence and information at the time , " Blair said .
0	1919740	1919926	" I don 't know if the person I 'm talking to now may end up being someone else at another time that may not follow the rules , " Parrish said .	" I don 't know whether the person I 'm talking to now may end up being someone else , " Parrish said .
1	2748287	2748550	" I think it 's going to be a close vote , but I think the grant proposal is going to win , " McConnell said .	" I think it 's going to be a close vote , but I think the grant proposal 's going to win , " said Sen. Mitch McConnell , assistant majority leader .
1	3394891	3394775	Twenty-eight people were believed to have been spending Christmas Day with the caretaker of the St Sophia 's camp , when the mudslide smashed into two cabins .	Twenty-seven people were believed to have been spending Christmas Day with the caretaker of Saint Sophia Camp , a Greek Orthodox facility , when the mudslide roared through .
0	2963943	2963880	One , Capt. Doug McDonald , remained hospitalized in critical condition on Thursday .	Her 20-year-old sister , Allyson , was severely burned and remained hospitalized in critical condition .
0	1865364	1865251	The United States finally relented during President Bush 's visit to Africa earlier this month .	During President Bush 's trip to Africa earlier this month , however , Washington said it would support the increase .
1	263690	263819	" There is no conscious policy of the United States , I can assure you of this , to move the dollar at all , " he said .	He also said there is no conscious policy by the United States to move the value of the dollar .
1	283751	283290	It 's the first such drill since the September 11 terrorist attacks on New York and Washington .	It is the nation 's first large-scale counterterrorism exercise since the Sept . 11 terrorist attacks .
1	2517014	2516995	Myanmar 's pro-democracy leader Aung San Suu Kyi will return home late Friday but will remain in detention after recovering from surgery at a Yangon hospital , her personal physician said .	Myanmar 's pro-democracy leader Aung San Suu Kyi will be kept under house arrest following her release from a hospital where she underwent surgery , her personal physician said Friday .
1	1330643	1330622	According to the Merchant Marine Ministry , the 37-year-old ship is registered to Alpha Shipping Inc. based in the Pacific Ocean nation of Marshall Islands .	The Baltic Sky is a 37-year-old ship registered to Alpha Shipping Inc. based in the Pacific Ocean nation of Marshall Islands .
1	3111452	3111428	In an unusual move , the U.S. Patent and Trademark Office is reconsidering a patent affecting Internet pages that critics contend could disrupt millions of Web sites .	In an unusual move that critics contend could disrupt millions of Web sites , the U.S. Patent and Trademark Office is reconsidering a patent affecting Internet pages .
0	1167835	1167651	Kansas Department of Health and Environment records show there were 88 abortions performed on girls age 14 and younger last year .	Statistics from the Kansas Department of Health and Environment show that 11,844 abortions were performed in the state last year .
0	1423836	1423708	A European Union spokesman said the Commission was consulting EU member states " with a view to taking appropriate action if necessary " on the matter .	Laos 's second most important export destination - said it was consulting EU member states ' ' with a view to taking appropriate action if necessary ' ' on the matter .
1	2090911	2091154	Waiting crowds filling the streets on both sides overwhelmed the peacekeepers soon after daylight , sweeping past the barbed wire barricades .	But waiting crowds filling the streets rushed the bridges soon after daylight , overrunning razor-wire barricades .
1	2265271	2265152	Barry Callebaut will be able to use Brach 's retail network to sell products made from its German subsidiary Stollwerck , which makes chocolate products not sold in the United States .	Barry Callebaut will be able to use Brach 's retail network to sell products made from its German subsidiary Stollwerck , which makes chocolate products unknown to the American market .
1	3062202	3062308	By skirting the FDA 's oversight , Eagan said , the quality of the imported drugs is " less predictable " than for those obtained in the United States .	By skirting the FDA 's oversight , Eagan said the quality of the imported drugs is " less predictable " than U.S. drugs .
1	2155514	2155377	He said : " For the first time there is an easy and affordable way of making this treasure trove of BBC content available to all . "	" For the first time , there is an easy and affordable way of making this treasure trove of BBC content available to all , " Dyke said .
1	1552068	1551928	Three such vigilante-style attacks forced the hacker organizer , who identified himself only as " Eleonora [ 67 ] , " to extend the contest until 7 p.m. EST Sunday .	Three such vigilante-style attacks forced the hacker organiser , who identified himself only as " Eleonora67 ] , " to extend the contest until 8am ( AEST ) today .
1	936978	937500	Eric Gagne pitched a perfect ninth for his 23rd save in as many opportunities .	Gagne struck out two in a perfect ninth inning for his 23rd save .
0	985015	984975	One way or another , Harry Potter And The Order Of The Phoenix will be in your hands by Saturday .	Just about everything about " Harry Potter and the Order of the Phoenix " will set records .
1	1430357	1430425	" Allison just proves you don 't need to wait until August or September to have a disaster , " said Josh Lichter , a meteorologist with the Houston-Galveston weather office .	" Allison just proves you don 't need to wait until August or September to have a disaster , " Lichter said .
1	3039310	3039413	Today , analysts say , UN members can no longer ignore the shifts since the September 11 2001 attacks .	On Wednesday , analysts say , UN members can no longer ignore the shifts since the attacks in the US of September 11 2001 .
1	34513	34742	Police say CIBA was involved in the importation of qat , a narcotic substance legal in Britain but banned in the United States .	Mr McKinlay said that CIBA was involved in the importation of qat , a narcotic substance legal in Britain but banned in the US .
1	368067	368018	Chiron already has nearly 20 percent acceptances from PowderJect 's shareholders .	Chiron has acceptances from holders of nearly 20 percent of PowderJect shares .
0	611663	611716	Ernst & Young has denied any wrongdoing and plans to fight the allegations .	Ernst & Young has denied the SEC 's claims , and called its recommendations " irresponsible " .
1	98432	98657	The attack followed several days of disturbances in the city where American soldiers exchanged fire with an unknown number of attackers as civilians carried out demonstrations against the American presence .	The attack came after several days of disturbance in the city in which U.S. soldiers exchanged fire with an unknown number of attackers as civilians protested the American presence .
1	3039007	3038845	No company employee has received an individual target letter at this time .	She said no company official had received " an individual target letter at this time . "
1	1708040	1708062	Second-quarter results reflected a gain of 10 cents per diluted share , while the 2002 results included a loss of 19 cents per diluted share .	The second-quarter results had a non-operating gain of 10 cents a share while the 2002 second-quarter performance had a net non-operating loss of 19 cents a share .
0	1757264	1757375	He allegedly told his ex-wife in an angry phone call that he had no intention of following their new custody agreement .	The two had battled over custody and he allegedly told her in an angry phone call that he had no intention of following their new custody agreement .
1	383417	383558	Worldwide , more than 50 million people have seen " Les Miz , " with gross receipts of $ 1.8 billion .	Worldwide , Les Misérables has been seen by over 50 million people , with a total gross of over $ 2 billion .
0	2766112	2766084	In fiction : Edward P. Jones ( " The Known World " ) and Scott Spencer ( " A Ship Made of Paper " ) .	The fifth nominee for fiction is Scott Spencer , for A Ship Made of Paper .
1	1261116	1261234	" Overwhelmingly the Windows brand really resonated with them . "	" Windows was the part of the experience that really resonated with people . "
1	3028143	3028234	The Centers for Medicare and Medicaid Services , the federal agency that runs Medicare , last year began a similar effort for nursing homes .	The Centers for Medicare and Medicaid launched a similar consumer tool for nursing homes last year .
0	249699	249623	Vivace was founded in 1999 and has raised over $ 118 million in three rounds of venture financing .	During difficult times for technology venture capital , Vivace raised over $ 118 million in three rounds of venture financing .
0	3448488	3448449	The Dow Jones industrial average < .DJI > added 28 points , or 0.27 percent , at 10,557 , hitting its highest level in 21 months .	The Dow Jones industrial average < .DJI > rose 49 points , or 0.47 percent , to 10,578 .
1	2749322	2749663	The Democratic candidates also began announcing their fund-raising totals before Wednesday 's deadline to file quarterly reports with the Federal Election Commission .	The Democratic candidates also began announcing their fund-raising totals in advance of the deadline today to file quarterly reports with the Federal Election Commission .
0	2204592	2204588	Sun Microsystems Inc. on Thursday said it had added 100 new third-party systems and 100 new components to its Hardware Compatibility List for the Solaris x86 operating system Platform Edition .	The vendor has added 100 new third-party systems and 100 new components to the operating system 's Hardware Compatibility List ( HCL ) .
1	2889005	2888954	Prosecutors said PW Marketing violated the state 's 1998 anti-spam law by sending unsolicited e-mail without a toll-free number for recipients to call to stop additional mailings .	Prosecutors said PW Marketing violated the 1998 anti-spam law because these unsolicited e-mails were sent without a free call number for recipients to phone to stop additional mailings .
0	1657632	1657619	The Neighbours star and singer spent yesterday resting at her family home in Sydney and will have more tests today .	Goodrem spent yesterday resting in her family home in Sydney and will have more tests today to determine her exact treatment .
0	555617	555528	The 3 rd Armored Cavalry Regiment is 5,200 strong and the largest combat unit at Fort Carson .	Broomhead , 34 , was assigned to the 2nd Squadron , 3rd Armored Cavalry Regiment .
1	2396937	2396818	" The risk of inflation becoming undesirably low remains the predominant concern for the foreseeable future , " the Fed said in a statement accompanying the unanimous decision .	" The risk of inflation becoming undesirably low remains the predominant concern for the foreseeable future , " the policy-setting Federal Open Market Committee said .
0	2339738	2339771	" It is bad for Symbian , " said Per Lindberg , analyst at Dresdner Kleinwort Wasserstein .	" Motorola has displayed clear disloyalty " to Symbian , said Per Lindberg , an analyst at Dresdner Kleinwort Wasserstein in London .
0	1616174	1616206	Bob Richter , a spokesman for House Speaker Tom Craddick , had no comment about the ruling .	Bob Richter , spokesman for Craddick , R-Midland , said the speaker had not seen the ruling and could not comment .
1	635783	635802	But Ms Ward said the headroom under its financial covenants was " tight " and that there could be another downgrade if Southcorp breached any of its banking covenants .	But Ms Ward said the headroom under its financial covenants was " tight " and that there could be a rating downgrade if Southcorp did breach any banking covenants .
1	3444633	3444733	He added : ``I 've never heard of more reprehensiblebehaviour by a doctor .	The Harrisons ’ lawyer Paul LiCalsi said : “ I ’ ve never heard of more reprehensible behaviour by a doctor .
1	555553	555528	Broomhead was assigned to 2nd Squadron , 3rd Armor Cavalry Regiment , based at Fort Carson .	Broomhead , 34 , was assigned to the 2nd Squadron , 3rd Armored Cavalry Regiment .
1	1112021	1111925	Other staff members , however , defended the document , saying it would still help policy-makers and the agency improve efforts to address the climate issue .	Some E.P.A. staff members defended the document , saying that although pared down it would still help policy makers and the agency address the climate issue .
0	2749410	2749625	President Bush raised a record-breaking $ 49.5 million for his re-election campaign over the last three months , with contributions from 262,000 Americans , the president 's campaign chairman said Tuesday .	President Bush has raised $ 83.9 million since beginning his re-election campaign in May , and has $ 70 million of that left to spend , his campaign said Tuesday .
1	1629064	1629043	An episode is declared when the ozone reaches .20 parts per million parts of air for one hour .	A Stage 1 episode is declared when ozone levels reach 0.20 parts per million .
1	789691	789665	" He may not have been there , " the defence official said on Thursday .	" He may not have been there , " said a defence official speaking on condition of anonymity .
1	844421	844679	The U.N. troops are in Congo to protect U.N. installations and personnel , and they can only fire in self defense and have been unable to stem the violence .	The troops - whose mandate is to protect U.N. installations and personnel - can only fire in self-defense and have been unable to stem the violence .
1	58540	58567	North American markets grabbed early gains Monday morning , as earnings season begins to slow and economic indicators take the spotlight .	North American futures pointed to a strong start to the first trading session of the week Monday , as earnings season slows and economic indicators take the spotlight .
1	781439	781461	Xerox itself paid a $ 10 million fine last year to settle similar SEC charges .	Xerox itself previously paid a $ 10-million penalty to settle the SEC accusations .
1	1909579	1909408	" This deal makes sense for both companies , " said National Chief Executive Brian Halla .	" This deal makes sense for both companies , " Halla said in a prepared statement .
0	787432	787464	The blasts killed two people and injured more than 150 others .	The Atlanta Olympic Games attack killed one woman and injured more than 100 other people .
0	52758	52343	Morrill 's wife , Ellie , sobbed and hugged Bondeson 's sister-in-law during the service .	At the service Morrill 's widow , Ellie , sobbed and hugged Bondeson 's sister-in-law as people consoled her .
1	1675025	1675047	Spansion products are to be available from both AMD and Fujitsu , AMD said .	Spansion Flash memory solutions are available worldwide from AMD and Fujitsu .
1	2131318	2131372	About 1,500 police will be deployed for the visit .	Around 1,500 police are to be deployed at Niigata for the ferry 's visit .
1	325763	325928	Gamarekian told The News she remembers only the woman 's first name - and refused to reveal it .	She told the New York Daily News she remembers only the intern 's first name , which she refused to reveal .
1	2638975	2638855	One of the FBI ’ s key operatives , who had a falling out with the bureau , provided an account of the operation at a friend ’ s closed immigration court proceeding .	One of the FBI 's key operatives , who has had a falling-out with the bureau , provided an account of the operation at a friend 's closed immigration court proceeding .
1	2198694	2198937	A nationally board certified teacher with a master 's degree , Kelley makes a salary of $ 65,000 in his 30th year .	A nationally board certified teacher with a master 's degree , Kelley , in his 30th year teaching , makes $ 65,000 .
1	1825432	1825301	A man arrested for allegedly threatening to shoot and kill a city councilman from Queens was ordered held on $ 100,000 bail during an early morning court appearance Saturday .	The Queens man arrested for allegedly threatening to shoot City Councilman Hiram Monserrate was held on $ 100,000 bail Saturday , a spokesman for the Queens district attorney said .
1	2906104	2906322	They were being held Sunday in the Camden County Jail on $ 100,000 bail .	They remained in Camden County Jail on Sunday on $ 100,000 bail .
1	722278	722383	Ms Stewart , the chief executive , was not expected to attend .	Ms Stewart , 61 , its chief executive officer and chairwoman , did not attend .
0	101747	101777	Christina 's aunt , Shelley Riling , said the defense 's claims were preposterous .	Christina 's aunt , Shelley Riling , said she will address the court .
1	2224884	2224819	The Justice Department Aug. 19 gave pre-clearance for the Oct. 7 date for the election to recall Gov. Gray Davis , saying it would not affect minority voting rights .	The Justice Department on Aug. 19 sanctioned the Oct. 7 date for recall election , saying it would not affect voting rights .
0	977938	978162	Lord Falconer hailed the changes as " a new beginning as far as the courts , Crown Prosecution Service and police are concerned " .	" It 's a new beginning as far as the courts , Crown Prosecution Service and police are concerned , making the criminal justice system work better . "
0	1015010	1014963	GE stock closed at $ 30.65 a share , down about 42 cents , on the New York Stock Exchange .	GE 's shares closed at $ 30.65 on Friday on the New York Stock Exchange .
1	1513190	1513246	At least 27 US troops have been killed in hostile fire since Bush 's statement .	At least 26 American troops have been killed in hostile fire since major combat was officially declared over on May 1 .
1	2385348	2385394	A recent poll showed Edwards with a narrow lead in South Carolina , and he plans a rally there later on Tuesday .	A recent poll showed Edwards in a virtual four-way tie at the top in South Carolina , and he plans a rally there later on Tuesday .
1	2317018	2317252	November 17 's last victim was British defence attache Stephen Saunders , who was shot on an Athens road in June 2000 .	November 17 's last victim was British defense attache Stephen Saunders , who was shot and killed at point-blank range on a busy Athens road in June 2000 .
0	1831696	1831660	The agency charged that one WD Energy worker discussed false reporting with traders at two other energy companies .	The agency found further that a WD Energy employee discussed false reporting with traders at two other energy companies , which the CFTC didn 't identify .
1	1528383	1528083	Zulifquar Ali , a worshipper slightly wounded by shrapnel , said the assailants first targeted the mosque 's security guards .	Witness Zulfiqar Ali , who was slightly wounded by shrapnel , said the attackers had focused on the mosque 's guards .
1	917965	918315	For the second year in a row , rises in hospital costs accounted for much of the inflation , accounting for 51 percent of the overall cost increase .	For the second year in a row , rises in hospital costs dominated the increase , accounting for 51 percent of the overall cost spiral .
0	3218713	3218830	Q : Can I buy coverage for prescription drugs right away ?	Congress has added a new benefit - an option to buy insurance coverage for prescription drugs .
1	221079	221003	The airline also said it has the option to buy 380 more airplanes , orders that would be split evenly between the two manufacturers .	The airline has the option to buy 380 more , split evenly between the two manufacturers .
1	2546175	2546198	Dr Mark McClean , Jonathan 's family doctor , said if the drug had been administered earlier Jonathan would have retained more of his brain functions .	Dr Mark McClean , the family 's GP , said had the drug been administered to Jonathan earlier , he would have retained more of his brain function .
0	799346	799268	The chain operates more than 3,400 stores , and has annual revenue of about $ 15.8 billion .	The chain , which has been under new management since late 1999 , has more than 3,400 stores and $ 15.8 billion in annual revenue .
0	2673104	2673130	All patients developed some or all of the symptoms of E. coli food poisoning : bloody diarrhea , vomiting , abdominal cramping and nausea .	Symptoms of the E. coli infection include bloody diarrhea , nausea , vomiting and abdominal cramping .
1	1354501	1354476	Federal regulators have turned from sour to sweet on a proposed $ 2.8 billion merger of ice cream giants Nestle Holdings Inc. and Dreyer 's Grand Ice Cream Inc .	Federal regulators have changed their minds on a proposed $ 2.8 billion merger of ice cream giants Nestle Holdings and Dreyer 's Grand Ice Cream .
1	3070979	3070949	Environmental campaigners are using this weekend ’ s lunar eclipse to highlight the huge increase in light pollution across the UK .	Environmental campaigners used the eclipse to highlight the surge in light pollution across Britain .
0	1264509	1264471	Available July 7 , the software supports the Solaris , IBM AIX , Red Hat Linux and Windows operating systems .	The OpForce product currently works with Solaris , AIX , Red Hat Linux and Windows servers .
1	103280	103431	Justice Minister Martin Cauchon and Prime Minister Jean Chrétien have both said the Liberal government will introduce legislation soon to decriminalize possession of small amounts of pot for personal use .	Justice Minister Martin Cauchon and Prime Minister Jean Chretien both have said the government will introduce legislation to decriminalize possession of small amounts of pot .
0	110731	110648	But Chauncey Billups demonstrated he 's also capable of big games , scoring 77 points over the final two games against the Magic .	Billups scored 77 points in the final two games of the first-round series against the Magic .
1	2274844	2274714	Kelly killed himself after being exposed as the source for a BBC report which claimed the government had embellished evidence of Iraq 's banned weapons to justify the war .	He killed himself after being exposed as the source for a BBC report which claimed the government exaggerated the case for war against Iraq .
0	1050307	1050144	And it 's going to be a wild ride , " said Allan Hoffenblum , a Republican consultant .	Now the rest is just mechanical , " said Allan Hoffenblum , a Republican consultant .
1	2810634	2810670	While the Ibrahims had one separation operation , Goodrich and Dr. David Staffenberg plan about three for the Aguirres , with several weeks between each .	Instead of one long operation to separate the twins , Goodrich and Dr. David Staffenberg plan about three , with several weeks between each .
1	3073773	3073779	Lay had contended that turning over the documents would violate his Fifth Amendment right against self-incrimination .	Lay had refused to turn over the papers , asserting his Fifth Amendment right against self-incrimination .
0	261202	260995	The WHO experts didn 't say how many cases in Hebei were in rural areas .	Hebei has reported 191 cases and eight deaths , though the WHO experts did not say how many were in rural areas .
1	1824224	1824209	Nearly 300 mutinous troops who seized a Manila shopping and apartment complex demanding the government resign gave up and retreated peacefully after some 19 hours .	Mutinous troops who seized a Manila shopping and apartment complex demanding the government resign ended a 19-hour standoff late Sunday and returned to barracks without a shot fired .
1	548867	548785	In three years , Lend Lease has slipped from a top-five stock , when its share price was around $ 24 , to 37th .	In the space of three years , Lend Lease has slipped from a top-five 5 stock when its share price hovered around $ 24 to 37th on the list .
0	2796658	2796682	About two hours later , his body , wrapped in a blanket , was found dumped a few blocks away .	Then his body was dumped a few blocks away , found in a driveway on Argyle Road .
1	1808166	1808434	Columbia broke up over Texas upon re-entry on Feb. 1 .	Columbia broke apart in the skies above Texas on Feb. 1 .
1	853475	853342	A year or two later , 259 , or 10 per cent , of the youths reported that they had started to smoke , or had taken just a few puffs .	Within two years , 259 , or 10 percent , of the youths reported they had started to smoke or had at least taken a few puffs .
0	977772	977804	The Lord Chancellor was guardian of the Great Seal , used to stamp all official documents from the sovereign .	Falconer will hold on , for now , to the Lord Chancellor 's Great Seal , used to sign off instructions from the sovereign .
1	577854	578500	Cindy Yeast , a 50-year-old Washington-area publicist , says she began taking supplements two years ago in part to avoid mild dementia that affects her elderly parents .	She started taking supplements two years ago - partly to stave off mild dementia that affects her elderly parents .
1	2829194	2829229	The two are not related , but have referred to each other as father and son .	He 's not related to Malvo , but the two have referred to each other as father and son .
1	2074182	2074668	Gibson said last month in a press statement that " neither I nor my film are anti-Semitic .	Gibson said in a June statement that he and his film are not anti-Semitic .
0	2758265	2758282	The world 's largest software company said it recognized the difficulty the multiple patches posed for companies , and set out to make it easier for them to apply the updates .	The world 's largest software company said it recognized the difficulty the multiple patches posed for companies trying to apply them .
1	1958079	1958143	The Dow Jones industrial average .DJI ended up 64.64 points , or 0.71 percent , at 9,191.09 , according to the latest available data .	The blue-chip Dow Jones industrial average .DJI added 38 points , or 0.42 percent , to 9,165 .
1	544217	544325	The vote came just two days after Kurds swept City Council elections , taking the largest single block of votes on the 30-seat council .	The vote for mayor followed City Council elections that gave Kurds the largest block of votes on the 30-seat council .
1	2385288	2385256	Large swells and dangerous surf already were being felt along sections of the coast .	Already large swells and dangerous surf have arrived along the mid-Atlantic .
0	2324708	2325028	Based on a separate survey of households , the unemployment rate fell in August to 6.1 percent from 6.2 percent .	Labor Department analysts discounted a slight improvement in the national unemployment rate , which fell in August to 6.1 percent from 6.2 percent .
1	2139506	2139427	" We will work with the board to ensure a smooth transition . "	He said federal regulators would work with the corporation to ensure a " smooth transition . "
1	2965576	2965701	Gasps could be heard in the courtroom when the photo was displayed .	Gasps could be heard as the photo was projected onto the screen .
1	2931098	2931144	Gilead had earnings of $ 73.1 million , or 33 cents a share , compared with $ 20.8 million , or 10 cents , in the year-ago quarter .	Quarterly profit climbed to $ 73.1 million , or 33 cents a share , from $ 20.8 million , or 10 cents , a year earlier , the company said .
0	644788	644816	" I had one bad stretch of holes that put me out of contention to win , " Woods said .	" I had one bad stretch of holes that put me out of contention , " Woods said , referring to his 42 on the front nine Saturday .
0	2551891	2551563	The poll had a margin of error of plus or minus 2 percentage points .	It had a margin of sampling error of plus or minus four percentage points and was conducted Thursday through Saturday .
1	1089053	1089297	Sen. Patrick Leahy of Vermont , the committee 's senior Democrat , later said the problem is serious but called Hatch 's suggestion too drastic .	Sen. Patrick Leahy , the committee 's senior Democrat , later said the problem is serious but called Hatch 's idea too drastic a remedy to be considered .
1	3435735	3435717	The broad Standard & Poor 's 500 < .SPX > eased 0.37 of a point , or 0.03 percent , at 1,121 .	The Standard & Poor 's 500 Index < .SPX > slipped 0.26 point , or 0.02 percent , to 1,121.96 .
0	1954	2142	Watertown , Saugus and Framingham also are going smoke-free Monday , joining a growing number of cities around the country .	Along with Boston , Watertown , Saugus and Framingham also are going smoke-free Monday .
1	3400796	3400822	That is evident from their failure , three times in a row , to get a big enough turnout to elect a president .	Three times in a row , they failed to get a big _ enough turnout to elect a president .
1	1220668	1220801	We firmly believe we have an absolute right to use the common word ' spike ' as the name of our network . "	We firmly believe that we have an absolute right to use the common word ' spike ' to name our network .
1	1889954	1889847	Sources who knew of the bidding said last week that cable TV company Comcast Corp. was also looking at VUE .	Late last week , sources told Reuters cable TV company Comcast Corp. CMCSA.O also was looking at buying VUE assets .
1	315785	315653	But MTA officials appropriated the money to the 2003 and 2004 budgets without notifying riders or even the MTA board members considering the 50-cent hike , Hevesi found .	MTA officials appropriated the surplus money to later years ' budgets without notifying riders or the MTA board members when the 50-cent hike was being considered , he said .
0	1521034	1520582	White , who had suffered kidney failure from years of high blood pressure , died at Cedars-Sinai Medical Center around 9 : 30 a.m. , said manager Ned Shankman .	White , who had kidney failure from years of high blood pressure , had been undergoing dialysis and had been hospitalized since a September stroke .
1	2083598	2083810	About 10 percent of high school and 16 percent of elementary students must be proficient at math .	In math , 16 percent of elementary and middle school students and 9.6 percent of high school students must be proficient .
1	1910610	1910455	The legal ruling follows three days of intense speculation Hewlett-Packard Co. may be bidding for the company .	The legal ruling follows three days of wild volatility in RIM 's stock over speculation that PC giant Hewlett-Packard Co. may be bidding for the company .
1	3113791	3113782	The European Commission , the EU 's antitrust enforcer , is expected to issue its decision next spring — unless a settlement is reached .	The European Commission is expected to issue its decision in the case next spring — unless a settlement is reached .
1	3214517	3214483	" So Sebastian did his best to convincingly confess to a crime that he didn 't commit in order to survive , " she told jurors .	" Sebastian did his best to confess convincingly to a crime he didn 't do in order to survive , " Ms. Richardson declared .
0	2083612	2083810	Twenty percent of Latino students and 23 percent of black students performed at proficient or higher .	In math , 16 percent of elementary and middle school students and 9.6 percent of high school students must be proficient .
1	661390	661218	He is charged in three bombings in Atlanta including a blast at the 1996 Olympics and one in Alabama .	He is charged in three bombings in Atlanta - including a blast at the 1996 Olympics - along with the bombing in Alabama .
1	1269572	1269682	The men were remanded in custody and are due to appear again before court on July 8 .	They were remanded in custody and will appear in court again on July 8 .
1	1095780	1095652	" No matter who becomes the sponsor for stock-car racing 's top series , NASCAR will need an all-star event , " Wheeler said in a statement .	No matter who becomes the sponsor for stock-car racings top series , NASCAR will need an all-star event , Wheeler said Tuesday .
1	116294	116332	The Phillies were upset that Counsell had stolen second in the sixth inning with Arizona leading 7-1 .	The Phillies were apparently upset when Counsell stole during the sixth with the Diamondbacks up 7-1 .
1	941617	941673	He said his hatred for such people grew from these discussions and had helped convince him violence was the answer .	His hatred for these people had germinated from these discussions and helped cement his belief that violence was the panacea .
1	2640607	2640576	" There is no need for one deadline for all to create the ASEAN Economic Community , " Thaksin said .	Thus , he said , there did not have to one deadline to create the economic community .
1	3310210	3310286	The announcement was made during the recording of a Christmas concert attended by top Vatican cardinals , bishops , and many elite from Italian society , witnesses said .	The broadside came during the recording on Saturday night of a Christmas concert attended by top Vatican cardinals , bishops and many elite of Italian society , witnesses said .
1	3376093	3376101	The additional contribution brings total U.S. food aid to North Korea this year to 100,000 tonnes .	The donation of 60,000 tons brings the total of U.S. contributions for the year to 100,000 .
1	1549586	1549609	Leon Williams ' body was found inside his third-floor apartment at 196 Bay St. , in Tompkinsville .	The dead man , Leon Williams , was found in his third-floor apartment .
1	460211	460445	The player 's eyes were bloodshot and a blood-alcohol test produced a reading of 0.18 - well above Tennessee 's level of presumed intoxication of 0.10 , the report said .	He failed a field sobriety test and a blood-alcohol test produced a reading of 0.18 – well above Tennessee 's level of presumed intoxication of 0.10 , the report said .
1	1196962	1197061	But Virgin wants to operate Concorde on routes to New York , Barbados and Dubai .	Branson said that his preference would be to operate a fully commercial service on routes to New York , Barbados and Dubai .
0	862804	862715	He tried to fight off officers and was taken to a hospital after a police dog bit him but was later released .	Cruz tried to fight off officers and was hospitalized after a police dog bit him , Sgt. Steve Dixon said .
1	1726935	1726879	The announcement , which economists said was not a surprise , may be bittersweet for the millions of Americans without jobs .	Economists said the announcement was not a surprise , and politicians said it offered little comfort to the millions of Americans without jobs .
0	331980	332110	Asked if the delegates could leave on Friday , police intelligence chief in Aceh , Surya Dharma , told reporters they could not because they did not have proper permission .	Asked if the delegates could leave on Friday , police intelligence chief Surya Dharma told reporters : " Of course they may not go .
1	173879	173832	Dealers said the dollar also drew some downside support as Japanese investors are expected to keep snapping up foreign bonds amid the yen 's rise against the dollar .	Dealers said the dollar also drew some downside support as Japanese investors are expected to keep snapping up foreign bonds amid ever-falling domestic interest rates .
0	2834988	2835026	Iran has until the end of the month to satisfy the agency it has no plans for nuclear weapons .	The Iranians have until the end of the month to answer all the agency 's questions about their past nuclear activities .
1	2587300	2587243	Her father , Florin Cioaba , the king of Transylvania 's Gypsies , had her brought back and she was married against her will .	Her father , Roma King Florin Cioaba , had her brought back and she was promptly married against her will .
0	554905	554627	Claire had advanced to the third round of the 76th annual Scripps Howard National Spelling Bee .	One by one they strolled to the microphone , all 251 youngsters in the 76th Scripps Howard National Spelling Bee .
1	1912524	1912648	Citigroup Inc . C.N , the world 's largest financial services company , on Wednesday promoted Marjorie Magner to chairman and chief executive of its global consumer group .	Citigroup ( C ) on Wednesday named Marjorie Magner chairman and chief executive of its colossal global consumer business .
1	3255597	3255668	" They 've been in the stores for over six weeks , " says Carney .	The quarterlies usually stay in stores for between six to eight weeks , " Carney added .
1	629316	629289	Let me just say this : the evidence that we have of weapons of mass destruction was evidence drawn up and accepted by the joint intelligence community .	" The evidence that we had of weapons of mass destruction was drawn up and accepted by the Joint Intelligence Committee , " he said .
1	54181	53570	Ridge said no actual explosives or other harmful substances will be used .	Ridge said no real explosives or harmful devices will be used in the exercise .
1	723557	724115	Thus far , Stewart 's company appears ready to stand behind her .	For now , the company 's management appears to be standing behind Stewart .
0	2607718	2607708	But late Thursday night , the campaign issued a statement saying there would be no news conference and no big announcement .	But late yesterday , the campaign and the state Democratic Party said there would be no news conference .
1	753858	753890	There 's also a flaw that results because IE does not implement an appropriate block on a file download dialog box .	The second vulnerability is a result of IE not implementing a block on a file download dialog box .
1	587009	586969	Another $ 100-million in savings will come from management layoffs and pay cuts .	The airline expects to save another $ 100-million a year through management layoffs and pay cuts .
1	308567	308525	He called on Prime Minister John Howard to establish a royal commission on child sex abuse .	The Senate motion also called on Prime Minister John Howard to hold a royal commission into child sex abuse .
0	665419	665612	" We think that the United States of America should support the free speech of all groups , " Mr. White said , objecting to Mr. Olson 's recommendation .	We think that the United States of America should support the free speech of all groups , he said .
1	2763517	2763576	Terri Schiavo , 39 , underwent the procedure at the Tampa Bay area hospice where she has been living for several years , said her father , Bob Schindler .	The tube was removed Wednesday from Terri Schiavo , 39 , at the Tampa Bay-area hospice where she has lived for several years .
0	3107118	3107136	After 18 months , Nissen found that Lipitor stopped plaque buildup in the patients ' arteries .	After 18 months , the atorvastatin patients had no change in the plaque in their arteries .
1	780604	780466	Toll , Australia 's second-largest transport company , last week offered NZ75 a share for Tranz Rail .	Toll last week offered to buy the company for NZ75c a share , or $ NZ158 million .
0	1989213	1989116	" This child was literally neglected to death , " Armstrong County District Attorney Scott Andreassi said .	Armstrong County District Attorney Scott Andreassi said the many family photos in the home did not include Kristen .
1	1462409	1462504	Wal-Mart , the nation 's largest private employer , has expanded its antidiscrimination policy to protect gay and lesbian employees , company officials said Tuesday .	Wal-Mart Stores Inc . , the nation 's largest private employer , will now include gays and lesbians in its anti-discrimination policy , company officials said Wednesday .
1	260952	260924	Metro , bus and local rail services in France 's four largest towns -- Paris , Lyon , Lille and Marseille -- were severely disrupted , Europe 1 radio reported .	Subway , bus and suburban rail services in France 's four largest cities -- Paris , Lyon , Lille and Marseille -- were severely disrupted , transport authorities said .
1	1224743	1225510	In the undergraduate case , Rehnquist said the use of race was not " narrowly tailored " to achieve the university 's asserted interest in diversity .	Rehnquist wrote that the system was not narrowly tailored to achieve the interest in educational diversity .
0	3329379	3329416	SP2 is basically about security enhancements to Windows , such as the improved Internet Connection Firewall ( ICF ) .	The firewall in the current Windows XP was known as the Internet Connection Firewall ( ICF ) .
1	2362761	2362698	A landslide in central Chungchong province derailed a Seoul-bound train and 28 passengers were injured , television said .	In central Chungchong province , a landslide caused a Seoul-bound Saemaeul Express train to derail , injuring 28 people , local television said .
0	1465073	1464854	They will help draft a plan to attack obesity that Kraft will implement over three to four years .	The team will help draft a plan by the end of the year to attack obesity .
1	195728	196099	But that amount would probably be impossible to pass in the Senate , where Republican moderates have refused to go above $ 350 billion .	Such an amount would probably be unable to summon a majority of the Senate , where Republican moderates have refused to go above $ 350 billion .
1	2587767	2587673	In the clash with police , Lt. Mothana Ali said about 1,000 demonstrators had gone to the station demanding jobs .	In Baghdad , police Lieut . Mothana Ali said about 1,000 demonstrators arrived at the station demanding jobs .
0	1490044	1489975	Corixa shares rose 54 cents to $ 7.74 yesterday on the Nasdaq Stock Market .	Shares of Corixa rose 54 cents , or about 8 percent , to close at $ 7.74 .
1	958161	957782	Committee approval , expected today , would set the stage for debate on the Senate floor beginning Monday .	That would clear the way for debate in the full Senate beginning on Monday .
1	1033204	1033365	O 'Brien was charged with leaving the scene of a fatal accident , a felony .	Bishop Thomas O 'Brien , 67 , was booked on a charge of leaving the scene of a fatal accident .
0	2996241	2996734	Tom Hamilton said his daughter was conscious and alert and in stable condition after the attack Friday morning .	Bethany , who remained in stable condition after the attack Friday morning , talked of the attack Saturday .
0	2015389	2015410	The Calgary woman , who is in her twenties , donated blood on Aug. 7 .	The woman -- who has no symptoms of illness -- donated blood Aug. 7 .
1	221515	221509	Quattrone lawyer John W. Keker said his client is innocent .	In a statement Monday , his lawyer John Keker said ``Frank Quattrone is innocent .
0	2283737	2283794	In the weeks leading up to the execution , several Florida officials received anonymous threatening letters .	Several Florida officials connected to the case have received threatening letters , accompanied by rifle bullets .
1	2826681	2826474	The disagreement over online music sales was disclosed in documents filed last week with the judge and made available by the court yesterday .	The fight over online music sales was disclosed in documents made available Monday by the court .
1	2249237	2249305	Parson was charged with intentionally causing and attempting to cause damage to protected computers .	Parson is charged with one count of intentionally causing damage to a protected computer .
1	389239	389299	" The court and the public need to know much more of the details of the defendant 's seemingly massive fraud , " the judge said .	" The court and the public need to know more of the defendants ' seemingly massive fraud , " he said .
1	2652187	2652218	The U.S. Supreme Court will hear arguments on Wednesday on whether companies can be sued under the Americans with Disabilities Act for refusing to rehire rehabilitated drug users .	The high court will hear arguments today on whether companies can be sued under the ADA for refusing to rehire rehabilitated drug users .
1	2945693	2945847	The IRS said taxpayers can avoid undelivered checks by having refunds deposited directly into their checking or savings accounts .	The IRS said taxpayers can avoid problems with lost or stolen refunds by having refunds deposited directly into personal checking or savings accounts .
1	2065523	2065836	" More than 70,000 men and women from bases in Southern California were deployed in Iraq .	In all , more than 70,000 troops based in Southern California were deployed to Iraq .
1	2222998	2223097	BP shares slipped 0.8 percent to 433.50 pence ( $ 6.85 ) each in afternoon trading on the London Stock Exchange .	BP shares slipped 48 cents to $ 41.72 Friday in trading on the New York Stock Exchange .
1	2561999	2561941	Because of the accounting charge , the company now says it lost $ 1.04 billion , or 32 cents a share , in the quarter ended June 30 .	Including the charge , the Santa Clara , Calif.-based company said Monday it lost $ 1.04 billion , or 32 cents per share , in the period ending June 30 .
0	2324704	2325023	Friday 's report raised new worries that a weak job market could shackle the budding economic recovery despite a slight improvement in the overall unemployment rate .	U.S. companies slashed payrolls for a seventh straight month in August , raising new worries that a weak jobs market could shackle the budding economic recovery .
1	2336453	2336545	Federal Emergency Management Administration designated $ 20 million to establish the registry .	The registry was launched with $ 20 million from the Federal Emergency Management Agency .
1	720572	720486	BREAST cancer cases in the UK have hit an all-time high with more than 40,000 women diagnosed with the disease each year , Cancer Re-search UK revealed yesterday .	Cases of breast cancer in Britain have reached a record high , with the number of women diagnosed with the disease passing the 40,000 mark for the first time .
1	1605818	1605806	" It was never our intention to sell the product , " said Health Minister Anne McClellan , a skeptic of medical marijuana use .	" It was never the intention of us to sell product , " federal Health Minister Anne McLellan said yesterday in Edmonton .
0	2440680	2440474	GM , the world 's largest automaker , has 115,000 active UAW workers and another 340,000 retirees and spouses .	They cover more than 300,000 UAW workers and 500,000 retirees and spouses .
0	726399	726078	Rosenthal is hereby sentenced to custody of the Federal Bureau of prisons for one day with credit for time served , " Breyer said to tumultuous cheers in the courtroom .	" Rosenthal is hereby sentenced to custody of the Federal Bureau of Prisons for one day with credit for time served . "
1	533903	533818	" We are committed to helping the Iraqi people get on the path to a free society , " Rumsfeld said in a speech to the Council on Foreign Relations .	" We are committed to helping the Iraqi people get on the path to a free society , " he said .
1	1166473	1166857	Mr. Young said he was disappointed that the government didn 't see the severe acute respiratory syndrome crisis as worthy of federal disaster-relief money .	Young said he was disappointed the government didn 't see the SARS crisis as worthy of federal disaster relief money .
1	144089	143697	The 12-nation currency has risen by 33 percent against the dollar over the past 15 months .	The euro is up 9 percent against the dollar in the past six weeks .
1	3439854	3439874	In February 2000 , the officers — Kenneth Boss , Sean Carroll , Edward McMellon and Richard Murphy — were acquitted of all charges in the killing .	The officers -- Kenneth Boss , Sean Carroll , Edward McMellon and Richard Murphy -- were acquitted in 2000 of state murder charges .
1	3464314	3464302	I was surprised it turned out me talking and the president just listening .	" I was surprised it turned out me talking and the president just listening . . . It was mostly a monologue . "
1	2008984	2009175	The state 's House delegation currently consists of 17 Democrats and 15 Republicans .	Democrats hold a 17-15 edge in the state 's U.S. House delegation .
0	816867	816831	Freddie also said Leland C. Brendsel will retire as chairman and chief executive and resign from the board .	He replaces Leland Brendsel , 61 , who retired as chairman and chief executive .
1	192285	192327	We 'll be listening carefully to the [ IAEA ] director general 's report at the next board meeting .	" We 'll be listening carefully to the ( IAEA ) director-general 's report at the next board meeting . "
1	2688145	2688162	In that position , Elias will report to Joe Tucci , president and CEO of EMC .	As executive vice president of new ventures , Elias will report to Joe Tucci , EMC 's president and chief executive .
1	3294207	3294290	But with the PM due to leave tomorrow afternoon for personal reasons there was a risk he might not be present when the final decision was made .	But with the Prime Minister due to leave tomorrow , a day early , he may not be present when the final decision is made .
0	205100	205145	A pro-independence radical , Miodrag Zivkovic , of the Liberal Alliance , came in second with 31 percent of the vote .	Miodrag Zivkovic , of the Liberal Alliance of Montenegro , won 31 percent of the vote while the independent Dragan Hajdukovic got four percent .
0	3242051	3241897	Mr. Kerkorian tried unsuccessfully to take over Chrysler in 1995 , but did win representation on its board .	Kerkorian and Tracinda had also tried to take over Chrysler in 1995 .
0	1076861	1077018	Glover spoke at a news conference that included about 20 relatives of the victims .	About 20 family members of the victims were invited to the news conference .
1	2095803	2095786	Drax faced a financial crisis late last year after it lost its most lucrative sales contract , held with insolvent utility TXU Europe .	Drax ’ s troubles began late last year when it lost its most lucrative sales contract , with the insolvent utility TXU Europe .
1	2112330	2112376	But I would rather be talking about high standards than low standards . "	" I would rather be talking about positive numbers rather than negative .
1	3389318	3389271	It was not immediately known how many people were on flight UTA 141 , which could carry 141 passengers and crew .	It was still not known exactly how many people were on the plane , which could carry 141 passengers and crew .
1	698948	698933	The market remains pinned in a narrow range after a powerful rally drove the broad Standard & Poor 's 500 index .SPX up more than 20 percent since mid-March .	The market remains pinned in a narrow range after a powerful rally pushed the broad S & P 500 index up more than 20 percent since mid-March .
1	539585	539355	Witnesses said they believed the man planned to crash the Launceston-bound Qantas flight 1737 , which was carrying 47 passengers and six crew .	Witnesses believe he wanted to crash Flight 1737 , which had 47 passengers and six crew .
1	684848	684557	As Samudra sat down to hear the indictment , he looked over to his nine lawyers and shouted ``God is Great ' ' three times .	As he sat down to hear the indictment , Samudra looked over to his nine lawyers and shouted " Takbir ! " , or " Proclaim ! " , a religious rallying cry .
1	347017	347002	In hardest-hit Taipei , traffic has disappeared from once bustling streets , ubiquitous department stores stand mostly empty and restaurants are eerily quiet .	In hardest-hit Taipei , traffic has disappeared from once-bustling streets and department stores and restaurants are virtually empty .
1	1592037	1592076	In a statement , Lee said he " no longer believes that Viacom deliberately intended to trade on my name when naming Spike TV . "	Spike Lee no longer believes that Viacom deliberately intended to trade on his name by calling its own venture " Spike TV , " according to a statement read in court Tuesday .
0	3013483	3013540	Singapore Prime Minister Goh Chok Tong says China plays an important role in the integration of Asia , including managing the stresses and strains both within and between countries .	HAINAN PROVINCE , China : Singapore Prime Minister Goh Chok Tong said China plays an important role in the integration of Asia .
1	2020252	2020081	The worm attacks Windows computers via a hole in the operating system , an issue Microsoft on July 16 had warned about .	The worm attacks Windows computers via a hole in the operating system , which Microsoft warned of 16 July .
0	2614947	2614904	The premium edition adds OfficeFront Page 2003 , Acceleration Server 2000 , and SQL Server 2000 .	The premium edition adds ISA Server , SQL Server and a specialized edition of BizTalk 2004 .
0	1744257	1744378	In the year-ago quarter , the steelmaker recorded a profit of $ 16.2 million , or 15 cents per share , on sales of $ 1.14 billion .	In the second quarter last year , AK Steel reported a profit of $ 16.2 million , or 15 cents a share .
0	1119721	1119714	Sony claimed that the reader 's capacitance sensing technology cannot be fooled by paper copies and does not require cleaning .	Its capacitance sensing technology electronically reads a fingerprint ; Sony says it can 't be fooled by paper copies and doesn 't require cleaning .
1	1186754	1187056	Amazon.com shipped out more than a million copies of the new book , making Saturday the largest distribution day of a single item in e-commerce history .	Amazon.com shipped more than a million copies by Saturday afternoon , making Saturday the largest distribution day of a single item in e-commerce history .
1	2842562	2842582	The show 's closure affected third-quarter earnings per share by a penny .	The company said this impacted earnings by a penny a share .
0	431076	431242	After the two-hour meeting on May 14 , publisher Arthur O. Sulzberger Jr . , executive editor Howell Raines and managing editor Gerald Boyd pledged quick remedies to staff grievances .	The committee will make recommendations to Publisher Arthur Sulzberger , Executive Editor Howell Raines and Managing Editor Gerald Boyd .
1	1393764	1393984	It 's been a busy couple of days for security gurus assigned to keep their companies safe and sound .	It 's been a busy couple of days for enterprise security gurus tasked with the job of keeping their companies safe and sound .
0	2916199	2916164	Lu reclined in a soft chair wearing a woolly coat near the blackened capsule .	" It 's great to be back home , " said Lu , dressed in a woolly coat near the blackened capsule .
1	2530671	2530542	Gov. Bob Riley proposed the budget cuts after Alabama voters rejected his $ 1.2 billion tax plan Sept . 9 .	After Alabama voters rejected his $ 1.2 billion tax plan Sept . 9 , Riley forecast significant cuts in state programs .
1	219064	218969	" It is probably not the easiest time to come in and take over the shuttle program , but then again , I look forward to the challenge , " he said .	" It 's probably not the easiest time to come in and take over the shuttle program , but I look forward to the challenge , " Parsons told reporters at NASA headquarters .
0	2377289	2377259	Estonia 's place in the European mainstream and safeguard its independence regained in 1991 .	Estonia was forcibly incorporated in the Soviet Union in 1940 and regained its independence only in 1991 .
0	2110220	2110199	Franklin County Judge-Executive Teresa Barton said a firefighter was struck by lightning and was taken to the Frankfort Regional Medical Center .	A county firefighter , was struck by lightning and was in stable condition at Frankfort Regional Medical Center .
0	1864253	1863810	Police suspected that Shaichat , 20 , had been abducted either by Palestinians or by Israeli Arabs .	Nobody claimed responsibility for Schaichat 's death , but police suspect that the 20-year-old soldier was abducted either by Palestinians or Israeli Arabs .
0	3150803	3150839	During this year 's August to October quarter , Lowe 's opened 38 new stores , including two relocations .	During the third quarter , Lowe 's opened 38 new stores and now has 932 stores in 45 states .
0	969381	969512	The technology-laced Nasdaq Composite Index < .IXIC > declined 25.78 points , or 1.56 percent , to 1,627.84 .	The broader Standard & Poor 's 500 Index .SPX gave up 11.91 points , or 1.19 percent , at 986.60 .
1	271891	271839	Sony said the PSP would also feature a 4.5-inch LCD screen , Memory Stick expansion slots .	It also features a 4.5 in back-lit LCD screen and memory expansion facilities .
0	2829648	2829613	Clinton did not mention that two Democratic senators , Charles Robb of Virginia and Wendell Ford of Kentucky , voted to shelve the McCain bill .	Two Democrats , Sen. Charles Robb of Virginia and Wendell Ford of Kentucky , voted with the 40 Republicans .
1	886904	887158	Some of the company 's software developers will join Microsoft , but details haven 't been finalized , said Mike Nash , corporate vice president of Microsoft 's security business unit .	Some of the companys software developers will join Microsoft , but details havent been finalized , said Mike Nash , corporate vice president of Microsofts security business unit .
0	2632692	2632767	Wal-Mart has said it plans to open at least 40 Supercenters in the state in the coming years ; analysts expect four or more to be in San Diego County .	At least 40 of the outlets will be in California , and analysts expect four or more to be in San Diego County .
1	2240399	2240149	Cintas is battling efforts to unionize 17,000 of its workers and to let unions organize the workers by signing cards , rather than by a lengthy election process .	Cintas is battling efforts to unionize 17,000 of its workers and labor 's demands to let its workers organize by signing cards , rather than by a lengthy election process .
1	805457	805985	The opposition would resort to rolling mass action " at strategic times of our choice and without warning to the dictatorship , " he said .	" From now onwards we will embark on rolling mass action at strategic times of our choice and without any warning to the dictatorship , " he said .
1	2896308	2896334	Federal Agriculture Minister Warren Truss said the Government still did not know the real reason the sheep were rejected at the Saudi port of Jeddah on August 21 .	He said the Government still did not know the real reason the original Saudi buyer pulled out on August 21 .
1	2110775	2110924	Tom Kraynak , manager of operations and resources for the Canton , Ohio-based East Central Area Reliability Council , said that scenario is one among many that investigators are considering .	Tom Kraynak , manager of operations and resources for the Canton , Ohio-based East Central Area Reliability Council , said investigators are considering the scenario .
1	1762569	1762526	Hester said Sanmina was the best fit among several purchase offers the company received from electronics manufacturers and computer makers .	Hester said Sanmina 's offer was the best among several Newisys received from electronics manufacturers and computer makers .
0	2706154	2706185	The other inmate fell but Selenski shimmed down the makeshift rope to a second-story roof and used the mattress to scale a razor-wire fence , Fischi said .	After the other inmate fell , Selenski used the mattress to scale a 10-foot , razor-wire fence , Fischi said .
1	1057995	1057778	The hearing , expected to last a week , will determine whether Akbar faces a court-martial .	The purpose of the hearing is to determine whether Akbar should be court-martialled .
1	1386884	1386857	He said he has begun a court action to seize Beacon Hill 's assets and has frozen more than $ 13 million Beacon Hill had when it closed .	He said he has initiated a forfeiture action in court and frozen more than $ 13 million Beacon Hill had when it closed .
1	3093023	3092996	Speaking for the first time yesterday , Brigitte 's maternal aunt said his family was unaware he had was in prison or that he had remarried .	Brigitte 's maternal aunt said his family was unaware he had been sent to prison , or that he had remarried in Sydney .
1	1661381	1661317	" Close co-operation between our law enforcement agencies , close co-operation between our intelligence services lie at the heart of the ongoing fight against terrorism . "	Close cooperation between regional law enforcement agencies and intelligence services was at the heart of the fight against terrorism , he said .
0	2926039	2925982	The mother of a Briton held by Colombian guerrillasspoke of her relief yesterday after hearing that he might be freed in the next few weeks .	The parents of a Briton being held hostage by Colombian rebels spoke yesterday of their optimism that he would be freed in time for his birthday next month .
0	637168	637447	We strongly disagree with Novell 's position and view it as a desperate measure to curry favor with the Linux community .	McBride characterized Novell 's move as " a desperate measure to curry favor with the Linux community . "
1	696677	696932	After more than two years ' detention under the State Security Bureau , the four were found guilty of subversion in Beijing 's No. 1 Intermediate Court last Wednesday .	After more than two years in detention by the State Security Bureau , the four were found guilty last Wednesday of subversion .
1	3122429	3122305	Mr Russell , 46 , a coal miner from Brisbane , said : " They are obviously hurting , so we are basically going over there to help them . "	" They are obviously hurting so we are basically going over there to help them , " Russell , 46 , said .
1	1348909	1348954	The New York Democrat and former first lady has said she will not run for the White House in 2004 , but has not ruled out a race in later years .	The former first lady has said she will not run for the White House in 2004 but has not ruled out a race later on .
0	162203	162101	It does not affect the current Windows Media Player 9.0 Series .	Windows Media Player has had security problems before .
0	71501	71627	The seizure took place at 4 a.m. on March 18 , just hours before the first American air assault .	The time was about 4 a.m. on March 18 , just hours before the first pinpoint missiles rained down on the capital .
1	2907762	2907649	Donations stemming from the Sept . 11 attacks helped push up contributions to human service organizations and large branches of the United Way by 15 percent and 28.6 percent , respectively .	Donations stemming from the Sept . 11 attacks helped push up contributions to human service organizations by 15 percent and to large branches of the United Way by 28.6 percent .
1	2167771	2167744	In May , Mr. Hatfill said he was struck by a vehicle being driven by an FBI employee who was tailing him in Georgetown .	Last May , Hatfill was struck by a vehicle being driven by an FBI employee who was tailing him in Washington 's Georgetown neighborhood .
1	3320577	3320553	" I will support a constitutional amendment which would honor marriage between a man and a woman , codify that , " he said .	" If necessary , I will support a constitutional amendment which would honour marriage between a man and a woman , codify that . "
1	849291	849442	IBM of the US and Infineon Technologies of Germany will today announce a technological development that could threaten multi-billion dollar memory chip markets .	IBMof the US andInfineon Technologies of Germany willon Tuesdayannounce a technological development that could threaten multi-billion dollar memory chip markets .
0	763948	763991	Costa 's semifinal opponent is Spaniard Juan Carlos Ferrero , whom he beat in last year 's final .	Costa will play Juan Carlos Ferrero next in a rematch of last year 's final .
1	1908763	1908744	A former employee of a local power company pleaded guilty Wednesday to setting off a bomb that knocked out a power substation during the Winter Olympics last year .	A former Utah Power meter reader pleaded guilty Wednesday to bombing a power substation during the 2002 Winter Olympics .
0	1876120	1876059	Thyroid hormones are known to help in weight loss by stimulating metabolism - and cutting cholesterol - but come with the unwanted side effect of speeding up the heartbeat .	Thyroid hormones are known to help in weight loss by stimulating metabolism , and they can help cut cholesterol too .
1	518089	518133	Judge Craig Doran said it wasn 't his role to determine if Hovan was " an evil man " but maintained that " he has committed an evil act . "	Judge Craig Doran said he couldn 't determine if Hovan was " an evil man " but said he " has committed an evil act . "
0	224932	224868	The Hartford shares rose $ 2.88 , or 6.6 percent , to close Monday at $ 46.50 on the New York Stock Exchange .	Shares of Hartford rose $ 2.88 to $ 46.50 in New York Stock Exchange composite trading .
1	1771131	1771091	It also offers a built-in NAND flash boot loader so that high-density NAND flash memory can be used without having to install an additional support chip .	The S3C2440 has a built-in NAND flash boot loader , for example , so that high-density NAND flash memory can be installed without an additional support chip .
0	2728425	2728251	It decided instead to issue them before the stock market opened Monday after the downgrade of its debt late Friday by Moody 's , the credit rating agency .	It decided instead to issue them before the stock market opened Monday to counteract the downgrade of its debt late Friday by Moody 's to one step above junk status .
0	953733	953537	Altria shares fell 2.5 percent or $ 1.11 to $ 42.57 and were the Dow 's biggest percentage loser .	Its shares fell $ 9.61 to $ 50.26 , ranking as the NYSE 's most-active issue and its biggest percentage loser .
1	349215	349241	It will be followed in November by a third movie , " The Matrix Revolutions . "	The film is the second of a trilogy , which will wrap up in November with " The Matrix Revolutions . "
1	2919853	2919804	Massachusetts regulators and the Securities and Exchange Commission on Tuesday pressed securities fraud charges against Putnam Investments and two of its former portfolio managers for alleged improper mutual fund trading .	State and federal securities regulators filed civil charges against Putnam Investments and two portfolio managers in the ever-expanding mutual fund trading scandal .
1	954526	954607	He is blocking them until the Air Force assigns four additional C-130 cargo planes to Gowen Field , an Idaho Air National Guard base in Boise .	He is holding them up until the Air Force agrees to assign four additional C-130 cargo planes to the Idaho Air National Guard .
1	69773	69792	Cisco pared spending to compensate for sluggish sales .	In response to sluggish sales , Cisco pared spending .
0	2823575	2823513	The study , published Monday in the journal Molecular Brain Research , is likely to also apply to humans , its authors said .	The study , conducted on the brains of developing mice , was being published today in the journal Molecular Brain Research .
1	2455942	2455978	My decision today is not based on any one event . "	Governor Rowland said his decision was " not based on any one event . "
1	131979	131957	Nelson , 27 , is being retried on civil-rights charges stemming from the disturbance which led to Rosenbaum 's death .	Nelson , 27 , is being retried on civil rights charges stemming from the disturbance that led to Rosenbaum 's death .
0	2010705	2010779	" The government elements who have been causing trouble are still in place .	The government elements who have been causing trouble are still in place , they are attacking us . "
1	54142	53641	Next Monday at about 2 p.m. ( CST ) , hospital officials in and near Chicago will notice a sudden increase in people complaining of flu-like symptoms .	Around the same time , hospital officials in and near Chicago will notice a sudden increase in people complaining of flu-like symptoms .
1	1015249	1015204	Wal-Mart Stores Inc . , Kohl 's Corp. , Family Dollar Stores Inc. and Big Lots Inc. were among the merchants posting May sales that fell below Wall Street 's modest expectations .	Wal- Mart , Kohl 's Corp. , Family Dollar Stores Inc . , and Big Lots Inc. posted May sales that fell below Wall Street 's modest expectations .
0	753928	753890	The patch also fixes a vulnerability that results because IE does not implement an appropriate block on a file download dialog box .	The second vulnerability is a result of IE not implementing a block on a file download dialog box .
1	3022833	3023029	Peterson , a former fertilizer salesman , is charged with murder in the deaths of his 27-year-old wife and the baby boy she was carrying .	Peterson , 31 , is now charged with murder in the deaths of his 27-year-old wife and their unborn son .
0	751520	751373	SPOT products run a Microsoft operating system and the company 's DirectBand radio technology developed with SCA Data Systems .	The DirectBand network was developed with the assistance of SCA Data Systems .
0	218848	218851	He replaces Ron Dittemore , who announced his resignation in April .	Dittemore announced his plans to resign on April 23 .
1	3181118	3181443	Detectives told Deasean 's father , Stelly Chisolm , a college student , and mother , Kimberly Hill , of the arrest shortly after Perry was apprehended .	Shortly after his arrest , detectives told Deasean 's father , Stelly Chisolm , a college student , and mother , Kimberly Hill , a medical assistant , about the development .
1	515581	515752	They were among about 40 people attending the traditional Jewish ceremony colored by some non-traditional touches .	He said about 40 people attended the traditional Jewish ceremony colored by some nontraditional touches .
1	347022	347003	Taiwan had been relatively free of the viral infection until a fiasco at a Taipei hospital in late April caused the number of infections to skyrocket .	Taiwan had been relatively free of the viral infection until a severe outbreak at a Taipei hospital in late April .
1	3311600	3311633	Mr. Rowland attended a party in South Windsor for the families of Connecticut National Guard soldiers called to active duty .	Rowland was making an appearance at a holiday party for families of Connecticut National Guard soldiers assigned to duty in Iraq and Afghanistan .
0	3439114	3439084	Ross Garber , Rowland 's lawyer , said Tuesday he would attend the meeting and would ask to speak on the issue .	Ross Garber , Rowland 's legal counsel , said the governor would have no comment on the condo deal .
0	487951	488007	The euro was at 1.5281 versus the Swiss franc EURCHF = , up 0.2 percent on the session , after hitting its highest since mid-2001 around 1.5292 earlier in the session .	The euro was steady versus the Swiss franc after hitting its highest since mid-2001 of 1.5261 earlier in the session .
0	314997	315030	On the stand Wednesday , she said she was referring only to the kissing .	On the stand Wednesday , she testified that she was referring to the kissing before the alleged rape .
0	4733	4557	Garner said the group would probably be expanded to include , for example , a Christian and perhaps another Sunni leader .	The group has already met several times and Gen. Garner said it probably will be expanded to include a Christian and perhaps another Sunni Muslim leader .
1	2820371	2820525	Blair 's Foreign Secretary Jack Straw was to take his place on Monday to give a statement to parliament on the European Union .	Blair 's office said his Foreign Secretary Jack Straw would take his place on Monday to give a statement to parliament on the EU meeting the prime minister attended last week .
1	801552	801516	" There were more people surrounding the clubhouse than the Unabomber 's house up in the hills , " Baker said .	" There are more people surrounding the clubhouse than surrounded the Unabomber 's home in the hills .
1	1704987	1705268	Charles O. Prince , 53 , was named as Mr. Weill 's successor .	Mr. Weill 's longtime confidant , Charles O. Prince , 53 , was named as his successor .
1	396041	396188	Officials are also meeting with the International Organization for Epizootics ( OIE ) , which establishes animal-health standards for the world .	Canadian officials were also expected to meet yesterday with the International Organization for Epizootics ( OIE ) , which establishes animal-health standards for the world .
0	1014983	1014963	GE stock closed Friday at $ 30.65 a share , down about 42 cents , on the New York Stock Exchange .	GE 's shares closed at $ 30.65 on Friday on the New York Stock Exchange .
1	2320654	2320666	The Midwestern research center will focus on the development of diagnostic , therapeutic and vaccine products for anthrax , botulism , tularemia , hemorrhagic fever viruses and plague .	The Midwestern center will focus on diagnosis , treatment and vaccines for anthrax , botulism , tularemia , hemorrhagic fever viruses and plague .
1	1057876	1057778	The hearing is to determine whether there is enough evidence to order Akbar to a general court-martial proceeding .	The purpose of the hearing is to determine whether Akbar should be court-martialled .
0	2116843	2116883	In the United States , heart attacks kill about 460,000 year , in Canada about 80,000 .	In the United States , heart attacks kill about 460,000 yearly , according to the National Institutes of Health .
1	1461629	1461781	Ninety-five percent of international cargo to the United States is carried by ship .	Ships carry 95 percent of international cargo to the United States .
0	374015	374162	" It 's a major victory for Maine , and it 's a major victory for other states .	The Maine program could be a model for other states .
1	2493369	2493428	News that oil producers were lowering their output starting in November exacerbated a sell-off that was already under way on Wall Street .	News that the Organization of Petroleum Exporting Countries was lowering output starting in November exacerbated a stock sell-off already under way yesterday .
1	490355	490378	They note that after several weeks of rallies on upbeat earnings , investors are looking for stronger evidence of a recovery before sending stocks higher .	After several weeks of market rallies on upbeat earnings , many investors are looking for more concrete signs of an economic recovery .
1	2691044	2691264	Most economists had expected a more dire report , with many anticipating the fifth month of job losses in six months .	Most economists had been expecting a far more dire report , with many expecting to see the fifth month of job losses in six months in September .
1	1831453	1831491	But software license revenues , a measure financial analysts watch closely , decreased 21 percent to $ 107.6 million .	License sales , a key measure of demand , fell 21 percent to $ 107.6 million .
1	2380695	2380822	King , brand-name writer , master of the horror story and e-book pioneer , is receiving this year 's medal for Distinguished Contributions to American Letters .	Stephen King , master of the horror story and e-book pioneer , is receiving this year 's medal for Distinguished Contributions to American Letters from the National Book Foundation .
1	2577517	2577531	The Denver-based natural gas producer and marketer said the inaccurate reporting was discovered after it received a subpoena from the U.S. Commodity Futures Trading Commission .	The natural gas producer and marketer said the inaccurate reporting was discovered in response to a subpoena from the U.S. Commodity Futures Trading Commission , or CFTC .
1	3267026	3266930	The steel tariffs , which the U.S. president imposed in March 2002 , will officially end at midnight , instead of March 2005 as initially planned .	The U.S. steel tariffs , which Bush imposed in March 2002 , were to officially end at midnight Thursday ( 0500 GMT ) , instead of March 2005 as initially planned .
1	360875	360943	Business Week 's online edition reported on Friday that WorldCom and the SEC could announce a settlement as early as Monday .	BusinessWeek Online has learned that the settlement could come as early as Monday , May 19 .
1	162632	162653	Only one of the five buildings in the Baghdad compound of the United Nations Development Program escaped being burned , the UN said on its Web site .	Only one of the five buildings in the compound in Baghdad run by the UN Development Program , escaped being burned , the UN said on its Web site .
1	1128884	1128865	Shares of Salix have rocketed 64 percent since Axcan made its first offer on April 10 .	Since the initial takeover offer , Salix shares have risen about 35 percent .
1	3264732	3264648	The jury verdict , reached Wednesday after less than four hours of deliberation , followed a 2 week trial , during which Waagner represented himself .	The quick conviction followed a 2 1 / 2 week trial , during which the Venango County man represented himself .
1	1721433	1721267	It 's happened five times in the last 11 years : A disaster puts this Southwestern town in the headlines during the summer tourist season .	It 's happened five times in the last decade : A disaster puts this tourist town in the headlines during summer , its busiest season .
0	146112	146127	The broader Standard & Poor 's 500 Index .SPX edged down 9 points , or 0.98 percent , to 921 .	The technology-laced Nasdaq Composite Index < .IXIC > shed 15 points , or 0.98 percent , to 1,492 .
1	389117	389052	The company emphasized that McDonald 's USA does not import any raw beef or hamburger patties from Canada for McDonald 's use in the United States .	McDonald 's said in a statement that it does not import any raw beef or hamburger patties from Canada for use in the United States .
1	872784	872834	Gregory Parseghian , a former investment banker , was appointed chief executive .	Greg Parseghian was appointed the new chief executive .
0	2977500	2977547	Their contract will expire at 12 : 01 a.m. Wednesday instead of 12 : 01 a.m. Sunday , said Rian Wathen , organizing director for United Food and Commercial Workers Local 700 .	" It has outraged the membership , " said Rian Wathen , organizing director of United Food and Commercial Workers Local 700 .
1	3107137	3107119	But plaque volume increased by 2.7 percent in pravastatin patients .	The volume of plaque in Pravachol patients ' arteries rose by 3 % .
1	1619244	1619274	Today in the US , the book - kept under wraps by its publishers , G. P. Putnam 's Sons , since its inception - will appear in bookstores .	Tomorrow the book , kept under wraps by G. P. Putnam 's Sons since its inception , will appear in bookstores .
0	3061836	3062031	The S & P / TSX composite rose 87.74 points on the week , while the TSX Venture Exchange composite gained 44.49 points .	On the week , the Dow Jones industrial average rose 11.56 points , while the Nasdaq Stock Market gained 39.42 points .
1	485999	486011	Ex-KGB agent Putin added that the Beatles were considered ' propaganda of an alien ideology ' .	In Soviet times the Beatles ' music " was considered propaganda of an alien ideology .


================================================
FILE: src/examples/tensorflow/bert_demo/latency_printer.py
================================================
latency_list = []
with open('latencies.txt', 'r') as f:
    for line in f:
        latency_list.append(float(line.rstrip()))

latency_list = sorted(latency_list)
l = len(latency_list)

print(f'p50 latency is {latency_list[int(.5 * l)]} seconds')
print(f'p90 latency is {latency_list[int(.9 * l)]} seconds')
print(f'p95 latency is {latency_list[int(.95 * l)]} seconds')
print(f'p99 latency is {latency_list[int(.99 * l)]} seconds')


================================================
FILE: src/examples/tensorflow/bert_demo/mrpc.proto
================================================
# coding=utf-8

""" Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0
    Program to gather information from a system
"""

syntax = "proto3";

package mrpc;

service mrpc {
    rpc paraphrase (TextPair) returns (YesNo) {}
}

message TextPair {
    bytes text_a = 1;
    bytes text_b = 2;
}

message YesNo {
    bytes message = 1;
    bytes prediction = 2;
}


================================================
FILE: src/examples/tensorflow/bert_demo/mrpc_feature.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Extract pre-computed feature vectors from BERT."""

import os
import csv
import time
import numpy as np
import tokenization


class InputExample(object):
  """A single training/test example for simple sequence classification."""

  def __init__(self, guid, text_a, text_b=None, label=None):
    """Constructs a InputExample.

    Args:
      guid: Unique id for the example.
      text_a: string. The untokenized text of the first sequence. For single
        sequence tasks, only this sequence must be specified.
      text_b: (Optional) string. The untokenized text of the second sequence.
        Only must be specified for sequence pair tasks.
      label: (Optional) string. The label of the example. This should be
        specified for train and dev examples, but not for test examples.
    """
    self.guid = guid
    self.text_a = text_a
    self.text_b = text_b
    self.label = label


class PaddingInputExample(object):
  """Fake example so the num input examples is a multiple of the batch size.

  When running eval/predict on the TPU, we need to pad the number of examples
  to be a multiple of the batch size, because the TPU requires a fixed batch
  size. The alternative is to drop the last batch, which is bad because it means
  the entire output data won't be generated.

  We use this class instead of `None` because treating `None` as padding
  battches could cause silent errors.
  """


class InputFeatures(object):
  """A single set of features of data."""

  def __init__(self,
               input_ids,
               input_mask,
               segment_ids,
               label_id,
               is_real_example=True):
    self.input_ids = input_ids
    self.input_mask = input_mask
    self.segment_ids = segment_ids
    self.label_id = label_id
    self.is_real_example = is_real_example


def convert_single_example(ex_index, example, label_list, max_seq_length,
                           tokenizer):
  """Converts a single `InputExample` into a single `InputFeatures`."""

  if isinstance(example, PaddingInputExample):
    return InputFeatures(
        input_ids=[0] * max_seq_length,
        input_mask=[0] * max_seq_length,
        segment_ids=[0] * max_seq_length,
        label_id=0,
        is_real_example=False)

  label_map = {}
  for (i, label) in enumerate(label_list):
    label_map[label] = i

  tokens_a = tokenizer.tokenize(example.text_a)
  tokens_b = None
  if example.text_b:
    tokens_b = tokenizer.tokenize(example.text_b)

  if tokens_b:
    # Modifies `tokens_a` and `tokens_b` in place so that the total
    # length is less than the specified length.
    # Account for [CLS], [SEP], [SEP] with "- 3"
    _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
  else:
    # Account for [CLS] and [SEP] with "- 2"
    if len(tokens_a) > max_seq_length - 2:
      tokens_a = tokens_a[0:(max_seq_length - 2)]

  # The convention in BERT is:
  # (a) For sequence pairs:
  #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
  #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
  # (b) For single sequences:
  #  tokens:   [CLS] the dog is hairy . [SEP]
  #  type_ids: 0     0   0   0  0     0 0
  #
  # Where "type_ids" are used to indicate whether this is the first
  # sequence or the second sequence. The embedding vectors for `type=0` and
  # `type=1` were learned during pre-training and are added to the wordpiece
  # embedding vector (and position vector). This is not *strictly* necessary
  # since the [SEP] token unambiguously separates the sequences, but it makes
  # it easier for the model to learn the concept of sequences.
  #
  # For classification tasks, the first vector (corresponding to [CLS]) is
  # used as the "sentence vector". Note that this only makes sense because
  # the entire model is fine-tuned.
  tokens = []
  segment_ids = []
  tokens.append("[CLS]")
  segment_ids.append(0)
  for token in tokens_a:
    tokens.append(token)
    segment_ids.append(0)
  tokens.append("[SEP]")
  segment_ids.append(0)

  if tokens_b:
    for token in tokens_b:
      tokens.append(token)
      segment_ids.append(1)
    tokens.append("[SEP]")
    segment_ids.append(1)

  input_ids = tokenizer.convert_tokens_to_ids(tokens)

  # The mask has 1 for real tokens and 0 for padding tokens. Only real
  # tokens are attended to.
  input_mask = [1] * len(input_ids)

  # Zero-pad up to the sequence length.
  while len(input_ids) < max_seq_length:
    input_ids.append(0)
    input_mask.append(0)
    segment_ids.append(0)

  assert len(input_ids) == max_seq_length
  assert len(input_mask) == max_seq_length
  assert len(segment_ids) == max_seq_length

  label_id = label_map[example.label]

  feature = InputFeatures(
      input_ids=input_ids,
      input_mask=input_mask,
      segment_ids=segment_ids,
      label_id=label_id,
      is_real_example=True)
  return feature


def read_tsv(input_file, quotechar=None):
    """Reads a tab separated value file."""
    with open(input_file, "r") as f:
        reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
        lines = []
        for line in reader:
            lines.append(line)
        return lines


def create_examples(lines, set_type):
    """Creates examples for the training and dev sets."""
    examples = []
    for (i, line) in enumerate(lines):
      if i == 0:
        continue
      guid = "%s-%s" % (set_type, i)
      text_a = tokenization.convert_to_unicode(line[3])
      text_b = tokenization.convert_to_unicode(line[4])
      if set_type == "test":
        label = "0"
      else:
        label = tokenization.convert_to_unicode(line[0])
      examples.append(
          InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
    return examples


def _truncate_seq_pair(tokens_a, tokens_b, max_length):
  """Truncates a sequence pair in place to the maximum length."""

  # This is a simple heuristic which will always truncate the longer sequence
  # one token at a time. This makes more sense than truncating an equal percent
  # of tokens from each, since if one sequence is very short then each token
  # that's truncated likely contains more information than a longer sequence.
  while True:
    total_length = len(tokens_a) + len(tokens_b)
    if total_length <= max_length:
      break
    if len(tokens_a) > len(tokens_b):
      tokens_a.pop()
    else:
      tokens_b.pop()


def get_eval_model_feed_dict_list(mrpc_tsv, vocab_txt):
    tsv = read_tsv(mrpc_tsv)
    result = create_examples(tsv, "dev")
    model_feed_dict_list = []
    for example in result:
        tokenizer = tokenization.FullTokenizer(vocab_file=vocab_txt, do_lower_case=True)
        label_list = ['0', '1']
        feature = convert_single_example(ex_index=0, example=example, label_list=label_list,
                                         max_seq_length=128, tokenizer=tokenizer)
        pre_model_feed_dict = {
            'input_ids': feature.input_ids,
            'input_mask': feature.input_mask,
            'segment_ids': feature.segment_ids,
            'label_id': feature.label_id,
            'is_real_example': feature.is_real_example,
        }
        model_feed_dict = {}
        for key, value in pre_model_feed_dict.items():
            if key in {'label_id', 'is_real_example'}:
                value = np.tile(np.int32(value), reps=[1])
            else:
                value = np.tile(np.int32(value), reps=[1, 1])
            model_feed_dict[key] = value
        model_feed_dict_list.append(model_feed_dict)
    return model_feed_dict_list


def text_pair_to_model_feed_dict(text_a, text_b, tokenizer):
    fake_tsv = [['index', '#1 ID', '#2 ID', '#1 String', '#2 String'],
                ['', '', '', text_a, text_b]]
    result = create_examples(fake_tsv, "test")
    example = result[0]
    label_list = ['0', '1']
    feature = convert_single_example(ex_index=0, example=example, label_list=label_list,
                                     max_seq_length=128, tokenizer=tokenizer)
    return {
        'input_ids': np.tile(np.int32(feature.input_ids), reps=[1, 1]),
        'input_mask': np.tile(np.int32(feature.input_mask), reps=[1, 1]),
        'segment_ids': np.tile(np.int32(feature.segment_ids), reps=[1, 1]),
    }


================================================
FILE: src/examples/tensorflow/bert_demo/mrpc_pb2.py
================================================
# -*- coding: utf-8 -*-
# Generated by the protocol buffer compiler.  DO NOT EDIT!
# source: mrpc.proto

import sys
_b=sys.version_info[0]<3 and (lambda x:x) or (lambda x:x.encode('latin1'))
from google.protobuf import descriptor as _descriptor
from google.protobuf import message as _message
from google.protobuf import reflection as _reflection
from google.protobuf import symbol_database as _symbol_database
# @@protoc_insertion_point(imports)

_sym_db = _symbol_database.Default()


DESCRIPTOR = _descriptor.FileDescriptor(
  name='mrpc.proto',
  package='mrpc',
  syntax='proto3',
  serialized_options=None,
  serialized_pb=_b('\n\nmrpc.proto\x12\x04mrpc\"*\n\x08TextPair\x12\x0e\n\x06text_a\x18\x01 \x01(\x0c\x12\x0e\n\x06text_b\x18\x02 \x01(\x0c\",\n\x05YesNo\x12\x0f\n\x07message\x18\x01 \x01(\x0c\x12\x12\n\nprediction\x18\x02 \x01(\x0c\x32\x33\n\x04mrpc\x12+\n\nparaphrase\x12\x0e.mrpc.TextPair\x1a\x0b.mrpc.YesNo\"\x00\x62\x06proto3')
)


_TEXTPAIR = _descriptor.Descriptor(
  name='TextPair',
  full_name='mrpc.TextPair',
  filename=None,
  file=DESCRIPTOR,
  containing_type=None,
  fields=[
    _descriptor.FieldDescriptor(
      name='text_a', full_name='mrpc.TextPair.text_a', index=0,
      number=1, type=12, cpp_type=9, label=1,
      has_default_value=False, default_value=_b(""),
      message_type=None, enum_type=None, containing_type=None,
      is_extension=False, extension_scope=None,
      serialized_options=None, file=DESCRIPTOR),
    _descriptor.FieldDescriptor(
      name='text_b', full_name='mrpc.TextPair.text_b', index=1,
      number=2, type=12, cpp_type=9, label=1,
      has_default_value=False, default_value=_b(""),
      message_type=None, enum_type=None, containing_type=None,
      is_extension=False, extension_scope=None,
      serialized_options=None, file=DESCRIPTOR),
  ],
  extensions=[
  ],
  nested_types=[],
  enum_types=[
  ],
  serialized_options=None,
  is_extendable=False,
  syntax='proto3',
  extension_ranges=[],
  oneofs=[
  ],
  serialized_start=20,
  serialized_end=62,
)


_YESNO = _descriptor.Descriptor(
  name='YesNo',
  full_name='mrpc.YesNo',
  filename=None,
  file=DESCRIPTOR,
  containing_type=None,
  fields=[
    _descriptor.FieldDescriptor(
      name='message', full_name='mrpc.YesNo.message', index=0,
      number=1, type=12, cpp_type=9, label=1,
      has_default_value=False, default_value=_b(""),
      message_type=None, enum_type=None, containing_type=None,
      is_extension=False, extension_scope=None,
      serialized_options=None, file=DESCRIPTOR),
    _descriptor.FieldDescriptor(
      name='prediction', full_name='mrpc.YesNo.prediction', index=1,
      number=2, type=12, cpp_type=9, label=1,
      has_default_value=False, default_value=_b(""),
      message_type=None, enum_type=None, containing_type=None,
      is_extension=False, extension_scope=None,
      serialized_options=None, file=DESCRIPTOR),
  ],
  extensions=[
  ],
  nested_types=[],
  enum_types=[
  ],
  serialized_options=None,
  is_extendable=False,
  syntax='proto3',
  extension_ranges=[],
  oneofs=[
  ],
  serialized_start=64,
  serialized_end=108,
)

DESCRIPTOR.message_types_by_name['TextPair'] = _TEXTPAIR
DESCRIPTOR.message_types_by_name['YesNo'] = _YESNO
_sym_db.RegisterFileDescriptor(DESCRIPTOR)

TextPair = _reflection.GeneratedProtocolMessageType('TextPair', (_message.Message,), {
  'DESCRIPTOR' : _TEXTPAIR,
  '__module__' : 'mrpc_pb2'
  # @@protoc_insertion_point(class_scope:mrpc.TextPair)
  })
_sym_db.RegisterMessage(TextPair)

YesNo = _reflection.GeneratedProtocolMessageType('YesNo', (_message.Message,), {
  'DESCRIPTOR' : _YESNO,
  '__module__' : 'mrpc_pb2'
  # @@protoc_insertion_point(class_scope:mrpc.YesNo)
  })
_sym_db.RegisterMessage(YesNo)


_MRPC = _descriptor.ServiceDescriptor(
  name='mrpc',
  full_name='mrpc.mrpc',
  file=DESCRIPTOR,
  index=0,
  serialized_options=None,
  serialized_start=110,
  serialized_end=161,
  methods=[
  _descriptor.MethodDescriptor(
    name='paraphrase',
    full_name='mrpc.mrpc.paraphrase',
    index=0,
    containing_service=None,
    input_type=_TEXTPAIR,
    output_type=_YESNO,
    serialized_options=None,
  ),
])
_sym_db.RegisterServiceDescriptor(_MRPC)

DESCRIPTOR.services_by_name['mrpc'] = _MRPC

# @@protoc_insertion_point(module_scope)


================================================
FILE: src/examples/tensorflow/bert_demo/mrpc_pb2_grpc.py
================================================
# coding=utf-8

""" Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0
    Program to gather information from a system
"""

# Generated by the gRPC Python protocol compiler plugin. DO NOT EDIT!
import grpc

import mrpc_pb2 as mrpc__pb2


class mrpcStub(object):
  # missing associated documentation comment in .proto file
  pass

  def __init__(self, channel):
    """Constructor.

    Args:
      channel: A grpc.Channel.
    """
    self.paraphrase = channel.unary_unary(
        '/mrpc.mrpc/paraphrase',
        request_serializer=mrpc__pb2.TextPair.SerializeToString,
        response_deserializer=mrpc__pb2.YesNo.FromString,
        )


class mrpcServicer(object):
  # missing associated documentation comment in .proto file
  pass

  def paraphrase(self, request, context):
    # missing associated documentation comment in .proto file
    pass
    context.set_code(grpc.StatusCode.UNIMPLEMENTED)
    context.set_details('Method not implemented!')
    raise NotImplementedError('Method not implemented!')


def add_mrpcServicer_to_server(servicer, server):
  rpc_method_handlers = {
      'paraphrase': grpc.unary_unary_rpc_method_handler(
          servicer.paraphrase,
          request_deserializer=mrpc__pb2.TextPair.FromString,
          response_serializer=mrpc__pb2.YesNo.SerializeToString,
      ),
  }
  generic_handler = grpc.method_handlers_generic_handler(
      'mrpc.mrpc', rpc_method_handlers)
  server.add_generic_rpc_handlers((generic_handler,))


================================================
FILE: src/examples/tensorflow/bert_demo/protoc.sh
================================================
#!/bin/bash
# coding=utf-8

""" Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0
    Program to gather information from a system
"""

python -m grpc_tools.protoc -I . --python_out=. --grpc_python_out=. mrpc.proto


================================================
FILE: src/examples/tensorflow/bert_demo/setup.py
================================================

import setuptools


setuptools.setup(
    name='bert-demo',
    version='2019.12.13',
    description='BERT Client-Server Demo',
    author='Amazon AWS',
    author_email='aws-neuron-support@amazon.com',
    license='BSD',
    classifiers=[
        'Development Status :: 1 - Planning',
        'Intended Audience :: Developers',
        'Topic :: Scientific/Engineering :: Artificial Intelligence',
        'License :: OSI Approved :: BSD License',
        'Programming Language :: Python :: 3',
        'Programming Language :: Python :: 3.5',
        'Programming Language :: Python :: 3.6',
        'Programming Language :: Python :: 3.7',
    ],
    keywords='bert',
    include_package_data=True,
    packages=setuptools.PEP420PackageFinder.find(),
    package_data={'': [
        '*',
    ]},
    entry_points={
        'console_scripts': [
            'neuron_bert_model=bert_demo.bert_model:main',
            'bert_server=bert_demo.bert_server:serve',
            'bert_client=bert_demo.bert_client:client',
        ],
    },
    install_requires=[
    ],
)


================================================
FILE: src/examples/tensorflow/bert_demo/tokenization.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections
import re
import unicodedata
import six


def validate_case_matches_checkpoint(do_lower_case, init_checkpoint):
  """Checks whether the casing config is consistent with the checkpoint name."""

  # The casing has to be passed in by the user and there is no explicit check
  # as to whether it matches the checkpoint. The casing information probably
  # should have been stored in the bert_config.json file, but it's not, so
  # we have to heuristically detect it to validate.

  if not init_checkpoint:
    return

  m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt", init_checkpoint)
  if m is None:
    return

  model_name = m.group(1)

  lower_models = [
      "uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12",
      "multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12"
  ]

  cased_models = [
      "cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16",
      "multi_cased_L-12_H-768_A-12"
  ]

  is_bad_config = False
  if model_name in lower_models and not do_lower_case:
    is_bad_config = True
    actual_flag = "False"
    case_name = "lowercased"
    opposite_flag = "True"

  if model_name in cased_models and do_lower_case:
    is_bad_config = True
    actual_flag = "True"
    case_name = "cased"
    opposite_flag = "False"

  if is_bad_config:
    raise ValueError(
        "You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. "
        "However, `%s` seems to be a %s model, so you "
        "should pass in `--do_lower_case=%s` so that the fine-tuning matches "
        "how the model was pre-training. If this error is wrong, please "
        "just comment out this check." % (actual_flag, init_checkpoint,
                                          model_name, case_name, opposite_flag))


def convert_to_unicode(text):
  """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
  if six.PY3:
    if isinstance(text, str):
      return text
    elif isinstance(text, bytes):
      return text.decode("utf-8", "ignore")
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  elif six.PY2:
    if isinstance(text, str):
      return text.decode("utf-8", "ignore")
    elif isinstance(text, unicode):
      return text
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  else:
    raise ValueError("Not running on Python2 or Python 3?")


def printable_text(text):
  """Returns text encoded in a way suitable for print or `tf.logging`."""

  # These functions want `str` for both Python2 and Python3, but in one case
  # it's a Unicode string and in the other it's a byte string.
  if six.PY3:
    if isinstance(text, str):
      return text
    elif isinstance(text, bytes):
      return text.decode("utf-8", "ignore")
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  elif six.PY2:
    if isinstance(text, str):
      return text
    elif isinstance(text, unicode):
      return text.encode("utf-8")
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  else:
    raise ValueError("Not running on Python2 or Python 3?")


def load_vocab(vocab_file):
  """Loads a vocabulary file into a dictionary."""
  vocab = collections.OrderedDict()
  index = 0
  with open(vocab_file, "r") as reader:
    while True:
      token = convert_to_unicode(reader.readline())
      if not token:
        break
      token = token.strip()
      vocab[token] = index
      index += 1
  return vocab


def convert_by_vocab(vocab, items):
  """Converts a sequence of [tokens|ids] using the vocab."""
  output = []
  for item in items:
    output.append(vocab[item])
  return output


def convert_tokens_to_ids(vocab, tokens):
  return convert_by_vocab(vocab, tokens)


def convert_ids_to_tokens(inv_vocab, ids):
  return convert_by_vocab(inv_vocab, ids)


def whitespace_tokenize(text):
  """Runs basic whitespace cleaning and splitting on a piece of text."""
  text = text.strip()
  if not text:
    return []
  tokens = text.split()
  return tokens


class FullTokenizer(object):
  """Runs end-to-end tokenziation."""

  def __init__(self, vocab_file, do_lower_case=True):
    self.vocab = load_vocab(vocab_file)
    self.inv_vocab = {v: k for k, v in self.vocab.items()}
    self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
    self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)

  def tokenize(self, text):
    split_tokens = []
    for token in self.basic_tokenizer.tokenize(text):
      for sub_token in self.wordpiece_tokenizer.tokenize(token):
        split_tokens.append(sub_token)

    return split_tokens

  def convert_tokens_to_ids(self, tokens):
    return convert_by_vocab(self.vocab, tokens)

  def convert_ids_to_tokens(self, ids):
    return convert_by_vocab(self.inv_vocab, ids)


class BasicTokenizer(object):
  """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""

  def __init__(self, do_lower_case=True):
    """Constructs a BasicTokenizer.

    Args:
      do_lower_case: Whether to lower case the input.
    """
    self.do_lower_case = do_lower_case

  def tokenize(self, text):
    """Tokenizes a piece of text."""
    text = convert_to_unicode(text)
    text = self._clean_text(text)

    # This was added on November 1st, 2018 for the multilingual and Chinese
    # models. This is also applied to the English models now, but it doesn't
    # matter since the English models were not trained on any Chinese data
    # and generally don't have any Chinese data in them (there are Chinese
    # characters in the vocabulary because Wikipedia does have some Chinese
    # words in the English Wikipedia.).
    text = self._tokenize_chinese_chars(text)

    orig_tokens = whitespace_tokenize(text)
    split_tokens = []
    for token in orig_tokens:
      if self.do_lower_case:
        token = token.lower()
        token = self._run_strip_accents(token)
      split_tokens.extend(self._run_split_on_punc(token))

    output_tokens = whitespace_tokenize(" ".join(split_tokens))
    return output_tokens

  def _run_strip_accents(self, text):
    """Strips accents from a piece of text."""
    text = unicodedata.normalize("NFD", text)
    output = []
    for char in text:
      cat = unicodedata.category(char)
      if cat == "Mn":
        continue
      output.append(char)
    return "".join(output)

  def _run_split_on_punc(self, text):
    """Splits punctuation on a piece of text."""
    chars = list(text)
    i = 0
    start_new_word = True
    output = []
    while i < len(chars):
      char = chars[i]
      if _is_punctuation(char):
        output.append([char])
        start_new_word = True
      else:
        if start_new_word:
          output.append([])
        start_new_word = False
        output[-1].append(char)
      i += 1

    return ["".join(x) for x in output]

  def _tokenize_chinese_chars(self, text):
    """Adds whitespace around any CJK character."""
    output = []
    for char in text:
      cp = ord(char)
      if self._is_chinese_char(cp):
        output.append(" ")
        output.append(char)
        output.append(" ")
      else:
        output.append(char)
    return "".join(output)

  def _is_chinese_char(self, cp):
    """Checks whether CP is the codepoint of a CJK character."""
    # This defines a "chinese character" as anything in the CJK Unicode block:
    #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
    #
    # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
    # despite its name. The modern Korean Hangul alphabet is a different block,
    # as is Japanese Hiragana and Katakana. Those alphabets are used to write
    # space-separated words, so they are not treated specially and handled
    # like the all of the other languages.
    if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
        (cp >= 0x3400 and cp <= 0x4DBF) or  #
        (cp >= 0x20000 and cp <= 0x2A6DF) or  #
        (cp >= 0x2A700 and cp <= 0x2B73F) or  #
        (cp >= 0x2B740 and cp <= 0x2B81F) or  #
        (cp >= 0x2B820 and cp <= 0x2CEAF) or
        (cp >= 0xF900 and cp <= 0xFAFF) or  #
        (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
      return True

    return False

  def _clean_text(self, text):
    """Performs invalid character removal and whitespace cleanup on text."""
    output = []
    for char in text:
      cp = ord(char)
      if cp == 0 or cp == 0xfffd or _is_control(char):
        continue
      if _is_whitespace(char):
        output.append(" ")
      else:
        output.append(char)
    return "".join(output)


class WordpieceTokenizer(object):
  """Runs WordPiece tokenziation."""

  def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
    self.vocab = vocab
    self.unk_token = unk_token
    self.max_input_chars_per_word = max_input_chars_per_word

  def tokenize(self, text):
    """Tokenizes a piece of text into its word pieces.

    This uses a greedy longest-match-first algorithm to perform tokenization
    using the given vocabulary.

    For example:
      input = "unaffable"
      output = ["un", "##aff", "##able"]

    Args:
      text: A single token or whitespace separated tokens. This should have
        already been passed through `BasicTokenizer.

    Returns:
      A list of wordpiece tokens.
    """

    text = convert_to_unicode(text)

    output_tokens = []
    for token in whitespace_tokenize(text):
      chars = list(token)
      if len(chars) > self.max_input_chars_per_word:
        output_tokens.append(self.unk_token)
        continue

      is_bad = False
      start = 0
      sub_tokens = []
      while start < len(chars):
        end = len(chars)
        cur_substr = None
        while start < end:
          substr = "".join(chars[start:end])
          if start > 0:
            substr = "##" + substr
          if substr in self.vocab:
            cur_substr = substr
            break
          end -= 1
        if cur_substr is None:
          is_bad = True
          break
        sub_tokens.append(cur_substr)
        start = end

      if is_bad:
        output_tokens.append(self.unk_token)
      else:
        output_tokens.extend(sub_tokens)
    return output_tokens


def _is_whitespace(char):
  """Checks whether `chars` is a whitespace character."""
  # \t, \n, and \r are technically contorl characters but we treat them
  # as whitespace since they are generally considered as such.
  if char == " " or char == "\t" or char == "\n" or char == "\r":
    return True
  cat = unicodedata.category(char)
  if cat == "Zs":
    return True
  return False


def _is_control(char):
  """Checks whether `chars` is a control character."""
  # These are technically control characters but we count them as whitespace
  # characters.
  if char == "\t" or char == "\n" or char == "\r":
    return False
  cat = unicodedata.category(char)
  if cat in ("Cc", "Cf"):
    return True
  return False


def _is_punctuation(char):
  """Checks whether `chars` is a punctuation character."""
  cp = ord(char)
  # We treat all non-letter/number ASCII as punctuation.
  # Characters such as "^", "$", and "`" are not in the Unicode
  # Punctuation class but we treat them as punctuation anyways, for
  # consistency.
  if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
      (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
    return True
  cat = unicodedata.category(char)
  if cat.startswith("P"):
    return True
  return False


================================================
FILE: src/examples/tensorflow/bert_demo/tune_save.sh
================================================
#!/bin/bash

pushd $BERT_REPO_DIR

python run_classifier.py \
  --task_name=MRPC \
  --do_train=true \
  --do_eval=true \
  --do_predict=true \
  --data_dir=$GLUE_DIR/MRPC \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir=$BERT_REPO_DIR/MRPC_finetune

python run_classifier.py \
  --task_name=MRPC \
  --do_predict=true \
  --data_dir=$GLUE_DIR/MRPC \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --output_dir=$BERT_REPO_DIR/MRPC_finetune

popd


================================================
FILE: src/examples/tensorflow/bert_demo/uncased_L-24_H-1024_A-16.vocab.txt
================================================
[PAD]
[unused0]
[unused1]
[unused2]
[unused3]
[unused4]
[unused5]
[unused6]
[unused7]
[unused8]
[unused9]
[unused10]
[unused11]
[unused12]
[unused13]
[unused14]
[unused15]
[unused16]
[unused17]
[unused18]
[unused19]
[unused20]
[unused21]
[unused22]
[unused23]
[unused24]
[unused25]
[unused26]
[unused27]
[unused28]
[unused29]
[unused30]
[unused31]
[unused32]
[unused33]
[unused34]
[unused35]
[unused36]
[unused37]
[unused38]
[unused39]
[unused40]
[unused41]
[unused42]
[unused43]
[unused44]
[unused45]
[unused46]
[unused47]
[unused48]
[unused49]
[unused50]
[unused51]
[unused52]
[unused53]
[unused54]
[unused55]
[unused56]
[unused57]
[unused58]
[unused59]
[unused60]
[unused61]
[unused62]
[unused63]
[unused64]
[unused65]
[unused66]
[unused67]
[unused68]
[unused69]
[unused70]
[unused71]
[unused72]
[unused73]
[unused74]
[unused75]
[unused76]
[unused77]
[unused78]
[unused79]
[unused80]
[unused81]
[unused82]
[unused83]
[unused84]
[unused85]
[unused86]
[unused87]
[unused88]
[unused89]
[unused90]
[unused91]
[unused92]
[unused93]
[unused94]
[unused95]
[unused96]
[unused97]
[unused98]
[UNK]
[CLS]
[SEP]
[MASK]
[unused99]
[unused100]
[unused101]
[unused102]
[unused103]
[unused104]
[unused105]
[unused106]
[unused107]
[unused108]
[unused109]
[unused110]
[unused111]
[unused112]
[unused113]
[unused114]
[unused115]
[unused116]
[unused117]
[unused118]
[unused119]
[unused120]
[unused121]
[unused122]
[unused123]
[unused124]
[unused125]
[unused126]
[unused127]
[unused128]
[unused129]
[unused130]
[unused131]
[unused132]
[unused133]
[unused134]
[unused135]
[unused136]
[unused137]
[unused138]
[unused139]
[unused140]
[unused141]
[unused142]
[unused143]
[unused144]
[unused145]
[unused146]
[unused147]
[unused148]
[unused149]
[unused150]
[unused151]
[unused152]
[unused153]
[unused154]
[unused155]
[unused156]
[unused157]
[unused158]
[unused159]
[unused160]
[unused161]
[unused162]
[unused163]
[unused164]
[unused165]
[unused166]
[unused167]
[unused168]
[unused169]
[unused170]
[unused171]
[unused172]
[unused173]
[unused174]
[unused175]
[unused176]
[unused177]
[unused178]
[unused179]
[unused180]
[unused181]
[unused182]
[unused183]
[unused184]
[unused185]
[unused186]
[unused187]
[unused188]
[unused189]
[unused190]
[unused191]
[unused192]
[unused193]
[unused194]
[unused195]
[unused196]
[unused197]
[unused198]
[unused199]
[unused200]
[unused201]
[unused202]
[unused203]
[unused204]
[unused205]
[unused206]
[unused207]
[unused208]
[unused209]
[unused210]
[unused211]
[unused212]
[unused213]
[unused214]
[unused215]
[unused216]
[unused217]
[unused218]
[unused219]
[unused220]
[unused221]
[unused222]
[unused223]
[unused224]
[unused225]
[unused226]
[unused227]
[unused228]
[unused229]
[unused230]
[unused231]
[unused232]
[unused233]
[unused234]
[unused235]
[unused236]
[unused237]
[unused238]
[unused239]
[unused240]
[unused241]
[unused242]
[unused243]
[unused244]
[unused245]
[unused246]
[unused247]
[unused248]
[unused249]
[unused250]
[unused251]
[unused252]
[unused253]
[unused254]
[unused255]
[unused256]
[unused257]
[unused258]
[unused259]
[unused260]
[unused261]
[unused262]
[unused263]
[unused264]
[unused265]
[unused266]
[unused267]
[unused268]
[unused269]
[unused270]
[unused271]
[unused272]
[unused273]
[unused274]
[unused275]
[unused276]
[unused277]
[unused278]
[unused279]
[unused280]
[unused281]
[unused282]
[unused283]
[unused284]
[unused285]
[unused286]
[unused287]
[unused288]
[unused289]
[unused290]
[unused291]
[unused292]
[unused293]
[unused294]
[unused295]
[unused296]
[unused297]
[unused298]
[unused299]
[unused300]
[unused301]
[unused302]
[unused303]
[unused304]
[unused305]
[unused306]
[unused307]
[unused308]
[unused309]
[unused310]
[unused311]
[unused312]
[unused313]
[unused314]
[unused315]
[unused316]
[unused317]
[unused318]
[unused319]
[unused320]
[unused321]
[unused322]
[unused323]
[unused324]
[unused325]
[unused326]
[unused327]
[unused328]
[unused329]
[unused330]
[unused331]
[unused332]
[unused333]
[unused334]
[unused335]
[unused336]
[unused337]
[unused338]
[unused339]
[unused340]
[unused341]
[unused342]
[unused343]
[unused344]
[unused345]
[unused346]
[unused347]
[unused348]
[unused349]
[unused350]
[unused351]
[unused352]
[unused353]
[unused354]
[unused355]
[unused356]
[unused357]
[unused358]
[unused359]
[unused360]
[unused361]
[unused362]
[unused363]
[unused364]
[unused365]
[unused366]
[unused367]
[unused368]
[unused369]
[unused370]
[unused371]
[unused372]
[unused373]
[unused374]
[unused375]
[unused376]
[unused377]
[unused378]
[unused379]
[unused380]
[unused381]
[unused382]
[unused383]
[unused384]
[unused385]
[unused386]
[unused387]
[unused388]
[unused389]
[unused390]
[unused391]
[unused392]
[unused393]
[unused394]
[unused395]
[unused396]
[unused397]
[unused398]
[unused399]
[unused400]
[unused401]
[unused402]
[unused403]
[unused404]
[unused405]
[unused406]
[unused407]
[unused408]
[unused409]
[unused410]
[unused411]
[unused412]
[unused413]
[unused414]
[unused415]
[unused416]
[unused417]
[unused418]
[unused419]
[unused420]
[unused421]
[unused422]
[unused423]
[unused424]
[unused425]
[unused426]
[unused427]
[unused428]
[unused429]
[unused430]
[unused431]
[unused432]
[unused433]
[unused434]
[unused435]
[unused436]
[unused437]
[unused438]
[unused439]
[unused440]
[unused441]
[unused442]
[unused443]
[unused444]
[unused445]
[unused446]
[unused447]
[unused448]
[unused449]
[unused450]
[unused451]
[unused452]
[unused453]
[unused454]
[unused455]
[unused456]
[unused457]
[unused458]
[unused459]
[unused460]
[unused461]
[unused462]
[unused463]
[unused464]
[unused465]
[unused466]
[unused467]
[unused468]
[unused469]
[unused470]
[unused471]
[unused472]
[unused473]
[unused474]
[unused475]
[unused476]
[unused477]
[unused478]
[unused479]
[unused480]
[unused481]
[unused482]
[unused483]
[unused484]
[unused485]
[unused486]
[unused487]
[unused488]
[unused489]
[unused490]
[unused491]
[unused492]
[unused493]
[unused494]
[unused495]
[unused496]
[unused497]
[unused498]
[unused499]
[unused500]
[unused501]
[unused502]
[unused503]
[unused504]
[unused505]
[unused506]
[unused507]
[unused508]
[unused509]
[unused510]
[unused511]
[unused512]
[unused513]
[unused514]
[unused515]
[unused516]
[unused517]
[unused518]
[unused519]
[unused520]
[unused521]
[unused522]
[unused523]
[unused524]
[unused525]
[unused526]
[unused527]
[unused528]
[unused529]
[unused530]
[unused531]
[unused532]
[unused533]
[unused534]
[unused535]
[unused536]
[unused537]
[unused538]
[unused539]
[unused540]
[unused541]
[unused542]
[unused543]
[unused544]
[unused545]
[unused546]
[unused547]
[unused548]
[unused549]
[unused550]
[unused551]
[unused552]
[unused553]
[unused554]
[unused555]
[unused556]
[unused557]
[unused558]
[unused559]
[unused560]
[unused561]
[unused562]
[unused563]
[unused564]
[unused565]
[unused566]
[unused567]
[unused568]
[unused569]
[unused570]
[unused571]
[unused572]
[unused573]
[unused574]
[unused575]
[unused576]
[unused577]
[unused578]
[unused579]
[unused580]
[unused581]
[unused582]
[unused583]
[unused584]
[unused585]
[unused586]
[unused587]
[unused588]
[unused589]
[unused590]
[unused591]
[unused592]
[unused593]
[unused594]
[unused595]
[unused596]
[unused597]
[unused598]
[unused599]
[unused600]
[unused601]
[unused602]
[unused603]
[unused604]
[unused605]
[unused606]
[unused607]
[unused608]
[unused609]
[unused610]
[unused611]
[unused612]
[unused613]
[unused614]
[unused615]
[unused616]
[unused617]
[unused618]
[unused619]
[unused620]
[unused621]
[unused622]
[unused623]
[unused624]
[unused625]
[unused626]
[unused627]
[unused628]
[unused629]
[unused630]
[unused631]
[unused632]
[unused633]
[unused634]
[unused635]
[unused636]
[unused637]
[unused638]
[unused639]
[unused640]
[unused641]
[unused642]
[unused643]
[unused644]
[unused645]
[unused646]
[unused647]
[unused648]
[unused649]
[unused650]
[unused651]
[unused652]
[unused653]
[unused654]
[unused655]
[unused656]
[unused657]
[unused658]
[unused659]
[unused660]
[unused661]
[unused662]
[unused663]
[unused664]
[unused665]
[unused666]
[unused667]
[unused668]
[unused669]
[unused670]
[unused671]
[unused672]
[unused673]
[unused674]
[unused675]
[unused676]
[unused677]
[unused678]
[unused679]
[unused680]
[unused681]
[unused682]
[unused683]
[unused684]
[unused685]
[unused686]
[unused687]
[unused688]
[unused689]
[unused690]
[unused691]
[unused692]
[unused693]
[unused694]
[unused695]
[unused696]
[unused697]
[unused698]
[unused699]
[unused700]
[unused701]
[unused702]
[unused703]
[unused704]
[unused705]
[unused706]
[unused707]
[unused708]
[unused709]
[unused710]
[unused711]
[unused712]
[unused713]
[unused714]
[unused715]
[unused716]
[unused717]
[unused718]
[unused719]
[unused720]
[unused721]
[unused722]
[unused723]
[unused724]
[unused725]
[unused726]
[unused727]
[unused728]
[unused729]
[unused730]
[unused731]
[unused732]
[unused733]
[unused734]
[unused735]
[unused736]
[unused737]
[unused738]
[unused739]
[unused740]
[unused741]
[unused742]
[unused743]
[unused744]
[unused745]
[unused746]
[unused747]
[unused748]
[unused749]
[unused750]
[unused751]
[unused752]
[unused753]
[unused754]
[unused755]
[unused756]
[unused757]
[unused758]
[unused759]
[unused760]
[unused761]
[unused762]
[unused763]
[unused764]
[unused765]
[unused766]
[unused767]
[unused768]
[unused769]
[unused770]
[unused771]
[unused772]
[unused773]
[unused774]
[unused775]
[unused776]
[unused777]
[unused778]
[unused779]
[unused780]
[unused781]
[unused782]
[unused783]
[unused784]
[unused785]
[unused786]
[unused787]
[unused788]
[unused789]
[unused790]
[unused791]
[unused792]
[unused793]
[unused794]
[unused795]
[unused796]
[unused797]
[unused798]
[unused799]
[unused800]
[unused801]
[unused802]
[unused803]
[unused804]
[unused805]
[unused806]
[unused807]
[unused808]
[unused809]
[unused810]
[unused811]
[unused812]
[unused813]
[unused814]
[unused815]
[unused816]
[unused817]
[unused818]
[unused819]
[unused820]
[unused821]
[unused822]
[unused823]
[unused824]
[unused825]
[unused826]
[unused827]
[unused828]
[unused829]
[unused830]
[unused831]
[unused832]
[unused833]
[unused834]
[unused835]
[unused836]
[unused837]
[unused838]
[unused839]
[unused840]
[unused841]
[unused842]
[unused843]
[unused844]
[unused845]
[unused846]
[unused847]
[unused848]
[unused849]
[unused850]
[unused851]
[unused852]
[unused853]
[unused854]
[unused855]
[unused856]
[unused857]
[unused858]
[unused859]
[unused860]
[unused861]
[unused862]
[unused863]
[unused864]
[unused865]
[unused866]
[unused867]
[unused868]
[unused869]
[unused870]
[unused871]
[unused872]
[unused873]
[unused874]
[unused875]
[unused876]
[unused877]
[unused878]
[unused879]
[unused880]
[unused881]
[unused882]
[unused883]
[unused884]
[unused885]
[unused886]
[unused887]
[unused888]
[unused889]
[unused890]
[unused891]
[unused892]
[unused893]
[unused894]
[unused895]
[unused896]
[unused897]
[unused898]
[unused899]
[unused900]
[unused901]
[unused902]
[unused903]
[unused904]
[unused905]
[unused906]
[unused907]
[unused908]
[unused909]
[unused910]
[unused911]
[unused912]
[unused913]
[unused914]
[unused915]
[unused916]
[unused917]
[unused918]
[unused919]
[unused920]
[unused921]
[unused922]
[unused923]
[unused924]
[unused925]
[unused926]
[unused927]
[unused928]
[unused929]
[unused930]
[unused931]
[unused932]
[unused933]
[unused934]
[unused935]
[unused936]
[unused937]
[unused938]
[unused939]
[unused940]
[unused941]
[unused942]
[unused943]
[unused944]
[unused945]
[unused946]
[unused947]
[unused948]
[unused949]
[unused950]
[unused951]
[unused952]
[unused953]
[unused954]
[unused955]
[unused956]
[unused957]
[unused958]
[unused959]
[unused960]
[unused961]
[unused962]
[unused963]
[unused964]
[unused965]
[unused966]
[unused967]
[unused968]
[unused969]
[unused970]
[unused971]
[unused972]
[unused973]
[unused974]
[unused975]
[unused976]
[unused977]
[unused978]
[unused979]
[unused980]
[unused981]
[unused982]
[unused983]
[unused984]
[unused985]
[unused986]
[unused987]
[unused988]
[unused989]
[unused990]
[unused991]
[unused992]
[unused993]
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
@
[
\
]
^
_
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
¡
¢
£
¤
¥
¦
§
¨
©
ª
«
¬
®
°
±
²
³
´
µ
¶
·
¹
º
»
¼
½
¾
¿
×
ß
æ
ð
÷
ø
þ
đ
ħ
ı
ł
ŋ
œ
ƒ
ɐ
ɑ
ɒ
ɔ
ɕ
ə
ɛ
ɡ
ɣ
ɨ
ɪ
ɫ
ɬ
ɯ
ɲ
ɴ
ɹ
ɾ
ʀ
ʁ
ʂ
ʃ
ʉ
ʊ
ʋ
ʌ
ʎ
ʐ
ʑ
ʒ
ʔ
ʰ
ʲ
ʳ
ʷ
ʸ
ʻ
ʼ
ʾ
ʿ
ˈ
ː
ˡ
ˢ
ˣ
ˤ
α
β
γ
δ
ε
ζ
η
θ
ι
κ
λ
μ
ν
ξ
ο
π
ρ
ς
σ
τ
υ
φ
χ
ψ
ω
а
б
в
г
д
е
ж
з
и
к
л
м
н
о
п
р
с
т
у
ф
х
ц
ч
ш
щ
ъ
ы
ь
э
ю
я
ђ
є
і
ј
љ
њ
ћ
ӏ
ա
բ
գ
դ
ե
թ
ի
լ
կ
հ
մ
յ
ն
ո
պ
ս
վ
տ
ր
ւ
ք
־
א
ב
ג
ד
ה
ו
ז
ח
ט
י
ך
כ
ל
ם
מ
ן
נ
ס
ע
ף
פ
ץ
צ
ק
ר
ש
ת
،
ء
ا
ب
ة
ت
ث
ج
ح
خ
د
ذ
ر
ز
س
ش
ص
ض
ط
ظ
ع
غ
ـ
ف
ق
ك
ل
م
ن
ه
و
ى
ي
ٹ
پ
چ
ک
گ
ں
ھ
ہ
ی
ے
अ
आ
उ
ए
क
ख
ग
च
ज
ट
ड
ण
त
थ
द
ध
न
प
ब
भ
म
य
र
ल
व
श
ष
स
ह
ा
ि
ी
ो
।
॥
ং
অ
আ
ই
উ
এ
ও
ক
খ
গ
চ
ছ
জ
ট
ড
ণ
ত
থ
দ
ধ
ন
প
ব
ভ
ম
য
র
ল
শ
ষ
স
হ
া
ি
ী
ে
க
ச
ட
த
ந
ன
ப
ம
ய
ர
ல
ள
வ
ா
ி
ு
ே
ை
ನ
ರ
ಾ
ක
ය
ර
ල
ව
ා
ก
ง
ต
ท
น
พ
ม
ย
ร
ล
ว
ส
อ
า
เ
་
།
ག
ང
ད
ན
པ
བ
མ
འ
ར
ལ
ས
မ
ა
ბ
გ
დ
ე
ვ
თ
ი
კ
ლ
მ
ნ
ო
რ
ს
ტ
უ
ᄀ
ᄂ
ᄃ
ᄅ
ᄆ
ᄇ
ᄉ
ᄊ
ᄋ
ᄌ
ᄎ
ᄏ
ᄐ
ᄑ
ᄒ
ᅡ
ᅢ
ᅥ
ᅦ
ᅧ
ᅩ
ᅪ
ᅭ
ᅮ
ᅯ
ᅲ
ᅳ
ᅴ
ᅵ
ᆨ
ᆫ
ᆯ
ᆷ
ᆸ
ᆼ
ᴬ
ᴮ
ᴰ
ᴵ
ᴺ
ᵀ
ᵃ
ᵇ
ᵈ
ᵉ
ᵍ
ᵏ
ᵐ
ᵒ
ᵖ
ᵗ
ᵘ
ᵢ
ᵣ
ᵤ
ᵥ
ᶜ
ᶠ
‐
‑
‒
–
—
―
‖
‘
’
‚
“
”
„
†
‡
•
…
‰
′
″
›
‿
⁄
⁰
ⁱ
⁴
⁵
⁶
⁷
⁸
⁹
⁺
⁻
ⁿ
₀
₁
₂
₃
₄
₅
₆
₇
₈
₉
₊
₍
₎
ₐ
ₑ
ₒ
ₓ
ₕ
ₖ
ₗ
ₘ
ₙ
ₚ
ₛ
ₜ
₤
₩
€
₱
₹
ℓ
№
ℝ
™
⅓
⅔
←
↑
→
↓
↔
↦
⇄
⇌
⇒
∂
∅
∆
∇
∈
−
∗
∘
√
∞
∧
∨
∩
∪
≈
≡
≤
≥
⊂
⊆
⊕
⊗
⋅
─
│
■
▪
●
★
☆
☉
♠
♣
♥
♦
♭
♯
⟨
⟩
ⱼ
⺩
⺼
⽥
、
。
〈
〉
《
》
「
」
『
』
〜
あ
い
う
え
お
か
き
く
け
こ
さ
し
す
せ
そ
た
ち
っ
つ
て
と
な
に
ぬ
ね
の
は
ひ
ふ
へ
ほ
ま
み
む
め
も
や
ゆ
よ
ら
り
る
れ
ろ
を
ん
ァ
ア
ィ
イ
ウ
ェ
エ
オ
カ
キ
ク
ケ
コ
サ
シ
ス
セ
タ
チ
ッ
ツ
テ
ト
ナ
ニ
ノ
ハ
ヒ
フ
ヘ
ホ
マ
ミ
ム
メ
モ
ャ
ュ
ョ
ラ
リ
ル
レ
ロ
ワ
ン
・
ー
一
三
上
下
不
世
中
主
久
之
也
事
二
五
井
京
人
亻
仁
介
代
仮
伊
会
佐
侍
保
信
健
元
光
八
公
内
出
分
前
劉
力
加
勝
北
区
十
千
南
博
原
口
古
史
司
合
吉
同
名
和
囗
四
国
國
土
地
坂
城
堂
場
士
夏
外
大
天
太
夫
奈
女
子
学
宀
宇
安
宗
定
宣
宮
家
宿
寺
將
小
尚
山
岡
島
崎
川
州
巿
帝
平
年
幸
广
弘
張
彳
後
御
德
心
忄
志
忠
愛
成
我
戦
戸
手
扌
政
文
新
方
日
明
星
春
昭
智
曲
書
月
有
朝
木
本
李
村
東
松
林
森
楊
樹
橋
歌
止
正
武
比
氏
民
水
氵
氷
永
江
沢
河
治
法
海
清
漢
瀬
火
版
犬
王
生
田
男
疒
発
白
的
皇
目
相
省
真
石
示
社
神
福
禾
秀
秋
空
立
章
竹
糹
美
義
耳
良
艹
花
英
華
葉
藤
行
街
西
見
訁
語
谷
貝
貴
車
軍
辶
道
郎
郡
部
都
里
野
金
鈴
镇
長
門
間
阝
阿
陳
陽
雄
青
面
風
食
香
馬
高
龍
龸
ﬁ
ﬂ
！
（
）
，
－
．
／
：
？
～
the
of
and
in
to
was
he
is
as
for
on
with
that
it
his
by
at
from
her
##s
she
you
had
an
were
but
be
this
are
not
my
they
one
which
or
have
him
me
first
all
also
their
has
up
who
out
been
when
after
there
into
new
two
its
##a
time
would
no
what
about
said
we
over
then
other
so
more
##e
can
if
like
back
them
only
some
could
##i
where
just
##ing
during
before
##n
do
##o
made
school
through
than
now
years
most
world
may
between
down
well
three
##d
year
while
will
##ed
##r
##y
later
##t
city
under
around
did
such
being
used
state
people
part
know
against
your
many
second
university
both
national
##er
these
don
known
off
way
until
re
how
even
get
head
...
didn
##ly
team
american
because
de
##l
born
united
film
since
still
long
work
south
us
became
any
high
again
day
family
see
right
man
eyes
house
season
war
states
including
took
life
north
same
each
called
name
much
place
however
go
four
group
another
found
won
area
here
going
10
away
series
left
home
music
best
make
hand
number
company
several
never
last
john
000
very
album
take
end
good
too
following
released
game
played
little
began
district
##m
old
want
those
side
held
own
early
county
ll
league
use
west
##u
face
think
##es
2010
government
##h
march
came
small
general
town
june
##on
line
based
something
##k
september
thought
looked
along
international
2011
air
july
club
went
january
october
our
august
april
york
12
few
2012
2008
east
show
member
college
2009
father
public
##us
come
men
five
set
station
church
##c
next
former
november
room
party
located
december
2013
age
got
2007
##g
system
let
love
2006
though
every
2014
look
song
water
century
without
body
black
night
within
great
women
single
ve
building
large
population
river
named
band
white
started
##an
once
15
20
should
18
2015
service
top
built
british
open
death
king
moved
local
times
children
february
book
why
11
door
need
president
order
final
road
wasn
although
due
major
died
village
third
knew
2016
asked
turned
st
wanted
say
##p
together
received
main
son
served
different
##en
behind
himself
felt
members
power
football
law
voice
play
##in
near
park
history
30
having
2005
16
##man
saw
mother
##al
army
point
front
help
english
street
art
late
hands
games
award
##ia
young
14
put
published
country
division
across
told
13
often
ever
french
london
center
six
red
2017
led
days
include
light
25
find
tell
among
species
really
according
central
half
2004
form
original
gave
office
making
enough
lost
full
opened
must
included
live
given
german
player
run
business
woman
community
cup
might
million
land
2000
court
development
17
short
round
ii
km
seen
class
story
always
become
sure
research
almost
director
council
la
##2
career
things
using
island
##z
couldn
car
##is
24
close
force
##1
better
free
support
control
field
students
2003
education
married
##b
nothing
worked
others
record
big
inside
level
anything
continued
give
james
##3
military
established
non
returned
feel
does
title
written
thing
feet
william
far
co
association
hard
already
2002
##ra
championship
human
western
100
##na
department
hall
role
various
production
21
19
heart
2001
living
fire
version
##ers
##f
television
royal
##4
produced
working
act
case
society
region
present
radio
period
looking
least
total
keep
england
wife
program
per
brother
mind
special
22
##le
am
works
soon
##6
political
george
services
taken
created
##7
further
able
reached
david
union
joined
upon
done
important
social
information
either
##ic
##x
appeared
position
ground
lead
rock
dark
election
23
board
france
hair
course
arms
site
police
girl
instead
real
sound
##v
words
moment
##te
someone
##8
summer
project
announced
san
less
wrote
past
followed
##5
blue
founded
al
finally
india
taking
records
america
##ne
1999
design
considered
northern
god
stop
battle
toward
european
outside
described
track
today
playing
language
28
call
26
heard
professional
low
australia
miles
california
win
yet
green
##ie
trying
blood
##ton
southern
science
maybe
everything
match
square
27
mouth
video
race
recorded
leave
above
##9
daughter
points
space
1998
museum
change
middle
common
##0
move
tv
post
##ta
lake
seven
tried
elected
closed
ten
paul
minister
##th
months
start
chief
return
canada
person
sea
release
similar
modern
brought
rest
hit
formed
mr
##la
1997
floor
event
doing
thomas
1996
robert
care
killed
training
star
week
needed
turn
finished
railway
rather
news
health
sent
example
ran
term
michael
coming
currently
yes
forces
despite
gold
areas
50
stage
fact
29
dead
says
popular
2018
originally
germany
probably
developed
result
pulled
friend
stood
money
running
mi
signed
word
songs
child
eventually
met
tour
average
teams
minutes
festival
current
deep
kind
1995
decided
usually
eastern
seemed
##ness
episode
bed
added
table
indian
private
charles
route
available
idea
throughout
centre
addition
appointed
style
1994
books
eight
construction
press
mean
wall
friends
remained
schools
study
##ch
##um
institute
oh
chinese
sometimes
events
possible
1992
australian
type
brown
forward
talk
process
food
debut
seat
performance
committee
features
character
arts
herself
else
lot
strong
russian
range
hours
peter
arm
##da
morning
dr
sold
##ry
quickly
directed
1993
guitar
china
##w
31
list
##ma
performed
media
uk
players
smile
##rs
myself
40
placed
coach
province
towards
wouldn
leading
whole
boy
official
designed
grand
census
##el
europe
attack
japanese
henry
1991
##re
##os
cross
getting
alone
action
lower
network
wide
washington
japan
1990
hospital
believe
changed
sister
##ar
hold
gone
sir
hadn
ship
##ka
studies
academy
shot
rights
below
base
bad
involved
kept
largest
##ist
bank
future
especially
beginning
mark
movement
section
female
magazine
plan
professor
lord
longer
##ian
sat
walked
hill
actually
civil
energy
model
families
size
thus
aircraft
completed
includes
data
captain
##or
fight
vocals
featured
richard
bridge
fourth
1989
officer
stone
hear
##ism
means
medical
groups
management
self
lips
competition
entire
lived
technology
leaving
federal
tournament
bit
passed
hot
independent
awards
kingdom
mary
spent
fine
doesn
reported
##ling
jack
fall
raised
itself
stay
true
studio
1988
sports
replaced
paris
systems
saint
leader
theatre
whose
market
capital
parents
spanish
canadian
earth
##ity
cut
degree
writing
bay
christian
awarded
natural
higher
bill
##as
coast
provided
previous
senior
ft
valley
organization
stopped
onto
countries
parts
conference
queen
security
interest
saying
allowed
master
earlier
phone
matter
smith
winning
try
happened
moving
campaign
los
##ley
breath
nearly
mid
1987
certain
girls
date
italian
african
standing
fell
artist
##ted
shows
deal
mine
industry
1986
##ng
everyone
republic
provide
collection
library
student
##ville
primary
owned
older
via
heavy
1st
makes
##able
attention
anyone
africa
##ri
stated
length
ended
fingers
command
staff
skin
foreign
opening
governor
okay
medal
kill
sun
cover
job
1985
introduced
chest
hell
feeling
##ies
success
meet
reason
standard
meeting
novel
1984
trade
source
buildings
##land
rose
guy
goal
##ur
chapter
native
husband
previously
unit
limited
entered
weeks
producer
operations
mountain
takes
covered
forced
related
roman
complete
successful
key
texas
cold
##ya
channel
1980
traditional
films
dance
clear
approximately
500
nine
van
prince
question
active
tracks
ireland
regional
silver
author
personal
sense
operation
##ine
economic
1983
holding
twenty
isbn
additional
speed
hour
edition
regular
historic
places
whom
shook
movie
km²
secretary
prior
report
chicago
read
foundation
view
engine
scored
1982
units
ask
airport
property
ready
immediately
lady
month
listed
contract
##de
manager
themselves
lines
##ki
navy
writer
meant
##ts
runs
##ro
practice
championships
singer
glass
commission
required
forest
starting
culture
generally
giving
access
attended
test
couple
stand
catholic
martin
caught
executive
##less
eye
##ey
thinking
chair
quite
shoulder
1979
hope
decision
plays
defeated
municipality
whether
structure
offered
slowly
pain
ice
direction
##ion
paper
mission
1981
mostly
200
noted
individual
managed
nature
lives
plant
##ha
helped
except
studied
computer
figure
relationship
issue
significant
loss
die
smiled
gun
ago
highest
1972
##am
male
bring
goals
mexico
problem
distance
commercial
completely
location
annual
famous
drive
1976
neck
1978
surface
caused
italy
understand
greek
highway
wrong
hotel
comes
appearance
joseph
double
issues
musical
companies
castle
income
review
assembly
bass
initially
parliament
artists
experience
1974
particular
walk
foot
engineering
talking
window
dropped
##ter
miss
baby
boys
break
1975
stars
edge
remember
policy
carried
train
stadium
bar
sex
angeles
evidence
##ge
becoming
assistant
soviet
1977
upper
step
wing
1970
youth
financial
reach
##ll
actor
numerous
##se
##st
nodded
arrived
##ation
minute
##nt
believed
sorry
complex
beautiful
victory
associated
temple
1968
1973
chance
perhaps
metal
##son
1945
bishop
##et
lee
launched
particularly
tree
le
retired
subject
prize
contains
yeah
theory
empire
##ce
suddenly
waiting
trust
recording
##to
happy
terms
camp
champion
1971
religious
pass
zealand
names
2nd
port
ancient
tom
corner
represented
watch
legal
anti
justice
cause
watched
brothers
45
material
changes
simply
response
louis
fast
##ting
answer
60
historical
1969
stories
straight
create
feature
increased
rate
administration
virginia
el
activities
cultural
overall
winner
programs
basketball
legs
guard
beyond
cast
doctor
mm
flight
results
remains
cost
effect
winter
##ble
larger
islands
problems
chairman
grew
commander
isn
1967
pay
failed
selected
hurt
fort
box
regiment
majority
journal
35
edward
plans
##ke
##ni
shown
pretty
irish
characters
directly
scene
likely
operated
allow
spring
##j
junior
matches
looks
mike
houses
fellow
##tion
beach
marriage
##ham
##ive
rules
oil
65
florida
expected
nearby
congress
sam
peace
recent
iii
wait
subsequently
cell
##do
variety
serving
agreed
please
poor
joe
pacific
attempt
wood
democratic
piece
prime
##ca
rural
mile
touch
appears
township
1964
1966
soldiers
##men
##ized
1965
pennsylvania
closer
fighting
claimed
score
jones
physical
editor
##ous
filled
genus
specific
sitting
super
mom
##va
therefore
supported
status
fear
cases
store
meaning
wales
minor
spain
tower
focus
vice
frank
follow
parish
separate
golden
horse
fifth
remaining
branch
32
presented
stared
##id
uses
secret
forms
##co
baseball
exactly
##ck
choice
note
discovered
travel
composed
truth
russia
ball
color
kiss
dad
wind
continue
ring
referred
numbers
digital
greater
##ns
metres
slightly
direct
increase
1960
responsible
crew
rule
trees
troops
##no
broke
goes
individuals
hundred
weight
creek
sleep
memory
defense
provides
ordered
code
value
jewish
windows
1944
safe
judge
whatever
corps
realized
growing
pre
##ga
cities
alexander
gaze
lies
spread
scott
letter
showed
situation
mayor
transport
watching
workers
extended
##li
expression
normal
##ment
chart
multiple
border
##ba
host
##ner
daily
mrs
walls
piano
##ko
heat
cannot
##ate
earned
products
drama
era
authority
seasons
join
grade
##io
sign
difficult
machine
1963
territory
mainly
##wood
stations
squadron
1962
stepped
iron
19th
##led
serve
appear
sky
speak
broken
charge
knowledge
kilometres
removed
ships
article
campus
simple
##ty
pushed
britain
##ve
leaves
recently
cd
soft
boston
latter
easy
acquired
poland
##sa
quality
officers
presence
planned
nations
mass
broadcast
jean
share
image
influence
wild
offer
emperor
electric
reading
headed
ability
promoted
yellow
ministry
1942
throat
smaller
politician
##by
latin
spoke
cars
williams
males
lack
pop
80
##ier
acting
seeing
consists
##ti
estate
1961
pressure
johnson
newspaper
jr
chris
olympics
online
conditions
beat
elements
walking
vote
##field
needs
carolina
text
featuring
global
block
shirt
levels
francisco
purpose
females
et
dutch
duke
ahead
gas
twice
safety
serious
turning
highly
lieutenant
firm
maria
amount
mixed
daniel
proposed
perfect
agreement
affairs
3rd
seconds
contemporary
paid
1943
prison
save
kitchen
label
administrative
intended
constructed
academic
nice
teacher
races
1956
formerly
corporation
ben
nation
issued
shut
1958
drums
housing
victoria
seems
opera
1959
graduated
function
von
mentioned
picked
build
recognized
shortly
protection
picture
notable
exchange
elections
1980s
loved
percent
racing
fish
elizabeth
garden
volume
hockey
1941
beside
settled
##ford
1940
competed
replied
drew
1948
actress
marine
scotland
steel
glanced
farm
steve
1957
risk
tonight
positive
magic
singles
effects
gray
screen
dog
##ja
residents
bus
sides
none
secondary
literature
polish
destroyed
flying
founder
households
1939
lay
reserve
usa
gallery
##ler
1946
industrial
younger
approach
appearances
urban
ones
1950
finish
avenue
powerful
fully
growth
page
honor
jersey
projects
advanced
revealed
basic
90
infantry
pair
equipment
visit
33
evening
search
grant
effort
solo
treatment
buried
republican
primarily
bottom
owner
1970s
israel
gives
jim
dream
bob
remain
spot
70
notes
produce
champions
contact
ed
soul
accepted
ways
del
##ally
losing
split
price
capacity
basis
trial
questions
##ina
1955
20th
guess
officially
memorial
naval
initial
##ization
whispered
median
engineer
##ful
sydney
##go
columbia
strength
300
1952
tears
senate
00
card
asian
agent
1947
software
44
draw
warm
supposed
com
pro
##il
transferred
leaned
##at
candidate
escape
mountains
asia
potential
activity
entertainment
seem
traffic
jackson
murder
36
slow
product
orchestra
haven
agency
bbc
taught
website
comedy
unable
storm
planning
albums
rugby
environment
scientific
grabbed
protect
##hi
boat
typically
1954
1953
damage
principal
divided
dedicated
mount
ohio
##berg
pick
fought
driver
##der
empty
shoulders
sort
thank
berlin
prominent
account
freedom
necessary
efforts
alex
headquarters
follows
alongside
des
simon
andrew
suggested
operating
learning
steps
1949
sweet
technical
begin
easily
34
teeth
speaking
settlement
scale
##sh
renamed
ray
max
enemy
semi
joint
compared
##rd
scottish
leadership
analysis
offers
georgia
pieces
captured
animal
deputy
guest
organized
##lin
tony
combined
method
challenge
1960s
huge
wants
battalion
sons
rise
crime
types
facilities
telling
path
1951
platform
sit
1990s
##lo
tells
assigned
rich
pull
##ot
commonly
alive
##za
letters
concept
conducted
wearing
happen
bought
becomes
holy
gets
ocean
defeat
languages
purchased
coffee
occurred
titled
##q
declared
applied
sciences
concert
sounds
jazz
brain
##me
painting
fleet
tax
nick
##ius
michigan
count
animals
leaders
episodes
##line
content
##den
birth
##it
clubs
64
palace
critical
refused
fair
leg
laughed
returning
surrounding
participated
formation
lifted
pointed
connected
rome
medicine
laid
taylor
santa
powers
adam
tall
shared
focused
knowing
yards
entrance
falls
##wa
calling
##ad
sources
chosen
beneath
resources
yard
##ite
nominated
silence
zone
defined
##que
gained
thirty
38
bodies
moon
##ard
adopted
christmas
widely
register
apart
iran
premier
serves
du
unknown
parties
##les
generation
##ff
continues
quick
fields
brigade
quiet
teaching
clothes
impact
weapons
partner
flat
theater
supreme
1938
37
relations
##tor
plants
suffered
1936
wilson
kids
begins
##age
1918
seats
armed
internet
models
worth
laws
400
communities
classes
background
knows
thanks
quarter
reaching
humans
carry
killing
format
kong
hong
setting
75
architecture
disease
railroad
inc
possibly
wish
arthur
thoughts
harry
doors
density
##di
crowd
illinois
stomach
tone
unique
reports
anyway
##ir
liberal
der
vehicle
thick
dry
drug
faced
largely
facility
theme
holds
creation
strange
colonel
##mi
revolution
bell
politics
turns
silent
rail
relief
independence
combat
shape
write
determined
sales
learned
4th
finger
oxford
providing
1937
heritage
fiction
situated
designated
allowing
distribution
hosted
##est
sight
interview
estimated
reduced
##ria
toronto
footballer
keeping
guys
damn
claim
motion
sport
sixth
stayed
##ze
en
rear
receive
handed
twelve
dress
audience
granted
brazil
##well
spirit
##ated
noticed
etc
olympic
representative
eric
tight
trouble
reviews
drink
vampire
missing
roles
ranked
newly
household
finals
wave
critics
##ee
phase
massachusetts
pilot
unlike
philadelphia
bright
guns
crown
organizations
roof
42
respectively
clearly
tongue
marked
circle
fox
korea
bronze
brian
expanded
sexual
supply
yourself
inspired
labour
fc
##ah
reference
vision
draft
connection
brand
reasons
1935
classic
driving
trip
jesus
cells
entry
1920
neither
trail
claims
atlantic
orders
labor
nose
afraid
identified
intelligence
calls
cancer
attacked
passing
stephen
positions
imperial
grey
jason
39
sunday
48
swedish
avoid
extra
uncle
message
covers
allows
surprise
materials
fame
hunter
##ji
1930
citizens
figures
davis
environmental
confirmed
shit
titles
di
performing
difference
acts
attacks
##ov
existing
votes
opportunity
nor
shop
entirely
trains
opposite
pakistan
##pa
develop
resulted
representatives
actions
reality
pressed
##ish
barely
wine
conversation
faculty
northwest
ends
documentary
nuclear
stock
grace
sets
eat
alternative
##ps
bag
resulting
creating
surprised
cemetery
1919
drop
finding
sarah
cricket
streets
tradition
ride
1933
exhibition
target
ear
explained
rain
composer
injury
apartment
municipal
educational
occupied
netherlands
clean
billion
constitution
learn
1914
maximum
classical
francis
lose
opposition
jose
ontario
bear
core
hills
rolled
ending
drawn
permanent
fun
##tes
##lla
lewis
sites
chamber
ryan
##way
scoring
height
1934
##house
lyrics
staring
55
officials
1917
snow
oldest
##tic
orange
##ger
qualified
interior
apparently
succeeded
thousand
dinner
lights
existence
fans
heavily
41
greatest
conservative
send
bowl
plus
enter
catch
##un
economy
duty
1929
speech
authorities
princess
performances
versions
shall
graduate
pictures
effective
remembered
poetry
desk
crossed
starring
starts
passenger
sharp
##ant
acres
ass
weather
falling
rank
fund
supporting
check
adult
publishing
heads
cm
southeast
lane
##burg
application
bc
##ura
les
condition
transfer
prevent
display
ex
regions
earl
federation
cool
relatively
answered
besides
1928
obtained
portion
##town
mix
##ding
reaction
liked
dean
express
peak
1932
##tte
counter
religion
chain
rare
miller
convention
aid
lie
vehicles
mobile
perform
squad
wonder
lying
crazy
sword
##ping
attempted
centuries
weren
philosophy
category
##ize
anna
interested
47
sweden
wolf
frequently
abandoned
kg
literary
alliance
task
entitled
##ay
threw
promotion
factory
tiny
soccer
visited
matt
fm
achieved
52
defence
internal
persian
43
methods
##ging
arrested
otherwise
cambridge
programming
villages
elementary
districts
rooms
criminal
conflict
worry
trained
1931
attempts
waited
signal
bird
truck
subsequent
programme
##ol
ad
49
communist
details
faith
sector
patrick
carrying
laugh
##ss
controlled
korean
showing
origin
fuel
evil
1927
##ent
brief
identity
darkness
address
pool
missed
publication
web
planet
ian
anne
wings
invited
##tt
briefly
standards
kissed
##be
ideas
climate
causing
walter
worse
albert
articles
winners
desire
aged
northeast
dangerous
gate
doubt
1922
wooden
multi
##ky
poet
rising
funding
46
communications
communication
violence
copies
prepared
ford
investigation
skills
1924
pulling
electronic
##ak
##ial
##han
containing
ultimately
offices
singing
understanding
restaurant
tomorrow
fashion
christ
ward
da
pope
stands
5th
flow
studios
aired
commissioned
contained
exist
fresh
americans
##per
wrestling
approved
kid
employed
respect
suit
1925
angel
asking
increasing
frame
angry
selling
1950s
thin
finds
##nd
temperature
statement
ali
explain
inhabitants
towns
extensive
narrow
51
jane
flowers
images
promise
somewhere
object
fly
closely
##ls
1912
bureau
cape
1926
weekly
presidential
legislative
1921
##ai
##au
launch
founding
##ny
978
##ring
artillery
strike
un
institutions
roll
writers
landing
chose
kevin
anymore
pp
##ut
attorney
fit
dan
billboard
receiving
agricultural
breaking
sought
dave
admitted
lands
mexican
##bury
charlie
specifically
hole
iv
howard
credit
moscow
roads
accident
1923
proved
wear
struck
hey
guards
stuff
slid
expansion
1915
cat
anthony
##kin
melbourne
opposed
sub
southwest
architect
failure
plane
1916
##ron
map
camera
tank
listen
regarding
wet
introduction
metropolitan
link
ep
fighter
inch
grown
gene
anger
fixed
buy
dvd
khan
domestic
worldwide
chapel
mill
functions
examples
##head
developing
1910
turkey
hits
pocket
antonio
papers
grow
unless
circuit
18th
concerned
attached
journalist
selection
journey
converted
provincial
painted
hearing
aren
bands
negative
aside
wondered
knight
lap
survey
ma
##ow
noise
billy
##ium
shooting
guide
bedroom
priest
resistance
motor
homes
sounded
giant
##mer
150
scenes
equal
comic
patients
hidden
solid
actual
bringing
afternoon
touched
funds
wedding
consisted
marie
canal
sr
kim
treaty
turkish
recognition
residence
cathedral
broad
knees
incident
shaped
fired
norwegian
handle
cheek
contest
represent
##pe
representing
beauty
##sen
birds
advantage
emergency
wrapped
drawing
notice
pink
broadcasting
##ong
somehow
bachelor
seventh
collected
registered
establishment
alan
assumed
chemical
personnel
roger
retirement
jeff
portuguese
wore
tied
device
threat
progress
advance
##ised
banks
hired
manchester
nfl
teachers
structures
forever
##bo
tennis
helping
saturday
sale
applications
junction
hip
incorporated
neighborhood
dressed
ceremony
##ds
influenced
hers
visual
stairs
decades
inner
kansas
hung
hoped
gain
scheduled
downtown
engaged
austria
clock
norway
certainly
pale
protected
1913
victor
employees
plate
putting
surrounded
##ists
finishing
blues
tropical
##ries
minnesota
consider
philippines
accept
54
retrieved
1900
concern
anderson
properties
institution
gordon
successfully
vietnam
##dy
backing
outstanding
muslim
crossing
folk
producing
usual
demand
occurs
observed
lawyer
educated
##ana
kelly
string
pleasure
budget
items
quietly
colorado
philip
typical
##worth
derived
600
survived
asks
mental
##ide
56
jake
jews
distinguished
ltd
1911
sri
extremely
53
athletic
loud
thousands
worried
shadow
transportation
horses
weapon
arena
importance
users
tim
objects
contributed
dragon
douglas
aware
senator
johnny
jordan
sisters
engines
flag
investment
samuel
shock
capable
clark
row
wheel
refers
session
familiar
biggest
wins
hate
maintained
drove
hamilton
request
expressed
injured
underground
churches
walker
wars
tunnel
passes
stupid
agriculture
softly
cabinet
regarded
joining
indiana
##ea
##ms
push
dates
spend
behavior
woods
protein
gently
chase
morgan
mention
burning
wake
combination
occur
mirror
leads
jimmy
indeed
impossible
singapore
paintings
covering
##nes
soldier
locations
attendance
sell
historian
wisconsin
invasion
argued
painter
diego
changing
egypt
##don
experienced
inches
##ku
missouri
vol
grounds
spoken
switzerland
##gan
reform
rolling
ha
forget
massive
resigned
burned
allen
tennessee
locked
values
improved
##mo
wounded
universe
sick
dating
facing
pack
purchase
user
##pur
moments
##ul
merged
anniversary
1908
coal
brick
understood
causes
dynasty
queensland
establish
stores
crisis
promote
hoping
views
cards
referee
extension
##si
raise
arizona
improve
colonial
formal
charged
##rt
palm
lucky
hide
rescue
faces
95
feelings
candidates
juan
##ell
goods
6th
courses
weekend
59
luke
cash
fallen
##om
delivered
affected
installed
carefully
tries
swiss
hollywood
costs
lincoln
responsibility
##he
shore
file
proper
normally
maryland
assistance
jump
constant
offering
friendly
waters
persons
realize
contain
trophy
800
partnership
factor
58
musicians
cry
bound
oregon
indicated
hero
houston
medium
##ure
consisting
somewhat
##ara
57
cycle
##che
beer
moore
frederick
gotten
eleven
worst
weak
approached
arranged
chin
loan
universal
bond
fifteen
pattern
disappeared
##ney
translated
##zed
lip
arab
capture
interests
insurance
##chi
shifted
cave
prix
warning
sections
courts
coat
plot
smell
feed
golf
favorite
maintain
knife
vs
voted
degrees
finance
quebec
opinion
translation
manner
ruled
operate
productions
choose
musician
discovery
confused
tired
separated
stream
techniques
committed
attend
ranking
kings
throw
passengers
measure
horror
fan
mining
sand
danger
salt
calm
decade
dam
require
runner
##ik
rush
associate
greece
##ker
rivers
consecutive
matthew
##ski
sighed
sq
documents
steam
edited
closing
tie
accused
1905
##ini
islamic
distributed
directors
organisation
bruce
7th
breathing
mad
lit
arrival
concrete
taste
08
composition
shaking
faster
amateur
adjacent
stating
1906
twin
flew
##ran
tokyo
publications
##tone
obviously
ridge
storage
1907
carl
pages
concluded
desert
driven
universities
ages
terminal
sequence
borough
250
constituency
creative
cousin
economics
dreams
margaret
notably
reduce
montreal
mode
17th
ears
saved
jan
vocal
##ica
1909
andy
##jo
riding
roughly
threatened
##ise
meters
meanwhile
landed
compete
repeated
grass
czech
regularly
charges
tea
sudden
appeal
##ung
solution
describes
pierre
classification
glad
parking
##ning
belt
physics
99
rachel
add
hungarian
participate
expedition
damaged
gift
childhood
85
fifty
##red
mathematics
jumped
letting
defensive
mph
##ux
##gh
testing
##hip
hundreds
shoot
owners
matters
smoke
israeli
kentucky
dancing
mounted
grandfather
emma
designs
profit
argentina
##gs
truly
li
lawrence
cole
begun
detroit
willing
branches
smiling
decide
miami
enjoyed
recordings
##dale
poverty
ethnic
gay
##bi
gary
arabic
09
accompanied
##one
##ons
fishing
determine
residential
acid
##ary
alice
returns
starred
mail
##ang
jonathan
strategy
##ue
net
forty
cook
businesses
equivalent
commonwealth
distinct
ill
##cy
seriously
##ors
##ped
shift
harris
replace
rio
imagine
formula
ensure
##ber
additionally
scheme
conservation
occasionally
purposes
feels
favor
##and
##ore
1930s
contrast
hanging
hunt
movies
1904
instruments
victims
danish
christopher
busy
demon
sugar
earliest
colony
studying
balance
duties
##ks
belgium
slipped
carter
05
visible
stages
iraq
fifa
##im
commune
forming
zero
07
continuing
talked
counties
legend
bathroom
option
tail
clay
daughters
afterwards
severe
jaw
visitors
##ded
devices
aviation
russell
kate
##vi
entering
subjects
##ino
temporary
swimming
forth
smooth
ghost
audio
bush
operates
rocks
movements
signs
eddie
##tz
ann
voices
honorary
06
memories
dallas
pure
measures
racial
promised
66
harvard
ceo
16th
parliamentary
indicate
benefit
flesh
dublin
louisiana
1902
1901
patient
sleeping
1903
membership
coastal
medieval
wanting
element
scholars
rice
62
limit
survive
makeup
rating
definitely
collaboration
obvious
##tan
boss
ms
baron
birthday
linked
soil
diocese
##lan
ncaa
##mann
offensive
shell
shouldn
waist
##tus
plain
ross
organ
resolution
manufacturing
adding
relative
kennedy
98
whilst
moth
marketing
gardens
crash
72
heading
partners
credited
carlos
moves
cable
##zi
marshall
##out
depending
bottle
represents
rejected
responded
existed
04
jobs
denmark
lock
##ating
treated
graham
routes
talent
commissioner
drugs
secure
tests
reign
restored
photography
##gi
contributions
oklahoma
designer
disc
grin
seattle
robin
paused
atlanta
unusual
##gate
praised
las
laughing
satellite
hungary
visiting
##sky
interesting
factors
deck
poems
norman
##water
stuck
speaker
rifle
domain
premiered
##her
dc
comics
actors
01
reputation
eliminated
8th
ceiling
prisoners
script
##nce
leather
austin
mississippi
rapidly
admiral
parallel
charlotte
guilty
tools
gender
divisions
fruit
##bs
laboratory
nelson
fantasy
marry
rapid
aunt
tribe
requirements
aspects
suicide
amongst
adams
bone
ukraine
abc
kick
sees
edinburgh
clothing
column
rough
gods
hunting
broadway
gathered
concerns
##ek
spending
ty
12th
snapped
requires
solar
bones
cavalry
##tta
iowa
drinking
waste
index
franklin
charity
thompson
stewart
tip
flash
landscape
friday
enjoy
singh
poem
listening
##back
eighth
fred
differences
adapted
bomb
ukrainian
surgery
corporate
masters
anywhere
##more
waves
odd
sean
portugal
orleans
dick
debate
kent
eating
puerto
cleared
96
expect
cinema
97
guitarist
blocks
electrical
agree
involving
depth
dying
panel
struggle
##ged
peninsula
adults
novels
emerged
vienna
metro
debuted
shoes
tamil
songwriter
meets
prove
beating
instance
heaven
scared
sending
marks
artistic
passage
superior
03
significantly
shopping
##tive
retained
##izing
malaysia
technique
cheeks
##ola
warren
maintenance
destroy
extreme
allied
120
appearing
##yn
fill
advice
alabama
qualifying
policies
cleveland
hat
battery
smart
authors
10th
soundtrack
acted
dated
lb
glance
equipped
coalition
funny
outer
ambassador
roy
possibility
couples
campbell
dna
loose
ethan
supplies
1898
gonna
88
monster
##res
shake
agents
frequency
springs
dogs
practices
61
gang
plastic
easier
suggests
gulf
blade
exposed
colors
industries
markets
pan
nervous
electoral
charts
legislation
ownership
##idae
mac
appointment
shield
copy
assault
socialist
abbey
monument
license
throne
employment
jay
93
replacement
charter
cloud
powered
suffering
accounts
oak
connecticut
strongly
wright
colour
crystal
13th
context
welsh
networks
voiced
gabriel
jerry
##cing
forehead
mp
##ens
manage
schedule
totally
remix
##ii
forests
occupation
print
nicholas
brazilian
strategic
vampires
engineers
76
roots
seek
correct
instrumental
und
alfred
backed
hop
##des
stanley
robinson
traveled
wayne
welcome
austrian
achieve
67
exit
rates
1899
strip
whereas
##cs
sing
deeply
adventure
bobby
rick
jamie
careful
components
cap
useful
personality
knee
##shi
pushing
hosts
02
protest
ca
ottoman
symphony
##sis
63
boundary
1890
processes
considering
considerable
tons
##work
##ft
##nia
cooper
trading
dear
conduct
91
illegal
apple
revolutionary
holiday
definition
harder
##van
jacob
circumstances
destruction
##lle
popularity
grip
classified
liverpool
donald
baltimore
flows
seeking
honour
approval
92
mechanical
till
happening
statue
critic
increasingly
immediate
describe
commerce
stare
##ster
indonesia
meat
rounds
boats
baker
orthodox
depression
formally
worn
naked
claire
muttered
sentence
11th
emily
document
77
criticism
wished
vessel
spiritual
bent
virgin
parker
minimum
murray
lunch
danny
printed
compilation
keyboards
false
blow
belonged
68
raising
78
cutting
##board
pittsburgh
##up
9th
shadows
81
hated
indigenous
jon
15th
barry
scholar
ah
##zer
oliver
##gy
stick
susan
meetings
attracted
spell
romantic
##ver
ye
1895
photo
demanded
customers
##ac
1896
logan
revival
keys
modified
commanded
jeans
##ious
upset
raw
phil
detective
hiding
resident
vincent
##bly
experiences
diamond
defeating
coverage
lucas
external
parks
franchise
helen
bible
successor
percussion
celebrated
il
lift
profile
clan
romania
##ied
mills
##su
nobody
achievement
shrugged
fault
1897
rhythm
initiative
breakfast
carbon
700
69
lasted
violent
74
wound
ken
killer
gradually
filmed
°c
dollars
processing
94
remove
criticized
guests
sang
chemistry
##vin
legislature
disney
##bridge
uniform
escaped
integrated
proposal
purple
denied
liquid
karl
influential
morris
nights
stones
intense
experimental
twisted
71
84
##ld
pace
nazi
mitchell
ny
blind
reporter
newspapers
14th
centers
burn
basin
forgotten
surviving
filed
collections
monastery
losses
manual
couch
description
appropriate
merely
tag
missions
sebastian
restoration
replacing
triple
73
elder
julia
warriors
benjamin
julian
convinced
stronger
amazing
declined
versus
merchant
happens
output
finland
bare
barbara
absence
ignored
dawn
injuries
##port
producers
##ram
82
luis
##ities
kw
admit
expensive
electricity
nba
exception
symbol
##ving
ladies
shower
sheriff
characteristics
##je
aimed
button
ratio
effectively
summit
angle
jury
bears
foster
vessels
pants
executed
evans
dozen
advertising
kicked
patrol
1889
competitions
lifetime
principles
athletics
##logy
birmingham
sponsored
89
rob
nomination
1893
acoustic
##sm
creature
longest
##tra
credits
harbor
dust
josh
##so
territories
milk
infrastructure
completion
thailand
indians
leon
archbishop
##sy
assist
pitch
blake
arrangement
girlfriend
serbian
operational
hence
sad
scent
fur
dj
sessions
hp
refer
rarely
##ora
exists
1892
##ten
scientists
dirty
penalty
burst
portrait
seed
79
pole
limits
rival
1894
stable
alpha
grave
constitutional
alcohol
arrest
flower
mystery
devil
architectural
relationships
greatly
habitat
##istic
larry
progressive
remote
cotton
##ics
##ok
preserved
reaches
##ming
cited
86
vast
scholarship
decisions
cbs
joy
teach
1885
editions
knocked
eve
searching
partly
participation
gap
animated
fate
excellent
##ett
na
87
alternate
saints
youngest
##ily
climbed
##ita
##tors
suggest
##ct
discussion
staying
choir
lakes
jacket
revenue
nevertheless
peaked
instrument
wondering
annually
managing
neil
1891
signing
terry
##ice
apply
clinical
brooklyn
aim
catherine
fuck
farmers
figured
ninth
pride
hugh
evolution
ordinary
involvement
comfortable
shouted
tech
encouraged
taiwan
representation
sharing
##lia
##em
panic
exact
cargo
competing
fat
cried
83
1920s
occasions
pa
cabin
borders
utah
marcus
##isation
badly
muscles
##ance
victorian
transition
warner
bet
permission
##rin
slave
terrible
similarly
shares
seth
uefa
possession
medals
benefits
colleges
lowered
perfectly
mall
transit
##ye
##kar
publisher
##ened
harrison
deaths
elevation
##ae
asleep
machines
sigh
ash
hardly
argument
occasion
parent
leo
decline
1888
contribution
##ua
concentration
1000
opportunities
hispanic
guardian
extent
emotions
hips
mason
volumes
bloody
controversy
diameter
steady
mistake
phoenix
identify
violin
##sk
departure
richmond
spin
funeral
enemies
1864
gear
literally
connor
random
sergeant
grab
confusion
1865
transmission
informed
op
leaning
sacred
suspended
thinks
gates
portland
luck
agencies
yours
hull
expert
muscle
layer
practical
sculpture
jerusalem
latest
lloyd
statistics
deeper
recommended
warrior
arkansas
mess
supports
greg
eagle
1880
recovered
rated
concerts
rushed
##ano
stops
eggs
files
premiere
keith
##vo
delhi
turner
pit
affair
belief
paint
##zing
mate
##ach
##ev
victim
##ology
withdrew
bonus
styles
fled
##ud
glasgow
technologies
funded
nbc
adaptation
##ata
portrayed
cooperation
supporters
judges
bernard
justin
hallway
ralph
##ick
graduating
controversial
distant
continental
spider
bite
##ho
recognize
intention
mixing
##ese
egyptian
bow
tourism
suppose
claiming
tiger
dominated
participants
vi
##ru
nurse
partially
tape
##rum
psychology
##rn
essential
touring
duo
voting
civilian
emotional
channels
##king
apparent
hebrew
1887
tommy
carrier
intersection
beast
hudson
##gar
##zo
lab
nova
bench
discuss
costa
##ered
detailed
behalf
drivers
unfortunately
obtain
##lis
rocky
##dae
siege
friendship
honey
##rian
1861
amy
hang
posted
governments
collins
respond
wildlife
preferred
operator
##po
laura
pregnant
videos
dennis
suspected
boots
instantly
weird
automatic
businessman
alleged
placing
throwing
ph
mood
1862
perry
venue
jet
remainder
##lli
##ci
passion
biological
boyfriend
1863
dirt
buffalo
ron
segment
fa
abuse
##era
genre
thrown
stroke
colored
stress
exercise
displayed
##gen
struggled
##tti
abroad
dramatic
wonderful
thereafter
madrid
component
widespread
##sed
tale
citizen
todd
monday
1886
vancouver
overseas
forcing
crying
descent
##ris
discussed
substantial
ranks
regime
1870
provinces
switch
drum
zane
ted
tribes
proof
lp
cream
researchers
volunteer
manor
silk
milan
donated
allies
venture
principle
delivery
enterprise
##ves
##ans
bars
traditionally
witch
reminded
copper
##uk
pete
inter
links
colin
grinned
elsewhere
competitive
frequent
##oy
scream
##hu
tension
texts
submarine
finnish
defending
defend
pat
detail
1884
affiliated
stuart
themes
villa
periods
tool
belgian
ruling
crimes
answers
folded
licensed
resort
demolished
hans
lucy
1881
lion
traded
photographs
writes
craig
##fa
trials
generated
beth
noble
debt
percentage
yorkshire
erected
ss
viewed
grades
confidence
ceased
islam
telephone
retail
##ible
chile
m²
roberts
sixteen
##ich
commented
hampshire
innocent
dual
pounds
checked
regulations
afghanistan
sung
rico
liberty
assets
bigger
options
angels
relegated
tribute
wells
attending
leaf
##yan
butler
romanian
forum
monthly
lisa
patterns
gmina
##tory
madison
hurricane
rev
##ians
bristol
##ula
elite
valuable
disaster
democracy
awareness
germans
freyja
##ins
loop
absolutely
paying
populations
maine
sole
prayer
spencer
releases
doorway
bull
##ani
lover
midnight
conclusion
##sson
thirteen
lily
mediterranean
##lt
nhl
proud
sample
##hill
drummer
guinea
##ova
murphy
climb
##ston
instant
attributed
horn
ain
railways
steven
##ao
autumn
ferry
opponent
root
traveling
secured
corridor
stretched
tales
sheet
trinity
cattle
helps
indicates
manhattan
murdered
fitted
1882
gentle
grandmother
mines
shocked
vegas
produces
##light
caribbean
##ou
belong
continuous
desperate
drunk
historically
trio
waved
raf
dealing
nathan
bat
murmured
interrupted
residing
scientist
pioneer
harold
aaron
##net
delta
attempting
minority
mini
believes
chorus
tend
lots
eyed
indoor
load
shots
updated
jail
##llo
concerning
connecting
wealth
##ved
slaves
arrive
rangers
sufficient
rebuilt
##wick
cardinal
flood
muhammad
whenever
relation
runners
moral
repair
viewers
arriving
revenge
punk
assisted
bath
fairly
breathe
lists
innings
illustrated
whisper
nearest
voters
clinton
ties
ultimate
screamed
beijing
lions
andre
fictional
gathering
comfort
radar
suitable
dismissed
hms
ban
pine
wrist
atmosphere
voivodeship
bid
timber
##ned
##nan
giants
##ane
cameron
recovery
uss
identical
categories
switched
serbia
laughter
noah
ensemble
therapy
peoples
touching
##off
locally
pearl
platforms
everywhere
ballet
tables
lanka
herbert
outdoor
toured
derek
1883
spaces
contested
swept
1878
exclusive
slight
connections
##dra
winds
prisoner
collective
bangladesh
tube
publicly
wealthy
thai
##ys
isolated
select
##ric
insisted
pen
fortune
ticket
spotted
reportedly
animation
enforcement
tanks
110
decides
wider
lowest
owen
##time
nod
hitting
##hn
gregory
furthermore
magazines
fighters
solutions
##ery
pointing
requested
peru
reed
chancellor
knights
mask
worker
eldest
flames
reduction
1860
volunteers
##tis
reporting
##hl
wire
advisory
endemic
origins
settlers
pursue
knock
consumer
1876
eu
compound
creatures
mansion
sentenced
ivan
deployed
guitars
frowned
involves
mechanism
kilometers
perspective
shops
maps
terminus
duncan
alien
fist
bridges
##pers
heroes
fed
derby
swallowed
##ros
patent
sara
illness
characterized
adventures
slide
hawaii
jurisdiction
##op
organised
##side
adelaide
walks
biology
se
##ties
rogers
swing
tightly
boundaries
##rie
prepare
implementation
stolen
##sha
certified
colombia
edwards
garage
##mm
recalled
##ball
rage
harm
nigeria
breast
##ren
furniture
pupils
settle
##lus
cuba
balls
client
alaska
21st
linear
thrust
celebration
latino
genetic
terror
##cia
##ening
lightning
fee
witness
lodge
establishing
skull
##ique
earning
hood
##ei
rebellion
wang
sporting
warned
missile
devoted
activist
porch
worship
fourteen
package
1871
decorated
##shire
housed
##ock
chess
sailed
doctors
oscar
joan
treat
garcia
harbour
jeremy
##ire
traditions
dominant
jacques
##gon
##wan
relocated
1879
amendment
sized
companion
simultaneously
volleyball
spun
acre
increases
stopping
loves
belongs
affect
drafted
tossed
scout
battles
1875
filming
shoved
munich
tenure
vertical
romance
pc
##cher
argue
##ical
craft
ranging
www
opens
honest
tyler
yesterday
virtual
##let
muslims
reveal
snake
immigrants
radical
screaming
speakers
firing
saving
belonging
ease
lighting
prefecture
blame
farmer
hungry
grows
rubbed
beam
sur
subsidiary
##cha
armenian
sao
dropping
conventional
##fer
microsoft
reply
qualify
spots
1867
sweat
festivals
##ken
immigration
physician
discover
exposure
sandy
explanation
isaac
implemented
##fish
hart
initiated
connect
stakes
presents
heights
householder
pleased
tourist
regardless
slip
closest
##ction
surely
sultan
brings
riley
preparation
aboard
slammed
baptist
experiment
ongoing
interstate
organic
playoffs
##ika
1877
130
##tar
hindu
error
tours
tier
plenty
arrangements
talks
trapped
excited
sank
ho
athens
1872
denver
welfare
suburb
athletes
trick
diverse
belly
exclusively
yelled
1868
##med
conversion
##ette
1874
internationally
computers
conductor
abilities
sensitive
hello
dispute
measured
globe
rocket
prices
amsterdam
flights
tigers
inn
municipalities
emotion
references
3d
##mus
explains
airlines
manufactured
pm
archaeological
1873
interpretation
devon
comment
##ites
settlements
kissing
absolute
improvement
suite
impressed
barcelona
sullivan
jefferson
towers
jesse
julie
##tin
##lu
grandson
hi
gauge
regard
rings
interviews
trace
raymond
thumb
departments
burns
serial
bulgarian
scores
demonstrated
##ix
1866
kyle
alberta
underneath
romanized
##ward
relieved
acquisition
phrase
cliff
reveals
han
cuts
merger
custom
##dar
nee
gilbert
graduation
##nts
assessment
cafe
difficulty
demands
swung
democrat
jennifer
commons
1940s
grove
##yo
completing
focuses
sum
substitute
bearing
stretch
reception
##py
reflected
essentially
destination
pairs
##ched
survival
resource
##bach
promoting
doubles
messages
tear
##down
##fully
parade
florence
harvey
incumbent
partial
framework
900
pedro
frozen
procedure
olivia
controls
##mic
shelter
personally
temperatures
##od
brisbane
tested
sits
marble
comprehensive
oxygen
leonard
##kov
inaugural
iranian
referring
quarters
attitude
##ivity
mainstream
lined
mars
dakota
norfolk
unsuccessful
##°
explosion
helicopter
congressional
##sing
inspector
bitch
seal
departed
divine
##ters
coaching
examination
punishment
manufacturer
sink
columns
unincorporated
signals
nevada
squeezed
dylan
dining
photos
martial
manuel
eighteen
elevator
brushed
plates
ministers
ivy
congregation
##len
slept
specialized
taxes
curve
restricted
negotiations
likes
statistical
arnold
inspiration
execution
bold
intermediate
significance
margin
ruler
wheels
gothic
intellectual
dependent
listened
eligible
buses
widow
syria
earn
cincinnati
collapsed
recipient
secrets
accessible
philippine
maritime
goddess
clerk
surrender
breaks
playoff
database
##ified
##lon
ideal
beetle
aspect
soap
regulation
strings
expand
anglo
shorter
crosses
retreat
tough
coins
wallace
directions
pressing
##oon
shipping
locomotives
comparison
topics
nephew
##mes
distinction
honors
travelled
sierra
ibn
##over
fortress
sa
recognised
carved
1869
clients
##dan
intent
##mar
coaches
describing
bread
##ington
beaten
northwestern
##ona
merit
youtube
collapse
challenges
em
historians
objective
submitted
virus
attacking
drake
assume
##ere
diseases
marc
stem
leeds
##cus
##ab
farming
glasses
##lock
visits
nowhere
fellowship
relevant
carries
restaurants
experiments
101
constantly
bases
targets
shah
tenth
opponents
verse
territorial
##ira
writings
corruption
##hs
instruction
inherited
reverse
emphasis
##vic
employee
arch
keeps
rabbi
watson
payment
uh
##ala
nancy
##tre
venice
fastest
sexy
banned
adrian
properly
ruth
touchdown
dollar
boards
metre
circles
edges
favour
comments
ok
travels
liberation
scattered
firmly
##ular
holland
permitted
diesel
kenya
den
originated
##ral
demons
resumed
dragged
rider
##rus
servant
blinked
extend
torn
##ias
##sey
input
meal
everybody
cylinder
kinds
camps
##fe
bullet
logic
##wn
croatian
evolved
healthy
fool
chocolate
wise
preserve
pradesh
##ess
respective
1850
##ew
chicken
artificial
gross
corresponding
convicted
cage
caroline
dialogue
##dor
narrative
stranger
mario
br
christianity
failing
trent
commanding
buddhist
1848
maurice
focusing
yale
bike
altitude
##ering
mouse
revised
##sley
veteran
##ig
pulls
theology
crashed
campaigns
legion
##ability
drag
excellence
customer
cancelled
intensity
excuse
##lar
liga
participating
contributing
printing
##burn
variable
##rk
curious
bin
legacy
renaissance
##my
symptoms
binding
vocalist
dancer
##nie
grammar
gospel
democrats
ya
enters
sc
diplomatic
hitler
##ser
clouds
mathematical
quit
defended
oriented
##heim
fundamental
hardware
impressive
equally
convince
confederate
guilt
chuck
sliding
##ware
magnetic
narrowed
petersburg
bulgaria
otto
phd
skill
##ama
reader
hopes
pitcher
reservoir
hearts
automatically
expecting
mysterious
bennett
extensively
imagined
seeds
monitor
fix
##ative
journalism
struggling
signature
ranch
encounter
photographer
observation
protests
##pin
influences
##hr
calendar
##all
cruz
croatia
locomotive
hughes
naturally
shakespeare
basement
hook
uncredited
faded
theories
approaches
dare
phillips
filling
fury
obama
##ain
efficient
arc
deliver
min
raid
breeding
inducted
leagues
efficiency
axis
montana
eagles
##ked
supplied
instructions
karen
picking
indicating
trap
anchor
practically
christians
tomb
vary
occasional
electronics
lords
readers
newcastle
faint
innovation
collect
situations
engagement
160
claude
mixture
##feld
peer
tissue
logo
lean
##ration
°f
floors
##ven
architects
reducing
##our
##ments
rope
1859
ottawa
##har
samples
banking
declaration
proteins
resignation
francois
saudi
advocate
exhibited
armor
twins
divorce
##ras
abraham
reviewed
jo
temporarily
matrix
physically
pulse
curled
##ena
difficulties
bengal
usage
##ban
annie
riders
certificate
##pi
holes
warsaw
distinctive
jessica
##mon
mutual
1857
customs
circular
eugene
removal
loaded
mere
vulnerable
depicted
generations
dame
heir
enormous
lightly
climbing
pitched
lessons
pilots
nepal
ram
google
preparing
brad
louise
renowned
##₂
liam
##ably
plaza
shaw
sophie
brilliant
bills
##bar
##nik
fucking
mainland
server
pleasant
seized
veterans
jerked
fail
beta
brush
radiation
stored
warmth
southeastern
nate
sin
raced
berkeley
joke
athlete
designation
trunk
##low
roland
qualification
archives
heels
artwork
receives
judicial
reserves
##bed
woke
installation
abu
floating
fake
lesser
excitement
interface
concentrated
addressed
characteristic
amanda
saxophone
monk
auto
##bus
releasing
egg
dies
interaction
defender
ce
outbreak
glory
loving
##bert
sequel
consciousness
http
awake
ski
enrolled
##ress
handling
rookie
brow
somebody
biography
warfare
amounts
contracts
presentation
fabric
dissolved
challenged
meter
psychological
lt
elevated
rally
accurate
##tha
hospitals
undergraduate
specialist
venezuela
exhibit
shed
nursing
protestant
fluid
structural
footage
jared
consistent
prey
##ska
succession
reflect
exile
lebanon
wiped
suspect
shanghai
resting
integration
preservation
marvel
variant
pirates
sheep
rounded
capita
sailing
colonies
manuscript
deemed
variations
clarke
functional
emerging
boxing
relaxed
curse
azerbaijan
heavyweight
nickname
editorial
rang
grid
tightened
earthquake
flashed
miguel
rushing
##ches
improvements
boxes
brooks
180
consumption
molecular
felix
societies
repeatedly
variation
aids
civic
graphics
professionals
realm
autonomous
receiver
delayed
workshop
militia
chairs
trump
canyon
##point
harsh
extending
lovely
happiness
##jan
stake
eyebrows
embassy
wellington
hannah
##ella
sony
corners
bishops
swear
cloth
contents
xi
namely
commenced
1854
stanford
nashville
courage
graphic
commitment
garrison
##bin
hamlet
clearing
rebels
attraction
literacy
cooking
ruins
temples
jenny
humanity
celebrate
hasn
freight
sixty
rebel
bastard
##art
newton
##ada
deer
##ges
##ching
smiles
delaware
singers
##ets
approaching
assists
flame
##ph
boulevard
barrel
planted
##ome
pursuit
##sia
consequences
posts
shallow
invitation
rode
depot
ernest
kane
rod
concepts
preston
topic
chambers
striking
blast
arrives
descendants
montgomery
ranges
worlds
##lay
##ari
span
chaos
praise
##ag
fewer
1855
sanctuary
mud
fbi
##ions
programmes
maintaining
unity
harper
bore
handsome
closure
tournaments
thunder
nebraska
linda
facade
puts
satisfied
argentine
dale
cork
dome
panama
##yl
1858
tasks
experts
##ates
feeding
equation
##las
##ida
##tu
engage
bryan
##ax
um
quartet
melody
disbanded
sheffield
blocked
gasped
delay
kisses
maggie
connects
##non
sts
poured
creator
publishers
##we
guided
ellis
extinct
hug
gaining
##ord
complicated
##bility
poll
clenched
investigate
##use
thereby
quantum
spine
cdp
humor
kills
administered
semifinals
##du
encountered
ignore
##bu
commentary
##maker
bother
roosevelt
140
plains
halfway
flowing
cultures
crack
imprisoned
neighboring
airline
##ses
##view
##mate
##ec
gather
wolves
marathon
transformed
##ill
cruise
organisations
carol
punch
exhibitions
numbered
alarm
ratings
daddy
silently
##stein
queens
colours
impression
guidance
liu
tactical
##rat
marshal
della
arrow
##ings
rested
feared
tender
owns
bitter
advisor
escort
##ides
spare
farms
grants
##ene
dragons
encourage
colleagues
cameras
##und
sucked
pile
spirits
prague
statements
suspension
landmark
fence
torture
recreation
bags
permanently
survivors
pond
spy
predecessor
bombing
coup
##og
protecting
transformation
glow
##lands
##book
dug
priests
andrea
feat
barn
jumping
##chen
##ologist
##con
casualties
stern
auckland
pipe
serie
revealing
ba
##bel
trevor
mercy
spectrum
yang
consist
governing
collaborated
possessed
epic
comprises
blew
shane
##ack
lopez
honored
magical
sacrifice
judgment
perceived
hammer
mtv
baronet
tune
das
missionary
sheets
350
neutral
oral
threatening
attractive
shade
aims
seminary
##master
estates
1856
michel
wounds
refugees
manufacturers
##nic
mercury
syndrome
porter
##iya
##din
hamburg
identification
upstairs
purse
widened
pause
cared
breathed
affiliate
santiago
prevented
celtic
fisher
125
recruited
byzantine
reconstruction
farther
##mp
diet
sake
au
spite
sensation
##ert
blank
separation
105
##hon
vladimir
armies
anime
##lie
accommodate
orbit
cult
sofia
archive
##ify
##box
founders
sustained
disorder
honours
northeastern
mia
crops
violet
threats
blanket
fires
canton
followers
southwestern
prototype
voyage
assignment
altered
moderate
protocol
pistol
##eo
questioned
brass
lifting
1852
math
authored
##ual
doug
dimensional
dynamic
##san
1851
pronounced
grateful
quest
uncomfortable
boom
presidency
stevens
relating
politicians
chen
barrier
quinn
diana
mosque
tribal
cheese
palmer
portions
sometime
chester
treasure
wu
bend
download
millions
reforms
registration
##osa
consequently
monitoring
ate
preliminary
brandon
invented
ps
eaten
exterior
intervention
ports
documented
log
displays
lecture
sally
favourite
##itz
vermont
lo
invisible
isle
breed
##ator
journalists
relay
speaks
backward
explore
midfielder
actively
stefan
procedures
cannon
blond
kenneth
centered
servants
chains
libraries
malcolm
essex
henri
slavery
##hal
facts
fairy
coached
cassie
cats
washed
cop
##fi
announcement
item
2000s
vinyl
activated
marco
frontier
growled
curriculum
##das
loyal
accomplished
leslie
ritual
kenny
##00
vii
napoleon
hollow
hybrid
jungle
stationed
friedrich
counted
##ulated
platinum
theatrical
seated
col
rubber
glen
1840
diversity
healing
extends
id
provisions
administrator
columbus
##oe
tributary
te
assured
org
##uous
prestigious
examined
lectures
grammy
ronald
associations
bailey
allan
essays
flute
believing
consultant
proceedings
travelling
1853
kit
kerala
yugoslavia
buddy
methodist
##ith
burial
centres
batman
##nda
discontinued
bo
dock
stockholm
lungs
severely
##nk
citing
manga
##ugh
steal
mumbai
iraqi
robot
celebrity
bride
broadcasts
abolished
pot
joel
overhead
franz
packed
reconnaissance
johann
acknowledged
introduce
handled
doctorate
developments
drinks
alley
palestine
##nis
##aki
proceeded
recover
bradley
grain
patch
afford
infection
nationalist
legendary
##ath
interchange
virtually
gen
gravity
exploration
amber
vital
wishes
powell
doctrine
elbow
screenplay
##bird
contribute
indonesian
pet
creates
##com
enzyme
kylie
discipline
drops
manila
hunger
##ien
layers
suffer
fever
bits
monica
keyboard
manages
##hood
searched
appeals
##bad
testament
grande
reid
##war
beliefs
congo
##ification
##dia
si
requiring
##via
casey
1849
regret
streak
rape
depends
syrian
sprint
pound
tourists
upcoming
pub
##xi
tense
##els
practiced
echo
nationwide
guild
motorcycle
liz
##zar
chiefs
desired
elena
bye
precious
absorbed
relatives
booth
pianist
##mal
citizenship
exhausted
wilhelm
##ceae
##hed
noting
quarterback
urge
hectares
##gue
ace
holly
##tal
blonde
davies
parked
sustainable
stepping
twentieth
airfield
galaxy
nest
chip
##nell
tan
shaft
paulo
requirement
##zy
paradise
tobacco
trans
renewed
vietnamese
##cker
##ju
suggesting
catching
holmes
enjoying
md
trips
colt
holder
butterfly
nerve
reformed
cherry
bowling
trailer
carriage
goodbye
appreciate
toy
joshua
interactive
enabled
involve
##kan
collar
determination
bunch
facebook
recall
shorts
superintendent
episcopal
frustration
giovanni
nineteenth
laser
privately
array
circulation
##ovic
armstrong
deals
painful
permit
discrimination
##wi
aires
retiring
cottage
ni
##sta
horizon
ellen
jamaica
ripped
fernando
chapters
playstation
patron
lecturer
navigation
behaviour
genes
georgian
export
solomon
rivals
swift
seventeen
rodriguez
princeton
independently
sox
1847
arguing
entity
casting
hank
criteria
oakland
geographic
milwaukee
reflection
expanding
conquest
dubbed
##tv
halt
brave
brunswick
doi
arched
curtis
divorced
predominantly
somerset
streams
ugly
zoo
horrible
curved
buenos
fierce
dictionary
vector
theological
unions
handful
stability
chan
punjab
segments
##lly
altar
ignoring
gesture
monsters
pastor
##stone
thighs
unexpected
operators
abruptly
coin
compiled
associates
improving
migration
pin
##ose
compact
collegiate
reserved
##urs
quarterfinals
roster
restore
assembled
hurry
oval
##cies
1846
flags
martha
##del
victories
sharply
##rated
argues
deadly
neo
drawings
symbols
performer
##iel
griffin
restrictions
editing
andrews
java
journals
arabia
compositions
dee
pierce
removing
hindi
casino
runway
civilians
minds
nasa
hotels
##zation
refuge
rent
retain
potentially
conferences
suburban
conducting
##tto
##tions
##tle
descended
massacre
##cal
ammunition
terrain
fork
souls
counts
chelsea
durham
drives
cab
##bank
perth
realizing
palestinian
finn
simpson
##dal
betty
##ule
moreover
particles
cardinals
tent
evaluation
extraordinary
##oid
inscription
##works
wednesday
chloe
maintains
panels
ashley
trucks
##nation
cluster
sunlight
strikes
zhang
##wing
dialect
canon
##ap
tucked
##ws
collecting
##mas
##can
##sville
maker
quoted
evan
franco
aria
buying
cleaning
eva
closet
provision
apollo
clinic
rat
##ez
necessarily
ac
##gle
##ising
venues
flipped
cent
spreading
trustees
checking
authorized
##sco
disappointed
##ado
notion
duration
trumpet
hesitated
topped
brussels
rolls
theoretical
hint
define
aggressive
repeat
wash
peaceful
optical
width
allegedly
mcdonald
strict
copyright
##illa
investors
mar
jam
witnesses
sounding
miranda
michelle
privacy
hugo
harmony
##pp
valid
lynn
glared
nina
102
headquartered
diving
boarding
gibson
##ncy
albanian
marsh
routine
dealt
enhanced
er
intelligent
substance
targeted
enlisted
discovers
spinning
observations
pissed
smoking
rebecca
capitol
visa
varied
costume
seemingly
indies
compensation
surgeon
thursday
arsenal
westminster
suburbs
rid
anglican
##ridge
knots
foods
alumni
lighter
fraser
whoever
portal
scandal
##ray
gavin
advised
instructor
flooding
terrorist
##ale
teenage
interim
senses
duck
teen
thesis
abby
eager
overcome
##ile
newport
glenn
rises
shame
##cc
prompted
priority
forgot
bomber
nicolas
protective
360
cartoon
katherine
breeze
lonely
trusted
henderson
richardson
relax
banner
candy
palms
remarkable
##rio
legends
cricketer
essay
ordained
edmund
rifles
trigger
##uri
##away
sail
alert
1830
audiences
penn
sussex
siblings
pursued
indianapolis
resist
rosa
consequence
succeed
avoided
1845
##ulation
inland
##tie
##nna
counsel
profession
chronicle
hurried
##una
eyebrow
eventual
bleeding
innovative
cure
##dom
committees
accounting
con
scope
hardy
heather
tenor
gut
herald
codes
tore
scales
wagon
##oo
luxury
tin
prefer
fountain
triangle
bonds
darling
convoy
dried
traced
beings
troy
accidentally
slam
findings
smelled
joey
lawyers
outcome
steep
bosnia
configuration
shifting
toll
brook
performers
lobby
philosophical
construct
shrine
aggregate
boot
cox
phenomenon
savage
insane
solely
reynolds
lifestyle
##ima
nationally
holdings
consideration
enable
edgar
mo
mama
##tein
fights
relegation
chances
atomic
hub
conjunction
awkward
reactions
currency
finale
kumar
underwent
steering
elaborate
gifts
comprising
melissa
veins
reasonable
sunshine
chi
solve
trails
inhabited
elimination
ethics
huh
ana
molly
consent
apartments
layout
marines
##ces
hunters
bulk
##oma
hometown
##wall
##mont
cracked
reads
neighbouring
withdrawn
admission
wingspan
damned
anthology
lancashire
brands
batting
forgive
cuban
awful
##lyn
104
dimensions
imagination
##ade
dante
##ship
tracking
desperately
goalkeeper
##yne
groaned
workshops
confident
burton
gerald
milton
circus
uncertain
slope
copenhagen
sophia
fog
philosopher
portraits
accent
cycling
varying
gripped
larvae
garrett
specified
scotia
mature
luther
kurt
rap
##kes
aerial
750
ferdinand
heated
es
transported
##shan
safely
nonetheless
##orn
##gal
motors
demanding
##sburg
startled
##brook
ally
generate
caps
ghana
stained
demo
mentions
beds
ap
afterward
diary
##bling
utility
##iro
richards
1837
conspiracy
conscious
shining
footsteps
observer
cyprus
urged
loyalty
developer
probability
olive
upgraded
gym
miracle
insects
graves
1844
ourselves
hydrogen
amazon
katie
tickets
poets
##pm
planes
##pan
prevention
witnessed
dense
jin
randy
tang
warehouse
monroe
bang
archived
elderly
investigations
alec
granite
mineral
conflicts
controlling
aboriginal
carlo
##zu
mechanics
stan
stark
rhode
skirt
est
##berry
bombs
respected
##horn
imposed
limestone
deny
nominee
memphis
grabbing
disabled
##als
amusement
aa
frankfurt
corn
referendum
varies
slowed
disk
firms
unconscious
incredible
clue
sue
##zhou
twist
##cio
joins
idaho
chad
developers
computing
destroyer
103
mortal
tucker
kingston
choices
yu
carson
1800
os
whitney
geneva
pretend
dimension
staged
plateau
maya
##une
freestyle
##bc
rovers
hiv
##ids
tristan
classroom
prospect
##hus
honestly
diploma
lied
thermal
auxiliary
feast
unlikely
iata
##tel
morocco
pounding
treasury
lithuania
considerably
1841
dish
1812
geological
matching
stumbled
destroying
marched
brien
advances
cake
nicole
belle
settling
measuring
directing
##mie
tuesday
bassist
capabilities
stunned
fraud
torpedo
##list
##phone
anton
wisdom
surveillance
ruined
##ulate
lawsuit
healthcare
theorem
halls
trend
aka
horizontal
dozens
acquire
lasting
swim
hawk
gorgeous
fees
vicinity
decrease
adoption
tactics
##ography
pakistani
##ole
draws
##hall
willie
burke
heath
algorithm
integral
powder
elliott
brigadier
jackie
tate
varieties
darker
##cho
lately
cigarette
specimens
adds
##ree
##ensis
##inger
exploded
finalist
cia
murders
wilderness
arguments
nicknamed
acceptance
onwards
manufacture
robertson
jets
tampa
enterprises
blog
loudly
composers
nominations
1838
ai
malta
inquiry
automobile
hosting
viii
rays
tilted
grief
museums
strategies
furious
euro
equality
cohen
poison
surrey
wireless
governed
ridiculous
moses
##esh
##room
vanished
##ito
barnes
attract
morrison
istanbul
##iness
absent
rotation
petition
janet
##logical
satisfaction
custody
deliberately
observatory
comedian
surfaces
pinyin
novelist
strictly
canterbury
oslo
monks
embrace
ibm
jealous
photograph
continent
dorothy
marina
doc
excess
holden
allegations
explaining
stack
avoiding
lance
storyline
majesty
poorly
spike
dos
bradford
raven
travis
classics
proven
voltage
pillow
fists
butt
1842
interpreted
##car
1839
gage
telegraph
lens
promising
expelled
casual
collector
zones
##min
silly
nintendo
##kh
##bra
downstairs
chef
suspicious
afl
flies
vacant
uganda
pregnancy
condemned
lutheran
estimates
cheap
decree
saxon
proximity
stripped
idiot
deposits
contrary
presenter
magnus
glacier
im
offense
edwin
##ori
upright
##long
bolt
##ois
toss
geographical
##izes
environments
delicate
marking
abstract
xavier
nails
windsor
plantation
occurring
equity
saskatchewan
fears
drifted
sequences
vegetation
revolt
##stic
1843
sooner
fusion
opposing
nato
skating
1836
secretly
ruin
lease
##oc
edit
##nne
flora
anxiety
ruby
##ological
##mia
tel
bout
taxi
emmy
frost
rainbow
compounds
foundations
rainfall
assassination
nightmare
dominican
##win
achievements
deserve
orlando
intact
armenia
##nte
calgary
valentine
106
marion
proclaimed
theodore
bells
courtyard
thigh
gonzalez
console
troop
minimal
monte
everyday
##ence
##if
supporter
terrorism
buck
openly
presbyterian
activists
carpet
##iers
rubbing
uprising
##yi
cute
conceived
legally
##cht
millennium
cello
velocity
ji
rescued
cardiff
1835
rex
concentrate
senators
beard
rendered
glowing
battalions
scouts
competitors
sculptor
catalogue
arctic
ion
raja
bicycle
wow
glancing
lawn
##woman
gentleman
lighthouse
publish
predicted
calculated
##val
variants
##gne
strain
##ui
winston
deceased
##nus
touchdowns
brady
caleb
sinking
echoed
crush
hon
blessed
protagonist
hayes
endangered
magnitude
editors
##tine
estimate
responsibilities
##mel
backup
laying
consumed
sealed
zurich
lovers
frustrated
##eau
ahmed
kicking
mit
treasurer
1832
biblical
refuse
terrified
pump
agrees
genuine
imprisonment
refuses
plymouth
##hen
lou
##nen
tara
trembling
antarctic
ton
learns
##tas
crap
crucial
faction
atop
##borough
wrap
lancaster
odds
hopkins
erik
lyon
##eon
bros
##ode
snap
locality
tips
empress
crowned
cal
acclaimed
chuckled
##ory
clara
sends
mild
towel
##fl
##day
##а
wishing
assuming
interviewed
##bal
##die
interactions
eden
cups
helena
##lf
indie
beck
##fire
batteries
filipino
wizard
parted
##lam
traces
##born
rows
idol
albany
delegates
##ees
##sar
discussions
##ex
notre
instructed
belgrade
highways
suggestion
lauren
possess
orientation
alexandria
abdul
beats
salary
reunion
ludwig
alright
wagner
intimate
pockets
slovenia
hugged
brighton
merchants
cruel
stole
trek
slopes
repairs
enrollment
politically
underlying
promotional
counting
boeing
##bb
isabella
naming
##и
keen
bacteria
listing
separately
belfast
ussr
450
lithuanian
anybody
ribs
sphere
martinez
cock
embarrassed
proposals
fragments
nationals
##fs
##wski
premises
fin
1500
alpine
matched
freely
bounded
jace
sleeve
##af
gaming
pier
populated
evident
##like
frances
flooded
##dle
frightened
pour
trainer
framed
visitor
challenging
pig
wickets
##fold
infected
email
##pes
arose
##aw
reward
ecuador
oblast
vale
ch
shuttle
##usa
bach
rankings
forbidden
cornwall
accordance
salem
consumers
bruno
fantastic
toes
machinery
resolved
julius
remembering
propaganda
iceland
bombardment
tide
contacts
wives
##rah
concerto
macdonald
albania
implement
daisy
tapped
sudan
helmet
angela
mistress
##lic
crop
sunk
finest
##craft
hostile
##ute
##tsu
boxer
fr
paths
adjusted
habit
ballot
supervision
soprano
##zen
bullets
wicked
sunset
regiments
disappear
lamp
performs
app
##gia
##oa
rabbit
digging
incidents
entries
##cion
dishes
##oi
introducing
##ati
##fied
freshman
slot
jill
tackles
baroque
backs
##iest
lone
sponsor
destiny
altogether
convert
##aro
consensus
shapes
demonstration
basically
feminist
auction
artifacts
##bing
strongest
twitter
halifax
2019
allmusic
mighty
smallest
precise
alexandra
viola
##los
##ille
manuscripts
##illo
dancers
ari
managers
monuments
blades
barracks
springfield
maiden
consolidated
electron
##end
berry
airing
wheat
nobel
inclusion
blair
payments
geography
bee
cc
eleanor
react
##hurst
afc
manitoba
##yu
su
lineup
fitness
recreational
investments
airborne
disappointment
##dis
edmonton
viewing
##row
renovation
##cast
infant
bankruptcy
roses
aftermath
pavilion
##yer
carpenter
withdrawal
ladder
##hy
discussing
popped
reliable
agreements
rochester
##abad
curves
bombers
220
rao
reverend
decreased
choosing
107
stiff
consulting
naples
crawford
tracy
ka
ribbon
cops
##lee
crushed
deciding
unified
teenager
accepting
flagship
explorer
poles
sanchez
inspection
revived
skilled
induced
exchanged
flee
locals
tragedy
swallow
loading
hanna
demonstrate
##ela
salvador
flown
contestants
civilization
##ines
wanna
rhodes
fletcher
hector
knocking
considers
##ough
nash
mechanisms
sensed
mentally
walt
unclear
##eus
renovated
madame
##cks
crews
governmental
##hin
undertaken
monkey
##ben
##ato
fatal
armored
copa
caves
governance
grasp
perception
certification
froze
damp
tugged
wyoming
##rg
##ero
newman
##lor
nerves
curiosity
graph
115
##ami
withdraw
tunnels
dull
meredith
moss
exhibits
neighbors
communicate
accuracy
explored
raiders
republicans
secular
kat
superman
penny
criticised
##tch
freed
update
conviction
wade
ham
likewise
delegation
gotta
doll
promises
technological
myth
nationality
resolve
convent
##mark
sharon
dig
sip
coordinator
entrepreneur
fold
##dine
capability
councillor
synonym
blown
swan
cursed
1815
jonas
haired
sofa
canvas
keeper
rivalry
##hart
rapper
speedway
swords
postal
maxwell
estonia
potter
recurring
##nn
##ave
errors
##oni
cognitive
1834
##²
claws
nadu
roberto
bce
wrestler
ellie
##ations
infinite
ink
##tia
presumably
finite
staircase
108
noel
patricia
nacional
##cation
chill
eternal
tu
preventing
prussia
fossil
limbs
##logist
ernst
frog
perez
rene
##ace
pizza
prussian
##ios
##vy
molecules
regulatory
answering
opinions
sworn
lengths
supposedly
hypothesis
upward
habitats
seating
ancestors
drank
yield
hd
synthesis
researcher
modest
##var
mothers
peered
voluntary
homeland
##the
acclaim
##igan
static
valve
luxembourg
alto
carroll
fe
receptor
norton
ambulance
##tian
johnston
catholics
depicting
jointly
elephant
gloria
mentor
badge
ahmad
distinguish
remarked
councils
precisely
allison
advancing
detection
crowded
##10
cooperative
ankle
mercedes
dagger
surrendered
pollution
commit
subway
jeffrey
lesson
sculptures
provider
##fication
membrane
timothy
rectangular
fiscal
heating
teammate
basket
particle
anonymous
deployment
##ple
missiles
courthouse
proportion
shoe
sec
##ller
complaints
forbes
blacks
abandon
remind
sizes
overwhelming
autobiography
natalie
##awa
risks
contestant
countryside
babies
scorer
invaded
enclosed
proceed
hurling
disorders
##cu
reflecting
continuously
cruiser
graduates
freeway
investigated
ore
deserved
maid
blocking
phillip
jorge
shakes
dove
mann
variables
lacked
burden
accompanying
que
consistently
organizing
provisional
complained
endless
##rm
tubes
juice
georges
krishna
mick
labels
thriller
##uch
laps
arcade
sage
snail
##table
shannon
fi
laurence
seoul
vacation
presenting
hire
churchill
surprisingly
prohibited
savannah
technically
##oli
170
##lessly
testimony
suited
speeds
toys
romans
mlb
flowering
measurement
talented
kay
settings
charleston
expectations
shattered
achieving
triumph
ceremonies
portsmouth
lanes
mandatory
loser
stretching
cologne
realizes
seventy
cornell
careers
webb
##ulating
americas
budapest
ava
suspicion
##ison
yo
conrad
##hai
sterling
jessie
rector
##az
1831
transform
organize
loans
christine
volcanic
warrant
slender
summers
subfamily
newer
danced
dynamics
rhine
proceeds
heinrich
gastropod
commands
sings
facilitate
easter
ra
positioned
responses
expense
fruits
yanked
imported
25th
velvet
vic
primitive
tribune
baldwin
neighbourhood
donna
rip
hay
pr
##uro
1814
espn
welcomed
##aria
qualifier
glare
highland
timing
##cted
shells
eased
geometry
louder
exciting
slovakia
##sion
##iz
##lot
savings
prairie
##ques
marching
rafael
tonnes
##lled
curtain
preceding
shy
heal
greene
worthy
##pot
detachment
bury
sherman
##eck
reinforced
seeks
bottles
contracted
duchess
outfit
walsh
##sc
mickey
##ase
geoffrey
archer
squeeze
dawson
eliminate
invention
##enberg
neal
##eth
stance
dealer
coral
maple
retire
polo
simplified
##ht
1833
hid
watts
backwards
jules
##oke
genesis
mt
frames
rebounds
burma
woodland
moist
santos
whispers
drained
subspecies
##aa
streaming
ulster
burnt
correspondence
maternal
gerard
denis
stealing
##load
genius
duchy
##oria
inaugurated
momentum
suits
placement
sovereign
clause
thames
##hara
confederation
reservation
sketch
yankees
lets
rotten
charm
hal
verses
ultra
commercially
dot
salon
citation
adopt
winnipeg
mist
allocated
cairo
##boy
jenkins
interference
objectives
##wind
1820
portfolio
armoured
sectors
##eh
initiatives
##world
integrity
exercises
robe
tap
ab
gazed
##tones
distracted
rulers
111
favorable
jerome
tended
cart
factories
##eri
diplomat
valued
gravel
charitable
##try
calvin
exploring
chang
shepherd
terrace
pdf
pupil
##ural
reflects
ups
##rch
governors
shelf
depths
##nberg
trailed
crest
tackle
##nian
##ats
hatred
##kai
clare
makers
ethiopia
longtime
detected
embedded
lacking
slapped
rely
thomson
anticipation
iso
morton
successive
agnes
screenwriter
straightened
philippe
playwright
haunted
licence
iris
intentions
sutton
112
logical
correctly
##weight
branded
licked
tipped
silva
ricky
narrator
requests
##ents
greeted
supernatural
cow
##wald
lung
refusing
employer
strait
gaelic
liner
##piece
zoe
sabha
##mba
driveway
harvest
prints
bates
reluctantly
threshold
algebra
ira
wherever
coupled
240
assumption
picks
##air
designers
raids
gentlemen
##ean
roller
blowing
leipzig
locks
screw
dressing
strand
##lings
scar
dwarf
depicts
##nu
nods
##mine
differ
boris
##eur
yuan
flip
##gie
mob
invested
questioning
applying
##ture
shout
##sel
gameplay
blamed
illustrations
bothered
weakness
rehabilitation
##of
##zes
envelope
rumors
miners
leicester
subtle
kerry
##ico
ferguson
##fu
premiership
ne
##cat
bengali
prof
catches
remnants
dana
##rily
shouting
presidents
baltic
ought
ghosts
dances
sailors
shirley
fancy
dominic
##bie
madonna
##rick
bark
buttons
gymnasium
ashes
liver
toby
oath
providence
doyle
evangelical
nixon
cement
carnegie
embarked
hatch
surroundings
guarantee
needing
pirate
essence
##bee
filter
crane
hammond
projected
immune
percy
twelfth
##ult
regent
doctoral
damon
mikhail
##ichi
lu
critically
elect
realised
abortion
acute
screening
mythology
steadily
##fc
frown
nottingham
kirk
wa
minneapolis
##rra
module
algeria
mc
nautical
encounters
surprising
statues
availability
shirts
pie
alma
brows
munster
mack
soup
crater
tornado
sanskrit
cedar
explosive
bordered
dixon
planets
stamp
exam
happily
##bble
carriers
kidnapped
##vis
accommodation
emigrated
##met
knockout
correspondent
violation
profits
peaks
lang
specimen
agenda
ancestry
pottery
spelling
equations
obtaining
ki
linking
1825
debris
asylum
##20
buddhism
teddy
##ants
gazette
##nger
##sse
dental
eligibility
utc
fathers
averaged
zimbabwe
francesco
coloured
hissed
translator
lynch
mandate
humanities
mackenzie
uniforms
lin
##iana
##gio
asset
mhz
fitting
samantha
genera
wei
rim
beloved
shark
riot
entities
expressions
indo
carmen
slipping
owing
abbot
neighbor
sidney
##av
rats
recommendations
encouraging
squadrons
anticipated
commanders
conquered
##oto
donations
diagnosed
##mond
divide
##iva
guessed
decoration
vernon
auditorium
revelation
conversations
##kers
##power
herzegovina
dash
alike
protested
lateral
herman
accredited
mg
##gent
freeman
mel
fiji
crow
crimson
##rine
livestock
##pped
humanitarian
bored
oz
whip
##lene
##ali
legitimate
alter
grinning
spelled
anxious
oriental
wesley
##nin
##hole
carnival
controller
detect
##ssa
bowed
educator
kosovo
macedonia
##sin
occupy
mastering
stephanie
janeiro
para
unaware
nurses
noon
135
cam
hopefully
ranger
combine
sociology
polar
rica
##eer
neill
##sman
holocaust
##ip
doubled
lust
1828
109
decent
cooling
unveiled
##card
1829
nsw
homer
chapman
meyer
##gin
dive
mae
reagan
expertise
##gled
darwin
brooke
sided
prosecution
investigating
comprised
petroleum
genres
reluctant
differently
trilogy
johns
vegetables
corpse
highlighted
lounge
pension
unsuccessfully
elegant
aided
ivory
beatles
amelia
cain
dubai
sunny
immigrant
babe
click
##nder
underwater
pepper
combining
mumbled
atlas
horns
accessed
ballad
physicians
homeless
gestured
rpm
freak
louisville
corporations
patriots
prizes
rational
warn
modes
decorative
overnight
din
troubled
phantom
##ort
monarch
sheer
##dorf
generals
guidelines
organs
addresses
##zon
enhance
curling
parishes
cord
##kie
linux
caesar
deutsche
bavaria
##bia
coleman
cyclone
##eria
bacon
petty
##yama
##old
hampton
diagnosis
1824
throws
complexity
rita
disputed
##₃
pablo
##sch
marketed
trafficking
##ulus
examine
plague
formats
##oh
vault
faithful
##bourne
webster
##ox
highlights
##ient
##ann
phones
vacuum
sandwich
modeling
##gated
bolivia
clergy
qualities
isabel
##nas
##ars
wears
screams
reunited
annoyed
bra
##ancy
##rate
differential
transmitter
tattoo
container
poker
##och
excessive
resides
cowboys
##tum
augustus
trash
providers
statute
retreated
balcony
reversed
void
storey
preceded
masses
leap
laughs
neighborhoods
wards
schemes
falcon
santo
battlefield
pad
ronnie
thread
lesbian
venus
##dian
beg
sandstone
daylight
punched
gwen
analog
stroked
wwe
acceptable
measurements
dec
toxic
##kel
adequate
surgical
economist
parameters
varsity
##sberg
quantity
ella
##chy
##rton
countess
generating
precision
diamonds
expressway
ga
##ı
1821
uruguay
talents
galleries
expenses
scanned
colleague
outlets
ryder
lucien
##ila
paramount
##bon
syracuse
dim
fangs
gown
sweep
##sie
toyota
missionaries
websites
##nsis
sentences
adviser
val
trademark
spells
##plane
patience
starter
slim
##borg
toe
incredibly
shoots
elliot
nobility
##wyn
cowboy
endorsed
gardner
tendency
persuaded
organisms
emissions
kazakhstan
amused
boring
chips
themed
##hand
llc
constantinople
chasing
systematic
guatemala
borrowed
erin
carey
##hard
highlands
struggles
1810
##ifying
##ced
wong
exceptions
develops
enlarged
kindergarten
castro
##ern
##rina
leigh
zombie
juvenile
##most
consul
##nar
sailor
hyde
clarence
intensive
pinned
nasty
useless
jung
clayton
stuffed
exceptional
ix
apostolic
230
transactions
##dge
exempt
swinging
cove
religions
##ash
shields
dairy
bypass
190
pursuing
bug
joyce
bombay
chassis
southampton
chat
interact
redesignated
##pen
nascar
pray
salmon
rigid
regained
malaysian
grim
publicity
constituted
capturing
toilet
delegate
purely
tray
drift
loosely
striker
weakened
trinidad
mitch
itv
defines
transmitted
ming
scarlet
nodding
fitzgerald
fu
narrowly
sp
tooth
standings
virtue
##₁
##wara
##cting
chateau
gloves
lid
##nel
hurting
conservatory
##pel
sinclair
reopened
sympathy
nigerian
strode
advocated
optional
chronic
discharge
##rc
suck
compatible
laurel
stella
shi
fails
wage
dodge
128
informal
sorts
levi
buddha
villagers
##aka
chronicles
heavier
summoned
gateway
3000
eleventh
jewelry
translations
accordingly
seas
##ency
fiber
pyramid
cubic
dragging
##ista
caring
##ops
android
contacted
lunar
##dt
kai
lisbon
patted
1826
sacramento
theft
madagascar
subtropical
disputes
ta
holidays
piper
willow
mare
cane
itunes
newfoundland
benny
companions
dong
raj
observe
roar
charming
plaque
tibetan
fossils
enacted
manning
bubble
tina
tanzania
##eda
##hir
funk
swamp
deputies
cloak
ufc
scenario
par
scratch
metals
anthem
guru
engaging
specially
##boat
dialects
nineteen
cecil
duet
disability
messenger
unofficial
##lies
defunct
eds
moonlight
drainage
surname
puzzle
honda
switching
conservatives
mammals
knox
broadcaster
sidewalk
cope
##ried
benson
princes
peterson
##sal
bedford
sharks
eli
wreck
alberto
gasp
archaeology
lgbt
teaches
securities
madness
compromise
waving
coordination
davidson
visions
leased
possibilities
eighty
jun
fernandez
enthusiasm
assassin
sponsorship
reviewer
kingdoms
estonian
laboratories
##fy
##nal
applies
verb
celebrations
##zzo
rowing
lightweight
sadness
submit
mvp
balanced
dude
##vas
explicitly
metric
magnificent
mound
brett
mohammad
mistakes
irregular
##hing
##ass
sanders
betrayed
shipped
surge
##enburg
reporters
termed
georg
pity
verbal
bulls
abbreviated
enabling
appealed
##are
##atic
sicily
sting
heel
sweetheart
bart
spacecraft
brutal
monarchy
##tter
aberdeen
cameo
diane
##ub
survivor
clyde
##aries
complaint
##makers
clarinet
delicious
chilean
karnataka
coordinates
1818
panties
##rst
pretending
ar
dramatically
kiev
bella
tends
distances
113
catalog
launching
instances
telecommunications
portable
lindsay
vatican
##eim
angles
aliens
marker
stint
screens
bolton
##rne
judy
wool
benedict
plasma
europa
spark
imaging
filmmaker
swiftly
##een
contributor
##nor
opted
stamps
apologize
financing
butter
gideon
sophisticated
alignment
avery
chemicals
yearly
speculation
prominence
professionally
##ils
immortal
institutional
inception
wrists
identifying
tribunal
derives
gains
##wo
papal
preference
linguistic
vince
operative
brewery
##ont
unemployment
boyd
##ured
##outs
albeit
prophet
1813
bi
##rr
##face
##rad
quarterly
asteroid
cleaned
radius
temper
##llen
telugu
jerk
viscount
menu
##ote
glimpse
##aya
yacht
hawaiian
baden
##rl
laptop
readily
##gu
monetary
offshore
scots
watches
##yang
##arian
upgrade
needle
xbox
lea
encyclopedia
flank
fingertips
##pus
delight
teachings
confirm
roth
beaches
midway
winters
##iah
teasing
daytime
beverly
gambling
bonnie
##backs
regulated
clement
hermann
tricks
knot
##shing
##uring
##vre
detached
ecological
owed
specialty
byron
inventor
bats
stays
screened
unesco
midland
trim
affection
##ander
##rry
jess
thoroughly
feedback
##uma
chennai
strained
heartbeat
wrapping
overtime
pleaded
##sworth
mon
leisure
oclc
##tate
##ele
feathers
angelo
thirds
nuts
surveys
clever
gill
commentator
##dos
darren
rides
gibraltar
##nc
##mu
dissolution
dedication
shin
meals
saddle
elvis
reds
chaired
taller
appreciation
functioning
niece
favored
advocacy
robbie
criminals
suffolk
yugoslav
passport
constable
congressman
hastings
vera
##rov
consecrated
sparks
ecclesiastical
confined
##ovich
muller
floyd
nora
1822
paved
1827
cumberland
ned
saga
spiral
##flow
appreciated
yi
collaborative
treating
similarities
feminine
finishes
##ib
jade
import
##nse
##hot
champagne
mice
securing
celebrities
helsinki
attributes
##gos
cousins
phases
ache
lucia
gandhi
submission
vicar
spear
shine
tasmania
biting
detention
constitute
tighter
seasonal
##gus
terrestrial
matthews
##oka
effectiveness
parody
philharmonic
##onic
1816
strangers
encoded
consortium
guaranteed
regards
shifts
tortured
collision
supervisor
inform
broader
insight
theaters
armour
emeritus
blink
incorporates
mapping
##50
##ein
handball
flexible
##nta
substantially
generous
thief
##own
carr
loses
1793
prose
ucla
romeo
generic
metallic
realization
damages
mk
commissioners
zach
default
##ther
helicopters
lengthy
stems
spa
partnered
spectators
rogue
indication
penalties
teresa
1801
sen
##tric
dalton
##wich
irving
photographic
##vey
dell
deaf
peters
excluded
unsure
##vable
patterson
crawled
##zio
resided
whipped
latvia
slower
ecole
pipes
employers
maharashtra
comparable
va
textile
pageant
##gel
alphabet
binary
irrigation
chartered
choked
antoine
offs
waking
supplement
##wen
quantities
demolition
regain
locate
urdu
folks
alt
114
##mc
scary
andreas
whites
##ava
classrooms
mw
aesthetic
publishes
valleys
guides
cubs
johannes
bryant
conventions
affecting
##itt
drain
awesome
isolation
prosecutor
ambitious
apology
captive
downs
atmospheric
lorenzo
aisle
beef
foul
##onia
kidding
composite
disturbed
illusion
natives
##ffer
emi
rockets
riverside
wartime
painters
adolf
melted
##ail
uncertainty
simulation
hawks
progressed
meantime
builder
spray
breach
unhappy
regina
russians
##urg
determining
##tation
tram
1806
##quin
aging
##12
1823
garion
rented
mister
diaz
terminated
clip
1817
depend
nervously
disco
owe
defenders
shiva
notorious
disbelief
shiny
worcester
##gation
##yr
trailing
undertook
islander
belarus
limitations
watershed
fuller
overlooking
utilized
raphael
1819
synthetic
breakdown
klein
##nate
moaned
memoir
lamb
practicing
##erly
cellular
arrows
exotic
##graphy
witches
117
charted
rey
hut
hierarchy
subdivision
freshwater
giuseppe
aloud
reyes
qatar
marty
sideways
utterly
sexually
jude
prayers
mccarthy
softball
blend
damien
##gging
##metric
wholly
erupted
lebanese
negro
revenues
tasted
comparative
teamed
transaction
labeled
maori
sovereignty
parkway
trauma
gran
malay
121
advancement
descendant
2020
buzz
salvation
inventory
symbolic
##making
antarctica
mps
##gas
##bro
mohammed
myanmar
holt
submarines
tones
##lman
locker
patriarch
bangkok
emerson
remarks
predators
kin
afghan
confession
norwich
rental
emerge
advantages
##zel
rca
##hold
shortened
storms
aidan
##matic
autonomy
compliance
##quet
dudley
atp
##osis
1803
motto
documentation
summary
professors
spectacular
christina
archdiocese
flashing
innocence
remake
##dell
psychic
reef
scare
employ
rs
sticks
meg
gus
leans
##ude
accompany
bergen
tomas
##iko
doom
wages
pools
##nch
##bes
breasts
scholarly
alison
outline
brittany
breakthrough
willis
realistic
##cut
##boro
competitor
##stan
pike
picnic
icon
designing
commercials
washing
villain
skiing
micro
costumes
auburn
halted
executives
##hat
logistics
cycles
vowel
applicable
barrett
exclaimed
eurovision
eternity
ramon
##umi
##lls
modifications
sweeping
disgust
##uck
torch
aviv
ensuring
rude
dusty
sonic
donovan
outskirts
cu
pathway
##band
##gun
##lines
disciplines
acids
cadet
paired
##40
sketches
##sive
marriages
##⁺
folding
peers
slovak
implies
admired
##beck
1880s
leopold
instinct
attained
weston
megan
horace
##ination
dorsal
ingredients
evolutionary
##its
complications
deity
lethal
brushing
levy
deserted
institutes
posthumously
delivering
telescope
coronation
motivated
rapids
luc
flicked
pays
volcano
tanner
weighed
##nica
crowds
frankie
gifted
addressing
granddaughter
winding
##rna
constantine
gomez
##front
landscapes
rudolf
anthropology
slate
werewolf
##lio
astronomy
circa
rouge
dreaming
sack
knelt
drowned
naomi
prolific
tracked
freezing
herb
##dium
agony
randall
twisting
wendy
deposit
touches
vein
wheeler
##bbled
##bor
batted
retaining
tire
presently
compare
specification
daemon
nigel
##grave
merry
recommendation
czechoslovakia
sandra
ng
roma
##sts
lambert
inheritance
sheikh
winchester
cries
examining
##yle
comeback
cuisine
nave
##iv
ko
retrieve
tomatoes
barker
polished
defining
irene
lantern
personalities
begging
tract
swore
1809
175
##gic
omaha
brotherhood
##rley
haiti
##ots
exeter
##ete
##zia
steele
dumb
pearson
210
surveyed
elisabeth
trends
##ef
fritz
##rf
premium
bugs
fraction
calmly
viking
##birds
tug
inserted
unusually
##ield
confronted
distress
crashing
brent
turks
resign
##olo
cambodia
gabe
sauce
##kal
evelyn
116
extant
clusters
quarry
teenagers
luna
##lers
##ister
affiliation
drill
##ashi
panthers
scenic
libya
anita
strengthen
inscriptions
##cated
lace
sued
judith
riots
##uted
mint
##eta
preparations
midst
dub
challenger
##vich
mock
cf
displaced
wicket
breaths
enables
schmidt
analyst
##lum
ag
highlight
automotive
axe
josef
newark
sufficiently
resembles
50th
##pal
flushed
mum
traits
##ante
commodore
incomplete
warming
titular
ceremonial
ethical
118
celebrating
eighteenth
cao
lima
medalist
mobility
strips
snakes
##city
miniature
zagreb
barton
escapes
umbrella
automated
doubted
differs
cooled
georgetown
dresden
cooked
fade
wyatt
rna
jacobs
carlton
abundant
stereo
boost
madras
inning
##hia
spur
ip
malayalam
begged
osaka
groan
escaping
charging
dose
vista
##aj
bud
papa
communists
advocates
edged
tri
##cent
resemble
peaking
necklace
fried
montenegro
saxony
goose
glances
stuttgart
curator
recruit
grocery
sympathetic
##tting
##fort
127
lotus
randolph
ancestor
##rand
succeeding
jupiter
1798
macedonian
##heads
hiking
1808
handing
fischer
##itive
garbage
node
##pies
prone
singular
papua
inclined
attractions
italia
pouring
motioned
grandma
garnered
jacksonville
corp
ego
ringing
aluminum
##hausen
ordering
##foot
drawer
traders
synagogue
##play
##kawa
resistant
wandering
fragile
fiona
teased
var
hardcore
soaked
jubilee
decisive
exposition
mercer
poster
valencia
hale
kuwait
1811
##ises
##wr
##eed
tavern
gamma
122
johan
##uer
airways
amino
gil
##ury
vocational
domains
torres
##sp
generator
folklore
outcomes
##keeper
canberra
shooter
fl
beams
confrontation
##lling
##gram
feb
aligned
forestry
pipeline
jax
motorway
conception
decay
##tos
coffin
##cott
stalin
1805
escorted
minded
##nam
sitcom
purchasing
twilight
veronica
additions
passive
tensions
straw
123
frequencies
1804
refugee
cultivation
##iate
christie
clary
bulletin
crept
disposal
##rich
##zong
processor
crescent
##rol
bmw
emphasized
whale
nazis
aurora
##eng
dwelling
hauled
sponsors
toledo
mega
ideology
theatres
tessa
cerambycidae
saves
turtle
cone
suspects
kara
rusty
yelling
greeks
mozart
shades
cocked
participant
##tro
shire
spit
freeze
necessity
##cos
inmates
nielsen
councillors
loaned
uncommon
omar
peasants
botanical
offspring
daniels
formations
jokes
1794
pioneers
sigma
licensing
##sus
wheelchair
polite
1807
liquor
pratt
trustee
##uta
forewings
balloon
##zz
kilometre
camping
explicit
casually
shawn
foolish
teammates
nm
hassan
carrie
judged
satisfy
vanessa
knives
selective
cnn
flowed
##lice
eclipse
stressed
eliza
mathematician
cease
cultivated
##roy
commissions
browns
##ania
destroyers
sheridan
meadow
##rius
minerals
##cial
downstream
clash
gram
memoirs
ventures
baha
seymour
archie
midlands
edith
fare
flynn
invite
canceled
tiles
stabbed
boulder
incorporate
amended
camden
facial
mollusk
unreleased
descriptions
yoga
grabs
550
raises
ramp
shiver
##rose
coined
pioneering
tunes
qing
warwick
tops
119
melanie
giles
##rous
wandered
##inal
annexed
nov
30th
unnamed
##ished
organizational
airplane
normandy
stoke
whistle
blessing
violations
chased
holders
shotgun
##ctic
outlet
reactor
##vik
tires
tearing
shores
fortified
mascot
constituencies
nc
columnist
productive
tibet
##rta
lineage
hooked
oct
tapes
judging
cody
##gger
hansen
kashmir
triggered
##eva
solved
cliffs
##tree
resisted
anatomy
protesters
transparent
implied
##iga
injection
mattress
excluding
##mbo
defenses
helpless
devotion
##elli
growl
liberals
weber
phenomena
atoms
plug
##iff
mortality
apprentice
howe
convincing
aaa
swimmer
barber
leone
promptly
sodium
def
nowadays
arise
##oning
gloucester
corrected
dignity
norm
erie
##ders
elders
evacuated
sylvia
compression
##yar
hartford
pose
backpack
reasoning
accepts
24th
wipe
millimetres
marcel
##oda
dodgers
albion
1790
overwhelmed
aerospace
oaks
1795
showcase
acknowledge
recovering
nolan
ashe
hurts
geology
fashioned
disappearance
farewell
swollen
shrug
marquis
wimbledon
124
rue
1792
commemorate
reduces
experiencing
inevitable
calcutta
intel
##court
murderer
sticking
fisheries
imagery
bloom
280
brake
##inus
gustav
hesitation
memorable
po
viral
beans
accidents
tunisia
antenna
spilled
consort
treatments
aye
perimeter
##gard
donation
hostage
migrated
banker
addiction
apex
lil
trout
##ously
conscience
##nova
rams
sands
genome
passionate
troubles
##lets
##set
amid
##ibility
##ret
higgins
exceed
vikings
##vie
payne
##zan
muscular
##ste
defendant
sucking
##wal
ibrahim
fuselage
claudia
vfl
europeans
snails
interval
##garh
preparatory
statewide
tasked
lacrosse
viktor
##lation
angola
##hra
flint
implications
employs
teens
patrons
stall
weekends
barriers
scrambled
nucleus
tehran
jenna
parsons
lifelong
robots
displacement
5000
##bles
precipitation
##gt
knuckles
clutched
1802
marrying
ecology
marx
accusations
declare
scars
kolkata
mat
meadows
bermuda
skeleton
finalists
vintage
crawl
coordinate
affects
subjected
orchestral
mistaken
##tc
mirrors
dipped
relied
260
arches
candle
##nick
incorporating
wildly
fond
basilica
owl
fringe
rituals
whispering
stirred
feud
tertiary
slick
goat
honorable
whereby
skip
ricardo
stripes
parachute
adjoining
submerged
synthesizer
##gren
intend
positively
ninety
phi
beaver
partition
fellows
alexis
prohibition
carlisle
bizarre
fraternity
##bre
doubts
icy
cbc
aquatic
sneak
sonny
combines
airports
crude
supervised
spatial
merge
alfonso
##bic
corrupt
scan
undergo
##ams
disabilities
colombian
comparing
dolphins
perkins
##lish
reprinted
unanimous
bounced
hairs
underworld
midwest
semester
bucket
paperback
miniseries
coventry
demise
##leigh
demonstrations
sensor
rotating
yan
##hler
arrange
soils
##idge
hyderabad
labs
##dr
brakes
grandchildren
##nde
negotiated
rover
ferrari
continuation
directorate
augusta
stevenson
counterpart
gore
##rda
nursery
rican
ave
collectively
broadly
pastoral
repertoire
asserted
discovering
nordic
styled
fiba
cunningham
harley
middlesex
survives
tumor
tempo
zack
aiming
lok
urgent
##rade
##nto
devils
##ement
contractor
turin
##wl
##ool
bliss
repaired
simmons
moan
astronomical
cr
negotiate
lyric
1890s
lara
bred
clad
angus
pbs
##ience
engineered
posed
##lk
hernandez
possessions
elbows
psychiatric
strokes
confluence
electorate
lifts
campuses
lava
alps
##ep
##ution
##date
physicist
woody
##page
##ographic
##itis
juliet
reformation
sparhawk
320
complement
suppressed
jewel
##½
floated
##kas
continuity
sadly
##ische
inability
melting
scanning
paula
flour
judaism
safer
vague
##lm
solving
curb
##stown
financially
gable
bees
expired
miserable
cassidy
dominion
1789
cupped
145
robbery
facto
amos
warden
resume
tallest
marvin
ing
pounded
usd
declaring
gasoline
##aux
darkened
270
650
sophomore
##mere
erection
gossip
televised
risen
dial
##eu
pillars
##link
passages
profound
##tina
arabian
ashton
silicon
nail
##ead
##lated
##wer
##hardt
fleming
firearms
ducked
circuits
blows
waterloo
titans
##lina
atom
fireplace
cheshire
financed
activation
algorithms
##zzi
constituent
catcher
cherokee
partnerships
sexuality
platoon
tragic
vivian
guarded
whiskey
meditation
poetic
##late
##nga
##ake
porto
listeners
dominance
kendra
mona
chandler
factions
22nd
salisbury
attitudes
derivative
##ido
##haus
intake
paced
javier
illustrator
barrels
bias
cockpit
burnett
dreamed
ensuing
##anda
receptors
someday
hawkins
mattered
##lal
slavic
1799
jesuit
cameroon
wasted
tai
wax
lowering
victorious
freaking
outright
hancock
librarian
sensing
bald
calcium
myers
tablet
announcing
barack
shipyard
pharmaceutical
##uan
greenwich
flush
medley
patches
wolfgang
pt
speeches
acquiring
exams
nikolai
##gg
hayden
kannada
##type
reilly
##pt
waitress
abdomen
devastated
capped
pseudonym
pharmacy
fulfill
paraguay
1796
clicked
##trom
archipelago
syndicated
##hman
lumber
orgasm
rejection
clifford
lorraine
advent
mafia
rodney
brock
##ght
##used
##elia
cassette
chamberlain
despair
mongolia
sensors
developmental
upstream
##eg
##alis
spanning
165
trombone
basque
seeded
interred
renewable
rhys
leapt
revision
molecule
##ages
chord
vicious
nord
shivered
23rd
arlington
debts
corpus
sunrise
bays
blackburn
centimetres
##uded
shuddered
gm
strangely
gripping
cartoons
isabelle
orbital
##ppa
seals
proving
##lton
refusal
strengthened
bust
assisting
baghdad
batsman
portrayal
mara
pushes
spears
og
##cock
reside
nathaniel
brennan
1776
confirmation
caucus
##worthy
markings
yemen
nobles
ku
lazy
viewer
catalan
encompasses
sawyer
##fall
sparked
substances
patents
braves
arranger
evacuation
sergio
persuade
dover
tolerance
penguin
cum
jockey
insufficient
townships
occupying
declining
plural
processed
projection
puppet
flanders
introduces
liability
##yon
gymnastics
antwerp
taipei
hobart
candles
jeep
wes
observers
126
chaplain
bundle
glorious
##hine
hazel
flung
sol
excavations
dumped
stares
sh
bangalore
triangular
icelandic
intervals
expressing
turbine
##vers
songwriting
crafts
##igo
jasmine
ditch
rite
##ways
entertaining
comply
sorrow
wrestlers
basel
emirates
marian
rivera
helpful
##some
caution
downward
networking
##atory
##tered
darted
genocide
emergence
replies
specializing
spokesman
convenient
unlocked
fading
augustine
concentrations
resemblance
elijah
investigator
andhra
##uda
promotes
bean
##rrell
fleeing
wan
simone
announcer
##ame
##bby
lydia
weaver
132
residency
modification
##fest
stretches
##ast
alternatively
nat
lowe
lacks
##ented
pam
tile
concealed
inferior
abdullah
residences
tissues
vengeance
##ided
moisture
peculiar
groove
zip
bologna
jennings
ninja
oversaw
zombies
pumping
batch
livingston
emerald
installations
1797
peel
nitrogen
rama
##fying
##star
schooling
strands
responding
werner
##ost
lime
casa
accurately
targeting
##rod
underway
##uru
hemisphere
lester
##yard
occupies
2d
griffith
angrily
reorganized
##owing
courtney
deposited
##dd
##30
estadio
##ifies
dunn
exiled
##ying
checks
##combe
##о
##fly
successes
unexpectedly
blu
assessed
##flower
##ه
observing
sacked
spiders
kn
##tail
mu
nodes
prosperity
audrey
divisional
155
broncos
tangled
adjust
feeds
erosion
paolo
surf
directory
snatched
humid
admiralty
screwed
gt
reddish
##nese
modules
trench
lamps
bind
leah
bucks
competes
##nz
##form
transcription
##uc
isles
violently
clutching
pga
cyclist
inflation
flats
ragged
unnecessary
##hian
stubborn
coordinated
harriet
baba
disqualified
330
insect
wolfe
##fies
reinforcements
rocked
duel
winked
embraced
bricks
##raj
hiatus
defeats
pending
brightly
jealousy
##xton
##hm
##uki
lena
gdp
colorful
##dley
stein
kidney
##shu
underwear
wanderers
##haw
##icus
guardians
m³
roared
habits
##wise
permits
gp
uranium
punished
disguise
bundesliga
elise
dundee
erotic
partisan
pi
collectors
float
individually
rendering
behavioral
bucharest
ser
hare
valerie
corporal
nutrition
proportional
##isa
immense
##kis
pavement
##zie
##eld
sutherland
crouched
1775
##lp
suzuki
trades
endurance
operas
crosby
prayed
priory
rory
socially
##urn
gujarat
##pu
walton
cube
pasha
privilege
lennon
floods
thorne
waterfall
nipple
scouting
approve
##lov
minorities
voter
dwight
extensions
assure
ballroom
slap
dripping
privileges
rejoined
confessed
demonstrating
patriotic
yell
investor
##uth
pagan
slumped
squares
##cle
##kins
confront
bert
embarrassment
##aid
aston
urging
sweater
starr
yuri
brains
williamson
commuter
mortar
structured
selfish
exports
##jon
cds
##him
unfinished
##rre
mortgage
destinations
##nagar
canoe
solitary
buchanan
delays
magistrate
fk
##pling
motivation
##lier
##vier
recruiting
assess
##mouth
malik
antique
1791
pius
rahman
reich
tub
zhou
smashed
airs
galway
xii
conditioning
honduras
discharged
dexter
##pf
lionel
129
debates
lemon
tiffany
volunteered
dom
dioxide
procession
devi
sic
tremendous
advertisements
colts
transferring
verdict
hanover
decommissioned
utter
relate
pac
racism
##top
beacon
limp
similarity
terra
occurrence
ant
##how
becky
capt
updates
armament
richie
pal
##graph
halloween
mayo
##ssen
##bone
cara
serena
fcc
dolls
obligations
##dling
violated
lafayette
jakarta
exploitation
##ime
infamous
iconic
##lah
##park
kitty
moody
reginald
dread
spill
crystals
olivier
modeled
bluff
equilibrium
separating
notices
ordnance
extinction
onset
cosmic
attachment
sammy
expose
privy
anchored
##bil
abbott
admits
bending
baritone
emmanuel
policeman
vaughan
winged
climax
dresses
denny
polytechnic
mohamed
burmese
authentic
nikki
genetics
grandparents
homestead
gaza
postponed
metacritic
una
##sby
##bat
unstable
dissertation
##rial
##cian
curls
obscure
uncovered
bronx
praying
disappearing
##hoe
prehistoric
coke
turret
mutations
nonprofit
pits
monaco
##ي
##usion
prominently
dispatched
podium
##mir
uci
##uation
133
fortifications
birthplace
kendall
##lby
##oll
preacher
rack
goodman
##rman
persistent
##ott
countless
jaime
recorder
lexington
persecution
jumps
renewal
wagons
##11
crushing
##holder
decorations
##lake
abundance
wrath
laundry
£1
garde
##rp
jeanne
beetles
peasant
##sl
splitting
caste
sergei
##rer
##ema
scripts
##ively
rub
satellites
##vor
inscribed
verlag
scrapped
gale
packages
chick
potato
slogan
kathleen
arabs
##culture
counterparts
reminiscent
choral
##tead
rand
retains
bushes
dane
accomplish
courtesy
closes
##oth
slaughter
hague
krakow
lawson
tailed
elias
ginger
##ttes
canopy
betrayal
rebuilding
turf
##hof
frowning
allegiance
brigades
kicks
rebuild
polls
alias
nationalism
td
rowan
audition
bowie
fortunately
recognizes
harp
dillon
horrified
##oro
renault
##tics
ropes
##α
presumed
rewarded
infrared
wiping
accelerated
illustration
##rid
presses
practitioners
badminton
##iard
detained
##tera
recognizing
relates
misery
##sies
##tly
reproduction
piercing
potatoes
thornton
esther
manners
hbo
##aan
ours
bullshit
ernie
perennial
sensitivity
illuminated
rupert
##jin
##iss
##ear
rfc
nassau
##dock
staggered
socialism
##haven
appointments
nonsense
prestige
sharma
haul
##tical
solidarity
gps
##ook
##rata
igor
pedestrian
##uit
baxter
tenants
wires
medication
unlimited
guiding
impacts
diabetes
##rama
sasha
pas
clive
extraction
131
continually
constraints
##bilities
sonata
hunted
sixteenth
chu
planting
quote
mayer
pretended
abs
spat
##hua
ceramic
##cci
curtains
pigs
pitching
##dad
latvian
sore
dayton
##sted
##qi
patrols
slice
playground
##nted
shone
stool
apparatus
inadequate
mates
treason
##ija
desires
##liga
##croft
somalia
laurent
mir
leonardo
oracle
grape
obliged
chevrolet
thirteenth
stunning
enthusiastic
##ede
accounted
concludes
currents
basil
##kovic
drought
##rica
mai
##aire
shove
posting
##shed
pilgrimage
humorous
packing
fry
pencil
wines
smells
144
marilyn
aching
newest
clung
bon
neighbours
sanctioned
##pie
mug
##stock
drowning
##mma
hydraulic
##vil
hiring
reminder
lilly
investigators
##ncies
sour
##eous
compulsory
packet
##rion
##graphic
##elle
cannes
##inate
depressed
##rit
heroic
importantly
theresa
##tled
conway
saturn
marginal
rae
##xia
corresponds
royce
pact
jasper
explosives
packaging
aluminium
##ttered
denotes
rhythmic
spans
assignments
hereditary
outlined
originating
sundays
lad
reissued
greeting
beatrice
##dic
pillar
marcos
plots
handbook
alcoholic
judiciary
avant
slides
extract
masculine
blur
##eum
##force
homage
trembled
owens
hymn
trey
omega
signaling
socks
accumulated
reacted
attic
theo
lining
angie
distraction
primera
talbot
##key
1200
ti
creativity
billed
##hey
deacon
eduardo
identifies
proposition
dizzy
gunner
hogan
##yam
##pping
##hol
ja
##chan
jensen
reconstructed
##berger
clearance
darius
##nier
abe
harlem
plea
dei
circled
emotionally
notation
fascist
neville
exceeded
upwards
viable
ducks
##fo
workforce
racer
limiting
shri
##lson
possesses
1600
kerr
moths
devastating
laden
disturbing
locking
##cture
gal
fearing
accreditation
flavor
aide
1870s
mountainous
##baum
melt
##ures
motel
texture
servers
soda
##mb
herd
##nium
erect
puzzled
hum
peggy
examinations
gould
testified
geoff
ren
devised
sacks
##law
denial
posters
grunted
cesar
tutor
ec
gerry
offerings
byrne
falcons
combinations
ct
incoming
pardon
rocking
26th
avengers
flared
mankind
seller
uttar
loch
nadia
stroking
exposing
##hd
fertile
ancestral
instituted
##has
noises
prophecy
taxation
eminent
vivid
pol
##bol
dart
indirect
multimedia
notebook
upside
displaying
adrenaline
referenced
geometric
##iving
progression
##ddy
blunt
announce
##far
implementing
##lav
aggression
liaison
cooler
cares
headache
plantations
gorge
dots
impulse
thickness
ashamed
averaging
kathy
obligation
precursor
137
fowler
symmetry
thee
225
hears
##rai
undergoing
ads
butcher
bowler
##lip
cigarettes
subscription
goodness
##ically
browne
##hos
##tech
kyoto
donor
##erty
damaging
friction
drifting
expeditions
hardened
prostitution
152
fauna
blankets
claw
tossing
snarled
butterflies
recruits
investigative
coated
healed
138
communal
hai
xiii
academics
boone
psychologist
restless
lahore
stephens
mba
brendan
foreigners
printer
##pc
ached
explode
27th
deed
scratched
dared
##pole
cardiac
1780
okinawa
proto
commando
compelled
oddly
electrons
##base
replica
thanksgiving
##rist
sheila
deliberate
stafford
tidal
representations
hercules
ou
##path
##iated
kidnapping
lenses
##tling
deficit
samoa
mouths
consuming
computational
maze
granting
smirk
razor
fixture
ideals
inviting
aiden
nominal
##vs
issuing
julio
pitt
ramsey
docks
##oss
exhaust
##owed
bavarian
draped
anterior
mating
ethiopian
explores
noticing
##nton
discarded
convenience
hoffman
endowment
beasts
cartridge
mormon
paternal
probe
sleeves
interfere
lump
deadline
##rail
jenks
bulldogs
scrap
alternating
justified
reproductive
nam
seize
descending
secretariat
kirby
coupe
grouped
smash
panther
sedan
tapping
##18
lola
cheer
germanic
unfortunate
##eter
unrelated
##fan
subordinate
##sdale
suzanne
advertisement
##ility
horsepower
##lda
cautiously
discourse
luigi
##mans
##fields
noun
prevalent
mao
schneider
everett
surround
governorate
kira
##avia
westward
##take
misty
rails
sustainability
134
unused
##rating
packs
toast
unwilling
regulate
thy
suffrage
nile
awe
assam
definitions
travelers
affordable
##rb
conferred
sells
undefeated
beneficial
torso
basal
repeating
remixes
##pass
bahrain
cables
fang
##itated
excavated
numbering
statutory
##rey
deluxe
##lian
forested
ramirez
derbyshire
zeus
slamming
transfers
astronomer
banana
lottery
berg
histories
bamboo
##uchi
resurrection
posterior
bowls
vaguely
##thi
thou
preserving
tensed
offence
##inas
meyrick
callum
ridden
watt
langdon
tying
lowland
snorted
daring
truman
##hale
##girl
aura
overly
filing
weighing
goa
infections
philanthropist
saunders
eponymous
##owski
latitude
perspectives
reviewing
mets
commandant
radial
##kha
flashlight
reliability
koch
vowels
amazed
ada
elaine
supper
##rth
##encies
predator
debated
soviets
cola
##boards
##nah
compartment
crooked
arbitrary
fourteenth
##ctive
havana
majors
steelers
clips
profitable
ambush
exited
packers
##tile
nude
cracks
fungi
##е
limb
trousers
josie
shelby
tens
frederic
##ος
definite
smoothly
constellation
insult
baton
discs
lingering
##nco
conclusions
lent
staging
becker
grandpa
shaky
##tron
einstein
obstacles
sk
adverse
elle
economically
##moto
mccartney
thor
dismissal
motions
readings
nostrils
treatise
##pace
squeezing
evidently
prolonged
1783
venezuelan
je
marguerite
beirut
takeover
shareholders
##vent
denise
digit
airplay
norse
##bbling
imaginary
pills
hubert
blaze
vacated
eliminating
##ello
vine
mansfield
##tty
retrospective
barrow
borne
clutch
bail
forensic
weaving
##nett
##witz
desktop
citadel
promotions
worrying
dorset
ieee
subdivided
##iating
manned
expeditionary
pickup
synod
chuckle
185
barney
##rz
##ffin
functionality
karachi
litigation
meanings
uc
lick
turbo
anders
##ffed
execute
curl
oppose
ankles
typhoon
##د
##ache
##asia
linguistics
compassion
pressures
grazing
perfection
##iting
immunity
monopoly
muddy
backgrounds
136
namibia
francesca
monitors
attracting
stunt
tuition
##ии
vegetable
##mates
##quent
mgm
jen
complexes
forts
##ond
cellar
bites
seventeenth
royals
flemish
failures
mast
charities
##cular
peruvian
capitals
macmillan
ipswich
outward
frigate
postgraduate
folds
employing
##ouse
concurrently
fiery
##tai
contingent
nightmares
monumental
nicaragua
##kowski
lizard
mal
fielding
gig
reject
##pad
harding
##ipe
coastline
##cin
##nos
beethoven
humphrey
innovations
##tam
##nge
norris
doris
solicitor
huang
obey
141
##lc
niagara
##tton
shelves
aug
bourbon
curry
nightclub
specifications
hilton
##ndo
centennial
dispersed
worm
neglected
briggs
sm
font
kuala
uneasy
plc
##nstein
##bound
##aking
##burgh
awaiting
pronunciation
##bbed
##quest
eh
optimal
zhu
raped
greens
presided
brenda
worries
##life
venetian
marxist
turnout
##lius
refined
braced
sins
grasped
sunderland
nickel
speculated
lowell
cyrillic
communism
fundraising
resembling
colonists
mutant
freddie
usc
##mos
gratitude
##run
mural
##lous
chemist
wi
reminds
28th
steals
tess
pietro
##ingen
promoter
ri
microphone
honoured
rai
sant
##qui
feather
##nson
burlington
kurdish
terrorists
deborah
sickness
##wed
##eet
hazard
irritated
desperation
veil
clarity
##rik
jewels
xv
##gged
##ows
##cup
berkshire
unfair
mysteries
orchid
winced
exhaustion
renovations
stranded
obe
infinity
##nies
adapt
redevelopment
thanked
registry
olga
domingo
noir
tudor
ole
##atus
commenting
behaviors
##ais
crisp
pauline
probable
stirling
wigan
##bian
paralympics
panting
surpassed
##rew
luca
barred
pony
famed
##sters
cassandra
waiter
carolyn
exported
##orted
andres
destructive
deeds
jonah
castles
vacancy
suv
##glass
1788
orchard
yep
famine
belarusian
sprang
##forth
skinny
##mis
administrators
rotterdam
zambia
zhao
boiler
discoveries
##ride
##physics
lucius
disappointing
outreach
spoon
##frame
qualifications
unanimously
enjoys
regency
##iidae
stade
realism
veterinary
rodgers
dump
alain
chestnut
castile
censorship
rumble
gibbs
##itor
communion
reggae
inactivated
logs
loads
##houses
homosexual
##iano
ale
informs
##cas
phrases
plaster
linebacker
ambrose
kaiser
fascinated
850
limerick
recruitment
forge
mastered
##nding
leinster
rooted
threaten
##strom
borneo
##hes
suggestions
scholarships
propeller
documentaries
patronage
coats
constructing
invest
neurons
comet
entirety
shouts
identities
annoying
unchanged
wary
##antly
##ogy
neat
oversight
##kos
phillies
replay
constance
##kka
incarnation
humble
skies
minus
##acy
smithsonian
##chel
guerrilla
jar
cadets
##plate
surplus
audit
##aru
cracking
joanna
louisa
pacing
##lights
intentionally
##iri
diner
nwa
imprint
australians
tong
unprecedented
bunker
naive
specialists
ark
nichols
railing
leaked
pedal
##uka
shrub
longing
roofs
v8
captains
neural
tuned
##ntal
##jet
emission
medina
frantic
codex
definitive
sid
abolition
intensified
stocks
enrique
sustain
genoa
oxide
##written
clues
cha
##gers
tributaries
fragment
venom
##rity
##ente
##sca
muffled
vain
sire
laos
##ingly
##hana
hastily
snapping
surfaced
sentiment
motive
##oft
contests
approximate
mesa
luckily
dinosaur
exchanges
propelled
accord
bourne
relieve
tow
masks
offended
##ues
cynthia
##mmer
rains
bartender
zinc
reviewers
lois
##sai
legged
arrogant
rafe
rosie
comprise
handicap
blockade
inlet
lagoon
copied
drilling
shelley
petals
##inian
mandarin
obsolete
##inated
onward
arguably
productivity
cindy
praising
seldom
busch
discusses
raleigh
shortage
ranged
stanton
encouragement
firstly
conceded
overs
temporal
##uke
cbe
##bos
woo
certainty
pumps
##pton
stalked
##uli
lizzie
periodic
thieves
weaker
##night
gases
shoving
chooses
wc
##chemical
prompting
weights
##kill
robust
flanked
sticky
hu
tuberculosis
##eb
##eal
christchurch
resembled
wallet
reese
inappropriate
pictured
distract
fixing
fiddle
giggled
burger
heirs
hairy
mechanic
torque
apache
obsessed
chiefly
cheng
logging
##tag
extracted
meaningful
numb
##vsky
gloucestershire
reminding
##bay
unite
##lit
breeds
diminished
clown
glove
1860s
##ن
##ug
archibald
focal
freelance
sliced
depiction
##yk
organism
switches
sights
stray
crawling
##ril
lever
leningrad
interpretations
loops
anytime
reel
alicia
delighted
##ech
inhaled
xiv
suitcase
bernie
vega
licenses
northampton
exclusion
induction
monasteries
racecourse
homosexuality
##right
##sfield
##rky
dimitri
michele
alternatives
ions
commentators
genuinely
objected
pork
hospitality
fencing
stephan
warships
peripheral
wit
drunken
wrinkled
quentin
spends
departing
chung
numerical
spokesperson
##zone
johannesburg
caliber
killers
##udge
assumes
neatly
demographic
abigail
bloc
##vel
mounting
##lain
bentley
slightest
xu
recipients
##jk
merlin
##writer
seniors
prisons
blinking
hindwings
flickered
kappa
##hel
80s
strengthening
appealing
brewing
gypsy
mali
lashes
hulk
unpleasant
harassment
bio
treaties
predict
instrumentation
pulp
troupe
boiling
mantle
##ffe
ins
##vn
dividing
handles
verbs
##onal
coconut
senegal
340
thorough
gum
momentarily
##sto
cocaine
panicked
destined
##turing
teatro
denying
weary
captained
mans
##hawks
##code
wakefield
bollywood
thankfully
##16
cyril
##wu
amendments
##bahn
consultation
stud
reflections
kindness
1787
internally
##ovo
tex
mosaic
distribute
paddy
seeming
143
##hic
piers
##15
##mura
##verse
popularly
winger
kang
sentinel
mccoy
##anza
covenant
##bag
verge
fireworks
suppress
thrilled
dominate
##jar
swansea
##60
142
reconciliation
##ndi
stiffened
cue
dorian
##uf
damascus
amor
ida
foremost
##aga
porsche
unseen
dir
##had
##azi
stony
lexi
melodies
##nko
angular
integer
podcast
ants
inherent
jaws
justify
persona
##olved
josephine
##nr
##ressed
customary
flashes
gala
cyrus
glaring
backyard
ariel
physiology
greenland
html
stir
avon
atletico
finch
methodology
ked
##lent
mas
catholicism
townsend
branding
quincy
fits
containers
1777
ashore
aragon
##19
forearm
poisoning
##sd
adopting
conquer
grinding
amnesty
keller
finances
evaluate
forged
lankan
instincts
##uto
guam
bosnian
photographed
workplace
desirable
protector
##dog
allocation
intently
encourages
willy
##sten
bodyguard
electro
brighter
##ν
bihar
##chev
lasts
opener
amphibious
sal
verde
arte
##cope
captivity
vocabulary
yields
##tted
agreeing
desmond
pioneered
##chus
strap
campaigned
railroads
##ович
emblem
##dre
stormed
501
##ulous
marijuana
northumberland
##gn
##nath
bowen
landmarks
beaumont
##qua
danube
##bler
attorneys
th
ge
flyers
critique
villains
cass
mutation
acc
##0s
colombo
mckay
motif
sampling
concluding
syndicate
##rell
neon
stables
ds
warnings
clint
mourning
wilkinson
##tated
merrill
leopard
evenings
exhaled
emil
sonia
ezra
discrete
stove
farrell
fifteenth
prescribed
superhero
##rier
worms
helm
wren
##duction
##hc
expo
##rator
hq
unfamiliar
antony
prevents
acceleration
fiercely
mari
painfully
calculations
cheaper
ign
clifton
irvine
davenport
mozambique
##np
pierced
##evich
wonders
##wig
##cate
##iling
crusade
ware
##uel
enzymes
reasonably
mls
##coe
mater
ambition
bunny
eliot
kernel
##fin
asphalt
headmaster
torah
aden
lush
pins
waived
##care
##yas
joao
substrate
enforce
##grad
##ules
alvarez
selections
epidemic
tempted
##bit
bremen
translates
ensured
waterfront
29th
forrest
manny
malone
kramer
reigning
cookies
simpler
absorption
205
engraved
##ffy
evaluated
1778
haze
146
comforting
crossover
##abe
thorn
##rift
##imo
##pop
suppression
fatigue
cutter
##tr
201
wurttemberg
##orf
enforced
hovering
proprietary
gb
samurai
syllable
ascent
lacey
tick
lars
tractor
merchandise
rep
bouncing
defendants
##yre
huntington
##ground
##oko
standardized
##hor
##hima
assassinated
nu
predecessors
rainy
liar
assurance
lyrical
##uga
secondly
flattened
ios
parameter
undercover
##mity
bordeaux
punish
ridges
markers
exodus
inactive
hesitate
debbie
nyc
pledge
savoy
nagar
offset
organist
##tium
hesse
marin
converting
##iver
diagram
propulsion
pu
validity
reverted
supportive
##dc
ministries
clans
responds
proclamation
##inae
##ø
##rea
ein
pleading
patriot
sf
birch
islanders
strauss
hates
##dh
brandenburg
concession
rd
##ob
1900s
killings
textbook
antiquity
cinematography
wharf
embarrassing
setup
creed
farmland
inequality
centred
signatures
fallon
370
##ingham
##uts
ceylon
gazing
directive
laurie
##tern
globally
##uated
##dent
allah
excavation
threads
##cross
148
frantically
icc
utilize
determines
respiratory
thoughtful
receptions
##dicate
merging
chandra
seine
147
builders
builds
diagnostic
dev
visibility
goddamn
analyses
dhaka
cho
proves
chancel
concurrent
curiously
canadians
pumped
restoring
1850s
turtles
jaguar
sinister
spinal
traction
declan
vows
1784
glowed
capitalism
swirling
install
universidad
##lder
##oat
soloist
##genic
##oor
coincidence
beginnings
nissan
dip
resorts
caucasus
combustion
infectious
##eno
pigeon
serpent
##itating
conclude
masked
salad
jew
##gr
surreal
toni
##wc
harmonica
151
##gins
##etic
##coat
fishermen
intending
bravery
##wave
klaus
titan
wembley
taiwanese
ransom
40th
incorrect
hussein
eyelids
jp
cooke
dramas
utilities
##etta
##print
eisenhower
principally
granada
lana
##rak
openings
concord
##bl
bethany
connie
morality
sega
##mons
##nard
earnings
##kara
##cine
wii
communes
##rel
coma
composing
softened
severed
grapes
##17
nguyen
analyzed
warlord
hubbard
heavenly
behave
slovenian
##hit
##ony
hailed
filmmakers
trance
caldwell
skye
unrest
coward
likelihood
##aging
bern
sci
taliban
honolulu
propose
##wang
1700
browser
imagining
cobra
contributes
dukes
instinctively
conan
violinist
##ores
accessories
gradual
##amp
quotes
sioux
##dating
undertake
intercepted
sparkling
compressed
139
fungus
tombs
haley
imposing
rests
degradation
lincolnshire
retailers
wetlands
tulsa
distributor
dungeon
nun
greenhouse
convey
atlantis
aft
exits
oman
dresser
lyons
##sti
joking
eddy
judgement
omitted
digits
##cts
##game
juniors
##rae
cents
stricken
une
##ngo
wizards
weir
breton
nan
technician
fibers
liking
royalty
##cca
154
persia
terribly
magician
##rable
##unt
vance
cafeteria
booker
camille
warmer
##static
consume
cavern
gaps
compass
contemporaries
foyer
soothing
graveyard
maj
plunged
blush
##wear
cascade
demonstrates
ordinance
##nov
boyle
##lana
rockefeller
shaken
banjo
izzy
##ense
breathless
vines
##32
##eman
alterations
chromosome
dwellings
feudal
mole
153
catalonia
relics
tenant
mandated
##fm
fridge
hats
honesty
patented
raul
heap
cruisers
accusing
enlightenment
infants
wherein
chatham
contractors
zen
affinity
hc
osborne
piston
156
traps
maturity
##rana
lagos
##zal
peering
##nay
attendant
dealers
protocols
subset
prospects
biographical
##cre
artery
##zers
insignia
nuns
endured
##eration
recommend
schwartz
serbs
berger
cromwell
crossroads
##ctor
enduring
clasped
grounded
##bine
marseille
twitched
abel
choke
https
catalyst
moldova
italians
##tist
disastrous
wee
##oured
##nti
wwf
nope
##piration
##asa
expresses
thumbs
167
##nza
coca
1781
cheating
##ption
skipped
sensory
heidelberg
spies
satan
dangers
semifinal
202
bohemia
whitish
confusing
shipbuilding
relies
surgeons
landings
ravi
baku
moor
suffix
alejandro
##yana
litre
upheld
##unk
rajasthan
##rek
coaster
insists
posture
scenarios
etienne
favoured
appoint
transgender
elephants
poked
greenwood
defences
fulfilled
militant
somali
1758
chalk
potent
##ucci
migrants
wink
assistants
nos
restriction
activism
niger
##ario
colon
shaun
##sat
daphne
##erated
swam
congregations
reprise
considerations
magnet
playable
xvi
##р
overthrow
tobias
knob
chavez
coding
##mers
propped
katrina
orient
newcomer
##suke
temperate
##pool
farmhouse
interrogation
##vd
committing
##vert
forthcoming
strawberry
joaquin
macau
ponds
shocking
siberia
##cellular
chant
contributors
##nant
##ologists
sped
absorb
hail
1782
spared
##hore
barbados
karate
opus
originates
saul
##xie
evergreen
leaped
##rock
correlation
exaggerated
weekday
unification
bump
tracing
brig
afb
pathways
utilizing
##ners
mod
mb
disturbance
kneeling
##stad
##guchi
100th
pune
##thy
decreasing
168
manipulation
miriam
academia
ecosystem
occupational
rbi
##lem
rift
##14
rotary
stacked
incorporation
awakening
generators
guerrero
racist
##omy
cyber
derivatives
culminated
allie
annals
panzer
sainte
wikipedia
pops
zu
austro
##vate
algerian
politely
nicholson
mornings
educate
tastes
thrill
dartmouth
##gating
db
##jee
regan
differing
concentrating
choreography
divinity
##media
pledged
alexandre
routing
gregor
madeline
##idal
apocalypse
##hora
gunfire
culminating
elves
fined
liang
lam
programmed
tar
guessing
transparency
gabrielle
##gna
cancellation
flexibility
##lining
accession
shea
stronghold
nets
specializes
##rgan
abused
hasan
sgt
ling
exceeding
##₄
admiration
supermarket
##ark
photographers
specialised
tilt
resonance
hmm
perfume
380
sami
threatens
garland
botany
guarding
boiled
greet
puppy
russo
supplier
wilmington
vibrant
vijay
##bius
paralympic
grumbled
paige
faa
licking
margins
hurricanes
##gong
fest
grenade
ripping
##uz
counseling
weigh
##sian
needles
wiltshire
edison
costly
##not
fulton
tramway
redesigned
staffordshire
cache
gasping
watkins
sleepy
candidacy
##group
monkeys
timeline
throbbing
##bid
##sos
berth
uzbekistan
vanderbilt
bothering
overturned
ballots
gem
##iger
sunglasses
subscribers
hooker
compelling
ang
exceptionally
saloon
stab
##rdi
carla
terrifying
rom
##vision
coil
##oids
satisfying
vendors
31st
mackay
deities
overlooked
ambient
bahamas
felipe
olympia
whirled
botanist
advertised
tugging
##dden
disciples
morales
unionist
rites
foley
morse
motives
creepy
##₀
soo
##sz
bargain
highness
frightening
turnpike
tory
reorganization
##cer
depict
biographer
##walk
unopposed
manifesto
##gles
institut
emile
accidental
kapoor
##dam
kilkenny
cortex
lively
##13
romanesque
jain
shan
cannons
##ood
##ske
petrol
echoing
amalgamated
disappears
cautious
proposes
sanctions
trenton
##ر
flotilla
aus
contempt
tor
canary
cote
theirs
##hun
conceptual
deleted
fascinating
paso
blazing
elf
honourable
hutchinson
##eiro
##outh
##zin
surveyor
tee
amidst
wooded
reissue
intro
##ono
cobb
shelters
newsletter
hanson
brace
encoding
confiscated
dem
caravan
marino
scroll
melodic
cows
imam
##adi
##aneous
northward
searches
biodiversity
cora
310
roaring
##bers
connell
theologian
halo
compose
pathetic
unmarried
dynamo
##oot
az
calculation
toulouse
deserves
humour
nr
forgiveness
tam
undergone
martyr
pamela
myths
whore
counselor
hicks
290
heavens
battleship
electromagnetic
##bbs
stellar
establishments
presley
hopped
##chin
temptation
90s
wills
nas
##yuan
nhs
##nya
seminars
##yev
adaptations
gong
asher
lex
indicator
sikh
tobago
cites
goin
##yte
satirical
##gies
characterised
correspond
bubbles
lure
participates
##vid
eruption
skate
therapeutic
1785
canals
wholesale
defaulted
sac
460
petit
##zzled
virgil
leak
ravens
256
portraying
##yx
ghetto
creators
dams
portray
vicente
##rington
fae
namesake
bounty
##arium
joachim
##ota
##iser
aforementioned
axle
snout
depended
dismantled
reuben
480
##ibly
gallagher
##lau
##pd
earnest
##ieu
##iary
inflicted
objections
##llar
asa
gritted
##athy
jericho
##sea
##was
flick
underside
ceramics
undead
substituted
195
eastward
undoubtedly
wheeled
chimney
##iche
guinness
cb
##ager
siding
##bell
traitor
baptiste
disguised
inauguration
149
tipperary
choreographer
perched
warmed
stationary
eco
##ike
##ntes
bacterial
##aurus
flores
phosphate
##core
attacker
invaders
alvin
intersects
a1
indirectly
immigrated
businessmen
cornelius
valves
narrated
pill
sober
ul
nationale
monastic
applicants
scenery
##jack
161
motifs
constitutes
cpu
##osh
jurisdictions
sd
tuning
irritation
woven
##uddin
fertility
gao
##erie
antagonist
impatient
glacial
hides
boarded
denominations
interception
##jas
cookie
nicola
##tee
algebraic
marquess
bahn
parole
buyers
bait
turbines
paperwork
bestowed
natasha
renee
oceans
purchases
157
vaccine
215
##tock
fixtures
playhouse
integrate
jai
oswald
intellectuals
##cky
booked
nests
mortimer
##isi
obsession
sept
##gler
##sum
440
scrutiny
simultaneous
squinted
##shin
collects
oven
shankar
penned
remarkably
##я
slips
luggage
spectral
1786
collaborations
louie
consolidation
##ailed
##ivating
420
hoover
blackpool
harness
ignition
vest
tails
belmont
mongol
skinner
##nae
visually
mage
derry
##tism
##unce
stevie
transitional
##rdy
redskins
drying
prep
prospective
##21
annoyance
oversee
##loaded
fills
##books
##iki
announces
fda
scowled
respects
prasad
mystic
tucson
##vale
revue
springer
bankrupt
1772
aristotle
salvatore
habsburg
##geny
dal
natal
nut
pod
chewing
darts
moroccan
walkover
rosario
lenin
punjabi
##ße
grossed
scattering
wired
invasive
hui
polynomial
corridors
wakes
gina
portrays
##cratic
arid
retreating
erich
irwin
sniper
##dha
linen
lindsey
maneuver
butch
shutting
socio
bounce
commemorative
postseason
jeremiah
pines
275
mystical
beads
bp
abbas
furnace
bidding
consulted
assaulted
empirical
rubble
enclosure
sob
weakly
cancel
polly
yielded
##emann
curly
prediction
battered
70s
vhs
jacqueline
render
sails
barked
detailing
grayson
riga
sloane
raging
##yah
herbs
bravo
##athlon
alloy
giggle
imminent
suffers
assumptions
waltz
##itate
accomplishments
##ited
bathing
remixed
deception
prefix
##emia
deepest
##tier
##eis
balkan
frogs
##rong
slab
##pate
philosophers
peterborough
grains
imports
dickinson
rwanda
##atics
1774
dirk
lan
tablets
##rove
clone
##rice
caretaker
hostilities
mclean
##gre
regimental
treasures
norms
impose
tsar
tango
diplomacy
variously
complain
192
recognise
arrests
1779
celestial
pulitzer
##dus
bing
libretto
##moor
adele
splash
##rite
expectation
lds
confronts
##izer
spontaneous
harmful
wedge
entrepreneurs
buyer
##ope
bilingual
translate
rugged
conner
circulated
uae
eaton
##gra
##zzle
lingered
lockheed
vishnu
reelection
alonso
##oom
joints
yankee
headline
cooperate
heinz
laureate
invading
##sford
echoes
scandinavian
##dham
hugging
vitamin
salute
micah
hind
trader
##sper
radioactive
##ndra
militants
poisoned
ratified
remark
campeonato
deprived
wander
prop
##dong
outlook
##tani
##rix
##eye
chiang
darcy
##oping
mandolin
spice
statesman
babylon
182
walled
forgetting
afro
##cap
158
giorgio
buffer
##polis
planetary
##gis
overlap
terminals
kinda
centenary
##bir
arising
manipulate
elm
ke
1770
ak
##tad
chrysler
mapped
moose
pomeranian
quad
macarthur
assemblies
shoreline
recalls
stratford
##rted
noticeable
##evic
imp
##rita
##sque
accustomed
supplying
tents
disgusted
vogue
sipped
filters
khz
reno
selecting
luftwaffe
mcmahon
tyne
masterpiece
carriages
collided
dunes
exercised
flare
remembers
muzzle
##mobile
heck
##rson
burgess
lunged
middleton
boycott
bilateral
##sity
hazardous
lumpur
multiplayer
spotlight
jackets
goldman
liege
porcelain
rag
waterford
benz
attracts
hopeful
battling
ottomans
kensington
baked
hymns
cheyenne
lattice
levine
borrow
polymer
clashes
michaels
monitored
commitments
denounced
##25
##von
cavity
##oney
hobby
akin
##holders
futures
intricate
cornish
patty
##oned
illegally
dolphin
##lag
barlow
yellowish
maddie
apologized
luton
plagued
##puram
nana
##rds
sway
fanny
łodz
##rino
psi
suspicions
hanged
##eding
initiate
charlton
##por
nak
competent
235
analytical
annex
wardrobe
reservations
##rma
sect
162
fairfax
hedge
piled
buckingham
uneven
bauer
simplicity
snyder
interpret
accountability
donors
moderately
byrd
continents
##cite
##max
disciple
hr
jamaican
ping
nominees
##uss
mongolian
diver
attackers
eagerly
ideological
pillows
miracles
apartheid
revolver
sulfur
clinics
moran
163
##enko
ile
katy
rhetoric
##icated
chronology
recycling
##hrer
elongated
mughal
pascal
profiles
vibration
databases
domination
##fare
##rant
matthias
digest
rehearsal
polling
weiss
initiation
reeves
clinging
flourished
impress
ngo
##hoff
##ume
buckley
symposium
rhythms
weed
emphasize
transforming
##taking
##gence
##yman
accountant
analyze
flicker
foil
priesthood
voluntarily
decreases
##80
##hya
slater
sv
charting
mcgill
##lde
moreno
##iu
besieged
zur
robes
##phic
admitting
api
deported
turmoil
peyton
earthquakes
##ares
nationalists
beau
clair
brethren
interrupt
welch
curated
galerie
requesting
164
##ested
impending
steward
viper
##vina
complaining
beautifully
brandy
foam
nl
1660
##cake
alessandro
punches
laced
explanations
##lim
attribute
clit
reggie
discomfort
##cards
smoothed
whales
##cene
adler
countered
duffy
disciplinary
widening
recipe
reliance
conducts
goats
gradient
preaching
##shaw
matilda
quasi
striped
meridian
cannabis
cordoba
certificates
##agh
##tering
graffiti
hangs
pilgrims
repeats
##ych
revive
urine
etat
##hawk
fueled
belts
fuzzy
susceptible
##hang
mauritius
salle
sincere
beers
hooks
##cki
arbitration
entrusted
advise
sniffed
seminar
junk
donnell
processors
principality
strapped
celia
mendoza
everton
fortunes
prejudice
starving
reassigned
steamer
##lund
tuck
evenly
foreman
##ffen
dans
375
envisioned
slit
##xy
baseman
liberia
rosemary
##weed
electrified
periodically
potassium
stride
contexts
sperm
slade
mariners
influx
bianca
subcommittee
##rane
spilling
icao
estuary
##nock
delivers
iphone
##ulata
isa
mira
bohemian
dessert
##sbury
welcoming
proudly
slowing
##chs
musee
ascension
russ
##vian
waits
##psy
africans
exploit
##morphic
gov
eccentric
crab
peck
##ull
entrances
formidable
marketplace
groom
bolted
metabolism
patton
robbins
courier
payload
endure
##ifier
andes
refrigerator
##pr
ornate
##uca
ruthless
illegitimate
masonry
strasbourg
bikes
adobe
##³
apples
quintet
willingly
niche
bakery
corpses
energetic
##cliffe
##sser
##ards
177
centimeters
centro
fuscous
cretaceous
rancho
##yde
andrei
telecom
tottenham
oasis
ordination
vulnerability
presiding
corey
cp
penguins
sims
##pis
malawi
piss
##48
correction
##cked
##ffle
##ryn
countdown
detectives
psychiatrist
psychedelic
dinosaurs
blouse
##get
choi
vowed
##oz
randomly
##pol
49ers
scrub
blanche
bruins
dusseldorf
##using
unwanted
##ums
212
dominique
elevations
headlights
om
laguna
##oga
1750
famously
ignorance
shrewsbury
##aine
ajax
breuning
che
confederacy
greco
overhaul
##screen
paz
skirts
disagreement
cruelty
jagged
phoebe
shifter
hovered
viruses
##wes
mandy
##lined
##gc
landlord
squirrel
dashed
##ι
ornamental
gag
wally
grange
literal
spurs
undisclosed
proceeding
yin
##text
billie
orphan
spanned
humidity
indy
weighted
presentations
explosions
lucian
##tary
vaughn
hindus
##anga
##hell
psycho
171
daytona
protects
efficiently
rematch
sly
tandem
##oya
rebranded
impaired
hee
metropolis
peach
godfrey
diaspora
ethnicity
prosperous
gleaming
dar
grossing
playback
##rden
stripe
pistols
##tain
births
labelled
##cating
172
rudy
alba
##onne
aquarium
hostility
##gb
##tase
shudder
sumatra
hardest
lakers
consonant
creeping
demos
homicide
capsule
zeke
liberties
expulsion
pueblo
##comb
trait
transporting
##ddin
##neck
##yna
depart
gregg
mold
ledge
hangar
oldham
playboy
termination
analysts
gmbh
romero
##itic
insist
cradle
filthy
brightness
slash
shootout
deposed
bordering
##truct
isis
microwave
tumbled
sheltered
cathy
werewolves
messy
andersen
convex
clapped
clinched
satire
wasting
edo
vc
rufus
##jak
mont
##etti
poznan
##keeping
restructuring
transverse
##rland
azerbaijani
slovene
gestures
roommate
choking
shear
##quist
vanguard
oblivious
##hiro
disagreed
baptism
##lich
coliseum
##aceae
salvage
societe
cory
locke
relocation
relying
versailles
ahl
swelling
##elo
cheerful
##word
##edes
gin
sarajevo
obstacle
diverted
##nac
messed
thoroughbred
fluttered
utrecht
chewed
acquaintance
assassins
dispatch
mirza
##wart
nike
salzburg
swell
yen
##gee
idle
ligue
samson
##nds
##igh
playful
spawned
##cise
tease
##case
burgundy
##bot
stirring
skeptical
interceptions
marathi
##dies
bedrooms
aroused
pinch
##lik
preferences
tattoos
buster
digitally
projecting
rust
##ital
kitten
priorities
addison
pseudo
##guard
dusk
icons
sermon
##psis
##iba
bt
##lift
##xt
ju
truce
rink
##dah
##wy
defects
psychiatry
offences
calculate
glucose
##iful
##rized
##unda
francaise
##hari
richest
warwickshire
carly
1763
purity
redemption
lending
##cious
muse
bruises
cerebral
aero
carving
##name
preface
terminology
invade
monty
##int
anarchist
blurred
##iled
rossi
treats
guts
shu
foothills
ballads
undertaking
premise
cecilia
affiliates
blasted
conditional
wilder
minors
drone
rudolph
buffy
swallowing
horton
attested
##hop
rutherford
howell
primetime
livery
penal
##bis
minimize
hydro
wrecked
wrought
palazzo
##gling
cans
vernacular
friedman
nobleman
shale
walnut
danielle
##ection
##tley
sears
##kumar
chords
lend
flipping
streamed
por
dracula
gallons
sacrifices
gamble
orphanage
##iman
mckenzie
##gible
boxers
daly
##balls
##ان
208
##ific
##rative
##iq
exploited
slated
##uity
circling
hillary
pinched
goldberg
provost
campaigning
lim
piles
ironically
jong
mohan
successors
usaf
##tem
##ught
autobiographical
haute
preserves
##ending
acquitted
comparisons
203
hydroelectric
gangs
cypriot
torpedoes
rushes
chrome
derive
bumps
instability
fiat
pets
##mbe
silas
dye
reckless
settler
##itation
info
heats
##writing
176
canonical
maltese
fins
mushroom
stacy
aspen
avid
##kur
##loading
vickers
gaston
hillside
statutes
wilde
gail
kung
sabine
comfortably
motorcycles
##rgo
169
pneumonia
fetch
##sonic
axel
faintly
parallels
##oop
mclaren
spouse
compton
interdisciplinary
miner
##eni
181
clamped
##chal
##llah
separates
versa
##mler
scarborough
labrador
##lity
##osing
rutgers
hurdles
como
166
burt
divers
##100
wichita
cade
coincided
##erson
bruised
mla
##pper
vineyard
##ili
##brush
notch
mentioning
jase
hearted
kits
doe
##acle
pomerania
##ady
ronan
seizure
pavel
problematic
##zaki
domenico
##ulin
catering
penelope
dependence
parental
emilio
ministerial
atkinson
##bolic
clarkson
chargers
colby
grill
peeked
arises
summon
##aged
fools
##grapher
faculties
qaeda
##vial
garner
refurbished
##hwa
geelong
disasters
nudged
bs
shareholder
lori
algae
reinstated
rot
##ades
##nous
invites
stainless
183
inclusive
##itude
diocesan
til
##icz
denomination
##xa
benton
floral
registers
##ider
##erman
##kell
absurd
brunei
guangzhou
hitter
retaliation
##uled
##eve
blanc
nh
consistency
contamination
##eres
##rner
dire
palermo
broadcasters
diaries
inspire
vols
brewer
tightening
ky
mixtape
hormone
##tok
stokes
##color
##dly
##ssi
pg
##ometer
##lington
sanitation
##tility
intercontinental
apps
##adt
¹⁄₂
cylinders
economies
favourable
unison
croix
gertrude
odyssey
vanity
dangling
##logists
upgrades
dice
middleweight
practitioner
##ight
206
henrik
parlor
orion
angered
lac
python
blurted
##rri
sensual
intends
swings
angled
##phs
husky
attain
peerage
precinct
textiles
cheltenham
shuffled
dai
confess
tasting
bhutan
##riation
tyrone
segregation
abrupt
ruiz
##rish
smirked
blackwell
confidential
browning
amounted
##put
vase
scarce
fabulous
raided
staple
guyana
unemployed
glider
shay
##tow
carmine
troll
intervene
squash
superstar
##uce
cylindrical
len
roadway
researched
handy
##rium
##jana
meta
lao
declares
##rring
##tadt
##elin
##kova
willem
shrubs
napoleonic
realms
skater
qi
volkswagen
##ł
tad
hara
archaeologist
awkwardly
eerie
##kind
wiley
##heimer
##24
titus
organizers
cfl
crusaders
lama
usb
vent
enraged
thankful
occupants
maximilian
##gaard
possessing
textbooks
##oran
collaborator
quaker
##ulo
avalanche
mono
silky
straits
isaiah
mustang
surged
resolutions
potomac
descend
cl
kilograms
plato
strains
saturdays
##olin
bernstein
##ype
holstein
ponytail
##watch
belize
conversely
heroine
perpetual
##ylus
charcoal
piedmont
glee
negotiating
backdrop
prologue
##jah
##mmy
pasadena
climbs
ramos
sunni
##holm
##tner
##tri
anand
deficiency
hertfordshire
stout
##avi
aperture
orioles
##irs
doncaster
intrigued
bombed
coating
otis
##mat
cocktail
##jit
##eto
amir
arousal
sar
##proof
##act
##ories
dixie
pots
##bow
whereabouts
159
##fted
drains
bullying
cottages
scripture
coherent
fore
poe
appetite
##uration
sampled
##ators
##dp
derrick
rotor
jays
peacock
installment
##rro
advisors
##coming
rodeo
scotch
##mot
##db
##fen
##vant
ensued
rodrigo
dictatorship
martyrs
twenties
##н
towed
incidence
marta
rainforest
sai
scaled
##cles
oceanic
qualifiers
symphonic
mcbride
dislike
generalized
aubrey
colonization
##iation
##lion
##ssing
disliked
lublin
salesman
##ulates
spherical
whatsoever
sweating
avalon
contention
punt
severity
alderman
atari
##dina
##grant
##rop
scarf
seville
vertices
annexation
fairfield
fascination
inspiring
launches
palatinate
regretted
##rca
feral
##iom
elk
nap
olsen
reddy
yong
##leader
##iae
garment
transports
feng
gracie
outrage
viceroy
insides
##esis
breakup
grady
organizer
softer
grimaced
222
murals
galicia
arranging
vectors
##rsten
bas
##sb
##cens
sloan
##eka
bitten
ara
fender
nausea
bumped
kris
banquet
comrades
detector
persisted
##llan
adjustment
endowed
cinemas
##shot
sellers
##uman
peek
epa
kindly
neglect
simpsons
talon
mausoleum
runaway
hangul
lookout
##cic
rewards
coughed
acquainted
chloride
##ald
quicker
accordion
neolithic
##qa
artemis
coefficient
lenny
pandora
tx
##xed
ecstasy
litter
segunda
chairperson
gemma
hiss
rumor
vow
nasal
antioch
compensate
patiently
transformers
##eded
judo
morrow
penis
posthumous
philips
bandits
husbands
denote
flaming
##any
##phones
langley
yorker
1760
walters
##uo
##kle
gubernatorial
fatty
samsung
leroy
outlaw
##nine
unpublished
poole
jakob
##ᵢ
##ₙ
crete
distorted
superiority
##dhi
intercept
crust
mig
claus
crashes
positioning
188
stallion
301
frontal
armistice
##estinal
elton
aj
encompassing
camel
commemorated
malaria
woodward
calf
cigar
penetrate
##oso
willard
##rno
##uche
illustrate
amusing
convergence
noteworthy
##lma
##rva
journeys
realise
manfred
##sable
410
##vocation
hearings
fiance
##posed
educators
provoked
adjusting
##cturing
modular
stockton
paterson
vlad
rejects
electors
selena
maureen
##tres
uber
##rce
swirled
##num
proportions
nanny
pawn
naturalist
parma
apostles
awoke
ethel
wen
##bey
monsoon
overview
##inating
mccain
rendition
risky
adorned
##ih
equestrian
germain
nj
conspicuous
confirming
##yoshi
shivering
##imeter
milestone
rumours
flinched
bounds
smacked
token
##bei
lectured
automobiles
##shore
impacted
##iable
nouns
nero
##leaf
ismail
prostitute
trams
##lace
bridget
sud
stimulus
impressions
reins
revolves
##oud
##gned
giro
honeymoon
##swell
criterion
##sms
##uil
libyan
prefers
##osition
211
preview
sucks
accusation
bursts
metaphor
diffusion
tolerate
faye
betting
cinematographer
liturgical
specials
bitterly
humboldt
##ckle
flux
rattled
##itzer
archaeologists
odor
authorised
marshes
discretion
##ов
alarmed
archaic
inverse
##leton
explorers
##pine
drummond
tsunami
woodlands
##minate
##tland
booklet
insanity
owning
insert
crafted
calculus
##tore
receivers
##bt
stung
##eca
##nched
prevailing
travellers
eyeing
lila
graphs
##borne
178
julien
##won
morale
adaptive
therapist
erica
cw
libertarian
bowman
pitches
vita
##ional
crook
##ads
##entation
caledonia
mutiny
##sible
1840s
automation
##ß
flock
##pia
ironic
pathology
##imus
remarried
##22
joker
withstand
energies
##att
shropshire
hostages
madeleine
tentatively
conflicting
mateo
recipes
euros
ol
mercenaries
nico
##ndon
albuquerque
augmented
mythical
bel
freud
##child
cough
##lica
365
freddy
lillian
genetically
nuremberg
calder
209
bonn
outdoors
paste
suns
urgency
vin
restraint
tyson
##cera
##selle
barrage
bethlehem
kahn
##par
mounts
nippon
barony
happier
ryu
makeshift
sheldon
blushed
castillo
barking
listener
taped
bethel
fluent
headlines
pornography
rum
disclosure
sighing
mace
doubling
gunther
manly
##plex
rt
interventions
physiological
forwards
emerges
##tooth
##gny
compliment
rib
recession
visibly
barge
faults
connector
exquisite
prefect
##rlin
patio
##cured
elevators
brandt
italics
pena
173
wasp
satin
ea
botswana
graceful
respectable
##jima
##rter
##oic
franciscan
generates
##dl
alfredo
disgusting
##olate
##iously
sherwood
warns
cod
promo
cheryl
sino
##ة
##escu
twitch
##zhi
brownish
thom
ortiz
##dron
densely
##beat
carmel
reinforce
##bana
187
anastasia
downhill
vertex
contaminated
remembrance
harmonic
homework
##sol
fiancee
gears
olds
angelica
loft
ramsay
quiz
colliery
sevens
##cape
autism
##hil
walkway
##boats
ruben
abnormal
ounce
khmer
##bbe
zachary
bedside
morphology
punching
##olar
sparrow
convinces
##35
hewitt
queer
remastered
rods
mabel
solemn
notified
lyricist
symmetric
##xide
174
encore
passports
wildcats
##uni
baja
##pac
mildly
##ease
bleed
commodity
mounds
glossy
orchestras
##omo
damian
prelude
ambitions
##vet
awhile
remotely
##aud
asserts
imply
##iques
distinctly
modelling
remedy
##dded
windshield
dani
xiao
##endra
audible
powerplant
1300
invalid
elemental
acquisitions
##hala
immaculate
libby
plata
smuggling
ventilation
denoted
minh
##morphism
430
differed
dion
kelley
lore
mocking
sabbath
spikes
hygiene
drown
runoff
stylized
tally
liberated
aux
interpreter
righteous
aba
siren
reaper
pearce
millie
##cier
##yra
gaius
##iso
captures
##ttering
dorm
claudio
##sic
benches
knighted
blackness
##ored
discount
fumble
oxidation
routed
##ς
novak
perpendicular
spoiled
fracture
splits
##urt
pads
topology
##cats
axes
fortunate
offenders
protestants
esteem
221
broadband
convened
frankly
hound
prototypes
isil
facilitated
keel
##sher
sahara
awaited
bubba
orb
prosecutors
186
hem
520
##xing
relaxing
remnant
romney
sorted
slalom
stefano
ulrich
##active
exemption
folder
pauses
foliage
hitchcock
epithet
204
criticisms
##aca
ballistic
brody
hinduism
chaotic
youths
equals
##pala
pts
thicker
analogous
capitalist
improvised
overseeing
sinatra
ascended
beverage
##tl
straightforward
##kon
curran
##west
bois
325
induce
surveying
emperors
sax
unpopular
##kk
cartoonist
fused
##mble
unto
##yuki
localities
##cko
##ln
darlington
slain
academie
lobbying
sediment
puzzles
##grass
defiance
dickens
manifest
tongues
alumnus
arbor
coincide
184
appalachian
mustafa
examiner
cabaret
traumatic
yves
bracelet
draining
heroin
magnum
baths
odessa
consonants
mitsubishi
##gua
kellan
vaudeville
##fr
joked
null
straps
probation
##ław
ceded
interfaces
##pas
##zawa
blinding
viet
224
rothschild
museo
640
huddersfield
##vr
tactic
##storm
brackets
dazed
incorrectly
##vu
reg
glazed
fearful
manifold
benefited
irony
##sun
stumbling
##rte
willingness
balkans
mei
wraps
##aba
injected
##lea
gu
syed
harmless
##hammer
bray
takeoff
poppy
timor
cardboard
astronaut
purdue
weeping
southbound
cursing
stalls
diagonal
##neer
lamar
bryce
comte
weekdays
harrington
##uba
negatively
##see
lays
grouping
##cken
##henko
affirmed
halle
modernist
##lai
hodges
smelling
aristocratic
baptized
dismiss
justification
oilers
##now
coupling
qin
snack
healer
##qing
gardener
layla
battled
formulated
stephenson
gravitational
##gill
##jun
1768
granny
coordinating
suites
##cd
##ioned
monarchs
##cote
##hips
sep
blended
apr
barrister
deposition
fia
mina
policemen
paranoid
##pressed
churchyard
covert
crumpled
creep
abandoning
tr
transmit
conceal
barr
understands
readiness
spire
##cology
##enia
##erry
610
startling
unlock
vida
bowled
slots
##nat
##islav
spaced
trusting
admire
rig
##ink
slack
##70
mv
207
casualty
##wei
classmates
##odes
##rar
##rked
amherst
furnished
evolve
foundry
menace
mead
##lein
flu
wesleyan
##kled
monterey
webber
##vos
wil
##mith
##на
bartholomew
justices
restrained
##cke
amenities
191
mediated
sewage
trenches
ml
mainz
##thus
1800s
##cula
##inski
caine
bonding
213
converts
spheres
superseded
marianne
crypt
sweaty
ensign
historia
##br
spruce
##post
##ask
forks
thoughtfully
yukon
pamphlet
ames
##uter
karma
##yya
bryn
negotiation
sighs
incapable
##mbre
##ntial
actresses
taft
##mill
luce
prevailed
##amine
1773
motionless
envoy
testify
investing
sculpted
instructors
provence
kali
cullen
horseback
##while
goodwin
##jos
gaa
norte
##ldon
modify
wavelength
abd
214
skinned
sprinter
forecast
scheduling
marries
squared
tentative
##chman
boer
##isch
bolts
swap
fisherman
assyrian
impatiently
guthrie
martins
murdoch
194
tanya
nicely
dolly
lacy
med
##45
syn
decks
fashionable
millionaire
##ust
surfing
##ml
##ision
heaved
tammy
consulate
attendees
routinely
197
fuse
saxophonist
backseat
malaya
##lord
scowl
tau
##ishly
193
sighted
steaming
##rks
303
911
##holes
##hong
ching
##wife
bless
conserved
jurassic
stacey
unix
zion
chunk
rigorous
blaine
198
peabody
slayer
dismay
brewers
nz
##jer
det
##glia
glover
postwar
int
penetration
sylvester
imitation
vertically
airlift
heiress
knoxville
viva
##uin
390
macon
##rim
##fighter
##gonal
janice
##orescence
##wari
marius
belongings
leicestershire
196
blanco
inverted
preseason
sanity
sobbing
##due
##elt
##dled
collingwood
regeneration
flickering
shortest
##mount
##osi
feminism
##lat
sherlock
cabinets
fumbled
northbound
precedent
snaps
##mme
researching
##akes
guillaume
insights
manipulated
vapor
neighbour
sap
gangster
frey
f1
stalking
scarcely
callie
barnett
tendencies
audi
doomed
assessing
slung
panchayat
ambiguous
bartlett
##etto
distributing
violating
wolverhampton
##hetic
swami
histoire
##urus
liable
pounder
groin
hussain
larsen
popping
surprises
##atter
vie
curt
##station
mute
relocate
musicals
authorization
richter
##sef
immortality
tna
bombings
##press
deteriorated
yiddish
##acious
robbed
colchester
cs
pmid
ao
verified
balancing
apostle
swayed
recognizable
oxfordshire
retention
nottinghamshire
contender
judd
invitational
shrimp
uhf
##icient
cleaner
longitudinal
tanker
##mur
acronym
broker
koppen
sundance
suppliers
##gil
4000
clipped
fuels
petite
##anne
landslide
helene
diversion
populous
landowners
auspices
melville
quantitative
##xes
ferries
nicky
##llus
doo
haunting
roche
carver
downed
unavailable
##pathy
approximation
hiroshima
##hue
garfield
valle
comparatively
keyboardist
traveler
##eit
congestion
calculating
subsidiaries
##bate
serb
modernization
fairies
deepened
ville
averages
##lore
inflammatory
tonga
##itch
co₂
squads
##hea
gigantic
serum
enjoyment
retailer
verona
35th
cis
##phobic
magna
technicians
##vati
arithmetic
##sport
levin
##dation
amtrak
chow
sienna
##eyer
backstage
entrepreneurship
##otic
learnt
tao
##udy
worcestershire
formulation
baggage
hesitant
bali
sabotage
##kari
barren
enhancing
murmur
pl
freshly
putnam
syntax
aces
medicines
resentment
bandwidth
##sier
grins
chili
guido
##sei
framing
implying
gareth
lissa
genevieve
pertaining
admissions
geo
thorpe
proliferation
sato
bela
analyzing
parting
##gor
awakened
##isman
huddled
secrecy
##kling
hush
gentry
540
dungeons
##ego
coasts
##utz
sacrificed
##chule
landowner
mutually
prevalence
programmer
adolescent
disrupted
seaside
gee
trusts
vamp
georgie
##nesian
##iol
schedules
sindh
##market
etched
hm
sparse
bey
beaux
scratching
gliding
unidentified
216
collaborating
gems
jesuits
oro
accumulation
shaping
mbe
anal
##xin
231
enthusiasts
newscast
##egan
janata
dewey
parkinson
179
ankara
biennial
towering
dd
inconsistent
950
##chet
thriving
terminate
cabins
furiously
eats
advocating
donkey
marley
muster
phyllis
leiden
##user
grassland
glittering
iucn
loneliness
217
memorandum
armenians
##ddle
popularized
rhodesia
60s
lame
##illon
sans
bikini
header
orbits
##xx
##finger
##ulator
sharif
spines
biotechnology
strolled
naughty
yates
##wire
fremantle
milo
##mour
abducted
removes
##atin
humming
wonderland
##chrome
##ester
hume
pivotal
##rates
armand
grams
believers
elector
rte
apron
bis
scraped
##yria
endorsement
initials
##llation
eps
dotted
hints
buzzing
emigration
nearer
##tom
indicators
##ulu
coarse
neutron
protectorate
##uze
directional
exploits
pains
loire
1830s
proponents
guggenheim
rabbits
ritchie
305
hectare
inputs
hutton
##raz
verify
##ako
boilers
longitude
##lev
skeletal
yer
emilia
citrus
compromised
##gau
pokemon
prescription
paragraph
eduard
cadillac
attire
categorized
kenyan
weddings
charley
##bourg
entertain
monmouth
##lles
nutrients
davey
mesh
incentive
practised
ecosystems
kemp
subdued
overheard
##rya
bodily
maxim
##nius
apprenticeship
ursula
##fight
lodged
rug
silesian
unconstitutional
patel
inspected
coyote
unbeaten
##hak
34th
disruption
convict
parcel
##cl
##nham
collier
implicated
mallory
##iac
##lab
susannah
winkler
##rber
shia
phelps
sediments
graphical
robotic
##sner
adulthood
mart
smoked
##isto
kathryn
clarified
##aran
divides
convictions
oppression
pausing
burying
##mt
federico
mathias
eileen
##tana
kite
hunched
##acies
189
##atz
disadvantage
liza
kinetic
greedy
paradox
yokohama
dowager
trunks
ventured
##gement
gupta
vilnius
olaf
##thest
crimean
hopper
##ej
progressively
arturo
mouthed
arrondissement
##fusion
rubin
simulcast
oceania
##orum
##stra
##rred
busiest
intensely
navigator
cary
##vine
##hini
##bies
fife
rowe
rowland
posing
insurgents
shafts
lawsuits
activate
conor
inward
culturally
garlic
265
##eering
eclectic
##hui
##kee
##nl
furrowed
vargas
meteorological
rendezvous
##aus
culinary
commencement
##dition
quota
##notes
mommy
salaries
overlapping
mule
##iology
##mology
sums
wentworth
##isk
##zione
mainline
subgroup
##illy
hack
plaintiff
verdi
bulb
differentiation
engagements
multinational
supplemented
bertrand
caller
regis
##naire
##sler
##arts
##imated
blossom
propagation
kilometer
viaduct
vineyards
##uate
beckett
optimization
golfer
songwriters
seminal
semitic
thud
volatile
evolving
ridley
##wley
trivial
distributions
scandinavia
jiang
##ject
wrestled
insistence
##dio
emphasizes
napkin
##ods
adjunct
rhyme
##ricted
##eti
hopeless
surrounds
tremble
32nd
smoky
##ntly
oils
medicinal
padded
steer
wilkes
219
255
concessions
hue
uniquely
blinded
landon
yahoo
##lane
hendrix
commemorating
dex
specify
chicks
##ggio
intercity
1400
morley
##torm
highlighting
##oting
pang
oblique
stalled
##liner
flirting
newborn
1769
bishopric
shaved
232
currie
##ush
dharma
spartan
##ooped
favorites
smug
novella
sirens
abusive
creations
espana
##lage
paradigm
semiconductor
sheen
##rdo
##yen
##zak
nrl
renew
##pose
##tur
adjutant
marches
norma
##enity
ineffective
weimar
grunt
##gat
lordship
plotting
expenditure
infringement
lbs
refrain
av
mimi
mistakenly
postmaster
1771
##bara
ras
motorsports
tito
199
subjective
##zza
bully
stew
##kaya
prescott
1a
##raphic
##zam
bids
styling
paranormal
reeve
sneaking
exploding
katz
akbar
migrant
syllables
indefinitely
##ogical
destroys
replaces
applause
##phine
pest
##fide
218
articulated
bertie
##thing
##cars
##ptic
courtroom
crowley
aesthetics
cummings
tehsil
hormones
titanic
dangerously
##ibe
stadion
jaenelle
auguste
ciudad
##chu
mysore
partisans
##sio
lucan
philipp
##aly
debating
henley
interiors
##rano
##tious
homecoming
beyonce
usher
henrietta
prepares
weeds
##oman
ely
plucked
##pire
##dable
luxurious
##aq
artifact
password
pasture
juno
maddy
minsk
##dder
##ologies
##rone
assessments
martian
royalist
1765
examines
##mani
##rge
nino
223
parry
scooped
relativity
##eli
##uting
##cao
congregational
noisy
traverse
##agawa
strikeouts
nickelodeon
obituary
transylvania
binds
depictions
polk
trolley
##yed
##lard
breeders
##under
dryly
hokkaido
1762
strengths
stacks
bonaparte
connectivity
neared
prostitutes
stamped
anaheim
gutierrez
sinai
##zzling
bram
fresno
madhya
##86
proton
##lena
##llum
##phon
reelected
wanda
##anus
##lb
ample
distinguishing
##yler
grasping
sermons
tomato
bland
stimulation
avenues
##eux
spreads
scarlett
fern
pentagon
assert
baird
chesapeake
ir
calmed
distortion
fatalities
##olis
correctional
pricing
##astic
##gina
prom
dammit
ying
collaborate
##chia
welterweight
33rd
pointer
substitution
bonded
umpire
communicating
multitude
paddle
##obe
federally
intimacy
##insky
betray
ssr
##lett
##lean
##lves
##therapy
airbus
##tery
functioned
ud
bearer
biomedical
netflix
##hire
##nca
condom
brink
ik
##nical
macy
##bet
flap
gma
experimented
jelly
lavender
##icles
##ulia
munro
##mian
##tial
rye
##rle
60th
gigs
hottest
rotated
predictions
fuji
bu
##erence
##omi
barangay
##fulness
##sas
clocks
##rwood
##liness
cereal
roe
wight
decker
uttered
babu
onion
xml
forcibly
##df
petra
sarcasm
hartley
peeled
storytelling
##42
##xley
##ysis
##ffa
fibre
kiel
auditor
fig
harald
greenville
##berries
geographically
nell
quartz
##athic
cemeteries
##lr
crossings
nah
holloway
reptiles
chun
sichuan
snowy
660
corrections
##ivo
zheng
ambassadors
blacksmith
fielded
fluids
hardcover
turnover
medications
melvin
academies
##erton
ro
roach
absorbing
spaniards
colton
##founded
outsider
espionage
kelsey
245
edible
##ulf
dora
establishes
##sham
##tries
contracting
##tania
cinematic
costello
nesting
##uron
connolly
duff
##nology
mma
##mata
fergus
sexes
gi
optics
spectator
woodstock
banning
##hee
##fle
differentiate
outfielder
refinery
226
312
gerhard
horde
lair
drastically
##udi
landfall
##cheng
motorsport
odi
##achi
predominant
quay
skins
##ental
edna
harshly
complementary
murdering
##aves
wreckage
##90
ono
outstretched
lennox
munitions
galen
reconcile
470
scalp
bicycles
gillespie
questionable
rosenberg
guillermo
hostel
jarvis
kabul
volvo
opium
yd
##twined
abuses
decca
outpost
##cino
sensible
neutrality
##64
ponce
anchorage
atkins
turrets
inadvertently
disagree
libre
vodka
reassuring
weighs
##yal
glide
jumper
ceilings
repertory
outs
stain
##bial
envy
##ucible
smashing
heightened
policing
hyun
mixes
lai
prima
##ples
celeste
##bina
lucrative
intervened
kc
manually
##rned
stature
staffed
bun
bastards
nairobi
priced
##auer
thatcher
##kia
tripped
comune
##ogan
##pled
brasil
incentives
emanuel
hereford
musica
##kim
benedictine
biennale
##lani
eureka
gardiner
rb
knocks
sha
##ael
##elled
##onate
efficacy
ventura
masonic
sanford
maize
leverage
##feit
capacities
santana
##aur
novelty
vanilla
##cter
##tour
benin
##oir
##rain
neptune
drafting
tallinn
##cable
humiliation
##boarding
schleswig
fabian
bernardo
liturgy
spectacle
sweeney
pont
routledge
##tment
cosmos
ut
hilt
sleek
universally
##eville
##gawa
typed
##dry
favors
allegheny
glaciers
##rly
recalling
aziz
##log
parasite
requiem
auf
##berto
##llin
illumination
##breaker
##issa
festivities
bows
govern
vibe
vp
333
sprawled
larson
pilgrim
bwf
leaping
##rts
##ssel
alexei
greyhound
hoarse
##dler
##oration
seneca
##cule
gaping
##ulously
##pura
cinnamon
##gens
##rricular
craven
fantasies
houghton
engined
reigned
dictator
supervising
##oris
bogota
commentaries
unnatural
fingernails
spirituality
tighten
##tm
canadiens
protesting
intentional
cheers
sparta
##ytic
##iere
##zine
widen
belgarath
controllers
dodd
iaaf
navarre
##ication
defect
squire
steiner
whisky
##mins
560
inevitably
tome
##gold
chew
##uid
##lid
elastic
##aby
streaked
alliances
jailed
regal
##ined
##phy
czechoslovak
narration
absently
##uld
bluegrass
guangdong
quran
criticizing
hose
hari
##liest
##owa
skier
streaks
deploy
##lom
raft
bose
dialed
huff
##eira
haifa
simplest
bursting
endings
ib
sultanate
##titled
franks
whitman
ensures
sven
##ggs
collaborators
forster
organising
ui
banished
napier
injustice
teller
layered
thump
##otti
roc
battleships
evidenced
fugitive
sadie
robotics
##roud
equatorial
geologist
##iza
yielding
##bron
##sr
internationale
mecca
##diment
sbs
skyline
toad
uploaded
reflective
undrafted
lal
leafs
bayern
##dai
lakshmi
shortlisted
##stick
##wicz
camouflage
donate
af
christi
lau
##acio
disclosed
nemesis
1761
assemble
straining
northamptonshire
tal
##asi
bernardino
premature
heidi
42nd
coefficients
galactic
reproduce
buzzed
sensations
zionist
monsieur
myrtle
##eme
archery
strangled
musically
viewpoint
antiquities
bei
trailers
seahawks
cured
pee
preferring
tasmanian
lange
sul
##mail
##working
colder
overland
lucivar
massey
gatherings
haitian
##smith
disapproval
flaws
##cco
##enbach
1766
npr
##icular
boroughs
creole
forums
techno
1755
dent
abdominal
streetcar
##eson
##stream
procurement
gemini
predictable
##tya
acheron
christoph
feeder
fronts
vendor
bernhard
jammu
tumors
slang
##uber
goaltender
twists
curving
manson
vuelta
mer
peanut
confessions
pouch
unpredictable
allowance
theodor
vascular
##factory
bala
authenticity
metabolic
coughing
nanjing
##cea
pembroke
##bard
splendid
36th
ff
hourly
##ahu
elmer
handel
##ivate
awarding
thrusting
dl
experimentation
##hesion
##46
caressed
entertained
steak
##rangle
biologist
orphans
baroness
oyster
stepfather
##dridge
mirage
reefs
speeding
##31
barons
1764
227
inhabit
preached
repealed
##tral
honoring
boogie
captives
administer
johanna
##imate
gel
suspiciously
1767
sobs
##dington
backbone
hayward
garry
##folding
##nesia
maxi
##oof
##ppe
ellison
galileo
##stand
crimea
frenzy
amour
bumper
matrices
natalia
baking
garth
palestinians
##grove
smack
conveyed
ensembles
gardening
##manship
##rup
##stituting
1640
harvesting
topography
jing
shifters
dormitory
##carriage
##lston
ist
skulls
##stadt
dolores
jewellery
sarawak
##wai
##zier
fences
christy
confinement
tumbling
credibility
fir
stench
##bria
##plication
##nged
##sam
virtues
##belt
marjorie
pba
##eem
##made
celebrates
schooner
agitated
barley
fulfilling
anthropologist
##pro
restrict
novi
regulating
##nent
padres
##rani
##hesive
loyola
tabitha
milky
olson
proprietor
crambidae
guarantees
intercollegiate
ljubljana
hilda
##sko
ignorant
hooded
##lts
sardinia
##lidae
##vation
frontman
privileged
witchcraft
##gp
jammed
laude
poking
##than
bracket
amazement
yunnan
##erus
maharaja
linnaeus
264
commissioning
milano
peacefully
##logies
akira
rani
regulator
##36
grasses
##rance
luzon
crows
compiler
gretchen
seaman
edouard
tab
buccaneers
ellington
hamlets
whig
socialists
##anto
directorial
easton
mythological
##kr
##vary
rhineland
semantic
taut
dune
inventions
succeeds
##iter
replication
branched
##pired
jul
prosecuted
kangaroo
penetrated
##avian
middlesbrough
doses
bleak
madam
predatory
relentless
##vili
reluctance
##vir
hailey
crore
silvery
1759
monstrous
swimmers
transmissions
hawthorn
informing
##eral
toilets
caracas
crouch
kb
##sett
295
cartel
hadley
##aling
alexia
yvonne
##biology
cinderella
eton
superb
blizzard
stabbing
industrialist
maximus
##gm
##orus
groves
maud
clade
oversized
comedic
##bella
rosen
nomadic
fulham
montane
beverages
galaxies
redundant
swarm
##rot
##folia
##llis
buckinghamshire
fen
bearings
bahadur
##rom
gilles
phased
dynamite
faber
benoit
vip
##ount
##wd
booking
fractured
tailored
anya
spices
westwood
cairns
auditions
inflammation
steamed
##rocity
##acion
##urne
skyla
thereof
watford
torment
archdeacon
transforms
lulu
demeanor
fucked
serge
##sor
mckenna
minas
entertainer
##icide
caress
originate
residue
##sty
1740
##ilised
##org
beech
##wana
subsidies
##ghton
emptied
gladstone
ru
firefighters
voodoo
##rcle
het
nightingale
tamara
edmond
ingredient
weaknesses
silhouette
285
compatibility
withdrawing
hampson
##mona
anguish
giggling
##mber
bookstore
##jiang
southernmost
tilting
##vance
bai
economical
rf
briefcase
dreadful
hinted
projections
shattering
totaling
##rogate
analogue
indicted
periodical
fullback
##dman
haynes
##tenberg
##ffs
##ishment
1745
thirst
stumble
penang
vigorous
##ddling
##kor
##lium
octave
##ove
##enstein
##inen
##ones
siberian
##uti
cbn
repeal
swaying
##vington
khalid
tanaka
unicorn
otago
plastered
lobe
riddle
##rella
perch
##ishing
croydon
filtered
graeme
tripoli
##ossa
crocodile
##chers
sufi
mined
##tung
inferno
lsu
##phi
swelled
utilizes
£2
cale
periodicals
styx
hike
informally
coop
lund
##tidae
ala
hen
qui
transformations
disposed
sheath
chickens
##cade
fitzroy
sas
silesia
unacceptable
odisha
1650
sabrina
pe
spokane
ratios
athena
massage
shen
dilemma
##drum
##riz
##hul
corona
doubtful
niall
##pha
##bino
fines
cite
acknowledging
bangor
ballard
bathurst
##resh
huron
mustered
alzheimer
garments
kinase
tyre
warship
##cp
flashback
pulmonary
braun
cheat
kamal
cyclists
constructions
grenades
ndp
traveller
excuses
stomped
signalling
trimmed
futsal
mosques
relevance
##wine
wta
##23
##vah
##lter
hoc
##riding
optimistic
##´s
deco
sim
interacting
rejecting
moniker
waterways
##ieri
##oku
mayors
gdansk
outnumbered
pearls
##ended
##hampton
fairs
totals
dominating
262
notions
stairway
compiling
pursed
commodities
grease
yeast
##jong
carthage
griffiths
residual
amc
contraction
laird
sapphire
##marine
##ivated
amalgamation
dissolve
inclination
lyle
packaged
altitudes
suez
canons
graded
lurched
narrowing
boasts
guise
wed
enrico
##ovsky
rower
scarred
bree
cub
iberian
protagonists
bargaining
proposing
trainers
voyages
vans
fishes
##aea
##ivist
##verance
encryption
artworks
kazan
sabre
cleopatra
hepburn
rotting
supremacy
mecklenburg
##brate
burrows
hazards
outgoing
flair
organizes
##ctions
scorpion
##usions
boo
234
chevalier
dunedin
slapping
##34
ineligible
pensions
##38
##omic
manufactures
emails
bismarck
238
weakening
blackish
ding
mcgee
quo
##rling
northernmost
xx
manpower
greed
sampson
clicking
##ange
##horpe
##inations
##roving
torre
##eptive
##moral
symbolism
38th
asshole
meritorious
outfits
splashed
biographies
sprung
astros
##tale
302
737
filly
raoul
nw
tokugawa
linden
clubhouse
##apa
tracts
romano
##pio
putin
tags
##note
chained
dickson
gunshot
moe
gunn
rashid
##tails
zipper
##bas
##nea
contrasted
##ply
##udes
plum
pharaoh
##pile
aw
comedies
ingrid
sandwiches
subdivisions
1100
mariana
nokia
kamen
hz
delaney
veto
herring
##words
possessive
outlines
##roup
siemens
stairwell
rc
gallantry
messiah
palais
yells
233
zeppelin
##dm
bolivar
##cede
smackdown
mckinley
##mora
##yt
muted
geologic
finely
unitary
avatar
hamas
maynard
rees
bog
contrasting
##rut
liv
chico
disposition
pixel
##erate
becca
dmitry
yeshiva
narratives
##lva
##ulton
mercenary
sharpe
tempered
navigate
stealth
amassed
keynes
##lini
untouched
##rrie
havoc
lithium
##fighting
abyss
graf
southward
wolverine
balloons
implements
ngos
transitions
##icum
ambushed
concacaf
dormant
economists
##dim
costing
csi
rana
universite
boulders
verity
##llon
collin
mellon
misses
cypress
fluorescent
lifeless
spence
##ulla
crewe
shepard
pak
revelations
##م
jolly
gibbons
paw
##dro
##quel
freeing
##test
shack
fries
palatine
##51
##hiko
accompaniment
cruising
recycled
##aver
erwin
sorting
synthesizers
dyke
realities
sg
strides
enslaved
wetland
##ghan
competence
gunpowder
grassy
maroon
reactors
objection
##oms
carlson
gearbox
macintosh
radios
shelton
##sho
clergyman
prakash
254
mongols
trophies
oricon
228
stimuli
twenty20
cantonese
cortes
mirrored
##saurus
bhp
cristina
melancholy
##lating
enjoyable
nuevo
##wny
downfall
schumacher
##ind
banging
lausanne
rumbled
paramilitary
reflex
ax
amplitude
migratory
##gall
##ups
midi
barnard
lastly
sherry
##hp
##nall
keystone
##kra
carleton
slippery
##53
coloring
foe
socket
otter
##rgos
mats
##tose
consultants
bafta
bison
topping
##km
490
primal
abandonment
transplant
atoll
hideous
mort
pained
reproduced
tae
howling
##turn
unlawful
billionaire
hotter
poised
lansing
##chang
dinamo
retro
messing
nfc
domesday
##mina
blitz
timed
##athing
##kley
ascending
gesturing
##izations
signaled
tis
chinatown
mermaid
savanna
jameson
##aint
catalina
##pet
##hers
cochrane
cy
chatting
##kus
alerted
computation
mused
noelle
majestic
mohawk
campo
octagonal
##sant
##hend
241
aspiring
##mart
comprehend
iona
paralyzed
shimmering
swindon
rhone
##eley
reputed
configurations
pitchfork
agitation
francais
gillian
lipstick
##ilo
outsiders
pontifical
resisting
bitterness
sewer
rockies
##edd
##ucher
misleading
1756
exiting
galloway
##nging
risked
##heart
246
commemoration
schultz
##rka
integrating
##rsa
poses
shrieked
##weiler
guineas
gladys
jerking
owls
goldsmith
nightly
penetrating
##unced
lia
##33
ignited
betsy
##aring
##thorpe
follower
vigorously
##rave
coded
kiran
knit
zoology
tbilisi
##28
##bered
repository
govt
deciduous
dino
growling
##bba
enhancement
unleashed
chanting
pussy
biochemistry
##eric
kettle
repression
toxicity
nrhp
##arth
##kko
##bush
ernesto
commended
outspoken
242
mca
parchment
sms
kristen
##aton
bisexual
raked
glamour
navajo
a2
conditioned
showcased
##hma
spacious
youthful
##esa
usl
appliances
junta
brest
layne
conglomerate
enchanted
chao
loosened
picasso
circulating
inspect
montevideo
##centric
##kti
piazza
spurred
##aith
bari
freedoms
poultry
stamford
lieu
##ect
indigo
sarcastic
bahia
stump
attach
dvds
frankenstein
lille
approx
scriptures
pollen
##script
nmi
overseen
##ivism
tides
proponent
newmarket
inherit
milling
##erland
centralized
##rou
distributors
credentials
drawers
abbreviation
##lco
##xon
downing
uncomfortably
ripe
##oes
erase
franchises
##ever
populace
##bery
##khar
decomposition
pleas
##tet
daryl
sabah
##stle
##wide
fearless
genie
lesions
annette
##ogist
oboe
appendix
nair
dripped
petitioned
maclean
mosquito
parrot
rpg
hampered
1648
operatic
reservoirs
##tham
irrelevant
jolt
summarized
##fp
medallion
##taff
##−
clawed
harlow
narrower
goddard
marcia
bodied
fremont
suarez
altering
tempest
mussolini
porn
##isms
sweetly
oversees
walkers
solitude
grimly
shrines
hk
ich
supervisors
hostess
dietrich
legitimacy
brushes
expressive
##yp
dissipated
##rse
localized
systemic
##nikov
gettysburg
##js
##uaries
dialogues
muttering
251
housekeeper
sicilian
discouraged
##frey
beamed
kaladin
halftime
kidnap
##amo
##llet
1754
synonymous
depleted
instituto
insulin
reprised
##opsis
clashed
##ctric
interrupting
radcliffe
insisting
medici
1715
ejected
playfully
turbulent
##47
starvation
##rini
shipment
rebellious
petersen
verification
merits
##rified
cakes
##charged
1757
milford
shortages
spying
fidelity
##aker
emitted
storylines
harvested
seismic
##iform
cheung
kilda
theoretically
barbie
lynx
##rgy
##tius
goblin
mata
poisonous
##nburg
reactive
residues
obedience
##евич
conjecture
##rac
401
hating
sixties
kicker
moaning
motown
##bha
emancipation
neoclassical
##hering
consoles
ebert
professorship
##tures
sustaining
assaults
obeyed
affluent
incurred
tornadoes
##eber
##zow
emphasizing
highlanders
cheated
helmets
##ctus
internship
terence
bony
executions
legislators
berries
peninsular
tinged
##aco
1689
amplifier
corvette
ribbons
lavish
pennant
##lander
worthless
##chfield
##forms
mariano
pyrenees
expenditures
##icides
chesterfield
mandir
tailor
39th
sergey
nestled
willed
aristocracy
devotees
goodnight
raaf
rumored
weaponry
remy
appropriations
harcourt
burr
riaa
##lence
limitation
unnoticed
guo
soaking
swamps
##tica
collapsing
tatiana
descriptive
brigham
psalm
##chment
maddox
##lization
patti
caliph
##aja
akron
injuring
serra
##ganj
basins
##sari
astonished
launcher
##church
hilary
wilkins
sewing
##sf
stinging
##fia
##ncia
underwood
startup
##ition
compilations
vibrations
embankment
jurist
##nity
bard
juventus
groundwater
kern
palaces
helium
boca
cramped
marissa
soto
##worm
jae
princely
##ggy
faso
bazaar
warmly
##voking
229
pairing
##lite
##grate
##nets
wien
freaked
ulysses
rebirth
##alia
##rent
mummy
guzman
jimenez
stilled
##nitz
trajectory
tha
woken
archival
professions
##pts
##pta
hilly
shadowy
shrink
##bolt
norwood
glued
migrate
stereotypes
devoid
##pheus
625
evacuate
horrors
infancy
gotham
knowles
optic
downloaded
sachs
kingsley
parramatta
darryl
mor
##onale
shady
commence
confesses
kan
##meter
##placed
marlborough
roundabout
regents
frigates
io
##imating
gothenburg
revoked
carvings
clockwise
convertible
intruder
##sche
banged
##ogo
vicky
bourgeois
##mony
dupont
footing
##gum
pd
##real
buckle
yun
penthouse
sane
720
serviced
stakeholders
neumann
bb
##eers
comb
##gam
catchment
pinning
rallies
typing
##elles
forefront
freiburg
sweetie
giacomo
widowed
goodwill
worshipped
aspirations
midday
##vat
fishery
##trick
bournemouth
turk
243
hearth
ethanol
guadalajara
murmurs
sl
##uge
afforded
scripted
##hta
wah
##jn
coroner
translucent
252
memorials
puck
progresses
clumsy
##race
315
candace
recounted
##27
##slin
##uve
filtering
##mac
howl
strata
heron
leveled
##ays
dubious
##oja
##т
##wheel
citations
exhibiting
##laya
##mics
##pods
turkic
##lberg
injunction
##ennial
##mit
antibodies
##44
organise
##rigues
cardiovascular
cushion
inverness
##zquez
dia
cocoa
sibling
##tman
##roid
expanse
feasible
tunisian
algiers
##relli
rus
bloomberg
dso
westphalia
bro
tacoma
281
downloads
##ours
konrad
duran
##hdi
continuum
jett
compares
legislator
secession
##nable
##gues
##zuka
translating
reacher
##gley
##ła
aleppo
##agi
tc
orchards
trapping
linguist
versatile
drumming
postage
calhoun
superiors
##mx
barefoot
leary
##cis
ignacio
alfa
kaplan
##rogen
bratislava
mori
##vot
disturb
haas
313
cartridges
gilmore
radiated
salford
tunic
hades
##ulsive
archeological
delilah
magistrates
auditioned
brewster
charters
empowerment
blogs
cappella
dynasties
iroquois
whipping
##krishna
raceway
truths
myra
weaken
judah
mcgregor
##horse
mic
refueling
37th
burnley
bosses
markus
premio
query
##gga
dunbar
##economic
darkest
lyndon
sealing
commendation
reappeared
##mun
addicted
ezio
slaughtered
satisfactory
shuffle
##eves
##thic
##uj
fortification
warrington
##otto
resurrected
fargo
mane
##utable
##lei
##space
foreword
ox
##aris
##vern
abrams
hua
##mento
sakura
##alo
uv
sentimental
##skaya
midfield
##eses
sturdy
scrolls
macleod
##kyu
entropy
##lance
mitochondrial
cicero
excelled
thinner
convoys
perceive
##oslav
##urable
systematically
grind
burkina
287
##tagram
ops
##aman
guantanamo
##cloth
##tite
forcefully
wavy
##jou
pointless
##linger
##tze
layton
portico
superficial
clerical
outlaws
##hism
burials
muir
##inn
creditors
hauling
rattle
##leg
calais
monde
archers
reclaimed
dwell
wexford
hellenic
falsely
remorse
##tek
dough
furnishings
##uttered
gabon
neurological
novice
##igraphy
contemplated
pulpit
nightstand
saratoga
##istan
documenting
pulsing
taluk
##firmed
busted
marital
##rien
disagreements
wasps
##yes
hodge
mcdonnell
mimic
fran
pendant
dhabi
musa
##nington
congratulations
argent
darrell
concussion
losers
regrets
thessaloniki
reversal
donaldson
hardwood
thence
achilles
ritter
##eran
demonic
jurgen
prophets
goethe
eki
classmate
buff
##cking
yank
irrational
##inging
perished
seductive
qur
sourced
##crat
##typic
mustard
ravine
barre
horizontally
characterization
phylogenetic
boise
##dit
##runner
##tower
brutally
intercourse
seduce
##bbing
fay
ferris
ogden
amar
nik
unarmed
##inator
evaluating
kyrgyzstan
sweetness
##lford
##oki
mccormick
meiji
notoriety
stimulate
disrupt
figuring
instructional
mcgrath
##zoo
groundbreaking
##lto
flinch
khorasan
agrarian
bengals
mixer
radiating
##sov
ingram
pitchers
nad
tariff
##cript
tata
##codes
##emi
##ungen
appellate
lehigh
##bled
##giri
brawl
duct
texans
##ciation
##ropolis
skipper
speculative
vomit
doctrines
stresses
253
davy
graders
whitehead
jozef
timely
cumulative
haryana
paints
appropriately
boon
cactus
##ales
##pid
dow
legions
##pit
perceptions
1730
picturesque
##yse
periphery
rune
wr
##aha
celtics
sentencing
whoa
##erin
confirms
variance
425
moines
mathews
spade
rave
m1
fronted
fx
blending
alleging
reared
##gl
237
##paper
grassroots
eroded
##free
##physical
directs
ordeal
##sław
accelerate
hacker
rooftop
##inia
lev
buys
cebu
devote
##lce
specialising
##ulsion
choreographed
repetition
warehouses
##ryl
paisley
tuscany
analogy
sorcerer
hash
huts
shards
descends
exclude
nix
chaplin
gaga
ito
vane
##drich
causeway
misconduct
limo
orchestrated
glands
jana
##kot
u2
##mple
##sons
branching
contrasts
scoop
longed
##virus
chattanooga
##75
syrup
cornerstone
##tized
##mind
##iaceae
careless
precedence
frescoes
##uet
chilled
consult
modelled
snatch
peat
##thermal
caucasian
humane
relaxation
spins
temperance
##lbert
occupations
lambda
hybrids
moons
mp3
##oese
247
rolf
societal
yerevan
ness
##ssler
befriended
mechanized
nominate
trough
boasted
cues
seater
##hom
bends
##tangle
conductors
emptiness
##lmer
eurasian
adriatic
tian
##cie
anxiously
lark
propellers
chichester
jock
ev
2a
##holding
credible
recounts
tori
loyalist
abduction
##hoot
##redo
nepali
##mite
ventral
tempting
##ango
##crats
steered
##wice
javelin
dipping
laborers
prentice
looming
titanium
##ː
badges
emir
tensor
##ntation
egyptians
rash
denies
hawthorne
lombard
showers
wehrmacht
dietary
trojan
##reus
welles
executing
horseshoe
lifeboat
##lak
elsa
infirmary
nearing
roberta
boyer
mutter
trillion
joanne
##fine
##oked
sinks
vortex
uruguayan
clasp
sirius
##block
accelerator
prohibit
sunken
byu
chronological
diplomats
ochreous
510
symmetrical
1644
maia
##tology
salts
reigns
atrocities
##ия
hess
bared
issn
##vyn
cater
saturated
##cycle
##isse
sable
voyager
dyer
yusuf
##inge
fountains
wolff
##39
##nni
engraving
rollins
atheist
ominous
##ault
herr
chariot
martina
strung
##fell
##farlane
horrific
sahib
gazes
saetan
erased
ptolemy
##olic
flushing
lauderdale
analytic
##ices
530
navarro
beak
gorilla
herrera
broom
guadalupe
raiding
sykes
311
bsc
deliveries
1720
invasions
carmichael
tajikistan
thematic
ecumenical
sentiments
onstage
##rians
##brand
##sume
catastrophic
flanks
molten
##arns
waller
aimee
terminating
##icing
alternately
##oche
nehru
printers
outraged
##eving
empires
template
banners
repetitive
za
##oise
vegetarian
##tell
guiana
opt
cavendish
lucknow
synthesized
##hani
##mada
finalized
##ctable
fictitious
mayoral
unreliable
##enham
embracing
peppers
rbis
##chio
##neo
inhibition
slashed
togo
orderly
embroidered
safari
salty
236
barron
benito
totaled
##dak
pubs
simulated
caden
devin
tolkien
momma
welding
sesame
##ept
gottingen
hardness
630
shaman
temeraire
620
adequately
pediatric
##kit
ck
assertion
radicals
composure
cadence
seafood
beaufort
lazarus
mani
warily
cunning
kurdistan
249
cantata
##kir
ares
##41
##clusive
nape
townland
geared
insulted
flutter
boating
violate
draper
dumping
malmo
##hh
##romatic
firearm
alta
bono
obscured
##clave
exceeds
panorama
unbelievable
##train
preschool
##essed
disconnected
installing
rescuing
secretaries
accessibility
##castle
##drive
##ifice
##film
bouts
slug
waterway
mindanao
##buro
##ratic
halves
##ل
calming
liter
maternity
adorable
bragg
electrification
mcc
##dote
roxy
schizophrenia
##body
munoz
kaye
whaling
239
mil
tingling
tolerant
##ago
unconventional
volcanoes
##finder
deportivo
##llie
robson
kaufman
neuroscience
wai
deportation
masovian
scraping
converse
##bh
hacking
bulge
##oun
administratively
yao
580
amp
mammoth
booster
claremont
hooper
nomenclature
pursuits
mclaughlin
melinda
##sul
catfish
barclay
substrates
taxa
zee
originals
kimberly
packets
padma
##ality
borrowing
ostensibly
solvent
##bri
##genesis
##mist
lukas
shreveport
veracruz
##ь
##lou
##wives
cheney
tt
anatolia
hobbs
##zyn
cyclic
radiant
alistair
greenish
siena
dat
independents
##bation
conform
pieter
hyper
applicant
bradshaw
spores
telangana
vinci
inexpensive
nuclei
322
jang
nme
soho
spd
##ign
cradled
receptionist
pow
##43
##rika
fascism
##ifer
experimenting
##ading
##iec
##region
345
jocelyn
maris
stair
nocturnal
toro
constabulary
elgin
##kker
msc
##giving
##schen
##rase
doherty
doping
sarcastically
batter
maneuvers
##cano
##apple
##gai
##git
intrinsic
##nst
##stor
1753
showtime
cafes
gasps
lviv
ushered
##thed
fours
restart
astonishment
transmitting
flyer
shrugs
##sau
intriguing
cones
dictated
mushrooms
medial
##kovsky
##elman
escorting
gaped
##26
godfather
##door
##sell
djs
recaptured
timetable
vila
1710
3a
aerodrome
mortals
scientology
##orne
angelina
mag
convection
unpaid
insertion
intermittent
lego
##nated
endeavor
kota
pereira
##lz
304
bwv
glamorgan
insults
agatha
fey
##cend
fleetwood
mahogany
protruding
steamship
zeta
##arty
mcguire
suspense
##sphere
advising
urges
##wala
hurriedly
meteor
gilded
inline
arroyo
stalker
##oge
excitedly
revered
##cure
earle
introductory
##break
##ilde
mutants
puff
pulses
reinforcement
##haling
curses
lizards
stalk
correlated
##fixed
fallout
macquarie
##unas
bearded
denton
heaving
802
##ocation
winery
assign
dortmund
##lkirk
everest
invariant
charismatic
susie
##elling
bled
lesley
telegram
sumner
bk
##ogen
##к
wilcox
needy
colbert
duval
##iferous
##mbled
allotted
attends
imperative
##hita
replacements
hawker
##inda
insurgency
##zee
##eke
casts
##yla
680
ives
transitioned
##pack
##powering
authoritative
baylor
flex
cringed
plaintiffs
woodrow
##skie
drastic
ape
aroma
unfolded
commotion
nt
preoccupied
theta
routines
lasers
privatization
wand
domino
ek
clenching
nsa
strategically
showered
bile
handkerchief
pere
storing
christophe
insulting
316
nakamura
romani
asiatic
magdalena
palma
cruises
stripping
405
konstantin
soaring
##berman
colloquially
forerunner
havilland
incarcerated
parasites
sincerity
##utus
disks
plank
saigon
##ining
corbin
homo
ornaments
powerhouse
##tlement
chong
fastened
feasibility
idf
morphological
usable
##nish
##zuki
aqueduct
jaguars
keepers
##flies
aleksandr
faust
assigns
ewing
bacterium
hurled
tricky
hungarians
integers
wallis
321
yamaha
##isha
hushed
oblivion
aviator
evangelist
friars
##eller
monograph
ode
##nary
airplanes
labourers
charms
##nee
1661
hagen
tnt
rudder
fiesta
transcript
dorothea
ska
inhibitor
maccabi
retorted
raining
encompassed
clauses
menacing
1642
lineman
##gist
vamps
##ape
##dick
gloom
##rera
dealings
easing
seekers
##nut
##pment
helens
unmanned
##anu
##isson
basics
##amy
##ckman
adjustments
1688
brutality
horne
##zell
sui
##55
##mable
aggregator
##thal
rhino
##drick
##vira
counters
zoom
##01
##rting
mn
montenegrin
packard
##unciation
##♭
##kki
reclaim
scholastic
thugs
pulsed
##icia
syriac
quan
saddam
banda
kobe
blaming
buddies
dissent
##lusion
##usia
corbett
jaya
delle
erratic
lexie
##hesis
435
amiga
hermes
##pressing
##leen
chapels
gospels
jamal
##uating
compute
revolving
warp
##sso
##thes
armory
##eras
##gol
antrim
loki
##kow
##asian
##good
##zano
braid
handwriting
subdistrict
funky
pantheon
##iculate
concurrency
estimation
improper
juliana
##his
newcomers
johnstone
staten
communicated
##oco
##alle
sausage
stormy
##stered
##tters
superfamily
##grade
acidic
collateral
tabloid
##oped
##rza
bladder
austen
##ellant
mcgraw
##hay
hannibal
mein
aquino
lucifer
wo
badger
boar
cher
christensen
greenberg
interruption
##kken
jem
244
mocked
bottoms
cambridgeshire
##lide
sprawling
##bbly
eastwood
ghent
synth
##buck
advisers
##bah
nominally
hapoel
qu
daggers
estranged
fabricated
towels
vinnie
wcw
misunderstanding
anglia
nothin
unmistakable
##dust
##lova
chilly
marquette
truss
##edge
##erine
reece
##lty
##chemist
##connected
272
308
41st
bash
raion
waterfalls
##ump
##main
labyrinth
queue
theorist
##istle
bharatiya
flexed
soundtracks
rooney
leftist
patrolling
wharton
plainly
alleviate
eastman
schuster
topographic
engages
immensely
unbearable
fairchild
1620
dona
lurking
parisian
oliveira
ia
indictment
hahn
bangladeshi
##aster
vivo
##uming
##ential
antonia
expects
indoors
kildare
harlan
##logue
##ogenic
##sities
forgiven
##wat
childish
tavi
##mide
##orra
plausible
grimm
successively
scooted
##bola
##dget
##rith
spartans
emery
flatly
azure
epilogue
##wark
flourish
##iny
##tracted
##overs
##oshi
bestseller
distressed
receipt
spitting
hermit
topological
##cot
drilled
subunit
francs
##layer
eel
##fk
##itas
octopus
footprint
petitions
ufo
##say
##foil
interfering
leaking
palo
##metry
thistle
valiant
##pic
narayan
mcpherson
##fast
gonzales
##ym
##enne
dustin
novgorod
solos
##zman
doin
##raph
##patient
##meyer
soluble
ashland
cuffs
carole
pendleton
whistling
vassal
##river
deviation
revisited
constituents
rallied
rotate
loomed
##eil
##nting
amateurs
augsburg
auschwitz
crowns
skeletons
##cona
bonnet
257
dummy
globalization
simeon
sleeper
mandal
differentiated
##crow
##mare
milne
bundled
exasperated
talmud
owes
segregated
##feng
##uary
dentist
piracy
props
##rang
devlin
##torium
malicious
paws
##laid
dependency
##ergy
##fers
##enna
258
pistons
rourke
jed
grammatical
tres
maha
wig
512
ghostly
jayne
##achal
##creen
##ilis
##lins
##rence
designate
##with
arrogance
cambodian
clones
showdown
throttle
twain
##ception
lobes
metz
nagoya
335
braking
##furt
385
roaming
##minster
amin
crippled
##37
##llary
indifferent
hoffmann
idols
intimidating
1751
261
influenza
memo
onions
1748
bandage
consciously
##landa
##rage
clandestine
observes
swiped
tangle
##ener
##jected
##trum
##bill
##lta
hugs
congresses
josiah
spirited
##dek
humanist
managerial
filmmaking
inmate
rhymes
debuting
grimsby
ur
##laze
duplicate
vigor
##tf
republished
bolshevik
refurbishment
antibiotics
martini
methane
newscasts
royale
horizons
levant
iain
visas
##ischen
paler
##around
manifestation
snuck
alf
chop
futile
pedestal
rehab
##kat
bmg
kerman
res
fairbanks
jarrett
abstraction
saharan
##zek
1746
procedural
clearer
kincaid
sash
luciano
##ffey
crunch
helmut
##vara
revolutionaries
##tute
creamy
leach
##mmon
1747
permitting
nes
plight
wendell
##lese
contra
ts
clancy
ipa
mach
staples
autopsy
disturbances
nueva
karin
pontiac
##uding
proxy
venerable
haunt
leto
bergman
expands
##helm
wal
##pipe
canning
celine
cords
obesity
##enary
intrusion
planner
##phate
reasoned
sequencing
307
harrow
##chon
##dora
marred
mcintyre
repay
tarzan
darting
248
harrisburg
margarita
repulsed
##hur
##lding
belinda
hamburger
novo
compliant
runways
bingham
registrar
skyscraper
ic
cuthbert
improvisation
livelihood
##corp
##elial
admiring
##dened
sporadic
believer
casablanca
popcorn
##29
asha
shovel
##bek
##dice
coiled
tangible
##dez
casper
elsie
resin
tenderness
rectory
##ivision
avail
sonar
##mori
boutique
##dier
guerre
bathed
upbringing
vaulted
sandals
blessings
##naut
##utnant
1680
306
foxes
pia
corrosion
hesitantly
confederates
crystalline
footprints
shapiro
tirana
valentin
drones
45th
microscope
shipments
texted
inquisition
wry
guernsey
unauthorized
resigning
760
ripple
schubert
stu
reassure
felony
##ardo
brittle
koreans
##havan
##ives
dun
implicit
tyres
##aldi
##lth
magnolia
##ehan
##puri
##poulos
aggressively
fei
gr
familiarity
##poo
indicative
##trust
fundamentally
jimmie
overrun
395
anchors
moans
##opus
britannia
armagh
##ggle
purposely
seizing
##vao
bewildered
mundane
avoidance
cosmopolitan
geometridae
quartermaster
caf
415
chatter
engulfed
gleam
purge
##icate
juliette
jurisprudence
guerra
revisions
##bn
casimir
brew
##jm
1749
clapton
cloudy
conde
hermitage
278
simulations
torches
vincenzo
matteo
##rill
hidalgo
booming
westbound
accomplishment
tentacles
unaffected
##sius
annabelle
flopped
sloping
##litz
dreamer
interceptor
vu
##loh
consecration
copying
messaging
breaker
climates
hospitalized
1752
torino
afternoons
winfield
witnessing
##teacher
breakers
choirs
sawmill
coldly
##ege
sipping
haste
uninhabited
conical
bibliography
pamphlets
severn
edict
##oca
deux
illnesses
grips
##pl
rehearsals
sis
thinkers
tame
##keepers
1690
acacia
reformer
##osed
##rys
shuffling
##iring
##shima
eastbound
ionic
rhea
flees
littered
##oum
rocker
vomiting
groaning
champ
overwhelmingly
civilizations
paces
sloop
adoptive
##tish
skaters
##vres
aiding
mango
##joy
nikola
shriek
##ignon
pharmaceuticals
##mg
tuna
calvert
gustavo
stocked
yearbook
##urai
##mana
computed
subsp
riff
hanoi
kelvin
hamid
moors
pastures
summons
jihad
nectar
##ctors
bayou
untitled
pleasing
vastly
republics
intellect
##η
##ulio
##tou
crumbling
stylistic
sb
##ی
consolation
frequented
h₂o
walden
widows
##iens
404
##ignment
chunks
improves
288
grit
recited
##dev
snarl
sociological
##arte
##gul
inquired
##held
bruise
clube
consultancy
homogeneous
hornets
multiplication
pasta
prick
savior
##grin
##kou
##phile
yoon
##gara
grimes
vanishing
cheering
reacting
bn
distillery
##quisite
##vity
coe
dockyard
massif
##jord
escorts
voss
##valent
byte
chopped
hawke
illusions
workings
floats
##koto
##vac
kv
annapolis
madden
##onus
alvaro
noctuidae
##cum
##scopic
avenge
steamboat
forte
illustrates
erika
##trip
570
dew
nationalities
bran
manifested
thirsty
diversified
muscled
reborn
##standing
arson
##lessness
##dran
##logram
##boys
##kushima
##vious
willoughby
##phobia
286
alsace
dashboard
yuki
##chai
granville
myspace
publicized
tricked
##gang
adjective
##ater
relic
reorganisation
enthusiastically
indications
saxe
##lassified
consolidate
iec
padua
helplessly
ramps
renaming
regulars
pedestrians
accents
convicts
inaccurate
lowers
mana
##pati
barrie
bjp
outta
someplace
berwick
flanking
invoked
marrow
sparsely
excerpts
clothed
rei
##ginal
wept
##straße
##vish
alexa
excel
##ptive
membranes
aquitaine
creeks
cutler
sheppard
implementations
ns
##dur
fragrance
budge
concordia
magnesium
marcelo
##antes
gladly
vibrating
##rral
##ggles
montrose
##omba
lew
seamus
1630
cocky
##ament
##uen
bjorn
##rrick
fielder
fluttering
##lase
methyl
kimberley
mcdowell
reductions
barbed
##jic
##tonic
aeronautical
condensed
distracting
##promising
huffed
##cala
##sle
claudius
invincible
missy
pious
balthazar
ci
##lang
butte
combo
orson
##dication
myriad
1707
silenced
##fed
##rh
coco
netball
yourselves
##oza
clarify
heller
peg
durban
etudes
offender
roast
blackmail
curvature
##woods
vile
309
illicit
suriname
##linson
overture
1685
bubbling
gymnast
tucking
##mming
##ouin
maldives
##bala
gurney
##dda
##eased
##oides
backside
pinto
jars
racehorse
tending
##rdial
baronetcy
wiener
duly
##rke
barbarian
cupping
flawed
##thesis
bertha
pleistocene
puddle
swearing
##nob
##tically
fleeting
prostate
amulet
educating
##mined
##iti
##tler
75th
jens
respondents
analytics
cavaliers
papacy
raju
##iente
##ulum
##tip
funnel
271
disneyland
##lley
sociologist
##iam
2500
faulkner
louvre
menon
##dson
276
##ower
afterlife
mannheim
peptide
referees
comedians
meaningless
##anger
##laise
fabrics
hurley
renal
sleeps
##bour
##icle
breakout
kristin
roadside
animator
clover
disdain
unsafe
redesign
##urity
firth
barnsley
portage
reset
narrows
268
commandos
expansive
speechless
tubular
##lux
essendon
eyelashes
smashwords
##yad
##bang
##claim
craved
sprinted
chet
somme
astor
wrocław
orton
266
bane
##erving
##uing
mischief
##amps
##sund
scaling
terre
##xious
impairment
offenses
undermine
moi
soy
contiguous
arcadia
inuit
seam
##tops
macbeth
rebelled
##icative
##iot
590
elaborated
frs
uniformed
##dberg
259
powerless
priscilla
stimulated
980
qc
arboretum
frustrating
trieste
bullock
##nified
enriched
glistening
intern
##adia
locus
nouvelle
ollie
ike
lash
starboard
ee
tapestry
headlined
hove
rigged
##vite
pollock
##yme
thrive
clustered
cas
roi
gleamed
olympiad
##lino
pressured
regimes
##hosis
##lick
ripley
##ophone
kickoff
gallon
rockwell
##arable
crusader
glue
revolutions
scrambling
1714
grover
##jure
englishman
aztec
263
contemplating
coven
ipad
preach
triumphant
tufts
##esian
rotational
##phus
328
falkland
##brates
strewn
clarissa
rejoin
environmentally
glint
banded
drenched
moat
albanians
johor
rr
maestro
malley
nouveau
shaded
taxonomy
v6
adhere
bunk
airfields
##ritan
1741
encompass
remington
tran
##erative
amelie
mazda
friar
morals
passions
##zai
breadth
vis
##hae
argus
burnham
caressing
insider
rudd
##imov
##mini
##rso
italianate
murderous
textual
wainwright
armada
bam
weave
timer
##taken
##nh
fra
##crest
ardent
salazar
taps
tunis
##ntino
allegro
gland
philanthropic
##chester
implication
##optera
esq
judas
noticeably
wynn
##dara
inched
indexed
crises
villiers
bandit
royalties
patterned
cupboard
interspersed
accessory
isla
kendrick
entourage
stitches
##esthesia
headwaters
##ior
interlude
distraught
draught
1727
##basket
biased
sy
transient
triad
subgenus
adapting
kidd
shortstop
##umatic
dimly
spiked
mcleod
reprint
nellie
pretoria
windmill
##cek
singled
##mps
273
reunite
##orous
747
bankers
outlying
##omp
##ports
##tream
apologies
cosmetics
patsy
##deh
##ocks
##yson
bender
nantes
serene
##nad
lucha
mmm
323
##cius
##gli
cmll
coinage
nestor
juarez
##rook
smeared
sprayed
twitching
sterile
irina
embodied
juveniles
enveloped
miscellaneous
cancers
dq
gulped
luisa
crested
swat
donegal
ref
##anov
##acker
hearst
mercantile
##lika
doorbell
ua
vicki
##alla
##som
bilbao
psychologists
stryker
sw
horsemen
turkmenistan
wits
##national
anson
mathew
screenings
##umb
rihanna
##agne
##nessy
aisles
##iani
##osphere
hines
kenton
saskatoon
tasha
truncated
##champ
##itan
mildred
advises
fredrik
interpreting
inhibitors
##athi
spectroscopy
##hab
##kong
karim
panda
##oia
##nail
##vc
conqueror
kgb
leukemia
##dity
arrivals
cheered
pisa
phosphorus
shielded
##riated
mammal
unitarian
urgently
chopin
sanitary
##mission
spicy
drugged
hinges
##tort
tipping
trier
impoverished
westchester
##caster
267
epoch
nonstop
##gman
##khov
aromatic
centrally
cerro
##tively
##vio
billions
modulation
sedimentary
283
facilitating
outrageous
goldstein
##eak
##kt
ld
maitland
penultimate
pollard
##dance
fleets
spaceship
vertebrae
##nig
alcoholism
als
recital
##bham
##ference
##omics
m2
##bm
trois
##tropical
##в
commemorates
##meric
marge
##raction
1643
670
cosmetic
ravaged
##ige
catastrophe
eng
##shida
albrecht
arterial
bellamy
decor
harmon
##rde
bulbs
synchronized
vito
easiest
shetland
shielding
wnba
##glers
##ssar
##riam
brianna
cumbria
##aceous
##rard
cores
thayer
##nsk
brood
hilltop
luminous
carts
keynote
larkin
logos
##cta
##ا
##mund
##quay
lilith
tinted
277
wrestle
mobilization
##uses
sequential
siam
bloomfield
takahashi
274
##ieving
presenters
ringo
blazed
witty
##oven
##ignant
devastation
haydn
harmed
newt
therese
##peed
gershwin
molina
rabbis
sudanese
001
innate
restarted
##sack
##fus
slices
wb
##shah
enroll
hypothetical
hysterical
1743
fabio
indefinite
warped
##hg
exchanging
525
unsuitable
##sboro
gallo
1603
bret
cobalt
homemade
##hunter
mx
operatives
##dhar
terraces
durable
latch
pens
whorls
##ctuated
##eaux
billing
ligament
succumbed
##gly
regulators
spawn
##brick
##stead
filmfare
rochelle
##nzo
1725
circumstance
saber
supplements
##nsky
##tson
crowe
wellesley
carrot
##9th
##movable
primate
drury
sincerely
topical
##mad
##rao
callahan
kyiv
smarter
tits
undo
##yeh
announcements
anthologies
barrio
nebula
##islaus
##shaft
##tyn
bodyguards
2021
assassinate
barns
emmett
scully
##mah
##yd
##eland
##tino
##itarian
demoted
gorman
lashed
prized
adventist
writ
##gui
alla
invertebrates
##ausen
1641
amman
1742
align
healy
redistribution
##gf
##rize
insulation
##drop
adherents
hezbollah
vitro
ferns
yanking
269
php
registering
uppsala
cheerleading
confines
mischievous
tully
##ross
49th
docked
roam
stipulated
pumpkin
##bry
prompt
##ezer
blindly
shuddering
craftsmen
frail
scented
katharine
scramble
shaggy
sponge
helix
zaragoza
279
##52
43rd
backlash
fontaine
seizures
posse
cowan
nonfiction
telenovela
wwii
hammered
undone
##gpur
encircled
irs
##ivation
artefacts
oneself
searing
smallpox
##belle
##osaurus
shandong
breached
upland
blushing
rankin
infinitely
psyche
tolerated
docking
evicted
##col
unmarked
##lving
gnome
lettering
litres
musique
##oint
benevolent
##jal
blackened
##anna
mccall
racers
tingle
##ocene
##orestation
introductions
radically
292
##hiff
##باد
1610
1739
munchen
plead
##nka
condo
scissors
##sight
##tens
apprehension
##cey
##yin
hallmark
watering
formulas
sequels
##llas
aggravated
bae
commencing
##building
enfield
prohibits
marne
vedic
civilized
euclidean
jagger
beforehand
blasts
dumont
##arney
##nem
740
conversions
hierarchical
rios
simulator
##dya
##lellan
hedges
oleg
thrusts
shadowed
darby
maximize
1744
gregorian
##nded
##routed
sham
unspecified
##hog
emory
factual
##smo
##tp
fooled
##rger
ortega
wellness
marlon
##oton
##urance
casket
keating
ley
enclave
##ayan
char
influencing
jia
##chenko
412
ammonia
erebidae
incompatible
violins
cornered
##arat
grooves
astronauts
columbian
rampant
fabrication
kyushu
mahmud
vanish
##dern
mesopotamia
##lete
ict
##rgen
caspian
kenji
pitted
##vered
999
grimace
roanoke
tchaikovsky
twinned
##analysis
##awan
xinjiang
arias
clemson
kazakh
sizable
1662
##khand
##vard
plunge
tatum
vittorio
##nden
cholera
##dana
##oper
bracing
indifference
projectile
superliga
##chee
realises
upgrading
299
porte
retribution
##vies
nk
stil
##resses
ama
bureaucracy
blackberry
bosch
testosterone
collapses
greer
##pathic
ioc
fifties
malls
##erved
bao
baskets
adolescents
siegfried
##osity
##tosis
mantra
detecting
existent
fledgling
##cchi
dissatisfied
gan
telecommunication
mingled
sobbed
6000
controversies
outdated
taxis
##raus
fright
slams
##lham
##fect
##tten
detectors
fetal
tanned
##uw
fray
goth
olympian
skipping
mandates
scratches
sheng
unspoken
hyundai
tracey
hotspur
restrictive
##buch
americana
mundo
##bari
burroughs
diva
vulcan
##6th
distinctions
thumping
##ngen
mikey
sheds
fide
rescues
springsteen
vested
valuation
##ece
##ely
pinnacle
rake
sylvie
##edo
almond
quivering
##irus
alteration
faltered
##wad
51st
hydra
ticked
##kato
recommends
##dicated
antigua
arjun
stagecoach
wilfred
trickle
pronouns
##pon
aryan
nighttime
##anian
gall
pea
stitch
##hei
leung
milos
##dini
eritrea
nexus
starved
snowfall
kant
parasitic
cot
discus
hana
strikers
appleton
kitchens
##erina
##partisan
##itha
##vius
disclose
metis
##channel
1701
tesla
##vera
fitch
1735
blooded
##tila
decimal
##tang
##bai
cyclones
eun
bottled
peas
pensacola
basha
bolivian
crabs
boil
lanterns
partridge
roofed
1645
necks
##phila
opined
patting
##kla
##lland
chuckles
volta
whereupon
##nche
devout
euroleague
suicidal
##dee
inherently
involuntary
knitting
nasser
##hide
puppets
colourful
courageous
southend
stills
miraculous
hodgson
richer
rochdale
ethernet
greta
uniting
prism
umm
##haya
##itical
##utation
deterioration
pointe
prowess
##ropriation
lids
scranton
billings
subcontinent
##koff
##scope
brute
kellogg
psalms
degraded
##vez
stanisław
##ructured
ferreira
pun
astonishing
gunnar
##yat
arya
prc
gottfried
##tight
excursion
##ographer
dina
##quil
##nare
huffington
illustrious
wilbur
gundam
verandah
##zard
naacp
##odle
constructive
fjord
kade
##naud
generosity
thrilling
baseline
cayman
frankish
plastics
accommodations
zoological
##fting
cedric
qb
motorized
##dome
##otted
squealed
tackled
canucks
budgets
situ
asthma
dail
gabled
grasslands
whimpered
writhing
judgments
##65
minnie
pv
##carbon
bananas
grille
domes
monique
odin
maguire
markham
tierney
##estra
##chua
libel
poke
speedy
atrium
laval
notwithstanding
##edly
fai
kala
##sur
robb
##sma
listings
luz
supplementary
tianjin
##acing
enzo
jd
ric
scanner
croats
transcribed
##49
arden
cv
##hair
##raphy
##lver
##uy
357
seventies
staggering
alam
horticultural
hs
regression
timbers
blasting
##ounded
montagu
manipulating
##cit
catalytic
1550
troopers
##meo
condemnation
fitzpatrick
##oire
##roved
inexperienced
1670
castes
##lative
outing
314
dubois
flicking
quarrel
ste
learners
1625
iq
whistled
##class
282
classify
tariffs
temperament
355
folly
liszt
##yles
immersed
jordanian
ceasefire
apparel
extras
maru
fished
##bio
harta
stockport
assortment
craftsman
paralysis
transmitters
##cola
blindness
##wk
fatally
proficiency
solemnly
##orno
repairing
amore
groceries
ultraviolet
##chase
schoolhouse
##tua
resurgence
nailed
##otype
##×
ruse
saliva
diagrams
##tructing
albans
rann
thirties
1b
antennas
hilarious
cougars
paddington
stats
##eger
breakaway
ipod
reza
authorship
prohibiting
scoffed
##etz
##ttle
conscription
defected
trondheim
##fires
ivanov
keenan
##adan
##ciful
##fb
##slow
locating
##ials
##tford
cadiz
basalt
blankly
interned
rags
rattling
##tick
carpathian
reassured
sync
bum
guildford
iss
staunch
##onga
astronomers
sera
sofie
emergencies
susquehanna
##heard
duc
mastery
vh1
williamsburg
bayer
buckled
craving
##khan
##rdes
bloomington
##write
alton
barbecue
##bians
justine
##hri
##ndt
delightful
smartphone
newtown
photon
retrieval
peugeot
hissing
##monium
##orough
flavors
lighted
relaunched
tainted
##games
##lysis
anarchy
microscopic
hopping
adept
evade
evie
##beau
inhibit
sinn
adjustable
hurst
intuition
wilton
cisco
44th
lawful
lowlands
stockings
thierry
##dalen
##hila
##nai
fates
prank
tb
maison
lobbied
provocative
1724
4a
utopia
##qual
carbonate
gujarati
purcell
##rford
curtiss
##mei
overgrown
arenas
mediation
swallows
##rnik
respectful
turnbull
##hedron
##hope
alyssa
ozone
##ʻi
ami
gestapo
johansson
snooker
canteen
cuff
declines
empathy
stigma
##ags
##iner
##raine
taxpayers
gui
volga
##wright
##copic
lifespan
overcame
tattooed
enactment
giggles
##ador
##camp
barrington
bribe
obligatory
orbiting
peng
##enas
elusive
sucker
##vating
cong
hardship
empowered
anticipating
estrada
cryptic
greasy
detainees
planck
sudbury
plaid
dod
marriott
kayla
##ears
##vb
##zd
mortally
##hein
cognition
radha
319
liechtenstein
meade
richly
argyle
harpsichord
liberalism
trumpets
lauded
tyrant
salsa
tiled
lear
promoters
reused
slicing
trident
##chuk
##gami
##lka
cantor
checkpoint
##points
gaul
leger
mammalian
##tov
##aar
##schaft
doha
frenchman
nirvana
##vino
delgado
headlining
##eron
##iography
jug
tko
1649
naga
intersections
##jia
benfica
nawab
##suka
ashford
gulp
##deck
##vill
##rug
brentford
frazier
pleasures
dunne
potsdam
shenzhen
dentistry
##tec
flanagan
##dorff
##hear
chorale
dinah
prem
quezon
##rogated
relinquished
sutra
terri
##pani
flaps
##rissa
poly
##rnet
homme
aback
##eki
linger
womb
##kson
##lewood
doorstep
orthodoxy
threaded
westfield
##rval
dioceses
fridays
subsided
##gata
loyalists
##biotic
##ettes
letterman
lunatic
prelate
tenderly
invariably
souza
thug
winslow
##otide
furlongs
gogh
jeopardy
##runa
pegasus
##umble
humiliated
standalone
tagged
##roller
freshmen
klan
##bright
attaining
initiating
transatlantic
logged
viz
##uance
1723
combatants
intervening
stephane
chieftain
despised
grazed
317
cdc
galveston
godzilla
macro
simulate
##planes
parades
##esses
960
##ductive
##unes
equator
overdose
##cans
##hosh
##lifting
joshi
epstein
sonora
treacherous
aquatics
manchu
responsive
##sation
supervisory
##christ
##llins
##ibar
##balance
##uso
kimball
karlsruhe
mab
##emy
ignores
phonetic
reuters
spaghetti
820
almighty
danzig
rumbling
tombstone
designations
lured
outset
##felt
supermarkets
##wt
grupo
kei
kraft
susanna
##blood
comprehension
genealogy
##aghan
##verted
redding
##ythe
1722
bowing
##pore
##roi
lest
sharpened
fulbright
valkyrie
sikhs
##unds
swans
bouquet
merritt
##tage
##venting
commuted
redhead
clerks
leasing
cesare
dea
hazy
##vances
fledged
greenfield
servicemen
##gical
armando
blackout
dt
sagged
downloadable
intra
potion
pods
##4th
##mism
xp
attendants
gambia
stale
##ntine
plump
asteroids
rediscovered
buds
flea
hive
##neas
1737
classifications
debuts
##eles
olympus
scala
##eurs
##gno
##mute
hummed
sigismund
visuals
wiggled
await
pilasters
clench
sulfate
##ances
bellevue
enigma
trainee
snort
##sw
clouded
denim
##rank
##rder
churning
hartman
lodges
riches
sima
##missible
accountable
socrates
regulates
mueller
##cr
1702
avoids
solids
himalayas
nutrient
pup
##jevic
squat
fades
nec
##lates
##pina
##rona
##ου
privateer
tequila
##gative
##mpton
apt
hornet
immortals
##dou
asturias
cleansing
dario
##rries
##anta
etymology
servicing
zhejiang
##venor
##nx
horned
erasmus
rayon
relocating
£10
##bags
escalated
promenade
stubble
2010s
artisans
axial
liquids
mora
sho
yoo
##tsky
bundles
oldies
##nally
notification
bastion
##ths
sparkle
##lved
1728
leash
pathogen
highs
##hmi
immature
880
gonzaga
ignatius
mansions
monterrey
sweets
bryson
##loe
polled
regatta
brightest
pei
rosy
squid
hatfield
payroll
addict
meath
cornerback
heaviest
lodging
##mage
capcom
rippled
##sily
barnet
mayhem
ymca
snuggled
rousseau
##cute
blanchard
284
fragmented
leighton
chromosomes
risking
##md
##strel
##utter
corinne
coyotes
cynical
hiroshi
yeomanry
##ractive
ebook
grading
mandela
plume
agustin
magdalene
##rkin
bea
femme
trafford
##coll
##lun
##tance
52nd
fourier
upton
##mental
camilla
gust
iihf
islamabad
longevity
##kala
feldman
netting
##rization
endeavour
foraging
mfa
orr
##open
greyish
contradiction
graz
##ruff
handicapped
marlene
tweed
oaxaca
spp
campos
miocene
pri
configured
cooks
pluto
cozy
pornographic
##entes
70th
fairness
glided
jonny
lynne
rounding
sired
##emon
##nist
remade
uncover
##mack
complied
lei
newsweek
##jured
##parts
##enting
##pg
293
finer
guerrillas
athenian
deng
disused
stepmother
accuse
gingerly
seduction
521
confronting
##walker
##going
gora
nostalgia
sabres
virginity
wrenched
##minated
syndication
wielding
eyre
##56
##gnon
##igny
behaved
taxpayer
sweeps
##growth
childless
gallant
##ywood
amplified
geraldine
scrape
##ffi
babylonian
fresco
##rdan
##kney
##position
1718
restricting
tack
fukuoka
osborn
selector
partnering
##dlow
318
gnu
kia
tak
whitley
gables
##54
##mania
mri
softness
immersion
##bots
##evsky
1713
chilling
insignificant
pcs
##uis
elites
lina
purported
supplemental
teaming
##americana
##dding
##inton
proficient
rouen
##nage
##rret
niccolo
selects
##bread
fluffy
1621
gruff
knotted
mukherjee
polgara
thrash
nicholls
secluded
smoothing
thru
corsica
loaf
whitaker
inquiries
##rrier
##kam
indochina
289
marlins
myles
peking
##tea
extracts
pastry
superhuman
connacht
vogel
##ditional
##het
##udged
##lash
gloss
quarries
refit
teaser
##alic
##gaon
20s
materialized
sling
camped
pickering
tung
tracker
pursuant
##cide
cranes
soc
##cini
##typical
##viere
anhalt
overboard
workout
chores
fares
orphaned
stains
##logie
fenton
surpassing
joyah
triggers
##itte
grandmaster
##lass
##lists
clapping
fraudulent
ledger
nagasaki
##cor
##nosis
##tsa
eucalyptus
tun
##icio
##rney
##tara
dax
heroism
ina
wrexham
onboard
unsigned
##dates
moshe
galley
winnie
droplets
exiles
praises
watered
noodles
##aia
fein
adi
leland
multicultural
stink
bingo
comets
erskine
modernized
canned
constraint
domestically
chemotherapy
featherweight
stifled
##mum
darkly
irresistible
refreshing
hasty
isolate
##oys
kitchener
planners
##wehr
cages
yarn
implant
toulon
elects
childbirth
yue
##lind
##lone
cn
rightful
sportsman
junctions
remodeled
specifies
##rgh
291
##oons
complimented
##urgent
lister
ot
##logic
bequeathed
cheekbones
fontana
gabby
##dial
amadeus
corrugated
maverick
resented
triangles
##hered
##usly
nazareth
tyrol
1675
assent
poorer
sectional
aegean
##cous
296
nylon
ghanaian
##egorical
##weig
cushions
forbid
fusiliers
obstruction
somerville
##scia
dime
earrings
elliptical
leyte
oder
polymers
timmy
atm
midtown
piloted
settles
continual
externally
mayfield
##uh
enrichment
henson
keane
persians
1733
benji
braden
pep
324
##efe
contenders
pepsi
valet
##isches
298
##asse
##earing
goofy
stroll
##amen
authoritarian
occurrences
adversary
ahmedabad
tangent
toppled
dorchester
1672
modernism
marxism
islamist
charlemagne
exponential
racks
unicode
brunette
mbc
pic
skirmish
##bund
##lad
##powered
##yst
hoisted
messina
shatter
##ctum
jedi
vantage
##music
##neil
clemens
mahmoud
corrupted
authentication
lowry
nils
##washed
omnibus
wounding
jillian
##itors
##opped
serialized
narcotics
handheld
##arm
##plicity
intersecting
stimulating
##onis
crate
fellowships
hemingway
casinos
climatic
fordham
copeland
drip
beatty
leaflets
robber
brothel
madeira
##hedral
sphinx
ultrasound
##vana
valor
forbade
leonid
villas
##aldo
duane
marquez
##cytes
disadvantaged
forearms
kawasaki
reacts
consular
lax
uncles
uphold
##hopper
concepcion
dorsey
lass
##izan
arching
passageway
1708
researches
tia
internationals
##graphs
##opers
distinguishes
javanese
divert
##uven
plotted
##listic
##rwin
##erik
##tify
affirmative
signifies
validation
##bson
kari
felicity
georgina
zulu
##eros
##rained
##rath
overcoming
##dot
argyll
##rbin
1734
chiba
ratification
windy
earls
parapet
##marks
hunan
pristine
astrid
punta
##gart
brodie
##kota
##oder
malaga
minerva
rouse
##phonic
bellowed
pagoda
portals
reclamation
##gur
##odies
##⁄₄
parentheses
quoting
allergic
palette
showcases
benefactor
heartland
nonlinear
##tness
bladed
cheerfully
scans
##ety
##hone
1666
girlfriends
pedersen
hiram
sous
##liche
##nator
1683
##nery
##orio
##umen
bobo
primaries
smiley
##cb
unearthed
uniformly
fis
metadata
1635
ind
##oted
recoil
##titles
##tura
##ια
406
hilbert
jamestown
mcmillan
tulane
seychelles
##frid
antics
coli
fated
stucco
##grants
1654
bulky
accolades
arrays
caledonian
carnage
optimism
puebla
##tative
##cave
enforcing
rotherham
seo
dunlop
aeronautics
chimed
incline
zoning
archduke
hellenistic
##oses
##sions
candi
thong
##ople
magnate
rustic
##rsk
projective
slant
##offs
danes
hollis
vocalists
##ammed
congenital
contend
gesellschaft
##ocating
##pressive
douglass
quieter
##cm
##kshi
howled
salim
spontaneously
townsville
buena
southport
##bold
kato
1638
faerie
stiffly
##vus
##rled
297
flawless
realising
taboo
##7th
bytes
straightening
356
jena
##hid
##rmin
cartwright
berber
bertram
soloists
411
noses
417
coping
fission
hardin
inca
##cen
1717
mobilized
vhf
##raf
biscuits
curate
##85
##anial
331
gaunt
neighbourhoods
1540
##abas
blanca
bypassed
sockets
behold
coincidentally
##bane
nara
shave
splinter
terrific
##arion
##erian
commonplace
juris
redwood
waistband
boxed
caitlin
fingerprints
jennie
naturalized
##ired
balfour
craters
jody
bungalow
hugely
quilt
glitter
pigeons
undertaker
bulging
constrained
goo
##sil
##akh
assimilation
reworked
##person
persuasion
##pants
felicia
##cliff
##ulent
1732
explodes
##dun
##inium
##zic
lyman
vulture
hog
overlook
begs
northwards
ow
spoil
##urer
fatima
favorably
accumulate
sargent
sorority
corresponded
dispersal
kochi
toned
##imi
##lita
internacional
newfound
##agger
##lynn
##rigue
booths
peanuts
##eborg
medicare
muriel
nur
##uram
crates
millennia
pajamas
worsened
##breakers
jimi
vanuatu
yawned
##udeau
carousel
##hony
hurdle
##ccus
##mounted
##pod
rv
##eche
airship
ambiguity
compulsion
recapture
##claiming
arthritis
##osomal
1667
asserting
ngc
sniffing
dade
discontent
glendale
ported
##amina
defamation
rammed
##scent
fling
livingstone
##fleet
875
##ppy
apocalyptic
comrade
lcd
##lowe
cessna
eine
persecuted
subsistence
demi
hoop
reliefs
710
coptic
progressing
stemmed
perpetrators
1665
priestess
##nio
dobson
ebony
rooster
itf
tortricidae
##bbon
##jian
cleanup
##jean
##øy
1721
eighties
taxonomic
holiness
##hearted
##spar
antilles
showcasing
stabilized
##nb
gia
mascara
michelangelo
dawned
##uria
##vinsky
extinguished
fitz
grotesque
£100
##fera
##loid
##mous
barges
neue
throbbed
cipher
johnnie
##a1
##mpt
outburst
##swick
spearheaded
administrations
c1
heartbreak
pixels
pleasantly
##enay
lombardy
plush
##nsed
bobbie
##hly
reapers
tremor
xiang
minogue
substantive
hitch
barak
##wyl
kwan
##encia
910
obscene
elegance
indus
surfer
bribery
conserve
##hyllum
##masters
horatio
##fat
apes
rebound
psychotic
##pour
iteration
##mium
##vani
botanic
horribly
antiques
dispose
paxton
##hli
##wg
timeless
1704
disregard
engraver
hounds
##bau
##version
looted
uno
facilitates
groans
masjid
rutland
antibody
disqualification
decatur
footballers
quake
slacks
48th
rein
scribe
stabilize
commits
exemplary
tho
##hort
##chison
pantry
traversed
##hiti
disrepair
identifiable
vibrated
baccalaureate
##nnis
csa
interviewing
##iensis
##raße
greaves
wealthiest
343
classed
jogged
£5
##58
##atal
illuminating
knicks
respecting
##uno
scrubbed
##iji
##dles
kruger
moods
growls
raider
silvia
chefs
kam
vr
cree
percival
##terol
gunter
counterattack
defiant
henan
ze
##rasia
##riety
equivalence
submissions
##fra
##thor
bautista
mechanically
##heater
cornice
herbal
templar
##mering
outputs
ruining
ligand
renumbered
extravagant
mika
blockbuster
eta
insurrection
##ilia
darkening
ferocious
pianos
strife
kinship
##aer
melee
##anor
##iste
##may
##oue
decidedly
weep
##jad
##missive
##ppel
354
puget
unease
##gnant
1629
hammering
kassel
ob
wessex
##lga
bromwich
egan
paranoia
utilization
##atable
##idad
contradictory
provoke
##ols
##ouring
##tangled
knesset
##very
##lette
plumbing
##sden
##¹
greensboro
occult
sniff
338
zev
beaming
gamer
haggard
mahal
##olt
##pins
mendes
utmost
briefing
gunnery
##gut
##pher
##zh
##rok
1679
khalifa
sonya
##boot
principals
urbana
wiring
##liffe
##minating
##rrado
dahl
nyu
skepticism
np
townspeople
ithaca
lobster
somethin
##fur
##arina
##−1
freighter
zimmerman
biceps
contractual
##herton
amend
hurrying
subconscious
##anal
336
meng
clermont
spawning
##eia
##lub
dignitaries
impetus
snacks
spotting
twigs
##bilis
##cz
##ouk
libertadores
nic
skylar
##aina
##firm
gustave
asean
##anum
dieter
legislatures
flirt
bromley
trolls
umar
##bbies
##tyle
blah
parc
bridgeport
crank
negligence
##nction
46th
constantin
molded
bandages
seriousness
00pm
siegel
carpets
compartments
upbeat
statehood
##dner
##edging
marko
730
platt
##hane
paving
##iy
1738
abbess
impatience
limousine
nbl
##talk
441
lucille
mojo
nightfall
robbers
##nais
karel
brisk
calves
replicate
ascribed
telescopes
##olf
intimidated
##reen
ballast
specialization
##sit
aerodynamic
caliphate
rainer
visionary
##arded
epsilon
##aday
##onte
aggregation
auditory
boosted
reunification
kathmandu
loco
robyn
402
acknowledges
appointing
humanoid
newell
redeveloped
restraints
##tained
barbarians
chopper
1609
italiana
##lez
##lho
investigates
wrestlemania
##anies
##bib
690
##falls
creaked
dragoons
gravely
minions
stupidity
volley
##harat
##week
musik
##eries
##uously
fungal
massimo
semantics
malvern
##ahl
##pee
discourage
embryo
imperialism
1910s
profoundly
##ddled
jiangsu
sparkled
stat
##holz
sweatshirt
tobin
##iction
sneered
##cheon
##oit
brit
causal
smyth
##neuve
diffuse
perrin
silvio
##ipes
##recht
detonated
iqbal
selma
##nism
##zumi
roasted
##riders
tay
##ados
##mament
##mut
##rud
840
completes
nipples
cfa
flavour
hirsch
##laus
calderon
sneakers
moravian
##ksha
1622
rq
294
##imeters
bodo
##isance
##pre
##ronia
anatomical
excerpt
##lke
dh
kunst
##tablished
##scoe
biomass
panted
unharmed
gael
housemates
montpellier
##59
coa
rodents
tonic
hickory
singleton
##taro
451
1719
aldo
breaststroke
dempsey
och
rocco
##cuit
merton
dissemination
midsummer
serials
##idi
haji
polynomials
##rdon
gs
enoch
prematurely
shutter
taunton
£3
##grating
##inates
archangel
harassed
##asco
326
archway
dazzling
##ecin
1736
sumo
wat
##kovich
1086
honneur
##ently
##nostic
##ttal
##idon
1605
403
1716
blogger
rents
##gnan
hires
##ikh
##dant
howie
##rons
handler
retracted
shocks
1632
arun
duluth
kepler
trumpeter
##lary
peeking
seasoned
trooper
##mara
laszlo
##iciencies
##rti
heterosexual
##inatory
##ssion
indira
jogging
##inga
##lism
beit
dissatisfaction
malice
##ately
nedra
peeling
##rgeon
47th
stadiums
475
vertigo
##ains
iced
restroom
##plify
##tub
illustrating
pear
##chner
##sibility
inorganic
rappers
receipts
watery
##kura
lucinda
##oulos
reintroduced
##8th
##tched
gracefully
saxons
nutritional
wastewater
rained
favourites
bedrock
fisted
hallways
likeness
upscale
##lateral
1580
blinds
prequel
##pps
##tama
deter
humiliating
restraining
tn
vents
1659
laundering
recess
rosary
tractors
coulter
federer
##ifiers
##plin
persistence
##quitable
geschichte
pendulum
quakers
##beam
bassett
pictorial
buffet
koln
##sitor
drills
reciprocal
shooters
##57
##cton
##tees
converge
pip
dmitri
donnelly
yamamoto
aqua
azores
demographics
hypnotic
spitfire
suspend
wryly
roderick
##rran
sebastien
##asurable
mavericks
##fles
##200
himalayan
prodigy
##iance
transvaal
demonstrators
handcuffs
dodged
mcnamara
sublime
1726
crazed
##efined
##till
ivo
pondered
reconciled
shrill
sava
##duk
bal
cad
heresy
jaipur
goran
##nished
341
lux
shelly
whitehall
##hre
israelis
peacekeeping
##wled
1703
demetrius
ousted
##arians
##zos
beale
anwar
backstroke
raged
shrinking
cremated
##yck
benign
towing
wadi
darmstadt
landfill
parana
soothe
colleen
sidewalks
mayfair
tumble
hepatitis
ferrer
superstructure
##gingly
##urse
##wee
anthropological
translators
##mies
closeness
hooves
##pw
mondays
##roll
##vita
landscaping
##urized
purification
sock
thorns
thwarted
jalan
tiberius
##taka
saline
##rito
confidently
khyber
sculptors
##ij
brahms
hammersmith
inspectors
battista
fivb
fragmentation
hackney
##uls
arresting
exercising
antoinette
bedfordshire
##zily
dyed
##hema
1656
racetrack
variability
##tique
1655
austrians
deteriorating
madman
theorists
aix
lehman
weathered
1731
decreed
eruptions
1729
flaw
quinlan
sorbonne
flutes
nunez
1711
adored
downwards
fable
rasped
1712
moritz
mouthful
renegade
shivers
stunts
dysfunction
restrain
translit
327
pancakes
##avio
##cision
##tray
351
vial
##lden
bain
##maid
##oxide
chihuahua
malacca
vimes
##rba
##rnier
1664
donnie
plaques
##ually
337
bangs
floppy
huntsville
loretta
nikolay
##otte
eater
handgun
ubiquitous
##hett
eras
zodiac
1634
##omorphic
1820s
##zog
cochran
##bula
##lithic
warring
##rada
dalai
excused
blazers
mcconnell
reeling
bot
este
##abi
geese
hoax
taxon
##bla
guitarists
##icon
condemning
hunts
inversion
moffat
taekwondo
##lvis
1624
stammered
##rest
##rzy
sousa
fundraiser
marylebone
navigable
uptown
cabbage
daniela
salman
shitty
whimper
##kian
##utive
programmers
protections
rm
##rmi
##rued
forceful
##enes
fuss
##tao
##wash
brat
oppressive
reykjavik
spartak
ticking
##inkles
##kiewicz
adolph
horst
maui
protege
straighten
cpc
landau
concourse
clements
resultant
##ando
imaginative
joo
reactivated
##rem
##ffled
##uising
consultative
##guide
flop
kaitlyn
mergers
parenting
somber
##vron
supervise
vidhan
##imum
courtship
exemplified
harmonies
medallist
refining
##rrow
##ка
amara
##hum
780
goalscorer
sited
overshadowed
rohan
displeasure
secretive
multiplied
osman
##orth
engravings
padre
##kali
##veda
miniatures
mis
##yala
clap
pali
rook
##cana
1692
57th
antennae
astro
oskar
1628
bulldog
crotch
hackett
yucatan
##sure
amplifiers
brno
ferrara
migrating
##gree
thanking
turing
##eza
mccann
ting
andersson
onslaught
gaines
ganga
incense
standardization
##mation
sentai
scuba
stuffing
turquoise
waivers
alloys
##vitt
regaining
vaults
##clops
##gizing
digger
furry
memorabilia
probing
##iad
payton
rec
deutschland
filippo
opaque
seamen
zenith
afrikaans
##filtration
disciplined
inspirational
##merie
banco
confuse
grafton
tod
##dgets
championed
simi
anomaly
biplane
##ceptive
electrode
##para
1697
cleavage
crossbow
swirl
informant
##lars
##osta
afi
bonfire
spec
##oux
lakeside
slump
##culus
##lais
##qvist
##rrigan
1016
facades
borg
inwardly
cervical
xl
pointedly
050
stabilization
##odon
chests
1699
hacked
ctv
orthogonal
suzy
##lastic
gaulle
jacobite
rearview
##cam
##erted
ashby
##drik
##igate
##mise
##zbek
affectionately
canine
disperse
latham
##istles
##ivar
spielberg
##orin
##idium
ezekiel
cid
##sg
durga
middletown
##cina
customized
frontiers
harden
##etano
##zzy
1604
bolsheviks
##66
coloration
yoko
##bedo
briefs
slabs
debra
liquidation
plumage
##oin
blossoms
dementia
subsidy
1611
proctor
relational
jerseys
parochial
ter
##ici
esa
peshawar
cavalier
loren
cpi
idiots
shamrock
1646
dutton
malabar
mustache
##endez
##ocytes
referencing
terminates
marche
yarmouth
##sop
acton
mated
seton
subtly
baptised
beige
extremes
jolted
kristina
telecast
##actic
safeguard
waldo
##baldi
##bular
endeavors
sloppy
subterranean
##ensburg
##itung
delicately
pigment
tq
##scu
1626
##ound
collisions
coveted
herds
##personal
##meister
##nberger
chopra
##ricting
abnormalities
defective
galician
lucie
##dilly
alligator
likened
##genase
burundi
clears
complexion
derelict
deafening
diablo
fingered
champaign
dogg
enlist
isotope
labeling
mrna
##erre
brilliance
marvelous
##ayo
1652
crawley
ether
footed
dwellers
deserts
hamish
rubs
warlock
skimmed
##lizer
870
buick
embark
heraldic
irregularities
##ajan
kiara
##kulam
##ieg
antigen
kowalski
##lge
oakley
visitation
##mbit
vt
##suit
1570
murderers
##miento
##rites
chimneys
##sling
condemn
custer
exchequer
havre
##ghi
fluctuations
##rations
dfb
hendricks
vaccines
##tarian
nietzsche
biking
juicy
##duced
brooding
scrolling
selangor
##ragan
352
annum
boomed
seminole
sugarcane
##dna
departmental
dismissing
innsbruck
arteries
ashok
batavia
daze
kun
overtook
##rga
##tlan
beheaded
gaddafi
holm
electronically
faulty
galilee
fractures
kobayashi
##lized
gunmen
magma
aramaic
mala
eastenders
inference
messengers
bf
##qu
407
bathrooms
##vere
1658
flashbacks
ideally
misunderstood
##jali
##weather
mendez
##grounds
505
uncanny
##iii
1709
friendships
##nbc
sacrament
accommodated
reiterated
logistical
pebbles
thumped
##escence
administering
decrees
drafts
##flight
##cased
##tula
futuristic
picket
intimidation
winthrop
##fahan
interfered
339
afar
francoise
morally
uta
cochin
croft
dwarfs
##bruck
##dents
##nami
biker
##hner
##meral
nano
##isen
##ometric
##pres
##ан
brightened
meek
parcels
securely
gunners
##jhl
##zko
agile
hysteria
##lten
##rcus
bukit
champs
chevy
cuckoo
leith
sadler
theologians
welded
##section
1663
jj
plurality
xander
##rooms
##formed
shredded
temps
intimately
pau
tormented
##lok
##stellar
1618
charred
ems
essen
##mmel
alarms
spraying
ascot
blooms
twinkle
##abia
##apes
internment
obsidian
##chaft
snoop
##dav
##ooping
malibu
##tension
quiver
##itia
hays
mcintosh
travers
walsall
##ffie
1623
beverley
schwarz
plunging
structurally
m3
rosenthal
vikram
##tsk
770
ghz
##onda
##tiv
chalmers
groningen
pew
reckon
unicef
##rvis
55th
##gni
1651
sulawesi
avila
cai
metaphysical
screwing
turbulence
##mberg
augusto
samba
56th
baffled
momentary
toxin
##urian
##wani
aachen
condoms
dali
steppe
##3d
##app
##oed
##year
adolescence
dauphin
electrically
inaccessible
microscopy
nikita
##ega
atv
##cel
##enter
##oles
##oteric
##ы
accountants
punishments
wrongly
bribes
adventurous
clinch
flinders
southland
##hem
##kata
gough
##ciency
lads
soared
##ה
undergoes
deformation
outlawed
rubbish
##arus
##mussen
##nidae
##rzburg
arcs
##ingdon
##tituted
1695
wheelbase
wheeling
bombardier
campground
zebra
##lices
##oj
##bain
lullaby
##ecure
donetsk
wylie
grenada
##arding
##ης
squinting
eireann
opposes
##andra
maximal
runes
##broken
##cuting
##iface
##ror
##rosis
additive
britney
adultery
triggering
##drome
detrimental
aarhus
containment
jc
swapped
vichy
##ioms
madly
##oric
##rag
brant
##ckey
##trix
1560
1612
broughton
rustling
##stems
##uder
asbestos
mentoring
##nivorous
finley
leaps
##isan
apical
pry
slits
substitutes
##dict
intuitive
fantasia
insistent
unreasonable
##igen
##vna
domed
hannover
margot
ponder
##zziness
impromptu
jian
lc
rampage
stemming
##eft
andrey
gerais
whichever
amnesia
appropriated
anzac
clicks
modifying
ultimatum
cambrian
maids
verve
yellowstone
##mbs
conservatoire
##scribe
adherence
dinners
spectra
imperfect
mysteriously
sidekick
tatar
tuba
##aks
##ifolia
distrust
##athan
##zle
c2
ronin
zac
##pse
celaena
instrumentalist
scents
skopje
##mbling
comical
compensated
vidal
condor
intersect
jingle
wavelengths
##urrent
mcqueen
##izzly
carp
weasel
422
kanye
militias
postdoctoral
eugen
gunslinger
##ɛ
faux
hospice
##for
appalled
derivation
dwarves
##elis
dilapidated
##folk
astoria
philology
##lwyn
##otho
##saka
inducing
philanthropy
##bf
##itative
geek
markedly
sql
##yce
bessie
indices
rn
##flict
495
frowns
resolving
weightlifting
tugs
cleric
contentious
1653
mania
rms
##miya
##reate
##ruck
##tucket
bien
eels
marek
##ayton
##cence
discreet
unofficially
##ife
leaks
##bber
1705
332
dung
compressor
hillsborough
pandit
shillings
distal
##skin
381
##tat
##you
nosed
##nir
mangrove
undeveloped
##idia
textures
##inho
##500
##rise
ae
irritating
nay
amazingly
bancroft
apologetic
compassionate
kata
symphonies
##lovic
airspace
##lch
930
gifford
precautions
fulfillment
sevilla
vulgar
martinique
##urities
looting
piccolo
tidy
##dermott
quadrant
armchair
incomes
mathematicians
stampede
nilsson
##inking
##scan
foo
quarterfinal
##ostal
shang
shouldered
squirrels
##owe
344
vinegar
##bner
##rchy
##systems
delaying
##trics
ars
dwyer
rhapsody
sponsoring
##gration
bipolar
cinder
starters
##olio
##urst
421
signage
##nty
aground
figurative
mons
acquaintances
duets
erroneously
soyuz
elliptic
recreated
##cultural
##quette
##ssed
##tma
##zcz
moderator
scares
##itaire
##stones
##udence
juniper
sighting
##just
##nsen
britten
calabria
ry
bop
cramer
forsyth
stillness
##л
airmen
gathers
unfit
##umber
##upt
taunting
##rip
seeker
streamlined
##bution
holster
schumann
tread
vox
##gano
##onzo
strive
dil
reforming
covent
newbury
predicting
##orro
decorate
tre
##puted
andover
ie
asahi
dept
dunkirk
gills
##tori
buren
huskies
##stis
##stov
abstracts
bets
loosen
##opa
1682
yearning
##glio
##sir
berman
effortlessly
enamel
napoli
persist
##peration
##uez
attache
elisa
b1
invitations
##kic
accelerating
reindeer
boardwalk
clutches
nelly
polka
starbucks
##kei
adamant
huey
lough
unbroken
adventurer
embroidery
inspecting
stanza
##ducted
naia
taluka
##pone
##roids
chases
deprivation
florian
##jing
##ppet
earthly
##lib
##ssee
colossal
foreigner
vet
freaks
patrice
rosewood
triassic
upstate
##pkins
dominates
ata
chants
ks
vo
##400
##bley
##raya
##rmed
555
agra
infiltrate
##ailing
##ilation
##tzer
##uppe
##werk
binoculars
enthusiast
fujian
squeak
##avs
abolitionist
almeida
boredom
hampstead
marsden
rations
##ands
inflated
334
bonuses
rosalie
patna
##rco
329
detachments
penitentiary
54th
flourishing
woolf
##dion
##etched
papyrus
##lster
##nsor
##toy
bobbed
dismounted
endelle
inhuman
motorola
tbs
wince
wreath
##ticus
hideout
inspections
sanjay
disgrace
infused
pudding
stalks
##urbed
arsenic
leases
##hyl
##rrard
collarbone
##waite
##wil
dowry
##bant
##edance
genealogical
nitrate
salamanca
scandals
thyroid
necessitated
##!
##"
###
##$
##%
##&
##'
##(
##)
##*
##+
##,
##-
##.
##/
##:
##;
##<
##=
##>
##?
##@
##[
##\
##]
##^
##_
##`
##{
##|
##}
##~
##¡
##¢
##£
##¤
##¥
##¦
##§
##¨
##©
##ª
##«
##¬
##®
##±
##´
##µ
##¶
##·
##º
##»
##¼
##¾
##¿
##æ
##ð
##÷
##þ
##đ
##ħ
##ŋ
##œ
##ƒ
##ɐ
##ɑ
##ɒ
##ɔ
##ɕ
##ə
##ɡ
##ɣ
##ɨ
##ɪ
##ɫ
##ɬ
##ɯ
##ɲ
##ɴ
##ɹ
##ɾ
##ʀ
##ʁ
##ʂ
##ʃ
##ʉ
##ʊ
##ʋ
##ʌ
##ʎ
##ʐ
##ʑ
##ʒ
##ʔ
##ʰ
##ʲ
##ʳ
##ʷ
##ʸ
##ʻ
##ʼ
##ʾ
##ʿ
##ˈ
##ˡ
##ˢ
##ˣ
##ˤ
##β
##γ
##δ
##ε
##ζ
##θ
##κ
##λ
##μ
##ξ
##ο
##π
##ρ
##σ
##τ
##υ
##φ
##χ
##ψ
##ω
##б
##г
##д
##ж
##з
##м
##п
##с
##у
##ф
##х
##ц
##ч
##ш
##щ
##ъ
##э
##ю
##ђ
##є
##і
##ј
##љ
##њ
##ћ
##ӏ
##ա
##բ
##գ
##դ
##ե
##թ
##ի
##լ
##կ
##հ
##մ
##յ
##ն
##ո
##պ
##ս
##վ
##տ
##ր
##ւ
##ք
##־
##א
##ב
##ג
##ד
##ו
##ז
##ח
##ט
##י
##ך
##כ
##ל
##ם
##מ
##ן
##נ
##ס
##ע
##ף
##פ
##ץ
##צ
##ק
##ר
##ש
##ת
##،
##ء
##ب
##ت
##ث
##ج
##ح
##خ
##ذ
##ز
##س
##ش
##ص
##ض
##ط
##ظ
##ع
##غ
##ـ
##ف
##ق
##ك
##و
##ى
##ٹ
##پ
##چ
##ک
##گ
##ں
##ھ
##ہ
##ے
##अ
##आ
##उ
##ए
##क
##ख
##ग
##च
##ज
##ट
##ड
##ण
##त
##थ
##द
##ध
##न
##प
##ब
##भ
##म
##य
##र
##ल
##व
##श
##ष
##स
##ह
##ा
##ि
##ी
##ो
##।
##॥
##ং
##অ
##আ
##ই
##উ
##এ
##ও
##ক
##খ
##গ
##চ
##ছ
##জ
##ট
##ড
##ণ
##ত
##থ
##দ
##ধ
##ন
##প
##ব
##ভ
##ম
##য
##র
##ল
##শ
##ষ
##স
##হ
##া
##ি
##ী
##ে
##க
##ச
##ட
##த
##ந
##ன
##ப
##ம
##ய
##ர
##ல
##ள
##வ
##ா
##ி
##ு
##ே
##ை
##ನ
##ರ
##ಾ
##ක
##ය
##ර
##ල
##ව
##ා
##ก
##ง
##ต
##ท
##น
##พ
##ม
##ย
##ร
##ล
##ว
##ส
##อ
##า
##เ
##་
##།
##ག
##ང
##ད
##ན
##པ
##བ
##མ
##འ
##ར
##ལ
##ས
##မ
##ა
##ბ
##გ
##დ
##ე
##ვ
##თ
##ი
##კ
##ლ
##მ
##ნ
##ო
##რ
##ს
##ტ
##უ
##ᄀ
##ᄂ
##ᄃ
##ᄅ
##ᄆ
##ᄇ
##ᄉ
##ᄊ
##ᄋ
##ᄌ
##ᄎ
##ᄏ
##ᄐ
##ᄑ
##ᄒ
##ᅡ
##ᅢ
##ᅥ
##ᅦ
##ᅧ
##ᅩ
##ᅪ
##ᅭ
##ᅮ
##ᅯ
##ᅲ
##ᅳ
##ᅴ
##ᅵ
##ᆨ
##ᆫ
##ᆯ
##ᆷ
##ᆸ
##ᆼ
##ᴬ
##ᴮ
##ᴰ
##ᴵ
##ᴺ
##ᵀ
##ᵃ
##ᵇ
##ᵈ
##ᵉ
##ᵍ
##ᵏ
##ᵐ
##ᵒ
##ᵖ
##ᵗ
##ᵘ
##ᵣ
##ᵤ
##ᵥ
##ᶜ
##ᶠ
##‐
##‑
##‒
##–
##—
##―
##‖
##‘
##’
##‚
##“
##”
##„
##†
##‡
##•
##…
##‰
##′
##″
##›
##‿
##⁄
##⁰
##ⁱ
##⁴
##⁵
##⁶
##⁷
##⁸
##⁹
##⁻
##ⁿ
##₅
##₆
##₇
##₈
##₉
##₊
##₍
##₎
##ₐ
##ₑ
##ₒ
##ₓ
##ₕ
##ₖ
##ₗ
##ₘ
##ₚ
##ₛ
##ₜ
##₤
##₩
##€
##₱
##₹
##ℓ
##№
##ℝ
##™
##⅓
##⅔
##←
##↑
##→
##↓
##↔
##↦
##⇄
##⇌
##⇒
##∂
##∅
##∆
##∇
##∈
##∗
##∘
##√
##∞
##∧
##∨
##∩
##∪
##≈
##≡
##≤
##≥
##⊂
##⊆
##⊕
##⊗
##⋅
##─
##│
##■
##▪
##●
##★
##☆
##☉
##♠
##♣
##♥
##♦
##♯
##⟨
##⟩
##ⱼ
##⺩
##⺼
##⽥
##、
##。
##〈
##〉
##《
##》
##「
##」
##『
##』
##〜
##あ
##い
##う
##え
##お
##か
##き
##く
##け
##こ
##さ
##し
##す
##せ
##そ
##た
##ち
##っ
##つ
##て
##と
##な
##に
##ぬ
##ね
##の
##は
##ひ
##ふ
##へ
##ほ
##ま
##み
##む
##め
##も
##や
##ゆ
##よ
##ら
##り
##る
##れ
##ろ
##を
##ん
##ァ
##ア
##ィ
##イ
##ウ
##ェ
##エ
##オ
##カ
##キ
##ク
##ケ
##コ
##サ
##シ
##ス
##セ
##タ
##チ
##ッ
##ツ
##テ
##ト
##ナ
##ニ
##ノ
##ハ
##ヒ
##フ
##ヘ
##ホ
##マ
##ミ
##ム
##メ
##モ
##ャ
##ュ
##ョ
##ラ
##リ
##ル
##レ
##ロ
##ワ
##ン
##・
##ー
##一
##三
##上
##下
##不
##世
##中
##主
##久
##之
##也
##事
##二
##五
##井
##京
##人
##亻
##仁
##介
##代
##仮
##伊
##会
##佐
##侍
##保
##信
##健
##元
##光
##八
##公
##内
##出
##分
##前
##劉
##力
##加
##勝
##北
##区
##十
##千
##南
##博
##原
##口
##古
##史
##司
##合
##吉
##同
##名
##和
##囗
##四
##国
##國
##土
##地
##坂
##城
##堂
##場
##士
##夏
##外
##大
##天
##太
##夫
##奈
##女
##子
##学
##宀
##宇
##安
##宗
##定
##宣
##宮
##家
##宿
##寺
##將
##小
##尚
##山
##岡
##島
##崎
##川
##州
##巿
##帝
##平
##年
##幸
##广
##弘
##張
##彳
##後
##御
##德
##心
##忄
##志
##忠
##愛
##成
##我
##戦
##戸
##手
##扌
##政
##文
##新
##方
##日
##明
##星
##春
##昭
##智
##曲
##書
##月
##有
##朝
##木
##本
##李
##村
##東
##松
##林
##森
##楊
##樹
##橋
##歌
##止
##正
##武
##比
##氏
##民
##水
##氵
##氷
##永
##江
##沢
##河
##治
##法
##海
##清
##漢
##瀬
##火
##版
##犬
##王
##生
##田
##男
##疒
##発
##白
##的
##皇
##目
##相
##省
##真
##石
##示
##社
##神
##福
##禾
##秀
##秋
##空
##立
##章
##竹
##糹
##美
##義
##耳
##良
##艹
##花
##英
##華
##葉
##藤
##行
##街
##西
##見
##訁
##語
##谷
##貝
##貴
##車
##軍
##辶
##道
##郎
##郡
##部
##都
##里
##野
##金
##鈴
##镇
##長
##門
##間
##阝
##阿
##陳
##陽
##雄
##青
##面
##風
##食
##香
##馬
##高
##龍
##龸
##ﬁ
##ﬂ
##！
##（
##）
##，
##－
##．
##／
##：
##？
##～


================================================
FILE: src/examples/tensorflow/huggingface_bert/huggingface_bert.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e91cf83b",
   "metadata": {},
   "source": [
    "# Running Huggingface DistilBERT with TensorFlow-Neuron"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71394e1e",
   "metadata": {},
   "source": [
    "In this tutorial you will compile and deploy DistilBERT version of HuggingFace 🤗 Transformers BERT for Inferentia using TensorFlow-Neuron. The full list of HuggingFace's pretrained BERT models can be found in the BERT section on this page https://huggingface.co/transformers/pretrained_models.html. you can also read about HuggingFace's pipeline feature here: https://huggingface.co/transformers/main_classes/pipelines.html\n",
    "\n",
    "This Jupyter notebook should be run on an instance which is inf1.6xlarge or larger, but in real life scenario the compilation should be done on a compute instance and the deployment on inf1 instance to save costs."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "828ef9bd",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5becc549",
   "metadata": {},
   "source": [
    "To run this tutorial please follow the instructions for [TensorFlow-Neuron Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/tensorflow-neuron.html#setup-tensorflow-neuron) and the [Jupyter Notebook Quickstart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) and set your kernel to \"Python (tensorflow-neuron)\" .\n",
    "\n",
    "Next, install some additional dependencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee1a3b84",
   "metadata": {},
   "outputs": [],
   "source": [
    "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
    "!pip install transformers==4.30.2\n",
    "!pip install ipywidgets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c301cfce",
   "metadata": {},
   "source": [
    "## Download From Huggingface and Compile for AWS-Neuron"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92e8050d",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import tensorflow_neuron as tfn\n",
    "from transformers import DistilBertTokenizer, TFDistilBertModel\n",
    "\n",
    "# Create a wrapper for the roberta model that will accept inputs as a list\n",
    "# instead of a dictionary. This will allow the compiled model to be saved\n",
    "# to disk with the model.save() fucntion.\n",
    "class DistilBertWrapper(tf.keras.Model):\n",
    "    def __init__(self, model):\n",
    "        super().__init__()\n",
    "        self.model = model\n",
    "    def __call__(self, example_inputs):\n",
    "        return self.model({'input_ids' : example_inputs[0], 'attention_mask' : example_inputs[1]})\n",
    "        \n",
    "\n",
    "tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')\n",
    "model = DistilBertWrapper(TFDistilBertModel.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english'))\n",
    "\n",
    "batch_size = 16\n",
    "\n",
    "# create example inputs with a batch size of 16\n",
    "text = [\"Paris is the <mask> of France.\"] * batch_size\n",
    "encoded_input = tokenizer(text, return_tensors='tf', padding='max_length', max_length=64)\n",
    "\n",
    "# turn inputs into a list\n",
    "example_input = [encoded_input['input_ids'], encoded_input['attention_mask']]\n",
    "\n",
    "#compile\n",
    "model_neuron = tfn.trace(model, example_input)\n",
    "\n",
    "print(\"Running on neuron:\", model_neuron(example_input))\n",
    "\n",
    "# save the model to disk to save recompilation time for next usage\n",
    "model_neuron.save('./distilbert-neuron-b16')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f2e159a",
   "metadata": {},
   "source": [
    "## Run Basic Inference Benchmarking"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ccf22e74",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import concurrent.futures\n",
    "import time\n",
    "\n",
    "reloaded_neuron_model = tf.keras.models.load_model('./distilbert-neuron-b16')\n",
    "print(\"Reloaded model running on neuron:\", reloaded_neuron_model(example_input))\n",
    "\n",
    "num_threads = 4\n",
    "num_inferences = 1000\n",
    "\n",
    "latency_list = []\n",
    "def inference_with_latency_calculation(example_input):\n",
    "    global latency_list\n",
    "    start = time.time()\n",
    "    result = reloaded_neuron_model(example_input)\n",
    "    end = time.time()\n",
    "    latency_list.append((end-start) * 1000)\n",
    "    return result\n",
    "\n",
    "start = time.time()\n",
    "with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:\n",
    "    futures = []\n",
    "    for i in range(num_inferences):\n",
    "        futures.append(executor.submit(inference_with_latency_calculation, example_input))\n",
    "    for future in concurrent.futures.as_completed(futures):\n",
    "        get_result = future.result()\n",
    "end = time.time()\n",
    "\n",
    "total_time = end - start\n",
    "throughput = (num_inferences * batch_size)/total_time\n",
    "\n",
    "print(f\"Throughput was {throughput} samples per second.\")\n",
    "print(f\"Latency p50 was {np.percentile(latency_list, 50)} ms\")\n",
    "print(f\"Latency p90 was {np.percentile(latency_list, 90)} ms\")\n",
    "print(f\"Latency p95 was {np.percentile(latency_list, 95)} ms\")\n",
    "print(f\"Latency p99 was {np.percentile(latency_list, 99)} ms\")\n",
    "assert(throughput >= 1930.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b31b82fc",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: src/examples/tensorflow/k8s_bert_demo/Dockerfile.tfserving_example
================================================
From ubuntu:16.04
RUN apt-get update
RUN apt-get install -y wget apt-transport-https ca-certificates awscli
RUN echo "deb https://apt.repos.neuron.amazonaws.com xenial main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -

RUN apt-get update
RUN apt-get install -y tensorflow-model-server-neuron

================================================
FILE: src/examples/tensorflow/k8s_bert_demo/README.md
================================================
</br>
</br>

Please view our documentation at **[https://awsdocs-neuron.readthedocs-hosted.com/](https://awsdocs-neuron.readthedocs-hosted.com/)** 


================================================
FILE: src/examples/tensorflow/k8s_bert_demo/bert_client.py
================================================
import numpy as np
import grpc
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import time

if __name__ == '__main__':
    channel = grpc.insecure_channel('localhost:9000')
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
    request = predict_pb2.PredictRequest()
    request.model_spec.name = 'bert_mrpc_hc_gelus_b4_l24_0926_02'
    i = np.zeros([1, 128], dtype=np.int32)
    request.inputs['input_ids'].CopyFrom(tf.contrib.util.make_tensor_proto(i, shape=i.shape))
    request.inputs['input_mask'].CopyFrom(tf.contrib.util.make_tensor_proto(i, shape=i.shape))
    request.inputs['segment_ids'].CopyFrom(tf.contrib.util.make_tensor_proto(i, shape=i.shape))

    latencies = []
    for i in range(100):
        start = time.time()
        result = stub.Predict(request)
        latencies.append(time.time() - start)
        print("Inference successful: {}".format(i))

    print ("Ran {} inferences successfully. Latency average = {}".format(len(latencies), np.average(latencies)))


================================================
FILE: src/examples/tensorflow/k8s_bert_demo/bert_service.yml
================================================
---
kind: Service
apiVersion: v1
metadata:
  name: inf-k8s-test
  labels:
    app: inf-k8s-test
spec:
  ports:
    - name: http-tf-serving
      port: 8500
      targetPort: 8500
    - name: grpc-tf-serving
      port: 9000
      targetPort: 9000
  selector:
    app: inf-k8s-test
    role: master
  type: ClusterIP
---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: inf-k8s-test
  labels:
    app: inf-k8s-test
    role: master
spec:
  replicas: 1 # Number of desired replicas. Increase to desired number.
  selector:
    matchLabels:
      app: inf-k8s-test
      role: master
  template:
    metadata:
      labels:
        app: inf-k8s-test
        role: master
    spec:
      volumes:
        - name: sock
          emptyDir: {}
      containers:
        - name: inf-k8s-test
          image: tf-serving-ctr
          imagePullPolicy: IfNotPresent
          command: ["/bin/sh","-c"]

          # Pull model from s3, then start tensorflow_model_server_neuron with the model.
          args:
            - "aws s3 sync s3://<your-bert-bucket>/bert /tmp/bert && \
           tensorflow_model_server_neuron --port=9000 --rest_api_port=8500 --model_name=bert_mrpc_hc_gelus_b4_l24_0926_02 --model_base_path=/tmp//bert/"

          # Open grpc and rest API ports
          ports:
            - containerPort: 8500
            - containerPort: 9000

          # Informs tensorflow_model_server_neuron of UDS socket location
          env:
            - name: NEURON_RTD_ADDRESS
              value: unix:/sock/neuron.sock

          # Arbitrary resource requirements
          resources:
            limits:
              cpu: 4
              memory: 4Gi
            requests:
              cpu: "1"
              memory: 1Gi

          # Shared volume mount, for UDS socket
          volumeMounts:
            - name: sock
              mountPath: /sock

        # Neuron-rtd container
        - name: neuron-rtd
          image: 790709498068.dkr.ecr.us-east-1.amazonaws.com/neuron-rtd:latest # neuron-rtd image.
          imagePullPolicy: IfNotPresent

          # Neuron-rtd required capabilities
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
                - IPC_LOCK

          # Shared volume mount, for UDS socket
          volumeMounts:
            - name: sock
              mountPath: /sock

          resources:
            limits:
              hugepages-2Mi: 256Mi    # configure to 256 * desired number of Inferentia devices.
              aws.amazon.com/neuron: 1  # desired number of Inferentia devices.
            requests:
              memory: 1024Mi          # Desired amount of memory. Should be larger than hugepages-2Mi limit.


================================================
FILE: src/examples/tensorflow/keras_resnet50/LICENSE
================================================
Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  
Permission is hereby granted, free of charge, to any person obtaining a copy of this
software and associated documentation files (the "Software"), to deal in the Software
without restriction, including without limitation the rights to use, copy, modify,
merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


================================================
FILE: src/examples/tensorflow/keras_resnet50/README.md
================================================
</br>
</br>

Please view our documentation at **[https://awsdocs-neuron.readthedocs-hosted.com/](https://awsdocs-neuron.readthedocs-hosted.com/)** 


================================================
FILE: src/examples/tensorflow/keras_resnet50/fp32tofp16.py
================================================
""" Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0
"""

import re
import argparse
import tensorflow as tf
import numpy as np

from google.protobuf import text_format
from tensorflow.core.framework import graph_pb2
from tensorflow.core.framework import node_def_pb2
from tensorflow.python.platform import gfile

from tensorflow.core.framework import attr_value_pb2
from tensorflow.python.framework import tensor_util

def ConvertFP32ToOther(graphdef):
  """Converts an FP32 network by casting all constants (weights) to a lower
     precision floating point type (FP16) and updating the dtypes
     everywhere."""
  cast_type = "float16"
  sess = tf.Session(graph=tf.import_graph_def(graphdef))
  output_graph_def = graph_pb2.GraphDef()
  dummy_tensor = sess.run(tf.constant([0.1]))
  dummy_tensor_proto = tensor_util.make_tensor_proto(dummy_tensor, \
      dtype=cast_type, shape=dummy_tensor.shape)
  dummy_tensor32 = sess.run(tf.constant([0.1]))
  dummy_tensor_proto32 = tensor_util.make_tensor_proto(dummy_tensor, \
      dtype=tf.float32, shape=dummy_tensor.shape)
  dt_float_type_attr = attr_value_pb2.AttrValue(type=dummy_tensor_proto32.dtype)
  dt_half_type_attr = attr_value_pb2.AttrValue(type=dummy_tensor_proto.dtype)
  for node in graphdef.node:
    output_node = node_def_pb2.NodeDef()
    output_node.CopyFrom(node)
    if (node.op == "Const"):
      if (node.attr["dtype"] == dt_float_type_attr):
        a = tensor_util.MakeNdarray(node.attr["value"].tensor)
        a = tf.cast(a, cast_type)
        a = sess.run(a)
        output_node.attr["dtype"].CopyFrom(dt_half_type_attr)
        output_node.attr["value"].CopyFrom(
            attr_value_pb2.AttrValue(
              tensor=tensor_util.make_tensor_proto(a,\
                dtype=cast_type, shape=a.shape)))
    else:
      if ("T" in node.attr.keys()):
        if (output_node.attr["T"] == dt_float_type_attr):
          output_node.attr["T"].CopyFrom(dt_half_type_attr)
      if ("Tparams" in node.attr.keys()):
        if (output_node.attr["Tparams"] == dt_float_type_attr):
          output_node.attr["Tparams"].CopyFrom(dt_half_type_attr)
      if ("dtype" in node.attr.keys()):
        if (node.attr["dtype"] == dt_float_type_attr):
          output_node.attr["dtype"].CopyFrom(dt_half_type_attr)
      if ("SrcT" in node.attr.keys()):
        if (node.attr["SrcT"] == dt_float_type_attr):
          output_node.attr["SrcT"].CopyFrom(dt_half_type_attr)
      if ("DstT" in node.attr.keys()):
        if (node.attr["DstT"] == dt_float_type_attr):
          output_node.attr["DstT"].CopyFrom(dt_half_type_attr)
    output_graph_def.node.extend([output_node])
  return output_graph_def

def load_graph(model_file):
  graph_def = tf.GraphDef()

  with open(model_file, "rb") as f:
    graph_def.ParseFromString(f.read())

  return graph_def

if __name__ == "__main__":
  parser = argparse.ArgumentParser()
  parser.add_argument("--graph", help="graph/model to be executed",
      required=True)
  parser.add_argument("--out_graph", help="graph/model to be generated",
      required=True)
  args = parser.parse_args()

  graph_f32 = load_graph(args.graph)
  graph_f16 = ConvertFP32ToOther(graph_f32)
  output_xformed_graph_name = args.out_graph
  with gfile.GFile(output_xformed_graph_name, "wb") as f:
    f.write(graph_f16.SerializeToString())
  #with gfile.GFile(output_xformed_graph_name+"txt", 'w') as f:
  #  f.write(text_format.MessageToString(graph_f16))


================================================
FILE: src/examples/tensorflow/keras_resnet50/full_sweep
================================================
#!/usr/bin/env bash

##########################################################################
#  Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#  SPDX-License-Identifier: MIT-0
##########################################################################

echo "" > full_sweep.log
echo "" > full_sweep_results.txt

results=()
for b in $(seq 1 5); do 
    for i in 1 2 4 8 12 16; do 
        python pb2sm_compile.py --batch_size=$b --neuroncore-pipeline-cores=$i | tee -a full_sweep.log;
        results[$b]+=", "`tail -1 full_sweep.log`
    done
done

head="batch"
for i in 1 2 4 8 12 16; do
    head+=", nc${i}"
done 
echo $head | tee -a full_sweep_results.txt
for b in $(seq 1 5); do 
    echo $b${results[$b]} | tee -a full_sweep_results.txt
done


================================================
FILE: src/examples/tensorflow/keras_resnet50/gen_resnet50_keras.py
================================================
""" Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0
"""

import re
import argparse
import tensorflow as tf
import numpy as np

from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions

from google.protobuf import text_format
import tensorflow.python.saved_model

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--fp16", action='store_true', help="use float16 parameters and operations")
    args = parser.parse_args()

    # set Keras global configurations
    tf.keras.backend.set_learning_phase(0)
    tf.keras.backend.set_image_data_format('channels_last')
    if (args.fp16):
        float_type = 'float16'
        float_type2 = 'fp16'
    else:
        float_type = 'float32'
        float_type2 = 'fp32'
    tf.keras.backend.set_floatx(float_type)

    # load pre-trained model using Keras
    model_name = 'resnet50_%s_keras'%float_type2
    model = ResNet50(weights='imagenet')

    # various save files
    frozen_file = model_name + '.pb'
    opt_file = model_name + '_opt.pb'

    # obtain parameters
    model_input = model.input.name.replace(':0', '')
    model_output = model.output.name.replace(':0', '')
    batch, height, width, channels = model.input.shape

    print ("model, frozen file, optimized file, input size, input node, output node,")
    print ("%s, %s, %s, %dx%dx%d, %s, %s" %(model_name, frozen_file, opt_file, width, height, channels, model_input, model_output) ) 

    # obtain the TF session
    sess = tf.compat.v1.keras.backend.get_session()

    # save checkpoint files for freeze_graph
    ckpt_file = '/tmp/' + model_name + '/' + model_name + '.ckpt'
    graph_file = '/tmp/' + model_name + '/' + model_name + '.pb'
    tf.compat.v1.train.Saver().save(sess, ckpt_file)
    tf.io.write_graph(sess.graph.as_graph_def(), logdir='.', name=graph_file, as_text=False)

    print(model_output)
    with tf.compat.v1.Session(graph=tf.Graph()) as sess:
          saver = tf.compat.v1.train.import_meta_graph(ckpt_file + '.meta')
          saver.restore(sess, ckpt_file)
          output_graph_def = tf.compat.v1.graph_util.convert_variables_to_constants(
              sess, tf.compat.v1.get_default_graph().as_graph_def(), [model_output])
          output_graph_def = tf.compat.v1.graph_util.remove_training_nodes(
              output_graph_def, protected_nodes=[model_output])
          with open(frozen_file, 'wb') as f:
              f.write(output_graph_def.SerializeToString())


================================================
FILE: src/examples/tensorflow/keras_resnet50/infer_resnet50_keras.py
================================================
""" Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0
"""

import os
import time
import shutil
import argparse

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications import resnet50

parser = argparse.ArgumentParser()
parser.add_argument("--graph", default="resnet50_fp32_keras.pb", help="Graph to use for inference", required=True)
parser.add_argument("--input", default="input_1", help="Input of graph")
parser.add_argument("--output", default="probs/Softmax", help="Output of graph")
args = parser.parse_args()

tf.keras.backend.set_image_data_format('channels_last')

def pb_to_saved_model(pb_path, input_names, output_names, model_dir):
    graph_def = tf.compat.v1.GraphDef()
    graph_def.ParseFromString(open(pb_path, 'rb').read())
    with tf.compat.v1.Session(graph=tf.Graph()) as sess:
        tf.import_graph_def(graph_def, name='')
        inputs = {name: sess.graph.get_tensor_by_name(ts_name) for name, ts_name in input_names.items()}
        outputs = {name: sess.graph.get_tensor_by_name(ts_name) for name, ts_name in output_names.items()}
        tf.saved_model.simple_save(sess, model_dir, inputs, outputs)

SAVED_MODEL_DIR = './rn50_fp16'
shutil.rmtree(SAVED_MODEL_DIR, ignore_errors=True)
input_tname="{}:0".format(args.input)
output_tname="{}:0".format(args.output)
pb_to_saved_model(args.graph, {input_tname : input_tname}, {output_tname : output_tname}, SAVED_MODEL_DIR)

# Create input from image
img_sgl = image.load_img('kitten_small.jpg', target_size=(224, 224))
img_arr = image.img_to_array(img_sgl)
img_arr2 = np.expand_dims(img_arr, axis=0)
img_arr3 = resnet50.preprocess_input(np.repeat(img_arr2, 1, axis=0))

# Load model
predictor_host = tf.contrib.predictor.from_saved_model(SAVED_MODEL_DIR)

# Run inference
model_feed_dict={'input_1:0': img_arr3}
infa_rslts = predictor_host(model_feed_dict);
print(resnet50.decode_predictions(infa_rslts[output_tname], top=5)[0])


================================================
FILE: src/examples/tensorflow/keras_resnet50/infer_resnet50_keras_loadtest.py
================================================
""" Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0
"""

import shutil
import tensorflow as tf
import os
import time
from concurrent import futures
import numpy as np
import statistics
import argparse
import requests
import tensorflow as tf
import tensorflow.neuron
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications import resnet50
import warnings
import subprocess
import json

tf.keras.backend.set_image_data_format('channels_last')

arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('--batch_size', type=int, default=5, choices=range(1, 6), help='Batch size of model as it was compiled')
arg_parser.add_argument('--neuroncore-pipeline-cores', type=int, default=1, choices=range(1, 17), help='Number of NeuronCores limit for each partitioned graph')
args = arg_parser.parse_args()

neuron_ls_output = subprocess.run(["neuron-ls","-j"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=True, encoding="utf-8")
neuron_ls_json = json.loads(neuron_ls_output.stdout)
avail_neuroncores = neuron_ls_json[0]["nc_count"]

USER_BATCH_SIZE = 2 * args.batch_size
NUM_LOOPS_PER_THREAD = 400
COMPILED_MODEL_DIR = "./rn50_fp16_compiled_b" + str(args.batch_size) + "_nc" + str(args.neuroncore_pipeline_cores) + "/1"

# Ensure there's enough buffer capacity to hold in-flight requests in runtime
NUM_INFERS_IN_FLIGHT = args.neuroncore_pipeline_cores + 3
os.environ['NEURON_MAX_NUM_INFERS'] = str(NUM_INFERS_IN_FLIGHT)

num_groups = avail_neuroncores // args.neuroncore_pipeline_cores
group_sizes = [str(args.neuroncore_pipeline_cores)] * num_groups
warnings.warn("NEURONCORE_GROUP_SIZES is being deprecated, if your application is using NEURONCORE_GROUP_SIZES please \
see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/deprecation.html#announcing-end-of-support-for-neuroncore-group-sizes \
for more details.", DeprecationWarning)
os.environ['NEURONCORE_GROUP_SIZES'] = ','.join(group_sizes)

# Create input from image
img_sgl = image.load_img('kitten_small.jpg', target_size=(224, 224))
img_arr = image.img_to_array(img_sgl, dtype='float16')
img_arr2 = np.expand_dims(img_arr, axis=0)
img_arr3 = np.repeat(img_arr2, USER_BATCH_SIZE, axis=0)

# Load model
NUM_THREADS_PER_PREDICTOR = args.neuroncore_pipeline_cores
pred_list = [tf.contrib.predictor.from_saved_model(COMPILED_MODEL_DIR) for _ in range(num_groups)]
pred_list = pred_list * NUM_THREADS_PER_PREDICTOR
num_threads = len(pred_list)

num_infer_per_thread = []
tot_latency_per_thread = []
thread_active = []
latency_list = []
for i in range(num_threads):
    num_infer_per_thread.append(0)
    tot_latency_per_thread.append(0)
    thread_active.append(0)

def one_thread(pred, model_feed_dict, index):
    global num_infer_per_thread
    thread_active[index] = 1
    for i in range(NUM_LOOPS_PER_THREAD):
        start = time.time()
        result = pred(model_feed_dict)
        delta = time.time() - start
        latency_list.append(delta)
        # skip first warmup run
        if i > 0:
            tot_latency_per_thread[index] += delta
        num_infer_per_thread[index] += USER_BATCH_SIZE
        #print(num_infer_per_thread[index])
    thread_active[index] = 0

def current_throughput():
    global num_infer_per_thread
    global args
    iteration = 0
    num_infer = 0
    last_num_infer = num_infer
    throughput_stats = []
    print("Run with {} NeuronCores".format(avail_neuroncores))
    print("NEURON_MAX_NUM_INFERS (env): " + os.environ.get('NEURON_MAX_NUM_INFERS', '<unset>'))
    print("NEURONCORE_GROUP_SIZES (env): " + os.environ.get('NEURONCORE_GROUP_SIZES', '<unset>'))
    print("NUM THREADS: ", num_threads)
    print("NUM_LOOPS_PER_THREAD: ", NUM_LOOPS_PER_THREAD)
    print("USER_BATCH_SIZE: ", USER_BATCH_SIZE)
    while num_infer < NUM_LOOPS_PER_THREAD * USER_BATCH_SIZE * num_threads:
        num_infer = 0
        total_thread_cnt = 0
        for i in range(num_threads):
            num_infer = num_infer + num_infer_per_thread[i]
            total_thread_cnt = total_thread_cnt + thread_active[i]
        current_num_infer = num_infer
        throughput = current_num_infer - last_num_infer
        #print('Active threads: {}, current throughput: {} images/sec'.format(total_thread_cnt, throughput))
        # track throughput over time, after warmup
        if iteration > 4 and total_thread_cnt == num_threads:
            throughput_stats.append(throughput)
        last_num_infer = current_num_infer
        iteration += 1
        time.sleep(1.0)
    time.sleep(1.0)
    tot_latency = 0
    for i in range(num_threads):
        tot_latency += tot_latency_per_thread[i]
    # adjust loop count to remove the first warmup run
    print("Throughput values collected:")
    print(throughput_stats)

    print("\nCompiled batch size {:}, user batch size {:}, Throughput stats (images/sec): Avg={:0.0f} Max={:}, Latency stats (msec/user-batch): P50={:0.1f} P90={:0.1f} P95={:0.1f} P99={:0.1f} \n".format( args.batch_size, USER_BATCH_SIZE, np.mean(throughput_stats), np.max(throughput_stats), 
        (np.percentile(latency_list, 50))*1000.0, (np.percentile(latency_list, 90))*1000.0, (np.percentile(latency_list, 95))*1000.0, (np.percentile(latency_list, 99))*1000.0)
    )


print("\n*** Compiled batch size {}, user batch size {}, num NeuronCores {} (input shape: {}, saved model dir: {}) ***\n".format(args.batch_size, USER_BATCH_SIZE, args.neuroncore_pipeline_cores, img_arr3.shape, COMPILED_MODEL_DIR))

# Run inference
model_feed_dict={'input_1:0': img_arr3}

executor = futures.ThreadPoolExecutor(max_workers = num_threads + 1)
executor.submit(current_throughput)
for i,pred in enumerate(pred_list):
    executor.submit(one_thread, pred, model_feed_dict, i)


================================================
FILE: src/examples/tensorflow/keras_resnet50/keras_resnet50.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "spectacular-payroll",
   "metadata": {},
   "source": [
    "# Tensorflow ResNet 50 Optimization Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "equivalent-stack",
   "metadata": {},
   "source": [
    "## Note: this tutorial runs on tensorflow-neuron 1.x only"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "alpine-aside",
   "metadata": {},
   "source": [
    "## Introduction: "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial we provide three main sections:\n",
    "\n",
    "* Take a Resnet 50 model and perform optimizations on it\n",
    "\n",
    "* Compile the model with different batch sizes and Neuroncore Group sizes (read about Neuroncore Group sizes here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-runtime/nrt-theory-of-operation.html#neuron-core-group)\n",
    "\n",
    "* Run inference on our multiple compiled models to see which has the best throughput\n",
    "\n",
    "Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [Tensorflow Installation Guide](../../../../frameworks/tensorflow/tensorflow-neuron/setup/tensorflow-install.html#install-neuron-tensorflow). You can select the Kernel from the “Kernel -> Change Kernel” option on the top of this Jupyter notebook page."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "opened-forty",
   "metadata": {},
   "source": [
    "## Install Dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "meaningful-algebra",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install pillow requests # Necessary for loading images\n",
    "!pip install tensorflow_neuron==1.15.5.2.8.9.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com/\n",
    "!pip install neuron_cc==1.13.5.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "remarkable-exercise",
   "metadata": {},
   "source": [
    "## Compile"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "consecutive-right",
   "metadata": {},
   "source": [
    "The following example shows how to compile a FP16 ResNet50 network using various batching parameters to find the optimal solution. On inf1.6xlarge, run through the following steps to get a optimized Resnet 50 model.\n",
    "First, extract Keras ResNet50 FP32 (resnet50_fp32_keras.pb will be generated):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "vertical-finland",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "import argparse\n",
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "\n",
    "from tensorflow.keras.applications.resnet50 import ResNet50\n",
    "from tensorflow.keras.preprocessing import image\n",
    "from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions\n",
    "\n",
    "from google.protobuf import text_format\n",
    "import tensorflow.python.saved_model\n",
    "\n",
    "# set Keras global configurations\n",
    "tf.keras.backend.set_learning_phase(0)\n",
    "tf.keras.backend.set_image_data_format('channels_last')\n",
    "\n",
    "float_type = 'float32'\n",
    "float_type2 = 'fp32'\n",
    "tf.keras.backend.set_floatx(float_type)\n",
    "\n",
    "# load pre-trained model using Keras\n",
    "model_name = 'resnet50_%s_keras'%float_type2\n",
    "model = ResNet50(weights='imagenet')\n",
    "\n",
    "# various save files\n",
    "frozen_file = model_name + '.pb'\n",
    "opt_file = model_name + '_opt.pb'\n",
    "\n",
    "# obtain parameters\n",
    "model_input = model.input.name.replace(':0', '')\n",
    "model_output = model.output.name.replace(':0', '')\n",
    "batch, height, width, channels = model.input.shape\n",
    "\n",
    "print (\"model, frozen file, optimized file, input size, input node, output node,\")\n",
    "print (\"%s, %s, %s, %dx%dx%d, %s, %s\" %(model_name, frozen_file, opt_file, width, height, channels, model_input, model_output) ) \n",
    "\n",
    "# obtain the TF session\n",
    "sess = tf.compat.v1.keras.backend.get_session()\n",
    "\n",
    "# save checkpoint files for freeze_graph\n",
    "ckpt_file = '/tmp/' + model_name + '/' + model_name + '.ckpt'\n",
    "graph_file = '/tmp/' + model_name + '/' + model_name + '.pb'\n",
    "tf.compat.v1.train.Saver().save(sess, ckpt_file)\n",
    "tf.io.write_graph(sess.graph.as_graph_def(), logdir='.', name=graph_file, as_text=False)\n",
    "\n",
    "print(model_output)\n",
    "with tf.compat.v1.Session(graph=tf.Graph()) as sess:\n",
    "      saver = tf.compat.v1.train.import_meta_graph(ckpt_file + '.meta')\n",
    "      saver.restore(sess, ckpt_file)\n",
    "      output_graph_def = tf.compat.v1.graph_util.convert_variables_to_constants(\n",
    "          sess, tf.compat.v1.get_default_graph().as_graph_def(), [model_output])\n",
    "      output_graph_def = tf.compat.v1.graph_util.remove_training_nodes(\n",
    "          output_graph_def, protected_nodes=[model_output])\n",
    "      with open(frozen_file, 'wb') as f:\n",
    "          f.write(output_graph_def.SerializeToString())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "romance-cyprus",
   "metadata": {},
   "source": [
    "Optimize the extracted Keras ResNet50 FP32 graph for inference before casting (resnet50_fp32_keras_opt.pb will be generated) with the following transformations to the graph:\n",
    "\n",
    "* Remove Identity and CheckNumerics nodes\n",
    "* Fold FusedBatchNorm constants into previous Conv2D weights\n",
    "* Fold other constants\n",
    "* Strip unused nodes\n",
    "* Sort by execution order"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "higher-grant",
   "metadata": {},
   "outputs": [],
   "source": [
    "import copy\n",
    "import string\n",
    "\n",
    "from google.protobuf import text_format\n",
    "from tensorflow.core.framework import node_def_pb2\n",
    "from tensorflow.core.framework import attr_value_pb2\n",
    "from tensorflow.python.framework import tensor_util\n",
    "from tensorflow.tools.graph_transforms import TransformGraph\n",
    "\n",
    "def clear_input(node):\n",
    "  for i in range(len(node.input)):\n",
    "    node.input.pop()\n",
    "\n",
    "def replace_name(node, name):\n",
    "  node.name = name\n",
    "     \n",
    "def replace_input(node, input_name, new_name):\n",
    "  # node.input.replace(input_name, new_name)\n",
    "  temp = []\n",
    "  for i in node.input:\n",
    "    temp.extend([new_name if i == input_name else i])\n",
    "  clear_input(node)\n",
    "  for i in temp:\n",
    "    node.input.extend([i])\n",
    "\n",
    "def swap_names(node1, node2):\n",
    "  temp = node2.name\n",
    "  node2.name = node1.name\n",
    "  node1.name = temp\n",
    "\n",
    "def get_const_node(const_node_name, const_by_name):\n",
    "  name = re.sub(\"/read$\", \"\", const_node_name)\n",
    "  return const_by_name[name]\n",
    "\n",
    "def get_const_ndarray(const_node_name, const_by_name):\n",
    "  name = re.sub(\"/read$\", \"\", const_node_name)\n",
    "  node = const_by_name[name]\n",
    "  return tf.make_ndarray(node.attr.get(\"value\").tensor)\n",
    "\n",
    "def adjust_bias_values(bias_node, fbn_node, const_by_name):\n",
    "  bias_val = get_const_ndarray(bias_node.input[1], const_by_name)  \n",
    "  gamma_val = get_const_ndarray(fbn_node.input[1], const_by_name)  \n",
    "  mean_val = get_const_ndarray(fbn_node.input[3], const_by_name)  \n",
    "  variance_val = get_const_ndarray(fbn_node.input[4], const_by_name) \n",
    "  new_bias = bias_val * gamma_val / np.sqrt(variance_val)\n",
    "  new_tensor = tensor_util.make_tensor_proto(new_bias, new_bias.dtype, new_bias.shape)\n",
    "  bias_const_node = get_const_node(bias_node.input[1], const_by_name)\n",
    "  bias_const_node.attr[\"value\"].CopyFrom(attr_value_pb2.AttrValue(tensor=new_tensor))\n",
    "\n",
    "def MoveBiasAddAfterFusedBatchNorm(graphdef):\n",
    "  \"\"\"fold_batch_norm function of TransformGraph is unable to fold Keras ResNet50\n",
    "  because of BiasAdd between Conv2D and FusedBatchNorm (BiasAdd is not needed\n",
    "  if FusedBatchNorm is used, but it exists in Keras ResNet50). Here, we \n",
    "  move BiasAdd to after FusedBatchNorm, and adjust bias value by gamma/sqrt(variance).\n",
    "  \"\"\"\n",
    "  sess = tf.compat.v1.Session(graph=tf.import_graph_def(graphdef))\n",
    "  output_graph_def = tf.compat.v1.GraphDef()\n",
    "  node_by_name = {}\n",
    "  const_by_name = {}\n",
    "  for node in graphdef.node:\n",
    "    # Hack: use FusedBatchNormV2 so fold_batch_norm can recognize\n",
    "    if node.op == \"FusedBatchNormV3\":\n",
    "      node.op = \"FusedBatchNorm\"\n",
    "      del(node.attr[\"U\"])\n",
    "      #import pdb; pdb.set_trace()\n",
    "    copied_node = node_def_pb2.NodeDef()\n",
    "    copied_node.CopyFrom(node)\n",
    "    node_by_name[node.name] = copied_node\n",
    "    skip_add_node = False\n",
    "    # Switch Mul/BiasAdd in Keras RN50 so fold_batch_norm transform would work\n",
    "    if node.op == \"Const\":\n",
    "      const_by_name[node.name] = copied_node  \n",
    "    elif node.op.startswith(\"FusedBatchNorm\"):\n",
    "      inputs = node.input\n",
    "      for i in inputs:\n",
    "        input_node = node_by_name[i]\n",
    "        if input_node.op == \"BiasAdd\":\n",
    "          output_graph_def.node.remove(input_node)\n",
    "          input_node_input0 = input_node.input[0]\n",
    "          # Adjust bias values (multiply by scale/sqrt(variance))\n",
    "          adjust_bias_values(input_node, node, const_by_name)\n",
    "          # Hack: swap names to avoid changing input of activation\n",
    "          swap_names(copied_node, input_node)\n",
    "          # Fix inputs for these two ops\n",
    "          replace_input(copied_node, i, input_node_input0)\n",
    "          replace_input(input_node, input_node_input0, copied_node.name)\n",
    "          # Fix order in node list\n",
    "          output_graph_def.node.extend([copied_node])\n",
    "          output_graph_def.node.extend([input_node])\n",
    "          skip_add_node = True\n",
    "    # Add maybe-modified nodes if not already done\n",
    "    if not skip_add_node:\n",
    "      output_graph_def.node.extend([copied_node])\n",
    "  return output_graph_def\n",
    "\n",
    "def FoldFusedBatchNorm(graph_def):\n",
    "  \"\"\"Optimize training graph for inference:\n",
    "    - Remove Identity and CheckNumerics nodes\n",
    "    - Fold FusedBatchNorm constants into previous Conv2D weights\n",
    "    - Fold other constants\n",
    "    - Strip unused nodes\n",
    "    - Sort by execution order\n",
    "  \"\"\"\n",
    "  transformed_graph_def = TransformGraph (\n",
    "         graph_def,\n",
    "         ['input_1'],\n",
    "         ['probs/Softmax'],\n",
    "         [\n",
    "            'add_default_attributes',\n",
    "            'remove_nodes(op=Identity, op=CheckNumerics)',\n",
    "            'fold_constants(ignore_errors=true)',\n",
    "            'fold_batch_norms',\n",
    "            'fold_old_batch_norms',\n",
    "            'strip_unused_nodes',\n",
    "            'sort_by_execution_order',\n",
    "         ])\n",
    "  return transformed_graph_def\n",
    "\n",
    "def load_graph(model_file):\n",
    "  graph_def = tf.compat.v1.GraphDef()\n",
    "\n",
    "  with open(model_file, \"rb\") as f:\n",
    "    graph_def.ParseFromString(f.read())\n",
    "  return graph_def\n",
    "\n",
    "\n",
    "graph_orig = load_graph('resnet50_fp32_keras.pb')\n",
    "graph_mod = MoveBiasAddAfterFusedBatchNorm(graph_orig)\n",
    "graph_mod2 = FoldFusedBatchNorm(graph_mod)\n",
    "with tf.io.gfile.GFile('resnet50_fp32_keras_opt.pb', \"wb\") as f:\n",
    "    f.write(graph_mod2.SerializeToString())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "corresponding-acquisition",
   "metadata": {},
   "source": [
    "Convert full graph to FP16 (resnet50_fp16_keras_opt.pb will be generated.\n",
    "This will take about a minute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "detected-training",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.core.framework import graph_pb2\n",
    "from tensorflow.python.platform import gfile\n",
    "\n",
    "def ConvertFP32ToOther(graphdef):\n",
    "  \"\"\"Converts an FP32 network by casting all constants (weights) to a lower\n",
    "     precision floating point type (FP16) and updating the dtypes\n",
    "     everywhere.\"\"\"\n",
    "  cast_type = \"float16\"\n",
    "  sess = tf.Session(graph=tf.import_graph_def(graphdef))\n",
    "  output_graph_def = graph_pb2.GraphDef()\n",
    "  dummy_tensor = sess.run(tf.constant([0.1]))\n",
    "  dummy_tensor_proto = tensor_util.make_tensor_proto(dummy_tensor, \\\n",
    "      dtype=cast_type, shape=dummy_tensor.shape)\n",
    "  dummy_tensor32 = sess.run(tf.constant([0.1]))\n",
    "  dummy_tensor_proto32 = tensor_util.make_tensor_proto(dummy_tensor, \\\n",
    "      dtype=tf.float32, shape=dummy_tensor.shape)\n",
    "  dt_float_type_attr = attr_value_pb2.AttrValue(type=dummy_tensor_proto32.dtype)\n",
    "  dt_half_type_attr = attr_value_pb2.AttrValue(type=dummy_tensor_proto.dtype)\n",
    "  for node in graphdef.node:\n",
    "    output_node = node_def_pb2.NodeDef()\n",
    "    output_node.CopyFrom(node)\n",
    "    if (node.op == \"Const\"):\n",
    "      if (node.attr[\"dtype\"] == dt_float_type_attr):\n",
    "        a = tensor_util.MakeNdarray(node.attr[\"value\"].tensor)\n",
    "        a = tf.cast(a, cast_type)\n",
    "        a = sess.run(a)\n",
    "        output_node.attr[\"dtype\"].CopyFrom(dt_half_type_attr)\n",
    "        output_node.attr[\"value\"].CopyFrom(\n",
    "            attr_value_pb2.AttrValue(\n",
    "              tensor=tensor_util.make_tensor_proto(a,\\\n",
    "                dtype=cast_type, shape=a.shape)))\n",
    "    else:\n",
    "      if (\"T\" in node.attr.keys()):\n",
    "        if (output_node.attr[\"T\"] == dt_float_type_attr):\n",
    "          output_node.attr[\"T\"].CopyFrom(dt_half_type_attr)\n",
    "      if (\"Tparams\" in node.attr.keys()):\n",
    "        if (output_node.attr[\"Tparams\"] == dt_float_type_attr):\n",
    "          output_node.attr[\"Tparams\"].CopyFrom(dt_half_type_attr)\n",
    "      if (\"dtype\" in node.attr.keys()):\n",
    "        if (node.attr[\"dtype\"] == dt_float_type_attr):\n",
    "          output_node.attr[\"dtype\"].CopyFrom(dt_half_type_attr)\n",
    "      if (\"SrcT\" in node.attr.keys()):\n",
    "        if (node.attr[\"SrcT\"] == dt_float_type_attr):\n",
    "          output_node.attr[\"SrcT\"].CopyFrom(dt_half_type_attr)\n",
    "      if (\"DstT\" in node.attr.keys()):\n",
    "        if (node.attr[\"DstT\"] == dt_float_type_attr):\n",
    "          output_node.attr[\"DstT\"].CopyFrom(dt_half_type_attr)\n",
    "    output_graph_def.node.extend([output_node])\n",
    "  return output_graph_def\n",
    "\n",
    "def load_graph(model_file):\n",
    "  graph_def = tf.GraphDef()\n",
    "\n",
    "  with open(model_file, \"rb\") as f:\n",
    "    graph_def.ParseFromString(f.read())\n",
    "\n",
    "  return graph_def\n",
    "\n",
    "graph_f32 = load_graph('resnet50_fp32_keras_opt.pb')\n",
    "graph_f16 = ConvertFP32ToOther(graph_f32)\n",
    "output_xformed_graph_name = 'resnet50_fp16_keras_opt.pb'\n",
    "with gfile.GFile(output_xformed_graph_name, \"wb\") as f:\n",
    "    f.write(graph_f16.SerializeToString())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "correct-travel",
   "metadata": {},
   "source": [
    "Run the compilation script to sweep through various batch sizes up to 5 and several NeuronCore Group sizes up to 16. The script calls the compilation script pb2sm_compile.py which tries to perform compilation. Some error messages are expected due to known issues (see Known Issues section in the tutorial). If you run all the configurations it will take about 45 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "shared-ratio",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "#!/usr/bin/env bash\n",
    "\n",
    "echo \"\" > full_sweep.log\n",
    "echo \"\" > full_sweep_results.txt\n",
    "\n",
    "results=()\n",
    "for b in $(seq 1 5); do \n",
    "    for i in 1 2 4 8 12 16; do \n",
    "        python pb2sm_compile.py --batch_size=$b --neuroncore-pipeline-cores=$i | tee -a full_sweep.log;\n",
    "        results[$b]+=\", \"`tail -1 full_sweep.log`\n",
    "    done\n",
    "done\n",
    "\n",
    "head=\"batch\"\n",
    "for i in 1 2 4 8 12 16; do\n",
    "    head+=\", nc${i}\"\n",
    "done \n",
    "echo $head | tee -a full_sweep_results.txt\n",
    "for b in $(seq 1 5); do \n",
    "    echo $b${results[$b]} | tee -a full_sweep_results.txt\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "attached-austin",
   "metadata": {},
   "source": [
    "You should see some output like this:\n",
    "```\n",
    "INFO: Compilation finished in 95 seconds with 99.5% operations placed on Inferentia\n",
    "\n",
    "1\n",
    "\n",
    "*** Batch size 1, num NeuronCores 2 (input shape: (1, 224, 224, 3), saved model dir: rn50_fp16_compiled_b1_nc2) ***\n",
    "\n",
    "INFO: Compilation finished in 95 seconds with 99.5% operations placed on Inferentia\n",
    "\n",
    "1\n",
    "\n",
    "*** Batch size 1, num NeuronCores 4 (input shape: (1, 224, 224, 3), saved model dir: rn50_fp16_compiled_b1_nc4) ***\n",
    "\n",
    "INFO: Compilation finished in 95 seconds with 99.5% operations placed on Inferentia\n",
    "\n",
    "1\n",
    "\n",
    "... (outputs removed)\n",
    "\n",
    "*** Batch size 5, num NeuronCores 16 (input shape: (5, 224, 224, 3), saved model dir: rn50_fp16_compiled_b5_nc16) ***\n",
    "\n",
    "ERROR: Compilation finished in 120 seconds with less than 50% operations placed on Inferentia (0.0%)\n",
    "\n",
    "INFO: Retry compilation without static weights\n",
    "\n",
    "ERROR: Retry compilation finished in 137 seconds with less than 50% operations placed on Inferentia (0.0%)\n",
    "\n",
    "0\n",
    "\n",
    "The file full_sweep_results.txt shows a summary of the sweep results with latest Neuron 1/27/20 release (0 means compilation unsuccessful and 0 ops mapped to Inferentia, 1 means most ops mapped to Inferentia and non-static weights, 2 means most ops mapped to Inferentia and using static weights):\n",
    "\n",
    "batch, nc1, nc2, nc4, nc8, nc12, nc16\n",
    "1, 1, 1, 1, 2, 2, 2\n",
    "2, 1, 1, 0, 1, 2, 2\n",
    "3, 1, 1, 1, 1, 1, 1\n",
    "4, 1, 1, 0, 1, 1, 1\n",
    "5, 1, 1, 0, 0, 0, 0\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "surprised-abortion",
   "metadata": {},
   "source": [
    "## Inference"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "departmental-surprise",
   "metadata": {},
   "source": [
    "Run inference over different batch sizes and Neuroncore groups to obtain throughput and latency results for ResNet50. To apply dynamic batching, the user batch size is set to 10x the compiled batch size, in order to keep input queue full and to amortize framework-to-Neuron overhead.\n",
    "\n",
    "Note: The results are based on the Neuron v1.12.2 (Mar 4th 2021) release. These will continue improve as we increase Neuron performance.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "requested-inspiration",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cd ~/aws-neuron-sdk/src/examples/tensorflow/keras_resnet50/\n",
    "!echo \"\" > batch.log\n",
    "!for i in $(seq 1 5); do python infer_resnet50_keras_loadtest.py --batch_size=$i --neuroncore-pipeline-cores=1 | tee -a batch.log; done\n",
    "!for i in $(seq 1 5); do python infer_resnet50_keras_loadtest.py --batch_size=$i --neuroncore-pipeline-cores=2 | tee -a batch.log; done\n",
    "!for i in $(seq 1 5); do python infer_resnet50_keras_loadtest.py --batch_size=$i --neuroncore-pipeline-cores=4 | tee -a batch.log; done\n",
    "!for i in $(seq 1 5); do python infer_resnet50_keras_loadtest.py --batch_size=$i --neuroncore-pipeline-cores=8 | tee -a batch.log; done\n",
    "!for i in $(seq 1 5); do python infer_resnet50_keras_loadtest.py --batch_size=$i --neuroncore-pipeline-cores=12 | tee -a batch.log; done\n",
    "!for i in $(seq 1 5); do python infer_resnet50_keras_loadtest.py --batch_size=$i --neuroncore-pipeline-cores=16 | tee -a batch.log; done"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "split-genesis",
   "metadata": {},
   "source": [
    "The file batch.log now contains the results for each batch size. We can look at the throughput values to get an idea of which models are performing well. The output should look something like this:\n",
    "\n",
    "The model best model configuration for throughput (if you run on an Inf1.6xlarge as suggested in the tutorial) is batch size 5 NeuronCore group size 2. Increasing batch size usually helps to increase throughput (up to a certain extent)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "filled-township",
   "metadata": {},
   "source": [
    "```\n",
    "*** Compiled batch size 5, user batch size 10, num NeuronCores 2 (input shape: (10, 224, 224, 3), saved model dir: ./rn50_fp16_compiled_b5_nc2/1) ***\n",
    "\n",
    "Instance type inf1.6xlarge with 16 NeuronCores\n",
    "NEURON_MAX_NUM_INFERS (env): 5\n",
    "NEURONCORE_GROUP_SIZES (env): 2,2,2,2,2,2,2,2\n",
    "NUM THREADS:  16\n",
    "NUM_LOOPS_PER_THREAD:  400\n",
    "USER_BATCH_SIZE:  10\n",
    "Throughput values collected:\n",
    "[10680, 10700, 10660]\n",
    "\n",
    "(rest of outputs removed)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "189c4f0e-1a4e-4067-921f-95449c45dedd",
   "metadata": {},
   "source": [
    "## Known Issues\n",
    "\n",
    "### Unable to compile with batch and num NeuronCores combination\n",
    "\n",
    "For some combination of batch and number of NeuronCores setting, you may\n",
    "see an internal compiler error as below. Please see the sweep result\n",
    "above for Neuron 1/27/20 release. Furthermore, if using auto-casting to\n",
    "bfloat16 from FP32 network and batch size is larger than 1 would result\n",
    "in the same error.\n",
    "\n",
    "\n",
    "```bash\n",
    "\n",
    "INFO:tensorflow:fusing subgraph neuron_op_a73aed4b95ca5d5b with neuron-cc; log file is at /home/ubuntu/keras_fp16_benchmarking_db/compiler_workdir/neuron_op_a73aed4b95ca5d5b/graph_def.neuron-cc.log\n",
    "   WARNING:tensorflow:Failed to fuse subgraph neuron_op_a73aed4b95ca5d5b with '/home/ubuntu/test_venv/bin/neuron-cc compile /home/ubuntu/keras_fp16_benchmarking_db/compiler_workdir/neuron_op_a73aed4b95ca5d5b/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /home/ubuntu/keras_fp16_benchmarking_db/compiler_workdir/neuron_op_a73aed4b95ca5d5b/graph_def.neff --io-config \"{\\\"inputs\\\": {\\\"input_10/_0:0\\\": [[6, 224, 224, 3], \\\"float16\\\"]}, \\\"outputs\\\": [\\\"probs/Softmax:0\\\"]}\" --batching_en --rematerialization_en --sb_size 120 --spill_dis --enable-replication True'\n",
    "   WARNING:tensorflow:neuron-cc error message:\n",
    "   WARNING:tensorflow:01/23/2020 01:15:40 AM ERROR [neuron-cc]:\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]: ***************************************************************\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]:  An Internal Compiler Error has occurred\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]: ***************************************************************\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]:\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]: Please contact Customer Support and provide the following details.\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]:\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]: Error message:  Non-zero exit status (134) for command: /home/ubuntu/test_venv/lib/python3.6/site-packages/neuroncc/starfish/bin/list_sch --hhir hh-tr-external-move.json --verbose 0 --sb_size 120 --arith_intensity_target 2300 --sb_watermark_low 0.250000 --sb_watermark_high 0.750000 --sb_size_tol 1 --alloc simple1 --alloc_opt --depth_diff 0.100000 --verbose_start_cycle 0 --tt_dist --mm_meet_cnt 1 --load_speed_factor 0.300000 --schir sch_tmp.json --spill_depth_limit 5 --spill_dis --true_dep --mm_order --batching_en --rematerialization_en\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]:\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]: Error class:    CompilerInternalError\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]: Error location: job.Scheduler.3\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]: Command line:   /home/ubuntu/test_venv/bin/neuron-cc compile /home/ubuntu/keras_fp16_benchmarking_db/compiler_workdir/neuron_op_a73aed4b95ca5d5b/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /home/ubuntu/keras_fp16_benchmarking_db/compiler_workdir/neuron_op_a73aed4b95ca5d5b/graph_def.neff --io-config '{\"inputs\": {\"input_10/_0:0\": [[6, 224, 224, 3], \"float16\"]}, \"outputs\": [\"probs/Softmax:0\"]}' --batching_en --rematerialization_en --sb_size 120 --spill_dis --enable-replication True\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]:\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]: Internal details:\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]:   File \"neuroncc/driver/Job.py\", line 207, in neuroncc.driver.Job.runSingleInputFn\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]:   File \"neuroncc/driver/jobs/Scheduler.py\", line 58, in neuroncc.driver.jobs.Scheduler.Scheduler.runSingleInput\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]:   File \"neuroncc/driver/Job.py\", line 145, in neuroncc.driver.Job.Job.shellCommand\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]:\n",
    "   01/23/2020 01:15:40 AM ERROR [neuron-cc]: Version information:\n",
    "   01/23/2020 01:15:41 AM ERROR [neuron-cc]:   Neuron Compiler version 1.0.6632.0+6001610955\n",
    "   01/23/2020 01:15:41 AM ERROR [neuron-cc]:\n",
    "   01/23/2020 01:15:41 AM ERROR [neuron-cc]:   HWM version 1.0.839.0-6001300654\n",
    "   01/23/2020 01:15:41 AM ERROR [neuron-cc]:   NEFF version 0.6\n",
    "   01/23/2020 01:15:41 AM ERROR [neuron-cc]:   TVM version 1.0.1589.0+6001610955\n",
    "   01/23/2020 01:15:41 AM ERROR [neuron-cc]:   NumPy version 1.16.5\n",
    "   01/23/2020 01:15:41 AM ERROR [neuron-cc]:   MXNet not available\n",
    "   01/23/2020 01:15:41 AM ERROR [neuron-cc]:   TF version 1.15.0\n",
    "   01/23/2020 01:15:41 AM ERROR [neuron-cc]:\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "gentle-census",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.9 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: src/examples/tensorflow/keras_resnet50/optimize_for_inference.py
================================================
""" Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0
"""

import re
import copy
import argparse
import tensorflow as tf
import numpy as np
import string

from google.protobuf import text_format
from tensorflow.core.framework import node_def_pb2
from tensorflow.core.framework import attr_value_pb2
from tensorflow.python.framework import tensor_util
from tensorflow.tools.graph_transforms import TransformGraph

def clear_input(node):
  for i in range(len(node.input)):
    node.input.pop()

def replace_name(node, name):
  node.name = name
     
def replace_input(node, input_name, new_name):
  # node.input.replace(input_name, new_name)
  temp = []
  for i in node.input:
    temp.extend([new_name if i == input_name else i])
  clear_input(node)
  for i in temp:
    node.input.extend([i])

def swap_names(node1, node2):
  temp = node2.name
  node2.name = node1.name
  node1.name = temp

def get_const_node(const_node_name, const_by_name):
  name = re.sub("/read$", "", const_node_name)
  return const_by_name[name]

def get_const_ndarray(const_node_name, const_by_name):
  name = re.sub("/read$", "", const_node_name)
  node = const_by_name[name]
  return tf.make_ndarray(node.attr.get("value").tensor)

def adjust_bias_values(bias_node, fbn_node, const_by_name):
  bias_val = get_const_ndarray(bias_node.input[1], const_by_name)  
  gamma_val = get_const_ndarray(fbn_node.input[1], const_by_name)  
  mean_val = get_const_ndarray(fbn_node.input[3], const_by_name)  
  variance_val = get_const_ndarray(fbn_node.input[4], const_by_name) 
  new_bias = bias_val * gamma_val / np.sqrt(variance_val)
  new_tensor = tensor_util.make_tensor_proto(new_bias, new_bias.dtype, new_bias.shape)
  bias_const_node = get_const_node(bias_node.input[1], const_by_name)
  bias_const_node.attr["value"].CopyFrom(attr_value_pb2.AttrValue(tensor=new_tensor))

def MoveBiasAddAfterFusedBatchNorm(graphdef):
  """fold_batch_norm function of TransformGraph is unable to fold Keras ResNet50
  because of BiasAdd between Conv2D and FusedBatchNorm (BiasAdd is not needed
  if FusedBatchNorm is used, but it exists in Keras ResNet50). Here, we 
  move BiasAdd to after FusedBatchNorm, and adjust bias value by gamma/sqrt(variance).
  """
  sess = tf.compat.v1.Session(graph=tf.import_graph_def(graphdef))
  output_graph_def = tf.compat.v1.GraphDef()
  node_by_name = {}
  const_by_name = {}
  for node in graphdef.node:
    # Hack: use FusedBatchNormV2 so fold_batch_norm can recognize
    if node.op == "FusedBatchNormV3":
      node.op = "FusedBatchNorm"
      del(node.attr["U"])
      #import pdb; pdb.set_trace()
    copied_node = node_def_pb2.NodeDef()
    copied_node.CopyFrom(node)
    node_by_name[node.name] = copied_node
    skip_add_node = False
    # Switch Mul/BiasAdd in Keras RN50 so fold_batch_norm transform would work
    if node.op == "Const":
      const_by_name[node.name] = copied_node  
    elif node.op.startswith("FusedBatchNorm"):
      inputs = node.input
      for i in inputs:
        input_node = node_by_name[i]
        if input_node.op == "BiasAdd":
          output_graph_def.node.remove(input_node)
          input_node_input0 = input_node.input[0]
          # Adjust bias values (multiply by scale/sqrt(variance))
          adjust_bias_values(input_node, node, const_by_name)
          # Hack: swap names to avoid changing input of activation
          swap_names(copied_node, input_node)
          # Fix inputs for these two ops
          replace_input(copied_node, i, input_node_input0)
          replace_input(input_node, input_node_input0, copied_node.name)
          # Fix order in node list
          output_graph_def.node.extend([copied_node])
          output_graph_def.node.extend([input_node])
          skip_add_node = True
    # Add maybe-modified nodes if not already done
    if not skip_add_node:
      output_graph_def.node.extend([copied_node])
  return output_graph_def

def FoldFusedBatchNorm(graph_def):
  """Optimize training graph for inference:
    - Remove Identity and CheckNumerics nodes
    - Fold FusedBatchNorm constants into previous Conv2D weights
    - Fold other constants
    - Strip unused nodes
    - Sort by execution order
  """
  transformed_graph_def = TransformGraph (
         graph_def,
         ['input_1'],
         ['probs/Softmax'],
         [
            'add_default_attributes',
            'remove_nodes(op=Identity, op=CheckNumerics)',
            'fold_constants(ignore_errors=true)',
            'fold_batch_norms',
            'fold_old_batch_norms',
            'strip_unused_nodes',
            'sort_by_execution_order',
         ])
  return transformed_graph_def

def load_graph(model_file):
  graph_def = tf.compat.v1.GraphDef()

  with open(model_file, "rb") as f:
    graph_def.ParseFromString(f.read())
  return graph_def

if __name__ == "__main__":
  parser = argparse.ArgumentParser()
  parser.add_argument("--graph", help="graph/model to be executed",
      required=True)
  parser.add_argument("--out_graph", help="graph/model to be generated",
      required=True)
  args = parser.parse_args()

  graph_orig = load_graph(args.graph)
  graph_mod = MoveBiasAddAfterFusedBatchNorm(graph_orig)
  graph_mod2 = FoldFusedBatchNorm(graph_mod)
  with tf.io.gfile.GFile(args.out_graph, "wb") as f:
    f.write(graph_mod2.SerializeToString())
  #with tf.io.gfile.GFile(args.out_graph + "txt", 'w') as f:
  #  f.write(text_format.MessageToString(graph_mod2))


================================================
FILE: src/examples/tensorflow/keras_resnet50/pb2sm_compile.py
================================================
""" Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0
"""

import time
import shutil
import numpy as np
import argparse
import tensorflow as tf
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications import resnet50
import tensorflow.neuron as tfn

tf.keras.backend.set_image_data_format('channels_last')

arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('--batch_size', type=int, default=5, choices=range(1, 6), help='Input data batch size for compilation of model')
arg_parser.add_argument('--neuroncore-pipeline-cores', type=int, default=1, choices=range(1, 17), help='Number of NeuronCores limit for each partitioned graph')
arg_parser.add_argument('--debug_args', type=str, default="", help='Optional Compiler debug args')
arg_parser.add_argument('--workdir', type=str, default="compiler_workdir", help='Compiler work directory')

args = arg_parser.parse_args()

def pb_to_saved_model(pb_path, input_names, output_names, model_dir):
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(open(pb_path, 'rb').read())
    with tf.Session(graph=tf.Graph()) as sess:
        tf.import_graph_def(graph_def, name='')
        inputs = {name: sess.graph.get_tensor_by_name(ts_name) for name, ts_name in input_names.items()}
        outputs = {name: sess.graph.get_tensor_by_name(ts_name) for name, ts_name in output_names.items()}
        tf.saved_model.simple_save(sess, model_dir, inputs, outputs)

saved_model_dir = "rn50_fp16"

shutil.rmtree(saved_model_dir, ignore_errors=True)

pb_to_saved_model("resnet50_fp16_keras_opt.pb", {"input_1:0": "input_1:0"}, {"probs/Softmax:0" : "probs/Softmax:0"}, saved_model_dir)

batch_size = args.batch_size
img_arr = np.zeros([batch_size, 224, 224, 3], dtype='float16')
compiled_saved_model_dir = saved_model_dir + "_compiled_b" + str(batch_size) + "_nc" + str(args.neuroncore_pipeline_cores)
shutil.rmtree(compiled_saved_model_dir + "/1", ignore_errors=True)

print("\n*** Batch size {}, num NeuronCores {} (input shape: {}, saved model dir: {}) ***\n".format(batch_size, args.neuroncore_pipeline_cores, img_arr.shape, compiled_saved_model_dir))
compiler_args = ['--neuroncore-pipeline-cores', str(args.neuroncore_pipeline_cores)]
if args.debug_args:
    compiler_args.extend(args.debug_args.split(" "))

static_weights = False
if args.neuroncore_pipeline_cores >= 8:
    static_weights = True

shutil.rmtree(args.workdir, ignore_errors=True)
start = time.time()
rslts = tfn.saved_model.compile(saved_model_dir, compiled_saved_model_dir + "/1",
               model_feed_dict={'input_1:0' : img_arr},
               compiler_workdir=args.workdir,
               dynamic_batch_size=True,
               compiler_args = compiler_args)
delta = time.time() - start
perc_on_inf = rslts['OnNeuronRatio'] * 100

compile_success = False
if perc_on_inf < 50:
    print("\nERROR: Compilation finished in {:.0f} seconds with less than 50% operations placed on Inferentia ({:.1f}%)\n".format(delta, perc_on_inf))
    if '--static-weights' in compiler_args:
        print("INFO: Retry compilation without static weights")
        compiler_args.remove('--static-weights')
        static_weights = False
        shutil.rmtree(compiled_saved_model_dir + "/1", ignore_errors=True)
        shutil.rmtree('compiler_workdir2', ignore_errors=True)
        start = time.time()
        rslts = tfn.saved_model.compile(saved_model_dir, compiled_saved_model_dir + "/1",
                   model_feed_dict={'input_1:0' : img_arr},
                   compiler_workdir='compiler_workdir2',
                   dynamic_batch_size=True,
                   compiler_args = compiler_args)
        delta = time.time() - start
        perc_on_inf = rslts['OnNeuronRatio'] * 100
        if perc_on_inf < 50:
            print("\nERROR: Retry compilation finished in {:.0f} seconds with less than 50% operations placed on Inferentia ({:.1f}%)\n".format(delta, perc_on_inf))
        else:
            print("\nINFO: Retry compilation finished in {:.0f} seconds with {:.1f}% operations placed on Inferentia\n".format(delta, perc_on_inf))
            compile_success = True
else:
    print("\nINFO: Compilation finished in {:.0f} seconds with {:.1f}% operations placed on Inferentia\n".format(delta, perc_on_inf))
    compile_success = True

# Prepare SavedModel for uploading to Inf1 instance
completion_code = 0
if compile_success:
    shutil.make_archive('./' + compiled_saved_model_dir, 'zip', './', compiled_saved_model_dir)
    completion_code = 1 + int(static_weights)

print(completion_code)

exit(int(not compile_success))


================================================
FILE: src/examples/tensorflow/keras_resnet50/run_all
================================================
#!/usr/bin/env bash

##########################################################################
#  Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#  SPDX-License-Identifier: MIT-0
##########################################################################

pip install pillow

# Extract Keras ResNet50 FP32 and check inference
python gen_resnet50_keras.py
python infer_resnet50_keras.py --graph resnet50_fp32_keras.pb
# Optimize fp32 graph for inference before casting
python optimize_for_inference.py --graph resnet50_fp32_keras.pb --out_graph resnet50_fp32_keras_opt.pb
python infer_resnet50_keras.py --graph resnet50_fp32_keras_opt.pb
# Cast full graph to FP16
python fp32tofp16.py  --graph resnet50_fp32_keras_opt.pb --out_graph resnet50_fp16_keras_opt.pb
python infer_resnet50_keras.py --graph resnet50_fp16_keras_opt.pb
# Compile
python pb2sm_compile.py
# Infer
python infer_resnet50_keras_loadtest.py


================================================
FILE: src/examples/tensorflow/openpose_demo/openpose.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "caff04ba",
   "metadata": {},
   "source": [
    "# Running OpenPose on Inferentia\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09b2919a",
   "metadata": {},
   "source": [
    "## Note: this tutorial runs on tensorflow-neuron 1.x only"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dcf9bb1",
   "metadata": {},
   "source": [
    "## Introduction:\n",
    "\n",
    "In this tutorial we will compile and deploy Openpose model for Inferentia. This jupyter notebook should run on an inf1.6xlarge instance for compilation and inference. The inference part of this tutorial requires inf1.6xlarge and not the compilation itself. For simplicity we will run this tutorial on a single instance but in real life scenario the compilation can be done on a compute c5.4xlarge instance and the deployment on the inf1 instance family.\n",
    "\n",
    "In this tutorial we provide two main sections:\n",
    "1. Compile the OpenPose model on inf1x6large.\n",
    "2. Infer the same compiled model on inf1x6large.\n",
    "\n",
    "Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [Tensorflow Installation Guide](../../../../frameworks/tensorflow/tensorflow-neuron/setup/tensorflow-install.html#install-neuron-tensorflow). You can select the Kernel from the “Kernel -> Change Kernel” option on the top of this Jupyter notebook page.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04ae0838",
   "metadata": {},
   "source": [
    "## Acknowledgement:\n",
    "\n",
    "Many thanks to https://github.com/ildoonet for providing pretrained model as well as the image preprocessing/pose estimating infrastructure."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0d6d08e",
   "metadata": {},
   "source": [
    "## Download tensorflow pose net frozen graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1926d4e3",
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "!wget -c --tries=2 $( wget -q -O - http://www.mediafire.com/file/qlzzr20mpocnpa3/graph_opt.pb | grep -o 'http*://download[^\"]*' | tail -n 1 ) -O graph_opt.pb\n",
    "\n",
    "!pip install tensorflow_neuron==1.15.5.2.8.9.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com/\n",
    "!pip install neuron_cc==1.13.5.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83eb578b",
   "metadata": {},
   "source": [
    "## Compile\n",
    "Compile the pose net frozen graph into AWS Neuron compatible form. Network input image resolution is adjustable with argument --net_resolution (e. g., --net_resolution=656x368). The compiled model can accept arbitrary batch size input at runtime."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "362f322e",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Usage: python convert_graph_opt.py /path/to/graph_opt.pb /path/to/graph_opt_neuron.pb\n",
    "\"\"\"\n",
    "#import argparse\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "from tensorflow.core.framework.tensor_shape_pb2 import TensorShapeProto\n",
    "import tensorflow.neuron as tfn\n",
    "\n",
    "\n",
    "def compile():\n",
    "    #parser = argparse.ArgumentParser()\n",
    "    #parser.add_argument('input_pb_path', help='Input serialized GraphDef protobuf')\n",
    "    #parser.add_argument('output_pb_path', help='Ouput serialized GraphDef protobuf')\n",
    "    #parser.add_argument('--net_resolution', default='656x368', help='Network resolution in WxH format, e. g., --net_resolution=656x368')\n",
    "    #parser.add_argument('--debug_verify', action='store_true')\n",
    "    #args = parser.parse_args()\n",
    "   \n",
    "    input_pb_path = './graph_opt.pb'\n",
    "    net_resolution = '656x368'\n",
    "    output_pb_path = './graph_opt_neuron_' + net_resolution + '.pb'\n",
    "    \n",
    "    debug_verify = 'store_true'\n",
    "    dim_w, dim_h = net_resolution.split('x')\n",
    "    dim_w = int(dim_w)\n",
    "    dim_h = int(dim_h)\n",
    "    graph_def = tf.GraphDef()\n",
    "    with open(input_pb_path, 'rb') as f:\n",
    "        graph_def.ParseFromString(f.read())\n",
    "\n",
    "    if debug_verify:\n",
    "        np.random.seed(0)\n",
    "        feed_dict = {'image:0': np.random.rand(1, dim_h, dim_w, 3)}\n",
    "        output_name = 'Openpose/concat_stage7:0'\n",
    "        with tf.Session(graph=tf.Graph()) as sess:\n",
    "            tf.import_graph_def(graph_def, name='')\n",
    "            result_reference = sess.run(output_name, feed_dict)\n",
    "\n",
    "    preprocessing_ops = {'preprocess_divide', 'preprocess_divide/y', 'preprocess_subtract', 'preprocess_subtract/y'}\n",
    "    graph_def = nhwc_to_nchw(graph_def, preprocessing_ops)\n",
    "    graph_def = inline_float32_to_float16(graph_def, preprocessing_ops)\n",
    "    with tf.Session(graph=tf.Graph()) as sess:\n",
    "        tf.import_graph_def(graph_def, name='')\n",
    "        no_fuse_ops = preprocessing_ops.union({'Openpose/concat_stage7'})\n",
    "        infer_graph = tfn.graph_util.inference_graph_from_session(\n",
    "            sess, shape_feed_dict={'image:0': [1, dim_h, dim_w, 3]}, output_tensors=['Openpose/concat_stage7:0'],\n",
    "            no_fuse_ops=no_fuse_ops, dynamic_batch_size=True,\n",
    "        )\n",
    "    with open(output_pb_path, 'wb') as f:\n",
    "        f.write(infer_graph.as_graph_def().SerializeToString())\n",
    "\n",
    "    if debug_verify:\n",
    "        with tf.Session(graph=infer_graph) as sess:\n",
    "            result_compiled = sess.run(output_name, feed_dict)\n",
    "        np.testing.assert_allclose(result_compiled, result_reference, rtol=1e-2, atol=1e-3)\n",
    "\n",
    "\n",
    "def inline_float32_to_float16(graph_def, preprocessing_ops):\n",
    "    float32_enum = tf.float32.as_datatype_enum\n",
    "    float16_enum = tf.float16.as_datatype_enum\n",
    "    graph = tf.Graph()\n",
    "    with graph.as_default():\n",
    "        tf.import_graph_def(graph_def, name='')\n",
    "    graph_def = graph.as_graph_def()\n",
    "    for node in graph_def.node:\n",
    "        if node.name in preprocessing_ops or node.op == 'Placeholder':\n",
    "            cast_input_node_name = node.name\n",
    "            continue\n",
    "        if node.op == 'Const':\n",
    "            if node.attr['dtype'].type == float32_enum:\n",
    "                node.attr['dtype'].type = float16_enum\n",
    "                tensor_def = node.attr['value'].tensor\n",
    "                tensor_def.dtype = float16_enum\n",
    "                if tensor_def.tensor_content:\n",
    "                    const_np = np.frombuffer(tensor_def.tensor_content, dtype=np.float32).astype(np.float16)\n",
    "                    tensor_def.tensor_content = const_np.tobytes()\n",
    "                elif len(tensor_def.float_val):\n",
    "                    const_np = np.array(tensor_def.float_val).astype(np.float16).view(np.uint16)\n",
    "                    tensor_def.float_val[:] = []\n",
    "                    tensor_def.half_val[:] = list(const_np)\n",
    "                else:\n",
    "                    raise NotImplementedError\n",
    "        elif 'T' in node.attr and node.attr['T'].type == float32_enum:\n",
    "            node.attr['T'].type = float16_enum\n",
    "    for node in graph_def.node:\n",
    "        if node.name == cast_input_node_name:\n",
    "            node.name = '{}_PreCastFloat32ToFlot16'.format(node.name)\n",
    "            input_node = node\n",
    "            break\n",
    "    cast_input_node = _gen_cast_node_def(cast_input_node_name, tf.float16, input_node)\n",
    "\n",
    "    output_node = graph_def.node[-1]\n",
    "    cast_output_node_name = output_node.name\n",
    "    output_node.name = '{}_PreCastFloat16ToFlot32'.format(output_node.name)\n",
    "    cast_output_node = _gen_cast_node_def(cast_output_node_name, tf.float32, output_node)\n",
    "\n",
    "    preprocessing_ops.add(input_node.name)\n",
    "    new_graph_def = tf.GraphDef()\n",
    "    new_graph_def.node.extend(graph_def.node)\n",
    "    new_graph_def.node.append(cast_input_node)\n",
    "    new_graph_def.node.append(cast_output_node)\n",
    "    graph = tf.Graph()\n",
    "    with graph.as_default():\n",
    "        tf.import_graph_def(new_graph_def, name='')\n",
    "    return graph.as_graph_def()\n",
    "\n",
    "\n",
    "def nhwc_to_nchw(graph_def, preprocessing_ops):\n",
    "    graph = tf.Graph()\n",
    "    with graph.as_default():\n",
    "        tf.import_graph_def(graph_def, name='')\n",
    "    graph_def = graph.as_graph_def()\n",
    "    node_name_to_node = {node.name: node for node in graph_def.node}\n",
    "    for node in graph_def.node:\n",
    "        if node.name in preprocessing_ops or node.op == 'Placeholder':\n",
    "            transpose_input_node_name = node.name\n",
    "            continue\n",
    "        if node.op == 'Conv2D':\n",
    "            node.attr['data_format'].s = b'NCHW'\n",
    "            strides = node.attr['strides'].list.i\n",
    "            strides[:] = [strides[0], strides[3], strides[1], strides[2]]\n",
    "        elif node.op == 'BiasAdd':\n",
    "            if node.name != 'probs/BiasAdd':\n",
    "                node.attr['data_format'].s = b'NCHW'\n",
    "        elif node.op == 'MaxPool':\n",
    "            node.attr['data_format'].s = b'NCHW'\n",
    "            ksize = node.attr['ksize'].list.i\n",
    "            ksize[:] = [ksize[0], ksize[3], ksize[1], ksize[2]]\n",
    "            strides = node.attr['strides'].list.i\n",
    "            strides[:] = [strides[0], strides[3], strides[1], strides[2]]\n",
    "        elif node.op in {'Concat', 'ConcatV2'}:\n",
    "            node_axes = node_name_to_node[node.input[-1]]\n",
    "            node_axes.attr['value'].tensor.int_val[:] = [1]\n",
    "    for node in graph_def.node:\n",
    "        if node.name == transpose_input_node_name:\n",
    "            node.name = '{}_PreTransposeNHWC2NCHW'.format(node.name)\n",
    "            input_node = node\n",
    "            break\n",
    "    transpose_input_node, transpose_input_perm_node = _gen_transpose_def(transpose_input_node_name, [0, 3, 1, 2], input_node)\n",
    "\n",
    "    output_node = graph_def.node[-1]\n",
    "    transpose_output_node_name = output_node.name\n",
    "    output_node.name = '{}_PreTransposeNCHW2NHWC'.format(output_node.name)\n",
    "    transpose_output_node, transpose_output_perm_node = _gen_transpose_def(transpose_output_node_name, [0, 2, 3, 1], output_node)\n",
    "\n",
    "    preprocessing_ops.add(input_node.name)\n",
    "    preprocessing_ops.add(transpose_input_perm_node.name)\n",
    "    new_graph_def = tf.GraphDef()\n",
    "    new_graph_def.node.extend(graph_def.node)\n",
    "    new_graph_def.node.append(transpose_input_perm_node)\n",
    "    new_graph_def.node.append(transpose_input_node)\n",
    "    new_graph_def.node.append(transpose_output_perm_node)\n",
    "    new_graph_def.node.append(transpose_output_node)\n",
    "    graph = tf.Graph()\n",
    "    with graph.as_default():\n",
    "        tf.import_graph_def(new_graph_def, name='')\n",
    "    return graph.as_graph_def()\n",
    "\n",
    "\n",
    "def _gen_cast_node_def(name, target_dtype, input_node):\n",
    "    cast_node = tf.NodeDef(name=name, op='Cast')\n",
    "    cast_node.input.append(input_node.name)\n",
    "    cast_node.attr['DstT'].type = target_dtype.as_datatype_enum\n",
    "    cast_node.attr['SrcT'].type = input_node.attr['T'].type\n",
    "    cast_node.attr['Truncate'].b = False\n",
    "    return cast_node\n",
    "\n",
    "\n",
    "def _gen_transpose_def(name, perm, input_node):\n",
    "    perm_node = tf.NodeDef(name='{}/perm'.format(name), op='Const')\n",
    "    perm_node.attr['dtype'].type = tf.int32.as_datatype_enum\n",
    "    tensor_def = perm_node.attr['value'].tensor\n",
    "    tensor_def.dtype = tf.int32.as_datatype_enum\n",
    "    tensor_def.tensor_shape.dim.append(TensorShapeProto.Dim(size=4))\n",
    "    tensor_def.tensor_content = np.array(perm, dtype=np.int32).tobytes()\n",
    "    transpose_node = tf.NodeDef(name=name, op='Transpose')\n",
    "    transpose_node.input.append(input_node.name)\n",
    "    transpose_node.input.append(perm_node.name)\n",
    "    transpose_node.attr['T'].type = input_node.attr['T'].type\n",
    "    transpose_node.attr['Tperm'].type = tf.int32.as_datatype_enum\n",
    "    return transpose_node, perm_node\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88c41e01",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "compile()\n",
    "\n",
    "# Sample output will look like below:\n",
    "# WARNING:tensorflow:From <ipython-input-3-27d3844cd753>:47: inference_graph_from_session (from tensorflow_neuron.python.graph_util) is deprecated and will be removed in a future version.\n",
    "# Instructions for updating:\n",
    "# Please refer to AWS documentation on Neuron integrated TensorFlow 2.0.\n",
    "# INFO:tensorflow:Froze 0 variables.\n",
    "# INFO:tensorflow:Converted 0 variables to const ops.\n",
    "# INFO:tensorflow:fusing subgraph {subgraph neuron_op_ed41d2deb8c54255 with input tensors [\"<tf.Tensor 'preprocess_subtract0/_0:0' shape=(1, 3, 368, 656) dtype=float16>\"], output tensors [\"<tf.Tensor 'Openpose/concat_stage7_PreCastFloat16ToFlot32:0' shape=(1, 46, 82, 57) dtype=float16>\"]} with neuron-cc\n",
    "# INFO:tensorflow:Number of operations in TensorFlow session: 474\n",
    "# INFO:tensorflow:Number of operations after tf.neuron optimizations: 474\n",
    "# INFO:tensorflow:Number of operations placed on Neuron runtime: 465"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a9af0c7",
   "metadata": {},
   "source": [
    "## Deploy\n",
    "Using same instance to deploy the model.\n",
    "In case of different deployment instance, launch a deployment inf1 instance and copy the AWS Neuron optimized tensorflow frozen graph graph_opt_neuron_656x368.pb to the deployment inf1 instance. The smallest instance type inf1.xlarge is sufficient for this demo.\n",
    "\n",
    "Your graph_opt_neuron_656x368.pb can now be plugged into https://github.com/ildoonet seemlessly if you have tensorflow-neuron installed. When it is used at runtime, please ensure that the image resolution is the same as compile-time image resolution, i. e., 656x368.\n",
    "\n",
    "Measure performance on the compiled frozen graph using dummy inputs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0481d049",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Copyright (C) 2020, Amazon.com. All Rights Reserved\n",
    "\"\"\"\n",
    "import os\n",
    "import atexit\n",
    "import time\n",
    "import math\n",
    "import json\n",
    "from collections import OrderedDict, Counter\n",
    "from contextlib import contextmanager, ContextDecorator\n",
    "from functools import wraps\n",
    "from tensorflow.python.client import session\n",
    "from tensorflow.python.platform import tf_logging as logging\n",
    "\n",
    "\n",
    "class measure_performance(ContextDecorator):\n",
    "    \"\"\"Convenient tool for performance measurements.\n",
    "    Can be apply on tensorflow session.run, tf-serving unary gRPC calls, or a given custom function.\n",
    "    Usage:\n",
    "    To generate performance report for the entire Python or gRPC-client process, insert\n",
    "    the following function call before running inferences:\n",
    "    `tfn.measure_performance()`\n",
    "    Then latency/throughput report will be generated when the process terminates.\n",
    "    Alternatively, it is possible to use `tfn.measure_performance` programmatically\n",
    "    as a context manager. Performance measurement will be done for all inferences\n",
    "    happening under this context. Report will be displayed as INFO level log when exiting\n",
    "    the context. It is also possible to obtain a JSON format report in Python.\n",
    "    For example:\n",
    "    ```\n",
    "    with tfn.measure_performance() as perf:\n",
    "        ... (run some inferences) ...\n",
    "    report_json = perf.report()\n",
    "    report_full_json = perf.report(verbosity=1)\n",
    "    ```\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, func=None, window_size=1):\n",
    "        self.perf_tracker = PerformanceTracker(window_size)\n",
    "        atexit.register(self.perf_tracker.report)\n",
    "        self._original_run = session.Session.run\n",
    "        self._original_grpc_call = None\n",
    "        if callable(func):\n",
    "            self.perf_tracker.register_func(self._track_performance(func))\n",
    "        else:\n",
    "            session.Session.run = self._track_performance(session.Session.run)\n",
    "            try:\n",
    "                import grpc\n",
    "                from tensorflow_serving.apis import prediction_service_pb2_grpc\n",
    "                dummy_stub = prediction_service_pb2_grpc.PredictionServiceStub(grpc.insecure_channel(''))\n",
    "                self._grpc_callable_type = type(dummy_stub.Predict)\n",
    "                self._original_grpc_call = self._grpc_callable_type.__call__\n",
    "            except ImportError:\n",
    "                pass\n",
    "            if callable(self._original_grpc_call):\n",
    "                self._grpc_callable_type.__call__ = self._track_performance(\n",
    "                    grpc._channel._UnaryUnaryMultiCallable.__call__\n",
    "                )\n",
    "\n",
    "    def __enter__(self):\n",
    "        return self.perf_tracker\n",
    "\n",
    "    def __exit__(self, *exc):\n",
    "        atexit.unregister(self.perf_tracker.report)\n",
    "        self.perf_tracker.report()\n",
    "        session.Session.run = self._original_run\n",
    "        if self._original_grpc_call is not None:\n",
    "            self._grpc_callable_type.__call__ = self._original_grpc_call\n",
    "        return False\n",
    "\n",
    "    def _track_performance(self, func):\n",
    "        @wraps(func)\n",
    "        def wrapper(*args, **kwargs):\n",
    "            start = time.time()\n",
    "            result = func(*args, **kwargs)\n",
    "            end = time.time()\n",
    "            self.perf_tracker.add_timestamps(start, end)\n",
    "            return result\n",
    "        return wrapper\n",
    "\n",
    "\n",
    "class PerformanceTracker(ContextDecorator):\n",
    "\n",
    "    description = (\n",
    "        \"Latency unit: second. Throughput unit: number of batched inferences per second. \"\n",
    "        \"Reported throughput is a lower bound of the actual throughput as inferences \"\n",
    "        \"spanning across window boundaries are not counted towards any of the windows. \"\n",
    "        \"'Quiet' periods (i. e., window buckets where the inference function is not called) \"\n",
    "        \"are not counted towards the reported average throughput.\"\n",
    "    )\n",
    "\n",
    "    def __init__(self, window_size):\n",
    "        self.window_size = window_size\n",
    "        self.timestamps_list = []\n",
    "        self._func = None\n",
    "\n",
    "    def __call__(self, *args, **kwargs):\n",
    "        return self._func(*args, **kwargs)\n",
    "\n",
    "    def register_func(self, func):\n",
    "        self._func = func\n",
    "\n",
    "    def add_timestamps(self, start, end):\n",
    "        self.timestamps_list.append([start, end])\n",
    "\n",
    "    def report(self, verbosity=0):\n",
    "        if self.timestamps_list:\n",
    "            latency_list = [end - start for start, end in self.timestamps_list]\n",
    "            latency_json = {\n",
    "                'p50': percentile(latency_list, 50),\n",
    "                'p90': percentile(latency_list, 90),\n",
    "                'p99': percentile(latency_list, 99),\n",
    "                'p100': percentile(latency_list, 100),\n",
    "            }\n",
    "            bucketed_timestamps = [self._get_bucket(start, end) for start, end in self.timestamps_list]\n",
    "            counted_buckets = Counter(item for item in bucketed_timestamps if item is not None)\n",
    "            bucket_throughputs = [(key, value / self.window_size) for key, value in sorted(counted_buckets.items())]\n",
    "            busy_throughputs = list(OrderedDict((key, value) for key, value in bucket_throughputs).values())\n",
    "            throughput_json = {\n",
    "                'peak': max(busy_throughputs),\n",
    "                'median': percentile(busy_throughputs, 50),\n",
    "                'average': sum(busy_throughputs) / len(busy_throughputs),\n",
    "            }\n",
    "            if verbosity > 0:\n",
    "                throughput_json['trend'] = busy_throughputs\n",
    "            report_json = {\n",
    "                'pid': os.getpid(),\n",
    "                'throughput': throughput_json,\n",
    "                'latency': latency_json,\n",
    "                'description': PerformanceTracker.description,\n",
    "            }\n",
    "            with _logging_show_info():\n",
    "                logging.info('performance report:\\n{}'.format(json.dumps(report_json, indent=4)))\n",
    "            return report_json\n",
    "\n",
    "    def _get_bucket(self, start, end):\n",
    "        bucketed_start = math.floor(start / self.window_size) * self.window_size\n",
    "        bucketed_end = math.ceil(end / self.window_size) * self.window_size\n",
    "        if bucketed_end - bucketed_start == self.window_size:\n",
    "            return bucketed_start\n",
    "        else:\n",
    "            return None\n",
    "\n",
    "\n",
    "def percentile(number_list, percent):\n",
    "    pos_float = len(number_list) * percent / 100\n",
    "    max_pos = len(number_list) - 1\n",
    "    pos_floor = min(math.floor(pos_float), max_pos)\n",
    "    pos_ceil = min(math.ceil(pos_float), max_pos)\n",
    "    number_list = sorted(number_list)\n",
    "    return number_list[pos_ceil] if pos_float - pos_floor > 0.5 else number_list[pos_floor]\n",
    "\n",
    "\n",
    "@contextmanager\n",
    "def _logging_show_info():\n",
    "    try:\n",
    "        verbosity = logging.get_verbosity()\n",
    "        logging.set_verbosity(logging.INFO)\n",
    "        yield\n",
    "    finally:\n",
    "        logging.set_verbosity(verbosity)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "960c6aa9",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Below are the inputs for compiled frozen graph \n",
    "\n",
    "pb_path is a /path/graph_opt_neuron_656x368.pb\n",
    "num_thread = 8 ( Number of threads that work on each tensorflow session ) \n",
    "batch_size =1 ( batch_size )\n",
    "net_resolution ,default=656x368\n",
    "num_inferences = 200\n",
    "\"\"\"\n",
    "import os\n",
    "from concurrent import futures\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "import tensorflow.neuron as tfn\n",
    "\n",
    "def run_with_dummy(sess, dummy_feed_dict, num_inferences):\n",
    "    for _ in range(num_inferences):\n",
    "        sess.run('Openpose/concat_stage7:0', dummy_feed_dict)\n",
    "        \n",
    "def main():\n",
    "    NUM_NEURON_CORES = 16\n",
    "    pb_path = './graph_opt_neuron_656x368.pb'\n",
    "    num_thread = 8\n",
    "    batch_size = 1\n",
    "    net_resolution = '656x368'\n",
    "    num_inferences = 200\n",
    "    dim_w, dim_h = net_resolution.split('x')\n",
    "    dim_w = int(dim_w)\n",
    "    dim_h = int(dim_h)\n",
    "    graph_def = tf.GraphDef()\n",
    "    with open(pb_path, 'rb') as f:\n",
    "        graph_def.ParseFromString(f.read())\n",
    "    \n",
    "    graph_def = tfn.graph_util.tag_multicore(graph_def, NUM_NEURON_CORES)\n",
    "    \n",
    "    with tfn.measure_performance() as perf:\n",
    "        with tf.Session(graph=tf.Graph()) as sess:\n",
    "            tf.import_graph_def(graph_def, name='')\n",
    "            input_name = 'image:0'\n",
    "            input_shape = sess.graph.get_tensor_by_name(input_name).shape.as_list()\n",
    "            input_shape[0] = batch_size\n",
    "            input_shape[1] = dim_h\n",
    "            input_shape[2] = dim_w\n",
    "            dummy_feed_dict = {input_name: np.zeros(input_shape).astype(np.float32)}\n",
    "            with futures.ThreadPoolExecutor(max_workers=num_thread) as executor:\n",
    "                fut_list = [executor.submit(run_with_dummy, sess, dummy_feed_dict, num_inferences) for _ in range(num_thread)]\n",
    "                res_list = [fut.result() for fut in fut_list]   \n",
    "\n",
    "main()\n",
    "\n",
    "# Sample output will look like below:\n",
    "# INFO:tensorflow:performance report:\n",
    "# {\n",
    "#    \"pid\": 17713,\n",
    "#    \"throughput\": {\n",
    "#        \"peak\": 66.0,\n",
    "#        \"median\": 64.0,\n",
    "#        \"average\": 61.56521739130435\n",
    "#    },\n",
    "#    \"latency\": {\n",
    "#        \"p50\": 0.1106414794921875,\n",
    "#        \"p90\": 0.11212301254272461,\n",
    "#        \"p99\": 0.11337876319885254,\n",
    "#        \"p100\": 7.08282732963562\n",
    "#    },\n",
    "#    \"description\": \"Latency unit: second. Throughput unit: number of batched inferences per second. Reported throughput is a lower bound of the actual throughput as inferences spanning across window boundaries are not counted towards any of the windows. 'Quiet' periods (i. e., window buckets where the inference function is not called) are not counted towards the reported average throughput.\"\n",
    "# }"
   ]
  },
  {
   "cell_type": "raw",
   "id": "4f15e776",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.9 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: src/examples/tensorflow/ssd300_demo/README.md
================================================
</br>
</br>

Please view our documentation at **[https://awsdocs-neuron.readthedocs-hosted.com/](https://awsdocs-neuron.readthedocs-hosted.com/)** 


================================================
FILE: src/examples/tensorflow/ssd300_demo/ssd300_detection.py
================================================
import argparse
import json
import pkg_resources
from distutils.version import LooseVersion
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import tensorflow as tf
import tensorflow.neuron as tfn


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--image', required=True, help='Path to image that is to be detected. Support jpeg and png format.')
    parser.add_argument('--image_with_detections', required=True, help='Path to save image after detection (with bounding boxes drawn). Png format.')
    parser.add_argument('--saved_model', required=True, help='TensorFlow SSD300 SavedModel')
    parser.add_argument('--score_threshold', type=float, default=0.15, help='Minimum required score for drawing a bounding box')
    parser.add_argument('--instances_val2017_json', default=None, help='Json file that contains labeling information')
    parser.add_argument('--save_results', default=None)
    parser.add_argument('--disable_version_check', action='store_true')
    args = parser.parse_args()
    if not args.disable_version_check:
        tfn_version = LooseVersion(pkg_resources.get_distribution('tensorflow-neuron').version)
        if tfn_version < LooseVersion('1.15.0.1.0.1333.0'):
            raise RuntimeError(
                'tensorflow-neuron version {} is too low for this demo. Please upgrade '
                'by "pip install -U tensorflow-neuron --extra-index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))

    with open(args.image, 'rb') as f:
        img_jpg_bytes = f.read()
    model_feed_dict = {'batch_image': [img_jpg_bytes]}

    predictor = tf.contrib.predictor.from_saved_model(args.saved_model)
    results = predictor(model_feed_dict)
    if args.save_results is not None:
        np.savez(args.save_results, **results)
    boxes_np = results['boxes']
    scores_np = results['scores']
    classes_np = results['classes']

    if args.instances_val2017_json is not None:
        with open(args.instances_val2017_json) as f:
            annotate_json = json.load(f)
        label_info = {idx+1: cat['name'] for idx, cat in enumerate(annotate_json['categories'])}

    plt.switch_backend('agg')
    fig, ax = plt.subplots(1)
    ax.imshow(Image.open(args.image).convert('RGB'))

    wanted = scores_np[0] > args.score_threshold
    for xywh, label_no_bg in zip(boxes_np[0][wanted], classes_np[0][wanted]):
        rect = patches.Rectangle((xywh[0], xywh[1]), xywh[2], xywh[3], linewidth=1, edgecolor='g', facecolor='none')
        ax.add_patch(rect)
        rx, ry = rect.get_xy()
        rx = rx + rect.get_width() / 2.0
        if args.instances_val2017_json is not None:
            ax.annotate(label_info[label_no_bg + 1], (rx, ry), color='w', backgroundcolor='g', fontsize=10,
                        ha='center', va='center', bbox=dict(boxstyle='square,pad=0.01', fc='g', ec='none', alpha=0.5))
    plt.savefig(args.image_with_detections)
    plt.close(fig)


if __name__ == '__main__':
    main()


================================================
FILE: src/examples/tensorflow/ssd300_demo/ssd300_evaluation.py
================================================
import argparse
import os
import json
import glob
from concurrent import futures
import time
import pkg_resources
from distutils.version import LooseVersion
import numpy as np
import tensorflow as tf
import tensorflow.neuron as tfn
from pycocotools.cocoeval import COCOeval
from DeepLearningExamples.PyTorch.Detection.SSD.src.coco import COCO
from DeepLearningExamples.PyTorch.Detection.SSD.src.utils import dboxes300_coco
from DeepLearningExamples.PyTorch.Detection.SSD.src.utils import SSDTransformer
from DeepLearningExamples.PyTorch.Detection.SSD.src.utils import COCODetection


def get_val_dataset(val_annotate, val_coco_root):
    dboxes = dboxes300_coco()
    val_trans = SSDTransformer(dboxes, (300, 300), val=True)
    val_coco = COCODetection(val_coco_root, val_annotate, val_trans)
    return val_coco


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--saved_model', required=True, help='TensorFlow SSD300 SavedModel')
    parser.add_argument('--val2017', required=True, help='Path to COCO 2017 validation dataset')
    parser.add_argument('--instances_val2017_json', required=True, help='Json file that contains labeling information')
    parser.add_argument('--num_sessions', type=int, default=1, help='Number of tensorflow sessions')
    parser.add_argument('--num_threads', type=int, default=4, help='Number of threads')
    parser.add_argument('--throughput_interval', type=int, default=10, help='Interval for counting throughput')
    parser.add_argument('--save_results', default=None)
    parser.add_argument('--disable_version_check', action='store_true')
    args = parser.parse_args()
    if not args.disable_version_check:
        tfn_version = LooseVersion(pkg_resources.get_distribution('tensorflow-neuron').version)
        if tfn_version < LooseVersion('1.15.0.1.0.1333.0'):
            raise RuntimeError(
                'tensorflow-neuron version {} is too low for this demo. Please upgrade '
                'by "pip install -U tensorflow-neuron --extra-index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))
    predictor_list = [tf.contrib.predictor.from_saved_model(args.saved_model) for _ in range(args.num_sessions)]

    val_dataset = get_val_dataset(args.instances_val2017_json, args.val2017)
    inv_map = {v: k for k, v in val_dataset.label_map.items()}
    model_feed_dict_list = []
    for img_id in val_dataset.img_keys:
        img_path = os.path.join(args.val2017, val_dataset.images[img_id][0])
        with open(img_path, 'rb') as f:
            img_jpg_bytes = f.read()
        model_feed_dict_list.append({'batch_image': [img_jpg_bytes]})

    latency_list = []
    throughput_list = []
    def predict(pred, model_feed_dict):
        start = time.time()
        result = pred(model_feed_dict)
        latency_list.append(time.time() - start)
        return result

    def performance():
        last_num_infer = len(latency_list)
        while len(latency_list) < len(model_feed_dict_list):
            current_num_infer = len(latency_list)
            throughput = (current_num_infer - last_num_infer) / args.throughput_interval
            throughput_list.append(throughput)
            p50 = 0.0
            p90 = 0.0
            if latency_list:
                p50 = np.percentile(latency_list, 50)
                p90 = np.percentile(latency_list, 90)
            print('pid {}: current throughput {}, latency p50={:.3f} p90={:.3f}'.format(os.getpid(), throughput, p50, p90))
            last_num_infer = current_num_infer
            time.sleep(args.throughput_interval)

    executor = futures.ThreadPoolExecutor(max_workers=(args.num_sessions*args.num_threads)+1)
    performance_future = executor.submit(performance)
    eval_futures = []
    for idx, model_feed_dict in enumerate(model_feed_dict_list):
        eval_fut = executor.submit(predict, predictor_list[idx%len(predictor_list)], model_feed_dict)
        eval_futures.append(eval_fut)
    waited_results = []
    for idx, eval_fut in enumerate(eval_futures):
        if idx % 100 == 0:
            print('evaluating image {}/{}'.format(idx, len(eval_futures)))
        waited_results.append(eval_fut.result())
    eval_results = []
    for idx, (img_id, results) in enumerate(zip(val_dataset.img_keys, waited_results)):
        boxes = results['boxes']
        for box, label, prob in zip(results['boxes'][0], results['classes'][0], results['scores'][0]):
            res = [img_id, box[0], box[1], box[2], box[3], prob, inv_map[label+1]]  # +1 to account for background
            eval_results.append(res)
    performance_future.result()

    coco_gt = COCO(annotation_file=args.instances_val2017_json)
    coco_dt = coco_gt.loadRes(np.array(eval_results).astype(np.float32))
    coco_eval = COCOeval(coco_gt, coco_dt, iouType='bbox')
    coco_eval.evaluate()
    coco_eval.accumulate()
    coco_eval.summarize()
    if args.save_results is not None:
        np.save(args.save_results, coco_eval.stats)


if __name__ == '__main__':
    main()


================================================
FILE: src/examples/tensorflow/ssd300_demo/ssd300_evaluation_client.py
================================================
import argparse
import os
import json
import glob
from concurrent import futures
import time
import subprocess
from distutils.version import LooseVersion
import numpy as np
import tensorflow as tf
import grpc
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
from pycocotools.cocoeval import COCOeval
from DeepLearningExamples.PyTorch.Detection.SSD.src.coco import COCO
from DeepLearningExamples.PyTorch.Detection.SSD.src.utils import dboxes300_coco
from DeepLearningExamples.PyTorch.Detection.SSD.src.utils import SSDTransformer
from DeepLearningExamples.PyTorch.Detection.SSD.src.utils import COCODetection


def get_val_dataset(val_annotate, val_coco_root):
    dboxes = dboxes300_coco()
    val_trans = SSDTransformer(dboxes, (300, 300), val=True)
    val_coco = COCODetection(val_coco_root, val_annotate, val_trans)
    return val_coco


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--server_address', default='localhost:8500', help='tensorflow-model-server-neuron grpc address')
    parser.add_argument('--model_name', default='default', help='Serving model name')
    parser.add_argument('--val2017', required=True, help='Path to COCO 2017 validation dataset')
    parser.add_argument('--instances_val2017_json', required=True, help='Json file that contains labeling information')
    parser.add_argument('--num_threads', type=int, default=4, help='Number of threads')
    parser.add_argument('--throughput_interval', type=int, default=10, help='Interval for counting throughput')
    parser.add_argument('--save_results', default=None)
    args = parser.parse_args()

    channel = grpc.insecure_channel(args.server_address)
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

    val_dataset = get_val_dataset(args.instances_val2017_json, args.val2017)
    inv_map = {v: k for k, v in val_dataset.label_map.items()}
    request_list = []
    for img_id in val_dataset.img_keys:
        img_path = os.path.join(args.val2017, val_dataset.images[img_id][0])
        with open(img_path, 'rb') as f:
            img_jpg_bytes = f.read()
        data = np.array([img_jpg_bytes], dtype=object)
        data = tf.contrib.util.make_tensor_proto(data, shape=data.shape)
        request = predict_pb2.PredictRequest()
        request.model_spec.name = args.model_name
        request.inputs['batch_image'].CopyFrom(data)
        request_list.append(request)

    latency_list = []
    throughput_list = []
    def predict(request):
        start = time.time()
        result = stub.Predict(request).outputs
        latency_list.append(time.time() - start)
        return result

    def performance():
        last_num_infer = len(latency_list)
        while len(latency_list) < len(request_list):
            current_num_infer = len(latency_list)
            throughput = (current_num_infer - last_num_infer) / args.throughput_interval
            throughput_list.append(throughput)
            p50 = 0.0
            p90 = 0.0
            if latency_list:
                p50 = np.percentile(latency_list, 50)
                p90 = np.percentile(latency_list, 90)
            print('pid {}: current throughput {}, latency p50={:.3f} p90={:.3f}'.format(os.getpid(), throughput, p50, p90))
            last_num_infer = current_num_infer
            time.sleep(args.throughput_interval)

    executor = futures.ThreadPoolExecutor(max_workers=args.num_threads+1)
    performance_future = executor.submit(performance)
    eval_futures = []
    for idx, request in enumerate(request_list):
        eval_fut = executor.submit(predict, request)
        eval_futures.append(eval_fut)
    waited_results = []
    for idx, eval_fut in enumerate(eval_futures):
        if idx % 100 == 0:
            print('evaluating image {}/{}'.format(idx, len(eval_futures)))
        waited_results.append(eval_fut.result())
    eval_results = []
    for idx, (img_id, results) in enumerate(zip(val_dataset.img_keys, waited_results)):
        results = {key: tf.make_ndarray(value) for key, value in results.items()}
        boxes = results['boxes']
        for box, label, prob in zip(results['boxes'][0], results['classes'][0], results['scores'][0]):
            res = [img_id, box[0], box[1], box[2], box[3], prob, inv_map[label+1]]  # +1 to account for background
            eval_results.append(res)
    performance_future.result()

    coco_gt = COCO(annotation_file=args.instances_val2017_json)
    coco_dt = coco_gt.loadRes(np.array(eval_results).astype(np.float32))
    coco_eval = COCOeval(coco_gt, coco_dt, iouType='bbox')
    coco_eval.evaluate()
    coco_eval.accumulate()
    coco_eval.summarize()
    if args.save_results is not None:
        np.save(args.save_results, coco_eval.stats)


if __name__ == '__main__':
    main()


================================================
FILE: src/examples/tensorflow/ssd300_demo/ssd300_model.py
================================================
import sys
import os
import argparse
import time
import itertools
from functools import partial
from collections import Counter
import json
import shutil
import pkg_resources
from distutils.version import LooseVersion
import numpy as np
import tensorflow as tf
from tensorflow.core.framework import attr_value_pb2
import tensorflow.neuron as tfn
import torch


def decode_jpeg_resize(input_tensor, image_size):
    # decode jpeg
    tensor = tf.image.decode_png(input_tensor, channels=3)

    # resize
    decoded_shape = tf.shape(tensor)
    tensor = tf.cast(tensor, tf.float32)
    decoded_shape_hw = decoded_shape[0:2]
    decoded_shape_hw_float32 = tf.cast(decoded_shape_hw, tf.float32)
    tensor = tf.image.resize(tensor, image_size)

    # normalize
    tensor -= np.array([0.485, 0.456, 0.406]).astype(np.float32) * 255.0
    return tensor, decoded_shape_hw_float32[::-1]


def preprocessor(input_tensor, image_size):
    with tf.name_scope('Preprocessor'):
        tensor, bbox_scale_hw = tf.map_fn(
            partial(decode_jpeg_resize, image_size=image_size), input_tensor,
            dtype=(tf.float32, tf.float32), back_prop=False, parallel_iterations=16)
    return tensor, bbox_scale_hw


def tf_Conv2d(input_tensor, module, first_conv=False):
    np_dtype = input_tensor.dtype.as_numpy_dtype
    kernel_np = module.weight.detach().numpy().transpose([2, 3, 1, 0])
    if first_conv:
        kernel_np /= (np.array([0.229, 0.224, 0.225]).astype(np.float32) * 255.0)[:, np.newaxis]
    kernel = tf.constant(kernel_np.astype(np_dtype))
    if any(module.padding):
        pad_h, pad_w = module.padding
        padding = [[0, 0], [pad_h, pad_h], [pad_w, pad_w], [0, 0]]
        input_tensor = tf.pad(input_tensor, padding)
    stride_h, stride_w = module.stride
    tensor = tf.nn.conv2d(input_tensor, kernel, strides=[1, stride_h, stride_w, 1], padding='VALID')
    if module.bias is not None:
        bias = tf.constant(module.bias.detach().numpy().astype(np_dtype))
        tensor = tf.nn.bias_add(tensor, bias)
    return tensor

def tf_BatchNorm2d(input_tensor, module):
    def _norm_np(ts):
        return ts.astype(input_tensor.dtype.as_numpy_dtype)
    mean = _norm_np(module.running_mean.detach().numpy())
    offset = _norm_np(module.bias.detach().numpy())
    inv_std = np.sqrt(module.running_var.detach().numpy() + module.eps)
    scale_inv_std = _norm_np(module.weight.detach().numpy() / inv_std)
    return scale_inv_std * (input_tensor - mean) + offset

def tf_MaxPool2d(input_tensor, module):
    pad = module.padding
    tensor = tf.pad(input_tensor, [[0, 0], [pad, pad], [pad, pad], [0, 0]])
    return tf.nn.max_pool2d(tensor, ksize=module.kernel_size, strides=module.stride, padding='VALID')

def tf_Bottleneck(input_tensor, module):
    tensor = tf_Conv2d(input_tensor, module.conv1)
    tensor = tf_BatchNorm2d(tensor, module.bn1)
    tensor = tf.nn.relu(tensor)
    tensor = tf_Conv2d(tensor, module.conv2)
    tensor = tf_BatchNorm2d(tensor, module.bn2)
    tensor = tf.nn.relu(tensor)
    tensor = tf_Conv2d(tensor, module.conv3)
    tensor = tf_BatchNorm2d(tensor, module.bn3)
    if module.downsample is not None:
        input_tensor = tf_Conv2d(input_tensor, module.downsample[0])
        input_tensor = tf_BatchNorm2d(input_tensor, module.downsample[1])
    return tf.nn.relu(input_tensor + tensor)

def tf_SequentialBottleneck(tensor, seq, resnet):
    with tf.name_scope('{}.Sequential'.format(seq)):
        for idx, module in enumerate(resnet[seq]):
            with tf.name_scope('{}.BasicBlock'.format(idx)):
                tensor = tf_Bottleneck(tensor, module)
    return tensor

def tf_bbox_view(detection_feed, modules, ndim):
    results = []
    for idx, (tensor, mod) in enumerate(zip(detection_feed, modules)):
        with tf.name_scope('branch{}'.format(idx)):
            tensor = tf_Conv2d(tensor, mod)
            tensor = tf.transpose(tensor, [0, 3, 1, 2])
            tensor = tf.cast(tensor, tf.float32)

            shape = tensor.shape.as_list()
            batch_size = -1 if shape[0] is None else shape[0]
            new_shape = [batch_size, ndim, np.prod(shape[1:]) // ndim]
            results.append(tf.reshape(tensor, new_shape))
    tensor = tf.concat(results, axis=-1)
    return tensor


def tf_feature_extractor(input_tensor, resnet):
    with tf.name_scope('FeatureExtractor'):
        with tf.name_scope('0.Conv2d'):
            tensor = tf_Conv2d(input_tensor, resnet[0], first_conv=True)
        with tf.name_scope('1.BatchNorm2d'):
            tensor = tf_BatchNorm2d(tensor, resnet[1])
        with tf.name_scope('2.ReLU'):
            tensor = tf.nn.relu(tensor)
        with tf.name_scope('3.MaxPool2d'):
            tensor = tf_MaxPool2d(tensor, resnet[3])
        tensor = tf_SequentialBottleneck(tensor, 4, resnet)
        tensor = tf_SequentialBottleneck(tensor, 5, resnet)
        tensor = tf_SequentialBottleneck(tensor, 6, resnet)
        tensor = tf.cast(tensor, tf.float16)
    return tensor


def tf_box_predictor(tensor, ssd300_torch):
    with tf.name_scope('BoxPredictor'):
        detection_feed = [tensor]
        for idx, block in enumerate(ssd300_torch.additional_blocks):
            with tf.name_scope('{}.Sequential'.format(idx)):
                tensor = tf_Conv2d(tensor, block[0])
                tensor = tf_BatchNorm2d(tensor, block[1])
                tensor = tf.nn.relu(tensor)
                tensor = tf_Conv2d(tensor, block[3])
                tensor = tf_BatchNorm2d(tensor, block[4])
                tensor = tf.nn.relu(tensor)
                detection_feed.append(tensor)
        with tf.name_scope('Boxes'):
            loc = tf_bbox_view(detection_feed, ssd300_torch.loc, ndim=4)
        with tf.name_scope('Probabilities'):
            conf = tf_bbox_view(detection_feed, ssd300_torch.conf, ndim=ssd300_torch.label_num)
    return loc, conf


@tfn.fuse(batch_size=1, dynamic_batch_size=True)
def tf_ssd300(input_tensor, ssd300_torch):
    with tf.name_scope('SSD300'):
        tensor = tf_feature_extractor(input_tensor, ssd300_torch.feature_extractor.feature_extractor)
        loc, conf = tf_box_predictor(tensor, ssd300_torch)
    return loc, conf


def scale_back_batch(bboxes_in, scores_in, scale_xy, scale_wh, dboxes_xywh):
    """
        Do scale and transform from xywh to ltrb
        suppose input Nx4xnum_bbox Nxlabel_numxnum_bbox
    """
    with tf.name_scope('ScaleBackBatch'):
        bboxes_in = tf.transpose(bboxes_in, [0, 2, 1])
        scores_in = tf.transpose(scores_in, [0, 2, 1])

        bboxes_xy = bboxes_in[:, :, :2]
        bboxes_wh = bboxes_in[:, :, 2:]
        bboxes_xy *= scale_xy
        bboxes_wh *= scale_wh

        bboxes_xy = bboxes_xy * dboxes_xywh[:, :, 2:] + dboxes_xywh[:, :, :2]
        bboxes_wh = tf.exp(bboxes_wh) * dboxes_xywh[:, :, 2:]

        bboxes_wh_half = 0.5 * bboxes_wh
        bboxes_lt = bboxes_xy - bboxes_wh_half
        bboxes_rb = bboxes_xy + bboxes_wh_half

        bboxes_in = tf.concat([bboxes_lt, bboxes_rb], axis=-1)

        return bboxes_in, tf.nn.softmax(scores_in, axis=-1)

def select_nms_outputs(input_tensors):
    boxes_xywh, scores, classes, valid_detections = input_tensors
    return boxes_xywh[:valid_detections], scores[:valid_detections], classes[:valid_detections]

def postprocessor(ploc_ts, plabel_ts, bbox_scale_hw_ts, scale_xy, scale_wh, dboxes_xywh):
    with tf.name_scope('Postprocessor'):
        ploc_ts = tf.cast(ploc_ts, tf.float32)
        plabel_ts = tf.cast(plabel_ts, tf.float32)
        bboxes_ts, probs_ts = scale_back_batch(ploc_ts, plabel_ts, scale_xy, scale_wh, dboxes_xywh)
        bboxes_ts = bboxes_ts[:, :, tf.newaxis, :]
        probs_ts = probs_ts[:, :, 1:]
        nms_outputs = tf.image.combined_non_max_suppression(
            bboxes_ts,
            probs_ts,
            max_output_size_per_class=200,
            max_total_size=200,
            iou_threshold=0.5,
            score_threshold=0.05,
            pad_per_class=False,
            clip_boxes=False,
            name='CombinedNonMaxSuppression',
        )
        nmsed_boxes_x0y0x1y1, nmsed_scores, nmsed_classes, valid_detections = nms_outputs
        nmsed_boxes_x0y0 = nmsed_boxes_x0y0x1y1[..., :2]
        nmsed_boxes_x1y1 = nmsed_boxes_x0y0x1y1[..., 2:]
        bbox_scale_hw_ts = bbox_scale_hw_ts[:, tf.newaxis, :]
        nmsed_boxes_xy = nmsed_boxes_x0y0 * bbox_scale_hw_ts
        nmsed_boxes_wh = (nmsed_boxes_x1y1 - nmsed_boxes_x0y0) * bbox_scale_hw_ts
        nmsed_boxes_xywh = tf.concat([nmsed_boxes_xy, nmsed_boxes_wh], axis=-1)
        nmsed_boxes_xywh, nmsed_scores, nmsed_classes = tf.map_fn(
            select_nms_outputs, (nmsed_boxes_xywh, nmsed_scores, nmsed_classes, valid_detections),
            dtype=(tf.float32, tf.float32, tf.float32), back_prop=False, parallel_iterations=16)
    return nmsed_boxes_xywh, nmsed_scores, nmsed_classes


class DefaultBoxes(object):

    def __init__(self, fig_size, feat_size, steps, scales, aspect_ratios,
                 scale_xy=0.1, scale_wh=0.2):

        self.feat_size = feat_size
        self.fig_size = fig_size

        self.scale_xy_ = scale_xy
        self.scale_wh_ = scale_wh

        # According to https://github.com/weiliu89/caffe
        # Calculation method slightly different from paper
        self.steps = steps
        self.scales = scales

        fk = fig_size/np.array(steps)
        self.aspect_ratios = aspect_ratios

        self.default_boxes = []
        # size of feature and number of feature
        for idx, sfeat in enumerate(self.feat_size):

            sk1 = scales[idx]/fig_size
            sk2 = scales[idx+1]/fig_size
            sk3 = np.sqrt(sk1*sk2)
            all_sizes = [(sk1, sk1), (sk3, sk3)]

            for alpha in aspect_ratios[idx]:
                w, h = sk1*np.sqrt(alpha), sk1/np.sqrt(alpha)
                all_sizes.append((w, h))
                all_sizes.append((h, w))
            for w, h in all_sizes:
                for i, j in itertools.product(range(sfeat), repeat=2):
                    cx, cy = (j+0.5)/fk[idx], (i+0.5)/fk[idx]
                    self.default_boxes.append((cx, cy, w, h))

        self.dboxes = np.array(self.default_boxes)
        self.dboxes = self.dboxes.clip(min=0, max=1)
        # For IoU calculation
        self.dboxes_ltrb = self.dboxes.copy()
        self.dboxes_ltrb[:, 0] = self.dboxes[:, 0] - 0.5 * self.dboxes[:, 2]
        self.dboxes_ltrb[:, 1] = self.dboxes[:, 1] - 0.5 * self.dboxes[:, 3]
        self.dboxes_ltrb[:, 2] = self.dboxes[:, 0] + 0.5 * self.dboxes[:, 2]
        self.dboxes_ltrb[:, 3] = self.dboxes[:, 1] + 0.5 * self.dboxes[:, 3]

    @property
    def scale_xy(self):
        return self.scale_xy_

    @property
    def scale_wh(self):
        return self.scale_wh_

    def __call__(self, order="ltrb"):
        if order == "ltrb": return self.dboxes_ltrb
        if order == "xywh": return self.dboxes


def dboxes300_coco():
    figsize = 300
    feat_size = [38, 19, 10, 5, 3, 1]
    steps = [8, 16, 32, 64, 100, 300]
    # use the scales here: https://github.com/amdegroot/ssd.pytorch/blob/master/data/config.py
    scales = [21, 45, 99, 153, 207, 261, 315]
    aspect_ratios = [[2], [2, 3], [2, 3], [2, 3], [2], [2]]
    dboxes = DefaultBoxes(figsize, feat_size, steps, scales, aspect_ratios)
    return dboxes


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--torch_checkpoint', required=True, help='Path to PyTorch SSD300 model checkpoint')
    parser.add_argument('--output_saved_model', required=True, help='Output TensorFlow SavedModel that runs on Inferentia')
    parser.add_argument('--disable_version_check', action='store_true')
    args = parser.parse_args()
    if os.path.exists(args.output_saved_model):
        raise OSError('SavedModel dir {} already exists'.format(args.output_saved_model))

    if not args.disable_version_check:
        neuroncc_version = LooseVersion(pkg_resources.get_distribution('neuron-cc').version)
        if neuroncc_version < LooseVersion('1.0.18000'):
            raise RuntimeError(
                'neuron-cc version {} is too low for this demo. Please upgrade '
                'by "pip install -U neuron-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com"'.format(neuroncc_version))
        tfn_version = LooseVersion(pkg_resources.get_distribution('tensorflow-neuron').version)
        if tfn_version < LooseVersion('1.15.3.1.0.1900.0'):
            raise RuntimeError(
                'tensorflow-neuron version {} is too low for this demo. Please upgrade '
                'by "pip install -U tensorflow-neuron --extra-index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))

    sys.path.append(os.getcwd())
    from DeepLearningExamples.PyTorch.Detection.SSD.src import model as torch_ssd300_model
    ssd300_torch = torch_ssd300_model.SSD300()
    ckpt = torch.load(args.torch_checkpoint, map_location=torch.device('cpu'))
    ssd300_torch.load_state_dict(ckpt['model'])
    ssd300_torch.eval()

    input_tensor = tf.placeholder(tf.string, [None])
    image_tensor, bbox_scale_hw_tensor = preprocessor(input_tensor, [300, 300])

    dboxes = dboxes300_coco()
    dboxes_xywh = dboxes(order="xywh")[np.newaxis, ...]

    ploc_tensor, plabel_tensor = tf_ssd300(image_tensor, ssd300_torch)
    boxes_tensor, scores_tensor, classes_tensor = postprocessor(
        ploc_tensor, plabel_tensor, bbox_scale_hw_tensor, dboxes.scale_xy, dboxes.scale_wh, dboxes_xywh)
    outputs = {
        'boxes': boxes_tensor,
        'scores': scores_tensor,
        'classes': classes_tensor,
    }

    sess = tf.Session()
    try:
        sess.run(outputs)
    except:
        pass

    for op in sess.graph.get_operations():
        if op.type == 'NeuronOp':
            if not op.get_attr('executable'):
                raise AttributeError(
                    'Neuron executable (neff) is empty. Please check neuron-cc is installed and working properly '
                    '("pip install neuron-cc --force --extra-index-url=https://pip.repos.neuron.amazonaws.com" '
                    'to force reinstall neuron-cc).')
            model_config = op.node_def.attr['model_config'].list
            if model_config.i:
                model_config.i[0] = 1
            else:
                model_config.i.extend([1, 1, 1, 10])
            op._set_attr('model_config', attr_value_pb2.AttrValue(list=model_config))
    tf.saved_model.simple_save(sess, args.output_saved_model, {'batch_image': input_tensor}, outputs)


if __name__ == '__main__':
    main()


================================================
FILE: src/examples/tensorflow/tensorflow-neuronx/tfneuronx-roberta-base-tutorial.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e91cf83b",
   "metadata": {},
   "source": [
    "# Running Huggingface Roberta-Base with TensorFlow-NeuronX"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71394e1e",
   "metadata": {},
   "source": [
    "This tutorial demonstrates how to compile the Huggingface roberta-base model and infer on a trn1.2xlarge instance with \n",
    "```tensorflow-neuronx```. To compile larger models like roberta-large, please consider using an inf2 instance."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "828ef9bd",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5becc549",
   "metadata": {},
   "source": [
    "To run this tutorial please follow the instructions for [TensorFlow-NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/tensorflow/tensorflow-neuronx/setup/tensorflow-neuronx-install.html) and the [Jupyter Notebook Quickstart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) and set your kernel to \"Python (tensorflow-neuronx)\".\n",
    "\n",
    "Next, install some additional dependencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee1a3b84",
   "metadata": {},
   "outputs": [],
   "source": [
    "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
    "!pip install transformers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c301cfce",
   "metadata": {},
   "source": [
    "## Download From Huggingface and Compile for AWS-Neuron"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92e8050d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import tensorflow_neuronx as tfnx\n",
    "from transformers import RobertaTokenizer, TFRobertaModel\n",
    "from transformers import BertTokenizer, TFBertModel\n",
    "\n",
    "# Create a wrapper for the roberta model that will accept inputs as a list\n",
    "# instead of a dictionary. This will allow the compiled model to be saved\n",
    "# to disk with the model.save() fucntion.\n",
    "class RobertaWrapper(tf.keras.Model):\n",
    "    def __init__(self, model):\n",
    "        super().__init__()\n",
    "        self.model = model\n",
    "    def __call__(self, example_inputs):\n",
    "        return self.model({'input_ids' : example_inputs[0], 'attention_mask' : example_inputs[1]})\n",
    "        \n",
    "\n",
    "tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n",
    "model = RobertaWrapper(TFRobertaModel.from_pretrained('roberta-base'))\n",
    "\n",
    "batch_size = 16\n",
    "\n",
    "# create example inputs with a batch size of 16\n",
    "text = [\"Paris is the <mask> of France.\"] * batch_size\n",
    "encoded_input = tokenizer(text, return_tensors='tf', padding='max_length', max_length=64)\n",
    "\n",
    "# turn inputs into a list\n",
    "example_input = [encoded_input['input_ids'], encoded_input['attention_mask']]\n",
    "\n",
    "#compile\n",
    "model_neuron = tfnx.trace(model, example_input)\n",
    "\n",
    "print(\"Running on neuron:\", model_neuron(example_input))\n",
    "\n",
    "# save the model to disk to save recompilation time for next usage\n",
    "model_neuron.save('./roberta-neuron-b16')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f2e159a",
   "metadata": {},
   "source": [
    "## Run Basic Inference Benchmarking"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ccf22e74",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import concurrent.futures\n",
    "import time\n",
    "\n",
    "reloaded_neuron_model = tf.keras.models.load_model('./roberta-neuron-b16')\n",
    "print(\"Reloaded model running on neuron:\", reloaded_neuron_model(example_input))\n",
    "\n",
    "num_threads = 4\n",
    "num_inferences = 1000\n",
    "\n",
    "latency_list = []\n",
    "def inference_with_latency_calculation(example_input):\n",
    "    global latency_list\n",
    "    start = time.time()\n",
    "    result = reloaded_neuron_model(example_input)\n",
    "    end = time.time()\n",
    "    latency_list.append((end-start) * 1000)\n",
    "    return result\n",
    "\n",
    "start = time.time()\n",
    "with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:\n",
    "    futures = []\n",
    "    for i in range(num_inferences):\n",
    "        futures.append(executor.submit(inference_with_latency_calculation, example_input))\n",
    "    for future in concurrent.futures.as_completed(futures):\n",
    "        get_result = future.result()\n",
    "end = time.time()\n",
    "\n",
    "total_time = end - start\n",
    "\n",
    "print(f\"Throughput was {(num_inferences * batch_size)/total_time} samples per second.\")\n",
    "print(f\"Latency p50 was {np.percentile(latency_list, 50)} ms\")\n",
    "print(f\"Latency p90 was {np.percentile(latency_list, 90)} ms\")\n",
    "print(f\"Latency p99 was {np.percentile(latency_list, 99)} ms\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (Neuron TensorFlow)",
   "language": "python",
   "name": "aws_neuron_venv_tf"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: src/examples/tensorflow/tensorflow_resnet50/resnet50.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "a3bskVXPvchm"
   },
   "source": [
    "# Running ResNet50 on Inferentia\n",
    "## Note: this tutorial runs on tensorflow-neuron 1.x only"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Introduction:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Rb5rSpcZvYbX"
   },
   "source": [
    "In this tutorial we will compile and deploy ResNet50 model for Inferentia.\n",
    "In this tutorial we provide two main sections:\n",
    "1. Compile the ResNet50 model.\n",
    "2. Infer the same compiled model.\n",
    "\n",
    "Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [Tensorflow Installation Guide](../../../../frameworks/tensorflow/tensorflow-neuron/setup/tensorflow-install.html#install-neuron-tensorflow). You can select the Kernel from the “Kernel -> Change Kernel” option on the top of this Jupyter notebook page.\n",
    "\n",
    "Instructions of how to setup Neuron Tensorflow environment and run the tutorial as a Jupyter notebook are available in the [Tensorflow Quick Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/tensorflow/tensorflow-neuron/tutorials/tensorflow-tutorial-setup.html#tensorflow-tutorial-setup)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "!pip install tensorflow_neuron==1.15.5.2.8.9.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com/\n",
    "!pip install neuron_cc==1.13.5.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "E8FhiMivhcYB"
   },
   "source": [
    "## Compile for Neuron\n",
    "\n",
    "A trained model must be compiled to Inferentia target before it can be deployed on Inferentia instances. In this step we compile the Keras ResNet50 model and export it as a SavedModel which is an interchange format for TensorFlow models.\n",
    "At the end of compilation, the compiled SavedModel is saved in resnet50_neuron local directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import time\n",
    "import shutil\n",
    "import tensorflow as tf\n",
    "import tensorflow.neuron as tfn\n",
    "import tensorflow.compat.v1.keras as keras\n",
    "from tensorflow.keras.applications.resnet50 import ResNet50\n",
    "from tensorflow.keras.applications.resnet50 import preprocess_input\n",
    "\n",
    "# Create a workspace\n",
    "WORKSPACE = './ws_resnet50'\n",
    "os.makedirs(WORKSPACE, exist_ok=True)\n",
    "\n",
    "# Prepare export directory (old one removed)\n",
    "model_dir = os.path.join(WORKSPACE, 'resnet50')\n",
    "compiled_model_dir = os.path.join(WORKSPACE, 'resnet50_neuron')\n",
    "shutil.rmtree(model_dir, ignore_errors=True)\n",
    "shutil.rmtree(compiled_model_dir, ignore_errors=True)\n",
    "\n",
    "# Instantiate Keras ResNet50 model\n",
    "keras.backend.set_learning_phase(0)\n",
    "keras.backend.set_image_data_format('channels_last')\n",
    "\n",
    "model = ResNet50(weights='imagenet')\n",
    "\n",
    "# Export SavedModel\n",
    "tf.saved_model.simple_save(\n",
    "    session            = keras.backend.get_session(),\n",
    "    export_dir         = model_dir,\n",
    "    inputs             = {'input': model.inputs[0]},\n",
    "    outputs            = {'output': model.outputs[0]})\n",
    "\n",
    "# Compile using Neuron\n",
    "tfn.saved_model.compile(model_dir, compiled_model_dir)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "I52jQOyO8vAn"
   },
   "source": [
    "## Deploy on Inferentia\n",
    "\n",
    "Using same instance to deploy the model.\n",
    "In case of different deployment instance, launch a deployment inf1 instance and copy compiled model to the deployment inf1 instance.\n",
    "\n",
    "Download the example image, and install pillow module for inference on deployement instance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!curl -O https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg\n",
    "!pip install pillow  # Necessary for loading images"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### After downloading the example image, run the inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import time\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "from tensorflow.keras.preprocessing import image\n",
    "from tensorflow.keras.applications import resnet50\n",
    "\n",
    "tf.keras.backend.set_image_data_format('channels_last')\n",
    "\n",
    "# Create input from image\n",
    "img_sgl = image.load_img('kitten_small.jpg', target_size=(224, 224))\n",
    "img_arr = image.img_to_array(img_sgl)\n",
    "img_arr2 = np.expand_dims(img_arr, axis=0)\n",
    "img_arr3 = resnet50.preprocess_input(img_arr2)\n",
    "\n",
    "# Load model\n",
    "COMPILED_MODEL_DIR = './ws_resnet50/resnet50_neuron/'\n",
    "predictor_inferentia = tf.contrib.predictor.from_saved_model(COMPILED_MODEL_DIR)\n",
    "\n",
    "# Run inference\n",
    "model_feed_dict={'input': img_arr3}\n",
    "infa_rslts = predictor_inferentia(model_feed_dict);\n",
    "\n",
    "# Display results\n",
    "print(resnet50.decode_predictions(infa_rslts[\"output\"], top=5)[0])\n",
    "\n",
    "# Sample output will look like below:\n",
    "#[('n02123045', 'tabby', 0.68817204), ('n02127052', 'lynx', 0.12701613), ('n02123159', 'tiger_cat', 0.08736559), ('n02124075', 'Egyptian_cat', 0.063844085), ('n02128757', 'snow_leopard', 0.009240591)]"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "default_view": {},
   "name": "Untitled",
   "provenance": [],
   "version": "0.3.2",
   "views": {}
  },
  "kernelspec": {
   "display_name": "Python 3.8.9 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}


================================================
FILE: src/examples/tensorflow/tensorflow_serving_tutorial.rst
================================================
.. _tensorflow-serving-neuronrt-visible-cores:

Using NEURON_RT_VISIBLE_CORES with TensorFlow Serving
=====================================================

TensorFlow serving allows customers to scale-up inference workloads
across a network. TensorFlow Neuron Serving uses the same API as normal
TensorFlow Serving with two differences: (a) the saved model must be
compiled for Inferentia and (b) the entry point is a different binary
named ``tensorflow_model_server_neuron``. Follow the steps below 
to install the package using apt-get or yum. This will be pre-installed in a future relase.

Install TensorFlow Model Server and Serving API
-----------------------------------------------

Follow the steps in the :ref:`install-neuron-tensorflow`.

Then ensure you install using either apt-get or yum.
If using TF 1.x, install the appropriate version (see above).:

.. code:: bash

   sudo apt-get install tensorflow-model-server-neuron

or

.. code:: bash

   sudo dnf install tensorflow-model-server-neuron

Also, you would need TensorFlow Serving API (use --no-deps to prevent
installation of regular tensorflow). Depending on the version of Tensorflow
you wish to use:

For Tensorflow 1.x:

.. code:: bash

   pip install --no-deps tensorflow_serving_api==1.15

For Tensorflow 2.x:

.. code:: bash

   pip install --no-deps tensorflow_serving_api

For the example image preprocessing using Keras preprocessing, the
Python Imaging Library Pillow is required:

.. code:: bash

   pip install pillow

To workaround h5py issue https://github.com/aws/aws-neuron-sdk/issues/220:

.. code:: bash

   pip install "h5py<3.0.0"


Export and Compile Saved Model
------------------------------

The following example shows graph construction followed by the addition
of Neuron compilation step before exporting to saved model.

For Tensorflow 1.x:

.. code:: python

   import tensorflow as tf
   import tensorflow.neuron

   tf.keras.backend.set_learning_phase(0)
   tf.keras.backend.set_image_data_format('channels_last')
   model = tf.keras.applications.ResNet50(weights='imagenet')
   sess = tf.keras.backend.get_session()
   inputs = {'input': model.inputs[0]}
   outputs = {'output': model.outputs[0]}

   # save the model using tf.saved_model.simple_save
   modeldir = "./resnet50/1"
   tf.saved_model.simple_save(sess, modeldir, inputs, outputs)

   # compile the model for Inferentia
   neuron_modeldir = "./resnet50_inf1/1"
   tf.neuron.saved_model.compile(modeldir, neuron_modeldir, batch_size=1)

For Tensorflow 2.x:

.. code:: python

    import tensorflow as tf
    import tensorflow.neuron as tfn
    import numpy as np

    tf.keras.backend.set_learning_phase(0)
    tf.keras.backend.set_image_data_format('channels_last')
    image_sizes = [224, 224]
    model = tf.keras.applications.ResNet50(weights='imagenet')
    example_inputs = tf.random.uniform([1, *image_sizes, 3], dtype=tf.float32)

    # run the model once to define the forward pass and allow for saving
    model_neuron(example_inputs)
    model_neuron = tfn.trace(model, example_inputs)
    tf.keras.models.save_model(model_neuron, './resnet50_inf1/1')


Serving Saved Model
-------------------

User can now serve the saved model with the
tensorflow_model_server_neuron binary. To utilize multiple NeuronCores,
it is recommended to launch multiple tensorflow model servers that
listen to the same gRPC port:

.. code:: bash

   export NEURON_RT_VISIBLE_CORES=0  # important to set this environment variable before launching model servers
   tensorflow_model_server_neuron --model_name=resnet50_inf1 \
        --model_base_path=$(pwd)/resnet50_inf1/ --port=8500

   #then to run another server on a different neuron core open another
   #window and run this, except this time set NEURON_RT_VISIBLE_CORES=1
   #you can keep doing this up to the number of Neuron Cores on your machine

   export NEURON_RT_VISIBLE_CORES=1
   tensorflow_model_server_neuron --model_name=resnet50_inf1 \
        --model_base_path=$(pwd)/resnet50_inf1/ --port=8500

The compiled model is staged in Inferentia DRAM by the server to prepare
for inference.

Generate inference requests to the model server
-----------------------------------------------

Now run inferences via GRPC as shown in the following sample client
code:

For Tensorflow 1.x:

.. code:: python

  import numpy as np
  import grpc
  import tensorflow as tf
  from tensorflow.keras.preprocessing import image
  from tensorflow.keras.applications.resnet50 import preprocess_input
  from tensorflow.keras.applications.resnet50 import decode_predictions
  from tensorflow_serving.apis import predict_pb2
  from tensorflow_serving.apis import prediction_service_pb2_grpc

  if __name__ == '__main__':
      channel = grpc.insecure_channel('localhost:8500')
      stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
      img_file = tf.keras.utils.get_file(
          "./kitten_small.jpg",
          "https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg")
      img = image.load_img(img_file, target_size=(224, 224))
      img_array = preprocess_input(image.img_to_array(img)[None, ...])
      request = predict_pb2.PredictRequest()
      request.model_spec.name = 'resnet50_inf1'
      request.inputs['input'].CopyFrom(
          tf.contrib.util.make_tensor_proto(img_array, shape=img_array.shape))
      result = stub.Predict(request)
      prediction = tf.make_ndarray(result.outputs['output'])
      print(decode_predictions(prediction))

For Tensorflow 2.x:

.. code:: python

    import numpy as np
    import grpc
    import tensorflow as tf
    from tensorflow.keras.preprocessing import image
    from tensorflow.keras.applications.resnet50 import preprocess_input
    from tensorflow_serving.apis import predict_pb2
    from tensorflow_serving.apis import prediction_service_pb2_grpc
    from tensorflow.keras.applications.resnet50 import decode_predictions

    tf.keras.backend.set_image_data_format('channels_last')

    if __name__ == '__main__':
        channel = grpc.insecure_channel('localhost:8500')
        stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
        img_file = tf.keras.utils.get_file(
            "./kitten_small.jpg",
            "https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg")
        img = image.load_img(img_file, target_size=(224, 224))
        img_array = preprocess_input(image.img_to_array(img)[None, ...])
        request = predict_pb2.PredictRequest()
        request.model_spec.name = 'resnet50_inf1'
        request.inputs['input_1'].CopyFrom(
            tf.make_tensor_proto(img_array, shape=img_array.shape))
        result = stub.Predict(request)
        prediction = tf.make_ndarray(result.outputs['output_1'])
        print(decode_predictions(prediction))


================================================
FILE: src/examples/tensorflow/yolo_v3_demo/yolo_v3.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [Broken] Evaluate YOLO v3 on Inferentia\n",
    "## Note: this tutorial runs on tensorflow-neuron 1.x only"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "This tutorial walks through compiling and evaluating YOLO v3 model on Inferentia using the AWS Neuron SDK.\n",
    "\n",
    "\n",
    "In this tutorial we provide two main sections:\n",
    "\n",
    "1. Download Dataset and Generate Pretrained SavedModel\n",
    "\n",
    "2. Compile the YOLO v3 model.\n",
    "\n",
    "3. Deploy the same compiled model.\n",
    "\n",
    "Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [Tensorflow Installation Guide](../../../../frameworks/tensorflow/tensorflow-neuron/setup/tensorflow-install.html#install-neuron-tensorflow). You can select the Kernel from the “Kernel -> Change Kernel” option on the top of this Jupyter notebook page.\n",
    "\n",
    "Instructions of how to setup Neuron Tensorflow environment and run the tutorial as a Jupyter notebook are available in the Tutorial main page [Tensorflow-YOLO_v3 Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/tensorflow/tensorflow-neuron/tutorials/yolo_v3_demo/yolo_v3_demo.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This demo requires the following pip packages:\n",
    "\n",
    "`pillow matplotlib pycocotools`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%pip install tensorflow_neuron==1.15.5.2.8.9.0 neuron_cc==1.13.5.0 requests pillow matplotlib pycocotools==2.0.1 numpy==1.18.2 torch~=1.5.0 --force \\\n",
    "    --extra-index-url=https://pip.repos.neuron.amazonaws.com"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1:  Download Dataset and Generate Pretrained SavedModel\n",
    "### Download COCO 2017 validation dataset\n",
    "\n",
    "We start by downloading the COCO validation dataset, which we will use to validate our model. The COCO 2017 dataset is widely used for object-detection, segmentation and image captioning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!curl -LO http://images.cocodataset.org/zips/val2017.zip\n",
    "!curl -LO http://images.cocodataset.org/annotations/annotations_trainval2017.zip\n",
    "!unzip -q val2017.zip\n",
    "!unzip annotations_trainval2017.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Generate YOLO v3 tensorflow SavedModel (pretrained on COCO 2017 dataset)\n",
    "\n",
    "Script yolo_v3_coco_saved_model.py will generate a tensorflow SavedModel using pretrained weights from https://github.com/YunYang1994/tensorflow-yolov3/releases/download/v1.0/yolov3_coco.tar.gz."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run yolo_v3_coco_saved_model.py ./yolo_v3_coco_saved_model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tensorflow SavedModel can be loaded as a tensorflow predictor. When a JPEG format image is provided as input, the output result of the tensorflow predictor contains information for drawing bounding boxes and classification results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import tensorflow as tf\n",
    "from PIL import Image\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.patches as patches\n",
    "\n",
    "# launch predictor and run inference on an arbitrary image in the validation dataset\n",
    "yolo_pred_cpu = tf.contrib.predictor.from_saved_model('./yolo_v3_coco_saved_model')\n",
    "image_path = './val2017/000000581781.jpg'\n",
    "with open(image_path, 'rb') as f:\n",
    "    feeds = {'image': [f.read()]}\n",
    "results = yolo_pred_cpu(feeds)\n",
    "\n",
    "# load annotations to decode classification result\n",
    "with open('./annotations/instances_val2017.json') as f:\n",
    "    annotate_json = json.load(f)\n",
    "label_info = {idx+1: cat['name'] for idx, cat in enumerate(annotate_json['categories'])}\n",
    "\n",
    "# draw picture and bounding boxes\n",
    "fig, ax = plt.subplots(figsize=(10, 10))\n",
    "ax.imshow(Image.open(image_path).convert('RGB'))\n",
    "wanted = results['scores'][0] > 0.1\n",
    "for xyxy, label_no_bg in zip(results['boxes'][0][wanted], results['classes'][0][wanted]):\n",
    "    xywh = xyxy[0], xyxy[1], xyxy[2] - xyxy[0], xyxy[3] - xyxy[1]\n",
    "    rect = patches.Rectangle((xywh[0], xywh[1]), xywh[2], xywh[3], linewidth=1, edgecolor='g', facecolor='none')\n",
    "    ax.add_patch(rect)\n",
    "    rx, ry = rect.get_xy()\n",
    "    rx = rx + rect.get_width() / 2.0\n",
    "    ax.annotate(label_info[label_no_bg + 1], (rx, ry), color='w', backgroundcolor='g', fontsize=10,\n",
    "                ha='center', va='center', bbox=dict(boxstyle='square,pad=0.01', fc='g', ec='none', alpha=0.5))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2:  Compile the Pretrained SavedModel for Neuron\n",
    "\n",
    "We make use of the Python compilation API `tfn.saved_model.compile` that is available in `tensorflow-neuron<2`. For the purpose of reducing Neuron runtime overhead, it is necessary to make use of arguments `no_fuse_ops` and `minimum_segment_size`.\n",
    "Compiled model is saved in ./yolo_v3_coco_saved_model_neuron."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import shutil\n",
    "import tensorflow as tf\n",
    "import tensorflow.neuron as tfn\n",
    "\n",
    "\n",
    "def no_fuse_condition(op):\n",
    "    return op.name.startswith('Preprocessor') or op.name.startswith('Postprocessor')\n",
    "\n",
    "with tf.Session(graph=tf.Graph()) as sess:\n",
    "    tf.saved_model.loader.load(sess, ['serve'], './yolo_v3_coco_saved_model')\n",
    "    no_fuse_ops = [op.name for op in sess.graph.get_operations() if no_fuse_condition(op)]\n",
    "shutil.rmtree('./yolo_v3_coco_saved_model_neuron', ignore_errors=True)\n",
    "result = tfn.saved_model.compile(\n",
    "    './yolo_v3_coco_saved_model', './yolo_v3_coco_saved_model_neuron',\n",
    "    # to enforce trivial compilable subgraphs to run on CPU\n",
    "    no_fuse_ops=no_fuse_ops,\n",
    "    minimum_segment_size=100,\n",
    "    batch_size=2,\n",
    "    dynamic_batch_size=True,\n",
    ")\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy the model on Inferentia\n",
    "## Part 3:Evaluate Model Quality after Compilation\n",
    "\n",
    "### Define evaluation functions\n",
    "We first define some handy helper functions for running evaluation on the COCO 2017 dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import json\n",
    "import time\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "from pycocotools.coco import COCO\n",
    "from pycocotools.cocoeval import COCOeval\n",
    "\n",
    "\n",
    "def cocoapi_eval(jsonfile,\n",
    "                 style,\n",
    "                 coco_gt=None,\n",
    "                 anno_file=None,\n",
    "                 max_dets=(100, 300, 1000)):\n",
    "    \"\"\"\n",
    "    Args:\n",
    "        jsonfile: Evaluation json file, eg: bbox.json, mask.json.\n",
    "        style: COCOeval style, can be `bbox` , `segm` and `proposal`.\n",
    "        coco_gt: Whether to load COCOAPI through anno_file,\n",
    "                 eg: coco_gt = COCO(anno_file)\n",
    "        anno_file: COCO annotations file.\n",
    "        max_dets: COCO evaluation maxDets.\n",
    "    \"\"\"\n",
    "    assert coco_gt is not None or anno_file is not None\n",
    "\n",
    "    if coco_gt is None:\n",
    "        coco_gt = COCO(anno_file)\n",
    "    print(\"Start evaluate...\")\n",
    "    coco_dt = coco_gt.loadRes(jsonfile)\n",
    "    if style == 'proposal':\n",
    "        coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')\n",
    "        coco_eval.params.useCats = 0\n",
    "        coco_eval.params.maxDets = list(max_dets)\n",
    "    else:\n",
    "        coco_eval = COCOeval(coco_gt, coco_dt, style)\n",
    "    coco_eval.evaluate()\n",
    "    coco_eval.accumulate()\n",
    "    coco_eval.summarize()\n",
    "    return coco_eval.stats\n",
    "\n",
    "\n",
    "def bbox_eval(anno_file, bbox_list):\n",
    "    coco_gt = COCO(anno_file)\n",
    "\n",
    "    outfile = 'bbox_detections.json'\n",
    "    print('Generating json file...')\n",
    "    with open(outfile, 'w') as f:\n",
    "        json.dump(bbox_list, f)\n",
    "\n",
    "    map_stats = cocoapi_eval(outfile, 'bbox', coco_gt=coco_gt)\n",
    "    return map_stats\n",
    "\n",
    "\n",
    "def get_image_as_bytes(images, eval_pre_path):\n",
    "    batch_im_id_list = []\n",
    "    batch_im_name_list = []\n",
    "    batch_img_bytes_list = []\n",
    "    n = len(images)\n",
    "    batch_im_id = []\n",
    "    batch_im_name = []\n",
    "    batch_img_bytes = []\n",
    "    for i, im in enumerate(images):\n",
    "        im_id = im['id']\n",
    "        file_name = im['file_name']\n",
    "        if i % eval_batch_size == 0 and i != 0:\n",
    "            batch_im_id_list.append(batch_im_id)\n",
    "            batch_im_name_list.append(batch_im_name)\n",
    "            batch_img_bytes_list.append(batch_img_bytes)\n",
    "            batch_im_id = []\n",
    "            batch_im_name = []\n",
    "            batch_img_bytes = []\n",
    "        batch_im_id.append(im_id)\n",
    "        batch_im_name.append(file_name)\n",
    "\n",
    "        with open(os.path.join(eval_pre_path, file_name), 'rb') as f:\n",
    "            batch_img_bytes.append(f.read())\n",
    "    return batch_im_id_list, batch_im_name_list, batch_img_bytes_list\n",
    "\n",
    "\n",
    "def analyze_bbox(results, batch_im_id, _clsid2catid):\n",
    "    bbox_list = []\n",
    "    k = 0\n",
    "    for boxes, scores, classes in zip(results['boxes'], results['scores'], results['classes']):\n",
    "        if boxes is not None:\n",
    "            im_id = batch_im_id[k]\n",
    "            n = len(boxes)\n",
    "            for p in range(n):\n",
    "                clsid = classes[p]\n",
    "                score = scores[p]\n",
    "                xmin, ymin, xmax, ymax = boxes[p]\n",
    "                catid = (_clsid2catid[int(clsid)])\n",
    "                w = xmax - xmin + 1\n",
    "                h = ymax - ymin + 1\n",
    "\n",
    "                bbox = [xmin, ymin, w, h]\n",
    "                # Round to the nearest 10th to avoid huge file sizes, as COCO suggests\n",
    "                bbox = [round(float(x) * 10) / 10 for x in bbox]\n",
    "                bbox_res = {\n",
    "                    'image_id': im_id,\n",
    "                    'category_id': catid,\n",
    "                    'bbox': bbox,\n",
    "                    'score': float(score),\n",
    "                }\n",
    "                bbox_list.append(bbox_res)\n",
    "        k += 1\n",
    "    return bbox_list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the actual evaluation loop. To fully utilize all four cores on one Inferentia, the optimal setup is to run multi-threaded inference using a `ThreadPoolExecutor`. The following cell is a multi-threaded adaptation of the evaluation routine at https://github.com/miemie2013/Keras-YOLOv4/blob/910c4c6f7265f5828fceed0f784496a0b46516bf/tools/cocotools.py#L97."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from concurrent import futures\n",
    "\n",
    "def evaluate(yolo_predictor, images, eval_pre_path, anno_file, eval_batch_size, _clsid2catid):\n",
    "    batch_im_id_list, batch_im_name_list, batch_img_bytes_list = get_image_as_bytes(images, eval_pre_path)\n",
    "\n",
    "    # warm up\n",
    "    yolo_predictor({'image': np.array(batch_img_bytes_list[0], dtype=object)})\n",
    "\n",
    "    with futures.ThreadPoolExecutor(4) as exe:\n",
    "        fut_im_list = []\n",
    "        fut_list = []\n",
    "        start_time = time.time()\n",
    "        for batch_im_id, batch_im_name, batch_img_bytes in zip(batch_im_id_list, batch_im_name_list, batch_img_bytes_list):\n",
    "            if len(batch_img_bytes) != eval_batch_size:\n",
    "                continue\n",
    "            fut = exe.submit(yolo_predictor, {'image': np.array(batch_img_bytes, dtype=object)})\n",
    "            fut_im_list.append((batch_im_id, batch_im_name))\n",
    "            fut_list.append(fut)\n",
    "        bbox_list = []\n",
    "        count = 0\n",
    "        for (batch_im_id, batch_im_name), fut in zip(fut_im_list, fut_list):\n",
    "            results = fut.result()\n",
    "            bbox_list.extend(analyze_bbox(results, batch_im_id, _clsid2catid))\n",
    "            for _ in batch_im_id:\n",
    "                count += 1\n",
    "                if count % 100 == 0:\n",
    "                    print('Test iter {}'.format(count))\n",
    "        print('==================== Performance Measurement ====================')\n",
    "        print('Finished inference on {} images in {} seconds'.format(len(images), time.time() - start_time))\n",
    "        print('=================================================================')\n",
    "    # start evaluation\n",
    "    box_ap_stats = bbox_eval(anno_file, bbox_list)\n",
    "    return box_ap_stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate mean average precision (mAP) score\n",
    "Here is the code to calculate mAP scores of the YOLO v3 model. The expected mAP score is around 0.328 if we use the pretrained weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "yolo_pred = tf.contrib.predictor.from_saved_model('./yolo_v3_coco_saved_model_neuron')\n",
    "\n",
    "val_coco_root = './val2017'\n",
    "val_annotate = './annotations/instances_val2017.json'\n",
    "clsid2catid = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 13, 12: 14, 13: 15, 14: 16,\n",
    "               15: 17, 16: 18, 17: 19, 18: 20, 19: 21, 20: 22, 21: 23, 22: 24, 23: 25, 24: 27, 25: 28, 26: 31,\n",
    "               27: 32, 28: 33, 29: 34, 30: 35, 31: 36, 32: 37, 33: 38, 34: 39, 35: 40, 36: 41, 37: 42, 38: 43,\n",
    "               39: 44, 40: 46, 41: 47, 42: 48, 43: 49, 44: 50, 45: 51, 46: 52, 47: 53, 48: 54, 49: 55, 50: 56,\n",
    "               51: 57, 52: 58, 53: 59, 54: 60, 55: 61, 56: 62, 57: 63, 58: 64, 59: 65, 60: 67, 61: 70, 62: 72,\n",
    "               63: 73, 64: 74, 65: 75, 66: 76, 67: 77, 68: 78, 69: 79, 70: 80, 71: 81, 72: 82, 73: 84, 74: 85,\n",
    "               75: 86, 76: 87, 77: 88, 78: 89, 79: 90}\n",
    "eval_batch_size = 8\n",
    "with open(val_annotate, 'r', encoding='utf-8') as f2:\n",
    "    for line in f2:\n",
    "        line = line.strip()\n",
    "        dataset = json.loads(line)\n",
    "        images = dataset['images']\n",
    "box_ap = evaluate(yolo_pred, images, val_coco_root, val_annotate, eval_batch_size, clsid2catid)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.9 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: src/examples/tensorflow/yolo_v3_demo/yolo_v3_coco_saved_model.py
================================================
import argparse
import os
import urllib.request
import tempfile
import shutil
from functools import partial
import numpy as np
import tensorflow as tf


STRIDES = [8, 16, 32]
ANCHORS = np.array([1.25,1.625, 2.0,3.75, 4.125,2.875, 1.875,3.8125, 3.875,2.8125, 3.6875,7.4375, 3.625,2.8125, 4.875,6.1875, 11.65625,10.1875]).astype(np.float32).reshape([3, 3, 2])
ANCHOR_PER_SCALE = 3
BOX_SCORE_THRESH = 0.3
UPSAMPLE_METHOD = "resize"
NUM_CLASSES = 80


class YOLOV3(object):
    """Implement tensoflow yolov3 here"""
    def __init__(self, input_data, input_size, trainable):

        self.trainable        = trainable
        self.num_class        = NUM_CLASSES
        self.strides          = STRIDES
        self.anchors          = ANCHORS
        self.anchor_per_scale = ANCHOR_PER_SCALE
        self.box_score_thresh = BOX_SCORE_THRESH
        self.upsample_method  = UPSAMPLE_METHOD

        input_data, decoded_shape = preprocessor(input_data, [input_size, input_size])
        self.conv_lbbox, self.conv_mbbox, self.conv_sbbox = self.__build_nework(input_data)

        def decode_boxes(bboxes_and_decoded_shape):
            conv_lbbox, conv_mbbox, conv_sbbox, decoded_shape = bboxes_and_decoded_shape
            conv_lbbox = tf.cast(conv_lbbox, tf.float32)
            conv_mbbox = tf.cast(conv_mbbox, tf.float32)
            conv_sbbox = tf.cast(conv_sbbox, tf.float32)
            conv_lbbox = conv_lbbox[tf.newaxis, ...]
            conv_mbbox = conv_mbbox[tf.newaxis, ...]
            conv_sbbox = conv_sbbox[tf.newaxis, ...]
            decoded_shape = decoded_shape[tf.newaxis, ...]
            with tf.variable_scope('pred_sbbox'):
                pred_sbbox_coors, pred_sbbox_class_scores = self.decode(conv_sbbox, self.anchors[0], self.strides[0], decoded_shape, input_size)

            with tf.variable_scope('pred_mbbox'):
                pred_mbbox_coors, pred_mbbox_class_scores = self.decode(conv_mbbox, self.anchors[1], self.strides[1], decoded_shape, input_size)

            with tf.variable_scope('pred_lbbox'):
                pred_lbbox_coors, pred_lbbox_class_scores = self.decode(conv_lbbox, self.anchors[2], self.strides[2], decoded_shape, input_size)

            with tf.variable_scope('pred_bbox_filter'):
                pred_bbox_coors = tf.concat([pred_sbbox_coors, pred_mbbox_coors, pred_lbbox_coors], axis=1)
                pred_bbox_class_scores = tf.concat([pred_sbbox_class_scores, pred_mbbox_class_scores, pred_lbbox_class_scores], axis=1)
                nms_top_k = 100
                nms_thresh= 0.45
                coors, scores, classes, valid_detections = tf.image.combined_non_max_suppression(
                    pred_bbox_coors,
                    pred_bbox_class_scores,
                    max_output_size_per_class=nms_top_k,
                    max_total_size=nms_top_k,
                    iou_threshold=nms_thresh,
                    score_threshold=self.box_score_thresh,
                    pad_per_class=False,
                    clip_boxes=False,
                    name='CombinedNonMaxSuppression',
                )
                scores = scores[..., tf.newaxis]
                classes = classes[..., tf.newaxis]
            return coors[0], scores[0], classes[0]

        with tf.name_scope('Postprocessor'):
            coors, scores, classes = tf.map_fn(
                decode_boxes, [self.conv_lbbox, self.conv_mbbox, self.conv_sbbox, decoded_shape],
                dtype=(tf.float32, tf.float32, tf.float32), back_prop=False, parallel_iterations=16)

        with tf.variable_scope('pred_bbox'):
            self.pred_bbox_boxes = tf.identity(coors, name='boxes')
            self.pred_bbox_scores = tf.identity(scores[..., 0], name='scores')
            self.pred_bbox_classes = tf.identity(classes[..., 0], name='classes')

    def __build_nework(self, input_data):
        route_1, route_2, input_data = darknet53(input_data, self.trainable)

        input_data = convolutional(input_data, (1, 1, 1024,  512), self.trainable, 'conv52')
        input_data = convolutional(input_data, (3, 3,  512, 1024), self.trainable, 'conv53')
        input_data = convolutional(input_data, (1, 1, 1024,  512), self.trainable, 'conv54')
        input_data = convolutional(input_data, (3, 3,  512, 1024), self.trainable, 'conv55')
        input_data = convolutional(input_data, (1, 1, 1024,  512), self.trainable, 'conv56')

        conv_lobj_branch = convolutional(input_data, (3, 3, 512, 1024), self.trainable, name='conv_lobj_branch')
        conv_lbbox = convolutional(conv_lobj_branch, (1, 1, 1024, 3*(self.num_class + 5)),
                                   trainable=self.trainable, name='conv_lbbox', activate=False, bn=False)

        input_data = convolutional(input_data, (1, 1,  512,  256), self.trainable, 'conv57')
        input_data = upsample(input_data, name='upsample0', method=self.upsample_method)

        with tf.variable_scope('route_1'):
            input_data = tf.concat([input_data, route_2], axis=-1)

        input_data = convolutional(input_data, (1, 1, 768, 256), self.trainable, 'conv58')
        input_data = convolutional(input_data, (3, 3, 256, 512), self.trainable, 'conv59')
        input_data = convolutional(input_data, (1, 1, 512, 256), self.trainable, 'conv60')
        input_data = convolutional(input_data, (3, 3, 256, 512), self.trainable, 'conv61')
        input_data = convolutional(input_data, (1, 1, 512, 256), self.trainable, 'conv62')

        conv_mobj_branch = convolutional(input_data, (3, 3, 256, 512),  self.trainable, name='conv_mobj_branch' )
        conv_mbbox = convolutional(conv_mobj_branch, (1, 1, 512, 3*(self.num_class + 5)),
                                   trainable=self.trainable, name='conv_mbbox', activate=False, bn=False)

        input_data = convolutional(input_data, (1, 1, 256, 128), self.trainable, 'conv63')
        input_data = upsample(input_data, name='upsample1', method=self.upsample_method)

        with tf.variable_scope('route_2'):
            input_data = tf.concat([input_data, route_1], axis=-1)

        input_data = convolutional(input_data, (1, 1, 384, 128), self.trainable, 'conv64')
        input_data = convolutional(input_data, (3, 3, 128, 256), self.trainable, 'conv65')
        input_data = convolutional(input_data, (1, 1, 256, 128), self.trainable, 'conv66')
        input_data = convolutional(input_data, (3, 3, 128, 256), self.trainable, 'conv67')
        input_data = convolutional(input_data, (1, 1, 256, 128), self.trainable, 'conv68')

        conv_sobj_branch = convolutional(input_data, (3, 3, 128, 256), self.trainable, name='conv_sobj_branch')
        conv_sbbox = convolutional(conv_sobj_branch, (1, 1, 256, 3*(self.num_class + 5)),
                                   trainable=self.trainable, name='conv_sbbox', activate=False, bn=False)

        return conv_lbbox, conv_mbbox, conv_sbbox

    def decode(self, conv_output, anchors, stride, decoded_shape, input_size):
        conv_output = tf.cast(conv_output, tf.float32)
        """
        return tensor of shape [batch_size, output_size, output_size, anchor_per_scale, 5 + num_classes]
               contains (x, y, w, h, score, probability)
        """

        conv_shape       = tf.shape(conv_output)
        batch_size       = conv_shape[0]
        output_size      = conv_shape[1]
        anchor_per_scale = len(anchors)

        conv_output = tf.reshape(conv_output, (batch_size, output_size, output_size, anchor_per_scale, 5 + self.num_class))

        conv_raw_dxdy = conv_output[:, :, :, :, 0:2]
        conv_raw_dwdh = conv_output[:, :, :, :, 2:4]
        conv_raw_conf = conv_output[:, :, :, :, 4:5]
        conv_raw_prob = conv_output[:, :, :, :, 5: ]

        y = tf.tile(tf.range(output_size, dtype=tf.int32)[:, tf.newaxis], [1, output_size])
        x = tf.tile(tf.range(output_size, dtype=tf.int32)[tf.newaxis, :], [output_size, 1])

        xy_grid = tf.concat([x[:, :, tf.newaxis], y[:, :, tf.newaxis]], axis=-1)
        xy_grid = tf.tile(xy_grid[tf.newaxis, :, :, tf.newaxis, :], [batch_size, 1, 1, anchor_per_scale, 1])
        xy_grid = tf.cast(xy_grid, tf.float32)

        pred_xy = (tf.sigmoid(conv_raw_dxdy) + xy_grid) * stride
        pred_wh = (tf.exp(conv_raw_dwdh) * anchors) * stride
        pred_xywh = tf.concat([pred_xy, pred_wh], axis=-1)

        pred_conf = tf.sigmoid(conv_raw_conf)
        pred_prob = tf.sigmoid(conv_raw_prob)

        pred_xywh = tf.reshape(pred_xywh, (-1, output_size*output_size*3, pred_xywh.shape[-1]))
        pred_conf = tf.reshape(pred_conf, (-1, output_size*output_size*3))
        pred_prob = tf.reshape(pred_prob, (-1, output_size*output_size*3, pred_prob.shape[-1]))

        return tf_postprocess_boxes(pred_xywh, pred_conf, pred_prob, decoded_shape, input_size, self.box_score_thresh)


def darknet53(input_data, trainable):

    with tf.variable_scope('darknet'):

        input_data = convolutional(input_data, filters_shape=(3, 3,  3,  32), trainable=trainable, name='conv0')
        input_data = convolutional(input_data, filters_shape=(3, 3, 32,  64), trainable=trainable, name='conv1', downsample=True)

        for i in range(1):
            input_data = residual_block(input_data,  64,  32, 64, trainable=trainable, name='residual%d' %(i+0))

        input_data = convolutional(input_data, filters_shape=(3, 3,  64, 128), trainable=trainable, name='conv4', downsample=True)

        for i in range(2):
            input_data = residual_block(input_data, 128,  64, 128, trainable=trainable, name='residual%d' %(i+1))

        input_data = convolutional(input_data, filters_shape=(3, 3, 128, 256), trainable=trainable, name='conv9', downsample=True)

        for i in range(8):
            input_data = residual_block(input_data, 256, 128, 256, trainable=trainable, name='residual%d' %(i+3))

        route_1 = input_data
        input_data = convolutional(input_data, filters_shape=(3, 3, 256, 512), trainable=trainable, name='conv26', downsample=True)

        for i in range(8):
            input_data = residual_block(input_data, 512, 256, 512, trainable=trainable, name='residual%d' %(i+11))

        route_2 = input_data
        input_data = convolutional(input_data, filters_shape=(3, 3, 512, 1024), trainable=trainable, name='conv43', downsample=True)

        for i in range(4):
            input_data = residual_block(input_data, 1024, 512, 1024, trainable=trainable, name='residual%d' %(i+19))

        return route_1, route_2, input_data


def convolutional(input_data, filters_shape, trainable, name, downsample=False, activate=True, bn=True):

    with tf.variable_scope(name):
        if downsample:
            pad_h, pad_w = (filters_shape[0] - 2) // 2 + 1, (filters_shape[1] - 2) // 2 + 1
            paddings = tf.constant([[0, 0], [pad_h, pad_h], [pad_w, pad_w], [0, 0]])
            input_data = tf.pad(input_data, paddings, 'CONSTANT')
            strides = (1, 2, 2, 1)
            padding = 'VALID'
        else:
            strides = (1, 1, 1, 1)
            padding = "SAME"

        weight = tf.get_variable(name='weight', dtype=tf.float32, trainable=True,
                                 shape=filters_shape, initializer=tf.random_normal_initializer(stddev=0.01))
        weight = tf.cast(weight, tf.float16)
        conv = tf.nn.conv2d(input=input_data, filter=weight, strides=strides, padding=padding)

        if bn:
            conv = tf.layers.batch_normalization(conv, beta_initializer=tf.zeros_initializer(),
                                                 gamma_initializer=tf.ones_initializer(),
                                                 moving_mean_initializer=tf.zeros_initializer(),
                                                 moving_variance_initializer=tf.ones_initializer(), training=trainable,
                                                 fused=False)
        else:
            bias = tf.get_variable(name='bias', shape=filters_shape[-1], trainable=True,
                                   dtype=tf.float32, initializer=tf.constant_initializer(0.0))
            bias = tf.cast(bias, tf.float16)
            conv = tf.nn.bias_add(conv, bias)

        if activate == True: conv = tf.nn.leaky_relu(conv, alpha=0.1)

    return conv


def residual_block(input_data, input_channel, filter_num1, filter_num2, trainable, name):
    short_cut = input_data
    with tf.variable_scope(name):
        input_data = convolutional(input_data, filters_shape=(1, 1, input_channel, filter_num1),
                                   trainable=trainable, name='conv1')
        input_data = convolutional(input_data, filters_shape=(3, 3, filter_num1,   filter_num2),
                                   trainable=trainable, name='conv2')
        residual_output = input_data + short_cut
    return residual_output


def upsample(input_data, name, method="deconv"):
    assert method in ["resize", "deconv"]

    if method == "resize":
        with tf.variable_scope(name):
            input_shape = tf.shape(input_data)
            output = tf.image.resize_nearest_neighbor(input_data, (input_shape[1] * 2, input_shape[2] * 2))

    if method == "deconv":
        # replace resize_nearest_neighbor with conv2d_transpose To support TensorRT optimization
        numm_filter = input_data.shape.as_list()[-1]
        output = tf.layers.conv2d_transpose(input_data, numm_filter, kernel_size=2, padding='same',
                                            strides=(2,2), kernel_initializer=tf.random_normal_initializer())

    return output


def decode_jpeg_resize(input_tensor, image_size):
    tensor = tf.image.decode_png(input_tensor, channels=3)
    shape = tf.shape(tensor)
    tensor = tf.cast(tensor, tf.float32)
    tensor = tf.image.resize_image_with_pad(tensor, image_size[0], image_size[1])
    tensor /= 255.0
    return tf.cast(tensor, tf.float16), shape


def preprocessor(input_tensor, image_size):
    with tf.name_scope('Preprocessor'):
        batch_tensor, batch_shape = tf.map_fn(
            partial(decode_jpeg_resize, image_size=image_size), input_tensor,
            dtype=(tf.float16, tf.int32), back_prop=False, parallel_iterations=16)
    return batch_tensor, batch_shape


def tf_postprocess_boxes(pred_xywh, pred_conf, pred_prob, org_img_shape, input_size, score_threshold):
    batch_size = tf.shape(pred_xywh)[0]

    # # (1) (x, y, w, h) --> (xmin, ymin, xmax, ymax)
    pred_coor = tf.concat([pred_xywh[:, :, :2] - pred_xywh[:, :, 2:] * 0.5,
                           pred_xywh[:, :, :2] + pred_xywh[:, :, 2:] * 0.5], axis=-1)
    # # (2) (xmin, ymin, xmax, ymax) -> (xmin_org, ymin_org, xmax_org, ymax_org)
    org_wh = org_img_shape[:, tf.newaxis, 1::-1]
    org_whwh = tf.concat([org_wh, org_wh], axis=-1)
    org_whwh = tf.cast(org_whwh, tf.float32)
    input_size = np.float32(input_size)
    resize_ratio = input_size / tf.reduce_max(org_whwh, axis=-1)
    dwhwh = (input_size - resize_ratio * org_whwh) / 2
    pred_coor = (pred_coor - dwhwh) / resize_ratio

    # # (5) discard some boxes with low scores
    scores = pred_conf * tf.reduce_max(pred_prob, axis=-1)
    score_mask = scores > score_threshold
    coors = pred_coor[score_mask]
    pred_conf = pred_conf[score_mask]
    pred_conf = tf.reshape(pred_conf, [batch_size, -1, 1])
    pred_prob = pred_prob[score_mask]
    pred_prob = tf.reshape(pred_prob, [batch_size, -1, pred_prob.shape[-1]])
    class_scores = pred_conf * pred_prob
    coors = tf.reshape(coors, [batch_size, -1, 1, coors.shape[-1]])
    class_scores = tf.reshape(class_scores, [batch_size, -1, class_scores.shape[-1]])
    return coors, class_scores


def convert_weights(org_weights_path, cur_weights_path, input_size):
    org_weights_mess = []
    with tf.Session(graph=tf.Graph()) as sess:
        load = tf.train.import_meta_graph(org_weights_path + '.meta')
        load.restore(sess, org_weights_path)
        for var in tf.global_variables():
            var_name = var.op.name
            var_name_mess = str(var_name).split('/')
            var_shape = var.shape
            org_weights_mess.append([var_name, var_shape])
            print("=> " + str(var_name).ljust(50), var_shape)
        print()

    cur_weights_mess = []
    with tf.Session(graph=tf.Graph()) as sess:
        with tf.name_scope('input'):
            input_data = tf.placeholder(dtype=tf.string, shape=(None,), name='input_data')
            training = tf.placeholder(dtype=tf.bool, name='trainable')
        model = YOLOV3(input_data, input_size, training)
        for var in tf.global_variables():
            var_name = var.op.name
            var_name_mess = str(var_name).split('/')
            var_shape = var.shape
            print(var_name_mess[0])
            cur_weights_mess.append([var_name, var_shape])
            print("=> " + str(var_name).ljust(50), var_shape)

        org_weights_num = len(org_weights_mess)
        cur_weights_num = len(cur_weights_mess)
        if cur_weights_num != org_weights_num:
            raise RuntimeError

        print('=> Number of weights that will rename:\t%d' % cur_weights_num)
        cur_to_org_dict = {}
        for index in range(org_weights_num):
            org_name, org_shape = org_weights_mess[index]
            cur_name, cur_shape = cur_weights_mess[index]
            if cur_shape != org_shape:
                print(org_weights_mess[index])
                print(cur_weights_mess[index])
                raise RuntimeError
            cur_to_org_dict[cur_name] = org_name
            print("=> " + str(cur_name).ljust(50) + ' : ' + org_name)

        with tf.name_scope('load_save'):
            name_to_var_dict = {var.op.name: var for var in tf.global_variables()}
            restore_dict = {cur_to_org_dict[cur_name]: name_to_var_dict[cur_name] for cur_name in cur_to_org_dict}
            load = tf.train.Saver(restore_dict)
            save = tf.train.Saver(tf.global_variables())
            for var in tf.global_variables():
                print("=> " + var.op.name)

        sess.run(tf.global_variables_initializer())
        print('=> Restoring weights from:\t %s' % org_weights_path)
        load.restore(sess, org_weights_path)
        save.save(sess, cur_weights_path)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('model_dir')
    args = parser.parse_args()
    if os.path.exists(args.model_dir):
        raise OSError('Directory {} already exists; please specify a different path for the tensorflow SavedModel'.format(args.model_dir))
    with tempfile.TemporaryDirectory() as workdir:
        ckpt_file = os.path.join(workdir, './yolov3_coco_demo.ckpt')
        input_size = 416
        if not os.path.isfile(ckpt_file + '.meta'):
            yolov3_coco_tar_gz = os.path.join(workdir, './yolov3_coco.tar.gz')
            url = 'https://github.com/YunYang1994/tensorflow-yolov3/releases/download/v1.0/yolov3_coco.tar.gz'
            print('Downloading from {}'.format(url))
            urllib.request.urlretrieve(url, yolov3_coco_tar_gz)
            shutil.unpack_archive(yolov3_coco_tar_gz, extract_dir=workdir)
            convert_weights(os.path.join(workdir, './yolov3_coco.ckpt'), ckpt_file, input_size)

        input_tensor_name = 'input/input_data:0'
        output_names = ['boxes', 'scores', 'classes']
        output_tensor_names = ['pred_bbox/boxes:0', 'pred_bbox/scores:0', 'pred_bbox/classes:0']
        with tf.Session(graph=tf.Graph()) as sess:
            with tf.name_scope('input'):
                input_data = tf.placeholder(dtype=tf.string, shape=[None], name='input_data')
            model = YOLOV3(input_data, input_size, trainable=False)
            print(model.conv_sbbox, model.conv_mbbox, model.conv_lbbox)
            saver = tf.train.Saver()
            saver.restore(sess, ckpt_file)
            input_tensor = sess.graph.get_tensor_by_name(input_tensor_name)
            inputs = {'image': input_tensor}
            outputs = {name: sess.graph.get_tensor_by_name(tensor_name) for name, tensor_name in zip(output_names, output_tensor_names)}
            tf.saved_model.simple_save(sess, args.model_dir, inputs, outputs)
    print('tensorflow YOLO v3 SavedModel generated at {}'.format(args.model_dir))


if __name__ == '__main__':
    main()


================================================
FILE: src/examples/tensorflow/yolo_v4_demo/README.md
================================================
</br>
</br>

Please view our documentation at **[https://awsdocs-neuron.readthedocs-hosted.com/](https://awsdocs-neuron.readthedocs-hosted.com/)** 


================================================
FILE: src/examples/tensorflow/yolo_v4_demo/evaluate.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluate YOLO v4 on Inferentia\n",
    "## Note: this tutorial runs on tensorflow-neuron 1.x only"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "This tutorial walks through compiling and evaluating YOLO v4 model on Inferentia using the AWS Neuron SDK 09/2020 release. We recommend running this tutorial on an EC2 `inf1.2xlarge` instance which contains one Inferentia and 8 vCPU cores, as well as 16 GB of memory.Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [Tensorflow Installation Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/tensorflow/tensorflow-neuron/setup/tensorflow-install.html#install-neuron-tensorflow) You can select the Kernel from the “Kernel -> Change Kernel” option on the top of this Jupyter notebook page."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This demo requires the following pip packages:\n",
    "\n",
    "`neuron-cc tensorflow-neuron<2 requests pillow matplotlib pycocotools torch`\n",
    "\n",
    "and debian/rpm package `aws-neuron-runtime`.\n",
    "\n",
    "On DLAMI, `aws-neuron-runtime` is already pre-installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install tensorflow_neuron==1.15.5.2.8.9.0 neuron_cc==1.13.5.0 requests pillow matplotlib pycocotools==2.0.1 numpy==1.18.2 torch~=1.5.0 --force \\\n",
    "    --extra-index-url=https://pip.repos.neuron.amazonaws.com"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: Download Dataset and Generate Pretrained SavedModel\n",
    "### Download COCO 2017 validation dataset\n",
    "We start by downloading the COCO validation dataset, which we will use to validate our model. The COCO 2017 dataset is widely used for object-detection, segmentation and image captioning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!curl -LO http://images.cocodataset.org/zips/val2017.zip\n",
    "!curl -LO http://images.cocodataset.org/annotations/annotations_trainval2017.zip\n",
    "!unzip -q val2017.zip\n",
    "!unzip annotations_trainval2017.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check required package versions\n",
    "Here are the minimum required versions of AWS Neuron packages. We run a check."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pkg_resources\n",
    "from distutils.version import LooseVersion\n",
    "\n",
    "assert LooseVersion(pkg_resources.get_distribution('neuron-cc').version) > LooseVersion('1.0.20000')\n",
    "assert LooseVersion(pkg_resources.get_distribution('tensorflow-neuron').version) > LooseVersion('1.15.3.1.0.2000')\n",
    "print('passed package version checks')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generate YOLO v4 tensorflow SavedModel (pretrained on COCO 2017 dataset)\n",
    "Script `yolo_v4_coco_saved_model.py` will generate a tensorflow SavedModel using pretrained weights from https://github.com/Tianxiaomo/pytorch-YOLOv4."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python3 yolo_v4_coco_saved_model.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tensorflow SavedModel can be loaded as a tensorflow predictor. When a JPEG format image is provided as input, the output result of the tensorflow predictor contains information for drawing bounding boxes and classification results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import tensorflow as tf\n",
    "from PIL import Image\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.patches as patches\n",
    "\n",
    "# launch predictor and run inference on an arbitrary image in the validation dataset\n",
    "yolo_pred_cpu = tf.contrib.predictor.from_saved_model('./yolo_v4_coco_saved_model')\n",
    "image_path = './val2017/000000581781.jpg'\n",
    "with open(image_path, 'rb') as f:\n",
    "    feeds = {'image': [f.read()]}\n",
    "results = yolo_pred_cpu(feeds)\n",
    "\n",
    "# load annotations to decode classification result\n",
    "with open('./annotations/instances_val2017.json') as f:\n",
    "    annotate_json = json.load(f)\n",
    "label_info = {idx+1: cat['name'] for idx, cat in enumerate(annotate_json['categories'])}\n",
    "\n",
    "# draw picture and bounding boxes\n",
    "fig, ax = plt.subplots(figsize=(10, 10))\n",
    "ax.imshow(Image.open(image_path).convert('RGB'))\n",
    "wanted = results['scores'][0] > 0.1\n",
    "for xyxy, label_no_bg in zip(results['boxes'][0][wanted], results['classes'][0][wanted]):\n",
    "    xywh = xyxy[0], xyxy[1], xyxy[2] - xyxy[0], xyxy[3] - xyxy[1]\n",
    "    rect = patches.Rectangle((xywh[0], xywh[1]), xywh[2], xywh[3], linewidth=1, edgecolor='g', facecolor='none')\n",
    "    ax.add_patch(rect)\n",
    "    rx, ry = rect.get_xy()\n",
    "    rx = rx + rect.get_width() / 2.0\n",
    "    ax.annotate(label_info[label_no_bg + 1], (rx, ry), color='w', backgroundcolor='g', fontsize=10,\n",
    "                ha='center', va='center', bbox=dict(boxstyle='square,pad=0.01', fc='g', ec='none', alpha=0.5))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2: Compile the Pretrained SavedModel for Inferentia\n",
    "We make use of the Python compilation API `tfn.saved_model.compile` that is avaiable in `tensorflow-neuron<2`. For the purpose of reducing Neuron runtime overhead, it is necessary to make use of arguments `no_fuse_ops` and `minimum_segment_size`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import shutil\n",
    "import tensorflow as tf\n",
    "import tensorflow.neuron as tfn\n",
    "\n",
    "\n",
    "def no_fuse_condition(op):\n",
    "    return any(op.name.startswith(pat) for pat in ['reshape', 'lambda_1/Cast', 'lambda_2/Cast', 'lambda_3/Cast'])\n",
    "\n",
    "with tf.Session(graph=tf.Graph()) as sess:\n",
    "    tf.saved_model.loader.load(sess, ['serve'], './yolo_v4_coco_saved_model')\n",
    "    no_fuse_ops = [op.name for op in sess.graph.get_operations() if no_fuse_condition(op)]\n",
    "shutil.rmtree('./yolo_v4_coco_saved_model_neuron', ignore_errors=True)\n",
    "result = tfn.saved_model.compile(\n",
    "    './yolo_v4_coco_saved_model', './yolo_v4_coco_saved_model_neuron',\n",
    "    # we partition the graph before casting from float16 to float32, to help reduce the output tensor size by 1/2\n",
    "    no_fuse_ops=no_fuse_ops,\n",
    "    # to enforce trivial compilable subgraphs to run on CPU\n",
    "    minimum_segment_size=100,\n",
    "    batch_size=1,\n",
    "    dynamic_batch_size=True,\n",
    ")\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 3: Evaluate Model Quality after Compilation\n",
    "### Define evaluation functions\n",
    "We first define some handy helper functions for running evaluation on the COCO 2017 dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import json\n",
    "import time\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "from pycocotools.coco import COCO\n",
    "from pycocotools.cocoeval import COCOeval\n",
    "\n",
    "\n",
    "def cocoapi_eval(jsonfile,\n",
    "                 style,\n",
    "                 coco_gt=None,\n",
    "                 anno_file=None,\n",
    "                 max_dets=(100, 300, 1000)):\n",
    "    \"\"\"\n",
    "    Args:\n",
    "        jsonfile: Evaluation json file, eg: bbox.json, mask.json.\n",
    "        style: COCOeval style, can be `bbox` , `segm` and `proposal`.\n",
    "        coco_gt: Whether to load COCOAPI through anno_file,\n",
    "                 eg: coco_gt = COCO(anno_file)\n",
    "        anno_file: COCO annotations file.\n",
    "        max_dets: COCO evaluation maxDets.\n",
    "    \"\"\"\n",
    "    assert coco_gt is not None or anno_file is not None\n",
    "\n",
    "    if coco_gt is None:\n",
    "        coco_gt = COCO(anno_file)\n",
    "    print(\"Start evaluate...\")\n",
    "    coco_dt = coco_gt.loadRes(jsonfile)\n",
    "    if style == 'proposal':\n",
    "        coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')\n",
    "        coco_eval.params.useCats = 0\n",
    "        coco_eval.params.maxDets = list(max_dets)\n",
    "    else:\n",
    "        coco_eval = COCOeval(coco_gt, coco_dt, style)\n",
    "    coco_eval.evaluate()\n",
    "    coco_eval.accumulate()\n",
    "    coco_eval.summarize()\n",
    "    return coco_eval.stats\n",
    "\n",
    "\n",
    "def bbox_eval(anno_file, bbox_list):\n",
    "    coco_gt = COCO(anno_file)\n",
    "\n",
    "    outfile = 'bbox_detections.json'\n",
    "    print('Generating json file...')\n",
    "    with open(outfile, 'w') as f:\n",
    "        json.dump(bbox_list, f)\n",
    "\n",
    "    map_stats = cocoapi_eval(outfile, 'bbox', coco_gt=coco_gt)\n",
    "    return map_stats\n",
    "\n",
    "\n",
    "def get_image_as_bytes(images, eval_pre_path):\n",
    "    batch_im_id_list = []\n",
    "    batch_im_name_list = []\n",
    "    batch_img_bytes_list = []\n",
    "    n = len(images)\n",
    "    batch_im_id = []\n",
    "    batch_im_name = []\n",
    "    batch_img_bytes = []\n",
    "    for i, im in enumerate(images):\n",
    "        im_id = im['id']\n",
    "        file_name = im['file_name']\n",
    "        if i % eval_batch_size == 0 and i != 0:\n",
    "            batch_im_id_list.append(batch_im_id)\n",
    "            batch_im_name_list.append(batch_im_name)\n",
    "            batch_img_bytes_list.append(batch_img_bytes)\n",
    "            batch_im_id = []\n",
    "            batch_im_name = []\n",
    "            batch_img_bytes = []\n",
    "        batch_im_id.append(im_id)\n",
    "        batch_im_name.append(file_name)\n",
    "\n",
    "        with open(os.path.join(eval_pre_path, file_name), 'rb') as f:\n",
    "            batch_img_bytes.append(f.read())\n",
    "    return batch_im_id_list, batch_im_name_list, batch_img_bytes_list\n",
    "\n",
    "\n",
    "def analyze_bbox(results, batch_im_id, _clsid2catid):\n",
    "    bbox_list = []\n",
    "    k = 0\n",
    "    for boxes, scores, classes in zip(results['boxes'], results['scores'], results['classes']):\n",
    "        if boxes is not None:\n",
    "            im_id = batch_im_id[k]\n",
    "            n = len(boxes)\n",
    "            for p in range(n):\n",
    "                clsid = classes[p]\n",
    "                score = scores[p]\n",
    "                xmin, ymin, xmax, ymax = boxes[p]\n",
    "                catid = (_clsid2catid[int(clsid)])\n",
    "                w = xmax - xmin + 1\n",
    "                h = ymax - ymin + 1\n",
    "\n",
    "                bbox = [xmin, ymin, w, h]\n",
    "                # Round to the nearest 10th to avoid huge file sizes, as COCO suggests\n",
    "                bbox = [round(float(x) * 10) / 10 for x in bbox]\n",
    "                bbox_res = {\n",
    "                    'image_id': im_id,\n",
    "                    'category_id': catid,\n",
    "                    'bbox': bbox,\n",
    "                    'score': float(score),\n",
    "                }\n",
    "                bbox_list.append(bbox_res)\n",
    "        k += 1\n",
    "    return bbox_list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the actual evaluation loop. To fully utilize all four cores on one Inferentia, the optimal setup is to run multi-threaded inference using a `ThreadPoolExecutor`. The following cell is a multi-threaded adaptation of the evaluation routine at https://github.com/miemie2013/Keras-YOLOv4/blob/910c4c6f7265f5828fceed0f784496a0b46516bf/tools/cocotools.py#L97."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from concurrent import futures\n",
    "\n",
    "NUM_THREADS = 4\n",
    "\n",
    "def evaluate(yolo_predictor, images, eval_pre_path, anno_file, eval_batch_size, _clsid2catid):\n",
    "    batch_im_id_list, batch_im_name_list, batch_img_bytes_list = get_image_as_bytes(images, eval_pre_path)\n",
    "\n",
    "    # warm up\n",
    "    yolo_predictor({'image': np.array(batch_img_bytes_list[0], dtype=object)})\n",
    "    \n",
    "    def yolo_predictor_timer(yolo_pred, image):\n",
    "        begin = time.time()\n",
    "        result = yolo_pred(image)\n",
    "        delta = time.time() - begin\n",
    "        return result, delta\n",
    "\n",
    "    latency = []\n",
    "    with futures.ThreadPoolExecutor(NUM_THREADS) as exe:\n",
    "        fut_im_list = []\n",
    "        fut_list = []\n",
    "\n",
    "        start_time = time.time()\n",
    "        for batch_im_id, batch_im_name, batch_img_bytes in zip(batch_im_id_list, batch_im_name_list, batch_img_bytes_list):\n",
    "            if len(batch_img_bytes) != eval_batch_size:\n",
    "                continue\n",
    "            fut = exe.submit(yolo_predictor_timer, yolo_predictor, {'image': np.array(batch_img_bytes, dtype=object)})\n",
    "            fut_im_list.append((batch_im_id, batch_im_name))\n",
    "            fut_list.append(fut)\n",
    "        bbox_list = []\n",
    "        sum_time = 0.0\n",
    "        count = 0\n",
    "        for (batch_im_id, batch_im_name), fut in zip(fut_im_list, fut_list):\n",
    "            results, times = fut.result()\n",
    "            # Adjust latency since we are in batch\n",
    "            latency.append(times / eval_batch_size)\n",
    "            sum_time += times\n",
    "            bbox_list.extend(analyze_bbox(results, batch_im_id, _clsid2catid))\n",
    "            for _ in batch_im_id:\n",
    "                count += 1\n",
    "                if count % 1000 == 0:\n",
    "                    print('Test iter {}'.format(count))\n",
    "\n",
    "        throughput = len(images) / (sum_time / NUM_THREADS)\n",
    "\n",
    "        \n",
    "    print('Average Images Per Second:', throughput)\n",
    "    print(\"Latency P50: {:.1f} ms\".format(np.percentile(latency, 50)*1000.0))\n",
    "    print(\"Latency P90: {:.1f} ms\".format(np.percentile(latency, 90)*1000.0))\n",
    "    print(\"Latency P95: {:.1f} ms\".format(np.percentile(latency, 95)*1000.0))\n",
    "    print(\"Latency P99: {:.1f} ms\".format(np.percentile(latency, 99)*1000.0))\n",
    "\n",
    "    # start evaluation\n",
    "    box_ap_stats = bbox_eval(anno_file, bbox_list)\n",
    "    return box_ap_stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate mean average precision (mAP) score\n",
    "Here is the code to calculate mAP scores of the YOLO v4 model. The expected mAP score is around 0.487 if we use the pretrained weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "yolo_pred = tf.contrib.predictor.from_saved_model('./yolo_v4_coco_saved_model_neuron')\n",
    "\n",
    "val_coco_root = './val2017'\n",
    "val_annotate = './annotations/instances_val2017.json'\n",
    "clsid2catid = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 13, 12: 14, 13: 15, 14: 16,\n",
    "               15: 17, 16: 18, 17: 19, 18: 20, 19: 21, 20: 22, 21: 23, 22: 24, 23: 25, 24: 27, 25: 28, 26: 31,\n",
    "               27: 32, 28: 33, 29: 34, 30: 35, 31: 36, 32: 37, 33: 38, 34: 39, 35: 40, 36: 41, 37: 42, 38: 43,\n",
    "               39: 44, 40: 46, 41: 47, 42: 48, 43: 49, 44: 50, 45: 51, 46: 52, 47: 53, 48: 54, 49: 55, 50: 56,\n",
    "               51: 57, 52: 58, 53: 59, 54: 60, 55: 61, 56: 62, 57: 63, 58: 64, 59: 65, 60: 67, 61: 70, 62: 72,\n",
    "               63: 73, 64: 74, 65: 75, 66: 76, 67: 77, 68: 78, 69: 79, 70: 80, 71: 81, 72: 82, 73: 84, 74: 85,\n",
    "               75: 86, 76: 87, 77: 88, 78: 89, 79: 90}\n",
    "eval_batch_size = 8\n",
    "with open(val_annotate, 'r', encoding='utf-8') as f2:\n",
    "    for line in f2:\n",
    "        line = line.strip()\n",
    "        dataset = json.loads(line)\n",
    "        images = dataset['images']\n",
    "box_ap = evaluate(yolo_pred, images, val_coco_root, val_annotate, eval_batch_size, clsid2catid)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Environment (conda_aws_neuron_tensorflow_p36)",
   "language": "python",
   "name": "conda_aws_neuron_tensorflow_p36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: src/examples/tensorflow/yolo_v4_demo/yolo_v4_coco_saved_model.py
================================================
import os
import io
from functools import partial
import requests
import numpy as np
import torch
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers


def rename_weights(checkpoint):
    name_mapping = {
        'down1.conv1.conv.0.weight': 'models.0.conv1.weight',
        'down1.conv1.conv.1.weight': 'models.0.bn1.weight',
        'down1.conv1.conv.1.bias': 'models.0.bn1.bias',
        'down1.conv1.conv.1.running_mean': 'models.0.bn1.running_mean',
        'down1.conv1.conv.1.running_var': 'models.0.bn1.running_var',
        'down1.conv1.conv.1.num_batches_tracked': 'models.0.bn1.num_batches_tracked',
        'down1.conv2.conv.0.weight': 'models.1.conv2.weight',
        'down1.conv2.conv.1.weight': 'models.1.bn2.weight',
        'down1.conv2.conv.1.bias': 'models.1.bn2.bias',
        'down1.conv2.conv.1.running_mean': 'models.1.bn2.running_mean',
        'down1.conv2.conv.1.running_var': 'models.1.bn2.running_var',
        'down1.conv2.conv.1.num_batches_tracked': 'models.1.bn2.num_batches_tracked',
        'down1.conv3.conv.0.weight': 'models.2.conv3.weight',
        'down1.conv3.conv.1.weight': 'models.2.bn3.weight',
        'down1.conv3.conv.1.bias': 'models.2.bn3.bias',
        'down1.conv3.conv.1.running_mean': 'models.2.bn3.running_mean',
        'down1.conv3.conv.1.running_var': 'models.2.bn3.running_var',
        'down1.conv3.conv.1.num_batches_tracked': 'models.2.bn3.num_batches_tracked',
        'down1.conv4.conv.0.weight': 'models.4.conv4.weight',
        'down1.conv4.conv.1.weight': 'models.4.bn4.weight',
        'down1.conv4.conv.1.bias': 'models.4.bn4.bias',
        'down1.conv4.conv.1.running_mean': 'models.4.bn4.running_mean',
        'down1.conv4.conv.1.running_var': 'models.4.bn4.running_var',
        'down1.conv4.conv.1.num_batches_tracked': 'models.4.bn4.num_batches_tracked',
        'down1.conv5.conv.0.weight': 'models.5.conv5.weight',
        'down1.conv5.conv.1.weight': 'models.5.bn5.weight',
        'down1.conv5.conv.1.bias': 'models.5.bn5.bias',
        'down1.conv5.conv.1.running_mean': 'models.5.bn5.running_mean',
        'down1.conv5.conv.1.running_var': 'models.5.bn5.running_var',
        'down1.conv5.conv.1.num_batches_tracked': 'models.5.bn5.num_batches_tracked',
        'down1.conv6.conv.0.weight': 'models.6.conv6.weight',
        'down1.conv6.conv.1.weight': 'models.6.bn6.weight',
        'down1.conv6.conv.1.bias': 'models.6.bn6.bias',
        'down1.conv6.conv.1.running_mean': 'models.6.bn6.running_mean',
        'down1.conv6.conv.1.running_var': 'models.6.bn6.running_var',
        'down1.conv6.conv.1.num_batches_tracked': 'models.6.bn6.num_batches_tracked',
        'down1.conv7.conv.0.weight': 'models.8.conv7.weight',
        'down1.conv7.conv.1.weight': 'models.8.bn7.weight',
        'down1.conv7.conv.1.bias': 'models.8.bn7.bias',
        'down1.conv7.conv.1.running_mean': 'models.8.bn7.running_mean',
        'down1.conv7.conv.1.running_var': 'models.8.bn7.running_var',
        'down1.conv7.conv.1.num_batches_tracked': 'models.8.bn7.num_batches_tracked',
        'down1.conv8.conv.0.weight': 'models.10.conv8.weight',
        'down1.conv8.conv.1.weight': 'models.10.bn8.weight',
        'down1.conv8.conv.1.bias': 'models.10.bn8.bias',
        'down1.conv8.conv.1.running_mean': 'models.10.bn8.running_mean',
        'down1.conv8.conv.1.running_var': 'models.10.bn8.running_var',
        'down1.conv8.conv.1.num_batches_tracked': 'models.10.bn8.num_batches_tracked',
        'down2.conv1.conv.0.weight': 'models.11.conv9.weight',
        'down2.conv1.conv.1.weight': 'models.11.bn9.weight',
        'down2.conv1.conv.1.bias': 'models.11.bn9.bias',
        'down2.conv1.conv.1.running_mean': 'models.11.bn9.running_mean',
        'down2.conv1.conv.1.running_var': 'models.11.bn9.running_var',
        'down2.conv1.conv.1.num_batches_tracked': 'models.11.bn9.num_batches_tracked',
        'down2.conv2.conv.0.weight': 'models.12.conv10.weight',
        'down2.conv2.conv.1.weight': 'models.12.bn10.weight',
        'down2.conv2.conv.1.bias': 'models.12.bn10.bias',
        'down2.conv2.conv.1.running_mean': 'models.12.bn10.running_mean',
        'down2.conv2.conv.1.running_var': 'models.12.bn10.running_var',
        'down2.conv2.conv.1.num_batches_tracked': 'models.12.bn10.num_batches_tracked',
        'down2.conv3.conv.0.weight': 'models.14.conv11.weight',
        'down2.conv3.conv.1.weight': 'models.14.bn11.weight',
        'down2.conv3.conv.1.bias': 'models.14.bn11.bias',
        'down2.conv3.conv.1.running_mean': 'models.14.bn11.running_mean',
        'down2.conv3.conv.1.running_var': 'models.14.bn11.running_var',
        'down2.conv3.conv.1.num_batches_tracked': 'models.14.bn11.num_batches_tracked',
        'down2.resblock.module_list.0.0.conv.0.weight': 'models.15.conv12.weight',
        'down2.resblock.module_list.0.0.conv.1.weight': 'models.15.bn12.weight',
        'down2.resblock.module_list.0.0.conv.1.bias': 'models.15.bn12.bias',
        'down2.resblock.module_list.0.0.conv.1.running_mean': 'models.15.bn12.running_mean',
        'down2.resblock.module_list.0.0.conv.1.running_var': 'models.15.bn12.running_var',
        'down2.resblock.module_list.0.0.conv.1.num_batches_tracked': 'models.15.bn12.num_batches_tracked',
        'down2.resblock.module_list.0.1.conv.0.weight': 'models.16.conv13.weight',
        'down2.resblock.module_list.0.1.conv.1.weight': 'models.16.bn13.weight',
        'down2.resblock.module_list.0.1.conv.1.bias': 'models.16.bn13.bias',
        'down2.resblock.module_list.0.1.conv.1.running_mean': 'models.16.bn13.running_mean',
        'down2.resblock.module_list.0.1.conv.1.running_var': 'models.16.bn13.running_var',
        'down2.resblock.module_list.0.1.conv.1.num_batches_tracked': 'models.16.bn13.num_batches_tracked',
        'down2.resblock.module_list.1.0.conv.0.weight': 'models.18.conv14.weight',
        'down2.resblock.module_list.1.0.conv.1.weight': 'models.18.bn14.weight',
        'down2.resblock.module_list.1.0.conv.1.bias': 'models.18.bn14.bias',
        'down2.resblock.module_list.1.0.conv.1.running_mean': 'models.18.bn14.running_mean',
        'down2.resblock.module_list.1.0.conv.1.running_var': 'models.18.bn14.running_var',
        'down2.resblock.module_list.1.0.conv.1.num_batches_tracked': 'models.18.bn14.num_batches_tracked',
        'down2.resblock.module_list.1.1.conv.0.weight': 'models.19.conv15.weight',
        'down2.resblock.module_list.1.1.conv.1.weight': 'models.19.bn15.weight',
        'down2.resblock.module_list.1.1.conv.1.bias': 'models.19.bn15.bias',
        'down2.resblock.module_list.1.1.conv.1.running_mean': 'models.19.bn15.running_mean',
        'down2.resblock.module_list.1.1.conv.1.running_var': 'models.19.bn15.running_var',
        'down2.resblock.module_list.1.1.conv.1.num_batches_tracked': 'models.19.bn15.num_batches_tracked',
        'down2.conv4.conv.0.weight': 'models.21.conv16.weight',
        'down2.conv4.conv.1.weight': 'models.21.bn16.weight',
        'down2.conv4.conv.1.bias': 'models.21.bn16.bias',
        'down2.conv4.conv.1.running_mean': 'models.21.bn16.running_mean',
        'down2.conv4.conv.1.running_var': 'models.21.bn16.running_var',
        'down2.conv4.conv.1.num_batches_tracked': 'models.21.bn16.num_batches_tracked',
        'down2.conv5.conv.0.weight': 'models.23.conv17.weight',
        'down2.conv5.conv.1.weight': 'models.23.bn17.weight',
        'down2.conv5.conv.1.bias': 'models.23.bn17.bias',
        'down2.conv5.conv.1.running_mean': 'models.23.bn17.running_mean',
        'down2.conv5.conv.1.running_var': 'models.23.bn17.running_var',
        'down2.conv5.conv.1.num_batches_tracked': 'models.23.bn17.num_batches_tracked',
        'down3.conv1.conv.0.weight': 'models.24.conv18.weight',
        'down3.conv1.conv.1.weight': 'models.24.bn18.weight',
        'down3.conv1.conv.1.bias': 'models.24.bn18.bias',
        'down3.conv1.conv.1.running_mean': 'models.24.bn18.running_mean',
        'down3.conv1.conv.1.running_var': 'models.24.bn18.running_var',
        'down3.conv1.conv.1.num_batches_tracked': 'models.24.bn18.num_batches_tracked',
        'down3.conv2.conv.0.weight': 'models.25.conv19.weight',
        'down3.conv2.conv.1.weight': 'models.25.bn19.weight',
        'down3.conv2.conv.1.bias': 'models.25.bn19.bias',
        'down3.conv2.conv.1.running_mean': 'models.25.bn19.running_mean',
        'down3.conv2.conv.1.running_var': 'models.25.bn19.running_var',
        'down3.conv2.conv.1.num_batches_tracked': 'models.25.bn19.num_batches_tracked',
        'down3.conv3.conv.0.weight': 'models.27.conv20.weight',
        'down3.conv3.conv.1.weight': 'models.27.bn20.weight',
        'down3.conv3.conv.1.bias': 'models.27.bn20.bias',
        'down3.conv3.conv.1.running_mean': 'models.27.bn20.running_mean',
        'down3.conv3.conv.1.running_var': 'models.27.bn20.running_var',
        'down3.conv3.conv.1.num_batches_tracked': 'models.27.bn20.num_batches_tracked',
        'down3.resblock.module_list.0.0.conv.0.weight': 'models.28.conv21.weight',
        'down3.resblock.module_list.0.0.conv.1.weight': 'models.28.bn21.weight',
        'down3.resblock.module_list.0.0.conv.1.bias': 'models.28.bn21.bias',
        'down3.resblock.module_list.0.0.conv.1.running_mean': 'models.28.bn21.running_mean',
        'down3.resblock.module_list.0.0.conv.1.running_var': 'models.28.bn21.running_var',
        'down3.resblock.module_list.0.0.conv.1.num_batches_tracked': 'models.28.bn21.num_batches_tracked',
        'down3.resblock.module_list.0.1.conv.0.weight': 'models.29.conv22.weight',
        'down3.resblock.module_list.0.1.conv.1.weight': 'models.29.bn22.weight',
        'down3.resblock.module_list.0.1.conv.1.bias': 'models.29.bn22.bias',
        'down3.resblock.module_list.0.1.conv.1.running_mean': 'models.29.bn22.running_mean',
        'down3.resblock.module_list.0.1.conv.1.running_var': 'models.29.bn22.running_var',
        'down3.resblock.module_list.0.1.conv.1.num_batches_tracked': 'models.29.bn22.num_batches_tracked',
        'down3.resblock.module_list.1.0.conv.0.weight': 'models.31.conv23.weight',
        'down3.resblock.module_list.1.0.conv.1.weight': 'models.31.bn23.weight',
        'down3.resblock.module_list.1.0.conv.1.bias': 'models.31.bn23.bias',
        'down3.resblock.module_list.1.0.conv.1.running_mean': 'models.31.bn23.running_mean',
        'down3.resblock.module_list.1.0.conv.1.running_var': 'models.31.bn23.running_var',
        'down3.resblock.module_list.1.0.conv.1.num_batches_tracked': 'models.31.bn23.num_batches_tracked',
        'down3.resblock.module_list.1.1.conv.0.weight': 'models.32.conv24.weight',
        'down3.resblock.module_list.1.1.conv.1.weight': 'models.32.bn24.weight',
        'down3.resblock.module_list.1.1.conv.1.bias': 'models.32.bn24.bias',
        'down3.resblock.module_list.1.1.conv.1.running_mean': 'models.32.bn24.running_mean',
        'down3.resblock.module_list.1.1.conv.1.running_var': 'models.32.bn24.running_var',
        'down3.resblock.module_list.1.1.conv.1.num_batches_tracked': 'models.32.bn24.num_batches_tracked',
        'down3.resblock.module_list.2.0.conv.0.weight': 'models.34.conv25.weight',
        'down3.resblock.module_list.2.0.conv.1.weight': 'models.34.bn25.weight',
        'down3.resblock.module_list.2.0.conv.1.bias': 'models.34.bn25.bias',
        'down3.resblock.module_list.2.0.conv.1.running_mean': 'models.34.bn25.running_mean',
        'down3.resblock.module_list.2.0.conv.1.running_var': 'models.34.bn25.running_var',
        'down3.resblock.module_list.2.0.conv.1.num_batches_tracked': 'models.34.bn25.num_batches_tracked',
        'down3.resblock.module_list.2.1.conv.0.weight': 'models.35.conv26.weight',
        'down3.resblock.module_list.2.1.conv.1.weight': 'models.35.bn26.weight',
        'down3.resblock.module_list.2.1.conv.1.bias': 'models.35.bn26.bias',
        'down3.resblock.module_list.2.1.conv.1.running_mean': 'models.35.bn26.running_mean',
        'down3.resblock.module_list.2.1.conv.1.running_var': 'models.35.bn26.running_var',
        'down3.resblock.module_list.2.1.conv.1.num_batches_tracked': 'models.35.bn26.num_batches_tracked',
        'down3.resblock.module_list.3.0.conv.0.weight': 'models.37.conv27.weight',
        'down3.resblock.module_list.3.0.conv.1.weight': 'models.37.bn27.weight',
        'down3.resblock.module_list.3.0.conv.1.bias': 'models.37.bn27.bias',
        'down3.resblock.module_list.3.0.conv.1.running_mean': 'models.37.bn27.running_mean',
        'down3.resblock.module_list.3.0.conv.1.running_var': 'models.37.bn27.running_var',
        'down3.resblock.module_list.3.0.conv.1.num_batches_tracked': 'models.37.bn27.num_batches_tracked',
        'down3.resblock.module_list.3.1.conv.0.weight': 'models.38.conv28.weight',
        'down3.resblock.module_list.3.1.conv.1.weight': 'models.38.bn28.weight',
        'down3.resblock.module_list.3.1.conv.1.bias': 'models.38.bn28.bias',
        'down3.resblock.module_list.3.1.conv.1.running_mean': 'models.38.bn28.running_mean',
        'down3.resblock.module_list.3.1.conv.1.running_var': 'models.38.bn28.running_var',
        'down3.resblock.module_list.3.1.conv.1.num_batches_tracked': 'models.38.bn28.num_batches_tracked',
        'down3.resblock.module_list.4.0.conv.0.weight': 'models.40.conv29.weight',
        'down3.resblock.module_list.4.0.conv.1.weight': 'models.40.bn29.weight',
        'down3.resblock.module_list.4.0.conv.1.bias': 'models.40.bn29.bias',
        'down3.resblock.module_list.4.0.conv.1.running_mean': 'models.40.bn29.running_mean',
        'down3.resblock.module_list.4.0.conv.1.running_var': 'models.40.bn29.running_var',
        'down3.resblock.module_list.4.0.conv.1.num_batches_tracked': 'models.40.bn29.num_batches_tracked',
        'down3.resblock.module_list.4.1.conv.0.weight': 'models.41.conv30.weight',
        'down3.resblock.module_list.4.1.conv.1.weight': 'models.41.bn30.weight',
        'down3.resblock.module_list.4.1.conv.1.bias': 'models.41.bn30.bias',
        'down3.resblock.module_list.4.1.conv.1.running_mean': 'models.41.bn30.running_mean',
        'down3.resblock.module_list.4.1.conv.1.running_var': 'models.41.bn30.running_var',
        'down3.resblock.module_list.4.1.conv.1.num_batches_tracked': 'models.41.bn30.num_batches_tracked',
        'down3.resblock.module_list.5.0.conv.0.weight': 'models.43.conv31.weight',
        'down3.resblock.module_list.5.0.conv.1.weight': 'models.43.bn31.weight',
        'down3.resblock.module_list.5.0.conv.1.bias': 'models.43.bn31.bias',
        'down3.resblock.module_list.5.0.conv.1.running_mean': 'models.43.bn31.running_mean',
        'down3.resblock.module_list.5.0.conv.1.running_var': 'models.43.bn31.running_var',
        'down3.resblock.module_list.5.0.conv.1.num_batches_tracked': 'models.43.bn31.num_batches_tracked',
        'down3.resblock.module_list.5.1.conv.0.weight': 'models.44.conv32.weight',
        'down3.resblock.module_list.5.1.conv.1.weight': 'models.44.bn32.weight',
        'down3.resblock.module_list.5.1.conv.1.bias': 'models.44.bn32.bias',
        'down3.resblock.module_list.5.1.conv.1.running_mean': 'models.44.bn32.running_mean',
        'down3.resblock.module_list.5.1.conv.1.running_var': 'models.44.bn32.running_var',
        'down3.resblock.module_list.5.1.conv.1.num_batches_tracked': 'models.44.bn32.num_batches_tracked',
        'down3.resblock.module_list.6.0.conv.0.weight': 'models.46.conv33.weight',
        'down3.resblock.module_list.6.0.conv.1.weight': 'models.46.bn33.weight',
        'down3.resblock.module_list.6.0.conv.1.bias': 'models.46.bn33.bias',
        'down3.resblock.module_list.6.0.conv.1.running_mean': 'models.46.bn33.running_mean',
        'down3.resblock.module_list.6.0.conv.1.running_var': 'models.46.bn33.running_var',
        'down3.resblock.module_list.6.0.conv.1.num_batches_tracked': 'models.46.bn33.num_batches_tracked',
        'down3.resblock.module_list.6.1.conv.0.weight': 'models.47.conv34.weight',
        'down3.resblock.module_list.6.1.conv.1.weight': 'models.47.bn34.weight',
        'down3.resblock.module_list.6.1.conv.1.bias': 'models.47.bn34.bias',
        'down3.resblock.module_list.6.1.conv.1.running_mean': 'models.47.bn34.running_mean',
        'down3.resblock.module_list.6.1.conv.1.running_var': 'models.47.bn34.running_var',
        'down3.resblock.module_list.6.1.conv.1.num_batches_tracked': 'models.47.bn34.num_batches_tracked',
        'down3.resblock.module_list.7.0.conv.0.weight': 'models.49.conv35.weight',
        'down3.resblock.module_list.7.0.conv.1.weight': 'models.49.bn35.weight',
        'down3.resblock.module_list.7.0.conv.1.bias': 'models.49.bn35.bias',
        'down3.resblock.module_list.7.0.conv.1.running_mean': 'models.49.bn35.running_mean',
        'down3.resblock.module_list.7.0.conv.1.running_var': 'models.49.bn35.running_var',
        'down3.resblock.module_list.7.0.conv.1.num_batches_tracked': 'models.49.bn35.num_batches_tracked',
        'down3.resblock.module_list.7.1.conv.0.weight': 'models.50.conv36.weight',
        'down3.resblock.module_list.7.1.conv.1.weight': 'models.50.bn36.weight',
        'down3.resblock.module_list.7.1.conv.1.bias': 'models.50.bn36.bias',
        'down3.resblock.module_list.7.1.conv.1.running_mean': 'models.50.bn36.running_mean',
        'down3.resblock.module_list.7.1.conv.1.running_var': 'models.50.bn36.running_var',
        'down3.resblock.module_list.7.1.conv.1.num_batches_tracked': 'models.50.bn36.num_batches_tracked',
        'down3.conv4.conv.0.weight': 'models.52.conv37.weight',
        'down3.conv4.conv.1.weight': 'models.52.bn37.weight',
        'down3.conv4.conv.1.bias': 'models.52.bn37.bias',
        'down3.conv4.conv.1.running_mean': 'models.52.bn37.running_mean',
        'down3.conv4.conv.1.running_var': 'models.52.bn37.running_var',
        'down3.conv4.conv.1.num_batches_tracked': 'models.52.bn37.num_batches_tracked',
        'down3.conv5.conv.0.weight': 'models.54.conv38.weight',
        'down3.conv5.conv.1.weight': 'models.54.bn38.weight',
        'down3.conv5.conv.1.bias': 'models.54.bn38.bias',
        'down3.conv5.conv.1.running_mean': 'models.54.bn38.running_mean',
        'down3.conv5.conv.1.running_var': 'models.54.bn38.running_var',
        'down3.conv5.conv.1.num_batches_tracked': 'models.54.bn38.num_batches_tracked',
        'down4.conv1.conv.0.weight': 'models.55.conv39.weight',
        'down4.conv1.conv.1.weight': 'models.55.bn39.weight',
        'down4.conv1.conv.1.bias': 'models.55.bn39.bias',
        'down4.conv1.conv.1.running_mean': 'models.55.bn39.running_mean',
        'down4.conv1.conv.1.running_var': 'models.55.bn39.running_var',
        'down4.conv1.conv.1.num_batches_tracked': 'models.55.bn39.num_batches_tracked',
        'down4.conv2.conv.0.weight': 'models.56.conv40.weight',
        'down4.conv2.conv.1.weight': 'models.56.bn40.weight',
        'down4.conv2.conv.1.bias': 'models.56.bn40.bias',
        'down4.conv2.conv.1.running_mean': 'models.56.bn40.running_mean',
        'down4.conv2.conv.1.running_var': 'models.56.bn40.running_var',
        'down4.conv2.conv.1.num_batches_tracked': 'models.56.bn40.num_batches_tracked',
        'down4.conv3.conv.0.weight': 'models.58.conv41.weight',
        'down4.conv3.conv.1.weight': 'models.58.bn41.weight',
        'down4.conv3.conv.1.bias': 'models.58.bn41.bias',
        'down4.conv3.conv.1.running_mean': 'models.58.bn41.running_mean',
        'down4.conv3.conv.1.running_var': 'models.58.bn41.running_var',
        'down4.conv3.conv.1.num_batches_tracked': 'models.58.bn41.num_batches_tracked',
        'down4.resblock.module_list.0.0.conv.0.weight': 'models.59.conv42.weight',
        'down4.resblock.module_list.0.0.conv.1.weight': 'models.59.bn42.weight',
        'down4.resblock.module_list.0.0.conv.1.bias': 'models.59.bn42.bias',
        'down4.resblock.module_list.0.0.conv.1.running_mean': 'models.59.bn42.running_mean',
        'down4.resblock.module_list.0.0.conv.1.running_var': 'models.59.bn42.running_var',
        'down4.resblock.module_list.0.0.conv.1.num_batches_tracked': 'models.59.bn42.num_batches_tracked',
        'down4.resblock.module_list.0.1.conv.0.weight': 'models.60.conv43.weight',
        'down4.resblock.module_list.0.1.conv.1.weight': 'models.60.bn43.weight',
        'down4.resblock.module_list.0.1.conv.1.bias': 'models.60.bn43.bias',
        'down4.resblock.module_list.0.1.conv.1.running_mean': 'models.60.bn43.running_mean',
        'down4.resblock.module_list.0.1.conv.1.running_var': 'models.60.bn43.running_var',
        'down4.resblock.module_list.0.1.conv.1.num_batches_tracked': 'models.60.bn43.num_batches_tracked',
        'down4.resblock.module_list.1.0.conv.0.weight': 'models.62.conv44.weight',
        'down4.resblock.module_list.1.0.conv.1.weight': 'models.62.bn44.weight',
        'down4.resblock.module_list.1.0.conv.1.bias': 'models.62.bn44.bias',
        'down4.resblock.module_list.1.0.conv.1.running_mean': 'models.62.bn44.running_mean',
        'down4.resblock.module_list.1.0.conv.1.running_var': 'models.62.bn44.running_var',
        'down4.resblock.module_list.1.0.conv.1.num_batches_tracked': 'models.62.bn44.num_batches_tracked',
        'down4.resblock.module_list.1.1.conv.0.weight': 'models.63.conv45.weight',
        'down4.resblock.module_list.1.1.conv.1.weight': 'models.63.bn45.weight',
        'down4.resblock.module_list.1.1.conv.1.bias': 'models.63.bn45.bias',
        'down4.resblock.module_list.1.1.conv.1.running_mean': 'models.63.bn45.running_mean',
        'down4.resblock.module_list.1.1.conv.1.running_var': 'models.63.bn45.running_var',
        'down4.resblock.module_list.1.1.conv.1.num_batches_tracked': 'models.63.bn45.num_batches_tracked',
        'down4.resblock.module_list.2.0.conv.0.weight': 'models.65.conv46.weight',
        'down4.resblock.module_list.2.0.conv.1.weight': 'models.65.bn46.weight',
        'down4.resblock.module_list.2.0.conv.1.bias': 'models.65.bn46.bias',
        'down4.resblock.module_list.2.0.conv.1.running_mean': 'models.65.bn46.running_mean',
        'down4.resblock.module_list.2.0.conv.1.running_var': 'models.65.bn46.running_var',
        'down4.resblock.module_list.2.0.conv.1.num_batches_tracked': 'models.65.bn46.num_batches_tracked',
        'down4.resblock.module_list.2.1.conv.0.weight': 'models.66.conv47.weight',
        'down4.resblock.module_list.2.1.conv.1.weight': 'models.66.bn47.weight',
        'down4.resblock.module_list.2.1.conv.1.bias': 'models.66.bn47.bias',
        'down4.resblock.module_list.2.1.conv.1.running_mean': 'models.66.bn47.running_mean',
        'down4.resblock.module_list.2.1.conv.1.running_var': 'models.66.bn47.running_var',
        'down4.resblock.module_list.2.1.conv.1.num_batches_tracked': 'models.66.bn47.num_batches_tracked',
        'down4.resblock.module_list.3.0.conv.0.weight': 'models.68.conv48.weight',
        'down4.resblock.module_list.3.0.conv.1.weight': 'models.68.bn48.weight',
        'down4.resblock.module_list.3.0.conv.1.bias': 'models.68.bn48.bias',
        'down4.resblock.module_list.3.0.conv.1.running_mean': 'models.68.bn48.running_mean',
        'down4.resblock.module_list.3.0.conv.1.running_var': 'models.68.bn48.running_var',
        'down4.resblock.module_list.3.0.conv.1.num_batches_tracked': 'models.68.bn48.num_batches_tracked',
        'down4.resblock.module_list.3.1.conv.0.weight': 'models.69.conv49.weight',
        'down4.resblock.module_list.3.1.conv.1.weight': 'models.69.bn49.weight',
        'down4.resblock.module_list.3.1.conv.1.bias': 'models.69.bn49.bias',
        'down4.resblock.module_list.3.1.conv.1.running_mean': 'models.69.bn49.running_mean',
        'down4.resblock.module_list.3.1.conv.1.running_var': 'models.69.bn49.running_var',
        'down4.resblock.module_list.3.1.conv.1.num_batches_tracked': 'models.69.bn49.num_batches_tracked',
        'down4.resblock.module_list.4.0.conv.0.weight': 'models.71.conv50.weight',
        'down4.resblock.module_list.4.0.conv.1.weight': 'models.71.bn50.weight',
        'down4.resblock.module_list.4.0.conv.1.bias': 'models.71.bn50.bias',
        'down4.resblock.module_list.4.0.conv.1.running_mean': 'models.71.bn50.running_mean',
        'down4.resblock.module_list.4.0.conv.1.running_var': 'models.71.bn50.running_var',
        'down4.resblock.module_list.4.0.conv.1.num_batches_tracked': 'models.71.bn50.num_batches_tracked',
        'down4.resblock.module_list.4.1.conv.0.weight': 'models.72.conv51.weight',
        'down4.resblock.module_list.4.1.conv.1.weight': 'models.72.bn51.weight',
        'down4.resblock.module_list.4.1.conv.1.bias': 'models.72.bn51.bias',
        'down4.resblock.module_list.4.1.conv.1.running_mean': 'models.72.bn51.running_mean',
        'down4.resblock.module_list.4.1.conv.1.running_var': 'models.72.bn51.running_var',
        'down4.resblock.module_list.4.1.conv.1.num_batches_tracked': 'models.72.bn51.num_batches_tracked',
        'down4.resblock.module_list.5.0.conv.0.weight': 'models.74.conv52.weight',
        'down4.resblock.module_list.5.0.conv.1.weight': 'models.74.bn52.weight',
        'down4.resblock.module_list.5.0.conv.1.bias': 'models.74.bn52.bias',
        'down4.resblock.module_list.5.0.conv.1.running_mean': 'models.74.bn52.running_mean',
        'down4.resblock.module_list.5.0.conv.1.running_var': 'models.74.bn52.running_var',
        'down4.resblock.module_list.5.0.conv.1.num_batches_tracked': 'models.74.bn52.num_batches_tracked',
        'down4.resblock.module_list.5.1.conv.0.weight': 'models.75.conv53.weight',
        'down4.resblock.module_list.5.1.conv.1.weight': 'models.75.bn53.weight',
        'down4.resblock.module_list.5.1.conv.1.bias': 'models.75.bn53.bias',
        'down4.resblock.module_list.5.1.conv.1.running_mean': 'models.75.bn53.running_mean',
        'down4.resblock.module_list.5.1.conv.1.running_var': 'models.75.bn53.running_var',
        'down4.resblock.module_list.5.1.conv.1.num_batches_tracked': 'models.75.bn53.num_batches_tracked',
        'down4.resblock.module_list.6.0.conv.0.weight': 'models.77.conv54.weight',
        'down4.resblock.module_list.6.0.conv.1.weight': 'models.77.bn54.weight',
        'down4.resblock.module_list.6.0.conv.1.bias': 'models.77.bn54.bias',
        'down4.resblock.module_list.6.0.conv.1.running_mean': 'models.77.bn54.running_mean',
        'down4.resblock.module_list.6.0.conv.1.running_var': 'models.77.bn54.running_var',
        'down4.resblock.module_list.6.0.conv.1.num_batches_tracked': 'models.77.bn54.num_batches_tracked',
        'down4.resblock.module_list.6.1.conv.0.weight': 'models.78.conv55.weight',
        'down4.resblock.module_list.6.1.conv.1.weight': 'models.78.bn55.weight',
        'down4.resblock.module_list.6.1.conv.1.bias': 'models.78.bn55.bias',
        'down4.resblock.module_list.6.1.conv.1.running_mean': 'models.78.bn55.running_mean',
        'down4.resblock.module_list.6.1.conv.1.running_var': 'models.78.bn55.running_var',
        'down4.resblock.module_list.6.1.conv.1.num_batches_tracked': 'models.78.bn55.num_batches_tracked',
        'down4.resblock.module_list.7.0.conv.0.weight': 'models.80.conv56.weight',
        'down4.resblock.module_list.7.0.conv.1.weight': 'models.80.bn56.weight',
        'down4.resblock.module_list.7.0.conv.1.bias': 'models.80.bn56.bias',
        'down4.resblock.module_list.7.0.conv.1.running_mean': 'models.80.bn56.running_mean',
        'down4.resblock.module_list.7.0.conv.1.running_var': 'models.80.bn56.running_var',
        'down4.resblock.module_list.7.0.conv.1.num_batches_tracked': 'models.80.bn56.num_batches_tracked',
        'down4.resblock.module_list.7.1.conv.0.weight': 'models.81.conv57.weight',
        'down4.resblock.module_list.7.1.conv.1.weight': 'models.81.bn57.weight',
        'down4.resblock.module_list.7.1.conv.1.bias': 'models.81.bn57.bias',
        'down4.resblock.module_list.7.1.conv.1.running_mean': 'models.81.bn57.running_mean',
        'down4.resblock.module_list.7.1.conv.1.running_var': 'models.81.bn57.running_var',
        'down4.resblock.module_list.7.1.conv.1.num_batches_tracked': 'models.81.bn57.num_batches_tracked',
        'down4.conv4.conv.0.weight': 'models.83.conv58.weight',
        'down4.conv4.conv.1.weight': 'models.83.bn58.weight',
        'down4.conv4.conv.1.bias': 'models.83.bn58.bias',
        'down4.conv4.conv.1.running_mean': 'models.83.bn58.running_mean',
        'down4.conv4.conv.1.running_var': 'models.83.bn58.running_var',
        'down4.conv4.conv.1.num_batches_tracked': 'models.83.bn58.num_batches_tracked',
        'down4.conv5.conv.0.weight': 'models.85.conv59.weight',
        'down4.conv5.conv.1.weight': 'models.85.bn59.weight',
        'down4.conv5.conv.1.bias': 'models.85.bn59.bias',
        'down4.conv5.conv.1.running_mean': 'models.85.bn59.running_mean',
        'down4.conv5.conv.1.running_var': 'models.85.bn59.running_var',
        'down4.conv5.conv.1.num_batches_tracked': 'models.85.bn59.num_batches_tracked',
        'down5.conv1.conv.0.weight': 'models.86.conv60.weight',
        'down5.conv1.conv.1.weight': 'models.86.bn60.weight',
        'down5.conv1.conv.1.bias': 'models.86.bn60.bias',
        'down5.conv1.conv.1.running_mean': 'models.86.bn60.running_mean',
        'down5.conv1.conv.1.running_var': 'models.86.bn60.running_var',
        'down5.conv1.conv.1.num_batches_tracked': 'models.86.bn60.num_batches_tracked',
        'down5.conv2.conv.0.weight': 'models.87.conv61.weight',
        'down5.conv2.conv.1.weight': 'models.87.bn61.weight',
        'down5.conv2.conv.1.bias': 'models.87.bn61.bias',
        'down5.conv2.conv.1.running_mean': 'models.87.bn61.running_mean',
        'down5.conv2.conv.1.running_var': 'models.87.bn61.running_var',
        'down5.conv2.conv.1.num_batches_tracked': 'models.87.bn61.num_batches_tracked',
        'down5.conv3.conv.0.weight': 'models.89.conv62.weight',
        'down5.conv3.conv.1.weight': 'models.89.bn62.weight',
        'down5.conv3.conv.1.bias': 'models.89.bn62.bias',
        'down5.conv3.conv.1.running_mean': 'models.89.bn62.running_mean',
        'down5.conv3.conv.1.running_var': 'models.89.bn62.running_var',
        'down5.conv3.conv.1.num_batches_tracked': 'models.89.bn62.num_batches_tracked',
        'down5.resblock.module_list.0.0.conv.0.weight': 'models.90.conv63.weight',
        'down5.resblock.module_list.0.0.conv.1.weight': 'models.90.bn63.weight',
        'down5.resblock.module_list.0.0.conv.1.bias': 'models.90.bn63.bias',
        'down5.resblock.module_list.0.0.conv.1.running_mean': 'models.90.bn63.running_mean',
        'down5.resblock.module_list.0.0.conv.1.running_var': 'models.90.bn63.running_var',
        'down5.resblock.module_list.0.0.conv.1.num_batches_tracked': 'models.90.bn63.num_batches_tracked',
        'down5.resblock.module_list.0.1.conv.0.weight': 'models.91.conv64.weight',
        'down5.resblock.module_list.0.1.conv.1.weight': 'models.91.bn64.weight',
        'down5.resblock.module_list.0.1.conv.1.bias': 'models.91.bn64.bias',
        'down5.resblock.module_list.0.1.conv.1.running_mean': 'models.91.bn64.running_mean',
        'down5.resblock.module_list.0.1.conv.1.running_var': 'models.91.bn64.running_var',
        'down5.resblock.module_list.0.1.conv.1.num_batches_tracked': 'models.91.bn64.num_batches_tracked',
        'down5.resblock.module_list.1.0.conv.0.weight': 'models.93.conv65.weight',
        'down5.resblock.module_list.1.0.conv.1.weight': 'models.93.bn65.weight',
        'down5.resblock.module_list.1.0.conv.1.bias': 'models.93.bn65.bias',
        'down5.resblock.module_list.1.0.conv.1.running_mean': 'models.93.bn65.running_mean',
        'down5.resblock.module_list.1.0.conv.1.running_var': 'models.93.bn65.running_var',
        'down5.resblock.module_list.1.0.conv.1.num_batches_tracked': 'models.93.bn65.num_batches_tracked',
        'down5.resblock.module_list.1.1.conv.0.weight': 'models.94.conv66.weight',
        'down5.resblock.module_list.1.1.conv.1.weight': 'models.94.bn66.weight',
        'down5.resblock.module_list.1.1.conv.1.bias': 'models.94.bn66.bias',
        'down5.resblock.module_list.1.1.conv.1.running_mean': 'models.94.bn66.running_mean',
        'down5.resblock.module_list.1.1.conv.1.running_var': 'models.94.bn66.running_var',
        'down5.resblock.module_list.1.1.conv.1.num_batches_tracked': 'models.94.bn66.num_batches_tracked',
        'down5.resblock.module_list.2.0.conv.0.weight': 'models.96.conv67.weight',
        'down5.resblock.module_list.2.0.conv.1.weight': 'models.96.bn67.weight',
        'down5.resblock.module_list.2.0.conv.1.bias': 'models.96.bn67.bias',
        'down5.resblock.module_list.2.0.conv.1.running_mean': 'models.96.bn67.running_mean',
        'down5.resblock.module_list.2.0.conv.1.running_var': 'models.96.bn67.running_var',
        'down5.resblock.module_list.2.0.conv.1.num_batches_tracked': 'models.96.bn67.num_batches_tracked',
        'down5.resblock.module_list.2.1.conv.0.weight': 'models.97.conv68.weight',
        'down5.resblock.module_list.2.1.conv.1.weight': 'models.97.bn68.weight',
        'down5.resblock.module_list.2.1.conv.1.bias': 'models.97.bn68.bias',
        'down5.resblock.module_list.2.1.conv.1.running_mean': 'models.97.bn68.running_mean',
        'down5.resblock.module_list.2.1.conv.1.running_var': 'models.97.bn68.running_var',
        'down5.resblock.module_list.2.1.conv.1.num_batches_tracked': 'models.97.bn68.num_batches_tracked',
        'down5.resblock.module_list.3.0.conv.0.weight': 'models.99.conv69.weight',
        'down5.resblock.module_list.3.0.conv.1.weight': 'models.99.bn69.weight',
        'down5.resblock.module_list.3.0.conv.1.bias': 'models.99.bn69.bias',
        'down5.resblock.module_list.3.0.conv.1.running_mean': 'models.99.bn69.running_mean',
        'down5.resblock.module_list.3.0.conv.1.running_var': 'models.99.bn69.running_var',
        'down5.resblock.module_list.3.0.conv.1.num_batches_tracked': 'models.99.bn69.num_batches_tracked',
        'down5.resblock.module_list.3.1.conv.0.weight': 'models.100.conv70.weight',
        'down5.resblock.module_list.3.1.conv.1.weight': 'models.100.bn70.weight',
        'down5.resblock.module_list.3.1.conv.1.bias': 'models.100.bn70.bias',
        'down5.resblock.module_list.3.1.conv.1.running_mean': 'models.100.bn70.running_mean',
        'down5.resblock.module_list.3.1.conv.1.running_var': 'models.100.bn70.running_var',
        'down5.resblock.module_list.3.1.conv.1.num_batches_tracked': 'models.100.bn70.num_batches_tracked',
        'down5.conv4.conv.0.weight': 'models.102.conv71.weight',
        'down5.conv4.conv.1.weight': 'models.102.bn71.weight',
        'down5.conv4.conv.1.bias': 'models.102.bn71.bias',
        'down5.conv4.conv.1.running_mean': 'models.102.bn71.running_mean',
        'down5.conv4.conv.1.running_var': 'models.102.bn71.running_var',
        'down5.conv4.conv.1.num_batches_tracked': 'models.102.bn71.num_batches_tracked',
        'down5.conv5.conv.0.weight': 'models.104.conv72.weight',
        'down5.conv5.conv.1.weight': 'models.104.bn72.weight',
        'down5.conv5.conv.1.bias': 'models.104.bn72.bias',
        'down5.conv5.conv.1.running_mean': 'models.104.bn72.running_mean',
        'down5.conv5.conv.1.running_var': 'models.104.bn72.running_var',
        'down5.conv5.conv.1.num_batches_tracked': 'models.104.bn72.num_batches_tracked',
        'neek.conv1.conv.0.weight': 'models.105.conv73.weight',
        'neek.conv1.conv.1.weight': 'models.105.bn73.weight',
        'neek.conv1.conv.1.bias': 'models.105.bn73.bias',
        'neek.conv1.conv.1.running_mean': 'models.105.bn73.running_mean',
        'neek.conv1.conv.1.running_var': 'models.105.bn73.running_var',
        'neek.conv1.conv.1.num_batches_tracked': 'models.105.bn73.num_batches_tracked',
        'neek.conv2.conv.0.weight': 'models.106.conv74.weight',
        'neek.conv2.conv.1.weight': 'models.106.bn74.weight',
        'neek.conv2.conv.1.bias': 'models.106.bn74.bias',
        'neek.conv2.conv.1.running_mean': 'models.106.bn74.running_mean',
        'neek.conv2.conv.1.running_var': 'models.106.bn74.running_var',
        'neek.conv2.conv.1.num_batches_tracked': 'models.106.bn74.num_batches_tracked',
        'neek.conv3.conv.0.weight': 'models.107.conv75.weight',
        'neek.conv3.conv.1.weight': 'models.107.bn75.weight',
        'neek.conv3.conv.1.bias': 'models.107.bn75.bias',
        'neek.conv3.conv.1.running_mean': 'models.107.bn75.running_mean',
        'neek.conv3.conv.1.running_var': 'models.107.bn75.running_var',
        'neek.conv3.conv.1.num_batches_tracked': 'models.107.bn75.num_batches_tracked',
        'neek.conv4.conv.0.weight': 'models.114.conv76.weight',
        'neek.conv4.conv.1.weight': 'models.114.bn76.weight',
        'neek.conv4.conv.1.bias': 'models.114.bn76.bias',
        'neek.conv4.conv.1.running_mean': 'models.114.bn76.running_mean',
        'neek.conv4.conv.1.running_var': 'models.114.bn76.running_var',
        'neek.conv4.conv.1.num_batches_tracked': 'models.114.bn76.num_batches_tracked',
        'neek.conv5.conv.0.weight': 'models.115.conv77.weight',
        'neek.conv5.conv.1.weight': 'models.115.bn77.weight',
        'neek.conv5.conv.1.bias': 'models.115.bn77.bias',
        'neek.conv5.conv.1.running_mean': 'models.115.bn77.running_mean',
        'neek.conv5.conv.1.running_var': 'models.115.bn77.running_var',
        'neek.conv5.conv.1.num_batches_tracked': 'models.115.bn77.num_batches_tracked',
        'neek.conv6.conv.0.weight': 'models.116.conv78.weight',
        'neek.conv6.conv.1.weight': 'models.116.bn78.weight',
        'neek.conv6.conv.1.bias': 'models.116.bn78.bias',
        'neek.conv6.conv.1.running_mean': 'models.116.bn78.running_mean',
        'neek.conv6.conv.1.running_var': 'models.116.bn78.running_var',
        'neek.conv6.conv.1.num_batches_tracked': 'models.116.bn78.num_batches_tracked',
        'neek.conv7.conv.0.weight': 'models.117.conv79.weight',
        'neek.conv7.conv.1.weight': 'models.117.bn79.weight',
        'neek.conv7.conv.1.bias': 'models.117.bn79.bias',
        'neek.conv7.conv.1.running_mean': 'models.117.bn79.running_mean',
        'neek.conv7.conv.1.running_var': 'models.117.bn79.running_var',
        'neek.conv7.conv.1.num_batches_tracked': 'models.117.bn79.num_batches_tracked',
        'neek.conv8.conv.0.weight': 'models.120.conv80.weight',
        'neek.conv8.conv.1.weight': 'models.120.bn80.weight',
        'neek.conv8.conv.1.bias': 'models.120.bn80.bias',
        'neek.conv8.conv.1.running_mean': 'models.120.bn80.running_mean',
        'neek.conv8.conv.1.running_var': 'models.120.bn80.running_var',
        'neek.conv8.conv.1.num_batches_tracked': 'models.120.bn80.num_batches_tracked',
        'neek.conv9.conv.0.weight': 'models.122.conv81.weight',
        'neek.conv9.conv.1.weight': 'models.122.bn81.weight',
        'neek.conv9.conv.1.bias': 'models.122.bn81.bias',
        'neek.conv9.conv.1.running_mean': 'models.122.bn81.running_mean',
        'neek.conv9.conv.1.running_var': 'models.122.bn81.running_var',
        'neek.conv9.conv.1.num_batches_tracked': 'models.122.bn81.num_batches_tracked',
        'neek.conv10.conv.0.weight': 'models.123.conv82.weight',
        'neek.conv10.conv.1.weight': 'models.123.bn82.weight',
        'neek.conv10.conv.1.bias': 'models.123.bn82.bias',
        'neek.conv10.conv.1.running_mean': 'models.123.bn82.running_mean',
        'neek.conv10.conv.1.running_var': 'models.123.bn82.running_var',
        'neek.conv10.conv.1.num_batches_tracked': 'models.123.bn82.num_batches_tracked',
        'neek.conv11.conv.0.weight': 'models.124.conv83.weight',
        'neek.conv11.conv.1.weight': 'models.124.bn83.weight',
        'neek.conv11.conv.1.bias': 'models.124.bn83.bias',
        'neek.conv11.conv.1.running_mean': 'models.124.bn83.running_mean',
        'neek.conv11.conv.1.running_var': 'models.124.bn83.running_var',
        'neek.conv11.conv.1.num_batches_tracked': 'models.124.bn83.num_batches_tracked',
        'neek.conv12.conv.0.weight': 'models.125.conv84.weight',
        'neek.conv12.conv.1.weight': 'models.125.bn84.weight',
        'neek.conv12.conv.1.bias': 'models.125.bn84.bias',
        'neek.conv12.conv.1.running_mean': 'models.125.bn84.running_mean',
        'neek.conv12.conv.1.running_var': 'models.125.bn84.running_var',
        'neek.conv12.conv.1.num_batches_tracked': 'models.125.bn84.num_batches_tracked',
        'neek.conv13.conv.0.weight': 'models.126.conv85.weight',
        'neek.conv13.conv.1.weight': 'models.126.bn85.weight',
        'neek.conv13.conv.1.bias': 'models.126.bn85.bias',
        'neek.conv13.conv.1.running_mean': 'models.126.bn85.running_mean',
        'neek.conv13.conv.1.running_var': 'models.126.bn85.running_var',
        'neek.conv13.conv.1.num_batches_tracked': 'models.126.bn85.num_batches_tracked',
        'neek.conv14.conv.0.weight': 'models.127.conv86.weight',
        'neek.conv14.conv.1.weight': 'models.127.bn86.weight',
        'neek.conv14.conv.1.bias': 'models.127.bn86.bias',
        'neek.conv14.conv.1.running_mean': 'models.127.bn86.running_mean',
        'neek.conv14.conv.1.running_var': 'models.127.bn86.running_var',
        'neek.conv14.conv.1.num_batches_tracked': 'models.127.bn86.num_batches_tracked',
        'neek.conv15.conv.0.weight': 'models.130.conv87.weight',
        'neek.conv15.conv.1.weight': 'models.130.bn87.weight',
        'neek.conv15.conv.1.bias': 'models.130.bn87.bias',
        'neek.conv15.conv.1.running_mean': 'models.130.bn87.running_mean',
        'neek.conv15.conv.1.running_var': 'models.130.bn87.running_var',
        'neek.conv15.conv.1.num_batches_tracked': 'models.130.bn87.num_batches_tracked',
        'neek.conv16.conv.0.weight': 'models.132.conv88.weight',
        'neek.conv16.conv.1.weight': 'models.132.bn88.weight',
        'neek.conv16.conv.1.bias': 'models.132.bn88.bias',
        'neek.conv16.conv.1.running_mean': 'models.132.bn88.running_mean',
        'neek.conv16.conv.1.running_var': 'models.132.bn88.running_var',
        'neek.conv16.conv.1.num_batches_tracked': 'models.132.bn88.num_batches_tracked',
        'neek.conv17.conv.0.weight': 'models.133.conv89.weight',
        'neek.conv17.conv.1.weight': 'models.133.bn89.weight',
        'neek.conv17.conv.1.bias': 'models.133.bn89.bias',
        'neek.conv17.conv.1.running_mean': 'models.133.bn89.running_mean',
        'neek.conv17.conv.1.running_var': 'models.133.bn89.running_var',
        'neek.conv17.conv.1.num_batches_tracked': 'models.133.bn89.num_batches_tracked',
        'neek.conv18.conv.0.weight': 'models.134.conv90.weight',
        'neek.conv18.conv.1.weight': 'models.134.bn90.weight',
        'neek.conv18.conv.1.bias': 'models.134.bn90.bias',
        'neek.conv18.conv.1.running_mean': 'models.134.bn90.running_mean',
        'neek.conv18.conv.1.running_var': 'models.134.bn90.running_var',
        'neek.conv18.conv.1.num_batches_tracked': 'models.134.bn90.num_batches_tracked',
        'neek.conv19.conv.0.weight': 'models.135.conv91.weight',
        'neek.conv19.conv.1.weight': 'models.135.bn91.weight',
        'neek.conv19.conv.1.bias': 'models.135.bn91.bias',
        'neek.conv19.conv.1.running_mean': 'models.135.bn91.running_mean',
        'neek.conv19.conv.1.running_var': 'models.135.bn91.running_var',
        'neek.conv19.conv.1.num_batches_tracked': 'models.135.bn91.num_batches_tracked',
        'neek.conv20.conv.0.weight': 'models.136.conv92.weight',
        'neek.conv20.conv.1.weight': 'models.136.bn92.weight',
        'neek.conv20.conv.1.bias': 'models.136.bn92.bias',
        'neek.conv20.conv.1.running_mean': 'models.136.bn92.running_mean',
        'neek.conv20.conv.1.running_var': 'models.136.bn92.running_var',
        'neek.conv20.conv.1.num_batches_tracked': 'models.136.bn92.num_batches_tracked',
        'head.conv1.conv.0.weight': 'models.137.conv93.weight',
        'head.conv1.conv.1.weight': 'models.137.bn93.weight',
        'head.conv1.conv.1.bias': 'models.137.bn93.bias',
        'head.conv1.conv.1.running_mean': 'models.137.bn93.running_mean',
        'head.conv1.conv.1.running_var': 'models.137.bn93.running_var',
        'head.conv1.conv.1.num_batches_tracked': 'models.137.bn93.num_batches_tracked',
        'head.conv2.conv.0.weight': 'models.138.conv94.weight',
        'head.conv2.conv.0.bias': 'models.138.conv94.bias',
        'head.conv3.conv.0.weight': 'models.141.conv95.weight',
        'head.conv3.conv.1.weight': 'models.141.bn95.weight',
        'head.conv3.conv.1.bias': 'models.141.bn95.bias',
        'head.conv3.conv.1.running_mean': 'models.141.bn95.running_mean',
        'head.conv3.conv.1.running_var': 'models.141.bn95.running_var',
        'head.conv3.conv.1.num_batches_tracked': 'models.141.bn95.num_batches_tracked',
        'head.conv4.conv.0.weight': 'models.143.conv96.weight',
        'head.conv4.conv.1.weight': 'models.143.bn96.weight',
        'head.conv4.conv.1.bias': 'models.143.bn96.bias',
        'head.conv4.conv.1.running_mean': 'models.143.bn96.running_mean',
        'head.conv4.conv.1.running_var': 'models.143.bn96.running_var',
        'head.conv4.conv.1.num_batches_tracked': 'models.143.bn96.num_batches_tracked',
        'head.conv5.conv.0.weight': 'models.144.conv97.weight',
        'head.conv5.conv.1.weight': 'models.144.bn97.weight',
        'head.conv5.conv.1.bias': 'models.144.bn97.bias',
        'head.conv5.conv.1.running_mean': 'models.144.bn97.running_mean',
        'head.conv5.conv.1.running_var': 'models.144.bn97.running_var',
        'head.conv5.conv.1.num_batches_tracked': 'models.144.bn97.num_batches_tracked',
        'head.conv6.conv.0.weight': 'models.145.conv98.weight',
        'head.conv6.conv.1.weight': 'models.145.bn98.weight',
        'head.conv6.conv.1.bias': 'models.145.bn98.bias',
        'head.conv6.conv.1.running_mean': 'models.145.bn98.running_mean',
        'head.conv6.conv.1.running_var': 'models.145.bn98.running_var',
        'head.conv6.conv.1.num_batches_tracked': 'models.145.bn98.num_batches_tracked',
        'head.conv7.conv.0.weight': 'models.146.conv99.weight',
        'head.conv7.conv.1.weight': 'models.146.bn99.weight',
        'head.conv7.conv.1.bias': 'models.146.bn99.bias',
        'head.conv7.conv.1.running_mean': 'models.146.bn99.running_mean',
        'head.conv7.conv.1.running_var': 'models.146.bn99.running_var',
        'head.conv7.conv.1.num_batches_tracked': 'models.146.bn99.num_batches_tracked',
        'head.conv8.conv.0.weight': 'models.147.conv100.weight',
        'head.conv8.conv.1.weight': 'models.147.bn100.weight',
        'head.conv8.conv.1.bias': 'models.147.bn100.bias',
        'head.conv8.conv.1.running_mean': 'models.147.bn100.running_mean',
        'head.conv8.conv.1.running_var': 'models.147.bn100.running_var',
        'head.conv8.conv.1.num_batches_tracked': 'models.147.bn100.num_batches_tracked',
        'head.conv9.conv.0.weight': 'models.148.conv101.weight',
        'head.conv9.conv.1.weight': 'models.148.bn101.weight',
        'head.conv9.conv.1.bias': 'models.148.bn101.bias',
        'head.conv9.conv.1.running_mean': 'models.148.bn101.running_mean',
        'head.conv9.conv.1.running_var': 'models.148.bn101.running_var',
        'head.conv9.conv.1.num_batches_tracked': 'models.148.bn101.num_batches_tracked',
        'head.conv10.conv.0.weight': 'models.149.conv102.weight',
        'head.conv10.conv.0.bias': 'models.149.conv102.bias',
        'head.conv11.conv.0.weight': 'models.152.conv103.weight',
        'head.conv11.conv.1.weight': 'models.152.bn103.weight',
        'head.conv11.conv.1.bias': 'models.152.bn103.bias',
        'head.conv11.conv.1.running_mean': 'models.152.bn103.running_mean',
        'head.conv11.conv.1.running_var': 'models.152.bn103.running_var',
        'head.conv11.conv.1.num_batches_tracked': 'models.152.bn103.num_batches_tracked',
        'head.conv12.conv.0.weight': 'models.154.conv104.weight',
        'head.conv12.conv.1.weight': 'models.154.bn104.weight',
        'head.conv12.conv.1.bias': 'models.154.bn104.bias',
        'head.conv12.conv.1.running_mean': 'models.154.bn104.running_mean',
        'head.conv12.conv.1.running_var': 'models.154.bn104.running_var',
        'head.conv12.conv.1.num_batches_tracked': 'models.154.bn104.num_batches_tracked',
        'head.conv13.conv.0.weight': 'models.155.conv105.weight',
        'head.conv13.conv.1.weight': 'models.155.bn105.weight',
        'head.conv13.conv.1.bias': 'models.155.bn105.bias',
        'head.conv13.conv.1.running_mean': 'models.155.bn105.running_mean',
        'head.conv13.conv.1.running_var': 'models.155.bn105.running_var',
        'head.conv13.conv.1.num_batches_tracked': 'models.155.bn105.num_batches_tracked',
        'head.conv14.conv.0.weight': 'models.156.conv106.weight',
        'head.conv14.conv.1.weight': 'models.156.bn106.weight',
        'head.conv14.conv.1.bias': 'models.156.bn106.bias',
        'head.conv14.conv.1.running_mean': 'models.156.bn106.running_mean',
        'head.conv14.conv.1.running_var': 'models.156.bn106.running_var',
        'head.conv14.conv.1.num_batches_tracked': 'models.156.bn106.num_batches_tracked',
        'head.conv15.conv.0.weight': 'models.157.conv107.weight',
        'head.conv15.conv.1.weight': 'models.157.bn107.weight',
        'head.conv15.conv.1.bias': 'models.157.bn107.bias',
        'head.conv15.conv.1.running_mean': 'models.157.bn107.running_mean',
        'head.conv15.conv.1.running_var': 'models.157.bn107.running_var',
        'head.conv15.conv.1.num_batches_tracked': 'models.157.bn107.num_batches_tracked',
        'head.conv16.conv.0.weight': 'models.158.conv108.weight',
        'head.conv16.conv.1.weight': 'models.158.bn108.weight',
        'head.conv16.conv.1.bias': 'models.158.bn108.bias',
        'head.conv16.conv.1.running_mean': 'models.158.bn108.running_mean',
        'head.conv16.conv.1.running_var': 'models.158.bn108.running_var',
        'head.conv16.conv.1.num_batches_tracked': 'models.158.bn108.num_batches_tracked',
        'head.conv17.conv.0.weight': 'models.159.conv109.weight',
        'head.conv17.conv.1.weight': 'models.159.bn109.weight',
        'head.conv17.conv.1.bias': 'models.159.bn109.bias',
        'head.conv17.conv.1.running_mean': 'models.159.bn109.running_mean',
        'head.conv17.conv.1.running_var': 'models.159.bn109.running_var',
        'head.conv17.conv.1.num_batches_tracked': 'models.159.bn109.num_batches_tracked',
        'head.conv18.conv.0.weight': 'models.160.conv110.weight',
        'head.conv18.conv.0.bias': 'models.160.conv110.bias',
    }
    pth_weights = torch.load(checkpoint)
    pt_weights = type(pth_weights)()
    for name, new_name in name_mapping.items():
        pt_weights[new_name] = pth_weights[name]
    return pt_weights


def convert_pt_checkpoint_to_keras_h5(state_dict):
    print('============================================================')

    def copy1(conv, bn, idx):
        keyword1 = 'conv%d.weight' % idx
        keyword2 = 'bn%d.weight' % idx
        keyword3 = 'bn%d.bias' % idx
        keyword4 = 'bn%d.running_mean' % idx
        keyword5 = 'bn%d.running_var' % idx
        for key in state_dict:
            value = state_dict[key].numpy()
            if keyword1 in key:
                w = value
            elif keyword2 in key:
                y = value
            elif keyword3 in key:
                b = value
            elif keyword4 in key:
                m = value
            elif keyword5 in key:
                v = value
        w = w.transpose(2, 3, 1, 0)
        conv.set_weights([w])
        bn.set_weights([y, b, m, v])

    def copy2(conv, idx):
        keyword1 = 'conv%d.weight' % idx
        keyword2 = 'conv%d.bias' % idx
        for key in state_dict:
            value = state_dict[key].numpy()
            if keyword1 in key:
                w = value
            elif keyword2 in key:
                b = value
        w = w.transpose(2, 3, 1, 0)
        conv.set_weights([w, b])

    num_classes = 80
    num_anchors = 3

    with tf.Session(graph=tf.Graph()):
        inputs = layers.Input(shape=[], dtype='string')
        model_body = YOLOv4(inputs, num_classes, num_anchors)
        model_body.summary()
        layer_name_to_idx = {layer.name: idx for idx, layer in enumerate(model_body.layers)}

        print('\nCopying...')
        i1 = layer_name_to_idx['conv2d']
        i2 = layer_name_to_idx['batch_normalization']
        copy1(model_body.layers[i1], model_body.layers[i2], 1)
        for i in range(2, 94, 1):
            i1 = layer_name_to_idx['conv2d_%d' % (i - 1)]
            i2 = layer_name_to_idx['batch_normalization_%d' % (i - 1)]
            copy1(model_body.layers[i1], model_body.layers[i2], i)
        for i in range(95, 102, 1):
            i1 = layer_name_to_idx['conv2d_%d' % (i - 1)]
            i2 = layer_name_to_idx['batch_normalization_%d' % (i - 2,)]
            copy1(model_body.layers[i1], model_body.layers[i2], i)
        for i in range(103, 110, 1):
            i1 = layer_name_to_idx['conv2d_%d' % (i - 1)]
            i2 = layer_name_to_idx['batch_normalization_%d' % (i - 3,)]
            copy1(model_body.layers[i1], model_body.layers[i2], i)

        i1 = layer_name_to_idx['conv2d_93']
        copy2(model_body.layers[i1], 94)
        i1 = layer_name_to_idx['conv2d_101']
        copy2(model_body.layers[i1], 102)
        i1 = layer_name_to_idx['conv2d_109']
        copy2(model_body.layers[i1], 110)

        weights = model_body.get_weights()
    print('\nDone.')
    return weights


class Mish(layers.Layer):

    def __init__(self):
        super(Mish, self).__init__()

    def compute_output_shape(self, input_shape):
        return input_shape

    def call(self, x):
        return x * tf.tanh(tf.math.softplus(x))


def conv2d_unit(x, filters, kernels, strides=1, padding='valid', bn=1, act='mish'):
    use_bias = (bn != 1)
    x = layers.Conv2D(filters, kernels,
                      padding=padding,
                      strides=strides,
                      use_bias=use_bias,
                      activation='linear',
                      kernel_initializer=keras.initializers.RandomNormal(mean=0.0, stddev=0.01))(x)
    if bn:
        x = layers.BatchNormalization(fused=False)(x)
    if act == 'leaky':
        x = keras.layers.LeakyReLU(alpha=0.1)(x)
    elif act == 'mish':
        x = Mish()(x)
    return x


def residual_block(inputs, filters_1, filters_2):
    x = conv2d_unit(inputs, filters_1, 1, strides=1, padding='valid')
    x = conv2d_unit(x, filters_2, 3, strides=1, padding='same')
    x = layers.add([inputs, x])
    return x


def stack_residual_block(inputs, filters_1, filters_2, n):
    x = residual_block(inputs, filters_1, filters_2)
    for i in range(n - 1):
        x = residual_block(x, filters_1, filters_2)
    return x


def spp(x):
    x_1 = x
    x_2 = layers.MaxPooling2D(pool_size=5, strides=1, padding='same')(x)
    x_3 = layers.MaxPooling2D(pool_size=9, strides=1, padding='same')(x)
    x_4 = layers.MaxPooling2D(pool_size=13, strides=1, padding='same')(x)
    out = layers.Concatenate()([x_4, x_3, x_2, x_1])
    return out


def YOLOv4(inputs, num_classes, num_anchors, input_shape=(608, 608), initial_filters=32,
           fast=False, anchors=None, conf_thresh=0.05, nms_thresh=0.45, keep_top_k=100, nms_top_k=100):
    i32 = initial_filters
    i64 = i32 * 2
    i128 = i32 * 4
    i256 = i32 * 8
    i512 = i32 * 16
    i1024 = i32 * 32

    x, image_shape = layers.Lambda(lambda t: preprocessor(t, input_shape))(inputs)

    # cspdarknet53
    x = conv2d_unit(x, i32, 3, strides=1, padding='same')

    # ============================= s2 =============================
    x = layers.ZeroPadding2D(padding=((1, 0), (1, 0)))(x)
    x = conv2d_unit(x, i64, 3, strides=2)
    s2 = conv2d_unit(x, i64, 1, strides=1)
    x = conv2d_unit(x, i64, 1, strides=1)
    x = stack_residual_block(x, i32, i64, n=1)
    x = conv2d_unit(x, i64, 1, strides=1)
    x = layers.Concatenate()([x, s2])
    s2 = conv2d_unit(x, i64, 1, strides=1)

    # ============================= s4 =============================
    x = layers.ZeroPadding2D(padding=((1, 0), (1, 0)))(s2)
    x = conv2d_unit(x, i128, 3, strides=2)
    s4 = conv2d_unit(x, i64, 1, strides=1)
    x = conv2d_unit(x, i64, 1, strides=1)
    x = stack_residual_block(x, i64, i64, n=2)
    x = conv2d_unit(x, i64, 1, strides=1)
    x = layers.Concatenate()([x, s4])
    s4 = conv2d_unit(x, i128, 1, strides=1)

    # ============================= s8 =============================
    x = layers.ZeroPadding2D(padding=((1, 0), (1, 0)))(s4)
    x = conv2d_unit(x, i256, 3, strides=2)
    s8 = conv2d_unit(x, i128, 1, strides=1)
    x = conv2d_unit(x, i128, 1, strides=1)
    x = stack_residual_block(x, i128, i128, n=8)
    x = conv2d_unit(x, i128, 1, strides=1)
    x = layers.Concatenate()([x, s8])
    s8 = conv2d_unit(x, i256, 1, strides=1)

    # ============================= s16 =============================
    x = layers.ZeroPadding2D(padding=((1, 0), (1, 0)))(s8)
    x = conv2d_unit(x, i512, 3, strides=2)
    s16 = conv2d_unit(x, i256, 1, strides=1)
    x = conv2d_unit(x, i256, 1, strides=1)
    x = stack_residual_block(x, i256, i256, n=8)
    x = conv2d_unit(x, i256, 1, strides=1)
    x = layers.Concatenate()([x, s16])
    s16 = conv2d_unit(x, i512, 1, strides=1)

    # ============================= s32 =============================
    x = layers.ZeroPadding2D(padding=((1, 0), (1, 0)))(s16)
    x = conv2d_unit(x, i1024, 3, strides=2)
    s32 = conv2d_unit(x, i512, 1, strides=1)
    x = conv2d_unit(x, i512, 1, strides=1)
    x = stack_residual_block(x, i512, i512, n=4)
    x = conv2d_unit(x, i512, 1, strides=1)
    x = layers.Concatenate()([x, s32])
    s32 = conv2d_unit(x, i1024, 1, strides=1)

    # fpn
    x = conv2d_unit(s32, i512, 1, strides=1, act='leaky')
    x = conv2d_unit(x, i1024, 3, strides=1, padding='same', act='leaky')
    x = conv2d_unit(x, i512, 1, strides=1, act='leaky')
    x = spp(x)

    x = conv2d_unit(x, i512, 1, strides=1, act='leaky')
    x = conv2d_unit(x, i1024, 3, strides=1, padding='same', act='leaky')
    fpn_s32 = conv2d_unit(x, i512, 1, strides=1, act='leaky')

    # pan01
    x = conv2d_unit(fpn_s32, i256, 1, strides=1, act='leaky')
    x = layers.UpSampling2D(2)(x)
    s16 = conv2d_unit(s16, i256, 1, strides=1, act='leaky')
    x = layers.Concatenate()([s16, x])
    x = conv2d_unit(x, i256, 1, strides=1, act='leaky')
    x = conv2d_unit(x, i512, 3, strides=1, padding='same', act='leaky')
    x = conv2d_unit(x, i256, 1, strides=1, act='leaky')
    x = conv2d_unit(x, i512, 3, strides=1, padding='same', act='leaky')
    fpn_s16 = conv2d_unit(x, i256, 1, strides=1, act='leaky')

    # pan02
    x = conv2d_unit(fpn_s16, i128, 1, strides=1, act='leaky')
    x = layers.UpSampling2D(2)(x)
    s8 = conv2d_unit(s8, i128, 1, strides=1, act='leaky')
    x = layers.Concatenate()([s8, x])
    x = conv2d_unit(x, i128, 1, strides=1, act='leaky')
    x = conv2d_unit(x, i256, 3, strides=1, padding='same', act='leaky')
    x = conv2d_unit(x, i128, 1, strides=1, act='leaky')
    x = conv2d_unit(x, i256, 3, strides=1, padding='same', act='leaky')
    x = conv2d_unit(x, i128, 1, strides=1, act='leaky')

    # output_s, doesn't need concat()
    output_s = conv2d_unit(x, i256, 3, strides=1, padding='same', act='leaky')
    output_s = conv2d_unit(output_s, num_anchors * (num_classes + 5), 1, strides=1, bn=0, act=None)

    # output_m, need concat()
    x = layers.ZeroPadding2D(padding=((1, 0), (1, 0)))(x)
    x = conv2d_unit(x, i256, 3, strides=2, act='leaky')
    x = layers.Concatenate()([x, fpn_s16])
    x = conv2d_unit(x, i256, 1, strides=1, act='leaky')
    x = conv2d_unit(x, i512, 3, strides=1, padding='same', act='leaky')
    x = conv2d_unit(x, i256, 1, strides=1, act='leaky')
    x = conv2d_unit(x, i512, 3, strides=1, padding='same', act='leaky')
    x = conv2d_unit(x, i256, 1, strides=1, act='leaky')
    output_m = conv2d_unit(x, i512, 3, strides=1, padding='same', act='leaky')
    output_m = conv2d_unit(output_m, num_anchors * (num_classes + 5), 1, strides=1, bn=0, act=None)

    # output_l, need concat()
    x = layers.ZeroPadding2D(padding=((1, 0), (1, 0)))(x)
    x = conv2d_unit(x, i512, 3, strides=2, act='leaky')
    x = layers.Concatenate()([x, fpn_s32])
    x = conv2d_unit(x, i512, 1, strides=1, act='leaky')
    x = conv2d_unit(x, i1024, 3, strides=1, padding='same', act='leaky')
    x = conv2d_unit(x, i512, 1, strides=1, act='leaky')
    x = conv2d_unit(x, i1024, 3, strides=1, padding='same', act='leaky')
    x = conv2d_unit(x, i512, 1, strides=1, act='leaky')
    output_l = conv2d_unit(x, i1024, 3, strides=1, padding='same', act='leaky')
    output_l = conv2d_unit(output_l, num_anchors * (num_classes + 5), 1, strides=1, bn=0, act=None)

    def cast_float32(tensor):
        return tf.cast(tensor, tf.float32)

    output_l = layers.Lambda(cast_float32)(output_l)
    output_m = layers.Lambda(cast_float32)(output_m)
    output_s = layers.Lambda(cast_float32)(output_s)

    # originally reshape in multi_thread_post
    output_lr = layers.Reshape((1, input_shape[0] // 32, input_shape[1] // 32, 3, 5 + num_classes))(output_l)
    output_mr = layers.Reshape((1, input_shape[0] // 16, input_shape[1] // 16, 3, 5 + num_classes))(output_m)
    output_sr = layers.Reshape((1, input_shape[0] // 8, input_shape[1] // 8, 3, 5 + num_classes))(output_s)

    # originally _yolo_out
    masks = [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
    anchors = [[12, 16], [19, 36], [40, 28], [36, 75], [76, 55],
               [72, 146], [142, 110], [192, 243], [459, 401]]

    def batch_process_feats(out, anchors, mask):
        grid_h, grid_w, num_boxes = map(int, out.shape[2:5])

        anchors = [anchors[i] for i in mask]
        anchors_tensor = np.array(anchors).reshape(1, 1, len(anchors), 2)

        # Reshape to batch, height, width, num_anchors, box_params.
        box_xy = tf.sigmoid(out[..., :2])
        box_wh = tf.exp(out[..., 2:4])
        box_wh = box_wh * anchors_tensor

        box_confidence = tf.sigmoid(out[..., 4])
        box_confidence = tf.expand_dims(box_confidence, axis=-1)
        box_class_probs = tf.sigmoid(out[..., 5:])

        col = np.tile(np.arange(0, grid_w), grid_w).reshape(-1, grid_w)
        row = np.tile(np.arange(0, grid_h).reshape(-1, 1), grid_h)

        col = col.reshape(grid_h, grid_w, 1, 1).repeat(3, axis=-2)
        row = row.reshape(grid_h, grid_w, 1, 1).repeat(3, axis=-2)
        grid = np.concatenate((col, row), axis=-1).astype(np.float32)

        box_xy += grid
        box_xy /= (grid_w, grid_h)
        box_wh /= input_shape
        box_xy -= (box_wh / 2.)  # normalized xywh
        boxes = tf.concat((box_xy, box_xy + box_wh), axis=-1)

        box_scores = box_confidence * box_class_probs
        num_boxes = np.prod(boxes.shape[1:-1])
        boxes = tf.reshape(boxes, [-1, num_boxes, boxes.shape[-1]])
        box_scores = tf.reshape(box_scores, [-1, num_boxes, box_scores.shape[-1]])
        return boxes, box_scores

    def filter_boxes(outputs):
        boxes_l, boxes_m, boxes_s, box_scores_l, box_scores_m, box_scores_s, image_shape = outputs
        boxes_l, box_scores_l = filter_boxes_one_size(boxes_l, box_scores_l)
        boxes_m, box_scores_m = filter_boxes_one_size(boxes_m, box_scores_m)
        boxes_s, box_scores_s = filter_boxes_one_size(boxes_s, box_scores_s)
        boxes = tf.concat([boxes_l, boxes_m, boxes_s], axis=0)
        box_scores = tf.concat([box_scores_l, box_scores_m, box_scores_s], axis=0)
        image_shape_wh = image_shape[1::-1]
        image_shape_whwh = tf.concat([image_shape_wh, image_shape_wh], axis=-1)
        image_shape_whwh = tf.cast(image_shape_whwh, tf.float32)
        boxes *= image_shape_whwh
        boxes = tf.expand_dims(boxes, 0)
        box_scores = tf.expand_dims(box_scores, 0)
        boxes = tf.expand_dims(boxes, 2)
        nms_boxes, nms_scores, nms_classes, valid_detections = tf.image.combined_non_max_suppression(
            boxes,
            box_scores,
            max_output_size_per_class=nms_top_k,
            max_total_size=nms_top_k,
            iou_threshold=nms_thresh,
            score_threshold=conf_thresh,
            pad_per_class=False,
            clip_boxes=False,
            name='CombinedNonMaxSuppression',
        )
        return nms_boxes[0], nms_scores[0], nms_classes[0]

    def filter_boxes_one_size(boxes, box_scores):
        box_class_scores = tf.reduce_max(box_scores, axis=-1)
        keep = box_class_scores > conf_thresh
        boxes = boxes[keep]
        box_scores = box_scores[keep]
        return boxes, box_scores

    def batch_yolo_out(outputs):
        with tf.name_scope('yolo_out'):
            b_output_lr, b_output_mr, b_output_sr, b_image_shape = outputs
            with tf.name_scope('process_feats'):
                b_boxes_l, b_box_scores_l = batch_process_feats(b_output_lr, anchors, masks[0])
            with tf.name_scope('process_feats'):
                b_boxes_m, b_box_scores_m = batch_process_feats(b_output_mr, anchors, masks[1])
            with tf.name_scope('process_feats'):
                b_boxes_s, b_box_scores_s = batch_process_feats(b_output_sr, anchors, masks[2])
            with tf.name_scope('filter_boxes'):
                b_nms_boxes, b_nms_scores, b_nms_classes = tf.map_fn(
                    filter_boxes, [b_boxes_l, b_boxes_m, b_boxes_s, b_box_scores_l, b_box_scores_m, b_box_scores_s, b_image_shape],
                    dtype=(tf.float32, tf.float32, tf.float32), back_prop=False, parallel_iterations=16)
        return b_nms_boxes, b_nms_scores, b_nms_classes

    boxes_scores_classes = layers.Lambda(batch_yolo_out)([output_lr, output_mr, output_sr, image_shape])

    model_body = keras.models.Model(inputs=inputs, outputs=boxes_scores_classes)
    return model_body


def decode_jpeg_resize(input_tensor, image_size):
    tensor = tf.image.decode_png(input_tensor, channels=3)
    shape = tf.shape(tensor)
    tensor = tf.cast(tensor, tf.float32)
    tensor = tf.image.resize(tensor, image_size)
    tensor /= 255.0
    return tf.cast(tensor, tf.float16), shape


def preprocessor(input_tensor, image_size):
    with tf.name_scope('Preprocessor'):
        tensor = tf.map_fn(
            partial(decode_jpeg_resize, image_size=image_size), input_tensor,
            dtype=(tf.float16, tf.int32), back_prop=False, parallel_iterations=16)
    return tensor


def main():
    os.system('aws s3 cp s3://neuron-s3/training_checkpoints/pytorch/yolov4/yolov4.pth . --no-sign-request')
    torch_weights = rename_weights('./yolov4.pth')
    keras_weights = convert_pt_checkpoint_to_keras_h5(torch_weights)
    keras.backend.set_learning_phase(0)
    num_anchors = 3
    num_classes = 80
    input_shape = (608, 608)
    conf_thresh = 0.001
    nms_thresh = 0.45
    inputs = layers.Input(shape=[], dtype='string')
    yolo = YOLOv4(inputs, num_classes, num_anchors, input_shape, conf_thresh=conf_thresh, nms_thresh=nms_thresh)
    yolo.set_weights(keras_weights)
    sess = keras.backend.get_session()
    inputs = {'image': yolo.inputs[0]}
    output_names = ['boxes', 'scores', 'classes']
    outputs = {name: ts for name, ts in zip(output_names, yolo.outputs)}
    tf.saved_model.simple_save(sess, './yolo_v4_coco_saved_model', inputs, outputs)


if __name__ == '__main__':
    main()


================================================
FILE: src/helperscripts/installationScripts/python_instructions.txt
================================================

# AL2 Driver and Tools
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami --category=driver_runtime_tools

# U20 Driver and Tools
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami --category=driver_runtime_tools

# AL2 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

# U20 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami

# AL2 Pytorch Neuronx Upgrade(1.13)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

# U20 Pytorch Neuronx Upgrade(1.13)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami

# AL2 Pytorch Neuronx Upgrade(1.12)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.12.0 --neuron-version=2.6.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

# U20 Pytorch Neuronx Upgrade(1.12)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.12.0 --neuron-version=2.6.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami

# AL2 Pytorch Neuronx Upgrade(1.11)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.11.0 --neuron-version=2.4.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

# U20 Pytorch Neuronx Upgrade(1.11)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.11.0 --neuron-version=2.4.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami

# AL2 tensorflow Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=tensorflow --framework-version=2.10.1.1.0.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami --category=compiler_framework

# U20 tensorflow Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1.1.0.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami --category=compiler_framework

# AL2 tensorflow Neuronx upgrade
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.10.1.1.0.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami --category=compiler_framework

# U20 tensorflow Neuronx upgrade
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.10.1.1.0.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami --category=compiler_framework

# AL2 EFA Installation
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=efa --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

# U20 EFA Installation
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=efa --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami

# AL2 PyTorch DLAMI
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=all --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=dlami-framework

# U20 PyTorch DLAMI
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=all --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=dlami-framework

# AL2 tensorflow Neuronx upgrade(2.10)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami --category=compiler_framework

# U20 tensorflow Neuronx upgrade(2.10)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami --category=compiler_framework

# AL2 tensorflow Neuronx upgrade(2.9)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --neuron-version=2.10.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami --category=compiler_framework

# U20 tensorflow Neuronx upgrade(2.9)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --neuron-version=2.10.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami --category=compiler_framework

# AL2 tensorflow Neuronx upgrade(2.8)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --neuron-version=2.10.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami --category=compiler_framework

# U20 tensorflow Neuronx upgrade(2.8)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --neuron-version=2.10.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami --category=compiler_framework

# AL2 tensorflow Neuronx Install(2.10)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami --category=compiler_framework

# U20 tensorflow Neuronx Install(2.10)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami --category=compiler_framework

# AL2 tensorflow Neuronx Install(2.8)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=tensorflow --framework-version=2.8 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami --category=compiler_framework

# U20 tensorflow Neuronx Install(2.8)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.8 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami --category=compiler_framework

# AL2 tensorflow Neuronx Install(2.7)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=tensorflow --framework-version=2.7 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami --category=compiler_framework

# U20 tensorflow Neuronx Install(2.7)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.7 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami --category=compiler_framework

# AL2 Tensorflow DLAMI
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=all --framework=tensorflow --framework-version=2.10 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=dlami-framework

# U20 Tensorflow DLAMI
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=all --framework=tensorflow --framework-version=2.10 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=dlami-framework

# AL2 PyTorch Neuron DLAMI
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=all --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=inf1 --ami=dlami-framework

# U20 PyTorch Neuron DLAMI
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=all --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=inf1 --ami=dlami-framework

# U22 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# U22 Tensorflow Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# U22 Pytorch Neuron Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami

# U22 Tensorflow Neuron Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=non-dlami

# AL2 Pytorch Neuronx DLAMI Upgrade(1.13)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=dlami-framework

# U20 Pytorch Neuronx DLAMI Upgrade(1.13)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=dlami-framework

# AL2 tensorflow Neuronx upgrade DLAMI(2.10)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=dlami-framework --category=compiler_framework

# AL2 tensorflow Neuronx upgrade DLAMI(2.9)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --neuron-version=2.10.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=dlami-framework --category=compiler_framework

# AL2 tensorflow Neuronx upgrade DLAMI(2.8)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --neuron-version=2.10.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=dlami-framework --category=compiler_framework

# U20 tensorflow Neuronx upgrade DLAMI(2.10)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=dlami-framework --category=compiler_framework

# U20 tensorflow Neuronx upgrade DLAMI(2.9)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.9.3 --neuron-version=2.10.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=dlami-framework --category=compiler_framework

# U20 tensorflow Neuronx upgrade(2.8)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=tensorflow --framework-version=2.8.4 --neuron-version=2.10.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=dlami-framework --category=compiler_framework

# U20 Pytorch Neuronx 2.0 Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami

# AL2 Pytorch Neuronx 2.0 Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

# U22 Pytorch Neuronx 2.0 Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# AL2 Pytorch Neuronx Upgrade(2.0)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=non-dlami

# U20 Pytorch Neuronx Upgrade(2.0)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami

# U22 Pytorch Neuronx Upgrade(2.0)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# AL2 Pytorch Neuronx DLAMI Upgrade(2.0)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2 --instance=trn1 --ami=dlami-framework

# U20 Pytorch Neuronx DLAMI Upgrade(2.0)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=dlami-framework

# AL2023 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

# AL2023 tensorflow Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=tensorflow --framework-version=2.10.1.1.0.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami --category=compiler_framework

# Al2023 Pytorch Neuronx 2.0 Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

# AL2023 tensorflow Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=tensorflow --framework-version=2.10.1.1.0.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami --category=compiler_framework

# U20 Pytorch Neuronx 2.1 Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami

# AL2023 Pytorch Neuronx 2.1 Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

# U22 2.5 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.5.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# AL2023 Pytorch Neuronx Upgrade(2.1)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

# U20 Pytorch Neuronx Upgrade(2.1)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami

# U22 2.5.1 Pytorch Neuronx Upgrade
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.5.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# AL2023 Pytorch Neuronx DLAMI Upgrade(2.1)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=dlami-framework

# U20 Pytorch Neuronx DLAMI Upgrade(2.1)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=dlami-framework

# U22 Neuron DLAMI - Torch-Neuronx-1.13.1
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=dlami-neuron

# U22 Neuron DLAMI - Torch-Neuronx- 2.1.1
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=dlami-neuron

# U22 Neuron DLAMI - Tensorflow-Neuronx- 2.10.1
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=dlami-neuron

# U22 Neuron DLAMI - Transofrmers-Neuronx
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=transformers-neuronx --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=dlami-neuron

# U22 Neuron DLAMI - Torch-Neuron-1.13.1
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=dlami-neuron

# U22 Neuron DLAMI - Tensorflow-Neuron- 2.10.1
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=tensorflow --framework-version=2.10.1 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=inf1 --ami=dlami-neuron

# Rocky Linux 9 Driver and Tools
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=rockylinux9 --instance=trn1 --ami=non-dlami --category=driver_runtime_tools

# AL2023 Driver and Tools
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=1.13.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami --category=driver_runtime_tools

# U22 2.1 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.1.2 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# AL2023 2.1 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.1.2 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

# U20 2.1 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.1.2 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami

# U20 Pytorch Neuronx Upgrade(2.1)
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.1.2 --file=src/helperscripts/n2-manifest.json --os=ubuntu20 --instance=trn1 --ami=non-dlami

# AL2023 2.5.1 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.5.1 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

# AL2023 Driver and Tools
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=2.8.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami --category=driver_runtime_tools

# U22 Driver and Tools
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=2.9.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami --category=driver_runtime_tools

# AL2 EFA Installation
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=efa --framework=pytorch --framework-version=2.8.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

# U22 EFA Installation
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=efa --framework=pytorch --framework-version=2.9.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# U22 2.6.0 Pytorch Neuronx Upgrade
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.6.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# AL2023 2.6.0 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.6.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

# U22 2.6.0 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.6.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# AL2023 2.7.0 Pytorch Neuronx Upgrade
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.7.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

# U22 2.7.0 Pytorch Neuronx Upgrade
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.7.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# AL2023 2.7.0 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.7.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

# U22 2.7.0 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.7.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# AL2023 2.8.0 Pytorch Neuronx Upgrade
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.8.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

# U22 2.8.0 Pytorch Neuronx Upgrade
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# AL2023 Latest Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.8.0 --file=src/helperscripts/n2-manifest.json --os=amazonlinux2023 --instance=trn1 --ami=non-dlami

# U22 2.8.0 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# U22 2.9.0 Pytorch Neuronx Upgrade
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.9.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# U22 Latest Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.9.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu22 --instance=trn1 --ami=non-dlami

# U24 EFA Installation
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=efa --framework=pytorch --framework-version=2.9.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu24 --instance=trn1 --ami=non-dlami

# U24 2.9.0 Pytorch Neuronx Upgrade
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.9.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu24 --instance=trn1 --ami=non-dlami

# U24 2.9.0 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.9.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu24 --instance=trn1 --ami=non-dlami

# U24 Driver and Tools
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --framework=pytorch --framework-version=2.9.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu24 --instance=trn1 --ami=non-dlami --category=driver_runtime_tools

# U24 2.8.0 Pytorch Neuronx Upgrade
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=update --category=compiler_framework --framework=pytorch --framework-version=2.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu24 --instance=trn1 --ami=non-dlami

# U24 2.8.0 Pytorch Neuronx Install
.. program-output:: python3 src/helperscripts/n2-helper.py --install-type=install --category=compiler_framework --framework=pytorch --framework-version=2.8.0 --file=src/helperscripts/n2-manifest.json --os=ubuntu24 --instance=trn1 --ami=non-dlami


================================================
FILE: src/helperscripts/n2-helper.py
================================================
import json
import argparse
from packaging.version import Version, parse
import pandas as pd
from pandas import json_normalize


class manifest:
    def __init__(self, manifest_file):

        self.manifest_file = manifest_file
        self.df_packages = pd.DataFrame()

    def parse_manifest(self):

        with open(self.manifest_file, 'r') as f:
            manifest = json.load(f)

        # repos
        self.df_repos = json_normalize(manifest['repos_n2'])

        # latest release
        self.df_latest_release = json_normalize(manifest['latest_release'])

        # os properties
        self.df_os_properties = json_normalize(manifest['os_properties'])

        # ami properties
        self.df_ami_properties = json_normalize(manifest['ami_properties'])

        # dlami properties
        self.df_dlami_properties = json_normalize(manifest['dlami_properties'])

        # major version properties
        self.df_major_version_properties = json_normalize(manifest['major_version_properties'])

        # package properties
        self.df_package_properties = json_normalize(manifest['package_properties'])

        # neuron releases
        for release in manifest['neuron_releases']:
            df_release = json_normalize(release['packages'])
            df_release['neuron_version'] = release['neuron_version']
            self.df_packages = pd.concat([self.df_packages, df_release])

        # merge release packages
        self.df_release_packages = self.df_packages.merge(self.df_package_properties, how='left', on='name')
        self.df_release_packages['supported_instances'] = self.df_release_packages['supported_instances'].tolist()

    def merge_release_packages(self):

        self.df_release_packages = self.df_packages.merge(self.df_package_properties, how='left', on='name')

    def extract_major_minor_version(self, version):

        return str(version.major) + '.' + str(version.minor)

    def get_pip_packages_supporting_python_versions(self, args):
        '''
        Get supported python version by packages (compiler and framework)
        e.g., {"3.6","3.7","3.8"}
        '''

        if args.neuron_version == None:
            neuron_version = self.get_latest_neuron_version_per_instance(args.instance)
        else:
            neuron_version = args.neuron_version

        df_instance = self.df_release_packages[
            (self.df_release_packages['supported_instances'].map(lambda x: args.instance in x)) & (
                    self.df_release_packages['neuron_version'] == neuron_version)]

        # Compiler supporting Python versions
        compiler_python_versions = \
            df_instance.loc[df_instance['component'] == 'Compiler']['supported_python_versions'].values[0]

        # Specific framework version supporting Python versions
        df_framework = df_instance.loc[df_instance['category'] == args.framework].copy()
        df_framework['version'] = df_framework['version'].map(lambda x: Version(x))
        df_framework['major_minor_version'] = df_framework['version'].map(lambda x: str(x.major) + '.' + str(x.minor))

        framework_python_versions = df_framework.loc[
            df_framework['major_minor_version'] == self.extract_major_minor_version(Version(args.framework_version))][
            'supported_python_versions'].values[0]
        return list(set(compiler_python_versions) & set(framework_python_versions))

    def get_major_version(self, package_name, instance):
        return self.df_major_version_properties.loc[(self.df_major_version_properties['name'] == package_name)][
            args.instance].values[0]

    def generate_script(self, args):
        '''
        It generates:
        (1) str_preamble
        (2) str_driver
        (3) str_runtime
        (4) str_tools
        (5) str_python
        (6) str_compiler
        (7) str_framework
        '''

        str_preamble = ''

        # Install and enable EPEL (required only for rocky linux 9 currently)
        str_preamble += self.install_and_enable_epel(args)

        # Configure Neuron repository
        str_preamble += self.config_neuron_repository(args)

        # Update OS packages
        str_preamble += self.update_os_packages(args)

        # Install OS headers
        str_preamble += self.install_os_headers(args)

        # Install git
        str_preamble += self.install_git(args)

        # Install Neuron driver
        str_driver = self.install_neuron_driver(args)

        # Install Neuron runtime
        str_runtime = self.install_neuron_runtime(args)

        # Install EFA driver
        str_efa = self.install_efa_driver(args)

        # Install Neuron Tools
        str_tools = self.install_neuron_system_tools(args)

        # Add PATH
        if args.mode != 'compile' or args.ami != 'dlami-framework':
            str_tools += '\n# Add PATH\n'
            str_tools += 'export PATH=/opt/aws/neuron/bin:$PATH\n'

        # Install Python virtual environment
        str_python = self.set_python_venv(args)

        # Activate Pythohn venv
        str_python += self.activate_python_venv(args)

        # install jupyter notebook
        str_python += self.jupyter_notebook(args)

        # Set pip repository
        str_python += self.set_pip_repository()

        # Install wget, awscli
        str_python += self.install_aux(args)

        # install extra dependencies
        str_deps = self.install_extra_dependencies(args)

        # Install Neuron compiler
        str_compiler = self.install_neuron_compiler(args)

        # Install Neuron framework
        str_framework = self.install_neuron_framework(args)

        # install neuron compiler and framework
        str_compiler_framework = self.install_neuron_compiler_and_framework(args)
        if args.ami == 'dlami-framework':
            # dlami instructions
            str_dlami = self.install_dlami(args)
            return str_dlami
        elif args.ami == 'dlami-neuron':
            str_dlami = self.install_neuron_dlami(args)
            return str_dlami
        elif args.category == 'all':
            if args.instance == 'trn1':
                str_runtime += str_efa
            return str_preamble + str_driver + str_runtime + str_tools + str_deps + str_python + str_compiler_framework
        elif args.category == 'driver_runtime_tools':
            return str_preamble + str_driver + str_runtime + str_tools
        elif args.category == 'compiler_framework':
            return str_deps + str_python + str_compiler_framework
        elif args.category == 'driver':
            return str_preamble + str_driver
        elif args.category == 'runtime':
            return str_runtime
        elif args.category == 'tools':
            return str_tools
        elif args.category == 'compiler':
            if args.instance != 'inf1':
                return str_python + str_compiler
            else:
                return str_python
        elif args.category == 'framework':
            return str_framework
        elif args.category == 'efa':
            return str_efa

    def install_dlami(self, args):
        latest_release_for_instance = \
            self.df_latest_release.loc[self.df_latest_release['instance'] == args.instance]['version'].values[0]
        latest_release_for_dlami = self.df_dlami_properties[
            (self.df_dlami_properties['framework'] == args.framework) & (
                self.df_dlami_properties['supported_instances'].map(lambda x: args.instance in x))][
            'neuron_released_version'].values[0]

        if (latest_release_for_instance == latest_release_for_dlami):
            return self.activate_python_venv(args)
        else:
            args.install_type = 'update'
            str_dlami = self.activate_python_venv(args)
            str_dlami += self.jupyter_notebook(args)
            str_dlami += self.set_pip_repository()
            str_dlami += self.install_neuron_compiler_and_framework(args)
        return str_dlami


    def install_neuron_dlami(self, args):
        str_dlami = ""
        if ((args.instance == 'trn1' or args.instance == 'inf2') and args.category == "transformers-neuronx"):
            str_dlami = '\n# Activate Python venv for Transformers-NeuronX \n'
            str_dlami += "source /opt/aws_neuronx_venv_transformers_neuronx/bin/activate"
        elif ((args.instance == 'trn1' or args.instance == 'inf2') and args.framework == "pytorch" and args.framework_version == "1.13.1"):
            str_dlami = '\n# Activate Python venv for Pytorch 1.13 \n'
            str_dlami += "source /opt/aws_neuronx_venv_pytorch_1_13/bin/activate"
        elif ((args.instance == 'trn1' or args.instance == 'inf2') and args.framework == "pytorch" and args.framework_version == "2.1"):
            str_dlami = '\n# Activate Python venv for Pytorch 2.1 \n'
            str_dlami += "source /opt/aws_neuronx_venv_pytorch_2_1/bin/activate"
        elif ((args.instance == 'trn1' or args.instance == 'inf2') and args.framework == "tensorflow" and args.framework_version == "2.10.1"):
            str_dlami = '\n# Activate Python venv for Tensorflow 2.10 \n'
            str_dlami += "source /opt/aws_neuronx_venv_tensorflow_2_10/bin/activate"
        elif (args.instance == 'inf1' and args.framework == "tensorflow" and args.framework_version == "2.10.1"):
            str_dlami = '\n# Activate Python venv for Tensorflow 2.10 \n'
            str_dlami += "source /opt/aws_neuron_venv_tensorflow_2_10_inf1/bin/activate"
        elif (args.instance == 'inf1' and args.framework == "pytorch" and args.framework_version == "1.13.1"):
            str_dlami = '\n# Activate Python venv for Pytorch 1.13 \n'
            str_dlami += "source /opt/aws_neuron_venv_pytorch_1_13_inf1/bin/activate"
        return str_dlami

    def jupyter_notebook(self, args):
        os_default_python_version = \
            self.df_os_properties.loc[self.df_os_properties['os'] == args.os]['default_python_version'].values[0]
        packages_supporting_python_versions = self.get_pip_packages_supporting_python_versions(args)

        if os_default_python_version in packages_supporting_python_versions:
            target_python_version = os_default_python_version
        else:
            target_python_version = max(packages_supporting_python_versions)

        framework_name = self.get_package_names(category=args.framework, instance=args.instance)[0]

        str_jupiter = '\n# Install Jupyter notebook kernel\n'
        str_jupiter += 'pip install ipykernel ' + '\n'
        str_jupiter += 'python' + target_python_version + ' -m ipykernel install --user --name '
        str_jupiter += 'aws_neuron_venv_' + args.framework
        if args.instance == 'inf1':
            str_jupiter += '_inf1'
        str_jupiter += ' --display-name "Python (' + framework_name + ')"' + '\n'
        str_jupiter += 'pip install jupyter notebook' + '\n'
        str_jupiter += 'pip install environment_kernels' + '\n'
        return str_jupiter

    def install_and_enable_epel(self, args):
        str = ''
        if args.mode != 'compile':
            if args.install_type == 'install':
                if args.os == 'rockylinux9':
                    str += '\n# Install and enable EPEL\n'
                    str += 'sudo dnf config-manager --set-enabled crb\n'
                    str += 'sudo dnf install epel-release -y\n'
        return str

    def config_neuron_repository(self, args):
        """
        Reads OS type from the arguments and generates scripts for configuration of Neuron repository
        """
        str = ''
        if args.mode != 'compile':
            # Neuron repository needs when mode is 'develop' or 'deploy'
            if args.install_type == 'install':
                str += '\n# Configure Linux for Neuron repository updates' + '\n'
                if args.os == 'ubuntu18' or args.os == 'ubuntu20' or args.os == 'ubuntu22' or args.os == 'ubuntu24':
                    str += '. /etc/os-release' + '\n'
                    str += 'sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF' + '\n'
                    str += 'deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main' + '\n'
                    str += 'EOF' + '\n'
                    str += 'wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -' + '\n'
                elif args.os == 'amazonlinux2' or args.os == 'amazonlinux2023' or args.os == 'rockylinux9':
                    str += 'sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF' + '\n'
                    str += '[neuron]' + '\n'
                    str += 'name=Neuron YUM Repository' + '\n'
                    str += 'baseurl=https://yum.repos.neuron.amazonaws.com' + '\n'
                    str += 'enabled=1' + '\n'
                    str += 'metadata_expire=0' + '\n'
                    str += 'EOF' + '\n'
                    str += 'sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB' + '\n'
        return str


    def get_repo(self):
        str = '\n# Configure Linux for Neuron repository updates' + '\n'
        if args.os == 'ubuntu18' or args.os == 'ubuntu20' or args.os == 'ubuntu22' or args.os == 'ubuntu24':
            str += '. /etc/os-release' + '\n'
            str += 'sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF' + '\n'
            str += 'deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main' + '\n'
            str += 'EOF' + '\n'
            str += 'wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -' + '\n'
        elif args.os == 'amazonlinux2' or args.os == 'amazonlinux2023' or args.os == 'rockylinux9':
            str += 'sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF' + '\n'
            str += '[neuron]' + '\n'
            str += 'name=Neuron YUM Repository' + '\n'
            str += 'baseurl=https://yum.repos.neuron.amazonaws.com' + '\n'
            str += 'enabled=1' + '\n'
            str += 'metadata_expire=0' + '\n'
            str += 'EOF' + '\n'
            str += 'sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB' + '\n'
        return str

    def update_os_packages(self, args):
        """
        Reads mode and OS type and updates OS packages accordingly.
        """
        str = ''
        if args.mode != 'compile':
            # OS packages need to be updated in "develop" or "deploy" mode
            str += '\n# Update OS packages \n'
            if args.os == 'ubuntu18' or args.os == 'ubuntu20' or args.os == 'ubuntu22' or args.os == 'ubuntu24':
                str += 'sudo apt-get update -y' + '\n'
            elif args.os == 'amazonlinux2' or args.os == 'amazonlinux2023' or args.os == 'rockylinux9':
                str += 'sudo dnf update -y' + '\n'
            if args.os == 'rockylinux9':
                str += '# Reboot instance to ensure kernel is updated\n'
                str += 'sudo reboot\n'
        return str

    def install_os_headers(self, args):
        """
        Reads mode and OS type and install OS headers accordingly.
        """
        str = ''
        if args.mode != 'compile':
            # OS headers need to be installed in "develop" or "deploy" mode
            if args.install_type == 'install':
                str += '\n# Install OS headers \n'
            elif args.install_type == 'update':
                str += '\n# Update OS headers \n'
            if args.os == 'ubuntu18' or args.os == 'ubuntu20' or args.os == 'ubuntu22'or args.os == 'ubuntu24':
                str += 'sudo apt-get install linux-headers-$(uname -r) -y' + '\n'
            elif args.os == 'amazonlinux2' or args.os == 'amazonlinux2023' or args.os == 'rockylinux9':
                str += 'sudo dnf install -y "kernel-devel-uname-r = $(uname -r)"' + '\n'

        return str

    def install_git(self, args):

        str = '\n# Install git \n'
        if args.os == 'ubuntu18' or args.os == 'ubuntu20' or args.os == 'ubuntu22' or args.os == 'ubuntu24':
            str += 'sudo apt-get install git -y\n'
        elif args.os == 'amazonlinux2' or args.os == 'amazonlinux2023' or args.os == 'rockylinux9':
            str += 'sudo dnf install git -y\n'

        return str

    def install_neuron_driver(self, args):
        """
        Neuron driver install script will be generated based on mode, AMI, and OS.
        mode: when develop or deploy
        AMI: when not dlami-base
        OS: for different command
        """
        str = ''

        if args.ami == 'dlami-base':
            return str

        # get driver package names for release version, instance
        # we take only the first element in the list since there should be ond driver package.
        driver_package = self.get_package_names(category='driver', instance=args.instance)[0]

        if args.mode != 'compile':
            # if args.ami != 'dlami-base':
            install = 'install' if args.install_type == 'install' else 'upgrade'
            str += f'\n# {install} Neuron Driver\n'

            if args.os == 'ubuntu18' or args.os == 'ubuntu20' or args.os == 'ubuntu22' or args.os == 'ubuntu24':
                if args.neuron_version == None:
                    if self.df_package_properties.loc[self.df_package_properties['name'] == driver_package][
                        'pin_major'].values[0] == 'true':
                        version = '=' + self.get_major_version(driver_package, args.instance) + '.'
                elif (args.neuron_version != None) & (args.install_type == 'install'):
                    version = '=' + self.get_package_version(category='driver', name=driver_package,
                                                             neuron_version=args.neuron_version)
                elif args.install_type == 'update':
                    if self.df_package_properties.loc[self.df_package_properties['name'] == driver_package][
                        'pin_major'].values[0] == 'true':
                        version = '=' + self.get_package_version(category='driver', name=driver_package,
                                                                 neuron_version=args.neuron_version)
                str += f'sudo apt-get {install} {driver_package}{version}* -y'
                if args.install_type == 'update':
                    str += ' --allow-change-held-packages'
                str += '\n'

            elif args.os == 'amazonlinux2' or args.os == 'amazonlinux2023' or args.os =='rockylinux9':
                yum_install = 'install' if args.install_type == 'install' else 'update'

                if args.install_type == 'install':

                    if args.neuron_version == None:
                        if self.df_package_properties.loc[self.df_package_properties['name'] == driver_package][
                            'pin_major'].values[0] == 'true':
                            version = '-' + self.get_major_version(driver_package, args.instance) + '.'
                    else:
                        version = '-' + self.get_package_version(category='driver', name=driver_package,
                                                                 neuron_version=args.neuron_version)
                elif args.install_type == 'update':
                    if self.df_package_properties.loc[self.df_package_properties['name'] == driver_package][
                        'pin_major'].values[0] == 'true':
                        version = '-' + self.get_major_version(driver_package, args.instance)

                str += f'sudo dnf {yum_install} {driver_package}{version}* -y\n'
        '''
        if args.ami == 'dlami-base':
            str += '--allow-change-held-packages'
        '''

        return str

    def install_neuron_runtime(self, args):
        """
        Neuron runtime install script will be generated based on instace, mode, AMI, and OS.
        instance: trn1
        mode: when develop or deploy
        AMI: when not dlami-base
        OS: for different command
        """
        str = ''

        # get runtime package names for release verion, instance

        runtime_packages = self.get_package_names(category='runtime', instance=args.instance,
                                                  neuron_version=args.neuron_version)
        # install neuron runtime on trn1
        if args.mode != 'compile':
            install = 'install' if args.install_type == 'install' else 'upgrade'
            if len(runtime_packages) != 0:
                if args.install_type == 'install':
                    str += '\n# Install Neuron Runtime \n'
                elif args.install_type == 'update':
                    str += '\n# Update Neuron Runtime\n'

                for runtime_package in runtime_packages:
                    # if args.ami != 'dlami-base':
                    if args.os == 'ubuntu18' or args.os == 'ubuntu20' or args.os == 'ubuntu22' or args.os == 'ubuntu24':
                        str += (f'sudo apt-get {install} ' + runtime_package)
                        if args.neuron_version == None:
                            if self.df_package_properties.loc[self.df_package_properties['name'] == runtime_package][
                                'pin_major'].values[0] == 'true':
                                str += '=' + self.get_major_version(runtime_package, args.instance) + '.* -y'
                                if args.install_type == 'update':
                                    str += ' --allow-change-held-packages'
                                str += '\n'
                        elif (args.neuron_version != None) & (args.install_type == 'install'):
                            str += '=' + self.get_package_version(category='runtime', name=runtime_package,
                                                                  neuron_version=args.neuron_version) + '* -y\n'
                        else:
                            str += '\n'

                    elif args.os == 'amazonlinux2' or args.os == 'amazonlinux2023' or args.os == 'rockylinux9':
                        str += 'sudo dnf '
                        if args.install_type == 'install':
                            str += 'install '
                            str += runtime_package
                            if args.neuron_version == None:
                                if \
                                        self.df_package_properties.loc[
                                            self.df_package_properties['name'] == runtime_package][
                                            'pin_major'].values[0] == 'true':
                                    str += '-' + self.get_major_version(runtime_package, args.instance) + '.* -y\n'
                            else:
                                str += '-' + self.get_package_version(category='driver', name=runtime_package,
                                                                      neuron_version=args.neuron_version) + '* -y\n'
                        elif args.install_type == 'update':
                            str += 'update '
                            str += runtime_package
                            if self.df_package_properties.loc[self.df_package_properties['name'] == runtime_package][
                                'pin_major'].values[0] == 'true':
                                str += '-' + self.get_major_version(runtime_package, args.instance) + '.* -y\n'
        return str

    def install_efa_driver(self, args):
        str = ''
        # install EFA driver on trn1
        if args.instance == 'trn1' and args.mode == 'develop':
            str += '\n# Install EFA Driver (only required for multi-instance training)\n'
            str += 'curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \n'
            str += 'wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.key \n'
            str += 'cat aws-efa-installer.key | gpg --fingerprint \n'
            str += 'wget https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz.sig && gpg --verify ./aws-efa-installer-latest.tar.gz.sig \n'
            str += 'tar -xvf aws-efa-installer-latest.tar.gz \n'
            str += 'cd aws-efa-installer && sudo bash efa_installer.sh --yes \n'
            str += 'cd \n'
            str += 'sudo rm -rf aws-efa-installer-latest.tar.gz aws-efa-installer \n'
        return str

    def install_neuron_system_tools(self, args):
        """
        Neuron tools will be installed in develop mode.
        """
        str = ''
        if args.mode == 'develop':
            # get runtime package names for release verion, instance
            install = 'install' if args.install_type == 'install' else 'upgrade'
            system_tool_packages = self.get_package_names(category='system-tools', instance=args.instance)
            if len(system_tool_packages) != 0:
                if args.install_type == 'install':
                    str += '\n# Install Neuron Tools \n'
                elif args.install_type == 'update':
                    str += '\n# Update Neuron Tools\n'

                for system_tool in system_tool_packages:
                    if args.os == 'ubuntu18' or args.os == 'ubuntu20' or args.os == 'ubuntu22' or args.os == 'ubuntu24':
                        str += (f'sudo apt-get {install} ' + system_tool)
                        if args.neuron_version == None:
                            if self.df_package_properties.loc[self.df_package_properties['name'] == system_tool][
                                'pin_major'].values[0] == 'true':
                                str += '=' + self.get_major_version(system_tool, args.instance) + '.* -y'
                                if args.install_type == 'update':
                                    str += ' --allow-change-held-packages'
                                str += '\n'

                        elif (args.neuron_version != None) & (args.install_type == 'install'):
                            str += '=' + self.get_package_version(category='system-tools', name=system_tool,
                                                                  neuron_version=args.neuron_version) + '* -y\n'

                    elif args.os == 'amazonlinux2' or args.os == 'amazonlinux2023' or args.os == 'rockylinux9':
                        str += 'sudo dnf '
                        if args.install_type == 'install':
                            str += 'install '
                            str += system_tool
                            if args.neuron_version == None:
                                if self.df_package_properties.loc[self.df_package_properties['name'] == system_tool][
                                    'pin_major'].values[0] == 'true':
                                    str += '-' + self.get_major_version(system_tool, args.instance) + '.* -y\n'
                            else:
                                str += '-' + self.get_package_version(category='driver', name=system_tool,
                                                                      neuron_version=args.neuron_version) + '* -y\n'
                        elif args.install_type == 'update':
                            str += 'update '
                            str += system_tool
                            if self.df_package_properties.loc[self.df_package_properties['name'] == system_tool][
                                'pin_major'].values[0] == 'true':
                                str += '-' + self.get_major_version(system_tool, args.instance) + '.* -y\n'
        return str

    def install_extra_dependencies(self, args):
        """
        Any extra dependencies must be added in this function
        """
        str = ''
        if args.os == 'amazonlinux2023':
            str += '# Install External Dependency\n'
            str += 'sudo dnf '
            if args.mode == 'develop':
                str += 'install -y '
            elif args.install_type == 'update':
                str += 'update '
            str += 'libxcrypt-compat\n'
        return str

    def set_python_venv(self, args):
        # find the right python version that Neuron framework supports
        # (for fresh install) install the Python venv
        # (for fresh install and update) activate the venv
        str = ''

        indentation = '\t' if args.venv_install_type == 'parallel-cluster' else ''

        os_default_python_version = \
            self.df_os_properties.loc[self.df_os_properties['os'] == args.os]['default_python_version'].values[0]
        packages_supporting_python_versions = self.get_pip_packages_supporting_python_versions(args)

        if os_default_python_version in packages_supporting_python_versions:
            target_python_version = os_default_python_version
        else:
            target_python_version = max(packages_supporting_python_versions)

        if args.install_type == 'install':
            # Install Python: if the default Python version of OS does not support Neuron packages, we install the supporting version
            if os_default_python_version not in packages_supporting_python_versions:
                str += '\n# Install Python \n'
                if args.os == 'ubuntu18' or args.os == 'ubuntu20' or args.os == 'ubuntu22' or args.os == 'ubuntu24':
                    str += 'sudo add-apt-repository ppa:deadsnakes/ppa\n'
                    str += 'sudo apt-get install python' + target_python_version + '\n'
                elif args.os == 'amazonlinux2' or args.os == 'amazonlinux2023':
                    str += 'sudo dnf install python' + target_python_version + '\n'
                elif args.os == 'rockylinux9':
                    str += 'sudo dnf install python' + target_python_version + '\n'

            # Install Python venv
            """
            if os_default_python_version in packages_supporting_python_versions:
                str += '\n# Install Python venv \n'
                str +='python'+target_python_version+' -m venv '+args.framework+'_venv \n'
            else:
            """
            if args.os == 'ubuntu18' or args.os == 'ubuntu20' or args.os == 'ubuntu22' or args.os == 'ubuntu24':
                str += '\n# Install Python venv \n'
                str += 'sudo apt-get install -y python' + target_python_version + '-venv g++ \n'
            elif args.os == 'amazonlinux2' or args.os == 'amazonlinux2023' or args.os == 'rockylinux9':
                str += '\n# Install Python venv \n'
                if args.os == 'amazonlinux2' or args.os == 'rockylinux9':
                    str += 'sudo dnf install -y python' + target_python_version + '-venv gcc-c++ \n'
                else:
                    str += 'sudo dnf install -y gcc-c++ \n'

            # when venv_install_type is parellel cluster, we need to change the directory
            if args.venv_install_type == 'parallel-cluster':
                if args.os == 'ubuntu18' or args.os == 'ubuntu20' or args.os == 'ubuntu22' or args.os == 'ubuntu24':
                    str += '\ncd /home/ubuntu\n'
                elif args.os == 'amazonlinux2' or args.os == 'amazonlinux2023':
                    str += '\ncd /home/ec2-user\n'

                str += '. "/etc/parallelcluster/cfnconfig"\n'
                str += '\nif [[ $cfn_node_type == "HeadNode" ]]; then\n'

            # Create Python venv
            str += f'\n{indentation}# Create Python venv\n'
            str_venv_name = 'aws_neuron_venv_' + args.framework
            if args.instance == 'inf1':
                str_venv_name += '_inf1'

            str += f'{indentation}python{target_python_version} -m venv ' + str_venv_name + ' \n'

        return str

    def activate_python_venv(self, args):

        str = ''

        indentation = '\t' if args.venv_install_type == 'parallel-cluster' else ''
        str_venv_name = ''
        str += f'\n{indentation}# Activate Python venv \n'

        if args.ami == 'dlami-framework':
            str_venv_name += '/opt/'

        str_venv_name += 'aws_neuron_venv_' + args.framework

        if args.instance == 'inf1':
            str_venv_name += '_inf1'

        str += f'{indentation}source ' + str_venv_name + '/bin/activate \n'

        # install python packages
        if (args.install_type == 'install' and args.ami != 'dlami-framework'):
            str += f'{indentation}python -m pip install -U pip \n'

        return str

    def set_pip_repository(self):
        str = ''

        indentation = '\t' if args.venv_install_type == 'parallel-cluster' else ''

        str += f'\n{indentation}# Set pip repository pointing to the Neuron repository \n'
        str += f'{indentation}python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com\n'

        return str

    def install_aux(self, args):
        str = ''

        indentation = '\t' if args.venv_install_type == 'parallel-cluster' else ''

        if args.instance == 'trn1':
            str += f'\n{indentation}# Install wget, awscli \n'
            str += f'{indentation}python -m pip install wget \n'
            str += f'{indentation}python -m pip install awscli \n'

        return str

    def install_neuron_compiler_and_framework(self, args):
        str = ''
        indentation = '\t' if args.venv_install_type == 'parallel-cluster' else ''
        compiler_package = self.get_package_names(category='compiler', instance=args.instance)[0]
        framework_name = self.get_package_names(category=args.framework, instance=args.instance)[0]
        # if args.instance == 'inf1':
        #     return ''

        str = ''
        if args.mode != 'deploy':
            if args.install_type == 'install':
                str += f'\n{indentation}# Install Neuron Compiler and Framework\n'
            elif args.install_type == 'update':
                str += f'\n{indentation}# Update Neuron Compiler and Framework\n'

            str += f'{indentation}python -m pip install '
            if args.install_type == 'update':
                str += '--upgrade '


        str += compiler_package

        if args.neuron_version == None or args.install_type == 'update':
            if self.df_package_properties.loc[self.df_package_properties['name'] == compiler_package][
                'pin_major'].values[0] == 'true':
                str += '==' + self.get_major_version(compiler_package, args.instance) + '.* '
        else:
            str += '==' + self.get_package_version(category='compiler', name=compiler_package,
                                                   neuron_version=args.neuron_version) + ' '

        if args.neuron_version != None:  # prev install
            str += framework_name + '=='
            str += self.get_package_version(category=args.framework, name=framework_name,
                                            neuron_version=args.neuron_version,
                                            framework_version=args.framework_version)
        else:  # fresh install
            if args.framework == 'pytorch':
                str += framework_name
                if args.framework_version == "1.13.1":
                    str += '=='
                    str += "1.13.*"
                elif args.framework_version == "2.1.2":
                    str += '=='
                    str += "2.1.*"
                elif args.framework_version == "2.5.1":
                    str += '=='
                    str += "2.5.*"
                elif args.framework_version == "2.6.0":
                    str += '=='
                    str += "2.6.*"
                elif args.framework_version == "2.7.0":
                    str += '=='
                    str += "2.7.*"
                elif args.framework_version == "2.8.0":
                    str += '=='
                    str += "2.8.*"
                elif args.framework_version == "2.9.0":
                    str += '=='
                    str += "2.9.*"
                str += ' torchvision\n'
            else:
                str += framework_name

        if args.instance == 'inf1':

            install = 'Install' if args.install_type == 'install' else 'Update'
            upgrade = '--upgrade ' if args.install_type == 'update' else ''

            if args.neuron_version != None:  # in case of previous neuron version
                version = '==' + self.get_package_version(category=args.framework, neuron_version=args.neuron_version,
                                                          framework_version=args.framework_version)
            else:  # in case of latest neuron version (fresh install)
                if args.framework_version.startswith(
                        self.get_main_framework_version(instance=args.instance, framework=args.framework,
                                                        neuron_version=args.neuron_version)) == False:
                    version = '==' + args.framework_version + '.*'
                else:
                    version = ''

            if args.framework == 'pytorch':

                pytorch_aux = ' neuron-cc[tensorflow] "protobuf"' if args.mode != 'deploy' else ''

                str = f'\n# {install} PyTorch Neuron\n'
                str += f'python -m pip install {upgrade}torch-neuron{version}{pytorch_aux} torchvision\n'

            elif args.framework == 'tensorflow':

                if args.neuron_version != None:  # in case of previous neuron version

                    ms_version = '=' + self.get_package_version(category='model-server',
                                                                neuron_version=args.neuron_version,
                                                                framework_version=args.framework_version)
                else:  # in case of latest neuron version (fresh install)
                    if args.framework_version != self.get_main_framework_version(instance=args.instance,
                                                                                 framework=args.framework,
                                                                                 neuron_version=args.neuron_version):
                        ms_version = '=' + self.get_package_version(category='model-server',
                                                                    neuron_version=args.neuron_version,
                                                                    framework_version=args.framework_version)
                    else:
                        ms_version = ''

                str = f'\n# {install} TensorFlow Neuron\n'
                str += f'python -m pip install {upgrade}tensorflow-neuron[cc]{version} "protobuf"\n'

                str += f'\n# {install} Neuron TensorBoard\n'
                str += f'python -m pip install {upgrade}tensorboard-plugin-neuron\n'

                if args.mode != 'compile':
                    str += f'\n# Optional: {install} Tensorflow Neuron model server\n'
                    if args.os == 'ubuntu18' or args.os == 'ubuntu20' or args.os == 'ubuntu22' or args.os == 'ubuntu24':
                        str += f'sudo apt-get install tensorflow-model-server-neuronx{ms_version} -y\n'
                    elif args.os == 'amazonlinux2' or args.os == 'amazonlinux2023':
                        str += f'sudo dnf install tensorflow-model-server-neuronx{ms_version} -y\n'

            elif args.framework == 'mxnet':

                mxnet_framework = ''

                neuron_cc_version = ''
                if args.framework_version == '1.8.0':
                    mxnet_framework = 'mx_neuron'
                elif args.framework_version == '1.5.1':
                    mxnet_framework = 'mxnet_neuron'
                    neuron_cc_version='==1.15.0'

                str = f'\n# {install} MXNet Neuron\n'
                str += 'wget https://aws-mx-pypi.s3.us-west-2.amazonaws.com/1.8.0/aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl\n'
                str += 'pip install aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl\n'
                str += f'python -m pip install {upgrade}{mxnet_framework}{version} neuron-cc{neuron_cc_version}\n'

        if args.venv_install_type == 'parallel-cluster':
            if args.os == 'ubuntu18' or args.os == 'ubuntu20' or args.os == 'ubuntu22' or args.os == 'ubuntu24':
                str += f'\n\n{indentation}chown ubuntu:ubuntu -R {args.framework}_venv\n'
            elif args.os == 'amazonlinux2' or args.os == 'amazonlinux2023':
                str += f'\n\n{indentation}chown ec2-user:ec2-user -R {args.framework}_venv\n'

            str += 'fi'

        return str

    def install_neuron_compiler(self, args):
        '''
        Neuron compiler will be installed in develop or compile mode based on the instance.
        '''
        str = ''

        indentation = '\t' if args.venv_install_type == 'parallel-cluster' else ''

        compiler_package = self.get_package_names(category='compiler', instance=args.instance)[0]

        if args.instance == 'inf1':
            return ''

        str = ''
        if args.mode != 'deploy':
            if args.install_type == 'install':
                str += f'\n{indentation}# Install Neuron Compiler\n'
            elif args.install_type == 'update':
                str += f'\n{indentation}# Update Neuron Compiler\n'

            str += f'{indentation}python -m pip install '
            if args.install_type == 'update':
                str += '--upgrade '

            str += compiler_package

            if args.neuron_version == None:
                if self.df_package_properties.loc[self.df_package_properties['name'] == compiler_package][
                    'pin_major'].values[0] == 'true':
                    str += '==' + self.get_major_version(compiler_package, args.instance) + '.* \n'
                else:
                    str += '\n'
            else:
                str += '==' + self.get_package_version(category='compiler', name=compiler_package,
                                                       neuron_version=args.neuron_version) + '\n'

        return str

    def install_neuron_framework(self, args):
        '''
        Neuron framework is installed based on:
        instance
        framework
        framework-version
        '''
        str = ''
        indentation = '\t' if args.venv_install_type == 'parallel-cluster' else ''

        framework_name = self.get_package_names(category=args.framework, instance=args.instance)[0]

        if args.install_type == 'install':
            str += f'\n{indentation}# Install Neuron Framework\n'
        elif args.install_type == 'update':
            str += f'\n{indentation}# Update Neuron Framework\n'

        str += f'{indentation}python -m pip install '
        if args.install_type == 'update':
            str += '--upgrade '

        if args.neuron_version != None:  # prev install
            str += framework_name + '=='
            str += self.get_package_version(category=args.framework, name=framework_name,
                                            neuron_version=args.neuron_version,
                                            framework_version=args.framework_version)
        else:  # fresh install
            str += framework_name

        if args.framework == 'pytorch':
            str += ' torchvision\n'

        if args.instance == 'inf1':

            install = 'Install' if args.install_type == 'install' else 'Update'
            upgrade = '--upgrade ' if args.install_type == 'update' else ''

            if args.neuron_version != None:  # in case of previous neuron version
                version = '==' + self.get_package_version(category=args.framework, neuron_version=args.neuron_version,
                                                          framework_version=args.framework_version)
            else:  # in case of latest neuron version (fresh install)
                if args.framework_version.startswith(
                        self.get_main_framework_version(instance=args.instance, framework=args.framework,
                                                        neuron_version=args.neuron_version)) == False:
                    version = '==' + args.framework_version + '.*'
                else:
                    version = ''

            if args.framework == 'pytorch':

                pytorch_aux = ' neuron-cc[tensorflow] "protobuf"' if args.mode != 'deploy' else ''

                str = f'\n# {install} PyTorch Neuron\n'
                str += f'python -m pip install {upgrade}torch-neuron{version}{pytorch_aux} torchvision\n'


            elif args.framework == 'tensorflow':

                if args.neuron_version != None:  # in case of previous neuron version

                    ms_version = '=' + self.get_package_version(category='model-server',
                                                                neuron_version=args.neuron_version,
                                                                framework_version=args.framework_version)
                else:  # in case of latest neuron version (fresh install)
                    if args.framework_version != self.get_main_framework_version(instance=args.instance,
                                                                                 framework=args.framework,
                                                                                 neuron_version=args.neuron_version):
                        ms_version = '=' + self.get_package_version(category='model-server',
                                                                    neuron_version=args.neuron_version,
                                                                    framework_version=args.framework_version)
                    else:
                        ms_version = ''

                str = f'\n# {install} TensorFlow Neuron\n'
                str += f'python -m pip install {upgrade}tensorflow-neuron[cc]{version} "protobuf"\n'

                if args.mode != 'compile':
                    str += f'\n# Optional: {install} Tensorflow Neuron model server\n'
                    if args.os == 'ubuntu18' or args.os == 'ubuntu20' or args.os == 'ubuntu22' or args.os == 'ubuntu24':
                        str += f'sudo apt-get install tensorflow-model-server-neuronx{ms_version} -y\n'
                    elif args.os == 'amazonlinux2' or args.os == 'amazonlinux2023':
                        str += f'sudo dnf install tensorflow-model-server-neuronx{ms_version} -y\n'

            elif args.framework == 'mxnet':

                mxnet_framework = ''

                if args.framework_version == '1.8.0':
                    mxnet_framework = 'mx_neuron'
                elif args.framework_version == '1.5.1':
                    mxnet_framework = 'mxnet_neuron'

                str = f'\n# {install} MXNet Neuron\n'
                str += 'wget https://aws-mx-pypi.s3.us-west-2.amazonaws.com/1.8.0/aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl\n'
                str += 'pip install aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl\n'
                str += f'python -m pip install {upgrade}{mxnet_framework}{version} neuron-cc\n'

        if args.venv_install_type == 'parallel-cluster':
            if args.os == 'ubuntu18' or args.os == 'ubuntu20' or args.os == 'ubuntu22' or args.os == 'ubuntu24':
                str += f'\n\n{indentation}chown ubuntu:ubuntu -R {args.framework}_venv\n'
            elif args.os == 'amazonlinux2' or args.os == 'amazonlinux2023':
                str += f'\n\n{indentation}chown ec2-user:ec2-user -R {args.framework}_venv\n'

            str += 'fi'

        return str

    def get_latest_neuron_version_per_instance(self, instance):
        return self.df_latest_release.loc[self.df_latest_release['instance'] == instance]['version'].values[0]

    def get_package_names(self, category, instance, neuron_version=None):

        if neuron_version == None:
            neuron_version = self.get_latest_neuron_version_per_instance(instance)

        df_instance = self.df_release_packages[
            self.df_release_packages['supported_instances'].map(lambda x: instance in x)]

        return \
            df_instance.loc[(df_instance['category'] == category) & (df_instance['neuron_version'] == neuron_version)][
                'name'].tolist()

    def get_package_version(self, category, neuron_version, name=None, framework_version=None):
        if neuron_version == None:
            neuron_version = self.get_latest_neuron_version_per_instance(args.instance)

        if name != None:
            df_package = self.df_release_packages.loc[(self.df_release_packages['neuron_version'] == neuron_version) & (
                    self.df_release_packages['name'] == name)]
        else:
            df_package = self.df_release_packages.loc[self.df_release_packages['neuron_version'] == neuron_version]

        if (category == 'pytorch') or (category == 'tensorflow') or (category == 'mxnet') or (
                category == 'model-server'):
            df_package = df_package.loc[df_package['category'] == category]
            fv = self.extract_major_minor_version(Version(framework_version))
            df_package = df_package.loc[df_package['version'].map(lambda x: x.startswith(fv))]
        return df_package['version'].values[0]

    def get_main_framework_version(self, instance, framework, neuron_version):

        if neuron_version == None:
            neuron_version = self.get_latest_neuron_version_per_instance(instance)

        df_instance = self.df_release_packages[
            self.df_release_packages['supported_instances'].map(lambda x: instance in x)]

        df_version = df_instance.loc[
            (df_instance['category'] == framework) & (df_instance['neuron_version'] == neuron_version)].copy()

        df_version['version'] = df_version['version'].map(lambda x: Version(x))

        main_version = sorted(df_version['version'], reverse=True)[0]

        return str(main_version.major) + '.' + str(main_version.minor)

    def list_packages(self, args):

        str = ''

        if args.neuron_version == None:
            neuron_version = self.get_latest_neuron_version_per_instance(args.instance)
        else:
            neuron_version = args.neuron_version

        if (args.list == 'packages'):  # list packages by neuron version

            df_instance = self.df_release_packages[
                self.df_release_packages['supported_instances'].map(lambda x: args.instance in x)]

            df_version = df_instance.loc[
                (df_instance['neuron_version'] == neuron_version) & (df_instance['category'] != 'efa')].copy()

            str += '\nList of packages in Neuron ' + neuron_version + ':\n\n'
            str += '{0:35} {1:50}\n'.format("Component", "Package")

            for index, row in df_version.iterrows():
                if row['category'] == 'libnrt':
                    str += f"{row['component']:<35} {row['name'] + ' (Version ' + row['version']})\n"
                else:
                    str += f"{row['component']:<35} {row['name'] + '-' + row['version']} \n"

            df_version['package'] = (df_version['name'] + '-' + df_version['version'])

        return str

    def list_pyversions(self, args):

        str = ''

        if args.neuron_version == None:
            neuron_version = self.get_latest_neuron_version_per_instance(args.instance)
        else:
            neuron_version = args.neuron_version

        if (args.list == 'pyversions'):  # list packages by neuron version

            df_instance = self.df_release_packages[
                self.df_release_packages['supported_instances'].map(lambda x: args.instance in x)]

            df_version = df_instance.loc[
                (df_instance['neuron_version'] == neuron_version) & (df_instance['category'] != 'efa')].copy()

            str += '\nList of packages in Neuron ' + neuron_version + ':\n\n'
            str += '{0:35} {1:50}\n'.format("Package", "           Supported Python Versions")

            for index, row in df_version.iterrows():
                python_version_str = ''
                for i, pversion in enumerate(row['supported_python_versions']):
                    if i != len(row['supported_python_versions'])-1:
                        python_version_str += pversion + ", "
                    else:
                        python_version_str += pversion
                if len(row['supported_python_versions']) != 0:
                    if row['category'] == 'libnrt':
                        str += f"{row['name'] + ' Version ' + row['version']:<50}{python_version_str} \n"
                    else:
                        str += f"{row['name'] + '-' + row['version']:<50}{python_version_str} \n"

                df_version['package'] = (df_version['name'] + '-' + df_version['version'])

        return str


################
# Sanity Checks
################
def cli_validate(args):
    # case of parallel-cluster, it should not be inf1
    if (args.venv_install_type == 'parallel-cluster') & (args.instance == 'inf1'):
        print(__name__, ": error: ", "parallel-cluster scripts is not compatible with inf1")
        exit(-1)


########################################
# parse_arguments
########################################

def cli_parse_arguments():
    __name__ = 'n2-helper.py'
    parser = argparse.ArgumentParser(prog=__name__
                                     ,
                                     usage='\npython3 %(prog)s --list={packages} [--neuron-version=X.Y.Z] [--instance=INSTANCE]\n'
                                           + 'python3 %(prog)s --list={pyversions} [--neuron-version=X.Y.Z] [--instance=INSTANCE]\n'
                                           + 'python3 %(prog)s --install-type={install,update}\n'
                                           + 'python3 %(prog)s --instance={inf1,trn1,inf2,trn2}\n'
                                           + 'python3 %(prog)s --os={ubuntu18,ubuntu20,ubuntu22,amazonlinux2,amazonlinux2023,rockylinux9,ubuntu24}\n'
                                           + 'python3 %(prog)s --ami={non-dlami,dlami-base,dlami-conda,dlami-framework,dlami-neuron}\n'
                                           + 'python3 %(prog)s --framework={pytorch,tensorflow,mxnet}\n'
                                           + 'python3 %(prog)s --framework-version=[X.Y.Z] [options]\n'
                                           + 'python3 %(prog)s --mode={develop,compile,deploy} [options]\n'
                                           + 'python3 %(prog)s --category={framework,driver,runtime,compiler,tools,all,driver_runtime_tools,compiler_framework,efa, transformers-neuronx}\n'
                                           + 'options= [--file=FILE]\n'
                                     , description='Installer helper for Neuron SDK')

    group = parser.add_mutually_exclusive_group(required=True)
    parser.add_argument("--neuron-version", metavar='X.Y.Z')
    group.add_argument("--list", choices=['neuron_versions', 'pyversions','packages', 'components', 'frameworks'])
    group.add_argument("--install-type", choices=['install', 'update'])
    parser.add_argument("--instance", choices=['inf1', 'trn1', 'inf2', 'trn2'])
    parser.add_argument("--os", choices=['ubuntu18', 'ubuntu20', 'ubuntu22', 'amazonlinux2', 'amazonlinux2023', 'rockylinux9', 'ubuntu24'], )
    parser.add_argument("--ami", choices=['non-dlami', 'dlami-base', 'dlami-conda', 'dlami-framework', 'dlami-neuron'],
                        default='non-dlami', help='default=non-dlami')
    parser.add_argument("--mode", choices=['develop', 'compile', 'deploy', 'initialize'], default='develop')
    parser.add_argument("--category",
                        choices=['framework', 'driver', 'runtime', 'compiler', 'tools', 'all', 'driver_runtime_tools',
                                 'compiler_framework', 'efa', 'transformers-neuronx'])
    parser.add_argument("--framework", choices=['pytorch', 'tensorflow', 'mxnet'])
    parser.add_argument("--framework-version", metavar='X.Y.Z')
    parser.add_argument("--venv-install-type", choices=['single-node', 'parallel-cluster'], default='single-node')
    parser.add_argument("--file", default='n2-manifest.json', help='default=n2-manifest.json')

    return parser.parse_args()


if __name__ == '__main__':
    setup_cmd = ''
    args = cli_parse_arguments()

    # arguments sanity check
    cli_validate(args)

    # parse the manifest file
    n2_manifest = manifest(manifest_file=args.file)
    n2_manifest.parse_manifest()

    # framework version sanity check
    # generate install script
    if (args.list == 'packages'):
        print(n2_manifest.list_packages(args))
    elif (args.list == 'pyversions'):
        print(n2_manifest.list_pyversions(args))
    else:
        print(n2_manifest.generate_script(args))


================================================
FILE: src/helperscripts/n2-manifest.json
================================================
{
    "repos_n2": [
      {"repo_type":"whl", "repo_url":"https://pip.repos.neuron.amazonaws.com/"},
      {"repo_type":"rpm", "repo_url":"https://yum.repos.neuron.amazonaws.com/"},
      {"repo_type":"deb", "repo_url":"https://apt.repos.neuron.amazonaws.com/"}
    ],
    "manifest_date": "04/09/2026",
    "manifest_version": "2.29.0",
    "latest_release": [
      {"instance":"inf1", "version":"2.29.0"},
      {"instance":"trn1", "version":"2.29.0"},
      {"instance":"trn2", "version":"2.29.0"},
      {"instance":"inf2", "version":"2.29.0"},
      {"instance":"trn1n", "version":"2.29.0"}
    ],
    "os_properties": [
      {"os":"ubuntu18", "default_python_version":"3.7"},
      {"os":"ubuntu20", "default_python_version":"3.8"},
      {"os":"ubuntu22", "default_python_version":"3.10"},
      {"os":"ubuntu24", "default_python_version":"3.12"},
      {"os":"amazonlinux2", "default_python_version":"3.8"},
      {"os":"amazonlinux2023", "default_python_version":"3.9"},
      {"os":"rockylinux9", "default_python_version":"3.9"}
    ],
    "ami_properties": [
      {"ami":"non-dlami", "package_categories": ["driver","runtime","tools","compiler","framework"]},
      {"ami":"dlami-base", "package_categories": ["tools","compiler","framework"]},
      {"ami":"dlami-conda", "package_categories": ["driver","runtime","tools","compiler","framework"]},
      {"ami":"dlami-<framework>", "package_categories": ["driver","runtime","tools","compiler"]}
    ],
    "dlami_properties": [
      {"framework":"pytorch", "dlami": "1.13", "neuron_released_version": "2.17.0", "supported_instances":["trn1","inf2","inf1"]},
      {"framework":"tensorflow", "dlami": "2.10", "neuron_released_version": "2.17.0", "supported_instances":["trn1","inf2"]}
    ],
    "major_version_properties": [
      {"name":"neuronx-cc","inf1":"","trn1":"2","inf2":"2","trn2":"2","trn3":"2"},
      {"name":"aws-neuronx-k8-plugin","inf1":"2","trn1":"2","inf2":"2","trn2":"2","trn3":"2"},
      {"name":"aws-neuronx-k8-scheduler","inf1":"2","trn1":"2","inf2":"2","trn2":"2","trn3":"2"},
      {"name":"aws-neuronx-oci-hooks","inf1":"2","trn1":"2","inf2":"2","trn2":"2","trn3":"2"},
      {"name":"tensorflow-neuronx","inf1":"","trn1":"1","inf2":"1"},
      {"name":"torch-neuronx","inf1":"","trn1":"1","inf2":"1","trn2":"2","trn3":"2"},
      {"name":"aws-neuronx-dkms","inf1":"2.21","trn1":"2","inf2":"2","trn2":"2","trn3":"2"},
      {"name":"aws-neuronx-collectives","trn1":"2","inf2":"2","trn2":"2","trn3":"2"},
      {"name":"aws-neuronx-runtime-lib","inf1":"","trn1":"2","inf2":"2","trn2":"2","trn3":"2"},
      {"name":"aws-neuronx-tools","inf1":"2","trn1":"2","inf2":"2","trn2":"2","trn3":"2"},
      {"name":"tensorflow-model-server-neuronx","inf1":"2","trn1":"2","inf2":"2"},
      {"name":"neuronperf","inf1":"2","trn1":"2","inf2":"2"},
      {"name":"tensorboard-plugin-neuronx","inf1":"2","trn1":"2","inf2":"2","trn2":"2"},
      {"name":"nki","trn1":"2","inf2":"2","trn2":"2","trn3":"2"}
    ],
    "package_properties": [
      {"name":"aws-neuronx-runtime-discovery", "component":"General","category":"general","package_type":"pip","use_cases":["inference"],"pin_major":"false"},
      {"name":"aws_neuron_sdk_release_version", "component":"Github","category":"github","package_type":"pip","use_cases":["inference"],"pin_major":"false"},
      {"name":"libneuronxla","component":"Framework","category":"general","package_type":"pip","use_cases":["inference"],"pin_major":"false"},
      {"name":"neuron-cc","component":"Compiler","category":"compiler","package_type":"pip","use_cases":["inference"],"pin_major":"false"},
      {"name":"neuronx-cc","component":"Compiler","category":"compiler","package_type":"pip","use_cases":["inference","training"],"pin_major":"true"},
      {"name":"neuronx-cc-stubs","component":"Compiler","category":"compiler","package_type":"pip","use_cases":["inference","training"],"pin_major":"true"},
      {"name":"aws-neuronx-k8-plugin","component":"Kubernetes Plugin","category":"container","package_type":"os","use_cases":["inference","training"],"pin_major":"true"},
      {"name":"aws-neuronx-k8-scheduler","component":"Kubernetes Scheduler","category":"container","package_type":"os","use_cases":["inference","training"],"pin_major":"true"},
      {"name":"aws-neuronx-oci-hooks","component":"OCI Hooks","category":"container","package_type":"os","use_cases":["inference","training"],"pin_major":"true"},
      {"name":"mxnet-neuron","component":"MXNet","category":"mxnet","package_type":"pip","use_cases":["inference"],"pin_major":"false"},
      {"name":"tensorflow-neuron","component":"TensorFlow","category":"tensorflow","package_type":"pip","use_cases":["inference"],"pin_major":"false"},
      {"name":"tensorflow","component":"TensorFlow","category":"tensorflow","package_type":"pip","use_cases":["inference"],"pin_major":"false"},
      {"name":"tensorflow-neuronx","component":"TensorFlow","category":"tensorflow","package_type":"pip","use_cases":["inference","training"],"pin_major":"true"},
      {"name":"torch-neuron","component":"PyTorch","category":"pytorch","package_type":"pip","use_cases":["inference"],"pin_major":"false"},
      {"name":"torch-neuronx","component":"PyTorch","category":"pytorch","package_type":"pip","use_cases":["inference","training"],"pin_major":"true"},
      {"name":"transformers-neuronx","component":"Transformers Neuron","category":"transformers-neuronx","package_type":"pip","use_cases":["inference","training"],"pin_major":"true"},
      {"name":"mxnet_neuron","component":"MXNet","category":"mxnet","package_type":"pip","use_cases":["inference"],"pin_major":"false"},
      {"name":"mx_neuron","component":"MXNet","category":"mxnet","package_type":"pip","use_cases":["inference"],"pin_major":"false"},
      {"name":"aws-neuronx-dkms","component":"Driver","category":"driver","package_type":"os","use_cases":["inference","training"],"pin_major":"true"},
      {"name":"aws-neuronx-collectives","component":"Collective Communication Library","category":"runtime","package_type":"os","use_cases":["training"],"pin_major":"true"},
      {"name":"efa-installer","component":"EFA","category":"efa","package_type":"na","use_cases":["training"],"pin_major":"false"},
      {"name":"aws-neuronx-runtime-lib","component":"Runtime Library","category":"runtime","package_type":"os","use_cases":["inference","training"],"pin_major":"true"},
      {"name":"aws-neuron-tools","component":"System Tools","category":"system-tools","package_type":"os","use_cases":["inference"],"pin_major":"true"},
      {"name":"aws-neuronx-tools","component":"System Tools","category":"system-tools","package_type":"os","use_cases":["inference","training"],"pin_major":"true"},
      {"name":"tensorflow-model-server-neuron","component":"TensorFlow Model Server","category":"model-server","package_type":"os","use_cases":["inference","training"],"pin_major":"true"},
      {"name":"tensorflow-model-server-neuronx","component":"TensorFlow Model Server","category":"model-server","package_type":"os","use_cases":["inference","training"],"pin_major":"true"},
      {"name":"neuronperf","component":"Perf Tools","category":"helper-tools","package_type":"os","use_cases":["inference","training"],"pin_major":"true"},
      {"name":"tensorboard-plugin-neuron","component":"TensorBoard","category":"profiling-tools","package_type":"os","use_cases":["inference"],"pin_major":"true"},
      {"name":"tensorboard-plugin-neuronx","component":"TensorBoard","category":"profiling-tools","package_type":"os","use_cases":["inference","training"],"pin_major":"true"},
      {"name":"libnrt.so","component":"Runtime Library","category":"libnrt","package_type":"os","use_cases":["inference"],"pin_major":"false"},
      {"name":"torch_xla","component":"PyTorch","category":"helper-lib","package_type":"pip","use_cases":["inference","training"],"pin_major":"false"},
      {"name":"aws-neuronx-gpsimd-tools","component":"CustomOps Tools","category":"na","package_type":"os","use_cases":["inference","training"],"pin_major":"false"},
      {"name":"aws-neuronx-gpsimd-customop-lib","component":"CustomOps","category":"na","package_type":"os","use_cases":["inference","training"],"pin_major":"false"},
      {"name":"aws-neuronx-oci-hook","component":"OCI","category":"na","package_type":"os","use_cases":["inference","training"],"pin_major":"false"},
      {"name":"dmlc_nnvm","component":"Compiler","category":"na","package_type":"os","use_cases":["inference"],"pin_major":"false"},
      {"name":"neuronx_hwm","component":"Compiler","category":"na","package_type":"os","use_cases":["inference"],"pin_major":"false"},
      {"name":"dmlc_topi","component":"Compiler","category":"na","package_type":"os","use_cases":["inference"],"pin_major":"false"},
      {"name":"dmlc_tvm","component":"Compiler","category":"na","package_type":"os","use_cases":["inference"],"pin_major":"false"},
      {"name":"inferentia_hwm","component":"Compiler","category":"na","package_type":"os","use_cases":["inference","training"],"pin_major":"false"},
      {"name":"neuronx_distributed","component":"Neuron Distributed","category":"na","package_type":"os","use_cases":["inference","training"],"pin_major":"false"},
      {"name":"neuronx_distributed_training","component":"Neuron Distributed Training","category":"na","package_type":"os","use_cases":["inference","training"],"pin_major":"false"},
      {"name":"neuronx_distributed_inference","component":"Neuron Distributed Inference","category":"na","package_type":"os","use_cases":["inference"],"pin_major":"false"},
      {"name":"jax_neuronx","component":"Jax","category":"jax","package_type":"pip","use_cases":["inference"],"pin_major":"true"},
      {"name":"nki","component":"NKI","category":"nki","package_type":"pip","use_cases":["inference","training"],"pin_major":"true"}
    ],
    "neuron_releases": [
      {"neuron_version":"2.29.0", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.31.24.0","supported_instances":["trn1","inf2","trn2","trn3"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.27.4.0","supported_instances":["trn1","inf2","trn2","trn3"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.21.2.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"14.09.x","supported_instances":["trn1","inf2","trn2","trn3"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.29.147.0","supported_instances":["inf1","trn1","inf2","trn2","trn3"] ,"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.29.147.0","supported_instances":["inf1","trn1","inf2","trn2","trn3"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.15.13.0","supported_instances":["inf1","trn1","inf2","trn2","trn3"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.31.24.0","supported_instances":["trn1","inf2","trn2","trn3"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.29.18.0","supported_instances":["inf1","trn1","inf2","trn2","trn3"],"supported_python_versions":[]},
        {"name":"jax_neuronx","version":"0.7.0.1.0.8181","supported_instances":["trn1","inf2","trn2","trn3"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"libneuronxla","version":"2.2.16408.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.24.5133.0","supported_instances":["trn1","inf2","trn2","trn3"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx-cc-stubs","version":"2.24.5133.0","supported_instances":["trn1","inf2","trn2","trn3"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx_distributed","version":"0.18.27753","supported_instances":["trn1","inf2","trn2","trn3"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx_distributed_inference","version":"0.9.17334","supported_instances":["inf2","trn2","trn1","trn3"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"nki","version":"0.3.0","supported_instances":["trn1","inf2","trn2","trn3"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.0.918.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf1"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf1"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf1"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.9.0.2.13.24727","supported_instances":["trn1","inf2","trn2","trn3"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"efa-installer","version":"1.47","supported_instances":["trn1","trn2","trn3"],"supported_python_versions":[]}
        ]}, 
      {"neuron_version":"2.28.1", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.30.59.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.26.10.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.20.7.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.20.1.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.29.71.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.29.71.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.14.102.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.30.51.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.28.23.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"jax_neuronx","version":"0.7.0.1.0.7584","supported_instances":["trn1","inf2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"libneuronxla","version":"2.2.15515.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.23.6484.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx-cc-stubs","version":"2.23.6484.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx_distributed","version":"0.17.26814","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx_distributed_training","version":"1.7.0","supported_instances":["trn1","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx_distributed_inference","version":"0.8.16251","supported_instances":["inf2","trn2","trn1"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"nki","version":"0.2.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.0.918.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.9.0.2.12.22436","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"torch-neuronx","version":"2.7.0.2.12.22436","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.8.0.2.12.22436","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"efa-installer","version":"1.47","supported_instances":["trn1","trn2"],"supported_python_versions":[]}
        ]}, 
      {"neuron_version":"2.28.0", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.30.59.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.26.5.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.20.4.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.20.1.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.29.71.0","supported_instances":["inf1","trn1","inf2","trn2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.29.71.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.14.102.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.30.51.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.28.23.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"jax_neuronx","version":"0.7.0.1.0.7584","supported_instances":["trn1","inf2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"libneuronxla","version":"2.2.15515.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.23.6484.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx-cc-stubs","version":"2.23.6484.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx_distributed","version":"0.17.26814","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx_distributed_training","version":"1.7.0","supported_instances":["trn1","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx_distributed_inference","version":"0.8.16251","supported_instances":["inf2","trn2","trn1"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"nki","version":"0.2.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.0.918.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.9.0.2.12.22436","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"torch-neuronx","version":"2.7.0.2.12.22436","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.8.0.2.12.22436","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"efa-installer","version":"1.47","supported_instances":["trn1","trn2"],"supported_python_versions":[]}
        ]}, 
      {"neuron_version":"2.27.1", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.29.41.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.25.4.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.19.2.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.19.1.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.29.16.0","supported_instances":["inf1","trn1","inf2","trn2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.29.16.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.13.52.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.29.40.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.27.33.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"jax_neuronx","version":"0.7.0.1.0.7377","supported_instances":["trn1","inf2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"libneuronxla","version":"2.2.14584.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.22.12471.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx-cc-stubs","version":"2.22.12471.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx_distributed","version":"0.16.25997","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx_distributed_training","version":"1.7.0","supported_instances":["trn1","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx_distributed_inference","version":"0.7.15063","supported_instances":["inf2","trn2","trn1"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"nki","version":"0.1.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.0.918.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.9.0.2.11.19912","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"torch-neuronx","version":"2.7.0.2.11.19912","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.8.0.2.11.19912","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
        ]}, 
      {"neuron_version":"2.27.0", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.29.41.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.25.4.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.19.2.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.19.1.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.29.16.0","supported_instances":["inf1","trn1","inf2","trn2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.29.16.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.13.52.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.29.40.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.27.33.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"jax_neuronx","version":"0.7.0.1.0.7377","supported_instances":["trn1","inf2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"libneuronxla","version":"2.2.14584.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.22.12471.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx-cc-stubs","version":"2.22.12471.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx_distributed","version":"0.16.25997","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx_distributed_training","version":"1.7.0","supported_instances":["trn1","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"neuronx_distributed_inference","version":"0.7.14366","supported_instances":["inf2","trn2","trn1"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"nki","version":"0.1.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.0.918.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.9.0.2.11.19912","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"torch-neuronx","version":"2.7.0.2.11.19912","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.8.0.2.11.19912","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11","3.12"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
        ]}, 
      {"neuron_version":"2.26.1", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.28.27.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.24.7.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.18.0.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.18.0.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.28.4.0","supported_instances":["inf1","trn1","inf2","trn2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.28.4.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.12.36.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.28.23.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.26.14.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"jax_neuronx","version":"0.6.2.1.0.6446","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9"]},
        {"name":"libneuronxla","version":"2.2.12677.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.21.33363.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx-cc-stubs","version":"2.21.33363.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.15.22404","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_training","version":"1.6.0","supported_instances":["trn1","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_inference","version":"0.6.10598","supported_instances":["inf2","trn2","trn1"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.0.837.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.6.0.2.10.16998","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.7.0.2.10.16998","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.8.0.2.10.16998","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11"]},
        {"name":"transformers-neuronx","version":"0.13.1315","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]}, 
      {"neuron_version":"2.26.0", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.28.27.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.24.7.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.18.0.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.18.0.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.28.4.0","supported_instances":["inf1","trn1","inf2","trn2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.28.4.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.12.36.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.28.23.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.26.14.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"jax_neuronx","version":"0.6.2.1.0.6446","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9"]},
        {"name":"libneuronxla","version":"2.2.12677.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.21.18209.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx-cc-stubs","version":"2.21.18209.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.15.22404","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_training","version":"1.6.0","supported_instances":["trn1","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_inference","version":"0.6.10598","supported_instances":["inf2","trn2","trn1"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.0.837.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.6.0.2.10.13553","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.7.0.2.10.13553","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.8.0.2.10.13553","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.10","3.11"]},
        {"name":"transformers-neuronx","version":"0.13.1315","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]}, 
      {"neuron_version":"2.25.0", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.27.34.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-collectives","version":"2.27.34.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.23.9.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.17.1.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.17.0.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.27.7.0","supported_instances":["inf1","trn1","inf2","trn2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.27.7.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.11.42.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.27.23.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.25.145.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"jax_neuronx","version":"0.6.1.1.0.3499","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9"]},
        {"name":"libneuronxla","version":"2.2.8201.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.20.9961.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx-cc-stubs","version":"2.20.9961.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.14.18461","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_training","version":"1.5.0","supported_instances":["trn1","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_inference","version":"0.5.9230","supported_instances":["inf2","trn2","trn1"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.0.813.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.6.0.2.9.9357","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.7.0.2.9.9357","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"transformers-neuronx","version":"0.13.1216","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.24.1", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.26.43.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-collectives","version":"2.26.43.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.22.2.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.16.2.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.16.1.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.26.7.0","supported_instances":["inf1","trn1","inf2","trn2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.26.7.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.10.56.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.26.42.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.24.54.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"jax_neuronx","version":"0.6.0.1.0.1296","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9"]},
        {"name":"libneuronxla","version":"2.2.4410.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.19.8089.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx-cc-stubs","version":"2.19.8089.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.13.14393","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_training","version":"1.4.1","supported_instances":["trn1","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_inference","version":"0.4.7422","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.0.760.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.5.1.2.8.6734","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.6.0.2.8.6734","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.7.0.2.8.6734","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"torch_xla","version":"2.1.6","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"transformers-neuronx","version":"0.13.985","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.24.0", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.26.43.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-collectives","version":"2.26.43.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.22.2.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.16.2.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.16.1.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.26.7.0","supported_instances":["inf1","trn1","inf2","trn2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.26.7.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.10.56.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.26.42.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.24.54.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"jax_neuronx","version":"0.6.0.1.0.1296","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9"]},
        {"name":"libneuronxla","version":"2.2.4410.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.19.8089.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx-cc-stubs","version":"2.19.8089.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.13.14393","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_training","version":"1.4.0","supported_instances":["trn1","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_inference","version":"0.4.7422","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.0.760.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.5.1.2.8.6734","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.6.0.2.8.6734","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.7.0.2.8.6734","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"torch_xla","version":"2.1.6","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"transformers-neuronx","version":"0.13.985","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.23.0", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.25.65.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.21.37.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.15.12.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.15.1.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.25.24.0","supported_instances":["inf1","trn1","inf2","trn2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.25.24.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.9.88.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.25.57.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.23.9.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"dmlc_nnvm","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"dmlc_topi","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"dmlc_tvm","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"inferentia_hwm","version":"1.17.6.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"jax_neuronx","version":"0.5.3.1.0.719","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9"]},
        {"name":"libneuronxla","version":"2.2.3493.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.18.121.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx-cc-stubs","version":"2.18.121.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.12.12111","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_training","version":"1.3.0","supported_instances":["trn1"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_inference","version":"0.3.5591","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.0.670.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.5.1.2.7.5413","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.6.0.2.7.5413","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"torch_xla","version":"2.1.6","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"transformers-neuronx","version":"0.13.798","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.22.0", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.24.59.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-collectives","version":"2.24.59.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.20.28.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.14.12.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.14.6.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.24.23.0","supported_instances":["inf1","trn1","inf2","trn2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.24.23.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.7.5.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.24.53.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.24.53.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.22.61.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"dmlc_nnvm","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"dmlc_topi","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"dmlc_tvm","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"inferentia_hwm","version":"1.17.6.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"jax_neuronx","version":"0.1.3","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9"]},
        {"name":"libneuronxla","version":"2.2.1630.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"libneuronxla","version":"0.5.3396","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.17.194.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx-cc-stubs","version":"2.17.194.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.11.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_training","version":"1.2.0","supported_instances":["trn1"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_inference","version":"0.2.0","supported_instances":["inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.6.117.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.8.4.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.9.3.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.5.1.2.6.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"torch_xla","version":"2.1.6","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"transformers-neuronx","version":"0.13.470","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9","3.10","3.11"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.21.1", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.23.135.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-collectives","version":"2.12.35.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.19.64.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.13.16.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.13.2.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.23.45.0","supported_instances":["inf1","trn1","inf2","trn2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.23.45.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.6.36.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.12.23.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.23.112.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.20.204.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"dmlc_nnvm","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_topi","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_tvm","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"inferentia_hwm","version":"1.17.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"jax_neuronx","version":"0.1.2","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9"]},
        {"name":"libneuronxla","version":"2.1.714.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"libneuronxla","version":"0.5.3396","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.16.372.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"neuronx-cc-stubs","version":"2.16.372.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.10.1","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_training","version":"1.1.1","supported_instances":["trn1"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_inference","version":"0.1.1","supported_instances":["inf2","trn2","trn1"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.6.52.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.8.4.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.9.3.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},      
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.10.2.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.11.0.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.12.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"1.13.1.1.17.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.1.2.2.4.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.5.1.2.4.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch_xla","version":"1.13.1+torchneurong","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch_xla","version":"2.1.6","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"transformers-neuronx","version":"0.13.380","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.21.0", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.23.133.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-collectives","version":"2.12.35.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.19.64.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.13.16.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.13.2.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.23.30.0","supported_instances":["inf1","trn1","inf2","trn2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.23.30.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.6.36.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.12.23.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.23.110.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.20.204.0","supported_instances":["inf1","trn1","inf2","trn2"],"supported_python_versions":[]},
        {"name":"dmlc_nnvm","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_topi","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_tvm","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"inferentia_hwm","version":"1.17.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"jax_neuronx","version":"0.1.2","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9"]},
        {"name":"libneuronxla","version":"2.1.681.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"libneuronxla","version":"0.5.3388","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.16.345.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"neuronx-cc-stubs","version":"2.16.345.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.10.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_training","version":"1.1.0","supported_instances":["trn1"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_inference","version":"0.1.0","supported_instances":["inf2","trn2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.6.52.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.8.4.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.9.3.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},      
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.10.2.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.11.0.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.12.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"1.13.1.1.17.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.1.2.2.4.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.5.1.2.4.0","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch_xla","version":"1.13.1+torchneurong","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch_xla","version":"2.1.6","supported_instances":["trn1","inf2","trn2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"transformers-neuronx","version":"0.13.322","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]},
      
      {"neuron_version":"2.20.2", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.22.33.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-collectives","version":"2.12.35.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.18.20.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.12.2.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.12.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.22.20.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.22.20.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.5.8.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.12.23.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.22.19.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.19.0.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"dmlc_nnvm","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_topi","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_tvm","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"inferentia_hwm","version":"1.17.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"jax_neuronx","version":"0.1.1","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9"]},
        {"name":"libneuronxla","version":"2.0.5347.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"libneuronxla","version":"0.5.3278","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.15.143.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"neuronx-cc-stubs","version":"2.15.143.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.9.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_training","version":"1.0.1","supported_instances":["trn1"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.6.63.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.8.4.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.9.3.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},      
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.10.2.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.11.0.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.12.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"1.13.1.1.16.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch-neuron","version":"1.9.1.2.11.13.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.1.2.2.3.2","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch_xla","version":"1.13.1+torchneurong","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch_xla","version":"2.1.5","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"transformers-neuronx","version":"0.12.313","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.20.1", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.22.26.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-collectives","version":"2.12.35.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.18.12.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.12.2.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.12.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.22.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.22.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.5.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.12.23.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.22.14.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.19.0.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"dmlc_nnvm","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_topi","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_tvm","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"inferentia_hwm","version":"1.17.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"jax_neuronx","version":"0.1.1","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9"]},
        {"name":"libneuronxla","version":"2.0.4986.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"libneuronxla","version":"0.5.2978","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.15.141.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"neuronx-cc-stubs","version":"2.15.141.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.9.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_training","version":"1.0.0","supported_instances":["trn1"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.6.63.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.8.4.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.9.3.2.12.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},      
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.10.2.2.11.7.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.11.0.2.11.7.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.12.1.2.11.7.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.7.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.9.1.2.11.7.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"1.13.1.1.16.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.1.2.2.3.1","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch_xla","version":"1.13.1+torchneurong","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch_xla","version":"2.1.4","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"transformers-neuronx","version":"0.12.313","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.20.0", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.22.26.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-collectives","version":"2.12.35.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.18.12.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.12.2.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.12.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.22.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.22.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.5.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.12.23.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.22.14.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.19.0.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"dmlc_nnvm","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_topi","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_tvm","version":"1.19.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"inferentia_hwm","version":"1.17.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"jax_neuronx","version":"0.1.1","supported_instances":["trn1","inf2"],"supported_python_versions":["3.9"]},
        {"name":"libneuronxla","version":"2.0.4115.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"libneuronxla","version":"0.5.2978","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuron-cc","version":"1.24.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.15.128.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"neuronx-cc-stubs","version":"2.15.128.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.9.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"neuronx_distributed_training","version":"1.0.0","supported_instances":["trn1"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.6.63.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.12.0.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.12.0.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.12.0.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.12.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.8.4.2.12.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.9.3.2.12.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},      
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.10.2.2.11.7.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.11.0.2.11.7.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.12.1.2.11.7.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.11.7.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.9.1.2.11.7.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"1.13.1.1.16.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch-neuronx","version":"2.1.2.2.3.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch_xla","version":"1.13.1+torchneurong","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"torch_xla","version":"2.1.4","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"transformers-neuronx","version":"0.12.313","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10","3.11"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.19.1", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.21.46.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.17.17.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.11.4.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.11.3.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.21.14.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.21.14.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.4.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.21.41.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.18.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"dmlc_nnvm","version":"1.19.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_topi","version":"1.19.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_tvm","version":"1.19.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"inferentia_hwm","version":"1.17.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"libneuronxla","version":"2.0.2335","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"libneuronxla","version":"0.5.1795","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuron-cc","version":"1.23.5.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.14.227.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.8.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.6.63.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.11.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.11.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.11.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.11.4.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.7.4.2.11.4.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.8.4.2.11.4.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.9.3.2.11.4.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.10.2.2.10.12.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.11.0.2.10.12.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.12.1.2.10.12.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.10.12.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.9.1.2.10.12.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"1.13.1.1.15.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.1.2.2.2.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch_xla","version":"1.13.1+torchneuronf","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch_xla","version":"2.1.3","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"transformers-neuronx","version":"0.11.351","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.19.0", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.21.46.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.17.17.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.11.4.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.11.3.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.21.14.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.21.14.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.4.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.21.41.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.18.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"dmlc_nnvm","version":"1.19.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_topi","version":"1.19.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_tvm","version":"1.19.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"inferentia_hwm","version":"1.17.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"libneuronxla","version":"2.0.2335","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"libneuronxla","version":"0.5.1795","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.147.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuron-cc","version":"1.23.5.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.93.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.14.213.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.8.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.6.63.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.11.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.11.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.11.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.11.4.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.7.4.2.11.4.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.8.4.2.11.4.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.9.3.2.11.4.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.10.2.2.10.12.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.11.0.2.10.12.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.12.1.2.10.12.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.10.12.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.9.1.2.10.12.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"1.13.1.1.15.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.1.2.2.2.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch_xla","version":"1.13.1+torchneuronf","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch_xla","version":"2.1.3","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"transformers-neuronx","version":"0.11.351","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.18.2", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.20.22.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.16.7.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.9.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.9.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.20.13.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.20.13.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.3.0.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.20.22.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.17.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"dmlc_nnvm","version":"1.19.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_topi","version":"1.19.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_tvm","version":"1.19.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"inferentia_hwm","version":"1.17.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"libneuronxla","version":"2.0.965","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"libneuronxla","version":"0.5.971","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.50.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuron-cc","version":"1.22.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.55.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.13.72.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.7.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.6.7.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.10.19.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.10.19.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.10.19.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.10.19.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.10.19.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.7.4.2.10.19.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.8.4.2.10.19.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.9.3.2.10.19.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf."],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.10.2.2.9.74.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.11.0.2.9.74.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.12.1.2.9.74.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.9.74.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.9.1.2.9.74.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"1.13.1.1.14.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.1.2.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch_xla","version":"1.13.1+torchneurone","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch_xla","version":"2.1.2","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"transformers-neuronx","version":"0.10.0.360","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.18.1", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.20.22.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.16.7.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.9.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.9.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.20.13.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.20.13.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.3.0.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.20.22.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.17.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"dmlc_nnvm","version":"1.19.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_topi","version":"1.19.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_tvm","version":"1.19.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"inferentia_hwm","version":"1.17.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"libneuronxla","version":"2.0.965","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"libneuronxla","version":"0.5.971","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.50.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuron-cc","version":"1.22.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.55.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.13.68.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.7.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.6.7.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.10.19.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.10.19.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.10.19.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.10.19.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.10.19.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.7.4.2.10.19.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.8.4.2.10.19.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.9.3.2.10.19.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf."],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.10.2.2.9.74.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.11.0.2.9.74.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.12.1.2.9.74.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.9.74.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.9.1.2.9.74.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"1.13.1.1.14.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.1.2.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch_xla","version":"1.13.1+torchneurone","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch_xla","version":"2.1.2","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"transformers-neuronx","version":"0.10.0.360","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.18.0", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.20.22.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.16.7.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.9.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.9.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-plugin","version":"2.20.13.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.20.13.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hook","version":"2.3.0.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"aws-neuronx-runtime-lib","version":"2.20.22.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.17.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"dmlc_nnvm","version":"1.19.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_topi","version":"1.19.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"dmlc_tvm","version":"1.19.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"inferentia_hwm","version":"1.17.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"libneuronxla","version":"2.0.965","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"libneuronxla","version":"0.5.971","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mx_neuron","version":"1.8.0.2.4.50.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuron-cc","version":"1.22.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronperf","version":"1.8.55.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx-cc","version":"2.13.66.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"neuronx_distributed","version":"0.7.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorboard-plugin-neuronx","version":"2.6.7.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.10.19.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.10.19.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.10.19.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.10.19.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.10.19.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.7.4.2.10.19.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.8.4.2.10.19.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuron","version":"2.9.3.2.10.19.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf."],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.10.2.2.9.74.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.11.0.2.9.74.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.12.1.2.9.74.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.13.1.2.9.74.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuron","version":"1.9.1.2.9.74.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"1.13.1.1.14.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch-neuronx","version":"2.1.2.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch_xla","version":"1.13.1+torchneurone","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"torch_xla","version":"2.1.2","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"transformers-neuronx","version":"0.10.0.21","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.17.0", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.20.11.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.15.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.9.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.9.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-plugin","version":"2.19.16.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.19.16.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.45.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"aws-neuronx-runtime-lib","version":"2.20.11.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.17.0.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.18.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"dmlc_topi","version":"1.18.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"dmlc_tvm","version":"1.18.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"inferentia_hwm","version":"1.16.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"libneuronxla","version":"2.0.755","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"libneuronxla","version":"0.5.809","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.40.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.21.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronperf","version":"1.8.15.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx-cc","version":"2.12.68.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx_distributed","version":"0.6.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx_hwm","version":"2.12.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorboard-plugin-neuronx","version":"2.6.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.10.8.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.10.8.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.10.8.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.10.8.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.10.8.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.10.8.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.10.8.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.10.8.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf."],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.10.2.2.9.17.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.9.17.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.9.17.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.9.17.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.9.17.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuronx","version":"1.13.1.1.13.1","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuronx","version":"2.0.0.2.0.1b0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuronx","version":"2.1.1.2.0.1b0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1+torchneurond","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"2.1.1","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.9.474","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
    ]},
      {"neuron_version":"2.16.1", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.19.7.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.15.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.9.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.9.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-plugin","version":"2.19.16.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.19.16.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.45.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"aws-neuronx-runtime-lib","version":"2.19.5.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.16.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.18.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"dmlc_topi","version":"1.18.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"dmlc_tvm","version":"1.18.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"inferentia_hwm","version":"1.16.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"libneuronxla","version":"2.0.498","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"libneuronxla","version":"0.5.669","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.40.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.21.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronperf","version":"1.8.15.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx-cc","version":"2.12.68.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx_distributed","version":"0.6.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx_hwm","version":"2.12.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorboard-plugin-neuronx","version":"2.6.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.10.8.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.10.8.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.10.8.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.10.8.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.10.8.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.10.8.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.10.8.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.10.8.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf."],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.10.2.2.9.17.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.9.17.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.9.17.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.9.17.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.9.17.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuronx","version":"1.13.1.1.13.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuronx","version":"2.0.0.2.0.1b0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuronx","version":"2.1.1.2.0.0b0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1+torchneurond","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"2.1.1","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.9.474","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
    ]},
      {"neuron_version":"2.16.0", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.19.7.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.15.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.9.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.9.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-plugin","version":"2.19.16.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.19.16.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.45.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"aws-neuronx-runtime-lib","version":"2.19.5.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.16.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.18.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"dmlc_topi","version":"1.18.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"dmlc_tvm","version":"1.18.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"inferentia_hwm","version":"1.16.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"libneuronxla","version":"2.0.498","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"libneuronxla","version":"0.5.669","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.40.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.21.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronperf","version":"1.8.15.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx-cc","version":"2.12.54.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx_distributed","version":"0.6.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx_hwm","version":"2.12.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorboard-plugin-neuronx","version":"2.6.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.10.8.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.10.8.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.10.8.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.10.8.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.10.8.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.10.8.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.10.8.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.10.8.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf."],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.10.2.2.9.17.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.9.17.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.9.17.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.9.17.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.9.17.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuronx","version":"1.13.1.1.13.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuronx","version":"2.0.0.2.0.1b0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuronx","version":"2.1.1.2.0.0b0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1+torchneurond","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"2.1.1","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.9.474","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]}
    ]},
      {"neuron_version":"2.15.2", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.18.19.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.14.5.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.8.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.8.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-plugin","version":"2.18.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.18.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.27.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"aws-neuronx-runtime-lib","version":"2.18.15.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.15.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.18.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"dmlc_topi","version":"1.18.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"dmlc_tvm","version":"1.18.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"inferentia_hwm","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"libneuronxla","version":"1.0.680","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"libneuronxla","version":"0.5.570","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.25.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.20.3.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronperf","version":"1.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx-cc","version":"2.11.0.35","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx_distributed","version":"0.5.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx_hwm","version":"2.11.0.2","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorboard-plugin-neuronx","version":"2.5.43.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.10.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.10.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.10.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.10.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.10.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.10.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.10.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.10.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf."],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.10.2.2.9.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.9.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.9.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.9.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.9.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuronx","version":"1.13.1.1.12.1","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuronx","version":"2.0.0.2.0.1b0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1+torchneuronc","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"2.0.0+torchneuron0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.8.268","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
      {"name":"libnrt.so","version":"2.18.15","supported_instances":["inf1"],"supported_python_versions":[]}
    ]},
      {"neuron_version":"2.15.1", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.18.19.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.14.5.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.8.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.8.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-plugin","version":"2.18.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.18.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.27.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"aws-neuronx-runtime-lib","version":"2.18.15.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.15.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.18.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"dmlc_topi","version":"1.18.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"dmlc_tvm","version":"1.18.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"inferentia_hwm","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"libneuronxla","version":"1.0.680","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"libneuronxla","version":"0.5.570","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.25.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.20.3.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronperf","version":"1.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx-cc","version":"2.11.0.34","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx_distributed","version":"0.5.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx_hwm","version":"2.11.0.2","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorboard-plugin-neuronx","version":"2.5.43.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.10.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.10.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.10.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.10.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.10.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.10.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.10.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.10.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf."],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.10.2.2.9.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.9.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.9.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.9.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.9.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuronx","version":"1.13.1.1.12.1","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuronx","version":"2.0.0.2.0.1b0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1+torchneuronc","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"2.0.0+torchneuron0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.8.268","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
      {"name":"libnrt.so","version":"2.18.15","supported_instances":["inf1"],"supported_python_versions":[]}
    ]},
      {"neuron_version":"2.15.0", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.18.18.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.14.5.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.8.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.8.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-plugin","version":"2.18.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.18.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.27.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"aws-neuronx-runtime-lib","version":"2.18.14.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.15.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.18.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"dmlc_topi","version":"1.18.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"dmlc_tvm","version":"1.18.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"inferentia_hwm","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"libneuronxla","version":"1.0.663","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"libneuronxla","version":"0.5.538","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.25.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.20.3.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronperf","version":"1.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx-cc","version":"2.11.0.34","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx_distributed","version":"0.5.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"neuronx_hwm","version":"2.11.0.2","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorboard-plugin-neuronx","version":"2.5.43.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.10.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.10.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.10.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.10.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.10.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.10.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.10.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.10.2.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf."],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.10.2.2.9.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.9.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.9.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.9.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.9.6.0","supported_instances":["inf1"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuronx","version":"1.13.1.1.12.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch-neuronx","version":"2.0.0.2.0.0b0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1+torchneuronc","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"2.0.0+torchneuron0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.8.268","supported_instances":["trn1","inf2"],"supported_python_versions":["3.8","3.9","3.10"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
      {"name":"libnrt.so","version":"2.18.14","supported_instances":["inf1"],"supported_python_versions":[]}
    ]},
      {"neuron_version":"2.14.1", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.17.9.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.13.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.7.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.7.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-lib","version":"2.17.7.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.14.6.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"libneuronxla","version":"0.5.476","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx_hwm","version":"2.10.0.5","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1+torchneuronb","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx-cc","version":"2.10.0.35","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.19.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"aws-neuronx-k8-plugin","version":"2.17.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.17.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.22.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf."],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorboard-plugin-neuronx","version":"2.5.39.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"torch-neuron","version":"1.10.2.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.39.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.10.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"torch-neuronx","version":"1.13.1.1.11.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.7.84","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
      {"name":"neuronperf","version":"1.8.7.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"libnrt.so","version":"2.12.23.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.17.2.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_topi","version":"1.17.2.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_tvm","version":"1.17.2.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"inferentia_hwm","version":"1.15.2.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"neuronx_distributed","version":"0.4.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]}
    ]},
      {"neuron_version":"2.14.0", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.17.9.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.13.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.7.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.7.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-lib","version":"2.17.7.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.14.6.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"libneuronxla","version":"0.5.476","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx_hwm","version":"2.10.0.5","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1+torchneuronb","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx-cc","version":"2.10.0.34","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.19.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"aws-neuronx-k8-plugin","version":"2.17.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.17.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.22.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf."],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorboard-plugin-neuronx","version":"2.5.39.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"torch-neuron","version":"1.10.2.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.39.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.10.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"torch-neuronx","version":"1.13.1.1.11.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.7.84","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
      {"name":"neuronperf","version":"1.8.7.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"libnrt.so","version":"2.12.23.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.17.2.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_topi","version":"1.17.2.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_tvm","version":"1.17.2.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"inferentia_hwm","version":"1.15.2.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"neuronx_distributed","version":"0.4.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]}
    ]},        
      {"neuron_version":"2.13.2", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.16.16.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.12.18.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.6.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.6.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-lib","version":"2.16.14.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.13.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"libneuronxla","version":"0.5.440","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx_hwm","version":"2.9.0.2","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1+torchneurona","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx-cc","version":"2.9.0.40","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.18.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"aws-neuronx-k8-plugin","version":"2.16.18.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.16.18.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.25.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf."],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorboard-plugin-neuronx","version":"2.5.39.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"torch-neuron","version":"1.10.2.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.39.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.10.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"torch-neuronx","version":"1.13.1.1.10.1","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.6.106","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
      {"name":"neuronperf","version":"1.8.7.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"libnrt.so","version":"2.12.23.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_topi","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_tvm","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"inferentia_hwm","version":"1.15.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"neuronx_distributed","version":"0.3.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]}
    ]},
      {"neuron_version":"2.13.1", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.16.8.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.12.11.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.6.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.6.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-lib","version":"2.16.8.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.13.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"libneuronxla","version":"0.5.425","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx_hwm","version":"2.9.0.2","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1+torchneurona","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx-cc","version":"2.9.0.40","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.18.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"aws-neuronx-k8-plugin","version":"2.16.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.16.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.21.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf."],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorboard-plugin-neuronx","version":"2.5.39.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"torch-neuron","version":"1.10.2.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.39.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.10.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"torch-neuronx","version":"1.13.1.1.10.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.6.106","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
      {"name":"neuronperf","version":"1.8.7.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"libnrt.so","version":"2.12.23.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_topi","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_tvm","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"inferentia_hwm","version":"1.15.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"neuronx_distributed","version":"0.3.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]}
    ]},
      {"neuron_version":"2.13.0", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.16.8.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.12.11.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.6.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.6.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-lib","version":"2.16.8.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.13.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"libneuronxla","version":"0.5.425","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx_hwm","version":"2.9.0.1","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1+torchneurona","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx-cc","version":"2.9.0.16","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.18.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"aws-neuronx-k8-plugin","version":"2.16.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.16.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.21.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.10.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf."],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorboard-plugin-neuronx","version":"2.5.39.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"torch-neuron","version":"1.10.2.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.9.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.39.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.10.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"torch-neuronx","version":"1.13.1.1.10.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.6.106","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
      {"name":"neuronperf","version":"1.8.7.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"libnrt.so","version":"2.12.23.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_topi","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_tvm","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"inferentia_hwm","version":"1.15.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"neuronx_distributed","version":"0.3.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]}
    ]},
      {"neuron_version":"2.12.2", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.15.16.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.11.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.5.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.5.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-lib","version":"2.15.14.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"libneuronxla","version":"0.5.413","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx_hwm","version":"2.8.0.3","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1+torchneuron8","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx-cc","version":"2.8.0.25","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.17.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"aws-neuronx-k8-plugin","version":"2.15.6.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.15.6.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.16.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"1.15.5.2.9.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.9.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.9.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.9.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.9.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.9.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.9.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.9.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.9.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.9.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorboard-plugin-neuronx","version":"2.5.39.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"torch-neuron","version":"1.10.2.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.39.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.10.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"torch-neuronx","version":"1.13.1.1.9.1","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.5.58","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
      {"name":"neuronperf","version":"1.8.7.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"libnrt.so","version":"2.12.23.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_topi","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_tvm","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"inferentia_hwm","version":"1.14.4.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"neuronx_distributed","version":"0.2.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]}
    ]},        
      {"neuron_version":"2.12.1", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.15.16.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.11.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.5.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.5.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-lib","version":"2.15.14.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"libneuronxla","version":"0.5.413","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx_hwm","version":"2.8.0.3","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1+torchneuron8","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx-cc","version":"2.8.0.25","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.17.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"aws-neuronx-k8-plugin","version":"2.15.6.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.15.6.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.16.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"1.15.5.2.9.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.9.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.9.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.9.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.9.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.9.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.9.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.9.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.9.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.9.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorboard-plugin-neuronx","version":"2.5.39.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"torch-neuron","version":"1.10.2.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.39.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.10.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"torch-neuronx","version":"1.13.1.1.9.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.5.58","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
      {"name":"neuronperf","version":"1.8.7.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"libnrt.so","version":"2.12.23.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_topi","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_tvm","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"inferentia_hwm","version":"1.14.4.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"neuronx_distributed","version":"0.2.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]}
    ]},
      {"neuron_version":"2.12.0", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.15.13.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.11.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.5.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.5.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-lib","version":"2.15.11.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.12.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"libneuronxla","version":"0.5.391","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx_hwm","version":"2.8.0.3","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7,","3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1+torchneuron8","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx-cc","version":"2.8.0.25","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.17.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"aws-neuronx-k8-plugin","version":"2.15.6.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.15.6.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.16.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"1.15.5.2.9.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.9.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.9.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.9.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.9.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.9.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.9.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.9.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.9.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.9.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorboard-plugin-neuronx","version":"2.5.39.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"torch-neuron","version":"1.10.2.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.39.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.10.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"torch-neuronx","version":"1.13.1.1.9.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.5.58","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
      {"name":"neuronperf","version":"1.8.7.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"libnrt.so","version":"2.12.23.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_topi","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_tvm","version":"1.16.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"inferentia_hwm","version":"1.14.4.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"neuronx_distributed","version":"0.2.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]}
    ]},
      {"neuron_version":"2.11.0", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.14.9.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.10.11.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.4.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.4.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-lib","version":"2.14.8.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.11.10.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"libneuronxla","version":"0.5.326","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx_hwm","version":"2.7.0.3","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1+torchneuron7","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx-cc","version":"2.7.0.40","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.16.2.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"aws-neuronx-k8-plugin","version":"2.14.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.14.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"1.15.5.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.8.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.8.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.8.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.8.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.8.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.8.9.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorboard-plugin-neuronx","version":"2.5.37.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"torch-neuron","version":"1.10.2.2.7.10.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.7.10.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.7.10.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.7.10.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.7.10.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.39.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.9.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"torch-neuronx","version":"1.13.1.1.8.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.4.60","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
      {"name":"neuronperf","version":"1.8.6.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"libnrt.so","version":"2.12.23.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.16.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_topi","version":"1.16.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_tvm","version":"1.16.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"inferentia_hwm","version":"1.14.2.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"islpy","version":"2021.1+aws2021.x.169.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx_distributed","version":"0.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.10.0", "packages": [
      {"name":"aws-neuronx-collectives","version":"2.13.7.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-dkms","version":"2.9.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-customop-lib","version":"0.3.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-gpsimd-tools","version":"0.3.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-runtime-lib","version":"2.13.6.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-tools","version":"2.10.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"libneuronxla","version":"0.5.207","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx_hwm","version":"2.6.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch_xla","version":"1.13.1","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"neuronx-cc","version":"2.6.0.19","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"neuron-cc","version":"1.15.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"aws-neuronx-k8-plugin","version":"2.13.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-k8-scheduler","version":"2.13.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"aws-neuronx-oci-hook","version":"2.2.0.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-neuron","version":"1.15.5.2.8.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
      {"name":"tensorflow-neuron","version":"2.10.1.2.8.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuron","version":"2.7.4.2.8.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.8.4.2.8.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuron","version":"2.9.3.2.8.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"tensorflow-neuronx","version":"2.10.1.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.7.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.8.4.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-neuronx","version":"2.9.3.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.8.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.8.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.8.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.8.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.8.1.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
      {"name":"tensorboard-plugin-neuronx","version":"2.5.26.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
      {"name":"torch-neuron","version":"1.10.2.2.7.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.11.0.2.7.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.12.1.2.7.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.13.1.2.7.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"torch-neuron","version":"1.9.1.2.7.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"mxnet_neuron","version":"1.5.1.1.10.39.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"mx_neuron","version":"1.8.0.2.4.1.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
      {"name":"torch-neuronx","version":"1.13.1.1.7.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9","3.10"]},
      {"name":"transformers-neuronx","version":"0.3.32","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
      {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
      {"name":"neuronperf","version":"1.8.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"libnrt.so","version":"2.12.23.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_nnvm","version":"1.15.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_topi","version":"1.15.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"dmlc_tvm","version":"1.15.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"inferentia_hwm","version":"1.14.1","supported_instances":["inf1"],"supported_python_versions":[]},
      {"name":"islpy","version":"2021.1","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.9.1", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.12.35.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.8.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop","version":"0.2.3.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.12.23.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.9.5.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"libneuronxla","version":"0.5.205","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"neuronx_hwm","version":"2.5.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch_xla","version":"1.13.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"neuronx-cc","version":"2.5.0.28","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8"]},
        {"name":"neuron-cc","version":"1.14.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"aws-neuronx-k8-plugin","version":"2.12.5.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.12.5.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hooks","version":"2.1.97.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"1.15.5.2.7.4.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.7.4.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.7.4.2.7.4.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.8.4.2.7.4.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.9.3.2.7.4.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
        {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.7.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.7.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.7.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.7.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.7.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorboard-plugin-neuronx","version":"2.5.25.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"torch-neuron","version":"1.10.2.2.6.6.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-neuron","version":"1.11.0.2.6.6.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-neuron","version":"1.12.1.2.6.6.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-neuron","version":"1.13.1.2.6.6.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-neuron","version":"1.9.1.2.6.6.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.37.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"mx_neuron","version":"1.8.0.2.2.127.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-neuronx","version":"1.13.0.1.6.1","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"neuronperf","version":"1.7.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"libnrt.so","version":"2.12.16.0","supported_instances":["inf1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.9.0", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.12.27.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.8.4.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop","version":"0.2.3.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.2.1.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.12.16.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.9.5.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"libneuronxla","version":"0.5.173","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"neuronx_hwm","version":"2.5.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch_xla","version":"1.13.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"neuronx-cc","version":"2.5.0.28","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
        {"name":"neuron-cc","version":"1.14.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"aws-neuronx-k8-plugin","version":"2.12.5.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.12.5.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hooks","version":"2.1.97.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"1.15.5.2.7.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.7.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.7.4.2.7.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.8.4.2.7.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.9.3.2.7.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.2.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
        {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.7.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.7.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.7.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.7.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.7.3.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorboard-plugin-neuronx","version":"2.5.25.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"torch-neuron","version":"1.10.2.2.6.5.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-neuron","version":"1.11.0.2.6.5.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-neuron","version":"1.12.1.2.6.5.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-neuron","version":"1.13.1.2.6.5.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-neuron","version":"1.9.1.2.6.5.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.37.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"mx_neuron","version":"1.8.0.2.2.127.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-neuronx","version":"1.13.0.1.6.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"neuronperf","version":"1.7.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"libnrt.so","version":"2.12.16.0","supported_instances":["inf1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.8.0", "packages": [
        {"name":"aws-neuronx-collectives","version":"2.11.47.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.7.33.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-customop","version":"0.1.23.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-gpsimd-tools","version":"0.1.7.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-discovery","version":"2.9","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.11.43.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.8.2.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"libneuronxla","version":"0.5.144","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"neuronx_hwm","version":"2.4.0.1","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch_xla","version":"1.13.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"neuronx-cc","version":"2.4.0.21","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8"]},
        {"name":"neuron-cc","version":"1.13.5.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"aws-neuronx-k8-plugin","version":"2.1.12.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.1.12.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hooks","version":"2.1.81.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"1.15.5.2.6.5.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"tensorflow-neuron","version":"2.7.4.2.6.5.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.8.4.2.6.5.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.9.3.2.6.5.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.10.1.2.6.5.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuronx","version":"2.10.1.1.0.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
        {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.6.5.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.7.4.2.6.5.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.4.2.6.5.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.9.3.2.6.5.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.10.1.2.6.5.0","supported_instances":["inf1","trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorboard-plugin-neuronx","version":"2.5.19.0","supported_instances":["trn1","inf2"],"supported_python_versions":[]},
        {"name":"tensorboard-plugin-neuron","version":"2.4.6.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"torch-neuron","version":"1.11.0.2.5.8.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-neuron","version":"1.12.1.2.5.8.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-neuron","version":"1.10.2.2.5.8.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-neuron","version":"1.9.1.2.5.8.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.11.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"mx_neuron","version":"1.8.0.2.2.43.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-neuronx","version":"1.13.0.1.5.0","supported_instances":["trn1","inf2"],"supported_python_versions":["3.7","3.8","3.9"]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"neuronperf","version":"1.6.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"libnrt.so","version":"2.10.30.0","supported_instances":["inf1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.7.0", "packages": [
        {"name":"neuronx-cc","version":"2.4.0.21","supported_instances":["trn1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"aws-neuronx-k8-plugin","version":"2.1.12.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.1.12.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hooks","version":"2.1.60.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-neuronx","version":"2.8.2.1.2.0","supported_instances":["trn1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.5.6.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.5.4.2.5.6.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.6.3.2.5.6.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.7.0.2.5.6.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.0.2.5.6.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorboard-plugin-neuronx","version":"2.5.15.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"neuronx-gpsimd-customop","version":"0.1.23.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"neuronx-gpsimd-tools","version":"0.1.7.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"torch-neuronx","version":"1.13.0.1.4.0","supported_instances":["trn1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-xla","version":"1.13.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.7.15.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-collectives","version":"2.11.47.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.11.43.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.7.2.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.6.0", "packages": [
        {"name":"neuronx-cc","version":"2.3.0.4","supported_instances":["trn1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"aws-neuronx-k8-plugin","version":"2.1.12.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.1.12.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hooks","version":"2.1.14.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-neuronx","version":"2.8.2.1.2.0","supported_instances":["trn1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.5.6.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.5.4.2.5.6.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.6.3.2.5.6.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.7.0.2.5.6.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.0.2.5.6.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorboard-plugin-neuronx","version":"2.5.3.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"torch-neuronx","version":"1.12.0.1.4.0","supported_instances":["trn1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"torch-xla","version":"1.12.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-dkms","version":"2.6.33.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-collectives","version":"2.10.37.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.10.30.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.6.1.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.5.0", "packages": [
        {"name":"neuron-cc","version":"1.13.5.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"neuronx-cc","version":"2.2.0.73","supported_instances":["trn1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"aws-neuronx-k8-plugin","version":"2.1.12.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.1.12.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hooks","version":"2.1.14.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"1.15.5.2.5.6.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"tensorflow-neuron","version":"2.5.3.2.5.6.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.6.5.2.5.6.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.7.3.2.5.6.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.8.2.2.5.6.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuronx","version":"2.8.2.1.2.0","supported_instances":["trn1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-model-server-neuronx","version":"1.15.0.2.5.6.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.5.4.2.5.6.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.6.3.2.5.6.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.7.0.2.5.6.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuronx","version":"2.8.0.2.5.6.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorboard-plugin-neuron","version":"2.4.6.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"torch-neuron","version":"1.11.0.2.5.8.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuron","version":"1.12.1.2.5.8.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuron","version":"1.10.2.2.5.8.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuron","version":"1.7.1.2.5.8.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuron","version":"1.8.1.2.5.8.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuron","version":"1.9.1.2.5.8.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuronx","version":"1.11.0.1.2.0","supported_instances":["trn1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.11.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"mx_neuron","version":"1.8.0.2.2.43.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"aws-neuronx-dkms","version":"2.6.33.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-collectives","version":"2.10.34.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.10.27.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.5.19.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"neuronperf","version":"1.6.1.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"libnrt.so","version":"2.10.27.0","supported_instances":["inf1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.4.0", "packages": [
        {"name":"neuron-cc","version":"1.11.7.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"neuronx-cc","version":"2.2.0.73","supported_instances":["trn1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"aws-neuronx-k8-plugin","version":"2.1.2.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.1.2.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hooks","version":"2.1.2.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"1.15.5.2.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"tensorflow-neuron","version":"2.5.3.2.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.6.3.2.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.7.1.2.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.8.0.2.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuronx","version":"2.8.2.1.2.0","supported_instances":["trn1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-model-server-neuron","version":"1.15.0.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuron","version":"2.5.4.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuron","version":"2.6.3.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuron","version":"2.7.0.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuron","version":"2.8.0.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"tensorboard-plugin-neuron","version":"2.4.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"torch-neuron","version":"1.7.1.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuron","version":"1.8.1.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuron","version":"1.9.1.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuron","version":"1.10.2.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuron","version":"1.11.0.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuronx","version":"1.11.0.1.2.0","supported_instances":["trn1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"mx_neuron","version":"1.8.0.2.2.2.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"aws-neuronx-dkms","version":"2.6.5.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-collectives","version":"2.10.17.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.10.15.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.5.16.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"neuronperf","version":"1.3.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"libnrt.so","version":"2.2.51.0","supported_instances":["inf1"],"supported_python_versions":[]}
      ]},
      {"neuron_version":"2.3.0", "packages": [
        {"name":"neuron-cc","version":"1.11.7.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"neuronx-cc","version":"2.1.0.76","supported_instances":["trn1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"aws-neuronx-k8-plugin","version":"2.0.1.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-k8-scheduler","version":"2.0.1.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-oci-hooks","version":"2.0.1.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"tensorflow-neuron","version":"1.15.5.2.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"tensorflow-neuron","version":"2.5.3.2.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.6.3.2.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.7.1.2.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuron","version":"2.8.0.2.3.0","supported_instances":["inf1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-neuronx","version":"2.8.2.1.1.0","supported_instances":["trn1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"tensorflow-model-server-neuron","version":"1.15.0.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuron","version":"2.5.4.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuron","version":"2.6.3.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuron","version":"2.7.0.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"tensorflow-model-server-neuron","version":"2.8.0.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"tensorboard-plugin-neuron","version":"2.4.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"torch-neuron","version":"1.7.1.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuron","version":"1.8.1.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuron","version":"1.9.1.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuron","version":"1.10.2.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuron","version":"1.11.0.2.3.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"torch-neuronx","version":"1.11.0.1.1.1","supported_instances":["trn1"],"supported_python_versions":["3.7","3.8"]},
        {"name":"mxnet_neuron","version":"1.5.1.1.10.0.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"mx_neuron","version":"1.8.0.2.2.2.0","supported_instances":["inf1"],"supported_python_versions":["3.7"]},
        {"name":"aws-neuronx-dkms","version":"2.5.41.0","supported_instances":["inf1","trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-collectives","version":"2.9.86.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"efa-installer","version":"na","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-runtime-lib","version":"2.9.64.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"aws-neuron-tools","version":"2.1.4.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"aws-neuronx-tools","version":"2.4.14.0","supported_instances":["trn1"],"supported_python_versions":[]},
        {"name":"neuronperf","version":"1.3.0.0","supported_instances":["inf1"],"supported_python_versions":[]},
        {"name":"libnrt.so","version":"2.2.51.0","supported_instances":["inf1"],"supported_python_versions":[]}
      ]}
    ]
}


================================================
FILE: src/helperscripts/neuron-releases-manifest.json
================================================
{
  "repos": {
    "whl": "https://pip.repos.neuron.amazonaws.com/",
    "rpm": "https://yum.repos.neuron.amazonaws.com/",
    "deb": "https://apt.repos.neuron.amazonaws.com/"
  },
  "manifest_date": "2022-12-12",
  "manifest_version": "1.0.1",
  "dlami_conda_env": {
    "tensorflow": {
      "1.15.5": [
        "aws_neuron_tensorflow_p36",
        "aws_neuron_tensorflow_p36"
      ],
      "2.1.4": [
        "None",
        "None"
      ],
      "2.2.3": [
        "None",
        "None"
      ],
      "2.3.4": [
        "None",
        "None"
      ],
      "2.4.3": [
        "None",
        "None"
      ],
      "2.5.1": [
        "None",
        "None"
      ],
      "2.5.2": [
        "None",
        "None"
      ],
      "2.5.3": [
        "None",
        "None"
      ],
      "2.6.3": [
        "None",
        "None"
      ],
      "2.6.5": [
        "None",
        "None"
      ],
      "2.7.1": [
        "None",
        "None"
      ],
      "2.7.3": [
        "None",
        "None"
      ],
      "2.8.0": [
        "None",
        "None"
      ],
      "2.8.2": [
        "None",
        "None"
      ]
    },
    "pytorch": {
      "1.5.1": [
        "None",
        "aws_neuron_pytorch_p36"
      ],
      "1.6.0": [
        "None",
        "aws_neuron_pytorch_p36"
      ],
      "1.7.1": [
        "None",
        "aws_neuron_pytorch_p36"
      ],
      "1.8.1": [
        "aws_neuron_pytorch_p36",
        "aws_neuron_pytorch_p36"
      ],
      "1.9.1": [
        "None",
        "aws_neuron_pytorch_p36"
      ],
      "1.10.1": [
        "None",
        "aws_neuron_pytorch_p36"
      ],
      "1.10.2": [
        "None",
        "None"
      ],
      "1.11.0": [
        "None",
        "None"
      ]
    },
    "mxnet": {
      "1.5.1": [
        "aws_neuron_mxnet_p36",
        "aws_neuron_mxnet_p36"
      ],
      "1.8.0": [
        "None",
        "aws_neuron_mxnet_p36"
      ]
    }
  },
  "latest_version_of_maintained_packages": {
    "runtime-server": {
      "framework": false,
      "package-name": "aws-neuron-runtime",
      "package-version": "1.6.24.0",
      "neuron-version": "1.15.2"
    },
    "mxnet-1.5.1": {
      "framework": true,
      "package-name": "mxnet_neuron",
      "package-version": "1.5.1.1.6.5.1",
      "neuron-version": "1.16.0"
    }
  },
  "fal_supported_runtime": {
    "tensorflow": {
      "1.15.5": {
        "neuron-rtd": [
          "0.0.0.0",
          "1.15.5.1.6.10.0"
        ],
        "libnrt": [
          "1.15.5.2.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.1.4": {
        "neuron-rtd": [
          "0.0.0.0",
          "2.1.4.1.6.10.0"
        ],
        "libnrt": [
          "2.1.4.2.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.2.3": {
        "neuron-rtd": [
          "0.0.0.0",
          "2.2.3.1.6.10.0"
        ],
        "libnrt": [
          "2.2.3.2.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.3.3": {
        "neuron-rtd": [
          "0.0.0.0",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.3.4": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "2.3.4.2.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.4.2": {
        "neuron-rtd": [
          "0.0.0.0",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.4.3": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "2.4.3.2.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.5.0": {
        "neuron-rtd": [
          "0.0.0.0",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.5.1": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "2.5.1.2.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.5.2": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "2.5.1.2.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.5.3": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "2.5.1.2.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.6.3": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "2.5.1.2.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.6.5": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "2.5.1.2.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.7.1": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "2.5.1.2.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.7.3": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "2.5.1.2.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.8.0": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "2.5.1.2.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "2.8.2": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "2.5.1.2.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      }
    },
    "pytorch": {
      "1.5.1": {
        "neuron-rtd": [
          "0.0.0.0",
          "1.5.1.1.5.21.0"
        ],
        "libnrt": [
          "1.5.1.1.5.21.1",
          "99.99.99.99.99.99.99"
        ]
      },
      "1.7.1": {
        "neuron-rtd": [
          "0.0.0.0",
          "1.7.1.1.5.21.0"
        ],
        "libnrt": [
          "1.7.1.1.5.21.1",
          "99.99.99.99.99.99.99"
        ]
      },
      "1.8.1": {
        "neuron-rtd": [
          "0.0.0.0",
          "1.8.1.1.5.21.0"
        ],
        "libnrt": [
          "1.8.1.1.5.21.1",
          "99.99.99.99.99.99.99"
        ]
      },
      "1.9.1": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "1.9.1.0.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "1.10.1": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "1.9.1.0.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "1.10.2": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "1.9.1.0.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "1.11.0": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "1.9.1.0.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      },
      "1.12.1": {
        "neuron-rtd": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "1.9.1.0.0.0.0",
          "99.99.99.99.99.99.99"
        ]
      }
    },
    "mxnet": {
      "1.5.1": {
        "neuron-rtd": [
          "0.0.0.0",
          "99.99.99.99.99.99.99"
        ],
        "libnrt": [
          "99.99.99.99.99.99.99",
          "99.99.99.99.99.99.99"
        ]
      },
      "1.8.0": {
        "neuron-rtd": [
          "0.0.0.0",
          "1.8.0.1.3.4.0"
        ],
        "libnrt": [
          "1.8.0.1.3.4.1",
          "99.99.99.99.99.99.99"
        ]
      }
    }
  },
  "latest_release": {
    "inf1": {
      "version": "2.8.0"
    }
  },
  "neuron_versions": {
    "2.6.0": {
      "python_ver": [
        "3.7"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuronx-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.6.33.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.10.27.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuronx-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.12.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuronx-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.12.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuronx-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.6.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.13.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.7.1.2.5.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.5.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.10.2.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.11.0.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.12.1.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.3.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.6.5.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.7.3.2.5.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.8.2.2.5.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuronx": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.4.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.6.3.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.7.0.2.5.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.8.0.2.5.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.4.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.10.11.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.2.43.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3.us-west-2.amazonaws.com/1.8.0/aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "2.7.0": {
      "python_ver": [
        "3.7"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuronx-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.7.15.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.10.27.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuronx-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.12.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuronx-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.12.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuronx-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.7.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.13.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.7.1.2.5.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.5.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.10.2.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.11.0.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.12.1.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.3.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.6.5.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.7.3.2.5.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.8.2.2.5.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuronx": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.4.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.6.3.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.7.0.2.5.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.8.0.2.5.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.4.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.10.11.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.2.43.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3.us-west-2.amazonaws.com/1.8.0/aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "2.8.0": {
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuronx-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.7.33.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.10.30.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuronx-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.12.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuronx-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.12.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuronx-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.8.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.13.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.7.1.2.5.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.5.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.10.2.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.11.0.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.12.1.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.6.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.3.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.6.5.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.7.4.2.6.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.8.4.2.6.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.9.3.2.6.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.10.1.2.6.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuronx": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.6.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.4.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.6.3.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.7.4.2.6.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.8.4.2.6.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.9.3.2.6.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.10.1.2.6.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.4.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.10.11.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.2.43.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3.us-west-2.amazonaws.com/1.8.0/aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      },
      "instance_support": [
        "inf1"
      ],
      "python_ver": [
        "3.7"
      ]
    },
    "2.5.0": {
      "python_ver": [
        "3.7"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuronx-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.6.33.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.10.27.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuronx-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.12.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuronx-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.12.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuronx-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.5.19.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.13.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.7.1.2.5.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.5.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.10.2.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.11.0.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.12.1.2.5.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.3.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.6.5.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.7.3.2.5.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.8.2.2.5.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuronx": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.4.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.6.3.2.5.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.7.0.2.5.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.8.0.2.5.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.4.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.10.11.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.2.43.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3.us-west-2.amazonaws.com/1.8.0/aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "2.4.0": {
      "python_ver": [
        "3.7"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuronx-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.6.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.51.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuronx-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuronx-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.5.16.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.11.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.7.1.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.10.2.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.11.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.3.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.6.3.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.7.1.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.8.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.4.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.6.3.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.7.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.8.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.4.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.10.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.2.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3.us-west-2.amazonaws.com/1.8.0/aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "2.3.0": {
      "python_ver": [
        "3.7"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuronx-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.5.41.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.51.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuronx-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "2.0.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuronx-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "2.0.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.11.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.7.1.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.10.2.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.11.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.3.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.6.3.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.7.1.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.8.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.4.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.6.3.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.7.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.8.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.4.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.10.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.2.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3.us-west-2.amazonaws.com/1.8.0/aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.19.2": {
      "python_ver": [
        "3.7"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.3.26.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.51.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.9.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.9.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.11.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.7.1.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.10.2.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.11.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.3.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.6.3.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.7.1.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.8.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.4.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.6.3.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.7.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.8.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.4.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.10.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.2.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3.us-west-2.amazonaws.com/1.8.0/aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.19.1": {
      "python_ver": [
        "3.7"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.3.11.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.51.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.9.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.9.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.11.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.7.1.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.10.2.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.11.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.3.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.6.3.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.7.1.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.8.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.4.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.6.3.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.7.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.8.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.4.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.10.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.2.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3.us-west-2.amazonaws.com/1.8.0/aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.19.0": {
      "python_ver": [
        "3.7"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.3.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.51.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.9.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.9.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.11.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.7.1.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.10.2.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.11.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.3.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.6.3.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.7.1.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.8.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.4.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.6.3.2.3.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.7.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.8.0.2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.4.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.10.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.2.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3.us-west-2.amazonaws.com/1.8.0/aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.18.0": {
      "python_ver": [
        "3.7"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.14.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.51.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.8.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.8.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.0.790.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.10.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.2.2.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.2.2.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.2.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.10.1.2.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.2.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.3.2.2.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.6.3.2.2.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.7.1.2.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.2.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.4.2.2.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.6.3.2.2.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.7.0.2.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.9.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.2.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3.us-west-2.amazonaws.com/1.8.0/aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx-1.8.0.2-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.17.2": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.13.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.31.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.0.623.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.9.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.1.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.2.1.7.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.2.1.7.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.1.7.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.1.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.10.1.2.1.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.1.14.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.1.4.2.1.14.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.2.3.2.1.14.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.3.4.2.1.14.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.4.3.2.1.14.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.2.2.1.14.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.1.14.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.1.4.2.1.14.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.2.3.2.1.14.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.3.4.2.1.14.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.4.3.2.1.14.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.3.2.1.14.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.8.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.1.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3-us-west-2.amazonaws.com/1.8.0/aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.17.1": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.13.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.31.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.0.623.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.9.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.1.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.2.1.7.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.2.1.7.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.1.7.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.1.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.10.1.2.1.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.1.13.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.1.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.2.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.3.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.4.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.2.2.1.13.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.1.13.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.1.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.2.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.3.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.4.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.3.2.1.13.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.8.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.1.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3-us-west-2.amazonaws.com/1.8.0/aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.17.0": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.13.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.31.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.0.623.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.9.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.1.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.2.1.7.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.2.1.7.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.1.7.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.1.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.10.1.2.1.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.1.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.1.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.2.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.3.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.4.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.2.2.1.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.1.6.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.1.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.2.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.3.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.4.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.3.2.1.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.8.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.1.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3-us-west-2.amazonaws.com/1.8.0/aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.16.3": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.18.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.0.494.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.0.85.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.2.0.536.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.2.0.536.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.0.536.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.0.536.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.0.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.1.4.2.0.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.2.3.2.0.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.3.4.2.0.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.4.3.2.0.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.1.2.0.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.0.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.1.4.2.0.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.2.3.2.0.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.3.4.2.0.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.4.3.2.0.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.2.2.0.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.7.3.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.0.290.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3-us-west-2.amazonaws.com/1.8.0/aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.16.2": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.18.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.0.327.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.0.85.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.2.0.468.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.2.0.468.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.0.468.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.0.468.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.1.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.2.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.3.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.4.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.1.2.0.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.1.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.2.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.3.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.4.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.2.2.0.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.7.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.0.276.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3-us-west-2.amazonaws.com/1.8.0/aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.16.1": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.18.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.0.327.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.7.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.0.85.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.2.0.392.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.2.0.392.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.0.392.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.0.392.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.1.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.2.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.3.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.4.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.1.2.0.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.1.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.2.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.3.4.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.4.3.2.0.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.2.2.0.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.7.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.0.276.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3-us-west-2.amazonaws.com/1.8.0/aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.16.0": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "libnrt": {
          "framework": false,
          "packages": {
            "libnrt": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.15.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "lib"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "2.0.277.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.7.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "neuronperf": {
          "framework": false,
          "packages": {
            "neuronperf": {
              "install_on_compute_instance": false,
              "versions": {
                "1.0.85.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.2.0.318.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.2.0.318.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.2.0.318.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.9.1.2.0.318.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.2.0.3.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.1.4.2.0.3.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.2.3.2.0.3.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.3.4.2.0.3.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.4.3.2.0.3.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.1.2.0.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.2.0.3.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.1.4.2.0.3.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.2.3.2.0.3.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.3.4.2.0.3.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.4.3.2.0.3.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.2.2.0.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.7.0.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.2.0.271.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3-us-west-2.amazonaws.com/1.8.0/aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.15.2": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-server": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.24.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.22.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.22.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-base": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime-base": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.21.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.25.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.6.13.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.5.21.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.1.5.21.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.1.5.21.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.1.6.10.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.1.4.1.6.10.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.2.3.1.6.10.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.3.3.1.6.10.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.4.2.1.6.10.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.0.1.6.10.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.6.10.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.1.4.1.6.10.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.2.2.1.6.10.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.3.0.1.6.10.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.4.1.1.6.10.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.1.1.6.10.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.6.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.1.3.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3-us-west-2.amazonaws.com/1.8.0/aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.15.1": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-server": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.24.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.22.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.22.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-base": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime-base": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.21.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.25.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.6.13.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.5.21.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.1.5.21.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.1.5.21.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.1.4.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.2.3.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.3.3.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.4.2.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.0.1.6.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.1.4.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.2.2.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.3.0.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.4.1.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.1.1.6.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.6.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.1.3.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3-us-west-2.amazonaws.com/1.8.0/aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.15.0": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.0.450.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-server": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.19.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.17.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.17.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-base": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime-base": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.16.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.20.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.6.13.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.5.21.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.1.5.21.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.1.5.21.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.1.4.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.2.3.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.3.3.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.4.2.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "2.5.0.1.6.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.1.4.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.2.2.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.3.0.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.4.1.1.6.8.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                },
                "2.5.1.1.6.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.6.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.1.3.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3-us-west-2.amazonaws.com/1.8.0/aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.14.2": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "2.0.386.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-server": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.9.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-base": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime-base": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.10.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.5.12.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.1.5.12.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.1.5.12.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.1.5.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.5.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.6.1.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.1.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3-us-west-2.amazonaws.com/1.8.0/aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.14.1": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "1.5.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-server": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-base": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime-base": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "1.7.4.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.5.12.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.1.5.12.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.1.5.12.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.1.5.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.5.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.6.1.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.1.3.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3-us-west-2.amazonaws.com/1.8.0/aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.14.0": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "1.5.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-server": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime": {
              "install_on_compute_instance": false,
              "versions": {
                "1.5.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-base": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime-base": {
              "install_on_compute_instance": false,
              "versions": {
                "1.5.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "1.6.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.4.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.4.1.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.1.4.1.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.8.1.1.4.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.1.4.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.1.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.4.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.5.1.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.1.2.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3-us-west-2.amazonaws.com/1.8.0/aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.13.0": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.9.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-server": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.17.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.5.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.5.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-base": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime-base": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.12.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "1.5.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.3.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.3.5.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.1.3.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.1.3.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-plugin-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "2.0.29.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.3.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.4.4.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            },
            "mx_neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.8.0.1.1.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [
                    "wget https://aws-mx-pypi.s3-us-west-2.amazonaws.com/1.8.0/aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl",
                    "pip install aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl"
                  ],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.12.3": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-server": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.12.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-base": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime-base": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.12.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.2.11.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.2.24.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.1.2.24.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.1.2.9.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.2.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.2.9.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.3.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.12.2": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-server": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.12.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-base": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime-base": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.12.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.2.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.2.16.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.1.2.16.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.1.2.9.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.2.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.2.9.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.3.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.12.1": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-server": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.9.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-base": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime-base": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.2.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.2.15.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.1.2.15.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.1.2.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.2.6.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.2.8.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.3.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.12.0": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-server": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-base": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime-base": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "1.4.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.2.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.2.3.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.1.2.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.5.1.2.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.2.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.3.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.11.0": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "1.3.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-server": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime": {
              "install_on_compute_instance": false,
              "versions": {
                "1.3.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.3.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.3.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-base": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime-base": {
              "install_on_compute_instance": false,
              "versions": {
                "1.3.2.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "1.3.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.1.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.1.7.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                },
                "1.7.1.1.1.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.4.1.1.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.1.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.1.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.2.1.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    },
    "1.10.0": {
      "python_ver": [
        "3.6"
      ],
      "instance_support": [
        "inf1"
      ],
      "arch": [
        "x86_64"
      ],
      "components": {
        "driver": {
          "framework": false,
          "packages": {
            "aws-neuron-dkms": {
              "install_on_compute_instance": false,
              "versions": {
                "1.2.3.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin",
                    "src"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-server": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime": {
              "install_on_compute_instance": false,
              "versions": {
                "1.2.5.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-plugin": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-plugin": {
              "install_on_compute_instance": false,
              "versions": {
                "1.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "k8-scheduler": {
          "framework": false,
          "packages": {
            "aws-neuron-k8-scheduler": {
              "install_on_compute_instance": false,
              "versions": {
                "1.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "runtime-base": {
          "framework": false,
          "packages": {
            "aws-neuron-runtime-base": {
              "install_on_compute_instance": false,
              "versions": {
                "1.2.0.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-rtd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "tools": {
          "framework": false,
          "packages": {
            "aws-neuron-tools": {
              "install_on_compute_instance": false,
              "versions": {
                "1.2.7.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "neuron-monitor",
                    "neuron-cli",
                    "neuron-top",
                    "neuron-htop"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "compiler": {
          "framework": false,
          "packages": {
            "neuron-cc": {
              "install_on_compute_instance": true,
              "versions": {
                "1.0.24045.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": "whl"
                }
              }
            }
          }
        },
        "pytorch": {
          "framework": true,
          "packages": {
            "torch-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.0.1978.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "torch-neuron"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow": {
          "framework": true,
          "packages": {
            "tensorflow-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.15.4.1.0.2168.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorboard": {
          "framework": false,
          "packages": {
            "tensorboard-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.0.615.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        },
        "tensorflow-model-server": {
          "framework": false,
          "packages": {
            "tensorflow-model-server-neuron": {
              "install_on_compute_instance": false,
              "versions": {
                "1.15.0.1.0.2168.0": {
                  "main_version": true,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "deb",
                    "rpm"
                  ]
                }
              }
            }
          }
        },
        "mxnet": {
          "framework": true,
          "packages": {
            "mxnet-neuron": {
              "install_on_compute_instance": true,
              "versions": {
                "1.5.1.1.1.88.0": {
                  "main_version": false,
                  "pre_install_cmds": [],
                  "post_install_cmds": [],
                  "format": [
                    "bin"
                  ],
                  "content": [
                    "tbd"
                  ],
                  "package_type": [
                    "whl"
                  ]
                }
              }
            }
          }
        }
      }
    }
  }
}

================================================
FILE: src/helperscripts/neuron-setup-example.py
================================================
from neuronsetuphelper import neuron_setup_helper


nr_setup=neuron_setup_helper(manifest_file='default',neuron_version='latest')

setup_cmd = nr_setup.instructions(framework='tensorflow',action='Install',os='ubuntu',ami='non-dlami',mode='develop',framework_version='latest')
print (setup_cmd)


================================================
FILE: src/helperscripts/neuronsetuphelper.py
================================================
import json
import argparse
from packaging.version import Version, parse


########################################
# neuron_setup_helper
########################################

class neuron_release_info:
    def __init__(self):

        self.release_frameworks_all = {}
        self.release_frameworks_main = {}
        self.release_packages_all ={}
        self.release_package_main={}
        self.release_frameworks_list=[]
        self.release_components_list = []
        self.release_tf_package_to_model_server_package={}
        self.release_os_install_list =[]
        self.python_ver=""


# release_frameworks_all
# Desc: Dictionary - all framewors included in the release
#   example: 'pytorch-1.5.1': {'framework': 'pytorch', 'package': 'torch-neuron', 'version': '1.5.1.1.5.3.0', 'main': False, 'framework_version': '1.5.1', 'package_name': 'torch-neuron-1.5.1.1.5.3.0', 'pre_install_cmds': [], 'post_install_cmds': []}
# release_frameworks_all = {}

# release_frameworks_main
# Desc: Dictionary - the main frameworks in each rlease (single  version of the same framework)
#   example: 'mxnet': {'framework': 'mxnet-1.8.0', 'package': 'mx_neuron', 'version': '1.8.0.1.3.0.0', 'framework_version': '1.5.1', 'full_package_name': 'mx_neuron-1.8.0.1.3.0.0', 'pre_install_cmds': ['wget https://aws-mx-pypi.s3-us-west-2.amazonaws.com/1.8.0/aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl', 'pip install aws_mx_cu110-1.8.0-py2.py3-none-manylinux2014_x86_64.whl'], 'post_install_cmds': []}
# release_frameworks_main = {}

# release_packages_all
# Desc: Dictionary -  all packages included in the release
#   example: 'aws-neuron-dkms-1.5.0.0': {'component': 'driver', 'package': 'aws-neuron-dkms', 'version': '1.5.0.0', 'main': True, 'pre_install_cmds': [], 'post_install_cmds': []}
# release_packages_all ={}

# release_package_main
# Desc: Dictionary - only single package from each component
#   example: 'driver': {'package': 'aws-neuron-dkms', 'version': '1.5.0.0', 'full_package_name': 'aws-neuron-dkms-1.5.0.0', 'pre_install_cmds': [], 'post_install_cmds': []}
# release_package_main={}


# list of all framewoks included in the specific neuron release
# release_frameworks_list=[]

# list of all neuron components included in the specific neuron release
# release_components_list = []

# dictionary to correlate tf version with model server version
# release_tf_package_to_model_server_package = {}


# list of all Neuron versions included in the manifest
neuron_ver_list = []


# release_os_install_list =[]

dlami_conda_env= {}


package_formal_name= {
    "compiler":"Neuron Compiler",
    "tensorflow":"Neuron TensorFlow",
    "pytorch":"Neuron PyTorch",
    "mxnet":"Neuron MXNet",
    "runtime-server":"Neuron Runtime server",
    "libnrt":"Neuron Runtime library",
    "runtime-base":"Neuron Runtime base",
    "driver":"Neuron Driver",
    "tools":"Neuron Tools",
    "tensorboard":"Neuron TensorBoard",
    "tensorflow-model-server":"Neuron TensorFlow model server"
    }


########################################
# parse_arguments
########################################

def cli_parse_arguments():
    __name__='neuron-install-helper.py'
    parser = argparse.ArgumentParser(prog=__name__
    ,usage='\npython3 %(prog)s --list {neuron_versions,packages,components,frameworks} [--neuron-version=X.Y.Z]  [--file FILE] \n'
    +'python3 %(prog)s --install {pytorch,tensorflow,mxnet} [--neuron-version=X.Y.Z] [--framework-version=FRAMEWORK-X.Y.Z] [options]\n'
    +'python3 %(prog)s --install {driver,runtime,tools} [--neuron-version=X.Y.Z] [options]\n'
    +'python3 %(prog)s --update {pytorch,tensorflow,mxnet} [--framework-version=framework-X.Y.Z]  [options]\n'
    +'python3 %(prog)s --update {driver,runtime,tools} [options]\n'
    +'options= [--file FILE] [--ami {dlami,non-dlami}] [--os {ubuntu,amazonlinux}]\n'
    ,description='Installer helper for Neuron SDK')

    group = parser.add_mutually_exclusive_group(required=True)
    parser.add_argument("--neuron-version",metavar='X.Y.Z')
    group.add_argument("--list",choices=['neuron_versions','packages','components','frameworks'])
    group.add_argument("--install",choices=['pytorch','tensorflow','mxnet'])
    group.add_argument("--update",choices=['pytorch','tensorflow','mxnet'])
    parser.add_argument("--mode",choices=['develop','compile','deploy'],default='develop')
    parser.add_argument("--framework-version",metavar='framework-X.Y.Z')
    parser.add_argument("--os",choices=['ubuntu','amazonlinux'],default='ubuntu',help='default=ubuntu')
    parser.add_argument("--ami",choices=['dlami','non-dlami'],default='non-dlami',help='default=non-dlami')
    parser.add_argument("--file",default='neuron-releases-manifest.json',help='default=neuron-releases-manifest.json')

    return parser.parse_args()


def enumerate_release_manifest(nr_setup, in_neuron_version):

    ########################################
    # Enumerate the Json file
    ########################################

    if nr_setup.file==None:
        nr_setup.file='neuron-releases-manifest.json'

    try:
        read_file = open(nr_setup.file, "r")
    except:
        print(__name__,": error:","Can't open " + nr_setup.file + " ")
        exit(-1)

    neuron_releases = json.load (read_file)

    latest_neuron_version = neuron_releases["latest_release"]["inf1"]["version"]

    nr_setup.dlami_conda_env = neuron_releases["dlami_conda_env"]

    nr_setup.fal_supported_runtime = neuron_releases["fal_supported_runtime"]

    if (in_neuron_version == None) | (in_neuron_version == 'latest'):
        neuron_version=latest_neuron_version
    else:
        neuron_version = in_neuron_version


    for n_ver in neuron_releases["neuron_versions"]:
        neuron_ver_list.append(n_ver)


    for neuron_release_ver in neuron_releases["neuron_versions"]:
        m_release=neuron_releases["neuron_versions"][neuron_release_ver]["components"]
        n_info=neuron_release_info()
        n_info.python_ver=  neuron_releases["neuron_versions"][neuron_release_ver]["python_ver"][0]

        for component_name in m_release:
            if m_release[component_name]["framework"]==False:
                n_info.release_components_list.append(component_name)
            m_packages=m_release[component_name]["packages"]
            for package_name in m_packages:
                for package_ver in m_packages[package_name]["versions"]:
                    m_package_ver=m_packages[package_name]["versions"][package_ver]

                    full_package_name=package_name+'-'+package_ver

                    n_info.release_packages_all[full_package_name]= {"component":component_name,"package":package_name,"version":package_ver,"main":m_package_ver["main_version"],"pre_install_cmds":m_package_ver["pre_install_cmds"],"post_install_cmds":m_package_ver["post_install_cmds"],"package_type":m_package_ver["package_type"]}

                    if m_package_ver["main_version"]:
                        n_info.release_package_main[component_name]={"package":package_name,"version":package_ver,"full_package_name":full_package_name,"pre_install_cmds":m_package_ver["pre_install_cmds"],"post_install_cmds":m_package_ver["post_install_cmds"],"package_type":m_package_ver["package_type"]}

                    if m_release[component_name]["framework"]:
                        ver_digits = package_ver.rsplit('.')
                        fw_ver=ver_digits[0]+'.'+ver_digits[1]+'.'+ver_digits[2]
                        fw_name_ver=component_name+'-'+fw_ver

                        if m_release[component_name]["framework"]:
                            n_info.release_components_list.append(fw_name_ver)
                            n_info.release_frameworks_list.append(fw_name_ver)

                        if m_package_ver["main_version"]:
                            n_info.release_frameworks_main[component_name]={"framework":fw_name_ver,"package":package_name,"version":package_ver,"framework_version":fw_ver,"package_name":full_package_name,"full_package_name":full_package_name,"pre_install_cmds":m_package_ver["pre_install_cmds"],"post_install_cmds":m_package_ver["post_install_cmds"],"package_type":m_package_ver["package_type"]}


                        n_info.release_frameworks_all[fw_name_ver]={"framework":component_name,"package":package_name,"version":package_ver,"main":m_package_ver["main_version"],"framework_version":fw_ver,"package_name":full_package_name,"pre_install_cmds":m_package_ver["pre_install_cmds"],"post_install_cmds":m_package_ver["post_install_cmds"],"package_type":m_package_ver["package_type"]}

        if 'driver' in n_info.release_components_list:
            n_info.release_os_install_list.append('driver')
        if 'runtime-server' in n_info.release_components_list:
            n_info.release_os_install_list.append('runtime-server')
        if 'tools' in n_info.release_components_list:
            n_info.release_os_install_list.append('tools')
        if 'tensorflow-model-server' in n_info.release_components_list:
            n_info.release_os_install_list.append('tensorflow-model-server')

        # correlate TF and TF model server versions
        for pkg in n_info.release_packages_all.keys():
            if n_info.release_packages_all[pkg]['component'] == 'tensorflow':
                package_ver=n_info.release_packages_all[pkg]['version']
                ver_digits = package_ver.rsplit('.')
                tf_small_ver=ver_digits[0]+'.'+ver_digits[1]
                for pkg2 in n_info.release_packages_all.keys():
                    if n_info.release_packages_all[pkg2]['component'] == 'tensorflow-model-server':
                        package_ver=n_info.release_packages_all[pkg2]['version']
                        ver_digits = package_ver.rsplit('.')
                        tf_model_server_small_ver=ver_digits[0]+'.'+ver_digits[1]
                        if tf_model_server_small_ver==tf_small_ver:
                            n_info.release_tf_package_to_model_server_package[pkg]=pkg2
                            break

        nr_setup.releases_info[neuron_release_ver]=n_info


    try:
        m_release=neuron_releases["neuron_versions"][neuron_version]["components"]
    except:
        print(__name__,": error: ","Version " + neuron_version + " is not a Neuron version or it is not supported")
        exit(-1)


    return (neuron_version,latest_neuron_version)


################
# Sanity Checks
################
def cli_validate(update,neuron_version,framework_version,is_latest_neuron,ami):
    # --update_cmd Sanity check
    # When choosing update, it always updating to latest , should not provide neuron_version
    if (update!=None) & (is_latest_neuron == False):
        print (__name__,": error: ","--update always update to latest Neuron versions, can't specify Neuron version")
        exit(-1)

    #if neuron_version != None:
    #    if ami == 'dlami':
    #        print (__name__,": error: ","--neuron_version should not be specified together with --ami=dlami")
    #        exit(-1)

    if (framework_version != None):
        if (framework_version not in  nr_setup.releases_info[neuron_version].release_frameworks_list):
            print (__name__,": error: "," " + framework_version + " is not a supported framework")
            exit(-1)

########################################
# version to tuple
########################################

def versiontuple(v):
   filled = []
   for point in v.split("."):
      filled.append(point.zfill(8))
   return tuple(filled)


########################################
# --list command
########################################
def cli_list_cmd(nr_setup, neuron_version, list):


    str =''

    if (list == 'neuron_versions'):
        str += '\nList of Neuron release versions supported by this helper:\n' + '\n'
        for ver in neuron_ver_list:
            str += 'neuron-'+ver + '\n'

    #TODO: add "[main]" label to main packages
    if (list == 'packages'):
        str += '\nList of Neuron packages included in Neuron release version ' + neuron_version + ':\n' + '\n'
        for package in nr_setup.releases_info[neuron_version].release_packages_all:
            if len( nr_setup.releases_info[neuron_version].release_packages_all[package]['package_type']):
                #FIXME Runtime library hardcode print
                if (nr_setup.releases_info[neuron_version].release_packages_all[package]["component"] == 'libnrt'):
                    str += nr_setup.releases_info[neuron_version].release_packages_all[package]["component"] +' : \t' +     \
                        "libnrt.so (version "+  \
                        nr_setup.releases_info[neuron_version].release_packages_all[package]["version"] +  ")"  + '\n'
                else:
                    str += nr_setup.releases_info[neuron_version].release_packages_all[package]["component"] +' : \t' + package + '\n'

    if (list == 'components'):
        str += '\nList of Neuron components included in Neuron release version ' + neuron_version + ':\n' + '\n'
        for comp in nr_setup.releases_info[neuron_version].release_components_list:
            str += comp + '\n'

    #TODO: add "[main]" label to main frameworks
    if (list == 'frameworks'):
        str += '\nList of frameworks included in Neuron release version ' + neuron_version + ':\n' + '\n'
        for fw in nr_setup.releases_info[neuron_version].release_frameworks_all:
            str += nr_setup.releases_info[neuron_version].release_frameworks_all[fw]["framework"] +' : \t' + fw + '\n'

    return str


########################################
# Print configuration
########################################

def hlpr_print_config(nr_setup, neuron_version):
    str = ''
    str += '\n'
    str += '###########################################################################' + '\n'
    str += '# ' + nr_setup.action + ' ' + nr_setup.framework + ' '
    if (nr_setup.framework_version != 'latest') & (nr_setup.framework_version != None):
        str += '(' + nr_setup.framework_version + ')' + ' '
    if nr_setup.action == 'Update':
        str += 'from latest Neuron version ' + neuron_version
    else:
        str += 'from Neuron version ' + neuron_version

    str += '\n# '

    str += 'On '
    if (nr_setup.os == 'ubuntu'):
        str += 'Ubuntu '
    elif (nr_setup.os == 'amazonlinux'):
        str += 'Amazon Linux '

    if (nr_setup.ami == 'dlami'):
       str += 'DLAMI'
    else:
        str += 'AMI'

    str += ' for '
    if (nr_setup.mode == 'compile'):
       str += 'compilation on compute instance'
    elif (nr_setup.mode == 'develop'):
       str += 'development on inf1 instance'
    elif (nr_setup.mode == 'deploy'):
       str += 'deployment on inf1 instance'
    str += '\n'
    str += '###########################################################################' + '\n'
    str += '\n'

    return str

###################################
# Build Pip command
###################################
def hlpr_build_pip_command(nr_setup, neuron_version, component,include_compiler,optional):


    package_dict= nr_setup.releases_info[neuron_version].release_package_main

    if (nr_setup.framework_version==None):
        fw_package_dict= nr_setup.releases_info[neuron_version].release_frameworks_main
        fw_comp=component
    else:
        fw_package_dict= nr_setup.releases_info[neuron_version].release_frameworks_all
        fw_comp=nr_setup.framework_version

    pip_cmd_prefix=''
    pip_cmd =''


    if nr_setup.action=='Install':
        pip_cmd_prefix = 'pip install '
    else:
        pip_cmd_prefix = 'pip install --upgrade '

    cmd=pip_cmd_prefix

    if (component == 'mxnet') | (component == 'pytorch') | (component == 'tensorflow'):

        # Framework installation
        if (component == 'mxnet') | (component == 'pytorch'):
            pip_cmd += cmd + fw_package_dict[fw_comp]['package']
            if (nr_setup.is_latest_neuron==False) | (nr_setup.force_versions == True):
                pip_cmd += '=='+fw_package_dict[fw_comp]['version']
            elif (nr_setup.is_latest_neuron==True)&(nr_setup.framework_version!=None):
                pip_cmd += '=='+fw_package_dict[fw_comp]['framework_version']+'.*'

        elif (component == 'tensorflow'):
            if ((parse(neuron_version)<parse('1.15.0')) | (parse(fw_package_dict[fw_comp]['framework_version'])<parse('2.0.0'))):
                pip_cmd += cmd + fw_package_dict[fw_comp]['package']
            else:
                pip_cmd = cmd + fw_package_dict[fw_comp]['package']
                if (include_compiler == True):
                    pip_cmd +=  '[cc]'

            if (nr_setup.is_latest_neuron==False) | (nr_setup.force_versions == True):
                pip_cmd += '=='+fw_package_dict[fw_comp]['version']
            elif (nr_setup.is_latest_neuron==True)&(nr_setup.framework_version!=None):
                pip_cmd += '=='+fw_package_dict[fw_comp]['framework_version']+'.*'

        # Compiler installation
        if (include_compiler == True):
            if (component == 'tensorflow'):
                if ((parse(neuron_version)<parse('1.15.0')) | (parse(fw_package_dict[fw_comp]['framework_version'])<parse('2.0.0'))):
                    pip_cmd += ' ' + package_dict['compiler']['package']
                    if (nr_setup.is_latest_neuron==False) | (nr_setup.force_versions == True):
                        pip_cmd += '=='+package_dict['compiler']['version']
            if (component == 'mxnet'):
                pip_cmd += ' ' + package_dict['compiler']['package']
                if (nr_setup.is_latest_neuron==False) | (nr_setup.force_versions == True):
                    pip_cmd += '=='+package_dict['compiler']['version']

            if (component == 'pytorch'):
                pip_cmd += ' ' + package_dict['compiler']['package']
                pip_cmd += '[tensorflow] "protobuf==3.20.1"'
                if (nr_setup.is_latest_neuron==False) | (nr_setup.force_versions == True):
                    pip_cmd += '=='+package_dict['compiler']['version']

        # Additional packages installation
        if (component == 'pytorch'):
                pip_cmd += ' torchvision'

        if component == 'tensorflow':
            pip_cmd += ' "protobuf"'

    else:
        pip_cmd += '\n'
        if optional==False:
            pip_cmd += '# ' + nr_setup.action  + ' ' + package_formal_name[component] + '\n'
        else:
            pip_cmd += '# Optional: ' + nr_setup.action  + ' ' + package_formal_name[component] + '\n'
        pip_cmd += cmd + package_dict[component]['package']
        if (nr_setup.is_latest_neuron==False) | (nr_setup.force_versions == True):
            pip_cmd += '=='+package_dict[component]['version']


    pip_cmd += '\n'
    return pip_cmd


#################################################
##  pip_setup_repos
#################################################
def hlpr_pip_repos_setup():
    str = '\n'
    str += '# Set Pip repository  to point to the Neuron repository' + '\n'
    str += 'pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com'+ '\n'
    return str

#################################################
##  hlpr_pip_install_create_python_venv
#################################################

def hlpr_pip_install_create_python_venv(nr_setup, neuron_version):

    py_ver=nr_setup.releases_info[neuron_version].python_ver
    str = ''
    str += '\n'

    if nr_setup.os == 'ubuntu':
        str += '######################################################' + '\n'
        str += '#   Only for Ubuntu 20 - Install Python' + py_ver + '\n'
        str += '#' + '\n'
        str += '# sudo add-apt-repository ppa:deadsnakes/ppa' + '\n'
        str += '# sudo apt-get install python' + py_ver + '\n'
        str += '#' + '\n'
        str += '######################################################' + '\n'

    str += '# Install Python venv and activate Python virtual environment to install    ' + '\n'
    str += '# Neuron pip packages.' + '\n'

    if nr_setup.os == 'ubuntu':
        str += 'sudo apt-get install -y python'+ py_ver + '-venv g++' + '\n'
    elif nr_setup.os == 'amazonlinux':
        str += 'sudo dnf install -y python'+ py_ver + '-venv gcc-c++' + '\n'
    str += 'python'+ py_ver + ' -m venv ' + nr_setup.framework +'_venv' + '\n'
    str += 'source '+ nr_setup.framework  + '_venv/bin/activate' + '\n'
    str += 'pip install -U pip' + '\n'
    str += '\n'


    if (nr_setup.mode == 'develop') & (nr_setup.action =='Install'):
        if ((nr_setup.ami=='dlami') & (nr_setup.conda_env == 'None')) | \
            (nr_setup.ami !='dlami'):

            str += '\n'
            str += '# Instal Jupyter notebook kernel '+ '\n'
            str += 'pip install ipykernel ' + '\n'
            str += 'python'+ py_ver + ' -m ipykernel install --user --name '
            str += nr_setup.framework  + '_venv '
            str += '--display-name "Python (' + package_formal_name[nr_setup.framework] + ')"' + '\n'
            str += 'pip install jupyter notebook' + '\n'
            str += 'pip install environment_kernels' + '\n'
            str += '\n'

    return str

#################################################
##  hlpr_pip_activate_python_venv
#################################################

def hlpr_pip_activate_python_venv(nr_setup, neuron_version):

    py_ver=nr_setup.releases_info[neuron_version].python_ver

    str = ''
    str += '\n'
    str += '# Activate a Python ' + py_ver + ' virtual environment where Neuron pip packages were installed ' + '\n'
    str += 'source '+ nr_setup.framework  + '_venv/bin/activate' + '\n'
    str += '\n'

    return str

######################################################################
##  Framework/Compiler installation / Update  instructions (non-DLAMI)
#######################################################################

def hlpr_framework_compiler_setup(nr_setup, neuron_version, include_compiler):

    cmd_inst = ''
    cmd_inst += '\n'
    cmd_inst += '#' + nr_setup.action  + ' ' + package_formal_name[nr_setup.framework] + '\n'

    if (nr_setup.action=='Install'):
        if len(nr_setup.fw_package_dict[nr_setup.fw_comp]['pre_install_cmds']):
            for cmd_pre in nr_setup.releases_info[neuron_version].release_package_main[nr_setup.framework]['pre_install_cmds']:
                cmd_inst += cmd_pre  + '\n'


    cmd_inst += hlpr_build_pip_command(nr_setup=nr_setup,neuron_version=neuron_version, component=nr_setup.framework,include_compiler=include_compiler,optional=False)

    return cmd_inst


######################################################################
##  hlpr_framework_dlami_activate
#######################################################################

def hlpr_framework_dlami_activate(nr_setup):

    str = ''

    str += '\n'
    if (nr_setup.framework == 'pytorch'):
            str += '# Activate PyTorch' + '\n'
    elif (nr_setup.framework == 'tensorflow'):
        str += '# Activate TensorFlow' + '\n'

    elif (nr_setup.framework == 'mxnet'):
        str += '# Activate MXNet' + '\n'

    str += 'source activate '
    str +=  nr_setup.generic_conda_env + '\n'

    return str


#################################################
##  hlpr_os_packages_update
#################################################

def hlpr_os_packages_update(nr_setup):

    str = ''
    str += '\n'
    str += '# Update OS packages' + '\n'
    if nr_setup.os == 'ubuntu':
        str += 'sudo apt-get update -y' + '\n'
    elif nr_setup.os == 'amazonlinux':
        str += 'sudo dnf update -y' + '\n'

    return str

#################################################
##  hlpr_os_headers_update
#################################################

def hlpr_os_headers_update(nr_setup):
    str = ''
    str = '\n'
    str += '# ' + nr_setup.action + ' OS headers'
    str += '\n'
    if nr_setup.os == 'ubuntu':
        str += 'sudo apt-get install linux-headers-$(uname -r) -y' + '\n'
    elif nr_setup.os == 'amazonlinux':
        str += 'sudo dnf install -y "kernel-devel-uname-r = $(uname -r)"' + '\n'
    return str

#################################################
##  hlpr_os_export_path
#################################################

def hlpr_os_export_path(nr_setup):
    str = ''
    str += '\n'
    if nr_setup.os == 'ubuntu':
        str += 'export PATH=/opt/aws/neuron/bin:$PATH' + '\n'
    elif nr_setup.os == 'amazonlinux':
        str += 'export PATH=/opt/aws/neuron/bin:$PATH' + '\n'
    return str


#################################################
##  hlpr_os_packages_first_setup
#################################################

def hlpr_os_packages_first_setup(nr_setup):

    str = ''
    str += '\n# Configure Linux for Neuron repository updates' + '\n'
    if nr_setup.os == 'ubuntu':
        str += '. /etc/os-release' + '\n'
        str += 'sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF' + '\n'
        str += 'deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main' + '\n'
        str += 'EOF' + '\n'
        str += 'wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -' + '\n'
    elif nr_setup.os == 'amazonlinux':
        str += 'sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF' + '\n'
        str += '[neuron]' + '\n'
        str += 'name=Neuron YUM Repository' + '\n'
        str += 'baseurl=https://yum.repos.neuron.amazonaws.com' + '\n'
        str += 'enabled=1' + '\n'
        str += 'metadata_expire=0' + '\n'
        str += 'EOF' + '\n'
        str += 'sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB' + '\n'

    return str


#################################################
##  os_comp_setup
#################################################

def hlpr_os_comp_setup_cmd(nr_setup, neuron_version, comp,optional,pkg):

    os_cmd = ''

    if pkg==None:
        key=comp
        pkg_dict= nr_setup.releases_info[neuron_version].release_package_main
    else:
        key=pkg
        pkg_dict= nr_setup.releases_info[neuron_version].release_packages_all


    if (comp=='driver'):
        #os_cmd += '\n'
        #os_cmd += '###############################################################################################################\n'
        #os_cmd += '# Before installing or updating aws-neuron-dkms:'+ '\n'
        #os_cmd += '# - Stop any existing Neuron runtime 1.0 daemon (neuron-rtd) by calling: \'sudo systemctl stop neuron-rtd\'' + '\n'
        #os_cmd += '###############################################################################################################\n'
        # WARNING: Exception
        # Starting Neuron 1.16.0 , new kernel is needed to work with Runtime 2.x (library mode)


        if (parse(neuron_version)>=parse('2.99.99')):
            os_cmd += '\n'
            os_cmd += '################################################################################################################\n'
            os_cmd += '# To install or update to Neuron versions 2.99.99 and newer from previous releases:'+ '\n'
            if (nr_setup.os=='ubuntu'):
                os_cmd += '# - Unstall aws-neuron-dkms by calling \`sudo dnf remove aws-neuron-dkms -y\`  -y'+ '\n'
            elif (nr_setup.os=='amazonlinux'):
                os_cmd += '# - Unstall aws-neuron-dkms by calling \`sudo apt-get remove aws-neuron-dkms\`  -y'+ '\n'
            os_cmd += '# - DO NOT skip \'aws-neuronx-dkms\' install or upgrade step, you MUST install or upgrade to latest Neuron driver'+ '\n'
            os_cmd += '################################################################################################################\n'
        elif (parse(neuron_version)>=parse('1.19.1')):
            os_cmd += '\n'
            os_cmd += '################################################################################################################\n'
            os_cmd += '# To install or update to Neuron versions 1.19.1 and newer from previous releases:'+ '\n'
            os_cmd += '# - DO NOT skip \'aws-neuron-dkms\' install or upgrade step, you MUST install or upgrade to latest Neuron driver'+ '\n'
            os_cmd += '################################################################################################################\n'


    # Update header files if driver should be installed or updated
    if (comp=='driver'):
        os_cmd += hlpr_os_headers_update(nr_setup)


    if nr_setup.os=='ubuntu':
        os_cmd_prefix = 'sudo apt-get install '
    elif (nr_setup.action=='Install')&(nr_setup.os=='amazonlinux'):
        os_cmd_prefix = 'sudo dnf install '
    elif (nr_setup.action=='Update')&(nr_setup.os=='amazonlinux'):
        os_cmd_prefix = 'sudo dnf update '

    if comp in nr_setup.releases_info[neuron_version].release_os_install_list:
        # install only if there is a package associated with the component
        if (len(pkg_dict[key]['package_type']) != 0):
            #os_cmd = build_os_command(cmd=os_cmd_prefix,component=comp,is_latest_release=is_latest_neuron)
            os_cmd += '\n'
            if (optional==False):
                os_cmd += '# ' + nr_setup.action + ' ' + package_formal_name[comp]
            else:
                os_cmd += '# Optional: ' + nr_setup.action + ' ' + package_formal_name[comp]

            if (nr_setup.is_latest_neuron==False)&(nr_setup.os=='ubuntu'):
                os_cmd += '\n'
                os_cmd += '# If you are downgrading from newer version, please add \'--allow-downgrades\' option to \'sudo apt-get install\' '
            if (nr_setup.is_latest_neuron==False)&(nr_setup.os=='amazonlinux'):
                os_cmd += '\n'
                os_cmd += '# If you are downgrading from newer version , please remove existing package using \'sudo dnf remove\' before installing the older package'
            os_cmd += '\n'
            # Amazon Linux DLAMI will not allow updating tensorflow-model-server and aws-neuron-dkms without adding sudo dnf versionlock delete
            if ((comp=='tensorflow-model-server') | (comp=='driver'))  & (nr_setup.ami == 'dlami') & (nr_setup.os == 'amazonlinux'):
                os_cmd += 'sudo dnf versionlock delete '
                os_cmd += pkg_dict[key]['package']
                os_cmd += '\n'

            os_cmd += os_cmd_prefix + pkg_dict[key]['package']

            # Amazon Linux yum installation packaging versioning is set via hyphen not equals
            version_key = "="
            if (nr_setup.os=='amazonlinux'):
                version_key = "-"

            if (nr_setup.is_latest_neuron==False) | (nr_setup.force_versions):
                os_cmd += version_key + pkg_dict[key]['version']
            elif (pkg!=None):
                if ( nr_setup.releases_info[neuron_version].release_package_main[comp]['version']!= nr_setup.releases_info[neuron_version].release_packages_all[pkg]['version']):
                    os_cmd += version_key + pkg_dict[key]['version']

            # Ubuntu DLAMI will not allow updating tensorflow-model-server and aws-neuron-dkms without adding --allow-change-held-packages
            if ((comp=='tensorflow-model-server') | (comp=='driver'))  & (nr_setup.ami == 'dlami') & (nr_setup.os == 'ubuntu'):
                os_cmd += ' --allow-change-held-packages'

            os_cmd += ' -y'
            os_cmd += '\n'

    # Update header files if driver should be installed or updated
    if (comp=='driver'):
        os_cmd += '\n'
        os_cmd += '####################################################################################\n'
        os_cmd += '# Warning: If Linux kernel is updated as a result of OS package update'+ '\n'
        if (parse(neuron_version)>=parse('2.99.99')):
            os_cmd += '#          Neuron driver (aws-neuronx-dkms) should be re-installed after reboot'+ '\n'
        else:
            os_cmd += '#          Neuron driver (aws-neuron-dkms) should be re-installed after reboot'+ '\n'
        os_cmd += '####################################################################################\n'

    if (comp=='tools'):
        if (parse(neuron_version)>=parse('2.99.99')):
            os_cmd += '\n'
            os_cmd += '################################################################################################################\n'
            os_cmd += '# To install or update to Neuron versions 2.99.99 and newer from previous releases:'+ '\n'
            if (nr_setup.os=='ubuntu'):
                os_cmd += '# - Unstall aws-neuron-tools by calling \`sudo dnf remove aws-neuron-tools -y\`  -y'+ '\n'
            elif (nr_setup.os=='amazonlinux'):
                os_cmd += '# - Unstall aws-neuron-tools by calling \`sudo apt-get remove aws-neuron-tools\`  -y'+ '\n'
            os_cmd += '################################################################################################################\n'

    return os_cmd


########################################
##  installation / Update  instructions
########################################
def hlpr_instructions(nr_setup, neuron_version):

    cmd_string = ''

    setup_mode=nr_setup.mode


    # look for conda environment for this framework version
    for fw_env in nr_setup.dlami_conda_env:
        if fw_env != nr_setup.framework:
            continue
        fw_ver_conda_env=nr_setup.dlami_conda_env[fw_env]
        for conda_env_fw_ver in fw_ver_conda_env:
            if (conda_env_fw_ver == nr_setup.fw_package_dict[nr_setup.fw_comp]['framework_version']):
                nr_setup.conda_env=nr_setup.dlami_conda_env[fw_env][conda_env_fw_ver][0]
                nr_setup.generic_conda_env=nr_setup.dlami_conda_env[fw_env][conda_env_fw_ver][1]
                break


    # look what runtime works with this framework version
    fal_rtd=False
    fal_libnrt=False
    for fw in nr_setup.fal_supported_runtime:
        if fw != nr_setup.framework:
            continue
        if fw == nr_setup.framework:
            if (nr_setup.framework_version == None):
                fw_ver= nr_setup.releases_info[neuron_version].release_frameworks_main[nr_setup.framework]['framework_version']
                fal_version= nr_setup.releases_info[neuron_version].release_frameworks_main[nr_setup.framework]['version']
            else:
                fw_ver= nr_setup.releases_info[neuron_version].release_frameworks_all[nr_setup.framework_version]['framework_version']
                fal_version= nr_setup.releases_info[neuron_version].release_frameworks_all[nr_setup.framework_version]['version']
            fal_supported_rtd=nr_setup.fal_supported_runtime[fw][fw_ver]['neuron-rtd']
            fal_supported_libnrt=nr_setup.fal_supported_runtime[fw][fw_ver]['libnrt']
            if (parse(fal_version) >= parse(fal_supported_rtd[0])) &  \
                (parse(fal_version) <= parse(fal_supported_rtd[1])):
                fal_rtd=True
            elif (parse(fal_version) >= parse(fal_supported_libnrt[0])) &  \
                (parse(fal_version) <= parse(fal_supported_libnrt[1])):
                fal_libnrt=True

    if nr_setup.conda_env == "None":
        dlami_ev_exists=False
    else:
        dlami_ev_exists=True

    #cmd_string += hlpr_print_config(nr_setup, neuron_version)

    if (nr_setup.framework_version==None):
        fw_package_dict= nr_setup.releases_info[neuron_version].release_frameworks_main
        fw_comp=nr_setup.framework
    else:
        fw_package_dict= nr_setup.releases_info[neuron_version].release_frameworks_all
        fw_comp=nr_setup.framework_version


    if (nr_setup.framework !=None): #if install or update
        # If we are not using DLAMI
        if (nr_setup.ami=='non-dlami') | \
            ((nr_setup.ami=='dlami') & \
                (
                (nr_setup.action == 'Update') | \
                (dlami_ev_exists==False) | \
                (nr_setup.is_latest_neuron==False)) \
                ):


            if (nr_setup.ami=='dlami') & (dlami_ev_exists==False):
                cmd_string += '\n'
                cmd_string += '# Note: There is no DLAMI Conda environment for this framework version'+ '\n'
                cmd_string += '#       Framework will be installed/updated inside a Python environment'+ '\n'


            if (setup_mode == 'develop') | (setup_mode == 'deploy'):
                if (nr_setup.action =='Install')&(nr_setup.ami!='dlami'):
                    # For First install, setup Neuron OS packagaes repo (dnf or apt)
                    cmd_string += hlpr_os_packages_first_setup(nr_setup)

                # Always update to latest OS packages
                cmd_string += hlpr_os_packages_update(nr_setup)

                cmd_string += hlpr_os_comp_setup_cmd(nr_setup, neuron_version, comp='driver',optional=False,pkg=None)


                #FIXME Temporary check for MXNET 1.5 in maintenance mode
                if (neuron_version == "1.16.0") & (nr_setup.framework=="mxnet")&    \
                    (fw_package_dict[fw_comp]['framework_version']=="1.5.1"):
                    cmd_string += hlpr_os_comp_setup_cmd(nr_setup, neuron_version="1.15.2", comp='runtime-server',optional=False,pkg=None)
                elif (fal_rtd):
                    cmd_string += hlpr_os_comp_setup_cmd(nr_setup, neuron_version, comp='runtime-server',optional=False,pkg=None)

                #if mode = develop, install tools
                if (setup_mode == 'develop'):
                    cmd_string += hlpr_os_comp_setup_cmd(nr_setup, neuron_version, comp='tools',optional=False,pkg=None)
                    if (nr_setup.framework == 'tensorflow'):
                        cmd_string +=  hlpr_build_pip_command(nr_setup, neuron_version, component='tensorboard',include_compiler=False,optional=False)

                if (nr_setup.action =='Install'):
                    cmd_string += hlpr_os_export_path(nr_setup)

            if (nr_setup.ami=='non-dlami') | \
                ((nr_setup.ami=='dlami')&(nr_setup.generic_conda_env=="None")):

                if (nr_setup.action =='Install'):
                    # For first install , install python venv and activate a venv
                    cmd_string += hlpr_pip_install_create_python_venv(nr_setup, neuron_version)
                elif (nr_setup.action =='Update'):
                    # For nect times, activate the venv used for initial install
                    cmd_string += hlpr_pip_activate_python_venv(nr_setup, neuron_version)
            elif (nr_setup.ami=='dlami'):
                cmd_string += hlpr_framework_dlami_activate(nr_setup)

            # Setup Neuron pip packages
            cmd_string += hlpr_pip_repos_setup()

            # Now install framework
            if (setup_mode == 'deploy'):
                # do not install compiler when deploying
                cmd_string += hlpr_framework_compiler_setup(nr_setup, neuron_version,  include_compiler=False)
            else:
                # install compiler when mode = developer or mode = compile
                cmd_string += hlpr_framework_compiler_setup(nr_setup, neuron_version,  include_compiler=True)


            #if mode = deploy, install model server
            if (setup_mode != 'compile'):
                    if (nr_setup.framework == 'tensorflow'):
                        if (nr_setup.framework_version==None):
                            tf_package= nr_setup.releases_info[neuron_version].release_frameworks_main[nr_setup.framework]['package_name']
                        else:
                            tf_package= nr_setup.releases_info[neuron_version].release_frameworks_all[nr_setup.framework_version]['package_name']
                        cmd_string += hlpr_os_comp_setup_cmd(nr_setup, neuron_version, comp='tensorflow-model-server',optional=True,pkg= nr_setup.releases_info[neuron_version].release_tf_package_to_model_server_package[tf_package])


        # if running DLAMI
        elif (nr_setup.ami=='dlami'):
            if (nr_setup.action =='Install'):

                cmd_string += '\n'
                cmd_string += '# Neuron is pre-installed on Deep Learning AMI (DLAMI), latest DLAMI version may not include latest Neuron versions '+ '\n'
                cmd_string += '# To update to latest Neuron version, follow "Update to latest release" instruction on Neuron documentation'+ '\n'

                # WARNING: Exception
                # Starting Neuron 1.16.0 , new kernel is needed to work with Runtime 2.x (library mode)
                if (parse(neuron_version)>=parse('1.16.0')):
                    if (setup_mode == 'develop') | (setup_mode == 'deploy'):
                        cmd_string += hlpr_os_comp_setup_cmd(nr_setup, neuron_version, comp='driver',optional=False,pkg=None)

                #FIXME Temporary check for MXNET 1.5 in maintenance mode
                if (neuron_version == "1.16.0") & (nr_setup.framework=="mxnet")&    \
                    (fw_package_dict[fw_comp]['framework_version']=="1.5.1"):
                    cmd_string += hlpr_os_comp_setup_cmd(nr_setup, neuron_version="1.15.2", comp='runtime-server',optional=False,pkg=None)

                cmd_string += '\n'
                cmd_string += hlpr_framework_dlami_activate(nr_setup)


    return cmd_string


########################################
# neuron_setup_helper
########################################

class neuron_setup_helper:
    def __init__(self, manifest_file,neuron_version):

        # All Neuron releases
        self.releases_info = {}

        if (manifest_file== None) | (manifest_file== 'default')  :
            self.file = 'neuron-releases-manifest.json'
        else:
            self.file = manifest_file

        ver_tuple = enumerate_release_manifest(nr_setup=self,in_neuron_version=neuron_version)
        self.neuron_version = ver_tuple[0]
        self.latest_neuron_version = ver_tuple[1]

        self.conda_env=""
        self.python_ver=""
        self.generic_conda_env=""

        if self.neuron_version == self.latest_neuron_version:
            self.is_latest_neuron=True
        else:
            self.is_latest_neuron=False

        if (self.is_latest_neuron) & (neuron_version !=None) & (neuron_version !='latest'):
            # User explicitly specified the version, although it is the latest version
            # in this case the instructions will include the exact versions of the packages
            self.force_versions=True
        else:
            self.force_versions=False


    def instructions(self,framework,action,framework_version,os,ami,mode):

        self.framework=framework
        self.action=action
        self.mode=mode
        self.os=os
        self.ami=ami
        if (framework_version=='latest'):
            self.framework_version=None
        else:
            self.framework_version=framework_version
        setup_cmd = ""

        if (self.framework_version==None):
            self.fw_package_dict= self.releases_info[self.neuron_version].release_frameworks_main
            self.fw_comp=self.framework
        else:
            self.fw_package_dict= self.releases_info[self.neuron_version].release_frameworks_all
            self.fw_comp=self.framework_version

        setup_cmd=hlpr_instructions(self,self.neuron_version)

        return setup_cmd

if __name__ == '__main__':
    setup_cmd =''
    args = cli_parse_arguments()
    nr_setup=neuron_setup_helper(manifest_file=args.file,neuron_version=args.neuron_version)

    cli_validate(update=args.update,neuron_version=nr_setup.neuron_version,framework_version=args.framework_version,is_latest_neuron=nr_setup.is_latest_neuron,ami=args.ami)
    if (args.list):
        setup_cmd += cli_list_cmd(nr_setup=nr_setup,neuron_version=nr_setup.neuron_version, list=args.list)
    else:
        if (args.install != None)|(args.update !=None):
            if args.install:
                framework=args.install
                action = 'Install'
            elif args.update:
                framework=args.update
                action = 'Update'
        else:
            action = None
            framework=None

        setup_cmd += nr_setup.instructions(framework=framework,action=action,framework_version=args.framework_version,os=args.os,ami=args.ami,mode=args.mode)
    print (setup_cmd)


================================================
FILE: src/helperscripts/release-manifest-def.py
================================================

neuron_releases={
    "repos":{
        "whl"="_url",           # url of the wheel repo
        "rpm"="_url",           # url of the rpm repo (yum)
        "deb"="_url",           # url of the debian repo (apt)
    }
    "manifest_date": "_date",
    "manifest_version":"_ver"   # Will increment when format change
    "latest_release":{
        "_instance":{           # can be "inf1", "trn1", etc.. 
            "version":"_ver"    # latest neuron release that support the _instance 
        }
    }
    "neuron_versions"={         # all neuron release versions supported by this manifest
        "_neuron_version":{     # Neuron release version entry e.g. "1.14.0"
            "python_ver": ["_ver"]              # list of python versions supported by this neuron release, e.g. "3.6"
            "instance_support": ["_instance"]   # list of instances supported by this neuron release
            "arch":["_arch"]                    # list of architectures supported by this neuron release (e.g. x86)
            "components":{                      # all components included in this neuron release 
                                                # (e.g. compiler, driver , pytorch ...)
                "_component_name":{             # component entry (e.g. driver, compiler)
                    "framework":_boolean        # is this component a framework ? 
                                                # needed since there is a differces in versioning and content etc .. 
                    "packages":{                # all packages of this component that included in this release 
                                                # e.g. mxnet support mx_neuron and mxnet-neuron
                        "_package_name":{       # package entry (e.g. mx_neuron)
                            "install_on_compute_instance":_booolean     # can this package installed on compute instance?
                            "versions":{                            # all versions of the specific package
                                                                    # e.g. torch-neuron may include multiple versions
                                "_ver":{                            # package version entry (e.g. 1.4.1.0)
                                    "pre_install_cmds":["_cmd"]     # a list of commands to call before installing
                                                                    # the package, e.g. when a plugin need to install the
                                                                    # framework first , as in mx_neuron
                                    "post_install_cmds":["_cmd"]    # a list of commands to call after installaing the package
                                    "format":["_format"]            # package format (e.g. bin or src)
                                    "content":["_content"]          # package content 
                                                                    # (e.g. tools include neuron-top, neuron monitor etc .. )
                                    "package_type":["_type"]        # list of package type supported ( e.g. whl, rpm, deb)
                                }
                            }                     
                        }
                    }
                }
            }
        }
    },
    "softwarelifecycle":{           # Status of neuron software releases (supported, maintained, deprecated)
                                    # Releases that are not under "supported" or "maintained" should be "supported"
        "maintained":{              # Releases that are being maintained, no active development, bug fixes can be provided
                                    # releases can be Neuron release, component (e.g. runtime), or a framework (e.g. pytorch-1.5.x)
            "neuron_versions":{     # Neuron versions that are under maintanance status
                "from":"_ver"       # from neuron release version
                "to":"_ver"         # to neuron release version
            },
            "components":{              # Components that are under maintanance status
                "_component_name":{     # packages in that component
                    "_package_name":{   # package entry
                        "from":"_ver"   # from version
                        "to":"_ver"     # to version
                    }
                }

            },
            "frameworks":{              # Frameworks that are under maintanance status
                "pytorch":{             # Pytorch versions that are under maintanance status
                    "from":"_ver"       # from version
                    "to":"_ver"         # to version
                },
                "tensorflow":{          # Pytorch versions that are under maintanance status
                    "from":"_ver"       # from version
                    "to":"_ver"         # to version
                },
                "mxnewt":{              # MXNet versions that are under maintanance status
                    "from":"_ver"       # from version
                    "to":"_ver"         # to version
                }
            }
        },
        "deprecated":{                  # Releases that are deprecated, no bug fixes
                                        # format similar to "maintained" section
        },
    },
    "compatability": {                  # compatability section
        "_component_name": {            # component entry
            "_package_name": {          # package entry
                "_ver_to__ver": {       # compatability entry
                    "from": "_ver",     # from version
                    "to": "_ver",       # to version
                    "instance_support": [   # instance compatability
                        "_instance"
                    ],
                    "arch": [               # arch compatability
                        "_arch"
                    ],
                    "components": {                 # components compatability section 
                        "_component_name": {        # component entry
                            "_package_name": {      # package entry
                                "from": "_ver",     # from version
                                "to": "_ver"        # to version
                            }
                        }
                    }
                }
            }
        }
    }
}


================================================
FILE: src/k8/bert_service.yml
================================================
---
kind: Service
apiVersion: v1
metadata:
  name: inf-k8s-test
  labels:
    app: inf-k8s-test
spec:
  ports:
    - name: http-tf-serving
      port: 8500
      targetPort: 8500
    - name: grpc-tf-serving
      port: 9000
      targetPort: 9000
  selector:
    app: inf-k8s-test
    role: master
  type: ClusterIP
---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: inf-k8s-test
  labels:
    app: inf-k8s-test
    role: master
spec:
  replicas: 1 # Number of desired replicas. Increase to desired number.
  selector:
    matchLabels:
      app: inf-k8s-test
      role: master
  template:
    metadata:
      labels:
        app: inf-k8s-test
        role: master
    spec:
      volumes:
        - name: sock
          emptyDir: {}
      containers:
        - name: inf-k8s-test
          image: tf-serving-ctr
          imagePullPolicy: IfNotPresent
          command: ["/bin/sh","-c"]

          # Pull model from s3, then start tensorflow_model_server_neuron with the model.
          args:
            - "aws s3 sync s3://<your-bert-bucket>/bert /tmp/bert && \
           tensorflow_model_server_neuron --port=9000 --rest_api_port=8500 --model_name=bert_mrpc_hc_gelus_b4_l24_0926_02 --model_base_path=/tmp//bert/"

          # Open grpc and rest API ports
          ports:
            - containerPort: 8500
            - containerPort: 9000

          # Arbitrary resource requirements
          resources:
            limits:
              cpu: 4
              memory: 4Gi
              aws.amazon.com/neuron: 1  # desired number of Inferentia devices.
            requests:
              cpu: "1"
              memory: 1Gi
              aws.amazon.com/neuron: 1  # desired number of Inferentia devices.


================================================
FILE: src/k8/k8s-neuron-device-plugin-rbac.yml
================================================
# rbac.yaml
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: neuron-device-plugin
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - update
  - patch
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
  - update
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: neuron-device-plugin
  namespace: kube-system
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: neuron-device-plugin
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: neuron-device-plugin
subjects:
- kind: ServiceAccount
  name: neuron-device-plugin
  namespace: kube-system


================================================
FILE: src/k8/k8s-neuron-device-plugin.yml
================================================
# https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: neuron-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name:  neuron-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      # Uncomment the annotation below if k8s version is 1.13 or lower
      # annotations:
      #  scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: neuron-device-plugin-ds
    spec:
      serviceAccount: neuron-device-plugin
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - key: aws.amazon.com/neuron
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              # Uncomment following matchExpressions if using k8s 1.16 or lower
              #- matchExpressions:
              #    - key: "beta.kubernetes.io/instance-type"
              #      operator: In
              #      values:
              #        - inf1.xlarge
              #        - inf1.2xlarge
              #        - inf1.6xlarge
              #        - inf1.24xlarge
              #        - inf2.xlarge
              #        - inf2.8xlarge
              #        - inf2.24xlarge
              #        - inf2.48xlarge
              #        - trn1.2xlarge
              #        - trn1.32xlarge
              #        - trn1n.32xlarge
              - matchExpressions:
                  - key: "node.kubernetes.io/instance-type"
                    operator: In
                    values:
                      - inf1.xlarge
                      - inf1.2xlarge
                      - inf1.6xlarge
                      - inf1.24xlarge
                      - inf2.xlarge
                      - inf2.8xlarge
                      - inf2.24xlarge
                      - inf2.48xlarge
                      - trn1.2xlarge
                      - trn1.32xlarge
                      - trn1n.32xlarge
      containers:
        # Find all neuron-device-plugin images at https://gallery.ecr.aws/neuron/neuron-device-plugin
      - image: public.ecr.aws/neuron/neuron-device-plugin:2.22.4.0
        imagePullPolicy: Always
        name: neuron-device-plugin
        env:
        - name: KUBECONFIG
          value: /etc/kubernetes/kubelet.conf
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
          - name: infa-map
            mountPath: /run
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: infa-map
          hostPath:
            path: /run


================================================
FILE: src/k8/k8s-neuron-monitor-daemonset.yml
================================================
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: neuron-monitor
  namespace: neuron-monitor
  labels:
    app: neuron-monitor
    version: v1
spec:
  selector:
    matchLabels:
      app: neuron-monitor
  template:
    metadata:
      labels:
        app: neuron-monitor
        version: v1
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/os
                    operator: In
                    values:
                      - linux
                  - key: node.kubernetes.io/instance-type
                    operator: In
                    values:
                      - trn1.2xlarge
                      - trn1.32xlarge
                      - trn1n.32xlarge
                      - inf1.xlarge
                      - inf1.2xlarge
                      - inf1.6xlarge
                      - inf2.xlarge
                      - inf2.8xlarge
                      - inf2.24xlarge
                      - inf2.48xlarge
      containers:
        - name: neuron-monitor
          image: public.ecr.aws/neuron/neuron-monitor:1.3.0
          ports:
            - containerPort: 8000
          command:
             - "/opt/bin/entrypoint.sh"
          args: 
            - "--port"
            - "8000"
            - "--neuron-monitor-config"
            - "/opt/aws/neuron/bin/neuron-monitor.conf"
          resources:
            limits:
              cpu: 500m
              memory: 256Mi
            requests:
              cpu: 256m
              memory: 128Mi
          env:
          - name: GOMEMLIMIT
            value: 160MiB
          securityContext:
            privileged: true


================================================
FILE: src/k8/k8s-neuron-scheduler-configmap.yml
================================================
apiVersion: v1
data:
  policy.cfg: |
    {
      "kind": "Policy",
      "apiVersion": "v1",
      "extenders": [
        {
          "urlPrefix": "http://127.0.0.1:32700",
          "filterVerb": "filter",
          "bindVerb":   "bind",
          "enableHttps": false,
          "nodeCacheCapable": true,
          "managedResources": [
            {
              "name": "aws.amazon.com/neuron",
              "ignoredByScheduler": false
            },
            {
              "name": "aws.amazon.com/neurondevice",
              "ignoredByScheduler": false
            },
            {
              "name": "aws.amazon.com/neuroncore",
              "ignoredByScheduler": false
            }
          ],
          "ignorable": false
        }
      ]
    }
kind: ConfigMap
metadata:
  name: scheduler-policy
  namespace: kube-system


================================================
FILE: src/k8/k8s-neuron-scheduler-eks.yml
================================================
# rbac.yaml
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: k8s-neuron-scheduler
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - update
  - patch
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - update
  - patch
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - bindings
  - pods/binding
  verbs:
  - create
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: k8s-neuron-scheduler
  namespace: kube-system
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: k8s-neuron-scheduler
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: k8s-neuron-scheduler
subjects:
- kind: ServiceAccount
  name: k8s-neuron-scheduler
  namespace: kube-system

# deployment yaml
---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: k8s-neuron-scheduler
  namespace: kube-system
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
        app: neuron-scheduler
        component: k8s-neuron-scheduler
  template:
    metadata:
      labels:
        app: neuron-scheduler
        component: k8s-neuron-scheduler
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      serviceAccount: k8s-neuron-scheduler
      schedulerName: my-scheduler
      containers:
        - name: neuron-scheduler-exp
          # Find all neuron-scheduler images at https://gallery.ecr.aws/neuron/neuron-scheduler
          image: public.ecr.aws/neuron/neuron-scheduler:2.22.4.0
          env:
          - name: PORT
            value: "12345"

# service.yaml            
---
apiVersion: v1
kind: Service
metadata:
  name: k8s-neuron-scheduler
  namespace: kube-system
  labels:
    app: neuron-scheduler
    component: k8s-neuron-scheduler
spec:
  ports:
  - port: 12345
    name: http
    targetPort: 12345
  selector:
    # select app=ingress-nginx pods
    app: neuron-scheduler
    component: k8s-neuron-scheduler   


================================================
FILE: src/k8/k8s-neuron-scheduler.yml
================================================
# rbac.yaml
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: k8s-neuron-scheduler
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - update
  - patch
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - bindings
  - pods/binding
  verbs:
  - create
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: k8s-neuron-scheduler
  namespace: kube-system
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: k8s-neuron-scheduler
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: k8s-neuron-scheduler
subjects:
- kind: ServiceAccount
  name: k8s-neuron-scheduler
  namespace: kube-system

# deployment yaml
---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: k8s-neuron-scheduler
  namespace: kube-system
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
        app: neuron-scheduler
        component: k8s-neuron-scheduler
  template:
    metadata:
      labels:
        app: neuron-scheduler
        component: k8s-neuron-scheduler
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      hostNetwork: true
      tolerations:
      - effect: NoSchedule
        operator: Exists
        key: node-role.kubernetes.io/master
      - effect: NoSchedule
        operator: Exists
        key: node.cloudprovider.kubernetes.io/uninitialized
      nodeSelector:
         node-role.kubernetes.io/master: ""
      serviceAccount: k8s-neuron-scheduler
      containers:
        - name: neuron-scheduler
          # Find all neuron-scheduler images at https://gallery.ecr.aws/neuron/neuron-scheduler
          image: public.ecr.aws/neuron/neuron-scheduler:2.22.4.0
          env:
          - name: PORT
            value: "12345"

# service.yaml            
---
apiVersion: v1
kind: Service
metadata:
  name: k8s-neuron-scheduler
  namespace: kube-system
  labels:
    app: neuron-scheduler
    component: k8s-neuron-scheduler
spec:
  type: NodePort
  ports:
  - port: 12345
    name: http
    targetPort: 12345
    nodePort: 32700
  selector:
    # select app=ingress-nginx pods
    app: neuron-scheduler
    component: k8s-neuron-scheduler   


================================================
FILE: src/k8/k8s-ultraserver-init-script.sh
================================================
#!/bin/bash

MPI_HOST_FILE=/etc/mpi/hostfile

NEURON_ULTRASERVER_MODE_UNSET=0
NEURON_ULTRASERVER_MODE_X4=1
NEURON_ULTRASERVER_MODE_X2H=2
NEURON_ULTRASERVER_MODE_X2V=3
NEURON_ULTRASERVER_MODE_X1=4

ULTRASERVER_INIT_DIR=/root/ultraserver_init
SORTED_NODES_FILE=$ULTRASERVER_INIT_DIR/sorted_nodes.txt
FQDN_MODE_FILE=$ULTRASERVER_INIT_DIR/fqdn_mode.txt
ENV_VARS_FILE=$ULTRASERVER_INIT_DIR/us_env_vars.txt
NEW_HOST_FILE=$ULTRASERVER_INIT_DIR/new_hostfile

export NEURON_ULTRASERVER_SERVER_ID_DEFAULT_VALUE="0000000000000000"
export NEURON_ULTRASERVER_NODE_ID_DEFAULT_VALUE=-1

export NEURON_GLOBAL_TOPOID0_HOST=""

export NUM_WORKERS=0

cat /dev/null > $SORTED_NODES_FILE
cat /dev/null > $FQDN_MODE_FILE
cat /dev/null > $ENV_VARS_FILE
cat /dev/null > $NEW_HOST_FILE

save_sorted_node_list() {
    # Gather ultraserver information from each worker node
    mpirun --allow-run-as-root \
        --mca orte_keep_fqdn_hostnames 1 \
        -host $ip_list \
        -x NEURON_ULTRASERVER_SERVER_ID_DEFAULT_VALUE \
        -x NEURON_ULTRASERVER_NODE_ID_DEFAULT_VALUE \
        -x NEURON_ULTRASERVER_NODE_CONFIG \
        sh -c '
            if [ -f "/sys/class/neuron_device/server_id_${NEURON_ULTRASERVER_NODE_CONFIG}" ]; then
                NEURON_ULTRASERVER_SERVER_ID=$(cat /sys/class/neuron_device/server_id_${NEURON_ULTRASERVER_NODE_CONFIG})
            else
                NEURON_ULTRASERVER_SERVER_ID=$NEURON_ULTRASERVER_SERVER_ID_DEFAULT_VALUE
            fi

            if [ -f "/sys/class/neuron_device/node_id_${NEURON_ULTRASERVER_NODE_CONFIG}" ]; then
                NEURON_ULTRASERVER_NODE_ID=$(cat /sys/class/neuron_device/node_id_${NEURON_ULTRASERVER_NODE_CONFIG})
            else
                NEURON_ULTRASERVER_NODE_ID=$NEURON_ULTRASERVER_NODE_ID_DEFAULT_VALUE
            fi

            FQDN=$(hostname --fqdn)
            echo $NEURON_ULTRASERVER_SERVER_ID:$NEURON_ULTRASERVER_NODE_ID:$FQDN
        ' | sort -t':' -k1,1 -k2,2 -k3,3 > $SORTED_NODES_FILE

    # Set the topology ids for each worker node
    local i=0
    while IFS= read -r line; do
        echo "${i}:${line}"
        ((i++))
    done < $SORTED_NODES_FILE > temp && mv temp $SORTED_NODES_FILE
    NEURON_GLOBAL_TOPOID0_HOST=$(head -n1 $SORTED_NODES_FILE | cut -d: -f4)
}

validate_node_config() {
    while read -r server_id; do
        # Server id and node id are only valid for node configs > 1
        if [ $NEURON_ULTRASERVER_NODE_CONFIG -ne 1 ]; then
            # Validate server id exists
            if [ "$server_id" = "$NEURON_ULTRASERVER_SERVER_ID_DEFAULT_VALUE" ]; then
                echo "$NEURON_ULTRASERVER_NODE_CONFIG-node config is not supported"
                exit 1
            fi

            # Validate there is the correct amount of nodes that share the same server id
            count=$(grep "$server_id" "$SORTED_NODES_FILE" | wc -l)
            if [ $count -ne $NEURON_ULTRASERVER_NODE_CONFIG ]; then
                echo "Error: Incorrect number of nodes with server id $server_id, need $NEURON_ULTRASERVER_NODE_CONFIG nodes but saw $count"
                exit 1
            fi

            # Validate all the node ids are unique
            node_ids_count=$(grep "$server_id" "$SORTED_NODES_FILE" | cut -d':' -f3 | sort | uniq | wc -l)
            if [ $node_ids_count -ne $NEURON_ULTRASERVER_NODE_CONFIG ]; then
                echo "Error: Found $node_ids_count unique node IDs, expected $NEURON_ULTRASERVER_NODE_CONFIG"
                exit 1
            fi
        fi

        while IFS=':' read -r tid sid nid fqdn; do
            # Validate mode is valid for each node
            modes="${fqdn_modes_map[$fqdn]}"
            if [ $NEURON_ULTRASERVER_NODE_CONFIG -eq 4 ]; then
                if echo "$modes" | grep -q "\b$NEURON_ULTRASERVER_MODE_X4\b"; then
                    mode=$NEURON_ULTRASERVER_MODE_X4
                else
                    echo "Error: Node $fqdn does not support 4-node config"
                    exit 1
                fi
            elif [ $NEURON_ULTRASERVER_NODE_CONFIG -eq 2 ]; then
                if echo "$modes" | grep -q "\b$NEURON_ULTRASERVER_MODE_X2V\b"; then
                    mode=$NEURON_ULTRASERVER_MODE_X2V
                elif echo "$modes" | grep -q "\b$NEURON_ULTRASERVER_MODE_X2H\b"; then
                    mode=$NEURON_ULTRASERVER_MODE_X2H
                else
                    echo "Error: Node $fqdn does not support 2-node config"
                    exit 1
                fi
            else
                mode=$NEURON_ULTRASERVER_MODE_X1
            fi

            # Save each worker node's environments variables to a file
            echo "${tid}:${mode}:${sid}:${nid}:${fqdn}" >> "$ENV_VARS_FILE"
        done < <(grep "$server_id" "$SORTED_NODES_FILE")
    done < <(cut -d':' -f2 "$SORTED_NODES_FILE" | sort | uniq)
}

reorder_hostfile() {
    # Check if files exist
    if [ ! -f "$MPI_HOST_FILE" ] || [ ! -f "$SORTED_NODES_FILE" ]; then
        echo "Error: One or both input files do not exist"
        exit 1
    fi

    # Extract FQDNs from SORTED_NODES_FILE and reorder entries
    while IFS=: read -r _ _ _ fqdn; do
        # Remove .cluster.local suffix
        clean_fqdn=${fqdn%.cluster.local}

        # Find the matching line in original file
        while read -r line; do
            if [[ "$line" == "$clean_fqdn"* ]]; then
                echo "$line" >> "$NEW_HOST_FILE"
                break
            fi
        done < "$MPI_HOST_FILE"
    done < "$SORTED_NODES_FILE"
}

# Validate node config
if [ -z "${NEURON_ULTRASERVER_NODE_CONFIG}" ]; then
    NEURON_ULTRASERVER_NODE_CONFIG=4
fi
if [ $NEURON_ULTRASERVER_NODE_CONFIG -ne 1 ] && [ $NEURON_ULTRASERVER_NODE_CONFIG -ne 2 ] && [ $NEURON_ULTRASERVER_NODE_CONFIG -ne 4 ]; then
    echo "Error: Invalid ultraserver node config: $NEURON_ULTRASERVER_NODE_CONFIG. Must be 1, 2, or 4."
    exit 1
fi
echo "Using $NEURON_ULTRASERVER_NODE_CONFIG-node config"

echo -e "\nCurrent hostfile:"
cat $MPI_HOST_FILE

# Read the file, extract the first column, resolve IPs, and build the comma-separated string
ip_list=""
while read line; do
    ip=$(getent hosts "$line" | awk '{print $1}')
    if [ -z "$ip" ]; then
        echo "error: Unable to resolve IP address for host: $line"
        exit 1
    fi
    if [ -z "$ip_list" ]; then
        ip_list="$ip"
    else
        ip_list="${ip_list},${ip}"
    fi
done < <(cut -d' ' -f1 $MPI_HOST_FILE)
echo "Worker pod IPs:" "$ip_list"

# Count unique IPs from ip_list and store in NUM_WORKERS
NUM_WORKERS=$(echo "$ip_list" | tr -cd ',' | wc -c)
NUM_WORKERS=$((NUM_WORKERS + 1))
echo "Number of worker nodes: $NUM_WORKERS"

# Validate that the number of workers is a multiple of the node config
if [ $((NUM_WORKERS % NEURON_ULTRASERVER_NODE_CONFIG)) -ne 0 ]; then
    echo "Error: Invalid number of worker nodes for $NEURON_ULTRASERVER_NODE_CONFIG-node config: $NUM_WORKERS."
    exit 1
fi

# Create a map of workers to their possible ultraserver modes
mpirun --allow-run-as-root \
    --mca orte_keep_fqdn_hostnames 1 \
    -host $ip_list \
    sh -c '
        FQDN=$(hostname --fqdn)
        NEURON_ULTRASERVER_MODE=$(cat /sys/class/neuron_device/ultraserver_mode)
        echo $FQDN:$NEURON_ULTRASERVER_MODE
    ' | sort -t':' -k1 > $FQDN_MODE_FILE
declare -A fqdn_modes_map
while IFS=':' read -r fqdn mode; do
    fqdn_modes_map["$fqdn"]="$mode"
done < $FQDN_MODE_FILE
(echo "FQDN:Modes" && cat $FQDN_MODE_FILE) | tr ':' '    '

# Validate worker nodes
echo -e "\nSorted nodes:"
save_sorted_node_list
(echo "TOPO_ID:SERVER_ID:NODE_ID:FQDN" && cat $SORTED_NODES_FILE) |  tr ':' '    '
echo -e "\nNEURON_GLOBAL_TOPOID0 node will be: $NEURON_GLOBAL_TOPOID0_HOST"
validate_node_config

# Update hostlist
echo -e "\nUpdated hostfile:"
reorder_hostfile
cat $NEW_HOST_FILE

# Write environment variables to each worker node
for line in `cat $ENV_VARS_FILE`; do
    IFS=':' read -r topo_id mode server_id node_id fqdn <<< "$line"
    export mode server_id node_id fqdn topo_id
    mpirun --allow-run-as-root \
        --mca orte_keep_fqdn_hostnames 1 \
        -host $fqdn \
        -x topo_id \
        -x NEURON_GLOBAL_TOPOID0_HOST \
        -x mode \
        -x server_id \
        -x node_id \
        sh -c '
            sed -i "/^NEURON_GLOBAL_TOPOID=/d" /etc/environment
            sed -i "/^NEURON_GLOBAL_TOPOID0_HOST=/d" /etc/environment
            sed -i "/^NEURON_RT_ULTRASERVER_MODE=/d" /etc/environment
            sed -i "/^NEURON_RT_ULTRASERVER_SERVER_ID=/d" /etc/environment
            sed -i "/^NEURON_RT_ULTRASERVER_NODE_ID=/d" /etc/environment

            echo "NEURON_GLOBAL_TOPOID=$topo_id" >> /etc/environment
            echo "NEURON_GLOBAL_TOPOID0_HOST=$NEURON_GLOBAL_TOPOID0_HOST" >> /etc/environment
            echo "NEURON_RT_ULTRASERVER_MODE=$mode" >> /etc/environment
            echo "NEURON_RT_ULTRASERVER_SERVER_ID=$server_id" >> /etc/environment
            echo "NEURON_RT_ULTRASERVER_NODE_ID=$node_id" >> /etc/environment

            echo "Node $(hostname --fqdn): Variables set and persisted"
            echo "NEURON_GLOBAL_TOPOID=$topo_id"
            echo "NEURON_GLOBAL_TOPOID0_HOST=$NEURON_GLOBAL_TOPOID0_HOST"
            echo "NEURON_RT_ULTRASERVER_MODE=$mode"
            echo "NEURON_RT_ULTRASERVER_SERVER_ID=$server_id"
            echo "NEURON_RT_ULTRASERVER_NODE_ID=$node_id"
        '
done


================================================
FILE: src/k8/my-scheduler.yml
================================================
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-scheduler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: my-scheduler-as-kube-scheduler
subjects:
- kind: ServiceAccount
  name: my-scheduler
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:kube-scheduler
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: my-scheduler-as-volume-scheduler
subjects:
- kind: ServiceAccount
  name: my-scheduler
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:volume-scheduler
  apiGroup: rbac.authorization.k8s.io
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: my-scheduler
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - coordination.k8s.io
  resources:
  - leases
  verbs:
  - create
  - get
  - list
  - update
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: my-scheduler
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: my-scheduler
subjects:
- kind: ServiceAccount
  name: my-scheduler
  namespace: kube-system
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: my-scheduler-config
  namespace: kube-system
data:
  my-scheduler-config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    profiles:
      - schedulerName: my-scheduler
    extenders:
      - urlPrefix: 'http://k8s-neuron-scheduler.kube-system.svc.cluster.local:12345'
        filterVerb: filter
        bindVerb: bind
        enableHTTPS: false
        nodeCacheCapable: true
        managedResources:
          - name: 'aws.amazon.com/neuron'
            ignoredByScheduler: false
          - name: 'aws.amazon.com/neuroncore'
            ignoredByScheduler: false
          - name: 'aws.amazon.com/neurondevice'
            ignoredByScheduler: false
        ignorable: false
    leaderElection:
      leaderElect: true
      resourceNamespace: kube-system    
      resourceName: my-scheduler
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: scheduler
    tier: control-plane
  name: my-scheduler
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: scheduler
      tier: control-plane
  replicas: 1
  template:
    metadata:
      labels:
        component: scheduler
        tier: control-plane
        version: second
    spec:
      serviceAccountName: my-scheduler
      containers:
      - args:
        - --config=/etc/kubernetes/my-scheduler/my-scheduler-config.yaml
        - --leader-elect=true
        - --v=2
        command:
        - /usr/local/bin/kube-scheduler
        image: public.ecr.aws/eks-distro/kubernetes/kube-scheduler:v1.28.5-eks-1-28-latest
        # or use below for your version of k8s
        # image: registry.k8s.io/kube-scheduler:<version>
        livenessProbe:
          httpGet:
            path: /healthz
            port: 10259
            scheme: HTTPS
          initialDelaySeconds: 15
        name: kube-second-scheduler
        readinessProbe:
          httpGet:
            path: /healthz
            port: 10259
            scheme: HTTPS
        resources:
          requests:
            cpu: '0.1'
        securityContext:
          privileged: false
        volumeMounts:
          - name: config-volume
            mountPath: /etc/kubernetes/my-scheduler
      hostNetwork: false
      hostPID: false
      volumes:
        - name: config-volume
          configMap:
            name: my-scheduler-config


================================================
FILE: src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-config.yml
================================================
apiVersion: v1
data:
  kernel-monitor.json: |
    {
        "plugin": "kmsg",
        "logPath": "/dev/kmsg",
        "lookback": "5m",
        "bufferSize": 10,
        "source": "kernel-monitor",
        "conditions": [
            {
                "type": "NeuronHealth",
                "reason": "NeuronHasNoError",
                "message": "Neuron has no error"
            }
        ],
        "rules": [
            {
                "type": "permanent",
                "condition": "NeuronHealth",
                "reason": "NeuronHasError_SRAM_UNCORRECTABLE_ERROR",
                "pattern": ".* NEURON_HW_ERR=SRAM_UNCORRECTABLE_ERROR .*"
            },
            {
                "type": "permanent",
                "condition": "NeuronHealth",
                "reason": "NeuronHasError_NC_UNCORRECTABLE_ERROR",
                "pattern": ".* NEURON_HW_ERR=NC_UNCORRECTABLE_ERROR .*"
            },
            {
                "type": "permanent",
                "condition": "NeuronHealth",
                "reason": "NeuronHasError_HBM_UNCORRECTABLE_ERROR",
                "pattern": ".* NEURON_HW_ERR=HBM_UNCORRECTABLE_ERROR .*"
            },
            {
                "type": "permanent",
                "condition": "NeuronHealth",
                "reason": "NeuronHasError_DMA_ERROR",
                "pattern": ".* NEURON_HW_ERR=DMA_ERROR .*"
            }
        ]
    }
kind: ConfigMap
metadata:
  name: node-problem-detector-config
  namespace: neuron-healthcheck-system


================================================
FILE: src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-rbac.yml
================================================
apiVersion: v1
kind: ServiceAccount
metadata:
  name: node-problem-detector
  namespace: neuron-healthcheck-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: npd-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:node-problem-detector
subjects:
  - kind: ServiceAccount
    name: node-problem-detector
    namespace: neuron-healthcheck-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: system:node-problem-detector
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
- apiGroups:
  - ""
  - events.k8s.io
  resources:
  - events
  verbs:
  - create
  - patch
  - update


================================================
FILE: src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery.yml
================================================
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector
  namespace: neuron-healthcheck-system
  labels:
    app: node-problem-detector
spec:
  selector:
    matchLabels:
      app: node-problem-detector
  template:
    metadata:
      labels:
        app: node-problem-detector
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: "node.kubernetes.io/instance-type"
                    operator: In
                    values:
                      - inf1.xlarge
                      - inf1.2xlarge
                      - inf1.6xlarge
                      - inf1.24xlarge
                      - inf2.xlarge
                      - inf2.8xlarge
                      - inf2.24xlarge
                      - inf2.48xlarge
                      - trn1.2xlarge
                      - trn1.32xlarge
                      - trn1n.32xlarge

      containers:
      - name: node-problem-detector
        command:
        - /node-problem-detector
        - --logtostderr
        - --config.system-log-monitor=/config/kernel-monitor.json
        image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19
        ports:
        - containerPort: 20257
        resources:
          limits:
            cpu: 10m
            memory: 80Mi
          requests:
            cpu: 10m
            memory: 80Mi
        imagePullPolicy: Always
        securityContext:
          privileged: true
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        volumeMounts:
        - name: log
          mountPath: /var/log
          readOnly: true
        - name: kmsg
          mountPath: /dev/kmsg
          readOnly: true
        # Make sure node problem detector is in the same timezone
        # with the host.
        - name: localtime
          mountPath: /etc/localtime
          readOnly: true
        - name: config
          mountPath: /config
          readOnly: true
      - name: node-recovery
        command:
        - /bin/sh
        - -c
        - "sleep 60 && /scripts/check-health.py"
        image: public.ecr.aws/neuron/neuron-node-recovery:1.3.0
        resources:
          limits:
            cpu: 10m
            memory: 150Mi
          requests:
            cpu: 10m
            memory: 150Mi
        imagePullPolicy: Always
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: ENABLE_RECOVERY
          value: "false"
      serviceAccountName: node-problem-detector
      volumes:
      - name: log
        # Config `log` to your system log directory
        hostPath:
          path: /var/log/
      - name: kmsg
        hostPath:
          path: /dev/kmsg
      - name: localtime
        hostPath:
          path: /etc/localtime
      - name: config
        configMap:
          name: node-problem-detector-config
          defaultMode: 0555
          items:
          - key: kernel-monitor.json
            path: kernel-monitor.json
      tolerations:
        - effect: NoSchedule
          operator: Exists
        - effect: NoExecute
          operator: Exists


================================================
FILE: src/libnrt/README.md
================================================
# NeuronX Runtime Header Files

## Overview

The NeuronX Runtime Library provides C APIs for initializing the Neuron hardware,
loading models and input data, executing iterations on loaded models, and
retrieving output data.

This library is provided to customers via a shared object (libnrt.so) that is installed
through the `aws-neuronx-runtime-lib` package. This directory exposes the header files
that customers can use to write custom applications utilizing the NeuronX Runtime Library.

## File Location

These header files will be installed to the user's system under `/opt/aws/neuron/include`
when installing the `aws-neuronx-runtime-lib` package and the `libnrt.so` library is 
installed under the `/opt/aws/neuron/lib` directory.

## Experimental Headers

The following files contain experimental function declarations and are subject to change in 
future releases.

- nrt_async.h
- nrt_async_sendrecv.h
- nrt_experimental.h

## Documentation

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-api-guide.html


================================================
FILE: src/libnrt/include/ndl/ndl.h
================================================
/*
 * Copyright 2020-2021, Amazon.com, Inc. or its affiliates. All Rights Reserved
 */

#pragma once

#include <stdint.h>
#include <stdbool.h>
#include <sys/types.h>
#include <pthread.h>

#include "neuron_driver_shared.h"

#ifdef __cplusplus
extern "C" {
#endif

typedef enum NQ_DEV_TYPE {
    NQ_DEV_TYPE_NEURON_CORE = 0,
    NQ_DEV_TYPE_TOPSP,
    NQ_DEV_TYPE_MAX,
} ndl_nq_dev_t;

#define NEURON_MAX_DEVICES MAX_NEURON_DEVICE_COUNT
#define NEURON_DEVICE_PREFIX "/dev/neuron"
#define NEURON_DRIVER_LIBRARY_MAJOR 1
#define NEURON_DRIVER_LIB_MINOR 0
#define MAX_HBM_PER_DEVICE 4

#define DRIVER_VERSION_MAX_SIZE 32
typedef struct ndl_version_info {
    uint16_t driver_major_version;      // Major version of the driver
    uint16_t driver_minor_version;      // Minor version of the driver
    char driver_full_version[DRIVER_VERSION_MAX_SIZE];
    uint16_t library_major_version;     // Major version of the library
    uint16_t library_minor_version;     // Minor version of the library
} ndl_version_info_t;

/** Get version info.
 *
 * @param[out] version       - Buffer to store the version information.
 *
 * @return 0 on success.
 *         -1 on failed to read driver version.
 */
int ndl_get_version(ndl_version_info_t *version);

/** Gets the range of compatible version
 *
 * @param min_compatible_version_min [out]  - Lowest supported version
 * @param max_compatible_version_max [out]  - Highest supported version
 *
 * @return 0 on success.
 *
 */
int ndl_get_compatible_version(uint32_t *min_compatible_version, uint32_t *max_compatible_version);

typedef struct ndl_device_init_param {
    bool initialize_device; // if set to true, device is initialized as part of open()
    int num_dram_regions; // splits device DRAMs into given number of regions.
    bool map_hbm; // if set to true, HBM will be mapped during device open
} ndl_device_init_param_t;


#define NDL_COPY_BUF_SIZE (2ull * 1024 * 1024)
typedef struct ndl_copy_buf {
    uint64_t mem_handle;
    void *mmap_va;
    pthread_mutex_t lock;
} ndl_copy_buf_t;

// Maximum neuron devices supported on a system.
#define MAX_NEURON_DEVICE_COUNT 64

// Maximum neuron cores per device
#define MAX_NC_PER_DEVICE 8

typedef struct ndl_device {
    uint8_t device_index;                               // Device Index
    uint8_t device_type;                                // Device Type (V1, V2..)
    uint16_t device_revision;                           // Revision id of board
    uint8_t connected_device_count;                     // Number of devices connected to this device
    uint8_t connected_devices[MAX_NEURON_DEVICE_COUNT]; // Array of devices(IDs) connected to this device
    uint64_t csr_base[2];                               // BAR0/BAR2 base
    uint64_t csr_size[2];                               // BAR0/BAR2 size
    ndl_copy_buf_t cpy_bufs[MAX_NC_PER_DEVICE];         // MMAP buffers for efficiently copying data in/out of the device
    void *hbm_va[MAX_HBM_PER_DEVICE];                   // HBM virtual addresses
    size_t hbm_size;                                    // HBM sizes
    uint32_t hbm_va_cnt;                                // Number of active HBM regions
    uint32_t shift_hbm_size;                            // Cached number of bits to shift
    uint64_t hbm_offset[MAX_HBM_PER_DEVICE];            // HBM offsets
    uint8_t context[];                                  // Library reserved fields
} ndl_device_t;

typedef struct ndl_device_nc {
    ndl_device_t *device;
    uint32_t nc_id;
} ndl_device_nc_t;

typedef struct ndl_device_context {
    int nd_fd;
} ndl_device_context_t;

typedef struct ndl_mem_info {
    ndl_device_t *device;
    __u64 driver_handle;
    uint64_t pa;
    uint64_t mmap_offset;
    uint64_t size;
    uint32_t align;
    void *mmap_va;
    uint32_t host_memory;
    int nc_id;
} ndl_mem_info_t;

typedef struct ndl_notification_context {
    union {
        uint8_t nc_id; // neuron core index
        uint8_t nq_dev_id; // notification device index
    };
    ndl_nq_dev_t nq_dev_type; // notification device type
    uint8_t nq_type; // type of the notification queue
    uint8_t engine_index; // engine index
    uint32_t size; // size of the NQ in bytes
    int fd; // file descriptor of /dev/ndX/ncY/nqZ
    uint64_t offset; //mmap offset in the nd
    uint64_t mem_handle;
    void *va; // mmapped address

    ndl_mem_info_t *mem_info; // NQ memory info
} ndl_notification_context_t;

/**
 * Called by app the first time when it accesses the device.
 *
 * @param[in] device_index       - device index that is to be opened
 * @param[in] num_tdram_regions  - number of tdram regions
 * @param[out] device            - device specific information
 *
 * @return 0 on success.
 *         -1 on failure
 */
int ndl_open_device(int device_index, ndl_device_init_param_t *params, ndl_device_t **device);

/**
 * Called by app when it is done. After this, device cannot be accessed
 *
 * @param[in] device    - Device to close.
 *
 * @return 0 on success.
 *         -1 on failure
 */
int ndl_close_device(ndl_device_t *device);

/**
 * Get all the device index
 *
 * @param[out] device_indexes       - Buffer to store device indexes.
 * @param[in] device_indexes_size   - Size of the buffer in dwords.
 *
 * @return Number of devices found.
 */
int ndl_available_devices(int *device_indexes, int device_indexes_size);

/** Read from one or more registers.
 *
 * @param device[in]        - Device handle.
 * @param bar[in]           - BAR to read.
 * @param addresses[in]     - Array of register addresses.
 * @param count[in]         - Number of registers in the array.
 * @param buffer[out]       - Buffer to store read data.
 *
 * @return 0 on success.
 */
int ndl_bar_read(ndl_device_t *device, uint8_t bar, uint64_t *addresses, uint32_t count, uint32_t *buffer);

/** Write to one or more registers.
 *
 * @param device[in]        - Device handle.
 * @param bar[in]           - BAR to write.
 * @param addresses[in]     - Array of register addresses.
 * @param count[in]         - Number of registers in the array.
 * @param data[in]          - Data to write.
 *
 * @return 0 on success.
 */
int ndl_bar_write(ndl_device_t *device, uint8_t bar, uint64_t *addresses, uint32_t count, uint32_t *data);

/** Read hw counters from one or more addresses
 *
 * @param device[in]        - Device handle.
 * @param addresses[in]     - Array of register addresses.
 * @param count[in]         - Number of registers in the array.
 * @param buffer[out]       - Buffer to store read data.
 *
 * @return 0 on success.
 */
int ndl_read_hw_counters(ndl_device_t *device, uint64_t *addresses, uint32_t count, uint32_t *data);

/**
 * Retrieves the cached HBM virtual address for the specified device.
 *
 * @param device[in]        - Device handle.
 * @param hbm_idx[in]       - HBM index.
 * @param va[out]           - Resulting virtual address.
 * @param size[out]         - Size of the HBM
 * 
 * @return 0 on success, -EINVAL on failure, and -ENOENT when there are no more entries to be found.
 */
int ndl_get_hbm_va(ndl_device_t *device, int hbm_idx, void **va, size_t *size);

/** Allocates memory.
 *
 * @param device[in]        - Device to be associated with the allocation.
 * @param size[in]          - Number of bytes to allocate.
 * @param host_memory[in]   - If true allocate from host memory instead of using device memory.
 * @param dram_channel[in]  - DRAM channel to use in the device memory.
 * @param dram_region[in]   - DRAM region to use in the device memory.
 * @param nc_id[in]         - NC ID to use in the device
 * @param mem_alloc_type[in]- Type of memory allocation 
 * @param mem_handle[out]   - Allocated memory handle would be stored here.
 *
 * @return 0 on success.
 */
int ndl_memory_alloc(ndl_device_t *device, size_t size, uint64_t align, uint32_t host_memory, uint32_t dram_channel, uint32_t dram_region,
                        uint32_t nc_id, uint32_t mem_alloc_type, uint64_t *mem_handle);

/** Given a mem handle gets it PA - HACK to be removed
 * @param mem_handle[in]     - Memory handle
 * @parama pa[out]           - Physical address of handle
 *
 * @return the PA
 */
int ndl_memory_get_pa(uint64_t mem_handle, uint64_t *pa);

/** Map given m memory handle into virtual address space.
 *
 * @param mem_handle[in]     - Handle to map.
 * @param va[out]            - Resulting virtual address.
 *
 * @return 0 on success
 */
int ndl_memory_map(uint64_t mem_handle, void **va);

/** Unmap given memory handle from virtual address space.
 *
 * @param mem_handle[in]     - Handle to unmap.
 *
 * @return 0 on success
 */
int ndl_memory_unmap(uint64_t mem_handle);

/** Frees already allocated memory.
 *
 * @param mem_handle[in]   - Memory handle to be freed.
 *
 * @return 0 on success.
 */
int ndl_memory_free(uint64_t mem_handle);

/** Copy data from buffer to mem_handle.
 *
 * @param mem_handle[in]    - Handle on which data needs to be copied in.
 * @param buffer            - Buffer from which data needs to be copied.
 * @param offset            - Offset in the mem handle.
 * @param size              - Size in bytes to be copied.
 *
 * @return 0 on success.
 */
int ndl_memory_buf_copyin(uint64_t mem_handle, void *buffer, uint64_t offset, size_t size);

/** Copy data from mem_handle to buffer.
 *
 * @param mem_handle[in]    - Handle from which data needs to be copied out.
 * @param buffer            - Buffer to which data needs to be copied.
 * @param offset            - Offset in the mem handle.
 * @param size              - Size in bytes to be copied.
 *
 * @return 0 on success.
 */
int ndl_memory_buf_copyout(uint64_t mem_handle, void *buffer, uint64_t offset, size_t size);

/** Copy data from buffer to mem_handle (zero copy, buffer is pinned and used directly).
 *
 * @param mem_handle[in]    - Handle on which data needs to be copied in.
 * @param buffer            - Buffer from which data needs to be copied.
 * @param offset            - Offset in the mem handle.
 * @param size              - Size in bytes to be copied.
 *
 * @return 0 on success.
 */
int ndl_memory_buf_zerocopyin(uint64_t mem_handle, void *buffer, uint64_t offset, size_t size, int qid, uint32_t bar4_wr_threshold);

/** Copy data from mem_handle to buffer (zero copy, buffer is pinned and used directly).
 *
 * @param mem_handle[in]    - Handle from which data needs to be copied out.
 * @param buffer            - Buffer to which data needs to be copied.
 * @param offset            - Offset in the mem handle.
 * @param size              - Size in bytes to be copied.
 * @param qid               - H2T queue to use.  NEURON_DMA_H2T_DEFAULT_QID is default
 *
 * @return 0 on success.
 */
int ndl_memory_buf_zerocopyout(uint64_t mem_handle, void *buffer, uint64_t offset, size_t size, int qid);

/** Batch transfer data between host buffers and device memory.
 *
 * @param mem_handle[in]    - Device memory handle
 * @param ops[in]           - Array of batch operations
 * @param num_ops[in]       - Number of operations in batch
 * @param direction[in]     - Transfer direction (0=write to device, 1=read from device)
 * @param qid[in]           - H2T queue to use (-1 for default)
 *
 * @return 0 on success.
 */
int ndl_memory_buf_batch_copy(neuron_memcpy_batch_t *batches, uint64_t num_batches, uint32_t direction, int qid);

/** Copy data from buffer to addr in engine.
 *
 * @param device[in]        - Device information.
 * @param nc_id [in]        - Neuron core id.
 * @param dst [in]          - Address on which data needs to be copied in.
 * @param buffer            - Buffer from which data needs to be copied.
 * @param offset            - Offset in the mem handle.
 * @param size              - Size in bytes to be copied.
 * @param qid               - H2T queue to use.  NEURON_DMA_H2T_DEFAULT_QID is default
 *
 * @return 0 on success.
 */
int ndl_program_engine(ndl_device_t *device, uint32_t nc_id, uint64_t dst, void *buffer, uint64_t offset, size_t size);

/** Memset the given memhandle with passed byte value
 *
 * @param src_mem_handle[in]- Handle which needs to be filled with byte value
 * @param offset            - Src Offset in the mem handle.
 * @param value             - Byte value to fill the memory with
 * @param size              - Size in bytes to be copied.
 *
 * @return 0 on success.
 */
int ndl_memset(const uint64_t addr, uint64_t offset, const int value, const size_t size);

/** Copy data between mem_handles.
 *
 * @param src_mem_handle[in]- Handle from which data needs to be copied out.
 * @param dst_mem_handle[in]- Handle from which data needs to be copied to.
 * @param src_offset        - Src Offset in the mem handle.
 * @param dst_offset        - Dest Offset in the mem handle.
 * @param size              - Size in bytes to be copied.
 *
 * @return 0 on success.
 */
int ndl_memory_copy(uint64_t src_mem_handle, uint64_t dst_mem_handle, uint64_t src_offset, uint64_t dst_offset,
                    size_t size);


/** Copy data between mem_handles asynchronously.
 *
 * @param src_mem_handle[in]   - Handle from which data needs to be copied out.
 * @param dst_mem_handle[in]   - Handle from which data needs to be copied to.
 * @param src_offset           - Src Offset in the mem handle.
 * @param dst_offset           - Dest Offset in the mem handle.
 * @param size                 - Size in bytes to be copied.
 * @param prefetch_addr        - Host destination address associate with copy out operation to prefetch
 * @param wait_handle [in/out] - wait_handle [in] is for prev request, [out] is handle for this request
 *
 * @return 0 on success.
 */
int ndl_memory_copy_as(uint64_t src_mem_handle, uint64_t dst_mem_handle, uint64_t src_offset, uint64_t dst_offset, 
                       size_t size, uint64_t prefetch_addr, int * wait_handle);


/** Copy data between mem_handles.
 *
 * @param mem_handle[in]  - Handle from which data for this tran (either src or dst)
 * @param wait_handle     - wait_handle for an async dma
 *
 * @return 0 on success.
 */
int ndl_memory_copy_as_wait(uint64_t mem_handle, int wait_handle);

/** Set the dma engine state
 *
 * @param device_index[in]  - Device index.
 * @param eng_id[in]        - Eng ID that is initialized.
 * @param state[in]         - State that is set UDMA_NORMAL/UDMA_DISABLE etc
 *
 * @return 0 on success.
 */
int ndl_dma_eng_set_state(int device_index, uint32_t eng_id, uint32_t state);

/** Get the dma engine state
 *
 * @param device_index[in]  - Device index.
 * @param eng_id[in]        - Engine index which status needs to be collected.
 * @param state[out]        - Buffer to store engine state.
 *
 * @return 0 on success.
 */
int ndl_dma_eng_get_state(int device_index, uint32_t eng_id, struct neuron_dma_eng_state *state);

/** Get DMA queue state
 *
 * @param device_index[in]  - Device index.
 * @param eng_id [in]       - DMA engine index.
 * @param qid [in]          - DMA queue index.
 * @param tx [out]          - Tx queue state.
 * @param rx [out]          - Rx queue state.
 *
 * @return 0 on success.
 */
int ndl_dma_queue_get_state(int device_index, uint8_t eng_id, uint8_t qid, struct neuron_dma_queue_state *tx, struct neuron_dma_queue_state *rx);

/** Copy DMA descriptors to userspace.
 *
 *  This API needs root privilege.
 *
 * @param device_index[in]  - Device index.
 * @param eng_id [in]       - DMA engine index.
 * @param qid [in]          - DMA queue index.
 * @param type [in]         - Type of the queue.
 * @param index [in]        - Start descriptor index.
 * @param count [in]        - Number of descriptor needs to be copied.
 * @param buffer [out]      - Buffer to store the descriptors.
 *
 * @return 0 on success.
 */
int ndl_dma_descriptor_copyout(int device_index, uint8_t eng_id, uint8_t qid, enum neuron_dma_queue_type type, uint32_t start_index, uint32_t count, void *buffer);

/** Initialize the dma queue for a given engine
 *
 * @param device_index[in]  - Device index
 * @param eng_id[in]        - Engine for which the queue is initialized
 * @param qid[in]           - Queue id that needs to be initialized
 * @param tx_desc_count[in] - number of tx desc's need to be allocated
 * @param rx_desc_count[in] - number of rx desc's need to be allocated
 * @param tx_handle[in]     - TX mem handle
 * @param rx_handle[in]     - RX mem handle
 * @param rxc_handle[in]    - Completion mem handle
 *
 * @return 0 on success.
 */
int ndl_dma_queue_init(int device_index, uint32_t eng_id, uint32_t qid, uint32_t tx_desc_count, uint32_t rx_desc_count,
                       uint64_t tx_handle, uint64_t rx_handle, uint64_t rxc_handle, uint32_t axi_port);

struct ndl_queue_init {
    __u32 eng_id; // [in] DMA engine index
    __u32 qid; // [in] Queue index in the DMA engine
    __u32 tx_desc_count; // [in] number of tx desc's need to be allocated
    __u32 rx_desc_count; // [in] number of rx desc's need to be allocated
    __u64 tx_handle; // [in] mem handle for the tx ring
    __u64 rx_handle; // [in] mem handle for the rx ring
    __u64 rxc_handle; // [in] mem handle for the rxc ring
    __u32 axi_port; // [in] axi port
};

#define MAX_NDL_QUEUE_INIT_BATCH 256
struct ndl_queue_init_batch {
    __u32 count;
    struct ndl_queue_init entries[MAX_NDL_QUEUE_INIT_BATCH];
};

/** Initialize a batch of dma queues
 *
 * @param device_index[in]  - Device index
 * @param batch[in]         - Batch of dma queue initialization requests
 *
 * @return 0 on success.
 */
int ndl_dma_queue_init_batch(int device_idx, struct ndl_queue_init_batch *batch);

/** Release the dma queue for a given engine - only used in tests
 *
 * @param device_index[in]  - Device index
 * @param eng_id[in]        - Engine for which the queue is initialized
 * @param qid[in]           - Queue id that needs to be initialized
 *
 * @return 0 on success.
 */
int ndl_dma_queue_release(int device_index, uint32_t eng_id, uint32_t qid);

/** Starts DMA by copying the given number of descriptors or prefetch s2m
 *
 * @param device_index[in]  - Device index
 * @param eng_id[in]        - Engine for which the queue is initialized
 * @param qid[in]           - Queue id that needs to be initialized
 * @param tx_desc_count[in] - number of tx desc's need to be copied, could be 0 if called for s2m prefetch
 * @param rx_desc_count[in] - number of rx desc's need to be copied
 *
 * @return 0 on success.
 */
int ndl_dma_queue_copy_start(int device_index, uint32_t eng_id, uint32_t qid, uint32_t tx_desc_count, uint32_t rx_desc_count);

/** Acks the completed desc count for the eng/queue - only used in tests
 *
 * @param device_index[in]  - Device index
 * @param eng_id[in]        - Engine for which the queue is initialized
 * @param qid[in]           - Queue id that needs to be initialized
 * @param count[in]         - Number of desc's to ack
 *
 * @return 0 on success.
 */
int ndl_dma_ack_completed_desc(int device_index, uint32_t eng_id, uint32_t qid, uint32_t count);

/** Copy data from buffer to mem_handle. Buffer has dma desc
 *
 * @param mem_handle[in]        - Handle on which data needs to be copied in.
 * @param buffer[in]            - Buffer from which data needs to be copied. Buffer has dma desc
 * @param offset[in]            - Offset in the mem handle.
 * @param num_descs[in]         - Number of descriptors to copy
 * @param queue_type[in]        - From which queue copy descriptors.
 *
 * @return 0 on success.
 */
int ndl_dma_copy_descriptors(uint64_t mem_handle, void *buffer, uint64_t offset, uint32_t num_descs, enum neuron_dma_queue_type queue_type);

/** Reset given NCs within a device.
 *
 * @param device_index[in]  - Device to reset.
 * @param nc_map[in]        - NCs to reset (-1 to reset entire device)
 * @param request_id[out]   - ID for this reset request
 *
 * @return 0 on success.
 */
int ndl_reset_ncs(int device_index, int nc_map, uint32_t *request_id);

/** Register the callback to NRT to warn/nudge users when hitting soft incompatibility
 *
 * @param callback  - the call back function
 * @return int - 0 on success, otherwise on failure
 */
int ndl_register_soft_incompat_callback(void (*callback)(const char *));

/** Waits for readiness of given NCs within a device.
 *
 * @param device_index[in]  - Device index.
 * @param request_id[in]    - ID for the reset request to wait on
 * @param result[out]       - Buffer to store the result.
 *                            If the device is ready then this would be set to 1.
 *
 * @return 0 on success.
 *
 */
int ndl_ready_ncs(int device_index, uint32_t request_id, uint8_t *result);

/** Get info on all the apps that are currently using the device, caller needs to free returned info (*info)
 *
 * @param device_index[in]  - Device index.
 * @param info[out] - Pointer to a pointer which will hold app data, needs to be deallocated by caller
 * @param size[out] - Number of entries in neuron_app_info
 *
 * @return 0   - on success
 */
int ndl_get_all_apps_info(ndl_device_t *device, struct neuron_app_info **info, size_t *count, uint16_t apps_info_flags);

/** Increment a semaphore in Neuron Core.
 *
 * @param device[in]            - Device
 * @param nc_index[in]          - Neuron Core index
 * @param semaphore_index[in]   - Semaphore which needs to be incremented.
 * @param value[in]             - Value to decrement.
 *
 * @return 0 on success
 */
int ndl_nc_semaphore_increment(ndl_device_t *device, int nc_index, uint32_t semaphore_index, uint32_t value);

/** Decrement a semaphore in Neuron Core.
 *
 * @param device[in]            - Device
 * @param nc_index[in]          - Neuron Core index
 * @param semaphore_index[in]   - Semaphore which needs to be decremented.
 * @param value[in]             - Value to increment.
 *
 * @return 0 on success
 */
int ndl_nc_semaphore_decrement(ndl_device_t *device, int nc_index, uint32_t semaphore_index, uint32_t value);

/** Get semaphore value in Neuron Core.
 *
 * @param device[in]            - Device
 * @param nc_index[in]          - Neuron Core index
 * @param semaphore_index[in]   - Semaphore index.
 * @param value[out]            - Buffer where read value would be stored.
 *
 * @return 0 on success
 */
int ndl_nc_semaphore_read(ndl_device_t *device, int nc_index, uint32_t semaphore_index, uint32_t *value);


/** Write given value into the semaphore in Neuron Core.
 *
 * @param device[in]            - Device
 * @param nc_index[in]          - Neuron Core index
 * @param semaphore_index[in]   - Semaphore index.
 * @param value[in]             - Value to write.
 *
 * @return 0 on success
 */
int ndl_nc_semaphore_write(ndl_device_t *device, int nc_index, uint32_t semaphore_index, uint32_t value);


/** Get event value in Neuron Core.
 *
 * @param device[in]            - Device
 * @param nc_index[in]          - Neuron Core index
 * @param semaphore_index[in]   - Semaphore index.
 * @param value[out]            - Buffer where read value would be stored.
 *
 * @return 0 on success
 */
int ndl_nc_event_get(ndl_device_t *device, int nc_index, uint32_t event_index, uint32_t *value);


/** Set a event in Neuron Core.
 *
 * @param device[in]            - Device
 * @param nc_index[in]          - Neuron Core index
 * @param event_index[in]       - Event index.
 * @param value[in]             - Value to write.
 *
 * @return 0 on success
 */
int ndl_nc_event_set(ndl_device_t *device, int nc_index, uint32_t event_index, uint32_t value);


/** Configure notification queue
 *
 * Neuron device has multiple of neuron cores and TOP_SPs. If nq_dev_type is
 * NQ_DEV_TYPE_NEURON_CORE, nq_dev_index conveys neuron core index. In case of
 * NQ_DEV_TYPE_NEURON_TOPSP, nq_dev_index means TOP_SP index.
 *
 * @param device[in]            - Device
 * @param nq_dev_id[in]         - Notification device index
 * @param nq_dev_type[in]       - Notification device type
 * @param nq_type[in]           - Notification queue type
 * @param engine_index[in]      - Engine index
 * @param size[in]              - Size in bytes
 * @param on_host_memory[in]    - If true, NQ is created on host memory
 * @param dram_channel          - If NQ is created on device, DRAM channel to use
 * @param dram_region           - If NQ is created on device, DRAM region to use
 * @param force_alloc_mem       - If true, force allocate new memory (and delete already allocated memory, if any)
 * @param context[out]          - Resulting NQ context.
 *
 * @return 0 on success.
 */
int ndl_notification_init(ndl_device_t *device, int nq_dev_id, ndl_nq_dev_t nq_dev_type, uint8_t nq_type, uint8_t engine_index,
                          uint32_t size, bool on_host_memory, uint32_t dram_channel, uint32_t dram_region,
                          uint64_t *notification_context);

/** Configure notification queue with option to force re-allocate/re-size
 *
 * Neuron device has multiple of neuron cores and TOP_SPs. If nq_dev_type is
 * NQ_DEV_TYPE_NEURON_CORE, nq_dev_index conveys neuron core index. In case of
 * NQ_DEV_TYPE_NEURON_TOPSP, nq_dev_index means TOP_SP index.
 *
 * @param device[in]            - Device
 * @param nq_dev_id[in]         - Notification device index
 * @param nq_dev_type[in]       - Notification device type
 * @param nq_type[in]           - Notification queue type
 * @param engine_index[in]      - Engine index
 * @param size[in]              - Size in bytes
 * @param on_host_memory[in]    - If true, NQ is created on host memory
 * @param dram_channel          - If NQ is created on device, DRAM channel to use
 * @param dram_region           - If NQ is created on device, DRAM region to use
 * @param force_alloc_mem       - If true, force allocate new memory (and delete already allocated memory, if any)
 * @param context[out]          - Resulting NQ context.
 *
 * @return 0 on success.
 */
int ndl_notification_init_with_realloc(ndl_device_t *device, int nq_dev_id, ndl_nq_dev_t nq_dev_type, uint8_t nq_type, uint8_t engine_index,
                          uint32_t size, bool on_host_memory, uint32_t dram_channel, uint32_t dram_region, bool force_alloc_mem,
                          uint64_t *notification_context);

/** Returns mem_handle associated with the NQ
 *
 * @param notification_context[in]  - Notification context
 * @param mem_handle[out]           - Notification's memory handle would be stored here.
 *
 * @return 0 on success, 1 on failure
 */
int ndl_notification_get_mem_handle(uint64_t notification_context, uint64_t *mem_handle);

/** Returns size associated with the NQ
 *
 * @param notification_context[in]  - Notification context
 * @param size[out]           - Notification's size would be stored here.
 *
 * @return 0 on success, 1 on failure
 */

int ndl_notification_get_size(uint64_t notification_context, uint32_t *size);

/** Maps NQ to virtual address.
 *
 * @param notification_context[in]  - Notification context.
 * @param va [out]                  - Virtual address where the mapping is done.
 * @return 0 on success
 */
int ndl_notification_map(uint64_t notification_context, void **va);

/** Stops and destroys already configured notification queue.
 *
 * @param notification_context[in] - Notification context.
 *
 * @return 0 on success.
 */
int ndl_notification_destroy(uint64_t notification_context);

/** Makes neuron ds available for use and returns a valid pointer in **data and a valid size in *size
 *
 * @param device[in]            - Device
 * @param pid[in]               - PID for this NDS (if 0 it allocates a new one)
 * @param data[out]             - Will contain a valid pointer to the datastore
 * @param size[out]             - Will contain a valid size for the datastore
 *
 * @return 0 on success.
 */
int ndl_nds_open(ndl_device_t *device, int32_t pid, void **data, size_t *size);

/** Decreases ref count for the given pid
 *
 * @param device                - Device
 * @param pid                   - PID owning the datastore
 * @param data                  - Pointer to datastore raw data (returned by ndl_nds_open)
 * @param size                  - Size of datastore (returned by ndl_nds_open)
 *
 * @return 0 on success.
 */
int ndl_nds_close(ndl_device_t *device, int32_t pid, void *data, size_t size);

/** Enter inference critical section.
 *
 * @param device[in]            - Device
 * @param nc_index[in]          - Neuron core index
 * @param uuid[in]              - UUID of the model expected to be loaded
 *
 * This function would fail if the UUID is different or PID
 * which loaded the UUID is different.
 *
 * @return 0 on success, -1 on failure.
 */
int ndl_crwl_reader_enter(ndl_device_t *device, int nc_index, struct neuron_uuid uuid);

/** Exit inference critical section.
 *
 * @param device[in]            - Device
 * @param nc_index[in]          - Neuron core index
 * @param uuid[in]              - UUID of the model expected to be loaded
 *
 * @return 0 on success, -1 on failure.
 */
int ndl_crwl_reader_exit(ndl_device_t *device, int nc_index, struct neuron_uuid uuid);

/** Enter model load critical section.
 *
 * @param device[in]            - Device
 * @param nc_index[in]          - Neuron core index
 * @param uuid[in]              - UUID of the model to be loaded
 *
 * @return 0 on success, -1 on failure.
 */
int ndl_crwl_writer_enter(ndl_device_t *device, int nc_index, struct neuron_uuid uuid);

/** Exit model load critical section and enter inference critical section.
 *
 * @param device[in]            - Device
 * @param nc_index[in]          - Neuron core index
 * @param uuid[in]              - UUID of the loaded model
 *
 * @return 0 on success, -1 on failure.
 */
int ndl_crwl_writer_downgrade(ndl_device_t *device, int nc_index, struct neuron_uuid uuid);

/** Find given number of free NCs and mark them as used.
 *
 * @param nc_count[in]          - Number of free neuron cores needed.
 * @param start_nc[in]          - From where to start the free core search.
 * @param end_nc[in]            - Last NC where to stop the free core search.
 * @param max_nc_available[out] - Maximum number of free cores available.
 * @param bitmap[out]           - Bitmap of marked neuron core indexes.
 * @param size[in]              - size of the bitmap in bytes
 *
 * @return 0 on success, -1 on failure.
 */
int ndl_crwl_nc_range_mark(uint32_t nc_count, uint32_t start_nc, uint32_t end_nc,
                           uint32_t *max_nc_available, uint64_t *bitmap, size_t size);

/** Unmark neuron cores as free.
 *
 * @param bitmap[in]           - Bitmap of marked neuron core indexes.
 * @param size[in]             - size of the bitmap in bytes
 *
 * @return 0 on success, -1 on failure.
 */
int ndl_crwl_nc_range_unmark(uint64_t *bitmap, size_t size);

/** Gets the info for the copy buffer for copying data to/from device
 *
 * To dma data in and out of the device, app needs a host dram buffer allocated
 * by the driver. Allocating this every-time is expensive especially if we want
 * a bigger copy size. To avoid this performance penalty, applications can use
 * this preallocated buffer.
 *
 * @param device[in]    - Device
 * @param nc_id[in]     - nc id the copy buffer is from
 * @param cpy_buf[out]  - Pointer to copy buffer
 *
 * @return 0 on success
 */
int ndl_get_copy_buf(ndl_device_t *device, uint32_t nc_id, ndl_copy_buf_t **cpy_buf);

/** Set the neuron core init state
 * Initially the state is set to started and then app intializes the neuron core. Then
 * it sets the state to completed. If any other app tries to set the state to started when it
 * is already started then this routine will block until the init is done or timeout
 *
 * @param device[in]            - Device
 * @param state[in]             - State that will be state
 * @param new_state[out]        - State after the set is done
 *
 * @return 0 on success, -1 on failure.
 */
int ndl_nc_init_set_state(ndl_device_t *device, uint32_t nc_id, uint32_t state, uint32_t *new_state);

/** Gets the state of model start. If this is the first model that will be loaded in the nc.
 *
 * @param device[in]            - Device
 * @param nc_id[in]             - nc id
 * @param started_count[out]    - number of times model started in that nc
 *
 * @return 0 on success, -1 on failure.
 */
int ndl_nc_model_started_count(ndl_device_t *device, uint32_t nc_id, uint64_t *started_count);

/** Gets the architecture & revision of the board
 *
 * @param architecture[out]          - Architecture of the board
 * @param revision[out]              - Revision of the board
 *
 * @return 0 on success
 */
int ndl_get_board_info(uint32_t *architecture, uint32_t *revision);

/** Gets BDF for a device - only for devices opened by the calling process - DEPRECATED don't use
 *
 * @param bus_num[out]               - Bus number for this device
 * @param pci_slot[out]              - PCI slot for this device
 * @param dev_func[out]              - Device function for this device
 *
 * @return 0 on success
 */
int ndl_get_device_bdf(int device_index, uint32_t *bus_num, uint8_t *pci_slot, uint8_t *dev_func);

/**
 * @brief Get the anonymous file-descriptor of dma-buf associated with
 * a Neuron device memory region if it was registered for EFA peer direct
 *
 * @param addr[in]        - Device buffer virtual address
 * @param size[in]        - Device buffer size (in bytes)
 * @param fd[out]         - dma-buf fd
 *
 * @return 0 on success
 */
int ndl_get_dmabuf_fd(uint64_t addr, uint64_t size, int* fd);

/** Gets BDF for a device
 *
 * @param device_index[in]           - Neuron device index
 * @param domain[out]                - PCIe domain for the device
 * @param bus_num[out]               - Bus number for the device
 * @param pci_slot[out]              - PCI slot for the device
 * @param dev_func[out]              - Device function for the device
 *
 * @return 0 on success
 */
int ndl_get_device_bdf_ext(int device_index, uint32_t *domain, uint32_t *bus_num, uint8_t *pci_slot, uint8_t *dev_func);

/** retrieve offset/size where to mmap around a physical address
 * 
 * @param device[in]             - Neuron device
 * @param pa[in]                 - physical address in device mem to retrieve mc mmap info for
 * @param mmap_offset[out]       - mmap offset
 * @param mem_handle[out]        - The handle for the given physical address.
 *                                 Set to 0 when using backwards compatible interface with old drivers.
 * @param size[out]              - size
 *
 */
int ndl_mem_get_mc_mmap_info(ndl_device_t *device, uint64_t pa, uint64_t *mmap_offset, uint64_t *size, uint64_t *mem_handle);

/** mmap a bar region into user address
 *
 * @param device[in]          - Neuron device
 * @param block[in]           - block type containing the resource
 * @param block_id[in]        - id of the block if is more than one block
 * @param resource[in]        - resource the caller wants to mmap
 * @param va[out]             - virual address of the resource
 * @param size[out]           - size of the resource
 *
 */
int ndl_mmap_bar_region( ndl_device_t *device, enum neuron_dm_block_type block, uint32_t block_id, enum neuron_dm_resource_type resource, 
                        void ** va, uint64_t * size);

/** Close all cached FDs
 *
 */
void ndl_device_cached_fd_close_all(void);

/** Log an error message to kernel messages/serial console
 * 
 * @param str[in]             - The error message
 * @param size[in]            - The size of the error message including null terminator
 * @param action[in]          - Additional action to perform
 * 
 * @return On success: 0
 *         On failure: -1 and:
 *           * errno == EFAULT when size is too large
 *           * errno == EBADMSG when str is not null terminated
 */
int ndl_printk(char *str, uint32_t size, uint32_t action);

/** get the host device id for an open device (for containers)
 * 
 * @param device[in]           - Neuron device
 * @param host_device_id[out]  - host device id 
 * 
 */
int ndl_get_host_device_id(ndl_device_t *device, uint32_t *host_device_id);

/** return device id to routing id mapping table along with number of entries in the table
 *
 * @param count[in/out]             - [in] size of map in entries.  [out] # entries returned
 * @param host_did_to_rid_map[out]  - map of host device id to routing ids 
 *
 */
int ndl_get_host_device_id_to_rid_map(uint32_t *count, uint32_t *host_did_to_rid_map);

int ndl_dump_device_allocation_info(ndl_device_t *device, uint32_t hbm_index, struct neuron_ioctl_mem_chunk_info *data, uint32_t *num_entries);

/** ask the driver to dump neuron core process info 
 *
 * @param nc_id[in]             - [in] neuron core to dump process info for
 * @param filter_log_owner[in]  - [in] only dump log entries for the owner pid of the neuron core
 * @param log_dump_limit[in]    - [in] max number of log entries to dump
 *
 */
int ndl_dump_nc_pid_info(uint32_t nc_id, bool filter_log_owner, uint32_t log_dump_limit);

/** write a value to entire HBM accessible to Neuron (so excludes firmware carveout)
 *
 *  @param hbm_index     - HBM to write to
 *  @param init_val      - value to write
 */
int ndl_hbm_scrub_start(ndl_device_t *device, uint32_t nc_id, uint32_t hbm_index, uint32_t axi_port, uint32_t init_val);
int ndl_hbm_scrub_wait(ndl_device_t *device, uint32_t nc_id, uint32_t hbm_index);

/** Gets the tpb mapping.
 *
 * @param map[out]              - Location to store the mapping information
 * @param max_num_entries[in]   - Maximum number of entries we can store in `map`
 * @param mapping_version[in]   - Flavor of mapping to get from the driver
 *
 * @return 0 on success
 */
int ndl_get_logical_to_physical_nc_map(struct neuron_ioctl_nc_map *map, uint32_t max_num_entries, enum neuron_ioctl_nc_mapping_type mapping_version);

/** return pod information
 *
 * @param pod_type[out] - type of pod
 * @param pod_sz[out]   - size of the pod
 *
 */
int ndl_pod_info(uint32_t * pod_type, uint32_t * pod_sz);

/** return pod election state
 *
 * @param state[out] - election state
 *
 */
int ndl_pod_election_state(uint32_t * state);

/** return pod mapping information.
 *
 * @param node_id[out]          - node id of the pod node.  -1 if the node is not part of a configured pod
 *
 */
int ndl_pod_mapping_info(int * node_id);


/** return pod status
 *
 * @param pod_id[out]           - pod id.  Only valid it the pod is configured as a pod
 * @param state[out]            - state of the pod election 
 * @param pod_type[out]         - type of pod 
 * @param pod_sz[out]           - size of the pod.  0 if the node is not part of a pod
 * @param node_id[out]          - node id of the pod node.  -1 if the node is not part of a configured pod
 * @param mode[out]             - current operating mode
 * @param modes_supported[out]  - supported operating modes
 *
 */
int ndl_pod_status(uint8_t *pod_id, uint32_t *state, uint32_t *pod_type, uint32_t *pod_sz, int *node_id, 
                   enum neuron_ultraserver_mode *mode, uint32_t *modes_supported);


/** control pod election state
 *
 * @param ctrl[in]           - control request.  (enum neuron_pod_ctrl_req)
 * @param mode[in]           - requested operating mode
 * @param timeout[in]        - timeout for control operation
 * @param state[out]         - state of the pod election 
 *
 */
int ndl_pod_ctrl(uint32_t ctrl, enum neuron_ultraserver_mode mode, uint32_t timeout, uint32_t *state);

int ndl_alloc_contiguous_scratchpad(ndl_device_t *device, uint64_t size, uint32_t hbm_index, uint32_t nc_id, uint64_t *mem_handle);

/** Similar to ndl_memory_map - only difference is that a contiguous scratchpad var may span multiple contiguous memchunks. So size of memory mapping is different from just the size of the first contiguous memchunk.
 *
 * @param mem_handle[in]     - Handle to map.
 * @param va[out]            - Resulting virtual address.
 * @param size[in]           - Size to map 
 *
 * @return 0 on success
 */

int ndl_memory_map_contiguous_scratchpad(uint64_t mem_handle, void **va, uint64_t size);

/** Set performance profile
 *
 * @param device[in]        - Device handle.
 * @param profile[in]       - Performance profile to set.
 *
 * @return 0 on success.
 */
int ndl_set_performance_profile(ndl_device_t *device, uint32_t profile);

/** Enable or disable throttling notifications
 *
 * @param device[in]        - Device handle.
 * @param enable[in]        - true to enable, false to disable.
 *
 * @return 0 on success.
 */
int ndl_enable_throttling_notifications(ndl_device_t *device, bool enable);

bool ndl_feature_supported(int nd_fd, uint64_t feature);

/** dynamically allocate h2t queues (rings)
 *
 * @param device[in]                - Neuron device
 * @param nc_id[in]                 - neuron core to allocate h2t queues for
 * @param copy_queue_cnt[in]        - number of h2t copy queues to allocate
 * @param service_queue_cnt[in]     - number of service queues to allocate
 * @param copy_queue_bmap[out]      - bitmap of the allocated copy queues
 * @param servic_equeue_bmap[out]   - bitmap of the allocated service queues
 * @param copy_default_queue[out]   - default h2t copy queue
 *
 */
int ndl_h2t_dma_queue_alloc(ndl_device_t *device,  uint32_t nc_id, uint32_t copy_queue_cnt, uint32_t service_queue_cnt,
                            uint32_t *copy_queue_bmap, uint32_t *service_queue_bmap, uint32_t *copy_default_queue);

/** free dynamically allocated h2t queues
 *
 * @param device[in]           - Neuron device
 * @param nc_id[in]            - [in] neuron core to free queues for
 * @param queue_bmap[in]       - [in] bitmap of queues to free
 *
 */
int ndl_h2t_dma_queue_free(ndl_device_t *device,  uint32_t nc_id, uint32_t queue_bmap);

/** control metrics posting behavior
 *
 * @param device[in]            - Neuron device
 * @param mode[in]              - how to modify posting behavior (enable or disable periodic posting)
 */
int ndl_metrics_ctrl(ndl_device_t *device, enum neuron_metrics_mode mode);

/** get Neuron device and HBM index pointed by VA
 *
 * @param va[in]                - VA of Neuron memory
 * @param device_index[out]     - Neuron device
 * @param hbm_index[out]        - HBM index
 */
int ndl_get_va_placement(const void *va, int *device_index, int *hbm_index);

/**
 * arbitrary size bitmap support
 *
 */
#define NBM_NR_BITS(t)     (sizeof(t)*8)
#define NBM_NR_ENT(nr,t)   (((nr)+NBM_NR_BITS(t)-1) / NBM_NR_BITS(t))

static inline uint32_t nbitmap_test_bit(uint32_t nr, uint64_t *addr)
{
    return (addr[nr/NBM_NR_BITS(*addr)] & (1ull << (nr % NBM_NR_BITS(*addr)))) != 0ull;
}

static inline void nbitmap_set_bit(uint32_t nr, uint64_t *addr)
{
    addr[nr/NBM_NR_BITS(*addr)] |= (1ull << (nr % NBM_NR_BITS(*addr)));
}

static inline uint32_t nbitmap_ffs1(uint32_t nr, uint64_t *addr)
{
    int i;
    for (i=0; i < NBM_NR_ENT(nr, *addr); i++) {
        uint32_t x = __builtin_ffsl(addr[i]);
        if (x) 
            return i * NBM_NR_BITS(*addr) + x;
    }
    return 0;
}

static inline uint32_t nbitmap_popcount(uint32_t nr, uint64_t *addr)
{
    int i;
    uint32_t cnt = 0;
    for (i=0; i < NBM_NR_ENT(nr, *addr); i++) {
        cnt += __builtin_popcountll(addr[i]);
    }
    return cnt;
}
static inline void nbitmap_clr_bit(uint32_t nr, uint64_t *addr)
{
    addr[nr/NBM_NR_BITS(*addr)] &= ~(1ull << (nr % NBM_NR_BITS(*addr)));
}

#ifdef __cplusplus
}
#endif


================================================
FILE: src/libnrt/include/ndl/neuron_driver_shared.h
================================================
/*
 * Copyright 2021, Amazon.com, Inc. or its affiliates. All Rights Reserved
 */

#ifndef NEURON_DRIVER_SHARED_H
#define NEURON_DRIVER_SHARED_H

#include <linux/types.h>
#include "neuron_driver_shared_tensor_batch_op.h"

enum neuron_driver_feature_flag {
	NEURON_DRIVER_FEATURE_DMABUF = 1ull <<  0, 
	NEURON_DRIVER_FEATURE_ASYNC_DMA = 1ull <<  1, 
	NEURON_DRIVER_FEATURE_BATCH_DMAQ_INIT = 1ull <<  2, 
	NEURON_DRIVER_FEATURE_BIG_CORE_MAPS   = 1ull <<  3, 
	NEURON_DRIVER_FEATURE_MEM_ALLOC_TYPE  = 1ull <<  4,
	NEURON_DRIVER_FEATURE_HBM_SCRUB = 1ull << 5,
	NEURON_DRIVER_FEATURE_MEM_ALLOC64 = 1ull << 6,
	NEURON_DRIVER_FEATURE_CONTIGUOUS_SCRATCHPAD = 1ull << 7,
	NEURON_DRIVER_FEATURE_ZEROCOPY = 1ull << 8,
};

// FIXME  this should be more generic - like node type.
enum {
	NEURON_POD_TYPE_NONE = 0,
	NEURON_POD_TYPE_P2P,
	NEURON_POD_TYPE_SWITCH
};

enum {
	NEURON_POD_E_STATE_NOT_STARTED= 0,
	NEURON_POD_E_STATE_IN_PROGRESS,
	NEURON_POD_E_STATE_ULTRASERVER,
	NEURON_POD_E_STATE_FAILED,				// TODO we currently don't discriminate between failed and single node (todo for diagnostic/debug purposes)
	NEURON_POD_E_STATE_SINGLE_NODE,
};

enum neuron_pod_ctrl_req {
	NEURON_NPE_POD_CTRL_REQ_POD = 0,  		 // request pod state to pod (on-demand election request)
	NEURON_NPE_POD_CTRL_REQ_SINGLE_NODE = 1, // request pod state to single node
	NEURON_NPE_POD_CTRL_REQ_KILL = 2,		 // request to kill the election
	NEURON_NPE_POD_CTRL_SET_MODE = 3,		 // request to ultraserver mode
};

enum neuron_ultraserver_mode {
	NEURON_ULTRASERVER_MODE_UNSET = 0,  	 // no configuration set
	NEURON_ULTRASERVER_MODE_X4 = 1,  		 // 4 node US configuration
	NEURON_ULTRASERVER_MODE_X2H = 2,  		 // 2 node US configuration using horizontal links 
	NEURON_ULTRASERVER_MODE_X2V = 3,  		 // 2 node US configuration using vertical links 
	NEURON_ULTRASERVER_MODE_X1 = 4,  		 // 1 node US configuration (standalone)
};

enum neuron_metrics_mode {
    NEURON_METRICS_MODE_PERIODIC_ENABLE = 0,    // enable periodic posting
    NEURON_METRICS_MODE_PERIODIC_DISABLE = 1,   // disable periodic posting
};

#define NEURON_NC_MAP_DEVICE (0xffffffff)

enum neuron_dma_queue_type {
	NEURON_DMA_QUEUE_TYPE_TX = 0, // transmit queue
	NEURON_DMA_QUEUE_TYPE_RX, // receive queue
	NEURON_DMA_QUEUE_TYPE_COMPLETION, // completion queue
};

enum neuron_cinit_state {
	NEURON_CINIT_STATE_STARTED = 1, // Core Init is initiated
	NEURON_CINIT_STATE_COMPLETED, // Core Init is completed successfully
	NEURON_CINIT_STATE_INVALID // Core Init is not valid
};


struct neuron_dma_eng_state {
	__u32 revision_id; // revision id
	__u32 max_queues; // maximum queues supported
	__u32 num_queues; // number of queues configured
	__u32 tx_state; // Tx statue
	__u32 rx_state; // Rx state
};

struct neuron_dma_queue_state {
	__u32 hw_status; // hardware status
	__u32 sw_status; // software status
	__u64 base_addr; // base address of the queue
	__u32 length; // size of the queue
	__u32 head_pointer; // hardware pointer index
	__u32 tail_pointer; // software pointer index
	__u64 completion_base_addr; // completion queue base address
	__u32 completion_head; // completion head
};

enum neuron_dma_h2t_ctx_handle_type {
	NEURON_DMA_H2T_CTX_HANDLE_NONE   = -1,  // no handle - used as prev handle to start an async dma
	NEURON_DMA_H2T_CTX_HANDLE_SYNC   =  0,  // handle for doing synchronous DMA
	NEURON_DMA_H2T_CTX_HANDLE_ASYNC1 =  1,  // first of two async handles
	NEURON_DMA_H2T_CTX_HANDLE_ASYNC2 =  2,  // second of two async handles
	NEURON_DMA_H2T_CTX_HANDLE_CNT    =  3   // number of dma 
};

/*
 * H2T DMA Default Queue id
 */
#define NEURON_DMA_H2T_DEFAULT_QID (-1)

/*
 * NOTE: In runtime version 5, this enum was passed in as a bool instead -
 * true if top_sp and false if NC. Match the enum values to the bool to
 * maintain compatibility with older runtime. Do not change these values
 * until the min compatibility version is updated to >=6.
 */
enum NQ_DEVICE_TYPE {
    NQ_DEVICE_TYPE_NEURON_CORE = 0,
    NQ_DEVICE_TYPE_TOPSP,
    NQ_DEVICE_TYPE_MAX
};

enum NQ_TYPE {
	NQ_TYPE_TRACE = 0, /**< Implicit notifications generated during execution. */
	NQ_TYPE_NOTIFY, /**< Explicit notifications generated by NOTIFY instruction */
	NQ_TYPE_EVENT, /**< Notifications triggered by event set/clear operations. */
	NQ_TYPE_ERROR, /**< Notifications triggered by an error condition. */
	NQ_TYPE_TRACE_DMA, /**< Implicit notifications generated by DMA transfers.*/
	NQ_TYPE_THROTTLE,   /**< Notifications triggered by throttling activity. */
	NQ_TYPE_MAX
};

/**
 * memory mapping enums for selecting what bar0 resources to map.
 * Bar0 mmapping is restricted to a limited set of regions.
 * Resources are selected by block type, block id and resource within the block.
 * TPB 1 State buffer, for example - where type is TPB, block id is 1 and 
 * resource is state buffer.
 * NEURON_DM_RESOURCE_ALL resource mapping is restricted to read only.
 *
 */
enum neuron_dm_block_type {
   NEURON_DM_BLOCK_INVALID = -1,  // invalid - tag last entry in the table
   NEURON_DM_BLOCK_TPB     =  0,
   NEURON_DM_BLOCK_TOPSP   =  1,
   NEURON_DM_BLOCK_HBM     =  2
};

enum neuron_dm_resource_type {
   NEURON_DM_RESOURCE_SEMAPHORE = 0,  // resource to mmap is semaphore region
   NEURON_DM_RESOURCE_ALL = 1,        // resource to mmap is the entire block (read only). Only available for TOPSP
   NEURON_DM_RESOURCE_SBUF = 2,       // resource to mmap is state buffer
   NEURON_DM_RESOURCE_DMEM = 3	      // resource to mmap is device memory
};

struct neuron_uuid {
	__u8 value[32];
};

#define NEURON_MAX_PROCESS_PER_DEVICE 16 // 2 per core (arbitrary but needs to small number for fast lookup)

#define APP_INFO_PID_NC_LOCK_INFO	(1)
#define APP_INFO_PID_MEM_USAGE		(1 << 1)
#define APP_INFO_ALL			(0xF)

#define APP_INFO_MAX_MODELS_PER_DEVICE	(4)
#define NDS_INVALID_ID (-1)

struct neuron_app_info {
	__s32 pid;							// PID of this app
	__u8 nc_lock_map;						// NCs which are locked by it (one bit set for each locked NC)
	struct neuron_uuid uuid_data[APP_INFO_MAX_MODELS_PER_DEVICE];	// UUIDs running for this app for each neuroncore
	size_t host_mem_size;						// Amount of host memory used by this PID
	size_t device_mem_size;						// Amount of device memory used by this PID
};

typedef union nmetric_version {
	struct {
		__u64 build_num : 32;
		__u64 minor_ver : 8;
		__u64 major_ver : 8;
		__u64 reserved : 16;
	};
	__u64 all;
} nmetric_version_t;

struct neuron_ioctl_mem_chunk_info {
	__u64 pa;
	__u64 size;
	__u32 mem_type;
};

// Max number of entries this version of the driver
// will ever give back to the user
#define NEURON_NC_MAP_MAX_ENTRIES 128
enum neuron_ioctl_nc_mapping_type {
    NEURON_IOCTL_NC_MAPPING_TYPE_V0 = 0,           // seng swap mapping
};
struct neuron_ioctl_nc_map_entry {
    __u32 device_id;
    __u32 device_nc_idx;
};
struct neuron_ioctl_nc_map {
    __u32 num_entries;
    struct neuron_ioctl_nc_map_entry mappings[];
};

/* A batch of copy operations */
typedef struct neuron_memcpy_batch {
	__u64 mem_handle;               // [in] Source or Destination memory handle from/to data needs to be copied.
	__u64 mem_handle_offset;        // [in] Memory offset of the memory handle
	const nrt_tensor_batch_op_t *ops_ptr; // [in] Pointer to array of operations
	__u32 num_ops;                  // [in] Number of neuron_memcpy_op operations.
	__u16 bar4_wr_threshold;        // [in] Threshold below which we will use bar4 direct write vs. DMA. Subject to driver limits.
	__u16 flags;                    // [in] TBD.
	void *context;                  // [in] TBD. opaque context pointer passed back in completion queue
} neuron_memcpy_batch_t;

/*
 * Memory allocation categories for sysfs counters
*/
typedef enum {
	NEURON_MEMALLOC_TYPE_UNKNOWN_HOST, // only for old runtimes, do not use elsewhere
	NEURON_MEMALLOC_TYPE_CODE_HOST,
	NEURON_MEMALLOC_TYPE_TENSORS_HOST,
	NEURON_MEMALLOC_TYPE_CONSTANTS_HOST,
	NEURON_MEMALLOC_TYPE_MISC_HOST,
	NEURON_MEMALLOC_TYPE_NCDEV_HOST,
	NEURON_MEMALLOC_TYPE_NOTIFICATION_HOST,

	NEURON_MEMALLOC_TYPE_UNKNOWN_DEVICE, // only for old runtimes, do not use elsewhere
	NEURON_MEMALLOC_TYPE_CODE_DEVICE,
	NEURON_MEMALLOC_TYPE_TENSORS_DEVICE,
	NEURON_MEMALLOC_TYPE_CONSTANTS_DEVICE,
	NEURON_MEMALLOC_TYPE_SCRATCHPAD_DEVICE,
	NEURON_MEMALLOC_TYPE_MISC_DEVICE,
	NEURON_MEMALLOC_TYPE_NCDEV_DEVICE,
	NEURON_MEMALLOC_TYPE_COLLECTIVES_DEVICE,
	NEURON_MEMALLOC_TYPE_SCRATCHPAD_NONSHARED_DEVICE,
	NEURON_MEMALLOC_TYPE_NOTIFICATION_DEVICE,

	NEURON_MEMALLOC_TYPE_DMA_RINGS_HOST,
	NEURON_MEMALLOC_TYPE_DMA_RINGS_DEVICE,

	NEURON_MEMALLOC_TYPE_CONTIGUOUS_SCRATCHPAD_DEVICE, // uses same sysfs counter as NEURON_MEMALLOC_TYPE_SCRATCHPAD_DEVICE

	NEURON_MEMALLOC_TYPE_MAX
} mem_alloc_category_t;

/*
 * NDS stats
 * Note: 
 * 	To add a new counter type inside the enum, 
 * 		1. you need to manually decrease NDS_ND_COUNTER_RESERVED or NDS_NC_COUNTER_RESERVED by 1
 * 		2. you need to update NDS_ND_COUNTER_COUNT or NDS_NC_COUNTER_COUNT
 * 	To prevent compatability issues, you need to always append the new counter type to the end of the enum
 */
#define NDS_ND_COUNTER_RESERVED 18

// Device counter types
enum {
	NDS_ND_COUNTER_RUNTIME_VERSION,
	NDS_ND_COUNTER_FRAMEWORK_VERSION,
	NDS_ND_COUNTER_FAL_VERSION,
	NDS_ND_COUNTER_FEATURE_BITMAP,
	NDS_ND_COUNTER_MIN_NEFF_VERSION,
	NDS_ND_COUNTER_MAX_NEFF_VERSION,

	// memory usage counters
	NDS_ND_COUNTER_MEM_USAGE_CODE_HOST,
	NDS_ND_COUNTER_MEM_USAGE_TENSORS_HOST,
	NDS_ND_COUNTER_MEM_USAGE_CONSTANTS_HOST,
	NDS_ND_COUNTER_MEM_USAGE_SCRATCHPAD_HOST,
	NDS_ND_COUNTER_MEM_USAGE_MISC_HOST,

	NDS_ND_COUNTER_DYNAMIC_SYSFS_METRIC_BITMAP,

	NDS_ND_COUNTER_DEVICE_CLUSTER_ID,

	NDS_ND_COUNTER_COUNT = NDS_ND_COUNTER_DEVICE_CLUSTER_ID + NDS_ND_COUNTER_RESERVED + 1
};

#define NDS_NC_COUNTER_RESERVED 0

// Neuroncore counter types
enum {
	NDS_NC_COUNTER_TIME_IN_USE = 0,

	NDS_NC_COUNTER_INFER_COMPLETED,
	NDS_NC_COUNTER_INFER_COMPLETED_WITH_ERR,
	NDS_NC_COUNTER_INFER_COMPLETED_WITH_NUM_ERR,
	NDS_NC_COUNTER_INFER_TIMED_OUT,
	NDS_NC_COUNTER_INFER_INCORRECT_INPUT,
	NDS_NC_COUNTER_INFER_FAILED_TO_QUEUE,

	// these must be in this specifc order
	// runtime assumes these are offset by
	// error code
	NDS_NC_COUNTER_ERR_GENERIC,
	NDS_NC_COUNTER_ERR_NUMERICAL,
	NDS_NC_COUNTER_ERR_MODEL,
	NDS_NC_COUNTER_ERR_TRANSIENT,
	NDS_NC_COUNTER_ERR_HW,
	NDS_NC_COUNTER_ERR_RT,

	NDS_NC_COUNTER_LATENCY_DEVICE,
	NDS_NC_COUNTER_LATENCY_TOTAL,
	NDS_NC_COUNTER_NC_TIME,

	// these are new counters
	// these shall be placed at the
	// end so there offsets are always
	// greater than old counters
	// This will ensure
	// new runtime + old driver will
	// write to reserved setions and not
	// break anything
	NDS_NC_COUNTER_GENERIC_FAIL,
	NDS_NC_COUNTER_ERR_RESOURCE,
	NDS_NC_COUNTER_ERR_RESOURCE_NC,
	NDS_NC_COUNTER_ERR_INVALID,
	NDS_NC_COUNTER_ERR_UNSUPPORTED_NEFF_VERSION,

	NDS_NC_COUNTER_CC_TIME,

	NDS_NC_COUNTER_MEM_USAGE_CODE_DEVICE,
	NDS_NC_COUNTER_MEM_USAGE_TENSORS_DEVICE,
	NDS_NC_COUNTER_MEM_USAGE_CONSTANTS_DEVICE,
	NDS_NC_COUNTER_MEM_USAGE_SCRATCHPAD_DEVICE,
	NDS_NC_COUNTER_MEM_USAGE_MISC_DEVICE,

	NDS_NC_COUNTER_MODEL_LOAD_COUNT,
	NDS_NC_COUNTER_INFERENCE_COUNT,

	NDS_NC_COUNTER_MAC_COUNT,

	NDS_NC_COUNTER_OOB,

	NDS_NC_COUNTER_COUNT = NDS_NC_COUNTER_OOB + NDS_NC_COUNTER_RESERVED + 1
};

#define NDS_MAX_NEURONCORE_COUNT     (4)
#define NDS_EXT_MAX_NEURONCORE_COUNT (12)

// Additional NC storage
// | NDS_EXT_NC_COUNTER_COUNT | ... | NDS_EXT_NC_COUNTER_COUNT | (x NDS_MAX_NEURONCORE_COUNT) - this will only store the 'overflow' from the original counters
// | NDS_NC_COUNTER_COUNT + NDS_EXT_NC_COUNTER_COUNT | ... (x NDS_EXT_MAX_NEURONCORE_COUNT)   - this will store complete data for additional NCs (up to a max of 16)
#define NDS_EXT_NC_COUNTER_ADDED_RESERVED 54
// Index of NC counter extensions start at NDS_NC_COUNTER_COUNT not at 0
enum {
	NDS_EXT_NC_COUNTER_HW_ERR_COLLECTIVES = NDS_NC_COUNTER_COUNT,
	NDS_EXT_NC_COUNTER_HW_ERR_HBM_UE,
	NDS_EXT_NC_COUNTER_HW_ERR_NC_UE,
	NDS_EXT_NC_COUNTER_HW_ERR_DMA_ABORT,
	NDS_EXT_NC_COUNTER_ERR_SW_NQ_OVERFLOW,
	NDS_EXT_NC_COUNTER_ERR_SW_SEMAPHORE_ERROR,
	NDS_EXT_NC_COUNTER_ERR_SW_EVENT_ERROR,
	NDS_EXT_NC_COUNTER_ERR_SW_PSUM_COLLISION,
	NDS_EXT_NC_COUNTER_ERR_SW_SEQUENCER_FATAL,
	NDS_EXT_NC_COUNTER_HW_ERR_REPAIRABLE_HBM_UE,
	NDS_EXT_NC_COUNTER_LAST,
	NDS_EXT_NC_COUNTER_COUNT =  NDS_EXT_NC_COUNTER_LAST - NDS_NC_COUNTER_COUNT + NDS_EXT_NC_COUNTER_ADDED_RESERVED
};

#define NDS_TOTAL_NC_COUNTER_COUNT (NDS_NC_COUNTER_COUNT + NDS_EXT_NC_COUNTER_COUNT) // 31 original + 64 extended = 95 counters

typedef struct nds_header {
	char signature[4];      // Fixed signature: 'n', 'd', 's', 0
	int  version;           // Version of the datastore's format
} nds_header_t;

/* --------------------------------------------
 * NDS shared data offsets
 * --------------------------------------------
 */

#define NDS_HEADER_START (0)
#define NDS_HEADER_SIZE (sizeof(nds_header_t))

#define NDS_ND_COUNTERS_START (NDS_HEADER_START + NDS_HEADER_SIZE)
#define NDS_ND_COUNTERS_SIZE (NDS_ND_COUNTER_COUNT * sizeof(uint64_t))
#define NDS_ND_COUNTERS(base_addr) ((uint64_t *)(base_addr + NDS_ND_COUNTERS_START))

// original NC counter section
#define NDS_NEURONCORE_COUNTERS_COUNT (NDS_NC_COUNTER_COUNT)
#define NDS_NEURONCORE_COUNTERS_START (NDS_ND_COUNTERS_START + NDS_ND_COUNTERS_SIZE)
#define NDS_NEURONCORE_COUNTERS_SIZE (NDS_NEURONCORE_COUNTERS_COUNT * NDS_MAX_NEURONCORE_COUNT * sizeof(uint64_t))
#define NDS_NEURONCORE_COUNTERS(base_addr, nc_index) ((uint64_t *)(base_addr + NDS_NEURONCORE_COUNTERS_START) + (nc_index * NDS_NEURONCORE_COUNTERS_COUNT))

// additional NC counter section at the end of all existing structures in the datastore (i.e. after NDS_PROCESS_EXT_INFO)
// NDS_PROCESS_EXT_INFO_START + NDS_PROCESS_EXT_INFO_SIZE = 44588 (hardcoded because it's easier than to move all the structs here and sizeof them)
#define NDS_EXT_NC_COUNTER_COUNT_OLD (65)
#define NDS_TOTAL_NC_COUNTER_COUNT_OLD (96)

#define NDS_EXT_NEURONCORE_COUNTERS_SIZE_OLD (NDS_EXT_NC_COUNTER_COUNT_OLD * NDS_MAX_NEURONCORE_COUNT * sizeof(uint64_t))
#define NDS_EXT_NEURONCORE_NC_DATA_SIZE_OLD	(NDS_TOTAL_NC_COUNTER_COUNT_OLD * NDS_EXT_MAX_NEURONCORE_COUNT * sizeof(uint64_t))
#define NDS_EXT_SECTION_SIZE_OLD (NDS_EXT_NEURONCORE_COUNTERS_SIZE_OLD + NDS_EXT_NEURONCORE_NC_DATA_SIZE_OLD)
#define NDS_EXT_OFFSET_OLD (44588)

#define NDS_EXT_ALIGNMENT (64)
#define NDS_ALIGN(v) ((v) + (-(v) & (NDS_EXT_ALIGNMENT - 1)))
#define NDS_EXT_OFFSET (NDS_ALIGN(NDS_EXT_OFFSET_OLD + NDS_EXT_SECTION_SIZE_OLD))

#define NDS_EXT_NEURONCORE_COUNTERS_COUNT (NDS_EXT_NC_COUNTER_COUNT) // number of extended counters
#define NDS_EXT_NEURONCORE_COUNTERS_START (NDS_EXT_OFFSET)
#define NDS_EXT_NEURONCORE_COUNTERS_SIZE (NDS_EXT_NC_COUNTER_COUNT * NDS_MAX_NEURONCORE_COUNT * sizeof(uint64_t))
#define NDS_EXT_NEURONCORE_COUNTERS(base_addr, nc_index) ((uint64_t *)(base_addr + NDS_EXT_NEURONCORE_COUNTERS_START) + (nc_index * NDS_EXT_NC_COUNTER_COUNT))

// additional NC data for extra Neuron Cores (12 extra sets which include all 95 counters + 1 for padding)
#define NDS_EXT_NEURONCORE_NC_DATA_PADDING (1) // 1 added as padding for 64 byte alignment per NC
#define NDS_EXT_NEURONCORE_NC_DATA_COUNT (NDS_TOTAL_NC_COUNTER_COUNT + NDS_EXT_NEURONCORE_NC_DATA_PADDING) // full set of counters (base + extended) + padding
#define NDS_EXT_NEURONCORE_NC_DATA_START (NDS_ALIGN(NDS_EXT_NEURONCORE_COUNTERS_START + NDS_EXT_NEURONCORE_COUNTERS_SIZE))
#define NDS_EXT_NEURONCORE_NC_DATA_SIZE (NDS_EXT_MAX_NEURONCORE_COUNT * NDS_EXT_NEURONCORE_NC_DATA_COUNT * sizeof(uint64_t))

#define NDS_EXT_NEURONCORE_NC_DATA(base_addr, nc_index) ((uint64_t *)(base_addr + NDS_EXT_NEURONCORE_NC_DATA_START) + (nc_index * NDS_EXT_NEURONCORE_NC_DATA_COUNT))

#endif  // NEURON_DRIVER_SHARED_H


================================================
FILE: src/libnrt/include/ndl/neuron_driver_shared_tensor_batch_op.h
================================================
/*
 * Shared tensor batch operation between runtime and driver.
 */

#ifndef NEURON_DRIVER_SHARED_TENSOR_BATCH_OP_H
#define NEURON_DRIVER_SHARED_TENSOR_BATCH_OP_H

#ifdef __KERNEL__
#include <linux/types.h>
typedef __u64 nrt_tensor_batch_offset_t;
typedef __u64 nrt_tensor_batch_size_t;
#else
#include <stdint.h>
typedef uint64_t nrt_tensor_batch_offset_t;
typedef uint64_t nrt_tensor_batch_size_t;
#endif

typedef struct nrt_tensor_batch_op {
    nrt_tensor_batch_offset_t offset;
    nrt_tensor_batch_size_t size;
    void *buffer;
} nrt_tensor_batch_op_t;

#endif  // NEURON_DRIVER_SHARED_TENSOR_BATCH_OP_H


================================================
FILE: src/libnrt/include/nrt/ndebug_stream.h
================================================
/*
 * Copyright 2025, Amazon.com, Inc. or its affiliates. All Rights Reserved
 */

/**
 * Overview:
 * The `ndebug_stream` APIs provide applications a way to consume debug events from the runtime (see
 * `ndebug_stream_event_type_t` for the different event types). These debug events are emitted by the
 * runtime per Logical Neuron Core and can be used by applications to get information on events that
 * occured on the device (ie prints, breakpoints, etc.).
 *
 * Connecting, polling, and consuming:
 * Applications that want to consume debug events will first need to connect to a Logical Neuron Core's debug stream via a call to
 * `nrt_debug_client_connect`. Once a client is connected to a core's debug stream, the runtime will will push debug events emitted
 * by the Logical Neuron Core to the stream for clients to consume. To be notified of emitted debug events, clients can utilize the
 * polling APIs provided by the Linux kernel. The `stream_fd` handle obtained from the `nrt_debug_client_connect` is a typical Linux
 * file descriptor and can be passed into any Linux polling API. It is important to note though, that while the `stream_fd` is pollable,
 * all other non-polling related functionality must go through the provided `nrt_debug_client*` APIs. For example, the stream contents
 * can only be accessed from the `nrt_debug_client_read*` API(s) and any other methods of accessing the stream data leads undefined/undesireable
 * behavior.
 *
 * Closing a Connection:
 * Once a connection is not needed anymore, clients can close the connection using the `nrt_debug_client_connect_close` API.
 *
 * Events:
 * Events consist of a header describing the payload type, and a payload representing the contents of the event. Events can be consumed by
 * clients via the `nrt_debug_client_read*` API(s).
 *
 * Notes:
 *  * These APIs do not allow for interprocess communication. Debug events are only pushed to the process that owns the Logical Neuron Core.
 *  * These APIs do not provide thread safety for multiple threads accessing the SAME stream (thread safety for different streams is guarenteed).
 *  * There can only be one outstanding connection per stream. Any attempts to initialize multiple connectiongs will result in an error.
 *  * Events are only emitted AFTER a client connects to a Logical Neuron Core's stream. Any event that would have been emitted before connectioning
 *    to the stream is dropped.
 *  * Events will be dropped if the number of unconsumed events in a stream exceeds the stream's buffer size. Clients must consume events fast
 *    enough to prevent dropped events. Additionally, Clients can configure the stream's buffer size via the `NEURON_RT_DEBUG_STREAM_BUFFER_SIZE`
 *    environment variable. The buffer size currently defaults to 64K debug events.
 */

#pragma once

#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <nrt/nrt_status.h>

#ifdef __cplusplus
extern "C" {
#endif

typedef enum ndebug_stream_event_type {
    NDEBUG_STREAM_EVENT_TYPE_INVALID = 0,
    NDEBUG_STREAM_EVENT_TYPE_DEBUG_TENSOR_READ = 1,
} ndebug_stream_event_type_t;

typedef struct ndebug_stream_event_header {
    uint64_t data_size;
    uint32_t type;
    char reserved[52];
} ndebug_stream_event_header_t;

typedef struct ndebug_stream_payload_debug_tensor_read {
    char prefix[512];
    uint32_t logical_nc_id;
    uint32_t pipe;
    char tensor_dtype[16];
    uint64_t tensor_shape[8];
    uint64_t tensor_data_size;
    char reserved0[416];
    char tensor_data[];
} ndebug_stream_payload_debug_tensor_read_t;

/** Establish a connection to a specified Logical Neuron Core's debug stream.
 *
 * @param logical_nc_idx[in]    - Core's debug stream to connect to.
 * @param stream_fd[out]        - Connection handle to reference and interact with the stream.
 *
 * @return NRT_SUCCESS on success.
 *
 * @note Only one client can connect to a Logical Neuron Core's stream at any given time.
 *       Attempts to connect to a stream with multiple clients will result in a NRT_INVALID
 *       return status.
 *
 */
NRT_STATUS nrt_debug_client_connect(int logical_nc_idx, int *stream_fd);

/** Closes connection created by `nrt_debug_client_connect`
 *
 * @param stream_fd[in] - Connection handle to close.
 *
 */
void nrt_debug_client_connect_close(int stream_fd);

/** Consumes a single event from the stream.
 *
 * @param stream_fd[in] - Stream to consume an event from
 * @param header[out]   - Comsuned event's header. See `ndebug_stream_event_header_t`.
 * @param payload[out]  - Consumed event's payload. See `ndebug_stream_payload*` and `ndebug_stream_event_type_t`.
 *                        **IMPORTANT**: it is the user's responsibility to free this payload pointer.
 *
 * @return NRT_SUCCESS on success.
 *
 * @note This function must be called from the same process that owns the Logical Neuron Core. Calling this
 *       function from any other process results in undefined behavior.
 *
 */
NRT_STATUS nrt_debug_client_read_one_event(int stream_fd, ndebug_stream_event_header_t *header, void **payload);

#ifdef __cplusplus
}
#endif


================================================
FILE: src/libnrt/include/nrt/nds/neuron_ds.h
================================================
/*
 * Copyright 2021, Amazon.com, Inc. or its affiliates. All Rights Reserved
 */

#pragma once

#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
#include <sys/types.h>
#include <ndl/ndl.h>

#ifdef __cplusplus
extern "C" {
#endif

// Main NDS object handle
typedef void *nds_obj_handle_t;

// NDS object types
#define OBJECT_TYPE_MODEL_NODE_INFO     (0)
#define OBJECT_TYPE_PROCESS_INFO        (1)
#define OBJECT_TYPE_PROCESS_INFO_EXT    (2)

// Model-related structs
#define MODEL_MEM_USAGE_LOCATION_COUNT 2

/*
 * Number of slots for mem_usage_type in Neuron Datastore (also used by tools)
 *
 * In the current version of the neuron datastore's format, there are only 12 slots for storing
 * memory usage type, so we aggregate them using the same logic as for the 'per NC' memory tracker.
 * Monitor always aggregated them even further by adding them together, so we aren't breaking any feature.
 *
 * For usage types definiton, go to "inc/tdrv/dma_mem_usage_type.h"
 *
 */
enum {
    NDS_DMA_MEM_USAGE_SLOT_CODE,
    NDS_DMA_MEM_USAGE_SLOT_TENSORS,
    NDS_DMA_MEM_USAGE_SLOT_CONSTANTS,
    NDS_DMA_MEM_USAGE_SLOT_SCRATCHPAD,
    NDS_DMA_MEM_USAGE_SLOT_MISC,
    NDS_DMA_MEM_USAGE_SLOT_COUNT = 12 // do not change
};

// Aggregated data for all chunks of the same type/location
typedef struct nds_mem_usage_info {
    size_t total_size;        // Total size
    uint32_t chunk_count;       // Number chunks that make up the total size
} nds_mem_usage_info_t;

// Loaded model node information
typedef struct nds_model_node_info {
    uint32_t model_id;           // parent model id
    uint32_t model_node_id;      // node id
    char name[256];              // model name
    char uuid[16];               // uuid
    uint8_t nc_index;            // nc index
    uint8_t sg_index;            // subgraph index
} nds_model_node_info_t;

// Loaded model node memory usage information
typedef struct nds_model_node_mem_usage_info {
    // MODEL_MEM_USAGE_LOCATION_COUNT per each usage type
    nds_mem_usage_info_t model_mem_usage[MODEL_MEM_USAGE_LOCATION_COUNT][NDS_DMA_MEM_USAGE_SLOT_COUNT];
} nds_model_node_mem_usage_info_t;

// Version information
typedef struct nds_version_info {
    uint8_t major;
    uint8_t minor;
    uint32_t build;
} nds_version_info_t;

// Process information-related struct
typedef struct nds_process_info {
    int8_t  framework_type;
    char    tag[32];
    nds_version_info_t framework_version;
    nds_version_info_t fal_version;
    nds_version_info_t runtime_version;
} nds_process_info_t;

// Extended process information
typedef struct nds_process_info_ext {
    char tag[256];
} nds_process_info_ext_t;

typedef struct nds_instance nds_instance_t;
typedef struct ndl_device ndl_device_t;


// Feature bitmap's bit index information
typedef enum feature_bitmap_bit_index {
    BIT_INDEX_TEST_FEATURE = 0,
    BIT_INDEX_MULTICORE_FEATURE = 1,

    BIT_INDEX_COUNT = BIT_INDEX_MULTICORE_FEATURE + 1
} feature_bitmap_bit_index_t;


/** Opens NDS for the given pid. If pid == 0, it acquires it for the current PID
 *  and it's opened in read-write mode. If pid != 0, it acquires it for the provided PID
 *  and it's opened as read-only.
 *
 * @param device[in]            - ndl_device used to open this NDS
 * @pid pid[in]                 - pid for which to open the NDS, if 0 - it's opened as r/w for the current process
 * @inst[out]                   - address of a pointer which will contain the instance handle
 *
 * @return non zero in case of error
 */
int nds_open(ndl_device_t *device, pid_t pid, nds_instance_t **inst);

/** Releases the NDS instance and frees the data associated with it (mandatory for readers)
 *
 * @param inst[in]              - NDS instance to close
 *
 * @return non zero in case of error, the pointer gets deleted regardless
 */
int nds_close(nds_instance_t *inst);

/* --------------------------------------------
 * NDS Neuroncore Counters
 * --------------------------------------------
 */

/** Increments a simple per-nc counter
 *
 * @param inst[in]              - NDS instance
 * @param pnc_index[in]         - Neuroncore index
 * @param counter_index[in]     - Counter index
 * @param increment[in]         - Amount to increment
 *
 * @return 0 on success.
 */
int nds_increment_nc_counter(nds_instance_t *inst, int pnc_index, uint32_t counter_index, uint64_t increment);

/** Decrements a simple per-nc counter
 *
 * @param inst[in]              - NDS instance
 * @param pnc_index[in]         - Neuroncore index
 * @param counter_index[in]     - Counter index
 * @param increment[in]         - Amount to increment
 *
 * @return 0 on success.
 */
int nds_decrement_nc_counter(nds_instance_t *inst, int pnc_index, uint32_t counter_index, uint64_t decrement);

/** Gets a simple per-nc counter
 *
 * @param inst[in]              - NDS instance
 * @param pnc_index[in]         - Neuroncore index
 * @param counter_index[in]     - Counter index
 * @param value[out]            - Counter value
 *
 * @return 0 on success.
 */
int nds_get_nc_counter(nds_instance_t *inst, int pnc_index, uint32_t counter_index, uint64_t *value);

/** Sets a simple per-nc counter
 *
 * @param inst[in]              - NDS instance
 * @param pnc_index[in]         - Neuroncore index
 * @param counter_index[in]     - Counter index
 * @param value[in]             - Value to set the counter to
 *
 * @return 0 on success.
 */
int nds_set_nc_counter(nds_instance_t *inst, int pnc_index, uint32_t counter_index, uint64_t *value);

/* --------------------------------------------
 * NDS Neuron Device Counters
 * --------------------------------------------
 */

/** Increments a simple per-nd counter - may overflow
 *
 * @param inst[in]              - NDS instance
 * @param counter_index[in]     - Counter index
 * @param increment[in]         - Amount to increment
 *
 * @return 0 on success.
 */
int nds_increment_nd_counter(nds_instance_t *inst, uint32_t counter_index, uint64_t increment);

/** Decrements a simple per-nd counter - may overflow
 *
 * @param inst[in]              - NDS instance
 * @param counter_index[in]     - Counter index
 * @param decrement[in]         - Amount to decrement
 *
 * @return 0 on success.
 */
int nds_decrement_nd_counter(nds_instance_t *inst, uint32_t counter_index, uint64_t decrement);

/** Bitwise inclusive OR operation on counter
 * 
 * @param inst[in]              - NDS instance
 * @param counter_index[in]     - Counter index
 * @param 1ull << bit_index     - bit mask on the feature bitmap
 * 
 * @return 0 on success.
 */
int nds_or_nd_counter(nds_instance_t *inst, uint32_t counter_index, uint64_t bit_index);

/** Gets a simple per-nd counter
 *
 * @param inst[in]              - NDS instance
 * @param counter_index[in]     - Counter index
 * @param value[out]            - Counter value
 *
 * @return 0 on success.
 */
int nds_get_nd_counter(nds_instance_t *inst, uint32_t counter_index, uint64_t *value);

/** Sets a simple per-nd counter
 *
 * @param inst[in]              - NDS instance
 * @param counter_index[in]     - Counter index
 * @param value[in]             - Value to set the counter to
 *
 * @return 0 on success.
 */
int nds_set_nd_counter(nds_instance_t *inst, uint32_t counter_index, uint64_t *value);

/* --------------------------------------------
 * NDS objects
 * --------------------------------------------
 */

/** Writes an NDS object to the NDS memory
 *
 * @param obj[in]               - NDS object handle
 *
 * @return 0 on success.
 */
int nds_obj_commit(nds_obj_handle_t obj);

/** Creates a new NDS object with the given type
 *
 * @param inst[in]              - NDS instance
 * @param type[in]              - type of object to create
 *
 * @return handle for newly created object
 */
nds_obj_handle_t nds_obj_new(nds_instance_t *inst, int type);

/** Deletes a NDS object from NDS (and local memory)
 *
 * @param obj[in]               - NDS object handle
 *
 * @return 0 on success.
 */
int nds_obj_delete(nds_obj_handle_t obj);

/** Casts this NDS object to a mode_node_info_t which can be used for r/w
 *
 * @param obj[in]               - NDS object handle
 *
 * @return non-NULL on success.
 */
nds_model_node_info_t *nds_obj_handle_to_model_node_info(nds_obj_handle_t obj);

/** Casts this NDS object to a nds_model_node_mem_usage_info_t which can be used for r/w
 *
 * @param obj[in]               - NDS object handle
 *
 * @return non-NULL on success.
 */
nds_model_node_mem_usage_info_t *nds_obj_handle_to_model_node_mem_usage(nds_obj_handle_t obj);

/** Reads all model info data and returns it as an array (needs to be deleted by caller)
 *
 * @param inst[in]              - NDS instance
 * @param models[out]           - Pointer where to write the address of an array of length count containing object handles
 * @param count[out]            - Number of models loaded (present in the models array)
 *
 * @return non-NULL on success.
 */
int nds_read_all_model_nodes(nds_instance_t *inst, nds_obj_handle_t **models, size_t *count);

/** Casts this NDS object to a nds_process_info_t which can be used for r/w
 *
 * @param obj[in]               - NDS object handle
 *
 * @return non-NULL on success.
 */
nds_process_info_t *nds_obj_handle_to_process_info(nds_obj_handle_t obj);

/** Casts this NDS object to a nds_process_info_ext_t which can be used for r/w
 *
 * @param obj[in]               - NDS object handle
 *
 * @return non-NULL on success.
 */
nds_process_info_ext_t *nds_obj_handle_to_process_info_ext(nds_obj_handle_t obj);

/** Reads process info and returns a nds_obj_handle
 *
 * @param inst[in]              - NDS instance
 *
 * @return non-NULL on success.
 */
nds_obj_handle_t nds_read_process_info(nds_instance_t *inst);

/** Reads extended process info and returns a nds_obj_handle
 *
 * @param inst[in]              - NDS instance
 *
 * @return non-NULL on success.
 */
nds_obj_handle_t nds_read_process_info_ext(nds_instance_t *inst);

#ifdef __cplusplus
}
#endif


================================================
FILE: src/libnrt/include/nrt/nec.h
================================================
/*
 * Copyright 2021, Amazon.com, Inc. or its affiliates. All Rights Reserved
 */

#pragma once

#include <time.h>
#include <stdint.h>
#include <stddef.h>
#include <stdbool.h>
#include "nrt/nrt_status.h"
#include <pthread.h>

#ifdef __cplusplus
extern "C" {
#endif

#define NEC_MAX_CHANNELS 32 /* matches MAXCHANNELS in NCCL */
#define NEC_MAX_NR_CHANNEL_CHUNKS 32 /* Channel buffers for reduce operation */
#define NEC_MAX_FOLD_N 16

/*
 * We can set max communicator to anything here but ultimately we will be
 * limited by how much HW resources (such as TOP_SP semaphores or NX DRAM
 * space etc) get used up as number of communicators go up.
 */
#define NEC_MAX_COMM_N 12   /* Max supported replica-groups in NEFF */

#define NEC_MAX_NET_BUFFERS (2 * NEC_MAX_COMM_N) /* 2(hier & ring) x (# replica groups) */

/*
 * We can set max communicator to anything here but ultimately we will be
 * limited by how much HW resources (such as TOP_SP semaphores or NX DRAM
 * space etc) get used up as number of communicators go up.
 */
#define NEC_MAX_COMM_N 12   /* Max supported replica-groups in NEFF */

#define NEC_MAX_NET_BUFFERS (2 * NEC_MAX_COMM_N) /* 2(hier & ring) x (# replica groups) */

/*
 * We can set max communicator to anything here but ultimately we will be
 * limited by how much HW resources (such as TOP_SP semaphores or NX DRAM
 * space etc) get used up as number of communicators go up.
 */
#define NEC_MAX_COMM_N 12   /* Max supported replica-groups in NEFF */

#define NEC_MAX_NET_BUFFERS (2 * NEC_MAX_COMM_N) /* 2(hier & ring) x (# replica groups) */

#define NEC_CACHE_LINE_SIZE 128

/* Rank ID to denote network connector */
#define NEC_NET_CONNECTOR_RANK -1
/* MLA dev ID to denote network connector */
#define NEC_NET_MLA_DEV -1
/* MLA dev ID to denote POD connector */
#define NEC_POD_MLA_DEV -2
/* Rank ID to denote an unknown connector -> possibly not reachable */
#define NEC_UNKNOWN_RANK -3
/* MLA dev ID to denote an unknown connector -> possibly not reachable */
#define NEC_UNKNOWN_MLA_DEV -3

/* the number of hierarchical cc pipeline stage */
#define NEC_HIER_CC_PIPELINE_STAGE_N    (3)

/* the max number of outgoing requests in the recv/send proxy */
#define NCCL_NET_NEURON_MAX_REQUESTS 128

/**
 * The maximum number of concurrent cc execution. As NCCL needs this
 * information, define the size in the common header file.
 */
#define NEC_MAX_STREAM_N       4

/**
 * The different types of ofi communicators that are in the netResources
 * object that is used in the recv/send proxy
 */
typedef enum ofi_comm_type {
    NET_SEND_COMM,
    NET_RECV_COMM,
    NET_RECV_LISTEN_COMM,
    LOCAL_RECV_COMM,
    LOCAL_SEND_COMM
} ofi_comm_type_t;

enum enc_comm_type {
    H_COMM_INTRA_ID = 0,
    H_COMM_INTER_ID = 1,
    H_COMM_MAX_ID
};

/**
 * Neuron Elastic Collectives (NEC)
 *
 * This is the main component for Neuron Elastic Collectives in Neuron Runtime
 * (NRT). This is to provide collective operations to applications offloaded by
 * the device including collective comm init, receiving (post) operations,
 * building resources for the operation, triggering the operation and polling
 * its completion.
 *
 *     +-----------------------+
 *     |  Collectives App      |
 *     +-----------------------+
 *     |  Collectives Library  |
 *     +-----------------------+
 *     |       NEC / NRT       |
 *     +-----------------------+
 *     |        DEVICE         |
 *     +-----------------------+
 *
 * TODO: ENC will be renamed to NEC
 */

typedef enum nec_pod_type {
    NEC_POD_TYPE_NONE,
    NEC_POD_TYPE_P2P,
    NEC_POD_TYPE_SWITCH,
    NEC_POD_TYPE_INVALID
} nec_pod_type_t;

typedef enum nec_pod_type {
    NEC_POD_TYPE_NONE,
    NEC_POD_TYPE_P2P,
    NEC_POD_TYPE_SWITCH,
    NEC_POD_TYPE_INVALID
} nec_pod_type_t;

/* Translated from what KaenaDriver returns */
typedef enum nec_pod_type {
    NEC_POD_TYPE_NONE,
    NEC_POD_TYPE_P2P,
    NEC_POD_TYPE_SWITCH,
    NEC_POD_TYPE_INVALID
} nec_pod_type_t;

typedef struct enc_comm* nec_comm_t;
typedef struct enc_channel* nec_channel_t;
typedef uint64_t dma_addr_t;

struct enc_net_host_memory_index {
    union {
        volatile uint32_t index;
        char pad[NEC_CACHE_LINE_SIZE]; /* Avoid false-sharing */
    };
};

/**
 * Host memory structure for network transport
 *
 * The proxy-thread progress function first waits for the device to be ready by
 * polling host index on fold 0 until it is (-1). Once (-1) was polled, the
 * proxy-thread resets the host index to 0 and notify the device that the
 * proxy-thread is ready by incrementing the handshake semaphore by 1.
 *
 * On the sender side, the device increase the host index to post a buffer to
 * send to a remote device. The proxy-thread send progress function polls the
 * host index and send posted buffers to the respective remote device. The
 * proxy-thread polls for send requests completions and notifies the device on
 * these completions by increasing the send_complete semaphore by the amount of
 * completed send requests. The device may in response to this notification
 * increase the host index further to post additional buffers to send. The
 * proxy-thread recognize the last entry in the FIFO by the fact it is
 * specially marked (See mark_fifo_end())
 *
 * On the receiver side, the device increase the host index to post receive
 * buffers to be filled with data from a remote device. The proxy-thread recv
 * progress function polls the host index and post the receive buffers to the
 * network plugin. The proxy-thread polls for receive completions and notifies
 * the device on these completions by increasing the recv_complete semaphore by
 * the amount of completed recv requests. The device use this notification to
 * know that data is available for processing on the device memory. The device
 * may also in response to this notification increase the host index further to
 * post additional buffers to post as receive buffers. The proxy-thread
 * recognize the last entry in the FIFO by the fact it is specially marked.
 *
 * For the ring algorithm:
 * The sender's handshake and send_complete semaphores
 * are the send-credit semaphore.
 * The receiver's handshake and recv_complete semaphores are the recv-cnt
 * semaphore.
 *
 * For the mesh algorithm:
 * The handshake semaphore is the local-handshake event semaphore for both
 * sender and receiver.
 * The recevier's recv_complete semaphore is the broadcast event semaphore.
 * The sender's send_complete semaphore is the sync event semaphore.
 */
struct enc_net_host_memory {
    union {
        struct {
            struct enc_net_host_memory_index post_recv[NEC_MAX_FOLD_N];
        } recv;
        struct {
            struct enc_net_host_memory_index post_send[NEC_MAX_FOLD_N];
        } send;
    };
};

typedef struct enc_host_mem {
    void *mem_handle;
    void *va;
    dma_addr_t pa;
    size_t size;
} enc_host_mem_t;


typedef struct enc_host_mem_shared {
    enc_host_mem_t mem;
    int refcnt;
} enc_host_mem_shared_t;

/**
 * Network connector structure containing allocated resources for network transport
 */
struct enc_net_connector {
    int fold_n;

    enc_host_mem_t net_host_mem; /* Used to signal proxy thread */
    enc_host_mem_shared_t *dynamic_input_host_mem; /* Used to pass info only available during execution */

    /* Network transport buffer, allocated only for sender */
    void *devmem_res;
    void *nccl_mhandle;

    /* Address and mhandle for event semaphores and pre-registered buffers */
    void *inc_recv_sem_nccl_mhandle;
    uint32_t *inc_recv_sem_values_buffer;
    void *inc_recv_sem_values_buffer_mhandle;

    /*
     * NCCL network connector data structure. When one proxy worker is used for
     * the same type (recv or send) network operation, connector information
     * should be included in each transaction.
     */
    void *nccl_connector;
};

typedef enum enc_pattern {
    ENC_PATTERN_RING,
    ENC_PATTERN_MESH,
    ENC_PATTERN_INVALID,
} enc_pattern_t;

typedef enum enc_net_connectivity {
    ENC_CONNECTIVITY_MESH,
    ENC_CONNECTIVITY_RDH,
    ENC_CONNECTIVITY_DEFAULT
} enc_net_connectivity_t;

struct enc_channel {
    /*
     * Application parameters for init
     */
    int id;
    enc_pattern_t pattern;

    /* Applicable only in case of remote neighbor */
    struct enc_net_connector *net_recv; /* if receving from rank over the network */
    struct enc_net_connector *net_send; /* if sending to rank over the network */

    /*
     * Neuron Runtime context
     */
    void *devmem_res;
    void *two_step_pod_mesh_devmem_res;
    /* Gateway buffer is allocated only when hybrid ring is supported */
    void *devmem_gw_buf_res;
    void *nccl_mhandle;

    dma_addr_t gw_recv_buffer;
    dma_addr_t gw_send_buffer;

    struct enc_channel_context *ch_ctx;
    struct encd_dma_channel *drv_channel;
};

struct enc_peer_info {
    int neuron_dev;
    int rid;
    int tpb_index;
    int pod_node_id;
};

typedef enum enc_topology_mode {
    ENC_TOPO_NULL = 0,
    ENC_TOPO_4_DEVS_IN_ROW,
    ENC_TOPO_4_DEVS_IN_COLUMN,
} enc_topology_mode_t;

struct enc_comm_info {
    int neuron_dev;
    int rank;
    int rank_n;
    int local_rank_n;
    int local_rack_rank_n;

    int node;
    int node_n;

    enc_topology_mode_t enc_topo_mode;

    /* Pod information received from NCCL */
    bool enable_pod;
    bool use_net; /* Whether network interface is used or not with the communicator */
    int pod;
    int pod_n;
    int pod_node;
    int pod_node_n;

    struct enc_peer_info *peers;
};

struct enc_ring {
    int prev;
    int next;
    int *user_ranks;

    /* used by one_rank_per_device rings only */
    bool duplicate;
};

/* Kangaring */
#define NEC_KANGARING_MAX_NUM_RANKS (256)
#define KANGARING_NUM_SENG_PER_DEV  (4)
#define KANGARING_NUM_TPB_PER_DEV   (8)
#define KANGARING_MAX_SECONDARIES   (3)

enum SEngine {
    S0 = 0,
    S1 = 1,
    S2 = 2,
    S3 = 3,
    SENGS_PER_DIE = 2,
    SENGS_PER_MLA = 4
};

struct enc_kangaring {
    int vnc;                                        // virtual neuron core size
    int logical_path[NEC_KANGARING_MAX_NUM_RANKS];  // the logical kangaring path: p0 s0 p1 s1 ...
    int prev;                                       // upstream
    int next;                                       // downstream
    int port;                                       // port to go to next

    /* In VNC 2 case, this is the only peer. For primary ranks, it refer to their secondary rank;
     * for secondary ranks, this refer to their primary rank.
     * In VNC 1 case, it refers specifically to the peer over rmtv with the same tpb index.
     */
    int peer_rmtv;
    /* In VNC 1 case, we have these 2 additional peers.
     * peer_over_rmtv2 refers to the peer over rmtv with a different tpb index.
     * peer_local refers to the local peer with a different tpb index
     */
    int peer_rmtv2;
    int peer_local;

    int next_peer_rmtv;                              // next's peer over rmtv

    bool is_primary;                                // is self rank on data path?
    bool is_next_pcie;                              // is next primary reached via pcie or d2d?
    bool duplicate;                                 // is this a duplicate channel?
    bool pattern2;                                  // is pattern 2?
};

typedef enum metaring_type {
    RING,
    KANGARING,
    SINGLE_CYCLE_RING,
    RDH,
    INVALID_METARING
} metaring_type_t;

struct enc_alg_metaring {
    int channel_n;
    struct enc_channel channels[NEC_MAX_CHANNELS];

    struct enc_ring ring_ranks[NEC_MAX_CHANNELS];
    struct enc_kangaring kangaring_ranks[NEC_MAX_CHANNELS];
    metaring_type_t type;

    /* Does the group contain only on rank per device? This variable is set to true when NCCL
     * returns device level H-cycles to runtime. In this case, we will parse that device H-cycle
     * and generate ring paths on runtime side. We do this because we need to enforce certain
     * pre-defined patterns in the paths so that we avoid dead locks between concurrent groups.
     */
    bool one_rank_per_device;
    /* Hybrid ring is supported when RG have 4 H-cycles of one_rank_per_device */
    bool is_hybrid_ring;
    bool tokens_exchanged;    /* reinitialzed tokens from old metaring config*/
    bool deadlock_free_rank_list;

    struct enc_comm *comm; /* Backward reference to ENC comm */
    struct encd_alg_metaring *drv_alg;

    /* For use by src/tgt pairs only */
    bool skip_send;
    bool skip_recv;
};

/*
 * The order of the events matter here, so while adding a new event make sure the event is added
 * to the right section of the list
 * 
 * ENC_COMMON_NUM_EVENT_TYPE:                           contains all common events between RDH-Mesh or A2A-mesh
 * ENC_MESH_NUM_EVENT_TYPE-ENC_COMMON_NUM_EVENT_TYPE:   contains events used by mesh
 * ENC_A2A_NUM_EVENT_TYPE-ENC_MESH_NUM_EVENT_TYPE:      contains events used by A2A only
 * ENC_RDH_NUM_EVENT_TYPE-ENC_A2A_NUM_EVENT_TYPE:       contains events used by RDH only
 *
 */
typedef enum enc_mesh_event_type {
    EVT_SYNC,
    EVT_GLOBAL_HNDSHK,
    EVT_LOCAL_HNDSHK,
    EVT_INTER_GRP_BRDCST,
    EVT_FUNCTION_BARRIER_FIRST_COLL,
    EVT_FUNCTION_BARRIER_LAST_COLL,
    EVT_REDUCE_LOCAL_HNDSHK,
    EVT_INTRA_GRP_BRDCST,
    ENC_COMMON_NUM_EVENT_TYPE,

    ENC_MESH_NUM_EVENT_START = ENC_COMMON_NUM_EVENT_TYPE,
    EVT_REDUCE_COPY = ENC_COMMON_NUM_EVENT_TYPE,
    EVT_REDUCE_COPY_2,
    EVT_REDUCE_WRITE,
    EVT_INTER_GRP_BRDCST_2,
    EVT_LOCAL_AND_POD_GRP_BRDCST,
    EVT_LOCAL_AND_POD_GRP_BRDCST_2,
    ENC_MESH_NUM_EVENT_TYPE,

    ENC_A2A_NUM_EVENT_START = ENC_MESH_NUM_EVENT_TYPE,
    EVT_LOCAL_HNDSHK_1 = ENC_MESH_NUM_EVENT_TYPE,
    EVT_LOCAL_HNDSHK_2,
    EVT_GLOBAL_HNDSHK_1,
    EVT_INTER_GRP_BRDCST_1,
    EVT_INTRA_GRP_BRDCST_1,
    EVT_2DEV_BRDCST,
    EVT_2DEV_HNDSHK,
    EVT_COPY_FROM_HOST,
    ENC_A2A_NUM_EVENT_TYPE,

    ENC_RDH_NUM_EVENT_START = ENC_A2A_NUM_EVENT_TYPE,
    EVT_RH_STEP_0 = ENC_A2A_NUM_EVENT_TYPE,
    EVT_RH_STEP_1,
    EVT_RH_STEP_2,
    EVT_RH_STEP_3,
    EVT_RH_STEP_4,
    EVT_RH_STEP_5,
    EVT_RH_STEP_6,
    EVT_RH_STEP_7,
    EVT_RH_STEP_8,
    EVT_RH_STEP_9,
    EVT_RDH_LOCAL_HANDSHAKE = EVT_RH_STEP_9,
    EVT_RDH_AXES_HANDSHAKE,
    EVT_RD_STEP_0,
    EVT_RD_STEP_1,
    EVT_RD_STEP_2,
    EVT_RD_STEP_3,
    EVT_RD_STEP_4,
    EVT_RD_STEP_5,
    EVT_RD_STEP_6,
    EVT_RDH_AXES_HANDSHAKE_2,
    EVT_1DEV_RDH_STEP_1,
    EVT_1DEV_RDH_STEP_2,
    EVT_1DEV_RD_STEP_1,
    EVT_1DEV_RD_STEP_2,
    EVT_1DEV_RH_STEP_1,
    EVT_2DEV_RD_STEP_0,
    EVT_2DEV_RD_STEP_1,
    EVT_2DEV_RD_STEP_2,
    EVT_2DEV_RD_STEP_3,
    EVT_2DEV_RD_STEP_4,
    EVT_RDH_LOCAL_PEER_HANDSHAKE,
    ENC_RDH_NUM_EVENT_TYPE    // We assume each event is used only once
                              // Enforced by encd_init_mesh_event()
} enc_mesh_event_type_t;

#define ENC_MESH_MAX_NUM_EVENTS 64

#define KiB     (1024)
#define MiB     (1024 * KiB)
#define GiB     (1024 * MiB)

struct enc_mesh_nbr_grp {
    int *ranks;
    int ranks_n;
};

struct enc_mesh_event {
    struct enc_mesh_nbr_grp src_neighbor_grp;
    struct enc_mesh_nbr_grp dst_neighbor_grp;
    bool valid;
    enc_mesh_event_type_t evt_type;
};

typedef enum enc_alg_mesh_type {
    ENC_ALG_FULL_MESH,
    ENC_ALG_GROUPED_MESH,
    ENC_ALG_MESH_TRN2,
    ENC_ALG_MESH_SWITCH,
    ENC_ALG_MESH_INVALID
} enc_alg_mesh_type_t;

/* TODO: In a separate commit we will change this to a cpp
 * file so we can have classes
 */
#define ENC_MAX_OP_TYPES     (13)
struct enc_alg_mesh_subtype {
    struct enc_mesh_event events[ENC_MESH_MAX_NUM_EVENTS];
    int num_events;
    struct encd_alg_mesh_subtype *drv_mesh;
    struct enc_alg_mesh *mesh; /* backward reference */
    size_t op_max_limit[ENC_MAX_OP_TYPES]; /* upper limit below which we will use mesh */
    size_t op_min_limit[ENC_MAX_OP_TYPES]; /* lower limit above which we will use mesh */
    size_t op_max_limit_sbuf[ENC_MAX_OP_TYPES]; /* upper limit below which we will use mesh for 2D tensors */
    size_t op_min_limit_sbuf[ENC_MAX_OP_TYPES]; /* lower limit above which we will use mesh for 2D Tensors */
    bool no_inplace_support;
    bool is_use_chnl_buffer; /* Whether channel bufer will be used or not */
    bool is_rdh;
    bool is_single_step_mesh;
    bool is_two_step_pod_mesh;
    bool is_latency_opt;
    bool is_bw_opt;
    bool is_rmv_dst_routing;
    uint32_t alltoall_iteration;
};

#define ENC_MAX_MESH_SUBTYPES         (20)
#define ENC_MESH_MAX_NUM_DEVICES      (128)

struct enc_alg_mesh {
    enc_alg_mesh_type_t mesh_type;

    union {
        struct {
            uint32_t devid_to_rankid[ENC_MESH_MAX_NUM_DEVICES];
            /* Whether it is a single or a multi chip mesh */
            bool is_multi_chip;
        } trn2;
        struct {
            int num_non_net_node_local_groups;
        } trn1;
        struct {
            bool root_rank;
            int num_intra_group_roots;
            int local_root_ids[ENC_MESH_MAX_NUM_DEVICES];
            int global_root_ids[ENC_MESH_MAX_NUM_DEVICES];
        } inf2;
    };
    int group_id;
    int num_groups;
    /* Mesh uses only a single channel */
    struct enc_channel channel;
    struct enc_alg_mesh_subtype mesh_subtype[ENC_MAX_MESH_SUBTYPES];

    /* Holds maximum amt of data a single group is allowed to deposit into
     * the channel buffer. The definition of a group varies by platform type.
     * On TRN1, TRN2 a group currently consists of all or some ranks from a
     * single chip but on INF2 it refers to a collection of chips. The concept
     * of a group exists to avoid traffic replication on the wire by combining
     * input data from multiple ranks within a group before sending it outside
     * of the group. Therefore at the destination side we only receive a single
     * chunk of data per group.
     */
    size_t max_chbuf_space_per_group;
    /* Valid only for TRN2. For TRN2 to prevent AXI deadlock we avoid on-chip
     * routing at the destination chip and deposit data in the HBM closest to
     * the entry port. So the rank owning that HBM receives data on behalf of
     * other ranks on that same chip. This is why we need to carve out dedicated
     * channel buf space for each of the other s-engines on the same chip.
     */
    size_t max_chbuf_space_per_seng;
    /* Valid only for single step mesh where we directly copy the entire input
     * buffer into another rank's channel buffer.
     */
    size_t max_chbuf_space_per_rank;

    /* Whether to use double buffer to skip global handshake */
    bool double_buffer;

    /* Whether to build RDH */
    bool build_rdh;
    bool rdh_double_buffer;
    void *rdh_devmem_res;  /*intra rdh channel buffer */
    bool use_2dev_proxy;

    bool tokens_exchanged;    /* reinitialzed tokens from old mesh config*/

    bool use_net;               /* Whether inter-node mesh with network proxy is used or not */

    /* Backward references to NCCL comm and general cluster info.
     * These might come from enc_comm or enc_alg_hier
     */
    struct enc_nccl_comm_node *nccl_comm_node; /* Reference to NCCL comm */
    struct enc_comm_info *ci; /* General cluster information */

    struct enc_comm *comm; /* Backward reference to ENC comm */
    struct encd_alg_mesh *drv_alg;

    /*
     * DMA mapped memory to host dedicated for A2Av metadata available only during
     * execution.
     */
    enc_host_mem_t alltoallv_host_input;
};

struct enc_alg_hier {
    struct {
        struct enc_nccl_comm_node *nccl_comm_node;
        struct enc_comm_info ci;

        struct enc_alg_metaring ring;
        struct enc_alg_metaring kangaring;
        struct enc_alg_mesh mesh;
    } intra;

    struct {
        struct enc_nccl_comm_node *nccl_comm_node;
        struct enc_comm_info ci;

        struct enc_alg_metaring ring;
        struct enc_alg_metaring rdh;
        struct enc_alg_mesh mesh;
    } inter;

    struct {
        struct {
            struct enc_nccl_comm_node *nccl_comm_node;
            struct enc_comm_info ci;

            struct enc_alg_metaring ring;
        } stage[NEC_HIER_CC_PIPELINE_STAGE_N];
    } pipeline;

    void* devmem_res; /* Hierarchical Reduce Scatter uses intermediate buffer */

    struct enc_comm *comm; /* Backward reference to ENC comm */
    struct encd_alg_hier *drv_alg;
};

/**
 * Comm info to query from NCCL
 */
typedef struct nccl_comm_info {
    /* General cluster information */
    uint64_t cluster_id; // randomly generated id used to identify unique clusters in log metrics
    time_t epoch; // the epoch of the initial barrier at the start of a collectives execution. used when generating core dumps so that all ranks agree on a datetime.

    int neuron_dev;
    int rank;
    int rank_n;
    int local_rank_n;
    int local_rack_rank_n;

    int node;
    int node_n;

    bool enable_pod;
    bool use_net; /* Whether network interface is used or not with the communicator */
    int pod;
    int pod_n;
    int pod_node;
    int pod_node_n;

    struct enc_peer_info *peers; /* Needs to be allocated before calling ncclGetCommInfo() or NULL if peers info is not needed */

    /* Ring algorithm information */
    int channel_n;
    struct enc_ring rings[NEC_MAX_CHANNELS];

    /* Kangaring algorithm information */
    int kangaring_channel_n;
    int* kangaring_paths[NEC_MAX_CHANNELS];

    /* Hamiltonian cycles of MLAs, used to construct 1-rank-per-mla rings */
    int mla_cycle_n;
    int* mla_cycles[NEC_MAX_CHANNELS];
} nccl_comm_info_t;

typedef struct enc_nccl_comm_node {
    void *nccl_comm;
    char *key;
    size_t key_sz;
    /* Tracking the graph information in the nccl_comm. We can use
     * ncclGetCommInfo() but it's expensive. Instead, simply track the graph
     * information here. This flag can only changed from true to false. The
     * other way is not possible.
     */
    bool disable_graph;
    bool global_nccl_comm_node;
    int refcnt;
    uint32_t stream_id;
    uint32_t context_id;

    uint32_t num_local_participants;
    uint32_t num_local_leaders;
    uint32_t my_local_leader;
    uint32_t *local_participants;
    uint32_t *local_leaders;
    struct bp_barrier *local_barrier;
    bool intra_pod_interface; /* When intra-pod interface is used, we can't skip exeuction barrier */
} enc_nccl_comm_node_t;

/* Neuron Device information. This data structure is used to send the device information from NRT to
 * nccom for nccl communicator building.
 */
#define ENC_PROXY_HISTOGRAM_OUTPUT_PATH_LENGTH_MAX (128)
typedef struct enc_proxy_histogram_config {
    bool enable;
    size_t bucket_usecs;
    size_t num_buckets;
    size_t per_neff_warmup;
    size_t warmup;
    char output_path[ENC_PROXY_HISTOGRAM_OUTPUT_PATH_LENGTH_MAX];
} enc_proxy_histogram_config_t;
 
typedef struct enc_neuron_device_info {
    int nec_dev_id;
    int mla_idx;
    int tpb_idx;
    int host_device_id;
    int routing_id;
    uint64_t pod_id;
    nec_pod_type_t pod_type;
    uint32_t pod_node_id;
    uint32_t virtual_server_id;
    enc_proxy_histogram_config_t histogram_config;
} enc_neuron_device_info_t;

/**
 * Collective communicator corresponding to ncclComm structure
 *
 * enc_comm is the Collective Comm that holds all the necessary information to
 * execute an collective operation. This should be pre-set before operations are
 * posted mainly because of the topology information built upon physical
 * connectivity. Collective operations are executed on multiple channels and a
 * channel is a path for data transfer along a pre-built topology.
 */
struct enc_comm {
    struct enc_nccl_comm_node *nccl_comm_node; /* Reference to NCCL comm */
    struct enc_comm_info ci; /* General cluster information */
    int id;
    int stream_id;

    /*
     * Algorithms
     */
    struct enc_alg_metaring ring;
    struct enc_alg_metaring kangaring;
    struct enc_alg_metaring rdh;
    struct enc_alg_hier hier;
    struct enc_alg_mesh mesh;

    /**
     * Use these handles to share network connector buffers across NEFFs.
     * Only used in global comm. Other comms will refer to the global comm to reuse them.
     * We use net_conn_count to sequentially assign these reservations to network conectors
     * to make sure:
     * 1) different comm in a NEFF don't reuse the same buffer (for multi-stream cases)
     * 2) for each NEFF, we always start with index 0 and go up for the most overlap and
     *    reusability. We reset net_conn_count to 0 in enc_load_operations
     */
    int net_conn_count;
    void* net_connector_devmem_res[NEC_MAX_NET_BUFFERS];

    // TODO: nr_channel_chunks and chunk_size should not be a comm property anymore
    int nr_channel_chunks; /* Channel buffer depth, applies to all channels */
    size_t chunk_size; /* Unit of transfer, applies to all channels */

    struct encd_comm *drv_comm; /* Reference to driver comm */

    char topology[1024]; /* Used for debugging purposes only to print the topology in case of an error */
};

/**
 * Global communicator
 */
struct enc_glb_comm {
    uint32_t g_device_id; /* Same as comm->rank */
    uint32_t g_device_cnt; /* Same as comm->rank_n */
    uint32_t vtpb_idx;
    int nec_dev_id;
    int mla_idx;
    /* Absolute neuron device hw id. This is the ID that driver
       exposes neuron device on to host system aka OS. Neuron devices
       are expesed to RT by different ID in case docker remaps
       devices */
    int host_device_id;
    int routing_id;
    uint32_t virtual_server_id;
    nec_pod_type_t pod_type;
    uint32_t pod_node_id;
    uint32_t pod_sz;
    uint64_t pod_id;
    const char *root_comm_id; /* By getenv in nrt_config */
    bool check_sigs;          /* By getenv in nrt_config */

    uint32_t *rank_nodes; /* The node index of each rank */
    uint32_t *local_ranks; /* The intra-node rank of each rank */

    enc_nccl_comm_node_t nccl_comm_node; /* nccl_comm node can be used by any stream */

    struct bananaphone *local_rings;
    struct bp_handle *local_peer_handles;

    /**
     * A set of buffers containing values that are used to
     * increment semaphores over efa transactions.
     */
    uint32_t *inc_recv_sem_values_buffer;
    size_t inc_recv_sem_values_buffer_size;

    struct enc_comm comm;

    /* TODO: manage all the devmem reservations in a single place
     * Today we share the buffers under the below path:
     * enc_glb_comm->comm->ring.channels[ring_channel_id].devmem_res
     * We need to move the above reservations and the one below to a
     * singleton class e.g. enc_glb_comm->devmem_res_pool
     */
    void* inter_rdh_devmem_res[NEC_MAX_STREAM_N];
    /* TODO: manage all the devmem reservations in a single place
     * this mem res is referred by comm->rdh.rdh_devmem_res
     */
    void* intra_rdh_devmem_res[NEC_MAX_STREAM_N];

    void* mesh_devmem_res_per_rg[NEC_MAX_STREAM_N * NEC_MAX_COMM_N * H_COMM_MAX_ID];

    void* rdh_devmem_res_per_rg[NEC_MAX_STREAM_N * NEC_MAX_COMM_N];

    void *gateway_devmem_res[NEC_MAX_STREAM_N][NEC_MAX_CHANNELS];

    pthread_mutex_t gcomm_setup_mtx;

    void *proxy_queue;   // opaque pointer to enc_proxy_queue

    void *device_barrier_table;
};

/**
 * Network transport FIFOs
 *
 * Host send proxy should know the EFA buffer index, offset in the buffer and the size of
 * each data tranfer to send to remote device and recv proxy
 * needs destination addresses for each data from sender to submit network receive request.
 * Send and recv proxy should know when to report the completion of using
 * EFA buffer and complete is used to notify it.
 *
 * Such information is recorded when operation is loaded and becomes available on execution. Host
 * proxy uses these APIs to query the recorded FIFO.
 */

/**
 * A net_ops_info_t entry corresponds to a set of smaller operations that are defined by multiple
 * net_src_addr_t and net_dest_addr_t. These sub operations can correspond to different types of
 * actions, so store a net_addr_mark_t identifier in each net_src_addr_t or net_dest_addr_t entry
 * to denote the purpose of the sub-operation.
 */
typedef enum net_addr_mark {
    NET_TRANSFER,       /* Will drive data transfer over EFA */
    NET_OP_COMPLETE,    /* Will mark final completion of a collective operation */
    EXEC_COMPLETE       /* Will mark final completion of a collective load execution */
} net_addr_mark_t;

typedef struct net_src_addr {
    uint32_t net_op_idx;
    int complete;
    dma_addr_t dev_addr;
    void *host_addr;
    void *nccl_mhandle;
    uint32_t size;
    net_addr_mark_t mark;
    void* proxy_histogram_tag;
     /* Fields below are for mesh only */
    int dst_rank;
    /* For local RDMA read */
    void *dst_addr;
    void *dst_mhandle;
} net_src_addr_t;

typedef struct net_dest_addr {
    uint32_t net_op_idx;
    int complete;
    dma_addr_t dev_addr;
    void *host_addr;
    void *nccl_mhandle;
    uint32_t size;
    net_addr_mark_t mark;
    /* Fields below are for mesh only */
    int src_rank;
} net_dest_addr_t;

typedef struct net_ops_info {
    uint16_t sema_shift_offset;
    bool early_send_completion;
    bool early_recv_posting;
    volatile uint32_t *inc_send_handshake;
    volatile uint32_t *inc_send_complete;
    volatile uint32_t *inc_recv_handshake;
    volatile uint32_t *inc_recv_complete;
    uint32_t tx_entry_cnt;
    uint32_t rx_entry_cnt;
    uint32_t net_idx_loop_size;
    uint32_t initial_send_credits;
    uint32_t ending_recv_credits;
    size_t data_type_sz;
    bool is_dynamic_send_recv_sz;
    bool variable_peer;
    bool add_to_histogram;
    /*
     * proxy uses this pointer to get connector information from transaction
     * saddr/daddr fifo entry of each operation.
     */
    void *enc_channel;
} net_ops_info_t;

/**
 * API for proxy-thread to increase handshake and send/recv semaphores by writing directly to the
 * memory mapped semaphore inc register.
 * For more information, see documentation on struct enc_net_host_memory definition.
 */
void nec_inc_semaphore(volatile uint32_t *sem_inc_addr, uint32_t val);

/**
 * API for proxy-thread to get dynamic send and offset for the case where message
 * size is determined by data only available during execution.
 */
size_t nec_get_dynamic_send_size_bytes(enc_host_mem_t *dyn_input, size_t data_type_sz, int dst_rank, int rank_n);
size_t nec_get_dynamic_send_offset_bytes(enc_host_mem_t *dyn_input, size_t data_type_sz, int dst_rank, int rank_n);
size_t nec_get_dynamic_recv_offset_bytes(enc_host_mem_t *dyn_input, size_t data_type_sz, int src_rank, int rank_n);
void nec_set_recv_size_bytes(enc_host_mem_t *dyn_input, size_t recv_size_bytes, size_t data_type_sz, int src_rank, int rank_n);

/**
 * Qeury device information
 */
int nec_get_device_count(int *available_devices_array, uint32_t array_size);
int nec_get_device_pci_bdf(int neuron_dev, uint32_t *domain, uint32_t *bus_num, uint8_t *pci_slot, uint8_t *dev_func);

/**
 * Query vcore size
 */
NRT_STATUS nec_get_virtual_core_size(uint32_t *virtual_core_size);

typedef struct nec_version_info {
    uint64_t major;
    uint64_t minor;
    uint64_t patch;
    uint64_t maintenance;
    char git_hash[16];
    uint64_t compatibility_version;
    // Any new fields added needs to be here. The fields before this cannot be
    // changed to maintain backward compatibility
    uint8_t future_fields[];
} nec_version_info_t;

NRT_STATUS nec_get_version_info(nec_version_info_t *version_info);

NRT_STATUS nec_build_port_and_rid_map(int local_nec_dev_id, int *mla_indexes, int *host_device_ids, int count);

bool nec_is_mla_available(int local_nec_dev_id, int mla_idx);

int nec_mla_idx_to_rid(int local_nec_dev_id, int mla_idx);

int nec_rid_to_mla_idx(int local_nec_dev_id, int rid);

int nec_get_peer_mla_idx(int local_nec_dev_id, int mla_idx, int port);

int nec_get_p2p_pod_peer_node(uint32_t nec_dev_id, int node, uint32_t port_distance,
                              int *peer_node);

NRT_STATUS nec_pod_node_can_access_peer_node(nec_pod_type_t pod_type,
                                             uint32_t local_rid, uint32_t local_node_id,
                                             uint32_t remote_rid, uint32_t remote_node_id,
                                             int *can_access_peer);

void nec_ndl_printk(char *str, uint32_t size, uint32_t action);

#ifdef __cplusplus
}
#endif


================================================
FILE: src/libnrt/include/nrt/nrt.h
================================================
/*
 * Copyright 2021, Amazon.com, Inc. or its affiliates. All Rights Reserved
 */

#pragma once

#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
// Use quoted includes in nrt headers including other nrt headers. Most clients
// (ptxla, jax, etc.) build with bazel, and bazel has issue with angle-brackets.
// See https://bazel.build/docs/bazel-and-cpp#include-paths for details.
#include "nrt/nrt_status.h"
#include "ndl/neuron_driver_shared_tensor_batch_op.h"


#ifdef __cplusplus
extern "C" {
#endif

/** Major and minor version of runtime. */
#define NRT_MAJOR_VERSION 2
#define NRT_MINOR_VERSION 0

typedef struct nrt_model nrt_model_t;

typedef struct nrt_tensor nrt_tensor_t;

typedef struct nrt_cc_context nrt_cc_context_t;

/**
 * WARNING: Do not change the value of existing enums!
 * These values will be used by libnrt consumers, we
 * cannot change the defines under them, only append.
 */
typedef enum {
    NRT_TENSOR_PLACEMENT_DEVICE,
    NRT_TENSOR_PLACEMENT_HOST,
    NRT_TENSOR_PLACEMENT_VIRTUAL,
} nrt_tensor_placement_t;

typedef enum {
    NRT_FRAMEWORK_TYPE_INVALID = 0,             // Invalid
    NRT_FRAMEWORK_TYPE_NO_FW = 1,               // Framework less execution
    NRT_FRAMEWORK_TYPE_TENSORFLOW,              // Tensorflow
    NRT_FRAMEWORK_TYPE_PYTORCH,                 // Pytorch
    NRT_FRAMEWORK_TYPE_MXNET,                   // Mxnet
    NRT_FRAMEWORK_TYPE_PRECHECK,                // Neuron Node Precheck
} nrt_framework_type_t;

enum {
    NRT_INSTANCE_UNKNOWN    = 0,
    NRT_INSTANCE_INF1       = 1,
    NRT_INSTANCE_TRN1       = 2,
    NRT_INSTANCE_TRN1N      = 3,
    NRT_INSTANCE_INF2       = 4,
    NRT_INSTANCE_TRN2       = 5,
    NRT_INSTANCE_TRN2N      = 6,
    NRT_INSTANCE_INF2E      = 7,
    NRT_INSTANCE_TRN2P      = 8,
    NRT_INSTANCE_TRN2U      = 9,
    NRT_INSTANCE_TRN2E      = 10,
    NRT_INSTANCE_TRN2EU     = 11,
    NRT_INSTANCE_TRN2AC     = 12,
    NRT_INSTANCE_TRN2UAC    = 13,
    NRT_INSTANCE_TRN3       = 14,
    NRT_INSTANCE_TRN3PDS98  = 15
};

enum {
    NRT_INSTANCE_SIZE_1XL,
    NRT_INSTANCE_SIZE_2XL,
    NRT_INSTANCE_SIZE_4XL,
    NRT_INSTANCE_SIZE_6XL,
    NRT_INSTANCE_SIZE_8XL,
    NRT_INSTANCE_SIZE_24XL,
    NRT_INSTANCE_SIZE_32XL,
    NRT_INSTANCE_SIZE_48XL,
    NRT_INSTANCE_SIZE_3XL,
    // Note: Add new sizes right above this line to prevent breaking backward compatibility

    NRT_INSTANCE_SIZE_UNKNOWN,
    NRT_INSTANCE_SIZE_NUM = NRT_INSTANCE_SIZE_UNKNOWN,
};

typedef enum nrt_op_type {
    NRT_OP_ADD     = 0x0,
    NRT_OP_FMA     = 0x1,
    NRT_OP_MAX     = 0x2,
    NRT_OP_MIN     = 0x3,
    NRT_OP_INVALID = 0xF,
} nrt_op_type_t;

typedef enum nrt_dtype {
    NRT_DTYPE_UNKNOWN  = 0x0,
    NRT_DTYPE_INVALID  = 0x0,
    NRT_DTYPE_FP8_E3   = 0xD,
    NRT_DTYPE_FP8_E4   = 0xE,
    NRT_DTYPE_FP8_E5   = 0xF,
    NRT_DTYPE_FLOAT16  = 0x7,
    NRT_DTYPE_BFLOAT16 = 0x6,
    NRT_DTYPE_FLOAT32  = 0xA,
    NRT_DTYPE_FP32R    = 0xB,
    NRT_DTYPE_UINT8    = 0x3,
    NRT_DTYPE_UINT16   = 0x5,
    NRT_DTYPE_UINT32   = 0x9,
    NRT_DTYPE_UINT64   = 0x1,
    NRT_DTYPE_INT8     = 0x2,
    NRT_DTYPE_INT16    = 0x4,
    NRT_DTYPE_INT32    = 0x8,
    NRT_DTYPE_INT64    = 0xC,
} nrt_dtype_t;

typedef enum nrt_cc_op_type {
    NRT_CC_ALLGATHER,
    NRT_CC_ALLREDUCE,
    NRT_CC_REDUCESCATTER
} nrt_cc_op_type_t;

typedef struct nrt_instance_info {
    uint32_t family;
    uint32_t size;
    char arch_name[16];
    char device_revision[8];
} nrt_instance_info_t;

NRT_STATUS nrt_get_instance_info(nrt_instance_info_t *info, size_t instance_info_len);

/** Initialize neuron runtime.
 *
 * @param framework[in]      - Type of the framework.
 * @param fw_version[in]     - Framework version as string. (eg 2.1)
 * @param fal_version[in]    - Framework Abstraction Layer version as string.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_init(nrt_framework_type_t framework, const char *fw_version, const char *fal_version);

/** Closes all the devices and cleans up the runtime state.
 */
void nrt_close();

/** Load given NEFF and place it in one or more neuron cores.
 *
 * @param neff_bytes[in]    - Pointer to NEFF data.
 * @param size[in]          - Length of the NEFF data.
 * @param vnc[in]           - VNC index where the NEFF should be loaded(-1 means runtime would automatically load in first free VNC).
 * @param vnc_count[in]     - DEPRECATED: always use -1
 * @param model[out]        - Resulting model would be stored here.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_load(const void *neff_bytes, size_t size, int32_t vnc, int32_t vnc_count, nrt_model_t **model);

/** Load given NEFF for collective operations and place it in one or more neuron cores.
 *
 * If global NCCL communicator was not previously created, we will create it inside this API with the assumption that
 * global device id is same as ctx_device_id and global device count is same as ctx_device_count.
 *
 * @param neff_bytes[in]        - Pointer to NEFF data.
 * @param size[in]              - Length of the NEFF data.
 * @param vnc[in]               - VNC index where the NEFF should be loaded(-1 means runtime would automatically load in first free VNC).
 * @param vnc_count[in]         - DEPRECATED: always use -1
 * @param ctx_device_id[in]     - Device ID relative to the number of devices participating in this NEFF
 * @param ctx_device_count[in]  - Number of devices participating in collectives operations in this NEFF
 * @param model[out]            - Resulting model would be stored here.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_load_collectives(const void *neff_bytes, size_t size, int32_t vnc, int32_t vnc_count,
                                uint32_t ctx_device_id, uint32_t ctx_device_count, nrt_model_t **model);

/** Unload given model and free up device and host resources.
 *
 * @param model - Model to unload.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_unload(nrt_model_t *model);

/** Get the number of VNCs used by a loaded model. (deprecated)
 *
 * @param model[in] - Model.
 * @param vnc_count[out] - The number of VNCs used by the model.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_get_model_nc_count(const nrt_model_t *model, uint32_t *vnc_count);

/** Get the number of VNCs used by a loaded model. (deprecated)
 *
 * @param model[in] - Model.
 * @param vnc_count[out] - The number of VNCs used by the model.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_get_model_vnc_count(const nrt_model_t *model, uint32_t *vnc_count);

/** Returns VirtualNeuronCores available in instance. (deprecated)
 *
 * @param vnc_count[out] - VirtualNeuronCores available in instance.
 *
 * @note This API can be called before nrt_init().
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_get_total_nc_count(uint32_t *vnc_count);

/** Returns VirtualNeuronCores available in instance.
 *
 * @param vnc_count[out] - VirtualNeuronCores available in instance.
 *
 * @note This API can be called before nrt_init().
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_get_total_vnc_count(uint32_t *vnc_count);

/** Returns VirtualNeuronCores visible to the application. (deprecated)
 *
 * @param vnc_count[out] - VirtualNeuronCores visible to the application.
 *
 * @note This API can be called before nrt_init().
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_get_visible_nc_count(uint32_t *vnc_count);

/** Returns VirtualNeuronCores visible to the application.
 *
 * @param vnc_count[out] - VirtualNeuronCores visible to the application.
 *
 * @note This API can be called before nrt_init().
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_get_visible_vnc_count(uint32_t *vnc_count);

/** A container to hold multiple tensors */
typedef void nrt_tensor_set_t;

/** Allocates a new tensor set.
 *
 * @param result[out]       - Pointer to newly allocated tensor set would be stored here.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_allocate_tensor_set(nrt_tensor_set_t **result);

/** Destroys given tensor_set and frees memory.
 *
 * @param tensor_set[in]    - Tensors set to be freed.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
void nrt_destroy_tensor_set(nrt_tensor_set_t **tensor_set);

/** Add/replace given tensor to tensor set
 *
 * @param tensor_set[in]    - Tensor set to which the tensor is added.
 * @param tensor_name[in]   - Name of the tensor.
 * @param tensor[in]        - Pointer to tensor. This pointer should be valid till nrt_destroy_tensor_set() is called.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_add_tensor_to_tensor_set(nrt_tensor_set_t *tensor_set, const char *tensor_name, nrt_tensor_t *tensor);

/** Get a tensor's info from a tensor set.
 *
 * @param tensor_set[in]    - Tensor set.
 * @param tensor_name[in]   - Name of the tensor.
 * @param tensor[out]       - Pointer to tensor would be stored here.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_get_tensor_from_tensor_set(nrt_tensor_set_t *tensor_set, const char *tensor_name, nrt_tensor_t **tensor);

/** Execute given model with given inputs and collect outputs.
 *
 * @param model[in] - Model to execute.
 * @param input_set[in] - Set of input tensors.
 * @param output_set[in] - Set of output tensors.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_execute(nrt_model_t *model, const nrt_tensor_set_t *input_set, nrt_tensor_set_t *output_set);

/** Execute given model with given inputs, repeat execution specified number of times and collect outputs.
 *
 * @param model[in] - Model to execute.
 * @param input_set[in] - Set of input tensors.
 * @param output_set[in] - Set of output tensors.
 * @param repeat_count[in] - Number of to repeat execution.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_execute_repeat(nrt_model_t *model, const nrt_tensor_set_t *input_set, nrt_tensor_set_t *output_set, int repeat_count);

/** Build (initialize and setup) NCCL global communicator.
 *
 * @param vnc[in]               - Local VNC (within the instance)
 * @param g_device_id[in]       - Global device id
 * @param g_device_count[in]    - Max world size of all neffs that will be executed
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_build_global_comm(int32_t vnc, uint32_t g_device_id, uint32_t g_device_count);

/** Allocates a tensor that can be passed and used by a model for compute.
 *
 * @param tensor_placement[in]  - Where the tensor would be allocated (device, host, or virtual memory)
 * @param vnc[in]               - Virutal Neuron Core id to allocate the tensor on. Pass in -1 if allocating tensors on host memory.
 * @param size[in]              - Size in bytes of the tensor to allocate.
 * @param name[in]              - OPTIONAL. Name of the tensor.
 * @param tensor[out]           - Pointer to newly created tensor will be stored here.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_tensor_allocate(nrt_tensor_placement_t tensor_placement, int vnc, size_t size, const char *name, nrt_tensor_t **tensor);

/** Deallocates a tensor created by "nrt_tensor_allocate".
 *
 * @param tensor[in]    - Deallocates given tensor.
 *
 * @return None
 */
void nrt_tensor_free(nrt_tensor_t **tensor);

/** Copies data from tensor to passed in buffer.
 *
 * @param tensor[in]    - Tensor used to reference the tensor to read from.
 * @param buf[out]      - Buffer used to store data read from the tensor.
 * @param offset[in]    - Offset into the tensor to read from.
 * @param size[in]      - Number of bytes to read.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_tensor_read(const nrt_tensor_t *tensor, void *buf, size_t offset, size_t size);

/** Copies data from passed in buffer to tensor.
 *
 * @param tensor[in/out]    - Tensor used to reference the tensor to write to.
 * @param buf[in]           - Buffer used to store data to write to the tensor.
 * @param offset[in]        - Offset into the tensor to write to.
 * @param size[in]          - Number of bytes to write.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_tensor_write(nrt_tensor_t *tensor, const void *buf, size_t offset, size_t size);

/** A batch of tensor operations on a single tensor */
// the definition of nrt_tensor_batch_op_t is in neuron_driver_shared_tensor_batch_op.h
typedef struct nrt_tensor_batch {
    const nrt_tensor_t *tensor;        // Tensor handle
    const nrt_tensor_batch_op_t *ops;  // Array of operations for this tensor
    uint32_t num_ops;            // Number of operations for this tensor
} nrt_tensor_batch_t;

/** Batch read data from multiple tensors.
 *
 * @param batches[in]     - An array of batches, each of which describes operations on one tensor
 * @param num_batches[in] - Number of batches (tensors) in the array
 * @param unsafe[in]      - If true, skip tensor tracking/blocking (use with caution)
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_tensor_read_batch(const nrt_tensor_batch_t *batches, uint64_t num_batches, bool unsafe);

/** Batch write data to multiple tensors.
 *
 * @param batches[in]     - An array of batches, each of which describes operations on one tensor
 * @param num_batches[in] - Number of batches (tensors) in the array
 * @param unsafe[in]      - If true, skip tensor tracking/blocking (use with caution)
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_tensor_write_batch(const nrt_tensor_batch_t *batches, uint64_t num_batches, bool unsafe);

/** Copies data between tensors.
 *
 * When copying between two device tensors, they must both be allocated on the SAME Neuron Core.
 * A NRT_INVALID will be returned in the failing case.
 *
 * @param src[in]           - Tensor to copy from.
 * @param src_offset[in]    - Offset into the source tensor to copy from.
 * @param dst[out]          - Tensor to copy to.
 * @param dst_offset[in]    - Offset into the destination tensor to copy to.
 * @param size[in]          - Number of bytes to copy.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_tensor_copy(const nrt_tensor_t *src, size_t src_offset, nrt_tensor_t *dst, size_t dst_offset, size_t size);

/** Gets the size of the passed in tensor.
 *
 * @param tensor[in]    - Tensor used to reference the tensor to get size of.
 *
 * @return Size of the tensor.
 */
size_t nrt_tensor_get_size(const nrt_tensor_t *tensor);

/** Set the memory + offset pointed to by tensor to value
 *
 * @param tensor[in]        - allocated tensor
 * @param offset[in]        - offset within the tensor
 * @param value[in]         - value to set with
 * @param size[in]          - size of memory to set
 *
 * @return 0 on success.
 */
NRT_STATUS nrt_tensor_memset(nrt_tensor_t *tensor, uint64_t offset, int value, size_t size);

/** Allocates an empty tensor, i.e. the tensor structure w/o any attached storage
 *
 * @param name[in]              - OPTIONAL. Name of the tensor.
 * @param tensor[out]           - Pointer to newly created tensor will be stored here.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_tensor_allocate_empty(const char *name, nrt_tensor_t **tensor);

/** Attaches caller supplied buffer to a tensor.  Any storage previously attached to the tensor is detached
 *  and freed if was owned by the tensor.
 *  The buffer is supplied by the caller and must persist through the entire lifetime of the tensor.
 *
 * @param tensor[in]            - Tensor
 * @param buffer[in]            - Caller supplied buffer to use as tensor's storage
 * @param size[in]              - Buffer Size
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_tensor_attach_buffer(nrt_tensor_t *tensor, void *buffer, size_t size);

/** Creates a tensor to point to a slice of another tensor
 *  does not do a deep copy, just points the "slice" tensor storage to the "source" tensor storage
 *
 * @param tensor_source[in] - Tensor to point at
 * @param offset[in]        - Offset from the beginning of the source tensor to point at
 * @param size[in]          - Size of the slice
 * @param name[in]          - Optional name for the new tensor
 * @param tensor_slice[in]  - Newly allocated tensor to point to the storage of the source tensor
 *
 */
NRT_STATUS nrt_tensor_allocate_slice( const nrt_tensor_t *tensor_source, size_t offset, size_t size, const char *name, nrt_tensor_t **tensor_slice);

/** Given a tensor get the virtual address.
 *
 * @param tensor[in]        - Tensor for which the VA needs to be obtained
 *
 * @return va on success, NULL on failure.
 */
void *nrt_tensor_get_va(const nrt_tensor_t *tensor);

/** Returns on device allocation info for a tensor
 *
 * @param tensor[in]        - Tensor for which the information needs to be obtained
 * @param alloc_info[out]   - On device allocation information
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
typedef struct nrt_tensor_device_allocation_info {
    uint64_t physical_address; // physical address in device memory space
    size_t size;               // allocation size, could be larger than the tensor size
    int hbm_index;             // which of the HBMs the tensor is placed
} nrt_tensor_device_allocation_info_t;
NRT_STATUS nrt_tensor_get_device_allocation_info(const nrt_tensor_t *tensor, nrt_tensor_device_allocation_info_t *alloc_info);

/**
 * @brief A Runtime API to check if a given output tensor is fully written/complete.
 *        If timeout is given as unbounded, it emits a warning at the first 30 seconds.
 *
 * @param output_tensor:  The given output tensor.
 * @param timeout:        The maximum total duration to wait for tensor completion in microseconds.
 *                        If timeout is negative, the wait is unbounded. The caller is in charge of handling the timeout behaviors.
 *                        o/w, it checks completion until the timeout.
 * @param expected_completion_count:  The number of completions expected by the caller.
 *
 * @return NRT_STATUS:    It returns NRT_SUCCESS if the tensor is complete;
 *                        It returns NRT_INVALID, if the output tensor is given as NULL;
 *                        It returns NRT_TIMEOUT if the tensor is not reaching the expected_completion_count within the timeout.
 */
NRT_STATUS nrt_tensor_check_output_completion(const nrt_tensor_t *output_tensor,
                                              int64_t timeout,
                                              uint64_t expected_completion_count);

/**
 * @brief A Runtime API to reset the completion counter inside an output tensor to 0.
 *
 * @param output_tensor:  The given output tensor.
 * @return NRT_STATUS:    It returns NRT_SUCCESS if reset is successful;
 *                        It returns NRT_INVALID, if the output tensor is given as NULL.
 */
NRT_STATUS nrt_tensor_reset_output_completion(nrt_tensor_t *output_tensor);

/**
 * @brief Get the anonymous file-descriptor of dma-buf associated with
 * a Neuron device memory region if it was registered for EFA peer direct
 *
 * @param addr[in]          - Device buffer virtual address
 * @param size[in]          - Device buffer size (in bytes)
 * @param fd[out]           - dma-buf fd
 *
 * @return NRT_SUCCESS on success
 */
NRT_STATUS nrt_get_dmabuf_fd(uint64_t va, uint64_t size, int* fd);


/**  Get the host based device id from the device id presented to runtime (which may container based device id)
 * @param neuron_dev[in]      - device id
 * @param host_device_id[out] - host device id
 * @return NRT_SUCCESS if call was successful, NRT_INVALID otherwise
 */
NRT_STATUS nrt_host_device_id_get( int neuron_dev, uint32_t *host_device_id);

/**  Return array of routing IDs indexed by host device ID. This is the definitive routing ID mapping provided from the driver
 * @param coutn[in/out]           - [in] number of entries in the mapping table provided. [out] count of entries returned
 * @param host_did_to_rid_map[in] - table/map of routing IDs indexed by host device ID
 * @return NRT_SUCCESS if call was successful, NRT_INVALID otherwise
 */
NRT_STATUS nrt_host_device_id_rid_map_get(uint32_t *count, uint32_t *host_did_to_rid_map);

/**
 * Get the HBM virtual address and size for a specific HBM index.
 * @param device_id[in]         - Device ID
 * @param hbm_idx[in]           - HBM index
 * @param addr[out]             - Pointer to store the virtual address
 * @param size[out]             - Pointer to store the size of the HBM region
 * @return NRT_SUCCESS if call was successful and HBM region was mapped
 *         NRT_INVALID_HANDLE if there are no more HBM regions to map for this device
 *         NRT_INVALID if the interface isn't supported or for invalid parameters
 *         NRT_FAILURE for other errors
 */
NRT_STATUS nrt_get_hbm_mmap_va(int device_id, int hbm_idx, void **addr, size_t *size);


typedef struct nrt_vnc_memory_stats {
    size_t bytes_used;
    size_t bytes_limit;
    // NOTE: For backward compatibility, when making updates, don't delete existing fields, and
    //  ALWAYS add to the end of this struct!
} nrt_vnc_memory_stats_t;

/** Get the NRT memory stats for a VNC.
 *
 * @param vnc[in]             - Local VNC (within the instance)
 * @param stats[out]          - Pointer to a nrt_vnc_memory_stats struct
 * @param stats_size_in[in]   - Caller expected size of the nrt_vnc_memory_stats struct, for compatibility purposes
 * @param stats_size_out[out] - Library written size of the nrt_vnc_memory_stats struct, for compatibility purposes
 *
 * @return NRT_STATUS_SUCCESS on success.
 */

NRT_STATUS nrt_get_vnc_memory_stats(uint32_t vnc, nrt_vnc_memory_stats_t *stats, size_t stats_size_in, size_t *stats_size_out);

/** Get BDF of the EFA device attached to a Neuron device identified by VA of HBM allocation on that device
 *
 * @param va[in]            - VA of a memory allocated on a Neuron devices
 * @param efa_bdf[out]      - a buffer (of sufficient size) to store BDF of the connected EFA device
 * @param len[in/out]       - in: length of buffer (including NULL), out: length of string (excluding NULL)
 *
 * @return NRT_SUCCESS on success
 *         NRT_RESOUCE if the buffer is not large enough to store the BDF string
 *         NRT_FAILURE for other errors
 */

NRT_STATUS nrt_get_attached_efa_bdf(const void *va, char *efa_bdf, size_t *len);

/******************************
 * Out-of-NEFF collectives    *
 ******************************/

typedef struct nrt_cc_comm {
    uint32_t *replica_group; /* a list of participants */
    uint32_t rank; /* my rank in the replica_group */
    uint32_t rank_n; /* size of replica_group */

    uint32_t ctx_device_id;
    uint32_t ctx_device_count;
    uint32_t vnc;
} nrt_cc_comm_t;

typedef struct nrt_tensor_list {
    nrt_tensor_t **tensors;
    size_t num_tensors;
} nrt_tensor_list_t;

/** Build (initialize and setup) global communicator for host-driven collective operations.
 *
 * @param vnc[in]               - Local VNC (within the instance)
 * @param g_device_id[in]       - Global device id
 * @param g_device_count[in]    - Max world size of all participating workers
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_cc_global_comm_init(uint32_t vnc, uint32_t g_device_id, uint32_t g_device_count);

#ifdef __cplusplus
}
#endif


================================================
FILE: src/libnrt/include/nrt/nrt_async.h
================================================
/*
 * Copyright 2025, Amazon.com, Inc. or its affiliates. All Rights Reserved
 */

#pragma once

#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
// Use quoted includes in nrt headers including other nrt headers. Most clients
// (ptxla, jax, etc.) build with bazel, and bazel has issue with angle-brackets.
// See https://bazel.build/docs/bazel-and-cpp#include-paths for details.
#include "nrt/nrt.h"

#ifdef __cplusplus
extern "C" {
#endif

// execution units
typedef enum {
  NRTA_XU_TENSOR_READ = 0,
  NRTA_XU_TENSOR_WRITE,
  NRTA_XU_TENSOR_OP, // For tensor ops other than read and write
  NRTA_XU_COMPUTE,
  NRTA_XU_COLLECTIVES,

  // For new XU types, must only add after existing ones
  NRTA_XU_TYPE_NUM
} nrta_xu_t;


// nrta_seq_t's are monotomically increasing ids of executions
// The first 16 bits are a Execution Unit ID, while the last
// 48 bits are a strictly ordered Sequence Number
typedef uint64_t nrta_seq_t;
typedef uint16_t nrta_xu_id_t;

#define NRTA_SEQ_NUM_MAX      ((1ull << 48) - 1)
#define NRTA_SEQ_NUM_MASK     NRTA_SEQ_NUM_MAX
#define NRTA_SEQ_GET_SEQ_NUM(seq_id)  (seq_id & NRTA_SEQ_NUM_MASK)
#define NRTA_SEQ_GET_XU_ID(seq_id)    (seq_id >> 48)


typedef struct nrta_error {
    nrta_seq_t seq_id;
    uint64_t error_code; // NRT_STATUS, but typed as uint64 to ensure consistent representation across compilers
} nrta_error_t;
static_assert(sizeof(nrta_error_t) == 16, "nrta_error_t must be of size 16");

// data structure used to store errors encountered during execution
typedef struct nrta_error_tracker nrta_error_tracker_t;

/** Enqueues a tensor write request.  Copies the data from a host buffer to a
 *  tensor allocated on a Neuron device.  Uses TENSOR_WRITE execution unit based
 *  on the LNC that allocated the tensor.
 *
 * @param tensor[in]          - Destination tensor
 * @param buf[in]             - Host buffer containing source data
 * @param offset[in]          - Offset into the tensor
 * @param size[in]            - Number of bytes to write
 * @param queue[in]           - XU queue to use,
 * @param err[in]             - error tracker
 * @param req_sequence[out]   - Sequence number of the scheduled request
 *
 * @return NRT_SUCCESS on success
 */
NRT_STATUS nrta_tensor_write(nrt_tensor_t *tensor,
                             const void *buf,
                             uint64_t offset,
                             uint64_t size,
                             int queue,
                             nrta_error_tracker_t *err,
                             nrta_seq_t *req_sequence);

/** Enqueues a tensor read request.  Copies the data from a tensor allocated on a Neuron device
 *  to a host buffer. Uses TENSOR_READ execution unit based
 *  on the LNC that allocated the tensor.
 *
 * @param buf[in]             - Destination Host buffer
 * @param tensor[in]          - Source tensor
 * @param offset[in]          - Offset into the tensor
 * @param size[in]            - Number of bytes to read
 * @param queue[in]           - XU queue to use,
 * @param err[in]             - error tracker
 * @param req_sequence[out]   - Sequence number of the scheduled request
 *
 * @return NRT_SUCCESS on success
 */
NRT_STATUS nrta_tensor_read(void *buf,
                            nrt_tensor_t *tensor,
                            uint64_t offset,
                            uint64_t size,
                            int queue,
                            nrta_error_tracker_t *err,
                            nrta_seq_t *req_sequence);

/** Enqueues a tensor copy request.  Copies data between two tensors allocated
 *  on the same Logical Neuron Core.  Uses TENSOR_OP execution unit.
 *
 * NOTE: the tensors must be allocated until the copy completes
 *
 * @param src[in]             - Source tensor
 * @param src_offset[in]      - Offset into the source tensor
 * @param dst[in]             - Destination tensor
 * @param dst_offset[in]      - Offset into the destination tensor
 * @param size[in]            - Number of bytes to copy
 * @param queue[in]           - XU queue to use
 * @param err[in]             - error tracker
 * @param req_sequence[out]   - Sequence number of the scheduled request
 *
 * @return NRT_SUCCESS on success
 */
NRT_STATUS nrta_tensor_copy(nrt_tensor_t *src,
                            uint64_t src_offset,
                            nrt_tensor_t *dst,
                            uint64_t dst_offset,
                            uint64_t size,
                            int queue,
                            nrta_error_tracker_t *err,
                            nrta_seq_t *req_sequence);

/** Schedules an asynchronous request to execute a model with specified inputs
 *  and outputs. Uses COMPUTE execution unit of an LNC of the loaded model.
 *
 * @param model[in]           - The model to schedule for execution
 * @param input_set[in]       - Set of input tensors for the model
 * @param output_set[in]      - Set of tensors to receive the outputs
 * @param queue[in]           - XU queue to use, must be 0
 * @param err[in]             - error tracker
 * @param req_sequence[out]   - Sequence number of the scheduled request
 *
 * @return NRT_SUCCESS on successful preparation, appropriate error code otherwise
 */
NRT_STATUS nrta_execute_schedule(nrt_model_t *model,
                                 const nrt_tensor_set_t *input,
                                 nrt_tensor_set_t *output,
                                 int queue,
                                 nrta_error_tracker_t *err,
                                 nrta_seq_t *req_sequence);

/** Prepares collective context and HW configuration needed for collectives operation.
 *  Allocates a collective context handle that is returned to the caller
 *  which is freed in the schedule thread post CC op execution.
 *
 * @param comm[in]              - Communicator containing the replica group
 * @param input[in]             - Input tensor list
 * @param output[out]           - Output tensor list
 * @param dtype[in]             - Data type of elements
 * @param op[in]                - Reduction operation (e.g., SUM, MAX) if applicable
 * @param cc_op[in]             - Collective operation (e.g., ALLREDUCE, ALLGATHER)
 * @param cc_ctx[out]           - Collective context
 *
 * @return NRT_SUCCESS on successful preparation, appropriate error code otherwise
 */
NRT_STATUS nrta_cc_prepare(nrt_cc_comm_t *comm,
                           nrt_tensor_list_t *input,
                           nrt_tensor_list_t *output,
                           nrt_dtype_t dtype,
                           nrt_op_type_t op,
                           nrt_cc_op_type_t cc_op,
                           nrt_cc_context_t **cc_ctx);

/** Schedules an asynchronous request to execute collective operation
 *
 * @param cc_ctx[in]           - Collective context
 * @param queue[in]            - XU queue to use, must be 0
 * @param err[in]              - error tracker
 * @param req_sequence[out]    - Sequence number of the scheduled request
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrta_cc_schedule(nrt_cc_context_t **cc_ctx,
                            int queue,
                            nrta_error_tracker_t *err,
                            nrta_seq_t *req_sequence);

// completion status

/** Checks completion status of a scheduled request
 *
 * @param seq[in]           - Scheduled request sequence id
 * @param is_completed[out] - true if the request is completed, false otherwise
 *
 * @return NRT_SUCCESS if the request is completed, NRT_INVALID if the seq is not valid
 */
NRT_STATUS nrta_is_completed(nrta_seq_t seq, bool *is_completed);


/** Returns sequence number of the last completed request
 *
 * @param lnc[in]           - LNC
 * @param xu[in]            - XU
 * @param queue[in]         - XU's queue
 * @param seq[out]          - last completed sequence number
 *
 * @return NRT_SUCCESS on success
 */
NRT_STATUS nrta_get_sequence(uint32_t lnc, nrta_xu_t xu, int queue, nrta_seq_t *seq);


/** Returns a pollable file descriptor that is READABLE when the execution request
 * specified by seq is complete.
 *
 * Note that users should only use the `poll` family of functions and `close` on this file
 * descriptor. Any other FD function is invalid and can lead to undefined behavior.
 *
 * The file descriptor must be passed to `close` to free the handle once the handle is not
 * needed anymore.
 *
 * @param seq[in]           - sequence to track completion
 * @param fd[out]           - FD associate with the sequence.
 * @return NRT_SUCCESS on success
 */
NRT_STATUS nrta_get_completion_handle(nrta_seq_t seq, int *fd);


/** Creates an error tracker list
 *
 * @param lnc_idx[in]           - Logical Neuron Core this list will be used for
 * @param error_tracker[out]    - Created list.
 * @return NRT_SUCCESS on success
 */
NRT_STATUS nrta_error_tracker_create(uint32_t lnc_idx, nrta_error_tracker_t **error_tracker);

/** Frees an error tracker list
 *
 * @param error_tracker[in] - Error tracker list to free
 *
 */
void nrta_error_tracker_destroy(nrta_error_tracker_t *error_tracker);

/** Gets list of errors from error tracker list
 *
 * @param error_tracker[in] - Error tracker list to get errors from
 * @param list[out]         - Array of errors obtained from teh error tracker
 * @param error_count[out]  - Number of errors in the list
 * @return NRT_SUCCESS on success
 */
NRT_STATUS nrta_error_tracker_get_list(nrta_error_tracker_t *error_tracker, const nrta_error_t **list, size_t *error_count);

#ifdef __cplusplus
}
#endif


================================================
FILE: src/libnrt/include/nrt/nrt_async_sendrecv.h
================================================
#pragma once

#include "nrt/nrt.h"
#include "nrt/nrt_status.h"

#ifdef __cplusplus
extern "C" {
#endif

typedef struct nrt_async_sendrecv_comm nrt_async_sendrecv_comm_t;
typedef struct nrt_async_sendrecv_request nrt_async_sendrecv_request_t;

/**
 * Get the maximum number of async sendrecv communicators per logical neuron core
 *
 * @param num[out]   - The maximum number of async sendrecv communicators per logical neuron core
 * @return NRT_SUCCESS on success
 *         NRT_FAILURE for errors
 */
NRT_STATUS nrt_async_sendrecv_get_max_num_communicators_per_lnc(int* num);

/**
 * Get the maximum number of pending requests per async sendrecv communicator
 *
 * @param num[out]   - The maximum number of pending requests per async sendrecv  communicator
 * @return NRT_SUCCESS on success
 *         NRT_FAILURE for errors
 */
NRT_STATUS nrt_async_sendrecv_get_max_num_pending_request(int* num);

/** Initialize asynchronous tensor send and receive on logical neuron core
 *
 * Logical neuron core ID is the absolute ID of the logical core on
 * the host machine. The ID is uneffected by device remapping via
 * docker and selection of visible logical cores.
 *
 * This function may only be called when runtime is initialized. This
 * function must have a matching call to nrt_async_sendrecv_close() before
 * nrt_close() is called.
 * This function returns error in case preceeding call to
 * nrt_async_sendrecv_close() on the logical neuron core returned error.
 *
 * @param lnc[in]   - Logical neuron core ID on the current server
 * @return NRT_SUCCESS if logical core has been initialized successfully
 *         NRT_FAILURE for errors
 */
NRT_STATUS nrt_async_sendrecv_init(int lnc);

/** Closes asynchronous tensor send and receive of logical neuron core and cleans up resources
 *
 * A call to this function must have a preceeding matching call to
 * nrt_async_sendrecv_init().  After this function was invoked, all sendrecv
 * communicators and requests associated with this logical neuron core
 * are closed and cannot be accessed anymore invoking functions with those
 * communicators or requests is regarded undefined behavior.
 * Cases where this function is called and one of the communicators is
 * not connected yet are considered an error. Cases where this
 * function is called and send or receive requests are still inflight
 * are considered an error.
 *
 * @param lnc[in]   - Logical neuron core ID on the current server
 * @return NRT_SUCCESS if logical core has been closed successfully
 *         NRT_FAILURE for errors
 */
NRT_STATUS nrt_async_sendrecv_close(int lnc);

/** Create send communicator
 *
 * Before send communicator can be used to initiate sending a tensor,
 * connection to receive communicator must be established. Use
 * function nrt_async_sendrecv_test_comm() to test whether connection is
 * established.
 * Async sendrecv for logical neuron core lnc must have been
 * initialized via call to nrt_async_sendrecv_init() before this function is
 * invoked.
 * This function is thread-safe.
 *
 * @param peer_ip[in]    - IP adress of peer logical neuron core
 * @param peer_lnc[in]   - Logical neuron core ID on the peer server
 * @param lnc[in]        - Logical neuron core ID on the current server
 * @param send_comm[out] - Pointer to send communicator
 * @return NRT_SUCCESS  if logical core has been created successfully
 *         NRT_RESOURCE if the number of created communicators exceeds the limit of NRT_ASYNC_SENDRECV_MAX_NUM_COMMUNICATORS_PER_LNC
 *         NRT_FAILURE  for other errors
 */
NRT_STATUS nrt_async_sendrecv_connect(const char* peer_ip, int peer_lnc, int lnc, nrt_async_sendrecv_comm_t** send_comm);

/** Create receive communicator
 *
 * Before receive communicator can be used to initiate receiveing a tensor,
 * connection to receive communicator must be established. Use
 * function nrt_async_sendrecv_test_comm() to test whether connection is
 * established.
 * Async sendrecv for logical neuron core lnc must have been
 * initialized via call to nrt_async_sendrecv_init() before this function is
 * invoked.
 * This function is thread-safe.
 *
 * @param peer_ip[in]    - IP adress of peer logical neuron core
 * @param peer_lnc[in]   - Logical neuron core ID on the peer server
 * @param lnc[in]        - Logical neuron core ID on the current server
 * @param recv_comm[out] - Pointer to receive communicator
 * @return NRT_SUCCESS  if logical core has been created successfully
 *         NRT_RESOURCE if the number of created communicators exceeds the limit of NRT_ASYNC_SENDRECV_MAX_NUM_COMMUNICATORS_PER_LNC
 *         NRT_FAILURE  for other errors
 */
NRT_STATUS nrt_async_sendrecv_accept(const char* peer_ip, int peer_lnc, int lnc, nrt_async_sendrecv_comm_t** recv_comm);

/** Test whether connection has been established
 *
 * @param comm[in]  - The send or receive communicator
 * @param done[out] - True if connection to peer communicator is established
 * @return NRT_SUCCESS if test performed without error
 *         NRT_INVALID_HANDLE if handle is invalid
 *         NRT_TIMEOUT        if the communicator fails to establish connection within time limit
 *         NRT_FAILURE        for other errors
 */
NRT_STATUS nrt_async_sendrecv_test_comm(nrt_async_sendrecv_comm_t* comm, bool* done);

/** Asynchronously send a tensor
 *
 * This is a non-blocking function.
 *
 * This function is thread-safe. This function is only allowed to be
 * invoked on a communicator that is sucessfully tested to be
 * connected via call to nrt_async_sendrecv_test_comm().
 *
 * @param tensor[in]        - Tensor to receive to
 * @param offset[in]        - Offset into the tensor to receive to
 * @param length[in]        - Number of bytes to read
 * @param send_comm[in]     - Send communicator
 * @param request[out]      - Pointer to receive request
 * @return NRT_SUCCESS        on success
 *         NRT_INVALID_HANDLE if handle is invalid
 *         NRT_RESOURCE       if the number of pending requests exceeds the limit of NRT_ASYNC_SENDRECV_MAX_NUM_PENDING_REQUEST
 *         NRT_FAILURE        for other errors
 */
NRT_STATUS nrt_async_sendrecv_send_tensor(nrt_tensor_t* tensor, size_t offset, size_t length, nrt_async_sendrecv_comm_t* send_comm, nrt_async_sendrecv_request_t** request);

/** Asynchronously receive a tensor
 *
 * This is a non-blocking function.
 *
 * This function is thread-safe. This function is only allowed to be
 * invoked on a communicator that is sucessfully tested to be
 * connected via call to nrt_async_sendrecv_test_comm().
 *
 * @param tensor[in]        - Tensor to receive to
 * @param offset[in]        - Offset into the tensor to receive to
 * @param length[in]        - Number of bytes to read
 * @param recv_comm[in]     - Receive communicator
 * @param request[out]      - Pointer to receive request
 * @return NRT_SUCCESS        on success
 *         NRT_INVALID_HANDLE if handle is invalid
 *         NRT_RESOURCE       if the number of pending requests exceeds the limit of NRT_ASYNC_SENDRECV_MAX_NUM_PENDING_REQUEST
 *         NRT_FAILURE        for other errors
 */
NRT_STATUS nrt_async_sendrecv_recv_tensor(nrt_tensor_t* tensor, size_t offset, size_t length, nrt_async_sendrecv_comm_t* recv_comm, nrt_async_sendrecv_request_t** request);

/** Test the completion status of a asynchronous request
 *
 * This function is thread-safe when invoked with different
 * requests. This function is not allowed to be invoked concurrently
 * by multiple threads with the same request at the same time. When
 * this function returned request to be completed, this function is
 * not allowed to be invoked again with the same request.
 *
 * @param request[in]       - Request to test
 * @param done[out]         - Whether the request has completed
 * @param size[out]         - Number of bytes sent/received
 * @return NRT_SUCCESS        on success
 *         NRT_INVALID_HANDLE if handle is invalid
 *         NRT_TIMEOUT        if the request fails to complete data transfer within time limit
 *         NRT_FAILURE        for other errors
 */
NRT_STATUS nrt_async_sendrecv_test_request(nrt_async_sendrecv_request_t* request, bool* done, size_t* size);

/** Flush received messae to ensure full arrival in memory
 *
 * Ensure that received messages of successfully tested async sendrecv
 * receive operations prior to call to this function fully arrived in
 * memory after this function completes.
 *
 * @param lnc[in]        - Receiving logical neuron core ID
 * @return NRT_SUCCESS  if flush operation succeeded
 *         NRT_FAILURE  for other errors
 */
NRT_STATUS nrt_async_sendrecv_flush(int lnc);

#ifdef __cplusplus
}
#endif


================================================
FILE: src/libnrt/include/nrt/nrt_experimental.h
================================================
/*
 * Copyright 2021, Amazon.com, Inc. or its affiliates. All Rights Reserved
 */

#pragma once

#include <stddef.h>
#include <stdint.h>
#include "nrt/nrt_status.h"
#include "nrt/nrt.h"

#ifdef __cplusplus
extern "C" {
#endif


/** Usage of a Tensor in the NEFF
 */
typedef enum nrt_tensor_usage {
    NRT_TENSOR_USAGE_INPUT = 0,     // Tensor is used for ifmap
    NRT_TENSOR_USAGE_OUTPUT,        // Tensor is used for ofmap
} nrt_tensor_usage_t;

#define NRT_TENSOR_NAME_MAX 256

typedef struct nrt_tensor_info {
    char name[NRT_TENSOR_NAME_MAX];     // Name of the tensor
    nrt_tensor_usage_t usage;           // Type of the tensor
    size_t size;                        // Tensor size in bytes
    nrt_dtype_t dtype;                  // data type
    uint32_t *shape;                    // an array representing data shape
    uint32_t ndim;                      // the number of dimensions
} nrt_tensor_info_t;

typedef struct nrt_tensor_info_array {
    uint64_t tensor_count;              // Total number of tensors in the NEFF
    nrt_tensor_info_t tensor_array[];   // Array of tensor info
} nrt_tensor_info_array_t;

/* Function definition for async exec status callbacks */
typedef void (*NRT_ASYNC_EXEC_STATUS_CALLBACK)(void *params, uint32_t model_id, uint32_t vnc, uint64_t job_id, NRT_STATUS status);

/** Return input/output tensor information for a given model.
*
* @param model[in]         - Model for which tensor information needs to be extracted.
* @param tensor_info[out]  - Pointer to store the result.
*
* @return NRT_STATUS_SUCCESS on success.
*/
NRT_STATUS nrt_get_model_tensor_info(nrt_model_t *model, nrt_tensor_info_array_t **tensor_info);

/** Return the instance count for this model handle (optimal number of concurrent threads that can call nrt_execute). (deprecated)
*
* @param model[in]         - Model for the instance count needs to be returned.
* @param instance[out]     - Pointer to store the result.
*
* @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_get_model_instance_count(nrt_model_t *model, uint32_t *instance_count);


/** Free input/output tensor information for a given model.
*
* @param tensor_info[in]  - Pointer to store the result.
*
* @return NRT_STATUS_SUCCESS on success.
*/
NRT_STATUS nrt_free_model_tensor_info(nrt_tensor_info_array_t *tensor_info);

/** Enable tracing for all VNCs visible to the app
 *
 * @param trace_mem[in] - collect memory allocation info
 *
 * @return NRT_SUCCESS on success.
 */
NRT_STATUS nrt_trace_start(bool trace_mem);

/** Serialize all data and disable tracing
 *
 * @param filename[in] - filename to write to
 *
 * @return NRT_SUCCESS on success.
 */
NRT_STATUS nrt_trace_stop(const char *filename);

/** temporary, to be removed. See comment in neuron_nccl.cc
*/
void *nrt_get_libnccl_net(int *err, char *err_msg, size_t err_msg_size);

/** Structs to pass around ucode image info
*/
typedef struct nrt_ucode_img {
    uint8_t *bin;
    size_t size;
} nrt_ucode_img;

typedef struct nrt_ucode_info {
    nrt_ucode_img iram;
    nrt_ucode_img dram;
} nrt_ucode_info;

/** Specify pooling engine ucode iram and dram images that will get loaded by nrt_init().
*   To use this API, it MUST be called BEFORE nrt_init().
*   Swapping ucode after nrt_init() is NOT supported. Ucode images are only loaded once.
*   This API provides a temporary workaround for swapping ucode.
*/
NRT_STATUS nrt_set_pool_eng_ucode(const nrt_ucode_info *ucode_info);

/** Copies data to memory mapped Neuron device memory
*
* @param dest[in]          - Pointer to destination memory (mmaped device memory)
* @param src[in]           - Pointer to source memory
* @param size[in]          - Copy size
*
* @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_memcpy_to_device(void *dest, const void *src, size_t size);

/** Register a return status callback to post exec status to when running in async exec mode.
 *  Calling this multiple times will replace the previouly registered callback.
 *
 * @param callback[in]  - Callback to post nrt exec status to for async execution.
 * @param params[in]    - Params for the async exec thread to pass to the callback upon
 *                        execution completion. Can be NULL.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_register_async_exec_callback(NRT_ASYNC_EXEC_STATUS_CALLBACK callback, void *params);

/** Implements a barrier by running a small all-reduce over all workers
*
* @param vnc[in]                 - local VNC (within the instance)
* @param global_device_id[in]    - global worker ID
* @param global_device_count[in] - total number of workers
*
* @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_barrier(int32_t vnc, uint32_t g_device_id, uint32_t g_device_count);

/** Perform all-rank AllGather
*
* @param vnc[in]              - local VNC (within the instance)
* @param g_device_id[in]      - global worker ID
* @param g_device_count[in]   - total number of workers
* @param rank_input_size[in]  - input size
* @param input[in]            - ptr to input data from this rank
* @param output[out]          - ptr to output buffer of size (g_device_count*rank_input_size)
*
* @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_all_gather(int32_t vnc, uint32_t g_device_id, uint32_t g_device_count,
                          uint32_t rank_input_size, void *input, void *output);

/** Blocks caller until all queued executions on async worker thread are drained.
 *
 * @param vnc - VNC index to block on.
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_async_drain_queued_execs(int32_t vnc);

typedef struct nrt_model_info {
    uint32_t vnc;
    // additional fields can be added here in the future
    // do not remove previously added fields because it will cause
    // memory corruption if the caller was compiled using a different 
    // version of this header.
} nrt_model_info_t;
/** Returns information about loaded model
 *
 * @param model [in]          - the model
 * @param info [out]          - the information about the model
 * @param info_size_in [in]   - the size of the info structure (used for version control)
 * @param info_size_out [out] - the number of bytes written (for version control)
 *
 * @return NRT_SUCCESS on success
 */
NRT_STATUS nrt_get_model_info(const nrt_model_t *model, nrt_model_info_t *info, size_t info_size_in, size_t *info_size_out);

#ifdef __cplusplus
}
#endif


================================================
FILE: src/libnrt/include/nrt/nrt_profile.h
================================================
/*
 * Copyright 2021, Amazon.com, Inc. or its affiliates. All Rights Reserved
 */

#pragma once

#include "nrt/nrt.h"

#ifdef __cplusplus
extern "C" {
#endif

/** Enable profiling for a model
 *
 * @param model[in]     - model to profile
 * @param filename[in]  - output filename that will be used with nrt_profile_stop()
 *
 * @return NRT_SUCCESS on success.
 */
NRT_STATUS nrt_profile_start(nrt_model_t *model, const char *filename);

/** Collect results and disable profiling for a model
 *
 * @param filename[in] - output filename to save the NTFF profile to
 *
 * @return NRT_SUCCESS on success.
 */
NRT_STATUS nrt_profile_stop(const char *filename);

/** Options for continuous device profiling.
 *
 * Opaque struct used to preserve compatibility and enforce proper usage.
 * Use nrt_profile_continuous_options_set_* functions set options.
 * Default options:
 * - output_dir: "./output"
 *
 * Usage:
 *   nrt_profile_continuous_options_t *options;
 *   nrt_profile_continuous_options_allocate(&options);
 *   nrt_profile_continuous_options_set_output_dir(options, "./output");
 */
typedef struct nrt_profile_continuous_options nrt_profile_continuous_options_t;

/** Allocate memory for the nrt_profile_continuous_options_t struct and set all options to defaults.
 *
 * @param options[in] - pointer to a pointer to nrt_profile_continuous_options_t struct
 */
NRT_STATUS nrt_profile_continuous_options_allocate(nrt_profile_continuous_options_t **options);

/** Free up memory allocated for the options struct needed for continuous device profiling.
 *
 * @param options[in] - pointer to a nrt_profile_continuous_options struct
 *
 * @return NRT_SUCCESS on success.
 */
NRT_STATUS nrt_profile_continuous_options_free(nrt_profile_continuous_options_t *options);

/** Sets the output directory for results of continuous device profiling.
 *
 * The filename is set automatically.
 *
 * @param[in,out] options Pointer to the options struct.
 * @param[in] output_dir Path to the output directory.
 *
 * @return NRT_SUCCESS on success.
 */
NRT_STATUS nrt_profile_continuous_options_set_output_dir(nrt_profile_continuous_options_t *options, const char *output_dir);

/** @brief Start continuous device profiling.
 *
 * When continuous device profiling is started, profiling is enabled for every model but notifications
 * will only be serialized to disk when the user calls nrt_profile_continuous_save(). This gives
 * the user control over which profiles are saved to disk. When a profile is not saved, the overhead
 * of trace serialization and disk write is avoided. Continuous profiling is ideal for scenarios where you
 * only want to save profiles for specific executions. In this mode you do not need to call
 * nrt_profile_start() and nrt_profile_stop() because they are called internally. Continuous profiling
 * will not start if inspect device profiling is already enabled or async execution is enabled.
 *
 * @param options[in] - options to control continuous device profiling
 *
 * @return NRT_SUCCESS on success.
 */
NRT_STATUS nrt_profile_continuous_start(nrt_profile_continuous_options_t *options);

/** Save NTFF profile to disk for the latest model executed on requested NeuronCore.
 *
 * Output directory will be set according to the options passed into this function. The filenames of
 * NTFFs within the output directory are chosen automatically to avoid conflicts. Calling save does
 * not stop continuous profiling.
 *
 * @param vnc[in]      - (start) NeuronCore id to collect profile for
 * @param options[in]  - options to control continuous device profiling
 *
 * @return NRT_SUCCESS on success.
 */
NRT_STATUS nrt_profile_continuous_save(uint32_t vnc, nrt_profile_continuous_options_t *options);

/** Stops continuous device profiling.
 *
 * Calling stop does not save a profile.
 *
 * @return NRT_SUCCESS on success.
 */
NRT_STATUS nrt_profile_continuous_stop();

/* Begin tracing/profiling
 *
 * Users of this API must set options through environment variables:
 * 
 * - NEURON_RT_INSPECT_ENABLE: Set to 1 to enable system and device profiles.
 *   For control over which profile types are captured, use NEURON_RT_INSPECT_SYSTEM_PROFILE 
 *   and NEURON_RT_INSPECT_DEVICE_PROFILE.
 * - NEURON_RT_INSPECT_OUTPUT_DIR: The directory where captured profile data will be saved to.
 *   Defaults to ./output.
 * - NEURON_RT_INSPECT_SYSTEM_PROFILE: Set to 0 to disable the capture of system profiles. 
 *   Defaults to 1 when NEURON_RT_INSPECT_ENABLE is set to 1.
 * - NEURON_RT_INSPECT_DEVICE_PROFILE: Set to 0 to disable the capture of device profiles.
 *   Defaults to 1 when NEURON_RT_INSPECT_ENABLE is set to 1.
 * - NEURON_RT_INSPECT_ON_FAIL: Set to 1 to enable dumping of device profiles in case of an error 
 *   during graph execution. Defaults to 0.
 * 
 * @return NRT_SUCCESS on success
 */
NRT_STATUS nrt_inspect_begin();


/* Stop tracing/profiling and dump profile data.
 * Does nothing if `duration` is given to nrt_inspect_begin() and already elapsed
 *
 * @return NRT_SUCCESS on success
 */
NRT_STATUS nrt_inspect_stop();


/** @brief Options for nrt_inspect_begin_with_options API.
 *
 * Opaque struct used to preserve compatibility and enforce proper usage.
 * Use nrt_inspect_config_set_* functions to set options or 
 * nrt_inspect_config_set_defaults to set use default options.
 *
 * Example Usage:
 *  nrt_inspect_config_t *options;
 *  nrt_inspect_config_allocate(&options);
 *  nrt_inspect_config_set_output_dir(options, "./output");
 */
typedef struct nrt_inspect_config nrt_inspect_config_t;


/** Allocate memory for the options structure which is needed to
 * start profiling using nrt_inspect_begin_with_options. This will set all options to defaults
 * 
 * @param options[out] - pointer to a pointer to options nrt_inspect_config struct
 * 
 */
NRT_STATUS nrt_inspect_config_allocate(nrt_inspect_config_t **options);

/** @brief all fields of the nrt_inspect_config structure to their default values.
 * 
 * Default behavior after calling this function:
 * - Session ID: 1
 * - Output directory: "./output" (when not explicitly set)
 * - Activity types: All activity types enabled (system_profile, device_profile, host_memory, cpu_util)
 * - System trace: All NeuronCores and event types enabled for capture
 * - Inspect mode: Disabled (profiles not captured automatically)
 * - Inspect on failure: Disabled (profiles not captured on execution failures)
 * 
 * @param options[in,out] - Pointer to an nrt_inspect_config structure.
 * 
 * @return NRT_SUCCESS on success
 * 
 * @note These default values set here are NOT influenced by the environment variables. 
 * If you are using the environment variables to set the values you do not need to use this method 
 * or any of the nrt_inspect_config_set_* functions.
 */
NRT_STATUS nrt_inspect_config_set_defaults(nrt_inspect_config_t *options);

/** Free up memory allocated for the options structure which is needed to
 * start profiling using nrt_inspect_begin_with_options
 * 
 * @param options[in] - pointer to an options nrt_inspect_config struct
 * @return NRT_SUCCESS on success
 */
NRT_STATUS nrt_inspect_config_free(nrt_inspect_config_t *options);

/**
 * @brief Sets the session ID for the nrt_inspect_config_t which is needed to
 * start profiling using nrt_inspect_begin_with_options
 *
 * @param[in,out] options Pointer to the options structure.
 * @param[in] session_id Session ID to set.
 * @return NRT_SUCCESS on success
 */
NRT_STATUS nrt_inspect_config_set_session_id(nrt_inspect_config_t *options, int session_id);

/**
 * @brief Sets the output directory for results of 
 * profiling using nrt_inspect_begin_with_options
 *
 * @param[in,out] options Pointer to the options structure.
 * @param[in] output_dir Path to the output directory. Must be a valid non-empty string 
 * @return NRT_SUCCESS on success, NRT_INVALID for invalid parameters, NRT_RESOURCE for memory allocation failure.
 * 
 * @note The function makes an internal copy of the string, so the caller
 *       does not need to keep the original string alive.
 * @note Call nrt_inspect_config_free() to properly clean up allocated memory.
 */
NRT_STATUS nrt_inspect_config_set_output_dir(nrt_inspect_config_t *options, const char *output_dir);

/**
 * @brief Sets max number of system trace events that can be stored across all ring buffers
 *
 * @param[in,out] options Pointer to the options structure.
 * @param[in] sys_trace_max_events_per_nc Max number of system trace events that can be stored across all ring buffers.
 * @return NRT_SUCCESS on success
 */
NRT_STATUS nrt_inspect_config_set_sys_trace_max_events_per_nc(nrt_inspect_config_t *options, uint64_t sys_trace_max_events_per_nc);

/**
 * @brief Sets system trace capture enabled for a specific NeuronCore
 * ring buffers won't be allocated for disabled NeuronCores 
 * 
 * @param[in,out] options Pointer to the options structure.
 * @param[in] nc_idx Index of the NeuronCore.
 * @param[in] enabled Boolean value to enable or disable system trace capture.
 * @return NRT_SUCCESS on success
 */
NRT_STATUS nrt_inspect_config_set_capture_enabled_for_nc(nrt_inspect_config_t *options, uint32_t nc_idx, bool enabled);

/** 
 * @brief Sets system trace capture enabled for a specific event type
 * can save memory and reduce output size
 * @param[in,out] options Pointer to the options structure.
 * @param[in] event_type Valid event types.
 * @param[in] enabled Capture enabled flag.
 * @return NRT_SUCCESS on success
 * 
 * @note Event type must be a string from the list of supported event types. To get the list of supported event types, 
 * use nrt_sys_trace_get_event_types in the nrt_sys_trace.h header file.
 */
NRT_STATUS nrt_inspect_config_set_capture_enabled_for_event_type_string(nrt_inspect_config_t *options, const char *event_type, bool enabled);

/**
 * @brief Enable both system and device profiling for normal execution
 * 
 * When disabled (default), no profiles are captured during normal execution.
 * This flag controls whether profiles are captured automatically for each execution.
 * Note: If both enable_inspect and enable_inspect_on_fail are false, no profiling occurs.
 *
 * @param[in,out] options Pointer to the options structure.
 * @param[in] enable_inspect Boolean value to enable or disable inspect profiling.
 * @return NRT_SUCCESS on success, NRT_INVALID for invalid parameters.
 */
NRT_STATUS nrt_inspect_config_set_enable_inspect(nrt_inspect_config_t *options, bool enable_inspect);

/**
 * @brief Enable dumping of device profiles in case of execution failures
 * 
 * When enabled, device profiles will be captured and saved when graph execution fails.
 * This is disabled by default. If both enable_inspect and enable_inspect_on_fail are false,
 * no profiling occurs at all.
 *
 * @param[in,out] options Pointer to the options structure.
 * @param[in] enable_inspect_on_fail Boolean value to enable or disable inspect on failure.
 * @return NRT_SUCCESS on success, NRT_INVALID for invalid parameters.
 */
NRT_STATUS nrt_inspect_config_set_enable_inspect_on_fail(nrt_inspect_config_t *options, bool enable_inspect_on_fail);

 /**
 * Begin tracing/profiling with configurable options
 *
 * Parameters:
 * @param[in] options - A pointer to an nrt_inspect_config struct containing configuration options
 *                     for profiling. Use nrt_inspect_config_set_* functions to set options.
 *                     If NULL is passed, default options will be used.
 * @return NRT_SUCCESS on success
 * 
 * @note This API ignores all the NEURON_RT_INSPECT_* environment variables.
 * If you are using the environment variables to set the values you do not need to use this method 
 * or any of the nrt_inspect_config_set_* functions. Use nrt_inspect_begin() instead.
 */
NRT_STATUS nrt_inspect_begin_with_options(nrt_inspect_config_t *options);

/**
 * @brief Returns all available activity type strings
 *
 * This function allocates and returns an array of all supported activity type
 * strings. The caller is responsible for freeing both the individual strings
 * and the array itself, or can use nrt_inspect_config_free_activity_types().
 *
 * @param[out] activity_types Pointer to store the allocated array of activity type strings.
 * @param[out] count Pointer to store the number of activity types returned.
 * @return NRT_SUCCESS on success, NRT_INVALID for invalid parameters, 
 *         NRT_RESOURCE for memory allocation failure.
 */
NRT_STATUS nrt_inspect_config_get_all_activity_types(const char ***activity_types, size_t *count);

/**
 * @brief Returns the currently enabled activity type strings
 *
 * This function examines the enabled_activities bitmask in the configuration
 * and returns an array of strings for only the currently enabled activity types.
 * The caller is responsible for freeing both the individual strings and the array itself,
 * or can use nrt_inspect_config_free_activity_types().
 *
 * @param[in] options Pointer to the options structure.
 * @param[out] activity_types Pointer to store the allocated array of enabled activity type strings.
 * @param[out] count Pointer to store the number of enabled activity types returned.
 * @return NRT_SUCCESS on success, NRT_INVALID for invalid parameters, 
 *         NRT_RESOURCE for memory allocation failure.
 */
NRT_STATUS nrt_inspect_config_get_enabled_activity_types(nrt_inspect_config_t *options, const char ***activity_types, size_t *count);

/**
 * @brief Free the activity types array allocated by nrt_inspect_config_get_all_activity_types
 * or nrt_inspect_config_get_enabled_activity_types.
 * This function properly frees both the array and all individual strings.
 * 
 * @param[in] activity_types Pointer to the activity types array to be freed.
 * @param[in] count Number of activity types in the array.
 */
void nrt_inspect_config_free_activity_types(const char **activity_types, size_t count);

/**
 * @brief Sets or clears a specific activity type in the configuration
 *
 * This function enables or disables a specific activity type by name. It converts
 * the activity type string to the corresponding enum value and updates the
 * enabled_activities bitmask accordingly.
 *
 * @param[in,out] options Pointer to the options structure.
 * @param[in] activity_type String name of the activity type. Valid values are:
 *                         "system_profile", "device_profile", "host_memory", 
 *                         "cpu_util", "all"
 * @param[in] enabled True to enable the activity, false to disable it.
 * @return NRT_SUCCESS on success, NRT_INVALID for invalid parameters or unknown activity type.
 */
NRT_STATUS nrt_inspect_config_set_activity(nrt_inspect_config_t *options, const char *activity_type, bool enabled);


#ifdef __cplusplus
}
#endif


================================================
FILE: src/libnrt/include/nrt/nrt_status.h
================================================
/*
 * Copyright 2021, Amazon.com, Inc. or its affiliates. All Rights Reserved
 */

#pragma once

#ifdef __cplusplus
extern "C" {
#endif

// NOTE: if making changes here please also keep
// KaenaTools: KaenaTools: pkg/rt/rt.go in sync

typedef enum {
    NRT_SUCCESS = 0,
    NRT_FAILURE = 1,                        // non specific failure, don't use if there is more descriptive type
    NRT_INVALID = 2,                        // e.g. invalid NEFF, bad instruction, bad DMA descriptor, input tensor name/size does not match the model, etc.

                                            // TODO invalid_handle is no longer useful because handles are not passed in nrt API
                                            // remove
    NRT_INVALID_HANDLE = 3,                 // make this one explicit instead of using more generic INVALID_INPUT because it could be a common caller mistake
    NRT_RESOURCE = 4,                       // failed to allocate a resource for requested operation

                                            // TODO separate exec timeout from others
    NRT_TIMEOUT = 5,                        // operation timed out
    NRT_HW_ERROR = 6,                       // Hardware failure
    NRT_QUEUE_FULL = 7,                     // not enough space in the execution input queue
    NRT_LOAD_NOT_ENOUGH_NC = 9,             // Failed to allocate enough NCs for loading a NEFF
    NRT_UNSUPPORTED_NEFF_VERSION = 10,      // Unsupported version of NEFF

    // DO NOT USE - keep for backward compat
    NRT_FAIL_HOST_MEM_ALLOC = 11,           // failed to allocate host memory

    // Unique retcodes to help the caller identify when nrt apis are called outside the scope of nrt_init() and nrt_close()
    NRT_UNINITIALIZED = 13,
    NRT_CLOSED = 14,

    NRT_QUEUE_EMPTY = 15, // Accessed a queue with no data


    NRT_EXEC_UNIT_UNRECOVERABLE = 101,      // Encountered fatal error and Execution Unit is in limbo, cannot recover. Need to reset


    NRT_EXEC_BAD_INPUT = 1002,              // invalid input has been submitted to exec()
    NRT_EXEC_COMPLETED_WITH_NUM_ERR = 1003, // execution was completed with numerical errors (produced NaN)
    NRT_EXEC_COMPLETED_WITH_ERR = 1004,     // execution was completed with other errors,
                                            // either logical - event double clear, or physical - parity error
    NRT_EXEC_NC_BUSY = 1005,                // the neuron core is locked (in use) by another model/process
    NRT_EXEC_OOB = 1006,                    // one or more indirect memcopies and/or embedding updates are out of bound
    NRT_COLL_PENDING = 1100,                // collective operation is still pending

    // classify different types of execution hangs/timeouts. For unknown/generic hang, use NRT_TIMEOUT.
    NRT_EXEC_HW_ERR_COLLECTIVES = 1200,     // Stuck in collectives op (missing notification(s)). Possibly caused by a hardware error on another worker.
    NRT_EXEC_HW_ERR_HBM_UE      = 1201,     // An HBM encountered an unrepairable uncorrectable error and produced incorrect results.
    NRT_EXEC_HW_ERR_NC_UE       = 1202,     // An on-chip memory of Neuron Core encountered a parity error and produced incorrect results.
    NRT_EXEC_HW_ERR_DMA_ABORT   = 1203,     // A DMA engine encountered an unrecoverable error.

    NRT_EXEC_SW_NQ_OVERFLOW     = 1204,     // Software notification queue overflow.
    NRT_EXEC_HW_ERR_REPAIRABLE_HBM_UE = 1205,  // An HBM encountered an repairable uncorrectable error and produced incorrect results.

    NRT_NETWORK_PROXY_FAILURE   = 1206,    // EFA network proxy operation failed.
} NRT_STATUS;

const char *nrt_get_status_as_str(NRT_STATUS status);

#ifdef __cplusplus
}
#endif


================================================
FILE: src/libnrt/include/nrt/nrt_sys_trace.h
================================================
/*
 * Copyright 2025, Amazon.com, Inc. or its affiliates. All Rights Reserved
 */
#pragma once
#include <nrt/nrt.h>

#ifdef __cplusplus
extern "C" {
#endif

/*
 * This is a public interface used by both the fetch api (which allows near
 * real-time querying of captured events), and inspect profiling (which saves
 * captured events to disk), as well as other profiling functions.
 */

//------------------------------------------------
// Section: System Trace Capture
//------------------------------------------------

typedef struct nrt_sys_trace_config nrt_sys_trace_config_t;

/** Allocate memory for the options structure which is needed to
 * start profiling using nrt_sys_trace_start. This will set all options to
 * defaults. The reason we use an _allocate function is so that users don't need
 * to know the size or implementation details of the config struct.
 *
 * @param options[in] - pointer to a pointer to options nrt_sys_trace_config struct
 *
 */
NRT_STATUS nrt_sys_trace_config_allocate(nrt_sys_trace_config_t **options);

/** Set all fields of the nrt_sys_trace_config structure to their default values.
 *
 * @param options[in,out] - Pointer to an nrt_sys_trace_config structure.
 */
void nrt_sys_trace_config_set_defaults(nrt_sys_trace_config_t *options);

/** Free up memory allocated for the options structure which is needed to
 * start profiling using nrt_sys_trace_start
 *
 * @param options[in] - pointer to an options nrt_sys_trace_config struct
 *
 */
void nrt_sys_trace_config_free(nrt_sys_trace_config_t *options);

/**
 * @brief Sets max number of events that can be stored across all ring buffers
 *
 * @param[in,out] options Pointer to the options structure.
 * @param[in] max_events_per_nc Max number of events that can be stored in each ring buffer.
 */
void nrt_sys_trace_config_set_max_events_per_nc(nrt_sys_trace_config_t *options, uint64_t max_events_per_nc);

/**
 * @brief Sets system trace capture enabled for a specific NeuronCore
 * ring buffers won't be allocated for disabled NeuronCores.
 * Can save memory, reduce output size, and speed up trace processing.
 * @param[in,out] options Pointer to the options structure.
 * @param[in] nc_idx NeuronCore index.
 * @param[in] enabled Capture enabled flag.
 */
void nrt_sys_trace_config_set_capture_enabled_for_nc(nrt_sys_trace_config_t *options, uint32_t nc_idx, bool enabled);

/**
 * @brief Sets system trace capture enabled for a specific event type.
 * Can save memory, reduce output size, and speed up trace processing.
 * @param[in,out] options Pointer to the options structure.
 * @param[in] event_type Event type string, possible values are from nrt_sys_trace_get_event_types
 * @param[in] enabled Capture enabled flag.
 */
NRT_STATUS nrt_sys_trace_config_set_capture_enabled_for_event_type(nrt_sys_trace_config_t *options, const char *event_type, bool enabled);

/**
 * @brief Returns an allocated array of all valid event type strings.
 * @param[out] event_types Pointer to array of const char* (allocated).
 * @param[out] count Number of event types.
 * @return NRT_SUCCESS on success, error code otherwise.
 * @note The user is responsible for freeing the array and each string, or can use
 *       nrt_sys_trace_free_event_types() for convenience.
 *
 * Example usage:
 *      const char **event_types = nullptr;
 *      size_t count = 0;
 *      NRT_STATUS status = nrt_sys_trace_get_event_types(&event_types, &count);
 *      // Manual cleanup:
 *      for (size_t i = 0; i < count; ++i) {
 *          free((void*)event_types[i]);
 *      }
 *      free((void*)event_types);
 *      // Or use convenience function:
 *      nrt_sys_trace_free_event_types(event_types, count);
 */
NRT_STATUS nrt_sys_trace_get_event_types(const char ***event_types, size_t *count);

/**
 * @brief Free the event types array allocated by nrt_sys_trace_get_event_types.
 * This function properly frees both the array and all individual strings.
 *
 * @param[in] event_types Pointer to the event types array to be freed.
 * @param[in] count Number of event types in the array.
 */
void nrt_sys_trace_free_event_types(const char **event_types, size_t count);

/**
 * @brief Returns an allocated array of enabled event type strings for the given config.
 * @param[in] options Pointer to the nrt_sys_trace_config_t structure.
 * @param[out] event_types Pointer to array of const char* (allocated).
 * @param[out] count Number of enabled event types.
 * @return NRT_SUCCESS on success, error code otherwise.
 * @note The user is responsible for freeing the array and each string.
 */
NRT_STATUS nrt_sys_trace_config_get_enabled_event_types(nrt_sys_trace_config_t *options, const char ***event_types, size_t *count);

// Initiailization for system trace capture including allocating memory for event ring buffers
NRT_STATUS nrt_sys_trace_start(nrt_sys_trace_config_t *options);

// Teardown for system trace capture including freeing allocated memory for event ring buffers
NRT_STATUS nrt_sys_trace_stop();

//------------------------------------------------
// Section: System Trace Fetch
//------------------------------------------------

typedef struct nrt_sys_trace_fetch_options nrt_sys_trace_fetch_options_t;
NRT_STATUS nrt_sys_trace_fetch_options_allocate(nrt_sys_trace_fetch_options_t **options);
void nrt_sys_trace_fetch_options_set_defaults(nrt_sys_trace_fetch_options_t *options);
void nrt_sys_trace_fetch_options_free(nrt_sys_trace_fetch_options_t *options);
// Max number of events to fetch per NeuronCore
void nrt_sys_trace_fetch_options_set_max_events_per_nc(nrt_sys_trace_fetch_options_t *options, uint64_t max_events_per_nc);
// Fetch events only for specified NeuronCore
void nrt_sys_trace_fetch_options_set_nc_idx(nrt_sys_trace_fetch_options_t *options, uint64_t nc_idx);

/**
 * Fetches system trace events from process memory and returns them as a JSON-formatted string.
 * Once events are fetched, they cannot be fetched again.
 *
 * @param[out] buffer       On successful return, will point to a dynamically allocated, null-terminated
 *                          JSON string containing the trace events. Memory for the output buffer is
 *                          allocated internally; therefore, the caller should not allocate or initialize
 *                          the buffer before calling this function. Instead, the caller must initialize
 *                          the buffer pointer to NULL and, after a successful call, is responsible for
 *                          freeing the allocated memory by calling nrt_sys_trace_buffer_free(buffer).
 *
 * @param[out] written_size A pointer to a size_t variable that will be set to the number of bytes written
 *                          into the allocated buffer.
 *
 * @param[in] options       Pointer to options such as max number of events to fetch.
 *
 * @return NRT_SUCCESS on success.
 *
 * Usage example:
 *     char *buffer;
 *     size_t written_size;
 *     nrt_sys_trace_fetch_options_t *options;
 *     nrt_sys_trace_fetch_options_allocate(&options);
 *     nrt_sys_trace_fetch_options_set_nc_idx(options, 0); // Fetch events from NeuronCore 0 only instead of all
 *     nrt_sys_trace_fetch_options_set_max_events_per_nc(options, 10000); // Fetch up to 10,000 events instead of all
 *     nrt_sys_trace_fetch_events(&buffer, &written_size, options);
 *     // or if you want to use the default options:
 *     nrt_sys_trace_fetch_events(&buffer, &written_size, NULL);
 *     // finally free the buffer when the events are no longer needed:
 *     nrt_sys_trace_buffer_free(buffer)
 */
NRT_STATUS nrt_sys_trace_fetch_events(char **buffer, size_t *written_size, const nrt_sys_trace_fetch_options_t *options);

/** Free the buffer allocated by nrt_sys_trace_fetch_events. Should be called after the events are no longer needed.
 *
 * @param buffer [in]        - Pointer to buffer to be freed.
 *
 * @return NRT_SUCCESS on success.
 */
void nrt_sys_trace_buffer_free(char *buffer);

#ifdef __cplusplus
}
#endif


================================================
FILE: src/libnrt/include/nrt/nrt_version.h
================================================
/*
 * Copyright 2021, Amazon.com, Inc. or its affiliates. All Rights Reserved
 */

#pragma once

#ifdef __cplusplus
extern "C" {
#endif

#define RT_VERSION_DETAIL_LEN 128
#define GIT_HASH_LEN 64

typedef struct nrt_version {
    uint64_t rt_major;
    uint64_t rt_minor;
    uint64_t rt_patch;
    uint64_t rt_maintenance;
    char rt_detail[RT_VERSION_DETAIL_LEN];
    char git_hash[GIT_HASH_LEN];
} nrt_version_t;

/** Get the NRT library version
 *
 * @param ver[out]          - Pointer to nrt version struct
 * @param size[in]          - Length of the data needed to be filled in the nrt_version_struct
 *
 * @return NRT_STATUS_SUCCESS on success.
 */
NRT_STATUS nrt_get_version(nrt_version_t *ver, size_t size);

#ifdef __cplusplus
}
#endif


================================================
FILE: src/neuron-gatherinfo/LICENSE
================================================
Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of this
software and associated documentation files (the "Software"), to deal in the Software
without restriction, including without limitation the rights to use, copy, modify,
merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


================================================
FILE: src/neuron-gatherinfo/clear_params_tfpb.py
================================================
import re
import copy
import argparse
import tensorflow as tf
import numpy as np
import string

from google.protobuf import text_format
from tensorflow.core.framework import node_def_pb2
from tensorflow.core.framework import attr_value_pb2
from tensorflow.python.framework import tensor_util
from tensorflow.tools.graph_transforms import TransformGraph

def zero_const(node):
  val = tf.make_ndarray(node.attr.get("value").tensor)
  new_val = val * 0.0
  new_tensor = tensor_util.make_tensor_proto(new_val, new_val.dtype, new_val.shape)
  node.attr["value"].CopyFrom(attr_value_pb2.AttrValue(tensor=new_tensor))

def ZeroAllConst(graphdef):
  sess = tf.compat.v1.Session(graph=tf.import_graph_def(graphdef))
  const_by_name = {}
  node_by_name = {}
  for node in graphdef.node:
    node_by_name[node.name] = node  
    if node.op == "Const":
      const_by_name[node.name] = node  
    if node.op == "BiasAdd" or node.op == "MatMul" \
            or node.op.startswith("Conv") \
            or node.op.startswith("FusedBatchNorm"):
      for i in node.input:  
        i_node = node_by_name[i]
        if i_node.op == "Const":
          zero_const(i_node)
        if i_node.op == "Identity":
          x_node = node_by_name[i_node.input[0]]
          if x_node.op == "Const":
            zero_const(x_node)
  return graphdef

def load_graph(model_file):
  graph_def = tf.compat.v1.GraphDef()

  with open(model_file, "rb") as f:
    graph_def.ParseFromString(f.read())
  return graph_def

if __name__ == "__main__":
  parser = argparse.ArgumentParser(description="Zero-out parameters of BiasAdd, MatMul, Conv*, and FusedBatchNorm of TensorFlow frozen graph.")
  parser.add_argument("--graph", help="File name of frozen graph to be converted",
      required=True)
  parser.add_argument("--out_graph", help="File name to save converted frozen graph",
      required=True)
  args = parser.parse_args()

  graph_orig = load_graph(args.graph)
  graph_mod = ZeroAllConst(graph_orig)
  with tf.io.gfile.GFile(args.out_graph, "wb") as f:
    f.write(graph_mod.SerializeToString())
  #with tf.io.gfile.GFile(args.out_graph + "txt", 'w') as f:
  #  f.write(text_format.MessageToString(graph_mod))


================================================
FILE: src/neuron-gatherinfo/mx_neuron_check_model.py
================================================
import os
import json
import sys
import struct
import argparse
import subprocess
from collections import Counter
 
class neuron_parser:
  def __init__(self):
    self.parser = argparse.ArgumentParser()
    self.parser.add_argument('model_path', type=str, help='path prefix to MXNet model (the part before -symbol.json).')
    self.parser.add_argument('--show_names', action='store_true', help='list operation by name instead of summarizing by type (caution: this option will generate many lines of output for a large model).')
    self.parser.add_argument('--expand_subgraph', action='store_true', help='show subgraph operations.')
    self.parser_args = self.parser.parse_args()
    self.neuronop_info = {}
    self.total_pipeline_cores = 0
    self.min_required_pipeline_cores = 0
    path = self.parser_args.model_path

    if os.path.exists(path + '-symbol.json'):
      self.load_mxnet_model(path)
    elif os.path.isdir(path):
      self.load_tensorflow_model(path)
    else:
      raise RuntimeError('Cannot determine framework type from model path argument.')
    self.supported = self.get_neuron_supported()
    self.supported.extend(self.addl_support)
    for name, executable, (sg_nodetypes, sg_nodenames) in self.neuron_nodes:
      num_cores, requested_cores, _ = self.get_cores_from_executable(executable)
      self.neuronop_info[name] = (num_cores, requested_cores, sg_nodetypes, sg_nodenames)
      self.total_pipeline_cores += num_cores
      if num_cores > self.min_required_pipeline_cores:
          self.min_required_pipeline_cores = num_cores

  def get_neuron_supported(self):
    exec_cmd = ["neuron-cc", "list-operators", "--framework", self.framework]
    oplist = subprocess.check_output(' '.join(exec_cmd), shell=True)
    oplist = str(oplist, 'utf-8')
    oplist = oplist.split("\n")
    return oplist[:-1]  # Remove the last element which is ''
 
  def get_tf_subgraph_types_names(self, node):
    from tensorflow.core.framework import graph_pb2
    graph_def = graph_pb2.GraphDef()
    graph_def.ParseFromString(node.attr['graph_def'].s)
    sg_nodes = graph_def.node
    sg_nodes = [sg_node for sg_node in sg_nodes if sg_node.op not in self.excl_types]
    nodetypes = [sg_node.op for sg_node in sg_nodes]
    nodenames = [sg_node.name for sg_node in sg_nodes]
    return nodetypes, nodenames

  def load_tensorflow_model(self, path):
    import tensorflow as tf
    import tensorflow_hub as hub
    self.framework = 'TENSORFLOW'
    self.neuron_optype = "NeuronOp"
    self.excl_types = ['Placeholder', 'PlaceholderWithDefault', 'NoOp', 'Const', 'Identity', 'IdentityN', 'VarHandleOp', 'VarIsInitializedOp', 'AssignVariableOp', 'ReadVariableOp', 'StringJoin', 'ShardedFilename', 'SaveV2', 'MergeV2Checkpoints', 'RestoreV2']
    self.addl_support = ['FusedBatchNormV3', 'BatchMatMulV2', 'AddV2', 'StopGradient', self.neuron_optype]
    model = hub.load(path)
    graph_def = model.graph.as_graph_def()
    nodes = graph_def.node
    nodes = [node for node in nodes if node.op not in self.excl_types]
    self.nodetypes = [node.op for node in nodes]
    self.nodenames = [node.name for node in nodes]
    self.neuron_nodes = [(node.name, node.attr['executable'].s, self.get_tf_subgraph_types_names(node)) for node in nodes if node.op == self.neuron_optype]

  def get_mx_subgraph_types_names(self, node):
    nodetypes = []
    nodenames = []
    for sg in node['subgraphs']:
      filtered_nodes = [sg_node for sg_node in sg['nodes'] if sg_node['op'] not in self.excl_types]
      nodetypes.extend([sg_node['op'] for sg_node in filtered_nodes])
      nodenames.extend([sg_node['name'] for sg_node in filtered_nodes])
    return nodetypes, nodenames

  def load_mxnet_model(self, path):      
    import mxnet as mx
    if mx.__version__ != "1.5.1":
      try:
        import mx_neuron as mxn
      except:
        raise "Please install mxnetneuron package."
    self.framework = 'MXNET'
    self.neuron_optype = "_neuron_subgraph_op"
    self.excl_types = ['null']
    self.addl_support = [self.neuron_optype]
    sym, args, auxs = mx.model.load_checkpoint(path, 0)
    nodes = json.loads(sym.tojson())["nodes"]
    nodes = [node for node in nodes if node['op'] not in self.excl_types]
    self.nodetypes = [node['op'] for node in nodes]
    self.nodenames = [node['name'] for node in nodes]
    neuron_nodes_tmp = [node for node in nodes if node['op'] == self.neuron_optype]
    self.neuron_nodes = [(node['name'], bytearray(args[node['name']+"_neuronbin"].asnumpy()), self.get_mx_subgraph_types_names(node)) for node in neuron_nodes_tmp]

  @staticmethod
  def get_cores_from_executable(executable):
    _NC_HEADER_SIZE = 544
    header = executable[:_NC_HEADER_SIZE]
    info = list(struct.unpack('168xI304xI64B', header))
    numCores = info.pop(0)
    numCoresRequested = info.pop(0)
    coresPerNode = info
    return  numCores, numCoresRequested, coresPerNode

  # Display table of operation type or name and whether supported or not
  def print_node_type_info(self):
    self.cnt_total = len(self.nodetypes)
    self.cnt_supported = 0
    if self.parser_args.show_names:
      widthn = max(max(map(len, self.nodenames)), 8)
      widtht = max(max(map(len, self.nodetypes)), 8)
      format_str = "{:<" + str(widthn) + "}  {:<" + str(widtht) + "}  {:<4}"
      pp = lambda x: print(format_str.format(*x))
      pp(['Op Name', 'Op Type', 'Neuron Supported ?'])
      pp(['-------', '-------', '------------------'])
      for idx, opname in enumerate(self.nodenames):
        optype = self.nodetypes[idx]
        if optype in self.supported:
          pp([opname, optype, 'Yes'])
          self.cnt_supported += 1
      for idx, opname in enumerate(self.nodenames):
        optype = self.nodetypes[idx]
        if optype not in self.supported:
          pp([opname, optype, 'No'])
    else:
      count = Counter(self.nodetypes)
      width = max(max(map(len, self.nodetypes)), 8)
      format_str = "{:<" + str(width) + "}  {:<14}  {:<4}"
      pp = lambda x: print(format_str.format(*x))
      pp(['Op Type', 'Num Instances', 'Neuron Supported ?'])
      pp(['-------', '-------------', '------------------'])
      for key in count:
        if key in self.supported:
          pp([key, count[key], 'Yes'])
          self.cnt_supported += count[key]
      for key in count:
        if key not in self.supported:
          pp([key, count[key], 'No'])
    print()

  def print_subgraph_ops(self, sg_nodetypes, sg_nodenames):
    if self.parser_args.show_names:
      widthn = max(max(map(len, sg_nodenames)), 8)
      widtht = max(max(map(len, sg_nodetypes)), 8)
      format_str = "{:<" + str(widthn) + "}  {:<" + str(widtht) + "}"
      pp = lambda x: print('    ', format_str.format(*x))
      pp(['Op Name', 'Op Type'])
      pp(['-------', '-------'])
      for idx, opname in enumerate(sg_nodenames):
        optype = sg_nodetypes[idx]
        pp([opname, optype])
    else:
      count = Counter(sg_nodetypes)
      width = max(max(map(len, sg_nodetypes)), 8)
      format_str = "{:<" + str(width) + "}  {:<14}"
      pp = lambda x: print('    ', format_str.format(*x))
      pp(['Op Type', 'Num Instances'])
      pp(['-------', '-------------'])
      for key in count:
        pp([key, count[key]])

  def print_neuron_node_info(self):
    idx = 0
    width = max(max(map(len, self.neuronop_info)), 14) 
    format_str = "{:<" + str(width) + "}  {:<14}"
    pp = lambda x: print(format_str.format(*x))
    pp(['Subgraph Name', 'Num Pipelined NeuronCores'])
    pp(['-------------', '-------------------------'])
    core_cnt_list = []
    for name, (num_cores, _, sg_nodetypes, sg_nodenames) in self.neuronop_info.items():
      pp([name, num_cores])
      core_cnt_list.append(num_cores)
      idx += 1
      if self.parser_args.expand_subgraph:
        self.print_subgraph_ops(sg_nodetypes, sg_nodenames)
    print()

  def print_neuron_support_stats(self):
    print("* Total inference operations: {}".format(self.cnt_total))
    print("* Total Neuron supported inference operations: {}".format(self.cnt_supported))
    if self.cnt_total > 0:
      perc = self.cnt_supported / self.cnt_total * 100
    else:
      perc = 0
    print("* Percent of total inference operations supported by Neuron: {:.1f}".format(perc))
    print()

  def print_common_desc(self):
    if self.parser_args.show_names:
      print("* Each line shows an operation name and whether the type of that operation is supported in Neuron.")
    else:
      print("* Each line shows an operation type, the number of instances of that type within model,\n" \
            "* and whether the type is supported in Neuron.")
    print("* Some operation types are excluded from table because they are no-operations or training-related operations:\n", \
            self.excl_types, "\n")

  def run(self):
    if len(self.neuronop_info) > 0:
      print("\n* Found {} Neuron subgraph(s) ({}(s)) in this compiled model.\n" \
            "* Use this tool on the original uncompiled model to see Neuron supported operations.\n" \
            "* The following table shows all operations, including Neuron subgraphs.".format(len(self.neuronop_info), self.neuron_optype))
      self.print_common_desc()
      self.print_node_type_info()
      print('* Please run this model on Inf1 instance with at least {} NeuronCore(s).'.format(self.min_required_pipeline_cores))
      print("* The following list show each Neuron subgraph with number of pipelined NeuronCores used by subgraph\n"\
            "* (and subgraph operations if --expand_subgraph is used):\n")
      self.print_neuron_node_info()
    else:
      print("\n* The following table shows the supported and unsupported operations within this uncompiled model.")
      self.print_common_desc()
      self.print_node_type_info()
      self.print_neuron_support_stats()
 
if __name__=='__main__':
  toolkit = neuron_parser()
  toolkit.run()


================================================
FILE: src/neuron-gatherinfo/neuron-gatherinfo.py
================================================
#!/usr/bin/env python3
# coding=utf-8


""" Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.

    SPDX-License-Identifier: MIT-0

    Program to gather information from a system
"""
import sys
import os
import argparse
import shutil
import subprocess
import re

ACTUAL_CMD = os.path.realpath(sys.argv[0])

USAGE_MSG = """
    Usage: {} [options]
    This program is used to gather information from this system for analysis
    and debugging
    """.format(ACTUAL_CMD)

EXCLUDE_FILES_BY_NAME = "weight files, model, NEFF (Neuron Executable File Format)"

HELP_CC_FILES = """ Location of the neuron-cc generated files """
DEFAULT_CCFILES_LOCATION = "~/bin"

SYSLOG_SEARCH_PATTERNS = r"nrtd|neuron|kernel:"

EXTERNAL_CMDS = ["lscpu", "lshw",
                 "lspci | grep -i Amazon",
                 "neuron-cc --version",
                 "neuron-ls",
                 "top -b -n 1",
                 "uname -a", "uptime",
                 ]

PROC_FILES = ["/proc/cmdline",
              "/proc/cpuinfo",
              "/proc/filesystems",
              "/proc/interrupts",
              "/proc/iomem",
              "/proc/loadavg",
              "/proc/meminfo",
              "/proc/modules",
              "/proc/mtrr",
              "/proc/version",
              ]

HELP_ADDITIONAL_FILE_OR_DIR = """ Additional file or directory that the user wants to provide in
    the archive. The user can sanitize this file or directory before sharing """

INCLUDE_MSG = """
    By default, only the lines containing (grep) patterns like '{}' from the syslog are copied.
    Other lines are excluded. Using this option allows the timestamp section of other lines
    to be included. The rest of the contents of the line itself are elided. Providing the
    timestamp section may provide time continuity while viewing the copied syslog file
    """.format(SYSLOG_SEARCH_PATTERNS)

HELP_RT_FILES = """ Location of the neuron runtime generated files """
MISCINFO_FILE = 'miscinfo.txt'

HELP_VERBOSE = """ Verbose mode displays commands executed and any additional information
                   which may be useful in debugging the tool itself
               """

INCLUDE_EXTNS = ('.pb')

HELP_INCLUDE_EXTN_FILES = """ Include files with these extensions from the compiler work
    directory in the archive:
    {}
    """.format(INCLUDE_EXTNS)

HELP_STDOUT = """ The file where the stdout of the compiler run was saved """

HELP_OUTDIR_MSG = """
    The output directory where all the files and other information will be stored.
    The output will be stored as an archive as well as the actual directory where all the
    contents are copied. This will allow a simple  audit of the files, if necessary.
    *** N O T E ***: Make sure that this directory has enough space to hold the files
    and resulting archive
    """

USERCMDFILE = "how-the-user-executed-the-script-{}.txt".format(os.path.basename(ACTUAL_CMD))

NEURONDUMPPROGRAM = "/opt/aws/neuron/bin/neuron-dump.py"
NEURONDUMPFILE = os.path.splitext(os.path.basename(NEURONDUMPPROGRAM))[0]

NEURON_ERRMSG = "Error: File {} doesn't exist, aws-neuron-tool package isn't installed?".format(
        NEURONDUMPPROGRAM)

NEURON_INFO_TARBALL = "{}".format(os.path.splitext(os.path.basename(ACTUAL_CMD))[0])
NEURONTMPDIR = NEURON_INFO_TARBALL

ARCHIVE_MSG = "\n\n\t******\n\tArchive created at:\n\t\t{}\n\tFrom directory:\n\t\t{}\n\t******\n\n"

NOT_IMPLEMENTED_MSG = ", nothing to see here, folks (not implemented as yet)"

# these are the only compiler-generated files that are included by default
COMPILER_FILES = ['graph_def.neuron-cc.log', 'all_metrics.csv', 'hh-tr-operand-tensortensor.json']

COMPILER_FILES_USER_OPT_IN = ['exp_and_others.json', 'graph_def.neff', 'graph_def.pb',
                              'hh-spilled.json', 'hh-tr-accDN2virtDN.json',
                              'hh-tr-external-move.json', 'hh-tr-internal-move.json',
                              'hh-tr-removeDN.json', 'hh-transforms.json', 'wavegraph.json',
                              'hh.json', 'pass03_scheduling.json',
                              'relay_graph_opt_pre_color.txt', 'relay_graph_post_opt_kelp.txt',
                              'relay_graph_post_opt_unit_level.txt', 'relay_graph_pre_opt.txt',
                              'saved_model.pb', 'sch.json', 'sch_tmp.json',
                              'schedule_trace.json',
                              'wavegraph-bin.json']

MODEL_DATA_MSG = """
    By using this option, the entire compiler work directory's contents will be
    included (excluding the {} files, unless an additional option is used). This would
    include model information, etc.
    The files that are included, by default, are these: {}

    """.format(INCLUDE_EXTNS, ", ".join(COMPILER_FILES))

MODEL_DATA_MSG_INFO = """
\t**************************
\tBased on your command line option, we're also packaging these files:

\t\t{}

\tAnd this directory: {}

\t**************************
"""

def get_os_version():

    ''' function to obtain the Linux version
        Args:

        Output:

        Returns:
            string with value 'Ubuntu' or 'RedHat'
    '''

    try:
        with open("/proc/version") as fdin:
            data = fdin.read()
            if data.find('Ubuntu') == -1:
                osver = 'RedHat'
            else:
                osver = 'Ubuntu'
    except FileNotFoundError:
        osver = 'Ubuntu'

    return osver


def get_files(*, basedir, matchfiles, verbose):
    ''' function to get the files based on a base directory and file extension

        Args:
            basedir     : base directory where files reside
            matchfiles  : set of files to match
            verbose : flag to indicate if verbose messages need to be displayed

        Output:

        Returns:
            list of files found

    '''

    myfiles = list()
    for dpath, _, files in os.walk(basedir):
        for mfile in files:
            if mfile in matchfiles:
                mfile = os.path.realpath(os.path.join(dpath, mfile))
                if os.path.isfile(mfile):
                    myfiles.append(mfile)
                else:
                    if verbose:
                        print("Warning: {} is not a file".format(mfile))

    return myfiles


def dump_compiler_info(*, outdir, location, allowmodel=False, addfldir=None, verbose=False):
    ''' function to gather the following information:
            Framework:
                - TensorFlow
                - MXNet
                - PyTorch
            Compiler:
        Args:
            outdir      : output directory
            location    : location of compiler-generated files
            allowmodel  : if True, allow gathering of additional files
            verbose : flag to indicate if verbose messages need to be displayed

        Output: compiler-generated files copied to outdir

        Returns:
    '''

    if location is not None:
        if allowmodel:  # copy the entire directory
            try:
                shutil.copytree(location, os.path.join(outdir, os.path.basename(location)),
                                ignore_dangling_symlinks=True)
            except shutil.Error:
                pass
        else:
            fileset = set(COMPILER_FILES)
            l1data = get_files(basedir=location, matchfiles=fileset, verbose=verbose)
            copy_files(outdir=outdir, basedir=location, filelist=l1data, verbose=verbose)

        if addfldir is not None:
            if os.path.isfile(addfldir):
                shutil.copy(addfldir, outdir)
            else:  # directory copy
                try:
                    shutil.copytree(addfldir, os.path.join(outdir, os.path.basename(addfldir)),
                                    ignore_dangling_symlinks=True)
                except shutil.Error:
                    pass

    # print("Function: ", sys._getframe().f_code.co_name,  # pylint: disable=W0212
    #       NOT_IMPLEMENTED_MSG)


def copy_stdout(*, outdir, stdout, verbose):
    ''' function to copy the stdout file to the destination location

        Args:
            outdir  : destination location (output directory)
            stdout  : file containing the output of running neuron-cc
            verbose : flag to indicate if verbose messages need to be displayed

        Output:

        Returns:
    '''

    if verbose:
        print("Copying {} to {}".format(stdout, outdir))

    shutil.copy(stdout, outdir)


def copy_syslog(*, outdir, include_flag=False, verbose):
    '''
        function to copy contents of the syslog to the output directory

        Args:
            outdir          : output directory location where the syslog's contents
                              are to be copied
            include_flag    : if True, include lines that do not match
            verbose : flag to indicate if verbose messages need to be displayed

        Output:
            copy of syslog's contents with just "Neuron-specific" lines

        Returns:
    '''

    # syslog looks like this:
    # 2019-11-21T19:32:50.347183+00:00 ink neuron-rtd[17977]: nrtd[17977]: <SNIP>
    # The first regex (regex1) is used to match lines that we want to see in our copy

    regex1 = re.compile(r'^(\S+)\s.*?({})'.format(SYSLOG_SEARCH_PATTERNS))
    regex2 = re.compile(r'^(\S+)\s')

    osver = get_os_version()
    if osver == 'Ubuntu':
        syslog = '/var/log/syslog'
    else:
        syslog = '/var/log/messages'

    try:
        with open(syslog) as fdin,\
            open(os.path.join(outdir, 'copy-of-syslog'), 'w') as fdout:
            for line in fdin:
                match = regex1.search(line)
                if match is not None:
                    fdout.write(line)
                else:
                    if include_flag:
                        match = regex2.match(line)
                        if match is not None:
                            # exclude the rest of the line
                            fdout.write(match.group(1) + ' XXX contents elided XXX\n')
                        else:
                            print("Error in parsing this line: {}".format(line))
    except FileNotFoundError:
        print("Error, /var/log/syslog not found")


def dump_rt_info(*, location, verbose):
    ''' function to dump the following information:
            - runtime
            - Framework (??)
        Args:
            location: location of runtime files
            verbose : flag to indicate if verbose messages need to be displayed
        Returns:
            list of info
    '''

    # l1data = get_files(basedir=location, file_extn=('.sh'))
    print("Function: ", sys._getframe().f_code.co_name,  # pylint: disable=W0212
          NOT_IMPLEMENTED_MSG)


def allow_capture_of_files():
    '''
        function to allow the capture of files from the customer's environment
        This is OFF by default and has to be explicitly enabled by the command-line
        option by the user

        Args:

        Output:

        Returns:

    '''

    print("Function: ", sys._getframe().f_code.co_name,  # pylint: disable=W0212
          NOT_IMPLEMENTED_MSG)


def add_additional_filters(filterfile):
    '''
        function to apply additional filters to files that are being captured

        Args:
            filterfile  : text file with patterns (regexs), one per line, to use as filters


        Output:

        Returns:

    '''

    print("Function: ", sys._getframe().f_code.co_name,  # pylint: disable=W0212
          NOT_IMPLEMENTED_MSG)


def dump_miscinfo(*, outdir, verbose):
    ''' function to dump miscellaneous information, including:
            - system info (uname -a)
            - package info (??? list of packages installed)
            - neuron-ls
            - neuron-top

        Args:
            outdir  : output directory
            verbose : flag to indicate if verbose messages need to be displayed

        Output:
            Creates various reports in the outdir location

        Returns:

    '''

    osver = get_os_version()
    if osver == 'Ubuntu':
        pkgcmds = ["apt list | egrep '^aws'",
                   "pip list | egrep '^neuron|^numpy|^tensor|^scipy'"]
    else:
        pkgcmds = ["rpm -qa | egrep '^aws|^neuron|^numpy|^tensor|^scipy'"]

    cmds = EXTERNAL_CMDS + pkgcmds

    for cmd in cmds:
        cmdname = cmd.split(' ')[0]  # get just the command name for creating the file
        cmdfile = os.path.join(outdir, "report-{}.txt".format(cmdname))

        with open(cmdfile, "w") as fdout:

            if verbose:
                print("Running cmd: {} and capturing output in file: {}".format(cmd, cmdfile))

            try:
                res = subprocess.Popen(cmd, stdout=subprocess.PIPE,
                                       stderr=subprocess.STDOUT, universal_newlines=True,
                                       shell=True)
                stdout, stderr = res.communicate()
                if stderr is not None:
                    fdout.write("Error in executing cmd: {}\nError: {}\n".format(cmd, str(stderr)))
                else:
                    fdout.write("Output from executing cmd: {}\n\n{}\n".format(cmd, str(stdout)))
            except (OSError, ValueError) as err:
                fdout.write("Error in executing cmd: {}\nError: {}\n".format(cmd, err))


def dump_proc_info(*, outdir, verbose):
    '''
        function to dump information related to "/proc"

        Args:
            outdir  : output directory
            verbose : flag to indicate if verbose messages need to be displayed

        Output:
            Creates various reports in the outdir location

        Returns:

    '''

    for procfile in PROC_FILES:
        fname = procfile.split('/')  # use the 2nd and 3rd items from this (canonical form)
        pfile = os.path.join(outdir, "report-{}-{}.txt".format(fname[1], fname[2]))
        if verbose:
            print("Copying contents of: {} to: {}".format(procfile, pfile))

        try:
            with open(pfile, "w") as fdout, open(procfile) as fdin:
                fdout.write("Contents of {}\n\n".format(procfile))
                fdout.write(fdin.read())
        except FileNotFoundError:
            print("Error: file {} not found\n".format(procfile))


def sanity_check(options):
    '''
        function to check if command-line arguments are valid

        Args:
            options : options from argparse parser

        Output:

        Returns:
            0 : success
            1 : failure
    '''

    # the script has to be run as root or "sudo"
    if os.getuid() != 0:
        print("*** Rerun this script as user 'root' or as sudo **\n\n")
        return 1

    outdir = options.outdir

    retval = 0
    if os.path.isfile(outdir) or os.path.isdir(outdir):
        print("Error: {} already exists, please provide a non-existing directory".format(outdir))
        retval = 1

    if not os.path.isfile(options.stdout):
        print("Error: {} doesn't exist, please provide an existing file".format(options.stdout))
        retval = 1

    if options.addfldir is not None:
        if not os.path.isfile(options.addfldir) and not os.path.isdir(options.addfldir):
            print("Error: {} isn't a file nor a directory".format(options.addfldir))
            retval = 1

    for mydir in [options.ccdir, options.rtdir]:
        if mydir is not None and not os.path.isdir(mydir):
            print("Error: {} is not a directory, please provide a directory".format(mydir))
            retval = 1

    if options.allowmodel and options.ccdir is None:
        print("Error: you need to specify a compiler work directory along with the 'm' option")
        retval = 1
    return retval


def copy_files(*, outdir, basedir, filelist, verbose):
    '''
        function to copy files from the original source area
        into the destination. This is also the place for any
        massaging or eliding of file contents

        Args:
            outdir  : destination location
            basedir : base directory from where the files are to be copied
            filelist: list of files to be copied
            verbose : flag to indicate if verbose messages need to be displayed

        Output:
            Copy of files (possibly altered) from the source

        Returns:

    '''

    for thisfile in filelist:
        myfile = '.' + thisfile[len(basedir):]
        mydir = os.path.dirname(os.path.join(outdir, myfile))
        if not os.path.isdir(mydir):
            os.makedirs(mydir)
        shutil.copy(thisfile, mydir, follow_symlinks=True)


def write_miscinfo(*, outdir, data):
    '''
        function to write out the contents of the miscellaneous commands

        Args:
            outdir  : destination location
            data    : list of strings to be stored in a file

        Output:
            MISCINFO_FILE created with the contents of the output of the various
            commands
    '''

    flname = os.path.join(outdir, MISCINFO_FILE)

    with open(flname, "w") as fdout:
        fdout.write("\n".join(data))


def run_neuron_dump(outdir, verbose):
    '''
        function to call the existing neuron-dump.py tool

        Args:
            outdir  : destination location
            verbose : flag to indicate if verbose messages need to be displayed

        Output:
            tarball created by this tool

        Returns:

    '''

    if not os.path.isfile(NEURONDUMPPROGRAM):
        print(NEURON_ERRMSG)
        return

    cmd = "{} -o {}".format(NEURONDUMPPROGRAM, os.path.join(outdir, NEURONDUMPFILE))

    if verbose:
        print("Executing command: {}".format(cmd))

    try:
        res = subprocess.Popen(cmd, stdout=subprocess.PIPE,
                               stderr=subprocess.STDOUT, universal_newlines=True,
                               shell=True)
        stdout, stderr = res.communicate()
        if stderr is not None:
            print("Error in executing cmd: {}\nError: {}\n".format(cmd, str(stderr)))
    except (OSError, ValueError) as err:
        print("Error in executing cmd: {}\nError: {}\n".format(cmd, err))

    if verbose:
        print("Output of cmd: {}\n{}".format(cmd, stdout))


def package_tarball(*, outdir, allowmodel, ccdir, verbose):
    '''
        function to package everything into a tarball

        Args:
            outdir      : output directory
            allowmodel  : flag to indicate whether the user has allowed
                          gathering of model data

        Output:
            A tar ball created in directory one level above outdir
            this would be the directory provided by the user

        Returns:
    '''

    mytarball = os.path.join(os.path.split(outdir)[0], NEURON_INFO_TARBALL)

    if verbose:
        print("Creating archive: {}".format(mytarball))

    archivefile = shutil.make_archive(mytarball, 'gztar', outdir)
    print(ARCHIVE_MSG.format(archivefile, outdir))

    if allowmodel:
        print(MODEL_DATA_MSG_INFO.format("\n\t\t".join(COMPILER_FILES),
                                         ccdir))


def add_cmdline_args():
    '''
        function to add the command line arguments and options

        Args:

        Output:

        Returns:
            parser for cmd line

    '''

    parser = argparse.ArgumentParser(
        formatter_class=argparse.RawDescriptionHelpFormatter,
        description=USAGE_MSG)

    parser.add_argument('--additionalfileordir',
                        dest='addfldir',
                        help=HELP_ADDITIONAL_FILE_OR_DIR,
                        default=None)

    parser.add_argument('-c', '--compileroutdir',
                        dest='ccdir',
                        help=HELP_CC_FILES,
                        default=None)

    parser.add_argument('-i', '--include',
                        dest='includemismatch',
                        help=INCLUDE_MSG,
                        action='store_true',
                        default=False)

    parser.add_argument('-f', '--filter',
                        dest='filterfile',
                        default=None)

    parser.add_argument('-m', "--modeldata",  # data related to model, etc. will be gathered
                        dest='allowmodel',
                        action='store_true',
                        help=MODEL_DATA_MSG,
                        default=False)

    parser.add_argument('-o', '--out',
                        dest='outdir',
                        help=HELP_OUTDIR_MSG,
                        required=True)

    parser.add_argument('-r', '--runtimeoutdir',
                        dest='rtdir',
                        help=HELP_RT_FILES,
                        default=None)

    parser.add_argument('-s', '--stdout',
                        dest='stdout',
                        help=HELP_STDOUT,
                        required=True)

    parser.add_argument('-v', '--verbose',
                        dest='verbose',
                        help=HELP_VERBOSE,
                        action='store_true',
                        default=False)

    return parser


def main():
    """ main function
        creates command-line option parser, sanity checks, and then executes code
        based on command-line options
    """

    parser = add_cmdline_args()

    if len(sys.argv) == 1:
        parser.print_help()
        sys.exit(1)

    options = parser.parse_args()
    # append the directory where we'll create files to what the user provides
    options.outdir = os.path.realpath(os.path.join(options.outdir, NEURONTMPDIR))

    if options.ccdir is not None:
        options.ccdir = os.path.realpath(options.ccdir)

    if options.addfldir is not None:
        options.addfldir = os.path.realpath(options.addfldir)

    if options.rtdir is not None:
        options.rtdir = os.path.realpath(options.rtdir)

    options.stdout = os.path.realpath(options.stdout)

    if sanity_check(options):
        parser.print_help()
        sys.exit(1)

    # create the base directory
    try:
        os.makedirs(options.outdir)
    except FileNotFoundError:
        print("Error in creating directory {}".format(options.outdir))
        sys.exit(1)

    # if options.allow:
    #     allow_capture_of_files()

    if options.filterfile is not None:
        add_additional_filters(os.path.realpath(options.filterfile))

    # record the command as executed by the user
    with open(os.path.join(options.outdir, USERCMDFILE), "w") as fdout:
        fdout.write("Command executed as: {}\n".format(" ".join(sys.argv)))

    dump_compiler_info(outdir=options.outdir, location=options.ccdir,
                       allowmodel=options.allowmodel,
                       addfldir=options.addfldir,
                       verbose=options.verbose)

    # Not being used now. neuron-dump.py would do this
    # dump_rt_info(location=options.rtdir, verbose=options.verbose)

    dump_miscinfo(outdir=options.outdir, verbose=options.verbose)
    dump_proc_info(outdir=options.outdir, verbose=options.verbose)

    copy_stdout(outdir=options.outdir, stdout=options.stdout, verbose=options.verbose)
    copy_syslog(outdir=options.outdir, include_flag=options.includemismatch,
                verbose=options.verbose)

    # run the existing tool neuron-dump.py as well
    run_neuron_dump(outdir=options.outdir, verbose=options.verbose)

    package_tarball(outdir=options.outdir, allowmodel=options.allowmodel,
                    ccdir=options.ccdir, verbose=options.verbose)

    # change permissions for the directory and output
    os.system("chown -R {} {}".format(os.getlogin(), os.path.split(options.outdir)[0]))

    # write_miscinfo(outdir=options.outdir, data=l3)


if __name__ == "__main__":

    main()


================================================
FILE: src/neuron-gatherinfo/tf_neuron_check_model.py
================================================
import os
import json
import sys
import struct
import argparse
import subprocess
from collections import Counter
 
class neuron_parser:
  def __init__(self):
    self.parser = argparse.ArgumentParser()
    self.parser.add_argument('model_path', type=str, help='a TensorFlow SavedModel directory (currently supporting TensorFlow v1 SaveModel only).')
    self.parser.add_argument('--show_names', action='store_true', help='list operation by name instead of summarizing by type (caution: this option will generate many lines of output for a large model).')
    self.parser.add_argument('--expand_subgraph', action='store_true', help='show subgraph operations.')
    self.parser_args = self.parser.parse_args()
    self.neuronop_info = {}
    self.total_pipeline_cores = 0
    self.min_required_pipeline_cores = 0
    path = self.parser_args.model_path
    if os.path.exists(path + '-symbol.json'):
      self.load_mxnet_model(path)
    elif os.path.isdir(path):
      self.load_tensorflow_model(path)
    else:
      raise RuntimeError('Cannot determine framework type from model path argument.')
    self.supported = self.get_neuron_supported()
    self.supported.extend(self.addl_support)
    for name, executable, (sg_nodetypes, sg_nodenames) in self.neuron_nodes:
      num_cores, requested_cores, _ = self.get_cores_from_executable(executable)
      self.neuronop_info[name] = (num_cores, requested_cores, sg_nodetypes, sg_nodenames)
      self.total_pipeline_cores += num_cores
      if num_cores > self.min_required_pipeline_cores:
          self.min_required_pipeline_cores = num_cores

  def get_neuron_supported(self):
    exec_cmd = ["neuron-cc", "list-operators", "--framework", self.framework]
    oplist = subprocess.check_output(' '.join(exec_cmd), shell=True)
    oplist = str(oplist, 'utf-8')
    oplist = oplist.split("\n")
    return oplist[:-1]  # Remove the last element which is ''
 
  def get_tf_subgraph_types_names(self, node):
    from tensorflow.core.framework import graph_pb2
    graph_def = graph_pb2.GraphDef()
    graph_def.ParseFromString(node.attr['graph_def'].s)
    sg_nodes = graph_def.node
    sg_nodes = [sg_node for sg_node in sg_nodes if sg_node.op not in self.excl_types]
    nodetypes = [sg_node.op for sg_node in sg_nodes]
    nodenames = [sg_node.name for sg_node in sg_nodes]
    return nodetypes, nodenames

  def load_tensorflow_model(self, path):
    import tensorflow as tf
    import tensorflow_hub as hub
    self.framework = 'TENSORFLOW'
    self.neuron_optype = "NeuronOp"
    self.excl_types = ['Placeholder', 'PlaceholderWithDefault', 'NoOp', 'Const', 'Identity', 'IdentityN', 'VarHandleOp', 'VarIsInitializedOp', 'AssignVariableOp', 'ReadVariableOp', 'StringJoin', 'ShardedFilename', 'SaveV2', 'MergeV2Checkpoints', 'RestoreV2']
    self.addl_support = ['FusedBatchNormV3', 'BatchMatMulV2', 'AddV2', 'StopGradient', self.neuron_optype]
    model = hub.load(path)
    graph_def = model.graph.as_graph_def()
    nodes = graph_def.node
    nodes = [node for node in nodes if node.op not in self.excl_types]
    self.nodetypes = [node.op for node in nodes]
    self.nodenames = [node.name for node in nodes]
    self.neuron_nodes = [(node.name, node.attr['executable'].s, self.get_tf_subgraph_types_names(node)) for node in nodes if node.op == self.neuron_optype]

  def get_mx_subgraph_types_names(self, node):
    nodetypes = []
    nodenames = []
    for sg in node['subgraphs']:
      filtered_nodes = [sg_node for sg_node in sg['nodes'] if sg_node['op'] not in self.excl_types]
      nodetypes.extend([sg_node['op'] for sg_node in filtered_nodes])
      nodenames.extend([sg_node['name'] for sg_node in filtered_nodes])
    return nodetypes, nodenames

  def load_mxnet_model(self, path):      
    import mxnet as mx
    if mx.__version__ != "1.5.1":
      try:
        import mxnetneuron as mxn
      except:
        raise "Please install mxnetneuron package."
    self.framework = 'MXNET'
    self.neuron_optype = "_neuron_subgraph_op"
    self.excl_types = ['null']
    self.addl_support = [self.neuron_optype]
    sym, args, auxs = mx.model.load_checkpoint(path, 0)
    nodes = json.loads(sym.tojson())["nodes"]
    nodes = [node for node in nodes if node['op'] not in self.excl_types]
    self.nodetypes = [node['op'] for node in nodes]
    self.nodenames = [node['name'] for node in nodes]
    neuron_nodes_tmp = [node for node in nodes if node['op'] == self.neuron_optype]
    self.neuron_nodes = [(node['name'], bytearray(args[node['name']+"_neuronbin"].asnumpy()), self.get_mx_subgraph_types_names(node)) for node in neuron_nodes_tmp]

  @staticmethod
  def get_cores_from_executable(executable):
    _NC_HEADER_SIZE = 544
    header = executable[:_NC_HEADER_SIZE]
    info = list(struct.unpack('168xI304xI64B', header))
    numCores = info.pop(0)
    numCoresRequested = info.pop(0)
    coresPerNode = info
    return  numCores, numCoresRequested, coresPerNode

  # Display table of operation type or name and whether supported or not
  def print_node_type_info(self):
    self.cnt_total = len(self.nodetypes)
    self.cnt_supported = 0
    if self.parser_args.show_names:
      widthn = max(max(map(len, self.nodenames)), 8)
      widtht = max(max(map(len, self.nodetypes)), 8)
      format_str = "{:<" + str(widthn) + "}  {:<" + str(widtht) + "}  {:<4}"
      pp = lambda x: print(format_str.format(*x))
      pp(['Op Name', 'Op Type', 'Neuron Supported ?'])
      pp(['-------', '-------', '------------------'])
      for idx, opname in enumerate(self.nodenames):
        optype = self.nodetypes[idx]
        if optype in self.supported:
          pp([opname, optype, 'Yes'])
          self.cnt_supported += 1
      for idx, opname in enumerate(self.nodenames):
        optype = self.nodetypes[idx]
        if optype not in self.supported:
          pp([opname, optype, 'No'])
    else:
      count = Counter(self.nodetypes)
      width = max(max(map(len, self.nodetypes)), 8)
      format_str = "{:<" + str(width) + "}  {:<14}  {:<4}"
      pp = lambda x: print(format_str.format(*x))
      pp(['Op Type', 'Num Instances', 'Neuron Supported ?'])
      pp(['-------', '-------------', '------------------'])
      for key in count:
        if key in self.supported:
          pp([key, count[key], 'Yes'])
          self.cnt_supported += count[key]
      for key in count:
        if key not in self.supported:
          pp([key, count[key], 'No'])
    print()

  def print_subgraph_ops(self, sg_nodetypes, sg_nodenames):
    if self.parser_args.show_names:
      widthn = max(max(map(len, sg_nodenames)), 8)
      widtht = max(max(map(len, sg_nodetypes)), 8)
      format_str = "{:<" + str(widthn) + "}  {:<" + str(widtht) + "}"
      pp = lambda x: print('    ', format_str.format(*x))
      pp(['Op Name', 'Op Type'])
      pp(['-------', '-------'])
      for idx, opname in enumerate(sg_nodenames):
        optype = sg_nodetypes[idx]
        pp([opname, optype])
    else:
      count = Counter(sg_nodetypes)
      width = max(max(map(len, sg_nodetypes)), 8)
      format_str = "{:<" + str(width) + "}  {:<14}"
      pp = lambda x: print('    ', format_str.format(*x))
      pp(['Op Type', 'Num Instances'])
      pp(['-------', '-------------'])
      for key in count:
        pp([key, count[key]])

  def print_neuron_node_info(self):
    idx = 0
    width = max(max(map(len, self.neuronop_info)), 14) 
    format_str = "{:<" + str(width) + "}  {:<14}"
    pp = lambda x: print(format_str.format(*x))
    pp(['Subgraph Name', 'Num Pipelined NeuronCores'])
    pp(['-------------', '-------------------------'])
    core_cnt_list = []
    for name, (num_cores, _, sg_nodetypes, sg_nodenames) in self.neuronop_info.items():
      pp([name, num_cores])
      core_cnt_list.append(num_cores)
      idx += 1
      if self.parser_args.expand_subgraph:
        self.print_subgraph_ops(sg_nodetypes, sg_nodenames)
    print()

  def print_neuron_support_stats(self):
    print("* Total inference operations: {}".format(self.cnt_total))
    print("* Total Neuron supported inference operations: {}".format(self.cnt_supported))
    if self.cnt_total > 0:
      perc = self.cnt_supported / self.cnt_total * 100
    else:
      perc = 0
    print("* Percent of total inference operations supported by Neuron: {:.1f}".format(perc))
    print()

  def print_common_desc(self):
    if self.parser_args.show_names:
      print("* Each line shows an operation name and whether the type of that operation is supported in Neuron.")
    else:
      print("* Each line shows an operation type, the number of instances of that type within model,\n" \
            "* and whether the type is supported in Neuron.")
    print("* Some operation types are excluded from table because they are no-operations or training-related operations:\n", \
            self.excl_types, "\n")

  def run(self):
    if len(self.neuronop_info) > 0:
      print("\n* Found {} Neuron subgraph(s) ({}(s)) in this compiled model.\n" \
            "* Use this tool on the original uncompiled model to see Neuron supported operations.\n" \
            "* The following table shows all operations, including Neuron subgraphs.".format(len(self.neuronop_info), self.neuron_optype))
      self.print_common_desc()
      self.print_node_type_info()
      print('* Please run this model on Inf1 instance with at least {} NeuronCore(s).'.format(self.min_required_pipeline_cores))
      print("* The following list show each Neuron subgraph with number of pipelined NeuronCores used by subgraph\n"\
            "* (and subgraph operations if --expand_subgraph is used):\n")
      self.print_neuron_node_info()
    else:
      print("\n* The following table shows the supported and unsupported operations within this uncompiled model.")
      self.print_common_desc()
      self.print_node_type_info()
      self.print_neuron_support_stats()
 
if __name__=='__main__':
  toolkit = neuron_parser()
  toolkit.run()


================================================
FILE: src/neuronperf/LICENSE
================================================
AWS Neuron License Agreement

THIS IS AN AGREEMENT BETWEEN YOU AND AMAZON WEB SERVICES, INC. (WITH ITS AFFILIATES, "AWS" OR "WE") THAT GOVERNS YOUR USE OF THE AWS NEURON SOFTWARE (TOGETHER WITH ANY UPDATES AND UPGRADES TO IT, AND ACCOMPANYING DOCUMENTATION, THE “SOFTWARE”) THAT WE MAKE AVAILABLE TO YOU. IF YOU DOWNLOAD, INSTALL, OR USE THE SOFTWARE, YOU ACCEPT AND AGREE TO BE BOUND BY THIS AGREEMENT AND REPRESENT THAT YOU HAVE THE AUTHORITY TO BIND YOURSELF OR THE ENTITY YOU REPRESENT TO THIS AGREEMENT.

1.	Use of the Software
We hereby grant you a personal, limited, nonexclusive, non-transferable, non-sublicenseable, revocable, royalty-free, worldwide license during the term of this Agreement to install and use the Software in connection with AWS Services. You may not use the Software if you do not have an account in good standing with AWS. Some components of the Software (whether developed by AWS or third parties) may also be governed by applicable open source software licenses located in the software component's source code. Your license rights with respect to these individual components are defined by the applicable open source software license, and nothing in this Agreement will restrict, limit, or otherwise affect any rights or obligations you may have, or conditions to which you may be subject, under such open source software licenses. “AWS Services” means each of the services made available by AWS as may be updated by AWS from time to time in its sole discretion at https://aws.amazon.com/service-terms/ and are subject to your AWS Customer Agreement or AWS Enterprise Agreement.

2.	Limitations

You may not, and you will not encourage, assist or authorize any other person to (a) sell, rent, lease, lend, loan, distribute, act as a service bureau, publicly communicate, transform, or sub-license the Software or otherwise assign any rights to the Software in whole or in part, (b) modify, alter, tamper with, repair, or otherwise create derivative works of the Software, (c) reverse engineer, disassemble, or decompile the Software or apply any other process or procedure to derive the source code of any software included in the Software, or (d) access or use the Software or the AWS Service in a way intended to avoid incurring fees or exceeding usage limits or quotas. All rights granted to you are conditioned on your continued compliance with this Agreement, and will immediately and automatically terminate if you do not comply with any term or condition of this Agreement or the AWS Customer Agreement or AWS Enterprise Agreement, including any failure to remit timely payment for the Software or the AWS Service. You will not use the Software with any software or other materials that are subject to licenses or restrictions (e.g., open source software licenses) that, when combined with the Software, would require us to disclose, license, distribute or otherwise make all or any part of such Software available to anyone.  You will not remove, modify, or obscure any copyright, patent, trademark or other proprietary or attribution notices on or in any Software.

3.	Reservation of Rights

You may not use the Software for any illegal purpose. The Software is the intellectual property of AWS or its licensors. The structure, organization, and code of the Software are valuable trade secrets and AWS confidential information. The Software is protected by applicable law, including without limitation copyright laws and international treaty provisions. Except for the rights expressly granted to you in this Agreement, all right, title and interest in the Software are reserved and retained by AWS and our licensors. You do not acquire any intellectual property or other rights in the Software as a result of downloading, installing, or using the Software.

4.	Updates

In order to keep the Software up-to-date, we may offer automatic or manual updates at any time. If we elect to provide maintenance or support of any kind, we may terminate that maintenance or support at any time without notice to you.

5.	Termination

You may terminate this Agreement at any time by uninstalling and destroying all copies of the Software that are in your possession or control. This Agreement (including any rights granted to you under this Agreement) will immediately and automatically terminate without notice from us if (a) you fail to comply with any term or condition of this Agreement or any other agreement you have with AWS, or (b) you fail to make timely payment for any AWS Service. In the case of termination, you must cease all downloading, installation, and use of the Software and uninstall and destroy all copies of the Software that are in your possession or control. We may modify, suspend, discontinue, or terminate your right to use part or all of the Software at any time without notice to you, and in that event we may modify the Software to make it inoperable. AWS will not be liable to you should it exercise those rights. Our failure to insist upon or enforce your strict compliance with this Agreement will not constitute a waiver of any of our rights. No waiver of any provision of this Agreement shall be effective unless in writing.

6.	Disclaimer of Warranties and Limitation of Liability

a. YOU EXPRESSLY ACKNOWLEDGE AND AGREE THAT INSTALLATION AND USE OF, AND ANY OTHER ACCESS TO, THE SOFTWARE IS AT YOUR SOLE RISK. THE SOFTWARE IS DELIVERED TO YOU “AS IS” WITH ALL FAULTS AND WITHOUT WARRANTY OF ANY KIND, AND AWS, ITS LICENSORS AND DISTRIBUTORS, AND EACH OF THEIR RESPECTIVE AFFILIATES AND SUPPLIERS (COLLECTIVELY, THE “RELEASED PARTIES”) DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, ACCURACY, QUIET ENJOYMENT, AND NON-INFRINGEMENT. NO ORAL OR WRITTEN INFORMATION OR ADVICE GIVEN BY A RELEASED PARTY OR AN AUTHORIZED REPRESENTATIVE OF A RELEASED PARTY WILL CREATE A WARRANTY. THE LAWS OF CERTAIN JURISDICTIONS DO NOT ALLOW THE DISCLAIMER OF IMPLIED WARRANTIES. IF THESE LAWS APPLY TO YOU, SOME OR ALL OF THE ABOVE DISCLAIMERS, EXCLUSIONS, OR LIMITATIONS MAY NOT APPLY TO YOU, AND YOU MAY HAVE ADDITIONAL RIGHTS.

b. TO THE EXTENT NOT PROHIBITED BY LAW, NO RELEASED PARTY WILL BE LIABLE TO YOU FOR ANY INCIDENTAL OR CONSEQUENTIAL DAMAGES FOR BREACH OF ANY EXPRESS OR IMPLIED WARRANTY, BREACH OF CONTRACT, NEGLIGENCE, STRICT LIABILITY, OR ANY OTHER LEGAL THEORY RELATED TO THE SOFTWARE, INCLUDING WITHOUT LIMITATION ANY DAMAGES ARISING OUT OF LOSS OF PROFITS, REVENUE, DATA, OR USE OF THE APPLICATION, EVEN IF A RELEASED PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. IN ANY CASE, ANY RELEASED PARTY’S AGGREGATE LIABILITY UNDER THE AGREEMENT WILL BE LIMITED TO $50.00. THE LAWS OF CERTAIN JURISDICTIONS DO NOT ALLOW THE EXCLUSION OR LIMITATION OF INCIDENTAL OR CONSEQUENTIAL DAMAGES. IF THESE LAWS APPLY TO YOU, SOME OR ALL OF THE ABOVE EXCLUSIONS OR LIMITATIONS MAY NOT APPLY TO YOU, AND YOU MAY HAVE ADDITIONAL RIGHTS.

7.	Indemnification

You are liable for and will defend, indemnify, and hold harmless the Released Parties and their officers, directors, agents, and employees, from and against any liability, loss, damage, cost, or expense (including reasonable attorneys’ fees) arising out of your use of the Software, violation of the Agreement, violation of applicable law, or violation of any right of any person or entity, including without limitation intellectual property rights.

8.	Compliance with Laws; Export Regulations

You will comply with all export and re-export restrictions and regulations of the United States Department of Commerce and other United States and foreign agencies and authorities that may apply to the Software, and not to transfer, or encourage, assist, or authorize the transfer of the Software to a prohibited country or otherwise in violation of any applicable restrictions or regulations.

9.	U.S. Government End Users

The Software is provided to the U.S. Government as “commercial items,” “commercial computer software,” “commercial computer software documentation,” and “technical data” with the same rights and restrictions generally applicable to the Software. If you are using the Software on behalf of the U.S. Government and these terms fail to meet the U.S. Government’s needs or are inconsistent in any respect with federal law, you will immediately discontinue your use of the Software. The terms “commercial item,” “commercial computer software,” “commercial computer software documentation,” and “technical data” are defined in the Federal Acquisition Regulation and the Defense Federal Acquisition Regulation Supplement.

10.	Amendment

We may amend this Agreement at our sole discretion by posting the revised terms on the AWS website (aws.amazon.com) or within the Software. Your continued use of the Software after any amendment's effective date evidences your agreement to be bound by it. If you do not agree to a change, you must stop using the Software and terminate this Agreement.

13.	Conflicts

In the event of any conflict or inconsistency among the terms and conditions of this Agreement and the existing AWS Customer Agreement or your AWS Enterprise Agreement, such conflict or inconsistency will be resolved by giving precedence to this Agreement.

14.	Entire Agreement and Severability

This is the entire agreement between AWS and you regarding the Software and supersedes all prior understandings regarding such subject matter (including any Evaluation Agreement). If any term or condition of this Agreement is deemed invalid, void, or for any reason unenforceable, that part will be deemed severable and will not affect the validity and enforceability of any remaining term or condition.


================================================
FILE: src/neuronperf/README.md
================================================
# NeuronPerf

A library for benchmarking machine learning models on accelerators.

## Documentation

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuronperf/index.html

================================================
FILE: src/neuronperf/build.sh
================================================
#!/bin/bash

set -ex

python3 -m pytest -vv \
    --verbose \
    --ignore=build/private \
    --cov=neuronperf \
    --cov-report term-missing \
    --cov-report html:build/brazil-documentation/coverage \
    --cov-report xml:build/brazil-documentation/coverage/coverage.xml \
    --color=yes \
    -x \
    test \
    -m "sanity or slow"

python3 setup.py bdist_wheel --dist-dir build/pip/public/neuronperf


================================================
FILE: src/neuronperf/conf.py
================================================
"""Sphinx configuration."""

import datetime
import os
import shutil

from amazon_doc_utils import brazil_info

# Get metadata from brazil
brazil_version, intersphinx_factory = brazil_info.get(
    [brazil_info.PackageVersion, brazil_info.IntersphinxFactory]
)


def run_apidoc(app):
    """Generate doc stubs using sphinx-apidoc."""
    module_dir = os.path.join(app.srcdir, "../src/")
    output_dir = os.path.join(app.srcdir, "_apidoc")
    excludes = []

    # Ensure that any stale apidoc files are cleaned up first.
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)

    cmd = [
        "--separate",
        "--module-first",
        "--doc-project=API Reference",
        "-o",
        output_dir,
        module_dir,
    ]
    cmd.extend(excludes)

    try:
        from sphinx.ext import apidoc  # Sphinx >= 1.7

        apidoc.main(cmd)
    except ImportError:
        from sphinx import apidoc  # Sphinx < 1.7

        cmd.insert(0, apidoc.__file__)
        apidoc.main(cmd)


def setup(app):
    """Register our sphinx-apidoc hook."""
    app.connect("builder-inited", run_apidoc)


# Sphinx configuration below.
project = brazil_version.name
version = brazil_version.mv
release = brazil_version.full_version
copyright = "{}, Amazon.com".format(datetime.datetime.now().year)

intersphinx_mapping = intersphinx_factory.get_mapping()

extensions = [
    "sphinx.ext.autodoc",
    "sphinx.ext.intersphinx",
    "sphinx.ext.napoleon",
    "sphinx.ext.todo",
    "sphinx.ext.viewcode",
]

source_suffix = ".rst"
master_doc = "index"

autoclass_content = "class"
autodoc_member_order = "bysource"
default_role = "py:obj"

html_theme = "haiku"
htmlhelp_basename = "{}doc".format(project)

napoleon_use_rtype = False


================================================
FILE: src/neuronperf/model_neuron_b1.csv
================================================
n_models,workers_per_model,pipeline_size,batch_size,throughput_avg,throughput_peak,latency_ms_p0,latency_ms_p50,latency_ms_p90,latency_ms_p95,latency_ms_p99,latency_ms_p100,load_avg_ms,warmup_avg_ms,e2e_avg_ms,input_avg_ms,preprocess_avg_ms,postprocess_avg_ms,infer_avg_ms,worker_avg_s,total_infs,total_s,status,model_filename,multiprocess,multiinterpreter,device_type,instance_type
1,1,1,1,31346.0,31408.0,0.03,0.03,0.031,0.032,0.037,0.732,62.217,2.625,0.031,0.001,0.0,0.0,0.028,4.93,154704,5.0,finished,model_neuron_b1.pt,True,False,neuron,inf1.6xlarge
16,16,1,1,380604.75,380923.0,0.03,0.032,0.054,0.054,0.057,0.938,293.806,3.266,0.043,0.001,0.0,0.0,0.039,4.7,1799549,5.0,finished,model_neuron_b1.pt,True,False,neuron,inf1.6xlarge
1,2,1,1,51178.0,51319.0,0.035,0.036,0.037,0.039,0.047,1.13,114.118,2.713,0.037,0.001,0.0,0.0,0.033,4.88,248984,5.0,finished,model_neuron_b1.pt,True,False,neuron,inf1.6xlarge
16,32,1,1,381098.75,383905.0,0.03,0.058,0.067,0.073,0.121,48.07,303.916,4.42,0.08,0.001,0.0,0.0,0.074,4.69,1804925,5.0,finished,model_neuron_b1.pt,True,False,neuron,inf1.6xlarge


================================================
FILE: src/neuronperf/pyproject.toml
================================================
[tool.black]
line-length = 100

[tool.isort]
known_first_party = ["neuronperf"]

[tool.pytest.ini_options]
markers = [
    "sanity",
    "slow",
]

# required for compatibility with black:
profile = "black"

# To maintain consistency with other settings
line_length = 100


================================================
FILE: src/neuronperf/src/neuronperf/__init__.py
================================================
# -*- coding: utf-8 -*-

"""
NeuronPerf Library
~~~~~~~~~~~~~~~~~~

A library for benchmarking machine learning models on accelerators.

:copyright: (c) 2022 Amazon Inc.
:license: See LICENSE.
"""

from .__version__ import __title__, __description__, __url__, __version__
from .__version__ import __author__, __author_email__, __license__
from .__version__ import __copyright__

# setup logging first
import logging

_log_level = logging.DEBUG
log = logging.getLogger(__name__)
log.setLevel(_log_level)

from .logging import _get_stream_handlers

for handler in _get_stream_handlers(_log_level):
    log.addHandler(handler)

from .benchmarking import compile, benchmark, set_verbosity
from .cpu import cpu
from .cpu.cpu import DummyModel
from .reporting import CSV_COLS, PRINT_COLS, get_reports, print_reports, write_csv, write_json
from .timing import timestamp_convert, Timer


================================================
FILE: src/neuronperf/src/neuronperf/__version__.py
================================================
__title__ = "neuronperf"
__description__ = "A benchmarking library for machine learning accelerators."
__url__ = "https://awsdocs-neuron.readthedocs-hosted.com/en/neuronperf"
__version__ = "0.0.0.0"
__author__ = "AWS"
__author_email__ = "neuronperf@amazon.com"
__license__ = "Proprietary"
__copyright__ = "Copyright Amazon Web Services and its Affiliates. All rights reserved."


================================================
FILE: src/neuronperf/src/neuronperf/benchmarking.py
================================================
# -*- coding: utf-8 -*-

"""
neuronperf.benchmarking
~~~~~~~~~~~~~~~~~~~~~~~
Provides utility functions and classes that underlie the framework benchmarkers.
"""

from typing import Any, Callable, Dict, List, Union

import collections
import concurrent
import concurrent.futures
import copy
import functools
import logging
import multiprocessing
import os
import psutil
import subprocess
import sys
import tempfile
import threading
import time
import traceback

import dill


from . import model_index
from .compile_constants import NEURONCORE_PIPELINE_CORES, FAST_MATH, FAST_MATH_OPTIONS
from .reporting import get_reports
from .scripts import run_benchmark_file
from .timing import Timer


log = logging.getLogger(__name__)

# Wrapper for sending back subprocess failure info. Needs to be at top level for pickle.
BenchmarkerErrorWrapper = collections.namedtuple("BenchmarkerErrorWrapper", "trace")

ERROR = "error"
SUPPORTED_DEVICE_TYPES = ["neuron", "cpu", "cuda", "gpu"]  # TODO: "tpu"]
BENCHMARK_SECS = 120


class Benchmarker(threading.Thread):
    r"""
    :class:`benchmarking:Benchmarker` benchmarks a single model.

    This class is a `threading.Thread`. Call `start` to launch a non-blocking
    benchmarking thread. Calling `stop` will end the benchmarking and block
    until all subroutines complete.

    An object of this class may be serialized and sent to multiple subprocesses
    for parallel use. After benchmarking, results can be obtained with
    `results`.
    """

    def __init__(
        self,
        id: int,
        device_id: int,
        load_fn: Callable[[str], Any],
        model_filename: str,
        inputs,
        workers_per_model: int,
        env_setup_fn: Callable[[int, Dict, Any], None] = None,
        setup_fn: Callable[[int, Dict, Any], None] = None,
        preprocess_fn: Callable[[Any], Any] = None,
        postprocess_fn: Callable[[Any], Any] = None,
        dataset_loader_fn: Callable[[Any, int], Any] = None,
        model_class_name: str = None,
        model_class_file: str = None,
    ):
        super().__init__()

        self.id = id
        self.device_id = device_id
        self.load_fn = load_fn
        self.model_filename = model_filename
        self.inputs = inputs
        self.input_iter = None  # Prepared in setup()
        self.input_lock = threading.Lock()
        self.workers_per_model = workers_per_model
        self.env_setup_fn = env_setup_fn
        self.setup_fn = setup_fn
        self.preprocess_fn = preprocess_fn
        self.postprocess_fn = postprocess_fn
        self.dataset_loader_fn = dataset_loader_fn
        self.model_class_name = model_class_name
        self.model_class_file = model_class_file

        # Mutable internal state.
        self.model = None
        self.benchmark_timer = Timer()
        self.env_setup_timer = Timer()
        self.setup_timer = Timer()
        self.load_timer = Timer()
        self.warmup_timer = Timer()
        self.input_timer = Timer()
        self.preprocess_timers = [Timer() for _ in range(workers_per_model)]
        self.infer_timers = [Timer() for _ in range(workers_per_model)]
        self.postprocess_timers = [Timer() for _ in range(workers_per_model)]
        self.e2e_timers = [Timer() for _ in range(workers_per_model)]
        self.worker_timers = [Timer() for _ in range(workers_per_model)]
        self.n_infs = [0] * workers_per_model
        self.process_id = 0  # set at launch time
        self.benchmarking = False
        self.benchmarking_lock = threading.Lock()
        self.status_lock = threading.Lock()
        self.status = "ready"
        self.error = None

    def _status(self, status, error=None):
        """Update internal status, unless a previous error has occurred."""
        with self.status_lock:
            if self.status == ERROR:
                return
            self.status = status
            if error:
                self.error = error

    def next_input(self):
        self.input_lock.acquire()
        self.input_timer.start()
        try:
            return next(self.input_iter)
        finally:
            self.input_timer.stop()
            self.input_lock.release()

    def prepare_inputs(self):
        """Prepares input iterator; runs an optional custom setup function."""
        if self.dataset_loader_fn:

            def input_iter():
                dataset_loader = self.dataset_loader_fn(self.inputs, self.workers_per_model)
                while True:
                    inputs = next(dataset_loader)
                    yield inputs if isinstance(inputs, tuple) else (inputs,)

            self.input_iter = input_iter()
        else:

            def input_iter():
                inputs = self.inputs if isinstance(self.inputs, tuple) else (self.inputs,)
                while True:
                    yield inputs

            self.input_iter = input_iter()

    def load(self):
        """Loads the model that will be used for benchmarking."""
        with self.load_timer:
            self.model = self.load_fn(self.model_filename, device_id=self.device_id)

    def warmup(self):
        """Warmup the model with a single e2e inference."""
        with self.warmup_timer:
            inputs = self.next_input()
            if self.preprocess_fn:
                inputs = self.preprocess_fn(*inputs)
            outputs = self.model(*inputs if isinstance(inputs, tuple) else inputs)
            if self.postprocess_fn:
                self.postprocess_fn(outputs)
        self.n_infs[0] += 1  # track warmup infs in worker 0

    def setup(self):
        """Perform all setup work prior to benchmarking."""
        self.prepare_inputs()

        if self.env_setup_fn:
            with self.env_setup_timer:
                self.env_setup_fn()

        self.load()

        if self.setup_fn:
            with self.setup_timer:
                self.setup_fn(self.model)

        self.warmup()

    def infer(self, worker_id) -> tuple:
        """Execute a single inference."""
        with self.e2e_timers[worker_id]:
            inputs = self.next_input()
            if self.preprocess_fn:
                with self.preprocess_timers[worker_id]:
                    inputs = self.preprocess_fn(*inputs)
            with self.infer_timers[worker_id]:
                outputs = self.model(*inputs if isinstance(inputs, tuple) else inputs)
            if self.postprocess_fn:
                with self.postprocess_timers[worker_id]:
                    outputs = self.postprocess_fn(outputs)
        return outputs

    def worker_thread(self, worker_id):
        """A single worker thread that runs inference until signalled to stop."""
        n_infs = 0
        try:
            log.debug(f"Benchmarker {self.id}, Worker {worker_id} started.")
            with self.worker_timers[worker_id]:
                while self.benchmarking and self.status != ERROR:
                    self.infer(worker_id)
                    n_infs += 1
            if self.status == ERROR:
                log.debug(
                    f"Benchmarker {self.id}, Worker {worker_id} stopped early due to an error after {n_infs} inferences."
                )
        except StopIteration:
            pass
        except:
            trace = "".join(traceback.format_exception(*sys.exc_info()))
            log.error(
                f"Benchmarker {self.id}, Worker {worker_id} encountered an error during benchmarking:\n{trace}"
            )
            self._status(ERROR, BenchmarkerErrorWrapper(trace))
        finally:
            self.n_infs[worker_id] += n_infs
            log.debug(
                f"Benchmarker {self.id}, Worker {worker_id} finished after {self.n_infs[worker_id]} inferences."
            )

    def run(self):
        with self.benchmarking_lock:
            if self.benchmarking:
                raise RuntimeError(
                    f"Benchmarker {self.id} can't start because it is already running."
                )
            self.benchmarking = True
            self._status("running")

        # Set our process id, now that we are launched.
        self.process_id = os.getpid()

        # Launch all workers and begin benchmarking.
        # If any individual worker reports an error, self.status will reflect
        # that after this method.
        with self.benchmark_timer:
            try:
                self.setup()
            except:
                trace = "".join(traceback.format_exception(*sys.exc_info()))
                log.error(f"Benchmarker {self.id} encountered an error during prep:\n{trace}")
                self._status(ERROR, BenchmarkerErrorWrapper(trace))
            else:
                with concurrent.futures.ThreadPoolExecutor(max_workers=self.workers_per_model) as exe:
                    for worker_id in range(self.workers_per_model):
                        exe.submit(self.worker_thread, worker_id)

        # There are three ways to reach the next section:
        # 1. We ran out of benchmarking examples in a provided dataset (graceful quit on StopIteration).
        # 2. We were asked to stop().
        # 3. We encountered an error.

        # In cases 1 and 3, we can acquire the lock, update our state if necessary, and quit.
        # In case 2, we already hold the lock, so we can skip this section and let stop() handle cleanup.
        if self.benchmarking_lock.acquire(blocking=False):
            try:
                self.benchmarking = False
                self._status("finished")
            finally:
                self.benchmarking_lock.release()

    def stop(self):
        # Setting self.benchmarking = False triggers workers to terminate gracefully.
        # We must hold the benchmarking_lock until the thread has joined to ensure
        # consistent use of the self.benchmarking flag.
        with self.benchmarking_lock:
            if not self.benchmarking:
                return
            self._status("stopping")
            self.benchmarking = False
            self.join()
            self._status("finished")

    def results(self) -> dict:
        with self.benchmarking_lock:
            if self.benchmarking:
                raise RuntimeError("Cannot produce results until benchmarking has completed.")
            return {
                "id": self.id,
                "device_id": self.device_id,
                "workers_per_model": self.workers_per_model,
                "n_infs": sum(self.n_infs),
                "status": self.status,
                "process_id": self.process_id,
                "total_s": self.benchmark_timer.total_duration("s"),
                "timers": {
                    "env_setup": [self.env_setup_timer],
                    "setup": [self.setup_timer],
                    "load": [self.load_timer],
                    "input": [self.input_timer],
                    "warmup": [self.warmup_timer],
                    "preprocess": self.preprocess_timers,
                    "infer": self.infer_timers,
                    "postprocess": self.postprocess_timers,
                    "e2e": self.e2e_timers,
                    "worker": self.worker_timers,
                },
            }


class StatsThread(threading.Thread):
    """A thread to collect some system metrics duirng benchmarking."""

    def __init__(self, interval: float):
        super().__init__()
        self.interval = interval  # interval (in seconds) to collect metrics
        self.cpu_percents = []
        self.mem_percents = []
        self.running = True

    def run(self):
        while self.running:
            cpu_percent = psutil.cpu_percent(interval=self.interval, percpu=False)
            mem_percent = psutil.virtual_memory()[2]
            self.cpu_percents.append(cpu_percent)
            self.mem_percents.append(mem_percent)

    def join(self, **kwargs):
        self.running = False
        super().join(**kwargs)


def _combine_results(results: List[dict]) -> dict:
    """Combines the results of multiple benchmarkers into a single results structure."""
    combined_results = {}
    for result in results:
        # workers_per_model should be the same across all benchmarkers, so we only need it once.
        combined_results.setdefault("workers_per_model", result["workers_per_model"])
        # If an error occurred anywhere, preserve it.
        combined_results["status"] = (
            result["status"] if combined_results.get("status", "") != ERROR else ERROR
        )
        combined_results["n_infs"] = combined_results.get("n_infs", 0) + result["n_infs"]
        # Keep the longest subprocess duration.
        combined_results["total_s"] = max(combined_results.get("total_s", 0), result["total_s"])
        # Concatenate all timing info.
        timers = combined_results.get("timers", {})
        for k, v in result["timers"].items():
            timer_list = timers.get(k, [])
            timer_list.extend(v)
            timers[k] = timer_list
        combined_results["timers"] = timers
    return combined_results


def _get_num_workers(pipeline_size: int) -> int:
    """Returns a best-guess number of worker threads for a single benchmarking process."""
    return 2 if pipeline_size == 1 else pipeline_size - 1


def get_instance_type() -> str:
    """Try to obtain the maximum number of NeuronCores available on this instance."""
    try:
        import urllib.request

        with urllib.request.urlopen(
            "http://169.254.169.254/latest/meta-data/instance-type"
        ) as response:
            instance_type = response.read().decode("utf-8")
        log.debug("Automatically determined instance type: {}".format(instance_type))
        return instance_type
    except:
        return None


def _get_cost_per_hour(instance_type: str) -> float:
    # Hourly rates
    instancetype_to_cost = {
        "inf1.xlarge": 0.228,
        "inf1.2xlarge": 0.362,
        "inf1.6xlarge": 1.18,
        "inf1.24xlarge": 4.721,
    }
    try:
        return instancetype_to_cost[instance_type]
    except:
        # Just ignore unknown instance types for now
        return None


def _get_max_neuroncores(instance_type: str = None) -> int:
    """Try to obtain the maximum number of NeuronCores available on this instance."""
    instancetype_to_neuroncores = {
        "inf1.xlarge": 4,
        "inf1.2xlarge": 4,
        "inf1.6xlarge": 16,
        "inf1.24xlarge": 64,
    }
    try:
        if not instance_type:
            instance_type = get_instance_type()
        return instancetype_to_neuroncores[instance_type]
    except:
        num_cores = 2
        log.warning(f"Unknown Neuron device size. Assuming {num_cores} NeuronCores is the maximum.")
        return num_cores


def _get_num_gpus(instance_type: str = None) -> int:
    """Try to obtain the maximum number of NeuronCores available on this instance."""
    instancetype_to_gpus = {
        "g4dn.xlarge": 1,
        "g4dn.2xlarge": 1,
        "g4dn.4xlarge": 1,
        "g4dn.8xlarge": 1,
        "g4dn.16xlarge": 1,
        "g4dn.12xlarge": 4,
        "g4dn.metal": 8,
        "g4ad.xlarge": 1,
        "g4ad.2xlarge": 1,
        "g4ad.4xlarge": 1,
        "g4ad.8xlarge": 2,
        "g4ad.16xlarge": 4,
        "p4d.24xlarge": 8,
    }
    try:
        if not instance_type:
            instance_type = get_instance_type()
        return instancetype_to_gpus[instance_type]
    except:
        log.warning("Unknown GPU device size. Assuming 1 GPU is available.")
        return 1


def _get_num_devices(device_type: str, instance_type: str = None) -> int:
    """This is a stub, to be populated later for other instance types."""
    if device_type == "neuron":
        return _get_max_neuroncores(instance_type)
    elif device_type == "cpu":
        return multiprocessing.cpu_count()
    elif device_type == "cuda" or device_type == "gpu":
        return _get_num_gpus(instance_type)
    else:
        log.warning("An unknown device_type was passed: {}".format(device_type))
        return None


def _sanitize_inputs(inputs, batch_sizes: Union[int, List[int]], dataset_inputs=False) -> List[int]:
    """Return inputs and batch_sizes with matching lengths, or throw an error."""
    if not isinstance(inputs, list):
        inputs = [inputs]
    if isinstance(batch_sizes, int):
        batch_sizes = [batch_sizes]
    if not batch_sizes:
        log.warning(
            "Batch sizes were not provided, so assuming 1 and only the first input will be benchmarked."
        )
        batch_sizes = [1]
    if not dataset_inputs:
        if len(batch_sizes) < len(inputs):
            delta = len(inputs) - len(batch_sizes)
            log.warning(
                "Received {} inputs, but only {} batch sizes. Discarding last {} inputs.".format(
                    len(inputs), len(batch_sizes), delta
                )
            )
            inputs = inputs[: len(batch_sizes)]
        elif len(inputs) < len(batch_sizes):
            delta = len(batch_sizes) - len(inputs)
            log.warning(
                "Received {} batch sizes, but only {} inputs. Discarding last {} batch sizes.".format(
                    len(batch_sizes), len(inputs), delta
                )
            )
            batch_sizes = batch_sizes[: len(inputs)]
    return inputs, batch_sizes


def set_verbosity(verbosity: int):
    r"""
    Controls the verbosty of NeuronPerf logging.

    :param int verbosity: 0 = error, 1 = info, 2 = debug
    """
    if 0 == verbosity:
        log.setLevel(logging.ERROR)
    elif 1 == verbosity:
        log.setLevel(logging.INFO)
    else:
        log.setLevel(logging.DEBUG)


def compile(
    compile_fn,
    model,
    inputs,
    batch_sizes: Union[int, List[int]] = None,
    pipeline_sizes: Union[int, List[int]] = None,
    performance_levels: Union[str, List[int]] = None,
    models_dir: str = "models",
    model_name: str = None,
    filename: str = None,
    compiler_args: dict = None,
    verbosity: int = 1,
    **kwargs,
) -> str:
    r"""
    Compiles the provided model with each provided example input, pipeline size, and performance level.

    :param model: The model to compile.
    :param list inputs: A list of example inputs.
    :param Union[int, List[int]] batch_sizes: A list of batch sizes that correspond to the example inputs.
    :param Union[int, List[int]] pipeline_sizes: A list of pipeline sizes to use. See :ref:`neuroncore-pipeline`.
    :param Union[int, List[int]] performance_levels: A list of performance levels to try. Options are: 0 (max accuracy), 1, 2, 3 (max performance, default).  See :ref:`mixed-precision`.
    :param str models_dir: The directory where compilation artifacts will be stored.
    :param str model_name: An optional model name tag to apply to compiled artifacts.
    :param str filename: The name of the model index to write out. If not provided, a name will be generated and returned.
    :param dict compiler_args: Additional compiler arguments to be forwarded with every compilation.
    :param int verbosity: 0 = error, 1 = info, 2 = debug
    :return: A model index filename. If a configuration fails to compile, it will not be included in the index and an error will be logged.
    :rtype: str
    """
    # Set NeuronPerf logging verbosity.
    set_verbosity(verbosity)

    # Standardize arguments.
    if not pipeline_sizes:
        pipeline_sizes = [1]
    if not performance_levels:
        performance_levels = []
    if not compiler_args:
        compiler_args = {}
    if not model_name:
        if isinstance(model, str):
            model_name = model
        else:
            try:
                model_name = model.__name__
            except AttributeError:
                log.warning("Unable to determine a model name, using 'Model'.")
                model_name = "Model"
    if isinstance(pipeline_sizes, int):
        pipeline_sizes = [pipeline_sizes]
    if isinstance(performance_levels, int):
        performance_levels = [performance_levels]

    inputs, batch_sizes = _sanitize_inputs(inputs, batch_sizes)

    # Sanity check and sanitize compiler_args.
    if NEURONCORE_PIPELINE_CORES in compiler_args:
        if pipeline_sizes:
            log.warning(
                (
                    "You provided NeuronCore Pipeline Core sizes using both "
                    "compiler_args and pipeline_sizes. Ignoring flag in compiler_args."
                )
            )
        else:
            pipeline_sizes = [compiler_args[NEURONCORE_PIPELINE_CORES]]
        del compiler_args[NEURONCORE_PIPELINE_CORES]

    if FAST_MATH in compiler_args:
        if performance_levels:
            log.warning(
                (
                    f"You provided performance_levels and {FAST_MATH}. "
                    "Ignoring flag in compiler_args."
                )
            )
        del compiler_args[FAST_MATH]

    # Check if performance levels are within expected bounds.
    max_performance = max(FAST_MATH_OPTIONS)
    performance_levels_invalid = list(
        filter(
            lambda level: level < min(FAST_MATH_OPTIONS) or level > max_performance,
            performance_levels,
        )
    )
    if performance_levels_invalid:
        log.warning(
            "You provided some invalid performance_levels. Ignoring: {}".format(
                performance_levels_invalid
            )
        )
        performance_levels = [
            level
            for level in performance_levels
            if (level in performance_levels) and (level not in performance_levels_invalid)
        ]

    # If we still have no values, set default to max performance.
    if not performance_levels:
        performance_levels.append(max_performance)

    # Create standard output dir, if it doesn't exit.
    os.makedirs(models_dir, exist_ok=True)

    # Compile all requested model combinations.
    model_idxs = []

    # TODO: Support appending to existing index by filtering already-compiled configs.
    def make_index():
        """Create a model index file that contains info about all compiled models."""
        index = model_index.append(*model_idxs)
        # Return the name of the new index file.
        return model_index.save(index, filename=filename)

    compile_idx = 1
    n_compiles = len(inputs) * len(pipeline_sizes) * len(performance_levels)
    for input_idx, example_input in enumerate(inputs):
        batch_size = batch_sizes[input_idx]
        for pipeline_size in pipeline_sizes:
            for performance_level in performance_levels:
                _compiler_args = copy.copy(compiler_args)
                _compiler_args[FAST_MATH] = FAST_MATH_OPTIONS[performance_level]
                if pipeline_size != 1:
                    _compiler_args[NEURONCORE_PIPELINE_CORES] = str(pipeline_size)

                # Construct a more informative model name with some config info
                model_name_ex = "{}_b{}_p{}_{}".format(
                    model_name,
                    batch_size,
                    pipeline_size,
                    model_index.generate_id(),
                )
                log.info(
                    (
                        f"Compiling batch size {batch_size} for {pipeline_size} NeuronCore(s) with performance level "
                        f"{performance_level}/{max_performance}. [{compile_idx}/{n_compiles}]"
                    )
                )
                status = "ready"
                timer = Timer()
                with timer:
                    try:
                        model_filename = compile_fn(
                            model,
                            example_input,
                            models_dir,
                            model_name_ex,
                            compiler_args=_compiler_args,
                            **kwargs,
                        )
                        status = "finished"
                    except KeyboardInterrupt:
                        status = "error"
                        model_filename = None
                        log.error("Compilation interrupted, terminating.")
                        return make_index()
                    except:
                        status = "error"
                        model_filename = None
                        log.exception(
                            (
                                f"Failed to compile input={input_idx}, "
                                f"batch_size={batch_size}, "
                                f"pipeline_size={pipeline_size}, "
                                f"performance_level={performance_level}."
                            )
                        )
                    finally:
                        model_idx = model_index.create(
                            model_filename,
                            model_name=model_name,
                            batch_size=batch_size,
                            pipeline_size=pipeline_size,
                            performance_level=performance_level,
                            compile_s=round(timer.total_duration("s"), 2),
                            status=status,
                        )
                        model_idxs.append(model_idx)
                        filename = make_index()
                compile_idx += 1
    return filename


def run_benchmarker(benchmarker, duration, pipe=None):
    def _send(results):
        if pipe:
            pipe.send(results)
            pipe.close()
        else:
            return results

    try:
        log.debug(f"Benchmarker {benchmarker.id} started.")
        check_freq = 0.1  # Check progress every 0.1 seconds.
        start_time = time.time()
        benchmarker.start()
        elapsed = 0
        while (elapsed < duration) and benchmarker.benchmarking:
            elapsed = time.time() - start_time
            remaining = max(0, duration - elapsed)
            time.sleep(min(check_freq, remaining))
        benchmarker.stop()
    except:
        trace = "".join(traceback.format_exception(*sys.exc_info()))
        error = BenchmarkerErrorWrapper(trace)
        return _send(error)
    else:
        results = benchmarker.results() if benchmarker.status != ERROR else benchmarker.error
        return _send(results)
    finally:
        log.debug(f"Benchmarker {benchmarker.id} finished.")


def _run_benchmarker_new_interpreter(benchmarker, duration):
    """
    This function is a workaround for frameworks that cannot be safely forked.
    The premise is to launch a new Python interpreter and run benchmarking
    from within the new interpreter. It works by writing serialized benchmarkers
    to temporary files, and then launching run_benchmark_file.py. The script
    writes back serialized results.
    """

    # Temporary serialization workaround. This attribute is inherited from Thread.
    # TODO: Separate data from benchmarking.
    setattr(benchmarker, "_stderr", None)

    script = run_benchmark_file.__file__

    # Serialize the benchmarker to a file.
    f = tempfile.NamedTemporaryFile(delete=False)
    log.debug("Dumping Benchmarker {} to file '{}'.".format(benchmarker.id, f.name))
    try:
        dill.dump(benchmarker, f)
    except dill.PicklingError:
        raise dill.PicklingError(
            (
                "NeuronPerf was unable to serialize the benchmarker. This is probably becuause your model "
                "could not be serialized. Make sure to use top-level classes instead of locals. You may "
                "need to wrap your model and manually load it using Python's importlib."
            )
        )
    f.close()

    # Run the benchmarking script in a clean Python process.
    command = [
        sys.executable,
        script,
        f.name,
        str(duration),
    ]

    # If we are manually loading a model class file in subprocesses, we need to let them know.
    if benchmarker.model_class_name and benchmarker.model_class_file:
        command.append(f"--model_class_name={benchmarker.model_class_name}")
        command.append(f"--model_class_file={benchmarker.model_class_file}")

    proc = subprocess.Popen(
        command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, encoding="utf-8"
    )

    # Interpreter and framework overhead add a delay to processing. We should ensure
    # that during multiinterpreter benchmarking, sufficient time is waited for results.
    timeout = 60 + duration

    try:
        outs, errs = proc.communicate(timeout=timeout)
        with open(f.name, "rb") as fp:
            result = dill.load(fp)
        if isinstance(result, BenchmarkerErrorWrapper):
            raise ChildProcessError(
                "Benchmarker {} encountered an error:\n{}".format(benchmarker.id, result.trace)
            )
        if isinstance(result, Benchmarker):
            # If we still have a benchmarker object instead of results, something
            # went wrong that wasn't handled by the benchmarker routine.
            from pathlib import Path

            path = Path(f.name)
            logs = os.path.join(path.parent, "neuronperf_error_{}".format(str(path.stem)))
            if os.path.exists(logs):
                with open(logs, "rt") as logs_fp:
                    err_logs = logs_fp.readlines()
                os.unlink(logs)
                raise ChildProcessError(
                    "Benchmarker {} failed. Logs from child process:\n{}".format(
                        benchmarker.id, "".join(err_logs)
                    )
                )
            else:
                raise ChildProcessError(
                    (
                        "Benchmarker {} failed and no error logs were found. A child process may have "
                        "aborted. To obtain a stack trace, try running a single configuration inside a "
                        "single process by passing multiprocess=False, multiinterpreter=False"
                    )
                )

        return result
    except subprocess.TimeoutExpired:
        proc.kill()
        raise ChildProcessError(
            "Benchmarker {} stopped responding after {} seconds.".format(benchmarker.id, timeout)
        )
    finally:
        os.unlink(f.name)


def _run_benchmarkers_multiprocess(
    benchmarkers: List[Benchmarker], duration: int, benchmark_func=run_benchmarker
) -> dict:
    results = []
    # Hand each benchmarker object to a subprocess.
    pipes, procs = [], []
    for benchmarker in benchmarkers:
        parent_pipe, child_pipe = multiprocessing.Pipe()
        pipes.append(parent_pipe)
        proc = multiprocessing.Process(
            target=benchmark_func, args=(benchmarker, duration, child_pipe)
        )
        procs.append(proc)
    # Launch benchmarking.
    for proc in procs:
        proc.start()
    # Collect results.
    for id, (pipe, proc) in enumerate(zip(pipes, procs)):
        try:
            proc_result = pipe.recv()
            if isinstance(proc_result, BenchmarkerErrorWrapper):
                log.error("Child process encountered an error:\n{}".format(proc_result.trace))
                raise ChildProcessError()
            proc.join()
            results.append(proc_result)
        except KeyboardInterrupt:
            log.error("Benchmarking interrupted, terminating.")
            for proc in procs:
                proc.terminate()
            raise KeyboardInterrupt()
        except EOFError:
            log.error(
                (
                    f"Child process {id} was killed by the host OS during benchmarking.\n"
                    "You may have run out of memory.\n"
                    "Verify that your model can perform inference without NeuronPerf or try n_models=1."
                )
            )
    return _combine_results(results)


def _run_benchmarkers_multithreaded(
    benchmarkers: List[Benchmarker], duration: int, benchmark_func=run_benchmarker
) -> dict:
    results = []
    timeout = 60 + duration  # Add some time for setup overhead and cleanup.
    try:
        args = ((benchmarker, duration) for benchmarker in benchmarkers)
        with concurrent.futures.ThreadPoolExecutor(max_workers=len(benchmarkers)) as exe:
            results.extend(exe.map(lambda arg: benchmark_func(*arg), args, timeout=timeout))
        for result in results:
            if isinstance(result, BenchmarkerErrorWrapper):
                raise RuntimeError("Worker thread encountered an error:\n{}".format(result.trace))
    except concurrent.futures.TimeoutError:
        log.error("Benchmarking timed out after {} seconds.".format(timeout))
    except KeyboardInterrupt:
        raise KeyboardInterrupt("Benchmarking interrupted, terminating.")
    return _combine_results(results)


def run_benchmarkers(
    benchmarkers: List[Benchmarker],
    duration: int,
    stats_interval: float = 0.5,
    multiprocess: bool = True,
    multiinterpreter: bool = False,
) -> dict:
    results = {}

    # Launch a background thread to collect system stats during benchmarking.
    stats_thread = StatsThread(stats_interval)
    stats_thread.start()

    try:
        if multiinterpreter:
            if not sys.executable:
                raise ValueError(
                    (
                        "Unable to benchmark in multi-interpreter mode because "
                        "the Python interpreter cannot be located (sys.executable is empty)."
                    )
                )
            # We can safely re-use the multithreaded path here by using a custom benchmarking
            # function that spawns fresh interpreters.
            results = _run_benchmarkers_multithreaded(
                benchmarkers, duration, benchmark_func=_run_benchmarker_new_interpreter
            )
        elif multiprocess:
            results = _run_benchmarkers_multiprocess(benchmarkers, duration)
        else:
            results = _run_benchmarkers_multithreaded(benchmarkers, duration)
    finally:
        stats_thread.join()
        results["cpu_percents"] = stats_thread.cpu_percents
        results["mem_percents"] = stats_thread.mem_percents

    return results


def _get_env_setup_fn(benchmarker_id: int, benchmarker_config: dict, env_setup_fn):
    """Wrap an environment setup function with device-specific requirements."""
    device_type = str(benchmarker_config["device_type"]).lower().strip()
    legacy = bool(os.environ.get("NEURONCORE_GROUP_SIZES"))
    if "neuron" == device_type:

        @functools.wraps(env_setup_fn)
        def _env_setup_fn():
            import os

            id = benchmarker_id
            config = benchmarker_config
            pipeline_size = config["pipeline_size"]
            if config["multiprocess"] or config["multiinterpreter"]:
                # In multiprocess mode, need to specify the exact cores for the process.
                min_core = pipeline_size * id
                max_core = min_core + (pipeline_size - 1)
                visible_cores = f"{min_core}-{max_core}"

                if legacy:
                    os.environ["NEURONCORE_GROUP_SIZES"] = str(pipeline_size)
                else:
                    os.environ["NEURON_RT_VISIBLE_CORES"] = visible_cores
            else:
                # In multithreaded mode, all required cores are allocated in this process.
                n_models = config["n_models"]
                if legacy:
                    os.environ["NEURONCORE_GROUP_SIZES"] = ",".join([str(pipeline_size)] * n_models)
                else:
                    os.environ["NEURON_RT_VISIBLE_CORES"] = "0-{}".format(
                        n_models * pipeline_size - 1
                    )

            # Finally, call any additional custom setup function provided.
            if env_setup_fn:
                env_setup_fn(id, config)

        return _env_setup_fn
    elif device_type == "cpu":
        return env_setup_fn
    elif device_type == "cuda" or device_type == "gpu":

        @functools.wraps(env_setup_fn)
        def _env_setup_fn():
            import os

            os.environ["CUDA_VISIBLE_DEVICES"] = str(benchmarker_id)

            if env_setup_fn:
                env_setup_fn(benchmarker_id, benchmarker_config)

        return _env_setup_fn
    else:
        log.warning(
            (
                f"NeuronPerf does not implement a proper environment setup for {device_type}. "
                "You may need to provide your own."
            )
        )
        return env_setup_fn


def _get_setup_fn(benchmarker_id: int, benchmarker_config: dict, setup_fn):
    """Wraps a customer provided setup function with additional info from the benchmarker."""
    if not setup_fn:
        return None

    @functools.wraps(setup_fn)
    def _setup_fn(model):
        setup_fn(benchmarker_id, benchmarker_config, model)

    return _setup_fn


def _get_device_id(benchmarker_id: int, benchmarker_config: dict):
    """Calculate an appropriate device id for a benchmarker object."""
    device_id = benchmarker_id
    device_type = str(benchmarker_config["device_type"]).lower().strip()
    if device_type in SUPPORTED_DEVICE_TYPES:
        if not (benchmarker_config["multiprocess"] or benchmarker_config["multiinterpreter"]):
            device_id = benchmarker_id * benchmarker_config["pipeline_size"]
        return device_id
    else:
        log.warning(
            "Assuming device_id={} for benchmarker_id={} for unknown device_type={}".format(
                device_id, benchmarker_id, device_type
            )
        )
    return device_id


def benchmark(
    load_fn: Callable[[str, int], Any],
    model_filename: str,
    inputs: Any,
    batch_sizes: Union[int, List[int]] = None,
    duration: float = BENCHMARK_SECS,
    n_models: Union[int, List[int]] = None,
    pipeline_sizes: Union[int, List[int]] = None,
    performance_levels: Union[int, List[int]] = None,
    workers_per_model: Union[int, None] = None,
    env_setup_fn: Callable[[int, Dict], None] = None,
    setup_fn: Callable[[int, Dict, Any], None] = None,
    preprocess_fn: Callable[[Any], Any] = None,
    postprocess_fn: Callable[[Any], Any] = None,
    dataset_loader_fn: Callable[[Any, int], Any] = None,
    multiprocess: bool = True,
    multiinterpreter: bool = False,
    return_timers: bool = False,
    stats_interval: float = 0.5,
    device_type: str = "neuron",
    cost_per_hour: float = None,
    model_name: str = None,
    model_class_name: str = None,
    model_class_file: str = None,
    verbosity: int = 1,
) -> List[Dict]:
    r"""
    Benchmarks the model index or individiual model using the provided inputs.
    If a model index is provided, additional fields such as ``pipeline_sizes`` and
    ``performance_levels`` can be used to filter the models to benchmark. The default
    behavior is to benchmark all configurations in the model index. Any additional
    compiler_args passed will be forwarded to the compiler on every invocation.

    :param Callable[[str, int], Any] load_fn: A function that accepts a model filename and device id, and returns a loaded model. This is automatically passed through the subpackage calls (e.g. ``neuronperf.torch.benchmark``).
    :param str model_filename: A path to a model index from compile or path to an individual model. For CPU benchmarking, a class should be passed that can be instantiated with a default constructor (e.g. ``MyModelClass``).
    :param list inputs: A list of example inputs. If the list contains tuples, they will be destructured on inference to support multiple arguments.
    :param Union[int, List[int]] batch_sizes: A list of ints indicating batch sizes that correspond to the inputs. Assumes 1 if not provided.
    :param duration float: The number of seconds to benchmark each model.
    :param n_models Union[int, List[int]]: The number of models to run in parallel. Default behavior runs 1 model and the max number of models possible, determined by a best effort from ``device_type``, instance size, or other environment state.
    :param Union[int, List[int]] pipeline_sizes: A list of pipeline sizes to use. See :ref:`neuroncore-pipeline`.
    :param Union[int, List[int]] performance_levels: A list of performance levels to try. Options are: 0 (max accuracy), 1, 2, 3 (max performance, default). See :ref:`mixed-precision`.
    :param Union[int, List[int]] workers_per_model: The number of workers to use per model loaded. If ``None``, this is automatically selected.
    :param Callable[[int, Dict], None] env_setup_fn: A custom environment setup function to run in each subprocess before model loading. It will receive the benchmarker id and config.
    :param Callable[[int, Dict, Any], None] setup_fn: A function that receives the benchmarker id, config, and model to perform last minute configuration before inference.
    :param Callable[[Any], Any]: preprocess_fn: A custom preprocessing function to perform on each input before inference.
    :param Callable[[Any], Any]: postprocess_fn: A custom postprocessing function to perform on each input after inference.
    :param bool multiprocess: When True, model loading is dispatched to forked subprocesses. Should be left alone unless debugging.
    :param bool multiinterpreter: When True, benchmarking is performed in a new python interpreter per model. All parameters must be serializable. Overrides multiprocess.
    :param bool return_timers: When True, the return of this function is a list of tuples ``(config, results)`` with detailed information. This can be converted to reports with ``get_reports(results)``.
    :param float stats_interval: Collection interval (in seconds) for metrics during benchmarking, such as CPU and memory usage.
    :param str device_type: This will be set automatically to one of the ``SUPPORTED_DEVICE_TYPES``.
    :param float cost_per_hour: The price of this device / hour. Used to estimate cost / 1 million infs in reports.
    :param str model_name: A friendly name for the model to use in reports.
    :param str model_class_name: Internal use.
    :param str model_class_file: Internal use.
    :param int verbosity: 0 = error, 1 = info, 2 = debug
    :return: A list of benchmarking results.
    :rtype: List[Dict]
    """
    # Set NeuronPerf logging verbosity.
    set_verbosity(verbosity)

    # --------------------------------------------
    # Input validation
    # --------------------------------------------
    # Validate that enough information was provided.
    if not load_fn:
        raise ValueError(
            "You should call benchmark() through a framework submodule, e.g. neuronperf.torch.benchmark()."
        )
    if not isinstance(model_filename, str):
        raise ValueError(
            "You must provide the path to a saved model or the path to a model index from neuronperf.compile()."
        )

    # Useful for debugging.
    if not multiprocess and not multiinterpreter:
        log.warning("Benchmarking in a single process.")

    # Standardize inputs.
    dataset_inputs = dataset_loader_fn is not None
    if (not dataset_inputs) and (not isinstance(inputs, list)):
        inputs = [inputs]
    if isinstance(n_models, int):
        n_models = [n_models]
    if isinstance(pipeline_sizes, int):
        pipeline_sizes = [pipeline_sizes]
    if isinstance(performance_levels, int):
        performance_levels = [performance_levels]
    if workers_per_model is None:
        workers_per_model = []
    elif isinstance(workers_per_model, int):
        workers_per_model = [workers_per_model]
    if duration < BENCHMARK_SECS:
        log.warning("Results may be unreliable with short test durations.")

    # If the model_filename is JSON, attempt to interpret it as a model index.
    index = None
    if model_filename.endswith(model_index.MODEL_INDEX_SUFFIX):
        index = model_index.load(model_filename)

    # If we loaded a model_index, ensure provided inputs are compatible
    # and use it to refine the benchmarking combinations we will run.
    if index:
        # Extract a model name from the index, if possible.
        if not model_name:
            model_name = index["model_name"]

        # If batch_sizes, pipeline_sizes and/or performance_levels were provided,
        # treat them as filters on the index. A value of None is treated as no filter.
        # See the docs for model_index.filter().
        index = model_index.filter(
            index,
            status="finished",  # only take compiled models
            batch_size=batch_sizes,  # select all requested batch sizes
            pipeline_size=pipeline_sizes,
            performance_level=performance_levels,
        )

        if 0 == len(index["model_configs"]):
            raise ValueError(
                "No models were found in the model index matching requested criteria. Check that compilation succeeded."
            )

        # If a model index was provided without batch_sizes, extract the sizes from the index.
        if not batch_sizes:
            # Select unique batch_sizes in model index.
            batch_sizes = set(config["batch_size"] for config in index["model_configs"])
            batch_sizes = sorted(list(batch_sizes))

    # Validate batch sizes after attempting to extract from the model index.
    inputs, batch_sizes = _sanitize_inputs(inputs, batch_sizes, dataset_inputs)

    # If we still don't have a model name, use the filename.
    if not model_name:
        model_name = model_filename

    # If no pipeline_sizes are provided, we'll assume it's 1 for a single model unless told otherwise.
    if not pipeline_sizes:
        log.debug("Pipeline size was not specified, assuming 1.")
        pipeline_sizes = [1]

    # Assume max performance is desired.
    if not performance_levels:
        max_performance = max(FAST_MATH_OPTIONS)
        log.debug(f"Performance level was not specified, assuming {max_performance}.")
        performance_levels = [max_performance]

    # If a model was provided directly without a model index, build a dummy model index.
    # A single model can not possibly have been compiled for more than 1 configuration,
    # hence why we can assume index [0].
    if not index:
        index = model_index.create(
            filename=model_filename,
            model_name=model_name,
            batch_size=batch_sizes[0],
            pipeline_size=pipeline_sizes[0],
            performance_level=performance_levels[0],
        )

    model_configs = index["model_configs"]

    # --------------------------------------------
    # Benchmarking
    # --------------------------------------------

    # Estimate time remaining based on configs requested to run.
    # If n_models wasn't provided, the default benchmarks [min, max].
    n_models_est = 2 if not n_models else len(n_models)
    # If workers_per_model wasn't provided, the default benchmarks [1, 2].
    n_models_est *= 2 if not workers_per_model else len(workers_per_model)
    secs_remaining = len(model_configs) * n_models_est * duration
    mins_remaining = None if secs_remaining < 60 else round(secs_remaining / 60.0, 1)
    etr = f"{mins_remaining} minutes" if mins_remaining else f"{int(round(secs_remaining))} seconds"
    log.info("Benchmarking '{}', ~{} remaining.".format(model_filename, etr))

    # Try to determine instance type.
    instance_type = get_instance_type()
    if not instance_type:
        instance_type = "unknown"

    # Try to automatically determine the maximum number of devices available.
    max_devices = _get_num_devices(device_type, instance_type)
    log.debug("Automatically determined number of devices: {}".format(max_devices))

    # Try to detect cost / hour for this device.
    if not cost_per_hour:
        cost_per_hour = _get_cost_per_hour(instance_type)

    # Run through all requested combinations and generate a report.
    # This will produce a list of tuples, (config, results).
    all_results = []

    def make_reports():
        """Helper to generate reports from available results."""
        # If all_results was set, we return the unmodified benchmarking results.
        return all_results if return_timers else get_reports(all_results, cost_per_hour)

    for model_config in model_configs:
        batch_size = model_config["batch_size"]
        pipeline_size = model_config["pipeline_size"]

        # Determine the number of model copies for each benchmarking session.
        model_counts = n_models
        # If the user didn't provide n_models, choose reasonable defaults.
        if not model_counts:
            # Try to run a single model and the max models supported on this hardware.
            if max_devices and (max_devices // pipeline_size > 1):
                model_counts = [1, max_devices // pipeline_size]
            else:
                model_counts = [1]
        # If the user provided model counts and we determine they are too large, emit a warning.
        else:
            if max_devices:
                model_counts_too_large = list(
                    filter(
                        lambda model_count: model_count * pipeline_size > max_devices, model_counts
                    )
                )
                if model_counts_too_large:
                    log.warning(
                        (
                            "Some values of n_models exceed the number of devices available: "
                            f"{model_counts_too_large} > {max_devices}"
                        )
                    )

        # Compute number of workers for this pipeline size, if not specified.
        n_workers = workers_per_model
        if not n_workers:
            n_workers = [_get_num_workers(pipeline_size)]
            # 1 worker thread == min latency
            if 1 not in n_workers:
                n_workers.insert(0, 1)

        for _workers_per_model in n_workers:
            # We now know everything we need to benchmark.
            #   1. Build a comprehensive benchmarker config,
            #   2. build one benchmarker per model,
            #   3. run the benchmarkers in parallel,
            #   4. and collect the results for this configuration.
            for model_count in model_counts:
                # 1. Benchmarker config
                config = {
                    "model_filename": model_config["filename"],
                    "model_name": model_name,
                    "device_type": device_type,
                    "instance_type": instance_type,
                    "batch_size": batch_size,
                    "n_models": model_count,
                    "workers_per_model": _workers_per_model,
                    "pipeline_size": pipeline_size,
                    "n_devices": model_count * pipeline_size,
                    "performance_level": model_config["performance_level"],
                    "multiprocess": multiprocess,
                    "multiinterpreter": multiinterpreter,
                    "stats_interval": str(stats_interval),
                    "start_dts": time.strftime("%Y%m%d-%H%M%S"),
                    "duration": str(duration),
                }

                # 2. Build the benchmarkers
                benchmarkers = []
                for benchmarker_id in range(model_count):
                    benchmarker = Benchmarker(
                        id=benchmarker_id,
                        device_id=_get_device_id(benchmarker_id, config),
                        load_fn=load_fn,
                        model_filename=model_config["filename"],
                        inputs=inputs if dataset_inputs else inputs[batch_sizes.index(batch_size)],
                        workers_per_model=_workers_per_model,
                        env_setup_fn=_get_env_setup_fn(benchmarker_id, config, env_setup_fn),
                        setup_fn=_get_setup_fn(benchmarker_id, config, setup_fn),
                        preprocess_fn=preprocess_fn,
                        postprocess_fn=postprocess_fn,
                        dataset_loader_fn=dataset_loader_fn,
                        model_class_name=model_class_name,
                        model_class_file=model_class_file,
                    )
                    benchmarkers.append(benchmarker)

                # 3. Run benchmarkers in parallel
                log.debug("Running model config: {}".format(config))
                try:
                    results = run_benchmarkers(
                        benchmarkers,
                        duration,
                        stats_interval=stats_interval,
                        multiprocess=multiprocess,
                        multiinterpreter=multiinterpreter,
                    )

                    # 4. Collect results
                    config["stop_dts"] = time.strftime("%Y%m%d-%H%M%S")
                    all_results.append((config, results))
                except KeyboardInterrupt:
                    # If we are interrupted, return whatever we have on hand.
                    return make_reports()
                except:
                    # If something else goes wrong with the model, we should
                    # log this configuration and move on.
                    log.exception("Failure benchmarking config: {}".format(config))

    return make_reports()


================================================
FILE: src/neuronperf/src/neuronperf/compile_constants.py
================================================
# -*- coding: utf-8 -*-

"""
neuronperf.compile_constants
~~~~~~~~~~~~~~~~~~~~~~~
Holds constants used at compile time.
"""

NEURONCORE_PIPELINE_CORES = "--neuroncore-pipeline-cores"
FAST_MATH = "--fast-math"
FAST_MATH_OPTIONS = {
    0: "none",
    1: "fp32-cast-matmult no-fast-relayout",
    2: "fp32-cast-matmult",
    3: "all",
}


================================================
FILE: src/neuronperf/src/neuronperf/cpu/__init__.py
================================================
from neuronperf.cpu.cpu import benchmark


================================================
FILE: src/neuronperf/src/neuronperf/cpu/cpu.py
================================================
# -*- coding: utf-8 -*-

"""
neuronperf.cpu
~~~~~~~~~~~~~~~~~~~~~~~
Provides CPU support.
"""

import functools
import logging

from .. import benchmarking


log = logging.getLogger(__name__)


class DummyModel:
    def __call__(self, x):
        x *= 5
        x += 3
        return x


def benchmark(model_class, inputs, *args, **kwargs):
    if not isinstance(model_class, type):
        raise TypeError("For CPU benchmarking, you must provide a class to instantiate.")

    device_type = kwargs.pop("device_type", "cpu")
    multiinterpreter = kwargs.pop("multiinterpreter", False)
    if multiinterpreter:
        log.warning(
            "CPU + multiinterpreter is not yet fully supported. You need to provide a custom load_fn that can import your class and instantiate it."
        )

    # Create a custom load_fn that instantiates the model.
    def load_fn(*args, **kwargs):
        return model_class()

    kwargs["device_type"] = device_type
    kwargs["multiinterpreter"] = multiinterpreter

    return benchmarking.benchmark(
        load_fn,
        model_class.__name__,
        inputs,
        *args,
        **kwargs,
    )


================================================
FILE: src/neuronperf/src/neuronperf/logging.py
================================================
# -*- coding: utf-8 -*-

"""
neuronperf.logging
~~~~~~~~~~~~~~~~~~~~~~~
Provides logging utility functions.
"""

import logging


FORMAT_STRING = '%(levelname)s:%(name)s - %(message)s'


def _get_stream_handlers(level = logging.DEBUG):
    formatter = logging.Formatter(FORMAT_STRING)
    sh = logging.StreamHandler()
    sh.setLevel(logging.DEBUG)
    sh.setFormatter(formatter)
    return [sh]


================================================
FILE: src/neuronperf/src/neuronperf/model_index.py
================================================
# -*- coding: utf-8 -*-

"""
neuronperf.model_index
~~~~~~~~~~~~~~~~~~~~~~~
Provides utilities for working with model indexes.
"""

from typing import Any, List, Union

import builtins
import copy as copy_module
import itertools
import json
import logging
import os
import pathlib
import random
import shutil


from .__version__ import __version__
from .compile_constants import FAST_MATH_OPTIONS


log = logging.getLogger(__name__)

MODEL_INDEX_SUFFIX = ".json"


def generate_id(length: int = 8):
    """Generate a random-enough sequence to append to model names and prevent collisions."""
    id_chars = "abcdefghijklmnopqrstuvwxyz0123456789"
    new_id = [id_chars[random.randrange(len(id_chars))] for _ in range(length)]
    return "".join(new_id)


def generate_name(model_name: str):
    """Generate a model index name from a model name."""
    return model_name + "_" + generate_id() + MODEL_INDEX_SUFFIX


def _create(model_name: str, compile_info: list) -> dict:
    if not isinstance(compile_info, list):
        log.exception(
            "Expected a list of compile info dicts, received '{}'.".format(str(type(compile_info)))
        )
    model_index = {
        "NeuronPerf_version": __version__,
        "model_name": model_name,
        "model_configs": compile_info,
    }
    return model_index


def create(
    filename: str,
    model_name: str = None,
    batch_size: int = 1,
    pipeline_size: int = 1,
    performance_level: int = max(FAST_MATH_OPTIONS),
    compile_s: float = None,
    status: str = "finished",
) -> dict:
    r"""
    Create a new model index from a pre-compiled model.

    :param str filename: The path to the compiled model.
    :param str model_name: A friendly name for the model. Will default to filename.
    :param int batch_size: The batch size at compilation for this model.
    :param int pipeline_size: The pipeline size used at compilation for this model.
    :param int performance_level: The performance level this model was compiled with.
    :param float compile_s: Seconds spent compiling.
    :param str status: A string describing compilation result. Can be "finished" or "error".
    :return: A new dictionary representing a model index.
    :rtype: dict
    """
    if not model_name:
        model_name = filename
    compile_info = [
        {
            "filename": filename,
            "batch_size": batch_size,
            "pipeline_size": pipeline_size,
            "performance_level": performance_level,
            "compile_s": compile_s,
            "status": status,
        }
    ]
    return _create(model_name, compile_info)


def delete(filename: str):
    """Deletes the model index and all associated models referenced by the index."""
    if not os.path.exists(filename):
        log.warning("Asked to delete '{}', but it can't be located.".format(filename))
        return

    # Load the index
    configs = load(filename)["model_configs"]

    # Remove all referenced models
    model_filenames = map(lambda x: x["filename"], itertools.chain(configs))
    for model_filename in model_filenames:
        log.debug(f"Deleting '{model_filename}'.")
        if os.path.exists(model_filename):
            if os.path.isdir(model_filename):
                shutil.rmtree(model_filename)
            else:
                os.remove(model_filename)

    # Finally, remove the model index itself
    log.debug(f"Deleting '{filename}'")
    os.remove(filename)


def copy(old_index: Union[str, dict], new_index: str, new_dir: str) -> str:
    r"""
    Copy an index to a new location. Will rename ``old_index``
    to ``new_index`` and copy all model files into ``new_dir``,
    updating the index paths.

    This is useful for pulling individual models out of a pool.

    Returns the path to the new index.
    """
    os.makedirs(new_dir, exist_ok=True)
    index = _sanitize(old_index)[0].copy()

    configs = index["model_configs"]
    for config in configs:
        path = pathlib.Path(config["filename"])
        config["filename"] = str(shutil.copy2(path, new_dir))

    return save(index, new_index)


def move(old_index: str, new_index: str, new_dir: str) -> str:
    """This is the same as ``copy`` followed by ``delete`` on the old index."""
    index = copy(old_index, new_index, new_dir)
    delete(old_index)
    return index


def _sanitize(*model_indexes: Union[str, dict]) -> List[dict]:
    r"""
    Helper function to load indexes if strings are provided.
    If already loaded, this is a no-op.
    """
    if not model_indexes:
        raise ValueError("No model indexes were provided.")
    indexes = []
    # Load any paths provided and sanity check all inputs.
    for index in model_indexes:
        if not index:
            raise ValueError("An empty value was received, but expected a model index.")
        if isinstance(index, str):
            index = load(index)
        if not isinstance(index, dict):
            raise TypeError("Expected a model index, but received '{}'.".format(str(type(None))))
        if not len(index) > 0:
            raise ValueError("Received an empty model index.")
        indexes.append(index)
    # Check versions are all the same, and emit a warning if they aren't.
    versions = set(map(lambda x: x["NeuronPerf_version"], indexes))
    if len(versions) > 1:
        log.warning("Received model with different versions: '{}'.".format(str(versions)))
    model_name = indexes[0]["model_name"]
    # Ensure model names are matching.
    if not all(model_name == index["model_name"] for index in indexes):
        model_names = list(set(map(lambda x: x["model_name"], indexes)))
        log.warning("Received model indexes with different model names: {}".format(model_names))
    return indexes


def append(*model_indexes: Union[str, dict]) -> dict:
    r"""
    Appends the model indexes non-destructively into a new model index, without
    modifying any of the internal data.

    This is useful if you have benchmarked multiple related models and wish to
    combine their respective model indexes into a single index.

    Model name will be taken from the first index provided.
    Duplicate configs will be filtered.

    :param Union[str, dict] model_indexes: Model indexes or paths to model indexes to combine.
    :return: A new dictionary representing the combined model index.
    :rtype: dict
    """
    indexes = _sanitize(*model_indexes)
    # Extract the model configs from the indexes
    config_iter = map(lambda index: copy_module.deepcopy(index["model_configs"]), indexes)
    # Combine the model configs
    combined = list(itertools.chain.from_iterable(config_iter))
    # Split unique and duplicate configs
    duplicate = []
    unique = []
    for config in combined:
        if config in unique:
            duplicate.append(config)
        else:
            unique.append(config)
    if len(duplicate) > 0:
        log.warning(
            (
                f"There were {len(duplicate)} duplicate model configs "
                "filtered. The duplicates were:\n"
                "{}".format("\n".join(map(lambda c: str(c), duplicate)))
            )
        )
    # Build new index from configs
    return _create(indexes[0]["model_name"], unique)


def save(model_index: dict, filename: str = None, root_dir=None) -> str:
    r"""Save a NeuronPerf model index to a file."""
    if not filename:
        model_name = model_index["model_name"]
        filename = generate_name(model_name)
    if not filename.lower().endswith(MODEL_INDEX_SUFFIX):
        filename += MODEL_INDEX_SUFFIX
    if not root_dir:
        root_dir = "."
    try:
        with open(os.path.join(root_dir, filename), "w") as fp:
            json.dump(model_index, fp)
    except OSError:
        log.exception("Failed to write '{}'.".format(filename))
    return filename


def load(filename) -> dict:
    """Load a NeuronPerf model index from a file."""
    model_index = None
    try:
        with open(filename, "r") as fp:
            model_index = json.load(fp)
    except OSError:
        # file is probably not a model index
        log.exception("Failed to load model index '{}'".format(filename))
    else:
        from distutils.version import LooseVersion

        try:
            if LooseVersion(model_index["NeuronPerf_version"]) > LooseVersion(__version__):
                log.warning(
                    "Model index newer than NeuronPerf (version {} > {}). Try updating NeuronPerf.".format(
                        model_index["NeuronPerf_version"], __version__
                    )
                )
        except TypeError:
            log.warning(
                "Couldn't compare model index version ({}) to NeuronPerf version ({}), continuing anyway.".format(
                    model_index["NeuronPerf_version"], __version__
                )
            )

    return model_index


def filter_configs(configs, filter_name, filter_values) -> List:
    """Filters provided configs on specified filter and value and returns a new config list."""
    if filter_values is None:
        return configs.copy()
    # Filter on configs that have the filter_name and value is in filter_values
    if not isinstance(filter_values, list):
        filter_values = [filter_values]
    return list(
        builtins.filter(
            lambda config: filter_name in config and config[filter_name] in filter_values, configs
        )
    )


def filter(index: Union[str, dict], **kwargs) -> dict:
    r"""
    Filters provided model index on provided criteria and returns a new index.
    Each kwarg is a standard (k, v) pair, where k is treated as a filter name
    and v may be one or more values used to filter model configs.
    """
    index = _sanitize(index)[0].copy()

    # Filter each config on provided kwargs pairs.
    configs = index["model_configs"]
    for k, v in kwargs.items():
        configs = filter_configs(configs, k, v)

    index["model_configs"] = configs
    return index


================================================
FILE: src/neuronperf/src/neuronperf/mxnet/__init__.py
================================================
from neuronperf.mxnet.mxnet import benchmark, compile


================================================
FILE: src/neuronperf/src/neuronperf/mxnet/mxnet.py
================================================
# -*- coding: utf-8 -*-

"""
neuronperf.mxnet
~~~~~~~~~~~~~~~~~~~~~~~
Provides Apache MXNet support.
"""

import contextlib
import functools
import os
import threading

# handle different API versions of mxnet
import mxnet as mx
from distutils.version import LooseVersion

if LooseVersion(mx.__version__) >= LooseVersion("1.8"):
    _mx_version = 1.8
    import mx_neuron as neuron
else:
    _mx_version = 1.5
    from mxnet.contrib import neuron

from .. import benchmarking


class _MXNetModelWrapper:
    def __init__(self, device_id, sym, args, aux):
        self.device_id = device_id
        self.sym = sym
        self.args = args
        self.aux = aux
        self.ctx = None
        self.exes = {}
        self.lock = threading.Lock()

    def __call__(self, inputs):
        # on the first inference, do prep work
        if not self.ctx:
            self.ctx = mx.neuron(self.device_id)

        # prepare inputs for model
        for k, v in inputs.items():
            inputs[k] = mx.nd.array(v)
        self.args.update(inputs)

        # obtain an executor for this thread
        thread_id = threading.get_ident()
        if thread_id not in self.exes:
            with self.lock:
                exe = self.sym.bind(
                    ctx=self.ctx, args=self.args, aux_states=self.aux, grad_req="null"
                )
            self.exes[thread_id] = exe
        else:
            exe = self.exes[thread_id]

        # run inference
        outputs = exe.forward(**inputs)
        mx.nd.waitall()
        return outputs[0]


@contextlib.contextmanager
def change_dir(new_dir):
    old_dir = os.getcwd()
    os.chdir(os.path.join(old_dir, new_dir))
    try:
        yield
    finally:
        os.chdir(old_dir)


def _load_fn(model_filename, **kwargs):
    device_id = kwargs.get("device_id", 0)
    sym, args, aux = mx.model.load_checkpoint(model_filename, 0)
    return _MXNetModelWrapper(device_id, sym, args, aux)


def _compile_fn(model, example_inputs, models_dir, model_name, **kwargs):
    _sym, _args, _aux = model
    model_filename = os.path.join(models_dir, model_name)
    compiler_args = kwargs.pop("compiler_args", {})

    # MXNet passes additional kwargs directly to compiler
    _sym, _args, _aux = neuron.compile(
        _sym,
        _args,
        _aux,
        example_inputs,
        **compiler_args,
    )

    with change_dir(models_dir):
        mx.model.save_checkpoint(model_name, 0, _sym, _args, _aux)
    return model_filename


def compile(model, inputs, *args, **kwargs):
    return benchmarking.compile(_compile_fn, model, inputs, *args, **kwargs)


def benchmark(model_filename, inputs, *args, **kwargs):
    env_setup_fn = kwargs.pop("env_setup_fn", lambda *_: None)

    # Use a custom setup function to handle MXNet concurrency requirements.
    @functools.wraps(env_setup_fn)
    def _env_setup_fn(id, config):
        workers_per_model = str(config["workers_per_model"])
        os.environ["MXNET_CPU_TEMP_COPY"] = workers_per_model
        os.environ["MXNET_EXEC_NUM_TEMP"] = workers_per_model
        os.environ["MXNET_CPU_WORKER_NTHREADS"] = workers_per_model
        os.environ["MXNET_MP_WORKER_NTHREADS"] = workers_per_model

        # Remember to call any additional custom setup provided.
        env_setup_fn(id, config)

    kwargs["env_setup_fn"] = _env_setup_fn

    return benchmarking.benchmark(_load_fn, model_filename, inputs, *args, **kwargs)


================================================
FILE: src/neuronperf/src/neuronperf/py.typed
================================================
# Marker file that indicates this package supports typing


================================================
FILE: src/neuronperf/src/neuronperf/reporting.py
================================================
# -*- coding: utf-8 -*-

"""
neuronperf.reporting
~~~~~~~~~~~~~~~~~~~~
Provides utilities for producing reports from benchmarking results.
"""

from typing import List

import csv
import itertools
import json
import logging
import time

import numpy as np

from . import __version__


log = logging.getLogger(__name__)

CSV_COLS = [
    "model_name",
    "n_models",
    "workers_per_model",
    "pipeline_size",
    "batch_size",
    "throughput_avg",
    "throughput_peak",
    "latency_ms_p0",
    "latency_ms_p50",
    "latency_ms_p90",
    "latency_ms_p95",
    "latency_ms_p99",
    "latency_ms_p100",
    "cpu_avg_percent",
    "cpu_percent_p50",
    "mem_avg_percent",
    "mem_percent_p50",
    "e2e_avg_ms",
    "infer_avg_ms",
    "total_infs",
    "total_s",
    "performance_level",
    "model_filename",
    "device_type",
    "instance_type",
    "cost_per_1m_inf",
]

PRINT_COLS = [
    "throughput_avg",
    "latency_ms_p50",
    "latency_ms_p99",
    "n_models",
    "pipeline_size",
    "workers_per_model",
    "batch_size",
    "model_filename",
]

REQUIRED_CONFIG_KEYS = [
    "multiprocess",
    "multiinterpreter",
    "device_type",
    "batch_size",
    "model_filename",
    "model_name",
    "n_models",
    "pipeline_size",
]

REQUIRED_RESULTS_KEYS = [
    "workers_per_model",
    "status",
    "timers",
    "n_infs",
    "total_s",
]


def _validate_config(config):
    for required_key in REQUIRED_CONFIG_KEYS:
        if required_key not in config:
            raise ValueError(
                (
                    f"Model config is missing required key '{required_key}'. "
                    "Something probably went wrong during benchmarking. Provided:\n{config}"
                )
            )


def _validate_results(results):
    for required_key in REQUIRED_RESULTS_KEYS:
        if required_key not in results:
            raise ValueError(
                (
                    f"Benchmarking results are missing required key '{required_key}'. "
                    "Something probably went wrong during benchmarking. Provided:\n{results}"
                )
            )


def _get_report_name(model_name: str) -> str:
    return "{}.results-{}".format(model_name, time.strftime("%Y%m%d-%H%M%S"))


def get_report(
    benchmark_results, cost_per_hour: float = None, window_size: int = 1, verbosity: int = 0
) -> dict:
    r"""Get a performance report from benchmarker results.

    :param benchmark_results: Results from a :class:`benchmarking:Benchmarker` object.
    :param float cost_per_hour: The cost / hour for this device.
    :param int window_size: Window size in seconds used to measure throughput.
    :param int verbosity: Controls logging during report generation. Use 0 (default), 1, or 2.
    :returns: A dictionary containing performance information.
    """
    report = {}
    config, results = benchmark_results
    _validate_config(config)
    _validate_results(results)
    try:
        report["NeuronPerf_version"] = __version__

        # copy benchmarker info from config into report
        for k, v in config.items():
            report[k] = v

        # number of intervals is the same across all stats, so we can use this as a proxy
        report["n_stats_intervals"] = len(results["cpu_percents"])

        report["workers_per_model"] = results["workers_per_model"]
        report["status"] = results["status"]

        # timing stats
        report["load_avg_ms"] = np.fromiter(
            (t.avg("ms") for t in results["timers"]["load"]), float
        ).mean()
        report["input_avg_ms"] = np.fromiter(
            (t.avg("ms") for t in results["timers"]["input"]), float
        ).mean()
        report["warmup_avg_ms"] = np.fromiter(
            (t.avg("ms") for t in results["timers"]["warmup"]), float
        ).mean()
        report["env_setup_avg_ms"] = np.fromiter(
            (t.avg("ms") for t in results["timers"]["env_setup"]), float
        ).mean()
        report["setup_avg_ms"] = np.fromiter(
            (t.avg("ms") for t in results["timers"]["setup"]), float
        ).mean()
        report["preprocess_avg_ms"] = np.fromiter(
            (t.avg("ms") for t in results["timers"]["preprocess"]), float
        ).mean()
        report["infer_avg_ms"] = np.fromiter(
            (t.avg("ms") for t in results["timers"]["infer"]), float
        ).mean()
        report["postprocess_avg_ms"] = np.fromiter(
            (t.avg("ms") for t in results["timers"]["postprocess"]), float
        ).mean()
        report["e2e_avg_ms"] = np.fromiter(
            (t.avg("ms") for t in results["timers"]["e2e"]), float
        ).mean()
        report["worker_avg_s"] = round(
            np.fromiter((t.avg("s") for t in results["timers"]["worker"]), float).mean(), 2
        )
        report["total_infs"] = results["n_infs"] * config["batch_size"]
        report["total_s"] = round(results["total_s"], 2)

        percentiles = [0, 50, 90, 95, 99, 100]

        cpu_percents = np.fromiter(results["cpu_percents"], float)
        if cpu_percents.size > 2:
            cpu_percentiles = np.percentile(cpu_percents[1:-1], percentiles)
            report["cpu_avg_percent"] = cpu_percentiles.mean()
            for i, p in enumerate(percentiles):
                report[f"cpu_percent_p{p}"] = cpu_percentiles[i]

        mem_percents = np.fromiter(results["mem_percents"], float)
        if mem_percents.size > 2:
            mem_percentiles = np.percentile(mem_percents[1:-1], percentiles)
            report["mem_avg_percent"] = mem_percentiles.mean()
            for i, p in enumerate(percentiles):
                report[f"mem_percent_p{p}"] = mem_percentiles[i]

        # latency
        latencies = np.fromiter(
            itertools.chain.from_iterable(t.durations("ms") for t in results["timers"]["e2e"]),
            float,
        )
        latency_percentiles = np.percentile(latencies, percentiles)
        for i, p in enumerate(percentiles):
            report["latency_ms_p{}".format(p)] = latency_percentiles[i]

        # bucketize ending timestamps
        end_timestamps = np.fromiter(
            itertools.chain.from_iterable(t.end_timestamps("s") for t in results["timers"]["e2e"]),
            float,
        )
        bucket_ends = np.floor(end_timestamps / window_size)
        # group timestamps by window and correct for batch size
        _, bucket_counts = np.unique(bucket_ends, return_counts=True)
        bucket_counts *= config["batch_size"]
        # find max and normalize by window size
        report["throughput_peak"] = bucket_counts.max() / window_size
        report["throughput_avg"] = bucket_counts[1:-1].mean() / window_size

        if verbosity > 0:
            report["throughput_hist"] = bucket_counts
        if verbosity > 1:
            report["e2e_durations_ms"] = np.fromiter(
                (t.durations("ms") for t in results["timers"]["e2e"]), float
            )

        # Try to estimte cost / inference
        if cost_per_hour:
            try:
                infs_per_hour = 3600 * report["throughput_avg"]
                report["cost_per_1m_inf"] = cost_per_hour * (1_000_000 / infs_per_hour)
            except:
                # We'll ignore this, as it's caused by a missing field that would have
                # already generated an earlier error log. We should continue producing
                # a report nonetheless.
                pass

        # Truncate floats to 3 places for readability.
        for key, value in report.items():
            if isinstance(value, float):
                report[key] = round(value, 3)

    except:
        log.exception(
            (
                "Failed to produce a report from benchmarking results. "
                "Something probably went wrong during benchmarking."
            )
        )
    return report


def get_reports(results, cost_per_hour: float = None) -> List[dict]:
    r"""
    Summarizes and combines the detailed results from
    ``neuronperf.benchmark``, when run with ``return_timers=True``.
    One report dictionary is produced per model configuration benchmarked.
    The list of reports can be fed directly to other reporting utilities,
    such as ``neuronperf.write_csv``.

    :param results: Benchmarker results.
    :param float cost_per_hour: The cost / hour for this device.
    """
    reports = []
    for idx, (config, result) in enumerate(results):
        try:
            _validate_config(config)
            _validate_results(result)
        except ValueError:
            log.exception(f"Result {idx} is missing required information, skipping.")
            continue
        report = get_report((config, result), cost_per_hour)
        reports.append(report)
    return reports


def print_reports(reports: List[dict], cols=PRINT_COLS, sort_by="throughput_peak", reverse=False):
    r"""Print a subset of report cols to the terminal.

    :param reports: Results from `get_reports`.
    :param cols: The columns in the report to be displayed.
    :param sort_by: Sort the cols by the specified key.
    :param reverse: Sort order.
    """
    if not reports:
        print("No reports were found. Did benchmarking succeed?")
        return
    # Print headers.
    col_width = max(map(lambda col: len(col), cols)) + 1
    row_format = "{{:<{}}}".format(col_width) * len(cols)
    print(row_format.format(*cols))
    # Extract all rows.
    rows = []
    for report in reports:
        row = []
        for col in cols:
            row.append(report[col] if col in report else "N/A")
        rows.append(row)
    # Sort rows by the specified key, if the key exists.
    if sort_by in cols:
        sort_index = cols.index(sort_by)
        rows = sorted(rows, key=lambda row: row[sort_index], reverse=reverse)
    # Print all rows.
    for row in rows:
        print(row_format.format(*row))


def write_csv(reports: List[dict], filename: str = None, cols=CSV_COLS):
    r"""Write a benchmarking report to CSV file.

    :param reports: Results from `get_reports`.
    :param filename: File name to write out. If not provided, generated from model_name in report and current timestamp.
    :param cols: The columns in the report to be kept.
    """
    if not filename:
        filename = "{}.csv".format(_get_report_name(reports[0]["model_name"]))
    try:
        with open(filename, "w", newline="", encoding="utf-8") as csvfile:
            writer = csv.writer(csvfile)
            writer.writerow(cols)
            for idx, report in enumerate(reports):
                row = []
                for col in cols:
                    if col in report:
                        row.append(report[col] if report[col] is not None else "N/A")
                    else:
                        log.debug(f"Report {idx} is missing field '{col}'.")
                        row.append("N/A")
                writer.writerow(row)
        return filename
    except OSError:
        log.exception(f"Failed to write '{filename}'. Check that you have write permissions.")


def write_json(reports: List[dict], filename: str = None):
    if not filename:
        filename = "{}.json".format(_get_report_name(reports[0]["model_name"]))
    try:
        with open(filename, "w", encoding="utf-8") as jsonfile:
            json.dump(reports, jsonfile)
        return filename
    except OSError:
        log.exception(
            (
                f"Failed to write '{filename}'. Check that the report "
                "contains data and that you have write permissions."
            )
        )


================================================
FILE: src/neuronperf/src/neuronperf/scripts/__init__.py
================================================


================================================
FILE: src/neuronperf/src/neuronperf/scripts/run_benchmark_file.py
================================================
import argparse
import dill
import neuronperf


def main():
    parser = argparse.ArgumentParser(
        prog="benchmark",
        description="Run a serialized Benchmarker for a given `duration`. Upon "
        "success overwrite `filename` with the updated Benchmarker",
    )
    parser.add_argument("filename", type=str, help="The serialized Benchmarker")
    parser.add_argument("duration", type=float, help="The duration of each config (seconds)")
    parser.add_argument("--model_class_name", type=str, help="The name of a model class to load")
    parser.add_argument("--model_class_file", type=str, help="Path to Python module defining model_class_name")
    args = parser.parse_args()

    try:
        # If we were provided with a model class to import before deserialization, we need
        # to handle that now. The class will be manually imported.
        if args.model_class_name and args.model_class_file:
            import importlib.util

            spec = importlib.util.spec_from_file_location(
                args.model_class_name, args.model_class_file
            )
            module = importlib.util.module_from_spec(spec)
            spec.loader.exec_module(module)
            globals()[args.model_class_name] = getattr(module, args.model_class_name)

        # Load the benchmarker object
        with open(args.filename, "rb") as f:
            benchmarker = dill.load(f)

        # Execute the benchmarker
        result = neuronperf.benchmarking.run_benchmarker(benchmarker, args.duration)

        # Write the result back to the same file
        with open(args.filename, "wb") as f:
            dill.dump(result, f)
    except:
        # Dump traceback to a file for debugging.
        import os
        import sys
        import traceback
        from pathlib import Path

        path = Path(args.filename)
        filename = os.path.join(path.parent, "neuronperf_error_{}".format(path.stem))
        trace = "".join(traceback.format_exception(*sys.exc_info()))
        with open(filename, "wt") as err_fp:
            err_fp.write(trace)


if __name__ == "__main__":
    main()


================================================
FILE: src/neuronperf/src/neuronperf/tensorflow/__init__.py
================================================
from neuronperf.tensorflow.tensorflow import benchmark, compile


================================================
FILE: src/neuronperf/src/neuronperf/tensorflow/tensorflow.py
================================================
# -*- coding: utf-8 -*-

"""
neuronperf.tensorflow
~~~~~~~~~~~~~~~~~~~~~~~
Provides TensorFlow support.
"""

import itertools
import logging
import os
import threading


from .. import benchmarking


log = logging.getLogger(__name__)
_lock = threading.Lock()


def _load_fn(model_file, **kwargs):
    with _lock:
        import tensorflow as tf

        if tf.__version__.startswith("1"):
            return tf.contrib.predictor.from_saved_model(model_file)
        else:
            import tensorflow.keras as keras

            return keras.models.load_model(model_file)


def _compile_fn(model, inputs, models_dir, model_name, **kwargs):
    import tensorflow as tf
    import tensorflow.neuron as tfn

    model_filename = os.path.join(models_dir, model_name)

    # NeuronPerf provides compiler_args as a dictionary, but framework expects a different format.
    compiler_args = kwargs.pop("compiler_args", {})

    if tf.__version__.startswith("1"):
        compiler_args_flattened = list(itertools.chain.from_iterable(compiler_args.items()))
        kwargs["compiler_args"] = compiler_args_flattened
        kwargs["model_feed_dict"] = inputs

        # For TF 1.x, the saved model path is expected instead of a loaded model.
        tfn.saved_model.compile(model, model_filename, **kwargs)
    else:
        if compiler_args:
            compiler_args_flattened = " ".join(
                ["{}={}".format(k, v) for k, v in compiler_args.items()]
            )
            os.environ["NEURON_CC_FLAGS"] = compiler_args_flattened
        else:
            os.environ["NEURON_CC_FLAGS"] = ""

        model_neuron = tfn.trace(model, inputs, **kwargs)
        model_neuron.save(model_filename)
    return model_filename


def compile(model, inputs, *args, **kwargs):
    return benchmarking.compile(_compile_fn, model, inputs, *args, **kwargs)


def benchmark(model_filename, inputs, *args, **kwargs):
    # Tensorflow-neuron is not currently fork safe, so we workaround this during benchmarking
    # by spawning a fresh interpreter session for each model we benchmark.
    if "multiinterpreter" in kwargs and not kwargs["multiinterpreter"]:
        log.warning(
            "Setting multiinterpreter=False is not safe with TensorFlow. Use at your own risk."
        )
    else:
        kwargs["multiinterpreter"] = True

    return benchmarking.benchmark(_load_fn, model_filename, inputs, *args, **kwargs)


================================================
FILE: src/neuronperf/src/neuronperf/timing.py
================================================
# -*- coding: utf-8 -*-

"""
neuronperf._timing
~~~~~~~~~~~~~~~~~~~~~~~
Provides utility functions for timing and time unit conversions.
"""

from typing import Any, Callable

import sys
import time
import typing

import numpy as np


time_unit_ratios = {
    'ns': { 'ns': 1, 'us': 1e-3, 'ms': 1e-6, 's': 1e-9 },
    'us': { 'ns': 1e3, 'us': 1, 'ms': 1e-3, 's': 1e-6 },
    'ms': { 'ns': 1e6, 'us': 1e3, 'ms': 1, 's': 1e-3 },
    's': { 'ns': 1e9, 'us': 1e6, 'ms': 1e3, 's': 1 }
}


supported_time_units = time_unit_ratios.keys()


def timestamp_convert(timestamps,
                      input_time_unit: str,
                      output_time_unit: str):
    """Convert timestamp(s) from one time unit to another.

    :param ts: A timestamp or iterable of timestamps.
    :param input_time_unit: A string specifying the input time unit.
    :param output_time_unit: A string specifying the output time unit.
    :returns: A single timestamp or container of timestamps in the output time unit.
    """
    try:
        ratio = time_unit_ratios[input_time_unit][output_time_unit]
    except:
        raise ValueError(f"Can't convert {input_time_unit} to {output_time_unit}")

    return timestamps * ratio


class Timer():
    def __init__(self,
                 timer_fn: Callable[[], Any] = time.perf_counter,
                 timer_unit: str = 's'):
        self.timer_fn = timer_fn
        self.timer_unit = timer_unit
        self._start = []
        self._end = []

    def __enter__(self):
        self.start()

    def __exit__(self, type, value, traceback):
        self.stop()

    def __delitem__(self, index):
        del self._start[index]
        del self._end[index]

    def __getitem__(self, index):
        # it's possible that start and end won't match if negative indices are used,
        # b/c timer may have started and not stopped yet
        if index < 0: index = index % len(self._end)
        return self._start[index], self._end[index]

    def __iter__(self):
        return zip(self._start, self._end)

    def __len__(self):
        return len(self._end)

    def __str__(self):
        return str(self.timestamps())

    def start(self):
        # If we've already started, consider this a request to restart.
        # This also handles partial timestamps due to a Timer-unrelated error.
        if len(self._start) > len(self._end): self._start.pop()
        self._start.append(self.timer_fn())

    def stop(self):
        # if we haven't started, ignore this
        if 0 == len(self._start): return
        self._end.append(self.timer_fn())

    def next(self):
        """Manually advance the timer to the next timestamp measurement."""
        self.stop()
        self.start()

    def reset(self):
        self._start.clear()
        self._end.clear()

    def insert(self, timestamps: tuple, time_unit: str):
        """Manually insert a timestamp pair. Does not affect ongoing timing.

        :param timestamps: Timestamp pair to insert.
        :param time_unit: The time unit of the incoming timestamps.
        """
        if len(timestamps) != 2 or not time_unit: raise ValueError()
        timestamps = timestamp_convert(np.array(timestamps), time_unit, self.timer_unit)
        self._start.insert(0, timestamps[0])
        self._end.insert(0, timestamps[1])

    def start_timestamps(self, time_unit: str = None):
        if not time_unit: return np.array(self._start)
        return timestamp_convert(np.array(self._start), self.timer_unit, time_unit)

    def end_timestamps(self, time_unit: str = None):
        if not time_unit: return np.array(self._end)
        return timestamp_convert(np.array(self._end), self.timer_unit, time_unit)

    def timestamps(self, time_unit: str = None):
        """Returns a list of pairs of timestamps (start, end).

        :param time_unit: The time unit of the output timestamp(s). `None` will use the timer's native unit.
        """
        starts, ends = self.start_timestamps(time_unit), self.end_timestamps(time_unit)
        return np.stack((starts[:len(ends)], ends), axis=-1)

    def durations(self, time_unit: str = None):
        """Returns an `ndarray` of timestamp deltas, optionally converted into a provided time unit.

        :param time_unit: The time unit of the output timestamp(s). `None` will use the timer's native unit.
        :returns: An `ndarray` of timestamp deltas.
        """
        starts, ends = self.start_timestamps(), self.end_timestamps()
        return timestamp_convert(ends - starts[:len(ends)], self.timer_unit, time_unit)

    def total_duration(self, time_unit: str = None):
        """Returns total duration of all time measurements, optionally converted into a provided time unit.

        :param time_unit: The time unit of the output timestamp(s). `None` will use the timer's native unit.
        :
        """
        starts, ends = self.start_timestamps(), self.end_timestamps()
        total = np.sum(ends - starts[:len(ends)])
        return total if not time_unit else timestamp_convert(total, self.timer_unit, time_unit)

    def avg(self, time_unit: str = None):
        """Returns average duration, optionally converted into a provided time unit.

        :param time_unit: The time unit of the output timestamp(s). `None` will use the timer's native unit.
        :returns: The average duration.
        """
        return self.durations(time_unit).mean() if len(self._end) > 0 else 0


================================================
FILE: src/neuronperf/src/neuronperf/torch/__init__.py
================================================
from neuronperf.torch.torch import benchmark, compile


================================================
FILE: src/neuronperf/src/neuronperf/torch/torch.py
================================================
# -*- coding: utf-8 -*-

"""
neuronperf.torch
~~~~~~~~~~~~~~~~~~~~~~~
Provides PyTorch support.
"""

import functools
import itertools
import logging
import math
import os
import types

import torch

from .. import benchmarking


log = logging.getLogger(__name__)


def _compile_fn(model, example_inputs, models_dir, model_name, **kwargs):
    import torch_neuron

    """Compiles a model for Neuron."""
    model_filename = os.path.join(models_dir, "{}.pt".format(model_name))
    model.eval()

    # NeuronPerf provides compiler_args as a dictionary, but framework expects a different format.
    compiler_args = kwargs.get("compiler_args", {})
    compiler_args_flattened = list(itertools.chain.from_iterable(compiler_args.items()))
    kwargs["compiler_args"] = compiler_args_flattened

    model_neuron = torch.neuron.trace(
        model,
        example_inputs,
        **kwargs,
    )
    model_neuron.save(model_filename)
    return model_filename


def _load_fn(model_filename, **kwargs):
    import torch_neuron

    model = torch.jit.load(model_filename)
    model.eval()
    return model


def _class_load_fn(model_class, **kwargs):
    model = model_class()
    model.eval()
    return model


def compile(model, inputs, *args, **kwargs):
    return benchmarking.compile(_compile_fn, model, inputs, *args, **kwargs)


# See: https://pytorch.org/docs/stable/data.html#dataset-types
def _get_dataset_loader_fn(dataset, loop):
    def _worker_init_fn(worker_id):
        # This function will be called for each worker by torch.
        worker_info = torch.utils.data.get_worker_info()
        worker_id = worker_info.id
        num_workers = worker_info.num_workers
        dataset = worker_info.dataset  # the dataset copy in this worker process
        per_worker = int(math.ceil(len(dataset) / float(num_workers)))
        start = worker_id * per_worker
        end = min(start + per_worker, len(dataset))
        log.debug(
            "worker_id={}, num_workers={}, per_worker={}, start={}, end={}".format(
                worker_id, num_workers, per_worker, start, end
            )
        )

        # We monkey-patch the dataset __iter__ function to support a multi-worker config.
        def _iter(self, start, end, loop):
            if loop:
                return itertools.cycle(range(start, end))
            else:
                return iter(range(start, end))

        __iter__ = functools.partial(_iter, start, end, loop)
        dataset.__iter__ = types.MethodType(__iter__, dataset)

    def dataset_loader_fn(dataset, num_workers):
        return iter(
            torch.utils.data.DataLoader(
                dataset, num_workers=num_workers, worker_init_fn=_worker_init_fn
            )
        )

    return dataset_loader_fn


def benchmark(model_filename, inputs, *args, dataset_inputs=False, loop_dataset=False, **kwargs):
    # These functions may need to be overridden or wrapped, depending upon config requested.
    load_fn = _load_fn
    setup_fn = kwargs.get("setup_fn", lambda *args, **kwargs: None)
    preprocess_fn = kwargs.get("preprocess_fn", lambda *args: (*args,))

    # If cuda is requested, ensure it's available and provide smart wrappers for CUDA device loading.
    device_type = kwargs.get("device_type", None)
    use_cuda = device_type and ("cuda" in device_type.lower() or "gpu" == device_type.lower())
    if use_cuda:
        if not torch.cuda.is_available():
            raise ValueError(
                "You requested CUDA benchmarking, but torch is unable to locate a CUDA device."
            )

        # Must use multiinterpreter for CUDA.
        if "multiinterpreter" in kwargs and not kwargs["multiinterpreter"]:
            log.warning(
                (
                    "You set multiinterpreter to False, but it is required for safe CUDA benchmarking.\n"
                    "Your preference has been overridden so that benchmarking may continue."
                )
            )
        kwargs["multiinterpreter"] = True

        # If we received a non-string, use class-based load function
        if not isinstance(model_filename, str):
            # In GPU benchmarking, a model class is expected. This line is for clarity.
            model_class = model_filename
            if not isinstance(model_class, type):
                raise TypeError("GPU benchmarking expects a model class to be provided instead of a filename.")

            # We must also know the name of the file to import from, so that serialization can succeed.
            import inspect

            try:
                model_class_file = inspect.getfile(model_class)
                kwargs["model_class_file"] = model_class_file
                kwargs["model_class_name"] = model_class.__name__
            except:
                raise ValueError(
                    (
                        "Your model class must be defined in a Python module so that it can be serialized properly.\n"
                        "Please add your model to a simple Python file along with any required imports."
                    )
                )

            @functools.wraps(_class_load_fn)
            def load_fn(*args, **kwargs):
                return _class_load_fn(model_class, **kwargs)

            # Now swap the class object for its name so the benchmarker still receives a string.
            model_filename = model_class.__name__

        # Wrap setup_fn so that it moves the model to CUDA device.
        @functools.wraps(setup_fn)
        def _setup_fn(id, config, model):
            setup_fn(id, config, model)
            model.to("cuda")

        kwargs["setup_fn"] = _setup_fn

        # Wrap preprocess_fn with one that moves inputs to CUDA.
        @functools.wraps(preprocess_fn)
        def _preprocess_fn(*inputs):
            inputs = preprocess_fn(*inputs)
            for input in inputs:
                input.to("cuda")
            return (*inputs,)

        kwargs["preprocess_fn"] = _preprocess_fn

    # When custom datasets are used, a loader function will need to be available in subprocesses.
    dataset_loader_fn = None
    if dataset_inputs:
        dataset_loader_fn = _get_dataset_loader_fn(example_inputs, loop_dataset)
    kwargs["dataset_loader_fn"] = dataset_loader_fn

    with torch.no_grad():
        return benchmarking.benchmark(
            load_fn,
            model_filename,
            inputs,
            *args,
            **kwargs,
        )


================================================
FILE: src/neuronperf/test/test_neuronperf.py
================================================
# -*- coding: utf-8 -*-

import json
import os
import pathlib
import shutil
import time

import numpy as np
import pytest

import neuronperf


@pytest.mark.sanity
def test_timer():
    timer = neuronperf.Timer()
    with timer:
        time.sleep(1)

    # sanity check
    assert timer.total_duration("s") > 0.5 and timer.total_duration("s") < 1.5

    # check conversions are functional
    assert (
        timer.total_duration("ns")
        > timer.total_duration("us")
        > timer.total_duration("ms")
        > timer.total_duration("s")
    )

    # check timestamp deltas are close to total
    assert timer.total_duration("s") == pytest.approx(timer.durations("s").sum())

    # check iteration functions
    for _ in range(10):
        with timer:
            time.sleep(0.01)
    assert len(timer) > 10

    # check that timer always returns pairs
    timestamps = timer.timestamps()
    for pair in timestamps:
        assert 2 == len(pair)
        assert pair[1] > pair[0]

    # check that len is functional
    assert len(timer) == len(timestamps)


@pytest.mark.sanity
def test_timestamp_convert():
    # test scalar behavior
    assert 1000 == pytest.approx(neuronperf.timestamp_convert(1, "s", "ms"))
    assert 1.5 == pytest.approx(neuronperf.timestamp_convert(1500, "ms", "s"))
    assert 2.3e6 == pytest.approx(neuronperf.timestamp_convert(2.3, "s", "us"))

    # test array behavior
    times = np.array([1, 2, 3])
    times_ms = neuronperf.timestamp_convert(times, "s", "ms")
    assert 1000 == pytest.approx(times_ms[0])


@pytest.mark.sanity
def test_model_index_create_from_file():
    filename = "dummy_model.ext"
    model_name = "dummy"
    index = neuronperf.model_index.create(filename, model_name=model_name)
    assert index["model_name"] == model_name
    assert len(index["model_configs"]) == 1
    assert index["model_configs"][0]["filename"] == filename


@pytest.mark.sanity
def test_model_index_create_delete_save_load():
    filename = "dummy_index.json"
    if os.path.exists(filename):
        neuronperf.model_index.delete(filename)

    model_name = "Dummy"
    model_filename = os.path.join("models", "dummy.model")
    model_index = neuronperf.model_index.create(model_filename, model_name=model_name)
    neuronperf.model_index.save(model_index, filename=filename)
    assert os.path.exists(filename)

    model_index_loaded = neuronperf.model_index.load(filename)
    assert model_index_loaded == model_index
    assert model_index_loaded["model_name"] == model_name
    assert model_index_loaded["model_configs"][0]["batch_size"] == 1

    neuronperf.model_index.delete(filename)
    assert not os.path.exists(filename)


@pytest.mark.sanity
def test_model_index_copy():
    filename = "dummy_index.json"
    if os.path.exists(filename):
        neuronperf.model_index.delete(filename)

    model_filename = os.path.join("models", "dummy.model")
    os.makedirs("models", exist_ok=True)
    pathlib.Path(model_filename).touch()
    model_name = "Dummy"
    model_index = neuronperf.model_index.create(model_filename, model_name=model_name)
    neuronperf.model_index.save(model_index, filename=filename)

    # Test copy API using a pre-loaded model inndex
    neuronperf.model_index.copy(model_index, "new_index.json", "new_models")
    assert os.path.exists("models")
    assert os.path.exists(model_filename)
    assert os.path.exists("new_index.json")
    assert os.path.exists(os.path.join("new_models", "dummy.model"))

    new_index = neuronperf.model_index.load("new_index.json")
    assert new_index["model_configs"][0]["filename"] == os.path.join("new_models", "dummy.model")

    neuronperf.model_index.delete(filename)
    neuronperf.model_index.delete("new_index.json")
    shutil.rmtree("new_models")
    shutil.rmtree("models")


@pytest.mark.sanity
def test_model_index_copy_2():
    filename = "dummy_index.json"
    if os.path.exists(filename):
        neuronperf.model_index.delete(filename)

    model_filename = os.path.join("models", "dummy.model")
    os.makedirs("models", exist_ok=True)
    pathlib.Path(model_filename).touch()
    model_name = "Dummy"
    model_index = neuronperf.model_index.create(model_filename, model_name=model_name)
    neuronperf.model_index.save(model_index, filename=filename)

    # Test copy API using a file
    neuronperf.model_index.copy(filename, "new_index.json", "new_models")
    assert os.path.exists("models")
    assert os.path.exists(model_filename)
    assert os.path.exists("new_index.json")
    assert os.path.exists(os.path.join("new_models", "dummy.model"))

    new_index = neuronperf.model_index.load("new_index.json")
    assert new_index["model_configs"][0]["filename"] == os.path.join("new_models", "dummy.model")

    neuronperf.model_index.delete(filename)
    neuronperf.model_index.delete("new_index.json")
    shutil.rmtree("new_models")
    shutil.rmtree("models")


@pytest.mark.sanity
def test_model_index_move():
    filename = "dummy_index.json"
    if os.path.exists(filename):
        neuronperf.model_index.delete(filename)

    model_filename = os.path.join("models", "dummy.model")
    os.makedirs("models", exist_ok=True)
    pathlib.Path(model_filename).touch()
    model_name = "Dummy"
    model_index = neuronperf.model_index.create(model_filename, model_name=model_name)
    neuronperf.model_index.save(model_index, filename=filename)

    neuronperf.model_index.move(filename, "new_index.json", "new_models")
    assert not os.path.exists(filename)
    assert not os.path.exists(model_filename)
    assert os.path.exists("new_index.json")
    assert os.path.exists(os.path.join("new_models", "dummy.model"))

    new_index = neuronperf.model_index.load("new_index.json")
    assert new_index["model_configs"][0]["filename"] == os.path.join("new_models", "dummy.model")

    neuronperf.model_index.delete("new_index.json")
    shutil.rmtree("new_models")
    shutil.rmtree("models")


@pytest.mark.sanity
def test_model_index_append():
    model_indexes = [
        neuronperf.model_index.create(f"Dummy_{x}", model_name="Dummy") for x in range(10)
    ]
    combined_index = neuronperf.model_index.append(*model_indexes)
    # Assert that combination apparently did happen.
    assert len(combined_index["model_configs"]) == len(model_indexes)
    # Check that batch_sizes haven't been modified.
    assert all(1 == config["batch_size"] for config in combined_index["model_configs"])

    # Test for duplicate filtering behavior
    model_indexes = [neuronperf.model_index.create("Dummy") for _ in range(10)]
    combined_index = neuronperf.model_index.append(*model_indexes)
    assert len(combined_index["model_configs"]) == 1


@pytest.mark.sanity
def test_model_index_filter():
    idx_1 = neuronperf.model_index.create("fake", performance_level=2, compile_s=1)
    idx_2 = neuronperf.model_index.create("fake2", compile_s=2)
    idx = neuronperf.model_index.append(idx_1, idx_2)

    filtered = neuronperf.model_index.filter(idx, filename="fake")
    print(filtered)
    assert 1 == len(filtered["model_configs"])
    assert "fake" == filtered["model_name"]

    filtered = neuronperf.model_index.filter(idx, performance_level=2)
    assert 1 == len(filtered["model_configs"])
    assert "fake" == filtered["model_name"]

    # None key should filter nothing
    filtered = neuronperf.model_index.filter(idx, compile_s=None)
    assert 2 == len(filtered["model_configs"])


@pytest.mark.sanity
@pytest.mark.slow
def test_benchmarker():
    dummy_model = lambda x: None
    dummy_load = lambda path, device_id: dummy_model
    b = neuronperf.benchmarking.Benchmarker(
        id=0, device_id=0, load_fn=dummy_load, model_filename="test", inputs=[], workers_per_model=2
    )
    b.start()
    time.sleep(1.5)
    b.stop()

    assert b.status == "finished"
    assert all(n_infs > 100 for n_infs in b.n_infs)


@pytest.mark.slow
def test_benchmark_multithread():
    benchmarker_results = neuronperf.cpu.benchmark(
        neuronperf.DummyModel,
        [np.array([1, 2, 3, 4])],
        duration=2,
        n_models=4,
        multiprocess=False,
        multiinterpreter=False,
        verbosity=2,
        return_timers=True,
    )

    # Return value is a list of tuples:
    # [(config, results), (config, results), ...]
    # Each config is a dict. Each result is a dict.

    # A single configuration without workers_per_model set will produce 2 results
    assert len(benchmarker_results) == 2

    for benchmarker_result in benchmarker_results:
        config, results = benchmarker_result
        assert "cpu_percents" in results
        assert "mem_percents" in results
        assert not config["multiprocess"]
        assert not config["multiinterpreter"]
        assert results["status"] == "finished"
        assert results["n_infs"] > 100


@pytest.mark.slow
def test_benchmark_multithread_2():
    dummy_model = lambda x: None
    dummy_load = lambda path, device_id: dummy_model
    reports = neuronperf.benchmark(
        load_fn=dummy_load,
        model_filename="dummy_filename",
        inputs=[[1]],
        duration=2,
        n_models=4,
        multiprocess=False,
        multiinterpreter=False,
        verbosity=2,
    )

    # A single configuration without workers_per_model set will produce 2 results
    assert len(reports) == 2
    report = reports[0]
    assert not report["multiprocess"]
    assert not report["multiinterpreter"]
    assert report["status"] == "finished"
    assert report["total_infs"] > 100


@pytest.mark.slow
def test_benchmark_multiprocess():
    n_models = 16
    benchmarker_results = neuronperf.cpu.benchmark(
        neuronperf.DummyModel,
        inputs=[np.array([1, 2])],
        batch_sizes=[1],
        duration=2,
        n_models=n_models,
        multiprocess=True,
        multiinterpreter=False,
        verbosity=2,
        return_timers=True,
    )

    # A single configuration will produce a single result tuple
    assert len(benchmarker_results) == 2
    # Extract the benchmarker results
    config, results = benchmarker_results[0]
    # Confirm that there is least 1 timer / model for each benchmarker
    assert len(next(iter(results["timers"].values()))) >= n_models
    assert config["multiprocess"]
    assert not config["multiinterpreter"]
    assert results["status"] == "finished"
    assert results["n_infs"] > 100


@pytest.mark.slow
def test_benchmark_multiinterpreter():
    benchmarker_results = neuronperf.cpu.benchmark(
        neuronperf.DummyModel,
        inputs=[np.array([1, 2])],
        duration=2.5,
        n_models=2,
        multiprocess=False,
        multiinterpreter=True,
        verbosity=2,
        return_timers=True,
    )

    # A single configuration without workers_per_model set will produce 2 results
    assert len(benchmarker_results) == 2
    # Extract the benchmarker results
    config, results = benchmarker_results[0]
    assert config["multiinterpreter"]
    assert results["status"] == "finished"
    assert results["n_infs"] > 100


@pytest.mark.slow
def test_reporting():
    benchmarker_results = neuronperf.cpu.benchmark(
        neuronperf.DummyModel,
        inputs=[np.array([1, 2, 3, 4])],
        n_models=[1, 4],
        duration=2,
        verbosity=2,
        return_timers=True,
    )

    assert len(benchmarker_results) == 4
    reports = neuronperf.get_reports(benchmarker_results)
    assert len(reports) == len(benchmarker_results)
    assert all("total_infs" in report for report in reports)

    neuronperf.print_reports(reports)
    csv_file = neuronperf.write_csv(reports)
    os.remove(csv_file)
    json_file = neuronperf.write_json(reports)
    with open(json_file, "rt") as fp:
        json.load(fp)
    os.remove(json_file)


================================================
FILE: static/google673a8c4fbaa024d8.html
================================================
google-site-verification: google673a8c4fbaa024d8.html

================================================
FILE: static/robots.txt
================================================
User-agent: *

Disallow: /en/v2.24.0/

Disallow: /en/v2.23.0/

Disallow: /en/v2.22.1/

Disallow: /en/v2.22.0/

Disallow: /en/v2.21.1/

Disallow: /en/v2.21.0/

Disallow: /en/v2.20.2/

Disallow: /en/v2.20.1/

Disallow: /en/v2.20.0/

Disallow: /en/v2.19.1/

Disallow: /en/v2.19.0/

Disallow: /en/v2.18.2/

Disallow: /en/v2.18.1/

Disallow: /en/v2.18.0/

Disallow: /en/v2.17.0/

Disallow: /en/v2.16.1/

Disallow: /en/v2.16.0/

Disallow: /en/v2.15.2/

Disallow: /en/v2.15.1/

Disallow: /en/v2.15.0/

Disallow: /en/v2.14.1/

Disallow: /en/v2.14.0/

Disallow: /en/v2.13.2/

Disallow: /en/v2.13.1/

Disallow: /en/v2.13.0/

Disallow: /en/v2.12.2/

Disallow: /en/v2.12.1/

Disallow: /en/v2.12.0/

Disallow: /en/v2.11.0/

Disallow: /en/v2.10.0/

Disallow: /en/v2.9.0/

Disallow: /en/v2.8.0/

Disallow: /en/v2.7.0/

Disallow: /en/v2.6.0/

Disallow: /en/v2.5.0/

Disallow: /en/v2.4.0/

Disallow: /en/v2.3.0/

Disallow: /en/v1.19.2/

Disallow: /en/v1.19.1/

Disallow: /en/v1.19.0/

Disallow: /en/v1.18.0/

Disallow: /en/v1.17.2/

Disallow: /en/v1.17.1/

Disallow: /en/v1.17.0/

Disallow: /en/v1.16.3/

Disallow: /en/v1.16.2/

Disallow: /en/v1.16.1/

Disallow: /en/v1.16.0/

Disallow: /en/v1.15.2/

Disallow: /en/1.15.1/

Disallow: /en/1.15.0/

Disallow: /en/1.14.2/

Disallow: /en/1.14.1/

Disallow: /en/1.14.0/

Disallow: /en/1.13.0/

Disallow: /en/1.12.2/

Disallow: /en/1.12.1/

Disallow: /en/1.12.0/

Disallow: /en/1.11.0/


Sitemap: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/sitemap1.xml


================================================
FILE: static/sitemap1.xml
================================================
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/index.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/misc-customops.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/third-party-solutions.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/nki_faq.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/torch-neuron-ubuntu20.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/mxnet-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/jax-neuronx.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/setup-troubleshooting.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/torch-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/multiframework-dlami.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/setup-rocky-linux-9.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/troubleshooting.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/torch-neuronx.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/index.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/ecs-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/third-party-solutions.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/dlc-then-customize-devflow.html</loc>
    <lastmod>2025-10-28</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/eks-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/aws-batch-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/sagemaker-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/parallelcluster-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/ec2-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/releasecontent.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/2.29.0.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-developer-guide.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-configurable-parameters.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/configuration-guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/rn.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-troubleshoot.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/faq.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/dlc-then-ecs-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/dlc-then-customize-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/neo-then-hosting-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/dlc-then-k8s-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorial-docker-runtime1.0.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/dlc-then-ec2-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/faq-troubleshooting-releasenote.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/neuron-dra.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/ec2.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/dlc-then-eks-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/container-deployment-flows.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/ec2-then-ec2-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/k8.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/troubleshooting.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/developerflows.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/locate-neuron-dlc-image.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/faq.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/neuron-plugins.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/container-sm-hosting-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/getting-started.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/beta-participation.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/amazonq-getstarted.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/whats-new.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/troubleshooting.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/profiling-tools.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/monitoring-tools.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/faq.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/sdk-policy.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/what-is-neuron.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/security.html</loc>
    <lastmod>2026-02-13</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/neuron-cc.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/neuronx-cc.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/tensorflow/tensorflow_serving_tutorial.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/neuronx-cc/how-to-convolution-in-unet.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/neuronx-cc/developer-guide.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/neuronx-cc/faq.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/neuron-cc/api-reference-guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/neuron-cc/command-line-reference.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/neuron-cc/developer-guide.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/neuron-cc/faq.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF005.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF011.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/ESPP047.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF010.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF004.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF006.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF007.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF013.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF017.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EBVF030.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF016.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EHCA005.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EARG001.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF015.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF001.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EBIR023.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EUOC002.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EXTP004.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EOOM001.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/ESPP004.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EOOM002.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF018.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF024.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF031.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF019.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF022.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EVRF009.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/EXSP001.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/error-codes/ESFH002.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/neuronx-cc/api-reference-guide/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/calculator/neuron-calculator.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/faq/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/faq/neuron2-intro-faq.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/faq/contributing-faq.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/faq/onnx-faq.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/index.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/index.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/models/index.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/models/inference-inf1-samples.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/models/training-trn1-samples.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/models/inference-inf2-trn1-samples.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/benchmarks/index.html</loc>
    <lastmod>2025-11-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/index.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/glossary.html</loc>
    <lastmod>2025-10-28</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/oss/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/quick-start/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/quick-start/mxnet-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/quick-start/torch-neuron-tab-training.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/quick-start/tensorflow-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/quick-start/docs-quicklinks.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/quick-start/user-guide-quickstart.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/quick-start/github-samples.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/quick-start/torch-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/quick-start/tab-inference-tensorflow-neuron.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/quick-start/inference-quickstart.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/quick-start/training-quickstart.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/news-and-blogs/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/trn1-arch.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/inferentia2.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/trainium3.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/inf1-arch.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/trainium2.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/inferentia.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/trn2-arch.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/trn3-arch.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/neuron-core-v4.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/neuron-core-v1.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/inf2-arch.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/trainium.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/neuron-core-v2.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/neuron-core-v3.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-features/index.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-features/rounding-modes.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-features/data-types.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-features/neuron-caching.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-features/custom-c++-operators.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-features/neuroncore-pipeline.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-features/neuroncore-batching.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-features/logical-neuroncore-config.html</loc>
    <lastmod>2026-04-08</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/benchmarks/inf2/inf2-performance.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/benchmarks/trn1/trn1-training-performance.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/benchmarks/trn1/trn1-inference-performance.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/benchmarks/inf1/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-transition-pytorch-trainium.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-multiframework-dlamis-inf1.html</loc>
    <lastmod>2025-10-17</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-maintenance-nxdt-nxd-core-training.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-longer-support-pytorch-2-7-2-8.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-tensorflow-2-8-9.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-pytorch-1-1-3.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-python38-no-longer-support.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/github-changes.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/sm-training-trn1-introduce.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-support-nemo-megatron.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-nemo.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-beta-pytorch-neuroncore-placement-apis.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-pt2.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-nxdi-changes.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eol-megatron-lm.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-support-device-version.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-torch-neuronx-nki-jit.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-xla-bf16.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-support-llama3-2-checkpoint.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-support-jax-neuronx-nki-call.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-support-ubuntu-20-base.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-intent-eos-tensorflow-tutorial-inf.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announcement-end-of-support-pytorch-2-6.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-maintenance-nxdi-nxd-core-inference.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-support-neurondevice.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-intent-eol-nemo-arg.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-intent-eos-tnx.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-python38.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-longer-support-pytorch-113.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announcement-nki-library-kernel-migration.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eol-ubuntu-18.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-neuron-profiler-v230.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-pytorch-1-9.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-support-al2.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-u20-dlamis.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/sm-training-dlc-2.9.1.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-inf1-virtual-environments.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-maintenance-mxnet.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announcement-nki-library-namespace-changes.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-pytorch-profiling-api.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/neuron2-intro.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-longer-support-nxd-examples.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-dlami-ubuntu-22-04.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/gpg-expiration.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-longer-support-tensorflow-inf2.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-support-torch-neuron-versions.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-tensorboard-tools.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-component-change.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-nxd-examples.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-tensorflow1-x.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-package-change.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-mllama-checkpoint.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announcement-end-of-support-nxdt-nxd-core.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-intent-eos-pt2-6.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-pytorch-2-7-2-8-v229.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-neuron-driver-support-inf1.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/dlami-neuron-2.10.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announcement-end-of-support-neuronxcc-nki.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-tensorflow-inf2.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-pt-versions.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/dlami-pytorch-introduce.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-torch-neuron.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/dlami-neuron-2.12.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-neuron-profiler.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-support-tf-versions.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-longer-support-pytorch-2-1.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-neuron-profiler-2.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/neuron250-packages-changes.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-longer-support-neuron-det.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-support-vllm-v0.html</loc>
    <lastmod>2026-02-26</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-support-tensorboard-plugin.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-neurondevice.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announcement-end-of-support-vllm-v0.html</loc>
    <lastmod>2026-02-26</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-deprecation-nxd-path-trace-api.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-support-nki-jit-torch.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-maintenance-tf.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announcement-python-3-9-eol.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-deprecation-transformer-flag.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-neuron-det.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-nki-library-namespace-changes-2-28.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-support-tensorflow2-10.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-moving-samples.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-intent-eos-opt.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-megatronlm-2-13.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-neurondevice-version.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-block-dimension-nki.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-pytorch-2-1.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announcement-end-of-support-parallel-model-trace.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-probuf.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-deprecation-containers-rtd.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-correction-neuron-driver-support-inf1.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/end-of-support-pt2.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/neuron230-packages-changes.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-al2.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-nxdt-nxd-core-training.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-intent-maintenance-tnx.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-support-tensorflow1-x.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/neuron-rtd-eol.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/release-neuron2.4.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-no-longer-support-u20-dlc-dlami.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-jax-neuronx-nki-call.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eol-python-3-7.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-bf16-vars.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-pytorch-2-7-2-8.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-nki-namespace-migration.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-dlami.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-intent-eos-pt-version.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron1.x/eol-pt-15.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron1.x/announce-eol-pt-before-1-8.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron1.x/eol-tf-21-24.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron1.x/announce-eol-pt-1-5.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron1.x/announce-eol-mx-before-1-5.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron1.x/announce-eol-tf-before-2-5.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron1.x/announcements.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron1.x/eol-ncgs-env_2.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron1.x/announce-eol-tf-before-2-7.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/torch-neuronx/introducing-pytorch-2-7.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/torch-neuronx/torch-neuronx-graph-partitioner-app-note.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/torch-neuronx/index.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/torch-neuronx/introducing-pytorch-2-6.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/torch-neuronx/torch-neuronx-dataparallel-app-note.html</loc>
    <lastmod>2025-10-28</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/torch-neuronx/introducing-pytorch-2-8.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/torch-neuronx/introducing-pytorch-2-9.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/torch-neuronx/introducing-pytorch-2-x.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/torch-neuronx/migration-from-xla-downcast-bf16.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/mxnet-neuron/flex-eg.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/transformers-neuronx/generative-llm-inference-with-neuron.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/neuron-cc/mixed-precision.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/neuronx-distributed/introducing-nxd-inference.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/neuronx-distributed/introducing-nxdt-training.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/torch-neuron/index.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/torch-neuron/bucketing-app-note.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/torch-neuron/torch-neuron-dataparallel-app-note.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/torch-neuron/rcnn-app-note.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/neuron1x/introducing-libnrt.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/perf/neuron-cc/performance-tuning.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/perf/neuron-cc/parallel-ncgs.html</loc>
    <lastmod>2025-10-20</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/faq/training/neuron-training.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/faq/inference/neuron-faq.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/faq/inference/trouble-shooting-faq.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-setup.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/inference-torch-neuronx.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/guide-torch-neuron-vs-torch-neuronx-inference.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/pytorch-native-overview.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/training-torch-neuronx.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/jax/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/jax/setup/jax-setup.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/jax/setup/jax-neuronx-known-issues.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/jax/api-reference-guide/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/jax/api-reference-guide/neuron-envvars.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-disable-dynamic-batching.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/training-troubleshooting.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-dynamic-batching.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/pytorch-neuron-supported-operators.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-default.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/misc-inference-torch-neuronx.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/additional-examples-training.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/additional-examples-inference-torch-neuronx.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-dim-neq-zero.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/torch-neuronx-dataparallel-example-specify-ncs.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup-trn1-multi-node-execution.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/misc-training.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/about/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-update-al2.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-update-u20.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-update-u22.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-update-al2023.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-update-u24.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install-prev-u24.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/note-setup-general.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install-prev-u22.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install-prev-u20.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install-prev-al2.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-update-al2-dlami.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-neuronx-install-cxx11.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install-prev-al2023.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-update-u20-dlami.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/torch-neuronx-profiling-api.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/torch-neuronx-profiling-dev-guide.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/analyze_for_training.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/tutorials-training-torch-neuronx.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/finetune_hftrainer.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/mlp.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/zero1_gpt2.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/bert.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/inference/tutorials-torch-neuronx.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/inference/tutorial-torchserve-neuronx.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/training/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide.html</loc>
    <lastmod>2026-04-08</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-debug.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/inference/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/inference/core-placement.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/inference/autobucketing-dev-guide.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/inference/trace-vs-xla-lazytensor.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/training/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/training/pytorch-neuron-parallel-compile.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/training/torch-neuron-envvars.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-async-lazy-load.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-replace-weights.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-core-placement.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/inference-api-guide-torch-neuronx.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-trace.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-data-parallel.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-analyze.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/prev-releases/neuronx-2.8.0-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/prev-releases/neuronx-2.7.0-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/prev-releases/neuronx-2.9.0-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/.git/logs/refs/remotes/origin/VRF004.html</loc>
    <lastmod>2026-01-27</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/how-to/how-to-ultraserver.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/files/index-dra.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/tutorial-oci-hook.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-device-plugin.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-monitor.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/build-run-neuron-container.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-multiple-scheduler.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-problem-detector-and-recovery.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/tutorial-docker-env-setup.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-problem-detector-and-recovery-irsa.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-default-scheduler.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-helm-chart.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-setup.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-prerequisite.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler-flow.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/index.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/get-started/quickstart-configure-deploy-dlc.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/get-started/quickstart-pytorch-inference-dlc.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/training/Dockerfile-trainium-dlc.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/training/mlp.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/inference/config-properties.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/inference/torchserve-neuron.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/inference/dockerd-libmode-entrypoint.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/inference/Dockerfile-tf-serving.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/inference/Dockerfile-libmode.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/inference/Dockerfile-inference-dlc.html</loc>
    <lastmod>2025-10-28</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/v1/inference/Dockerfile-torch-neuron.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/v1/inference/Dockerfile-app-rt-same.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/v1/inference/Dockerfile-app-rt-diff.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/v1/inference/dockerd-entrypoint-app-rt-same.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/v1/inference/Dockerfile-neuron-rtd.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/training/index.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/training/tutorial-training.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/training/k8s_mlp_train_demo.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/inference/index.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/inference/k8s_rn50_demo.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/inference/tutorial-infer.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/about/index.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/about/core-dump.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/about/collectives.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/explore/compute-comm-overlap.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/explore/index.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/explore/work-with-neff-files.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/explore/direct-hbm-tensor-alloc.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/explore/core-dump-deep-dive.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/explore/intranode-collective-comm.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/explore/device-memory.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/explore/runtime-performance-tips.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/explore/internode-collective-comm.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/nrt_async_sendrecv.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/nrt_status.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/nrt-async-api-best-practices.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/nrt_async.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/ndebug_stream.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/nrt_sys_trace.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/nrt.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/debug-stream-api.html</loc>
    <lastmod>2026-02-05</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/nrt-async-api-examples.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/ndl.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/nrt-async-api-overview.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/nec.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/neuron_driver_shared.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/nrt_experimental.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/neuron_ds.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/nrt_profile.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/neuron_driver_shared_tensor_batch_op.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/api/nrt_version.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.27.1.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/content.html</loc>
    <lastmod>2026-04-08</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.26.1.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.28.1.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.28.0.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/rn.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/mxnet-neuron.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/torch-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/tensorboard-neuron.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/libneuronxla.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/components/index.html</loc>
    <lastmod>2026-04-08</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/components/runtime.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/components/nxd-inference.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/components/nki-lib.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/components/dlamis.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/components/dev-tools.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/components/nki.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/components/containers.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/components/nxd-training.html</loc>
    <lastmod>2026-04-08</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/components/compiler.html</loc>
    <lastmod>2026-04-08</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/components/jax.html</loc>
    <lastmod>2026-04-08</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/components/pytorch.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/components/nxd-core.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/nemo/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/nemo/neuronx-nemo.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/customcxxps/gpsimd-tools.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/customcxxps/gpsimd-customop-lib.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/neuron-cc/neuron-cc.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/neuron1/prev/content.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/neuron1/prev/rn.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/neuron1/neuronrelease/previous-content.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/tensorflow/tensorflow-neuron/tensorflow-neuron.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/tensorflow/tensorflow-neuron/tensorflow-neuron-v2.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/tensorflow/tensorflow-modelserver-neuron/tensorflow-modelserver-neuron.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/tensorflow/tensorflow-modelserver-neuron/tensorflow-modelserver-neuronx.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/tensorflow/tensorflow-modelserver-neuron/tensorflow-modelserver-neuron-v2.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/tensorflow/tensorflow-neuronx/tensorflow-neuronx.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/neuron-cc/neuron-cc-ops/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/neuron-cc/neuron-cc-ops/neuron-cc-ops-pytorch.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/neuron-cc/neuron-cc-ops/neuron-cc-ops-xla.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/neuron-cc/neuron-cc-ops/neuron-cc-ops-tensorflow.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/archive/neuron-cc/neuron-cc-ops/neuron-cc-ops-mxnet.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.25.0/index.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.25.0/nx-jax.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.25.0/runtime.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.25.0/dlami.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.25.0/nxd-inference.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.25.0/nx-pytorch.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.25.0/tools.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.25.0/docs-and-samples.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.25.0/containers.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.25.0/nxd-training.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.25.0/compiler.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.25.0/nxd-core.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.26.0/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.26.0/nx-jax.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.26.0/runtime.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.26.0/dlami.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.26.0/nxd-inference.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.26.0/nx-pytorch.html</loc>
    <lastmod>2025-12-19</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.26.0/tools.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.26.0/nki.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.26.0/containers.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.26.0/nxd-core.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.27.0/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.27.0/runtime.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.27.0/dlami.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.27.0/nxd-inference.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.27.0/nki-lib.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.27.0/nx-pytorch.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.27.0/tools.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.27.0/nki.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.27.0/containers.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/2.27.0/compiler.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/plugins/npd-ecs-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/dlc-then-ecs-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/aws-batch-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/sagemaker-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/parallelcluster-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/ec2-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/setup/ecs-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/setup/eks-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/dlc-then-ecs-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/byoc-hosting-devflow-inf2.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/neo-then-hosting-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/dlc-then-k8s-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/dlc-then-ec2-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/aws-batch-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/dlc-then-eks-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/ec2-then-ec2-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/sagemaker-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/dev-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/byoc-hosting-devflow.html</loc>
    <lastmod>2025-10-28</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/parallelcluster-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/env-setup-text.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/ec2-then-ec2-devflow-inf2.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/ec2-flows.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/inference/container-sm-hosting-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/sm-devflow/sm-training-devflow.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/batch/batch-training.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/parallelcluster/parallelcluster-training.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/devflows/training/ec2/ec2-training.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nemo-megatron/index.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/index.html</loc>
    <lastmod>2025-10-28</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/index.html</loc>
    <lastmod>2026-02-26</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/overview-index.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/neuron-inference-overview.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/api-reference-guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/overview.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/app_notes.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/misc.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/developer-guide.html</loc>
    <lastmod>2025-11-11</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api-reference-guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tp_developer_guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tensor_parallelism_overview.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/context_parallelism_overview.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/pp_developer_guide.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/model_optimizer_wrapper_developer_guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/activation_memory_reduction_developer_guide.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api-reference-guide-training.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/neuronx_distributed_inference_developer_guide.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/developer-guide-inference.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api-reference-guide-inference.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/model_builder_v2_api_reference.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/developer-guide-training.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/lora_finetune_developer_guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/app_notes.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/pipeline_parallelism_overview.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/standard_mixed_precision.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/ptl_developer_guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index-training.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/developer-guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/activation_memory_reduction.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/save_load_developer_guide.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index-inference.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/neuronx-distributed-misc.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/setup/index.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/inference.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/index.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/training_tutorials.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/training_llama_tp_pp.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/inference_tutorials.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/finetune_llama3_8b_ptl_lora.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/training.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/general/config_overview.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/general/features.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/general/installation_guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/general/known_issues.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/app_notes/nxd-training-tp-appnote.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/app_notes/nxd-training-cp-appnote.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/app_notes/nxd-training-amr-appnote.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/app_notes/nxd-training-pp-appnote.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/index.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_70B_pretraining.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_DPO_ORPO.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_SFT.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_SFT_LORA.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/checkpoint_conversion.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_pretraining.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/developer_guides/index.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/developer_guides/migration_nnm_nxdt.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/developer_guides/cpu_mode_developer_guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/developer_guides/optimizer_lr_scheduler_flow.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/developer_guides/new_model_guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/developer_guides/new_dataloader_guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/developer_guides/migration_nemo_nxdt.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/misc/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/misc/nxdi-troubleshooting.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/app-notes/parallelism.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/app-notes/index.html</loc>
    <lastmod>2025-11-12</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/vllm/index.html</loc>
    <lastmod>2026-02-26</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/vllm/quickstart-vllm-offline-serving.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/vllm/quickstart-vllm-online-serving.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/models/index.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/disaggregated-inference-tutorial.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn3-gpt-oss-120b-tutorial.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.3-70b-tutorial.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/sd-inference-tutorial.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.1-405b-tutorial.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.3-70b-fp8.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.1-405b-speculative-tutorial.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/disaggregated-inference-tutorial-1p1d.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/index.html</loc>
    <lastmod>2026-02-26</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/nxd-examples-migration-guide.html</loc>
    <lastmod>2025-11-12</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/model-reference.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/weights-sharding-guide.html</loc>
    <lastmod>2025-11-12</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide-v1.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/performance-cli-params.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/moe-arch-deep-dive.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/onboarding-models.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html</loc>
    <lastmod>2026-02-26</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/llm-inference-benchmarking-guide.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/disaggregated-inference.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html</loc>
    <lastmod>2025-11-12</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/migrate-from-tnx-to-nxdi.html</loc>
    <lastmod>2025-11-12</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/how-to-use-fpem.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/accuracy-eval-with-datasets.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/writing-tests.html</loc>
    <lastmod>2025-11-12</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/api-guides/index.html</loc>
    <lastmod>2025-11-12</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/api-guides/api-guide.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/models/llama3/llama_33_70b.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/models/qwen3/qwen3_moe_235b.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/al2-python.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/launch-trn1-dlami.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/legacy-inf1/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/legacy-inf1/pytorch.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/notebook/running-jupyter-notebook-as-script.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/dlami.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/update-dlc.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/manual.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/update-manual.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/update-dlami.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/pytorch/dlc.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/jax/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/jax/dlami.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/jax/manual.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/jax/dlc.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf2/note-setup-libnrt-warning.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf2/launch-inf2-dlami.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf2/dlami-enable-neuron-pytorch.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/trn1/dlami-notes.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf1/neuron-pip-install.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf1/note-setup-cntr.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf1/launch-inf1-dlami-aws-cli.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf1/launch-inf1-dlami.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf1/launch-inf1-ami.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf1/note-setup-general.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf1/neuron-pip-setup.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf1/compile_mode.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf1/develop_mode.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf1/tensorboard-plugin-neuron-pip-install.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf1/dlami-enable-neuron-mxnet.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf1/note-setup-libnrt-warning.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf1/dlami-enable-neuron-pytorch.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/setup/install-templates/inf1/deploy_mode.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki-dge.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/index.html</loc>
    <lastmod>2026-04-08</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki-dynamic-loops.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki-compiler.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/mxfp-matmul.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/use-neuron-profile.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki_block_dimension_migration_guide.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki-dma-bandwidth-guide.html</loc>
    <lastmod>2026-04-08</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki-beta2-migration-guide.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki-aps.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki-hbm-crc-hashing.html</loc>
    <lastmod>2026-04-08</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki-0-3-0-update-guide.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/deep-dives/nki_perf_guide.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/how-to-scheduling-apis.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/index.html</loc>
    <lastmod>2026-04-08</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/nki_simulator.html</loc>
    <lastmod>2026-04-08</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/framework_custom_op.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.isa.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.collectives.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.api.shared.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.language.html</loc>
    <lastmod>2026-04-08</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.simulate.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.language.tile_size.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/index.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/quickstart-implement-run-kernel.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/nki-language-guide.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/setup-env.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/about/index.html</loc>
    <lastmod>2026-04-08</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/about/tiling-overview.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/about/indexing-overview.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/about/data-representation-overview.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/about/lnc.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/about/nki-dma-overview.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/get-started/about/memory-hierarchy-overview.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.set_rng_seed.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.memset.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.nc_n_gather.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.gelu_apprx_tanh.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.gelu_apprx_sigmoid.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.bn_stats.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.erf_dx.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.ceil.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.is_hbm.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.float32.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.dge_mode.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.rms_norm.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.arctan.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.dma_engine.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.greater.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.mish.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.store.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.maximum.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.activation.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.sequential_range.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.dropout.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.register_store.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.shared_identity_matrix.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.tensor_scalar_cumulative.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.trunc.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.rng.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.dma_transpose.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.tan.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.affine_select.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.matmul_perf_mode.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.VirtualRegister.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.exp.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.dma_compute.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.less_equal.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.nc_matmul_mx.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.gelu.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.register_move.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.gelu_apprx_sigmoid_dx.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.broadcast_to.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.sendrecv.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.max.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.logical_not.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.collectives.reduce_scatter.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.collectives.collective_permute.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.softplus.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.static_range.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.collectives.all_gather.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.subtract.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.nc_transpose.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.float8_e5m2.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.load.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.shared_constant.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.bitwise_or.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.quantize_mx.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.float4_e2m1fn_x4.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.collectives.collective_permute_implicit_reduce.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.softmax.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.program_id.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.int8.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.greater_equal.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.nc_version.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.tensor_scalar.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.invert.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.uint32.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.core_barrier.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.sign.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.shared_hbm.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.negative.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.affine_range.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.nonzero_with_count.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.tfloat32.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.zeros.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.oob_mode.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.collectives.collective_permute_implicit.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.square.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.is_on_chip.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.ndarray.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.matmul.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.ones.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.float8_e4m3.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.num_programs.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.collectives.rank_id.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.exponential.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.where.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.is_psum.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.float8_e4m3fn.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.power.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.nc_stream_shuffle.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.erf.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.register_alloc.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.rand_set_state.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.rand.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.reciprocal.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.load_transpose2d.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.int32.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.sbuf.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.sum.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.log.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.get_nc_version.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.equal.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.select_reduce.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.dma_copy.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.register_load.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.int16.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.engine.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.less.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.float8_e4m3fn_x4.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.scalar_tensor_tensor.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.bfloat16.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.tensor_copy_predicated.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.private_hbm.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.not_equal.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.tensor_copy.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.logical_and.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.logical_or.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.multiply.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.uint8.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.nc_find_index8.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.max8.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.floor.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.right_shift.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.float8_e5m2_x4.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.prod.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.collectives.all_to_all.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.rsqrt.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.bool_.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.mean.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.expand_dims.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.min.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.left_shift.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.add.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.hbm.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.rand_get_state.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.jit.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.relu.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.range_select.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.collectives.ReplicaGroup.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.nc_match_replace8.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.tensor_partition_reduce.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.collectives.all_to_all_v.html</loc>
    <lastmod>2026-04-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.logical_xor.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.uint16.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.nc_matmul.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.tensor_tensor_scan.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.tensor_reduce.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.program_ndim.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.local_gather.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.tanh.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.bn_aggr.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.sigmoid.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.bitwise_and.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.tensor_tensor.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.is_sbuf.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.transpose.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.simulate.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.ds.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.reduce_cmd.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.collectives.all_reduce.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.no_reorder.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.tensor_scalar_reduce.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.cos.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.copy.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.activation_reduce.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.device_print.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.full.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.silu_dx.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.collectives.collective_permute_implicit_current_processing_rank_id.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.psum.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.tile_size.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.empty_like.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.sin.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.silu.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.iota.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.sequence_bounds.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.minimum.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.var.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.abs.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.reciprocal.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.gather_flattened.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.bitwise_xor.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.random_seed.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.isa.rand2.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.zeros_like.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.sqrt.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.gelu_dx.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.dropout.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.dynamic_range.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.all.html</loc>
    <lastmod>2026-04-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/generated/nki.language.float16.html</loc>
    <lastmod>2026-02-24</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/architecture/index.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/architecture/trainium2_arch.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/architecture/trainium3_arch.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/architecture/trainium_inferentia2_arch.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/matrix_multiplication.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/spmd_multiple_nc_tensor_addition.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/fused_mamba.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/kernel-optimization.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/transpose2d.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/average_pool2d.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/guides/tutorials/spmd_tensor_addition.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/kernel-utils/index.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/kernel-utils/tiled-range.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/kernel-utils/tensor-view.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/kernel-utils/allocator.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/kernel-utils/stream-shuffle-broadcast.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/specs/index.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/specs/design-rmsnorm-quant.html</loc>
    <lastmod>2026-02-17</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/about/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/transformer-tkg.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/index.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/cross-entropy.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/conv1d.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/find-nonzero-indices.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/output-projection-cte.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/dynamic-elementwise-add.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/blockwise-mm-backward.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/mlp.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/fgcc.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/attention-cte.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/depthwise-conv1d.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/rmsnorm-quant.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/qkv.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/moe-tkg.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/output-projection-tkg.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/attention-block-tkg.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/fg-allgather.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/topk-reduce.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/sb2sb-allgather.html</loc>
    <lastmod>2026-04-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/cumsum.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/moe-cte.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/attention-tkg.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/router-topk.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/library/api/rope.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/api-reference-guide.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/mxnet-neuron-setup.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/misc-mxnet-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/neo-then-hosting-devflow.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/ec2-then-ec2-devflow.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/api-compilation-python-api.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/developer-guide.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/troubleshooting-guide.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/inference-mxnet-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/neuronperf/index.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/neuronperf/neuronperf_terminology.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/neuronperf/neuronperf_benchmark_guide.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/neuronperf/neuronperf_troubleshooting.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/neuronperf/neuronperf_overview.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/neuronperf/neuronperf_examples.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/neuronperf/neuronperf_model_index_guide.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/neuronperf/neuronperf_evaluate_guide.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/neuronperf/neuronperf_faq.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/neuronperf/neuronperf_api.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/neuronperf/rn.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/neuronperf/neuronperf_compile_guide.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/neuronperf/neuronperf_framework_notes.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/neuronperf/neuronperf_install.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/transformers-neuronx/api-reference-guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/transformers-neuronx/index.html</loc>
    <lastmod>2025-10-28</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/transformers-neuronx/transformers-neuronx-tutorials.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/transformers-neuronx/transformers-neuronx-developer-guide.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/transformers-neuronx/transformers-neuronx-api-reference.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/transformers-neuronx/transformers-neuronx-misc.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/transformers-neuronx/developer-guide.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/transformers-neuronx/transformers-neuronx-developer-guide-for-continuous-batching.html</loc>
    <lastmod>2025-10-28</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/torch-neuron-dataparallel-example-specify-ncs.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/torch-neuron-dataparallel-example-dynamic-batching.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/api-reference-guide-torch-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/torch-neuron-dataparallel-example-disable-dynamic-batching.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/inference-torch-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/api-torch-neuron-dataparallel-api.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/developer-guide-torch-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/torch-neuron-dataparallel-example-dim-neq-zero.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/additional-examples-inference-torch-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/api-compilation-python-api.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/torch-neuron-dataparallel-example-default.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/troubleshooting-guide.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/api-core-placement.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/misc-inference-torch-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/helper-tools/index.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/helper-tools/tutorial-neuron-check-model.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/helper-tools/tutorial-neuron-gatherinfo.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/setup-legacy-inf1-tensorflow.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron-inference.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx-inference.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-setup.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorboard/getting-started-tensorboard-neuron-plugin.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tutorials/training-gpt-neox.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tutorials/gpt3_neuronx_nemo_megatron_pretraining.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tutorials/training_llama2_tp_pp_ptl.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tutorials/megatron_gpt_pretraining.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tutorials/training-gpt-neox-20b.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tutorials/finetuning_llama2_7b_ptl.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tutorials/finetune_t5.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tutorials/multinode-training-model-profiling.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tutorials/training_codegen25_7b.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tutorials/ssd300_demo/ssd300_demo.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/api-reference-guide.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/api-tracing-python-api.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/tensorflow2-accelerated-ops.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/additional-examples.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/dlc-then-ecs-devflow.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/neo-then-hosting-devflow.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/dlc-then-ec2-devflow.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/api-auto-replication-api.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/dlc-then-eks-devflow.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/ec2-then-ec2-devflow.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/misc-tensorflow-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/tf2_faq.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/api-tfn-analyze-model-api.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/api-compilation-python-api.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/api-reference-guide.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/tfnx-analyze-model-api.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/tf-neuronx-auto-replication-api.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/tfneuronx-python-tracing-api.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/misc-tensorflow-neuronx.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/setup/tensorflow-install-prev-u20.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/setup/tensorflow-install-prev-al2.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/setup/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/setup/tensorflow-install-prev-u22.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/setup/tensorflow-install-prev-al2023.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/setup/tensorflow-neuronx-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/setup/tensorflow-update-u20-dlami.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/setup/tensorflow-update-u22.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/setup/tensorflow-update-u20.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/setup/tensorflow-update-al2.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/setup/tensorflow-update-al2-dlami.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/tutorials/tutorial-tensorflowx-serving-NeuronRT-Visible-Cores.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/tutorials/tutorials-tensorflow-neuronx.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/setup/prev-releases/neuronx-2.9.0-tensorflow-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuronx/setup/prev-releases/neuronx-2.8.0-tensorflow-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/tensorflow-install-prev-u20.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/tensorflow-install-prev-u22.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/tensorflow-install-prev.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/tensorflow-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/tensorflow-install-prev-al2023.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/tensorflow-update.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/tensorflow-update-u22.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/tensorflow-update-u20.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/tutorials/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/tutorials/tutorials-tensorflow-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/tutorials/tutorials-tensorflow-nlp.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/tutorials/tensorflow-tutorial-setup.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/tutorials/tutorials-tensorflow-utilizing-neuron-capabilities.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/tutorials/bert_demo/bert_demo.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.17.0-tensorflow-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.17.2-tensorflow-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.16.3-tensorflow-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.17.1-tensorflow-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.15.2-tensorflow-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.15.1-tensorflow-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.18.0-tensorflow-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.19.0-tensorflow-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.15.0-tensorflow-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/tensorflow/tensorflow-neuron/setup/prev-releases/neuron-1.14.2-tensorflow-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/pytorch-update-u20.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/pytorch-update-u22.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/pytorch-update-al2023.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/pytorch-install-prev.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/pytorch-update.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/pytorch-install-prev-u22.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/pytorch-install-prev-u20.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/pytorch-install-prev-al2.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/pytorch-install-cxx11.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/pytorch-update-al2-dlami.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/pytorch-install-prev-al2023.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/pytorch-update-u20-dlami.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/guides/torch-lstm-support.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/tutorials/tutorials-torch-neuron-nlp.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/tutorials/transformers-marianmt.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/tutorials/tutorials-utilizing-neuron-capabilities.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/tutorials/tutorials-torch-neuron-computervision.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/tutorials/neuroncore_pipeline_pytorch.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/tutorials/tutorial-libtorch.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/tutorials/pytorch-tutorial-setup.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/tutorials/tutorial-torchserve.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/tutorials/tutorials-inference-torch-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/guides/core-placement/torch-core-placement.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/prev-releases/neuron-1.19.0-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/prev-releases/neuron-1.17.2-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/prev-releases/neuron-2.4.0-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/prev-releases/neuron-1.15.2-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/prev-releases/neuron-1.15.1-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/prev-releases/neuron-1.15.0-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/prev-releases/neuron-1.18.0-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/prev-releases/neuron-2.3.0-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/prev-releases/neuron-1.16.1-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/prev-releases/neuron-2.5.0-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/prev-releases/neuron-1.16.2-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/prev-releases/neuron-1.16.3-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/torch-neuron/setup/prev-releases/neuron-1.14.2-pytorch-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/transformers-neuronx/setup/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/mxnet-update-u20.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/mxnet-neuron-al2-base-dlami.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/mxnet-neuron-al2.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/mxnet-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/mxnet-neuron-al2023.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/mxnet-neuron-ubuntu22.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/mxnet-install-prev-al2.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/mxnet-install-prev-u20.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/mxnet-install-prev-u22.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/mxnet-neuron-ubuntu20.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/mxnet-neuron-ubuntu20-base-dlami.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/mxnet-install-prev-al2023.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/mxnet-update.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/tutorials/tutorials-mxnet-utilizing-neuron-capabilities.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/tutorials/tutorials-mxnet-computervision.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/tutorials/tutorial-model-serving.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/tutorials/tutorials-mxnet-nlp.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/tutorials/tutorials-mxnet-neuron.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/tutorials/mxnet-tutorial-setup.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/prev-releases/neuron-1.17.2-mxnet-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/prev-releases/neuron-1.15.2-mxnet-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/prev-releases/neuron-1.14.2-mxnet-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/prev-releases/neuron-1.19.0-mxnet-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/prev-releases/neuron-1.16.3-mxnet-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/prev-releases/neuron-1.18.0-mxnet-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/prev-releases/neuron-1.15.0-mxnet-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/archive/mxnet-neuron/setup/prev-releases/neuron-1.15.1-mxnet-install.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/index.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/nccom-test.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-top-user-guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-sysfs-user-guide.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-ls.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/tensorboard/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/tensorboard/getting-started-tensorboard-neuronx-plugin.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/tutorials/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/tutorials/torch-neuronx-profiling-with-tb.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/tutorials/tutorial-neuron-monitor-mnist.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/tutorials/tutorial-tensorboard-scalars-mnist.html</loc>
    <lastmod>2025-10-09</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/tutorials/performance-profiling-vllm.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/profiler/neuron-profile-user-guide.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/profiler/neuron-profiler-2-0-beta-user-guide.html</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/index.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-system-profiles.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-hierarchy-view.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-database-viewer.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/how-to-link-view-source-code.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/migration-faq.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-device-profiles.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/get-started.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/how-to-profile-workload.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-summary-page.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-memory-viewer.html</loc>
    <lastmod>2026-04-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/view-perfetto.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-tensor-viewer.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-explorer/overview-ai-recommendations.html</loc>
    <lastmod>2026-02-25</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/api-reference-guide/api-reference-guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/api-reference-guide/custom-ops-ref-guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/programming-guide/custom-c++-operators-devguide.html</loc>
    <lastmod>2026-02-03</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/programming-guide/programming-guide.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/tutorials/customop-mlp-perf-opt.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/tutorials/tutorials.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
  <url>
    <loc>https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/tutorials/customop-mlp-training.html</loc>
    <lastmod>2025-10-07</lastmod>
  </url>
</urlset>

================================================
FILE: tools/index.rst
================================================
.. _neuron-tools:

.. meta::
   :description: Developer tools for profiling, monitoring, and analyzing machine learning workloads on AWS Neuron devices.
   :keywords: AWS Neuron, developer tools, profiler, monitoring, analysis, TensorBoard, visualization, debugging, optimization
   :date-modified: 12/02/2025

Developer Tools
================

AWS Neuron provides a comprehensive suite of developer tools for optimizing, monitoring, and debugging machine learning workloads on AWS Inferentia and Trainium accelerators. These tools enable developers to gain deep insights into model performance, system utilization, and hardware behavior to maximize the efficiency of ML applications running on Neuron-enabled instances.

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: Neuron Explorer
      :link: /tools/neuron-explorer/index
      :link-type: doc
      :class-header: sd-bg-primary sd-text-white
        
      Neuron Explorer is a suite of tools designed to support ML engineers throughout their development journey on AWS Trainium, from model development through debugging, profiling, analysis, and optimization.

   .. grid-item-card:: Neuron Profiler 2.0
      :link: /tools/profiler/neuron-profiler-2-0-beta-user-guide
      :link-type: doc
      :class-header: sd-bg-primary sd-text-white
        
      Neuron Profiler 2.0 offers a user-friendly experience for capturing and analyzing application performance through both high-level system profiles and detailed device-level profiles.

   .. grid-item-card:: Neuron Profiler
      :link: /tools/profiler/neuron-profile-user-guide
      :link-type: doc
      :class-header: sd-bg-primary sd-text-white
        
      The Neuron Profiler is a tool to profile and analyze performance of a ML model compiled with the Neuron compiler and run on NeuronDevices.

   .. grid-item-card:: System Tools
      :link: /tools/neuron-sys-tools/index
      :link-type: doc
      :class-header: sd-bg-primary sd-text-white
        
      Command-line utilities for monitoring, debugging, and managing AWS Neuron devices, including neuron-monitor, neuron-top, neuron-ls, and more.

   .. grid-item-card:: Third Party Tools
      :link: /tools/third-party-solutions
      :link-type: doc
      :class-header: sd-bg-primary sd-text-white
        
      Third-party tools and integrations that support the AWS Neuron development experience, including monitoring, visualization, and optimization solutions.

..
   .. grid-item-card:: AP Visualizer
      :link: ap-visualizer/ap-visualizer.html
      :link-type: url
      :class-header: sd-bg-primary sd-text-white
        
      Visualize access patterns of tensors on Neuron devices.

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: Tutorials
      :link: /tools/tutorials/index
      :link-type: doc
      :class-header: sd-bg-secondary sd-text-white

      Tutorials for how to utilize all Neuron Tools.

   .. grid-item-card:: Release Notes
      :link: /release-notes/components/dev-tools
      :link-type: doc
      :class-header: sd-bg-secondary sd-text-white

      Latest updates, new features, and improvements to Neuron Tools and Neuron Explorer.

.. toctree::
   :maxdepth: 1
   :hidden:

   Neuron Profiler 2.0 </tools/profiler/neuron-profiler-2-0-beta-user-guide>
   Neuron Profiler </tools/profiler/neuron-profile-user-guide>
   System Tools </tools/neuron-sys-tools/index>
   Third-party Tools </tools/third-party-solutions>
   Tutorials </tools/tutorials/index>
   Release Notes </release-notes/components/dev-tools>


================================================
FILE: tools/neuron-explorer/get-started.rst
================================================
.. meta::
   :description: Setup and get started guide for new Neuron SDK profiler
   :date_updated: 12/02/2025

.. _new-neuron-profiler-setup:

Get Started with Neuron Explorer
========================================

In this guide, you'll learn how to set up and launch Neuron Explorer, including the web-based UI for interactive analysis. By the end of this guide, you'll be able to visualize and analyze performance data for your models directly in your browser.

Overview
---------

In this guide, you'll launch an AWS Trainium or Inferentia EC2 instance using the AWS Deep Learning AMI (DLAMI) for Neuron, install and verify Neuron Explorer, start both the API and UI servers, and set up secure SSH tunneling to view the Neuron Explorer interface in your local browser.

Use this tool when you want to collect, inspect, and visualize Neuron profiling data from model training or inference jobs running on Neuron-compatible instances. At a high level, you will:

1. Launch a Neuron DLAMI instance
2. Verify Neuron Explorer installation
3. Start the Neuron Explorer servers
4. Configure SSH tunneling
5. Access the Neuron Explorer UI locally


Prerequisites
--------------

* An AWS account with permissions to launch EC2 instances.
* Access to an AWS Trainium or Inferentia instance type (such as trn1.2xlarge, inf2.xlarge).
* AWS Neuron DLAMI with the latest Neuron SDK preinstalled.
* SSH key pair (``.pem`` file) to securely connect to your EC2 instance.
* Local machine with SSH client and web browser installed.


Before you begin
-----------------

Complete these steps before starting the task in this document:

1. Make sure you have an active AWS account and `a default VPC available in your region <https://console.aws.amazon.com/vpc/>`_. 
2. Create or locate your SSH key pair (``.pem`` file) that allows access to your EC2 instance.

Instructions
-------------

1. Launch a Neuron-compatible EC2 instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Launch an EC2 instance with either a Trainium or Inferentia instance type using the AWS Neuron DLAMI.
You can do this from the AWS Management Console or CLI. For more instructions on how to launch an instance with Neuron DLAMI, refer to the instructions here.

**Expected outcome**

Your instance should start and appear in the EC2 dashboard as "Running."


2. Verify that Neuron Explorer is installed
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Once you've connected to your EC2 instance with SSH, verify that Neuron Explorer and the associated tools are installed:

.. code-block:: bash

   apt list --installed | grep neuronx-tools

**Expected outcome**

You should see neuronx-tools listed among the installed packages, confirming that Neuron Explorer is available on your instance.


3. Launch the API and UI SPA servers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Start the Neuron Explorer web servers using the following command:

.. code-block:: bash

   neuron-explorer view -v 2 --data-path ./parquet_files


This command starts:

* The UI SPA (Single Page Application) server (default port: 3001)
* The API server (default port: 3002)


**Expected outcome**

You'll see terminal logs confirming that both the UI and API servers are running.


4. Set up SSH tunneling
^^^^^^^^^^^^^^^^^^^^^^^^

By default, Neuron Explorer runs locally on the EC2 instance. To securely access it from your local computer, you must create SSH tunnels for ports 3001 and 3002.

Run the following command from your local machine terminal (replace placeholders such as ``your-key`` and ``public_ip_address_of_your_instance``):

.. code-block:: bash

   ssh -i ~/your-key.pem -L 3001:localhost:3001 -L 3002:localhost:3002 ubuntu@[public_ip_address_of_your_instance_] -fN

**Explanation:**

* ``-L 3001:localhost:3001`` forwards the UI server.
* ``-L 3002:localhost:3002`` forwards the API server.
* ``-fN`` keeps the tunnel open in the background.


**Expected outcome**

No error messages should appear, indicating that your SSH tunnels are active.

.. note::
   Replace ``ubuntu`` with the appropriate username for your AMI (for example, ``ec2-user`` on Amazon Linux).

5. Connect to the Neuron Explorer UI
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Once your tunnel is active, open your preferred web browser and navigate to:

.. code-block:: text

   http://localhost:3001


**Expected outcome**

The Neuron Explorer UI loads in your browser, displaying an interactive dashboard for exploring profiling data.

Confirm your work
------------------

You've successfully set up Neuron Explorer! To confirm everything is working:

1. The browser should display the Neuron Explorer interface.
2. The terminal running the profiler command should show log activity when interacting with the UI.
3. You can explore profiling sessions from your ``./parquet_files`` directory.

If all these checks pass, you are ready to begin analyzing performance data using Neuron Explorer.

Common issues
---------------

If you encounter an error or other issue while working through this task, here are some commonly encountered issues and how to address them:

* **Neuron Explorer UI doesn't load**: Check that your SSH tunnel is configured correctly. Make sure ports 3001 and 3002 are forwarded using the ``-L`` flags in your SSH command, and verify the EC2 instance is running.
* **No profiling data displayed**: Double-check that the directory passed to ``--data-path`` contains valid .parquet profiling files generated by a prior Neuron profiling run.
* **neuron-profile command not found**: Ensure that Neuron SDK is installed. Please ensure that you have launched your instance with Neuron DLAMI or you have set up your instance based on the instructions mentioned here.
* **Connection refused on port 3001 or 3002**: Confirm that your EC2 security group allows outbound traffic and that the SSH tunnel was created from your local machine, not from inside the instance.


================================================
FILE: tools/neuron-explorer/how-to-link-view-source-code.rst
================================================
.. meta::
    :description: Learn how to use source code linking in Neuron Explorer to understand code performance and optimize your applications
    :date-modified: 11/21/2025

.. _neuron-explorer-source-code:

Source Code Viewer
====================

In this guide, you'll learn how to use Neuron Explorer's source code linking feature to visualize connections between your application code and device performance. Discover how to navigate between source code and device instructions, highlight performance-critical sections, view framework stack traces, and leverage interactive code decorations to optimize your AWS Neuron applications for maximum efficiency.

Overview
--------

Source code linking helps you understand how your code changes affect device performance and identify ways to optimize it. This feature creates interactive connections between source code files and other Neuron Explorer widgets. You can zoom to device instructions from selected code lines, navigate between instructions and source code, and highlight instructions for specific loop iterations. You can use source code linking in both the VS Code extension and standalone web application. This gives you flexibility for different developer workflows.

The Framework Stack Trace feature shows up in the Event Details when an instruction on the device profile is clicked. This feature is used to map the device instructions back to framework level code in JAX or PyTorch to better understand what part of the application code resulted in a particular device instruction.

.. image:: /tools/profiler/images/view-link-1.gif

Instructions
-------------

To enable the addition of the "NKI Source Location" field to a profile enable set this environment variable: ``NEURON_FRAMEWORK_DEBUG=1``

To enable tracking of the stack trace information, you set these environment variables before compiling your NEFF:

.. code-block:: bash

    export XLA_IR_DEBUG=1
    export XLA_HLO_DEBUG=1

Once you have the NEFF, you can simply capture the profile as usual. To view your source code while viewing the profile, use the ``--framework-source-root`` flag to pass the path to framework source files. This is optional and is only needed if you want to view your code alongside the displayedprofile.

.. code-block:: bash

    neuron-explorer view -n file.neff -s profile.ntff --framework-source-root /path/to/framework/source/files

Code Viewer Widget
-------------------

Highlighting Instructions
~~~~~~~~~~~~~~~~~~~~~~~~~~

Select source code lines to highlight their corresponding instructions in the profiler view. You can select individual lines or multiple lines through block selection or multiple cursors.

.. image:: /tools/profiler/images/view-link-2.png

Navigating to Source Code
~~~~~~~~~~~~~~~~~~~~~~~~~~

(Ctrl/Cmd)+Click any instruction to jump to it's location in source code. If there are multiple matches, you will be prompted to select which file to navigate to.

.. image:: /tools/profiler/images/view-link-3.png

Source Code Decorations
~~~~~~~~~~~~~~~~~~~~~~~~

Performance metrics appear as decorations directly in your source code, updating automatically with the instruction profiler's time range. 

Configure which metrics to display and in the settings panel. Currently only instruction count and PE element count are supported.

.. image:: /tools/profiler/images/view-link-4.png

Navigating to Instructions
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Select lines in your source code and navigate to their corresponding instructions using Ctrl+Shift+G, the context menu, or the "Zoom into Instructions" command from the command palette. 

The Device Trace Viewer will then zoom to show all instructions associated with your selection.

.. image:: /tools/profiler/images/view-link-5.png

Dependency Annotations
~~~~~~~~~~~~~~~~~~~~~~~

When enabled, selecting an instruction will highlight its dependent source code lines. The selected instruction's line will be highlighted in one color, with its dependencies shown in a different color.

.. image:: /tools/profiler/images/view-link-6.png


================================================
FILE: tools/neuron-explorer/how-to-profile-workload.rst
================================================
.. meta::
    :description: Learn how to capture a profile, launch the Neuron Explorer UI, and use the Profile Manager to analyze your workload performance.
    :date-modified: 12/02/2025

Capture and View Profiles in Neuron Explorer
================================================

Capturing Profiles
------------------
In this guide, you'll learn how to capture a profile, launch the Neuron Explorer, use the Profile Manager, and view Neuron Explorer in your IDE.

To get a better understanding of your workload's performance, you must collect the raw device traces and runtime metadata in the form of an NTFF (Neuron Trace File Format) which you can then correlate with the compiled NEFF (Neuron Executable File Format) to derive insights.

Set the following environment variables before compiling to capture more descriptive layer names and stack frame information.

.. code-block:: bash

   export XLA_IR_DEBUG=1
   export XLA_HLO_DEBUG=1

For NKI developers, set ``NEURON_FRAMEWORK_DEBUG`` in addition to the two above to enable kernel source code tracking:

.. code-block:: bash

   export NEURON_FRAMEWORK_DEBUG=1

If profiling was successful, you will see NEFF (``.neff``) and NTFF (``.ntff``) artifacts in the specified output directory similar to the following:

.. code-block:: bash

   output
   └── i-0ade06f040a13f2bf_pid_210229
       ├── 395760075800974_instid_0_vnc_0.ntff
       └── neff_395760075800974.neff

Device profiles for the first execution of each NEFF per NeuronCore are captured, and NEFF/NTFF pairs with the same prefix (for PyTorch) or unique hash (for JAX or CLI) must be uploaded together. See the section on :ref:`uploading profiles <profile-manager-upload-profile>` for more details.

JAX Profiling API
~~~~~~~~~~~~~~~~~

When using the JAX context-managed profiling API, set two extra environment variables to signal the profile plugin to begin capturing device profile data when the profiling API is invoked.

.. code-block:: python

   os.environ["NEURON_RT_INSPECT_DEVICE_PROFILE"] = "1"
   os.environ["NEURON_RT_INSPECT_OUTPUT_DIR"] = "./output"

Then, profile a block of code:

.. code-block:: python

   with jax.profiler.trace(os.environ["NEURON_RT_INSPECT_OUTPUT_DIR"]):

Full code example:

.. code-block:: python

   from functools import partial
   import os
   import jax
   import jax.numpy as jnp

   from jax.sharding import Mesh, NamedSharding, PartitionSpec as P
   from jax.experimental.shard_map import shard_map
   from time import sleep
   from functools import partial

   os.environ["NEURON_RT_INSPECT_DEVICE_PROFILE"] = "1"
   os.environ["NEURON_RT_INSPECT_OUTPUT_DIR"] = "./output"

   jax.config.update("jax_default_prng_impl", "rbg")

   mesh = Mesh(jax.devices(), ('i',))

   def device_put(x, pspec):
     return jax.device_put(x, NamedSharding(mesh, pspec))

   lhs_spec = P('i', None)
   lhs = device_put(jax.random.normal(jax.random.key(0), (128, 128)), lhs_spec)

   rhs_spec = P('i', None)
   rhs = device_put(jax.random.normal(jax.random.key(1), (128, 16)), rhs_spec)


   @jax.jit
   @partial(shard_map, mesh=mesh, in_specs=(lhs_spec, rhs_spec),
            out_specs=rhs_spec)
   def matmul_allgather(lhs_block, rhs_block):
     rhs = jax.lax.all_gather(rhs_block, 'i', tiled=True)
     return lhs_block @ rhs

   with jax.profiler.trace(os.environ["NEURON_RT_INSPECT_OUTPUT_DIR"]):
     out = matmul_allgather(lhs, rhs)
     for i in range(10):
         with jax.profiler.TraceAnnotation("my_label"+str(i)):
             out = matmul_allgather(lhs, rhs)
         sleep(0.001)


   expected = lhs @ rhs
   with jax.default_device(jax.devices('cpu')[0]):
     equal = jnp.allclose(jax.device_get(out), jax.device_get(expected), atol=1e-3, rtol=1e-3)
     print("Tensors are the same") if equal else print("Tensors are different")


.. _neuron-explorer-capture-environment-variables:
.. _neuron-explorer-non-framework-user-experience:

Environment Variables
~~~~~~~~~~~~~~~~~~~~~

You can also control profiling with environment variables. This is useful when you can’t easily change your 
application code, such as when running an executable which calls the Neuron Runtime or in a containerized 
environment where the application code is built into the container image.

.. _neuron-explorer-core-control-variables:

Core Control Variables
^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Variable
     - Description
     - Default behavior
   * - ``NEURON_RT_INSPECT_ENABLE``
     - Set to ``1`` to enable profiling
     - Enables system profiling and disables device profiling. To control which profile types are captured, see :ref:`Profile type selection <neuron-explorer-profile-type-selection>`
   * - ``NEURON_RT_INSPECT_OUTPUT_DIR``
     - Directory for profile data output
     - Default directory for captured profile data is ``./output``

.. _neuron-explorer-profile-type-selection:

Device or System Profile Type Selection
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. note:: 
    
    When ``NEURON_RT_INSPECT_ENABLE`` set to ``1``, ``NEURON_RT_INSPECT_SYSTEM_PROFILE`` is enabled by default (set to 1) and ``NEURON_RT_INSPECT_DEVICE_PROFILE`` is disabled by default (set to ``0``).

When ``NEURON_RT_INSPECT_ENABLE`` = 1, two different profile types are available:

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Variable
     - Profile type
     - Description
     - Enable capture
     - Disable capture
   * - ``NEURON_RT_INSPECT_SYSTEM_PROFILE``
     - System-level
     - Captures runtime system events and operations
     - Set to ``1``
     - Set to ``0``
   * - ``NEURON_RT_INSPECT_DEVICE_PROFILE``
     - Device-level
     - Captures detailed NeuronCore hardware metrics
     - Set to ``1``
     - Set to ``0``

.. note::

    These variables have no effect if ``NEURON_RT_INSPECT_ENABLE`` is not set to ``1``.

.. _neuron-explorer-advanced-config-vars:
  
Advanced configuration for System Profiles
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Variable
     - Profile type
     - Description
     - Default behavior
   * - ``NEURON_RT_INSPECT_SYS_TRACE_MAX_EVENTS_PER_NC``
     - System-level
     - Maximum trace events per NeuronCore before oldest events are overwritten
     - 1,000,000

.. note:: 
    
    Increasing the event limit will consume more host memory.

Capture using nccom-test with Environment Variables
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Profiling can be enabled using environment variables. For simplicity, we have a quick way to generate a Neuron workload through using :ref:`nccom-test <nccom-test>`. nccom-test is a benchmarking tool which is already available with Neuron AMI.

.. code-block:: shell

    export NEURON_RT_INSPECT_ENABLE=1
    export NEURON_RT_INSPECT_OUTPUT_DIR=./output
    nccom-test allr allg -b 512kb -e 512kb -r 32 -n 10 -d fp32 -w 1 -f 512

.. note::
    If you have problems with nccom-test add the --debug flag.
    If using a trn1.2xlarge instance, change -r 32 to -r 2 to use fewer neuron cores.

To understand the profiling output see this section: :ref:`Inspect Output <neuron-explorer-inspect-output>`

Capture with EKS
^^^^^^^^^^^^^^^^

Capturing a profile on EKS is most easily done through setting of environment variables as described in the section 
:ref:`Non-framework specific User Experience <neuron-explorer-non-framework-user-experience>`. By using environment 
variables, users do not need to change application code in their container image or modify their run commands. 

Update the deployment yaml to include the ``NEURON_RT_INSPECT_ENABLE`` and ``NEURON_RT_INSPECT_OUTPUT_DIR`` 
environment variables. For distributed workloads, it’s important that ``NEURON_RT_INSPECT_OUTPUT_DIR`` points to a 
directory on a shared volume which all workers have access to.

.. code-block:: yaml

    apiVersion: v1
    kind: Pod
    metadata:
    name: trn1-mlp
    spec:
    restartPolicy: Never
    schedulerName: default-scheduler
    nodeSelector:
        beta.kubernetes.io/instance-type: trn1.32xlarge
    containers:
        - name: trn1-mlp
        env:
            - name: NEURON_RT_INSPECT_ENABLE
            value: "1"
            - name: NEURON_RT_INSPECT_OUTPUT_DIR
            value: "/shared/output"
        command: ['torchrun']
        args:
            - '--nnodes=1'
            - '--nproc_per_node=32'
            - 'train_torchrun.py'
        image: ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPO}:mlp
        imagePullPolicy: IfNotPresent
        resources:
            limits: 
            aws.amazon.com/neuron: 16


.. note::

    EKS users running PyTorch and JAX applications are still free to change their application code 
    and use the PyTorch or JAX Python profiling APIs if they want finer-grained control over profiling. 
    However, using the environment variables conveniently allows profiling without modifying the 
    container image or application code.


CLI
~~~

In certain cases, you may want to profile the application without requiring code modifications such as when deploying a containerized application through EKS. Note that when capturing with the CLI, profiling will be enabled for the entire lifetime of the application. If more granular control is required for profiling specific sections of the model, it is recommended to use the PyTorch or JAX APIs.

To enable profiling without code change, run your workload with the following environment variables set:

.. code-block:: bash

   export NEURON_RT_INSPECT_ENABLE=1
   export NEURON_RT_INSPECT_DEVICE_PROFILE=1
   export NEURON_RT_INSPECT_OUTPUT_DIR=./output
   python train.py

CLI reference for System Profiles
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In addition to controlling profiling with environment variables, you can use the ``neuron-explorer inspect`` command line interface 
for profiling applications. This provides the same functionality as environment variables but helps you avoid typos, invalid arguments, 
and provides a useful ``--help`` command to explain available options.

.. code-block:: shell

   Usage:
   neuron-explorer [OPTIONS] inspect [inspect-OPTIONS] [userscript...]

   Application Options:
   -v, --version               Show version and exit

   Help Options:
   -h, --help                  Show this help message

   [inspect command options]
         -o, --output-dir=       Output directory for the inspection results (default: .)
         -n, --num-trace-events= Maximum number of trace events to capture when profiling. Once hitting this limit, old events are dropped

   [inspect command arguments]
   userscript:                 Run command/script that launches a Neuron workload. E.g. 'python app.py' or './runscript.sh'

Example of using System Profiles CLI
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

User can provide any type of their own script to generate a Neuron workload such as Pytorch to the System Profiles CLI. 
For simplicity, we have a quick way to generate a Neuron workload 
through using ``nccom-test``. ``nccom-test`` is a benchmarking tool which is already available with Neuron AMI and ``aws-neuronx-tools`` package.

.. code-block:: shell

    ubuntu@ip-172-31-63-210:~$ neuron-explorer inspect -o inspect-output-nccom-test nccom-test allg -b 512kb -e 512kb -r 32 -n 10 -d fp32 -w 1 -f 512
    INFO[0000] Running command "nccom-test allg -b 512kb -e 512kb -r 32 -n 10 -d fp32 -w 1 -f 512" with profiling enabled
        size(B)    count(elems)    type    time:avg(us)    algbw(GB/s)    busbw(GB/s)
        524288          131072    fp32           24.15          21.71          21.03
    Avg bus bandwidth:    21.0339GB/s

.. note::
    If you have problems with nccom-test add the --debug flag.
    If using a trn1.2xlarge instance, change -r 32 to -r 2 to use fewer neuron cores.

.. _neuron-explorer-inspect-output:

``neuron-explorer inspect`` Output
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The above command traces a Neuron workload execution and saves the output to the ``inspect-output-nccom-test`` directory. 
You will see the output directory contains a single NEFF file and a device profile (NTFF) for all Neuron Cores which executed that NEFF. 
You will also see ``ntrace.pb`` and ``trace_info.pb`` files storing the system profile data.
Below showing what the outputs will look like:

.. code-block:: shell

    ubuntu@ip-172-31-63-210:~$ tree inspect-output-nccom-test
    inspect-output-nccom-test
        ├── i-012590440bb9fd263_pid_98399
        │   ├── 14382885777943380728_instid_0_vnc_0.ntff
        │   ├── 14382885777943380728_instid_0_vnc_1.ntff
        │   ├── 14382885777943380728_instid_0_vnc_10.ntff
        │   ├── 14382885777943380728_instid_0_vnc_11.ntff
        ...
        │   ├── 14382885777943380728_instid_0_vnc_8.ntff
        │   ├── 14382885777943380728_instid_0_vnc_9.ntff
        │   ├── cpu_util.pb
        │   ├── host_mem.pb
        │   ├── neff_14382885777943380728.neff
        │   ├── ntrace.pb
        │   └── trace_info.pb
        └──

    2 directories, 74 files


To view a summary of the captured profile data run the command

.. code-block:: shell

    neuron-explorer view -d inspect-output-nccom-test --output-format summary-text


.. _neuron-explorer-filtering-system-profiles:

Capture-time Filtering
----------------------

**Capture-time filtering** reduces memory usage and trace file size by only collecting specific events, but filtered data cannot be recovered later.
Configure filters before trace capture using environment variables or API functions. 
You can use NeuronCore filters to only capture events for specific NeuronCores (for example only events associated with NeuronCore 0 or all the NeuronCores on a specific NeuronDevice). 
You can use event type filters to only capture specific events (for example model execute or collectives events). 
It is possible to combine both NeuronCore and event type filters.

NeuronCore
~~~~~~~~~~

If capture is enabled for a NeuronCore then a ring buffer will be allocated in host memory for storing those core's events. Thus filtering by NeuronCore decreases host memory usage during capture.

Default Behavior
^^^^^^^^^^^^^^^^

By default, all visible NeuronCores are enabled for capture. 

Using Environment Variables
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: shell

    # Filter to capture events only from NeuronCore 0
    export NEURON_RT_INSPECT_EVENT_FILTER_NC=0

    # Filter to capture events from NeuronCores 0, 2, and 4
    export NEURON_RT_INSPECT_EVENT_FILTER_NC=0,2,4

    # Filter to capture events from a range of NeuronCores (0 through 3)
    export NEURON_RT_INSPECT_EVENT_FILTER_NC=0-3

    # Reset to default behavior
    unset NEURON_RT_INSPECT_EVENT_FILTER_NC # Back to capturing all visible cores

Using API Functions
^^^^^^^^^^^^^^^^^^^

.. code-block:: c

    #include <nrt/nrt_sys_trace.h>

    // Allocate and configure trace options
    nrt_sys_trace_config_t *config;
    nrt_sys_trace_config_allocate(&config);
    nrt_sys_trace_config_set_defaults(config);

    // Enable capture only for specific NeuronCores

    // Disable all cores since by default they are all enabled
    int num_cores = 128;
    for (int i=0; i<num_cores; i++) {
      nrt_sys_trace_config_set_capture_enabled_for_nc(config, i, false); // disable NC i
    }

    // Then enable specific cores
    nrt_sys_trace_config_set_capture_enabled_for_nc(config, 0, true);  // Enable NC 0
    nrt_sys_trace_config_set_capture_enabled_for_nc(config, 2, true);  // Enable NC 2

    // Start tracing with the configuration
    nrt_sys_trace_start(config);

    // Your application code here...

    // Stop tracing and cleanup
    nrt_sys_trace_stop();
    nrt_sys_trace_config_free(config);

Event Type
~~~~~~~~~~

Default Behavior
^^^^^^^^^^^^^^^^

By default, all event types are enabled for capture.

Getting Available Event Types
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can discover all available event types using the ``nrt_sys_trace_get_event_types`` API.

.. code-block:: c

    #include <nrt/nrt_sys_trace.h>

    // Get all available event types
    const char **event_types = nullptr;
    size_t count = 0;
    NRT_STATUS status = nrt_sys_trace_get_event_types(&event_types, &count);

    if (status == NRT_SUCCESS) {
        printf("Available event types:\n");
        for (size_t i = 0; i < count; ++i) {
            printf("  %s\n", event_types[i]);
        }
        
        // Free the event types array
        for (size_t i = 0; i < count; ++i) {
            free((void*)event_types[i]);
        }
        free((void*)event_types);
    }

Using Environment Variables
^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``NEURON_RT_INSPECT_EVENT_FILTER_TYPE`` environment variable supports:

* **Default**: If not set, all event types are captured
* **Specific event types**: Use exact event names from ``nrt_sys_trace_get_event_types()``
* **Event categories**: Use ``hardware`` or ``software`` to filter by category
* **Exclusion**: Use ``^`` prefix to exclude specific events from a category

.. code-block:: shell

    # Filter to capture only specific event types
    export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=model_load,nrt_execute,runtime_execute

    # Filter to capture all hardware events
    export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=hardware

    # Filter to capture all software events
    export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=software

    # Filter to capture all hardware events EXCEPT cc_exec
    export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=hardware,^cc_exec

    # Filter to capture all software events EXCEPT model_load
    export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=software,^model_load

    # Mix categories and specific events
    export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=hardware,tensor_read,tensor_write

    # Reset to default behavior
    unset NEURON_RT_INSPECT_EVENT_FILTER_TYPE  # Back to capturing all event types

The ``hardware`` group contains events that are executed on the NeuronCore. 
These are ``nc_exec_running``, ``cc_running``, ``cc_exec_barrier``, ``numerical_err``, ``nrt_model_switch``, ``timestamp_sync_point``, ``hw_notify``.
The ``software`` group contains all other events.

Using API Functions
^^^^^^^^^^^^^^^^^^^

Use the ``nrt_sys_trace_config_set_capture_enabled_for_event_type`` API to filter by event type.

.. code-block:: c

    #include <nrt/nrt_sys_trace.h>

    // Configure trace options
    nrt_sys_trace_config_t *config;
    nrt_sys_trace_config_allocate(&config);
    nrt_sys_trace_config_set_defaults(config); // By default, all event types are enabled

    // Disable specific event types (others remain enabled)
    nrt_sys_trace_config_set_capture_enabled_for_event_type(config, "device_exec", false);

    // Or disable all first, then enable only specific ones
    const char **all_event_types = nullptr;
    size_t all_count = 0;
    nrt_sys_trace_get_event_types(&all_event_types, &all_count);

    // Disable all event types first
    for (size_t i = 0; i < all_count; ++i) {
        nrt_sys_trace_config_set_capture_enabled_for_event_type(config, all_event_types[i], false);
    }

    // Enable only specific event types
    nrt_sys_trace_config_set_capture_enabled_for_event_type(config, "model_load", true);
    nrt_sys_trace_config_set_capture_enabled_for_event_type(config, "nrt_execute", true);

    // Verify which event types are enabled
    const char **enabled_types = nullptr;
    size_t enabled_count = 0;
    nrt_sys_trace_config_get_enabled_event_types(config, &enabled_types, &enabled_count);
    printf("Enabled event types: %zu\n", enabled_count);
    for (size_t i = 0; i < enabled_count; ++i) {
        printf("  %s\n", enabled_types[i]);
    }

    // Clean up memory (caller is responsible)
    for (size_t i = 0; i < enabled_count; ++i) {
        free((void*)enabled_types[i]);
    }
    free((void*)enabled_types);

    for (size_t i = 0; i < all_count; ++i) {
        free((void*)all_event_types[i]);
    }
    free((void*)all_event_types);

    // Start tracing
    nrt_sys_trace_start(config);

    // Your application code here...

    // Cleanup
    nrt_sys_trace_stop();
    nrt_sys_trace_config_free(config);


Processing-time Filtering
--------------------------

**Processing-time filtering** preserves the complete trace and allows flexible analysis with different filters, but requires more memory and storage during capture.
Apply filters when viewing or processing already captured profiles. This approach allows you to 
analyze the same trace data in different ways without recapturing. The filters can be used for any 
``neuron-explorer`` output format including ``--output-format json`` and ``--output-format perfetto``.

NeuronCore
~~~~~~~~~~

Use the ``--system-trace-filter-neuron-core`` to only process events for specific NeuronCores. The IDs are local to the instance and not global IDs. 

If the ``--system-trace-filter-neuron-core`` argument is not set then events from all NeuronCores will be included in the processed trace.


**Single neuron core**

.. code-block:: shell

    neuron-explorer view -d ./output --system-trace-filter-neuron-core "0"

**Multiple neuron cores**

.. code-block:: shell

    neuron-explorer view -d ./output --system-trace-filter-neuron-core "0,1,2,3"

Event Type
~~~~~~~~~~
Use the ``--system-trace-filter-event-type`` to only process specific trace events types.

If the ``--system-trace-filter-event-type`` argument is not set then all event types will be included in the processed trace.

**Single event type**

.. code-block:: shell

    neuron-explorer view -d ./output --system-trace-filter-event-type "nrt_execute"

**Multiple event type**

.. code-block:: shell

    neuron-explorer view -d ./output --system-trace-filter-event-type "nrt_execute,nrt_load"

Instance ID
~~~~~~~~~~~

Use the ``--system-trace-filter-instance-id`` to only process events for specific ec2 instances.

If the ``--system-trace-filter-instance-id`` argument is not set then events from all instances will be included in the processed trace.

**Single instance**

.. code-block:: shell

    neuron-explorer view -d ./output --system-trace-filter-instance-id "i-abc123"

**Multiple instances**

.. code-block:: shell

    neuron-explorer view -d ./output --system-trace-filter-instance-id "i-abc123,i-def456,i-ghi789"

Processing only system or device profiles
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can reduce processing times by skipping the processing of system or device profiles. Choose this when you are interested in only a specific profile, or when you want to start with a limited set of profiling data before exploring the full profile.

To skip processing of device profiles use the ``--ignore-device-profile`` option. To skip processing of system profiles use the ``--ignore-system-profile`` option. These options can be used with the ``--output-format`` values ``parquet`` (default), ``perfetto``, or ``json``.

For example:

.. code-block:: shell

    neuron-explorer view -d ./output --ignore-device-profile --output-format perfetto


View Profiles
-------------

To view a profile in Neuron Explorer, follow these steps:

1. **Start the Neuron Explorer UI and API servers** using the ``neuron-explorer`` tool from ``aws-neuronx-tools``:

   .. code-block:: bash

      neuron-explorer view --data-path /absolute/path/to/db

   By default, the UI will be launched on port 3001 and the API server will be launched on port 3002.

2. **Set up port-forwarding** (if running on a remote EC2 instance) to enable local viewing:

   .. code-block:: bash

      ssh -i <key.pem> <user>@<ip> -L 3001:localhost:3001 -L 3002:localhost:3002

   note::
      it is necessary to forward both 3001 (for the UI server) and 3002 (for the data server)

3. **Open the UI** by navigating to ``localhost:3001`` in your browser.

4. **Upload your profile** by clicking the **"Upload Profile"** button in the Profile Manager page. You can either:

   * Upload the NEFF (``.neff``) and NTFF (``.ntff``) files individually using the "Individual Files" upload mode, or
   * Upload the folder containing the NEFF and NTFF files using the "Directory Upload" mode.

Neuron Explorer Browser UI
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _neuron-explorer-profile-manager:

Profile Manager
^^^^^^^^^^^^^^^

Profile Manager is a page for uploading artifact (NEFF, NTFF and source code) and selecting profiles to access.

.. image:: /tools/profiler/images/profile-workload-3.png

.. _profile-manager-upload-profile:


Click on the "Upload Profile" button to open the Upload Profile modal.


**Device Profile Upload**

Select "Individual Files" upload mode to upload NEFF, NTFF, and source code individually.

Select "Directory Upload" to upload profile files from a directory.

.. note::
   * "Profile name" is a required field. You cannot upload a profile with existing name unless the option "Force Upload" is checked at the bottom. Force Upload currently will overwrite the existing profile with the same name.
   * For uploading source code, the UI only supports the upload of folders, individual files, or compressed files in the gzipped tar ``.tar.gz`` archive format.

.. image:: /tools/neuron-explorer/images/device-profile-upload-ui.png


.. _profile-manager-system-profile-upload:

**System Profile Upload**

Select "Directory Upload", then in the Profile Directory drag and drop area, select the directory containing the system profile files.

The directory should contain instance sub-directories with the following: ``ntrace.pb``, ``trace_info.pb``, ``cpu_util.pb``, and ``host_mem.pb``.
For an example see the output in :ref:`neuron-explorer inspect <neuron-explorer-inspect-output>`

.. note::
   System Profile uploads only support "Directory Upload".

.. image:: /tools/neuron-explorer/images/system-profile-upload-ui.png


**Processing Status**

After uploading a profile, the processing task is shown under "User Uploaded" table. Use the "Refresh" button in the top-right to fetch the latest processing status and verify completion.


**Listing profiles**

All uploaded profiles are provided in the Profile Manager page with details such as the processing status and upload time, along with various quick access actions.

.. image:: /tools/profiler/images/profile-workload-5.png

* **Pencil button**: Rename a profile.
* **Star button**: Mark this profile as favorite profile. This profile will be shown in the User's favorites list.
* **Bulb button**: Navigate to the summary page of this profile. For more details on the summary page, see :doc:`this overview of the Neuron Explorer Summary Page </tools/neuron-explorer/overview-summary-page>`.

Clicking on the name of profile takes you to its corresponding profile page.

Neuron Explorer for Visual Studio Code
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The UI is also available as a VSCode extension, enabling better native integration for features such as code linking.

Install the Neuron Explorer extension from the Visual Studio Code Marketplace. Open the Extensions view in VSCode by pressing **Ctrl+Shift+X** (Windows/Linux) or **CMD+Shift+X** (MacOS), and search for ``AWS Neuron Explorer`` or ``amazonwebservices.neuron-explorer``. Select the extension published by **Amazon Web Services** in the sidebar, then click the blue **Install** button.

.. image:: /tools/profiler/images/profile-workload-1.png

Ensure the SSH tunnel is established by following the steps above. Otherwise, specify a custom endpoint by selecting the extension in the left activity bar. Then, navigate to the "Endpoint" action on the bottom bar of your VSCode session and select "Custom endpoint", and enter ``localhost:3002``. 

.. image:: /tools/profiler/images/profile-workload-2.png

From there, navigate to the **Profile Manager** page through the extension UI in the left activity bar.

JSON Output
~~~~~~~~~~~

The ``--output-format`` json option writes processed profile data to human-readable JSON that can be used for scripting and manual inspection.

.. code-block:: shell

    neuron-explorer view -d ./output --output-format json

This will generate a ``system_profile.json`` file containing the system profile data and a ``device_profile_model_<model_id>.json`` file for each unique compiled model that was executed on a Neuron Device. 

The  system_profile.json JSON contains the following data types:

* ``trace_events``: Neuron Runtime API trace events and Framework/Application trace events containing timestamps, durations, names, and the ec2 instance-id to differentiate between events from different compute nodes in a distributed workload.

.. code-block:: json

    {
        "Neuron_Runtime_API_Event": {
            "duration": 27094,
            "group": "nrt-nc-000",
            "id": 1,
            "instance_id": "i-0f207fb2a99bd2d08",
            "lnc_idx": "0",
            "name": "nrt_tensor_write",
            "parent_id": 0,
            "process_id": "1627711",
            "size": "4",
            "tensor_id": "4900392441224765051",
            "tensor_name": "_unknown_",
            "thread_id": 1627711,
            "timestamp": 1729888371056597613,
            "type": 11
        },
        "Framework_Event": {
            "duration": 3758079,
            "group": "framework-80375131",
            "instance_id": "i-0f207fb2a99bd2d08",
            "name": "PjitFunction(matmul_allgather)",
            "process_id": "701",
            "thread_id": 80375131,
            "timestamp": 1729888382798557372,
            "type": 99999
        }
    }

* ``mem_usage``: sampled host memory usage 

.. code-block:: json

    {
        "duration": 1,
        "instance_id": "i-0f207fb2a99bd2d08",
        "percent_usage": 9.728179797845964,
        "timestamp": 1729888369286687792,
        "usage": 51805806592
    }

* ``cpu_util``: sampled CPU utilization. Results are provided per core and per ec2 instance involved in a distributed workload

.. code-block:: json

    {
        "cpu_id": "47",
        "duration": 1,
        "instance_id": "i-0f207fb2a99bd2d08",
        "timestamp": 1729888371287337243,
        "util": 2.3255813
    },


View in Perfetto
~~~~~~~~~~~~~~~~

Users can view their Neuron Explorer profiles in Perfetto. Please see :doc:`view-perfetto` for more information.

.. note::
    New Neuron Explorer features released in 2.27 and onwards may not be supported in Perfetto. For the full user experience and features set, please use the Neuron Explorer UI or VSCode Integration.


Troubleshooting
---------------

Incomplete JAX Profiles
~~~~~~~~~~~~~~~~~~~~~~~

If your JAX profile has fewer events than expected or lacks the Runtime API trace, check whether 
``jax.profiler.stop_trace`` is being called inside a ``with jax.profiler.trace`` context block. 
This can prematurely stop tracing. Use ``jax.profiler.stop_trace`` only when profiling was started 
with ``jax.profiler.start_trace``, not when using the context-managed ``with jax.profiler.trace`` API.

Also when using ``jax.profiler`` within your script ensure that the 
environment variable ``NEURON_RT_INSPECT_ENABLE`` is not set to 1. 
Additionally, ensure that ``NEURON_RT_INSPECT_OUTPUT_DIR`` is set to 
the correct output directory and this is the output directory passed to 
``with jax.profiler.trace``.

Dropped Events in System Profile
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When processing a system profile, you may see a warning indicating that some trace events were dropped during capture.

.. code-block:: shell

    WARN[0000] Warning: 1001 trace events were dropped during capture (stored 530560 out of 531561 total events). Consider increasing buffer size, reducing trace duration, or filtering events.

This means during capture the trace event buffers filled and oldest events were overwritten. If you need to avoid dropping events for the full duration of your workload consider the following adjustments:

* Increase buffer size by setting ``NEURON_RT_INSPECT_SYS_TRACE_MAX_EVENTS_PER_NC`` (see :ref:`Profile Capture Environment Variables <neuron-explorer-capture-environment-variables>`). This will increase host memory usage.
* Apply capture-time filters (NeuronCores / event types) (see :ref:`Filtering System Profiles <neuron-explorer-filtering-system-profiles>`.)
* Shorten profiled region: limit the code span under the profiling context / runtime.


================================================
FILE: tools/neuron-explorer/index.rst
================================================
.. meta::
   :description: Neuron Explorer documentation for performance profiling, debugging, and optimization of ML workloads on AWS Trainium and Inferentia.
   :date-modified: 12/02/2025

.. _neuron-explorer-home:

Neuron Explorer
=================

.. important::

    Neuron Explorer is the recommended profiling tool for AWS Neuron workloads. It provides end-to-end profiling support along with the latest features and an improved user experience. 
    
    **Note:** Neuron will end support for :ref:`Neuron Profiler 2.0 <neuron-profiler-2-0-guide>` and :ref:`Neuron Profiler <neuron-profile-ug>` in Neuron 2.29 release. Users are encouraged to migrate to Neuron Explorer. Please see :doc:`migration-faq` and :ref:`neuron-explorer-faq` for more details.
    
Neuron Explorer is a suite of tools designed to support ML engineers throughout their development journey on AWS Trainium. Neuron Explorer helps developers maintain context, iterate efficiently, and focus on building and optimizing high-performance models. Developers can access Neuron Explorer from CLI, UI, or inside their IDE through VSCode integration.

Profiling Viewers
--------------------

Neuron Explorer enables ML performance engineers to trace execution from source code down to hardware operations, enabling detailed analysis of model behavior at every layer of the stack. The suite of tools supports both single-node and distributed applications, allowing developers to analyze workloads at scale. 

Getting Started
---------------

.. grid:: 1 2 2 2
   :gutter: 3

   .. grid-item-card:: Get Started
      :link: get-started
      :link-type: doc
      :class-card: sd-border-1

      Set up Neuron Explorer, launch the web UI, and configure SSH tunneling for secure access to profiling data.

   .. grid-item-card:: Capture and View Profiles
      :link: how-to-profile-workload
      :link-type: doc
      :class-card: sd-border-1

      Learn how to capture and view profiles in the Neuron Explorer UI or directly in your IDE via VSCode Integration.

Visualization and Analysis
---------------------------

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: Device Trace Viewer
      :link: overview-device-profiles
      :link-type: doc
      :class-card: sd-border-1

      Explore hardware-level execution with timeline view, operator table, event details, annotations, dependency highlighting, search, and more analysis features.

   .. grid-item-card:: System Trace Viewer
      :link: overview-system-profiles
      :link-type: doc
      :class-card: sd-border-1

      Explore system-level execution with timeline view and more analysis features.


.. grid:: 1 2 2 2
   :gutter: 3

   .. grid-item-card:: Hierarchy Viewer
      :link: overview-hierarchy-view
      :link-type: doc
      :class-card: sd-border-1

      Visualize the entire execution from model layers down to hardware execution, supporting interactivity with device viewer and source code linking.

   .. grid-item-card:: Source Code Viewer
      :link: how-to-link-view-source-code
      :link-type: doc
      :class-card: sd-border-1

      Navigate between NKI and PyTorch source code and profile data with bidirectional linking and highlighting.

   .. grid-item-card:: Summary Viewer
      :link: overview-summary-page
      :link-type: doc
      :class-card: sd-border-1

      Get streamlined performance insights and optimization recommendations with high-level metrics and visualizations.

   .. grid-item-card:: Database Viewer
      :link: overview-database-viewer
      :link-type: doc
      :class-card: sd-border-1

      Develop your own analyses, examine profiling data stored in database tables, or run ad-hoc queries during performance analysis. 

   .. grid-item-card:: Tensor Viewer
      :link: overview-tensor-viewer
      :link-type: doc
      :class-card: sd-border-1

      Viewing tensor information including names, sizes, shapes, and memory usage details.

   .. grid-item-card:: Memory Viewer
      :link: overview-memory-viewer
      :link-type: doc
      :class-card: sd-border-1

      Analyze memory allocation, usage patterns, and potential inefficiencies across SBUF partitions.

   .. grid-item-card:: AI Recommendation Viewer
      :link: overview-ai-recommendations
      :link-type: doc
      :class-card: sd-border-1

      Get AI powered bottleneck analysis and optmization recommendations for NKI profiles.

Tutorials
----------

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: Profile a NKI Kernel
      :link: /nki/guides/use-neuron-profile
      :link-type: doc
      :class-card: sd-border-1

      Learn how to profile a NKI kernel with Neuron Explorer.

.. grid:: 1 2 2 2
   :gutter: 3

   .. grid-item-card:: vLLM Performance
      :link: /tools/tutorials/performance-profiling-vllm
      :link-type: doc
      :class-card: sd-border-1

      Capture and analyze system-level and device-level profiles for vLLM inference workloads on Trainium.


Additional Resources
--------------------

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: Viewing Profiles with Perfetto
      :link: view-perfetto
      :link-type: doc
      :class-card: sd-border-1

      Learn how to view Neuron Explorer profiles using the Perfetto UI for trace analysis.

.. _download-neuron-explorer-vscode:

Neuron Explorer for Visual Studio Code
------------------------------------------------

The Neuron Explorer VSCode extension is available on the Visual Studio Code Extension Marketplace.

To install the extension, open the Extensions view in VSCode by pressing **Ctrl+Shift+X** (Windows/Linux) or **CMD+Shift+X** (MacOS), and search for ``AWS Neuron Explorer`` or ``amazonwebservices.neuron-explorer``. Select the extension published by **Amazon Web Services** in the sidebar, then click the blue **Install** button.

You can also install the extension directly from the `Visual Studio Code Marketplace <https://marketplace.visualstudio.com/items?itemName=AmazonWebServices.neuron-explorer>`_.

.. _neuron-explorer-faq:

Neuron Explorer FAQ
-------------------

What can I expect from the Neuron Explorer?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Neuron Explorer provides a comprehensive profiling experience with both device-level and system-level profiling support. Neuron Explorer features an enhanced profiling experience with hierarchical profiling, bidirectional code linking, AI-powered recommendations, IDE integration, and more. In future releases, Neuron Explorer will continue to expand with additional profiling viewers and features, debugging capabilities, and enhanced recommendation and analysis tools to support the entire ML development journey on Trainium.

What is the difference between device-level and system-level profiling?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Device-level profiling captures hardware execution data from NeuronCores, including compute engine instructions, DMA operations, and hardware utilization. Use device-level profiling to analyze hardware performance, identify compute or memory bottlenecks, and optimize kernel implementations.

System-level profiling captures software execution data, including framework operations, Neuron Runtime API calls, CPU utilization, and memory usage. Use system-level profiling to analyze framework overhead, identify CPU bottlenecks, and debug runtime issues.

Is Neuron Explorer going to replace Neuron Profiler and Neuron Profiler 2.0?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Yes. Neuron Explorer is the recommended profiling tool and replaces both Neuron Profiler and Profiler 2.0.

Neuron Profiler and Profiler 2.0 are supported for one final release. In Neuron 2.29 release, they will enter end-of-support and will no longer receive updates or technical support, though they will remain accessible through the ``neuron-profile`` package in previous releases. Users should migrate to Neuron Explorer now.

Are my existing profiles compatible with Neuron Explorer?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Yes. Neuron Explorer is backwards compatible with profile data captured using Neuron Profiler or Profiler 2.0. Existing profile files must be reprocessed before viewing in Neuron Explorer, but you do not need to recapture them. See :ref:`new-neuron-profiler-setup`.

For detailed migration guidance, including CLI command mappings and feature comparisons, see the :doc:`migration-faq`.


.. toctree::
   :hidden:
   :maxdepth: 1

   Get Started <get-started>
   Neuron Profiler to Neuron Explorer Migration Guide <migration-faq>
   Capture and View Profiles <how-to-profile-workload>
   Device Trace Viewer <overview-device-profiles>
   System Trace Viewer <overview-system-profiles>
   Hierarchy Viewer <overview-hierarchy-view>
   Source Code Viewer <how-to-link-view-source-code>
   Summary Viewer <overview-summary-page>
   Database Viewer <overview-database-viewer>
   Tensor Viewer <overview-tensor-viewer>
   Memory Viewer <overview-memory-viewer>
   AI Recommendation Viewer <overview-ai-recommendations>
   View Profiles with Perfetto <view-perfetto>
   

================================================
FILE: tools/neuron-explorer/migration-faq.rst
================================================
.. _neuron-profiler-migration-guide:

Migration Guide from Neuron Profiler to Neuron Explorer
========================================================

This guide provides detailed information for migrating from Neuron Profiler or Neuron Profiler 2.0 to Neuron Explorer.

.. contents:: Table of Contents
   :local:
   :depth: 2

Overview
--------

Neuron Explorer is the recommended profiling tool for AWS Neuron workloads, replacing both Neuron Profiler and Neuron Profiler 2.0. This guide helps you transition your profiling workflows to Neuron Explorer.

Key Differences
---------------

The following table summarizes the key differences between Neuron Profiler/Profiler 2.0 and Neuron Explorer:

.. list-table::
   :widths: 30 35 35
   :header-rows: 1
   :align: left

   * - Feature
     - Neuron Profiler / Profiler 2.0
     - Neuron Explorer
   * - CLI tool
     - ``neuron-profile``
     - ``neuron-explorer``
   * - Device Profiling
     - Yes
     - Yes (enhanced)
   * - System Profiling
     - Yes (Profiler 2.0 only)
     - Yes
   * - Hierarchy Viewer
     - No
     - Yes
   * - Source Code Viewer
     - Yes (Device profiles)
     - Yes (Device profiles)
   * - AI Recommendation Viewer
     - No
     - Yes (for NKI profiles)
   * - IDE Integration
     - No
     - Yes (VSCode Extension)
   * - Database Viewer
     - No
     - Yes
   * - Tensor Viewer
     - No
     - Yes
   * - Additional Installation Requirements
     - InfluxDB installation required
     - None


Update CLI Commands
--------------------

Replace ``neuron-profile`` with ``neuron-explorer`` in your scripts and workflows. The following commands are subject to change before GA:

.. list-table::
   :widths: 50 50
   :header-rows: 1
   :align: left

   * - Neuron Profiler Command
     - Neuron Explorer Command
   * - ``neuron-profile view -d ./output``
     - ``neuron-explorer view -d ./output``
   * - ``neuron-profile view -n file.neff -s profile.ntff``
     - ``neuron-explorer view -n file.neff -s profile.ntff``
   * - ``neuron-profile capture -n file.neff -s profile.ntff``
     - ``neuron-explorer capture -n file.neff -s profile.ntff``


Frequently Asked Questions
--------------------------

Do I need to install InfluxDB for Neuron Explorer?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

No. Unlike Neuron Profiler, Neuron Explorer requires no external installation or setup.

How do I view existing profiles captured with Neuron Profiler?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Existing NEFF and NTFF files captured with Neuron Profiler are fully compatible with Neuron Explorer. To view them:

.. code-block:: bash

   # View a single device profile
   neuron-explorer view -n file.neff -s profile.ntff

The profiles will be reprocessed using Neuron Explorer's processing pipeline, which may provide additional insights not available in the original Neuron Profiler view.

How do I capture profiles with Neuron Explorer?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Neuron Explorer provides the ``neuron-explorer capture`` command for standalone NEFF profiling, similar to ``neuron-profile capture``:

.. code-block:: bash

   # Capture a device profile
   neuron-explorer capture -n file.neff -s profile.ntff

You can also use the framework profiling APIs or environment variables to capture profiles during your actual workload execution. For NKI kernel profiling, continue using the ``nki.benchmark`` or ``nki.profile`` APIs as documented in the :ref:`NKI profiling guide <use-neuron-profile>`.

What new features does Neuron Explorer provide?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Neuron Explorer introduces several new capabilities:

- **Hierarchy Viewer**: Visualize execution from model layers down to hardware operations. See :doc:`overview-hierarchy-view`.
- **Source Code Viewer**: Navigate between source code and profile data. See :doc:`how-to-link-view-source-code`.
- **AI Recommendation Viewer**: Get AI-powered optimization suggestions for NKI profiles. See :doc:`overview-ai-recommendations`.
- **Database Viewer**: Run custom queries on profiling data. See :doc:`overview-database-viewer`.
- **Memory Viewer**: Get insight into memory allocation, usage patterns, and potential memory usage inefficiencies.
- **Tensor Viewer**: Examine tensor information including shapes and memory usage. See :doc:`overview-tensor-viewer`.
- **VSCode Extension**: View profiles directly in your IDE with native code linking support.
- **System Trace Viewer**: Enhanced system-level profiling visualization. See :doc:`overview-system-profiles`.

How do I get help during migration?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Review the :doc:`get-started` guide for initial setup
- See :doc:`how-to-profile-workload` for detailed capture and viewing instructions
- Check submitted issues and file new issues via the `AWS Neuron GitHub issues <https://github.com/aws-neuron/aws-neuron-sdk/issues>`_


================================================
FILE: tools/neuron-explorer/overview-ai-recommendations.rst
================================================
.. meta::
    :description: AI Recommendation feature helps identify and understand bottlenecks and optimization opportunities for NKI kernels through AI-powered analysis
    :date-modified: 11/21/2025

AI Recommendation Viewer
=========================

In this guide, you'll learn how to use the AI Recommendation Viewer to identify and understand bottlenecks and optimization opportunities for NKI kernels through AI-powered analysis of the user's profile and source code. Users receive actionable recommendations through the Neuron Explorer UI, CLI, or via their IDE. Each report provides the top 2-3 optimization opportunities ranked by effort and impact, including the symptom with quantified metrics, the optimization with implementation guidance, expected speedup estimates, and implementation tradeoffs. 

The feature is entirely opt-in and only enabled for profiles that the user explicitly requests a recommendation for.

.. warning:: 
    * Responses in this AWS Bedrock-powered feature are AI-generated. Verify accuracy and appropriateness before use. 
    * This feature is available in US Regions only. Neuron may securely transmit data across Regions within your geography for processing. 
    * Your AWS account will be billed for Bedrock usage. Each time you generate an AI Recommendation for a profile, a single Bedrock request is made with up to 30,000 input tokens and 10,000 output tokens. 
    * At the moment, this feature may only be used with Claude Sonnet 4.5.

.. _local_setup_directions:

Local setup directions
----------------------------------------------------

AI Recommendations use Amazon Bedrock. To enable this feature, you must configure AWS credentials on the system you are running neuron-explorer on. The AWS credentials should have bedrock:InvokeModel permissions and access to Claude Sonnet 4.5. For information on configuring Bedrock access, refer to the `AWS Bedrock model access documentation <https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html>`_.

Getting an AI Recommendation From the UI
----------------------------------------------------

To generate an AI Recommendation from the UI open your profile, click the "Add Widget" dropdown, and select **AI Recommendation**.

.. image:: /tools/profiler/images/recommendation-button.png

Go to the **AI Recommendation** widget box and click the **Get AI Recommendation** button. This will perform additional analysis and send the recommendation request to AWS Bedrock and can take up to a minute to generate. Avoid refreshing the page during this time.

.. image:: /tools/profiler/images/recommendation-widget.png

Once the recommendation has been generated it will be displayed in the widget box. For each recommendation you will see the performance inefficiency symptoms that were observed, the suggested optimization to make, and potential tradeoffs to look out for when implementing the optimizations.

.. image:: /tools/profiler/images/recommendation-view.png

Getting an AI Recommendation from the CLI
----------------------------------------------------

Users may also get AI recommendations with the ``neuron-explorer recommend`` CLI command. 

Before you start, ensure that you have followed the :ref:`local setup directions <local_setup_directions>` to enable Bedrock access on your configured AWS account. ``neuron-explorer`` uses the default AWS credentials you have configured. If you will use other credentials, you can specify an AWS profile to use by setting environment variables: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html.

To generate a recommendation, provide the following to the ``neuron-explorer recommend`` command:

* A NEFF file for your compiled NKI kernel
* An NTFF file for your captured profile
* The location where your NKI source files can be found

Example:

.. code-block::

   neuron-explorer recommend -n </path/to/neff> -s </path/to/ntff> --nki-source-root </path/to/src/dir>

Running this command processes the profile and prints the AI-generated recommendation to the console in Markdown format. You can save this output to a file and view it in any text editor or Markdown viewer.


================================================
FILE: tools/neuron-explorer/overview-database-viewer.rst
================================================
.. meta::
    :description: Learn about the Database Viewer tool in Neuron Explorer for querying and exploring profiling data using SQL or natural language queries.
    :date-modified: 01/27/2026

.. _database-viewer-overview:

Database Viewer
=====================

The Database Viewer offers an interactive interface providing visibility to all underlying data that the Neuron Explorer
processes from a :doc:`NEFF </neuron-runtime/explore/work-with-neff-files>` and NTFF. 
Use this tool to develop your own analyses, examine profiling data stored in database tables, or run ad-hoc queries during performance analysis. 
You can access this data through natural language queries or raw SQL.


.. image:: /tools/profiler/images/database-viewer.png

Table Selection and Schema Inspection
-------------------------------------

When the tool loads, it fetches the list of available database tables. Select a table from the dropdown to view its schema.

The schema table displays:

* **Field Name** - Column name (hover for description tooltip).
* **Data Type** - The data type of the field.
* **Required** - Whether the field is required.
* **Unit** - Measurement unit (if applicable).
* **Example** - Example value for the field.

Querying Data
-------------

The query input supports two modes:

1. **SQL queries** - Write standard SQL starting with ``SELECT``.
2. **Natural language queries** - Describe what you want in plain English.

Examples:

Natural language query to get the first 5 rows::

    Get the first 5 rows

SQL query to filter with conditions::

    SELECT field_name FROM table_name WHERE condition

Press **Enter** or click **Execute Query** to run. Use **Shift+Enter** for multi-line input.

Query Results
-------------

Results appear below the query input in reverse chronological order (newest first). Each result shows:

* The original query text.
* The generated SQL (for natural language queries).
* A scrollable results table.

Click **Export CSV** to download any result set as a CSV file.

.. image:: /tools/profiler/images/database-viewer-query-result.png

================================================
FILE: tools/neuron-explorer/overview-device-profiles.rst
================================================
.. meta::
    :description: Learn about Neuron Explorer widgets for device profiling including timeline views, event details, annotations, and performance analysis tools.
    :date-modified: 12/02/2025

Device Trace Viewer
===================

The Neuron Device Trace Viewer displays a hardware instruction level granularity of execution on a NeuronCore. Neuron Explorer collects the timestamped start and end events that occur on the device into a NTFF. As a post-processing step, the profiler will correlate these events with information in the compiled NEFF to generate a detailed report of the hardware performance. The Neuron Explorer UI provides several different tools for an extensible and customizable workflow.

.. image:: /tools/profiler/images/device-profile-1.png

Tools
------

Device Trace Viewer
~~~~~~~~~~~~~~~~~~~~~

The Device Trace Viewer presents a timeline view of the device execution, including activity on the DMA and compute engines, Hardware FLOPs Utilization (HFU) and device memory utilization over time, and more.

.. image:: /tools/profiler/images/device-profile-2.png

Hover
^^^^^

.. image:: /tools/profiler/images/device-profile-3.png

Hover on events in the timeline to see important identifying information at a glance, such as the time window, the hierarchy, and the hardware instruction that was executed.

For more details, clicking the event will display the full details in the Event Details widget.

Color Scheme
^^^^^^^^^^^^

.. list-table::
   :header-rows: 0
   :widths: 50 50

   * - .. image:: /tools/profiler/images/device-profile-4.png
          :width: 100%
     - .. image:: /tools/profiler/images/device-profile-5.png
          :width: 100%

Instructions are color-coded according to their associated PyTorch operator. All instructions derived from the same PyTorch operator share an identical color.

.. note::
   In future releases, we will introduce more customizable options for color-coding.

Panning
^^^^^^^

.. image:: /tools/profiler/images/device-profile-6.gif

Panning is supported in a couple of ways:

* Left-clicking the x-axis and dragging it
* Spinning scroll-wheel while holding down shift
* With the keyboard:
    * A/D keys for left/right movement
    * Left/right arrow keys for left/right movement

The amount panned depends on the current zoom level.

Event Details
~~~~~~~~~~~~~

Upon clicking an event in the Device Trace Viewer, all details related to the event will appear in the Event Details. The information shown will be a superset of the information available on hover, allowing us to dive deeper into what is happening on the hardware.

* The Event Details table will populate with field data from clicked events from the instruction widget.
* When filtering by fields through Search, all matching events will be rendered as pages in the Event Details. Users can navigate through each page to analyze data for each matching event.
  
.. image:: /tools/profiler/images/device-profile-7.png

Annotations
~~~~~~~~~~~

Users can create annotations by right-clicking in the Device Trace Viewer. These annotations can be moved by clicking and dragging the vertical line, and will snap to the closest events when applicable.

The annotations tab will show more details on all available annotations in the profile, such as the time difference and summary metrics that occur between two markers. The option of which two annotations to compare is configurable in the diff vs column. You can also quickly zoom in to the region between two annotations by selecting the checkbox on the left. Users can rename, delete, save, and load annotations for better readability and collaboration.

.. image:: /tools/profiler/images/device-profile-8.png

Operator Table
~~~~~~~~~~~~~~

The Operator Table aggregates the hardware level metrics into framework layers and operations, such as the MFU and amount of data being moved. Users can progressively expand each row to get a further breakdown of each nested operator.

Filters can be applied and columns can be sorted for more streamlined viewing.

.. image:: /tools/profiler/images/device-profile-9.png

Overall Summary
~~~~~~~~~~~~~~~

The Overall Summary displays performance metrics across the entire profile run, with metrics broken down into different categories such as by the NeuronCore engines. These can be used for quick insights into how well the model performed.

.. image:: /tools/profiler/images/device-profile-10.png

Current Selection Summary
~~~~~~~~~~~~~~~~~~~~~~~~~

The Current Selection Summary provides metrics for the current time window. Zooming in and out in the Device Trace Viewer will update the summary. This can be used in conjunction with the zoom feature of Annotations for easy access to a region of interest.

.. image:: /tools/profiler/images/device-profile-11.png

.. _box-selection-summary:

Box Selection Summary
~~~~~~~~~~~~~~~~~~~~~

The Box Selection Summary provides metrics within a bounding box region. Select and drag regions within the timeline widget to update the summary.

.. image:: /tools/profiler/images/box-select.gif

Box selection is supported in a couple of ways:

* Toggling the box selection button within the timeline widget
* Clear selection with `esc` key

Correponding summary information of the selected region is displayed within the box selection selection widget.

Code Viewer
~~~~~~~~~~~

Profiles that are uploaded with source code files enable users to quickly navigate between NKI and application level source code and the corresponding hardware level instructions.

In the Device Trace Viewer, we can click on an event to highlight the source code line in the Code Viewer. A (Ctrl/Cmd) + click on the event will scroll to the corresponding source code line.

In the Code Viewer, clicking on a line in the source code will automatically highlight all associated events in the Device Trace Viewer. Similarly, highlighting multiple lines of the source code will also highlight all events in the timeline.

.. image:: /tools/profiler/images/device-profile-12.png

See :ref:`neuron-explorer-source-code` for instructions on how to enable source code viewing.

Layout Customization
~~~~~~~~~~~~~~~~~~~~

Understanding and optimizing performance with the profiler can be overwhelming given the amount of information being processed and displayed. As part of preparing for optimization work, you can cross-reference different information, such as the Device Trace Viewer with the application source code. With the widget-based UI, you can customize the layout to best fit a specific workflow. Each widget can be added, removed, dragged around, and resized. Once you are happy with the layout, you can save it through the Layout dropdown at the top right. The layouts are not tied to a specific profile, so they can be loaded and re-used for future profiles as well.

.. image:: /tools/profiler/images/device-profile-13.png


================================================
FILE: tools/neuron-explorer/overview-hierarchy-view.rst
================================================
.. meta::
    :description: Learn about the Hierarchy View in Neuron Explorer for analyzing framework layers and HLO operations with zooming, highlighting, and display options.
    :date-modified: 12/02/2025

Hierarchy Viewer
===================

The Hierarchy Viewer shows an up-leveled representation of the hardware execution organized by the framework layers and HLO operations. It enables you to progressively drill down into nested layers or operators and map the execution of application level constructs to the Neuron device. This view interacts with other tools such as the Device Trace Viewer.

.. image:: /tools/profiler/images/hierarchy-view-1.gif


Zooming
-------

.. image:: /tools/profiler/images/hierarchy-view-2.png

You can zoom in on the Hierarchy Viewer in a couple of ways:

* Click-drag your mouse across the graph (support in both directions)
* Scroll down using your mouse wheel, with the mouse cursor on the x-axis
* Zoom in and out buttons in the top-right corner
* With the keyboard:
  
    * W and S for zooming in and out, respectively
    * Up and down arrow keys for zooming in and out, respectively

To zoom out, simply scroll up with your mouse wheel when you place your mouse cursor on the x-axis.

Change Displayed Layers
-----------------------

.. image:: /tools/profiler/images/hierarchy-view-3.png

The display options menu, accessed with the button in the top-right corner, allows you to selectively show or hide different layers. For instance, in the example shown above, the framework layer is hidden while displaying the hierarchy starting from HLO.

Highlighting
------------

.. image:: /tools/profiler/images/hierarchy-view-4.png

Right-clicking on an operator in Hierarchy Viewer will highlight all the corresponding instructions in the Device Trace Viewer for the operator using the same color. Multiple operators can be highlighted at once.

.. image:: /tools/profiler/images/hierarchy-view-5.png


================================================
FILE: tools/neuron-explorer/overview-memory-viewer.rst
================================================
.. meta::
    :description: Learn about the Memory View in Neuron Explorer for analyzing all the memory allocations on SBUF.
    :date-modified: 03/24/2026

Memory Viewer
===================

The Memory Viewer in Neuron Explorer offers deep, low-level insight into memory allocation, usage patterns, and potential inefficiencies — going well beyond surface-level metrics. With comprehensive visibility into how memory is consumed across the device, it enables kernel and performance engineers to make informed optimization decisions, reduce debugging time, and improve overall system performance.

.. image:: /tools/neuron-explorer/images/memory_viewer_overview.png
   :alt: Memory Viewer overview showing memory allocation patterns across SBUF partitions


Enable Memory Viewer during Profile Upload
--------------------------------------------

To enable the Memory Viewer feature, check the option 'Enable Memory Viewer' when you upload your profile:

.. image:: /tools/neuron-explorer/images/memory_viewer_enable.png

View the Memory Viewer Widget
------------------------------

Once your profile finishes processing and is ready to view, click the Add Widget button and select 'Memory Viewer':

.. image:: /tools/neuron-explorer/images/memory_viewer_add_widget.png


By hovering your mouse over each allocation, you can see the detailed information about this allocation. For allocations triggered by instructions, hover informations includes: 
* Start time and end time
* Duration
* Start address and end address 
* Opcode
* Operands 

For allocations triggered by DMAs, hover information includes: 
* Partition number 
* Start time and end time 
* Duration
* Start address and end address 
* DMA queue name 
* Block ID

By analyzing memory allocations, you can address memory fragmentation by identifying sparse allocation patterns and potentially rescheduling instructions or DMAs to different addresses to maintain memory compactness. Additionally, you can perform spill/reload analysis to identify opportunities for reducing spills by relocating allocations to available space at alternative addresses.

You can also use the dropdown menu to inspect the memory allocations on different partitions and NC cores:  

.. image:: /tools/neuron-explorer/images/memory_viewer_hover.png


================================================
FILE: tools/neuron-explorer/overview-summary-page.rst
================================================
.. meta::
    :description: Learn how to use the Neuron Explorer summary page to quickly identify performance issues, view key metrics, and get actionable optimization recommendations for your profiles.
    :date-modified: 03/20/2026

Summary Viewer
================

The Neuron Explorer summary viewer provides a streamlined view of your profile's most critical performance insights, enabling quick identification of issues and optimization opportunities without navigating through detailed data.

.. image:: /tools/profiler/images/explorer-summary-page.png

Benefits
--------

Both new and experienced users benefit from this streamlined view of profiling data.

* Identify performance issues quickly
* Understand your profile's most critical metrics at a glance
* Get actionable recommendations for optimization

How to use
-------------

1. **Open your profile** - The Summary Viewer is accessible via the Profile Manager or Neuron Explorer UI.
2. **Examine key metrics** - Review the metrics and graphs to understand your profile's performance characteristics.
3. **Review recommendations** - Start with the **Performance Insights & Recommendations** section. This section highlights the most important performance issues.
4. **Select specific time regions** - Use the "Region Selection" menu to view specific timeslices corresponding to network layers. This helps you drill down into specific sections of your profile. You can generate custom time regions using the "Add Region" button.
5. **Take action** - Apply the recommended optimizations to your model or workload.

Understanding region-level insights
-----------------------------------

When you work with profiles from entire networks or network subgraphs, different regions will have different performance characteristics. The landing page enables performance analysis on a per-layer basis and provides:

* Layer-specific recommendations
* Time-range indication of where problems occur
* More accurate insights for complex profiles

Use the 'Region Selection' menu to navigate between different layers and view their individual performance data.

What the landing page displays
------------------------------

Performance Insights and Recommendations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This section shows 2-4 recommendations to help you improve performance. The profiler analyzes your data and identifies the most important issues to address. The profiler prioritizes recommendations by criticality and shows you the most critical ones first.

Example recommendations
^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 30 35 35

   * - Condition
     - Root Cause
     - Recommended Action
   * - Low Model FLOPS relative to Active FLOPS (< 50%)
     - Tensor engine is active but not performing useful matrix operations
     - Ensure instructions use the entire tensor engine and are pipelined correctly
   * - NKI instruction coverage < 50% on tensor, vector, or scalar engine
     - Compiler-generated instructions dominate the engine
     - Write NKI kernel code for the network operations present in that profile section
   * - Active FLOPS throttling detected
     - FLOPS lost due to throttling during active tensor engine periods
     - Investigate the root cause of throttling to recover tensor engine utilization
   * - Transpose FLOPS > 10% of total hardware FLOPS
     - Excessive data movement within the tensor engine
     - Improve memory layout to reduce transpose operations
   * - Collective operation outliers detected
     - Significantly underperforming collective operations relative to their group median
     - Check for overlapping instructions that might be causing delays
   * - Spill reload bytes > 25% of total HBM reads
     - Excessive spill/reload operations consuming memory bandwidth
     - Check for data dependencies causing excessive spill/reload operations

Key Metrics
~~~~~~~~~~~

This section displays tables and graphs that summarize your profile's performance metrics.

Compute Performance Statistics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **total_time** - Total duration of on-device time for the run in seconds. This doesn't include host-device data movement overhead or host runtime/framework overhead.
* **mm_arithmetic_intensity** - The ratio of regular Matrix Multiplication (MATMUL) Floating Point Operations (FLOPs) to total Dynamic Random Access Memory (DRAM) transfer size. This metric helps you determine if your workload is memory-bound or compute-bound.
* **hfu_estimated_percent** - Hardware FLOPs Utilization reflects the Tensor Engine utilization calculated from all Tensor Engine instructions.
* **mfu_estimated_percent** - Model FLOPs Utilization reflects the Tensor Engine utilization for useful compute (matrix multiplications from your model definition).

Memory Bandwidth Utilization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **total_bandwidth_available** - The total bytes possible to be transferred within the given time region for the current Neuron hardware specification.
* **mbu_estimated_percent** - Memory Bandwidth Utilization (MBU) shows the achieved (as running on the current Neuron hardware) High Bandwidth Memory (HBM) bandwidth utilization.
* **average_dma_size** - The average DMA transfer size (higher is better).
* **useful_read_percent** - The fraction of HBM reads that are useful (``hbm_read_bytes`` - ``hbm_reload_bytes``) / hbm_read_bytes). Note that "useful" is related to an inherent property of the memory itself, but a measurement of how efficiently the memory is being utilized by a specific workload or application. Low numbers may indicate inefficient memory access patterns and suboptimal layouts.

FLOPs Utilization
^^^^^^^^^^^^^^^^^

For each compute engine (tensor, vector, scalar, gpsimd), displays how well utilized the engine is. You can view all cores simultaneously or select a specific Neuron Core from the dropdown.

Tensor Engine
"""""""""""""

The Tensor engine has a detailed breakdown of how the FLOPs are being used:

* **model_flops**: The percentage of tensor flops spent performing useful matrix operations, contributing to model progress
* **transpose_flops**: The percentage of tensor flops spent performing transpose operations / data movement
* **active_flops** - Percentage of tensor flops that correspond to the active_time of the tensor engine, but where the engine was not effectively utilized.
* **throttled_flops (active and inactive)** - Percentage of FLOPs wasted due to throttling, either during active or inactive tensor engine periods.

There are a few key things to look for in this graph:

1. **model_flops relative to active_flops**. Large differences could indicate that the tensor engine is being poorly utilized with small tensor sizes, or that operations are not being pipelined effectively.
2. **model_flops relative to transpose_flops**. It is desired to have little-to-no ``transpose_flops`` consuming tensor engine utilization. Ideally the ``model_flops`` amount is much larger than the amount of transposes.
3. **active_throttled_flops**: FLOPs lost due to throttling during active periods is undesirable. It is worth identifying the root cause for the throttling if there is indication of this happening.

Other Engines (Scalar, Vector, GpSimd)
"""""""""""""""""""""""""""""""""""""""

These engines do not yet have detailed FLOP utilization breakdowns, they only show the active period of operation for the engine.

* **active_flops** - Percentage of FLOPs when the engine processes at least one instruction (excluding semaphore waits).

NKI Engine Statistics
^^^^^^^^^^^^^^^^^^^^^

This chart shows the instruction count breakdown between NKI-generated instructions and compiler-generated instructions for each compute engine (tensor, vector, scalar). The stacked bar chart helps you understand how much of your workload is running NKI kernel code versus compiler-generated code.

Hovering over a bar displays a detailed breakdown of instruction counts by opcode for that engine and source type.

When NKI instruction coverage is below 50% for a given engine, the summary page generates a recommendation to write NKI kernel code for the network operations in that profile section.

DMA Utilization
^^^^^^^^^^^^^^^

This chart shows how the DMA engines are being utilized, displayed as a percentage of the total available bandwidth. Two dropdown menus control the chart's aggregation:

* **Outer aggregation** - Choose between viewing data per DMA engine ("All Engines") or per Neuron Core ("Neuron Cores").
* **Inner aggregation** - Choose between grouping by data type or source type:

  * **Data Type** groups transfers into Instruction, IO, Weights, and Dynamic categories.
  * **Source Type** groups transfers into Static (compiler-generated), Software Dynamic (GpSimd-generated), and Hardware Dynamic (DGE hardware-generated) categories.

Each category shows two bar segments: a solid bar representing bandwidth utilization and a striped bar representing active time utilization beyond the bandwidth portion. This helps distinguish between time spent transferring data and time the DMA engine is active but not fully utilizing bandwidth.

Memory Bandwidth Breakdown
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Shows how the available HBM memory bandwidth was used as a doughnut chart:

* HBM Read — effective read bytes (excluding spill reloads)
* HBM Write — effective write bytes (excluding spill saves)
* SBUF Spill Reload — bytes reloaded from HBM due to state buffer spills
* SBUF Spill Save — bytes saved to HBM due to state buffer spills
* Unused — remaining available bandwidth

Collective Operations Duration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Displays the duration of each collective operation in the profile, grouped by operation type and size. Two visualization modes are available via a dropdown:

* **Scatter** - Shows individual operation durations as scatter points, with each operation type on a separate row. Hovering over a point displays detailed information including algorithm, operation, duration, start/end timestamps, element count, input/output sizes, and trigger engine. Clicking a point pins the tooltip for easy text selection.
* **Box Plot** - Shows the statistical distribution (min, Q1, median, mean, Q3, max, variance, count) of operation durations per operation type. This is useful for quickly identifying the spread and central tendency of each operation group.

Both modes are useful for identifying outliers in collective runtime, which can be used to investigate specific sections of the profile more deeply. It is possible to filter out datasets by clicking on the datasets in the legend of the graph.

System Information
^^^^^^^^^^^^^^^^^^

Displays metadata about the system and software versions used during profiling:

* Instance Type
* Compiler Version
* Explorer Version
* Driver Version
* Runtime Version
* Collectives Version

System Profile Summary
======================

When a system profile is loaded, the Summary Viewer automatically switches to the System Profile Summary view. System profiles capture data across multiple devices, processes, and instances, providing a holistic view of distributed workload performance.

Overview
--------

The System Profile Summary provides:

* A high-level overview of the entire system's profiling session
* HBM memory usage trends across logical NeuronCores
* A detailed table of all device profiles with key performance metrics
* The ability to drill down into individual device profiles for detailed analysis

System Overview Card
--------------------

Displays aggregate information about the profiling session:

* **Instances** - Number of unique instances captured in the profile
* **Processes** - Number of unique processes captured
* **System Profile Time** - Total wall-clock duration of the system profiling session
* **Total Device Runtime** - Cumulative on-device execution time across all device profiles
* **Total Device Profiles** - Number of individual device profiles in the system profile

HBM Memory Usage Chart
-----------------------

A line chart showing HBM memory usage over time. When per-NeuronCore data is available, the chart displays a separate line for each logical NeuronCore (HBM index), color-coded for easy identification. When only aggregate data is available, a single filled area chart shows total HBM usage.

The x-axis shows time (in the profiling session's time domain) and the y-axis shows memory usage in bytes. Hovering over the chart displays the exact timestamp and memory usage for each NeuronCore.

Device Profiles Table
---------------------

A table listing all device profiles captured in the system profile. The table supports:

* **Process filtering** - Use the dropdown to filter profiles by process ID, or select "All Processes" to view everything.
* **Expandable rows** - Click the expand arrow on any row to see additional per-profile metrics including tensor/vector/scalar engine active time percentages, DMA active time, and HBM read/write bytes.
* **Column tooltips** - Hover over column headers to see descriptions of each metric from the profile schema.

Table columns:

* **Profile Name** - Clickable link that navigates to the detailed device profile view
* **LNC** - Logical NeuronCore ID
* **Neuron Cores** - Number of physical NeuronCores used by this profile
* **Total Duration** - Total on-device execution time for this profile's events
* **Calls** - Number of execution events for this profile
* **Duration** - Total profiled time for this device profile
* **MFU** - Model FLOPs Utilization
* **HFU** - Hardware FLOPs Utilization
* **MBU** - Memory Bandwidth Utilization
* **CC Active** - Collective communication active time percentage

Device Profile Detail View
--------------------------

Clicking a device profile name in the table navigates to a detail view that embeds the standard Summary Viewer for that specific device profile. This provides the full set of per-device metrics, charts, and recommendations described in the sections above.

A "Back to System Overview" button at the top returns you to the system-level summary.


================================================
FILE: tools/neuron-explorer/overview-system-profiles.rst
================================================
.. meta::
    :description: Learn about the System Profile in Neuron Explorer for analyzing system-level execution across instances and workers with runtime and hardware events.
    :date-modified: 01/30/2026

System Profile
================

The Neuron System Profile show a system-level granularity of execution across instances and workers in your workload. This provides visibility into Neuron Runtime API calls and ML framework function calls (PyTorch or JAX) to help identify bottlenecks in distributed workloads. The Neuron Explorer UI provides system-level widgets for an extensible and customizable workflow.

.. image:: /tools/neuron-explorer/images/neuron-explorer-system-viewer.png

System Trace Viewer
---------------------

The System Trace Viewer provides an interactive timeline interface with time range selection, configurable event grouping, system event details on hover, and linking of hardware events to Device Trace Viewer widgets.

You can see events in the Neuron Runtime and correlate them with hardware execution events on the Neuron Devices.

.. image:: /tools/neuron-explorer/images/system-timeline-widget.png

You can also see the device memory (HBM) allocations for each Neuron device over time. Hovering over these memory usage events shows a breakdown by usage category.

.. image:: /tools/neuron-explorer/images/system-timeline-widget-hbm-usage.png


Adding Widgets
---------------
The System Profile supports both System and Device widgets, enabling multi-profile analysis, for example comparing annotated device events across different devices.

To add a widget:

1. Click the **Add Widget** button to open the Add Widget modal.
2. Select a Device or System widget.
3. Click a widget tile to load it with the selected profile. Each tile is tagged with its supported profile type (system, device, or both).

To load multiple instances of the same widget type for different profiles, repeat the steps above and select a different profile each time.

.. image:: /tools/neuron-explorer/images/system-timeline-add-widget.gif


After adding a widget, you can switch to a different profile by using the profile dropdown at the top of the widget.

.. image:: /tools/neuron-explorer/images/widget_switch_profiles.png

.. note::

   Adding duplicate widgets for the same profile is not currently supported.


Settings
----------

The System Trace Viewer supports multiple grouping modes to organize events for different analysis perspectives.
You can switch between the following grouping modes in the settings to focus your analysis on different aspects of system performance:

.. list-table:: Grouping Options
   :widths: auto
   :header-rows: 1
   :align: left

   * - Grouping Option
     - Description
     - Example
   * - CPU vs Device Grouping (Default)
     - Groups events by event source (CPU or Neuron device events)
     - Runtime events: ``i-0b1ea78ca2865fd32/PID:1765325/TID:0/neuron_rt``, Hardware events: ``i-0b1ea78ca2865fd32/PID:1765325/Worker:0/neuron_hw``
   * - NeuronCore Grouping
     - Groups events by individual NeuronCore
     - ``i-0b1ea78ca2865fd32/NC:0``, ``i-0b1ea78ca2865fd32/NC:1``
   * - Thread Grouping
     - Groups events by thread identifier
     - ``i-0b1ea78ca2865fd32/PID:1765325/TID:0``
   * - Process Grouping
     - Groups events by process identifier
     - ``i-0b1ea78ca2865fd32/PID:1765325``
   * - Instance Grouping
     - Groups all events by instance only
     - ``i-0b1ea78ca2865fd32``

.. image:: /tools/neuron-explorer/images/system-timeline-settings.png

Event Details
--------------

Clicking on trace events in the timeline populates the Event Details widget with a list of properties for the system trace event.

.. image:: /tools/neuron-explorer/images/system-event-details.png

Device Profile Linking
------------------------

The System Trace Viewer links hardware events to the Device Trace Viewer, which renders the corresponding device traces.

Navigating from the System Trace Viewer to a Device Trace Viewer can be accomplished in two ways:

Open the Device Profile List Modal
------------------------------------

To see a list of all device profiles captured during your workload:

1. **Click the "Device Profiles List" button** in the top right action bar of the System Trace Viewer to open a modal containing a list of device profiles
2. **Select a Device Profile and click Submit** to open the Device Trace Viewer with the selected device profile

.. image:: /tools/neuron-explorer/images/system-timeline-device-profiles-list-modal.png

Drill-down from Hardware Events
---------------------------------

To drill-down from a hardware event to the Device Trace Viewer:

1. Find a hardware event such as ``nc_exec_running``
2. Click on the hardware event
3. Wait for the Device Trace Viewer to open

This will open a new Device Trace Viewer with the selected device profile showing detailed hardware events. To learn about device profiles, see :doc:`Device Profiles in Neuron Explorer <overview-device-profiles>`.

.. image:: /tools/neuron-explorer/images/system-timeline-hardware-event-linking.gif


================================================
FILE: tools/neuron-explorer/overview-tensor-viewer.rst
================================================
.. meta::
    :description: Learn about the Tensor Viewer in Neuron Explorer for viewing tensor information including names, sizes, shapes, and memory usage details.
    :date-modified: 01/27/2026

.. _tensor-viewer-overview:

Tensor Viewer
=================

The Tensor Viewer contains the following information about all tensors in the NEFF file:

* **variable_name** - The tensor name.
* **type** - How the system uses the tensor. Examples include input tensor, output tensor, or weight tensor.
* **format** - How the tensor arranges in memory. For example, "NHWC" shows a specific dimension arrangement. Letters include N (batch size), H (height), W (width), C (channel).
* **shape** - The tensor's multi-dimensional shape.
* **size** - The tensor's total size in bytes.
* **node** - NEFF node.
* **pcore_idx** - Index of the physical NeuronCore within a Logical NeuronCore (LNC). A Logical NeuronCore groups physical NeuronCores. For LNC2, this field shows either 0 or 1.
* **load_to_sbuf_avg_size_bytes** - The average size in bytes of each DMA transfer when the system loads this tensor into the State Buffer.
* **load_to_sbuf_total_size_bytes** - The total size in bytes of all DMA transfers when the system loads this tensor into the State Buffer.
* **load_to_sbuf_dma_count** - The total number of DMAs that loaded this tensor into the State Buffer.
* **load_to_sbuf_repeat_factor** - How many times the system loaded this tensor into the State Buffer. A value of 1 means one load, 2 means two loads, and so on.

.. image:: /tools/profiler/images/tensor-viewer-table.png

You can use this data to match with framework-level instructions or for kernel development. You can also use it to search for instructions in the Device Timeline Viewer. 
The SBUF loading information in the table can help you verify tensors are loaded efficiently.

Searching
---------

You can use the Tensor Viewer with the Device Timeline Viewer and Search tool to match tensor information in the table with instructions that run on the device. 
Enter the variable_name from the table, into the DMA search field to see all DMA instructions that relate to that tensor.
The example below shows a complete search for the tensor token_position_to_id:

.. image:: /tools/profiler/images/tensor-viewer-search-example.png


================================================
FILE: tools/neuron-explorer/view-perfetto.rst
================================================
.. meta::
    :description: Learn about using Neuron Explorer with Perfetto
    :date-modified: 02/05/2026

Viewing Profiles with Perfetto
==============================

.. note::
    New Neuron Explorer features released in 2.27 and onwards may not be supported in Perfetto. For the full user experience and features set, please use the Neuron Explorer UI or VSCode Integration.

Perfetto is an open-source trace analysis toolkit with a powerful UI for visualizing and analyzing trace data.
Users of Neuron Profiler have the option of viewing their profiles in the Perfetto UI.

The ``--output-format perfetto`` option writes processed data to Perfetto's native protobuf-based tracing format which can be visualized in the Perfetto UI at https://ui.perfetto.dev/.

Example:

.. code-block:: shell

    neuron-explorer view -d ./output --output-format perfetto

This will generate a ``system_profile.pftrace`` file for the system profile and a ``device_profile_model_<model_id>.pftrace`` file for each unique compiled model that was executed on a Neuron Device.

To view the system profile, go to https://ui.perfetto.dev/ and open the ``system_profile.pftrace`` file.

.. note::
    When loading trace files in the Perfetto UI, your data is processed locally and not uploaded to Perfetto’s servers.

|neuron-explorer-perfetto-timeline|

To view a device profile go to https://ui.perfetto.dev/ and open the  ``device_profile_model_<model_id>.pftrace`` file. This will show a detailed view of hardware activity on the NeuronCore during execution of this graph.

|neuron-explorer-perfetto-device-timeline|

.. note::
    Your browser may run out of memory when viewing ``*.pftrace`` (Perfetto trace) files that are more than a few hundred MB. See the section :ref:`Viewing Large Profiles in Perfetto <neuron-profile-large-perfetto-profiles>` for directions on how to view large traces using the trace processor.


Perfetto Output View Options
----------------------------

When outputting to Perfetto it is possible to group your traces by different attributes. This is useful for
larger profiles involving many NeuronCores and instances. The following options are available:

.. list-table:: Perfetto output view options
     :header-rows: 1
     :widths: 30 70

     * - CLI option
       - Description
     * - ``--system-trace-primary-group``
       - First-order grouping of trace events (maps to a Perfetto process / process group of rows). Provide a comma-delimited list of field names. Allowed fields: ``instance_id``, ``thread_id``, ``lnc_idx``, ``process_id``. Default: ``instance_id,process_id``.
     * - ``--system-trace-secondary-group``
       - Second-order grouping of trace events (maps to a Perfetto thread / single row). Provide a comma-delimited list of field names. Allowed fields: ``instance_id``, ``worker_gid``, ``thread_id``, ``lnc_idx``, ``process_id``. Default: ``worker_gid,lnc_idx, thread_id``.


For example, the following profile uses ``neuron-explorer view --output-format=perfetto --system-trace-primary-group=instance_id,process_id --system-trace-secondary-group=lnc_idx,thread_id`` to group the system profile first by unique combinations
of instance_id and process_id, and then in each of those groups there are rows of events with unique combinations of lnc_idx and thread_id.

|neuron-explorer-perfetto-grouping|

Grouping By Global Worker ID
----------------------------

By default, Perfetto traces are grouped by ``worker_gid`` which is a unique global identifier for each NeuronCore across all instances in a distributed workload.
When clicking on an event in the trace you will see fields for both ``lnc_idx`` (local NeuronCore index on that process) and ``worker_gid`` (global NeuronCore index across all instances).
It is possible for ``lnc_idx`` to be the same for different processes on the same instance or across different instances in a distributed workload. However, ``worker_gid`` is unique for each NeuronCore across all instances.
The image below shows how to correlate the naming of tracks (rows) in the Perfetto UI to both ``lnc_idx`` and ``worker_gid``.

|neuron-explorer-perfetto-gid|


.. |neuron-explorer-perfetto-timeline| image:: /images/neuron-profiler2-perfetto-timeline.png
.. |neuron-explorer-perfetto-device-timeline| image:: /images/neuron-profiler2-perfetto-device-timeline.png
.. |neuron-explorer-perfetto-grouping| image:: /images/neuron-profiler2-perfetto-grouping.png
.. |neuron-explorer-perfetto-gid| image:: /images/neuron-profiler2-perfetto-gid.png


================================================
FILE: tools/neuron-sys-tools/index.rst
================================================
System Tools
============

Neuron system tools provide essential utilities for monitoring, debugging, and managing AWS Neuron devices and workloads. These command-line tools offer real-time insights into device utilization, process management, hardware health, and performance metrics across Neuron instances.

.. toctree:: 
    :maxdepth: 1
    :hidden:

    Neuron-Monitor User Guide </tools/neuron-sys-tools/neuron-monitor-user-guide>
    Neuron-Top User Guide </tools/neuron-sys-tools/neuron-top-user-guide>
    Neuron-LS User Guide </tools/neuron-sys-tools/neuron-ls>
    Neuron-Sysfs User Guide </tools/neuron-sys-tools/neuron-sysfs-user-guide>
    NCCOM-TEST User Guide </tools/neuron-sys-tools/nccom-test>
    TensorBoard </tools/tensorboard/index>

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: Neuron-Monitor User Guide
      :link: /tools/neuron-sys-tools/neuron-monitor-user-guide
      :link-type: doc
      :class-header: sd-bg-primary sd-text-white

      Real-time monitoring tool for tracking NeuronCore utilization, memory usage, and thermal metrics across Neuron devices with customizable output formats.

   .. grid-item-card:: Neuron-Top User Guide
      :link: /tools/neuron-sys-tools/neuron-top-user-guide
      :link-type: doc
      :class-header: sd-bg-primary sd-text-white

      Interactive process viewer similar to htop that displays running processes on Neuron devices with real-time resource consumption metrics.

   .. grid-item-card:: Neuron-LS User Guide
      :link: /tools/neuron-sys-tools/neuron-ls
      :link-type: doc
      :class-header: sd-bg-primary sd-text-white

      Device discovery and listing tool that provides detailed information about available Neuron devices, their capabilities, and current status.

   .. grid-item-card:: Neuron-Sysfs User Guide
      :link: /tools/neuron-sys-tools/neuron-sysfs-user-guide
      :link-type: doc
      :class-header: sd-bg-primary sd-text-white

      Low-level system interface tool for accessing Neuron device information through the Linux sysfs filesystem interface.

   .. grid-item-card:: NCCOM-TEST User Guide
      :link: /tools/neuron-sys-tools/nccom-test
      :link-type: doc
      :class-header: sd-bg-primary sd-text-white

      Collective communication testing and benchmarking tool for validating and measuring performance of multi-device communication patterns.

   .. grid-item-card:: TensorBoard
      :link: /tools/tensorboard/index
      :link-type: doc
      :class-header: sd-bg-primary sd-text-white

      TensorBoard Neuron plugin for Trn1 instances, including installation, configuration, and advanced visualization features.

   .. grid-item-card:: Tutorials
      :link: /tools/tutorials/index
      :link-type: doc
      :class-header: sd-bg-secondary sd-text-white

      Tutorials for how to utilize the Neuron system tools suite.

   .. grid-item-card:: What's New
      :link: /release-notes/prev/2.27.0/index
      :link-type: doc
      :class-header: sd-bg-secondary sd-text-white

      Latest updates, new features, and improvements to the Neuron system tools suite.


================================================
FILE: tools/neuron-sys-tools/nccom-test.rst
================================================
.. _nccom-test:

======================
NCCOM-TEST User Guide
======================

.. contents:: Table of contents
    :local:
    :depth: 2

Overview
--------

**nccom-test** is a benchmarking tool for evaluating Collective Communication operations on AWS Trainium and Inferentia instances. It supports Trn1, Trn2, Trn3, and Inf2 instance types. The tool can assess performance across multiple instances or perform quick environment sanity checks before running more complex workloads. While single-instance benchmarking is supported for all compatible instance types, multi-instance benchmarking is limited to Trainium instances (Trn1, Trn2, and Trn3). 
To execute collective operations, **nccom-test** will generate, and then execute, NEFFs (Neuron Executable File Format) containing several collective operation instructions.

.. note::

    On Inf2 instances, only single-instance benchmarking is supported. Running a multi-node nccom-test benchmark
    will result in an error.

Using nccom-test
----------------

Here is a simple example which will run a 2 worker (ranks) all-reduce with a total size of 32MB:


.. code-block::

    nccom-test -r 2 allr
         size(B)    count(elems)     type    time(us)    algbw(GB/s)    busbw(GB/s)
        33554432        33554432    uint8         768          40.69          40.69
    Avg bus bandwidth:      40.6901GB/s


Output description
^^^^^^^^^^^^^^^^^^

The command will output a table containing several columns containing performance metrics.
There will be a line for every requested data size (by default the data size is 32MB as
seen in the previous example).

.. list-table::
    :widths: 40 260
    :header-rows: 1

    * - Column name
      - Description
    * - size(B)
      - Size in bytes for the data involved in this collective operation
    * - count(elems)
      - Number of elements in the data involved in this collective operation. For example, if **size(B)** is 4 and **type** is fp32,
        then **count** will be 1 since one single fp32 element has been processed.
    * - type
      - Data type for the processed data. Can be: **uint8**, **int8**, **uint16**, **int16**, **fp16**, **bf16**, **int32**, **uint32**, **fp32**
    * - time(us)
      - Time in microseconds representing the average of all durations for the Collective Communication operations executed during the benchmark.
    * - algbw(GB/s)
      - Algorithm bandwidth in gibibytes (1GiB = 1,073,741,824 bytes) per second which is calculated as **size(B)** / **time(us)**
    * - busbw(GB/s)
      - Bus bandwidth - bandwidth per data line in gibibytes per second - it provides a bandwidth number that is independent from the number of ranks (unlike **algbw**).
        For a more in-depth explanation on bus Bandwidth, please refer to `Bus Bandwidth Calculation`_
    * - algorithm (optional)
      - Algorithm used to execute this collective operation (e.g. Ring, Mesh, RDH)
    * - Avg bus bandwidth
      - Average of the values in the busbw column
  

.. _Bus Bandwidth Calculation:
**Bus Bandwidth Calculation:**

The purpose of bus bandwidth is to provide a number reflecting how optimally hardware is used, normalizing for different rank counts.

Given the following:

- ``r`` as the number of ranks participating in a collective operation
- ``s`` as the size of the collective operation
- ``B`` as the bus bandwidth of a single rank
- ``t`` latency of the operation

Let's take an AllGather operation as an example. To complete an AllGather operation with ``r`` ranks, each rank must transfer ``r-1`` data chunks of size ``s/r``. Therefore, with a bandwidth of ``B``, the latency (``t``)
of the operation would be:

.. code-block::

    t = ((number of chunks to transfer) * (size of each chunk)) / (bandwidth of rank)
    t = ((r-1) * (s/r)) / B

However, for a given collective operation result, we have the latency, but not the bandwidth of each rank. Rearranging to solve for bus bandwidth, we get:

.. code-block::

  B = ((r-1) * (s/r)) / t

which, given ``algbw = s / t``, can also be rewritten as:

.. code-block::

  B = ((r-1) / r) * algbw

Using this formula, we can calculate the bus bandwidth, ``B``, for an AllGather collective operation among ``r`` ranks with size ``s`` that took ``t`` seconds.

We can now directly compare the calculated bus bandwidth to the actual hardware bandwidth to see how well the hardware is being utilized. For different operations that transfer a different
number of chunks, the bandwidth calculation changes slightly, with our algbw factor ``(r-1) / r`` changing depending on the collective operation:

.. list-table::
    :widths: 40 40
    :header-rows: 1

    * - Collective Operation
      - Bus Bandwidth Factor
    * - All-Reduce
      - ``(2 * (r-1)) / r``
    * - All-Gather
      - ``(r-1) / r``
    * - Reduce-Scatter
      - ``(r-1) / r``
    * - Send-Receive
      - 1
    * - All-to-All
      - ``(r-1) / r``
    * - Permute
      - 1
    * - All-to-Allv
      - ``(r-1) / r``


CLI arguments
^^^^^^^^^^^^^

Required Arguments:
~~~~~~~~~~~~~~~~~~~

.. list-table::
    :widths: 40 80 260
    :header-rows: 1

    * - Argument
      - Default value
      - Description
    * - <cc operation>
      - N/A, required argument
      - The type of Collective Communication operation to execute for this benchmark.
        Supported types:

            - ``all_reduce`` / ``allr``: All-Reduce
            - ``all_gather`` / ``allg``: All-Gather
            - ``reduce_scatter`` / ``redsct``: Reduce-Scatter
            - ``sendrecv``: Send-Receive
            - ``alltoall``: All-to-All
            - ``permute``: Permute
            - ``alltoallv``: All-to-Allv (Currently only supported for inter-node configurations)
    * - ``-r, --nworkers``
      - N/A, required argument
      - Total number of workers (ranks) to use

Benchmark Configuration:
~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
    :widths: 40 80 260
    :header-rows: 1

    * - Argument
      - Default value
      - Description
    * - ``-N, --nnodes``
      - 1
      - Total number of nodes (instances) to use. The number of workers will be divided equally across all nodes.
        If this argument is greater than 1, `MPI Execution`_ or `Slurm Execution`_ will need to be used.
    * - ``-b, --minbytes``
      - 32M
      - The starting size for the benchmark
    * - ``-e, --maxbytes``
      - 32M
      - The end size for the benchmark. **nccom-test** will run benchmarks for all sizes between ``-b, --minbytes`` and
        ``-e, --maxbytes``, increasing the size by either ``-i, --stepbytes`` or ``--f, --stepfactor`` with every run.
    * - ``-i, --stepbytes``
      - (``--maxbytes`` - ``--minbytes``) / 10
      - Amount of bytes with which to increase the benchmark's size on every subsequent run.
        For example, for this combination of arguments: ``-b 8 -e 16 -i 4``, the benchmark will
        be ran for the following sizes: 8 bytes, 12 bytes, 16 bytes.
    * - ``-f, --stepfactor``
      - N/A
      - Factor with which to increase the benchmark's size on every subsequent run.
        For example, for this combination of argument values: ``-b 8 -e 32 -f 2``, the benchmark will
        be ran for the following sizes: 8 bytes, 16 bytes, 32 bytes.

.. note::

    All arguments that take a size in bytes will also accept larger size units, for example:
    ``-f 2048`` can be written as ``-f 2kb`` or ``-f 1048576`` can be written as ``-f 1MB``.

Iteration Configuration:
~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
    :widths: 40 80 260
    :header-rows: 1

    * - Argument
      - Default value
      - Description
    * - ``-n, --iters``
      - 20
      - Number of Collective Communication operations to execute during the benchmark.
    * - ``-w, --warmup_iters``
      - 5
      - Number of Collective Communication operations to execute as warmup during the benchmark. 
        The warmup operations will execute prior to any of the measured operations and their performance will be not be used calculate the reported statistics.
    * - ``-I, --neff_iters``
      - N/A
      - Number of times to execute the NEFF with Collective Communication operations during the benchmark. 
    * - ``-W, --neff_warmup_iters``
      - N/A
      - Number of times to execute the NEFF with Collective Communication operations as warmup during the benchmark. All collective operations in a warmup NEFF execution will be ignored when calculating statistics.

To execute collective operations, ``nccom-test`` will generate, and then execute, NEFFs (Neuron Executable File Format) containing several collective operation instructions.
The above flags control how many collective operations are generated, run, and measured.

There are two primary modes for controlling the number of collective operations run:

1. If neither the ``neff_iters`` nor the ``neff_warmup_iters`` flag is supplied, ``iters + warmup_iters`` will be treated as the desired total number of
   operations to be run. If necessary, ``nccom-test`` will spread this total number of operations out across several NEFFs.

2. If the user desires more control over how collectives operation execution should be organized, they should use the ``neff_iters`` and ``neff_warmup_iters``
   flags. When these flags are used, the ``iters`` and the ``warmup_iters`` flags now represent the number of operations in a single NEFF. The NEFF itself will be repeatedly run
   ``neff_iters + neff_warmup_iters`` times.

Examples:

- ``-n 15``, ``-w 5``, ``-I 10``, would result in 200 Collective Communication operations being run with 150 being measured:
  The generated NEFF will have 20 (15 measured, 5 warmup) ops and the NEFF will be run 10 times.
- ``-n 15``, ``-w 5``, ``-I 10``, ``-W 5``, would result in 300 Collective Communication operations being run with 150 being measured:
  The generated NEFF will have 20 (15 measured, 5 warmup) ops and the NEFF will be run 15 (10 measured, 5 warmup) times
    

Input/Output Data:
~~~~~~~~~~~~~~~~~~

.. list-table::
    :widths: 40 80 260
    :header-rows: 1

    * - Argument
      - Default value
      - Description
    * - ``-d, --datatype``
      - ``uint8``
      - Data type for the data used by the benchmark. Supported types: ``uint8``, ``int8``, ``uint16``, ``int16``,
        ``fp16``, ``bf16``, ``uint32``, ``int32``, ``fp32``. Input data will be zero filled, unless ``--check`` is
        provided in which case it will be filled with either pseudo-random data or ones.
    * - ``-c, --check``
      - N/A
      - If provided, validates correctness of the operations. Can additionally specify options: ``random`` (default) or ``all_ones``.
        For an explanation of these options, see `Data Integrity`_.
        This will not impact device execution time and collective operation performance (time, algbw, and busbw),
        but will slightly increase the overall execution time.
    * - ``--seed``
      - N/A
      - Seed to use while generating pseudo-random data for ``random`` correctness check with ``--check`` flag
    * - ``--unique-buff``
      - false
      - Use a unique buffer for the input and output of every collective operation. When using this flag, each collective operation in a NEFF will use a
        different in-memory input/output buffer than every other operation. For All-Gather operations run with certain algorithms (e.g. Mesh, RDH),
        there is additional handshaking for output buffers, and using unique buffers may improve collective operation performance.
    * - ``--coalesced-cc-size-ratio``
      - N/A
      - List representing the ratio with which to split the input tensor into multiple tensors for coalesced, collective operations. Given a size of ``4MB`` and a ``coalesced-cc-size-ratio`` of ``[1,2,1]``, each collective
        operation would actually consist of 3 parallel, coalesced operations of sizes: ``1MB``, ``2MB``, and ``1MB``.
    * - ``--shared-output-buff``
      - false
      - For the CC operation, use a single, shared, HBM output buffer between 2 neuron cores in the same HBM domain.
    * - ``--alltoallv-metadata``
      - N/A
      - For ``alltoallv`` collective operation, a ``json`` file containing send counts, send displacements, receive counts, and receive displacements for the collective operation. 
        Counts specify number of elements to send/receive between ranks, displacements specify where in buffer to send/receive data.
        Length of count and displacement arrays should equal size of replica group over which ``alltoallv`` collective operation is performed. 
        If one metadata entry is provided, it applies to all ranks, otherwise, specify one entry per rank. `AlltoAllV Example`_.

.. _Data Integrity:

Data Integrity:

If the ``--check`` flag is provided when running ``nccom-test``, the correctness of the CC operations will be verified. There are currently two modes for verification: ``random`` (the default used when only ``--check`` is provided)
and ``all_ones``. 

1. The ``random`` mode will fill each input tensor with pseudo-random data and then, on the CPU, calculate a expected golden output. After collective operation execution,
   the output tensor of the operation will be compared against the calculated golden tensor. For non-integral types (e.g. ``fp16``, ``fp32``), golden comparison will use tolerances.
   For operations in which all participating ranks should finish with identical outputs (e.g ``allr``, ``allg``), there will also be a check between ranks to ensure this.
   If the ``random`` check fails, input, output, and golden tensors will be saved to disk for further investigation. The ``--seed`` flag can be used to set the seed for the
   pseudo-random input tensor generation. Otherwise, the seed value will be based on the current time and logged.

2. The ``all_ones`` mode will fill each input tensor with the value ``1``. A single, golden value\: ``G``, will be calculated based on the operation. For example, the golden value\: ``G``
   for an All-Reduce with 16 ranks will be ``16``. After operation execution, ``nccom-test`` will verify each output tensor is filled with ``G``.

``random`` mode should be preferred for more rigorous verification. However, for quicker, more easily understood verification, ``all_ones`` should be preferred.

.. _MPI Execution:
MPI Execution:
~~~~~~~~~~~~~~~

.. list-table::
    :widths: 40 80 260
    :header-rows: 1

    * - Argument
      - Default value
      - Description
    * - ``-s, --hosts``
      - N/A
      - Hosts on which to run execution.
    * - ``--hosts-file``
      - N/A
      - File containing hosts on which to run execution. One host specified per line.
    * - ``--mpi-log-dir``
      - N/A
      - If specified, logs from each node in ``mpi`` multi-node benchmark will be saved to a unique file within the specified directory

To use ``mpi`` mode, provide all hosts for your invocation, either with the ``--hosts`` flag or a ``~/hosts`` file, and set the ``NEURON_RT_ROOT_COMM_ID`` environment variable to the IP address of the first host listed and any free port.
Depending on your environment, ``mpi`` may require passwordless SSH access to each host in your invocation. See the `Open MPI SSH documentation <https://docs.open-mpi.org/en/v5.0.x/launching-apps/ssh.html#launching-with-ssh>`_ for details.

Example:

``NEURON_RT_ROOT_COMM_ID=10.1.4.145:45654 nccom-test -r 64 -N 2 -d fp32 allr --hosts 10.1.4.145 10.1.4.138``

The above command will invoke a ``neuron-bench`` process on both hosts listed, to execute the collective operations, using 32 ranks from each host.
Latency data will be reported back from each host and collected on the host on which the ``nccom-test`` command was invoked. 
The host on which the ``nccom-test`` command is invoked should usually be one of the provided hosts, but it can be another unrelated host, as long as it can invoke MPI processes
on the provided hosts.


.. _Slurm Execution:
Slurm Execution:
~~~~~~~~~~~~~~~~

.. list-table::
    :widths: 40 80 260
    :header-rows: 1

    * - Argument
      - Default value
      - Description
    * - ``-S, --slurm-mode``
      - false
      - Use ``srun`` to run benchmark on ``slurm``-based cluster
    * - ``-u, --slurm-vcpus-per-node``
      - Minimum CPU count amongst all nodes
      - Number of vCPUs available per node in ``slurm`` allocation
    * - ``--slurm-setup-script``
      - N/A
      - Script to run on each node in ``slurm`` allocation before executing benchmark. Can use ``default`` to run 
        a default script installing the latest Neuron software.
    * - ``--slurm-job-id``
      - alloc
      - Specify jobId for ``slurm`` allocation to execute benchmark on. By default, will create a new allocation to execute benchmark on.
    * - ``--slurm-use-head-node-neuron-bench``
      - false
      - Copy ``neuron-bench`` binary from head node to all nodes in allocation

To use ``slurm`` mode, specify the ``--slurm-mode`` flag. When using slurm mode, ``nccom-test`` invocations should be run from the head node of the slurm cluster. 
Users can either use an existing slurm job by providing a job id, or have ``nccom-test`` allocate one for you. 
Additionally, users can provide a path to a setup script to run on each slurm node before execution. Users can alternatively specify ``default`` to use a supplied default setup script.

Examples:

``nccom-test -r 64 -N 2 allr --slurm-mode --slurm-setup-script path/to/my/custom-setup-script.sh``

The above command will execute collective operation across two nodes using slurm. Slurm will allocate a job with two nodes before beginning execution and will run the ``custom-setup-script.sh``
on each node before executing any collective operations.


``nccom-test -r 64 -N 2 allr --slurm-mode --slurm-job-id 12345``

The above command will use an existing slurm allocation (``jobId: 12345``) with no setup.


Output:
~~~~~~~

.. list-table::
    :widths: 40 80 260
    :header-rows: 1

    * - Argument
      - Default value
      - Description
    * - ``--non-interactive``
      - false
      - Do not display any animation or progress indicator.
    * - ``--report-to-json-file``
      - N/A
      - Persist config and results to specified JSON file if a filepath is provided.
    * - ``-t, --stats``
      - avg
      - Latency (time) statistics to display in the final output. Currently supports ``avg`` and any percentile (e.g ``p15``, ``p50``, ``p90``).
    * - ``--show-algorithm``
      - false
      - Show which algorithm (e.g. Ring, Mesh, RDH) was used to execute the collective operation in ``nccom-test`` output.
        Currently, any hierarchical algorithms used will be displayed as ``hier``, and will not include any sub-algorithms.
    * - ``--show-input-output-size``
      - false
      - Print or save to JSON per rank input and output sizes in B.
    * - ``--debug``
      - false
      - Show debug logs from execution of ``nccom-test`` and ``neuron-bench`` in realtime. Enables ``non-interactive`` mode implicitly.


SBUF Collectives:
~~~~~~~~~~~~~~~~~

.. list-table::
    :widths: 40 80 260
    :header-rows: 1

    * - Argument
      - Default value
      - Description
    * - ``--sb2sb``
      - false
      - Indicates whether to allocate input, output, and scratch-buffer on SBUF (rather than HBM).  This may result in improved performance.
    * - ``--input-shape``
      - N/A
      - Provide input tensor dimensions in format: ``[step0,step1][num_elem0,num_elem1]``. ``step0/num_elem0`` correspond to the free dimension of the SBUF, while ``step1/num_elem1`` correspond to the partition dimension of the SBUF.
    * - ``--output-shape``
      - N/A
      - Provide output tensor dimensions in format: ``[step0,step1][num_elem0,num_elem1]``. ``step0/num_elem0`` correspond to the free dimension of the SBUF, while ``step1/num_elem1`` correspond to the partition dimension of the SBUF.
    * - ``--cc-dim``
      - 1
      - Control dimensions of tensor concatenation. Either concatenate tensor in free dimension (``cc-dim = 0``) or concatenate in partition dimension first and wrap around in free dimension second (``cc-dim = 1``)


Replica Group:
~~~~~~~~~~~~~~

Flags to control which subset of ranks a collective operation will be executed on.

.. list-table::
    :widths: 40 80 260
    :header-rows: 1

    * - Argument
      - Default value
      - Description
    * - ``--data-parallel-dimension``
      - N/A
      - Run the given collective operation in parallel across multiple sub-groups of size ``data-parallel-dimension``. For 128 ranks and data parallel dimension of 2, 
        there would be 64 parallel collective operations happening at the same time, each with 2 ranks. Primarily intended for multi-node executions with one-rank-per-node
        replica groups.
    * - ``--custom-replica-group``
      - N/A
      - Provide the JSON file for custom-defined replica groups.
    * - ``--custom-src-target-pairs``
      - N/A
      - Provide the JSON file for custom-defined source_target_pairs for the collective permute operation.

Additional Flags:
~~~~~~~~~~~~~~~~~

.. list-table::
    :widths: 40 80 260
    :header-rows: 1

    * - Argument
      - Default value
      - Description
    
    * - ``--vcpu-pin-mode``
      - false
      - Pin CPU thread for each rank to a given CPU.  
    * - ``--data-collector-port``
      - 60006
      - If running ``nccom-test`` in multi-node mode or on another node, a data collector is used to gather latencies from all nodes in benchmark.
        Port to use for data collector.
    * - ``--data-collector-host``
      - current host
      - Hostname or IP address of node to use as data collector, all latencies from other nodes will be sent to this host

Environment Variables
^^^^^^^^^^^^^^^^^^^^^
In addition to CLI arguments, there are also several environment variables which can be used to alter how collectives run inside ``nccom-test``

.. list-table::
    :widths: 40 80 260
    :header-rows: 1

    * - Environment Variable
      - Default value
      - Description
    * - ``NEURON_LOGICAL_NC_CONFIG``
      - 2 for ``trn2`` and ``trn3``. 1 for ``inf2`` and ``trn1``
      - Controls how many physical NeuronCores are grouped to make up a logical NeuronCore.

Users may also find certain Neuron Runtime environment variables useful with ``nccom-test`` executions. See :ref:`nrt-configuration`

Examples
^^^^^^^^

.. note::

    Performance data shown in these examples should not be considered up-to-date. For the latest performance
    data, please refer to the performance section.


Single Instance Examples
~~~~~~~~~~~~~~~~~~~~~~~~

- Quick environment validation

    .. code-block::

        nccom-test -r 2 allr
            size(B)    count(elems)     type    time(us)    algbw(GB/s)    busbw(GB/s)
            33554432        33554432    uint8         768          40.69          40.69
        Avg bus bandwidth:      40.6901GB/s


    If a problem was found, it can be reported in two possible ways:

    - Immediately:

        .. code-block::

            nccom-test -r 2 allr
            Neuron DKMS Driver is not running! Read the troubleshooting guide at: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-troubleshoot.html#neuron-driver-installation-fails


    - After a benchmark attempt:

        .. code-block::

            nccom-test -r 2 allr
                 size(B)    count(elems)    type    time(us)    algbw(GB/s)    busbw(GB/s)
                33554432    Failure running neuron-bench - log file /tmp/nccom_test_log_7pqpdfjf.log
            1 errors found - test failed


        In this case, further information about the error can be found in the ``neuron-bench`` log file.

- 2 rank all-reduce on a single instance for sizes ranging from 1MiB to 1GiB with a step of 4x

    .. code-block::

        nccom-test -r 2 --minbytes 1kb --maxbytes 1gb --stepfactor 4 --datatype fp32 allr
               size(B)    count(elems)    type    time(us)    algbw(GB/s)    busbw(GB/s)
                  1024             256    fp32          58           0.02           0.02
                  4096            1024    fp32          58           0.07           0.07
                 16384            4096    fp32          58           0.26           0.26
                 65536           16384    fp32          58           1.05           1.05
                262144           65536    fp32          60           4.07           4.07
               1048576          262144    fp32          68          14.36          14.36
               4194304         1048576    fp32         107          36.51          36.51
              16777216         4194304    fp32         332          47.06          47.06
              67108864        16777216    fp32        1214          51.48          51.48
             268435456        67108864    fp32        4750          52.63          52.63
            1073741824       268435456    fp32       18930          52.83          52.83
        Avg bus bandwidth:      23.6671GB/s


- 32 rank all-gather on a single instance for sizes ranging from 1KiB to 1MiB with a step of 8x, with correctness checking


.. code-block::

        nccom-test -r 32 --minbytes 1kb --maxbytes 1mb --stepfactor 8 --datatype fp32 --check allg
        size(B)    count(elems)    type    time(us)    algbw(GB/s)    busbw(GB/s)
        1024             256    fp32         151           0.01           0.01
        8192            2048    fp32         149           0.05           0.05
       65536           16384    fp32         150           0.41           0.39
      524288          131072    fp32         179           2.73           2.64
    Avg bus bandwidth:      0.7731GB/s

- Specify the custom source target pairs as a JSON file for the collective permute operator ``--custom-src-target-pairs``.

.. code-block::

    nccom-test -r 8 --custom-src-target-pairs pairs.json permute
    size(B)    count(elems)     type    time:avg(us)    algbw(GB/s)    busbw(GB/s)
    33554432        33554432    uint8          894.24          37.52          37.52
    Avg bus bandwidth:	37.5230GB/s

    cat pairs.json
    {
        "src_target_pairs": [
            [
                [0, 1],
                [1, 0],
                [2, 3],
                [3, 2],
                [4, 4],
                [5, 5],
                [6, 6],
                [7, 7]
            ]
        ]
    }


- Reporting the input and output size explicitly with ``--show-input-output-size``.

.. code-block::

    nccom-test -r 32 --minbytes 1kb --maxbytes 1mb --stepfactor 8 --datatype fp32 --check allg --show-input-output-size
    size(B)    count(elems)    total_input_size(B)    total_output_size(B)    type    time:avg(us)    algbw(GB/s)    busbw(GB/s)
       1024             256                     32                    1024    fp32            6.16           0.17           0.16
       8192            2048                    256                    8192    fp32            6.48           1.26           1.23
      65536           16384                   2048                   65536    fp32            8.17           8.02           7.77
     524288          131072                  16384                  524288    fp32           23.16          22.64          21.93
    Avg bus bandwidth:      7.7715GB/s

- Getting percentile latency results with ``--stats``

.. code-block::

    nccom-test -r 8 --minbytes 1kb --maxbytes 1mb --stepfactor 8 --datatype fp32 --stats avg p25 p50 p90 p99 --iters 1000 allg
    size(B)    count(elems)    type    time:avg(us)    time:p25(us)    time:p50(us)    time:p90(us)    time:p99(us)    algbw(GB/s)    busbw(GB/s)
       1024             256    fp32            10.0              10              10              11              12           0.10           0.09
       8192            2048    fp32           10.22              10              10              11              12           0.80           0.70
      65536           16384    fp32           11.31              11              11              13              13           5.80           5.07
     524288          131072    fp32           14.83              14              15              16              17          35.34          30.92
    Avg bus bandwidth:	9.1966GB/s

- Example results as JSON with ``--report-to-json-file``

.. code-block::

    nccom-test -r 32 --minbytes 1kb --maxbytes 1mb --stepfactor 8 --datatype fp32 --check allg --report-to-json-file nccom-results.json
    size(B)    count(elems)    type    time:avg(us)    algbw(GB/s)    busbw(GB/s)
       1024             256    fp32            6.19           0.17           0.16
       8192            2048    fp32            6.55           1.25           1.21
      65536           16384    fp32            8.18           8.01           7.76
     524288          131072    fp32           23.11          22.69          21.98
    Avg bus bandwidth:      7.7775GB/s

    python3 -m json.tool nccom-results.json
    {
        "results": [
            {
                "size(B)": 1024,
                "count(elems)": 256,
                "type": "fp32",
                "algbw(GB/s)": 0.16553675170497603,
                "busbw(GB/s)": 0.16036372821419553,
                "time:avg(us)": 6.19
            },
            {
                "size(B)": 8192,
                "count(elems)": 2048,
                "type": "fp32",
                "algbw(GB/s)": 1.2500906056270864,
                "busbw(GB/s)": 1.21102527420124,
                "time:avg(us)": 6.55
            },
            {
                "size(B)": 65536,
                "count(elems)": 16384,
                "type": "fp32",
                "algbw(GB/s)": 8.008982241741455,
                "busbw(GB/s)": 7.758701546687035,
                "time:avg(us)": 8.18
            },
            {
                "size(B)": 524288,
                "count(elems)": 131072,
                "type": "fp32",
                "algbw(GB/s)": 22.688776793562784,
                "busbw(GB/s)": 21.97975251876395,
                "time:avg(us)": 23.11
            }
        ]
    }

- Example results with ``--show-algorithm`` flag

.. code-block::

    nccom-test -r 16 allr -b 4 -e 1gb -f 16 -d fp32 --show-algorithm
    size(B)    count(elems)    type    time:avg(us)    algbw(GB/s)    busbw(GB/s)    algorithm
            4               1    fp32          299.91           0.00           0.00         mesh
           32               8    fp32          299.69           0.00           0.00         mesh
          512             128    fp32          299.82           0.00           0.00         mesh
         8192            2048    fp32          299.74           0.03           0.05         mesh
       131072           32768    fp32          574.15           0.23           0.43         mesh
      2097152          524288    fp32          686.32           3.06           5.73          rdh
     33554432         8388608    fp32         2754.15          12.18          22.84    kangaring
    536870912       134217728    fp32         9689.51          55.41         103.89    kangaring
    Avg bus bandwidth:	16.6181GB/s


Multiple Instances Example
~~~~~~~~~~~~~~~~~~~~~~~~~~

- 64 rank all-reduce on two instances for sizes ranging from 8 bytes to 1GiB with a step of 2x, running 50 ops

    .. code-block::

        NEURON_RT_ROOT_COMM_ID=10.1.4.145:45654 nccom-test -r 64 -N 2 -b 8 -e 1GB -f 2 -n 50 -w 5 -d fp32 allr --hosts 127.0.0.1 10.1.4.138
               size(B)    count(elems)    type    time(us)    algbw(GB/s)    busbw(GB/s)
                     8               2    fp32         520           0.00           0.00
                    16               4    fp32         520           0.00           0.00
                    32               8    fp32         523           0.00           0.00
                    64              16    fp32         525           0.00           0.00
                   128              32    fp32         553           0.00           0.00
                   256              64    fp32         709           0.00           0.00
                   512             128    fp32         782           0.00           0.00
                  1024             256    fp32         840           0.00           0.00
                  2048             512    fp32         881           0.00           0.00
                  4096            1024    fp32         916           0.00           0.01
                  8192            2048    fp32        1013           0.01           0.01
                 16384            4096    fp32        1031           0.01           0.03
                 32768            8192    fp32        1174           0.03           0.05
                 65536           16384    fp32        1315           0.05           0.09
                131072           32768    fp32        1315           0.09           0.18
                262144           65536    fp32        1311           0.19           0.37
                524288          131072    fp32        1312           0.37           0.73
               1048576          262144    fp32        1328           0.74           1.45
               2097152          524288    fp32        1329           1.47           2.89
               4194304         1048576    fp32        1378           2.83           5.58
               8388608         2097152    fp32        1419           5.51          10.84
              16777216         4194304    fp32        2138           7.31          14.39
              33554432         8388608    fp32        2711          11.53          22.69
              67108864        16777216    fp32        3963          15.77          31.05
             134217728        33554432    fp32        6279          19.91          39.19
             268435456        67108864    fp32       11954          20.91          41.17
             536870912       134217728    fp32       21803          22.93          45.15
            1073741824       268435456    fp32       41806          23.92          47.09
        Avg bus bandwidth:      9.3924GB/s


.. _AlltoAllV Example:
- Specify alltoallv-metadata as JSON for ``alltoallv`` operation ``--alltoallv-metadata``.
.. code-block::

    NEURON_RT_ROOT_COMM_ID=172.32.137.79:44444 nccom-test -r 2 -N 2 -d fp32 alltoallv -b 1MB -e 1MB --hosts 127.0.0.1 172.32.253.16 --alltoallv-metadata alltoallv_metadata.json
    size(B)    count(elems)    type    time:avg(us)    algbw(GB/s)    busbw(GB/s)
    1048608          262152    fp32          955.05           1.10           0.55
    Avg bus bandwidth:	0.5490GB/s

    cat alltoallv_metadata.json
    {
      "alltoallv_metadata": [
        {
          "send_counts": [512, 1024],
          "send_displs": [0, 512],
          "recv_counts": [256, 768],
          "recv_displs": [0, 256]
        }
      ]
    }


================================================
FILE: tools/neuron-sys-tools/neuron-ls.rst
================================================
.. _neuron-ls-ug:

Neuron LS User Guide
---------------------

The neuron-ls command is a tool for managing Neuron devices in your instance.
This command serves two key purposes: it identifies all Neuron devices present in the current instance 
and provides information about the processes running on each device along with the command that launched that process.
To use this command, simply type ``neuron-ls`` in your terminal.

.. rubric:: neuron-ls CLI

.. code-block:: text

    neuron-ls [options]

**Options**

``--wide, -w``
    Displays the table in a wider format.

``--show-all-procs, -a``
    Show all processes using the Neuron Devices, including processes that aren't using
    Neuron Runtime 2.x such as ``neuron-monitor`` or ``neuron-ls`` itself.

``--topology, -t``
    Display topology information about the system's Neuron Devices.

``--json-output, -j``
    Output in JSON format.

.. note::

  ``neuron-ls`` fully supports the newly launched Trn2 instances.

Examples
^^^^^^^^

``neuron-ls`` is compatible with all Neuron instance types: inf1, inf2, trn1 and trn2.
These are a few examples on running the tool on a trn2n.48xlarge:

::

  $ neuron-ls
  instance-type: trn2n.48xlarge
  instance-id: i-aabbccdd123456789
  logical-neuroncore-config: 2
  +--------+--------+----------+--------+---------------+--------------+---------------+------+
  | NEURON | NEURON |  NEURON  | NEURON |   CONNECTED   |     PCI      |      CPU      | NUMA |
  | DEVICE | CORES  | CORE IDS | MEMORY |    DEVICES    |     BDF      |   AFFINITY    | NODE |
  +--------+--------+----------+--------+---------------+--------------+---------------+------+
  | 0      | 4      | 0-3      | 96 GB  | 12, 3, 4, 1   | 0000:cc:00.0 | 48-95,144-191 | 1    |
  | 1      | 4      | 4-7      | 96 GB  | 13, 0, 5, 2   | 0000:b5:00.0 | 48-95,144-191 | 1    |
  | 2      | 4      | 8-11     | 96 GB  | 14, 1, 6, 3   | 0000:b6:00.0 | 48-95,144-191 | 1    |
  | 3      | 4      | 12-15    | 96 GB  | 15, 2, 7, 0   | 0000:cb:00.0 | 48-95,144-191 | 1    |
  | 4      | 4      | 16-19    | 96 GB  | 0, 7, 8, 5    | 0000:6f:00.0 | 0-47,96-143   | 0    |
  | 5      | 4      | 20-23    | 96 GB  | 1, 4, 9, 6    | 0000:58:00.0 | 0-47,96-143   | 0    |
  | 6      | 4      | 24-27    | 96 GB  | 2, 5, 10, 7   | 0000:59:00.0 | 0-47,96-143   | 0    |
  | 7      | 4      | 28-31    | 96 GB  | 3, 6, 11, 4   | 0000:6e:00.0 | 0-47,96-143   | 0    |
  | 8      | 4      | 32-35    | 96 GB  | 4, 11, 12, 9  | 0000:9b:00.0 | 0-47,96-143   | 0    |
  | 9      | 4      | 36-39    | 96 GB  | 5, 8, 13, 10  | 0000:84:00.0 | 0-47,96-143   | 0    |
  | 10     | 4      | 40-43    | 96 GB  | 6, 9, 14, 11  | 0000:85:00.0 | 0-47,96-143   | 0    |
  | 11     | 4      | 44-47    | 96 GB  | 7, 10, 15, 8  | 0000:9a:00.0 | 0-47,96-143   | 0    |
  | 12     | 4      | 48-51    | 96 GB  | 8, 15, 0, 13  | 0000:f8:00.0 | 48-95,144-191 | 1    |
  | 13     | 4      | 52-55    | 96 GB  | 9, 12, 1, 14  | 0000:e1:00.0 | 48-95,144-191 | 1    |
  | 14     | 4      | 56-59    | 96 GB  | 10, 13, 2, 15 | 0000:e2:00.0 | 48-95,144-191 | 1    |
  | 15     | 4      | 60-63    | 96 GB  | 11, 14, 3, 12 | 0000:f7:00.0 | 48-95,144-191 | 1    |
  +--------+--------+----------+--------+---------------+--------------+---------------+------+

::

  $ neuron-ls --wide
  instance-type: trn2n.48xlarge
  instance-id: i-aabbccdd123456789
  logical-neuroncore-config: 2
  +--------+--------+--------+---------------+---------+--------+----------------------------------------------------------------------------------+---------+
  | NEURON | NEURON | NEURON |   CONNECTED   |   PCI   |  PID   |                                     COMMAND                                      | RUNTIME |
  | DEVICE | CORES  | MEMORY |    DEVICES    |   BDF   |        |                                                                                  | VERSION |
  +--------+--------+--------+---------------+---------+--------+----------------------------------------------------------------------------------+---------+
  | 0      | 4      | 96 GB  | 12, 3, 4, 1   | cc:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  | 1      | 4      | 96 GB  | 13, 0, 5, 2   | b5:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  | 2      | 4      | 96 GB  | 14, 1, 6, 3   | b6:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  | 3      | 4      | 96 GB  | 15, 2, 7, 0   | cb:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  | 4      | 4      | 96 GB  | 0, 7, 8, 5    | 6f:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  | 5      | 4      | 96 GB  | 1, 4, 9, 6    | 58:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  | 6      | 4      | 96 GB  | 2, 5, 10, 7   | 59:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  | 7      | 4      | 96 GB  | 3, 6, 11, 4   | 6e:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  | 8      | 4      | 96 GB  | 4, 11, 12, 9  | 9b:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  | 9      | 4      | 96 GB  | 5, 8, 13, 10  | 84:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  | 10     | 4      | 96 GB  | 6, 9, 14, 11  | 85:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  | 11     | 4      | 96 GB  | 7, 10, 15, 8  | 9a:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  | 12     | 4      | 96 GB  | 8, 15, 0, 13  | f8:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  | 13     | 4      | 96 GB  | 9, 12, 1, 14  | e1:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  | 14     | 4      | 96 GB  | 10, 13, 2, 15 | e2:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  | 15     | 4      | 96 GB  | 11, 14, 3, 12 | f7:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --warmup none --fixed-instance-count 64 --... | 2.0.0   |
  +--------+--------+--------+---------------+---------+--------+----------------------------------------------------------------------------------+---------+

::

  $ neuron-ls --show-all-procs
  instance-type: trn2n.48xlarge
  instance-id: i-aabbccdd123456789
  logical-neuroncore-config: 2
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | NEURON | NEURON | NEURON |   CONNECTED   |   PCI   |  PID   |                 COMMAND                  | RUNTIME |
  | DEVICE | CORES  | MEMORY |    DEVICES    |   BDF   |        |                                          | VERSION |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 0      | 4      | 96 GB  | 12, 3, 4, 1   | cc:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 1      | 4      | 96 GB  | 13, 0, 5, 2   | b5:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 2      | 4      | 96 GB  | 14, 1, 6, 3   | b6:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 3      | 4      | 96 GB  | 15, 2, 7, 0   | cb:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 4      | 4      | 96 GB  | 0, 7, 8, 5    | 6f:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 5      | 4      | 96 GB  | 1, 4, 9, 6    | 58:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 6      | 4      | 96 GB  | 2, 5, 10, 7   | 59:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 7      | 4      | 96 GB  | 3, 6, 11, 4   | 6e:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 8      | 4      | 96 GB  | 4, 11, 12, 9  | 9b:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 9      | 4      | 96 GB  | 5, 8, 13, 10  | 84:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 10     | 4      | 96 GB  | 6, 9, 14, 11  | 85:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 11     | 4      | 96 GB  | 7, 10, 15, 8  | 9a:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 12     | 4      | 96 GB  | 8, 15, 0, 13  | f8:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 13     | 4      | 96 GB  | 9, 12, 1, 14  | e1:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 14     | 4      | 96 GB  | 10, 13, 2, 15 | e2:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+
  | 15     | 4      | 96 GB  | 11, 14, 3, 12 | f7:00.0 | 268911 | neuron-bench exec --run-as-cc-neff --... | 2.0.0   |
  |        |        |        |               |         | 269192 | neuron-ls --show-all-procs               | NA      |
  +--------+--------+--------+---------------+---------+--------+------------------------------------------+---------+

::

  $ neuron-ls --topology
  instance-type: trn2n.48xlarge
  instance-id: i-aabbccdd123456789
  logical-neuroncore-config: 2
  +--------+--------+--------+---------------+---------+
  | NEURON | NEURON | NEURON |   CONNECTED   |   PCI   |
  | DEVICE | CORES  | MEMORY |    DEVICES    |   BDF   |
  +--------+--------+--------+---------------+---------+
  | 0      | 4      | 96 GB  | 12, 3, 4, 1   | cc:00.0 |
  | 1      | 4      | 96 GB  | 13, 0, 5, 2   | b5:00.0 |
  | 2      | 4      | 96 GB  | 14, 1, 6, 3   | b6:00.0 |
  | 3      | 4      | 96 GB  | 15, 2, 7, 0   | cb:00.0 |
  | 4      | 4      | 96 GB  | 0, 7, 8, 5    | 6f:00.0 |
  | 5      | 4      | 96 GB  | 1, 4, 9, 6    | 58:00.0 |
  | 6      | 4      | 96 GB  | 2, 5, 10, 7   | 59:00.0 |
  | 7      | 4      | 96 GB  | 3, 6, 11, 4   | 6e:00.0 |
  | 8      | 4      | 96 GB  | 4, 11, 12, 9  | 9b:00.0 |
  | 9      | 4      | 96 GB  | 5, 8, 13, 10  | 84:00.0 |
  | 10     | 4      | 96 GB  | 6, 9, 14, 11  | 85:00.0 |
  | 11     | 4      | 96 GB  | 7, 10, 15, 8  | 9a:00.0 |
  | 12     | 4      | 96 GB  | 8, 15, 0, 13  | f8:00.0 |
  | 13     | 4      | 96 GB  | 9, 12, 1, 14  | e1:00.0 |
  | 14     | 4      | 96 GB  | 10, 13, 2, 15 | e2:00.0 |
  | 15     | 4      | 96 GB  | 11, 14, 3, 12 | f7:00.0 |
  +--------+--------+--------+---------------+---------+


  Neuron Device Topology
        *        *        *        *      
        │        │        │        │      
        ▼        ▼        ▼        ▼      
  *––►[ 0 ]◄––►[ 1 ]◄––►[ 2 ]◄––►[ 3 ]◄––*
        ▲        ▲        ▲        ▲      
        │        │        │        │      
        ▼        ▼        ▼        ▼      
  *––►[ 4 ]◄––►[ 5 ]◄––►[ 6 ]◄––►[ 7 ]◄––*
        ▲        ▲        ▲        ▲      
        │        │        │        │      
        ▼        ▼        ▼        ▼      
  *––►[ 8 ]◄––►[ 9 ]◄––►[10 ]◄––►[11 ]◄––*
        ▲        ▲        ▲        ▲      
        │        │        │        │      
        ▼        ▼        ▼        ▼      
  *––►[12 ]◄––►[13 ]◄––►[14 ]◄––►[15 ]◄––*
        ▲        ▲        ▲        ▲      
        │        │        │        │      
        *        *        *        *      

  Legend:

          *––► = Wrap-around link

::

  $ neuron-ls -j
  [
    {
        "neuron_device": 0,
        "bdf": "cc:00.0",
        "cpu_affinity": "48-95,144-191",
        "numa_node": "1",
        "connected_to": [
            12,
            3,
            4,
            1
        ],
        "nc_count": 4,
        "logical_neuroncore_config": 2,
        "memory_size": 103079215104,
        "neuroncore_ids": [
            0,
            1,
            2,
            3
        ],
        "neuron_processes": [
            {
                "pid": 113985,
                "command": "neuron-bench exec --run-as-cc-neff --...",
                "neuron_runtime_version": "2.0.0"
            }
        ]
    },
    ...
    {
        "neuron_device": 15,
        "bdf": "f7:00.0",
        "cpu_affinity": "48-95,144-191",
        "numa_node": "1",
        "connected_to": [
            11,
            14,
            3,
            12
        ],
        "nc_count": 4,
        "logical_neuroncore_config": 2,
        "memory_size": 103079215104,
        "neuroncore_ids": [
            60,
            61,
            62,
            63
        ],
        "neuron_processes": [
            {
                "pid": 113985,
                "command": "neuron-bench exec --run-as-cc-neff --...",
                "neuron_runtime_version": "2.0.0"
            }
        ]
    }
  ]

Field Definitions
^^^^^^^^^^^^^^^^^

-  instance-type: Type of instance on which neuron-ls is running.
-  instance-id: EC2 ID of the instance on which neuron-ls is running.
-  logical-neuroncore-config: (only available on trn2 instances) the current logical NeuronCore configuration; for more information refer to :ref:`logical-neuroncore-config`
-  NEURON DEVICE / neuron_device: Logical ID assigned to the Neuron Device.
-  NEURON CORES / nc_count: Number of NeuronCores present in the Neuron Device.
-  NEURON CORE IDS / neuroncore_ids: Range or list of individual NeuronCore IDs belonging to the device, used with ``NEURON_RT_VISIBLE_CORES`` for selective core usage.
-  NEURON MEMORY / memory_size: Amount DRAM memory in Neuron Device.
-  CONNECTED DEVICES / connected_to: Logical ID of Neuron Devices connected to this
   Neuron Device.
-  PCI BDF / bdf: PCI Bus Device Function (BDF) ID of the device.
-  CPU AFFINITY / cpu_affinity: CPU cores that per NeuronCore proxy threads are pinned to
-  NUMA NODE / numa_node: NUMA (Non-Uniform Memory Access) node associated with the Neuron Device
-  PID / pid: ID of the process using this Neuron Device.
-  COMMAND / command: Command used to launch the process using this
   Neuron Device.
-  RUNTIME VERSION / neuron_runtime_version: Version of Neuron Runtime (if applicable) for
   the application using this Neuron Device.


================================================
FILE: tools/neuron-sys-tools/neuron-monitor-user-guide.rst
================================================
.. _neuron-monitor-ug:

Neuron Monitor User Guide
=========================

.. contents:: Table of contents
   :local:
   :depth: 2

Overview
--------

**neuron-monitor** collects metrics and stats from the Neuron
Applications running on the system and streams the collected data to
``stdout`` in ``JSON`` format. It is provided as part of the
``aws-neuron-tools`` package.

These metrics and stats are organized into **metric groups** which can
be configured by providing a configuration file as described in :ref:`using-neuron-monitor`

When running, **neuron-monitor** will:

-  Collect the data for the metric groups which, based on the elapsed
   time since their last update, need to be updated
-  Take the newly collected data and consolidate it into a large report
-  Serialize that report to JSON and stream it to stdout from where it
   can be consumed by other tools - such as the sample
   :ref:`neuron-monitor-cloudwatch.py <neuron-monitor-cloudwatchpy>` and
   :ref:`neuron-monitor-prometheus.py <neuron-monitor-prometheuspy>`
   scripts.
-  Wait until at least one **metric group** needs to be collected and
   repeat this flow

.. note::

  ``neuron-monitor`` fully supports the newly launched Trn2 instances.

.. _using-neuron-monitor:

Using neuron-monitor
--------------------

.. _monitor_cli:

.. rubric:: neuron-monitor CLI

.. program:: neuron-monitor

.. option:: neuron-monitor [parameters]

    neuron-monitor accepts the following optional parameters:

    - ``--verbose`` (int) default=0: Can be 0 to 4, and controls the amount of
      debugging and verbose information sent to stderr; **0: no output**,
      **4: maximum verbosity**

    - ``-c, --config-file`` (string): Allows specifying a valid path to a
      neuron-monitor JSON configuration file


**Example:**

.. code-block::

    neuron-monitor -c monitor.conf


Not specifying any configuration file will enable collecting all the metric groups
with a period of 5 seconds for all currently running Neuron applications.

Configuration file example
~~~~~~~~~~~~~~~~~~~~~~~~~~
Example of a configuration file which enables all available **metric
groups** for every running Neuron application, with a global update period of 1
second and sets an update period of 2 seconds for the ``"neuron_hw_counters"``
metric group:

::

   {
     "period": "1s",
     "neuron_runtimes": [
       {
         "tag_filter": ".*",
         "metrics": [
           {
             "type": "neuroncore_counters"
           },
           {
             "type": "memory_used"
           },
           {
             "type": "neuron_runtime_vcpu_usage"
           },
           {
             "type": "execution_stats"
           }
         ]
       }
     ],
     "system_metrics": [
       {
         "type": "vcpu_usage"
       },
       {
         "type": "memory_info"
       },
       {
          "period": "2s",
          "type": "neuron_hw_counters"
       }
     ]
   }

Neuron applications tagging
~~~~~~~~~~~~~~~~~~~~~~~~~~~
In order to make application monitoring easier, Neuron applications can be tagged with a 255 character
string which identifies that app. Tagging is done using the ``NEURON_PROCESS_TAG`` environment variable.

For example:
``NEURON_PROCESS_TAG=my_app_1 python training.py`` will associate the ``my_app_1`` tag with that Python application.
If ``NEURON_PROCESS_TAG`` is not specified, the application's PID will be used as a TAG.

This tag will be used by neuron-monitor to filter Neuron applications.

JSON objects and fields in the configuration file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  ``"neuron_runtimes"`` - array of objects specifying which Neuron
   Applications to monitor and what metric groups are enabled for each
   of them

   -  ``"tag_filter"`` - a regex which will be used to filter Neuron applications tags
      in order to determine if they will be monitored (optional)
   -  ``"metrics"`` - array of objects specifying which metric groups to
      capture for this Neuron application

      -  ``"type"`` - type of metric group

-  ``"period"`` - this field applies to **metric group** objects and
   sets the amount of time between two updates for that metric group

   -  if can be specified as part of the **root** and/or
      **neuron_runtime** objects where it applies to all their children,
      and/or as part of a **metric group** object
   -  if there's no period specified, a default value of **5 seconds**
      will be used

-  ``"system_metrics"`` - array of objects specifying which system level
   metric groups are enabled

Neuron Runtime-level metric groups
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  :ref:`neuron-monitor-nc-counters` - NeuronCore related metrics
-  :ref:`neuron-monitor-memory-used` - data on the amount of memory used
   by the Neuron application
-  :ref:`neuron-monitor-vcpu-usage` - Neuron application vCPU
   utilization data
-  :ref:`neuron-monitor-execution-stats` - Neuron application execution
   stats, including error count and latency

System-wide metric groups
~~~~~~~~~~~~~~~~~~~~~~~~~

-  :ref:`neuron-monitor-vcpu-usage` - system-wide vCPU usage
-  :ref:`neuron-monitor-memory-info` - system-wide memory usage
-  :ref:`neuron-monitor-hw-counters` - counters for correctable and
   uncorrectable memory ecc events


Execution model
---------------

|image|

neuron-monitor waits for one or more **metric groups** to be up for
update, then collects the corresponding data, consolidates it into a
report which is streamed to stdout as a JSON and goes back to waiting.

The JSON output format
----------------------

Whenever the report gets updated, a complete JSON is written to stdout.
This is its structure:

::

   {
     "neuron_runtime_data": [
       {
         "pid": 0,
         "address": "",
         "neuron_runtime_tag", "my_app_1",
         "error": "",
         "report": {
           "neuroncore_counters": {
               [...]
           },
           "execution_stats": {
               [...]
           },
           "memory_used": {
               [...]
           },
           "neuron_runtime_vcpu_usage": {
               [...]
           }
         }
       }
     ],
     "system_data": {
       "neuron_hw_counters": {
               [...]
       },
       "vcpu_usage": {
               [...]
       },
       "memory_info": {
               [...]
       }
     },
     "instance_info": {
               [...]
     },
     "neuron_hardware_info": {
               [...]
     },
     "neuron_k8s_info": {
               [...]
     }
   }

-  ``"neuron_runtime_data"`` is an array containing one entry per each
   Neuron application which passes the filter specified in the settings file

   -  ``"pid"`` is the pid of this Neuron application
   -  ``"neuron_runtime_tag"`` is the configured tag for the Neuron application
   -  ``"error"`` specifies any error that occurred when collecting data
      from this Neuron application
   -  ``"report"`` will contain the results for the Neuron application-level
      metric groups; their formats are described below

-  ``"system_data"`` has a similar structure to ``"neuron_runtime_data"``‘s
   ``"report"`` but only contains system-level metric groups (not
   associated to any Neuron application)

Regardless of the configuration, the following two JSON objects are always present
in the output:

.. _neuron-monitor-instance-info:

instance_info
~~~~~~~~~~~~~

Contains information about the instance on which neuron-monitor is running.
::

     "instance_info": {
       "instance_name": "My_Instance",
       "instance_id": "i-0011223344556677a",
       "instance_type": "trn2n.48xlarge",
       "instance_availability_zone": "us-west-2b",
       "instance_availability_zone_id": "usw2-az2",
       "instance_region": "us-west-2",
       "ami_id": "ami-0011223344556677b",
       "subnet_id": "subnet-112233ee",
       "error": ""
     }

Depending on when the instance was launched, the following fields might
not be available:

-  ``instance_availability_zone_id`` : available only for instances
   launched in 2020-08-24 and later
-  ``instance_region`` : available only for instances launched on
   2020-08-24 and later
-  ``instance_name`` : available only if ``instance_region`` is set and
   aws-cli tools are installed

``error`` will contain an error string if getting one of the fields,
**except those mentioned above**, resulted in an error.

.. _neuron-monitor-hardware-info:

neuron_hardware_info
~~~~~~~~~~~~~~~~~~~~

Contains basic information about the Neuron hardware.
::

     "neuron_hardware_info": {
       "neuron_device_type": "trainium2",
       "neuron_device_version": "v4",
       "neuroncore_version": "v3d",
       "neuron_device_count": 16,
       "neuron_device_memory_size": 103079215104,
       "neuroncore_per_device_count": 4,
       "logical_neuroncore_config": 2,
       "error": ""
     }

-  ``neuron_device_type``: type of the Neuron Devices on the instance
-  ``neuroncore_version``: version of the NeuronCores on the instance
-  ``neuron_device_count`` : number of available Neuron Devices
-  ``neuron_device_memory_size``: total memory available on each Neuron Device
-  ``neuroncore_per_device_count`` : number of NeuronCores present on each Neuron Device
-  ``logical_neuroncore_config`` : the current Logical NeuronCore configuration
-  ``error`` : will contain an error string if any occurred when getting this information
   (usually due to the Neuron Driver not being installed or not running).

The following JSON object is disabled by default, but can be made available if "k8s_info" is enabled:

.. _neuron-monitor-k8s-info:

neuron_k8s_info
~~~~~~~~~~~~~~~

Contains information about what Kubernetes pods/containers are using Neuron resources
::

           "neuron_k8s_info": {
             "period": 15.030359284,
             "neuroncores_k8s_info": {
               "0": {
                 "pod_name": "p0",
                 "namespace": "n0",
                 "container_name": ["c0"]
               },
               "1": {
                 "pod_name": "p0",
                 "namespace": "n0",
                 "container_name": ["c0"]
               },
               ...
             "neurondevices_k8s_info": {
               "0": {
                 "pod_name": "p0",
                 "namespace": "n0",
                 "container_name": ["c0"]
               },
               ...
             }
             "error": ""
           },

- ``"neuroncores_k8s_info"`` - object containing information on which
  Neuron cores are being used by Kubernetes pod/containers, indexed by
  Neuron core index: ``"neuroncore_index": { neuroncore_k8s_data }``

  - ``"pod_name"`` - name of pod using Neuron core
  - ``"namespace"`` - namespace of pod using Neuron core
  - ``"container_name"`` - names of containers using Neuron core

- ``"neurondevices_k8s_info"`` - object containing information on which
  Neuron devices are being used by Kubernetes pod/containers, indexed by
  Neuron device index: ``"neurondevice_index": { neurondevice_k8s_data }``

  - ``"pod_name"`` - name of pod using Neuron device
  - ``"namespace"`` - namespace of pod using Neuron device
  - ``"container_name"`` - names of containers using Neuron device

- ``"error"`` - will contain an error string if any occurred when getting this information

For more information on how to enable K8s information, see :ref:`neuron-monitor-k8s-infopy`.

.. _neuron-metric-groups:

Metric Groups
~~~~~~~~~~~~~

Each **metric group** requested in the settings file will get an entry
in the resulting output. The general format for such an entry is:

::

   "metric_group": {
     "period": 1.015, // Actual captured period, in seconds
     "error": "",     // Error, if any occurred, otherwise an empty string
     [...]            // Metric group specific data
   }

.. _runtime-level-metric-groups-1:

Neuron application level metric groups
--------------------------------------

.. _neuron-monitor-nc-counters:

neuroncore_counters
~~~~~~~~~~~~~~~~~~~

::

           "neuroncore_counters": {
             "period": 1.000113182,
             "neuroncores_in_use": {
               "0": {
                 "neuroncore_utilization": 42.01,
                 "flops": 1234567891011,
                 "v3d": {
                   "nc_v3.0": {
                     "neuroncore_utilization": 21.01
                   },
                   "nc_v3.1": {
                     "neuroncore_utilization": 63.01
                   }
                 }
               },
               "1": {
                 "neuroncore_utilization": 42.02,
                 "flops": 1234567891021,
                 "v3d": {
                   "nc_v3.2": {
                     "neuroncore_utilization": 21.02
                   },
                   "nc_v3.3": {
                     "neuroncore_utilization": 63.02
                   }
                 }
               },
               [...]
             },
             "error": ""
           }

-  ``"neuroncores_in_use"`` is an object containing data for all the
   NeuronCores that were active when the data was captured, indexed by
   NeuronCore index: ``"neuroncore_index": { neuroncore_data }``

   -  ``"neuroncore_utilization"`` - NeuronCore utilization, in percent,
      during the captured period
   -  ``"flops"`` - number of floating point operations per second during
      the captured period
   -  ``"v3d"`` - only available on Trn2 - contains the utilization for every
      physical NeuronCore that makes up the current NeuronCore

-  ``"error"`` - string containing any error that occurred when
   collecting the data

.. _neuron-monitor-execution-stats:

execution_stats
~~~~~~~~~~~~~~~

::

           "execution_stats": {
             "period": 1.030613214,
             "error_summary": {
               "generic": 0,
               "numerical": 0,
               "transient": 0,
               "model": 0,
               "runtime": 0,
               "hardware": 0
             },
             "execution_summary": {
               "completed": 123,
               "completed_with_err": 0,
               "completed_with_num_err": 0,
               "timed_out": 0,
               "incorrect_input": 0,
               "failed_to_queue": 0
             },
             "latency_stats": {
               "total_latency": {
                 "p0": 0.01100001,
                 "p1": 0.01100002,
                 "p25": 0.01100004,
                 "p50": 0.01100008,
                 "p75": 0.01100010,
                 "p99": 0.01100012,
                 "p100": 0.01100013
               },
               "device_latency": {
                 "p0": 0.01000001,
                 "p1": 0.01000002,
                 "p25": 0.01000004,
                 "p50": 0.01000008,
                 "p75": 0.01000010,
                 "p99": 0.01000012,
                 "p100": 0.01000013
               }
             },
             "error": ""
           },

-  ``"error_summary"`` is an object containing the error counts for the
   captured period indexed by their type

   -  ``"generic"`` - generic execution errors
   -  ``"numeric"`` - NAN errors encountered during execution
   -  ``"transient"`` - recoverable errors, such as ECC corrections
   -  ``"model"`` - model-related errors
   -  ``"runtime"`` - Neuron Runtime errors
   -  ``"hardware"`` - hardware errors such as uncorrectable ECC issues

-  ``"execution_summary"`` is an object containing all execution outcome
   counts for the captured period indexed by their type

   -  ``"completed"`` - executions completed successfully
   -  ``"completed_with_err"`` - executions that ended in an error other
      than a numeric error
   -  ``"completed_with_num_err"`` - executions that ended in a numeric
      error
   -  ``"timed_out"`` - executions that took longer than the Neuron
      Runtime configured timeout value
   -  ``"incorrect_input"`` - executions that failed to start due to
      incorrect input being provided
   -  ``"failed_to_queue"`` - execution requests that were rejected due
      to Neuron Runtime not being able to queue them

-  ``"latency_stats"`` contains two objects containing latency
   percentiles, in seconds, for the data captured for the model
   executed during the captured period. If there are no models being
   executed during this time, the two objects will be ``null`` (i.e.
   ``"total_latency": null``)

   -  ``"total_latency"`` - percentiles, in seconds, representing latency for an execution as measured by the Neuron Runtime
   -  ``"device_latency"`` - percentiles, in seconds, representing execution time exclusively on the Neuron Device

-  ``"error"`` - string containing any error that occurred when
   collecting the data


.. _neuron-monitor-memory-used:

memory_used
~~~~~~~~~~~

::

     "memory_used": {
       "period": 1.00001,
       "neuron_runtime_used_bytes": {
         "host": 6997643264,
         "neuron_device": 12519788544,
         "usage_breakdown": {
           "host": {
             "application_memory": 6996594688,
             "constants": 0,
             "dma_buffers": 1048576,
             "tensors": 0
           },
           "neuroncore_memory_usage": {
             "0": {
               "constants": 193986816,
               "model_code": 176285056,
               "model_shared_scratchpad": 0,
               "runtime_memory": 0,
               "tensors": 20971520
             },
             "1": {
               "constants": 193986816,
               "model_code": 176285056,
               "model_shared_scratchpad": 0,
               "runtime_memory": 0,
               "tensors": 20971520
             },
             ...
           }
       }
       "loaded_models": [
         {
           "name": "neff",
           "uuid": "91f2f66e83ea419dace1da07617ad39f",
           "model_id": 10005,
           "is_running": false,
           "subgraphs": {
             "sg_00": {
               "memory_used_bytes": {
                 "host": 20480,
                 "neuron_device": 21001024,
                 "usage_breakdown": {
                   "host": {
                     "application_memory": 20480,
                     "constants": 0,
                     "dma_buffers": 0,
                     "tensors": 0
                   },
                   "neuron_device": {
                     "constants": 20971520,
                     "model_code": 29504,
                     "runtime_memory": 0,
                     "tensors": 0
                   }
                 }
               },
               "neuroncore_index": 0,
               "neuron_device_index": 12
             }
           }
         },
         ...
         ],
         "error": ""
      }


-  ``"memory_used"`` summarizes the amount of memory used by the
   Neuron application

   -  ``"neuron_runtime_used_bytes"`` - current amount of memory used by
      the Neuron application

      -  ``"host"`` - total host DRAM usage in bytes
      -  ``"neuron_device"`` - total Neuron device memory usage in bytes
      -  ``"usage_breakdown"`` - a breakdown of the total memory usage in the other two fields

         - ``"host"`` - breakdown of the host memory usage

            - ``"application_memory"`` - amount of host memory used by the application - this includes all allocations that are not included
              in the next categories
            - ``"constants"`` - amount of host memory used for constants during training (or weights during inference)
            - ``"dma_buffers"`` - amount of host memory used for DMA transfers
            - ``"tensors"`` - amount of host memory used for tensors

         - ``"neuroncore_memory_usage"`` - a breakdown of memory allocated on the Neuron Devices and the NeuronCores for which it was allocated

            - ``"0"`` - ``"64"`` (for trn2-48xlarge) - NeuronCores for which the memory was allocated
            - ``"constants"`` - amount of device memory used for constants during training (or weights during inference)
            - ``"model_code"`` - amount of device memory used for models' executable code
            - ``"model_shared_scratchpad"`` - amount of device memory used for the scratchpad shared by the models - a memory region reserved for the models' internal variables and auxiliary buffers
            - ``"runtime_memory"`` - amount of device memory used by the Neuron Runtime
            - ``"tensors"`` - amount of device memory used for tensors

-  ``"loaded_models"`` - array containing objects representing loaded models

   -  ``"name"`` - name of the model
   -  ``"uuid"`` - unique id for the model
   -  ``"model_id"`` - Neuron application-assigned ID for this model
   -  ``"is_running"`` - true if this model is currently started, false otherwise
   -  "``subgraphs"`` - object containing all the subgraphs for the model, indexed by their name: ``"subgraph_name": { subgraph_data }``

      -  ``"memory_used_bytes"`` - memory usage for this subgraph

         -  ``"host"`` - total host DRAM usage in bytes
         -  ``"neuron_device"`` - total Neuron device DRAM usage in bytes
         -  ``"usage_breakdown"`` - a breakdown of memory allocated at load time for this model

            - ``"host"`` - breakdown of host memory allocated for this model

               - ``"application_memory"`` - amount of host memory allocated for this model by the Neuron Runtime which doesn't fall in any
                 of the next categories
               - ``"constants"`` - amount of host memory used for constants during training (or weights during inference)
               - ``"dma_buffers"`` - host memory allocated for DMA transfers for this model
               - ``"tensors"`` - amount of device memory used for tensors at model load time

            - ``"neuron_device"`` - a breakdown of device memory allocated for this model

               - ``"constants"`` - amount of device memory used for constants during training (or weights during inference)
               - ``"model_code"`` - amount of device memory used for the model's executable code
               - ``"runtime_memory"`` - amount of device memory used by the Neuron Runtime for this model
               - ``"tensors"`` - amount of device memory allocated for tensors at this model's load time

      -  ``"neuroncore_index"`` - NeuronCore index on which the subgraph is loaded
      -  ``"neuron_device_index"`` - Neuron device index on which the subgraph is loaded


-  ``"error"`` - string containing any error that occurred when
   collecting the data


neuron_runtime_vcpu_usage
~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

           "neuron_runtime_vcpu_usage": {
             "period": 1.030604818,
             "vcpu_usage": {
               "user": 42.01,
               "system": 12.34
             },
             "error": ""
           }

-  ``"vcpu_usage"`` - object showing vCPU usage in percentages for the
   Neuron application during the captured period

   -  ``"user"`` - percentage of time spent in user code by this Neuron
      Application
   -  ``"system"`` - percentage of time spent in kernel code by this
      Neuron application

-  ``"error"`` - string containing any error that occurred when
   collecting the data

System level metric groups
--------------------------

.. _neuron-monitor-hw-counters:

neuron_hw_counters
~~~~~~~~~~~~~~~~~~

::

           "neuron_hw_counters": {
             "period": 1.030359284,
             "neuron_devices": [
               {
                 "neuron_device_index": 0,
                 "mem_ecc_corrected": 0,
                 "mem_ecc_uncorrected": 0,
                 "sram_ecc_uncorrected": 0,
                 "sram_ecc_corrected": 0
               }
             ],
             "error": ""
           },

-  ``"neuron_devices"`` - array containing ECC data for all Neuron devices

   -  ``"neuron_device_index"`` - Neuron device index
   -  ``"mem_ecc_corrected"`` - number of corrected ECC events in the
      Neuron device’s DRAM
   -  ``"mem_ecc_uncorrected"`` - number of uncorrected ECC events in
      the Neuron device’s DRAM
   -  ``"sram_ecc_uncorrected"`` - number of uncorrected ECC events in
      the Neuron device’s SRAM
   -  ``"sram_ecc_corrected"`` - number of corrected ECC events in
      the Neuron device’s SRAM

-  ``"error"`` - string containing any error that occurred when
   collecting the data

.. _neuron-monitor-vcpu-usage:

vcpu_usage
~~~~~~~~~~~~

::

   "vcpu_usage": {
     "period": 0.999974868,
     "average_usage": {
       "user": 32.77,
       "nice": 0,
       "system": 22.87,
       "idle": 39.36,
       "io_wait": 0,
       "irq": 0,
       "soft_irq": 0
     },
     "usage_data": {
       "0": {
         "user": 34.41,
         "nice": 0,
         "system": 27.96,
         "idle": 37.63,
         "io_wait": 0,
         "irq": 0,
         "soft_irq": 0
       },
       "1": {
         "user": 56.84,
         "nice": 0,
         "system": 28.42,
         "idle": 14.74,
         "io_wait": 0,
         "irq": 0,
         "soft_irq": 0
       },
       [...]
     },
     "context_switch_count": 123456,
     "error": ""
   }

-  each vCPU usage object contains the following fields:

   -  ``"user"`` - percentage of time spent in user code
   -  ``"nice"`` - percentage of time spent executing niced user code
   -  ``"system"`` - percentage of time spent executing kernel code
   -  ``"idle"`` - percentage of time spent idle
   -  ``"io_wait"`` - percentage of time spent waiting for IO operations
   -  ``"irq"`` - percentage of time spent servicing hardware interrupts
   -  ``"soft_irq"`` - percentage of time spent servicing software
      interrupts

-  ``"average_usage"`` - contains the average usage across all vCPUs
   during the captured period
-  ``"usage_data"`` - contains per vCPU usage during the captured period
-  ``"context_switch_count"`` - contains the number of vCPU context
   switches during the captured period
-  ``"error"`` - string containing any error that occurred when
   collecting the data

.. _neuron-monitor-memory-info:

memory_info
~~~~~~~~~~~

::

   "memory_info": {
     "period": 5.346411129,
     "memory_total_bytes": 49345835008,
     "memory_used_bytes": 16042344448,
     "swap_total_bytes": 0,
     "swap_used_bytes": 0,
     "error": ""
   }

-  ``"memory_total_bytes"`` - total size of the host memory, in bytes

-  ``"memory_used_bytes"`` - amount of host memory in use, in bytes

-  ``"swap_total_bytes"`` - total size of the host swap file, in bytes

-  ``"swap_used_bytes"`` - amount of swap memory in use, in bytes


.. _neuron-monitor-companion-scripts:

Companion scripts
-----------------

neuron-monitor is installed with three Python companion scripts:
:ref:`neuron-monitor-cloudwatchpy`, :ref:`neuron-monitor-prometheuspy`, and
:ref:`neuron-monitor-k8s-infopy`

.. _neuron-monitor-cloudwatchpy:

neuron-monitor-cloudwatch.py
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It requires Python3 and the `boto3 Python
module <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#quickstart>`__.
It is installed to:
``/opt/aws/neuron/bin/neuron-monitor-cloudwatch.py``.

.. _using-neuron-monitor-cloudwatchpy:

Using neuron-monitor-cloudwatch.py
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   neuron-monitor | neuron-monitor-cloudwatch.py --namespace <namespace> --region <region>

For example:

::

   neuron-monitor | neuron-monitor-cloudwatch.py --namespace neuron_monitor_test --region us-west-2

.. _neuron-monitor-prometheuspy:

neuron-monitor-prometheus.py
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It requires Python3 and the `Prometheus client Python
module <https://github.com/prometheus/client_python>`__. It is installed
to: ``/opt/aws/neuron/bin/neuron-monitor-prometheus.py``.

.. _using-neuron-monitor-prometheuspy:

Using neuron-monitor-prometheus.py
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   neuron-monitor | neuron-monitor-prometheus.py --port <port>

For example:

::

   neuron-monitor | neuron-monitor-prometheus.py --port 8008

The default value for ``--port`` is ``8000``.

If your data visualization framework is Grafana, we provided a :download:`Grafana dashboard </src/examples/neuron-monitor/neuron-monitor-grafana.json>`
which integrates with Prometheus and this script.

.. |image| image:: ../../images/nm-img2.png

.. _neuron-monitor-k8s-infopy:

neuron-monitor-k8s-info.py (Beta)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It requires Python3 and the `gRPC Python
package <https://pypi.org/project/grpcio/>`__. It is installed
to: ``/opt/aws/neuron/bin/neuron-monitor-k8s-info.py``.

.. important::

   This companion script is in Beta and is disabled by default.

   It only works on EKS, and is currently not supported with EKS auto mode.

.. _using-neuron-monitor-k8s-infopy:

Using neuron-monitor-k8s-info.py
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

   neuron-monitor | neuron-monitor-prometheus.py --port <port> --enable-k8s-info | neuron-monitor-k8s-info.py --period <seconds>

For example:

::

   neuron-monitor | neuron-monitor-prometheus.py --port 8008 --enable-k8s-info | neuron-monitor-k8s-info.py --period 30

The default value for ``--period`` is ``15``.

Running neuron monitor in Kubernetes environment
-------------------------------------------------

For running neuron monitor in Kubernetes environment, please refer to instructions `here <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html>`_.


================================================
FILE: tools/neuron-sys-tools/neuron-sysfs-user-guide.rst
================================================
.. _neuron-sysfs-ug:

Neuron Sysfs User Guide
=======================

.. contents:: Table of contents
    :local:
    :depth: 3

Introduction
------------
The kernel provides a few ways in which userspace programs can get system information from the kernel space. Sysfs is one common way to do so. It is a virtual filesystem typically mounted on the ``/sys`` directory and contains information about hardware devices attached to the system and about drivers handling those devices. By navigating the hierarchical structure of the sysfs filesystem and viewing the information provided by its files and directories, you can gather valuable information that can help diagnose and resolve a wide range of hardware and system issues.

Thus a sysfs filesystem is set up per Neuron Device under ``/sys/devices/virtual/neuron_device`` to give you an insight into the Neuron Driver and Runtime at system level. By performing several simple CLIs such as reading or writing to a sysfs file, you can get information such as Runtime status, memory usage, Driver info etc. You can even create your own shell scripts to query Runtime and Driver statistics from sysfs and generate customized reports.

This user guide will first explain the Neuron sysfs structure and then introduce many ways where you can perform diagnostic works with Neuron sysfs.


Neuron Sysfs Filesystem Structure
---------------------------------
High Level Overview
^^^^^^^^^^^^^^^^^^^

Here is the high level structure of the Neuron sysfs filesystem, where the total and present counters are not shown:

.. code-block:: bash

  /sys/devices/virtual/neuron_device/
  ├── neuron0/
  │   ├── subsystem
  │   ├── uevent
  │   ├── connected_devices
  │   ├── core_count
  │   ├── reset
  │   ├── power/
  │   │   ├── async
  │   │   ├── control
  │   │   ├── runtime_active_time
  │   │   ├── runtime_active_kids
  │   │   └── ...
  │   ├── info/
  │   │   ├── notify_delay
  │   │   ├── serial_number
  │   │   └── architecture/
  │   │       ├── arch_type
  │   │       ├── device_name
  │   │       └── instance_type
  ├── stats
  │   ├── hardware
  │   │   ├── mem_ecc_uncorrected
  │   │   ├── mem_ecc_repairable_uncorrected
  │   │   └── sram_ecc_uncorrected
  │   ├── memory_usage
  │   │    └── host_mem
  │   │       ├── application_memory
  │   │       ├── constants
  │   │       ├── dma_buffers
  │   │       ├── dma_rings
  │   │       ├── driver_memory
  │   │       ├── notifications
  │   │       ├── tensors
  │   │       └── uncategorized
  │   └── power
  │        └── utilization
  ├── neuron_core0/
  │       ├── info/
  │       │   └── architecture/
  │       │       └── arch_type
  │       ├── stats/
  │       │   ├── status/
  │       │   │   ├── exec_bad_input
  │       │   │   ├── hw_error
  │       │   │   ├── infer_failed_to_queue
  │       │   │   ├── resource_nc_error
  │       │   │   ├── unsupported_neff_version
  │       │   │   ├── failure
  │       │   │   ├── infer_completed_with_error
  │       │   │   ├── invalid_error
  │       │   │   ├── oob_error
  │       │   │   ├── success
  │       │   │   ├── generic_error
  │       │   │   ├── infer_completed_with_num_error
  │       │   │   ├── resource_error
  │       │   │   └── timeout
  │       │   ├── memory_usage/
  │       │   │   ├── device_mem/
  │       │   │   │   ├── collectives
  │       │   │   │   ├── constants
  │       │   │   │   ├── dma_rings
  │       │   │   │   ├── driver_memory
  │       │   │   │   ├── model_code
  │       │   │   │   ├── model_shared_scratchpad
  │       │   │   │   ├── nonshared_scratchpad
  │       │   │   │   ├── notifications
  │       │   │   │   ├── runtime_memory
  │       │   │   │   ├── tensors
  │       │   |   │   └── uncategorized
  │       │   │   └── host_mem
  │       │   └── other_info/
  │       │       ├── flop_count
  │       │       ├── inference_count
  │       │       ├── model_load_count
  │       │       ├── reset_fail_count
  │       │       ├── reset_req_count
  │       │       └── nc_time_in_use
  │       └── ...
  │── neuron_core1/
  │   │   ├── info/
  │   │   │   └── ...
  │   │   └── stats/
  │   │       └── ...
  │   └── ...
  ├── neuron1
  ├── neuron2
  ├── neuron3
  └── ...


Each Neuron Device is represented as a directory under ``/sys/devices/virtual/neuron_device/``, where ``neuron0/`` represents the Neuron Device 0, ``neuron1/`` represents the Neuron Device 1, etc. Each NeuronCore is represented as a directory under a Neuron Device directory, represented as ``neuron_core{0,1,2,...}``. Metrics such as Runtime and Driver info and statistics are collected as per NeuronCore in two directories under the NeuronCore directory, i.e. ``info/`` and ``stats/``.

Most of the metrics belong to a category called “counter.” 
Each counter is represented as a directory, which holds two numerical values as two files: total and present. Each memory usage counter has an additional value called peak.
The total value starts accumulating metrics when the Driver is loaded. The present value records the last changed metric value. The peak value records the max value so far.
Each counter has the same filesystem structure like this:

.. code-block:: dash

    /sys/devices/virtual/neuron_device/neuron0/neuron_core0/status/
    ├── exec_bad_input/
    │   ├── total
    │   └── present
    ├── hw_error/
    │   ├── total
    │   └── present
    ├── infer_failed_to_queue/
    │   ├── total
    │   └── present
    └── ...


Description for Each Field
^^^^^^^^^^^^^^^^^^^^^^^^^^^

``info/``: This directory stores general information about hardware and software. None of them are counter types.

* ``notify_delay``: The delay between notifications from the Neuron Device.  Current settings are on (``0``) or off (``-1``).  Off by default. 

* ``serial_number``: The unique device identifier.

* ``architecture/``: This directory stores hardware architecture information.

  * ``arch_type``: The architecture type of the Neuron Device. Sample architecture types are v1, v2, and v3. You can only read the value. You cannot change it.

  * ``instance_type``: The instance type of the Neuron Device. Sample instance types are Inf1, Inf2, and Trn1. You can only read the value. You cannot change it.

  * ``device_type``: The Neuron Device type. Sample Neuron Device types are Inferentia, Inferentia2, and Trainium1. You can only read the value. You cannot change it.


``stats/``: This directory stores Neuron Runtime and Driver statistics. It contains three subdirectories: ``status/``, ``memory_usage/``, and ``other_info/``.

* ``status/``: This directory stores the number of each return status of API calls. As explained in :ref:`The LIBNRT API Return Codes <nrt_api>`, every API call returns an NRT_STATUS value, which represents the return status of that API call. Our sysfs filesystem stores all ``NRT_STATUS`` as subdirectories under the ``status/`` directory. They all have the counter structure. Thus each ``NRT_STATUS`` subdirectory holds two values (total and present) and records the number of times you receive a certain ``NRT_STATUS``. The following is description for each ``NRT_STATUS`` subdirectory. You should see the description align with what is described in :ref:`The LIBNRT API Return Codes <nrt_api>`.

* ``memory_usage/``: This directory contains memory usage statistics for both device and host, represented as counters. In this directory, the total counters indicate the current memory usage, present counters represent the memory allocation or deallocation amount in the previous operation, and peak counters indicate the maximum memory usage observed. Additionally, this directory provides detailed breakdown statistics for device and host memory usage. These memory breakdown details correspond to the :ref:`Memory Usage Summary <neuron_top_mem_usage>` section displayed on in Neuron Monitor.

  * ``device_mem/``: The amount of memory that Neuron Runtime uses for weights, instructions and DMA rings.

    * This device memory per NeuronCore is further categorized into five types: ``collectives/``, ``constants/``, ``dma_rings/``, ``driver_memory/``, ``model_code/``, ``model_shared_scratchpad/``, ``nonshared_scratchpad/``, ``notifications/``, ``runtime_memory/``, ``tensors/``, and ``uncategorized/``. Each of these categories has total, present, and peak.
        * ``collectives`` - amount of device memory used for collective communication between workers
        * ``constants`` - amount of device memory used for constants (for applications running training) or weights (for applications running inferences)
        * ``dma_rings`` - amount of device memory used for storing model executable code used for data movements
        * ``driver_memory`` - amount of device memory used by the Neuron Driver
        * ``model_code`` - amount of device memory used for storing model executable code
        * ``model_shared_scratchpad`` - amount of device memory used for the shared model scratchpad, a buffer shared between models on the same Neuron Core used for internal model variables and other auxiliary buffers
        * ``nonshared_scratchpad`` - amount of device memory used for non-shared model scratchpad, a buffer used by a single model for internal model variables and other auxiliary buffers
        * ``notifications`` - amount of device memory used to store instruction level trace information used to profile workloads ran on the device
        * ``runtime_memory`` - amount of device memory used by the Neuron Runtime (outside of the previous categories)
        * ``tensors`` - amount of device memory used for tensors
        * ``uncategorized`` - amount of device memory that does not belong in any other catagory in this list
  
  * ``host_mem/``: The amount of memory that Neuron Runtime uses for input and output tensors.

    * The host memory per Neuron Device is further categorized into four types: ``application_memory/``, ``constants/``, ``dma_buffers/``, ``dma_rings/``, ``driver_memory/``, ``notifications/``, ``tensors/``, ``uncategorized/``.  These categories provide more granular host memory classification compared to :ref:`Host Used Memory <neuron_top_host_mem_usage>` section. Each of these categories has total, present, and peak

  * ``hardware/``: Hardware statistics.

    * ``mem_ecc_uncorrected``: The number of unrepairable uncorrected ECC events in the Neuron device's DRAM.

    * ``mem_ecc_repairable_uncorrected``: The number of repairable uncorrected ECC events in the Neuron device's DRAM.

    * ``sram_ecc_uncorrected``: The  number of uncorrected ECC events in the Neuron device's SRAM.
  * ``power/``: Power statistics.

    * ``utilization``: Reports per-minute power usage statistics as a percentage of max power in the following format:

        <status>,<timestamp>,<min_power>,<max_power>,<avg_power>

        **Field descriptions:**

        status
            Indicates the sampling state in a string.  Valid values are:

              ``POWER_STATUS_VALID`` - Sampling successful

              ``POWER_STATUS_NO_DATA`` - No samples available

              ``POWER_STATUS_INVALID`` - An internal sampling error occurred

        timestamp
            Time when the sample was collected in Unix epoch seconds (integer)

        min_power
            Minimum power utilization during the sampling period (0.00-100.00%)

        max_power
            Maximum power utilization during the sampling period (0.00-100.00%)

        avg_power
            Average power utilization during the sampling period (0.00-100.00%)

      The interface updates these statistics every minute based on continuous power sampling.
* ``other_info/``: This directory contains statistics that are not included by ``status/`` and ``memory_usage/``. None of them are counter types.

  * ``flop_count``: The number of flops. You can use it to calculate the TFLOP/s by ``flop_count`` / time interval

  * ``inference_count``: The number of successful inferences

  * ``model_load_count``:  The number of successful model loads

  * ``reset_fail_count``: The number of failed device resets

  * ``reset_req_count``:  The number of device resets requests

  * ``nc_time_in_use``:  The time interval in microseconds between the start and the end of the current execution on hardware

Other fields:

* ``connected_devices``: The list of connected devices' ids. You should see the same output as neuron-ls's CONNECTED DEVICES.

* ``reset``: write to this file resets corresponding the Neuron Device.


Read and Write to Sysfs
^^^^^^^^^^^^^^^^^^^^^^^^^

Reading a sysfs file gives the value for the corresponding metric. You can use the cat command to view the contents of the sysfs files.: 

.. code-block:: bash

  ubuntu@ip-xxx-xx-xx-xxx:~$ sudo cat /sys/devices/virtual/neuron_device/neuron0/neuron_core0/stats/status/failure/total 
  0
  ubuntu@ip-xxx-xx-xx-xxx:~$ sudo cat /sys/devices/virtual/neuron_device/neuron0/neuron_core0/info/architecture/arch_type 
  NCv2

Sysfs metrics of counter type are write to clear. You can write any value to the file, and the metric will be set to 0:

.. code-block:: bash

  ubuntu@ip-xxx-xx-xx-xxx:~$ echo 1 | sudo tee /sys/devices/virtual/neuron_device/neuron0/neuron_core0/stats/status/failure/total 
  1


Writing to ``reset`` resets the corresponding Neuron Device. E.g. the below resets Neuron Device 0:

.. code-block:: bash

  ubuntu@ip-xxx-xx-xx-xxx:~$ echo 1 | sudo tee /sys/devices/virtual/neuron_device/neuron0/reset
  1

Note
^^^^

All files under ``/sys/devices/virtual/neuron_device/neuron0/power`` such as ``runtime_active_kids`` or ``runtime_status`` are related to generic device power management. They are not created or controlled by our sysfs metrics. The word ``runtime`` in these files does not refer to Neuron Runtime.

.. _troubleshoot_via_sysfs:

How to Troubleshoot via Sysfs
-----------------------------
You can perform simple and easy tasks to troubleshoot your ML jobs with one or a few CLIs to read or write the sysfs filesystem.
You can do aggregations across all the NeuronCores and all the Neuron Device to get a summarized view using your scripts.


You can also use the Sysfs notification feature to wait passively (without wasting CPU cycles) for changes to the values of Sysfs files. To use this feature, you need to implement a user-space program that calls the poll() function on the Sysfs file that you want to wait on. 
The poll() function has the following signature: ``unsigned int (*poll) (struct file *, struct poll_table_struct *)``.
By default, the Sysfs notification feature is turned off when the driver is loaded. To enable notifications, you can set the value of ``/sys/devices/virtual/neuron_device/neuron0/info/notify_delay`` to 0. To disable notifications, you can set it to -1. Please note that enabling this feature can impact performance.

Here is a sample user space program using poll():

.. code-block:: dash

	#include <fcntl.h>
	#include <poll.h>
	#include <unistd.h>
	#include <stdio.h>
	#include <stdlib.h>

	int main(int argc, char * argv[])
	{
		char readbuf[128];
		int attr_fd = -1; 
		struct pollfd pfd;
		int retval = 0;
		ssize_t read_bytes;

		if (argc < 2) {
			fprintf(stderr, "Error: Please specify sysfs file path\n");
			exit(1);
		}   
		attr_fd = open(argv[1], O_RDONLY, 0); 
		if (attr_fd < 0) {
			perror(argv[1]);
			exit(2);
		}   

		read_bytes = read(attr_fd, readbuf, sizeof(readbuf));
		if (read_bytes < 0) {
			perror(argv[1]);
			exit(3);
		}   
		printf("%.*s", (int)read_bytes, readbuf);

		pfd.fd = attr_fd;
		pfd.events = POLLERR | POLLPRI;
		pfd.revents = 0;
		while ((retval = poll(&pfd, 1, 100)) >= 0) {
			if (pfd.revents & (POLLERR | POLLPRI)) {
				pfd.revents = 0;

				lseek(attr_fd, 0, SEEK_SET);
				read_bytes = read(attr_fd, readbuf, sizeof(readbuf));
				if (read_bytes < 0) {
					perror(argv[1]);
					exit(4);
				}
				printf("%.*s", (int)read_bytes, readbuf);
			}
		}
		return 0;
	}


================================================
FILE: tools/neuron-sys-tools/neuron-top-user-guide.rst
================================================
.. _neuron-top-ug:

Neuron Top User Guide
=====================

.. contents:: Table of contents
   :local:
   :depth: 2

Overview
--------
``neuron-top`` provides useful information about NeuronCore and vCPU utilization, memory usage,
loaded models, and Neuron applications.

.. note::

  ``neuron-top`` fully supports the newly launched trn2 instances.

.. note::

  If you are parsing ``neuron-top`` output in your automation environment, you can now replace it with ``neuron-monitor``
  (:ref:`neuron-monitor-ug`) which outputs data in a standardized, easier to parse JSON format.

Using neuron-top
----------------

Command line arguments
~~~~~~~~~~~~~~~~~~~~~~
Launch ``neuron-top`` by simply typing its name in the shell: ``neuron-top``.

User interface
~~~~~~~~~~~~~~

The title section of the user interface shows the application's version number,
EC2 instance ID, and the instance type on which it is running:

|titleimg|

The rest of the user interface is divided in 4 sections. The data shown in these
sections applies to the currently selected tab - which can be the 'all' tab,
which aggregates data from all running Neuron processes, or a tab representing
a single Neuron process:

|overview|

* The ``NeuronCore <vers> Utilization`` section shows the NeuronCore utilization for the
  currently selected tab. ``<vers>`` is the version of the NeuronCores on the instance (for example,
  ``v2`` for trn1 instances and inf2 instances, ``v3`` for trn2 instances with ``LNC=1``, ``v3d`` for trn2
  instances with ``LNC=2``) 

  Pressing the 'F' key will toggle between displaying utilization percentages - as seen in the previous image -
  and teraflops (trillion floating point operations per second), as seen in the image below:

|flops|

* The ``VCPU Utilization`` section shows:

  * ``System vCPU usage`` - the two percentages are user% and system%
  * ``Runtime vCPU usage`` - same breakdown

.. _neuron_top_mem_usage:

* The ``Memory Usage Summary`` section provides a breakdown of the total memory usage on the Neuron Device as well
  as on the host:

  .. _neuron_top_host_mem_usage:

  * ``Host Used Memory`` - amount of host memory used by the selected application (or an aggregate of all applications if 'All' is selected)

    * ``Total`` - total amount of host memory used
    * ``Tensors`` - amount of host memory used for tensors
    * ``Constants`` - amount of host memory used for constants (for applications running training) or weights (for applications running inferences)
    * ``DMA Buffers`` - amount of host memory used for DMA transfers
    * ``App. Memory`` - amount of host memory used by the application that doesn't fall in any of the previous categories

  .. _neuron_top_device_mem_usage:

  * ``Device Used Memory`` - amount of device memory used by the selected application (or an aggregate of all applications if 'All' is selected)

    * ``Total`` - total amount of device memory used
    * ``Tensors`` - amount of device memory used for tensors
    * ``Constants`` - amount of device memory used for constants (for applications running training) or weights (for applications running inferences)
    * ``Model Code`` - amount of device memory used for storing model executable code
    * ``Runtime Memory`` - amount of device memory used by the Neuron Runtime (outside of the previous categories)
    * ``Model Scratchpad`` - amount of device memory used for the shared model scratchpad, a shared buffer used for internal model variables and other
      auxiliary buffers

* ``Memory Usage Details`` contains memory usage data organized as a tree which can be expanded/collapsed. The columns are:

  * ``Model ID`` - the Neuron Runtime identifier for this model instance
  * ``Host Memory`` - amount of host memory used
  * ``Device Memory`` - amount of device memory used

The tree view shows the amount of memory used for the same categories shown in the ``Memory Usage Summary`` but in this section
they are attached to either a model (if the memory has been allocated at model load time for that model), or to a NeuronCore (if
the memory can't be associated with a model, but has been allocated for that NeuronCore).
The 'parent' shows the total amount of memory used - the sum of its children.

.. note::
  The up/down/left/right keys can be used to navigate the tree view. The 'x' key expands/collapses the
  entire tree.

The bottom bar shows which Neuron process' data is currently displayed by highlighting
its tag using a green font and marking it using a pair of '>', '<' characters. The 'all'
tab shows an aggregated view of all the Neuron processes currently running on the instance.

|tabbar|

.. note::

  The '1'-'9' keys select the current tab. 'a'/'d' selects the previous/next
  tab on the bar.

.. |titleimg| image:: ../../images/trn2-neuron-top-header.png
.. |overview| image:: ../../images/trn2-neuron-top.png
.. |flops| image:: ../../images/trn2-neuron-top-nc.png
.. |tabbar| image:: ../../images/nt-2.png


================================================
FILE: tools/profiler/neuron-profile-user-guide.rst
================================================
.. _neuron-profile-ug:

Neuron Profiler User Guide
============================

The Neuron Profiler, ``neuron-profile``, is a tool to profile and analyze performance of a ML model compiled with the Neuron compiler and run on NeuronDevices.

.. important::

    The Neuron Profiler will be replaced by the new Neuron Explorer in a future release. For more details and migration guidance, see :ref:`neuron-explorer-faq`.

``neuron-profile`` helps developers identify performance bottlenecks and optimize their workloads for NeuronDevices. neuron-profile provides insights into NeuronDevice activity including the instructions executed on each compute engine (ex. Tensor engine, Vector engine, etc.), DMA data movement activity, and performance metrics such as engine utilization, DMA throughput, memory usage, and more. NeuronDevice activity is collected by the ``neuron-profile capture`` command which runs the model with tracing enabled. Profiling typically has near zero overhead because NeuronDevices have dedicated on-chip hardware profiling.

Additionally, ``neuron-profile`` supports Neuron Kernel Interface (NKI) developers in profiling their kernels. For more information, please refer to :ref:`use-neuron-profile`

.. _neuron-profiler-installation:

Installation
------------

``neuron-profile`` comes as part of the ``aws-neuronx-tools`` package, and will be installed to ``/opt/aws/neuron/bin``.

.. note::

    ``neuron-profile`` requires Ubuntu 22.04 or newer, or Amazon Linux 2023 or newer.
    Capturing profiles requires an Inferentia or Trainium instance, but processing profiles 
    can be done on any instance type.

The Neuron web profile viewer utilizes InfluxDB OSS 2.x to store time series data for the profiled workloads after post processing.
Please follow the instructions provided at https://portal.influxdata.com/downloads/ for the correct OS.  A sample installation
of Neuron Profile and InfluxDB is provided below.

Ubuntu
~~~~~~

.. code-block:: bash

    # Install Neuron Profile
    . /etc/os-release
    sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
    deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
    EOF

    wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
    sudo apt-get update -y
    sudo apt-get install aws-neuronx-runtime-lib aws-neuronx-dkms -y
    sudo apt-get install aws-neuronx-tools -y

    # Install InfluxDB
    wget -q https://repos.influxdata.com/influxdata-archive_compat.key
    echo '393e8779c89ac8d958f81f942f9ad7fb82a25e133faddaf92e15b16e6ac9ce4c influxdata-archive_compat.key' | sha256sum -c && cat influxdata-archive_compat.key | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg > /dev/null
    echo 'deb [signed-by=/etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg] https://repos.influxdata.com/debian stable main' | sudo tee /etc/apt/sources.list.d/influxdata.list

    sudo apt-get update && sudo apt-get install influxdb2 influxdb2-cli -y
    sudo systemctl start influxdb
    influx setup
    # Fill in the information to finish the setup


Capturing a profile
-------------------

The ``neuron-profile`` tool can both capture and post-process profiling information. ``neuron-profile`` takes a compiled model (a NEFF), executes it, and saves the profile results to a NTFF (``profile.ntff`` by default).
For this example, we assume a NEFF is already available as ``file.neff``

::

    $ neuron-profile capture -n file.neff -s profile.ntff

Capturing profiles for multi-worker jobs
----------------------------------------

``neuron-profile`` can capture profiles for collectives-enabled NEFFs running across multiple NeuronCores, NeuronDevices, or even nodes. 
This is useful for understanding performance and communication overheads when deploying larger distributed models.

The following example, performs a distributed run across all NeuronDevices and NeuronCores on an inf2.24xlarge instances, capturing profiles for all 12 workers (one for each NeuronCore).

::

    $ neuron-profile capture -n file.neff --collectives-workers-per-node 12 -s output/profile.ntff

A profile is saved for each worker in the output directory.

:: 

    $ ls output
    profile_rank_0.ntff   profile_rank_2.ntff  profile_rank_6.ntff profile_rank_1.ntff   profile_rank_3.ntff  profile_rank_7.ntff
    profile_rank_10.ntff  profile_rank_4.ntff  profile_rank_8.ntff profile_rank_11.ntff  profile_rank_5.ntff  profile_rank_9.ntff

It is also possible to run a distributed job while only capturing a profile for a specific worker instead of all workers. To do that, use the ``--collectives-profile-id`` option.

::

    $ neuron-profile capture -n file.neff --collectives-profile-id 5 --collectives-workers-per-node 12 -s output/profile.ntff
    $ ls output
    profile_rank_5.ntff


Providing per-worker inputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, ``neuron-profile capture`` uses all-zero inputs or a single set of inputs specified via positional arguments. For multi-worker jobs where each worker needs different inputs, use the ``--multi-input`` (``-m``) option to specify a file that maps inputs to each worker.

Each line in the multi-input file corresponds to one worker and follows the same format as the positional ``inputs`` argument (``<NAME> <FILE_PATH>`` pairs separated by spaces). For example, for a 2-worker job:

::

    # inputs.txt
    IN1 worker0_x.npy IN2 worker0_y.npy
    IN1 worker1_x.npy IN2 worker1_y.npy

Then capture the profile with:

::

    $ neuron-profile capture -n file.neff -m inputs.txt --collectives-workers-per-node 2 -s output/profile.ntff

.. note::

    The ``--multi-input`` option cannot be used together with the positional ``inputs`` argument.


Capturing profiles for multi-node jobs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For multi-node jobs, ``neuron-profile`` must be invoked on each node using the ``collectives-worker-start-id`` to specify the global index of the first worker on the given
node. For example, for a two node job with a total of four workers and two workers per node, the following commands are run on each node.

::

    # on node 0
    $ neuron-profile capture -n file.neff --collectives-worker-start-id 0 --collectives-workers-per-node 2 --collectives-worker-count 4
    # on node 1
    $ neuron-profile capture -n file.neff --collectives-worker-start-id 2 --collectives-workers-per-node 2 --collectives-worker-count 4

``neuron-profile`` saves the profile for a worker on the node where that worker was launched. So in the case above, ``profile_rank_0.ntff`` and ``profile_rank_1.ntff``
are saved to node 0, and ``profile_rank_2.ntff`` and ``profile_rank_3.ntff`` are saved to node 1.


Processing and viewing the profile results
------------------------------------------

To analyze and view the collected profiling data, use the ``view`` subcommand of ``neuron-profile``. This command performs two main functions: it post-processes the profiling data and starts up an HTTP server. Once the server is running, you can access the profiling results through your web browser. Please note: Chrome is the officially supported browser for viewing profiling results


.. note::
    Profiles can be processed and viewed on another machine without Neuron devices. The ``aws-neuronx-tools`` package
    needs to be installed so that you can run ``neuron-profile view``. To process the profile on another
    instance, you need to copy the NEFF and NTFF files from your Inf or Trn instance to that instance.

Viewing a single profile
~~~~~~~~~~~~~~~~~~~~~~~~

The first way to invoke ``neuron-profile view`` is to pass both the NEFF and the NTFF to this command.
It will post-process these artifacts and print out a direct link to the profile view.

::

    $ neuron-profile view -n file.neff -s profile.ntff
    View profile at http://localhost:3001/profile/n_fdc71a0b582ee3009711a96e59958af921243921
    ctrl-c to exit


Viewing profiles for multi-worker jobs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Profiles from multi-worker jobs (i.e. more than one NeuronCore) can either be viewed individually or in a combined collectives view.
Since profile data is often similar between workers and processing profile data for all workers can be time-consuming, it is recommended to first 
explore the profile for a single worker or small subset of workers. Viewing the profile for a specific worker is the same as for single-worker profiles.

::

    $ neuron-profile view -n file.neff -s output/profile_rank_5.ntff
    View profile at http://localhost:3001/profile/n_fdc71a0b582ee3009711a96e59958af921243921


To view the profile for multiple workers, pass the directory containing all worker profiles to ``neuron-profile``.

::

    $ neuron-profile view -n file.neff -d output
    View profile at http://localhost:3001/profile_cc/p_9a69d907e1350100c9b03745eaa67aa7422842ed

|neuron-profile-multiworker-timeline|

When viewing profiles with the combined collectives view you can easily switch between the timelines of different workers by clicking
the "Rank <x>" tabs.

Note: the "CC Aggregated View" currently shows no data. This will be populated in an upcoming release. 


Viewing multiple profiles
~~~~~~~~~~~~~~~~~~~~~~~~~

Alternatively, when post-processing multiple profiles, it may be desirable to have a persistent server running while processing results in the background.
In this case, we can skip passing arguments to the command, which will direct users to the main page listing all available profiles.

::

    $ neuron-profile view
    View a list of profiles at http://localhost:3001/

In a separate window, we can kick off the post-processing without launching another server by passing the ``--ingest-only`` flag.

::

    $ neuron-profile view -n file.neff -s profile.ntff --ingest-only
    Profile "n_47cf9972d42798d236caa68952d0d29a76d8bd66" is ready to view

``n_47cf9972d42798d236caa68952d0d29a76d8bd66`` is the bucket where the data is stored.  We can find this profile at ``localhost:3001/profile/<bucket>``.

Accessing the profiles
~~~~~~~~~~~~~~~~~~~~~~

If ``neuron-profile view`` is run on a remote instance, you may need to use port forwarding to access the profiles.

From the local machine, SSH to the remote instance and forward ports 3001 (the default ``neuron-profile`` HTTP server port) and 8086 (the default
InfluxDB port).  Then in the browser, go to ``localhost:3001`` to view the profiles.

::

    $ ssh <user>@<ip> -L 3001:localhost:3001 -L 8086:localhost:8086


.. _neuron-profile-ug-alternative-outputs:

Alternative output formats
~~~~~~~~~~~~~~~~~~~~~~~~~~

Besides the web view mentioned above, ``neuron-profile`` also supports other output formats such as ``summary-text`` and ``summary-json`` for viewing overall metrics of the profile,
as well as ``json`` for a parsable alternative.

Profile summary
^^^^^^^^^^^^^^^

You can see a summary of each profile using the command ``neuron-profile view --output-format summary-text -n file.neff -s output/profile_rank_<i>.ntff``. This output
includes summary metrics and fields for the NeuronCore (``nc_idx``) and NeuronDevice (``nd_idx``) on which the worker was run. For example, the following shows worker 5 used core 1 on
device 3 and took 0.017 seconds (17 ms) to run the model.

::

    $ neuron-profile view --output-format summary-text -n file.neff -s output/profile_rank_5.ntff | grep -e "nd_idx" -e "nc_idx" -e "total_time"
    nc_idx      1
    nd_idx      2
    total_time  0.017

This summary is also available as JSON using ``--output-format summary-json``.

JSON
^^^^

You can also view the profile summary and all post-processed profiler events together as a single JSON. To do that, use the ``--output-format json`` option.

::

    $ neuron-profile view --output-format json --output-file profile.json -n file.neff -s output/profile_rank_5.ntff
    $ cat profile.json
    {
        "summary": [
            {
                "total_time": 0.017,
                "event_count": 11215
                [...]
            }
        ],
        "instruction": [
            {
                "timestamp": 10261883214,
                "duration": 148,
                "label": "TensorMatrix",
                "hlo_name": "%add.1 = add(%dot, %custom-call.44)",
                "opcode": "MATMUL",
                "operands": "S[5] (Tensor)++@complete acc_flags=3 row_grp=q0 src=fp16@0x5600[1,0,0][3,1,1] dst=0x2000000[1,0,0][3,1,1] 3*128 "
            },
            [...]
        ]
    }

Understanding a Neuron profile
------------------------------

The section provides a quick overview on what features and information are available through the Neuron web profile viewer.

For more information on terms used, please check out the :ref:`neuron_hw_glossary`.

Timeline
~~~~~~~~

|neuron-profile-web-timeline|

The execution timeline is plotted based on the elapsed nanoseconds since the start of execution.

Starting from the bottom, the ``TensorMatrix Utilization`` shows the efficiency of the TensorEngine, and
the ``Pending DMA Count`` and ``DMA Throughput`` rows show the DMA activity.  In general, we want these to be as high
as possible, and in some cases may help give clues as to whether the workload is memory or compute bound.

Next are the individual NeuronCore engine executions.  These rows show the start and end times for instructions executed by each
engine, and clicking on one of these bars will show more detailed information, as well as any dependencies that were found.
For models involving collective compute operations, you will additionally see rows labeled with ``CC-core``, which are used to synchronize
the CC operations.

Towards the top is the DMA activity.  These can include the transfers of input and output tensors, intermediate tensors, and any
additional spilling or loading to and from the on-chip SRAM memory.


.. _neuron-profile-ug-features:

Features
~~~~~~~~

The following are some useful features that may help with navigating a profile:

- Dragging your cursor across a portion of the timeline will zoom in to the selected window, providing a more in depth view of the execution during that time period.
- Hovering over a point will reveal a subset of information associated with it.
- Clicking a point will open a text box below the timeline with all the information associated with it.
- Right-clicking a point will drop a marker at a certain location.  This marker will persist when zooming in and out.

  - All marker information can be found by clicking the ``Annotations`` button.
  - Markers can be saved and loaded by using a provided name for the marker set.
  - Individual markers can be renamed or deleted in this menu as well.
  - Time span between markers will automatically be shown, and users can change the marker name next to ``diff vs`` to calculate time between other markers.

|neuron-profile-annotation-menu|

- The "Search" tab can be used to find and highlight specific points in the profile related to the queried field(s).
- Click on the "Box Select" button in the top-right corner of the timeline and then click and drag on any region of the plot to select all events in that region and get summary statistics such as total duration and breakdowns of opcodes, transfer_sizes, and more.

View Settings
^^^^^^^^^^^^^

Options within the ``View Settings`` tab can be used to further customize the timeline view.  Editing any settings will update the URL accordingly, which can be used to re-visit the current view at a later time.
To speed up initial load times, the default will be a ``Minimal View`` which only shows the instructions executed and the model FLOPs utilization (MFU) over time.  Changing between the minimal and full views can also be done through the ``Reset to Full View`` or ``Reset to Minimal View`` buttons.

- ``DMA color group`` will recolor DMAs based on the selected grouping. For example, "Engine" will re-color the DMAs based on the associated engine.
- ``Instruction color group`` will recolor instructions based on the selected grouping. For example, "Layer" will re-color the timeline based on the associated framework layer name.
- ``Layer group depth`` will group and color instructions at the selected layer depth. It will apply when ``Instruction color group`` is set to "Layer".

  **Example:**
    When ``Layer group depth`` is 2, instructions with layers `model/layer1/op1` and `model/layer1/op2` will be set to the same color.
- ``Semaphore IDs`` allows for the selection of multiple semaphore values to show at once within the timeline
  

|neuron-profile-view-settings|

Additionally, there are various summary tabs that can be clicked to provide more information on the model/NEFFs.

- ``Layer Summary`` shows timing information, FLOPs and instructions counts per layer.
- ``Selection Summary`` shows summarized information for all data points in the selected window when using the "Box Select" mode.
- ``NEFF Header`` shows details on the profiled NEFF, such as the number of NeuronCores required to execute.
- ``NEFF Nodes`` shows input, output, and weight tensor information, including name, size, and shape.
- ``Model Info`` shows a summary of the NTFF, such as the NeuronCore the model was executed on, number of notifications, and hardware execution time.
- ``DMA Queues Info`` shows more information on the queues used for data movement.
- ``NC Memory Usage Info`` shows a snapshot of the device memory usage breakdown before profiling was started.
- ``Terminology`` shows a description of metrics provided in the summary table.

|neuron-profile-web-summaries|

Performance Warnings
~~~~~~~~~~~~~~~~~~~~

Furthermore, ``neuron-profile`` will automatically highlight some potential performance issues with warning annotations. For example if a tensor has been loaded more than 2 times a warning annotation (seen below as an orange box) will be drawn, encircling the dma instructions where the tensor was loaded many times.
Hover on the annotation to see more details about loading the tensor. Another kind of warning annotation will highlight areas of high throttling. This provides the user a potential reason for slow down (thermal protection). Specific throttling details are shown when hovering the annotation.

|neuron-profile-tensor-reload-annotation|

.. _neuron-profile-collectives-barrier:

Collectives
~~~~~~~~~~~

For models involving collective operations, the timeline will show a box around all data points related to each operation.  Hovering the top left of the box will reveal more information associated with the operation.

.. note::
    this feature requires profiles to be captured with Neuron Runtime 2.20 or higher.

|neuron-profile-cc-op-annotation|

Additionally, for any on-device collectives synchronization barrier, a similar box will be display indicating a barrier instead of an actual collectives operation.

|neuron-profile-cc-op-barrier|

Event Details
~~~~~~~~~~~~~

The information when a point is clicked is grouped by categories such as `Timing` or `IDs` for convenience.
Each row will also include a tool tip on the right side, which can be hovered for an explanation on what the field represents.
For instruction `Operands` specifically, clicking on the tooltip will reveal a breakdown of fields that compose an operand, as well as a generic example for reference.  The examples may not apply directly to the currently viewed profile.

|neuron-profile-click-tooltip|


.. _neuron-profile-framework-stack-trace:

Framework Stack Trace
----------------------------

The Framework Stack Trace feature shows up in the Event Details when an instruction on the device profile is clicked. This can we used to map the device instructions back to framework level code in JAX or PyTorch to better understand what part of the application code resulted in a particular device instruction.

|neuron-profile-stack-trace-event-details|

To enable tracking of the stack trace information, you need to set environment variables before compiling your NEFF:

::

    export XLA_IR_DEBUG=1
    export XLA_HLO_DEBUG=1

Once you have the NEFF, you can simply capture the profile as usual. While viewing the profile use the ``--framework-source-root`` to pass the path to framework source files. This is optional and is only needed if you want to view your code along side the profile.

::

    $ neuron-profile view -n file.neff -s profile.ntff --framework-source-root /path/to/framework/source/files

|neuron-profile-stack-trace-viewer|

Searching Profiles
~~~~~~~~~~~~~~~~~~

Searching helps identify specific data points that may be worth investigating, such as all instructions related to a specific layer or operation.
In the "Search" tab, select the corresponding field of interest and enter the value to search for.  Multiple fields can be searched together.  Please refer to the tooltip within the tab for more help on the query syntax.
The search results will also include a summary of all data points found within the current time range.

|neuron-profile-search-summary|


Hardware Errors
~~~~~~~~~~~~~~~

Invalid code can lead to errors on Neuron hardware. These errors will be displayed in Neuron Profile's Custom Notification timeline, as shown below. For example an Out of Bounds (OOB) error is displayed as:

|neuron-profile-oob-error|

Users can correlate the error to the time it occurred and view nearby events to help debug.


.. _neuron-profile-scratchpad-mem-usage:

View Scratchpad Usage With Memory Tracker
------------------------------------------

The Memory Tracker feature in Neuron Profiler provides detailed insights into scratchpad memory usage over time, showing how memory is allocated and utilized by different tensors during model execution. This is particularly useful for understanding memory bottlenecks and optimizing memory usage patterns.

To enable Memory Tracker, you need to set environment variables before compiling your NEFF:

::

    export XLA_IR_DEBUG=1
    export XLA_HLO_DEBUG=1

Then compile your model with these debug flags enabled. After compilation, capture the profile with the ``--enable-dge-notifs`` flag or set ``NEURON_RT_ENABLE_DGE_NOTIFICATIONS=1``:

::

    $ neuron-profile capture -n file.neff --enable-dge-notifs

Finally, view the profile with Memory Tracker enabled:

::

    $ neuron-profile view -n file.neff -s profile.ntff --enable-memory-tracker

The Memory Tracker displays a timeline showing scratchpad memory usage over time, with a detailed breakdown of which tensors are consuming memory at any given point. This visualization helps identify:

- Peak scratchpad memory usage
- Memory allocation patterns
- Tensor-specific memory consumption
- Potential memory optimization opportunities

|neuron-profiler-memory-tracker|

You can interact with the Memory Tracker timeline similar to other profile views - clicking on memory usage bars will show detailed information about the tensors using memory at that time, and you can zoom in to specific time ranges to get a more detailed view of memory allocation patterns.


Viewing Profiles with Perfetto
------------------------------

Perfetto is an open-source trace analysis toolkit with a powerful UI for visualizing and analyzing trace data.
Users of Neuron Profiler have the option of viewing their profiles in the Perfetto UI.

To process your profile and generate a Perfetto trace file that can be viewed in the Perfetto UI run the following command:

::

    $ neuron-profile view -n file.neff -s profile.ntff --output-format perfetto

This will generate a ntff.pftrace file. Go to https://ui.perfetto.dev/ in your browser and open the ntff.pftrace file to view your profile in Perfetto.

.. note::
    When loading trace files in the Perfetto UI, your data is processed locally and not uploaded to Perfetto’s servers.


|neuron-profile-perfetto-device|

.. _neuron-profile-large-perfetto-profiles:

Viewing Large Profiles In Perfetto
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Your browser may run out of memory when viewing ``ntff.pftrace`` (Perfetto trace) files that are more than a few hundred MB.
To get around this problem you can use the trace processor script by running the following command on your local system where you wish to view the profile

::

    curl -LO https://get.perfetto.dev/trace_processor
    chmod +x ./trace_processor
    ./trace_processor --httpd ntff.pftrace

Now go to  https://ui.perfetto.dev/ in your browser and in the dialog box that pops up click the  “YES, use loaded trace” button.

For more information on using the trace processor script and viewing large traces, please refer to the 
Perfetto documentation at https://perfetto.dev/docs/visualization/large-traces.

Showing Dependencies In Perfetto
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default Neuron Profiler does not process dependencies for profiles to be viewed in Perfetto because Perfetto renders 
the full dependency chain which can be visually overwhelming. To include dependencies that can be viewed when clicking 
instructions and DMAs in the Perfetto UI, use the ``--show-perfetto-flows`` flag when processing your profile.

::

    $ neuron-profile view -n file.neff -s profile.ntff --output-format perfetto --show-perfetto-flows


CLI reference
-------------

.. rubric:: neuron-profile capture

.. rubric:: neuron-profile capture

.. code-block:: text

    neuron-profile capture [parameters] [inputs...]

Takes a given compiled NEFF, executes it, and collects the profile results.
When no inputs are provided, all-zero inputs are used, which may result in inf or NaNs.
It is recommended to use ``--ignore-exec-errors``.

**Parameters**

``-n, --neff`` (string)
    The compiled NEFF to profile.

``-s, --session-file`` (string)
    The file to store profile session information in.

``--ignore-exec-errors``
    Ignore errors during execution.

``inputs`` (positional args)
    List of inputs in the form of ``<NAME> <FILE_PATH>`` separated by space. For example: ``IN1 x.npy IN2 y.npy``.

The following ``neuron-profile capture`` arguments are only relevant for multi-worker jobs:

``-m, --multi-input`` (string)
    Path to a file that describes the input list for each requested worker. Each line in the file should correspond to one worker and follow the same format as the ``inputs`` positional argument (i.e. ``<NAME> <FILE_PATH>`` pairs separated by spaces). Cannot be used together with the ``inputs`` positional argument. If ``inputs`` is used instead, all workers will use the same inputs.

``--collectives-profile-id`` (string)
    Worker id which will be profiled. Passing ``all`` profiles all workers. (default: ``all``)

``-r, --collectives-workers-per-node`` (int)
    The number of workers on the current node. The global worker id (rank) of worker n on current node is ``collectives-worker-start-id+n``.

``--collectives-worker-count`` (int)
    Total number of Neuron workers across all nodes for this collectives run.

``--collectives-worker-start-id`` (int)
    The rank offset for the first worker on the current node. For example, if node 0 has workers 0,1 and node 1 has workers 2,3 then ``collectives-worker-start-id`` for node 0 and 1 will be 0 and 2, respectively. (default: ``0``)

.. rubric:: neuron-profile view

.. code-block:: text

    neuron-profile view [parameters]

**Parameters**

``-n, --neff-path`` (string)
    The compiled NEFF file location.

``-s, --session-file`` (string)
    The profile results NTFF file location.

``-d, --session-dir`` (string)
    Directory containing profile files for multi-worker runs.

``--output-format`` (string)
    How the processed profile should be presented. The default ``db`` writes processed data to the database. ``summary-text`` and ``summary-json`` print the summary data as a table or json, respectively, without writing to the database. The ``perfetto`` option writes processed data to Perfetto's native protobuf based tracing format, and can be visualized in the Perfetto UI. The ``JSON`` option writes processed data to human-readable JSON. (default: ``db``)

``--output-file`` (string)
    File path to write results to, if applicable for the given output format.

``--db-endpoint`` (string)
    The endpoint of InfluxDB. (default: ``http://localhost:8086``)

``--db-org`` (string)
    The org name of InfluxDB.

``--db-bucket`` (string)
    Name of the InfluxDB bucket where ingested profile data is stored. Also used in the URL for viewing the profile. (Optional)

``--port`` (int)
    The port number of the http server. (default: ``3001``)

``--force``
    Force overwrite an existing profile in the database.

``--terminology``
    Print a helpful table of terminology used by the profiler.

``--enable-memory-tracker``
    Enable Memory Tracker to view scratchpad usage over time with a breakdown of usage per tensor. This requires having set ``XLA_IR_DEBUG=1`` and ``XLA_HLO_DEBUG=1`` before NEFF compilation and passing ``--enable-dge-notifs`` when capturing the profile.


FAQ
---

Difference between TensorE and TensorMatrixE Rows in Timeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- TensorE includes instruction trace for LoadStationary (LoadWeight)
- TensorMatrixE includes instruction trace for MultiplyMoving (Matmul)
- Both instruction traces happen on the same TensorE engine, but we separate them into two rows to de-clutter the timeline due to the background load stationary feature (loading stationary matrix for the next matmul in parallel to current matmul). See more info in :ref:`NKI architecture guide <arch_guide_tensor_engine_perf>`. 

Out of memory (OOM) when capturing a profile
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If ``neuron-profile capture`` fails due to device out-of-memory (OOM), you can increase available memory using using the single-IO mode:

Single-IO creates one shared I/O buffer on the device equal to the size of the largest I/O tensor. All inputs and outputs then point to slices of this shared buffer instead of allocating separate tensors. This significantly lowers the device memory needed during capture at the cost of producing incorrect outputs.

Example usage:

::

    neuron-profile capture --single-io -n file.neff -s profile.ntff

Important: with ``--single-io``, the profiled performance characteristics (e.g., timing, utilization, bandwidth) are representative, but the model outputs are intentionally not correct. Use this option only to get accurate performance measurements when device memory is tight; do not use it for correctness/accuracy validation.

If you are able to make changes to your model itself to reduce memory usage, consider the following:
- Reduce batch size
- Lower numerical precision
- Reduce number of layers

In some cases, a full device profile isn’t necessary to understand performance at a high level. You can instead capture a system profile, which shows overall model execution time and a runtime API trace across all workers and does not require extra device memory. See :ref:`System Profiles overview <system-profiles-overview>`.

Troubleshooting
---------------


Outputting to Unsupported NumPy Data Type
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When running ``neuron-profile capture --save-output-npy``, you may encounter an error if the output tensor uses a data type that NumPy doesn't natively support:

::

    failed to save output output_hbm to file: unsupported type for npy output: bfloat16

To work around this, use ``--save-output`` instead to save the output as raw binary, then convert it to the desired data type using NumPy and the ``ml_dtypes`` library.
This preserves the precision of the output since it is written to binary instead of casting to a supported data type.

::

    # Capture with raw binary output
    neuron-profile capture --save-output -n file.neff

    # Convert from raw binary to bfloat16
    import numpy as np
    import ml_dtypes
    
    output = np.fromfile('output0.npy', dtype=np.uint16)
    output = output.view(ml_dtypes.bfloat16)

InfluxDB not installed
~~~~~~~~~~~~~~~~~~~~~~

::

    $ neuron-profile view -n file.neff -s profile.ntff
    ERRO[0001] To install influxdb, go to https://portal.influxdata.com/downloads/ and follow the instructions there
    influxdb not setup correctly: exec: "influx": executable file not found in $PATH

::

    $ neuron-profile view -n file.neff -s profile.ntff
    ERRO[0000]                                              
    influxdb token not setup correctly: exit status 1
    Try executing "systemctl start influxdb" and "influx setup"

Running ``neuron-profile view`` without InfluxDB installed will result in an error and a pointer to the InfluxDB installation instructions.
Please follow the provided instructions and retry.

Too many open files
~~~~~~~~~~~~~~~~~~~

::

    influxdb2client E! Write error: internal error: unexpected error writing points to database: [shard 10677] open /home/ubuntu/.influxdbv2/engine/data/7caae65aaa48380d/autogen/10677/index/0/MANIFEST: too many open files

InfluxDB will encounter "too many open files" and out of memory errors after a few hundred buckets have been created.
Two ways to solve this are to delete unused buckets or increase the system file descriptor limit.

To increase the file descriptor limit, add the following lines to ``/etc/security/limits.d/efa.conf`` and ``/etc/security/limits.conf``:

::

    *               soft    nofile      1048576
    *               hard    nofile      1048576

Add the following lines to /etc/sysctl.conf

::

    fs.file-max = 197341270
    vm.max_map_count=1048576

Commit changes by running ``sudo sysctl -p``.

.. |neuron-profile-web-timeline| image:: /images/neuron-profile-web-timeline_2-11.png
.. |neuron-profile-annotation-menu| image:: /images/neuron-profile-annotation-menu_2-21.png
.. |neuron-profile-view-settings| image:: /images/neuron-profile-view-settings_2-26.png
.. |neuron-profile-web-summaries| image:: /images/neuron-profile-web-summaries_2-21.png
.. |neuron-profile-tensor-reload-annotation| image:: /images/neuron-profile-tensor-reload-annotation.png
.. |neuron-profile-multiworker-timeline| image:: /images/neuron-profile-multiworker-timelime_2-16.png
.. |neuron-profile-cc-op-annotation| image:: /images/neuron-profile-cc-op-annotation.png
.. |neuron-profile-cc-op-barrier| image:: /images/neuron-profile-cc-op-barrier.png
.. |neuron-profile-click-tooltip| image:: /images/neuron-profile-click-tooltip.png
.. |neuron-profile-oob-error| image:: /images/neuron-profile-oob-error.png
.. |neuron-profile-search-summary| image:: /images/neuron-profile-search-summary.png
.. |neuron-profiler-memory-tracker| image:: /images/neuron-profiler-memory-tracker.png
.. |neuron-profile-stack-trace-event-details| image:: /images/neuron-profile-stack-trace-event-details.png
.. |neuron-profile-stack-trace-viewer| image:: /images/neuron-profile-stack-trace-viewer.png
.. |neuron-profile-perfetto-device| image:: /images/neuron-profiler2-perfetto-device.png

When viewing UI "FATAL - Failed metadata query"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you are SSH port forwarding the web UI from a remote machine to your local desktop you will need to port forward both the web UI (3001) and the database (8086) like so:

::

    ssh -L 3001:localhost:3001 -L 8086:localhost:8086 remote_machine

Visual Artifacts when viewing profiles
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some users have reported visual artifacts when viewing certain profiles in browsers other than Chrome. If you encounter this issue, please try using Chrome. 
For more details, refer to the GitHub issue: https://github.com/aws-neuron/aws-neuron-sdk/issues/1033


================================================
FILE: tools/profiler/neuron-profiler-2-0-beta-user-guide.rst
================================================
.. _neuron-profiler-2-0-guide:

Neuron Profiler 2.0 (Beta) User Guide
=====================================

Overview
--------

Neuron Profiler 2.0 offers a user-friendly experience for capturing and analyzing application performance 
through both high-level system profiles and detailed device-level profiles. Users can profile their workloads 
using framework-specific APIs within their application code or by setting an environment variable before 
execution. This tool supports profiling for both single-node and distributed workloads, integrating with 
environments such as ParallelCluster and EKS. Once captured, profile results can be explored through multiple 
interfaces: the Neuron Profiler UI, the open-source trace viewer `Perfetto <https://perfetto.dev/docs/>`_, 
or by exporting to a human-readable JSON format. This flexibility in data capture and visualization enables 
users to gain comprehensive insights into their application's performance across various scenarios and scales.

.. important::

    The Neuron Profiler will be replaced by the new Neuron Explorer in a future release. For more details and migration guidance, see :ref:`neuron-explorer-faq`.

.. note::
    Neuron Profiler 2.0 is a set of new features currently in beta that enhance and simplify the experience of 
    capturing and viewing profiles. It is not a replacement of :ref:`Neuron Profiler <neuron-profile-ug>`, 
    which is the existing feature set specifically for capturing and viewing device profiles.

.. _system-profiles-overview:

Key benefits
~~~~~~~~~~~~

- End-to-end timing of model execution and a Neuron Runtime API trace across all workers, helping identify scheduling gaps, synchronization, and host/runtime overheads.
- No extra device memory usage by default, making system profiles ideal when device memory is limited or when only high-level insights are needed.
- Option to capture device profiles for individual models during your workload. 
- Flexible capture and viewing: enable via environment variables or framework APIs; view in the Neuron Profiler UI, in Perfetto, or export as JSON.

Capturing profiles
------------------

Neuron Profiler 2.0 offers several flexible options for capturing profiles. Users can either set an environment 
variable ``NEURON_RT_INSPECT_ENABLE`` or use the PyTorch or JAX profiling APIs from their application code for 
fine-grained control over which sections of their code are profiled. PyTorch and JAX users who prefer not to 
modify their application code can still enable profiling by setting the environment variable before running 
their application.

JAX User Experience
-------------------

JAX Setup
~~~~~~~~~~~~

Follow the :ref:`JAX Setup <setup-jax-neuronx>` instructions to install the required
JAX Neuron Plugin and the latest Neuron Driver, Runtime and Tools packages.


JAX Profiler
~~~~~~~~~~~~

The JAX context-managed profiling API allows you to profile blocks of code. This will capture a system profile 
including a Neuron Runtime API trace and Python trace for your application code in the captured block. This 
will also capture device profiles for any compiled graphs (NEFFs) executed on NeuronCores within this block. To use 
the profiler, import the ``jax`` package.

.. code-block:: python

    import jax

Profiling is enabled for all code enclosed in the context when using 
``with jax.profiler.trace(os.environ["NEURON_RT_INSPECT_OUTPUT_DIR"]):``

.. note::
     It is important to pass the output directory ``os.environ["NEURON_RT_INSPECT_OUTPUT_DIR"]`` to 
     ``with jax.profiler.trace`` and run ``export NEURON_RT_INSPECT_OUTPUT_DIR=<your output directory>`` 
     before enabling profiling. This ensures all captured profile data is saved to the correct output directory.

Custom Annotations in JAX
~~~~~~~~~~~~~~~~~~~~~~~~~

To add custom annotations to blocks of code in your profile, you can use ``jax.profiler.TraceAnnotation``. 
Annotation names can be created at runtime, such as in the :ref:`example here <neuron-profile-full-jax-example>` 
using ``with jax.profiler.TraceAnnotation("my_label"+str(i)):``. For more information on TraceAnnotations, 
see the official `JAX documentation <https://jax.readthedocs.io/en/latest/_autosummary/jax.profiler.TraceAnnotation.html>`_.

JAX Profiling using environment variable
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Instead of using the jax.profiler context manager, you can enable profiling for your entire application using 
an environment variable. This is desirable if you want to capture a profile without modifying your application 
code. To enable profiling with the environment variable ``NEURON_RT_INSPECT_ENABLE=1`` and 
``NEURON_RT_INSPECT_OUTPUT_DIR=./output`` before running your application.

For example:

.. code-block:: shell

    # make sure to remove call to with jax.profiler.trace from python script
    NEURON_RT_INSPECT_ENABLE=1 NEURON_RT_INSPECT_OUTPUT_DIR=./output python jax_script.py

When using the ``NEURON_RT_INSPECT_ENABLE`` environment variable instead of ``jax.profiler``, system profiles 
will not contain a framework and application code trace, only Neuron Runtime API trace.

Do not set the ``NEURON_RT_INSPECT_ENABLE`` environment variable and use the ``jax.profiler`` within your 
application code at the same time. Use one or the other.

For more profiling options that can be set through environment variables, see the section :ref:`Profile Capture Environment Variables <neuron-profiler-capture-environment-variables>`.

.. _neuron-profile-full-jax-example:

Full JAX Example
~~~~~~~~~~~~~~~~

Create a file ``jax_script.py`` which performs repeated matrix multiplications distributed across Neuron devices.

.. code-block:: python

    from functools import partial
    import os
    import jax
    import jax.numpy as jnp

    from jax.sharding import Mesh, NamedSharding, PartitionSpec as P
    from jax.experimental.shard_map import shard_map
    from time import sleep

    os.environ["XLA_FLAGS"] = "--xla_dump_hlo_snapshots --xla_dump_to=./dump"

    jax.config.update("jax_default_prng_impl", "rbg")

    mesh = Mesh(jax.devices(), ('i',))

    def device_put(x, pspec):
        return jax.device_put(x, NamedSharding(mesh, pspec))

    lhs_spec = P('i', None)
    lhs = device_put(jax.random.normal(jax.random.key(0), (128, 128)), lhs_spec)

    rhs_spec = P('i', None)
    rhs = device_put(jax.random.normal(jax.random.key(1), (128, 16)), rhs_spec)

    @jax.jit
    @partial(shard_map, mesh=mesh, in_specs=(lhs_spec, rhs_spec), out_specs=rhs_spec)
    def matmul_allgather(lhs_block, rhs_block):
        rhs = jax.lax.all_gather(rhs_block, 'i', tiled=True)
        return lhs_block @ rhs

    with jax.profiler.trace(os.environ["NEURON_RT_INSPECT_OUTPUT_DIR"]):
        out = matmul_allgather(lhs, rhs)
        for i in range(10):
            with jax.profiler.TraceAnnotation("my_label"+str(i)):
                out = matmul_allgather(lhs, rhs)
            sleep(0.001)

    expected = lhs @ rhs
    with jax.default_device(jax.devices('cpu')[0]):
        equal = jnp.allclose(jax.device_get(out), jax.device_get(expected), atol=1e-3, rtol=1e-3)
        print("Tensors are the same") if equal else print("Tensors are different")

Set your profile output directory and run the script:

.. code-block:: shell

    export NEURON_RT_INSPECT_OUTPUT_DIR=./output
    python jax_script.py

PyTorch User Experience
-----------------------

PyTorch Setup
~~~~~~~~~~~~~

Follow the :ref:`PyTorch Setup <setup-torch-neuronx>` instructions to install the required PyTorch Neuron packages 
as well as the latest Neuron Driver, Runtime and Tools. 

PyTorch Profiler
~~~~~~~~~~~~~~~~

The PyTorch context-managed profiling API allows you to profile blocks of code. This will capture a system 
profile including a Neuron Runtime API trace and Python trace for your application code in the captured block. 
This will also capture device profiles for any compiled graphs executed on NeuronCores within this block. To 
use the profiler, import it in your application:

.. code-block:: python

    from torch_neuronx.experimental import profiler

Then profile a block of code using:

.. code-block:: python

    with torch_neuronx.experimental.profiler.profile(
        port=9012,
        profile_type='system',
        target='neuron_profile_perfetto',
        output_dir=os.environ['NEURON_RT_INSPECT_OUTPUT_DIR'],
        ms_duration=30000) as profiler:

After modifying your code to call the profiler, run your application as you normally would 
but set the environment variable ``NEURON_RT_INSPECT_OUTPUT_DIR`` to specify the output directory.

.. code-block:: shell

    NEURON_RT_INSPECT_OUTPUT_DIR=./output python application.py

.. note::
     it is essential to set ``output_dir=os.environ['NEURON_RT_INSPECT_OUTPUT_DIR']`` when starting the profiler from your application code. 
     This ensures that all profile data sources dump to the same output directory. 

PyTorch Profiling using Environment Variable
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Instead of using the ``torch_neuronx.experimental.profiler.profile`` context manager, you can enable profiling 
for your entire application using environment variable. This is desirable if you want to capture a profile without modifying your application code. To enable profiling 
with environment variable ``NEURON_RT_INSPECT_ENABLE=1`` and ``NEURON_RT_INSPECT_OUTPUT_DIR=./output`` before running your application.

For example

.. code-block:: shell

    # make sure to remove call to with torch_neuronx.experimental.profiler.profile from python script
    NEURON_RT_INSPECT_ENABLE=1 NEURON_RT_INSPECT_OUTPUT_DIR=./output python pytorch_script.py

When using the ``NEURON_RT_INSPECT_ENABLE`` environment variable instead of ``torch_neuronx.experimental.profiler.profile`` system profiles will not contain a framework and application code trace, only Neuron Runtime API trace.

Do not set the ``NEURON_RT_INSPECT_ENABLE`` environment variable and use the ``torch_neuronx.experimental.profiler.profile`` within your application code at the same time. Use one or the other. 

For more profiling options that can be set through environment variables, see the section :ref:`Profile Capture Environment Variables <neuron-profiler-capture-environment-variables>`.


Full PyTorch Example
~~~~~~~~~~~~~~~~~~~~

Create a file ``train_torchrun_context.py`` with the following contents

.. code-block:: python

    import os

    import torch
    import torch.nn as nn
    import torch.nn.functional as F

    # XLA imports
    import torch_xla
    import torch_xla.core.xla_model as xm
    import torch_xla.debug.profiler as xp

    import torch_neuronx
    from torch_neuronx.experimental import profiler

    os.environ["NEURON_CC_FLAGS"] = "--cache_dir=./compiler_cache"

    # Global constants
    EPOCHS = 2

    # Declare 3-layer MLP Model
    class MLP(nn.Module):
        def __init__(self, input_size=10, output_size=2, layers=[5, 5]):
            super(MLP, self).__init__()
            self.fc1 = nn.Linear(input_size, layers[0])
            self.fc2 = nn.Linear(layers[0], layers[1])
            self.fc3 = nn.Linear(layers[1], output_size)

        def forward(self, x):
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))
            x = self.fc3(x)
            return F.log_softmax(x, dim=1)

    def main():
        # Fix the random number generator seeds for reproducibility
        torch.manual_seed(0)

        # XLA: Specify XLA device (defaults to a NeuronCore on Trn1 instance)
        device = xm.xla_device()

        # Start the profiler context-manager
        with torch_neuronx.experimental.profiler.profile(
            port=9012,
            profile_type='system',
            target='neuron_profile_perfetto',
            output_dir=os.environ['NEURON_RT_INSPECT_OUTPUT_DIR'],
            ms_duration=30000) as profiler:

            # IMPORTANT: the model has to be transferred to XLA within
            # the context manager, otherwise profiling won't work
            model = MLP().to(device)
            optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
            loss_fn = torch.nn.NLLLoss()

            # start training loop
            print('----------Training ---------------')
            model.train()
            for epoch in range(EPOCHS):
                optimizer.zero_grad()
                train_x = torch.randn(1, 10).to(device)
                train_label = torch.tensor([1]).to(device)

                # forward
                loss = loss_fn(model(train_x), train_label)

                # back
                loss.backward()
                optimizer.step()

                # XLA: collect ops and run them in XLA runtime
                xm.mark_step()

        print('----------End Training ---------------')

    if __name__ == '__main__':
        main()

Run this workload with the following command:

.. code-block:: shell

    NEURON_RT_INSPECT_OUTPUT_DIR="output" python simple_demo.py

.. _neuron-profiler-non-framework-user-experience:

Non-framework Specific User Experience
--------------------------------------

You can also control profiling with environment variables. This is useful when you can’t easily change your 
application code, such as when running an executable which calls the Neuron Runtime or in a containerized 
environment where the application code is built into the container image.

.. _neuron-profiler-capture-environment-variables:

Profile Capture Environment Variables
--------------------------------------

.. _core-control-variables:

Core control variables
~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Variable
     - Description
     - Default behavior
   * - ``NEURON_RT_INSPECT_ENABLE``
     - Set to ``1`` to enable profiling
     - Enables system profiling and disables device profiling. To control which profile types are captured, see :ref:`Profile type selection <profile-type-selection>`
   * - ``NEURON_RT_INSPECT_OUTPUT_DIR``
     - Directory for profile data output
     - Default directory for captured profile data is ``./output``

.. _profile-type-selection:

Profile type selection
~~~~~~~~~~~~~~~~~~~~~~~

.. note:: 
    
    When ``NEURON_RT_INSPECT_ENABLE`` set to ``1``, ``NEURON_RT_INSPECT_SYSTEM_PROFILE`` is enabled by default (set to 1) and ``NEURON_RT_INSPECT_DEVICE_PROFILE`` is disabled by default (set to ``0``).

When ``NEURON_RT_INSPECT_ENABLE`` = 1, two different profile types are available:

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Variable
     - Profile type
     - Description
     - Enable capture
     - Disable capture
   * - ``NEURON_RT_INSPECT_SYSTEM_PROFILE``
     - System-level
     - Captures runtime system events and operations
     - Set to ``1``
     - Set to ``0``
   * - ``NEURON_RT_INSPECT_DEVICE_PROFILE``
     - Device-level
     - Captures detailed NeuronCore hardware metrics
     - Set to ``1``
     - Set to ``0``

.. note::

    These variables have no effect if ``NEURON_RT_INSPECT_ENABLE`` is not set to ``1``.

.. _advanced-config-vars:
  
Advanced configuration
~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: auto
   :header-rows: 1
   :align: left

   * - Variable
     - Profile type
     - Description
     - Default behavior
   * - ``NEURON_RT_INSPECT_SYS_TRACE_MAX_EVENTS_PER_NC``
     - System-level
     - Maximum trace events per NeuronCore before oldest events are overwritten
     - 1,000,000

.. note:: 
    
    Increasing the event limit will consume more host memory.

Example Capturing Profile of Application Using Environment Variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Instead of using the PyTorch or JAX profilers you can profile your Python application (or any application calling the Neuron Runtime API) using environment variables.

.. code-block:: shell

    NEURON_RT_INSPECT_ENABLE=1 NEURON_RT_INSPECT_OUTPUT_DIR=./output python app.py

See :ref:`Profile Capture Environment Variables <neuron-profiler-capture-environment-variables>` for other profiling options that can be set via environment variable.

Example Capturing Profile of nccom-test Using Environment Variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Profiling can be enabled using environment variables. For simplicity, we have a quick way to generate a Neuron workload through using :ref:`nccom-test <nccom-test>`. nccom-test is a benchmarking tool which is already available with Neuron AMI.

.. code-block:: shell

    export NEURON_RT_INSPECT_ENABLE=1
    export NEURON_RT_INSPECT_OUTPUT_DIR=./output
    nccom-test allr allg -b 512kb -e 512kb -r 32 -n 10 -d fp32 -w 1 -f 512

.. note::
    If you have problems with nccom-test add the --debug flag.
    If using a trn1.2xlarge instance, change -r 32 to -r 2 to use fewer neuron cores.

To understand the profiling output see this section: :ref:`Inspect Output <neuron-profiler-inspect-output>`

CLI reference for System Profiles
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In addition to controlling profiling with environment variables, you can use the ``neuron-profile inspect`` command line interface 
for profiling applications. This provides the same functionality as environment variables but helps you avoid typos, invalid arguments, 
and provides a useful ``--help`` command to explain available options.

.. code-block:: shell

    Usage:
    neuron-profile [OPTIONS] inspect [inspect-OPTIONS] [userscript...]

    Application Options:
    -v, --version                      Show version and exit

    Help Options:
    -h, --help                         Show this help message

    [inspect command options]
        -o, --output-dir=              Output directory for the captured profile data, including system and device profiles (default: ./output)
        -n, --num-trace-events=        Maximum number of trace events to capture when profiling. Once hitting this limit, no new events are recorded
            --capture-system-profiles  Disable capture of system profile data. Can reduce output size.
            --capture-device-profiles  Disable capture of device profile data. Can reduce output size.

    [inspect command arguments]
    userscript:                        Run command/script that launches a Neuron workload. E.g. 'python app.py' or './runscript.sh'


Example of using System Profiles CLI
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

User can provide any type of their own script to generate a Neuron workload such as Pytorch to the System Profiles CLI. 
For simplicity, we have a quick way to generate a Neuron workload 
through using ``nccom-test``. ``nccom-test`` is a benchmarking tool which is already available with Neuron AMI and ``aws-neuronx-tools`` package.

.. code-block:: shell

    ubuntu@ip-172-31-63-210:~$ neuron-profile inspect -o inspect-output-nccom-test nccom-test allg -b 512kb -e 512kb -r 32 -n 10 -d fp32 -w 1 -f 512
    INFO[0000] Running command "nccom-test allg -b 512kb -e 512kb -r 32 -n 10 -d fp32 -w 1 -f 512" with profiling enabled
        size(B)    count(elems)    type    time:avg(us)    algbw(GB/s)    busbw(GB/s)
        524288          131072    fp32           24.15          21.71          21.03
    Avg bus bandwidth:    21.0339GB/s

.. note::
    If you have problems with nccom-test add the --debug flag.
    If using a trn1.2xlarge instance, change -r 32 to -r 2 to use fewer neuron cores.

.. _neuron-profiler-inspect-output:

``neuron-profile inspect`` Output
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The above command shows a Neuron workload execution is being traced and output to ``inspect-output-nccom-test`` directory. 
You will see the output directory contains a single NEFF file and a device profile (NTFF) for all Neuron Cores which executed that NEFF. 
You will also see ``ntrace.pb`` and ``trace_info.pb`` files storing the system profile data.
Below showing what the outputs will look like:

.. code-block:: shell

    ubuntu@ip-172-31-63-210:~$ tree inspect-output-nccom-test
    inspect-output-nccom-test
        ├── i-012590440bb9fd263_pid_98399
        │   ├── 14382885777943380728_instid_0_vnc_0.ntff
        │   ├── 14382885777943380728_instid_0_vnc_1.ntff
        │   ├── 14382885777943380728_instid_0_vnc_10.ntff
        │   ├── 14382885777943380728_instid_0_vnc_11.ntff
        ...
        │   ├── 14382885777943380728_instid_0_vnc_8.ntff
        │   ├── 14382885777943380728_instid_0_vnc_9.ntff
        │   ├── cpu_util.pb
        │   ├── host_mem.pb
        │   ├── neff_14382885777943380728.neff
        │   ├── ntrace.pb
        │   └── trace_info.pb
        └──

    2 directories, 74 files


To view a summary of the captured profile data run the command

.. code-block:: shell

    neuron-profile view -d inspect-output-nccom-test --output-format summary-text


EKS User Experience
-------------------

Capturing a profile on EKS is most easily done through setting of environment variables as described in the section 
:ref:`Non-framework specific User Experience <neuron-profiler-non-framework-user-experience>`. By using environment 
variables, users do not need to change application code in their container image or modify their run commands. 

Update the deployment yaml to include the ``NEURON_RT_INSPECT_ENABLE`` and ``NEURON_RT_INSPECT_OUTPUT_DIR`` 
environment variables. For distributed workloads, it’s important that ``NEURON_RT_INSPECT_OUTPUT_DIR`` points to a 
directory on a shared volume which all workers have access to.

.. code-block:: yaml

    apiVersion: v1
    kind: Pod
    metadata:
    name: trn1-mlp
    spec:
    restartPolicy: Never
    schedulerName: default-scheduler
    nodeSelector:
        beta.kubernetes.io/instance-type: trn1.32xlarge
    containers:
        - name: trn1-mlp
        env:
            - name: NEURON_RT_INSPECT_ENABLE
            value: "1"
            - name: NEURON_RT_INSPECT_OUTPUT_DIR
            value: "/shared/output"
        command: ['torchrun']
        args:
            - '--nnodes=1'
            - '--nproc_per_node=32'
            - 'train_torchrun.py'
        image: ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPO}:mlp
        imagePullPolicy: IfNotPresent
        resources:
            limits: 
            aws.amazon.com/neuron: 16


.. note::

    EKS users running PyTorch and JAX applications are still free to change their application code 
    and use the PyTorch or JAX Python profiling APIs if they want finer-grained control over profiling. 
    However, using the environment variables conveniently allows profiling without modifying the 
    container image or application code.

Processing and Viewing Profiles
-------------------------------

Users have three output options for interacting with their captured profiles

* Neuron Profiler UI - Neuron’s custom UI which allows easily drilling down to detailed device profiles from high level system profiles
* Perfetto - Allows sharing profiles as a single file and viewing your profiles in the Perfetto UI at https://ui.perfetto.dev/
* JSON - human-readable text output that enables simple scripting 

Neuron Profiler UI
~~~~~~~~~~~~~~~~~~

To view a profile in the Neuron Profiler UI run the following command to process a profile and launch the UI

.. code-block:: shell

    neuron-profile view -d ./output

To view profiles with the Neuron Profiler UI running locally you will need to have InfluxDB installed on your system. 
To install and setup InfluxDB follow the :ref:`directions in the official Neuron Profile documentation <neuron-profiler-installation>`.


Neuron Profiler System Profile UI
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The system profile timeline shows a trace of Neuron Runtime API calls, ML framework function calls, CPU utilization, and memory usage on each of the instances in your workload. 
The Neuron Runtime API trace is grouped by NeuronCore IDX and ec2 instance ID. For example, all events in the row 
labeled nrt-nc-003-i-0f207fb2a99bd2d08 are associated with NeuronCore 3 and instance i-0f207fb2a99bd2d08.

Framework function traces are grouped by thread id and ec2 instance id. For example, all events in 
the row framework-3266405268-i-0f207fb2a99bd2d08 are framework or application function calls made on thread 
3266405268 running on instance i-0f207fb2a99bd2d08.


|neuron-profiler2-annotate-system-ui|

Clicking on trace events in the timeline shows a “Event attributes” view with a list of attributes associated with that event. 
For example, clicking on an nrt_execute event (the Neuron Runtime API call for executing a compiled model on a NeuronCore) 
will show events such as Flop count (the number of floating point operations for a single execution of the model), 
the model name, and the NeuronCore idx and ec2 instance id associated with the function call. 

|neuron-profiler2-attributes-window|

Neuron Profiler 2.0 allows users to drill-down from a system timeline to a device profile timeline in order to see a detailed view 
of hardware activity during the execution of a graph. To do this, select an nrt_execute event in the timeline and in the 
“Event attributes” view select the "Open device profile" button under the Model Name attribute. 
This will open a new window with a device profile. For help understanding a device profile see the section documentation section "Understanding a Neuron Profile"

|neuron-profiler2-drilldown-device|

To see a list of all device profiles that were captured during your workload press the “Device Profiles” button at the bottom of the timeline. From this list you can 
see all unique compiled graphs (NEFFs) that were executed on NeuronCores during your workload. For each graph there is a link to a device 
profile that will show a detailed view of hardware activity on the NeuronCore during execution of this graph. 

|neuron-profiler2-device-profile-list|


Viewing Profiles with Perfetto
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Perfetto is an open-source trace analysis toolkit with a powerful UI for visualizing and analyzing trace data.
Users of Neuron Profiler have the option of viewing their profiles in the Perfetto UI.

The ``--output-format perfetto`` option writes processed data to Perfetto's native protobuf-based tracing format which can be visualized in the Perfetto UI at https://ui.perfetto.dev/.

Example:

.. code-block:: shell

    neuron-profile view -d ./output --output-format perfetto

This will generate a ``system_profile.pftrace`` file for the system profile and a ``device_profile_model_<model_id>.pftrace`` file for each unique compiled model that was executed on a Neuron Device.

To view the system profile, go to https://ui.perfetto.dev/ and open the ``system_profile.pftrace`` file.

.. note::
    When loading trace files in the Perfetto UI, your data is processed locally and not uploaded to Perfetto’s servers.

|neuron-profiler2-perfetto-timeline|

To view a device profile go to https://ui.perfetto.dev/ and open the  ``device_profile_model_<model_id>.pftrace`` file. This will show a detailed view of hardware activity on the NeuronCore during execution of this graph.

|neuron-profiler2-perfetto-device-timeline|

.. note::
    Your browser may run out of memory when viewing ``*.pftrace`` (Perfetto trace) files that are more than a few hundred MB. See the section :ref:`Viewing Large Profiles in Perfetto <neuron-profile-large-perfetto-profiles>` for directions on how to view large traces using the trace processor.


Perfetto Output View Options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When outputting to Perfetto it is possible to group your traces by different attributes. This is useful for
larger profiles involving many NeuronCores and instances. The following options are available:

.. list-table:: Perfetto output view options
     :header-rows: 1
     :widths: 30 70

     * - CLI option
       - Description
     * - ``--system-trace-primary-group``
       - First-order grouping of trace events (maps to a Perfetto process / process group of rows). Provide a comma-delimited list of field names. Allowed fields: ``instance_id``, ``thread_id``, ``lnc_idx``, ``process_id``. Default: ``instance_id,process_id``.
     * - ``--system-trace-secondary-group``
       - Second-order grouping of trace events (maps to a Perfetto thread / single row). Provide a comma-delimited list of field names. Allowed fields: ``instance_id``, ``worker_gid``, ``thread_id``, ``lnc_idx``, ``process_id``. Default: ``worker_gid,lnc_idx, thread_id``.


For example, the following profile uses ``neuron-profile view --output-format=perfetto --system-trace-primary-group=instance_id,process_id --system-trace-secondary-group=lnc_idx,thread_id`` to group the system profile first by unique combinations
of instance_id and process_id, and then in each of those groups there are rows of events with unique combinations of lnc_idx and thread_id.

|neuron-profiler2-perfetto-grouping|

Grouping By Global Worker ID
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, Perfetto traces are grouped by ``worker_gid`` which is a unique global identifier for each NeuronCore across all instances in a distributed workload.
When clicking on an event in the trace you will see fields for both ``lnc_idx`` (local NeuronCore index on that process) and ``worker_gid`` (global NeuronCore index across all instances).
It is possible for ``lnc_idx`` to be the same for different processes on the same instance or across different instances in a distributed workload. However, ``worker_gid`` is unique for each NeuronCore across all instances.
The image below shows how to correlate the naming of tracks (rows) in the Perfetto UI to both ``lnc_idx`` and ``worker_gid``.

|neuron-profiler2-perfetto-gid|


Generating JSON Output From Profiles
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``--output-format`` json option writes processed profile data to human-readable JSON that can be used for scripting and manual inspection.

.. code-block:: shell

    neuron-profile view -d ./output --output-format json

This will generate a ``system_profile.json`` file containing the system profile data and a ``device_profile_model_<model_id>.json`` file for each unique compiled model that was executed on a Neuron Device. 

The  system_profile.json JSON contains the following data types:

* ``trace_events``: Neuron Runtime API trace events and Framework/Application trace events containing timestamps, durations, names, and the ec2 instance-id to differentiate between events from different compute nodes in a distributed workload.

.. code-block:: json

    {
        "Neuron_Runtime_API_Event": {
            "duration": 27094,
            "group": "nrt-nc-000",
            "id": 1,
            "instance_id": "i-0f207fb2a99bd2d08",
            "lnc_idx": "0",
            "name": "nrt_tensor_write",
            "parent_id": 0,
            "process_id": "1627711",
            "size": "4",
            "tensor_id": "4900392441224765051",
            "tensor_name": "_unknown_",
            "thread_id": 1627711,
            "timestamp": 1729888371056597613,
            "type": 11
        },
        "Framework_Event": {
            "duration": 3758079,
            "group": "framework-80375131",
            "instance_id": "i-0f207fb2a99bd2d08",
            "name": "PjitFunction(matmul_allgather)",
            "process_id": "701",
            "thread_id": 80375131,
            "timestamp": 1729888382798557372,
            "type": 99999
        }
    }

* ``mem_usage``: sampled host memory usage 

.. code-block:: json

    {
        "duration": 1,
        "instance_id": "i-0f207fb2a99bd2d08",
        "percent_usage": 9.728179797845964,
        "timestamp": 1729888369286687792,
        "usage": 51805806592
    }

* ``cpu_util``: sampled CPU utilization. Results are provided per core and per ec2 instance involved in a distributed workload

.. code-block:: json

    {
        "cpu_id": "47",
        "duration": 1,
        "instance_id": "i-0f207fb2a99bd2d08",
        "timestamp": 1729888371287337243,
        "util": 2.3255813
    },


Processing only system or device profiles
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To reduce processing times it is possible to skip processing of system or device profiles. Sometimes users may only be interested in one or want to start  with a limited set of profiling data before exploring the full profile.  

To skip processing of device profiles use the ``--ignore-device-profile`` option. To skip processing of system profiles use the ``--ignore-system-profile`` option. These options can be used with the ``--output-format`` values ``db`` (default), ``perfetto``, or ``json``.

For example:

.. code-block:: shell

    neuron-profile view -d ./output --ignore-device-profile --output-format perfetto

.. _neuron-profiler-filtering-system-profiles:

Filtering System Profiles
--------------------------

This guide explains how to filter system trace events to optimize memory usage, reduce output size, and speed up trace processing. **Capture-time filtering** reduces memory usage and trace file size by only collecting specific events, but filtered data cannot be recovered later. **Processing-time filtering** preserves the complete trace and allows flexible analysis with different filters, but requires more memory and storage during capture.

Capture-Time Filtering
~~~~~~~~~~~~~~~~~~~~~~

Configure filters before trace capture using environment variables or API functions. 
You can use NeuronCore filters to only capture events for specific NeuronCores (for example only events associated with NeuronCore 0 or all the NeuronCores on a specific NeuronDevice). 
You can use event type filters to only capture specific events (for example model execute or collectives events). 
It is possible to combine both NeuronCore and event type filters.

Filtering by NeuronCore
^^^^^^^^^^^^^^^^^^^^^^^

If capture is enabled for a NeuronCore then a ring buffer will be allocated in host memory for storing those core's events. Thus filtering by NeuronCore decreases host memory usage during capture.

Default Behavior
"""""""""""""""""

By default, all visible NeuronCores are enabled for capture. 

Using Environment Variables
"""""""""""""""""""""""""""

.. code-block:: shell

    # Filter to capture events only from NeuronCore 0
    export NEURON_RT_INSPECT_EVENT_FILTER_NC=0

    # Filter to capture events from NeuronCores 0, 2, and 4
    export NEURON_RT_INSPECT_EVENT_FILTER_NC=0,2,4

    # Filter to capture events from a range of NeuronCores (0 through 3)
    export NEURON_RT_INSPECT_EVENT_FILTER_NC=0-3

    # Reset to default behavior
    unset NEURON_RT_INSPECT_EVENT_FILTER_NC # Back to capturing all visible cores

Using API Functions
"""""""""""""""""""

.. code-block:: c

    #include <nrt/nrt_sys_trace.h>

    // Allocate and configure trace options
    nrt_sys_trace_config_t *config;
    nrt_sys_trace_config_allocate(&config);
    nrt_sys_trace_config_set_defaults(config);

    // Enable capture only for specific NeuronCores

    // Disable all cores since by default they are all enabled
    int num_cores = 128;
    for (int i=0; i<num_cores; i++) {
      nrt_sys_trace_config_set_capture_enabled_for_nc(config, i, false); // disable NC i
    }

    // Then enable specific cores
    nrt_sys_trace_config_set_capture_enabled_for_nc(config, 0, true);  // Enable NC 0
    nrt_sys_trace_config_set_capture_enabled_for_nc(config, 2, true);  // Enable NC 2

    // Start tracing with the configuration
    nrt_sys_trace_start(config);

    // Your application code here...

    // Stop tracing and cleanup
    nrt_sys_trace_stop();
    nrt_sys_trace_config_free(config);

Filtering by Event Type
^^^^^^^^^^^^^^^^^^^^^^^

Default Behavior
"""""""""""""""""

By default, all event types are enabled for capture.

Getting Available Event Types
""""""""""""""""""""""""""""""

You can discover all available event types using the ``nrt_sys_trace_get_event_types`` API.

.. code-block:: c

    #include <nrt/nrt_sys_trace.h>

    // Get all available event types
    const char **event_types = nullptr;
    size_t count = 0;
    NRT_STATUS status = nrt_sys_trace_get_event_types(&event_types, &count);

    if (status == NRT_SUCCESS) {
        printf("Available event types:\n");
        for (size_t i = 0; i < count; ++i) {
            printf("  %s\n", event_types[i]);
        }
        
        // Free the event types array
        for (size_t i = 0; i < count; ++i) {
            free((void*)event_types[i]);
        }
        free((void*)event_types);
    }

Using Environment Variables
"""""""""""""""""""""""""""

The ``NEURON_RT_INSPECT_EVENT_FILTER_TYPE`` environment variable supports:

* **Default**: If not set, all event types are captured
* **Specific event types**: Use exact event names from ``nrt_sys_trace_get_event_types()``
* **Event categories**: Use ``hardware`` or ``software`` to filter by category
* **Exclusion**: Use ``^`` prefix to exclude specific events from a category

.. code-block:: shell

    # Filter to capture only specific event types
    export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=nrt_load,nrt_execute,nc_exec_running

    # Filter to capture all hardware events
    export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=hardware

    # Filter to capture all software events
    export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=software

    # Filter to capture all hardware events EXCEPT cc_exec
    export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=hardware,^cc_running

    # Filter to capture all software events EXCEPT nrt_load
    export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=software,^nrt_load

    # Mix categories and specific events
    export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=hardware,nrt_tensor_write,nrt_tensor_read

    # Reset to default behavior
    unset NEURON_RT_INSPECT_EVENT_FILTER_TYPE  # Back to capturing all event types

The ``hardware`` group contains events that are executed on the NeuronCore. 
These are ``nc_exec_running``, ``cc_running``, ``cc_exec_barrier``, ``numerical_err``, ``nrt_model_switch``, ``timestamp_sync_point``, ``hw_notify``.
The ``software`` group contains all other events.

Using API Functions
"""""""""""""""""""

Use the ``nrt_sys_trace_config_set_capture_enabled_for_event_type`` API to filter by event type.

.. code-block:: c

    #include <nrt/nrt_sys_trace.h>

    // Configure trace options
    nrt_sys_trace_config_t *config;
    nrt_sys_trace_config_allocate(&config);
    nrt_sys_trace_config_set_defaults(config); // By default, all event types are enabled

    // Disable specific event types (others remain enabled)
    nrt_sys_trace_config_set_capture_enabled_for_event_type(config, "device_exec", false);

    // Or disable all first, then enable only specific ones
    const char **all_event_types = nullptr;
    size_t all_count = 0;
    nrt_sys_trace_get_event_types(&all_event_types, &all_count);

    // Disable all event types first
    for (size_t i = 0; i < all_count; ++i) {
        nrt_sys_trace_config_set_capture_enabled_for_event_type(config, all_event_types[i], false);
    }

    // Enable only specific event types
    nrt_sys_trace_config_set_capture_enabled_for_event_type(config, "model_load", true);
    nrt_sys_trace_config_set_capture_enabled_for_event_type(config, "nrt_execute", true);

    // Verify which event types are enabled
    const char **enabled_types = nullptr;
    size_t enabled_count = 0;
    nrt_sys_trace_config_get_enabled_event_types(config, &enabled_types, &enabled_count);
    printf("Enabled event types: %zu\n", enabled_count);
    for (size_t i = 0; i < enabled_count; ++i) {
        printf("  %s\n", enabled_types[i]);
    }

    // Clean up memory (caller is responsible)
    for (size_t i = 0; i < enabled_count; ++i) {
        free((void*)enabled_types[i]);
    }
    free((void*)enabled_types);

    for (size_t i = 0; i < all_count; ++i) {
        free((void*)all_event_types[i]);
    }
    free((void*)all_event_types);

    // Start tracing
    nrt_sys_trace_start(config);

    // Your application code here...

    // Cleanup
    nrt_sys_trace_stop();
    nrt_sys_trace_config_free(config);

.. _neuron-profile-system-timestamp-adjustment:

Adjusting Hardware Timestamps
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Hardware events executed on the NeuronCore use device-specific timestamps that are in a different time domain than CPU timestamps. To enable accurate correlation between hardware and software events in the JSON system trace output, the runtime automatically adjusts hardware event timestamps to the CPU time domain using synchronization point events.

How Timestamp Adjustment Works
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

System trace events are generated from multiple independent time domains: the CPU host and each ML accelerator devices operating with their own clocks. To align events from different domains, the runtime performs software-based time synchronization after event collection.

**Sync Point Events**: After each execution, a special ``timestamp_sync_point`` event captures nearly simultaneous timestamps from both the host CPU (``cpu_timestamp_ns``) and the device (``nc_timestamp_ns``). These sync events are used to adjust the timestamps of hardware events to the CPU domain. 
These synchronization events are included in the returned event trace and serve as reference points for timestamp adjustment. Users can see the sync point used for aligning hardware events in the timeline.

**Adjustment Algorithm**: For each hardware event, the runtime:

- Uses the sync point with matching exec_id for that NeuronCore
- Calculates the time difference between the hardware event and the sync point (in device time)
- Applies that same time difference to the sync point's CPU timestamp
- Formula: ``adjusted_timestamp = sync_cpu_timestamp + (event_device_timestamp - sync_device_timestamp)``

Illustration::

         Sync_Point           HW_Event
                 │                │
                 ▼                ▼
    Device Time ─●────────────────●───>
                 |-------Δt------>|     - sync_device_timestamp and sync_cpu_timestamp occur ~simultaneously, though their clocks differ
    CPU Time ────●────────────────●───> - Calc Δt = event_device_timestamp - sync_device_timestamp (elapsed time since sync point on device)
                 |-------Δt------>|     - Add Δt to sync_cpu_timestamp to get adjusted_timestamp

|neuron-profiler2-syncpoint-timeline|

**Hardware Events**: Hardware events that require timestamp adjustment include:

- ``nc_exec_running`` (NeuronCore execution start/stop)
- ``cc_running`` (collective communication execution)
- ``cc_exec_barrier`` (collective communication barriers)
- ``numerical_err`` (numerical errors)
- ``nc_model_switch`` (NeuronCore model switching)

Tips
^^^^

1. **Memory Optimization**: Use NeuronCore filtering to avoid allocating ring buffers for unused cores and decrease host memory usage. Use both event type or NeuronCore to decrease output trace sizes.
2. **Event Type Discovery**: Use ``nrt_sys_trace_get_event_types()`` to discover available event types
3. **Category Filtering**: Use ``hardware``/``software`` categories for broad filtering
4. **Exclusion Filtering**: Use ``^`` prefix to exclude specific events from categories
5. **Combine Filters**: Use both NeuronCore and event type filters together for maximum optimization

Processing-Time Filtering
~~~~~~~~~~~~~~~~~~~~~~~~~~

Apply filters when viewing or processing already captured profiles. This approach allows you to 
analyze the same trace data in different ways without recapturing. The filters can be used for any 
``neuron-profile`` output format including ``--output-format json`` and ``--output-format perfetto``.

Filtering by NeuronCore
^^^^^^^^^^^^^^^^^^^^^^^

Use the ``--system-trace-filter-neuron-core`` to only process events for specific NeuronCores. The IDs are local to the instance and not global IDs. 

If the ``--system-trace-filter-neuron-core`` argument is not set then events from all NeuronCores will be included in the processed trace.

.. code-block:: shell

    # Filter by single neuron core
    neuron-profile view -d ./output --system-trace-filter-neuron-core "0" --output-format perfetto

    # Filter by multiple neuron cores
    neuron-profile view -d ./output --system-trace-filter-neuron-core "0,1,2,3" --output-format perfetto

Filtering by Event Type
^^^^^^^^^^^^^^^^^^^^^^^

Use the ``--system-trace-filter-event-type`` to only process specific trace events types.

If the ``--system-trace-filter-event-type`` argument is not set then all event types will be included in the processed trace.

.. code-block:: shell

    # Filter by single event type
    neuron-profile view -d ./output --system-trace-filter-event-type "nrt_execute" --output-format perfetto

    # Filter by multiple event types
    neuron-profile view -d ./output --system-trace-filter-event-type "nrt_execute,nrt_load" --output-format perfetto

Filtering by Instance ID
^^^^^^^^^^^^^^^^^^^^^^^^

Use the ``--system-trace-filter-instance-id`` to only process events for specific ec2 instances.

If the ``--system-trace-filter-instance-id`` argument is not set then events from all instances will be included in the processed trace.

.. code-block:: shell

    # Filter by single instance
    neuron-profile view -d ./output --system-trace-filter-instance-id "i-abc123" --output-format perfetto

    # Filter by multiple instances (comma-separated)
    neuron-profile view -d ./output --system-trace-filter-instance-id "i-abc123,i-def456,i-ghi789" --output-format perfetto

Troubleshooting
---------------

Incomplete JAX Profiles
~~~~~~~~~~~~~~~~~~~~~~~

If your JAX profile has fewer events than expected or lacks the Runtime API trace, check whether 
``jax.profiler.stop_trace`` is being called inside a ``with jax.profiler.trace`` context block. 
This can prematurely stop tracing. Use ``jax.profiler.stop_trace`` only when profiling was started 
with ``jax.profiler.start_trace``, not when using the context-managed ``with jax.profiler.trace`` API.

Also when using ``jax.profiler`` within your script ensure that the 
environment variable ``NEURON_RT_INSPECT_ENABLE`` is not set to 1. 
Additionally, ensure that ``NEURON_RT_INSPECT_OUTPUT_DIR`` is set to 
the correct output directory and this is the output directory passed to 
``with jax.profiler.trace``.

Dropped Events in System Profile
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When processing a system profile, you may see a warning indicating that some trace events were dropped during capture.

.. code-block:: shell

    WARN[0000] Warning: 1001 trace events were dropped during capture (stored 530560 out of 531561 total events). Consider increasing buffer size, reducing trace duration, or filtering events.

This means during capture the trace event buffers filled and oldest events were overwritten. If you need to avoid dropping events for the full duration of your workload consider the following adjustments:

* Increase buffer size by setting ``NEURON_RT_INSPECT_SYS_TRACE_MAX_EVENTS_PER_NC`` (see :ref:`Profile Capture Environment Variables <neuron-profiler-capture-environment-variables>`). This will increase host memory usage.
* Apply capture-time filters (NeuronCores / event types) (see :ref:`Filtering System Profiles <neuron-profiler-filtering-system-profiles>`.)
* Shorten profiled region: limit the code span under the profiling context / runtime.


.. |neuron-profiler2-annotate-system-ui| image:: /images/neuron-profiler2-annotate-system-ui.png
.. |neuron-profiler2-attributes-window| image:: /images/neuron-profiler2-attributes-window.png
.. |neuron-profiler2-device-profile-list| image:: /images/neuron-profiler2-device-profile-list.png
.. |neuron-profiler2-drilldown-device| image:: /images/neuron-profiler2-drilldown-device.png
.. |neuron-profiler2-perfetto-timeline| image:: /images/neuron-profiler2-perfetto-timeline.png
.. |neuron-profiler2-perfetto-device-timeline| image:: /images/neuron-profiler2-perfetto-device-timeline.png
.. |neuron-profiler2-perfetto-grouping| image:: /images/neuron-profiler2-perfetto-grouping.png
.. |neuron-profiler2-syncpoint-timeline| image:: /images/neuron-profiler2-syncpoint-timeline.png
.. |neuron-profiler2-perfetto-gid| image:: /images/neuron-profiler2-perfetto-gid.png


================================================
FILE: tools/tensorboard/getting-started-tensorboard-neuronx-plugin.rst
================================================
.. _neuronx-plugin-tensorboard:

NeuronX Plugin for TensorBoard (Trn1)
======================================

.. contents:: Table of Contents
  :local:
  :depth: 2


Overview
--------

This guide is for developers who want to better understand how their
model is executed using Neuron SDK through TensorBoard.

The Neuron plugin for TensorBoard provides metrics to the performance of machine learning tasks accelerated using the Neuron SDK. It is
compatible with TensorBoard versions 1.15 and higher. It provides visualizations and profiling results for graphs executed on NeuronCores.

.. note::

    The following information is compatible with Neuron SDK for Trn1.  For a walkthrough on Inf1, please check out the guide
    :ref:`neuron-plugin-tensorboard`.


Enable profiling on Trn1
------------------------

.. note::

   Profiling is currently only supported with PyTorch Neuron (``torch-neuronx``).

Please refer to the following guides:

- PyTorch-Neuron
    - :ref:`torch-neuronx-profiling-with-tb`


Launch TensorBoard
------------------

In this step, we will process the Neuron profile data and launch TensorBoard.

1. Install the Neuron plugin for Tensorboard on your EC2 instance.

.. code:: bash


    pip install tensorboard-plugin-neuronx --extra-index-url https://pip.repos.neuron.amazonaws.com

.. note::

   If using TensorBoard >= 2.5, please use the ``--load_fast=false`` option when launching.
   ``tensorboard --logdir results --load_fast=false``

2. After you see the following message, TensorBoard is ready to use.  By default,
TensorBoard will be launched at ``localhost:6006``.

::

   ...
   Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
   TensorBoard 2.4.1 at http://localhost:6006/ (Press CTRL+C to quit)


View results in TensorBoard
---------------------------

In this step, we will view the Neuron plugin for TensorBoard from a browser on your local
development machine.

1. Connect to the EC2 instance where TensorBoard is running while enabling port forwarding.
In this example, we assume TensorBoard has been launched using the default address ``localhost:6006``.

.. code:: bash

   # if Ubuntu-based AMI
   ssh -i <PEM key file> ubuntu@<instance DNS> -L 6006:localhost:6006

   # if AL2-based AMI
   ssh -i <PEM key file> ec2-user@<instance DNS> -L 6006:localhost:6006

2. In a browser, visit |tensorboard_address|.

3. In the top navigation bar, switch from ``Graphs`` to ``Neuron``.  If it does not show up,
please wait a while and refresh the page while the plugin loads.  If the issue persists, check
the ``Inactive`` dropdown list on the right and check for ``Neuron``.

|image1|

4. If TensorBoard failed to find the generated logs, you will see the following message:

|image2|


In this case, please make sure the version of the ``aws-neuronx-tools``
package and the Neuron framework package is from Neuron release 2.6 or newer.


Neuron Trace View
-----------------

|image3|

The trace view gives a high level timeline of execution by aligning Neuron events, such as Neuron Device execution,
data transfers, and Collective Compute synchronization (if applicable), with other events from the XLA profiler.

Use this view to better understand bottlenecks during the run, and potentially experiment with how execution changes
by moving the ``mark_step()`` call which will execute the graph.


Neuron Operator View
--------------------

|image4|

The operator view can show timing information for both the framework operators and HLO operators by selecting
the ``operator-framework`` and ``operator-hlo`` tools respectively.  The pie charts show breakdowns of the time taken
by device, as well as per operator on a single device.  The table below lists out the operators and can be sorted by clicking
on the columnn headers.  For fused operations, hover over the ``?`` to see which operators are being executed.

For a quick glance at the most time consuming operators, click the ``Time %`` column in the table to sort by the relative
time spent on this type of operation compared to the rest of the model.


Neuron Operator Timeline View
-----------------------------

|image5|

The operator timeline view is a detailed look into a single execution with Neuron.  A high level overview at the top breaks
down the execution into categories, including Neuron Runtime setup time, as well as NeuronCore compute engine and DMA engine busyness.
Activity on the compute and DMA engines are further categorized as compute, control, and data transfer intervals which are
shown as separate processes, with each showing a hierarchical view of the framework operators and their corresponding
HLO operation.  The fused operations can be a result of compiler optimizations or are operations that are running in
parallel on the device.  Each bar can be clicked to show information regarding which operators are overlapped.

This view can give better insight into how operators translate to Neuron, as well as how certain Neuron compiler options
may improve performance.


Troubleshooting
---------------

TensorBoard launch fails
~~~~~~~~~~~~~~~~~~~~~~~~

::

    ImportError: cannot import name 'Mapping' from 'collections'

This is an issue with Python 3.10 and a dependency of an old tensorboard version.  To workaround this error, please run
``pip install --upgrade tensorboard``.  For more information, see https://github.com/tensorflow/tensorboard/pull/5490.

.. |image1| image:: /images/Neuron_Profiler_Tensorboard_Dropdown.jpg
.. |image2| image:: /images/tb-plugin-img12.png
  :height: 2914
  :width: 5344
  :scale: 10%
.. |image3| image:: /images/Neuron_Profiler_Runtime_Trace_Original.jpg
.. |image4| image:: /images/Neuron_Profiler_T1_Op_Framework_View.png
.. |image5| image:: /images/TB_Operator_Timeline_2-10.png
.. |tensorboard_address| raw:: html

   <a href="http://localhost:6006" target="_blank">localhost:6006</a>


================================================
FILE: tools/tensorboard/index.rst
================================================
.. _tensorboard-neuron:

TensorBoard
===========

TensorBoard integration with AWS Neuron provides powerful visualization and debugging capabilities for machine learning workloads. The Neuron TensorBoard plugins enable developers to monitor training progress, analyze model performance, and debug compilation issues through familiar TensorBoard interfaces.

.. toctree::
    :maxdepth: 1
    :hidden:

    TensorBoard for NeuronX </tools/tensorboard/getting-started-tensorboard-neuronx-plugin>

TensorBoard for Trn1
--------------------

.. grid:: 1
   :gutter: 3

   .. grid-item-card:: TensorBoard Plugin for NeuronX (Trn1)
      :link: /tools/tensorboard/getting-started-tensorboard-neuronx-plugin
      :link-type: doc
      :class-header: sd-bg-primary sd-text-white

      Comprehensive guide for using the TensorBoard Neuron plugin on Trn1 instances, including installation, configuration, and advanced visualization features.

   .. grid-item-card:: Profiling PyTorch NeuronX (``torch-neuronx``) with TensorBoard
      :link: /tools/tutorials/torch-neuronx-profiling-with-tb
      :link-type: doc
      :class-header: sd-bg-primary sd-text-white

      Step-by-step tutorial for monitoring PyTorch training progress on Trn1 instances using TensorBoard scalars, metrics visualization, and performance tracking.


================================================
FILE: tools/third-party-solutions.rst
================================================
.. _third-party-tool-solutions:

Third-party solutions
======================

AWS Neuron integrates with multiple third-party partner solutions that alow you to run deep learning workloads on Amazon EC2 
instances powered by AWS Trainium and AWS Inferentia chips. The following list gives an overview of third-party solutions 
that work with AWS Neuron.

Datadog
"""""""
Datadog, an observability and security platform, provides real-time monitoring for cloud infrastructure and ML operations. Datadog is 
excited to launch its AWS Neuron integration, which pulls metrics collected by Neuron SDK’s Neuron Monitor tool into Datadog, 
enabling users to track the performance of their Trainium and Inferentia-based instances. By providing real-time visibility into 
model performance and hardware usage, Datadog helps customers ensure efficient training and inference, optimized resource 
utilization, and the prevention of service slowdowns.

`Datadog documentation <https://docs.datadoghq.com/integrations/aws_neuron/?tab=host>`_


================================================
FILE: tools/tutorials/index.rst
================================================
.. _neuron-tools-tutorials:

Tutorials
============

.. toctree::
    :hidden:
    :maxdepth: 1

    performance-profiling-vllm
    torch-neuronx-profiling-with-tb
    tutorial-tensorboard-scalars-mnist
    tutorial-neuron-monitor-mnist

.. grid:: 1 2 2 2
   :gutter: 3

   .. grid-item-card:: Profiling a vLLM Inference Workload
      :link: /tools/tutorials/performance-profiling-vllm
      :link-type: doc
      :class-card: sd-border-1

      Learn how to capture and analyze device-level and system-level profiles for vLLM inference workloads on AWS Trainium. 

   .. grid-item-card:: Profiling a NKI Kernel
      :link: /nki/guides/use-neuron-profile
      :link-type: doc
      :class-card: sd-border-1

      Learn how to profile a NKI kernel with Neuron Explorer.

   .. grid-item-card:: Profiling PyTorch Neuron with TensorBoard
      :link: tutorial-tensorboard-scalars-mnist
      :link-type: doc
      :class-card: sd-border-1

      Learn how to use Neuron's plugin for TensorBoard that allows users to measure and visualize performance on a torch runtime level or an operator level.

   .. grid-item-card:: Track System Resource Utilization during Training with Neuron Monitor
      :link: tutorial-neuron-monitor-mnist
      :link-type: doc
      :class-card: sd-border-1

      Learn how to monitor resource utilization using neuron-monitor, Prometheus and Grafana while running a multi-layer perceptron MNIST model on Trainium using PyTorch Neuron.

   .. grid-item-card:: Track Training Progress in TensorBoard using PyTorch Neuron
      :link: torch-neuronx-profiling-with-tb
      :link-type: doc
      :class-card: sd-border-1

      Learn how to track training progress in TensorBoard while running a multi-layer perceptron MNIST model on Trainium using PyTorch Neuron.


================================================
FILE: tools/tutorials/performance-profiling-vllm.rst
================================================
.. meta::
    :description: Learn how to use Neuron Explorer to capture and analyze system-level and device-level profiles for vLLM inference workloads on AWS Trainium
    :date-modified: 12/02/2025

Profiling a vLLM Inference Workload on AWS Trainium
==========================================================================

This tutorial outlines the steps involved in using Neuron Explorer to capture and view system-level and device-level profiles for a vLLM-hosted inference workload on AWS Trainium.

Overview
--------

By following this tutorial you will learn how to:

* Launch a vLLM-hosted inference workload on AWS Trainium with system and device-level profiling enabled
* View the system-level profile using Perfetto
* Identify regions within the system profile that show LLM context-encoding (prefill) and token generation (decode) running on the NeuronDevices, along with the names of the associated compute graphs
* View the device-level profiles for context-encoding & token generation compute graphs in the Neuron Explorer UI

Prepare your environment
------------------------

The following steps show how to launch a Trainium EC2 instance using the latest Neuron Deep Learning AMI (DLAMI) and then install vLLM so that an example vLLM-hosted model can be profiled using the Neuron Explorer. If you would prefer to use a containerized environment (Docker, EKS), please refer to the Neuron documentation to get started with a Neuron Deep Learning Container (DLC) image that has vLLM pre-installed.

1. Launch a Trainium instance (trn1.32xlarge, trn2.3xlarge, trn2.48xlarge)
    1. Option 1: Launch the instance using the latest AWS Deep Learning AMI (DLAMI), which includes the Neuron SDK preinstalled. Once the instance is launched, please SSH into it and use the virtual environment for neuronx-distributed-inference by following this command -
        1. ``source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate``
    2. Option 2: If using a fresh Linux instance, manually install the latest Neuron packages by following the AWS Neuron installation guide.
2. Install vLLM
    1. Refer to the Neuron documentation which outlines how to install the Neuron vLLM fork from source.

Step 1: Save a smaller version of your model
--------------------------------------------

When profiling LLMs it is usually desirable to use only a subset of the model's layers in order to understand model performance and to identify possible bottlenecks. Capturing traces for the entire model could lead to an excessive volume of profiling data, making analysis cumbersome. To address this, the following script takes the Qwen3-8B-base model, truncates it to the first 4 layers, and saves the resulting smaller model for profiling purposes.

.. code-block:: python

    import transformers

    model_id = "Qwen/Qwen3-8B-Base"
    config = transformers.AutoConfig.from_pretrained(model_id)
    config.num_hidden_layers = 4
    config.layer_types = ["full_attention"] * 4
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
    output_dir = "4layer_qwen3"

    model = transformers.AutoModelForCausalLM.from_pretrained(model_id, config=config)
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

Save the above python script as ``save_4layer_qwen.py`` and then run it using the python interpreter:

.. code-block:: bash

    python3 ./save_4layer_qwen.py

Once the script has completed, you should see the new ``4layer_qwen`` directory which contains the truncated model.

Step 2: Run a vLLM offline inference workload with profiling enabled
--------------------------------------------------------------------

In this step, you will run a small vLLM offline inference script that will compile, run, and profile your 4-layer Qwen3 model on the Trainium chips.

Begin by saving the following python script as ``qwen3_offline_inference.py``:

.. code-block:: python

    import os
    os.environ['VLLM_NEURON_FRAMEWORK'] = "neuronx-distributed-inference"

    # Enable Neuron profiling via environment variables
    os.environ['XLA_IR_DEBUG'] = "1"
    os.environ['XLA_HLO_DEBUG'] = "1"
    os.environ['NEURON_FRAMEWORK_DEBUG'] = "1"
    os.environ['NEURON_RT_INSPECT_ENABLE'] = "1"
    os.environ['NEURON_RT_INSPECT_SYSTEM_PROFILE'] = "1"
    os.environ['NEURON_RT_INSPECT_DEVICE_PROFILE'] = "1"
    os.environ['NEURON_RT_INSPECT_OUTPUT_DIR'] = "./neuron_profiles"

    from vllm import LLM, SamplingParams

    # Sample prompts.
    prompts = [
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    # Create a sampling params object.
    sampling_params = SamplingParams(top_k=1)

    # Create an LLM instance using the 4-layer Qwen3 model
    llm = LLM(
        model="4layer_qwen3",
        max_num_seqs=4,
        max_model_len=128,
        additional_config={
            "override_neuron_config": {
                "enable_bucketing":False,
            },
        },
        enable_prefix_caching=False,
        tensor_parallel_size=8)

    # Run inference using the sample prompts
    outputs = llm.generate(prompts, sampling_params)

Next, run the offline inference script with a Python interpreter:

.. code-block:: bash

    python3 ./qwen3_offline_inference.py

After ~60s the script should complete, and you will see a new ``neuron_profiles`` directory which contains both system-level and device-level profile traces for this example inference workload.

Step 3: Visualize the system profile for your model
---------------------------------------------------

.. note::
   System profiles are currently viewed using the open-source Perfetto tool. Viewing of system profiles will be natively supported by the Neuron Explorer UI in an upcoming release.

Run the following command to generate a Perfetto compatible file from the system profile traces that you previously captured:

.. code-block:: bash

    neuron-explorer view -d ./neuron_profiles --ignore-device-profile \
      --output-format perfetto

The above command generates a file called ``system_profile.pftrace`` in your working directory.

Copy the ``system_profile.pftrace`` file to your local machine and open up the Perfetto UI in your local web browser.

In the left-hand menu, choose "Open trace file" and select your ``system_profile.pftrace`` file to view the system profile. Expand the first row under Default Workspace and you will see a timeline view similar to the following:

.. image:: /tools/profiler/images/perf-profiling-1.png

The system profile shows a high-level chronological view of the various Neuron Runtime API calls that took place during your example inference workload. If you hover the mouse cursor over the various pink/green bars you can see which specific API call occurred at each time point, such as ``nrt_tensor_read``, ``nrt_tensor_write``, ``nrt_execute``, and ``nrt_load_collectives``.

Look for the **nrt_execute** bar identified below and select it. This will open an information dialog providing details of the specific ``nrt_execute`` call:

.. image:: /tools/profiler/images/perf-profiling-2.png

.. image:: /tools/profiler/images/perf-profiling-3.png

In the Arguments pane you will find useful information such as the following:

* device_profile - the unique name of the device profile associated with this event
* nc_idx - the index of the NeuronCore that is associated with this API call
* model_name - path to the compiled Neuron Executable File Format (NEFF) compute graph associated with this event

In the above screenshot, notice that the model_name field provides additional information about what is happening during this part of the model execution:

.. code-block:: text

    tmp/nxd_model/context_encoding_model/_tp0_bk0/model.MODULE_6d1668c2294e2409dd72+ad9e832d.neff

* ``context_encoding_model`` - indicates that this is handling context-encoding (prefill) during vLLM inference (other model names will alternatively include token_generation_model to indicate the token-generation / decode phase of inference).
* ``tp0`` - indicates that this profile is associated with the rank0 of the tensor-parallel (TP) replica group
* ``bk0`` - indicates that this profile is associated with the first sequence bucket as configured in Neuronx Distributed Inference (NxDI) NeuronConfig.

Step 4: Visualize device profiles in Neuron Explorer
----------------------------------------------------

In this step, you will view a device profile for your model in Neuron Explorer UI.

If you look inside the ``neuron_profiles`` directory that was created during Step 2, you will see many Neuron Executable File Format (NEFF) and their associated Neuron Trace File Format (NTFF) files. For each pair of NEFF/NTFF files, the NEFF represents the Neuron-compiled compute graph for a portion of your model, and the NTFF represents the device-level profile trace for that specific compute graph.

While you are free to view any of the device-level profiles using the Neuron Explorer UI, it is often more useful to start from the system-level profile and identify a specific device-level profile of interest. Let's refer back to the nrt_execute region of the system-level profile that was covered in the previous section. Please find and left-click this region to bring up the information dialog at the bottom of Perfetto:

.. image:: /tools/profiler/images/perf-profiling-4.png

.. image:: /tools/profiler/images/perf-profiling-5.png

In the device_profile field, note that numerical ID that is included at the end of the device profile name, in this case 2120860766. This ID is what you will use to locate the NEFF/NTFF pair associated with this specific nrt_execute API call.

Use the following find command (substituting-in your device profile ID) to locate the NEFF/NTFF files associated with your identified ID:

.. code-block:: bash

    find ./neuron_profiles -name \*2120860766\* | sort

.. image:: /tools/profiler/images/perf-profiling-6.png

In the above output you can see that there is a single NEFF file ``neff_2120860766.neff``, and multiple NTFF files ``2120860766_instid_0_vnc_0.ntff`` ... ``2120860766_instid_0_vnc_7.ntff`` each representing the profile trace for one of the 8 NeuronCores that participated in this inference request.

These are the files you will open in the Neuron profiler UI to inspect the device-level execution.

Please copy the NEFF and one of the NTFF files to your local machine, as you will need to upload the files to the Neuron Explorer UI using your web browser.

To view the Neuron Profile Web UI, execute the ``view`` command to start the Neuron Explorer web UI:

.. code-block:: bash

    $ neuron-explorer view --data-path ./<workspace> --output-format parquet

``<workspace>`` is a path that neuron-explorer will use for storing and managing profiles.

The above command also prints a URL that you can click to open the web UI:

.. code-block:: text

    View a list of profiles at http://localhost:3001/

If ``neuron-explorer view`` is run on a remote instance, you may need to use port forwarding to access the web UI. By default, ``neuron-explorer`` creates a web server on port 3001 and the API server on port 3002. To enable connection to your browser on your local computer, you must to establish an SSH tunnel to both ports 3001 and 3002.

For example:

.. code-block:: bash

    ssh -L 3001:localhost:3001 -L 3002:localhost:3002 <user>@<ip> -fN

If you created an EC2 instance with PEM credentials, include them in the SSH tunnel as seen below:

.. code-block:: bash

    ssh -i ~/my-ec2.pem -L 3001:localhost:3001 -L 3002:localhost:3002 ubuntu@[PUBLIC_IP_ADDRESS] -fN

Once the SSH tunnel is setup, you can now open a browser and navigate to http://localhost:3001.

With the Neuron Explorer UI open, go to "Profile Manager", and click "Upload Profile" at the top-right of the screen. Give your profile an appropriate name, and upload the NEFF and NTFF files that you previously identified:

.. image:: /tools/profiler/images/perf-profiling-7.png

After a few seconds, you should receive a message indicating that NEFF/NTFF were uploaded successfully:

.. image:: /tools/profiler/images/perf-profiling-8.png

Within the Neuron Explorer UI, go tot he Profile Manager screen and look for your newly uploaded profile.

.. image:: /tools/profiler/images/perf-profiling-9.png

Depending on the size of your profile, it could take a few minutes before the Status field shows "PROCESSED". Once processing is complete, click the profile name to open the profile:

.. image:: /tools/profiler/images/perf-profiling-10.png

Confirmation
------------

Congratulations, you have now successfully generated both system-level and device-level profiles for a vLLM inference workload using Neuron Explorer and learned how to visualize them. This knowledge will enable you to effectively analyze the performance characteristics of your workload and identify potential optimization opportunities.

Clean up
--------

After completing your profiling experiments, remember to terminate the instance you launched to avoid unnecessary costs.

Next steps
----------

Now that you've completed this tutorial, try profiling your own model to analyze its workload. Identify performance gaps, apply optimizations, and profile again to measure the improvements. For a deeper dive into performance analysis, check out Neuron's blog series on profiling.

================================================
FILE: tools/tutorials/torch-neuronx-profiling-with-tb.rst
================================================
.. _torch-neuronx-profiling-with-tb:

Profiling PyTorch NeuronX with TensorBoard
==============================================================

.. contents:: Table of Contents
   :local:
   :depth: 2

Introduction
------------

Neuron provides a plugin for TensorBoard that allows users to measure and visualize
performance on a torch runtime level or an operator
level. With this information, it becomes quicker to identify any
performance bottleneck allowing for quicker addressing of that issue.

For more information on the Neuron plugin for TensorBoard, see :ref:`neuronx-plugin-tensorboard`.

Setup
-----

Prerequisites
~~~~~~~~~~~~~

1. Initial `Trn1 setup for PyTorch
   (torch-neuronx) <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install.html>`__
   has been done

Environment
~~~~~~~~~~~

::

   #activate python virtual environment and install tensorboard_plugin_neuron
   source ~/aws_neuron_venv_pytorch_p38/bin/activate
   pip install tensorboard_plugin_neuronx

   #create work directory for the Neuron Profiling tutorials
   mkdir -p ~/neuron_profiling_tensorboard_examples
   cd ~/neuron_profiling_tensorboard_examples


Part 1: Operator Level Trace for ``xm.markstep()`` workflow
-------------------------------------------------------------

Goal
~~~~

After completing this tutorial, the user should be able to understand
the features of the Operator Level Trace. The user should also be able
to form a narrative/surface level analysis from what is being presented
in the Operator Level Trace.

Set Up
~~~~~~

Let’s set up a directory containing the material for this demo

::

   cd ~/neuron_profiling_tensorboard_examples
   mkdir tutorial_1
   cd tutorial_1

   # this is where our code will be written
   touch run.py

Here is the code for ``run.py``:

::

   import os
   import torch
   import torch_neuronx
   from torch_neuronx.experimental import profiler
   import torch_xla.core.xla_model as xm

   os.environ["NEURON_CC_FLAGS"] = "--cache_dir=./compiler_cache"

   device = xm.xla_device()

   class NN(torch.nn.Module):
      def __init__(self):
         super().__init__()

         self.layer1 = torch.nn.Linear(4,4)
         self.nl1 = torch.nn.ReLU()
         self.layer2 = torch.nn.Linear(4,2)
         self.nl2 = torch.nn.Tanh()

      def forward(self, x):
         x = self.nl1(self.layer1(x))
         return self.nl2(self.layer2(x))

   with torch.no_grad():

      model = NN()

      inp = torch.rand(4,4)
      output = model(inp)

      with torch_neuronx.experimental.profiler.profile(
         port=9012,
         profile_type='operator',
         ms_duration=10000 ):
         
         
         # IMPORTANT: the model has to be transferred to XLA within
         # the context manager, otherwise profiling won't work
         neuron_model = model.to(device)
         neuron_inp = inp.to(device)
         
         output_neuron = neuron_model(neuron_inp)
         xm.mark_step()   

   print("==CPU OUTPUT==")
   print(output)
   print()
   print("==TRN1 OUTPUT==")
   print(output_neuron)


Understanding the Code
~~~~~~~~~~~~~~~~~~~~~~

For this first tutorial, we’ll be using a simple Feed forward NN model.
However, once the TensorBoard dashboard is up, we’ll see some
interesting and unexpected things. A simple model is helpful since it is
easy to reference back to.

Another important part is the “operator” profiling type we specified in the context manager.

**Low Level:** The “operator“ dashboard is the dashboard that contains
the Operator Level Trace This view also only zooms in on the
NeuronDevice, while the ”trace“ dashboard shows processes from all
devices. The Operator Level Trace View is organized by levels of
abstraction, with the top level showing the model class. The next lower
tier shows model components, and the lowest tier shows specific
operators that occur for a specific model component. This view is useful
for identifying model bottlenecks at the operator level.

We also print out the outputs from the CPU model and the TRN1 model to note
the small differences in output.

Running The Profiler
~~~~~~~~~~~~~~~~~~~~

::

   python run.py

**Output:**

Initial Output & Compilation Success

::

   0%   10   20   30   40   50   60   70   80   90   100%
   |----|----|----|----|----|----|----|----|----|----|
   ***************************************************
   Analyzing dependencies of Block1
   0%   10   20   30   40   50   60   70   80   90   100%
   |----|----|----|----|----|----|----|----|----|----|
   ***************************************************
   Analyzing dependencies of Block1
   0%   10   20   30   40   50   60   70   80   90   100%
   |----|----|----|----|----|----|----|----|----|----|
   ***************************************************
   Dependency reduction of sg0000
   0%   10   20   30   40   50   60   70   80   90   100%
   |----|----|----|----|----|----|----|----|----|----|
   ***************************************************

Processing the Neuron Profiler Traces

::

   torch_neuron: Waiting for XLA profile completion ...
   torch_neuron: translate_xplane: Processing plane: '/host:CPU'
   torch_neuron: XLA decode - Read filename 2023_04_28_00_54_04
   torch_neuron: XLA decode - Read date parts ['2023', '04', '28', '00', '54', '04']
   torch_neuron: XLA decode - Read start date 2023-04-28 00:54:04 from directory stamp
   torch_neuron: translate_xplane: Processing plane: '/host:Neuron-runtime:profile//c1a992f0ea378f7a_1/model10001/node5/plugins/neuron/1682643254/neuron_op_timeline_split.json'
   torch_neuron: translate_xplane: Writing plane: '/host:Neuron-runtime:profile//c1a992f0ea378f7a_1/model10001/node5/plugins/neuron/1682643254/neuron_op_timeline_split.json' to 'temp_profiler_logs/c1a992f0ea378f7a_1/neuron_op_timeline_split.json'
   torch_neuron: translate_xplane: Processing plane: '/host:Neuron-runtime:profile//c1a992f0ea378f7a_1/model10001/node5/plugins/neuron/1682643254/neuron_op_timeline.json'
   torch_neuron: translate_xplane: Writing plane: '/host:Neuron-runtime:profile//c1a992f0ea378f7a_1/model10001/node5/plugins/neuron/1682643254/neuron_op_timeline.json' to 'temp_profiler_logs/c1a992f0ea378f7a_1/neuron_op_timeline.json'
   torch_neuron: translate_xplane: Processing plane: '/host:Neuron-runtime:profile//c1a992f0ea378f7a_1/model10001/node5/plugins/neuron/1682643254/neuron_hlo_op.json'
   torch_neuron: translate_xplane: Writing plane: '/host:Neuron-runtime:profile//c1a992f0ea378f7a_1/model10001/node5/plugins/neuron/1682643254/neuron_hlo_op.json' to 'temp_profiler_logs/c1a992f0ea378f7a_1/neuron_hlo_op.json'
   torch_neuron: translate_xplane: Processing plane: '/host:Neuron-runtime:profile//c1a992f0ea378f7a_1/model10001/node5/plugins/neuron/1682643254/neuron_framework_op.json'
   torch_neuron: translate_xplane: Writing plane: '/host:Neuron-runtime:profile//c1a992f0ea378f7a_1/model10001/node5/plugins/neuron/1682643254/neuron_framework_op.json' to 'temp_profiler_logs/c1a992f0ea378f7a_1/neuron_framework_op.json'

Printing output from CPU model and Trn1 Model:

::

   ==CPU OUTPUT==
   tensor([[-0.1396, -0.3266],
           [-0.0327, -0.3105],
           [-0.0073, -0.3268],
           [-0.1683, -0.3230]])

   ==TRN1 OUTPUT==
   tensor([[-0.1396, -0.3266],
           [-0.0328, -0.3106],
           [-0.0067, -0.3270],
           [-0.1684, -0.3229]], device='xla:1')

Loading the Operators Level Trace in TensorBoard
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Run ``tensorboard --load_fast=false --logdir logs/``

Take note of the port (usually 6006) and enter ``localhost:<port>`` into
the local browser (assuming port forwarding is set up properly)

.. note::

   Check :ref:`Tensorboard Interface Overview` to understand TensorBoard interface


The Operator Level Trace views are the same format plus an id at the
end; ``year_month_day_hour_minute_second_millisecond_id``. The Tool
dropdown will have 3 options: operator-framework, operator-hlo, and
operator-timeline.

Operator Framework View
~~~~~~~~~~~~~~~~~~~~~~~

|tensorboard-operator-framework-view|

This view contains a pie-chart displaying the
proportional execution time for each of the model operators on the framework level for a
neuron device. The list of operators is shown in the bottom along with
other details about number of occurrences, execution time and neuron
device and core.

Operator HLO View
~~~~~~~~~~~~~~~~~

|tensorboard-operator-hlo-view|

This view contains a pie-chart displaying the
proportional execution time for each of the model operators on the hlo level for a
Neuron device. The list of operators is shown in the bottom along with
other details about number of occurrences, execution time and neuron
device and core.

.. note::

   For this simple model, the pie chart will be the same as the framework view. This won't be
   the case for larger and more complex models.

Operator Trace View
~~~~~~~~~~~~~~~~~~~

|tensorboard-operator-trace-view|


.. _trace_view_sections:

Trace View Sections
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Notice there are four sections: Process Overview, Control, Execution, and Data
Transfer. In each section there are more subdivisions with each layer
representing a certain level of abstraction. Also important to note that
the timescale axis is aligned between the two sections. This is
important to note as sometimes there are gaps in the process execution.
Most of the time, there are data transfer operations happening in
between the gaps.

Fusion Operators
^^^^^^^^^^^^^^^^

**Simple Case:** Zooming in on the operations, we can recognize some
operations for a neural network, such as a dot product and transpose,
but sometimes there will be fused operators (fusion operators). To
understand these operators, click on it, and on the bottom of the
dashboard, some information will appear. 

|tensorboard-operator-trace-fusion-simple|

Notice in the above example the fusion operator is fusing the operator before and
after itself on the timeline. More specifically, ``fused_3`` is a fusion
of ``NN[model]/input`` and
``NN[model]/ReLU[nl1]/Tensor_1/aten__relu_maximum``. These kinds of
fusions occur when the ``neuronx-cc`` compiler has found an optimization
relating to the two operators. Most often this would be the execution of
the operators on separate compute engines or another form of parallelism.

**Complex Case:** Most often, the order of fusion operators can get a
little complicated or contain "hidden" information. For the first example,
let’s zoom into the data transfer section such that we see the timescale range 
from 6000 ns. to 6600 ns. It should look similar to below:

|tensorboard-operator-trace-fusion-complex|

Looking at ``fused_16`` (11452 ns) we see it's surrounded by other fused operators.
Furthermore, the ``fused_16`` operator fuses more than two operators: ``NN[model]/Linear[layer1]/aten__addmm_add``,
``NN[model]/input``, and ``NN[model]/Linear[layer1]/aten__addmm_dot``. These operators can be found in the timeline, but sometimes
the fused operators may not exist in the timeline due to it occurring within another operation. We go over an example of this case
in Part 2.


Understanding the Low Level Timeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Looking at the trace we can look behind the scenes at how the model is
executed on neuron hardware. Before proceeding with the analysis, it is worth recalling the
way we defined the model for this tutorial:

.. code:: python

   class NN(torch.nn.Module):
      def __init__(self):
         super().__init__()

         self.layer1 = torch.nn.Linear(4,4)
         self.nl1 = torch.nn.ReLU()
         self.layer2 = torch.nn.Linear(4,2)
         self.nl2 = torch.nn.Tanh()

      def forward(self, x):
         x = self.nl1(self.layer1(x))
         return self.nl2(self.layer2(x))

Analysis
^^^^^^^^
**Input Operators:** We see input operators here. This is because in a markstep flow, we need to transfer inputs to the xla device. This is represented by the ``SyncTensorsGraph.53`` call.

**ReLU at the beginning:** The first couple of blocks in the Process Data Transfer section initially appear to be confusing. There is an ``Input`` (0 ns.)
block followed by a ``ReLU`` (100 ns.) operator. Under the hood here, ``ReLU`` is rewritten as an ``elementwise_max(arr,0)``, 
(0 here means an array with zeros) but to create this operation, the zeros have to be set in memory, which is a data operation.
A general rule is that if an operator appears this early in the data transfer section, it most likely means there is an operation
lowering involving setting some values into memory for use later on.

**Memory allocation for Linear[layer1]:** We resume with the data transfer operations. Here, memory is getting allocated for specific operators, and sometimes the allocated
inputs get loaded onto operators while the rest of the input gets allocated. This can be seen at ``fused_18`` (11811 ns.) and ``fused_23`` (12181 ns.).
Eventually the input gets fully allocated, and other allocations occur for dot products, transpose, and broadcast operators for
``Linear[layer1]`` and ``Linear[layer2]``.

Conclusion
^^^^^^^^^^^

There are a few conclusions that can be determined from analyzing the timeline. We can see that we’ve been able to save a bit of time due to 
parallelism with fusion operations, and saving some compute time with preloading operations (ex. ``ReLU``). A clear trend is that a majority of the time is spent on data transfer operations.
It is also evident that even a simple Feed Forward NN becomes complicated when put under a microscope in the profiler. Facts such as the implementation of ``ReLU`` in the runtime/architecture, aren’t explicitly stated in the profiler, but do make
themselves known by the unusual ordering placement of the trace blocks and unusual fusion operators.

In terms of action items that can be taken based on our narrative, there
really isn’t any. This is a very very simple model that outputs after 8
microseconds, and we chose it because it is simple to understand. In
more realistic examples we will aim to do more compute than data
transfer on the hardware, and where possible to overlap data transfer
and compute between sequential operations.

The profiler revealed a lot of optimizations that were done, via fusion
operators and parallelism. However, the end goal of this tool is to be
able to improve performance by revealing the bottlenecks of the model.

.. note::

   While we did explain some of the quirks visible in the profiler at a microscopic level, it isn’t necessary
   to do so for normal use. This tutorial introduced the microscopic explanation for these occurrences to show to the 
   user that this is *indeed* what happens in the hardware when executing a simple FFNN.

Part 2: Operator Level Trace with ``torch_neuronx.trace()`` workflow
----------------------------------------------------------------------

Set Up
~~~~~~

The setup will be similar to Part 1.
::

   cd ~/neuron_profiling_tensorboard_examples
   mkdir tutorial_2
   cd tutorial_2

   # this is where our code will be written
   touch run.py

Here is the code for ``run.py``:

::

   import os
   import time
   import torch
   import torch_neuronx
   from torch_neuronx.experimental import profiler

   class NN(torch.nn.Module):
      def __init__(self):
         super().__init__()

         self.layer1 = torch.nn.Linear(4,4)
         self.nl1 = torch.nn.ReLU()
         self.layer2 = torch.nn.Linear(4,2)
         self.nl2 = torch.nn.Tanh()

      def forward(self, x):
         x = self.nl1(self.layer1(x))
         return self.nl2(self.layer2(x))

   model = NN()
   model.eval()

   inp = torch.rand(4,4)

   output = model(inp)

   with torch_neuronx.experimental.profiler.profile(
      port=9012,
      profile_type='operator',
      ms_duration=10000,
      traced_only=True):

      neuron_model = torch_neuronx.trace(model,inp,compiler_workdir="./compiler_cache")
      neuron_model(inp)

   print("==CPU OUTPUT==")
   print(output)
   print()
   print("==INF2 OUTPUT==")
   print(output_neuron)

Important code differences from Part 1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. ``import torch_xla.core.xla_model as xm`` is no longer necessary
2. Set ``traced_only=True`` in ``torch_neuronx.experimental.profiler.profile()``. This option is necessary for traced models, otherwise the generated profile will not be accurate or not work.
3. Tracing the model with ``torch_neuronx.trace()`` and removing ``xm.markstep()``.

Otherwise, the code is the same as Part 1.

Running Part 2
~~~~~~~~~~~~~~~~~
To Run:

::

   python run.py

The output will look almost identical as Part 1

Loading the Operators Level Trace in TensorBoard
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Run ``tensorboard --load_fast=false --logdir logs/``, just like Part 1.

.. note::

   Check :ref:`Tensorboard Interface Overview` to understand TensorBoard interface

Timeline View:

|tensorboard-operator-trace-view-traced|

Notable Differences in Timeline View from Part 1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**No Input Operators:** For a traced model, we do not transfer the input to an xla device, so these operations are not seen on the timeline. This also affects scheduling, which is why the time taken in
the profiling is less than the markstep one.

**Combined Loading of Linear[layer1] and Tanh:** ``fused_19`` (5824 ns) contains a fusion between ``Linear[layer1]`` and ``Tanh[nl2]``. This might be a bit odd, but such data loading parallelism
can be understood by understanding how tanh is implemented. Typically, functions like tanh are implemented by lookup tables that require being pre-loaded onto memory, which is a data transfer operation.
A bulk of data transfer operations are done in the beginning to optimize computations.

.. note::
   Despite these differences, the big picture conclusion drawn from Part 1 still holds, as the two timelines are more similar than different. Some new insights drawn is that the traced model performs better than the markstep flow, since this was profiling a single forward pass.


.. |tensorboard-url-image| image:: /images/Neuron_Profiler_Tensorboard_Url.jpg

.. |tensorboard-NEURON-header| image:: /images/Neuron_Profiler_Tensorboard_Header.jpg

.. |tensorboard-NEURON-dropdown| image:: /images/Neuron_Profiler_Tensorboard_Dropdown.jpg

.. |tensorboard-run-tool-dropdowns| image:: /images/Neuron_Profiler_Tensorboard_Run_Tool_Dropdowns.jpg

.. |tensorboard-run-trace-original| image:: /images/Neuron_Profiler_Runtime_Trace_Original.jpg

.. |tensorboard-run-trace-selected-section| image:: /images/Neuron_Profiler_Runtime_Trace_Section_Selection.jpg

.. |tensorboard-run-trace-selected-section-zoomed| image:: /images/Neuron_Profiler_Runtime_Trace_Section_Selection_Zoomed.jpg

.. |tensorboard-run-trace-selected-section-zoomed-named-traces| image:: /images/Neuron_Profiler_Runtime_Trace_Section_Selection_Zoomed_Named_Traces.jpg

.. |tensorboard-operator-framework-view| image:: /images/Neuron_Profiler_T1_Op_Framework_View.png

.. |tensorboard-operator-hlo-view| image:: /images/Neuron_Profiler_T1_Op_HLO_View.png

.. |tensorboard-operator-trace-view| image:: /images/Neuron_Profiler_T1_Op_Trace_View.png

.. |tensorboard-operator-trace-view-traced| image:: /images/Neuron_Profiler_T1_Op_Trace_View_Traced.png

.. |tensorboard-operator-trace-fusion-simple| image:: /images/Neuron_Profiler_T1_Op_Trace_Fusion_Simple.png

.. |tensorboard-operator-trace-fusion-complex| image:: /images/Neuron_Profiler_T1_Op_Trace_Fusion_Complex.png

================================================
FILE: tools/tutorials/tutorial-neuron-monitor-mnist.rst
================================================
.. _track-system-monitor:

Track System Resource Utilization during Training with neuron-monitor using PyTorch Neuron
==========================================================================================

.. contents:: Table of Contents
   :local:
   :depth: 2

This tutorial explains how to monitor resource utilization using **neuron-monitor**, **Prometheus** and **Grafana** while running a multi-layer
perceptron MNIST model on Trainium using PyTorch Neuron.

Multi-layer Perceptron MNIST Model
----------------------------------

This tutorial is based on the MNIST example for PyTorch Neuron on Trainium.
For the full tutorial, please see :ref:`Multi-Layer Perceptron Training Tutorial <neuronx-mlp-training-tutorial>`.

The Training Job
----------------

For this tutorial, we will make the original script do more work thus giving us more system utilization data to observe. The training
loop is simply repeated 1000 times:

.. code:: python

    for run in range(0, 1000):
        print(f'Run {run}')
        model.train()
        ...

Save the following code as :download:`train_monitor.py </src/examples/pytorch/mnist_mlp/train_monitor.py>` and you can run it as
``python3 train_monitor.py`` on a Trn1 instance.

.. literalinclude:: /src/examples/pytorch/mnist_mlp/train_monitor.py
    :language: python

Setting up **Prometheus** and **Grafana**
-----------------------------------------

.. note::
   The setup presented in the following paragraphs can be extended to monitor any number of instances running training jobs or
   inference workloads. For this tutorial, we will set everything up on a single Trn1 instance running Amazon Linux 2.

Setting up **Prometheus**
~~~~~~~~~~~~~~~~~~~~~~~~~

For a more detailed guide on how to install **Prometheus** visit their official guide at https://prometheus.io/docs/prometheus/latest/getting_started/.

Download and unzip a prebuilt **Prometheus** binary on your Trn1 instance:

.. code:: bash

    wget https://github.com/prometheus/prometheus/releases/download/v2.38.0/prometheus-2.38.0.linux-amd64.tar.gz
    tar -xzvf prometheus-2.38.0.linux-amd64.tar.gz
    cd prometheus-2.38.0.linux-amd64/

Create a config and add a scrape target:

.. code:: bash

    vim prometheus.yml

.. code:: yml

    scrape_configs:
    - job_name:       'neuron'

    # Scrape target every 5 seconds.
    scrape_interval: 5s

      static_configs:
        - targets: ['localhost:8000']

Finally, start **Prometheus**:

.. code:: bash

    ./prometheus --config.file=prometheus.yml

Setting up **Grafana**
~~~~~~~~~~~~~~~~~~~~~~

For a more detailed guide on how to install **Grafana** visit their official guide at https://grafana.com/grafana/download.

Add the Grafana repo to dnf:

.. code:: bash

    sudo vim /etc/yum.repos.d/grafana.repo

    [grafana]
    name=grafana
    baseurl=https://packages.grafana.com/oss/rpm
    repo_gpgcheck=1
    enabled=1
    gpgcheck=1
    gpgkey=https://packages.grafana.com/gpg.key
    sslverify=1
    sslcacert=/etc/pki/tls/certs/ca-bundle.crt

Install and start **Grafana**:

.. code:: bash

    sudo dnf install -y grafana
    sudo /bin/systemctl start grafana-server.service

By default, **Grafana** will run a HTTP server on port 3000. If you need to change that, update its config and restart the service:

.. code:: bash

    sudo vim /etc/grafana/grafana.ini
    ...
    sudo /bin/systemctl start grafana-server.service

Using your favorite web browser, access the Grafana webpage and add a new dashboard.

The default user and password are both 'admin':

.. image:: tutorial_grafana_login.png
   :alt: Image: image.png

Next, you'll add a Prometheus data source by going to ``Configuration`` -> ``Data Sources``:

.. image:: tutorial_grafana_data_sources.png
   :alt: Image: image.png

... and adding the local **Prometheus** server as a data source:

.. image:: tutorial_grafana_add_prometheus.png
   :alt: Image: image.png

Finally, upload the sample dashboard :download:`neuron-monitor-grafana.json </src/examples/neuron-monitor/neuron-monitor-grafana.json>`
to **Grafana**:

.. image:: tutorial_grafana_upload_dash.png
   :alt: Image: image.png

Monitoring the Training Workload
--------------------------------

Start the training job which, due to the artificially added complexity, will take more than 15 minutes:

.. code:: bash

   python train_monitor.py

On the same instance, start ``neuron-monitor`` and its companion script, ``neuron-monitor-prometheus.py``:

.. code:: bash

   neuron-monitor | neuron-monitor-prometheus.py

Once they are running, you can use your web browser, access the **Grafana** server running on your Trn1 instance and
view a timeline of the system utilization.

The upper part of the dashboard contains:
 - a list of the currently monitored instances (for this tutorial there is a single Trn1 instance)
 - aggregated metrics for stats such as NeuronCore utilization, NeuronCores in use, iteration success rates, error rates etc.
 - a timeline of execution status rates and execution latencies

.. image:: tutorial_grafana_dash_1.png
   :alt: Image: image.png

The lower part of the dashboard contains:
- one line of charts containing a timeline of Neuron resource utilization (NeuronCore, vCPU and memory utilization)
- one line of charts containing a timeline of host resource utilization (vCPU and memory utilization)

.. image:: tutorial_grafana_dash_2.png
   :alt: Image: image.png


================================================
FILE: tools/tutorials/tutorial-tensorboard-scalars-mnist.rst
================================================
.. _tb_track_training_minst:

Track Training Progress in TensorBoard using PyTorch Neuron
============================================================

.. contents:: Table of Contents
   :local:
   :depth: 2

This tutorial explains how to track training progress in TensorBoard while running a multi-layer perceptron MNIST model on Trainium using PyTorch Neuron.

Multi-layer perceptron MNIST model
----------------------------------

This tutorial is based on the MNIST example for PyTorch Neuron on Trainium.
For the full tutorial, please see :ref:`Multi-Layer Perceptron Training Tutorial <neuronx-mlp-training-tutorial>`.

Output TensorBoard logs
-----------------------

To generate TensorBoard logs, we first modify the training script to use the ``SummaryWriter``:

.. code:: python

   from torch.utils.tensorboard import SummaryWriter
   writer = SummaryWriter('./output')

In the training loop, we can then use the ``add_scalar`` API to log the loss per step.

.. code:: python

   writer.add_scalar("step loss", loss, idx)

At the end of the script, add ``writer.flush()`` to ensure all logs are written.

Save the following code as :download:`train_tb.py </src/examples/pytorch/mnist_mlp/train_tb.py>` and run it as ``python3 train_tb.py`` on a Trn1 instance.
The generated logs can be found in the ``./output`` directory that was passed to ``SummaryWriter``.

.. literalinclude:: /src/examples/pytorch/mnist_mlp/train_tb.py
    :language: python

View loss in TensorBoard
------------------------

In order to view your training metrics, install TensorBoard in your Python environment:

.. code:: bash

   pip install tensorboard

Then, launch TensorBoard with the ``./output`` directory

.. code:: bash

   tensorboard --logdir ./output

Once running, open a new SSH connection to the instance and port-forward
TCP port 6006 (ex: -L 6006:127.0.0.1:6006). Once the tunnel is
established, TensorBoard can then be accessed via web browser at the
following URL: `http://localhost:6006 <http://localhost:6006/>`__.
Please note that you will not be able to access TensorBoard if you
disconnect your port-forwarding SSH session to the Trainium instance.

.. image:: tb-scalars.png
   :alt: Image: image.png

In TensorBoard, you can now see the loss per step plotted.
When capturing loss for multiple runs, you can plot them together on the same graph to compare runs.
Be sure to change the output directory for different runs, for example ``./output/run1`` for the first, ``./output/run2`` for the second, etc.